# 🤖 LiteRT-LM Web Server

Run **Gemma 4** models on embedded devices (Orange Pi 5, Raspberry Pi, etc.) using [LiteRT-LM](https://github.com/google-ai-edge/litert-lm) with a REST API and Web UI.

---

## 📋 Requirements

- Python 3.10+
- [LiteRT-LM](https://github.com/google-ai-edge/litert-lm) installed and working
- Python libraries:

```bash
pip install -r requirements.txt
```

`requirements.txt`:
```txt
fastapi
uvicorn
pydantic
huggingface_hub
```

---

## 📁 Project Structure

```text
.
├── app.py               # Simple REST API, single-turn
├── server.py            # Full REST API + Web UI, multi-turn sessions
├── templates/
│   └── index.html       # Web UI interface (separated from server.py)
├── models/              # Directory containing .litertlm model files
│   ├── gemma-4-E2B-it.litertlm
│   └── gemma-4-E4B-it.litertlm
├── requirements.txt
└── README.md
```

---

## 🤖 Supported Models

| Model | Hugging Face Repo | Description |
|-------|-------------------|-------------|
| `gemma-4-E2B-it` | https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm | Edge 2B — faster and lighter |
| `gemma-4-E4B-it` | https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm | Edge 4B — smarter and heavier |

> **Note:** It is recommended to use the `-it` (instruction-tuned) versions for chat/Q&A. Versions without `-it` are base models that only predict the next token and are not suitable for conversations.

### Download Models

```bash
# Gemma 4 E2B (smaller, ~faster)
hf download litert-community/gemma-4-E2B-it-litert-lm \
  --include '*.litertlm' \
  --local-dir models/

# Gemma 4 E4B (larger, ~smarter)
hf download litert-community/gemma-4-E4B-it-litert-lm \
  --include '*.litertlm' \
  --local-dir models/
```

> **Or** let the server automatically download the model when you select one that is not available locally.

---

## 🚀 Usage Guide

### Method 1: Run with Default Options

```bash
python server.py
```

The server will display a **model selection menu** before starting:

```text
====================================================
  LiteRT-LM Server — Select Model
====================================================
  [1] gemma-4-E2B-it
      Gemma 4 Edge 2B — smaller, faster
      ✓ available

  [2] gemma-4-E4B-it
      Gemma 4 Edge 4B — smarter, slower
      ✗ not downloaded

  [3] Use model from another path

Select model (1/2/3):
```

**Automatic model download:**
- If the selected model is not available, the server will ask: `Do you want to download the model now? (y/n)`
- Select `y` to automatically download it from Hugging Face
- Or select `n` to download it manually later

**Automatic port handling:**
- If port 8000 is already in use, the server will ask you to choose another port
- Or press Enter to automatically find an available port (8001-8999)

### Method 2: Run with Command Line Arguments

```bash
# Specify port
python server.py --port 8080

# Specify model path
python server.py --model /path/to/model.litertlm

# Combine both
python server.py --port 8080 --model ~/models/gemma-4-E2B-it.litertlm

# Show full help
python server.py --help
```

### Open the Web UI

```text
http://<ip-address>:<port>
```

The model name and port will be displayed when the server starts:

```text
====================================================
  🚀 Server is starting...
  📍 URL: http://localhost:8000
  📦 Model: gemma-4-E2B-it.litertlm
====================================================
```

---

## 📄 `app.py` — Simple REST API

A basic single-turn API without a model selection menu. Suitable for quick integrations or testing.

### Run

```bash
python app.py
```

### Endpoint

#### `POST /generate`

Send a prompt and receive a response. Each request is independent, with **no memory** between calls.

```bash
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Who are you?"}'
```

**Response:**

```json
{
  "response": "I am Gemma 4, a Large Language Model...",
  "tokens": 42,
  "elapsed_s": 5.31,
  "tokens_per_sec": 7.91
}
```

---

## 🖥️ `server.py` — Full REST API + Web UI

The full version includes model selection at startup, multi-turn conversations, session management, and a browser-based chat interface.

---

### 🌐 Web UI

Open your browser and visit `http://<ip-address>:8000`

Features:
- **Model selection at startup** via CLI menu — model name displayed directly in the header
- User-friendly chat interface with Vietnamese language support
- Automatically creates a session when opening the page
- Remembers conversation context within the same session
- **New** button to start a new conversation
- **Clear** button to delete history and create a new session
- `Enter` to send, `Shift + Enter` for a new line
- **Markdown rendering**: responses are displayed with proper formatting (headings, lists, code blocks, tables, bold/italic, etc.)
- **Performance metrics**: `⚡ X tok/s` badge below each response, including token count and processing time

---

### 🔌 REST API

#### `GET /info`
Returns information about the currently running model and the number of active sessions.

```bash
curl http://localhost:8000/info
```

**Response:**

```json
{
  "model": "gemma-4-E2B-it",
  "sessions": 2
}
```

---

#### `POST /generate`
Single-turn request without context memory. Useful for standalone Q&A.

```bash
curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is the capital of Vietnam?"}'
```

**Response:**

```json
{
  "response": "The capital of Vietnam is Hanoi.",
  "tokens": 12,
  "elapsed_s": 1.45,
  "tokens_per_sec": 8.27
}
```

---

#### `POST /chat/new`
Create a new session. Returns a `session_id` for subsequent requests.

```bash
curl -X POST http://localhost:8000/chat/new
```

**Response:**

```json
{
  "session_id": "a3f2c1d4-..."
}
```

---

#### `POST /chat/{session_id}`
Send a message within a session. The model **remembers the entire conversation history** for that session.

```bash
curl -X POST http://localhost:8000/chat/a3f2c1d4-... \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Tell me more about that"}'
```

**Response:**

```json
{
  "session_id": "a3f2c1d4-...",
  "response": "...",
  "tokens": 58,
  "elapsed_s": 7.12,
  "tokens_per_sec": 8.15
}
```

---

#### `DELETE /chat/{session_id}`
Delete a session and free memory.

```bash
curl -X DELETE http://localhost:8000/chat/a3f2c1d4-...
```

**Response:**

```json
{
  "status": "cleared",
  "session_id": "a3f2c1d4-..."
}
```

---

#### `GET /chat/sessions/list`
List all active sessions.

```bash
curl http://localhost:8000/chat/sessions/list
```

**Response:**

```json
{
  "sessions": ["a3f2c1d4-...", "b7e9f2a1-..."],
  "count": 2
}
```

---

## 💡 Example: Multi-turn Conversation via curl

```bash
# 1. Create a session
SESSION=$(curl -s -X POST http://localhost:8000/chat/new | python3 -c "import sys,json; print(json.load(sys.stdin)['session_id'])")

# 2. Send the first message
curl -s -X POST http://localhost:8000/chat/$SESSION \
  -H "Content-Type: application/json" \
  -d '{"prompt": "My name is Nam"}' | python3 -m json.tool

# 3. The model remembers context
curl -s -X POST http://localhost:8000/chat/$SESSION \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is my name?"}' | python3 -m json.tool

# 4. Delete the session when done
curl -X DELETE http://localhost:8000/chat/$SESSION
```

---

## ⚙️ Configuration

### Command Line Arguments

| Argument | Description | Default |
|-----------|-------------|----------|
| `--port`, `-p` | Server port | `8000` |
| `--model`, `-m` | Full path to the `.litertlm` model file | None (select from menu) |
| `--help`, `-h` | Show help | - |

### Configuration in Code

Parameters configured near the top of `server.py`:

| Variable | Description | Default |
|-----------|-------------|----------|
| `MODELS_DIR` | Directory containing models | `./models` |
| `AVAILABLE_MODELS` | List of models + Hugging Face repos | see file |
| `backend` | Inference backend | `litert_lm.Backend.CPU` |
| `host` | Listening address | `0.0.0.0` |

To add a new model to the menu, append it to the `AVAILABLE_MODELS` dictionary in `server.py`:

```python
AVAILABLE_MODELS = {
    "gemma-4-E2B-it": {
        "file": "gemma-4-E2B-it.litertlm",
        "repo": "litert-community/gemma-4-E2B-it-litert-lm",
        "desc": "Gemma 4 Edge 2B — smaller, faster",
    },
    "new-model-name": {
        "file": "new-model-name.litertlm",
        "repo": "org/repo-name",
        "desc": "Model description",
    },
}
```

To switch the backend to GPU (if supported by the device):

```python
engine = litert_lm.Engine(str(MODEL_PATH), backend=litert_lm.Backend.GPU)
```

### Run as a systemd Service (Linux)

See detailed instructions in `SERVICE_README.md`

```bash
# Install the service
sudo bash install_service.sh

# Manage the service
sudo systemctl status litert-lm
sudo systemctl restart litert-lm
sudo journalctl -u litert-lm -f
```

---

## 📝 Notes

- Each session stores the entire conversation history in RAM. It is recommended to delete sessions when no longer needed.
- The `mel_filterbank` warning during startup is normal — it is related to the Gemma 4 multimodal audio encoder and does not affect text generation.
- Generation speed depends on the hardware. On an Orange Pi 5 using CPU, expect around 5–15 tokens/second.
- Token/s uses `engine.tokenize()` if available, otherwise falls back to an estimate of `len(text) // 4`.
- Markdown is rendered using https://marked.js.org/ directly in the browser, not on the server.
- Only use `-it` (instruction-tuned) models for chat — base models are not suitable for conversations.

---

## 📜 License

Copyright (c) 2026

[Tran Thanh Tan / TTAI Solutions Software]

All rights reserved.

No part of this software or its source code may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the copyright holder.