From 6d5a3bd72c7f90405d055722246c3b41bdb64292 Mon Sep 17 00:00:00 2001
From: Tony Tran <thanhtan.tran@gmail.com>
Date: Sun, 10 May 2026 16:51:48 +0700
Subject: [PATCH] Upload files to "/"

---
 README-en.md | 412 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 412 insertions(+)
 create mode 100644 README-en.md
diff --git a/README-en.md b/README-en.md
new file mode 100644
index 0000000..aa1f32c
--- /dev/null
+++ b/README-en.md
@@ -0,0 +1,412 @@
+# 🤖 LiteRT-LM Web Server
+
+Run **Gemma 4** models on embedded devices (Orange Pi 5, Raspberry Pi, etc.) using urlLiteRT-LMhttps://github.com/google-ai-edge/litert-lm with a REST API and Web UI.
+
+---
+
+## 📋 Requirements
+
+- Python 3.10+
+- urllitert-lmhttps://github.com/google-ai-edge/litert-lm installed and working
+- Python libraries:
+
+```bash
+pip install -r requirements.txt
+```
+
+`requirements.txt`:
+```txt
+fastapi
+uvicorn
+pydantic
+huggingface_hub
+```
+
+---
+
+## 📁 Project Structure
+
+```text
+.
+├── app.py               # Simple REST API, single-turn
+├── server.py            # Full REST API + Web UI, multi-turn sessions
+├── templates/
+│   └── index.html       # Web UI interface (separated from server.py)
+├── models/              # Directory containing .litertlm model files
+│   ├── gemma-4-E2B-it.litertlm
+│   └── gemma-4-E4B-it.litertlm
+├── requirements.txt
+└── README.md
+```
+
+---
+
+## 🤖 Supported Models
+
+| Model | Hugging Face Repo | Description |
+|-------|-------------------|-------------|
+| `gemma-4-E2B-it` | https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm | Edge 2B — faster and lighter |
+| `gemma-4-E4B-it` | https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm | Edge 4B — smarter and heavier |
+
+> **Note:** It is recommended to use the `-it` (instruction-tuned) versions for chat/Q&A. Versions without `-it` are base models that only predict the next token and are not suitable for conversations.
+
+### Download Models
+
+```bash
+# Gemma 4 E2B (smaller, ~faster)
+hf download litert-community/gemma-4-E2B-it-litert-lm \
+  --include '*.litertlm' \
+  --local-dir models/
+
+# Gemma 4 E4B (larger, ~smarter)
+hf download litert-community/gemma-4-E4B-it-litert-lm \
+  --include '*.litertlm' \
+  --local-dir models/
+```
+
+> **Or** let the server automatically download the model when you select one that is not available locally.
+
+---
+
+## 🚀 Usage Guide
+
+### Method 1: Run with Default Options
+
+```bash
+python server.py
+```
+
+The server will display a **model selection menu** before starting:
+
+```text
+====================================================
+  LiteRT-LM Server — Select Model
+====================================================
+  [1] gemma-4-E2B-it
+      Gemma 4 Edge 2B — smaller, faster
+      ✓ available
+
+  [2] gemma-4-E4B-it
+      Gemma 4 Edge 4B — smarter, slower
+      ✗ not downloaded
+
+  [3] Use model from another path
+
+Select model (1/2/3):
+```
+
+**Automatic model download:**
+- If the selected model is not available, the server will ask: `Do you want to download the model now? (y/n)`
+- Select `y` to automatically download it from Hugging Face
+- Or select `n` to download it manually later
+
+**Automatic port handling:**
+- If port 8000 is already in use, the server will ask you to choose another port
+- Or press Enter to automatically find an available port (8001-8999)
+
+### Method 2: Run with Command Line Arguments
+
+```bash
+# Specify port
+python server.py --port 8080
+
+# Specify model path
+python server.py --model /path/to/model.litertlm
+
+# Combine both
+python server.py --port 8080 --model ~/models/gemma-4-E2B-it.litertlm
+
+# Show full help
+python server.py --help
+```
+
+### Open the Web UI
+
+```text
+http://<ip-address>:<port>
+```
+
+The model name and port will be displayed when the server starts:
+
+```text
+====================================================
+  🚀 Server is starting...
+  📍 URL: http://localhost:8000
+  📦 Model: gemma-4-E2B-it.litertlm
+====================================================
+```
+
+---
+
+## 📄 `app.py` — Simple REST API
+
+A basic single-turn API without a model selection menu. Suitable for quick integrations or testing.
+
+### Run
+
+```bash
+python app.py
+```
+
+### Endpoint
+
+#### `POST /generate`
+
+Send a prompt and receive a response. Each request is independent, with **no memory** between calls.
+
+```bash
+curl -X POST http://localhost:8000/generate \
+  -H "Content-Type: application/json" \
+  -d '{"prompt": "Who are you?"}'
+```
+
+**Response:**
+
+```json
+{
+  "response": "I am Gemma 4, a Large Language Model...",
+  "tokens": 42,
+  "elapsed_s": 5.31,
+  "tokens_per_sec": 7.91
+}
+```
+
+---
+
+## 🖥️ `server.py` — Full REST API + Web UI
+
+The full version includes model selection at startup, multi-turn conversations, session management, and a browser-based chat interface.
+
+---
+
+### 🌐 Web UI
+
+Open your browser and visit `http://<ip-address>:8000`
+
+Features:
+- **Model selection at startup** via CLI menu — model name displayed directly in the header
+- User-friendly chat interface with Vietnamese language support
+- Automatically creates a session when opening the page
+- Remembers conversation context within the same session
+- **New** button to start a new conversation
+- **Clear** button to delete history and create a new session
+- `Enter` to send, `Shift + Enter` for a new line
+- **Markdown rendering**: responses are displayed with proper formatting (headings, lists, code blocks, tables, bold/italic, etc.)
+- **Performance metrics**: `⚡ X tok/s` badge below each response, including token count and processing time
+
+---
+
+### 🔌 REST API
+
+#### `GET /info`
+Returns information about the currently running model and the number of active sessions.
+
+```bash
+curl http://localhost:8000/info
+```
+
+**Response:**
+
+```json
+{
+  "model": "gemma-4-E2B-it",
+  "sessions": 2
+}
+```
+
+---
+
+#### `POST /generate`
+Single-turn request without context memory. Useful for standalone Q&A.
+
+```bash
+curl -X POST http://localhost:8000/generate \
+  -H "Content-Type: application/json" \
+  -d '{"prompt": "What is the capital of Vietnam?"}'
+```
+
+**Response:**
+
+```json
+{
+  "response": "The capital of Vietnam is Hanoi.",
+  "tokens": 12,
+  "elapsed_s": 1.45,
+  "tokens_per_sec": 8.27
+}
+```
+
+---
+
+#### `POST /chat/new`
+Create a new session. Returns a `session_id` for subsequent requests.
+
+```bash
+curl -X POST http://localhost:8000/chat/new
+```
+
+**Response:**
+
+```json
+{
+  "session_id": "a3f2c1d4-..."
+}
+```
+
+---
+
+#### `POST /chat/{session_id}`
+Send a message within a session. The model **remembers the entire conversation history** for that session.
+
+```bash
+curl -X POST http://localhost:8000/chat/a3f2c1d4-... \
+  -H "Content-Type: application/json" \
+  -d '{"prompt": "Tell me more about that"}'
+```
+
+**Response:**
+
+```json
+{
+  "session_id": "a3f2c1d4-...",
+  "response": "...",
+  "tokens": 58,
+  "elapsed_s": 7.12,
+  "tokens_per_sec": 8.15
+}
+```
+
+---
+
+#### `DELETE /chat/{session_id}`
+Delete a session and free memory.
+
+```bash
+curl -X DELETE http://localhost:8000/chat/a3f2c1d4-...
+```
+
+**Response:**
+
+```json
+{
+  "status": "cleared",
+  "session_id": "a3f2c1d4-..."
+}
+```
+
+---
+
+#### `GET /chat/sessions/list`
+List all active sessions.
+
+```bash
+curl http://localhost:8000/chat/sessions/list
+```
+
+**Response:**
+
+```json
+{
+  "sessions": ["a3f2c1d4-...", "b7e9f2a1-..."],
+  "count": 2
+}
+```
+
+---
+
+## 💡 Example: Multi-turn Conversation via curl
+
+```bash
+# 1. Create a session
+SESSION=$(curl -s -X POST http://localhost:8000/chat/new | python3 -c "import sys,json; print(json.load(sys.stdin)['session_id'])")
+
+# 2. Send the first message
+curl -s -X POST http://localhost:8000/chat/$SESSION \
+  -H "Content-Type: application/json" \
+  -d '{"prompt": "My name is Nam"}' | python3 -m json.tool
+
+# 3. The model remembers context
+curl -s -X POST http://localhost:8000/chat/$SESSION \
+  -H "Content-Type: application/json" \
+  -d '{"prompt": "What is my name?"}' | python3 -m json.tool
+
+# 4. Delete the session when done
+curl -X DELETE http://localhost:8000/chat/$SESSION
+```
+
+---
+
+## ⚙️ Configuration
+
+### Command Line Arguments
+
+| Argument | Description | Default |
+|-----------|-------------|----------|
+| `--port`, `-p` | Server port | `8000` |
+| `--model`, `-m` | Full path to the `.litertlm` model file | None (select from menu) |
+| `--help`, `-h` | Show help | - |
+
+### Configuration in Code
+
+Parameters configured near the top of `server.py`:
+
+| Variable | Description | Default |
+|-----------|-------------|----------|
+| `MODELS_DIR` | Directory containing models | `./models` |
+| `AVAILABLE_MODELS` | List of models + Hugging Face repos | see file |
+| `backend` | Inference backend | `litert_lm.Backend.CPU` |
+| `host` | Listening address | `0.0.0.0` |
+
+To add a new model to the menu, append it to the `AVAILABLE_MODELS` dictionary in `server.py`:
+
+```python
+AVAILABLE_MODELS = {
+    "gemma-4-E2B-it": {
+        "file": "gemma-4-E2B-it.litertlm",
+        "repo": "litert-community/gemma-4-E2B-it-litert-lm",
+        "desc": "Gemma 4 Edge 2B — smaller, faster",
+    },
+    "new-model-name": {
+        "file": "new-model-name.litertlm",
+        "repo": "org/repo-name",
+        "desc": "Model description",
+    },
+}
+```
+
+To switch the backend to GPU (if supported by the device):
+
+```python
+engine = litert_lm.Engine(str(MODEL_PATH), backend=litert_lm.Backend.GPU)
+```
+
+### Run as a systemd Service (Linux)
+
+See detailed instructions in `SERVICE_README.md`
+
+```bash
+# Install the service
+sudo bash install_service.sh
+
+# Manage the service
+sudo systemctl status litert-lm
+sudo systemctl restart litert-lm
+sudo journalctl -u litert-lm -f
+```
+
+---
+
+## 📝 Notes
+
+- Each session stores the entire conversation history in RAM. It is recommended to delete sessions when no longer needed.
+- The `mel_filterbank` warning during startup is normal — it is related to the Gemma 4 multimodal audio encoder and does not affect text generation.
+- Generation speed depends on the hardware. On an Orange Pi 5 using CPU, expect around 5–15 tokens/second.
+- Token/s uses `engine.tokenize()` if available, otherwise falls back to an estimate of `len(text) // 4`.
+- Markdown is rendered using https://marked.js.org/ directly in the browser, not on the server.
+- Only use `-it` (instruction-tuned) models for chat — base models are not suitable for conversations.
+
+---
+
+## 📜 License
+
+MIT