From 6d5a3bd72c7f90405d055722246c3b41bdb64292 Mon Sep 17 00:00:00 2001 From: Tony Tran Date: Sun, 10 May 2026 16:51:48 +0700 Subject: [PATCH] Upload files to "/" --- README-en.md | 412 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 412 insertions(+) create mode 100644 README-en.md diff --git a/README-en.md b/README-en.md new file mode 100644 index 0000000..aa1f32c --- /dev/null +++ b/README-en.md @@ -0,0 +1,412 @@ +# πŸ€– LiteRT-LM Web Server + +Run **Gemma 4** models on embedded devices (Orange Pi 5, Raspberry Pi, etc.) using ξˆ€urlξˆ‚LiteRT-LMξˆ‚https://github.com/google-ai-edge/litert-lm with a REST API and Web UI. + +--- + +## πŸ“‹ Requirements + +- Python 3.10+ +- ξˆ€urlξˆ‚litert-lmξˆ‚https://github.com/google-ai-edge/litert-lm installed and working +- Python libraries: + +```bash +pip install -r requirements.txt +``` + +`requirements.txt`: +```txt +fastapi +uvicorn +pydantic +huggingface_hub +``` + +--- + +## πŸ“ Project Structure + +```text +. +β”œβ”€β”€ app.py # Simple REST API, single-turn +β”œβ”€β”€ server.py # Full REST API + Web UI, multi-turn sessions +β”œβ”€β”€ templates/ +β”‚ └── index.html # Web UI interface (separated from server.py) +β”œβ”€β”€ models/ # Directory containing .litertlm model files +β”‚ β”œβ”€β”€ gemma-4-E2B-it.litertlm +β”‚ └── gemma-4-E4B-it.litertlm +β”œβ”€β”€ requirements.txt +└── README.md +``` + +--- + +## πŸ€– Supported Models + +| Model | Hugging Face Repo | Description | +|-------|-------------------|-------------| +| `gemma-4-E2B-it` | https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm | Edge 2B β€” faster and lighter | +| `gemma-4-E4B-it` | https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm | Edge 4B β€” smarter and heavier | + +> **Note:** It is recommended to use the `-it` (instruction-tuned) versions for chat/Q&A. Versions without `-it` are base models that only predict the next token and are not suitable for conversations. + +### Download Models + +```bash +# Gemma 4 E2B (smaller, ~faster) +hf download litert-community/gemma-4-E2B-it-litert-lm \ + --include '*.litertlm' \ + --local-dir models/ + +# Gemma 4 E4B (larger, ~smarter) +hf download litert-community/gemma-4-E4B-it-litert-lm \ + --include '*.litertlm' \ + --local-dir models/ +``` + +> **Or** let the server automatically download the model when you select one that is not available locally. + +--- + +## πŸš€ Usage Guide + +### Method 1: Run with Default Options + +```bash +python server.py +``` + +The server will display a **model selection menu** before starting: + +```text +==================================================== + LiteRT-LM Server β€” Select Model +==================================================== + [1] gemma-4-E2B-it + Gemma 4 Edge 2B β€” smaller, faster + βœ“ available + + [2] gemma-4-E4B-it + Gemma 4 Edge 4B β€” smarter, slower + βœ— not downloaded + + [3] Use model from another path + +Select model (1/2/3): +``` + +**Automatic model download:** +- If the selected model is not available, the server will ask: `Do you want to download the model now? (y/n)` +- Select `y` to automatically download it from Hugging Face +- Or select `n` to download it manually later + +**Automatic port handling:** +- If port 8000 is already in use, the server will ask you to choose another port +- Or press Enter to automatically find an available port (8001-8999) + +### Method 2: Run with Command Line Arguments + +```bash +# Specify port +python server.py --port 8080 + +# Specify model path +python server.py --model /path/to/model.litertlm + +# Combine both +python server.py --port 8080 --model ~/models/gemma-4-E2B-it.litertlm + +# Show full help +python server.py --help +``` + +### Open the Web UI + +```text +http://: +``` + +The model name and port will be displayed when the server starts: + +```text +==================================================== + πŸš€ Server is starting... + πŸ“ URL: http://localhost:8000 + πŸ“¦ Model: gemma-4-E2B-it.litertlm +==================================================== +``` + +--- + +## πŸ“„ `app.py` β€” Simple REST API + +A basic single-turn API without a model selection menu. Suitable for quick integrations or testing. + +### Run + +```bash +python app.py +``` + +### Endpoint + +#### `POST /generate` + +Send a prompt and receive a response. Each request is independent, with **no memory** between calls. + +```bash +curl -X POST http://localhost:8000/generate \ + -H "Content-Type: application/json" \ + -d '{"prompt": "Who are you?"}' +``` + +**Response:** + +```json +{ + "response": "I am Gemma 4, a Large Language Model...", + "tokens": 42, + "elapsed_s": 5.31, + "tokens_per_sec": 7.91 +} +``` + +--- + +## πŸ–₯️ `server.py` β€” Full REST API + Web UI + +The full version includes model selection at startup, multi-turn conversations, session management, and a browser-based chat interface. + +--- + +### 🌐 Web UI + +Open your browser and visit `http://:8000` + +Features: +- **Model selection at startup** via CLI menu β€” model name displayed directly in the header +- User-friendly chat interface with Vietnamese language support +- Automatically creates a session when opening the page +- Remembers conversation context within the same session +- **New** button to start a new conversation +- **Clear** button to delete history and create a new session +- `Enter` to send, `Shift + Enter` for a new line +- **Markdown rendering**: responses are displayed with proper formatting (headings, lists, code blocks, tables, bold/italic, etc.) +- **Performance metrics**: `⚑ X tok/s` badge below each response, including token count and processing time + +--- + +### πŸ”Œ REST API + +#### `GET /info` +Returns information about the currently running model and the number of active sessions. + +```bash +curl http://localhost:8000/info +``` + +**Response:** + +```json +{ + "model": "gemma-4-E2B-it", + "sessions": 2 +} +``` + +--- + +#### `POST /generate` +Single-turn request without context memory. Useful for standalone Q&A. + +```bash +curl -X POST http://localhost:8000/generate \ + -H "Content-Type: application/json" \ + -d '{"prompt": "What is the capital of Vietnam?"}' +``` + +**Response:** + +```json +{ + "response": "The capital of Vietnam is Hanoi.", + "tokens": 12, + "elapsed_s": 1.45, + "tokens_per_sec": 8.27 +} +``` + +--- + +#### `POST /chat/new` +Create a new session. Returns a `session_id` for subsequent requests. + +```bash +curl -X POST http://localhost:8000/chat/new +``` + +**Response:** + +```json +{ + "session_id": "a3f2c1d4-..." +} +``` + +--- + +#### `POST /chat/{session_id}` +Send a message within a session. The model **remembers the entire conversation history** for that session. + +```bash +curl -X POST http://localhost:8000/chat/a3f2c1d4-... \ + -H "Content-Type: application/json" \ + -d '{"prompt": "Tell me more about that"}' +``` + +**Response:** + +```json +{ + "session_id": "a3f2c1d4-...", + "response": "...", + "tokens": 58, + "elapsed_s": 7.12, + "tokens_per_sec": 8.15 +} +``` + +--- + +#### `DELETE /chat/{session_id}` +Delete a session and free memory. + +```bash +curl -X DELETE http://localhost:8000/chat/a3f2c1d4-... +``` + +**Response:** + +```json +{ + "status": "cleared", + "session_id": "a3f2c1d4-..." +} +``` + +--- + +#### `GET /chat/sessions/list` +List all active sessions. + +```bash +curl http://localhost:8000/chat/sessions/list +``` + +**Response:** + +```json +{ + "sessions": ["a3f2c1d4-...", "b7e9f2a1-..."], + "count": 2 +} +``` + +--- + +## πŸ’‘ Example: Multi-turn Conversation via curl + +```bash +# 1. Create a session +SESSION=$(curl -s -X POST http://localhost:8000/chat/new | python3 -c "import sys,json; print(json.load(sys.stdin)['session_id'])") + +# 2. Send the first message +curl -s -X POST http://localhost:8000/chat/$SESSION \ + -H "Content-Type: application/json" \ + -d '{"prompt": "My name is Nam"}' | python3 -m json.tool + +# 3. The model remembers context +curl -s -X POST http://localhost:8000/chat/$SESSION \ + -H "Content-Type: application/json" \ + -d '{"prompt": "What is my name?"}' | python3 -m json.tool + +# 4. Delete the session when done +curl -X DELETE http://localhost:8000/chat/$SESSION +``` + +--- + +## βš™οΈ Configuration + +### Command Line Arguments + +| Argument | Description | Default | +|-----------|-------------|----------| +| `--port`, `-p` | Server port | `8000` | +| `--model`, `-m` | Full path to the `.litertlm` model file | None (select from menu) | +| `--help`, `-h` | Show help | - | + +### Configuration in Code + +Parameters configured near the top of `server.py`: + +| Variable | Description | Default | +|-----------|-------------|----------| +| `MODELS_DIR` | Directory containing models | `./models` | +| `AVAILABLE_MODELS` | List of models + Hugging Face repos | see file | +| `backend` | Inference backend | `litert_lm.Backend.CPU` | +| `host` | Listening address | `0.0.0.0` | + +To add a new model to the menu, append it to the `AVAILABLE_MODELS` dictionary in `server.py`: + +```python +AVAILABLE_MODELS = { + "gemma-4-E2B-it": { + "file": "gemma-4-E2B-it.litertlm", + "repo": "litert-community/gemma-4-E2B-it-litert-lm", + "desc": "Gemma 4 Edge 2B β€” smaller, faster", + }, + "new-model-name": { + "file": "new-model-name.litertlm", + "repo": "org/repo-name", + "desc": "Model description", + }, +} +``` + +To switch the backend to GPU (if supported by the device): + +```python +engine = litert_lm.Engine(str(MODEL_PATH), backend=litert_lm.Backend.GPU) +``` + +### Run as a systemd Service (Linux) + +See detailed instructions in `SERVICE_README.md` + +```bash +# Install the service +sudo bash install_service.sh + +# Manage the service +sudo systemctl status litert-lm +sudo systemctl restart litert-lm +sudo journalctl -u litert-lm -f +``` + +--- + +## πŸ“ Notes + +- Each session stores the entire conversation history in RAM. It is recommended to delete sessions when no longer needed. +- The `mel_filterbank` warning during startup is normal β€” it is related to the Gemma 4 multimodal audio encoder and does not affect text generation. +- Generation speed depends on the hardware. On an Orange Pi 5 using CPU, expect around 5–15 tokens/second. +- Token/s uses `engine.tokenize()` if available, otherwise falls back to an estimate of `len(text) // 4`. +- Markdown is rendered using https://marked.js.org/ directly in the browser, not on the server. +- Only use `-it` (instruction-tuned) models for chat β€” base models are not suitable for conversations. + +--- + +## πŸ“œ License + +MIT