# 🤖 LiteRT-LM Web Server Run **Gemma 4** models on embedded devices (Orange Pi 5, Raspberry Pi, etc.) using [LiteRT-LM](https://github.com/google-ai-edge/litert-lm) with a REST API and Web UI. --- ## 📋 Requirements - Python 3.10+ - [LiteRT-LM](https://github.com/google-ai-edge/litert-lm) installed and working - Python libraries: ```bash pip install -r requirements.txt ``` `requirements.txt`: ```txt fastapi uvicorn pydantic huggingface_hub ``` --- ## 📁 Project Structure ```text . ├── app.py # Simple REST API, single-turn ├── server.py # Full REST API + Web UI, multi-turn sessions ├── templates/ │ └── index.html # Web UI interface (separated from server.py) ├── models/ # Directory containing .litertlm model files │ ├── gemma-4-E2B-it.litertlm │ └── gemma-4-E4B-it.litertlm ├── requirements.txt └── README.md ``` --- ## 🤖 Supported Models | Model | Hugging Face Repo | Description | |-------|-------------------|-------------| | `gemma-4-E2B-it` | https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm | Edge 2B — faster and lighter | | `gemma-4-E4B-it` | https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm | Edge 4B — smarter and heavier | > **Note:** It is recommended to use the `-it` (instruction-tuned) versions for chat/Q&A. Versions without `-it` are base models that only predict the next token and are not suitable for conversations. ### Download Models ```bash # Gemma 4 E2B (smaller, ~faster) hf download litert-community/gemma-4-E2B-it-litert-lm \ --include '*.litertlm' \ --local-dir models/ # Gemma 4 E4B (larger, ~smarter) hf download litert-community/gemma-4-E4B-it-litert-lm \ --include '*.litertlm' \ --local-dir models/ ``` > **Or** let the server automatically download the model when you select one that is not available locally. --- ## 🚀 Usage Guide ### Method 1: Run with Default Options ```bash python server.py ``` The server will display a **model selection menu** before starting: ```text ==================================================== LiteRT-LM Server — Select Model ==================================================== [1] gemma-4-E2B-it Gemma 4 Edge 2B — smaller, faster ✓ available [2] gemma-4-E4B-it Gemma 4 Edge 4B — smarter, slower ✗ not downloaded [3] Use model from another path Select model (1/2/3): ``` **Automatic model download:** - If the selected model is not available, the server will ask: `Do you want to download the model now? (y/n)` - Select `y` to automatically download it from Hugging Face - Or select `n` to download it manually later **Automatic port handling:** - If port 8000 is already in use, the server will ask you to choose another port - Or press Enter to automatically find an available port (8001-8999) ### Method 2: Run with Command Line Arguments ```bash # Specify port python server.py --port 8080 # Specify model path python server.py --model /path/to/model.litertlm # Combine both python server.py --port 8080 --model ~/models/gemma-4-E2B-it.litertlm # Show full help python server.py --help ``` ### Open the Web UI ```text http://: ``` The model name and port will be displayed when the server starts: ```text ==================================================== 🚀 Server is starting... 📍 URL: http://localhost:8000 📦 Model: gemma-4-E2B-it.litertlm ==================================================== ``` --- ## 📄 `app.py` — Simple REST API A basic single-turn API without a model selection menu. Suitable for quick integrations or testing. ### Run ```bash python app.py ``` ### Endpoint #### `POST /generate` Send a prompt and receive a response. Each request is independent, with **no memory** between calls. ```bash curl -X POST http://localhost:8000/generate \ -H "Content-Type: application/json" \ -d '{"prompt": "Who are you?"}' ``` **Response:** ```json { "response": "I am Gemma 4, a Large Language Model...", "tokens": 42, "elapsed_s": 5.31, "tokens_per_sec": 7.91 } ``` --- ## 🖥️ `server.py` — Full REST API + Web UI The full version includes model selection at startup, multi-turn conversations, session management, and a browser-based chat interface. --- ### 🌐 Web UI Open your browser and visit `http://:8000` Features: - **Model selection at startup** via CLI menu — model name displayed directly in the header - User-friendly chat interface with Vietnamese language support - Automatically creates a session when opening the page - Remembers conversation context within the same session - **New** button to start a new conversation - **Clear** button to delete history and create a new session - `Enter` to send, `Shift + Enter` for a new line - **Markdown rendering**: responses are displayed with proper formatting (headings, lists, code blocks, tables, bold/italic, etc.) - **Performance metrics**: `⚡ X tok/s` badge below each response, including token count and processing time --- ### 🔌 REST API #### `GET /info` Returns information about the currently running model and the number of active sessions. ```bash curl http://localhost:8000/info ``` **Response:** ```json { "model": "gemma-4-E2B-it", "sessions": 2 } ``` --- #### `POST /generate` Single-turn request without context memory. Useful for standalone Q&A. ```bash curl -X POST http://localhost:8000/generate \ -H "Content-Type: application/json" \ -d '{"prompt": "What is the capital of Vietnam?"}' ``` **Response:** ```json { "response": "The capital of Vietnam is Hanoi.", "tokens": 12, "elapsed_s": 1.45, "tokens_per_sec": 8.27 } ``` --- #### `POST /chat/new` Create a new session. Returns a `session_id` for subsequent requests. ```bash curl -X POST http://localhost:8000/chat/new ``` **Response:** ```json { "session_id": "a3f2c1d4-..." } ``` --- #### `POST /chat/{session_id}` Send a message within a session. The model **remembers the entire conversation history** for that session. ```bash curl -X POST http://localhost:8000/chat/a3f2c1d4-... \ -H "Content-Type: application/json" \ -d '{"prompt": "Tell me more about that"}' ``` **Response:** ```json { "session_id": "a3f2c1d4-...", "response": "...", "tokens": 58, "elapsed_s": 7.12, "tokens_per_sec": 8.15 } ``` --- #### `DELETE /chat/{session_id}` Delete a session and free memory. ```bash curl -X DELETE http://localhost:8000/chat/a3f2c1d4-... ``` **Response:** ```json { "status": "cleared", "session_id": "a3f2c1d4-..." } ``` --- #### `GET /chat/sessions/list` List all active sessions. ```bash curl http://localhost:8000/chat/sessions/list ``` **Response:** ```json { "sessions": ["a3f2c1d4-...", "b7e9f2a1-..."], "count": 2 } ``` --- ## 💡 Example: Multi-turn Conversation via curl ```bash # 1. Create a session SESSION=$(curl -s -X POST http://localhost:8000/chat/new | python3 -c "import sys,json; print(json.load(sys.stdin)['session_id'])") # 2. Send the first message curl -s -X POST http://localhost:8000/chat/$SESSION \ -H "Content-Type: application/json" \ -d '{"prompt": "My name is Nam"}' | python3 -m json.tool # 3. The model remembers context curl -s -X POST http://localhost:8000/chat/$SESSION \ -H "Content-Type: application/json" \ -d '{"prompt": "What is my name?"}' | python3 -m json.tool # 4. Delete the session when done curl -X DELETE http://localhost:8000/chat/$SESSION ``` --- ## ⚙️ Configuration ### Command Line Arguments | Argument | Description | Default | |-----------|-------------|----------| | `--port`, `-p` | Server port | `8000` | | `--model`, `-m` | Full path to the `.litertlm` model file | None (select from menu) | | `--help`, `-h` | Show help | - | ### Configuration in Code Parameters configured near the top of `server.py`: | Variable | Description | Default | |-----------|-------------|----------| | `MODELS_DIR` | Directory containing models | `./models` | | `AVAILABLE_MODELS` | List of models + Hugging Face repos | see file | | `backend` | Inference backend | `litert_lm.Backend.CPU` | | `host` | Listening address | `0.0.0.0` | To add a new model to the menu, append it to the `AVAILABLE_MODELS` dictionary in `server.py`: ```python AVAILABLE_MODELS = { "gemma-4-E2B-it": { "file": "gemma-4-E2B-it.litertlm", "repo": "litert-community/gemma-4-E2B-it-litert-lm", "desc": "Gemma 4 Edge 2B — smaller, faster", }, "new-model-name": { "file": "new-model-name.litertlm", "repo": "org/repo-name", "desc": "Model description", }, } ``` To switch the backend to GPU (if supported by the device): ```python engine = litert_lm.Engine(str(MODEL_PATH), backend=litert_lm.Backend.GPU) ``` ### Run as a systemd Service (Linux) See detailed instructions in `SERVICE_README.md` ```bash # Install the service sudo bash install_service.sh # Manage the service sudo systemctl status litert-lm sudo systemctl restart litert-lm sudo journalctl -u litert-lm -f ``` --- ## 📝 Notes - Each session stores the entire conversation history in RAM. It is recommended to delete sessions when no longer needed. - The `mel_filterbank` warning during startup is normal — it is related to the Gemma 4 multimodal audio encoder and does not affect text generation. - Generation speed depends on the hardware. On an Orange Pi 5 using CPU, expect around 5–15 tokens/second. - Token/s uses `engine.tokenize()` if available, otherwise falls back to an estimate of `len(text) // 4`. - Markdown is rendered using https://marked.js.org/ directly in the browser, not on the server. - Only use `-it` (instruction-tuned) models for chat — base models are not suitable for conversations. --- ## 📜 License Copyright (c) 2026 [Tran Thanh Tan / TTAI Solutions Software] All rights reserved. No part of this software or its source code may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the copyright holder.