# 🤖 LiteRT-LM Web Server Run **Gemma 4** models on embedded devices (Orange Pi 5, Raspberry Pi, etc.) using [LiteRT-LM](https://github.com/google-ai-edge/litert-lm) with a REST API and Web UI. --- ## 📋 Requirements - Python 3.10+ - [LiteRT-LM](https://github.com/google-ai-edge/litert-lm) installed and working - Python libraries: ```bash pip install -r requirements.txt ``` `requirements.txt`: ```txt fastapi uvicorn pydantic huggingface_hub ``` --- ## 📁 Project Structure ```text . ├── app.py # Simple REST API, single-turn ├── server.py # Full REST API + Web UI, multi-turn sessions ├── templates/ │ └── index.html # Web UI interface (separated from server.py) ├── models/ # Directory containing .litertlm model files │ ├── gemma-4-E2B-it.litertlm │ └── gemma-4-E4B-it.litertlm ├── requirements.txt └── README.md ``` --- ## 🤖 Supported Models | Model | Hugging Face Repo | Description | |-------|-------------------|-------------| | `gemma-4-E2B-it` | https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm | Edge 2B — faster and lighter | | `gemma-4-E4B-it` | https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm | Edge 4B — smarter and heavier | > **Note:** It is recommended to use the `-it` (instruction-tuned) versions for chat/Q&A. Versions without `-it` are base models that only predict the next token and are not suitable for conversations. ### Download Models ```bash # Gemma 4 E2B (smaller, ~faster) hf download litert-community/gemma-4-E2B-it-litert-lm \ --include '*.litertlm' \ --local-dir models/ # Gemma 4 E4B (larger, ~smarter) hf download litert-community/gemma-4-E4B-it-litert-lm \ --include '*.litertlm' \ --local-dir models/ ``` > **Or** let the server automatically download the model when you select one that is not available locally. --- ## 🚀 Usage Guide ### Method 1: Run with Default Options ```bash python server.py ``` The server will display a **model selection menu** before starting: ```text ==================================================== LiteRT-LM Server — Select Model ==================================================== [1] gemma-4-E2B-it Gemma 4 Edge 2B — smaller, faster ✓ available [2] gemma-4-E4B-it Gemma 4 Edge 4B — smarter, slower ✗ not downloaded [3] Use model from another path Select model (1/2/3): ``` **Automatic model download:** - If the selected model is not available, the server will ask: `Do you want to download the model now? (y/n)` - Select `y` to automatically download it from Hugging Face - Or select `n` to download it manually later **Automatic port handling:** - If port 8000 is already in use, the server will ask you to choose another port - Or press Enter to automatically find an available port (8001-8999) ### Method 2: Run with Command Line Arguments ```bash # Specify port python server.py --port 8080 # Specify model path python server.py --model /path/to/model.litertlm # Combine both python server.py --port 8080 --model ~/models/gemma-4-E2B-it.litertlm # Show full help python server.py --help ``` ### Open the Web UI ```text http://: ``` The model name and port will be displayed when the server starts: ```text ==================================================== 🚀 Server is starting... 📍 URL: http://localhost:8000 📦 Model: gemma-4-E2B-it.litertlm ==================================================== ``` --- ## 📄 `app.py` — Simple REST API A basic single-turn API without a model selection menu. Suitable for quick integrations or testing. ### Run ```bash python app.py ``` ### Endpoint #### `POST /generate` Send a prompt and receive a response. Each request is independent, with **no memory** between calls. ```bash curl -X POST http://localhost:8000/generate \ -H "Content-Type: application/json" \ -d '{"prompt": "Who are you?"}' ``` **Response:** ```json { "response": "I am Gemma 4, a Large Language Model...", "tokens": 42, "elapsed_s": 5.31, "tokens_per_sec": 7.91 } ``` --- ## 🖥️ `server.py` — Full REST API + Web UI The full version includes model selection at startup, multi-turn conversations, session management, and a browser-based chat interface. --- ### 🌐 Web UI Open your browser and visit `http://:8000` Features: - **Model selection at startup** via CLI menu — model name displayed directly in the header - User-friendly chat interface with Vietnamese language support - Automatically creates a session when opening the page - Remembers conversation context within the same session - **New** button to start a new conversation - **Clear** button to delete history and create a new session - `Enter` to send, `Shift + Enter` for a new line - **Markdown rendering**: responses are displayed with proper formatting (headings, lists, code blocks, tables, bold/italic, etc.) - **Performance metrics**: `⚡ X tok/s` badge below each response, including token count and processing time --- ### 🔌 REST API #### `GET /info` Returns information about the currently running model and the number of active sessions. ```bash curl http://localhost:8000/info ``` **Response:** ```json { "model": "gemma-4-E2B-it", "sessions": 2 } ``` --- #### `POST /generate` Single-turn request without context memory. Useful for standalone Q&A. ```bash curl -X POST http://localhost:8000/generate \ -H "Content-Type: application/json" \ -d '{"prompt": "What is the capital of Vietnam?"}' ``` **Response:** ```json { "response": "The capital of Vietnam is Hanoi.", "tokens": 12, "elapsed_s": 1.45, "tokens_per_sec": 8.27 } ``` --- #### `POST /chat/new` Create a new session. Returns a `session_id` for subsequent requests. ```bash curl -X POST http://localhost:8000/chat/new ``` **Response:** ```json { "session_id": "a3f2c1d4-..." } ``` --- #### `POST /chat/{session_id}` Send a message within a session. The model **remembers the entire conversation history** for that session. ```bash curl -X POST http://localhost:8000/chat/a3f2c1d4-... \ -H "Content-Type: application/json" \ -d '{"prompt": "Tell me more about that"}' ``` **Response:** ```json { "session_id": "a3f2c1d4-...", "response": "...", "tokens": 58, "elapsed_s": 7.12, "tokens_per_sec": 8.15 } ``` --- #### `DELETE /chat/{session_id}` Delete a session and free memory. ```bash curl -X DELETE http://localhost:8000/chat/a3f2c1d4-... ``` **Response:** ```json { "status": "cleared", "session_id": "a3f2c1d4-..." } ``` --- #### `GET /chat/sessions/list` List all active sessions. ```bash curl http://localhost:8000/chat/sessions/list ``` **Response:** ```json { "sessions": ["a3f2c1d4-...", "b7e9f2a1-..."], "count": 2 } ``` --- ## 💡 Example: Multi-turn Conversation via curl ```bash # 1. Create a session SESSION=$(curl -s -X POST http://localhost:8000/chat/new | python3 -c "import sys,json; print(json.load(sys.stdin)['session_id'])") # 2. Send the first message curl -s -X POST http://localhost:8000/chat/$SESSION \ -H "Content-Type: application/json" \ -d '{"prompt": "My name is Nam"}' | python3 -m json.tool # 3. The model remembers context curl -s -X POST http://localhost:8000/chat/$SESSION \ -H "Content-Type: application/json" \ -d '{"prompt": "What is my name?"}' | python3 -m json.tool # 4. Delete the session when done curl -X DELETE http://localhost:8000/chat/$SESSION ``` --- ## ⚙️ Configuration ### Command Line Arguments | Argument | Description | Default | |-----------|-------------|----------| | `--port`, `-p` | Server port | `8000` | | `--model`, `-m` | Full path to the `.litertlm` model file | None (select from menu) | | `--help`, `-h` | Show help | - | ### Configuration in Code Parameters configured near the top of `server.py`: | Variable | Description | Default | |-----------|-------------|----------| | `MODELS_DIR` | Directory containing models | `./models` | | `AVAILABLE_MODELS` | List of models + Hugging Face repos | see file | | `backend` | Inference backend | `litert_lm.Backend.CPU` | | `host` | Listening address | `0.0.0.0` | To add a new model to the menu, append it to the `AVAILABLE_MODELS` dictionary in `server.py`: ```python AVAILABLE_MODELS = { "gemma-4-E2B-it": { "file": "gemma-4-E2B-it.litertlm", "repo": "litert-community/gemma-4-E2B-it-litert-lm", "desc": "Gemma 4 Edge 2B — smaller, faster", }, "new-model-name": { "file": "new-model-name.litertlm", "repo": "org/repo-name", "desc": "Model description", }, } ``` To switch the backend to GPU (if supported by the device): ```python engine = litert_lm.Engine(str(MODEL_PATH), backend=litert_lm.Backend.GPU) ``` ### Run as a systemd Service (Linux) See detailed instructions in `SERVICE_README.md` ```bash # Install the service sudo bash install_service.sh # Manage the service sudo systemctl status litert-lm sudo systemctl restart litert-lm sudo journalctl -u litert-lm -f ``` --- ## 📝 Notes - Each session stores the entire conversation history in RAM. It is recommended to delete sessions when no longer needed. - The `mel_filterbank` warning during startup is normal — it is related to the Gemma 4 multimodal audio encoder and does not affect text generation. - Generation speed depends on the hardware. On an Orange Pi 5 using CPU, expect around 5–15 tokens/second. - Token/s uses `engine.tokenize()` if available, otherwise falls back to an estimate of `len(text) // 4`. - Markdown is rendered using https://marked.js.org/ directly in the browser, not on the server. - Only use `-it` (instruction-tuned) models for chat — base models are not suitable for conversations. --- ## 📜 License MIT