10 KiB
🤖 LiteRT-LM Web Server
Run Gemma 4 models on embedded devices (Orange Pi 5, Raspberry Pi, etc.) using LiteRT-LM with a REST API and Web UI.
📋 Requirements
- Python 3.10+
- LiteRT-LM installed and working
- Python libraries:
pip install -r requirements.txt
requirements.txt:
fastapi
uvicorn
pydantic
huggingface_hub
📁 Project Structure
.
├── app.py # Simple REST API, single-turn
├── server.py # Full REST API + Web UI, multi-turn sessions
├── templates/
│ └── index.html # Web UI interface (separated from server.py)
├── models/ # Directory containing .litertlm model files
│ ├── gemma-4-E2B-it.litertlm
│ └── gemma-4-E4B-it.litertlm
├── requirements.txt
└── README.md
🤖 Supported Models
| Model | Hugging Face Repo | Description |
|---|---|---|
gemma-4-E2B-it |
https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm | Edge 2B — faster and lighter |
gemma-4-E4B-it |
https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm | Edge 4B — smarter and heavier |
Note: It is recommended to use the
-it(instruction-tuned) versions for chat/Q&A. Versions without-itare base models that only predict the next token and are not suitable for conversations.
Download Models
# Gemma 4 E2B (smaller, ~faster)
hf download litert-community/gemma-4-E2B-it-litert-lm \
--include '*.litertlm' \
--local-dir models/
# Gemma 4 E4B (larger, ~smarter)
hf download litert-community/gemma-4-E4B-it-litert-lm \
--include '*.litertlm' \
--local-dir models/
Or let the server automatically download the model when you select one that is not available locally.
🚀 Usage Guide
Method 1: Run with Default Options
python server.py
The server will display a model selection menu before starting:
====================================================
LiteRT-LM Server — Select Model
====================================================
[1] gemma-4-E2B-it
Gemma 4 Edge 2B — smaller, faster
✓ available
[2] gemma-4-E4B-it
Gemma 4 Edge 4B — smarter, slower
✗ not downloaded
[3] Use model from another path
Select model (1/2/3):
Automatic model download:
- If the selected model is not available, the server will ask:
Do you want to download the model now? (y/n) - Select
yto automatically download it from Hugging Face - Or select
nto download it manually later
Automatic port handling:
- If port 8000 is already in use, the server will ask you to choose another port
- Or press Enter to automatically find an available port (8001-8999)
Method 2: Run with Command Line Arguments
# Specify port
python server.py --port 8080
# Specify model path
python server.py --model /path/to/model.litertlm
# Combine both
python server.py --port 8080 --model ~/models/gemma-4-E2B-it.litertlm
# Show full help
python server.py --help
Open the Web UI
http://<ip-address>:<port>
The model name and port will be displayed when the server starts:
====================================================
🚀 Server is starting...
📍 URL: http://localhost:8000
📦 Model: gemma-4-E2B-it.litertlm
====================================================
📄 app.py — Simple REST API
A basic single-turn API without a model selection menu. Suitable for quick integrations or testing.
Run
python app.py
Endpoint
POST /generate
Send a prompt and receive a response. Each request is independent, with no memory between calls.
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "Who are you?"}'
Response:
{
"response": "I am Gemma 4, a Large Language Model...",
"tokens": 42,
"elapsed_s": 5.31,
"tokens_per_sec": 7.91
}
🖥️ server.py — Full REST API + Web UI
The full version includes model selection at startup, multi-turn conversations, session management, and a browser-based chat interface.
🌐 Web UI
Open your browser and visit http://<ip-address>:8000
Features:
- Model selection at startup via CLI menu — model name displayed directly in the header
- User-friendly chat interface with Vietnamese language support
- Automatically creates a session when opening the page
- Remembers conversation context within the same session
- New button to start a new conversation
- Clear button to delete history and create a new session
Enterto send,Shift + Enterfor a new line- Markdown rendering: responses are displayed with proper formatting (headings, lists, code blocks, tables, bold/italic, etc.)
- Performance metrics:
⚡ X tok/sbadge below each response, including token count and processing time
🔌 REST API
GET /info
Returns information about the currently running model and the number of active sessions.
curl http://localhost:8000/info
Response:
{
"model": "gemma-4-E2B-it",
"sessions": 2
}
POST /generate
Single-turn request without context memory. Useful for standalone Q&A.
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "What is the capital of Vietnam?"}'
Response:
{
"response": "The capital of Vietnam is Hanoi.",
"tokens": 12,
"elapsed_s": 1.45,
"tokens_per_sec": 8.27
}
POST /chat/new
Create a new session. Returns a session_id for subsequent requests.
curl -X POST http://localhost:8000/chat/new
Response:
{
"session_id": "a3f2c1d4-..."
}
POST /chat/{session_id}
Send a message within a session. The model remembers the entire conversation history for that session.
curl -X POST http://localhost:8000/chat/a3f2c1d4-... \
-H "Content-Type: application/json" \
-d '{"prompt": "Tell me more about that"}'
Response:
{
"session_id": "a3f2c1d4-...",
"response": "...",
"tokens": 58,
"elapsed_s": 7.12,
"tokens_per_sec": 8.15
}
DELETE /chat/{session_id}
Delete a session and free memory.
curl -X DELETE http://localhost:8000/chat/a3f2c1d4-...
Response:
{
"status": "cleared",
"session_id": "a3f2c1d4-..."
}
GET /chat/sessions/list
List all active sessions.
curl http://localhost:8000/chat/sessions/list
Response:
{
"sessions": ["a3f2c1d4-...", "b7e9f2a1-..."],
"count": 2
}
💡 Example: Multi-turn Conversation via curl
# 1. Create a session
SESSION=$(curl -s -X POST http://localhost:8000/chat/new | python3 -c "import sys,json; print(json.load(sys.stdin)['session_id'])")
# 2. Send the first message
curl -s -X POST http://localhost:8000/chat/$SESSION \
-H "Content-Type: application/json" \
-d '{"prompt": "My name is Nam"}' | python3 -m json.tool
# 3. The model remembers context
curl -s -X POST http://localhost:8000/chat/$SESSION \
-H "Content-Type: application/json" \
-d '{"prompt": "What is my name?"}' | python3 -m json.tool
# 4. Delete the session when done
curl -X DELETE http://localhost:8000/chat/$SESSION
⚙️ Configuration
Command Line Arguments
| Argument | Description | Default |
|---|---|---|
--port, -p |
Server port | 8000 |
--model, -m |
Full path to the .litertlm model file |
None (select from menu) |
--help, -h |
Show help | - |
Configuration in Code
Parameters configured near the top of server.py:
| Variable | Description | Default |
|---|---|---|
MODELS_DIR |
Directory containing models | ./models |
AVAILABLE_MODELS |
List of models + Hugging Face repos | see file |
backend |
Inference backend | litert_lm.Backend.CPU |
host |
Listening address | 0.0.0.0 |
To add a new model to the menu, append it to the AVAILABLE_MODELS dictionary in server.py:
AVAILABLE_MODELS = {
"gemma-4-E2B-it": {
"file": "gemma-4-E2B-it.litertlm",
"repo": "litert-community/gemma-4-E2B-it-litert-lm",
"desc": "Gemma 4 Edge 2B — smaller, faster",
},
"new-model-name": {
"file": "new-model-name.litertlm",
"repo": "org/repo-name",
"desc": "Model description",
},
}
To switch the backend to GPU (if supported by the device):
engine = litert_lm.Engine(str(MODEL_PATH), backend=litert_lm.Backend.GPU)
Run as a systemd Service (Linux)
See detailed instructions in SERVICE_README.md
# Install the service
sudo bash install_service.sh
# Manage the service
sudo systemctl status litert-lm
sudo systemctl restart litert-lm
sudo journalctl -u litert-lm -f
📝 Notes
- Each session stores the entire conversation history in RAM. It is recommended to delete sessions when no longer needed.
- The
mel_filterbankwarning during startup is normal — it is related to the Gemma 4 multimodal audio encoder and does not affect text generation. - Generation speed depends on the hardware. On an Orange Pi 5 using CPU, expect around 5–15 tokens/second.
- Token/s uses
engine.tokenize()if available, otherwise falls back to an estimate oflen(text) // 4. - Markdown is rendered using https://marked.js.org/ directly in the browser, not on the server.
- Only use
-it(instruction-tuned) models for chat — base models are not suitable for conversations.
📜 License
Copyright (c) 2026
[Tran Thanh Tan / TTAI Solutions Software]
All rights reserved.
No part of this software or its source code may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the copyright holder.