9.7 KiB
🤖 LiteRT-LM Web Server
Run Gemma 4 models on embedded devices (Orange Pi 5, Raspberry Pi, etc.) using urlLiteRT-LMhttps://github.com/google-ai-edge/litert-lm with a REST API and Web UI.
📋 Requirements
- Python 3.10+
- urllitert-lmhttps://github.com/google-ai-edge/litert-lm installed and working
- Python libraries:
pip install -r requirements.txt
requirements.txt:
fastapi
uvicorn
pydantic
huggingface_hub
📁 Project Structure
.
├── app.py # Simple REST API, single-turn
├── server.py # Full REST API + Web UI, multi-turn sessions
├── templates/
│ └── index.html # Web UI interface (separated from server.py)
├── models/ # Directory containing .litertlm model files
│ ├── gemma-4-E2B-it.litertlm
│ └── gemma-4-E4B-it.litertlm
├── requirements.txt
└── README.md
🤖 Supported Models
| Model | Hugging Face Repo | Description |
|---|---|---|
gemma-4-E2B-it |
https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm | Edge 2B — faster and lighter |
gemma-4-E4B-it |
https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm | Edge 4B — smarter and heavier |
Note: It is recommended to use the
-it(instruction-tuned) versions for chat/Q&A. Versions without-itare base models that only predict the next token and are not suitable for conversations.
Download Models
# Gemma 4 E2B (smaller, ~faster)
hf download litert-community/gemma-4-E2B-it-litert-lm \
--include '*.litertlm' \
--local-dir models/
# Gemma 4 E4B (larger, ~smarter)
hf download litert-community/gemma-4-E4B-it-litert-lm \
--include '*.litertlm' \
--local-dir models/
Or let the server automatically download the model when you select one that is not available locally.
🚀 Usage Guide
Method 1: Run with Default Options
python server.py
The server will display a model selection menu before starting:
====================================================
LiteRT-LM Server — Select Model
====================================================
[1] gemma-4-E2B-it
Gemma 4 Edge 2B — smaller, faster
✓ available
[2] gemma-4-E4B-it
Gemma 4 Edge 4B — smarter, slower
✗ not downloaded
[3] Use model from another path
Select model (1/2/3):
Automatic model download:
- If the selected model is not available, the server will ask:
Do you want to download the model now? (y/n) - Select
yto automatically download it from Hugging Face - Or select
nto download it manually later
Automatic port handling:
- If port 8000 is already in use, the server will ask you to choose another port
- Or press Enter to automatically find an available port (8001-8999)
Method 2: Run with Command Line Arguments
# Specify port
python server.py --port 8080
# Specify model path
python server.py --model /path/to/model.litertlm
# Combine both
python server.py --port 8080 --model ~/models/gemma-4-E2B-it.litertlm
# Show full help
python server.py --help
Open the Web UI
http://<ip-address>:<port>
The model name and port will be displayed when the server starts:
====================================================
🚀 Server is starting...
📍 URL: http://localhost:8000
📦 Model: gemma-4-E2B-it.litertlm
====================================================
📄 app.py — Simple REST API
A basic single-turn API without a model selection menu. Suitable for quick integrations or testing.
Run
python app.py
Endpoint
POST /generate
Send a prompt and receive a response. Each request is independent, with no memory between calls.
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "Who are you?"}'
Response:
{
"response": "I am Gemma 4, a Large Language Model...",
"tokens": 42,
"elapsed_s": 5.31,
"tokens_per_sec": 7.91
}
🖥️ server.py — Full REST API + Web UI
The full version includes model selection at startup, multi-turn conversations, session management, and a browser-based chat interface.
🌐 Web UI
Open your browser and visit http://<ip-address>:8000
Features:
- Model selection at startup via CLI menu — model name displayed directly in the header
- User-friendly chat interface with Vietnamese language support
- Automatically creates a session when opening the page
- Remembers conversation context within the same session
- New button to start a new conversation
- Clear button to delete history and create a new session
Enterto send,Shift + Enterfor a new line- Markdown rendering: responses are displayed with proper formatting (headings, lists, code blocks, tables, bold/italic, etc.)
- Performance metrics:
⚡ X tok/sbadge below each response, including token count and processing time
🔌 REST API
GET /info
Returns information about the currently running model and the number of active sessions.
curl http://localhost:8000/info
Response:
{
"model": "gemma-4-E2B-it",
"sessions": 2
}
POST /generate
Single-turn request without context memory. Useful for standalone Q&A.
curl -X POST http://localhost:8000/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "What is the capital of Vietnam?"}'
Response:
{
"response": "The capital of Vietnam is Hanoi.",
"tokens": 12,
"elapsed_s": 1.45,
"tokens_per_sec": 8.27
}
POST /chat/new
Create a new session. Returns a session_id for subsequent requests.
curl -X POST http://localhost:8000/chat/new
Response:
{
"session_id": "a3f2c1d4-..."
}
POST /chat/{session_id}
Send a message within a session. The model remembers the entire conversation history for that session.
curl -X POST http://localhost:8000/chat/a3f2c1d4-... \
-H "Content-Type: application/json" \
-d '{"prompt": "Tell me more about that"}'
Response:
{
"session_id": "a3f2c1d4-...",
"response": "...",
"tokens": 58,
"elapsed_s": 7.12,
"tokens_per_sec": 8.15
}
DELETE /chat/{session_id}
Delete a session and free memory.
curl -X DELETE http://localhost:8000/chat/a3f2c1d4-...
Response:
{
"status": "cleared",
"session_id": "a3f2c1d4-..."
}
GET /chat/sessions/list
List all active sessions.
curl http://localhost:8000/chat/sessions/list
Response:
{
"sessions": ["a3f2c1d4-...", "b7e9f2a1-..."],
"count": 2
}
💡 Example: Multi-turn Conversation via curl
# 1. Create a session
SESSION=$(curl -s -X POST http://localhost:8000/chat/new | python3 -c "import sys,json; print(json.load(sys.stdin)['session_id'])")
# 2. Send the first message
curl -s -X POST http://localhost:8000/chat/$SESSION \
-H "Content-Type: application/json" \
-d '{"prompt": "My name is Nam"}' | python3 -m json.tool
# 3. The model remembers context
curl -s -X POST http://localhost:8000/chat/$SESSION \
-H "Content-Type: application/json" \
-d '{"prompt": "What is my name?"}' | python3 -m json.tool
# 4. Delete the session when done
curl -X DELETE http://localhost:8000/chat/$SESSION
⚙️ Configuration
Command Line Arguments
| Argument | Description | Default |
|---|---|---|
--port, -p |
Server port | 8000 |
--model, -m |
Full path to the .litertlm model file |
None (select from menu) |
--help, -h |
Show help | - |
Configuration in Code
Parameters configured near the top of server.py:
| Variable | Description | Default |
|---|---|---|
MODELS_DIR |
Directory containing models | ./models |
AVAILABLE_MODELS |
List of models + Hugging Face repos | see file |
backend |
Inference backend | litert_lm.Backend.CPU |
host |
Listening address | 0.0.0.0 |
To add a new model to the menu, append it to the AVAILABLE_MODELS dictionary in server.py:
AVAILABLE_MODELS = {
"gemma-4-E2B-it": {
"file": "gemma-4-E2B-it.litertlm",
"repo": "litert-community/gemma-4-E2B-it-litert-lm",
"desc": "Gemma 4 Edge 2B — smaller, faster",
},
"new-model-name": {
"file": "new-model-name.litertlm",
"repo": "org/repo-name",
"desc": "Model description",
},
}
To switch the backend to GPU (if supported by the device):
engine = litert_lm.Engine(str(MODEL_PATH), backend=litert_lm.Backend.GPU)
Run as a systemd Service (Linux)
See detailed instructions in SERVICE_README.md
# Install the service
sudo bash install_service.sh
# Manage the service
sudo systemctl status litert-lm
sudo systemctl restart litert-lm
sudo journalctl -u litert-lm -f
📝 Notes
- Each session stores the entire conversation history in RAM. It is recommended to delete sessions when no longer needed.
- The
mel_filterbankwarning during startup is normal — it is related to the Gemma 4 multimodal audio encoder and does not affect text generation. - Generation speed depends on the hardware. On an Orange Pi 5 using CPU, expect around 5–15 tokens/second.
- Token/s uses
engine.tokenize()if available, otherwise falls back to an estimate oflen(text) // 4. - Markdown is rendered using https://marked.js.org/ directly in the browser, not on the server.
- Only use
-it(instruction-tuned) models for chat — base models are not suitable for conversations.
📜 License
MIT