Files
litert-lm-orangepi/README-en.md
T
2026-05-21 13:58:25 +07:00

10 KiB
Raw Blame History

🤖 LiteRT-LM Web Server

Run Gemma 4 models on embedded devices (Orange Pi 5, Raspberry Pi, etc.) using LiteRT-LM with a REST API and Web UI.


📋 Requirements

  • Python 3.10+
  • LiteRT-LM installed and working
  • Python libraries:
pip install -r requirements.txt

requirements.txt:

fastapi
uvicorn
pydantic
huggingface_hub

📁 Project Structure

.
├── app.py               # Simple REST API, single-turn
├── server.py            # Full REST API + Web UI, multi-turn sessions
├── templates/
│   └── index.html       # Web UI interface (separated from server.py)
├── models/              # Directory containing .litertlm model files
│   ├── gemma-4-E2B-it.litertlm
│   └── gemma-4-E4B-it.litertlm
├── requirements.txt
└── README.md

🤖 Supported Models

Model Hugging Face Repo Description
gemma-4-E2B-it https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm Edge 2B — faster and lighter
gemma-4-E4B-it https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm Edge 4B — smarter and heavier

Note: It is recommended to use the -it (instruction-tuned) versions for chat/Q&A. Versions without -it are base models that only predict the next token and are not suitable for conversations.

Download Models

# Gemma 4 E2B (smaller, ~faster)
hf download litert-community/gemma-4-E2B-it-litert-lm \
  --include '*.litertlm' \
  --local-dir models/

# Gemma 4 E4B (larger, ~smarter)
hf download litert-community/gemma-4-E4B-it-litert-lm \
  --include '*.litertlm' \
  --local-dir models/

Or let the server automatically download the model when you select one that is not available locally.


🚀 Usage Guide

Method 1: Run with Default Options

python server.py

The server will display a model selection menu before starting:

====================================================
  LiteRT-LM Server — Select Model
====================================================
  [1] gemma-4-E2B-it
      Gemma 4 Edge 2B — smaller, faster
      ✓ available

  [2] gemma-4-E4B-it
      Gemma 4 Edge 4B — smarter, slower
      ✗ not downloaded

  [3] Use model from another path

Select model (1/2/3):

Automatic model download:

  • If the selected model is not available, the server will ask: Do you want to download the model now? (y/n)
  • Select y to automatically download it from Hugging Face
  • Or select n to download it manually later

Automatic port handling:

  • If port 8000 is already in use, the server will ask you to choose another port
  • Or press Enter to automatically find an available port (8001-8999)

Method 2: Run with Command Line Arguments

# Specify port
python server.py --port 8080

# Specify model path
python server.py --model /path/to/model.litertlm

# Combine both
python server.py --port 8080 --model ~/models/gemma-4-E2B-it.litertlm

# Show full help
python server.py --help

Open the Web UI

http://<ip-address>:<port>

The model name and port will be displayed when the server starts:

====================================================
  🚀 Server is starting...
  📍 URL: http://localhost:8000
  📦 Model: gemma-4-E2B-it.litertlm
====================================================

📄 app.py — Simple REST API

A basic single-turn API without a model selection menu. Suitable for quick integrations or testing.

Run

python app.py

Endpoint

POST /generate

Send a prompt and receive a response. Each request is independent, with no memory between calls.

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Who are you?"}'

Response:

{
  "response": "I am Gemma 4, a Large Language Model...",
  "tokens": 42,
  "elapsed_s": 5.31,
  "tokens_per_sec": 7.91
}

🖥️ server.py — Full REST API + Web UI

The full version includes model selection at startup, multi-turn conversations, session management, and a browser-based chat interface.


🌐 Web UI

Open your browser and visit http://<ip-address>:8000

Features:

  • Model selection at startup via CLI menu — model name displayed directly in the header
  • User-friendly chat interface with Vietnamese language support
  • Automatically creates a session when opening the page
  • Remembers conversation context within the same session
  • New button to start a new conversation
  • Clear button to delete history and create a new session
  • Enter to send, Shift + Enter for a new line
  • Markdown rendering: responses are displayed with proper formatting (headings, lists, code blocks, tables, bold/italic, etc.)
  • Performance metrics: ⚡ X tok/s badge below each response, including token count and processing time

🔌 REST API

GET /info

Returns information about the currently running model and the number of active sessions.

curl http://localhost:8000/info

Response:

{
  "model": "gemma-4-E2B-it",
  "sessions": 2
}

POST /generate

Single-turn request without context memory. Useful for standalone Q&A.

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is the capital of Vietnam?"}'

Response:

{
  "response": "The capital of Vietnam is Hanoi.",
  "tokens": 12,
  "elapsed_s": 1.45,
  "tokens_per_sec": 8.27
}

POST /chat/new

Create a new session. Returns a session_id for subsequent requests.

curl -X POST http://localhost:8000/chat/new

Response:

{
  "session_id": "a3f2c1d4-..."
}

POST /chat/{session_id}

Send a message within a session. The model remembers the entire conversation history for that session.

curl -X POST http://localhost:8000/chat/a3f2c1d4-... \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Tell me more about that"}'

Response:

{
  "session_id": "a3f2c1d4-...",
  "response": "...",
  "tokens": 58,
  "elapsed_s": 7.12,
  "tokens_per_sec": 8.15
}

DELETE /chat/{session_id}

Delete a session and free memory.

curl -X DELETE http://localhost:8000/chat/a3f2c1d4-...

Response:

{
  "status": "cleared",
  "session_id": "a3f2c1d4-..."
}

GET /chat/sessions/list

List all active sessions.

curl http://localhost:8000/chat/sessions/list

Response:

{
  "sessions": ["a3f2c1d4-...", "b7e9f2a1-..."],
  "count": 2
}

💡 Example: Multi-turn Conversation via curl

# 1. Create a session
SESSION=$(curl -s -X POST http://localhost:8000/chat/new | python3 -c "import sys,json; print(json.load(sys.stdin)['session_id'])")

# 2. Send the first message
curl -s -X POST http://localhost:8000/chat/$SESSION \
  -H "Content-Type: application/json" \
  -d '{"prompt": "My name is Nam"}' | python3 -m json.tool

# 3. The model remembers context
curl -s -X POST http://localhost:8000/chat/$SESSION \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is my name?"}' | python3 -m json.tool

# 4. Delete the session when done
curl -X DELETE http://localhost:8000/chat/$SESSION

⚙️ Configuration

Command Line Arguments

Argument Description Default
--port, -p Server port 8000
--model, -m Full path to the .litertlm model file None (select from menu)
--help, -h Show help -

Configuration in Code

Parameters configured near the top of server.py:

Variable Description Default
MODELS_DIR Directory containing models ./models
AVAILABLE_MODELS List of models + Hugging Face repos see file
backend Inference backend litert_lm.Backend.CPU
host Listening address 0.0.0.0

To add a new model to the menu, append it to the AVAILABLE_MODELS dictionary in server.py:

AVAILABLE_MODELS = {
    "gemma-4-E2B-it": {
        "file": "gemma-4-E2B-it.litertlm",
        "repo": "litert-community/gemma-4-E2B-it-litert-lm",
        "desc": "Gemma 4 Edge 2B — smaller, faster",
    },
    "new-model-name": {
        "file": "new-model-name.litertlm",
        "repo": "org/repo-name",
        "desc": "Model description",
    },
}

To switch the backend to GPU (if supported by the device):

engine = litert_lm.Engine(str(MODEL_PATH), backend=litert_lm.Backend.GPU)

Run as a systemd Service (Linux)

See detailed instructions in SERVICE_README.md

# Install the service
sudo bash install_service.sh

# Manage the service
sudo systemctl status litert-lm
sudo systemctl restart litert-lm
sudo journalctl -u litert-lm -f

📝 Notes

  • Each session stores the entire conversation history in RAM. It is recommended to delete sessions when no longer needed.
  • The mel_filterbank warning during startup is normal — it is related to the Gemma 4 multimodal audio encoder and does not affect text generation.
  • Generation speed depends on the hardware. On an Orange Pi 5 using CPU, expect around 515 tokens/second.
  • Token/s uses engine.tokenize() if available, otherwise falls back to an estimate of len(text) // 4.
  • Markdown is rendered using https://marked.js.org/ directly in the browser, not on the server.
  • Only use -it (instruction-tuned) models for chat — base models are not suitable for conversations.

📜 License

Copyright (c) 2026

[Tran Thanh Tan / TTAI Solutions Software]

All rights reserved.

No part of this software or its source code may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the copyright holder.