Files

T

admin 6d5a3bd72c Upload files to "/"

2026-05-10 16:51:48 +07:00

9.7 KiB

Raw Blame History

🤖 LiteRT-LM Web Server

Run Gemma 4 models on embedded devices (Orange Pi 5, Raspberry Pi, etc.) using urlLiteRT-LMhttps://github.com/google-ai-edge/litert-lm with a REST API and Web UI.

📋 Requirements

Python 3.10+
urllitert-lmhttps://github.com/google-ai-edge/litert-lm installed and working
Python libraries:

pip install -r requirements.txt

requirements.txt:

fastapi
uvicorn
pydantic
huggingface_hub

📁 Project Structure

.
├── app.py               # Simple REST API, single-turn
├── server.py            # Full REST API + Web UI, multi-turn sessions
├── templates/
│   └── index.html       # Web UI interface (separated from server.py)
├── models/              # Directory containing .litertlm model files
│   ├── gemma-4-E2B-it.litertlm
│   └── gemma-4-E4B-it.litertlm
├── requirements.txt
└── README.md

🤖 Supported Models

Model	Hugging Face Repo	Description
`gemma-4-E2B-it`	https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm	Edge 2B — faster and lighter
`gemma-4-E4B-it`	https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm	Edge 4B — smarter and heavier

Note: It is recommended to use the -it (instruction-tuned) versions for chat/Q&A. Versions without -it are base models that only predict the next token and are not suitable for conversations.

Download Models

# Gemma 4 E2B (smaller, ~faster)
hf download litert-community/gemma-4-E2B-it-litert-lm \
  --include '*.litertlm' \
  --local-dir models/

# Gemma 4 E4B (larger, ~smarter)
hf download litert-community/gemma-4-E4B-it-litert-lm \
  --include '*.litertlm' \
  --local-dir models/

Or let the server automatically download the model when you select one that is not available locally.

🚀 Usage Guide

Method 1: Run with Default Options

python server.py

The server will display a model selection menu before starting:

====================================================
  LiteRT-LM Server — Select Model
====================================================
  [1] gemma-4-E2B-it
      Gemma 4 Edge 2B — smaller, faster
      ✓ available

  [2] gemma-4-E4B-it
      Gemma 4 Edge 4B — smarter, slower
      ✗ not downloaded

  [3] Use model from another path

Select model (1/2/3):

Automatic model download:

If the selected model is not available, the server will ask: Do you want to download the model now? (y/n)
Select y to automatically download it from Hugging Face
Or select n to download it manually later

Automatic port handling:

If port 8000 is already in use, the server will ask you to choose another port
Or press Enter to automatically find an available port (8001-8999)

Method 2: Run with Command Line Arguments

# Specify port
python server.py --port 8080

# Specify model path
python server.py --model /path/to/model.litertlm

# Combine both
python server.py --port 8080 --model ~/models/gemma-4-E2B-it.litertlm

# Show full help
python server.py --help

Open the Web UI

http://<ip-address>:<port>

The model name and port will be displayed when the server starts:

====================================================
  🚀 Server is starting...
  📍 URL: http://localhost:8000
  📦 Model: gemma-4-E2B-it.litertlm
====================================================

📄 `app.py` — Simple REST API

A basic single-turn API without a model selection menu. Suitable for quick integrations or testing.

Run

python app.py

Endpoint

`POST /generate`

Send a prompt and receive a response. Each request is independent, with no memory between calls.

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Who are you?"}'

Response:

{
  "response": "I am Gemma 4, a Large Language Model...",
  "tokens": 42,
  "elapsed_s": 5.31,
  "tokens_per_sec": 7.91
}

🖥️ `server.py` — Full REST API + Web UI

The full version includes model selection at startup, multi-turn conversations, session management, and a browser-based chat interface.

🌐 Web UI

Open your browser and visit http://<ip-address>:8000

Features:

Model selection at startup via CLI menu — model name displayed directly in the header
User-friendly chat interface with Vietnamese language support
Automatically creates a session when opening the page
Remembers conversation context within the same session
New button to start a new conversation
Clear button to delete history and create a new session
Enter to send, Shift + Enter for a new line
Markdown rendering: responses are displayed with proper formatting (headings, lists, code blocks, tables, bold/italic, etc.)
Performance metrics: ⚡ X tok/s badge below each response, including token count and processing time

🔌 REST API

`GET /info`

Returns information about the currently running model and the number of active sessions.

curl http://localhost:8000/info

Response:

{
  "model": "gemma-4-E2B-it",
  "sessions": 2
}

`POST /generate`

Single-turn request without context memory. Useful for standalone Q&A.

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is the capital of Vietnam?"}'

Response:

{
  "response": "The capital of Vietnam is Hanoi.",
  "tokens": 12,
  "elapsed_s": 1.45,
  "tokens_per_sec": 8.27
}

`POST /chat/new`

Create a new session. Returns a session_id for subsequent requests.

curl -X POST http://localhost:8000/chat/new

Response:

{
  "session_id": "a3f2c1d4-..."
}

`POST /chat/{session_id}`

Send a message within a session. The model remembers the entire conversation history for that session.

curl -X POST http://localhost:8000/chat/a3f2c1d4-... \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Tell me more about that"}'

Response:

{
  "session_id": "a3f2c1d4-...",
  "response": "...",
  "tokens": 58,
  "elapsed_s": 7.12,
  "tokens_per_sec": 8.15
}

`DELETE /chat/{session_id}`

Delete a session and free memory.

curl -X DELETE http://localhost:8000/chat/a3f2c1d4-...

Response:

{
  "status": "cleared",
  "session_id": "a3f2c1d4-..."
}

`GET /chat/sessions/list`

List all active sessions.

curl http://localhost:8000/chat/sessions/list

Response:

{
  "sessions": ["a3f2c1d4-...", "b7e9f2a1-..."],
  "count": 2
}

💡 Example: Multi-turn Conversation via curl

# 1. Create a session
SESSION=$(curl -s -X POST http://localhost:8000/chat/new | python3 -c "import sys,json; print(json.load(sys.stdin)['session_id'])")

# 2. Send the first message
curl -s -X POST http://localhost:8000/chat/$SESSION \
  -H "Content-Type: application/json" \
  -d '{"prompt": "My name is Nam"}' | python3 -m json.tool

# 3. The model remembers context
curl -s -X POST http://localhost:8000/chat/$SESSION \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is my name?"}' | python3 -m json.tool

# 4. Delete the session when done
curl -X DELETE http://localhost:8000/chat/$SESSION

⚙️ Configuration

Command Line Arguments

Argument	Description	Default
`--port`, `-p`	Server port	`8000`
`--model`, `-m`	Full path to the `.litertlm` model file	None (select from menu)
`--help`, `-h`	Show help	-

Configuration in Code

Parameters configured near the top of server.py:

Variable	Description	Default
`MODELS_DIR`	Directory containing models	`./models`
`AVAILABLE_MODELS`	List of models + Hugging Face repos	see file
`backend`	Inference backend	`litert_lm.Backend.CPU`
`host`	Listening address	`0.0.0.0`

To add a new model to the menu, append it to the AVAILABLE_MODELS dictionary in server.py:

AVAILABLE_MODELS = {
    "gemma-4-E2B-it": {
        "file": "gemma-4-E2B-it.litertlm",
        "repo": "litert-community/gemma-4-E2B-it-litert-lm",
        "desc": "Gemma 4 Edge 2B — smaller, faster",
    },
    "new-model-name": {
        "file": "new-model-name.litertlm",
        "repo": "org/repo-name",
        "desc": "Model description",
    },
}

To switch the backend to GPU (if supported by the device):

engine = litert_lm.Engine(str(MODEL_PATH), backend=litert_lm.Backend.GPU)

Run as a systemd Service (Linux)

See detailed instructions in SERVICE_README.md

# Install the service
sudo bash install_service.sh

# Manage the service
sudo systemctl status litert-lm
sudo systemctl restart litert-lm
sudo journalctl -u litert-lm -f

📝 Notes

Each session stores the entire conversation history in RAM. It is recommended to delete sessions when no longer needed.
The mel_filterbank warning during startup is normal — it is related to the Gemma 4 multimodal audio encoder and does not affect text generation.
Generation speed depends on the hardware. On an Orange Pi 5 using CPU, expect around 5–15 tokens/second.
Token/s uses engine.tokenize() if available, otherwise falls back to an estimate of len(text) // 4.
Markdown is rendered using https://marked.js.org/ directly in the browser, not on the server.
Only use -it (instruction-tuned) models for chat — base models are not suitable for conversations.

📜 License

MIT

9.7 KiB Raw Blame History Unescape Escape

🤖 LiteRT-LM Web Server

📋 Requirements

📁 Project Structure

🤖 Supported Models

Download Models

🚀 Usage Guide

Method 1: Run with Default Options

Method 2: Run with Command Line Arguments

Open the Web UI

📄 app.py — Simple REST API

Run

Endpoint

POST /generate

🖥️ server.py — Full REST API + Web UI

🌐 Web UI

🔌 REST API

GET /info

POST /generate

POST /chat/new

POST /chat/{session_id}

DELETE /chat/{session_id}

GET /chat/sessions/list

💡 Example: Multi-turn Conversation via curl

⚙️ Configuration

Command Line Arguments

Configuration in Code

Run as a systemd Service (Linux)

📝 Notes

📜 License

9.7 KiB

Raw Blame History

📄 `app.py` — Simple REST API

`POST /generate`

🖥️ `server.py` — Full REST API + Web UI

`GET /info`

`POST /generate`

`POST /chat/new`

`POST /chat/{session_id}`

`DELETE /chat/{session_id}`

`GET /chat/sessions/list`