Files

T

admin 1a0477966b Update README-en.md

2026-05-21 13:58:25 +07:00

10 KiB

Raw Blame History

🤖 LiteRT-LM Web Server

Run Gemma 4 models on embedded devices (Orange Pi 5, Raspberry Pi, etc.) using LiteRT-LM with a REST API and Web UI.

📋 Requirements

Python 3.10+
LiteRT-LM installed and working
Python libraries:

pip install -r requirements.txt

requirements.txt:

fastapi
uvicorn
pydantic
huggingface_hub

📁 Project Structure

.
├── app.py               # Simple REST API, single-turn
├── server.py            # Full REST API + Web UI, multi-turn sessions
├── templates/
│   └── index.html       # Web UI interface (separated from server.py)
├── models/              # Directory containing .litertlm model files
│   ├── gemma-4-E2B-it.litertlm
│   └── gemma-4-E4B-it.litertlm
├── requirements.txt
└── README.md

🤖 Supported Models

Model	Hugging Face Repo	Description
`gemma-4-E2B-it`	https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm	Edge 2B — faster and lighter
`gemma-4-E4B-it`	https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm	Edge 4B — smarter and heavier

Note: It is recommended to use the -it (instruction-tuned) versions for chat/Q&A. Versions without -it are base models that only predict the next token and are not suitable for conversations.

Download Models

# Gemma 4 E2B (smaller, ~faster)
hf download litert-community/gemma-4-E2B-it-litert-lm \
  --include '*.litertlm' \
  --local-dir models/

# Gemma 4 E4B (larger, ~smarter)
hf download litert-community/gemma-4-E4B-it-litert-lm \
  --include '*.litertlm' \
  --local-dir models/

Or let the server automatically download the model when you select one that is not available locally.

🚀 Usage Guide

Method 1: Run with Default Options

python server.py

The server will display a model selection menu before starting:

====================================================
  LiteRT-LM Server — Select Model
====================================================
  [1] gemma-4-E2B-it
      Gemma 4 Edge 2B — smaller, faster
      ✓ available

  [2] gemma-4-E4B-it
      Gemma 4 Edge 4B — smarter, slower
      ✗ not downloaded

  [3] Use model from another path

Select model (1/2/3):

Automatic model download:

If the selected model is not available, the server will ask: Do you want to download the model now? (y/n)
Select y to automatically download it from Hugging Face
Or select n to download it manually later

Automatic port handling:

If port 8000 is already in use, the server will ask you to choose another port
Or press Enter to automatically find an available port (8001-8999)

Method 2: Run with Command Line Arguments

# Specify port
python server.py --port 8080

# Specify model path
python server.py --model /path/to/model.litertlm

# Combine both
python server.py --port 8080 --model ~/models/gemma-4-E2B-it.litertlm

# Show full help
python server.py --help

Open the Web UI

http://<ip-address>:<port>

The model name and port will be displayed when the server starts:

====================================================
  🚀 Server is starting...
  📍 URL: http://localhost:8000
  📦 Model: gemma-4-E2B-it.litertlm
====================================================

📄 `app.py` — Simple REST API

A basic single-turn API without a model selection menu. Suitable for quick integrations or testing.

Run

python app.py

Endpoint

`POST /generate`

Send a prompt and receive a response. Each request is independent, with no memory between calls.

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Who are you?"}'

Response:

{
  "response": "I am Gemma 4, a Large Language Model...",
  "tokens": 42,
  "elapsed_s": 5.31,
  "tokens_per_sec": 7.91
}

🖥️ `server.py` — Full REST API + Web UI

The full version includes model selection at startup, multi-turn conversations, session management, and a browser-based chat interface.

🌐 Web UI

Open your browser and visit http://<ip-address>:8000

Features:

Model selection at startup via CLI menu — model name displayed directly in the header
User-friendly chat interface with Vietnamese language support
Automatically creates a session when opening the page
Remembers conversation context within the same session
New button to start a new conversation
Clear button to delete history and create a new session
Enter to send, Shift + Enter for a new line
Markdown rendering: responses are displayed with proper formatting (headings, lists, code blocks, tables, bold/italic, etc.)
Performance metrics: ⚡ X tok/s badge below each response, including token count and processing time

🔌 REST API

`GET /info`

Returns information about the currently running model and the number of active sessions.

curl http://localhost:8000/info

Response:

{
  "model": "gemma-4-E2B-it",
  "sessions": 2
}

`POST /generate`

Single-turn request without context memory. Useful for standalone Q&A.

curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is the capital of Vietnam?"}'

Response:

{
  "response": "The capital of Vietnam is Hanoi.",
  "tokens": 12,
  "elapsed_s": 1.45,
  "tokens_per_sec": 8.27
}

`POST /chat/new`

Create a new session. Returns a session_id for subsequent requests.

curl -X POST http://localhost:8000/chat/new

Response:

{
  "session_id": "a3f2c1d4-..."
}

`POST /chat/{session_id}`

Send a message within a session. The model remembers the entire conversation history for that session.

curl -X POST http://localhost:8000/chat/a3f2c1d4-... \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Tell me more about that"}'

Response:

{
  "session_id": "a3f2c1d4-...",
  "response": "...",
  "tokens": 58,
  "elapsed_s": 7.12,
  "tokens_per_sec": 8.15
}

`DELETE /chat/{session_id}`

Delete a session and free memory.

curl -X DELETE http://localhost:8000/chat/a3f2c1d4-...

Response:

{
  "status": "cleared",
  "session_id": "a3f2c1d4-..."
}

`GET /chat/sessions/list`

List all active sessions.

curl http://localhost:8000/chat/sessions/list

Response:

{
  "sessions": ["a3f2c1d4-...", "b7e9f2a1-..."],
  "count": 2
}

💡 Example: Multi-turn Conversation via curl

# 1. Create a session
SESSION=$(curl -s -X POST http://localhost:8000/chat/new | python3 -c "import sys,json; print(json.load(sys.stdin)['session_id'])")

# 2. Send the first message
curl -s -X POST http://localhost:8000/chat/$SESSION \
  -H "Content-Type: application/json" \
  -d '{"prompt": "My name is Nam"}' | python3 -m json.tool

# 3. The model remembers context
curl -s -X POST http://localhost:8000/chat/$SESSION \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is my name?"}' | python3 -m json.tool

# 4. Delete the session when done
curl -X DELETE http://localhost:8000/chat/$SESSION

⚙️ Configuration

Command Line Arguments

Argument	Description	Default
`--port`, `-p`	Server port	`8000`
`--model`, `-m`	Full path to the `.litertlm` model file	None (select from menu)
`--help`, `-h`	Show help	-

Configuration in Code

Parameters configured near the top of server.py:

Variable	Description	Default
`MODELS_DIR`	Directory containing models	`./models`
`AVAILABLE_MODELS`	List of models + Hugging Face repos	see file
`backend`	Inference backend	`litert_lm.Backend.CPU`
`host`	Listening address	`0.0.0.0`

To add a new model to the menu, append it to the AVAILABLE_MODELS dictionary in server.py:

AVAILABLE_MODELS = {
    "gemma-4-E2B-it": {
        "file": "gemma-4-E2B-it.litertlm",
        "repo": "litert-community/gemma-4-E2B-it-litert-lm",
        "desc": "Gemma 4 Edge 2B — smaller, faster",
    },
    "new-model-name": {
        "file": "new-model-name.litertlm",
        "repo": "org/repo-name",
        "desc": "Model description",
    },
}

To switch the backend to GPU (if supported by the device):

engine = litert_lm.Engine(str(MODEL_PATH), backend=litert_lm.Backend.GPU)

Run as a systemd Service (Linux)

See detailed instructions in SERVICE_README.md

# Install the service
sudo bash install_service.sh

# Manage the service
sudo systemctl status litert-lm
sudo systemctl restart litert-lm
sudo journalctl -u litert-lm -f

📝 Notes

Each session stores the entire conversation history in RAM. It is recommended to delete sessions when no longer needed.
The mel_filterbank warning during startup is normal — it is related to the Gemma 4 multimodal audio encoder and does not affect text generation.
Generation speed depends on the hardware. On an Orange Pi 5 using CPU, expect around 5–15 tokens/second.
Token/s uses engine.tokenize() if available, otherwise falls back to an estimate of len(text) // 4.
Markdown is rendered using https://marked.js.org/ directly in the browser, not on the server.
Only use -it (instruction-tuned) models for chat — base models are not suitable for conversations.

📜 License

[Tran Thanh Tan / TTAI Solutions Software]

No part of this software or its source code may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the copyright holder.

10 KiB Raw Blame History Unescape Escape

🤖 LiteRT-LM Web Server

📋 Requirements

📁 Project Structure

🤖 Supported Models

Download Models

🚀 Usage Guide

Method 1: Run with Default Options

Method 2: Run with Command Line Arguments

Open the Web UI

📄 app.py — Simple REST API

Run

Endpoint

POST /generate

🖥️ server.py — Full REST API + Web UI

🌐 Web UI

🔌 REST API

GET /info

POST /generate

POST /chat/new

POST /chat/{session_id}

DELETE /chat/{session_id}

GET /chat/sessions/list

💡 Example: Multi-turn Conversation via curl

⚙️ Configuration

Command Line Arguments

Configuration in Code

Run as a systemd Service (Linux)

📝 Notes

📜 License

10 KiB

Raw Blame History

📄 `app.py` — Simple REST API

`POST /generate`

🖥️ `server.py` — Full REST API + Web UI

`GET /info`

`POST /generate`

`POST /chat/new`

`POST /chat/{session_id}`

`DELETE /chat/{session_id}`

`GET /chat/sessions/list`