Upload files to "/"
This commit is contained in:
+412
@@ -0,0 +1,412 @@
|
|||||||
|
# 🤖 LiteRT-LM Web Server
|
||||||
|
|
||||||
|
Run **Gemma 4** models on embedded devices (Orange Pi 5, Raspberry Pi, etc.) using urlLiteRT-LMhttps://github.com/google-ai-edge/litert-lm with a REST API and Web UI.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📋 Requirements
|
||||||
|
|
||||||
|
- Python 3.10+
|
||||||
|
- urllitert-lmhttps://github.com/google-ai-edge/litert-lm installed and working
|
||||||
|
- Python libraries:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install -r requirements.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
`requirements.txt`:
|
||||||
|
```txt
|
||||||
|
fastapi
|
||||||
|
uvicorn
|
||||||
|
pydantic
|
||||||
|
huggingface_hub
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📁 Project Structure
|
||||||
|
|
||||||
|
```text
|
||||||
|
.
|
||||||
|
├── app.py # Simple REST API, single-turn
|
||||||
|
├── server.py # Full REST API + Web UI, multi-turn sessions
|
||||||
|
├── templates/
|
||||||
|
│ └── index.html # Web UI interface (separated from server.py)
|
||||||
|
├── models/ # Directory containing .litertlm model files
|
||||||
|
│ ├── gemma-4-E2B-it.litertlm
|
||||||
|
│ └── gemma-4-E4B-it.litertlm
|
||||||
|
├── requirements.txt
|
||||||
|
└── README.md
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🤖 Supported Models
|
||||||
|
|
||||||
|
| Model | Hugging Face Repo | Description |
|
||||||
|
|-------|-------------------|-------------|
|
||||||
|
| `gemma-4-E2B-it` | https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm | Edge 2B — faster and lighter |
|
||||||
|
| `gemma-4-E4B-it` | https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm | Edge 4B — smarter and heavier |
|
||||||
|
|
||||||
|
> **Note:** It is recommended to use the `-it` (instruction-tuned) versions for chat/Q&A. Versions without `-it` are base models that only predict the next token and are not suitable for conversations.
|
||||||
|
|
||||||
|
### Download Models
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Gemma 4 E2B (smaller, ~faster)
|
||||||
|
hf download litert-community/gemma-4-E2B-it-litert-lm \
|
||||||
|
--include '*.litertlm' \
|
||||||
|
--local-dir models/
|
||||||
|
|
||||||
|
# Gemma 4 E4B (larger, ~smarter)
|
||||||
|
hf download litert-community/gemma-4-E4B-it-litert-lm \
|
||||||
|
--include '*.litertlm' \
|
||||||
|
--local-dir models/
|
||||||
|
```
|
||||||
|
|
||||||
|
> **Or** let the server automatically download the model when you select one that is not available locally.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🚀 Usage Guide
|
||||||
|
|
||||||
|
### Method 1: Run with Default Options
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python server.py
|
||||||
|
```
|
||||||
|
|
||||||
|
The server will display a **model selection menu** before starting:
|
||||||
|
|
||||||
|
```text
|
||||||
|
====================================================
|
||||||
|
LiteRT-LM Server — Select Model
|
||||||
|
====================================================
|
||||||
|
[1] gemma-4-E2B-it
|
||||||
|
Gemma 4 Edge 2B — smaller, faster
|
||||||
|
✓ available
|
||||||
|
|
||||||
|
[2] gemma-4-E4B-it
|
||||||
|
Gemma 4 Edge 4B — smarter, slower
|
||||||
|
✗ not downloaded
|
||||||
|
|
||||||
|
[3] Use model from another path
|
||||||
|
|
||||||
|
Select model (1/2/3):
|
||||||
|
```
|
||||||
|
|
||||||
|
**Automatic model download:**
|
||||||
|
- If the selected model is not available, the server will ask: `Do you want to download the model now? (y/n)`
|
||||||
|
- Select `y` to automatically download it from Hugging Face
|
||||||
|
- Or select `n` to download it manually later
|
||||||
|
|
||||||
|
**Automatic port handling:**
|
||||||
|
- If port 8000 is already in use, the server will ask you to choose another port
|
||||||
|
- Or press Enter to automatically find an available port (8001-8999)
|
||||||
|
|
||||||
|
### Method 2: Run with Command Line Arguments
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Specify port
|
||||||
|
python server.py --port 8080
|
||||||
|
|
||||||
|
# Specify model path
|
||||||
|
python server.py --model /path/to/model.litertlm
|
||||||
|
|
||||||
|
# Combine both
|
||||||
|
python server.py --port 8080 --model ~/models/gemma-4-E2B-it.litertlm
|
||||||
|
|
||||||
|
# Show full help
|
||||||
|
python server.py --help
|
||||||
|
```
|
||||||
|
|
||||||
|
### Open the Web UI
|
||||||
|
|
||||||
|
```text
|
||||||
|
http://<ip-address>:<port>
|
||||||
|
```
|
||||||
|
|
||||||
|
The model name and port will be displayed when the server starts:
|
||||||
|
|
||||||
|
```text
|
||||||
|
====================================================
|
||||||
|
🚀 Server is starting...
|
||||||
|
📍 URL: http://localhost:8000
|
||||||
|
📦 Model: gemma-4-E2B-it.litertlm
|
||||||
|
====================================================
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📄 `app.py` — Simple REST API
|
||||||
|
|
||||||
|
A basic single-turn API without a model selection menu. Suitable for quick integrations or testing.
|
||||||
|
|
||||||
|
### Run
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python app.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Endpoint
|
||||||
|
|
||||||
|
#### `POST /generate`
|
||||||
|
|
||||||
|
Send a prompt and receive a response. Each request is independent, with **no memory** between calls.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -X POST http://localhost:8000/generate \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"prompt": "Who are you?"}'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Response:**
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"response": "I am Gemma 4, a Large Language Model...",
|
||||||
|
"tokens": 42,
|
||||||
|
"elapsed_s": 5.31,
|
||||||
|
"tokens_per_sec": 7.91
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 🖥️ `server.py` — Full REST API + Web UI
|
||||||
|
|
||||||
|
The full version includes model selection at startup, multi-turn conversations, session management, and a browser-based chat interface.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 🌐 Web UI
|
||||||
|
|
||||||
|
Open your browser and visit `http://<ip-address>:8000`
|
||||||
|
|
||||||
|
Features:
|
||||||
|
- **Model selection at startup** via CLI menu — model name displayed directly in the header
|
||||||
|
- User-friendly chat interface with Vietnamese language support
|
||||||
|
- Automatically creates a session when opening the page
|
||||||
|
- Remembers conversation context within the same session
|
||||||
|
- **New** button to start a new conversation
|
||||||
|
- **Clear** button to delete history and create a new session
|
||||||
|
- `Enter` to send, `Shift + Enter` for a new line
|
||||||
|
- **Markdown rendering**: responses are displayed with proper formatting (headings, lists, code blocks, tables, bold/italic, etc.)
|
||||||
|
- **Performance metrics**: `⚡ X tok/s` badge below each response, including token count and processing time
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 🔌 REST API
|
||||||
|
|
||||||
|
#### `GET /info`
|
||||||
|
Returns information about the currently running model and the number of active sessions.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl http://localhost:8000/info
|
||||||
|
```
|
||||||
|
|
||||||
|
**Response:**
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"model": "gemma-4-E2B-it",
|
||||||
|
"sessions": 2
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### `POST /generate`
|
||||||
|
Single-turn request without context memory. Useful for standalone Q&A.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -X POST http://localhost:8000/generate \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"prompt": "What is the capital of Vietnam?"}'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Response:**
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"response": "The capital of Vietnam is Hanoi.",
|
||||||
|
"tokens": 12,
|
||||||
|
"elapsed_s": 1.45,
|
||||||
|
"tokens_per_sec": 8.27
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### `POST /chat/new`
|
||||||
|
Create a new session. Returns a `session_id` for subsequent requests.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -X POST http://localhost:8000/chat/new
|
||||||
|
```
|
||||||
|
|
||||||
|
**Response:**
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"session_id": "a3f2c1d4-..."
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### `POST /chat/{session_id}`
|
||||||
|
Send a message within a session. The model **remembers the entire conversation history** for that session.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -X POST http://localhost:8000/chat/a3f2c1d4-... \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"prompt": "Tell me more about that"}'
|
||||||
|
```
|
||||||
|
|
||||||
|
**Response:**
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"session_id": "a3f2c1d4-...",
|
||||||
|
"response": "...",
|
||||||
|
"tokens": 58,
|
||||||
|
"elapsed_s": 7.12,
|
||||||
|
"tokens_per_sec": 8.15
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### `DELETE /chat/{session_id}`
|
||||||
|
Delete a session and free memory.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -X DELETE http://localhost:8000/chat/a3f2c1d4-...
|
||||||
|
```
|
||||||
|
|
||||||
|
**Response:**
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"status": "cleared",
|
||||||
|
"session_id": "a3f2c1d4-..."
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
#### `GET /chat/sessions/list`
|
||||||
|
List all active sessions.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl http://localhost:8000/chat/sessions/list
|
||||||
|
```
|
||||||
|
|
||||||
|
**Response:**
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"sessions": ["a3f2c1d4-...", "b7e9f2a1-..."],
|
||||||
|
"count": 2
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 💡 Example: Multi-turn Conversation via curl
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# 1. Create a session
|
||||||
|
SESSION=$(curl -s -X POST http://localhost:8000/chat/new | python3 -c "import sys,json; print(json.load(sys.stdin)['session_id'])")
|
||||||
|
|
||||||
|
# 2. Send the first message
|
||||||
|
curl -s -X POST http://localhost:8000/chat/$SESSION \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"prompt": "My name is Nam"}' | python3 -m json.tool
|
||||||
|
|
||||||
|
# 3. The model remembers context
|
||||||
|
curl -s -X POST http://localhost:8000/chat/$SESSION \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{"prompt": "What is my name?"}' | python3 -m json.tool
|
||||||
|
|
||||||
|
# 4. Delete the session when done
|
||||||
|
curl -X DELETE http://localhost:8000/chat/$SESSION
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## ⚙️ Configuration
|
||||||
|
|
||||||
|
### Command Line Arguments
|
||||||
|
|
||||||
|
| Argument | Description | Default |
|
||||||
|
|-----------|-------------|----------|
|
||||||
|
| `--port`, `-p` | Server port | `8000` |
|
||||||
|
| `--model`, `-m` | Full path to the `.litertlm` model file | None (select from menu) |
|
||||||
|
| `--help`, `-h` | Show help | - |
|
||||||
|
|
||||||
|
### Configuration in Code
|
||||||
|
|
||||||
|
Parameters configured near the top of `server.py`:
|
||||||
|
|
||||||
|
| Variable | Description | Default |
|
||||||
|
|-----------|-------------|----------|
|
||||||
|
| `MODELS_DIR` | Directory containing models | `./models` |
|
||||||
|
| `AVAILABLE_MODELS` | List of models + Hugging Face repos | see file |
|
||||||
|
| `backend` | Inference backend | `litert_lm.Backend.CPU` |
|
||||||
|
| `host` | Listening address | `0.0.0.0` |
|
||||||
|
|
||||||
|
To add a new model to the menu, append it to the `AVAILABLE_MODELS` dictionary in `server.py`:
|
||||||
|
|
||||||
|
```python
|
||||||
|
AVAILABLE_MODELS = {
|
||||||
|
"gemma-4-E2B-it": {
|
||||||
|
"file": "gemma-4-E2B-it.litertlm",
|
||||||
|
"repo": "litert-community/gemma-4-E2B-it-litert-lm",
|
||||||
|
"desc": "Gemma 4 Edge 2B — smaller, faster",
|
||||||
|
},
|
||||||
|
"new-model-name": {
|
||||||
|
"file": "new-model-name.litertlm",
|
||||||
|
"repo": "org/repo-name",
|
||||||
|
"desc": "Model description",
|
||||||
|
},
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
To switch the backend to GPU (if supported by the device):
|
||||||
|
|
||||||
|
```python
|
||||||
|
engine = litert_lm.Engine(str(MODEL_PATH), backend=litert_lm.Backend.GPU)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run as a systemd Service (Linux)
|
||||||
|
|
||||||
|
See detailed instructions in `SERVICE_README.md`
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Install the service
|
||||||
|
sudo bash install_service.sh
|
||||||
|
|
||||||
|
# Manage the service
|
||||||
|
sudo systemctl status litert-lm
|
||||||
|
sudo systemctl restart litert-lm
|
||||||
|
sudo journalctl -u litert-lm -f
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📝 Notes
|
||||||
|
|
||||||
|
- Each session stores the entire conversation history in RAM. It is recommended to delete sessions when no longer needed.
|
||||||
|
- The `mel_filterbank` warning during startup is normal — it is related to the Gemma 4 multimodal audio encoder and does not affect text generation.
|
||||||
|
- Generation speed depends on the hardware. On an Orange Pi 5 using CPU, expect around 5–15 tokens/second.
|
||||||
|
- Token/s uses `engine.tokenize()` if available, otherwise falls back to an estimate of `len(text) // 4`.
|
||||||
|
- Markdown is rendered using https://marked.js.org/ directly in the browser, not on the server.
|
||||||
|
- Only use `-it` (instruction-tuned) models for chat — base models are not suitable for conversations.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 📜 License
|
||||||
|
|
||||||
|
MIT
|
||||||
Reference in New Issue
Block a user