Upload files to "/"

2026-05-10 16:51:48 +07:00
parent 89fbac3177
commit 6d5a3bd72c
1 changed files with 412 additions and 0 deletions
@@ -0,0 +1,412 @@
 # 🤖 LiteRT-LM Web Server
 Run **Gemma 4** models on embedded devices (Orange Pi 5, Raspberry Pi, etc.) using urlLiteRT-LMhttps://github.com/google-ai-edge/litert-lm with a REST API and Web UI.
 ---
 ## 📋 Requirements
 - Python 3.10+
 - urllitert-lmhttps://github.com/google-ai-edge/litert-lm installed and working
 - Python libraries:
 ```bash
 pip install -r requirements.txt
 ```
 `requirements.txt`:
 ```txt
 fastapi
 uvicorn
 pydantic
 huggingface_hub
 ```
 ---
 ## 📁 Project Structure
 ```text
 .
 ├── app.py               # Simple REST API, single-turn
 ├── server.py            # Full REST API + Web UI, multi-turn sessions
 ├── templates/
 │   └── index.html       # Web UI interface (separated from server.py)
 ├── models/              # Directory containing .litertlm model files
 │   ├── gemma-4-E2B-it.litertlm
 │   └── gemma-4-E4B-it.litertlm
 ├── requirements.txt
 └── README.md
 ```
 ---
 ## 🤖 Supported Models
 | Model | Hugging Face Repo | Description |
 |-------|-------------------|-------------|
 | `gemma-4-E2B-it` | https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm | Edge 2B — faster and lighter |
 | `gemma-4-E4B-it` | https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm | Edge 4B — smarter and heavier |
 > **Note:** It is recommended to use the `-it` (instruction-tuned) versions for chat/Q&A. Versions without `-it` are base models that only predict the next token and are not suitable for conversations.
 ### Download Models
 ```bash
 # Gemma 4 E2B (smaller, ~faster)
 hf download litert-community/gemma-4-E2B-it-litert-lm \
  --include '*.litertlm' \
  --local-dir models/
 # Gemma 4 E4B (larger, ~smarter)
 hf download litert-community/gemma-4-E4B-it-litert-lm \
  --include '*.litertlm' \
  --local-dir models/
 ```
 > **Or** let the server automatically download the model when you select one that is not available locally.
 ---
 ## 🚀 Usage Guide
 ### Method 1: Run with Default Options
 ```bash
 python server.py
 ```
 The server will display a **model selection menu** before starting:
 ```text
 ====================================================
  LiteRT-LM Server — Select Model
 ====================================================
  [1] gemma-4-E2B-it
      Gemma 4 Edge 2B — smaller, faster
      ✓ available
  [2] gemma-4-E4B-it
      Gemma 4 Edge 4B — smarter, slower
      ✗ not downloaded
  [3] Use model from another path
 Select model (1/2/3):
 ```
 **Automatic model download:**
 - If the selected model is not available, the server will ask: `Do you want to download the model now? (y/n)`
 - Select `y` to automatically download it from Hugging Face
 - Or select `n` to download it manually later
 **Automatic port handling:**
 - If port 8000 is already in use, the server will ask you to choose another port
 - Or press Enter to automatically find an available port (8001-8999)
 ### Method 2: Run with Command Line Arguments
 ```bash
 # Specify port
 python server.py --port 8080
 # Specify model path
 python server.py --model /path/to/model.litertlm
 # Combine both
 python server.py --port 8080 --model ~/models/gemma-4-E2B-it.litertlm
 # Show full help
 python server.py --help
 ```
 ### Open the Web UI
 ```text
 http://<ip-address>:<port>
 ```
 The model name and port will be displayed when the server starts:
 ```text
 ====================================================
  🚀 Server is starting...
  📍 URL: http://localhost:8000
  📦 Model: gemma-4-E2B-it.litertlm
 ====================================================
 ```
 ---
 ## 📄 `app.py` — Simple REST API
 A basic single-turn API without a model selection menu. Suitable for quick integrations or testing.
 ### Run
 ```bash
 python app.py
 ```
 ### Endpoint
 #### `POST /generate`
 Send a prompt and receive a response. Each request is independent, with **no memory** between calls.
 ```bash
 curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Who are you?"}'
 ```
 **Response:**
 ```json
 {
  "response": "I am Gemma 4, a Large Language Model...",
  "tokens": 42,
  "elapsed_s": 5.31,
  "tokens_per_sec": 7.91
 }
 ```
 ---
 ## 🖥️ `server.py` — Full REST API + Web UI
 The full version includes model selection at startup, multi-turn conversations, session management, and a browser-based chat interface.
 ---
 ### 🌐 Web UI
 Open your browser and visit `http://<ip-address>:8000`
 Features:
 - **Model selection at startup** via CLI menu — model name displayed directly in the header
 - User-friendly chat interface with Vietnamese language support
 - Automatically creates a session when opening the page
 - Remembers conversation context within the same session
 - **New** button to start a new conversation
 - **Clear** button to delete history and create a new session
 - `Enter` to send, `Shift + Enter` for a new line
 - **Markdown rendering**: responses are displayed with proper formatting (headings, lists, code blocks, tables, bold/italic, etc.)
 - **Performance metrics**: `⚡ X tok/s` badge below each response, including token count and processing time
 ---
 ### 🔌 REST API
 #### `GET /info`
 Returns information about the currently running model and the number of active sessions.
 ```bash
 curl http://localhost:8000/info
 ```
 **Response:**
 ```json
 {
  "model": "gemma-4-E2B-it",
  "sessions": 2
 }
 ```
 ---
 #### `POST /generate`
 Single-turn request without context memory. Useful for standalone Q&A.
 ```bash
 curl -X POST http://localhost:8000/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is the capital of Vietnam?"}'
 ```
 **Response:**
 ```json
 {
  "response": "The capital of Vietnam is Hanoi.",
  "tokens": 12,
  "elapsed_s": 1.45,
  "tokens_per_sec": 8.27
 }
 ```
 ---
 #### `POST /chat/new`
 Create a new session. Returns a `session_id` for subsequent requests.
 ```bash
 curl -X POST http://localhost:8000/chat/new
 ```
 **Response:**
 ```json
 {
  "session_id": "a3f2c1d4-..."
 }
 ```
 ---
 #### `POST /chat/{session_id}`
 Send a message within a session. The model **remembers the entire conversation history** for that session.
 ```bash
 curl -X POST http://localhost:8000/chat/a3f2c1d4-... \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Tell me more about that"}'
 ```
 **Response:**
 ```json
 {
  "session_id": "a3f2c1d4-...",
  "response": "...",
  "tokens": 58,
  "elapsed_s": 7.12,
  "tokens_per_sec": 8.15
 }
 ```
 ---
 #### `DELETE /chat/{session_id}`
 Delete a session and free memory.
 ```bash
 curl -X DELETE http://localhost:8000/chat/a3f2c1d4-...
 ```
 **Response:**
 ```json
 {
  "status": "cleared",
  "session_id": "a3f2c1d4-..."
 }
 ```
 ---
 #### `GET /chat/sessions/list`
 List all active sessions.
 ```bash
 curl http://localhost:8000/chat/sessions/list
 ```
 **Response:**
 ```json
 {
  "sessions": ["a3f2c1d4-...", "b7e9f2a1-..."],
  "count": 2
 }
 ```
 ---
 ## 💡 Example: Multi-turn Conversation via curl
 ```bash
 # 1. Create a session
 SESSION=$(curl -s -X POST http://localhost:8000/chat/new | python3 -c "import sys,json; print(json.load(sys.stdin)['session_id'])")
 # 2. Send the first message
 curl -s -X POST http://localhost:8000/chat/$SESSION \
  -H "Content-Type: application/json" \
  -d '{"prompt": "My name is Nam"}' | python3 -m json.tool
 # 3. The model remembers context
 curl -s -X POST http://localhost:8000/chat/$SESSION \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is my name?"}' | python3 -m json.tool
 # 4. Delete the session when done
 curl -X DELETE http://localhost:8000/chat/$SESSION
 ```
 ---
 ## ⚙️ Configuration
 ### Command Line Arguments
 | Argument | Description | Default |
 |-----------|-------------|----------|
 | `--port`, `-p` | Server port | `8000` |
 | `--model`, `-m` | Full path to the `.litertlm` model file | None (select from menu) |
 | `--help`, `-h` | Show help | - |
 ### Configuration in Code
 Parameters configured near the top of `server.py`:
 | Variable | Description | Default |
 |-----------|-------------|----------|
 | `MODELS_DIR` | Directory containing models | `./models` |
 | `AVAILABLE_MODELS` | List of models + Hugging Face repos | see file |
 | `backend` | Inference backend | `litert_lm.Backend.CPU` |
 | `host` | Listening address | `0.0.0.0` |
 To add a new model to the menu, append it to the `AVAILABLE_MODELS` dictionary in `server.py`:
 ```python
 AVAILABLE_MODELS = {
    "gemma-4-E2B-it": {
        "file": "gemma-4-E2B-it.litertlm",
        "repo": "litert-community/gemma-4-E2B-it-litert-lm",
        "desc": "Gemma 4 Edge 2B — smaller, faster",
    },
    "new-model-name": {
        "file": "new-model-name.litertlm",
        "repo": "org/repo-name",
        "desc": "Model description",
    },
 }
 ```
 To switch the backend to GPU (if supported by the device):
 ```python
 engine = litert_lm.Engine(str(MODEL_PATH), backend=litert_lm.Backend.GPU)
 ```
 ### Run as a systemd Service (Linux)
 See detailed instructions in `SERVICE_README.md`
 ```bash
 # Install the service
 sudo bash install_service.sh
 # Manage the service
 sudo systemctl status litert-lm
 sudo systemctl restart litert-lm
 sudo journalctl -u litert-lm -f
 ```
 ---
 ## 📝 Notes
 - Each session stores the entire conversation history in RAM. It is recommended to delete sessions when no longer needed.
 - The `mel_filterbank` warning during startup is normal — it is related to the Gemma 4 multimodal audio encoder and does not affect text generation.
 - Generation speed depends on the hardware. On an Orange Pi 5 using CPU, expect around 5–15 tokens/second.
 - Token/s uses `engine.tokenize()` if available, otherwise falls back to an estimate of `len(text) // 4`.
 - Markdown is rendered using https://marked.js.org/ directly in the browser, not on the server.
 - Only use `-it` (instruction-tuned) models for chat — base models are not suitable for conversations.
 ---
 ## 📜 License
 MIT