From 4f8b5650774ae1d1fc1dfe818a1225f697a343cf Mon Sep 17 00:00:00 2001 From: Tony Tran Date: Sun, 19 Apr 2026 07:27:53 +0000 Subject: [PATCH] update --- README.md | 291 ++++++++++------------------------------------- requirements.txt | 3 +- 2 files changed, 65 insertions(+), 229 deletions(-) diff --git a/README.md b/README.md index 643f6fd..ab3079d 100644 --- a/README.md +++ b/README.md @@ -1,270 +1,105 @@ -# 🤖 LiteRT-LM Web Server - -Chạy mô hình **Gemma 4** trên thiết bị nhúng (Orange Pi 5, Raspberry Pi, v.v.) thông qua [LiteRT-LM](https://github.com/google-ai-edge/litert-lm) với giao diện REST API và Web UI. - +--- +license: apache-2.0 +base_model: +- google/gemma-4-E4B-it +tags: + - litert-lm --- -## 📋 Yêu cầu +# litert-community/gemma-4-E4B-it-litert-lm -- Python 3.10+ -- [`litert-lm`](https://github.com/google-ai-edge/litert-lm) đã cài và hoạt động +Main Model Card: [google/gemma-4-E4B-it](https://huggingface.co/google/gemma-4-E4B-it) -```bash -pip install litert-lm -``` +This model card provides the Gemma 4 E4B model in a way that is ready for deployment on Android, iOS, Desktop, IoT and Web. -- Thư viện Python: +Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. This particular Gemma 4 model is small so it is ideal for on-device use cases. By running this model on device, users can have private access to Generative AI technology without even requiring an internet connection. -```bash -pip install -r requirements.txt -``` +These models are provided in the `.litertlm` format for use with the LiteRT-LM framework. LiteRT-LM is a specialized orchestration layer built directly on top of LiteRT, Google’s high-performance multi-platform runtime trusted by millions of Android and edge developers. LiteRT provides the foundational hardware acceleration via XNNPack for CPU and ML Drift for GPU. LiteRT-LM adds the specialized GenAI libraries and APIs, such as KV-cache management, prompt templating, and function calling. This integrated stack is the same technology powering the Google AI Edge Gallery showcase app. -- Models +The model file size is 3.65 GB, which includes a text decoder with 2.24 GB of weights and 0.67 GB of embedding parameters. LiteRT-LM framework always keeps main weights in memory, while the embedding parameters are memory mapped which enables significant working memory savings on some platforms as seen in the detailed data below. The vision and audio models are loaded as needed to further reduce memory consumption. -+ https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm +## Try Gemma 4 E4B -+ https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm +
---- +| [](https://play.google.com/store/apps/details?id=com.google.ai.edge.gallery&pli=1) | [](https://apps.apple.com/us/app/google-ai-edge-gallery/id6749645337) | [](https://ai.google.dev/edge/litert-lm/cli) | [](https://ai.google.dev/edge/litert-lm/cli) | [](https://huggingface.co/spaces/tylermullen/Gemma4) | +| :---: | :---: | :---: | :---: | :---: | +| [Android](https://play.google.com/store/apps/details?id=com.google.ai.edge.gallery&pli=1) | [iOS](https://apps.apple.com/us/app/google-ai-edge-gallery/id6749645337) | [Desktop](https://ai.google.dev/edge/litert-lm/cli) | [IoT](https://ai.google.dev/edge/litert-lm/cli) | [Web](https://huggingface.co/spaces/tylermullen/Gemma4) | -## 📁 Cấu trúc +
-``` -. -├── app.py # REST API đơn giản, single-turn -├── server.py # REST API đầy đủ + Web UI, multi-turn sessions -└── README.md -``` ---- +## Build with Gemma 4 E4B and LiteRT-LM -## 🚀 Hướng dẫn sử dụng +Ready to integrate this into your product? Get started [here](https://ai.google.dev/edge/litert-lm/overview). -### Bước 1 — Đặt model vào cùng thư mục +## Gemma 4 E4B Performance on LiteRT-LM -``` -gemma-4-E2B-it.litertlm ← model file -app.py -server.py -``` +All benchmarks were taken using 1024 prefill tokens and 256 decode tokens with a context length of 2048 tokens via LiteRT-LM. The model can support up to 32k context length. The inference on CPU is accelerated via the LiteRT XNNPACK delegate with 4 threads. Time-to-first-token does not include load time. Benchmarks were run with caches enabled and initialized. During the first run, the latency and memory usage may differ. Model size is the size of the file on disk. -Nếu model ở chỗ khác, sửa biến `MODEL_PATH` ở đầu mỗi file. +CPU memory was measured using, `rusage::ru_maxrss` on Android, Linux and Raspberry Pi, `task_vm_info::phys_footprint` on iOS and MacBook and `process_memory_counters::PrivateUsage` on Windows. ---- +**Android** -## 📄 `app.py` — REST API đơn giản +*Note: On [supported Android devices](https://developers.google.com/ml-kit), Gemma 4 is available through Android AI Core as [Gemini Nano](https://developer.android.com/ai/gemini-nano#architecture), which is the recommended path for production applications.* -File cơ bản, phù hợp để tích hợp nhanh hoặc test. -### Chạy +| Device                                     | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | Time-to-first-token (sec) | Model size (MB) | CPU Memory (MB) | +| :---- | :---- | :---- | :---- | :---- | :---- | :---- | +| S26 Ultra | CPU | 195 | 17.7 | 5.3 | 3654 | 3283 | +| S26 Ultra | GPU | 1,293 | 22.1 | 0.8 | 3654 | 710 | -```bash -python app.py -``` -Server khởi động tại `http://0.0.0.0:8000` +**iOS** -### Endpoint +| Device                                     | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | Time-to-first-token (sec) | Model size (MB) | CPU/GPU Memory (MB) | +| :---- | :---- | :---- | :---- | :---- | :---- | :---- | +| iPhone 17 Pro | CPU | 159 | 9.7 | 6.5 | 3654 | 961 | +| iPhone 17 Pro | GPU | 1,189 | 25.1 | 0.9 | 3654 | 3380 | -#### `POST /generate` +**Linux** -Gửi một prompt, nhận phản hồi. Mỗi request là độc lập, **không có bộ nhớ** giữa các lần gọi. +| Device                                     | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | Time-to-first-token (sec) | Model size (MB) | CPU Memory (MB) | +| :---- | :---- | :---- | :---- | :---- | :---- | :---- | +| Arm 2.3 & 2.8GHz | CPU | 82 | 17.5 | 12.6 | 3654 | 3139 | +| NVIDIA GeForce RTX 4090 | GPU | 7,260 | 91.2 | 0.2 | 3654 | 1119 | -```bash -curl -X POST http://localhost:8000/generate \ - -H "Content-Type: application/json" \ - -d '{"prompt": "Bạn là ai?"}' -``` +**macOS** -**Response:** +| Device                                     | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | Time-to-first-token (sec) | Model size (MB) | CPU/GPU Memory (MB) | +| :---- | :---- | :---- | :---- | :---- | :---- | :---- | +| MacBook Pro M4 Max | CPU | 277 | 27.0 | 3.7 | 3654 | 890 | +| MacBook Pro M4 Max | GPU | 2,560 | 101.1 | 0.4 | 3654 | 3217 | -```json -{ - "response": "Tôi là Gemma 4, một Mô hình Ngôn ngữ Lớn..." -} -``` +**Windows** ---- +| Device                                     | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | Time-to-first-token (sec) | Model size (MB) | CPU Memory (MB) | +| :---- | :---- | :---- | :---- | :---- | :---- | :---- | +| Intel LunarLake | CPU | 173 | 16.8 | 5.98 | 3654 | 9372 | +| Intel LunarLake | GPU | 1202 | 25.13| 0.89 | 3654 | 7147 | -## 🖥️ `server.py` — REST API đầy đủ + Web UI -Phiên bản đầy đủ với hỗ trợ multi-turn conversation, quản lý session và giao diện chat trên trình duyệt. +**IoT** -### Chạy +| Device                                     | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | Time-to-first-token (sec) | Model size (MB) | CPU Memory (MB) | +| :---- | :---- | :---- | :---- | :---- | :---- | :---- | +| Raspberry Pi 5 16GB | CPU | 51 | 3.2 | 20.5 | 3654 | 3069 | -```bash -python server.py -``` -Server khởi động tại `http://0.0.0.0:8000` ---- +## Gemma 4 E4B on Web -### 🌐 Web UI +Running Gemma inference on the web is currently supported through [LLM Inference Engine](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/web_js) and uses the *gemma-4-E4B-it-web.task* model file. Try it out [live in your browser](https://huggingface.co/spaces/tylermullen/Gemma4) (Chrome with WebGPU recommended). To start developing with it, download [the web model](https://huggingface.co/litert-community/gemma-4-E4B-it-litert-lm/blob/main/gemma-4-E4B-it-web.task) and run with our [sample web page](https://github.com/google-ai-edge/mediapipe-samples/blob/main/examples/llm_inference/js/README.md), or follow the [guide](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/web_js) to add it to your own app. -Mở trình duyệt và truy cập: +Benchmarked in Chrome on a MacBook Pro 2024 (Apple M4 Max) with 1024 prefill tokens and 256 decode tokens, but the model can support context lengths up to 128K. -``` -http://<địa-chỉ-ip>:8000 -``` +| Device | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | Initialization time (sec) | Model size (MB) | CPU Memory (GB) | GPU Memory (GB) | +| :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- | +| Web | GPU | 1598 | 44.4 | 1.5 | 2964 | 1.1 | 3.3 | -Tính năng: -- Giao diện chat trực quan, hỗ trợ tiếng Việt -- Tự động tạo session khi mở trang -- Nhớ ngữ cảnh hội thoại trong cùng một session -- Nút **New** để bắt đầu cuộc trò chuyện mới -- Nút **Clear** để xóa lịch sử và tạo session mới -- `Enter` để gửi, `Shift + Enter` để xuống dòng + ---- + * GPU memory measured by "GPU Process" memory for all of Chrome while running. Was 130MB when inactive, before any model loading took place. + * CPU memory measured for the entire tab while running. Was 55MB when inactive, before any model loading took place. -### 🔌 REST API - -#### `POST /generate` -Single-turn, không nhớ context. Dùng khi chỉ cần hỏi đáp đơn lẻ. - -```bash -curl -X POST http://localhost:8000/generate \ - -H "Content-Type: application/json" \ - -d '{"prompt": "Thủ đô của Việt Nam là gì?"}' -``` - ---- - -#### `POST /chat/new` -Tạo một session mới. Trả về `session_id` dùng cho các request tiếp theo. - -```bash -curl -X POST http://localhost:8000/chat/new -``` - -**Response:** - -```json -{ - "session_id": "a3f2c1d4-..." -} -``` - ---- - -#### `POST /chat/{session_id}` -Gửi tin nhắn trong một session. Model **nhớ toàn bộ lịch sử hội thoại** trong session đó. - -```bash -curl -X POST http://localhost:8000/chat/a3f2c1d4-... \ - -H "Content-Type: application/json" \ - -d '{"prompt": "Kể thêm về điều đó đi"}' -``` - -**Response:** - -```json -{ - "session_id": "a3f2c1d4-...", - "response": "..." -} -``` - ---- - -#### `DELETE /chat/{session_id}` -Xóa session và giải phóng bộ nhớ. - -```bash -curl -X DELETE http://localhost:8000/chat/a3f2c1d4-... -``` - -**Response:** - -```json -{ - "status": "cleared", - "session_id": "a3f2c1d4-..." -} -``` - ---- - -#### `GET /chat/sessions/list` -Liệt kê tất cả session đang hoạt động. - -```bash -curl http://localhost:8000/chat/sessions/list -``` - -**Response:** - -```json -{ - "sessions": ["a3f2c1d4-...", "b7e9f2a1-..."], - "count": 2 -} -``` - ---- - -## 💡 Ví dụ: Multi-turn conversation qua curl - -```bash -# 1. Tạo session -SESSION=$(curl -s -X POST http://localhost:8000/chat/new | python3 -c "import sys,json; print(json.load(sys.stdin)['session_id'])") - -# 2. Gửi tin nhắn đầu tiên -curl -s -X POST http://localhost:8000/chat/$SESSION \ - -H "Content-Type: application/json" \ - -d '{"prompt": "Tên tôi là Nam"}' | python3 -m json.tool - -# 3. Model nhớ context -curl -s -X POST http://localhost:8000/chat/$SESSION \ - -H "Content-Type: application/json" \ - -d '{"prompt": "Tên tôi là gì?"}' | python3 -m json.tool - -# 4. Xóa session khi xong -curl -X DELETE http://localhost:8000/chat/$SESSION -``` - ---- - -## ⚙️ Cấu hình - -Các tham số có thể chỉnh trong đầu mỗi file: - -| Biến | Mô tả | Mặc định | -|------|-------|---------| -| `MODEL_PATH` | Đường dẫn đến file model | `gemma-4-E2B-it.litertlm` | -| `backend` | Backend inference | `litert_lm.Backend.CPU` | -| `host` | Địa chỉ lắng nghe | `0.0.0.0` | -| `port` | Cổng | `8000` | - -Để đổi backend sang GPU (nếu thiết bị hỗ trợ): - -```python -engine = litert_lm.Engine(MODEL_PATH, backend=litert_lm.Backend.GPU) -``` - ---- - -## 🧪 Test nhanh - -```bash -# Kiểm tra server đang chạy -curl http://localhost:8000/generate \ - -X POST \ - -H "Content-Type: application/json" \ - -d '{"prompt": "Hello!"}' -``` - ---- - -## 📝 Ghi chú - -- Mỗi session giữ toàn bộ lịch sử hội thoại trong RAM. Nên xóa session khi không dùng nữa. -- Warning `mel_filterbank` khi khởi động là bình thường — liên quan đến audio encoder của Gemma 4 multimodal, không ảnh hưởng đến text generation. -- Tốc độ generate phụ thuộc vào phần cứng. Trên Orange Pi 5 với CPU, khoảng 5–15 token/giây. - ---- - -## 📜 License - -MIT + diff --git a/requirements.txt b/requirements.txt index f2ad05e..c698f11 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,3 +1,4 @@ fastapi uvicorn -litert-lm-api-nightly +litert-lm +huggingface_hub