orangepivietnam/orangepi-rag

Fork 0

T

admin 8c827179e3 update gitignore

2026-06-14 17:07:34 +07:00

markdown

Add files

2026-06-11 23:53:48 +07:00

rag_index

add rag test

2026-06-12 10:57:45 +07:00

raw

Add files

2026-06-11 23:53:48 +07:00

static

update frontend

2026-06-14 17:04:17 +07:00

templates

update frontend

2026-06-14 17:04:17 +07:00

.env.example

update

2026-06-13 11:01:35 +07:00

.gitignore

update gitignore

2026-06-14 17:07:34 +07:00

articles.jsonl

Add files

2026-06-11 23:53:48 +07:00

chunks.jsonl

Add files

2026-06-11 23:53:48 +07:00

crawl_blog.py

update crawl blog

2026-06-12 11:37:39 +07:00

crawl_orangepi_blog.py

Add files

2026-06-11 23:53:48 +07:00

errors.jsonl

Add files

2026-06-11 23:53:48 +07:00

keywords_example.json

update crawl blog

2026-06-12 11:37:39 +07:00

orangepi_models.json

Add files

2026-06-11 23:53:48 +07:00

rag_app.py

Merge branch 'main' of https://git.ttcorp.net/admin/orangepi-rag

2026-06-12 22:20:10 +07:00

rag_chat.db

update

2026-06-13 11:01:35 +07:00

README.md

them web app va README

2026-06-12 22:13:19 +07:00

requirements-cpu.txt

update

2026-06-12 22:49:19 +07:00

requirements.txt

them web app va README

2026-06-12 22:13:19 +07:00

summary.json

Add files

2026-06-11 23:53:48 +07:00

urls.json

Add files

2026-06-11 23:53:48 +07:00

web_app.py

update

2026-06-13 11:01:35 +07:00

README.md

Blog RAG Toolkit

Bộ công cụ RAG (Retrieval-Augmented Generation) hoàn chỉnh: crawl blog, trích xuất từ khóa, chia nhỏ nội dung, truy vấn bằng LLM, và giao diện web để chat.

1. Cài đặt

git clone <repo-url>
cd orangepi-rag
pip install -r requirements.txt

Yêu cầu: Python 3.10+, tài khoản Firecrawl (cho crawl), tài khoản LLM — OpenAI / Together.ai / Groq / Ollama (cho truy vấn).

2. Cấu hình API

Tạo file .env ở thư mục gốc dự án:

# ─── BẮT BUỘC cho crawl ───
FIRECRAWL_API_KEY=fc-...

# ─── BẮT BUỘC cho RAG query ───
OPENAI_API_KEY=sk-...

# ─── TÙY CHỌN ───
# Thay đổi LLM provider (mặc định: OpenAI)
# LLM_BASE_URL=https://api.together.xyz/v1
# LLM_MODEL=meta-llama/Llama-3-70b-chat-hf

Lấy Firecrawl key tại: https://www.firecrawl.dev Lấy OpenAI key tại: https://platform.openai.com/api-keys

Bước 1 — Crawl dữ liệu blog

1.1 Tạo file từ khóa

Tạo file keywords.json chứa các từ khóa cần trích xuất từ blog:

[
  {
    "category": "hardware",
    "keywords": ["Raspberry Pi", "Orange Pi", "Arduino", "ESP32"]
  },
  {
    "category": "software",
    "keywords": ["Docker", "Ubuntu", "Home Assistant", "MQTT"]
  }
]

Xem file mẫu tại keywords_example.json.

1.2 Tìm sitemap URL

Blog WordPress thường có sitemap tại:

https://example.com/post-sitemap.xml (Yoast SEO)
https://example.com/sitemap.xml (generic)

1.3 Chạy crawl

# Test thử 5 bài viết
python crawl_blog.py \
  --sitemap https://example.com/post-sitemap.xml \
  --limit 5 \
  --out-dir ./blog_data

# Crawl toàn bộ blog
python crawl_blog.py \
  --sitemap https://example.com/post-sitemap.xml \
  --all \
  --keywords keywords.json \
  --out-dir ./blog_data

# Crawl với tùy chỉnh
python crawl_blog.py \
  --sitemap https://example.com/post-sitemap.xml \
  --all \
  --keywords keywords.json \
  --out-dir ./blog_data \
  --sleep 1.5 \
  --max-words 500 \
  --overlap-words 80 \
  --language vi

1.4 Kết quả

Sau khi crawl xong, thư mục blog_data/ sẽ chứa:

blog_data/
├── articles.jsonl        # Mỗi dòng = 1 bài viết (title, text, keywords, ...)
├── chunks.jsonl          # Mỗi dòng = 1 đoạn nhỏ (~650 từ) cho embedding
├── keywords.json         # File từ khóa đã dùng
├── urls.json             # Danh sách URL tìm được từ sitemap
├── raw/<slug>.json       # Response gốc từ Firecrawl
├── markdown/<slug>.md    # Markdown đã làm sạch
├── errors.jsonl          # Các URL lỗi
└── summary.json          # Tổng kết crawl

1.5 Tham số đầy đủ

Tham số	Mặc định	Mô tả
`--sitemap`	(bắt buộc)	URL sitemap
`--out-dir`	`./blog_data`	Thư mục output
`--keywords`	`<out-dir>/keywords.json`	File từ khóa JSON
`--limit N`	5	Crawl N bài đầu tiên
`--all`	—	Crawl toàn bộ
`--sleep SEC`	1.0	Nghỉ giữa mỗi request (giây)
`--force`	—	Crawl lại kể cả đã có cache
`--max-words N`	650	Số từ tối đa mỗi chunk
`--overlap-words N`	100	Số từ overlap giữa các chunk
`--language`	`en`	Mã ngôn ngữ mặc định

Bước 2 — Xây dựng chỉ mục & truy vấn

2.1 Xây dựng chỉ mục FAISS

python rag_app.py \
  --build \
  --data-dir ./blog_data \
  --index-dir ./rag_index

Kết quả:

rag_index/
├── faiss.index           # Chỉ mục vector FAISS
└── chunks.jsonl          # Bản sao chunks cho retrieval

2.2 Truy vấn đơn lẻ

python rag_app.py \
  --query "Cài Docker trên Raspberry Pi như thế nào?" \
  --data-dir ./blog_data \
  --index-dir ./rag_index

2.3 Chat interactive (terminal)

python rag_app.py \
  --interactive \
  --data-dir ./blog_data \
  --index-dir ./rag_index

Gõ câu hỏi, nhận câu trả lời. Nhấn Ctrl+C để thoát.

2.4 Kiểm tra retrieval không cần LLM

python rag_app.py \
  --query "Home Assistant" \
  --retrieve-only \
  --data-dir ./blog_data \
  --index-dir ./rag_index

Chỉ hiển thị các chunk liên quan nhất, không gọi LLM.

2.5 Tham số đầy đủ

Tham số	Mặc định	Mô tả
`--data-dir`	`.`	Thư mục chứa chunks.jsonl
`--index-dir`	`./rag_index`	Thư mục chỉ mục FAISS
`--build`	—	Xây dựng chỉ mục
`--query`	—	Câu hỏi cần trả lời
`--interactive`	—	Chế độ chat terminal
`--retrieve-only`	—	Chỉ test retrieval, không dùng LLM
`--top-k`	5	Số chunk trả về
`--embed-model`	`paraphrase-multilingual-MiniLM-L12-v2`	Mô hình embedding
`--llm-model`	`gpt-4o-mini`	Tên mô hình LLM
`--llm-base-url`	`https://api.openai.com/v1`	URL API LLM

Bước 3 — Giao diện web

3.1 Khởi động server

python web_app.py \
  --data-dir ./blog_data \
  --index-dir ./rag_index \
  --port 5000

Mở trình duyệt: http://localhost:5000

3.2 Sử dụng

Nhấn + để tạo phiên chat mới
Gõ câu hỏi vào ô nhập, nhấn Enter để gửi
Xem câu trả lời + nguồn bài viết
Tạo nhiều phiên để hỏi nhiều chủ đề khác nhau
Xóa lịch sử hoặc xóa phiên bằng nút trên header

3.3 Tính năng

Tính năng	Mô tả
Quản lý phiên	Tạo, chuyển đổi, xóa nhiều phiên chat
Lịch sử chat	Lưu vào SQLite, giữ lại khi reload trang
Nhớ ngữ cảnh	10 tin nhắn cuối được đưa vào prompt để giữ context
Tránh lạc đề	LLM được hướng dẫn chỉ trả lời trong phạm vi dữ liệu
Trích nguồn	Mỗi câu trả lời có link đến bài viết gốc
Responsive	Giao diện thích ứng desktop và mobile

3.4 Tham số

Tham số	Mặc định	Mô tả
`--host`	`0.0.0.0`	Host để bind
`--port`	`5000`	Port
`--debug`	—	Chế độ debug
`--data-dir`	`.`	Thư mục dữ liệu
`--index-dir`	`./rag_index`	Thư mục chỉ mục

3.5 Biến môi trường web

# Trong file .env
RAG_DATA_DIR=./blog_data
RAG_INDEX_DIR=./rag_index
RAG_LLM_MODEL=gpt-4o-mini
RAG_LLM_BASE_URL=https://api.openai.com/v1
RAG_TOP_K=5
RAG_MAX_HISTORY=10          # Số tin nhắn giữ context

3.6 API endpoints

Method	Path	Mô tả
`GET`	`/api/sessions`	Danh sách phiên
`POST`	`/api/sessions`	Tạo phiên mới
`DELETE`	`/api/sessions/<id>`	Xóa phiên
`GET`	`/api/sessions/<id>/messages`	Lịch sử tin nhắn
`POST`	`/api/sessions/<id>/messages`	Gửi tin nhắn, nhận câu trả lời
`POST`	`/api/sessions/<id>/clear`	Xóa lịch sử phiên
`GET`	`/api/stats`	Thống kê hệ thống

Tham khảo

Cấu trúc thư mục hoàn chỉnh

orangepi-rag/
├── .env                     # API keys (FIRECRAWL, OPENAI)
├── requirements.txt         # Python dependencies
├── crawl_blog.py            # Crawler tổng quát
├── crawl_orangepi_blog.py   # Crawler orangepi.vn
├── rag_app.py               # RAG query (CLI)
├── web_app.py               # Giao diện web (Flask)
├── keywords_example.json    # Mẫu file từ khóa
├── templates/
│   └── index.html           # HTML template
├── static/
│   ├── style.css            # CSS
│   └── app.js               # JavaScript
├── blog_data/               # Dữ liệu crawl được
│   ├── articles.jsonl
│   ├── chunks.jsonl
│   ├── keywords.json
│   ├── urls.json
│   ├── raw/
│   ├── markdown/
│   ├── errors.jsonl
│   └── summary.json
├── rag_index/               # Chỉ mục FAISS
│   ├── faiss.index
│   └── chunks.jsonl
└── rag_chat.db              # SQLite chat history

Lưu ý khi dùng LLM provider khác

# Together.ai
LLM_BASE_URL=https://api.together.xyz/v1
LLM_MODEL=meta-llama/Llama-3-70b-chat-hf
OPENAI_API_KEY=...

# Groq
LLM_BASE_URL=https://api.groq.com/openai/v1
LLM_MODEL=llama-3.1-70b-versatile
OPENAI_API_KEY=...

# Ollama (chạy local)
LLM_BASE_URL=http://localhost:11434/v1
LLM_MODEL=llama3
OPENAI_API_KEY=ollama