5c5e3333a5598ec4c966aa848e94c5b4fca63201
OrangePi RAG Dataset
A Vietnamese-language RAG (Retrieval-Augmented Generation) data pipeline that crawls, extracts, and chunks blog articles from orangepi.vn — the official Orange Pi distributor in Vietnam.
Dataset Summary
| Metric | Value |
|---|---|
| Articles | 199 |
| Chunks | 472 |
| Models | 36 |
| Language | vi |
| Last crawl | 2026-06-11 |
Output Files
| File | Description |
|---|---|
articles.jsonl |
Full article records (title, description, markdown, text, product mentions, topic, metadata) |
chunks.jsonl |
Overlapping text chunks (~650 words, ~100 overlap) with metadata for embedding |
urls.json |
Discovered sitemap URLs with lastmod timestamps |
raw/<slug>.json |
Raw Firecrawl API scrape response per article |
markdown/<slug>.md |
Cleaned markdown per article |
orangepi_models.json |
Canonical Orange Pi model dictionary with aliases |
errors.jsonl |
Failed URLs and error details |
summary.json |
Crawl summary statistics |
Chunk metadata
Each chunk in chunks.jsonl includes:
chunk_id— unique ID ({article_id}__chunk_{seq})article_id— source article referencecontent— chunk text (markdown)section— nearest heading contextmetadata.product_mentions— canonical Orange Pi models mentionedmetadata.topic— inferred topic (e.g., "home assistant", "linux", "docker")
Usage
Prerequisites
- Python 3.10+
- A Firecrawl API key
Install
git clone <repo-url>
cd orangepi-rag
# No external dependencies beyond Python stdlib
Set API key
export FIRECRAWL_API_KEY="fc-..."
Or place it in /home/admin/.hermes/.env:
FIRECRAWL_API_KEY=fc-...
Run crawl
# Quick test — process first 5 articles
python3 crawl_orangepi_blog.py --limit 5
# Full crawl — all discovered articles
python3 crawl_orangepi_blog.py --all
# Re-scrape everything (overwrites existing raw files)
python3 crawl_orangepi_blog.py --all --force
Options
| Argument | Default | Description |
|---|---|---|
--limit N |
5 | Process first N articles |
--all |
— | Process all discovered articles |
--out-dir PATH |
/mnt/ssd/orangepi-rag |
Output directory |
--models PATH |
<out-dir>/orangepi_models.json |
Model dictionary path |
--sitemap URL |
https://orangepi.vn/post-sitemap.xml |
Sitemap URL |
--sleep SEC |
1.0 | Delay between Firecrawl calls |
--force |
— | Re-scrape cached articles |
--max-words N |
650 | Target words per chunk |
--overlap-words N |
100 | Overlap words between chunks |
Model Detection
The pipeline uses orangepi_models.json to detect canonical Orange Pi product names in article text. The dictionary supports aliases per model (e.g., "Orange Pi 5", "OrangePi 5", "OPi 5") and longest-match-first resolution to prevent false double-counts.
Use Cases
- Semantic search over Vietnamese Orange Pi knowledge
- Q&A bots for Orange Pi tutorials, OS installs, hardware guides
- Product recommendation based on article content
- Fine-tuning Vietnamese embedding models on SBC/embedded computing content
License
Data sourced from orangepi.vn. Check their site for content usage terms.
Description
Hệ thống Retrieval-Augmented Generation (RAG) cho blog orangepi.vn — trợ lý AI trả lời câu hỏi về sản phẩm Orange Pi dựa trên dữ liệu thực tế.
https://orangepi.vn
Languages
Python
75.2%
CSS
10.7%
JavaScript
8.7%
HTML
5.4%