# OrangePi RAG Dataset A **Vietnamese-language** RAG (Retrieval-Augmented Generation) data pipeline that crawls, extracts, and chunks blog articles from [orangepi.vn](https://orangepi.vn) — the official Orange Pi distributor in Vietnam. ## Dataset Summary | Metric | Value | |-------------|-------| | Articles | 199 | | Chunks | 472 | | Models | 36 | | Language | vi | | Last crawl | 2026-06-11 | ## Output Files | File | Description | |------|-------------| | `articles.jsonl` | Full article records (title, description, markdown, text, product mentions, topic, metadata) | | `chunks.jsonl` | Overlapping text chunks (~650 words, ~100 overlap) with metadata for embedding | | `urls.json` | Discovered sitemap URLs with `lastmod` timestamps | | `raw/.json` | Raw Firecrawl API scrape response per article | | `markdown/.md` | Cleaned markdown per article | | `orangepi_models.json` | Canonical Orange Pi model dictionary with aliases | | `errors.jsonl` | Failed URLs and error details | | `summary.json` | Crawl summary statistics | ### Chunk metadata Each chunk in `chunks.jsonl` includes: - `chunk_id` — unique ID (`{article_id}__chunk_{seq}`) - `article_id` — source article reference - `content` — chunk text (markdown) - `section` — nearest heading context - `metadata.product_mentions` — canonical Orange Pi models mentioned - `metadata.topic` — inferred topic (e.g., "home assistant", "linux", "docker") ## Usage ### Prerequisites - Python 3.10+ - A [Firecrawl](https://www.firecrawl.dev) API key ### Install ```bash git clone cd orangepi-rag # No external dependencies beyond Python stdlib ``` ### Set API key ```bash export FIRECRAWL_API_KEY="fc-..." ``` Or place it in `/home/admin/.hermes/.env`: ``` FIRECRAWL_API_KEY=fc-... ``` ### Run crawl ```bash # Quick test — process first 5 articles python3 crawl_orangepi_blog.py --limit 5 # Full crawl — all discovered articles python3 crawl_orangepi_blog.py --all # Re-scrape everything (overwrites existing raw files) python3 crawl_orangepi_blog.py --all --force ``` ### Options | Argument | Default | Description | |----------|---------|-------------| | `--limit N` | 5 | Process first N articles | | `--all` | — | Process all discovered articles | | `--out-dir PATH` | `/mnt/ssd/orangepi-rag` | Output directory | | `--models PATH` | `/orangepi_models.json` | Model dictionary path | | `--sitemap URL` | `https://orangepi.vn/post-sitemap.xml` | Sitemap URL | | `--sleep SEC` | 1.0 | Delay between Firecrawl calls | | `--force` | — | Re-scrape cached articles | | `--max-words N` | 650 | Target words per chunk | | `--overlap-words N` | 100 | Overlap words between chunks | ## Model Detection The pipeline uses `orangepi_models.json` to detect canonical Orange Pi product names in article text. The dictionary supports aliases per model (e.g., `"Orange Pi 5"`, `"OrangePi 5"`, `"OPi 5"`) and longest-match-first resolution to prevent false double-counts. ## Use Cases - **Semantic search** over Vietnamese Orange Pi knowledge - **Q&A bots** for Orange Pi tutorials, OS installs, hardware guides - **Product recommendation** based on article content - **Fine-tuning** Vietnamese embedding models on SBC/embedded computing content ## License Data sourced from [orangepi.vn](https://orangepi.vn). Check their site for content usage terms.