2026-06-12 22:19:53 +07:00
2026-06-11 23:53:48 +07:00
2026-06-12 10:57:45 +07:00
2026-06-11 23:53:48 +07:00
2026-06-12 22:19:53 +07:00
2026-06-12 10:57:45 +07:00
2026-06-11 23:53:48 +07:00
2026-06-11 23:53:48 +07:00
2026-06-11 23:53:48 +07:00
2026-06-11 23:53:48 +07:00
2026-06-11 23:53:48 +07:00
2026-06-12 22:19:53 +07:00
2026-06-11 23:53:08 +07:00
2026-06-12 10:57:45 +07:00
2026-06-11 23:53:48 +07:00
2026-06-11 23:53:48 +07:00

OrangePi RAG Dataset

A Vietnamese-language RAG (Retrieval-Augmented Generation) data pipeline that crawls, extracts, and chunks blog articles from orangepi.vn — the official Orange Pi distributor in Vietnam.

Dataset Summary

Metric Value
Articles 199
Chunks 472
Models 36
Language vi
Last crawl 2026-06-11

Output Files

File Description
articles.jsonl Full article records (title, description, markdown, text, product mentions, topic, metadata)
chunks.jsonl Overlapping text chunks (~650 words, ~100 overlap) with metadata for embedding
urls.json Discovered sitemap URLs with lastmod timestamps
raw/<slug>.json Raw Firecrawl API scrape response per article
markdown/<slug>.md Cleaned markdown per article
orangepi_models.json Canonical Orange Pi model dictionary with aliases
errors.jsonl Failed URLs and error details
summary.json Crawl summary statistics

Chunk metadata

Each chunk in chunks.jsonl includes:

  • chunk_id — unique ID ({article_id}__chunk_{seq})
  • article_id — source article reference
  • content — chunk text (markdown)
  • section — nearest heading context
  • metadata.product_mentions — canonical Orange Pi models mentioned
  • metadata.topic — inferred topic (e.g., "home assistant", "linux", "docker")

Usage

Prerequisites

Install

git clone <repo-url>
cd orangepi-rag
# No external dependencies beyond Python stdlib

Set API key

export FIRECRAWL_API_KEY="fc-..."

Or place it in /home/admin/.hermes/.env:

FIRECRAWL_API_KEY=fc-...

Run crawl

# Quick test — process first 5 articles
python3 crawl_orangepi_blog.py --limit 5

# Full crawl — all discovered articles
python3 crawl_orangepi_blog.py --all

# Re-scrape everything (overwrites existing raw files)
python3 crawl_orangepi_blog.py --all --force

Options

Argument Default Description
--limit N 5 Process first N articles
--all Process all discovered articles
--out-dir PATH /mnt/ssd/orangepi-rag Output directory
--models PATH <out-dir>/orangepi_models.json Model dictionary path
--sitemap URL https://orangepi.vn/post-sitemap.xml Sitemap URL
--sleep SEC 1.0 Delay between Firecrawl calls
--force Re-scrape cached articles
--max-words N 650 Target words per chunk
--overlap-words N 100 Overlap words between chunks

Model Detection

The pipeline uses orangepi_models.json to detect canonical Orange Pi product names in article text. The dictionary supports aliases per model (e.g., "Orange Pi 5", "OrangePi 5", "OPi 5") and longest-match-first resolution to prevent false double-counts.

Use Cases

  • Semantic search over Vietnamese Orange Pi knowledge
  • Q&A bots for Orange Pi tutorials, OS installs, hardware guides
  • Product recommendation based on article content
  • Fine-tuning Vietnamese embedding models on SBC/embedded computing content

License

Data sourced from orangepi.vn. Check their site for content usage terms.

S
Description
Hệ thống Retrieval-Augmented Generation (RAG) cho blog orangepi.vn — trợ lý AI trả lời câu hỏi về sản phẩm Orange Pi dựa trên dữ liệu thực tế.
https://orangepi.vn
Readme 8.4 MiB
Languages
Python 75.2%
CSS 10.7%
JavaScript 8.7%
HTML 5.4%