commit 7a55bf9b622f13178130a776bd7567544a27fa4b Author: Tony Tran Date: Thu Jun 11 23:53:08 2026 +0700 first commit diff --git a/README.md b/README.md new file mode 100644 index 0000000..5b57863 --- /dev/null +++ b/README.md @@ -0,0 +1,106 @@ +# OrangePi RAG Dataset + +A **Vietnamese-language** RAG (Retrieval-Augmented Generation) data pipeline that crawls, extracts, and chunks blog articles from [orangepi.vn](https://orangepi.vn) — the official Orange Pi distributor in Vietnam. + +## Dataset Summary + +| Metric | Value | +|-------------|-------| +| Articles | 199 | +| Chunks | 472 | +| Models | 36 | +| Language | vi | +| Last crawl | 2026-06-11 | + +## Output Files + +| File | Description | +|------|-------------| +| `articles.jsonl` | Full article records (title, description, markdown, text, product mentions, topic, metadata) | +| `chunks.jsonl` | Overlapping text chunks (~650 words, ~100 overlap) with metadata for embedding | +| `urls.json` | Discovered sitemap URLs with `lastmod` timestamps | +| `raw/.json` | Raw Firecrawl API scrape response per article | +| `markdown/.md` | Cleaned markdown per article | +| `orangepi_models.json` | Canonical Orange Pi model dictionary with aliases | +| `errors.jsonl` | Failed URLs and error details | +| `summary.json` | Crawl summary statistics | + +### Chunk metadata + +Each chunk in `chunks.jsonl` includes: + +- `chunk_id` — unique ID (`{article_id}__chunk_{seq}`) +- `article_id` — source article reference +- `content` — chunk text (markdown) +- `section` — nearest heading context +- `metadata.product_mentions` — canonical Orange Pi models mentioned +- `metadata.topic` — inferred topic (e.g., "home assistant", "linux", "docker") + +## Usage + +### Prerequisites + +- Python 3.10+ +- A [Firecrawl](https://www.firecrawl.dev) API key + +### Install + +```bash +git clone +cd orangepi-rag +# No external dependencies beyond Python stdlib +``` + +### Set API key + +```bash +export FIRECRAWL_API_KEY="fc-..." +``` + +Or place it in `/home/admin/.hermes/.env`: + +``` +FIRECRAWL_API_KEY=fc-... +``` + +### Run crawl + +```bash +# Quick test — process first 5 articles +python3 crawl_orangepi_blog.py --limit 5 + +# Full crawl — all discovered articles +python3 crawl_orangepi_blog.py --all + +# Re-scrape everything (overwrites existing raw files) +python3 crawl_orangepi_blog.py --all --force +``` + +### Options + +| Argument | Default | Description | +|----------|---------|-------------| +| `--limit N` | 5 | Process first N articles | +| `--all` | — | Process all discovered articles | +| `--out-dir PATH` | `/mnt/ssd/orangepi-rag` | Output directory | +| `--models PATH` | `/orangepi_models.json` | Model dictionary path | +| `--sitemap URL` | `https://orangepi.vn/post-sitemap.xml` | Sitemap URL | +| `--sleep SEC` | 1.0 | Delay between Firecrawl calls | +| `--force` | — | Re-scrape cached articles | +| `--max-words N` | 650 | Target words per chunk | +| `--overlap-words N` | 100 | Overlap words between chunks | + +## Model Detection + +The pipeline uses `orangepi_models.json` to detect canonical Orange Pi product names in article text. The dictionary supports aliases per model (e.g., `"Orange Pi 5"`, `"OrangePi 5"`, `"OPi 5"`) and longest-match-first resolution to prevent false double-counts. + +## Use Cases + +- **Semantic search** over Vietnamese Orange Pi knowledge +- **Q&A bots** for Orange Pi tutorials, OS installs, hardware guides +- **Product recommendation** based on article content +- **Fine-tuning** Vietnamese embedding models on SBC/embedded computing content + +## License + +Data sourced from [orangepi.vn](https://orangepi.vn). Check their site for content usage terms.