first commit

2026-06-11 23:53:08 +07:00
commit 7a55bf9b62
1 changed files with 106 additions and 0 deletions
@@ -0,0 +1,106 @@
+# OrangePi RAG Dataset
+
+A **Vietnamese-language** RAG (Retrieval-Augmented Generation) data pipeline that crawls, extracts, and chunks blog articles from [orangepi.vn](https://orangepi.vn) — the official Orange Pi distributor in Vietnam.
+
+## Dataset Summary
+
+| Metric       | Value |
+|-------------|-------|
+| Articles     | 199   |
+| Chunks       | 472   |
+| Models       | 36    |
+| Language     | vi    |
+| Last crawl   | 2026-06-11 |
+
+## Output Files
+
+| File | Description |
+|------|-------------|
+| `articles.jsonl` | Full article records (title, description, markdown, text, product mentions, topic, metadata) |
+| `chunks.jsonl` | Overlapping text chunks (~650 words, ~100 overlap) with metadata for embedding |
+| `urls.json` | Discovered sitemap URLs with `lastmod` timestamps |
+| `raw/<slug>.json` | Raw Firecrawl API scrape response per article |
+| `markdown/<slug>.md` | Cleaned markdown per article |
+| `orangepi_models.json` | Canonical Orange Pi model dictionary with aliases |
+| `errors.jsonl` | Failed URLs and error details |
+| `summary.json` | Crawl summary statistics |
+
+### Chunk metadata
+
+Each chunk in `chunks.jsonl` includes:
+
+- `chunk_id` — unique ID (`{article_id}__chunk_{seq}`)
+- `article_id` — source article reference
+- `content` — chunk text (markdown)
+- `section` — nearest heading context
+- `metadata.product_mentions` — canonical Orange Pi models mentioned
+- `metadata.topic` — inferred topic (e.g., "home assistant", "linux", "docker")
+
+## Usage
+
+### Prerequisites
+
+- Python 3.10+
+- A [Firecrawl](https://www.firecrawl.dev) API key
+
+### Install
+
+```bash
+git clone <repo-url>
+cd orangepi-rag
+# No external dependencies beyond Python stdlib
+```
+
+### Set API key
+
+```bash
+export FIRECRAWL_API_KEY="fc-..."
+```
+
+Or place it in `/home/admin/.hermes/.env`:
+
+```
+FIRECRAWL_API_KEY=fc-...
+```
+
+### Run crawl
+
+```bash
+# Quick test — process first 5 articles
+python3 crawl_orangepi_blog.py --limit 5
+
+# Full crawl — all discovered articles
+python3 crawl_orangepi_blog.py --all
+
+# Re-scrape everything (overwrites existing raw files)
+python3 crawl_orangepi_blog.py --all --force
+```
+
+### Options
+
+| Argument | Default | Description |
+|----------|---------|-------------|
+| `--limit N` | 5 | Process first N articles |
+| `--all` | — | Process all discovered articles |
+| `--out-dir PATH` | `/mnt/ssd/orangepi-rag` | Output directory |
+| `--models PATH` | `<out-dir>/orangepi_models.json` | Model dictionary path |
+| `--sitemap URL` | `https://orangepi.vn/post-sitemap.xml` | Sitemap URL |
+| `--sleep SEC` | 1.0 | Delay between Firecrawl calls |
+| `--force` | — | Re-scrape cached articles |
+| `--max-words N` | 650 | Target words per chunk |
+| `--overlap-words N` | 100 | Overlap words between chunks |
+
+## Model Detection
+
+The pipeline uses `orangepi_models.json` to detect canonical Orange Pi product names in article text. The dictionary supports aliases per model (e.g., `"Orange Pi 5"`, `"OrangePi 5"`, `"OPi 5"`) and longest-match-first resolution to prevent false double-counts.
+
+## Use Cases
+
+- **Semantic search** over Vietnamese Orange Pi knowledge
+- **Q&A bots** for Orange Pi tutorials, OS installs, hardware guides
+- **Product recommendation** based on article content
+- **Fine-tuning** Vietnamese embedding models on SBC/embedded computing content
+
+## License
+
+Data sourced from [orangepi.vn](https://orangepi.vn). Check their site for content usage terms.