first commit
This commit is contained in:
@@ -0,0 +1,106 @@
|
|||||||
|
# OrangePi RAG Dataset
|
||||||
|
|
||||||
|
A **Vietnamese-language** RAG (Retrieval-Augmented Generation) data pipeline that crawls, extracts, and chunks blog articles from [orangepi.vn](https://orangepi.vn) — the official Orange Pi distributor in Vietnam.
|
||||||
|
|
||||||
|
## Dataset Summary
|
||||||
|
|
||||||
|
| Metric | Value |
|
||||||
|
|-------------|-------|
|
||||||
|
| Articles | 199 |
|
||||||
|
| Chunks | 472 |
|
||||||
|
| Models | 36 |
|
||||||
|
| Language | vi |
|
||||||
|
| Last crawl | 2026-06-11 |
|
||||||
|
|
||||||
|
## Output Files
|
||||||
|
|
||||||
|
| File | Description |
|
||||||
|
|------|-------------|
|
||||||
|
| `articles.jsonl` | Full article records (title, description, markdown, text, product mentions, topic, metadata) |
|
||||||
|
| `chunks.jsonl` | Overlapping text chunks (~650 words, ~100 overlap) with metadata for embedding |
|
||||||
|
| `urls.json` | Discovered sitemap URLs with `lastmod` timestamps |
|
||||||
|
| `raw/<slug>.json` | Raw Firecrawl API scrape response per article |
|
||||||
|
| `markdown/<slug>.md` | Cleaned markdown per article |
|
||||||
|
| `orangepi_models.json` | Canonical Orange Pi model dictionary with aliases |
|
||||||
|
| `errors.jsonl` | Failed URLs and error details |
|
||||||
|
| `summary.json` | Crawl summary statistics |
|
||||||
|
|
||||||
|
### Chunk metadata
|
||||||
|
|
||||||
|
Each chunk in `chunks.jsonl` includes:
|
||||||
|
|
||||||
|
- `chunk_id` — unique ID (`{article_id}__chunk_{seq}`)
|
||||||
|
- `article_id` — source article reference
|
||||||
|
- `content` — chunk text (markdown)
|
||||||
|
- `section` — nearest heading context
|
||||||
|
- `metadata.product_mentions` — canonical Orange Pi models mentioned
|
||||||
|
- `metadata.topic` — inferred topic (e.g., "home assistant", "linux", "docker")
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
### Prerequisites
|
||||||
|
|
||||||
|
- Python 3.10+
|
||||||
|
- A [Firecrawl](https://www.firecrawl.dev) API key
|
||||||
|
|
||||||
|
### Install
|
||||||
|
|
||||||
|
```bash
|
||||||
|
git clone <repo-url>
|
||||||
|
cd orangepi-rag
|
||||||
|
# No external dependencies beyond Python stdlib
|
||||||
|
```
|
||||||
|
|
||||||
|
### Set API key
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export FIRECRAWL_API_KEY="fc-..."
|
||||||
|
```
|
||||||
|
|
||||||
|
Or place it in `/home/admin/.hermes/.env`:
|
||||||
|
|
||||||
|
```
|
||||||
|
FIRECRAWL_API_KEY=fc-...
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run crawl
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Quick test — process first 5 articles
|
||||||
|
python3 crawl_orangepi_blog.py --limit 5
|
||||||
|
|
||||||
|
# Full crawl — all discovered articles
|
||||||
|
python3 crawl_orangepi_blog.py --all
|
||||||
|
|
||||||
|
# Re-scrape everything (overwrites existing raw files)
|
||||||
|
python3 crawl_orangepi_blog.py --all --force
|
||||||
|
```
|
||||||
|
|
||||||
|
### Options
|
||||||
|
|
||||||
|
| Argument | Default | Description |
|
||||||
|
|----------|---------|-------------|
|
||||||
|
| `--limit N` | 5 | Process first N articles |
|
||||||
|
| `--all` | — | Process all discovered articles |
|
||||||
|
| `--out-dir PATH` | `/mnt/ssd/orangepi-rag` | Output directory |
|
||||||
|
| `--models PATH` | `<out-dir>/orangepi_models.json` | Model dictionary path |
|
||||||
|
| `--sitemap URL` | `https://orangepi.vn/post-sitemap.xml` | Sitemap URL |
|
||||||
|
| `--sleep SEC` | 1.0 | Delay between Firecrawl calls |
|
||||||
|
| `--force` | — | Re-scrape cached articles |
|
||||||
|
| `--max-words N` | 650 | Target words per chunk |
|
||||||
|
| `--overlap-words N` | 100 | Overlap words between chunks |
|
||||||
|
|
||||||
|
## Model Detection
|
||||||
|
|
||||||
|
The pipeline uses `orangepi_models.json` to detect canonical Orange Pi product names in article text. The dictionary supports aliases per model (e.g., `"Orange Pi 5"`, `"OrangePi 5"`, `"OPi 5"`) and longest-match-first resolution to prevent false double-counts.
|
||||||
|
|
||||||
|
## Use Cases
|
||||||
|
|
||||||
|
- **Semantic search** over Vietnamese Orange Pi knowledge
|
||||||
|
- **Q&A bots** for Orange Pi tutorials, OS installs, hardware guides
|
||||||
|
- **Product recommendation** based on article content
|
||||||
|
- **Fine-tuning** Vietnamese embedding models on SBC/embedded computing content
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
Data sourced from [orangepi.vn](https://orangepi.vn). Check their site for content usage terms.
|
||||||
Reference in New Issue
Block a user