update crawl blog
This commit is contained in:
@@ -1,106 +1,215 @@
|
||||
# OrangePi RAG Dataset
|
||||
# Blog RAG Toolkit
|
||||
|
||||
A **Vietnamese-language** RAG (Retrieval-Augmented Generation) data pipeline that crawls, extracts, and chunks blog articles from [orangepi.vn](https://orangepi.vn) — the official Orange Pi distributor in Vietnam.
|
||||
A complete RAG (Retrieval-Augmented Generation) pipeline: **crawl** any blog, **extract** keywords, **chunk** content, and **query** with an LLM.
|
||||
|
||||
## Dataset Summary
|
||||
## Components
|
||||
|
||||
| Metric | Value |
|
||||
|-------------|-------|
|
||||
| Articles | 199 |
|
||||
| Chunks | 472 |
|
||||
| Models | 36 |
|
||||
| Language | vi |
|
||||
| Last crawl | 2026-06-11 |
|
||||
| File | Purpose |
|
||||
|------|---------|
|
||||
| `crawl_blog.py` | Generic blog crawler (sitemap + Firecrawl) |
|
||||
| `crawl_orangepi_blog.py` | OrangePi.vn-specific crawler |
|
||||
| `rag_app.py` | RAG query application (FAISS + LLM) |
|
||||
| `keywords_example.json` | Sample keyword dictionary |
|
||||
|
||||
## Output Files
|
||||
## Quick Start
|
||||
|
||||
| File | Description |
|
||||
|------|-------------|
|
||||
| `articles.jsonl` | Full article records (title, description, markdown, text, product mentions, topic, metadata) |
|
||||
| `chunks.jsonl` | Overlapping text chunks (~650 words, ~100 overlap) with metadata for embedding |
|
||||
| `urls.json` | Discovered sitemap URLs with `lastmod` timestamps |
|
||||
| `raw/<slug>.json` | Raw Firecrawl API scrape response per article |
|
||||
| `markdown/<slug>.md` | Cleaned markdown per article |
|
||||
| `orangepi_models.json` | Canonical Orange Pi model dictionary with aliases |
|
||||
| `errors.jsonl` | Failed URLs and error details |
|
||||
| `summary.json` | Crawl summary statistics |
|
||||
|
||||
### Chunk metadata
|
||||
|
||||
Each chunk in `chunks.jsonl` includes:
|
||||
|
||||
- `chunk_id` — unique ID (`{article_id}__chunk_{seq}`)
|
||||
- `article_id` — source article reference
|
||||
- `content` — chunk text (markdown)
|
||||
- `section` — nearest heading context
|
||||
- `metadata.product_mentions` — canonical Orange Pi models mentioned
|
||||
- `metadata.topic` — inferred topic (e.g., "home assistant", "linux", "docker")
|
||||
|
||||
## Usage
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Python 3.10+
|
||||
- A [Firecrawl](https://www.firecrawl.dev) API key
|
||||
|
||||
### Install
|
||||
### 1. Install
|
||||
|
||||
```bash
|
||||
git clone <repo-url>
|
||||
cd orangepi-rag
|
||||
# No external dependencies beyond Python stdlib
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
### Set API key
|
||||
### 2. Set API key
|
||||
|
||||
```bash
|
||||
export FIRECRAWL_API_KEY="fc-..."
|
||||
# or put in .env file:
|
||||
echo "FIRECRAWL_API_KEY=fc-..." > .env
|
||||
```
|
||||
|
||||
Or place it in `/home/admin/.hermes/.env`:
|
||||
|
||||
```
|
||||
FIRECRAWL_API_KEY=fc-...
|
||||
```
|
||||
|
||||
### Run crawl
|
||||
### 3. Crawl a blog
|
||||
|
||||
```bash
|
||||
# Quick test — process first 5 articles
|
||||
python3 crawl_orangepi_blog.py --limit 5
|
||||
# Crawl 5 articles from any WordPress blog
|
||||
python crawl_blog.py --sitemap https://example.com/post-sitemap.xml --limit 5
|
||||
|
||||
# Full crawl — all discovered articles
|
||||
python3 crawl_orangepi_blog.py --all
|
||||
# Crawl all articles with custom keywords
|
||||
python crawl_blog.py --sitemap https://example.com/post-sitemap.xml --all --keywords keywords.json
|
||||
|
||||
# Re-scrape everything (overwrites existing raw files)
|
||||
python3 crawl_orangepi_blog.py --all --force
|
||||
# Output to custom directory
|
||||
python crawl_blog.py --sitemap https://example.com/post-sitemap.xml --all --out-dir ./my_blog_data
|
||||
```
|
||||
|
||||
### 4. Build index & query
|
||||
|
||||
```bash
|
||||
# Build FAISS index
|
||||
python rag_app.py --build --data-dir ./my_blog_data --index-dir ./my_index
|
||||
|
||||
# Query (requires OPENAI_API_KEY)
|
||||
export OPENAI_API_KEY="sk-..."
|
||||
python rag_app.py --query "How to install Docker?" --data-dir ./my_blog_data --index-dir ./my_index
|
||||
|
||||
# Interactive chat
|
||||
python rag_app.py --interactive --data-dir ./my_blog_data --index-dir ./my_index
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## crawl_blog.py — Generic Blog Crawler
|
||||
|
||||
Crawls any blog that exposes a sitemap (WordPress, Yoast, etc.).
|
||||
|
||||
### Usage
|
||||
|
||||
```bash
|
||||
python crawl_blog.py --sitemap <SITEMAP_URL> [options]
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
| Argument | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `--sitemap` | (required) | Sitemap URL |
|
||||
| `--out-dir` | `./blog_data` | Output directory |
|
||||
| `--keywords` | `<out-dir>/keywords.json` | Keywords JSON path |
|
||||
| `--limit N` | 5 | Process first N articles |
|
||||
| `--all` | — | Process all discovered articles |
|
||||
| `--out-dir PATH` | `/mnt/ssd/orangepi-rag` | Output directory |
|
||||
| `--models PATH` | `<out-dir>/orangepi_models.json` | Model dictionary path |
|
||||
| `--sitemap URL` | `https://orangepi.vn/post-sitemap.xml` | Sitemap URL |
|
||||
| `--all` | — | Process all articles |
|
||||
| `--sleep SEC` | 1.0 | Delay between Firecrawl calls |
|
||||
| `--force` | — | Re-scrape cached articles |
|
||||
| `--max-words N` | 650 | Target words per chunk |
|
||||
| `--overlap-words N` | 100 | Overlap words between chunks |
|
||||
| `--language` | `en` | Default language code |
|
||||
|
||||
## Model Detection
|
||||
### Output files
|
||||
|
||||
The pipeline uses `orangepi_models.json` to detect canonical Orange Pi product names in article text. The dictionary supports aliases per model (e.g., `"Orange Pi 5"`, `"OrangePi 5"`, `"OPi 5"`) and longest-match-first resolution to prevent false double-counts.
|
||||
| File | Description |
|
||||
|------|-------------|
|
||||
| `articles.jsonl` | Article records with keyword mentions |
|
||||
| `chunks.jsonl` | Chunked content for embedding |
|
||||
| `keywords.json` | Keyword dictionary used |
|
||||
| `urls.json` | Discovered URLs |
|
||||
| `raw/<slug>.json` | Raw Firecrawl responses |
|
||||
| `markdown/<slug>.md` | Cleaned markdown |
|
||||
| `errors.jsonl` | Failed URLs |
|
||||
| `summary.json` | Crawl summary |
|
||||
|
||||
## Use Cases
|
||||
---
|
||||
|
||||
- **Semantic search** over Vietnamese Orange Pi knowledge
|
||||
- **Q&A bots** for Orange Pi tutorials, OS installs, hardware guides
|
||||
- **Product recommendation** based on article content
|
||||
- **Fine-tuning** Vietnamese embedding models on SBC/embedded computing content
|
||||
## keywords.json — Keyword Dictionary
|
||||
|
||||
Defines keywords to extract from crawled content. Supports categorized or flat format.
|
||||
|
||||
### Categorized format (recommended)
|
||||
|
||||
```json
|
||||
[
|
||||
{
|
||||
"category": "hardware",
|
||||
"keywords": ["Raspberry Pi", "Arduino", "ESP32"]
|
||||
},
|
||||
{
|
||||
"category": "software",
|
||||
"keywords": ["Docker", "Ubuntu", "Home Assistant"]
|
||||
}
|
||||
]
|
||||
```
|
||||
|
||||
### Flat format
|
||||
|
||||
```json
|
||||
["Raspberry Pi", "Docker", "Home Assistant", "MQTT"]
|
||||
```
|
||||
|
||||
See `keywords_example.json` for a complete template.
|
||||
|
||||
---
|
||||
|
||||
## rag_app.py — RAG Query Application
|
||||
|
||||
FAISS-based vector search + LLM generation.
|
||||
|
||||
### Usage
|
||||
|
||||
```bash
|
||||
# Build index (one-time)
|
||||
python rag_app.py --build --data-dir ./blog_data --index-dir ./index
|
||||
|
||||
# Single query
|
||||
python rag_app.py --query "Câu hỏi của bạn" --data-dir ./blog_data --index-dir ./index
|
||||
|
||||
# Interactive chat
|
||||
python rag_app.py --interactive --data-dir ./blog_data --index-dir ./index
|
||||
|
||||
# Test retrieval only (no LLM needed)
|
||||
python rag_app.py --query "test" --retrieve-only --data-dir ./blog_data --index-dir ./index
|
||||
```
|
||||
|
||||
### Options
|
||||
|
||||
| Argument | Default | Description |
|
||||
|----------|---------|-------------|
|
||||
| `--data-dir` | `.` | Directory with chunks.jsonl |
|
||||
| `--index-dir` | `./rag_index` | FAISS index directory |
|
||||
| `--build` | — | Build index from chunks |
|
||||
| `--query` | — | Query to answer |
|
||||
| `--interactive` | — | Interactive chat mode |
|
||||
| `--retrieve-only` | — | Test retrieval without LLM |
|
||||
| `--top-k` | 5 | Number of chunks to retrieve |
|
||||
| `--embed-model` | `paraphrase-multilingual-MiniLM-L12-v2` | Embedding model |
|
||||
| `--llm-model` | `gpt-4o-mini` | LLM model name |
|
||||
| `--llm-base-url` | `https://api.openai.com/v1` | LLM API base URL |
|
||||
|
||||
### LLM API configuration
|
||||
|
||||
Set in `.env`:
|
||||
|
||||
```bash
|
||||
OPENAI_API_KEY=sk-...
|
||||
# Or for other providers:
|
||||
# LLM_BASE_URL=https://api.together.xyz/v1
|
||||
# LLM_MODEL=meta-llama/Llama-3-70b-chat-hf
|
||||
```
|
||||
|
||||
Compatible with any OpenAI-format API: OpenAI, Together.ai, Groq, Ollama, etc.
|
||||
|
||||
---
|
||||
|
||||
## crawl_orangepi_blog.py — OrangePi-specific Crawler
|
||||
|
||||
Specialized crawler for orangepi.vn with Orange Pi model detection.
|
||||
|
||||
```bash
|
||||
python crawl_orangepi_blog.py --limit 5
|
||||
python crawl_orangepi_blog.py --all
|
||||
```
|
||||
|
||||
Uses `orangepi_models.json` for product mention detection (36 Orange Pi models with aliases).
|
||||
|
||||
---
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
Blog (sitemap)
|
||||
│
|
||||
▼
|
||||
crawl_blog.py ──► Firecrawl API ──► articles.jsonl
|
||||
│ chunks.jsonl
|
||||
│ keywords.json
|
||||
│ raw/*.json
|
||||
│ markdown/*.md
|
||||
▼
|
||||
rag_app.py
|
||||
│
|
||||
├──► SentenceTransformer (embeddings)
|
||||
├──► FAISS (vector index)
|
||||
└──► LLM API (generation)
|
||||
│
|
||||
▼
|
||||
Answer + sources
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
Data sourced from [orangepi.vn](https://orangepi.vn). Check their site for content usage terms.
|
||||
Data sourced from respective blogs. Check each site for content usage terms.
|
||||
|
||||
Reference in New Issue
Block a user