update crawl blog

2026-06-12 11:37:39 +07:00
parent 3ebf6f450d
commit 65d2cae6ca
5 changed files with 982 additions and 71 deletions
@@ -1,106 +1,215 @@
-# OrangePi RAG Dataset
+# Blog RAG Toolkit

-A **Vietnamese-language** RAG (Retrieval-Augmented Generation) data pipeline that crawls, extracts, and chunks blog articles from [orangepi.vn](https://orangepi.vn) — the official Orange Pi distributor in Vietnam.
+A complete RAG (Retrieval-Augmented Generation) pipeline: **crawl** any blog, **extract** keywords, **chunk** content, and **query** with an LLM.

-## Dataset Summary
+## Components

-| Metric       | Value |
-|-------------|-------|
-| Articles     | 199   |
-| Chunks       | 472   |
-| Models       | 36    |
-| Language     | vi    |
-| Last crawl   | 2026-06-11 |
+| File | Purpose |
+|------|---------|
+| `crawl_blog.py` | Generic blog crawler (sitemap + Firecrawl) |
+| `crawl_orangepi_blog.py` | OrangePi.vn-specific crawler |
+| `rag_app.py` | RAG query application (FAISS + LLM) |
+| `keywords_example.json` | Sample keyword dictionary |

-## Output Files
+## Quick Start

-| File | Description |
-|------|-------------|
-| `articles.jsonl` | Full article records (title, description, markdown, text, product mentions, topic, metadata) |
-| `chunks.jsonl` | Overlapping text chunks (~650 words, ~100 overlap) with metadata for embedding |
-| `urls.json` | Discovered sitemap URLs with `lastmod` timestamps |
-| `raw/<slug>.json` | Raw Firecrawl API scrape response per article |
-| `markdown/<slug>.md` | Cleaned markdown per article |
-| `orangepi_models.json` | Canonical Orange Pi model dictionary with aliases |
-| `errors.jsonl` | Failed URLs and error details |
-| `summary.json` | Crawl summary statistics |
-
-### Chunk metadata
-
-Each chunk in `chunks.jsonl` includes:
-
- `chunk_id` — unique ID (`{article_id}__chunk_{seq}`)
- `article_id` — source article reference
- `content` — chunk text (markdown)
- `section` — nearest heading context
- `metadata.product_mentions` — canonical Orange Pi models mentioned
- `metadata.topic` — inferred topic (e.g., "home assistant", "linux", "docker")
-
-## Usage
-
-### Prerequisites
-
- Python 3.10+
- A [Firecrawl](https://www.firecrawl.dev) API key
-
-### Install
+### 1. Install

 ```bash
-git clone <repo-url>
-cd orangepi-rag
-# No external dependencies beyond Python stdlib
+pip install -r requirements.txt
 ```

-### Set API key
+### 2. Set API key

 ```bash
 export FIRECRAWL_API_KEY="fc-..."
+# or put in .env file:
+echo "FIRECRAWL_API_KEY=fc-..." > .env
 ```

-Or place it in `/home/admin/.hermes/.env`:
-
-```
-FIRECRAWL_API_KEY=fc-...
-```
-
-### Run crawl
+### 3. Crawl a blog

 ```bash
-# Quick test — process first 5 articles
-python3 crawl_orangepi_blog.py --limit 5
+# Crawl 5 articles from any WordPress blog
+python crawl_blog.py --sitemap https://example.com/post-sitemap.xml --limit 5

-# Full crawl — all discovered articles
-python3 crawl_orangepi_blog.py --all
+# Crawl all articles with custom keywords
+python crawl_blog.py --sitemap https://example.com/post-sitemap.xml --all --keywords keywords.json

-# Re-scrape everything (overwrites existing raw files)
-python3 crawl_orangepi_blog.py --all --force
+# Output to custom directory
+python crawl_blog.py --sitemap https://example.com/post-sitemap.xml --all --out-dir ./my_blog_data
+```
+
+### 4. Build index & query
+
+```bash
+# Build FAISS index
+python rag_app.py --build --data-dir ./my_blog_data --index-dir ./my_index
+
+# Query (requires OPENAI_API_KEY)
+export OPENAI_API_KEY="sk-..."
+python rag_app.py --query "How to install Docker?" --data-dir ./my_blog_data --index-dir ./my_index
+
+# Interactive chat
+python rag_app.py --interactive --data-dir ./my_blog_data --index-dir ./my_index
+```
+
+---
+
+## crawl_blog.py — Generic Blog Crawler
+
+Crawls any blog that exposes a sitemap (WordPress, Yoast, etc.).
+
+### Usage
+
+```bash
+python crawl_blog.py --sitemap <SITEMAP_URL> [options]
 ```

 ### Options

 | Argument | Default | Description |
 |----------|---------|-------------|
+| `--sitemap` | (required) | Sitemap URL |
+| `--out-dir` | `./blog_data` | Output directory |
+| `--keywords` | `<out-dir>/keywords.json` | Keywords JSON path |
 | `--limit N` | 5 | Process first N articles |
-| `--all` | — | Process all discovered articles |
-| `--out-dir PATH` | `/mnt/ssd/orangepi-rag` | Output directory |
-| `--models PATH` | `<out-dir>/orangepi_models.json` | Model dictionary path |
-| `--sitemap URL` | `https://orangepi.vn/post-sitemap.xml` | Sitemap URL |
+| `--all` | — | Process all articles |
 | `--sleep SEC` | 1.0 | Delay between Firecrawl calls |
 | `--force` | — | Re-scrape cached articles |
 | `--max-words N` | 650 | Target words per chunk |
 | `--overlap-words N` | 100 | Overlap words between chunks |
+| `--language` | `en` | Default language code |

-## Model Detection
+### Output files

-The pipeline uses `orangepi_models.json` to detect canonical Orange Pi product names in article text. The dictionary supports aliases per model (e.g., `"Orange Pi 5"`, `"OrangePi 5"`, `"OPi 5"`) and longest-match-first resolution to prevent false double-counts.
+| File | Description |
+|------|-------------|
+| `articles.jsonl` | Article records with keyword mentions |
+| `chunks.jsonl` | Chunked content for embedding |
+| `keywords.json` | Keyword dictionary used |
+| `urls.json` | Discovered URLs |
+| `raw/<slug>.json` | Raw Firecrawl responses |
+| `markdown/<slug>.md` | Cleaned markdown |
+| `errors.jsonl` | Failed URLs |
+| `summary.json` | Crawl summary |

-## Use Cases
+---

- **Semantic search** over Vietnamese Orange Pi knowledge
- **Q&A bots** for Orange Pi tutorials, OS installs, hardware guides
- **Product recommendation** based on article content
- **Fine-tuning** Vietnamese embedding models on SBC/embedded computing content
+## keywords.json — Keyword Dictionary
+
+Defines keywords to extract from crawled content. Supports categorized or flat format.
+
+### Categorized format (recommended)
+
+```json
+[
+  {
+    "category": "hardware",
+    "keywords": ["Raspberry Pi", "Arduino", "ESP32"]
+  },
+  {
+    "category": "software",
+    "keywords": ["Docker", "Ubuntu", "Home Assistant"]
+  }
+]
+```
+
+### Flat format
+
+```json
+["Raspberry Pi", "Docker", "Home Assistant", "MQTT"]
+```
+
+See `keywords_example.json` for a complete template.
+
+---
+
+## rag_app.py — RAG Query Application
+
+FAISS-based vector search + LLM generation.
+
+### Usage
+
+```bash
+# Build index (one-time)
+python rag_app.py --build --data-dir ./blog_data --index-dir ./index
+
+# Single query
+python rag_app.py --query "Câu hỏi của bạn" --data-dir ./blog_data --index-dir ./index
+
+# Interactive chat
+python rag_app.py --interactive --data-dir ./blog_data --index-dir ./index
+
+# Test retrieval only (no LLM needed)
+python rag_app.py --query "test" --retrieve-only --data-dir ./blog_data --index-dir ./index
+```
+
+### Options
+
+| Argument | Default | Description |
+|----------|---------|-------------|
+| `--data-dir` | `.` | Directory with chunks.jsonl |
+| `--index-dir` | `./rag_index` | FAISS index directory |
+| `--build` | — | Build index from chunks |
+| `--query` | — | Query to answer |
+| `--interactive` | — | Interactive chat mode |
+| `--retrieve-only` | — | Test retrieval without LLM |
+| `--top-k` | 5 | Number of chunks to retrieve |
+| `--embed-model` | `paraphrase-multilingual-MiniLM-L12-v2` | Embedding model |
+| `--llm-model` | `gpt-4o-mini` | LLM model name |
+| `--llm-base-url` | `https://api.openai.com/v1` | LLM API base URL |
+
+### LLM API configuration
+
+Set in `.env`:
+
+```bash
+OPENAI_API_KEY=sk-...
+# Or for other providers:
+# LLM_BASE_URL=https://api.together.xyz/v1
+# LLM_MODEL=meta-llama/Llama-3-70b-chat-hf
+```
+
+Compatible with any OpenAI-format API: OpenAI, Together.ai, Groq, Ollama, etc.
+
+---
+
+## crawl_orangepi_blog.py — OrangePi-specific Crawler
+
+Specialized crawler for orangepi.vn with Orange Pi model detection.
+
+```bash
+python crawl_orangepi_blog.py --limit 5
+python crawl_orangepi_blog.py --all
+```
+
+Uses `orangepi_models.json` for product mention detection (36 Orange Pi models with aliases).
+
+---
+
+## Architecture
+
+```
+Blog (sitemap)
+    │
+    ▼
+crawl_blog.py ──► Firecrawl API ──► articles.jsonl
+    │                                  chunks.jsonl
+    │                                  keywords.json
+    │                                  raw/*.json
+    │                                  markdown/*.md
+    ▼
+rag_app.py
+    │
+    ├──► SentenceTransformer (embeddings)
+    ├──► FAISS (vector index)
+    └──► LLM API (generation)
+            │
+            ▼
+        Answer + sources
+```

 ## License

-Data sourced from [orangepi.vn](https://orangepi.vn). Check their site for content usage terms.
+Data sourced from respective blogs. Check each site for content usage terms.