Files
orangepi-rag/README.md
T
2026-06-12 11:37:39 +07:00

5.5 KiB

Blog RAG Toolkit

A complete RAG (Retrieval-Augmented Generation) pipeline: crawl any blog, extract keywords, chunk content, and query with an LLM.

Components

File Purpose
crawl_blog.py Generic blog crawler (sitemap + Firecrawl)
crawl_orangepi_blog.py OrangePi.vn-specific crawler
rag_app.py RAG query application (FAISS + LLM)
keywords_example.json Sample keyword dictionary

Quick Start

1. Install

pip install -r requirements.txt

2. Set API key

export FIRECRAWL_API_KEY="fc-..."
# or put in .env file:
echo "FIRECRAWL_API_KEY=fc-..." > .env

3. Crawl a blog

# Crawl 5 articles from any WordPress blog
python crawl_blog.py --sitemap https://example.com/post-sitemap.xml --limit 5

# Crawl all articles with custom keywords
python crawl_blog.py --sitemap https://example.com/post-sitemap.xml --all --keywords keywords.json

# Output to custom directory
python crawl_blog.py --sitemap https://example.com/post-sitemap.xml --all --out-dir ./my_blog_data

4. Build index & query

# Build FAISS index
python rag_app.py --build --data-dir ./my_blog_data --index-dir ./my_index

# Query (requires OPENAI_API_KEY)
export OPENAI_API_KEY="sk-..."
python rag_app.py --query "How to install Docker?" --data-dir ./my_blog_data --index-dir ./my_index

# Interactive chat
python rag_app.py --interactive --data-dir ./my_blog_data --index-dir ./my_index

crawl_blog.py — Generic Blog Crawler

Crawls any blog that exposes a sitemap (WordPress, Yoast, etc.).

Usage

python crawl_blog.py --sitemap <SITEMAP_URL> [options]

Options

Argument Default Description
--sitemap (required) Sitemap URL
--out-dir ./blog_data Output directory
--keywords <out-dir>/keywords.json Keywords JSON path
--limit N 5 Process first N articles
--all Process all articles
--sleep SEC 1.0 Delay between Firecrawl calls
--force Re-scrape cached articles
--max-words N 650 Target words per chunk
--overlap-words N 100 Overlap words between chunks
--language en Default language code

Output files

File Description
articles.jsonl Article records with keyword mentions
chunks.jsonl Chunked content for embedding
keywords.json Keyword dictionary used
urls.json Discovered URLs
raw/<slug>.json Raw Firecrawl responses
markdown/<slug>.md Cleaned markdown
errors.jsonl Failed URLs
summary.json Crawl summary

keywords.json — Keyword Dictionary

Defines keywords to extract from crawled content. Supports categorized or flat format.

[
  {
    "category": "hardware",
    "keywords": ["Raspberry Pi", "Arduino", "ESP32"]
  },
  {
    "category": "software",
    "keywords": ["Docker", "Ubuntu", "Home Assistant"]
  }
]

Flat format

["Raspberry Pi", "Docker", "Home Assistant", "MQTT"]

See keywords_example.json for a complete template.


rag_app.py — RAG Query Application

FAISS-based vector search + LLM generation.

Usage

# Build index (one-time)
python rag_app.py --build --data-dir ./blog_data --index-dir ./index

# Single query
python rag_app.py --query "Câu hỏi của bạn" --data-dir ./blog_data --index-dir ./index

# Interactive chat
python rag_app.py --interactive --data-dir ./blog_data --index-dir ./index

# Test retrieval only (no LLM needed)
python rag_app.py --query "test" --retrieve-only --data-dir ./blog_data --index-dir ./index

Options

Argument Default Description
--data-dir . Directory with chunks.jsonl
--index-dir ./rag_index FAISS index directory
--build Build index from chunks
--query Query to answer
--interactive Interactive chat mode
--retrieve-only Test retrieval without LLM
--top-k 5 Number of chunks to retrieve
--embed-model paraphrase-multilingual-MiniLM-L12-v2 Embedding model
--llm-model gpt-4o-mini LLM model name
--llm-base-url https://api.openai.com/v1 LLM API base URL

LLM API configuration

Set in .env:

OPENAI_API_KEY=sk-...
# Or for other providers:
# LLM_BASE_URL=https://api.together.xyz/v1
# LLM_MODEL=meta-llama/Llama-3-70b-chat-hf

Compatible with any OpenAI-format API: OpenAI, Together.ai, Groq, Ollama, etc.


crawl_orangepi_blog.py — OrangePi-specific Crawler

Specialized crawler for orangepi.vn with Orange Pi model detection.

python crawl_orangepi_blog.py --limit 5
python crawl_orangepi_blog.py --all

Uses orangepi_models.json for product mention detection (36 Orange Pi models with aliases).


Architecture

Blog (sitemap)
    │
    ▼
crawl_blog.py ──► Firecrawl API ──► articles.jsonl
    │                                  chunks.jsonl
    │                                  keywords.json
    │                                  raw/*.json
    │                                  markdown/*.md
    ▼
rag_app.py
    │
    ├──► SentenceTransformer (embeddings)
    ├──► FAISS (vector index)
    └──► LLM API (generation)
            │
            ▼
        Answer + sources

License

Data sourced from respective blogs. Check each site for content usage terms.