Files

T

2026-06-12 11:37:39 +07:00

5.5 KiB

Raw Blame History

Blog RAG Toolkit

A complete RAG (Retrieval-Augmented Generation) pipeline: crawl any blog, extract keywords, chunk content, and query with an LLM.

Components

File	Purpose
`crawl_blog.py`	Generic blog crawler (sitemap + Firecrawl)
`crawl_orangepi_blog.py`	OrangePi.vn-specific crawler
`rag_app.py`	RAG query application (FAISS + LLM)
`keywords_example.json`	Sample keyword dictionary

Quick Start

1. Install

pip install -r requirements.txt

2. Set API key

export FIRECRAWL_API_KEY="fc-..."
# or put in .env file:
echo "FIRECRAWL_API_KEY=fc-..." > .env

3. Crawl a blog

# Crawl 5 articles from any WordPress blog
python crawl_blog.py --sitemap https://example.com/post-sitemap.xml --limit 5

# Crawl all articles with custom keywords
python crawl_blog.py --sitemap https://example.com/post-sitemap.xml --all --keywords keywords.json

# Output to custom directory
python crawl_blog.py --sitemap https://example.com/post-sitemap.xml --all --out-dir ./my_blog_data

4. Build index & query

# Build FAISS index
python rag_app.py --build --data-dir ./my_blog_data --index-dir ./my_index

# Query (requires OPENAI_API_KEY)
export OPENAI_API_KEY="sk-..."
python rag_app.py --query "How to install Docker?" --data-dir ./my_blog_data --index-dir ./my_index

# Interactive chat
python rag_app.py --interactive --data-dir ./my_blog_data --index-dir ./my_index

crawl_blog.py — Generic Blog Crawler

Crawls any blog that exposes a sitemap (WordPress, Yoast, etc.).

Usage

python crawl_blog.py --sitemap <SITEMAP_URL> [options]

Options

Argument	Default	Description
`--sitemap`	(required)	Sitemap URL
`--out-dir`	`./blog_data`	Output directory
`--keywords`	`<out-dir>/keywords.json`	Keywords JSON path
`--limit N`	5	Process first N articles
`--all`	—	Process all articles
`--sleep SEC`	1.0	Delay between Firecrawl calls
`--force`	—	Re-scrape cached articles
`--max-words N`	650	Target words per chunk
`--overlap-words N`	100	Overlap words between chunks
`--language`	`en`	Default language code

Output files

File	Description
`articles.jsonl`	Article records with keyword mentions
`chunks.jsonl`	Chunked content for embedding
`keywords.json`	Keyword dictionary used
`urls.json`	Discovered URLs
`raw/<slug>.json`	Raw Firecrawl responses
`markdown/<slug>.md`	Cleaned markdown
`errors.jsonl`	Failed URLs
`summary.json`	Crawl summary

keywords.json — Keyword Dictionary

Defines keywords to extract from crawled content. Supports categorized or flat format.

Categorized format (recommended)

[
  {
    "category": "hardware",
    "keywords": ["Raspberry Pi", "Arduino", "ESP32"]
  },
  {
    "category": "software",
    "keywords": ["Docker", "Ubuntu", "Home Assistant"]
  }
]

Flat format

["Raspberry Pi", "Docker", "Home Assistant", "MQTT"]

See keywords_example.json for a complete template.

rag_app.py — RAG Query Application

FAISS-based vector search + LLM generation.

Usage

# Build index (one-time)
python rag_app.py --build --data-dir ./blog_data --index-dir ./index

# Single query
python rag_app.py --query "Câu hỏi của bạn" --data-dir ./blog_data --index-dir ./index

# Interactive chat
python rag_app.py --interactive --data-dir ./blog_data --index-dir ./index

# Test retrieval only (no LLM needed)
python rag_app.py --query "test" --retrieve-only --data-dir ./blog_data --index-dir ./index

Options

Argument	Default	Description
`--data-dir`	`.`	Directory with chunks.jsonl
`--index-dir`	`./rag_index`	FAISS index directory
`--build`	—	Build index from chunks
`--query`	—	Query to answer
`--interactive`	—	Interactive chat mode
`--retrieve-only`	—	Test retrieval without LLM
`--top-k`	5	Number of chunks to retrieve
`--embed-model`	`paraphrase-multilingual-MiniLM-L12-v2`	Embedding model
`--llm-model`	`gpt-4o-mini`	LLM model name
`--llm-base-url`	`https://api.openai.com/v1`	LLM API base URL

LLM API configuration

Set in .env:

OPENAI_API_KEY=sk-...
# Or for other providers:
# LLM_BASE_URL=https://api.together.xyz/v1
# LLM_MODEL=meta-llama/Llama-3-70b-chat-hf

Compatible with any OpenAI-format API: OpenAI, Together.ai, Groq, Ollama, etc.

crawl_orangepi_blog.py — OrangePi-specific Crawler

Specialized crawler for orangepi.vn with Orange Pi model detection.

python crawl_orangepi_blog.py --limit 5
python crawl_orangepi_blog.py --all

Uses orangepi_models.json for product mention detection (36 Orange Pi models with aliases).

Architecture

Blog (sitemap)
    │
    ▼
crawl_blog.py ──► Firecrawl API ──► articles.jsonl
    │                                  chunks.jsonl
    │                                  keywords.json
    │                                  raw/*.json
    │                                  markdown/*.md
    ▼
rag_app.py
    │
    ├──► SentenceTransformer (embeddings)
    ├──► FAISS (vector index)
    └──► LLM API (generation)
            │
            ▼
        Answer + sources

License

Data sourced from respective blogs. Check each site for content usage terms.

5.5 KiB Raw Blame History

Blog RAG Toolkit

Components

Quick Start

1. Install

2. Set API key

3. Crawl a blog

4. Build index & query

crawl_blog.py — Generic Blog Crawler

Usage

Options

Output files

keywords.json — Keyword Dictionary

Categorized format (recommended)

Flat format

rag_app.py — RAG Query Application

Usage

Options

LLM API configuration

crawl_orangepi_blog.py — OrangePi-specific Crawler

Architecture

License

5.5 KiB

Raw Blame History