5.5 KiB
5.5 KiB
Blog RAG Toolkit
A complete RAG (Retrieval-Augmented Generation) pipeline: crawl any blog, extract keywords, chunk content, and query with an LLM.
Components
| File | Purpose |
|---|---|
crawl_blog.py |
Generic blog crawler (sitemap + Firecrawl) |
crawl_orangepi_blog.py |
OrangePi.vn-specific crawler |
rag_app.py |
RAG query application (FAISS + LLM) |
keywords_example.json |
Sample keyword dictionary |
Quick Start
1. Install
pip install -r requirements.txt
2. Set API key
export FIRECRAWL_API_KEY="fc-..."
# or put in .env file:
echo "FIRECRAWL_API_KEY=fc-..." > .env
3. Crawl a blog
# Crawl 5 articles from any WordPress blog
python crawl_blog.py --sitemap https://example.com/post-sitemap.xml --limit 5
# Crawl all articles with custom keywords
python crawl_blog.py --sitemap https://example.com/post-sitemap.xml --all --keywords keywords.json
# Output to custom directory
python crawl_blog.py --sitemap https://example.com/post-sitemap.xml --all --out-dir ./my_blog_data
4. Build index & query
# Build FAISS index
python rag_app.py --build --data-dir ./my_blog_data --index-dir ./my_index
# Query (requires OPENAI_API_KEY)
export OPENAI_API_KEY="sk-..."
python rag_app.py --query "How to install Docker?" --data-dir ./my_blog_data --index-dir ./my_index
# Interactive chat
python rag_app.py --interactive --data-dir ./my_blog_data --index-dir ./my_index
crawl_blog.py — Generic Blog Crawler
Crawls any blog that exposes a sitemap (WordPress, Yoast, etc.).
Usage
python crawl_blog.py --sitemap <SITEMAP_URL> [options]
Options
| Argument | Default | Description |
|---|---|---|
--sitemap |
(required) | Sitemap URL |
--out-dir |
./blog_data |
Output directory |
--keywords |
<out-dir>/keywords.json |
Keywords JSON path |
--limit N |
5 | Process first N articles |
--all |
— | Process all articles |
--sleep SEC |
1.0 | Delay between Firecrawl calls |
--force |
— | Re-scrape cached articles |
--max-words N |
650 | Target words per chunk |
--overlap-words N |
100 | Overlap words between chunks |
--language |
en |
Default language code |
Output files
| File | Description |
|---|---|
articles.jsonl |
Article records with keyword mentions |
chunks.jsonl |
Chunked content for embedding |
keywords.json |
Keyword dictionary used |
urls.json |
Discovered URLs |
raw/<slug>.json |
Raw Firecrawl responses |
markdown/<slug>.md |
Cleaned markdown |
errors.jsonl |
Failed URLs |
summary.json |
Crawl summary |
keywords.json — Keyword Dictionary
Defines keywords to extract from crawled content. Supports categorized or flat format.
Categorized format (recommended)
[
{
"category": "hardware",
"keywords": ["Raspberry Pi", "Arduino", "ESP32"]
},
{
"category": "software",
"keywords": ["Docker", "Ubuntu", "Home Assistant"]
}
]
Flat format
["Raspberry Pi", "Docker", "Home Assistant", "MQTT"]
See keywords_example.json for a complete template.
rag_app.py — RAG Query Application
FAISS-based vector search + LLM generation.
Usage
# Build index (one-time)
python rag_app.py --build --data-dir ./blog_data --index-dir ./index
# Single query
python rag_app.py --query "Câu hỏi của bạn" --data-dir ./blog_data --index-dir ./index
# Interactive chat
python rag_app.py --interactive --data-dir ./blog_data --index-dir ./index
# Test retrieval only (no LLM needed)
python rag_app.py --query "test" --retrieve-only --data-dir ./blog_data --index-dir ./index
Options
| Argument | Default | Description |
|---|---|---|
--data-dir |
. |
Directory with chunks.jsonl |
--index-dir |
./rag_index |
FAISS index directory |
--build |
— | Build index from chunks |
--query |
— | Query to answer |
--interactive |
— | Interactive chat mode |
--retrieve-only |
— | Test retrieval without LLM |
--top-k |
5 | Number of chunks to retrieve |
--embed-model |
paraphrase-multilingual-MiniLM-L12-v2 |
Embedding model |
--llm-model |
gpt-4o-mini |
LLM model name |
--llm-base-url |
https://api.openai.com/v1 |
LLM API base URL |
LLM API configuration
Set in .env:
OPENAI_API_KEY=sk-...
# Or for other providers:
# LLM_BASE_URL=https://api.together.xyz/v1
# LLM_MODEL=meta-llama/Llama-3-70b-chat-hf
Compatible with any OpenAI-format API: OpenAI, Together.ai, Groq, Ollama, etc.
crawl_orangepi_blog.py — OrangePi-specific Crawler
Specialized crawler for orangepi.vn with Orange Pi model detection.
python crawl_orangepi_blog.py --limit 5
python crawl_orangepi_blog.py --all
Uses orangepi_models.json for product mention detection (36 Orange Pi models with aliases).
Architecture
Blog (sitemap)
│
▼
crawl_blog.py ──► Firecrawl API ──► articles.jsonl
│ chunks.jsonl
│ keywords.json
│ raw/*.json
│ markdown/*.md
▼
rag_app.py
│
├──► SentenceTransformer (embeddings)
├──► FAISS (vector index)
└──► LLM API (generation)
│
▼
Answer + sources
License
Data sourced from respective blogs. Check each site for content usage terms.