# Blog RAG Toolkit A complete RAG (Retrieval-Augmented Generation) pipeline: **crawl** any blog, **extract** keywords, **chunk** content, and **query** with an LLM. ## Components | File | Purpose | |------|---------| | `crawl_blog.py` | Generic blog crawler (sitemap + Firecrawl) | | `crawl_orangepi_blog.py` | OrangePi.vn-specific crawler | | `rag_app.py` | RAG query application (FAISS + LLM) | | `keywords_example.json` | Sample keyword dictionary | ## Quick Start ### 1. Install ```bash pip install -r requirements.txt ``` ### 2. Set API key ```bash export FIRECRAWL_API_KEY="fc-..." # or put in .env file: echo "FIRECRAWL_API_KEY=fc-..." > .env ``` ### 3. Crawl a blog ```bash # Crawl 5 articles from any WordPress blog python crawl_blog.py --sitemap https://example.com/post-sitemap.xml --limit 5 # Crawl all articles with custom keywords python crawl_blog.py --sitemap https://example.com/post-sitemap.xml --all --keywords keywords.json # Output to custom directory python crawl_blog.py --sitemap https://example.com/post-sitemap.xml --all --out-dir ./my_blog_data ``` ### 4. Build index & query ```bash # Build FAISS index python rag_app.py --build --data-dir ./my_blog_data --index-dir ./my_index # Query (requires OPENAI_API_KEY) export OPENAI_API_KEY="sk-..." python rag_app.py --query "How to install Docker?" --data-dir ./my_blog_data --index-dir ./my_index # Interactive chat python rag_app.py --interactive --data-dir ./my_blog_data --index-dir ./my_index ``` --- ## crawl_blog.py — Generic Blog Crawler Crawls any blog that exposes a sitemap (WordPress, Yoast, etc.). ### Usage ```bash python crawl_blog.py --sitemap [options] ``` ### Options | Argument | Default | Description | |----------|---------|-------------| | `--sitemap` | (required) | Sitemap URL | | `--out-dir` | `./blog_data` | Output directory | | `--keywords` | `/keywords.json` | Keywords JSON path | | `--limit N` | 5 | Process first N articles | | `--all` | — | Process all articles | | `--sleep SEC` | 1.0 | Delay between Firecrawl calls | | `--force` | — | Re-scrape cached articles | | `--max-words N` | 650 | Target words per chunk | | `--overlap-words N` | 100 | Overlap words between chunks | | `--language` | `en` | Default language code | ### Output files | File | Description | |------|-------------| | `articles.jsonl` | Article records with keyword mentions | | `chunks.jsonl` | Chunked content for embedding | | `keywords.json` | Keyword dictionary used | | `urls.json` | Discovered URLs | | `raw/.json` | Raw Firecrawl responses | | `markdown/.md` | Cleaned markdown | | `errors.jsonl` | Failed URLs | | `summary.json` | Crawl summary | --- ## keywords.json — Keyword Dictionary Defines keywords to extract from crawled content. Supports categorized or flat format. ### Categorized format (recommended) ```json [ { "category": "hardware", "keywords": ["Raspberry Pi", "Arduino", "ESP32"] }, { "category": "software", "keywords": ["Docker", "Ubuntu", "Home Assistant"] } ] ``` ### Flat format ```json ["Raspberry Pi", "Docker", "Home Assistant", "MQTT"] ``` See `keywords_example.json` for a complete template. --- ## rag_app.py — RAG Query Application FAISS-based vector search + LLM generation. ### Usage ```bash # Build index (one-time) python rag_app.py --build --data-dir ./blog_data --index-dir ./index # Single query python rag_app.py --query "Câu hỏi của bạn" --data-dir ./blog_data --index-dir ./index # Interactive chat python rag_app.py --interactive --data-dir ./blog_data --index-dir ./index # Test retrieval only (no LLM needed) python rag_app.py --query "test" --retrieve-only --data-dir ./blog_data --index-dir ./index ``` ### Options | Argument | Default | Description | |----------|---------|-------------| | `--data-dir` | `.` | Directory with chunks.jsonl | | `--index-dir` | `./rag_index` | FAISS index directory | | `--build` | — | Build index from chunks | | `--query` | — | Query to answer | | `--interactive` | — | Interactive chat mode | | `--retrieve-only` | — | Test retrieval without LLM | | `--top-k` | 5 | Number of chunks to retrieve | | `--embed-model` | `paraphrase-multilingual-MiniLM-L12-v2` | Embedding model | | `--llm-model` | `gpt-4o-mini` | LLM model name | | `--llm-base-url` | `https://api.openai.com/v1` | LLM API base URL | ### LLM API configuration Set in `.env`: ```bash OPENAI_API_KEY=sk-... # Or for other providers: # LLM_BASE_URL=https://api.together.xyz/v1 # LLM_MODEL=meta-llama/Llama-3-70b-chat-hf ``` Compatible with any OpenAI-format API: OpenAI, Together.ai, Groq, Ollama, etc. --- ## crawl_orangepi_blog.py — OrangePi-specific Crawler Specialized crawler for orangepi.vn with Orange Pi model detection. ```bash python crawl_orangepi_blog.py --limit 5 python crawl_orangepi_blog.py --all ``` Uses `orangepi_models.json` for product mention detection (36 Orange Pi models with aliases). --- ## Architecture ``` Blog (sitemap) │ ▼ crawl_blog.py ──► Firecrawl API ──► articles.jsonl │ chunks.jsonl │ keywords.json │ raw/*.json │ markdown/*.md ▼ rag_app.py │ ├──► SentenceTransformer (embeddings) ├──► FAISS (vector index) └──► LLM API (generation) │ ▼ Answer + sources ``` ## License Data sourced from respective blogs. Check each site for content usage terms.