# Blog RAG Toolkit

A complete RAG (Retrieval-Augmented Generation) pipeline: **crawl** any blog, **extract** keywords, **chunk** content, and **query** with an LLM.

## Components

| File | Purpose |
|------|---------|
| `crawl_blog.py` | Generic blog crawler (sitemap + Firecrawl) |
| `crawl_orangepi_blog.py` | OrangePi.vn-specific crawler |
| `rag_app.py` | RAG query application (FAISS + LLM) |
| `keywords_example.json` | Sample keyword dictionary |

## Quick Start

### 1. Install

```bash
pip install -r requirements.txt
```

### 2. Set API key

```bash
export FIRECRAWL_API_KEY="fc-..."
# or put in .env file:
echo "FIRECRAWL_API_KEY=fc-..." > .env
```

### 3. Crawl a blog

```bash
# Crawl 5 articles from any WordPress blog
python crawl_blog.py --sitemap https://example.com/post-sitemap.xml --limit 5

# Crawl all articles with custom keywords
python crawl_blog.py --sitemap https://example.com/post-sitemap.xml --all --keywords keywords.json

# Output to custom directory
python crawl_blog.py --sitemap https://example.com/post-sitemap.xml --all --out-dir ./my_blog_data
```

### 4. Build index & query

```bash
# Build FAISS index
python rag_app.py --build --data-dir ./my_blog_data --index-dir ./my_index

# Query (requires OPENAI_API_KEY)
export OPENAI_API_KEY="sk-..."
python rag_app.py --query "How to install Docker?" --data-dir ./my_blog_data --index-dir ./my_index

# Interactive chat
python rag_app.py --interactive --data-dir ./my_blog_data --index-dir ./my_index
```

---

## crawl_blog.py — Generic Blog Crawler

Crawls any blog that exposes a sitemap (WordPress, Yoast, etc.).

### Usage

```bash
python crawl_blog.py --sitemap <SITEMAP_URL> [options]
```

### Options

| Argument | Default | Description |
|----------|---------|-------------|
| `--sitemap` | (required) | Sitemap URL |
| `--out-dir` | `./blog_data` | Output directory |
| `--keywords` | `<out-dir>/keywords.json` | Keywords JSON path |
| `--limit N` | 5 | Process first N articles |
| `--all` | — | Process all articles |
| `--sleep SEC` | 1.0 | Delay between Firecrawl calls |
| `--force` | — | Re-scrape cached articles |
| `--max-words N` | 650 | Target words per chunk |
| `--overlap-words N` | 100 | Overlap words between chunks |
| `--language` | `en` | Default language code |

### Output files

| File | Description |
|------|-------------|
| `articles.jsonl` | Article records with keyword mentions |
| `chunks.jsonl` | Chunked content for embedding |
| `keywords.json` | Keyword dictionary used |
| `urls.json` | Discovered URLs |
| `raw/<slug>.json` | Raw Firecrawl responses |
| `markdown/<slug>.md` | Cleaned markdown |
| `errors.jsonl` | Failed URLs |
| `summary.json` | Crawl summary |

---

## keywords.json — Keyword Dictionary

Defines keywords to extract from crawled content. Supports categorized or flat format.

### Categorized format (recommended)

```json
[
  {
    "category": "hardware",
    "keywords": ["Raspberry Pi", "Arduino", "ESP32"]
  },
  {
    "category": "software",
    "keywords": ["Docker", "Ubuntu", "Home Assistant"]
  }
]
```

### Flat format

```json
["Raspberry Pi", "Docker", "Home Assistant", "MQTT"]
```

See `keywords_example.json` for a complete template.

---

## rag_app.py — RAG Query Application

FAISS-based vector search + LLM generation.

### Usage

```bash
# Build index (one-time)
python rag_app.py --build --data-dir ./blog_data --index-dir ./index

# Single query
python rag_app.py --query "Câu hỏi của bạn" --data-dir ./blog_data --index-dir ./index

# Interactive chat
python rag_app.py --interactive --data-dir ./blog_data --index-dir ./index

# Test retrieval only (no LLM needed)
python rag_app.py --query "test" --retrieve-only --data-dir ./blog_data --index-dir ./index
```

### Options

| Argument | Default | Description |
|----------|---------|-------------|
| `--data-dir` | `.` | Directory with chunks.jsonl |
| `--index-dir` | `./rag_index` | FAISS index directory |
| `--build` | — | Build index from chunks |
| `--query` | — | Query to answer |
| `--interactive` | — | Interactive chat mode |
| `--retrieve-only` | — | Test retrieval without LLM |
| `--top-k` | 5 | Number of chunks to retrieve |
| `--embed-model` | `paraphrase-multilingual-MiniLM-L12-v2` | Embedding model |
| `--llm-model` | `gpt-4o-mini` | LLM model name |
| `--llm-base-url` | `https://api.openai.com/v1` | LLM API base URL |

### LLM API configuration

Set in `.env`:

```bash
OPENAI_API_KEY=sk-...
# Or for other providers:
# LLM_BASE_URL=https://api.together.xyz/v1
# LLM_MODEL=meta-llama/Llama-3-70b-chat-hf
```

Compatible with any OpenAI-format API: OpenAI, Together.ai, Groq, Ollama, etc.

---

## crawl_orangepi_blog.py — OrangePi-specific Crawler

Specialized crawler for orangepi.vn with Orange Pi model detection.

```bash
python crawl_orangepi_blog.py --limit 5
python crawl_orangepi_blog.py --all
```

Uses `orangepi_models.json` for product mention detection (36 Orange Pi models with aliases).

---

## Architecture

```
Blog (sitemap)
    │
    ▼
crawl_blog.py ──► Firecrawl API ──► articles.jsonl
    │                                  chunks.jsonl
    │                                  keywords.json
    │                                  raw/*.json
    │                                  markdown/*.md
    ▼
rag_app.py
    │
    ├──► SentenceTransformer (embeddings)
    ├──► FAISS (vector index)
    └──► LLM API (generation)
            │
            ▼
        Answer + sources
```

## License

Data sourced from respective blogs. Check each site for content usage terms.