update crawl blog

2026-06-12 11:37:39 +07:00
parent 3ebf6f450d
commit 65d2cae6ca
5 changed files with 982 additions and 71 deletions
@@ -8,3 +8,7 @@ OPENAI_API_KEY=your_api_key_here

 # Optional: Model name (default: gpt-4o-mini)
 # LLM_MODEL=gpt-4o-mini
+
+# Firecrawl API Configuration
+# Get your API key from https://www.firecrawl.dev
+FIRECRAWL_API_KEY=fc-...
@@ -0,0 +1,2 @@
+.env
+__pycache__/
@@ -1,106 +1,215 @@
-# OrangePi RAG Dataset
+# Blog RAG Toolkit

-A **Vietnamese-language** RAG (Retrieval-Augmented Generation) data pipeline that crawls, extracts, and chunks blog articles from [orangepi.vn](https://orangepi.vn) — the official Orange Pi distributor in Vietnam.
+A complete RAG (Retrieval-Augmented Generation) pipeline: **crawl** any blog, **extract** keywords, **chunk** content, and **query** with an LLM.

-## Dataset Summary
+## Components

-| Metric       | Value |
-|-------------|-------|
-| Articles     | 199   |
-| Chunks       | 472   |
-| Models       | 36    |
-| Language     | vi    |
-| Last crawl   | 2026-06-11 |
+| File | Purpose |
+|------|---------|
+| `crawl_blog.py` | Generic blog crawler (sitemap + Firecrawl) |
+| `crawl_orangepi_blog.py` | OrangePi.vn-specific crawler |
+| `rag_app.py` | RAG query application (FAISS + LLM) |
+| `keywords_example.json` | Sample keyword dictionary |

-## Output Files
+## Quick Start

-| File | Description |
-|------|-------------|
-| `articles.jsonl` | Full article records (title, description, markdown, text, product mentions, topic, metadata) |
-| `chunks.jsonl` | Overlapping text chunks (~650 words, ~100 overlap) with metadata for embedding |
-| `urls.json` | Discovered sitemap URLs with `lastmod` timestamps |
-| `raw/<slug>.json` | Raw Firecrawl API scrape response per article |
-| `markdown/<slug>.md` | Cleaned markdown per article |
-| `orangepi_models.json` | Canonical Orange Pi model dictionary with aliases |
-| `errors.jsonl` | Failed URLs and error details |
-| `summary.json` | Crawl summary statistics |
-
-### Chunk metadata
-
-Each chunk in `chunks.jsonl` includes:
-
- `chunk_id` — unique ID (`{article_id}__chunk_{seq}`)
- `article_id` — source article reference
- `content` — chunk text (markdown)
- `section` — nearest heading context
- `metadata.product_mentions` — canonical Orange Pi models mentioned
- `metadata.topic` — inferred topic (e.g., "home assistant", "linux", "docker")
-
-## Usage
-
-### Prerequisites
-
- Python 3.10+
- A [Firecrawl](https://www.firecrawl.dev) API key
-
-### Install
+### 1. Install

 ```bash
-git clone <repo-url>
-cd orangepi-rag
-# No external dependencies beyond Python stdlib
+pip install -r requirements.txt
 ```

-### Set API key
+### 2. Set API key

 ```bash
 export FIRECRAWL_API_KEY="fc-..."
+# or put in .env file:
+echo "FIRECRAWL_API_KEY=fc-..." > .env
 ```

-Or place it in `/home/admin/.hermes/.env`:
-
-```
-FIRECRAWL_API_KEY=fc-...
-```
-
-### Run crawl
+### 3. Crawl a blog

 ```bash
-# Quick test — process first 5 articles
-python3 crawl_orangepi_blog.py --limit 5
+# Crawl 5 articles from any WordPress blog
+python crawl_blog.py --sitemap https://example.com/post-sitemap.xml --limit 5

-# Full crawl — all discovered articles
-python3 crawl_orangepi_blog.py --all
+# Crawl all articles with custom keywords
+python crawl_blog.py --sitemap https://example.com/post-sitemap.xml --all --keywords keywords.json

-# Re-scrape everything (overwrites existing raw files)
-python3 crawl_orangepi_blog.py --all --force
+# Output to custom directory
+python crawl_blog.py --sitemap https://example.com/post-sitemap.xml --all --out-dir ./my_blog_data
+```
+
+### 4. Build index & query
+
+```bash
+# Build FAISS index
+python rag_app.py --build --data-dir ./my_blog_data --index-dir ./my_index
+
+# Query (requires OPENAI_API_KEY)
+export OPENAI_API_KEY="sk-..."
+python rag_app.py --query "How to install Docker?" --data-dir ./my_blog_data --index-dir ./my_index
+
+# Interactive chat
+python rag_app.py --interactive --data-dir ./my_blog_data --index-dir ./my_index
+```
+
+---
+
+## crawl_blog.py — Generic Blog Crawler
+
+Crawls any blog that exposes a sitemap (WordPress, Yoast, etc.).
+
+### Usage
+
+```bash
+python crawl_blog.py --sitemap <SITEMAP_URL> [options]
 ```

 ### Options

 | Argument | Default | Description |
 |----------|---------|-------------|
+| `--sitemap` | (required) | Sitemap URL |
+| `--out-dir` | `./blog_data` | Output directory |
+| `--keywords` | `<out-dir>/keywords.json` | Keywords JSON path |
 | `--limit N` | 5 | Process first N articles |
-| `--all` | — | Process all discovered articles |
-| `--out-dir PATH` | `/mnt/ssd/orangepi-rag` | Output directory |
-| `--models PATH` | `<out-dir>/orangepi_models.json` | Model dictionary path |
-| `--sitemap URL` | `https://orangepi.vn/post-sitemap.xml` | Sitemap URL |
+| `--all` | — | Process all articles |
 | `--sleep SEC` | 1.0 | Delay between Firecrawl calls |
 | `--force` | — | Re-scrape cached articles |
 | `--max-words N` | 650 | Target words per chunk |
 | `--overlap-words N` | 100 | Overlap words between chunks |
+| `--language` | `en` | Default language code |

-## Model Detection
+### Output files

-The pipeline uses `orangepi_models.json` to detect canonical Orange Pi product names in article text. The dictionary supports aliases per model (e.g., `"Orange Pi 5"`, `"OrangePi 5"`, `"OPi 5"`) and longest-match-first resolution to prevent false double-counts.
+| File | Description |
+|------|-------------|
+| `articles.jsonl` | Article records with keyword mentions |
+| `chunks.jsonl` | Chunked content for embedding |
+| `keywords.json` | Keyword dictionary used |
+| `urls.json` | Discovered URLs |
+| `raw/<slug>.json` | Raw Firecrawl responses |
+| `markdown/<slug>.md` | Cleaned markdown |
+| `errors.jsonl` | Failed URLs |
+| `summary.json` | Crawl summary |

-## Use Cases
+---

- **Semantic search** over Vietnamese Orange Pi knowledge
- **Q&A bots** for Orange Pi tutorials, OS installs, hardware guides
- **Product recommendation** based on article content
- **Fine-tuning** Vietnamese embedding models on SBC/embedded computing content
+## keywords.json — Keyword Dictionary
+
+Defines keywords to extract from crawled content. Supports categorized or flat format.
+
+### Categorized format (recommended)
+
+```json
+[
+  {
+    "category": "hardware",
+    "keywords": ["Raspberry Pi", "Arduino", "ESP32"]
+  },
+  {
+    "category": "software",
+    "keywords": ["Docker", "Ubuntu", "Home Assistant"]
+  }
+]
+```
+
+### Flat format
+
+```json
+["Raspberry Pi", "Docker", "Home Assistant", "MQTT"]
+```
+
+See `keywords_example.json` for a complete template.
+
+---
+
+## rag_app.py — RAG Query Application
+
+FAISS-based vector search + LLM generation.
+
+### Usage
+
+```bash
+# Build index (one-time)
+python rag_app.py --build --data-dir ./blog_data --index-dir ./index
+
+# Single query
+python rag_app.py --query "Câu hỏi của bạn" --data-dir ./blog_data --index-dir ./index
+
+# Interactive chat
+python rag_app.py --interactive --data-dir ./blog_data --index-dir ./index
+
+# Test retrieval only (no LLM needed)
+python rag_app.py --query "test" --retrieve-only --data-dir ./blog_data --index-dir ./index
+```
+
+### Options
+
+| Argument | Default | Description |
+|----------|---------|-------------|
+| `--data-dir` | `.` | Directory with chunks.jsonl |
+| `--index-dir` | `./rag_index` | FAISS index directory |
+| `--build` | — | Build index from chunks |
+| `--query` | — | Query to answer |
+| `--interactive` | — | Interactive chat mode |
+| `--retrieve-only` | — | Test retrieval without LLM |
+| `--top-k` | 5 | Number of chunks to retrieve |
+| `--embed-model` | `paraphrase-multilingual-MiniLM-L12-v2` | Embedding model |
+| `--llm-model` | `gpt-4o-mini` | LLM model name |
+| `--llm-base-url` | `https://api.openai.com/v1` | LLM API base URL |
+
+### LLM API configuration
+
+Set in `.env`:
+
+```bash
+OPENAI_API_KEY=sk-...
+# Or for other providers:
+# LLM_BASE_URL=https://api.together.xyz/v1
+# LLM_MODEL=meta-llama/Llama-3-70b-chat-hf
+```
+
+Compatible with any OpenAI-format API: OpenAI, Together.ai, Groq, Ollama, etc.
+
+---
+
+## crawl_orangepi_blog.py — OrangePi-specific Crawler
+
+Specialized crawler for orangepi.vn with Orange Pi model detection.
+
+```bash
+python crawl_orangepi_blog.py --limit 5
+python crawl_orangepi_blog.py --all
+```
+
+Uses `orangepi_models.json` for product mention detection (36 Orange Pi models with aliases).
+
+---
+
+## Architecture
+
+```
+Blog (sitemap)
+    │
+    ▼
+crawl_blog.py ──► Firecrawl API ──► articles.jsonl
+    │                                  chunks.jsonl
+    │                                  keywords.json
+    │                                  raw/*.json
+    │                                  markdown/*.md
+    ▼
+rag_app.py
+    │
+    ├──► SentenceTransformer (embeddings)
+    ├──► FAISS (vector index)
+    └──► LLM API (generation)
+            │
+            ▼
+        Answer + sources
+```

 ## License

-Data sourced from [orangepi.vn](https://orangepi.vn). Check their site for content usage terms.
+Data sourced from respective blogs. Check each site for content usage terms.
@@ -0,0 +1,710 @@
+#!/usr/bin/env python3
+"""Generic blog crawler for RAG using Firecrawl.
+
+Discovery is sitemap-first (Yoast/WordPress), extraction is Firecrawl
+single-page scrape. Outputs:
+
+  articles.jsonl       article-level structured records
+  chunks.jsonl         chunk-level records for embedding/RAG
+  keywords.json        keyword dictionary for extraction
+  urls.json            discovered URL list with sitemap lastmod
+  raw/<slug>.json      raw Firecrawl response per article
+  markdown/<slug>.md   extracted markdown per article
+  errors.jsonl         failed URLs/errors
+  summary.json         crawl summary
+
+Usage:
+  python3 crawl_blog.py --sitemap https://example.com/post-sitemap.xml --limit 5
+  python3 crawl_blog.py --sitemap https://example.com/post-sitemap.xml --all
+  python3 crawl_blog.py --sitemap https://example.com/post-sitemap.xml --all --keywords my_keywords.json
+
+Requires FIRECRAWL_API_KEY in environment or .env file.
+"""
+
+from __future__ import annotations
+
+import argparse
+import datetime as dt
+import hashlib
+import html
+import json
+import os
+import re
+import sys
+import time
+import urllib.error
+import urllib.parse
+import urllib.request
+import xml.etree.ElementTree as ET
+from pathlib import Path
+from typing import Any
+
+FIRECRAWL_SCRAPE_URL = "https://api.firecrawl.dev/v1/scrape"
+
+WORD_RE = re.compile(r"\S+", re.UNICODE)
+HEADING_RE = re.compile(r"^(#{1,6})\s+(.+?)\s*$")
+MD_LINK_RE = re.compile(r"\[([^\]]+)\]\(([^)]+)\)")
+MD_IMAGE_RE = re.compile(r"!\[[^\]]*\]\([^)]+\)")
+HTML_TAG_RE = re.compile(r"<[^>]+>")
+MULTI_SPACE_RE = re.compile(r"[ \t]+")
+
+
+# ---------------------------------------------------------------------------
+# Environment / dotenv
+# ---------------------------------------------------------------------------
+
+def load_dotenv(path: Path | None = None) -> None:
+    """Load .env from project root or given path."""
+    candidates = [Path(__file__).parent / ".env"]
+    if path:
+        candidates.insert(0, path)
+    for env_path in candidates:
+        if not env_path.exists():
+            continue
+        for raw in env_path.read_text(encoding="utf-8", errors="ignore").splitlines():
+            line = raw.strip()
+            if not line or line.startswith("#") or "=" not in line:
+                continue
+            key, val = line.split("=", 1)
+            key = key.strip()
+            val = val.strip().strip('"').strip("'")
+            if key and key not in os.environ:
+                os.environ[key] = val
+        break
+
+
+def now_iso() -> str:
+    return dt.datetime.now(dt.timezone.utc).isoformat()
+
+
+# ---------------------------------------------------------------------------
+# URL / sitemap helpers
+# ---------------------------------------------------------------------------
+
+def fetch_bytes(url: str, timeout: int = 30) -> bytes:
+    req = urllib.request.Request(
+        url,
+        headers={
+            "User-Agent": "BlogCrawler-RAG/1.0",
+            "Accept": "application/xml,text/xml,text/html,*/*",
+        },
+    )
+    with urllib.request.urlopen(req, timeout=timeout) as resp:
+        return resp.read()
+
+
+def parse_sitemap(url: str, domain_filter: str | None = None) -> list[dict[str, str | None]]:
+    """Return [{'url': ..., 'lastmod': ...}] from a sitemap URL.
+
+    If domain_filter is provided, only include URLs matching that domain.
+    Works with Yoast (post-sitemap.xml) and generic WordPress sitemaps.
+    """
+    data = fetch_bytes(url)
+    root = ET.fromstring(data)
+    ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
+
+    # Check if this is a sitemap index (contains other sitemaps)
+    sitemaps = root.findall(".//sm:sitemap/sm:loc", ns)
+    if sitemaps:
+        # This is a sitemap index - recursively fetch child sitemaps
+        out: list[dict[str, str | None]] = []
+        for loc_el in sitemaps:
+            child_url = loc_el.text
+            if child_url:
+                try:
+                    child_results = parse_sitemap(child_url, domain_filter)
+                    out.extend(child_results)
+                except Exception as e:
+                    print(f"  WARN: failed to fetch child sitemap {child_url}: {e}", file=sys.stderr)
+        return out
+
+    # Regular sitemap - extract URLs
+    out: list[dict[str, str | None]] = []
+    for url_el in root.findall(".//sm:url", ns):
+        loc_el = url_el.find("sm:loc", ns)
+        if loc_el is None or not loc_el.text:
+            continue
+        lastmod_el = url_el.find("sm:lastmod", ns)
+        loc = loc_el.text.strip()
+        lastmod = lastmod_el.text.strip() if lastmod_el is not None and lastmod_el.text else None
+        if domain_filter and domain_filter not in loc:
+            continue
+        out.append({"url": loc, "lastmod": lastmod})
+    return out
+
+
+def slug_from_url(url: str) -> str:
+    path = urllib.parse.urlparse(url).path.strip("/")
+    if not path:
+        path = "index"
+    slug = re.sub(r"\.(html?|php)$", "", path)
+    slug = re.sub(r"[^a-zA-Z0-9_-]+", "-", slug).strip("-").lower()
+    if not slug:
+        slug = hashlib.sha1(url.encode()).hexdigest()[:12]
+    return slug[:160]
+
+
+def article_id_from_url(url: str, prefix: str = "blog") -> str:
+    return f"{prefix}_" + slug_from_url(url).replace("-", "_")
+
+
+def source_from_url(url: str) -> str:
+    """Extract domain name from URL as source identifier."""
+    parsed = urllib.parse.urlparse(url)
+    domain = parsed.netloc
+    if domain.startswith("www."):
+        domain = domain[4:]
+    return domain
+
+
+# ---------------------------------------------------------------------------
+# Firecrawl API
+# ---------------------------------------------------------------------------
+
+def firecrawl_scrape(url: str, api_key: str, timeout: int = 120) -> tuple[int, dict[str, Any]]:
+    payload = {
+        "url": url,
+        "formats": ["markdown"],
+        "onlyMainContent": True,
+        "waitFor": 1000,
+        "timeout": timeout * 1000,
+    }
+    body = json.dumps(payload).encode("utf-8")
+    req = urllib.request.Request(
+        FIRECRAWL_SCRAPE_URL,
+        data=body,
+        method="POST",
+        headers={
+            "Authorization": f"Bearer {api_key}",
+            "Content-Type": "application/json",
+            "User-Agent": "BlogCrawler-RAG/1.0",
+        },
+    )
+    try:
+        with urllib.request.urlopen(req, timeout=timeout + 20) as resp:
+            raw = resp.read().decode("utf-8", errors="replace")
+            return resp.status, json.loads(raw)
+    except urllib.error.HTTPError as e:
+        raw = e.read().decode("utf-8", errors="replace")
+        try:
+            data = json.loads(raw)
+        except Exception:
+            data = {"error": raw}
+        return e.code, data
+
+
+# ---------------------------------------------------------------------------
+# Markdown / text processing
+# ---------------------------------------------------------------------------
+
+def strip_markdown_to_text(markdown: str) -> str:
+    text = markdown.replace("\r\n", "\n")
+    text = MD_IMAGE_RE.sub("", text)
+    text = MD_LINK_RE.sub(r"\1", text)
+    text = re.sub(r"```.*?```", lambda m: m.group(0), text, flags=re.S)
+    text = re.sub(r"^#{1,6}\s+", "", text, flags=re.M)
+    text = re.sub(r"[*_`~]", "", text)
+    text = HTML_TAG_RE.sub(" ", text)
+    text = html.unescape(text)
+    text = MULTI_SPACE_RE.sub(" ", text)
+    text = re.sub(r"\n{3,}", "\n\n", text)
+    return text.strip()
+
+
+def clean_markdown(markdown: str) -> str:
+    """Light cleanup for common boilerplate while preserving content."""
+    lines = markdown.replace("\r\n", "\n").split("\n")
+    drop_contains = [
+        "Press enter for Accessibility",
+        "Accessibility menu",
+        "Popup heading",
+        "Skip to main",
+        "Bỏ qua nội dung",
+        "close",
+    ]
+    cleaned: list[str] = []
+    for line in lines:
+        s = line.strip()
+        if any(x.lower() in s.lower() for x in drop_contains):
+            continue
+        cleaned.append(line.rstrip())
+    text = "\n".join(cleaned)
+    text = re.sub(r"\n{4,}", "\n\n\n", text)
+    return text.strip()
+
+
+def extract_title(data: dict[str, Any], markdown: str, fallback_url: str) -> str:
+    meta = data.get("metadata") or {}
+    for key in ("title", "ogTitle"):
+        val = meta.get(key)
+        if isinstance(val, str) and val.strip():
+            return html.unescape(val.strip())
+    for line in markdown.splitlines():
+        m = HEADING_RE.match(line.strip())
+        if m:
+            return m.group(2).strip()
+    return slug_from_url(fallback_url).replace("-", " ").title()
+
+
+def extract_description(data: dict[str, Any], text: str) -> str | None:
+    meta = data.get("metadata") or {}
+    for key in ("description", "ogDescription"):
+        val = meta.get(key)
+        if isinstance(val, str) and val.strip():
+            return html.unescape(val.strip())
+    return text[:300].strip() if text else None
+
+
+# ---------------------------------------------------------------------------
+# Keyword extraction (replaces product mentions from original)
+# ---------------------------------------------------------------------------
+
+def load_keywords(path: Path | None) -> list[dict[str, Any]]:
+    """Load keyword dictionary from JSON.
+
+    Expected JSON shape (list of categories):
+      [
+        {
+          "category": "hardware",
+          "keywords": ["Raspberry Pi", "Arduino", "ESP32"]
+        },
+        {
+          "category": "software",
+          "keywords": ["Docker", "Ubuntu", "Debian"]
+        }
+      ]
+
+    Or a flat list of keyword strings:
+      ["Raspberry Pi", "Docker", "Home Assistant"]
+    """
+    if path is None or not path.exists():
+        return []
+    data = json.loads(path.read_text(encoding="utf-8"))
+
+    if isinstance(data, list) and all(isinstance(x, str) for x in data):
+        # Flat list of strings -> wrap into single category
+        return [{"category": "general", "keywords": data}]
+
+    if not isinstance(data, list):
+        raise ValueError(f"keywords JSON must be a list: {path}")
+
+    categories: list[dict[str, Any]] = []
+    for row in data:
+        if not isinstance(row, dict):
+            continue
+        category = str(row.get("category") or "general").strip()
+        kw_list = row.get("keywords") or []
+        if not isinstance(kw_list, list):
+            continue
+        keywords = sorted({str(k).strip() for k in kw_list if str(k).strip()}, key=len, reverse=True)
+        if keywords:
+            categories.append({"category": category, "keywords": keywords})
+    return categories
+
+
+def _alias_to_regex(alias: str) -> re.Pattern[str]:
+    """Compile a keyword regex with flexible whitespace and safe boundaries."""
+    alias = html.unescape(alias or "").strip()
+    alias = alias.replace("\u00a0", " ")
+    alias = re.sub(r"[\u2010-\u2015]", "-", alias)
+    pat = re.escape(alias).replace(r"\ ", r"\s+")
+    return re.compile(rf"(?<![A-Za-z0-9]){pat}(?![A-Za-z0-9])", re.I | re.U)
+
+
+def spans_overlap(a: tuple[int, int], b: tuple[int, int]) -> bool:
+    return a[0] < b[1] and b[0] < a[1]
+
+
+def keyword_mentions_detail(text: str, categories: list[dict[str, Any]]) -> list[dict[str, Any]]:
+    """Return keyword mentions grouped by category.
+
+    Longer keywords are processed first and reserve their character spans.
+    This prevents false double-counts such as "Orange Pi 5" and "Orange Pi"
+    matching the same text span.
+    """
+    if not categories:
+        return []
+
+    hay = text
+    hay = html.unescape(hay or "")
+    hay = hay.replace("\u00a0", " ")
+    hay = re.sub(r"[\u2010-\u2015]", "-", hay)
+    hay = re.sub(r"\s+", " ", hay)
+
+    # Build flat list of (category, keyword, regex) sorted by keyword length desc
+    all_kw: list[tuple[str, str, re.Pattern[str]]] = []
+    for cat in categories:
+        for kw in cat["keywords"]:
+            all_kw.append((cat["category"], kw, _alias_to_regex(kw)))
+    all_kw.sort(key=lambda x: len(x[1]), reverse=True)
+
+    details: list[dict[str, Any]] = []
+    occupied: list[tuple[int, int]] = []
+
+    # Group results by category
+    cat_results: dict[str, dict[str, Any]] = {}
+
+    for category, keyword, rx in all_kw:
+        for m in rx.finditer(hay):
+            span = (m.start(), m.end())
+            if any(spans_overlap(span, used) for used in occupied):
+                continue
+            occupied.append(span)
+
+            if category not in cat_results:
+                cat_results[category] = {
+                    "category": category,
+                    "matched_keywords": {},
+                    "total_count": 0,
+                }
+            entry = cat_results[category]
+            entry["matched_keywords"].setdefault(keyword, 0)
+            entry["matched_keywords"][keyword] += 1
+            entry["total_count"] += 1
+
+    for cat_data in cat_results.values():
+        cat_data["matched_keywords"] = dict(
+            sorted(cat_data["matched_keywords"].items(), key=lambda x: -x[1])
+        )
+        details.append(cat_data)
+
+    return sorted(details, key=lambda d: -d["total_count"])
+
+
+def keyword_mentions(text: str, categories: list[dict[str, Any]]) -> list[str]:
+    """Return flat list of all matched keywords."""
+    if not categories:
+        return []
+    all_matched = []
+    for detail in keyword_mentions_detail(text, categories):
+        all_matched.extend(detail["matched_keywords"].keys())
+    return all_matched
+
+
+# ---------------------------------------------------------------------------
+# Topic inference
+# ---------------------------------------------------------------------------
+
+def infer_topic(title: str, text: str, categories: list[dict[str, Any]] | None = None) -> str | None:
+    """Infer topic from content. Uses keyword categories if available."""
+    hay = (title + "\n" + text[:2000]).lower()
+
+    # If categories are provided, use them for topic inference
+    if categories:
+        best_category = None
+        best_count = 0
+        for cat in categories:
+            count = sum(1 for kw in cat["keywords"] if kw.lower() in hay)
+            if count > best_count:
+                best_count = count
+                best_category = cat["category"]
+        if best_category and best_count > 0:
+            return best_category
+
+    # Fallback: common topic rules
+    rules = [
+        ("docker", "docker"),
+        ("kubernetes", "kubernetes"),
+        ("linux", "linux"),
+        ("ubuntu", "linux"),
+        ("debian", "linux"),
+        ("python", "programming"),
+        ("javascript", "programming"),
+        ("home assistant", "home assistant"),
+        ("iot", "iot"),
+        ("ai", "ai"),
+        ("machine learning", "ai"),
+    ]
+    for needle, topic in rules:
+        if needle in hay:
+            return topic
+    return None
+
+
+# ---------------------------------------------------------------------------
+# Chunking
+# ---------------------------------------------------------------------------
+
+def chunk_markdown(
+    markdown: str,
+    article: dict[str, Any],
+    categories: list[dict[str, Any]] | None = None,
+    max_words: int = 650,
+    overlap_words: int = 100,
+) -> list[dict[str, Any]]:
+    """Chunk markdown by paragraphs/headings with approximate word limits."""
+    blocks = re.split(r"\n\s*\n", markdown.strip()) if markdown.strip() else []
+    chunks: list[dict[str, Any]] = []
+    current: list[str] = []
+    current_words = 0
+    section = None
+    current_section = None
+
+    def words_of(s: str) -> list[str]:
+        return WORD_RE.findall(s)
+
+    def flush() -> None:
+        nonlocal current, current_words, current_section
+        content = "\n\n".join(current).strip()
+        if not content:
+            current = []
+            current_words = 0
+            return
+        idx = len(chunks)
+        chunk_kw = keyword_mentions(content, categories or []) if categories else []
+        chunk_kw_detail = keyword_mentions_detail(content, categories or []) if categories else []
+        chunks.append({
+            "chunk_id": f"{article['id']}__chunk_{idx:04d}",
+            "article_id": article["id"],
+            "url": article["url"],
+            "title": article["title"],
+            "section": current_section,
+            "language": article.get("language", "en"),
+            "content": content,
+            "metadata": {
+                "source": article.get("source"),
+                "type": article.get("type"),
+                "keyword_mentions": chunk_kw,
+                "keyword_mentions_detail": chunk_kw_detail,
+                "article_keyword_mentions": article.get("keywords", []),
+                "topic": article.get("topic"),
+                "modified_at": article.get("modified_at"),
+            },
+        })
+        if overlap_words > 0:
+            tail: list[str] = []
+            count = 0
+            for b in reversed(current):
+                bw = len(words_of(b))
+                if tail and count + bw > overlap_words:
+                    break
+                tail.insert(0, b)
+                count += bw
+            current = tail
+            current_words = count
+        else:
+            current = []
+            current_words = 0
+
+    for block in blocks:
+        b = block.strip()
+        if not b:
+            continue
+        m = HEADING_RE.match(b.splitlines()[0].strip())
+        if m:
+            section = m.group(2).strip()
+        bw = len(words_of(b))
+        if current and current_words + bw > max_words:
+            flush()
+        if not current:
+            current_section = section
+        if bw > max_words * 1.5:
+            words = words_of(b)
+            start = 0
+            while start < len(words):
+                part = " ".join(words[start:start + max_words])
+                if current and current_words + len(words_of(part)) > max_words:
+                    flush()
+                current.append(part)
+                current_words += len(words_of(part))
+                flush()
+                start += max_words - overlap_words
+            continue
+        current.append(b)
+        current_words += bw
+    if current:
+        flush()
+    return chunks
+
+
+# ---------------------------------------------------------------------------
+# JSONL helpers
+# ---------------------------------------------------------------------------
+
+def append_jsonl(path: Path, record: dict[str, Any]) -> None:
+    with path.open("a", encoding="utf-8") as f:
+        f.write(json.dumps(record, ensure_ascii=False, sort_keys=True) + "\n")
+
+
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+
+def main(argv: list[str] | None = None) -> int:
+    parser = argparse.ArgumentParser(
+        description="Generic blog crawler for RAG (Firecrawl + sitemap)",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Crawl 5 articles from a blog
+  python3 crawl_blog.py --sitemap https://example.com/post-sitemap.xml --limit 5
+
+  # Crawl all articles
+  python3 crawl_blog.py --sitemap https://example.com/post-sitemap.xml --all
+
+  # Use custom keywords for extraction
+  python3 crawl_blog.py --sitemap https://example.com/post-sitemap.xml --all --keywords keywords.json
+
+  # Output to custom directory
+  python3 crawl_blog.py --sitemap https://example.com/post-sitemap.xml --all --out-dir ./my_data
+        """,
+    )
+    parser.add_argument("--sitemap", required=True, help="Sitemap URL (e.g. https://example.com/post-sitemap.xml)")
+    parser.add_argument("--out-dir", type=Path, default=Path("./blog_data"), help="Output directory")
+    parser.add_argument("--keywords", type=Path, default=None, help="Keywords JSON path; defaults to <out-dir>/keywords.json")
+    parser.add_argument("--limit", type=int, default=None, help="Only process first N article URLs")
+    parser.add_argument("--all", action="store_true", help="Process all discovered article URLs")
+    parser.add_argument("--sleep", type=float, default=1.0, help="Delay between Firecrawl calls (seconds)")
+    parser.add_argument("--force", action="store_true", help="Re-scrape even if raw file exists")
+    parser.add_argument("--max-words", type=int, default=650, help="Target words per chunk")
+    parser.add_argument("--overlap-words", type=int, default=100, help="Overlap words between chunks")
+    parser.add_argument("--language", default="en", help="Default language code for articles")
+    args = parser.parse_args(argv)
+
+    if not args.all and args.limit is None:
+        args.limit = 5
+
+    load_dotenv()
+    api_key = os.environ.get("FIRECRAWL_API_KEY")
+    if not api_key:
+        print("ERROR: FIRECRAWL_API_KEY is not set in environment or .env file", file=sys.stderr)
+        return 2
+
+    out_dir: Path = args.out_dir
+    out_dir.mkdir(parents=True, exist_ok=True)
+
+    # Load keywords
+    keywords_path: Path = args.keywords or (out_dir / "keywords.json")
+    categories = load_keywords(keywords_path)
+
+    # Create output directories
+    raw_dir = out_dir / "raw"
+    md_dir = out_dir / "markdown"
+    raw_dir.mkdir(parents=True, exist_ok=True)
+    md_dir.mkdir(parents=True, exist_ok=True)
+
+    # Discover URLs from sitemap
+    source_domain = source_from_url(args.sitemap)
+    urls = parse_sitemap(args.sitemap)
+    urls_path = out_dir / "urls.json"
+    urls_path.write_text(
+        json.dumps({"sitemap": args.sitemap, "source": source_domain, "count": len(urls), "urls": urls}, ensure_ascii=False, indent=2),
+        encoding="utf-8",
+    )
+
+    selected = urls if args.all else urls[: args.limit]
+    articles_path = out_dir / "articles.jsonl"
+    chunks_path = out_dir / "chunks.jsonl"
+    errors_path = out_dir / "errors.jsonl"
+
+    if args.force:
+        articles_path.unlink(missing_ok=True)
+        chunks_path.unlink(missing_ok=True)
+        errors_path.unlink(missing_ok=True)
+
+    print(f"Source: {source_domain}")
+    print(f"Discovered article URLs: {len(urls)}")
+    print(f"Processing: {len(selected)}")
+    print(f"Output: {out_dir}")
+    print(f"Keywords: {keywords_path} ({sum(len(c['keywords']) for c in categories)} keywords in {len(categories)} categories)")
+
+    ok = 0
+    failed = 0
+    total_chunks = 0
+
+    for idx, item in enumerate(selected, 1):
+        url = str(item["url"])
+        lastmod = item.get("lastmod")
+        slug = slug_from_url(url)
+        raw_path = raw_dir / f"{slug}.json"
+        md_path = md_dir / f"{slug}.md"
+        print(f"[{idx}/{len(selected)}] {url}")
+
+        # Use cached raw if available
+        if raw_path.exists() and not args.force:
+            try:
+                raw_data = json.loads(raw_path.read_text(encoding="utf-8"))
+                status = int(raw_data.get("_http_status", 200))
+            except Exception as e:
+                append_jsonl(errors_path, {"url": url, "error": f"read cached raw failed: {e}", "at": now_iso()})
+                failed += 1
+                continue
+        else:
+            status, raw_data = firecrawl_scrape(url, api_key)
+            raw_data["_http_status"] = status
+            raw_data["_source_url"] = url
+            raw_data["_scraped_at"] = now_iso()
+            raw_path.write_text(json.dumps(raw_data, ensure_ascii=False, indent=2), encoding="utf-8")
+            time.sleep(args.sleep)
+
+        if status >= 400 or not raw_data.get("success"):
+            append_jsonl(errors_path, {"url": url, "http_status": status, "error": raw_data, "at": now_iso()})
+            print(f"  ERROR status={status} success={raw_data.get('success')}")
+            failed += 1
+            continue
+
+        data = raw_data.get("data") or {}
+        markdown = data.get("markdown") or ""
+        markdown = clean_markdown(markdown)
+        text = strip_markdown_to_text(markdown)
+        if len(text) < 100:
+            append_jsonl(errors_path, {"url": url, "http_status": status, "error": "too little text", "text_chars": len(text), "at": now_iso()})
+            print(f"  ERROR too little text chars={len(text)}")
+            failed += 1
+            continue
+
+        title = extract_title(data, markdown, url)
+        desc = extract_description(data, text)
+        kw_detail = keyword_mentions_detail(title + "\n" + text, categories)
+        kw_list = [kw for cat in kw_detail for kw in cat["matched_keywords"].keys()]
+
+        article = {
+            "id": article_id_from_url(url),
+            "url": url,
+            "source": source_domain,
+            "type": "blog_article",
+            "title": title,
+            "description": desc,
+            "published_at": None,
+            "modified_at": lastmod,
+            "language": args.language,
+            "markdown": markdown,
+            "text": text,
+            "tags": [],
+            "keywords": kw_list,
+            "keyword_mentions_detail": kw_detail,
+            "topic": infer_topic(title, text, categories),
+            "scraped_at": raw_data.get("_scraped_at") or now_iso(),
+            "metadata": data.get("metadata") or {},
+        }
+        chunks = chunk_markdown(markdown, article, categories=categories, max_words=args.max_words, overlap_words=args.overlap_words)
+
+        md_path.write_text(markdown, encoding="utf-8")
+        append_jsonl(articles_path, article)
+        for chunk in chunks:
+            append_jsonl(chunks_path, chunk)
+
+        ok += 1
+        total_chunks += len(chunks)
+        print(f"  OK title={title!r} markdown_chars={len(markdown)} chunks={len(chunks)} keywords={kw_list[:5]}")
+
+    summary = {
+        "sitemap": args.sitemap,
+        "source": source_domain,
+        "discovered": len(urls),
+        "processed": len(selected),
+        "ok": ok,
+        "failed": failed,
+        "chunks": total_chunks,
+        "out_dir": str(out_dir),
+        "keywords_path": str(keywords_path),
+        "keywords_loaded": sum(len(c["keywords"]) for c in categories),
+        "finished_at": now_iso(),
+    }
+    (out_dir / "summary.json").write_text(json.dumps(summary, ensure_ascii=False, indent=2), encoding="utf-8")
+    print("SUMMARY", json.dumps(summary, ensure_ascii=False))
+    return 0 if failed == 0 else 1
+
+
+if __name__ == "__main__":
+    raise SystemExit(main())
@@ -0,0 +1,86 @@
+[
+  {
+    "category": "hardware",
+    "keywords": [
+      "Raspberry Pi",
+      "Orange Pi",
+      "Arduino",
+      "ESP32",
+      "ESP8266",
+      "BeagleBone",
+      "NVIDIA Jetson",
+      "STM32",
+      "GPIO",
+      "SPI",
+      "I2C",
+      "UART"
+    ]
+  },
+  {
+    "category": "operating_system",
+    "keywords": [
+      "Linux",
+      "Ubuntu",
+      "Debian",
+      "Raspberry Pi OS",
+      "Armbian",
+      "DietPi",
+      "Manjaro",
+      "Fedora",
+      "CentOS",
+      "FreeBSD",
+      "OpenWrt",
+      "Home Assistant OS"
+    ]
+  },
+  {
+    "category": "software",
+    "keywords": [
+      "Docker",
+      "Kubernetes",
+      "Home Assistant",
+      "MQTT",
+      "Node-RED",
+      "Jellyfin",
+      "Plex",
+      "Pi-hole",
+      "AdGuard",
+      "Nginx",
+      "Apache",
+      "Samba",
+      "Kodi",
+      "OctoPrint"
+    ]
+  },
+  {
+    "category": "ai_ml",
+    "keywords": [
+      "TensorFlow",
+      "PyTorch",
+      "OpenCV",
+      "YOLO",
+      "LLM",
+      "GPT",
+      "machine learning",
+      "deep learning",
+      "neural network",
+      "NPU",
+      "inference"
+    ]
+  },
+  {
+    "category": "networking",
+    "keywords": [
+      "VPN",
+      "WireGuard",
+      "ZeroTier",
+      "Tailscale",
+      "firewall",
+      "router",
+      "WiFi",
+      "Bluetooth",
+      "Zigbee",
+      "LoRa"
+    ]
+  }
+]