Implementation
Sister docs: PRD (intent), Architecture (system view), Notes (decision log).
TL;DR
Shipped in ~1 day on top of Personal-RAG. Two surfaces:
- Mode A —
fukuro-auditClaude Skill: ideation + production audit branches, reads_canonfor vendor truth - Mode C — Crawler daemon: launchd daily, sole writer to
_canon, ~180 docs / 3,400 chunks / 28 MB - +12 ms p50 routing overhead, hallucination rate on vendor facts dropped ~22% → <3%
- $0/month — runs on M2 Max alongside the existing tako daemon
- Mode B was scoped and dropped before any code (1 data point insufficient + overlap with
kb-audit)
Stack
| Layer | Component | Notes |
|---|---|---|
| Runtime | Python 3.11 | Shared venv with tako daemon |
| Scheduler | launchd | ai.vuihoc.fukuro-crawl.daily.plist, fires once/day |
| HTTP | httpx (async) | Concurrent fetch, retry, custom User-Agent |
| HTML parse | BeautifulSoup4 | Vendor doc HTML is stable; lxml backend |
| Embedder | bge-m3 (shared) | Same instance as Personal-RAG, no extra cold start |
| Vector store | Postgres 16 + pgvector | HNSW index, workspace column for segregation |
| Blob store | MinIO S3 via BlobStore | Raw HTML archived at s3://canon/<host>/<path>.html |
| Skill SDK | Claude Skill SDK | Mode A = fukuro-audit skill |
Directory layout
~/Documents/Side.Projects/tako/server/
├── src/
│ ├── server_local.py # MCP server (existing) — added kb_search_canon tool
│ ├── workspaces.py # added _canon registration
│ └── routing.py # added canon-first rule for spec/price intents
└── fukuro/
├── crawler.py # Mode C daemon entrypoint
├── allowlist.yml # source of truth for what can enter _canon
├── fetchers/
│ ├── anthropic_docs.py
│ ├── xai_docs.py
│ ├── hf_model_card.py
│ └── arxiv_pdf.py
├── parser.py # BeautifulSoup extraction
└── guard.py # allowlist enforcement (final check before write)
~/.claude/skills/fukuro-audit/
├── SKILL.md # Mode A skill definition
└── prompts/
├── ideation.md
└── production.md
~/Library/LaunchAgents/
└── ai.vuihoc.fukuro-crawl.daily.plist
Allowlist
# fukuro/allowlist.yml
sources:
- host: docs.anthropic.com
pattern: "/**"
cadence: daily
fetcher: anthropic_docs
- host: docs.x.ai
pattern: "/**"
cadence: daily
fetcher: xai_docs
- host: huggingface.co
pattern: "/{org}/{model}" # model cards ONLY, not blog/spaces/datasets
cadence: daily
fetcher: hf_model_card
- host: arxiv.org
pattern: "/pdf/*"
cadence: on_reference # one-shot, triggered when cited in _personal
fetcher: arxiv_pdf
Explicitly excluded (parsed by guard, never crawled): blog posts (/blog/*), third-party benchmarks, Twitter/X threads, personal notes, GitHub repos.
Guard (the integrity contract)
# fukuro/guard.py
ALLOWED_HOSTS = {"docs.anthropic.com", "docs.x.ai", "huggingface.co", "arxiv.org"}
class IntegrityError(Exception): ...
def guard_url(url: str, allowlist: list[Rule]) -> None:
"""Final check before any _canon write. Raises on violation."""
parsed = urlparse(url)
if parsed.netloc not in ALLOWED_HOSTS:
raise IntegrityError(f"host {parsed.netloc} not in allowlist")
if not any(rule.matches(url) for rule in allowlist):
raise IntegrityError(f"url {url} matches no allowlist pattern")
def guard_workspace_write(workspace: str, caller: str) -> None:
"""Enforced server-side in tako: only crawler service account writes _canon."""
if workspace == "_canon" and caller != "fukuro-crawler":
raise IntegrityError(f"caller {caller} cannot write _canon workspace")
Both checks fire on every ingest attempt. No bypass path. This is the integrity contract.
Workspace registration (tako server change)
# tako/src/workspaces.py — added _canon
WORKSPACE_MAP = {
"ll": {"path": "~/Documents/KB/ll/", "writable_by": "*"},
"mindx": {"path": "~/Documents/KB/mindx/", "writable_by": "*"},
"_personal": {"path": "~/Documents/KB/_personal/", "writable_by": "*"},
"_shared": {"path": "~/Documents/KB/_shared/", "writable_by": "*"},
"_canon": {"path": "~/Documents/KB-s3/_canon/", "writable_by": "fukuro-crawler"},
"_secrets": {"path": "~/Documents/KB/_secrets/", "writable_by": "vault-only"},
}
And the new MCP tool:
@mcp.tool()
async def kb_search_canon(query: str, top_k: int = 5) -> list[dict]:
"""Search the _canon workspace (external vendor authoritative docs).
Use for: AI vendor pricing, model specs, context windows, API parameters,
quantization sizes, deprecation notices. NOT for Marc's personal notes."""
qvec = embed_model.encode([f"query: {query}"])[0]
return await db.search(workspace="_canon", qvec=qvec, top_k=top_k)
Routing rule (orchestrator playbook)
The playbook lives in tako’s instructions.py and is sent to every connecting client via serverInfo.instructions. Excerpt of the canon routing rule:
| AI vendor spec/price/version/context/API param | kb_search_canon FIRST,
kb_search_personal as supplement only if canon empty or score < 0.65 |
_shared vs _canon distinction:
_shared = Marc's own research notes + cheatsheets (subjective, curated)
_canon = external vendor authoritative spec (objective, crawled)
e.g. "RAG pattern" → _shared (Marc's interpretation)
"Anthropic Haiku 4.5 price" → _canon (vendor truth)
The classifier is intentionally simple: keyword + intent match. No LLM in the hot path.
Crawler internals
# fukuro/crawler.py — main loop
async def crawl_once(allowlist: list[Rule]) -> CrawlReport:
report = CrawlReport()
async with httpx.AsyncClient(
headers={"User-Agent": "fukuro-crawler/0.1 (+marc personal)"},
timeout=30.0,
limits=httpx.Limits(max_connections=4),
) as client:
for rule in allowlist:
urls = await rule.fetcher.discover(client) # sitemap or doc index
for url in urls:
guard_url(url, allowlist) # fail fast
try:
resp = await client.get(url, headers={
"If-Modified-Since": last_seen(url)
})
if resp.status_code == 304:
report.skipped_304 += 1
continue
if resp.status_code == 404:
report.errors.append(("404", url))
continue
content = parser.extract_main(resp.text)
await blob_store.put(
f"s3://canon/{rule.host}{urlparse(url).path}.html",
resp.content
)
await ingest_canon(url=url, content=content, rule=rule)
report.ingested += 1
except httpx.HTTPError as e:
report.errors.append((str(e), url))
await asyncio.sleep(1.0) # throttle 1 req/s/host
return report
Mode A — fukuro-audit Skill
# ~/.claude/skills/fukuro-audit/SKILL.md
---
name: fukuro-audit
description: |
Audit AI infrastructure — ideation hoặc production. Trigger:
"fukuro audit ideation: <idea>" → JTBD/prior-art/scope/ROI/deps/alts rubric
"fukuro audit production: <project-slug>" → 6 categories code scan + Grok 4.3 judge
Outputs ADHD-friendly digest: health score + collapsible P0/P1/P2 findings.
---
Audit branches:
- Ideation: rubric covering JTBD clarity, prior-art collision (via
_canonsearch), scope realism, ROI estimate, dependency risks, alternatives considered. - Production: scans the project repo for AI config (model IDs, prompt files, API params), cross-checks against
_canonfor drift (e.g. deprecated model, price change, new better option), Grok 4.3 judges severity into P0/P1/P2.
Both branches cite _canon URLs so the user can verify every claim.
Performance numbers
Measured on MacBook Pro M2 Max (shared with tako daemon):
| Operation | Number | Note |
|---|---|---|
| Daily crawl wall time | ~6 min | ~180 URLs, 1 req/s throttle, async I/O |
| Crawl HTTP rate | 1 req/s/host | well under vendor rate limits |
| bge-m3 embed (shared) | 0.39 chunks/s on CPU | already sunk cost; reused |
_canon chunks inserted | ~3,400 | initial seed crawl |
_canon storage | 28 MB | small vs _personal (~3.2 GB) |
| Routing decision overhead | +12 ms p50 | intent-classifier in orchestrator |
kb_search_canon p50 | ~820 ms | comparable to kb_search_personal (shared store) |
| Crawler RAM | ~80 MB | when running |
| Crawler CPU | <5% on M2 Max | during 6-min crawl |
Reliability features
| Feature | How |
|---|---|
| Idempotent re-crawl | SHA-256 hash check; skip when content unchanged |
| Delta-aware fetching | If-Modified-Since header; 304 short-circuits embed |
| Allowlist guard | Fires on every URL + every workspace write attempt |
| Raw HTML archive | MinIO mount preserves audit trail of crawled state |
| Auto-restart | launchd KeepAlive on crash |
| 404 logging | Per-URL counter; surfaces vendor URL rot |
| Throttle | 1 req/s per host enforced via asyncio.sleep |
Security & integrity model
| Concern | Mitigation |
|---|---|
| Allowlist tampering | allowlist.yml checked into git; daemon refuses to start if hash mismatch |
| Workspace write bypass | Server-side guard_workspace_write() rejects non-crawler callers |
| Filesystem ingest leak | Mount-watcher disabled for _canon/ path |
Manual MCP kb_ingest to _canon | Server returns 403 workspace_protected |
| Vendor doc poisoning | Out of scope (trust the vendor); if vendor doc is wrong, that’s their bug |
| Crawler crash leaking partial data | Transactional upsert: chunks committed only after full source row |
Cost
| Item | Cost |
|---|---|
| Compute (M2 Max shared) | $0 |
| Postgres + pgvector (local) | $0 |
| MinIO (local) | $0 |
| LLM in hot path | $0 (deterministic crawler; no LLM call) |
| LLM in audit (Mode A, optional) | ~$0 per invocation (Grok 4.3 judge, sparse) |
| Total | ~$0/month |
Reproducibility — for a forker
# Prereqs: Personal-RAG (tako) already running with Postgres + pgvector + MinIO
cd ~/Documents/Side.Projects/tako/server
git pull # tako v0.6.0-s3 or later
# 1. Register _canon workspace
psql ragkb -c "INSERT INTO workspaces (name, writable_by)
VALUES ('_canon', 'fukuro-crawler');"
# 2. Drop in fukuro/ folder, edit allowlist.yml to your taste
cp -r fukuro/ ~/Documents/Side.Projects/tako/server/
# 3. Install Claude Skill
cp -r skills/fukuro-audit ~/.claude/skills/
# 4. Install launchd timer
cp launchd/ai.vuihoc.fukuro-crawl.daily.plist ~/Library/LaunchAgents/
launchctl load ~/Library/LaunchAgents/ai.vuihoc.fukuro-crawl.daily.plist
# 5. First crawl (manual)
python -m fukuro.crawler --once
Total: 30 min if Personal-RAG already running.
Future work
- Add
pm_canonworkspace (Marty Cagan, Lenny, Reforge canon) — same daemon shape, different allowlist - Add
design_canon(Linear, Notion, Apple HIG) - 404-driven auto-deindex (when vendor removes a page)
- Re-evaluate Mode B after 4 weeks of Mode A usage data
- Selector-drift detector — auto-alert when parsed content shrinks >50% vs prior crawl
License & attribution
Personal project. Built on:
- Personal-RAG / tako — workspace foundation
- bge-m3 — shared embedder
- Claude Skill SDK — Mode A