AI-Canon-Crawler — PRD
Size S · P1 · Foundation Status: ✅ Shipped 2026-05-24 — see Implementation for build details Actual: ~1 day from design lock → daemon live
1. Problem
Personal-RAG has ~50k sources mixed together: personal notes, project READMEs, meeting transcripts, vendor docs. When asked a factual question about an external AI vendor (pricing, context window, quantization size, API parameter), the retriever surfaces personal notes from months ago before the vendor’s current docs.
Concrete bug: query “What’s the current price of Claude Haiku 4.5?” → top result was a February note speculating about pricing (score 0.81). The vendor’s actual pricing page ranked second (0.74).
Pain: ~22% hallucination rate on vendor-fact questions (judged sample). The retriever isn’t broken — semantic similarity is what it was asked to optimize. The problem is trust tier: personal notes and vendor docs were the same kind of object to the retriever.
Why now: hallucinations on vendor facts were starting to leak into blog drafts and project decisions. A fix at the embedding layer wouldn’t help — the issue is upstream of retrieval.
2. Goal & Success Metrics
Goal: When a question is about an external AI vendor’s spec/price/version, the answer comes from the vendor’s current docs — not from a stale personal note.
Metrics — actual achieved:
| Metric | Target | Achieved | Note |
|---|---|---|---|
| Hallucination rate on vendor facts | <5% | <3% | Judged on 50-question held-out set |
_canon workspace size | 100+ docs | ~180 docs / 3,400 chunks | Anthropic + xAI + HF model cards |
| Crawl wall time | <15 min | ~6 min | Daily delta crawl |
| Routing overhead p50 | <30 ms | +12 ms | Tool-routing decision before retrieval |
| Storage footprint | <100 MB | 28 MB | Small vs personal workspace (~3.2 GB) |
3. User journey
- User asks Claude (any client): “What’s the current Haiku 4.5 input price?”
- MCP orchestrator detects spec/price/version intent → routes to
kb_search_canonfirst. _canonreturns vendor’s pricing page chunks with high confidence.- If
_canonempty, fall back tokb_search_personal. - Claude synthesizes answer citing the vendor URL.
Parallel: daemon runs daily, crawls allowlist URLs, deltas only, embeds via shared bge-m3, upserts into _canon workspace.
4. Scope (MoSCoW) — final
Must — DONE:
- ✅ Dedicated
_canonworkspace in Postgres (extends tako schema) - ✅ MCP tool
kb_search_canonregistered server-side - ✅ Crawler daemon — Mode C — daily launchd timer
- ✅ Allowlist enforced on every crawl tick (no off-list ingests)
- ✅ Routing rule in MCP orchestrator playbook (spec/price/version → canon first)
- ✅
fukuro-auditClaude Skill — Mode A — ideation + production audit branches
Should — DONE:
- ✅ Idempotent re-ingest via SHA-256 hash check (shared with Personal-RAG)
- ✅ Overwrite-on-recrawl policy (canon is always latest vendor truth)
- ✅ Classifier label
_canondistinct from_sharedin tako orchestration
Could — DROPPED:
- ❌ Mode B (continuous evidence gathering during conversation) — dropped per scope-cut. Reasoning: only 1 data point of demand (the original hallucination bug), and overlap with the existing
kb-auditweekly job made the marginal value unclear. Re-evaluate after Mode A/C have 4 weeks of usage. - ⏸️
pm_canon/design_canonworkspaces — pattern proven, deferred to dedicated PRDs - ⏸️ Web UI for browsing
_canoncontent — Claude clients are sufficient
Won’t (M1):
- Multi-vendor allowlist expansion beyond Anthropic/xAI/HF (one at a time, measured)
- Auto-discovery of new vendor doc URLs (allowlist is a feature, not a limit)
- Real-time crawl on every conversation (daily is enough; cost/value not justified)
5. Architecture (final)
Two-mode design (after Mode B drop):
- Mode A:
fukuro-auditClaude Skill — invoked by user, audits ideation or production projects against canon - Mode C: Crawler daemon — runs daily via launchd, sole writer to
_canon
See Architecture for diagrams.
6. Tech Stack — final choices
| Layer | Choice | Reason |
|---|---|---|
| Crawler runtime | Python 3.11 | Shared with tako daemon, single venv |
| Scheduler | launchd | Already managing tako mount-watcher + backups; one less moving part |
| HTML fetch | httpx (async) | Concurrent allowlist fetch; auto-retry built-in |
| Parser | BeautifulSoup | Stable, sufficient for vendor doc HTML |
| Embedder | bge-m3 (shared) | Same model as Personal-RAG; no extra cold start |
| Vector store | Postgres 16 + pgvector | Shared HNSW index, one DB per workspace tag |
| Blob storage | MinIO S3 (BlobStore) | Reuses tako mount path; canonical raw HTML archived |
| Skill SDK | Claude Skill SDK | Mode A = fukuro-audit skill |
Cost posture: $0/month. Daemon runs on M2 Max alongside tako. No external infra.
7. Milestones — actual
| Phase | What shipped |
|---|---|
| Design | Mode A/B/C scoped on paper; Mode B dropped before any code written |
| Workspace | _canon added to tako server v0.6.0-s3 (schema + classifier + MCP tool) |
| Mode A | fukuro-audit Claude Skill — ideation + production branches |
| Mode C | Crawler + allowlist + daily launchd timer; first crawl ingested ~180 docs |
| Routing | MCP orchestrator playbook updated; _shared vs _canon distinction documented |
Ship DoD passed:
- ✅ Hallucination drop measured on 50-question set (<3%)
- ✅ Crawl runs nightly, deltas only, <15 min wall time
- ✅
_canonis the only workspace the crawler can write to (verified by code path) - ✅ Routing rule live and tested with mixed-intent queries
8. Cost & Quota
| Item | Cost |
|---|---|
| Compute (M2 Max, shared with tako) | $0 |
| Postgres + pgvector (local) | $0 |
| MinIO S3 mount (local) | $0 |
| External LLM calls | $0 (crawler is deterministic; no LLM in hot path) |
| Total | ~$0/month |
9. Risks & open questions — outcomes
Risks identified at design:
- Allowlist drift (vendor changes URL structure) → daemon logs 404s loudly; manual allowlist patch when it happens
_canoncontent leaking into_personalif classifier misfires → mitigated by workspace-level segregation (DB-level, not tag-level)- Crawl bandwidth hitting vendor rate limits → throttled to 1 req/s per host; well under public limits
Resolved:
- Mode B value question → resolved by drop. Two modes is enough.
_sharedvs_canonambiguity → resolved with explicit rule in orchestrator playbook:_shared= Marc’s curated notes (subjective),_canon= vendor truth (objective)
Open (M2):
- Q: When to add
pm_canon/design_canon? → Wait for measurable pain in those domains; don’t build speculatively. - Q: How to detect a stale
_canonentry if vendor silently removes a page? → 404-tracking + auto-deindex; not yet implemented.
10. Definition of Done
Ship done: ✅ 2026-05-24 — _canon workspace live, crawler daemon running daily, fukuro-audit skill installed, routing rule deployed, hallucination metric measured.
Production-stable done (4-week criterion):
- ⏳ 4 consecutive weeks of daily crawl with no manual intervention
- ⏳ Hallucination rate stays <5% over a rolling 50-question sample
- ⏳ No
_canonentries written by any path other than the crawler
See also
- Architecture — Mode A vs Mode C, crawl pipeline, routing
- Implementation — allowlist code, classifier wiring, perf
- Notes — chronological decisions including the Mode B drop