Notes & Decision Log

Format: YYYY-MM-DD — context — decision/finding.

Decisions

2026-05-24 — Auto-fix policy live. Weekly/monthly tiers now route verified findings through: Grok 4.3 propose → safety heuristics filter (≤5 lines, no high-risk path, scoped to 1 file, no rm/DROP) → Grok 4.3 judge re-verify → git stash --include-untracked snapshot → apply. Telegram digest shows the revert SHA for one-click rollback. Manual review-and-merge gate retained for high-risk types (URL/port/credential/infra topology). ADHD format finalized: severity emoji first, action verb opens, time-box estimate, bundled 2×/day.
2026-05-23 — PRODUCTION SWAP: Haiku 4.5 → Grok 4.3 as judge. After Eval-Framework bake-off on a 99-finding holdout. Strict-substring scoring said Haiku 13% / Sonnet 60% / Opus 67% / Grok 80%. LLM-judged real-accuracy (the right metric) said Grok 99%. Strict scorer counted alternate valid findings as failures — the “Eval Scorer Strict Trap”. Production cost at personal scale ~$0.61/mo; at SMB scale (15K audits/mo) ~$90/mo vs SaaS audit tools $500–2,000/mo. Shipped same afternoon.
2026-05-20 — Audit corpus profile measured: VN 21% / mixed 28% / EN 40% → effective ~50/50 bilingual. Confirms Qwen dense primary (bge-m3 retrieval + Qwen detector both multilingual SOTA). Code-mixed 28% = the hardest bucket; flagged as a dedicated eval subtask for Eval-Framework. 4B + Haiku verifier remains the daily sweet spot (80% accuracy, 3s, 2.5 GB RAM).
2026-02-x — Event-triggered tier added. Post-commit hook on the KB-s3 mount (Personal-RAG canonical write target) fires a scoped audit on the changed files within minutes. Catches drift introduced by manual file edits before it propagates into AI assertion chains.
2025-11-15 — 4-tier cron complete. Initial weekly cadence extended to daily (light, 4B+Haiku) + weekly (8B+Haiku) + monthly (32B alone, Layer 3 cross-workspace). Tier-specific models because: most findings are caught by smaller scope at higher frequency; cross-workspace contradictions are slower-evolving and tolerate monthly.
2025-11-09 — Hardening pass. PII redaction (12 regex patterns + email partial-redact, 6 unit tests, 100% pass). launchd RunAtLoad: true + skip-if-recent state file for Mac off/sleep catch-up. Fingerprint-based suppression added with expiry (later deprecated — see gotcha below).
2025-11-08 — Day-one deploy. 3 layers (intra-file static / cross-file LLM / cross-workspace LLM) + weekly cron + Telegram digest. Initial corpus: 79 files. First run surfaced 25+ real findings including the Oracle A1 hallucination, the rag-kb hardware drift (M1 16 GB → M2 Max 64 GB), workspace count inconsistency (6 vs 4), Inko port mismatch (8081 vs 9090), Python version drift on health-coach (3.10 vs 3.11). All 25+ confirmed real on manual review.
2025-11-08 (early) — Build vs Buy decision: build. SaaS audit tools (Promptfoo + Inspect AI bundle, $500–2,000/mo) couldn’t be considered because: (a) KB contains internal product docs + customer data + IP (data residency / VN DPD / NDA), (b) heterogeneous source formats (memory + CLAUDE.md + project NOTES + meeting transcripts + Confluence dump) — no SaaS tool parses all of these, (c) cost-quality frontier favors build at SMB scale (~$90/mo vs $500-2K), payback 1–2 months, (d) prompt iteration in one afternoon vs vendor roadmap dependency.
2025-11-08 (early) — Re-use Personal-RAG retrieval, don’t build own index. bge-m3 + Postgres + pgvector already index the corpus; duplicating would waste 3.2 GB of disk and the bilingual SOTA retrieval quality already measured (Hit@3 = 97.8%). Audit reads through Personal-RAG for cross-source lookup.

Gotchas

2026-05-23 — Eval Scorer Strict Trap. Strict-substring scoring on multi-valid-output tasks counts alternate valid findings as failures. Grok 4.3 was real-99% but strict-80% — easy to under-pick a backend if you tune-before-LLM-judge. Cross-ref: same lesson hit on Dojo eval the same week. Verify with LLM-judge BEFORE tuning prompt/model.
2026-01-x — Fingerprint suppression breaks. First version stored (file, finding_type, evidence_hash) fingerprints in suppressions.json with expiry. Added 14 suppressions for legitimate-but-flagged findings. Next run: 7 NEW findings, all logically equivalent to suppressed ones but with different fingerprints — the LLM rephrases findings on every run (slightly different wording, different example quote, different hash). Was playing whack-a-mole. Fix: deprecated fingerprint suppression. Instead, add an explanatory  comment in the source file itself. The LLM reads the file, sees the explanation, stops flagging. Durable, no hash dependency. Deeper lesson: bad suppression hides the problem; good context teaches the auditor.
2025-11-x — macOS cron doesn’t catch up missed runs. If the Mac is asleep at 03:00, cron never fires. Fix: use launchd with RunAtLoad: true + state-file skip-if-recent guard. On wake/boot the audit fires, state file prevents duplicate runs in the same window. Defense-in-depth: three independent catch-up layers.
2025-11-x — Hooks running binaries inside ~/Documents/ hang silently until Full Disk Access is granted per-binary in System Settings. Fix: grant FDA to /bin/bash and the venv python, or keep scripts outside ~/Documents/.
2025-11-08 — Default cron was sending findings on first run before redaction was wired up. Caught immediately on review; rotated the leaked API key that was in one of the sources. Mandatory: redaction unit test in CI before any deploy that touches LLM calls.
2025-11-08 — MLX OOM on Qwen 32B with full corpus. First monthly run blew up with a Metal “Impacting Interactivity” crash. Fix: bumped --max-seq down to 4096 for 32B tier + gradient-checkpoint mode + only run when no other heavy app is open. Cross-ref the broader mlx-lm LoRA Metal crash lesson — same root cause (competing with Personal-RAG daemon).

Production-swap retrospective: Haiku 4.5 → Grok 4.3

The single most impactful decision in the project’s life. Sequence:

Shipped with Haiku 4.5 as judge (cheap, fast, familiar). Strict-substring scorer reported 13% accuracy on the audit task. Assumed “audit must be hard for small models”.
Built Eval-Framework for unrelated reasons.
Ran knowledge-audit task through Eval-Framework’s bake-off with LLM-judge scoring → real accuracy numbers were very different from strict-substring.
Grok 4.3 emerged as the dominant pick: 99% real accuracy (vs Haiku 13%), at 1/3 the cost of Sonnet alone, beating Opus by 30+ points.
Same afternoon: swap in production, re-run last 4 weekly findings against new judge for backfill, confirm no regression.

Cost of the swap: ~5 hours. Value: confidence in every future audit output went from “I should manually verify ~half” to “I trust the surfaced findings”.

Lesson encoded as Marc-rule: production LLM judge default = Grok 4.3 per dojo eval, NOT Haiku. Defaulting to Haiku because of SDK familiarity is a real bias, has happened twice.

Reference links

The triggering blog post: Why your AI memory lies to you
The production-swap blog post: Knowledge-Audit for RAG-KB: From 80% to 99% with Grok 4.3
Personal-RAG (substrate): /projects/personal-rag-kb
Eval-Framework (verified the swap): /projects/eval-framework
launchd RunAtLoad: https://developer.apple.com/documentation/xpc/xpc_services_xpc_launchd_plist
Anthropic Claude API: https://docs.anthropic.com/
xAI Grok API: https://docs.x.ai/

Working-session log