← Back to project
● Shipped P0 Size M Foundation

Knowledge-Audit — Notes

Chronological decision log, Grok 4.3 production-swap retrospective, gotchas.

Notes & Decision Log

Format: YYYY-MM-DD — context — decision/finding.

Decisions

  • 2026-05-24Auto-fix policy live. Weekly/monthly tiers now route verified findings through: Grok 4.3 propose → safety heuristics filter (≤5 lines, no high-risk path, scoped to 1 file, no rm/DROP) → Grok 4.3 judge re-verify → git stash --include-untracked snapshot → apply. Telegram digest shows the revert SHA for one-click rollback. Manual review-and-merge gate retained for high-risk types (URL/port/credential/infra topology). ADHD format finalized: severity emoji first, action verb opens, time-box estimate, bundled 2×/day.
  • 2026-05-23PRODUCTION SWAP: Haiku 4.5 → Grok 4.3 as judge. After Eval-Framework bake-off on a 99-finding holdout. Strict-substring scoring said Haiku 13% / Sonnet 60% / Opus 67% / Grok 80%. LLM-judged real-accuracy (the right metric) said Grok 99%. Strict scorer counted alternate valid findings as failures — the “Eval Scorer Strict Trap”. Production cost at personal scale ~$0.61/mo; at SMB scale (15K audits/mo) ~$90/mo vs SaaS audit tools $500–2,000/mo. Shipped same afternoon.
  • 2026-05-20Audit corpus profile measured: VN 21% / mixed 28% / EN 40% → effective ~50/50 bilingual. Confirms Qwen dense primary (bge-m3 retrieval + Qwen detector both multilingual SOTA). Code-mixed 28% = the hardest bucket; flagged as a dedicated eval subtask for Eval-Framework. 4B + Haiku verifier remains the daily sweet spot (80% accuracy, 3s, 2.5 GB RAM).
  • 2026-02-xEvent-triggered tier added. Post-commit hook on the KB-s3 mount (Personal-RAG canonical write target) fires a scoped audit on the changed files within minutes. Catches drift introduced by manual file edits before it propagates into AI assertion chains.
  • 2025-11-154-tier cron complete. Initial weekly cadence extended to daily (light, 4B+Haiku) + weekly (8B+Haiku) + monthly (32B alone, Layer 3 cross-workspace). Tier-specific models because: most findings are caught by smaller scope at higher frequency; cross-workspace contradictions are slower-evolving and tolerate monthly.
  • 2025-11-09Hardening pass. PII redaction (12 regex patterns + email partial-redact, 6 unit tests, 100% pass). launchd RunAtLoad: true + skip-if-recent state file for Mac off/sleep catch-up. Fingerprint-based suppression added with expiry (later deprecated — see gotcha below).
  • 2025-11-08Day-one deploy. 3 layers (intra-file static / cross-file LLM / cross-workspace LLM) + weekly cron + Telegram digest. Initial corpus: 79 files. First run surfaced 25+ real findings including the Oracle A1 hallucination, the rag-kb hardware drift (M1 16 GB → M2 Max 64 GB), workspace count inconsistency (6 vs 4), Inko port mismatch (8081 vs 9090), Python version drift on health-coach (3.10 vs 3.11). All 25+ confirmed real on manual review.
  • 2025-11-08 (early)Build vs Buy decision: build. SaaS audit tools (Promptfoo + Inspect AI bundle, $500–2,000/mo) couldn’t be considered because: (a) KB contains internal product docs + customer data + IP (data residency / VN DPD / NDA), (b) heterogeneous source formats (memory + CLAUDE.md + project NOTES + meeting transcripts + Confluence dump) — no SaaS tool parses all of these, (c) cost-quality frontier favors build at SMB scale (~$90/mo vs $500-2K), payback 1–2 months, (d) prompt iteration in one afternoon vs vendor roadmap dependency.
  • 2025-11-08 (early)Re-use Personal-RAG retrieval, don’t build own index. bge-m3 + Postgres + pgvector already index the corpus; duplicating would waste 3.2 GB of disk and the bilingual SOTA retrieval quality already measured (Hit@3 = 97.8%). Audit reads through Personal-RAG for cross-source lookup.

Gotchas

  • 2026-05-23Eval Scorer Strict Trap. Strict-substring scoring on multi-valid-output tasks counts alternate valid findings as failures. Grok 4.3 was real-99% but strict-80% — easy to under-pick a backend if you tune-before-LLM-judge. Cross-ref: same lesson hit on Dojo eval the same week. Verify with LLM-judge BEFORE tuning prompt/model.
  • 2026-01-xFingerprint suppression breaks. First version stored (file, finding_type, evidence_hash) fingerprints in suppressions.json with expiry. Added 14 suppressions for legitimate-but-flagged findings. Next run: 7 NEW findings, all logically equivalent to suppressed ones but with different fingerprints — the LLM rephrases findings on every run (slightly different wording, different example quote, different hash). Was playing whack-a-mole. Fix: deprecated fingerprint suppression. Instead, add an explanatory <!-- NOTE: ... --> comment in the source file itself. The LLM reads the file, sees the explanation, stops flagging. Durable, no hash dependency. Deeper lesson: bad suppression hides the problem; good context teaches the auditor.
  • 2025-11-xmacOS cron doesn’t catch up missed runs. If the Mac is asleep at 03:00, cron never fires. Fix: use launchd with RunAtLoad: true + state-file skip-if-recent guard. On wake/boot the audit fires, state file prevents duplicate runs in the same window. Defense-in-depth: three independent catch-up layers.
  • 2025-11-xHooks running binaries inside ~/Documents/ hang silently until Full Disk Access is granted per-binary in System Settings. Fix: grant FDA to /bin/bash and the venv python, or keep scripts outside ~/Documents/.
  • 2025-11-08Default cron was sending findings on first run before redaction was wired up. Caught immediately on review; rotated the leaked API key that was in one of the sources. Mandatory: redaction unit test in CI before any deploy that touches LLM calls.
  • 2025-11-08MLX OOM on Qwen 32B with full corpus. First monthly run blew up with a Metal “Impacting Interactivity” crash. Fix: bumped --max-seq down to 4096 for 32B tier + gradient-checkpoint mode + only run when no other heavy app is open. Cross-ref the broader mlx-lm LoRA Metal crash lesson — same root cause (competing with Personal-RAG daemon).

Production-swap retrospective: Haiku 4.5 → Grok 4.3

The single most impactful decision in the project’s life. Sequence:

  1. Shipped with Haiku 4.5 as judge (cheap, fast, familiar). Strict-substring scorer reported 13% accuracy on the audit task. Assumed “audit must be hard for small models”.
  2. Built Eval-Framework for unrelated reasons.
  3. Ran knowledge-audit task through Eval-Framework’s bake-off with LLM-judge scoring → real accuracy numbers were very different from strict-substring.
  4. Grok 4.3 emerged as the dominant pick: 99% real accuracy (vs Haiku 13%), at 1/3 the cost of Sonnet alone, beating Opus by 30+ points.
  5. Same afternoon: swap in production, re-run last 4 weekly findings against new judge for backfill, confirm no regression.

Cost of the swap: ~5 hours. Value: confidence in every future audit output went from “I should manually verify ~half” to “I trust the surfaced findings”.

Lesson encoded as Marc-rule: production LLM judge default = Grok 4.3 per dojo eval, NOT Haiku. Defaulting to Haiku because of SDK familiarity is a real bias, has happened twice.

Working-session log

DateHoursWhatOutcome
2025-11-08~8 hInitial build (3 layers + weekly cron + Telegram + PII redaction)Day-one deploy, 25+ findings
2025-11-09~3 hHardening (launchd catch-up, fingerprint suppression, more redactions)Stable weekly cadence
2025-11-15~2 h4-tier cron split (daily/weekly/monthly + event)Multi-cadence shipped
2026-01-x~1 hDeprecate fingerprint suppression, switch to in-source commentsMore durable, lower maintenance
2026-02-x~2 hPost-commit hook on KB-s3 mount (event tier)Sub-minute drift detection
2026-05-20~1 hCorpus profile measurementConfirmed Qwen + bge-m3 stack
2026-05-23~5 hGrok 4.3 production swap (Eval-Framework bake-off + ship same day)80% → 99% real accuracy
2026-05-24~3 hAuto-fix policy + safety heuristics + ADHD format finalizeWeekly/monthly hands-off mode
Total~25 hProduction-ready, $5/mo