Notes & Decision Log
Format: YYYY-MM-DD — context — decision/finding.
Decisions
- 2026-05-22 — 4-tier triage routing design. Daily = local 4B + Haiku verifier (
$0.15/mo); Weekly = local 8B + Haiku ($0.30/mo); Monthly = 32B alone ($0); On-demand audit = Grok 4.3 ($0.61/mo). Total portfolio AI cost ~$1.10/mo. Distill attempt for the local tiers BLOCKED (4 Metal crashes, adapter 65% regression vs 70% baseline = HURTING the model). Queue retrain for the weekend with clean GPU state (quit competing apps, reboot,--save-every 25 --max-seq 1024 --grad-checkpoint). - 2026-05-21 (F3c) — Production judge picked: Grok 4.3 at 99.0% LLM-judge accuracy on holdout-99, $0.61/mo at audit volume. Verified by running the frozen holdout exactly once with
--final-decisionflag; appended toholdout-runs.log. Beat Opus 4.7 by +32pp; beat Haiku-as-direct-auditor (which collapsed to 13%) by +86pp. Correction: em sai 2 lần default Haiku do SDK familiarity — production LLM-judge default per dojo eval is Grok 4.3. - 2026-05-21 (F3b) — Scoring breakthrough. Strict substring scored Grok 4.3 at 79.8% on holdout-99. Re-checking the 20 “failures” with an LLM-judge revealed 19/20 were valid alternate findings the rubric had simply not enumerated. Real accuracy = 99.0%. +19.2pp gap between strict and LLM-judge on the same outputs. Decision: LLM-judge becomes the default scorer for any multi-valid-output task. Strict stays available for single-canonical-answer tasks.
- 2026-05-21 (F3a) — Stratified eval set: 192 cases, 5 buckets × 3 languages. Bootstrapped via Haiku from a corpus sample, then hand-balanced. Split 93 dev (iterable) + 99 holdout (FROZEN). CLI guard refuses non-
--final-decisionruns onholdout-*paths. Tamper-evidentholdout-runs.logrecords every final-decision run. - 2026-05-21 (F2) — Pairwise Cohen’s κ added after first bake-off. Revealed Sonnet ↔ Opus = 0.66 (same Anthropic family → correlated bias) vs Grok ↔ Sonnet = 0.52 (independent providers). Decision: ensembles must avoid same-family pairs. Without κ this would have been invisible.
- 2026-05-21 (F2) — Gemini 2.5 Flash returned empty
candidateson 99/99 holdout cases. Suspected safety filter or output format incompatibility; not investigated further (cheap to skip). Adapter normalized totext=""so scorer flags 0% accuracy loudly. Lesson: sanity-test ALL providers with--smokebefore committing to a bake-off. - 2026-05-21 (F2) — LLM-as-judge scorer: Sonnet 4.6 verdicts (VALID/INVALID per finding). Cached by
(finding_hash, case_id)— re-runs across prompt iteration are free. +30% cost overhead on top of generation; acceptable at audit volume. - 2026-05-21 (F1) — YAML for task config, not JSON. Multi-line system prompts + comments without escape pain. Pydantic schema validation on load. Adapter pattern = 30 LOC per provider (lazy abstraction; copy-paste 2 instances before generalizing — Decision #3 in PM catches).
- 2026-05-21 (F1) — 6 cross-cloud judges from day 1: Anthropic Haiku/Sonnet/Opus, xAI Grok 4-fast + Grok 4.3, OpenAI GPT-5.4-mini, Gemini 2.5 Flash. Reason: build-vs-buy and migration questions are inherently cross-provider; vendor-lock-in to a single SDK from the start = the original sin.
- 2026-05-21 (F1) — Bootstrap CI 1000 resamples (95%) built in from day 1. Single accuracy number is misleading; CI overlap reveals “no statistical difference” between Grok 4-fast and GPT-5.4-mini (would have been a coin-flip pick otherwise).
- 2026-05-21 (F1) — Asyncio fan-out with semaphore=8. Provider SDKs are async-native; cleaner cancellation than threading. 6 judges × 93 cases = 558 calls completes in ~12 min.
- 2026-05-21 (F1) — Postgres
eval_runsaudit log sharing Personal-RAG’s Postgres instance. Cheap drift queries via SQL; one source of truth for “how did this judge score last month”.git_shacolumn → reproducibility. - 2026-05-20 — Distill attempt INCOMPLETE. 4 Metal crashes during mlx-lm LoRA training; adapter at iter 400 = 65% vs base = 70% → training HURT the model. Cause: competing with tako daemon + heavy apps for GPU. Queue weekend retrain with clean GPU state +
--save-every 25 --max-seq 1024 --grad-checkpoint. Production ships Grok 4.3 cloud in the meantime. - 2026-01-21 — Foundation hypothesis drafted: eval framework converts 3 PM questions (Migration / Prompt engineering / Build-vs-Buy) from “2-3 weeks guessing public benchmarks” → “evidence-based answers in a few hours”. 8 personal projects to apply against (Knowledge-Audit / Mail-Assistant / Mac-Translator / Voice-Assistant / Email-Filter / Personal-RAG / Diagram-Engine / Eval-Framework itself). Build narrow first (one task) → ship → extend.
7 PM-bias catches (single-day sprint)
Logged from the F1→F3 build session. Each one would have wasted budget, leaked compliance, or shipped the wrong roadmap if accepted blindly. Documented = catchable.
| # | AI bias direction | PM counter-prompt | Money/Time at risk |
|---|---|---|---|
| 1 | Tune model > Verify metric | ”5 failed cases manually — do they really fail?” | $50–280/mo + 4–6h |
| 2 | Frame relative > absolute | ”vs baseline / no-treatment?“ | 8–12h dev + electricity |
| 3 | DRY > YAGNI | ”How many real instances? <2 = defer” | 5h + double rewrite cost |
| 4 | More compute > Pareto | ”ROI per pp? 80/20 split?” | $30/mo recurring + UX latency |
| 5 | Author-context > audience-fit | ”Set scope upfront — strip personal anecdotes” | Marketing credibility erosion |
| 6 | Use-all-context > scope-filter | ”Public — strip PII + codenames” | Compliance + competitive intel |
| 7 | Generalize N=1 > per-domain eval | ”Is the evidence specific to this task?” | Production accuracy drop |
Full narrative in 7-pm-decisions-building-eval-framework.
Gotchas
- 2026-05-21 — Cohen’s κ NaN on perfect agreement. When both judges pass all 99 cases,
cohen_kappa_scorereturns NaN (no disagreement = undefined). Fix: clamp to 1.0 withnote=degeneratein the report; don’t propagate NaN into downstream stats. - 2026-05-21 — Holdout integrity is a discipline failure, not a code failure. Operator under deadline pressure will absolutely “just run holdout one more time”. Fix: CLI guard refuses
holdout-*paths without explicit--final-decisionflag; every final-decision run appended to tamper log; repeat runs print a loud overfit warning. - 2026-05-21 — Bootstrap resample wall time scales linearly with N. 1000 resamples × 99 cases is fine; 10K × 1000 cases would be ~minutes. Cap at 1000 by default; configurable.
- 2026-05-21 — Gemini’s empty
candidatesarray masquerades as a successful API call (HTTP 200, no exception). Without an adapter-layer check, scorer would silently scoretext=""→ 0%. Fix: adapter raises a structuredEmptyResponseErrorthat the reporter renders with a ⚠ flag. - 2026-05-21 — xAI rate limits hit at concurrency >8 on Grok 4.3. Semaphore set to 8; doesn’t bottleneck because LLM-judge scoring runs in parallel after generation.
- 2026-05-21 — OpenAI Responses API for GPT-5.4-mini requires
reasoning.effortconfig that doesn’t exist in legacychat.completions. Adapter wraps both shapes. - 2026-05-21 — Anthropic
systemparam is top-level, not in messages list. Easy to forget when porting from xAI/OpenAI adapter. - 2026-05-21 — Verdict cache key must include the
case_idnot just the finding hash — same finding text can be VALID in one case and INVALID in another (depends on source context). Initial design withoutcase_idproduced silently-wrong scores. - 2026-05-20 — mlx-lm LoRA training crashes with “Impacting Interactivity” warning when competing with tako daemon + heavy apps. Quit competing apps + reboot before training. Use
--save-every 25 --max-seq 1024 --grad-checkpointto bound RAM pressure. - 2026-05-20 — Strict substring scorer ate 19 valid Grok findings out of 20 “failures” — the trap that motivated Eval-Framework’s default scorer change. Always LLM-judge a sample of “failures” before tuning the model.
Reference links
- Anthropic SDK: https://github.com/anthropics/anthropic-sdk-python
- xAI SDK: https://github.com/xai-org/xai-sdk-python
- OpenAI SDK: https://github.com/openai/openai-python
- Google Gen AI SDK: https://github.com/googleapis/python-genai
- scikit-learn Cohen’s kappa: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html
- Bootstrap CI (Efron 1979): https://en.wikipedia.org/wiki/Bootstrapping_(statistics)
- Companion blog: Eval-Framework PM journey
- Companion blog: 7 PM decisions
- Companion blog: LLM bake-off methodology
Working-session log
| Date | Hours | What | Outcome |
|---|---|---|---|
| 2026-01-21 | ~1 h | Foundation hypothesis + JTBD framing | Doc drafted; 8-project portfolio scoped |
| 2026-05-20 | ~6 h | Distill attempt (mlx-lm LoRA) | INCOMPLETE — 4 Metal crashes, 65% regression; weekend retrain queued |
| 2026-05-21 morning | ~2 h | F1 — CLI scaffold + task YAML + Anthropic + xAI adapters | First single-judge score run works |
| 2026-05-21 midday | ~2 h | F2 — OpenAI + Gemini adapters + parallel fan-out + strict scorer | First bake-off table renders |
| 2026-05-21 afternoon | ~2 h | F3a — 192-case stratified eval set bootstrap; dev/holdout split | Eval set frozen |
| 2026-05-21 late | ~2 h | F3b — LLM-judge scorer + bootstrap CI + Cohen’s κ | Scoring breakthrough discovered (79.8% → 99.0%) |
| 2026-05-21 evening | ~2 h | F3c — Holdout-99 final-decision run; Grok 4.3 shipped to production | Production cost $0.61/mo verified |
| F1+F2+F3 subtotal | ~10 h | — | Foundation ready, first project served |
| 2026-05-21 night | ~4 h | F4 — Apply to Personal-RAG (93-query held-out retrieval eval) | Hit@3=97.8%, MRR=0.948, reranker shipped |
| 2026-05-22 | ~2 h | 4-tier triage routing design + 7-decisions blog draft | Design done; distill retry queued for weekend |