Notes & Decision Log

Format: YYYY-MM-DD — context — decision/finding.

Decisions

2026-05-22 — 4-tier triage routing design. Daily = local 4B + Haiku verifier (~~$0.15/mo); Weekly = local 8B + Haiku (~~$0.30/mo); Monthly = 32B alone (~~$0); On-demand audit = Grok 4.3 (~~$0.61/mo). Total portfolio AI cost ~$1.10/mo. Distill attempt for the local tiers BLOCKED (4 Metal crashes, adapter 65% regression vs 70% baseline = HURTING the model). Queue retrain for the weekend with clean GPU state (quit competing apps, reboot, --save-every 25 --max-seq 1024 --grad-checkpoint).
2026-05-21 (F3c) — Production judge picked: Grok 4.3 at 99.0% LLM-judge accuracy on holdout-99, $0.61/mo at audit volume. Verified by running the frozen holdout exactly once with --final-decision flag; appended to holdout-runs.log. Beat Opus 4.7 by +32pp; beat Haiku-as-direct-auditor (which collapsed to 13%) by +86pp. Correction: em sai 2 lần default Haiku do SDK familiarity — production LLM-judge default per dojo eval is Grok 4.3.
2026-05-21 (F3b) — Scoring breakthrough. Strict substring scored Grok 4.3 at 79.8% on holdout-99. Re-checking the 20 “failures” with an LLM-judge revealed 19/20 were valid alternate findings the rubric had simply not enumerated. Real accuracy = 99.0%. +19.2pp gap between strict and LLM-judge on the same outputs. Decision: LLM-judge becomes the default scorer for any multi-valid-output task. Strict stays available for single-canonical-answer tasks.
2026-05-21 (F3a) — Stratified eval set: 192 cases, 5 buckets × 3 languages. Bootstrapped via Haiku from a corpus sample, then hand-balanced. Split 93 dev (iterable) + 99 holdout (FROZEN). CLI guard refuses non---final-decision runs on holdout-* paths. Tamper-evident holdout-runs.log records every final-decision run.
2026-05-21 (F2) — Pairwise Cohen’s κ added after first bake-off. Revealed Sonnet ↔ Opus = 0.66 (same Anthropic family → correlated bias) vs Grok ↔ Sonnet = 0.52 (independent providers). Decision: ensembles must avoid same-family pairs. Without κ this would have been invisible.
2026-05-21 (F2) — Gemini 2.5 Flash returned empty candidates on 99/99 holdout cases. Suspected safety filter or output format incompatibility; not investigated further (cheap to skip). Adapter normalized to text="" so scorer flags 0% accuracy loudly. Lesson: sanity-test ALL providers with --smoke before committing to a bake-off.
2026-05-21 (F2) — LLM-as-judge scorer: Sonnet 4.6 verdicts (VALID/INVALID per finding). Cached by (finding_hash, case_id) — re-runs across prompt iteration are free. +30% cost overhead on top of generation; acceptable at audit volume.
2026-05-21 (F1) — YAML for task config, not JSON. Multi-line system prompts + comments without escape pain. Pydantic schema validation on load. Adapter pattern = 30 LOC per provider (lazy abstraction; copy-paste 2 instances before generalizing — Decision #3 in PM catches).
2026-05-21 (F1) — 6 cross-cloud judges from day 1: Anthropic Haiku/Sonnet/Opus, xAI Grok 4-fast + Grok 4.3, OpenAI GPT-5.4-mini, Gemini 2.5 Flash. Reason: build-vs-buy and migration questions are inherently cross-provider; vendor-lock-in to a single SDK from the start = the original sin.
2026-05-21 (F1) — Bootstrap CI 1000 resamples (95%) built in from day 1. Single accuracy number is misleading; CI overlap reveals “no statistical difference” between Grok 4-fast and GPT-5.4-mini (would have been a coin-flip pick otherwise).
2026-05-21 (F1) — Asyncio fan-out with semaphore=8. Provider SDKs are async-native; cleaner cancellation than threading. 6 judges × 93 cases = 558 calls completes in ~12 min.
2026-05-21 (F1) — Postgres eval_runs audit log sharing Personal-RAG’s Postgres instance. Cheap drift queries via SQL; one source of truth for “how did this judge score last month”. git_sha column → reproducibility.
2026-05-20 — Distill attempt INCOMPLETE. 4 Metal crashes during mlx-lm LoRA training; adapter at iter 400 = 65% vs base = 70% → training HURT the model. Cause: competing with tako daemon + heavy apps for GPU. Queue weekend retrain with clean GPU state + --save-every 25 --max-seq 1024 --grad-checkpoint. Production ships Grok 4.3 cloud in the meantime.
2026-01-21 — Foundation hypothesis drafted: eval framework converts 3 PM questions (Migration / Prompt engineering / Build-vs-Buy) from “2-3 weeks guessing public benchmarks” → “evidence-based answers in a few hours”. 8 personal projects to apply against (Knowledge-Audit / Mail-Assistant / Mac-Translator / Voice-Assistant / Email-Filter / Personal-RAG / Diagram-Engine / Eval-Framework itself). Build narrow first (one task) → ship → extend.

7 PM-bias catches (single-day sprint)

Logged from the F1→F3 build session. Each one would have wasted budget, leaked compliance, or shipped the wrong roadmap if accepted blindly. Documented = catchable.

#	AI bias direction	PM counter-prompt	Money/Time at risk
1	Tune model > Verify metric	”5 failed cases manually — do they really fail?”	$50–280/mo + 4–6h
2	Frame relative > absolute	”vs baseline / no-treatment?“	8–12h dev + electricity
3	DRY > YAGNI	”How many real instances? <2 = defer”	5h + double rewrite cost
4	More compute > Pareto	”ROI per pp? 80/20 split?”	$30/mo recurring + UX latency
5	Author-context > audience-fit	”Set scope upfront — strip personal anecdotes”	Marketing credibility erosion
6	Use-all-context > scope-filter	”Public — strip PII + codenames”	Compliance + competitive intel
7	Generalize N=1 > per-domain eval	”Is the evidence specific to this task?”	Production accuracy drop

Full narrative in 7-pm-decisions-building-eval-framework.

Gotchas

2026-05-21 — Cohen’s κ NaN on perfect agreement. When both judges pass all 99 cases, cohen_kappa_score returns NaN (no disagreement = undefined). Fix: clamp to 1.0 with note=degenerate in the report; don’t propagate NaN into downstream stats.
2026-05-21 — Holdout integrity is a discipline failure, not a code failure. Operator under deadline pressure will absolutely “just run holdout one more time”. Fix: CLI guard refuses holdout-* paths without explicit --final-decision flag; every final-decision run appended to tamper log; repeat runs print a loud overfit warning.
2026-05-21 — Bootstrap resample wall time scales linearly with N. 1000 resamples × 99 cases is fine; 10K × 1000 cases would be ~minutes. Cap at 1000 by default; configurable.
2026-05-21 — Gemini’s empty candidates array masquerades as a successful API call (HTTP 200, no exception). Without an adapter-layer check, scorer would silently score text="" → 0%. Fix: adapter raises a structured EmptyResponseError that the reporter renders with a ⚠ flag.
2026-05-21 — xAI rate limits hit at concurrency >8 on Grok 4.3. Semaphore set to 8; doesn’t bottleneck because LLM-judge scoring runs in parallel after generation.
2026-05-21 — OpenAI Responses API for GPT-5.4-mini requires reasoning.effort config that doesn’t exist in legacy chat.completions. Adapter wraps both shapes.
2026-05-21 — Anthropic system param is top-level, not in messages list. Easy to forget when porting from xAI/OpenAI adapter.
2026-05-21 — Verdict cache key must include the case_id not just the finding hash — same finding text can be VALID in one case and INVALID in another (depends on source context). Initial design without case_id produced silently-wrong scores.
2026-05-20 — mlx-lm LoRA training crashes with “Impacting Interactivity” warning when competing with tako daemon + heavy apps. Quit competing apps + reboot before training. Use --save-every 25 --max-seq 1024 --grad-checkpoint to bound RAM pressure.
2026-05-20 — Strict substring scorer ate 19 valid Grok findings out of 20 “failures” — the trap that motivated Eval-Framework’s default scorer change. Always LLM-judge a sample of “failures” before tuning the model.

Reference links

Anthropic SDK: https://github.com/anthropics/anthropic-sdk-python
xAI SDK: https://github.com/xai-org/xai-sdk-python
OpenAI SDK: https://github.com/openai/openai-python
Google Gen AI SDK: https://github.com/googleapis/python-genai
scikit-learn Cohen’s kappa: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html
Bootstrap CI (Efron 1979): https://en.wikipedia.org/wiki/Bootstrapping_(statistics)
Companion blog: Eval-Framework PM journey
Companion blog: 7 PM decisions
Companion blog: LLM bake-off methodology

Working-session log