← Back to project
● Shipped P0 Size M Foundation

Eval-Framework — Notes

Chronological decision log, gotchas, working-session hours, and the 7 PM-bias catches.

Notes & Decision Log

Format: YYYY-MM-DD — context — decision/finding.

Decisions

  • 2026-05-224-tier triage routing design. Daily = local 4B + Haiku verifier ($0.15/mo); Weekly = local 8B + Haiku ($0.30/mo); Monthly = 32B alone ($0); On-demand audit = Grok 4.3 ($0.61/mo). Total portfolio AI cost ~$1.10/mo. Distill attempt for the local tiers BLOCKED (4 Metal crashes, adapter 65% regression vs 70% baseline = HURTING the model). Queue retrain for the weekend with clean GPU state (quit competing apps, reboot, --save-every 25 --max-seq 1024 --grad-checkpoint).
  • 2026-05-21 (F3c)Production judge picked: Grok 4.3 at 99.0% LLM-judge accuracy on holdout-99, $0.61/mo at audit volume. Verified by running the frozen holdout exactly once with --final-decision flag; appended to holdout-runs.log. Beat Opus 4.7 by +32pp; beat Haiku-as-direct-auditor (which collapsed to 13%) by +86pp. Correction: em sai 2 lần default Haiku do SDK familiarity — production LLM-judge default per dojo eval is Grok 4.3.
  • 2026-05-21 (F3b)Scoring breakthrough. Strict substring scored Grok 4.3 at 79.8% on holdout-99. Re-checking the 20 “failures” with an LLM-judge revealed 19/20 were valid alternate findings the rubric had simply not enumerated. Real accuracy = 99.0%. +19.2pp gap between strict and LLM-judge on the same outputs. Decision: LLM-judge becomes the default scorer for any multi-valid-output task. Strict stays available for single-canonical-answer tasks.
  • 2026-05-21 (F3a)Stratified eval set: 192 cases, 5 buckets × 3 languages. Bootstrapped via Haiku from a corpus sample, then hand-balanced. Split 93 dev (iterable) + 99 holdout (FROZEN). CLI guard refuses non---final-decision runs on holdout-* paths. Tamper-evident holdout-runs.log records every final-decision run.
  • 2026-05-21 (F2)Pairwise Cohen’s κ added after first bake-off. Revealed Sonnet ↔ Opus = 0.66 (same Anthropic family → correlated bias) vs Grok ↔ Sonnet = 0.52 (independent providers). Decision: ensembles must avoid same-family pairs. Without κ this would have been invisible.
  • 2026-05-21 (F2)Gemini 2.5 Flash returned empty candidates on 99/99 holdout cases. Suspected safety filter or output format incompatibility; not investigated further (cheap to skip). Adapter normalized to text="" so scorer flags 0% accuracy loudly. Lesson: sanity-test ALL providers with --smoke before committing to a bake-off.
  • 2026-05-21 (F2)LLM-as-judge scorer: Sonnet 4.6 verdicts (VALID/INVALID per finding). Cached by (finding_hash, case_id) — re-runs across prompt iteration are free. +30% cost overhead on top of generation; acceptable at audit volume.
  • 2026-05-21 (F1)YAML for task config, not JSON. Multi-line system prompts + comments without escape pain. Pydantic schema validation on load. Adapter pattern = 30 LOC per provider (lazy abstraction; copy-paste 2 instances before generalizing — Decision #3 in PM catches).
  • 2026-05-21 (F1)6 cross-cloud judges from day 1: Anthropic Haiku/Sonnet/Opus, xAI Grok 4-fast + Grok 4.3, OpenAI GPT-5.4-mini, Gemini 2.5 Flash. Reason: build-vs-buy and migration questions are inherently cross-provider; vendor-lock-in to a single SDK from the start = the original sin.
  • 2026-05-21 (F1)Bootstrap CI 1000 resamples (95%) built in from day 1. Single accuracy number is misleading; CI overlap reveals “no statistical difference” between Grok 4-fast and GPT-5.4-mini (would have been a coin-flip pick otherwise).
  • 2026-05-21 (F1)Asyncio fan-out with semaphore=8. Provider SDKs are async-native; cleaner cancellation than threading. 6 judges × 93 cases = 558 calls completes in ~12 min.
  • 2026-05-21 (F1)Postgres eval_runs audit log sharing Personal-RAG’s Postgres instance. Cheap drift queries via SQL; one source of truth for “how did this judge score last month”. git_sha column → reproducibility.
  • 2026-05-20Distill attempt INCOMPLETE. 4 Metal crashes during mlx-lm LoRA training; adapter at iter 400 = 65% vs base = 70% → training HURT the model. Cause: competing with tako daemon + heavy apps for GPU. Queue weekend retrain with clean GPU state + --save-every 25 --max-seq 1024 --grad-checkpoint. Production ships Grok 4.3 cloud in the meantime.
  • 2026-01-21Foundation hypothesis drafted: eval framework converts 3 PM questions (Migration / Prompt engineering / Build-vs-Buy) from “2-3 weeks guessing public benchmarks” → “evidence-based answers in a few hours”. 8 personal projects to apply against (Knowledge-Audit / Mail-Assistant / Mac-Translator / Voice-Assistant / Email-Filter / Personal-RAG / Diagram-Engine / Eval-Framework itself). Build narrow first (one task) → ship → extend.

7 PM-bias catches (single-day sprint)

Logged from the F1→F3 build session. Each one would have wasted budget, leaked compliance, or shipped the wrong roadmap if accepted blindly. Documented = catchable.

#AI bias directionPM counter-promptMoney/Time at risk
1Tune model > Verify metric”5 failed cases manually — do they really fail?”$50–280/mo + 4–6h
2Frame relative > absolute”vs baseline / no-treatment?“8–12h dev + electricity
3DRY > YAGNI”How many real instances? <2 = defer”5h + double rewrite cost
4More compute > Pareto”ROI per pp? 80/20 split?”$30/mo recurring + UX latency
5Author-context > audience-fit”Set scope upfront — strip personal anecdotes”Marketing credibility erosion
6Use-all-context > scope-filter”Public — strip PII + codenames”Compliance + competitive intel
7Generalize N=1 > per-domain eval”Is the evidence specific to this task?”Production accuracy drop

Full narrative in 7-pm-decisions-building-eval-framework.

Gotchas

  • 2026-05-21Cohen’s κ NaN on perfect agreement. When both judges pass all 99 cases, cohen_kappa_score returns NaN (no disagreement = undefined). Fix: clamp to 1.0 with note=degenerate in the report; don’t propagate NaN into downstream stats.
  • 2026-05-21Holdout integrity is a discipline failure, not a code failure. Operator under deadline pressure will absolutely “just run holdout one more time”. Fix: CLI guard refuses holdout-* paths without explicit --final-decision flag; every final-decision run appended to tamper log; repeat runs print a loud overfit warning.
  • 2026-05-21Bootstrap resample wall time scales linearly with N. 1000 resamples × 99 cases is fine; 10K × 1000 cases would be ~minutes. Cap at 1000 by default; configurable.
  • 2026-05-21Gemini’s empty candidates array masquerades as a successful API call (HTTP 200, no exception). Without an adapter-layer check, scorer would silently score text="" → 0%. Fix: adapter raises a structured EmptyResponseError that the reporter renders with a ⚠ flag.
  • 2026-05-21xAI rate limits hit at concurrency >8 on Grok 4.3. Semaphore set to 8; doesn’t bottleneck because LLM-judge scoring runs in parallel after generation.
  • 2026-05-21OpenAI Responses API for GPT-5.4-mini requires reasoning.effort config that doesn’t exist in legacy chat.completions. Adapter wraps both shapes.
  • 2026-05-21Anthropic system param is top-level, not in messages list. Easy to forget when porting from xAI/OpenAI adapter.
  • 2026-05-21Verdict cache key must include the case_id not just the finding hash — same finding text can be VALID in one case and INVALID in another (depends on source context). Initial design without case_id produced silently-wrong scores.
  • 2026-05-20mlx-lm LoRA training crashes with “Impacting Interactivity” warning when competing with tako daemon + heavy apps. Quit competing apps + reboot before training. Use --save-every 25 --max-seq 1024 --grad-checkpoint to bound RAM pressure.
  • 2026-05-20Strict substring scorer ate 19 valid Grok findings out of 20 “failures” — the trap that motivated Eval-Framework’s default scorer change. Always LLM-judge a sample of “failures” before tuning the model.

Working-session log

DateHoursWhatOutcome
2026-01-21~1 hFoundation hypothesis + JTBD framingDoc drafted; 8-project portfolio scoped
2026-05-20~6 hDistill attempt (mlx-lm LoRA)INCOMPLETE — 4 Metal crashes, 65% regression; weekend retrain queued
2026-05-21 morning~2 hF1 — CLI scaffold + task YAML + Anthropic + xAI adaptersFirst single-judge score run works
2026-05-21 midday~2 hF2 — OpenAI + Gemini adapters + parallel fan-out + strict scorerFirst bake-off table renders
2026-05-21 afternoon~2 hF3a — 192-case stratified eval set bootstrap; dev/holdout splitEval set frozen
2026-05-21 late~2 hF3b — LLM-judge scorer + bootstrap CI + Cohen’s κScoring breakthrough discovered (79.8% → 99.0%)
2026-05-21 evening~2 hF3c — Holdout-99 final-decision run; Grok 4.3 shipped to productionProduction cost $0.61/mo verified
F1+F2+F3 subtotal~10 hFoundation ready, first project served
2026-05-21 night~4 hF4 — Apply to Personal-RAG (93-query held-out retrieval eval)Hit@3=97.8%, MRR=0.948, reranker shipped
2026-05-22~2 h4-tier triage routing design + 7-decisions blog draftDesign done; distill retry queued for weekend