Eval-Framework — PRD

Size M · P0 · Foundation Status: ✅ F1+F2+F3 shipped (2026-05-21, 10-hour single-day sprint) · F4 applied to Personal-RAG (2026-05-21) Originally planned: 5 hours / Actual: 10 hours concentrated work + ongoing application across other projects

1. Problem

Every LLM-powered feature I build hits the same three decision points:

Migration — should we move from model A (current) to model B (just released)?
Prompt engineering — after swapping models, which prompts need rewriting? Will the old prompts break?
Build vs buy — are local LLMs (open-weight) ready to replace cloud APIs?

Without an eval framework, the answer to all three = guess from gut + a 5-case A/B chat session. Guessing wrong in production = silent accuracy drop, support ticket spike weeks later, hard-to-attribute regression.

Pain: a single bad LLM swap (e.g. Haiku → Sonnet because “Sonnet is bigger”) can recurring-cost +$200-1,200/month at SMB scale and leak compliance. Decision quality compounds; engineering velocity matters less if the directional choices are wrong.

Why now: I am simultaneously building 8 personal projects that touch LLMs (Knowledge-Audit, Mail-Assistant, Mac-Translator, Voice-Assistant, Email-Filter, Personal-RAG, Diagram-Engine, Eval-Framework itself). One foundation pattern, reused 8 times.

2. Goal & Success Metrics

Goal: For any task T and a list of candidate judges [J1..Jn], run a single command → get per-judge accuracy + bootstrap CI 95% + per-stratum breakdown + cost/case + pairwise Cohen’s κ in <20 minutes wall time.

Metrics — actual achieved:

Metric	Target	Achieved	Note
Time to first verified production model decision	≤1 week	1 day	Knowledge-Audit task; Grok 4.3 picked + shipped
Bake-off wall time (6 judges × N=30 dev)	<30 min	~12 min	Parallel provider calls
Bake-off cost (6 judges × N=30 dev)	<$10	~$3	Cheap enough for weekly runs on new vendor releases
Holdout-99 evaluation cost (single judge)	<$5	~$0.50	LLM-judge scoring adds ~30% on top
Production judge cost (Grok 4.3 at audit volume)	<$5/mo	$0.61/mo	At personal audit volume
Eval set stratification	Yes	5 buckets × 3 languages = 15 strata	Reveals language/bucket bias
Holdout integrity	Frozen	Never used during tuning; one-pass final decision	Anti-overfit
True-accuracy verification on holdout	LLM-judge ≥ strict by ≥10pp on multi-valid tasks	99.0% LLM-judge vs 79.8% strict	+19.2pp delta = scorer was masking real accuracy
Projects using framework	8	2 verified (Knowledge-Audit, Personal-RAG) + 6 queued	Phased rollout

3. User journey

Single user (me, the PM). The journey for each LLM decision:

Author opens a YAML task config (tasks/<task_name>.yaml) — system prompt, scoring rubric, eval set path.
Author runs eval-framework bake-off --task T --judges J1,J2,J3 --eval-set dev-93 --scorer llm_judge.
CLI parallel-fans-out provider calls → collects raw outputs → LLM-judge scores each → bootstrap CI → renders table.
Author inspects per-stratum breakdown — does winner survive across languages/buckets?
If yes → re-run once on holdout-99 for final accuracy + cost numbers → commit decision in project notes.
Production deploys the picked judge. Weekly drift-monitoring cron re-evaluates a sampled 30 cases; ≥5pp regression → Telegram P0 alert.

4. Scope (MoSCoW) — final

Must — DONE:

✅ Generic YAML task config (system prompt + scoring rubric per task)
✅ 4 cross-cloud provider adapters: Anthropic, xAI (Grok), OpenAI, Google Gemini
✅ 2 scoring modes: strict_substring + llm_judge (default)
✅ Bake-off N judges on the same task + bootstrap CI 95% (1000 resamples)
✅ CLI (eval-framework bake-off, eval-framework score, eval-framework verify)
✅ Pairwise Cohen’s κ matrix across judges (binary classifier agreement)
✅ Per-stratum breakdown (bucket × language)
✅ Holdout protection — separate dev/holdout paths; holdout never auto-runs

Should — DONE:

✅ Cost/case + cost/run reporting per judge
✅ p95 latency per judge
✅ Per-case error inspection (show input + expected + actual + judge verdict)
✅ Weekly drift cron + Telegram alert on regression ≥ 5pp

Could — partial:

✅ Python API (programmatic invoke from notebooks)
⏸️ Local MLX endpoint adapter — config-shaped but not wired in F3 (distill attempt blocked: 4 Metal crashes, 65% regression, queued for weekend retrain)
⏸️ Self-consistency (N>1) + adaptive escalation — designed (Decision #4 in notes) but not shipped; bad ROI at current volume
❌ Promptfoo CI integration — deferred, Eval-Framework’s CLI already covers the same surface

Won’t (F1-F3) — kept:

Multi-user / SaaS surface — single-user by design
Web UI — CLI + notebook API sufficient
Auto-prompt optimization — out of scope; PM-curated prompts only

5. Architecture (final)

5 components: Eval set (stratified, dev/holdout) → Judge adapter layer (4 provider clients) → Runner (parallel fan-out + retry) → Scorer (llm_judge default, strict_substring fallback) → Reporter (CI 95%, κ, per-stratum, cost). See Architecture.

6. Tech Stack — final choices

Layer	Original spec	Implemented	Reason for change
Runner	bash + jq	Python 3.11 + asyncio	Type safety, async fan-out, native provider SDKs
Task config	JSON	YAML + Pydantic schema	Comments + multi-line system prompts
Judge adapter	Hand-rolled HTTP	Official provider SDKs (`anthropic`, `xai-sdk`, `openai`, `google-genai`)	Less retry/auth boilerplate
Scoring	Strict substring match	LLM-judge by default (Sonnet 4.6 verdict) + `strict_substring` available	F3 breakthrough: strict masked 19/20 valid alternate findings
Production judge	Default Haiku (em sai 2x)	Grok 4.3	Bake-off won; Haiku as direct auditor = 13%
Statistics	None	bootstrap CI 1000 resamples + Cohen’s κ pairwise	Single accuracy number is misleading
Eval store	CSV	YAML + Postgres audit log	YAML for hand-curation; Postgres for run history + drift queries
Drift monitor	Manual	launchd cron + Telegram bot	Weekly sampled 30 cases; ADHD-friendly P0 alert format

Cost posture: ~$3 per full bake-off (6 judges × N=30 dev) → cheap enough to run weekly on every vendor release. Production cost dominated by serving (Grok 4.3 at audit volume = $0.61/mo).

7. Milestones — actual

Hour	What shipped
0-2 (F1)	CLI scaffold (`click`-based), task YAML schema, Anthropic + xAI adapters
2-4 (F2)	OpenAI + Gemini adapters, parallel fan-out, basic strict-substring scorer
4-6 (F3a)	Stratified 192-case eval set bootstrap via Haiku (5×3 strata), dev/holdout split
6-8 (F3b)	LLM-judge scorer (Sonnet 4.6), bootstrap CI, Cohen’s κ matrix
8-10 (F3c)	First production bake-off: 6 judges × dev-93 → Grok 4.3 wins; holdout-99 confirm; scoring breakthrough discovered (79.8% strict → 99.0% LLM-judge)
+4h, 2026-05-21 (F4)	Applied to Personal-RAG: 93-query held-out personal eval, Hit@3 = 97.8%, MRR = 0.948
2026-05-22 (F4b)	4-tier triage routing design (daily=4B+Haiku, weekly=8B+Haiku, monthly=32B, on-demand=Grok 4.3) — distill attempt blocked, retrain weekend

F3 DoD passed:

✅ 6-judge bake-off runs end-to-end in <20 min wall
✅ Holdout-99 frozen; final decision = single pass
✅ Per-stratum breakdown reveals language bias (Grok 4.3: VN 88% / EN 63% / mixed 71%)
✅ Cohen’s κ reveals correlated bias (Sonnet ↔ Opus = 0.66, same Anthropic family)
✅ Production judge picked + shipped with verified cost number ($0.61/mo)

8. Cost & Quota

Item	Free?	Actual usage
Anthropic API (Haiku/Sonnet/Opus + judge)	❌	~$1/bake-off
xAI Grok API (Grok 4-fast / Grok 4.3)	❌	~$1/bake-off
OpenAI GPT-5.4-mini	❌	~$0.50/bake-off
Google Gemini Flash	✅ free tier	$0
Postgres 16 (local, shared with Personal-RAG)	✅	<100 MB eval audit log
Production judge serving (Grok 4.3 @ audit vol)	❌	$0.61/mo

Recurring cost: <$5/month total (production serving + weekly drift monitor). Bake-offs run on-demand (~$3 each, 1-2x/month when new vendors ship).

9. Risks & open questions — outcomes

Original risks:

Holdout leakage if dev iteration uses holdout cases → mitigated by separate file paths + CLI guard that refuses --eval-set holdout-* unless --final-decision flag set
LLM-judge cost spirals at scale → measured: scoring adds ~30% on top of generation cost; acceptable at audit volume
Gemini API integration fragile → confirmed during bake-off: Gemini Flash returned empty candidates on 99/99 holdout cases; sanity-test ALL providers before bake-off

Current risks:

Distill attempt blocked (4 Metal crashes 2026-05-20, adapter 65% regression vs 70% baseline = HURTING) → cloud Grok 4.3 shipped as production default; local distill retry queued for weekend
Judge model staleness (Sonnet 4.6 as scorer judge) — if Sonnet itself regresses, all scores shift; mitigation: weekly drift cron re-runs on holdout sample
Cohen’s κ degenerate on perfect-agreement judges (κ = NaN when both pass all cases) — fixed with min-disagreement floor

Original open Qs:

Q1: Will LLM-judge scorer disagree with human eval? → spot-check on 20 cases showed 95% human agreement with Sonnet judge verdict — acceptable
Q2: Is bootstrap CI worth the compute? → yes; revealed Grok 4-fast vs Grok 4.3 overlap (CI overlap = no statistical difference at N=30)
Q3: Local LLMs ready for production? → not yet (distill 65% < 70% baseline); cloud Grok 4.3 cheaper at current volume anyway

10. Definition of Done

F1+F2+F3 Done: ✅ 2026-05-21 — CLI shipped, 6-model bake-off runs end-to-end, holdout-99 frozen, first verified production pick (Grok 4.3 for Knowledge-Audit at 99% / $0.61/mo).

F4 applied to Personal-RAG: ✅ 2026-05-21 — 93-query held-out personal-workspace eval; Hit@3=97.8%, MRR=0.948; reranker swap measured +3.2pp Hit@1 and shipped.

Portfolio-DoD (target M2):

⏳ Mail-Assistant inbox classification eval (queued)
⏳ Voice-Assistant tool-use accuracy eval (queued)
⏳ Mac-Translator translation fidelity eval (queued)
⏳ Local MLX adapter retry (weekend distill retrain with clean GPU state)

Eval-Framework — PRD

Eval-Framework — PRD

1. Problem

2. Goal & Success Metrics

3. User journey

4. Scope (MoSCoW) — final

5. Architecture (final)

6. Tech Stack — final choices

7. Milestones — actual

8. Cost & Quota

9. Risks & open questions — outcomes

10. Definition of Done

See also