Eval-Framework — PRD
Size M · P0 · Foundation Status: ✅ F1+F2+F3 shipped (2026-05-21, 10-hour single-day sprint) · F4 applied to Personal-RAG (2026-05-21) Originally planned: 5 hours / Actual: 10 hours concentrated work + ongoing application across other projects
1. Problem
Every LLM-powered feature I build hits the same three decision points:
- Migration — should we move from model A (current) to model B (just released)?
- Prompt engineering — after swapping models, which prompts need rewriting? Will the old prompts break?
- Build vs buy — are local LLMs (open-weight) ready to replace cloud APIs?
Without an eval framework, the answer to all three = guess from gut + a 5-case A/B chat session. Guessing wrong in production = silent accuracy drop, support ticket spike weeks later, hard-to-attribute regression.
Pain: a single bad LLM swap (e.g. Haiku → Sonnet because “Sonnet is bigger”) can recurring-cost +$200-1,200/month at SMB scale and leak compliance. Decision quality compounds; engineering velocity matters less if the directional choices are wrong.
Why now: I am simultaneously building 8 personal projects that touch LLMs (Knowledge-Audit, Mail-Assistant, Mac-Translator, Voice-Assistant, Email-Filter, Personal-RAG, Diagram-Engine, Eval-Framework itself). One foundation pattern, reused 8 times.
2. Goal & Success Metrics
Goal: For any task T and a list of candidate judges [J1..Jn], run a single command → get per-judge accuracy + bootstrap CI 95% + per-stratum breakdown + cost/case + pairwise Cohen’s κ in <20 minutes wall time.
Metrics — actual achieved:
| Metric | Target | Achieved | Note |
|---|---|---|---|
| Time to first verified production model decision | ≤1 week | 1 day | Knowledge-Audit task; Grok 4.3 picked + shipped |
| Bake-off wall time (6 judges × N=30 dev) | <30 min | ~12 min | Parallel provider calls |
| Bake-off cost (6 judges × N=30 dev) | <$10 | ~$3 | Cheap enough for weekly runs on new vendor releases |
| Holdout-99 evaluation cost (single judge) | <$5 | ~$0.50 | LLM-judge scoring adds ~30% on top |
| Production judge cost (Grok 4.3 at audit volume) | <$5/mo | $0.61/mo | At personal audit volume |
| Eval set stratification | Yes | 5 buckets × 3 languages = 15 strata | Reveals language/bucket bias |
| Holdout integrity | Frozen | Never used during tuning; one-pass final decision | Anti-overfit |
| True-accuracy verification on holdout | LLM-judge ≥ strict by ≥10pp on multi-valid tasks | 99.0% LLM-judge vs 79.8% strict | +19.2pp delta = scorer was masking real accuracy |
| Projects using framework | 8 | 2 verified (Knowledge-Audit, Personal-RAG) + 6 queued | Phased rollout |
3. User journey
Single user (me, the PM). The journey for each LLM decision:
- Author opens a YAML task config (
tasks/<task_name>.yaml) — system prompt, scoring rubric, eval set path. - Author runs
eval-framework bake-off --task T --judges J1,J2,J3 --eval-set dev-93 --scorer llm_judge. - CLI parallel-fans-out provider calls → collects raw outputs → LLM-judge scores each → bootstrap CI → renders table.
- Author inspects per-stratum breakdown — does winner survive across languages/buckets?
- If yes → re-run once on
holdout-99for final accuracy + cost numbers → commit decision in project notes. - Production deploys the picked judge. Weekly drift-monitoring cron re-evaluates a sampled 30 cases; ≥5pp regression → Telegram P0 alert.
4. Scope (MoSCoW) — final
Must — DONE:
- ✅ Generic YAML task config (system prompt + scoring rubric per task)
- ✅ 4 cross-cloud provider adapters: Anthropic, xAI (Grok), OpenAI, Google Gemini
- ✅ 2 scoring modes:
strict_substring+llm_judge(default) - ✅ Bake-off N judges on the same task + bootstrap CI 95% (1000 resamples)
- ✅ CLI (
eval-framework bake-off,eval-framework score,eval-framework verify) - ✅ Pairwise Cohen’s κ matrix across judges (binary classifier agreement)
- ✅ Per-stratum breakdown (bucket × language)
- ✅ Holdout protection — separate dev/holdout paths; holdout never auto-runs
Should — DONE:
- ✅ Cost/case + cost/run reporting per judge
- ✅ p95 latency per judge
- ✅ Per-case error inspection (show input + expected + actual + judge verdict)
- ✅ Weekly drift cron + Telegram alert on regression ≥ 5pp
Could — partial:
- ✅ Python API (programmatic invoke from notebooks)
- ⏸️ Local MLX endpoint adapter — config-shaped but not wired in F3 (distill attempt blocked: 4 Metal crashes, 65% regression, queued for weekend retrain)
- ⏸️ Self-consistency (N>1) + adaptive escalation — designed (Decision #4 in notes) but not shipped; bad ROI at current volume
- ❌ Promptfoo CI integration — deferred, Eval-Framework’s CLI already covers the same surface
Won’t (F1-F3) — kept:
- Multi-user / SaaS surface — single-user by design
- Web UI — CLI + notebook API sufficient
- Auto-prompt optimization — out of scope; PM-curated prompts only
5. Architecture (final)
5 components: Eval set (stratified, dev/holdout) → Judge adapter layer (4 provider clients) → Runner (parallel fan-out + retry) → Scorer (llm_judge default, strict_substring fallback) → Reporter (CI 95%, κ, per-stratum, cost). See Architecture.
6. Tech Stack — final choices
| Layer | Original spec | Implemented | Reason for change |
|---|---|---|---|
| Runner | bash + jq | Python 3.11 + asyncio | Type safety, async fan-out, native provider SDKs |
| Task config | JSON | YAML + Pydantic schema | Comments + multi-line system prompts |
| Judge adapter | Hand-rolled HTTP | Official provider SDKs (anthropic, xai-sdk, openai, google-genai) | Less retry/auth boilerplate |
| Scoring | Strict substring match | LLM-judge by default (Sonnet 4.6 verdict) + strict_substring available | F3 breakthrough: strict masked 19/20 valid alternate findings |
| Production judge | Default Haiku (em sai 2x) | Grok 4.3 | Bake-off won; Haiku as direct auditor = 13% |
| Statistics | None | bootstrap CI 1000 resamples + Cohen’s κ pairwise | Single accuracy number is misleading |
| Eval store | CSV | YAML + Postgres audit log | YAML for hand-curation; Postgres for run history + drift queries |
| Drift monitor | Manual | launchd cron + Telegram bot | Weekly sampled 30 cases; ADHD-friendly P0 alert format |
Cost posture: ~$3 per full bake-off (6 judges × N=30 dev) → cheap enough to run weekly on every vendor release. Production cost dominated by serving (Grok 4.3 at audit volume = $0.61/mo).
7. Milestones — actual
| Hour | What shipped |
|---|---|
| 0-2 (F1) | CLI scaffold (click-based), task YAML schema, Anthropic + xAI adapters |
| 2-4 (F2) | OpenAI + Gemini adapters, parallel fan-out, basic strict-substring scorer |
| 4-6 (F3a) | Stratified 192-case eval set bootstrap via Haiku (5×3 strata), dev/holdout split |
| 6-8 (F3b) | LLM-judge scorer (Sonnet 4.6), bootstrap CI, Cohen’s κ matrix |
| 8-10 (F3c) | First production bake-off: 6 judges × dev-93 → Grok 4.3 wins; holdout-99 confirm; scoring breakthrough discovered (79.8% strict → 99.0% LLM-judge) |
| +4h, 2026-05-21 (F4) | Applied to Personal-RAG: 93-query held-out personal eval, Hit@3 = 97.8%, MRR = 0.948 |
| 2026-05-22 (F4b) | 4-tier triage routing design (daily=4B+Haiku, weekly=8B+Haiku, monthly=32B, on-demand=Grok 4.3) — distill attempt blocked, retrain weekend |
F3 DoD passed:
- ✅ 6-judge bake-off runs end-to-end in <20 min wall
- ✅ Holdout-99 frozen; final decision = single pass
- ✅ Per-stratum breakdown reveals language bias (Grok 4.3: VN 88% / EN 63% / mixed 71%)
- ✅ Cohen’s κ reveals correlated bias (Sonnet ↔ Opus = 0.66, same Anthropic family)
- ✅ Production judge picked + shipped with verified cost number ($0.61/mo)
8. Cost & Quota
| Item | Free? | Actual usage |
|---|---|---|
| Anthropic API (Haiku/Sonnet/Opus + judge) | ❌ | ~$1/bake-off |
| xAI Grok API (Grok 4-fast / Grok 4.3) | ❌ | ~$1/bake-off |
| OpenAI GPT-5.4-mini | ❌ | ~$0.50/bake-off |
| Google Gemini Flash | ✅ free tier | $0 |
| Postgres 16 (local, shared with Personal-RAG) | ✅ | <100 MB eval audit log |
| Production judge serving (Grok 4.3 @ audit vol) | ❌ | $0.61/mo |
Recurring cost: <$5/month total (production serving + weekly drift monitor). Bake-offs run on-demand (~$3 each, 1-2x/month when new vendors ship).
9. Risks & open questions — outcomes
Original risks:
- Holdout leakage if dev iteration uses holdout cases → mitigated by separate file paths + CLI guard that refuses
--eval-set holdout-*unless--final-decisionflag set - LLM-judge cost spirals at scale → measured: scoring adds ~30% on top of generation cost; acceptable at audit volume
- Gemini API integration fragile → confirmed during bake-off: Gemini Flash returned empty
candidateson 99/99 holdout cases; sanity-test ALL providers before bake-off
Current risks:
- Distill attempt blocked (4 Metal crashes 2026-05-20, adapter 65% regression vs 70% baseline = HURTING) → cloud Grok 4.3 shipped as production default; local distill retry queued for weekend
- Judge model staleness (Sonnet 4.6 as scorer judge) — if Sonnet itself regresses, all scores shift; mitigation: weekly drift cron re-runs on holdout sample
- Cohen’s κ degenerate on perfect-agreement judges (κ = NaN when both pass all cases) — fixed with min-disagreement floor
Original open Qs:
- Q1: Will LLM-judge scorer disagree with human eval? → spot-check on 20 cases showed 95% human agreement with Sonnet judge verdict — acceptable
- Q2: Is bootstrap CI worth the compute? → yes; revealed Grok 4-fast vs Grok 4.3 overlap (CI overlap = no statistical difference at N=30)
- Q3: Local LLMs ready for production? → not yet (distill 65% < 70% baseline); cloud Grok 4.3 cheaper at current volume anyway
10. Definition of Done
F1+F2+F3 Done: ✅ 2026-05-21 — CLI shipped, 6-model bake-off runs end-to-end, holdout-99 frozen, first verified production pick (Grok 4.3 for Knowledge-Audit at 99% / $0.61/mo).
F4 applied to Personal-RAG: ✅ 2026-05-21 — 93-query held-out personal-workspace eval; Hit@3=97.8%, MRR=0.948; reranker swap measured +3.2pp Hit@1 and shipped.
Portfolio-DoD (target M2):
- ⏳ Mail-Assistant inbox classification eval (queued)
- ⏳ Voice-Assistant tool-use accuracy eval (queued)
- ⏳ Mac-Translator translation fidelity eval (queued)
- ⏳ Local MLX adapter retry (weekend distill retrain with clean GPU state)
See also
- Architecture — pipeline + adapter pattern + scoring layer
- Implementation — code structure, schema, perf, reproducibility
- Notes — chronological decision log + 7 PM-bias catches
- Enterprise — 5 use cases for B2B SaaS / Fintech / EdTech / Healthcare / Multi-tenant