← Back to project
● Shipped P0 Size M Foundation

Eval-Framework — PRD

Product spec, scope, milestones, and success metrics for the personal eval framework.

Eval-Framework — PRD

Size M · P0 · Foundation Status: ✅ F1+F2+F3 shipped (2026-05-21, 10-hour single-day sprint) · F4 applied to Personal-RAG (2026-05-21) Originally planned: 5 hours / Actual: 10 hours concentrated work + ongoing application across other projects

1. Problem

Every LLM-powered feature I build hits the same three decision points:

  1. Migrationshould we move from model A (current) to model B (just released)?
  2. Prompt engineeringafter swapping models, which prompts need rewriting? Will the old prompts break?
  3. Build vs buyare local LLMs (open-weight) ready to replace cloud APIs?

Without an eval framework, the answer to all three = guess from gut + a 5-case A/B chat session. Guessing wrong in production = silent accuracy drop, support ticket spike weeks later, hard-to-attribute regression.

Pain: a single bad LLM swap (e.g. Haiku → Sonnet because “Sonnet is bigger”) can recurring-cost +$200-1,200/month at SMB scale and leak compliance. Decision quality compounds; engineering velocity matters less if the directional choices are wrong.

Why now: I am simultaneously building 8 personal projects that touch LLMs (Knowledge-Audit, Mail-Assistant, Mac-Translator, Voice-Assistant, Email-Filter, Personal-RAG, Diagram-Engine, Eval-Framework itself). One foundation pattern, reused 8 times.

2. Goal & Success Metrics

Goal: For any task T and a list of candidate judges [J1..Jn], run a single command → get per-judge accuracy + bootstrap CI 95% + per-stratum breakdown + cost/case + pairwise Cohen’s κ in <20 minutes wall time.

Metrics — actual achieved:

MetricTargetAchievedNote
Time to first verified production model decision≤1 week1 dayKnowledge-Audit task; Grok 4.3 picked + shipped
Bake-off wall time (6 judges × N=30 dev)<30 min~12 minParallel provider calls
Bake-off cost (6 judges × N=30 dev)<$10~$3Cheap enough for weekly runs on new vendor releases
Holdout-99 evaluation cost (single judge)<$5~$0.50LLM-judge scoring adds ~30% on top
Production judge cost (Grok 4.3 at audit volume)<$5/mo$0.61/moAt personal audit volume
Eval set stratificationYes5 buckets × 3 languages = 15 strataReveals language/bucket bias
Holdout integrityFrozenNever used during tuning; one-pass final decisionAnti-overfit
True-accuracy verification on holdoutLLM-judge ≥ strict by ≥10pp on multi-valid tasks99.0% LLM-judge vs 79.8% strict+19.2pp delta = scorer was masking real accuracy
Projects using framework82 verified (Knowledge-Audit, Personal-RAG) + 6 queuedPhased rollout

3. User journey

Single user (me, the PM). The journey for each LLM decision:

  1. Author opens a YAML task config (tasks/<task_name>.yaml) — system prompt, scoring rubric, eval set path.
  2. Author runs eval-framework bake-off --task T --judges J1,J2,J3 --eval-set dev-93 --scorer llm_judge.
  3. CLI parallel-fans-out provider calls → collects raw outputs → LLM-judge scores each → bootstrap CI → renders table.
  4. Author inspects per-stratum breakdown — does winner survive across languages/buckets?
  5. If yes → re-run once on holdout-99 for final accuracy + cost numbers → commit decision in project notes.
  6. Production deploys the picked judge. Weekly drift-monitoring cron re-evaluates a sampled 30 cases; ≥5pp regression → Telegram P0 alert.

4. Scope (MoSCoW) — final

Must — DONE:

  • ✅ Generic YAML task config (system prompt + scoring rubric per task)
  • ✅ 4 cross-cloud provider adapters: Anthropic, xAI (Grok), OpenAI, Google Gemini
  • ✅ 2 scoring modes: strict_substring + llm_judge (default)
  • ✅ Bake-off N judges on the same task + bootstrap CI 95% (1000 resamples)
  • ✅ CLI (eval-framework bake-off, eval-framework score, eval-framework verify)
  • ✅ Pairwise Cohen’s κ matrix across judges (binary classifier agreement)
  • ✅ Per-stratum breakdown (bucket × language)
  • ✅ Holdout protection — separate dev/holdout paths; holdout never auto-runs

Should — DONE:

  • ✅ Cost/case + cost/run reporting per judge
  • ✅ p95 latency per judge
  • ✅ Per-case error inspection (show input + expected + actual + judge verdict)
  • ✅ Weekly drift cron + Telegram alert on regression ≥ 5pp

Could — partial:

  • ✅ Python API (programmatic invoke from notebooks)
  • ⏸️ Local MLX endpoint adapter — config-shaped but not wired in F3 (distill attempt blocked: 4 Metal crashes, 65% regression, queued for weekend retrain)
  • ⏸️ Self-consistency (N>1) + adaptive escalation — designed (Decision #4 in notes) but not shipped; bad ROI at current volume
  • ❌ Promptfoo CI integration — deferred, Eval-Framework’s CLI already covers the same surface

Won’t (F1-F3) — kept:

  • Multi-user / SaaS surface — single-user by design
  • Web UI — CLI + notebook API sufficient
  • Auto-prompt optimization — out of scope; PM-curated prompts only

5. Architecture (final)

5 components: Eval set (stratified, dev/holdout) → Judge adapter layer (4 provider clients) → Runner (parallel fan-out + retry) → Scorer (llm_judge default, strict_substring fallback) → Reporter (CI 95%, κ, per-stratum, cost). See Architecture.

6. Tech Stack — final choices

LayerOriginal specImplementedReason for change
Runnerbash + jqPython 3.11 + asyncioType safety, async fan-out, native provider SDKs
Task configJSONYAML + Pydantic schemaComments + multi-line system prompts
Judge adapterHand-rolled HTTPOfficial provider SDKs (anthropic, xai-sdk, openai, google-genai)Less retry/auth boilerplate
ScoringStrict substring matchLLM-judge by default (Sonnet 4.6 verdict) + strict_substring availableF3 breakthrough: strict masked 19/20 valid alternate findings
Production judgeDefault Haiku (em sai 2x)Grok 4.3Bake-off won; Haiku as direct auditor = 13%
StatisticsNonebootstrap CI 1000 resamples + Cohen’s κ pairwiseSingle accuracy number is misleading
Eval storeCSVYAML + Postgres audit logYAML for hand-curation; Postgres for run history + drift queries
Drift monitorManuallaunchd cron + Telegram botWeekly sampled 30 cases; ADHD-friendly P0 alert format

Cost posture: ~$3 per full bake-off (6 judges × N=30 dev) → cheap enough to run weekly on every vendor release. Production cost dominated by serving (Grok 4.3 at audit volume = $0.61/mo).

7. Milestones — actual

HourWhat shipped
0-2 (F1)CLI scaffold (click-based), task YAML schema, Anthropic + xAI adapters
2-4 (F2)OpenAI + Gemini adapters, parallel fan-out, basic strict-substring scorer
4-6 (F3a)Stratified 192-case eval set bootstrap via Haiku (5×3 strata), dev/holdout split
6-8 (F3b)LLM-judge scorer (Sonnet 4.6), bootstrap CI, Cohen’s κ matrix
8-10 (F3c)First production bake-off: 6 judges × dev-93 → Grok 4.3 wins; holdout-99 confirm; scoring breakthrough discovered (79.8% strict → 99.0% LLM-judge)
+4h, 2026-05-21 (F4)Applied to Personal-RAG: 93-query held-out personal eval, Hit@3 = 97.8%, MRR = 0.948
2026-05-22 (F4b)4-tier triage routing design (daily=4B+Haiku, weekly=8B+Haiku, monthly=32B, on-demand=Grok 4.3) — distill attempt blocked, retrain weekend

F3 DoD passed:

  • ✅ 6-judge bake-off runs end-to-end in <20 min wall
  • ✅ Holdout-99 frozen; final decision = single pass
  • ✅ Per-stratum breakdown reveals language bias (Grok 4.3: VN 88% / EN 63% / mixed 71%)
  • ✅ Cohen’s κ reveals correlated bias (Sonnet ↔ Opus = 0.66, same Anthropic family)
  • ✅ Production judge picked + shipped with verified cost number ($0.61/mo)

8. Cost & Quota

ItemFree?Actual usage
Anthropic API (Haiku/Sonnet/Opus + judge)~$1/bake-off
xAI Grok API (Grok 4-fast / Grok 4.3)~$1/bake-off
OpenAI GPT-5.4-mini~$0.50/bake-off
Google Gemini Flash✅ free tier$0
Postgres 16 (local, shared with Personal-RAG)<100 MB eval audit log
Production judge serving (Grok 4.3 @ audit vol)$0.61/mo

Recurring cost: <$5/month total (production serving + weekly drift monitor). Bake-offs run on-demand (~$3 each, 1-2x/month when new vendors ship).

9. Risks & open questions — outcomes

Original risks:

  • Holdout leakage if dev iteration uses holdout cases → mitigated by separate file paths + CLI guard that refuses --eval-set holdout-* unless --final-decision flag set
  • LLM-judge cost spirals at scale → measured: scoring adds ~30% on top of generation cost; acceptable at audit volume
  • Gemini API integration fragile → confirmed during bake-off: Gemini Flash returned empty candidates on 99/99 holdout cases; sanity-test ALL providers before bake-off

Current risks:

  • Distill attempt blocked (4 Metal crashes 2026-05-20, adapter 65% regression vs 70% baseline = HURTING) → cloud Grok 4.3 shipped as production default; local distill retry queued for weekend
  • Judge model staleness (Sonnet 4.6 as scorer judge) — if Sonnet itself regresses, all scores shift; mitigation: weekly drift cron re-runs on holdout sample
  • Cohen’s κ degenerate on perfect-agreement judges (κ = NaN when both pass all cases) — fixed with min-disagreement floor

Original open Qs:

  • Q1: Will LLM-judge scorer disagree with human eval? → spot-check on 20 cases showed 95% human agreement with Sonnet judge verdict — acceptable
  • Q2: Is bootstrap CI worth the compute? → yes; revealed Grok 4-fast vs Grok 4.3 overlap (CI overlap = no statistical difference at N=30)
  • Q3: Local LLMs ready for production? → not yet (distill 65% < 70% baseline); cloud Grok 4.3 cheaper at current volume anyway

10. Definition of Done

F1+F2+F3 Done: ✅ 2026-05-21 — CLI shipped, 6-model bake-off runs end-to-end, holdout-99 frozen, first verified production pick (Grok 4.3 for Knowledge-Audit at 99% / $0.61/mo).

F4 applied to Personal-RAG: ✅ 2026-05-21 — 93-query held-out personal-workspace eval; Hit@3=97.8%, MRR=0.948; reranker swap measured +3.2pp Hit@1 and shipped.

Portfolio-DoD (target M2):

  • ⏳ Mail-Assistant inbox classification eval (queued)
  • ⏳ Voice-Assistant tool-use accuracy eval (queued)
  • ⏳ Mac-Translator translation fidelity eval (queued)
  • ⏳ Local MLX adapter retry (weekend distill retrain with clean GPU state)

See also

  • Architecture — pipeline + adapter pattern + scoring layer
  • Implementation — code structure, schema, perf, reproducibility
  • Notes — chronological decision log + 7 PM-bias catches
  • Enterprise — 5 use cases for B2B SaaS / Fintech / EdTech / Healthcare / Multi-tenant