A production-grade personal eval framework for every AI feature in the portfolio. Converts three recurring PM questions — should we migrate from model A to model B?, which prompts break after the swap?, are local LLMs ready to replace cloud APIs? — from “2-3 weeks of guessing against public benchmarks” into “evidence-based answers in a few hours” on a task-specific eval set.
At a glance
- 6-model bake-off shipped — Claude Haiku 4.5, Sonnet 4.6, Opus 4.7, Grok 4-fast-reasoning, Grok 4.3, GPT-5.4-mini, Gemini 2.5 Flash
- Stratified eval set: 192 cases bootstrapped via Haiku — 5 buckets × 3 languages (15 strata) — split 93 dev + 99 holdout (FROZEN)
- Scoring breakthrough: strict substring scoring = 79.8%, LLM-judged scoring = 99.0% true accuracy on holdout-99 — same outputs, same judge model, different scorer
- Production judge picked: Grok 4.3 wins Knowledge-Audit at 99% — beating Opus 4.7 (67%) and Haiku as direct auditor (last at 13%)
- Production cost: $0.61/month for Grok 4.3 at audit volume (15K audits/mo SMB-scale also lands at ~$90/mo all-in)
- Foundation for 8 personal projects — Knowledge-Audit, Mail-Assistant, Mac-Translator, Voice-Assistant, Email-Filter, Personal-RAG, Diagram-Engine, Eval-Framework itself
- Methodology = 5 components — stratified eval, single task × multi judges, LLM-as-judge for multi-valid-output tasks, pairwise Cohen’s κ for correlated bias detection, frozen holdout protection
- Shipped F1+F2+F3 in a 10-hour day (2026-05-21) — single-day sprint from CLI scaffold to verified production decision
Stack
Python 3.11 · pytest (test runner + parametrize for eval cases) · Pydantic (task config schema) · Anthropic SDK · xAI SDK (Grok) · OpenAI SDK (GPT-5.4-mini) · Google Gemini SDK · Postgres 16 (eval-run audit log) · scikit-learn (Cohen’s kappa pairwise judge agreement) · numpy (bootstrap CI 1000 resamples)
Documentation
| Doc | Read this for |
|---|---|
| PRD | What & why — problem framing, JTBD, scope, milestones, success metrics |
| Architecture | System diagrams, eval pipeline, judge adapter pattern, scoring layer |
| Implementation | Tech stack, code structure, schema, perf, reproducibility steps |
| Notes | Chronological decision log + gotchas + the 7 PM-bias catches |
| Enterprise | 5 enterprise use cases — B2B SaaS chatbot eval, fintech reconciliation, EdTech moderation, healthcare triage, multi-tenant LLM migration |
Quickstart for users
# 1. Bake-off across N judges on a task
eval-framework bake-off \
--task knowledge_audit \
--judges grok-4.3,claude-haiku-4.5,claude-sonnet-4.6,gpt-5.4-mini \
--eval-set holdout-99 \
--scorer llm_judge
# 2. Output: per-judge accuracy + bootstrap CI 95% + Cohen's κ matrix +
# cost/case + p95 latency + per-stratum breakdown
Adding a new judge = 1 file (~30 LOC provider client). Adding a new task = 1 YAML config (system prompt + scoring rubric + eval set path).
Project status
| Phase | Milestone |
|---|---|
| F1 | CLI scaffold + task config schema + Anthropic + xAI adapters |
| F2 | OpenAI + Gemini adapters; LLM-judge scorer; bootstrap CI |
| F3 | Stratified 192-case eval set; dev/holdout split; pairwise κ; first verified production pick (Grok 4.3 for Knowledge-Audit) |
| F4 | Apply to Personal-RAG (retrieval quality eval) — DONE 2026-05-21, Hit@3=97.8% / MRR=0.948 |
| F5 | Mail-Assistant inbox classification eval — queued |
| F6 | Voice-Assistant tool-use accuracy eval — queued |
| F7 | 4-tier triage routing (daily=4B+Haiku, weekly=8B+Haiku, monthly=32B, on-demand=Grok 4.3) — design done, distill blocked (4 Metal crashes, retrain weekend) |
Total build time: F1+F2+F3 shipped in ~10 hours (single day, 2026-05-21). F4 = +4 h.
Foundation for downstream projects
This framework is shared infrastructure for every LLM-powered side project. Each project gets:
- A reproducible eval set (frozen holdout)
- A bake-off command to pick the right model on its real task
- A scoring rubric (LLM-judge by default) that survives prompt iteration
- A production cost number (per-case + per-month) before commit
- A weekly drift-monitoring cron with Telegram alerts (≥5pp regression → P0)
Every model decision in the portfolio runs through this framework. No “default to Haiku because I know the SDK” choices anymore.