TL;DR: I just shipped Eval-Framework — an eval harness for every personal AI feature. 10 hours in a single day. Bake-off across 6 models. Verified Grok 4.3 at 99% on an audit task. Insights for any PM making LLM decisions without measurement.
Context — JTBD (Jobs-To-Be-Done)
When: I’m building 8 personal projects that touch LLMs (Knowledge-Audit, Mail-Assistant, Mac-Translator, Voice-Assistant, Email-Filter, Personal-RAG, Diagram-Engine, Eval-Framework).
I want to: pick the right model for each task, without “guess from gut” or “default to Haiku because I’m used to it”.
So that: decisions are data-driven, not vendor-locked, and I can swap models when something new ships without breaking production.
→ An eval framework is a foundational tool for every LLM-powered product decision.
User research n=1 — three real PM questions
PMs running LLM features keep hitting three recurring questions:
- Migration: “Should we move from model A (current) to model B (just released)?”
- Prompt engineering: “After swapping models, which prompts need rewriting? Will the old prompts break?”
- Build vs buy: “Are local LLMs (open-weight) ready to replace cloud APIs?”
All three share a pattern: you need evidence-based numbers before committing. No eval = guessing. Guessing wrong in production = silent accuracy drop, user churn, support ticket spike a few weeks later.
Existing alternatives — and why they fail
- Promptfoo, DeepEval, Inspect AI — production-grade but overkill for one developer. Setup cost > value at personal scope.
- OpenAI Evals — vendor lock-in, no cross-provider support.
- HuggingFace lm-eval-harness — academic benchmarks (MMLU, GSM-8K), don’t reflect task-specific accuracy for my audit task.
- Manual A/B chat — n=5, not stratified, biased toward easy cases.
→ None fit “personal + cross-provider + task-specific + 5h to ship”.
Product hypothesis
An eval framework converts three PM questions (Migration / Prompt engineering / Build vs Buy) from “2-3 weeks of guessing against public benchmarks” → “evidence-based answer in a few hours” using an eval set specific to your own product.
Verifiable components:
- Migration A→B question: Eval-Framework CLI, one command
bake-off --judges A,B→ 20 minutes → CI 95%, per-bucket breakdown, $/case - Prompt v3 vs v4 question: same eval set, swap system_prompt, measure regression delta
- Local vs cloud question: the judges list includes a local MLX endpoint (not wired in v1 but config-ready) alongside cloud APIs
Tested by building: a 6-model bake-off on a real audit task in one day.
MVP scope (v1)
- Generic task config (YAML — system prompt + scoring rubric per task)
- 4 cross-cloud judge providers: Grok, Anthropic Claude, OpenAI GPT, Google Gemini
- 2 scoring modes: LLM-judge + substring match
- Bake-off N judges on the same task + bootstrap CI 95%
- CLI + Python API
→ Ship narrow and deep, defer breadth.
Headline design principle — “Scorer = LLM-judge by default”
The original v0 used strict substring match: pass if the output contains the expected key_quote_substr.
The triggering event: Grok 4.3 on holdout-99 scored 79.8% strict. Re-checking 20 “failures” with an LLM-judge: 19/20 were valid alternate findings. Real accuracy = 99.0%.
→ Multi-valid-output tasks (audit, summarization, translation) require LLM-judge scoring. Strict matching masks true performance.
→ Eval-Framework v1 default scorer = mode: llm_judge with the rubric in the task config.
Outcome measured (smoke task: Knowledge-Audit)
| Pre-Eval-Framework | Post-Eval-Framework |
|---|---|
| ”Haiku verifier OK” (guess) | Haiku as direct auditor = 13% accuracy ⚠️ |
| Production: 4B + Haiku hybrid ~80% | Production: Grok 4.3 = 99% verified |
| Cost: $0.15/mo | Cost: $1.80/mo (acceptable) |
| Decision time: weeks | Decision time: 1 day |
Take-aways for other PMs
- An eval framework is a “measurement system” for AI features. Same role as A/B test infra for web.
- Build narrow first — one task (audit) → ship → then extend to more tasks (Mail-Assistant, Voice-Assistant).
- Scorer matters more than judge model. Strict substring = artifact-prone. LLM-judge = expensive but accurate.
- Statistical CI (bootstrap 1000 resamples) = essential. A single accuracy number is misleading.
- Reusability ≠ refactor-first. Copy-paste 2 projects → extract patterns → DRY (lazy abstraction).
Repo + next
Code: personal repo (private).
Roadmap:
- Apply Eval-Framework to Personal-RAG (retrieval quality eval) — top priority because Personal-RAG serves other projects
- Mail-Assistant inbox classification eval
- Voice-Assistant tool-use accuracy
- Promptfoo CI integration (deferred)
→ Sequence: Personal-RAG → Mail-Assistant → Voice-Assistant → Mac-Translator.
Eval framework for Enterprise?
The same eval pattern scales up to enterprise tasks: KYC document extraction accuracy, contract clause classification, chatbot intent matching, claim summarization fidelity. Each task is defined by a single config file (system prompt + scoring rubric + eval set) + a bake-off across models on the company’s real product corpus.
Additional capabilities at enterprise scale: drift monitoring cron + Slack/email alerts when production regresses past a threshold, cost-quality frontier mapping for budget approval, audit trail per eval run for compliance.