Eval-Framework: why PMs need to measure before picking an LLM

TL;DR: I just shipped Eval-Framework — an eval harness for every personal AI feature. 10 hours in a single day. Bake-off across 6 models. Verified Grok 4.3 at 99% on an audit task. Insights for any PM making LLM decisions without measurement.

Context — JTBD (Jobs-To-Be-Done)

When: I’m building 8 personal projects that touch LLMs (Knowledge-Audit, Mail-Assistant, Mac-Translator, Voice-Assistant, Email-Filter, Personal-RAG, Diagram-Engine, Eval-Framework).

I want to: pick the right model for each task, without “guess from gut” or “default to Haiku because I’m used to it”.

So that: decisions are data-driven, not vendor-locked, and I can swap models when something new ships without breaking production.

→ An eval framework is a foundational tool for every LLM-powered product decision.

User research n=1 — three real PM questions

PMs running LLM features keep hitting three recurring questions:

Migration: “Should we move from model A (current) to model B (just released)?”
Prompt engineering: “After swapping models, which prompts need rewriting? Will the old prompts break?”
Build vs buy: “Are local LLMs (open-weight) ready to replace cloud APIs?”

All three share a pattern: you need evidence-based numbers before committing. No eval = guessing. Guessing wrong in production = silent accuracy drop, user churn, support ticket spike a few weeks later.

Existing alternatives — and why they fail

Promptfoo, DeepEval, Inspect AI — production-grade but overkill for one developer. Setup cost > value at personal scope.
OpenAI Evals — vendor lock-in, no cross-provider support.
HuggingFace lm-eval-harness — academic benchmarks (MMLU, GSM-8K), don’t reflect task-specific accuracy for my audit task.
Manual A/B chat — n=5, not stratified, biased toward easy cases.

→ None fit “personal + cross-provider + task-specific + 5h to ship”.

Product hypothesis

An eval framework converts three PM questions (Migration / Prompt engineering / Build vs Buy) from “2-3 weeks of guessing against public benchmarks” → “evidence-based answer in a few hours” using an eval set specific to your own product.

Verifiable components:

Migration A→B question: Eval-Framework CLI, one command bake-off --judges A,B → 20 minutes → CI 95%, per-bucket breakdown, $/case
Prompt v3 vs v4 question: same eval set, swap system_prompt, measure regression delta
Local vs cloud question: the judges list includes a local MLX endpoint (not wired in v1 but config-ready) alongside cloud APIs

Tested by building: a 6-model bake-off on a real audit task in one day.

MVP scope (v1)

Generic task config (YAML — system prompt + scoring rubric per task)
4 cross-cloud judge providers: Grok, Anthropic Claude, OpenAI GPT, Google Gemini
2 scoring modes: LLM-judge + substring match
Bake-off N judges on the same task + bootstrap CI 95%
CLI + Python API

→ Ship narrow and deep, defer breadth.

Headline design principle — “Scorer = LLM-judge by default”

The original v0 used strict substring match: pass if the output contains the expected key_quote_substr.

The triggering event: Grok 4.3 on holdout-99 scored 79.8% strict. Re-checking 20 “failures” with an LLM-judge: 19/20 were valid alternate findings. Real accuracy = 99.0%.

→ Multi-valid-output tasks (audit, summarization, translation) require LLM-judge scoring. Strict matching masks true performance.

→ Eval-Framework v1 default scorer = mode: llm_judge with the rubric in the task config.

Outcome measured (smoke task: Knowledge-Audit)

Pre-Eval-Framework	Post-Eval-Framework
”Haiku verifier OK” (guess)	Haiku as direct auditor = 13% accuracy ⚠️
Production: 4B + Haiku hybrid ~80%	Production: Grok 4.3 = 99% verified
Cost: $0.15/mo	Cost: $1.80/mo (acceptable)
Decision time: weeks	Decision time: 1 day

Take-aways for other PMs

An eval framework is a “measurement system” for AI features. Same role as A/B test infra for web.
Build narrow first — one task (audit) → ship → then extend to more tasks (Mail-Assistant, Voice-Assistant).
Scorer matters more than judge model. Strict substring = artifact-prone. LLM-judge = expensive but accurate.
Statistical CI (bootstrap 1000 resamples) = essential. A single accuracy number is misleading.
Reusability ≠ refactor-first. Copy-paste 2 projects → extract patterns → DRY (lazy abstraction).

Repo + next

Code: personal repo (private).

Roadmap:

Apply Eval-Framework to Personal-RAG (retrieval quality eval) — top priority because Personal-RAG serves other projects
Mail-Assistant inbox classification eval
Voice-Assistant tool-use accuracy
Promptfoo CI integration (deferred)

→ Sequence: Personal-RAG → Mail-Assistant → Voice-Assistant → Mac-Translator.

Eval framework for Enterprise?

The same eval pattern scales up to enterprise tasks: KYC document extraction accuracy, contract clause classification, chatbot intent matching, claim summarization fidelity. Each task is defined by a single config file (system prompt + scoring rubric + eval set) + a bake-off across models on the company’s real product corpus.

Additional capabilities at enterprise scale: drift monitoring cron + Slack/email alerts when production regresses past a threshold, cost-quality frontier mapping for budget approval, audit trail per eval run for compliance.