← All posts
📅

Eval-Framework: why PMs need to measure before picking an LLM

Building a personal eval framework: from a 20-case smoke test → 192 stratified cases → a bake-off across 6 models. When 'guessing' which LLM is best gets expensive and wrong.

TL;DR: I just shipped Eval-Framework — an eval harness for every personal AI feature. 10 hours in a single day. Bake-off across 6 models. Verified Grok 4.3 at 99% on an audit task. Insights for any PM making LLM decisions without measurement.

Context — JTBD (Jobs-To-Be-Done)

When: I’m building 8 personal projects that touch LLMs (Knowledge-Audit, Mail-Assistant, Mac-Translator, Voice-Assistant, Email-Filter, Personal-RAG, Diagram-Engine, Eval-Framework).

I want to: pick the right model for each task, without “guess from gut” or “default to Haiku because I’m used to it”.

So that: decisions are data-driven, not vendor-locked, and I can swap models when something new ships without breaking production.

→ An eval framework is a foundational tool for every LLM-powered product decision.

User research n=1 — three real PM questions

PMs running LLM features keep hitting three recurring questions:

  1. Migration: “Should we move from model A (current) to model B (just released)?”
  2. Prompt engineering: “After swapping models, which prompts need rewriting? Will the old prompts break?”
  3. Build vs buy: “Are local LLMs (open-weight) ready to replace cloud APIs?”

All three share a pattern: you need evidence-based numbers before committing. No eval = guessing. Guessing wrong in production = silent accuracy drop, user churn, support ticket spike a few weeks later.

Existing alternatives — and why they fail

  • Promptfoo, DeepEval, Inspect AI — production-grade but overkill for one developer. Setup cost > value at personal scope.
  • OpenAI Evals — vendor lock-in, no cross-provider support.
  • HuggingFace lm-eval-harness — academic benchmarks (MMLU, GSM-8K), don’t reflect task-specific accuracy for my audit task.
  • Manual A/B chat — n=5, not stratified, biased toward easy cases.

→ None fit “personal + cross-provider + task-specific + 5h to ship”.

Product hypothesis

An eval framework converts three PM questions (Migration / Prompt engineering / Build vs Buy) from “2-3 weeks of guessing against public benchmarks” → “evidence-based answer in a few hours” using an eval set specific to your own product.

Verifiable components:

  • Migration A→B question: Eval-Framework CLI, one command bake-off --judges A,B → 20 minutes → CI 95%, per-bucket breakdown, $/case
  • Prompt v3 vs v4 question: same eval set, swap system_prompt, measure regression delta
  • Local vs cloud question: the judges list includes a local MLX endpoint (not wired in v1 but config-ready) alongside cloud APIs

Tested by building: a 6-model bake-off on a real audit task in one day.

MVP scope (v1)

  • Generic task config (YAML — system prompt + scoring rubric per task)
  • 4 cross-cloud judge providers: Grok, Anthropic Claude, OpenAI GPT, Google Gemini
  • 2 scoring modes: LLM-judge + substring match
  • Bake-off N judges on the same task + bootstrap CI 95%
  • CLI + Python API

→ Ship narrow and deep, defer breadth.

Headline design principle — “Scorer = LLM-judge by default”

The original v0 used strict substring match: pass if the output contains the expected key_quote_substr.

The triggering event: Grok 4.3 on holdout-99 scored 79.8% strict. Re-checking 20 “failures” with an LLM-judge: 19/20 were valid alternate findings. Real accuracy = 99.0%.

→ Multi-valid-output tasks (audit, summarization, translation) require LLM-judge scoring. Strict matching masks true performance.

→ Eval-Framework v1 default scorer = mode: llm_judge with the rubric in the task config.

Outcome measured (smoke task: Knowledge-Audit)

Pre-Eval-FrameworkPost-Eval-Framework
”Haiku verifier OK” (guess)Haiku as direct auditor = 13% accuracy ⚠️
Production: 4B + Haiku hybrid ~80%Production: Grok 4.3 = 99% verified
Cost: $0.15/moCost: $1.80/mo (acceptable)
Decision time: weeksDecision time: 1 day

Take-aways for other PMs

  1. An eval framework is a “measurement system” for AI features. Same role as A/B test infra for web.
  2. Build narrow first — one task (audit) → ship → then extend to more tasks (Mail-Assistant, Voice-Assistant).
  3. Scorer matters more than judge model. Strict substring = artifact-prone. LLM-judge = expensive but accurate.
  4. Statistical CI (bootstrap 1000 resamples) = essential. A single accuracy number is misleading.
  5. Reusability ≠ refactor-first. Copy-paste 2 projects → extract patterns → DRY (lazy abstraction).

Repo + next

Code: personal repo (private).

Roadmap:

  • Apply Eval-Framework to Personal-RAG (retrieval quality eval) — top priority because Personal-RAG serves other projects
  • Mail-Assistant inbox classification eval
  • Voice-Assistant tool-use accuracy
  • Promptfoo CI integration (deferred)

→ Sequence: Personal-RAG → Mail-Assistant → Voice-Assistant → Mac-Translator.

Eval framework for Enterprise?

The same eval pattern scales up to enterprise tasks: KYC document extraction accuracy, contract clause classification, chatbot intent matching, claim summarization fidelity. Each task is defined by a single config file (system prompt + scoring rubric + eval set) + a bake-off across models on the company’s real product corpus.

Additional capabilities at enterprise scale: drift monitoring cron + Slack/email alerts when production regresses past a threshold, cost-quality frontier mapping for budget approval, audit trail per eval run for compliance.