Architecture
Sister docs: PRD (intent), Implementation (code deep-dive), Notes (decision log).
System view
flowchart TB
classDef user fill:#cce0e8,stroke:#1a1a1d,color:#1a1a1d,stroke-width:2px
classDef core fill:#faedd6,stroke:#1a1a1d,color:#1a1a1d,stroke-width:2px
classDef adapter fill:#e0d5ed,stroke:#1a1a1d,color:#1a1a1d,stroke-width:2px
classDef store fill:#f4d6db,stroke:#1a1a1d,color:#1a1a1d,stroke-width:2px
subgraph CLI["👤 Operator (PM)"]
Cmd["eval-framework bake-off
--task T --judges J1,J2,J3
--eval-set dev-93 --scorer llm_judge"]
Notebook["Python API
(jupyter notebooks)"]
end
Cmd --> Runner
Notebook --> Runner
subgraph Core["🏗️ Eval-Framework core"]
Runner["Runner
(asyncio parallel fan-out + retry)"]
Config["Task config
(YAML + Pydantic schema)"]
Scorer["Scorer
llm_judge (default) /
strict_substring"]
Stats["Stats
bootstrap CI 1000x
Cohen's κ pairwise"]
Reporter["Reporter
(per-stratum table,
cost, p95 latency)"]
Runner --> Config
Runner --> Scorer
Scorer --> Stats
Stats --> Reporter
end
subgraph Adapters["🔌 Judge adapter layer"]
AnthAd["Anthropic adapter
(Haiku/Sonnet/Opus 4.x)"]
XaiAd["xAI adapter
(Grok 4-fast, Grok 4.3)"]
OaiAd["OpenAI adapter
(GPT-5.4-mini)"]
GeminiAd["Gemini adapter
(Flash 2.5)"]
MlxAd["Local MLX adapter
(config-shaped, F7 wired)"]
end
Runner -->|parallel| AnthAd
Runner -->|parallel| XaiAd
Runner -->|parallel| OaiAd
Runner -->|parallel| GeminiAd
Runner -.->|deferred| MlxAd
Scorer --> JudgeLLM["LLM-Judge
(Sonnet 4.6 default)
VALID / INVALID verdict"]
subgraph Stores["🗄️ Eval stores"]
Tasks["tasks/
knowledge_audit.yaml
mail_classify.yaml
..."]
EvalDev["eval/dev-93.yaml
(iterable)"]
EvalHold["eval/holdout-99.yaml
FROZEN — one-pass only"]
PG["Postgres 16
eval_runs audit log
(shared w/ Personal-RAG)"]
end
Config -.reads.-> Tasks
Runner -.reads.-> EvalDev
Runner -.reads.-> EvalHold
Reporter -.writes.-> PG
subgraph Monitor["📡 Drift monitor (launchd weekly)"]
DriftCron["eval-framework verify
--task T --eval-set holdout-30-sample"]
Telegram["Telegram bot
P0 alert if Δacc ≥ 5pp"]
DriftCron --> Telegram
end
DriftCron -.uses.-> Adapters
class Cmd,Notebook user
class Runner,Config,Scorer,Stats,Reporter core
class AnthAd,XaiAd,OaiAd,GeminiAd,MlxAd,JudgeLLM adapter
class Tasks,EvalDev,EvalHold,PG store
The 5-component methodology
The framework is opinionated around 5 components discovered during the F3 bake-off. Each is necessary; dropping any one degrades decision quality:
| # | Component | Purpose | Failure mode if dropped |
|---|---|---|---|
| 1 | Stratified eval set | 5 buckets × 3 languages = 15 strata; reveals per-segment bias | Aggregate accuracy hides catastrophic failure modes (e.g. Gemini 99/99 empty) |
| 2 | Single task × multi judges | Apples-to-apples comparison with same SYSTEM_PROMPT + user format | Confounded comparisons; can’t attribute delta to model vs prompt |
| 3 | LLM-as-judge scorer (for multi-valid-output) | Sonnet 4.6 verdicts VALID/INVALID per finding | Strict substring undercounts (79.8% real → 99.0% true); kills good models |
| 4 | Pairwise Cohen’s κ | Detects correlated bias across same-family judges | Ensembles built on Sonnet+Opus = duplicate errors, false confidence |
| 5 | Frozen holdout | Held-out queries never used during tuning; final decision = 1 pass | Overfit to dev → production regression at first real query |
Pipeline
[0] Author defines task YAML:
tasks/knowledge_audit.yaml
name: knowledge_audit
system_prompt: |
You are an auditor. Extract claims from <source>...
Compare across sources. Output JSON list of contradictions or [].
user_template: "Sources:\n{sources}\n\nFindings (JSON):"
scoring:
mode: llm_judge
judge_model: claude-sonnet-4-6
rubric: |
For each finding, output VALID iff it is a real contradiction
(not paraphrase, not over-inference, quote is verbatim).
pass_rule: positive_at_least_one_valid_or_empty_on_negative
│
▼
[1] CLI parses:
eval-framework bake-off --task knowledge_audit
--judges grok-4.3,claude-haiku-4.5,claude-opus-4.7,...
--eval-set dev-93
--scorer llm_judge
│
▼
[2] Runner loads eval set (dev-93.yaml) → list[EvalCase]
Each case = {id, sources, expected_findings, stratum:{bucket, lang}}
│
▼
[3] Parallel fan-out (asyncio.gather, semaphore concurrency=8):
for judge in judges:
for case in cases:
adapter.complete(system_prompt, user_template.format(sources=case.sources))
→ raw_output
Retry on transient errors (429, 502, timeout): exp backoff 3 attempts.
Capture: raw_output, latency_ms, input_tokens, output_tokens, cost.
│
▼
[4] Scorer (mode=llm_judge):
For each (judge, case):
Parse raw_output → list[finding]
For each finding:
Sonnet 4.6 verdict(finding, case.sources) → VALID | INVALID
pass = pass_rule(verdicts, case.expected_type)
Caches Sonnet verdicts by (finding_hash, case_id) → re-runs free.
│
▼
[5] Stats:
For each judge:
acc = sum(pass) / N
ci_95 = bootstrap_resample(pass_vector, 1000)
per_stratum_acc = group_by(stratum) → acc per (bucket, lang)
cost_per_case = total_cost / N
p95_latency = percentile(latencies, 95)
Cohen's κ pairwise:
For each (j1, j2):
κ = cohen_kappa_score(pass_j1, pass_j2)
│
▼
[6] Reporter renders to stdout + writes to Postgres:
┌────────────────┬──────┬──────────┬──────────┬─────────┬──────┐
│ Judge │ Acc │ CI 95% │ $/case │ p95 ms │ Rank │
├────────────────┼──────┼──────────┼──────────┼─────────┼──────┤
│ Grok 4.3 │ 83.3%│ [76, 89] │ $0.0021 │ 1840 │ 1 │
│ Opus 4.7 │ 66.7%│ [59, 74] │ $0.0089 │ 2110 │ 2 │
│ Sonnet 4.6 │ 60.0%│ [52, 68] │ $0.0034 │ 1320 │ 3 │
│ GPT-5.4-mini │ 53.3%│ [45, 61] │ $0.0011 │ 980 │ 4 │
│ Grok 4-fast │ 50.0%│ [42, 58] │ $0.0008 │ 720 │ 5 │
│ Haiku 4.5 │ 13.3%│ [ 8, 20] │ $0.0004 │ 460 │ 6 │
│ Gemini 2.5 Fl. │ 0.0% │ [ 0, 4] │ $0 │ 510 │ 7 ⚠ │
└────────────────┴──────┴──────────┴──────────┴─────────┴──────┘
Per-stratum + κ matrix printed below.
eval_runs.id = 4271 (Postgres audit log).
│
▼
[7] (Operator) Inspect top 2-3 → re-run on holdout-99 ONCE:
eval-framework bake-off --task knowledge_audit
--judges grok-4.3
--eval-set holdout-99
--final-decision
→ 99.0% LLM-judge / 79.8% strict / $0.61/mo projected production cost
→ Commit decision in project notes; deploy.
Judge adapter pattern
All adapters implement a thin JudgeAdapter protocol:
class JudgeAdapter(Protocol):
model_id: str
provider: str
async def complete(
self,
system_prompt: str,
user_message: str,
max_tokens: int = 2048,
temperature: float = 0.0,
) -> CompletionResult:
"""Returns (text, input_tokens, output_tokens, latency_ms, cost_usd)."""
Adding a new judge = ~30 LOC: import provider SDK, map kwargs, compute cost from token counts via a static price table. No core framework changes.
Provider quirks handled at adapter layer:
- Anthropic —
systemis a top-level param, not a message - xAI — OpenAI-compatible Chat Completions;
temperature=0allowed - OpenAI GPT-5.4-mini — Responses API;
reasoning.effortdefaults handled - Gemini — system instruction via
system_instructionfield; emptycandidatesarray is a real (frequent) failure mode — adapter returnsCompletionResult(text="", ...)so scorer can flag
Scoring layer
Two modes, selected per task config:
strict_substring
def score(raw_output: str, expected_substr: str) -> bool:
return expected_substr.lower() in raw_output.lower()
Use when: single canonical answer (e.g. “What’s the latency p95?” → "840 ms").
Fails on: multi-valid-output tasks. F3 lesson: on Knowledge-Audit, strict scored Grok 4.3 at 79.8% but 19/20 “failures” were valid alternate findings the rubric had simply not listed.
llm_judge (default)
async def score(raw_output: str, case: EvalCase, rubric: str, judge_model: str) -> bool:
findings = parse_json_list(raw_output)
verdicts = await gather(
judge.complete(rubric_system, fmt(finding, case)) for finding in findings
)
return apply_pass_rule(verdicts, case.expected_type)
Use when: multi-valid-output tasks (audit, summarization, translation, classification with overlapping labels).
Cost: ~30% on top of generation cost.
Cache: judge verdicts cached by (finding_hash, case_id) — re-runs across iteration loops are free.
Default judge model: Grok 4.3 (per dojo eval — em sai 2 lần default Haiku do SDK familiarity). Config-overridable per task.
Holdout protection
The CLI refuses to run on holdout-* eval sets unless the explicit --final-decision flag is set. This is the only mechanism that prevents “just one more iteration on holdout” overfit.
# runner.py
if eval_set_path.name.startswith("holdout-") and not args.final_decision:
raise GuardError(
f"Refusing to run on {eval_set_path.name} without --final-decision. "
f"Use dev-* for iteration; holdout is one-pass only."
)
After a --final-decision run, the holdout-set hash + judge-list + run-id is appended to a tamper-evident log (eval/holdout-runs.log). Re-running on the same holdout is allowed but flagged loudly in the report (overfit warning).
Stratification
The 192-case bootstrap was generated by Haiku from a corpus sample, then hand-balanced to 5 buckets × 3 languages:
| Bucket | Languages | Count |
|---|---|---|
positive_explicit (clear contradiction) | VN / EN / mixed | 13 each = 39 |
positive_subtle (semantic contradiction) | VN / EN / mixed | 13 each = 39 |
negative_paraphrase (looks-like-contradiction, actually same) | VN / EN / mixed | 13 each = 39 |
negative_no_overlap (sources don’t share topic) | VN / EN / mixed | 13 each = 39 |
edge_quote_mismatch (quote attribution wrong) | VN / EN / mixed | 12 each = 36 |
| Total | 192 |
Split: 93 dev + 99 holdout (per-stratum proportional).
Drift monitor (weekly cron)
[launchd weekly Sat 03:00]
│
▼
eval-framework verify --task knowledge_audit
--eval-set holdout-30-sample
--judges grok-4.3
│
▼
Postgres: write eval_runs row (run_type='drift_check')
│
▼
SQL: compare acc to last 4-week median
│
├── Δacc ≥ -5pp → Telegram P0:
│ "🔴 Knowledge-Audit Grok 4.3 regressed:
│ 98% → 91% (Δ -7pp). Run full holdout-99 to confirm.
│ eval_runs.id=5023"
│
└── Δacc < -5pp → silent
ADHD-friendly format per memory feedback_adhd_delivery: severity-tagged, action-verb first, no fluff.
Data flow — judge call
Runner (asyncio task per (judge, case))
│
▼
Adapter.complete(system, user)
│
├── SDK retry-with-backoff (3 attempts)
├── Capture: text, in_tokens, out_tokens, ms
└── Cost = in_tokens × $/Mtok_in + out_tokens × $/Mtok_out
│
▼
Scorer.score(text, case)
│
├── parse → list[finding]
├── for finding: judge.complete(rubric_prompt, finding)
│ ← cached by (finding_hash, case_id)
└── apply_pass_rule(verdicts, case.expected_type) → bool
│
▼
Stats accumulator:
pass_vector[judge].append(bool)
cost_total[judge] += cost
latencies[judge].append(ms)
per_stratum[judge][stratum].append(bool)
Component responsibilities
| Component | Owns | Doesn’t own |
|---|---|---|
| Task YAML | system prompt, scoring rubric, judge model defaults | Eval cases, judge selection |
| Eval set YAML | Cases with id, sources, expected, stratum | Task semantics |
| Runner | Parallel fan-out, retry, semaphore-bounded concurrency | Scoring, statistics |
| Adapter | Provider SDK calls, cost computation, error normalization | Prompt engineering, scoring |
| Scorer | Strict vs LLM-judge dispatch, pass-rule application | Generation, statistics |
| Stats | Bootstrap CI, Cohen’s κ, per-stratum aggregation | Adapter, scoring |
| Reporter | Table render + Postgres audit-log writes | All compute |
| Holdout guard | Refuse non---final-decision holdout runs; append to tamper log | All else |
| Drift cron | Weekly verify + alert | Decision-making (operator confirms) |
Failure modes & recovery
| Failure | Detect | Recovery | Time |
|---|---|---|---|
| Provider 429 rate limit | Adapter | Exp backoff retry (3 attempts) | <30s |
| Provider 5xx | Adapter | Retry; if all fail → mark case as error, exclude from acc | <60s |
Gemini empty candidates | Adapter returns text="" | Scorer flags as fail; report shows 0% acc loudly | immediate |
| Judge model deprecated | Adapter raises on first call | Update judge_model in task YAML | <5 min |
| Holdout leak attempt | CLI guard raises | Operator re-runs with explicit --final-decision if intended | immediate |
| Cohen’s κ NaN (perfect agreement) | Stats | Floor κ at 1.0 with note=degenerate | immediate |
| Postgres audit log down | Reporter | Render to stdout still works; warning printed | n/a |
| Bootstrap resample slow (N>1000) | Stats | Capped at 1000; configurable | n/a |
| Drift cron false positive | Telegram alert | Operator runs full holdout-99 to confirm before acting | <30 min |
Why these choices
| Decision | Alternative considered | Why this won |
|---|---|---|
| LLM-judge default scorer | Strict substring as default | F3 lesson: strict masked 19/20 valid findings on Knowledge-Audit; 19.2pp accuracy gap |
| Grok 4.3 default judge (NOT Haiku) | Haiku (em sai 2x do SDK familiarity) | Bake-off: Haiku as direct auditor = 13%, Grok 4.3 = 99% on holdout. Production-verified. |
| YAML task config | JSON / Python | Multi-line system prompts + comments without escape pain |
| asyncio fan-out | Threading | Provider SDKs are async-native; cleaner cancellation |
| Postgres for audit log | SQLite / JSON files | Shared infra with Personal-RAG; cheap drift queries via SQL |
| Cohen’s κ pairwise | Fleiss’ κ across all | Pairwise reveals same-family bias (Sonnet ↔ Opus = 0.66); aggregate hides it |
| Bootstrap CI 1000 resamples | Wilson interval / asymptotic | Robust on small N (30 dev); same code path scales to 99 holdout |
| Stratified by 5 buckets × 3 langs | Random sample | Reveals language bias (Grok VN 88% / EN 63%); essential for VN-heavy corpora |
| Frozen holdout + CLI guard | Convention only | Operator self-discipline fails under deadline pressure; guard prevents accidents |
| 30-LOC adapter pattern | Heavy abstract base class | Lazy abstraction (YAGNI > DRY) — proven across 4 providers without rewrite |
| Weekly drift cron | Per-deploy CI eval | Side projects don’t have CI; weekly cron is the SMB-grade safety net |
| Single-day F1+F2+F3 ship | Phased over a week | Tight feedback loop forces brutal scope cuts; no premature polish |
See also
- Sequence diagrams for bake-off + scoring in Implementation
- 7 PM-bias catches in Notes — the decision-quality layer above the architecture
- Enterprise adaptations of this same architecture in Enterprise