Enterprise patterns
The personal version is the strictest constraint case: one operator, one machine, $5/mo budget, audit-trail in a local Postgres. Relaxing those constraints unlocks B2B applications without rewriting the architecture — the methodology (stratified eval → bake-off → LLM-judge scorer → Cohen’s κ → frozen holdout) stays identical. This page documents five concrete adaptations.
What stays vs. what changes
The 5-component methodology — stratified eval set, single task × multi judges, LLM-judge scorer, pairwise Cohen’s κ, frozen holdout — is identical across all enterprise use cases below. The deltas are around scale, governance, integration with vendor LLM ops, drift SLA, and compliance evidence, not around eval mechanics.
Migration matrix: Personal → Enterprise
| Aspect | Personal | Enterprise |
|---|---|---|
| Tasks evaluated | 1 task/project, 8 projects | N task types × M production features, organisation-wide |
| Eval set size | 93 dev + 99 holdout (192 total) | 1K–10K cases per task, stratified by tenant/segment/region |
| Authoring | Hand-curated YAML | Mix of hand-curated golden sets + automated mining from production logs + crowd-sourced labels |
| Judge providers | 4 cross-cloud (Anthropic, xAI, OpenAI, Google) | Same + private fine-tuned models + on-prem deployments + vertical-specialised vendors |
| Scorer | LLM-judge default (Grok 4.3) | Same, plus human-in-the-loop sampling for high-stakes domains; SME rubrics signed off by domain experts |
| Statistics | Bootstrap CI 1000 resamples + pairwise κ | Same + per-segment regression tests + significance vs. control group + multi-armed-bandit for A/B routing |
| Drift monitoring | Weekly cron + Telegram | Real-time streaming eval on sampled prod traffic + PagerDuty + SLO dashboards |
| Compliance evidence | None | Per-run signed audit trail, eval-run hash on-chain or in WORM storage, exportable for auditor review (PCI/SOC2/HIPAA) |
| Cost model | <$5/mo total | Per-eval-run usage-based pricing or seat-based platform fee; bake-off cost is a line item in vendor migration ROI |
| Holdout governance | CLI guard + tamper log | Air-gapped holdout storage, dual-control release, holdout-rotation policy after each model migration |
| Integration | CLI + Postgres | CI/CD gate (block deploy on regression ≥X pp) + vendor LLM ops platforms (LangSmith/Weights & Biases/Braintrust) + ticketing |
| Latency SLA on bake-off | Best-effort (~12 min) | Streaming results within SLA; partial results acceptable if provider X is degraded |
| Org scale | 1 person | Centralised AI-ops team + per-product PM-owned eval sets; governance forum reviews drift breaches |
The architecture diagram doesn’t change — only the labels on each component scale up.
Use case A — B2B SaaS chatbot eval (vendor like CX Genie / Botpress Cloud)
Problem
Conversational AI vendors selling to mid-market and enterprise customers ship monthly model + prompt updates. Today most rely on a static QA team poking the bot ad-hoc, then “shipping and praying”. Customer-reported regressions surface 2-6 weeks post-ship; by then 3-5 cohorts of users have experienced degraded quality, churn risk has spiked, and a costly emergency rollback is needed. The vendor’s own engineering org also can’t prove to enterprise procurement that “our v4.2 is better than v4.1” — the buyer asks for evidence and gets a marketing deck.
Industry datum: 38% of enterprise chatbot procurement RFPs in 2024 now require eval-driven dev evidence (Gartner Chatbot Magic Quadrant 2025).
Persona
Vendor PMs + AI engineers (the seller). Enterprise procurement + risk teams (the buyer). Customer success owners renewing accounts where bot quality is at risk.
Why eval matters
- Ship-then-pray vs ship-then-measure: without a 7-metric scorecard run pre-deploy, no defensible “we tested it” claim
- Customer-specific golden sets: each enterprise customer trains the bot on their own corpus → each gets a unique eval set; vendor must evaluate per-customer regression before pushing a shared model update
- Procurement evidence: bake-off output + per-stratum breakdown is the artifact procurement accepts as proof
What changes from personal version
- 7-metric scorecard per case: response accuracy, hallucination rate, escalation appropriateness, deflection rate, latency p95, cost/turn, context-quality score (RAG retrieval relevance). Personal version measures 1-2 metrics per task; enterprise scorecard demands all 7 with per-stratum breakdown.
- Per-customer golden sets: 200-500 cases per enterprise tenant, hand-curated from real conversations + edge cases. ~50-200 customers = ~50K-100K eval cases total.
- Eval-driven dev gate: every prompt iteration runs against the customer’s golden set + a shared cross-customer regression suite. Block deploy if any customer regresses ≥3pp.
- A/B production routing: pilot model variant on 5% of traffic with online eval; promote when 95% CI lower bound beats control.
Stack mapping (Eval-Framework primitives → enterprise extension)
| Eval-Framework primitive | Enterprise mapping |
|---|---|
| Task YAML | One task type per metric (response_accuracy, hallucination, escalation, …) — 7 YAMLs |
| Stratified eval set | Per-customer × per-intent stratification (5-10 intents × N customers × language) |
| LLM-judge scorer | SME-reviewed rubric per metric; judge model varies (Sonnet 4.6 for accuracy, Grok 4.3 for hallucination per dojo eval) |
| Cohen’s κ | Detect correlated bias when ensembling Anthropic + OpenAI judges |
| Frozen holdout | Per-customer holdout rotates quarterly; air-gapped storage |
| Drift cron | Real-time streaming eval on 1% production traffic; alert on ≥3pp regression |
Cost estimate (mid-market chatbot vendor, 100 enterprise customers)
- Eval cases: 100 customers × 300 cases avg = 30K cases
- Bake-off frequency: weekly per customer (4 judges × 300 cases = 1,200 calls/customer/week)
- LLM-judge scoring: ~30% on top
- All-in eval compute: ~$8K/mo (vs $300K-1M/year prevented churn from regression incidents = 30-100× ROI)
- Eval team: 2 AI ops engineers + 1 product analyst = ~$600K/year fully loaded
Compliance angle
- SOC2 Type II: eval-run audit log proves “change management with rollback readiness” controls
- EU AI Act (high-risk chatbots in financial/health): documented eval methodology + holdout governance becomes mandatory evidence
- Customer contract clauses: “we will not degrade
by more than X pp without 30-day notice” — enforceable only with eval infrastructure
Use case B — Fintech AI reconciliation eval (payment platform like LivePayments)
Problem
Payment platforms process T+1 settlement reconciliation: matching incoming bank statements against expected payouts across multi-currency, multi-rail (SWIFT / ACH / SEPA / local rails), with FX, fees, and chargeback adjustments. An LLM-assisted matcher classifies ambiguous cases (suspected duplicates, near-matches, dispute candidates). When the matcher is wrong, money sits in suspense accounts, regulators flag breaks, and ops teams burn hours on manual reconciliation. Vendors face audit pressure to prove any model swap (e.g. moving from Haiku 4.5 → Sonnet 4.6 for accuracy lift) didn’t quietly regress on edge cases.
Industry datum: PCI-DSS v4.0 explicitly requires “documented testing of AI/ML components used in payment processing” before production rollout.
Persona
Payment ops engineers, treasury ops managers, compliance/risk officers, external auditors. Vendor PM responsible for the matcher feature.
Why eval matters
- Auditability requirement: every model swap must produce signed eval evidence — bake-off result + per-segment regression + holdout-run hash
- Asymmetric error cost: a false-positive match (auto-clearing when actually a duplicate) costs $X in chargeback exposure; a false-negative (flagging a real match as ambiguous) costs $0.50 in ops time. Scorer must weight asymmetrically.
- FX edge cases: mid-day FX rate shifts create near-match candidates that look like duplicates; eval set must over-sample these
What changes from personal version
- Segment-weighted accuracy: weight per case by transaction value tier + currency + rail. Personal version treats all cases equally; here a $1M T+1 settlement weighs 10K× a $100 retail txn.
- Cost-asymmetric scoring rubric: LLM-judge prompt encodes “false-positive 100× cost of false-negative” so reported accuracy reflects business risk.
- Live shadow eval: run candidate model in parallel with production matcher on real flow (no auto-action); compare verdicts; promote only when shadow agrees with production on 99.5%+ of cases AND beats production on disputed-case accuracy.
- Holdout rotation: holdout rotated quarterly with dual-control sign-off (engineering + compliance both must approve release).
Stack mapping
| Eval-Framework primitive | Enterprise mapping |
|---|---|
| Task YAML | reconciliation_match, dispute_classify, fx_edge_case |
| Stratified eval set | (currency × rail × value-tier × FX-volatility) — ~50 strata |
| LLM-judge scorer | Cost-asymmetric rubric; SME (treasury ops lead) signs off |
| Frozen holdout | Air-gapped; quarterly rotation; signed release ceremony |
| Drift cron | Daily on a 1K-case sample; PagerDuty on ≥1pp regression on high-value-tier stratum |
Cost estimate (regional payment platform, 5M txns/day)
- Eval cases: 10K hand-curated + 50K mined from production
- Daily drift: 1K-case sample × 3 judges = 3K calls/day × $0.003 = $9/day = $270/mo
- Monthly bake-off (4 judges × 10K): ~$120/run × 4 = $480/mo
- All-in: ~$750/mo vs $50M+ daily settlement value at risk = trivially worth it
Compliance angle
- PCI-DSS v4.0: documented AI/ML component testing evidence; per-run audit trail accepted by QSA
- SOX: model change control with rollback readiness; eval-run hash stored in WORM compliance vault
- Local payment regulators (e.g. MAS, SBV, BNM): pre-rollout impact assessment proven via stratified holdout result
Use case C — EdTech content-moderation eval (student-data product)
Problem
EdTech platforms serving K-12 (preschool through high school) handle highly regulated minor-data and must moderate every piece of user-generated content (forum posts, chat, assignment submissions, photo uploads). The moderation model classifies content as safe / age-appropriate-warning / blocked. False negatives (harmful content reaches a minor) trigger regulatory fines + reputational catastrophe; false positives (over-blocking benign student work) frustrate teachers + parents + erode adoption. When the vendor swaps moderation models, every regulator and every parent rep wants evidence the new model isn’t more dangerous.
Industry datum: COPPA (US), GDPR-K (EU), Singapore PDPA-minor amendments all require demonstrable testing of AI moderation tools handling minor data; “we tested it” is no longer acceptable — evidence is.
Persona
EdTech product owners, school-board procurement, parent representatives on advisory boards, regulators (FTC / DfE / MOE / KOMINFO equivalents), trust & safety engineers.
Why eval matters
- Compliance evidence for parents and regulators: the eval report itself becomes a public-facing artifact (“our moderation model achieves 99.2% recall on harmful content across 8 categories, audited quarterly”)
- Age-appropriate stratification: 4-year-old vs 14-year-old language norms differ wildly; eval set MUST stratify by age cohort
- PII filter sub-eval: a separate task evaluates the PII redactor that runs before content reaches the moderation model (defense-in-depth)
What changes from personal version
- Multi-category recall metric: 8-10 harm categories (self-harm, bullying, sexual, violence, drugs, hate, doxxing, scam) each with separate recall target (>99% for self-harm, >95% for others). Personal version measures single-task accuracy; here each category is its own eval task.
- Age-cohort stratification: 4 cohorts (4-6, 7-10, 11-13, 14-18) × multiple languages × content type (text/image/audio) = 100+ strata. Reveals “moderator works great on teens but misses preschool euphemisms”.
- Per-school golden sets: top school-district customers contribute curated cases representing their student population; vendor maintains a federated holdout per district.
- Human-in-the-loop sampling: 1% of flagged content reviewed by trust & safety humans → labels feed back into next eval set (active learning).
- Public eval report: a quarterly published methodology + headline numbers (similar to Apple’s annual transparency report).
Stack mapping
| Eval-Framework primitive | Enterprise mapping |
|---|---|
| Task YAML | One per harm category (8-10 YAMLs) + PII filter + age-appropriate language |
| Stratified eval set | (category × age-cohort × language × content-type) — 100+ strata |
| LLM-judge scorer | Trust & safety SME rubric per category; conservative side (false-positive better than false-negative on self-harm) |
| Frozen holdout | Per-district holdout with parental-consent governance; quarterly rotation |
| Drift cron | Real-time eval on 0.1% of production moderation decisions; SLO dashboard per harm category |
Cost estimate (regional K-12 EdTech, 2M MAU)
- Eval cases: 20K hand-curated by T&S team + 100K mined-and-reviewed
- Real-time eval: 0.1% × 10M moderation decisions/day = 10K LLM-judge calls/day = ~$30/day = $900/mo
- Quarterly bake-off across vendor models: ~$500/quarter
- All-in: ~$1.2K/mo vs (regulatory fine exposure + churn risk + brand damage = unbounded)
Compliance angle
- COPPA / GDPR-K: documented testing evidence for AI tools processing minor data
- EU AI Act (high-risk: education): moderation model classified high-risk → mandatory pre-rollout impact assessment + ongoing monitoring documented
- Parental transparency: public eval report becomes a trust signal in renewals + procurement
Use case D — Healthcare symptom triage eval (clinical decision support)
Problem
Clinical decision support tools that triage incoming patient messages (telemedicine intake, nurse-line, ER pre-triage) classify symptoms into N priority levels (e.g. immediate-ER, urgent-care-within-4h, GP-within-24h, self-care). Models help nurses scale to higher patient volume, but false-negatives are catastrophic (sending a heart-attack patient home as “self-care”). Vendors must prove triage accuracy across the full risk distribution before clinical deployment, and prove it again after every model update.
Industry datum: FDA’s “Predetermined Change Control Plan” (PCCP) framework (2024) requires vendors of AI/ML medical devices to submit a documented eval-and-monitoring plan for any model updates intended to ship post-clearance.
Persona
Clinical product managers, medical directors, FDA / EMA / CDSCO / equivalent regulators, hospital chief medical informatics officers (CMIOs), nurse line operations.
Why eval matters
- False-negative cost asymmetry: missing a true emergency = patient harm + malpractice exposure; over-triaging to ER = mild inconvenience + cost. Scorer must encode this asymmetry explicitly.
- LLM-judge with clinician rubric: strict-match scoring fails because there are often 2-3 acceptable triage levels for ambiguous presentations; LLM-judge with clinician-authored rubric captures clinically-equivalent answers.
- Pre-clearance + post-clearance evidence: every model update requires PCCP-compliant testing documentation.
What changes from personal version
- Clinician-authored LLM-judge rubric: senior physicians draft the verdict prompt: “A triage of
urgent-care-within-4his VALID for this presentation if the symptom complex falls within X clinical guidelines.” Personal version uses a generic rubric; here every rubric is SME-signed. - False-negative-weighted accuracy: report a safety score = 1 - false_negative_rate_on_emergencies, alongside aggregate accuracy. Personal version reports aggregate; here safety score is the headline.
- Specialty stratification: pediatric / geriatric / obstetric / cardiac / respiratory / mental-health — different presentations, different triage thresholds, different judge rubrics.
- Counterfactual eval: for each holdout case, run the model with key clinical details perturbed (age ±10y, vitals ±20%) — robustness check.
- Dual judges: every case scored by 2 independent SME-rubric judges; disagreement → escalate to clinician review (active learning).
Stack mapping
| Eval-Framework primitive | Enterprise mapping |
|---|---|
| Task YAML | One per specialty (6-8 YAMLs); rubric signed by specialty SME |
| Stratified eval set | (specialty × age × severity × demographics × language) — 200+ strata |
| LLM-judge scorer | Clinician-authored rubric per specialty; dual-judge with escalation |
| Cohen’s κ | Inter-judge κ tracked over time; <0.7 = rubric clarification needed |
| Frozen holdout | Curated by medical advisory board; rotation tied to clinical guideline updates (annual) |
| Drift cron | Daily on 500-case stratified sample; clinical incident triggers immediate full-holdout re-run |
Cost estimate (regional telemedicine platform, 500K patient encounters/month)
- Eval cases: 5K-10K per specialty hand-curated by medical advisory board = ~50K total
- Bake-off pre-update: 3 candidate models × 50K = 150K calls + dual-judge = ~$1.5K/update × 4 updates/year = $6K/year
- Daily drift: 500 cases × 2 judges = 1K calls/day = ~$30/day = $900/mo
- Medical advisory board honorarium: ~$50K/year
- All-in: ~$70K/year vs (single missed-emergency malpractice settlement = $1M-10M = trivially worth it)
Compliance angle
- FDA PCCP: documented eval + monitoring plan for SaMD (Software as a Medical Device) post-clearance changes
- EU MDR + AI Act (high-risk: medical): pre-rollout impact assessment + post-market surveillance
- HIPAA: eval cases are de-identified per Safe Harbor; eval-run audit log is itself ePHI-handling-compliant
Use case E — Multi-tenant LLM migration eval (SaaS with N clients, e.g. PCF / NewLife / Ilham / BBL pattern)
Problem
A multi-tenant SaaS platform serves N enterprise clients on a shared infrastructure but with per-client configuration (different system prompts, different RAG corpora, different domain vocabulary, different regulatory regimes). When the platform upgrades the underlying LLM (e.g. Haiku 4.5 → Sonnet 4.6 for cost-quality lift), each client’s experience changes independently — some improve, some regress, depending on how their config interacts with the new model. Without per-client eval, a blanket migration silently regresses 1-2 clients → support escalation → renewal risk → emergency rollback. The PM needs evidence: “we evaluated the migration per-client and only N/M clients meet our regression bar; we will not migrate the remaining clients until we re-tune their prompts.”
Industry datum: matches the LL multi-tenant pattern in the memory ll_multitenant_requirements — PCF / NewLife / Ilham / BBL require the same feature shape but with per-client config; never if-else override, always config/strategy/flag.
Persona
Platform PM (the migrator). Per-client AM / CSM (the relationship owner). Each client’s internal stakeholder (the consumer). Engineering owns the rollout mechanics.
Why eval matters
- Per-client regression rate: prove for each tenant that the new model is non-regressive on their golden set before flipping their config
- Per-client prompt re-tune evidence: when migration shows regression, eval the prompt rewrite candidates and pick the one that recovers parity
- Phased rollout governance: weekly cohort of “next clients to migrate” picked based on eval evidence, not gut
What changes from personal version
- Per-tenant eval set: 200-500 cases per client, mined from their conversation history + edge cases their CSM has filed. ~10-50 clients = ~10K-25K total cases.
- Per-tenant pass criterion: each client has a configured “must not regress more than X pp on metric Y”; CLI gate refuses to flip their config flag unless eval evidence shows compliance.
- Shared cross-tenant regression suite: a separate “common” eval set that all clients share, to catch model-wide regressions that no single-client eval would surface.
- Migration cohort UI: each week the PM sees a dashboard — “5 clients passed migration eval, 2 failed, 3 pending” — and decides next cohort.
- Per-tenant prompt re-tune workflow: for failing clients, eval candidate prompt rewrites (
v2.1.PCF,v2.1.NewLife, …) before re-running migration eval.
Stack mapping
| Eval-Framework primitive | Enterprise mapping |
|---|---|
| Task YAML | Per-feature × per-client overlay (base task + client config overlay) |
| Stratified eval set | (client × intent × language × complexity) per tenant |
| LLM-judge scorer | Per-client SME rubric; CSM signs off |
| Cohen’s κ | Cross-client judge agreement — detect when a judge is biased toward one client’s style |
| Frozen holdout | Per-client holdout; client’s compliance owner signs the release |
| Drift cron | Per-client weekly; alert routed to CSM + platform PM |
| Migration cohort gate | Custom CLI subcommand eval-framework migrate --cohort <week> blocks deploys without passing eval per tenant |
Cost estimate (mid-market SaaS, 30 enterprise tenants)
- Eval cases: 30 tenants × 300 cases avg = 9K + 1K shared = 10K
- Migration bake-off: 3 candidate models × 10K cases + dual-judge = ~$300/migration × 4 migrations/year = $1.2K/year
- Weekly per-client drift: 30 clients × 100 cases × 1 judge = 3K calls/week = ~$10/week = $40/mo
- Per-tenant prompt re-tune eval: ~$50/client/migration when regression hits ~30% of tenants = ~$1.5K/year
- All-in: ~$3K/year for migration infrastructure vs (a single emergency rollback + 1 lost renewal = $200K-1M = 70-300× ROI)
Compliance angle
- Per-client SOC2 reports: eval audit trail proves change-management controls per tenant
- Contract SLAs: many enterprise contracts include “non-regression of
by more than X pp” clauses; eval is the enforcement mechanism - Data residency: per-client holdouts may need to live in their region (EU client → EU storage); eval runs region-pinned
Cross-cutting patterns
These appear in 3+ use cases above and form a second-tier reusable layer beyond the personal Eval-Framework foundation:
- Per-segment / per-tenant eval orchestration: stratify + parallelise + report per-segment pass rates; gate deploys on the worst-segment regression
- Cost-asymmetric scoring: LLM-judge rubrics that encode business-risk asymmetry (false-positive vs false-negative cost) → reported accuracy reflects business impact, not raw correctness
- Dual-judge with escalation: 2 independent LLM judges per case; disagreement triggers human SME review and feeds active learning
- Air-gapped holdout governance: dual-control release ceremony; signed by engineering + compliance; rotation tied to regulatory cadence
- Compliance audit-trail export: per-run signed hash, exportable as auditor-consumable artifact (PDF + JSON), retained in WORM storage
- CI/CD migration gate:
eval-framework migrate --cohortblocks deploys until per-tenant eval evidence passes; replaces “PM approves via Jira” with deterministic gate - Real-time streaming eval: production traffic sampled → LLM-judge in stream → SLO dashboard + alerting; catches drift in hours, not weeks
Building these once = ~8 weeks engineering. Then each new vertical = 2-4 weeks to launch instead of 8-12 weeks.
Go-to-market thinking
The architecture supports 3 plausible business models, each with different pricing / positioning:
| Model | Target | Pricing | Sales motion |
|---|---|---|---|
| B2B SaaS — AI eval platform | AI vendor teams (chatbot, fintech, edtech, healthtech) | Per-eval-run usage + platform seat fee | PLG signup → trial → upgrade. AE for regulated industries. |
| Compliance-evidence add-on | Regulated AI deployments (fintech/healthtech/edtech) | Per-eval-evidence-export + retained-audit-log fee | Enterprise direct sales, 6-month cycles |
| Open-source + managed | Devs / smaller vendors | Free OSS + managed cloud $X/mo | Inbound from GitHub stars; convert to managed for ops cost relief |
The B2B SaaS — AI eval platform model has the cleanest scaling story: eval volume grows with the vendor’s product success, so revenue is naturally aligned with customer outcomes. The compliance-evidence add-on is highest revenue per deal but requires domain expertise and SME networks per vertical. OSS is brand-building but slowest revenue.
What’s NOT in the personal version that enterprise needs
Realistic gap list — items that are zero-effort in personal version but real engineering investment for enterprise:
| Gap | Effort | Priority |
|---|---|---|
| Multi-tenant isolation (per-customer eval sets, per-customer compute quota) | 3-4 weeks | P0 |
| SSO / SAML for eval platform | 2 weeks | P0 |
| Air-gapped holdout storage + dual-control release | 2 weeks | P0 (regulated industries) |
| Audit-trail export (signed PDF + JSON, WORM retention) | 2-3 weeks | P0 (regulated industries) |
| Real-time streaming eval on production traffic | 4-6 weeks | P1 |
| CI/CD gate (block deploy on regression) | 1-2 weeks | P1 |
| Cost-asymmetric scorer DSL | 2 weeks | P1 |
| SME-rubric authoring UI + version control | 3-4 weeks | P1 |
| Dual-judge with escalation workflow | 2 weeks | P1 |
| Active-learning loop (human label → next eval set) | 4 weeks | P2 |
| Multi-region deployment + data residency | 2-3 weeks | P2 (EU / regulated clients) |
| Vendor LLM ops platform integrations (LangSmith / W&B / Braintrust) | 1-2 weeks each | P2 |
| SLO dashboard + PagerDuty integration | 2 weeks | P2 |
Total to enterprise-ready MVP: ~3-4 months of 2 engineers + 1 month design + ~$10K compliance audit prep.
See also
- Architecture — the unchanged 5-component methodology that scales across all use cases
- Implementation — the code that ships personal version; enterprise version extends adapters + adds governance, doesn’t rewrite the core
- PRD — original problem framing; enterprise framing is a superset
- Notes — 7 PM-bias catches that apply identically at enterprise scale (tune-before-verify, miss-baseline, refactor-early, bad-ROI-scale, author-context, PII-leak, blanket-generalize)