AI Evaluation Framework
How to measure if your AI system is actually working. The metrics, methods, and mindset for continuous quality assessment.
Why Evaluation Matters More Than You Think
The difference between an AI demo and an AI product is evaluation. Demos work when you cherry-pick inputs. Products work when users throw anything at them.
Most teams skip evaluation because it’s hard and unglamorous. Then they wonder why production quality is inconsistent. This playbook gives you a practical framework to avoid that fate.
The Three Layers of Evaluation
Evaluation isn’t one thing — it’s three distinct activities that serve different purposes:
┌─────────────────────────────────────────────────────┐
│ Layer 1: Development Evaluation │
│ "Does my change improve things?" │
│ • Fast feedback loop (seconds) │
│ • Small, curated test sets │
│ • Run on every commit │
└─────────────────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────┐
│ Layer 2: Pre-Production Evaluation │
│ "Is this safe to ship?" │
│ • Comprehensive test suites │
│ • Edge cases and adversarial inputs │
│ • Run before deployment │
└─────────────────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────┐
│ Layer 3: Production Monitoring │
│ "Is it still working?" │
│ • Real user traffic │
│ • Continuous quality sampling │
│ • Drift and regression detection │
└─────────────────────────────────────────────────────┘
Building Your Evaluation Dataset
Good evaluation requires good data. Here’s how to build it:
Start with Golden Examples
Create 50-100 examples where you know the correct answer. These should cover:
- Happy path cases — Common, straightforward inputs
- Edge cases — Unusual but valid inputs
- Failure cases — Inputs that should be rejected or flagged
- Adversarial cases — Attempts to break or manipulate the system
{
"id": "invoice-001",
"input": {
"document": "Invoice from Acme Corp...",
"task": "extract_fields"
},
"expected_output": {
"vendor": "Acme Corp",
"amount": 1500.00,
"date": "2026-01-15"
},
"category": "happy_path",
"difficulty": "easy"
}
Grow from Production
Your golden set should evolve with production usage:
- Log everything — Every input/output pair
- Sample for review — Random sample + low-confidence cases
- Human label — Have humans judge quality
- Promote to golden — Add reviewed cases to your test set
Cadence: Review 20-50 cases weekly. Your evaluation set should grow by 10-20% monthly.
Metrics That Matter
Accuracy Metrics
Task completion rate: Did the system produce a valid output?
def task_completion_rate(results):
completed = sum(1 for r in results if r.output is not None)
return completed / len(results)
Correctness rate: Of completed tasks, how many were correct?
def correctness_rate(results, ground_truth):
correct = sum(1 for r, gt in zip(results, ground_truth)
if matches(r.output, gt))
return correct / len(results)
Field-level accuracy: For structured outputs, measure per-field.
| Field | Accuracy | Notes |
|---|---|---|
| vendor | 98.5% | High confidence |
| amount | 94.2% | Number parsing issues |
| date | 89.1% | Format variations |
Quality Metrics
Format compliance: Does output match expected structure?
Hallucination rate: How often does the system make things up?
Consistency: Same input → same output?
Operational Metrics
Latency: p50, p95, p99 response times
Cost per request: Total API spend / request count
Error rate: Rate of exceptions, timeouts, retries
Automated Evaluation Methods
Assertion-Based Testing
For structured outputs, write assertions:
def test_invoice_extraction(output, expected):
assert output.vendor == expected.vendor
assert abs(output.amount - expected.amount) < 0.01
assert parse_date(output.date) == parse_date(expected.date)
LLM-as-Judge
Use a separate LLM to evaluate quality:
JUDGE_PROMPT = """
You are evaluating the quality of an AI response.
Original query: {query}
AI response: {response}
Reference answer: {reference}
Rate the response on:
1. Accuracy (1-5): Does it contain correct information?
2. Completeness (1-5): Does it address the full query?
3. Clarity (1-5): Is it well-structured and clear?
Output JSON with scores and brief justification.
"""
Caveats:
- LLM judges have biases (prefer verbose responses, etc.)
- Use a different model than the one being evaluated
- Calibrate judge scores against human labels
Embedding Similarity
For open-ended responses, compare semantic similarity:
def semantic_similarity(response, reference):
response_embedding = embed(response)
reference_embedding = embed(reference)
return cosine_similarity(response_embedding, reference_embedding)
Regression Detection
Baseline Snapshots
Before any change, capture current performance:
baseline_2026_01_15:
task_completion: 0.95
correctness: 0.89
latency_p50: 1.2s
cost_per_request: $0.003
Continuous Comparison
After changes, compare against baseline:
def detect_regression(current, baseline, threshold=0.02):
regressions = []
for metric in current:
delta = baseline[metric] - current[metric]
if delta > threshold:
regressions.append({
"metric": metric,
"baseline": baseline[metric],
"current": current[metric],
"delta": delta
})
return regressions
Alert Thresholds
Define what constitutes a problem:
| Metric | Warning | Critical |
|---|---|---|
| Correctness | -2% | -5% |
| Latency p95 | +20% | +50% |
| Error rate | +1% | +3% |
Production Monitoring
Sampling Strategy
You can’t evaluate everything. Sample strategically:
def should_evaluate(request, response):
# Always evaluate low-confidence outputs
if response.confidence < 0.7:
return True
# Always evaluate new input types
if is_novel_input(request):
return True
# Random sample of normal traffic
if random.random() < 0.01: # 1%
return True
return False
Quality Dashboards
Track trends over time:
- Daily quality scores
- Weekly regression reports
- Monthly deep-dive analysis
Feedback Loops
Connect evaluation to improvement:
User feedback → Low score → Human review → Label → Training data
Common Evaluation Pitfalls
Overfitting to Test Set
If you only optimize for your evaluation set, you’ll overfit. Periodically:
- Add new cases from production
- Rotate cases in/out
- Use held-out test sets for final decisions
Ignoring Edge Cases
Edge cases are where production problems live. Ensure your evaluation set includes:
- Very short inputs
- Very long inputs
- Non-English characters
- Malformed data
- Adversarial prompts
Vanity Metrics
High accuracy on easy cases is meaningless. Segment your metrics:
- By difficulty level
- By input category
- By user segment
Manual-Only Evaluation
If evaluation requires humans for every assessment, you won’t do it often enough. Automate as much as possible, reserve humans for:
- Labeling new golden examples
- Adjudicating ambiguous cases
- Calibrating automated judges
Implementation Checklist
Before claiming your evaluation framework is ready:
- Golden dataset with 50+ examples
- Automated test suite running on commits
- Pre-production evaluation gate
- Production monitoring with sampling
- Regression detection with alerts
- Human review queue for edge cases
- Monthly evaluation dataset refresh
Further Reading
- AI Architecture Patterns — How to structure systems for evaluability
- Production Readiness Checklist — Broader checklist including evaluation