AI Evaluation Framework

Why Evaluation Matters More Than You Think

The difference between an AI demo and an AI product is evaluation. Demos work when you cherry-pick inputs. Products work when users throw anything at them.

Most teams skip evaluation because it’s hard and unglamorous. Then they wonder why production quality is inconsistent. This playbook gives you a practical framework to avoid that fate.

The Three Layers of Evaluation

Evaluation is not one thing. It is three distinct activities that serve different purposes:

┌─────────────────────────────────────────────────────┐
│  Layer 1: Development Evaluation                     │
│  "Does my change improve things?"                    │
│  • Fast feedback loop (seconds)                      │
│  • Small, curated test sets                          │
│  • Run on every commit                               │
└─────────────────────────────────────────────────────┘
                    ▼
┌─────────────────────────────────────────────────────┐
│  Layer 2: Pre-Production Evaluation                  │
│  "Is this safe to ship?"                             │
│  • Comprehensive test suites                         │
│  • Edge cases and adversarial inputs                 │
│  • Run before deployment                             │
└─────────────────────────────────────────────────────┘
                    ▼
┌─────────────────────────────────────────────────────┐
│  Layer 3: Production Monitoring                      │
│  "Is it still working?"                              │
│  • Real user traffic                                 │
│  • Continuous quality sampling                       │
│  • Drift and regression detection                    │
└─────────────────────────────────────────────────────┘

Building Your Evaluation Dataset

Good evaluation requires good data. Here’s how to build it:

Start with Golden Examples

Create 50-100 examples where you know the correct answer. These should cover:

Happy path cases. Common, straightforward inputs.
Edge cases. Unusual but valid inputs.
Failure cases. Inputs that should be rejected or flagged.
Adversarial cases. Attempts to break or manipulate the system.

{
  "id": "invoice-001",
  "input": {
    "document": "Invoice from Acme Corp...",
    "task": "extract_fields"
  },
  "expected_output": {
    "vendor": "Acme Corp",
    "amount": 1500.00,
    "date": "2026-01-15"
  },
  "category": "happy_path",
  "difficulty": "easy"
}

Grow from Production

Your golden set should evolve with production usage:

Log everything. Every input/output pair.
Sample for review. Random sample plus low-confidence cases.
Human label. Have humans judge quality.
Promote to golden. Add reviewed cases to your test set.

Cadence: Review 20-50 cases weekly. Your evaluation set should grow by 10-20% monthly.

Metrics That Matter

Accuracy Metrics

Task completion rate: Did the system produce a valid output?

def task_completion_rate(results):
    completed = sum(1 for r in results if r.output is not None)
    return completed / len(results)

Correctness rate: Of completed tasks, how many were correct?

def correctness_rate(results, ground_truth):
    correct = sum(1 for r, gt in zip(results, ground_truth)
                  if matches(r.output, gt))
    return correct / len(results)

Field-level accuracy: For structured outputs, measure per-field. Example breakdown from a real invoice extraction system: vendor at 98.5% (high confidence), amount at 94.2% (number parsing issues), date at 89.1% (format variations). The point is that aggregate accuracy hides per-field weaknesses you can target.

Quality Metrics

Format compliance: Does output match expected structure?

Hallucination rate: How often does the system make things up?

Consistency: Same input → same output?

Operational Metrics

Latency: p50, p95, p99 response times

Cost per request: Total API spend / request count

Error rate: Rate of exceptions, timeouts, retries

Automated Evaluation Methods

Assertion-Based Testing

For structured outputs, write assertions:

def test_invoice_extraction(output, expected):
    assert output.vendor == expected.vendor
    assert abs(output.amount - expected.amount) < 0.01
    assert parse_date(output.date) == parse_date(expected.date)

LLM-as-Judge

Use a separate LLM to evaluate quality:

JUDGE_PROMPT = """
You are evaluating the quality of an AI response.

Original query: {query}
AI response: {response}
Reference answer: {reference}

Rate the response on:
1. Accuracy (1-5): Does it contain correct information?
2. Completeness (1-5): Does it address the full query?
3. Clarity (1-5): Is it well-structured and clear?

Output JSON with scores and brief justification.
"""

Caveats:

LLM judges have biases (prefer verbose responses, etc.)
Use a different model than the one being evaluated
Calibrate judge scores against human labels

Embedding Similarity

For open-ended responses, compare semantic similarity:

def semantic_similarity(response, reference):
    response_embedding = embed(response)
    reference_embedding = embed(reference)
    return cosine_similarity(response_embedding, reference_embedding)

Regression Detection

Baseline Snapshots

Before any change, capture current performance:

baseline_2026_01_15:
  task_completion: 0.95
  correctness: 0.89
  latency_p50: 1.2s
  cost_per_request: $0.003

Continuous Comparison

After changes, compare against baseline:

def detect_regression(current, baseline, threshold=0.02):
    regressions = []
    for metric in current:
        delta = baseline[metric] - current[metric]
        if delta > threshold:
            regressions.append({
                "metric": metric,
                "baseline": baseline[metric],
                "current": current[metric],
                "delta": delta
            })
    return regressions

Alert Thresholds

Define what constitutes a problem. Useful default thresholds: a 2% drop in correctness is a warning, 5% is critical. A 20% increase in p95 latency is a warning, 50% is critical. A 1% rise in error rate is a warning, 3% is critical. Tune these to your traffic and stakes.

Production Monitoring

Sampling Strategy

You can’t evaluate everything. Sample strategically:

def should_evaluate(request, response):
    # Always evaluate low-confidence outputs
    if response.confidence < 0.7:
        return True

    # Always evaluate new input types
    if is_novel_input(request):
        return True

    # Random sample of normal traffic
    if random.random() < 0.01:  # 1%
        return True

    return False

Quality Dashboards

Track trends over time:

Daily quality scores
Weekly regression reports
Monthly deep-dive analysis

Feedback Loops

Connect evaluation to improvement:

User feedback → Low score → Human review → Label → Training data

Common Evaluation Pitfalls

Overfitting to Test Set

If you only optimize for your evaluation set, you’ll overfit. Periodically:

Add new cases from production
Rotate cases in/out
Use held-out test sets for final decisions

Ignoring Edge Cases

Edge cases are where production problems live. Ensure your evaluation set includes:

Very short inputs
Very long inputs
Non-English characters
Malformed data
Adversarial prompts

Vanity Metrics

High accuracy on easy cases is meaningless. Segment your metrics:

By difficulty level
By input category
By user segment

Manual-Only Evaluation

If evaluation requires humans for every assessment, you won’t do it often enough. Automate as much as possible, reserve humans for:

Labeling new golden examples
Adjudicating ambiguous cases
Calibrating automated judges

Implementation Checklist

Before claiming your evaluation framework is ready:

Golden dataset with 50+ examples
Automated test suite running on commits
Pre-production evaluation gate
Production monitoring with sampling
Regression detection with alerts
Human review queue for edge cases
Monthly evaluation dataset refresh