AI Architecture Patterns That Actually Ship

The Problem with Most AI Architecture

Most AI projects fail not because the model doesn’t work, but because everything around it doesn’t work. The architecture is an afterthought, bolted onto existing systems without considering the unique constraints of AI workloads.

This playbook covers the patterns we use repeatedly on production AI systems. They are not academic. They come from shipping systems that handle real traffic.

Pattern 1: The Reliability Sandwich

Problem: LLM calls are slow, expensive, and occasionally fail. You can’t treat them like regular API calls.

Solution: Wrap every LLM interaction in three layers:

┌─────────────────────────────────────┐
│         Input Validation            │  ← Catch garbage before it costs you
├─────────────────────────────────────┤
│         Caching Layer               │  ← Don't pay twice for the same answer
├─────────────────────────────────────┤
│         LLM Call                    │  ← The expensive part
├─────────────────────────────────────┤
│         Output Validation           │  ← Ensure structure before using
├─────────────────────────────────────┤
│         Fallback Handler            │  ← Graceful degradation
└─────────────────────────────────────┘

Input Validation:

Check token counts before sending
Validate required context is present
Sanitize inputs that could cause injection

Caching:

Hash the semantic content, not exact strings
Set TTLs based on content freshness requirements
Use embedding similarity for near-match caching

Output Validation:

Parse structured outputs immediately
Validate against expected schema
Check for hallucination signals

Fallback:

Return cached stale data if available
Fall back to simpler models
Return honest “I don’t know” responses

Pattern 2: Confidence-Gated Routing

Problem: Different queries have different complexity. Using GPT-4 for everything is expensive. Using GPT-3.5 for everything produces bad results for hard queries.

Solution: Route based on estimated difficulty.

def route_query(query: str, context: dict) -> str:
    # Fast classifier to estimate query complexity
    complexity = estimate_complexity(query, context)

    if complexity < 0.3:
        return "fast_model"      # GPT-3.5, Claude Haiku
    elif complexity < 0.7:
        return "standard_model"  # GPT-4, Claude Sonnet
    else:
        return "premium_model"   # GPT-4 Turbo, Claude Opus

Key insight: The complexity classifier can be simple. Even a rule-based system that checks query length, presence of technical terms, and user history works well. You don’t need ML for the router.

Metrics to track:

Quality score by route (are fast models actually handling easy queries well?)
Cost savings vs. always using premium
Latency distribution by route

Pattern 3: The Evaluation Flywheel

Problem: You don’t know if your system is getting better or worse over time.

Solution: Build evaluation into the architecture from day one.

┌─────────────────────────────────────────────────────┐
│                  Production Traffic                  │
└──────────────────────────┬──────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────┐
│                  Logging Layer                       │
│    • Input/Output pairs                              │
│    • Latency, tokens, cost                           │
│    • User feedback signals                           │
└──────────────────────────┬──────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────┐
│                  Evaluation Pipeline                 │
│    • Automated quality scoring                       │
│    • Regression detection                            │
│    • A/B comparison                                  │
└──────────────────────────┬──────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────┐
│                  Action Triggers                     │
│    • Alert on quality drops                          │
│    • Auto-rollback bad prompts                       │
│    • Flag cases for human review                     │
└─────────────────────────────────────────────────────┘

What to log (always):

Full request/response (redacted for PII)
Model used, temperature, tokens
Latency breakdown (network, inference, parsing)
User ID for feedback correlation

What to evaluate:

Factual accuracy (can be automated with assertions)
Format compliance (did it follow instructions?)
User satisfaction (implicit from behavior, explicit from feedback)

Pattern 4: Context Window Management

Problem: Context windows are finite. RAG systems often stuff in irrelevant context. More context ≠ better results.

Solution: Treat context like a budget you spend carefully.

Context Budget: 8000 tokens

┌─────────────────────────────────────┐
│ System Prompt        │   800 tokens │  ← Keep minimal
├─────────────────────────────────────┤
│ User Query           │   200 tokens │
├─────────────────────────────────────┤
│ Retrieved Context    │  4000 tokens │  ← This is where most waste happens
├─────────────────────────────────────┤
│ Examples             │  1500 tokens │  ← Few-shot if needed
├─────────────────────────────────────┤
│ Output Buffer        │  1500 tokens │  ← Reserve for response
└─────────────────────────────────────┘

Context selection principles:

Relevance scoring: Use embedding similarity, but also recency and source quality
Deduplication: Similar chunks waste tokens
Compression: Summarize verbose context before including
Hierarchical retrieval: Get summaries first, details only if needed

Anti-pattern: Stuffing the context window with “just in case” information. Every token has a cost (latency, money, attention dilution).

Pattern 5: Structured Output Contracts

Problem: LLMs produce free-form text. Your code needs structured data.

Solution: Define explicit contracts and validate them.

// Define the contract
interface ExtractedInvoice {
  vendor: string;
  amount: number;
  currency: string;
  date: string;
  lineItems: Array<{
    description: string;
    quantity: number;
    unitPrice: number;
  }>;
  confidence: number;
}

// Validate the output
function parseInvoiceResponse(llmOutput: string): ExtractedInvoice | null {
  try {
    const parsed = JSON.parse(llmOutput);
    return invoiceSchema.parse(parsed); // Zod or similar
  } catch {
    return null; // Trigger fallback/retry
  }
}

Prompting for structure:

Include the exact schema in the prompt
Show 2-3 examples of valid outputs
Specify what to do for missing/uncertain fields
Request confidence scores for downstream routing

Pattern 6: Human-in-the-Loop Injection Points

Problem: AI systems need human oversight, but where?

Solution: Design explicit injection points for human review.

┌─────────────────────────────────────────────────────┐
│           Confidence Threshold Gate                  │
│                                                      │
│   High confidence (>0.9)  →  Auto-approve            │
│   Medium (0.7-0.9)        →  Queue for spot-check    │
│   Low (<0.7)              →  Require human review    │
└─────────────────────────────────────────────────────┘

Where to inject humans:

Low-confidence outputs
High-stakes decisions (money, legal, medical)
Novel input types (out of training distribution)
User-escalated cases

Implementation tips:

Make review queues first-class features, not afterthoughts
Track reviewer agreement with AI (calibration data)
Feed corrections back into evaluation datasets

Anti-Patterns to Avoid

The Monolithic Prompt

Putting everything in one giant prompt makes debugging impossible and changes risky. Break into composable pieces.

The Demo Architecture

What works for 10 requests/day breaks at 10,000. Build for production load from the start, even if you don’t need it yet.

Assuming model outputs are correct without validation. Every production issue we’ve seen involves unchecked outputs.

The Cost Afterthought

Not tracking per-request costs. You’ll be surprised when the bill arrives.

Implementation Checklist

Before shipping any AI feature: