AI Architecture Patterns That Actually Ship
The patterns we use to structure AI systems that survive production. Not theory — these are the patterns from real projects.
The Problem with Most AI Architecture
Most AI projects fail not because the model doesn’t work, but because everything around it doesn’t work. The architecture is an afterthought, bolted onto existing systems without considering the unique constraints of AI workloads.
This playbook covers the patterns we use repeatedly on production AI systems. They’re not academic — they come from shipping systems that handle real traffic.
Pattern 1: The Reliability Sandwich
Problem: LLM calls are slow, expensive, and occasionally fail. You can’t treat them like regular API calls.
Solution: Wrap every LLM interaction in three layers:
┌─────────────────────────────────────┐
│ Input Validation │ ← Catch garbage before it costs you
├─────────────────────────────────────┤
│ Caching Layer │ ← Don't pay twice for the same answer
├─────────────────────────────────────┤
│ LLM Call │ ← The expensive part
├─────────────────────────────────────┤
│ Output Validation │ ← Ensure structure before using
├─────────────────────────────────────┤
│ Fallback Handler │ ← Graceful degradation
└─────────────────────────────────────┘
Input Validation:
- Check token counts before sending
- Validate required context is present
- Sanitize inputs that could cause injection
Caching:
- Hash the semantic content, not exact strings
- Set TTLs based on content freshness requirements
- Use embedding similarity for near-match caching
Output Validation:
- Parse structured outputs immediately
- Validate against expected schema
- Check for hallucination signals
Fallback:
- Return cached stale data if available
- Fall back to simpler models
- Return honest “I don’t know” responses
Pattern 2: Confidence-Gated Routing
Problem: Different queries have different complexity. Using GPT-4 for everything is expensive. Using GPT-3.5 for everything produces bad results for hard queries.
Solution: Route based on estimated difficulty.
def route_query(query: str, context: dict) -> str:
# Fast classifier to estimate query complexity
complexity = estimate_complexity(query, context)
if complexity < 0.3:
return "fast_model" # GPT-3.5, Claude Haiku
elif complexity < 0.7:
return "standard_model" # GPT-4, Claude Sonnet
else:
return "premium_model" # GPT-4 Turbo, Claude Opus
Key insight: The complexity classifier can be simple. Even a rule-based system that checks query length, presence of technical terms, and user history works well. You don’t need ML for the router.
Metrics to track:
- Quality score by route (are fast models actually handling easy queries well?)
- Cost savings vs. always using premium
- Latency distribution by route
Pattern 3: The Evaluation Flywheel
Problem: You don’t know if your system is getting better or worse over time.
Solution: Build evaluation into the architecture from day one.
┌─────────────────────────────────────────────────────┐
│ Production Traffic │
└──────────────────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Logging Layer │
│ • Input/Output pairs │
│ • Latency, tokens, cost │
│ • User feedback signals │
└──────────────────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Evaluation Pipeline │
│ • Automated quality scoring │
│ • Regression detection │
│ • A/B comparison │
└──────────────────────────┬──────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Action Triggers │
│ • Alert on quality drops │
│ • Auto-rollback bad prompts │
│ • Flag cases for human review │
└─────────────────────────────────────────────────────┘
What to log (always):
- Full request/response (redacted for PII)
- Model used, temperature, tokens
- Latency breakdown (network, inference, parsing)
- User ID for feedback correlation
What to evaluate:
- Factual accuracy (can be automated with assertions)
- Format compliance (did it follow instructions?)
- User satisfaction (implicit from behavior, explicit from feedback)
Pattern 4: Context Window Management
Problem: Context windows are finite. RAG systems often stuff in irrelevant context. More context ≠ better results.
Solution: Treat context like a budget you spend carefully.
Context Budget: 8000 tokens
┌─────────────────────────────────────┐
│ System Prompt │ 800 tokens │ ← Keep minimal
├─────────────────────────────────────┤
│ User Query │ 200 tokens │
├─────────────────────────────────────┤
│ Retrieved Context │ 4000 tokens │ ← This is where most waste happens
├─────────────────────────────────────┤
│ Examples │ 1500 tokens │ ← Few-shot if needed
├─────────────────────────────────────┤
│ Output Buffer │ 1500 tokens │ ← Reserve for response
└─────────────────────────────────────┘
Context selection principles:
- Relevance scoring: Use embedding similarity, but also recency and source quality
- Deduplication: Similar chunks waste tokens
- Compression: Summarize verbose context before including
- Hierarchical retrieval: Get summaries first, details only if needed
Anti-pattern: Stuffing the context window with “just in case” information. Every token has a cost (latency, money, attention dilution).
Pattern 5: Structured Output Contracts
Problem: LLMs produce free-form text. Your code needs structured data.
Solution: Define explicit contracts and validate them.
// Define the contract
interface ExtractedInvoice {
vendor: string;
amount: number;
currency: string;
date: string;
lineItems: Array<{
description: string;
quantity: number;
unitPrice: number;
}>;
confidence: number;
}
// Validate the output
function parseInvoiceResponse(llmOutput: string): ExtractedInvoice | null {
try {
const parsed = JSON.parse(llmOutput);
return invoiceSchema.parse(parsed); // Zod or similar
} catch {
return null; // Trigger fallback/retry
}
}
Prompting for structure:
- Include the exact schema in the prompt
- Show 2-3 examples of valid outputs
- Specify what to do for missing/uncertain fields
- Request confidence scores for downstream routing
Pattern 6: Human-in-the-Loop Injection Points
Problem: AI systems need human oversight, but where?
Solution: Design explicit injection points for human review.
┌─────────────────────────────────────────────────────┐
│ Confidence Threshold Gate │
│ │
│ High confidence (>0.9) → Auto-approve │
│ Medium (0.7-0.9) → Queue for spot-check │
│ Low (<0.7) → Require human review │
└─────────────────────────────────────────────────────┘
Where to inject humans:
- Low-confidence outputs
- High-stakes decisions (money, legal, medical)
- Novel input types (out of training distribution)
- User-escalated cases
Implementation tips:
- Make review queues first-class features, not afterthoughts
- Track reviewer agreement with AI (calibration data)
- Feed corrections back into evaluation datasets
Anti-Patterns to Avoid
The Monolithic Prompt
Putting everything in one giant prompt makes debugging impossible and changes risky. Break into composable pieces.
The Demo Architecture
What works for 10 requests/day breaks at 10,000. Build for production load from the start, even if you don’t need it yet.
The Blind Trust
Assuming model outputs are correct without validation. Every production issue we’ve seen involves unchecked outputs.
The Cost Afterthought
Not tracking per-request costs. You’ll be surprised when the bill arrives.
Implementation Checklist
Before shipping any AI feature:
- Input validation in place
- Output validation in place
- Caching layer implemented
- Fallback behavior defined
- Logging capturing full request/response
- Cost tracking per request
- Human review path exists
- Evaluation metrics defined
Further Reading
- Evaluation Framework — How to measure if your AI system is working
- Production Readiness Checklist — What “production-ready” actually means