Operational Excellence Beginner

Production Readiness Checklist

What 'production-ready' actually means for AI systems. The checklist we use before any launch.

Author Synapti Collective

Published January 23, 2026

Read time 8 min

productionchecklistdeploymentoperations

What Production-Ready Actually Means

“It works on my machine” is not production-ready. Neither is “it demos well.” Production-ready means the system can handle real users, real scale, and real failures, without you babysitting it.

This checklist covers what we verify before any AI system goes live.

The Checklist

1. Reliability

Failure Handling

All LLM calls have timeouts configured
Retry logic with exponential backoff
Fallback behavior defined for all failure modes
Circuit breakers prevent cascade failures
Graceful degradation vs. hard failures documented

Error Handling

All errors logged with context
User-facing errors are helpful, not technical
No stack traces exposed to users
Error rates monitored and alerted

Recovery

System restarts cleanly after failures
No data loss on crashes
Failed jobs can be retried

2. Performance

Latency

p50 latency meets user expectations
p95 latency acceptable (users will wait)
p99 latency bounded (no infinite hangs)
Timeout thresholds set appropriately

Throughput

Load tested at 2x expected peak
Rate limiting in place
Backpressure handling defined
Horizontal scaling path exists

Cost

Cost per request calculated
Cost monitoring in place
Budget alerts configured
Cost optimization opportunities identified

3. Security

Input Validation

All user inputs sanitized
Prompt injection mitigations in place
Input length limits enforced
File uploads validated (if applicable)

Output Safety

No PII in logs
No secrets in responses
Output filtering for harmful content
Compliance with content policies

Access Control

Authentication required
Authorization enforced
API keys rotatable
Audit logging enabled

4. Observability

Logging

All requests logged
All responses logged (with PII redaction)
Error context captured
Logs searchable and queryable

Metrics

Latency histograms
Error rates
Token usage
Cost per request

Alerting

Error rate alerts
Latency alerts
Cost alerts
On-call rotation defined

5. Quality

Evaluation

Golden test set exists
Automated evaluation pipeline
Regression detection active
Quality dashboard available

Monitoring

Production quality sampling
User feedback collection
Drift detection
Human review queue

6. Operations

Deployment

Zero-downtime deployment
Rollback procedure documented
Feature flags for new functionality
Blue-green or canary deployment

Configuration

All config externalized (not hardcoded)
Secrets management proper
Environment-specific configs
Config changes audited

Documentation

Architecture documented
Runbook for common issues
On-call procedures
Escalation paths

7. Compliance

Data

Data retention policy implemented
PII handling documented
Data deletion capability
Data export capability (if required)

Legal

Terms of service updated
Privacy policy covers AI usage
Content policy defined
User consent mechanisms (if required)

Minimum Viable Production

Not everything is equally important. Here’s the absolute minimum before any launch:

Must Have

Timeout on all LLM calls
Basic error handling
Request/response logging
Cost tracking
One form of quality monitoring

Should Have

Automated evaluation
Alerting on errors
Rate limiting
Fallback behavior

Nice to Have

Advanced observability
Sophisticated quality analysis
Automated rollback

Red Flags

If you see any of these, you’re not ready:

“We will add monitoring later.” You will not, and you will regret it.
“The error handling is TODO.” Production will find every edge case.
“We don’t know how much it costs.” You will when the bill arrives.
“We tested it manually.” Manual testing does not scale.
“It works in staging.” Staging lies.

Pre-Launch Day Checklist

The day before launch:

Launch day:

Monitoring dashboards open
Team available for issues
Canary percentage set
Rollback ready

Post-launch:

Monitor for 24-48 hours
Review error logs
Check cost actuals vs. estimates
Gather initial user feedback