Production Readiness Checklist
What 'production-ready' actually means for AI systems. The checklist we use before any launch.
Author Synapti Collective
Published January 23, 2026
Read time 8 min
What Production-Ready Actually Means
“It works on my machine” is not production-ready. Neither is “it demos well.” Production-ready means the system can handle real users, real scale, and real failures, without you babysitting it.
This checklist covers what we verify before any AI system goes live.
The Checklist
1. Reliability
Failure Handling
- All LLM calls have timeouts configured
- Retry logic with exponential backoff
- Fallback behavior defined for all failure modes
- Circuit breakers prevent cascade failures
- Graceful degradation vs. hard failures documented
Error Handling
- All errors logged with context
- User-facing errors are helpful, not technical
- No stack traces exposed to users
- Error rates monitored and alerted
Recovery
- System restarts cleanly after failures
- No data loss on crashes
- Failed jobs can be retried
2. Performance
Latency
- p50 latency meets user expectations
- p95 latency acceptable (users will wait)
- p99 latency bounded (no infinite hangs)
- Timeout thresholds set appropriately
Throughput
- Load tested at 2x expected peak
- Rate limiting in place
- Backpressure handling defined
- Horizontal scaling path exists
Cost
- Cost per request calculated
- Cost monitoring in place
- Budget alerts configured
- Cost optimization opportunities identified
3. Security
Input Validation
- All user inputs sanitized
- Prompt injection mitigations in place
- Input length limits enforced
- File uploads validated (if applicable)
Output Safety
- No PII in logs
- No secrets in responses
- Output filtering for harmful content
- Compliance with content policies
Access Control
- Authentication required
- Authorization enforced
- API keys rotatable
- Audit logging enabled
4. Observability
Logging
- All requests logged
- All responses logged (with PII redaction)
- Error context captured
- Logs searchable and queryable
Metrics
- Latency histograms
- Error rates
- Token usage
- Cost per request
Alerting
- Error rate alerts
- Latency alerts
- Cost alerts
- On-call rotation defined
5. Quality
Evaluation
- Golden test set exists
- Automated evaluation pipeline
- Regression detection active
- Quality dashboard available
Monitoring
- Production quality sampling
- User feedback collection
- Drift detection
- Human review queue
6. Operations
Deployment
- Zero-downtime deployment
- Rollback procedure documented
- Feature flags for new functionality
- Blue-green or canary deployment
Configuration
- All config externalized (not hardcoded)
- Secrets management proper
- Environment-specific configs
- Config changes audited
Documentation
- Architecture documented
- Runbook for common issues
- On-call procedures
- Escalation paths
7. Compliance
Data
- Data retention policy implemented
- PII handling documented
- Data deletion capability
- Data export capability (if required)
Legal
- Terms of service updated
- Privacy policy covers AI usage
- Content policy defined
- User consent mechanisms (if required)
Minimum Viable Production
Not everything is equally important. Here’s the absolute minimum before any launch:
Must Have
- Timeout on all LLM calls
- Basic error handling
- Request/response logging
- Cost tracking
- One form of quality monitoring
Should Have
- Automated evaluation
- Alerting on errors
- Rate limiting
- Fallback behavior
Nice to Have
- Advanced observability
- Sophisticated quality analysis
- Automated rollback
Red Flags
If you see any of these, you’re not ready:
- “We will add monitoring later.” You will not, and you will regret it.
- “The error handling is TODO.” Production will find every edge case.
- “We don’t know how much it costs.” You will when the bill arrives.
- “We tested it manually.” Manual testing does not scale.
- “It works in staging.” Staging lies.
Pre-Launch Day Checklist
The day before launch:
- All checklist items verified
- Load test completed
- Rollback tested
- On-call scheduled
- Stakeholders notified
- Support prepared
Launch day:
- Monitoring dashboards open
- Team available for issues
- Canary percentage set
- Rollback ready
Post-launch:
- Monitor for 24-48 hours
- Review error logs
- Check cost actuals vs. estimates
- Gather initial user feedback
Further Reading
- AI Architecture Patterns. Patterns that make production easier.
- Evaluation Framework. Deep dive on quality monitoring.