Production Readiness Checklist
What 'production-ready' actually means for AI systems. The checklist we use before any launch.
Author Synapti Collective
Published January 23, 2026
Read time 8 min
What Production-Ready Actually Means
“It works on my machine” is not production-ready. Neither is “it demos well.” Production-ready means the system can handle real users, real scale, and real failures — without you babysitting it.
This checklist covers what we verify before any AI system goes live.
The Checklist
1. Reliability
Failure Handling
- All LLM calls have timeouts configured
- Retry logic with exponential backoff
- Fallback behavior defined for all failure modes
- Circuit breakers prevent cascade failures
- Graceful degradation vs. hard failures documented
Error Handling
- All errors logged with context
- User-facing errors are helpful, not technical
- No stack traces exposed to users
- Error rates monitored and alerted
Recovery
- System restarts cleanly after failures
- No data loss on crashes
- Failed jobs can be retried
2. Performance
Latency
- p50 latency meets user expectations
- p95 latency acceptable (users will wait)
- p99 latency bounded (no infinite hangs)
- Timeout thresholds set appropriately
Throughput
- Load tested at 2x expected peak
- Rate limiting in place
- Backpressure handling defined
- Horizontal scaling path exists
Cost
- Cost per request calculated
- Cost monitoring in place
- Budget alerts configured
- Cost optimization opportunities identified
3. Security
Input Validation
- All user inputs sanitized
- Prompt injection mitigations in place
- Input length limits enforced
- File uploads validated (if applicable)
Output Safety
- No PII in logs
- No secrets in responses
- Output filtering for harmful content
- Compliance with content policies
Access Control
- Authentication required
- Authorization enforced
- API keys rotatable
- Audit logging enabled
4. Observability
Logging
- All requests logged
- All responses logged (with PII redaction)
- Error context captured
- Logs searchable and queryable
Metrics
- Latency histograms
- Error rates
- Token usage
- Cost per request
Alerting
- Error rate alerts
- Latency alerts
- Cost alerts
- On-call rotation defined
5. Quality
Evaluation
- Golden test set exists
- Automated evaluation pipeline
- Regression detection active
- Quality dashboard available
Monitoring
- Production quality sampling
- User feedback collection
- Drift detection
- Human review queue
6. Operations
Deployment
- Zero-downtime deployment
- Rollback procedure documented
- Feature flags for new functionality
- Blue-green or canary deployment
Configuration
- All config externalized (not hardcoded)
- Secrets management proper
- Environment-specific configs
- Config changes audited
Documentation
- Architecture documented
- Runbook for common issues
- On-call procedures
- Escalation paths
7. Compliance
Data
- Data retention policy implemented
- PII handling documented
- Data deletion capability
- Data export capability (if required)
Legal
- Terms of service updated
- Privacy policy covers AI usage
- Content policy defined
- User consent mechanisms (if required)
Minimum Viable Production
Not everything is equally important. Here’s the absolute minimum before any launch:
Must Have
- Timeout on all LLM calls
- Basic error handling
- Request/response logging
- Cost tracking
- One form of quality monitoring
Should Have
- Automated evaluation
- Alerting on errors
- Rate limiting
- Fallback behavior
Nice to Have
- Advanced observability
- Sophisticated quality analysis
- Automated rollback
Red Flags
If you see any of these, you’re not ready:
- “We’ll add monitoring later” — You won’t, and you’ll regret it
- “The error handling is TODO” — Production will find every edge case
- “We don’t know how much it costs” — You will when the bill arrives
- “We tested it manually” — Manual testing doesn’t scale
- “It works in staging” — Staging lies
Pre-Launch Day Checklist
The day before launch:
- All checklist items verified
- Load test completed
- Rollback tested
- On-call scheduled
- Stakeholders notified
- Support prepared
Launch day:
- Monitoring dashboards open
- Team available for issues
- Canary percentage set
- Rollback ready
Post-launch:
- Monitor for 24-48 hours
- Review error logs
- Check cost actuals vs. estimates
- Gather initial user feedback
Further Reading
- AI Architecture Patterns — Patterns that make production easier
- Evaluation Framework — Deep dive on quality monitoring