Operational Excellence Beginner

Production Readiness Checklist

What 'production-ready' actually means for AI systems. The checklist we use before any launch.

Author Synapti Collective
Published January 23, 2026
Read time 8 min
productionchecklistdeploymentoperations

What Production-Ready Actually Means

“It works on my machine” is not production-ready. Neither is “it demos well.” Production-ready means the system can handle real users, real scale, and real failures, without you babysitting it.

This checklist covers what we verify before any AI system goes live.

The Checklist

1. Reliability

Failure Handling

  • All LLM calls have timeouts configured
  • Retry logic with exponential backoff
  • Fallback behavior defined for all failure modes
  • Circuit breakers prevent cascade failures
  • Graceful degradation vs. hard failures documented

Error Handling

  • All errors logged with context
  • User-facing errors are helpful, not technical
  • No stack traces exposed to users
  • Error rates monitored and alerted

Recovery

  • System restarts cleanly after failures
  • No data loss on crashes
  • Failed jobs can be retried

2. Performance

Latency

  • p50 latency meets user expectations
  • p95 latency acceptable (users will wait)
  • p99 latency bounded (no infinite hangs)
  • Timeout thresholds set appropriately

Throughput

  • Load tested at 2x expected peak
  • Rate limiting in place
  • Backpressure handling defined
  • Horizontal scaling path exists

Cost

  • Cost per request calculated
  • Cost monitoring in place
  • Budget alerts configured
  • Cost optimization opportunities identified

3. Security

Input Validation

  • All user inputs sanitized
  • Prompt injection mitigations in place
  • Input length limits enforced
  • File uploads validated (if applicable)

Output Safety

  • No PII in logs
  • No secrets in responses
  • Output filtering for harmful content
  • Compliance with content policies

Access Control

  • Authentication required
  • Authorization enforced
  • API keys rotatable
  • Audit logging enabled

4. Observability

Logging

  • All requests logged
  • All responses logged (with PII redaction)
  • Error context captured
  • Logs searchable and queryable

Metrics

  • Latency histograms
  • Error rates
  • Token usage
  • Cost per request

Alerting

  • Error rate alerts
  • Latency alerts
  • Cost alerts
  • On-call rotation defined

5. Quality

Evaluation

  • Golden test set exists
  • Automated evaluation pipeline
  • Regression detection active
  • Quality dashboard available

Monitoring

  • Production quality sampling
  • User feedback collection
  • Drift detection
  • Human review queue

6. Operations

Deployment

  • Zero-downtime deployment
  • Rollback procedure documented
  • Feature flags for new functionality
  • Blue-green or canary deployment

Configuration

  • All config externalized (not hardcoded)
  • Secrets management proper
  • Environment-specific configs
  • Config changes audited

Documentation

  • Architecture documented
  • Runbook for common issues
  • On-call procedures
  • Escalation paths

7. Compliance

Data

  • Data retention policy implemented
  • PII handling documented
  • Data deletion capability
  • Data export capability (if required)

Legal

  • Terms of service updated
  • Privacy policy covers AI usage
  • Content policy defined
  • User consent mechanisms (if required)

Minimum Viable Production

Not everything is equally important. Here’s the absolute minimum before any launch:

Must Have

  • Timeout on all LLM calls
  • Basic error handling
  • Request/response logging
  • Cost tracking
  • One form of quality monitoring

Should Have

  • Automated evaluation
  • Alerting on errors
  • Rate limiting
  • Fallback behavior

Nice to Have

  • Advanced observability
  • Sophisticated quality analysis
  • Automated rollback

Red Flags

If you see any of these, you’re not ready:

  • “We will add monitoring later.” You will not, and you will regret it.
  • “The error handling is TODO.” Production will find every edge case.
  • “We don’t know how much it costs.” You will when the bill arrives.
  • “We tested it manually.” Manual testing does not scale.
  • “It works in staging.” Staging lies.

Pre-Launch Day Checklist

The day before launch:

  • All checklist items verified
  • Load test completed
  • Rollback tested
  • On-call scheduled
  • Stakeholders notified
  • Support prepared

Launch day:

  • Monitoring dashboards open
  • Team available for issues
  • Canary percentage set
  • Rollback ready

Post-launch:

  • Monitor for 24-48 hours
  • Review error logs
  • Check cost actuals vs. estimates
  • Gather initial user feedback

Further Reading

License: This playbook is licensed under CC BY-SA 4.0.

You're free to share and adapt this content for any purpose, including commercial use. Attribution required. Derivatives must use the same license.