Operations Beginner

Production Readiness Checklist

What 'production-ready' actually means for AI systems. The checklist we use before any launch.

Author Synapti Collective
Published January 23, 2026
Read time 8 min
productionchecklistdeploymentoperations

What Production-Ready Actually Means

“It works on my machine” is not production-ready. Neither is “it demos well.” Production-ready means the system can handle real users, real scale, and real failures — without you babysitting it.

This checklist covers what we verify before any AI system goes live.

The Checklist

1. Reliability

Failure Handling

  • All LLM calls have timeouts configured
  • Retry logic with exponential backoff
  • Fallback behavior defined for all failure modes
  • Circuit breakers prevent cascade failures
  • Graceful degradation vs. hard failures documented

Error Handling

  • All errors logged with context
  • User-facing errors are helpful, not technical
  • No stack traces exposed to users
  • Error rates monitored and alerted

Recovery

  • System restarts cleanly after failures
  • No data loss on crashes
  • Failed jobs can be retried

2. Performance

Latency

  • p50 latency meets user expectations
  • p95 latency acceptable (users will wait)
  • p99 latency bounded (no infinite hangs)
  • Timeout thresholds set appropriately

Throughput

  • Load tested at 2x expected peak
  • Rate limiting in place
  • Backpressure handling defined
  • Horizontal scaling path exists

Cost

  • Cost per request calculated
  • Cost monitoring in place
  • Budget alerts configured
  • Cost optimization opportunities identified

3. Security

Input Validation

  • All user inputs sanitized
  • Prompt injection mitigations in place
  • Input length limits enforced
  • File uploads validated (if applicable)

Output Safety

  • No PII in logs
  • No secrets in responses
  • Output filtering for harmful content
  • Compliance with content policies

Access Control

  • Authentication required
  • Authorization enforced
  • API keys rotatable
  • Audit logging enabled

4. Observability

Logging

  • All requests logged
  • All responses logged (with PII redaction)
  • Error context captured
  • Logs searchable and queryable

Metrics

  • Latency histograms
  • Error rates
  • Token usage
  • Cost per request

Alerting

  • Error rate alerts
  • Latency alerts
  • Cost alerts
  • On-call rotation defined

5. Quality

Evaluation

  • Golden test set exists
  • Automated evaluation pipeline
  • Regression detection active
  • Quality dashboard available

Monitoring

  • Production quality sampling
  • User feedback collection
  • Drift detection
  • Human review queue

6. Operations

Deployment

  • Zero-downtime deployment
  • Rollback procedure documented
  • Feature flags for new functionality
  • Blue-green or canary deployment

Configuration

  • All config externalized (not hardcoded)
  • Secrets management proper
  • Environment-specific configs
  • Config changes audited

Documentation

  • Architecture documented
  • Runbook for common issues
  • On-call procedures
  • Escalation paths

7. Compliance

Data

  • Data retention policy implemented
  • PII handling documented
  • Data deletion capability
  • Data export capability (if required)

Legal

  • Terms of service updated
  • Privacy policy covers AI usage
  • Content policy defined
  • User consent mechanisms (if required)

Minimum Viable Production

Not everything is equally important. Here’s the absolute minimum before any launch:

Must Have

  • Timeout on all LLM calls
  • Basic error handling
  • Request/response logging
  • Cost tracking
  • One form of quality monitoring

Should Have

  • Automated evaluation
  • Alerting on errors
  • Rate limiting
  • Fallback behavior

Nice to Have

  • Advanced observability
  • Sophisticated quality analysis
  • Automated rollback

Red Flags

If you see any of these, you’re not ready:

  • “We’ll add monitoring later” — You won’t, and you’ll regret it
  • “The error handling is TODO” — Production will find every edge case
  • “We don’t know how much it costs” — You will when the bill arrives
  • “We tested it manually” — Manual testing doesn’t scale
  • “It works in staging” — Staging lies

Pre-Launch Day Checklist

The day before launch:

  • All checklist items verified
  • Load test completed
  • Rollback tested
  • On-call scheduled
  • Stakeholders notified
  • Support prepared

Launch day:

  • Monitoring dashboards open
  • Team available for issues
  • Canary percentage set
  • Rollback ready

Post-launch:

  • Monitor for 24-48 hours
  • Review error logs
  • Check cost actuals vs. estimates
  • Gather initial user feedback

Further Reading

License: This playbook is licensed under CC BY-SA 4.0.

You're free to share and adapt this content for any purpose, including commercial use. Attribution required. Derivatives must use the same license.