“Production-ready” is one of the most overloaded phrases in software. Engineering teams mean “it works on staging.” Product means “we can demo it.” Customers mean “it works at 2am on a Sunday when we need it.” Only one of those definitions matters.
Here’s the 25-item checklist we run through before any launch we put our name on — five categories of five items each. If you can’t tick every box, you’re not production-ready. You’re demo-ready.

Reliability
- SLOs published.A specific number, not “high uptime.” 99.9% availability is a 43-minute monthly budget. Pick yours, write it down.
- Auto-scaling configured.Both horizontal and vertical. Tested by actually generating load — not just assumed because the platform says it can.
- Health checks at every layer.Load balancer → service → dependencies. Health checks return real signals, not just “process is alive.”
- Idempotent retries. Every external call (especially webhooks + payment) safe to fire twice. Tested on staging by deliberately retrying.
- Backups + restore drill. The first time you try to restore a backup must not be during an incident. Drill it quarterly.
Observability
- Structured logs.JSON, tagged with request ID, user ID, tenant ID. Searchable in one tool (Datadog, Loki, CloudWatch — pick one).
- Metrics with real signal.Request rate, error rate, latency (p50/p95/p99). Not just CPU/RAM — those tell you nothing about user experience.
- Distributed traces.Cross-service traces with OpenTelemetry. When something is slow, you can answer “which call?” in seconds.
- Alerting on burn rate, not point-in-time spikes. Burn-rate alerts fire when your error budget is being consumed faster than expected, not on every two-minute spike.
- On-call rota. A real human gets paged. The rota is documented, rotated weekly, and includes a backup.
Security
- Authentication + RBAC. Real identity (not shared accounts), role-based authorization, audit-logged.
- Secrets management. Secrets in a real vault (AWS Secrets Manager, 1Password, Vault). Never in env files committed to git.
- Encryption everywhere. TLS for in-transit, AES-256 for at-rest. Both turned on, both tested.
- Penetration tested. By a third party, with a written report. Findings remediated or formally accepted with a date.
- Audit logs.Who-did-what, immutable, retained per your industry’s requirements (90 days to 7 years).
Performance
- p95 latency budget.< 200ms for the API, < 2s for first paint on the web. Measured continuously.
- CDN in front of static assets.CloudFront, Cloudflare, Fastly — anything. Images, JS, CSS not coming off your origin.
- Caching strategy documented. Application cache, CDN cache, DB cache. Each layer with explicit invalidation rules.
- Database indexes for hot paths. Slow query log running; the top 5 queries indexed.
- Bundle budget.The web app’s first JS payload < 200KB. CI fails if a PR pushes it over.
Operations
- CI/CD with automated tests. Every merge runs tests + deploys to staging. Production deploy is one button or one approval, never a manual sequence.
- Infrastructure as code. Terraform, Pulumi or CloudFormation. The whole stack can be recreated from a Git commit.
- Feature flags. Ship dark, roll out gradually, kill fast. Day-one investment, lifelong payback.
- Runbooks for the top 10 incidents. Database down, third-party API down, payment processor down. Five-minute response, not a thirty-minute scramble.
- Public status page.Customers find out about incidents from you, not from each other. Statuspage / Atlassian / Better Uptime — cheap, fast, professional.
How to use the list
Print the checklist. Walk the team through it 4 weeks before launch. Anything red on the list either ships fixed, ships with a documented mitigation, or doesn’t ship. No “we’ll fix it in the first sprint after launch” — those items are still red on the list a quarter later.
We bake this into our Ongoing Maintenance retainer so the checklist stays green months after launch — not just on day one.
Takeaways
- “Production-ready” needs a written definition or it’s meaningless.
- Five categories: reliability, observability, security, performance, operations.
- Backup + restore drill matters more than backup.
- Health checks at every layer beat one big “is the app alive” check.
- If you can’t tick every item, write down the mitigation. Don’t pretend.







