Production-ready in 25 items

“Production-ready” is one of the most overloaded phrases in software. Engineering teams mean “it works on staging.” Product means “we can demo it.” Customers mean “it works at 2am on a Sunday when we need it.” Only one of those definitions matters.

Here’s the 25-item checklist we run through before any launch we put our name on — five categories of five items each. If you can’t tick every box, you’re not production-ready. You’re demo-ready.

Reliability

SLOs published.A specific number, not “high uptime.” 99.9% availability is a 43-minute monthly budget. Pick yours, write it down.
Auto-scaling configured.Both horizontal and vertical. Tested by actually generating load — not just assumed because the platform says it can.
Health checks at every layer.Load balancer → service → dependencies. Health checks return real signals, not just “process is alive.”
Idempotent retries. Every external call (especially webhooks + payment) safe to fire twice. Tested on staging by deliberately retrying.
Backups + restore drill. The first time you try to restore a backup must not be during an incident. Drill it quarterly.

Observability

Structured logs.JSON, tagged with request ID, user ID, tenant ID. Searchable in one tool (Datadog, Loki, CloudWatch — pick one).
Metrics with real signal.Request rate, error rate, latency (p50/p95/p99). Not just CPU/RAM — those tell you nothing about user experience.
Distributed traces.Cross-service traces with OpenTelemetry. When something is slow, you can answer “which call?” in seconds.
Alerting on burn rate, not point-in-time spikes. Burn-rate alerts fire when your error budget is being consumed faster than expected, not on every two-minute spike.
On-call rota. A real human gets paged. The rota is documented, rotated weekly, and includes a backup.

Security

Authentication + RBAC. Real identity (not shared accounts), role-based authorization, audit-logged.
Secrets management. Secrets in a real vault (AWS Secrets Manager, 1Password, Vault). Never in env files committed to git.
Encryption everywhere. TLS for in-transit, AES-256 for at-rest. Both turned on, both tested.
Penetration tested. By a third party, with a written report. Findings remediated or formally accepted with a date.
Audit logs.Who-did-what, immutable, retained per your industry’s requirements (90 days to 7 years).

Performance

p95 latency budget.< 200ms for the API, < 2s for first paint on the web. Measured continuously.
CDN in front of static assets.CloudFront, Cloudflare, Fastly — anything. Images, JS, CSS not coming off your origin.
Caching strategy documented. Application cache, CDN cache, DB cache. Each layer with explicit invalidation rules.
Database indexes for hot paths. Slow query log running; the top 5 queries indexed.
Bundle budget.The web app’s first JS payload < 200KB. CI fails if a PR pushes it over.

Operations

CI/CD with automated tests. Every merge runs tests + deploys to staging. Production deploy is one button or one approval, never a manual sequence.
Infrastructure as code. Terraform, Pulumi or CloudFormation. The whole stack can be recreated from a Git commit.
Feature flags. Ship dark, roll out gradually, kill fast. Day-one investment, lifelong payback.
Runbooks for the top 10 incidents. Database down, third-party API down, payment processor down. Five-minute response, not a thirty-minute scramble.
Public status page.Customers find out about incidents from you, not from each other. Statuspage / Atlassian / Better Uptime — cheap, fast, professional.

How to use the list

Print the checklist. Walk the team through it 4 weeks before launch. Anything red on the list either ships fixed, ships with a documented mitigation, or doesn’t ship. No “we’ll fix it in the first sprint after launch” — those items are still red on the list a quarter later.

We bake this into our Ongoing Maintenance retainer so the checklist stays green months after launch — not just on day one.

Takeaways

“Production-ready” needs a written definition or it’s meaningless.
Five categories: reliability, observability, security, performance, operations.
Backup + restore drill matters more than backup.
Health checks at every layer beat one big “is the app alive” check.
If you can’t tick every item, write down the mitigation. Don’t pretend.

Production-ready in 25 items

Reliability

Observability

Security

Performance

Operations

How to use the list

Takeaways

More from the engine room

AI in QA: where it helps, where it doesn’t

Controlling LLM costs in production

RAG vs fine-tuning: which do you actually need?

Agentic features in SaaS: the maturity ladder

Offline-first mobile: the app that works on the subway

Lift-and-shift vs refactor: how to actually decide

Monolith migration: the strangler-fig playbook

SOC 2 readiness in plain English

Let’s Build the Future Together!