SLOs that change how you ship

Service Level Objectives are one of those topics where the difference between textbook knowledge and operational knowledge is enormous. The textbook says “set a target, measure against it.” The reality is: SLOs are the single tool we’ve found that lets engineering and product teams have a non-emotional conversation about “ship faster” vs “stabilize.”

Done right, an SLO is a number you can defend with data, and an error budget you can spend. Done wrong, it’s a vanity dashboard that nobody acts on.

The three concepts (in plain English)

SLI — what you measure

A Service Level Indicator is the raw signal. “p99 latency on the checkout endpoint.” “Percentage of API requests that return 5xx.” “Time from page request to interactive.” Concrete, observable, comparable over time.

SLO — the target for that signal

The objective is the line you draw on the SLI graph. “P99 latency < 300ms for 99.5% of the time.” The percentage is a service-quality commitment you’re making to your users.

Error budget — the inverse

If your SLO is 99.9% success, your error budget is 0.1%. In a 30-day window, that’s 43 minutes of downtime (or equivalent failed requests). The budget is what you’re ALLOWED to spend — and the entire point is that you spend it deliberately, not accidentally.

The conversation that SLOs unlock

Without an SLO, the “ship faster” vs “stabilize” conversation is a values argument. Product wants velocity; SRE wants reliability; nobody has a way to be objectively right.

With an SLO, the conversation is mechanical. Budget healthy? Ship faster — take more risks, roll out experiments aggressively. Budget depleted? Freeze risky work, pay down reliability debt. The data drives the decision; the team has cover for either path.

Setting your first SLO

Pick the user journey, not the system.“Can the user complete checkout?” is a journey. “Is the database up?” is a system. SLOs are about user-facing outcomes.
Pick TWO indicators for that journey. Usually a latency SLI and a success-rate SLI. One alone is gameable; two together describe quality properly.
Set the SLO at the current performance level, minus a buffer. If your current p99 latency is 280ms, set 300ms. Set targets you’re currently meeting; tighten over time as you improve.
Review monthly. Burn rate, missed minutes, what consumed the budget. Adjust if reality has changed.

Mistakes that turn SLOs into vanity metrics

Setting SLOs you can’t enforce.“99.99% uptime” looks great on a vendor pitch but means nothing if no one stops deploys when the budget is gone.
Measuring infrastructure, not user experience. Database CPU is not an SLI; checkout success is.
Setting one SLO for the whole product. Different surfaces have different reliability needs. Marketing pages can tolerate worse latency than the checkout endpoint.
Not reviewing budget burn weekly.The budget is most useful as a leading indicator. If you’re only looking at it once a quarter, you’re looking too late.

How we approach this

For products under our Ongoing Maintenance engagement, we publish SLO dashboards as part of the operating cadence. The monthly review is grounded in budget burn, not gut feel — which is what makes the product/engineering conversation tractable.

Takeaways

SLI is what you measure; SLO is your target; error budget is what you spend.
Pick user-journey SLIs, not infrastructure metrics.
Start at current performance + a buffer. Tighten over time.
Budget healthy = ship faster. Budget depleted = stabilize. The data picks.

SLOs that change how you ship

The three concepts (in plain English)

SLI — what you measure

SLO — the target for that signal

Error budget — the inverse

The conversation that SLOs unlock

Setting your first SLO

Mistakes that turn SLOs into vanity metrics

How we approach this

Takeaways

More from the engine room

AI in QA: where it helps, where it doesn’t

Controlling LLM costs in production

RAG vs fine-tuning: which do you actually need?

Agentic features in SaaS: the maturity ladder

Offline-first mobile: the app that works on the subway

Lift-and-shift vs refactor: how to actually decide

Monolith migration: the strangler-fig playbook

SOC 2 readiness in plain English

Let’s Build the Future Together!