The Discipline of Error Budgets: Turning SLOs from Theater into Decisions

Most teams that adopt Service Level Objectives stop one step short of the thing that makes them useful. They define an SLO — say, 99.9% availability for the API over a 30-day window — they put it on a dashboard, they monitor it, and they congratulate themselves for being engineering-mature. Then nothing happens. Deploys go out the same way. Features ship at the same pace. The SLO becomes a number on a dashboard that nobody acts on. The missing piece is the error budget, and the discipline of actually treating it as a budget.

An error budget is the inverse of the SLO. If your SLO is 99.9% availability over 30 days, your error budget is 0.1% — which is roughly 43 minutes of downtime over the month, or about 1.4 minutes per day if you spread it evenly. That budget is real money. You can spend it on planned maintenance, on risky deploys, on chaos experiments, on schema migrations that briefly take a service down. What you do not get to do is overspend it without consequences.

The conversation the budget enables

The point of the error budget is the conversation it forces. Before the budget exists, the conversation between product (which wants to ship features) and operations (which wants the system to be reliable) is asymmetric — every new feature is a vague risk against an undefined reliability target. After the budget exists, the conversation has explicit numbers. "This feature requires a deploy that we estimate has a 5% chance of triggering a 10-minute incident. That is 0.5 minutes of expected error budget consumption. We have 25 minutes of budget left for the month. Yes, ship it." Or: "We have already burned 40 of 43 minutes this month. The next risky deploy has to wait."

This is the move from "we should be careful" to "here are the numbers, and here is what they say." It is the same move that latency budgets make for performance, and it is what turns reliability from a vague aspiration into an engineering primitive that fits into the same backlog the features fit into.

Choosing the SLO

The most common mistake in SLO-setting is starting with a percentage that sounds good rather than one that reflects what users actually experience. 99.99% sounds impressive but corresponds to about 4 minutes of monthly downtime, which is harder to engineer than 99.9% by an order of magnitude and is rarely justified by user research. 99% sounds embarrassingly low but is 7 hours of monthly downtime, which most internal tools tolerate easily.

The right SLO is the one that, if you missed it, would mean users were materially worse off. For a developer API, that might be 99.9% — every nine matters because a customer hitting a failure has to write retry logic. For an internal admin tool, it might be 99.5%, because the user can come back in 10 minutes. For a billing webhook delivery service, it might be 99.99%, because every dropped delivery is a real revenue event.

SLOs should also distinguish between availability (the service responds) and quality (the response is correct and fast enough). Many teams collapse these into one number and lose the ability to act on the difference. A separate latency SLO ("99% of requests complete within 500 ms") and availability SLO ("99.9% of requests get a successful response") give you a clearer signal about what is actually broken.

Burn rate alerting

The naive way to alert on an SLO is to alert when it has been violated. By that point, the violation has happened — the alert is a postmortem trigger, not a prevention tool. The better pattern is burn rate alerting: alert when the rate of error budget consumption is high enough that you will violate the SLO if it continues.

The standard pattern uses two windows. A short window (5 minutes) at a high burn rate (14.4x normal) catches acute incidents. A longer window (1 hour) at a moderate burn rate (6x normal) catches slow degradations that would otherwise hide in the noise. Both fire pages. Anything slower than that is a ticket, not a page.

Google's SRE Workbook has detailed worked examples of multi-window burn rate alerting that are worth reading once and referencing forever. The exact numbers depend on your SLO, but the pattern — burn rate, not violation — is the right one to internalize.

What you do when the budget is gone

The hardest part of error budgets is enforcing them. The whole point is that when the budget is exhausted, you stop the activities that consume it. In practice this means: no risky deploys. No new feature work that touches the foundation. The team's time goes to fixing the things that consumed the budget. This is unpopular precisely because it works.

Teams without explicit budget enforcement drift into a state where they routinely miss the SLO and there are no consequences. Engineers learn that the SLO is theater. Reliability degrades. The next big incident comes faster than expected. Teams with enforcement learn the opposite lesson: that reliability is a feature, that it is bought with the same time and attention as any other feature, and that running out of budget is a useful forcing function for the work that is otherwise easy to defer.

The budget is not a target

A subtle anti-pattern: treating the error budget as something to spend down to zero every month. The budget is a ceiling, not a quota. If you finish the month with 30 minutes of budget left over, that is good. It means you have headroom for the inevitable bad month. It does not mean you should have shipped more risky changes.

A related anti-pattern: padding the SLO so the budget is comfortable. If your real availability is 99.95% and you set your SLO at 99% so the budget is always full, you have abolished the discipline. The SLO has to be meaningful enough that hitting it is genuinely uncertain, and missing it is genuinely a problem.

Where this fits in our stack

Across our four products — DocuMint, CronPing, FlagBit, and WebhookVault — each one publishes its uptime SLO and tracks burn rate against it. CronPing's webhook delivery SLO is the strictest because it is the function customers most depend on; DocuMint's PDF generation SLO is more relaxed because customers are usually willing to retry. The budget is the mechanism that converts vague reliability ambitions into the specific operational decisions that determine whether the service actually stays up. Without the budget, an SLO is just a number. With it, the number is a contract.