Latency Budgets in Production: How to Set Them, Defend Them, and Update Them

Most APIs do not have a latency budget. They have a vague aspiration that the API should be "fast" and a vaguer aspiration that this should remain true as the system grows. The result is predictable: latency drifts upward, no single change is identifiable as the cause, and by the time someone notices, the regression has been compounding for months. A latency budget is the discipline that prevents this.

A latency budget is a contract: at a specified percentile, for a specified endpoint or class of endpoints, response time will not exceed a specified value. Concretely: "the p99 of POST /api/v1/checkout will not exceed 400 ms over a rolling 24-hour window." Once stated, every architectural decision can be checked against it. New downstream service? Account for its latency in the budget. New synchronous database call? Same. Without the budget, the conversation is "is 50 ms too much?" With the budget, the conversation is "we have 80 ms remaining in our 400 ms budget; this 50 ms call leaves 30 ms for everything else, which is too tight."

The first decision: which percentile

The percentile choice matters more than the number. Median (p50) latency budgets are nearly useless because they hide the long tail where users actually live. The p50 of a service that returns in 100 ms half the time and times out at 30 seconds the other half is 100 ms — and the budget would say everything is fine. The right percentiles for almost all user-facing systems are p95, p99, and p99.9.

p99 is the right default for most product surfaces. It catches the cases that real users see often enough to complain about while still being statistically stable enough to monitor reliably. p95 is appropriate for interactive surfaces where the experience needs to feel uniformly fast, like search-as-you-type. p99.9 is appropriate for high-stakes operations like checkout, where even 1 in 1000 requests timing out is a real revenue loss.

The mistake is to set a single percentile and ignore the rest. Set the budget at p99, but track p50, p95, p99, p99.9, and the absolute maximum, because each of them tells you something different. A drift in p50 with stable p99 means the average user is seeing it. A drift in p99 with stable p50 means the tail is getting worse. Both are real, and they have different causes.

The second decision: per-endpoint or per-class

You cannot set a single budget for the whole API. A simple GET that returns a single row should not have the same budget as a multi-step checkout that hits Stripe. The temptation to set a global budget is the temptation to manage by averages, and it produces the same problems averages always do.

The right granularity is usually per-endpoint for the dozen most important endpoints and per-class for the rest. A class might be "all read endpoints," "all write endpoints," "all webhook delivery endpoints." Each class has a budget that the endpoints in it must respect. The most-trafficked or most-revenue-critical endpoints get individual budgets. Everything else inherits from the class.

Document the budgets where engineers will see them — in the service README, in the runbook, near the route definition itself. A budget that lives in a wiki nobody reads is not a budget, it is folklore.

Sub-budgets: where the time goes

Once you have a top-level budget, decompose it. A 400 ms p99 budget for a checkout endpoint might break down as: 50 ms for request parsing and authentication, 100 ms for two database round trips, 200 ms for the synchronous Stripe call, 30 ms for response serialization and network egress, 20 ms of slack. Now any change can be evaluated against the relevant sub-budget. The Stripe call growing from 200 to 250 ms is not a 12.5% regression in the dependency — it is a 50 ms hit against 20 ms of slack, which is not affordable.

Sub-budgets are how budgets stop being theater and start being decision tools. Without them, the budget is set at the top level and ignored at the level where decisions actually happen. With them, an engineer evaluating a new approach can see immediately whether it fits.

The retry trap

Retries are the most common reason latency budgets are silently violated in production. A downstream call that has a 200 ms timeout and three retries with exponential backoff has a worst-case latency of 200 + 400 + 800 + final response time, which can easily exceed two seconds even if every retry is fast. Most teams set the retry policy and the budget independently, never check the math, and then are surprised when p99 spikes during partial outages.

The fix is to budget for the worst case, not the median case. If your retry policy can produce a 2-second worst case, then your latency budget for that endpoint must accommodate it, or your retry policy must be reduced. The third option — and usually the right one — is to make the worst case visible: if you have already used 350 ms of a 400 ms budget on the first attempt, do not spend 800 ms on a retry. Skip it, return the failure, and let the upstream caller decide what to do. This is the deadline-propagation pattern, and it is the only way to make retry policies and latency budgets coexist honestly.

Defending the budget over time

The budget at launch is the easy part. Defending it as the system grows is where most teams fail. Two patterns help:

Continuous monitoring with alerts at the budget, not at some round number. If your budget is p99 = 400 ms, alert at p99 = 380 ms with a warning and at 400 ms with a page. Alerting at "p99 > 1 second" is alerting at the disaster, not at the regression that caused it. By the time you see the disaster, it has been compounding for weeks.

Latency-aware code review. Any change that touches the request path on a budgeted endpoint should be evaluated against the budget. Most teams do this by feel. The teams that hit their budgets reliably do it by checklist. Adding a database call? Note the expected latency. Adding a downstream service call? Note its budget. Removing a cache? Note the cache hit rate and what the miss path costs. The discipline is small at each individual change and load-bearing in aggregate.

When to update the budget

The wrong reason to update the budget is "we missed it for a quarter and want to stop being yelled at." The right reasons are: the workload has fundamentally changed (much larger payloads, much higher write rate, fundamentally different access patterns), or the user-experience research shows the old budget was set incorrectly (users were tolerating something we thought they would not, or vice versa).

Updates should be deliberate and documented. Write the new budget down. Note the date. Note the reason. The history of budget changes is itself a useful artifact when someone six months later asks why the system is slower than it used to be.

The deeper point is that latency budgets are not really about latency. They are about making distributed systems decisions visible and tradeable. Without a budget, every change looks small. With one, the cumulative cost of all the small changes is visible immediately, and the conversations that need to happen actually happen. That is what the budget buys. The number on it matters less than the discipline of having it.

If you operate a small SaaS, the four products in our studio — DocuMint, CronPing, FlagBit, and WebhookVault — each ship with documented latency budgets and the monitoring scaffolding to defend them. The budget is not a magical artifact. It is a habit, and the habit is what scales.