Designing API SLAs: What to Promise, What to Measure, and the Patterns That Survive Real Outages

An SLA is a promise to customers that becomes a constraint on engineering. Getting the promise wrong is expensive in either direction: too generous and you cannot deliver, too conservative and customers do not trust the number.

An API SLA is a public number that customers use to decide whether to build on you. It is also an operational constraint that determines how aggressively your team has to invest in reliability. The number sets the budget for failure and the threshold for compensation. Getting it wrong is expensive: a promise you cannot keep produces credits, churn, and reputation damage; a promise that is too cautious leaves customers preferring competitors who promised more.

Most SLA discussions focus on the headline percentage: 99.9, 99.95, 99.99. The headline number is less important than the details underneath it: what counts as downtime, what time window the measurement covers, what excluded events exist, and what the remedy is when the SLA is missed. A 99.9% SLA with bad definitions is worse than a 99.5% SLA with rigorous definitions, because the customer cannot trust what the headline number represents.

What to measure

The SLA needs to be defined in terms of measurable behaviors. The two common bases are availability (was the API reachable?) and error rate (did requests succeed?). Availability alone is insufficient because an API that returns 500 errors for a percentage of requests is degraded from the customer's perspective even though it is reachable.

The right definition is usually success rate over a measurement window: number of non-5xx, non-timeout requests divided by total requests in the window. The granularity question is how the window is computed. A monthly window of total requests divided by total successful requests is a simple metric but smooths over short outages. A minute-granular window where each minute is counted as down if more than X% of requests failed gives a different number that better matches customer experience but is more demanding to deliver.

The error categorization matters. 4xx responses are customer errors and should not count against the SLA. 5xx and timeouts are server errors and should count. 429 rate limit responses are a gray area: they indicate the server is intentionally rejecting the request, which is not a failure of the server, but the customer experience is failure. The common resolution is to exclude 429 from SLA calculations while monitoring them separately as a customer-impact metric.

The exclusion list needs to be precise. Maintenance windows, customer-caused incidents (DDoS from a specific customer's misconfiguration), upstream provider outages (DNS, CDN, payment processor) are typical exclusions. The trap is excluding too much: a customer who experiences an outage caused by your upstream provider does not care that the cause was technically not yours. Generous exclusions produce SLA numbers that look great and customer experiences that do not.

What to promise

The number should be defensible against your actual operational history. The discipline is to compute the SLA you would have hit retroactively over the past 12-24 months with the proposed definition, then add a small safety margin. Promising 99.95 when your retroactive number is 99.92 is a guaranteed-to-fail SLA. Promising 99.5 when your retroactive number is 99.99 is leaving money on the table.

The tiering question depends on whether you have a freemium model. The common pattern is to publish the same SLA for all paid tiers and offer enhanced credits for higher tiers. This is simpler than tier-specific SLAs and avoids the complexity of running a separate SLA accounting per customer. The customer-facing message is paid customers get reliability, free tier customers get best-effort.

The remedy structure should be proportional and capped. Service credits as a percentage of monthly fee for each SLA breach is the standard pattern: 10% credit for missing 99.95%, 25% for missing 99.5%, 50% for missing 99%. The credits stack but are capped at the monthly fee. The credit must be claimable by the customer (an explicit request) not automatic, which prevents adversarial-claim behaviors and keeps the accounting simple.

The SLA language must be a contractual document, not a marketing claim. The legal review pays off when a customer disputes whether an SLA breach occurred. The document should specify the measurement method, the exclusion list, the remedy structure, and the dispute resolution process. The shorter the document, the better; one page of clear language beats five pages of legalese.

What to measure internally

The internal measurement is what the SLA accounting actually depends on. The two failure modes are unmeasured downtime (the SLA is missed but the team does not see it in dashboards) and measurement that does not match customer experience (the team's dashboard shows green while customers are reporting errors).

The internal measurement should include synthetic monitoring from multiple geographic regions, real user monitoring (RUM) when applicable, and aggregated server-side metrics. The three sources rarely agree perfectly, and the disagreement is informative: server-side metrics underreport network-level issues, synthetic monitoring underreports rare-condition errors, RUM is delayed and noisy.

The dashboard architecture matters. One synthetic check from one region with a one-minute interval is enough to miss most short outages and many longer ones. The right level is three to five regions with 30-second intervals, plus aggregated server-side counts of 5xx responses and timeouts. The dashboards should make the disagreement between sources visible, not hide it behind aggregated green dots.

The alerting threshold should fire well before the SLA is at risk of being missed. A monthly SLA of 99.9 allows 43 minutes of downtime per month. An alert that fires after 30 minutes of degradation gives the team almost no time to respond. The right thresholds are short-window: 5 minutes of elevated error rate triggers paging, 1 hour of elevated error rate triggers incident response. The monthly accounting is a backstop, not the primary detection mechanism.

The error budget framing

The Google SRE book introduced the error budget framing: the SLA defines an allowed amount of failure, and the team treats remaining budget as a resource that can be spent on deploys, experiments, and infrastructure changes. When budget is running low, the team prioritizes reliability work. When budget is healthy, the team can take more risks.

This framing is more useful than pure SLA accounting for engineering teams because it converts a binary did-we-miss-the-SLA question into a continuous how-much-budget-remains signal. The integration with deploys is the most concrete: when budget is low, risky deploys are deferred; when budget is healthy, deploys proceed normally. The discipline must be enforced, which is the hard part. Teams that talk about error budgets without actually changing behavior when budget runs low are doing SLA theater.

The budget arithmetic is simple. Monthly SLA of 99.9 allows 43 minutes of downtime. If 20 minutes have been consumed in the first half of the month, the budget remaining is 23 minutes against the remaining 15 days. The burn rate is computed continuously and alerts fire when burn rate exceeds the threshold that would consume the monthly budget before month-end.

The communication during breaches

When the SLA is missed, the customer-facing communication determines whether the breach becomes a churn event. The required elements are acknowledgment of the breach (do not pretend it did not happen), root cause analysis (what went wrong), remediation (what changes prevent it from recurring), and credit application (the contractual remedy). The bundle should arrive within a defined window after the breach (one week is typical) regardless of how complete the root cause analysis is.

The postmortem document should be public for major breaches. The pattern matches well-known incidents from Stripe, GitHub, AWS: a few paragraphs of plain English describing what happened, what we learned, and what we are doing about it. Customers value transparency over polished prose. The internal postmortem can be longer; the public version should be readable in 5 minutes and end with concrete commitments.

What an SLA does not buy

The SLA is a contractual promise about availability and error rates. It does not address latency (require a separate SLO), correctness (require separate guarantees), security (require separate guarantees), or feature behavior (require separate documentation). Customers who are concerned about any of these need to ask separately, and the answers are usually weaker than the availability SLA because they are harder to measure and harder to deliver.

The SLA also does not buy trust by itself. A customer who has experienced multiple breaches in the past quarter will not be reassured by the contractual remedy structure. The SLA is necessary but not sufficient. The team must actually deliver the reliability the SLA promises, and the historical track record over months and years is what customers actually use to evaluate trustworthiness.

Across our four products

We currently do not publish formal SLAs for DocuMint, CronPing, FlagBit, or WebhookVault. The early-stage decision was to focus on actual reliability before publishing numbers we could not yet defend operationally. The retroactive availability numbers across the past six months would support a 99.9% claim for all four products, but the variance from incident response capacity at a small team scale is high enough that we prefer to wait until the historical baseline supports a defensible 99.95% before formalizing the SLA.

The deeper observation is that the SLA is one of the topics where the customer-facing surface and the internal engineering reality are tightly coupled. The number on the marketing page sets the expectation; the architecture and operational practices set the deliverable. Getting them aligned is the work; the alignment is the asset.

Read more