The RED Method: Observability Metrics Every Service Should Emit

Most observability discussions become long shopping lists. Custom application metrics, business KPIs, distributed traces, structured logs, real user monitoring, synthetic monitoring, infrastructure metrics, vendor-specific dashboards. By the end of the meeting nobody knows what to actually instrument first, and the team ends up with either nothing or everything-half-finished. The dashboards proliferate; the alerts become noise; the on-call rotation grumbles.

The Tom Wilkie RED method is a deliberate antidote to this sprawl. RED stands for Rate, Errors, Duration: three numbers per service, emitted as Prometheus counters and histograms, dashboarded uniformly, alerted on consistently. It originated from Wilkie's experience at Weaveworks watching what actually catches production incidents across hundreds of services, and it has held up well enough that it became the default starting point for the SRE community.

The three numbers

Rate is the number of requests per second your service is handling. It is a counter that increments on every request, scraped at regular intervals to compute a rate. The rate is interesting because changes in it predict load-related incidents and reveal upstream behavior changes that nothing else surfaces. A 5x spike in rate that nobody warned you about means a downstream batch process started running, a customer just integrated, or a retry storm is in progress.

Errors is the number of requests per second your service is failing. It is also a counter, scraped to compute a rate, but split by error type or status code so that 5xx (server errors) and 4xx (client errors) can be tracked separately. The error rate is interesting because changes in it are usually the first signal that something is broken, and the error-rate-divided-by-request-rate gives you the error percentage that maps cleanly to SLO budgets.

Duration is the distribution of how long requests take. It is a histogram with buckets covering the latency range your service operates in, scraped to compute percentiles. The duration distribution is interesting because tail latency (p99, p99.9) is what users experience as "slow" or "broken," and changes in the distribution shape often precede outright failures.

Three numbers, instrumented consistently across every service. That is the entire method.

What RED catches

The reason RED has held up is that it catches a strikingly large fraction of production incidents. Walk through the major incident categories.

Code regressions: a deploy introduces a bug that returns wrong data or throws on edge cases. Errors counter spikes immediately. Catch rate is near 100%.

Capacity exhaustion: traffic exceeds what the service can handle. Duration p99 climbs first, then p95, then errors begin climbing as timeouts and rejections kick in. Catch rate is near 100%, with usually 5-15 minutes of warning before user impact becomes severe.

Dependency failures: a database, cache, or downstream service degrades. Duration climbs across all percentiles. Errors may or may not spike depending on whether the dependency returns errors or just gets slow. Catch rate is high, with the duration metric being the leading indicator.

Traffic surges: a customer integration goes live, a retry storm starts, a marketing campaign launches. Rate climbs sharply. Whether this becomes an incident depends on capacity headroom, but the metric tells you what is happening before downstream effects compound.

Bot attacks and abuse: rate climbs from a small set of IPs, errors may climb as rate limits engage. Both metrics surface the problem; the IP-level breakdown lives elsewhere but RED tells you to look.

The classes RED does not catch directly are: silent data corruption (numbers are wrong but the service returns 200), business-logic bugs that affect specific users (overall metrics look fine), and slow leaks that take days to manifest. These need different instrumentation. But the long tail of "service goes down" or "service gets slow" or "service starts erroring" is reliably caught by three counters and a histogram.

The instrumentation pattern

The RED instrumentation lives at the request boundary of your service, ideally in middleware so that every request flows through the same metric path. In Python with FastAPI, this is a single middleware function that increments a counter on entry, increments error counters on exception, and records duration on exit. In Go, it is similar with the standard net/http middleware pattern.

Label the metrics consistently: service, endpoint or route, method, status_code. Avoid high-cardinality labels (user IDs, request IDs, raw URLs with parameters) — those belong in traces and logs, not in metrics, where they cause cardinality explosion that breaks Prometheus storage.

Bucket the duration histogram thoughtfully. Default buckets like 0.005s, 0.01s, 0.025s, 0.05s, 0.1s, 0.25s, 0.5s, 1s, 2.5s, 5s, 10s cover most APIs, but if your service has a tight SLO at 100ms, you want more buckets in the 50-200ms range and fewer in the 5-10s range that you would never operate at. The buckets you choose determine the percentile resolution at the values that matter.

The dashboard pattern

The RED dashboard for a service has six panels:

(1) Rate over time, broken down by endpoint. (2) Error rate over time, broken down by error type. (3) Error percentage over time. (4) Duration p50/p95/p99 over time, on a single chart. (5) Duration heatmap (the full distribution over time, color-coded by frequency). (6) Top 10 endpoints by error rate, as a table.

Six panels per service, every service. That uniformity is what makes the dashboards usable across an organization. When an alert fires, you open the dashboard for the affected service and the layout is exactly what you expect, without having to learn each service's idiosyncratic dashboard. The cognitive load reduction during incidents is enormous and almost never priced in by teams that build artisanal dashboards per service.

The alert pattern

RED metrics support a small set of alerts that catch most incidents:

(1) Error rate exceeds X% for Y minutes, where X and Y are tuned per service. (2) Duration p99 exceeds Z for Y minutes, where Z is the tail SLO. (3) Rate drops to zero (or near zero) for Y minutes, which catches the "service is silently dead" failure that other alerts miss because no one is sending traffic to a deployment that lost its DNS routing.

Three alert types per service. Tuned per service for the specific SLOs and traffic patterns. No more, no less. The alert library is small enough that on-call engineers know what each alert means without consulting a wiki, and the false-positive rate is low enough that paging is meaningful rather than a daily noise floor.

Where RED is not enough

RED is the baseline, not the ceiling. There are classes of services that need more.

Async workers and queue processors do not have request boundaries in the HTTP sense; the appropriate adaptation is RED on job-processing operations rather than on requests, with rate as jobs-per-second, errors as failed-jobs-per-second, and duration as job-execution-time.

Database-heavy services often need additional dependency-level metrics: query rate, error rate, duration to the database tier specifically. These are RED applied at a lower layer.

Business-critical metrics — payment success rate, signup conversion, feature adoption — belong on a separate business dashboard, not the technical RED dashboard. The two have different audiences and different urgency profiles.

Saturation — Brendan Gregg's USE method (Utilization, Saturation, Errors) for resources — complements RED at the infrastructure layer. RED tells you that requests are slow; USE tells you that the disk queue is saturated, which explains why the requests are slow. Use both for full coverage.

The discipline

The reason RED earns its keep is the same reason most boring engineering practices earn theirs: consistency. Every service emits the same metrics with the same labels and the same buckets. Every dashboard has the same six panels in the same order. Every alert is one of the same three types. The uniformity makes the system understandable in aggregate and readable during the worst possible moments — three in the morning, partial information, an unfamiliar service, the engineer who knows it best on vacation.

Across DocuMint, CronPing, FlagBit, and WebhookVault, RED instrumentation is in middleware that we copy across services with minimal modification. The dashboards are templated. The alerts are the same three categories per product. The boring uniformity is the feature, not a bug — it is what makes the operational picture readable as the number of services grows.