Application Metrics: Counters, Gauges, Histograms, and When to Use Each
Most application metrics are the wrong type for the question they are trying to answer. A counter measures totals over time, a gauge measures a current value, and a histogram measures distributions. Picking the wrong one produces metrics that look right and tell you nothing.
The metrics dashboard that produced an outage we investigated last quarter had an average-latency graph showing 80ms with no obvious anomalies. The actual problem was that p99 latency was 12 seconds and 1% of requests were timing out, and the average masked the outlier completely. The metric was instrumented as a gauge that averaged values across the scrape interval, and the question the team needed answered — what is the worst-case latency some customers are seeing — was not the question the metric could answer.
This is the canonical mistake with application metrics: picking the wrong metric type for the question. A counter, a gauge, and a histogram measure different things and produce different signals. Most metrics in most applications are the wrong type, and the cost shows up not as wrong dashboards but as dashboards that look right while hiding the actual problem. We have instrumented metrics across DocuMint, CronPing, FlagBit, and WebhookVault, and the same selection mistakes recur in every codebase.
The three metric types
A counter measures the total number of something that has happened since the application started. Counters only go up. The interesting derived signal from a counter is its rate — how fast it is incrementing per unit time. Request counts, error counts, bytes processed, jobs completed: these are counters. The rate over five minutes is the useful operational signal; the absolute value is rarely interesting except for capacity accounting.
A gauge measures the current value of something. Gauges go up and down. The interesting signal from a gauge is its current reading and how it has changed recently. Active connections, queue depth, memory usage, temperature: these are gauges. The current value is operationally useful, and history shows trends and outliers.
A histogram measures the distribution of a series of values. The interesting signal from a histogram is the distribution shape — specifically, the percentiles. Request latency, response size, time spent in queue: these are histograms. The average is almost never the right signal; the p50, p95, p99, and max are the right signals. A histogram captures all of them from a single instrumentation point.
The latency mistake
Latency is the most common case where teams pick the wrong metric type. Average latency is mathematically a gauge or a counter-pair (sum and count, divided), and it is the wrong signal for almost every operational question about latency.
The right signal is the distribution. A service with 80ms average latency might have p99 of 12 seconds, which means 1% of users are having a terrible experience. Average treats those outliers as small perturbations on the average. Percentiles surface them directly. The percentile chart for the same service would show a flat blue line at 80ms (p50), a slightly higher orange line (p95), and a red line shooting to 12 seconds (p99) — and the operator immediately knows where to look.
The fix is to instrument latency as a histogram from the start. Histograms record the value into buckets, and the percentile is computed at query time from the bucket distribution. The cost is a few extra bytes per scrape and a small amount of CPU on aggregation. The benefit is that the operationally useful signal is available rather than averaged away.
The counter rate trap
Counters look simple — just increment when something happens — but the visualization side has a trap. Plotting the raw counter value produces a monotonically increasing line that conveys almost no operational information. The useful signal is the rate, computed by the monitoring system as the derivative of the counter over a time window.
The trap is that the time window matters. A 1-minute rate over a spike will show the spike. A 1-hour rate over the same spike will smooth it into invisibility. Picking the rate window deliberately is part of using counters correctly, and most dashboards use a default window that is too long to see the spikes that matter for operations.
A useful pattern is to dashboard the same counter at multiple rate windows: 5-minute rate for the operationally useful signal, 1-hour rate for the trend signal, 24-hour rate for the capacity-planning signal. The same underlying counter answers three different questions when read with three different windows.
The gauge sampling trap
Gauges are scraped at the monitoring system's interval — typically every 15 to 60 seconds. Whatever value the gauge has at the scrape moment is what gets recorded. Anything that happens between scrapes is invisible.
For slowly-varying gauges like memory usage, this is fine. For rapidly-varying gauges like queue depth, this is a problem. A queue that fills and drains within a single scrape interval will appear in the dashboard as a flat zero, even though it spent 30 seconds with 10,000 items queued. The operator sees "no queue problem" while the application is actually queueing badly.
The fix for rapidly-varying gauges is to record minimum, maximum, and average within the scrape interval, and let the dashboard show all three. The minimum tells you the floor, the maximum tells you the worst-case, and the average tells you the steady-state. Most monitoring systems support this pattern under names like "summary statistics" or "min/max/avg recording."
The histogram bucket choice
Histograms have one significant configuration decision: the bucket boundaries. Buckets must be chosen in advance because the histogram counts how many values fell into each bucket; values are not stored individually, so the buckets determine what percentile resolution is available.
The standard pattern is exponentially-spaced buckets covering the expected range. For latency in a web application: 1ms, 2.5ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s. This covers four orders of magnitude with enough resolution at each scale to compute meaningful percentiles.
The mistake is using uniformly-spaced buckets (10ms, 20ms, 30ms, ...) which gives high resolution at one scale and no resolution at others. A uniformly-spaced bucket scheme for latency that goes to 100ms in 10ms steps cannot distinguish a 200ms latency from a 12-second latency — both fall in the same "above 100ms" overflow bucket. The exponential spacing handles the full operational range with the same number of buckets.
The cardinality problem
Every metric has labels (also called tags or dimensions): customer_id, endpoint, region, status_code, and so on. Each unique combination of label values produces a separate time series. Histograms multiply this by the bucket count, so a histogram with 20 buckets labeled by 4 dimensions with 100 values each produces 20 × 100^4 = 2 billion time series.
This is the cardinality problem, and it destroys monitoring system performance. The right approach is to be deliberate about which labels are necessary, to avoid labels with unbounded cardinality (customer_id is almost always wrong as a metric label), and to use aggregation patterns rather than per-customer metrics. Per-customer concerns belong in logs and traces, not in metrics.
The deeper observation
The metric type and label scheme determine what questions the metric can answer. Picking the wrong type produces a metric that looks right and answers a different question than the one the operator is asking. The discipline of deliberately matching metric type to operational question is one of the highest-leverage observability investments a small team can make, and the cost is small — a few minutes per metric to think through what is actually being measured and what question the dashboard is going to ask. The teams that develop this discipline early avoid the worst kind of monitoring failure: dashboards that look correct while the application is on fire.