If you have ever watched a major service stay green on its own status page while every customer's monitoring dashboard is screaming red, you have experienced one of the most reliable patterns in modern operations: status pages lie. They lie partly because the technology is hard, but they lie mostly because the incentives are wrong. The page is owned by the marketing or support team, the metrics are owned by engineering, and there is a soft taboo against marking things red until the situation is "confirmed."
This essay is about how to build a status page that tells the truth, both technically and organizationally. The technical part is doable in a weekend. The organizational part is the actual work.
The two kinds of status data
Status pages exist in two flavors, and most of the lying happens because organizations confuse them.
Operator-driven status is what humans publish when they declare an incident. "We are investigating reports of degraded performance." This is useful for narrative, context, and apologies. It is necessarily late: a human has to notice, decide, and click "publish."
System-driven status is what your monitoring infrastructure detects in real time. Endpoint X is returning 5xx at a rate of 12% over the last 60 seconds. Region Y has 95th-percentile latency of 4 seconds. This is fast and uncaring, but it requires careful definition: what does "down" mean for a system that is partially degraded?
A truthful status page combines both. The system-driven layer detects and surfaces anomalies in seconds. The operator-driven layer adds explanation, scope, and resolution updates. Pages that show only operator updates lie by omission. Pages that show only metrics confuse customers with noise.
The synthetic monitoring trap
Most automated status pages work by hitting a health check endpoint every minute. /healthz returns 200, status is green; /healthz returns 5xx or times out, status is red.
This pattern fails in two predictable ways:
First, /healthz often does not exercise the same code path as a real customer request. It might check that the process is running and the database connection is alive, but not that the actual API endpoints work. The result: the system can be effectively down for users while the health check is happily returning 200.
Second, even when /healthz does test the real path, it tests only the path the operator knew to test. A subtle bug in a less-trafficked endpoint will not surface until a customer complains.
The fix is to derive status from the same telemetry that drives your alerts. If your alerting system says "p99 latency on /api/v1/invoices has exceeded 2 seconds for 5 minutes," your status page should say "DocuMint /invoices: degraded performance." The data source is the same; only the audience differs.
The granularity problem
"All systems operational" means almost nothing if "all systems" includes your blog, your billing system, your API, your dashboard, and your auth service. The customer experience of the API being down is completely different from the experience of the dashboard being down, but a single composite indicator collapses them.
The right granularity is per-feature, not per-component. A customer cares whether they can create an invoice, send an invoice, or download a receipt. They do not care whether the failure is in the API gateway, the PDF rendering service, the database, or the storage layer. Group your status indicators by what the customer is trying to do, not by your internal architecture.
This requires you to maintain a translation layer: a mapping from internal services to user-facing features. It is annoying to keep up to date, and the work is justified the first time a customer says "the site says everything is fine but I cannot do the thing I came to do."
Subdomain isolation
Host the status page on a domain or infrastructure that does not depend on the systems being monitored. The classic failure: status.example.com runs on the same load balancer as example.com, the load balancer fails, and the status page is unreachable at the moment customers most need it.
The cheap fix is a separate cloud provider. Run status on a different DNS, different host, different network path. Pay $5/month for it. When everything else is on fire, the status page should be the one thing customers can reach.
For our products, we use CronPing's public status pages partly for this reason: the monitoring infrastructure is independent of the monitored services, the page lives on a separate subdomain with its own routing, and the badge that customers can embed in their READMEs reads from the same source of truth.
Incident severity, plainly named
Most status pages use levels like "operational, degraded, partial outage, major outage." These are technical-sounding but customer-confusing. Is "degraded" worse than "partial"? Is "investigating" a status or a stage?
The clearest taxonomy I have seen uses three plain levels:
- Working normally. The thing is doing what customers expect.
- Some users affected. Some operations fail or are slow. Most customers will not notice. Specific groups will.
- Most users affected. The system is unusable for the majority. This is the page everyone is here to see.
Plus an orthogonal stage indicator for active incidents: investigating, identified, monitoring, resolved. The stage tells customers what you are doing; the level tells them how bad it is.
Postmortems that earn trust
The status page during an incident is necessary; the postmortem after the incident is what builds trust. Customers can forgive an outage. They cannot forgive an outage that gets one update of "we are investigating" and then nothing, ever.
The minimum: a public writeup within a week of any incident that affected paying customers. What broke. What we did. How long it lasted. What we are changing so it does not recur. No legalese, no euphemisms, no passive voice ("an issue was experienced"). Just plain prose.
The Cloudflare and GitLab postmortems are gold standards. They are technical, specific, sometimes embarrassing, and they have built more customer trust than a thousand "Reliability is our top priority" press releases.
The two-incident principle
If your status page has been green for two months while you have personally fixed two things you would have called incidents at any healthy company, your page is lying. It is reporting what you wish were true, not what is true.
The cure is to lower the bar. If you spent more than fifteen minutes fixing a customer-affecting issue, that was an incident. Mark it on the timeline. Write a one-paragraph note. Move on.
Customers do not lose faith in companies that have incidents. They lose faith in companies that pretend not to have them. A status page with regular small incidents and prompt resolutions reads as a company that knows what it is doing. A status page with no incidents for six months reads as a company that is hiding something.
What I run for our products
Each of DocuMint, CronPing, FlagBit, and WebhookVault has automated checks against its real API endpoints, hit from outside our infrastructure. The status badge updates within 60 seconds of a failed check. Public status pages show the last 30 days of uptime per endpoint. Postmortems land within a week.
I have caught my own bugs from this monitoring twice in the last month. That is the test. If your status page never makes you wince, it is not telling you anything new.
Build the page that catches your own bugs. The customer trust will follow.