Health Checks That Actually Predict Problems

Every service has a /healthz that returns {"status": "ok"}. Most of them are lying. The endpoint exists because Kubernetes asked for it, and it returns 200 OK because that's what Kubernetes was promised. It almost never reflects whether the service can actually serve traffic. The teams that turn health checks from a checkbox into a useful signal do a few small things differently.

Three checks, not one

The mistake is conflating three different questions into a single endpoint. Kubernetes already separates them: liveness ("should I restart this container?"), readiness ("should I send it traffic?"), and startup ("has it finished initializing?"). Most teams point all three probes at the same /healthz and the same code path, and lose all the information.

The right shape:

/livez — returns 200 if the process can respond at all. Should never check downstream dependencies. The whole point is to detect a deadlocked or wedged process, and adding a database query introduces a way to fail liveness for reasons that aren't liveness, which causes restart loops when the database goes down.
/readyz — returns 200 only if the service can actually do its work. Checks the database connection, any required external services, the auth provider if you have one, and any cache that's required for correct responses (not for speed). When this returns 503, the load balancer pulls the instance out of rotation.
/startupz — returns 200 once the process has finished its slow boot — loaded any caches, opened connection pools, run any startup migrations. Used by Kubernetes to delay readiness probes during cold starts.

The first time you make this split, the failure modes get instantly clearer. "Service is up but not serving" stops being a contradiction.

Check what the request path actually uses

The most common readiness check is "can I reach the database." The most common production failure is "I can reach the database but my connection pool is exhausted." These are different bugs and the first check does not catch the second.

The honest readiness check executes the same primitives a real request would: acquire a connection from the pool, run a trivial query, release. If the pool is exhausted, this blocks or fails, and the readiness probe correctly reports unready. If your service uses Redis on the request path, ping Redis. If it calls a downstream API on every request, do not call that API in your health check — you'll cause a cascade when the downstream slows down. Instead, surface the downstream's last-known state from a circuit breaker or an internal metric.

Distinguish "I cannot work" from "this dependency is down"

If your service can serve cached responses without Redis, then Redis being down is not a readiness failure — it's a degraded-mode signal. The mistake is having one boolean output where the truth is more granular.

The pattern: a readiness response should be a small JSON object with per-dependency status, and an aggregate that captures whether the service can serve traffic in some useful way. Something like:

{
  "ready": true,
  "degraded": ["redis"],
  "deps": {
    "database": "ok",
    "redis": "down",
    "queue": "ok"
  }
}

The HTTP status code is the load balancer's signal; the body is the operator's. A monitoring system can alert on degraded states even when the service is technically still receiving traffic. CronPing works as the inverse pattern: a heartbeat-based check that fires when an expected ping doesn't arrive, which catches background workers that liveness probes never see.

Time-bound your checks

A readiness check that hangs for 30 seconds is worse than one that fails immediately. The classic story: database is slow, readiness check waits, Kubernetes' probe times out at 10 seconds, the pod is marked unready and pulled from rotation. Now traffic shifts to other pods, which are also slow, which also fail readiness, which removes more capacity. You have created a cascade by making your health check pessimistic.

Every external call inside a health check needs an aggressive timeout — 1 to 3 seconds at most. If the database can't respond in 1 second to a SELECT 1, something is wrong, and reporting unready quickly is correct. If you fear a slow downstream, return degraded rather than failing, and let humans decide whether to take action.

Memoize the answer

If your readiness probe runs every 5 seconds and your service has 100 replicas, that's 1,200 health-check requests per minute hitting your database. This sounds small until you realize it's also touching Redis, and the queue, and any other dependency you check. The fix is to compute the answer once every few seconds in the background, cache it, and have the endpoint return the cached value.

The cached answer is good enough for orchestration purposes. Kubernetes does not need real-time truth; it needs a recent enough signal to make routing decisions. A 2-second-old health status is virtually always more than fresh enough.

Surface the things that don't fail loudly

The deadliest production failures are the ones that don't crash anything. Your queue depth is increasing. Your error rate has tripled but is still tolerable per request. Your latency is up 40 percent. None of these will trip a readiness probe, and none of them will get your attention until something else breaks.

The pattern is to have an additional operational health endpoint — not used by Kubernetes, used by your monitoring — that returns the metrics that matter to humans: queue depths, error rates over the last minute, p95 and p99 latency, recent restart count. This is where you put the early-warning signal. Real incidents almost never start with a 503; they start with a graph that started bending.

The maintenance check

One last small pattern: a way to gracefully take an instance out of rotation without killing it. The right shape is a manually-toggleable maintenance flag — a file on disk, an environment variable, a row in a tiny config table — that, when set, causes /readyz to return 503 even though the service is fine. Operators set it before draining a node; clears it once they're done. Without this, "drain a host" requires either killing connections mid-flight or waiting for natural attrition.

The shape that emerges

Three endpoints, not one. Each checks the things it actually needs to. Per-dependency status surfaced in the body, aggregate in the status code. Aggressive timeouts, cached answers, an operational endpoint for early warnings, a maintenance toggle for graceful drain. None of this is sophisticated. Most services don't have any of it. The transition from {"status": "ok"} to a check that actually predicts problems takes about two days of work and saves entire on-call shifts.