Here's the most common health check implementation I see:
@app.get("/healthz")
def health():
return {"status": "ok"}
This tells the load balancer the process is alive and can receive HTTP connections. It says nothing about whether the process can actually serve traffic. The database might be unreachable. The connection pool might be saturated. The cache warming might not have completed. All of that is invisible behind a 200 OK.
The gap between "process is alive" and "ready to serve traffic" is where cascading failures hide.
The Three-Level Hierarchy
Health checks have three distinct jobs, and conflating them causes problems:
Liveness: Is the process running and not deadlocked? If liveness fails, the container runtime should kill and restart the process. The check should be cheap — just confirming the event loop or thread pool is responsive. A stuck process that can't answer even this check is a zombie and should be replaced.
Readiness: Can this instance serve requests right now? If readiness fails, the load balancer should stop sending traffic to this instance, but the instance should not be killed. The process is alive; it's just not ready. Database disconnected, cache miss storm, dependency unavailable — all are readiness failures, not liveness failures.
Startup: Has the process completed initialization? Kubernetes's startupProbe delays liveness checks until initialization is confirmed, preventing premature restarts during slow cold starts. Without it, a liveness probe failing during a slow database migration can cause Kubernetes to kill an instance that just needed more time.
What Kubernetes Actually Does With These
livenessProbe failure → pod is killed and restarted. Use this only for truly dead processes.
readinessProbe failure → pod is removed from the Service endpoints. Traffic stops. Pod is not killed. When the probe recovers, traffic resumes. This is what you want for dependency failures.
startupProbe success → Kubernetes starts running liveness and readiness probes. Until startup succeeds, neither fires.
Most applications should have all three. Most applications have one — usually wired to the wrong job.
The Deep Health Check Trap
The obvious fix for a shallow health check — add real dependency checks to /healthz — creates a different problem:
@app.get("/healthz")
def health():
db.execute("SELECT 1") # Check database
redis.ping() # Check cache
return {"status": "ok"}
Now a brief database hiccup causes every instance to simultaneously fail its health check. The load balancer removes all backends from rotation. Total outage. The health check that was supposed to enable resilience caused the failure it was meant to detect.
This is the cascading health check failure pattern. It's more dangerous than a shallow health check because it converts a partial failure (one dependency is slow) into a total failure (all instances report unhealthy at the same moment).
The Right Split
Split the check, not the endpoint name:
/livez → checks only that the process is responsive. Runs every few seconds. Returns 200 if the HTTP handler can answer. Nothing else.
/readyz → checks critical-path dependencies, with tight timeouts. Checks only what a request actually needs — not everything you have. Returns 503 if any required dependency is unreachable after the timeout. Wire your load balancer readiness probes here, not your liveness probes.
/healthz → pick one of the above and document which. (Google's own systems use the /livez and /readyz convention; /healthz is legacy.)
Caching the Readiness Result
Running a real database check on every probe invocation is expensive. A readiness probe fires every 5-30 seconds by default. At 10 replicas with 10s intervals, that's 1 check per second hitting your database from health check traffic alone.
Cache the result for 5-10 seconds:
import time
_readiness_cache = {"ok": True, "ts": 0}
def check_readiness():
now = time.time()
if now - _readiness_cache["ts"] < 5:
return _readiness_cache["ok"]
try:
db.execute("SELECT 1", timeout=2)
_readiness_cache.update({"ok": True, "ts": now})
except Exception:
_readiness_cache.update({"ok": False, "ts": now})
return _readiness_cache["ok"]
What Health Checks Should Never Do
Run database migrations. Warm caches. Execute writes. Take longer than the probe timeout to respond (which causes spurious failures). Return detailed error messages to unauthenticated callers — exposing internal topology through health check error payloads is a reconnaissance aid for attackers.
A health check that returns 200 when the process is alive and 503 when it can't serve traffic, cached, with a 2-second timeout — that's the baseline. Everything else is optimization or noise.
Published by Anethoth. Find indie SaaS projects at builds.anethoth.com.