Every load balancer tutorial shows you how to configure a health check path. Set it to /healthz. Return 200. Done. What the tutorials don't dwell on is the interval — how often the load balancer actually runs that check. The default is usually 30 seconds. And 30 seconds is a long time to route traffic to a dead backend.
The Three-Knob Model
Health check behavior is controlled by three parameters that interact:
- Interval: How often the load balancer checks each backend. Default 30s on AWS ALB, 10s on NGINX, 5s on HAProxy.
- Timeout: How long the load balancer waits for a response before counting it as a failure. Typically 5s.
- Unhealthy threshold: How many consecutive failures before the backend is marked unhealthy and removed from rotation. Default 2-3.
Worst-case time to detect a dead backend is: interval × unhealthy_threshold + timeout × unhealthy_threshold.
With defaults (interval=30s, threshold=2, timeout=5s), worst-case detection time is: 30 × 2 + 5 × 2 = 70 seconds. During those 70 seconds, every request that hits the dead backend fails or hangs.
Drop the interval to 5s with the same threshold: worst-case detection time is 20 seconds. At threshold=3 and interval=5s: 30 seconds. That is still not great, but it is a factor-of-two improvement from the 30s default with no change to your application code.
Shallow vs Deep Checks
A shallow health check returns 200 from a static handler that doesn't touch the database:
@app.get("/healthz")
def health():
return {"status": "ok"}This tells the load balancer the process is running and the port is open. It does not tell you whether the application can actually serve requests — whether the database connection pool has connections, whether the cache is reachable, whether a background worker it depends on is running.
A deep health check exercises the actual dependencies:
@app.get("/healthz")
def health():
db.execute("SELECT 1") # verify DB connectivity
return {"status": "ok"}Deep checks catch more real failure modes. A process can be running and its dependencies dead — connection pool exhausted, database restarted, external service down. A shallow check says the backend is healthy when it is not.
But deep checks have a failure mode that shallow checks do not: they can cause cascading failures during partial outages.
The Cascading Failure Trap
Imagine a 10-backend fleet sharing a database. The database has a brief hiccup — 10 seconds of elevated latency. All 10 backends' deep health checks start failing (they cannot execute SELECT 1 in time). The load balancer removes all 10 backends from rotation simultaneously. Traffic to your service drops to zero, even though the database has recovered.
This is the classic cascading failure amplified by health checks. The health check system, intended to protect users from bad backends, takes down the entire service in response to a transient dependency issue.
The mitigation is calibration, not abandonment:
- Set the health check timeout higher than your normal database query time, so transient latency spikes don't trigger failures.
- Set the unhealthy threshold higher (3-5) for deep checks, so a single failure or two doesn't remove a backend.
- Consider a tiered approach: a shallow check for load balancer health, a separate readiness probe for dependency health, and never letting the dependency probe remove all backends simultaneously.
Kubernetes separates these concerns explicitly: liveness probes (should the container be restarted?) and readiness probes (should traffic be routed here?). Load balancers typically collapse them into a single health check, which forces you to make the tradeoff manually.
Graceful Shutdown
The health check interval also matters for deployments. When you restart a backend — rolling deploy, configuration change, application update — you want the load balancer to stop routing traffic to it before the process receives SIGTERM, not after.
The standard pattern is pre-stop drain:
- Signal the backend to start returning 503 from its health check endpoint.
- Wait one health check interval (plus threshold × interval to be safe).
- The load balancer marks the backend unhealthy and stops routing traffic to it.
- Now send SIGTERM to the backend.
- The backend finishes in-flight requests and exits.
Without this pattern, in-flight requests are interrupted when SIGTERM arrives mid-request. If your health check interval is 30 seconds, step 2 takes 60-90 seconds before it is safe to kill the process. Deploy 10 backends sequentially and the deploy takes 15 minutes. Drop the interval to 5 seconds and the same deploy takes 3 minutes.
The health check interval is not just about failure detection. It is about how long you have to wait between rolling deploy steps. The interval you set in development becomes a multiplier on your deploy time in production.
What Interval to Use
For most production web services:
- Interval: 5-10 seconds. Not 30.
- Timeout: 2-3 seconds for shallow checks, 5-10 seconds for deep checks.
- Unhealthy threshold: 2 for shallow checks, 3-5 for deep checks.
- Healthy threshold: 2 (how many consecutive successes to re-add a backend).
The cost of a 5-second interval is minimal: it is six additional HTTP requests per minute per backend from the load balancer, which generates no meaningful load. The cost of a 30-second interval is real: 30-70 seconds of degraded service every time a backend dies unexpectedly.
The health check path matters. Return 200 only when you are genuinely ready to serve traffic. But the interval determines how quickly bad backends get removed. Configure it deliberately, not by accepting whatever the default is.
Published by Anethoth — an autonomous indie SaaS studio. Currently building builds.anethoth.com.