Container Health Checks: Liveness, Readiness, and Startup Probes Explained
Most container health checks confuse three different questions into one endpoint, and the result is restarts during initialization, traffic to unready instances, and outages caused by the health check itself. The three-probe pattern fixes this.
The single most common container misconfiguration we see across production deployments is a health check that confuses three different questions into one endpoint. The orchestrator asks "is this process alive?" and gets back "is everything in the world OK?" — which is the wrong answer for almost every situation. The result is containers that get killed and restarted during ordinary startup, traffic that gets routed to instances that are not yet ready, and outage modes where the health check itself takes down the cluster faster than any real failure would have.
The fix is the three-probe pattern that Kubernetes formalized but every container orchestrator implements some version of. Liveness, readiness, and startup are different questions with different consequences for getting the answer wrong. We have wired health checks across DocuMint, CronPing, FlagBit, and WebhookVault, and the patterns converge across the four products.
Three questions, three probes
The liveness probe answers one question: should this container be killed and restarted? It is a question about the process itself, not about its dependencies. A correct liveness response means the process is responsive and not deadlocked. It does not mean the database is reachable, the cache is warm, or downstream services are healthy. Those are different concerns.
The readiness probe answers a different question: should this container receive traffic right now? A container can be alive but not ready — for example, during initial startup while it loads a configuration, during a deliberate drain before a deploy, or during a temporary downstream outage that makes its responses unreliable. Readiness toggling does not restart the container; it just removes the container from the load balancer pool until readiness returns.
The startup probe answers a third question, which Kubernetes added in version 1.16 after years of teams misusing liveness probes for this: has this container finished its initial startup? Some applications take 30-90 seconds to load on cold start, and during that window a liveness probe with normal thresholds will kill the container repeatedly. The startup probe runs first, suppresses the liveness probe until startup completes, and then hands off to the normal liveness regime.
The consequences of confusing them
When teams use a single /health endpoint for all three probes, the failure modes are predictable. A slow-starting application gets killed during initialization, gets restarted, gets killed during initialization again, and never enters service. The orchestrator reports the application as crashing without ever explaining that the crashes are induced by the health check itself.
An application with a /health endpoint that checks downstream dependencies will mark itself unhealthy when the database has a brief hiccup, get killed, and restart. The restart does not fix the database hiccup. The application now has a thundering herd of restarting containers competing for the recovering database. The cluster goes from "one slow dependency" to "no application running" in under a minute, and the post-mortem reads as a database outage when the actual cause was a misconfigured liveness check.
An application that uses the same endpoint for liveness and readiness loses the ability to drain. A deliberately draining instance cannot tell the load balancer "stop sending me traffic" without also telling the orchestrator "kill me." The two signals must be separable for graceful shutdown to work.
What the liveness probe should actually check
Liveness should verify that the process is alive and responsive. The simplest correct implementation is an HTTP endpoint that returns 200 OK if the request handler ran. If the handler ran, the process is alive. That is the entire useful information.
What the liveness probe should not check: external dependencies, downstream services, cache warmth, configuration freshness, or any other state that can fail temporarily and recover on its own. The reason is that a liveness failure means "kill and restart," and killing-and-restarting is the wrong response to a transient dependency outage. The right response is to wait for the dependency to recover, which is what the application would do if the liveness check did not exist.
The deeper rule: the liveness check should only fail for conditions that a restart will fix. Process deadlock is one such condition (a fresh process will not be deadlocked). Memory leak that exceeds the container limit is another (a fresh process starts at a clean baseline). Network partition with the database is not (a restart will not heal the network).
What the readiness probe should check
Readiness should verify that the container can usefully handle requests right now. This includes the process being alive (so it must include everything the liveness probe checks) plus the dependencies the application needs to be useful — the database is reachable, the configuration is loaded, the cache is hydrated to the degree the application needs.
The readiness probe is where it is appropriate to check dependencies. Failing readiness is the right response to a database outage: the container removes itself from the load balancer pool, traffic flows to instances that can still serve some subset of requests if the topology supports it, and when the database recovers the container automatically returns to service. None of this requires restarting the process.
The discipline is to be specific about what "ready" means. If 90% of the application's traffic does not need the cache, the readiness probe should not fail on cache failures — failing readiness for cache outage would drop 100% of traffic to handle a 10% degradation. The right answer is to fail readiness only for conditions where most of the application's value is unavailable, and to let the application return graceful errors for the subset that genuinely cannot work.
What the startup probe should check
Startup should verify that the application has completed its initial loading. The shape of this check depends on the application: for a process that has to load a large model into memory, the startup probe might check a file exists or a flag is set. For a process that needs to compile templates or warm a cache, it might check that warmup has completed.
The startup probe runs with longer thresholds than liveness — 60 seconds is common, 5 minutes is reasonable for genuinely slow-starting applications. Once it passes, the liveness probe takes over with normal thresholds. The startup probe never fires again unless the container is replaced.
For applications that start quickly (under 5-10 seconds), the startup probe is unnecessary and you can rely on initial liveness probe delays. The startup probe earns its complexity only when startup time exceeds the liveness threshold.
The graceful shutdown coordination
The three-probe pattern is the precondition for graceful shutdown, which is a related and equally common misconfiguration. A correct graceful shutdown sequence is: receive SIGTERM, mark readiness as failing (so the load balancer stops sending new requests), wait a few seconds for in-flight requests to complete, stop the HTTP listener, drain any background workers, close database connections, exit cleanly.
None of this works if the readiness probe is checking the same conditions as liveness. The container needs to be able to say "I am alive but not ready" during the drain window, and that requires separable probes. We have written about graceful shutdown patterns separately, but the three-probe pattern is the foundation.
The deeper observation
The three-probe pattern looks like extra complexity until you have lived through the failure modes of conflating them. After the first time a single misconfigured health check restarts an entire cluster during a database hiccup, the separation becomes obvious. The orchestrator can only respond correctly to a failure if the health check tells it what kind of failure occurred. Liveness, readiness, and startup are three different failures with three different correct responses, and the only way to get the correct response is to ask three different questions.