Heartbeats and Liveness: How Services Tell Each Other They're Alive

Two services need to know the other is alive. The naive solution is the heartbeat: every N seconds, service A sends a small message to service B saying "I am here." If service B does not see a heartbeat from service A for more than T seconds, service B concludes that service A is dead and takes some action — failing over, removing it from the load balancer, marking jobs as orphaned. The pattern is so common it is folklore, and the folklore conceals enough subtle problems that production systems built on naive heartbeats fail in ways that surprise their engineers.

The right pattern is not difficult. The wrong pattern is easy to fall into and produces failures that cluster around the worst possible moments. This piece is the field guide for building heartbeats that do what they look like they should do.

The interval-versus-timeout trap

The first decision in heartbeat design is the heartbeat interval and the dead-detection timeout. The naive choice is heartbeat every 5 seconds, declare dead after 5 seconds. This is wrong. Network jitter, garbage collection pauses, brief CPU saturation, and routine TCP retransmissions can all delay a heartbeat by a few seconds without anything actually being wrong. Setting the timeout equal to the interval guarantees false-positive dead-detections, which then cascade into failover storms, leader-election cycles, or worker thrash.

The right rule is the timeout should be at least 2-3x the interval, and ideally 5x for production systems. Heartbeat every 5 seconds, declare dead after 15-25 seconds. Heartbeat every 30 seconds, declare dead after 90-150 seconds. The math is not because you expect five missed heartbeats; it is because you want the dead-detection event to be statistically extremely improbable for a healthy service experiencing routine operational hiccups, and the only way to get that property is generous slack in the timeout.

The interval choice itself is a tradeoff between detection latency and traffic overhead. Shorter intervals detect failures faster but consume more bandwidth and CPU, especially in systems where heartbeats fan out across many services. Longer intervals are cheaper but mean dead services keep receiving traffic for longer. The right interval depends on what you do when a service is detected dead — if the action is failover with no data loss, you can tolerate longer detection; if the action is rerouting in-flight requests with potential timeout impact on users, shorter detection is worth the overhead.

The network-partition problem

The hardest problem in heartbeat-based liveness is that you cannot distinguish a failed service from a partitioned one. From service B's perspective, "service A has not sent a heartbeat in 30 seconds" is consistent with service A being dead, with the network between them being broken, with service B's heartbeat-receiving thread being stuck, and with the heartbeat sender on service A being broken while the rest of the service is fine. All four conditions look identical from B's point of view, and all four require different remediations.

The honest answer is that single-source heartbeats cannot disambiguate these cases. Production systems that need to disambiguate use multiple heartbeat sources — service B receives heartbeats from service A directly, but also receives information from a third party (a coordinator, a quorum of peers, an external watchdog) about whether service A is reaching others. If service A is heartbeating to three observers and one stops seeing heartbeats while two continue, the partition diagnosis is reliable. If all three stop seeing heartbeats, the dead diagnosis is reliable.

The split-brain failure mode follows from this directly. If service A and service B can both reach a third party, and they each conclude the other is dead because of a partition between them, both will trigger failover and the system has two active leaders. The protection is requiring quorum agreement before declaring a service dead, which is what etcd and Consul and ZooKeeper do under the hood and why their consensus protocols look more complex than naive heartbeats. For systems below that complexity threshold, the practical protection is making the failover action idempotent and reversible — assume false positives will happen and design the failover to recover from them.

The cascading-failure mode

Heartbeats produce a specific cascading-failure pattern when many services depend on the same heartbeat infrastructure. The pattern goes: load on the heartbeat infrastructure increases (because of legitimate traffic, a dependency change, or a memory leak), heartbeats start being dropped, services start declaring each other dead, dead-detection triggers failover and rerouting, the rerouting traffic adds more load, more heartbeats are dropped, and the cluster cycles through false-failover cascades until the whole system collapses.

The protection is rate-limiting and circuit-breaking the failover actions themselves. If your service detects N peers dead in a short window, that is more likely to be a heartbeat infrastructure problem than N peers actually being dead, and the right response is to pause failover decisions and alert humans rather than to act on every dead-detection. Production load balancers implement this — most have a "max in-flight removals" parameter that prevents catastrophic deregistration cascades. Custom systems often do not, and the lack is invisible until the day the cluster needs it.

The related pathology is heartbeats themselves saturating the link. If you have a 1000-node cluster with all-to-all heartbeats every second, that is a million heartbeats per second crossing the network, and the heartbeat overhead becomes a non-trivial fraction of the network's capacity. The protection is hierarchical heartbeats: nodes heartbeat to a small number of peers or to a coordinator, and the coordinator publishes the membership view to everyone. The shape is what gossip protocols (Cassandra, Akka cluster, HashiCorp Serf) do, and the math justifies the operational complexity at scale.

What heartbeats should actually contain

The naive heartbeat is empty: a ping, with nothing in it but its existence. The richer heartbeat carries information that lets the receiver make better decisions. A timestamp, so the receiver can detect clock skew and reject stale heartbeats. A sequence number, so the receiver can detect lost heartbeats versus delayed heartbeats. A health status (healthy, degraded, draining), so the receiver can react differently to a service that is alive but unwilling to take new traffic. Load metrics (current request rate, queue depth, memory pressure), so the load balancer can route away from overloaded but technically alive services.

The discipline is to treat the heartbeat as a structured message, not a ping, and to include the smallest amount of data that lets the receiver make better decisions. Adding fields is cheap; the heartbeat is small and the operational value of richer information is large. The constraint is that the heartbeat sender has to actually compute the metrics it reports, which has to be done without affecting the heartbeat send latency.

Application-level versus infrastructure-level heartbeats

An important distinction is application-level heartbeats (the application sends a heartbeat saying "I am operating correctly") versus infrastructure-level heartbeats (the OS, container runtime, or load balancer determines liveness from low-level signals like TCP connections or process existence). The two are often confused and serve different purposes.

Infrastructure-level heartbeats answer "is the process running and accepting connections?" They are cheap, automatic, and miss the case where the process is alive but stuck in a deadlock or unable to do useful work. Kubernetes liveness probes that check a TCP port fall in this category — they detect crashes but not hangs.

Application-level heartbeats answer "is the application doing useful work?" They are more expensive (the application has to actively report) and catch hang-state failures that infrastructure-level checks miss. Kubernetes readiness probes that hit a custom /readyz endpoint that exercises real dependencies are application-level. Production systems should use both: infrastructure for the cheap "process exists" check, application for the expensive "service is functional" check.

The graceful-shutdown interaction

Heartbeats interact subtly with graceful shutdown. A service that is being shut down should stop heartbeating before it stops accepting requests, not after. If it stops heartbeating after, the load balancer keeps sending traffic to it during the brief window between "stopped heartbeating" and "actually shut down." If it stops heartbeating immediately and then drains in-flight requests, the load balancer redirects new traffic to other instances while in-flight requests complete on this instance — which is what you want.

The pattern is: receive shutdown signal, set health to "draining," continue heartbeating with the drain status for at least one full timeout cycle so all consumers see the drain, stop accepting new requests, drain in-flight, exit. The total shutdown time is bounded by max(in-flight-drain-time, heartbeat-timeout) plus some slack. Skipping any step produces avoidable user-visible errors at deploy time.

Our use across products

The four products in this studio have minimal heartbeat machinery because the architecture is monolithic — each product is one container, and Caddy handles the load-balancing layer. The relevant heartbeats are between Caddy and each product (Caddy uses HTTP health checks at /health) and between the supervisor and each container (Docker uses health-check probes for restart decisions). Both layers use generous timeouts and fail-open semantics, so transient hiccups do not cascade. The richer heartbeat patterns described above earn their place in multi-service architectures; for our scale, the simpler patterns are correct. CronPing's product is itself a heartbeat-checking service for customer cron jobs, and its design embodies the timeout-discipline lessons of this piece — generous slack, exponential backoff on missed pings, alerting only after sustained absence rather than single missed beats.

The summary

Heartbeats look simple and conceal subtle production problems. The interval-versus-timeout choice has a 2-5x rule that prevents false positives. Single-source heartbeats cannot disambiguate failure from partition; multi-source quorum can. Cascading failures from heartbeat infrastructure overload need rate-limiting on failover actions. Rich heartbeat payloads beat empty pings. Application-level beats infrastructure-level for hang detection. Graceful shutdown coordinates with heartbeats by setting drain status before exiting. Most production systems get most of these wrong on first attempt and learn the lessons from outages. The patterns are simple once stated; the discipline is in stating them before building, not after.