Graceful Shutdown: The Pattern Most Services Get Wrong

Every service eventually shuts down. In production, the trigger is usually a deployment: the orchestrator sends SIGTERM, waits some grace period, then sends SIGKILL. What happens between SIGTERM and the process exiting decides whether deployments are invisible to users or visible as a 500 spike in your dashboards. Most services get this badly wrong, in a way that is invisible until traffic is heavy enough that even small windows of misbehavior leave fingerprints.

A correct graceful shutdown has six steps. Each step exists for a specific failure mode. Skip any of them and that failure mode appears.

Step 1: Mark unhealthy, keep accepting

The first thing on SIGTERM is not to stop. It is to flip the readiness probe to false while continuing to accept new requests for a brief window (typically 5-15 seconds, depending on your load balancer's reconciliation period). The reason is timing: load balancers learn that a backend is unhealthy by polling, and there is always a lag between the backend reporting unhealthy and the load balancer routing traffic away. If you stop accepting requests immediately on SIGTERM, every request that lands during this lag becomes a connection error.

The mechanic: signal handler sets a flag, readiness endpoint returns 503 when the flag is set, but the actual request handler keeps working as before. The load balancer sees three consecutive 503s on its probe (typically 6-15 seconds with default intervals), takes the backend out of rotation, and stops sending traffic. Only then is it safe to stop accepting.

Skip this step and your deployments produce a 503 spike proportional to your traffic rate times the LB lag.

Step 2: Stop the listener, finish in-flight

Once the LB has stopped sending traffic, close the listening socket. New connections will be refused, but existing in-flight requests continue. This is where most frameworks have a built-in mechanism (Go's http.Server.Shutdown, Python's uvicorn lifespan, Java's Tomcat graceful shutdown). Use it.

The key constraint is the deadline: you have a finite window before SIGKILL arrives (typically 30 seconds in Kubernetes by default). All in-flight work must complete or be abandoned within that window. If your typical request latency is 200ms, you can wait the full window comfortably; if it is 25 seconds, you have a different problem.

Step 3: Stop polling and consuming

Background work is its own category. If your service consumes from a queue (SQS, Kafka, NATS, Redis Streams), stop polling on SIGTERM. Each consumed message represents a commitment to process it; if you consume one and then die, you are either redelivering (causing duplicate work) or losing it (causing data loss).

The mechanic: the consumer loop checks a shutdown flag at the top of each iteration. When set, the loop exits without consuming further messages. Already-claimed messages are processed to completion if the shutdown deadline allows; otherwise they are explicitly rejected (NACKed) so the queue redelivers them rather than waiting for the visibility timeout.

Step 4: Drain the worker pool

Most services use a worker pool of some kind: thread pool, goroutine pool, asyncio task group. The pool needs to drain, not be killed. Drain means: wait for current tasks to complete, do not start new ones. The drain has its own deadline, usually a fraction of the overall shutdown deadline.

The pattern that works: the pool's submit method checks the shutdown flag and rejects new submissions; existing tasks run to completion or until their own deadline expires; the main goroutine/thread waits on the pool's terminated state with a timeout.

If the drain times out, you have a choice: log the orphaned tasks and exit (the queue will redeliver), or wait until SIGKILL forces it. Logging and exiting is almost always the right answer; SIGKILL is uninformative and the queue will redeliver anyway.

Step 5: Close persistent resources

After the workers are drained, close database connections, file handles, and outbound network connections. The order matters: anything you opened on startup, close in reverse order. The reason is dependency: if an in-flight request needs the database, and you close the database before the request finishes, the request fails. By draining workers first, you have already ensured nothing needs the database.

Connection pools deserve specific care. A correctly closing pool waits for in-use connections to be returned, then closes them with a small timeout. A pool that just closes everything immediately produces "connection closed" errors in the requests that were holding them.

Step 6: Exit cleanly

The process exits with status 0. Do not exit with non-zero just because you ran out of grace; that signals failure to the orchestrator and may interfere with rolling deploy logic.

If your shutdown sequence completed normally, exit 0. If it timed out partway through, log what was abandoned and exit 0 anyway. If something genuinely failed (panic, unrecoverable error during shutdown), exit non-zero.

The shutdown checklist

You can test your service's shutdown by running it under load and sending SIGTERM. The signs of correctness:

No 5xx errors after SIGTERM (modulo the LB lag window, which should be invisible if the readiness probe was flipped first).
No "connection closed" errors in client logs from in-flight requests.
No duplicate processing of queue messages (visible as duplicate side effects, e.g. two webhooks from one event).
No orphaned database connections lingering in the connection limit count.
Clean exit log with explicit shutdown completion.

Each of these is a specific failure mode mapped to a specific step. If you see "connection closed" errors, your listener stopped before the in-flight work finished. If you see duplicate side effects, your consumer did not drain cleanly. The shutdown bugs map cleanly to which step you skipped.

The blast radius of getting it wrong

None of these failure modes are catastrophic in isolation. A 503 here, a duplicate webhook there. The reason graceful shutdown matters is that deployments compound the errors: if you deploy ten times a day, and each deployment produces a small spike of 503s, those add up to a measurable fraction of your error budget. Worse, the failures are correlated to deployments, which makes them invisible in averages and visible only in deployment-aligned incident reports.

The investment to do this right is small and one-time per service. The improvement is permanent and shows up immediately in your deploy-time error metrics.

We use the pattern across the four developer APIs we run at DocuMint, CronPing, FlagBit, and WebhookVault. Each has the same six-step shutdown handler, parameterized for its specific resources. Deployments are silent in the dashboards even under load.