Postgres pg_cancel_backend vs pg_terminate_backend: Stopping Queries Without Breaking the Application
pg_cancel_backend interrupts a query while keeping the connection alive. pg_terminate_backend kills the connection entirely. The distinction matters during incidents and the wrong choice can convert a slow query into a stampede of reconnects.
The two backend-control functions in Postgres look superficially similar. Both take a process ID. Both can be called by a superuser or a member of the pg_signal_backend role. Both are used during incidents to stop a query that is misbehaving. The difference is what they do to the connection.
pg_cancel_backend(pid) sends SIGINT to the named backend process. The current query is interrupted with a "canceling statement due to user request" error and rolled back if it was inside a transaction. The connection itself remains open. The client receives an error and can immediately issue another query on the same connection.
pg_terminate_backend(pid) sends SIGTERM. The backend process is killed entirely. The connection drops. The client sees a connection-broken error and must reconnect before issuing another query. If the application uses a connection pool, the pool detects the dead connection on its next use and removes it.
When cancel is the right answer
The right default during an incident is pg_cancel_backend. The query is the problem. The connection is not. Cancel stops the work without forcing the application to reconnect, which is usually cheap and sometimes operationally significant.
If a long-running analytics query is blocking autovacuum or holding a lock that other queries need, cancel stops the query and releases the lock. The application that issued the query receives an error it can handle (most ORMs surface this as a generic query exception). If the application is well-written it logs the error and continues with the next request.
If the application is poorly written and treats a cancel error as a fatal connection problem, it may discard the connection on its own. The application bug shows up in pool churn metrics rather than in the database. The fix is in the application, not in the choice of cancel vs terminate.
When terminate is the right answer
Terminate is for connections that should not exist. The canonical case is an idle-in-transaction session that has been sitting on a transaction snapshot for hours, holding locks and blocking vacuum. The query is not running — there is no query to cancel. The transaction is open and the connection is consuming resources. Terminate kills the connection and resolves the problem.
Another case is a backend that is stuck in a state pg_cancel_backend does not address. Some C extension code paths do not check for the cancel signal at safe interruption points and the cancel has no effect. Terminate forces the issue at the kernel level. The cost is that other clean queries on the same connection are also lost, but if the connection is in a state where cancel does not work, the connection is already broken from a usefulness standpoint.
The idle_in_transaction_session_timeout configuration parameter automates the terminate decision for the most common case. Setting it to something like 60 seconds in production prevents most of the manual incidents that would otherwise require pg_terminate_backend.
The escalation pattern
The right discipline during an incident is to escalate. Try cancel first. If cancel returns true and the query state in pg_stat_activity changes to idle within a few seconds, the query was interrupted cleanly. If the query is still running after a sensible interval, escalate to terminate.
The functions return a boolean. True means the signal was delivered. False means the pid was invalid or the caller did not have permission. Neither return value guarantees the desired outcome. The actual evidence that a cancel worked is in pg_stat_activity, where the state column should change from "active" to "idle" and the query column should clear.
A script that operates on a list of misbehaving pids should send cancel, wait a few seconds, check pg_stat_activity, and only escalate to terminate for the pids that did not respond. This produces a much smaller blast radius than terminating everything immediately.
The reconnect storm risk
Terminating many connections at once produces a reconnect storm. If the application uses a connection pool with a fixed maximum, the pool tries to immediately replace the lost connections. If the same set of misbehaving connections are restored by the pool and immediately re-execute the same query, the underlying problem is not solved.
The risk is largest when terminating broad classes of connections during an incident. An admin who runs pg_terminate_backend in a loop across all idle-in-transaction connections may briefly improve database health and then see the same connections return within seconds with the same transactions reopened by an application that has not been fixed.
The mitigation is to identify the application or code path that is producing the broken connections, fix it or restart it, and only then clean up the orphaned database state. Terminate is a tactical tool for cleaning up state, not a strategic tool for fixing application bugs.
What the functions do not solve
Neither function can recover a query that has already done unwanted work. If a long-running UPDATE has already touched a million rows and is partway through committing, canceling rolls back the work in progress but does not undo any commit-visible side effects from earlier transactions in the same session.
Neither function can cancel an autovacuum process. Autovacuum workers are also backends and have process IDs, but pg_cancel_backend on an autovacuum worker is generally ineffective and pg_terminate_backend kills the worker without preventing autovacuum from launching another one on the same table immediately. The configuration knob for stopping autovacuum is in postgresql.conf or per-table storage parameters, not in runtime signals.
Neither function helps with queries running on replicas if the goal is to free locks on the primary. The primary's locks are held by the primary's backends, which are separate from the replica's backends. Cross-cluster lock cleanup requires action on the right cluster.
The audit trail question
Both functions are logged at the WARNING level by default, which usually means they appear in the postgres log but do not generate alerts. In an environment where these functions are called frequently during incidents, the log volume can be substantial.
A discipline that catches missed cancel-then-terminate escalations is to tag the cancel and terminate operations with the operator name in pg_stat_activity application_name and to log the diagnostic query that motivated the cancel. The audit trail then contains both the action and the reason, which is much more useful during incident reviews than the action alone.
The two functions are small but operationally consequential. Cancel for the query that is misbehaving and the connection that should keep working. Terminate for the connection that should not exist. The escalation discipline of try-cancel-then-terminate prevents most of the reconnect-storm risk that terminating broadly produces.
Read more essays and technical writing at anethoth.com — a notebook on databases, distributed systems, biology, and the engineering that holds the world together.