ClickHouse Memory Limits: When Background Merges Crash the Server

ClickHouse is built around merging. New data arrives in small partitions; old partitions get merged into larger ones in the background; query performance stays roughly constant because the number of partitions a query has to scan stays bounded. The merging is the slow continuous price the system pays for fast inserts and fast scans.

Most of the time, the merging happens invisibly. When it does not, the failure mode is unusually dramatic: the server runs out of memory, the OS kills the process, the orchestrator restarts the container, the next merge attempt fails the same way, and the database goes into a restart loop that does not resolve until somebody changes the memory configuration.

What MEMORY_LIMIT_EXCEEDED actually means in ClickHouse

ClickHouse tracks memory allocations through its own allocator wrapper. Every significant allocation goes through a MemoryTracker that knows about the current usage and the configured limit. If an allocation would push usage above the limit, the allocation throws an exception with code 241 (MEMORY_LIMIT_EXCEEDED) and the operation that requested the memory fails.

This is normally a survivable failure. A query that asks for too much memory gets cancelled and the error propagates to the client. The server keeps running.

The dangerous case is when the failing operation is a background merge. Background merges are not user-initiated and the exception does not propagate to anything that can decide to abort. The merge task crashes inside the background pool, the merge tree state is left inconsistent, and the next attempt to merge the same partitions hits the same memory ceiling and fails the same way. If the failure is consistent enough, the server's reaction to repeated background failures escalates to process exit, which becomes a container restart in a Docker deployment.

The default memory ceiling

ClickHouse's default max_server_memory_usage is derived from the host RAM via max_server_memory_usage_to_ram_ratio, which defaults to 0.9. The runtime detects the cgroup memory limit if one is set and uses that as the effective host RAM. For a containerized deployment with a 1.5 GB cgroup limit, the effective max_server_memory_usage works out to roughly 1.35 GB.

For small workloads, 1.35 GB is fine. For workloads that accumulate enough small partitions that a single merge has to allocate hundreds of megabytes of buffers, it is too low. The first merge that crosses the ceiling produces MEMORY_LIMIT_EXCEEDED, the merge fails, and the partition state stays unmerged. The next merge attempt tries to merge the same data, allocates the same buffers, and fails the same way.

The container restart loop is then a symptom, not a cause. The container is restarting because the supervisor inside ClickHouse is terminating after repeated background failures, not because the process is being killed by the OS for memory pressure. The kernel OOM killer is not involved; ClickHouse is cancelling itself.

Three things that fix it

The first fix is the obvious one: raise the memory ceiling. The configuration setting is max_server_memory_usage, expressed in bytes. The right value depends on the container's cgroup limit and the host's available RAM. On a 7.6 GB host with a 3 GB container limit, 2.5 GB for max_server_memory_usage leaves 500 MB for allocations that bypass the tracker (some thread stacks, mmap mappings, non-server processes inside the container).

The second fix is to reduce the memory cost of a single merge. The merge_tree.merge_max_block_size setting controls the number of rows ClickHouse processes per merge block. Lowering it from the default 8192 to 2048 reduces the memory footprint of each merge proportionally, at a small cost to merge throughput. For a server that has been crashing on merges, this is the change with the highest leverage.

The third fix, which is sometimes the right call but more often a band-aid, is to free up disk space. ClickHouse's merge scheduling considers both the number of partitions and the disk space available. If the disk is mostly full, the merge scheduler will be aggressive about merging old data to reclaim space, which produces larger merges that need more memory. Adding disk headroom is not directly addressing the memory problem but it does reduce the frequency of expensive merges.

Configuration in practice

The standard ClickHouse pattern for memory configuration is a file in /etc/clickhouse-server/config.d/. The contents look like this:

<clickhouse>
    <max_server_memory_usage>2684354560</max_server_memory_usage>
    <max_server_memory_usage_to_ram_ratio>0</max_server_memory_usage_to_ram_ratio>
    <merge_tree>
        <merge_max_block_size>2048</merge_max_block_size>
    </merge_tree>
</clickhouse>

Setting max_server_memory_usage_to_ram_ratio to 0 disables the ratio-based calculation and forces ClickHouse to use the explicit byte value. Without this, the explicit value can be overridden by the ratio calculation if the ratio happens to compute a lower number.

Two configuration mistakes are easy to make. The first is overriding background_pool_size without considering its interaction with number_of_free_entries_in_pool_to_execute_mutation. The default for the latter is 20, and the sanity check at startup requires background_pool_size * background_merges_mutations_concurrency_ratio >= number_of_free_entries_in_pool_to_execute_mutation. Lowering background_pool_size to a small number without also lowering the mutation entries setting produces a startup error with code BAD_ARGUMENTS.

The second mistake is setting max_server_memory_usage above the cgroup memory limit. ClickHouse will allocate up to the configured value, but the cgroup limit terminates the process when the OS-level RSS exceeds it. The OS kill is silent from ClickHouse's perspective and produces a generic container restart with no useful error in the ClickHouse logs.

What this tells us about ClickHouse

The deeper observation is that ClickHouse's resource model is dominated by the background workload, not the foreground query workload. A naive analysis based on query response times suggests that ClickHouse should scale memory based on concurrent query count and query complexity. The actual scaling is closer to: memory grows with the rate of inserts, because the rate of inserts determines the rate of merges, and merges are the largest memory consumers in steady-state operation.

This is the opposite of the usual database model where memory primarily serves the buffer cache and connection state. ClickHouse's buffer cache is more passive (the OS page cache does most of the work) and connections are cheap. The memory budget goes to merges. Sizing a ClickHouse deployment based on query metrics misses the actual binding constraint.

The practical implication is that ClickHouse deployments need to be sized for steady-state merge load, with explicit configuration for max_server_memory_usage and merge_max_block_size that matches the container's resource envelope. The defaults are designed for a bare-metal server with plenty of memory. In a 1-2 GB container, the defaults are wrong in ways that produce restart loops, not graceful degradation.

Read more essays and technical writing at anethoth.com — a notebook on databases, distributed systems, biology, and the engineering that holds the world together.