Why Your Background Jobs Silently Stop: The Redis Memory Eviction Trap

Your Sidekiq queue has been dropping jobs for three days. No errors. No alerts. The jobs simply never ran. Meanwhile Redis is running fine — memory stable, no crashes, no connection errors. Your application code enqueued the jobs successfully. The problem is that Redis deleted them before the worker had a chance to process them.

This is the allkeys-lru failure mode. It is quiet, it is intermittent, and it hits exactly when your system is under the most load — when you most need it to work correctly.

Why allkeys-lru is wrong for job queues

Redis has eight eviction policies controlling what happens when memory reaches the maxmemory limit. The default in most managed Redis offerings and Docker images is allkeys-lru: when memory is full, evict the least-recently-used key from the entire keyspace.

For a cache, this is correct behavior. Cache keys that haven't been accessed recently are good candidates for eviction — a cache miss just means a slower response, not lost data. The application can fetch the value again.

A job queue is not a cache. When Redis evicts a Sidekiq job, that job is gone. The application that enqueued it gets no error. The worker that would have processed it gets no error. The work simply does not happen. Depending on what the job was supposed to do — send an email, process a payment, update a record — the silence is the failure.

The same problem applies to Bull, BullMQ, Celery with Redis as broker, Resque, and any other queue built on Redis lists, sorted sets, or streams. They all store queue data as ordinary Redis keys. Under allkeys-lru, those keys are eviction candidates alongside everything else.

The symptoms that mislead

The failure mode looks nothing like what you expect from a broken queue. You expect errors. You get silence.

Redis memory usage stays stable — that's the point of eviction. Redis is working correctly according to its configuration. Your monitoring shows no Redis errors, no timeouts, no connection failures. Your application log shows jobs being enqueued successfully. Your worker logs show jobs being processed successfully — the ones that weren't evicted. The jobs that were evicted never appear in any log, because no component ever knew they were gone.

The tell, when you find it, is in the numbers: enqueue rate minus dequeue rate has been nonzero for days, but your queue length appears stable or decreasing. Jobs are leaving the queue faster than your workers are processing them. They're being evicted.

The fix

Two changes, in this order:

1. Set maxmemory explicitly. Redis with no maxmemory set will use all available RAM until the system OOM-kills it. Set it to 80% of the memory dedicated to this Redis instance — this leaves headroom for Redis's own overhead and avoids the OOM scenario while keeping eviction as a last resort, not a routine operation.

# In redis.conf or via CONFIG SET
maxmemory 6400mb  # for an 8GB instance
maxmemory-policy noeviction

2. Set maxmemory-policy noeviction. With noeviction, Redis returns an error when a write command would exceed the memory limit. Your application code will see the error. Your worker will see the error. The job will not be silently deleted — it will fail loudly, which is the correct behavior for a queue. You can then decide whether to shed load, scale memory, or add a backpressure mechanism.

Monitoring the config you set

Verify the configuration took effect and watch the margins:

redis-cli INFO memory | grep -E 'used_memory_rss|maxmemory[^_]'

The important ratio is used_memory_rss / maxmemory. When this approaches 0.9, you need more memory or fewer keys — not a more aggressive eviction policy. If you find yourself wanting to switch to allkeys-lru to relieve memory pressure on a job queue, the actual problem is that the queue is too deep, which is a capacity problem, not a Redis configuration problem.

volatile-lru evicts only keys with a TTL set. This sounds safer for queues — most queue keys don't have TTLs. But if any of your queue libraries do set TTLs (some do for delayed jobs), those keys become eviction candidates again. The policy boundary is fragile.

volatile-ttl evicts the key closest to expiration. Same problem: if your queuing library uses TTLs for any purpose, the policy becomes unpredictable.

The only safe policies for a job queue Redis are noeviction (fail loud) and allkeys-random (fail in a way you can at least detect by monitoring key counts). noeviction is the right default.

What this doesn't solve

Three things that remain unsolved after fixing eviction:

Persistence on restart. Redis with appendonly no and no RDB snapshot will lose all queue data on restart. If your job queue cannot tolerate this, configure AOF or RDB persistence — separate concern, same data loss risk.
Split-brain in Redis Sentinel. A Sentinel failover that promotes an out-of-date replica to primary can replay events that the original primary had already processed. Idempotent job handling is the application's responsibility.
Queue depth as a symptom. noeviction converts silent data loss into loud errors. If those errors are frequent, the underlying issue is that your workers can't keep up with enqueue rate — a scaling problem that eviction policy cannot fix and will not hide after this change.

The silence was the bug. The noise is the fix.

More at anethoth.com. Building something? List it on Builds — a directory for indie SaaS projects with transparent revenue and real founders.

Why allkeys-lru is wrong for job queues

The symptoms that mislead

The fix

Monitoring the config you set

The related policies that are also wrong

What this doesn't solve