Background Jobs Without a Job Queue
You don't need Redis, RabbitMQ, or Sidekiq for most background work. A 60-line Python loop and a SQLite table will carry you further than you think.
Every tutorial about background jobs starts the same way: install Redis, install Sidekiq or Celery or BullMQ, configure a worker pool, set up Sentinel for high availability, learn the YAML schema. Two thousand lines of infrastructure later, you have the ability to run a function asynchronously.
For four production APIs handling tens of thousands of jobs per day, we use approximately sixty lines of Python and a SQLite table. It has not failed in months. Here is what that looks like and why it works.
The schema
One table:
CREATE TABLE jobs (
id INTEGER PRIMARY KEY,
kind TEXT NOT NULL,
payload TEXT NOT NULL,
status TEXT NOT NULL DEFAULT 'pending',
attempts INTEGER NOT NULL DEFAULT 0,
max_attempts INTEGER NOT NULL DEFAULT 5,
scheduled_for TEXT,
locked_at TEXT,
locked_by TEXT,
result TEXT,
error TEXT,
created_at TEXT DEFAULT CURRENT_TIMESTAMP,
completed_at TEXT
);
CREATE INDEX idx_jobs_pending ON jobs(status, scheduled_for) WHERE status='pending';
The partial index is critical: 99% of rows over time are completed, and we never want to scan them when looking for work. The index only covers pending rows, so it stays small forever.
Enqueue is a single insert
def enqueue(kind: str, payload: dict, scheduled_for: datetime | None = None):
conn.execute(
"INSERT INTO jobs (kind, payload, scheduled_for) VALUES (?, ?, ?)",
(kind, json.dumps(payload), scheduled_for.isoformat() if scheduled_for else None)
)
conn.commit()
Schedule for the future by setting scheduled_for. The worker won't pick it up until that time. This single feature replaces 90% of why people reach for cron — recurring jobs become a pattern of "when this job finishes, enqueue the next one with scheduled_for = now + 1 hour."
The worker is a poll loop
def claim_one(worker_id: str) -> Optional[Job]:
cur = conn.execute("""
UPDATE jobs SET
status='running',
locked_at=CURRENT_TIMESTAMP,
locked_by=?,
attempts=attempts+1
WHERE id = (
SELECT id FROM jobs
WHERE status='pending'
AND (scheduled_for IS NULL OR scheduled_for <= CURRENT_TIMESTAMP)
ORDER BY id LIMIT 1
)
RETURNING id, kind, payload, attempts, max_attempts
""", (worker_id,))
row = cur.fetchone()
conn.commit()
return Job(*row) if row else None
SQLite supports UPDATE ... RETURNING as of 3.35 (March 2021). The atomic update-with-subquery is the entire concurrency control mechanism: two workers running this query simultaneously cannot claim the same job, because the UPDATE locks the row exclusively until commit. No advisory locks, no FOR UPDATE SKIP LOCKED, no Redis BRPOPLPUSH gymnastics.
The poll loop calls this every two seconds. If it returns nothing, sleep. If it returns a job, run it.
Retries are exponential and bounded
def fail(job: Job, error: str):
if job.attempts >= job.max_attempts:
conn.execute(
"UPDATE jobs SET status='dead', error=? WHERE id=?",
(error, job.id)
)
else:
backoff = min(60 * 2 ** (job.attempts - 1), 3600)
retry_at = datetime.utcnow() + timedelta(seconds=backoff)
conn.execute(
"UPDATE jobs SET status='pending', scheduled_for=?, error=? WHERE id=?",
(retry_at.isoformat(), error, job.id)
)
conn.commit()
First retry after one minute, second after two, third after four, capped at one hour. After max_attempts failures the job moves to dead status and stops retrying. We have a daily cron that emails a summary of dead jobs — most of the time the list is empty, and when it isn't, the cause is usually obvious.
Handling worker death
One failure mode the simple version misses: a worker dies in the middle of a job. The row is stuck in running with a locked_at timestamp from twenty minutes ago, but no worker is alive. Every minute we sweep:
UPDATE jobs SET status='pending', locked_by=NULL, locked_at=NULL
WHERE status='running' AND locked_at < datetime('now', '-10 minutes');
Ten minutes is longer than any of our jobs are supposed to take. If a real job is still running and gets reclaimed, that's a bug — set the timeout high enough to be impossible in practice but low enough that recovery is fast. Idempotent jobs (which yours should be) tolerate the rare double-execution gracefully.
Observability is just SQL
The dashboard is one query:
SELECT
status,
COUNT(*) as n,
AVG(julianday(completed_at) - julianday(created_at)) * 86400 as avg_seconds
FROM jobs
WHERE created_at > datetime('now', '-24 hours')
GROUP BY status;
Job counts by kind, latency distributions, dead job inspection — all SQL. The same database your application reads is the database your job queue reads. There is no second source of truth, no synchronization, no Redis-not-matching-Postgres weirdness.
When to graduate
This pattern works well up to a single worker host doing thousands of jobs per minute. Beyond that, you need horizontal scaling, and SQLite's single-writer model becomes a bottleneck. Three signals to upgrade:
- Workers are spending real time waiting for the database lock during enqueue or completion.
- The
jobstable is large enough that the partial index no longer fits in memory. - You need cross-host fan-out and a single SQLite file is no longer the unifying point.
The migration path is mechanical: switch to PostgreSQL with SELECT FOR UPDATE SKIP LOCKED, or to a real queue like Redis Streams or RabbitMQ. The schema and the worker loop translate directly. Most teams never reach this threshold; they assume they will and over-engineer from day one.
What we run
Across CronPing's alert dispatcher, DocuMint's PDF rendering pipeline, FlagBit's flag-evaluation cache warmer, and WebhookVault's replay engine, the same pattern is in production. Sixty lines, one SQLite table, sixty thousand jobs a day, no operational overhead.
The boring infrastructure is almost always enough. Reach for Redis when you have measured a problem. Until then, the simplest thing that could possibly work usually works.