Building a Job Queue Without Redis or RabbitMQ

Most job queue tutorials reach for Redis on page one and RabbitMQ on page two. They are not wrong — both are excellent at what they do — but they are heavy. Each adds a process to operate, a network hop to debug, a configuration file to forget about, and a class of failure modes that didn't exist before you adopted them.

For most small APIs, none of that complexity is necessary. You can build a durable, retryable, observable job queue in a single SQLite table and a worker loop. Here is how, and when you should reach for the heavier tools.

The core schema

One table is enough:

CREATE TABLE jobs (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  type TEXT NOT NULL,
  payload TEXT NOT NULL,
  status TEXT NOT NULL DEFAULT 'pending',
  attempts INTEGER NOT NULL DEFAULT 0,
  max_attempts INTEGER NOT NULL DEFAULT 3,
  scheduled_for TIMESTAMP NOT NULL,
  locked_at TIMESTAMP,
  locked_by TEXT,
  last_error TEXT,
  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
  completed_at TIMESTAMP
);

CREATE INDEX idx_jobs_pending ON jobs(status, scheduled_for) WHERE status='pending';

The columns map directly to the operations a queue needs: store work (type, payload), track progress (status, attempts), schedule retries (scheduled_for), prevent double-processing (locked_at, locked_by), and capture failures (last_error).

Enqueuing

def enqueue(job_type: str, payload: dict, run_at: datetime = None):
    run_at = run_at or datetime.utcnow()
    db.execute(
        "INSERT INTO jobs (type, payload, scheduled_for) VALUES (?, ?, ?)",
        (job_type, json.dumps(payload), run_at)
    )
    db.commit()

That is it. The job exists. It will be picked up by the next worker that comes around. If your API process crashes before the worker runs, the job survives in the database. If your worker crashes mid-job, the lock expires and another worker takes it. SQLite's durability does the heavy lifting.

Worker loop

The worker is a loop that claims a job, runs it, and updates the status:

WORKER_ID = f"worker-{uuid.uuid4()}"
LOCK_TIMEOUT = 300  # seconds — kill long-stuck jobs

def claim_job():
    now = datetime.utcnow()
    expired = now - timedelta(seconds=LOCK_TIMEOUT)
    cur = db.execute("""
      UPDATE jobs SET locked_at=?, locked_by=?, attempts=attempts+1
      WHERE id = (
        SELECT id FROM jobs
        WHERE status='pending'
          AND scheduled_for <= ?
          AND (locked_at IS NULL OR locked_at < ?)
        ORDER BY scheduled_for LIMIT 1
      )
      RETURNING id, type, payload
    """, (now, WORKER_ID, now, expired))
    return cur.fetchone()

def run_worker():
    while True:
        job = claim_job()
        if not job:
            time.sleep(2)
            continue
        process(job)

def process(job):
    try:
        handler = HANDLERS[job['type']]
        handler(json.loads(job['payload']))
        db.execute(
            "UPDATE jobs SET status='completed', completed_at=? WHERE id=?",
            (datetime.utcnow(), job['id'])
        )
    except Exception as e:
        retry_or_fail(job, e)
    finally:
        db.commit()

The atomic UPDATE with a subquery is the trick: SQLite (since 3.35) supports UPDATE ... RETURNING. Two workers can race for the same job; only one wins. The other gets None and tries again on the next loop.

Retries with exponential backoff

def retry_or_fail(job, error):
    if job['attempts'] >= job.get('max_attempts', 3):
        db.execute(
            "UPDATE jobs SET status='failed', last_error=? WHERE id=?",
            (str(error), job['id'])
        )
    else:
        delay = 2 ** job['attempts']  # 2s, 4s, 8s, 16s...
        next_run = datetime.utcnow() + timedelta(seconds=delay)
        db.execute("""
          UPDATE jobs SET status='pending', scheduled_for=?,
            locked_at=NULL, locked_by=NULL, last_error=?
          WHERE id=?
        """, (next_run, str(error), job['id']))

Every retry doubles the wait. After three failures, the job moves to failed and stays there until you investigate.

Scheduled jobs (the cron killer)

The same table handles scheduled work. Want a job to run in two hours? Set scheduled_for to two hours from now. The worker only picks up jobs whose schedule is past. Want a recurring job? At the end of the handler, enqueue the next instance:

def daily_report_handler(payload):
    send_report(payload['team_id'])
    enqueue('daily_report', payload, run_at=datetime.utcnow() + timedelta(days=1))

You now have a cron system inside your queue. No new infrastructure.

Observability

Because everything is in one table, observability is just SQL:

-- pending count
SELECT COUNT(*) FROM jobs WHERE status='pending';
-- failed in last hour
SELECT * FROM jobs WHERE status='failed' AND created_at > datetime('now','-1 hour');
-- slow jobs
SELECT type, AVG(julianday(completed_at) - julianday(created_at)) * 86400 as avg_seconds
FROM jobs WHERE status='completed' GROUP BY type;

This is dramatically easier to debug than Redis lists or RabbitMQ topology. Every job's full history is one row away.

The locking detail

The locked_at column matters because workers crash. Without timeout-based unlock, a crashed worker would orphan its job forever. The query above expires locks older than 5 minutes, so another worker can claim them. Set the timeout long enough that legitimate slow jobs don't get re-picked, but short enough that crashes don't leave work stuck.

What you give up

This pattern works well to roughly thousands of jobs per minute on a single VPS. You give up:

Cross-machine concurrency. SQLite is one machine. If you scale workers across servers, you need a network database (Postgres, MySQL) instead — same pattern, same SQL, just a network hop.
Pub/sub semantics. If you need fan-out broadcast to dozens of subscribers, this is not the right tool. Use Redis or NATS.
Sub-second polling. The 2-second sleep at the end of the worker loop is fine for most jobs. If you need millisecond latency, you need a push-based system.

For everything else — invoices to generate, emails to send, webhooks to deliver, exports to render, monitors to check — this is enough.

The graduation path

Move to Redis-backed queues (Celery, RQ, Sidekiq) when you need horizontal worker scaling, when your job rate exceeds what a single machine can sustain, or when you need real fan-out broadcast. Move to Kafka when your job stream is also an event log other systems consume.

Until then, one SQLite table and 100 lines of Python is a job queue you can read, understand, and debug. Most teams reach for the heavy tools before they need them and pay the operational tax for years.

If you want a fully managed service that handles the monitoring side — pinging your jobs to confirm they actually ran, alerting when they don't — CronPing is built specifically for this. The pattern in this post handles the work itself; CronPing handles the meta-question of "did the work happen at all."