Designing Idempotent Cron Jobs: Patterns That Survive Overlapping Runs and Missed Windows
Cron jobs fail in two ways: they run twice when they should run once, or they don't run at all when they should. The standard mitigations — flock files, distributed locks, manual cleanup scripts — paper over these failures rather than designing for them. The patterns that actually work tre...
Cron jobs are one of the simplest abstractions in operating systems and one of the most-broken in production. The schedule looks deterministic — every five minutes, every day at 3 a.m., every Monday morning — but the execution is anything but. A job runs longer than its window and overlaps with the next invocation. A machine reboots through the scheduled time and the job doesn't run. The job runs and crashes halfway through, leaving the system in an intermediate state that the next run doesn't know how to handle. Time-zone configuration drift makes the job run an hour earlier or later than expected. The standard mitigations — flock files, distributed locks, manual cleanup scripts — paper over these failures rather than designing for them. The patterns that actually work treat overlapping runs and missed windows as expected behavior, not as bugs to prevent.
The patterns in this post matter for any service that runs scheduled work, and the operational discipline they require is what CronPing exists to support. The cross-cutting concerns of authentication, idempotency, and observability also apply to background jobs in DocuMint, FlagBit, and WebhookVault.
The two failure modes
Cron jobs fail in two basic ways. The first is double execution: the job runs twice when it should run once. This happens when the previous run hasn't finished by the time the next scheduled invocation fires, when the cron daemon retries a perceived-failed run, when manual operator action triggers an extra run while the scheduled run is in flight, or when a distributed cron system fails to coordinate and multiple workers pick up the same scheduled job.
The second is missed execution: the job doesn't run when it should. This happens when the machine is rebooting through the scheduled time, when the scheduler service crashes and restarts past the window, when the job runs but errors out without producing useful output, when a configuration change accidentally deletes or modifies the schedule entry, or when a time-zone change moves the scheduled time outside the operator's expected window.
The first failure produces work being done twice, which can be fine (idempotent jobs) or catastrophic (sending duplicate emails, charging cards twice, deleting data twice). The second failure produces work not being done, which is fine for backups (the next backup catches up) or catastrophic for billing (a missed invoice cycle).
Pattern 1: idempotency by content-addressed key
The first pattern is making the job idempotent at the level of side effects, not at the level of the job invocation. The job calculates a deterministic key from its inputs — a date, an account ID, a logical period — and uses that key to ensure each side effect is performed at most once.
For a daily-billing job, the key is the (account_id, billing_date) pair. The job iterates over accounts, calculates the bill, and inserts a record into a billing_runs table with the (account_id, billing_date) as a unique constraint. If the row already exists, the job skips that account. The actual side effect — charging the card or sending the invoice — is wrapped in the same transaction or coordinated through a similar idempotency layer at the payment provider.
The advantage of this pattern is that it tolerates double execution by making the second execution a no-op rather than relying on locking to prevent it. The cron job can run twice; only one set of bills will be sent. The pattern works equally well for backfill jobs (the operator wants to re-run a missed window without producing duplicates) and for catch-up jobs (the system wants to re-process work that was missed during downtime).
Pattern 2: explicit time windows
The second pattern is parameterizing jobs by the time window they should process, rather than using "now" as the implicit input. A daily report job that runs at 1 a.m. is conceptually processing yesterday's data, not "now's" data. Making the time window explicit — via a command-line argument, an environment variable, or a database row — turns the job into something that can be re-run for any window, which is both a backfill capability and a debugging tool.
The cron entry calculates the window from the current time and passes it to the job. The job uses only the passed window, never the system clock. This gives several benefits at once: backfilling missed windows is a matter of running the job with the missed window's parameters; testing the job is a matter of running it against a known historical window; the job's behavior is deterministic given its inputs, which makes debugging tractable.
The pattern also handles the boundary case where a job runs slightly late or slightly early due to scheduler jitter. The job processes the assigned window regardless of when it actually executes, so a 1:02 a.m. start time still processes "yesterday's" data correctly.
Pattern 3: overlap protection at the job level
The third pattern is preventing two instances of the same job from running concurrently when the work is not safely concurrent. The standard tool — flock or its language equivalent — works for single-machine cron, but breaks when the cron is distributed across multiple machines or run by an orchestrator like Kubernetes CronJobs.
The pattern that scales is acquiring a database-backed lock at the start of the job and releasing it at the end. Postgres advisory locks are a good fit: pg_try_advisory_lock(job_name_hash) at the start; if it returns false, log "another instance is running" and exit; otherwise do the work and let the transaction-scoped lock release on commit.
The advantage over flock is that the lock is database-replicated and works across machines. The advantage over Redis-based locks is that it's already in your database stack and doesn't require additional infrastructure. The advantage over manual flag-based locking (UPDATE jobs SET locked=true) is that it doesn't leak when the job crashes — the connection drops and the lock is released automatically.
Pattern 4: dead-man-switch monitoring
The fourth pattern is monitoring the absence of a job's heartbeat rather than the presence of an alert. A cron job that fails silently — exits zero without doing the work, or doesn't run at all because the cron daemon crashed — is invisible to alerting systems that look for explicit failures.
The pattern is for the job to ping a monitoring endpoint at the start and at the end. The monitoring service knows the schedule and alerts when a ping is missed. If the job runs but doesn't reach the end-ping, the monitor knows the job started but didn't finish. If the job doesn't reach the start-ping at all, the monitor knows the job didn't run.
This is the pattern CronPing exists to make easy: a curl call at the start and end of every scheduled job, with the monitoring logic and alerting handled by the service. The pattern is generally useful regardless of which monitoring service implements it — Healthchecks.io, Cronitor, Dead Man's Snitch, or a custom solution. The discipline is making the job's success or failure observable through a positive heartbeat rather than waiting for an absence of errors.
Pattern 5: small-window-with-explicit-state
The fifth pattern is keeping each cron run small and tracking state explicitly between runs. A cron job that processes "all unprocessed records since the last run" and stores the high-water mark explicitly handles missed runs gracefully — the next run picks up from where the last successful run left off, regardless of how many windows were missed.
The state tracking goes in a table designed for the purpose: a job_state table with (job_name, last_run_at, last_processed_id) columns. The job reads its state at the start, does the work, updates the state at the end. The pattern handles all the edge cases: missed runs are caught up automatically (the job processes more than usual), backfill is supported (manually adjust the state and re-run), and the per-run work is bounded (a max-batch-size limit prevents runaway runs).
What not to do
The anti-pattern that destroys the most production systems is "the job is supposed to run every day at 3 a.m., we'll just trust the cron daemon to make it happen." Without idempotency, time-window parameterization, overlap protection, dead-man monitoring, and explicit state, a cron job is a single point of failure with no observability and no recovery path. The first time it doesn't run for a few days — because of an outage, a misconfiguration, an upstream failure — the operator has to write a one-off recovery script that's untested and risky.
The other anti-pattern is the kitchen-sink cron job that does many things in sequence. The job runs the daily report, sends the daily email, archives old data, vacuums the database, and updates the analytics roll-ups. When any one of these fails, the rest don't run. When the job runs longer than expected, the next invocation overlaps. The fix is small jobs with explicit dependencies, not big jobs with implicit ordering.
The deeper observation
Cron is one of the oldest abstractions in Unix and one of the most deceptively simple. The schedule entry looks declarative — "run this every day at 3 a.m." — but the execution is procedural and faulty in all the ways procedural code is faulty. Designing cron jobs that survive in production means treating the scheduled time as approximate, the run as potentially repeatable, the work as potentially partial, and the operator as the eventual debugger. The patterns above are not optional embellishments; they're what makes scheduled work sustainable as the system grows. Teams that take this discipline seriously rarely have midnight pages from missed billing runs; teams that don't have them periodically and call it the cost of running cron.