The Problem: Silent Failures

Cron jobs are the backbone of every backend system — database backups, report generation, cache cleanup, email sends, data syncs. They run reliably for months, and then one day they stop. Nobody notices.

Unlike web endpoints that return error codes, cron jobs fail silently. There's no user staring at a loading screen to report the problem. The backup that was supposed to run at 3 AM just... didn't. And you find out a week later when you need that backup.

How Dead Man's Switch Monitoring Works

The concept is borrowed from trains: a physical switch that a train driver must hold down. If they let go (pass out, leave), the train stops automatically. Applied to cron jobs:

  1. Create a monitor with an expected schedule (e.g., "every 5 minutes")
  2. Get a unique ping URL (e.g., https://cronping.anethoth.com/ping/abc123)
  3. Add a curl to the end of your cron job: curl -s https://cronping.anethoth.com/ping/abc123
  4. If the ping doesn't arrive on schedule, you get alerted

The beauty of this approach: you don't need to parse log files, check exit codes, or build custom health checks. If the job runs successfully, it pings. If it doesn't run, the missing ping triggers an alert.

Setting Up Monitoring: A Practical Example

Let's say you have a database backup that runs every 6 hours:

# Current crontab entry
0 */6 * * * /usr/local/bin/backup-db.sh

Step 1: Create a monitor via the CronPing API:

curl -X POST https://cronping.anethoth.com/api/v1/monitors \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "Production DB Backup",
    "schedule": "0 */6 * * *",
    "grace_period_minutes": 15,
    "alert_webhook": "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
  }'

Step 2: Add the ping to your cron job:

# Updated crontab entry — ping on success
0 */6 * * * /usr/local/bin/backup-db.sh && curl -s https://cronping.anethoth.com/ping/abc123

# Or even better — ping on failure too
0 */6 * * * /usr/local/bin/backup-db.sh && curl -s https://cronping.anethoth.com/ping/abc123 || curl -s https://cronping.anethoth.com/ping/abc123/fail

Grace Periods: Avoiding False Alarms

Jobs don't always run at the exact second they're scheduled. System load, network latency, or long-running previous jobs can cause delays. The grace period is how long after the expected time to wait before alerting.

General guidelines:

  • Quick jobs (< 1 minute): 5-minute grace period
  • Medium jobs (1-10 minutes): 15-minute grace period
  • Heavy jobs (backups, data processing): 30-60 minute grace period
  • Hourly jobs: 10-minute grace period
  • Daily jobs: 30-minute grace period

Common Cron Failure Modes

1. Environment Variables

Cron runs with a minimal environment. PATH, HOME, and other variables you depend on in your shell may not be set. Always use absolute paths in cron jobs: /usr/bin/python3 not python3.

2. Working Directory

Cron jobs start in the user's home directory, not the directory where the script lives. Use cd /path/to/project && ./script.sh explicitly.

3. Permission Issues

Files created by cron run as the cron user's UID. If your script writes to a directory owned by a different user, it fails silently. Check file permissions.

4. Overlapping Executions

If a job runs longer than its schedule interval, you get overlapping instances. Use flock to prevent this:

*/5 * * * * flock -n /tmp/myjob.lock /usr/local/bin/my-job.sh

5. Disk Space

Cron jobs that write to disk (backups, logs) eventually fill it up. Monitor disk usage alongside your cron jobs.

Status Badges for Visibility

One powerful pattern: embed status badges in your documentation. CronPing generates SVG badges that show the current state of each monitor. Drop them in your GitHub README:

![Backup Status](https://cronping.anethoth.com/badge/YOUR_TOKEN)

This gives your entire team instant visibility into whether background jobs are healthy — without logging into a dashboard.

Alerting Best Practices

  1. Alert on missed pings, not just failures. A job that doesn't run is often worse than a job that runs and fails (because the failure might be logged somewhere).
  2. Use escalation. Slack notification first, email after 30 minutes, PagerDuty after 1 hour.
  3. Don't alert on every single miss. A brief hiccup is different from a persistent failure. Configure "alert after N consecutive misses" if your monitoring tool supports it.
  4. Monitor the monitor. If your monitoring service itself goes down, you won't know your jobs are failing. Use a secondary check.

Try CronPing free

Why your cron jobs fail silently, how dead man's switch monitoring works, and how to set up reliable alerting for every scheduled task in your infrastructure. Get started with our free tier — no credit card required.

Get started free →