Building a Dead Man's Switch for Your Infrastructure

A dead man's switch is a mechanism that activates when its operator becomes incapacitated. Trains have them. Nuclear launch systems have them. Your cron jobs should have them too.

The concept is elegant: instead of monitoring for failure (which requires you to enumerate every possible failure mode), you monitor for silence. If a system that should be reporting regularly goes quiet, something is wrong. You don't need to know what went wrong. You just need to know that it went wrong.

The Pattern

Every dead man's switch has three components:

A heartbeat — a regular signal from the monitored system
An expected interval — how often the heartbeat should arrive
An alert — triggered when the interval passes without a heartbeat

That's it. No complex health checks. No multi-step diagnostic flows. Just: "I expect a ping every 5 minutes. If I don't get one, wake someone up."

Implementation: The Wrapper Script

The simplest dead man's switch wraps your existing cron job:

#!/bin/bash
# backup-with-heartbeat.sh

# Run the actual backup
/usr/local/bin/backup.sh

# If it succeeded, send the heartbeat
if [ $? -eq 0 ]; then
    curl -fsS --retry 3 https://cronping.anethoth.com/ping/YOUR_TOKEN
fi

The beauty of this pattern is what it catches:

The backup script crashes → no heartbeat → alert
The cron daemon stops → no heartbeat → alert
The server goes down → no heartbeat → alert
The backup takes too long and overlaps with the next run → late heartbeat → alert
Someone accidentally deletes the crontab → no heartbeat → alert

A traditional monitoring approach would need separate checks for each of these failure modes. The dead man's switch catches all of them with a single mechanism.

Real-World Patterns

Pattern 1: The Completion Ping

Ping after your job completes successfully. This is the basic pattern above. Good for jobs where you only care about successful completion.

0 2 * * * /usr/local/bin/backup.sh && curl -fsS https://cronping.anethoth.com/ping/TOKEN

Pattern 2: The Start-and-Finish Ping

Ping at the start and end of your job. This lets you detect jobs that start but never finish — hung processes, deadlocks, or resource exhaustion.

curl -fsS https://cronping.anethoth.com/ping/TOKEN/start
/usr/local/bin/etl-pipeline.sh
EXIT_CODE=$?
curl -fsS "https://cronping.anethoth.com/ping/TOKEN/end?exit_code=$EXIT_CODE"

Pattern 3: The Exit Code Ping

Include the exit code in your heartbeat so your monitoring system can distinguish between "job ran successfully" and "job ran but failed." Both are heartbeats, but only one should be green.

/usr/local/bin/deploy.sh
curl -fsS "https://cronping.anethoth.com/ping/TOKEN?exit=$?"

Pattern 4: The Cascading Switch

For multi-step pipelines, chain dead man's switches. Each step's completion triggers the next step's expected interval:

Extract (10min window) → ping → Transform (30min window) → ping → Load (15min window) → ping

If Extract finishes but Transform never starts, you know exactly where the pipeline broke.

Common Mistakes

1. Setting the grace period too tight.
If your backup takes 10-60 minutes depending on data volume, and you set a 15-minute grace period, you'll get false alarms when it legitimately takes 45 minutes. Set the grace period to 2x your worst-case expected duration. False alarms train you to ignore alerts.

2. Monitoring the monitor.
Your dead man's switch is itself infrastructure that can fail. What happens if CronPing goes down? This is where redundancy matters. Use a monitoring service with published uptime guarantees and a status page you can check.

3. Alerting to a channel nobody watches.
An alert that goes to a Slack channel with 500 unread messages is not an alert. Send critical dead man's switch alerts to PagerDuty, phone calls, or SMS. If it's worth monitoring, it's worth waking someone up for.

The Philosophy

Dead man's switches embody a fundamental principle of good engineering: design for the failure modes you can't predict. You can't enumerate every way a cron job might fail. But you can detect the absence of success, which is the universal signature of all failures.

This same principle applies beyond cron jobs. Health check endpoints, circuit breakers, timeout patterns — they're all variations of the dead man's switch. They don't try to understand failure. They just notice when success stops happening.

If you're running scheduled tasks without a dead man's switch, you're flying blind. Set one up today. CronPing makes it a single curl command. Your 2 AM self will thank you.

Want to try it? Create a free CronPing account, set up a monitor, and add one line to your crontab. It takes less time than reading this article did.