DEV Community

Jack
Jack

Posted on

How to Monitor Your Cron Jobs in Production (So They Don't Silently Die)

How to Monitor Your Cron Jobs in Production (So They Don't Silently Die)

Every production system has cron jobs. Database backups, report generation, cache warming, email digests — the list grows with your product. But here's the thing: cron jobs fail silently.

Your backup script has been failing for 3 weeks? Nobody knows until you need to restore. Your nightly ETL hasn't run since the last deploy? You'll find out when the CEO asks why the dashboard is stale.

The Dead Man's Switch Pattern

The most reliable way to monitor cron jobs is the dead man's switch (or heartbeat) pattern:

  1. Create a monitor with an expected schedule
  2. Your cron job pings the monitor after completing successfully
  3. If the monitor doesn't receive a ping within the expected window, fire an alert

This is fundamentally different from log monitoring because it catches jobs that never start — not just jobs that start and fail.

Implementation

Here's how it works with a simple HTTP endpoint:

# Your existing cron job
0 2 * * * /usr/local/bin/backup.sh

# Add monitoring - ping after success
0 2 * * * /usr/local/bin/backup.sh && curl -fsS https://cronping.anethoth.com/ping/YOUR_TOKEN
Enter fullscreen mode Exit fullscreen mode

The && is critical — it only pings if the backup succeeds. If the script exits non-zero, no ping is sent, and you get alerted.

Grace Periods

Not every job runs at exactly the same time. A good monitoring system lets you set a grace period — extra time before an alert fires.

For a daily backup that usually takes 10 minutes:

  • Schedule: every 1440 minutes (24 hours)
  • Grace period: 30 minutes
  • Alert fires if no ping received within 24.5 hours of the last one

Failure Modes

Failure Mode Log Monitoring Heartbeat Monitoring
Script errors out Catches Catches (no ping sent)
Script never starts Nothing to log Catches (no ping)
Server is down Can't log Catches (no ping)
Script hangs forever No error logged Catches (late ping)
Crontab deleted Nothing happens Catches (no ping)

Heartbeat monitoring catches every failure mode because it monitors for the absence of a signal rather than the presence of an error.

Setting Up CronPing

I built CronPing to make this dead-simple:

# 1. Sign up
curl -X POST https://cronping.anethoth.com/api/v1/signup \
  -H 'Content-Type: application/json' \
  -d '{ "email": "you@example.com" }'

# 2. Create a monitor
curl -X POST https://cronping.anethoth.com/api/v1/monitors \
  -H 'Authorization: Bearer ch_xxx...' \
  -H 'Content-Type: application/json' \
  -d '{ "name": "nightly-backup", "schedule_minutes": 1440, "grace_minutes": 30 }'

# 3. Add the ping to your cron job
0 2 * * * /usr/local/bin/backup.sh && curl -fsS https://cronping.anethoth.com/ping/xxx
Enter fullscreen mode Exit fullscreen mode

Free tier gives you 3 monitors — enough for most side projects.

Best Practices

  1. Always use && — only ping on success
  2. Use -fsS with curl — fail silently on network errors but show server errors
  3. Set realistic grace periods — too tight causes false alarms
  4. One monitor per job — don't reuse ping tokens

CronPing is free for up to 3 monitors. Try the cron expression helper to build your cron schedules.

Top comments (1)

Collapse
 
ticktockbent profile image
Wes

Your point about restart policies not fixing init failures is the trap most folks miss.

The scale + reload sequence has a related one though. docker compose up -d --scale web=5 returns once containers are created, not once they're ready to serve. The very next line, nginx -s reload, then adds those still-booting replicas to the upstream pool, so traffic in that window can land on a container mid-startup. The fix is healthcheck: on web plus either compose up --wait or a polling loop before reloading nginx,
so the pool only gets refreshed with replicas that have passed their probe.

How are you handling that gap in practice, or does the demo just eat the failed requests during a scale event?