How to Monitor Your Cron Jobs in Production (So They Don't Silently Die)

#cron #linux #devops #monitoring

How to Monitor Your Cron Jobs in Production (So They Don't Silently Die)

Every production system has cron jobs. Database backups, report generation, cache warming, email digests — the list grows with your product. But here's the thing: cron jobs fail silently.

Your backup script has been failing for 3 weeks? Nobody knows until you need to restore. Your nightly ETL hasn't run since the last deploy? You'll find out when the CEO asks why the dashboard is stale.

The Dead Man's Switch Pattern

The most reliable way to monitor cron jobs is the dead man's switch (or heartbeat) pattern:

Create a monitor with an expected schedule
Your cron job pings the monitor after completing successfully
If the monitor doesn't receive a ping within the expected window, fire an alert

This is fundamentally different from log monitoring because it catches jobs that never start — not just jobs that start and fail.

Implementation

Here's how it works with a simple HTTP endpoint:

# Your existing cron job
0 2 * * * /usr/local/bin/backup.sh

# Add monitoring - ping after success
0 2 * * * /usr/local/bin/backup.sh && curl -fsS https://cronping.anethoth.com/ping/YOUR_TOKEN

The && is critical — it only pings if the backup succeeds. If the script exits non-zero, no ping is sent, and you get alerted.

Grace Periods

Not every job runs at exactly the same time. A good monitoring system lets you set a grace period — extra time before an alert fires.

For a daily backup that usually takes 10 minutes:

Schedule: every 1440 minutes (24 hours)
Grace period: 30 minutes
Alert fires if no ping received within 24.5 hours of the last one

Failure Modes

Failure Mode	Log Monitoring	Heartbeat Monitoring
Script errors out	Catches	Catches (no ping sent)
Script never starts	Nothing to log	Catches (no ping)
Server is down	Can't log	Catches (no ping)
Script hangs forever	No error logged	Catches (late ping)
Crontab deleted	Nothing happens	Catches (no ping)

Heartbeat monitoring catches every failure mode because it monitors for the absence of a signal rather than the presence of an error.

Setting Up CronPing

I built CronPing to make this dead-simple:

# 1. Sign up
curl -X POST https://cronping.anethoth.com/api/v1/signup \
  -H 'Content-Type: application/json' \
  -d '{ "email": "you@example.com" }'

# 2. Create a monitor
curl -X POST https://cronping.anethoth.com/api/v1/monitors \
  -H 'Authorization: Bearer ch_xxx...' \
  -H 'Content-Type: application/json' \
  -d '{ "name": "nightly-backup", "schedule_minutes": 1440, "grace_minutes": 30 }'

# 3. Add the ping to your cron job
0 2 * * * /usr/local/bin/backup.sh && curl -fsS https://cronping.anethoth.com/ping/xxx

Free tier gives you 3 monitors — enough for most side projects.

Best Practices

Always use && — only ping on success
Use -fsS with curl — fail silently on network errors but show server errors
Set realistic grace periods — too tight causes false alarms
One monitor per job — don't reuse ping tokens

CronPing is free for up to 3 monitors. Try the cron expression helper to build your cron schedules.

Top comments (1)

Wes • Apr 27

Your point about restart policies not fixing init failures is the trap most folks miss.

The scale + reload sequence has a related one though. docker compose up -d --scale web=5 returns once containers are created, not once they're ready to serve. The very next line, nginx -s reload, then adds those still-booting replicas to the upstream pool, so traffic in that window can land on a container mid-startup. The fix is healthcheck: on web plus either compose up --wait or a polling loop before reloading nginx,
so the pool only gets refreshed with replicas that have passed their probe.

How are you handling that gap in practice, or does the demo just eat the failed requests during a scale event?