Kriss

Posted on May 12

Your Sidekiq job ran. It processed nothing. Nobody knew.

#ruby #monitoring #rails #devops

Your nightly billing sync ran at 2am. Sidekiq shows it completed. No exceptions, no retries, no dead queue entries. Your app looks healthy.

It processed zero invoices.

It's been doing this for eleven days.

This happens more than people admit. Sidekiq is excellent at handling failed jobs — its retry mechanism and dead queue are genuinely well designed. But "failed" in Sidekiq means "raised an exception." A job that connects to the database, queries 0 rows, and exits cleanly isn't a failed job. It's a successful job that did nothing. Sidekiq has no opinion on the difference.

This article covers how to close that gap.

Why Sidekiq's built-in monitoring isn't enough for scheduled jobs

Sidekiq ships with a web UI that shows queue depths, processed counts, failed jobs, and scheduled jobs. For a queue-based system, this is useful. But for scheduled jobs — the kind you run with sidekiq-cron or sidekiq-scheduler — you need something different.

The questions that matter for scheduled jobs are:

Did it run on schedule? (Not just "has it ever run?")
Did it actually process anything?
Is it taking longer than usual?

Sidekiq's web UI answers none of these. It shows you the last enqueued time and whether the job class exists in the schedule. That's not the same as knowing whether it ran at 2am last Tuesday, and whether it exported 1,400 rows like it should have.

The dead man's switch pattern

The fix is to invert the monitoring model. Instead of your monitoring system polling Sidekiq to check if jobs ran, you make your jobs proactively check in with an external service. If the external service stops receiving check-ins, it alerts you.

This is called a dead man's switch (or heartbeat monitoring). The idea: if the job dies or goes silent, the external service notices — because it's looking for a regular ping that never came.

Here's the three-signal implementation: start, success, fail.

# app/workers/daily_export_worker.rb
require 'net/http'
require 'json'

class DailyExportWorker
  include Sidekiq::Job

  TOKEN = ENV['DEADMANCHECK_TOKEN']
  BASE  = "https://deadmancheck.io/ping/#{TOKEN}"

  def perform
    dmc_start   # begins duration timer

    rows = run_export

    dmc_success(rows)   # signals completion + row count
  rescue
    dmc_fail
    raise   # re-raise so Sidekiq handles retries normally
  end

  private

  def dmc_start
    Net::HTTP.get(URI("#{BASE}/start"))
  rescue; end

  def dmc_success(count)
    uri = URI(BASE)
    req = Net::HTTP::Post.new(uri, 'Content-Type' => 'application/json')
    req.body = { count: count }.to_json
    Net::HTTP.start(uri.host, uri.port, use_ssl: true) { |h| h.request(req) }
  rescue; end

  def dmc_fail
    Net::HTTP.get(URI("#{BASE}/fail"))
  rescue; end
end

A few things worth noting:

Each ping helper rescues all exceptions silently. A monitoring outage should never kill a production job — the monitoring is less important than the job.
The raise after dmc_fail is intentional. Let Sidekiq handle its own retry logic; don't swallow the error just because you've notified the external service.
Uses Ruby's stdlib Net::HTTP — no extra gem to add to your Gemfile.

Works the same with sidekiq-cron or sidekiq-scheduler

If you're using sidekiq-cron or sidekiq-scheduler to run workers on a cron schedule, the perform method is already the right integration point. Your schedule config stays the same:

# config/schedule.yml (sidekiq-scheduler)
daily_export:
  cron: "0 2 * * *"
  class: DailyExportWorker
  queue: default

Create one monitor per scheduled job and set its interval to your schedule length plus a buffer. For a daily job: 25 hours. For an hourly job: 70 minutes. The buffer prevents false alerts from minor timing drift.

Output assertions: the part most tutorials skip

Here's the thing about "job ran successfully": Sidekiq marks a job successful when it completes without an exception. That tells you about the job's execution. It tells you nothing about whether the job's output was valid.

If your export job queries a table that returns 0 rows (because an upstream pipeline broke two days ago), Sidekiq marks it done. Your success rate metrics stay green. You find out eleven days later when someone asks why their data is stale.

DeadManCheck lets you configure an output assertion: alert if the count in the ping is below a threshold. You set it to count > 0. Now a job that exports zero rows triggers an alert, even though Sidekiq considers it a success.

This is done through the POST body:

# In dmc_success, POST the row count
req.body = { count: rows_exported }.to_json

Then in the monitor settings, configure: "alert if count is 0 or less."

The other cron monitoring tools — Cronitor, Healthchecks.io, Better Stack — check whether the ping arrived. They don't check what the ping reported. Output assertions are the difference between knowing your job ran and knowing your job worked.

Duration monitoring

The start ping does double duty: it starts a duration timer. When the success ping arrives, DeadManCheck records the elapsed time.

After 5 or more runs, it builds a rolling average. If a run takes significantly longer than the baseline — say, your 30-second export starts taking 8 minutes — it flags the anomaly.

This is a useful leading indicator. A slow job often means:

A query that's hitting an un-indexed table after a data volume threshold was crossed
A downstream API starting to time out
A Redis or database connection pool under pressure

You find out before users notice latency in the actual product.

The full setup takes about 10 minutes

Create a free account — no credit card needed, free for 5 monitors
Add a new monitor, set the interval to match your schedule + buffer
Copy the token into your environment as DEADMANCHECK_TOKEN
Add the three helper methods to your worker (or a shared concern)
Set the output assertion threshold if your job processes records
Deploy, trigger the job manually once, confirm the ping arrives in the dashboard

After that, you'll get an alert if:

The job doesn't run on schedule (missed ping)
The job raises an exception (fail ping)
The job runs but processes nothing (output assertion)
The job takes significantly longer than usual (duration anomaly)

That's the full set of failure modes — including the silent ones that Sidekiq alone won't catch.

DeadManCheck is open source and self-hostable. If you'd rather run the monitoring infrastructure yourself: GitHub →

DEV Community