DEV Community

Naveen Karasu
Naveen Karasu

Posted on

Day 9/60: Alerting Strategies -- Production Engineering

Day 9/60: Alerting Strategies -- Production Engineering

60 Day Production Engineering Challenge

Alert fatigue is the number one reason on-call rotations burn people out. Today I am covering the strategies that cut noise while keeping signal.

Symptom-Based Alerting with PromQL

Page on what users feel, not what servers report internally. Here is a burn rate alert that fires when your error budget is burning at 14.4x the allowed rate:

# Critical burn rate: will exhaust monthly budget in 1 hour
(
  sum(rate(http_requests_total{code=~"5.."}[1h]))
  / sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001)
and
(
  sum(rate(http_requests_total{code=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))
) > (14.4 * 0.001)
Enter fullscreen mode Exit fullscreen mode

The dual window (1h AND 5m) means you only page when the problem has statistical significance AND is actively happening right now.

Alertmanager Inhibition Rules

When a node dies, you do not need fifty alerts for every pod that was on it. Inhibition suppresses the cascade:

# alertmanager.yml inhibition config
inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: [alertname, cluster]
  - source_match:
      alertname: NodeDown
    target_match_re:
      alertname: "Pod.+"
    equal: [node]
Enter fullscreen mode Exit fullscreen mode

One NodeDown critical alert. Zero PodCrashLoopBackOff warnings until the node recovers.

Catching Silent Failures with absent()

# Alert when a target stops reporting entirely
- alert: TargetVanished
  expr: absent(up{job="payment-service"} == 1)
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "payment-service target missing from Prometheus"
    runbook: "https://runbooks.internal/target-vanished"
Enter fullscreen mode Exit fullscreen mode

This is the one alert that catches what every other alert misses: the silent failure where metrics just stop arriving.

Key Takeaways

Alert on symptoms. Use burn rates. Configure inhibition. Link runbooks. Test your pipeline.


Day 9/60 of the 60 Day Production Engineering Challenge

Top comments (0)