Day 9/60: Alerting Strategies -- Production Engineering

#sre #prometheus #devops #tutorial

Day 9/60: Alerting Strategies -- Production Engineering

60 Day Production Engineering Challenge

Alert fatigue is the number one reason on-call rotations burn people out. Today I am covering the strategies that cut noise while keeping signal.

Symptom-Based Alerting with PromQL

Page on what users feel, not what servers report internally. Here is a burn rate alert that fires when your error budget is burning at 14.4x the allowed rate:

# Critical burn rate: will exhaust monthly budget in 1 hour
(
  sum(rate(http_requests_total{code=~"5.."}[1h]))
  / sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001)
and
(
  sum(rate(http_requests_total{code=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))
) > (14.4 * 0.001)

The dual window (1h AND 5m) means you only page when the problem has statistical significance AND is actively happening right now.

Alertmanager Inhibition Rules

When a node dies, you do not need fifty alerts for every pod that was on it. Inhibition suppresses the cascade:

# alertmanager.yml inhibition config
inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: [alertname, cluster]
  - source_match:
      alertname: NodeDown
    target_match_re:
      alertname: "Pod.+"
    equal: [node]

One NodeDown critical alert. Zero PodCrashLoopBackOff warnings until the node recovers.

Catching Silent Failures with absent()

# Alert when a target stops reporting entirely
- alert: TargetVanished
  expr: absent(up{job="payment-service"} == 1)
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "payment-service target missing from Prometheus"
    runbook: "https://runbooks.internal/target-vanished"

This is the one alert that catches what every other alert misses: the silent failure where metrics just stop arriving.

Key Takeaways

Alert on symptoms. Use burn rates. Configure inhibition. Link runbooks. Test your pipeline.

Day 9/60 of the 60 Day Production Engineering Challenge

DEV Community

Day 9/60: Alerting Strategies -- Production Engineering

Day 9/60: Alerting Strategies -- Production Engineering

Symptom-Based Alerting with PromQL

Alertmanager Inhibition Rules

Catching Silent Failures with absent()

Key Takeaways

Top comments (0)