Day 9/60: Alerting Strategies -- Production Engineering
60 Day Production Engineering Challenge
Alert fatigue is the number one reason on-call rotations burn people out. Today I am covering the strategies that cut noise while keeping signal.
Symptom-Based Alerting with PromQL
Page on what users feel, not what servers report internally. Here is a burn rate alert that fires when your error budget is burning at 14.4x the allowed rate:
# Critical burn rate: will exhaust monthly budget in 1 hour
(
sum(rate(http_requests_total{code=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
) > (14.4 * 0.001)
and
(
sum(rate(http_requests_total{code=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
) > (14.4 * 0.001)
The dual window (1h AND 5m) means you only page when the problem has statistical significance AND is actively happening right now.
Alertmanager Inhibition Rules
When a node dies, you do not need fifty alerts for every pod that was on it. Inhibition suppresses the cascade:
# alertmanager.yml inhibition config
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: [alertname, cluster]
- source_match:
alertname: NodeDown
target_match_re:
alertname: "Pod.+"
equal: [node]
One NodeDown critical alert. Zero PodCrashLoopBackOff warnings until the node recovers.
Catching Silent Failures with absent()
# Alert when a target stops reporting entirely
- alert: TargetVanished
expr: absent(up{job="payment-service"} == 1)
for: 5m
labels:
severity: critical
annotations:
summary: "payment-service target missing from Prometheus"
runbook: "https://runbooks.internal/target-vanished"
This is the one alert that catches what every other alert misses: the silent failure where metrics just stop arriving.
Key Takeaways
Alert on symptoms. Use burn rates. Configure inhibition. Link runbooks. Test your pipeline.
Day 9/60 of the 60 Day Production Engineering Challenge
Top comments (0)