DEV Community

丁久
丁久

Posted on • Originally published at dingjiu1989-hue.github.io

SLI/SLO/Error Budgets: Defining SLIs, Setting SLOs, and Burn Rate Alerts

This article was originally published on AI Study Room. For the full version with working code examples and related articles, visit the original post.

SLI/SLO/Error Budgets: Defining SLIs, Setting SLOs, and Burn Rate Alerts

SLI/SLO/Error Budgets: Defining SLIs, Setting SLOs, and Burn Rate Alerts

Introduction

Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Error Budgets form the foundation of Site Reliability Engineering (SRE) practice. Originating from Google's SRE teams, these concepts provide a data-driven framework for balancing reliability with feature velocity. Rather than aiming for 100% uptime, SRE uses error budgets to make explicit trade-offs between reliability and innovation.

This article covers SLI definition, SLO setting methodology, error budget policies, and burn rate alert design.

Defining SLIs

An SLI is a carefully chosen metric that measures a specific aspect of service reliability. Common SLI categories include availability (ratio of successful requests), latency (request duration), throughput (requests per second), and durability (data persistence).

Good SLIs are measurable from the customer's perspective. For a web service, the availability SLI measures whether HTTP requests return successful responses (2xx or 3xx status codes), not whether the server process is running. This distinction is critical: the server could be operational while the application returns 500 errors.

SLIs should be captured as a ratio of good events to total events:

availability_sli = successful_requests / total_requests

latency_sli = fast_requests / total_requests

SLI definition requires choosing measurement windows and aggregation methods. A 30-second measurement window catches transient issues, while a 5-minute window smooths noise. Rolling windows (30 days for monthly SLOs) provide stability.

Setting SLOs

An SLO sets a target value for an SLI over a defined period. Common targets are 99.9% (three nines), 99.99% (four nines), and 99.999% (five nines). Each additional nine approximately increases allowed downtime by an order of magnitude:

  • 99.9% allows 8.76 hours of downtime per year.

  • 99.99% allows 52.56 minutes per year.

  • 99.999% allows 5.26 minutes per year.

SLO targets should not be aspirational. A target that has never been met provides no useful signal. Start with a target slightly below current performance and tighten it over time as reliability improves.

Not all services need the same SLO. Critical user journeys (authentication, checkout, data access) should have higher SLOs than secondary features. Multi-tier SLOs — target (internal goal) and minimum (customer commitment) — provide a buffer between aspirational goals and contractual obligations.

Error Budgets

The error budget is the allowed amount of unreliability within the SLO period. For a 99.9% SLO over 30 days, the error budget is 0.1% of total events — or approximately 43 minutes of downtime.

The error budget defines how much risk the team can take. When the budget is full (the service is exceeding its SLO), the team can deploy new features. When the budget is depleted (the service is at risk of missing its SLO), releases are frozen until reliability improves.

Error budget policies encode these decisions in automated processes. A CI/CD pipeline gate checks error budget consumption before allowing a production deployment. This creates a direct feedback loop between reliability and feature velocity.

Burn Rate Alerts

Burn rate alerts detect excessive error budget consumption before the budget is exhausted. The burn rate is how fast the error budget is being consumed relative to the SLO period.

A burn rate of 1 means the budget will be fully consumed by the end of the period at the current rate. A burn rate of 2 means the budget will be exhausted in half the period. Multi-window, multi-burn-rate alerting uses fast-burn and slow-burn windows:

  • Fast-burn alerts (burn rate >= 14 over 1 hour): Catches severe outages immediately. Pages the on-call engineer.

  • Slow-burn alerts (burn rate >= 2 over 6 hours): Detects gradual degradation. Pages or creates a ticket for next-day investigation.

This approach ensures critical incidents are paged immediately while gradual issues are investigated before they exhaust the budget.

Implementation with Prometheus

Prometheus and the slo-exporter pattern implement SLO monitoring effectively:

groups:

- name: slo-alerts

rules:

- alert: FastBurnRate

expr: |

(

1 - (rate(http_requests_good_total[1h]) / rate(http_requests_total[1h]))

) > 14 * (1 - 0.999)

for: 2m

labels:

severity: critical

The burn rate alert multiplies the SLO error rate (1 - SLO target) by the burn rate threshold. This approach normalizes alerts across different SLO targets.

Conclusion

SLIs, SLOs, and error budgets transform reliability from a subjective goal into an objective discipline. Well-defined SLIs measure what matters from the customer perspective. Realistic SLOs provide clear targets. Error budgets enable data-driven decisions about feature releases. Burn rate alerts catch problems before they exh


Read the full article on AI Study Room for complete code examples, comparison tables, and related resources.

Found this useful? Check out more developer guides and tool comparisons on AI Study Room.

Top comments (0)