Setting Realistic SLOs, SLAs, and Error Budgets

A practical guide to defining uptime targets that your team can actually hit, measure, and defend.

Why Most Uptime Targets Are Made Up

Nine nines. Five nines. "We target 99.9%." These numbers get written into contracts and OKRs all the time without anyone doing the math on what they actually require — or whether the underlying infrastructure can deliver them.

Setting realistic SLOs, SLAs, and error budgets isn't just an SRE formality. It's how you avoid promising customers something you can't deliver, and how you give your engineering team a meaningful signal about when to ship and when to stabilize.

The Definitions That Actually Matter

SLI — Service Level Indicator

A measurable signal: request success rate, latency at the 99th percentile, uptime as measured from an external probe. This is raw data.

SLO — Service Level Objective

An internal target built on top of an SLI. Example: "99.5% of HTTP requests return a non-5xx response, measured over a rolling 30-day window." Your team owns this.

SLA — Service Level Agreement

A contractual commitment to a customer, usually with financial consequences (service credits, termination rights) if you miss it. Your SLA should always be weaker than your SLO — that gap is your operational buffer.

Error Budget

The allowable failure derived from your SLO. At 99.5% over 30 days, you have roughly 3.6 hours of allowable downtime or degradation before you breach. Spend it deliberately.

How to Set an SLO You Can Actually Hit

Don't start with the number you want. Start with the number your system has actually achieved.

Step 1: Measure first. Pull 90 days of real availability and error-rate data before writing any target down. If you don't have that data, deploy external monitoring now and wait before committing to customers.

Step 2: Apply a realistic haircut. Your historical best isn't your SLO. Take your observed availability, subtract a margin for incidents you haven't had yet, and round down conservatively. If you've sustained 99.7% over the last quarter with no major incidents, a 99.5% SLO is defensible. A 99.9% SLO is not.

Step 3: Pick a meaningful window. Rolling 30-day windows are common and useful because they reflect recent performance without being gamed by a single bad day months ago. Avoid calendar-month windows that reset your budget on the 1st.

Step 4: Define what counts as a failure. "Downtime" is ambiguous. Specify: is a 10-second timeout a failure? Is degraded performance (e.g., p99 latency > 2 s) a failure? Write it down before an incident makes it political.

Step 5: Set your SLA below your SLO. If your SLO is 99.5%, a reasonable SLA commitment might be 99.0% or 99.2%. The gap gives you room to investigate and remediate without immediately triggering customer credits.

Measuring From the Right Place

Internal metrics (your load balancer's success rate, your APM tool) only tell you what your infrastructure sees. They miss DNS failures, CDN edges going dark, or a network path that's broken for users in a specific region.

External uptime monitoring — probing your endpoints from multiple geographic locations — gives you SLI data that's closer to what your customers actually experience. If your SLO is customer-facing availability, your SLI should be measured from outside your own network. Multi-region probes also help you distinguish a global outage from a regional blip, which matters when you're doing incident triage and burning error budget.

Using Your Error Budget

Once you have an error budget, it needs to drive real decisions:

Budget healthy? Ship features, run load tests, do risky infrastructure migrations.
Budget at 50%? Slow down deploys. Require extra review on changes touching critical paths.
Budget exhausted? Freeze non-essential changes. Focus entirely on reliability work until the window resets.

This is the core loop: the error budget makes reliability a shared engineering concern, not just an ops problem.

Common Mistakes to Avoid

Copying a competitor's SLA without understanding your stack. Their architecture isn't yours.
Using uptime as the only SLI. A service that responds 200 OK but returns garbage data is not "up."
Setting SLOs nobody checks. If the metric isn't in a dashboard someone looks at weekly, it doesn't exist operationally.
Forgetting dependencies. Your SLO is bounded by the weakest link in your dependency chain. If your payment provider targets 99.9%, you can't credibly promise more than that for checkout.
Never revisiting targets. As your system matures, your SLOs should too. Review them quarterly.

Key Takeaways

Measure historical performance before setting any target.
SLA < SLO — the gap is intentional, not padding.
Define failure precisely before an incident forces you to.
Error budgets only work if they're connected to shipping and freeze decisions.
External, multi-region monitoring is the right source of truth for customer-facing SLIs.
Review your SLOs quarterly; they should evolve with your system.