What Five-Nines Actually Costs You
Ninety-nine point nine-nine-nine percent uptime allows roughly 5 minutes and 15 seconds of downtime per year. That's not a rounding error โ it's a hard budget. Blow it on one bad deploy and you're done for the year.
Before chasing the number, confirm you actually need it. Five-nines is appropriate for payment processors, emergency-services infrastructure, and core API gateways. For a marketing site or an internal dashboard, the engineering cost almost certainly outweighs the benefit.
If you do need it, here's what it takes.
Eliminate Every Single Point of Failure
SPOFs kill five-nines faster than anything else. Audit your stack ruthlessly:
- Load balancers โ run at least two in active-active, in separate availability zones.
- Databases โ synchronous replication with automatic failover (e.g., PostgreSQL Patroni, Amazon RDS Multi-AZ). Know your failover time; many "HA" setups take 30โ60 seconds to promote a replica, which is already most of your annual budget.
- DNS โ use two independent DNS providers. A single provider outage (it happens) shouldn't take you offline.
- Third-party dependencies โ every external call is a potential SPOF. Circuit-break them, cache responses where valid, and design degraded-but-functional fallback paths.
- Deployment pipelines โ a broken CI/CD system shouldn't prevent a hotfix from reaching production.
Go Multi-Region, Not Just Multi-AZ
A single cloud region can and does go down. Multi-AZ protects against datacenter-level failures; multi-region protects against the larger blast radius of a regional event.
Multi-region adds real complexity:
- Data replication lag โ decide upfront whether you need active-active (hard, requires conflict resolution) or active-passive with fast failover.
- Global load balancing โ use anycast DNS or a global load balancer (AWS Global Accelerator, Cloudflare, etc.) to route traffic to the healthy region.
- Consistent configuration โ secrets, feature flags, and config must be available in every region independently. Don't let a central config service become a cross-region dependency.
- Failover testing โ run regular gamedays where you cut traffic to a region and verify the other region absorbs load correctly.
Control Your Change Rate
The majority of outages are self-inflicted โ a bad config push, a schema migration that locks tables, a dependency upgrade that wasn't tested under load. Five-nines demands extreme discipline around change:
- Progressive rollouts โ canary or blue-green deployments so a bad release hits 1% of traffic before 100%.
- Feature flags โ decouple deployment from release. You can ship code and turn it on later, and turn it off instantly if something breaks.
- Automated rollback triggers โ define error-rate and latency thresholds that trigger an automatic rollback without human intervention.
- Schema migrations โ never deploy a migration that locks a table in production. Use expand-contract patterns (add column, backfill, then drop old column across multiple releases).
- Freeze windows โ real high-availability systems maintain change freeze windows around high-traffic events.
Know Before Your Users Do
Five-nines requires sub-minute detection of problems. If you find out about an outage from a customer tweet, you've already lost significant budget.
Synthetic monitoring from multiple locations
External uptime checks โ run from geographically distributed vantage points โ catch issues that internal metrics miss entirely: DNS failures, TLS certificate expiry, BGP route leaks, and CDN edge node problems. A check run from a single location can miss a regional routing issue that's affecting half your users.
This is where multi-region monitoring (like what Pingy runs) earns its keep: if probes from three continents all agree your endpoint is down, you have high-confidence signal with minimal false positives.
Internal observability
- RED metrics (Rate, Errors, Duration) on every service boundary
- Structured logs shipped to a log aggregator that is not on the same infrastructure you're monitoring
- Alerting routed to an on-call system with escalation policies โ not just a Slack channel someone might miss at 3 am
Practice Failure Constantly
Resilience you haven't tested is resilience you don't have. Run chaos experiments in production (with careful blast-radius controls), not just staging. Conduct blameless post-mortems after every incident and track whether action items actually close.
Maintain a runbook for every class of failure your system could experience. When an alert fires at 2 am, the on-call engineer should be executing a known procedure, not improvising.
Key Takeaways
- Five-nines = ~5 minutes 15 seconds of downtime per year. Treat it as a strict budget.
- Eliminate SPOFs at every layer: compute, database, DNS, and third-party dependencies.
- Multi-AZ is table stakes; multi-region is required for true five-nines.
- Most outages are caused by changes โ progressive rollouts, feature flags, and automated rollback are essential.
- You can't defend a budget you can't measure: external synthetic monitoring from multiple regions gives you ground-truth visibility before users complain.
- Untested resilience is theoretical. Chaos engineering and gamedays make it real.