I’ve worked in the network services industries since the 90s and I still have a small web hosting business that survived the dot-com bubble burst. To this day, my partner and I lovingly keep it humming for our loyal customers, some who have been with us for over 20 years. Our service-level agreement (SLA) guarantees a 99.9% uptime, which is purely fabricated. More or less, it’s based on our historical uptime, with a nod of confidence from our upstream data centre’s SLA.
As an analyst, I have never been comfortable with uptime as a metric. Claiming our service uptime will stay about 99.9% is, at best, hypothetical, non-deterministic and only backed up by our track record, along with a prayer that we don’t under-perform in the future. In other words, uptime is a lagging indicator.
Uptime is typically expressed as a value ranging from 99% to as high as 99.999% (“five nines”). Five nines translates to about 5 minutes or less of downtime per year. And, while SLAs use uptime as a contractual metric, those SLAs imply that every outage is detected, measured, and reported accurately. The fact is, they typically exclude planned outages for maintenance, and seldom define partial outages, such as an FTP or SSH server going offline. Nor do they include non-impacting failures, such as a redundant power supply failing, or a monitoring agent crashing without disrupting service.
In reality, many failures simply go unnoticed and unreported. And, while Wikipedia editors obsess over percentage calculations ranging from one to twelve nines, it gives the impression that these numbers are deterministic. I know of no real-world service provider that is going to be that granular. “Five nines” might as well be 100%.
“HA isn’t a checkbox on a to-do list—it’s a behaviour that requires continuous discipline!”
High availability (HA), in contrast, is a design philosophy. When designing HA computer clusters and their networks, thinking in terms of uptime leads to over-simplified assumptions and poor architectural decisions. It’s more about how fast and how cleanly a system recovers. Failover time, data consistency, and user experience matter way more than raw uptime.
Real uptime depends on a web of interdependent systems that include hardware, software, orchestration, monitoring, and human response. Uptime percentages don’t measure preparedness. System administrators must be ready for edge cases, degraded modes, and partial failures. A fault in one component might be isolated, or it might cascade across systems. Uptime metrics don’t capture this nuance.
Observability is foundational, and uptime percentages don’t address this. What we really need is for the system to detect and respond to faults before users ever notice. If you achieve this, then users will declare “it never goes down” which, in their minds, equates with a 100% SLA.
To sum up, HA isn’t a checkbox on a to-do list—it’s a behaviour that requires continuous discipline!
A Practical Framework for System Architects
How, then, should system architects think about HA and uptime SLAs?
First off, designers shouldn’t be thinking about SLA at all! That is for the marketing and legal departments to decide. High availability, on the other hand, is absolutely the domain of the architect. It is a multidimensional goal that requires deliberate trade-offs. They must define what “available” means in context, and build systems that behave accordingly. Here are my key design considerations:
Failure domains
- Design for fault isolation.
- Use segmentation, redundancy, and containment to prevent cascading failures.
Recovery behaviour
- Prioritise fast, predictable failover.
- Test it regularly.
- Automate where possible.
Observability
- Invest in telemetry, alerting, and anomaly detection.
- Visibility is the first step to resilience.
Operational maturity
- Build playbooks for degraded modes.
- Practice chaos engineering.
- Train teams to respond under pressure.
User experience
- Availability isn’t binary.
- A system might be “up” but unusable.
- Design for graceful degradation.
Defining the Goal Posts
To translate high availability principles into actionable targets, system architects should define clear goal posts that reflect recovery expectations, fault isolation, and user impact. Here is how we should be setting meaningful HA targets:
Goal Post Example | |
---|---|
Recovery Time | Failover within 30 seconds |
Failure Isolation | No single fault affects more than 5% of users |
Observability | Detect anomalies within 10 seconds |
Degraded Modes | Maintain core functionality during partial failure |
User Impact | No more than 1% of users experience disruption |
Operational Response | Incident triage within 5 minutes |
These are not universal benchmarks; they’re starting points. The right goal posts depend on the business, its users, and their risk tolerance.
At the end of the day, high availability isn’t something you buy; it’s something you build, test, and evolve. As system architects, your job isn’t to chase uptime percentages, but to engineer systems that behave predictably under pressure, recover gracefully, and serve users reliably. The goal isn’t perfection; it’s resilience.