100% reliability is physically impossible. "reliability" is defined to include resistance to abnormal situations. A bridge might be 100% reliable when winds don't exceed 100mph...an event that might have a 1% probability per year in its area of construction, which is clearly highly unlikely, but still something that is accounted for in reliability, uptime, and failure rate metrics.
I don't even know if you can achieve 100% reliability given a set of plausible situations/usage, because you'd have to imagine every single possible failure mode for the thing that you're building.
That is - the threat of "destruction" proper (which you contrasted with "collapse without prior warning") is included in the reliability metrics for a device.
The point stands. Nothing that humans build can be 100% reliable - the only thing you can do is asymptotically approach it.
Now, that said - three nines per year is way too low for me, personally. Five nines is more comfortable (and if it's cheap I'd like to go higher).
That really depends how often it happens. You can't expect 100% reliability, especially for any processes that involve humans or physical items. And even if 100% reliability was attainable the cost would probably be astronomical so you have to make the tradeoff somewhere.
Don't let the perfect be the enemy of the good. 99.9XXXX reliability is good enough. Eventually have a enough nines and your risks are things like "nuclear war", "dinasour-killer sized asteroid hitting the earth", et cetera.
Five nines is impossible Really. It’s just not going to happen.
Actually, it is going to happen. One individual part may fail more than 99.999% of the time, but overall system integrity can certainly be designed with greater than 99.999% uptime.
Just get a second datacenter, get a second transit provider. Just as folks scale horizontally, you can built out reliability into the many many 9's such that when one component fails, the overall system integrity isn't impacted.
It's not easy, and not always cheap, but it's quite doable.
Furthermore, even 99.99999% reliability is not good enough, and that's already incredibly hard to achieve. Five 9s of reliability still means 5 minutes of unreliable behaviour per year, which is unacceptable in the context of driving.
I mean, yes, there are a lot of designs that fail, but the vast majority of infrastructure operates reliably for decades. Most waterslides don't decapitate people, and skyscraper collapses are exceedingly rare. Launch vehicles fail more frequently but they're phenomenally complicated machines that are produced in low volumes and flown infrequently.
First: We’re not talking about “out of nowhere” or during “routine” operation. Doing better than 99.99% uptime implies robustness to even extreme, unusual situations.
Second: Air travel could be much, much cheaper if it didn’t have to be nearly 100% reliable. This would be the right trade-off to make in almost any application that doesn’t almost guarantee deaths when it fails.
99% is absolutely reasonable for one layer of defense among many! That is one of the best methods to achieve truly high reliability, as is needed in this case: stack many reliable systems in such a way that they all must fail to get an overall failure. It is not perfect, of course, and things can always cascade, but it is a powerful technique.
At some point, you do assign a probability of systemic failure threshold, as nothing is perfect.
The idea behind orthogonal backups, however, is that since they are independent, very high reliability can be achieved with low reliable components. For example, if you've got a main with 90% reliability, and a backup with 90% reliability, the combined reliability is 99%. This can be a lot easier and less expensive to achieve than making one component 99%.
The backup generators could have a cheap extra seawall built around them, or could have simply raised them up on a 10 foot platform, or built them with snorkels like a jeep designed to cross streams.
Building a heftier main seawall would have been an order of magnitude or two more expensive.
"100% is probably never the right reliability target: not only is it impossible to achieve, it’s typically more reliability than a service’s users want or notice. Match the profile of the service to the risk the business is willing to take."
But "three 9 reliability" is still not the same thing as 100%. The contractor has a right to be concerned about the 100% figure making it's way into a contract.
I don't even know if you can achieve 100% reliability given a set of plausible situations/usage, because you'd have to imagine every single possible failure mode for the thing that you're building.
That is - the threat of "destruction" proper (which you contrasted with "collapse without prior warning") is included in the reliability metrics for a device.
The point stands. Nothing that humans build can be 100% reliable - the only thing you can do is asymptotically approach it.
Now, that said - three nines per year is way too low for me, personally. Five nines is more comfortable (and if it's cheap I'd like to go higher).
reply