Hacker Read top | best | new | newcomments | leaders | about | bookmarklet login

It looks to me like this analysis is ignore the age of drives. Specifically it might be comparing the current failure rate of years old drives against the failure rate of months old drives. This is likely to make newer disks look much better.

I'd love to see some attempts at calculating mean time to failure (which is admittedly difficult to estimate until you have had a disk long enough to see a whole generation of them fail).



sort by: page size:

> Since annual failure rate is a function mostly of age, it would be interesting to see a line chart of cumulative failure rate vs age

They showed some of this data in an earlier post https://www.backblaze.com/blog/how-long-do-disk-drives-last/

The "Drives have 3 distinct failure rates" graph is the most interesting, as it shows the result of the expected "bathtub curve" on the cumulative failure rate.


I don't really understand their methodology for computing failure rate. The page says they calculate the rate on a per annum basis as:

([#drives][# failures]) / [operating time across all drives]

Wat? The numerator and denominator seem unrelated. What is being measured here?

To me, it would make more sense to look at time to failure. Together with data on the age of the drive and the proportion of failures each year one could create an empirical distribution to characterise the likelihood of failure in each year of service. That would give a concrete basis from which to compare failure rates across different models.


I barely had time to skim it, but I'm not sure I like how the ST12000NM0008 shows up in the table. I find it really hard to reason about what the real failure rate could end up being on those drives. For example, you've got about 45 days average on each drive, so the failure rate is multiplied by roughly 8 to extrapolate the annualized failure rate. Doesn't that over state the estimated rate of failure since drives will tend to fail more often at the start of their life?

I only guesstimated out of the table and didn't have time to look at the actual data, so it's possible I misread something.


The largest error in this analysis is the arbitrary picking of a .1% failure rate for some arbitrary time period. In order for this analysis to make sense, it should be right at the probability of failure for the time period it takes to replicate back to 3 copies. What makes this hard to compute is that drives don't have a linear death rate in regards to age, it's more "bathtub" or `U` shaped. And this number should really be a function of the ages of all the disks.

Drives seem to have a sweet spot of reliability after a few months and up until a 2-4 years that is really low.


I really appreciate these blog posts and have used the data they present when buying HDDs for personal use.

One thing that bothers me is that the data presented doesn't really take into account the age of the HDDs. For example, if a batch of HDDs of a particular model is 6 years old and has a failure rate of 12%, that really doesn't tell me much except that it's an old HDD.

What I'd like to know, for a given model, what the blended failure rate is after 3mo, 6mo, 12mo, 24mo etc of operational time. That would be a real apples-to-apples comparison.


I wonder if they have states failure rate based on a drive's manufacturing or installation date? It looks like there's a bit of a bathtub curve, and it would be interesting to see if that's attributed to individual drives having a tendency to fail quickly (if they're going to), or if drives are less likely to crap out once their model has been manufactured for a few years

Yes, and because of that the numbers on the average time to failure are completely meaningless. The drives the don't ever fail skew the numbers completely. If a fantastically reliable drive were to have 5/5000 drives fail, but they all failed in the first month and then the rest carried on forever, then that would show here as having a lower "reliability" than a dire drive where 4000/5000 drives fail after a year.

I'd like to see instead something like mean time until 2% of the drives fail. That'd actually be comparable between drives. And yes, it would also mean that some drive types haven't reached 2% failure yet, so they'd be shown as ">X months".

This is what a Kaplan-Meier survival curve was meant for [0]. Please use it.

Also, it'd be great to see the confidence intervals on the annualised failure rates.

[0] https://en.wikipedia.org/wiki/Kaplan%E2%80%93Meier_estimator


Andy for Backblaze here: A while back we did an analysis of drive failure over time, i.e. the bathtub curve. It is probably a good idea to update that, as I believe we are seeing lower failure rates upfront these days.

It'd be interesting and quite helpful to see the failure rate vs. drive age, per manufacturer.

For example, for less reliable manufacturers there might be a "if you get past first N weeks, you are fine" pattern, or a failure cliff exaclty 1 week past the warranty period, or something equally entertaining.


I'll try to get the datacenter techs to answer tomorrow, but here is my best off the cuff attempt:

> predict if a drive will fail

We have some heuristics (high numbers of time outs and high remapped sectors), but in the end most failures are sudden and catastrophic. It is more like statistical tendencies, the most obvious one being drive age.

> sparsely in rows

Others have noticed, I have to ask the OTHER Brian (Brian Beach did the lion's share of this drive stat collection and presentation).

> its you guys

Aww shucks. :-) But remember, we do this as a guide for ourselves, but we spend most days working on backup features and scaling, we don't have a lot of extra time. That's why we sending this data out there, some smart grad student or PhD in Xerox PARC can hopefully figure out some good stuff we missed! Besides, I don't think math and statistics are our strength, we just happen to be sitting on one of the world's larger stock piles of spinning drives with access to the computers with scripts. :-)


The problem with the statistics provided is that they don't account for how long the drives have been running. Eventually all hard drives will fail but what matters is how long it will take until they do. From the charts I can't understand which is the most reliable over time. Or am I just misinterpreting the data?

I've always had a question about those stats. "Drive days" seems a key figure; but 1000 drive days may mean 100 drive with 10 days each, or 10 drives with 100 days each. This kind of metric implies that the chance of failing is independent of the age of the drive. Is that so? Was that verified? One of the most failing drives around is the Seagate 4TB, which is a quite old model (it exists since 2013, but it's still sold). How is the drive cohort composed? If it's composed by many old drives and many very new drives, it could mean that we're observing a "mean value" with very little significance.

It would be GREAT to have a "long form" CSV with a) drive model b) service start date (or service hours cont) c) failed/not failed during quarter. THAT would help understanding whether drives fail at random or because of old age (and/or what is the correlation between age and failure - there's a threshold effect, or it's linear?)


Interesting, thanks for posting! Could you talk quickly about why it's interesting to predict drive failure? Is it to understand how many replacement drives you might need to order in the short term, or is there value beyond stock management of drives?

Think of the classic "bathtub" curve (which says that young drives fail more frequently, old drives fail more frequently, and mid-age drives are most reliable).

That curve doesn't seem to match the data here. Or if it does, it says the "old" increase in failure rate happens at over 5 years.

I would guess backblaze will replace these old drives because they are too small/too slow/use too much power before they replace them for being too unreliable.


> Based on Backblaze’s stats, high-quality disk drives fail at 0.5-4% per year. A 4% risk per year is a 2% chance in any given week. Two simultaneous failures would happen once every 48 years, so I should be fine, right?

Either I misunderstood or there are some typos but this math seems all kind of wrong.

A 4% risk per year (assuming failure risk is independent of disk age) is less than 0.1% by week. A 2% risk per week would be a 65% risk per year!!

2 simultaneous failures at the same week for just 2 disks (again with the huge assumption of age-independent risk) would in the order of magnitude of less than 1:10^6 , so more than 20k years(31.2 k years tbc)

Of course you either change your drives every few years so the age-independent AFR still holds or you have to model the probability of failure using some exponential distribution like Poisson's. Exercise for the reader to estimate the numbers in that case.


Is it possible for this data to ever be useful? Given the time necessary to acquire the data, and the rate at which improvements are made to drives, cannot we make the assumption that drives purchased today probably won't operate in exactly the same manner as drives purchased a year ago?

I don't mean to insult, just to ponder the relevance of such long-term studies on tech that changes so quickly.


> If every drive type, new and old, big and small, did better this year, maybe they changed something in their environment this year?

It can also be the case that newer drives this year are better than newer drives last year, while older drives are over a "hill" in the failure statistics, e.g. it could be the case that there are more 1st-year failures than 2nd-year failures (for a fixed number of drives starting the year).


Says the annual failure rate is 1.5%, but average time to failure is 2.5 years? Those numbers don't line up.

Are most drives retired without failing?


I have always wondered why they don't use techniques from survival analysis to be able to draw conclusions even from sets with lower failure rates. Or for that matter to avoid slight bias even for drives they do report.
next

Legal | privacy