Hard Drive Failures

7th January, 2011

For a long time I have stated to clients that all hard disk drives (HDD) will fail.

The reason for this statement is to motivate them to action and ensure that they have adequate resilience in their storage, have adequate backups and not an assumption that failure will never happen. However is having implemented all these strategies enough reason to give you a warm and fuzzy feeling that all is well with your data?

If you have purchased hard drives, you may have seen the metric MTBF (Mean Time Between Failures stated in hours) or MTTDL (Mean Time To Data Loss in years). Are these metrics useful? Or are they predominantly used as part of marketing.

In a recent article, it was reported that these metrics are not based of experimentation but rather on modelling. A MTTDL of 2 million years does not mean that a drive that you purchased will last that long and certainly I have deployed systems with quality HDD in them that lasted one week. SUN claims a MTTDL of 2.4 million years for their ST5800 system and this raises questions as to whether this is adequate and what does it mean? In a thought experiment, if 10 ST5800’s were watched, and three lost data in the first year, four after 2.4 million years and three lost data after 4.8 million years. This does give a MTTDL of 2.4 million years but would we consider a system that had a 30% possibility of failure in the first year?

Other HDD failures and metrics

  • Silent data corruption: Where the contents of a file changes for an unknown reason;
  • Bit rot:  A process that randomly flips the bits, i.e. a 1 becomes a 0 and vice versa.
  • Non-recoverable error rate: This is the probability that a bit will be read incorrectly regardless of how long it’s been stored. E.g. 1 bit in 1014 bits (12.5 Terabytes).

This is not an extensive list that can cause data loss or failure of HDD and I have not touched upon mechanical failure, electrical failure, theft, and loss due to natural disasters like floods and lightning.

The bottom line in all of this is that failures will occur and as data stores become larger the greater the probability of failure. As such, you can never eliminate data loss, but you can lower the frequency of occurrence.

What can I do?

Is there anything that can be done to assist in preventing data loss? Doing nothing brings with it an assurance of failure and a high cost of data recovery if at all possible. A better question that could be asked is what can I implement? Such that when failure does eventually occur, you have strategies in place that will reduce the impact on your business. Strategies could include:

  1. RAID-1 or RAID-5 storage;
  2. Local backups with offsite strategies or further backups to the cloud;
  3. A business continuity plan;
  4. Test on a regular basis your ability to restore;
  4. Test on a regular basis your ability to restore;