At work last year I built out a home-grown RAID solution to support my current project. RAID stands for Redundant Array of Independent Disks. RAID technology has been around since the 1970s and provides both a way to use multiple drives together to form a larger “disk” or RAID-volume. The drawback to using multiple drives together is that you need all the individual drives to be operating for the RAID-volume to be available. With multiple drives, the probability of any one drive in the RAID-set will fail increases according to the number of drives in the RAID-set.
However, when you start using drives together to build a RAID-set the reliability of the RAID-volume will decrease in relation to the number of drives in the RAID-set. If a given drive has a 1 in 16 chance of failing over a given time frame then when you construct a RAID-set using 8 of these drives you have an 8/16 or 1 in 2 chance that the RAID-volume will fail in that same period of time. Real-world drives are much less likely to fail, but an increase in the likelihood that your RAID-volume will fail is not good. To address this increased chance of failure RAID technology also provides a level of data redundancy. Depending on the type of RAID in use, this redundancy typically allows one drive in the RAID-set to fail without loosing access to the RAID-Volume. So long as you replace the failed drive before any other drive fails the RAID-Volume will continue to operate with possibly reduced performance.
Since we were building the RAID from scratch I looked around to find some suitable drives and stumbled onto the Western Digital Caviar RE2 500GB (5000YS) drives. The “RE” stands for “RAID-edition”. The marketing copy for this drive explains that this model has a MTBF (Mean Time Between Failure) rating of 1.2 million hours, a 5-year warranty and it provides “optimum performance for enterprise applications.” In other words, a very reliable hard drive that is designed for use in a RAID.
All hard-drives occasionally encounter read errors. Desktop hard drives are designed to attempt to make significant efforts to correct for read errors so that we don’t loose the data on our desktop hard drive due to a temporary, or “soft” error. These attempts to correct the read error means that the desktop drive will spend some extra time trying to correctly read the data and send it on to the computer.
In a RAID-set this extra time spent by the drive trying to correct the read error is a problem. Firstly, the extra effort made by the drive is not necessary because the RAID-controller is designed to detect and correct for minor read errors. Second, the extra time before the drive responds to the read request can cause the RAID controller to determine that the drive is dead. When the RAID controller determines that a drive is failing, or dead, the controller will drop the drive from the array which is not a good thing at all.
The marketing copy also notes that the RAID-edition drives have “RAID-specific, time-limited error recovery (TLER) – A feature pioneered by WD, significantly reduces drive fallout caused by the extended hard drive error-recovery modes common to desktop drives.” That’s great, a drive designed to work correctly in a RAID! We ordered 10 of them – 8 for the RAID and two spares giving us a total of 3.5TB of RAID-5 storage.
A few months go by during we have an on-going unexplained problem where the RAID will randomly drop a drive. Unexplained problems never sit with me well so I asked around the lab. I happen to work in the lab where RAID technology was invented so I bounced some questions off a few people and didn’t come to any real conclusion about the source of the problem. That is, until a coincidental Google search turned up a notice that Western Digital has admitted that several models of their RAID-edition drives will occasionally drop out of a RAID-set for no particular reason. Hmm, the special RAID-edition drives do not work correctly in a RAID. The irony fairies must be falling over themselves about that one.
Western Digital tracked down the problem to a bug in the firmware of this model of hard drive. Finally, I’d be able to resolve the random dropping drive problem. I downloaded the firmware update from the Western Digital site and noticed a warning that the drive should be backed up before applying the update. Not a problem, I figured that I could update the two spare drives then if that goes well I can swap one of the spare drives into the RAID-set, rebuild the RAID, rinse and repeat until all my drives are updated. It’ll take a bit of time, but it is better than dealing with restoring our data from the corporate backup servers.
Updating the firmware of a hard drive is a sensitive operation so you are advised to boot using an MS-DOS disk to minimize the chances of your operating system fouling up the update. After hunting around for an MS-DOS boot disk and getting the firmware update software available on the boot disk I attempted to apply the update to the one of the spare drives. The updater refused to update the drive indicating that the update was not applicable. A quick check with the updater instructions confirmed that the updater was indeed for the 5000YS model drive. OK. So I attempted to update the second spare drive without any success.
Enter Western Digital technical support. First-tier support was unable to help and I was quickly bounced up to second-tier support.
WD Support: “What is the error that the updater gives you?”
Me: “Update not applicable for this drive.”
WD Support: “Yeah, that’s the message that you get when the drive is older than the update. We time limit the updaters so they will not apply to older releases of the drive.”
Me: “So how do I update my RAID-edition drives so that they will function in a RAID.”
WD Support: “You need to return them for a replacement.”
Me: “So, I have a RAID full of data here, are you really telling me that there is no way for me to update these drives without having a week of downtime while the drives are shipped to Western Digital and the replacements are sent to me?”
WD Support: “No, we have an advance RMA program where if you provide a credit card we will ship a drive to you and you have 30 days to return the old drive.”
Me: “Great, that means that I can integrate the updated drives into my RAID individually so that I don’t have to have any downtime.”
WD Support: “Yes. You can go to the Western Digital support web site and request up to 5 advance RMAs per day. You will need to make a new request for each drive.”
Me: “Huh? Are you really telling me that not only has Western Digital released a ”RAID-edition” drive that does not function correctly in a RAID, but in addition, there is no way for my company to do an advance RMA of a RAID quantity of drives in one go?”
WD Support: “No, the only way to do an advance RMA is to enter each of the serial numbers individually with your credit card number. You can do 5 advance RMAs per day.”
Me: “?!?!”