When a highly reliable disk array system is designed, it should be remembered that the disk array is just a part of a larger system and the reliability of the system is dominated by its weakest link. The average reliability of various components of the computer system is far less than the reliability of the disk arrays discussed in this thesis [PCMagazine 1996, Hillo 1993, Gibson 1991].
Beside hard disks, a disk subsystem has components such as fans, power supplies, power cables, data cables, and a disk controller [Hillo 1993, Gibson 1991]. The importance of the fans, for example, was stated already earlier as the temperature of disks rises rapidly if the fans do not operate or the ventilation is inadequate. Similarly, the importance of a reliable power supply is obvious. Beside the normal reliability requirements, the power supply should provide stable voltage for disks despite their activity as a disk can shut itself down if the voltage is not stable enough [Räsänen 1994, Seagate 1992]. The power and data cables are typically very reliable (at least when compared with other components) [Gibson 1991]. As a fault in cabling can disable several disks at the same time, a special care must be taken to arrange the disk array with minimized risk of related faults.
One of the most unreliable parts of the disk subsystem is the disk controller [Hillo 1993, Gibson 1991]. Especially, the large amount of RAM (e.g., used for cache buffers) reduces significantly the reliability of the controller unless non-volatile ECC based memory is used [Hillo 1993].
The major difference of the faults in the surrounding components of a disk subsystem compared with the faults in the disk units themselves is data unavailability instead of permanent data loss. The surrounding components can fail causing temporary data unavailability while the data is not actually lost (i.e., data can be made available again by repairing the faulty unit). However, some of the faults in the surrounding components may also cause data loss. For example, data stored temporarily in a disk controller (but not yet written into a disk) is lost during a power failure if the memory has no battery backup.
The other parts of the computer system (such as host CPU, main memory, network interface, other I/O devices, and operating system) have also a significant impact on the total reliability. Typically, the reliability of the system is reduced further by these components. Only in highly reliable/available computer systems, the reliability of these other parts of the computer system is high enough (e.g., due to redundant components) that the impact of the disk subsystem reliability becomes significant.
Here, only hardware related components have been discussed, but, in practical systems, significant portion of faults is caused by software errors for example in the operating system, the device drivers, or the disk array firmware.
One of the main causes for data loss in a modern computer system is neither the physical failures of the equipment nor the software errors but human errors. A disk array or any other reliable hardware configuration does not prevent a user from deleting accidentally the wrong files from the system.
Some of the human errors can be prevented by advanced hardware design. For example, if the disk array supports the hot swap concept, those disks that are currently in use should be protected against accidental pull out. A typical example that can cause data loss in such a system is when a serviceman pulls accidentally a wrong disk out of a crippled array. By pulling out the wrong disk, the consistency of the array is lost since no redundancy was left after the disk failure. This can be prevented by software controlled physical locks that allow the serviceman to pull out only the failed disk.
Importance of backups
Reliability improvement of a computer system does not make the backups obsolete. On the contrary, the backups are still needed and they are a way to protect against human errors and major accidents that could destroy an entire computer system. A good example of such an approach is a distributed computing and backup system where distant computers are mirrored to ensure a survival even after a major catastrophe [Varhol 1991].