In this thesis, performance and reliability effects of disk array subsystems have been studied. The main objective of this thesis has been to emphasize the importance of latent fault detection and its effect on the reliability and data availability of disk arrays. Significant improvements in both reliability and data availability can be achieved when latent faults are detected using the algorithms proposed in this thesis in comparison to normal disk arrays where latent faults are discovered only when a user request happens to access the faulty areas.
This thesis categorizes faults in a disk with two properties based on the fault severity (a sector fault or an entire disk unit fault) and its detection time (immediately detected or latent). A sector fault effects on a limited area of the disk causing one or a few sectors to have problems in maintaining data. On the contrary, a disk unit fault causes a significant part or an entire disk to be inaccessible. Detection of a disk unit fault is by its nature fast while sector faults can be either detected immediately or they may remain latent. In a disk array, disks are polled typically in a few seconds interval and a disk unit fault can be detected at the latest by the polling process what means in a matter of seconds. Hence, a disk unit fault seldom remains undetected for a longer time. Similarly, a sector fault is detected only when that area is accessed. Unfortunately, this can mean several weeks if the disk access pattern is unevenly distributed and the fault occurs in a rarely accessed area.
Modern disk arrays are designed to handle and recover disk unit and sector faults on the fly. While the array is serving normal user disk requests, information of the faulty disk can be reconstructed using the redundant information on the other disks and stored into a spare disk. The spare disk can be either a hot spare or the faulty disk can be hot swapped. Similarly, a sector fault can be recovered using appropriate recovery methods within a disk. Current commercially available disk arrays are not yet, however, equipped with a mechanism that would actively detect latent faults.
Typically, sector faults have been ignored in the technical literature. They are considered to be of lesser importance than disk unit faults as only one sector out of millions loses its data. However, the importance of even a single sector can be seen, for example, in a large database system where every sector counts. In such a database, even one lost sector may imply that the entire data must be considered to be inconsistent.
Modern disks, especially those that comply with the SCSI-2 standard, are capable of handling sector repairs when a sector fault is detected. A disk typically has a logical representation of the disk space (represented as a sequential list of logical sectors) that is separated from its physical structure (heads, tracks and physical sectors). In the case of a sector fault, a faulty physical sector can be replaced with a spare sector without changing the logical representation of the disk. If the disk detects a sector fault during a write operation, the sector remapping can be done automatically. However, the disk is unable to do the data recovery by itself with a read operation. In that case, the array configuration and its redundant information are needed as the missing data is recovered using data on the other disks of the disk array.
Latent faults in a disk array can be detected using scanning algorithms like those proposed in this thesis. The basic scanning algorithm is an adaptation of the memory scrubbing algorithm that is commonly used for detecting faults in primary memory. However, the scanning algorithms for latent fault detection in secondary memory are for the first time presented and analyzed in this thesis and in publications by the author.
The proposed disk scanning algorithms utilize the idle time of the system to scan the disk surface in order to detect latent faults. A scanning read request to the disk is issued only when the disk is detected to be idle. Hence, the additional delay that is experienced by a normal user disk request will not be significant even when the disk is heavily loaded. Any user disk request may need to wait additionally at most one scanning disk request to complete. As the size of a scanning disk request is typically approximately the same as that of normal user requests, the additional delay is nominal. However, the scanning algorithm may increase seek delays. If longer scanning requests are used, the request can be aborted in the case a user disk request is received.
The two benefits of using the disk scanning algorithm are: faster detection of latent faults and improved data availability. As user requests to a disk subsystem are typically accessing the disk space unevenly, the disk requests caused by normal user activity leave a significant part of the disk subsystem unaccessed for a long time. A problem arises due to the fundamental error recovery principle of the disk array. A typical disk array is capable of recovering only one fault in a group of disks. In the case of a latent fault (just a faulty sector) and a disk unit fault at the same time, the disk array loses its consistency as there are two simultaneous faults and the repair mechanism is unable to restore all data.
The main assumption of the proposed scanning algorithms is that the extra disk accesses cause no additional wear on the disk. This is generally true when the disk is spinning continuously without spindowns due to inactivity. Typically, the scanning requests represent only a minor portion of the disk load. Hence, the additional activity will not cause extensive wear in the form of seeks around the disk. As the scanning process is only reading the disk (not writing) there is no danger of losing data due to a power failure.
This thesis increases understanding of the reliability of a disk array. Especially, the importance of the latent fault detection is shown in the analysis and the proposed scanning algorithms indicate significant improvement on reliability and data availability. The impact on performance due to the scanning algorithms is shown to be usually marginal since scanning is typically done while the system is otherwise idle.
The analysis of disk array reliability with dual fault types is also new in this thesis. With this analysis, an analytical representation of a disk array reliability and data availability have been presented. Simple formulae have been derived for array reliability (mean time to data loss, MTTDL) and data availability (mission success probability).
The analysis is done for a generic array configuration. Hence, the produced formulae are in a general format and they can be used with arbitrary number of disks in the array. Also, the equations are independent of the disk array architecture and repair methods (except different repair time and the number of disks involved in the repair).
The analysis is divided into two categories based on the repair processes: hot swap or hot spare. The RAID-1 and RAID-5 arrays have been used as examples due to their popularity among disk arrays. Hot swap and hot spare methods are analyzed separately as the former assumes that the spare units are fault-free, but the repair process needs human intervention (the repair process may start a long time after a fault is detected) while the latter can start the repair process immediately after the fault detection, but it has a risk of having faulty spare unit. Due to complexity of the equations, the hot spare method is analyzed only using an approximation while the hot swap method is also analyzed analytically.
In the reliability analysis of the hot spare system, it has been noticed that the spare disk fault possibility does not have a significant effect on the reliability (neither decrease nor increase) when compared with the hot swap system if the active disk unit repair time is the same. This is in line with the results in the technical literature. The hot spare provides better reliability just because the repair process can be started immediately after the fault detection, and unlike in the hot swap case where user intervention is needed.
The results also have pointed out that it is possible to use the first analytical model (EMM1) in analyzing the hot spare disk arrays instead of the more complex model (EMM2) as both provide very similar results when the same repair times and failure rates are used. This is due to the fact that the spare disk reliability has no significant effect on the disk array reliability.
Interesting results were found when the interrelated faults were analyzed. When the second fault is assumed to occur with higher probability than the first fault (e.g., if the disks are from the same manufacturing batch or they are located in the same cabinet where temperature is increased due to a faulty fan), the reliability of the disk array drops dramatically. Eventually, a disk array system that was originally built as D+1 redundancy is acting like a system with D+1 parallel units with no redundancy (i.e., a RAID-5 array would actually be as reliable as a RAID-0 array). In practice, the situation may be even worse because the probability of having the first fault is even higher if the disks are coming from the same (inferior) manufacturing batch or the disks are otherwise prone to faults.
RAID-1 or RAID-5?
When the RAID-1 and RAID-5 disk arrays are compared, it has been noticed that RAID-1 provides better reliability and better performability than RAID-5 in all cases where the number of data disks is the same. An additional benefit of the RAID-1 array compared with the RAID-5 array is the speed of the repair process. In the RAID-1 array, only two disks are involved with the repair process while, in the RAID-5 array, all disks are involved. This means that, in large disk arrays, the RAID-1 architecture can repair a disk fault significantly faster than the RAID-5 architecture. The main disadvantages of the RAID-1 architecture are the high number of disks, larger number of faulty disks, and higher initial cost. As RAID-1 uses D+D redundancy instead of D+1 redundancy like in RAID-5, the number of disks is almost doubled. This causes also almost double the number of faulty disks in the RAID-1 array, but still the reliability is higher. As the prices of hard disks are falling, the initial cost of the RAID-1 array should not be a significant problem for those who want to have a disk array that has both good performance and reliability.
The main limitations of this analysis are that the array is assumed to tolerate only one fault in the disk group at any time and that only one array group is studied. The former limitation is a typical restriction of a conventional disk array as systems that tolerate multiple faults in the same disk group are generally considered to be too expensive with respect to money and performance. The latter limitation restricts the usage of these results in arrays with a single group of disks and in arrays where the number of spare disks is sufficient to allow multiple repair processes to be started simultaneously.
The results of this thesis can be used for obtaining more reliable disk array systems that will fulfill given performability requirements even during the recovery phase. This can be done by minimizing the reliability bottlenecks caused by latent faults in the disk arrays and by implementing a delayed repair method that reduces the performance degradation during the repair phase. This thesis will also increase the awareness of the effect of latent faults to the reliability of disk arrays and hopefully lead into better and more reliable disk arrays in the future.
With the new equations, it is possible to optimize disk arrays with respect to cost, performance, and reliability. Especially, it is possible to analyze the worst case scenarios when disk faults are related or disks are from the same manufacturing batch. In this case, it is very likely that second disk unit fault occurs soon after the first one.
This thesis has also a significant impact on the disk array development. The proposed scanning algorithms can be implemented already today. Actually, some of the basic scanning ideas are already in use [Scritsmier 1996]. Also, the ideas and the results of the reliability analysis of this thesis can be utilized when developing and optimizing new disk arrays.
The proposed scanning algorithms can also be used with non-redundant arrays and single disks. The scanning algorithms can detect early signs of media deterioration that are indicated as increased number of retries. This provides a mechanism to replace deteriorated sectors before the data is lost. Quite similar implementation is already in use in Microsoft’s Windows 95. Hence, the reliability can be improved also in a non-redundant disk subsystem.
The analysis of this thesis can be expanded in various areas. For example, hard disk diagnostics, next generation disk arrays, more sophisticated repair methods, and higher level fault resilient disk arrays can benefit from the ideas introduced here. Also, cost-performability of disk arrays should be studied.
One especially interesting area, where it is possible to utilize the scanning algorithms proposed in this thesis, is in the hard disks and their internal diagnostics. As a disk itself knows the best its own activity, it is obvious that the scanning process should be performed entirely inside the disk. There would be several benefits of doing this. First, the array controller would be released to do other duties. Also, the disk itself has better indication of the media deterioration as even the smallest problems are recognized. The main impact on the disk design would be in the standardization of the disk interfaces. The disks would then be able to predict data deterioration early enough that data loss could be prevented even with a single disk.
New generations of disk arrays have been introduced to improve array performance and reliability. Their performance effects and repair processes need more investigation. The analysis that is done in this thesis should be expanded into these new array architectures as well as systems with multiple arrays.
The computer systems are more and more used in continuously operating environments where no interrupts or down-times are tolerated and therefore faulty disk units should be repaired online. At the same time, the response time requirements tolerate no performance degradation even during the recovery or the degraded states. Hence, it should be possible to adjust the recovery process according to performance (and reliability) requirements. For example, the recovery process could adapt its activity based on the user activity or the degree of the completeness of the disk recovery. For example, the repair process of a disk unit fault in a RAID-5 array may delay its operation at the beginning as the user requests are already suffering from access to a crippled array. When the repair process is getting more complete, it can increase its activity as more and more user requests fall already at the repaired area where the performance is the same as in a fault-free array.
Some of the disk arrays can tolerate more than one fault at the same time in the same disk group. In such arrays, latent sector faults are not as catastrophic as in arrays that tolerate only one fault per time. However, latent faults will also decrease dramatically the reliability of those arrays. Hence, the scanning algorithm is vital even in those arrays as they typically have extremely high expectations on reliability. Thus, the effect of the proposed scanning algorithms in such environments should be analyzed.
In the future, the importance of the high performance data storage subsystem will increase with the new applications when large amounts of data are processed. As it has been shown, the performance gap between the secondary memory and the processing capacity is ever growing and therefore the bottleneck in the system lies in the I/O subsystem. Hence, the development efforts should be concentrated more on the data storage side to balance the performance of all components.
At the same time, reliability and cost of the system should not be forgotten. The total reliability should be at least as good as with the earlier systems (despite the larger number of components) but preferably even much higher. Total cost of the system can also be taken into account if cost-performability is used instead of performability in the disk arrays analysis. In principle, all costs should be minimized and all profits should be maximized. However, this is not so simple when also performance and reliability must be considered. Thus, a special interest should be focused on the definition of cost-performability equations to get similar generic metrics as with performability.
One of the main factors in cost-performability is the cost of lost data. Thus the reliability of a disk array should be very high. This can be achieved mainly by introducing redundancy on the computer system in all levels, and by using on-line self diagnostics for early fault detection. Here, the proposed scanning algorithms are good examples for the future direction.