Performance and reliability of computer systems have improved rapidly in recent years. However, these improvements have not been equal among all components. Some components, such as CPU (as measured by its clock speed or processing capacity), primary memory (as measured by its storage capacity and access time), and secondary memory (as measured by the storage capacity and reliability of a hard disk), have improved faster (increasing by 40-100% per year) while other components, such as I/O systems (as measured by the number of operations per second) or overall reliability (as measured by mean time between software or hardware faults), have improved at a much lower rate (e.g., the improvement in the average disk seek time has been only 7% in the same time span) [Lee 1991, Chen 1990, Chen 1990a].
The number of instructions executed per second has radically increased due to rapid development of the microprocessor technology. This is attributed to new manufacturing technologies, materials, and architectures. For example, by changing from 5V operating voltage to 3.3V, the processor clock can be doubled without increasing CPU power consumption. Also, several parallel architectures have been introduced to distribute processing among distinct processing units. Hence, the processing capacity (as measured by the number of operations per second) has increased significantly [Intel 1996a].
Similarly, the capacity of the primary memory has increased in recent years. However, this has mainly been required because the increase of the average size of application programs that has increased by 25-50% per year.
Requests for a faster I/O subsystem have emerged to satisfy the improvements in the other parts of the computer system. A faster disk subsystem is needed not only to match rapidly increasing performance of the other components but also to match larger programs and data sets.
The performance discrepancy of the various components in a computer system has increased continuously in recent years. The performance of some components is illustrated in Figure 1 relative to their values in 1985 (note the logarithmic scale) [IBM 1996a, IBM 1996b, IBM 1996c, Intel 1996, Intel 1996a, Seagate 1996c, Milligan 1994, Gray 1993]. These performance improvements have clearly been very uneven: the best improvement is in the disk capacity and the CPU speed while the components of the disk subsystem have improved significantly less. Especially, the properties that are related to physical disk constraints have improved only slightly. The I/O subsystem has unquestionably become the bottleneck of a computer system. Thus, the benefit of faster CPUs is easily diminished due to the slow I/O subsystem.
Figure 1. Trend of relative performance of various computer components
Several concepts of disk arrays have been introduced to reduce the increased gap of the performance improvement between CPU and disk I/O subsystem. The first ideas to overcome the performance limitations of a physical disk were based on mirroring and interleaving techniques [Kim 1987, Livny 1987, Kim 1986, Kim 1985]. Currently, one of the most popular concepts is the redundant array of inexpensive disks (RAID) [DPT 1993, RAB 1993, Gibson 1991, Lee 1991, Chen 1990, Chen 1990a, Lee 1990, Chen 1989, Katz 1989, Chen 1988, Patterson 1988, Patterson 1987, Salem 1986]. In the RAID concept, several disks are used in parallel to improve throughput, transaction rate, and/or reliability.
Better throughput in a disk array is achieved by utilizing several disks for one large user I/O request. When a large contiguous data block is accessed, the bottleneck of a conventional disk subsystem is the sustained transfer rate from the disk surface [Reddy 1990a, Katz 1989, Ousterhout 1988]. By accessing several disks in parallel, it is possible to achieve higher transfer throughput as data can be fetched/stored simultaneously from/to several disks. The disk bus transfer capacity is typically much higher (in order of five to ten times) than that of a single disk. In large arrays, several disk buses are used for providing high bandwidth [Seagate 1996a, Seagate 1996b, Seagate 1996c, Milligan 1994, Gray 1993, Hillo 1993]. In addition, modern disks contain large internal buffers to store data temporarily if the disk bus is momentarily used by other disks for data transfer. Hence, the operations of different disks can be overlapped so that while one disk is transferring data over the disk bus, the others can do seeks and gather data into their buffers or write data from their buffers into the disk.
When many disks are used in parallel, the number of faults increases significantly and reliability will be an important factor. In this thesis, the interest is focused on the question how high reliability can be offered when also high performance is required. Especially, the effect of latent faults is studied.
Table 1. Various RAID configurations
Disk array types
The most common RAID array configurations are listed in Table 1. The second column briefly illustrates the structure of each array configuration as listed in the technical literature [DPT 1993, Hillo 1993, RAB 1993, Gibson 1991, Lee 1991, Chen 1990, Chen 1990a, Lee 1990, Chen 1989, Katz 1989, Reddy 1989, Chen 1988, Patterson 1988, Patterson 1987, Salem 1986]. The number of disks needed to implement an appropriate array configuration is then listed in the third and fourth columns. Here, D indicates the number of data disks (used for storing user data) and H is the number of parity disks (if the number of parity disks depends on the number of data disks as in the case of Hamming coding where ) [Gibson 1991, Gibson 1989a, MacWilliams 1977, Peterson 1972]. The last column in the table illustrates the storage efficiency of the disk array architectures using an example of ten data disks. Beside the RAID types listed here, there are also other array types such as two dimensional parity and arrays with non-binary symbol codes [Stevens 1995, Schwarz 1994, Katz 1993, Mourad 1993a, RAB 1993, Lee 1992, Gibson 1991, Stonebraker 1989].
The performance of a disk array can be roughly expressed with two interrelated parameters: the maximum throughput of the system and the maximum number of simultaneous disk requests [Hou 1994, DPT 1993a, Geist 1993, Hillo 1993, Mourad 1993, Reddy 1991a, Chervenak 1990, Seltzer 1990, Seltzer 1990a, Olson 1989, Gibson 1989, Reddy 1989, Stonebraker 1988, Cox 1986, Kim 1986, Kim 1985]. These parameters are illustrated in Table 2. The different array configurations provide a variety of maximum throughput and maximum number of simultaneous read/write operations.
Table 2. Performance of RAID configurations
Importance of reliability
Originally, the fault tolerant disk subsystems were used only in expensive mainframe architectures, such as online transaction processing (OLTP) databases in banking applications, while smaller computers, such as PCs, had no fault tolerant features. Nowadays, even desktop PCs may have several disks containing several gigabytes of data. As the number of disks in modern computer systems increases and the user has more and more precious data stored onto them, the requirement for reliable data storage becomes imminent even in smaller computers. The RAID concept provides an answer also for this question allowing a wide range of architectures, with respect to cost, performance, and reliability, to be used.
Traditionally, disks in a disk array are considered to have only one type of fault that disables the entire disk [Hillo 1993, Gibson 1991, Cioffi 1990, Muntz 1990, Sierra 1990, Schulze 1988, Williams 1988]. However, there is in practice a variety of possible faults in a disk unit (such as head, actuator, surface, motor, connector, controller, and bearings faults) [Haseltine 1996, ANSI 1994, ANSI 1986]. The two most common faults are the above mentioned disk unit fault and a sector fault that effects only small part of the disk (just one sector). The actual reliability statistics are not revealed by disk or computer manufacturers [Hillo 1994, Gibson 1991].
Faults in a disk array are detected sooner or later either by normal user disk requests or by some diagnostic activity [IBM 1996d, Ottem 1996, Scritsmier 1996, Kari 1994, Kari 1993, Kari 1993a, Kari 1993b]. In the case of a disk unit fault, the fault detection is fast [ANSI 1997, ANSI 1994, ANSI 1986], but sector faults may remain undetected for a long time [Kari 1994, Kari 1993, Kari 1993a, Kari 1993b]. The latent faults have a significant effect on the reliability of the disk array as a conventional disk array can tolerate only one simultaneous fault in a disk group. According to practical experience in the industry, significant part of the cases when a disk array has lost its consistency were caused by the latent sector faults [Scritsmier 1996].
A disk array is conventionally repaired using a spare disk into which data of the faulty disk is reconstructed using the redundant information of the data stored in the remaining disks. Depending on the array type, one or more disks in the array are involved with the repair process.
Sector faults are usually ignored by conventional reliability models and repair processes. This is partly because the modern disks can automatically repair faulty sectors by themselves during a write operation [ANSI 1997, Räsänen 1996, Scritsmier 1996, ANSI 1994, Räsänen 1994, Platt 1992, ANSI 1986]. Practical disk array implementations, on the other hand, can do automatically data recovery of the lost data during the read operation using the redundant information stored in the other disks. Because a disk array is generally accessed unevenly [Hou 1994, Ylinen 1994, Seltzer 1992, Miller 1991, Reddy 1991a, Chen 1990, Seltzer 1990a, Olson 1989, Reddy 1990a, Bhide 1988, Ousterhout 1988], it is not possible to detect all sector faults using only normal user disk requests.
Table 3. Reliability of sample RAID configurations (MTTF=100 000h, MTTR=24h)
Reliability of disk arrays
As most of the disk array configurations are built to tolerate only one fault before repair (in order to simplify the data update on the disk array), the reliability of the disk array is dominated by the repair time of the array [Hou 1994, Schwarz 1994, Burkhard 1993, Chandy 1993, Geist 1993, Gray 1993, Hillo 1993, Gibson 1991, Reddy 1991, Sierra 1990, Gibson 1989, Gibson 1989a, Chen 1988, Garcia-Molina 1988, Schulze 1988, Stonebraker 1988]. This is the case when the sector faults are ignored [Kari 1994, Kari 1993, Kari 1993a, Kari 1993b]. If the sector faults are included, the latent sector faults are as significant as the repair time. When latent sector faults remain undetected for a long time, the reliability will be dominated by the latent sector faults.
Reliability (as expressed with Mean Time To Data Loss, MTTDL) is illustrated as a function of disk array configuration in Table 3 [Hillo 1993, Gibson 1991, Gibson 1989] with the same disk repair and failure times. This example assumes that the number of data disks is 10, Mean Time To Failure, MTTF, of a disk and Mean Time To Repair, MTTR, of a disk array are 100 000 and 24 hours, respectively. A second method used for expressing the reliability of a disk array is the mission success probability that gives the probability of an array to survive over a given time without losing its consistency [Gibson91].
There will always be a trade-off between cost, performance, and reliability as it is shown in Tables 1, 2, and 3. Typically, the better performance is required, the less reliable service a disk array can provide, and vice versa [Hou94, Hillo93, Gibson91, Muntz90]. This is especially true with the write operation. Thus, it is not possible to have one array architecture that is the best for all environments and user needs. For example, the following issues must be addressed when selecting a disk array architecture:
1. Cost issues:
• cost of redundancy
• cost of increased overall complexity
• cost of system failure or data loss
• cost of repair process
2. Performance issues:
• high I/O rate or high transfer capacity
• I/O patterns
• size of I/O requests
• ratio of read to write operations
• crippled array performance
• performance degradation due to repair process
• performance degradation due to preventive maintenance or fault detection
3. Reliability issues:
• mean time to data loss
• data availability
• recovery speed
• on-line or off-line recovery
• detection of latent faults
• effect of interrelated faults
Two terms have been introduced to express combination of the above mentioned parameters: performability and cost-performability [Trivedi 1994, Catania 1993, Pattipati 1993, Smith 1988, Furchgott 1984, Meyer 1980, Beaudry 1978].
Performability defines the combined performance and reliability of the system. The main problem in defining the performability formula is the different scales of the performance and reliability factors. The scale of the reliability (as expressed with MTTDL) can vary by several orders of magnitude while the scale of the performance (as expressed with the number of disk operations per second) is almost linearly related to the number of disks in the array.
Cost-performability expresses the trade-off between performability and how much it costs to operate such a disk array system. On the cost side of the equation, installation, running, and failure costs are typically included. An additional problem of the cost-performability is the ambiguity of the cost definition. In contrast to reliability and performance that can be expressed with exact, commonly agreed terms, the cost has a subjective nature: for example, each user may have one's own opinion about the cost of data loss.
Practical disk array implementations and theoretical models
Analysis of practical implementations of the disk array architectures can differ significantly from the theoretical models. This is mainly because several simplifications are made in the modeling process of the disk array architectures. For example, simple fault models, uniform access patterns, and independent failure rates are commonly assumed in theoretical models, but they are seldom the case in the practical implementations.
Simple fault models, such as the assumption that there is only one type of fault in a disk, are often used for modeling a disk array [Chandy 1993, Geist 1993, Gibson 1991, Gibson 1989a]. Such models also assume fast fault detection. Thus, it is possible to have an analytical approach to the reliability problem as the number of states in a Markov model is reduced, especially in the non-steady state analysis. These simple models ignore sector faults that commonly occur, but are more difficult to detect. As the sector faults can remain latent for a long time, the recovery process will start much later causing higher probability of having another fault in the same array before the first one is repaired. Hence, the reliability of the practical disk array system may be much lower than what is theoretically estimated.
In a practical disk array system, disk requests are spread quite unevenly over the disk space. The way the disk accesses are spread depends on, among other things, the type of the application, the operating system, and the number of simultaneous users. The main effect of this is that disk accesses tend to concentrate on certain areas (so called hot spots) of the disk space while other areas are rarely accessed (so called cold regions) [Hou 1994, Ylinen 1994, Seltzer 1992, Miller 1991, Reddy 1991a, Chen 1990, Reddy 1990, Seltzer 1990a, Olson 1989, Bhide 1988, Ousterhout 1988]. Therefore, fault detection due to normal disk accesses is limited. In the case of a fault in the seldom accessed area, the reliability can decrease significantly due to this latent fault. The probability of latent faults (especially sector faults) increases when larger and larger disk arrays are used. The uneven access patterns have also some effect on the performance of the disk array.
Faults in a disk array are often assumed to be independent. However, there are several reasons why the faults can actually be interrelated. First, the operating environment is usually the same for all disks (e.g., they use the same cooling system, power supply, or disk controller) [Räsänen 1996, Räsänen 1994, Seagate 1992, Seagate 1992a, Gibson 1991]. Second, disks in a disk array come very likely from the same manufacturing batch and therefore they are all prone to same manufacturing quality problems in both hardware and software wise [Hillo 1994, Gray 1993, Hillo 1992]. Besides, due to the complexity of the disk arrays, they are also prone to all kind of software errors that occur especially during heavy load [Kamunen 1994, Räs 4„nen 1994]. Hence, the failure rate of a disk array is in practice much higher than theoretically estimated. Especially, the probability of a second fault just after the first one is significantly higher in a practical system than in theory (when the faults are independent).
In summary, the theoretical reliability and performance figures of a disk array are never achieved in practice, but are limited by the actual implementation. Also, simplification of the system modeling reduces accuracy of the theoretical results.
In this thesis, the main objective is to study the effect of latent sector faults in reliability of disk arrays. This is done by introducing and analyzing detailed reliability models for a disk array that have two types of faults: disk unit faults and sector faults. Two repair models are analyzed: an array with hot spare and an array with hot swap. Both analytical and approximation methods are used for analyzing the reliability. Also, the effect of the related faults is studied.
Two reliability metrics are used in this thesis: data availability and Mean Time To Data Loss (MTTDL). The data availability is used for illustrating the probability of the system of maintaining the data for a given time. This can be used, for example, for expressing the 1-year, 3-years, and 10-years mission success probabilities [Gibson 1991]. The second term, MTTDL, is also used for expressing the quality of the disk array with respect to reliability and availability by expressing the estimated time the system maintains data integrity [Gibson 1991, Pages 1986, Siewiorek 1982, Shooman 1968].
The second objective of this thesis is to study the effect of latent fault detection techniques in disk arrays. A novel idea is to scan a disk while it is otherwise idle and to detect deteriorated areas in the disk before it loses its capability to maintain data storage. Besides, this scanning process can also be used for expediting the detection of the latent sector faults. This is especially important in systems that tolerate only one fault before repair and where latent faults are probable due to uneven access patterns and large number of disks.
The third objective of this thesis is to study the effect of the latent fault detection technique on performability. Latent fault detection increases the reliability of the disk array as the probability of data loss is significantly reduced when the latent faults are eliminated faster. On the other hand, latent fault detection reduces the performance of the disk array as part of its time is not available for user requests.
This thesis is divided into six parts: introduction, basic principles of latent fault detection, reliability modeling and analysis, performability analysis, a disk subsystem as a part of a computer system, and conclusions.
The first part of the thesis contains an introduction to disk arrays and some background information for this research. The first chapter contains an overview on the background of computer systems with disk arrays. In Chapter 2, previous studies are briefly reviewed in the area of performance and reliability evaluation of disks and disk arrays. Motivations of this thesis are then discussed in Chapter 3. Models of disks and disk arrays are reviewed in Chapter 4.
The second part of this thesis discusses the idea of detecting latent faults. In Chapter 5, the main principles of disk scanning algorithms are presented.
The main part of this thesis is the analysis of the disk arrays. Here, the reliability of two different disk array configurations (RAID-1 and RAID-5) is studied. In Chapter 6, the prerequisites of the reliability analysis are discussed. The assumptions are divided into three categories: reliability metrics, reliability measurements, and basic assumptions of the components in the reliability model. Then, two reliability models are formulated in Chapter 7. Both exact analytical and approximation approaches are used. The reliability models are analyzed in Chapter 8. The new reliability models are compared with results in the technical literature. Several aspects on reliability are discussed in this chapter comparing and analyzing the reliability models. Nine reliability scenarios are also analyzed in this chapter.
The fourth part studies the performability of a disk array. Chapter 9 addresses the performability issues. A performability model is built and analyzed for RAID-5 arrays. Also, a simple performability approximation is derived for RAID-1 and RAID-5 arrays.
The fifth part discusses briefly the disk array as a part of a computer system. Reliability of an entire computer system including a disk array is reviewed in Chapter 10.
Finally, the conclusions are presented in Chapter 11.
The contributions of this thesis can be divided into the following three main areas:
• new studies on the effects of latent faults on the reliability of a disk array,
• new reliability models for disk arrays taking into account both disk unit faults and sector faults, and
• new scanning algorithms to detect latent sector faults.
Effect of latent faults on disk array reliability
The main contribution of this thesis is its emphasis on the adverse reliability effect of the latent faults. In the technical literature, reliability analysis has concentrated on disk unit faults ignoring the possible effects of latent sector faults. As common disk array architectures are capable of handling only one fault in a disk group before repair, latent faults decrease the reliability of the disk array dramatically. As shown in this thesis, user initiated disk requests do not efficiently detect latent sector faults because of their uneven access patterns. This thesis illustrates the importance of latent fault detection by showing how significantly the reliability will decrease if the latent faults are not detected early enough. Latent fault detection can be done using simple algorithms as shown in this thesis thus dramatically increasing the reliability. Detailed analysis of the reliability of a disk array with various aspects related to latent faults is also included into this thesis.
Also, the possibility of delaying or slowing down the repair process is studied. This is important in systems where high data availability is needed, but performance degradation during the repair state must still be limited. In this thesis, the speed of the repair processes is studied in order to improve performability. The repair process can be delayed to reduce the performance degradation effect of the repair process on the user disk requests at the expense of reduced reliability.
New reliability models
This thesis presents also two novel reliability models for a disk array that include both disk unit and sector faults: one model for an array with hot swap and another for hot spare disks. The hot swap model is simpler as it assumes that the spare disks are fault-free while the hot spare model allows faults to occur also in the spare disk. The former model is analyzed using both an exact analytical method and an approximation method while the latter one, due to its complexity, is only analyzed using approximation. These reliability models are also used in the performability analysis.
Methods to detect and remove latent faults
New methods to detect and remove latent faults in order to maintain high reliability and data availability are introduced and analyzed. Three conference publications on disk scanning algorithms by the author are referred to. These algorithms utilize a similar approach as the memory scrubbing algorithms in detecting latent faults in the primary memory [Saleh 1990]. However, the publications were the first ones proposing such algorithms to be used for scanning secondary storage (i.e., hard disks). As disks behave differently from the primary memory, the new disk scanning algorithms are adapted for the disk usage.