2. PREVIOUS STUDIES

Performance and reliability of the disk subsystems have been widely studied in the technical literature [Hou 1994, Schwarz 1994, Burkhard 1993, Chandy 1993, Geist 1993, Hillo 1993, Mourad 1993, Gibson 1991, Kim 1991, Lee 1991, Ng 1991, Reddy 1991, Reddy 1991a, Chen 1990, Chen 1990a, Chervenak 1990, Lee 1990, Seltzer 1990, Seltzer 1990a, Chen 1989, Gibson 1989, Gibson 1989a, Olson 1989, Reddy 1989, Chen 1988, Garcia-Molina 1988, Schulze 1988, Stonebraker 1988, King 1987, Livny 1987, Cox 1986, Kim 1986, Kim 1985]. Some studies have concentrated on improving disk subsystem performance while some others have dealt with the reliability issues, but also combined performance and reliability has been analyzed. In addition, disk array repair methods, fault prevention mechanisms, performability, and cost-performability have been studied.

A short overview of various branches of performance and reliability analysis of disks and disk arrays is given in this chapter.

2.1 Improving disk subsystem performance

The performance of a disk subsystem has been traditionally improved using one of three approaches: improving the software, improving the disk hardware, or using an array of disks.

2.1.1 Software approach

The performance of the disk subsystem can be improved without changing the hardware. This can be done, for example, by optimizing the disk accesses, their order, or time when they are executed.

Caching

One of the easiest methods to improve disk subsystem performance is to store the previously used disk requests for further use. As the same disk locations are very likely to be accessed again, this disk caching can significantly reduce the number of actual disk requests. This reduces the average response time as significant portion of the requests is completed in practice immediately while the disk load is also reduced thus shortening the response times of the remaining disk requests. There have been several studies of caching algorithms, such as [Thiebaut 1992, Jhingran 1989, Nelson 1988, Ousterhout 1988, Koch 1987, Grossman 1985, Coffman 1973].

Read-ahead

It is not always enough to cache the previous disk requests to speed up the current disk requests. The performance can be further improved by using the idle time of the disk subsystem to read in advance those disk areas that are most likely to be accessed next. It has been shown that it is very probable that a user will access locations nearby the previous location [Hou 1994, Bhide 1988]. Sequential disk reads especially benefit from this read-ahead feature.

Write-behind

In the write-behind scheme, the main idea is to store the disk write requests, rearrange them, and optimize the disk usage [Seltzer 1990a]. By rearranging the disk write requests, it is possible to shorten the average disk seek length. If a user writes several times to the same location, it is possible to reduce the number of actual disk writes by writing only the last update. With a battery backed memory, this can allow a long delay before a physical write operation if most requests are reads.

File system enhancements

Improving the disk I/O is not the only method of improving the performance. Performance can be further improved by arranging the files in the file system so that the files of the same directory are located nearby each other as it is very likely that they are accessed at the same time [Seltzer 1993, Rosenblum 1992, Seltzer 1992, Dibble 1989, King 1987, Koch 1987, Ousterhout 1985]. Hence, the operating system can optimize the file location by making the files contiguous and rearranging the files by the directories.

2.1.2 Hardware approach

The main problem with these software enhancements is that the disk subsystem can have only limited improvements in the disk performance as the disks have physical constraints such as rotation and seek delays, and disk transfer rate. Although disks have improved in these matters significantly in recent years, the improvement has not been as rapid as the development in the other parts of the computer architecture [IBM 1996a, Lee 1991, Chen 1990, Chen 1990a].

Beside the software enhancements, hardware improvements on disk subsystem have also been proposed as listed below. According to these ideas, it is possible to improve the architecture of a single disk or to use a set of physical disks as one logical disk.

Improving disk architecture

One of the main limitations of a conventional hard disk is that it can serve only one request at a time. This is due to the fact that it has only one read/write head that accesses the physical storage media.

Two types of proposals to enhance the disk architecture in this approach are: to add dependent or independent heads [Orji 1991, Cioffi 1990, Sierra 1990, Cox 1986, Coffman 1973].

In the former case, the heads are using the same physical arm and the distance between the heads is fixed (e.g., half of the distance between the first and the last tracks of the disk). The main benefit of this approach is that the average disk seek distance can be reduced by half as the first head handles the first half of the disk and the other one handles the second half. The main disadvantage of this approach is that only one request can be in process at any time.

In the latter case, the disk is equipped with two (or more) independent disk arms that can all access the entire disk space. The main benefit of such arrangement is that it is possible to serve several disk requests at the same time. The disk itself or the disk controller can optimize the disk accesses (both seek and rotation delays) by using that head that is the closest to the requested area. This also improves slightly the reliability as the disk can still operate even if one of its heads is faulty.

Arrays of disks

An alternative method to improve the physical properties of a single disk is to use several physical disks as one logical entity. The operating system or the device driver of the disk subsystem divides the incoming requests to appropriate disks depending on the arrangement of disks. A typical example of this approach is disk mirroring.

2.1.3 Redundant array of inexpensive disks

The problem of dedicated hardware approach is slow development and vendor dependency. As the special disks do not generally follow any standards, the implementation depends on a single manufacturer. Also, the single disk approach provides no protection against disk faults. In addition, the probability of error free media goes down as the area of the disk increases. Therefore, in contrast to the SLED (Single Large Expensive Disk) approach an alternative approach: RAID (Redundant Array of Inexpensive Disks) is used. In RAID, the disk I/O performance is improved by enhancing the disk subsystems by combining a set of disks to work together as one logical disk. Several studies (as noted below) have been reported in this area.

The concept of redundant array of inexpensive disks (RAID) is one of the most popular approaches for disk arrays [DPT 1993, RAB 1993, Gibson 1991, Lee 1991, Chen 1990, Chen 1990a, Lee 1990, Chen 1989, Katz 1989, Chen 1988, Patterson 1988, Patterson 1987, Salem 1986]. The RAID concept was introduced to improve the performance and/or the reliability of the disk subsystem. This concept utilizes models for different disk array algorithms, like mirroring, striping, and striping with parity.

Several RAID models (e.g., RAID-0, RAID-1, RAID-2, RAID-3, RAID-4, RAID-5) are developed for different purposes. More detailed description of the different RAID models can be found in Chapter 4 and in [Hillo 1993, RAB 1993, Gibson 1991, Patterson 1987].

Beside these simple arrays, there are also more complex arrays such as RAID-0xRAID-1 (i.e., mirrored RAID-0 array) or RAID-5+ [Stevens 1995, Schwarz 1994, Hillo 1993, Katz 1993, Mourad 1993a, RAB 1993, Lee 1992, Stonebraker 1989]. At University of California in Berkeley, a second generation RAID concept has been developed to enhance the array architecture [Katz 1993, Lee 1992].

The performance of disk arrays is mainly optimized to store and retrieve data when no error has occurred [Hou 1994, Geist 1993, Hillo 1993, Mourad 1993, Chen 1992, Kemppainen 1991, Lee 1991, Olson 1989, Reddy 1991a, Chen 1990, Chervenak 1990, Lee 1990]. The performance of the arrays in a degraded state is considered to be of lesser importance because this is assumed to be an infrequent state.

2.2 Improving disk subsystem reliability

Reliability is typically maintained passively [Hillo 1993, Gibson 1991]. Disk faults are detected by the normal disk requests, i.e., no special disk activity is done to detect faults. The reliability of the disk subsystem has been improved using two approaches: improving the reliability of a single disk and using redundant disks.

2.2.1 Improved disk reliability

The reliability of a single disk has improved significantly in recent years [Quantum 1996a, Seagate 1996c, Nilsson 1993, Faulkner 1991, Gibson 1991]. This improvement has been due to the enhancements in the disk mechanics (reduced physical size and improved materials) as well as the new algorithms to estimate the disk reliability from the field returns [Quantum 1996a, Nilsson 1993, Gibson 1991].

Two major problems with the reliability of a single disk are: uneven reliability and high reliability requirements. It has been shown that the reliability of the same type of disks varies much (even several orders of magnitude) among the individual disks and manufacturing batches [Hillo 1994, Hillo 1992, Gibson 1991]. On the other hand, the reliability of the disk subsystem becomes more and more important as the amount of data stored on the disks increases.

2.2.2 Redundant disk arrays

Reliability of the disk subsystem has mainly been improved by introducing disk array concepts [RAB 1993, Lee 1991, Lee 1990, Katz 1989, Chen 1988, Patterson 1988, Salem 1986]. The main idea for the array is that the number of disk faults increases quite linearly with the number of disks in the array. Hence, to survive the increased fault rate, the array should have some redundancy to tolerate at least one disk fault.

A typical disk array tolerates only one fault in a disk group [Hou 1994, Schwarz 1994, Burkhard 1993, Chandy 1993, Holland 1993, RAB 1993, Gibson 1991, Reddy 1991, Muntz 1990]. The main reason for this is that higher level redundancy would cause lower performance. The reliability of such systems is kept high by expediting the repair process in order to minimize the time when the system has no available redundancy.

The repair process can be expedited by one of three methods: speeding up the repair process, starting the repair process earlier or detecting the faults earlier. Typically, the first two methods are used. The repair process is typically expedited by giving it priority over the normal user requests [Hillo 1993, Gibson 1991]. The repair process can be started earlier if the system has hot spare units that can be used for the recovery immediately after the fault is detected [Gibson 1991, Pages 1986, Siewiorek 1982, Shooman 1968]. If the repair is started a long time after the fault detection (e.g., the faulty unit must be replaced by a serviceman or the spare part must be ordered after the fault detection), the reliability decreases dramatically [Gibson 1991].

The same RAID concepts and RAID configurations (except RAID-0) can also be used for improving the reliability as well as the performance [Hillo 1993, RAB 1993, Gibson 1991]. However, the different RAID concepts have quite different behavior in terms of performance and reliability. For example, the RAID-1 array with two disks has the best reliability figures of all arrays that tolerate a single fault, but the performance or the cost may not be acceptable (e.g., the write throughput for a large disk I/O is not better than with a single disk). On the other hand, RAID-4 and RAID-5 arrays have significantly better performance (especially for read operations), but the reliability is much worse than that of a RAID-1 array as the large number of parallel data disks is secured with only one parity disk.

There are also proposals for systems that can survive two or more faults in the same group of disks [Stevens 1995, Schwarz 1994, Mourad 1993a, Gibson 1991]. However, they are usually considered only for extremely highly reliable systems where the reliability concerns override the performance.

2.3 Reliability analysis

The reliability analysis of a disk array is quite widely studied [Hou 1994, Schwarz 1994, Burkhard 1993, Chandy 1993, Geist 1993, Gray 1993, Hillo 1993, Gibson 1991, Reddy 1991, Sierra 1990, Gibson 1989, Gibson 1989a, Chen 1988, Garcia-Molina 1988, Schulze 1988, Stonebraker 1988]. Three main approaches are: exact analytical, measurements, and simulation.

In the exact analytical approach, the reliability analysis is based on a Markov model of the disk array [Schwarz 1994, Hillo 1993, Gibson 1991]. The main problem with this analytical approach is that the analysis quickly gets complicated when the reliability model is made more accurate. Typically, the disk array is modeled with a non-steady state Markov model where the system has at least one sink (i.e., data loss state). Hence, the equations become complex even with simple models. It is also possible to obtain estimates of the reliability by using an approximation approach as presented in [Gibson 1991].

The second alternative is to measure existing disk array systems, but this is typically considered to be unfeasible as the number of arrays is small and the mean time between faults is very long.

The third alternative is to use a simulation approach like in [Sahner 1987, Sahner 1986]. Here, the flexibility of the simulation programs makes it possible to create more complex models that have various behavior patterns (such as non-constant failure rates or dependent faults).

2.4 Disk array repair algorithms

The repair time of a disk array is typically considered to be so short that its effect on the performance is quite insignificant as stated for example in [Hillo 1993, Gibson 1991]. This is true in practice when the long term average response time or the number of requests per second are studied as the repair time is typically a few hours while the average time between faults in a disk array is tens or even hundreds of thousands of hours [Quantum 1996a, Seagate 1996c, Faulkner 1991, Gibson 1991].

Typically, the main principle of the disk array repair algorithms is “repair as fast as possible to minimize the risk of having the second fault before the repair”. This is a reasonable approach when only reliability is considered, but the higher reliability is achieved at the expense of worse performance during the repair time [Muntz 1990].

If the repair time can be selected so that the repair can be done during the low load period of the system, the performance degradation due to the repair process can be significantly lowered. Unfortunately, this is not always possible. First, the reliability may suffer too much if the repair process is delayed several hours or days because of the current heavy load in the system [Gibson 1991]. Second, it is not always possible to postpone the repair for a more suitable time as the system may be loaded continuously with the same load, i.e., with no idle period [TPC 1992, TPC 1992a]. Besides, the performance of the crippled array is often significantly worse than that of a fully working array.

2.5 Fault prevention algorithms

The third method, that was mentioned above, to improve reliability is to expedite the fault detection. Typically, the fault detection is not done actively in disk arrays as it is considered that there are only disk faults and they are rapidly detected by the normal user disk requests. This is not, unfortunately, the case when sector faults are also considered [Scritsmier 1996, Cioffi 1990, Sierra 1990, Schulze 1988, Williams 1988]. In this case, the sector fault can remain undetected for a long time [Kari 1994, Kari 1993, Kari 1993a, Kari 1993b].

Fault detection can be improved by using the idle time of the system to diagnose the system status. When there is nothing else to do, the disk subsystem can be gradually read in small blocks so that the user disk requests are not disturbed too much if a user disk request comes while the scanning request is still being processed [Kari 1994, Kari 1993, Kari 1993a, Kari 1993b].

The disk scanning algorithm uses the same basic principle as the memory scrubbing algorithm [Saleh 1990]. In the memory scrubbing algorithm, the idle time of the computer is used for scanning the primary memory of the computer to find defected areas.

2.6 Performability analysis

A new term, performability, has been introduced in order to combine the metrics of performance and reliability of a computer system [Trivedi 1994, Catania 1993, Pattipati 1993, Smith 1988, Furchgott 1984, Meyer 1980, Beaudry 1978]. The main idea for the combination is to allow comparisons of different configurations of alternative models when both the performance and the reliability are important.

The basic idea of performability is to use a Markov reward model. In the Markov reward model, the system is given a reward for every state of a normal Markov state model. The reward can be, for example, the performance of the system in that state as expressed with the number of operations per second. When the reward in each state and the probability of being in that state are known, the performability can be formed as the sum of rewards weighted with the probabilities of the conventional Markov model.

The combined analysis of performance and reliability (performability) has become one of the key issues in modern disk subsystems. The higher reliability usually causes performance degradation on disk write requests as the same data must be stored in multiple locations [RAB 1993]. On the other hand, redundancy in the system is essential for a disk subsystem to survive media faults .

Cost-performability analysis

Cost-performability is analyzed in a similar way as performability. Here, the cost of the system is also recognized. The costs of the system will then include factors such as initial installation cost, cost to run the system, cost for a system failure, and cost to reestablish the system after a failure.

2.7 Other related studies

The performance research and analysis of disk subsystems has generally been performed under steady state conditions. However, performance during a fault recovery process is also important, especially when the disk subsystem is used in a real-time environment where strict response time requirements must be met. For example, the mean response time is not sufficient to ensure that a disk subsystem can continue operations to store or retrieve data for a multimedia application such as audio or video playback and recording. Usually, performance is guaranteed only under either a steady or a degraded state, but not while the system is under repair of a disk fault [Gibson 1991, Muntz 1990].

One of the disk array performance analysis studies made for the repair process of a disk array is [Muntz 1990]. In this analysis, the performance of the disk array is analyzed not only during the normal operation of the array, but also during its repair phase.