The main objective of this thesis is to study the effect of latent sector faults in disk array reliability. The motivation behind this objective is to improve the performability of a disk array subsystem without increasing the cost of the system significantly. This is done in order to minimize the hardware investments while the performance and the data availability are maximized. Typically, a user is unwilling to invest in unnecessary equipment, but, on the other hand, the system reliability and the data availability should be as high as possible. Also, when the high performance requirement is included, optimization becomes important.
The optimization problem generally has two alternative approaches. First, the optimization concerns only the reliability and performance factors (or as combined performability). This can be interpreted as "the best combined reliability and performance at any cost". Usually, it is not feasible to optimize only either the performance or the reliability as the other would suffer too much. The second alternative is to optimize the performability and the system cost together (expressed as cost-performability). This resembles a more practical situation where, due to economical reasons, it not possible to add as much redundancy as desired.
Different users have various desired values for reliability, performance, and cost. Hence, it should be possible to express performability or cost-performability using common equations for a given array configuration as a function of these three parameters.
One of the main factors that limits the system reliability (and therefore also performability and cost-performability) is latent faults. Typically, a disk array can survive at most one fault in a disk group at any time. As a fault may remain undetected in a disk array for a long time, the importance of latent fault detection increases significantly in order to maintain reliability.
The reliability of a conventional disk array is typically estimated to be of the order of millions of hours as expressed with the mean time to data loss (MTTDL) like stated in [Gibson 1991]. These figures consider only disk unit faults ignoring both sector faults and latent faults in general. If these faults are included, the reliability drops significantly [Kari 1994, Kari 1993, Kari 1993a, Kari 1993b]. Hence, there should be a mechanism to detect those faults in order to regain the reliability. However, the reliability will very likely have an upper bound of the conventional estimations (with only disk unit faults considered).
For disk subsystems, the performability and cost-performability terms can be used for expressing the quality of the architecture [Trivedi 1994, Catania 1993, Pattipati 1993, Smith 1988, Furchgott 1984, Meyer 1980, Beaudry 1978]. Performability can be used for analyzing the behavior of a disk subsystem and finding the optimum combination of performance and reliability as a function of the number of disks in the array as shown in Figure 2. Here, a hypothetical system is measured with two parameters: performance (as measured with the number of I/O operations per second) and reliability (as expressed with MTTDL). The performance of a disk array improves with the number of disks as the system can serve more simultaneous disk requests. On the other hand, the reliability decreases with the number of disks as there is a higher number of parallel disks that can become faulty. The optimum performability can be defined as a function of performance and reliability.
In cost-performability, the cost (as expressed with the cost of the system installation, running costs, and possible damages due to data loss) of the system is taken into consideration. In Figure 3, the cost-performability of the same system is illustrated. As the cost increases with the number of parallel disks, the optimum cost-performability point is not necessarily at the same location as the optimum performability point. Cost-performability is not discussed further in this thesis since the term “cost” has no unambiguous definition like performance and reliability.
The performance and reliability of computer systems have improved radically in recent years. It has been compared that if the development in the automobile industry had been as rapid as in the computer industry, the current cars would cost only one dollar and they would run one million miles per gallon. However, the reliability of an average car is typically much higher than that of a conventional computer (especially when the software problems are included).
Unfortunately, performance developments have not been as rapid for all components of a computer system. The slowest development in the performance area has been amongst I/O subsystems such as network, mass storage, and user I/O. These systems have been overrun by the rapid progress of the central processing unit (CPU). For example, the network transfer rate has improved by ten fold in last five years (e.g., from 10 Mbps to 100 Mbps). The user I/O rate has increased only for output devices, but most of the performance gain has been drained by the higher display resolution. On the contrary, the user input rate has not improved significantly since the early days of computing. The capacity of hard disks has increased significantly in recent years, but otherwise the hard disk performance has shown only slight improvement. In Table 4, typical top of the line personal computers of 1986 and 1996 are compared [Fujitsu 1996, Intel 1996, Intel 1996a, Nokia 1986].
Table 4. Comparison of a typical PC in 1986 and 1996
Figure 4 illustrates how the hard disk capacity has developed for non-removable hard disks as a function of time and the size of the disk [IBM 1996a]. The capacity of the hard disks has increased steadily, about 40% per year, while the form factor (i.e., the physical size of the disk) has reduced significantly.
The performance of hard disks has not improved as fast as their capacity. The performance improvement has been restricted by the physical limitations (such as rotation speed, seek delays, storage density, and head mass).
Three parameters are used for measuring and expressing the performance of a hard disk: rotation speed, seek time, and transfer rate.
The enhancements of combined seek and rotation delays of a hard disk are illustrated in Figure 5 [IBM 1996b]. In mid 1980’s, the average seek time was still in order of 100 milliseconds, but, due to smaller disk form factors and lighter materials, the average seek time is nowadays less than 10 milliseconds. This means that the seek delay is now only 10% of that of ten years ago. The rotation speed has increased in last ten years from about 1800 rpm up to 7200 rpm. Thus, the reduction of rotational latency is 75% over the same time span.
The third parameter of a hard disk is the data transfer rate. The data transfer rate is limited by two factors: internal transfer rate in a disk and disk bus transfer rate. From late 1980’s through mid 1990’s, the internal data rate of disk drives has increased about 40% per year [IBM 1996c]. The current internal data rate is about 10 MB/s while the external data rate depends on the disk bus type varying from 10 to 40 or up to 100 MB/s [Seagate 1996a, Seagate 1996b]. Modern hard disks can utilize significantly higher bus transfer rates by buffering the data and disconnecting themselves from the bus when they are performing internal disk I/O operations. Hence, several hard disks can be connected onto the same bus sharing the high speed transfer channel.
The reliability of the hard disks has been enhanced significantly in the last ten years. In the mid 1980’s the average MTBF for a hard disk was in order of 20-40 000 hours while the current MTBF figures are around 500 000 to 1 million hours [Seagate 1996c, Quantum 1996a, Hillo 1993, Nilsson 1993, Faulkner 1991, Gibson 1991]. The main reasons for this are the improved disk technology, reduced size of the disks and new methods to predict the MTBF figures (based on field returns). One million hours (about 100 years) for MTBF of a disk is a quite theoretical figure. The actual figure greatly depends on the usage of the disks. For example, a set of 50 heavily loaded disks had 13 faults in three months leading to less than 10 000 hours MTBF while the “official” MTBF for these drives was around 300 000 hours [Hillo 1996, Räsänen 1996, Hillo 1994, Räsänen 1994].
The improved reliability and data availability have no value by themselves. On the contrary, most of the users are price-conscious and will not like to invest in unnecessary pieces of equipment unless there is a real benefit in the investment.
Eventually, the question of performability is money and risk management for the desired performance and reliability of the system. As a disk array is generally purchased in the first place to protect valuable user data and preferably to provide nonstop operation, the cost of a data loss can be assumed to be high. Thus, the probability of data loss should be minimized but not at any price. It is not wise to increase the level of complexity too high in the system because the practical reliability may be only a fraction of the theoretical estimation. The practical reliability is decreased, for example, by improper human operation and software errors (caused by too complex system software).
The financial effects of the disk array concept are two fold. First, the initial cost of the disk array subsystem is significantly higher as more hard disks are needed and the disk controller (and its software) is more complex. Second, the probability of data loss is smaller and therefore the expected damage due to a data loss is significantly less than in a non-fault tolerant disk subsystem.
At a certain point, there is no need to improve the data availability in the disk array level as other components are relatively less reliable than the disk array.
There are two points of view to computer reliability: user's and manufacturer's.
User’s view to reliability
From the user's point of view, the system is or is not operable. Therefore, there is only marginal (if any) benefit of improving MTTDL value of a system, for example, from 1 million hours to 10 million hours. This is because the user is only observing (typically) one disk array and none of the normal computer systems are designed to operate for a so long period of time (100 to 1000 years). Most of the computers become obsolete in a few years and will be replaced with a new model before the reliability figures have decreased even a bit. Hence, the reliability issues, when inspecting only one machine, lose their significance as the availability of the system remains high (in many practical cases being almost one) over the entire useful lifetime of the system.
Manufacturer’s view on reliability
On the contrary, a manufacturer sees a completely different view of reliability. For example, if there are 100 000 installations in the field each containing one disk array subsystem, there is a significant difference in user complaints if MTTDL increases from one to ten million hours. In the former case, there will be about 880 cases of data loss per year (systems are assumed to run 24 hours per day) but, in the latter case, only about 88 cases. Hence, this may have a dramatic effect on the profit and reputation of a company.
There are benefits for both good performance and reliability (as well as low cost). However, the combined performability is a compromise for both. The benefits can be divided into three categories: improved data reliability (or data availability), improved performance, and reduced cost to operate the system (fewer data losses).
Improved data availability
Enhancements in the reliability of a disk array improve data availability as well as nonstop operation of the system. A good example for a system that can benefit from the improved data availability is a database server that supports OLTP. In such a system, continuous operation is important and the cost of a data loss is typically extremely high.
The improved performance (especially during the degraded state) is a valuable property for systems that must provide good performance at all time. A disk array should be usable even during the repair process and there should be no need to shut down the system or disable user request processing while the repair process is active. In this way, the system can provide nonstop service for users even during exceptional situations.
The final benefit of improved performability is the reduced cost to run the system. The user will experience higher reliability and/or better performance of the disk array for the same money when the total life span and all possible costs of the system are considered. Alternatively, the same reliability and/or performance can be achieved with reduced cost.
This is a widely studied aspect of the disk arrays and there are a large variety of reports presenting how to improve performance in disk arrays [Hou 1994, Burkhard 1993, Holland 1993, Mourad 1993a, Reddy 1991, Muntz 1990]. Hence, this subject is not studied further in this thesis.
Data availability can be improved in three ways: using more reliable components, using higher levels of redundancy, or expediting the repair process. It is difficult to improve component reliability beyond a certain level. Therefore, it is not possible to improve the data availability by only enhancing the components. Alternatively, better availability can be achieved by utilizing higher levels of redundancy. Unfortunately, this usually also means performance degradation as updating data on a disk array gets slower as the redundancy increases. Thus, data availability can be improved up to a certain level when the lower limit of performance is set. The only remaining method to improve data availability is to expedite the repair process.
Most modern disk array architectures tolerate only one fault in a disk group. Therefore, it is vital for the data availability to minimize the time when the array has a fault in it. By reducing the duration of a fault (i.e., this is done by expediting the fault detection process and/or the fault repair process), the probability of having a second fault in the same disk group can be reduced radically.
It is typically quite difficult to expedite a repair process without effecting the performance. This is because when the time of the repair process is minimized, the utilization of the disks increases causing user disk requests to be delayed significantly. Thus, performance requirements limit the usage of the speed-up of the repair process in improving the reliability.
The remaining method to improve the reliability is therefore to reduce the time when faults are present in the system. As the repair process is hard to expedite, the interest should be focused on detecting existing faults thus eliminating the faults as quickly as possible. The fault detection can be done either by conventional user disk requests or by a special diagnostic procedure. The former case has the problem that it is related to user access patterns and therefore does not provide full coverage as not all areas of the disk are accessed by user requests. Hence, an active scanning program is needed to detect faults also in the rarely accessed areas.
The active scanning program inserts disk scanning requests among user disk requests thus increasing delays in user disk requests. If the parameters are set properly, the performance degradation will be reasonable. However, even a slight increase in the load in a congested system can lead into significant delays.
The best option would be if the system would detect faults even before their occurrence. This can be done using the increased number of retries as early warning signs of degradation [ANSI 1994, Räsänen 1994].