6. ASSUMPTIONS FOR RELIABILITY MODELS

The prerequisites for the reliability analysis are discussed in this chapter. Three categories are listed: reliability metrics, reliability measurement, and assumptions made for a novel reliability model.

6.1 Reliability metrics

Two reliability metrics are used in this thesis: data availability and MTTDL. The data availability is used for illustrating the probability of the system of maintaining the data for a given time. This can be used, for example, for expressing the 1-year, 3-years, and 10-years mission success probabilities [Gibson 1991]. The second term, MTTDL, is also used for expressing the quality of the disk array with respect to reliability and availability by expressing the estimated time the system maintains data integrity [Gibson 1991, Pages 1986, Siewiorek 1982, Shooman 1968].

Both data availability and MTTDL are functions of disk array configuration, Mean Time Between Failures (MTBF) or Mean Time To Failure (MTTF) of the disks, and Mean Time To Repair (MTTR) of the disk array. When the mission success probabilities or MTTDL are used, the array configurations can be compared using simple metrics.

6.2 Methods to evaluate reliability

Three alternative methods to evaluate the reliability of a disk array system are: analytical approach, measurements of the existing system, and reliability simulations.

6.2.1 Analytical approach

The analytical approach for the reliability analysis is based on Markov models [Shooman 1968]. With a Markov state transition diagram, it is possible to present and analyze both steady and transient states of a disk array. The steady state analysis is significantly simpler than the transient state analysis as, in the former case, the problem can be solved using the balance equations while, in the latter case, a set of differential equations must be solved.

In a simple Markov model, the group of differential equations can be easily solved using, for example, Laplace transformation, but, as the number of states in the model increases, the inverse Laplace transformation becomes more and more complex. This is shown in the next chapter. Even the complex inverse Laplace transformations can be done numerically, but the problem is to solve the equations in closed form. It may be infeasible in practice to use the numerical approach if a large number of parameter combinations is used, for example, for studying the sensitivity of different parameters. Hence, complex Markov models may require some approximation or simplification in order to be solved analytically.

6.2.1.1 Traditional Markov model

A traditional Markov model (TMM) for a disk array reliability is illustrated in Figure 14 [Schwarz 1994, Hillo 1993, Gibson 1991]. As the model has just one fault type and at maximum one simultaneous fault in an array is tolerated, the model has only three states and it can be easily solved analytically. In the model, indicates the probability of the system being in state x (where x is the number of faulty disks) at time instance t. There are totally disks and at least disks must remain fault-free to maintain data consistency. The system moves from the fault-free state ( ) to the single fault state ( ) if any of the disks fails with total rate of . The system consistency is lost if a second disk fault occurs with rate before the first one is repaired. The failed disk can be any of the remaining disks. The system returns back from the single fault state to the fault-free state when the faulty disk has been replaced with a spare disk and the data has been reconstructed onto that disk. The repair rate of the faulty disk is .

Undisplayed Graphic

Figure 14. Traditional Markov model for a disk array ( is the number of disks, is the disk failure rate, is the repair rate, and defines the probability of the system being at state at time where defines the number of faulty disks in the disk array)

No meaningful steady state solution for the Markov model exists in Figure 14 since one sink is included in the model. However, the transition state equations are simple:

, (7)

, (8)

and

(9)

where the parameters are shown in Figure 14. Also, it is generally assumed for the initial conditions that

(10)

whereas

. (11)

Equations (7)-(9) can be then solved using Laplace transformation and with the help of equations (10) and (11). Hence, the following results will be achieved [Gibson 1991]:

, (12)

, (13)

, (14)

and

(15)

where

(16)

and

. (17)

The reliability (as measured with MTTDL) can be expressed as

Undisplayed Graphic . (18)

Similarly, the mission success probabilities are achieved using the following equations:

, (19)

, (20)

and

(21)

where , , and are the mission success probabilities for one, three, and ten years, respectively.

In the above equations (7)-(21), the following assumptions are valid [Gibson 1991]:

• faults in the disk array are independent,

• lifetime of disks is exponentially distributed,

• repair time is exponentially distributed,

• system tolerates only one fault before repair, and

• there are D+1 disks in a fault-free system.

6.2.1.2 Approximation of Markov models

Approximation of a Markov model can be done either by reducing the state diagram, simplifying the equations, or using iterative methods [Gibson 1991].

Iterative method

The second method to solve the Markov model is based on the iterative method [Gibson 1991]. In this case, only MTTDL is achieved, but no R(t). The iterative method is as follows:

Beginning in a given state i, the expected time until the first transition into a different state j can be expressed as

Undisplayed Graphic (22)

where

(23)

and

. (24)

The solution to this system of linear equations includes an expression for the expected time beginning in state 0 and ending on the transition into state 2, that is for MTTDL. For the Markov model in Figure 14, this system of equations is:

, (25)

Undisplayed Graphic , (26)

and

. (27)

Solving the above equations (25)-(27) for MTTDL leads to

. (28)

Simple approximation

Gibson [Gibson 1991] has also proposed other approximation and simplification methods. These simplifications take advantage of the knowledge that the average disk failure rate is significantly lower than the average repair rate ( ). In that case, the system reliability can be approximated with the exponential function

(29)

where

, (30)

is the mean time to failure of a disk, and is the mean time to repair a disk fault in the disk array.

6.2.2 Reliability measurements

The analytical results of the reliability model are sometimes verified using measurements in a real environment [Shooman 1968]. In many cases, such measurements are not possible as the number of systems may be low and the period of time to observe the system should be very long (MTTDL can be millions of hours) as stated in [Gibson 1991]. It is practically infeasible to wait tens of years to gather statistics while the number of installations is typically small because of the high price. In addition, the practical reliability figures may differ significantly from the analytical models, as reliability of a disk array is also strongly related to the software implementation of the disk array [Hillo 1996, Kamunen 1994, Räsänen 1994, Hillo 1993]. Therefore, it was decided not to use measurements from the existing disk array systems for comparison.

6.2.3 Reliability simulation

A simulation program is an alternative approach to solve a Markov model. For this, several simulation programs are available and some of them have already been used for disk array analysis [Gibson 1991, Lee 1991, Orji 1991, Chen 1990, Lee 1990, Reddy 1990a, Pawlikowski 1990, Comdisco 1989, Sahner 1987, Sahner 1986]. As the transition rates from state to state varies much (several orders of magnitude), the accuracy requirements for the simulation program become significant (e.g., accuracy of the random number generator or the precision of the arithmetic). Also, the time needed for the reliability simulations may become infeasibly long, especially, when the number of different combinations is large. Therefore, the simulation approach was also rejected for use in this thesis.

6.3 Assumptions for novel reliability models

Basic principles for novel approximation models are illustrated here. The following three issues that are related with the disk reliability models are also discussed: disk fault models, fault detection efficiency, and disk access patterns.

6.3.1 Novel approximation method

The Markov model can be simplified with a novel two step approach. First, the Markov model is solved in a steady state case where the faulty state is ignored as depicted in Figure 15.

Undisplayed Graphic

Figure 15. Steady state approximation of the traditional Markov model ( is the number of disks, is the disk failure rate, is the repair rate, and defines the approximation of the probability of the system being at state where defines the number of faulty disks in the disk array)

From the steady state model, the probabilities of being in states and are achieved as follows:

(31)

and

. (32)

Then, it is possible to approximate the failure rate using the transition that is illustrated in Figure 16 [Laininen 1995].

The failure rate in the approximation can be expressed as

(33)

Undisplayed Graphic

Figure 16. Approximation of the failure rate ( is the number of disks, is the disk failure rate, and defines the approximation of the probability of the system being at state where defines the number of faulty disks in the disk array)

from which MTTDL for the simplified model can be achieved as

. (34)

This approximation is valid only when the disk repair rate is significantly higher than the total disk failure rate (i.e., ) when .

When the exact and the simplified models are compared, the MTTDL ratio is

Undisplayed Graphic (35)

while the error is

. (36)

In a practical case, is of the order of 100 hours, is of the order of 100 000 hours and is order of tens. Hence, the error is around

. (37)

The approximation underestimates the value of MTTDL because the error is always positive ( ). Thus, the approximation gives a slightly pessimistic estimation for the system reliability.

6.3.2 Disk array fault models

The faults in a disk array system can be divided into two major categories: disk related faults and other faults [Hillo 1994, Räsänen 1994, Cioffi 1990, Sierra 1990, Schulze 1988, Williams 1988]. In a disk related fault, part of the stored data is lost due to a fault in a disk unit. The other faults do not necessarily cause loss of data, but may cause unavailability of it. For example, power failure or data bus fault may not effect the actual data, but the data is momentarily unavailable.

In highly reliable systems, such as modern disk arrays, continuous availability of data is a critical factor of reliability. Hence, unavailability of data can be considered as fatal as the actual data loss. On the other hand, a system, that accepts short down times, can also tolerate data unavailability. For example, if data of a database can be restored by using a backup and a log file together with a recovery procedure, it is then acceptable that the disk array may be unavailable for a short time because of software or hardware faults.

In the next three chapters, the main emphasis is on the reliability aspects of disks omitting the effects of other components (such as disk controller, cables, power supply and software). In Chapter 10, also those components are briefly discussed.

Transient and permanent disk faults

The first disk fault category divides the faults based on the duration of the fault. The duration of the fault can be short (i.e., transient or temporary fault) or long (i.e., permanent fault).

A transient fault has two alternative reasons of occurrence. First, a transient fault may be caused by a change of the system state (e.g., a temperature change may cause a disk to recalibrate itself resulting in a disk access failure) [Kamunen 1996, Räsänen 1996, ANSI 1994, Räsänen 1994, Koolen 1992, McGregor 1992]. Typically, just by retrying the same request, it is possible to complete the request with no errors. Second, the fault may have a more permanent nature, but by altering the request slightly (e.g., by reading a sector with ECC enabled or reading the data slightly off the track), the fault can be bypassed or recovered.

Transient faults are often the first signs of media deterioration and they can be used for predicting permanent faults. However, all transient faults do not necessarily indicate media deterioration. Instead, they can be due to, for example, recalibration that is caused by temperature change [Räsänen 1994, Koolen 1992, McGregor 1992]. It is very unlikely that transient faults which are not related to the media deterioration will occur at the same location several times. Hence, by keeping a log of the location of transient faults, it is possible to sort out real deteriorated areas.

In contrast to transient faults, it is not possible to repair a permanent fault by retrying. The cause for a permanent fault can be either mechanical (e.g., bearings, motor or disk arms) or electrical (e.g., controlling logic or bus interface logic).

A permanent fault may also be caused by damage to the magnetic material. It can deteriorate either gradually or instantly. An instant degradation can be caused, for example, by a head crash or if the temperature of a disk rises over the so called Curie point when the entire disk loses its storage capacity [Räsänen 1994]. The disk temperature can rise, for example, due to an unrelated fault such as a faulty fan.

Disk unit faults and sector faults

The second disk fault category divides the faults based on the fault magnitude: total or partial. A fault can effect an entire disk (i.e., disk unit fault) or small part of it (e.g., sector fault). The size of the fault can be also intermediate (e.g., effecting one disk surface), but these faults are here considered to be disk unit faults as significant part of the data is effected.

6.3.2.1 Disk unit faults

Traditionally, the disk unit failure rates have been modeled using constant failure rates (leading to exponential fault distribution) [Schwarz 1994, Gibson 1991, Sierra 1990]. Constant failure rate is generally used for simplifying the reliability calculation of complex systems [Shooman 1968]. Beside the constant failure rate, also more complex failure rate models have been proposed (e.g., models leading to Weibull distribution) [Gibson 1991].

The actual disk unit failure rate is much more complicated. First of all, most electronic devices, such as hard disks, follow the bathtub curve (with high "infant" and "old age" failure rates) as shown in Figure 17 [Schwarz 1994, Gibson 1991, Sierra 1990, Shooman 1968]. However, during the useful lifetime of a disk, the failure rate is assumed to be more or less constant. Second, disk unit faults are not as independent of each other as assumed by typical fault models [Räsänen 1994, Gray 1993, Hillo 1992, Gibson 1991]. Third, the disk unit failure rate varies a lot between the disks of the different manufacturing batches [Voutilainen 1996, Schwarz 1994, Gray 1993, Hillo 1992].

Faults in disks are tightly related to each other. For example, the sector fault probability is significantly higher on those areas near by the area of known faults. Also, all sectors on the same disk surface suffer from the quality of a bad head. Furthermore, disks in the same storage cabinet are usually vulnerable to same cable, power supply, and ventilation faults.

The quality of disks depends also very much on the manufacturing batch. Typically, the disk properties (such as failure rate) in the same batch are similar, but among the different batches the quality can vary significantly as shown below. This effects the reliability in two ways. First, the actual reliability may vary dramatically and, for some units, MTBF can be only a fraction of the average MTBF. Second, if the disk array is built using disks from the same manufacturing batch, there is a significantly larger probability of having a second disk unit fault just after the first one than what the model of independent disk failure rates would predict [Kamunen 1996, Räsänen 1996, Hillo 1993, Hillo 1992]. This emphasizes the significance of the fast disk repair process and selecting the disks from different manufacturing batches.

Undisplayed Graphic

Figure 17. The bathtub curve of conventional lifetime distribution for an electrical device

As a practical example, there was a batch of disks of which significant portion (over 10%) failed in matter of hours after initial start up at the end user site [Hillo 1992]. The reason for this was a change in lubricating oil. It caused no problem in the factory, but transportation in cold environment changed completely the properties of the lubricant. An other example is that about 15% of 130 hard disk units installed in one customer got bad sectors within one year [Voutilainen 1996].

In practical systems, the reliability of a disk also depends heavily on how the disk is used. Several factors will shorten the life span of the disk. Some of them are listed in Table 5 [Hillo 1994, Räsänen 1994].

In this thesis, deterioration of the magnetic media is considered to be independent of the usage of the disk. This means that reading from or writing to a disk is assumed not to deteriorate the magnetic media. On the contrary, the media is assumed to deteriorate by itself also causing unaccessed areas to become faulty [Cioffi 1990, Sierra 1990, Schulze 1988, Williams 1988]. The read process is considered to have no effect on the media deterioration as the head is not touching the surface, thus no mechanical stress is focused on the magnetic storage media. However, other parts of the disk can suffer from extensive usage of the disk as listed in Table 5.

Table 5. Actions that effect on the disk life span

Action

Effects or potential problems

Comments

Start/Stop

The motor wears out

Extensive stress on bearings

Typically about 10 000 - 100 000
starts/stops allowed

Temperature cycle up/down

Extensive electrical or mechanical wear out

One cycle between 10°C and 40°C corresponds to one day operation time in constant temperature

Constant high temperature

Expedites wear out

Risk of losing all data due to overheating

Temperature over Curie point causes permanent damage [Kuhn 1997]

High seek activity

Higher power consumption

Increased temperature

Seek motor wear out

More heat and inadequate power may cause reduced lifetime of a power supply and disk malfunction due to lack of enough current

Uneven access pattern

May cause extensive seeks

Higher latent fault possibility

Most of the applications are accessing unevenly the disk space;

Uneven access pattern may cause
more seek activity

Dust and humidity

Electrical components wear out

Ventilation problem

Potential temperature increase

Vibration

Mechanical wear out

Mechanical defects

Vibration may loosen components or connectors

The actual media deterioration is caused by the impurity of the magnetic media and gradual degradation of the magnetic field [Sierra 1990]. If the magnetic media gradually loses its information, it can be refreshed (read and rewritten). Typically, the degradation and the impurity of the material are unevenly distributed.

The MTBF values of hard disks have improved in recent years rapidly. Ten years ago, the average MTBF was around 20 - 40 000 hours, but the official manufacturers’ MTBF figures are currently around 500 000 - 1 000 000 hours [Quantum 1996a, Seagate 1996c]. However, the practical MTBF figures can be significantly less (even as low as 10 000 - 65 000 hours) especially when the disks are heavily loaded [Räsänen 1996, Voutilainen 1996, Hillo 1994].

In this thesis, constant failure rate for disk unit faults is used. MBTF figures from 10 000 hours to 1 million hours are used. This range covers the MTBF figures that are commonly presented in the technical literature in order to keep the results comparable with other studies such as [Schwarz 1994, Hillo 1993, Gibson 1991].

6.3.2.2 Sector faults

Media deterioration typically effects only a limited area of the disk surface. The minimum area that is normally effected is a sector [Haseltine 1996, Räsänen 1996, Scritsmier 1996, ANSI 1994, Seagate 1992, Seagate 1992a]. To survive such events, modern disks have a repair mechanism to reconstruct a faulty sector by replacing the faulty sector with a spare one while still maintaining the same logical representation as described in Chapter 4.

Generally, faults are assumed to be independent of each other, but, with the sector faults, the nearby sectors (previous or next sector on the same track or sectors on the neighboring tracks) of the faulty one have significantly higher fault probability. This has a major effect on the sector repair algorithms as after a sector fault the repair process should also check nearby sectors. Such an enhancement can be implemented as a part of the repair procedure in a scanning algorithm.

The repair process of a sector fault is significantly faster than the repair of a disk unit fault. Typically, a sector repair takes of the order of hundreds of milliseconds, while constructing an entire disk may take easily more than an hour [Räsänen 1994, Kamunen 1994, Hillo 1993].

Based on practical experience, typical hard disks have about the same amount of sector faults as entire disk unit faults (i.e., a sector fault on any sector of a disk is encountered as often as a faulty disk unit) [Kamunen 1996, Räsänen 1996]. Hence, the faults in a disk are split evenly between sector faults and disk unit faults in this thesis when both faults are considered. For example, if the conventional disk array model uses 100 000 hours MTBF for the disk faults (i.e., sector faults are ignored and all faults are considered to be disk unit faults), the disk failure rate is . Then, the disk failure rate is in the enhanced model and the sector failure rate per sector is where S is the number of sectors in the disk.

6.3.3 Disk unit fault detection efficiency

Detection of disk unit faults is typically fast. Both the transient and permanent disk unit faults are detected quickly by the disk control protocol [ANSI 1994, Räsänen 1994, Seagate 1992, Seagate 1992a]. The disk unit fault can be reported immediately to upper layers by the disk controller when an access to a disk fails. It is also possible that the disk or the disk controller can first retry to recover the problem a few times by itself before reporting the fault to upper layers (such as the disk driver or the operating system). The SCSI protocol specifies the maximum response time in which the disk must reply to the disk controller [ANSI 1994, Seagate 1992, Seagate 1992a, ANSI 1986]. This time is significantly shorter than the average access time of a disk. In practice, if the disk is not giving the initial acknowledgement to a controller request in a fraction of a second, it is considered that an error has occurred and a special error recovery procedure should be started.

The latent disk unit faults can be reduced by polling. This is mainly used for disks that are accessed only very seldom. An array controller polls with a constant interval the disks to see that they are still alive [Kamunen 1996, Räsänen 1996, Hillo 1994]. By having the poll interval in the order of seconds, the latent disk unit faults can be practically eliminated as MTBF of a normal disk is significantly higher (in order of tens or hundreds of thousands of hours).

In this thesis, the latent disk unit faults are ignored. As it has been shown above, the disk unit faults can be detected so quickly that it is possible to consider that there are no latent disk unit faults. Even in the case of a very large disk array with a hundred disks, a disk unit fault is detected in a few tens of seconds [Räsänen 1996].

6.3.4 Sector fault detection efficiency

The fault detection of an accessed sector is related to two factors: parameter settings in a disk and the success probability of the disk access.

Parameter setting

Depending on the parameter settings, the disk either tries to recover the media deterioration by itself or it reports the fault to higher levels immediately when the disk access fails the first time as described in Chapter 4 [Räsänen 1996, ANSI 1994, Antony 1992, Seagate 1992a]. The number of retries should be limited because a large number of retries would significantly delay other disk requests when a faulty sector is accessed. Because of the response time requirements, the retry count is typically low, one or two retries, but it could be up to ten times if the response time is not so critical [Räsänen 1996].

Detection success probability

When a faulty sector is accessed, the fault is detected with a certain probability. Typically, the fault detection probability of a single access is quite close to one, but two special cases should be considered:

• False fault detection: In some cases, the disk may report an error even when the media has no actual problem [ANSI 1994, Seagate 1992a]. The reason for this is typically a transient fault in the disk.

• Undetected faults: Some of the sector faults may not be detected during every access. The early warning signs of the potential sector faults are given by recoverable errors. If the disk is hiding the retries, the warning signs may be missed [Räsänen 1996].

For simplicity, it is assumed in this thesis that the sector faults are detected with 100% probability whenever the sector is accessed. In practice, actively used sectors are accessed so often that their fault detection probability would be close to one in any case. The only problem would be the seldom accessed sectors.

If the fault detection probability (per access) were less than 100%, the sector fault detection rate would be lower and several accesses to the same sector would be required to detect a fault. This would mainly decrease the latent fault detection rate. It should be noted that the sector fault detection rate will be zero in all cases if the sector is not accessed at all.

6.3.5 Other faults

Other faults (such as in the disk controller, main computer, etc.) are ignored in this phase. They will be discussed later in Chapter 10.

6.3.6 Sector fault detection rate

There are two ways to detect sector faults: by regular user disk requests and by scanning.

6.3.6.1 Detection with user disk requests

The regular user disk accesses can also be used for detecting latent sector faults. While the user disk requests are accessing the disk space, they are simultaneously reporting the status of those sectors. If a problem is found, it can be fixed using the sector repair technique and redundant information on the other disks as described in Chapter 4 [Scritsmier 1996, ANSI 1994, Räsänen 1994, Platt 1992, ANSI 1986].

It should be noted that only read requests may detect a latent fault [Räsänen 1996, Scritsmier 1996]. When data is written to the disk, the old context of the sector has no value (except in the RAID-5 case). If the write requests were used for detecting sector faults, it would require read after every write operation (using WRITE AND VERIFY command), thus causing extra overhead and performance degradation [Räsänen 1996, Scritsmier 1996, ANSI 1994]. Thus, the sector fault detection by the user request is a function of read activity and distribution of the read requests.

The sector fault detection rate of the user read requests can be expressed as a function

(38)

where is user read activity (measured as the number of operations in a unit of time), is user read distribution, and is the probability of sector fault detection. Parameter describes the total activity of the user disk read requests that are issued to a disk and are directly related to the user activity. Parameter describes the distribution of the user disk requests. When the requests are concentrated on certain disk locations, fewer distinct sectors are actually accessed and therefore fewer sector faults can be detected. The sector fault detection probability is assumed to be one ( ).

Simplification of the fault detection model

For simplicity, it is assumed that the sector fault detection rate of the user read requests is a constant that depends only on the above mentioned parameters (i.e., user read activity and distribution). This leads to an exponential distribution of the fault detection time. In practice, the fault detection rate depends heavily on the location of the fault as, when the fault falls into a commonly (rarely) accessed sector, the fault detection rate is much higher (lower) than average. The constant fault detection rate is used for simplifying the analytical model that is presented in the following chapter.

6.3.6.2 Detection with scanning disk requests

The read requests of a scanning algorithm are specificly used for detecting latent sector faults. Actually, there is no difference between the user read requests and the scanning read requests from the point of view of a disk if a normal READ command is used [ANSI 1994, Seagate 1992a]. Only, the distribution of the read requests is different (i.e., the scanning read requests go through the entire disk periodically).

The sector fault detection rate by the read requests of the scanning algorithm can be expressed as a function

(39)

where describes the activity of the scanning read requests that are issued to a disk (measured as the number of operations in a unit of time) and describes the distribution of the scanning disk requests. The scanning requests are distributed evenly over the disk space so that all sectors will be accessed in every scanning cycle.

The average sector fault detection time (by the scanning requests) is half of the time that it takes to scan the entire disk. For example, if the disk is scanned through every 24 hours, the sector fault is detected by the scanning algorithm on the average within 12 hours after occurring.

Simplification of the fault detection model

For simplicity, it is assumed that the sector fault detection rate of the scanning read requests is a constant that depends only on the scanning read activity. The constant sector fault detection rate leads to an exponential distribution of the detection time (i.e., there is non-zero probability that sector fault detection could take much longer time than the scanning cycle). This assumption underestimates the reliability. However, the constant fault detection rate is used for simplifying the analytical model that is presented in the following chapter.

6.3.6.3 Combined sector fault detection rate

The sector fault can be detected either by a user disk request or a scanning disk request. Thus, the combined fault detection rate can be expressed as follows

. (40)

In practice, the combined sector fault detection rate is dominated by the scanning process as it is accessing all sectors in matter of hours while it may take several days or even weeks to access all sectors by the user disk requests.

6.3.7 User access patterns

A user access pattern affects the detection rate of the latent sector faults. Typically, it is assumed that the user access pattern is uniformly distributed over the entire disk space [Hou 1994, Seltzer 1992, Miller 1991, Reddy 1991a, Chen 1990, Reddy 1990a, Seltzer 1990a, Olson 1989, Bhide 1988, Ousterhout 1988]. The uniform access pattern is used for simplifying the performance analysis. From the point of view of the reliability analysis, the access pattern has traditionally been considered to be insignificant as the normal reliability analysis approach does not consider sector faults but only disk unit faults.

A practical disk access pattern is typically significantly different from any mathematical models. An example of such access patterns is illustrated in Figure 18 [Kari 1992]. The characteristics of these access patterns are the high peaks in certain areas and no accesses to others. Similar observations have been made also in [Mourad 1993].

Actual user access patterns have also very high data locality where the next disk access is close to the previous one (e.g., read and then write same location) [Hou 1994]. This has a major effect on the performance, but it also effects the reliability as the mechanical parts are not wearing so much as the seeks are generally shorter. On the other hand, the latent sector fault detection rate is much lower as not so many different sectors are accessed.

Undisplayed Graphic

Figure 18: Example of an actual disk access pattern (the density function)

This thesis uses four user access patterns as listed in Table 6. These patterns represent various concentrations of the user requests. The uniform access pattern provides a reference model where the user accesses are spread over the disk space evenly and therefore it has the highest detection rate for the latent sector faults while the other access patterns concentrate more and more into fewer and fewer sectors. Here, the common 80/20-rule is used as a basis (i.e., so called Zipf’s law) [Bhide 1988]. Hence, the Single-80/20, Double-80/20, and Triple-80/20 access patterns try to represent more accurately the practical user access patterns than the conventional uniform access pattern [Kari 1992, Bhide 1988]. A similar access pattern division has been used earlier when 70/30-rule has been used instead of 80/20 [Gibson 1991, Kim 1987].

Practical access patterns fall probably in between Double-80/20 and Triple-80/20. For example in Triple-80/20, 0.8% of the modern 1 GB hard disk is 8MB of the disk space that easily covers disk index tables, directory entries, and database index tables.

Table 6: Four distributions of the access patterns used in the analysis

	Type of access pattern	Request distribution (b_i of requests falls into c_i of the area)
	Uniform	100% of requests fall evenly over 100% of the area
	Single-80/20	20 % of requests fall into 80% of the area
		80 % of requests fall into 20% of the area
	Double-80/20	4 % of requests fall into 64% of the area
		32 % of requests fall into 32% of the area
		64 % of requests fall into 4% of the area
	Triple-80/20	0.8 % of requests fall into 51.2% of the area
		9.6 % of requests fall into 38.4% of the area
		38.4 % of requests fall into 9.6% of the area
		51.2 % of requests fall into 0.8% of the area

The uniform distribution and the various 80/20 distributions are illustrated in Figure 19 as a function of the disk space. For example, 90% of all disk requests fall in the disk space area that represents about 90%, 60%, 30%, and 11% of all disk sectors in Uniform, Single-80/20, Double-80/20, and Triple-80/20 access patterns, respectively.

The estimated coverage (i.e., the number of accessed distinct sectors divided by the total number of sectors on the disk) for the different access patterns can be expressed with the following equations for Uniform, Single-80/20, Double-80/20, and Triple-80/20 access patterns, respectively [Laininen 1995]:

Undisplayed Graphic , (41)

Undisplayed Graphic , (42)

Undisplayed Graphic

Figure 19. Distribution function of the different access patterns as a function of disk space

Undisplayed Graphic , (43)

and

Undisplayed Graphic . (44)

where is the total number of requested sectors and is the total number of sectors in the disk while b_i and c_i are specified in Table 6. By applying the values in Table 6, the following results are achieved for numerical values:

, (45)

, (46)

, (47)

and

Undisplayed Graphic . (48)

Undisplayed Graphic

Figure 20. Percentage of all sectors accessed as a function of the total number of accessed sectors

The estimated coverage of user access patterns is illustrated in Figure 20 as a function of the relative number of accessed sectors. The more uneven access pattern used, the larger number of accesses is needed to achieve the same coverage. For example, 90 % of the disk space is accessed using (on the average) 2.3, 8.3, 30, and 105 times the number of sectors in the disk when the access pattern is Uniform, Single-80/20, Double-80/20, and Triple-80/20, respectively. Hence, the sector fault detection rate (by the user disk accesses) depends heavily on the access pattern.

The estimated coverage is insensitive to the number of sectors in the disk as illustrated in Table 7. The coverage remains practically the same regardless of the number of sectors in the disk when the -ratio is kept the same. Thus, analysis can be done without bothering with the actual size of the disk, because the relative amount of disk requests to the disk capacity ( -ratio) provides the necessary information. This is very useful as the results of this analysis can be used with all sizes of disks as long as the access patterns are similar. However, the actual access patterns in practice tend to become more and more unevenly distributed when the size of the disk increases.

Table 7: Accuracy of the coverage estimation as a function of the number of sectors in the disk

Access pattern

Coverage

(S_a=100, S=100)/
[relative error %]

Coverage

(S_a=10 000,
S=10 000)/
[relative error %]

Coverage

(S_a=1 000 000,
S=1 000 000)/
[relative error %]

Uniform

0.633968
+0.2922%

0.632139

+0.0029%

0.632121

Single-80/20

0.373780

+0.1297%

0.373301

+0.0013%

0.373296

Double-80/20

0.281657

+0.2145%

0.281060

+0.0021%

0.281054

Triple-80/20

0.195353

+0.1194%

0.195122

+0.0012%

0.195120

Access pattern

Coverage

(S_a=500, S=100)/
[relative error %]

Coverage

(S_a=50 000,
S=10 000)/
[relative error %]

Coverage

(S_a=5 000 000,
S=1 000 000)/
[relative error %]

Uniform

0.993430

+0.0169%

0.993264

+0.0017%

0.993262

Single-80/20

0.771155

+0.0465%

0.770800

+0.0046%

0.770796

Double-80/20

0.529709

+0.0188%

0.529611

+0.0019%

0.529610

Triple-80/20

0.416635

+0.0420%

0.416461

+0.0042%

0.416460

The mean number of accessed sectors to detect a sector fault can be estimated based on the equations (41)-(44). For example, the mean number of requests to detect a sector fault in Triple-80/20 access pattern can be achieved by the following equation

Undisplayed Graphic (49)

from which the normalized value is achieved by

. (50)

Similarly, it is possible to achieve the normalized mean number of accessed sectors to detect a sector fault for the other access patterns. The numerical results are listed in Table 8. In this table, the equation is also compared with the simulation results.

This table shows in practice that it is possible to compare the user access activity and the scanning algorithm activity relative to each other independent of the actual number of sectors in the disk. There is no need to consider the actual size of the hard disks. Also, this allows us to compare the disk reliability (both disk unit fault rate and sector fault rate) with the user access patterns and the scanning algorithm without any actual knowledge of the size of the disks.

Table 8: Estimated relative number of sectors to be accessed to detect a sector fault

Access pattern

Relative number of accessed sectors
to detect a sector fault (analytical results)

Relative number of accessed sectors to detect a sector fault
(simulation results)

Error between the analytical and simulation results

Scanning algorithm

Uniform

+0.319%

Single-80/20

+0.543%

Double-80/20

-0.630%

Triple-80/20

+0.132%

6.3.8 Conclusions for assumptions

In the following list, a summary of conclusions for the assumptions is collected.

• It is possible to approximate a non-steady state Markov models in two phases when repair rates are significantly higher than failure rates.

• Disks are assumed to have two fault types: disk unit faults and sector faults.

• Disk unit and sector failure rates are constant.

• Repair rates of disk unit and sector faults are constant.

• After a disk unit fault, next disk operation detects the fault in matter of seconds.

• After a sector fault, next read request to that sector detects the fault. This may take a long time.

• Sector faults are independent of the usage of the disk (i.e., reading from or writing to a disk does not deteriorate the disk).

• User disk requests are not accessing the disk evenly. Four different user access patterns are used: Uniform, Single-80/20, Double-80/20, and Triple-80/20.

• Reliability analysis is independent of the actual size of disks.