7. NOVEL RELIABILITY MODELING APPROACHES

In this chapter, two new enhanced reliability models are built in which also sector faults are included. The first reliability model (Enhanced Markov Model 1, EMM1) is based on the hot swap principle while the second reliability model (Enhanced Markov Model 2, EMM2) is based on the hot spare principle [Chandy 1993, Hillo 1993, RAB 1993, Shooman 1968]. The former model is analyzed both analytically (referred to EMM1) and approximately (referred to EMM1A) while the latter model is analyzed only with approximations (referred to EMM2A).

7.1 Simplifications of Markov models

The Markov models that are used for the enhanced analysis have the following simplifications:

• Exponential repair times;

• Exponential failure times;

• Only zero, one, or two simultaneous sector faults (sector faults that are occurring at other disk addresses than the first one are ignored); and

• Only zero, one, or two simultaneous disk faults.

These assumptions are made to simplify the analysis of the models so that analytical approach can be used. With these simplifications, the number of states in the Markov models can be reduced significantly. This is important because even a simple non-steady state Markov model complicates the analytical approach radically as will be seen later in this chapter.

7.2 Markov models

Three reliability models are used in this chapter: a traditional reliability model and two enhanced models.

7.2.1 Traditional reliability model

The traditional Markov model (TMM) used for analyzing conventional disk arrays was presented in the previous chapter in Figure 14. Equations (18), (19), (20), and (21) express MTTDL and mission success probabilities for 1, 3, and 10 year missions, respectively.

7.2.2 Enhanced reliability model with no on-line spare disks

The first enhanced Markov model (EMM1) of disk arrays is illustrated in Figure 21. Here, the model is derived from TMM that contains only three states as stated for example in [Geist 1993, Gibson 1991].

Failure models

In EMM1, there are two different fault types (sector and disk unit faults) both having their own state in the Markov model. Only, when there are at least two faults at the same time and in the same disk group, the disk array loses its consistency and then data is lost. There are four alternative scenarios how the consistency can be lost:

• After a disk unit fault, a second disk unit fault (in the same disk group) occurs before the repair process of the first disk is completed; or

• After a disk unit fault, a sector fault (in the same disk group) occurs on any sector before the disk repair process is completed; or

• After a sector fault, any other disk (in the same disk group) fails before the sector fault has been detected and repaired; or

• After a sector fault, any other disk (in the same disk group) has also a sector fault at the corresponding sector before the first sector fault has been detected and repaired.

Transition rules

The transition rules of the Markov model for EMM1 are illustrated in Figure 21 and can be expressed as follows:

• The system moves from the fault-free state ( Undisplayed Graphic) to the sector fault state ( Undisplayed Graphic) if any of the sectors in any of the disks becomes faulty (with total rate Undisplayed Graphic);

• The system moves back from the sector fault state ( Undisplayed Graphic) to the fault-free state ( Undisplayed Graphic) when the faulty sector is detected and repaired (with rate Undisplayed Graphic);

• The system moves from the sector fault state ( Undisplayed Graphic) to the disk fault state ( Undisplayed Graphic) if a disk fault occurs at the same disk as the sector fault (with rate Undisplayed Graphic);

• The system moves from the sector fault state ( Undisplayed Graphic) to the data loss state ( Undisplayed Graphic) if there is a sector fault at the corresponding sector or a disk unit fault in any disk other than the one that has the sector fault (with total rate Undisplayed Graphic);

• The system moves from the fault-free state ( Undisplayed Graphic) to the disk fault state ( Undisplayed Graphic) if any of the disks becomes faulty (with rate Undisplayed Graphic);

• The system returns back from the disk fault state ( Undisplayed Graphic) to the fault-free state ( Undisplayed Graphic) when the faulty disk is replaced and repaired (with rate Undisplayed Graphic); and

Figure 21. Markov model for EMM1 ( Undisplayed Graphic indicates the probability of the system being at state Undisplayed Graphic at time Undisplayed Graphic where Undisplayed Graphic defines the number of faulty disks units and Undisplayed Graphic defines the number of faulty sectors in the disk array. Undisplayed Graphic indicates the probability of the data loss. Rest of the parameters are defined in Table 9.)

• The system moves from the disk fault state ( Undisplayed Graphic) to the data loss state ( Undisplayed Graphic) if there is another disk unit fault or a sector fault on any of the remaining disks (with rate Undisplayed Graphic).

In EMM1 as illustrated in Figure 21, Undisplayed Graphic indicates the probability of the system being at state xy (where x is the number of faulty disks and y is the number of faulty sectors in the system) at time t and Undisplayed Graphic is the probability of data loss due to two (or more) simultaneous faults. The other parameters are listed in Table 9.

Table 9. Parameters for EMM1, EMM1A, and EMM2A

Parameter

Parameter description

Comments

Undisplayed Graphic

number of (data) disks in an array

D disks are needed for data consistency

Undisplayed Graphic

number of sectors in a disk

each sector is treated independently

Undisplayed Graphic

disk unit failure rate

Undisplayed Graphic

disk unit failure rate after the first disk unit fault

this failure rate is greater than (or equal to) the failure rate of the first disk unit fault ( Undisplayed Graphic)

Undisplayed Graphic

sector failure rate

Undisplayed Graphic

spare disk failure rate

failure rate for an online spare disk

Undisplayed Graphic

disk repair rate

includes both disk unit fault detection and repair time

Undisplayed Graphic

sector repair rate

includes both sector fault detection time and repair time

Undisplayed Graphic

spare disk repair rate

includes delayed spare disk fault detection time, new disk ordering, and disk replacement time

Undisplayed Graphic

disk repair rate when spare disk is missing

includes spare disk fault detection time, new disk ordering, and disk replacement time

The transition state equations of EMM1 can be expressed as:

where the initial conditions are

(55)

and

. (56)

Equations (51)-(54) can be then solved using Laplace transformation with the help of equations (55) and (56). First the equations are moved from t-state to s-state as follows

 

and

(64)

where

(65)

and Undisplayed Graphic (i=0, 1, or 2) are the three roots of the following equation

. (66)

Term Undisplayed Graphic can be used for expressing the total reliability of EMM1.

MTTDL of EMM1

MTTDL of EMM1 can be expressed as

(67)

when .

Mission success probabilities of EMM1

Similarly, the mission success probabilities for the one, three, and ten years missions of EMM1 can be expressed as

, (68)

, (69)

and

. (70)

7.2.2.1 Approximation of EMM1

The Markov model illustrated in Figure 21 can be approximated using the same simplification logic that was explained in the previous chapter. This approximation model is called EMM1A. The process is done in two phases as illustrated in Figure 22: steady state simplification (A) and transient state analysis (B).

Steady state simplification

The steady state equations for EMM1A are expressed as:

, (71)

, (72)

and

Undisplayed Graphic (73)

while

Figure 22. Two phase approximation of EMM1A ( Undisplayed Graphic indicates the approximation of the probability of the system being at state Undisplayed Graphic where Undisplayed Graphic defines the number of faulty disks units and Undisplayed Graphic defines the number of faulty sectors in the disk array. Undisplayed Graphic indicates the probability of the data loss. Rest of the parameters are defined in Table 9.)

. (74)

Solving the above equations (71)-(74) in the steady state leads to the following probabilities of the system being in different states:

, (75)

, (76)

and

(77)

where

. (78)

Transient state analysis

As Undisplayed Graphic and Undisplayed Graphic, we will get an approximation for the failure rate of EMM1A model as follows

. (79)

From which we get for MTTDL

. (80)

Similarly, the mission success probabilities are expressed as

 

7.2.3 Enhanced reliability model with one on-line spare disk

The second enhanced Markov model (EMM2) of a disk array is illustrated in Figure 23. Here, the model has one spare disk that is used for quick repair of disk unit faults. The first disk repair can be started immediately after the fault detection. The second disk fault can be repaired after the spare disk is replaced. It has been shown that one spare disk is quite sufficient for a disk array [Gibson 1991].

This Markov model is analyzed only using an approximation due to the complexity of the model. The similar approach is used here as with EMM1A.

Failure models

In EMM2, there are three different faults (sector, active disk unit, and spare disk unit faults). The states are divided so that a spare disk unit fault can exist at the same time as a sector or an active disk unit fault. Only, when there are at least two faults at the same time and in the same disk group of the active disks, the disk array loses its consistency and data is lost. There are four alternative scenarios how the consistency can be lost:

• After an active disk unit fault, a second active disk unit fault (in the same disk group) occurs before the repair process of the first disk is completed; or

• After an active disk unit fault, a sector fault (in the same disk group) occurs on any sector of active disks before the disk repair process is completed; or

• After a sector fault of an active disk, any other active disk (in the same disk group) fails before the sector fault has been detected and repaired; or

• After a sector fault of an active disk, any other active disk (in the same disk group) has also a sector fault at the corresponding sector before the first sector fault has been detected and repaired.

Transition rules

The transition rules of EMM2 illustrated in Figure 23 are:

• The system moves from the fault-free state ( Undisplayed Graphic) to the sector fault state ( Undisplayed Graphic) when any of the sectors in any of the active disks becomes faulty (with total rate Undisplayed Graphic);

• The system moves back from the sector fault state ( Undisplayed Graphic) to the fault-free state ( Undisplayed Graphic) when the faulty sector is detected and repaired (with rate Undisplayed Graphic);

• The system moves from the sector fault state ( Undisplayed Graphic) to the active disk fault state ( Undisplayed Graphic) when a disk unit fault occurs at the same disk as the sector fault (with rate Undisplayed Graphic);

• The system moves from the sector fault state ( Undisplayed Graphic) to the spare disk and sector fault state ( Undisplayed Graphic) when the spare disk becomes faulty (with rate Undisplayed Graphic);

• The system moves back from the spare disk and sector fault state ( Undisplayed Graphic) to the sector fault state ( Undisplayed Graphic) when the spare disk unit fault is detected and a new spare disk is installed (with rate Undisplayed Graphic);

• The system moves from the sector fault state ( Undisplayed Graphic) to the data loss state ( Undisplayed Graphic) when there is a sector fault at a corresponding sector or a disk unit fault in any other active disk than the one that has the sector fault (with total rate Undisplayed Graphic);

• The system moves from the fault-free state ( Undisplayed Graphic) to the active disk fault state ( Undisplayed Graphic) when any of the active disks becomes faulty (with total rate Undisplayed Graphic);

• The system moves from the active disk fault state ( Undisplayed Graphic) to the spare disk fault state ( Undisplayed Graphic) when the faulty disk is logically replaced with the on-line spare disk and data is reconstructed to that disk (with rate Undisplayed Graphic);

• The system moves from the active disk fault state ( Undisplayed Graphic) to the data loss state ( Undisplayed Graphic) when there is another disk unit fault or any sector fault on the active disks (with total rate Undisplayed Graphic);

• The system moves from the fault-free state ( Undisplayed Graphic) to the spare disk fault state ( Undisplayed Graphic) when the spare disk becomes faulty (with rate Undisplayed Graphic);

• The system moves back from the spare disk fault state ( Undisplayed Graphic) to the fault-free state ( Undisplayed Graphic) when the spare disk fault is detected and a new spare disk is installed (with rate Undisplayed Graphic);

• The system moves from the spare disk fault state ( Undisplayed Graphic) to the spare disk and sector fault state ( Undisplayed Graphic) when any of the sectors in any of the active disks get faulty (with total rate Undisplayed Graphic);

• The system moves back from the spare disk and sector fault state ( Undisplayed Graphic) to the spare disk fault state ( Undisplayed Graphic) when the faulty sector is detected and repaired (with rate Undisplayed Graphic);

• The system moves from the spare disk and sector fault state ( Undisplayed Graphic) to the spare disk and active disk fault state ( Undisplayed Graphic) when the disk fault occurs at the same disk as the sector fault (with rate Undisplayed Graphic);

• The system moves from the spare disk and sector fault state ( Undisplayed Graphic) to the data loss state ( Undisplayed Graphic) when there is a sector fault at a corresponding sector or a disk unit fault in any other active disk than the one that has the sector fault (with total rate Undisplayed Graphic);

• The system moves from the spare disk fault state ( Undisplayed Graphic) to the spare disk and active disk fault state ( Undisplayed Graphic) when any of the active disks get faulty (with total rate Undisplayed Graphic);

• The system moves from the active disk fault state ( Undisplayed Graphic) to the spare disk and active disk fault state ( Undisplayed Graphic) when the spare disk becomes faulty during the disk array repair process (with rate Undisplayed Graphic);

• The system moves back from the spare disk and active disk fault state ( Undisplayed Graphic) to the active disk fault state ( Undisplayed Graphic) when a new spare disk is installed in the array (with rate Undisplayed Graphic); and

• The system moves from the spare disk and active disk fault state ( Undisplayed Graphic) to the data loss state ( Undisplayed Graphic) when there is another disk unit fault or any sector fault on the active disks (with total rate Undisplayed Graphic).

In the model illustrated in Figure 23, Undisplayed Graphic indicates the system being at state w,x,y (where w is the number of faulty spare disks, x is the number of faulty disks, and y is the number of faulty sectors in the system) and Undisplayed Graphic is the probability of data loss due to two (or more) simultaneous faults in the active disks. The other parameters are listed in Table 9.

Steady state simplification

The approximation of EMM2 is done in two phases of which the steady state part is illustrated in Figure 24. The approximation of this model is called EMM2A. The steady state equations can be expressed as follows:

 

Figure 23. Markov model for EMM2 ( Undisplayed Graphic indicates the probability of the system being at state Undisplayed Graphic at time Undisplayed Graphic where Undisplayed Graphic defines the number of faulty spare disks units, Undisplayed Graphic defines the number of faulty disks units, and Undisplayed Graphic defines the number of faulty sectors in the disk array. Undisplayed Graphic indicates the probability of the data loss. Rest of the parameters are defined in Table 9.)

where the initial condition is

. (90)

Equations (84)-(89) can be then solved with the help of equation (90). Thus, the probabilities are

, (91)

Figure 24. Steady state part of the Markov model of EMM2A ( Undisplayed Graphic indicates the approximation of the probability of the system being at state Undisplayed Graphic where Undisplayed Graphic defines the number of faulty spare disks units, Undisplayed Graphic defines the number of faulty disks units, and Undisplayed Graphic defines the number of faulty sectors in the disk array. Rest of the parameters are defined in Table 9.)

, (92)

, (93)

, (94)

, (95)

and

(96)

where

. (97)

Transient state analysis

As Undisplayed Graphic, Undisplayed Graphic, Undisplayed Graphic, and Undisplayed Graphicwe will get for the failure rate for the approximation based on the Figure 25 as follows

(98)

from which we get for MTTDL

Figure 25. Transient state part of the Markov model of EMM2A ( Undisplayed Graphic indicates the approximation of the probability of the system being at state Undisplayed Graphic where Undisplayed Graphic defines the number of faulty spare disks units, Undisplayed Graphic defines the number of faulty disks units, and Undisplayed Graphic defines the number of faulty sectors in the disk array. Undisplayed Graphic indicates the probability of the data loss. Rest of the parameters are defined in Table 9.)

. (99)

The mission success probabilities are then expressed as

 

Links

RAID data recovery, Mac data recovery, Unix data recovery, Linux data recovery, Oracle data recovery, CD data recovery, Zip data recovery, DVD data recovery , Flash data recovery, Laptop data recovery, PDA data recovery, Ipaq data recovery, Maxtor HDD, Hitachi HDD, Fujitsi HDD, Seagate HDD, Hewlett-Packard HDD, HP HDD, IBM HDD, MP3 data recovery, DVD data recovery, CD-RW data recovery, DAT data recovery, Smartmedia data recovery, Network data recovery, Lost data recovery, Back-up expert data recovery, Tape data recovery, NTFS data recovery, FAT 16 data recovery, FAT 32 data recovery, Novell data recovery, Recovery tool data recovery, Compact flash data recovery, Hard drive data recovery, IDE data recovery, SCSI data recovery, Deskstar data recovery, Maxtor data recovery, Fujitsu HDD data recovery, Samsung data recovery, IBM data recovery, Seagate data recovery, Hitachi data recovery, Western Digital data recovery, Quantum data recovery, Microdrives data recovery, Easy Recovery, Recover deleted data , Data Recovery, Data Recovery Software, Undelete data, Recover, Recovery, Restore data, Unerase deleted data, unformat, Deleted, Data Destorer, fat recovery, Data, Recovery Software, File recovery, Drive Recovery, Recovery Disk , Easy data recovery, Partition recovery, Data Recovery Program, File Recovery, Disaster Recovery, Undelete File, Hard Disk Rrecovery, Win95 Data Recovery, Win98 Data Recovery, WinME data recovery, WinNT 4.x data recovery, WinXP data recovery, Windows2000 data recovery, System Utilities data recovery, File data recovery, Disk Management recovery, BitMart 2000 data recovery, Hard Drive Data Recovery, CompactFlash I, CompactFlash II, CF Compact Flash Type I Card,CF Compact Flash Type II Card, MD Micro Drive Card, XD Picture Card, SM Smart Media Card, MMC I Multi Media Type I Card, MMC II Multi Media Type II Card, RS-MMC Reduced Size Multi Media Card, SD Secure Digital Card, Mini SD Mini Secure Digital Card, TFlash T-Flash Card, MS Memory Stick Card, MS DUO Memory Stick Duo Card, MS PRO Memory Stick PRO Card, MS PRO DUO Memory Stick PRO Duo Card, MS Memory Stick Card MagicGate, MS DUO Memory Stick Duo Card MagicGate, MS PRO Memory Stick PRO Card MagicGate, MS PRO DUO Memory Stick PRO Duo Card MagicGate, MicroDrive Card and TFlash Memory Cards, Digital Camera Memory Card, RS-MMC, ATAPI Drive, JVC JY-HD10U, Secured Data Deletion, IT Security Firewall & Antiviruses, PocketPC Recocery, System File Recovery , RAID