7. NOVEL RELIABILITY MODELING APPROACHESIn this chapter, two new enhanced reliability models are built in which also sector faults are included. The first reliability model (Enhanced Markov Model 1, EMM1) is based on the hot swap principle while the second reliability model (Enhanced Markov Model 2, EMM2) is based on the hot spare principle [Chandy 1993, Hillo 1993, RAB 1993, Shooman 1968]. The former model is analyzed both analytically (referred to EMM1) and approximately (referred to EMM1A) while the latter model is analyzed only with approximations (referred to EMM2A). 7.1 Simplifications of Markov modelsThe Markov models that are used for the enhanced analysis have the following simplifications: • Exponential repair times; • Exponential failure times; • Only zero, one, or two simultaneous sector faults (sector faults that are occurring at other disk addresses than the first one are ignored); and • Only zero, one, or two simultaneous disk faults. These assumptions are made to simplify the analysis of the models so that analytical approach can be used. With these simplifications, the number of states in the Markov models can be reduced significantly. This is important because even a simple non-steady state Markov model complicates the analytical approach radically as will be seen later in this chapter. 7.2 Markov modelsThree reliability models are used in this chapter: a traditional reliability model and two enhanced models. 7.2.1 Traditional reliability modelThe traditional Markov model (TMM) used for analyzing conventional disk arrays was presented in the previous chapter in Figure 14. Equations (18), (19), (20), and (21) express MTTDL and mission success probabilities for 1, 3, and 10 year missions, respectively. 7.2.2 Enhanced reliability model with no on-line spare disksThe first enhanced Markov model (EMM1) of disk arrays is illustrated in Figure 21. Here, the model is derived from TMM that contains only three states as stated for example in [Geist 1993, Gibson 1991]. Failure models In EMM1, there are two different fault types (sector and disk unit faults) both having their own state in the Markov model. Only, when there are at least two faults at the same time and in the same disk group, the disk array loses its consistency and then data is lost. There are four alternative scenarios how the consistency can be lost: • After a disk unit fault, a second disk unit fault (in the same disk group) occurs before the repair process of the first disk is completed; or • After a disk unit fault, a sector fault (in the same disk group) occurs on any sector before the disk repair process is completed; or • After a sector fault, any other disk (in the same disk group) fails before the sector fault has been detected and repaired; or • After a sector fault, any other disk (in the same disk group) has also a sector fault at the corresponding sector before the first sector fault has been detected and repaired. Transition rules The transition rules of the Markov model for EMM1 are illustrated in Figure 21 and can be expressed as follows: • The system moves from the fault-free state ( ) to the sector fault state ( ) if any of the sectors in any of the disks becomes faulty (with total rate ); • The system moves back from the sector fault state ( ) to the fault-free state ( ) when the faulty sector is detected and repaired (with rate ); • The system moves from the sector fault state ( ) to the disk fault state ( ) if a disk fault occurs at the same disk as the sector fault (with rate ); • The system moves from the sector fault state ( ) to the data loss state ( ) if there is a sector fault at the corresponding sector or a disk unit fault in any disk other than the one that has the sector fault (with total rate ); • The system moves from the fault-free state ( ) to the disk fault state ( ) if any of the disks becomes faulty (with rate ); • The system returns back from the disk fault state ( ) to the fault-free state ( ) when the faulty disk is replaced and repaired (with rate ); and Figure 21. Markov model for EMM1 ( indicates the probability of the system being at state at time where defines the number of faulty disks units and defines the number of faulty sectors in the disk array. indicates the probability of the data loss. Rest of the parameters are defined in Table 9.) • The system moves from the disk fault state ( ) to the data loss state ( ) if there is another disk unit fault or a sector fault on any of the remaining disks (with rate ). In EMM1 as illustrated in Figure 21, indicates the probability of the system being at state xy (where x is the number of faulty disks and y is the number of faulty sectors in the system) at time t and is the probability of data loss due to two (or more) simultaneous faults. The other parameters are listed in Table 9. Table 9. Parameters for EMM1, EMM1A, and EMM2A
The transition state equations of EMM1 can be expressed as: where the initial conditions are (55) and . (56) Equations (51)-(54) can be then solved using Laplace transformation with the help of equations (55) and (56). First the equations are moved from t-state to s-state as follows
and (64) where (65) and (i=0, 1, or 2) are the three roots of the following equation . (66) Term can be used for expressing the total reliability of EMM1. MTTDL of EMM1 MTTDL of EMM1 can be expressed as (67) when . Mission success probabilities of EMM1 Similarly, the mission success probabilities for the one, three, and ten years missions of EMM1 can be expressed as , (68) , (69) and . (70) 7.2.2.1 Approximation of EMM1The Markov model illustrated in Figure 21 can be approximated using the same simplification logic that was explained in the previous chapter. This approximation model is called EMM1A. The process is done in two phases as illustrated in Figure 22: steady state simplification (A) and transient state analysis (B). Steady state simplification The steady state equations for EMM1A are expressed as: , (71) , (72) and (73) while Figure 22. Two phase approximation of EMM1A ( indicates the approximation of the probability of the system being at state where defines the number of faulty disks units and defines the number of faulty sectors in the disk array. indicates the probability of the data loss. Rest of the parameters are defined in Table 9.) . (74) Solving the above equations (71)-(74) in the steady state leads to the following probabilities of the system being in different states: , (75) , (76) and (77) where . (78) Transient state analysis As and , we will get an approximation for the failure rate of EMM1A model as follows . (79) From which we get for MTTDL . (80) Similarly, the mission success probabilities are expressed as
7.2.3 Enhanced reliability model with one on-line spare diskThe second enhanced Markov model (EMM2) of a disk array is illustrated in Figure 23. Here, the model has one spare disk that is used for quick repair of disk unit faults. The first disk repair can be started immediately after the fault detection. The second disk fault can be repaired after the spare disk is replaced. It has been shown that one spare disk is quite sufficient for a disk array [Gibson 1991]. This Markov model is analyzed only using an approximation due to the complexity of the model. The similar approach is used here as with EMM1A. Failure models In EMM2, there are three different faults (sector, active disk unit, and spare disk unit faults). The states are divided so that a spare disk unit fault can exist at the same time as a sector or an active disk unit fault. Only, when there are at least two faults at the same time and in the same disk group of the active disks, the disk array loses its consistency and data is lost. There are four alternative scenarios how the consistency can be lost: • After an active disk unit fault, a second active disk unit fault (in the same disk group) occurs before the repair process of the first disk is completed; or • After an active disk unit fault, a sector fault (in the same disk group) occurs on any sector of active disks before the disk repair process is completed; or • After a sector fault of an active disk, any other active disk (in the same disk group) fails before the sector fault has been detected and repaired; or • After a sector fault of an active disk, any other active disk (in the same disk group) has also a sector fault at the corresponding sector before the first sector fault has been detected and repaired. Transition rules The transition rules of EMM2 illustrated in Figure 23 are: • The system moves from the fault-free state ( ) to the sector fault state ( ) when any of the sectors in any of the active disks becomes faulty (with total rate ); • The system moves back from the sector fault state ( ) to the fault-free state ( ) when the faulty sector is detected and repaired (with rate ); • The system moves from the sector fault state ( ) to the active disk fault state ( ) when a disk unit fault occurs at the same disk as the sector fault (with rate ); • The system moves from the sector fault state ( ) to the spare disk and sector fault state ( ) when the spare disk becomes faulty (with rate ); • The system moves back from the spare disk and sector fault state ( ) to the sector fault state ( ) when the spare disk unit fault is detected and a new spare disk is installed (with rate ); • The system moves from the sector fault state ( ) to the data loss state ( ) when there is a sector fault at a corresponding sector or a disk unit fault in any other active disk than the one that has the sector fault (with total rate ); • The system moves from the fault-free state ( ) to the active disk fault state ( ) when any of the active disks becomes faulty (with total rate ); • The system moves from the active disk fault state ( ) to the spare disk fault state ( ) when the faulty disk is logically replaced with the on-line spare disk and data is reconstructed to that disk (with rate ); • The system moves from the active disk fault state ( ) to the data loss state ( ) when there is another disk unit fault or any sector fault on the active disks (with total rate ); • The system moves from the fault-free state ( ) to the spare disk fault state ( ) when the spare disk becomes faulty (with rate ); • The system moves back from the spare disk fault state ( ) to the fault-free state ( ) when the spare disk fault is detected and a new spare disk is installed (with rate ); • The system moves from the spare disk fault state ( ) to the spare disk and sector fault state ( ) when any of the sectors in any of the active disks get faulty (with total rate ); • The system moves back from the spare disk and sector fault state ( ) to the spare disk fault state ( ) when the faulty sector is detected and repaired (with rate ); • The system moves from the spare disk and sector fault state ( ) to the spare disk and active disk fault state ( ) when the disk fault occurs at the same disk as the sector fault (with rate ); • The system moves from the spare disk and sector fault state ( ) to the data loss state ( ) when there is a sector fault at a corresponding sector or a disk unit fault in any other active disk than the one that has the sector fault (with total rate ); • The system moves from the spare disk fault state ( ) to the spare disk and active disk fault state ( ) when any of the active disks get faulty (with total rate ); • The system moves from the active disk fault state ( ) to the spare disk and active disk fault state ( ) when the spare disk becomes faulty during the disk array repair process (with rate ); • The system moves back from the spare disk and active disk fault state ( ) to the active disk fault state ( ) when a new spare disk is installed in the array (with rate ); and • The system moves from the spare disk and active disk fault state ( ) to the data loss state ( ) when there is another disk unit fault or any sector fault on the active disks (with total rate ). In the model illustrated in Figure 23, indicates the system being at state w,x,y (where w is the number of faulty spare disks, x is the number of faulty disks, and y is the number of faulty sectors in the system) and is the probability of data loss due to two (or more) simultaneous faults in the active disks. The other parameters are listed in Table 9. Steady state simplification The approximation of EMM2 is done in two phases of which the steady state part is illustrated in Figure 24. The approximation of this model is called EMM2A. The steady state equations can be expressed as follows:
Figure 23. Markov model for EMM2 ( indicates the probability of the system being at state at time where defines the number of faulty spare disks units, defines the number of faulty disks units, and defines the number of faulty sectors in the disk array. indicates the probability of the data loss. Rest of the parameters are defined in Table 9.) where the initial condition is . (90) Equations (84)-(89) can be then solved with the help of equation (90). Thus, the probabilities are , (91) Figure 24. Steady state part of the Markov model of EMM2A ( indicates the approximation of the probability of the system being at state where defines the number of faulty spare disks units, defines the number of faulty disks units, and defines the number of faulty sectors in the disk array. Rest of the parameters are defined in Table 9.) , (92) , (93) , (94) , (95) and (96) where . (97) Transient state analysis As , , , and we will get for the failure rate for the approximation based on the Figure 25 as follows (98) from which we get for MTTDL Figure 25. Transient state part of the Markov model of EMM2A ( indicates the approximation of the probability of the system being at state where defines the number of faulty spare disks units, defines the number of faulty disks units, and defines the number of faulty sectors in the disk array. indicates the probability of the data loss. Rest of the parameters are defined in Table 9.) . (99) The mission success probabilities are then expressed as
|