Reliability and Performance of Hierarchical RAID with Multiple Controllers Sung Hoon Baek, Bong Wan Kim, Eui Joung Joung and Chong Won Park Dept. of Computer System Electronics and Telecommunications Research Institute Yuseong-gu P.O.Box 106, Daejeon, Korea shbaek|kimbw|euijoung|cwpark@etri.re.kr ABSTRACT Redundant arrays of inexpensive disks (RAID) oer fault tolerance against disk failures. However a storage system having more disks suers from less reliability and perfor- mance. A RAID architecture tolerating multiple disk fail- ures shows severe performance degradation in comparison to the RAID Level 5 due to the complexity of implementa- tion. We present a new RAID architecture that tolerates at least three disk failures and oers similar throughput to the RAID Level 5. We call it the hierarchical RAID, which is hierarchically composed of RAID Levels. Furthermore, we formally introduce the mean-time-to-data-loss (MTTDL) of traditional RAID and the hierarchical RAID using Markov process for detailed comparison. Categories and Subject Descriptors C.4 [Computer Systems Organization]: Performance of Systems|Fault Tolerance General Terms Reliability Keywords High reliability, Markov process, three-failure-tolerant array, hierarchical RAID 1. INTRODUCTION In the last few years, we have experienced huge disparity between I/O subsystem performance and processing power of a computer system that has been growing steadily. Myer- s has reported that processor power has doubled every 2.25 years since 1978 [10], however the I/O performance has not kept pace with the gains in processing power. As the gap between the performance of processors and I/O systems is More detailed information of this paper can be obtained from http://moogok.etri.re.kr/shbaek/ becoming large, the overall performance of a computer sys- tem will depend on the I/O bottleneck [8], [16]. Therefore, it is essential to balancing the I/O bandwidth and the com- putational power. Improving I/O performance, known as data declustering and disk striping in disk array systems [16], [17], [13], has been one of the main research topic for computer architects in recent years. Patterson et al. have proposed Redundant Arrays of Inexpensive Disks (RAID) that is dened by ve dierent levels of RAID (Level 1 5) depending on the data and parity placement scheme [13], [14]. The RAID oer- s large capacity and good performance using a number of disks, and is a reliable storage system that prevents from data loss by means of single disk redundancy, even if a disk fails. Nowadays, the demand on huge data-storing capacity re- quired by video on demand, internet data center, data ware- housing, digital imaging, nonlinear video editing, and etc increases the number of disks of a RAID. As these trends accelerate, traditional RAID cannot protect from the simul- taneous loss of more than one disk [4]. As a result, a lot of researches have arisen in disk array system that will not lose data even when multiple disks fail simultaneously [2], [3], [7], [11], [12]. Previous works [2], [3], [11] have presented fault tolerant schemes for a disk array tolerating against two disk fail- ures and another works [7], [12] have presented three-disk- tolerant schemes oering better reliability of a disk array. Their schemes at least double the complexity of implemen- tation in comparison to the RAID Level 5, thus a disk ar- ray oering better reliability suers from severe performance degradation, since practical engineers hesitate to adopt these schemes into a commercial RAID system. RAID Level 3+1 and Level 5+1 have been introduced to dissolve the performance degradation with attaining high reliability [15]. The RAID Level 3+1 is mirroring of the RAID Level 3, and the RAID Level 5+1 is mirroring of the RAID Level 5. However, there is the drawback that these Levels require too many redundant disks, and their disk utilization is lower than 50%. Fig.1 shows the structure of RAID Level 3+1 and 5+1. The above architecture is a kind of hierarchical RAID that RAID level 3 RAID level 3 RAID level 5 RAID level 5 Mirroring RAID Controller RAID Controller Mirroring (a) RAID Level 3+1 Host Computer Host Computer (b) RAID Level 5+1 Figure 1: Structure of RAID Level 3+1 and 5+1 is mixed with multiple RAID Levels. Wilkes et al. [18] have presented a hierarchical AutoRAID, another kind of hierar- chical storage architecture, in which a two level storage hi- erarchy is implemented. In the upper level of this hierarchy, two copies of active data are stored to provide full redundan- cy and better performance. In the lower level, RAID Level 5 parity protection is used to provide lower storage cost. Accordingly, AutoRAID shows better performance than the RAID Level 5. Therefore, hierarchical approach used in AutoRAID [18], the RAID Level 3+1, and 5+1 enhances RAID technology. However, current hierarchical RAID technologies are within the scope of mixing the RAID Level 1 with another RAID Level, considering the performance bound to a single disk array controller. We present a new hierarchical RAID, which consists of mul- tiple arbitrary RAID Levels with multiple disk array con- trollers, not within the above scope. It can oer good per- formance and reliability without exhaustive disk utilization, thus annihilates the drawbacks of traditional disk arrays. For example, our hierarchical RAID Level 5 5 consists of the upper level RAID Level 5 including virtual disks, each of which is composed of the lower level RAID Level 5 inside individual disk array controllers. Hierarchical RAID toler- ates at least three disk failures, and has good performance that is similar to the RAID Level 5 because the operations of our hierarchical RAID disperse throughout disk array con- trollers. For the fair comparison of reliability of hierarchical RAID and RAID, we present a sophisticated analysis of mean-time- to-data-loss (MTTDL) for various hierarchical RAID Levels and RAID Levels. Patterson et al. [14], [5], [4], [7] have tried to show analytically the reliabilities of RAID, howev- er they did not provide a sophisticated analysis. Gibson et al. [7] have present the reliabilities in a simulation manner. Patterson et al. [14], [5] have approximated their analyses in a vague method. Burkhard et al. [4] have just shown the reliability of the two-disk-fault tolerant disk array using Markov process, but have approximated the reliabilities of other disk arrays. Our analysis does not have any approxi- mation and simulation, but we use precise Markov process for all RAID Levels. 2. HIRAID: OUR HIERARCHICAL RAID upper level RAID lower level RAID Host Computer RAID Level Y RAID controller upper channel lower channel upper channel lower channel RAID Level Y RAID Level Y RAID Level X Member Disk 0 Member Disk 1 Member Disk 2 Member Disk 0 Member Disk 1 Member Disk 2 Member Disk 0 Member Disk 1 Member Disk 2 Figure 2: HiRAID Level X Y : Host computer is connected with the upper level RAID Level X com- prised of virtual disks, each of which is composed of lower level RAID Level Y. Banking Computer upper level RAID lower level RAID RAID controller RAID Level 5 RAID Level 5 RAID Level 5 RAID Level 5 RAID Level 5 Primary Site Disaster Recovery Site Mirroring Figure 3: HiRAID Level 1 5 5: it is mirroring of the HiRAID Level 5 5. Each lower level RAID controller has two RAID groups. Four RAID groups compose the RAID Level 5, two of which compose the RAID Level 1 that is provided by software of the host computer. Traditional RAID consists of only disks and single array controller, but HiRAID is RAID over RAID. In other words, HiRAID consists of virtual disks, each of which is composed of RAID. Fig.2 shows the block diagram of one embodiment of a HiRAID comprised of multiple RAID controllers. The RAID controller of the RAID Level X connected to host computer is connected to the RAID controllers composed of the RAID Level Y, where X and Y mean the conventional RAID Level such as 0, 1, 3, 5, 6. Hereinafter, it is referred to the HiRAID Level X Y . For example, X Y can be 0 3, 1 3, 3 1, 0 5, 5 1, 1 5, 5 5, and so on. The HiRAID Level 1 3, 1 5, 0 3, and 0 5 are equal to the RAID Level 3+1, 5+1, 3+0, and 5+0 [15], respectively. Each HiRAID Level oers dierent characteristics such as performance, reliability, and disk utilization, hence the HiRAID Level 1 5 is not equal to the HiRAID Level 51 , thus X Y is not commutative. HiRAID does not need to be composed of multiple RAID controllers, but also the depth of HiRAID does not need to be 2. HiRAID can be made up of wholly single controller or several controllers including partial RAID groups. Fig.3 shows an exemplary diagram of a more complex HiRAID, which is HiRAID Level 155, and which depth is 3. Fig.3 shows that each lower level RAID controller has two RAID groups. The upper level mirroring is provided by software of the host computer. The HiRAID Level 55 uses the concept of product codes, which are common used in communications and appeared in Mann's architecture [9], in which data is stored in the mem- bers of a cluster using the RAID Level 5 striping and parity techniques, allowing data to be recovered when a cluster member is down. Each member possesses local disk drive devices employing the RAID Level 5 technology twice to achieve a computer system with high reliability and lower cost. Mann's architecture shows wider bandwidth because each member has local I/O channel in comparison to HiRAID that provides single I/O path for interconnection of host computer. However HiRAID operates as a general disk drive, accordingly it can be a general storage device or a storage component of a complex storage structure such as Mann's architecture. HiRAID is adequate to a system such as a tertiary storage device of a huge data center that requires single I/O channel and a large number of disks. A system requiring less than several tens of disks is not suitable for HiRAID, which in- creases reliability and shows not bad performance of a data storage system, however is not adequate for all application of storage device. The main concept of HiRAID uses a plurality of controller- s to increase reliability while retaining performance. Many controllers are required to achieve this, accordingly it is im- portant for a system adopting HiRAID to get controller for a cheap price. Price of recent commerical commodities range from much expensive to almost same level of price range comparing to the disk price. 3. RELIABILITY Each HiRAID Level oers various characteristics of reli- ability and performance that are better than traditional RAID. In this section, we compare the reliability of Hi- RAID with that of RAID by means of Markov process. The reliability of RAID can be measured in mean-time-to-data- loss (MTTDL). The following subsections formally show the MTTDL of all RAID Levels and HiRAID Levels not using any approximation and simulation. Our approach is based on the reliability engineering [1]. 3.1 The MTTDL of the traditional RAID If the failure rate and repair rate of the disk are characterized by the exponential distribution and disks are independent, the Markov process approach for the failure/repair model can be used [1]. The Markov diagram for RAID is shown in Fig.4. Each branch of a state denotes transition rate to next state. For example, (N2) of two-disk-down state is a failure rate (transition rate to three-disk-down state) when two disks have failed. 2 of two-disk-down state is a repair rate (tran- sition rate to single-disk-down state) when two disks have failed, where N is the number of total disks, the repair rate per disk is 1/MTTR (mean-time-to-repair), and the fail- ure rate per disk equals 1/MTTF (mean-time-to-failure). Mean-time-to-data-loss (MTTDL) can be obtained not only using the dierential equation, but also using the fundamen- tal matrix M that is dened by the following Equation (1) [1]. M = [I Q]1 (1) where Q is the truncated matrix [1]. The truncated stochastic transitional probability matrixes Q5, Q6, and Q7 for the single-disk, two-disk, and three-disk fault tolerant RAID (an imaginary RAID that has 3 redun- dant disks for tolerating three disk failures), respectively, are Q5 = 1 N N 1 (N 1) (2) Q6 = 24 1 N N 0 1 (N 1) (N 1) 0 2 1 (N 2) 23 5 (3) Q7 = 264 1 N N 0 0 1 (N 1) (N 1) 0 0 2 1 (N 2) 2 (N 2) 0 0 3 1 (N 3) 3 375 (4) From Equation (1) and Equation (2), the fundamental ma- trix M5 of the RAID Level 5 is M5 = mij = m11 m12 m21 m22 = 1 N(N 1)2 (N 1) + N N (5) all disks up single disk down data loss Nλ μ (N −1)λ (a) single disk fault tolerance (RAID Level 5: N-1) all disks up single disk down Nλ μ (N −1)λ two disks down 2μ data loss (N − 2)λ all disks up single disk down Nλ μ (N −1)λ two disks down 2μ three disks down (N − 2)λ data loss (N − 3)λ 3μ (b) two disks fault tolerance (RAID Level 6: N-2) (c) three disks fault tolerance (N-3) all disks up single disk down Nλ μ (N −1)P′(2)λ 2μ d-1 disks down (N − d + 2)P′(d −1)λ data loss (N −1)(1− P′(2))λ d disks down (d −1)μ dμ (N − d +1)P′(d)λ (N − d +1)(1− P′(d))λ (N − d)λ (d) mirroring (RAID Level 1: N/2) - Figure 4: Markov diagram of the RAID Level 5, 6, 1, and three disk failures tolerant RAID. The element mij of M5 is the average time spent in state j given that the process starts in state i before reaching the data-loss state. It was assumed that the system start in state 1 (all-disk-up) and therefore MTTDL is the sum of the average time spent in all state j given that the process starts in state 1. MTTDL = d+1 Xj=1 m1j (6) where d + 1 is the column size of the fundamental matrix. The MTTDL of the RAID Level 5, MTTDL5, is MTTDL5 = m11 + m12 = (2N 1) + N(N 1)2 = 1 N + N + N(N 1) =MTTDL0 + N + N(N 1) (7) The MTTDL of the RAID Level 6 and three disk failures tolerant RAID can be solved in the same ways. MTTDL6 = (3N2 6N + 2)2 + (3N 2) + 22 N(N 1)(N 2)3 = MTTDL5 + N(N 1)2 + 2N + 22 N(N 1)(N 2)3 (8) MTTDL7 = f(4N3 18N2 + 22N 6)3 + (6N2 14N+ 6)2 + (8N 6)2 + 63g=fN(N 1)(N 2)(N 3)4g = MTTDL6 + fN(N 1)(N 2)3 + 3N(N 1)2 +6N2 + 63g=fN(N 1)(N 2)(N 3)4g (9) The MTTDL of the single-disk, two-disk, and three-disk fault tolerant RAID can be obtained simply, meanwhile the MTTDL of the RAID Level 1 is somewhat complex. The transitional probability from single-disk-down state to two- disk-down state is neither zero nor one because there is no data loss if a second failed disk does not correspond to the mirror of the rst failed disk. Fig.4 (d) shows the Markov diagram of the RAID Level 1. Transitions from single-disk- down state to (d{1)-disk-down state have the probability P0(n) that one more disk fails without data loss in the (n{ 1)-disk-down state. (N k)P0(k +1) means the probabilis- tic number of fault tolerable disks without data loss in the k-disk-down state. The P0(n) is dened by the following equation. P0(n) = P(njn 1) = P(n \ n 1) P(n 1) = P(n) P(n 1) = (1 , if n = 1 P(n)=P (n 1) , if n 2 (10) where P(n) is the safety probability function that there is no data loss when n number of the disk fail simultaneously. In mirroring, the data is safe if each of n mirroring set of N=2 mirroring groups has only one faulty disk. Hence the safety probability function P1(n) of the RAID Level 1 is P1(n) = N=2 _C n 2n N _C n where n _C r = (nCr , if n r 0 , otherwise (11) From Equation (1), Equation (10), Equation (11), and Fig.4 (d), the fundamental matrix M1 of the RAID Level 1 can be written as Equation (12). M 1 1 = [I Q] = 26666666664 N NP0 1(1) 0 0 (N 1) + 0 0 0 2 0 0 ... ... . . . ... ... 0 0 (N d + 1) + (d 1) (N d + 1)P 0 1(d) 0 0 d (N d) + d 37777777775 (12) Equation (12) can be generalized to Equation (13). M1 1 = I Q1 = am;n(d+1)(d+1) ak+1;k = k; 8ksuch that1 k d ak;k = (N k +1) + (k 1), 8k such that 1 k d+ 1 ak;k+1 = (N k + 1)P0 1(k), 8k such that 1 k d am;n = 0; 8m, 8n such that m + 2 n or m 2 n (13) where d is the maximum number of fault-tolerable disks. In the case of the RAID Level 1, d is equal to N=2. The symbol- ic equation of the inverse of M1 1 cannot be simply derived from Equation (13), thus we choose numerical method to calculate the MTTDL. (c) the HiRAID Level 5x5 (a) the HiRAID Level 5x0 (b) the HiRAID Level 0x5 Figure 5: The case of maximum tolerable failures of the HiRAID. The number of virtual disks com- prising the upper level RAID is 3, and the number of disks comprising a lower level RAID is 4, (i = 3, j = 4) (a) the HiRAID Level 1x5 ( i = 2, j = 6) (b) the HiRAID Level 5x1 ( i = 6, j = 2) Figure 6: The case of maximum tolerable failures of the HiRAID Level 1 5 and 5 1, which are the special case of the HiRAID Level 5 5 when i = 2 and j = 2, respectively. 3.2 The MTTDL of the HiRAID The HiRAID Level 55 tolerates at least three disk failures and at most i +j 1 disk failures, where i is the number of virtual disks comprising the upper level RAID, and j is the number of physical disks comprising each lower level RAID. The total number of disks including check and information disks is equal to i j. Fig.5 and Fig.6 show the case of maximum tolerable failures of the HiRAID. Derivation of the MTTDL of the HiRAID Level 5 5 is also complex. It is similar to the HiRAID Level 1 if P1(n) of Equation (13) is replaced with P55(n), which can be written as P55(1) = 1 P55(2) = 1 P55(3) = 1 P55(4) = the probability that four disks fail in four separate groups without data loss + the probability that two disks fail in a same group and other each of disks fails in each dierent group without data loss + the probability that three disks fail in a same group and other disk fails in other dierent group without data loss + the probability that four disks fail in a same group without data loss. P55(4) = fi _C 4 (j _C 1)4 +i _C 3 3 _C 1 j _C 2 (j _C 1)2 +i _C 2 2 _C 1 j _C 3 j _C 1 +i _C 1 j _C 4g=ij _C 4 (14) Table 1: The probability P(n) and the maximum number of fault-tolerable disks d of the HiRAID Lev- els Level P(n) d 0 5 (i _C n jn)=ij _C n i 5 0 i(i j _C n)=ij _C n j 1 5 (2 j _C n +2 j _C n1 j)=2j _C n j + 1 5 1 (i _C n 2n +i _C n1 (n 1) 2n2)=2j _C n i + 1 By similar ways, P55(5) = fi _C 5 (j _C 1)5 +i _C 4 4 _C 1 j _C 2 (j _C 1)3 +i _C 3 3 _C 1 j _C 3 (j _C 1)2 +i _C 2 2 _C 1 j _C 4 j _C 1 +i _C 1 j _C 5g=ij _C 5 (15) From Equation (14) and Equation (15), we can generalize P55, which can be written as Equation (16). P55 = i _C n jn +Pn1 k=1 i _C nk (n k) j _C k+1 jnk1 ij _C n (16) If the information of the lower level RAID is unknown to the upper level RAID, P05(n) is the probability that n number of disks fail in n separate groups without data loss, and P50(n) is the probability that n disks fail in a same group without data loss, and P15(n) is obtained by assigning i = 2 to Equation (16), and P51 is obtained by assigning j = 2 to Equation (16). Table 1 shows these probability functions. P1(n) is simply derived by assigning i = N=2 and j = 2 to P55(n). The HiRAID Level 0 5 is a group comprised of the RAID Level 5s. However, the MTTDL of the RAID Level 0 5 is not equal to the MTTDL of the RAID Level 5 over the number of groups because the reliability function of the RAID Level 5, R5(t), is not characterized by the exponential distribution, and dened by R5(t) = number surviving at time t / total number surviving or failed. R5(t) = s1es2t s2es1t s1 s2 where s1 = 1 2 n(2N 1) + +p2 + (4N 2) + 2o s2 = 1 2 n(2N 1) + p2 + (4N 2) + 2o(17) The above equation is derived in the appendix. The MTTDL of HiRAID is obtained from Equation (6), E- quation (11), and Equation (13) by replacing with corre- sponding safety probability function P(n) and d = i+j 1. The inverse fundamental matrix M1 of HiRAID has very big dimension, and the MTTDL is derived from the inverse of M1. It is hard to derive symbolically the inverse matrix, hence we choose the numerical method. 3.3 Comparison Fig.7 shows the MTTDLs of RAID and HiRAID. The MT- TR has been chosen as one week in the gure. It takes just a 16 25 36 48 64 81 100 121 144 100 102 104 106 108 1010 the number of TOTAL disks [ea] MTTDL [days] Level 5×1 Level 5×5 N-4 Level 1 Level 1×5 N-3 Level 0×5 Level 6 Level 5 Level 5×0 Level 0 9 16 25 36 48 64 81 100 121 144 100 102 104 106 108 1010 the number of INFORMATION disks [ea] MTTDL [days] Figure 7: The MTTDLs of RAID and HiRAID when the MTTR is one week, the MTTF is ve years, and i = j for the HiRAID Level 5 0, 0 5, and 5 5. \N-3" and \N-4" denote a three disk failures and a four disk failures tolerant imaginary RAID, respectively, which MTTDL can be obtained from Equation (13) by replacing with d = 4 and P(n) = 1 for all n. Level 0 is a disk array without redundant disk, which MTTDL equals MTTF/N. few hours for a failed disk to be reconstructed if spare disks are prepared. However, if the system has no spare disk and no daily-service, it would take from several days to several weeks to prepare available new disk module. We have cho- sen the MTTF as ve years. Disks in a heavy-load system face more frequent failures, even though disk manufacturers insist a MTTF of about 23 years. Fig.8 shows the MTTDLs providing more fair comparison when the MTTR is one day and the MTTF is 23 years. The HiRAID Level 5 1 oers the best reliability. The HiRAID Level 5 5 oers better reliability than the three disk failures tolerant RAID that can be hardly implemented in a real system due to its complexity, As the number of disks increases, the reliability of the HiRAID decreases more slowly than those of the RAIDs. The HiRAID Level 5 5 can only tolerate more than three disk faults that occur in restricted disk locations, while can tolerate disk errors that occur in any arbitrary three disks without regard to their locations. Accordingly the HiRAID Level 55 always oers better reliability than the three disk failures tolerant RAID. We can nd out that the Level 5+1 (Level 1 5) is less reliable than the Level 1 and the Level 5 1, hence the reciprocal conguration of the Level 1 5 is more reliable. The Level 0 5 is also more reliable than the Level 5 0. 16 25 36 48 64 81 100 121 144 100 105 1010 1015 the number of TOTAL disks [ea] MTTDL [days] Level 5×1 Level 5×5 N-4 Level 1 Level 1×5 N-3 Level 0×5 Level 6 Level 5 Level 5×0 Level 0 9 16 25 36 48 64 81 100 121 144 100 105 1010 1015 the number of INFORMATION disks [ea] MTTDL [days] Figure 8: TheMTTDLs of RAID and HiRAID when the MTTR is one day, the MTTF is 23 years, and i = j for the HiRAID Level 5 0, 0 5, and 5 5. In this comparison of MTTDLs, we have not considered its support components such as controller, communication line, and etc. The HiRAID Level 55 tolerates errors that occur in any arbitrary lower level controller, any line between an upper level controller and a lower level controller, and any line between a lower level controller and a disk. It cannot tolerate upper level controller error, however the redundant controller scheme of current RAID techniques can resolve the problem. 4. PERFORMANCE While the disk utilization of the HiRAID Level 51 is poor, less than 50%, the HiRAID Level 55 oers reasonable disk utilization, (i1)(j1)=(ij), which is 81% when i = j = 10. Furthermore, its performance is not bad and similar to that of the RAID Level 5. Fig.9 and Fig.10 described by UML [6] show the small-write operations of the RAID Level 5 and the HiRAID Level 55. The complexity of the HiRAID Level 5 5 is about three times of that of the RAID Level 5, however the operations of the HiRAID disperse throughout three RAID controllers. From Fig.9 (b), the operation of the RAID Level 5, O5, can be written as O5 = 3 W:tr + 2 R:tr +2 R:c +XOR W.tr, R.tr, R.c, and XOR denote write-transfer, read-transfer, read-command, and exclusive OR operation, respectively. And from Fig.10 (b), the operations of controller1, con- troller2, controller3 of the HiRAID Level 55 can be written data D1 data D2 parity P1 new data D1 buffer W2 R1 W3 host computer 1 RAID controller 1 W1. tr W1. tr W1.c R1.c R1.tr X O R2 R (a) (b) W1.completion W1 R2.c W3. tr W2. tr R2.tr the end of all operations upper channel: host interface lower channel: disk interface concurrent internal transfer of two disks concurrent internal transfer of two disks Figure 9: The small-write operation of the RAID Level 5. W.tr, R.tr, R.c, and XOR denote write- transfer, read-transfer, read-command, and exclu- sive OR operation, respectively. Diagram (b) is a sequence diagram matching to Diagram (a) as, respectively Oc1 = 3W:tr + 2 R:tr + 2 R:c + XOR Oc2 = 3W:tr + 3 R:tr + 3 R:c + XOR Oc3 = 3W:tr + 3 R:tr + 3 R:c + XOR The controller2 and the controller3 have one more read- transfer and read-command comprising single read trans- action. The additional read transaction resides in the upper channel interfacing the upper level RAID, however it does not bottleneck the overall performance because the lower channel has more read/write transactions. Also, the large- write and read operation of the HiRAID Level 5 5 are similar to that of the RAID Level 5 except slight extra la- tency that it takes for commands to pass through the upper level RAID controller. Latency can be dened by elapsed time between a command of a host computer and the completion response to the com- mand because a next command can be transmitted after completion response of a storage controller. Small write latency of HiRAID is elapsed time from write-command (W1.c) of the host computer to completion response (W1. completion) of the RAID controller1 (Refer to Fig.10). Af- ter completion response, residual operations of lower level controllers are still remained. Even though the controller1 receives a next command from the host computer after com- pletion response, prompt execution can be performed on the next command without data inconsistency. Accordingly, from Fig.10, small write latency of the HiRAID Level 5 5 is \R1.c + R2.c + Int + R3.tr + R1.tr + R2.tr + W2.tr + XOR + W3.tr" and small write latency of the RAID Level 5 is \R1.c + R2.c + Int + R1.tr + R2.tr +W2.tr + XOR + W3.tr", where \Int" denotes internal transfer time of a disk drive. Small write latency of the HiRAID Level 55 has just one more read-transfer (R3.tr) than the RAID Level 5. 5. PROPERTY new data D1 buffer W2 R1 W3 R2 W1 data D1 parity P1 for D1 & D2 data D2 data D1 data D3 parity for D1 & D3 buffer W6 W7 R4 W4 parity P1 data D4 parity for P1 & D4 buffer W8 W9 R6 R3 W5 host computer 1 RAID controller 1 W1. tr W1. tr W1.c R1.c R1.tr X O R W3. W1.completion tr R5 RAID controller 2 RAID controller 3 R2.c R1.tr R5.tr R2.tr W5. tr W4. tr R5.c R3.c R1.c R2.c R2.tr W2. tr R4.c R6.c R3.tr R4.tr R6.tr W6. tr W8. tr X O R X O R W7. tr W9. tr (a) (b) the end of all operations upper channel lower channel RAID controller1 RAID controller2 RAID controller3 Figure 10: The small-write operation of the HiRAID Level 55. W.tr, R.tr, R.c, and XOR denote write- transfer, read-transfer, read-command, and exclu- sive OR operation, respectively. Diagram (b) is a sequence diagram matching to Diagram (a). Table 2: The probability P(4) that there is no data loss when 4 disks fail simultaneously P(4) j i 2 3 4 5 6 2 0 0.400 0.486 0.524 0.545 3 0.800 0.768 0.782 0.780 0.779 4 0.914 0.891 0.881 0.876 0.873 5 0.952 0.934 0.926 0.921 0.918 6 0.970 0.956 0.949 0.945 0.943 The RAID Level 55 has very excellent reliability and good performance, and it has remarkable several properties. In the HiRAID Level 5 5, i > j generally oers better reliability than i < j. Fig.7 shows that the MTTDL of the HiRAID Level 5 1 is always larger than that of the HiRAID Level 1 5. The Level 5 5 is the Level 5 1 when j = 2, the Level 1 5 when i = 2. We can also see the property from the safety probability function P(n). As the value of the P(n) is larger, the MTTDL increases. For example, Table 2 shows the probabilities when n is 4. The value of P(4) is marked in bold, when i j = 12. P(4) is always larger if i > j. We can nd this property of P(n) for all n. According to the above characteristics, i > j generally oers better reliability than i < j. Lemma 1. Disk utilization of the RAID Level 5 5 that the dimension parameters i and j are equal to each other is best. Proof. If the number of total disks, N, equals ij, and i = j, the utilization U1 is U1 = (i 1)(j 1) ij = N 2pN + 1 N = 1 2 pN + 1 N when i 6= j, the utilization U2 can be written as U2 = N i N=i +1 N = 1 i N 1 i + 1 N U1 U2 = 1 2 pN + 1 N 1 + i N + 1 i 1 N = (i pij)2 iN 0 Hence, U1 is always larger than U2 if i 6= j. Lemma 2. X Y is not commutative. Proof. The HiRAID Level 5 1 shows better re- liability than the HiRAID Level 1 5 in Fig.7, thus X Y is not commutative. 6. CONCLUSIONS We have formally introduced the mean-time-to-data- loss (MTTDL) of traditional RAID and HiRAID us- ing Markov process. And we have presented hierar- chal RAID oering high reliability and good perfor- mance. On the other hand, a traditional RAID oer- ing good reliability shows bad performance. The Hi- RAID has the following advantages over traditional RAID and other methods proposed for improving reliability. 1. The HiRAID Level 5 5 tolerates disk errors that occur in any arbitrary three disks. Its MTTDL is about 106 times of that of the RAID Level 5 when MTTF equals 23 years, MTTR is one day, and the number of information disks equals 100. 2. Its performance is similar to that of the RAID Level 5 except slight extra latency that it takes for commands to pass through the upper level RAID controller. 3. It can be composed of traditional RAID con- trollers without modication, not requiring ad- ditional algorithms. 4. It oers reasonable disk utilization that is (i 1)(j 1)=(ij), which is 81% when i = j = 10. HiRAID shows the above good merits, however it is not suitable for all application of a storage sys- tem. An application that requires so many disks, just moderate bandwidth, and suÆcient reliability is appropriate for it. Also, it requires several con- trollers, so the price of controller should be low in order that HiRAID be applied to a practical system. Meanwhile, we have introduced the sophisticated methodology to get the correct reliabilities of vari- ous disk array systems using Markov process. Also, our analysis shows the probability of disk failure re- covery that can occur in a restricted disk location of a disk array such as the RAID Level 1. 7. REFERENCES [1] R. Billinton and R. N. Allan. Reliability Evaluation of Engineering System: Concepts and Techniques. Perseus Publishing, 1992. [2] M. Blaum, J. Brady, J. Bruck, and J. Menon. Evenodd: An eÆcient scheme for tolerating double disk failures in raid architectures. IEEE Trans. Computers, 44(2):192{202, February 1995. [3] M. Blaum, J. Bruck, and A. Vardy. Mds array codes with independent parity symbols. IEEE Trans. Information Theory, 42(2):529{542, March 1996. [4] W. A. Burkhard and J. Menon. Disk array storage system reliability. In 23rd Int. Symp. on Fault-Tolerant Computing, pages 432{441. IEEE Comput. Soc., August 1993. [5] P. M. Chen, G. A. G. Edward K. Lee, R. H. Katz, and D. A. Patterson. Raid: High-performance, reliable secondary storage. ACM Computing Surveys, 26(2):145{184, June 1989. [6] B. P. Douglass. Real-Time UML: Developting EÆcient Objects for Embedded Systems. Addison Wesley, 1998. [7] G. A. Gibson, L. Hellerstein, R. M. Karp, R. H. Katz, and D. A. Patterson. Failure correction techniques for large disk arrays. In Third Int. Conf. on Architectural Support for Programming Languages and Operating System, pages 123{132. ACM, 1989. [8] R. H. Katz, G. Gibson, and D. A. Patterson. Disk system architectures for high performance computing. In Proceedings of the IEEE, pages 1842{1858. IEEE, December 1989. [9] B. E. Mann, P. J. Trasatti, M. D. Carlozzi, J. A. Ywoskus, and E. J. McGrath. Loosely coupled mass storage computer cluster. United States Patent { 5,862,312, January 1999. [10] G. Myers, A. Yu, and D. House. Microprocessor technology trends. In Proceedings of the IEEE, pages 1605{1622. IEEE, December 1986. [11] C. I. Park. EÆcient placement of parity and data to tolerate two disk failures in disk array systems. IEEE Trans. on Parallel & Distributed Systems, 6(11):177{184, November 1995. [12] C. W. Park and Y. Y. Han. EÆcient placement of parity and data to tolerate two disk failures in disk array systems. Lecture Notes in Computer Science { Asian 2000, 1961:58{68, November 2000. [13] D. A. Patterson, P. Chen, and R. H. Katz. Introduction to redundant arrays of inexpensive disks (raid). In 34th IEEE Computer Society International Conference, pages 112{117. IEEE Comput. Soc., 1989. [14] D. A. Patterson, G. A. Gibson, and R. H. Katz. A case for redundant arrays of inexpensive disks (raid). In SIGMOD Int. Conf. Data Management, pages 109{116. ACM, September 1988. [15] RAB. The RAIDBook: A Source Book for RAID Technology sixth edition. The RAID Advisory Board, Lino Lakes MN, 1999. [16] A. L. L. Reddy and P. Banerjee. An evaluation of multiple-disk i/o systems. IEEE Trans. Computers, 38(12):1680{1690, December 1989. [17] K. Salem and H. Garcia-Molina. Disk striping. In Int. Conf. of Data Engineering, pages 336{342. IEEE Comput. Soc., 1986. [18] J. Wilkes, R. Golding, C. Stealin, and T. Sullivan. The hp autoraid hierarchical storage system. ACM Trans. Computer Systems, 14(1):108{36, February 1996. APPENDIX A. THERELIABILITYFUNCTIONOFTHE RAID LEVEL 5 Using continuous Markov processes [1], we can de- rive the reliability function R5(t) of the RAID Lev- el 5. We rst assume independent and exponential failure/repair rates. P0 1(t); P0 2(t); P0 3(t) = P1(t); P2(t); P3(t)2 4 N N 0 f(N 1) + g (N 1) 0 0 0 3 5 (18) P1(t) is the probability that all disks are in an op- erative state at time t, P2(t) is the probability that only one disk fails at time t, and P3(t) is the prob- ability that the system is failed by two disk fail- ures. Assume that the system starts in state 1, then P1(0) = 1, P2(0) = P3(0) = 0. This set of dierential E- quation (18) can be solved using Laplace transforms. sP1(s) P1(0) = NP1(s) + P2(s) sP2(s) P2(0) = NP1(s) f(N 1) + gP2(s) From the above equations, P1(t) = [s1 f(N 1) + g]es1t + f(N 1) + s2ges2t s1 s2 P2(t) = Nes1t + Nes2t s1 s2 where s1 = 1 2 n(2N 1) + +p2 + (4N 2) + 2o s2 = 1 2 n(2N 1) + p2 + (4N 2) + 2o The reliability of the system is the sum of two prob- ability functions of safe state. R5(t) = P1(t) + P2(t) = s1es2t s2es1t s1 s2 The MTTDL can also be derived from the reliability function. MTTDL5 = Z 1 0 R5(t)dt = 2(N 1) + N(N 1)2 (19) The MTTDL of the HiRAID Level 0 5 can be de- rived from not only Equation (1), but also the reli- ability function. the HiRAID Level 0 5 is a series system of the RAID Level 5, hence the MTTDL of the HiRAID Level 0 5 comprised of i groups of the RAID Level 5 can be written as Equation (20), where N equals j. The MTTDL of the mirroring can be also calculated from Equation (20), where i equals the number of information disks and N equals 2. R05(t) = fR5(t)gi MTTDL05 = Z 1 0 R05(t)dt (20) |