In this chapter, performability of a disk array subsystem is studied. The focus is concentrated on the performability as the definition of “cost” in the cost-performability is somewhat ambiguous. A simple performance model of the disk array is used in this chapter, because more accurate models are considered to be out of the scope of this thesis.
Figure 37. Simple Markov model for performability of TMM ( is the number of disks, is the disk failure rate, is the repair rate, defines the probability of the system being at state at time where defines the number of faulty disks in the disk array, and defines the reward function at state at time )
Performability (i.e., combined performance and reliability) of a system can be expressed using Markov reward models [Trivedi 1994, Catania 1993, Pattipati 1993, Smith 1988, Furchgott 1984, Meyer 1980, Beaudry 1978]. Figure 37 illustrates a performability model for TMM presented in Chapter 6. This is a typical RAID-5 array with a D+1 redundancy scheme. For each state i (i=0,1, and 2), two parameters are defined: probability and reward.
The first parameter, probability ( ), defines the probability of the system being in state i at a given time t. This is the same probability as used for the reliability analysis in Chapters 6, 7, and 8.
The second parameter, reward ( ), defines the reward what system gets while being in state i at a given time t. The reward function can be, for example, the performance of the system. In a disk array, the performance can be expressed using the number of I/O operations per second. For example, in state 0 of Figure 37, the reward function specifies the number of user I/O operations that the disk array can perform in the fault-free state. In state 1, the reward function specifies the number of user I/O operations that the crippled disk array can perform while it is either waiting for the repair process to start or while the repair process in ongoing. In state 2, the reward function is zero because the data is lost and the disk array has failed.
For simplicity, it is assumed in this thesis that the reward function is constant (i.e., ) and only depends on state i but not the time.
The performability (or computational availability) in state i at a given time t can be then expressed as a product of the two above mentioned parameters (i.e., reward and probability of state i) as follows
And, the total performability at a given time t can be expressed as the sum of the performabilities of all states i as follows
Steady state performability
If the Markov model describes a steady state system, the performability can be expressed as follows
Non-steady state performability
In a non-steady state system, the probability of state i is changing. Eventually, the system will fail (in Figure 37, ). The cumulative performability of a system with non-repairable faults can be expressed as
where is the cumulative reliability of state i.
The performability of a RAID-5 disk array that is modeled with TMM can be expressed using the above equation (109) and the probabilities of states 0, 1, and 2 as expressed in Chapter 6 in equations (12) - (14). The cumulative reliabilities of TMM are:
and the value of has no effect since . Then, the performability of TMM is
where and are the reward functions of state 0 and 1, respectively. The reward functions depend on the type of the operation (read or write).
It should be noticed that equals MTTDL of TMM if and equal one. As is typically greater than or equal to , it is possible to obtain an upper limit estimation for performability by multiplying MTTDL of the array with the reward function of the fault-free state. Hence, the approximation of the performability can be expressed as
where is MTTDL of the array.
The performability of a RAID-5 disk array that is modeled with EMM1 can be expressed using the above equation (109) and the probabilities of states 00, 01, 10, and f as expressed in Chapter 7 in equations (61) - (64). The cumulative reliabilities of EMM1 are:
and the value of has no effect since . Then, the performability of EMM1 is
where , , and are the reward functions of state 00, 01, and 10, respectively. The reward functions depend on the type of the operation (read or write).
The performance of disk arrays can be modeled using a simple performance model for the arrays like in [Hillo 1993, Gibson 1991, Kemppainen 1991]. Here, the reward functions are modeled using the performance of the disk array that is estimated for either read or write operations but not for mixed read and write operations. More accurate performance model of the disk arrays is considered to be out of the scope of this thesis.
In a RAID-5 array, a total of disks is used for building an array of data disks. There are disks in the crippled array. If I/O requests are not assumed to span over several disks, each request would require the following number of disk accesses:
• 1 disk operation to read from a fault-free disk array;
• disk operations to read from a crippled disk array (in the worst case);
• 4 disk operations to write to a fault-free disk array;
• disk operations to write to a crippled disk array (in the worst case); and
• disk operations to reconstruct a faulty disk block.
The above equations (112) and (117) for performability are dedicated to RAID-5 arrays analysis. However, later in this chapter, it is shown that good estimation for the performability can be made using MTTDL of the array and the reward function of the fault-free state. Hence, reward functions for the RAID-1 array are also included here.
In a RAID-1 array, a total of disks is used for building an array of data disks. There are disks in the crippled array. If I/O requests are not assumed to span over several disks, each request would require the following number of disk accesses:
• 1 disk operation to read from a fault-free disk array;
• 1 disk operation to read from a crippled disk array;
• 2 disk operations to write to a fault-free disk array;
• 2 disk operations to write to a crippled disk array (in the worst case); and
• 2 disk operations to reconstruct a faulty disk block.
In a disk array, the maximum number of I/O operations depends on the array configuration, the type of the operation and the properties of the disks. The array configuration specifies how many parallel read and write operations can be performed as illustrated in the introduction in Chapter 1. In this thesis, the performance is expressed as relative comparison with a single disk. Relative performance value one corresponds to one fully working disk serving user requests. For example, a fault-free RAID-1 with two disks has relative performance two for read operations and one for write operations.
Effect of the scanning algorithm
The effect of the scanning algorithm is studied by reserving a certain capacity for the scanning algorithm. For every disk, a certain capacity (as expressed with ) is reserved for scanning and remaining capacity ( ) is available for other requests (user requests or repair).
Effect of the repair process
The repair process decreases the maximum number of user operations in the crippled array. The degree of degradation depends on the activity of the repair process. When, for example, a disk array of a total of ten disks is being repaired using 20% of the capacity for repair (as expressed with the repair activity, ), the theoretical remaining capacity is 8 units. This is further reduced if the read or write request needs to have several disk operations. For example, to write to a crippled RAID-5 array needs 10 disk operations. Hence, the relative performance is only . As for comparison, the relative write performance in the same size fault-free array would be 2.5.
Reward functions of RAID-5 and RAID-1
The relative reward functions of RAID-5 and RAID-1 arrays are illustrated in Table 15. It is assumed that three different states from the point of view of performance are:
• all disks working (state 0 in TMM and states 00 and 01 in EMM1);
• one disk unit failed (state 1 in TMM and state 10 in EMM1); and
• data lost (state 2 in TMM and state f in EMM1).
Sector faults are considered not to degrade the performance.
Table 15. Relative reward functions of RAID-5 and RAID-1
In a RAID-5 array, all disks are involved with the repair process. As the worst case scenario is used here, the read operation to a crippled array would require to access all remaining disks. Hence, the relative performance is one from which the repair activity is deducted. Similarly in the worst case, the write operation requires to read all remaining disks once and to write to one disk. From this relative performance, the repair activity is deducted.
In a RAID-1 array, only two disks are involved with the repair process. When the array has data disks ( disks totally), data disks are not effected by a disk unit fault. For a read operation, there are disks available and the performance is further reduced by the repair process in one disk. For a write operation, there are data disks that are not effected by the disk unit fault and one data disk that is effected by the repair process.
Performability of a RAID-5 array modeled using TMM and EMM1 models is illustrated in Figure 38. Here, the same default parameters are used as in Chapter 8. This figure shows that the approximation (performability equals MTTDL multiplied with the reward function of the fault-free state) provides accurate results. Hence, the same approximation principle is used with the RAID-1 array. It is also concluded that both performability models provide similar results that correspond to the reliability results.
Figure 38. Performability of RAID-5 array as a function of the number of disks in the array
Effect of the repair activity
The effect of the repair activity is studied in a configuration where the repair time depends on the number of disks and the repair activity. In Figure 39, performability of a RAID-5 array is illustrated as a function of the repair activity. Here, a RAID-5 array of 50 data disk is studied in two configurations: hot swap (repair starts 8 hours after the disk failure) and hot spare (repair starts immediately after the disk failure). The read operation provides four times better performability than the write operation as its reward function in state 00 is four times better. The repair time is assumed to be two hours with 100% repair activity and relatively longer, if the repair activity is less than 100%. The hot spare configuration provides significantly better performability than the hot swap configuration as the repair time in the latter case is shorter. The performability of state 10 of EMM1 has only a marginal effect (less than 1%) on the total performability of the RAID-5 array. This is because the failure rates in EMM1 are much smaller than the repair rates and therefore the system is mainly in state 00.
The only factor that may limit the disk repair activity is the performance requirement during the repair time. If no minimum requirement for performance during the repair time is set, then the repair can and should be done at full speed, otherwise the repair activity should be obstructed to guarantee the minimum performance. The reliability increase due to faster repair is much more significant than the minor performance degradation during the repair time when the total performability is considered.
Figure 39. Performability of RAID5 array modeled with EMM1 as a function of repair activity
Effect of the scanning algorithm
The effect of the scanning algorithm on the performability is studied by varying the scanning activity. It is assumed that it takes 24 hours to scan all disks in the array with 5% scanning activity. The performability of a RAID-5 array is presented in Figure 40 as a function of the scanning activity. When the hot swap (hot spare) configuration is used, the optimum performability is achieved in this sample configuration when the scanning algorithm uses 20% (30%) of the capacity for scanning the disk surface. The hot swap configuration reaches its peak performability earlier than the hot spare configuration as its reliability is dominated more by the longer disk repair time than the hot spare where the reliability can be increased longer with the increased scanning activity and its sector faults detection. Eventually in both cases, the performability starts decreasing when the scanning activity approaches 100%. This is obvious since less and less capacity of the array remains for user disk requests and the reliability does not increase because it is limited by the repair time of the disk unit failure.
Figure 40. Performability of RAID5 array modeled with EMM1 as a function of scanning activity
RAID-5 vs. RAID-1
The performability of RAID-1 and RAID-5 arrays is compared in Figure 41. The performability of the RAID-5 array is achieved using the above equations (112) and (117) while the performability of the RAID-1 array is approximated using MTTDL of RAID-1 multiplied with the appropriate reward function. As MTTDL of a RAID-1 array is approximated by dividing MTTDL of RAID-1 with two disks with the number of data disks in the array while the reward function is relative to the number of disks in the array, the performability of the RAID-1 array is constant (i.e., while the performance of the disk array increases with the number of disks in the array, at the same time the reliability decreases thus keeping the performability constant). Actually, the same effect can be found also with RAID-0 arrays where the performability remains constant but at a much lower level because the RAID-0 array has no redundancy. On the other hand, the performance of the RAID-5 array increases almost linearly with the number of disks in the array, but the reliability decreases more rapidly as more and more disks are protected just with a single disk.
The conclusions of the performability analysis are gathered in the following list:
• Performability of a disk array can be well approximated by multiplying MTTDL of the array with the reward function of the fault-free state when the repair rates are much higher than the failure rates.
• Performability of RAID-0 and RAID-1 arrays is constant regardless of the number of disks in the array. Higher performance is achieved with larger number of disks but at the expense of reduced reliability.
Figure 41. Performability of RAID-1 and RAID-5 arrays
• Performability of a RAID-5 array decreases as the number of disks increases. This is because reliability drops more than what performance increases.
• A RAID-1 array provides better performability than a RAID-5 array with the same number of data disks. The penalty for higher performability of the RAID-1 array is the larger number of disks in the array and higher number of failed disks.
• A scanning algorithm can improve performability. The scanning algorithm increases first the performability as the disk array reliability increases while the performance degradation remains still moderate. When the scanning activity increases further, the reliability no longer increases because the reliability bottleneck will be the disk unit faults, but at the same time the performance of the array drops. Thus, the performability also sinks.
• The increased speed of the repair process effects the performability by improving the reliability while the effect on the average performance is marginal. The only reason to limit the speed of the repair process is to guarantee a certain performance even with a crippled array.