4MODERN DISKS AND DISK ARRAYS
This chapter consists of two parts: a quick overview of the properties of modern SCSI disks and a review of disk array configurations.
The two most commonly used disk types in the modern computers are based on IDE (Intelligent Drive Electronics) and SCSI (Small Computer System Interface) interfaces. In many cases, disk manufacturers offer similar disks with both disk interfaces. The main difference between these two interfaces is the larger variety of functions in the SCSI interface. In this thesis, the interest is focused on SCSI disks as they are widely used in disk arrays.
There are currently two SCSI standards: SCSI-1 and SCSI-2 [ANSI 1994, ANSI 1986]. The third generation of the SCSI standard (SCSI-3) is under development [ANSI 1997, T10 1997, ANSI 1996, ANSI 1995]. These three standards specify the interfaces not only for disks but also for other devices (such as CD ROMs, tape streamers, printers, and local area networks).
The SCSI commands are divided into two categories: mandatory and optional commands. The mandatory commands are required to be recognized by all SCSI devices while a manufacturer may or may not implement the optional commands. Some of the commands are device specific thus used only with certain devices or they behave differently with different devices. In addition, there are some vendor specific SCSI commands or fields in the SCSI commands [ANSI 1994, Seagate 1992a]. For example, statistical information can be obtained from a disk with a standard command, but the information is manufacturer specific.
Significant part of the thesis is attributed to the enhanced properties of modern SCSI disk standards. As a normal SCSI disk by itself keeps track of its operation and logs events during its normal operations, it is possible to implement the scanning algorithms that are discussed in this thesis.
The main operating principles of disks can be found in [ANSI 1996, Seagate 1996a, Seagate 1996b, ANSI 1995, ANSI 1994, Seagate 1992, Seagate 1992a, Conner 1992, Sierra 1990, ANSI 1986, Cox 1986]. In a disk that complies with the SCSI-2 standard, the disk storage is represented as a continuous set of blocks, usually sectors [ANSI 1994, Seagate 1992a]. This is different from some older disk standards where a disk was categorized with three parameters: number of heads, number of tracks, and number of sectors per track. In a SCSI disk, the actual structure of the disk is hidden from normal users, but there is a SCSI command to query detailed information about physical disk geometry.
There are two major benefits of having a linear storage architecture. First, it hides the physical structure of the disk simplifying the handling of the disk by the operating system. Second, the linear data representation allows the operating system or the disk to repair sector faults without modifying the logical representation.
The major disadvantage of the linear storage architecture is found with the performance optimization. As the operating system does not generally know the physical structure of a disk, it is not able to adjust its disk requests based on the mechanical limitations. For example, read-ahead algorithm may suffer additional head switching and seek delays if the last requested sectors fall into a different track or surface.
A SCSI disk tries to minimize the risk of media deterioration. During a low level format operation, the disk scans the surface and ignores those areas that have been diagnosed to be faulty [ANSI 1994, Seagate 1992a]. The actual procedure is vendor specific. For example, not only the actual faulty area is omitted, but also its nearby regions can be rejected in order to minimize the probability of encountering media deterioration in the future.
A SCSI disk may also encounter defective blocks after a low level format operation. This can happen, for example, during a read or a write operation or while performing a diagnostic operation. In such an event, the defective block can be recovered with a special SCSI command: REASSIGN BLOCK [ANSI 1997, Kamunen 1996, Räsänen 1996, ANSI 1994, Räsänen 1994, Platt 1992, Seagate 1992a, ANSI 1986]. This command replaces the specified defected area (i.e., sector) with a spare one while maintaining the same logical representation. Again, the actual repair procedure is vendor specific.
Modern SCSI disks gather various statistical information during their normal operation. This information is divided into two categories: counters and detailed statistics. For example, the following kind of information is provided in ERROR COUNTER PAGES information elements of SCSI disks [ANSI 1994, Seagate 1992a]:
• Errors corrected without substantial delay
• Errors corrected with possible delays
• Total errors (e.g., rewrites or rereads)
• Total errors corrected
• Total times correction algorithm processed
• Total bytes processed
• Total uncorrected errors
This information is typically used for observing the behavior of the disk and its operational quality. The error counters can be used for early warning signs of coming disk faults.
Most recent error information
There is also more detailed information on the LAST N ERROR EVENTS PAGE that is vendor specific [ANSI 1994, Seagate92a]. This information can be used for obtaining more detailed information on the most recent errors. The number of those reported errors depends on the manufacturer.
A disadvantage of the LAST N ERROR EVENTS PAGE function is the masking effect. As the disk has limited capacity to store detailed error information, some of the older errors may be masked by new ones. In practice, the masking effect does not significantly reduce the fault detection capability. Typically, the frequency of reading this information is much higher than what it is updated. The information can be read in every few seconds while there are normally only a few major errors in a month or a year [ANSI 1994, Räsänen 1994, Gibson 1991]. Hence, the system can detect most of the errors reported with LAST N ERROR EVENTS PAGE without any major risk of missing error reports.
Automatic error reporting and recovery
The second method to get more information about the internal problems is to configure the disk to report all internal errors and recoveries. For example, it is possible to define, that the disk should report if read operation was successful, but [ANSI 1994, Räsänen 1994, Seagate 1992a]:
• error correction was needed,
• negative/positive head offset was needed,
• reassignment is recommended, or
• rewrite is recommended.
It is also possible to configure the SCSI disk to perform recovery of the possibly defective area given the slightest sign of a problem (by setting ENABLE EARLY RECOVERY mode). This informs the disk to use the most expedited form of error recovery at the expense of higher risk of error mis-detection and mis-correction (i.e., the disk may consider that the sector is faulty even when it is actually not).
The third alternative that is useful especially with a scanning algorithm is to use the VERIFY command [Scritsmier 1996, ANSI 1994, Seagate 1992a]. This command reads the information from the medium, but the data is not transferred from the disk (i.e., the data is read from the disk surface into the internal buffers, the consistency of the data is checked and errors reported). This can speed up the latent fault detection as the disk controller is not loaded with the scanning data as the disk performs most of the work internally.
The disk subsystem architectures that are used as examples in this thesis are based on the practical implementation of the RAID-concept such as reported in [Hillo 1993, Kemppainen 1991]. Figure 6 depicts a sample configuration of such an approach. This approach is a typical hardware disk array (HDA) where a disk controller implements the RAID algorithms and, from the point of view of an operating system, it resembles a logical disk.
A HDA is considered to be superior to a software disk array (SDA) [Kamunen 1994, Räsänen 1994, Hillo 1993]. In a SDA, there are three major disadvantages. First, the SDA depends strongly on the operating system (i.e., the array software is implemented as a part of the main operating system hence requiring a special driver for every new operating system). Second, the SDA is inferior to the HDA in handling faults in the array (e.g., booting from the SDA with one disk failed may be impossible). Besides, the SDA suffers more from higher interrupt load caused by disk I/O, especially in more complex arrays such as RAID-5. Typically, a general purpose operating system is not as efficient in handling a large number of interrupts as a dedicated real-time operating system on a controller board [Räsänen 1994, Kamunen 1994].
Figure 6. An example of a disk array configuration
Most of the disk array controllers are based on SCSI buses and SCSI disks. The transfer rate of a SCSI bus is sufficient to serve a large set of disks, especially in an OLTP environment where the average size of disk requests is typically small [IBM 1996c, Seagate 1996a, Ylinen 1994, Hillo 1993, Kari 1992, Miller 1991]. The number of disk buses is increased mainly to allow larger disk array configurations and higher number of parallel I/O's, not to increase the data transfer capacity.
The main application environments that have been kept in mind while doing this research are database and file servers. First, the database servers are typical examples of systems with nonstop operation and strict response time requirements that must be met even during a degraded state. Second, both database and file servers have similar access patterns where disk accesses are spread unevenly over the disk space.
A hierarchical fault tolerant architecture supports several parallel disk array controllers in one or several computers. This provides a three-level fault tolerant architecture as illustrated in Figure 7 [Novell 1997]. The lowest level of fault tolerance is the devices, disks. The second level consists of mirrored disk controllers. Finally, the third level fault tolerance consists of mirroring the entire servers. For example, dual disk controllers can be used for surviving controller faults or mirrored computer systems can be used for surviving faults in any other critical part.
In this research, the focus is on the device level fault tolerance (level I). Hence, the controller mirroring (level II) and server mirroring (level III) techniques are not discussed further in this thesis.
Figure 7. Three level hierarchical fault tolerant architecture
Six main RAID architectures are already listed in Chapter 1. Here, those architectures are briefly described. Beside those “standard” RAID architectures (RAID-0, RAID-1, RAID-2, RAID-3, RAID-4, and RAID-5), there are several RAID variants, but they are out of the scope of this thesis. It is not an intention here to give a complete description of all RAID architectures but just to have a short overview of the main principles. More detailed description of various RAID architectures can be found in [DPT 1993, RAB 1993, Lee 1991, Lee 1990, Katz 1989, Chen 1988, Patterson 1988, Salem 1986]
A single disk (sometimes denoted as RAID-0/1, striped array with only one disk) is often used in the models for comparison purposes. Some disk array controllers also support single disks for compatibility reasons [Hillo 1993, Kemppainen 1991].
Disk striping (RAID-0) provides no additional reliability improvement, but it is very often used in systems where reliability is not so vital (e.g., recovery after a disk failure can be done using backups and log files). Reliability of a disk subsystem decreases dramatically as the number of disks increases [Hillo 1993, Gibson 1991].
Figure 8 illustrates the basic principle of the RAID-0 array. It has two main parameters: the number of disks (D) and the size of the stripe unit. Depending on the size of a request, one or several disks are accessed. The optimum size of the stripe unit depends on the user disk requests, the average disk access time and the data transfer rate [Hillo 1993, Chen 1990a].
Figure 8. RAID-0 array with five disks
Disk mirroring (RAID-1) is a traditional method of achieving fault tolerance in a disk subsystem as described for example in [RAB 1993, Gibson 1991]. Due to two identical copies of the data, the disk mirroring algorithm suffers from high cost of redundancy (i.e., 50% of the disk space is spent for redundancy). One of the benefits of the mirroring concept is found in its way of handling write requests. Mirroring has significantly smaller overhead in write operations than, for example, the RAID-5 array. In a sustained write load, the RAID-1 array performs as well as a single disk [Hillo 1993, Kemppainen 1991]. The RAID-1 array has also twice as high bandwidth for read operations as a single disk. In Figure 9, the basic principle of the RAID-1 array is illustrated.
It is possible to combine the mirrored and the striped array architectures (i.e., RAID-0xRAID-1). The main idea is to have a RAID-0 array mirrored with an identical one. This allows achieving similar capacities (e.g., 2x50 disks) with the RAID-1 array as with the other RAID arrays.
RAID-2 arrays are designed for environments where high data transfer rates are required. As illustrated in Figure 10, the data is striped across multiple disks while some of the drives are dedicated to store additional ECC information that is calculated over the data disks.
Figure 9. RAID-1 array with two disks
Figure 10. RAID-2 array with eight data disks and four parity disks
As with RAID-2, the data is striped across multiple disks in the RAID-3 array. In this case, only one parity disk is used as illustrated in Figure 11. The size of the stripe unit can be either one bit or one byte. The error detection in the RAID-3 relies on the ECC embedded in each of the disks.
The parity is computed horizontally over the recorded bits BIT(n) … BIT(n+4) [RAB 1993]. For example, for the first parity row, PARITY0 is calculated as
where indicates exclusive-or function. If, for example, the second data disk (storing BIT1) is faulty, the data is recovered as follows
The same principles are applicable also for byte oriented RAID-3 as well as with RAID-4 and RAID-5 arrays.
This example identifies also the main problem of the RAID-3, RAID-4, and RAID-5 arrays: If there are more than one missing bit/byte/block in a row, the algorithm is no longer capable of reconstructing the data.
The RAID-4 array uses the same principle as the RAID-3 array with the exception that, instead of bit or byte as the stripe unit, the RAID-4 array uses a larger stripe unit size (i.e., typically multiple of sectors). This allows the controller to issue several simultaneous read operations and one write operation to the RAID-4 array. Figure 12 illustrates the RAID-4 array configuration.
and finally writing new BLOCK2 and new PARITY0 into the disks. Hence, a user write operation can very often be reduced to two disk read and two disk write operations for RAID-4 and RAID-5 arrays. An additional benefit is that the remaining disks can serve other disk requests at the same time.