RAID-II: A High-Bandwidth Network File Server Ann L. Drapeau Ken W. Shirriff John H. Hartman Ethan L. Miller Srinivasan Seshan Randy H. Katz Ken Lutz David A. Patterson1 Edward K. Lee2 Peter M. Chen3 Garth A. Gibson4 Abstract In 1989, the RAID (Redundant Arrays of Inexpensive Disks) group at U. C. Berkeley built a prototype disk array called RAID-I. The bandwidth delivered to clients by RAID-I was severely limited by the memory system bandwidth of the disk array’s host workstation. We designed our second prototype, RAID-II, to deliver more of the disk array bandwidth to file server clients. A custom-built crossbar memory system called the XBUS board connects the disks directly to the high-speed network, allowing data for large requests to bypass the server workstation. RAID-II runs Log-Structured File System (LFS) software to optimize performance for bandwidth-intensive applications. The RAID-II hardware with a single XBUS controller board delivers 20megabytes/second for large, random read operations and up to 31 megabytes/second for sequential read operations. A preliminary implementation of LFS on RAID-II delivers 21 megabytes/second on large read requests and 15 megabytes/second on large write operations. 1 Introduction It is essential for future file servers to provide highbandwidth I/O because of a trend toward bandwidthintensive applications like multi-media, CAD, objectoriented databases and scientific visualization. Even in well-established application areas such as scientific computing, the size of data sets is growing rapidly due to reductions in the cost of secondary storage and the introduction of faster supercomputers. These developments require faster I/O systems to transfer the increasing volume of data. High performance file servers will increasingly incorporate disk arrays to provide greater disk bandwidth. RAIDs, or Redundant Arrays of Inexpensive Disks [13], [5], use a collection of relatively small, inexpensive disks to achieve high performance and high reliability in a secondary storage system. RAIDs provide greater disk bandwidth to a file by striping or interleaving the data from a single file across a group of disk drives, allowing multiple disk transfers to occur in parallel. RAIDs ensure reliability by calculating 1Computer Science Division, University of California, Berkeley, CA 94720 2DEC Systems Research Center, Palo Alto, CA 94301-1044 3ComputerScience and EngineeringDivision,University of Michigan, Ann Arbor, MI 48109-2122 4School ofComputer Science, CarnegieMellonUniversity, Pittsburgh, PA 15213-3891 error correcting codes across a group of disks; this redundancy information can be used to reconstruct the data on disks that fail. In this paper, we examine the efficient delivery of bandwidth from a file server that includes a disk array. In 1989, the RAID group at U.C. Berkeley built an initial RAID prototype, called RAID-I [2]. The prototype was constructed using a Sun 4/280 workstation with 128 megabytes of memory, four dual-string SCSI controllers, 28 5.25-inch SCSI disks and specialized disk striping software. Wewere interested in the performance of a disk array constructed of commercially-available components on two workloads: small, random operations typical of file servers in a workstation environment and large transfers typical of bandwidth-intensive applications. We anticipated that there would be both hardware and software bottlenecks that would restrict the bandwidth achievable by RAID-I. The development of RAID-I was motivated by our desire to gain experience with disk array software, disk controllers and the SCSI protocol. Experiments with RAID-I show that it performs well when processing small, random I/O’s, achieving approximately 275 four-kilobyte random I/Os per second. However, RAID-I proved woefully inadequate at providing high-bandwidth I/O, sustaining at best 2.3 megabytes/second to a user-level application on RAID-I. By comparison, a single disk on RAID-I can sustain 1.3 megabytes/second. The bandwidth of nearly 26 of the 28 disks in the array is effectively wasted because it cannot be delivered to clients. There are several reasons why RAID-I is ill-suited for high-bandwidth I/O. The most serious is the memory contention experienced on the Sun 4/280 workstation during I/O operations. The copy operations that move data between kernel DMA buffers and buffers in user space saturate the memory system when I/O bandwidth reaches 2.3megabytes/second. Second, because all I/O on the Sun 4/280 goes through the CPU’s virtually addressed cache, data transfers experience interference from cache flushes. Finally, disregarding the workstation’s memory bandwidth limitation, high-bandwidth performance is also restricted by the low backplane bandwidth of the Sun 4/280’s VME system bus, which becomes saturated at 9megabytes/second. The problems RAID-I experienced are typical of many workstations that are designed to exploit large, fast caches for good processor performance, but fail to support adequate primary memory or I/O bandwidth [8], [12]. In such This paper appeared in the 21st International Symposium on Computer Architecture, April 1994, pages 234–244. workstations, the memory system is designed so that only the CPU has a fast, high-bandwidth path to memory. For busses or backplanes farther away from the CPU, available memory bandwidth drops quickly. Thus, file servers incorporating such workstations perform poorly on bandwidthintensive applications. The design of hardware and software for our second prototype, RAID-II [1], was motivated by a desire to deliver more of the disk array’s bandwidth to the file server’s clients. Toward this end, the RAID-II hardware contains two data paths: a high-bandwidth path that handles large transfers efficiently using a custom-built crossbar interconnect between the disk system and the high-bandwidth network, and a low-bandwidthpath used for control operations and small data transfers. The software for RAID-II was also designed to deliver disk array bandwidth. RAID-II runs LFS, the Log-Structured File System [14], developed by the Sprite operating systems group at Berkeley. LFS treats the disk array as a log, combining small accesses together and writing large blocks of data sequentially to avoid inefficient small write operations. A single XBUS board with 24 attached disks and no file system delivers up to 20 megabytes/second for random read operations and 31 megabytes/second for sequential read operations. With LFS software, the board delivers 21 megabytes/second on sequential read operations and 15 megabytes/second on sequential write operations. Not all of this data is currently delivered to clients because of client and network limitations described in Section 3.4. This paper describes the design, implementation and performance of theRAID-II prototype. Section 2 describes the RAID-II hardware, including the design choices made to deliver bandwidth, architecture and implementation details, and hardware performance measurements. Section 3 discusses the implementation and performance of LFS running on RAID-II. Section 4 compares other high performance I/O systems to RAID-II, and Section 5 discusses future directions. Finally, we summarize the contributions of the RAID-II prototype. 2 RAID-II Hardware Figure 1 illustrates the RAID-II storage architecture. The RAID-II storage server prototype spans three racks, with the two outer racks containing disk drives and the center rack containing custom-designed crossbar controller boards and commercial disk controller boards. The center rack also contains a workstation, called the host, which is shown in the figure as logically separate. Each XBUS controller board has a HIPPI connection to a high-speed Ultra Network Technologies ring network that may also connect supercomputers and client workstations. The host workstation has an Ethernet interface that allows transfers between the disk array and clients connected to the Ethernet network. In this section, we discuss the prototype hardware. First, we describe the design decisions that were made to deliver disk array bandwidth. Next, we describe the architecture and implementation, followed by microbenchmark measurements of hardware performance. 2.1 Delivering Disk Array Bandwidth 2.1.1 High and Low Bandwidth Data Paths Disk array bandwidth on our first prototypewas limited by low memory system bandwidth on the host workstation. RAID-II was designed to avoid similar performance limitations. It uses a custom-built memory system called the XBUS controller to create a high-bandwidth data path. This data path allows RAID-II to transfer data directly between the disk array and the high-speed, 100 megabytes/second HIPPI networkwithout sending data through the hostworkstation memory, as occurs in traditional disk array storage servers like RAID-I. Rather, data are temporarily stored in memory on the XBUS board and transferred using a highbandwidth crossbar interconnect between disks, the HIPPI network, the parity computation engine and memory. There is also a low-bandwidth data path in RAID-II that is similar to the data path in RAID-I. This path transfers data across theworkstation’s backplane between the XBUS board and host memory. The data sent along this path include file metadata (data associated with a file other than its contents) and small data transfers. Metadata are needed by file system software for file management, name translation and for maintaining consistency between file caches on the host workstation and the XBUS board. The lowbandwidth data path is also used for servicing data requests from clients on the 10 megabits/second Ethernet network attached to the host workstation. We refer to the two accessmodes onRAID-II as the highbandwidthmode for accesses that use the high-performance data path and standard mode for requests serviced by the low-bandwidth data path. Any client request can be serviced using either accessmode, but we maximize utilization and performance of the high-bandwidthdata path if smaller requests use the Ethernet network and larger requests use the HIPPI network. 2.1.2 Scaling to Provide Greater Bandwidth The bandwidth of the RAID-II storage server can be scaled by adding XBUS controller boards to a host workstation. To some extent, adding XBUS boards is like adding disks to a conventional file server. An important distinction is that adding an XBUS board to RAID-II increases the I/O bandwidth available to the network, whereas adding a disk to a conventional file server only increases the I/O bandwidth available to that particular file server. The latter is less effective, since the file server’s memory system and backplane will soon saturate if it is a typical workstation. Eventually, adding XBUS controllers to a host workstation will saturate the host’s CPU, since the host manages all disk and network transfers. However, since the highbandwidth data path of RAID-II is independent of the host workstation, we can use a more powerful host to continue scaling storage server bandwidth. 2.2 Architecture and Implementation ofRAID-II Figure 2 illustrates the architecture of the RAID-II file server. The file server’s backplane consists of two highbandwidth (HIPPI) data busses and a low-latency (VME) control bus. The backplane connects the high-bandwidth network interfaces, several XBUS controllers and a host workstation that controls the operation of the XBUS controller boards. Each XBUS board contains interfaces to the High Bandiwdth Network (100 MB/s Ultranet) Client Workstations Host Client Workstations Workstation Ethernet (10 Mb/s) Supercomputer SCSI HIPPI RAID-II File Server VME Figure 1: The RAID-II File Server and its Clients. RAID-II, shown by dotted lines, is composed of three racks. The host workstation is connected to one or more XBUS controller boards (contained in the center rack of RAID-II) over the VME backplane. Each XBUS controller has a HIPPI network connection that connects to the Ultranet high-speed ring network; the Ultranet also provides connections to client workstations and supercomputers. The XBUS boards also have connections to SCSI disk controller boards. In addition, the host workstation has an Ethernet network interface that connects it to client workstations. HIPPI backplane and VME control bus, memory, a parity computation engine and four VME interfaces that connect to disk controller boards. Figure 3 illustrates the physical packaging of the RAID-II file server, which spans three racks. To minimize the design effort, we used commercially available components whenever possible. Thinking Machines Corporation (TMC) provided a board set for the HIPPI interface to the Ultranet ring network; Interphase Corporation provided VME-based, dual-SCSI, Cougar disk controllers; Sun Microsystems provided the Sun 4/280 file server; and IBM donated disk drives and DRAM. The center rack is composed of three chassis. On top is a VME chassis containing eight Interphase Cougar disk controllers. The center chassis was provided by TMC and contains theHIPPI interfaces and our custom XBUS controller boards. The bottom VME chassis contains the Sun4/280 host workstation. Two outer racks each contain 72 3.5-inch, 320megabyte IBM SCSI disks and their power supplies. Each of these racks contains eight shelves of nine disks. RAID-II has a total capacity of 46 gigabytes, although the performance results in the next section are for a singleXBUS board controlling 24 disks. We will achieve full array connectivity by adding XBUS boards and increasing the number of disks per SCSI string. Figure 4 is a block diagram of theXBUS disk array controller board, which implements a 4x8, 32-bit wide crossbar interconnect called the XBUS. The crossbar connects four memory modules to eight system components called ports. The XBUS ports include two interfaces to HIPPI network boards, four VME interfaces to Interphase Cougar disk controller boards, a parity computation engine, and a VME interface to the host workstation. Each XBUS transfer involves one of the four memory modules and one of the eight XBUS ports. Each port was intended to support 40 megabytes/second of data transfer for a total of 160 megabytes/second of sustainable XBUS bandwidth. TheXBUSis a synchronous,multiplexed (address/data), crossbar-based interconnect that uses a centralized, priority-based arbitration scheme. Each of the eight 32- bit XBUS ports operates at a cycle time of 80 nanoseconds. The memory is interleaved in sixteen-word blocks. The XBUS supports only two types of bus transactions: reads and writes. Our implementation of the crossbar interconnect was fairly expensive, using 192 16-bit transceivers. Using surface mount packaging, the implementation required 120 square inches or approximately 20% of the XBUS controller’s board area. An advantage of the implementation is that the 32-bit XBUS ports are relatively inexpensive. HIPPI (1 Gb/s) HIPPI (1 Gb/s) VME VME VME VME TMC-HIPPIS Bus (64 bits, 100 MB/s) TMC-HIPPID Bus (64 bits, 100 MB/s) TMC-VME Bus (32 bits, 40 MB/s) SCSI0 SCSI1 Host Workstation Ethernet XBUS Disk Array Controller XBUS Disk Array Controller XBUS Disk Array Controller XBUS Disk Array Controller HIPPIS Interface HIPPID Interface Cougar Disk Controller Cougar Disk Controller Cougar Disk Controller Cougar Disk Controller RAID-II main chassis VME VME VME VME VME VME VME VME VME VME VME VME Figure 2: Architecture of RAID-II File Server. The host workstation may have several XBUS controller boards attached to its VME backplane. Each XBUS controller board contains interfaces to HIPPI network source and destination boards, buffer memory, a high-bandwidth crossbar, a parity engine, and interfaces to four SCSI disk controller boards. Two XBUS ports implement an interface between the XBUS and the TMC HIPPI boards. Each port is unidirectional, designed to sustain 40 megabytes/second of data transfer and bursts of 100 megabytes/second into 32 kilobyte FIFO interfaces. Four of the XBUS ports are used to connect the XBUS board to four VME busses, each of which may connect to one or two dual-string Interphase Cougar disk controllers. In our current configuration, we connect three disks to each SCSI string, two strings to each Cougar controller, and one Cougar controller to each XBUS VME interface for a total of 24 disks per XBUS board. The Cougar disk controllers can transfer data at 8 megabytes/second, for a maximum disk bandwidth to a single XBUS board of 32 megabytes/second in our present configuration. Of the remaining two XBUS ports, one interfaces to a parity computation engine. The last port is theVME control interface linking theXBUS board to the hostworkstation. It provides the host with access to the XBUS board’s memory as well as its control registers. This makes it possible for file server software running on the host to access network headers, file data and metadata in the XBUS memory. 2.3 Hardware Performance In this section, we present raw performance of the RAID-II [1] hardware, that is, the performance of the hardware without the overhead of a file system. In Section 3.4, we show how much of this raw hardware performance can be delivered by the file system to clients. Figure 5 shows RAID-II performance for random reads and writes. We refer to these performance measurements as hardware system level experiments, since they involve all the components of the system from the disks to the HIPPI network. For these experiments, the disk system is 8 MB DRAM 8 MB DRAM 8 MB DRAM 8 MB DRAM 16 word interleave 4 x 8 Crossbar Interconnect 40 MB/s per port HIPPIS HIPPID XOR Server VME VME0 VME1 VME2 VME3 Figure 4: Structure of XBUS controller board. The board contains fourmemorymodules connected by a 4x8 crossbar interconnect to eight XBUS ports. Two of these XBUS ports are interfaces to the HIPPI network. Another is a parity computation engine. One is a VME network interface for sending control information, and the remaining four are VME interfaces connecting to commercial SCSI disk controller boards. Figure 3: The physical packaging of the RAID-II File Server. Two outer racks contain 144 disks and their power supplies. The center rack contains three chassis: the top chassis holds VME disk controller boards; the center chassis contains XBUS controller boards and HIPPI interface boards, and the bottom VME chassis contains the Sun4/280 workstation. configured as a RAID Level 5 [13] with one parity group of 24 disks. (This scheme delivers high bandwidth but exposes the array to data loss during dependent failure modes such as a SCSI controller failure. Techniques for maximizing reliability are beyond the scope of this paper [4], [16], [6].) For reads, data are read from the disk array into the memory on theXBUS board; from there, data are sent over HIPPI, back to the XBUS board, and into XBUS memory. For writes, data originate in XBUS memory, are sent over the HIPPI and then back to the XBUS board to XBUS memory; parity is computed, and then both data and parity are written to the disk array. For both tests, the system is configured with four Interphase Cougar disk controllers, each with two strings of three disks. For both reads and writes, subsequent fixed size operations are at random locations. Figure 5 shows that, for large requests, hardware system level read and write performance reaches about 20 megabytes/second. The dip in read performance for requests of size 768 kilobytes occurs because at that size the striping scheme in this experiment involves a second string on one of the controllers; there is some contention on the controller that results in lower performance when both strings are used. Writes are slower than reads due to the increased disk and memory activity associated with computing and writing parity. While an order of magnitude faster than our previous prototype, RAID-I, this is still well below our target bandwidth of 40 megabytes/second. Below, we show that system performance is limited by that of the commercial disk controller boards and our disk interfaces. Table 1 shows peak performance of the system when x x x x x x x x + + + + + + + + 0 5 10 15 20 25 0 500 1000 1500 Throughput (MB/s) Request Size (KB) x Writes + Reads Figure 5: Hardware System Level Read and Write Performance. RAID-II achieves approximately 20 megabytes/second for both random reads and writes. Peak Performance Operation (megabytes/second) Sequential reads 31 Sequential writes 23 Table 1: Peak read and write performance for an XBUS board on sequential operations. This experiment was run using the four XBUS VME interfaces to disk controllers boards, with a fifth disk controller attached to the XBUS VME control bus interface. sequential read and write operations are performed. These measurements were obtained using the four Cougar boards attached to the XBUS VME interfaces, and in addition, using a fifth Cougar board attached to the XBUS VME control bus interface. For requests of size 1.6 megabytes, read performance is 31 megabytes/second, compared to 23 megabytes/second for writes. Write performance is worse than read performance because of the extra accesses required during writes to calculate and write parity and because sequential reads benefit from the read-ahead performed into track buffers on the disks; writes have no such advantage on these disks. While one of the main goals in the design of RAID-II was to provide better performance than our first prototype on high-bandwidth operations, we also want the file server to perform well on small, random operations. Table 2 compares the I/O rates achieved on our two disk array prototypes, RAID-I and RAID-II, using a test program that performed random 4 kilobyte reads. In each case, fifteen disks were accessed. The tests used Sprite operating system read and write system calls; the RAID-II test did not use LFS. In each case, a separate process issued 4 kilobyte, randomly distributed I/O requests to each active disk in the system. The performance with a single disk indicates the difference between the Seagate Wren IV disks used in RAID-I and the IBM 0661 disks used in RAID-II. The IBM disk drives have faster rotation and seek times, mak- 1 disk 15 disks I/O Rate I/O Rate System (I/Os/sec) (I/Os/sec) RAID-I 27 274 RAID-II 36 422 Table 2: Peak I/O rates in operations per second for random, 4 kilobyte reads. Performance for a single disk and for fifteen disks, with a separate process issuing random I/O operations to each disk. The IBM 0661 disk drives used in RAID-II can perform more I/Os per second than the Seagate Wren IV disks used in RAID-I because they have shorter seek and rotation times. Superior I/O rates on RAID-II are also the result of the architecture, which does not require data to be transferred through the host workstation. + + + + + + + + + + + + + 0 + + + 10 20 30 40 1 10 100 1000 10000 100000 Throughput (MB/s) Request Size (KB) Figure 6: HIPPI Loopback Performance. RAID-II achieves 38.5 megabytes/second in each direction. ing random operations quicker. With fifteen disks active, both systems perform well, with RAID-II achieving over 400 small accesses per second. In both prototypes, the small, random I/O rates were limited by the large number of context switches required on the Sun4/280 workstation to handle request completions. RAID-I’s performance was also limited by the memory bandwidth on the Sun4/280, since all data went into host memory. Since data need not be transferred to the host workstation in RAID-II, it delivers a higher percentage (78%) of the potential I/O rate from its fifteen disks than does RAID-I (67%). The remaining performance measurements in this sectionare for specific components of theXBUS board. Figure 6 shows the performance of the HIPPI network and boards. Data are transferred from the XBUS memory to the HIPPI source board, and then to the HIPPI destination board and back toXBUS memory. Because the network is configured as a loop, there isminimal network protocol overhead. This test focuses on the HIPPI interfaces’ raw hardware performance; it does not include measurements for sending data over the Ultranet. In the loopback mode, the overhead of sending a HIPPI packet is about 1.1 milliseconds, mostly due to setting up the HIPPI and XBUS control registers + + + + + 0 1 2 3 4 5 6 0 1 2 3 4 5 Throughput (MB/s) Number of Disks Figure 7: Disk read performance for varying number of disks on a single SCSI string. Cougar string bandwidth is limited to about 3 megabytes/second, less than that of three disks. The dashed line indicates the performance if bandwidth scaled linearly. across the slow VME link. (By comparison, an Ethernet packet takes approximately 0.5 millisecond to transfer.) Due to this control overhead, small requests result in low performance on the HIPPI network. For large requests, however, the XBUS and HIPPI boards support 38 megabytes/second in both directions, which is very close to the maximum bandwidth of the XBUS ports. The HIPPI loopback bandwidth of 38 megabytes/second is three times greater than FDDI and two orders of magnitude greater than Ethernet. Clearly the HIPPI part of the XBUS is not the limiting factor in determining hardware system level performance. Disk performance is responsible for the lower-thanexpected hardware system level performance of RAID-II. One performance limitation is the Cougar disk controller, which only supports about 3 megabytes/second on each of two SCSI strings, as shown in Figure 7. The other limitation is our relatively slow, synchronous VME interface ports, which only support 6.9 megabytes/second on read operations and 5.9 megabytes/second on write operations. This low performance is largely due to the difficulty of synchronizing the asynchronous VME bus for interactionwith the XBUS. 3 The RAID-II File System In the last section, we presented measurements of the performance of the RAID-II hardware without the overhead of a file system. Although some applications may use RAID-II as a large raw disk, most will require a file system with abstractions such as directories and files. The file system should deliver to applications as much of the bandwidth of the RAID-II disk array hardware as possible. RAID-II’s file system is a modified version of the Sprite Log-Structured File System (LFS) [14]. In this section, we first describe LFS and explain why it is particularly wellsuited for use with disk arrays. Second, we describe the features of RAID-II that require special attention in the LFS implementation. Next, we illustrate how LFS on RAID-II handles typical read requests from a client. Finally, we discuss the measured performance and implementation status of LFS on RAID-II. 3.1 The Log-Structured File System LFS [14] is a disk storage manager that writes all file data and metadata to a sequential append-only log; files are not assigned fixed blocks on disk. LFS minimizes the number of small write operations to disk; data for small writes are buffered in main memory and are written to disk asynchronouslywhen a log segment fills. This is in contrast to traditional file systems, which assign files to fixed blocks on disk. In traditional file systems, a sequence of random filewrites results in inefficient small, random disk accesses. Because it avoids small write operations, the Log- Structured File System avoids a weakness of redundant disk arrays that affects traditional file systems. Under a traditional file system, disk arrays that use large block interleaving (Level 5 RAID [13]) perform poorly on small write operations because each small write requires four disk accesses: reads of the old data and parity blocks and writes of the new data and parity blocks. By contrast, large write operations in disk arrays are efficient since they don’t require the reading of old data or parity. LFS eliminates small writes, grouping them into efficient large, sequential write operations. A secondary benefit of LFS is that crash recovery is very quick. LFS periodically performs checkpoint operations that record the current state of the file system. To recover from a file system crash, the LFS server need only process the log from the position of the last checkpoint. By contrast, a UNIX file system consistency checker traverses the entire directory structure in search of lost data. Because a RAID has a very large storage capacity, a standard UNIX-style consistency check on bootwould be unacceptably slow. For a 1 gigabyte file system, it takes a few seconds to perform an LFS file system check, compared with approximately 20minutes to check the consistency of a typical UNIX file system of comparable size. 3.2 LFS on RAID-II An implementation of LFS for RAID-IImust efficiently handle two architectural features of the disk array: the separate low- and high-bandwidth data paths, and the separate memories on the host workstation and the XBUS board. To handle the two data paths, we changed the original Sprite LFS software to separate the handling of data and metadata. We use the high-bandwidth data path to transfer data between the disks and the HIPPI network, bypassing host workstation memory. The low-bandwidth VME connection between theXBUS and the hostworkstation is used to send metadata to the host for file name lookup and location of data on disk. The low-bandwidth path is also used for small data transfers to clients on the Ethernet attached to the host. LFS on RAID-II also must manage separate memories on the host workstation and the XBUS board. The host memory cache contains metadata as well as files that have been read into workstation memory for transfer over the Ethernet. The cache is managed with a simple Least Recently Used replacement policy. Memory on the XBUS board is used for prefetch buffers, pipeliningbuffers, HIPPI network buffers, and write buffers for LFS segments. The file system keeps the two caches consistent and copies data between them as required. To hide some of the latency required to service large requests, LFS performs prefetching into XBUS memory buffers. LFS on RAID-II uses XBUS memory to pipeline network and disk operations. For read operations, while one block of data is being sent across the network, the next blocks are being read off the disk. For write operations, full LFS segments are written to disk while newer segments are being filled with data. We are also experimenting with prefetching techniques so small sequential reads can also benefit from overlapping disk and network operations. From the user’s perspective, the RAID-II file system looks like a standard file system to clients using the slow Ethernet path (standard mode), but requires relinking of applications for clients to use the fast HIPPI path (highbandwidth mode). To access data over the Ethernet, clients simply use NFS or the Sprite RPC mechanism. The fast data path across the Ultranet uses a special library of file system operations for RAID files: open, read, write, etc. The library converts file operations to operations on an Ultranet socket between the client and the RAID-II server. The advantage of this approach is that it doesn’t require changes to the client operating system. Alternatively, these operations could be moved into the kernel on the client and would be transparent to the user-level application. Finally, some operating systems offer dynamically loadable device drivers that would eliminate the need for relinking applications; however, Sprite does not allow dynamic loading. 3.3 Typical High and Low Bandwidth Requests This section explains how the RAID-II file system handles typical open and read requests from a client over the fastUltraNetwork Technologies network and over the slow Ethernet network attached to the host workstation. In the first example, the client is connected to RAID-II across the high-bandwidthUltranet network. The client application is linked with a small library that converts RAID file operations into operations on anUltranet socket connection. First, the applicationrunning on the client performs an open operation by calling the library routine raid_open. This call is transformed by the library into socket operations. The library opens a socket between the client and the RAID-II file server. Then the client sends an open command to RAID-II. Finally, the RAID-II host opens the file and informs the client. Now the application can perform read and write operations on the file. Suppose the application does a large read using the library routine raid_read. As before, the library converts this read operation into socket operations, sending RAID-II a read command that includes file position and request length. RAID-II handles a read request by pipelining disk reads and network sends. First, the file system code allocates a buffer in XBUS memory, determines the position of the data on disk, and calls the RAID driver code to read the first block of data into XBUS memory. When the read has completed, the file system calls the network code to send the data from XBUS memory to the client. Meanwhile, the file system allocates another XBUS buffer and reads the next block of data. Network sends and disk reads continue in parallel until the transfer is complete. LFS may have several pipeline processes issuing read requests, allowing disk reads to get ahead of network send operations for efficient network transfers. On the client side, the network data blocks are received and read into the application memory using socket read operations. When all the data blocks have been received, the library returns from the raid_read call to the application. In the second example, we illustrate the handling of a read request sent over the relatively slow Ethernet. A client sends standard file system open and read calls to the host workstation across its Ethernet connection. The host workstation sends the read command over the VME link to the XBUS board. The XBUS board responds by reading the data from disk into XBUS memory and transferring the data to the host workstation. Finally, the host workstation packages the data into Ethernet packets and sends the packets to the client. 3.4 Performance and Status This section describes the performance of LFS running on RAID-II. These measurements show that LFS performs efficiently on large random read and write operations, delivering about 65% of the raw bandwidth of the hardware. LFS also achieves good performance on small randomwrite operations by grouping small writes into larger segments. All the measurements presented in this section use a single XBUS board with 16 disks. The LFS log is interleaved or “striped” across the disks in units of 64 kilobytes. The log iswritten to the disk array in units or “segments” of 960 kilobytes. The file systemforRAID-II is still under development; it currently handles creation, reads, and writes of files, but LFS cleaning (garbage collection of free disk space in the log) has not yet been implemented. Because we don’t have sufficiently fast network clients, many of the file server measurements in this section don’t actually send data over the network; rather, data are written to or read from network buffers in XBUS memory. Figure 8 shows performance for random read and write operations for RAID-II runningLFS. For each request type, a single process issued requests to the disk array. For both reads and writes, data are transferred to/from network buffers, but do not actually go across the network. The figure shows that for random read requests larger than 10 megabytes, which are effectively sequential in nature, the file system delivers up to 20 megabytes/second, approximately 68% of the raw sequential hardware bandwidth described in Table 1. Smaller random reads have lower bandwidth due to an average overhead of 23 milliseconds per operation: 4 milliseconds of file system overhead and 19 milliseconds of disk overhead. The file system overhead should be reduced substantially as LFS performance is tuned. The graph also shows random write performance. For random write requests above approximately 512 kilobytes in size, the file system delivers close to its maximum value of 15 megabytes/second. This value is approximately 65% of the raw sequential write performance reported in Table 1. Random writes perform as well as sequential writes for much lower request sizes than was the case for read operations. This is because LFS groups together random write operations and writes them to the log sequentially. The bandwidth for small random write operations is better than bandwidth for small random reads, suggesting that LFS has been successful inmitigatingthe penalty for smallwrites on + + + + + + + + + + + + + + x x x x x x x x x x x 0 5 10 15 20 25 1 10 100 1000 10000 Bandwidth (MB/s) Request Size (KB) + Reads x Writes Figure 8: Performance of RAID-II running the LFS file system. a RAID Level 5 architecture. Small write performance is, however, limited because of approximately 3 milliseconds of network and file system overhead per request. Figure 8 showed RAID-II performance without sending data over the network. Next we consider network performance for reads and writes to a single client workstation; we have not yet run these tests for multiple clients. Performance on write operations to RAID-II is limited by the client; we currently don’t have a client that can deliver enough bandwidth to saturate the disk array. A SPARCstation 10/51 client on the HIPPI network writes data to RAID-II at 3.1 megabytes per second. Bandwidth is limited on the SPARCstation because its user-level network interface implementation performs many copy operations. RAID-II is capable of scaling to much higher bandwidth; utilization of the Sun4/280 workstation due to network operations is close to zero with the single SPARCstation client writing to the disk array. Performance on network reads from the disk array is limited by our initial network device driver implementation. To perform a data transfer, we must synchronize operation of the HIPPI source board and the host. Although the source board is able to interrupt the host, this capability was not used in the initial device driver implementation. Rather, for reasons of simplicity, the host workstation waits while data are being transmitted fromthe source board to the network, checking a bit that indicates when the data transfer is complete. This rather inefficient implementation limits RAID-II read operations for a single SPARCstation client to 3.2 megabytes/second. In the implementation currently being developed, the source board will interrupt the CPU when a transfer is complete. With this modification, we expect to deliver the full file system bandwidth of RAID-II to clients on the network, although the network implementation on the SPARCstationwill still limit the bandwidth to an individualworkstation client. 4 Existing File Server Architectures RAID-II is not the only system offering high I/O performance. In this section, we highlight some important characteristics of several existing high-performance I/Osystems and explain how they relate to the design of RAID-II. 4.1 File Servers for Workstations File servers for a workstation computing environment are not designed to provide high-bandwidth file access to individual files. Instead, they are designed to provide lowlatency service to a large number of clients viaNFS [15]. A typical NFS file server is a UNIX workstation that is transformed into a server by addingmemory, network interfaces, disk controllers, and possibly non-volatilememory to speed up NFS writes. All workstation vendors offer NFS servers of this type. The problem with using a workstation as a file server is that all data transfers between the network and the disks must pass across the workstation’s backplane, which can severely limit transfer bandwidth. It has been shown that the bandwidth improvements in workstation memory and I/O systems have not kept pace with improvements in processor speed [12]. One way of achieving higher performance than is possible from a UNIX workstation server is to use specialpurpose software and hardware. An example is the FAServer NFS file server [9]. The FAServer does not run UNIX; instead it runs a special-purpose operating system and file system tuned to provide efficient NFS file service. This reduces the overhead of each NFS operation and allows the FAServer to service many more clients than a UNIX-based server running on comparable hardware. The underlying hardware for the FAServer is a standard workstation, however, so the FAServer suffers from the same backplane bottleneck as a UNIX workstation when handling large transfers. Another special-purpose NFS file server is the Auspex NS6000 [11]. The Auspex is a functional multi-processor, meaning that the tasks required to process an NFS request are divided among several processors. Each processors runs its own special-purpose software, tailored to the task it must perform. The processors are connected together and to a cache memory by a high-speed bus running at 55 megabytes/second. The Auspex does contain a host workstationthat runsUNIX, but it is not involved in serving of NFS requests. File data are transferred between the network and the disks over the high-speed bus without being copied into the workstation memory. In this sense the Auspex architecture is similar to that of RAID-II, but it differs in the use of special-purpose processors to handle requests. 4.2 Supercomputer File Systems Supercomputers have long used high performance I/O systems. These systems fall into two categories: highspeed disks attached directly to the supercomputer and mainframe-managed storage connected to the supercomputer via a network. The first type of supercomputer file system consists of many fast parallel transfer disks directly attached to a supercomputer’s I/O channels. Each high-speed disk might transfer at a rate of 10 megabytes/second; typically, data might be striped across 40 of these disks. Data are usually copied onto these high-speed disks before an application is run on the supercomputer [10]. These disks do not provide permanent storage for users. The file system is not designed to provide efficient small transfers. The other type of supercomputer file system is the mainframe-managed mass storage system. A typical system, such as the NASA Ames mass storage system NAStore [17], provides both disk and tertiary storage to clients over a network. The bandwidth of such systems is typically lower than that of the parallel transfer disk systems. NAStore, for example, sustains 10 megabytes/second to individual files. RAID-II was designed to perform efficiently on small file accesses as well as on large transfers. A system similar in design philosophy to RAID-II is Los Alamos National Laboratory’s High Performance Data System (HPDS) [3]. Like RAID-II, HPDS attempts to provide low latency on small transfers and control operations as well as high bandwidth on large transfers. HPDS connects an IBM RAID Level 3 disk array to a HIPPI network; like RAID-II, it has a high-bandwidth data path and a low-bandwidth control path. The main difference between HPDS and RAID-II is that HPDS uses a bit-interleaved, or RAID Level 3, disk array, whereas RAID-II uses a flexible, crossbar interconnect that can support many different RAID architectures. In particular, RAID-II supports RAID Level 5, which can execute several small, independent I/Os in parallel. RAID Level 3, on the other hand, supports only one small I/O at a time. 5 Future Directions 5.1 Planned Use of RAID-II We plan to use RAID-II as a storage system for two new research projects at Berkeley. As part of the Gigabit Test Bed project being conducted by the U.C. Berkeley Computer Science Division and Lawrence Berkeley Laboratory (LBL), RAID-II will act as a high-bandwidthvideo storage and playback server. Data collected from an electron microscope at LBL will be sent from a video digitizer across an extended HIPPI network for storage on RAID-II and processing by a MASPAR multiprocessor. The InfoPad project at U.C. Berkeley will use the RAID-II disk array as an informationserver. In conjunction with the real time protocols developed by Prof. Domenico Ferrari’s Tenet research group, RAID-II will provide video storage and play-back from the disk array to a network of base stations. These base stations in turn drive the radio transceivers of a pico-cellular network to which hand-held display devices will connect. 5.2 Striping Across File Servers: The Zebra File System Zebra [7] is a network file system designed to provide high-bandwidth file access by striping files across multiple file servers. Its use with RAID-II would provide a mechanismfor striping high-bandwidth file accesses overmultiple network connections, and therefore across multiple XBUS boards. Zebra incorporates ideas from both RAID and LFS: from RAID, the ideas of combining many relatively low-performance devices into a single high-performance logical device, and using parity to survive device failures; and from LFS the concept of treating the storage system as a log, so that small writes and parity updates are avoided. There are two aspects of Zebra that make it particularly well-suitedfor usewithRAID-II. First, the servers inZebra perform very simple operations, merely storing blocks of the logical log of files without examining the content of the blocks. Little communication would be needed between the XBUS board and the host workstation, allowing data to flow between the network and the disk array efficiently. Second, Zebra, like LFS, treats the disk array as a log, eliminating inefficient small writes. 6 Conclusions The RAID-II disk array prototype was designed to deliver much more of the available disk array bandwidth to file server clients than was possible in our first prototype. The most important feature of the RAID-II hardware is the high-bandwidth data path that allows data to be transferred between the disks and the high-bandwidth network without passing through the low-bandwidthmemory system of the host workstation. We implement this high-bandwidth data path with a custom-built controller board called theXBUS board. The XBUS controller uses a 48 crossbar-based interconnect with a peak bandwidth of 160megabytes/second. File system software is also essential to delivering high bandwidth. RAID-II runs LFS, the Log-Structured File System, which lays out files in large contiguous segments to provide high-bandwidth access to large files, groups small write requests into large write requests to avoid wasting disk bandwidth, and provides fast crash recovery. We are pleased with the reliability of the XBUS controller and disk subsystem. The system has been in continuous operation for six months with no failures. Performance results for the RAID-II hardware and the Log Structure File System are good. The RAID-II hardware with a single XBUS controller board achieves about 20 megabytes/second for random read and write operations and 31 megabytes/second for sequential read operations. This performance is an order of magnitude improvement compared to our first prototype, but somewhat disappointing compared to our performance goal of 40 megabytes/second. The performance is limited by the SCSI disk controller boards and by the bandwidth capabilities of our disk interface ports. RAID-II performs well on small, random I/O operations, achieving approximately 400 fourkilobyte random I/Os per second to fifteen disks. A preliminary implementation of LFS delivers 21megabytes/second on large random read operations and 15 megabytes/second for moderately large random write operations into HIPPI network buffers on the XBUS board from RAID-II configured with a single XBUS board. Preliminary network performance shows that RAID-II can read and write 3 megabytes/second to a single SPARCstation client. This network interface is currently being tuned; we soon expect to deliver the full RAID-II file system bandwidth to clients. If the RAID-II project were developed with the network hardware available today, ATM network interfaces would likely be a better choice than HIPPI. The use of HIPPI in RAID-II had several disadvantages. Connections to HIPPI remain very expensive, especiallywhen the cost of switches needed to implement a network is included. While the bandwidth capabilities of HIPPI are a good match for supercomputer clients, the overhead of setting up transfers makes HIPPI a poor match for supporting a large number of client workstations. Finally, unless workstation memory bandwidth improves considerably, we believe a separate high-bandwidth data path like the XBUS board that connects the disks and the network directly is necessary to deliver high bandwidth I/O from a workstation-based network file server. The use of Log Structured File System software maximizes the performance of the disk array for bandwidth-intensive applications. Acknowledgments We would like to thank John Ousterhout for his suggestions for improving this paper and his participation in the development of RAID-II. We would also like to thank Mendel Rosenblum, Mary Baker, Rob Pfile, Rob Quiros, Richard Drewes and Mani Varadarajan for their contributions to the success of RAID-II. Special thanks to Theresa Lessard-Smith and BobMiller. This research was supported by Array Technologies, DARPA/NASA(NAG2-591),DEC, Hewlett-Packard, IBM, Intel Scientific Computers, CaliforniaMICRO, NSF (MIP 8715235),Seagate, Storage Technology Corporation, Sun Microsystems and Thinking Machines Corporation. References [1] Peter M. Chen, Edward K. Lee, Ann L. Drapeau, Ken Lutz, Ethan L. Miller, Srinivasan Seshan, Ken Shirriff,David A. Patterson, and Randy H. Katz. Performance and Design Evaluation of the RAID-II Storage Server. International Parallel Processing Symposium Workshop on I/O in Parallel Computer Systems, April 1993. Also invited for submission to the Journal of Distributed and Parallel Databases, to appear. [2] Ann L. Chervenak and Randy H. Katz. Perfomance of a Disk Array Prototype. In Proceedings SIGMETRICS, May 1991. [3] Bill Collins. High-Performance Data Systems. In Digest of Papers. Eleventh IEEE SymposiumonMass Storage Systems, October 1991. [4] Garth A. Gibson, R. Hugo Patterson, and M. Satyanarayanan. Disk Reads with DRAM Latency. Third Workshop on Workstaion Operating Systems, April 1992. [5] Garth Alan Gibson. Redundant Disk Arrays: Reliable, Parallel Secondary Storage. PhD thesis, U. C. Berkeley, April 1991. Technical Report No. UCB/CSD 91/613. [6] Jim Gray, Bob Horst, and Mark Walker. Parity Striping of Disc Arrays: Low-Cost Reliable Storage with Acceptable Throughput. In Proceedings Very Large Data Bases, pages 148–161, 1990. [7] John H. Hartman and John K. Ousterhout. The Zebra Striped Network File System. In Proceedings of the 14th ACM Symposium on Operating Systems Principles, December 1993. [8] John L. Hennessy and Norman P. Jouppi. Computer Technology and Architecture: An Evolving Interaction. IEEE Computer, 24:18–29, September 1991. [9] DavidHitz. AnNFS File Server Appliance. Technical Report Technical Report TR01, Network Appliance Corporation, August 1993. [10] Reagan W. Moore. File Servers, Networking, and Supercomputers. Technical Report GA-A20574, San Diego Supercomputer Center, July 1991. [11] Bruce Nelson. An Overview of Functional Multiprocessing for NFS Network Servers. Technical Report Technical Report 1, Auspex Engineering, July 1990. [12] John K. Ousterhout. Why Aren’t Operating Systems Getting Faster as Fast as Hardware. In Proceedings USENIX Technical Conference, June 1990. [13] DavidA.Patterson, GarthGibson, andRandyH.Katz. A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings ACM SIGMOD, pages 109– 116, June 1988. [14] Mendel Rosenblumand JohnOusterhout. TheDesign and Implementation of a Log-Structured File System. In Proc. ACMSymposiumonOperating Systems Principles, pages 1–15, October 1991. [15] Russel Sandberg, David Goldbert, Steve Kleiman, Dan Walsh, and Bob Lyon. Design and Implementation of the Sun Network Filesystem. In Summer 1985 Usenix Conference, 1985. [16] Martin Schulze, Garth Gibson, Randy H. Katz, and David A. Patterson. How Reliable is a RAID? In Proceedings IEEE COMPCON, pages 118–123, Spring 1989. [17] David Tweten. Hiding Mass Storage Under UNIX: NASA’s MSS-II Architecture. In Digest of Papers. Tenth IEEE Symposium on Mass Storage Systems, May 1990. |
RAID data recovery, Mac data recovery, Unix data recovery, Linux data recovery, Oracle data recovery, CD data recovery, Zip data recovery, DVD data recovery , Flash data recovery, Laptop data recovery, PDA data recovery, Ipaq data recovery, Maxtor HDD, Hitachi HDD, Fujitsi HDD, Seagate HDD,
Hewlett-Packard HDD, HP HDD, IBM HDD, MP3 data recovery, DVD data recovery, CD-RW data recovery, DAT data recovery, Smartmedia data recovery, Network data recovery, Lost data recovery, Back-up expert data recovery, Tape data recovery, NTFS data recovery, FAT 16 data recovery, FAT 32 data recovery, Novell data recovery, Recovery tool data recovery, Compact flash data recovery, Hard drive data recovery, IDE data recovery, SCSI data recovery, Deskstar data recovery, Maxtor data recovery, Fujitsu HDD data recovery, Samsung data recovery, IBM data recovery, Seagate data recovery, Hitachi data recovery, Western Digital data recovery, Quantum data recovery, Microdrives data recovery, Easy Recovery, Recover deleted data , Data Recovery, Data Recovery Software, Undelete data, Recover, Recovery, Restore data, Unerase deleted data, unformat, Deleted, Data Destorer, fat recovery, Data, Recovery Software, File recovery, Drive Recovery, Recovery Disk , Easy data recovery, Partition recovery, Data Recovery Program, File Recovery, Disaster Recovery, Undelete File, Hard Disk Rrecovery, Win95 Data Recovery, Win98 Data Recovery, WinME data recovery, WinNT 4.x data recovery, WinXP data recovery, Windows2000 data recovery, System Utilities data recovery, File data recovery, Disk Management recovery, BitMart 2000 data recovery, Hard Drive Data Recovery, CompactFlash I, CompactFlash II, CF Compact Flash Type I Card,CF Compact Flash Type II Card, MD Micro Drive Card, XD Picture Card, SM Smart Media Card, MMC I Multi Media Type I Card, MMC II Multi Media Type II Card, RS-MMC Reduced Size Multi Media Card, SD Secure Digital Card, Mini SD Mini Secure Digital Card, TFlash T-Flash Card, MS Memory Stick Card, MS DUO Memory Stick Duo Card, MS PRO Memory Stick PRO Card, MS PRO DUO Memory Stick PRO Duo Card, MS Memory Stick Card MagicGate, MS DUO Memory Stick Duo Card MagicGate, MS PRO Memory Stick PRO Card MagicGate, MS PRO DUO Memory Stick PRO Duo Card MagicGate, MicroDrive Card and TFlash Memory Cards, Digital Camera Memory Card, RS-MMC, ATAPI Drive, JVC JY-HD10U, Secured Data Deletion, IT Security Firewall & Antiviruses, PocketPC Recocery, System File Recovery , RAID