as a single process that communicates with clients and the reference server via UDP and with any plug-ins via a UNIX domain socket over which shared memory addresses are passed. Our Tee is an active intermediary. To access a file system exported by the reference server, a client sends its requests to the Tee. The Tee multiplexes all client requests into one stream of requests, with itself as the client so that it receives all responses directly. Since the Tee becomes the source of all RPC requests seen by the reference server, it is necessary for the relay to map clientassigned RPC transaction IDs (XIDs) onto a separate XID space. This makes each XID seen by the reference server unique, even if different clients send requests with the same XID, and it allows the Tee to determine which client should receive which reply. This XID mapping is the only way in which the relay modifies the RPC requests. The NFS plug-in contains the bulk of our Tee’s functionality and is divided into four modules: synchronization, duplication, comparison, and the dispatcher. The first three modules each comprise a group of worker threads and a queue of lightweight request objects. The dispatcher (not pictured in Figure 2) is a single thread that interfaces with the relay, receiving shared memory buffers. For each file system object, the plug-in maintains some state in a hash table keyed on the object’s reference server file handle. Each entry includes the object’s file handle on each server, its synchronization status, pointers to outstanding requests that reference it, and miscellanous book-keeping information. Keeping track of each object consumes 236 bytes. Each outstanding request is stored in a hash table keyed on the request’s reference server XID. Each entry requires 124 bytes to hold the request, both responses, their arrival times, and various miscellanous fields. The memory consumption is untuned and could be reduced. Each RPC received by the relay is stored directly into a shared memory buffer from the RPC header onward. The dispatcher is passed the addresses of these buffers in the order that the RPCs were received by the relay. It updates internal state (e.g., for synchronization ordering), then decides whether or not the request will yield a comparable response. If so, the request is passed to the duplication module, which constructs a new RPC based on the original by replacing file handles with their SUT equivalents. It then sends the request to the SUT. Once responses have been received from both the reference server and the SUT, they are passed to the comparison module. If the comparison module finds any discrepancies, it logs the RPC and responses and optionally alerts the user. For performance and space reasons, the Tee discards information related to matching responses, though this can be disabled if full tracing is desired. 5 Evaluation This section evaluates the Tee along three dimensions. First, it validates the Tee’s usefulness with several case studies. Second, it measures the performance impact of using the Tee. Third, it demonstrates the value of the synchronization ordering optimizations. 5.1 Systems used All experiments are run with the Tee on an Intel P4 2.4GHz machine with 512MB of RAM running Linux 2.6.5. The client is either a machine identical to the Tee or a dual P3 Xeon 600MHz with 512MB of RAM running FreeBSD 4.7. The servers include Linux and FreeBSD machines with the same specifications as the clients, an Intel P4 2.2GHz with 512MB of RAM running Linux 2.4.18, and a Network Appliance FAS900 series filer. For the performance and convergence benchmarks, the client and server machines are all identical to the Tee mentioned above and are connected via a Gigabit Ethernet switch. 5.2 Case studies An interesting use of the Tee is to compare popular deployed NFS server implementations. To do so, we ran a simple test program on a FreeBSD client to compare the responses of the different server configurations. The short test consists of directory, file, link, and symbolic link creation and deletion as well as reads and writes of data and attributes. No other filesystem objects were involved except the root directory in which the operations were done. Commands were issued at 2 second intervals. Comparing Linux to FreeBSD: We exercised a setup with a FreeBSD SUT and a Linux reference server to see how they differ. After post-processing READDIR and READDIRPLUS entries, and grouping like discrepancies, we are left with the nineteen unique discrepancies summarized in Table 1. In addition to those nineteen, we observed many discrepancies caused by the Linux NFS server’s use of some undefined bits in the MODE field (i.e., the field with the access control bits for owner, group, and world) of every file object’s attributes. The Linux server encodes the object’s type (e.g., directory, symlink, or regular file) in these bits, which causes the MODE field to not match FreeBSD’s values in every response. To eliminate this recurring discrepancy, we modified the comparison rules to replace bitwise-comparison Field Count Reason EOF flag 1 FreeBSD server failed to return EOF at the end of a read reply Attributes follow flag 10 Linux sometimes chooses not to return pre-op or post-op attributes Time 6 Parent directory pre-op ctime and mtime are set to the current time on FreeBSD Time 2 FreeBSD does not update a symbolic link’s atime on READLINK Table 1: Discrepancies when comparing Linux and FreeBSD servers. The fields that differ are shown along with the number of distinct RPCs for which they occur and the reason for the discrepancy of the entire MODE field with a loose-compare function that examines only the specification-defined bits. Perhaps the most interesting discrepancy is the EOF flag, which is the flag that signifies that a read operation has reached the end of the file. Our Tee tells us that when a FreeBSD client is reading data from a FreeBSD server, the server returns FALSE at the end of the file while the Linux server correctly returns TRUE. The same discrepancy is observed, of course, when the FreeBSD and Linux servers switch roles as reference server and SUT. The FreeBSD client does not malfunction, which means that the FreeBSD client is not using the EOF value that the server returns. Interestingly, when running the same experiment with a Linux client, the discrepancy is not seen because the Linux client uses different request sequences. If a developer were trying to implement a FreeBSD NFS server clone, the NFS Tee would be an useful tool in identifying and properly mimicking this quirk. The “attributes follow” flag, which indicates whether or not the attribute structure in the given response contains data,6 also produced discrepancies. These discrepancies mostly come from pre-operation directory attributes in which Linux, unlike FreeBSD, chooses not to return any data. Of course, the presence of these attributes represents additional discrepancies between the two servers’ responses, but the root cause is the same decision about whether to include the optional information. The last set of interesting discrepancies comes from timestamps. First, we observe that FreeBSD returns incorrect pre-operation directory modification times (mtime and ctime) for the parent directory for RPCs that create a file, a hard link, or a symbolic link. Rather than the proper values being returned, FreeBSD returns the current time. Second, FreeBSD and Linux use different policies for updating the last access timestamp (atime). Linux updates the atime on the symlink file when the symlink is followed, whereas FreeBSD only updates the atime when the symlink file is accessed directly (e.g., by writing it’s value). This difference ex- 6Many NFSv3 RPCs allow the affected object’s attributes to be included in the response, at the server’s discretion, for the client’s convenience. hibits discrepancies in RPCs that read the symlink’s attributes. We also ran the test with the servers swapped (FreeBSD as reference and Linux as SUT). Since the client interacts with the reference server’s implementation, we were interested to see if the FreeBSD client’s interaction with a FreeBSD NFS server would produce different results when compared to the Linux server, perhaps due to optimizations between the like client and server. But, the same set of discrepancies were found. Comparing Linux 2.6 to Linux 2.4: Comparing Linux 2.4 to Linux 2.6 resulted in very few discrepancies. The Tee shows that the 2.6 Kernel returns file metadata timestamps with nanosecond resolution as a result of its updated VFS layer, while the 2.4 kernel always returns timestamps with full second resolution. The only other difference we found was that the parent directory’s preoperation attributes for SETATTR are not returned in the 2.4 kernel but are in the 2.6 kernel. Comparing Network Appliance FAS900 to Linux and FreeBSD: Comparing the Network Appliance FAS900 to the Linux and FreeBSD servers yields a few interesting differences. The primary observation we are able to make is that the FAS900 replies are more similar to FreeBSD’s that Linux’s. The FAS900 handles its file MODE bits like FreeBSD without Linux’s extra file type bits. The FAS900, like the FreeBSD server, also returns all of the pre-operation directory attributes that Linux does not. It is also interesting to observe that the FAS900 clearly handles directories differently from both Linux and FreeBSD. The cookie that the Linux or FreeBSD server returns in response to a READDIR or READDIRPLUS call is a byte offset into the directory file whereas the Network Appliance filer simply returns an entry number in the directory. Aside: It is interesting to note that, as an unintended consequence of our initial relay implementation, we discovered an implementation difference between the FAS900 and the Linux or FreeBSD servers. The relay modifies the NFS call’s XIDs so that if two clients happen to use the same XID, they don’t get mixed up when the Tee relays them both. The relay is using a sequence of values for XIDs that is identical each time the relay is run. We found that, after restarting the Tee, requests would often get lost on the FAS900 but not on the Linux or FreeBSD servers. It turns out that the FAS900 caches XIDs for much longer than the other servers, resulting in dropped RPCs (as seeming duplicates) when the XID numbering starts over too soon. Debugging the Ursa Major NFS server: Although the NFS Tee is new, we have started to use it for debugging an NFS server being developed in our group. This server is being built as a front-end to Ursa Major, a storage system that will be deployed at Carnegie Mellon as part of the Self-* Storage project . Using Linux as a reference, we have found some non-problematic discrepancies (e.g., different choices made about which optional values to return) and one significant bug. The bug occurred in responses to the READ command, which never set the EOF flag even when the last byte of the file was returned. For the Linux clients used in testing, this is not a problem. For others, however, it is. Using the Tee exposed and isolated this latent problem, allowing it to be fixed proactively. 5.3 Performance impact of prototype We use PostMark to measure the impact the Tee would have on a client in a live environment. We compare two setups: one with the client talking directly to a Linux server and one with the client talking to a Tee that uses the same Linux server as the reference. We expect a significant increase in latency for each RPC, but less significant impact on throughput. PostMark was designed to measure the performance of a file system used for electronic mail, netnews, and web based services . It creates a large number of small randomly-sized files (between 512 B and 9.77 KB) and performs a specified number of transactions on them. Each transaction consists of two sub-transactions, with one being a create or delete and the other being a read or append. The experiments were done with a single client and up to sixteen concurrent clients. Except for the case of a single client, two instances of PostMark were run on each physical client machine. Each instance of PostMark ran with 10,000 transactions on 500 files and the biases for transaction types were equal. Except for the increase in the number of transactions, these are default PostMark values. Figure 5 shows that using the Tee reduces client throughput when compared to a direct NFS mount. The reduction is caused mainly by increased latency due to the added network hop and overheads introduced by the fact 0 100 200 300 400 500 600 1 2 4 6 8 10 12 14 16 PostMark Transactions per Second Number of Concurrent Clients Direct Mount Through-Tee Mount Figure 5: Performance with and without the Tee. The performance penalty caused by the Tee decreases as concurrency increases, because higher latency is the primary cost of inserting a Tee between client and reference serer. Concurrency allows request propagation and processing to be overlapped, which continues to benefit the Through- Tee case after the Direct case saturates.. The graph shows average and standard deviation of PostMark throughput, as a function of the number of concurent instances. that the Tee is a user-level process. The single-threaded nature of PostMark allows us to evaluate both the latency and the throughput costs of our Tee. With one client, PostMark induces one RPC request at a time, and the Tee decreases throughput by 61%. As multiple concurrent PostMark clients are added, the percentage difference between direct NFS and through-Tee NFS performance shrinks. This indicates that the latency increase is a more significant factor than the throughput limitation—with high concurrency and before the server is saturated, the decrease in throughput drops to 41%. When the server is heavily loaded in the case of a direct NFS mount, the Tee continues to scale and with 16 clients the reduction in throughput is only 12%. Although client performance is reduced through the use of the Tee, the reduction does not prevent us from using it to test synchronization convergence rates, do offline case studies, or test in live environments where lower performance is acceptible. 5.4 Speed of synchronization convergence One of our Tee design goals was to support dynamic addition of a SUT in a live environment. To make such addition most effective, the Tee should start performing comparisons as quickly as possible. Recall that operations on a file object may be compared only if the object is synchronized. This section evaluates the effectiveness of the synchronization ordering enhancements described in Section 4.2. We expect them to significantly increase the speed with which useful comparisons can begin.