as a single process that communicates with clients and
the reference server via UDP and with any plug-ins via
a UNIX domain socket over which shared memory addresses
are passed.
Our Tee is an active intermediary. To access a file system
exported by the reference server, a client sends its requests
to the Tee. The Tee multiplexes all client requests
into one stream of requests, with itself as the client so
that it receives all responses directly. Since the Tee becomes
the source of all RPC requests seen by the reference
server, it is necessary for the relay to map clientassigned
RPC transaction IDs (XIDs) onto a separate
XID space. This makes each XID seen by the reference
server unique, even if different clients send requests with
the same XID, and it allows the Tee to determine which
client should receive which reply. This XID mapping is
the only way in which the relay modifies the RPC requests.
The NFS plug-in contains the bulk of our Tee’s functionality
and is divided into four modules: synchronization,
duplication, comparison, and the dispatcher. The
first three modules each comprise a group of worker
threads and a queue of lightweight request objects. The
dispatcher (not pictured in Figure 2) is a single thread
that interfaces with the relay, receiving shared memory
buffers.
For each file system object, the plug-in maintains some
state in a hash table keyed on the object’s reference server
file handle. Each entry includes the object’s file handle
on each server, its synchronization status, pointers to
outstanding requests that reference it, and miscellanous
book-keeping information. Keeping track of each object
consumes 236 bytes. Each outstanding request is stored
in a hash table keyed on the request’s reference server
XID. Each entry requires 124 bytes to hold the request,
both responses, their arrival times, and various miscellanous
fields. The memory consumption is untuned and
could be reduced.
Each RPC received by the relay is stored directly into
a shared memory buffer from the RPC header onward.
The dispatcher is passed the addresses of these buffers
in the order that the RPCs were received by the relay.
It updates internal state (e.g., for synchronization ordering),
then decides whether or not the request will yield a
comparable response. If so, the request is passed to the
duplication module, which constructs a new RPC based
on the original by replacing file handles with their SUT
equivalents. It then sends the request to the SUT.
Once responses have been received from both the reference
server and the SUT, they are passed to the comparison
module. If the comparison module finds any discrepancies,
it logs the RPC and responses and optionally
alerts the user. For performance and space reasons, the
Tee discards information related to matching responses,
though this can be disabled if full tracing is desired.
5 Evaluation
This section evaluates the Tee along three dimensions.
First, it validates the Tee’s usefulness with several case
studies. Second, it measures the performance impact of
using the Tee. Third, it demonstrates the value of the
synchronization ordering optimizations.
5.1 Systems used
All experiments are run with the Tee on an Intel P4
2.4GHz machine with 512MB of RAM running Linux
2.6.5. The client is either a machine identical to the
Tee or a dual P3 Xeon 600MHz with 512MB of RAM
running FreeBSD 4.7. The servers include Linux and
FreeBSD machines with the same specifications as the
clients, an Intel P4 2.2GHz with 512MB of RAM running
Linux 2.4.18, and a Network Appliance FAS900
series filer. For the performance and convergence benchmarks,
the client and server machines are all identical to
the Tee mentioned above and are connected via a Gigabit
Ethernet switch.
5.2 Case studies
An interesting use of the Tee is to compare popular deployed
NFS server implementations. To do so, we ran
a simple test program on a FreeBSD client to compare
the responses of the different server configurations. The
short test consists of directory, file, link, and symbolic
link creation and deletion as well as reads and writes of
data and attributes. No other filesystem objects were involved
except the root directory in which the operations
were done. Commands were issued at 2 second intervals.
Comparing Linux to FreeBSD: We exercised a setup
with a FreeBSD SUT and a Linux reference server to
see how they differ. After post-processing READDIR and
READDIRPLUS entries, and grouping like discrepancies,
we are left with the nineteen unique discrepancies summarized
in Table 1. In addition to those nineteen, we
observed many discrepancies caused by the Linux NFS
server’s use of some undefined bits in the MODE field
(i.e., the field with the access control bits for owner,
group, and world) of every file object’s attributes. The
Linux server encodes the object’s type (e.g., directory,
symlink, or regular file) in these bits, which causes the
MODE field to not match FreeBSD’s values in every response.
To eliminate this recurring discrepancy, we modified
the comparison rules to replace bitwise-comparison Field Count Reason EOF flag 1 FreeBSD server failed to return EOF at the end of a read reply Attributes follow flag 10 Linux sometimes chooses not to return pre-op or post-op attributes Time 6 Parent directory pre-op ctime and mtime are set to the current time on
FreeBSD Time 2 FreeBSD does not update a symbolic link’s atime on READLINK
Table 1: Discrepancies when comparing Linux and FreeBSD servers. The fields that differ are shown along with the number of distinct RPCs
for which they occur and the reason for the discrepancy of the entire MODE field with a loose-compare function
that examines only the specification-defined bits. Perhaps the most interesting discrepancy is the EOF flag,
which is the flag that signifies that a read operation has reached the end of the file. Our Tee tells us that when a
FreeBSD client is reading data from a FreeBSD server,
the server returns FALSE at the end of the file while
the Linux server correctly returns TRUE. The same discrepancy
is observed, of course, when the FreeBSD and
Linux servers switch roles as reference server and SUT.
The FreeBSD client does not malfunction, which means that the FreeBSD client is not using the EOF value that
the server returns. Interestingly, when running the same
experiment with a Linux client, the discrepancy is not
seen because the Linux client uses different request sequences.
If a developer were trying to implement a
FreeBSD NFS server clone, the NFS Tee would be an
useful tool in identifying and properly mimicking this
quirk.
The “attributes follow” flag, which indicates whether or
not the attribute structure in the given response contains
data,6 also produced discrepancies. These discrepancies
mostly come from pre-operation directory attributes in
which Linux, unlike FreeBSD, chooses not to return any
data. Of course, the presence of these attributes represents
additional discrepancies between the two servers’
responses, but the root cause is the same decision about
whether to include the optional information.
The last set of interesting discrepancies comes from
timestamps. First, we observe that FreeBSD returns
incorrect pre-operation directory modification times
(mtime and ctime) for the parent directory for RPCs
that create a file, a hard link, or a symbolic link. Rather
than the proper values being returned, FreeBSD returns
the current time. Second, FreeBSD and Linux use different
policies for updating the last access timestamp
(atime). Linux updates the atime on the symlink file
when the symlink is followed, whereas FreeBSD only
updates the atime when the symlink file is accessed directly
(e.g., by writing it’s value). This difference ex-
6Many NFSv3 RPCs allow the affected object’s attributes to be included
in the response, at the server’s discretion, for the client’s convenience.
hibits discrepancies in RPCs that read the symlink’s attributes.
We also ran the test with the servers swapped (FreeBSD
as reference and Linux as SUT). Since the client interacts
with the reference server’s implementation, we were
interested to see if the FreeBSD client’s interaction with
a FreeBSD NFS server would produce different results
when compared to the Linux server, perhaps due to optimizations
between the like client and server. But, the
same set of discrepancies were found.
Comparing Linux 2.6 to Linux 2.4: Comparing Linux
2.4 to Linux 2.6 resulted in very few discrepancies. The
Tee shows that the 2.6 Kernel returns file metadata timestamps
with nanosecond resolution as a result of its updated
VFS layer, while the 2.4 kernel always returns
timestamps with full second resolution. The only other
difference we found was that the parent directory’s preoperation
attributes for SETATTR are not returned in the
2.4 kernel but are in the 2.6 kernel.
Comparing Network Appliance FAS900 to Linux and
FreeBSD: Comparing the Network Appliance FAS900
to the Linux and FreeBSD servers yields a few interesting
differences. The primary observation we are able
to make is that the FAS900 replies are more similar to
FreeBSD’s that Linux’s. The FAS900 handles its file
MODE bits like FreeBSD without Linux’s extra file type
bits. The FAS900, like the FreeBSD server, also returns
all of the pre-operation directory attributes that
Linux does not. It is also interesting to observe that
the FAS900 clearly handles directories differently from
both Linux and FreeBSD. The cookie that the Linux or
FreeBSD server returns in response to a READDIR or
READDIRPLUS call is a byte offset into the directory
file whereas the Network Appliance filer simply returns
an entry number in the directory.
Aside: It is interesting to note that, as an unintended consequence
of our initial relay implementation, we discovered
an implementation difference between the FAS900
and the Linux or FreeBSD servers. The relay modifies
the NFS call’s XIDs so that if two clients happen to use
the same XID, they don’t get mixed up when the Tee relays
them both. The relay is using a sequence of values
for XIDs that is identical each time the relay is run. We
found that, after restarting the Tee, requests would often
get lost on the FAS900 but not on the Linux or FreeBSD
servers. It turns out that the FAS900 caches XIDs for
much longer than the other servers, resulting in dropped
RPCs (as seeming duplicates) when the XID numbering
starts over too soon.
Debugging the Ursa Major NFS server: Although the
NFS Tee is new, we have started to use it for debugging
an NFS server being developed in our group. This server
is being built as a front-end to Ursa Major, a storage system
that will be deployed at Carnegie Mellon as part of
the Self-* Storage project [4]. Using Linux as a reference,
we have found some non-problematic discrepancies
(e.g., different choices made about which optional
values to return) and one significant bug. The bug occurred
in responses to the READ command, which never
set the EOF flag even when the last byte of the file was
returned. For the Linux clients used in testing, this is not
a problem. For others, however, it is. Using the Tee exposed
and isolated this latent problem, allowing it to be
fixed proactively.
5.3 Performance impact of prototype
We use PostMark to measure the impact the Tee would
have on a client in a live environment. We compare two
setups: one with the client talking directly to a Linux
server and one with the client talking to a Tee that uses
the same Linux server as the reference. We expect a significant
increase in latency for each RPC, but less significant
impact on throughput.
PostMark was designed to measure the performance of
a file system used for electronic mail, netnews, and web
based services [6]. It creates a large number of small
randomly-sized files (between 512 B and 9.77 KB) and
performs a specified number of transactions on them.
Each transaction consists of two sub-transactions, with
one being a create or delete and the other being a read or
append.
The experiments were done with a single client and up
to sixteen concurrent clients. Except for the case of a
single client, two instances of PostMark were run on each
physical client machine. Each instance of PostMark ran
with 10,000 transactions on 500 files and the biases for
transaction types were equal. Except for the increase in
the number of transactions, these are default PostMark
values.
Figure 5 shows that using the Tee reduces client throughput
when compared to a direct NFS mount. The reduction
is caused mainly by increased latency due to the
added network hop and overheads introduced by the fact
0
100
200
300
400
500
600
1 2 4 6 8 10 12 14 16
PostMark Transactions per Second
Number of Concurrent Clients
Direct Mount
Through-Tee Mount
Figure 5: Performance with and without the Tee. The performance
penalty caused by the Tee decreases as concurrency increases,
because higher latency is the primary cost of inserting a Tee between
client and reference serer. Concurrency allows request propagation and
processing to be overlapped, which continues to benefit the Through-
Tee case after the Direct case saturates.. The graph shows average and
standard deviation of PostMark throughput, as a function of the number
of concurent instances.
that the Tee is a user-level process.
The single-threaded nature of PostMark allows us to
evaluate both the latency and the throughput costs of our
Tee. With one client, PostMark induces one RPC request
at a time, and the Tee decreases throughput by 61%. As
multiple concurrent PostMark clients are added, the percentage
difference between direct NFS and through-Tee
NFS performance shrinks. This indicates that the latency
increase is a more significant factor than the throughput
limitation—with high concurrency and before the server
is saturated, the decrease in throughput drops to 41%.
When the server is heavily loaded in the case of a direct
NFS mount, the Tee continues to scale and with 16
clients the reduction in throughput is only 12%.
Although client performance is reduced through the use
of the Tee, the reduction does not prevent us from using it
to test synchronization convergence rates, do offline case
studies, or test in live environments where lower performance
is acceptible.
5.4 Speed of synchronization convergence
One of our Tee design goals was to support dynamic addition
of a SUT in a live environment. To make such
addition most effective, the Tee should start performing
comparisons as quickly as possible. Recall that operations
on a file object may be compared only if the object
is synchronized. This section evaluates the effectiveness
of the synchronization ordering enhancements described
in Section 4.2. We expect them to significantly increase
the speed with which useful comparisons can begin.
|