Because of the simplicity of information Atropos exposes
to applications, the interface to Atropos can be
readily implemented with small extensions to the commands
already defined in the SCSI protocol. The parameters
p and w could be exposed in a new mode page
returned by the MODE SENSE SCSI command. To ensure
that Atropos executes all requests to non-contiguous
VLBNs for the other-major access together, an application
can link the appropriate requests. To do so, the
READ or WRITE commands for semi-sequential access
are issued with the Link bit set.
3.3.4 Implementation details
Our Atropos logical volume manager implementation
is a stand-alone process that accepts I/O requests via
a socket. It issues individual disk I/Os directly to the
attached SCSI disks using the Linux raw SCSI device
/dev/sg. With an SMP host, the process can run on a
separate CPU of the same host, to minimize the effect on
the execution of the main application.
An application using Atropos is linked with a stub library
providing API functions for reading and writing.
The library uses shared memory to avoid data copies and
communicates through the socket with the Atropos LVM
process. The Atropos LVM organization is specified by
a configuration file, which functions in lieu of a format
command. The file lists the number of disks, p, the desired
block size, b, and the list of disks to be used.
For convenience, the interface stub also includes three
functions. The function get boundaries(LBN) returns
the stripe unit boundaries between which the given LBN
falls. Hence, these boundaries form a collection of w
contiguous LBNs for constructing efficient I/Os. The
get rectangle(LBN) function returns the wp contiguous
LBNs in a single row across all disks. These functions
are just convenient wrappers that calculate the proper
LBNs from the w and p parameters. Finally, the stub
interface also includes a batch() function to explicitly
group READ and WRITE commands (e.g., for semi-sequential
With no outstanding requests in the queue (i.e., the
disk is idle), current SCSI disks will immediately schedule
the first received request of batch, even though it may
not be the one with the smallest rotational latency. This
diminishes the effectiveness of semi-sequential access.
To overcome this problem, our Atropos implementation
“pre-schedules” the batch of requests by sending first the
request that will incur the smallest rotational latency. It
uses known techniques for SPTF scheduling outside of
disk firmware . With the help of a detailed and validated
model of the disk mechanics [2, 21], the disk head
position is deduced from the location and time of the
last-completed request. If disks waited for all requests
of a batch before making a scheduling decision, this prescheduling
would not be necessary.
Our implementation of the Atropos logical volume
manager is about 2000 lines of C++ code and includes
implementations of RAID levels 0 and 1. Another 600
lines of C code implement methods for automatically extracting
track boundaries and head switch time [22, 26].
4 Efficient access in database systems
Efficient access to database tables in both dimensions
can significantly improve performance of a variety of
queries doing selective table scans. These queries can request
(i) a subset of columns (restricting access along the
primary dimension, if the order is column-major), which
is prevalent in decision support workloads (TPC-H), (ii)
a subset of rows (restricting access along the secondary
dimension), which is prevalent in online transaction processing
(TPC-C), or (iii) a combination of both.
A companion project  to Atropos extends the
Shore database storage manager  to support a page
layout that takes advantage of Atropos’s efficient accesses
in both dimensions. The page layout is based
on a cache-efficient page layout, called PAX , which
extends the NSM page layout to group values of a single
attribute into units called “minipages”. Minipages in
PAX exist to take advantage of CPU cache prefetchers
to minimize cache misses during single-attribute memory
accesses. We use minipages as well, but they are
aligned and sized to fit into one or more 512 byte LBNs,
depending on the relative sizes of the attributes within a
The mapping of 8 KB pages onto the quadrangles
of the Atropos logical volume is depicted in Figure 6.
A single page contains 16 equally-sized attributes, labeled
A1–A16, where each attribute is stored in a separate
minipage that maps to a single VLBN. Accessing a
single page is thus done by issuing 16 batched requests
to every 16th (or more generally, wp-th) VLBN. Internally,
the VLBNs comprising this page are mapped diagonally
to the blocks marked with the dashed arrow.
Hence, 4 semi-sequential accesses proceeding in parallel
can fetch the entire page (i.e., row-major order access).
Individual minipages are mapped across sequential
runs of VLBNs. For example, to fetch attribute A1 for
records 0–399, the database storage manager can issue
one efficient sequential I/O to fetch the appropriate minipages.
Atropos breaks this I/O into four efficient, trackbased
disk accesses proceeding in parallel. The database
storage manager then reassembles these minipages into
appropriate 8 KB pages .