RAID server data recovery

Space & Compute Requirements of PPM* for Migration within the Secondary
Storage
Kameshwari R Gopinath K
kam@csa.iisc.ernet.in gopi@csa,iisc.ernet.in
Department of Computer Science & Automation,
Indian Institute of Science, Bangalore, India.
Abshncf--Pl'M* algorithm is used iii this work to dynandcally
migrate and reorganize data in a thee-tiered sccoiidary
storage hierarchy iniplcmentcd using software RAII). The
three Icvels iniplemeiitcd iii the prototype named Tempcrature
Seusitivc Storage ('I'SS) arc RAID 10, RAID 5 ctnd compressed
RMD 5 Icvcls. Our earlier study conlirnied the usefulness
of this algorithm in rcduring perforinmco dcgradatioii
with compressed RAID 4 levels being 20% or mom of
the working-set sine.
The usefulness of PPMe' for such an application is gauged
not only by nsing performance inetrics such as throughput,
residual compressed I<AIII 5 migrations, liil t miss ratios
at different migration boundaries, but also by its space and
'coinputc requirements.
Results using two file-system traccs Sitar & Harp show that
the worst case storage requirements of the Pl'M* has been
well below 10 MH and CPtJ usage lias been moderate. The
total migrations generated for 400000 accesses has been less
thaii 3% ofthe total accesses seen by the '1% device driver,
KEY WORDS: RAlD, Storage migration, l'refetch, Prcdiction
by Partial Match Unbounded
I . INTROIXJCI'ION & MOIIVATION
A three-tiered secondary storage hierarchy lias bccii studied
to achieve better costlperformanccratio ofthe secondary storage
syslern ill [KK02]. The three levels used in lhc study are
I1AII) 10, MID S and compressed RAID 5 configurations.
The compressed RAID 5 layer offsets the extra storage used
in the top two layers, by storing compressed data in them.
Deni2ml accesses to thc conipressed layer is minimized by
predicting l'uture accesses of Ihc working-set and migrating
the data to the upper laycrs. The efyectiveness of rulc-based
algorithms and predictive algorithms is also studied. Predictive
algorithm used in the study is a context based algorithm
hiownas the prediction hy partial match unbounded (PPM*).
The following paragaphs first motivate the decision to use
context based algorithms and the paperreporls the rcsults obtained
from the study coiiductcd to asscss thc suitability of
the PPM* algorithm with rcspect to incmory and CPU usagc.
As most applications exhibit intermittent regular scqucnces in
their access pa\leriis. predicting these patterns can he used to
enliance perforniaucc by reducing the penalty due to demand
accesses to the slower levels (conipresscd RAID 5 layer in
this ci~se).
The idea ofusing compression algoritlitns for making predictions
have hceri studied extensively arid have been found to
be effective. These algorithms use probability distribution of
elernelits in die data to achieve effective encoding [CKV93].
'Ilie ar~alogotts situatioii in prefctching.is that those extents
thkt tire niost likely to he accessed next should he prefetched.
Therefore, tlic model used by the compressor to effectively
compress a dnta sequence could be used for making predictions
as it describes the data accurately. Context modelling
lias been shown to achievc superior performance in thc comprcssion
litortitme. Techniques in this fmily use the preceding
few characters while calculating the prohability of the
next character. Predictions have been found to be more accuatc
when produced with respect to some context. Prcdictiou
by partial matching (PPM) is one such motlcl described
in the literature ([C!TWYS],[PM99]). PI'M is a finite-conlcxt
statistical modellitig technique that can be viewed as hlendiiig
together several fixed-order context models to predict the
next access in the input sequence. Prediction probabilitics
for each context iii the niodcl are calculated from frequency
COLIIIIS which are updated adaptively. lhe maximu~nc ontcxt
length is lixcd iii PI'M. Here we tiave used a derivative of
I'PM called EJPM'p lCrW9Sl which exploits contexts of 1111-
bounded length
Our study [KKOZ] ciinfiimed the usefulness of using PPM*
for prcdictiob of data accesses and their subsequent migration
to the upper levels in a three tiered secondary storage
hierarchy, The use~ulness of PPM*' for such an applicatioll
is gauged not only by using performance inetrics such as
throoghput, rcsidiral compressed RAID 5 mipratio~~hsi,t CI:
miss ratios at different migration boundaries, but also by its
space and compute requirements. This paper reports resilks
on the space aid compnle requirements of PPM' for our application.
In depth details on the environment and model of
ow stndy caii be found in [KK02].
2. RELATED WORK
The first to examine the use ol'compression modclling tcchniques
to track reference patterns and prefetch data were Vitter,
Krishnan and Cnrewitz ([VK91], IVK961). In [CKV931.
they proved that for a Markov source such techniques conTENCON
2003 / 11 68
2 verge to an optimal on-line algorithm.
We have usedpattem modelling techniques from text compression
for modelling file access patterns. In fact, several researchers
in data compression have extended PPM w i t h the
field of data compression. Moffat ef a/ [Mot901 address the
model size by periodically constructing a new model with the
last 2048 events and then using i: to replace the.current trie.
More recently. Cleary ef ul [CTW95] have presented methods
for extending PPM to support unboundedcontext lengths.
Our work uses this enhancement of the PPM for prediction of
the access patterns.
Context Modeling has also been used in the web context
extensively. The most imponant of them being [PM99],
[FCLJ99]. An important difference in handling the context
buildup between a web dispensation and the file-system ref-
~erencese quence, is the build up of the context at a per user in
the former. as against a single context trie in the later.
Kroeger [KL961 has adapted PPM to the file-system context,
wherein they track the file system accesses that have a high
probability of occurring next. By prefetching data of such
events, a LRU cache has been transformed’into a predictive
cache. Their 4 MB predictive cache has a higher cache hit
rate than a 90 MB LRU cache. They have implemented the
algorithm in the kernel context where memory is scarce and
have proposed PPM variants, Partitioned. Context Modeling
(PCM) and Extended Partitioned Context Modeling (EPCM)
to track file accessspatterns up to a specified length.
3. PPM’* IMPLEMENTATION
In our context, the elements of the model are data extents
(multiple sectors of data used as prediction unit) used by
readiwrite accesses occurring in the file-system which we
are profiling. Sequence of these extent readiwrite constitute
the-cpntext. It’ A, B and C are three extents, the context
(A;B,C) denotes that these three extents have been accessed
in that specific order at-least once by at-least one process.
The trie data stnxture used helps to mingle contexts of different
lengths into a single structure. The dynamism in the
context length permits the prediction of more than one extent
’ per access.
Exponential growth in memory requirements of higher
contexts have been restricted using techniques such as
([CTW95]): Use of Trie’structures to store PPM models in
conjunction with pointers hack to the input string; Pointing
leaf nodes to the input string whenever the context is unique.
The input pointer is moved forward by one position when the
context needs to be rxtended and’lastly by maintaining a list
of pointers to the currently active contexts to facilitate Trie
update.
There are two basic differences in using multi-order context
models to track file-system reference patterns with those for
text [KroOO]. These being the number of symbols mvdelled
and the length of the stream to he handled. While former
exacerbate the model space requirement, the later reflects the
need to adapt to changes in reference patterns.
The alphabet (the set of symbols modelled) is larger and more
dynamic in a file-system than the alphabet for a text file.
The number of extents in a file, and the number of files in
a file-system is orders of magnitude larger and also involves
changes due to file creation and deletion.
The dynamic and a priori unknown “alphabet“ in the filesystem
raises the model space requirements. Each node in
the trie can have a child form every member of the alphabet.
The trie built using the PPM model does not scale with the
change in the alphabet size. In the case of PPM’ the heights
of the branches from different file extents are proportional to
the relative popularity ofthe extents. This greatly reduces the
model space requirement for equivalent PPM orders.
For a model to he able to adapt to changes in the file’s predictive
nature, there needs to be some method of periodically
reducing node counts, so that near time access count take
precedence over much older accesses. This reduction of node
access counts also helps in keeping the model space requirements
under a threshold, by periodically scanning the context
trie for those nodes which are at-least “c/erminlend” old
and have their cpunts below some threshold value (typically
equal to 10% ofthe maximum access count in the window under
consideration). A new extent entry in the context is never
inhibited in our case as in [KroOO] as ow algorithm runs in
user space and memory usage need not be critically monitored.
We have retained an overall model space requirement
to which the cleaning routine conforms to.
The PPM* bas been implemented using a combination of
Patricia and trie nodes. In order to limit the storage used
by the Patricia trie and trie sub-branches, we have provided
a cleaning routine that incrementallyremoves tries and
Patricia nodes based on the access counts (programmable)
and age-interval (programmable). Ageinterval prevents the
removal of nodes (both Patricia and trie) which are less
than “base access’’ from the “current access”. The default
“baseaccess” is ageinterval/Z. This helps the buildup of near
references adequately for a longer number of accesses before
deciding to remove them. The aging scheme used in our study
not only helps in predicting the current working set accesses,
hut also in removing cold trie nodes. Every time a node is
encountered during a context trie build-up, the age of access
count is adjusted before the current count and age interval are
updated in the node as in shown in Figure I
Two progammahle aspect of keeping the number of nodes
under control (cleaning) are the cleaning interval and scope
of node removal. The choice of the cleaning interval is determined
based on the amount of useful context build-up we
want to retain vis-a-vis the rate of growth of the trie structure.
The scope of cleaning determines the context orders in
Computer Systems and Architecture / 11 69
which nodes are checked for low access counts so that they
may be removed. These can be the trie nodes representing
the higher contexts, or combination of both the trie nodes and
the Patricia nodes (representing the context 0). Removal of
context 0 nodes is the lowest on the priority as its presence
permits the context buildup with just one access. Extents are
not added to the context 0 on their first occurrence. They
have to occur 'WMINCI'XT.CNT' times before it is considered
'%of' enough to be included in the context trie. In our
study, we have set the "MINCTXT13" equal to three. We
have adopted a threshold based, incremental approach to contain
the context buildup to within a desired value:
Figure 1. Pseudo code for Aging a Trie-node
The hierarchy employed in selecting the scope of the node
cleanup is listed below:
Removal of Cold Higher Order Nodes - Any new re-build
can easily take over from the highest parent available under
each context branch.
Removal of Cold Order 0 Nodes - Rebuild of &gher contexts,
starting with these nodes will now have to wait till
"MIN~CTXT~CNTac"c esses, till they get re-entered into the
context trie.
Removal of All Higher Order Nodes - All predictions will
stop, as we do not use order 0 nodes for predictions.
Removal of All Order 0 Nodes - Start fresh from level zero
(the root level)!!
An I.RU type of algorithm has not been used for node removal
because of the additional storage requirement in each
extent data structure in the form of doubly linked lists. Also,
we had to cater to bursty nature of file-system accesses and
hence chose a threshold based removal of nodes which allows
bulk removal of nodes. In this, the cleaner routine, incrementally
scans top-doum the entire context trie and removes a
node and all its children when their nodes' access count falls
below some pre-specified threshold. This threshold itself is
L. adaptive, based on the urgency to control the entire context
trie size.
4. TRACES USED
We have used the publicly available Sitar & Harp traces to
determine the effectiveness of our space restricting methods:
These traces were generated as part of MBFS project in Distributcd
Computing Systcms Lab in thc Department of Computer
Science, University of Kentucky by Randy Appleton e/
al. The two traces instrumented by theni represents two distinct
type of work environment.
The Sitar Trace records user activity on a publicly available
SPARCstation, and represents typical non-techhical use in
an academic environment. This trace is representative of
many corporate, administrative, or government otlice environments.
Ihe trace spans approximately ten days and can be
considered I 0 bound in nature.
The Harp Trace was gathered on a SPARCstation reserved for
use by a research project. lhe trace is dominated by two collaborating
programmers working on a large software project,
The trace lasts ahout seven days and is representative ofcommon
programmer activity and maybe considered as a data
hound application.
The ratio of write to read for the Sitar and Harp Traces arc
0.86 and 48 respectively.
5. EXPERIMENTAL RESLJLTS
We now discuss the results that were obtained using the Sitar
& Harp traces with respect to the size of PPM* data stmc-
Lures, CPU load consumed and migration load generated at
the device driver.
Cleaning Inrerval
As already mentioned in Section 3, the very small cleaning
interval affects the context buildup, while large cleaning intervals
increases the size of the trie structure and could impose
overhead in terms of trie traversal and also extra niemory
usage. The traces were run on the prototype using three
cleaning interval - viz. 10000, 25000 and 50000 accesses.
Figure 2 shows the throughput graph as a bction of the
cleaning intervals for Sitar trace (the plot for the harp trace
is similar), while Figures 3 & 4 are the plots of memoryused
by the PPM* implementation for the Sitar & harp traces respectively.
In both the traces, the throughput plots show negligible variations
with different cleaning intervals. In the case ofthe Sitar
plots, there is a marginal increase in throughput at cleaning
intervdl25000.
When the size plots are considered, there is a distinct difference
in the size of memory used by the various cleaning
intervals. Definitely, the ncarly double size requirement for
TENCON 2003/11?0
$leaning interval'50000 is not justified. What thls indicate9
is that all extra context build-up it provides does not translate
into useful context information.
k D r A a r v l
Figure 2. Effect of Cleaning Interval on Throughput - Sitar
Trace - PPM* context depth 4, search depth 10, threshold
0.125
Figure 3. Effect of Cleaning Interval on Size - Sitar Trace -
PPM* context depth 4, search depth 10, threshold 0.125
Figure 4. Effect of,Clea&ng Interval on Size - Harp Trace
' - PPM* context depth 4. search depth 10, threshold 0.125
. ., I._".
Betweendeaning intervals of 10000 and 25000, it was de-
%ided to go.with the 25000 interval, since.due to its exes
, I.
c6dext intormation it is better geared to handle loaded system
conditions and other less locality exhibiting applications..
with moderate extra memory requirements. Even at 25000
accesses cleaning interval, the memory used by the context:
trie and its related data structures has, been less than 6 MB for
a total run length of 400000 accesses.
PPM* CPU Usage ; ;
Figures 5 and 6 show the % CPU usage as a function of profiling
interval connt. Each plot has three graphs, each representing
the lwp thread % CPU usage. These three plots are
labelled dm read which is the thread that reads the access information
through the DMAPI Interface ([Gro95]), conkrf
build that builds the context and also generates the prediction
and migrate which performs migration of the stripes in the
TSS Device. In both the Sitar and Harp plots, the dm read
is the most expensive in terms of CPU usage, followed by -
contexf build, with migrate being the least. There is scope of
reducing the CPU usage in both dm rend and context build.
Figure 5. CPU Usage - Sitar - context depth 4, search
depth 10, threshold0.125
No: Ullnur.4"
Figure 6: CPU Usage ~ Harp - context depth 4, search
depth 10, threshold 0.125
Migration Overhead
In an attempt to'assess the intensity of excess migratijns that
gets generated with'a low probability threshold;we plotted
. .. ;;j, . .. , . .
\
Computer Systems and Architecture / 11 71
the total compressed RAID 5 migration to total number of accesses
seen by the TSS Driver during the application run, for
all the combinations considered in this study, for both Sitar
and Harp traces.
Figure 7 shows % migration for the Sitar trace (The plot is
similar for the Harp Trace). In both the traces, and all the
cases studied, the migrations due to compressed RAID 5 accesses
is a very small percentage of the total accesses seen
by the TSS Driver during the run. This ratio decreases as the
run progresses and is below .03 in both the traces - i.e compressed
RAID 5 migrations are about less than 3% of the total
accesses seen by the TSS Driver during the application run.
6. CONCLUSIONS
The memory space used by both Sitar & Harp traces have
been at a very comfortable level of less than 10 MB - possible
due to periodic cleaning employed without much impact
on the throughput. The migrationoverheadhas also been well
within the acceptable limits of about 3% of the total accesses
seen by the TSS Device. me % CPU usage is a maximum
of 15%, 10% & 5% for the dm read, mntexl build & migrate
threads respectively. This being a prototype implementation,
there is a lot of scope for reducing the CPU usage, even
though it is often not considered a scarce quantity in high-end
systems.
Future work could concentrate on having less heavy DMAPI
event reads and also more optimized context build and
prefetch routines.
REFERENCES
[CKV93] Kenneth M Curewitz, P Krishnan, and Jeffrey
Scott Vitter. Practical prefetching via data
compression. In ACM SICMOD Internafional
Conference, pages 257-266. Washingfon DC,
USA. June 1993., 1993.
[CTWSS] John G Clearly, W. J. Teahan, and Ian H Witten.
Unbounded length contexts for ppm. Proceedings
of the 1995 Data Compression Confermce. pg 52-
61.1995., 1995.
[FCLJ99] Li Fan, Pei Cao, Wei Lin, and Quinn Jacobson.
Web prefetchmg between low-bandwidth clients
and proxies: Potential and performance. 6 s Proceedings
offhe Joint lnternufional Conference on
Measurements and Modeling of Computer Svstems
(SIGMETRICS '99). Atlanta, GA Mu,v 1999,
1999.
[Gro95] Data Management Interface Group. Interface
specification for the data management application
programming interface (dmapi), version 2.1,
march 1995. DMIG Specification, 1995.
[KK02] Kameshwari.R and Gopinath K. Vertical migration
within the secondary storage level for increased
io performance.. Proceedings of HPCAsia
2002 Conference. Dee 2002. Bangalore, India,
2002.
Tom M Kroeger and Darrell D E Long. Predicting
future lile-system actions from prior events.
Usenix 1996 Annual Techincal Conference. Son
Diego, California, Januar~1 996, 1996.
Tom M Kroeger. Modeling file access patterns to
improve caching performance. PhD Thesis, lintversi@
of Cnl$ornia. Sanrn C m , March 2000,
2000.
[Mot901 A. Moffit. Implementing the ppm data compression
scheme. IEEE Transactions on Communications,
~0138p, p 191 7-1921, November 1990,
1990.
Themistoklis Palpanas and Alherto Mendelzon.
Web prefetching using partial match prediction.
Proceedings ofthe 4th International Web Cirching
Workshop, 1999., 1999.
J.S. Vitter and P. Krishnan. Optimal prefetching
via data compression. Pmceedings 32nd Anmd
Symposium on Foundations of Compufer Science,
pp 121 -I30 IEEE, Ocrober 1991,1991.
J.S. Vitter and P. Krishnan. Optimal prefetching
via data compression. Journal of the ACM. vol43,
pp 773.793, Seplember 1996; 1996.
[KL96] .
[KroOO]
[PM99]
[VK91]
[VK96]