2 % Copyright (C) 2007 Alan D. Brunelle <Alan.Brunelle@hp.com>
4 % This program is free software; you can redistribute it and/or modify
5 % it under the terms of the GNU General Public License as published by
6 % the Free Software Foundation; either version 2 of the License, or
7 % (at your option) any later version.
9 % This program is distributed in the hope that it will be useful,
10 % but WITHOUT ANY WARRANTY; without even the implied warranty of
11 % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12 % GNU General Public License for more details.
14 % You should have received a copy of the GNU General Public License
15 % along with this program; if not, write to the Free Software
16 % Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
18 % vi :set textwidth=75
20 \documentclass{article}
21 \usepackage{multirow,graphicx,placeins}
24 %---------------------
25 \title{\texttt{btrecord} and \texttt{btreplay} User Guide}
26 \author{Alan D. Brunelle (Alan.Brunelle@hp.com)}
32 \thispagestyle{empty}\newpage
33 %---------------------
34 \tableofcontents\thispagestyle{empty}\newpage
35 %---------------------
36 \section{Introduction}
40 This document presents the command line overview for
41 \texttt{btrecord} and \texttt{btreplay}, and shows some commonly used
42 example usages of it in everyday work here at OSLO's Scalability and
45 \subsection*{Build Note}
47 To build these tools, one needs to
48 place the source directory next to a valid
49 \texttt{blktrace}\footnote{\texttt{git://git.kernel.dk/blktrace.git}}
50 directory, as it includes \texttt{../blktrace} in the \texttt{Makefile}.
53 %---------------------
54 \newpage\section{\texttt{btrecord} and \texttt{btreplay} Operating Model}
56 The \texttt{blktrace} utility provides the ability to collect detailed
57 traces from the kernel for each IO processed by the block IO layer. The
58 traces provide a complete timeline for each IO processed, including
59 detailed information concerning when an IO was first received by the block
60 IO layer -- indicating the device, CPU number, time stamp, IO direction,
61 sector number and IO size (number of sectors). Using this information,
62 one is able to \emph{replay} the IO again on the same machine or another
65 \subsection{Basic Workflow}
66 The basic operating work-flow to replay IOs would be something like:
69 \item Run \texttt{blktrace} to collect traces. Here you specify the
70 device or devices that you wish to trace and later replay IOs upon. Note:
71 the only traces you are interested in are \emph{QUEUE} requests --
72 thus, to save system resources (including storage for traces), one could
73 specify the \texttt{-a queue} command line option to \texttt{blktrace}.
75 \item While \texttt{blktrace} is running, you run the workload that you
78 \item When the work load has completed, you stop the \texttt{blktrace}
79 utility (thus saving all traces over the complete workload).
81 \item You extract the pertinent IO information from the traces saved by
82 \texttt{blktrace} using the \texttt{btrecord} utility. This will parse
83 each trace file created by \texttt{blktrace}, and crafty IO descriptions
84 to be used in the next phase of the workload processing.
86 \item Once \texttt{btrecord} has successfully created a series of data
87 files to be processed, you can run the \texttt{btreplay} utility which
88 attempts to generate the same IOs seen during the sample workload phase.
91 \subsection{IO Stream Replay Characteristics}
92 The major characteristics of the IO stream that are kept intact include:
95 \item[Device] The IOs are replayed on the same device as was seen
96 during the sample workload.
98 \item[IO direction] The same IO direction (read/write) is maintained.
100 \item[IO offset] The same device offset is maintained.
102 \item[IO size] The same number of sectors are transferred.
104 \item[Time differential] The time stamps stored during the
105 \texttt{blktrace} run are used to determine the amount of time between
106 IOs during the sample workload. \texttt{btreplay} \emph{attempts} to
107 maintain the same time differential between IOs, but no guarantees as
108 to complete accuracy are provided by the utility.
110 \item[Device IO Stream Ordering] All IOs on a device are submitted in
111 the precise order they were seen during the sample workload run.
114 As noted above, the time between IOs may not be accurately maintained
115 during replays. In addition the actual ordering of IOs \emph{between}
116 devices is not necessarily maintained. (Each device with an IO stream
117 maintains its own concept of time, and thus there may be slippage of the
118 time kept between managing threads.)
121 We have prototyped a different approach, wherein a single managing
122 thread handles all IOs across all devices. This approach, while
123 guaranteeing correct ordering of IOs across all devices, resulted in
124 much worse timing on a per IO basis.
127 \subsection{\texttt{btrecord/btreplay} Method of Operation}
129 As noted above, \texttt{btrecord} extracts \texttt{QUEUE} operations from
130 \texttt{blktrace} output. These \texttt{QUEUE} operations indicate the
131 entrance of IOs into the block IO layer. In order to replay these IOs with
132 some accuracy in regards to ordering and timeliness, we decided to take
133 multiple sequential (in time) IOs and put them in a single \emph{bunch} of
134 IOs that will be processed as a single \emph{asynchronous IO} call to the
135 kernel\footnote{Attempts to do them individually resulted in too large of a
136 turnaround time penalty (user-space to kernel and back). Note that in a
137 number of workloads, the IOs are coming in from the page cache handling
138 code, and thus are submitted to the block IO layer with \emph{very small}
139 time intervals between issues.}. To manage the size of the \emph{bunches},
140 the \texttt{btrecord} utility provides you with two controlling knobs:
143 \item[\texttt{--max-bunch-time}] This is the amount of time to encompass
144 in one bunch -- only IOs within the time specified are eligible
145 for \emph{bunching.} The default time is 10 milliseconds (10,000,000
146 nanoseconds). Refer to section~\ref{sec:c-o-m} on page~\pageref{sec:c-o-m}
147 for more information.
149 \item[\texttt{--max-pkts}] A \emph{bunch} size can be anywhere from
150 1 to 512 packets in size and by default we max a bunch to contain no
151 more than 8 individual IOs. With this option, one can increase or
152 decrease the maximum \emph{bunch} size. Refer to section~\ref{sec:c-o-M}
153 on page~\pageref{sec:c-o-M} for more information.
156 Each input data file (one per device per CPU) results in a new record
157 data file (again, one per device per CPU) which contains information
158 about \emph{bunches} of IOs to be replayed. \texttt{btreplay} operates on
159 these record data files by spawning a new pair of threads per file. One
160 thread managed the submitting of AIOs per bunch in the record data file,
161 while the other thread manages reclaiming AIOs completed\footnote{We
162 have found that having the same thread do both results in a further
163 reduction in replay timing accuracty.}.
165 Each submitting thread simply reads the input file of \emph{bunches}
166 recorded by \texttt{btrecord}, and attempts to faithfully reproduce the
167 ordering and timing of IOs seen during the sample workload. The reclaiming
168 thread simply wait for AIO completions, freeing up resources for the
169 submitting thread to utilize to submit new AIOs.
171 The number of CPUs being used on the replay system can be different from
172 the number on the recorded system. To help with mappings here the
173 \texttt{--cpus} option allows one to state how many CPUs on the replay
174 system to utilize. If the number of CPUs on the replay system is less than
175 on the recording system, we wrap CPU IDs. This \emph{may} result in an
176 overload of CPU processing capabilities on the replay system. (Refer to
177 section~\ref{sec:p-o-c} on page~\pageref{sec:p-o-c} for more details about the
178 \texttt{--cpus} option.)
180 \newpage\subsection{Known Deficiencies and Proposed Possible Fixes}
182 The overall known deficiencies with this current set of utilities is
183 outlined here, in some cases ideas on additions and/or improvements are
187 \item Lack of IO ordering across devices.
190 \emph{We could institute the notion of global time across threads,
191 and thus ensure IO ordering across devices, with some reduction in
195 \item Lack of IO timing accuracy -- additional time between IO bunches.
198 \emph{This is the primary problem with any IO replay mechanism -- how
199 to guarantee per-IO timing accuracy with respect to other replayed IOs?
200 One idea to reduce errors in this area would be to push the IO replay
201 into the kernel, where you \emph{may} receive more responsive timings.}
204 \item Bunching of IOs results in reduced time amongst IOs within a bunch.
207 \emph{The user has \emph{some} control over this (via the
208 \texttt{--max-pkts} option). One \emph{could} simply specify
209 \texttt{-max-pkts=1} and then each IO would be treated individualy. Of
210 course, this would probably then run into the problem of excessive
214 \item 1-to-1 mapping of devices -- for now the devices on the replay
215 machine must be the same as on the recording machine.
218 \emph{It should be relatively trivial to add in the notion of
219 mapping -- simply include a file that is read which maps devices
220 on one machine to devices (with offsets and sizes) on the replay
221 machine\footnote{The notion of an offset and device size to replay on
222 could be used to both allow for a single device to masquerade as more
223 than one device, and could be utilized in case the replay device is
224 smaller than the recorded device.}.}
226 \medskip\emph{One could also add in the notion of CPU mappings as well --
227 device $D_{rec}$ managed by CPU $C_{rec}$ on the recorded system
228 shall be replayed on device $D_{rep}$ and CPU $C_{rep}$ on the
233 With version 0.9.1 we now support the \texttt{-M} option to do this
234 -- see section~\ref{sec:p-o-M} on page~\pageref{sec:p-o-M} for more
235 information on device mapping.
241 %---------------------
242 \newpage\section{\label{sec:command-line}Command Line Options}
243 \subsection{\texttt{btrecord} Command Line Options}
246 Usage: btrecord -- version 0.9.3
248 [ -d <dir> : --input-directory=<dir> ] Default: .
249 [ -D <dir> : --output-directory=<dir>] Default: .
250 [ -F : --find-traces ] Default: Off
251 [ -h : --help ] Default: Off
252 [ -m <nsec> : --max-bunch-time=<nsec> ] Default: 10 msec
253 [ -M <pkts> : --max-pkts=<pkts> ] Default: 8
254 [ -o <base> : --output-base=<base> ] Default: replay
255 [ -v : --verbose ] Default: Off
256 [ -V : --version ] Default: Off
257 <dev>... Default: None
259 \caption{\label{fig:btrecord--help}\texttt{btrecord --help} Output}
263 \subsubsection{\label{sec:c-o-d}\texttt{-d} or
264 \texttt{--input-directory}\\Set Input Directory}
266 The \texttt{-d} option requires a single parameter providing the directory
267 name for where input files are to be found. The default directory is the
268 current directory (\texttt{.}).
270 \subsubsection{\label{sec:c-o-D}\texttt{-D} or
271 \texttt{--output-directory}\\Set Output Directory}
273 The \texttt{-D} option requires a single parameter providing the directory
274 name for where output files are to be placed. The default directory is the
275 current directory (\texttt{.}).
277 \subsubsection{\texttt{-F} or \texttt{--find-traces}\\Find Trace Files
280 The \texttt{-F} option instructs \texttt{btrecord} to go find all the
281 trace files in the directory specified (either via the \texttt{-d}
282 option, or in the default directory '.').
284 \subsubsection{\texttt{-h} or \texttt{--help}\\Display Help Message}
285 \subsubsection{\texttt{-V} or \texttt{--version}\\Display
286 \texttt{btrecord}Version}
288 The \texttt{-h} option displays the command line options and
289 defaults, as presented in figure~\ref{fig:btrecord--help} on
290 page~\pageref{fig:btrecord--help}.
292 The \texttt{-V} option displays the \texttt{btreplay} version, as shown here:
296 btrecord -- version 0.9.0
299 Both commands exit immediately after processing the option.
301 \subsubsection{\label{sec:c-o-m}\texttt{-m} or
302 \texttt{--max-bunch-time}\\Set Maximum Time Per Bunch}
304 The \texttt{-m} option requires a single parameter which specifies an
305 amount of time (in nanoseconds) to include in any one bunch of IOs that
306 are to be processed. The smaller the value, the smaller the number of
307 IOs processed at one time -- perhaps yielding in more realistic replay.
308 However, after a certain point the amount of overhead per bunch may result
309 in additonal real replay time, thus yielding less accurate replay times.
311 The default value is 10,000,000 nanoseconds (10 milliseconds).
313 \subsubsection{\label{sec:c-o-M}\texttt{-M} or
314 \texttt{--max-pkts}\\Set Maximum Packets Per Bunch}
316 The \texttt{-M} option requires a single parameter which specifies the
317 maximum number of IOs to store in a single bunch. As with the \texttt{-m}
318 option (section~\ref{sec:c-o-m}), smaller values \emph{may} or \emph{may not}
319 yield more accurate replay times.
321 The default value is 8, with a maximum value of up to 512 being supported.
323 \subsubsection{\label{sec:c-o-o}\texttt{-o} or
324 \texttt{--output-base}\\Set Base Name for Output Files}
326 Each output file has 3 fields:
329 \item Device identifier (taken directly from the device name of the
330 \texttt{blktrace} output file).
332 \item \texttt{btrecord} base name -- by default ``replay''.
334 \item And the CPU number (again, taken directly from the
335 \texttt{blktrace} output file name).
338 This option requires a single parameter that will override the default name
339 (replay), and replace it with the specified value.
341 \subsubsection{\label{sec:c-o-v}\texttt{-v} or
342 \texttt{--verbose}\\Select Verbose Output}
344 This option will output some simple statistics at the end of a successful
345 run. Figure~\ref{fig:verb-out} (page~\pageref{fig:verb-out}) shows
346 an example of some output, while figure~\ref{fig:verb-defs}
347 (page~\pageref{fig:verb-defs}) shows what the fields mean.
351 sdab:0: 580661 pkts (tot), 126030 pkts (replay), 89809 bunches, 1.4 pkts/bunch
352 sdab:1: 2559775 pkts (tot), 430172 pkts (replay), 293029 bunches, 1.5 pkts/bunch
353 sdab:2: 653559 pkts (tot), 136522 pkts (replay), 102288 bunches, 1.3 pkts/bunch
354 sdab:3: 474773 pkts (tot), 117849 pkts (replay), 69572 bunches, 1.7 pkts/bunch
356 \caption{\label{fig:verb-out}Verbose Output Example}
362 \item[Field 1] The first field contains the device name and CPU
363 identrifer. Thus: \texttt{sdab:0:} means the device \texttt{sdab} and
366 \item[Field 2] The second field contains the total number of packets
367 processed for each device file.
369 \item[Field 3] The next field shows the number of packets eligible for
372 \item[Field 4] The fourth field contains the total number of IO bunches.
374 \item[Field 5] The last field shows the average number of IOs per bunch
377 \caption{\label{fig:verb-defs}Verbose Field Definitions}
381 %---------------------
382 \newpage\subsection{\texttt{btreplay} Command Line Options}
385 Usage: btreplay -- version 0.9.3
387 [ -c <cpus> : --cpus=<cpus> ] Default: 1
388 [ -d <dir> : --input-directory=<dir> ] Default: .
389 [ -F : --find-records ] Default: Off
390 [ -h : --help ] Default: Off
391 [ -i <base> : --input-base=<base> ] Default: replay
392 [ -I <iters>: --iterations=<iters> ] Default: 1
393 [ -M <file> : --map-devs=<file> ] Default: None
394 [ -N : --no-stalls ] Default: Off
395 [ -v : --verbose ] Default: Off
396 [ -V : --version ] Default: Off
397 [ -W : --write-enable ] Default: Off
398 <dev...> Default: None
400 \caption{\label{fig:btreplay--help}\texttt{btreplay --help} Output}
404 \subsubsection{\label{sec:p-o-c}\texttt{-c} or
405 \texttt{--cpus}\\Set Number of CPUs to Use}
407 \subsubsection{\label{sec:p-o-d}\texttt{-d} or
408 \texttt{--input-directory}\\Set Input Directory}
410 The \texttt{-d} option requires a single parameter providing the directory
411 name for where input files are to be found. The default directory is the
412 current directory (\texttt{.}).
414 \subsubsection{\texttt{-F} or \texttt{--find-records}\\Find RecordFiles
417 The \texttt{-F} option instructs \texttt{btreplay} to go find all the
418 record files in the directory specified (either via the \texttt{-d}
419 option, or in the default directory '.').
421 \subsubsection{\texttt{-h} or \texttt{--help}\\Display Help Message}
422 \subsubsection{\texttt{-V} or \texttt{--version}\\Display
423 \texttt{btreplay}Version}
425 The \texttt{-h} option displays the command line options and
426 defaults, as presented in figure~\ref{fig:btreplay--help} on
427 page~\pageref{fig:btreplay--help}.
429 The \texttt{-V} option displays the \texttt{btreplay} version, as show here:
433 btreplay -- version 0.9.0
436 Both commands exit immediately after processing the option.
438 \subsubsection{\label{sec:p-o-i}\texttt{-i} or
439 \texttt{--input-base}\\Set Base Name for Input Files}
441 Each input file has 3 fields:
444 \item Device identifier (taken directly from the device name of the
445 \texttt{blktrace} output file).
447 \item \texttt{btrecord} base name -- by default ``replay''.
449 \item And the CPU number (again, taken directly from the
450 \texttt{blktrace} output file name).
453 This option requires a single parameter that will override the default name
454 (replay), and replace it with the specified value.
456 \subsubsection{\label{sec:p-o-I}\texttt{-I} or
457 \texttt{--iterations}\\Set Number of Iterations to Run}
459 This option requires a single parameter which specifies the number of times
460 to run through the input files. The default value is 1.
462 \subsubsection{\label{sec:p-o-M}\texttt{-M} or \texttt{map-devs}\\
463 Specify Device Mappings}
465 This option requires a single paramter which specifies the name of a
466 file contain device mappings. The file must be very simply managed, with
467 just two pieces of data per line:
470 \item The device name on the recorded system (with the \texttt{'/dev/'}
471 removed). Example: \texttt{/dev/sda} would just be \texttt{sda}.
473 \item The device name on the replay system to use (again, without the
474 \texttt{'/dev/'} path prepended).
477 An example file for when one would map devices \texttt{/dev/sda} and
478 \texttt{/dev/sdb} on the recorded system to \texttt{dev/sdg} and
479 \texttt{sdh} on the replay system would be:
486 The only entries in the file that are allowed are these two element lines
487 -- we do not (yet?) support the notion of blank lines, or comment lines, or
490 The utility \emph{does} allow for multiple \texttt{-M} options to be
491 supplied on the command line.
493 \subsubsection{\label{sec:o-N}\texttt{-N} or \texttt{--no-stalls}\\Disable
496 When specified on the command line, all pre-bunch stall indicators will be
497 ignored. IOs will be replayed without inter-bunch delays.
499 \subsubsection{\label{sec:p-o-v}\texttt{-v} or
500 \texttt{--verbose}\\Select Verbose Output}
502 When specified on the command line, this option instructs \texttt{btreplay}
503 to store information concerning each \emph{stall} and IO operation
504 performed by \texttt{btreplay}. The name of each file so created will be
505 the input file name used with an extension of \texttt{.rep} appended onto
506 it. Thus, an input file of the name \texttt{sdab.replay.3} would generate a
507 verbose output file with the name \texttt{sdab.replay.3.rep} in the
508 directory specified for input files.
510 In addition, \texttt{btreplay} will also output to \texttt{stderr} the
511 names of the input files being processed.
513 \subsubsection{\label{sec:p-o-W}\texttt{-W} or
514 \texttt{--write-enable}\\Enable Writing During Replay}
516 As a precautionary measure, by default \texttt{btreplay} will \emph{not}
517 process \emph{write} requests. In order to enable \texttt{btreplay} to
518 actually \emph{write} to devices one must explicitly specify the