btreplay/doc/btreplay.tex

   1 %
   2 % Copyright (C) 2007 Alan D. Brunelle <Alan.Brunelle@hp.com>
   3 %
   4 %  This program is free software; you can redistribute it and/or modify
   5 %  it under the terms of the GNU General Public License as published by
   6 %  the Free Software Foundation; either version 2 of the License, or
   7 %  (at your option) any later version.
   8 %
   9 %  This program is distributed in the hope that it will be useful,
  10 %  but WITHOUT ANY WARRANTY; without even the implied warranty of
  11 %  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  12 %  GNU General Public License for more details.
  13 %
  14 %  You should have received a copy of the GNU General Public License
  15 %  along with this program; if not, write to the Free Software
  16 %  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  USA
  17 %
  18 %  vi :set textwidth=75
  19 %
  20 \documentclass{article}
  21 \usepackage{multirow,graphicx,placeins}
  22
  23 \begin{document}
  24 %---------------------
  25 \title{\texttt{btrecord} and \texttt{btreplay} User Guide}
  26 \author{Alan D. Brunelle (Alan.Brunelle@hp.com)}
  27 \date{\today}
  28 \maketitle
  29 \begin{abstract}
  30 \input{abstract.tex}
  31 \end{abstract}
  32 \thispagestyle{empty}\newpage
  33 %---------------------
  34 \tableofcontents\thispagestyle{empty}\newpage
  35 %---------------------
  36 \section{Introduction}
  37 \input{abstract.tex}
  38
  39 \bigskip
  40 This document presents the command line overview for
  41 \texttt{btrecord} and \texttt{btreplay}, and shows some commonly used
  42 example usages of it in everyday work here at OSLO's Scalability and
  43 Performance Group.
  44
  45 \subsection*{Build Note}
  46
  47 To build these tools, one needs to
  48 place the source directory next to a valid
  49 \texttt{blktrace}\footnote{\texttt{git://git.kernel.dk/blktrace.git}}
  50 directory, as it includes \texttt{../blktrace} in the \texttt{Makefile}.
  51
  52
  53 %---------------------
  54 \newpage\section{\texttt{btrecord} and \texttt{btreplay} Operating Model}
  55
  56 The \texttt{blktrace} utility provides the ability to collect detailed
  57 traces from the kernel for each IO processed by the block IO layer. The
  58 traces provide a complete timeline for each IO processed, including
  59 detailed information concerning when an IO was first received by the block
  60 IO layer -- indicating the device, CPU number, time stamp, IO direction,
  61 sector number and IO size (number of sectors). Using this information,
  62 one is able to \emph{replay} the IO again on the same machine or another
  63 set up entirely.
  64
  65 \subsection{Basic Workflow}
  66 The basic operating work-flow to replay IOs would be something like:
  67
  68 \begin{enumerate}
  69   \item Run \texttt{blktrace} to collect traces. Here you specify the
  70   device or devices that you wish to trace and later replay IOs upon. Note:
  71   the only traces you are interested in are \emph{QUEUE} requests --
  72   thus, to save system resources (including storage for traces), one could
  73   specify the \texttt{-a queue} command line option to \texttt{blktrace}.
  74
  75   \item While \texttt{blktrace} is running, you run the workload that you
  76   are interested in.
  77
  78   \item When the work load has completed, you stop the \texttt{blktrace}
  79   utility (thus saving all traces over the complete workload).
  80
  81   \item You extract the pertinent IO information from the traces saved by
  82   \texttt{blktrace} using the \texttt{btrecord} utility. This will parse
  83   each trace file created by \texttt{blktrace}, and crafty IO descriptions
  84   to be used in the next phase of the workload processing.
  85
  86   \item Once \texttt{btrecord} has successfully created a series of data
  87   files to be processed, you can run the \texttt{btreplay} utility which
  88   attempts to generate the same IOs seen during the sample workload phase.
  89 \end{enumerate}
  90
  91 \subsection{IO Stream Replay Characteristics}
  92   The major characteristics of the IO stream that are kept intact include:
  93
  94   \begin{description}
  95     \item[Device] The IOs are replayed on the same device as was seen
  96     during the sample workload.
  97
  98     \item[IO direction] The same IO direction (read/write) is maintained.
  99
 100     \item[IO offset] The same device offset is maintained.
 101
 102     \item[IO size] The same number of sectors are transferred.
 103
 104     \item[Time differential] The time stamps stored during the
 105     \texttt{blktrace} run are used to determine the amount of time between
 106     IOs during the sample workload. \texttt{btreplay} \emph{attempts} to
 107     maintain the same time differential between IOs, but no guarantees as
 108     to complete accuracy are provided by the utility.
 109
 110     \item[Device IO Stream Ordering] All IOs on a device are submitted in
 111     the precise order they were seen during the sample workload run.
 112   \end{description}
 113
 114   As noted above, the time between IOs may not be accurately maintained
 115   during replays. In addition the actual ordering of IOs \emph{between}
 116   devices is not necessarily maintained. (Each device with an IO stream
 117   maintains its own concept of time, and thus there may be slippage of the
 118   time kept between managing threads.)
 119
 120   \begin{quotation}
 121     We have prototyped a different approach, wherein a single managing
 122     thread handles all IOs across all devices. This approach, while
 123     guaranteeing correct ordering of IOs across all devices, resulted in
 124     much worse timing on a per IO basis.
 125   \end{quotation}
 126
 127 \subsection{\texttt{btrecord/btreplay} Method of Operation}
 128
 129 As noted above, \texttt{btrecord} extracts \texttt{QUEUE} operations from
 130 \texttt{blktrace} output. These \texttt{QUEUE} operations indicate the
 131 entrance of IOs into the block IO layer. In order to replay these IOs with
 132 some accuracy in regards to ordering and timeliness, we decided to take
 133 multiple sequential (in time) IOs and put them in a single \emph{bunch} of
 134 IOs that will be processed as a single \emph{asynchronous IO} call to the
 135 kernel\footnote{Attempts to do them individually resulted in too large of a
 136 turnaround time penalty (user-space to kernel and back). Note that in a
 137 number of workloads, the IOs are coming in from the page cache handling
 138 code, and thus are submitted to the block IO layer with \emph{very small}
 139 time intervals between issues.}. To manage the size of the \emph{bunches},
 140 the \texttt{btrecord} utility provides you with two controlling knobs:
 141
 142 \begin{description}
 143   \item[\texttt{--max-bunch-time}] This is the amount of time to encompass
 144   in one bunch -- only IOs within the time specified are eligible
 145   for \emph{bunching.} The default time is 10 milliseconds (10,000,000
 146   nanoseconds). Refer to section~\ref{sec:c-o-m} on page~\pageref{sec:c-o-m}
 147   for more information.
 148
 149   \item[\texttt{--max-pkts}] A \emph{bunch} size can be anywhere from
 150   1 to 512 packets in size and by default we max a bunch to contain no
 151   more than 8 individual IOs. With this option, one can increase or
 152   decrease the maximum \emph{bunch} size.  Refer to section~\ref{sec:c-o-M}
 153   on page~\pageref{sec:c-o-M} for more information.
 154 \end{description}
 155
 156 Each input data file (one per device per CPU) results in a new record
 157 data file (again, one per device per CPU) which contains information
 158 about \emph{bunches} of IOs to be replayed. \texttt{btreplay} operates on
 159 these record data files by spawning a new pair of threads per file. One
 160 thread managed the submitting of AIOs per bunch in the record data file,
 161 while the other thread manages reclaiming AIOs completed\footnote{We
 162 have found that having the same thread do both results in a further
 163 reduction in replay timing accuracty.}.
 164
 165 Each submitting thread simply reads the input file of \emph{bunches}
 166 recorded by \texttt{btrecord}, and attempts to faithfully reproduce the
 167 ordering and timing of IOs seen during the sample workload. The reclaiming
 168 thread simply wait for AIO completions, freeing up resources for the
 169 submitting thread to utilize to submit new AIOs.
 170
 171 The number of CPUs being used on the replay system can be different from
 172 the number on the recorded system. To help with mappings here the
 173 \texttt{--cpus} option allows one to state how many CPUs on the replay
 174 system to utilize. If the number of CPUs on the replay system is less than
 175 on the recording system, we wrap CPU IDs. This \emph{may} result in an
 176 overload of CPU processing capabilities on the replay system. (Refer to
 177 section~\ref{sec:p-o-c} on page~\pageref{sec:p-o-c} for more details about the
 178 \texttt{--cpus} option.)
 179
 180 \newpage\subsection{Known Deficiencies and Proposed Possible Fixes}
 181
 182 The overall known deficiencies with this current set of utilities is
 183 outlined here, in some cases ideas on additions and/or improvements are
 184 included as well.
 185
 186 \begin{enumerate}
 187   \item Lack of IO ordering across devices.
 188
 189   \begin{quote}
 190     \emph{We could institute the notion of global time across threads,
 191     and thus ensure IO ordering across devices, with some reduction in
 192     timing accuracy.}
 193   \end{quote}
 194
 195   \item Lack of IO timing accuracy -- additional time between IO bunches.
 196
 197   \begin{quote}
 198     \emph{This is the primary problem with any IO replay mechanism -- how
 199     to guarantee per-IO timing accuracy with respect to other replayed IOs?
 200     One idea to reduce errors in this area would be to push the IO replay
 201     into the kernel, where you \emph{may} receive more responsive timings.}
 202   \end{quote}
 203
 204   \item Bunching of IOs results in reduced time amongst IOs within a bunch.
 205
 206   \begin{quote}
 207     \emph{The user has \emph{some} control over this (via the
 208     \texttt{--max-pkts} option). One \emph{could} simply specify
 209     \texttt{-max-pkts=1} and then each IO would be treated individualy. Of
 210     course, this would probably then run into the problem of excessive
 211     inter-IO times.}
 212   \end{quote}
 213
 214   \item 1-to-1 mapping of devices -- for now the devices on the replay
 215   machine must be the same as on the recording machine.
 216
 217   \begin{quote}
 218     \emph{It should be relatively trivial to add in the notion of
 219     mapping -- simply include a file that is read which maps devices
 220     on one machine to devices (with offsets and sizes) on the replay
 221     machine\footnote{The notion of an offset and device size to replay on
 222     could be used to both allow for a single device to masquerade as more
 223     than one device, and could be utilized in case the replay device is
 224     smaller than the recorded device.}.}
 225
 226     \medskip\emph{One could also add in the notion of CPU mappings as well --
 227     device $D_{rec}$ managed by CPU $C_{rec}$ on the recorded system
 228     shall be replayed on device $D_{rep}$ and CPU $C_{rep}$ on the
 229     replay machine.}
 230
 231     \bigskip
 232     \begin{quote}
 233       With version 0.9.1 we now support the \texttt{-M} option to do this
 234       -- see section~\ref{sec:p-o-M} on page~\pageref{sec:p-o-M} for more
 235       information on device mapping.
 236     \end{quote}
 237   \end{quote}
 238
 239 \end{enumerate}
 240
 241 %---------------------
 242 \newpage\section{\label{sec:command-line}Command Line Options}
 243 \subsection{\texttt{btrecord} Command Line Options}
 244 \begin{figure}[h!]
 245 \begin{verbatim}
 246 Usage: btrecord -- version 0.9.3
 247
 248         [ -d <dir>  : --input-directory=<dir> ] Default: .
 249         [ -D <dir>  : --output-directory=<dir>] Default: .
 250         [ -F        : --find-traces           ] Default: Off
 251         [ -h        : --help                  ] Default: Off
 252         [ -m <nsec> : --max-bunch-time=<nsec> ] Default: 10 msec
 253         [ -M <pkts> : --max-pkts=<pkts>       ] Default: 8
 254         [ -o <base> : --output-base=<base>    ] Default: replay
 255         [ -v        : --verbose               ] Default: Off
 256         [ -V        : --version               ] Default: Off
 257         <dev>...                                Default: None
 258 \end{verbatim}
 259 \caption{\label{fig:btrecord--help}\texttt{btrecord --help} Output}
 260 \end{figure}
 261 \FloatBarrier
 262
 263 \subsubsection{\label{sec:c-o-d}\texttt{-d} or
 264 \texttt{--input-directory}\\Set Input Directory}
 265
 266 The \texttt{-d} option requires a single parameter providing the directory
 267 name for where input files are to be found. The default directory is the
 268 current directory (\texttt{.}).
 269
 270 \subsubsection{\label{sec:c-o-D}\texttt{-D} or
 271 \texttt{--output-directory}\\Set Output Directory}
 272
 273 The \texttt{-D} option requires a single parameter providing the directory
 274 name for where output files are to be placed. The default directory is the
 275 current directory (\texttt{.}).
 276
 277 \subsubsection{\texttt{-F} or \texttt{--find-traces}\\Find Trace Files
 278 Automatically}
 279
 280 The \texttt{-F} option instructs \texttt{btrecord} to go find all the
 281 trace files in the directory specified (either via the \texttt{-d}
 282 option, or in the default directory '.').
 283
 284 \subsubsection{\texttt{-h} or \texttt{--help}\\Display Help Message}
 285 \subsubsection{\texttt{-V} or \texttt{--version}\\Display
 286 \texttt{btrecord}Version}
 287
 288 The \texttt{-h} option displays the command line options and
 289 defaults, as presented in figure~\ref{fig:btrecord--help} on
 290 page~\pageref{fig:btrecord--help}.
 291
 292 The \texttt{-V} option displays the \texttt{btreplay} version, as shown here:
 293
 294 \begin{verbatim}
 295 $ btrecord --version
 296 btrecord -- version 0.9.0
 297 \end{verbatim}
 298
 299 Both commands exit immediately after processing the option.
 300
 301 \subsubsection{\label{sec:c-o-m}\texttt{-m} or
 302 \texttt{--max-bunch-time}\\Set Maximum Time Per Bunch}
 303
 304 The \texttt{-m} option requires a single parameter which specifies an
 305 amount of time (in nanoseconds) to include in any one bunch of IOs that
 306 are to be processed. The smaller the value, the smaller the number of
 307 IOs processed at one time -- perhaps yielding in more realistic replay.
 308 However, after a certain point the amount of overhead per bunch may result
 309 in additonal real replay time, thus yielding less accurate replay times.
 310
 311 The default value is 10,000,000 nanoseconds (10 milliseconds).
 312
 313 \subsubsection{\label{sec:c-o-M}\texttt{-M} or
 314 \texttt{--max-pkts}\\Set Maximum Packets Per Bunch}
 315
 316 The \texttt{-M} option requires a single parameter which specifies the
 317 maximum number of IOs to store in a single bunch. As with the \texttt{-m}
 318 option (section~\ref{sec:c-o-m}), smaller values \emph{may} or \emph{may not}
 319 yield more accurate replay times.
 320
 321 The default value is 8, with a maximum value of up to 512 being supported.
 322
 323 \subsubsection{\label{sec:c-o-o}\texttt{-o} or
 324 \texttt{--output-base}\\Set Base Name for Output Files}
 325
 326 Each output file has 3 fields:
 327
 328 \begin{enumerate}
 329   \item Device identifier (taken directly from the device name of the
 330   \texttt{blktrace} output file).
 331
 332   \item \texttt{btrecord} base name -- by default ``replay''.
 333
 334   \item And the CPU number (again, taken directly from the
 335   \texttt{blktrace} output file name).
 336 \end{enumerate}
 337
 338 This option requires a single parameter that will override the default name
 339 (replay), and replace it with the specified value.
 340
 341 \subsubsection{\label{sec:c-o-v}\texttt{-v} or
 342 \texttt{--verbose}\\Select Verbose Output}
 343
 344 This option will output some simple statistics at the end of a successful
 345 run. Figure~\ref{fig:verb-out} (page~\pageref{fig:verb-out}) shows
 346 an example of some output, while figure~\ref{fig:verb-defs}
 347 (page~\pageref{fig:verb-defs}) shows what the fields mean.
 348
 349 \begin{figure}[h!]
 350 \begin{verbatim}
 351 sdab:0: 580661 pkts (tot), 126030 pkts (replay), 89809 bunches, 1.4 pkts/bunch
 352 sdab:1: 2559775 pkts (tot), 430172 pkts (replay), 293029 bunches, 1.5 pkts/bunch
 353 sdab:2: 653559 pkts (tot), 136522 pkts (replay), 102288 bunches, 1.3 pkts/bunch
 354 sdab:3: 474773 pkts (tot), 117849 pkts (replay), 69572 bunches, 1.7 pkts/bunch
 355 \end{verbatim}
 356 \caption{\label{fig:verb-out}Verbose Output Example}
 357 \end{figure}
 358 \FloatBarrier
 359
 360 \begin{figure}[h!]
 361 \begin{description}
 362   \item[Field 1] The first field contains the device name and CPU
 363   identrifer. Thus: \texttt{sdab:0:} means the device \texttt{sdab} and
 364   traces on CPU 0.
 365
 366   \item[Field 2] The second field contains the total number of packets
 367   processed for each device file.
 368
 369   \item[Field 3] The next field shows the number of packets eligible for
 370   replay.
 371
 372   \item[Field 4] The fourth field contains the total number of IO bunches.
 373
 374   \item[Field 5] The last field shows the average number of IOs per bunch
 375   recorded.
 376 \end{description}
 377 \caption{\label{fig:verb-defs}Verbose Field Definitions}
 378 \end{figure}
 379 \FloatBarrier
 380
 381 %---------------------
 382 \newpage\subsection{\texttt{btreplay} Command Line Options}
 383 \begin{figure}[h!]
 384 \begin{verbatim}
 385 Usage: btreplay -- version 0.9.3
 386
 387         [ -c <cpus> : --cpus=<cpus>           ] Default: 1
 388         [ -d <dir>  : --input-directory=<dir> ] Default: .
 389         [ -F        : --find-records           ] Default: Off
 390         [ -h        : --help                  ] Default: Off
 391         [ -i <base> : --input-base=<base>     ] Default: replay
 392         [ -I <iters>: --iterations=<iters>    ] Default: 1
 393         [ -M <file> : --map-devs=<file>       ] Default: None
 394         [ -N        : --no-stalls             ] Default: Off
 395         [ -v        : --verbose               ] Default: Off
 396         [ -V        : --version               ] Default: Off
 397         [ -W        : --write-enable          ] Default: Off
 398         <dev...>                                Default: None
 399 \end{verbatim}
 400 \caption{\label{fig:btreplay--help}\texttt{btreplay --help} Output}
 401 \end{figure}
 402 \FloatBarrier
 403
 404 \subsubsection{\label{sec:p-o-c}\texttt{-c} or
 405 \texttt{--cpus}\\Set Number of CPUs to Use}
 406
 407 \subsubsection{\label{sec:p-o-d}\texttt{-d} or
 408 \texttt{--input-directory}\\Set Input Directory}
 409
 410 The \texttt{-d} option requires a single parameter providing the directory
 411 name for where input files are to be found. The default directory is the
 412 current directory (\texttt{.}).
 413
 414 \subsubsection{\texttt{-F} or \texttt{--find-records}\\Find RecordFiles
 415 Automatically}
 416
 417 The \texttt{-F} option instructs \texttt{btreplay} to go find all the
 418 record files in the directory specified (either via the \texttt{-d}
 419 option, or in the default directory '.').
 420
 421 \subsubsection{\texttt{-h} or \texttt{--help}\\Display Help Message}
 422 \subsubsection{\texttt{-V} or \texttt{--version}\\Display
 423 \texttt{btreplay}Version}
 424
 425 The \texttt{-h} option displays the command line options and
 426 defaults, as presented in figure~\ref{fig:btreplay--help} on
 427 page~\pageref{fig:btreplay--help}.
 428
 429 The \texttt{-V} option displays the \texttt{btreplay} version, as show here:
 430
 431 \begin{verbatim}
 432 $ btreplay --version
 433 btreplay -- version 0.9.0
 434 \end{verbatim}
 435
 436 Both commands exit immediately after processing the option.
 437
 438 \subsubsection{\label{sec:p-o-i}\texttt{-i} or
 439 \texttt{--input-base}\\Set Base Name for Input Files}
 440
 441 Each input file has 3 fields:
 442
 443 \begin{enumerate}
 444   \item Device identifier (taken directly from the device name of the
 445   \texttt{blktrace} output file).
 446
 447   \item \texttt{btrecord} base name -- by default ``replay''.
 448
 449   \item And the CPU number (again, taken directly from the
 450   \texttt{blktrace} output file name).
 451 \end{enumerate}
 452
 453 This option requires a single parameter that will override the default name
 454 (replay), and replace it with the specified value.
 455
 456 \subsubsection{\label{sec:p-o-I}\texttt{-I} or
 457 \texttt{--iterations}\\Set Number of Iterations to Run}
 458
 459 This option requires a single parameter which specifies the number of times
 460 to run through the input files. The default value is 1.
 461
 462 \subsubsection{\label{sec:p-o-M}\texttt{-M} or \texttt{map-devs}\\
 463 Specify Device Mappings}
 464
 465 This option requires a single paramter which specifies the name of a
 466 file contain device mappings. The file must be very simply managed, with
 467 just two pieces of data per line:
 468
 469 \begin{enumerate}
 470   \item The device name on the recorded system (with the \texttt{'/dev/'}
 471   removed). Example: \texttt{/dev/sda} would just be \texttt{sda}.
 472
 473   \item The device name on the replay system to use (again, without the
 474   \texttt{'/dev/'} path prepended).
 475 \end{enumerate}
 476
 477 An example file for when one would map devices \texttt{/dev/sda} and
 478 \texttt{/dev/sdb} on the recorded system to \texttt{dev/sdg} and
 479 \texttt{sdh} on the replay system would be:
 480
 481 \begin{verbatim}
 482 sda sdg
 483 sdb sdh
 484 \end{verbatim}
 485
 486 The only entries in the file that are allowed are these two element lines
 487 -- we do not (yet?) support the notion of blank lines, or comment lines, or
 488 the like.
 489
 490 The utility \emph{does} allow for multiple \texttt{-M} options to be
 491 supplied on the command line.
 492
 493 \subsubsection{\label{sec:o-N}\texttt{-N} or \texttt{--no-stalls}\\Disable
 494 Pre-bunch Stalls}
 495
 496 When specified on the command line, all pre-bunch stall indicators will be
 497 ignored. IOs will be replayed without inter-bunch delays.
 498
 499 \subsubsection{\label{sec:p-o-v}\texttt{-v} or
 500 \texttt{--verbose}\\Select Verbose Output}
 501
 502 When specified on the command line, this option instructs \texttt{btreplay}
 503 to store information concerning each \emph{stall} and IO operation
 504 performed by \texttt{btreplay}. The name of each file so created will be
 505 the input file name used with an extension of \texttt{.rep} appended onto
 506 it. Thus, an input file of the name \texttt{sdab.replay.3} would generate a
 507 verbose output file with the name \texttt{sdab.replay.3.rep} in the
 508 directory specified for input files.
 509
 510 In addition, \texttt{btreplay} will also output to \texttt{stderr} the
 511 names of the input files being processed.
 512
 513 \subsubsection{\label{sec:p-o-W}\texttt{-W} or
 514 \texttt{--write-enable}\\Enable Writing During Replay}
 515
 516 As a precautionary measure, by default \texttt{btreplay} will \emph{not}
 517 process \emph{write} requests. In order to enable \texttt{btreplay} to
 518 actually \emph{write} to devices one must explicitly specify the
 519 \texttt{-W} option.
 520
 521 \end{document}