summaryrefslogtreecommitdiff
path: root/btreplay/doc/btreplay.tex
diff options
context:
space:
mode:
authorAlan D. Brunelle <Alan.Brunelle@hp.com>2007-10-02 12:35:07 -0400
committerJens Axboe <jens.axboe@oracle.com>2007-10-02 19:51:18 +0200
commitd47a3fec3f4bbcf6b0c6ef757a4eb449dd81d10a (patch)
treec23a7df0ca04c624a5d291d5ab4a96ff29a5a015 /btreplay/doc/btreplay.tex
parent4f93192893f41acfe1ded673c1111142b8f4cddd (diff)
downloadblktrace-d47a3fec3f4bbcf6b0c6ef757a4eb449dd81d10a.tar.gz
blktrace-d47a3fec3f4bbcf6b0c6ef757a4eb449dd81d10a.tar.bz2
Add btrecord/btreplay capability
These facilities allow one to attempt to replay a stream of IOs captured with blktrace. The general workflow is: 1. Initiate blktrace to capture traces 2. Do whatever to generate initial IO stream... 3. Stop blktrace 4. Run btrecord to convert traces into IO records 5. Run btreplay to replay IOs The IO stream characteristics during replay will try to respect the following characteristics of the original IO stream: 1. The IOs will target the same device(s) as originally seen. [One can alter this behavior by specifyin the -M option to btreplay, which allows one to remap IOs slated to one set of devices to a specified other set of devices.] 2. IO direction: the IOs will follow the same read/write (from-device/to-device) characteristics of the originating flow. [Note: By default replay will /not/ do writes, one must specify the -W option to do this. THis is a meager attempt to stop someone from shooting themselves in the foot (with a very large-caliber weapon).] 3. IO offset & size are maintained. 4. CPU: IOs are submitted on the originating CPU whenever possible. [Note: Since we are using asynchronous IO, IOs may be routed to another CPU prior to being processed by the block IO layer.] In order to try and replicate inter-IO timing as much as possible, btrecord will combine IOs "close in time" into one set, or bunch, of IOs. Then btreplay will replay all the IOs in one go (via asynchronous direct IO - io_submit). The size of the bunches are configurable via the -m flag to btrecord (which specifies the a time-based bunch size) and/or the -M flag (which specifies the maximum amount of IOs to put into a bunch). At the low-end, specifying '-M 1' instructs btrecord to act like fio - replay each IO as an individual unit. Besides the potential to remap devices (utilizing the -M option to replay, as noted above), one can also limit the number of CPUs on the replay machine - so if you have fewer CPUs on the replay machine you specify the -c option to btreplay. Lastly, one can specify the -N option to btreplay to instruct it to ignore inter-IO (inter-bunch of IOs) timings. Thus, this instructs btreplay to replay the bunches as fast as possible, ignoring the original delays between original IOs. The utilities include a write-up in the docs directory. Signed-off-by: Alan D. Brunelle <Alan.Brunelle@hp.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
Diffstat (limited to 'btreplay/doc/btreplay.tex')
-rw-r--r--btreplay/doc/btreplay.tex521
1 files changed, 521 insertions, 0 deletions
diff --git a/btreplay/doc/btreplay.tex b/btreplay/doc/btreplay.tex
new file mode 100644
index 0000000..beec720
--- /dev/null
+++ b/btreplay/doc/btreplay.tex
@@ -0,0 +1,521 @@
+%
+% Copyright (C) 2007 Alan D. Brunelle <Alan.Brunelle@hp.com>
+%
+% This program is free software; you can redistribute it and/or modify
+% it under the terms of the GNU General Public License as published by
+% the Free Software Foundation; either version 2 of the License, or
+% (at your option) any later version.
+%
+% This program is distributed in the hope that it will be useful,
+% but WITHOUT ANY WARRANTY; without even the implied warranty of
+% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+% GNU General Public License for more details.
+%
+% You should have received a copy of the GNU General Public License
+% along with this program; if not, write to the Free Software
+% Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+%
+% vi :set textwidth=75
+%
+\documentclass{article}
+\usepackage{multirow,graphicx,placeins}
+
+\begin{document}
+%---------------------
+\title{\texttt{btrecord} and \texttt{btreplay} User Guide}
+\author{Alan D. Brunelle (Alan.Brunelle@hp.com)}
+\date{\today}
+\maketitle
+\begin{abstract}
+\input{abstract.tex}
+\end{abstract}
+\thispagestyle{empty}\newpage
+%---------------------
+\tableofcontents\thispagestyle{empty}\newpage
+%---------------------
+\section{Introduction}
+\input{abstract.tex}
+
+\bigskip
+This document presents the command line overview for
+\texttt{btrecord} and \texttt{btreplay}, and shows some commonly used
+example usages of it in everyday work here at OSLO's Scalability and
+Performance Group.
+
+\subsection*{Build Note}
+
+To build these tools, one needs to
+place the source directory next to a valid
+\texttt{blktrace}\footnote{\texttt{git://git.kernel.dk/blktrace.git}}
+directory, as it includes \texttt{../blktrace} in the \texttt{Makefile}.
+
+
+%---------------------
+\newpage\section{\texttt{btrecord} and \texttt{btreplay} Operating Model}
+
+The \texttt{blktrace} utility provides the ability to collect detailed
+traces from the kernel for each IO processed by the block IO layer. The
+traces provide a complete timeline for each IO processed, including
+detailed information concerning when an IO was first received by the block
+IO layer -- indicating the device, CPU number, time stamp, IO direction,
+sector number and IO size (number of sectors). Using this information,
+one is able to \emph{replay} the IO again on the same machine or another
+set up entirely.
+
+\subsection{Basic Workflow}
+The basic operating work-flow to replay IOs would be something like:
+
+\begin{enumerate}
+ \item Run \texttt{blktrace} to collect traces. Here you specify the
+ device or devices that you wish to trace and later replay IOs upon. Note:
+ the only traces you are interested in are \emph{QUEUE} requests --
+ thus, to save system resources (including storage for traces), one could
+ specify the \texttt{-a queue} command line option to \texttt{blktrace}.
+
+ \item While \texttt{blktrace} is running, you run the workload that you
+ are interested in.
+
+ \item When the work load has completed, you stop the \texttt{blktrace}
+ utility (thus saving all traces over the complete workload).
+
+ \item You extract the pertinent IO information from the traces saved by
+ \texttt{blktrace} using the \texttt{btrecord} utility. This will parse
+ each trace file created by \texttt{blktrace}, and crafty IO descriptions
+ to be used in the next phase of the workload processing.
+
+ \item Once \texttt{btrecord} has successfully created a series of data
+ files to be processed, you can run the \texttt{btreplay} utility which
+ attempts to generate the same IOs seen during the sample workload phase.
+\end{enumerate}
+
+\subsection{IO Stream Replay Characteristics}
+ The major characteristics of the IO stream that are kept intact include:
+
+ \begin{description}
+ \item[Device] The IOs are replayed on the same device as was seen
+ during the sample workload.
+
+ \item[IO direction] The same IO direction (read/write) is maintained.
+
+ \item[IO offset] The same device offset is maintained.
+
+ \item[IO size] The same number of sectors are transferred.
+
+ \item[Time differential] The time stamps stored during the
+ \texttt{blktrace} run are used to determine the amount of time between
+ IOs during the sample workload. \texttt{btreplay} \emph{attempts} to
+ maintain the same time differential between IOs, but no guarantees as
+ to complete accuracy are provided by the utility.
+
+ \item[Device IO Stream Ordering] All IOs on a device are submitted in
+ the precise order they were seen during the sample workload run.
+ \end{description}
+
+ As noted above, the time between IOs may not be accurately maintained
+ during replays. In addition the actual ordering of IOs \emph{between}
+ devices is not necessarily maintained. (Each device with an IO stream
+ maintains its own concept of time, and thus there may be slippage of the
+ time kept between managing threads.)
+
+ \begin{quotation}
+ We have prototyped a different approach, wherein a single managing
+ thread handles all IOs across all devices. This approach, while
+ guaranteeing correct ordering of IOs across all devices, resulted in
+ much worse timing on a per IO basis.
+ \end{quotation}
+
+\subsection{\texttt{btrecord/btreplay} Method of Operation}
+
+As noted above, \texttt{btrecord} extracts \texttt{QUEUE} operations from
+\texttt{blktrace} output. These \texttt{QUEUE} operations indicate the
+entrance of IOs into the block IO layer. In order to replay these IOs with
+some accuracy in regards to ordering and timeliness, we decided to take
+multiple sequential (in time) IOs and put them in a single \emph{bunch} of
+IOs that will be processed as a single \emph{asynchronous IO} call to the
+kernel\footnote{Attempts to do them individually resulted in too large of a
+turnaround time penalty (user-space to kernel and back). Note that in a
+number of workloads, the IOs are coming in from the page cache handling
+code, and thus are submitted to the block IO layer with \emph{very small}
+time intervals between issues.}. To manage the size of the \emph{bunches},
+the \texttt{btrecord} utility provides you with two controlling knobs:
+
+\begin{description}
+ \item[\texttt{--max-bunch-time}] This is the amount of time to encompass
+ in one bunch -- only IOs within the time specified are eligible
+ for \emph{bunching.} The default time is 10 milliseconds (10,000,000
+ nanoseconds). Refer to section~\ref{sec:c-o-m} on page~\pageref{sec:c-o-m}
+ for more information.
+
+ \item[\texttt{--max-pkts}] A \emph{bunch} size can be anywhere from
+ 1 to 512 packets in size and by default we max a bunch to contain no
+ more than 8 individual IOs. With this option, one can increase or
+ decrease the maximum \emph{bunch} size. Refer to section~\ref{sec:c-o-M}
+ on page~\pageref{sec:c-o-M} for more information.
+\end{description}
+
+Each input data file (one per device per CPU) results in a new record
+data file (again, one per device per CPU) which contains information
+about \emph{bunches} of IOs to be replayed. \texttt{btreplay} operates on
+these record data files by spawning a new pair of threads per file. One
+thread managed the submitting of AIOs per bunch in the record data file,
+while the other thread manages reclaiming AIOs completed\footnote{We
+have found that having the same thread do both results in a further
+reduction in replay timing accuracty.}.
+
+Each submitting thread simply reads the input file of \emph{bunches}
+recorded by \texttt{btrecord}, and attempts to faithfully reproduce the
+ordering and timing of IOs seen during the sample workload. The reclaiming
+thread simply wait for AIO completions, freeing up resources for the
+submitting thread to utilize to submit new AIOs.
+
+The number of CPUs being used on the replay system can be different from
+the number on the recorded system. To help with mappings here the
+\texttt{--cpus} option allows one to state how many CPUs on the replay
+system to utilize. If the number of CPUs on the replay system is less than
+on the recording system, we wrap CPU IDs. This \emph{may} result in an
+overload of CPU processing capabilities on the replay system. (Refer to
+section~\ref{sec:p-o-c} on page~\pageref{sec:p-o-c} for more details about the
+\texttt{--cpus} option.)
+
+\newpage\subsection{Known Deficiencies and Proposed Possible Fixes}
+
+The overall known deficiencies with this current set of utilities is
+outlined here, in some cases ideas on additions and/or improvements are
+included as well.
+
+\begin{enumerate}
+ \item Lack of IO ordering across devices.
+
+ \begin{quote}
+ \emph{We could institute the notion of global time across threads,
+ and thus ensure IO ordering across devices, with some reduction in
+ timing accuracy.}
+ \end{quote}
+
+ \item Lack of IO timing accuracy -- additional time between IO bunches.
+
+ \begin{quote}
+ \emph{This is the primary problem with any IO replay mechanism -- how
+ to guarantee per-IO timing accuracy with respect to other replayed IOs?
+ One idea to reduce errors in this area would be to push the IO replay
+ into the kernel, where you \emph{may} receive more responsive timings.}
+ \end{quote}
+
+ \item Bunching of IOs results in reduced time amongst IOs within a bunch.
+
+ \begin{quote}
+ \emph{The user has \emph{some} control over this (via the
+ \texttt{--max-pkts} option). One \emph{could} simply specify
+ \texttt{-max-pkts=1} and then each IO would be treated individualy. Of
+ course, this would probably then run into the problem of excessive
+ inter-IO times.}
+ \end{quote}
+
+ \item 1-to-1 mapping of devices -- for now the devices on the replay
+ machine must be the same as on the recording machine.
+
+ \begin{quote}
+ \emph{It should be relatively trivial to add in the notion of
+ mapping -- simply include a file that is read which maps devices
+ on one machine to devices (with offsets and sizes) on the replay
+ machine\footnote{The notion of an offset and device size to replay on
+ could be used to both allow for a single device to masquerade as more
+ than one device, and could be utilized in case the replay device is
+ smaller than the recorded device.}.}
+
+ \medskip\emph{One could also add in the notion of CPU mappings as well --
+ device $D_{rec}$ managed by CPU $C_{rec}$ on the recorded system
+ shall be replayed on device $D_{rep}$ and CPU $C_{rep}$ on the
+ replay machine.}
+
+ \bigskip
+ \begin{quote}
+ With version 0.9.1 we now support the \texttt{-M} option to do this
+ -- see section~\ref{sec:p-o-M} on page~\pageref{sec:p-o-M} for more
+ information on device mapping.
+ \end{quote}
+ \end{quote}
+
+\end{enumerate}
+
+%---------------------
+\newpage\section{\label{sec:command-line}Command Line Options}
+\subsection{\texttt{btrecord} Command Line Options}
+\begin{figure}[h!]
+\begin{verbatim}
+Usage: btrecord -- version 0.9.3
+
+ [ -d <dir> : --input-directory=<dir> ] Default: .
+ [ -D <dir> : --output-directory=<dir>] Default: .
+ [ -F : --find-traces ] Default: Off
+ [ -h : --help ] Default: Off
+ [ -m <nsec> : --max-bunch-time=<nsec> ] Default: 10 msec
+ [ -M <pkts> : --max-pkts=<pkts> ] Default: 8
+ [ -o <base> : --output-base=<base> ] Default: replay
+ [ -v : --verbose ] Default: Off
+ [ -V : --version ] Default: Off
+ <dev>... Default: None
+\end{verbatim}
+\caption{\label{fig:btrecord--help}\texttt{btrecord --help} Output}
+\end{figure}
+\FloatBarrier
+
+\subsubsection{\label{sec:c-o-d}\texttt{-d} or
+\texttt{--input-directory}\\Set Input Directory}
+
+The \texttt{-d} option requires a single parameter providing the directory
+name for where input files are to be found. The default directory is the
+current directory (\texttt{.}).
+
+\subsubsection{\label{sec:c-o-D}\texttt{-D} or
+\texttt{--output-directory}\\Set Output Directory}
+
+The \texttt{-D} option requires a single parameter providing the directory
+name for where output files are to be placed. The default directory is the
+current directory (\texttt{.}).
+
+\subsubsection{\texttt{-F} or \texttt{--find-traces}\\Find Trace Files
+Automatically}
+
+The \texttt{-F} option instructs \texttt{btrecord} to go find all the
+trace files in the directory specified (either via the \texttt{-d}
+option, or in the default directory '.').
+
+\subsubsection{\texttt{-h} or \texttt{--help}\\Display Help Message}
+\subsubsection{\texttt{-V} or \texttt{--version}\\Display
+\texttt{btrecord}Version}
+
+The \texttt{-h} option displays the command line options and
+defaults, as presented in figure~\ref{fig:btrecord--help} on
+page~\pageref{fig:btrecord--help}.
+
+The \texttt{-V} option displays the \texttt{btreplay} version, as shown here:
+
+\begin{verbatim}
+$ btrecord --version
+btrecord -- version 0.9.0
+\end{verbatim}
+
+Both commands exit immediately after processing the option.
+
+\subsubsection{\label{sec:c-o-m}\texttt{-m} or
+\texttt{--max-bunch-time}\\Set Maximum Time Per Bunch}
+
+The \texttt{-m} option requires a single parameter which specifies an
+amount of time (in nanoseconds) to include in any one bunch of IOs that
+are to be processed. The smaller the value, the smaller the number of
+IOs processed at one time -- perhaps yielding in more realistic replay.
+However, after a certain point the amount of overhead per bunch may result
+in additonal real replay time, thus yielding less accurate replay times.
+
+The default value is 10,000,000 nanoseconds (10 milliseconds).
+
+\subsubsection{\label{sec:c-o-M}\texttt{-M} or
+\texttt{--max-pkts}\\Set Maximum Packets Per Bunch}
+
+The \texttt{-M} option requires a single parameter which specifies the
+maximum number of IOs to store in a single bunch. As with the \texttt{-m}
+option (section~\ref{sec:c-o-m}), smaller values \emph{may} or \emph{may not}
+yield more accurate replay times.
+
+The default value is 8, with a maximum value of up to 512 being supported.
+
+\subsubsection{\label{sec:c-o-o}\texttt{-o} or
+\texttt{--output-base}\\Set Base Name for Output Files}
+
+Each output file has 3 fields:
+
+\begin{enumerate}
+ \item Device identifier (taken directly from the device name of the
+ \texttt{blktrace} output file).
+
+ \item \texttt{btrecord} base name -- by default ``replay''.
+
+ \item And the CPU number (again, taken directly from the
+ \texttt{blktrace} output file name).
+\end{enumerate}
+
+This option requires a single parameter that will override the default name
+(replay), and replace it with the specified value.
+
+\subsubsection{\label{sec:c-o-v}\texttt{-v} or
+\texttt{--verbose}\\Select Verbose Output}
+
+This option will output some simple statistics at the end of a successful
+run. Figure~\ref{fig:verb-out} (page~\pageref{fig:verb-out}) shows
+an example of some output, while figure~\ref{fig:verb-defs}
+(page~\pageref{fig:verb-defs}) shows what the fields mean.
+
+\begin{figure}[h!]
+\begin{verbatim}
+sdab:0: 580661 pkts (tot), 126030 pkts (replay), 89809 bunches, 1.4 pkts/bunch
+sdab:1: 2559775 pkts (tot), 430172 pkts (replay), 293029 bunches, 1.5 pkts/bunch
+sdab:2: 653559 pkts (tot), 136522 pkts (replay), 102288 bunches, 1.3 pkts/bunch
+sdab:3: 474773 pkts (tot), 117849 pkts (replay), 69572 bunches, 1.7 pkts/bunch
+\end{verbatim}
+\caption{\label{fig:verb-out}Verbose Output Example}
+\end{figure}
+\FloatBarrier
+
+\begin{figure}[h!]
+\begin{description}
+ \item[Field 1] The first field contains the device name and CPU
+ identrifer. Thus: \texttt{sdab:0:} means the device \texttt{sdab} and
+ traces on CPU 0.
+
+ \item[Field 2] The second field contains the total number of packets
+ processed for each device file.
+
+ \item[Field 3] The next field shows the number of packets eligible for
+ replay.
+
+ \item[Field 4] The fourth field contains the total number of IO bunches.
+
+ \item[Field 5] The last field shows the average number of IOs per bunch
+ recorded.
+\end{description}
+\caption{\label{fig:verb-defs}Verbose Field Definitions}
+\end{figure}
+\FloatBarrier
+
+%---------------------
+\newpage\subsection{\texttt{btreplay} Command Line Options}
+\begin{figure}[h!]
+\begin{verbatim}
+Usage: btreplay -- version 0.9.3
+
+ [ -c <cpus> : --cpus=<cpus> ] Default: 1
+ [ -d <dir> : --input-directory=<dir> ] Default: .
+ [ -F : --find-records ] Default: Off
+ [ -h : --help ] Default: Off
+ [ -i <base> : --input-base=<base> ] Default: replay
+ [ -I <iters>: --iterations=<iters> ] Default: 1
+ [ -M <file> : --map-devs=<file> ] Default: None
+ [ -N : --no-stalls ] Default: Off
+ [ -v : --verbose ] Default: Off
+ [ -V : --version ] Default: Off
+ [ -W : --write-enable ] Default: Off
+ <dev...> Default: None
+\end{verbatim}
+\caption{\label{fig:btreplay--help}\texttt{btreplay --help} Output}
+\end{figure}
+\FloatBarrier
+
+\subsubsection{\label{sec:p-o-c}\texttt{-c} or
+\texttt{--cpus}\\Set Number of CPUs to Use}
+
+\subsubsection{\label{sec:p-o-d}\texttt{-d} or
+\texttt{--input-directory}\\Set Input Directory}
+
+The \texttt{-d} option requires a single parameter providing the directory
+name for where input files are to be found. The default directory is the
+current directory (\texttt{.}).
+
+\subsubsection{\texttt{-F} or \texttt{--find-records}\\Find RecordFiles
+Automatically}
+
+The \texttt{-F} option instructs \texttt{btreplay} to go find all the
+record files in the directory specified (either via the \texttt{-d}
+option, or in the default directory '.').
+
+\subsubsection{\texttt{-h} or \texttt{--help}\\Display Help Message}
+\subsubsection{\texttt{-V} or \texttt{--version}\\Display
+\texttt{btreplay}Version}
+
+The \texttt{-h} option displays the command line options and
+defaults, as presented in figure~\ref{fig:btreplay--help} on
+page~\pageref{fig:btreplay--help}.
+
+The \texttt{-V} option displays the \texttt{btreplay} version, as show here:
+
+\begin{verbatim}
+$ btreplay --version
+btreplay -- version 0.9.0
+\end{verbatim}
+
+Both commands exit immediately after processing the option.
+
+\subsubsection{\label{sec:p-o-i}\texttt{-i} or
+\texttt{--input-base}\\Set Base Name for Input Files}
+
+Each input file has 3 fields:
+
+\begin{enumerate}
+ \item Device identifier (taken directly from the device name of the
+ \texttt{blktrace} output file).
+
+ \item \texttt{btrecord} base name -- by default ``replay''.
+
+ \item And the CPU number (again, taken directly from the
+ \texttt{blktrace} output file name).
+\end{enumerate}
+
+This option requires a single parameter that will override the default name
+(replay), and replace it with the specified value.
+
+\subsubsection{\label{sec:p-o-I}\texttt{-I} or
+\texttt{--iterations}\\Set Number of Iterations to Run}
+
+This option requires a single parameter which specifies the number of times
+to run through the input files. The default value is 1.
+
+\subsubsection{\label{sec:p-o-M}\texttt{-M} or \texttt{map-devs}\\
+Specify Device Mappings}
+
+This option requires a single paramter which specifies the name of a
+file contain device mappings. The file must be very simply managed, with
+just two pieces of data per line:
+
+\begin{enumerate}
+ \item The device name on the recorded system (with the \texttt{'/dev/'}
+ removed). Example: \texttt{/dev/sda} would just be \texttt{sda}.
+
+ \item The device name on the replay system to use (again, without the
+ \texttt{'/dev/'} path prepended).
+\end{enumerate}
+
+An example file for when one would map devices \texttt{/dev/sda} and
+\texttt{/dev/sdb} on the recorded system to \texttt{dev/sdg} and
+\texttt{sdh} on the replay system would be:
+
+\begin{verbatim}
+sda sdg
+sdb sdh
+\end{verbatim}
+
+The only entries in the file that are allowed are these two element lines
+-- we do not (yet?) support the notion of blank lines, or comment lines, or
+the like.
+
+The utility \emph{does} allow for multiple \texttt{-M} options to be
+supplied on the command line.
+
+\subsubsection{\label{sec:o-N}\texttt{-N} or \texttt{--no-stalls}\\Disable
+Pre-bunch Stalls}
+
+When specified on the command line, all pre-bunch stall indicators will be
+ignored. IOs will be replayed without inter-bunch delays.
+
+\subsubsection{\label{sec:p-o-v}\texttt{-v} or
+\texttt{--verbose}\\Select Verbose Output}
+
+When specified on the command line, this option instructs \texttt{btreplay}
+to store information concerning each \emph{stall} and IO operation
+performed by \texttt{btreplay}. The name of each file so created will be
+the input file name used with an extension of \texttt{.rep} appended onto
+it. Thus, an input file of the name \texttt{sdab.replay.3} would generate a
+verbose output file with the name \texttt{sdab.replay.3.rep} in the
+directory specified for input files.
+
+In addition, \texttt{btreplay} will also output to \texttt{stderr} the
+names of the input files being processed.
+
+\subsubsection{\label{sec:p-o-W}\texttt{-W} or
+\texttt{--write-enable}\\Enable Writing During Replay}
+
+As a precautionary measure, by default \texttt{btreplay} will \emph{not}
+process \emph{write} requests. In order to enable \texttt{btreplay} to
+actually \emph{write} to devices one must explicitly specify the
+\texttt{-W} option.
+
+\end{document}