| 1 | \documentclass{article} |
| 2 | |
| 3 | % |
| 4 | % Copyright (C) 2005, 2006 Alan D. Brunelle <Alan.Brunelle@hp.com> |
| 5 | % |
| 6 | % This program is free software; you can redistribute it and/or modify |
| 7 | % it under the terms of the GNU General Public License as published by |
| 8 | % the Free Software Foundation; either version 2 of the License, or |
| 9 | % (at your option) any later version. |
| 10 | % |
| 11 | % This program is distributed in the hope that it will be useful, |
| 12 | % but WITHOUT ANY WARRANTY; without even the implied warranty of |
| 13 | % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
| 14 | % GNU General Public License for more details. |
| 15 | % |
| 16 | % You should have received a copy of the GNU General Public License |
| 17 | % along with this program; if not, write to the Free Software |
| 18 | % Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA |
| 19 | % |
| 20 | |
| 21 | \title{blktrace User Guide} |
| 22 | \author{blktrace: Jens Axboe (jens.axboe@oracle.com)\\ |
| 23 | User Guide: Alan D. Brunelle (Alan.Brunelle@hp.com)} |
| 24 | \date{27 May 2008} |
| 25 | |
| 26 | \begin{document} |
| 27 | \maketitle |
| 28 | %--------------------- |
| 29 | \section{\label{sec:intro}Introduction} |
| 30 | |
| 31 | blktrace is a block layer IO tracing mechanism which provides detailed |
| 32 | information about request queue operations up to user space. There are |
| 33 | three major components that are provided: |
| 34 | |
| 35 | \begin{description} |
| 36 | \item[Kernel patch] A patch to the Linux kernel which includes the |
| 37 | kernel event logging interfaces, and patches to areas within the block |
| 38 | layer to emit event traces. If you run a 2.6.17-rc1 or newer kernel, |
| 39 | you don't need to patch blktrace support as it is already included. |
| 40 | |
| 41 | \item[blktrace] A utility which transfers event traces from the kernel |
| 42 | into either long-term on-disk storage, or provides direct formatted |
| 43 | output (via blkparse). |
| 44 | |
| 45 | \item[blkparse] A utility which formats events stored in files, or when |
| 46 | run in \emph{live} mode directly outputs data collected by blktrace. |
| 47 | \end{description} |
| 48 | |
| 49 | \subsection{blktrace Download Area} |
| 50 | |
| 51 | The blktrace and blkparse utilities and associated kernel patch are provided |
| 52 | as part of the following git repository: |
| 53 | |
| 54 | git://git.kernel.org/pub/scm/linux/kernel/git/axboe/blktrace.git bt |
| 55 | |
| 56 | %-------------------------- |
| 57 | \newpage\section{\label{sec:quick-start}Quick Start Guide} |
| 58 | |
| 59 | The following sections outline some quick steps towards utilizing |
| 60 | blktrace. Some of the specific instructions below may need to be tailored |
| 61 | to your environment. |
| 62 | |
| 63 | \subsection{\label{sec:get-blktrace}Retrieving blktrace} |
| 64 | |
| 65 | As noted above, the kernel patch along with the blktrace and blkparse utilities are stored in a git repository. One simple way to get going would be: |
| 66 | |
| 67 | \begin{verbatim} |
| 68 | % git clone git://git.kernel.org/pub/scm/linux/kernel/git/axboe/blktrace.git bt |
| 69 | % cd bt |
| 70 | % git checkout |
| 71 | \end{verbatim} |
| 72 | |
| 73 | \subsection{\label{sec:patching}Patching and configuring the Linux kernel} |
| 74 | |
| 75 | A patch for a \emph{specific Linux kernel} is provided in bt/kernel (where |
| 76 | \emph{bt} is the name of the directory from the above git sequence). The |
| 77 | detailed actual patching instructions for a Linux kernel is outside the |
| 78 | scope of this document, but the following may be used as a sample template. |
| 79 | Note that you may skip this step, if you kernel is at least 2.6.17-rc1. |
| 80 | |
| 81 | As an example, bt/kernel contains blk-trace-2.6.14-rc1-git-G2, download |
| 82 | linux-2.6.13.tar.bz2 and patch-2.6.14-rc1.bz2 |
| 83 | |
| 84 | \begin{verbatim} |
| 85 | % tar xjf linux-2.6.13.tar.bz2 |
| 86 | % mv linux-2.6.13 linux-2.6.14-rc1 |
| 87 | % cd linux-2.6.14-rc1 |
| 88 | % bunzip2 -c ../patch-2.6.14-rc1.bz2 | patch -p1 |
| 89 | \end{verbatim} |
| 90 | |
| 91 | At this point you may (optionally) remove linux-2.6.13.tar.bz2 and |
| 92 | patch-2.6.14-rc1.bz2. |
| 93 | |
| 94 | At this point you should configure the Linux kernel for your specific |
| 95 | system -- again, outside the scope of this document -- and then enable |
| 96 | \emph{Support for tracing block io actions.} To do this, run |
| 97 | |
| 98 | \begin{verbatim} |
| 99 | % make menuconfig or make xconfig, or edit .config, or ... |
| 100 | \end{verbatim} |
| 101 | |
| 102 | and navigate through \emph{Device Drivers} and \emph{Block devices} |
| 103 | and then down to \emph{Support for tracing block io actions} and hit Y. |
| 104 | |
| 105 | Install the new kernel (and modules\ldots) and reboot. |
| 106 | |
| 107 | \subsection{\label{sec:mount}Mounting the debugfs file system} |
| 108 | |
| 109 | blktrace utilizes files under the debug file system, and thus must have |
| 110 | the mount point set up -- mounted on the directory /sys/kernel/debug. |
| 111 | To do this one may do either of the following: |
| 112 | |
| 113 | \begin{enumerate} |
| 114 | \item Manually mount after each boot: |
| 115 | \begin{verbatim} |
| 116 | % mount -t debugfs debugfs /sys/kernel/debug |
| 117 | \end{verbatim} |
| 118 | |
| 119 | \item Add an entry into /etc/fstab, and have it done automatically at |
| 120 | each boot\footnote{Note: after adding the entry to /etc/fstab, you |
| 121 | could then mount the directory this time only by doing: \% mount debug}: |
| 122 | \begin{verbatim} |
| 123 | debug /sys/kernel/debug debugfs default 0 0 |
| 124 | \end{verbatim} |
| 125 | \end{enumerate} |
| 126 | |
| 127 | \subsection{\label{sec:build}Build the tools} |
| 128 | |
| 129 | To build and install the tools, execute the following sequence (as root): |
| 130 | |
| 131 | \begin{verbatim} |
| 132 | % cd bt |
| 133 | % make && make install |
| 134 | \end{verbatim} |
| 135 | |
| 136 | \subsection{\label{sec:live-blktrace}blktrace -- live} |
| 137 | |
| 138 | Now to simply watch what is going on for a specific disk (to stop the |
| 139 | trace, hit control-C): |
| 140 | |
| 141 | \begin{verbatim} |
| 142 | % blktrace -d /dev/sda -o - | blkparse -i - |
| 143 | 8,0 3 1 0.000000000 697 G W 223490 + 8 [kjournald] |
| 144 | 8,0 3 2 0.000001829 697 P R [kjournald] |
| 145 | 8,0 3 3 0.000002197 697 Q W 223490 + 8 [kjournald] |
| 146 | 8,0 3 4 0.000005533 697 M W 223498 + 8 [kjournald] |
| 147 | 8,0 3 5 0.000008607 697 M W 223506 + 8 [kjournald] |
| 148 | 8,0 3 6 0.000011569 697 M W 223514 + 8 [kjournald] |
| 149 | 8,0 3 7 0.000014407 697 M W 223522 + 8 [kjournald] |
| 150 | 8,0 3 8 0.000017367 697 M W 223530 + 8 [kjournald] |
| 151 | 8,0 3 9 0.000020161 697 M W 223538 + 8 [kjournald] |
| 152 | 8,0 3 10 0.000024062 697 D W 223490 + 56 [kjournald] |
| 153 | 8,0 1 11 0.009507758 0 C W 223490 + 56 [0] |
| 154 | 8,0 1 12 0.009538995 697 G W 223546 + 8 [kjournald] |
| 155 | 8,0 1 13 0.009540033 697 P R [kjournald] |
| 156 | 8,0 1 14 0.009540313 697 Q W 223546 + 8 [kjournald] |
| 157 | 8,0 1 15 0.009542980 697 D W 223546 + 8 [kjournald] |
| 158 | 8,0 1 16 0.013542170 0 C W 223546 + 8 [0] |
| 159 | ... |
| 160 | ^C |
| 161 | ... |
| 162 | CPU1 (8,0): |
| 163 | Reads Queued: 0, 0KiB Writes Queued: 7, 128KiB |
| 164 | Read Dispatches: 0, 0KiB Write Dispatches: 7, 128KiB |
| 165 | Reads Completed: 0, 0KiB Writes Completed: 11, 168KiB |
| 166 | Read Merges: 0 Write Merges: 25 |
| 167 | IO unplugs: 0 Timer unplugs: 0 |
| 168 | ... |
| 169 | CPU3 (8,0): |
| 170 | Reads Queued: 0, 0KiB Writes Queued: 1, 28KiB |
| 171 | Read Dispatches: 0, 0KiB Write Dispatches: 1, 28KiB |
| 172 | Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB |
| 173 | Read Merges: 0 Write Merges: 6 |
| 174 | IO unplugs: 0 Timer unplugs: 0 |
| 175 | |
| 176 | Total (8,0): |
| 177 | Reads Queued: 0, 0KiB Writes Queued: 11, 168KiB |
| 178 | Read Dispatches: 0, 0KiB Write Dispatches: 11, 168KiB |
| 179 | Reads Completed: 0, 0KiB Writes Completed: 11, 168KiB |
| 180 | Read Merges: 0 Write Merges: 31 |
| 181 | IO unplugs: 0 Timer unplugs: 3 |
| 182 | |
| 183 | Events (8,0): 89 entries, 0 skips |
| 184 | \end{verbatim} |
| 185 | |
| 186 | A \emph{btrace} script is included in the distribution to ease live |
| 187 | tracing of devices. The above could also be accomplished by issuing: |
| 188 | |
| 189 | \begin{verbatim} |
| 190 | % btrace /dev/sda |
| 191 | \end{verbatim} |
| 192 | |
| 193 | By default, \emph{btrace} runs the trace in quiet mode so it will not |
| 194 | include statistics when you break the run. Add the \emph{-S} option to |
| 195 | get that dumped as well. |
| 196 | |
| 197 | \subsection{\label{sec:pc-blktrace}blktrace -- SCSI commands} |
| 198 | |
| 199 | The previous section showed typical file system io actions, but blktrace |
| 200 | can also show SCSI commands going in and out of the queue as submitted |
| 201 | by applications using the SCSI Generic (\emph{sg}) interface. |
| 202 | |
| 203 | \begin{verbatim} |
| 204 | % btrace /dev/cdrom |
| 205 | [...] |
| 206 | 3,0 0 25 0.004884107 13528 G R 0 + 0 [inquiry] |
| 207 | 3,0 0 26 0.004890361 13528 I R 56 (12 00 00 00 38 ..) [inquiry] |
| 208 | 3,0 0 27 0.004891223 13528 P R [inquiry] |
| 209 | 3,0 0 28 0.004893250 13528 D R 56 (12 00 00 00 38 ..) [inquiry] |
| 210 | 3,0 0 29 0.005344910 0 C R (12 00 00 00 38 ..) [0] |
| 211 | \end{verbatim} |
| 212 | |
| 213 | Here we see a program issuing an INQUIRY command to the CDROM device. |
| 214 | The program requested a read of 56 bytes of data, the CDB is included |
| 215 | in parenthesis after the data length. The completion event shows shows |
| 216 | that the command completed successfully. Tracing SCSI commands can be |
| 217 | very useful for debugging problems with programs talking directly to the |
| 218 | device. An example of that would be \emph{cdrecord} burning. |
| 219 | |
| 220 | \subsection{\label{sec:blktrace-post}blktrace -- post-processing} |
| 221 | |
| 222 | Another way to run blktrace is to have blktrace save data away for later |
| 223 | formatting by blkparse. This would be useful if you want to get |
| 224 | measurements while running specific loads. |
| 225 | |
| 226 | To do this, one would specify the device (or devices) to be watched. Then |
| 227 | go run you test cases. Stop the trace, and at your leisure utilize |
| 228 | blkparse to see the results. |
| 229 | |
| 230 | In this example, devices /dev/sdaa, /dev/sdc and /dev/sdo are used in an |
| 231 | LVM volume called adb3/vol. |
| 232 | |
| 233 | \begin{verbatim} |
| 234 | % blktrace /dev/sdaa /dev/sdc /dev/sdo & |
| 235 | [1] 9713 |
| 236 | % |
| 237 | % mkfs -t ext3 /dev/adb3/vol |
| 238 | mke2fs 1.35 (28-Feb-2004) |
| 239 | Filesystem label= |
| 240 | OS type: Linux |
| 241 | Block size=4096 (log=2) |
| 242 | Fragment size=4096 (log=2) |
| 243 | 16793600 inodes, 33555456 blocks |
| 244 | 1677772 blocks (5.00%) reserved for the super user |
| 245 | First data block=0 |
| 246 | Maximum filesystem blocks=4294967296 |
| 247 | 1025 block groups |
| 248 | 32768 blocks per group, 32768 fragments per group |
| 249 | 16384 inodes per group |
| 250 | Superblock backups stored on blocks: |
| 251 | 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, |
| 252 | 4096000, 7962624, 11239424, 20480000, 23887872 |
| 253 | |
| 254 | Writing inode tables: done |
| 255 | Creating journal (8192 blocks): done |
| 256 | Writing superblocks and filesystem accounting information: done |
| 257 | |
| 258 | This filesystem will be automatically checked every 27 mounts or |
| 259 | 180 days, whichever comes first. Use tune2fs -c or -i to override. |
| 260 | % |
| 261 | % kill -15 9713 |
| 262 | \end{verbatim} |
| 263 | |
| 264 | Then you could process the events later: |
| 265 | |
| 266 | \begin{verbatim} |
| 267 | % |
| 268 | % blkparse sdaa sdc sdo > events |
| 269 | % less events |
| 270 | 8,32 1 1 0.000000000 9728 G R 384 + 32 [mkfs.ext3] |
| 271 | 8,32 1 2 0.000001959 9728 P R [mkfs.ext3] |
| 272 | 8,32 1 3 0.000002446 9728 Q R 384 + 32 [mkfs.ext3] |
| 273 | 8,32 1 4 0.000005110 9728 D R 384 + 32 [mkfs.ext3] |
| 274 | 8,32 3 5 0.000200570 0 C R 384 + 32 [0] |
| 275 | 8,224 3 1 0.021658989 9728 G R 384 + 32 [mkfs.ext3] |
| 276 | ... |
| 277 | 65,160 3 163392 41.117070504 0 C W 87469088 + 1376 [0] |
| 278 | 8,32 3 163374 41.122683668 0 C W 88168160 + 1376 [0] |
| 279 | 65,160 3 163393 41.129952433 0 C W 87905984 + 1376 [0] |
| 280 | 65,160 3 163394 41.130049431 0 D W 89129344 + 1376 [swapper] |
| 281 | 65,160 3 163395 41.130067135 0 D W 89216704 + 1376 [swapper] |
| 282 | 65,160 3 163396 41.130083785 0 D W 89304096 + 1376 [swapper] |
| 283 | 65,160 3 163397 41.130099455 0 D W 89391488 + 1376 [swapper] |
| 284 | 65,160 3 163398 41.130114732 0 D W 89478848 + 1376 [swapper] |
| 285 | 65,160 3 163399 41.130128885 0 D W 89481536 + 64 [swapper] |
| 286 | 8,32 3 163375 41.134758196 0 C W 86333152 + 1376 [0] |
| 287 | 65,160 3 163400 41.142229726 0 C W 89129344 + 1376 [0] |
| 288 | 65,160 3 163401 41.144952314 0 C W 89481536 + 64 [0] |
| 289 | 8,32 3 163376 41.147441930 0 C W 88342912 + 1376 [0] |
| 290 | 65,160 3 163402 41.155869604 0 C W 89478848 + 1376 [0] |
| 291 | 8,32 3 163377 41.159466082 0 C W 86245760 + 1376 [0] |
| 292 | 65,160 3 163403 41.166944976 0 C W 89216704 + 1376 [0] |
| 293 | 65,160 3 163404 41.178968252 0 C W 89304096 + 1376 [0] |
| 294 | 65,160 3 163405 41.191860173 0 C W 89391488 + 1376 [0] |
| 295 | ... |
| 296 | Events (sdo): 0 entries, 0 skips |
| 297 | |
| 298 | CPU0 (65,160): |
| 299 | Reads Queued: 0, 0KiB Writes Queued: 9, 5,520KiB |
| 300 | Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB |
| 301 | Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB |
| 302 | Read Merges: 0 Write Merges: 336 |
| 303 | IO unplugs: 0 Timer unplugs: 0 |
| 304 | CPU1 (65,160): |
| 305 | Reads Queued: 2,411, 38,576KiB Writes Queued: 769, 425,408KiB |
| 306 | Read Dispatches: 2,407, 38,512KiB Write Dispatches: 118, 61,680KiB |
| 307 | Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB |
| 308 | Read Merges: 0 Write Merges: 25,819 |
| 309 | IO unplugs: 0 Timer unplugs: 4 |
| 310 | CPU2 (65,160): |
| 311 | Reads Queued: 2, 32KiB Writes Queued: 18, 10,528KiB |
| 312 | Read Dispatches: 2, 32KiB Write Dispatches: 3, 1,344KiB |
| 313 | Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB |
| 314 | Read Merges: 0 Write Merges: 640 |
| 315 | IO unplugs: 0 Timer unplugs: 0 |
| 316 | CPU3 (65,160): |
| 317 | Reads Queued: 20,572, 329,152KiB Writes Queued: 594, 279,712KiB |
| 318 | Read Dispatches: 20,576, 329,216KiB Write Dispatches: 1,474, 740,720KiB |
| 319 | Reads Completed: 22,985, 367,760KiB Writes Completed: 1,390, 721,168KiB |
| 320 | Read Merges: 0 Write Merges: 16,888 |
| 321 | IO unplugs: 0 Timer unplugs: 0 |
| 322 | |
| 323 | Total (65,160): |
| 324 | Reads Queued: 22,985, 367,760KiB Writes Queued: 1,390, 721,168KiB |
| 325 | Read Dispatches: 22,985, 367,760KiB Write Dispatches: 1,595, 803,744KiB |
| 326 | Reads Completed: 22,985, 367,760KiB Writes Completed: 1,390, 721,168KiB |
| 327 | Read Merges: 0 Write Merges: 43,683 |
| 328 | IO unplugs: 0 Timer unplugs: 4 |
| 329 | ... |
| 330 | \end{verbatim} |
| 331 | |
| 332 | %---------------------------- |
| 333 | \newpage\section{\label{sec:blktrace-ug}blktrace User Guide} |
| 334 | |
| 335 | The \emph{blktrace} utility extracts event traces from the kernel (via |
| 336 | the relaying through the debug file system). Some background details |
| 337 | concerning the run-time behaviour of blktrace will help to understand some |
| 338 | of the more arcane command line options: |
| 339 | |
| 340 | \begin{itemize} |
| 341 | \item blktrace receives data from the kernel in buffers passed up |
| 342 | through the debug file system (relay). Each device being traced has |
| 343 | a file created in the mounted directory for the debugfs, which defaults |
| 344 | to \emph{/sys/kernel/debug} -- this can be overridden with the \emph{-r} |
| 345 | command line argument. |
| 346 | |
| 347 | \item blktrace defaults to collecting \emph{all} events that can be |
| 348 | traced. To limit the events being captured, you can specify one or |
| 349 | more filter masks via the \emph{-a} option. |
| 350 | |
| 351 | Alternatively, one may specify the entire mask utilizing a hexadecimal |
| 352 | value that is version-specific. (Requires understanding of the internal |
| 353 | representation of the filter mask.) |
| 354 | |
| 355 | \item As noted above, the events are passed up via a series of buffers |
| 356 | stored into debugfs files. The size and number of buffers can be |
| 357 | specified via the \emph{-b} and \emph{-n} arguments respectively. |
| 358 | |
| 359 | \item blktrace stores the extracted data into files stored in the |
| 360 | \emph{local} directory. The format of the file names is (by default) |
| 361 | \emph{device}.blktrace.\emph{cpu}, where \emph{device} is the base |
| 362 | device name (e.g, if we are tracing /dev/sda, the base device name would |
| 363 | be \emph{sda}); and \emph{cpu} identifies a CPU for the event stream. |
| 364 | |
| 365 | The \emph{device} portion of the event file name can be changed via |
| 366 | the \emph{-o} option. |
| 367 | |
| 368 | \item blktrace may also be run concurrently with blkparse to produce |
| 369 | \emph{live} output -- to do this specify \emph{-o -} for blktrace. |
| 370 | |
| 371 | \item The default behaviour for blktrace is to run forever until explicitly killed by the user (via a control-C, or \emph{kill} utility invocation). There are two ways to modify this: |
| 372 | |
| 373 | \begin{enumerate} |
| 374 | \item You may utilize the blktrace utility itself to \emph{kill} |
| 375 | a running trace -- via the \emph{-k} option. |
| 376 | |
| 377 | \item You can specify a run-time duration for blktrace via the |
| 378 | \emph{-w} option -- then blktrace will run for the specified number |
| 379 | of seconds, and then halt. |
| 380 | \end{enumerate} |
| 381 | \end{itemize} |
| 382 | |
| 383 | \subsection{\label{sec:blktrace-args}Command line arguments} |
| 384 | \begin{tabular}{|l|l|l|}\hline |
| 385 | Short & Long & Description \\ \hline\hline |
| 386 | -A \emph{hex-mask} & --set-mask=\emph{hex-mask} & Set filter mask to \emph{hex-mask} \\ \hline |
| 387 | -a \emph{mask} & --act-mask=\emph{mask} & Add \emph{mask} to current filter (see below for masks) \\ \hline |
| 388 | -b \emph{size} & --buffer-size=\emph{size} & Specifies buffer size for event extraction (scaled by $2^{10}$) \\ \hline |
| 389 | -d \emph{dev} & --dev=\emph{dev} & Adds \emph{dev} as a device to trace \\ \hline |
| 390 | -k & --kill & Kill on-going trace \\ \hline |
| 391 | -n \emph{num-sub} & --num-sub=\emph{num-sub} & Specifies number of buffers to use \\ \hline |
| 392 | -o \emph{file} & --output=\emph{file} & Prepend \emph{file} to output file name(s) \\ |
| 393 | & & \textbf{This only works when using a single device} \\ |
| 394 | & & \textbf{or when piping the output via \texttt{-o -}} \\ |
| 395 | & & \textbf{with multiple devices.} \\ \hline |
| 396 | -r \emph{rel-path} & --relay=\emph{rel-path} & Specifies debugfs mount point \\ \hline |
| 397 | -V & --version & Outputs version \\ \hline |
| 398 | -w \emph{seconds} & --stopwatch=\emph{seconds} & Sets run time to the number of seconds specified \\ \hline |
| 399 | -I \emph{devs file}& --input-devs=\emph{devs file}& Adds devices found in \emph{devs file} to list of devices to trace. \\ |
| 400 | & & (One device per line.) \\ \hline |
| 401 | \end{tabular} |
| 402 | |
| 403 | \subsubsection{\label{sec:filter-mask}Filter Masks} |
| 404 | The following masks may be passed with the \emph{-a} command line |
| 405 | option, multiple filters may be combined via multiple \emph{-a} command |
| 406 | line options.\smallskip |
| 407 | |
| 408 | \begin{tabular}{|l|l|}\hline |
| 409 | barrier & \emph{barrier} attribute \\ \hline |
| 410 | complete & \emph{completed} by driver \\ \hline |
| 411 | fs & \emph{FS} requests \\ \hline |
| 412 | issue & \emph{issued} to driver \\ \hline |
| 413 | pc & \emph{packet command} events \\ \hline |
| 414 | queue & \emph{queue} operations \\ \hline |
| 415 | read & \emph{read} traces \\ \hline |
| 416 | requeue & \emph{requeue} operations \\ \hline |
| 417 | sync & \emph{synchronous} attribute \\ \hline |
| 418 | write & \emph{write} traces \\ \hline |
| 419 | notify & \emph{notify} trace messages \\ \hline |
| 420 | \end{tabular} |
| 421 | |
| 422 | \subsubsection{\label{sec:request-types}Request types} |
| 423 | blktrace disguingishes between two types of block layer requests, |
| 424 | file system and scsi commands. The former are dubbed \emph{fs} |
| 425 | requests, the latter \emph{pc} requests. File system requests are |
| 426 | normal read/write operations, ie any type of read or write from a |
| 427 | specific disk location at a given size. These requests typically |
| 428 | originate from a user process, but they may also be initiated by |
| 429 | the vm flushing dirty data to disk or the file system syncing |
| 430 | a super or journal block to disk. \emph{pc} requests are SCSI |
| 431 | commands. blktrace sends the command data block as a payload |
| 432 | so that blkparse can decode it. |
| 433 | |
| 434 | %---------------------------- |
| 435 | \newpage\section{\label{sec:blkparse-ug}blkparse User Guide} |
| 436 | |
| 437 | The \emph{blkparse} utility will attempt to combine streams of events |
| 438 | for various devices on various CPUs, and produce a formatted output of |
| 439 | the event information. As with blktrace, some details concerning blkparse |
| 440 | will help in understanding the command line options presented below. |
| 441 | |
| 442 | \begin{itemize} |
| 443 | \item By default, blkparse expects to run in a post-processing mode |
| 444 | -- one where the trace events have been saved by a previous run |
| 445 | of blktrace, and blkparse is combining event streams and dumping |
| 446 | formatted data. |
| 447 | |
| 448 | blkparse \emph{may} be run in a \emph{live} manner concurrently with |
| 449 | blktrace by specifying \emph{-i -} to blkparse, and combining it with |
| 450 | the live option for blktrace. An example would be: |
| 451 | |
| 452 | \begin{verbatim} |
| 453 | % blktrace -d /dev/sda -o - | blkparse -i - |
| 454 | \end{verbatim} |
| 455 | |
| 456 | \item You can set how many blkparse batches event reads via the |
| 457 | \emph{-b} option, the default is to handle events in batches of 512. |
| 458 | |
| 459 | \item If you have saved event traces in blktrace with different output |
| 460 | names (via the \emph{-o} option to blktrace), you must specify the |
| 461 | same \emph{input} name via the \emph{-i} option. |
| 462 | |
| 463 | \item The format of the output data can be controlled via the \emph{-f} |
| 464 | or \emph{-F} options -- see section~\ref{sec:blkparse-format} for details. |
| 465 | |
| 466 | By default, blkparse sends formatted data to standard output. This may |
| 467 | be changed via the \emph{-o} option, or text output can be disabled |
| 468 | via the\emph{-O} option. A merged binary stream can be produced using |
| 469 | the \emph{-d} option. |
| 470 | |
| 471 | \end{itemize} |
| 472 | |
| 473 | \newpage\subsection{\label{sec:blkparse-args}Command line arguments} |
| 474 | \begin{tabular}{|l|l|l|}\hline |
| 475 | Short & Long & Description \\ \hline\hline |
| 476 | -b \emph{batch} & --batch={batch} & Standard input read batching \\ \hline |
| 477 | |
| 478 | -i \emph{file} & --input=\emph{file} & Specifies base name for input files -- default is \emph{device}.blktrace.\emph{cpu}. \\ |
| 479 | & & As noted above, specifying \emph{-i -} runs in \emph{live} mode with blktrace \\ |
| 480 | & & (reading data from standard in). \\ \hline |
| 481 | |
| 482 | -F \emph{typ,fmt} & --format=\emph{typ,fmt} & Sets output format \\ |
| 483 | -f \emph{fmt} & --format-spec=\emph{fmt} & (See section~\ref{sec:blkparse-format} for details.) \\ |
| 484 | & & \\ |
| 485 | & & The -f form specifies a format for all events \\ |
| 486 | & & \\ |
| 487 | & & The -F form allows one to specify a format for a specific \\ |
| 488 | & & event type. The single-character \emph{typ} field is one of the \\ |
| 489 | & & action specifiers in section~\ref{sec:act-table} \\ \hline |
| 490 | |
| 491 | |
| 492 | -m & --missing & Print missing entries\\ \hline |
| 493 | |
| 494 | -h & --hash-by-name & Hash processes by name, not by PID\\ \hline |
| 495 | |
| 496 | -o \emph{file} & --output=\emph{file} & Output file \\ \hline |
| 497 | -O & --no-text-output & Do \emph{not} produce text output, used for binary (-d) only \\ \hline |
| 498 | |
| 499 | -d \emph{file} & --dump-binary=\emph{file} & Binary output file \\ \hline |
| 500 | |
| 501 | -q & --quiet & Quite mode \\ \hline |
| 502 | |
| 503 | -s & --per-program-stats & Displays data sorted by program \\ \hline |
| 504 | |
| 505 | -t & --track-ios & Display time deltas per IO \\ \hline |
| 506 | |
| 507 | -w \emph{span} & --stopwatch=\emph{span} & Display traces for the \emph{span} specified -- where span can be: \\ |
| 508 | & & \emph{end-time} -- Display traces from time 0 through \emph{end-time} (in ns) \\ |
| 509 | & & or \\ |
| 510 | & & \emph{start:end-time} -- Display traces from time \emph{start} \\ |
| 511 | & & through {end-time} (in ns). \\ \hline |
| 512 | |
| 513 | -M & --no-msgs & Do not add messages to binary output file \\\hline |
| 514 | -v & --verbose & More verbose marginal on marginal errors \\ \hline |
| 515 | -V & --version & Display version \\ \hline |
| 516 | |
| 517 | \end{tabular} |
| 518 | |
| 519 | \newpage |
| 520 | \subsection{\label{sec:blkparse-actions}Trace actions} |
| 521 | |
| 522 | \begin{description} |
| 523 | \item[C -- complete] A previously issued request has been completed. |
| 524 | The output will detail the sector and size of that request, as well |
| 525 | as the success or failure of it. |
| 526 | |
| 527 | \item[D -- issued] A request that previously resided on the block layer |
| 528 | queue or in the io scheduler has been sent to the driver. |
| 529 | |
| 530 | \item[I -- inserted] A request is being sent to the io scheduler for |
| 531 | addition to the internal queue and later service by the driver. The |
| 532 | request is fully formed at this time. |
| 533 | |
| 534 | \item[Q -- queued] This notes intent to queue io at the given location. |
| 535 | No real requests exists yet. |
| 536 | |
| 537 | \item[B -- bounced] The data pages attached to this \emph{bio} are |
| 538 | not reachable by the hardware and must be bounced to a lower memory |
| 539 | location. This causes a big slowdown in io performance, since the data |
| 540 | must be copied to/from kernel buffers. Usually this can be fixed with |
| 541 | using better hardware - either a better io controller, or a platform |
| 542 | with an IOMMU. |
| 543 | |
| 544 | \item[m -- message] Text message generated via kernel call to |
| 545 | \texttt{blk\_add\_trace\_msg}. |
| 546 | |
| 547 | \item[M -- back merge] A previously inserted request exists that ends |
| 548 | on the boundary of where this io begins, so the io scheduler can merge |
| 549 | them together. |
| 550 | |
| 551 | \item[F -- front merge] Same as the back merge, except this io ends |
| 552 | where a previously inserted requests starts. |
| 553 | |
| 554 | \item[G -- get request] To send any type of request to a block device, |
| 555 | a \emph{struct request} container must be allocated first. |
| 556 | |
| 557 | \item[S -- sleep] No available request structures were available, so |
| 558 | the issuer has to wait for one to be freed. |
| 559 | |
| 560 | \item[P -- plug] When io is queued to a previously empty block device |
| 561 | queue, Linux will plug the queue in anticipation of future ios being |
| 562 | added before this data is needed. |
| 563 | |
| 564 | \item[U -- unplug] Some request data already queued in the device, |
| 565 | start sending requests to the driver. This may happen automatically |
| 566 | if a timeout period has passed (see next entry) or if a number of |
| 567 | requests have been added to the queue. |
| 568 | |
| 569 | \item[T -- unplug due to timer] If nobody requests the io that was queued |
| 570 | after plugging the queue, Linux will automatically unplug it after a |
| 571 | defined period has passed. |
| 572 | |
| 573 | \item[X -- split] On raid or device mapper setups, an incoming io may |
| 574 | straddle a device or internal zone and needs to be chopped up into |
| 575 | smaller pieces for service. This may indicate a performance problem due |
| 576 | to a bad setup of that raid/dm device, but may also just be part of |
| 577 | normal boundary conditions. dm is notably bad at this and will clone |
| 578 | lots of io. |
| 579 | |
| 580 | \item[A -- remap] For stacked devices, incoming io is remapped to device |
| 581 | below it in the io stack. The remap action details what exactly is |
| 582 | being remapped to what. |
| 583 | |
| 584 | \end{description} |
| 585 | |
| 586 | \subsection{\label{sec:blkparse-format}Output Description and Formatting} |
| 587 | |
| 588 | The output from blkparse can be tailored for specific use - in particular, |
| 589 | to ease parsing of output, and/or limit output fields to those the user |
| 590 | wants to see. The data for fields which can be output include: |
| 591 | |
| 592 | \smallskip |
| 593 | \begin{tabular}{|l|l|}\hline |
| 594 | Field & Description \\ |
| 595 | Specifier & \\ \hline\hline |
| 596 | \emph{a} & Action, a (small) string (1 or 2 characters) -- see table below for more details \\ \hline |
| 597 | \emph{c} & CPU id \\ \hline |
| 598 | \emph{C} & Command \\ \hline |
| 599 | \emph{d} & RWBS field, a (small) string (1-3 characters) -- see section below for more details \\ \hline |
| 600 | \emph{D} & 7-character string containing the major and minor numbers of |
| 601 | the event's device \\ |
| 602 | & (separated by a comma). \\ \hline |
| 603 | \emph{e} & Error value \\ \hline |
| 604 | \emph{m} & Minor number of event's device. \\ \hline |
| 605 | \emph{M} & Major number of event's device. \\ \hline |
| 606 | \emph{n} & Number of blocks \\ \hline |
| 607 | \emph{N} & Number of bytes \\ \hline |
| 608 | \emph{p} & Process ID \\ \hline |
| 609 | \emph{P} & Display packet data -- series of hexadecimal values\\ \hline |
| 610 | \emph{s} & Sequence numbers \\ \hline |
| 611 | \emph{S} & Sector number \\ \hline |
| 612 | \emph{t} & Time stamp (nanoseconds) \\ \hline |
| 613 | \emph{T} & Time stamp (seconds) \\ \hline |
| 614 | \emph{u} & Elapsed value in microseconds (\emph{-t} command line option) \\ \hline |
| 615 | \emph{U} & Payload unsigned integer \\ \hline |
| 616 | \emph{z} & Absolute time stamp \\ \hline |
| 617 | \end{tabular} |
| 618 | |
| 619 | Note that the user can optionally specify field display width, and |
| 620 | optionally a left-aligned specifier. These precede field specifiers, |
| 621 | with a '\%' character, followed by the optional left-alignment specifer |
| 622 | (-) followed by the width (a decimal number) and then the field. |
| 623 | |
| 624 | Thus, to specify the command in a 12-character field that is left aligned: |
| 625 | |
| 626 | \begin{verbatim} |
| 627 | -f "%-12C" |
| 628 | \end{verbatim} |
| 629 | |
| 630 | \newpage |
| 631 | \subsubsection{\label{sec:act-table}Action Table} |
| 632 | The following table shows the various actions which may be output. |
| 633 | |
| 634 | \begin{tabular}{|l|l|}\hline |
| 635 | Act & Description \\ \hline\hline |
| 636 | A & IO was remapped to a different device \\ \hline |
| 637 | B & IO bounced \\ \hline |
| 638 | C & IO completion \\ \hline |
| 639 | D & IO issued to driver \\ \hline |
| 640 | F & IO front merged with request on queue \\ \hline |
| 641 | G & Get request \\ \hline |
| 642 | I & IO inserted onto request queue \\ \hline |
| 643 | M & IO back merged with request on queue \\ \hline |
| 644 | P & Plug request \\ \hline |
| 645 | Q & IO handled by request queue code \\ \hline |
| 646 | S & Sleep request \\ \hline |
| 647 | T & Unplug due to timeout \\ \hline |
| 648 | U & Unplug request \\ \hline |
| 649 | X & Split \\ \hline |
| 650 | \end{tabular} |
| 651 | |
| 652 | \subsubsection{\label{sec:act-table}RWBS Description} |
| 653 | This is a small string containing at least one character ('R' for read, |
| 654 | 'W' for write, or 'D' for block discard operation), and optionally either |
| 655 | a 'B' (for barrier operations) or 'S' (for synchronous operations). |
| 656 | |
| 657 | \subsubsection{\label{sec:default-output}Default output} |
| 658 | |
| 659 | The standard \emph{header} (or initial fields displayed) include: |
| 660 | |
| 661 | \begin{verbatim} |
| 662 | "%D %2c %8s %5T.%9t %5p %2a %3d " |
| 663 | \end{verbatim} |
| 664 | |
| 665 | Breaking this down: |
| 666 | |
| 667 | \begin{description} |
| 668 | \item[\%D] Displays the event's device major/minor as: \%3d,\%-3d. |
| 669 | \item[\%2c] CPU ID (2-character field). |
| 670 | \item[\%8s] Sequence number |
| 671 | \item[\%5T.\%9t] 5-charcter field for the seconds portion of the |
| 672 | time stamp and a 9-character field for the nanoseconds in the time stamp. |
| 673 | \item[\%5p] 5-character field for the process ID. |
| 674 | \item[\%2a] 2-character field for one of the actions. |
| 675 | \item[\%3d] 3-character field for the RWBS data. |
| 676 | \end{description} |
| 677 | |
| 678 | Seeing this in action: |
| 679 | |
| 680 | \begin{verbatim} |
| 681 | 8,0 3 1 0.000000000 697 G W 223490 + 8 [kjournald] |
| 682 | \end{verbatim} |
| 683 | |
| 684 | The header is the data in this line up to the 223490 (starting block). |
| 685 | |
| 686 | The default output for all event types includes this header. |
| 687 | |
| 688 | \paragraph{Default output per action} |
| 689 | |
| 690 | \begin{description} |
| 691 | \item[C -- complete] If a payload is present, this is presented between |
| 692 | parenthesis following the header, followed by the error value. |
| 693 | |
| 694 | If no payload is present, the sector and number of blocks are presented |
| 695 | (with an intervening plus (+) character). If the \emph{-t} option |
| 696 | was specified, then the elapsed time is presented. In either case, |
| 697 | it is followed by the error value for the completion. |
| 698 | |
| 699 | \item[D -- issued] |
| 700 | \item[I -- inserted] |
| 701 | \item[Q -- queued] |
| 702 | \item[B -- bounced] If a payload is present, the number of payload bytes |
| 703 | is output, followed by the payload in hexadecimal between parenthesis. |
| 704 | |
| 705 | If no payload is present, the sector and number of blocks are presented |
| 706 | (with an intervening plus (+) character). If the \emph{-t} option was |
| 707 | specified, then the elapsed time is presented (in parenthesis). In |
| 708 | either case, it is followed by the command associated with the event |
| 709 | (surrounded by square brackets). |
| 710 | |
| 711 | \item[M -- back merge] |
| 712 | \item[F -- front merge] |
| 713 | \item[G -- get request] |
| 714 | \item[S -- sleep] The starting sector and number of blocks is output |
| 715 | (with an intervening plus (+) character), followed by the command |
| 716 | associated with the event (surrounded by square brackets). |
| 717 | |
| 718 | \item[P -- plug] The command associated with the event (surrounded by |
| 719 | square brackets) is output. |
| 720 | |
| 721 | \item[U -- unplug] |
| 722 | \item[T -- unplug due to timer] The command associated with the event |
| 723 | (surrounded by square brackets) is output, followed by the number of |
| 724 | requests outstanding. |
| 725 | |
| 726 | \item[X -- split] The original starting sector followed by the new |
| 727 | sector (separated by a slash (/) is output, followed by the command |
| 728 | associated with the event (surrounded by square brackets). |
| 729 | |
| 730 | \item[A -- remap] Sector and length is output, along with the original |
| 731 | device and sector offset. |
| 732 | |
| 733 | \item[m -- message] The supplied message is appended to the end of |
| 734 | the standard header. |
| 735 | |
| 736 | \end{description} |
| 737 | |
| 738 | %------------------------------ |
| 739 | \newpage |
| 740 | \newpage\section*{\label{sec:blktrace-kg}Appendix: blktrace Kernel Guide} |
| 741 | |
| 742 | The blktrace facility provides an efficient event transfer mechanism which |
| 743 | supplies block IO layer state transition data via the relay |
| 744 | filesystem. This section provides some details as to the interfaces |
| 745 | blktrace utilizes in the kernel to effect this. It is good background data |
| 746 | to help understand some of the outputs and command-line options above. |
| 747 | |
| 748 | \subsection{blktrace.h Definitions} |
| 749 | Files which include $<linux/blktrace.h>$ are supplied with the following |
| 750 | definitions: |
| 751 | |
| 752 | \subsubsection{Trace Action Specifiers} |
| 753 | \begin{tabular}{|l|l|}\hline |
| 754 | BLK\_TA\_QUEUE & (RQ) Command queued to request\_queue. \\ |
| 755 | & (BIO) Command queued by elevator. \\ \hline |
| 756 | BLK\_TA\_BACKMERGE & Back merging elevator operation \\ \hline |
| 757 | BLK\_TA\_FRONTMERGE & Front merging elevator operation \\ \hline |
| 758 | BLK\_TA\_GETRQ & Free request retrieved. \\ \hline |
| 759 | BLK\_TA\_SLEEPRQ & No requests available, device unplugged. \\ \hline |
| 760 | BLK\_TA\_REQUEUE & Request requeued. \\ \hline |
| 761 | BLK\_TA\_ISSUE & Command set to driver for request\_queue. \\ \hline |
| 762 | BLK\_TA\_COMPLETE & Command completed by driver. \\ \hline |
| 763 | BLK\_TA\_PLUG & Device is plugged \\ \hline |
| 764 | BLK\_TA\_UNPLUG\_IO & Unplug device as IO is made available. \\ \hline |
| 765 | BLK\_TA\_UNPLUG\_TIMER & Unplug device after timer expired. \\ \hline |
| 766 | BLK\_TA\_INSERT & Insert request into queue. \\ \hline |
| 767 | BLK\_TA\_SPLIT & BIO split into 2 or more requests. \\ \hline |
| 768 | BLK\_TA\_BOUNCE & BIO was bounced \\ \hline |
| 769 | BLK\_TA\_REMAP & BIO was remapped \\ \hline |
| 770 | \end{tabular} |
| 771 | |
| 772 | %.......................................... |
| 773 | \subsection{blktrace.h Routines} |
| 774 | Files which include $<linux/blktrace.h>$ are supplied with the following |
| 775 | kernel routine invocable interfaces: |
| 776 | |
| 777 | \begin{description} |
| 778 | \item[blk\_add\_trace\_rq(struct request\_queue *q, struct request\_queue |
| 779 | *rq, u32 what)] |
| 780 | Adds a trace event describing the state change of the passed in |
| 781 | request\_queue. The \emph{what} parameter describes the change in |
| 782 | the request\_queue state, and is one of the request queue action |
| 783 | specifiers -- BLK\_TA\_QUEUE, BLK\_TA\_REQUEUE, BLK\_TA\_ISSUE, |
| 784 | or BLK\_TA\_COMPLETE. |
| 785 | |
| 786 | \item[blk\_add\_trace\_bio(struct request\_queue *q, struct bio *bio, |
| 787 | u32 what)] |
| 788 | Adds a trace event for the BIO passed in. The \emph{what} parameter |
| 789 | describes the action being performed on the BIO, and is one of |
| 790 | BLK\_TA\_BACKMERGE, BLK\_TA\_FRONTMERGE, or BLK\_TA\_QUEUE. |
| 791 | |
| 792 | \item[blk\_add\_trace\_generic(struct request\_queue *q, struct bio *bio, |
| 793 | int rw, u32 what)] |
| 794 | Adds a \emph{generic} trace event -- not one of the request queue |
| 795 | or BIO traces. The \emph{what} parameter describes the action being |
| 796 | performed on the BIO (if bio is non-NULL), and is one of |
| 797 | BLK\_TA\_PLUG, BLK\_TA\_GETRQ or BLK\_TA\_SLEEPRQ. |
| 798 | |
| 799 | \item[blk\_add\_trace\_pdu\_int(struct request\_queue *q, u32 what, |
| 800 | u32 pdu)] |
| 801 | Adds a trace with some payload data -- in this case, an unsigned |
| 802 | 32-bit entity (the \emph{pdu} parameter). The \emph{what} parameter |
| 803 | describes the nature of the payload, and is one of |
| 804 | BLK\_TA\_UNPLUG\_IO or BLK\_TA\_UNPLUG\_TIMER. |
| 805 | |
| 806 | \item[blk\_add\_trace\_remap(struct request\_queue *q, struct bio *bio, |
| 807 | dev\_t dev, sector\_t sector)] |
| 808 | Adds a trace with a remap event. \emph{dev} and \emph{sector} denote |
| 809 | the original device this \emph{bio} was mapped from. |
| 810 | |
| 811 | \item[blk\_add\_trace\_msg(struct request\_queue *q, char *fmt, ...)] |
| 812 | Adds a formatted message to the output stream. The total message |
| 813 | size can not exceed BLK\_TN\_MSG\_MSG characters (currently |
| 814 | 1024). Standard format conversions are supported (as supplied |
| 815 | by \texttt{vscnprintf}. |
| 816 | |
| 817 | \end{description} |
| 818 | \end{document} |