Commit | Line | Data |
---|---|---|
ff8a07c9 JA |
1 | \documentclass{article} |
2 | ||
3 | % | |
a2594911 | 4 | % Copyright (C) 2005, 2006 Alan D. Brunelle <Alan.Brunelle@hp.com> |
ff8a07c9 JA |
5 | % |
6 | % This program is free software; you can redistribute it and/or modify | |
7 | % it under the terms of the GNU General Public License as published by | |
8 | % the Free Software Foundation; either version 2 of the License, or | |
9 | % (at your option) any later version. | |
10 | % | |
11 | % This program is distributed in the hope that it will be useful, | |
12 | % but WITHOUT ANY WARRANTY; without even the implied warranty of | |
13 | % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | |
14 | % GNU General Public License for more details. | |
15 | % | |
16 | % You should have received a copy of the GNU General Public License | |
17 | % along with this program; if not, write to the Free Software | |
18 | % Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA | |
19 | % | |
20 | ||
21 | \title{blktrace User Guide} | |
758019ab | 22 | \author{blktrace: Jens Axboe (jens.axboe@oracle.com)\\ |
ff8a07c9 | 23 | User Guide: Alan D. Brunelle (Alan.Brunelle@hp.com)} |
ac00378e | 24 | \date{27 May 2008} |
ff8a07c9 JA |
25 | |
26 | \begin{document} | |
27 | \maketitle | |
28 | %--------------------- | |
29 | \section{\label{sec:intro}Introduction} | |
30 | ||
31 | blktrace is a block layer IO tracing mechanism which provides detailed | |
32 | information about request queue operations up to user space. There are | |
33 | three major components that are provided: | |
34 | ||
35 | \begin{description} | |
36 | \item[Kernel patch] A patch to the Linux kernel which includes the | |
37 | kernel event logging interfaces, and patches to areas within the block | |
c8255c5a JA |
38 | layer to emit event traces. If you run a 2.6.17-rc1 or newer kernel, |
39 | you don't need to patch blktrace support as it is already included. | |
ff8a07c9 JA |
40 | |
41 | \item[blktrace] A utility which transfers event traces from the kernel | |
42 | into either long-term on-disk storage, or provides direct formatted | |
43 | output (via blkparse). | |
44 | ||
45 | \item[blkparse] A utility which formats events stored in files, or when | |
46 | run in \emph{live} mode directly outputs data collected by blktrace. | |
47 | \end{description} | |
48 | ||
ff8a07c9 JA |
49 | \subsection{blktrace Download Area} |
50 | ||
51 | The blktrace and blkparse utilities and associated kernel patch are provided | |
52 | as part of the following git repository: | |
53 | ||
38e1f0c6 | 54 | git://git.kernel.org/pub/scm/linux/kernel/git/axboe/blktrace.git bt |
ff8a07c9 JA |
55 | |
56 | %-------------------------- | |
57 | \newpage\section{\label{sec:quick-start}Quick Start Guide} | |
58 | ||
59 | The following sections outline some quick steps towards utilizing | |
60 | blktrace. Some of the specific instructions below may need to be tailored | |
61 | to your environment. | |
62 | ||
63 | \subsection{\label{sec:get-blktrace}Retrieving blktrace} | |
64 | ||
65 | As noted above, the kernel patch along with the blktrace and blkparse utilities are stored in a git repository. One simple way to get going would be: | |
66 | ||
67 | \begin{verbatim} | |
38e1f0c6 | 68 | % git clone git://git.kernel.org/pub/scm/linux/kernel/git/axboe/blktrace.git bt |
ff8a07c9 | 69 | % cd bt |
7b464f9f | 70 | % git checkout |
ff8a07c9 JA |
71 | \end{verbatim} |
72 | ||
73 | \subsection{\label{sec:patching}Patching and configuring the Linux kernel} | |
74 | ||
75 | A patch for a \emph{specific Linux kernel} is provided in bt/kernel (where | |
76 | \emph{bt} is the name of the directory from the above git sequence). The | |
77 | detailed actual patching instructions for a Linux kernel is outside the | |
78 | scope of this document, but the following may be used as a sample template. | |
c8255c5a | 79 | Note that you may skip this step, if you kernel is at least 2.6.17-rc1. |
ff8a07c9 JA |
80 | |
81 | As an example, bt/kernel contains blk-trace-2.6.14-rc1-git-G2, download | |
82 | linux-2.6.13.tar.bz2 and patch-2.6.14-rc1.bz2 | |
83 | ||
84 | \begin{verbatim} | |
85 | % tar xjf linux-2.6.13.tar.bz2 | |
86 | % mv linux-2.6.13 linux-2.6.14-rc1 | |
87 | % cd linux-2.6.14-rc1 | |
88 | % bunzip2 -c ../patch-2.6.14-rc1.bz2 | patch -p1 | |
89 | \end{verbatim} | |
90 | ||
91 | At this point you may (optionally) remove linux-2.6.13.tar.bz2 and | |
92 | patch-2.6.14-rc1.bz2. | |
93 | ||
94 | At this point you should configure the Linux kernel for your specific | |
95 | system -- again, outside the scope of this document -- and then enable | |
96 | \emph{Support for tracing block io actions.} To do this, run | |
97 | ||
98 | \begin{verbatim} | |
99 | % make menuconfig or make xconfig, or edit .config, or ... | |
100 | \end{verbatim} | |
101 | ||
102 | and navigate through \emph{Device Drivers} and \emph{Block devices} | |
103 | and then down to \emph{Support for tracing block io actions} and hit Y. | |
104 | ||
105 | Install the new kernel (and modules\ldots) and reboot. | |
106 | ||
e33da1d5 | 107 | \subsection{\label{sec:mount}Mounting the debugfs file system} |
ff8a07c9 | 108 | |
e33da1d5 | 109 | blktrace utilizes files under the debug file system, and thus must have |
90ec4b15 VT |
110 | the mount point set up -- mounted on the directory /sys/kernel/debug. |
111 | To do this one may do either of the following: | |
ff8a07c9 JA |
112 | |
113 | \begin{enumerate} | |
114 | \item Manually mount after each boot: | |
115 | \begin{verbatim} | |
90ec4b15 | 116 | % mount -t debugfs debugfs /sys/kernel/debug |
ff8a07c9 JA |
117 | \end{verbatim} |
118 | ||
119 | \item Add an entry into /etc/fstab, and have it done automatically at | |
120 | each boot\footnote{Note: after adding the entry to /etc/fstab, you | |
90ec4b15 | 121 | could then mount the directory this time only by doing: \% mount debug}: |
ff8a07c9 | 122 | \begin{verbatim} |
90ec4b15 | 123 | debug /sys/kernel/debug debugfs default 0 0 |
ff8a07c9 JA |
124 | \end{verbatim} |
125 | \end{enumerate} | |
126 | ||
127 | \subsection{\label{sec:build}Build the tools} | |
128 | ||
129 | To build and install the tools, execute the following sequence (as root): | |
130 | ||
131 | \begin{verbatim} | |
132 | % cd bt | |
133 | % make && make install | |
134 | \end{verbatim} | |
135 | ||
136 | \subsection{\label{sec:live-blktrace}blktrace -- live} | |
137 | ||
138 | Now to simply watch what is going on for a specific disk (to stop the | |
139 | trace, hit control-C): | |
140 | ||
141 | \begin{verbatim} | |
142 | % blktrace -d /dev/sda -o - | blkparse -i - | |
143 | 8,0 3 1 0.000000000 697 G W 223490 + 8 [kjournald] | |
144 | 8,0 3 2 0.000001829 697 P R [kjournald] | |
145 | 8,0 3 3 0.000002197 697 Q W 223490 + 8 [kjournald] | |
146 | 8,0 3 4 0.000005533 697 M W 223498 + 8 [kjournald] | |
147 | 8,0 3 5 0.000008607 697 M W 223506 + 8 [kjournald] | |
148 | 8,0 3 6 0.000011569 697 M W 223514 + 8 [kjournald] | |
149 | 8,0 3 7 0.000014407 697 M W 223522 + 8 [kjournald] | |
150 | 8,0 3 8 0.000017367 697 M W 223530 + 8 [kjournald] | |
151 | 8,0 3 9 0.000020161 697 M W 223538 + 8 [kjournald] | |
152 | 8,0 3 10 0.000024062 697 D W 223490 + 56 [kjournald] | |
153 | 8,0 1 11 0.009507758 0 C W 223490 + 56 [0] | |
154 | 8,0 1 12 0.009538995 697 G W 223546 + 8 [kjournald] | |
155 | 8,0 1 13 0.009540033 697 P R [kjournald] | |
156 | 8,0 1 14 0.009540313 697 Q W 223546 + 8 [kjournald] | |
157 | 8,0 1 15 0.009542980 697 D W 223546 + 8 [kjournald] | |
158 | 8,0 1 16 0.013542170 0 C W 223546 + 8 [0] | |
159 | ... | |
160 | ^C | |
161 | ... | |
162 | CPU1 (8,0): | |
163 | Reads Queued: 0, 0KiB Writes Queued: 7, 128KiB | |
164 | Read Dispatches: 0, 0KiB Write Dispatches: 7, 128KiB | |
165 | Reads Completed: 0, 0KiB Writes Completed: 11, 168KiB | |
166 | Read Merges: 0 Write Merges: 25 | |
167 | IO unplugs: 0 Timer unplugs: 0 | |
168 | ... | |
169 | CPU3 (8,0): | |
170 | Reads Queued: 0, 0KiB Writes Queued: 1, 28KiB | |
171 | Read Dispatches: 0, 0KiB Write Dispatches: 1, 28KiB | |
172 | Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB | |
173 | Read Merges: 0 Write Merges: 6 | |
174 | IO unplugs: 0 Timer unplugs: 0 | |
175 | ||
176 | Total (8,0): | |
177 | Reads Queued: 0, 0KiB Writes Queued: 11, 168KiB | |
178 | Read Dispatches: 0, 0KiB Write Dispatches: 11, 168KiB | |
179 | Reads Completed: 0, 0KiB Writes Completed: 11, 168KiB | |
180 | Read Merges: 0 Write Merges: 31 | |
181 | IO unplugs: 0 Timer unplugs: 3 | |
182 | ||
183 | Events (8,0): 89 entries, 0 skips | |
184 | \end{verbatim} | |
185 | ||
b6dc7d6e JA |
186 | A \emph{btrace} script is included in the distribution to ease live |
187 | tracing of devices. The above could also be accomplished by issuing: | |
188 | ||
189 | \begin{verbatim} | |
190 | % btrace /dev/sda | |
191 | \end{verbatim} | |
192 | ||
193 | By default, \emph{btrace} runs the trace in quiet mode so it will not | |
194 | include statistics when you break the run. Add the \emph{-S} option to | |
195 | get that dumped as well. | |
196 | ||
197 | \subsection{\label{sec:pc-blktrace}blktrace -- SCSI commands} | |
198 | ||
199 | The previous section showed typical file system io actions, but blktrace | |
200 | can also show SCSI commands going in and out of the queue as submitted | |
201 | by applications using the SCSI Generic (\emph{sg}) interface. | |
202 | ||
203 | \begin{verbatim} | |
204 | % btrace /dev/cdrom | |
205 | [...] | |
206 | 3,0 0 25 0.004884107 13528 G R 0 + 0 [inquiry] | |
207 | 3,0 0 26 0.004890361 13528 I R 56 (12 00 00 00 38 ..) [inquiry] | |
208 | 3,0 0 27 0.004891223 13528 P R [inquiry] | |
209 | 3,0 0 28 0.004893250 13528 D R 56 (12 00 00 00 38 ..) [inquiry] | |
210 | 3,0 0 29 0.005344910 0 C R (12 00 00 00 38 ..) [0] | |
211 | \end{verbatim} | |
212 | ||
213 | Here we see a program issuing an INQUIRY command to the CDROM device. | |
214 | The program requested a read of 56 bytes of data, the CDB is included | |
215 | in parenthesis after the data length. The completion event shows shows | |
216 | that the command completed successfully. Tracing SCSI commands can be | |
217 | very useful for debugging problems with programs talking directly to the | |
218 | device. An example of that would be \emph{cdrecord} burning. | |
219 | ||
ff8a07c9 JA |
220 | \subsection{\label{sec:blktrace-post}blktrace -- post-processing} |
221 | ||
222 | Another way to run blktrace is to have blktrace save data away for later | |
223 | formatting by blkparse. This would be useful if you want to get | |
224 | measurements while running specific loads. | |
225 | ||
226 | To do this, one would specify the device (or devices) to be watched. Then | |
227 | go run you test cases. Stop the trace, and at your leisure utilize | |
228 | blkparse to see the results. | |
229 | ||
230 | In this example, devices /dev/sdaa, /dev/sdc and /dev/sdo are used in an | |
231 | LVM volume called adb3/vol. | |
232 | ||
233 | \begin{verbatim} | |
234 | % blktrace /dev/sdaa /dev/sdc /dev/sdo & | |
235 | [1] 9713 | |
236 | % | |
237 | % mkfs -t ext3 /dev/adb3/vol | |
238 | mke2fs 1.35 (28-Feb-2004) | |
239 | Filesystem label= | |
240 | OS type: Linux | |
241 | Block size=4096 (log=2) | |
242 | Fragment size=4096 (log=2) | |
243 | 16793600 inodes, 33555456 blocks | |
244 | 1677772 blocks (5.00%) reserved for the super user | |
245 | First data block=0 | |
246 | Maximum filesystem blocks=4294967296 | |
247 | 1025 block groups | |
248 | 32768 blocks per group, 32768 fragments per group | |
249 | 16384 inodes per group | |
250 | Superblock backups stored on blocks: | |
251 | 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, | |
252 | 4096000, 7962624, 11239424, 20480000, 23887872 | |
253 | ||
254 | Writing inode tables: done | |
255 | Creating journal (8192 blocks): done | |
256 | Writing superblocks and filesystem accounting information: done | |
257 | ||
258 | This filesystem will be automatically checked every 27 mounts or | |
259 | 180 days, whichever comes first. Use tune2fs -c or -i to override. | |
260 | % | |
261 | % kill -15 9713 | |
262 | \end{verbatim} | |
263 | ||
264 | Then you could process the events later: | |
265 | ||
266 | \begin{verbatim} | |
267 | % | |
268 | % blkparse sdaa sdc sdo > events | |
269 | % less events | |
270 | 8,32 1 1 0.000000000 9728 G R 384 + 32 [mkfs.ext3] | |
271 | 8,32 1 2 0.000001959 9728 P R [mkfs.ext3] | |
272 | 8,32 1 3 0.000002446 9728 Q R 384 + 32 [mkfs.ext3] | |
273 | 8,32 1 4 0.000005110 9728 D R 384 + 32 [mkfs.ext3] | |
274 | 8,32 3 5 0.000200570 0 C R 384 + 32 [0] | |
275 | 8,224 3 1 0.021658989 9728 G R 384 + 32 [mkfs.ext3] | |
276 | ... | |
277 | 65,160 3 163392 41.117070504 0 C W 87469088 + 1376 [0] | |
278 | 8,32 3 163374 41.122683668 0 C W 88168160 + 1376 [0] | |
279 | 65,160 3 163393 41.129952433 0 C W 87905984 + 1376 [0] | |
280 | 65,160 3 163394 41.130049431 0 D W 89129344 + 1376 [swapper] | |
281 | 65,160 3 163395 41.130067135 0 D W 89216704 + 1376 [swapper] | |
282 | 65,160 3 163396 41.130083785 0 D W 89304096 + 1376 [swapper] | |
283 | 65,160 3 163397 41.130099455 0 D W 89391488 + 1376 [swapper] | |
284 | 65,160 3 163398 41.130114732 0 D W 89478848 + 1376 [swapper] | |
285 | 65,160 3 163399 41.130128885 0 D W 89481536 + 64 [swapper] | |
286 | 8,32 3 163375 41.134758196 0 C W 86333152 + 1376 [0] | |
287 | 65,160 3 163400 41.142229726 0 C W 89129344 + 1376 [0] | |
288 | 65,160 3 163401 41.144952314 0 C W 89481536 + 64 [0] | |
289 | 8,32 3 163376 41.147441930 0 C W 88342912 + 1376 [0] | |
290 | 65,160 3 163402 41.155869604 0 C W 89478848 + 1376 [0] | |
291 | 8,32 3 163377 41.159466082 0 C W 86245760 + 1376 [0] | |
292 | 65,160 3 163403 41.166944976 0 C W 89216704 + 1376 [0] | |
293 | 65,160 3 163404 41.178968252 0 C W 89304096 + 1376 [0] | |
294 | 65,160 3 163405 41.191860173 0 C W 89391488 + 1376 [0] | |
295 | ... | |
296 | Events (sdo): 0 entries, 0 skips | |
297 | ||
298 | CPU0 (65,160): | |
299 | Reads Queued: 0, 0KiB Writes Queued: 9, 5,520KiB | |
300 | Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB | |
301 | Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB | |
302 | Read Merges: 0 Write Merges: 336 | |
303 | IO unplugs: 0 Timer unplugs: 0 | |
304 | CPU1 (65,160): | |
305 | Reads Queued: 2,411, 38,576KiB Writes Queued: 769, 425,408KiB | |
306 | Read Dispatches: 2,407, 38,512KiB Write Dispatches: 118, 61,680KiB | |
307 | Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB | |
308 | Read Merges: 0 Write Merges: 25,819 | |
309 | IO unplugs: 0 Timer unplugs: 4 | |
310 | CPU2 (65,160): | |
311 | Reads Queued: 2, 32KiB Writes Queued: 18, 10,528KiB | |
312 | Read Dispatches: 2, 32KiB Write Dispatches: 3, 1,344KiB | |
313 | Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB | |
314 | Read Merges: 0 Write Merges: 640 | |
315 | IO unplugs: 0 Timer unplugs: 0 | |
316 | CPU3 (65,160): | |
317 | Reads Queued: 20,572, 329,152KiB Writes Queued: 594, 279,712KiB | |
318 | Read Dispatches: 20,576, 329,216KiB Write Dispatches: 1,474, 740,720KiB | |
319 | Reads Completed: 22,985, 367,760KiB Writes Completed: 1,390, 721,168KiB | |
320 | Read Merges: 0 Write Merges: 16,888 | |
321 | IO unplugs: 0 Timer unplugs: 0 | |
322 | ||
323 | Total (65,160): | |
324 | Reads Queued: 22,985, 367,760KiB Writes Queued: 1,390, 721,168KiB | |
325 | Read Dispatches: 22,985, 367,760KiB Write Dispatches: 1,595, 803,744KiB | |
326 | Reads Completed: 22,985, 367,760KiB Writes Completed: 1,390, 721,168KiB | |
327 | Read Merges: 0 Write Merges: 43,683 | |
328 | IO unplugs: 0 Timer unplugs: 4 | |
329 | ... | |
330 | \end{verbatim} | |
331 | ||
332 | %---------------------------- | |
333 | \newpage\section{\label{sec:blktrace-ug}blktrace User Guide} | |
334 | ||
335 | The \emph{blktrace} utility extracts event traces from the kernel (via | |
e33da1d5 JA |
336 | the relaying through the debug file system). Some background details |
337 | concerning the run-time behaviour of blktrace will help to understand some | |
338 | of the more arcane command line options: | |
ff8a07c9 JA |
339 | |
340 | \begin{itemize} | |
341 | \item blktrace receives data from the kernel in buffers passed up | |
e33da1d5 JA |
342 | through the debug file system (relay). Each device being traced has |
343 | a file created in the mounted directory for the debugfs, which defaults | |
90ec4b15 VT |
344 | to \emph{/sys/kernel/debug} -- this can be overridden with the \emph{-r} |
345 | command line argument. | |
ff8a07c9 JA |
346 | |
347 | \item blktrace defaults to collecting \emph{all} events that can be | |
348 | traced. To limit the events being captured, you can specify one or | |
349 | more filter masks via the \emph{-a} option. | |
350 | ||
351 | Alternatively, one may specify the entire mask utilizing a hexadecimal | |
352 | value that is version-specific. (Requires understanding of the internal | |
353 | representation of the filter mask.) | |
354 | ||
355 | \item As noted above, the events are passed up via a series of buffers | |
e33da1d5 | 356 | stored into debugfs files. The size and number of buffers can be |
ff8a07c9 JA |
357 | specified via the \emph{-b} and \emph{-n} arguments respectively. |
358 | ||
359 | \item blktrace stores the extracted data into files stored in the | |
360 | \emph{local} directory. The format of the file names is (by default) | |
361 | \emph{device}.blktrace.\emph{cpu}, where \emph{device} is the base | |
362 | device name (e.g, if we are tracing /dev/sda, the base device name would | |
363 | be \emph{sda}); and \emph{cpu} identifies a CPU for the event stream. | |
364 | ||
365 | The \emph{device} portion of the event file name can be changed via | |
366 | the \emph{-o} option. | |
367 | ||
368 | \item blktrace may also be run concurrently with blkparse to produce | |
369 | \emph{live} output -- to do this specify \emph{-o -} for blktrace. | |
370 | ||
371 | \item The default behaviour for blktrace is to run forever until explicitly killed by the user (via a control-C, or \emph{kill} utility invocation). There are two ways to modify this: | |
372 | ||
373 | \begin{enumerate} | |
374 | \item You may utilize the blktrace utility itself to \emph{kill} | |
375 | a running trace -- via the \emph{-k} option. | |
376 | ||
377 | \item You can specify a run-time duration for blktrace via the | |
378 | \emph{-w} option -- then blktrace will run for the specified number | |
379 | of seconds, and then halt. | |
380 | \end{enumerate} | |
381 | \end{itemize} | |
382 | ||
383 | \subsection{\label{sec:blktrace-args}Command line arguments} | |
384 | \begin{tabular}{|l|l|l|}\hline | |
385 | Short & Long & Description \\ \hline\hline | |
386 | -A \emph{hex-mask} & --set-mask=\emph{hex-mask} & Set filter mask to \emph{hex-mask} \\ \hline | |
387 | -a \emph{mask} & --act-mask=\emph{mask} & Add \emph{mask} to current filter (see below for masks) \\ \hline | |
388 | -b \emph{size} & --buffer-size=\emph{size} & Specifies buffer size for event extraction (scaled by $2^{10}$) \\ \hline | |
389 | -d \emph{dev} & --dev=\emph{dev} & Adds \emph{dev} as a device to trace \\ \hline | |
390 | -k & --kill & Kill on-going trace \\ \hline | |
391 | -n \emph{num-sub} & --num-sub=\emph{num-sub} & Specifies number of buffers to use \\ \hline | |
ce2151eb AB |
392 | -o \emph{file} & --output=\emph{file} & Prepend \emph{file} to output file name(s) \\ |
393 | & & \textbf{This only works when using a single device} \\ | |
394 | & & \textbf{or when piping the output via \texttt{-o -}} \\ | |
395 | & & \textbf{with multiple devices.} \\ \hline | |
e33da1d5 | 396 | -r \emph{rel-path} & --relay=\emph{rel-path} & Specifies debugfs mount point \\ \hline |
57ea8602 | 397 | -V & --version & Outputs version \\ \hline |
ff8a07c9 | 398 | -w \emph{seconds} & --stopwatch=\emph{seconds} & Sets run time to the number of seconds specified \\ \hline |
cf1edb17 AB |
399 | -I \emph{devs file}& --input-devs=\emph{devs file}& Adds devices found in \emph{devs file} to list of devices to trace. \\ |
400 | & & (One device per line.) \\ \hline | |
ff8a07c9 JA |
401 | \end{tabular} |
402 | ||
403 | \subsubsection{\label{sec:filter-mask}Filter Masks} | |
404 | The following masks may be passed with the \emph{-a} command line | |
405 | option, multiple filters may be combined via multiple \emph{-a} command | |
406 | line options.\smallskip | |
407 | ||
408 | \begin{tabular}{|l|l|}\hline | |
409 | barrier & \emph{barrier} attribute \\ \hline | |
410 | complete & \emph{completed} by driver \\ \hline | |
411 | fs & \emph{FS} requests \\ \hline | |
412 | issue & \emph{issued} to driver \\ \hline | |
413 | pc & \emph{packet command} events \\ \hline | |
414 | queue & \emph{queue} operations \\ \hline | |
415 | read & \emph{read} traces \\ \hline | |
416 | requeue & \emph{requeue} operations \\ \hline | |
417 | sync & \emph{synchronous} attribute \\ \hline | |
418 | write & \emph{write} traces \\ \hline | |
cfa089d4 | 419 | notify & \emph{notify} trace messages \\ \hline |
ff8a07c9 JA |
420 | \end{tabular} |
421 | ||
e8ae3d09 JA |
422 | \subsubsection{\label{sec:request-types}Request types} |
423 | blktrace disguingishes between two types of block layer requests, | |
424 | file system and scsi commands. The former are dubbed \emph{fs} | |
425 | requests, the latter \emph{pc} requests. File system requests are | |
426 | normal read/write operations, ie any type of read or write from a | |
427 | specific disk location at a given size. These requests typically | |
428 | originate from a user process, but they may also be initiated by | |
429 | the vm flushing dirty data to disk or the file system syncing | |
430 | a super or journal block to disk. \emph{pc} requests are SCSI | |
431 | commands. blktrace sends the command data block as a payload | |
432 | so that blkparse can decode it. | |
433 | ||
ff8a07c9 JA |
434 | %---------------------------- |
435 | \newpage\section{\label{sec:blkparse-ug}blkparse User Guide} | |
436 | ||
437 | The \emph{blkparse} utility will attempt to combine streams of events | |
438 | for various devices on various CPUs, and produce a formatted output of | |
439 | the event information. As with blktrace, some details concerning blkparse | |
440 | will help in understanding the command line options presented below. | |
441 | ||
442 | \begin{itemize} | |
443 | \item By default, blkparse expects to run in a post-processing mode | |
444 | -- one where the trace events have been saved by a previous run | |
445 | of blktrace, and blkparse is combining event streams and dumping | |
446 | formatted data. | |
447 | ||
448 | blkparse \emph{may} be run in a \emph{live} manner concurrently with | |
449 | blktrace by specifying \emph{-i -} to blkparse, and combining it with | |
450 | the live option for blktrace. An example would be: | |
451 | ||
452 | \begin{verbatim} | |
453 | % blktrace -d /dev/sda -o - | blkparse -i - | |
454 | \end{verbatim} | |
455 | ||
456 | \item You can set how many blkparse batches event reads via the | |
457 | \emph{-b} option, the default is to handle events in batches of 512. | |
458 | ||
459 | \item If you have saved event traces in blktrace with different output | |
460 | names (via the \emph{-o} option to blktrace), you must specify the | |
461 | same \emph{input} name via the \emph{-i} option. | |
462 | ||
463 | \item The format of the output data can be controlled via the \emph{-f} | |
464 | or \emph{-F} options -- see section~\ref{sec:blkparse-format} for details. | |
465 | ||
cfa8ef19 AB |
466 | By default, blkparse sends formatted data to standard output. This may |
467 | be changed via the \emph{-o} option, or text output can be disabled | |
468 | via the\emph{-O} option. A merged binary stream can be produced using | |
469 | the \emph{-d} option. | |
ff8a07c9 JA |
470 | |
471 | \end{itemize} | |
472 | ||
473 | \newpage\subsection{\label{sec:blkparse-args}Command line arguments} | |
474 | \begin{tabular}{|l|l|l|}\hline | |
475 | Short & Long & Description \\ \hline\hline | |
476 | -b \emph{batch} & --batch={batch} & Standard input read batching \\ \hline | |
477 | ||
478 | -i \emph{file} & --input=\emph{file} & Specifies base name for input files -- default is \emph{device}.blktrace.\emph{cpu}. \\ | |
479 | & & As noted above, specifying \emph{-i -} runs in \emph{live} mode with blktrace \\ | |
480 | & & (reading data from standard in). \\ \hline | |
481 | ||
482 | -F \emph{typ,fmt} & --format=\emph{typ,fmt} & Sets output format \\ | |
483 | -f \emph{fmt} & --format-spec=\emph{fmt} & (See section~\ref{sec:blkparse-format} for details.) \\ | |
484 | & & \\ | |
485 | & & The -f form specifies a format for all events \\ | |
486 | & & \\ | |
487 | & & The -F form allows one to specify a format for a specific \\ | |
488 | & & event type. The single-character \emph{typ} field is one of the \\ | |
489 | & & action specifiers in section~\ref{sec:act-table} \\ \hline | |
490 | ||
491 | ||
492 | -m & --missing & Print missing entries\\ \hline | |
493 | ||
d915dee6 | 494 | -h & --hash-by-name & Hash processes by name, not by PID\\ \hline |
ff8a07c9 JA |
495 | |
496 | -o \emph{file} & --output=\emph{file} & Output file \\ \hline | |
cfa8ef19 | 497 | -O & --no-text-output & Do \emph{not} produce text output, used for binary (-d) only \\ \hline |
ff8a07c9 | 498 | |
a2594911 AB |
499 | -d \emph{file} & --dump-binary=\emph{file} & Binary output file \\ \hline |
500 | ||
ff8a07c9 JA |
501 | -q & --quiet & Quite mode \\ \hline |
502 | ||
503 | -s & --per-program-stats & Displays data sorted by program \\ \hline | |
504 | ||
505 | -t & --track-ios & Display time deltas per IO \\ \hline | |
506 | ||
507 | -w \emph{span} & --stopwatch=\emph{span} & Display traces for the \emph{span} specified -- where span can be: \\ | |
508 | & & \emph{end-time} -- Display traces from time 0 through \emph{end-time} (in ns) \\ | |
509 | & & or \\ | |
510 | & & \emph{start:end-time} -- Display traces from time \emph{start} \\ | |
511 | & & through {end-time} (in ns). \\ \hline | |
512 | ||
19cfaf3f | 513 | -M & --no-msgs & Do not add messages to binary output file \\\hline |
57ea8602 JA |
514 | -v & --verbose & More verbose marginal on marginal errors \\ \hline |
515 | -V & --version & Display version \\ \hline | |
ff8a07c9 JA |
516 | |
517 | \end{tabular} | |
518 | ||
519 | \newpage | |
208660b1 JA |
520 | \subsection{\label{sec:blkparse-actions}Trace actions} |
521 | ||
522 | \begin{description} | |
523 | \item[C -- complete] A previously issued request has been completed. | |
524 | The output will detail the sector and size of that request, as well | |
525 | as the success or failure of it. | |
526 | ||
527 | \item[D -- issued] A request that previously resided on the block layer | |
528 | queue or in the io scheduler has been sent to the driver. | |
529 | ||
530 | \item[I -- inserted] A request is being sent to the io scheduler for | |
531 | addition to the internal queue and later service by the driver. The | |
532 | request is fully formed at this time. | |
533 | ||
534 | \item[Q -- queued] This notes intent to queue io at the given location. | |
535 | No real requests exists yet. | |
536 | ||
bd5e71cf | 537 | \item[B -- bounced] The data pages attached to this \emph{bio} are |
208660b1 JA |
538 | not reachable by the hardware and must be bounced to a lower memory |
539 | location. This causes a big slowdown in io performance, since the data | |
540 | must be copied to/from kernel buffers. Usually this can be fixed with | |
541 | using better hardware - either a better io controller, or a platform | |
542 | with an IOMMU. | |
543 | ||
ac00378e AB |
544 | \item[m -- message] Text message generated via kernel call to |
545 | \texttt{blk\_add\_trace\_msg}. | |
546 | ||
bd5e71cf | 547 | \item[M -- back merge] A previously inserted request exists that ends |
208660b1 JA |
548 | on the boundary of where this io begins, so the io scheduler can merge |
549 | them together. | |
550 | ||
551 | \item[F -- front merge] Same as the back merge, except this io ends | |
552 | where a previously inserted requests starts. | |
553 | ||
208660b1 JA |
554 | \item[G -- get request] To send any type of request to a block device, |
555 | a \emph{struct request} container must be allocated first. | |
556 | ||
557 | \item[S -- sleep] No available request structures were available, so | |
558 | the issuer has to wait for one to be freed. | |
559 | ||
560 | \item[P -- plug] When io is queued to a previously empty block device | |
561 | queue, Linux will plug the queue in anticipation of future ios being | |
562 | added before this data is needed. | |
563 | ||
564 | \item[U -- unplug] Some request data already queued in the device, | |
565 | start sending requests to the driver. This may happen automatically | |
566 | if a timeout period has passed (see next entry) or if a number of | |
567 | requests have been added to the queue. | |
568 | ||
569 | \item[T -- unplug due to timer] If nobody requests the io that was queued | |
570 | after plugging the queue, Linux will automatically unplug it after a | |
571 | defined period has passed. | |
572 | ||
573 | \item[X -- split] On raid or device mapper setups, an incoming io may | |
574 | straddle a device or internal zone and needs to be chopped up into | |
575 | smaller pieces for service. This may indicate a performance problem due | |
576 | to a bad setup of that raid/dm device, but may also just be part of | |
577 | normal boundary conditions. dm is notably bad at this and will clone | |
578 | lots of io. | |
579 | ||
a8f30e64 JA |
580 | \item[A -- remap] For stacked devices, incoming io is remapped to device |
581 | below it in the io stack. The remap action details what exactly is | |
582 | being remapped to what. | |
583 | ||
208660b1 JA |
584 | \end{description} |
585 | ||
ff8a07c9 JA |
586 | \subsection{\label{sec:blkparse-format}Output Description and Formatting} |
587 | ||
588 | The output from blkparse can be tailored for specific use - in particular, | |
589 | to ease parsing of output, and/or limit output fields to those the user | |
590 | wants to see. The data for fields which can be output include: | |
591 | ||
592 | \smallskip | |
593 | \begin{tabular}{|l|l|}\hline | |
594 | Field & Description \\ | |
595 | Specifier & \\ \hline\hline | |
596 | \emph{a} & Action, a (small) string (1 or 2 characters) -- see table below for more details \\ \hline | |
597 | \emph{c} & CPU id \\ \hline | |
598 | \emph{C} & Command \\ \hline | |
599 | \emph{d} & RWBS field, a (small) string (1-3 characters) -- see section below for more details \\ \hline | |
600 | \emph{D} & 7-character string containing the major and minor numbers of | |
601 | the event's device \\ | |
602 | & (separated by a comma). \\ \hline | |
603 | \emph{e} & Error value \\ \hline | |
604 | \emph{m} & Minor number of event's device. \\ \hline | |
605 | \emph{M} & Major number of event's device. \\ \hline | |
606 | \emph{n} & Number of blocks \\ \hline | |
607 | \emph{N} & Number of bytes \\ \hline | |
608 | \emph{p} & Process ID \\ \hline | |
609 | \emph{P} & Display packet data -- series of hexadecimal values\\ \hline | |
610 | \emph{s} & Sequence numbers \\ \hline | |
611 | \emph{S} & Sector number \\ \hline | |
612 | \emph{t} & Time stamp (nanoseconds) \\ \hline | |
613 | \emph{T} & Time stamp (seconds) \\ \hline | |
614 | \emph{u} & Elapsed value in microseconds (\emph{-t} command line option) \\ \hline | |
615 | \emph{U} & Payload unsigned integer \\ \hline | |
9c73da91 | 616 | \emph{z} & Absolute time stamp \\ \hline |
ff8a07c9 JA |
617 | \end{tabular} |
618 | ||
619 | Note that the user can optionally specify field display width, and | |
620 | optionally a left-aligned specifier. These precede field specifiers, | |
621 | with a '\%' character, followed by the optional left-alignment specifer | |
622 | (-) followed by the width (a decimal number) and then the field. | |
623 | ||
624 | Thus, to specify the command in a 12-character field that is left aligned: | |
625 | ||
626 | \begin{verbatim} | |
627 | -f "%-12C" | |
628 | \end{verbatim} | |
629 | ||
630 | \newpage | |
631 | \subsubsection{\label{sec:act-table}Action Table} | |
632 | The following table shows the various actions which may be output. | |
633 | ||
634 | \begin{tabular}{|l|l|}\hline | |
635 | Act & Description \\ \hline\hline | |
a8f30e64 | 636 | A & IO was remapped to a different device \\ \hline |
bd5e71cf | 637 | B & IO bounced \\ \hline |
ff8a07c9 JA |
638 | C & IO completion \\ \hline |
639 | D & IO issued to driver \\ \hline | |
640 | F & IO front merged with request on queue \\ \hline | |
641 | G & Get request \\ \hline | |
642 | I & IO inserted onto request queue \\ \hline | |
bd5e71cf | 643 | M & IO back merged with request on queue \\ \hline |
ff8a07c9 JA |
644 | P & Plug request \\ \hline |
645 | Q & IO handled by request queue code \\ \hline | |
646 | S & Sleep request \\ \hline | |
647 | T & Unplug due to timeout \\ \hline | |
648 | U & Unplug request \\ \hline | |
ff8a07c9 JA |
649 | X & Split \\ \hline |
650 | \end{tabular} | |
651 | ||
652 | \subsubsection{\label{sec:act-table}RWBS Description} | |
653 | This is a small string containing at least one character ('R' for read, | |
64c03161 DW |
654 | 'W' for write, or 'D' for block discard operation), and optionally either |
655 | a 'B' (for barrier operations) or 'S' (for synchronous operations). | |
ff8a07c9 JA |
656 | |
657 | \subsubsection{\label{sec:default-output}Default output} | |
658 | ||
659 | The standard \emph{header} (or initial fields displayed) include: | |
660 | ||
661 | \begin{verbatim} | |
662 | "%D %2c %8s %5T.%9t %5p %2a %3d " | |
663 | \end{verbatim} | |
664 | ||
665 | Breaking this down: | |
666 | ||
667 | \begin{description} | |
668 | \item[\%D] Displays the event's device major/minor as: \%3d,\%-3d. | |
669 | \item[\%2c] CPU ID (2-character field). | |
670 | \item[\%8s] Sequence number | |
671 | \item[\%5T.\%9t] 5-charcter field for the seconds portion of the | |
672 | time stamp and a 9-character field for the nanoseconds in the time stamp. | |
673 | \item[\%5p] 5-character field for the process ID. | |
674 | \item[\%2a] 2-character field for one of the actions. | |
675 | \item[\%3d] 3-character field for the RWBS data. | |
676 | \end{description} | |
677 | ||
678 | Seeing this in action: | |
679 | ||
680 | \begin{verbatim} | |
681 | 8,0 3 1 0.000000000 697 G W 223490 + 8 [kjournald] | |
682 | \end{verbatim} | |
683 | ||
684 | The header is the data in this line up to the 223490 (starting block). | |
685 | ||
686 | The default output for all event types includes this header. | |
687 | ||
688 | \paragraph{Default output per action} | |
689 | ||
690 | \begin{description} | |
691 | \item[C -- complete] If a payload is present, this is presented between | |
692 | parenthesis following the header, followed by the error value. | |
693 | ||
694 | If no payload is present, the sector and number of blocks are presented | |
695 | (with an intervening plus (+) character). If the \emph{-t} option | |
696 | was specified, then the elapsed time is presented. In either case, | |
697 | it is followed by the error value for the completion. | |
698 | ||
699 | \item[D -- issued] | |
700 | \item[I -- inserted] | |
701 | \item[Q -- queued] | |
bd5e71cf | 702 | \item[B -- bounced] If a payload is present, the number of payload bytes |
ff8a07c9 JA |
703 | is output, followed by the payload in hexadecimal between parenthesis. |
704 | ||
705 | If no payload is present, the sector and number of blocks are presented | |
706 | (with an intervening plus (+) character). If the \emph{-t} option was | |
707 | specified, then the elapsed time is presented (in parenthesis). In | |
708 | either case, it is followed by the command associated with the event | |
709 | (surrounded by square brackets). | |
710 | ||
bd5e71cf | 711 | \item[M -- back merge] |
ff8a07c9 JA |
712 | \item[F -- front merge] |
713 | \item[G -- get request] | |
ff8a07c9 JA |
714 | \item[S -- sleep] The starting sector and number of blocks is output |
715 | (with an intervening plus (+) character), followed by the command | |
716 | associated with the event (surrounded by square brackets). | |
717 | ||
718 | \item[P -- plug] The command associated with the event (surrounded by | |
719 | square brackets) is output. | |
720 | ||
721 | \item[U -- unplug] | |
722 | \item[T -- unplug due to timer] The command associated with the event | |
723 | (surrounded by square brackets) is output, followed by the number of | |
724 | requests outstanding. | |
725 | ||
726 | \item[X -- split] The original starting sector followed by the new | |
727 | sector (separated by a slash (/) is output, followed by the command | |
728 | associated with the event (surrounded by square brackets). | |
729 | ||
a8f30e64 JA |
730 | \item[A -- remap] Sector and length is output, along with the original |
731 | device and sector offset. | |
732 | ||
ac00378e AB |
733 | \item[m -- message] The supplied message is appended to the end of |
734 | the standard header. | |
735 | ||
ff8a07c9 JA |
736 | \end{description} |
737 | ||
738 | %------------------------------ | |
739 | \newpage | |
740 | \newpage\section*{\label{sec:blktrace-kg}Appendix: blktrace Kernel Guide} | |
741 | ||
742 | The blktrace facility provides an efficient event transfer mechanism which | |
743 | supplies block IO layer state transition data via the relay | |
744 | filesystem. This section provides some details as to the interfaces | |
745 | blktrace utilizes in the kernel to effect this. It is good background data | |
746 | to help understand some of the outputs and command-line options above. | |
747 | ||
748 | \subsection{blktrace.h Definitions} | |
749 | Files which include $<linux/blktrace.h>$ are supplied with the following | |
750 | definitions: | |
751 | ||
752 | \subsubsection{Trace Action Specifiers} | |
753 | \begin{tabular}{|l|l|}\hline | |
754 | BLK\_TA\_QUEUE & (RQ) Command queued to request\_queue. \\ | |
755 | & (BIO) Command queued by elevator. \\ \hline | |
756 | BLK\_TA\_BACKMERGE & Back merging elevator operation \\ \hline | |
757 | BLK\_TA\_FRONTMERGE & Front merging elevator operation \\ \hline | |
758 | BLK\_TA\_GETRQ & Free request retrieved. \\ \hline | |
759 | BLK\_TA\_SLEEPRQ & No requests available, device unplugged. \\ \hline | |
760 | BLK\_TA\_REQUEUE & Request requeued. \\ \hline | |
761 | BLK\_TA\_ISSUE & Command set to driver for request\_queue. \\ \hline | |
762 | BLK\_TA\_COMPLETE & Command completed by driver. \\ \hline | |
763 | BLK\_TA\_PLUG & Device is plugged \\ \hline | |
764 | BLK\_TA\_UNPLUG\_IO & Unplug device as IO is made available. \\ \hline | |
765 | BLK\_TA\_UNPLUG\_TIMER & Unplug device after timer expired. \\ \hline | |
766 | BLK\_TA\_INSERT & Insert request into queue. \\ \hline | |
767 | BLK\_TA\_SPLIT & BIO split into 2 or more requests. \\ \hline | |
768 | BLK\_TA\_BOUNCE & BIO was bounced \\ \hline | |
a8f30e64 | 769 | BLK\_TA\_REMAP & BIO was remapped \\ \hline |
ff8a07c9 JA |
770 | \end{tabular} |
771 | ||
772 | %.......................................... | |
773 | \subsection{blktrace.h Routines} | |
774 | Files which include $<linux/blktrace.h>$ are supplied with the following | |
775 | kernel routine invocable interfaces: | |
776 | ||
777 | \begin{description} | |
778 | \item[blk\_add\_trace\_rq(struct request\_queue *q, struct request\_queue | |
779 | *rq, u32 what)] | |
780 | Adds a trace event describing the state change of the passed in | |
781 | request\_queue. The \emph{what} parameter describes the change in | |
782 | the request\_queue state, and is one of the request queue action | |
783 | specifiers -- BLK\_TA\_QUEUE, BLK\_TA\_REQUEUE, BLK\_TA\_ISSUE, | |
784 | or BLK\_TA\_COMPLETE. | |
785 | ||
786 | \item[blk\_add\_trace\_bio(struct request\_queue *q, struct bio *bio, | |
787 | u32 what)] | |
788 | Adds a trace event for the BIO passed in. The \emph{what} parameter | |
789 | describes the action being performed on the BIO, and is one of | |
790 | BLK\_TA\_BACKMERGE, BLK\_TA\_FRONTMERGE, or BLK\_TA\_QUEUE. | |
791 | ||
792 | \item[blk\_add\_trace\_generic(struct request\_queue *q, struct bio *bio, | |
793 | int rw, u32 what)] | |
794 | Adds a \emph{generic} trace event -- not one of the request queue | |
795 | or BIO traces. The \emph{what} parameter describes the action being | |
796 | performed on the BIO (if bio is non-NULL), and is one of | |
797 | BLK\_TA\_PLUG, BLK\_TA\_GETRQ or BLK\_TA\_SLEEPRQ. | |
798 | ||
799 | \item[blk\_add\_trace\_pdu\_int(struct request\_queue *q, u32 what, | |
800 | u32 pdu)] | |
801 | Adds a trace with some payload data -- in this case, an unsigned | |
802 | 32-bit entity (the \emph{pdu} parameter). The \emph{what} parameter | |
803 | describes the nature of the payload, and is one of | |
804 | BLK\_TA\_UNPLUG\_IO or BLK\_TA\_UNPLUG\_TIMER. | |
805 | ||
a8f30e64 JA |
806 | \item[blk\_add\_trace\_remap(struct request\_queue *q, struct bio *bio, |
807 | dev\_t dev, sector\_t sector)] | |
808 | Adds a trace with a remap event. \emph{dev} and \emph{sector} denote | |
809 | the original device this \emph{bio} was mapped from. | |
810 | ||
ac00378e AB |
811 | \item[blk\_add\_trace\_msg(struct request\_queue *q, char *fmt, ...)] |
812 | Adds a formatted message to the output stream. The total message | |
813 | size can not exceed BLK\_TN\_MSG\_MSG characters (currently | |
814 | 1024). Standard format conversions are supported (as supplied | |
815 | by \texttt{vscnprintf}. | |
a8f30e64 | 816 | |
ff8a07c9 JA |
817 | \end{description} |
818 | \end{document} |