Commit | Line | Data |
---|---|---|
ff8a07c9 JA |
1 | \documentclass{article} |
2 | ||
3 | % | |
a2594911 | 4 | % Copyright (C) 2005, 2006 Alan D. Brunelle <Alan.Brunelle@hp.com> |
ff8a07c9 JA |
5 | % |
6 | % This program is free software; you can redistribute it and/or modify | |
7 | % it under the terms of the GNU General Public License as published by | |
8 | % the Free Software Foundation; either version 2 of the License, or | |
9 | % (at your option) any later version. | |
10 | % | |
11 | % This program is distributed in the hope that it will be useful, | |
12 | % but WITHOUT ANY WARRANTY; without even the implied warranty of | |
13 | % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | |
14 | % GNU General Public License for more details. | |
15 | % | |
16 | % You should have received a copy of the GNU General Public License | |
17 | % along with this program; if not, write to the Free Software | |
18 | % Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA | |
19 | % | |
20 | ||
21 | \title{blktrace User Guide} | |
22 | \author{blktrace: Jens Axboe (axboe@suse.de)\\ | |
23 | User Guide: Alan D. Brunelle (Alan.Brunelle@hp.com)} | |
a2594911 | 24 | \date{23 February 2005} |
ff8a07c9 JA |
25 | |
26 | \begin{document} | |
27 | \maketitle | |
28 | %--------------------- | |
29 | \section{\label{sec:intro}Introduction} | |
30 | ||
31 | blktrace is a block layer IO tracing mechanism which provides detailed | |
32 | information about request queue operations up to user space. There are | |
33 | three major components that are provided: | |
34 | ||
35 | \begin{description} | |
36 | \item[Kernel patch] A patch to the Linux kernel which includes the | |
37 | kernel event logging interfaces, and patches to areas within the block | |
c8255c5a JA |
38 | layer to emit event traces. If you run a 2.6.17-rc1 or newer kernel, |
39 | you don't need to patch blktrace support as it is already included. | |
ff8a07c9 JA |
40 | |
41 | \item[blktrace] A utility which transfers event traces from the kernel | |
42 | into either long-term on-disk storage, or provides direct formatted | |
43 | output (via blkparse). | |
44 | ||
45 | \item[blkparse] A utility which formats events stored in files, or when | |
46 | run in \emph{live} mode directly outputs data collected by blktrace. | |
47 | \end{description} | |
48 | ||
ff8a07c9 JA |
49 | \subsection{blktrace Download Area} |
50 | ||
51 | The blktrace and blkparse utilities and associated kernel patch are provided | |
52 | as part of the following git repository: | |
53 | ||
38e1f0c6 | 54 | git://git.kernel.org/pub/scm/linux/kernel/git/axboe/blktrace.git bt |
ff8a07c9 JA |
55 | |
56 | %-------------------------- | |
57 | \newpage\section{\label{sec:quick-start}Quick Start Guide} | |
58 | ||
59 | The following sections outline some quick steps towards utilizing | |
60 | blktrace. Some of the specific instructions below may need to be tailored | |
61 | to your environment. | |
62 | ||
63 | \subsection{\label{sec:get-blktrace}Retrieving blktrace} | |
64 | ||
65 | As noted above, the kernel patch along with the blktrace and blkparse utilities are stored in a git repository. One simple way to get going would be: | |
66 | ||
67 | \begin{verbatim} | |
38e1f0c6 | 68 | % git clone git://git.kernel.org/pub/scm/linux/kernel/git/axboe/blktrace.git bt |
ff8a07c9 | 69 | % cd bt |
7b464f9f | 70 | % git checkout |
ff8a07c9 JA |
71 | \end{verbatim} |
72 | ||
73 | \subsection{\label{sec:patching}Patching and configuring the Linux kernel} | |
74 | ||
75 | A patch for a \emph{specific Linux kernel} is provided in bt/kernel (where | |
76 | \emph{bt} is the name of the directory from the above git sequence). The | |
77 | detailed actual patching instructions for a Linux kernel is outside the | |
78 | scope of this document, but the following may be used as a sample template. | |
c8255c5a | 79 | Note that you may skip this step, if you kernel is at least 2.6.17-rc1. |
ff8a07c9 JA |
80 | |
81 | As an example, bt/kernel contains blk-trace-2.6.14-rc1-git-G2, download | |
82 | linux-2.6.13.tar.bz2 and patch-2.6.14-rc1.bz2 | |
83 | ||
84 | \begin{verbatim} | |
85 | % tar xjf linux-2.6.13.tar.bz2 | |
86 | % mv linux-2.6.13 linux-2.6.14-rc1 | |
87 | % cd linux-2.6.14-rc1 | |
88 | % bunzip2 -c ../patch-2.6.14-rc1.bz2 | patch -p1 | |
89 | \end{verbatim} | |
90 | ||
91 | At this point you may (optionally) remove linux-2.6.13.tar.bz2 and | |
92 | patch-2.6.14-rc1.bz2. | |
93 | ||
94 | At this point you should configure the Linux kernel for your specific | |
95 | system -- again, outside the scope of this document -- and then enable | |
96 | \emph{Support for tracing block io actions.} To do this, run | |
97 | ||
98 | \begin{verbatim} | |
99 | % make menuconfig or make xconfig, or edit .config, or ... | |
100 | \end{verbatim} | |
101 | ||
102 | and navigate through \emph{Device Drivers} and \emph{Block devices} | |
103 | and then down to \emph{Support for tracing block io actions} and hit Y. | |
104 | ||
105 | Install the new kernel (and modules\ldots) and reboot. | |
106 | ||
e33da1d5 | 107 | \subsection{\label{sec:mount}Mounting the debugfs file system} |
ff8a07c9 | 108 | |
e33da1d5 JA |
109 | blktrace utilizes files under the debug file system, and thus must have |
110 | the mount point set up -- mounted on the directory /debug. To do this | |
ff8a07c9 JA |
111 | one may do either of the following: |
112 | ||
113 | \begin{enumerate} | |
114 | \item Manually mount after each boot: | |
115 | \begin{verbatim} | |
e33da1d5 | 116 | % mount -t debugfs debugfs /debug |
ff8a07c9 JA |
117 | \end{verbatim} |
118 | ||
119 | \item Add an entry into /etc/fstab, and have it done automatically at | |
120 | each boot\footnote{Note: after adding the entry to /etc/fstab, you | |
121 | could then mount the directory this time only by doing: \% mount /relay}: | |
122 | \begin{verbatim} | |
e33da1d5 | 123 | debug /debug debugfs default 0 0 |
ff8a07c9 JA |
124 | \end{verbatim} |
125 | \end{enumerate} | |
126 | ||
127 | \subsection{\label{sec:build}Build the tools} | |
128 | ||
129 | To build and install the tools, execute the following sequence (as root): | |
130 | ||
131 | \begin{verbatim} | |
132 | % cd bt | |
133 | % make && make install | |
134 | \end{verbatim} | |
135 | ||
136 | \subsection{\label{sec:live-blktrace}blktrace -- live} | |
137 | ||
138 | Now to simply watch what is going on for a specific disk (to stop the | |
139 | trace, hit control-C): | |
140 | ||
141 | \begin{verbatim} | |
142 | % blktrace -d /dev/sda -o - | blkparse -i - | |
143 | 8,0 3 1 0.000000000 697 G W 223490 + 8 [kjournald] | |
144 | 8,0 3 2 0.000001829 697 P R [kjournald] | |
145 | 8,0 3 3 0.000002197 697 Q W 223490 + 8 [kjournald] | |
146 | 8,0 3 4 0.000005533 697 M W 223498 + 8 [kjournald] | |
147 | 8,0 3 5 0.000008607 697 M W 223506 + 8 [kjournald] | |
148 | 8,0 3 6 0.000011569 697 M W 223514 + 8 [kjournald] | |
149 | 8,0 3 7 0.000014407 697 M W 223522 + 8 [kjournald] | |
150 | 8,0 3 8 0.000017367 697 M W 223530 + 8 [kjournald] | |
151 | 8,0 3 9 0.000020161 697 M W 223538 + 8 [kjournald] | |
152 | 8,0 3 10 0.000024062 697 D W 223490 + 56 [kjournald] | |
153 | 8,0 1 11 0.009507758 0 C W 223490 + 56 [0] | |
154 | 8,0 1 12 0.009538995 697 G W 223546 + 8 [kjournald] | |
155 | 8,0 1 13 0.009540033 697 P R [kjournald] | |
156 | 8,0 1 14 0.009540313 697 Q W 223546 + 8 [kjournald] | |
157 | 8,0 1 15 0.009542980 697 D W 223546 + 8 [kjournald] | |
158 | 8,0 1 16 0.013542170 0 C W 223546 + 8 [0] | |
159 | ... | |
160 | ^C | |
161 | ... | |
162 | CPU1 (8,0): | |
163 | Reads Queued: 0, 0KiB Writes Queued: 7, 128KiB | |
164 | Read Dispatches: 0, 0KiB Write Dispatches: 7, 128KiB | |
165 | Reads Completed: 0, 0KiB Writes Completed: 11, 168KiB | |
166 | Read Merges: 0 Write Merges: 25 | |
167 | IO unplugs: 0 Timer unplugs: 0 | |
168 | ... | |
169 | CPU3 (8,0): | |
170 | Reads Queued: 0, 0KiB Writes Queued: 1, 28KiB | |
171 | Read Dispatches: 0, 0KiB Write Dispatches: 1, 28KiB | |
172 | Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB | |
173 | Read Merges: 0 Write Merges: 6 | |
174 | IO unplugs: 0 Timer unplugs: 0 | |
175 | ||
176 | Total (8,0): | |
177 | Reads Queued: 0, 0KiB Writes Queued: 11, 168KiB | |
178 | Read Dispatches: 0, 0KiB Write Dispatches: 11, 168KiB | |
179 | Reads Completed: 0, 0KiB Writes Completed: 11, 168KiB | |
180 | Read Merges: 0 Write Merges: 31 | |
181 | IO unplugs: 0 Timer unplugs: 3 | |
182 | ||
183 | Events (8,0): 89 entries, 0 skips | |
184 | \end{verbatim} | |
185 | ||
b6dc7d6e JA |
186 | A \emph{btrace} script is included in the distribution to ease live |
187 | tracing of devices. The above could also be accomplished by issuing: | |
188 | ||
189 | \begin{verbatim} | |
190 | % btrace /dev/sda | |
191 | \end{verbatim} | |
192 | ||
193 | By default, \emph{btrace} runs the trace in quiet mode so it will not | |
194 | include statistics when you break the run. Add the \emph{-S} option to | |
195 | get that dumped as well. | |
196 | ||
197 | \subsection{\label{sec:pc-blktrace}blktrace -- SCSI commands} | |
198 | ||
199 | The previous section showed typical file system io actions, but blktrace | |
200 | can also show SCSI commands going in and out of the queue as submitted | |
201 | by applications using the SCSI Generic (\emph{sg}) interface. | |
202 | ||
203 | \begin{verbatim} | |
204 | % btrace /dev/cdrom | |
205 | [...] | |
206 | 3,0 0 25 0.004884107 13528 G R 0 + 0 [inquiry] | |
207 | 3,0 0 26 0.004890361 13528 I R 56 (12 00 00 00 38 ..) [inquiry] | |
208 | 3,0 0 27 0.004891223 13528 P R [inquiry] | |
209 | 3,0 0 28 0.004893250 13528 D R 56 (12 00 00 00 38 ..) [inquiry] | |
210 | 3,0 0 29 0.005344910 0 C R (12 00 00 00 38 ..) [0] | |
211 | \end{verbatim} | |
212 | ||
213 | Here we see a program issuing an INQUIRY command to the CDROM device. | |
214 | The program requested a read of 56 bytes of data, the CDB is included | |
215 | in parenthesis after the data length. The completion event shows shows | |
216 | that the command completed successfully. Tracing SCSI commands can be | |
217 | very useful for debugging problems with programs talking directly to the | |
218 | device. An example of that would be \emph{cdrecord} burning. | |
219 | ||
ff8a07c9 JA |
220 | \subsection{\label{sec:blktrace-post}blktrace -- post-processing} |
221 | ||
222 | Another way to run blktrace is to have blktrace save data away for later | |
223 | formatting by blkparse. This would be useful if you want to get | |
224 | measurements while running specific loads. | |
225 | ||
226 | To do this, one would specify the device (or devices) to be watched. Then | |
227 | go run you test cases. Stop the trace, and at your leisure utilize | |
228 | blkparse to see the results. | |
229 | ||
230 | In this example, devices /dev/sdaa, /dev/sdc and /dev/sdo are used in an | |
231 | LVM volume called adb3/vol. | |
232 | ||
233 | \begin{verbatim} | |
234 | % blktrace /dev/sdaa /dev/sdc /dev/sdo & | |
235 | [1] 9713 | |
236 | % | |
237 | % mkfs -t ext3 /dev/adb3/vol | |
238 | mke2fs 1.35 (28-Feb-2004) | |
239 | Filesystem label= | |
240 | OS type: Linux | |
241 | Block size=4096 (log=2) | |
242 | Fragment size=4096 (log=2) | |
243 | 16793600 inodes, 33555456 blocks | |
244 | 1677772 blocks (5.00%) reserved for the super user | |
245 | First data block=0 | |
246 | Maximum filesystem blocks=4294967296 | |
247 | 1025 block groups | |
248 | 32768 blocks per group, 32768 fragments per group | |
249 | 16384 inodes per group | |
250 | Superblock backups stored on blocks: | |
251 | 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, | |
252 | 4096000, 7962624, 11239424, 20480000, 23887872 | |
253 | ||
254 | Writing inode tables: done | |
255 | Creating journal (8192 blocks): done | |
256 | Writing superblocks and filesystem accounting information: done | |
257 | ||
258 | This filesystem will be automatically checked every 27 mounts or | |
259 | 180 days, whichever comes first. Use tune2fs -c or -i to override. | |
260 | % | |
261 | % kill -15 9713 | |
262 | \end{verbatim} | |
263 | ||
264 | Then you could process the events later: | |
265 | ||
266 | \begin{verbatim} | |
267 | % | |
268 | % blkparse sdaa sdc sdo > events | |
269 | % less events | |
270 | 8,32 1 1 0.000000000 9728 G R 384 + 32 [mkfs.ext3] | |
271 | 8,32 1 2 0.000001959 9728 P R [mkfs.ext3] | |
272 | 8,32 1 3 0.000002446 9728 Q R 384 + 32 [mkfs.ext3] | |
273 | 8,32 1 4 0.000005110 9728 D R 384 + 32 [mkfs.ext3] | |
274 | 8,32 3 5 0.000200570 0 C R 384 + 32 [0] | |
275 | 8,224 3 1 0.021658989 9728 G R 384 + 32 [mkfs.ext3] | |
276 | ... | |
277 | 65,160 3 163392 41.117070504 0 C W 87469088 + 1376 [0] | |
278 | 8,32 3 163374 41.122683668 0 C W 88168160 + 1376 [0] | |
279 | 65,160 3 163393 41.129952433 0 C W 87905984 + 1376 [0] | |
280 | 65,160 3 163394 41.130049431 0 D W 89129344 + 1376 [swapper] | |
281 | 65,160 3 163395 41.130067135 0 D W 89216704 + 1376 [swapper] | |
282 | 65,160 3 163396 41.130083785 0 D W 89304096 + 1376 [swapper] | |
283 | 65,160 3 163397 41.130099455 0 D W 89391488 + 1376 [swapper] | |
284 | 65,160 3 163398 41.130114732 0 D W 89478848 + 1376 [swapper] | |
285 | 65,160 3 163399 41.130128885 0 D W 89481536 + 64 [swapper] | |
286 | 8,32 3 163375 41.134758196 0 C W 86333152 + 1376 [0] | |
287 | 65,160 3 163400 41.142229726 0 C W 89129344 + 1376 [0] | |
288 | 65,160 3 163401 41.144952314 0 C W 89481536 + 64 [0] | |
289 | 8,32 3 163376 41.147441930 0 C W 88342912 + 1376 [0] | |
290 | 65,160 3 163402 41.155869604 0 C W 89478848 + 1376 [0] | |
291 | 8,32 3 163377 41.159466082 0 C W 86245760 + 1376 [0] | |
292 | 65,160 3 163403 41.166944976 0 C W 89216704 + 1376 [0] | |
293 | 65,160 3 163404 41.178968252 0 C W 89304096 + 1376 [0] | |
294 | 65,160 3 163405 41.191860173 0 C W 89391488 + 1376 [0] | |
295 | ... | |
296 | Events (sdo): 0 entries, 0 skips | |
297 | ||
298 | CPU0 (65,160): | |
299 | Reads Queued: 0, 0KiB Writes Queued: 9, 5,520KiB | |
300 | Read Dispatches: 0, 0KiB Write Dispatches: 0, 0KiB | |
301 | Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB | |
302 | Read Merges: 0 Write Merges: 336 | |
303 | IO unplugs: 0 Timer unplugs: 0 | |
304 | CPU1 (65,160): | |
305 | Reads Queued: 2,411, 38,576KiB Writes Queued: 769, 425,408KiB | |
306 | Read Dispatches: 2,407, 38,512KiB Write Dispatches: 118, 61,680KiB | |
307 | Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB | |
308 | Read Merges: 0 Write Merges: 25,819 | |
309 | IO unplugs: 0 Timer unplugs: 4 | |
310 | CPU2 (65,160): | |
311 | Reads Queued: 2, 32KiB Writes Queued: 18, 10,528KiB | |
312 | Read Dispatches: 2, 32KiB Write Dispatches: 3, 1,344KiB | |
313 | Reads Completed: 0, 0KiB Writes Completed: 0, 0KiB | |
314 | Read Merges: 0 Write Merges: 640 | |
315 | IO unplugs: 0 Timer unplugs: 0 | |
316 | CPU3 (65,160): | |
317 | Reads Queued: 20,572, 329,152KiB Writes Queued: 594, 279,712KiB | |
318 | Read Dispatches: 20,576, 329,216KiB Write Dispatches: 1,474, 740,720KiB | |
319 | Reads Completed: 22,985, 367,760KiB Writes Completed: 1,390, 721,168KiB | |
320 | Read Merges: 0 Write Merges: 16,888 | |
321 | IO unplugs: 0 Timer unplugs: 0 | |
322 | ||
323 | Total (65,160): | |
324 | Reads Queued: 22,985, 367,760KiB Writes Queued: 1,390, 721,168KiB | |
325 | Read Dispatches: 22,985, 367,760KiB Write Dispatches: 1,595, 803,744KiB | |
326 | Reads Completed: 22,985, 367,760KiB Writes Completed: 1,390, 721,168KiB | |
327 | Read Merges: 0 Write Merges: 43,683 | |
328 | IO unplugs: 0 Timer unplugs: 4 | |
329 | ... | |
330 | \end{verbatim} | |
331 | ||
332 | %---------------------------- | |
333 | \newpage\section{\label{sec:blktrace-ug}blktrace User Guide} | |
334 | ||
335 | The \emph{blktrace} utility extracts event traces from the kernel (via | |
e33da1d5 JA |
336 | the relaying through the debug file system). Some background details |
337 | concerning the run-time behaviour of blktrace will help to understand some | |
338 | of the more arcane command line options: | |
ff8a07c9 JA |
339 | |
340 | \begin{itemize} | |
341 | \item blktrace receives data from the kernel in buffers passed up | |
e33da1d5 JA |
342 | through the debug file system (relay). Each device being traced has |
343 | a file created in the mounted directory for the debugfs, which defaults | |
344 | to \emph{/debug} -- this can be overridden with the \emph{-r} command | |
ff8a07c9 JA |
345 | line argument. |
346 | ||
347 | \item blktrace defaults to collecting \emph{all} events that can be | |
348 | traced. To limit the events being captured, you can specify one or | |
349 | more filter masks via the \emph{-a} option. | |
350 | ||
351 | Alternatively, one may specify the entire mask utilizing a hexadecimal | |
352 | value that is version-specific. (Requires understanding of the internal | |
353 | representation of the filter mask.) | |
354 | ||
355 | \item As noted above, the events are passed up via a series of buffers | |
e33da1d5 | 356 | stored into debugfs files. The size and number of buffers can be |
ff8a07c9 JA |
357 | specified via the \emph{-b} and \emph{-n} arguments respectively. |
358 | ||
359 | \item blktrace stores the extracted data into files stored in the | |
360 | \emph{local} directory. The format of the file names is (by default) | |
361 | \emph{device}.blktrace.\emph{cpu}, where \emph{device} is the base | |
362 | device name (e.g, if we are tracing /dev/sda, the base device name would | |
363 | be \emph{sda}); and \emph{cpu} identifies a CPU for the event stream. | |
364 | ||
365 | The \emph{device} portion of the event file name can be changed via | |
366 | the \emph{-o} option. | |
367 | ||
368 | \item blktrace may also be run concurrently with blkparse to produce | |
369 | \emph{live} output -- to do this specify \emph{-o -} for blktrace. | |
370 | ||
371 | \item The default behaviour for blktrace is to run forever until explicitly killed by the user (via a control-C, or \emph{kill} utility invocation). There are two ways to modify this: | |
372 | ||
373 | \begin{enumerate} | |
374 | \item You may utilize the blktrace utility itself to \emph{kill} | |
375 | a running trace -- via the \emph{-k} option. | |
376 | ||
377 | \item You can specify a run-time duration for blktrace via the | |
378 | \emph{-w} option -- then blktrace will run for the specified number | |
379 | of seconds, and then halt. | |
380 | \end{enumerate} | |
381 | \end{itemize} | |
382 | ||
383 | \subsection{\label{sec:blktrace-args}Command line arguments} | |
384 | \begin{tabular}{|l|l|l|}\hline | |
385 | Short & Long & Description \\ \hline\hline | |
386 | -A \emph{hex-mask} & --set-mask=\emph{hex-mask} & Set filter mask to \emph{hex-mask} \\ \hline | |
387 | -a \emph{mask} & --act-mask=\emph{mask} & Add \emph{mask} to current filter (see below for masks) \\ \hline | |
388 | -b \emph{size} & --buffer-size=\emph{size} & Specifies buffer size for event extraction (scaled by $2^{10}$) \\ \hline | |
389 | -d \emph{dev} & --dev=\emph{dev} & Adds \emph{dev} as a device to trace \\ \hline | |
390 | -k & --kill & Kill on-going trace \\ \hline | |
391 | -n \emph{num-sub} & --num-sub=\emph{num-sub} & Specifies number of buffers to use \\ \hline | |
392 | -o \emph{file} & --output=\emph{file} & Prepend \emph{file} to output file name(s) \\ \hline | |
e33da1d5 | 393 | -r \emph{rel-path} & --relay=\emph{rel-path} & Specifies debugfs mount point \\ \hline |
57ea8602 | 394 | -V & --version & Outputs version \\ \hline |
ff8a07c9 JA |
395 | -w \emph{seconds} & --stopwatch=\emph{seconds} & Sets run time to the number of seconds specified \\ \hline |
396 | \end{tabular} | |
397 | ||
398 | \subsubsection{\label{sec:filter-mask}Filter Masks} | |
399 | The following masks may be passed with the \emph{-a} command line | |
400 | option, multiple filters may be combined via multiple \emph{-a} command | |
401 | line options.\smallskip | |
402 | ||
403 | \begin{tabular}{|l|l|}\hline | |
404 | barrier & \emph{barrier} attribute \\ \hline | |
405 | complete & \emph{completed} by driver \\ \hline | |
406 | fs & \emph{FS} requests \\ \hline | |
407 | issue & \emph{issued} to driver \\ \hline | |
408 | pc & \emph{packet command} events \\ \hline | |
409 | queue & \emph{queue} operations \\ \hline | |
410 | read & \emph{read} traces \\ \hline | |
411 | requeue & \emph{requeue} operations \\ \hline | |
412 | sync & \emph{synchronous} attribute \\ \hline | |
413 | write & \emph{write} traces \\ \hline | |
414 | \end{tabular} | |
415 | ||
e8ae3d09 JA |
416 | \subsubsection{\label{sec:request-types}Request types} |
417 | blktrace disguingishes between two types of block layer requests, | |
418 | file system and scsi commands. The former are dubbed \emph{fs} | |
419 | requests, the latter \emph{pc} requests. File system requests are | |
420 | normal read/write operations, ie any type of read or write from a | |
421 | specific disk location at a given size. These requests typically | |
422 | originate from a user process, but they may also be initiated by | |
423 | the vm flushing dirty data to disk or the file system syncing | |
424 | a super or journal block to disk. \emph{pc} requests are SCSI | |
425 | commands. blktrace sends the command data block as a payload | |
426 | so that blkparse can decode it. | |
427 | ||
ff8a07c9 JA |
428 | %---------------------------- |
429 | \newpage\section{\label{sec:blkparse-ug}blkparse User Guide} | |
430 | ||
431 | The \emph{blkparse} utility will attempt to combine streams of events | |
432 | for various devices on various CPUs, and produce a formatted output of | |
433 | the event information. As with blktrace, some details concerning blkparse | |
434 | will help in understanding the command line options presented below. | |
435 | ||
436 | \begin{itemize} | |
437 | \item By default, blkparse expects to run in a post-processing mode | |
438 | -- one where the trace events have been saved by a previous run | |
439 | of blktrace, and blkparse is combining event streams and dumping | |
440 | formatted data. | |
441 | ||
442 | blkparse \emph{may} be run in a \emph{live} manner concurrently with | |
443 | blktrace by specifying \emph{-i -} to blkparse, and combining it with | |
444 | the live option for blktrace. An example would be: | |
445 | ||
446 | \begin{verbatim} | |
447 | % blktrace -d /dev/sda -o - | blkparse -i - | |
448 | \end{verbatim} | |
449 | ||
450 | \item You can set how many blkparse batches event reads via the | |
451 | \emph{-b} option, the default is to handle events in batches of 512. | |
452 | ||
453 | \item If you have saved event traces in blktrace with different output | |
454 | names (via the \emph{-o} option to blktrace), you must specify the | |
455 | same \emph{input} name via the \emph{-i} option. | |
456 | ||
457 | \item The format of the output data can be controlled via the \emph{-f} | |
458 | or \emph{-F} options -- see section~\ref{sec:blkparse-format} for details. | |
459 | ||
cfa8ef19 AB |
460 | By default, blkparse sends formatted data to standard output. This may |
461 | be changed via the \emph{-o} option, or text output can be disabled | |
462 | via the\emph{-O} option. A merged binary stream can be produced using | |
463 | the \emph{-d} option. | |
ff8a07c9 JA |
464 | |
465 | \end{itemize} | |
466 | ||
467 | \newpage\subsection{\label{sec:blkparse-args}Command line arguments} | |
468 | \begin{tabular}{|l|l|l|}\hline | |
469 | Short & Long & Description \\ \hline\hline | |
470 | -b \emph{batch} & --batch={batch} & Standard input read batching \\ \hline | |
471 | ||
472 | -i \emph{file} & --input=\emph{file} & Specifies base name for input files -- default is \emph{device}.blktrace.\emph{cpu}. \\ | |
473 | & & As noted above, specifying \emph{-i -} runs in \emph{live} mode with blktrace \\ | |
474 | & & (reading data from standard in). \\ \hline | |
475 | ||
476 | -F \emph{typ,fmt} & --format=\emph{typ,fmt} & Sets output format \\ | |
477 | -f \emph{fmt} & --format-spec=\emph{fmt} & (See section~\ref{sec:blkparse-format} for details.) \\ | |
478 | & & \\ | |
479 | & & The -f form specifies a format for all events \\ | |
480 | & & \\ | |
481 | & & The -F form allows one to specify a format for a specific \\ | |
482 | & & event type. The single-character \emph{typ} field is one of the \\ | |
483 | & & action specifiers in section~\ref{sec:act-table} \\ \hline | |
484 | ||
485 | ||
486 | -m & --missing & Print missing entries\\ \hline | |
487 | ||
d915dee6 | 488 | -h & --hash-by-name & Hash processes by name, not by PID\\ \hline |
ff8a07c9 JA |
489 | |
490 | -o \emph{file} & --output=\emph{file} & Output file \\ \hline | |
cfa8ef19 | 491 | -O & --no-text-output & Do \emph{not} produce text output, used for binary (-d) only \\ \hline |
ff8a07c9 | 492 | |
a2594911 AB |
493 | -d \emph{file} & --dump-binary=\emph{file} & Binary output file \\ \hline |
494 | ||
ff8a07c9 JA |
495 | -q & --quiet & Quite mode \\ \hline |
496 | ||
497 | -s & --per-program-stats & Displays data sorted by program \\ \hline | |
498 | ||
499 | -t & --track-ios & Display time deltas per IO \\ \hline | |
500 | ||
501 | -w \emph{span} & --stopwatch=\emph{span} & Display traces for the \emph{span} specified -- where span can be: \\ | |
502 | & & \emph{end-time} -- Display traces from time 0 through \emph{end-time} (in ns) \\ | |
503 | & & or \\ | |
504 | & & \emph{start:end-time} -- Display traces from time \emph{start} \\ | |
505 | & & through {end-time} (in ns). \\ \hline | |
506 | ||
57ea8602 JA |
507 | -v & --verbose & More verbose marginal on marginal errors \\ \hline |
508 | -V & --version & Display version \\ \hline | |
ff8a07c9 JA |
509 | |
510 | \end{tabular} | |
511 | ||
512 | \newpage | |
208660b1 JA |
513 | \subsection{\label{sec:blkparse-actions}Trace actions} |
514 | ||
515 | \begin{description} | |
516 | \item[C -- complete] A previously issued request has been completed. | |
517 | The output will detail the sector and size of that request, as well | |
518 | as the success or failure of it. | |
519 | ||
520 | \item[D -- issued] A request that previously resided on the block layer | |
521 | queue or in the io scheduler has been sent to the driver. | |
522 | ||
523 | \item[I -- inserted] A request is being sent to the io scheduler for | |
524 | addition to the internal queue and later service by the driver. The | |
525 | request is fully formed at this time. | |
526 | ||
527 | \item[Q -- queued] This notes intent to queue io at the given location. | |
528 | No real requests exists yet. | |
529 | ||
530 | \item[W -- bounced] The data pages attached to this \emph{bio} are | |
531 | not reachable by the hardware and must be bounced to a lower memory | |
532 | location. This causes a big slowdown in io performance, since the data | |
533 | must be copied to/from kernel buffers. Usually this can be fixed with | |
534 | using better hardware - either a better io controller, or a platform | |
535 | with an IOMMU. | |
536 | ||
537 | \item[B -- back merge] A previously inserted request exists that ends | |
538 | on the boundary of where this io begins, so the io scheduler can merge | |
539 | them together. | |
540 | ||
541 | \item[F -- front merge] Same as the back merge, except this io ends | |
542 | where a previously inserted requests starts. | |
543 | ||
544 | \item[M -- front or back merge] One of the above. | |
545 | ||
546 | \item[G -- get request] To send any type of request to a block device, | |
547 | a \emph{struct request} container must be allocated first. | |
548 | ||
549 | \item[S -- sleep] No available request structures were available, so | |
550 | the issuer has to wait for one to be freed. | |
551 | ||
552 | \item[P -- plug] When io is queued to a previously empty block device | |
553 | queue, Linux will plug the queue in anticipation of future ios being | |
554 | added before this data is needed. | |
555 | ||
556 | \item[U -- unplug] Some request data already queued in the device, | |
557 | start sending requests to the driver. This may happen automatically | |
558 | if a timeout period has passed (see next entry) or if a number of | |
559 | requests have been added to the queue. | |
560 | ||
561 | \item[T -- unplug due to timer] If nobody requests the io that was queued | |
562 | after plugging the queue, Linux will automatically unplug it after a | |
563 | defined period has passed. | |
564 | ||
565 | \item[X -- split] On raid or device mapper setups, an incoming io may | |
566 | straddle a device or internal zone and needs to be chopped up into | |
567 | smaller pieces for service. This may indicate a performance problem due | |
568 | to a bad setup of that raid/dm device, but may also just be part of | |
569 | normal boundary conditions. dm is notably bad at this and will clone | |
570 | lots of io. | |
571 | ||
a8f30e64 JA |
572 | \item[A -- remap] For stacked devices, incoming io is remapped to device |
573 | below it in the io stack. The remap action details what exactly is | |
574 | being remapped to what. | |
575 | ||
208660b1 JA |
576 | \end{description} |
577 | ||
ff8a07c9 JA |
578 | \subsection{\label{sec:blkparse-format}Output Description and Formatting} |
579 | ||
580 | The output from blkparse can be tailored for specific use - in particular, | |
581 | to ease parsing of output, and/or limit output fields to those the user | |
582 | wants to see. The data for fields which can be output include: | |
583 | ||
584 | \smallskip | |
585 | \begin{tabular}{|l|l|}\hline | |
586 | Field & Description \\ | |
587 | Specifier & \\ \hline\hline | |
588 | \emph{a} & Action, a (small) string (1 or 2 characters) -- see table below for more details \\ \hline | |
589 | \emph{c} & CPU id \\ \hline | |
590 | \emph{C} & Command \\ \hline | |
591 | \emph{d} & RWBS field, a (small) string (1-3 characters) -- see section below for more details \\ \hline | |
592 | \emph{D} & 7-character string containing the major and minor numbers of | |
593 | the event's device \\ | |
594 | & (separated by a comma). \\ \hline | |
595 | \emph{e} & Error value \\ \hline | |
596 | \emph{m} & Minor number of event's device. \\ \hline | |
597 | \emph{M} & Major number of event's device. \\ \hline | |
598 | \emph{n} & Number of blocks \\ \hline | |
599 | \emph{N} & Number of bytes \\ \hline | |
600 | \emph{p} & Process ID \\ \hline | |
601 | \emph{P} & Display packet data -- series of hexadecimal values\\ \hline | |
602 | \emph{s} & Sequence numbers \\ \hline | |
603 | \emph{S} & Sector number \\ \hline | |
604 | \emph{t} & Time stamp (nanoseconds) \\ \hline | |
605 | \emph{T} & Time stamp (seconds) \\ \hline | |
606 | \emph{u} & Elapsed value in microseconds (\emph{-t} command line option) \\ \hline | |
607 | \emph{U} & Payload unsigned integer \\ \hline | |
608 | \end{tabular} | |
609 | ||
610 | Note that the user can optionally specify field display width, and | |
611 | optionally a left-aligned specifier. These precede field specifiers, | |
612 | with a '\%' character, followed by the optional left-alignment specifer | |
613 | (-) followed by the width (a decimal number) and then the field. | |
614 | ||
615 | Thus, to specify the command in a 12-character field that is left aligned: | |
616 | ||
617 | \begin{verbatim} | |
618 | -f "%-12C" | |
619 | \end{verbatim} | |
620 | ||
621 | \newpage | |
622 | \subsubsection{\label{sec:act-table}Action Table} | |
623 | The following table shows the various actions which may be output. | |
624 | ||
625 | \begin{tabular}{|l|l|}\hline | |
626 | Act & Description \\ \hline\hline | |
a8f30e64 | 627 | A & IO was remapped to a different device \\ \hline |
ff8a07c9 JA |
628 | B & IO back merged with request on queue \\ \hline |
629 | C & IO completion \\ \hline | |
630 | D & IO issued to driver \\ \hline | |
631 | F & IO front merged with request on queue \\ \hline | |
632 | G & Get request \\ \hline | |
633 | I & IO inserted onto request queue \\ \hline | |
634 | P & Plug request \\ \hline | |
635 | Q & IO handled by request queue code \\ \hline | |
636 | S & Sleep request \\ \hline | |
637 | T & Unplug due to timeout \\ \hline | |
638 | U & Unplug request \\ \hline | |
639 | W & IO bounced \\ \hline | |
640 | X & Split \\ \hline | |
641 | \end{tabular} | |
642 | ||
643 | \subsubsection{\label{sec:act-table}RWBS Description} | |
644 | This is a small string containing at least one character ('R' for read, | |
645 | 'W' for write operation), and optionally either a 'B' (for barrier | |
646 | operations) or 'S' (for synchronous operations). | |
647 | ||
648 | \subsubsection{\label{sec:default-output}Default output} | |
649 | ||
650 | The standard \emph{header} (or initial fields displayed) include: | |
651 | ||
652 | \begin{verbatim} | |
653 | "%D %2c %8s %5T.%9t %5p %2a %3d " | |
654 | \end{verbatim} | |
655 | ||
656 | Breaking this down: | |
657 | ||
658 | \begin{description} | |
659 | \item[\%D] Displays the event's device major/minor as: \%3d,\%-3d. | |
660 | \item[\%2c] CPU ID (2-character field). | |
661 | \item[\%8s] Sequence number | |
662 | \item[\%5T.\%9t] 5-charcter field for the seconds portion of the | |
663 | time stamp and a 9-character field for the nanoseconds in the time stamp. | |
664 | \item[\%5p] 5-character field for the process ID. | |
665 | \item[\%2a] 2-character field for one of the actions. | |
666 | \item[\%3d] 3-character field for the RWBS data. | |
667 | \end{description} | |
668 | ||
669 | Seeing this in action: | |
670 | ||
671 | \begin{verbatim} | |
672 | 8,0 3 1 0.000000000 697 G W 223490 + 8 [kjournald] | |
673 | \end{verbatim} | |
674 | ||
675 | The header is the data in this line up to the 223490 (starting block). | |
676 | ||
677 | The default output for all event types includes this header. | |
678 | ||
679 | \paragraph{Default output per action} | |
680 | ||
681 | \begin{description} | |
682 | \item[C -- complete] If a payload is present, this is presented between | |
683 | parenthesis following the header, followed by the error value. | |
684 | ||
685 | If no payload is present, the sector and number of blocks are presented | |
686 | (with an intervening plus (+) character). If the \emph{-t} option | |
687 | was specified, then the elapsed time is presented. In either case, | |
688 | it is followed by the error value for the completion. | |
689 | ||
690 | \item[D -- issued] | |
691 | \item[I -- inserted] | |
692 | \item[Q -- queued] | |
693 | \item[W -- bounced] If a payload is present, the number of payload bytes | |
694 | is output, followed by the payload in hexadecimal between parenthesis. | |
695 | ||
696 | If no payload is present, the sector and number of blocks are presented | |
697 | (with an intervening plus (+) character). If the \emph{-t} option was | |
698 | specified, then the elapsed time is presented (in parenthesis). In | |
699 | either case, it is followed by the command associated with the event | |
700 | (surrounded by square brackets). | |
701 | ||
702 | \item[B -- back merge] | |
703 | \item[F -- front merge] | |
704 | \item[G -- get request] | |
705 | \item[M -- front or back merge] | |
706 | \item[S -- sleep] The starting sector and number of blocks is output | |
707 | (with an intervening plus (+) character), followed by the command | |
708 | associated with the event (surrounded by square brackets). | |
709 | ||
710 | \item[P -- plug] The command associated with the event (surrounded by | |
711 | square brackets) is output. | |
712 | ||
713 | \item[U -- unplug] | |
714 | \item[T -- unplug due to timer] The command associated with the event | |
715 | (surrounded by square brackets) is output, followed by the number of | |
716 | requests outstanding. | |
717 | ||
718 | \item[X -- split] The original starting sector followed by the new | |
719 | sector (separated by a slash (/) is output, followed by the command | |
720 | associated with the event (surrounded by square brackets). | |
721 | ||
a8f30e64 JA |
722 | \item[A -- remap] Sector and length is output, along with the original |
723 | device and sector offset. | |
724 | ||
ff8a07c9 JA |
725 | \end{description} |
726 | ||
727 | %------------------------------ | |
728 | \newpage | |
729 | \newpage\section*{\label{sec:blktrace-kg}Appendix: blktrace Kernel Guide} | |
730 | ||
731 | The blktrace facility provides an efficient event transfer mechanism which | |
732 | supplies block IO layer state transition data via the relay | |
733 | filesystem. This section provides some details as to the interfaces | |
734 | blktrace utilizes in the kernel to effect this. It is good background data | |
735 | to help understand some of the outputs and command-line options above. | |
736 | ||
737 | \subsection{blktrace.h Definitions} | |
738 | Files which include $<linux/blktrace.h>$ are supplied with the following | |
739 | definitions: | |
740 | ||
741 | \subsubsection{Trace Action Specifiers} | |
742 | \begin{tabular}{|l|l|}\hline | |
743 | BLK\_TA\_QUEUE & (RQ) Command queued to request\_queue. \\ | |
744 | & (BIO) Command queued by elevator. \\ \hline | |
745 | BLK\_TA\_BACKMERGE & Back merging elevator operation \\ \hline | |
746 | BLK\_TA\_FRONTMERGE & Front merging elevator operation \\ \hline | |
747 | BLK\_TA\_GETRQ & Free request retrieved. \\ \hline | |
748 | BLK\_TA\_SLEEPRQ & No requests available, device unplugged. \\ \hline | |
749 | BLK\_TA\_REQUEUE & Request requeued. \\ \hline | |
750 | BLK\_TA\_ISSUE & Command set to driver for request\_queue. \\ \hline | |
751 | BLK\_TA\_COMPLETE & Command completed by driver. \\ \hline | |
752 | BLK\_TA\_PLUG & Device is plugged \\ \hline | |
753 | BLK\_TA\_UNPLUG\_IO & Unplug device as IO is made available. \\ \hline | |
754 | BLK\_TA\_UNPLUG\_TIMER & Unplug device after timer expired. \\ \hline | |
755 | BLK\_TA\_INSERT & Insert request into queue. \\ \hline | |
756 | BLK\_TA\_SPLIT & BIO split into 2 or more requests. \\ \hline | |
757 | BLK\_TA\_BOUNCE & BIO was bounced \\ \hline | |
a8f30e64 | 758 | BLK\_TA\_REMAP & BIO was remapped \\ \hline |
ff8a07c9 JA |
759 | \end{tabular} |
760 | ||
761 | %.......................................... | |
762 | \subsection{blktrace.h Routines} | |
763 | Files which include $<linux/blktrace.h>$ are supplied with the following | |
764 | kernel routine invocable interfaces: | |
765 | ||
766 | \begin{description} | |
767 | \item[blk\_add\_trace\_rq(struct request\_queue *q, struct request\_queue | |
768 | *rq, u32 what)] | |
769 | Adds a trace event describing the state change of the passed in | |
770 | request\_queue. The \emph{what} parameter describes the change in | |
771 | the request\_queue state, and is one of the request queue action | |
772 | specifiers -- BLK\_TA\_QUEUE, BLK\_TA\_REQUEUE, BLK\_TA\_ISSUE, | |
773 | or BLK\_TA\_COMPLETE. | |
774 | ||
775 | \item[blk\_add\_trace\_bio(struct request\_queue *q, struct bio *bio, | |
776 | u32 what)] | |
777 | Adds a trace event for the BIO passed in. The \emph{what} parameter | |
778 | describes the action being performed on the BIO, and is one of | |
779 | BLK\_TA\_BACKMERGE, BLK\_TA\_FRONTMERGE, or BLK\_TA\_QUEUE. | |
780 | ||
781 | \item[blk\_add\_trace\_generic(struct request\_queue *q, struct bio *bio, | |
782 | int rw, u32 what)] | |
783 | Adds a \emph{generic} trace event -- not one of the request queue | |
784 | or BIO traces. The \emph{what} parameter describes the action being | |
785 | performed on the BIO (if bio is non-NULL), and is one of | |
786 | BLK\_TA\_PLUG, BLK\_TA\_GETRQ or BLK\_TA\_SLEEPRQ. | |
787 | ||
788 | \item[blk\_add\_trace\_pdu\_int(struct request\_queue *q, u32 what, | |
789 | u32 pdu)] | |
790 | Adds a trace with some payload data -- in this case, an unsigned | |
791 | 32-bit entity (the \emph{pdu} parameter). The \emph{what} parameter | |
792 | describes the nature of the payload, and is one of | |
793 | BLK\_TA\_UNPLUG\_IO or BLK\_TA\_UNPLUG\_TIMER. | |
794 | ||
a8f30e64 JA |
795 | \item[blk\_add\_trace\_remap(struct request\_queue *q, struct bio *bio, |
796 | dev\_t dev, sector\_t sector)] | |
797 | Adds a trace with a remap event. \emph{dev} and \emph{sector} denote | |
798 | the original device this \emph{bio} was mapped from. | |
799 | ||
800 | ||
ff8a07c9 JA |
801 | \end{description} |
802 | \end{document} |