Merge tag 'docs-6.4-2' of git://git.lwn.net/linux
[linux-block.git] / Documentation / admin-guide / iostats.rst
CommitLineData
378012cf 1=====================
1da177e4 2I/O statistics fields
378012cf 3=====================
1da177e4 4
1da177e4
LT
5Since 2.4.20 (and some versions before, with patches), and 2.5.45,
6more extensive disk statistics have been introduced to help measure disk
877b638f 7activity. Tools such as ``sar`` and ``iostat`` typically interpret these and do
1da177e4
LT
8the work for you, but in case you are interested in creating your own
9tools, the fields are explained here.
10
11In 2.4 now, the information is found as additional fields in
877b638f
MCC
12``/proc/partitions``. In 2.6 and upper, the same information is found in two
13places: one is in the file ``/proc/diskstats``, and the other is within
1da177e4
LT
14the sysfs file system, which must be mounted in order to obtain
15the information. Throughout this document we'll assume that sysfs
877b638f
MCC
16is mounted on ``/sys``, although of course it may be mounted anywhere.
17Both ``/proc/diskstats`` and sysfs use the same source for the information
1da177e4
LT
18and so should not differ.
19
378012cf 20Here are examples of these different formats::
1da177e4 21
378012cf
MCC
22 2.4:
23 3 0 39082680 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
24 3 1 9221278 hda1 35486 0 35496 38030 0 0 0 0 0 38030 38030
1da177e4 25
877b638f 26 2.6+ sysfs:
378012cf
MCC
27 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
28 35486 38030 38030 38030
1da177e4 29
877b638f 30 2.6+ diskstats:
378012cf
MCC
31 3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160
32 3 1 hda1 35486 38030 38030 38030
1da177e4 33
bdca3c87
MC
34 4.18+ diskstats:
35 3 0 hda 446216 784926 9550688 4382310 424847 312726 5922052 19310380 0 3376340 23705160 0 0 0 0
36
877b638f
MCC
37On 2.4 you might execute ``grep 'hda ' /proc/partitions``. On 2.6+, you have
38a choice of ``cat /sys/block/hda/stat`` or ``grep 'hda ' /proc/diskstats``.
39
1da177e4 40The advantage of one over the other is that the sysfs choice works well
877b638f 41if you are watching a known, small set of disks. ``/proc/diskstats`` may
1da177e4
LT
42be a better choice if you are watching a large number of disks because
43you'll avoid the overhead of 50, 100, or 500 or more opens/closes with
44each snapshot of your disk statistics.
45
46In 2.4, the statistics fields are those after the device name. In
47the above example, the first field of statistics would be 446216.
877b638f 48By contrast, in 2.6+ if you look at ``/sys/block/hda/stat``, you'll
d94cdae1
AVC
49find just the 15 fields, beginning with 446216. If you look at
50``/proc/diskstats``, the 15 fields will be preceded by the major and
9d2e157d 51minor device numbers, and device name. Each of these formats provides
d94cdae1 5215 fields of statistics, each meaning exactly the same things.
1da177e4 53All fields except field 9 are cumulative since boot. Field 9 should
9d2e157d 54go to zero as I/Os complete; all others only increase (unless they
d94cdae1
AVC
55overflow and wrap). Wrapping might eventually occur on a very busy
56or long-lived system; so applications should be prepared to deal with
57it. Regarding wrapping, the types of the fields are either unsigned
58int (32 bit) or unsigned long (32-bit or 64-bit, depending on your
59machine) as noted per-field below. Unless your observations are very
60spread in time, these fields should not wrap twice before you notice it.
1da177e4
LT
61
62Each set of stats only applies to the indicated device; if you want
63system-wide stats you'll have to find all the devices and sum them all up.
64
d94cdae1 65Field 1 -- # of reads completed (unsigned long)
1da177e4 66 This is the total number of reads completed successfully.
378012cf 67
d94cdae1 68Field 2 -- # of reads merged, field 6 -- # of writes merged (unsigned long)
1da177e4
LT
69 Reads and writes which are adjacent to each other may be merged for
70 efficiency. Thus two 4K reads may become one 8K read before it is
71 ultimately handed to the disk, and so it will be counted (and queued)
72 as only one I/O. This field lets you know how often this was done.
378012cf 73
d94cdae1 74Field 3 -- # of sectors read (unsigned long)
1da177e4 75 This is the total number of sectors read successfully.
378012cf 76
d94cdae1 77Field 4 -- # of milliseconds spent reading (unsigned int)
1da177e4 78 This is the total number of milliseconds spent by all reads (as
b089f167 79 measured from blk_mq_alloc_request() to __blk_mq_end_request()).
378012cf 80
d94cdae1 81Field 5 -- # of writes completed (unsigned long)
1da177e4 82 This is the total number of writes completed successfully.
378012cf 83
d94cdae1 84Field 6 -- # of writes merged (unsigned long)
69963a07 85 See the description of field 2.
378012cf 86
d94cdae1 87Field 7 -- # of sectors written (unsigned long)
1da177e4 88 This is the total number of sectors written successfully.
378012cf 89
d94cdae1 90Field 8 -- # of milliseconds spent writing (unsigned int)
1da177e4 91 This is the total number of milliseconds spent by all writes (as
b089f167 92 measured from blk_mq_alloc_request() to __blk_mq_end_request()).
378012cf 93
d94cdae1 94Field 9 -- # of I/Os currently in progress (unsigned int)
1da177e4 95 The only field that should go to zero. Incremented as requests are
165125e1 96 given to appropriate struct request_queue and decremented as they finish.
378012cf 97
d94cdae1 98Field 10 -- # of milliseconds spent doing I/Os (unsigned int)
50ed380a 99 This field increases so long as field 9 is nonzero.
378012cf 100
9d9b8895
KK
101 Since 5.0 this field counts jiffies when at least one request was
102 started or completed. If request runs more than 2 jiffies then some
2b8bd423 103 I/O time might be not accounted in case of concurrent requests.
9d9b8895 104
d94cdae1 105Field 11 -- weighted # of milliseconds spent doing I/Os (unsigned int)
1da177e4
LT
106 This field is incremented at each I/O start, I/O completion, I/O
107 merge, or read of these stats by the number of I/Os in progress
108 (field 9) times the number of milliseconds spent doing I/O since the
109 last update of this field. This can provide an easy measure of both
110 I/O completion time and the backlog that may be accumulating.
111
d94cdae1 112Field 12 -- # of discards completed (unsigned long)
bdca3c87
MC
113 This is the total number of discards completed successfully.
114
d94cdae1 115Field 13 -- # of discards merged (unsigned long)
bdca3c87
MC
116 See the description of field 2
117
d94cdae1 118Field 14 -- # of sectors discarded (unsigned long)
bdca3c87
MC
119 This is the total number of sectors discarded successfully.
120
d94cdae1 121Field 15 -- # of milliseconds spent discarding (unsigned int)
bdca3c87 122 This is the total number of milliseconds spent by all discards (as
b089f167 123 measured from blk_mq_alloc_request() to __blk_mq_end_request()).
1da177e4 124
b6866318
KK
125Field 16 -- # of flush requests completed
126 This is the total number of flush requests completed successfully.
127
128 Block layer combines flush requests and executes at most one at a time.
129 This counts flush requests executed by disk. Not tracked for partitions.
130
131Field 17 -- # of milliseconds spent flushing
132 This is the total number of milliseconds spent by all flush requests.
133
1da177e4
LT
134To avoid introducing performance bottlenecks, no locks are held while
135modifying these counters. This implies that minor inaccuracies may be
136introduced when changes collide, so (for instance) adding up all the
137read I/Os issued per partition should equal those made to the disks ...
138but due to the lack of locking it may only be very close.
139
877b638f 140In 2.6+, there are counters for each CPU, which make the lack of locking
9d2e157d
RD
141almost a non-issue. When the statistics are read, the per-CPU counters
142are summed (possibly overflowing the unsigned long variable they are
1da177e4 143summed to) and the result given to the user. There is no convenient
9d2e157d 144user interface for accessing the per-CPU counters themselves.
1da177e4 145
2b8bd423
KK
146Since 4.19 request times are measured with nanoseconds precision and
147truncated to milliseconds before showing in this interface.
148
1da177e4
LT
149Disks vs Partitions
150-------------------
151
877b638f 152There were significant changes between 2.4 and 2.6+ in the I/O subsystem.
1da177e4
LT
153As a result, some statistic information disappeared. The translation from
154a disk address relative to a partition to the disk address relative to
155the host disk happens much earlier. All merges and timings now happen
156at the disk level rather than at both the disk and partition level as
877b638f 157in 2.4. Consequently, you'll see a different statistics output on 2.6+ for
1da177e4 158partitions from that for disks. There are only *four* fields available
877b638f 159for partitions on 2.6+ machines. This is reflected in the examples above.
1da177e4
LT
160
161Field 1 -- # of reads issued
162 This is the total number of reads issued to this partition.
378012cf 163
1da177e4
LT
164Field 2 -- # of sectors read
165 This is the total number of sectors requested to be read from this
166 partition.
378012cf 167
1da177e4
LT
168Field 3 -- # of writes issued
169 This is the total number of writes issued to this partition.
378012cf 170
1da177e4
LT
171Field 4 -- # of sectors written
172 This is the total number of sectors requested to be written to
173 this partition.
174
175Note that since the address is translated to a disk-relative one, and no
176record of the partition-relative address is kept, the subsequent success
177or failure of the read cannot be attributed to the partition. In other
178words, the number of reads for partitions is counted slightly before time
179of queuing for partitions, and at completion for whole disks. This is
180a subtle distinction that is probably uninteresting for most cases.
181
0e53c2be
JM
182More significant is the error induced by counting the numbers of
183reads/writes before merges for partitions and after for disks. Since a
184typical workload usually contains a lot of successive and adjacent requests,
185the number of reads/writes issued can be several times higher than the
186number of reads/writes completed.
187
188In 2.6.25, the full statistic set is again available for partitions and
189disk and partition statistics are consistent again. Since we still don't
190keep record of the partition-relative address, an operation is attributed to
191the partition which contains the first sector of the request after the
192eventual merges. As requests can be merged across partition, this could lead
d9195881 193to some (probably insignificant) inaccuracy.
0e53c2be 194
1da177e4
LT
195Additional notes
196----------------
197
877b638f 198In 2.6+, sysfs is not mounted by default. If your distribution of
1da177e4 199Linux hasn't added it already, here's the line you'll want to add to
877b638f 200your ``/etc/fstab``::
1da177e4 201
378012cf 202 none /sys sysfs defaults 0 0
1da177e4
LT
203
204
877b638f
MCC
205In 2.6+, all disk statistics were removed from ``/proc/stat``. In 2.4, they
206appear in both ``/proc/partitions`` and ``/proc/stat``, although the ones in
207``/proc/stat`` take a very different format from those in ``/proc/partitions``
1da177e4
LT
208(see proc(5), if your system has it.)
209
210-- ricklind@us.ibm.com