[fio.git] / tools / hist / fiologparser_hist.py.1

.TH fiologparser_hist.py 1 "August 18, 2016"
.SH NAME
fiologparser_hist.py \- Calculate statistics from fio histograms
.SH SYNOPSIS
.B fiologparser_hist.py
[\fIoptions\fR] [clat_hist_files]...
.SH DESCRIPTION
.B fiologparser_hist.py
is a utility for converting *_clat_hist* files
generated by fio into a CSV of latency statistics including minimum,
average, maximum latency, and selectable percentiles.
.SH EXAMPLES
.PP
.nf
$ fiologparser_hist.py *_clat_hist*
end-time, samples, min, avg, median, 90%, 95%, 99%, max
1000, 15, 192, 1678.107, 1788.859, 1856.076, 1880.040, 1899.208, 1888.000
2000, 43, 152, 1642.368, 1714.099, 1816.659, 1845.552, 1888.131, 1888.000
4000, 39, 1152, 1546.962, 1545.785, 1627.192, 1640.019, 1691.204, 1744
\[char46]..
.fi
.PP

.SH OPTIONS
.TP
.BR \-\-help
Print these options.
.TP
.BR \-\-buff_size \fR=\fPint
Number of samples to buffer into numpy at a time. Default is 10,000.
This can be adjusted to help performance.
.TP
.BR \-\-max_latency \fR=\fPint
Number of seconds of data to process at a time. Defaults to 20 seconds,
in order to handle the 17 second upper bound on latency in histograms
reported by fio. This should be increased if fio has been
run with a larger maximum latency. Lowering this when a lower maximum
latency is known can improve performance. See NOTES for more details.
.TP
.BR \-i ", " \-\-interval \fR=\fPint
Interval at which statistics are reported. Defaults to 1000 ms. This
should be set a minimum of the value for \fBlog_hist_msec\fR as given
to fio.
.TP
.BR \-\-noweight
Do not perform weighting of samples between output intervals. Default is False.
.TP
.BR \-d ", " \-\-divisor \fR=\fPint
Divide statistics by this value. Defaults to 1. Useful if you want to
convert latencies from milliseconds to seconds (\fBdivisor\fR=\fP1000\fR).
.TP
.BR \-\-warn
Enables warning messages printed to stderr, useful for debugging.
.TP
.BR \-\-group_nr \fR=\fPint
Set this to the value of \fIFIO_IO_U_PLAT_GROUP_NR\fR as defined in
\fPstat.h\fR if fio has been recompiled. Defaults to 19, the
current value used in fio. See NOTES for more details.
.TP
.BR \-\-percentiles \fR=\fPstr
Pass desired list of comma or colon separated percentiles to print.
The default is "90.0:95.0:99.0", but min, median(50%) and max percentiles are always printed
.TP
.BR \-\-usbin
Use to indicate to parser that histogram bin latencies values are in microseconds.
The default is to use nanoseconds, but histogram logs from fio versions <= 2.99 are in microseconds.
.TP
.BR \-\-directions \fR=\fPstr
By default, all directions (e.g read and write) histogram bins are combined
producing one 'mixed' result.
To produce independent directional results, pass some combination of 
\'rwtm\' characters with the \-\-directions\fR=\fPrwtm option. 
A \'dir\' column is added indicating the result direction for a row.

.SH NOTES
end-times are calculated to be uniform increments of the \fB\-\-interval\fR value given,
regardless of when histogram samples are reported. Of note:

.RS
Intervals with no samples are omitted. In the example above this means
"no statistics from 2 to 3 seconds" and "39 samples influenced the statistics
of the interval from 3 to 4 seconds".
.LP
Intervals with a single sample will have the same value for all statistics
.RE

.PP
The number of samples is unweighted, corresponding to the total number of samples
which have any effect whatsoever on the interval.

Min statistics are computed using value of the lower boundary of the first bin
(in increasing bin order) with non-zero samples in it. Similarly for max,
we take the upper boundary of the last bin with non-zero samples in it.
This is semantically identical to taking the 0th and 100th percentiles with a
50% bin-width buffer (because percentiles are computed using mid-points of
the bins). This enforces the following nice properties:

.RS
min <= 50th <= 90th <= 95th <= 99th <= max
.LP
min and max are strict lower and upper bounds on the actual
min / max seen by fio (and reported in *_clat.* with averaging turned off).
.RE

.PP
Average statistics use a standard weighted arithmetic mean.

When --noweights option is false (the default)
percentile statistics are computed using the weighted percentile method as
described here: \fIhttps://en.wikipedia.org/wiki/Percentile#Weighted_percentile\fR.
See weights() method for details on how weights are computed for individual
samples. In process_interval() we further multiply by the height of each bin
to get weighted histograms.

We convert files given on the command line, assumed to be fio histogram files,
An individual histogram file can contain the
histograms for multiple different r/w directions (notably when \fB\-\-rw\fR=\fPrandrw\fR). This
is accounted for by tracking each r/w direction separately. In the statistics
reported we ultimately merge *all* histograms (regardless of r/w direction).

The value of *_GROUP_NR in \fIstat.h\fR (and *_BITS) determines how many latency bins
fio outputs when histogramming is enabled. Namely for the current default of
GROUP_NR=19, we get 1,216 bins with a maximum latency of approximately 17
seconds. For certain applications this may not be sufficient. With GROUP_NR=24
we have 1,536 bins, giving us a maximum latency of 541 seconds (~ 9 minutes). If
you expect your application to experience latencies greater than 17 seconds,
you will need to recompile fio with a larger GROUP_NR, e.g. with:

.RS
.PP
.nf
sed -i.bak 's/^#define FIO_IO_U_PLAT_GROUP_NR 19\n/#define FIO_IO_U_PLAT_GROUP_NR 24/g' stat.h
make fio
.fi
.PP
.RE

.PP
Quick reference table for the max latency corresponding to a sampling of
values for GROUP_NR:

.RS
.PP
.nf
GROUP_NR | # bins | max latency bin value
19       | 1216   | 16.9 sec
20       | 1280   | 33.8 sec
21       | 1344   | 67.6 sec
22       | 1408   | 2  min, 15 sec
23       | 1472   | 4  min, 32 sec
24       | 1536   | 9  min, 4  sec
25       | 1600   | 18 min, 8  sec
26       | 1664   | 36 min, 16 sec
.fi
.PP
.RE

.PP
At present this program automatically detects the number of histogram bins in
the log files, and adjusts the bin latency values accordingly. In particular if
you use the \fB\-\-log_hist_coarseness\fR parameter of fio, you get output files with
a number of bins according to the following table (note that the first
row is identical to the table above):

.RS
.PP
.nf
coarse \\ GROUP_NR
        19     20    21     22     23     24     25     26
   -------------------------------------------------------
  0  [[ 1216,  1280,  1344,  1408,  1472,  1536,  1600,  1664],
  1   [  608,   640,   672,   704,   736,   768,   800,   832],
  2   [  304,   320,   336,   352,   368,   384,   400,   416],
  3   [  152,   160,   168,   176,   184,   192,   200,   208],
  4   [   76,    80,    84,    88,    92,    96,   100,   104],
  5   [   38,    40,    42,    44,    46,    48,    50,    52],
  6   [   19,    20,    21,    22,    23,    24,    25,    26],
  7   [  N/A,    10,   N/A,    11,   N/A,    12,   N/A,    13],
  8   [  N/A,     5,   N/A,   N/A,   N/A,     6,   N/A,   N/A]]
.fi
.PP
.RE

.PP
For other values of GROUP_NR and coarseness, this table can be computed like this:

.RS
.PP
.nf
bins = [1216,1280,1344,1408,1472,1536,1600,1664]
max_coarse = 8
fncn = lambda z: list(map(lambda x: z/2**x if z % 2**x == 0 else nan, range(max_coarse + 1)))
np.transpose(list(map(fncn, bins)))
.fi
.PP
.RE

.PP
If you have not adjusted GROUP_NR for your (high latency) application, then you
will see the percentiles computed by this tool max out at the max latency bin
value as in the first table above, and in this plot (where GROUP_NR=19 and thus we see
a max latency of ~16.7 seconds in the red line):

.RS
\fIhttps://www.cronburg.com/fio/max_latency_bin_value_bug.png
.RE

.PP
Motivation for, design decisions, and the implementation process are
described in further detail here:

.RS
\fIhttps://www.cronburg.com/fio/cloud-latency-problem-measurement/
.RE

.SH AUTHOR
.B fiologparser_hist.py
and this manual page were written by Karl Cronburg <karl.cronburg@gmail.com>.
.SH "REPORTING BUGS"
Report bugs to the \fBfio\fR mailing list <fio@vger.kernel.org>.
Commit	Line	Data
d1f6fcad KC	1	.TH fiologparser_hist.py 1 "August 18, 2016"
	2	.SH NAME
	3	fiologparser_hist.py \- Calculate statistics from fio histograms
	4	.SH SYNOPSIS
	5	.B fiologparser_hist.py
	6	[\fIoptions\fR] [clat_hist_files]...
	7	.SH DESCRIPTION
	8	.B fiologparser_hist.py
	9	is a utility for converting _clat_hist files
	10	generated by fio into a CSV of latency statistics including minimum,
4a11e9d7	11	average, maximum latency, and selectable percentiles.
d1f6fcad KC	12	.SH EXAMPLES
	13	.PP
	14	.nf
	15	$ fiologparser_hist.py _clat_hist
	16	end-time, samples, min, avg, median, 90%, 95%, 99%, max
	17	1000, 15, 192, 1678.107, 1788.859, 1856.076, 1880.040, 1899.208, 1888.000
	18	2000, 43, 152, 1642.368, 1714.099, 1816.659, 1845.552, 1888.131, 1888.000
	19	4000, 39, 1152, 1546.962, 1545.785, 1627.192, 1640.019, 1691.204, 1744
afc764cc	20	\[char46]..
d1f6fcad KC	21	.fi
	22	.PP
	23
	24	.SH OPTIONS
	25	.TP
	26	.BR \-\-help
	27	Print these options.
	28	.TP
	29	.BR \-\-buff_size \fR=\fPint
	30	Number of samples to buffer into numpy at a time. Default is 10,000.
	31	This can be adjusted to help performance.
	32	.TP
	33	.BR \-\-max_latency \fR=\fPint
	34	Number of seconds of data to process at a time. Defaults to 20 seconds,
	35	in order to handle the 17 second upper bound on latency in histograms
	36	reported by fio. This should be increased if fio has been
	37	run with a larger maximum latency. Lowering this when a lower maximum
	38	latency is known can improve performance. See NOTES for more details.
	39	.TP
	40	.BR \-i ", " \-\-interval \fR=\fPint
	41	Interval at which statistics are reported. Defaults to 1000 ms. This
	42	should be set a minimum of the value for \fBlog_hist_msec\fR as given
	43	to fio.
	44	.TP
4a11e9d7	45	.BR \-\-noweight
	46	Do not perform weighting of samples between output intervals. Default is False.
	47	.TP
d1f6fcad KC	48	.BR \-d ", " \-\-divisor \fR=\fPint
	49	Divide statistics by this value. Defaults to 1. Useful if you want to
	50	convert latencies from milliseconds to seconds (\fBdivisor\fR=\fP1000\fR).
	51	.TP
	52	.BR \-\-warn
	53	Enables warning messages printed to stderr, useful for debugging.
	54	.TP
	55	.BR \-\-group_nr \fR=\fPint
	56	Set this to the value of \fIFIO_IO_U_PLAT_GROUP_NR\fR as defined in
	57	\fPstat.h\fR if fio has been recompiled. Defaults to 19, the
	58	current value used in fio. See NOTES for more details.
4a11e9d7	59	.TP
	60	.BR \-\-percentiles \fR=\fPstr
	61	Pass desired list of comma or colon separated percentiles to print.
	62	The default is "90.0:95.0:99.0", but min, median(50%) and max percentiles are always printed
	63	.TP
	64	.BR \-\-usbin
	65	Use to indicate to parser that histogram bin latencies values are in microseconds.
	66	The default is to use nanoseconds, but histogram logs from fio versions <= 2.99 are in microseconds.
	67	.TP
	68	.BR \-\-directions \fR=\fPstr
	69	By default, all directions (e.g read and write) histogram bins are combined
	70	producing one 'mixed' result.
	71	To produce independent directional results, pass some combination of
	72	\'rwtm\' characters with the \-\-directions\fR=\fPrwtm option.
	73	A \'dir\' column is added indicating the result direction for a row.
d1f6fcad KC	74
	75	.SH NOTES
	76	end-times are calculated to be uniform increments of the \fB\-\-interval\fR value given,
	77	regardless of when histogram samples are reported. Of note:
	78
	79	.RS
	80	Intervals with no samples are omitted. In the example above this means
	81	"no statistics from 2 to 3 seconds" and "39 samples influenced the statistics
	82	of the interval from 3 to 4 seconds".
	83	.LP
	84	Intervals with a single sample will have the same value for all statistics
	85	.RE
	86
	87	.PP
	88	The number of samples is unweighted, corresponding to the total number of samples
	89	which have any effect whatsoever on the interval.
	90
	91	Min statistics are computed using value of the lower boundary of the first bin
	92	(in increasing bin order) with non-zero samples in it. Similarly for max,
	93	we take the upper boundary of the last bin with non-zero samples in it.
	94	This is semantically identical to taking the 0th and 100th percentiles with a
	95	50% bin-width buffer (because percentiles are computed using mid-points of
	96	the bins). This enforces the following nice properties:
	97
	98	.RS
	99	min <= 50th <= 90th <= 95th <= 99th <= max
	100	.LP
	101	min and max are strict lower and upper bounds on the actual
	102	min / max seen by fio (and reported in _clat. with averaging turned off).
	103	.RE
	104
	105	.PP
	106	Average statistics use a standard weighted arithmetic mean.
	107
4a11e9d7	108	When --noweights option is false (the default)
4a11e9d7	109	percentile statistics are computed using the weighted percentile method as
d1f6fcad KC	110	described here: \fIhttps://en.wikipedia.org/wiki/Percentile#Weighted_percentile\fR.
	111	See weights() method for details on how weights are computed for individual
	112	samples. In process_interval() we further multiply by the height of each bin
	113	to get weighted histograms.
	114
	115	We convert files given on the command line, assumed to be fio histogram files,
	116	An individual histogram file can contain the
	117	histograms for multiple different r/w directions (notably when \fB\-\-rw\fR=\fPrandrw\fR). This
	118	is accounted for by tracking each r/w direction separately. In the statistics
	119	reported we ultimately merge all histograms (regardless of r/w direction).
	120
	121	The value of _GROUP_NR in \fIstat.h\fR (and _BITS) determines how many latency bins
	122	fio outputs when histogramming is enabled. Namely for the current default of
	123	GROUP_NR=19, we get 1,216 bins with a maximum latency of approximately 17
	124	seconds. For certain applications this may not be sufficient. With GROUP_NR=24
	125	we have 1,536 bins, giving us a maximum latency of 541 seconds (~ 9 minutes). If
	126	you expect your application to experience latencies greater than 17 seconds,
	127	you will need to recompile fio with a larger GROUP_NR, e.g. with:
	128
	129	.RS
	130	.PP
	131	.nf
	132	sed -i.bak 's/^#define FIO_IO_U_PLAT_GROUP_NR 19\n/#define FIO_IO_U_PLAT_GROUP_NR 24/g' stat.h
	133	make fio
	134	.fi
	135	.PP
	136	.RE
	137
	138	.PP
	139	Quick reference table for the max latency corresponding to a sampling of
	140	values for GROUP_NR:
	141
	142	.RS
	143	.PP
	144	.nf
	145	GROUP_NR \| # bins \| max latency bin value
	146	19 \| 1216 \| 16.9 sec
	147	20 \| 1280 \| 33.8 sec
	148	21 \| 1344 \| 67.6 sec
	149	22 \| 1408 \| 2 min, 15 sec
	150	23 \| 1472 \| 4 min, 32 sec
	151	24 \| 1536 \| 9 min, 4 sec
	152	25 \| 1600 \| 18 min, 8 sec
	153	26 \| 1664 \| 36 min, 16 sec
	154	.fi
	155	.PP
	156	.RE
	157
	158	.PP
	159	At present this program automatically detects the number of histogram bins in
	160	the log files, and adjusts the bin latency values accordingly. In particular if
	161	you use the \fB\-\-log_hist_coarseness\fR parameter of fio, you get output files with
	162	a number of bins according to the following table (note that the first
	163	row is identical to the table above):
	164
	165	.RS
	166	.PP
	167	.nf
	168	coarse \\ GROUP_NR
	169	19 20 21 22 23 24 25 26
	170	-------------------------------------------------------
	171	0 [[ 1216, 1280, 1344, 1408, 1472, 1536, 1600, 1664],
	172	1 [ 608, 640, 672, 704, 736, 768, 800, 832],
	173	2 [ 304, 320, 336, 352, 368, 384, 400, 416],
174	3 [ 152, 160, 168, 176, 184, 192, 200, 208],
175	4 [ 76, 80, 84, 88, 92, 96, 100, 104],
176	5 [ 38, 40, 42, 44, 46, 48, 50, 52],
177	6 [ 19, 20, 21, 22, 23, 24, 25, 26],
178	7 [ N/A, 10, N/A, 11, N/A, 12, N/A, 13],
179	8 [ N/A, 5, N/A, N/A, N/A, 6, N/A, N/A]]
180	.fi
181	.PP
182	.RE
183
184	.PP
185	For other values of GROUP_NR and coarseness, this table can be computed like this:
186
187	.RS
188	.PP
189	.nf
190	bins = [1216,1280,1344,1408,1472,1536,1600,1664]
191	max_coarse = 8
192	fncn = lambda z: list(map(lambda x: z/2x if z % 2x == 0 else nan, range(max_coarse + 1)))
193	np.transpose(list(map(fncn, bins)))
194	.fi
195	.PP
196	.RE
197
198	.PP
199	If you have not adjusted GROUP_NR for your (high latency) application, then you
200	will see the percentiles computed by this tool max out at the max latency bin
201	value as in the first table above, and in this plot (where GROUP_NR=19 and thus we see
202	a max latency of ~16.7 seconds in the red line):
203
204	.RS
205	\fIhttps://www.cronburg.com/fio/max_latency_bin_value_bug.png
206	.RE
207
208	.PP
209	Motivation for, design decisions, and the implementation process are
210	described in further detail here:
211
212	.RS
213	\fIhttps://www.cronburg.com/fio/cloud-latency-problem-measurement/
214	.RE
215
216	.SH AUTHOR
217	.B fiologparser_hist.py
218	and this manual page were written by Karl Cronburg <karl.cronburg@gmail.com>.
219	.SH "REPORTING BUGS"
220	Report bugs to the \fBfio\fR mailing list <fio@vger.kernel.org>.