cmdprio: add support for a new cmdprio_bssplit entry format Add support for a new cmdprio_bssplit format, while keeping support for the old format, by migrating to the split_parse_prio_ddir() parsing function. In this new format, a priority class and priority level is defined inside each entry itself. In comparison with the old format, the new format does not restrict all entries to share the same priority class and priority level. Therefore, this new format is very useful if you need to submit I/Os with multiple IO priority class + IO priority level combinations, e.g. when testing or verifying an IO scheduler. cmdprio will allocate a clat_prio_stat array that holds all unique priorities (including the default priority). Finally, it will set the clat_prio pointer in the struct thread_stat (td->ts.clat_prio) to the newly allocated array. We also add a clat_prio_stat index to io_u.h, that will inform which array element (which priority value) this specific I/O was submitted with. The clat_prio_stat index will be used by the stat.c code, to avoid a costly search operation to find the correct array element to use, for each and every add_sample(). Note that while this patch will send down the correct I/O pattern to the drive (potentially using multiple different priorities), it will not display the cmdprio_{bssplit,percentage} stats correctly until a later commit in the series (which changes stat.c to report clat stats on a per priority granularity). This was done to ease reviewing. Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Link: https://lore.kernel.org/r/20220203192814.18552-9-Niklas.Cassel@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
sg: improve sg_write_mode option names There is a name collision for the sg_write_mode options for the WRITE AND VERIFY and VERIFY commands. Deprecate the 'verify' option and use 'write_and_verify' instead. Do the same thing for 'same' and 'write_same' to have a consistent naming scheme. The original option names are still supported for backward compatibility but list them as deprecated. Here are the new sg_write_mode options: Option SCSI command write WRITE (default) write_and_verify WRITE AND VERIFY verify (deprecated) WRITE AND VERIFY write_same WRITE SAME same (deprecated) WRITE SAME write_same_ndob WRITE SAME with NDOB flag set verify_bytchk_00 VERIFY with BYTCHK set to 00 verify_bytchk_01 VERIFY with BYTCHK set to 01 verify_bytchk_11 VERIFY with BYTCHK set to 11 Signed-off-by: Vincent Fu <vincent.fu@samsung.com> Link: https://lore.kernel.org/r/20211115200807.117138-4-vincent.fu@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
sg: add support for WRITE SAME(16) commands with NDOB flag set Add the sg_write_mode option write_same_ndob to issue WRITE SAME(16) commands with the no data output buffer flag set. This flag is not supported for WRITE SAME(10). So all commands with this option will be WRITE SAME(16). Also include an example job file. Signed-off-by: Vincent Fu <vincent.fu@samsung.com> Link: https://lore.kernel.org/r/20211115200807.117138-3-vincent.fu@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
sg: add support for VERIFY command using write modes fio does not have an explicit verify data direction and creating a new data direction just for SCSI VERIFY commands probably is not worthwhile. The format of SCSI VERIFY commands matches that of write operations since VERIFY commands can include data transfer to the device. So it seems reasonable to have VERIFY commands be accounted for as write operations by fio. Use the sg_write_mode option to support SCSI VERIFY commands with different BYTCHK values. BYTCHK Description 00 No data is transferred to the device; device data is checked 01 Device data is compared with data transferred to device 11 Same as 01 except that only one sector of data is transferred to the device and each sector specified in the verification extent is compared against this transferred data. Also update documentation and add a couple example jobs files. Signed-off-by: Vincent Fu <vincent.fu@samsung.com> Link: https://lore.kernel.org/r/20211115200807.117138-2-vincent.fu@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Support for alternate epochs in fio log files Add options log_alternate_epoch and log_alternate_epoch_clock_id. This is similar to the log_unix_epoch option. This resolves the issue raised in Issue #1314 log_alternate_epoch, if true, causes log files to use the same epoch used used by the clock_id parameter to the unix clock_gettime function, where clock_id is specified by the log_alternate_epoch_clock_id option. This is particularly useful as it allows us to specify a clock id like CLOCK_MONOTONIC_RAW, which is natural for synchronizing log files between processes. The current log_unix_epoch is problematic for that purpose because that clock is not monotonic or continuous. It turns out that log_unix_epoch is actually equivalent to log_alternate_epoch with log_alternate_epoch_clock_id set to CLOCK_REALTIME=0. Since this is the default value of the log_alternate_epoch_clock_id option anyways, we treat log_alternate_epoch and log_unix_epoch as equivalent in functionality, retaining the latter to avoid breaking existing clients. Signed-off-by: Nick Neumann <nick@pcpartpicker.com>
fio: Improve documentation of ignore_zone_limits option In the manual pages, change the description of the option ignore_zone_limits to its action when set, instead of the confusing text describing what happens when it is not set. Also add the description of this option in the HOWTO file as it is missing. Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com> Link: https://lore.kernel.org/r/20211214012413.464798-2-damien.lemoal@opensource.wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
docs: document quirky implementation of per priority stats reporting Commit 56440e63ac17 ("fio: report percentiles for slat, clat, lat") changed many things. One of the changes, from the commit message: "- for the new cmdprio_percentage latencies, if lat_percentiles=1, *total* latency percentiles will be tracked. Otherwise, *completion* latency percentiles will be tracked." In other words, the commit changed the per prio stats from always tracking (and reporting) clat latency, to instead either track (and report) clat or lat latency. Considering that a certain latency type reports two things: 1) min/max/avg latency for the the specific latency type 2) latency percentiles for the specific latency type If disable_clat/disable_lat is used, neither 1) nor 2) will be reported. If clat_percentiles/lat_percentiles is false, 2) will not be reported. Therefore it is unintuitive that setting lat_percentiles=1, an option usually used to enable/disable percentile reporting, also affects which type of latency that will be tracked (and reported) for per prio stats. The fact that the variables are named e.g. clat_prio_stat, regardless of the type of latency being tracked does not help. Anyway, let's document the way that the current implementation works, so that a user can know how per priority stats are handled, without having to read the source, since the commit that introduced this behavior forgot to update the documentation. Fixes: 56440e63ac17 ("fio: report percentiles for slat, clat, lat") Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Link: https://lore.kernel.org/r/20211125132020.109955-2-Niklas.Cassel@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
fio: Introduce the log_entries option When iops, latency, or bw logging options are used, fio will by default log information for any I/O that completes. The initial number of I/O log entries is 1024, as defined by (DEF_LOG_ENTRIES). When all log entries are used, new log entries are dynamically allocated by get_new_log(). This dynamic log entry allocation can negatively impact time-related statistics such as the I/O tail latencies (e.g. 99.9 percentile completion latency) as growing the logs causes a temporary I/O stall (IO quiesce), which disturbs the workload steady state. The effect of this is especially noticeable with workloads using IO priorities: the tail latencies of high priority I/Os increase if the IO log needs to be grown. For example, running the following fio command on a SATA disk supporting NCQ priority: fio --name=prio-randread --filename=/dev/sdg \ --random_generator=tausworthe64 --ioscheduler=none \ --write_lat_log=randread.log --log_prio=1 --rw=randread --bs=128k \ --ioengine=libaio --iodepth=32 --direct=1 --cmdprio_class=1 \ --cmdprio_percentage=30 --runtime=900 (128KB random read workload at QD=32 and 30% of commands issued with a high priority), with an inital number of log entries equal to the default of 1024, depending on the machine memory state, the completion latency statistics may show imprecise percentiles such as shown below. high prio (30.75%) clat percentiles (msec): | 1.00th=[ 14], 5.00th=[ 17], 10.00th=[ 20], 20.00th=[ 23], | 30.00th=[ 27], 40.00th=[ 32], 50.00th=[ 40], 60.00th=[ 53], | 70.00th=[ 71], 80.00th=[ 104], 90.00th=[ 169], 95.00th=[ 243], | 99.00th=[ 514], 99.50th=[ 676], 99.90th=[ 1485], 99.95th=[ 1502], | 99.99th=[ 1552] low prio (69.25%) clat percentiles (msec): | 1.00th=[ 16], 5.00th=[ 24], 10.00th=[ 37], 20.00th=[ 68], | 30.00th=[ 105], 40.00th=[ 146], 50.00th=[ 199], 60.00th=[ 255], | 70.00th=[ 330], 80.00th=[ 439], 90.00th=[ 592], 95.00th=[ 718], | 99.00th=[ 885], 99.50th=[ 986], 99.90th=[ 1469], 99.95th=[ 1536], | 99.99th=[ 1586] All completion latency percentiles above the 99.90th percentile are similar for the high and low priority commands, which is not consistent with the drive expected execution of prioritized read commands. To solve this issue and get more precise latency statistics, this patch introduces the new "log_entries" option to allow specifying a larger initial number of IO log entries to avoid run-time allocation. This option value defaults to DEF_LOG_ENTRIES and its maximum value is MAX_LOG_ENTRIES to be consistent with get_new_log() allocation. Also simplify get_new_log() by using calloc() instead of malloc, thus removing the need for the local variable new_size. Adding the "--log_entries=65536" option to the previous command line example, the completion latency results obtained are more stable: high prio (30.72%) clat percentiles (msec): | 1.00th=[ 15], 5.00th=[ 17], 10.00th=[ 19], 20.00th=[ 22], | 30.00th=[ 24], 40.00th=[ 27], 50.00th=[ 32], 60.00th=[ 36], | 70.00th=[ 46], 80.00th=[ 57], 90.00th=[ 81], 95.00th=[ 105], | 99.00th=[ 161], 99.50th=[ 188], 99.90th=[ 271], 99.95th=[ 275], | 99.99th=[ 363] low prio (69.28%) clat percentiles (msec): | 1.00th=[ 16], 5.00th=[ 27], 10.00th=[ 43], 20.00th=[ 80], | 30.00th=[ 123], 40.00th=[ 176], 50.00th=[ 236], 60.00th=[ 313], | 70.00th=[ 401], 80.00th=[ 506], 90.00th=[ 634], 95.00th=[ 718], | 99.00th=[ 844], 99.50th=[ 885], 99.90th=[ 953], 99.95th=[ 995], | 99.99th=[ 1053] All completion percentiles clearly now show shorter latencies for high priority commands, as expected. The 99.99th percentile for low priority commands is also improved compared to the previous case as the measurements are not impacted by the log dynamic allocation. Suggested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com> Link: https://lore.kernel.org/r/20211118052729.132423-1-damien.lemoal@opensource.wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
docs: update cmdprio_percentage documentation Commit 1437d6357429 ("libaio,io_uring: relax cmdprio_percentage constraints") relaxed the cmdprio_percentage constraints such that cmdprio_percentage and prioclass/prio could be used together. However, it forgot to remove the mention of this constraint from the docs. Update the docs to reflect the new behavior. Fixes: 1437d6357429 ("libaio,io_uring: relax cmdprio_percentage constraints") Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Link: https://lore.kernel.org/r/20211112095428.158300-2-Niklas.Cassel@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
options: Add thinktime_iotime option The thinktime option allows stalling a job for a specified amount of time. Using the thinktime_blocks option, periodic stalls can be added every thinktime_blocks IOs. However, with this option, the periodic stall may not be repeated at equal time intervals as the time to execute thinktime_blocks IOs may vary. To control the thinktime interval by time, introduce the option thinktime_iotime. With this new option, the thinktime stall is repeated after IOs are executed for thinktime_iotime. If this option is used together with the thinktime_blocks option, the thinktime pause is repeated after thinktime_iotime or after thinktime_blocks IOs, whichever happens first. To support the new option, add a new member thinktime_iotime in the struct thread_options and the struct thread_options_pack. Avoid size increase of the struct thread_options_pack by replacing a padding 'pad5' with the new member. To keep thinktime related members close, move the members near the position where the padding was placed. Make same changes to the struct thread_option also for consistency. To track the time and IO block count at the last stall, add last_thinktime variable and last_thinktime_blocks variable to struct thread_data. Also, introduce the helper function init_thinktime() to group thinktime related preparations. Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
libaio,io_uring: introduce cmdprio_bssplit The cmdprio_percentage, cmdprio_class and cmdprio options allow specifying different values for read and write operations. This enables various IO priority issuing patterns even uner a mixed read-write workload but does not allow differentiation within read and write I/O operation types with different sizes when the bssplit option is used. Introduce the cmdprio_bssplit option to complement the use of the bssplit option. This new option has the same format as the bssplit option, but the percentage values indicate the percentage of I/O operations with a particular block size that must be issued with the priority class and value specified by cmdprio_class and cmdprio. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
libaio,io_uring: introduce cmdprio_class and cmdprio options When the cmdprio_percentage option is used, the specified percentage of IO will be issued with the highest priority class IOPRIO_CLASS_RT. This priority class maps to the ATA NCQ "high" priority level and allows exercising a SATA device to measure its command latency characteristics in the presence of low and high priority commands. Beside ATA NCQ commands, Linux block IO schedulers also support IO priorities and will behave differently in the presence of IOs with different IO priority classes and values. However, cmdprio_percentage does not allow specifying all possible priority classes and values. To solve this, introduce libaio and io_uring engine specific options cmdprio_class and cmdprio. These new options are the equivalent of the prioclass and prio options and allow specifying the priority class and priority value to use for asynchronous I/Os when the cmdprio_percentage option is used. If not specified, the I/O priority class defaults to IOPRIO_CLASS_RT and the I/O priority value to 0, as before. Similarly to the cmdprio_percentage option, these options can specify different values for read and write I/Os using a comma separated list. The manpage, HOWTO and fiograph configuration file are updated to document these new options. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
libaio,io_uring: improve cmdprio_percentage option The cmdprio_percentage option of the libaio and io_uring engines defines a single percentage that applies to all IO operations, regardless of their direction. This prevents defining different high priority IO percentages for reads and writes operations. This differentiation can however be useful in the case of a mixed read-write workload (rwmixread and rwmixwrite options). Change the option definition to allow specifying a comma separated list of percentages, 2 at most, one for reads and one for writes. If only a single percentage is defined, it applies to both reads and writes as before. The cmdprio_percentage option becomes an array of DDIR_RWDIR_CNT elements indexed with enum fio_ddir values. The last entry of the array (for DDIR_TRIM) is always 0. Also create a new cmdprio helper file, engines/cmdprio.h, such that we can avoid code duplication between io_uring and libaio io engines. This helper file will be extended in subsequent patches. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>