path: root/engines
AgeCommit message (Collapse)Author
10 daysengines/xnvme: add xnvme engineAnkit Kumar
This patch introduces a new fio engine to work with xNVMe >= 0.2.0. xNVMe provides a user space library (libxnvme) to work with NVMe devices. The NVMe driver being used by libxnvme is re-targetable and can be any one of the GNU/Linux Kernel NVMe driver via libaio, IOCTLs, io_uring, the SPDK NVMe driver, or your own custom NVMe driver. For more info visit Co-Authored-By: Ankit Kumar <> Co-Authored-By: Simon A. F. Lund <> Co-Authored-By: Mads Ynddal <> Co-Authored-By: Michael Bang <> Co-Authored-By: Karl Bonde Torp <> Co-Authored-By: Gurmeet Singh <> Co-Authored-By: Pierre Labat <> Link: Signed-off-by: Jens Axboe <>
2022-03-30Rename 'fallthrough' attribute to 'fio_fallthrough'Jens Axboe
fallthrough is reserved in C++, so this causes issues with C++ programs pulling in the fio.h -> compiler.h header. Rename it to something fio specific instead. Signed-off-by: Jens Axboe <>
2022-03-20engines/null: use correct -includeJens Axboe
Fixes: cef0a8357b3f ("engines/null: update external engine compilation") Signed-off-by: Jens Axboe <>
2022-03-20engines/null: update external engine compilationJens Axboe
Everything needs to include config-host.h, and make sure that the C++ side uses the right type for the queue op. Fixes: Signed-off-by: Jens Axboe <>
2022-02-26windowsaio: open file for write if we have syncsJens Axboe
Windows wants the file opened for write if we do a file sync, so ensure we do that if we have syncs. Fixes: Signed-off-by: Jens Axboe <>
2022-02-24Fix three compiler warningsBart Van Assche
Fix three occurrences of the following clang compiler warning: warning: suggest braces around initialization of subobject [-Wmissing-braces] Signed-off-by: Bart Van Assche <>
2022-02-21io_uring: use syscall helpers for the hot pathJens Axboe
The only real hot system call here is the io_uring_enter(2) call, as that'll happen during the IO submission/completion parts. The rest are just setup function calls, we don't really care about those. Signed-off-by: Jens Axboe <>
2022-02-20Spelling and grammar fixesVille Skyttä
Signed-off-by: Ville Skyttä <>
2022-02-18rpma: update RPMA engines with new librpma completions APIOksana Salyk
The API of librpma has been changed between v0.10.0 and v0.12.0 and fio has to be updated. Signed-off-by: Oksana Salyk <>
2022-02-03stat: report clat stats on a per priority granularityNiklas Cassel
Convert the stat code to report clat stats on a per priority granularity, rather than simply supporting high/low priority. This is made possible by using the new clat_prio_stat array (per ddir), together with the clat_prio_stat index which is saved in each io_u. The per priority samples are only printed when there are samples for more than one priority in the clat_prio_stat array. If there are only samples for one priority, that means that all I/Os where submitted using the same priority, so no need to print. For example, running the following fio command: fio --name=test --filename=/dev/sdc --direct=1 --runtime=60 --rw=randread \ --ioengine=io_uring --ioscheduler=mq-deadline --iodepth=32 --bs=32k \ --prioclass=2 --prio=7 --cmdprio_bssplit=32k/20/3/0:32k/10/1/4 Now results in the following output: test: (groupid=0, jobs=1): err= 0: pid=465655: Tue Feb 1 02:24:47 2022 read: IOPS=146, BW=4695KiB/s (4808kB/s)(276MiB/60239msec) slat (usec): min=18, max=335, avg=62.87, stdev=22.59 clat (msec): min=2, max=2135, avg=217.97, stdev=287.26 lat (msec): min=2, max=2135, avg=218.03, stdev=287.26 clat prio 2/7 (msec): min=3, max=606, avg=106.57, stdev=86.64 clat prio 3/0 (msec): min=10, max=2135, avg=664.94, stdev=339.42 clat prio 1/4 (msec): min=2, max=300, avg=52.29, stdev=42.52 clat percentiles (msec): | 1.00th=[ 8], 5.00th=[ 14], 10.00th=[ 19], 20.00th=[ 33], | 30.00th=[ 52], 40.00th=[ 77], 50.00th=[ 108], 60.00th=[ 144], | 70.00th=[ 192], 80.00th=[ 300], 90.00th=[ 684], 95.00th=[ 911], | 99.00th=[ 1234], 99.50th=[ 1318], 99.90th=[ 1687], 99.95th=[ 1770], | 99.99th=[ 2140] clat prio 2/7 (69.25% of IOs) percentiles (msec): | 1.00th=[ 7], 5.00th=[ 13], 10.00th=[ 17], 20.00th=[ 28], | 30.00th=[ 44], 40.00th=[ 64], 50.00th=[ 85], 60.00th=[ 111], | 70.00th=[ 140], 80.00th=[ 174], 90.00th=[ 226], 95.00th=[ 279], | 99.00th=[ 368], 99.50th=[ 418], 99.90th=[ 502], 99.95th=[ 567], | 99.99th=[ 609] clat prio 3/0 (20.91% of IOs) percentiles (msec): | 1.00th=[ 44], 5.00th=[ 138], 10.00th=[ 205], 20.00th=[ 347], | 30.00th=[ 464], 40.00th=[ 558], 50.00th=[ 659], 60.00th=[ 760], | 70.00th=[ 860], 80.00th=[ 961], 90.00th=[ 1099], 95.00th=[ 1217], | 99.00th=[ 1485], 99.50th=[ 1687], 99.90th=[ 1871], 99.95th=[ 2140], | 99.99th=[ 2140] clat prio 1/4 (9.84% of IOs) percentiles (msec): | 1.00th=[ 7], 5.00th=[ 10], 10.00th=[ 13], 20.00th=[ 18], | 30.00th=[ 24], 40.00th=[ 30], 50.00th=[ 39], 60.00th=[ 51], | 70.00th=[ 63], 80.00th=[ 84], 90.00th=[ 114], 95.00th=[ 136], | 99.00th=[ 188], 99.50th=[ 197], 99.90th=[ 300], 99.95th=[ 300], | 99.99th=[ 300] bw ( KiB/s): min= 3456, max= 5888, per=100.00%, avg=4697.60, stdev=472.38, samples=120 iops : min= 108, max= 184, avg=146.80, stdev=14.76, samples=120 lat (msec) : 4=0.11%, 10=2.57%, 20=8.67%, 50=18.21%, 100=18.34% lat (msec) : 250=28.87%, 500=9.41%, 750=5.22%, 1000=5.09%, 2000=3.50% lat (msec) : >=2000=0.01% cpu : usr=0.16%, sys=0.97%, ctx=17715, majf=0, minf=262 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=99.6%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0% issued rwts: total=8839,0,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=32 Signed-off-by: Niklas Cassel <> Link: Signed-off-by: Jens Axboe <>
2022-02-03cmdprio: add support for a new cmdprio_bssplit entry formatNiklas Cassel
Add support for a new cmdprio_bssplit format, while keeping support for the old format, by migrating to the split_parse_prio_ddir() parsing function. In this new format, a priority class and priority level is defined inside each entry itself. In comparison with the old format, the new format does not restrict all entries to share the same priority class and priority level. Therefore, this new format is very useful if you need to submit I/Os with multiple IO priority class + IO priority level combinations, e.g. when testing or verifying an IO scheduler. cmdprio will allocate a clat_prio_stat array that holds all unique priorities (including the default priority). Finally, it will set the clat_prio pointer in the struct thread_stat (td->ts.clat_prio) to the newly allocated array. We also add a clat_prio_stat index to io_u.h, that will inform which array element (which priority value) this specific I/O was submitted with. The clat_prio_stat index will be used by the stat.c code, to avoid a costly search operation to find the correct array element to use, for each and every add_sample(). Note that while this patch will send down the correct I/O pattern to the drive (potentially using multiple different priorities), it will not display the cmdprio_{bssplit,percentage} stats correctly until a later commit in the series (which changes stat.c to report clat stats on a per priority granularity). This was done to ease reviewing. Signed-off-by: Niklas Cassel <> Reviewed-by: Damien Le Moal <> Link: Signed-off-by: Jens Axboe <>
2022-02-03Merge branch 'master' of Axboe
* 'master' of Added a new windows only IO engine option “no_completion_thread”. Add Windows support for --server. Avoid client calls to recv() without prior poll()
2022-02-03Added a new windows only IO engine option “no_completion_thread”.james rizzo
Without this option, Windows FIO creates a completion polling thread for each worker thread. This also requires an event queue for the completion thread to forward completions to the worker thread. Polling directly improves performance and better matches the linuxaio engine model. Signed-off-by: james rizzo <>
2022-01-26rpma: add support for File System DAXWang, Long
File System DAX is handled in a different way than Device DAX: 1) In case of File System DAX, each thread uses a separate file from this file system and no offset is needed. In case of Device DAX, each thread uses a separate offset within the same Device DAX. 2) File System DAX requires rpma_mr_advise(3)(ibv_advise_mr(3)) to be called for the registered memory to avoid page faults and degraded performance. Ref: Signed-off-by: Wang, Long <>
2022-01-18sg: allow fio to open and close streams for WRITE STREAM(16) commandsVincent Fu
If --stream_id=0 then fio will open a stream for WRITE STREAM(16) commands and close the stream when the device file is closed. Example: ./fio --name=test --filename=/dev/sdb --ioengine=sg --number_ios=1 --debug=file,io --sg_write_mode=write_stream --rw=randwrite fio: set debug option file fio: set debug option io test: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=sg, iodepth=1 fio-3.27 Starting 1 process file 1072297 setup files file 1072297 get file size for 0x7f0306fa5110/0//dev/sdb file 1072307 trying file /dev/sdb 290 file 1072307 fd open /dev/sdb file 1072307 file not found in hash /dev/sdb file 1072307 sgio_stream_control: opened stream 1 file 1072307 get file /dev/sdb, ref=0 io 1072307 drop page cache /dev/sdb file 1072307 goodf=1, badf=2, ff=2b1 file 1072307 get_next_file_rr: 0x7f0306fa5110 file 1072307 get_next_file: 0x7f0306fa5110 [/dev/sdb] file 1072307 get file /dev/sdb, ref=1 io 1072307 fill: io_u 0xb55700: off=0x35ef554000,len=0x1000,ddir=1,file=/dev/sdb io 1072307 prep: io_u 0xb55700: off=0x35ef554000,len=0x1000,ddir=1,file=/dev/sdb io 1072307 prep: io_u 0xb55700: ret=0 io 1072307 queue: io_u 0xb55700: off=0x35ef554000,len=0x1000,ddir=1,file=/dev/sdb io 1072307 complete: io_u 0xb55700: off=0x35ef554000,len=0x1000,ddir=1,file=/dev/sdb file 1072307 put file /dev/sdb, ref=2 file 1072307 close files file 1072307 put file /dev/sdb, ref=1 file 1072307 sgio_stream_control: closed stream 1 file 1072307 fd close /dev/sdb io 1072307 close ioengine sg io 1072307 free ioengine sg test: (groupid=0, jobs=1): err= 0: pid=1072307: Mon Aug 16 14:25:45 2021 write: IOPS=200, BW=800KiB/s (819kB/s)(4096B/5msec); 0 zone resets clat (nsec): min=93339, max=93339, avg=93339.00, stdev= 0.00 lat (nsec): min=96201, max=96201, avg=96201.00, stdev= 0.00 clat percentiles (nsec): | 1.00th=[93696], 5.00th=[93696], 10.00th=[93696], 20.00th=[93696], | 30.00th=[93696], 40.00th=[93696], 50.00th=[93696], 60.00th=[93696], | 70.00th=[93696], 80.00th=[93696], 90.00th=[93696], 95.00th=[93696], | 99.00th=[93696], 99.50th=[93696], 99.90th=[93696], 99.95th=[93696], | 99.99th=[93696] lat (usec) : 100=100.00% cpu : usr=100.00%, sys=0.00%, ctx=2, majf=0, minf=20 IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued rwts: total=0,1,0,0 short=0,0,0,0 dropped=0,0,0,0 latency : target=0, window=0, percentile=100.00%, depth=1 Run status group 0 (all jobs): WRITE: bw=800KiB/s (819kB/s), 800KiB/s-800KiB/s (819kB/s-819kB/s), io=4096B (4096B), run=5-5msec Signed-off-by: Vincent Fu <> Link: Signed-off-by: Jens Axboe <>
2022-01-18sg: add support for WRITE STREAM(16) commandsVincent Fu
Add the "write_stream" option to sg_write_mode to send WRITE STREAM(16) commands. Use the new stream_id option to set the stream identifier. Example: sg_stream_ctl -o /dev/sdb Assigned stream id: 1 ./fio --name=test --filename=/dev/sdb --ioengine=sg --sg_write_mode=write_stream --stream_id=1 --rw=randwrite --time_based --runtime=10s ... sg_stream_ctl -c --id=1 /dev/sdb Signed-off-by: Vincent Fu <> Link: Signed-off-by: Jens Axboe <>
2022-01-18sg: improve sg_write_mode option namesVincent Fu
There is a name collision for the sg_write_mode options for the WRITE AND VERIFY and VERIFY commands. Deprecate the 'verify' option and use 'write_and_verify' instead. Do the same thing for 'same' and 'write_same' to have a consistent naming scheme. The original option names are still supported for backward compatibility but list them as deprecated. Here are the new sg_write_mode options: Option SCSI command write WRITE (default) write_and_verify WRITE AND VERIFY verify (deprecated) WRITE AND VERIFY write_same WRITE SAME same (deprecated) WRITE SAME write_same_ndob WRITE SAME with NDOB flag set verify_bytchk_00 VERIFY with BYTCHK set to 00 verify_bytchk_01 VERIFY with BYTCHK set to 01 verify_bytchk_11 VERIFY with BYTCHK set to 11 Signed-off-by: Vincent Fu <> Link: Signed-off-by: Jens Axboe <>
2022-01-18sg: add support for WRITE SAME(16) commands with NDOB flag setVincent Fu
Add the sg_write_mode option write_same_ndob to issue WRITE SAME(16) commands with the no data output buffer flag set. This flag is not supported for WRITE SAME(10). So all commands with this option will be WRITE SAME(16). Also include an example job file. Signed-off-by: Vincent Fu <> Link: Signed-off-by: Jens Axboe <>
2022-01-18sg: add support for VERIFY command using write modesVincent Fu
fio does not have an explicit verify data direction and creating a new data direction just for SCSI VERIFY commands probably is not worthwhile. The format of SCSI VERIFY commands matches that of write operations since VERIFY commands can include data transfer to the device. So it seems reasonable to have VERIFY commands be accounted for as write operations by fio. Use the sg_write_mode option to support SCSI VERIFY commands with different BYTCHK values. BYTCHK Description 00 No data is transferred to the device; device data is checked 01 Device data is compared with data transferred to device 11 Same as 01 except that only one sector of data is transferred to the device and each sector specified in the verification extent is compared against this transferred data. Also update documentation and add a couple example jobs files. Signed-off-by: Vincent Fu <> Link: Signed-off-by: Jens Axboe <>
2022-01-09engines/io_uring: don't set CQSIZE clamp unconditionallyJens Axboe
For older kernels without IORING_SETUP_CQSIZE, we'll get EINVAL if we set it. Just retry the ring setup if that happens. Link: Signed-off-by: Jens Axboe <>
2021-12-09ioengines: libzbc: disable libzbc block backend driverDamien Le Moal
libzbc includes 3 different internal backend drivers: 1) The block backend: this backend relies on the kernel SMR support and uses regular system calls. 2) The SCSI backend: this is a SG passthrough driver for SAS drives and for SATA drives accessible through an SMR compliant SAT (SCSI-to-ATA translation layer). 3) The ATA backend: this is a SG passthrough driver for SATA drives not handled by the system SAT (either kernel or HBA SAT) libzbc automatically selects the internal backend driver, using the first one that is detected as functional (tested in the same order shown above). When running on an SMR enabled system (SMR compliant HBA and kernel with zoned block device support enabled), any fio job using the libzbc IO engine will thus end up using the regular kernel IO path. This is silly: for this IO path, the libaio or psync IO engines are far better (less overhead and more functionalities). The libzbc IO engine should be restricted to be a passthrough engine only, similarly to the sg engine. Fix the libzbc engine to not allow the use of libzbc block backend driver by removing the ZBC_O_DRV_BLOCK flag when opening the device. Also adjust the test script t/zbd/run-tests-against-nullb to remove the -l option to force the use of the libzbc IO engine as it will not work anymore (since the nullb device is neither a SCSI nor an ATA device). Signed-off-by: Damien Le Moal <> Link: Signed-off-by: Jens Axboe <>
2021-11-20io_uring: clamp CQ size to SQ sizeJens Axboe
By default, io_uring uses twice as big a CQ ring as the SQ ring. That's to help with cases where completions can come in unexpectedly. This is not the case for storage IO, so just clamp the CQ size to save a bit of memory on the CQEs and CQ ring. Signed-off-by: Jens Axboe <>
2021-11-12libaio,io_uring: make it possible to cleanup cmdprio malloced dataNiklas Cassel
The way that fio currently handles engine options: options_free() will call free() only for options that have the type FIO_OPT_STR_STORE. This means that any option that has a pointer in either td->o or td->eo, which is not of type FIO_OPT_STR_STORE will leak memory. This is true even for numjobs == 1. When running with numjobs > 1, fio_options_mem_dupe() will memcpy td->eo into the new td. Since off1 of the pointers in the first td has already been set, the pointers in the new td will point to the same data. (Regardless, options_free() will never try to free the memory, for neither td.) Neither can we manually free the memory in cleanup(), since the other td will still point to the same memory, so this would lead to a double free. These memory leaks are reported by e.g. valgrind. The most obvious way to solve this is to put dynamically allocated memory in {ioring,libaio}_data instead of {ioring,libaio}_options. This solves the problem since {ioring,libaio}_data is dynamically allocated by each td during the ioengine init callback, and is freed when the ioengine cleanup callback for that td is called. The downside of this is that the parsing has to be done in fio_cmdprio_init() instead of in the option .cb callback, since the .cb callback is called before {ioring,libaio}_data is available. This patch keeps the static cmdprio options in {ioring,libaio}_options, but moves the dynamically allocated memory needed by cmdprio to {ioring,libaio}_data. No cmdprio related memory leaks are reported after this patch. Signed-off-by: Niklas Cassel <> Link: Signed-off-by: Jens Axboe <>
2021-11-12cmdprio: add mode to make the logic easier to reason aboutNiklas Cassel
Add a new field "mode", in order to know if we are determining IO priorities according to cmdprio_percentage or to cmdprio_bssplit. This makes the logic easier to reason about, and allows us to remove the "use_cmdprio" variable from the ioengines themselves. Signed-off-by: Niklas Cassel <> Reviewed-by: Damien Le Moal <> Link: Signed-off-by: Jens Axboe <>
2021-11-12libaio,io_uring: move common cmdprio_prep() code to cmdprioNiklas Cassel
Move common cmdprio_prep() code to cmdprio.c to avoid code duplication. Signed-off-by: Niklas Cassel <> Reviewed-by: Damien Le Moal <> Link: Signed-off-by: Jens Axboe <>
2021-11-12libaio,io_uring: rename prio_prep() to include cmdprio in the nameNiklas Cassel
The default priority (which is either 0 or the value set by "prio" and "prioclass" options, will now be used regardless if prio_prep() is called or not. This is true for both libaio and io_uring. The way to think about it is that prio_prep() is only called if cmdprio_percentage/cmdprio_bssplit is used. prio_prep() might then override the default priority, if the random value happens to say that this I/O should use the cmdprio_value, rather than the default priority. Rename the prio_prep() functions to highlight that these functions are now only called if cmdprio is used. (If only option "prio"/"prioclass" is used, that is handled elsewhere.) Signed-off-by: Niklas Cassel <> Reviewed-by: Damien Le Moal <> Link: Signed-off-by: Jens Axboe <>
2021-11-12io_uring: set async IO priority to td->ioprio in fio_ioring_prep()Niklas Cassel
The default priority (which is either 0 or the value set by "prio" and "prioclass" options) is now saved in td->ioprio. The simplest thing is therefore to unconditionally set the async IO priority to td->ioprio in fio_ioring_prep(), and let fio_ioring_prio_prep() only handle the case where cmdprio_percentage/cmdprio_bssplit is enabled. Therefore, fio_ioring_prio_prep() doesn't need to care if prio/prioclass was enabled or not, we can simply think that fio_ioring_prio_prep() might "override" the default priority, whatever the default priority may be. Doing it this way also has the advantage that the prio_prep() function in io_uring will now look identical to the prio_prep() function in libaio. Signed-off-by: Niklas Cassel <> Reviewed-by: Damien Le Moal <> Link: Signed-off-by: Jens Axboe <>
2021-11-12cmdprio: do not allocate memory for unused data directionNiklas Cassel
All cmdprio options only support data directions read and write. However, each cmdprio option allocates memory for ddir trim as well, even though nothing is ever written to this memory. Change this so that we don't allocate memory for something which is never used. Signed-off-by: Niklas Cassel <> Reviewed-by: Damien Le Moal <> Link: Signed-off-by: Jens Axboe <>
2021-11-12cmdprio: move cmdprio function definitions to a new cmdprio.c fileNiklas Cassel
Move cmdprio function definitions from the cmdprio.h header file to a new cmdprio.c file, such that we can add new static functions to cmdprio.c. A follow up patch will add new cmdprio functions which do not need to be directly accessible by ioengines. Signed-off-by: Niklas Cassel <> Reviewed-by: Damien Le Moal <> Link: Signed-off-by: Jens Axboe <>
2021-10-16engines/http.c: add fallthrough annotation to _curl_traceRebecca Cran
To avoid the warning from clang "warning: unannotated fall-through between switch labels [-Wimplicit-fallthrough]" swap the "fall through" comment with the "fallthrough;" annotation from compiler.h. Since the second "fall through" comment isn't really a new fall-through, remove it. Signed-off-by: Rebecca Cran <> Link: Signed-off-by: Jens Axboe <>
2021-09-08engines/sg: Removing useless variable assignmentErwan Velu
ret is set to -1 but the break statement will not use this value. So let's remove this useless assignment which could be confusing. Signed-off-by: Erwan Velu <>
2021-09-08engines/sg: Return error if generic_close_file failsErwan Velu
The current code was returning 1 if generic_close_file() fails. The ret value was prepared with the real error, let's return this one as the per generic_open_file() error handling. Signed-off-by: Erwan Velu <>
2021-09-03fio: Introduce the log_prio optionDamien Le Moal
Introduce the log_prio option to expand priority logging from just a single bit information (priority high vs low) to the full value of the priority value used to execute IOs. When this option is set, the priority value is printed as a 16-bits hexadecimal value combining the I/O priority class and priority level as defined by the ioprio_value() helper. Similarly to the log_offset option, this option does not result in actual I/O priority logging when log_avg_msec is set. This patch also fixes a problem with the IO_U_F_PRIORITY flag, namely that this flag is used to indicate that the IO is being executed with a high priority on the device while at the same time indicating how to account for the IO completion latency (high_prio clat vs low_prio clat). With the introduction of the cmdprio_class and cmdprio options, these assumptions are not necesarilly compatible anymore. These problems are addressed as follows: * The priority_bit field of struct iosample is replaced with the 16-bits priority field representing the full io_u->ioprio value. When log_prio is set, the priority field value is logged as is. When log_prio is not set, 1 is logged as the entry's priority field if the sample priority class is IOPRIO_CLASS_RT, and 0 otherwise. * IO_U_F_PRIORITY is renamed to IO_U_F_HIGH_PRIO to indicate that a job IO has the highest priority within the job context and so must be accounted as such using high_prio clat. While fio final statistics only show accounting of high vs low IO completion latency statistics, the log_prio option allows a user to perform more detailed statistical analysis of a workload using multiple different IO priorities. Signed-off-by: Damien Le Moal <> Signed-off-by: Niklas Cassel <> Signed-off-by: Jens Axboe <>
2021-09-03libaio,io_uring: relax cmdprio_percentage constraintsDamien Le Moal
In fio, a job IO priority is controlled with the prioclass and prio options and these options cannot be used together with the cmdprio_percentage option. Allow a user to have async IO priorities default to the job defined IO priority by removing the mutual exclusion between the options cmdprio_percentage and prioclass/prio. With the introduction of the cmdprio_class option, an async IO priority may be lower than the job default priority, resulting in reversed clat statistics showed for high and low priority IOs when fio completes. Solve this by setting an io_u IO_U_F_PRIORITY flag depending on a comparison between the async IO priority and job default IO priority. When an async IO is issued without a priority set, Linux kernel will execute it using the IO priority of the issuing context set with ioprio_set(). This works fine for libaio, where the context will be the same as the context that submitted the IO. However, io_uring can be used with a kernel thread that performs block device IO submissions (sqthread_poll). Therefore, for io_uring, an IO sqe ioprio field must be set to the job default priority unless the IO priority is set according to the job cmdprio_percentage value. Because of this, IO uring already did set sqe->ioprio even when only prio/prioclass was used. See commit b7ed2a862dda ("io_uring: set sqe iopriority, if prio/prioclass is set"). In order to make the code easier to maintain, handle all I/O priority preparations in the same function. Signed-off-by: Damien Le Moal <> Signed-off-by: Niklas Cassel <> Signed-off-by: Jens Axboe <>
2021-09-03libaio,io_uring: introduce cmdprio_bssplitDamien Le Moal
The cmdprio_percentage, cmdprio_class and cmdprio options allow specifying different values for read and write operations. This enables various IO priority issuing patterns even uner a mixed read-write workload but does not allow differentiation within read and write I/O operation types with different sizes when the bssplit option is used. Introduce the cmdprio_bssplit option to complement the use of the bssplit option. This new option has the same format as the bssplit option, but the percentage values indicate the percentage of I/O operations with a particular block size that must be issued with the priority class and value specified by cmdprio_class and cmdprio. Signed-off-by: Damien Le Moal <> Signed-off-by: Niklas Cassel <> Signed-off-by: Jens Axboe <>
2021-09-03libaio,io_uring: introduce cmdprio_class and cmdprio optionsDamien Le Moal
When the cmdprio_percentage option is used, the specified percentage of IO will be issued with the highest priority class IOPRIO_CLASS_RT. This priority class maps to the ATA NCQ "high" priority level and allows exercising a SATA device to measure its command latency characteristics in the presence of low and high priority commands. Beside ATA NCQ commands, Linux block IO schedulers also support IO priorities and will behave differently in the presence of IOs with different IO priority classes and values. However, cmdprio_percentage does not allow specifying all possible priority classes and values. To solve this, introduce libaio and io_uring engine specific options cmdprio_class and cmdprio. These new options are the equivalent of the prioclass and prio options and allow specifying the priority class and priority value to use for asynchronous I/Os when the cmdprio_percentage option is used. If not specified, the I/O priority class defaults to IOPRIO_CLASS_RT and the I/O priority value to 0, as before. Similarly to the cmdprio_percentage option, these options can specify different values for read and write I/Os using a comma separated list. The manpage, HOWTO and fiograph configuration file are updated to document these new options. Signed-off-by: Damien Le Moal <> Signed-off-by: Niklas Cassel <> Signed-off-by: Jens Axboe <>
2021-09-03libaio,io_uring: improve cmdprio_percentage optionDamien Le Moal
The cmdprio_percentage option of the libaio and io_uring engines defines a single percentage that applies to all IO operations, regardless of their direction. This prevents defining different high priority IO percentages for reads and writes operations. This differentiation can however be useful in the case of a mixed read-write workload (rwmixread and rwmixwrite options). Change the option definition to allow specifying a comma separated list of percentages, 2 at most, one for reads and one for writes. If only a single percentage is defined, it applies to both reads and writes as before. The cmdprio_percentage option becomes an array of DDIR_RWDIR_CNT elements indexed with enum fio_ddir values. The last entry of the array (for DDIR_TRIM) is always 0. Also create a new cmdprio helper file, engines/cmdprio.h, such that we can avoid code duplication between io_uring and libaio io engines. This helper file will be extended in subsequent patches. Signed-off-by: Damien Le Moal <> Signed-off-by: Niklas Cassel <> Signed-off-by: Jens Axboe <>
2021-09-03os: introduce ioprio_value() helperDamien Le Moal
Introduce the ioprio_value() helper function to calculate a priority value based on a priority class and priority level. For Linux and Android, this is defined as an integer equal to the priority class shifted left by 13 bits and or-ed with the priority level. For Dragonfly, ioprio_value() simply returns the priority level as there is no concept of priority class. Use this new helper in the io_uring and libaio engines to set IO priority when the cmdprio_percentage option is used. Signed-off-by: Damien Le Moal <> Signed-off-by: Niklas Cassel <> Signed-off-by: Jens Axboe <>
2021-08-26io_uring: don't clear recently set sqe->rw_flagsNiklas Cassel
Commit 7c70f506e438 ("engines/io_uring: move sqe clear out of hot path") removed the memset of sqe from fio_ioring_prep(). This commit did add a clear of the sqe->rw_flags, however, it did so after both RWF_UNCACHED and RWF_NOWAIT flags might have been set, effectively clearing these flags if they got set. This doesn't make any sense. Make sure that we clear sqe->rw_flags before, not after, setting the flags. Fixes: 7c70f506e438 ("engines/io_uring: move sqe clear out of hot path") Signed-off-by: Niklas Cassel <> Signed-off-by: Jens Axboe <>
2021-08-26io_uring: fix misbehaving cmdprio_percentage optionNiklas Cassel
Commit 7c70f506e438 ("engines/io_uring: move sqe clear out of hot path") removed the memset of sqe from fio_ioring_prep(). Before this commit, fio_ioring_prio_prep() behaved properly, because sqe->ioprio was always cleared by the memset in fio_ioring_prep(). cmdprio_percentage=20 is supposed to set the highest priority class for 20% of the total I/Os, however, because sqes got reused without clearing the ioprio field, this meant that the number of I/Os sent with the highest priority became 95% already after 10 seconds. Quite far off from the intended 20%. Fix this by explicitly clearing the priority in fio_ioring_prio_prep(). Note that prio/prioclass cannot be used together with cmdprio_percentage, so we do not need to do an additional clear in fio_ioring_prep(). engines/libaio.c doesn't explicitly clear the ioprio, nor does it memset the descriptor entry, this is because io_prep_pread()/io_prep_pwrite() in libaio itself performs a memset. Fixes: 7c70f506e438 ("engines/io_uring: move sqe clear out of hot path") Signed-off-by: Niklas Cassel <> Signed-off-by: Jens Axboe <>
2021-08-26io_uring: always initialize sqe->flagsNiklas Cassel
Commit 7c70f506e438 ("engines/io_uring: move sqe clear out of hot path") removed the memset of sqe from fio_ioring_prep(). Later, force_async was added in commit 5a59a81d2923 ("engines/io_uring: allow setting of IOSQE_ASYNC"). The force_async commit sets sqe->flags every N requests, however, since we no longer do a memset, this commit should have made sure that flags is always initialized, such that we don't have sqe->flags set on reused sqes where we didn't intend to. Fixes: 5a59a81d2923 ("engines/io_uring: allow setting of IOSQE_ASYNC") Signed-off-by: Niklas Cassel <> Signed-off-by: Jens Axboe <>
2021-08-10engines/dfs: add support for 1.3 DAOS APIJohann Lombardi
A few changes were done to the pool connect and container open API in DAOS 1.3+. UUID string or label are now passed via the API instead of uuid_t structures. Change the dfs engine accordingly. Signed-off-by: Johann Lombardi <>
2021-08-06engines/libzbc: Enable trim for libzbc I/O engineShin'ichiro Kawasaki
The trim workload to zoned block devices is supported as zone reset, and this feature is available for I/O engines which support both zoned devices and trim workload. Libzbc I/O engine supports zoned devices but lacks trim workload support. To enable trim support with libzbc I/O engine, remove the check which inhibited trim from requests to libzbc I/O engine. Also set file open flags for trim same as write, and call zbd_do_io_u_trim() for trim I/O. Of note is that libzbc I/O engine now can support trim to sequential write required zones only. The trim I/Os to conventional zones are reported as an error. Signed-off-by: Shin'ichiro Kawasaki <> Reviewed-by: Dmitry Fomichev <> Signed-off-by: Jens Axboe <>
2021-07-26engines/exec: Code cleanup to remove leaksErwan Velu
As per the coverty reports, there was some issues in my code : - Some structures were not properly freed before returning. - Some file descriptors were not properly closed - Testing with 'if (!int)' isn't a good way to test if the value is negative Signed-off-by: Erwan Velu <>
2021-07-25engines/exec: style cleanupsJens Axboe
No functional changes in this patch. Signed-off-by: Jens Axboe <>
2021-06-29engines: Adding exec engineErwan Velu
When performing benchmarks with fio, some need to execute tasks in parallel to the job execution. A typical use-case would be observing performance/power metrics. Several implementations were possible : - Adding an exec_run in addition of the existing exec_{pre|post}run - Implementating performance/power metrics in fio - Adding an exec engine 1°) Adding an exec_run This was my first intention but quickly noticed that exec_{pre-post}run are executed for each 'numjob'. In the case of performance/power monitoring, this doesn't make sense to spawn an instance for each thread. 2°) Implementing performance/power metrics This is possible but would require lot of work to maintain this part of fio while 3rd party tools already take care of that perfectly. 3°) Adding an engine Adding an engine let users defining when and how many instances of the program they want. In the provided example, a single monitoring job is spawning at the same time as the worker thread which could be composed of several worker threads. A stonewall barrier is used to define which jobs must run together (monitoring / benchmark). The engine has two parameters : - program: name of the program to run - arguments: arguments to pass to the program - grace_time: duration between SIGTERM and SIGKILL - std_redirect: redirect std{err|out} to dedicated files Arguments can have special variables to be expanded before the execution: - %r will be replaced by the job duration in seconds - %n will be replaced by the job name During the program execution, the std{out|err} are redirected to files if std_redirect option is set (default). - stdout: <job_name>.stdout - stderr: <job_name>.stderr If the executed program has a nice stdout output, after the fio execution, the stdout file can be parsed by other tools like CI jobs or graphing tools. A sample job is provided here to show how this can be used. It runs twice the CPU engine with two different CPU modes (noop vs qsort). For each benchmark, the output of turbostat is saved for later analysis. After the fio run, it is possible to compare the impact of the two modes on the CPU frequency and power consumption. This can be easily extended to any other usage that needs to analysis the behavior of the host during some jobs. About the implementation, the exec engine forks : - the child doing an execvp() of the program. - the parent, fio, will monitor the time passed into the job Once the time is over, the program is SIGTERM followed by a SIGKILL to ensure it will not run _after_ the job is completed. This mechanism is required as : - not all programs can be controlled properly - that's last resort protection if the program gets crazy The delay is controlled by grace_time option, default is 1 sec. If the program can be limited in its duration, using the %r variable in the arguments can be used to request the program to stop _before_ the job finished like : program=/usr/bin/ arguments=--duration %r Signed-off-by: Erwan Velu <>
2021-06-14zbd: remove zbd_zoned_model ZBD_IGNORENiklas Cassel
For a job with zonemode=zbd, we do not want any file to be ignored. Each file's file type in that job should be supported by either zbd.c or the ioengine. If not, we should return an error. This way, ZBD_IGNORE becomes redundant and can be removed. By removing ZBD_IGNORE, we know that all files belonging to a job that has zonemode=zbd set, will either be a zoned block device, or emulate a zoned block device. This means that for jobs that have zonemode=zbd, f->zbd_info will always be non-NULL. This will make the zbd code slightly easier to reason about and to maintain. When removing zbd_zoned_model ZBD_IGNORE, define the new first enum value as 0x1, so that we avoid potential ABI problems with existing binaries. Signed-off-by: Niklas Cassel <> Reviewed-by: Damien Le Moal <> Signed-off-by: Jens Axboe <>
2021-06-10io_uring: drop redundant IO_MODE_OFFLOAD checkStefan Hajnoczi
check_engine_ops() already returns an error if io_submit_mode is IO_MODE_OFFLOAD and the engine is marked FIO_NO_OFFLOAD. Signed-off-by: Stefan Hajnoczi <> Signed-off-by: Jens Axboe <>
2021-05-27Merge branch 'fix-libpmem' of Axboe
* 'fix-libpmem' of engines/libpmem: do not call drain on close engines/libpmem: cleanup a little code, comments and example engines/libpmem: set file open/create mode always to RW
2021-05-18Merge branch 'taras/nfs-upstream' of Axboe
* 'taras/nfs-upstream' of clean up nfs example skip skeleton comments single line bodies C-style comments NFS configure fixes NFS engine