trim: add support for multiple ranges NVMe specification allow multiple ranges for the dataset management commands. Currently the block ioctl only allows a single range for trim, however multiple ranges can be specified using nvme character device. Add an option num_range to send multiple range per trim request, which only works if the data direction is solely trim i.e. trim or randtrim. Add FIO_MULTI_RANGE_TRIM as the ioengine flag, to restrict the usage of this new option. For multi range trim request this modifies the way IO buffers are used. The buffer length will depend on number of trim ranges and the actual buffer will contains start and length of each range entry. This increases fio server version (FIO_SERVER_VER) to 103. Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com> Link: https://lore.kernel.org/r/20240215151812.138370-2-ankit.kumar@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
fio: Introduce new constant thinkcycles option The thinkcycles parameter allows to set a number of cycles to spin between requests to model real-world applications more realistically The thinktime parameter family can be used to model an application processing the data to be able to model real-world applications more closely. Unfortunately this is currently set per constant time and therefore is affected by CPU frequency settings or task migration to a CPU with different capacity. The new thinkcycles parameter closes that gap and allows specifying a constant number of cycles instead, such that CPU capacity is taken into account. Signed-off-by: Christian Loehle <christian.loehle@arm.com>
Make log_unix_epoch an official alias of log_alternate_epoch log_alternate_epoch was introduced along with log_alternate_epoch_clock_id, and generalized the idea of log_unix_epoch. Both options had the same effect. So we make log_unix_epoch an official alias of log_alternate_epoch, instead of maintaining both redundant options. Signed-off-by: Nick Neumann nick@pcpartpicker.com
Record job start time to fix time pain points Add a new key in the json per-job output, job_start, that records the job start time obtained via a call to clock_gettime using the clock_id specified by the new job_start_clock_id option. This allows times of fio jobs and log entries to be compared/ordered against each other and against other system events recorded against the same clock_id. Add a note to the documentation for group_reporting about how there are several per-job values for which only the first job's value is recorded in the json output format when group_reporting is enabled. Fixes #1544 Signed-off-by: Nick Neumann nick@pcpartpicker.com
options: add priohint option Introduce the new option priohint to allow users to specify an I/O priority hint applying to all IOs issued by a job. This increases fio server version (FIO_SERVER_VER) to 101. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com> Link: https://lore.kernel.org/r/20230721110510.44772-5-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
init: clean up random seed options - make allrandrepeat a synonym of randrepeat. allrandrepeat is superfluous because the seeds set by randrepeat already encompass random number generators beyond the one used for random offsets. - allow randseed to override [all]randrepeat: this is what the documentation implies but was not previously the case This is a breaking change for users relying on the values of fio's default random seeds. Link: https://github.com/axboe/fio/pull/1546 Fixes: https://github.com/axboe/fio/issues/1502 Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
fio: steadystate: allow for custom check interval Allow for a different steady state check interval than 1s with a new --ss_interval parameter. Steady state is reached when the steady state condition (like slope) is true when comparing the last windows (set with --ss_dur). The actual values for this comparison is currently calculated for a 1s interval during the window. This is especially problematic for slow random devices, where the values do not converge for such a fine granularity. Letting the user set this solves this problem, although requires them figuring out an appropriate value themselves. --ss=iops:5% --ss_dur=120s should reproduce this for many (slower) devices. Then adding like --ss_interval=20s may let it converge. Signed-off-by: Christian Loehle <cloehle@posteo.de>
fio: add fdp support for io_uring_cmd nvme engine Add support for NVMe TP4146 Flexible Data Placemen, allowing placement identifiers in write commands. The user can enabled this with the new "fdp=1" parameter for fio's io_uring_cmd ioengine. By default, the fio jobs will cycle through all the namespace's available placement identifiers for write commands. The user can limit which placement identifiers can be used with additional parameter, "fdp_pli=<list,>", which can be used to separate write intensive jobs from less intensive ones. Setting up your namespace for FDP is outside the scope of 'fio', so this assumes the namespace is already properly configured for the mode. Link: https://lore.kernel.org/fio/CAKi7+wfX-eaUD5pky5cJ824uCzsQ4sPYMZdp3AuCUZOA1TQrYw@mail.gmail.com/T/#m056018eb07229bed00d4e589f9760b2a2aa009fc Based-on-a-patch-by: Ankit Kumar <ankit.kumar@samsung.com> Signed-off-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> [Vincent: fold in sfree fix from Ankit] Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
options: Support arbitrarily long pattern buffers Dynamically allocate the pattern buffer to remove the 512B length restriction. To accomplish this, store a pointer instead of a fixed block of memory for the buffers in the thread_options structure. Then introduce and use the function parse_and_fill_pattern_alloc() which will calculate the approprite size of the buffer and allocate it before filling it. The buffers will be freed, along with a number of string buffers in free_thread_options_to_cpu(). They will also be reallocated (if necessary) when receiving them over the wire with convert_thread_options_to_cpu(). This allows for specifying real world compressible data (eg. The Calgary Corpus) for the buffer_pattern option. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
cconv: Support pattern buffers of arbitrary size Change the thread_options_pack structure to support pattern buffers of arbitrary size by using a flexible array at the end of the the structure to store both the verify_pattern and the buffer_pattern in that order. In this way, only the actual bytes of each pattern will be sent over the wire and patterns of an arbitrary size can be used with the packed structure. In order to determine the required size of the structure the function thread_options_pack_size() is introduced which returns the total number of bytes required for a given thread_options instance. The two callsites of convert_thread_options_to_net() are then converted to dynamically allocate a pdu of the appropriate size and the two callsites of convert_thread_options_to_cpu() are modified to take the size of the received data to prevent buffer overruns. Also add specific testing of this feature in fio_test_cconv(). Seeing this changes the client/server protocol, the FIO_SERVER_VER is bumped. Signed-off-by: Logan Gunthorpe <logang@deltatee.com> Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
Introducing support for generation of dedup buffers across jobs. The dedup buffers are spread evenly between the jobs that enabled the dedupe_global option Note only dedupe_mode=working_set is supported. Note compression is supported with the global dedup enabled Signed-off-by: Bar David <bardavvid@gmail.com>
options: add a parsing function for an additional cmdprio_bssplit format The cmdprio_bssplit ioengine option for io_uring/libaio is currently parsed using split_parse_ddir(). While this function works fine for parsing the existing cmdprio_bssplit entry format, it forces every cmdprio_bssplit entry to use the priority defined by cmdprio and cmdprio_class. This means that there will only ever be at most two different priority values used in the job. To enable us to use more than two different priority values, add a new parsing function, split_parse_prio_ddir(), that will support parsing the existing cmdprio_bssplit entry format (blocksize/percentage), and a new cmdprio_bssplit entry format (blocksize/percentage/prioclass/priolevel). Since IO engines can be compiled as plugins, having the parse function in options.c avoids potential problems with ioengines having different versions of the same parsing function. A follow up patch will change to the new parsing function. Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Link: https://lore.kernel.org/r/20220203192814.18552-8-Niklas.Cassel@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Support for alternate epochs in fio log files Add options log_alternate_epoch and log_alternate_epoch_clock_id. This is similar to the log_unix_epoch option. This resolves the issue raised in Issue #1314 log_alternate_epoch, if true, causes log files to use the same epoch used used by the clock_id parameter to the unix clock_gettime function, where clock_id is specified by the log_alternate_epoch_clock_id option. This is particularly useful as it allows us to specify a clock id like CLOCK_MONOTONIC_RAW, which is natural for synchronizing log files between processes. The current log_unix_epoch is problematic for that purpose because that clock is not monotonic or continuous. It turns out that log_unix_epoch is actually equivalent to log_alternate_epoch with log_alternate_epoch_clock_id set to CLOCK_REALTIME=0. Since this is the default value of the log_alternate_epoch_clock_id option anyways, we treat log_alternate_epoch and log_unix_epoch as equivalent in functionality, retaining the latter to avoid breaking existing clients. Signed-off-by: Nick Neumann <nick@pcpartpicker.com>
fio: Introduce the log_entries option When iops, latency, or bw logging options are used, fio will by default log information for any I/O that completes. The initial number of I/O log entries is 1024, as defined by (DEF_LOG_ENTRIES). When all log entries are used, new log entries are dynamically allocated by get_new_log(). This dynamic log entry allocation can negatively impact time-related statistics such as the I/O tail latencies (e.g. 99.9 percentile completion latency) as growing the logs causes a temporary I/O stall (IO quiesce), which disturbs the workload steady state. The effect of this is especially noticeable with workloads using IO priorities: the tail latencies of high priority I/Os increase if the IO log needs to be grown. For example, running the following fio command on a SATA disk supporting NCQ priority: fio --name=prio-randread --filename=/dev/sdg \ --random_generator=tausworthe64 --ioscheduler=none \ --write_lat_log=randread.log --log_prio=1 --rw=randread --bs=128k \ --ioengine=libaio --iodepth=32 --direct=1 --cmdprio_class=1 \ --cmdprio_percentage=30 --runtime=900 (128KB random read workload at QD=32 and 30% of commands issued with a high priority), with an inital number of log entries equal to the default of 1024, depending on the machine memory state, the completion latency statistics may show imprecise percentiles such as shown below. high prio (30.75%) clat percentiles (msec): | 1.00th=[ 14], 5.00th=[ 17], 10.00th=[ 20], 20.00th=[ 23], | 30.00th=[ 27], 40.00th=[ 32], 50.00th=[ 40], 60.00th=[ 53], | 70.00th=[ 71], 80.00th=[ 104], 90.00th=[ 169], 95.00th=[ 243], | 99.00th=[ 514], 99.50th=[ 676], 99.90th=[ 1485], 99.95th=[ 1502], | 99.99th=[ 1552] low prio (69.25%) clat percentiles (msec): | 1.00th=[ 16], 5.00th=[ 24], 10.00th=[ 37], 20.00th=[ 68], | 30.00th=[ 105], 40.00th=[ 146], 50.00th=[ 199], 60.00th=[ 255], | 70.00th=[ 330], 80.00th=[ 439], 90.00th=[ 592], 95.00th=[ 718], | 99.00th=[ 885], 99.50th=[ 986], 99.90th=[ 1469], 99.95th=[ 1536], | 99.99th=[ 1586] All completion latency percentiles above the 99.90th percentile are similar for the high and low priority commands, which is not consistent with the drive expected execution of prioritized read commands. To solve this issue and get more precise latency statistics, this patch introduces the new "log_entries" option to allow specifying a larger initial number of IO log entries to avoid run-time allocation. This option value defaults to DEF_LOG_ENTRIES and its maximum value is MAX_LOG_ENTRIES to be consistent with get_new_log() allocation. Also simplify get_new_log() by using calloc() instead of malloc, thus removing the need for the local variable new_size. Adding the "--log_entries=65536" option to the previous command line example, the completion latency results obtained are more stable: high prio (30.72%) clat percentiles (msec): | 1.00th=[ 15], 5.00th=[ 17], 10.00th=[ 19], 20.00th=[ 22], | 30.00th=[ 24], 40.00th=[ 27], 50.00th=[ 32], 60.00th=[ 36], | 70.00th=[ 46], 80.00th=[ 57], 90.00th=[ 81], 95.00th=[ 105], | 99.00th=[ 161], 99.50th=[ 188], 99.90th=[ 271], 99.95th=[ 275], | 99.99th=[ 363] low prio (69.28%) clat percentiles (msec): | 1.00th=[ 16], 5.00th=[ 27], 10.00th=[ 43], 20.00th=[ 80], | 30.00th=[ 123], 40.00th=[ 176], 50.00th=[ 236], 60.00th=[ 313], | 70.00th=[ 401], 80.00th=[ 506], 90.00th=[ 634], 95.00th=[ 718], | 99.00th=[ 844], 99.50th=[ 885], 99.90th=[ 953], 99.95th=[ 995], | 99.99th=[ 1053] All completion percentiles clearly now show shorter latencies for high priority commands, as expected. The 99.99th percentile for low priority commands is also improved compared to the previous case as the measurements are not impacted by the log dynamic allocation. Suggested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Damien Le Moal <damien.lemoal@opensource.wdc.com> Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com> Link: https://lore.kernel.org/r/20211118052729.132423-1-damien.lemoal@opensource.wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
options: Add thinktime_iotime option The thinktime option allows stalling a job for a specified amount of time. Using the thinktime_blocks option, periodic stalls can be added every thinktime_blocks IOs. However, with this option, the periodic stall may not be repeated at equal time intervals as the time to execute thinktime_blocks IOs may vary. To control the thinktime interval by time, introduce the option thinktime_iotime. With this new option, the thinktime stall is repeated after IOs are executed for thinktime_iotime. If this option is used together with the thinktime_blocks option, the thinktime pause is repeated after thinktime_iotime or after thinktime_blocks IOs, whichever happens first. To support the new option, add a new member thinktime_iotime in the struct thread_options and the struct thread_options_pack. Avoid size increase of the struct thread_options_pack by replacing a padding 'pad5' with the new member. To keep thinktime related members close, move the members near the position where the padding was placed. Make same changes to the struct thread_option also for consistency. To track the time and IO block count at the last stall, add last_thinktime variable and last_thinktime_blocks variable to struct thread_data. Also, introduce the helper function init_thinktime() to group thinktime related preparations. Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
fio: Introduce the log_prio option Introduce the log_prio option to expand priority logging from just a single bit information (priority high vs low) to the full value of the priority value used to execute IOs. When this option is set, the priority value is printed as a 16-bits hexadecimal value combining the I/O priority class and priority level as defined by the ioprio_value() helper. Similarly to the log_offset option, this option does not result in actual I/O priority logging when log_avg_msec is set. This patch also fixes a problem with the IO_U_F_PRIORITY flag, namely that this flag is used to indicate that the IO is being executed with a high priority on the device while at the same time indicating how to account for the IO completion latency (high_prio clat vs low_prio clat). With the introduction of the cmdprio_class and cmdprio options, these assumptions are not necesarilly compatible anymore. These problems are addressed as follows: * The priority_bit field of struct iosample is replaced with the 16-bits priority field representing the full io_u->ioprio value. When log_prio is set, the priority field value is logged as is. When log_prio is not set, 1 is logged as the entry's priority field if the sample priority class is IOPRIO_CLASS_RT, and 0 otherwise. * IO_U_F_PRIORITY is renamed to IO_U_F_HIGH_PRIO to indicate that a job IO has the highest priority within the job context and so must be accounted as such using high_prio clat. While fio final statistics only show accounting of high vs low IO completion latency statistics, the log_prio option allows a user to perform more detailed statistical analysis of a workload using multiple different IO priorities. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
options: make parsing functions available to ioengines Move the declaration of split_parse_ddir(), str_split_parse() and the split_parse_fn typedef to thread_options.h so that IO engines can use these functions to parse options. The definition of struct split is also moved to thread_options.h from options.c. The type of the split_parse_fn callback function is changed to add a void * argument that can be used for an option parsing callback to pass a private data pointer to the split_parse_fn function. This can be used by an IO engine to pass a pointer to its engine specific option structure as td->eo is not yet set when options are being parsed. Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
dedupe: allow to generate dedupe buffers from working set This commit introduced new dedupe generation mode "working_set". Working set mode simulates a more realistic approach to deduped data, in which deduped buffers are generated from pre-existing working set - % size of the device or file. In other words, dedupe is not usually expected to be close in time with the source buffer, as well as source buffers are usually composed of small subset of the entire file or device. Signed-off-by: Bar David <bardavvid@gmail.com>