trim: add support for multiple ranges NVMe specification allow multiple ranges for the dataset management commands. Currently the block ioctl only allows a single range for trim, however multiple ranges can be specified using nvme character device. Add an option num_range to send multiple range per trim request, which only works if the data direction is solely trim i.e. trim or randtrim. Add FIO_MULTI_RANGE_TRIM as the ioengine flag, to restrict the usage of this new option. For multi range trim request this modifies the way IO buffers are used. The buffer length will depend on number of trim ranges and the actual buffer will contains start and length of each range entry. This increases fio server version (FIO_SERVER_VER) to 103. Signed-off-by: Ankit Kumar <ankit.kumar@samsung.com> Link: https://lore.kernel.org/r/20240215151812.138370-2-ankit.kumar@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Record job start time to fix time pain points Add a new key in the json per-job output, job_start, that records the job start time obtained via a call to clock_gettime using the clock_id specified by the new job_start_clock_id option. This allows times of fio jobs and log entries to be compared/ordered against each other and against other system events recorded against the same clock_id. Add a note to the documentation for group_reporting about how there are several per-job values for which only the first job's value is recorded in the json output format when group_reporting is enabled. Fixes #1544 Signed-off-by: Nick Neumann nick@pcpartpicker.com
zbd: rename 'open zones' to 'write zones' Current fio code for zonemode=zbd uses the word 'open zone' to mean the zones that fio jobs write to. Before fio starts writing to a zone, it calls zbd_open_zone(). When fio completes writing to a zone, it calls zbd_close_zone(). This wording is good for zoned block devices with max_open_zones limit, such as ZBC and ZAC devices. The devices use same word 'open' to express the zone condition that the devices assign resources for data write to zones. However, the word 'open' gets confusing to support zoned block devices which has max_active_zones limit, such as ZNS devices. These devices have both 'open' and 'active' keywords to mean two different kinds of resources on the device. This 'active' status does not fit with the 'open zone' wording in the fio code. Also, the word 'open' zone in fio code does not always match with the 'open' condition of zones on the device (e.g. when --ignore_zone_limits option is specified). To avoid the confusion, stop using the word 'open zone' in the fio code. Instead, use the word 'write zone' to mean that the zone is the write target. When fio starts a write to a zone, it adds the zone to write_zones array. When fio completes writing to a zone, it removes the zone from the write_zones array. For this purpose, rename struct fields, functions and a macro: ZBD_MAX_OPEN_ZONES -> ZBD_MAX_WRITE_ZONES struct fio_zone_info open -> write struct thread_data num_open_zones -> num_write_zones struct zoned_block_device_info: max_open_zones -> max_write_zones num_open_zones -> num_write_zones open_zones[] -> write_zones[] zbd_open_zone() -> zbd_write_zone_get() zbd_close_zone() -> zbd_write_zone_put() zbd_convert_to_open_zone() -> zbd_convert_to_write_zone() To match up these changes, rename local variables and goto labels. Also rephrase code comments. Of note is that this rename is only for the fio code. The fio options max_open_zones and job_max_open_zones are not renamed to not confuse users. Suggested-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Niklas Cassel <niklas.cassel@wdc.com> Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
init: refactor random seed setting td->rand_seed was modified in three different places. Put all this code in setup_random_seeds() to make it easier to understand and more maintanable. Also put setup_random_seeds() next to the other random-seed-related functions in init.c. init_rand_seed() was called in three different places for fio's main random number generators. Also put these three sets of invocations in the same place. Always initialize all of fio's main set of random states instead of skipping some for sequential workloads. This makes debugging easier. No functional change. Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
fio: add support for POSIX_FADV_NOREUSE As of Linux kernel commit 17e810229cb3 ("mm: support POSIX_FADV_NOREUSE"), POSIX_FADV_NOREUSE hints at the LRU algorithm to ignore accesses to mapped files with this flag. Previously, it was a no-op. Add it in fio as an fadvise_hint option to test the new behavior. Signed-off-by: Yuanchu Xie <yuanchu@google.com> Link: https://lore.kernel.org/r/20230331183703.3145788-1-yuanchu@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
thinktime: Fix missing re-init thinktime when using ramptime Prevent I/O bursts after ramptime due to thinktime. Each thread generates a certain amount of I/O requests, configured by thinktime_blocks. When ramptime ends, thinktime_blocks can't control I/O. Because thinktime_blocks are not reinitialized after ramptime. I fixed it by reinitializing last_thinktime and last_thinktime_blocks when ramptime ended. Signed-off by: Suho Son <suho.son@samsung.com>
Refactor for_each_td() to catch inappropriate td ptr reuse I recently introduced a bug caused by reusing a struct thread_data *td after the end of a for_each_td() loop construct. Link: https://github.com/axboe/fio/pull/1521#issuecomment-1448591102 To prevent others from making this same mistake, this commit refactors for_each_td() so that both the struct thread_data * and the loop index variable are placed inside their own scope for the loop. This will cause any reference to those variables outside the for_each_td() to produce an undeclared identifier error, provided the outer scope doesn't already reuse those same variable names for other code within the routine (which is fine because the scopes are separate). Because C/C++ doesn't let you declare two different variable types within the scope of a for() loop initializer, creating a scope for both struct thread_data * and the loop index required explicitly declaring a scope with a curly brace. This means for_each_td() includes an opening curly brace to create the scope, which means all uses of for_each_td() must now end with an invocation of a new macro named end_for_each() to emit an ending curly brace to match the scope brace created by for_each_td(): for_each_td(td) { while (td->runstate < TD_EXITED) sleep(1); } end_for_each(); The alternative is to end every for_each_td() construct with an inline curly brace, which is off-putting since the implementation of an extra opening curly brace is abstracted in for_each_td(): for_each_td(td) { while (td->runstate < TD_EXITED) sleep(1); }} Most fio logic only declares "struct thread_data *td" and "int i" for use in for_each_td(), which means those declarations will now cause -Wunused-variable warnings since they're not used outside the scope of the refactored for_each_td(). Those declarations have been removed. Implementing this change caught a latent bug in eta.c::calc_thread_status() that accesses the ending value of struct thread_data *td after the end of for_each_td(), now manifesting as a compile error, so working as designed :) Signed-off-by: Adam Horshack (horshack@live.com)
Bad header rand_seed with time_based or loops with randrepeat=0 verify Verify fails with "bad header rand_seed" when multiple iterations of do_io() execute (time_based=1 or loops>0), with verify enabled and randrepeat=0 The root cause is do_verify() resetting the verify seed back to the job-init value, which works for verification of the first iteration of do_io() but fails for subsequent iterations because the seed is left in its post-do_io() state after the first do_verify(), which means different rand values for the second iteration of do_io() yet the second iteration of do_verify() will revert back again to the job-init seed value. The fix is to revert the verify seed for randrepeat=0 back to ts state when do_io() last ran rather than to its job-init value. That will allow do_verify() to use the correct seed for each iteration while still retaining a per-iteration unique verify seed. Link: https://github.com/axboe/fio/issues/1517#issuecomment-1430282533 Signed-off-by: Adam Horshack (horshack@live.com)
verify: fix numberio accounting of experimental verify As for non-experimental verify, numberio is compared between the numbers saved in metadata and written data header. As for experimental verify, the metadata is not available. Instead of numberio in metadata, it refers td->io_issues[] as the numberio value for the comparison. However, td->io_issues[] is used not only for verify reads but also for normal I/Os. It results in comparison with wrong numberio value and verification failure. Fix this issue by adding a new field td->verify_read_issues which counts up number of verify reads. Substitute td->verify_read_issues to io_u->numberio to refer it for the comparison in experimental verify path. Also move td->io_issues[] substitution to io_u->numberio out of populate_verify_io_u() to keep same behavior in non-experimental verify path. Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
verify: fix bytes_done accounting of experimental verify The commit 55312f9f5572 ("Add ->bytes_done[] to struct thread_data") moved bytes_done[] on stack to struct thread_data. However, this unified two bytes_done[] in do_io() and do_verify() stacks into single td->bytes_done[]. This caused wrong condition check in do_verify() in experimental verify path since td->bytes_done[] holds values for do_io() not for do_verify(). This caused unexpected loop break in do_verify() and verify read skip when experimental_verify=1 option is specified. To fix this, add bytes_verified to struct thread_data for do_verify() in same manner as bytes_done[] for do_io(). Introduce a helper function io_u_update_bytes_done() to factor out same code for bytes_done[] and bytes_verified[]. Fixes: 55312f9f5572 ("Add ->bytes_done[] to struct thread_data") Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com> Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
iolog: add iolog_write for version 3 Add timestamps to all actions for iolog version 3. Fio now generates iolog files using version 3 by default, and only supports writing using that version. Reading iolog v2 still works as expected. Signed-off-by: Mohamad Gebai <mogeb@fb.com> Link: https://lore.kernel.org/r/20220407174031.599117-3-mogeb@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
iolog: add version 3 to support timestamp-based replay Version 3 format looks as follows: timestamp filename action offset length All file and IO actions must have timestamps, including 'add'. The 'wait' action is not allowed with version 3 so that we can leave all timing functionality to timestamps. Signed-off-by: Mohamad Gebai <mogeb@fb.com> Link: https://lore.kernel.org/r/20220407174031.599117-2-mogeb@fb.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Properly encode engine flags in thread_data::flags We have 16 engine flags and an 18-bit shift, so cast engine flags to unsigned long long before shifting to avoid dropping FIO_ASYNCIO_SYNC_TRIM and FIO_NO_OFFLOAD. Also make thread_data::flags unsigned long long to ensure it fits all flags even when longs are 32 bit, and fix TD_ENG_FLAG_MASK. Signed-off-by: Alberto Faria <afaria@redhat.com>
Add TD_F_SYNCS thread flag It's not enough to just track writes, some operating systems require a file to be opened for write to issue a file sync. Which does kind of make sense... Add such a flag and set it for iolog/blktrace replay, if we see a sync in there. This does mean we need to bump the IO engine version, as the engine flags need to get shifted. Link: https://github.com/axboe/fio/issues/1352 Signed-off-by: Jens Axboe <axboe@kernel.dk>
Cleanup __check_min_rate This is a cleanup of __check_min_rate. In looking at stuff for previous fixes, it seems like there are a lot of boolean checks of things that are always true or always false. I'll explain my reasoning for each change; it is possible I'm missing something somehow but I've run through it a few times. Here's my logic: 1) td->rate_bytes and td->rate_blocks are 0 on first call to __check_min_rate, and then are the previous iteration's value of td->this_io_bytes and td->this_io_blocks on subsequent calls 2) bytes and iops are the current iteration's values of td->this_io_bytes and td->this_io_blocks 3) The values of td->this_io_bytes and td->this_io_blocks are monotonic with respect to each call of __check_min_rate Therefore, bytes and iops are always greater than or equal to td->rate_bytes and td->rate_blocks. This means the "if (bytes < td->rate_bytes[ddir]) {" on line 176 can never happen. Now, I want to say the same thing about line 197, but that line is weird/wrong in another way. rate_iops is td->o.rate_iops, the specified desired iops rate from the job. So I believe that is a bug - the specified desired iops rate should not even be examined in this function, just like the same is true for the desired bytes rate. I'm pretty sure what is meant is to compare iops to td->rate_blocks just like bytes is compared to td->rate_bytes in line 176, which would similarly always be false. Now we can focus on the else caluses (lines 180-192 and lines 202-213). If spent is 0, we should just be returning false early like in 169-170, so let's move that case up with it. The "if (rate < ratemin || bytes < td->rate_bytes[ddir]) {" and "if (rate < rate_iops_min || iops < td->rate_blocks[ddir]) {" both have impossibilities as the second part of the or clause. All we really want is to compare computed bytes rate to ratemin, and computed iops rate to rate_iops_min. With all of that, this function becomes a lot simpler. The rest of the cleanup is renaming of variables to make what they are clearer, and some other simple things (like initializing the variables directly instead of initializing to zero and then doing +=). The renames are as follows: - td->lastrate to td->last_rate_check_time, the last time a min rate check was performed - bytes to current_rate_check_bytes, the number of bytes transferred so far at the time this call to __check_min_rate was made - iops to current_rate_check_blocks, the number of blocks transferred so far at the time this call to __check_min_rate was made - rate to current_rate_bytes or current_rate_iops, depending on if it is used as the current cycle's byte rate or block rate - ratemin to option_rate_bytes_min, the user supplied desired minimum bytes rate - rate_iops eliminated - should not be used in this function - rate_iops_min to option_rate_iops_min, the user supplied desired minimum block rate - td->rate_bytes to td->last_rate_check_bytes - the number of bytes transferred the *last* time a minimum rate check was called *and* passed (not shortcircuited because not enough time had elapsed for the cycle or settling) - td->rate_blocks to td->last_rate_check_blocks - the number of blocks transferred the *last* time a minimum rate check was called *and* passed (not shortcircuited because not enough time had elapsed for the cycle or settling) Signed-off-by: Nick Neumann nick@pcpartpicker.com
blktrace.c: Make thread-safe by removing local static variables Local static variables are not thread-safe. Make the functions in blktrace.c safe by replacing them. Signed-off-by: Lukas Straub <lukasstraub2@web.de> Link: https://lore.kernel.org/r/b805bb3f6acf6c5b4d8811872c62af939aac62a7.1642626314.git.lukasstraub2@web.de Signed-off-by: Jens Axboe <axboe@kernel.dk>