zbd: add the recover_zbd_write_error option
When the continue_on_error options is specified, it is expected that the
workload continues to run when non-critical errors happen. However,
write workloads with zonemode=zbd option can not continue after errors,
if the failed writes cause partial data write on the target device. This
partial write creates write pointer gap between the device and fio, then
the next write requests by fio will fail due to unaligned write command
errors. This restriction results in undesirable test stops during long
runs for SMR drives which can recover defect sectors.
To allow the write workloads with zonemode=zbd to continue after write
failures with partial data writes, introduce the new option
recover_zbd_write_error. When this option is specified together with the
continue_on_error option, fio checks the write pointer positions of the
write target zones in the error handling step. Then fix the write
pointer by moving it to the position that the failed writes would have
moved. Bump up FIO_SERVER_VER to note that the new option is added.
For that purpose, add a new function zbd_recover_write_error(). Call it
from zbd_queue_io() for sync IO engines, and from io_completed() for
async IO engines. Modify zbd_queue_io() to pass the pointer to the
status so that zbd_recover_write_error() can modify the status to ignore
the errors. Add three fields to struct fio_zone_info. The two new fields
writes_in_flight and max_write_error_offset track status of in-flight
writes at the write error, so that the write pointer positions can be
fixed after the in-flight writes completed. The field fixing_zone_wp
stores that the write pointer fix is ongoing, then prohibit the new
writes get issued to the zone.
When the failed write is synchronous, the write pointer fix is done by
writing the left data for the failed write. This keeps the verify
patterns written to the device, then verify works together with the
continue_on_zbd_write_error option. When the failed write is
asynchronous, other in-flight writes fail together. In this case, fio
waits for all in-flight writes complete then fix the write pointer. Then
verify data of the failed writes are lost and verify does not work.
Check the continue_on_zbd_write_error option is not specified together
with the verify workload and asynchronous IO engine.
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20250425052148.126788-6-shinichiro.kawasaki@wdc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>