engines/libblkio: Add option libblkio_force_enable_completion_eventfd

[fio.git] / fio.1
diff --git a/fio.1 b/fio.1

index 5aa54a4d0471772276737edae926bba7b7f7e63b..da5483037bad06c224c99dc8b095b7709540d2fd 100644 (file)
--- a/fio.1
+++ b/fio.1
@@ -67,8 +67,8 @@ List all commands defined by \fIioengine\fR, or print help for \fIcommand\fR
  defined by \fIioengine\fR. If no \fIioengine\fR is given, list all
  available ioengines.
  .TP
-.BI \-\-showcmd \fR=\fPjobfile
-Convert \fIjobfile\fR to a set of command\-line options.
+.BI \-\-showcmd
+Convert given \fIjobfile\fRs to a set of command\-line options.
  .TP
  .BI \-\-readonly
  Turn on safety read\-only checks, preventing writes and trims. The \fB\-\-readonly\fR
@@ -292,7 +292,7 @@ For Zone Block Device Mode:
  .RS
  .P
  .PD 0
-z means Zone 
+z means Zone
  .P
  .PD
  .RE
@@ -569,7 +569,7 @@ by this option will be \fBsize\fR divided by number of files unless an
  explicit size is specified by \fBfilesize\fR.
  .RS
  .P
-Each colon in the wanted path must be escaped with a '\\'
+Each colon in the wanted path must be escaped with a '\e'
  character. For instance, if the path is `/dev/dsk/foo@3,0:c' then you
  would use `filename=/dev/dsk/foo@3,0\\:c' and if the path is
  `F:\\filename' then you would use `filename=F\\:\\filename'.
@@ -766,6 +766,8 @@ starts. The \fBzonecapacity\fR parameter is ignored.
  Zoned block device mode. I/O happens sequentially in each zone, even if random
  I/O has been selected. Random I/O happens across all zones instead of being
  restricted to a single zone.
+Trim is handled using a zone reset operation. Trim only considers non-empty
+sequential write required and sequential write preferred zones.
  .RE
  .RE
  .TP
@@ -828,7 +830,7 @@ so. Default: false.
  .BI max_open_zones \fR=\fPint
  When running a random write test across an entire drive many more zones will be
  open than in a typical application workload. Hence this command line option
-that allows to limit the number of open zones. The number of open zones is
+that allows one to limit the number of open zones. The number of open zones is
  defined as the number of zones to which write commands are issued by all
  threads/processes.
  .TP
@@ -836,9 +838,9 @@ threads/processes.
  Limit on the number of simultaneously opened zones per single thread/process.
  .TP
  .BI ignore_zone_limits \fR=\fPbool
-If this isn't set, fio will query the max open zones limit from the zoned block
-device, and exit if the specified \fBmax_open_zones\fR value is larger than the
-limit reported by the device. Default: false.
+If this option is used, fio will ignore the maximum number of open zones limit
+of the zoned block device in use, thus allowing the option \fBmax_open_zones\fR
+value to be larger than the device reported limit. Default: false.
  .TP
  .BI zone_reset_threshold \fR=\fPfloat
  A number between zero and one that indicates the ratio of logical blocks with
@@ -898,7 +900,15 @@ Random mixed reads and writes.
  .TP
  .B trimwrite
  Sequential trim+write sequences. Blocks will be trimmed first,
-then the same blocks will be written to.
+then the same blocks will be written to. So if `io_size=64K' is specified,
+Fio will trim a total of 64K bytes and also write 64K bytes on the same
+trimmed blocks. This behaviour will be consistent with `number_ios' or
+other Fio options limiting the total bytes or number of I/O's.
+.TP
+.B randtrimwrite
+Like
+.B trimwrite ,
+but uses random offsets rather than sequential writes.
  .RE
  .P
  Fio defaults to read if the option is not specified. For the mixed I/O
@@ -1081,7 +1091,7 @@ provided. Data before the given offset will not be touched. This
  effectively caps the file size at `real_size \- offset'. Can be combined with
  \fBsize\fR to constrain the start and end range of the I/O workload.
  A percentage can be specified by a number between 1 and 100 followed by '%',
-for example, `offset=20%' to specify 20%. In ZBD mode, value can be set as 
+for example, `offset=20%' to specify 20%. In ZBD mode, value can be set as
  number of zones using 'z'.
  .TP
  .BI offset_align \fR=\fPint
@@ -1097,7 +1107,7 @@ specified). This option is useful if there are several jobs which are
  intended to operate on a file in parallel disjoint segments, with even
  spacing between the starting points. Percentages can be used for this option.
  If a percentage is given, the generated offset will be aligned to the minimum
-\fBblocksize\fR or to the value of \fBoffset_align\fR if provided.In ZBD mode, value 
+\fBblocksize\fR or to the value of \fBoffset_align\fR if provided.In ZBD mode, value
  can be set as number of zones using 'z'.
  .TP
  .BI number_ios \fR=\fPint
@@ -1120,7 +1130,7 @@ see \fBend_fsync\fR and \fBfsync_on_close\fR.
  .TP
  .BI fdatasync \fR=\fPint
  Like \fBfsync\fR but uses \fBfdatasync\fR\|(2) to only sync data and
-not metadata blocks. In Windows, FreeBSD, DragonFlyBSD or OSX there is no
+not metadata blocks. In Windows, DragonFlyBSD or OSX there is no
  \fBfdatasync\fR\|(2) so this falls back to using \fBfsync\fR\|(2).
  Defaults to 0, which means fio does not periodically issue and wait for a
  data-only sync to complete.
@@ -1214,12 +1224,12 @@ map. For the \fBnormal\fR distribution, a normal (Gaussian) deviation is
  supplied as a value between 0 and 100.
  .P
  The second, optional float is allowed for \fBpareto\fR, \fBzipf\fR and \fBnormal\fR
-distributions. It allows to set base of distribution in non-default place, giving
+distributions. It allows one to set base of distribution in non-default place, giving
  more control over most probable outcome. This value is in range [0-1] which maps linearly to
  range of possible random values.
  Defaults are: random for \fBpareto\fR and \fBzipf\fR, and 0.5 for \fBnormal\fR.
  If you wanted to use \fBzipf\fR with a `theta` of 1.2 centered on 1/4 of allowed value range,
-you would use `random_distibution=zipf:1.2:0.25`.
+you would use `random_distribution=zipf:1.2:0.25`.
  .P
  For a \fBzoned\fR distribution, fio supports specifying percentages of I/O
  access that should fall within what range of the file or device. For
@@ -1509,6 +1519,57 @@ all \-\- this option only controls the distribution of unique buffers. Setting
  this option will also enable \fBrefill_buffers\fR to prevent every buffer
  being identical.
  .TP
+.BI dedupe_mode \fR=\fPstr
+If \fBdedupe_percentage\fR is given, then this option controls how fio
+generates the dedupe buffers.
+.RS
+.RS
+.TP
+.B repeat
+.P
+.RS
+Generate dedupe buffers by repeating previous writes
+.RE
+.TP
+.B working_set
+.P
+.RS
+Generate dedupe buffers from working set
+.RE
+.RE
+.P
+\fBrepeat\fR is the default option for fio. Dedupe buffers are generated
+by repeating previous unique write.
+
+\fBworking_set\fR is a more realistic workload.
+With \fBworking_set\fR, \fBdedupe_working_set_percentage\fR should be provided.
+Given that, fio will use the initial unique write buffers as its working set.
+Upon deciding to dedupe, fio will randomly choose a buffer from the working set.
+Note that by using \fBworking_set\fR the dedupe percentage will converge
+to the desired over time while \fBrepeat\fR maintains the desired percentage
+throughout the job.
+.RE
+.RE
+.TP
+.BI dedupe_working_set_percentage \fR=\fPint
+If \fBdedupe_mode\fR is set to \fBworking_set\fR, then this controls
+the percentage of size of the file or device used as the buffers
+fio will choose to generate the dedupe buffers from
+.P
+.RS
+Note that \fBsize\fR needs to be explicitly provided and only 1 file
+per job is supported
+.RE
+.TP
+.BI dedupe_global \fR=\fPbool
+This controls whether the deduplication buffers will be shared amongst
+all jobs that have this option set. The buffers are spread evenly between
+participating jobs.
+.P
+.RS
+Note that \fBdedupe_mode\fR must be set to \fBworking_set\fR for this to work.
+Can be used in combination with compression
+.TP
  .BI invalidate \fR=\fPbool
  Invalidate the buffer/page cache parts of the files to be used prior to
  starting I/O if the platform and file type support it. Defaults to true.
@@ -1578,11 +1639,11 @@ multiplied by the I/O depth given. Note that for \fBshmhuge\fR and
  \fBmmaphuge\fR to work, the system must have free huge pages allocated. This
  can normally be checked and set by reading/writing
  `/proc/sys/vm/nr_hugepages' on a Linux system. Fio assumes a huge page
-is 4MiB in size. So to calculate the number of huge pages you need for a
-given job file, add up the I/O depth of all jobs (normally one unless
-\fBiodepth\fR is used) and multiply by the maximum bs set. Then divide
-that number by the huge page size. You can see the size of the huge pages in
-`/proc/meminfo'. If no huge pages are allocated by having a non-zero
+is 2 or 4MiB in size depending on the platform. So to calculate the number of
+huge pages you need for a given job file, add up the I/O depth of all jobs
+(normally one unless \fBiodepth\fR is used) and multiply by the maximum bs set.
+Then divide that number by the huge page size. You can see the size of the huge
+pages in `/proc/meminfo'. If no huge pages are allocated by having a non-zero
  number in `nr_hugepages', using \fBmmaphuge\fR or \fBshmhuge\fR will fail. Also
  see \fBhugepage\-size\fR.
  .P
@@ -1602,10 +1663,11 @@ of subsequent I/O memory buffers is the sum of the \fBiomem_align\fR and
  \fBbs\fR used.
  .TP
  .BI hugepage\-size \fR=\fPint
-Defines the size of a huge page. Must at least be equal to the system
-setting, see `/proc/meminfo'. Defaults to 4MiB. Should probably
-always be a multiple of megabytes, so using `hugepage\-size=Xm' is the
-preferred way to set this to avoid setting a non-pow-2 bad value.
+Defines the size of a huge page. Must at least be equal to the system setting,
+see `/proc/meminfo' and `/sys/kernel/mm/hugepages/'. Defaults to 2 or 4MiB
+depending on the platform. Should probably always be a multiple of megabytes,
+so using `hugepage\-size=Xm' is the preferred way to set this to avoid setting
+a non-pow-2 bad value.
  .TP
  .BI lockmem \fR=\fPint
  Pin the specified amount of memory with \fBmlock\fR\|(2). Can be used to
@@ -1614,8 +1676,11 @@ simulate a smaller amount of memory. The amount specified is per worker.
  .TP
  .BI size \fR=\fPint[%|z]
  The total size of file I/O for each thread of this job. Fio will run until
-this many bytes has been transferred, unless runtime is limited by other options
-(such as \fBruntime\fR, for instance, or increased/decreased by \fBio_size\fR).
+this many bytes has been transferred, unless runtime is altered by other means
+such as (1) \fBruntime\fR, (2) \fBio_size\fR, (3) \fBnumber_ios\fR, (4)
+gaps/holes while doing I/O's such as `rw=read:16K', or (5) sequential I/O
+reaching end of the file which is possible when \fBpercentage_random\fR is
+less than 100.
  Fio will divide this size between the available files determined by options
  such as \fBnrfiles\fR, \fBfilename\fR, unless \fBfilesize\fR is
  specified by the job. If the result of division happens to be 0, the size is
@@ -1624,7 +1689,7 @@ If this option is not specified, fio will use the full size of the given
  files or devices. If the files do not exist, size must be given. It is also
  possible to give size as a percentage between 1 and 100. If `size=20%' is
  given, fio will use 20% of the full size of the given files or devices. In ZBD mode,
-size can be given in units of number of zones using 'z'. Can be combined with \fBoffset\fR to 
+size can be given in units of number of zones using 'z'. Can be combined with \fBoffset\fR to
  constrain the start and end range that I/O will be done within.
  .TP
  .BI io_size \fR=\fPint[%|z] "\fR,\fB io_limit" \fR=\fPint[%|z]
@@ -1642,10 +1707,10 @@ also be set as number of zones using 'z'.
  .TP
  .BI filesize \fR=\fPirange(int)
  Individual file sizes. May be a range, in which case fio will select sizes
-for files at random within the given range and limited to \fBsize\fR in
-total (if that is given). If not given, each created file is the same size.
-This option overrides \fBsize\fR in terms of file size, which means
-this value is used as a fixed size or possible range of each file.
+for files at random within the given range. If not given, each created file
+is the same size. This option overrides \fBsize\fR in terms of file size,
+i.e. \fBsize\fR becomes merely the default for \fBio_size\fR (and
+has no effect it all if \fBio_size\fR is set explicitly).
  .TP
  .BI file_append \fR=\fPbool
  Perform I/O after the end of the file. Normally fio will operate within the
@@ -1658,9 +1723,7 @@ Sets size to something really large and waits for ENOSPC (no space left on
  device) or EDQUOT (disk quota exceeded)
  as the terminating condition. Only makes sense with sequential
  write. For a read workload, the mount point will be filled first then I/O
-started on the result. This option doesn't make sense if operating on a raw
-device node, since the size of that is already known by the file system.
-Additionally, writing beyond end-of-device will not return ENOSPC there.
+started on the result.
  .SS "I/O engine"
  .TP
  .BI ioengine \fR=\fPstr
@@ -1687,6 +1750,15 @@ Basic \fBpreadv\fR\|(2) or \fBpwritev\fR\|(2) I/O.
  .B pvsync2
  Basic \fBpreadv2\fR\|(2) or \fBpwritev2\fR\|(2) I/O.
  .TP
+.B io_uring
+Fast Linux native asynchronous I/O. Supports async IO
+for both direct and buffered IO.
+This engine defines engine specific options.
+.TP
+.B io_uring_cmd
+Fast Linux native asynchronous I/O for passthrough commands.
+This engine defines engine specific options.
+.TP
  .B libaio
  Linux native asynchronous I/O. Note that Linux may only support
  queued behavior with non-buffered I/O (set `direct=1' or
@@ -1721,10 +1793,9 @@ character devices. This engine supports trim operations. The
  sg engine includes engine specific options.
  .TP
  .B libzbc
-Synchronous I/O engine for SMR hard-disks using the \fBlibzbc\fR
-library. The target can be either an sg character device or
-a block device file. This engine supports the zonemode=zbd zone
-operations.
+Read, write, trim and ZBC/ZAC operations to a zoned block device using
+\fBlibzbc\fR library. The target can be either an SG character device or
+a block device file.
  .TP
  .B null
  Doesn't transfer any data, just pretends to. This is mainly used to
@@ -1912,49 +1983,134 @@ I/O engine supporting asynchronous read and write operations to
  NFS filesystems from userspace via libnfs. This is useful for
  achieving higher concurrency and thus throughput than is possible
  via kernel NFS.
+.TP
+.B exec
+Execute 3rd party tools. Could be used to perform monitoring during jobs runtime.
+.TP
+.B xnvme
+I/O engine using the xNVMe C API, for NVMe devices. The xnvme engine provides
+flexibility to access GNU/Linux Kernel NVMe driver via libaio, IOCTLs, io_uring,
+the SPDK NVMe driver, or your own custom NVMe driver. The xnvme engine includes
+engine specific options. (See \fIhttps://xnvme.io/\fR).
+.TP
+.B libblkio
+Use the libblkio library (\fIhttps://gitlab.com/libblkio/libblkio\fR). The
+specific driver to use must be set using \fBlibblkio_driver\fR. If
+\fBmem\fR/\fBiomem\fR is not specified, memory allocation is delegated to
+libblkio (and so is guaranteed to work with the selected driver).
  .SS "I/O engine specific parameters"
  In addition, there are some parameters which are only valid when a specific
  \fBioengine\fR is in use. These are used identically to normal parameters,
  with the caveat that when used on the command line, they must come after the
  \fBioengine\fR that defines them is selected.
  .TP
-.BI (io_uring, libaio)cmdprio_percentage \fR=\fPint
-Set the percentage of I/O that will be issued with higher priority by setting
-the priority bit. Non-read I/O is likely unaffected by ``cmdprio_percentage``.
-This option cannot be used with the `prio` or `prioclass` options. For this
-option to set the priority bit properly, NCQ priority must be supported and
-enabled and `direct=1' option must be used. fio must also be run as the root
-user.
+.BI (io_uring,libaio)cmdprio_percentage \fR=\fPint[,int]
+Set the percentage of I/O that will be issued with the highest priority.
+Default: 0. A single value applies to reads and writes. Comma-separated
+values may be specified for reads and writes. For this option to be effective,
+NCQ priority must be supported and enabled, and `direct=1' option must be
+used. fio must also be run as the root user. Unlike slat/clat/lat stats, which
+can be tracked and reported independently, per priority stats only track and
+report a single type of latency. By default, completion latency (clat) will be
+reported, if \fBlat_percentiles\fR is set, total latency (lat) will be reported.
  .TP
-.BI (io_uring)fixedbufs
+.BI (io_uring,libaio)cmdprio_class \fR=\fPint[,int]
+Set the I/O priority class to use for I/Os that must be issued with a
+priority when \fBcmdprio_percentage\fR or \fBcmdprio_bssplit\fR is set.
+If not specified when \fBcmdprio_percentage\fR or \fBcmdprio_bssplit\fR
+is set, this defaults to the highest priority class. A single value applies
+to reads and writes. Comma-separated values may be specified for reads and
+writes. See man \fBionice\fR\|(1). See also the \fBprioclass\fR option.
+.TP
+.BI (io_uring,libaio)cmdprio \fR=\fPint[,int]
+Set the I/O priority value to use for I/Os that must be issued with a
+priority when \fBcmdprio_percentage\fR or \fBcmdprio_bssplit\fR is set.
+If not specified when \fBcmdprio_percentage\fR or \fBcmdprio_bssplit\fR
+is set, this defaults to 0. Linux limits us to a positive value between
+0 and 7, with 0 being the highest. A single value applies to reads and writes.
+Comma-separated values may be specified for reads and writes. See man
+\fBionice\fR\|(1). Refer to an appropriate manpage for other operating systems
+since the meaning of priority may differ. See also the \fBprio\fR option.
+.TP
+.BI (io_uring,libaio)cmdprio_bssplit \fR=\fPstr[,str]
+To get a finer control over I/O priority, this option allows specifying
+the percentage of IOs that must have a priority set depending on the block
+size of the IO. This option is useful only when used together with the option
+\fBbssplit\fR, that is, multiple different block sizes are used for reads and
+writes.
+.RS
+.P
+The first accepted format for this option is the same as the format of the
+\fBbssplit\fR option:
+.RS
+.P
+cmdprio_bssplit=blocksize/percentage:blocksize/percentage
+.RE
+.P
+In this case, each entry will use the priority class and priority level defined
+by the options \fBcmdprio_class\fR and \fBcmdprio\fR respectively.
+.P
+The second accepted format for this option is:
+.RS
+.P
+cmdprio_bssplit=blocksize/percentage/class/level:blocksize/percentage/class/level
+.RE
+.P
+In this case, the priority class and priority level is defined inside each
+entry. In comparison with the first accepted format, the second accepted format
+does not restrict all entries to have the same priority class and priority
+level.
+.P
+For both formats, only the read and write data directions are supported, values
+for trim IOs are ignored. This option is mutually exclusive with the
+\fBcmdprio_percentage\fR option.
+.RE
+.TP
+.BI (io_uring,io_uring_cmd)fixedbufs
  If fio is asked to do direct IO, then Linux will map pages for each IO call, and
  release them when IO is done. If this option is set, the pages are pre-mapped
  before IO is started. This eliminates the need to map and release for each IO.
  This is more efficient, and reduces the IO latency as well.
  .TP
-.BI (io_uring)hipri
+.BI (io_uring,io_uring_cmd)nonvectored \fR=\fPint
+With this option, fio will use non-vectored read/write commands, where address
+must contain the address directly. Default is -1.
+.TP
+.BI (io_uring,io_uring_cmd)force_async
+Normal operation for io_uring is to try and issue an sqe as non-blocking first,
+and if that fails, execute it in an async manner. With this option set to N,
+then every N request fio will ask sqe to be issued in an async manner. Default
+is 0.
+.TP
+.BI (io_uring,io_uring_cmd,xnvme)hipri
  If this option is set, fio will attempt to use polled IO completions. Normal IO
  completions generate interrupts to signal the completion of IO, polled
  completions do not. Hence they are require active reaping by the application.
  The benefits are more efficient IO for high IOPS scenarios, and lower latencies
  for low queue depth IO.
  .TP
-.BI (io_uring)registerfiles
+.BI (io_uring,io_uring_cmd)registerfiles
  With this option, fio registers the set of files being used with the kernel.
  This avoids the overhead of managing file counts in the kernel, making the
  submission and completion part more lightweight. Required for the below
  sqthread_poll option.
  .TP
-.BI (io_uring)sqthread_poll
+.BI (io_uring,io_uring_cmd,xnvme)sqthread_poll
  Normally fio will submit IO by issuing a system call to notify the kernel of
  available items in the SQ ring. If this option is set, the act of submitting IO
  will be done by a polling thread in the kernel. This frees up cycles for fio, at
-the cost of using more CPU in the system.
+the cost of using more CPU in the system. As submission is just the time it
+takes to fill in the sqe entries and any syscall required to wake up the idle
+kernel thread, fio will not report submission latencies.
  .TP
-.BI (io_uring)sqthread_poll_cpu
+.BI (io_uring,io_uring_cmd)sqthread_poll_cpu \fR=\fPint
  When `sqthread_poll` is set, this option provides a way to define which CPU
  should be used for the polling thread.
  .TP
+.BI (io_uring_cmd)cmd_type \fR=\fPstr
+Specifies the type of uring passthrough command to be used. Supported
+value is nvme. Default is nvme.
+.TP
  .BI (libaio)userspace_reap
  Normally, with the libaio engine in use, fio will use the
  \fBio_getevents\fR\|(3) system call to reap newly returned events. With
@@ -1970,7 +2126,7 @@ than normal.
  When hipri is set this determines the probability of a pvsync2 I/O being high
  priority. The default is 100%.
  .TP
-.BI (pvsync2,libaio,io_uring)nowait
+.BI (pvsync2,libaio,io_uring,io_uring_cmd)nowait \fR=\fPbool
  By default if a request cannot be executed immediately (e.g. resource starvation,
  waiting on locks) it is queued and the initiating process will be blocked until
  the required resource becomes free.
@@ -1993,26 +2149,39 @@ option when using cpuio I/O engine.
  .BI (cpuio)cpuchunks \fR=\fPint
  Split the load into cycles of the given time. In microseconds.
  .TP
+.BI (cpuio)cpumode \fR=\fPstr
+Specify how to stress the CPU. It can take these two values:
+.RS
+.RS
+.TP
+.B noop
+This is the default and directs the CPU to execute noop instructions.
+.TP
+.B qsort
+Replace the default noop instructions with a qsort algorithm to consume more energy.
+.RE
+.RE
+.TP
  .BI (cpuio)exit_on_io_done \fR=\fPbool
  Detect when I/O threads are done, then exit.
  .TP
  .BI (libhdfs)namenode \fR=\fPstr
  The hostname or IP address of a HDFS cluster namenode to contact.
  .TP
-.BI (libhdfs)port
+.BI (libhdfs)port \fR=\fPint
  The listening port of the HFDS cluster namenode.
  .TP
-.BI (netsplice,net)port
+.BI (netsplice,net)port \fR=\fPint
  The TCP or UDP port to bind to or connect to. If this is used with
  \fBnumjobs\fR to spawn multiple instances of the same job type, then
  this will be the starting port number since fio will use a range of
  ports.
  .TP
-.BI (rdma, librpma_*)port
+.BI (rdma,librpma_*)port \fR=\fPint
  The port to use for RDMA-CM communication. This should be the same
  value on the client and the server side.
  .TP
-.BI (netsplice,net, rdma)hostname \fR=\fPstr
+.BI (netsplice,net,rdma)hostname \fR=\fPstr
  The hostname or IP address to use for TCP, UDP or RDMA-CM based I/O.
  If the job is a TCP listener or UDP reader, the hostname is not used
  and must be omitted unless it is a valid UDP multicast address.
@@ -2117,6 +2286,10 @@ Ceph cluster. If the \fBclustername\fR is specified, the \fBclientname\fR shall
  the full *type.id* string. If no type. prefix is given, fio will add 'client.'
  by default.
  .TP
+.BI (rados)conf \fR=\fPstr
+Specifies the configuration path of ceph cluster, so conf file does not
+have to be /etc/ceph/ceph.conf.
+.TP
  .BI (rbd,rados)busy_poll \fR=\fPbool
  Poll store instead of waiting for completion. Usually this provides better
  throughput at cost of higher(up to 100%) CPU utilization.
@@ -2154,6 +2327,15 @@ The S3 secret key.
  .BI (http)http_s3_keyid \fR=\fPstr
  The S3 key/access id.
  .TP
+.BI (http)http_s3_sse_customer_key \fR=\fPstr
+The encryption customer key in SSE server side.
+.TP
+.BI (http)http_s3_sse_customer_algorithm \fR=\fPstr
+The encryption customer algorithm in SSE server side. Default is \fBAES256\fR
+.TP
+.BI (http)http_s3_storage_class \fR=\fPstr
+Which storage class to access. User-customizable settings. Default is \fBSTANDARD\fR
+.TP
  .BI (http)http_swift_auth_token \fR=\fPstr
  The Swift auth token. See the example configuration file on how to
  retrieve this.
@@ -2210,7 +2392,7 @@ With writefua option set to 1, write operations include the force
  unit access (fua) flag. Default: 0.
  .TP
  .BI (sg)sg_write_mode \fR=\fPstr
-Specify the type of write commands to issue. This option can take three
+Specify the type of write commands to issue. This option can take multiple
  values:
  .RS
  .RS
@@ -2218,12 +2400,15 @@ values:
  .B write (default)
  Write opcodes are issued as usual
  .TP
+.B write_and_verify
+Issue WRITE AND VERIFY commands. The BYTCHK bit is set to 00b. This directs the
+device to carry out a medium verification with no data comparison for the data
+that was written. The writefua option is ignored with this selection.
+.TP
  .B verify
-Issue WRITE AND VERIFY commands. The BYTCHK bit is set to 0. This
-directs the device to carry out a medium verification with no data
-comparison. The writefua option is ignored with this selection.
+This option is deprecated. Use write_and_verify instead.
  .TP
-.B same
+.B write_same
  Issue WRITE SAME commands. This transfers a single block to the device
  and writes this same block of data to a contiguous sequence of LBAs
  beginning at the specified offset. fio's block size parameter
@@ -2234,9 +2419,43 @@ blocksize=8k will write 16 sectors with each command. fio will still
  generate 8k of data for each command butonly the first 512 bytes will
  be used and transferred to the device. The writefua option is ignored
  with this selection.
+.TP
+.B same
+This option is deprecated. Use write_same instead.
+.TP
+.B write_same_ndob
+Issue WRITE SAME(16) commands as above but with the No Data Output
+Buffer (NDOB) bit set. No data will be transferred to the device with
+this bit set. Data written will be a pre-determined pattern such as
+all zeroes.
+.TP
+.B write_stream
+Issue WRITE STREAM(16) commands. Use the stream_id option to specify
+the stream identifier.
+.TP
+.B verify_bytchk_00
+Issue VERIFY commands with BYTCHK set to 00. This directs the device to carry
+out a medium verification with no data comparison.
+.TP
+.B verify_bytchk_01
+Issue VERIFY commands with BYTCHK set to 01. This directs the device to
+compare the data on the device with the data transferred to the device.
+.TP
+.B verify_bytchk_11
+Issue VERIFY commands with BYTCHK set to 11. This transfers a single block to
+the device and compares the contents of this block with the data on the device
+beginning at the specified offset. fio's block size parameter specifies the
+total amount of data compared with this command. However, only one block
+(sector) worth of data is transferred to the device. This is similar to the
+WRITE SAME command except that data is compared instead of written.
  .RE
  .RE
  .TP
+.BI (sg)stream_id \fR=\fPint
+Set the stream identifier for WRITE STREAM commands. If this is set to 0 (which is not
+a valid stream identifier) fio will open a stream and then close it when done. Default
+is 0.
+.TP
  .BI (nbd)uri \fR=\fPstr
  Specify the NBD URI of the server to test.
  The string is a standard NBD URI (see
@@ -2282,22 +2501,167 @@ the use of cudaMemcpy.
  .RE
  .TP
  .BI (dfs)pool
-Specify the UUID of the DAOS pool to connect to.
+Specify the label or UUID of the DAOS pool to connect to.
  .TP
  .BI (dfs)cont
-Specify the UUID of the DAOS DAOS container to open.
+Specify the label or UUID of the DAOS container to open.
  .TP
  .BI (dfs)chunk_size
-Specificy a different chunk size (in bytes) for the dfs file.
+Specify a different chunk size (in bytes) for the dfs file.
  Use DAOS container's chunk size by default.
  .TP
  .BI (dfs)object_class
-Specificy a different object class for the dfs file.
+Specify a different object class for the dfs file.
  Use DAOS container's object class by default.
  .TP
  .BI (nfs)nfs_url
  URL in libnfs format, eg nfs://<server|ipv4|ipv6>/path[?arg=val[&arg=val]*]
  Refer to the libnfs README for more details.
+.TP
+.BI (exec)program\fR=\fPstr
+Specify the program to execute.
+Note the program will receive a SIGTERM when the job is reaching the time limit.
+A SIGKILL is sent once the job is over. The delay between the two signals is defined by \fBgrace_time\fR option.
+.TP
+.BI (exec)arguments\fR=\fPstr
+Specify arguments to pass to program.
+Some special variables can be expanded to pass fio's job details to the program :
+.RS
+.RS
+.TP
+.B %r
+replaced by the duration of the job in seconds
+.TP
+.BI %n
+replaced by the name of the job
+.RE
+.RE
+.TP
+.BI (exec)grace_time\fR=\fPint
+Defines the time between the SIGTERM and SIGKILL signals. Default is 1 second.
+.TP
+.BI (exec)std_redirect\fR=\fbool
+If set, stdout and stderr streams are redirected to files named from the job name. Default is true.
+.TP
+.BI (xnvme)xnvme_async\fR=\fPstr
+Select the xnvme async command interface. This can take these values.
+.RS
+.RS
+.TP
+.B emu
+This is default and use to emulate asynchronous I/O by using a single thread to
+create a queue pair on top of a synchronous I/O interface using the NVMe driver
+IOCTL.
+.TP
+.BI thrpool
+Emulate an asynchronous I/O interface with a pool of userspace threads on top
+of a synchronous I/O interface using the NVMe driver IOCTL. By default four
+threads are used.
+.TP
+.BI io_uring
+Linux native asynchronous I/O interface which supports both direct and buffered
+I/O.
+.TP
+.BI libaio
+Use Linux aio for Asynchronous I/O
+.TP
+.BI posix
+Use the posix asynchronous I/O interface to perform one or more I/O operations
+asynchronously.
+.TP
+.BI nil
+Do not transfer any data; just pretend to. This is mainly used for
+introspective performance evaluation.
+.RE
+.RE
+.TP
+.BI (xnvme)xnvme_sync\fR=\fPstr
+Select the xnvme synchronous command interface. This can take these values.
+.RS
+.RS
+.TP
+.B nvme
+This is default and uses Linux NVMe Driver ioctl() for synchronous I/O.
+.TP
+.BI psync
+This supports regular as well as vectored pread() and pwrite() commands.
+.TP
+.BI block
+This is the same as psync except that it also supports zone management
+commands using Linux block layer IOCTLs.
+.RE
+.RE
+.TP
+.BI (xnvme)xnvme_admin\fR=\fPstr
+Select the xnvme admin command interface. This can take these values.
+.RS
+.RS
+.TP
+.B nvme
+This is default and uses Linux NVMe Driver ioctl() for admin commands.
+.TP
+.BI block
+Use Linux Block Layer ioctl() and sysfs for admin commands.
+.RE
+.RE
+.TP
+.BI (xnvme)xnvme_dev_nsid\fR=\fPint
+xnvme namespace identifier for userspace NVMe driver such as SPDK.
+.TP
+.BI (xnvme)xnvme_iovec
+If this option is set, xnvme will use vectored read/write commands.
+.TP
+.BI (libblkio)libblkio_driver \fR=\fPstr
+The libblkio driver to use. Different drivers access devices through different
+underlying interfaces. Available drivers depend on the libblkio version in use
+and are listed at \fIhttps://libblkio.gitlab.io/libblkio/blkio.html#drivers\fR
+.TP
+.BI (libblkio)libblkio_pre_connect_props \fR=\fPstr
+A colon-separated list of libblkio properties to be set after creating but
+before connecting the libblkio instance. Each property must have the format
+\fB<name>=<value>\fR. Colons can be escaped as \fB\\:\fR. These are set after
+the engine sets any other properties, so those can be overriden. Available
+properties depend on the libblkio version in use and are listed at
+\fIhttps://libblkio.gitlab.io/libblkio/blkio.html#properties\fR
+.TP
+.BI (libblkio)libblkio_pre_start_props \fR=\fPstr
+A colon-separated list of libblkio properties to be set after connecting but
+before starting the libblkio instance. Each property must have the format
+\fB<name>=<value>\fR. Colons can be escaped as \fB\\:\fR. These are set after
+the engine sets any other properties, so those can be overriden. Available
+properties depend on the libblkio version in use and are listed at
+\fIhttps://libblkio.gitlab.io/libblkio/blkio.html#properties\fR
+.TP
+.BI (libblkio)hipri
+Use poll queues. This is incompatible with \fBlibblkio_wait_mode=eventfd\fR and
+\fBlibblkio_force_enable_completion_eventfd\fR.
+.TP
+.BI (libblkio)libblkio_vectored
+Submit vectored read and write requests.
+.TP
+.BI (libblkio)libblkio_write_zeroes_on_trim
+Submit trims as "write zeroes" requests instead of discard requests.
+.TP
+.BI (libblkio)libblkio_wait_mode \fR=\fPstr
+How to wait for completions:
+.RS
+.RS
+.TP
+.B block \fR(default)
+Use a blocking call to \fBblkioq_do_io()\fR.
+.TP
+.B eventfd
+Use a blocking call to \fBread()\fR on the completion eventfd.
+.TP
+.B loop
+Use a busy loop with a non-blocking call to \fBblkioq_do_io()\fR.
+.RE
+.RE
+.TP
+.BI (libblkio)libblkio_force_enable_completion_eventfd
+Enable the queue's completion eventfd even when unused. This may impact
+performance. The default is to enable it only if
+\fBlibblkio_wait_mode=eventfd\fR.
  .SS "I/O depth"
  .TP
  .BI iodepth \fR=\fPint
@@ -2402,7 +2766,7 @@ problem). Note that this option cannot reliably be used with async IO engines.
  Stall the job for the specified period of time after an I/O has completed before issuing the
  next. May be used to simulate processing being done by an application.
  When the unit is omitted, the value is interpreted in microseconds. See
-\fBthinktime_blocks\fR and \fBthinktime_spin\fR.
+\fBthinktime_blocks\fR, \fBthinktime_iotime\fR and \fBthinktime_spin\fR.
  .TP
  .BI thinktime_spin \fR=\fPtime
  Only valid if \fBthinktime\fR is set - pretend to spend CPU time doing
@@ -2423,6 +2787,17 @@ Only valid if \fBthinktime\fR is set - control how \fBthinktime_blocks\fR trigge
  The default is `complete', which triggers \fBthinktime\fR when fio completes
  \fBthinktime_blocks\fR blocks. If this is set to `issue', then the trigger happens
  at the issue side.
+.TP
+.BI thinktime_iotime \fR=\fPtime
+Only valid if \fBthinktime\fR is set - control \fBthinktime\fR interval by time.
+The \fBthinktime\fR stall is repeated after IOs are executed for
+\fBthinktime_iotime\fR. For example, `\-\-thinktime_iotime=9s \-\-thinktime=1s'
+repeat 10-second cycle with IOs for 9 seconds and stall for 1 second. When the
+unit is omitted, \fBthinktime_iotime\fR is interpreted as a number of seconds.
+If this option is used together with \fBthinktime_blocks\fR, the \fBthinktime\fR
+stall is repeated after \fBthinktime_iotime\fR or after \fBthinktime_blocks\fR
+IOs, whichever happens first.
+
  .TP
  .BI rate \fR=\fPint[,int][,int]
  Cap the bandwidth used by this job. The number is in bytes/sec, the normal
@@ -2506,7 +2881,8 @@ of milliseconds. Defaults to 1000.
  .BI write_iolog \fR=\fPstr
  Write the issued I/O patterns to the specified file. See
  \fBread_iolog\fR. Specify a separate file for each job, otherwise the
-iologs will be interspersed and the file may be corrupt.
+iologs will be interspersed and the file may be corrupt. This file will be
+opened in append mode.
  .TP
  .BI read_iolog \fR=\fPstr
  Open an iolog with the specified filename and replay the I/O patterns it
@@ -2624,13 +3000,13 @@ Set the I/O priority value of this job. Linux limits us to a positive value
  between 0 and 7, with 0 being the highest. See man
  \fBionice\fR\|(1). Refer to an appropriate manpage for other operating
  systems since meaning of priority may differ. For per-command priority
-setting, see I/O engine specific `cmdprio_percentage` and `hipri_percentage`
-options.
+setting, see the I/O engine specific `cmdprio_percentage` and
+`cmdprio` options.
  .TP
  .BI prioclass \fR=\fPint
  Set the I/O priority class. See man \fBionice\fR\|(1). For per-command
-priority setting, see I/O engine specific `cmdprio_percentage` and `hipri_percent`
-options.
+priority setting, see the I/O engine specific `cmdprio_percentage` and
+`cmdprio_class` options.
  .TP
  .BI cpus_allowed \fR=\fPstr
  Controls the same options as \fBcpumask\fR, but accepts a textual
@@ -2898,7 +3274,7 @@ the verify will be of the newly written data.
  To avoid false verification errors, do not use the norandommap option when
  verifying data with async I/O engines and I/O depths > 1.  Or use the
  norandommap and the lfsr random generator together to avoid writing to the
-same offset with muliple outstanding I/Os.
+same offset with multiple outstanding I/Os.
  .RE
  .TP
  .BI verify_offset \fR=\fPint
@@ -3009,7 +3385,9 @@ Verify that trim/discarded blocks are returned as zeros.
  Trim this number of I/O blocks.
  .TP
  .BI experimental_verify \fR=\fPbool
-Enable experimental verification.
+Enable experimental verification. Standard verify records I/O metadata for
+later use during the verification phase. Experimental verify instead resets the
+file after the write phase and then replays I/Os for the verification phase.
  .SS "Steady state"
  .TP
  .BI steadystate \fR=\fPstr:float "\fR,\fP ss" \fR=\fPstr:float
@@ -3136,6 +3514,17 @@ logging (see \fBlog_avg_msec\fR) has been enabled. See
  \fBwrite_bw_log\fR for details about the filename format and \fBLOG
  FILE FORMATS\fR for how data is structured within the file.
  .TP
+.BI log_entries \fR=\fPint
+By default, fio will log an entry in the iops, latency, or bw log for
+every I/O that completes. The initial number of I/O log entries is 1024.
+When the log entries are all used, new log entries are dynamically
+allocated.  This dynamic log entry allocation may negatively impact
+time-related statistics such as I/O tail latencies (e.g. 99.9th percentile
+completion latency). This option allows specifying a larger initial
+number of log entries to avoid run-time allocation of new log entries,
+resulting in more precise time-related I/O statistics.
+Also see \fBlog_avg_msec\fR as well. Defaults to 1024.
+.TP
  .BI log_avg_msec \fR=\fPint
  By default, fio will log an entry in the iops, latency, or bw log for every
  I/O that completes. When writing to the disk log, that can quickly grow to a
@@ -3169,6 +3558,11 @@ If this is set, the iolog options will include the byte offset for the I/O
  entry as well as the other data values. Defaults to 0 meaning that
  offsets are not present in logs. Also see \fBLOG FILE FORMATS\fR section.
  .TP
+.BI log_prio \fR=\fPbool
+If this is set, the iolog options will include the I/O priority for the I/O
+entry as well as the other data values. Defaults to 0 meaning that
+I/O priorities are not present in logs. Also see \fBLOG FILE FORMATS\fR section.
+.TP
  .BI log_compression \fR=\fPint
  If this is set, fio will compress the I/O logs as it goes, to keep the
  memory footprint lower. When a log reaches the specified size, that chunk is
@@ -3197,6 +3591,17 @@ If set, fio will log Unix timestamps to the log files produced by enabling
  write_type_log for each log type, instead of the default zero-based
  timestamps.
  .TP
+.BI log_alternate_epoch \fR=\fPbool
+If set, fio will log timestamps based on the epoch used by the clock specified
+in the \fBlog_alternate_epoch_clock_id\fR option, to the log files produced by
+enabling write_type_log for each log type, instead of the default zero-based
+timestamps.
+.TP
+.BI log_alternate_epoch_clock_id \fR=\fPint
+Specifies the clock_id to be used by clock_gettime to obtain the alternate epoch
+if either \fBBlog_unix_epoch\fR or \fBlog_alternate_epoch\fR are true. Otherwise has no
+effect. Default value is 0, or CLOCK_REALTIME.
+.TP
  .BI block_error_percentiles \fR=\fPbool
  If set, record errors in trim block-sized units from writes and trims and
  output a histogram of how many trims it took to get to errors, and what kind
@@ -3274,6 +3679,16 @@ EILSEQ) until the runtime is exceeded or the I/O size specified is
  completed. If this option is used, there are two more stats that are
  appended, the total error count and the first error. The error field given
  in the stats is the first error that was hit during the run.
+.RS
+.P
+Note: a write error from the device may go unnoticed by fio when using buffered
+IO, as the write() (or similar) system call merely dirties the kernel pages,
+unless `sync' or `direct' is used. Device IO errors occur when the dirty data is
+actually written out to disk. If fully sync writes aren't desirable, `fsync' or
+`fdatasync' can be used as well. This is specific to writes, as reads are always
+synchronous.
+.RS
+.P
  The allowed values are:
  .RS
  .RS
@@ -3861,7 +4276,7 @@ This format is not supported in fio versions >= 1.20\-rc3.
  .TP
  .B Trace file format v2
  The second version of the trace file format was added in fio version 1.17. It
-allows to access more then one file per trace and has a bigger set of possible
+allows one to access more than one file per trace and has a bigger set of possible
  file actions.
  .RS
  .P
@@ -3906,7 +4321,9 @@ given in bytes. The `action' can be one of these:
  .TP
  .B wait
  Wait for `offset' microseconds. Everything below 100 is discarded.
-The time is relative to the previous `wait' statement.
+The time is relative to the previous `wait' statement. Note that action `wait`
+is not allowed as of version 3, as the same behavior can be achieved using
+timestamps.
  .TP
  .B read
  Read `length' bytes beginning from `offset'.
@@ -3924,6 +4341,37 @@ Write `length' bytes beginning from `offset'.
  Trim the given file from the given `offset' for `length' bytes.
  .RE
  .RE
+.RE
+.TP
+.B Trace file format v3
+The third version of the trace file format was added in fio version 3.31. It
+forces each action to have a timestamp associated with it.
+.RS
+.P
+The first line of the trace file has to be:
+.RS
+.P
+"fio version 3 iolog"
+.RE
+.P
+Following this can be lines in two different formats, which are described below.
+.P
+.B
+The file management format:
+.RS
+timestamp filename action
+.P
+.RE
+.B
+The file I/O action format:
+.RS
+timestamp filename action offset length
+.P
+The `timestamp` is relative to the beginning of the run (ie starts at 0). The
+`filename`, `action`, `offset` and `length`  are identical to version 2, except
+that version 3 does not allow the `wait` action.
+.RE
+.RE
  .SH I/O REPLAY \- MERGING TRACES
  Colocation is a common practice used to get the most out of a machine.
  Knowing which workloads play nicely with each other and which ones don't is
@@ -4102,8 +4550,14 @@ The entry's `block size' is always in bytes. The `offset' is the position in byt
  from the start of the file for that particular I/O. The logging of the offset can be
  toggled with \fBlog_offset\fR.
  .P
-`Command priority` is 0 for normal priority and 1 for high priority. This is controlled
-by the ioengine specific \fBcmdprio_percentage\fR.
+If \fBlog_prio\fR is not set, the entry's `Command priority` is 1 for an IO executed
+with the highest RT priority class (\fBprioclass\fR=1 or \fBcmdprio_class\fR=1) and 0
+otherwise. This is controlled by the \fBprioclass\fR option and the ioengine specific
+\fBcmdprio_percentage\fR \fBcmdprio_class\fR options. If \fBlog_prio\fR is set, the
+entry's `Command priority` is the priority set for the IO, as a 16-bits hexadecimal
+number with the lowest 13 bits indicating the priority value (\fBprio\fR and
+\fBcmdprio\fR options) and the highest 3 bits indicating the IO priority class
+(\fBprioclass\fR and \fBcmdprio_class\fR options).
  .P
  Fio defaults to logging every individual I/O but when windowed logging is set
  through \fBlog_avg_msec\fR, either the average (by default) or the maximum