zbd: engines/libzbc: don't fail on assert for offline zones

[fio.git] / fio.1
diff --git a/fio.1 b/fio.1

index a2379f9816c2b0103191ff15338d9180527bf896..aa248a3b6da62172a2ee0a27e5b345034ebfe7fd 100644 (file)
--- a/fio.1
+++ b/fio.1
@@ -738,12 +738,13 @@ Accepted values are:
  .RS
  .TP
  .B none
-The \fBzonerange\fR, \fBzonesize\fR and \fBzoneskip\fR parameters are ignored.
+The \fBzonerange\fR, \fBzonesize\fR \fBzonecapacity\fR and \fBzoneskip\fR
+parameters are ignored.
  .TP
  .B strided
  I/O happens in a single zone until \fBzonesize\fR bytes have been transferred.
  After that number of bytes has been transferred processing of the next zone
-starts.
+starts. The \fBzonecapacity\fR parameter is ignored.
  .TP
  .B zbd
  Zoned block device mode. I/O happens sequentially in each zone, even if random
@@ -771,6 +772,14 @@ zoned block device, the specified \fBzonesize\fR must be 0 or equal to the
  device zone size. For a regular block device or file, the specified
  \fBzonesize\fR must be at least 512B.
  .TP
+.BI zonecapacity \fR=\fPint
+For \fBzonemode\fR=zbd, this defines the capacity of a single zone, which is
+the accessible area starting from the zone start address. This parameter only
+applies when using \fBzonemode\fR=zbd in combination with regular block devices.
+If not specified it defaults to the zone size. If the target device is a zoned
+block device, the zone capacity is obtained from the device information and this
+option is ignored.
+.TP
  .BI zoneskip \fR=\fPint
  For \fBzonemode\fR=strided, the number of bytes to skip after \fBzonesize\fR
  bytes of data have been transferred.
@@ -804,7 +813,11 @@ so. Default: false.
  When running a random write test across an entire drive many more zones will be
  open than in a typical application workload. Hence this command line option
  that allows to limit the number of open zones. The number of open zones is
-defined as the number of zones to which write commands are issued.
+defined as the number of zones to which write commands are issued by all
+threads/processes.
+.TP
+.BI job_max_open_zones \fR=\fPint
+Limit on the number of simultaneously opened zones per single thread/process.
  .TP
  .BI zone_reset_threshold \fR=\fPfloat
  A number between zero and one that indicates the ratio of logical blocks with
@@ -1119,7 +1132,7 @@ first. This may interfere with a given rate setting, if fio is asked to
  limit reads or writes to a certain rate. If that is the case, then the
  distribution may be skewed. Default: 50.
  .TP
-.BI random_distribution \fR=\fPstr:float[,str:float][,str:float]
+.BI random_distribution \fR=\fPstr:float[:float][,str:float][,str:float]
  By default, fio will use a completely uniform random distribution when asked
  to perform random I/O. Sometimes it is useful to skew the distribution in
  specific ways, ensuring that some parts of the data is more hot than others.
@@ -1155,6 +1168,14 @@ option. If a non\-uniform model is used, fio will disable use of the random
  map. For the \fBnormal\fR distribution, a normal (Gaussian) deviation is
  supplied as a value between 0 and 100.
  .P
+The second, optional float is allowed for \fBpareto\fR, \fBzipf\fR and \fBnormal\fR
+distributions. It allows to set base of distribution in non-default place, giving
+more control over most probable outcome. This value is in range [0-1] which maps linearly to
+range of possible random values.
+Defaults are: random for \fBpareto\fR and \fBzipf\fR, and 0.5 for \fBnormal\fR.
+If you wanted to use \fBzipf\fR with a `theta` of 1.2 centered on 1/4 of allowed value range,
+you would use `random_distibution=zipf:1.2:0.25`.
+.P
  For a \fBzoned\fR distribution, fio supports specifying percentages of I/O
  access that should fall within what range of the file or device. For
  example, given a criteria of:
@@ -1449,9 +1470,31 @@ starting I/O if the platform and file type support it. Defaults to true.
  This will be ignored if \fBpre_read\fR is also specified for the
  same job.
  .TP
-.BI sync \fR=\fPbool
-Use synchronous I/O for buffered writes. For the majority of I/O engines,
-this means using O_SYNC. Default: false.
+.BI sync \fR=\fPstr
+Whether, and what type, of synchronous I/O to use for writes.  The allowed
+values are:
+.RS
+.RS
+.TP
+.B none
+Do not use synchronous IO, the default.
+.TP
+.B 0
+Same as \fBnone\fR.
+.TP
+.B sync
+Use synchronous file IO. For the majority of I/O engines,
+this means using O_SYNC.
+.TP
+.B 1
+Same as \fBsync\fR.
+.TP
+.B dsync
+Use synchronous data IO. For the majority of I/O engines,
+this means using O_DSYNC.
+.PD
+.RE
+.RE
  .TP
  .BI iomem \fR=\fPstr "\fR,\fP mem" \fR=\fPstr
  Fio can use various types of memory as the I/O unit buffer. The allowed
@@ -1548,7 +1591,8 @@ if \fBsize\fR is set to 20GiB and \fBio_size\fR is set to 5GiB, fio
  will perform I/O within the first 20GiB but exit when 5GiB have been
  done. The opposite is also possible \-\- if \fBsize\fR is set to 20GiB,
  and \fBio_size\fR is set to 40GiB, then fio will do 40GiB of I/O within
-the 0..20GiB region.
+the 0..20GiB region. Value can be set as percentage: \fBio_size\fR=N%.
+In this case \fBio_size\fR multiplies \fBsize\fR= value.
  .TP
  .BI filesize \fR=\fPirange(int)
  Individual file sizes. May be a range, in which case fio will select sizes
@@ -1654,17 +1698,21 @@ This engine defines engine specific options.
  .TP
  .B cpuio
  Doesn't transfer any data, but burns CPU cycles according to the
-\fBcpuload\fR and \fBcpuchunks\fR options. Setting
-\fBcpuload\fR\=85 will cause that job to do nothing but burn 85%
-of the CPU. In case of SMP machines, use `numjobs=<nr_of_cpu>'
-to get desired CPU usage, as the cpuload only loads a
-single CPU at the desired rate. A job never finishes unless there is
-at least one non-cpuio job.
-.TP
-.B guasi
-The GUASI I/O engine is the Generic Userspace Asynchronous Syscall
-Interface approach to async I/O. See \fIhttp://www.xmailserver.org/guasi-lib.html\fR
-for more info on GUASI.
+\fBcpuload\fR, \fBcpuchunks\fR and \fBcpumode\fR options.
+A job never finishes unless there is at least one non-cpuio job.
+.RS
+.P
+.PD 0
+\fBcpuload\fR\=85 will cause that job to do nothing but burn 85% of the CPU.
+In case of SMP machines, use \fBnumjobs=<nr_of_cpu>\fR\ to get desired CPU usage,
+as the cpuload only loads a single CPU at the desired rate.
+
+.P
+\fBcpumode\fR\=qsort replace the default noop instructions loop
+by a qsort algorithm to consume more energy.
+
+.P
+.RE
  .TP
  .B rdma
  The RDMA I/O engine supports both RDMA memory semantics
@@ -1795,6 +1843,13 @@ Read and write iscsi lun with libiscsi.
  .TP
  .B nbd
  Synchronous read and write a Network Block Device (NBD).
+.TP
+.B libcufile
+I/O engine supporting libcufile synchronous access to nvidia-fs and a
+GPUDirect Storage-supported filesystem. This engine performs
+I/O without transferring buffers between user-space and the kernel,
+unless \fBverify\fR is set or \fBcuda_io\fR is \fBposix\fR. \fBiomem\fR must
+not be \fBcudamalloc\fR. This ioengine defines engine specific options.
  .SS "I/O engine specific parameters"
  In addition, there are some parameters which are only valid when a specific
  \fBioengine\fR is in use. These are used identically to normal parameters,
@@ -1806,7 +1861,8 @@ Set the percentage of I/O that will be issued with higher priority by setting
  the priority bit. Non-read I/O is likely unaffected by ``cmdprio_percentage``.
  This option cannot be used with the `prio` or `prioclass` options. For this
  option to set the priority bit properly, NCQ priority must be supported and
-enabled and `direct=1' option must be used.
+enabled and `direct=1' option must be used. fio must also be run as the root
+user.
  .TP
  .BI (io_uring)fixedbufs
  If fio is asked to do direct IO, then Linux will map pages for each IO call, and
@@ -1852,6 +1908,22 @@ than normal.
  When hipri is set this determines the probability of a pvsync2 I/O being high
  priority. The default is 100%.
  .TP
+.BI (pvsync2,libaio,io_uring)nowait
+By default if a request cannot be executed immediately (e.g. resource starvation,
+waiting on locks) it is queued and the initiating process will be blocked until
+the required resource becomes free.
+This option sets the RWF_NOWAIT flag (supported from the 4.14 Linux kernel) and
+the call will return instantly with EAGAIN or a partial result rather than waiting.
+
+It is useful to also use \fBignore_error\fR=EAGAIN when using this option.
+Note: glibc 2.27, 2.28 have a bug in syscall wrappers preadv2, pwritev2.
+They return EOPNOTSUP instead of EAGAIN.
+
+For cached I/O, using this option usually means a request operates only with
+cached data. Currently the RWF_NOWAIT flag does not supported for cached write.
+For direct I/O, requests will only succeed if cache invalidation isn't required,
+file blocks are fully allocated and the disk request could be issued immediately.
+.TP
  .BI (cpuio)cpuload \fR=\fPint
  Attempt to use the specified percentage of CPU cycles. This is a mandatory
  option when using cpuio I/O engine.
@@ -2042,6 +2114,16 @@ client and the server or in certain loopback configurations.
  Specify stat system call type to measure lookup/getattr performance.
  Default is \fBstat\fR for \fBstat\fR\|(2).
  .TP
+.BI (sg)hipri
+If this option is set, fio will attempt to use polled IO completions. This
+will have a similar effect as (io_uring)hipri. Only SCSI READ and WRITE
+commands will have the SGV4_FLAG_HIPRI set (not UNMAP (trim) nor VERIFY).
+Older versions of the Linux sg driver that do not support hipri will simply
+ignore this flag and do normal IO. The Linux SCSI Low Level Driver (LLD)
+that "owns" the device also needs to support hipri (also known as iopoll
+and mq_poll). The MegaRAID driver is an example of a SCSI LLD.
+Default: clear (0) which does normal (interrupted based) IO.
+.TP
  .BI (sg)readfua \fR=\fPbool
  With readfua option set to 1, read operations include the force
  unit access (fua) flag. Default: 0.
@@ -2091,7 +2173,36 @@ Example URIs:
  \fInbd+unix:///?socket=/tmp/socket\fR
  .TP
  \fInbds://tlshost/exportname\fR
-
+.RE
+.RE
+.TP
+.BI (libcufile)gpu_dev_ids\fR=\fPstr
+Specify the GPU IDs to use with CUDA. This is a colon-separated list of int.
+GPUs are assigned to workers roundrobin. Default is 0.
+.TP
+.BI (libcufile)cuda_io\fR=\fPstr
+Specify the type of I/O to use with CUDA. This option
+takes the following values:
+.RS
+.RS
+.TP
+.B cufile (default)
+Use libcufile and nvidia-fs. This option performs I/O directly
+between a GPUDirect Storage filesystem and GPU buffers,
+avoiding use of a bounce buffer. If \fBverify\fR is set,
+cudaMemcpy is used to copy verification data between RAM and GPU(s).
+Verification data is copied from RAM to GPU before a write
+and from GPU to RAM after a read.
+\fBdirect\fR must be 1.
+.TP
+.BI posix
+Use POSIX to perform I/O with a RAM buffer, and use
+cudaMemcpy to transfer data between RAM and the GPU(s).
+Data is copied from GPU to RAM before a write and copied
+from RAM to GPU after a read. \fBverify\fR does not affect
+the use of cudaMemcpy.
+.RE
+.RE
  .SS "I/O depth"
  .TP
  .BI iodepth \fR=\fPint
@@ -2189,7 +2300,7 @@ has a bit of extra overhead, especially for lower queue depth I/O where it
  can increase latencies. The benefit is that fio can manage submission rates
  independently of the device completion rates. This avoids skewed latency
  reporting if I/O gets backed up on the device side (the coordinated omission
-problem).
+problem). Note that this option cannot reliably be used with async IO engines.
  .SS "I/O rate"
  .TP
  .BI thinktime \fR=\fPtime
@@ -2212,6 +2323,12 @@ queue depth setting redundant, since no more than 1 I/O will be queued
  before we have to complete it and do our \fBthinktime\fR. In other words, this
  setting effectively caps the queue depth if the latter is larger.
  .TP
+.BI thinktime_blocks_type \fR=\fPstr
+Only valid if \fBthinktime\fR is set - control how \fBthinktime_blocks\fR triggers.
+The default is `complete', which triggers \fBthinktime\fR when fio completes
+\fBthinktime_blocks\fR blocks. If this is set to `issue', then the trigger happens
+at the issue side.
+.TP
  .BI rate \fR=\fPint[,int][,int]
  Cap the bandwidth used by this job. The number is in bytes/sec, the normal
  suffix rules apply. Comma-separated values may be specified for reads,
@@ -2275,6 +2392,11 @@ The percentage of I/Os that must fall within the criteria specified by
  defaults to 100.0, meaning that all I/Os must be equal or below to the value
  set by \fBlatency_target\fR.
  .TP
+.BI latency_run \fR=\fPbool
+Used with \fBlatency_target\fR. If false (default), fio will find the highest
+queue depth that meets \fBlatency_target\fR and exit. If true, fio will continue
+running and try to meet \fBlatency_target\fR by adjusting queue depth.
+.TP
  .BI max_latency \fR=\fPtime
  If set, fio will exit the job with an ETIMEDOUT error if it exceeds this
  maximum latency. When the unit is omitted, the value is interpreted in
@@ -2301,7 +2423,9 @@ replay, the file needs to be turned into a blkparse binary data file first
  You can specify a number of files by separating the names with a ':' character.
  See the \fBfilename\fR option for information on how to escape ':'
  characters within the file names. These files will be sequentially assigned to
-job clones created by \fBnumjobs\fR.
+job clones created by \fBnumjobs\fR. '-' is a reserved name, meaning read from
+stdin, notably if \fBfilename\fR is set to '-' which means stdin as well,
+then this flag can't be set to '-'.
  .TP
  .BI read_iolog_chunked \fR=\fPbool
  Determines how iolog is read. If false (default) entire \fBread_iolog\fR will
@@ -2512,27 +2636,25 @@ The ID of the flow. If not specified, it defaults to being a global
  flow. See \fBflow\fR.
  .TP
  .BI flow \fR=\fPint
-Weight in token-based flow control. If this value is used, then there is
-a 'flow counter' which is used to regulate the proportion of activity between
-two or more jobs. Fio attempts to keep this flow counter near zero. The
-\fBflow\fR parameter stands for how much should be added or subtracted to the
-flow counter on each iteration of the main I/O loop. That is, if one job has
-`flow=8' and another job has `flow=\-1', then there will be a roughly 1:8
-ratio in how much one runs vs the other.
-.TP
-.BI flow_watermark \fR=\fPint
-The maximum value that the absolute value of the flow counter is allowed to
-reach before the job must wait for a lower value of the counter.
+Weight in token-based flow control. If this value is used,
+then fio regulates the activity between two or more jobs
+sharing the same flow_id.
+Fio attempts to keep each job activity proportional to other jobs' activities
+in the same flow_id group, with respect to requested weight per job.
+That is, if one job has `flow=3', another job has `flow=2'
+and another with `flow=1`, then there will be a roughly 3:2:1 ratio
+in how much one runs vs the others.
  .TP
  .BI flow_sleep \fR=\fPint
-The period of time, in microseconds, to wait after the flow watermark has
-been exceeded before retrying operations.
+The period of time, in microseconds, to wait after the flow counter
+has exceeded its proportion before retrying operations.
  .TP
  .BI stonewall "\fR,\fB wait_for_previous"
  Wait for preceding jobs in the job file to exit, before starting this
  one. Can be used to insert serialization points in the job file. A stone
  wall also implies starting a new reporting group, see
-\fBgroup_reporting\fR.
+\fBgroup_reporting\fR. Optionally you can use `stonewall=0` to disable or
+`stonewall=1` to enable it.
  .TP
  .BI exitall
  By default, fio will continue running all other jobs when one job finishes.
@@ -2540,15 +2662,27 @@ Sometimes this is not the desired action. Setting \fBexitall\fR will instead
  make fio terminate all jobs in the same group, as soon as one job of that
  group finishes.
  .TP
-.BI exit_what
+.BI exit_what \fR=\fPstr
  By default, fio will continue running all other jobs when one job finishes.
-Sometimes this is not the desired action. Setting \fBexit_all\fR will instead
+Sometimes this is not the desired action. Setting \fBexitall\fR will instead
  make fio terminate all jobs in the same group. The option \fBexit_what\fR
-allows to control which jobs get terminated when \fBexitall\fR is enabled. The
-default is \fBgroup\fR and does not change the behaviour of \fBexitall\fR. The
-setting \fBall\fR terminates all jobs. The setting \fBstonewall\fR terminates
-all currently running jobs across all groups and continues execution with the
-next stonewalled group.
+allows you to control which jobs get terminated when \fBexitall\fR is enabled.
+The default value is \fBgroup\fR.
+The allowed values are:
+.RS
+.RS
+.TP
+.B all
+terminates all jobs.
+.TP
+.B group
+is the default and does not change the behaviour of \fBexitall\fR.
+.TP
+.B stonewall
+terminates all currently running jobs across all groups and continues
+execution with the next stonewalled group.
+.RE
+.RE
  .TP
  .BI exec_prerun \fR=\fPstr
  Before running this job, issue the command specified through
@@ -3577,7 +3711,7 @@ Below is a single line containing short names for each of the fields in the
  minimal output v3, separated by semicolons:
  .P
  .nf
-               terse_version_3;fio_version;jobname;groupid;error;read_kb;read_bandwidth;read_iops;read_runtime_ms;read_slat_min;read_slat_max;read_slat_mean;read_slat_dev;read_clat_min;read_clat_max;read_clat_mean;read_clat_dev;read_clat_pct01;read_clat_pct02;read_clat_pct03;read_clat_pct04;read_clat_pct05;read_clat_pct06;read_clat_pct07;read_clat_pct08;read_clat_pct09;read_clat_pct10;read_clat_pct11;read_clat_pct12;read_clat_pct13;read_clat_pct14;read_clat_pct15;read_clat_pct16;read_clat_pct17;read_clat_pct18;read_clat_pct19;read_clat_pct20;read_tlat_min;read_lat_max;read_lat_mean;read_lat_dev;read_bw_min;read_bw_max;read_bw_agg_pct;read_bw_mean;read_bw_dev;write_kb;write_bandwidth;write_iops;write_runtime_ms;write_slat_min;write_slat_max;write_slat_mean;write_slat_dev;write_clat_min;write_clat_max;write_clat_mean;write_clat_dev;write_clat_pct01;write_clat_pct02;write_clat_pct03;write_clat_pct04;write_clat_pct05;write_clat_pct06;write_clat_pct07;write_clat_pct08;write_clat_pct09;write_clat_pct10;write_clat_pct11;write_clat_pct12;write_clat_pct13;write_clat_pct14;write_clat_pct15;write_clat_pct16;write_clat_pct17;write_clat_pct18;write_clat_pct19;write_clat_pct20;write_tlat_min;write_lat_max;write_lat_mean;write_lat_dev;write_bw_min;write_bw_max;write_bw_agg_pct;write_bw_mean;write_bw_dev;cpu_user;cpu_sys;cpu_csw;cpu_mjf;cpu_minf;iodepth_1;iodepth_2;iodepth_4;iodepth_8;iodepth_16;iodepth_32;iodepth_64;lat_2us;lat_4us;lat_10us;lat_20us;lat_50us;lat_100us;lat_250us;lat_500us;lat_750us;lat_1000us;lat_2ms;lat_4ms;lat_10ms;lat_20ms;lat_50ms;lat_100ms;lat_250ms;lat_500ms;lat_750ms;lat_1000ms;lat_2000ms;lat_over_2000ms;disk_name;disk_read_iops;disk_write_iops;disk_read_merges;disk_write_merges;disk_read_ticks;write_ticks;disk_queue_time;disk_util
+               terse_version_3;fio_version;jobname;groupid;error;read_kb;read_bandwidth_kb;read_iops;read_runtime_ms;read_slat_min_us;read_slat_max_us;read_slat_mean_us;read_slat_dev_us;read_clat_min_us;read_clat_max_us;read_clat_mean_us;read_clat_dev_us;read_clat_pct01;read_clat_pct02;read_clat_pct03;read_clat_pct04;read_clat_pct05;read_clat_pct06;read_clat_pct07;read_clat_pct08;read_clat_pct09;read_clat_pct10;read_clat_pct11;read_clat_pct12;read_clat_pct13;read_clat_pct14;read_clat_pct15;read_clat_pct16;read_clat_pct17;read_clat_pct18;read_clat_pct19;read_clat_pct20;read_tlat_min_us;read_lat_max_us;read_lat_mean_us;read_lat_dev_us;read_bw_min_kb;read_bw_max_kb;read_bw_agg_pct;read_bw_mean_kb;read_bw_dev_kb;write_kb;write_bandwidth_kb;write_iops;write_runtime_ms;write_slat_min_us;write_slat_max_us;write_slat_mean_us;write_slat_dev_us;write_clat_min_us;write_clat_max_us;write_clat_mean_us;write_clat_dev_us;write_clat_pct01;write_clat_pct02;write_clat_pct03;write_clat_pct04;write_clat_pct05;write_clat_pct06;write_clat_pct07;write_clat_pct08;write_clat_pct09;write_clat_pct10;write_clat_pct11;write_clat_pct12;write_clat_pct13;write_clat_pct14;write_clat_pct15;write_clat_pct16;write_clat_pct17;write_clat_pct18;write_clat_pct19;write_clat_pct20;write_tlat_min_us;write_lat_max_us;write_lat_mean_us;write_lat_dev_us;write_bw_min_kb;write_bw_max_kb;write_bw_agg_pct;write_bw_mean_kb;write_bw_dev_kb;cpu_user;cpu_sys;cpu_csw;cpu_mjf;cpu_minf;iodepth_1;iodepth_2;iodepth_4;iodepth_8;iodepth_16;iodepth_32;iodepth_64;lat_2us;lat_4us;lat_10us;lat_20us;lat_50us;lat_100us;lat_250us;lat_500us;lat_750us;lat_1000us;lat_2ms;lat_4ms;lat_10ms;lat_20ms;lat_50ms;lat_100ms;lat_250ms;lat_500ms;lat_750ms;lat_1000ms;lat_2000ms;lat_over_2000ms;disk_name;disk_read_iops;disk_write_iops;disk_read_merges;disk_write_merges;disk_read_ticks;write_ticks;disk_queue_time;disk_util
  .fi
  .P
  In client/server mode terse output differs from what appears when jobs are run
@@ -3837,7 +3971,8 @@ Fio supports a variety of log file formats, for logging latencies, bandwidth,
  and IOPS. The logs share a common format, which looks like this:
  .RS
  .P
-time (msec), value, data direction, block size (bytes), offset (bytes)
+time (msec), value, data direction, block size (bytes), offset (bytes),
+command priority
  .RE
  .P
  `Time' for the log entry is always in milliseconds. The `value' logged depends
@@ -3871,6 +4006,9 @@ The entry's `block size' is always in bytes. The `offset' is the position in byt
  from the start of the file for that particular I/O. The logging of the offset can be
  toggled with \fBlog_offset\fR.
  .P
+`Command priority` is 0 for normal priority and 1 for high priority. This is controlled
+by the ioengine specific \fBcmdprio_percentage\fR.
+.P
  Fio defaults to logging every individual I/O but when windowed logging is set
  through \fBlog_avg_msec\fR, either the average (by default) or the maximum
  (\fBlog_max_value\fR is set) `value' seen over the specified period of time