doc: clarify what --alloc-size does

[fio.git] / HOWTO
diff --git a/HOWTO b/HOWTO

index 1bec8064fb49b5eea00ce5482c02f6e085681691..4fef1504a410a205c95c7c85b489f0068bceb1a5 100644 (file)
--- a/HOWTO
+++ b/HOWTO
@@ -93,6 +93,12 @@ Command line options
                         Dump info related to I/O rate switching.
         *compress*
                         Dump info related to log compress/decompress.
+       *steadystate*
+                       Dump info related to steadystate detection.
+       *helperthread*
+                       Dump info related to the helper thread.
+       *zbd*
+                       Dump info related to support for zoned block devices.
         *?* or *help*
                         Show available debug options.
  
@@ -100,6 +106,10 @@ Command line options
  
         Parse options only, don't start any I/O.
  
+.. option:: --merge-blktrace-only
+
+       Merge blktraces only, don't start any I/O.
+
  .. option:: --output=filename
  
         Write output to file `filename`.
@@ -194,7 +204,10 @@ Command line options
         Force a full status dump of cumulative (from job start) values at `time`
         intervals. This option does *not* provide per-period measurements. So
         values such as bandwidth are running averages. When the time unit is omitted,
-       `time` is interpreted in seconds.
+       `time` is interpreted in seconds. Note that using this option with
+       ``--output-format=json`` will yield output that technically isn't valid
+       json, since the output will be collated sets of valid json. It will need
+       to be split into valid sets of json after the run.
  
  .. option:: --section=name
  
@@ -209,8 +222,8 @@ Command line options
  
  .. option:: --alloc-size=kb
  
-       Set the internal smalloc pool size to `kb` in KiB.  The
-       ``--alloc-size`` switch allows one to use a larger pool size for smalloc.
+       Allocate additional internal smalloc pools of size `kb` in KiB.  The
+       ``--alloc-size`` option increases shared memory set aside for use by fio.
         If running large jobs with randommap enabled, fio can run out of memory.
         Smalloc is an internal allocator for shared structures from a fixed size
         memory pool and can grow to 16 pools. The pool size defaults to 16MiB.
@@ -952,24 +965,92 @@ Target file/device
  
         Unlink job files after each iteration or loop.  Default: false.
  
+.. option:: zonemode=str
+
+       Accepted values are:
+
+               **none**
+                               The :option:`zonerange`, :option:`zonesize` and
+                               :option:`zoneskip` parameters are ignored.
+               **strided**
+                               I/O happens in a single zone until
+                               :option:`zonesize` bytes have been transferred.
+                               After that number of bytes has been
+                               transferred processing of the next zone
+                               starts.
+               **zbd**
+                               Zoned block device mode. I/O happens
+                               sequentially in each zone, even if random I/O
+                               has been selected. Random I/O happens across
+                               all zones instead of being restricted to a
+                               single zone. The :option:`zoneskip` parameter
+                               is ignored. :option:`zonerange` and
+                               :option:`zonesize` must be identical.
+
  .. option:: zonerange=int
  
-       Size of a single zone in which I/O occurs. See also :option:`zonesize`
-       and :option:`zoneskip`.
+       Size of a single zone. See also :option:`zonesize` and
+       :option:`zoneskip`.
  
  .. option:: zonesize=int
  
-       Number of bytes to transfer before skipping :option:`zoneskip`
-       bytes. If this parameter is smaller than :option:`zonerange` then only
-       a fraction of each zone with :option:`zonerange` bytes will be
-       accessed.  If this parameter is larger than :option:`zonerange` then
-       each zone will be accessed multiple times before skipping
+       For :option:`zonemode` =strided, this is the number of bytes to
+       transfer before skipping :option:`zoneskip` bytes. If this parameter
+       is smaller than :option:`zonerange` then only a fraction of each zone
+       with :option:`zonerange` bytes will be accessed.  If this parameter is
+       larger than :option:`zonerange` then each zone will be accessed
+       multiple times before skipping to the next zone.
+
+       For :option:`zonemode` =zbd, this is the size of a single zone. The
+       :option:`zonerange` parameter is ignored in this mode.
  
  .. option:: zoneskip=int
  
-       Skip the specified number of bytes when :option:`zonesize` data have
-       been transferred. The three zone options can be used to do strided I/O
-       on a file.
+       For :option:`zonemode` =strided, the number of bytes to skip after
+       :option:`zonesize` bytes of data have been transferred. This parameter
+       must be zero for :option:`zonemode` =zbd.
+
+.. option:: read_beyond_wp=bool
+
+       This parameter applies to :option:`zonemode` =zbd only.
+
+       Zoned block devices are block devices that consist of multiple zones.
+       Each zone has a type, e.g. conventional or sequential. A conventional
+       zone can be written at any offset that is a multiple of the block
+       size. Sequential zones must be written sequentially. The position at
+       which a write must occur is called the write pointer. A zoned block
+       device can be either drive managed, host managed or host aware. For
+       host managed devices the host must ensure that writes happen
+       sequentially. Fio recognizes host managed devices and serializes
+       writes to sequential zones for these devices.
+
+       If a read occurs in a sequential zone beyond the write pointer then
+       the zoned block device will complete the read without reading any data
+       from the storage medium. Since such reads lead to unrealistically high
+       bandwidth and IOPS numbers fio only reads beyond the write pointer if
+       explicitly told to do so. Default: false.
+
+.. option:: max_open_zones=int
+
+       When running a random write test across an entire drive many more
+       zones will be open than in a typical application workload. Hence this
+       command line option that allows to limit the number of open zones. The
+       number of open zones is defined as the number of zones to which write
+       commands are issued.
+
+.. option:: zone_reset_threshold=float
+
+       A number between zero and one that indicates the ratio of logical
+       blocks with data to the total number of logical blocks in the test
+       above which zones should be reset periodically.
+
+.. option:: zone_reset_frequency=float
+
+       A number between zero and one that indicates how often a zone reset
+       should be issued if the zone reset threshold has been exceeded. A zone
+       reset is submitted after each (1 / zone_reset_frequency) write
+       requests. This and the previous parameter can be used to simulate
+       garbage collection activity.
  
  
  I/O type
@@ -1171,7 +1252,9 @@ I/O type
         is incremented for each sub-job (i.e. when :option:`numjobs` option is
         specified). This option is useful if there are several jobs which are
         intended to operate on a file in parallel disjoint segments, with even
-       spacing between the starting points.
+       spacing between the starting points. Percentages can be used for this option.
+       If a percentage is given, the generated offset will be aligned to the minimum
+       ``blocksize`` or to the value of ``offset_align`` if provided.
  
  .. option:: number_ios=int
  
@@ -1730,6 +1813,11 @@ I/O engine
                 **pvsync2**
                         Basic :manpage:`preadv2(2)` or :manpage:`pwritev2(2)` I/O.
  
+               **io_uring**
+                       Fast Linux native asynchronous I/O. Supports async IO
+                       for both direct and buffered IO.
+                       This engine defines engine specific options.
+
                 **libaio**
                         Linux native asynchronous I/O. Note that Linux may only support
                         queued behavior with non-buffered I/O (set ``direct=1`` or
@@ -1835,6 +1923,15 @@ I/O engine
                         (RBD) via librbd without the need to use the kernel rbd driver. This
                         ioengine defines engine specific options.
  
+               **http**
+                       I/O engine supporting GET/PUT requests over HTTP(S) with libcurl to
+                       a WebDAV or S3 endpoint.  This ioengine defines engine specific options.
+
+                       This engine only supports direct IO of iodepth=1; you need to scale this
+                       via numjobs. blocksize defines the size of the objects to be created.
+
+                       TRIM is translated to object deletion.
+
                 **gfapi**
                         Using GlusterFS libgfapi sync interface to direct access to
                         GlusterFS volumes without having to go through FUSE.  This ioengine
@@ -1892,6 +1989,26 @@ I/O engine
                         mounted with DAX on a persistent memory device through the PMDK
                         libpmem library.
  
+               **ime_psync**
+                       Synchronous read and write using DDN's Infinite Memory Engine (IME).
+                       This engine is very basic and issues calls to IME whenever an IO is
+                       queued.
+
+               **ime_psyncv**
+                       Synchronous read and write using DDN's Infinite Memory Engine (IME).
+                       This engine uses iovecs and will try to stack as much IOs as possible
+                       (if the IOs are "contiguous" and the IO depth is not exceeded)
+                       before issuing a call to IME.
+
+               **ime_aio**
+                       Asynchronous read and write using DDN's Infinite Memory Engine (IME).
+                       This engine will try to stack as much IOs as possible by creating
+                       requests for IME. FIO will then decide when to commit these requests.
+               **libiscsi**
+                       Read and write iscsi lun with libiscsi.
+               **nbd**
+                       Read and write a Network Block Device (NBD).
+
  I/O engine specific parameters
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  
@@ -1900,6 +2017,41 @@ In addition, there are some parameters which are only valid when a specific
  with the caveat that when used on the command line, they must come after the
  :option:`ioengine` that defines them is selected.
  
+.. option:: hipri : [io_uring]
+
+       If this option is set, fio will attempt to use polled IO completions.
+       Normal IO completions generate interrupts to signal the completion of
+       IO, polled completions do not. Hence they are require active reaping
+       by the application. The benefits are more efficient IO for high IOPS
+       scenarios, and lower latencies for low queue depth IO.
+
+.. option:: fixedbufs : [io_uring]
+
+       If fio is asked to do direct IO, then Linux will map pages for each
+       IO call, and release them when IO is done. If this option is set, the
+       pages are pre-mapped before IO is started. This eliminates the need to
+       map and release for each IO. This is more efficient, and reduces the
+       IO latency as well.
+
+.. option:: registerfiles : [io_uring]
+       With this option, fio registers the set of files being used with the
+       kernel. This avoids the overhead of managing file counts in the kernel,
+       making the submission and completion part more lightweight. Required
+       for the below :option:`sqthread_poll` option.
+
+.. option:: sqthread_poll : [io_uring]
+
+       Normally fio will submit IO by issuing a system call to notify the
+       kernel of available items in the SQ ring. If this option is set, the
+       act of submitting IO will be done by a polling thread in the kernel.
+       This frees up cycles for fio, at the cost of using more CPU in the
+       system.
+
+.. option:: sqthread_poll_cpu : [io_uring]
+
+       When :option:`sqthread_poll` is set, this option provides a way to
+       define which CPU should be used for the polling thread.
+
  .. option:: userspace_reap : [libaio]
  
         Normally, with the libaio engine in use, fio will use the
@@ -2115,6 +2267,63 @@ with the caveat that when used on the command line, they must come after the
                 transferred to the device. The writefua option is ignored with this
                 selection.
  
+.. option:: http_host=str : [http]
+
+       Hostname to connect to. For S3, this could be the bucket hostname.
+       Default is **localhost**
+
+.. option:: http_user=str : [http]
+
+       Username for HTTP authentication.
+
+.. option:: http_pass=str : [http]
+
+       Password for HTTP authentication.
+
+.. option:: https=str : [http]
+
+       Enable HTTPS instead of http. *on* enables HTTPS; *insecure*
+       will enable HTTPS, but disable SSL peer verification (use with
+       caution!). Default is **off**
+
+.. option:: http_mode=str : [http]
+
+       Which HTTP access mode to use: *webdav*, *swift*, or *s3*.
+       Default is **webdav**
+
+.. option:: http_s3_region=str : [http]
+
+       The S3 region/zone string.
+       Default is **us-east-1**
+
+.. option:: http_s3_key=str : [http]
+
+       The S3 secret key.
+
+.. option:: http_s3_keyid=str : [http]
+
+       The S3 key/access id.
+
+.. option:: http_swift_auth_token=str : [http]
+
+       The Swift auth token. See the example configuration file on how
+       to retrieve this.
+
+.. option:: http_verbose=int : [http]
+
+       Enable verbose requests from libcurl. Useful for debugging. 1
+       turns on verbose logging from libcurl, 2 additionally enables
+       HTTP IO tracing. Default is **0**
+
+.. option:: uri=str : [nbd]
+
+       Specify the NBD URI of the server to test.  The string
+       is a standard NBD URI
+       (see https://github.com/NetworkBlockDevice/nbd/tree/master/doc).
+       Example URIs: nbd://localhost:10809
+       nbd+unix:///?socket=/tmp/socket
+       nbds://tlshost/exportname
+
  I/O depth
  ~~~~~~~~~
  
@@ -2191,8 +2400,13 @@ I/O depth
         ``serialize_overlap`` tells fio to avoid provoking this behavior by explicitly
         serializing in-flight I/Os that have a non-zero overlap. Note that setting
         this option can reduce both performance and the :option:`iodepth` achieved.
-       Additionally this option does not work when :option:`io_submit_mode` is set to
-       offload. Default: false.
+
+       This option only applies to I/Os issued for a single job except when it is
+       enabled along with :option:`io_submit_mode`\=offload. In offload mode, fio
+       will check for overlap among all I/Os submitted by offload jobs with :option:`serialize_overlap`
+       enabled.
+
+       Default: false.
  
  .. option:: io_submit_mode=str
  
@@ -2336,6 +2550,10 @@ I/O replay
         :manpage:`blktrace(8)` for how to capture such logging data. For blktrace
         replay, the file needs to be turned into a blkparse binary data file first
         (``blkparse <device> -o /dev/null -d file_for_fio.bin``).
+       You can specify a number of files by separating the names with a ':'
+       character. See the :option:`filename` option for information on how to
+       escape ':' and '\' characters within the file names. These files will
+       be sequentially assigned to job clones created by :option:`numjobs`.
  
  .. option:: read_iolog_chunked=bool
  
@@ -2343,6 +2561,33 @@ I/O replay
         will be read at once. If selected true, input from iolog will be read
         gradually. Useful when iolog is very large, or it is generated.
  
+.. option:: merge_blktrace_file=str
+
+       When specified, rather than replaying the logs passed to :option:`read_iolog`,
+       the logs go through a merge phase which aggregates them into a single
+       blktrace. The resulting file is then passed on as the :option:`read_iolog`
+       parameter. The intention here is to make the order of events consistent.
+       This limits the influence of the scheduler compared to replaying multiple
+       blktraces via concurrent jobs.
+
+.. option:: merge_blktrace_scalars=float_list
+
+       This is a percentage based option that is index paired with the list of
+       files passed to :option:`read_iolog`. When merging is performed, scale
+       the time of each event by the corresponding amount. For example,
+       ``--merge_blktrace_scalars="50:100"`` runs the first trace in halftime
+       and the second trace in realtime. This knob is separately tunable from
+       :option:`replay_time_scale` which scales the trace during runtime and
+       does not change the output of the merge unlike this option.
+
+.. option:: merge_blktrace_iters=float_list
+
+       This is a whole number option that is index paired with the list of files
+       passed to :option:`read_iolog`. When merging is performed, run each trace
+       for the specified number of iterations. For example,
+       ``--merge_blktrace_iters="2:1"`` runs the first trace for two iterations
+       and the second trace for one iteration.
+
  .. option:: replay_no_stall=bool
  
         When replaying I/O with :option:`read_iolog` the default behavior is to
@@ -2380,12 +2625,13 @@ I/O replay
  
  .. option:: replay_align=int
  
-       Force alignment of I/O offsets and lengths in a trace to this power of 2
-       value.
+       Force alignment of the byte offsets in a trace to this value. The value
+       must be a power of 2.
  
  .. option:: replay_scale=int
  
-       Scale sector offsets down by this factor when replaying traces.
+       Scale byte offsets down by this factor when replaying traces. Should most
+       likely use :option:`replay_align` as well.
  
  .. option:: replay_skip=str
  
@@ -2821,6 +3067,10 @@ Steady state
         data from the rolling collection window. Threshold limits can be expressed
         as a fixed value or as a percentage of the mean in the collection window.
  
+       When using this feature, most jobs should include the :option:`time_based`
+       and :option:`runtime` options or the :option:`loops` option so that fio does not
+       stop running after it has covered the full size of the specified file(s) or device(s).
+
                 **iops**
                         Collect IOPS data. Stop the job if all individual IOPS measurements
                         are within the specified limit of the mean IOPS (e.g., ``iops:2``
@@ -3502,7 +3752,8 @@ is one long line of values, such as::
      2;card0;0;0;7139336;121836;60004;1;10109;27.932460;116.933948;220;126861;3495.446807;1085.368601;226;126864;3523.635629;1089.012448;24063;99944;50.275485%;59818.274627;5540.657370;7155060;122104;60004;1;8338;29.086342;117.839068;388;128077;5032.488518;1234.785715;391;128085;5061.839412;1236.909129;23436;100928;50.287926%;59964.832030;5644.844189;14.595833%;19.394167%;123706;0;7313;0.1%;0.1%;0.1%;0.1%;0.1%;0.1%;100.0%;0.00%;0.00%;0.00%;0.00%;0.00%;0.00%;0.01%;0.02%;0.05%;0.16%;6.04%;40.40%;52.68%;0.64%;0.01%;0.00%;0.01%;0.00%;0.00%;0.00%;0.00%;0.00%
      A description of this job goes here.
  
-The job description (if provided) follows on a second line.
+The job description (if provided) follows on a second line for terse v2.
+It appears on the same line for other terse versions.
  
  To enable terse output, use the :option:`--minimal` or
  :option:`--output-format`\=terse command line options. The
@@ -3587,6 +3838,11 @@ minimal output v3, separated by semicolons::
  
          terse_version_3;fio_version;jobname;groupid;error;read_kb;read_bandwidth;read_iops;read_runtime_ms;read_slat_min;read_slat_max;read_slat_mean;read_slat_dev;read_clat_min;read_clat_max;read_clat_mean;read_clat_dev;read_clat_pct01;read_clat_pct02;read_clat_pct03;read_clat_pct04;read_clat_pct05;read_clat_pct06;read_clat_pct07;read_clat_pct08;read_clat_pct09;read_clat_pct10;read_clat_pct11;read_clat_pct12;read_clat_pct13;read_clat_pct14;read_clat_pct15;read_clat_pct16;read_clat_pct17;read_clat_pct18;read_clat_pct19;read_clat_pct20;read_tlat_min;read_lat_max;read_lat_mean;read_lat_dev;read_bw_min;read_bw_max;read_bw_agg_pct;read_bw_mean;read_bw_dev;write_kb;write_bandwidth;write_iops;write_runtime_ms;write_slat_min;write_slat_max;write_slat_mean;write_slat_dev;write_clat_min;write_clat_max;write_clat_mean;write_clat_dev;write_clat_pct01;write_clat_pct02;write_clat_pct03;write_clat_pct04;write_clat_pct05;write_clat_pct06;write_clat_pct07;write_clat_pct08;write_clat_pct09;write_clat_pct10;write_clat_pct11;write_clat_pct12;write_clat_pct13;write_clat_pct14;write_clat_pct15;write_clat_pct16;write_clat_pct17;write_clat_pct18;write_clat_pct19;write_clat_pct20;write_tlat_min;write_lat_max;write_lat_mean;write_lat_dev;write_bw_min;write_bw_max;write_bw_agg_pct;write_bw_mean;write_bw_dev;cpu_user;cpu_sys;cpu_csw;cpu_mjf;cpu_minf;iodepth_1;iodepth_2;iodepth_4;iodepth_8;iodepth_16;iodepth_32;iodepth_64;lat_2us;lat_4us;lat_10us;lat_20us;lat_50us;lat_100us;lat_250us;lat_500us;lat_750us;lat_1000us;lat_2ms;lat_4ms;lat_10ms;lat_20ms;lat_50ms;lat_100ms;lat_250ms;lat_500ms;lat_750ms;lat_1000ms;lat_2000ms;lat_over_2000ms;disk_name;disk_read_iops;disk_write_iops;disk_read_merges;disk_write_merges;disk_read_ticks;write_ticks;disk_queue_time;disk_util
  
+In client/server mode terse output differs from what appears when jobs are run
+locally. Disk utilization data is omitted from the standard terse output and
+for v3 and later appears on its own separate line at the end of each terse
+reporting cycle.
+
  
  JSON output
  ------------
@@ -3691,6 +3947,46 @@ given in bytes. The `action` can be one of these:
  **trim**
            Trim the given file from the given `offset` for `length` bytes.
  
+
+I/O Replay - Merging Traces
+---------------------------
+
+Colocation is a common practice used to get the most out of a machine.
+Knowing which workloads play nicely with each other and which ones don't is
+a much harder task. While fio can replay workloads concurrently via multiple
+jobs, it leaves some variability up to the scheduler making results harder to
+reproduce. Merging is a way to make the order of events consistent.
+
+Merging is integrated into I/O replay and done when a
+:option:`merge_blktrace_file` is specified. The list of files passed to
+:option:`read_iolog` go through the merge process and output a single file
+stored to the specified file. The output file is passed on as if it were the
+only file passed to :option:`read_iolog`. An example would look like::
+
+       $ fio --read_iolog="<file1>:<file2>" --merge_blktrace_file="<output_file>"
+
+Creating only the merged file can be done by passing the command line argument
+:option:`merge-blktrace-only`.
+
+Scaling traces can be done to see the relative impact of any particular trace
+being slowed down or sped up. :option:`merge_blktrace_scalars` takes in a colon
+separated list of percentage scalars. It is index paired with the files passed
+to :option:`read_iolog`.
+
+With scaling, it may be desirable to match the running time of all traces.
+This can be done with :option:`merge_blktrace_iters`. It is index paired with
+:option:`read_iolog` just like :option:`merge_blktrace_scalars`.
+
+In an example, given two traces, A and B, each 60s long. If we want to see
+the impact of trace A issuing IOs twice as fast and repeat trace A over the
+runtime of trace B, the following can be done::
+
+       $ fio --read_iolog="<trace_a>:"<trace_b>" --merge_blktrace_file"<output_file>" --merge_blktrace_scalars="50:100" --merge_blktrace_iters="2:1"
+
+This runs trace A at 2x the speed twice for approximately the same runtime as
+a single run of trace B.
+
+
  CPU idleness profiling
  ----------------------
  
@@ -3825,6 +4121,7 @@ is recorded. Each *data direction* seen within the window period will aggregate
  its values in a separate row. Further, when using windowed logging the *block
  size* and *offset* entries will always contain 0.
  
+
  Client/Server
  -------------
  
@@ -3912,3 +4209,6 @@ containing two hostnames ``h1`` and ``h2`` with IP addresses 192.168.10.120 and
  
         /mnt/nfs/fio/192.168.10.120.fileio.tmp
         /mnt/nfs/fio/192.168.10.121.fileio.tmp
+
+Terse output in client/server mode will differ slightly from what is produced
+when fio is run in stand-alone mode. See the terse output section for details.