Merge branch 'atomic-writes'

[fio.git] / HOWTO.rst
diff --git a/HOWTO.rst b/HOWTO.rst

index fb067fe52844cb4fb3c1ad3bf192807b8007e5dc..4f071484bdc3fa7b02cc4dd51f87ec3ef97a601d 100644 (file)
--- a/HOWTO.rst
+++ b/HOWTO.rst
@@ -805,6 +805,11 @@ Target file/device
         Note: Windows and FreeBSD (refer to geom(4)) prevent write access to areas
         of the disk containing in-use data (e.g. filesystems).
  
         Note: Windows and FreeBSD (refer to geom(4)) prevent write access to areas
         of the disk containing in-use data (e.g. filesystems).
  
+       For HTTP and S3 access, specify a valid URL path or S3 key, respectively.
+       A filename for path-style S3 includes a bucket name (:file:`/bucket/k/e.y`)
+       while a virtual-hosted-style S3 filename :file:`/k/e.y` does not because 
+       its bucket name is specified in :option:`http_host`.
+
         The filename "`-`" is a reserved name, meaning *stdin* or *stdout*.  Which
         of the two depends on the read/write direction set.
  
         The filename "`-`" is a reserved name, meaning *stdin* or *stdout*.  Which
         of the two depends on the read/write direction set.
  
@@ -971,13 +976,13 @@ Target file/device
  
  .. option:: unlink=bool
  
  
  .. option:: unlink=bool
  
-       Unlink the job files when done. Not the default, as repeated runs of that
+       Unlink (delete) the job files when done. Not the default, as repeated runs of that
         job would then waste time recreating the file set again and again. Default:
         false.
  
  .. option:: unlink_each_loop=bool
  
         job would then waste time recreating the file set again and again. Default:
         false.
  
  .. option:: unlink_each_loop=bool
  
-       Unlink job files after each iteration or loop.  Default: false.
+       Unlink (delete) job files after each iteration or loop.  Default: false.
  
  .. option:: zonemode=str
  
  
  .. option:: zonemode=str
  
@@ -985,14 +990,14 @@ Target file/device
  
                 **none**
                                 The :option:`zonerange`, :option:`zonesize`,
  
                 **none**
                                 The :option:`zonerange`, :option:`zonesize`,
-                               :option `zonecapacity` and option:`zoneskip`
+                               :option:`zonecapacity` and :option:`zoneskip`
                                 parameters are ignored.
                 **strided**
                                 I/O happens in a single zone until
                                 :option:`zonesize` bytes have been transferred.
                                 After that number of bytes has been
                                 transferred processing of the next zone
                                 parameters are ignored.
                 **strided**
                                 I/O happens in a single zone until
                                 :option:`zonesize` bytes have been transferred.
                                 After that number of bytes has been
                                 transferred processing of the next zone
-                               starts. :option `zonecapacity` is ignored.
+                               starts. :option:`zonecapacity` is ignored.
                 **zbd**
                                 Zoned block device mode. I/O happens
                                 sequentially in each zone, even if random I/O
                 **zbd**
                                 Zoned block device mode. I/O happens
                                 sequentially in each zone, even if random I/O
@@ -1992,7 +1997,9 @@ I/O engine
  
  .. option:: ioengine=str
  
  
  .. option:: ioengine=str
  
-       Defines how the job issues I/O to the file. The following types are defined:
+       fio supports 2 kinds of performance measurement: I/O and file/directory operation.
+
+       I/O engines define how the job issues I/O to the file. The following types are defined:
  
                 **sync**
                         Basic :manpage:`read(2)` or :manpage:`write(2)`
  
                 **sync**
                         Basic :manpage:`read(2)` or :manpage:`write(2)`
@@ -2177,36 +2184,6 @@ I/O engine
                         absolute or relative. See :file:`engines/skeleton_external.c` for
                         details of writing an external I/O engine.
  
                         absolute or relative. See :file:`engines/skeleton_external.c` for
                         details of writing an external I/O engine.
  
-               **filecreate**
-                       Simply create the files and do no I/O to them.  You still need to
-                       set  `filesize` so that all the accounting still occurs, but no
-                       actual I/O will be done other than creating the file.
-
-               **filestat**
-                       Simply do stat() and do no I/O to the file. You need to set 'filesize'
-                       and 'nrfiles', so that files will be created.
-                       This engine is to measure file lookup and meta data access.
-
-               **filedelete**
-                       Simply delete the files by unlink() and do no I/O to them. You need to set 'filesize'
-                       and 'nrfiles', so that the files will be created.
-                       This engine is to measure file delete.
-
-               **dircreate**
-                       Simply create the directories and do no I/O to them.  You still need to
-                       set  `filesize` so that all the accounting still occurs, but no
-                       actual I/O will be done other than creating the directories.
-
-               **dirstat**
-                       Simply do stat() and do no I/O to the directories. You need to set 'filesize'
-                       and 'nrfiles', so that directories will be created.
-                       This engine is to measure directory lookup and meta data access.
-
-               **dirdelete**
-                       Simply delete the directories by rmdir() and do no I/O to them. You need to set 'filesize'
-                       and 'nrfiles', so that the directories will be created.
-                       This engine is to measure directory delete.
-
                 **libpmem**
                         Read and write using mmap I/O to a file on a filesystem
                         mounted with DAX on a persistent memory device through the PMDK
                 **libpmem**
                         Read and write using mmap I/O to a file on a filesystem
                         mounted with DAX on a persistent memory device through the PMDK
@@ -2276,6 +2253,50 @@ I/O engine
                         several instances to access the same device or file
                         simultaneously, but allow it for threads.
  
                         several instances to access the same device or file
                         simultaneously, but allow it for threads.
  
+       File/directory operation engines define how the job operates file or directory. The
+       following types are defined:
+
+               **filecreate**
+                       Simply create the files and do no I/O to them.  You still need to
+                       set  `filesize` so that all the accounting still occurs, but no
+                       actual I/O will be done other than creating the file.
+                       Example job file: filecreate-ioengine.fio.
+
+               **filestat**
+                       Simply do stat() and do no I/O to the file. You need to set 'filesize'
+                       and 'nrfiles', so that files will be created.
+                       This engine is to measure file lookup and meta data access.
+                       Example job file: filestat-ioengine.fio.
+
+               **filedelete**
+                       Simply delete the files by unlink() and do no I/O to them. You need to set 'filesize'
+                       and 'nrfiles', so that the files will be created.
+                       This engine is to measure file delete.
+                       Example job file: filedelete-ioengine.fio.
+
+               **dircreate**
+                       Simply create the directories and do no I/O to them.  You still need to
+                       set  `filesize` so that all the accounting still occurs, but no
+                       actual I/O will be done other than creating the directories.
+                       Example job file: dircreate-ioengine.fio.
+
+               **dirstat**
+                       Simply do stat() and do no I/O to the directories. You need to set 'filesize'
+                       and 'nrfiles', so that directories will be created.
+                       This engine is to measure directory lookup and meta data access.
+                       Example job file: dirstat-ioengine.fio.
+
+               **dirdelete**
+                       Simply delete the directories by rmdir() and do no I/O to them. You need to set 'filesize'
+                       and 'nrfiles', so that the directories will be created.
+                       This engine is to measure directory delete.
+                       Example job file: dirdelete-ioengine.fio.
+
+               For file and directory operation engines, there is no I/O throughput, then the
+               statistics data in report have different meanings. The meaningful output indexes are: 'iops' and 'clat'.
+               'bw' is meaningless. Refer to section: "Interpreting the output" for more details.
+
+
  I/O engine specific parameters
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  
  I/O engine specific parameters
  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  
@@ -2480,11 +2501,42 @@ with the caveat that when used on the command line, they must come after the
         For direct I/O, requests will only succeed if cache invalidation isn't required,
         file blocks are fully allocated and the disk request could be issued immediately.
  
         For direct I/O, requests will only succeed if cache invalidation isn't required,
         file blocks are fully allocated and the disk request could be issued immediately.
  
+.. option:: atomic=bool : [pvsync2] [libaio] [io_uring]
+
+       This option means that writes are issued with torn-write protection, meaning
+       that for a power fail or kernel crash, all or none of the data from the write
+       will be stored, but never a mix of old and new data. Torn-write protection is
+       also known as atomic writes.
+
+       This option sets the RWF_ATOMIC flag (supported from the 6.11 Linux kernel) on
+       a per-IO basis.
+
+       Writes with RWF_ATOMIC set will be rejected by the kernel when the file does
+       not support torn-write protection. To learn a file's torn-write limits, issue
+       statx with STATX_WRITE_ATOMIC.
+
  .. option:: fdp=bool : [io_uring_cmd] [xnvme]
  
         Enable Flexible Data Placement mode for write commands.
  
  .. option:: fdp=bool : [io_uring_cmd] [xnvme]
  
         Enable Flexible Data Placement mode for write commands.
  
-.. option:: fdp_pli_select=str : [io_uring_cmd] [xnvme]
+.. option:: dataplacement=str : [io_uring_cmd] [xnvme]
+
+        Specifies the data placement directive type to use for write commands.
+        The following types are supported:
+
+                **none**
+                        Do not use a data placement directive. This is the
+                        default.
+
+                **fdp**
+                        Use Flexible Data Placement directives for write
+                        commands. This is equivalent to specifying
+                        :option:`fdp` =1.
+
+               **streams**
+                        Use Streams directives for write commands.
+
+.. option:: plid_select=str, fdp_pli_select=str : [io_uring_cmd] [xnvme]
  
         Defines how fio decides which placement ID to use next. The following
         types are defined:
  
         Defines how fio decides which placement ID to use next. The following
         types are defined:
@@ -2496,16 +2548,46 @@ with the caveat that when used on the command line, they must come after the
                         Round robin over available placement IDs. This is the
                         default.
  
                         Round robin over available placement IDs. This is the
                         default.
  
-       The available placement ID index/indices is defined by the option
-       :option:`fdp_pli`.
+               **scheme**
+                       Choose a placement ID (index) based on the scheme file defined by
+                       the option :option:`dp_scheme`.
+
+       The available placement ID (indices) are defined by the option :option:`fdp_pli`
+       or :option:`plids` except for the case of **scheme**.
+
+.. option:: plids=str, fdp_pli=str : [io_uring_cmd] [xnvme]
+
+        Select which Placement ID Indices (FDP) or Placement IDs (streams) this
+        job is allowed to use for writes. This option accepts a comma-separated
+        list of values or ranges (e.g., 1,2-4,5,6-8).
+
+        For FDP by default, the job will cycle through all available Placement
+        IDs, so use this option to be selective. The values specified here are
+        array indices for the list of placement IDs returned by the nvme-cli
+        command ``nvme fdp status``. If you want fio to use FDP placement
+        identifiers only at indices 0, 2 and 5, set ``plids=0,2,5``.
+
+        For streams this should be a list of Stream IDs.
  
  
-.. option:: fdp_pli=str : [io_uring_cmd] [xnvme]
+.. option:: dp_scheme=str : [io_uring_cmd] [xnvme]
  
  
-       Select which Placement ID Index/Indicies this job is allowed to use for
-       writes. By default, the job will cycle through all available Placement
-        IDs, so use this to isolate these identifiers to specific jobs. If you
-        want fio to use placement identifier only at indices 0, 2 and 5 specify
-        ``fdp_pli=0,2,5``.
+       Defines which placement ID (index) to be selected based on offset(LBA) range.
+       The file should contains one or more scheme entries in the following format:
+
+               0, 10737418240, 0
+               10737418240, 21474836480, 1
+               21474836480, 32212254720, 2
+               ...
+
+       Each line, a scheme entry, contains start offset, end offset, and placement ID
+       (index) separated by comma(,). If the write offset is within the range of a certain
+       scheme entry(start offset ≤ offset < end offset), the corresponding placement ID
+       (index) will be selected. If the write offset belongs to multiple scheme entries,
+       the first matched scheme entry will be applied. If the offset is not within any range
+       of scheme entry, dspec field will be set to 0, default RUH. (Caution: In case of
+       multiple devices in a job, all devices of the job will be affected by the scheme. If
+       this option is specified, the option :option:`plids` or :option:`fdp_pli` will be
+       ignored.)
  
  .. option:: md_per_io_size=int : [io_uring_cmd] [xnvme]
  
  
  .. option:: md_per_io_size=int : [io_uring_cmd] [xnvme]
  
@@ -2599,7 +2681,7 @@ with the caveat that when used on the command line, they must come after the
                 this will be the starting port number since fio will use a range of
                 ports.
  
                 this will be the starting port number since fio will use a range of
                 ports.
  
-   [rdma], [librpma_*]
+   [rdma]
  
                 The port to use for RDMA-CM communication. This should be the same value
                 on the client and the server side.
  
                 The port to use for RDMA-CM communication. This should be the same value
                 on the client and the server side.
@@ -2610,20 +2692,6 @@ with the caveat that when used on the command line, they must come after the
         is a TCP listener or UDP reader, the hostname is not used and must be omitted
         unless it is a valid UDP multicast address.
  
         is a TCP listener or UDP reader, the hostname is not used and must be omitted
         unless it is a valid UDP multicast address.
  
-.. option:: serverip=str : [librpma_*]
-
-       The IP address to be used for RDMA-CM based I/O.
-
-.. option:: direct_write_to_pmem=bool : [librpma_*]
-
-       Set to 1 only when Direct Write to PMem from the remote host is possible.
-       Otherwise, set to 0.
-
-.. option:: busy_wait_polling=bool : [librpma_*_server]
-
-       Set to 0 to wait for completion instead of busy-wait polling completion.
-       Default: 1.
-
  .. option:: interface=str : [netsplice] [net]
  
         The IP address of the network interface used to send or receive UDP
  .. option:: interface=str : [netsplice] [net]
  
         The IP address of the network interface used to send or receive UDP
@@ -2789,16 +2857,32 @@ with the caveat that when used on the command line, they must come after the
         Specify stat system call type to measure lookup/getattr performance.
         Default is **stat** for :manpage:`stat(2)`.
  
         Specify stat system call type to measure lookup/getattr performance.
         Default is **stat** for :manpage:`stat(2)`.
  
-.. option:: readfua=bool : [sg]
+.. option:: readfua=bool : [sg] [io_uring_cmd]
  
         With readfua option set to 1, read operations include
         the force unit access (fua) flag. Default is 0.
  
  
         With readfua option set to 1, read operations include
         the force unit access (fua) flag. Default is 0.
  
-.. option:: writefua=bool : [sg]
+.. option:: writefua=bool : [sg] [io_uring_cmd]
  
         With writefua option set to 1, write operations include
         the force unit access (fua) flag. Default is 0.
  
  
         With writefua option set to 1, write operations include
         the force unit access (fua) flag. Default is 0.
  
+.. option:: write_mode=str : [io_uring_cmd]
+
+        Specifies the type of write operation.  Defaults to 'write'.
+
+                **write**
+                        Use Write commands for write operations
+
+                **uncor**
+                        Use Write Uncorrectable commands for write operations
+
+                **zeroes**
+                        Use Write Zeroes commands for write operations
+
+                **verify**
+                        Use Verify commands for write operations
+
  .. option:: sg_write_mode=str : [sg]
  
         Specify the type of write commands to issue. This option can take ten values:
  .. option:: sg_write_mode=str : [sg]
  
         Specify the type of write commands to issue. This option can take ten values:
@@ -2855,8 +2939,13 @@ with the caveat that when used on the command line, they must come after the
  
  .. option:: http_host=str : [http]
  
  
  .. option:: http_host=str : [http]
  
-       Hostname to connect to. For S3, this could be the bucket hostname.
-       Default is **localhost**
+       Hostname to connect to. HTTP port 80 is used automatically when the value of 
+       the https parameter is *off*, and HTTPS port 443 if it is *on*. A 
+       virtual-hosted-style S3 hostname starts with a bucket name, while a 
+       path-style S3 hostname does not. See 
+       https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html for 
+       detailed examples.
+       Default is **localhost** (path-style S3 hostname)
  
  .. option:: http_user=str : [http]
  
  
  .. option:: http_user=str : [http]
  
@@ -3913,6 +4002,17 @@ Verification
          instead resets the file after the write phase and then replays I/Os for
          the verification phase.
  
          instead resets the file after the write phase and then replays I/Os for
          the verification phase.
  
+.. option:: verify_write_sequence=bool
+
+        Verify the header write sequence number. In a scenario with multiple jobs,
+        verification of the write sequence number may fail. Disabling this option
+        will mean that write sequence number checking is skipped. Doing that can be
+        useful for testing atomic writes, as it means that checksum verification can
+        still be attempted. For when :option:`atomic` is enabled, checksum
+        verification is expected to succeed (while write sequence checking can still
+        fail).
+        Defaults to true.
+
  .. option:: trim_percentage=int
  
         Number of verify blocks to discard/trim.
  .. option:: trim_percentage=int
  
         Number of verify blocks to discard/trim.
@@ -4148,6 +4248,21 @@ Measurements and reporting
         entry as well as the other data values. Defaults to 0 meaning that
         offsets are not present in logs. Also see `Log File Formats`_.
  
         entry as well as the other data values. Defaults to 0 meaning that
         offsets are not present in logs. Also see `Log File Formats`_.
  
+.. option:: log_prio=bool
+
+       If this is set, the *Command priority* field in `Log File Formats`_
+       shows the priority value and the IO priority class of the command.
+       Otherwise, the field shows if the command has the highest RT
+       priority class or not. Also see `Log File Formats`_.
+
+.. option:: log_issue_time=bool
+
+       If this is set, the iolog options will include the command issue time
+       for the I/O entry as well as the other data values. Defaults to 0
+       meaning that command issue times are not present in logs. Also see
+       `Log File Formats`_. This option shall be set together with
+       :option:`write_lat_log` and :option:`log_offset`.
+
  .. option:: log_compression=int
  
         If this is set, fio will compress the I/O logs as it goes, to keep the
  .. option:: log_compression=int
  
         If this is set, fio will compress the I/O logs as it goes, to keep the
@@ -4583,6 +4698,21 @@ writes in the example above).  In the order listed, they denote:
                  commit if available) functions were completed to when the I/O's
                  completion was reaped by fio.
  
                  commit if available) functions were completed to when the I/O's
                  completion was reaped by fio.
  
+               For file and directory operation engines, **clat** denotes the time
+               to complete one file or directory operation.
+
+                 **filecreate engine**:the time cost to create a new file
+
+                 **filestat engine**:  the time cost to look up an existing file
+
+                 **filedelete engine**:the time cost to delete a file
+
+                 **dircreate engine**: the time cost to create a new directory
+
+                 **dirstat engine**:   the time cost to look up an existing directory
+
+                 **dirdelete engine**: the time cost to delete a directory
+
  **lat**
                 Total latency. Same names as slat and clat, this denotes the time from
                 when fio created the I/O unit to completion of the I/O operation.
  **lat**
                 Total latency. Same names as slat and clat, this denotes the time from
                 when fio created the I/O unit to completion of the I/O operation.
@@ -4601,12 +4731,30 @@ writes in the example above).  In the order listed, they denote:
                 are on the same disk, since they are then competing for disk
                 access.
  
                 are on the same disk, since they are then competing for disk
                 access.
  
+               For file and directory operation engines, **bw** is meaningless.
+
  **iops**
                 IOPS statistics based on measurements from discrete intervals.
                 For details see the description for bw above. See
                 :option:`iopsavgtime` to control the duration of the intervals.
                 Same values reported here as for bw except for percentage.
  
  **iops**
                 IOPS statistics based on measurements from discrete intervals.
                 For details see the description for bw above. See
                 :option:`iopsavgtime` to control the duration of the intervals.
                 Same values reported here as for bw except for percentage.
  
+               For file and directory operation engines, **iops** is the most
+               fundamental index to denote the performance.
+               It means how many files or directories can be operated per second.
+
+                 **filecreate engine**:number of files can be created per second
+
+                 **filestat engine**:  number of files can be looked up per second
+
+                 **filedelete engine**:number of files can be deleted per second
+
+                 **dircreate engine**: number of directories can be created per second
+
+                 **dirstat engine**:   number of directories can be looked up per second
+
+                 **dirdelete engine**: number of directories can be deleted per second
+
  **lat (nsec/usec/msec)**
                 The distribution of I/O completion latencies. This is the time from when
                 I/O leaves fio and when it gets completed. Unlike the separate
  **lat (nsec/usec/msec)**
                 The distribution of I/O completion latencies. This is the time from when
                 I/O leaves fio and when it gets completed. Unlike the separate
@@ -5080,7 +5228,7 @@ Fio supports a variety of log file formats, for logging latencies, bandwidth,
  and IOPS. The logs share a common format, which looks like this:
  
      *time* (`msec`), *value*, *data direction*, *block size* (`bytes`),
  and IOPS. The logs share a common format, which looks like this:
  
      *time* (`msec`), *value*, *data direction*, *block size* (`bytes`),
-    *offset* (`bytes`), *command priority*
+    *offset* (`bytes`), *command priority*, *issue time* (`nsec`)
  
  *Time* for the log entry is always in milliseconds. The *value* logged depends
  on the type of log, it will be one of the following:
  
  *Time* for the log entry is always in milliseconds. The *value* logged depends
  on the type of log, it will be one of the following:
@@ -5105,8 +5253,21 @@ The entry's *block size* is always in bytes. The *offset* is the position in byt
  from the start of the file for that particular I/O. The logging of the offset can be
  toggled with :option:`log_offset`.
  
  from the start of the file for that particular I/O. The logging of the offset can be
  toggled with :option:`log_offset`.
  
-*Command priority* is 0 for normal priority and 1 for high priority. This is controlled
-by the ioengine specific :option:`cmdprio_percentage`.
+If :option:`log_prio` is not set, the entry's *Command priority* is 1 for an IO
+executed with the highest RT priority class (:option:`prioclass` =1 or
+:option:`cmdprio_class` =1) and 0 otherwise. This is controlled by the
+:option:`prioclass` option and the ioengine specific
+:option:`cmdprio_percentage`  :option:`cmdprio_class` options. If
+:option:`log_prio` is set, the entry's *Command priority* is the priority set
+for the IO, as a 16-bits hexadecimal number with the lowest 13 bits indicating
+the priority value (:option:`prio` and :option:`cmdprio` options) and the
+highest 3 bits indicating the IO priority class (:option:`prioclass` and
+:option:`cmdprio_class` options).
+
+The entry's *issue time* is the command issue time in nanoseconds. The logging
+of the issue time can be toggled with :option:`log_issue_time`. This field has
+valid values in completion latency log file (clat), or submit latency log file
+(slat). The field has value 0 in other logs files.
  
  Fio defaults to logging every individual I/O but when windowed logging is set
  through :option:`log_avg_msec`, either the average (by default), the maximum
  
  Fio defaults to logging every individual I/O but when windowed logging is set
  through :option:`log_avg_msec`, either the average (by default), the maximum
@@ -5116,12 +5277,12 @@ is set to both) is recorded. The log file format when both the values are report
  takes this form:
  
      *time* (`msec`), *value*, *value1*, *data direction*, *block size* (`bytes`),
  takes this form:
  
      *time* (`msec`), *value*, *value1*, *data direction*, *block size* (`bytes`),
-    *offset* (`bytes`), *command priority*
+    *offset* (`bytes`), *command priority*, *issue time* (`nsec`)
  
  
  Each *data direction* seen within the window period will aggregate its values in a
  
  
  Each *data direction* seen within the window period will aggregate its values in a
-separate row. Further, when using windowed logging the *block size* and *offset*
-entries will always contain 0.
+separate row. Further, when using windowed logging the *block size*, *offset*
+and *issue time* entries will always contain 0.
  
  
  Client/Server
  
  
  Client/Server