Documentation/block/bfq-iosched.txt

   1 BFQ (Budget Fair Queueing)
   2 ==========================
   3
   4 BFQ is a proportional-share I/O scheduler, with some extra
   5 low-latency capabilities. In addition to cgroups support (blkio or io
   6 controllers), BFQ's main features are:
   7 - BFQ guarantees a high system and application responsiveness, and a
   8   low latency for time-sensitive applications, such as audio or video
   9   players;
  10 - BFQ distributes bandwidth, and not just time, among processes or
  11   groups (switching back to time distribution when needed to keep
  12   throughput high).
  13
  14 On average CPUs, the current version of BFQ can handle devices
  15 performing at most ~30K IOPS; at most ~50 KIOPS on faster CPUs. As a
  16 reference, 30-50 KIOPS correspond to very high bandwidths with
  17 sequential I/O (e.g., 8-12 GB/s if I/O requests are 256 KB large), and
  18 to 120-200 MB/s with 4KB random I/O. BFQ has not yet been tested on
  19 multi-queue devices.
  20
  21 The table of contents follow. Impatients can just jump to Section 3.
  22
  23 CONTENTS
  24
  25 1. When may BFQ be useful?
  26  1-1 Personal systems
  27  1-2 Server systems
  28 2. How does BFQ work?
  29 3. What are BFQ's tunable?
  30 4. BFQ group scheduling
  31  4-1 Service guarantees provided
  32  4-2 Interface
  33
  34 1. When may BFQ be useful?
  35 ==========================
  36
  37 BFQ provides the following benefits on personal and server systems.
  38
  39 1-1 Personal systems
  40 --------------------
  41
  42 Low latency for interactive applications
  43
  44 Regardless of the actual background workload, BFQ guarantees that, for
  45 interactive tasks, the storage device is virtually as responsive as if
  46 it was idle. For example, even if one or more of the following
  47 background workloads are being executed:
  48 - one or more large files are being read, written or copied,
  49 - a tree of source files is being compiled,
  50 - one or more virtual machines are performing I/O,
  51 - a software update is in progress,
  52 - indexing daemons are scanning filesystems and updating their
  53   databases,
  54 starting an application or loading a file from within an application
  55 takes about the same time as if the storage device was idle. As a
  56 comparison, with CFQ, NOOP or DEADLINE, and in the same conditions,
  57 applications experience high latencies, or even become unresponsive
  58 until the background workload terminates (also on SSDs).
  59
  60 Low latency for soft real-time applications
  61
  62 Also soft real-time applications, such as audio and video
  63 players/streamers, enjoy a low latency and a low drop rate, regardless
  64 of the background I/O workload. As a consequence, these applications
  65 do not suffer from almost any glitch due to the background workload.
  66
  67 Higher speed for code-development tasks
  68
  69 If some additional workload happens to be executed in parallel, then
  70 BFQ executes the I/O-related components of typical code-development
  71 tasks (compilation, checkout, merge, ...) much more quickly than CFQ,
  72 NOOP or DEADLINE.
  73
  74 High throughput
  75
  76 On hard disks, BFQ achieves up to 30% higher throughput than CFQ, and
  77 up to 150% higher throughput than DEADLINE and NOOP, with all the
  78 sequential workloads considered in our tests. With random workloads,
  79 and with all the workloads on flash-based devices, BFQ achieves,
  80 instead, about the same throughput as the other schedulers.
  81
  82 Strong fairness, bandwidth and delay guarantees
  83
  84 BFQ distributes the device throughput, and not just the device time,
  85 among I/O-bound applications in proportion their weights, with any
  86 workload and regardless of the device parameters. From these bandwidth
  87 guarantees, it is possible to compute tight per-I/O-request delay
  88 guarantees by a simple formula. If not configured for strict service
  89 guarantees, BFQ switches to time-based resource sharing (only) for
  90 applications that would otherwise cause a throughput loss.
  91
  92 1-2 Server systems
  93 ------------------
  94
  95 Most benefits for server systems follow from the same service
  96 properties as above. In particular, regardless of whether additional,
  97 possibly heavy workloads are being served, BFQ guarantees:
  98
  99 . audio and video-streaming with zero or very low jitter and drop
 100   rate;
 101
 102 . fast retrieval of WEB pages and embedded objects;
 103
 104 . real-time recording of data in live-dumping applications (e.g.,
 105   packet logging);
 106
 107 . responsiveness in local and remote access to a server.
 108
 109
 110 2. How does BFQ work?
 111 =====================
 112
 113 BFQ is a proportional-share I/O scheduler, whose general structure,
 114 plus a lot of code, are borrowed from CFQ.
 115
 116 - Each process doing I/O on a device is associated with a weight and a
 117   (bfq_)queue.
 118
 119 - BFQ grants exclusive access to the device, for a while, to one queue
 120   (process) at a time, and implements this service model by
 121   associating every queue with a budget, measured in number of
 122   sectors.
 123
 124   - After a queue is granted access to the device, the budget of the
 125     queue is decremented, on each request dispatch, by the size of the
 126     request.
 127
 128   - The in-service queue is expired, i.e., its service is suspended,
 129     only if one of the following events occurs: 1) the queue finishes
 130     its budget, 2) the queue empties, 3) a "budget timeout" fires.
 131
 132     - The budget timeout prevents processes doing random I/O from
 133       holding the device for too long and dramatically reducing
 134       throughput.
 135
 136     - Actually, as in CFQ, a queue associated with a process issuing
 137       sync requests may not be expired immediately when it empties. In
 138       contrast, BFQ may idle the device for a short time interval,
 139       giving the process the chance to go on being served if it issues
 140       a new request in time. Device idling typically boosts the
 141       throughput on rotational devices, if processes do synchronous
 142       and sequential I/O. In addition, under BFQ, device idling is
 143       also instrumental in guaranteeing the desired throughput
 144       fraction to processes issuing sync requests (see the description
 145       of the slice_idle tunable in this document, or [1, 2], for more
 146       details).
 147
 148       - With respect to idling for service guarantees, if several
 149         processes are competing for the device at the same time, but
 150         all processes (and groups, after the following commit) have
 151         the same weight, then BFQ guarantees the expected throughput
 152         distribution without ever idling the device. Throughput is
 153         thus as high as possible in this common scenario.
 154
 155   - If low-latency mode is enabled (default configuration), BFQ
 156     executes some special heuristics to detect interactive and soft
 157     real-time applications (e.g., video or audio players/streamers),
 158     and to reduce their latency. The most important action taken to
 159     achieve this goal is to give to the queues associated with these
 160     applications more than their fair share of the device
 161     throughput. For brevity, we call just "weight-raising" the whole
 162     sets of actions taken by BFQ to privilege these queues. In
 163     particular, BFQ provides a milder form of weight-raising for
 164     interactive applications, and a stronger form for soft real-time
 165     applications.
 166
 167   - BFQ automatically deactivates idling for queues born in a burst of
 168     queue creations. In fact, these queues are usually associated with
 169     the processes of applications and services that benefit mostly
 170     from a high throughput. Examples are systemd during boot, or git
 171     grep.
 172
 173   - As CFQ, BFQ merges queues performing interleaved I/O, i.e.,
 174     performing random I/O that becomes mostly sequential if
 175     merged. Differently from CFQ, BFQ achieves this goal with a more
 176     reactive mechanism, called Early Queue Merge (EQM). EQM is so
 177     responsive in detecting interleaved I/O (cooperating processes),
 178     that it enables BFQ to achieve a high throughput, by queue
 179     merging, even for queues for which CFQ needs a different
 180     mechanism, preemption, to get a high throughput. As such EQM is a
 181     unified mechanism to achieve a high throughput with interleaved
 182     I/O.
 183
 184   - Queues are scheduled according to a variant of WF2Q+, named
 185     B-WF2Q+, and implemented using an augmented rb-tree to preserve an
 186     O(log N) overall complexity.  See [2] for more details. B-WF2Q+ is
 187     also ready for hierarchical scheduling. However, for a cleaner
 188     logical breakdown, the code that enables and completes
 189     hierarchical support is provided in the next commit, which focuses
 190     exactly on this feature.
 191
 192   - B-WF2Q+ guarantees a tight deviation with respect to an ideal,
 193     perfectly fair, and smooth service. In particular, B-WF2Q+
 194     guarantees that each queue receives a fraction of the device
 195     throughput proportional to its weight, even if the throughput
 196     fluctuates, and regardless of: the device parameters, the current
 197     workload and the budgets assigned to the queue.
 198
 199   - The last, budget-independence, property (although probably
 200     counterintuitive in the first place) is definitely beneficial, for
 201     the following reasons:
 202
 203     - First, with any proportional-share scheduler, the maximum
 204       deviation with respect to an ideal service is proportional to
 205       the maximum budget (slice) assigned to queues. As a consequence,
 206       BFQ can keep this deviation tight not only because of the
 207       accurate service of B-WF2Q+, but also because BFQ *does not*
 208       need to assign a larger budget to a queue to let the queue
 209       receive a higher fraction of the device throughput.
 210
 211     - Second, BFQ is free to choose, for every process (queue), the
 212       budget that best fits the needs of the process, or best
 213       leverages the I/O pattern of the process. In particular, BFQ
 214       updates queue budgets with a simple feedback-loop algorithm that
 215       allows a high throughput to be achieved, while still providing
 216       tight latency guarantees to time-sensitive applications. When
 217       the in-service queue expires, this algorithm computes the next
 218       budget of the queue so as to:
 219
 220       - Let large budgets be eventually assigned to the queues
 221         associated with I/O-bound applications performing sequential
 222         I/O: in fact, the longer these applications are served once
 223         got access to the device, the higher the throughput is.
 224
 225       - Let small budgets be eventually assigned to the queues
 226         associated with time-sensitive applications (which typically
 227         perform sporadic and short I/O), because, the smaller the
 228         budget assigned to a queue waiting for service is, the sooner
 229         B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).
 230
 231 - If several processes are competing for the device at the same time,
 232   but all processes and groups have the same weight, then BFQ
 233   guarantees the expected throughput distribution without ever idling
 234   the device. It uses preemption instead. Throughput is then much
 235   higher in this common scenario.
 236
 237 - ioprio classes are served in strict priority order, i.e.,
 238   lower-priority queues are not served as long as there are
 239   higher-priority queues.  Among queues in the same class, the
 240   bandwidth is distributed in proportion to the weight of each
 241   queue. A very thin extra bandwidth is however guaranteed to
 242   the Idle class, to prevent it from starving.
 243
 244
 245 3. What are BFQ's tunable?
 246 ==========================
 247
 248 The tunables back_seek-max, back_seek_penalty, fifo_expire_async and
 249 fifo_expire_sync below are the same as in CFQ. Their description is
 250 just copied from that for CFQ. Some considerations in the description
 251 of slice_idle are copied from CFQ too.
 252
 253 per-process ioprio and weight
 254 -----------------------------
 255
 256 Unless the cgroups interface is used, weights can be assigned to
 257 processes only indirectly, through I/O priorities, and according to
 258 the relation: weight = (IOPRIO_BE_NR - ioprio) * 10.
 259
 260 slice_idle
 261 ----------
 262
 263 This parameter specifies how long BFQ should idle for next I/O
 264 request, when certain sync BFQ queues become empty. By default
 265 slice_idle is a non-zero value. Idling has a double purpose: boosting
 266 throughput and making sure that the desired throughput distribution is
 267 respected (see the description of how BFQ works, and, if needed, the
 268 papers referred there).
 269
 270 As for throughput, idling can be very helpful on highly seeky media
 271 like single spindle SATA/SAS disks where we can cut down on overall
 272 number of seeks and see improved throughput.
 273
 274 Setting slice_idle to 0 will remove all the idling on queues and one
 275 should see an overall improved throughput on faster storage devices
 276 like multiple SATA/SAS disks in hardware RAID configuration.
 277
 278 So depending on storage and workload, it might be useful to set
 279 slice_idle=0.  In general for SATA/SAS disks and software RAID of
 280 SATA/SAS disks keeping slice_idle enabled should be useful. For any
 281 configurations where there are multiple spindles behind single LUN
 282 (Host based hardware RAID controller or for storage arrays), setting
 283 slice_idle=0 might end up in better throughput and acceptable
 284 latencies.
 285
 286 Idling is however necessary to have service guarantees enforced in
 287 case of differentiated weights or differentiated I/O-request lengths.
 288 To see why, suppose that a given BFQ queue A must get several I/O
 289 requests served for each request served for another queue B. Idling
 290 ensures that, if A makes a new I/O request slightly after becoming
 291 empty, then no request of B is dispatched in the middle, and thus A
 292 does not lose the possibility to get more than one request dispatched
 293 before the next request of B is dispatched. Note that idling
 294 guarantees the desired differentiated treatment of queues only in
 295 terms of I/O-request dispatches. To guarantee that the actual service
 296 order then corresponds to the dispatch order, the strict_guarantees
 297 tunable must be set too.
 298
 299 There is an important flipside for idling: apart from the above cases
 300 where it is beneficial also for throughput, idling can severely impact
 301 throughput. One important case is random workload. Because of this
 302 issue, BFQ tends to avoid idling as much as possible, when it is not
 303 beneficial also for throughput. As a consequence of this behavior, and
 304 of further issues described for the strict_guarantees tunable,
 305 short-term service guarantees may be occasionally violated. And, in
 306 some cases, these guarantees may be more important than guaranteeing
 307 maximum throughput. For example, in video playing/streaming, a very
 308 low drop rate may be more important than maximum throughput. In these
 309 cases, consider setting the strict_guarantees parameter.
 310
 311 strict_guarantees
 312 -----------------
 313
 314 If this parameter is set (default: unset), then BFQ
 315
 316 - always performs idling when the in-service queue becomes empty;
 317
 318 - forces the device to serve one I/O request at a time, by dispatching a
 319   new request only if there is no outstanding request.
 320
 321 In the presence of differentiated weights or I/O-request sizes, both
 322 the above conditions are needed to guarantee that every BFQ queue
 323 receives its allotted share of the bandwidth. The first condition is
 324 needed for the reasons explained in the description of the slice_idle
 325 tunable.  The second condition is needed because all modern storage
 326 devices reorder internally-queued requests, which may trivially break
 327 the service guarantees enforced by the I/O scheduler.
 328
 329 Setting strict_guarantees may evidently affect throughput.
 330
 331 back_seek_max
 332 -------------
 333
 334 This specifies, given in Kbytes, the maximum "distance" for backward seeking.
 335 The distance is the amount of space from the current head location to the
 336 sectors that are backward in terms of distance.
 337
 338 This parameter allows the scheduler to anticipate requests in the "backward"
 339 direction and consider them as being the "next" if they are within this
 340 distance from the current head location.
 341
 342 back_seek_penalty
 343 -----------------
 344
 345 This parameter is used to compute the cost of backward seeking. If the
 346 backward distance of request is just 1/back_seek_penalty from a "front"
 347 request, then the seeking cost of two requests is considered equivalent.
 348
 349 So scheduler will not bias toward one or the other request (otherwise scheduler
 350 will bias toward front request). Default value of back_seek_penalty is 2.
 351
 352 fifo_expire_async
 353 -----------------
 354
 355 This parameter is used to set the timeout of asynchronous requests. Default
 356 value of this is 248ms.
 357
 358 fifo_expire_sync
 359 ----------------
 360
 361 This parameter is used to set the timeout of synchronous requests. Default
 362 value of this is 124ms. In case to favor synchronous requests over asynchronous
 363 one, this value should be decreased relative to fifo_expire_async.
 364
 365 low_latency
 366 -----------
 367
 368 This parameter is used to enable/disable BFQ's low latency mode. By
 369 default, low latency mode is enabled. If enabled, interactive and soft
 370 real-time applications are privileged and experience a lower latency,
 371 as explained in more detail in the description of how BFQ works.
 372
 373 timeout_sync
 374 ------------
 375
 376 Maximum amount of device time that can be given to a task (queue) once
 377 it has been selected for service. On devices with costly seeks,
 378 increasing this time usually increases maximum throughput. On the
 379 opposite end, increasing this time coarsens the granularity of the
 380 short-term bandwidth and latency guarantees, especially if the
 381 following parameter is set to zero.
 382
 383 max_budget
 384 ----------
 385
 386 Maximum amount of service, measured in sectors, that can be provided
 387 to a BFQ queue once it is set in service (of course within the limits
 388 of the above timeout). According to what said in the description of
 389 the algorithm, larger values increase the throughput in proportion to
 390 the percentage of sequential I/O requests issued. The price of larger
 391 values is that they coarsen the granularity of short-term bandwidth
 392 and latency guarantees.
 393
 394 The default value is 0, which enables auto-tuning: BFQ sets max_budget
 395 to the maximum number of sectors that can be served during
 396 timeout_sync, according to the estimated peak rate.
 397
 398 weights
 399 -------
 400
 401 Read-only parameter, used to show the weights of the currently active
 402 BFQ queues.
 403
 404
 405 wr_ tunables
 406 ------------
 407
 408 BFQ exports a few parameters to control/tune the behavior of
 409 low-latency heuristics.
 410
 411 wr_coeff
 412
 413 Factor by which the weight of a weight-raised queue is multiplied. If
 414 the queue is deemed soft real-time, then the weight is further
 415 multiplied by an additional, constant factor.
 416
 417 wr_max_time
 418
 419 Maximum duration of a weight-raising period for an interactive task
 420 (ms). If set to zero (default value), then this value is computed
 421 automatically, as a function of the peak rate of the device. In any
 422 case, when the value of this parameter is read, it always reports the
 423 current duration, regardless of whether it has been set manually or
 424 computed automatically.
 425
 426 wr_max_softrt_rate
 427
 428 Maximum service rate below which a queue is deemed to be associated
 429 with a soft real-time application, and is then weight-raised
 430 accordingly (sectors/sec).
 431
 432 wr_min_idle_time
 433
 434 Minimum idle period after which interactive weight-raising may be
 435 reactivated for a queue (in ms).
 436
 437 wr_rt_max_time
 438
 439 Maximum weight-raising duration for soft real-time queues (in ms). The
 440 start time from which this duration is considered is automatically
 441 moved forward if the queue is detected to be still soft real-time
 442 before the current soft real-time weight-raising period finishes.
 443
 444 wr_min_inter_arr_async
 445
 446 Minimum period between I/O request arrivals after which weight-raising
 447 may be reactivated for an already busy async queue (in ms).
 448
 449
 450 4. Group scheduling with BFQ
 451 ============================
 452
 453 BFQ supports both cgroup-v1 and cgroup-v2 io controllers, namely blkio
 454 and io. In particular, BFQ supports weight-based proportional
 455 share.
 456
 457 4-1 Service guarantees provided
 458 -------------------------------
 459
 460 With BFQ, proportional share means true proportional share of the
 461 device bandwidth, according to group weights. For example, a group
 462 with weight 200 gets twice the bandwidth, and not just twice the time,
 463 of a group with weight 100.
 464
 465 BFQ supports hierarchies (group trees) of any depth. Bandwidth is
 466 distributed among groups and processes in the expected way: for each
 467 group, the children of the group share the whole bandwidth of the
 468 group in proportion to their weights. In particular, this implies
 469 that, for each leaf group, every process of the group receives the
 470 same share of the whole group bandwidth, unless the ioprio of the
 471 process is modified.
 472
 473 The resource-sharing guarantee for a group may partially or totally
 474 switch from bandwidth to time, if providing bandwidth guarantees to
 475 the group lowers the throughput too much. This switch occurs on a
 476 per-process basis: if a process of a leaf group causes throughput loss
 477 if served in such a way to receive its share of the bandwidth, then
 478 BFQ switches back to just time-based proportional share for that
 479 process.
 480
 481 4-2 Interface
 482 -------------
 483
 484 To get proportional sharing of bandwidth with BFQ for a given device,
 485 BFQ must of course be the active scheduler for that device.
 486
 487 Within each group directory, the names of the files associated with
 488 BFQ-specific cgroup parameters and stats begin with the "bfq."
 489 prefix. So, with cgroups-v1 or cgroups-v2, the full prefix for
 490 BFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group
 491 parameter to set the weight of a group with BFQ is blkio.bfq.weight
 492 or io.bfq.weight.
 493
 494 Parameters to set
 495 -----------------
 496
 497 For each group, there is only the following parameter to set.
 498
 499 weight (namely blkio.bfq.weight or io.bfq-weight): the weight of the
 500 group inside its parent. Available values: 1..10000 (default 100). The
 501 linear mapping between ioprio and weights, described at the beginning
 502 of the tunable section, is still valid, but all weights higher than
 503 IOPRIO_BE_NR*10 are mapped to ioprio 0.
 504
 505
 506 [1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
 507     Scheduler", Proceedings of the First Workshop on Mobile System
 508     Technologies (MST-2015), May 2015.
 509     http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
 510
 511 [2] P. Valente and M. Andreolini, "Improving Application
 512     Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
 513     the 5th Annual International Systems and Storage Conference
 514     (SYSTOR '12), June 2012.
 515     Slightly extended version:
 516     http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
 517                                                         results.pdf