Merge tag 'drm-next-2018-08-24' of git://anongit.freedesktop.org/drm/drm
[linux-2.6-block.git] / Documentation / block / bfq-iosched.txt
CommitLineData
aee69d78
PV
1BFQ (Budget Fair Queueing)
2==========================
3
4BFQ is a proportional-share I/O scheduler, with some extra
5low-latency capabilities. In addition to cgroups support (blkio or io
6controllers), BFQ's main features are:
7- BFQ guarantees a high system and application responsiveness, and a
8 low latency for time-sensitive applications, such as audio or video
9 players;
10- BFQ distributes bandwidth, and not just time, among processes or
11 groups (switching back to time distribution when needed to keep
12 throughput high).
13
43c1b3d6
PV
14In its default configuration, BFQ privileges latency over
15throughput. So, when needed for achieving a lower latency, BFQ builds
16schedules that may lead to a lower throughput. If your main or only
17goal, for a given device, is to achieve the maximum-possible
18throughput at all times, then do switch off all low-latency heuristics
233f0bf4
PV
19for that device, by setting low_latency to 0. See Section 3 for
20details on how to configure BFQ for the desired tradeoff between
21latency and throughput, or on how to maximize throughput.
43c1b3d6 22
a33801e8
LM
23BFQ has a non-null overhead, which limits the maximum IOPS that a CPU
24can process for a device scheduled with BFQ. To give an idea of the
25limits on slow or average CPUs, here are, first, the limits of BFQ for
26three different CPUs, on, respectively, an average laptop, an old
27desktop, and a cheap embedded system, in case full hierarchical
28support is enabled (i.e., CONFIG_BFQ_GROUP_IOSCHED is set), but
29CONFIG_DEBUG_BLK_CGROUP is not set (Section 4-2):
30- Intel i7-4850HQ: 400 KIOPS
31- AMD A8-3850: 250 KIOPS
32- ARM CortexTM-A53 Octa-core: 80 KIOPS
33
34If CONFIG_DEBUG_BLK_CGROUP is set (and of course full hierarchical
35support is enabled), then the sustainable throughput with BFQ
36decreases, because all blkio.bfq* statistics are created and updated
37(Section 4-2). For BFQ, this leads to the following maximum
38sustainable throughputs, on the same systems as above:
24bfd19b
PV
39- Intel i7-4850HQ: 310 KIOPS
40- AMD A8-3850: 200 KIOPS
41- ARM CortexTM-A53 Octa-core: 56 KIOPS
68017e5d
PV
42
43BFQ works for multi-queue devices too.
aee69d78
PV
44
45The table of contents follow. Impatients can just jump to Section 3.
46
47CONTENTS
48
491. When may BFQ be useful?
50 1-1 Personal systems
51 1-2 Server systems
522. How does BFQ work?
2670cd16 533. What are BFQ's tunables and how to properly configure BFQ?
aee69d78
PV
544. BFQ group scheduling
55 4-1 Service guarantees provided
56 4-2 Interface
57
581. When may BFQ be useful?
59==========================
60
61BFQ provides the following benefits on personal and server systems.
62
631-1 Personal systems
64--------------------
65
66Low latency for interactive applications
67
68Regardless of the actual background workload, BFQ guarantees that, for
69interactive tasks, the storage device is virtually as responsive as if
70it was idle. For example, even if one or more of the following
71background workloads are being executed:
72- one or more large files are being read, written or copied,
73- a tree of source files is being compiled,
74- one or more virtual machines are performing I/O,
75- a software update is in progress,
76- indexing daemons are scanning filesystems and updating their
77 databases,
78starting an application or loading a file from within an application
79takes about the same time as if the storage device was idle. As a
80comparison, with CFQ, NOOP or DEADLINE, and in the same conditions,
81applications experience high latencies, or even become unresponsive
82until the background workload terminates (also on SSDs).
83
84Low latency for soft real-time applications
85
86Also soft real-time applications, such as audio and video
87players/streamers, enjoy a low latency and a low drop rate, regardless
88of the background I/O workload. As a consequence, these applications
89do not suffer from almost any glitch due to the background workload.
90
91Higher speed for code-development tasks
92
93If some additional workload happens to be executed in parallel, then
94BFQ executes the I/O-related components of typical code-development
95tasks (compilation, checkout, merge, ...) much more quickly than CFQ,
96NOOP or DEADLINE.
97
98High throughput
99
100On hard disks, BFQ achieves up to 30% higher throughput than CFQ, and
101up to 150% higher throughput than DEADLINE and NOOP, with all the
102sequential workloads considered in our tests. With random workloads,
103and with all the workloads on flash-based devices, BFQ achieves,
104instead, about the same throughput as the other schedulers.
105
106Strong fairness, bandwidth and delay guarantees
107
108BFQ distributes the device throughput, and not just the device time,
109among I/O-bound applications in proportion their weights, with any
110workload and regardless of the device parameters. From these bandwidth
111guarantees, it is possible to compute tight per-I/O-request delay
112guarantees by a simple formula. If not configured for strict service
113guarantees, BFQ switches to time-based resource sharing (only) for
114applications that would otherwise cause a throughput loss.
115
1161-2 Server systems
117------------------
118
119Most benefits for server systems follow from the same service
120properties as above. In particular, regardless of whether additional,
121possibly heavy workloads are being served, BFQ guarantees:
122
123. audio and video-streaming with zero or very low jitter and drop
124 rate;
125
126. fast retrieval of WEB pages and embedded objects;
127
128. real-time recording of data in live-dumping applications (e.g.,
129 packet logging);
130
131. responsiveness in local and remote access to a server.
132
133
1342. How does BFQ work?
135=====================
136
137BFQ is a proportional-share I/O scheduler, whose general structure,
138plus a lot of code, are borrowed from CFQ.
139
140- Each process doing I/O on a device is associated with a weight and a
141 (bfq_)queue.
142
143- BFQ grants exclusive access to the device, for a while, to one queue
144 (process) at a time, and implements this service model by
145 associating every queue with a budget, measured in number of
146 sectors.
147
148 - After a queue is granted access to the device, the budget of the
149 queue is decremented, on each request dispatch, by the size of the
150 request.
151
152 - The in-service queue is expired, i.e., its service is suspended,
153 only if one of the following events occurs: 1) the queue finishes
154 its budget, 2) the queue empties, 3) a "budget timeout" fires.
155
156 - The budget timeout prevents processes doing random I/O from
157 holding the device for too long and dramatically reducing
158 throughput.
159
160 - Actually, as in CFQ, a queue associated with a process issuing
161 sync requests may not be expired immediately when it empties. In
162 contrast, BFQ may idle the device for a short time interval,
163 giving the process the chance to go on being served if it issues
164 a new request in time. Device idling typically boosts the
2670cd16
PV
165 throughput on rotational devices and on non-queueing flash-based
166 devices, if processes do synchronous and sequential I/O. In
167 addition, under BFQ, device idling is also instrumental in
168 guaranteeing the desired throughput fraction to processes
169 issuing sync requests (see the description of the slice_idle
170 tunable in this document, or [1, 2], for more details).
aee69d78
PV
171
172 - With respect to idling for service guarantees, if several
173 processes are competing for the device at the same time, but
233f0bf4
PV
174 all processes and groups have the same weight, then BFQ
175 guarantees the expected throughput distribution without ever
176 idling the device. Throughput is thus as high as possible in
177 this common scenario.
aee69d78 178
2670cd16
PV
179 - On flash-based storage with internal queueing of commands
180 (typically NCQ), device idling happens to be always detrimental
181 for throughput. So, with these devices, BFQ performs idling
182 only when strictly needed for service guarantees, i.e., for
183 guaranteeing low latency or fairness. In these cases, overall
184 throughput may be sub-optimal. No solution currently exists to
185 provide both strong service guarantees and optimal throughput
186 on devices with internal queueing.
187
aee69d78
PV
188 - If low-latency mode is enabled (default configuration), BFQ
189 executes some special heuristics to detect interactive and soft
190 real-time applications (e.g., video or audio players/streamers),
191 and to reduce their latency. The most important action taken to
192 achieve this goal is to give to the queues associated with these
193 applications more than their fair share of the device
194 throughput. For brevity, we call just "weight-raising" the whole
195 sets of actions taken by BFQ to privilege these queues. In
196 particular, BFQ provides a milder form of weight-raising for
197 interactive applications, and a stronger form for soft real-time
198 applications.
199
200 - BFQ automatically deactivates idling for queues born in a burst of
201 queue creations. In fact, these queues are usually associated with
202 the processes of applications and services that benefit mostly
203 from a high throughput. Examples are systemd during boot, or git
204 grep.
205
206 - As CFQ, BFQ merges queues performing interleaved I/O, i.e.,
207 performing random I/O that becomes mostly sequential if
208 merged. Differently from CFQ, BFQ achieves this goal with a more
209 reactive mechanism, called Early Queue Merge (EQM). EQM is so
210 responsive in detecting interleaved I/O (cooperating processes),
211 that it enables BFQ to achieve a high throughput, by queue
212 merging, even for queues for which CFQ needs a different
213 mechanism, preemption, to get a high throughput. As such EQM is a
214 unified mechanism to achieve a high throughput with interleaved
215 I/O.
216
217 - Queues are scheduled according to a variant of WF2Q+, named
218 B-WF2Q+, and implemented using an augmented rb-tree to preserve an
219 O(log N) overall complexity. See [2] for more details. B-WF2Q+ is
233f0bf4 220 also ready for hierarchical scheduling, details in Section 4.
aee69d78
PV
221
222 - B-WF2Q+ guarantees a tight deviation with respect to an ideal,
223 perfectly fair, and smooth service. In particular, B-WF2Q+
224 guarantees that each queue receives a fraction of the device
225 throughput proportional to its weight, even if the throughput
226 fluctuates, and regardless of: the device parameters, the current
227 workload and the budgets assigned to the queue.
228
229 - The last, budget-independence, property (although probably
230 counterintuitive in the first place) is definitely beneficial, for
231 the following reasons:
232
233 - First, with any proportional-share scheduler, the maximum
234 deviation with respect to an ideal service is proportional to
235 the maximum budget (slice) assigned to queues. As a consequence,
236 BFQ can keep this deviation tight not only because of the
237 accurate service of B-WF2Q+, but also because BFQ *does not*
238 need to assign a larger budget to a queue to let the queue
239 receive a higher fraction of the device throughput.
240
241 - Second, BFQ is free to choose, for every process (queue), the
242 budget that best fits the needs of the process, or best
243 leverages the I/O pattern of the process. In particular, BFQ
244 updates queue budgets with a simple feedback-loop algorithm that
245 allows a high throughput to be achieved, while still providing
246 tight latency guarantees to time-sensitive applications. When
247 the in-service queue expires, this algorithm computes the next
248 budget of the queue so as to:
249
250 - Let large budgets be eventually assigned to the queues
251 associated with I/O-bound applications performing sequential
252 I/O: in fact, the longer these applications are served once
253 got access to the device, the higher the throughput is.
254
255 - Let small budgets be eventually assigned to the queues
256 associated with time-sensitive applications (which typically
257 perform sporadic and short I/O), because, the smaller the
258 budget assigned to a queue waiting for service is, the sooner
259 B-WF2Q+ will serve that queue (Subsec 3.3 in [2]).
260
261- If several processes are competing for the device at the same time,
262 but all processes and groups have the same weight, then BFQ
263 guarantees the expected throughput distribution without ever idling
264 the device. It uses preemption instead. Throughput is then much
265 higher in this common scenario.
266
267- ioprio classes are served in strict priority order, i.e.,
268 lower-priority queues are not served as long as there are
269 higher-priority queues. Among queues in the same class, the
270 bandwidth is distributed in proportion to the weight of each
271 queue. A very thin extra bandwidth is however guaranteed to
272 the Idle class, to prevent it from starving.
273
274
2670cd16
PV
2753. What are BFQ's tunables and how to properly configure BFQ?
276=============================================================
aee69d78 277
2670cd16
PV
278Most BFQ tunables affect service guarantees (basically latency and
279fairness) and throughput. For full details on how to choose the
280desired tradeoff between service guarantees and throughput, see the
281parameters slice_idle, strict_guarantees and low_latency. For details
282on how to maximise throughput, see slice_idle, timeout_sync and
283max_budget. The other performance-related parameters have been
284inherited from, and have been preserved mostly for compatibility with
285CFQ. So far, no performance improvement has been reported after
286changing the latter parameters in BFQ.
287
288In particular, the tunables back_seek-max, back_seek_penalty,
289fifo_expire_async and fifo_expire_sync below are the same as in
290CFQ. Their description is just copied from that for CFQ. Some
291considerations in the description of slice_idle are copied from CFQ
292too.
aee69d78
PV
293
294per-process ioprio and weight
295-----------------------------
296
e21b7a0b
AA
297Unless the cgroups interface is used (see "4. BFQ group scheduling"),
298weights can be assigned to processes only indirectly, through I/O
299priorities, and according to the relation:
300weight = (IOPRIO_BE_NR - ioprio) * 10.
301
302Beware that, if low-latency is set, then BFQ automatically raises the
303weight of the queues associated with interactive and soft real-time
304applications. Unset this tunable if you need/want to control weights.
aee69d78
PV
305
306slice_idle
307----------
308
309This parameter specifies how long BFQ should idle for next I/O
310request, when certain sync BFQ queues become empty. By default
311slice_idle is a non-zero value. Idling has a double purpose: boosting
312throughput and making sure that the desired throughput distribution is
313respected (see the description of how BFQ works, and, if needed, the
314papers referred there).
315
316As for throughput, idling can be very helpful on highly seeky media
317like single spindle SATA/SAS disks where we can cut down on overall
318number of seeks and see improved throughput.
319
320Setting slice_idle to 0 will remove all the idling on queues and one
321should see an overall improved throughput on faster storage devices
2670cd16
PV
322like multiple SATA/SAS disks in hardware RAID configuration, as well
323as flash-based storage with internal command queueing (and
324parallelism).
aee69d78
PV
325
326So depending on storage and workload, it might be useful to set
327slice_idle=0. In general for SATA/SAS disks and software RAID of
328SATA/SAS disks keeping slice_idle enabled should be useful. For any
329configurations where there are multiple spindles behind single LUN
2670cd16
PV
330(Host based hardware RAID controller or for storage arrays), or with
331flash-based fast storage, setting slice_idle=0 might end up in better
332throughput and acceptable latencies.
aee69d78
PV
333
334Idling is however necessary to have service guarantees enforced in
335case of differentiated weights or differentiated I/O-request lengths.
336To see why, suppose that a given BFQ queue A must get several I/O
337requests served for each request served for another queue B. Idling
338ensures that, if A makes a new I/O request slightly after becoming
339empty, then no request of B is dispatched in the middle, and thus A
340does not lose the possibility to get more than one request dispatched
341before the next request of B is dispatched. Note that idling
342guarantees the desired differentiated treatment of queues only in
343terms of I/O-request dispatches. To guarantee that the actual service
344order then corresponds to the dispatch order, the strict_guarantees
345tunable must be set too.
346
347There is an important flipside for idling: apart from the above cases
348where it is beneficial also for throughput, idling can severely impact
349throughput. One important case is random workload. Because of this
350issue, BFQ tends to avoid idling as much as possible, when it is not
2670cd16
PV
351beneficial also for throughput (as detailed in Section 2). As a
352consequence of this behavior, and of further issues described for the
353strict_guarantees tunable, short-term service guarantees may be
354occasionally violated. And, in some cases, these guarantees may be
355more important than guaranteeing maximum throughput. For example, in
356video playing/streaming, a very low drop rate may be more important
357than maximum throughput. In these cases, consider setting the
358strict_guarantees parameter.
aee69d78
PV
359
360strict_guarantees
361-----------------
362
363If this parameter is set (default: unset), then BFQ
364
365- always performs idling when the in-service queue becomes empty;
366
367- forces the device to serve one I/O request at a time, by dispatching a
368 new request only if there is no outstanding request.
369
370In the presence of differentiated weights or I/O-request sizes, both
371the above conditions are needed to guarantee that every BFQ queue
372receives its allotted share of the bandwidth. The first condition is
373needed for the reasons explained in the description of the slice_idle
374tunable. The second condition is needed because all modern storage
375devices reorder internally-queued requests, which may trivially break
376the service guarantees enforced by the I/O scheduler.
377
378Setting strict_guarantees may evidently affect throughput.
379
380back_seek_max
381-------------
382
383This specifies, given in Kbytes, the maximum "distance" for backward seeking.
384The distance is the amount of space from the current head location to the
385sectors that are backward in terms of distance.
386
387This parameter allows the scheduler to anticipate requests in the "backward"
388direction and consider them as being the "next" if they are within this
389distance from the current head location.
390
391back_seek_penalty
392-----------------
393
394This parameter is used to compute the cost of backward seeking. If the
395backward distance of request is just 1/back_seek_penalty from a "front"
396request, then the seeking cost of two requests is considered equivalent.
397
398So scheduler will not bias toward one or the other request (otherwise scheduler
399will bias toward front request). Default value of back_seek_penalty is 2.
400
401fifo_expire_async
402-----------------
403
404This parameter is used to set the timeout of asynchronous requests. Default
405value of this is 248ms.
406
407fifo_expire_sync
408----------------
409
410This parameter is used to set the timeout of synchronous requests. Default
411value of this is 124ms. In case to favor synchronous requests over asynchronous
412one, this value should be decreased relative to fifo_expire_async.
413
414low_latency
415-----------
416
417This parameter is used to enable/disable BFQ's low latency mode. By
418default, low latency mode is enabled. If enabled, interactive and soft
419real-time applications are privileged and experience a lower latency,
420as explained in more detail in the description of how BFQ works.
421
43c1b3d6 422DISABLE this mode if you need full control on bandwidth
44e44a1b
PV
423distribution. In fact, if it is enabled, then BFQ automatically
424increases the bandwidth share of privileged applications, as the main
425means to guarantee a lower latency to them.
426
43c1b3d6
PV
427In addition, as already highlighted at the beginning of this document,
428DISABLE this mode if your only goal is to achieve a high throughput.
429In fact, privileging the I/O of some application over the rest may
430entail a lower throughput. To achieve the highest-possible throughput
431on a non-rotational device, setting slice_idle to 0 may be needed too
432(at the cost of giving up any strong guarantee on fairness and low
433latency).
434
aee69d78
PV
435timeout_sync
436------------
437
438Maximum amount of device time that can be given to a task (queue) once
439it has been selected for service. On devices with costly seeks,
440increasing this time usually increases maximum throughput. On the
441opposite end, increasing this time coarsens the granularity of the
442short-term bandwidth and latency guarantees, especially if the
443following parameter is set to zero.
444
445max_budget
446----------
447
448Maximum amount of service, measured in sectors, that can be provided
449to a BFQ queue once it is set in service (of course within the limits
450of the above timeout). According to what said in the description of
451the algorithm, larger values increase the throughput in proportion to
452the percentage of sequential I/O requests issued. The price of larger
453values is that they coarsen the granularity of short-term bandwidth
454and latency guarantees.
455
456The default value is 0, which enables auto-tuning: BFQ sets max_budget
457to the maximum number of sectors that can be served during
458timeout_sync, according to the estimated peak rate.
459
2670cd16
PV
460For specific devices, some users have occasionally reported to have
461reached a higher throughput by setting max_budget explicitly, i.e., by
462setting max_budget to a higher value than 0. In particular, they have
463set max_budget to higher values than those to which BFQ would have set
464it with auto-tuning. An alternative way to achieve this goal is to
465just increase the value of timeout_sync, leaving max_budget equal to 0.
466
aee69d78
PV
467weights
468-------
469
470Read-only parameter, used to show the weights of the currently active
471BFQ queues.
472
473
aee69d78
PV
4744. Group scheduling with BFQ
475============================
476
e21b7a0b
AA
477BFQ supports both cgroups-v1 and cgroups-v2 io controllers, namely
478blkio and io. In particular, BFQ supports weight-based proportional
479share. To activate cgroups support, set BFQ_GROUP_IOSCHED.
aee69d78
PV
480
4814-1 Service guarantees provided
482-------------------------------
483
484With BFQ, proportional share means true proportional share of the
485device bandwidth, according to group weights. For example, a group
486with weight 200 gets twice the bandwidth, and not just twice the time,
487of a group with weight 100.
488
489BFQ supports hierarchies (group trees) of any depth. Bandwidth is
490distributed among groups and processes in the expected way: for each
491group, the children of the group share the whole bandwidth of the
492group in proportion to their weights. In particular, this implies
493that, for each leaf group, every process of the group receives the
494same share of the whole group bandwidth, unless the ioprio of the
495process is modified.
496
497The resource-sharing guarantee for a group may partially or totally
498switch from bandwidth to time, if providing bandwidth guarantees to
499the group lowers the throughput too much. This switch occurs on a
500per-process basis: if a process of a leaf group causes throughput loss
501if served in such a way to receive its share of the bandwidth, then
502BFQ switches back to just time-based proportional share for that
503process.
504
5054-2 Interface
506-------------
507
508To get proportional sharing of bandwidth with BFQ for a given device,
509BFQ must of course be the active scheduler for that device.
510
511Within each group directory, the names of the files associated with
512BFQ-specific cgroup parameters and stats begin with the "bfq."
513prefix. So, with cgroups-v1 or cgroups-v2, the full prefix for
514BFQ-specific files is "blkio.bfq." or "io.bfq." For example, the group
515parameter to set the weight of a group with BFQ is blkio.bfq.weight
516or io.bfq.weight.
517
a33801e8
LM
518As for cgroups-v1 (blkio controller), the exact set of stat files
519created, and kept up-to-date by bfq, depends on whether
520CONFIG_DEBUG_BLK_CGROUP is set. If it is set, then bfq creates all
521the stat files documented in
522Documentation/cgroup-v1/blkio-controller.txt. If, instead,
523CONFIG_DEBUG_BLK_CGROUP is not set, then bfq creates only the files
524blkio.bfq.io_service_bytes
525blkio.bfq.io_service_bytes_recursive
526blkio.bfq.io_serviced
527blkio.bfq.io_serviced_recursive
528
529The value of CONFIG_DEBUG_BLK_CGROUP greatly influences the maximum
530throughput sustainable with bfq, because updating the blkio.bfq.*
531stats is rather costly, especially for some of the stats enabled by
532CONFIG_DEBUG_BLK_CGROUP.
533
aee69d78
PV
534Parameters to set
535-----------------
536
537For each group, there is only the following parameter to set.
538
539weight (namely blkio.bfq.weight or io.bfq-weight): the weight of the
540group inside its parent. Available values: 1..10000 (default 100). The
541linear mapping between ioprio and weights, described at the beginning
542of the tunable section, is still valid, but all weights higher than
543IOPRIO_BE_NR*10 are mapped to ioprio 0.
544
44e44a1b
PV
545Recall that, if low-latency is set, then BFQ automatically raises the
546weight of the queues associated with interactive and soft real-time
547applications. Unset this tunable if you need/want to control weights.
548
aee69d78
PV
549
550[1] P. Valente, A. Avanzini, "Evolution of the BFQ Storage I/O
551 Scheduler", Proceedings of the First Workshop on Mobile System
552 Technologies (MST-2015), May 2015.
553 http://algogroup.unimore.it/people/paolo/disk_sched/mst-2015.pdf
554
555[2] P. Valente and M. Andreolini, "Improving Application
556 Responsiveness with the BFQ Disk I/O Scheduler", Proceedings of
557 the 5th Annual International Systems and Storage Conference
558 (SYSTOR '12), June 2012.
559 Slightly extended version:
560 http://algogroup.unimore.it/people/paolo/disk_sched/bfq-v1-suite-
561 results.pdf