[linux-block.git] / Documentation / block / as-iosched.txt

Anticipatory IO scheduler
-------------------------
Nick Piggin <piggin@cyberone.com.au>    13 Sep 2003

Attention! Database servers, especially those using "TCQ" disks should
investigate performance with the 'deadline' IO scheduler. Any system with high
disk performance requirements should do so, in fact.

If you see unusual performance characteristics of your disk systems, or you
see big performance regressions versus the deadline scheduler, please email
me. Database users don't bother unless you're willing to test a lot of patches
from me ;) its a known issue.

Also, users with hardware RAID controllers, doing striping, may find
highly variable performance results with using the as-iosched. The
as-iosched anticipatory implementation is based on the notion that a disk
device has only one physical seeking head.  A striped RAID controller
actually has a head for each physical device in the logical RAID device.

However, setting the antic_expire (see tunable parameters below) produces
very similar behavior to the deadline IO scheduler.

Selecting IO schedulers
-----------------------
Refer to Documentation/block/switching-sched.txt for information on
selecting an io scheduler on a per-device basis.

Anticipatory IO scheduler Policies
----------------------------------
The as-iosched implementation implements several layers of policies
to determine when an IO request is dispatched to the disk controller.
Here are the policies outlined, in order of application.

1. one-way Elevator algorithm.

The elevator algorithm is similar to that used in deadline scheduler, with
the addition that it allows limited backward movement of the elevator
(i.e. seeks backwards).  A seek backwards can occur when choosing between
two IO requests where one is behind the elevator's current position, and
the other is in front of the elevator's position. If the seek distance to
the request in back of the elevator is less than half the seek distance to
the request in front of the elevator, then the request in back can be chosen.
Backward seeks are also limited to a maximum of MAXBACK (1024*1024) sectors.
This favors forward movement of the elevator, while allowing opportunistic
"short" backward seeks.

2. FIFO expiration times for reads and for writes.

This is again very similar to the deadline IO scheduler.  The expiration
times for requests on these lists is tunable using the parameters read_expire
and write_expire discussed below.  When a read or a write expires in this way,
the IO scheduler will interrupt its current elevator sweep or read anticipation
to service the expired request.

3. Read and write request batching

A batch is a collection of read requests or a collection of write
requests.  The as scheduler alternates dispatching read and write batches
to the driver.  In the case a read batch, the scheduler submits read
requests to the driver as long as there are read requests to submit, and
the read batch time limit has not been exceeded (read_batch_expire).
The read batch time limit begins counting down only when there are
competing write requests pending.

In the case of a write batch, the scheduler submits write requests to
the driver as long as there are write requests available, and the
write batch time limit has not been exceeded (write_batch_expire).
However, the length of write batches will be gradually shortened
when read batches frequently exceed their time limit.

When changing between batch types, the scheduler waits for all requests
from the previous batch to complete before scheduling requests for the
next batch.

The read and write fifo expiration times described in policy 2 above
are checked only when in scheduling IO of a batch for the corresponding
(read/write) type.  So for example, the read FIFO timeout values are
tested only during read batches.  Likewise, the write FIFO timeout
values are tested only during write batches.  For this reason,
it is generally not recommended for the read batch time
to be longer than the write expiration time, nor for the write batch
time to exceed the read expiration time (see tunable parameters below).

When the IO scheduler changes from a read to a write batch,
it begins the elevator from the request that is on the head of the
write expiration FIFO.  Likewise, when changing from a write batch to
a read batch, scheduler begins the elevator from the first entry
on the read expiration FIFO.

4. Read anticipation.

Read anticipation occurs only when scheduling a read batch.
This implementation of read anticipation allows only one read request
to be dispatched to the disk controller at a time.  In
contrast, many write requests may be dispatched to the disk controller
at a time during a write batch.  It is this characteristic that can make
the anticipatory scheduler perform anomalously with controllers supporting
TCQ, or with hardware striped RAID devices. Setting the antic_expire
queue parameter (see below) to zero disables this behavior, and the 
anticipatory scheduler behaves essentially like the deadline scheduler.

When read anticipation is enabled (antic_expire is not zero), reads
are dispatched to the disk controller one at a time.
At the end of each read request, the IO scheduler examines its next
candidate read request from its sorted read list.  If that next request
is from the same process as the request that just completed,
or if the next request in the queue is "very close" to the
just completed request, it is dispatched immediately.  Otherwise,
statistics (average think time, average seek distance) on the process
that submitted the just completed request are examined.  If it seems
likely that that process will submit another request soon, and that
request is likely to be near the just completed request, then the IO
scheduler will stop dispatching more read requests for up to (antic_expire)
milliseconds, hoping that process will submit a new request near the one
that just completed.  If such a request is made, then it is dispatched
immediately.  If the antic_expire wait time expires, then the IO scheduler
will dispatch the next read request from the sorted read queue.

To decide whether an anticipatory wait is worthwhile, the scheduler
maintains statistics for each process that can be used to compute
mean "think time" (the time between read requests), and mean seek
distance for that process.  One observation is that these statistics
are associated with each process, but those statistics are not associated
with a specific IO device.  So for example, if a process is doing IO
on several file systems on separate devices, the statistics will be
a combination of IO behavior from all those devices.


Tuning the anticipatory IO scheduler
------------------------------------
When using 'as', the anticipatory IO scheduler there are 5 parameters under
/sys/block/*/queue/iosched/. All are units of milliseconds.

The parameters are:
* read_expire
    Controls how long until a read request becomes "expired". It also controls the
    interval between which expired requests are served, so set to 50, a request
    might take anywhere < 100ms to be serviced _if_ it is the next on the
    expired list. Obviously request expiration strategies won't make the disk
    go faster. The result basically equates to the timeslice a single reader
    gets in the presence of other IO. 100*((seek time / read_expire) + 1) is
    very roughly the % streaming read efficiency your disk should get with
    multiple readers.

* read_batch_expire
    Controls how much time a batch of reads is given before pending writes are
    served. A higher value is more efficient. This might be set below read_expire
    if writes are to be given higher priority than reads, but reads are to be
    as efficient as possible when there are no writes. Generally though, it
    should be some multiple of read_expire.

* write_expire, and
* write_batch_expire are equivalent to the above, for writes.

* antic_expire
    Controls the maximum amount of time we can anticipate a good read (one
    with a short seek distance from the most recently completed request) before
    giving up. Many other factors may cause anticipation to be stopped early,
    or some processes will not be "anticipated" at all. Should be a bit higher
    for big seek time devices though not a linear correspondence - most
    processes have only a few ms thinktime.

In addition to the tunables above there is a read-only file named est_time
which, when read, will show:

    - The probability of a task exiting without a cooperating task
      submitting an anticipated IO.

    - The current mean think time.

    - The seek distance used to determine if an incoming IO is better.
Commit	Line	Data
1da177e4 LT	1	Anticipatory IO scheduler
	2	-------------------------
	3	Nick Piggin <piggin@cyberone.com.au> 13 Sep 2003
	4
	5	Attention! Database servers, especially those using "TCQ" disks should
	6	investigate performance with the 'deadline' IO scheduler. Any system with high
	7	disk performance requirements should do so, in fact.
	8
	9	If you see unusual performance characteristics of your disk systems, or you
	10	see big performance regressions versus the deadline scheduler, please email
	11	me. Database users don't bother unless you're willing to test a lot of patches
	12	from me ;) its a known issue.
	13
	14	Also, users with hardware RAID controllers, doing striping, may find
	15	highly variable performance results with using the as-iosched. The
	16	as-iosched anticipatory implementation is based on the notion that a disk
	17	device has only one physical seeking head. A striped RAID controller
	18	actually has a head for each physical device in the logical RAID device.
	19
	20	However, setting the antic_expire (see tunable parameters below) produces
	21	very similar behavior to the deadline IO scheduler.
	22
1da177e4 LT	23	Selecting IO schedulers
1da177e4 LT	24	-----------------------
23c76983 AB	25	Refer to Documentation/block/switching-sched.txt for information on
23c76983 AB	26	selecting an io scheduler on a per-device basis.
1da177e4 LT	27
	28	Anticipatory IO scheduler Policies
	29	----------------------------------
	30	The as-iosched implementation implements several layers of policies
	31	to determine when an IO request is dispatched to the disk controller.
	32	Here are the policies outlined, in order of application.
	33
	34	1. one-way Elevator algorithm.
	35
	36	The elevator algorithm is similar to that used in deadline scheduler, with
	37	the addition that it allows limited backward movement of the elevator
	38	(i.e. seeks backwards). A seek backwards can occur when choosing between
	39	two IO requests where one is behind the elevator's current position, and
	40	the other is in front of the elevator's position. If the seek distance to
	41	the request in back of the elevator is less than half the seek distance to
	42	the request in front of the elevator, then the request in back can be chosen.
	43	Backward seeks are also limited to a maximum of MAXBACK (1024*1024) sectors.
	44	This favors forward movement of the elevator, while allowing opportunistic
	45	"short" backward seeks.
	46
	47	2. FIFO expiration times for reads and for writes.
	48
	49	This is again very similar to the deadline IO scheduler. The expiration
	50	times for requests on these lists is tunable using the parameters read_expire
	51	and write_expire discussed below. When a read or a write expires in this way,
	52	the IO scheduler will interrupt its current elevator sweep or read anticipation
	53	to service the expired request.
	54
	55	3. Read and write request batching
	56
	57	A batch is a collection of read requests or a collection of write
	58	requests. The as scheduler alternates dispatching read and write batches
	59	to the driver. In the case a read batch, the scheduler submits read
	60	requests to the driver as long as there are read requests to submit, and
	61	the read batch time limit has not been exceeded (read_batch_expire).
	62	The read batch time limit begins counting down only when there are
	63	competing write requests pending.
	64
	65	In the case of a write batch, the scheduler submits write requests to
	66	the driver as long as there are write requests available, and the
	67	write batch time limit has not been exceeded (write_batch_expire).
	68	However, the length of write batches will be gradually shortened
	69	when read batches frequently exceed their time limit.
	70
	71	When changing between batch types, the scheduler waits for all requests
	72	from the previous batch to complete before scheduling requests for the
	73	next batch.
	74
	75	The read and write fifo expiration times described in policy 2 above
	76	are checked only when in scheduling IO of a batch for the corresponding
	77	(read/write) type. So for example, the read FIFO timeout values are
	78	tested only during read batches. Likewise, the write FIFO timeout
	79	values are tested only during write batches. For this reason,
	80	it is generally not recommended for the read batch time
	81	to be longer than the write expiration time, nor for the write batch
	82	time to exceed the read expiration time (see tunable parameters below).
	83
	84	When the IO scheduler changes from a read to a write batch,
	85	it begins the elevator from the request that is on the head of the
	86	write expiration FIFO. Likewise, when changing from a write batch to
	87	a read batch, scheduler begins the elevator from the first entry
	88	on the read expiration FIFO.
	89
	90	4. Read anticipation.
91
92	Read anticipation occurs only when scheduling a read batch.
93	This implementation of read anticipation allows only one read request
94	to be dispatched to the disk controller at a time. In
95	contrast, many write requests may be dispatched to the disk controller
96	at a time during a write batch. It is this characteristic that can make
97	the anticipatory scheduler perform anomalously with controllers supporting
98	TCQ, or with hardware striped RAID devices. Setting the antic_expire
992caacf ML	99	queue parameter (see below) to zero disables this behavior, and the
992caacf ML	100	anticipatory scheduler behaves essentially like the deadline scheduler.
1da177e4 LT	101
	102	When read anticipation is enabled (antic_expire is not zero), reads
	103	are dispatched to the disk controller one at a time.
	104	At the end of each read request, the IO scheduler examines its next
	105	candidate read request from its sorted read list. If that next request
	106	is from the same process as the request that just completed,
	107	or if the next request in the queue is "very close" to the
	108	just completed request, it is dispatched immediately. Otherwise,
	109	statistics (average think time, average seek distance) on the process
	110	that submitted the just completed request are examined. If it seems
	111	likely that that process will submit another request soon, and that
	112	request is likely to be near the just completed request, then the IO
23c76983	113	scheduler will stop dispatching more read requests for up to (antic_expire)
1da177e4 LT	114	milliseconds, hoping that process will submit a new request near the one
	115	that just completed. If such a request is made, then it is dispatched
	116	immediately. If the antic_expire wait time expires, then the IO scheduler
	117	will dispatch the next read request from the sorted read queue.
	118
	119	To decide whether an anticipatory wait is worthwhile, the scheduler
	120	maintains statistics for each process that can be used to compute
	121	mean "think time" (the time between read requests), and mean seek
	122	distance for that process. One observation is that these statistics
	123	are associated with each process, but those statistics are not associated
	124	with a specific IO device. So for example, if a process is doing IO
	125	on several file systems on separate devices, the statistics will be
	126	a combination of IO behavior from all those devices.
	127
	128
	129	Tuning the anticipatory IO scheduler
	130	------------------------------------
	131	When using 'as', the anticipatory IO scheduler there are 5 parameters under
	132	/sys/block/*/queue/iosched/. All are units of milliseconds.
	133
	134	The parameters are:
	135	* read_expire
	136	Controls how long until a read request becomes "expired". It also controls the
	137	interval between which expired requests are served, so set to 50, a request
	138	might take anywhere < 100ms to be serviced _if_ it is the next on the
	139	expired list. Obviously request expiration strategies won't make the disk
	140	go faster. The result basically equates to the timeslice a single reader
	141	gets in the presence of other IO. 100*((seek time / read_expire) + 1) is
	142	very roughly the % streaming read efficiency your disk should get with
	143	multiple readers.
	144
	145	* read_batch_expire
	146	Controls how much time a batch of reads is given before pending writes are
	147	served. A higher value is more efficient. This might be set below read_expire
	148	if writes are to be given higher priority than reads, but reads are to be
	149	as efficient as possible when there are no writes. Generally though, it
	150	should be some multiple of read_expire.
	151
	152	* write_expire, and
	153	* write_batch_expire are equivalent to the above, for writes.
	154
	155	* antic_expire
	156	Controls the maximum amount of time we can anticipate a good read (one
	157	with a short seek distance from the most recently completed request) before
	158	giving up. Many other factors may cause anticipation to be stopped early,
	159	or some processes will not be "anticipated" at all. Should be a bit higher
	160	for big seek time devices though not a linear correspondence - most
	161	processes have only a few ms thinktime.
	162
23c76983 AB	163	In addition to the tunables above there is a read-only file named est_time
	164	which, when read, will show:
	165
	166	- The probability of a task exiting without a cooperating task
	167	submitting an anticipated IO.
	168
	169	- The current mean think time.
	170
	171	- The seek distance used to determine if an incoming IO is better.
	172