Commit | Line | Data |
---|---|---|
1da177e4 LT |
1 | Anticipatory IO scheduler |
2 | ------------------------- | |
3 | Nick Piggin <piggin@cyberone.com.au> 13 Sep 2003 | |
4 | ||
5 | Attention! Database servers, especially those using "TCQ" disks should | |
6 | investigate performance with the 'deadline' IO scheduler. Any system with high | |
7 | disk performance requirements should do so, in fact. | |
8 | ||
9 | If you see unusual performance characteristics of your disk systems, or you | |
10 | see big performance regressions versus the deadline scheduler, please email | |
11 | me. Database users don't bother unless you're willing to test a lot of patches | |
12 | from me ;) its a known issue. | |
13 | ||
14 | Also, users with hardware RAID controllers, doing striping, may find | |
15 | highly variable performance results with using the as-iosched. The | |
16 | as-iosched anticipatory implementation is based on the notion that a disk | |
17 | device has only one physical seeking head. A striped RAID controller | |
18 | actually has a head for each physical device in the logical RAID device. | |
19 | ||
20 | However, setting the antic_expire (see tunable parameters below) produces | |
21 | very similar behavior to the deadline IO scheduler. | |
22 | ||
23 | ||
24 | Selecting IO schedulers | |
25 | ----------------------- | |
26 | To choose IO schedulers at boot time, use the argument 'elevator=deadline'. | |
49033c81 F |
27 | 'noop', 'as' and 'cfq' (the default) are also available. IO schedulers are |
28 | assigned globally at boot time only presently. It's also possible to change | |
29 | the IO scheduler for a determined device on the fly, as described in | |
30 | Documentation/block/switching-sched.txt. | |
1da177e4 LT |
31 | |
32 | ||
33 | Anticipatory IO scheduler Policies | |
34 | ---------------------------------- | |
35 | The as-iosched implementation implements several layers of policies | |
36 | to determine when an IO request is dispatched to the disk controller. | |
37 | Here are the policies outlined, in order of application. | |
38 | ||
39 | 1. one-way Elevator algorithm. | |
40 | ||
41 | The elevator algorithm is similar to that used in deadline scheduler, with | |
42 | the addition that it allows limited backward movement of the elevator | |
43 | (i.e. seeks backwards). A seek backwards can occur when choosing between | |
44 | two IO requests where one is behind the elevator's current position, and | |
45 | the other is in front of the elevator's position. If the seek distance to | |
46 | the request in back of the elevator is less than half the seek distance to | |
47 | the request in front of the elevator, then the request in back can be chosen. | |
48 | Backward seeks are also limited to a maximum of MAXBACK (1024*1024) sectors. | |
49 | This favors forward movement of the elevator, while allowing opportunistic | |
50 | "short" backward seeks. | |
51 | ||
52 | 2. FIFO expiration times for reads and for writes. | |
53 | ||
54 | This is again very similar to the deadline IO scheduler. The expiration | |
55 | times for requests on these lists is tunable using the parameters read_expire | |
56 | and write_expire discussed below. When a read or a write expires in this way, | |
57 | the IO scheduler will interrupt its current elevator sweep or read anticipation | |
58 | to service the expired request. | |
59 | ||
60 | 3. Read and write request batching | |
61 | ||
62 | A batch is a collection of read requests or a collection of write | |
63 | requests. The as scheduler alternates dispatching read and write batches | |
64 | to the driver. In the case a read batch, the scheduler submits read | |
65 | requests to the driver as long as there are read requests to submit, and | |
66 | the read batch time limit has not been exceeded (read_batch_expire). | |
67 | The read batch time limit begins counting down only when there are | |
68 | competing write requests pending. | |
69 | ||
70 | In the case of a write batch, the scheduler submits write requests to | |
71 | the driver as long as there are write requests available, and the | |
72 | write batch time limit has not been exceeded (write_batch_expire). | |
73 | However, the length of write batches will be gradually shortened | |
74 | when read batches frequently exceed their time limit. | |
75 | ||
76 | When changing between batch types, the scheduler waits for all requests | |
77 | from the previous batch to complete before scheduling requests for the | |
78 | next batch. | |
79 | ||
80 | The read and write fifo expiration times described in policy 2 above | |
81 | are checked only when in scheduling IO of a batch for the corresponding | |
82 | (read/write) type. So for example, the read FIFO timeout values are | |
83 | tested only during read batches. Likewise, the write FIFO timeout | |
84 | values are tested only during write batches. For this reason, | |
85 | it is generally not recommended for the read batch time | |
86 | to be longer than the write expiration time, nor for the write batch | |
87 | time to exceed the read expiration time (see tunable parameters below). | |
88 | ||
89 | When the IO scheduler changes from a read to a write batch, | |
90 | it begins the elevator from the request that is on the head of the | |
91 | write expiration FIFO. Likewise, when changing from a write batch to | |
92 | a read batch, scheduler begins the elevator from the first entry | |
93 | on the read expiration FIFO. | |
94 | ||
95 | 4. Read anticipation. | |
96 | ||
97 | Read anticipation occurs only when scheduling a read batch. | |
98 | This implementation of read anticipation allows only one read request | |
99 | to be dispatched to the disk controller at a time. In | |
100 | contrast, many write requests may be dispatched to the disk controller | |
101 | at a time during a write batch. It is this characteristic that can make | |
102 | the anticipatory scheduler perform anomalously with controllers supporting | |
103 | TCQ, or with hardware striped RAID devices. Setting the antic_expire | |
992caacf ML |
104 | queue parameter (see below) to zero disables this behavior, and the |
105 | anticipatory scheduler behaves essentially like the deadline scheduler. | |
1da177e4 LT |
106 | |
107 | When read anticipation is enabled (antic_expire is not zero), reads | |
108 | are dispatched to the disk controller one at a time. | |
109 | At the end of each read request, the IO scheduler examines its next | |
110 | candidate read request from its sorted read list. If that next request | |
111 | is from the same process as the request that just completed, | |
112 | or if the next request in the queue is "very close" to the | |
113 | just completed request, it is dispatched immediately. Otherwise, | |
114 | statistics (average think time, average seek distance) on the process | |
115 | that submitted the just completed request are examined. If it seems | |
116 | likely that that process will submit another request soon, and that | |
117 | request is likely to be near the just completed request, then the IO | |
118 | scheduler will stop dispatching more read requests for up time (antic_expire) | |
119 | milliseconds, hoping that process will submit a new request near the one | |
120 | that just completed. If such a request is made, then it is dispatched | |
121 | immediately. If the antic_expire wait time expires, then the IO scheduler | |
122 | will dispatch the next read request from the sorted read queue. | |
123 | ||
124 | To decide whether an anticipatory wait is worthwhile, the scheduler | |
125 | maintains statistics for each process that can be used to compute | |
126 | mean "think time" (the time between read requests), and mean seek | |
127 | distance for that process. One observation is that these statistics | |
128 | are associated with each process, but those statistics are not associated | |
129 | with a specific IO device. So for example, if a process is doing IO | |
130 | on several file systems on separate devices, the statistics will be | |
131 | a combination of IO behavior from all those devices. | |
132 | ||
133 | ||
134 | Tuning the anticipatory IO scheduler | |
135 | ------------------------------------ | |
136 | When using 'as', the anticipatory IO scheduler there are 5 parameters under | |
137 | /sys/block/*/queue/iosched/. All are units of milliseconds. | |
138 | ||
139 | The parameters are: | |
140 | * read_expire | |
141 | Controls how long until a read request becomes "expired". It also controls the | |
142 | interval between which expired requests are served, so set to 50, a request | |
143 | might take anywhere < 100ms to be serviced _if_ it is the next on the | |
144 | expired list. Obviously request expiration strategies won't make the disk | |
145 | go faster. The result basically equates to the timeslice a single reader | |
146 | gets in the presence of other IO. 100*((seek time / read_expire) + 1) is | |
147 | very roughly the % streaming read efficiency your disk should get with | |
148 | multiple readers. | |
149 | ||
150 | * read_batch_expire | |
151 | Controls how much time a batch of reads is given before pending writes are | |
152 | served. A higher value is more efficient. This might be set below read_expire | |
153 | if writes are to be given higher priority than reads, but reads are to be | |
154 | as efficient as possible when there are no writes. Generally though, it | |
155 | should be some multiple of read_expire. | |
156 | ||
157 | * write_expire, and | |
158 | * write_batch_expire are equivalent to the above, for writes. | |
159 | ||
160 | * antic_expire | |
161 | Controls the maximum amount of time we can anticipate a good read (one | |
162 | with a short seek distance from the most recently completed request) before | |
163 | giving up. Many other factors may cause anticipation to be stopped early, | |
164 | or some processes will not be "anticipated" at all. Should be a bit higher | |
165 | for big seek time devices though not a linear correspondence - most | |
166 | processes have only a few ms thinktime. | |
167 |