Commit | Line | Data |
---|---|---|
5a225791 TH |
1 | I/O Barriers |
2 | ============ | |
3 | Tejun Heo <htejun@gmail.com>, July 22 2005 | |
4 | ||
5 | I/O barrier requests are used to guarantee ordering around the barrier | |
6 | requests. Unless you're crazy enough to use disk drives for | |
7 | implementing synchronization constructs (wow, sounds interesting...), | |
8 | the ordering is meaningful only for write requests for things like | |
9 | journal checkpoints. All requests queued before a barrier request | |
10 | must be finished (made it to the physical medium) before the barrier | |
11 | request is started, and all requests queued after the barrier request | |
12 | must be started only after the barrier request is finished (again, | |
13 | made it to the physical medium). | |
14 | ||
15 | In other words, I/O barrier requests have the following two properties. | |
16 | ||
17 | 1. Request ordering | |
18 | ||
19 | Requests cannot pass the barrier request. Preceding requests are | |
20 | processed before the barrier and following requests after. | |
21 | ||
22 | Depending on what features a drive supports, this can be done in one | |
23 | of the following three ways. | |
24 | ||
25 | i. For devices which have queue depth greater than 1 (TCQ devices) and | |
26 | support ordered tags, block layer can just issue the barrier as an | |
27 | ordered request and the lower level driver, controller and drive | |
6c28f2c0 | 28 | itself are responsible for making sure that the ordering constraint is |
5a225791 TH |
29 | met. Most modern SCSI controllers/drives should support this. |
30 | ||
31 | NOTE: SCSI ordered tag isn't currently used due to limitation in the | |
32 | SCSI midlayer, see the following random notes section. | |
33 | ||
34 | ii. For devices which have queue depth greater than 1 but don't | |
35 | support ordered tags, block layer ensures that the requests preceding | |
36 | a barrier request finishes before issuing the barrier request. Also, | |
37 | it defers requests following the barrier until the barrier request is | |
38 | finished. Older SCSI controllers/drives and SATA drives fall in this | |
39 | category. | |
40 | ||
41 | iii. Devices which have queue depth of 1. This is a degenerate case | |
42 | of ii. Just keeping issue order suffices. Ancient SCSI | |
43 | controllers/drives and IDE drives are in this category. | |
44 | ||
45 | 2. Forced flushing to physcial medium | |
46 | ||
47 | Again, if you're not gonna do synchronization with disk drives (dang, | |
48 | it sounds even more appealing now!), the reason you use I/O barriers | |
49 | is mainly to protect filesystem integrity when power failure or some | |
50 | other events abruptly stop the drive from operating and possibly make | |
51 | the drive lose data in its cache. So, I/O barriers need to guarantee | |
52 | that requests actually get written to non-volatile medium in order. | |
53 | ||
54 | There are four cases, | |
55 | ||
56 | i. No write-back cache. Keeping requests ordered is enough. | |
57 | ||
58 | ii. Write-back cache but no flush operation. There's no way to | |
a2ffd275 | 59 | guarantee physical-medium commit order. This kind of devices can't to |
5a225791 TH |
60 | I/O barriers. |
61 | ||
62 | iii. Write-back cache and flush operation but no FUA (forced unit | |
63 | access). We need two cache flushes - before and after the barrier | |
64 | request. | |
65 | ||
66 | iv. Write-back cache, flush operation and FUA. We still need one | |
67 | flush to make sure requests preceding a barrier are written to medium, | |
68 | but post-barrier flush can be avoided by using FUA write on the | |
69 | barrier itself. | |
70 | ||
71 | ||
72 | How to support barrier requests in drivers | |
73 | ------------------------------------------ | |
74 | ||
75 | All barrier handling is done inside block layer proper. All low level | |
76 | drivers have to are implementing its prepare_flush_fn and using one | |
77 | the following two functions to indicate what barrier type it supports | |
78 | and how to prepare flush requests. Note that the term 'ordered' is | |
79 | used to indicate the whole sequence of performing barrier requests | |
80 | including draining and flushing. | |
81 | ||
82 | typedef void (prepare_flush_fn)(request_queue_t *q, struct request *rq); | |
83 | ||
84 | int blk_queue_ordered(request_queue_t *q, unsigned ordered, | |
85 | prepare_flush_fn *prepare_flush_fn, | |
86 | unsigned gfp_mask); | |
87 | ||
88 | int blk_queue_ordered_locked(request_queue_t *q, unsigned ordered, | |
89 | prepare_flush_fn *prepare_flush_fn, | |
90 | unsigned gfp_mask); | |
91 | ||
92 | The only difference between the two functions is whether or not the | |
93 | caller is holding q->queue_lock on entry. The latter expects the | |
94 | caller is holding the lock. | |
95 | ||
96 | @q : the queue in question | |
97 | @ordered : the ordered mode the driver/device supports | |
98 | @prepare_flush_fn : this function should prepare @rq such that it | |
99 | flushes cache to physical medium when executed | |
100 | @gfp_mask : gfp_mask used when allocating data structures | |
101 | for ordered processing | |
102 | ||
103 | For example, SCSI disk driver's prepare_flush_fn looks like the | |
104 | following. | |
105 | ||
106 | static void sd_prepare_flush(request_queue_t *q, struct request *rq) | |
107 | { | |
108 | memset(rq->cmd, 0, sizeof(rq->cmd)); | |
109 | rq->flags |= REQ_BLOCK_PC; | |
110 | rq->timeout = SD_TIMEOUT; | |
111 | rq->cmd[0] = SYNCHRONIZE_CACHE; | |
112 | } | |
113 | ||
114 | The following seven ordered modes are supported. The following table | |
115 | shows which mode should be used depending on what features a | |
116 | device/driver supports. In the leftmost column of table, | |
117 | QUEUE_ORDERED_ prefix is omitted from the mode names to save space. | |
118 | ||
119 | The table is followed by description of each mode. Note that in the | |
120 | descriptions of QUEUE_ORDERED_DRAIN*, '=>' is used whereas '->' is | |
121 | used for QUEUE_ORDERED_TAG* descriptions. '=>' indicates that the | |
122 | preceding step must be complete before proceeding to the next step. | |
123 | '->' indicates that the next step can start as soon as the previous | |
124 | step is issued. | |
125 | ||
126 | write-back cache ordered tag flush FUA | |
127 | ----------------------------------------------------------------------- | |
128 | NONE yes/no N/A no N/A | |
129 | DRAIN no no N/A N/A | |
130 | DRAIN_FLUSH yes no yes no | |
131 | DRAIN_FUA yes no yes yes | |
132 | TAG no yes N/A N/A | |
133 | TAG_FLUSH yes yes yes no | |
134 | TAG_FUA yes yes yes yes | |
135 | ||
136 | ||
137 | QUEUE_ORDERED_NONE | |
138 | I/O barriers are not needed and/or supported. | |
139 | ||
140 | Sequence: N/A | |
141 | ||
142 | QUEUE_ORDERED_DRAIN | |
143 | Requests are ordered by draining the request queue and cache | |
144 | flushing isn't needed. | |
145 | ||
146 | Sequence: drain => barrier | |
147 | ||
148 | QUEUE_ORDERED_DRAIN_FLUSH | |
149 | Requests are ordered by draining the request queue and both | |
150 | pre-barrier and post-barrier cache flushings are needed. | |
151 | ||
152 | Sequence: drain => preflush => barrier => postflush | |
153 | ||
154 | QUEUE_ORDERED_DRAIN_FUA | |
155 | Requests are ordered by draining the request queue and | |
156 | pre-barrier cache flushing is needed. By using FUA on barrier | |
157 | request, post-barrier flushing can be skipped. | |
158 | ||
159 | Sequence: drain => preflush => barrier | |
160 | ||
161 | QUEUE_ORDERED_TAG | |
162 | Requests are ordered by ordered tag and cache flushing isn't | |
163 | needed. | |
164 | ||
165 | Sequence: barrier | |
166 | ||
167 | QUEUE_ORDERED_TAG_FLUSH | |
168 | Requests are ordered by ordered tag and both pre-barrier and | |
169 | post-barrier cache flushings are needed. | |
170 | ||
171 | Sequence: preflush -> barrier -> postflush | |
172 | ||
173 | QUEUE_ORDERED_TAG_FUA | |
174 | Requests are ordered by ordered tag and pre-barrier cache | |
175 | flushing is needed. By using FUA on barrier request, | |
176 | post-barrier flushing can be skipped. | |
177 | ||
178 | Sequence: preflush -> barrier | |
179 | ||
180 | ||
181 | Random notes/caveats | |
182 | -------------------- | |
183 | ||
184 | * SCSI layer currently can't use TAG ordering even if the drive, | |
185 | controller and driver support it. The problem is that SCSI midlayer | |
186 | request dispatch function is not atomic. It releases queue lock and | |
187 | switch to SCSI host lock during issue and it's possible and likely to | |
188 | happen in time that requests change their relative positions. Once | |
189 | this problem is solved, TAG ordering can be enabled. | |
190 | ||
191 | * Currently, no matter which ordered mode is used, there can be only | |
192 | one barrier request in progress. All I/O barriers are held off by | |
193 | block layer until the previous I/O barrier is complete. This doesn't | |
194 | make any difference for DRAIN ordered devices, but, for TAG ordered | |
195 | devices with very high command latency, passing multiple I/O barriers | |
196 | to low level *might* be helpful if they are very frequent. Well, this | |
197 | certainly is a non-issue. I'm writing this just to make clear that no | |
198 | two I/O barrier is ever passed to low-level driver. | |
199 | ||
200 | * Completion order. Requests in ordered sequence are issued in order | |
201 | but not required to finish in order. Barrier implementation can | |
202 | handle out-of-order completion of ordered sequence. IOW, the requests | |
203 | MUST be processed in order but the hardware/software completion paths | |
204 | are allowed to reorder completion notifications - eg. current SCSI | |
205 | midlayer doesn't preserve completion order during error handling. | |
206 | ||
207 | * Requeueing order. Low-level drivers are free to requeue any request | |
208 | after they removed it from the request queue with | |
209 | blkdev_dequeue_request(). As barrier sequence should be kept in order | |
210 | when requeued, generic elevator code takes care of putting requests in | |
211 | order around barrier. See blk_ordered_req_seq() and | |
212 | ELEVATOR_INSERT_REQUEUE handling in __elv_add_request() for details. | |
213 | ||
214 | Note that block drivers must not requeue preceding requests while | |
215 | completing latter requests in an ordered sequence. Currently, no | |
216 | error checking is done against this. | |
217 | ||
218 | * Error handling. Currently, block layer will report error to upper | |
219 | layer if any of requests in an ordered sequence fails. Unfortunately, | |
220 | this doesn't seem to be enough. Look at the following request flow. | |
221 | QUEUE_ORDERED_TAG_FLUSH is in use. | |
222 | ||
223 | [0] [1] [2] [3] [pre] [barrier] [post] < [4] [5] [6] ... > | |
224 | still in elevator | |
225 | ||
226 | Let's say request [2], [3] are write requests to update file system | |
227 | metadata (journal or whatever) and [barrier] is used to mark that | |
228 | those updates are valid. Consider the following sequence. | |
229 | ||
230 | i. Requests [0] ~ [post] leaves the request queue and enters | |
231 | low-level driver. | |
232 | ii. After a while, unfortunately, something goes wrong and the | |
233 | drive fails [2]. Note that any of [0], [1] and [3] could have | |
234 | completed by this time, but [pre] couldn't have been finished | |
235 | as the drive must process it in order and it failed before | |
236 | processing that command. | |
237 | iii. Error handling kicks in and determines that the error is | |
238 | unrecoverable and fails [2], and resumes operation. | |
239 | iv. [pre] [barrier] [post] gets processed. | |
240 | v. *BOOM* power fails | |
241 | ||
242 | The problem here is that the barrier request is *supposed* to indicate | |
243 | that filesystem update requests [2] and [3] made it safely to the | |
244 | physical medium and, if the machine crashes after the barrier is | |
245 | written, filesystem recovery code can depend on that. Sadly, that | |
246 | isn't true in this case anymore. IOW, the success of a I/O barrier | |
247 | should also be dependent on success of some of the preceding requests, | |
248 | where only upper layer (filesystem) knows what 'some' is. | |
249 | ||
250 | This can be solved by implementing a way to tell the block layer which | |
251 | requests affect the success of the following barrier request and | |
252 | making lower lever drivers to resume operation on error only after | |
253 | block layer tells it to do so. | |
254 | ||
255 | As the probability of this happening is very low and the drive should | |
256 | be faulty, implementing the fix is probably an overkill. But, still, | |
257 | it's there. | |
258 | ||
259 | * In previous drafts of barrier implementation, there was fallback | |
260 | mechanism such that, if FUA or ordered TAG fails, less fancy ordered | |
261 | mode can be selected and the failed barrier request is retried | |
262 | automatically. The rationale for this feature was that as FUA is | |
263 | pretty new in ATA world and ordered tag was never used widely, there | |
264 | could be devices which report to support those features but choke when | |
265 | actually given such requests. | |
266 | ||
267 | This was removed for two reasons 1. it's an overkill 2. it's | |
268 | impossible to implement properly when TAG ordering is used as low | |
269 | level drivers resume after an error automatically. If it's ever | |
270 | needed adding it back and modifying low level drivers accordingly | |
271 | shouldn't be difficult. |