[linux-2.6-block.git] / Documentation / block / barrier.txt

I/O Barriers
============
Tejun Heo <htejun@gmail.com>, July 22 2005

I/O barrier requests are used to guarantee ordering around the barrier
requests.  Unless you're crazy enough to use disk drives for
implementing synchronization constructs (wow, sounds interesting...),
the ordering is meaningful only for write requests for things like
journal checkpoints.  All requests queued before a barrier request
must be finished (made it to the physical medium) before the barrier
request is started, and all requests queued after the barrier request
must be started only after the barrier request is finished (again,
made it to the physical medium).

In other words, I/O barrier requests have the following two properties.

1. Request ordering

Requests cannot pass the barrier request.  Preceding requests are
processed before the barrier and following requests after.

Depending on what features a drive supports, this can be done in one
of the following three ways.

i. For devices which have queue depth greater than 1 (TCQ devices) and
support ordered tags, block layer can just issue the barrier as an
ordered request and the lower level driver, controller and drive
itself are responsible for making sure that the ordering constraint is
met.  Most modern SCSI controllers/drives should support this.

NOTE: SCSI ordered tag isn't currently used due to limitation in the
      SCSI midlayer, see the following random notes section.

ii. For devices which have queue depth greater than 1 but don't
support ordered tags, block layer ensures that the requests preceding
a barrier request finishes before issuing the barrier request.  Also,
it defers requests following the barrier until the barrier request is
finished.  Older SCSI controllers/drives and SATA drives fall in this
category.

iii. Devices which have queue depth of 1.  This is a degenerate case
of ii.  Just keeping issue order suffices.  Ancient SCSI
controllers/drives and IDE drives are in this category.

2. Forced flushing to physcial medium

Again, if you're not gonna do synchronization with disk drives (dang,
it sounds even more appealing now!), the reason you use I/O barriers
is mainly to protect filesystem integrity when power failure or some
other events abruptly stop the drive from operating and possibly make
the drive lose data in its cache.  So, I/O barriers need to guarantee
that requests actually get written to non-volatile medium in order.

There are four cases,

i. No write-back cache.  Keeping requests ordered is enough.

ii. Write-back cache but no flush operation.  There's no way to
guarantee physical-medium commit order.  This kind of devices can't to
I/O barriers.

iii. Write-back cache and flush operation but no FUA (forced unit
access).  We need two cache flushes - before and after the barrier
request.

iv. Write-back cache, flush operation and FUA.  We still need one
flush to make sure requests preceding a barrier are written to medium,
but post-barrier flush can be avoided by using FUA write on the
barrier itself.


How to support barrier requests in drivers
------------------------------------------

All barrier handling is done inside block layer proper.  All low level
drivers have to are implementing its prepare_flush_fn and using one
the following two functions to indicate what barrier type it supports
and how to prepare flush requests.  Note that the term 'ordered' is
used to indicate the whole sequence of performing barrier requests
including draining and flushing.

typedef void (prepare_flush_fn)(request_queue_t *q, struct request *rq);

int blk_queue_ordered(request_queue_t *q, unsigned ordered,
		      prepare_flush_fn *prepare_flush_fn,
		      unsigned gfp_mask);

int blk_queue_ordered_locked(request_queue_t *q, unsigned ordered,
			     prepare_flush_fn *prepare_flush_fn,
			     unsigned gfp_mask);

The only difference between the two functions is whether or not the
caller is holding q->queue_lock on entry.  The latter expects the
caller is holding the lock.

@q			: the queue in question
@ordered		: the ordered mode the driver/device supports
@prepare_flush_fn	: this function should prepare @rq such that it
			  flushes cache to physical medium when executed
@gfp_mask		: gfp_mask used when allocating data structures
			  for ordered processing

For example, SCSI disk driver's prepare_flush_fn looks like the
following.

static void sd_prepare_flush(request_queue_t *q, struct request *rq)
{
	memset(rq->cmd, 0, sizeof(rq->cmd));
	rq->flags |= REQ_BLOCK_PC;
	rq->timeout = SD_TIMEOUT;
	rq->cmd[0] = SYNCHRONIZE_CACHE;
}

The following seven ordered modes are supported.  The following table
shows which mode should be used depending on what features a
device/driver supports.  In the leftmost column of table,
QUEUE_ORDERED_ prefix is omitted from the mode names to save space.

The table is followed by description of each mode.  Note that in the
descriptions of QUEUE_ORDERED_DRAIN*, '=>' is used whereas '->' is
used for QUEUE_ORDERED_TAG* descriptions.  '=>' indicates that the
preceding step must be complete before proceeding to the next step.
'->' indicates that the next step can start as soon as the previous
step is issued.

	    write-back cache	ordered tag	flush		FUA
-----------------------------------------------------------------------
NONE		yes/no		N/A		no		N/A
DRAIN		no		no		N/A		N/A
DRAIN_FLUSH	yes		no		yes		no
DRAIN_FUA	yes		no		yes		yes
TAG		no		yes		N/A		N/A
TAG_FLUSH	yes		yes		yes		no
TAG_FUA		yes		yes		yes		yes


QUEUE_ORDERED_NONE
	I/O barriers are not needed and/or supported.

	Sequence: N/A

QUEUE_ORDERED_DRAIN
	Requests are ordered by draining the request queue and cache
	flushing isn't needed.

	Sequence: drain => barrier

QUEUE_ORDERED_DRAIN_FLUSH
	Requests are ordered by draining the request queue and both
	pre-barrier and post-barrier cache flushings are needed.

	Sequence: drain => preflush => barrier => postflush

QUEUE_ORDERED_DRAIN_FUA
	Requests are ordered by draining the request queue and
	pre-barrier cache flushing is needed.  By using FUA on barrier
	request, post-barrier flushing can be skipped.

	Sequence: drain => preflush => barrier

QUEUE_ORDERED_TAG
	Requests are ordered by ordered tag and cache flushing isn't
	needed.

	Sequence: barrier

QUEUE_ORDERED_TAG_FLUSH
	Requests are ordered by ordered tag and both pre-barrier and
	post-barrier cache flushings are needed.

	Sequence: preflush -> barrier -> postflush

QUEUE_ORDERED_TAG_FUA
	Requests are ordered by ordered tag and pre-barrier cache
	flushing is needed.  By using FUA on barrier request,
	post-barrier flushing can be skipped.

	Sequence: preflush -> barrier


Random notes/caveats
--------------------

* SCSI layer currently can't use TAG ordering even if the drive,
controller and driver support it.  The problem is that SCSI midlayer
request dispatch function is not atomic.  It releases queue lock and
switch to SCSI host lock during issue and it's possible and likely to
happen in time that requests change their relative positions.  Once
this problem is solved, TAG ordering can be enabled.

* Currently, no matter which ordered mode is used, there can be only
one barrier request in progress.  All I/O barriers are held off by
block layer until the previous I/O barrier is complete.  This doesn't
make any difference for DRAIN ordered devices, but, for TAG ordered
devices with very high command latency, passing multiple I/O barriers
to low level *might* be helpful if they are very frequent.  Well, this
certainly is a non-issue.  I'm writing this just to make clear that no
two I/O barrier is ever passed to low-level driver.

* Completion order.  Requests in ordered sequence are issued in order
but not required to finish in order.  Barrier implementation can
handle out-of-order completion of ordered sequence.  IOW, the requests
MUST be processed in order but the hardware/software completion paths
are allowed to reorder completion notifications - eg. current SCSI
midlayer doesn't preserve completion order during error handling.

* Requeueing order.  Low-level drivers are free to requeue any request
after they removed it from the request queue with
blkdev_dequeue_request().  As barrier sequence should be kept in order
when requeued, generic elevator code takes care of putting requests in
order around barrier.  See blk_ordered_req_seq() and
ELEVATOR_INSERT_REQUEUE handling in __elv_add_request() for details.

Note that block drivers must not requeue preceding requests while
completing latter requests in an ordered sequence.  Currently, no
error checking is done against this.

* Error handling.  Currently, block layer will report error to upper
layer if any of requests in an ordered sequence fails.  Unfortunately,
this doesn't seem to be enough.  Look at the following request flow.
QUEUE_ORDERED_TAG_FLUSH is in use.

 [0] [1] [2] [3] [pre] [barrier] [post] < [4] [5] [6] ... >
					  still in elevator

Let's say request [2], [3] are write requests to update file system
metadata (journal or whatever) and [barrier] is used to mark that
those updates are valid.  Consider the following sequence.

 i.	Requests [0] ~ [post] leaves the request queue and enters
	low-level driver.
 ii.	After a while, unfortunately, something goes wrong and the
	drive fails [2].  Note that any of [0], [1] and [3] could have
	completed by this time, but [pre] couldn't have been finished
	as the drive must process it in order and it failed before
	processing that command.
 iii.	Error handling kicks in and determines that the error is
	unrecoverable and fails [2], and resumes operation.
 iv.	[pre] [barrier] [post] gets processed.
 v.	*BOOM* power fails

The problem here is that the barrier request is *supposed* to indicate
that filesystem update requests [2] and [3] made it safely to the
physical medium and, if the machine crashes after the barrier is
written, filesystem recovery code can depend on that.  Sadly, that
isn't true in this case anymore.  IOW, the success of a I/O barrier
should also be dependent on success of some of the preceding requests,
where only upper layer (filesystem) knows what 'some' is.

This can be solved by implementing a way to tell the block layer which
requests affect the success of the following barrier request and
making lower lever drivers to resume operation on error only after
block layer tells it to do so.

As the probability of this happening is very low and the drive should
be faulty, implementing the fix is probably an overkill.  But, still,
it's there.

* In previous drafts of barrier implementation, there was fallback
mechanism such that, if FUA or ordered TAG fails, less fancy ordered
mode can be selected and the failed barrier request is retried
automatically.  The rationale for this feature was that as FUA is
pretty new in ATA world and ordered tag was never used widely, there
could be devices which report to support those features but choke when
actually given such requests.

 This was removed for two reasons 1. it's an overkill 2. it's
impossible to implement properly when TAG ordering is used as low
level drivers resume after an error automatically.  If it's ever
needed adding it back and modifying low level drivers accordingly
shouldn't be difficult.
Commit	Line	Data
5a225791 TH	1	I/O Barriers
	2	============
	3	Tejun Heo <htejun@gmail.com>, July 22 2005
	4
	5	I/O barrier requests are used to guarantee ordering around the barrier
	6	requests. Unless you're crazy enough to use disk drives for
	7	implementing synchronization constructs (wow, sounds interesting...),
	8	the ordering is meaningful only for write requests for things like
	9	journal checkpoints. All requests queued before a barrier request
	10	must be finished (made it to the physical medium) before the barrier
	11	request is started, and all requests queued after the barrier request
	12	must be started only after the barrier request is finished (again,
	13	made it to the physical medium).
	14
	15	In other words, I/O barrier requests have the following two properties.
	16
	17	1. Request ordering
	18
	19	Requests cannot pass the barrier request. Preceding requests are
	20	processed before the barrier and following requests after.
	21
	22	Depending on what features a drive supports, this can be done in one
	23	of the following three ways.
	24
	25	i. For devices which have queue depth greater than 1 (TCQ devices) and
	26	support ordered tags, block layer can just issue the barrier as an
	27	ordered request and the lower level driver, controller and drive
6c28f2c0	28	itself are responsible for making sure that the ordering constraint is
5a225791 TH	29	met. Most modern SCSI controllers/drives should support this.
	30
	31	NOTE: SCSI ordered tag isn't currently used due to limitation in the
	32	SCSI midlayer, see the following random notes section.
	33
	34	ii. For devices which have queue depth greater than 1 but don't
	35	support ordered tags, block layer ensures that the requests preceding
	36	a barrier request finishes before issuing the barrier request. Also,
	37	it defers requests following the barrier until the barrier request is
	38	finished. Older SCSI controllers/drives and SATA drives fall in this
	39	category.
	40
	41	iii. Devices which have queue depth of 1. This is a degenerate case
	42	of ii. Just keeping issue order suffices. Ancient SCSI
	43	controllers/drives and IDE drives are in this category.
	44
	45	2. Forced flushing to physcial medium
	46
	47	Again, if you're not gonna do synchronization with disk drives (dang,
	48	it sounds even more appealing now!), the reason you use I/O barriers
	49	is mainly to protect filesystem integrity when power failure or some
	50	other events abruptly stop the drive from operating and possibly make
	51	the drive lose data in its cache. So, I/O barriers need to guarantee
	52	that requests actually get written to non-volatile medium in order.
	53
	54	There are four cases,
	55
	56	i. No write-back cache. Keeping requests ordered is enough.
	57
	58	ii. Write-back cache but no flush operation. There's no way to
a2ffd275	59	guarantee physical-medium commit order. This kind of devices can't to
5a225791 TH	60	I/O barriers.
	61
	62	iii. Write-back cache and flush operation but no FUA (forced unit
	63	access). We need two cache flushes - before and after the barrier
	64	request.
	65
	66	iv. Write-back cache, flush operation and FUA. We still need one
	67	flush to make sure requests preceding a barrier are written to medium,
	68	but post-barrier flush can be avoided by using FUA write on the
	69	barrier itself.
	70
	71
	72	How to support barrier requests in drivers
	73	------------------------------------------
	74
	75	All barrier handling is done inside block layer proper. All low level
	76	drivers have to are implementing its prepare_flush_fn and using one
	77	the following two functions to indicate what barrier type it supports
	78	and how to prepare flush requests. Note that the term 'ordered' is
	79	used to indicate the whole sequence of performing barrier requests
	80	including draining and flushing.
	81
	82	typedef void (prepare_flush_fn)(request_queue_t q, struct request rq);
	83
	84	int blk_queue_ordered(request_queue_t *q, unsigned ordered,
	85	prepare_flush_fn *prepare_flush_fn,
	86	unsigned gfp_mask);
	87
	88	int blk_queue_ordered_locked(request_queue_t *q, unsigned ordered,
	89	prepare_flush_fn *prepare_flush_fn,
	90	unsigned gfp_mask);
	91
	92	The only difference between the two functions is whether or not the
	93	caller is holding q->queue_lock on entry. The latter expects the
	94	caller is holding the lock.
	95
	96	@q : the queue in question
	97	@ordered : the ordered mode the driver/device supports
	98	@prepare_flush_fn : this function should prepare @rq such that it
	99	flushes cache to physical medium when executed
	100	@gfp_mask : gfp_mask used when allocating data structures
	101	for ordered processing
	102
	103	For example, SCSI disk driver's prepare_flush_fn looks like the
	104	following.
	105
	106	static void sd_prepare_flush(request_queue_t q, struct request rq)
	107	{
	108	memset(rq->cmd, 0, sizeof(rq->cmd));
	109	rq->flags \|= REQ_BLOCK_PC;
	110	rq->timeout = SD_TIMEOUT;
	111	rq->cmd[0] = SYNCHRONIZE_CACHE;
	112	}
	113
	114	The following seven ordered modes are supported. The following table
	115	shows which mode should be used depending on what features a
	116	device/driver supports. In the leftmost column of table,
	117	QUEUE_ORDERED_ prefix is omitted from the mode names to save space.
	118
	119	The table is followed by description of each mode. Note that in the
	120	descriptions of QUEUE_ORDERED_DRAIN*, '=>' is used whereas '->' is
	121	used for QUEUE_ORDERED_TAG* descriptions. '=>' indicates that the
	122	preceding step must be complete before proceeding to the next step.
	123	'->' indicates that the next step can start as soon as the previous
124	step is issued.
125
126	write-back cache ordered tag flush FUA
127	-----------------------------------------------------------------------
128	NONE yes/no N/A no N/A
129	DRAIN no no N/A N/A
130	DRAIN_FLUSH yes no yes no
131	DRAIN_FUA yes no yes yes
132	TAG no yes N/A N/A
133	TAG_FLUSH yes yes yes no
134	TAG_FUA yes yes yes yes
135
136
137	QUEUE_ORDERED_NONE
138	I/O barriers are not needed and/or supported.
139
140	Sequence: N/A
141
142	QUEUE_ORDERED_DRAIN
143	Requests are ordered by draining the request queue and cache
144	flushing isn't needed.
145
146	Sequence: drain => barrier
147
148	QUEUE_ORDERED_DRAIN_FLUSH
149	Requests are ordered by draining the request queue and both
150	pre-barrier and post-barrier cache flushings are needed.
151
152	Sequence: drain => preflush => barrier => postflush
153
154	QUEUE_ORDERED_DRAIN_FUA
155	Requests are ordered by draining the request queue and
156	pre-barrier cache flushing is needed. By using FUA on barrier
157	request, post-barrier flushing can be skipped.
158
159	Sequence: drain => preflush => barrier
160
161	QUEUE_ORDERED_TAG
162	Requests are ordered by ordered tag and cache flushing isn't
163	needed.
164
165	Sequence: barrier
166
167	QUEUE_ORDERED_TAG_FLUSH
168	Requests are ordered by ordered tag and both pre-barrier and
169	post-barrier cache flushings are needed.
170
171	Sequence: preflush -> barrier -> postflush
172
173	QUEUE_ORDERED_TAG_FUA
174	Requests are ordered by ordered tag and pre-barrier cache
175	flushing is needed. By using FUA on barrier request,
176	post-barrier flushing can be skipped.
177
178	Sequence: preflush -> barrier
179
180
181	Random notes/caveats
182	--------------------
183
184	* SCSI layer currently can't use TAG ordering even if the drive,
185	controller and driver support it. The problem is that SCSI midlayer
186	request dispatch function is not atomic. It releases queue lock and
187	switch to SCSI host lock during issue and it's possible and likely to
188	happen in time that requests change their relative positions. Once
189	this problem is solved, TAG ordering can be enabled.
190
191	* Currently, no matter which ordered mode is used, there can be only
192	one barrier request in progress. All I/O barriers are held off by
193	block layer until the previous I/O barrier is complete. This doesn't
194	make any difference for DRAIN ordered devices, but, for TAG ordered
195	devices with very high command latency, passing multiple I/O barriers
196	to low level might be helpful if they are very frequent. Well, this
197	certainly is a non-issue. I'm writing this just to make clear that no
198	two I/O barrier is ever passed to low-level driver.
199
200	* Completion order. Requests in ordered sequence are issued in order
201	but not required to finish in order. Barrier implementation can
202	handle out-of-order completion of ordered sequence. IOW, the requests
203	MUST be processed in order but the hardware/software completion paths
204	are allowed to reorder completion notifications - eg. current SCSI
205	midlayer doesn't preserve completion order during error handling.
206
207	* Requeueing order. Low-level drivers are free to requeue any request
208	after they removed it from the request queue with
209	blkdev_dequeue_request(). As barrier sequence should be kept in order
210	when requeued, generic elevator code takes care of putting requests in
211	order around barrier. See blk_ordered_req_seq() and
212	ELEVATOR_INSERT_REQUEUE handling in __elv_add_request() for details.
213
214	Note that block drivers must not requeue preceding requests while
215	completing latter requests in an ordered sequence. Currently, no
216	error checking is done against this.
217
218	* Error handling. Currently, block layer will report error to upper
219	layer if any of requests in an ordered sequence fails. Unfortunately,
220	this doesn't seem to be enough. Look at the following request flow.
221	QUEUE_ORDERED_TAG_FLUSH is in use.
222
223	[0] [1] [2] [3] [pre] [barrier] [post] < [4] [5] [6] ... >
224	still in elevator
225
226	Let's say request [2], [3] are write requests to update file system
227	metadata (journal or whatever) and [barrier] is used to mark that
228	those updates are valid. Consider the following sequence.
229
230	i. Requests [0] ~ [post] leaves the request queue and enters
231	low-level driver.
232	ii. After a while, unfortunately, something goes wrong and the
233	drive fails [2]. Note that any of [0], [1] and [3] could have
234	completed by this time, but [pre] couldn't have been finished
235	as the drive must process it in order and it failed before
236	processing that command.
237	iii. Error handling kicks in and determines that the error is
238	unrecoverable and fails [2], and resumes operation.
239	iv. [pre] [barrier] [post] gets processed.
240	v. BOOM power fails
241
242	The problem here is that the barrier request is supposed to indicate
243	that filesystem update requests [2] and [3] made it safely to the
244	physical medium and, if the machine crashes after the barrier is
245	written, filesystem recovery code can depend on that. Sadly, that
246	isn't true in this case anymore. IOW, the success of a I/O barrier
247	should also be dependent on success of some of the preceding requests,
248	where only upper layer (filesystem) knows what 'some' is.
249
250	This can be solved by implementing a way to tell the block layer which
251	requests affect the success of the following barrier request and
252	making lower lever drivers to resume operation on error only after
253	block layer tells it to do so.
254
255	As the probability of this happening is very low and the drive should
256	be faulty, implementing the fix is probably an overkill. But, still,
257	it's there.
258
259	* In previous drafts of barrier implementation, there was fallback
260	mechanism such that, if FUA or ordered TAG fails, less fancy ordered
261	mode can be selected and the failed barrier request is retried
262	automatically. The rationale for this feature was that as FUA is
263	pretty new in ATA world and ordered tag was never used widely, there
264	could be devices which report to support those features but choke when
265	actually given such requests.
266
267	This was removed for two reasons 1. it's an overkill 2. it's
268	impossible to implement properly when TAG ordering is used as low
269	level drivers resume after an error automatically. If it's ever
270	needed adding it back and modifying low level drivers accordingly
271	shouldn't be difficult.