Elliott, Robert (Server Storage) [Fri, 6 Jun 2014 14:02:42 +0000 (16:02 +0200)]
mpt2sas: delay scsi_add_host call to work with scsi-mq
In _scsih_probe, delay the call to scsi_add_host until the host has been
fully set up.
Otherwise, the default .can_queue value of 1 causes scsi-mq to set the block
layer request queue size to its minimum size, resulting in awful performance.
In _scsih_probe error handling, call mpt3sas_base_detach rather than
scsi_remove_host to properly clean up in reverse order.
In _scsih_remove, call scsi_remove_host earlier to clean up in reverse order.
Signed-off-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig [Wed, 25 Jun 2014 16:52:01 +0000 (18:52 +0200)]
fnic: reject device resets without assigned tags for the blk-mq case
Current the midlayer fakes up a struct request for the explicit reset
ioctls, and those don't have a tag allocated to them. The fnic driver pokes
into midlayer structures to paper over this design issue, but that won't
work for the blk-mq case.
Either someone who can actually test the hardware will have to come up with
a similar hack for the blk-mq case, or we'll have to bite the bullet and fix
the way the EH ioctls work for real, but until that happens we fail these
explicit requests here.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Hiral Patel <hiralpat@cisco.com>
Cc: Suma Ramars <sramars@cisco.com>
Cc: Brian Uchino <buchino@cisco.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Fri, 3 Apr 2015 01:34:29 +0000 (19:34 -0600)]
scsi: add support for a blk-mq based I/O path.
This patch adds support for an alternate I/O path in the scsi midlayer
which uses the blk-mq infrastructure instead of the legacy request code.
Use of blk-mq is fully transparent to drivers, although for now a host
template field is provided to opt out of blk-mq usage in case any unforseen
incompatibilities arise.
In general replacing the legacy request code with blk-mq is a simple and
mostly mechanical transformation. The biggest exception is the new code
that deals with the fact the I/O submissions in blk-mq must happen from
process context, which slightly complicates the I/O completion handler.
The second biggest differences is that blk-mq is build around the concept
of preallocated requests that also include driver specific data, which
in SCSI context means the scsi_cmnd structure. This completely avoids
dynamic memory allocations for the fast path through I/O submission.
Due the preallocated requests the MQ code path exclusively uses the
host-wide shared tag allocator instead of a per-LUN one. This only
affects drivers actually using the block layer provided tag allocator
instead of their own. Unlike the old path blk-mq always provides a tag,
although drivers don't have to use it.
For now the blk-mq path is disable by defauly and must be enabled using
the "use_blk_mq" module parameter. Once the remaining work in the block
layer to make blk-mq more suitable for slow devices is complete I hope
to make it the default and eventually even remove the old code path.
Based on the earlier scsi-mq prototype by Nicholas Bellinger.
Thanks to Bart Van Assche and Robert Elliot for testing, benchmarking and
various sugestions and code contributions.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Wed, 25 Jun 2014 16:51:59 +0000 (18:51 +0200)]
scatterlist: allow chaining to preallocated chunks
Blk-mq drivers usually preallocate their S/G list as part of the request,
but if we want to support the very large S/G lists currently supported by
the SCSI code that would tie up a lot of memory in the preallocated request
pool. Add support to the scatterlist code so that it can initialize a
S/G list that uses a preallocated first chunks and dynamically allocated
additional chunks. That way the scsi-mq code can preallocate a first
page worth of S/G entries as part of the request, and dynamically extent
the S/G list when needed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Thu, 26 Jun 2014 02:15:54 +0000 (20:15 -0600)]
scsi: unwind blk_end_request_all and blk_end_request_err calls
Replace the calls to the various blk_end_request variants with opencode
equivalents. Blk-mq is using a model that gives the driver control
between the bio updates and the actual completion, and making the old
code follow that same model allows us to keep the code more similar for
both pathes.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Wed, 25 Jun 2014 16:51:57 +0000 (18:51 +0200)]
scsi: only maintain target_blocked if the driver has a target queue limit
This saves us an atomic operation for each I/O submission and completion
for the usual case where the driver doesn't set a per-target can_queue
value. Only a few iscsi hardware offload drivers set the per-target
can_queue value at the moment.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Thu, 26 Jun 2014 02:14:42 +0000 (20:14 -0600)]
scsi: fix the {host,target,device}_blocked counter mess
Seems like these counters are missing any sort of synchronization for
updates, as a over 10 year old comment from me noted. Fix this by
using atomic counters, and while we're at it also make sure they are
in the same cacheline as the _busy counters and not needlessly stored
to in every I/O completion.
With the new model the _busy counters can temporarily go negative,
so all the readers are updated to check for > 0 values. Longer
term every successful I/O completion will reset the counters to zero,
so the temporarily negative values will not cause any harm.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Thu, 26 Jun 2014 02:13:31 +0000 (20:13 -0600)]
scsi: convert device_busy to atomic_t
Avoid taking the queue_lock to check the per-device queue limit. Instead
we do an atomic_inc_return early on to grab our slot in the queue,
and if nessecary decrement it after finishing all checks.
Unlike the host and target busy counters this doesn't allow us to avoid the
queue_lock in the request_fn due to the way the interface works, but it'll
allow us to prepare for using the blk-mq code, which doesn't use the
queue_lock at all, and it at least avoids a queue_lock rountrip in
scsi_device_unbusy, which is still important given how busy the queue_lock
is.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Thu, 26 Jun 2014 02:11:25 +0000 (20:11 -0600)]
scsi: convert host_busy to atomic_t
Avoid taking the host-wide host_lock to check the per-host queue limit.
Instead we do an atomic_inc_return early on to grab our slot in the queue,
and if nessecary decrement it after finishing all checks.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Thu, 26 Jun 2014 02:08:50 +0000 (20:08 -0600)]
scsi: convert target_busy to an atomic_t
Avoid taking the host-wide host_lock to check the per-target queue limit.
Instead we do an atomic_inc_return early on to grab our slot in the queue,
and if nessecary decrement it after finishing all checks.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Thu, 26 Jun 2014 02:05:56 +0000 (20:05 -0600)]
scsi: push host_lock down into scsi_{host,target}_queue_ready
Prepare for not taking a host-wide lock in the dispatch path by pushing
the lock down into the places that actually need it. Note that this
patch is just a preparation step, as it will actually increase lock
roundtrips and thus decrease performance on its own.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Thu, 26 Jun 2014 02:02:09 +0000 (20:02 -0600)]
scsi: set ->scsi_done before calling scsi_dispatch_cmd
The blk-mq code path will set this to a different function, so make the
code simpler by setting it up in a legacy-request specific place.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Wed, 25 Jun 2014 23:22:32 +0000 (17:22 -0600)]
scsi: centralize command re-queueing in scsi_dispatch_fn
Make sure we only have the logic for requeing commands in one place.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Wed, 25 Jun 2014 23:20:08 +0000 (17:20 -0600)]
scsi: split __scsi_queue_insert
Factor out a helper to set the _blocked values, which we'll reuse for the
blk-mq code path.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Wed, 25 Jun 2014 16:51:48 +0000 (18:51 +0200)]
sd: don't use rq->cmd_len before setting it up
Unlike the old request code blk-mq doesn't initialize cmd_len with a
default value, so don't rely on it being set in sd_setup_write_same_cmnd.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Christoph Hellwig [Wed, 25 Jun 2014 22:54:25 +0000 (16:54 -0600)]
scsi: reintroduce scsi_driver.init_command
Instead of letting the ULD play games with the prep_fn move back to
the model of a central prep_fn with a callback to the ULD. This
already cleans up and shortens the code by itself, and will be required
to properly support blk-mq in the SCSI midlayer.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nicholas Bellinger <nab@linux-iscsi.org>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Christoph Hellwig [Thu, 1 May 2014 14:51:03 +0000 (16:51 +0200)]
scsi: remove scsi_end_request
By folding scsi_end_request into its only caller we can significantly clean
up the completion logic. We can use simple goto labels now to only have
a single place to finish or requeue command there instead of the previous
convoluted logic.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nicholas Bellinger <nab@linux-iscsi.org>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Christoph Hellwig [Wed, 25 Jun 2014 22:53:13 +0000 (16:53 -0600)]
scsi: explicitly release bidi buffers
Instead of trying to guess when we have a BIDI buffer in scsi_release_buffers
add a function to explicitly free the BIDI ressoures in the one place that
handles them. This avoids needing a special __scsi_release_buffers for the
case where we already have freed the request as well.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Alan Stern [Wed, 25 Jun 2014 22:50:33 +0000 (16:50 -0600)]
Fix command result state propagation
We're seeing a case where the contents of scmd->result isn't being reset after
a SCSI command encounters an error, is resubmitted, times out and then gets
handled. The error handler acts on the stale result of the previous error
instead of the timeout. Fix this by properly zeroing the scmd->status before
the command is resubmitted.
Signed-off-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Christoph Hellwig [Tue, 15 Apr 2014 10:24:56 +0000 (12:24 +0200)]
don't reference freed command in scsi_prep_return
Patch
commit
0479633686d370303e3430256ace4bd5f7f138dc
Author: Christoph Hellwig <hch@infradead.org>
Date: Thu Feb 20 14:20:55 2014 -0800
[SCSI] do not manipulate device reference counts in scsi_get/put_command
Introduced a use after free:I in the kill case of scsi_prep_return we have to
release our device reference, but we do this trying to reference the just
freed command. Use the local sdev pointer instead.
Fixes:
0479633686d370303e3430256ace4bd5f7f138dc
Reported-by: Joe Lawrence <joe.lawrence@stratus.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Christoph Hellwig [Tue, 15 Apr 2014 10:24:55 +0000 (12:24 +0200)]
don't reference freed command in scsi_init_sgtable
Patch
commit
0479633686d370303e3430256ace4bd5f7f138dc
Author: Christoph Hellwig <hch@infradead.org>
Date: Thu Feb 20 14:20:55 2014 -0800
[SCSI] do not manipulate device reference counts in scsi_get/put_command
Introduced a use after free: when scsi_init_io fails we have to release our
device reference, but we do this trying to reference the just freed command.
Add a local scsi_device pointer to fix this.
Fixes:
0479633686d370303e3430256ace4bd5f7f138dc
Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Christoph Hellwig [Wed, 25 Jun 2014 23:11:32 +0000 (17:11 -0600)]
add support for per-host cmd pools
This allows drivers to specify the size of their per-command private
data in the host template and then get extra memory allocated for
each command instead of needing another allocation in ->queuecommand.
With the current SCSI code that already does multiple allocations for
each command this probably doesn't make a big performance impact, but
it allows to clean up the drivers, and prepare them for using the
blk-mq infrastructure where the common allocation will make a difference.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Christoph Hellwig [Thu, 20 Feb 2014 22:20:59 +0000 (14:20 -0800)]
megaraid: simplify internal command handling
We don't use the passed in scsi command for anything, so just add a adapter-
wide internal status to go along with the internal scb that is used unter
int_mtx to pass back the return value and get rid of all the complexities
and abuse of the scsi_cmnd structure.
This gets rid of the only user of scsi_allocate_command/scsi_free_command,
which can now be removed.
[jejb: checkpatch fixes]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Adam Radford <aradford@gmail.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Christoph Hellwig [Thu, 20 Feb 2014 22:20:58 +0000 (14:20 -0800)]
remove a useless get/put_device pair in scsi_requeue_command
Avoid a spurious device get/put pair by cleaning up scsi_requeue_command
and folding scsi_unprep_request into it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Bart Van Assche [Thu, 20 Feb 2014 22:20:57 +0000 (14:20 -0800)]
remove a useless get/put_device pair in scsi_next_command
Eliminate a get_device() / put_device() pair from scsi_next_command().
Both are atomic operations hence removing these slightly improves
performance.
[hch: slight changes due to different context]
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Bart Van Assche [Thu, 20 Feb 2014 22:20:56 +0000 (14:20 -0800)]
remove a useless get/put_device pair in scsi_request_fn
SCSI devices may only be removed by calling scsi_remove_device().
That function must invoke blk_cleanup_queue() before the final put
of sdev->sdev_gendev. Since blk_cleanup_queue() waits for the
block queue to drain and then tears it down, scsi_request_fn cannot
be active anymore after blk_cleanup_queue() has returned and hence
the get_device()/put_device() pair in scsi_request_fn is unnecessary.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Christoph Hellwig [Wed, 25 Jun 2014 23:10:31 +0000 (17:10 -0600)]
do not manipulate device reference counts in scsi_get/put_command
Many callers won't need this and we can optimize them away. In addition
the handling in the __-prefixed variants was inconsistant to start with.
Based on an earlier patch from Bart Van Assche.
[jejb: fix kerneldoc probelm picked up by Fengguang Wu]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Hannes Reinecke [Mon, 11 Nov 2013 12:44:54 +0000 (13:44 +0100)]
improved eh timeout handler
When a command runs into a timeout we need to send an 'ABORT TASK'
TMF. This is typically done by the 'eh_abort_handler' LLDD callback.
Conceptually, however, this function is a normal SCSI command, so
there is no need to enter the error handler.
This patch implements a new scsi_abort_command() function which
invokes an asynchronous function scsi_eh_abort_handler() to
abort the commands via the usual 'eh_abort_handler'.
If abort succeeds the command is either retried or terminated,
depending on the number of allowed retries. However, 'eh_eflags'
records the abort, so if the retry would fail again the
command is pushed onto the error handler without trying to
abort it (again); it'll be cleared up from SCSI EH.
[hare: smatch detected stray switch fixed]
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
James Bottomley [Thu, 26 Jun 2014 01:58:28 +0000 (19:58 -0600)]
Fix spurious request sense in error handling
We unconditionally execute scsi_eh_get_sense() to make sure all failed
commands that should have sense attached, do. However, the routine forgets
that some commands, because of the way they fail, will not have any sense code
... we should not bother them with a REQUEST_SENSE command. Fix this by
testing to see if we actually got a CHECK_CONDITION return and skip asking for
sense if we don't.
Tested-by: Alan Stern <stern@rowland.harvard.edu>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Hannes Reinecke [Thu, 26 Jun 2014 01:57:35 +0000 (19:57 -0600)]
Add 'eh_deadline' to limit SCSI EH runtime
This patchs adds an 'eh_deadline' sysfs attribute to the scsi
host which limits the overall runtime of the SCSI EH.
The 'eh_deadline' value is stored in the now obsolete field
'resetting'.
When a command is failed the start time of the EH is stored
in 'last_reset'. If the overall runtime of the SCSI EH is longer
than last_reset + eh_deadline, the EH is short-circuited and
falls through to issue a host reset only.
[jejb: add comments in Scsi_Host about new fields]
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Hannes Reinecke [Wed, 23 Oct 2013 08:51:20 +0000 (10:51 +0200)]
remove check for 'resetting'
Field is now unused, so this is dead code.
[jejb: remove resetting and last_reset from Scsi_Host]
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Hannes Reinecke [Wed, 23 Oct 2013 08:51:19 +0000 (10:51 +0200)]
dc395: Move 'last_reset' into internal host structure
'last_reset' is only used internally, so move it into the
internal host structure.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Hannes Reinecke [Wed, 23 Oct 2013 08:51:18 +0000 (10:51 +0200)]
tmscsim: Move 'last_reset' into host structure
The 'last_reset' value is only used internally, so move it into
the internal host structure.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Hannes Reinecke [Wed, 23 Oct 2013 08:51:17 +0000 (10:51 +0200)]
advansys: Remove 'last_reset' references
Serves no purpose whatsoever.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Christoph Hellwig [Thu, 20 Feb 2014 22:20:54 +0000 (14:20 -0800)]
avoid taking host_lock in scsi_run_queue unless nessecary
If we don't have starved devices we don't need to take the host lock
to iterate over them. Also split the function up to be more clear.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Christoph Hellwig [Wed, 25 Jun 2014 22:45:56 +0000 (16:45 -0600)]
avoid useless free_list lock roundtrips
Avoid hitting the host-wide free_list lock unless we need to put a command
back onto the freelist.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Bart Van Assche [Tue, 2 Jul 2013 13:06:33 +0000 (15:06 +0200)]
enable destruction of blocked devices which fail LUN scanning
If something goes wrong during LUN scanning, e.g. a transport layer
failure occurs, then __scsi_remove_device() can get invoked by the
LUN scanning code for a SCSI device in state SDEV_CREATED_BLOCK and
before the SCSI device has been added to sysfs (is_visible == 0).
Make sure that even in this case the transition into state SDEV_DEL
occurs. This avoids that __scsi_remove_device() can get invoked a
second time by scsi_forget_host() if this last function is invoked
from another thread than the thread that performs LUN scanning.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Eiichi Tsukata [Tue, 11 Feb 2014 05:29:52 +0000 (14:29 +0900)]
Add timeout to avoid infinite command retry
Currently, scsi error handling in scsi_io_completion() tries to
unconditionally requeue scsi command when device keeps some error state.
For example, UNIT_ATTENTION causes infinite retry with
action == ACTION_RETRY.
This is because retryable errors are thought to be temporary and the scsi
device will soon recover from those errors. Normally, such retry policy is
appropriate because the device will soon recover from temporary error state.
But there is no guarantee that device is able to recover from error state
immediately. Some hardware error can prevent device from recovering.
This patch adds timeout in scsi_io_completion() to avoid infinite command
retry in scsi_io_completion(). Once scsi command retry time is longer than
this timeout, the command is treated as failure.
Signed-off-by: Eiichi Tsukata <eiichi.tsukata.xh@hitachi.com>
Signed-off-by: James Bottomley <JBottomley@Parallels.com>
Christoph Hellwig [Thu, 26 Jun 2014 01:51:41 +0000 (19:51 -0600)]
scsi: handle command allocation failure in scsi_reset_provider
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Nicholas Bellinger <nab@linux-iscsi.org>
Reviewed-by: Mike Christie <michaelc@cs.wisc.edu>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Sam Bradshaw [Wed, 18 Mar 2015 23:06:18 +0000 (17:06 -0600)]
blkmq: Fix NULL pointer deref when all reserved tags in
When allocating from the reserved tags pool, bt_get() is called with
a NULL hctx. If all tags are in use, the hw queue is kicked to push
out any pending IO, potentially freeing tags, and tag allocation is
retried. The problem is that blk_mq_run_hw_queue() doesn't check for
a NULL hctx. So we avoid it with a simple NULL hctx test.
Tested by hammering mtip32xx with concurrent smartctl/hdparm.
Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Selvan Mani <smani@micron.com>
Fixes:
b32232073e80 ("blk-mq: fix hang in bt_get()")
Cc: stable@kernel.org
Added appropriate comment.
Signed-off-by: Jens Axboe <axboe@fb.com>
Wenbo Wang [Fri, 20 Mar 2015 05:04:54 +0000 (01:04 -0400)]
Fix bug in blk_rq_merge_ok
Use the right array index to reference the last
element of rq->biotail->bi_io_vec[]
Signed-off-by: Wenbo Wang <wenbo.wang@memblaze.com>
Reviewed-by: Chong Yuan <chong.yuan@memblaze.com>
Fixes:
66cb45aa41315 ("block: add support for limiting gaps in SG lists")
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Tony Battersby [Wed, 11 Feb 2015 16:32:30 +0000 (11:32 -0500)]
blk-mq: fix double-free in error path
If the allocation of bt->bs fails, then bt->map can be freed twice, once
in blk_mq_init_bitmap_tags() -> bt_alloc(), and once in
blk_mq_init_bitmap_tags() -> bt_free(). Fix by setting the pointer to
NULL after the first free.
Cc: <stable@vger.kernel.org>
Signed-off-by: Tony Battersby <tonyb@cybernetics.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Wed, 14 Jan 2015 15:49:55 +0000 (08:49 -0700)]
blk-mq: fix false negative out-of-tags condition
The blk-mq tagging tries to maintain some locality between CPUs and
the tags issued. The tags are split into groups of words, and the
words may not be fully populated. When searching for a new free tag,
blk-mq may look at partial words, hence it passes in an offset/size
to find_next_zero_bit(). However, it does that wrong, the size must
always be the full length of the number of tags in that word,
otherwise we'll potentially miss some near the end.
Another issue is when __bt_get() goes from one word set to the next.
It bumps the index, but not the last_tag associated with the
previous index. Bump that to be in the range of the new word.
Finally, clean up __bt_get() and __bt_get_word() a bit and get
rid of the goto in there, and the unnecessary 'wrap' variable.
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Sat, 6 Sep 2014 23:08:05 +0000 (16:08 -0700)]
block: remove artifical max_hw_sectors cap
Set max_sectors to the value the drivers provides as hardware limit by
default. Linux had proper I/O throttling for a long time and doesn't
rely on a artifically small maximum I/O size anymore. By not limiting
the I/O size by default we remove an annoying tuning step required for
most Linux installation.
Note that both the user, and if absolutely required the driver can still
impose a limit for FS requests below max_hw_sectors_kb.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Wed, 21 Jan 2015 18:57:04 +0000 (11:57 -0700)]
blk-mq: add blk_mq_queue_freeze_start()
This is a partial backport from mainline to expose the same
API as newer kernels. On forward port, this patch can be dropped.
Signed-off-by: Jens Axboe <axboe@fb.com>
Keith Busch [Thu, 8 Jan 2015 15:59:53 +0000 (08:59 -0700)]
blk-mq: End unstarted requests on a dying queue
Requests that haven't been started prior to a queue dying can be ended
in error without waiting for them to start and time out.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Added code comment to explain why this is done.
Signed-off-by: Jens Axboe <axboe@fb.com>
Keith Busch [Thu, 8 Jan 2015 01:55:46 +0000 (18:55 -0700)]
blk-mq: Allow requests to never expire
Some types of requests may be started that are not gauranteed to ever
complete. This adds a request flag that a driver can use so mark the
request as such.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Wed, 21 Jan 2015 18:43:07 +0000 (11:43 -0700)]
blk-mq: Add helper to abort requeued requests
Adds a helper function a driver can use to abort requeued requests in
case any are pending when h/w queues are being removed.
Signed-off-by: Jens Axboe <axboe@fb.com>
Keith Busch [Thu, 8 Jan 2015 01:55:44 +0000 (18:55 -0700)]
blk-mq: Let drivers cancel requeue_work
Kicking requeued requests will start h/w queues in a work_queue, which
may alter the driver's requested state to temporarily stop them. This
patch exports a method to cancel the q->requeue_work so a driver can be
assured stopped h/w queues won't be started up before it is ready.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Keith Busch [Thu, 8 Jan 2015 01:55:43 +0000 (18:55 -0700)]
blk-mq: Export if requests were started
Drivers can iterate over all allocated request tags, but their callback
needs a way to know if the driver started the request in the first place.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Keith Busch [Thu, 8 Jan 2015 15:53:56 +0000 (08:53 -0700)]
blk-mq: Wake tasks entering queue on dying
When the queue is set to dying, wake up tasks that are waiting on frozen
queue so they realize it is dying and abandon their request.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Modified by me to add a code comment on the need for the wakeup.
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Mon, 22 Dec 2014 21:04:42 +0000 (14:04 -0700)]
block: wake up waiters when a queue is marked dying
If it's dying, we can't expect new request to complete and come
in an wake up other tasks waiting for requests. So after we
have marked it as dying, wake up everybody currently waiting
for a request. Once they wake, they will retry their allocation
and fail appropriately due to the state of the queue.
Tested-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Keith Busch [Sun, 21 Dec 2014 00:31:43 +0000 (16:31 -0800)]
blk-mq: Exit queue on alloc failure
Fixes usage counter when a request could not be allocated.
Signed-off-by: Keith Busch <keith.busch@intel.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Mon, 15 Dec 2014 15:35:47 +0000 (07:35 -0800)]
Revert "blk-mq: Micro-optimize bt_get()"
This reverts commit
4a19947cb6bcc2bafa43398f4b276a8a2200c1ae.
The optimization is only really safe for a single queue, otherwise
'bs' and 'bt' can indeed change, and if we don't do a finish_wait()
for each loop, we'll potentially change the wait structure and
corrupt task wait list.
Reported-by: Jan Kara <jack@suse.cz>
Takashi Iwai [Wed, 10 Dec 2014 16:05:25 +0000 (08:05 -0800)]
blk-mq: Fix uninitialized kobject at CPU hotplugging
When a CPU is hotplugged, the current blk-mq spews a warning like:
kobject '(null)' (
ffffe8ffffc8b5d8): tried to add an uninitialized object, something is seriously wrong.
CPU: 1 PID: 1386 Comm: systemd-udevd Not tainted 3.18.0-rc7-2.g088d59b-default #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_171129-lamiak 04/01/2014
0000000000000000 0000000000000002 ffffffff81605f07 ffffe8ffffc8b5d8
ffffffff8132c7a0 ffff88023341d370 0000000000000020 ffff8800bb05bd58
ffff8800bb05bd08 000000000000a0a0 000000003f441940 0000000000000007
Call Trace:
[<
ffffffff81005306>] dump_trace+0x86/0x330
[<
ffffffff81005644>] show_stack_log_lvl+0x94/0x170
[<
ffffffff81006d21>] show_stack+0x21/0x50
[<
ffffffff81605f07>] dump_stack+0x41/0x51
[<
ffffffff8132c7a0>] kobject_add+0xa0/0xb0
[<
ffffffff8130aee1>] blk_mq_register_hctx+0x91/0xb0
[<
ffffffff8130b82e>] blk_mq_sysfs_register+0x3e/0x60
[<
ffffffff81309298>] blk_mq_queue_reinit_notify+0xf8/0x190
[<
ffffffff8107cfdc>] notifier_call_chain+0x4c/0x70
[<
ffffffff8105fd23>] cpu_notify+0x23/0x50
[<
ffffffff81060037>] _cpu_up+0x157/0x170
[<
ffffffff810600d9>] cpu_up+0x89/0xb0
[<
ffffffff815fa5b5>] cpu_subsys_online+0x35/0x80
[<
ffffffff814323cd>] device_online+0x5d/0xa0
[<
ffffffff81432485>] online_store+0x75/0x80
[<
ffffffff81236a5a>] kernfs_fop_write+0xda/0x150
[<
ffffffff811c5532>] vfs_write+0xb2/0x1f0
[<
ffffffff811c5f42>] SyS_write+0x42/0xb0
[<
ffffffff8160c4ed>] system_call_fastpath+0x16/0x1b
[<
00007f0132fb24e0>] 0x7f0132fb24e0
This is indeed because of an uninitialized kobject for blk_mq_ctx.
The blk_mq_ctx kobjects are initialized in blk_mq_sysfs_init(), but it
goes loop over hctx_for_each_ctx(), i.e. it initializes only for
online CPUs. Thus, when a CPU is hotplugged, the ctx for the newly
onlined CPU is registered without initialization.
This patch fixes the issue by initializing the all ctx kobjects
belonging to each queue.
Bugzilla: https://bugzilla.novell.com/show_bug.cgi?id=908794
Cc: <stable@vger.kernel.org>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Bart Van Assche [Tue, 9 Dec 2014 15:59:21 +0000 (16:59 +0100)]
blk-mq: Use all available hardware queues
Suppose that a system has two CPU sockets, three cores per socket,
that it does not support hyperthreading and that four hardware
queues are provided by a block driver. With the current algorithm
this will lead to the following assignment of CPU cores to hardware
queues:
HWQ 0: 0 1
HWQ 1: 2 3
HWQ 2: 4 5
HWQ 3: (none)
This patch changes the queue assignment into:
HWQ 0: 0 1
HWQ 1: 2
HWQ 2: 3 4
HWQ 3: 5
In other words, this patch has the following three effects:
- All four hardware queues are used instead of only three.
- CPU cores are spread more evenly over hardware queues. For the
above example the range of the number of CPU cores associated
with a single HWQ is reduced from [0..2] to [1..2].
- If the number of HWQ's is a multiple of the number of CPU sockets
it is now guaranteed that all CPU cores associated with a single
HWQ reside on the same CPU socket.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Sagi Grimberg <sagig@mellanox.com>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Bart Van Assche [Tue, 9 Dec 2014 15:59:48 +0000 (16:59 +0100)]
blk-mq: Micro-optimize bt_get()
Remove a superfluous finish_wait() call. Convert the two bt_wait_ptr()
calls into a single call.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robert Elliott <elliott@hp.com>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Bart Van Assche [Tue, 9 Dec 2014 15:58:35 +0000 (16:58 +0100)]
blk-mq: Fix a race between bt_clear_tag() and bt_get()
What we need is the following two guarantees:
* Any thread that observes the effect of the test_and_set_bit() by
__bt_get_word() also observes the preceding addition of 'current'
to the appropriate wait list. This is guaranteed by the semantics
of the spin_unlock() operation performed by prepare_and_wait().
Hence the conversion of test_and_set_bit_lock() into
test_and_set_bit().
* The wait lists are examined by bt_clear() after the tag bit has
been cleared. clear_bit_unlock() guarantees that any thread that
observes that the bit has been cleared also observes the store
operations preceding clear_bit_unlock(). However,
clear_bit_unlock() does not prevent that the wait lists are examined
before that the tag bit is cleared. Hence the addition of a memory
barrier between clear_bit() and the wait list examination.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robert Elliott <elliott@hp.com>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
Cc: <stable@vger.kernel.org> # v3.13+
Signed-off-by: Jens Axboe <axboe@fb.com>
Bart Van Assche [Tue, 9 Dec 2014 15:58:11 +0000 (16:58 +0100)]
blk-mq: Avoid that __bt_get_word() wraps multiple times
If __bt_get_word() is called with last_tag != 0, if the first
find_next_zero_bit() fails, if after wrap-around the
test_and_set_bit() call fails and find_next_zero_bit() succeeds,
if the next test_and_set_bit() call fails and subsequently
find_next_zero_bit() does not find a zero bit, then another
wrap-around will occur. Avoid this by introducing an additional
local variable.
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robert Elliott <elliott@hp.com>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
Cc: <stable@vger.kernel.org> # v3.13+
Signed-off-by: Jens Axboe <axboe@fb.com>
Bart Van Assche [Tue, 9 Dec 2014 16:20:27 +0000 (08:20 -0800)]
blk-mq: Fix a use-after-free
blk-mq users are allowed to free the memory request_queue.tag_set
points at after blk_cleanup_queue() has finished but before
blk_release_queue() has started. This can happen e.g. in the SCSI
core. The SCSI core namely embeds the tag_set structure in a SCSI
host structure. The SCSI host structure is freed by
scsi_host_dev_release(). This function is called after
blk_cleanup_queue() finished but can be called before
blk_release_queue().
This means that it is not safe to access request_queue.tag_set from
inside blk_release_queue(). Hence remove the blk_sync_queue() call
from blk_release_queue(). This call is not necessary - outstanding
requests must have finished before blk_release_queue() is
called. Additionally, move the blk_mq_free_queue() call from
blk_release_queue() to blk_cleanup_queue() to avoid that struct
request_queue.tag_set gets accessed after it has been freed.
This patch avoids that the following kernel oops can be triggered
when deleting a SCSI host for which scsi-mq was enabled:
Call Trace:
[<
ffffffff8109a7c4>] lock_acquire+0xc4/0x270
[<
ffffffff814ce111>] mutex_lock_nested+0x61/0x380
[<
ffffffff812575f0>] blk_mq_free_queue+0x30/0x180
[<
ffffffff8124d654>] blk_release_queue+0x84/0xd0
[<
ffffffff8126c29b>] kobject_cleanup+0x7b/0x1a0
[<
ffffffff8126c140>] kobject_put+0x30/0x70
[<
ffffffff81245895>] blk_put_queue+0x15/0x20
[<
ffffffff8125c409>] disk_release+0x99/0xd0
[<
ffffffff8133d056>] device_release+0x36/0xb0
[<
ffffffff8126c29b>] kobject_cleanup+0x7b/0x1a0
[<
ffffffff8126c140>] kobject_put+0x30/0x70
[<
ffffffff8125a78a>] put_disk+0x1a/0x20
[<
ffffffff811d4cb5>] __blkdev_put+0x135/0x1b0
[<
ffffffff811d56a0>] blkdev_put+0x50/0x160
[<
ffffffff81199eb4>] kill_block_super+0x44/0x70
[<
ffffffff8119a2a4>] deactivate_locked_super+0x44/0x60
[<
ffffffff8119a87e>] deactivate_super+0x4e/0x70
[<
ffffffff811b9833>] cleanup_mnt+0x43/0x90
[<
ffffffff811b98d2>] __cleanup_mnt+0x12/0x20
[<
ffffffff8107252c>] task_work_run+0xac/0xe0
[<
ffffffff81002c01>] do_notify_resume+0x61/0xa0
[<
ffffffff814d2c58>] int_signal+0x12/0x17
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Robert Elliott <elliott@hp.com>
Cc: Ming Lei <ming.lei@canonical.com>
Cc: Alexander Gordeev <agordeev@redhat.com>
Cc: <stable@vger.kernel.org> # v3.13+
Signed-off-by: Jens Axboe <axboe@fb.com>
Ming Lei [Wed, 3 Dec 2014 11:38:04 +0000 (19:38 +0800)]
blk-mq: prevent unmapped hw queue from being scheduled
When one hardware queue has no mapped software queues, it
shouldn't have been scheduled. Otherwise WARNING or OOPS
can triggered.
blk_mq_hw_queue_mapped() helper is introduce for fixing
the problem.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Mon, 24 Nov 2014 08:27:23 +0000 (09:27 +0100)]
blk-mq: handle the single queue case in blk_mq_hctx_next_cpu
Don't duplicate the code to handle the not cpu bounce case in the
caller, do it inside blk_mq_hctx_next_cpu instead.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Paolo Bonzini [Fri, 7 Nov 2014 22:04:00 +0000 (23:04 +0100)]
blk-mq: use get_cpu/put_cpu instead of preempt_disable/preempt_enable
blk-mq is using preempt_disable/enable in order to ensure that the
queue runners are placed on the right CPU. This does not work with
the RT patches, because __blk_mq_run_hw_queue takes a non-raw
spinlock with the preemption-disabled region. If there is contention
on the lock, this violates the rules for preemption-disabled regions.
While this should be easily fixable within the RT patches just by doing
migrate_disable/enable, we can do better and document _why_ this
particular region runs with disabled preemption. After the previous
patch, it is trivial to switch it to get/put_cpu; the RT patches then
can change it to get_cpu_light, which lets virtio-blk run under RT
kernels.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Clark Williams <williams@redhat.com>
Tested-by: Clark Williams <williams@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Paolo Bonzini [Fri, 7 Nov 2014 22:03:59 +0000 (23:03 +0100)]
blk_mq: call preempt_disable/enable in blk_mq_run_hw_queue, and only if needed
preempt_disable/enable surrounds every call to blk_mq_run_hw_queue,
except the one in blk-flush.c. In fact that one is always asynchronous,
and it does not need smp_processor_id().
We can do the same for all other calls, avoiding preempt_disable when
async is true. This avoids peppering blk-mq.c with preemption-disabled
regions.
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Clark Williams <williams@redhat.com>
Tested-by: Clark Williams <williams@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Sun, 19 Oct 2014 15:13:57 +0000 (17:13 +0200)]
Revert "block: all blk-mq requests are tagged"
This reverts commit
fb3ccb5da71273e7f0d50b50bc879e50cedd60e7.
SCSI-2/SPI actually needs the tagged/untagged flag in the request to
work properly. Revert this patch and add a follow on to set it in
the right place.
Reported-by: Meelis Roos <mroos@linux.ee>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org
Shaohua Li [Mon, 1 Dec 2014 00:00:58 +0000 (16:00 -0800)]
blk-mq: move the kdump check to blk_mq_alloc_tag_set
We call blk_mq_alloc_tag_set() first then blk_mq_init_queue(). The requests are
allocated in the former function. So the kdump check should be moved to there
to really save memory.
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Mon, 8 Dec 2014 15:49:06 +0000 (08:49 -0700)]
blk-mq: re-check for available tags after running the hardware queue
If we run out of tags and have to sleep, we run the hardware queue
to kick pending IO into gear. During that run, we may have completed
requests, so re-check if we have free tags before going to sleep.
Signed-off-by: Jens Axboe <axboe@fb.com>
Bart Van Assche [Mon, 8 Dec 2014 15:46:34 +0000 (08:46 -0700)]
blk-mq: fix hang in bt_get()
Avoid that if there are fewer hardware queues than CPU threads that
bt_get() can hang. The symptoms of the hang were as follows:
* All tags allocated for a particular hardware queue.
* (nr_tags) pending commands for that hardware queue.
* No pending commands for the software queues associated with that
hardware queue.
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Tue, 18 Nov 2014 22:20:15 +0000 (15:20 -0700)]
blk-mq: GPL export blk_mq_{freeze,unfreeze}_queue()
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Mon, 17 Nov 2014 17:41:57 +0000 (10:41 -0700)]
blk-mq: add blk_mq_free_hctx_request()
It's silly to use blk_mq_free_request() which in turn maps the
request to the hardware queue, for places where we already know
what the hardware queue is. This saves us an extra mapping of a
hardware queue on request completion, if the caller knows this
information already.
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Tue, 18 Nov 2014 20:29:52 +0000 (13:29 -0700)]
blk-mq: export blk_mq_free_request()
Drivers that know they are blk-mq should just use this function
instead of calling through blk_put_request().
Signed-off-by: Jens Axboe <axboe@fb.com>
Bart Van Assche [Tue, 7 Oct 2014 14:45:21 +0000 (08:45 -0600)]
blk-mq: Make bt_clear_tag() easier to read
Eliminate a backwards goto statement from bt_clear_tag().
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Tue, 7 Oct 2014 14:39:20 +0000 (08:39 -0600)]
blk-mq: fix potential hang if rolling wakeup depth is too high
We currently divide the queue depth by 4 as our batch wakeup
count, but we split the wakeups over BT_WAIT_QUEUES number of
wait queues. This defaults to 8. If the product of the resulting
batch wake count and BT_WAIT_QUEUES is higher than the device
queue depth, we can get into a situation where a task goes to
sleep waiting for a request, but never gets woken up.
Reported-by: Bart Van Assche <bvanassche@acm.org>
Fixes:
4bb659b156996
Cc: stable@kernel.org
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Mon, 22 Sep 2014 16:21:48 +0000 (10:21 -0600)]
block: fix blk_abort_request on blk-mq
Signed-off-by: Christoph Hellwig <hch@lst.de>
Moved blk_mq_rq_timed_out() definition to the private blk-mq.h header.
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Fri, 19 Sep 2014 19:07:57 +0000 (12:07 -0700)]
blk-mq: start hardware queues when running requeue work
Requests are often requeued because of resource shortages (on
either the sw or hw side), and this often leads to the hw queue
in question being stopped. When we run the requeue work, ensure
that we start queues, otherwise the queue run will be a no-op.
Signed-off-by: Jens Axboe <axboe@fb.com>
Ming Lei [Fri, 19 Sep 2014 13:53:46 +0000 (21:53 +0800)]
blk-timeout: fix blk_add_timer
Commit
8cb34819cdd5d(blk-mq: unshared timeout handler) introduces
blk-mq's own timeout handler, and removes following line:
blk_queue_rq_timed_out(q, blk_mq_rq_timed_out);
which then causes blk_add_timer() to bypass adding the timer,
since blk-mq no longer has q->rq_timed_out_fn defined.
This patch fixes the problem by bypassing the check for blk-mq,
so that both request deadlines are still set and the rolling
timer updated.
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Fri, 19 Sep 2014 14:04:53 +0000 (08:04 -0600)]
blk-mq: fix potential oops on out-of-memory in __blk_mq_alloc_rq_maps()
__blk_mq_alloc_rq_maps() can be invoked multiple times, if we scale
back the queue depth if we are low on memory. So don't clear
set->tags when we fail, this is handled directly in
the parent function, blk_mq_alloc_tag_set().
Reported-by: Robert Elliott <Elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Tue, 16 Sep 2014 21:44:07 +0000 (14:44 -0700)]
blk-mq: avoid infinite recursion with the FUA flag
We should not insert requests into the flush state machine from
blk_mq_insert_request. All incoming flush requests come through
blk_{m,s}q_make_request and are handled there, while blk_execute_rq_nowait
should only be called for BLOCK_PC requests. All other callers
deal with requests that already went through the flush statemchine
and shouldn't be reinserted into it.
Reported-by: Robert Elliott <Elliott@hp.com>
Debugged-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
David Hildenbrand [Thu, 18 Sep 2014 09:04:31 +0000 (11:04 +0200)]
blk-mq: Avoid race condition with uninitialized requests
This patch should fix the bug reported in
https://lkml.org/lkml/2014/9/11/249.
We have to initialize at least the atomic_flags and the cmd_flags when
allocating storage for the requests.
Otherwise blk_mq_timeout_check() might dereference uninitialized
pointers when racing with the creation of a request.
Also move the reset of cmd_flags for the initializing code to the point
where a request is freed. So we will never end up with pending flush
request indicators that might trigger dereferences of invalid pointers
in blk_mq_timeout_check().
Cc: stable@vger.kernel.org
Signed-off-by: David Hildenbrand <dahi@linux.vnet.ibm.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Wed, 17 Sep 2014 14:27:03 +0000 (08:27 -0600)]
blk-mq: limit memory consumption if a crash dump is active
It's not uncommon for crash dump kernels to be limited to 128MB or
something low in that area. This is normally not a problem for
devices as we don't use that much memory, but for some shared SCSI
setups with huge queue depths, it can potentially fill most of
memory with tons of request allocations. blk-mq does scale back
when it fails to allocate memory, but it scales back just enough
so that blk-mq succeeds. This could still leave the system with
not enough memory to make any real progress.
Check if we are in a kdump environment and limit the hardware
queues and tag depth.
Signed-off-by: Jens Axboe <axboe@fb.com>
Ming Lei [Wed, 17 Sep 2014 09:47:58 +0000 (17:47 +0800)]
blk-mq: remove unnecessary blk_clear_rq_complete()
This patch removes two unnecessary blk_clear_rq_complete(),
the REQ_ATOM_COMPLETE flag is cleared inside blk_mq_start_request(),
so:
- The blk_clear_rq_complete() in blk_flush_restore_request()
needn't because the request will be freed later, and clearing
it here may open a small race window with timeout.
- The blk_clear_rq_complete() in blk_mq_requeue_request() isn't
necessary too, even though REQ_ATOM_STARTED is cleared in
__blk_mq_requeue_request(), in theory it still may cause a small
race window with timeout since the two clear_bit() may be
reordered.
Signed-off-by: Ming Lei <ming.lei@canoical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Tue, 16 Sep 2014 16:37:37 +0000 (10:37 -0600)]
blk-mq: request deadline must be visible before marking rq as started
When we start the request, we set the deadline and flip the bits
marking the request as started and non-complete. However, it's
important that the deadline store is ordered before flipping the
bits, otherwise we could have a small window where the request is
marked started but with an invalid deadline. This can confuse the
timeout handling.
Suggested-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Sat, 13 Sep 2014 23:40:13 +0000 (16:40 -0700)]
blk-mq: pass a reserved argument to the timeout handler
Allow blk-mq to pass an argument to the timeout handler to indicate
if we're timing out a reserved or regular command. For many drivers
those need to be handled different.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Sat, 13 Sep 2014 23:40:12 +0000 (16:40 -0700)]
blk-mq: unshared timeout handler
Duplicate the (small) timeout handler in blk-mq so that we can pass
arguments more easily to the driver timeout handler. This enables
the next patch.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Sat, 13 Sep 2014 23:40:11 +0000 (16:40 -0700)]
blk-mq: fix and simplify tag iteration for the timeout handler
Don't do a kmalloc from timer to handle timeouts, chances are we could be
under heavy load or similar and thus just miss out on the timeouts.
Fortunately it is very easy to just iterate over all in use tags, and doing
this properly actually cleans up the blk_mq_busy_iter API as well, and
prepares us for the next patch by passing a reserved argument to the
iterator.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Sat, 13 Sep 2014 23:40:10 +0000 (16:40 -0700)]
blk-mq: rename blk_mq_end_io to blk_mq_end_request
Now that we've changed the driver API on the submission side use the
opportunity to fix up the name on the completion side to fit into the
general scheme.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Sat, 13 Sep 2014 23:40:09 +0000 (16:40 -0700)]
blk-mq: call blk_mq_start_request from ->queue_rq
When we call blk_mq_start_request from the core blk-mq code before calling into
->queue_rq there is a racy window where the timeout handler can hit before we've
fully set up the driver specific part of the command.
Move the call to blk_mq_start_request into the driver so the driver can start
the request only once it is fully set up.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Christoph Hellwig [Tue, 16 Sep 2014 16:43:56 +0000 (09:43 -0700)]
blk-mq: remove REQ_END
Pass an explicit parameter for the last request in a batch to ->queue_rq
instead of using a request flag. Besides being a cleaner and non-stateful
interface this is also required for the next patch, which fixes the blk-mq
I/O submission code to not start a time too early.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Wed, 10 Sep 2014 15:02:03 +0000 (09:02 -0600)]
blk-mq: scale depth and rq map appropriate if low on memory
If we are running in a kdump environment, resources are scarce.
For some SCSI setups with a huge set of shared tags, we run out
of memory allocating what the drivers is asking for. So implement
a scale back logic to reduce the tag depth for those cases, allowing
the driver to successfully load.
We should extend this to detect low memory situations, and implement
a sane fallback for those (1 queue, 64 tags, or something like that).
Tested-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Fri, 15 Aug 2014 18:44:08 +0000 (12:44 -0600)]
blk-mq: don't allow merges if turned off for the queue
blk-mq uses BLK_MQ_F_SHOULD_MERGE, as set by the driver at init time,
to determine whether it should merge IO or not. However, this could
also be disabled by the admin, if merging is switched off through
sysfs. So check the general queue state as well before attempting
to merge IO.
Reported-by: Rob Elliott <Elliott@hp.com>
Tested-by: Rob Elliott <Elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Robert Elliott [Tue, 2 Sep 2014 16:38:44 +0000 (11:38 -0500)]
blk-mq: cleanup after blk_mq_init_rq_map failures
In blk-mq.c blk_mq_alloc_tag_set, if:
set->tags = kmalloc_node()
succeeds, but one of the blk_mq_init_rq_map() calls fails,
goto out_unwind;
needs to free set->tags so the caller is not obligated
to do so. None of the current callers (null_blk,
virtio_blk, virtio_blk, or the forthcoming scsi-mq)
do so.
set->tags needs to be set to NULL after doing so,
so other tag cleanup logic doesn't try to free
a stale pointer later. Also set it to NULL
in blk_mq_free_tag_set.
Tested with error injection on the forthcoming
scsi-mq + hpsa combination.
Signed-off-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Robert Elliott [Tue, 2 Sep 2014 16:38:49 +0000 (11:38 -0500)]
blk-mq: pass along blk_mq_alloc_tag_set return values
Two of the blk-mq based drivers do not pass back the return value
from blk_mq_alloc_tag_set, instead just returning -ENOMEM.
blk_mq_alloc_tag_set returns -EINVAL if the number of queues or
queue depth is bad. -ENOMEM implies that retrying after freeing some
memory might be more successful, but that won't ever change
in the -EINVAL cases.
Change the null_blk and mtip32xx drivers to pass along
the return value.
Signed-off-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Ming Lei [Tue, 2 Sep 2014 15:02:59 +0000 (23:02 +0800)]
blk-merge: fix blk_recount_segments
QUEUE_FLAG_NO_SG_MERGE is set at default for blk-mq devices,
so bio->bi_phys_segment computed may be bigger than
queue_max_segments(q) for blk-mq devices, then drivers will
fail to handle the case, for example, BUG_ON() in
virtio_queue_rq() can be triggerd for virtio-blk:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/
1359146
This patch fixes the issue by ignoring the QUEUE_FLAG_NO_SG_MERGE
flag if the computed bio->bi_phys_segment is bigger than
queue_max_segments(q), and the regression is caused by commit
05f1dd53152173(block: add queue flag for disabling SG merging).
Reported-by: Kick In <pierre-andre.morey@canonical.com>
Tested-by: Chris J Arges <chris.j.arges@canonical.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Wed, 25 Jun 2014 14:22:34 +0000 (08:22 -0600)]
blk-mq: blk_mq_start_hw_queue() should use blk_mq_run_hw_queue()
Currently it calls __blk_mq_run_hw_queue(), which depends on the
CPU placement being correct. This means it's not possible to call
blk_mq_start_hw_queues(q) from a context that is correct for all
queues, leading to triggering the
WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask));
in __blk_mq_run_hw_queue().
Reported-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Wed, 25 Jun 2014 20:59:36 +0000 (14:59 -0600)]
block: add support for limiting gaps in SG lists
Another restriction inherited for NVMe - those devices don't support
SG lists that have "gaps" in them. Gaps refers to cases where the
previous SG entry doesn't end on a page boundary. For NVMe, all SG
entries must start at offset 0 (except the first) and end on a page
boundary (except the last).
Signed-off-by: Jens Axboe <axboe@fb.com>
Alexander Gordeev [Tue, 17 Jun 2014 20:37:23 +0000 (22:37 +0200)]
blk-mq: bitmap tag: fix races in bt_get() function
This update fixes few issues in bt_get() function:
- list_empty(&wait.task_list) check is not protected;
- was_empty check is always true which results in *every* thread
entering the loop resets bt_wait_state::wait_cnt counter rather
than every bt->wake_cnt'th thread;
- 'bt_wait_state::wait_cnt' counter update is redundant, since
it also gets reset in bt_clear_tag() function;
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Alexander Gordeev [Thu, 12 Jun 2014 15:05:37 +0000 (17:05 +0200)]
blk-mq: bitmap tag: fix race on blk_mq_bitmap_tags::wake_cnt
This piece of code in bt_clear_tag() function is racy:
bs = bt_wake_ptr(bt);
if (bs && atomic_dec_and_test(&bs->wait_cnt)) {
atomic_set(&bs->wait_cnt, bt->wake_cnt);
wake_up(&bs->wait);
}
Since nothing prevents bt_wake_ptr() from returning the very
same 'bs' address on multiple CPUs, the following scenario is
possible:
CPU1 CPU2
---- ----
0. bs = bt_wake_ptr(bt); bs = bt_wake_ptr(bt);
1. atomic_dec_and_test(&bs->wait_cnt)
2. atomic_dec_and_test(&bs->wait_cnt)
3. atomic_set(&bs->wait_cnt, bt->wake_cnt);
If the decrement in [1] yields zero then for some amount of time
the decrement in [2] results in a negative/overflow value, which
is not expected. The follow-up assignment in [3] overwrites the
invalid value with the batch value (and likely prevents the issue
from being severe) which is still incorrect and should be a lesser.
Cc: Ming Lei <tom.leiming@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Alexander Gordeev [Wed, 18 Jun 2014 05:12:35 +0000 (22:12 -0700)]
blk-mq: bitmap tag: fix races on shared ::wake_index fields
Fix racy updates of shared blk_mq_bitmap_tags::wake_index
and blk_mq_hw_ctx::wake_index fields.
Cc: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Alexander Gordeev <agordeev@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Wed, 18 Jun 2014 05:09:29 +0000 (22:09 -0700)]
block: blk_max_size_offset() should check ->max_sectors
Commit
762380ad9322 inadvertently changed a check for max_sectors
to max_hw_sectors. Revert that part, so we still compare against
max_sectors.
Signed-off-by: Jens Axboe <axboe@fb.com>
Jens Axboe [Mon, 16 Jun 2014 17:40:25 +0000 (11:40 -0600)]
null_blk: fix softirq completions for queue_mode == 1
Only blk-mq completions have payload attached to the request, for
request_fn mode we have stored it in req->special. This fixes an
oops with queue_mode=1 and softirq completions.
Signed-off-by: Jens Axboe <axboe@fb.com>