git.kernel.dk Git - linux-2.6-block.git/log

Jens Axboe [Thu, 29 May 2014 17:00:11 +0000 (11:00 -0600)]

blk-mq: request initialization optimizations

We currently clear a lot more than we need to, so make that a bit
more clever. Make some of the init dependent on features, like
only setting start_time if we are going to use it.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Wed, 25 Jun 2014 20:09:06 +0000 (14:09 -0600)]

block: add queue flag for disabling SG merging

If devices are not SG starved, we waste a lot of time potentially
collapsing SG segments. Enough that 1.5% of the CPU time goes
to this, at only 400K IOPS. Add a queue flag, QUEUE_FLAG_NO_SG_MERGE,
which just returns the number of vectors in a bio instead of looping
over all segments and checking for collapsible ones.

Add a BLK_MQ_F_SG_MERGE flag so that drivers can opt-in on the sg
merging, if they so desire.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Wed, 28 May 2014 16:11:06 +0000 (18:11 +0200)]

blk-mq: remove alloc_hctx and free_hctx methods

There is no need for drivers to control hardware context allocation
now that we do the context to node mapping in common code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Wed, 28 May 2014 16:15:41 +0000 (10:15 -0600)]

blk-mq: add file comments and update copyright notices

None of the blk-mq files have an explanatory comment at the top
for what that particular file does. Add that and add appropriate
copyright notices as well.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Tue, 27 May 2014 18:59:50 +0000 (20:59 +0200)]

blk-mq: remove blk_mq_alloc_request_pinned

We now only have one caller left and can open code it there in a cleaner
way.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Tue, 27 May 2014 18:59:49 +0000 (20:59 +0200)]

blk-mq: do not use blk_mq_alloc_request_pinned in blk_mq_map_request

We already do a non-blocking allocation in blk_mq_map_request, no need
to repeat it. Just call __blk_mq_alloc_request to wait directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Tue, 27 May 2014 18:59:48 +0000 (20:59 +0200)]

blk-mq: remove blk_mq_wait_for_tags

The current logic for blocking tag allocation is rather confusing, as we
first allocated and then free again a tag in blk_mq_wait_for_tags, just
to attempt a non-blocking allocation and then repeat if someone else
managed to grab the tag before us.

Instead change blk_mq_alloc_request_pinned to simply do a blocking tag
allocation itself and use the request we get back from it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Tue, 27 May 2014 18:59:47 +0000 (20:59 +0200)]

blk-mq: initialize request in __blk_mq_alloc_request

Both callers if __blk_mq_alloc_request want to initialize the request, so
lift it into the common path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Tue, 27 May 2014 18:59:46 +0000 (20:59 +0200)]

blk-mq: merge blk_mq_alloc_reserved_request into blk_mq_alloc_request

Instead of having two almost identical copies of the same code just let
the callers pass in the reserved flag directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Thu, 26 Jun 2014 04:46:04 +0000 (22:46 -0600)]

blk-mq: add helper to insert requests from irq context

Both the cache flush state machine and the SCSI midlayer want to submit
requests from irq context, and the current per-request requeue_work
unfortunately causes corruption due to sharing with the csd field for
flushes. Replace them with a per-request_queue list of requests to
be requeued.

Based on an earlier test by Ming Lei.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reported-by: Ming Lei <tom.leiming@gmail.com>
Tested-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Wed, 28 May 2014 14:06:34 +0000 (08:06 -0600)]

blk-mq: remove stale comment for blk_mq_complete_request()

It works for both IPI and local completions as of commit
95f096849932.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Tue, 27 May 2014 23:46:48 +0000 (17:46 -0600)]

blk-mq: allow non-softirq completions

Right now we export two ways of completing a request:

1) blk_mq_complete_request(). This uses an IPI (if needed) and
   completes through q->softirq_done_fn(). It also works with
   timeouts.

2) blk_mq_end_io(). This completes inline, and ignores any timeout
   state of the request.

Let blk_mq_complete_request() handle non-softirq_done_fn completions
as well, by just completing inline. If a driver has enough completion
ports to place completions correctly, it need not define a
mq_ops->complete() and we can avoid an indirect function call by
doing the completion inline.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Tue, 27 May 2014 18:06:53 +0000 (12:06 -0600)]

blk-mq: pass in suggested NUMA node to ->alloc_hctx()

Drivers currently have to figure this out on their own, and they
are missing information to do it properly. The ones that did
attempt to do it, do it wrong.

So just pass in the suggested node directly to the alloc
function.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Ming Lei [Tue, 27 May 2014 15:35:14 +0000 (23:35 +0800)]

block: only allocate/free mq_usage_counter in blk-mq

The percpu counter is only used for blk-mq, so move
its allocation and free inside blk-mq, and don't
allocate it for legacy queue device.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Ming Lei [Tue, 27 May 2014 15:35:13 +0000 (23:35 +0800)]

blk-mq: avoid code duplication

blk_mq_exit_hw_queues() and blk_mq_free_hw_queues()
are introduced to avoid code duplication.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Ming Lei [Tue, 27 May 2014 14:34:45 +0000 (08:34 -0600)]

blk-mq: fix leak of hctx->ctx_map

hctx->ctx_map should have been freed inside blk_mq_free_queue().

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Mon, 26 May 2014 09:45:02 +0000 (11:45 +0200)]

blk-mq: idle all hardware contexts before freeing a queue

Without this we can leak the active_queues reference if a command is
freed while it is considered active.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Fri, 23 May 2014 20:14:57 +0000 (14:14 -0600)]

blk-mq: allow setting of per-request timeouts

Currently blk-mq uses the queue timeout for all requests. But
for some commands, drivers may want to set a specific timeout
for special requests. Allow this to be passed in through
request->timeout, and use it if set.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Sam Bradshaw [Fri, 23 May 2014 19:30:16 +0000 (13:30 -0600)]

blk-mq: export blk_mq_tag_busy_iter

Export the blk-mq in-flight tag iterator for driver consumption.
This is particularly useful in exception paths or SRSI where
in-flight IOs need to be cancelled and/or reissued. The NVMe driver
conversion will use this.

Signed-off-by: Sam Bradshaw <sbradshaw@micron.com>
Signed-off-by: Matias Bjørling <m@bjorling.me>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Thu, 22 May 2014 16:40:51 +0000 (10:40 -0600)]

blk-mq: split make request handler for multi and single queue

We want slightly different behavior from them:

- On single queue devices, we currently use the per-process plug
for deferred IO and for merging.

- On multi queue devices, we don't use the per-process plug, but
we want to go straight to hardware for SYNC IO.

Split blk_mq_make_request() into a blk_sq_make_request() for single
queue devices, and retain blk_mq_make_request() for multi queue
devices. Then we don't need multiple checks for q->nr_hw_queues
in the request mapping.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Wed, 21 May 2014 20:01:15 +0000 (14:01 -0600)]

blk-mq: save memory by freeing requests on unused hardware queues

Depending on the topology of the machine and the number of queues
exposed by a device, we can end up in a situation where some of
the hardware queues are unused (as in, they don't map to any
software queues). For this case, free up the memory used by the
request map, as we will not use it. This can be a substantial
amount of memory, depending on the number of queues vs CPUs and
the queue depth of the device.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Wed, 21 May 2014 19:59:08 +0000 (13:59 -0600)]

blk-mq: allow the hctx cpu hotplug notifier to return errors

Prepare this for the next patch which adds more smarts in the
plugging logic, so that we can save some memory.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Robert Elliott [Tue, 20 May 2014 21:46:26 +0000 (16:46 -0500)]

blk-mq: Micro-optimize blk_queue_nomerges() check

In blk_mq_make_request(), do the blk_queue_nomerges() check
outside the call to blk_attempt_plug_merge() to eliminate
function call overhead when nomerges=2 (disabled)

Signed-off-by: Robert Elliott <elliott@hp.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Alireza Haghdoost [Wed, 25 Jun 2014 19:52:26 +0000 (13:52 -0600)]

block: Enable sysfs nomerge control for I/O requests in the plug list

This patch enables the sysfs to control I/O request merge
functionality in the plug list. While this control has been
implemented for the request queue, it was dismissed in the plug list.
Therefore, block layer merges requests together (or attempt to merge)
even if the merge capability was disable using sysfs nomerge parameter
value 2.

This limitation is directly affects functionality of io_submit()
system call. The system call enables user to submit a bunch of IO
requests from user space using struct iocb **ios input argument.
However, the unconditioned merging functionality in the plug list
potentially merges these requests together down the road. Therefore,
there is no way to distinguish between an application sending bunch of
sequential IOs and an application sending one big IO. Ultimately, all
requests generated by the former app merge within the plug list
together and looks similar to the second app.

While the merging functionality is a desirable feature to improve the
performance of IO subsystem for some applications, it is not useful
for other application like ours at all.

Signed-off-by: Alireza Haghdoost <alireza@cs.umn.edu>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Coding style modified.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Tue, 20 May 2014 21:17:27 +0000 (15:17 -0600)]

blk-mq: initialize q->nr_requests after calling blk_queue_make_request()

blk_queue_make_requests() overwrites our set value for q->nr_requests,
turning it into the default of 128. Set this appropriately after
initializing queue values in blk_queue_make_request().

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Tue, 20 May 2014 17:49:02 +0000 (11:49 -0600)]

blk-mq: allow changing of queue depth through sysfs

For request_fn based devices, the block layer exports a 'nr_requests'
file through sysfs to allow adjusting of queue depth on the fly.
Currently this returns -EINVAL for blk-mq, since it's not wired up.
Wire this up for blk-mq, so that it now also always dynamic
adjustments of the allowed queue depth for any given block device
managed by blk-mq.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Wed, 25 Jun 2014 19:51:18 +0000 (13:51 -0600)]

blk-mq: switch ctx pending map to the sparser blk_align_bitmap

Each hardware queue has a bitmap of software queues with pending
requests. When new IO is queued on a software queue, the bit is
set, and when IO is pruned on a hardware queue run, the bit is
cleared. This causes a lot of traffic. Switch this from the regular
BITS_PER_LONG bitmap to a sparser layout, similarly to what was
done for blk-mq tagging.

20% performance increase was observed for single threaded IO, and
about 15% performanc increase on multiple threads driving the
same device.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Wed, 25 Jun 2014 19:46:14 +0000 (13:46 -0600)]

blk-mq: move the cache friendly bitmap type of out blk-mq-tag

We will use it for the pending list in blk-mq core as well.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Tue, 13 May 2014 21:10:52 +0000 (15:10 -0600)]

blk-mq: improve support for shared tags maps

This adds support for active queue tracking, meaning that the
blk-mq tagging maintains a count of active users of a tag set.
This allows us to maintain a notion of fairness between users,
so that we can distribute the tag depth evenly without starving
some users while allowing others to try unfair deep queues.

If sharing of a tag set is detected, each hardware queue will
track the depth of its own queue. And if this exceeds the total
depth divided by the number of active queues, the user is actively
throttled down.

The active queue count is done lazily to avoid bouncing that data
between submitter and completer. Each hardware queue gets marked
active when it allocates its first tag, and gets marked inactive
when 1) the last tag is cleared, and 2) the queue timeout grace
period has passed.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Ming Lei [Sat, 10 May 2014 17:01:51 +0000 (01:01 +0800)]

blk-mq: bitmap tag: cleanup blk_mq_init_tags

Both nr_cache and nr_tags arn't needed for bitmap tag anymore.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Ming Lei [Sat, 10 May 2014 21:43:14 +0000 (15:43 -0600)]

blk-mq: bitmap tag: select random tag betweet 0 and (depth - 1)

The selected tag should be selected at random between 0 and
(depth - 1) with probability 1/depth, instead between 0 and
(depth - 2) with probability 1/(depth - 1).

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Ming Lei [Sat, 10 May 2014 17:01:49 +0000 (01:01 +0800)]

blk-mq: bitmap tag: remove barrier in bt_clear_tag()

The barrier isn't necessary because both atomic_dec_and_test()
and wake_up() implicate one barrier.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Ming Lei [Sat, 10 May 2014 17:01:48 +0000 (01:01 +0800)]

blk-mq: bitmap tag: use clear_bit_unlock in bt_clear_tag()

The unlock memory barrier need to order access to req in free
path and clearing tag bit, otherwise either request free path
may see a allocated request, or initialized request in allocate
path might be modified by the ongoing free path.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Fri, 9 May 2014 20:54:08 +0000 (14:54 -0600)]

blk-mq: fix race in IO start accounting

Commit c6d600c6 opened up a small race where we could attempt to
account IO completion on a request, racing with IO start accounting.
Fix this up by ensuring that we've accounted for IO start before
inserting the request.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Fri, 9 May 2014 19:41:15 +0000 (13:41 -0600)]

blk-mq: use sparser tag layout for lower queue depth

For best performance, spreading tags over multiple cachelines
makes the tagging more efficient on multicore systems. But since
we have 8 * sizeof(unsigned long) tags per cacheline, we don't
always get a nice spread.

Attempt to spread the tags over at least 4 cachelines, using fewer
number of bits per unsigned long if we have to. This improves
tagging performance in setups with 32-128 tags. For higher depths,
the spread is the same as before (BITS_PER_LONG tags per cacheline).

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Fri, 9 May 2014 15:36:49 +0000 (09:36 -0600)]

blk-mq: implement new and more efficient tagging scheme

blk-mq currently uses percpu_ida for tag allocation. But that only
works well if the ratio between tag space and number of CPUs is
sufficiently high. For most devices and systems, that is not the
case. The end result if that we either only utilize the tag space
partially, or we end up attempting to fully exhaust it and run
into lots of lock contention with stealing between CPUs. This is
not optimal.

This new tagging scheme is a hybrid bitmap allocator. It uses
two tricks to both be SMP friendly and allow full exhaustion
of the space:

1) We cache the last allocated (or freed) tag on a per blk-mq
   software context basis. This allows us to limit the space
   we have to search. The key element here is not caching it
   in the shared tag structure, otherwise we end up dirtying
   more shared cache lines on each allocate/free operation.

2) The tag space is split into cache line sized groups, and
   each context will start off randomly in that space. Even up
   to full utilization of the space, this divides the tag users
   efficiently into cache line groups, avoiding dirtying the same
   one both between allocators and between allocator and freeer.

This scheme shows drastically better behaviour, both on small
tag spaces but on large ones as well. It has been tested extensively
to show better performance for all the cases blk-mq cares about.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Thu, 26 Jun 2014 04:44:59 +0000 (22:44 -0600)]

blk-mq: initialize struct request fields individually

This allows us to avoid a non-atomic memset over ->atomic_flags as well
as killing lots of duplicate initializations.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Thu, 8 May 2014 20:50:19 +0000 (14:50 -0600)]

blk-mq: update a hotplug comment for grammar

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Wed, 7 May 2014 16:26:44 +0000 (10:26 -0600)]

blk-mq: add basic round-robin of what CPU to queue workqueue work on

Right now we just pick the first CPU in the mask, but that can
easily overload that one. Add some basic batching and round-robin
all the entries in the mask instead.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Fri, 2 May 2014 17:24:48 +0000 (11:24 -0600)]

blk-mq: remove extra requeue trace

We already issue a blktrace requeue event in
__blk_mq_requeue_request(), don't do it from the original caller
as well.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Ming Lei [Thu, 1 May 2014 07:12:36 +0000 (15:12 +0800)]

block: null_blk: fix use after free

entry(cmd->ll_list) may belong to new request once end_cmd()
returns, so fix the bug with the patch.

Without the change, it is easy to observe oops when
doing null_blk(timer) test.

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Shlomo Pongratz [Wed, 25 Jun 2014 19:42:29 +0000 (13:42 -0600)]

block/null_blk: Fix completion processing from LIFO to FIFO

The completion queue is implemented using lockless list.

The llist_add is adds the events to the list head which is a push operation.
The processing of the completion elements is done by disconnecting all the
pushed elements and iterating over the disconnected list. The problem is
that the processing is done in reverse order w.r.t order of the insertion
i.e. LIFO processing. By reversing the disconnected list which is done in
linear time the desired FIFO processing is achieved.

Signed-off-by: Shlomo Pongratz <shlomop@mellanox.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Wed, 30 Apr 2014 19:43:56 +0000 (13:43 -0600)]

blk-mq: refactor request insertion/merging

Refactor the logic around adding a new bio to a software queue,
so we nest the ctx->lock where we really need it (merge and
insertion) and don't hold it when we don't (init and IO start
accounting).

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Wed, 30 Apr 2014 19:43:08 +0000 (13:43 -0600)]

blk-mq remove debug BUG_ON() when draining software queues

It's never been of any use, lets get rid of it.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Wed, 30 Apr 2014 02:49:48 +0000 (20:49 -0600)]

blk-mq: fix waiting for reserved tags

blk_mq_wait_for_tags() is only able to wait for "normal" tags,
not reserved tags. Pass in which one we should attempt to get
a tag for, so that waiting for reserved tags will work.

Reserved tags are used for internal commands, which are usually
serialized. Hence no waiting generally takes place, but we should
ensure that it actually works if users need that functionality.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Fri, 30 May 2014 21:41:39 +0000 (15:41 -0600)]

block: ensure that the timer is always added

Commit f793aa537866 relaxed the timer addition a little too much.
If the timer isn't pending, we always need to add it.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Fri, 25 Apr 2014 12:14:48 +0000 (14:14 +0200)]

block: fold __blk_add_timer into blk_add_timer

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Fri, 25 Apr 2014 09:32:53 +0000 (02:32 -0700)]

blk-mq: respect rq_affinity

The blk-mq code is using it's own version of the I/O completion affinity
tunables, which causes a few issues:

- the rq_affinity sysfs file doesn't work for blk-mq devices, even if it
   still is present, thus breaking existing tuning setups.
- the rq_affinity = 1 mode, which is the defauly for legacy request based
   drivers isn't implemented at all.
- blk-mq drivers don't implement any completion affinity with the default
   flag settings.

This patches removes the blk-mq ipi_redirect flag and sysfs file, as well
as the internal BLK_MQ_F_SHOULD_IPI flag and replaces it with code that
respects the queue-wide rq_affinity flags and also implements the
rq_affinity = 1 mode.

This means I/O completion affinity can now only be tuned block-queue wide
instead of per context, which seems more sensible to me anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Thu, 24 Apr 2014 14:51:47 +0000 (08:51 -0600)]

blk-mq: fix race with timeouts and requeue events

If a requeue event races with a timeout, we can get into the
situation where we attempt to complete a request from the
timeout handler when it's not start anymore. This causes a crash.
So have the timeout handler check that REQ_ATOM_STARTED is still
set on the request - if not, we ignore the event. If this happens,
the request has now been marked as complete. As a consequence, we
need to ensure to clear REQ_ATOM_COMPLETE in blk_mq_start_request(),
as to maintain proper request state.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Thu, 24 Apr 2014 14:50:38 +0000 (08:50 -0600)]

Revert "blk-mq: initialize req->q in allocation"

This reverts commit 6a3c8a3ac0e68dcfc2a01f4aa1ca0edd1a1701eb.

We need selective clearing of the request to make the init-at-free
time completely safe. Otherwise we end up stomping on
rq->atomic_flags, which we don't want to do.

commit | commitdiff | tree

Ming Lei [Wed, 23 Apr 2014 16:07:34 +0000 (00:07 +0800)]

blk-mq: fix leak of set->tags

set->tags should be freed in blk_mq_free_tag_set().

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Ming Lei [Sat, 19 Apr 2014 10:00:19 +0000 (18:00 +0800)]

blk-mq: initialize req->q in allocation

The patch basically reverts the patch of(blk-mq:
initialize request on allocation) in Jens's tree(already
in -next), and only initialize req->q in allocation
for two reasons:

- presumed cache hotness on completion
- blk_rq_tagged(rq) depends on reset of req->mq_ctx

Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Ming Lei [Sat, 19 Apr 2014 10:00:18 +0000 (18:00 +0800)]

blk-mq: user (1 << order) to implement order_to_size()

Cc: Jörg-Volker Peetz <jvpeetz@web.de>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Ming Lei [Sat, 19 Apr 2014 10:00:17 +0000 (18:00 +0800)]

blk-mq: fix allocation of set->tags

type of set->tags is struct blk_mq_tags **.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Ming Lei [Sat, 19 Apr 2014 10:00:16 +0000 (18:00 +0800)]

blk-mq: free hctx->ctx_map when init failed

Avoid memory leak in the failure path.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <tom.leiming@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Wed, 16 Apr 2014 07:44:59 +0000 (09:44 +0200)]

block: export blk_finish_request

This allows to mirror the blk-mq code flow for more a more readable I/O
completion handler in SCSI.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Thu, 26 Jun 2014 04:43:35 +0000 (22:43 -0600)]

blk-mq: rename mq_flush_work struct request member

We will use this work_struct to requeue scsi commands from the
completion handler as well, so give it a more generic name.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Wed, 16 Apr 2014 07:44:57 +0000 (09:44 +0200)]

blk-mq: add blk_mq_requeue_request

This allows to requeue a request that has been accepted by ->queue_rq
earlier. This is needed by the SCSI layer in various error conditions.

The existing internal blk_mq_requeue_request is renamed to
__blk_mq_requeue_request as it is a lower level building block for this
funtionality.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Wed, 16 Apr 2014 07:44:56 +0000 (09:44 +0200)]

blk-mq: add blk_mq_start_hw_queues

Add a helper to unconditionally kick contexts of a queue. This will
be needed by the SCSI layer to provide fair queueing between multiple
devices on a single host.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Wed, 16 Apr 2014 16:48:08 +0000 (10:48 -0600)]

blk-mq: add blk_mq_delay_queue

Add a blk-mq equivalent to blk_delay_queue so that the scsi layer can ask
to be kicked again after a delay.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Modified by me to kill the unnecessary preempt disable/enable
in the delayed workqueue handler.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Wed, 16 Apr 2014 07:44:54 +0000 (09:44 +0200)]

blk-mq: add async parameter to blk_mq_start_stopped_hw_queues

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Wed, 16 Apr 2014 07:44:53 +0000 (09:44 +0200)]

blk-mq: bidi support

Add two unlinkely branches to make sure the resid is initialized correctly
for bidi request pairs, and the second request gets properly freed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Wed, 25 Jun 2014 19:29:02 +0000 (13:29 -0600)]

blk-mq: allow drivers to hook into I/O completion

Split out the bottom half of blk_mq_end_io so that drivers can perform
work when they know a request has been completed, but before it has been
freed. This also obsoletes blk_mq_end_io_partial as drivers can now
pass any value to blk_update_request directly.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Wed, 16 Apr 2014 16:38:35 +0000 (10:38 -0600)]

blk-mq: kill preempt disable/enable in blk_mq_work_fn()

blk_mq_work_fn() is always invoked off the bounded workqueues,
so it can happily preempt among the queues in that set without
causing any issues for blk-mq.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Wed, 16 Apr 2014 15:23:48 +0000 (09:23 -0600)]

blk-mq: don't use preempt_count() to check for right CPU

UP or CONFIG_PREEMPT_NONE will return 0, and what we really
want to check is whether or not we are on the right CPU.
So don't make PREEMPT part of this, just test the CPU in
the mask directly.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Wed, 25 Jun 2014 19:25:31 +0000 (13:25 -0600)]

blk-mq: split out tag initialization, support shared tags

Add a new blk_mq_tag_set structure that gets set up before we initialize
the queue. A single blk_mq_tag_set structure can be shared by multiple
queues.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Modular export of blk_mq_{alloc,free}_tagset added by me.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Rusty Russell [Wed, 25 Jun 2014 19:06:59 +0000 (13:06 -0600)]

virtio-blk: base queue-depth on virtqueue ringsize or module param

Venkatash spake thus:

  virtio-blk set the default queue depth to 64 requests, which was
  insufficient for high-IOPS devices. Instead set the blk-queue depth to
  the device's virtqueue depth divided by two (each I/O requires at least
  two VQ entries).

But behold, Ted added a module parameter:

  Also allow the queue depth to be something which can be set at module
  load time or via a kernel boot-time parameter, for
  testing/benchmarking purposes.

And I rewrote it substantially, mainly to take
VIRTIO_RING_F_INDIRECT_DESC into account.

As QEMU sets the vq size for PCI to 128, Venkatash's patch wouldn't
have made a change.  This version does (since QEMU also offers
VIRTIO_RING_F_INDIRECT_DESC.

Inspired-by: "Theodore Ts'o" <tytso@mit.edu>
Based-on-the-true-story-of: Venkatesh Srinivas <venkateshs@google.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: virtio-dev@lists.oasis-open.org
Cc: virtualization@lists.linux-foundation.org
Cc: Frank Swiderski <fes@google.com>
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

commit | commitdiff | tree

Christoph Hellwig [Mon, 14 Apr 2014 08:30:10 +0000 (10:30 +0200)]

blk-mq: initialize request on allocation

If we want to share tag and request allocation between queues we cannot
initialize the request at init/free time, but need to initialize it
at allocation time as it might get used for different queues over its
lifetime.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Wed, 25 Jun 2014 19:02:57 +0000 (13:02 -0600)]

blk-mq: add ->init_request and ->exit_request methods

The current blk_mq_init_commands/blk_mq_free_commands interface has a
two problems:

1) Because only the constructor is passed to blk_mq_init_commands there
    is no easy way to clean up when a comman initialization failed.  The
    current code simply leaks the allocations done in the constructor.

2) There is no good place to call blk_mq_free_commands: before
    blk_cleanup_queue there is no guarantee that all outstanding
    commands have completed, so we can't free them yet.  After
    blk_cleanup_queue the queue has usually been freed.  This can be
    worked around by grabbing an unconditional reference before calling
    blk_cleanup_queue and dropping it after blk_mq_free_commands is
    done, although that's not exatly pretty and driver writers are
    guaranteed to get it wrong sooner or later.

Both issues are easily fixed by making the request constructor and
destructor normal blk_mq_ops methods.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Mon, 14 Apr 2014 08:30:08 +0000 (10:30 +0200)]

blk-mq: make ->flush_rq fully transparent to drivers

Drivers shouldn't have to care about the block layer setting aside a
request to implement the flush state machine. We already override the
mq context and tag to make it more transparent, but so far haven't deal
with the driver private data in the request. Make sure to override this
as well, and while we're at it add a proper helper sitting in blk-mq.c
that implements the full impersonation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Mon, 14 Apr 2014 08:30:07 +0000 (10:30 +0200)]

blk-mq: do not initialize req->special

Drivers can reach their private data easily using the blk_mq_rq_to_pdu
helper and don't need req->special. By not initializing it code can
be simplified nicely, and we also shave off a few more instructions from
the I/O path.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Thu, 26 Jun 2014 16:03:06 +0000 (10:03 -0600)]

null_blk: use blk_complete_request and blk_mq_complete_request

Use the block layer helpers for CPU-local completions instead of
reimplementing them locally.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Wed, 25 Jun 2014 18:48:49 +0000 (12:48 -0600)]

virtio_blk: use blk_mq_complete_request

Make sure to complete requests on the submitting CPU. Previously this
was done in blk_mq_end_io, but the responsibility shifted to the drivers.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Mon, 14 Apr 2014 08:30:06 +0000 (10:30 +0200)]

blk-mq: initialize resid_len

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Wed, 9 Apr 2014 16:53:21 +0000 (10:53 -0600)]

blk-mq: simplify blk_mq_hw_sysfs_cpus_show()

Now that we have a cpu mask of CPUs that are mapped to
a specific hardware queue, we can just iterate that to
display the sysfs num-hw-queue/cpu_list file.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Wed, 9 Apr 2014 16:18:23 +0000 (10:18 -0600)]

blk-mq: ensure that hardware queues are always run on the mapped CPUs

Instead of providing soft mappings with no guarantees on hardware
queues always being run on the right CPU, switch to a hard mapping
guarantee that ensure that we always run the hardware queue on
(one of, if more) the mapped CPU.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Tue, 8 Apr 2014 15:17:40 +0000 (09:17 -0600)]

block: add kblockd_schedule_delayed_work_on()

Same function as kblockd_schedule_delayed_work(), but allow the
caller to pass in a CPU that the work should be executed on. This
just directly extends and maps into the workqueue API, and will
be used to make the blk-mq mappings more strict.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Wed, 25 Jun 2014 18:39:11 +0000 (12:39 -0600)]

block: remove 'q' parameter from kblockd_schedule_*_work()

The queue parameter is never used, just get rid of it.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Sat, 5 Apr 2014 03:34:48 +0000 (21:34 -0600)]

blk-mq: fix potential stall during CPU unplug with IO pending

When a CPU is unplugged, we move the blk_mq_ctx request entries
to the current queue. The current code forgets to remap the
blk_mq_hw_ctx before marking the software context pending,
which breaks if old-cpu and new-cpu don't map to the same
hardware queue.

Additionally, if we mark entries as pending in the new
hardware queue, then make sure we schedule it for running.
Otherwise request could be sitting there until someone else
queues IO for that hardware queue.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Dave Jones [Thu, 29 May 2014 19:11:30 +0000 (15:11 -0400)]

block: remove dead code in scsi_ioctl:blk_verify_command

filter gets assigned the address of blk_default_cmd_filter on
entry to this function, so the !filter condition can never be true.

Signed-off-by: Dave Jones <davej@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Fri, 9 May 2014 21:48:23 +0000 (15:48 -0600)]

block: only calculate part_in_flight() once

We first check if we have inflight IO, then retrieve that
same number again. Usually this isn't that costly since the
chance of having the data dirtied in between is small, but
there's no reason for calling part_in_flight() twice.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Wed, 16 Apr 2014 17:36:54 +0000 (11:36 -0600)]

block: relax when to modify the timeout timer

Since we are now, by default, applying timer slack to expiry times,
the logic for when to modify a timer in the block code is suboptimal.
The block layer keeps a forward rolling timer per queue for all
requests, and modifies this timer if a request has a shorter timeout
than what the current expiry time is. However, this breaks down
when our rounded timer values get applied slack. Then each new
request ends up modifying the timer, since we're still a little
in front of the timer + slack.

Fix this by allowing a tolerance of HZ / 2, the timeout handling
doesn't need to be very precise. This drastically cuts down
the number of timer modifications we have to make.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Wed, 25 Jun 2014 18:35:49 +0000 (12:35 -0600)]

random: export add_disk_randomness

This will be needed for pending changes to the scsi midlayer that now
calls lower level block APIs, as well as any blk-mq driver that wants to
contribute to the random pool.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Mike Snitzer [Sun, 9 Mar 2014 03:19:20 +0000 (20:19 -0700)]

block: change flush sequence list addition back to front add

Commit 18741986 inadvertently changed the rq flush insertion
from a head to a tail insertion. Fix that back up.

Signed-off-by: Mike Snitzer <msnitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Shaohua Li [Wed, 19 Feb 2014 12:20:21 +0000 (20:20 +0800)]

blk-mq: add REQ_SYNC early

Add REQ_SYNC early, so rq_dispatched[] in blk_mq_rq_ctx_init
is set correctly.

Signed-off-by: Shaohua Li<shli@fusionio.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Wed, 25 Jun 2014 19:21:32 +0000 (13:21 -0600)]

blk-mq: support partial I/O completions

Add a new blk_mq_end_io_partial function to partially complete requests
as needed by the SCSI layer. We do this by reusing blk_update_request
to advance the bio instead of having a simplified version of it in
the blk-mq code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Fri, 21 Mar 2014 14:57:37 +0000 (08:57 -0600)]

blk-mq: merge blk_mq_insert_request and blk_mq_run_request

It's almost identical to blk_mq_insert_request, so fold the two into one
slightly more generic function by making the flush special case a bit
smarted.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Hellwig [Thu, 20 Feb 2014 23:32:36 +0000 (15:32 -0800)]

blk-mq: remove blk_mq_alloc_rq

There's only one caller, which is a straight wrapper and fits the naming
scheme of the related functions a lot better.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Dave Jones [Thu, 20 Mar 2014 21:03:58 +0000 (15:03 -0600)]

block: free q->flush_rq in blk_init_allocated_queue error paths

Commit 7982e90c3a57 ("block: fix q->flush_rq NULL pointer crash on
dm-mpath flush") moved an allocation to blk_init_allocated_queue(), but
neglected to free that allocation on the error paths that follow.

Signed-off-by: Dave Jones <davej@fedoraproject.org>
Acked-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

commit | commitdiff | tree

Mike Galbraith [Mon, 3 Mar 2014 04:57:26 +0000 (05:57 +0100)]

rt,blk,mq: Make blk_mq_cpu_notify_lock a raw spinlock

[  365.164040] BUG: sleeping function called from invalid context at kernel/rtmutex.c:674
[  365.164041] in_atomic(): 1, irqs_disabled(): 1, pid: 26, name: migration/1
[  365.164043] no locks held by migration/1/26.
[  365.164044] irq event stamp: 6648
[  365.164056] hardirqs last  enabled at (6647): [<ffffffff8153d377>] restore_args+0x0/0x30
[  365.164062] hardirqs last disabled at (6648): [<ffffffff810ed98d>] multi_cpu_stop+0x9d/0x120
[  365.164070] softirqs last  enabled at (0): [<ffffffff810543bc>] copy_process.part.28+0x6fc/0x1920
[  365.164072] softirqs last disabled at (0): [<          (null)>]           (null)
[  365.164076] CPU: 1 PID: 26 Comm: migration/1 Tainted: GF           N  3.12.12-rt19-0.gcb6c4a2-rt #3
[  365.164078] Hardware name: QCI QSSC-S4R/QSSC-S4R, BIOS QSSC-S4R.QCI.01.00.S013.032920111005 03/29/2011
[  365.164091]  0000000000000001 ffff880a42ea7c30 ffffffff815367e6 ffffffff81a086c0
[  365.164099]  ffff880a42ea7c40 ffffffff8108919c ffff880a42ea7c60 ffffffff8153c24f
[  365.164107]  ffff880a42ea91f0 00000000ffffffe1 ffff880a42ea7c88 ffffffff81297ec0
[  365.164108] Call Trace:
[  365.164119]  [<ffffffff810060b1>] try_stack_unwind+0x191/0x1a0
[  365.164127]  [<ffffffff81004872>] dump_trace+0x92/0x360
[  365.164133]  [<ffffffff81006108>] show_trace_log_lvl+0x48/0x60
[  365.164138]  [<ffffffff81004c18>] show_stack_log_lvl+0xd8/0x1d0
[  365.164143]  [<ffffffff81006160>] show_stack+0x20/0x50
[  365.164153]  [<ffffffff815367e6>] dump_stack+0x54/0x9a
[  365.164163]  [<ffffffff8108919c>] __might_sleep+0xfc/0x140
[  365.164173]  [<ffffffff8153c24f>] rt_spin_lock+0x1f/0x70
[  365.164182]  [<ffffffff81297ec0>] blk_mq_main_cpu_notify+0x20/0x70
[  365.164191]  [<ffffffff81540a1c>] notifier_call_chain+0x4c/0x70
[  365.164201]  [<ffffffff81083499>] __raw_notifier_call_chain+0x9/0x10
[  365.164207]  [<ffffffff810567be>] cpu_notify+0x1e/0x40
[  365.164217]  [<ffffffff81525da2>] take_cpu_down+0x22/0x40
[  365.164223]  [<ffffffff810ed9c6>] multi_cpu_stop+0xd6/0x120
[  365.164229]  [<ffffffff810edd97>] cpu_stopper_thread+0xd7/0x1e0
[  365.164235]  [<ffffffff810863a3>] smpboot_thread_fn+0x203/0x380
[  365.164241]  [<ffffffff8107cbf8>] kthread+0xc8/0xd0
[  365.164250]  [<ffffffff8154440c>] ret_from_fork+0x7c/0xb0
[  365.164429] smpboot: CPU 1 is now offline

Signed-off-by: Mike Galbraith <bitbucket@online.de>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Thu, 20 Mar 2014 19:29:18 +0000 (13:29 -0600)]

blk-mq: don't dump CPU -> hw queue map on driver load

Now that we are out of initial debug/bringup mode, remove
the verbose dump of the mapping table.

Provide the mapping table in sysfs, under the hardware queue
directory, in the cpu_list file.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Wed, 19 Mar 2014 21:25:02 +0000 (15:25 -0600)]

blk-mq: fix wrong usage of hctx->state vs hctx->flags

BLK_MQ_F_* flags are for hctx->flags, and are non-atomic and
set at registration time. BLK_MQ_S_* flags are dynamic and
atomic, and are accessed through hctx->state.

Some of the BLK_MQ_S_STOPPED uses were wrong. Additionally,
the header file should not use a bit shift for the _S_ flags,
as they are done through the set/test_bit functions.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Fri, 14 Mar 2014 16:43:15 +0000 (10:43 -0600)]

blk-mq: allow blk_mq_init_commands() to return failure

If drivers do dynamic allocation in the hardware command init
path, then we need to be able to handle and return failures.

And if they do allocations or mappings in the init command path,
then we need a cleanup function to free up that space at exit
time. So add blk_mq_free_commands() as the cleanup function.

This is required for the mtip32xx driver conversion to blk-mq.

Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jens Axboe [Thu, 10 Apr 2014 02:27:01 +0000 (20:27 -0600)]

block: fix regression with block enabled tagging

Martin reported that his test system would not boot with
current git, it oopsed with this:

BUG: unable to handle kernel paging request at ffff88046c6c9e80
IP: [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150
PGD 1ddf067 PUD 1de2067 PMD 47fc7d067 PTE 800000046c6c9060
Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
Modules linked in: sd_mod lpfc(+) scsi_transport_fc scsi_tgt oracleasm
rpcsec_gss_krb5 ipv6 igb dca i2c_algo_bit i2c_core hwmon
CPU: 3 PID: 87 Comm: kworker/u17:1 Not tainted 3.14.0+ #246
Hardware name: Supermicro X9DRX+-F/X9DRX+-F, BIOS 3.00 07/09/2013
Workqueue: events_unbound async_run_entry_fn
task: ffff8802743c2150 ti: ffff880273d02000 task.ti: ffff880273d02000
RIP: 0010:[<ffffffff812971e0>]  [<ffffffff812971e0>]
blk_queue_start_tag+0x90/0x150
RSP: 0018:ffff880273d03a58  EFLAGS: 00010092
RAX: ffff88046c6c9e78 RBX: ffff880077208e78 RCX: 00000000fffc8da6
RDX: 00000000fffc186d RSI: 0000000000000009 RDI: 00000000fffc8d9d
RBP: ffff880273d03a88 R08: 0000000000000001 R09: ffff8800021c2410
R10: 0000000000000005 R11: 0000000000015b30 R12: ffff88046c5bb8a0
R13: ffff88046c5c0890 R14: 000000000000001e R15: 000000000000001e
FS:  0000000000000000(0000) GS:ffff880277b00000(0000)
knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff88046c6c9e80 CR3: 00000000018f6000 CR4: 00000000000407e0
Stack:
ffff880273d03a98 ffff880474b18800 0000000000000000 ffff880474157000
ffff88046c5c0890 ffff880077208e78 ffff880273d03ae8 ffffffff813b9e62
ffff880200000010 ffff880474b18968 ffff880474b18848 ffff88046c5c0cd8
Call Trace:
[<ffffffff813b9e62>] scsi_request_fn+0xf2/0x510
[<ffffffff81293167>] __blk_run_queue+0x37/0x50
[<ffffffff8129ac43>] blk_execute_rq_nowait+0xb3/0x130
[<ffffffff8129ad24>] blk_execute_rq+0x64/0xf0
[<ffffffff8108d2b0>] ? bit_waitqueue+0xd0/0xd0
[<ffffffff813bba35>] scsi_execute+0xe5/0x180
[<ffffffff813bbe4a>] scsi_execute_req_flags+0x9a/0x110
[<ffffffffa01b1304>] sd_spinup_disk+0x94/0x460 [sd_mod]
[<ffffffff81160000>] ? __unmap_hugepage_range+0x200/0x2f0
[<ffffffffa01b2b9a>] sd_revalidate_disk+0xaa/0x3f0 [sd_mod]
[<ffffffffa01b2fb8>] sd_probe_async+0xd8/0x200 [sd_mod]
[<ffffffff8107703f>] async_run_entry_fn+0x3f/0x140
[<ffffffff8106a1c5>] process_one_work+0x175/0x410
[<ffffffff8106b373>] worker_thread+0x123/0x400
[<ffffffff8106b250>] ? manage_workers+0x160/0x160
[<ffffffff8107104e>] kthread+0xce/0xf0
[<ffffffff81070f80>] ? kthread_freezable_should_stop+0x70/0x70
[<ffffffff815f0bac>] ret_from_fork+0x7c/0xb0
[<ffffffff81070f80>] ? kthread_freezable_should_stop+0x70/0x70
Code: 48 0f ab 11 72 db 48 81 4b 40 00 00 10 00 89 83 08 01 00 00 48 89
df 49 8b 04 24 48 89 1c d0 e8 f7 a8 ff ff 49 8b 85 28 05 00 00 <48> 89
58 08 48 89 03 49 8d 85 28 05 00 00 48 89 43 08 49 89 9d
RIP  [<ffffffff812971e0>] blk_queue_start_tag+0x90/0x150
RSP <ffff880273d03a58>
CR2: ffff88046c6c9e80

Martin bisected and found this to be the problem patch;

commit 6d113398dcf4dfcd9787a4ead738b186f7b7ff0f
Author: Jan Kara <jack@suse.cz>
Date:   Mon Feb 24 16:39:54 2014 +0100

    block: Stop abusing rq->csd.list in blk-softirq

and the problem was immediately apparent. The patch states that
it is safe to reuse queuelist at completion time, since it is
no longer used. However, that is not true if a device is using
block enabled tagging. If that is the case, then the queuelist
is reused to keep track of busy tags. If a device also ended
up using softirq completions, we'd reuse ->queuelist for the
IPI handling while block tagging was still using it. Boom.

Fix this by adding a new ipi_list list head, and share the
memory used with the request hash table. The hash table is
never used after the request is moved to the dispatch list,
which happens long before any potential completion of the
request. Add a new request bit for this, so we don't have
cases that check rq->hash while it could potentially have
been reused for the IPI completion.

Reported-by: Martin K. Petersen <martin.petersen@oracle.com>
Tested-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jan Kara [Mon, 24 Feb 2014 15:39:54 +0000 (16:39 +0100)]

block: Stop abusing rq->csd.list in blk-softirq

Abusing rq->csd.list for a list of requests to complete is rather ugly.
We use rq->queuelist instead which is much cleaner. It is safe because
queuelist is used by the block layer only for requests waiting to be
submitted to a device. Thus it is unused when irq reports the request IO
is finished.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Martin K. Petersen [Thu, 10 Apr 2014 02:20:48 +0000 (22:20 -0400)]

scsi: Make sure cmd_flags are 64-bit

cmd_flags in struct request is now 64 bits wide but the scsi_execute
functions truncated arguments passed to int leading to errors. Make sure
the flags parameters are u64.

Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Jens Axboe <axboe@fb.com>
CC: Jan Kara <jack@suse.cz>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Christoph Lameter [Tue, 15 Oct 2013 18:22:29 +0000 (12:22 -0600)]

block: Replace __get_cpu_var uses

__get_cpu_var() is used for multiple purposes in the kernel source. One of
them is address calculation via the form &__get_cpu_var(x).  This calculates
the address for the instance of the percpu variable of the current processor
based on an offset.

Other use cases are for storing and retrieving data from the current
processors percpu area.  __get_cpu_var() can be used as an lvalue when
writing data or on the right side of an assignment.

__get_cpu_var() is defined as :

#define __get_cpu_var(var) (*this_cpu_ptr(&(var)))

__get_cpu_var() always only does an address determination. However, store
and retrieve operations could use a segment prefix (or global register on
other platforms) to avoid the address calculation.

this_cpu_write() and this_cpu_read() can directly take an offset into a
percpu area and use optimized assembly code to read and write per cpu
variables.

This patch converts __get_cpu_var into either an explicit address
calculation using this_cpu_ptr() or into a use of this_cpu operations that
use the offset.  Thereby address calculations are avoided and less registers
are used when code is generated.

At the end of the patch set all uses of __get_cpu_var have been removed so
the macro is removed too.

The patch set includes passes over all arches as well. Once these operations
are used throughout then specialized macros can be defined in non -x86
arches as well in order to optimize per cpu access by f.e.  using a global
register that may be set to the per cpu base.

Transformations done to __get_cpu_var()

1. Determine the address of the percpu instance of the current processor.

DEFINE_PER_CPU(int, y);
int *x = &__get_cpu_var(y);

    Converts to

int *x = this_cpu_ptr(&y);

2. Same as #1 but this time an array structure is involved.

DEFINE_PER_CPU(int, y[20]);
int *x = __get_cpu_var(y);

    Converts to

int *x = this_cpu_ptr(y);

3. Retrieve the content of the current processors instance of a per cpu
variable.

DEFINE_PER_CPU(int, y);
int x = __get_cpu_var(y)

   Converts to

int x = __this_cpu_read(y);

4. Retrieve the content of a percpu struct

DEFINE_PER_CPU(struct mystruct, y);
struct mystruct x = __get_cpu_var(y);

   Converts to

memcpy(&x, this_cpu_ptr(&y), sizeof(x));

5. Assignment to a per cpu variable

DEFINE_PER_CPU(int, y)
__get_cpu_var(y) = x;

   Converts to

this_cpu_write(y, x);

6. Increment/Decrement etc of a per cpu variable

DEFINE_PER_CPU(int, y);
__get_cpu_var(y)++

   Converts to

this_cpu_inc(y)

Signed-off-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Frederic Weisbecker [Mon, 24 Feb 2014 15:39:53 +0000 (16:39 +0100)]

block: Remove useless IPI struct initialization

rq_fifo_clear() reset the csd.list through INIT_LIST_HEAD for no clear
purpose. The csd.list doesn't need to be initialized as a list head
because it's only ever used as a list node.

Lets remove this useless initialization.

Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Jan Kara [Mon, 24 Feb 2014 15:39:52 +0000 (16:39 +0100)]

block: Stop abusing csd.list for fifo_time

Block layer currently abuses rq->csd.list.next for storing fifo_time.
That is a terrible hack and completely unnecessary as well. Union
achieves the same space saving in a cleaner way.

Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

commit | commitdiff | tree

Mikulas Patocka [Tue, 18 Feb 2014 18:09:34 +0000 (13:09 -0500)]

bio: don't write "bio: create slab" messages to syslog

When using device mapper, there are many "bio: create slab" messages in
the log. Device mapper targets have different front_pad, so each time when
we load a target that wasn't loaded before, we allocate a slab with the
appropriate front_pad and there is associated "bio: create slab" message.

This patch removes these messages, there is no need for them.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
Signed-off-by: Jens Axboe <axboe@fb.com>

Linux 6.x block layer and io_uring tree(s)

RSS Atom