git.kernel.dk Git - linux-block.git/log

Yufen Yu [Fri, 9 Oct 2020 03:26:32 +0000 (23:26 -0400)]

block: get rid of unnecessary local variable

Since whole elevator register is protectd by sysfs_lock, we
don't need extras 'has_elevator'. Just use q->elevator directly.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Yufen Yu [Fri, 9 Oct 2020 03:26:31 +0000 (23:26 -0400)]

block: fix comment and add lockdep assert

After commit b89f625e28d4 ("block: don't release queue's sysfs
lock during switching elevator"), whole elevator register and
unregister function are covered by sysfs_lock. So, remove wrong
comment and add lockdep assert.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Yufen Yu [Fri, 9 Oct 2020 03:26:30 +0000 (23:26 -0400)]

blk-mq: use helper function to test hw stopped

We have introduced helper function blk_mq_hctx_stopped() to test
BLK_MQ_S_STOPPED.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Yufen Yu [Fri, 9 Oct 2020 03:26:29 +0000 (23:26 -0400)]

block: use helper function to test queue register

We have defined common interface blk_queue_registered() to
test QUEUE_FLAG_REGISTERED. Just use it.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Yufen Yu [Fri, 9 Oct 2020 03:26:28 +0000 (23:26 -0400)]

block: remove redundant mq check

elv_support_iosched() will check queue_is_mq() for us. So, remove
the redundant check to clean code.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Yufen Yu [Fri, 9 Oct 2020 03:26:27 +0000 (23:26 -0400)]

block: invoke blk_mq_exit_sched no matter whether have .exit_sched

We will register debugfs for scheduler no matter whether it have
defined callback funciton .exit_sched. So, blk_mq_exit_sched()
is always needed to unregister debugfs. Also, q->elevator should
be set as NULL after exiting scheduler.

For now, since all register scheduler have defined .exit_sched,
it will not cause any actual problem. But It will be more reasonable
to do this change.

Signed-off-by: Yufen Yu <yuyufen@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Ming Lei [Fri, 9 Oct 2020 04:03:56 +0000 (12:03 +0800)]

percpu_ref: don't refer to ref->data if it isn't allocated

We can't check ref->data->confirm_switch directly in __percpu_ref_exit(), since
ref->data may not be allocated in one not-initialized refcount.

Fixes: 2b0d3d3e4fcf ("percpu_ref: reduce memory footprint of percpu_ref in fast path")
Reported-by: syzbot+fd15ff734dace9e16437@syzkaller.appspotmail.com
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Tetsuo Handa [Thu, 8 Oct 2020 13:37:23 +0000 (22:37 +0900)]

block: ratelimit handle_bad_sector() message

syzbot is reporting unkillable task [1], for the caller is failing to
handle a corrupted filesystem image which attempts to access beyond
the end of the device. While we need to fix the caller, flooding the
console with handle_bad_sector() message is unlikely useful.

[1] https://syzkaller.appspot.com/bug?id=f1f49fb971d7a3e01bd8ab8cff2ff4572ccf3092

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Baolin Wang [Thu, 8 Oct 2020 03:52:29 +0000 (11:52 +0800)]

blk-throttle: Re-use the throtl_set_slice_end()

Re-use throtl_set_slice_end() to remove duplicate code.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Baolin Wang [Thu, 8 Oct 2020 03:52:28 +0000 (11:52 +0800)]

blk-throttle: Open code __throtl_de/enqueue_tg()

The __throtl_de/enqueue_tg() functions are only be called by
throtl_de/enqueue_tg(), thus we can just open code them to
make code more readable.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Baolin Wang [Thu, 8 Oct 2020 03:52:27 +0000 (11:52 +0800)]

blk-throttle: Move service tree validation out of the throtl_rb_first()

The throtl_schedule_next_dispatch() will validate if the service queue
is empty before calling update_min_dispatch_time(), and the
update_min_dispatch_time() will call throtl_rb_first(), which will
validate service queue again.

Thus we can move the service queue validation out of the
throtl_rb_first() to remove the redundant validation in the fast path.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Baolin Wang [Thu, 8 Oct 2020 03:52:26 +0000 (11:52 +0800)]

blk-throttle: Move the list operation after list validation

We should move the list operation after validation.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Baolin Wang [Thu, 8 Oct 2020 03:52:25 +0000 (11:52 +0800)]

blk-throttle: Fix IO hang for a corner case

It can not scale up in throtl_adjusted_limit() if we set bps or iops is
1, which will cause IO hang when enable low limit. Thus we should treat
1 as a illegal value to avoid this issue.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Baolin Wang [Thu, 8 Oct 2020 03:52:24 +0000 (11:52 +0800)]

blk-throttle: Avoid tracking latency if low limit is invalid

The IO latency tracking is only for LOW limit, so we should add a
validation to avoid redundant latency tracking if the LOW limit
is not valid.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Baolin Wang [Thu, 8 Oct 2020 03:52:23 +0000 (11:52 +0800)]

blk-throttle: Avoid getting the current time if tg->last_finish_time is 0

We only update the tg->last_finish_time when the low limitaion is
enabled, so we can move the tg->last_finish_time validation a little
forward to avoid getting the unnecessary current time stamp if the
the low limitation is not enabled.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Baolin Wang [Thu, 8 Oct 2020 03:52:22 +0000 (11:52 +0800)]

blk-throttle: Remove a meaningless parameter for throtl_downgrade_state()

The throtl_downgrade_state() is always used to change to LIMIT_LOW
limitation, thus remove the latter meaningless parameter which
indicates the limitation index.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Baolin Wang [Mon, 28 Sep 2020 00:42:26 +0000 (08:42 +0800)]

block: Remove redundant 'return' statement

Remove redundant 'return' statement for 'void' functions.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Johannes Thumshirn [Wed, 7 Oct 2020 12:35:08 +0000 (21:35 +0900)]

block: soft limit zone-append sectors as well

Martin rightfully noted that for normal filesystem IO we have soft limits
in place, to prevent them from getting too big and not lead to
unpredictable latencies. For zone append we only have the hardware limit
in place.

Cap the max sectors we submit via zone-append to the maximal number of
sectors if the second limit is lower.

Reported-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/linux-btrfs/yq1k0w8g3rw.fsf@ca-mkp.ca.oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Gabriel Krisman Bertazi [Tue, 6 Oct 2020 19:41:25 +0000 (15:41 -0400)]

block: Consider only dispatched requests for inflight statistic

According to Documentation/block/stat.rst, inflight should not include
I/O requests that are in the queue but not yet dispatched to the device,
but blk-mq identifies as inflight any request that has a tag allocated,
which, for queues without elevator, happens at request allocation time
and before it is queued in the ctx (default case in blk_mq_submit_bio).

In addition, current behavior is different for queues with elevator from
queues without it, since for the former the driver tag is allocated at
dispatch time. A more precise approach would be to only consider
requests with state MQ_RQ_IN_FLIGHT.

This effectively reverts commit 6131837b1de6 ("blk-mq: count allocated
but not started requests in iostats inflight") to consolidate blk-mq
behavior with itself (elevator case) and with original documentation,
but it differs from the behavior used by the legacy path.

This version differs from v1 by using blk_mq_rq_state to access the
state attribute. Avoid using blk_mq_request_started, which was
suggested, since we don't want to include MQ_RQ_COMPLETE.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Tue, 6 Oct 2020 07:07:19 +0000 (09:07 +0200)]

block: move blk_mq_sched_try_merge to blk-merge.c

Move blk_mq_sched_try_merge to blk-merge.c, which allows to mark
a lot of the merge infrastructure static there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Tue, 6 Oct 2020 07:07:18 +0000 (09:07 +0200)]

block: remove the unused blk_integrity_merge_bio export

Also move the definition from the public blkdev.h to the private
block/blk.h header.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Tue, 6 Oct 2020 07:07:17 +0000 (09:07 +0200)]

block: remove the unused blk_integrity_merge_rq export

Also move the definition from the public blkdev.h to the private
block/blk.h header.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Ming Lei [Thu, 1 Oct 2020 15:48:42 +0000 (23:48 +0800)]

block: move 'q_usage_counter' into front of 'request_queue'

The field of 'q_usage_counter' is always fetched in fast path of every
block driver, and move it into front of 'request_queue', so it can be
fetched into 1st cacheline of 'request_queue' instance.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Veronika Kabatova <vkabatov@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Ming Lei [Thu, 1 Oct 2020 15:48:41 +0000 (23:48 +0800)]

percpu_ref: reduce memory footprint of percpu_ref in fast path

'struct percpu_ref' is often embedded into one user structure, and the
instance is usually referenced in fast path, however actually only
'percpu_count_ptr' is needed in fast path.

So move other fields into one new structure of 'percpu_ref_data', and
allocate it dynamically via kzalloc(), then memory footprint of
'percpu_ref' in fast path is reduced a lot and becomes suitable to put
into hot cacheline of user structure.

Signed-off-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Veronika Kabatova <vkabatov@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Eric Biggers [Wed, 16 Sep 2020 03:53:15 +0000 (20:53 -0700)]

block: warn if !__GFP_DIRECT_RECLAIM in bio_crypt_set_ctx()

bio_crypt_set_ctx() assumes its gfp_mask argument always includes
__GFP_DIRECT_RECLAIM, so that the mempool_alloc() will always succeed.

For now this assumption is still fine, since no callers violate it.
Making bio_crypt_set_ctx() able to fail would add unneeded complexity.

However, if a caller didn't use __GFP_DIRECT_RECLAIM, it would be very
hard to notice the bug. Make it easier by adding a WARN_ON_ONCE().

Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Satya Tangirala <satyat@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Satya Tangirala <satyat@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Eric Biggers [Wed, 16 Sep 2020 03:53:14 +0000 (20:53 -0700)]

block: make blk_crypto_rq_bio_prep() able to fail

blk_crypto_rq_bio_prep() assumes its gfp_mask argument always includes
__GFP_DIRECT_RECLAIM, so that the mempool_alloc() will always succeed.

However, blk_crypto_rq_bio_prep() might be called with GFP_ATOMIC via
setup_clone() in drivers/md/dm-rq.c.

This case isn't currently reachable with a bio that actually has an
encryption context. However, it's fragile to rely on this. Just make
blk_crypto_rq_bio_prep() able to fail.

Suggested-by: Satya Tangirala <satyat@google.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Satya Tangirala <satyat@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Eric Biggers [Wed, 16 Sep 2020 03:53:13 +0000 (20:53 -0700)]

block: make bio_crypt_clone() able to fail

bio_crypt_clone() assumes its gfp_mask argument always includes
__GFP_DIRECT_RECLAIM, so that the mempool_alloc() will always succeed.

However, bio_crypt_clone() might be called with GFP_ATOMIC via
setup_clone() in drivers/md/dm-rq.c, or with GFP_NOWAIT via
kcryptd_io_read() in drivers/md/dm-crypt.c.

Neither case is currently reachable with a bio that actually has an
encryption context. However, it's fragile to rely on this. Just make
bio_crypt_clone() able to fail, analogous to bio_integrity_clone().

Reported-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Satya Tangirala <satyat@google.com>
Cc: Satya Tangirala <satyat@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Fri, 25 Sep 2020 16:06:18 +0000 (18:06 +0200)]

block: add a bdget_part helper

All remaining callers of bdget() outside of fs/block_dev.c want to get a
reference to the struct block_device for a given struct hd_struct. Add
a helper just for that and then mark bdget static.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Fri, 25 Sep 2020 16:06:17 +0000 (18:06 +0200)]

drbd: remove ->this_bdev

DRBD keeps a block device open just to get and set the capacity from
it. Switch to primarily using the disk capacity as intended by the
block layer, and sync it to the bdev using revalidate_disk_size.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

yangerkun [Sat, 19 Sep 2020 03:54:25 +0000 (11:54 +0800)]

block-mq: fix comments in blk_mq_queue_tag_busy_iter

'f5bbbbe4d635 ("blk-mq: sync the update nr_hw_queues with
blk_mq_queue_tag_busy_iter")' introduce a bug what we may sleep between
rcu lock. Then '530ca2c9bd69 ("blk-mq: Allow blocking queue tag iter
callbacks")' fix it by get request_queue's ref. And 'a9a808084d6a ("block:
Remove the synchronize_rcu() call from __blk_mq_update_nr_hw_queues()")'
remove the synchronize_rcu in __blk_mq_update_nr_hw_queues. We need
update the confused comments in blk_mq_queue_tag_busy_iter.

Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Xianting Tian [Sat, 26 Sep 2020 02:39:47 +0000 (10:39 +0800)]

blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps()

We found blk_mq_alloc_rq_maps() takes more time in kernel space when
testing nvme device hot-plugging. The test and anlysis as below.

Debug code,
1, blk_mq_alloc_rq_maps():
        u64 start, end;
        depth = set->queue_depth;
        start = ktime_get_ns();
        pr_err("[%d:%s switch:%ld,%ld] queue depth %d, nr_hw_queues %d\n",
                        current->pid, current->comm, current->nvcsw, current->nivcsw,
                        set->queue_depth, set->nr_hw_queues);
        do {
                err = __blk_mq_alloc_rq_maps(set);
                if (!err)
                        break;

                set->queue_depth >>= 1;
                if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) {
                        err = -ENOMEM;
                        break;
                }
        } while (set->queue_depth);
        end = ktime_get_ns();
        pr_err("[%d:%s switch:%ld,%ld] all hw queues init cost time %lld ns\n",
                        current->pid, current->comm,
                        current->nvcsw, current->nivcsw, end - start);

2, __blk_mq_alloc_rq_maps():
        u64 start, end;
        for (i = 0; i < set->nr_hw_queues; i++) {
                start = ktime_get_ns();
                if (!__blk_mq_alloc_rq_map(set, i))
                        goto out_unwind;
                end = ktime_get_ns();
                pr_err("hw queue %d init cost time %lld ns\n", i, end - start);
        }

Test nvme hot-plugging with above debug code, we found it totally cost more
than 3ms in kernel space without being scheduled out when alloc rqs for all
16 hw queues with depth 1023, each hw queue cost about 140-250us. The cost
time will be increased with hw queue number and queue depth increasing. And
in an extreme case, if __blk_mq_alloc_rq_maps() returns -ENOMEM, it will try
"queue_depth >>= 1", more time will be consumed.
[  428.428771] nvme nvme0: pci function 10000:01:00.0
[  428.428798] nvme 10000:01:00.0: enabling device (0000 -> 0002)
[  428.428806] pcieport 10000:00:00.0: can't derive routing for PCI INT A
[  428.428809] nvme 10000:01:00.0: PCI INT A: no GSI
[  432.593374] [4688:kworker/u33:8 switch:663,2] queue depth 30, nr_hw_queues 1
[  432.593404] hw queue 0 init cost time 22883 ns
[  432.593408] [4688:kworker/u33:8 switch:663,2] all hw queues init cost time 35960 ns
[  432.595953] nvme nvme0: 16/0/0 default/read/poll queues
[  432.595958] [4688:kworker/u33:8 switch:700,2] queue depth 1023, nr_hw_queues 16
[  432.596203] hw queue 0 init cost time 242630 ns
[  432.596441] hw queue 1 init cost time 235913 ns
[  432.596659] hw queue 2 init cost time 216461 ns
[  432.596877] hw queue 3 init cost time 215851 ns
[  432.597107] hw queue 4 init cost time 228406 ns
[  432.597336] hw queue 5 init cost time 227298 ns
[  432.597564] hw queue 6 init cost time 224633 ns
[  432.597785] hw queue 7 init cost time 219954 ns
[  432.597937] hw queue 8 init cost time 150930 ns
[  432.598082] hw queue 9 init cost time 143496 ns
[  432.598231] hw queue 10 init cost time 147261 ns
[  432.598397] hw queue 11 init cost time 164522 ns
[  432.598542] hw queue 12 init cost time 143401 ns
[  432.598692] hw queue 13 init cost time 148934 ns
[  432.598841] hw queue 14 init cost time 147194 ns
[  432.598991] hw queue 15 init cost time 148942 ns
[  432.598993] [4688:kworker/u33:8 switch:700,2] all hw queues init cost time 3035099 ns
[  432.602611]  nvme0n1: p1

So use this patch to trigger schedule between each hw queue init, to avoid
other threads getting stuck. It is not in atomic context when executing
__blk_mq_alloc_rq_maps(), so it is safe to call cond_resched().

Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Tejun Heo [Fri, 18 Sep 2020 18:41:27 +0000 (14:41 -0400)]

iocost: consider iocgs with active delays for debt forgiveness

An iocg may have 0 debt but non-zero delay. The current debt forgiveness
logic doesn't act on such iocgs. This can lead to unexpected behaviors - an
iocg with a little bit of debt will have its delay canceled through debt
forgiveness but one w/o any debt but active delay will have to wait out
until its delay decays out.

This patch updates the debt handling logic so that it treats delays the same
as debts. If either debt or delay is active, debt forgiveness logic kicks in
and acts on both the same way.

Also, avoid turning the debt and delay directly to zero as that can confuse
state transitions.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Tejun Heo [Fri, 18 Sep 2020 00:44:56 +0000 (20:44 -0400)]

iocost: add iocg_forgive_debt tracepoint

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Tejun Heo [Fri, 18 Sep 2020 00:44:55 +0000 (20:44 -0400)]

iocost: reimplement debt forgiveness using average usage

Debt forgiveness logic was counting the number of consecutive !busy periods
as the trigger condition. While this usually works, it can easily be thrown
off by temporary fluctuations especially on configurations w/ short periods.

This patch reimplements debt forgiveness so that:

* Use the average usage over the forgiveness period instead of counting
  consecutive periods.

* Debt is reduced at around the target rate (1/2 every 100ms) regardless of
  ioc period duration.

* Usage threshold is raised to 50%. Combined with the preceding changes and
  the switch to average usage, this makes debt forgivness a lot more
  effective at reducing the amount of unnecessary idleness.

* Constants are renamed with DFGV_ prefix.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Tejun Heo [Fri, 18 Sep 2020 00:44:54 +0000 (20:44 -0400)]

iocost: recalculate delay after debt reduction

Debt sets the initial delay duration which is decayed over time. The current
debt reduction halved the debt but didn't change the delay. It prevented
future debts from increasing delay but didn't do anything to lower the
existing delay, limiting the mechanism's ability to reduce unnecessary
idling.

Reset iocg->delay to 0 after debt reduction so that iocg_kick_waitq()
recalculates new delay value based on the reduced debt amount.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Tejun Heo [Fri, 18 Sep 2020 00:44:53 +0000 (20:44 -0400)]

iocost: replace nr_shortages cond in ioc_forgive_debts() with busy_level one

Debt reduction was blocked if any iocg was short on budget in the past
period to avoid reducing debts while some iocgs are saturated. However, this
ends up unnecessarily blocking debt reduction due to temporary local
imbalances when the device is generally being underutilized, while also
failing to block when the underlying device is overwhelmed and the usage
becomes low from high latency.

Given that debt accumulation mostly happens with swapout bursts which can
significantly deteriorate the underlying device's latency response, the
current logic is not great.

Let's replace it with ioc->busy_level based condition so that we block debt
reduction when the underlying device is being saturated. ioc_forgive_debts()
call is moved after busy_level determination.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Tejun Heo [Fri, 18 Sep 2020 00:44:52 +0000 (20:44 -0400)]

iocost: factor out ioc_forgive_debts()

Debt reduction logic is going to be improved and expanded. Factor it out
into ioc_forgive_debts() and generalize the comment a bit. No functional
change.

Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Konstantin Khlebnikov [Wed, 23 Sep 2020 20:06:52 +0000 (16:06 -0400)]

dm: add support for REQ_NOWAIT and enable it for linear target

Add DM target feature flag DM_TARGET_NOWAIT which advertises that
target works with REQ_NOWAIT bios.

Add dm_table_supports_nowait() and update dm_table_set_restrictions()
to set/clear QUEUE_FLAG_NOWAIT accordingly.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Mike Snitzer [Wed, 23 Sep 2020 20:06:51 +0000 (16:06 -0400)]

block: add QUEUE_FLAG_NOWAIT

Add QUEUE_FLAG_NOWAIT to allow a block device to advertise support for
REQ_NOWAIT. Bio-based devices may set QUEUE_FLAG_NOWAIT where
applicable.

Update QUEUE_FLAG_MQ_DEFAULT to include QUEUE_FLAG_NOWAIT. Also
update submit_bio_checks() to verify it is set for REQ_NOWAIT bios.

Reported-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Thu, 3 Sep 2020 05:41:04 +0000 (07:41 +0200)]

vsprintf: use bd_partno in bdev_name

No need to go through the hd_struct to find the partition number.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Thu, 3 Sep 2020 05:41:03 +0000 (07:41 +0200)]

block: use bd_partno in bdevname

No need to go through the hd_struct to find the partition number.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Thu, 3 Sep 2020 05:41:02 +0000 (07:41 +0200)]

target/iblock: fix holder printing in iblock_show_configfs_dev_params

bd_contains is never NULL for an open block device. In addition ibd_bd
is always set to a block device that was exclusively opened by the
target code, so the holder is guranteed to be ib_dev as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Thu, 3 Sep 2020 05:41:01 +0000 (07:41 +0200)]

drbd: don't set ->bd_contains

The ->bd_contains field is set by __blkdev_get and drivers have no
business manipulating it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Thu, 3 Sep 2020 05:41:00 +0000 (07:41 +0200)]

drbd: don't detour through bd_contains for the gendisk

bd_disk is set on all block devices, including those for partitions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Thu, 3 Sep 2020 05:40:59 +0000 (07:40 +0200)]

md: don't detour through bd_contains for the gendisk

bd_disk is set on all block devices, including those for partitions.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Thu, 3 Sep 2020 05:40:58 +0000 (07:40 +0200)]

md: compare bd_disk instead of bd_contains

To check for partitions of the same disk bd_contains works as well, but
bd_disk is way more obvious.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Thu, 3 Sep 2020 05:40:57 +0000 (07:40 +0200)]

block: add a bdev_is_partition helper

Add a littler helper to make the somewhat arcane bd_contains checks a
little more obvious.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Ulf Hansson <ulf.hansson@linaro.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Thu, 3 Sep 2020 05:40:56 +0000 (07:40 +0200)]

Documentation/hdio: fix up obscure bd_contains references

bd_contains is an implementation detail and should not be mentioned in
a userspace API documentation.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Thu, 24 Sep 2020 06:51:40 +0000 (08:51 +0200)]

bdi: replace BDI_CAP_NO_{WRITEBACK,ACCT_DIRTY} with a single flag

Replace the two negative flags that are always used together with a
single positive flag that indicates the writeback capability instead
of two related non-capabilities. Also remove the pointless wrappers
to just check the flag.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Thu, 24 Sep 2020 06:51:39 +0000 (08:51 +0200)]

bdi: invert BDI_CAP_NO_ACCT_WB

Replace BDI_CAP_NO_ACCT_WB with a positive BDI_CAP_WRITEBACK_ACCT to
make the checks more obvious. Also remove the pointless
bdi_cap_account_writeback wrapper that just obsfucates the check.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Thu, 24 Sep 2020 06:51:38 +0000 (08:51 +0200)]

bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag

The BDI_CAP_STABLE_WRITES is one of the few bits of information in the
backing_dev_info shared between the block drivers and the writeback code.
To help untangling the dependency replace it with a queue flag and a
superblock flag derived from it. This also helps with the case of e.g.
a file system requiring stable writes due to its own checksumming, but
not forcing it on other users of the block device like the swap code.

One downside is that we an't support the stable_pages_required bdi
attribute in sysfs anymore. It is replaced with a queue attribute which
also is writable for easier testing.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Thu, 24 Sep 2020 06:51:37 +0000 (08:51 +0200)]

mm: use SWP_SYNCHRONOUS_IO more intelligently

There is no point in trying to call bdev_read_page if SWP_SYNCHRONOUS_IO
is not set, as the device won't support it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Thu, 24 Sep 2020 06:51:36 +0000 (08:51 +0200)]

bdi: remove BDI_CAP_SYNCHRONOUS_IO

BDI_CAP_SYNCHRONOUS_IO is only checked in the swap code, and used to
decided if ->rw_page can be used on a block device. Just check up for
the method instead. The only complication is that zram needs a second
set of block_device_operations as it can switch between modes that
actually support ->rw_page and those who don't.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Thu, 24 Sep 2020 06:51:35 +0000 (08:51 +0200)]

bdi: remove BDI_CAP_CGROUP_WRITEBACK

Just checking SB_I_CGROUPWB for cgroup writeback support is enough.
Either the file system allocates its own bdi (e.g. btrfs), in which case
it is known to support cgroup writeback, or the bdi comes from the block
layer, which always supports cgroup writeback.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Thu, 24 Sep 2020 06:51:34 +0000 (08:51 +0200)]

block: lift setting the readahead size into the block layer

Drivers shouldn't really mess with the readahead size, as that is a VM
concept.  Instead set it based on the optimal I/O size by lifting the
algorithm from the md driver when registering the disk.  Also set
bdi->io_pages there as well by applying the same scheme based on
max_sectors.  To ensure the limits work well for stacking drivers a
new helper is added to update the readahead limits from the block
limits, which is also called from disk_stack_limits.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mike Snitzer <snitzer@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Thu, 24 Sep 2020 06:51:33 +0000 (08:51 +0200)]

md: update the optimal I/O size on reshape

The raid5 and raid10 drivers currently update the read-ahead size,
but not the optimal I/O size on reshape. To prepare for deriving the
read-ahead size from the optimal I/O size make sure it is updated
as well.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Thu, 24 Sep 2020 06:51:32 +0000 (08:51 +0200)]

bdi: initialize ->ra_pages and ->io_pages in bdi_init

Set up a readahead size by default, as very few users have a good
reason to change it. This means code, ecryptfs, and orangefs now
set up the values while they were previously missing it, while ubifs,
mtd and vboxsf manually set it to 0 to avoid readahead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Acked-by: David Sterba <dsterba@suse.com> [btrfs]
Acked-by: Richard Weinberger <richard@nod.at> [ubifs, mtd]
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Thu, 24 Sep 2020 06:51:31 +0000 (08:51 +0200)]

aoe: set an optimal I/O size

aoe forces a larger readahead size, but any reason to do larger I/O
is not limited to readahead. Also set the optimal I/O size, and
remove the local constants in favor of just using SZ_2G.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Thu, 24 Sep 2020 06:51:30 +0000 (08:51 +0200)]

bcache: inherit the optimal I/O size

Inherit the optimal I/O size setting just like the readahead window,
as any reason to do larger I/O does not apply to just readahead.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Acked-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Thu, 24 Sep 2020 06:51:29 +0000 (08:51 +0200)]

drbd: remove dead code in device_to_statistics

Ever since the switch to blk-mq, a lower device not used for VM
writeback will not be marked congested, so the check will never
trigger.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Thu, 24 Sep 2020 06:51:28 +0000 (08:51 +0200)]

fs: remove the unused SB_I_MULTIROOT flag

The last user of SB_I_MULTIROOT is disappeared with commit f2aedb713c28
("NFS: Add fs_context support.")

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Mon, 21 Sep 2020 07:19:58 +0000 (09:19 +0200)]

block: mark blkdev_get static

There are no users outside the core block code left now.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Mon, 21 Sep 2020 07:19:57 +0000 (09:19 +0200)]

PM: mm: cleanup swsusp_swap_check

Use blkdev_get_by_dev instead of bdget + blkdev_get.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Mon, 21 Sep 2020 07:19:56 +0000 (09:19 +0200)]

mm: split swap_type_of

swap_type_of is used for two entirely different purposes:

(1) check what swap type a given device/offset corresponds to
(2) find the first available swap device that can be written to

Mixing both in a single function creates an unreadable mess. Create two
separate functions instead, and switch both to pass a dev_t instead of
a struct block_device to further simplify the code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Mon, 21 Sep 2020 07:19:55 +0000 (09:19 +0200)]

PM: rewrite is_hibernate_resume_dev to not require an inode

Just check the dev_t to help simplifying the code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Mon, 21 Sep 2020 07:19:54 +0000 (09:19 +0200)]

mm: cleanup claim_swapfile

Use blkdev_get_by_dev instead of bdgrab + blkdev_get.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Mon, 21 Sep 2020 07:19:53 +0000 (09:19 +0200)]

ocfs2: cleanup o2hb_region_dev_store

Use blkdev_get_by_dev instead of igrab (aka open coded bdgrab) +
blkdev_get.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Mon, 21 Sep 2020 07:19:52 +0000 (09:19 +0200)]

dasd: cleanup dasd_scan_partitions

Use blkdev_get_by_dev instead of bdget_disk + blkdev_get.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Stefan Haberland <sth@linux.ibm.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Mon, 21 Sep 2020 07:19:51 +0000 (09:19 +0200)]

raw: don't keep unopened block device around

Turn binding into a normal dev_t as the struct block device doesn't
buy us anything and use blkdev_open_by_dev to actually open it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Mon, 21 Sep 2020 07:19:50 +0000 (09:19 +0200)]

zram: cleanup backing_dev_store

Use blkdev_get_by_dev instead of bdgrab + blkdev_get.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Mon, 21 Sep 2020 07:19:49 +0000 (09:19 +0200)]

pktcdvd: use blkdev_get_by_dev instead of open coding it

Replace bdget + blkdev_get by blkdev_get_by_dev.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Mon, 21 Sep 2020 07:19:48 +0000 (09:19 +0200)]

pktcdvd: remove the if 0'ed pkt_start_recovery function

Remove code which has been dead since the initial commit.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Mon, 21 Sep 2020 07:19:47 +0000 (09:19 +0200)]

block: cleanup blkdev_bszset

Use blkdev_get_by_dev instead of bdgrab + blkdev_get.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Mon, 21 Sep 2020 07:19:46 +0000 (09:19 +0200)]

block: cleanup partition scanning in register_disk

Use blkdev_get_by_dev instead of open coding it using bdget_disk +
blkdev_get, and split the code to read the partition table into a
separate helper to make it a little more obvious.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Mon, 21 Sep 2020 07:19:45 +0000 (09:19 +0200)]

block: move the NEED_PART_SCAN flag to struct gendisk

We can only scan for partitions on the whole disk, so move the flag
from struct block_device to struct gendisk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Mike Snitzer [Tue, 22 Sep 2020 02:32:49 +0000 (22:32 -0400)]

block: allow 'chunk_sectors' to be non-power-of-2

It is possible, albeit more unlikely, for a block device to have a non
power-of-2 for chunk_sectors (e.g. 10+2 RAID6 with 128K chunk_sectors,
which results in a full-stripe size of 1280K. This causes the RAID6's
io_opt to be advertised as 1280K, and a stacked device _could_ then be
made to use a blocksize, aka chunk_sectors, that matches non power-of-2
io_opt of underlying RAID6 -- resulting in stacked device's
chunk_sectors being a non power-of-2).

Update blk_queue_chunk_sectors() and blk_max_size_offset() to
accommodate drivers that need a non power-of-2 chunk_sectors.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Mike Snitzer [Tue, 22 Sep 2020 02:32:48 +0000 (22:32 -0400)]

block: use lcm_not_zero() when stacking chunk_sectors

Like 'io_opt', blk_stack_limits() should stack 'chunk_sectors' using
lcm_not_zero() rather than min_not_zero() -- otherwise the final
'chunk_sectors' could result in sub-optimal alignment of IO to
component devices in the IO stack.

Also, if 'chunk_sectors' isn't a multiple of 'physical_block_size'
then it is a bug in the driver and the device should be flagged as
'misaligned'.

Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Wed, 23 Sep 2020 15:07:13 +0000 (17:07 +0200)]

block: fix bmd->is_null_mapped initialization

bmd is allocated using kmalloc in bio_alloc_map_data, so make sure
is_null_mapped is properly initialized to false for the !null_mapped
case.

Fixes: f3256075ba49 ("block: remove the BIO_NULL_MAPPED flag")
Reported-by: Marc Hartmayer <mhartmay@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Julia Lawall [Sun, 20 Sep 2020 11:26:18 +0000 (13:26 +0200)]

block: drop double zeroing

sg_init_table zeroes its first argument, so the allocation of that argument
doesn't have to.

the semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression x;
@@

x =
- kzalloc
+ kmalloc
(...)
...
sg_init_table(x,...)
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Baolin Wang [Mon, 7 Sep 2020 08:10:17 +0000 (16:10 +0800)]

blk-throttle: Avoid checking bps/iops limitation if bps or iops is unlimited

Do not need check the bps or iops limitation if bps or iops is unlimited.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Baolin Wang [Mon, 7 Sep 2020 08:10:16 +0000 (16:10 +0800)]

blk-throttle: Avoid calculating bps/iops limitation repeatedly

The tg_may_dispatch() will call tg_with_in_bps_limit() and
tg_with_in_iops_limit() to check if we can dispatch a bio or
not, which will calculate bps/iops limitation multiple times.
But tg_may_dispatch() is always called under queue lock, which
means the bps/iops limitation will not change in tg_may_dispatch().

So we can calculate the bps/iops limitation only once, and pass
them to tg_with_in_bps_limit() and tg_with_in_iops_limit() to
avoid calculating bps/iops limitation repeatedly.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Baolin Wang [Mon, 7 Sep 2020 08:10:15 +0000 (16:10 +0800)]

blk-throttle: Define readable macros instead of static variables

The 'throtl_grp_quantum' and 'throtl_quantum' are both read-only
variables, thus better to use readable macros instead of static
variables, which can also save some spaces for .bss area.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Baolin Wang [Mon, 7 Sep 2020 08:10:14 +0000 (16:10 +0800)]

blk-throttle: Use readable READ/WRITE macros

Use readable READ/WRITE macros instead of magic numbers.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Baolin Wang [Mon, 7 Sep 2020 08:10:13 +0000 (16:10 +0800)]

blk-throttle: Fix some comments' typos

Fix some comments' typos.

Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Tejun Heo [Mon, 14 Sep 2020 15:05:13 +0000 (11:05 -0400)]

iocost: fix infinite loop bug in adjust_inuse_and_calc_cost()

adjust_inuse_and_calc_cost() is responsible for reducing the amount of
donated weights dynamically in period as the budget runs low. Because we
don't want to do full donation calculation in period, we keep latching up
inuse by INUSE_ADJ_STEP_PCT of the active weight of the cgroup until the
resulting hweight_inuse is satisfactory.

Unfortunately, the adj_step calculation was reading the active weight before
acquiring ioc->lock. Because the current thread could have lost race to
activate the iocg to another thread before entering this function, it may
read the active weight as zero before acquiring ioc->lock. When this
happens, the adj_step is calculated as zero and the incremental adjustment
loop becomes an infinite one.

Fix it by fetching the active weight after acquiring ioc->lock.

Fixes: b0853ab4a238 ("blk-iocost: revamp in-period donation snapbacks")
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Tejun Heo [Fri, 11 Sep 2020 22:40:49 +0000 (18:40 -0400)]

blk-iocost: fix divide-by-zero in transfer_surpluses()

Conceptually, root_iocg->hweight_donating must be less than WEIGHT_ONE but
all hweight calculations round up and thus it may end up >= WEIGHT_ONE
triggering divide-by-zero and other issues. Bound the value to avoid
surprises.

Fixes: e08d02aa5fc9 ("blk-iocost: implement Andy's method for donation weight updates")
Signed-off-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Song Liu [Mon, 31 Aug 2020 22:27:25 +0000 (15:27 -0700)]

bcache: use part_[begin|end]_io_acct instead of disk_[begin|end]_io_acct

This enables proper statistics in /proc/diskstats for bcache partitions.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Coly Li <colyli@suse.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Song Liu [Mon, 31 Aug 2020 22:27:24 +0000 (15:27 -0700)]

md: use part_[begin|end]_io_acct instead of disk_[begin|end]_io_acct

This enables proper statistics in /proc/diskstats for md partitions.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Song Liu [Mon, 31 Aug 2020 22:27:23 +0000 (15:27 -0700)]

block: introduce part_[begin|end]_io_acct

These functions can be used to enable iostat for partitions on devices
like md, bcache.

Signed-off-by: Song Liu <songliubraving@fb.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Ming Lei [Fri, 11 Sep 2020 10:41:14 +0000 (18:41 +0800)]

blk-mq: always allow reserved allocation in hctx_may_queue

NVMe shares tagset between fabric queue and admin queue or between
connect_q and NS queue, so hctx_may_queue() can be called to allocate
request for these queues.

Tags can be reserved in these tagset. Before error recovery, there is
often lots of in-flight requests which can't be completed, and new
reserved request may be needed in error recovery path. However,
hctx_may_queue() can always return false because there is too many
in-flight requests which can't be completed during error handling.
Finally, nothing can proceed.

Fix this issue by always allowing reserved tag allocation in
hctx_may_queue(). This is reasonable because reserved tags are supposed
to always be available.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Cc: David Milburn <dmilburn@redhat.com>
Cc: Ewan D. Milne <emilne@redhat.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Tian Tao [Fri, 11 Sep 2020 10:58:52 +0000 (18:58 +0800)]

block: remove duplicate include statement in scsi_ioctl.c

scsi/sg.h is included more than once, Remove the one that isn't
necessary.

Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Xianting Tian [Thu, 10 Sep 2020 06:15:06 +0000 (14:15 +0800)]

blkcg: add plugging support for punt bio

The test and the explaination of the patch as bellow.

Before test we added more debug code in blkg_async_bio_workfn():
int count = 0
if (bios.head && bios.head->bi_next) {
need_plug = true;
blk_start_plug(&plug);
}
while ((bio = bio_list_pop(&bios))) {
/*io_punt is a sysctl user interface to control the print*/
if(io_punt) {
printk("[%s:%d] bio start,size:%llu,%d count=%d plug?%d\n",
current->comm, current->pid, bio->bi_iter.bi_sector,
(bio->bi_iter.bi_size)>>9, count++, need_plug);
}
submit_bio(bio);
}
if (need_plug)
blk_finish_plug(&plug);

Steps that need to be set to trigger *PUNT* io before testing:
mount -t btrfs -o compress=lzo /dev/sda6 /btrfs
mount -t cgroup2 nodev /cgroup2
mkdir /cgroup2/cg3
echo "+io" > /cgroup2/cgroup.subtree_control
echo "8:0 wbps=1048576000" > /cgroup2/cg3/io.max #1000M/s
echo $$ > /cgroup2/cg3/cgroup.procs

Then use dd command to test btrfs PUNT io in current shell:
dd if=/dev/zero of=/btrfs/file bs=64K count=100000

Test hardware environment as below:
[root@localhost btrfs]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                32
On-line CPU(s) list:   0-31
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel

With above debug code, test command and test environment, I did the
tests under 3 different system loads, which are triggered by stress:
1, Run 64 threads by command "stress -c 64 &"
[53615.975974] [kworker/u66:18:1490] bio start,size:45583056,8 count=0 plug?1
[53615.975980] [kworker/u66:18:1490] bio start,size:45583064,8 count=1 plug?1
[53615.975984] [kworker/u66:18:1490] bio start,size:45583072,8 count=2 plug?1
[53615.975987] [kworker/u66:18:1490] bio start,size:45583080,8 count=3 plug?1
[53615.975990] [kworker/u66:18:1490] bio start,size:45583088,8 count=4 plug?1
[53615.975993] [kworker/u66:18:1490] bio start,size:45583096,8 count=5 plug?1
... ...
[53615.977041] [kworker/u66:18:1490] bio start,size:45585480,8 count=303 plug?1
[53615.977044] [kworker/u66:18:1490] bio start,size:45585488,8 count=304 plug?1
[53615.977047] [kworker/u66:18:1490] bio start,size:45585496,8 count=305 plug?1
[53615.977050] [kworker/u66:18:1490] bio start,size:45585504,8 count=306 plug?1
[53615.977053] [kworker/u66:18:1490] bio start,size:45585512,8 count=307 plug?1
[53615.977056] [kworker/u66:18:1490] bio start,size:45585520,8 count=308 plug?1
[53615.977058] [kworker/u66:18:1490] bio start,size:45585528,8 count=309 plug?1

2, Run 32 threads by command "stress -c 32 &"
[50586.290521] [kworker/u66:6:32351] bio start,size:45806496,8 count=0 plug?1
[50586.290526] [kworker/u66:6:32351] bio start,size:45806504,8 count=1 plug?1
[50586.290529] [kworker/u66:6:32351] bio start,size:45806512,8 count=2 plug?1
[50586.290531] [kworker/u66:6:32351] bio start,size:45806520,8 count=3 plug?1
[50586.290533] [kworker/u66:6:32351] bio start,size:45806528,8 count=4 plug?1
[50586.290535] [kworker/u66:6:32351] bio start,size:45806536,8 count=5 plug?1
... ...
[50586.299640] [kworker/u66:5:32350] bio start,size:45808576,8 count=252 plug?1
[50586.299643] [kworker/u66:5:32350] bio start,size:45808584,8 count=253 plug?1
[50586.299646] [kworker/u66:5:32350] bio start,size:45808592,8 count=254 plug?1
[50586.299649] [kworker/u66:5:32350] bio start,size:45808600,8 count=255 plug?1
[50586.299652] [kworker/u66:5:32350] bio start,size:45808608,8 count=256 plug?1
[50586.299663] [kworker/u66:5:32350] bio start,size:45808616,8 count=257 plug?1
[50586.299665] [kworker/u66:5:32350] bio start,size:45808624,8 count=258 plug?1
[50586.299668] [kworker/u66:5:32350] bio start,size:45808632,8 count=259 plug?1

3, Don't run thread by stress
[50861.355246] [kworker/u66:19:32376] bio start,size:13544504,8 count=0 plug?0
[50861.355288] [kworker/u66:19:32376] bio start,size:13544512,8 count=0 plug?0
[50861.355322] [kworker/u66:19:32376] bio start,size:13544520,8 count=0 plug?0
[50861.355353] [kworker/u66:19:32376] bio start,size:13544528,8 count=0 plug?0
[50861.355392] [kworker/u66:19:32376] bio start,size:13544536,8 count=0 plug?0
[50861.355431] [kworker/u66:19:32376] bio start,size:13544544,8 count=0 plug?0
[50861.355468] [kworker/u66:19:32376] bio start,size:13544552,8 count=0 plug?0
[50861.355499] [kworker/u66:19:32376] bio start,size:13544560,8 count=0 plug?0
[50861.355532] [kworker/u66:19:32376] bio start,size:13544568,8 count=0 plug?0
[50861.355575] [kworker/u66:19:32376] bio start,size:13544576,8 count=0 plug?0
[50861.355618] [kworker/u66:19:32376] bio start,size:13544584,8 count=0 plug?0
[50861.355659] [kworker/u66:19:32376] bio start,size:13544592,8 count=0 plug?0
[50861.355740] [kworker/u66:0:32346] bio start,size:13544600,8 count=0 plug?1
[50861.355748] [kworker/u66:0:32346] bio start,size:13544608,8 count=1 plug?1
[50861.355962] [kworker/u66:2:32347] bio start,size:13544616,8 count=0 plug?0
[50861.356272] [kworker/u66:7:31962] bio start,size:13544624,8 count=0 plug?0
[50861.356446] [kworker/u66:7:31962] bio start,size:13544632,8 count=0 plug?0
[50861.356567] [kworker/u66:7:31962] bio start,size:13544640,8 count=0 plug?0
[50861.356707] [kworker/u66:19:32376] bio start,size:13544648,8 count=0 plug?0
[50861.356748] [kworker/u66:15:32355] bio start,size:13544656,8 count=0 plug?0
[50861.356825] [kworker/u66:17:31970] bio start,size:13544664,8 count=0 plug?0

Analysis of above 3 test results with different system load:
>From above test, we can see more and more continuous bios can be plugged
with system load increasing. When run "stress -c 64 &", 310 continuous
bios are plugged; When run "stress -c 32 &", 260 continuous bios are
plugged; When don't run stress, at most only 2 continuous bios are
plugged, in most cases, bio_list only contains one single bio.

How to explain above phenomenon:
We know, in submit_bio(), if the bio is a REQ_CGROUP_PUNT io, it will
queue a work to workqueue blkcg_punt_bio_wq. But when the workqueue is
scheduled, it depends on the system load.  When system load is low, the
workqueue will be quickly scheduled, and the bio in bio_list will be
quickly processed in blkg_async_bio_workfn(), so there is less chance
that the same io submit thread can add multiple continuous bios to
bio_list before workqueue is scheduled to run. The analysis aligned with
above test "3".
When system load is high, there is some delay before the workqueue can
be scheduled to run, the higher the system load the greater the delay.
So there is more chance that the same io submit thread can add multiple
continuous bios to bio_list. Then when the workqueue is scheduled to run,
there are more continuous bios in bio_list, which will be processed in
blkg_async_bio_workfn(). The analysis aligned with above test "1" and "2".

According to test, we can get io performance improved with the patch,
especially when system load is higher. Another optimazition is to use
the plug only when bio_list contains at least 2 bios.

Signed-off-by: Xianting Tian <tian.xianting@h3c.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Tue, 8 Sep 2020 14:53:47 +0000 (16:53 +0200)]

block: remove check_disk_change

Remove the now unused check_disk_change helper.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Tue, 8 Sep 2020 14:53:46 +0000 (16:53 +0200)]

sr: simplify sr_block_revalidate_disk

Both callers have a valid CD struture available, so rely on that instead
of getting another reference. Also move the function to avoid a forward
declaration.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Tue, 8 Sep 2020 14:53:45 +0000 (16:53 +0200)]

sr: use bdev_check_media_change

Switch to use bdev_check_media_change instead of check_disk_change and
call sr_block_revalidate_disk manually. Also add an explicit call to
sr_block_revalidate_disk just before disk_add() to ensure we always
read check for a ready unit and read the TOC and then stop wiring up
->revalidate_disk.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Tue, 8 Sep 2020 14:53:44 +0000 (16:53 +0200)]

sd: use bdev_check_media_change

Switch to use bdev_check_media_change instead of check_disk_change and
call sd_revalidate_disk manually. As sd also calls sd_revalidate_disk
manually during probe and open, the extra call into ->revalidate_disk
from bdev_disk_changed is not required either, so stop wiring up the
method.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Tue, 8 Sep 2020 14:53:43 +0000 (16:53 +0200)]

md: use bdev_check_media_change

The md driver does not have a ->revalidate_disk method, so it can just
use bdev_check_media_change without any additional changes.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Acked-by: Song Liu <song@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Tue, 8 Sep 2020 14:53:42 +0000 (16:53 +0200)]

ide-gd: stop using the disk events mechanism

ide-gd is only using the disk events mechanism to be able to force an
invalidation and partition scan on opening removable media. Just open
code the logic without invoving the block layer.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Tue, 8 Sep 2020 14:53:41 +0000 (16:53 +0200)]

ide-cd: remove idecd_revalidate_disk

Just merge the trivial function into its only caller.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

commit | commitdiff | tree

Christoph Hellwig [Tue, 8 Sep 2020 14:53:40 +0000 (16:53 +0200)]

ide-cd: use bdev_check_media_changed

Switch to use bdev_check_media_changed instead of check_disk_change and
call idecd_revalidate_disk manually. Given that idecd_revalidate_disk
only re-reads the TOC, and we already do the same at probe time, the
extra call into ->revalidate_disk from bdev_disk_changed is not required
either, so stop wiring up the method.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Linux 6.x block layer and io_uring tree(s)

RSS Atom