summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2021-06-29block: fix discard request mergefor-5.14/block-2021-06-29for-5.14/blockMing Lei
ll_new_hw_segment() is reached only in case of single range discard merge, and we don't have max discard segment size limit actually, so it is wrong to run the following check: if (req->nr_phys_segments + nr_phys_segs > blk_rq_get_max_segments(req)) it may be always false since req->nr_phys_segments is initialized as one, and bio's segment count is still 1, blk_rq_get_max_segments(reg) is 1 too. Fix the issue by not doing the check and bypassing the calculation of discard request's nr_phys_segments. Based on analysis from Wang Shanker. Cc: Christoph Hellwig <hch@lst.de> Reported-by: Wang Shanker <shankerwangmiao@gmail.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210628023312.1903255-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-27block/mq-deadline: Remove a WARN_ON_ONCE() callBart Van Assche
The purpose of the WARN_ON_ONCE() statement in dd_insert_request() is to verify that dd_prepare_request() cleared rq->elv.priv[0]. Since dd_prepare_request() is called during request initialization but not if a request is requeued, a warning is triggered if a request is requeued. Fix this by removing the WARN_ON_ONCE() statement. This patch suppresses the following kernel warning: WARNING: CPU: 28 PID: 432 at block/mq-deadline-main.c:740 dd_insert_request+0x4d4/0x5b0 Workqueue: kblockd blk_mq_requeue_work Call Trace: dd_insert_requests+0xfa/0x130 blk_mq_sched_insert_request+0x22c/0x240 blk_mq_requeue_work+0x21c/0x2d0 process_one_work+0x4c2/0xa70 worker_thread+0x2e5/0x6d0 kthread+0x21c/0x250 ret_from_fork+0x1f/0x30 Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Fixes: 08a9ad8bf607 ("block/mq-deadline: Add cgroup support") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210627211112.12720-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-25blk-mq: update hctx->dispatch_busy in case of real schedulerMing Lei
Commit 6e6fcbc27e77 ("blk-mq: support batching dispatch in case of io") starts to support io batching submission by using hctx->dispatch_busy. However, blk_mq_update_dispatch_busy() isn't changed to update hctx->dispatch_busy in that commit, so fix the issue by updating hctx->dispatch_busy in case of real scheduler. Reported-by: Jan Kara <jack@suse.cz> Reviewed-by: Jan Kara <jack@suse.cz> Fixes: 6e6fcbc27e77 ("blk-mq: support batching dispatch in case of io") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20210625020248.1630497-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-24blk: Fix lock inversion between ioc lock and bfqd lockJan Kara
Lockdep complains about lock inversion between ioc->lock and bfqd->lock: bfqd -> ioc: put_io_context+0x33/0x90 -> ioc->lock grabbed blk_mq_free_request+0x51/0x140 blk_put_request+0xe/0x10 blk_attempt_req_merge+0x1d/0x30 elv_attempt_insert_merge+0x56/0xa0 blk_mq_sched_try_insert_merge+0x4b/0x60 bfq_insert_requests+0x9e/0x18c0 -> bfqd->lock grabbed blk_mq_sched_insert_requests+0xd6/0x2b0 blk_mq_flush_plug_list+0x154/0x280 blk_finish_plug+0x40/0x60 ext4_writepages+0x696/0x1320 do_writepages+0x1c/0x80 __filemap_fdatawrite_range+0xd7/0x120 sync_file_range+0xac/0xf0 ioc->bfqd: bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed put_io_context_active+0x78/0xb0 -> ioc->lock grabbed exit_io_context+0x48/0x50 do_exit+0x7e9/0xdd0 do_group_exit+0x54/0xc0 To avoid this inversion we change blk_mq_sched_try_insert_merge() to not free the merged request but rather leave that upto the caller similarly to blk_mq_sched_try_merge(). And in bfq_insert_requests() we make sure to free all the merged requests after dropping bfqd->lock. Fixes: aee69d78dec0 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler") Reviewed-by: Ming Lei <ming.lei@redhat.com> Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210623093634.27879-3-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-24bfq: Remove merged request already in bfq_requests_merged()Jan Kara
Currently, bfq does very little in bfq_requests_merged() and handles all the request cleanup in bfq_finish_requeue_request() called from blk_mq_free_request(). That is currently safe only because blk_mq_free_request() is called shortly after bfq_requests_merged() while bfqd->lock is still held. However to fix a lock inversion between bfqd->lock and ioc->lock, we need to call blk_mq_free_request() after dropping bfqd->lock. That would mean that already merged request could be seen by other processes inside bfq queues and possibly dispatched to the device which is wrong. So move cleanup of the request from bfq_finish_requeue_request() to bfq_requests_merged(). Acked-by: Paolo Valente <paolo.valente@linaro.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20210623093634.27879-2-jack@suse.cz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-24block: pass a gendisk to bdev_disk_changedChristoph Hellwig
bdev_disk_changed can only operate on whole devices. Make that clear by passing a gendisk instead of the struct block_device. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210624123240.441814-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-24block: move bdev_disk_changedChristoph Hellwig
Move bdev_disk_changed to block/partitions/core.c, together with the rest of the partition scanning code. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210624123240.441814-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-24block: add the events* attributes to disk_attrsChristoph Hellwig
Add the events attributes to the disk_attrs array, which ensures they are added by the driver core when the device is created rather than adding them after the device has been added, which is racy versus uevents and requires more boilerplate code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210624073843.251178-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-24block: move the disk events code to a separate fileChristoph Hellwig
Move the code for handling disk events from genhd.c into a new file as it isn't very related to the rest of the file while at the same time requiring lots of forward declarations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20210624073843.251178-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-24block: fix trace completion for chained bioEdward Hsieh
For chained bio, trace_block_bio_complete in bio_endio is currently called only by the parent bio once upon all chained bio completed. However, the sector and size for the parent bio are modified in bio_split. Therefore, the size and sector of the complete events might not match the queue events in blktrace. The original fix of bio completion trace <fbbaf700e7b1> ("block: trace completion of all bios.") wants multiple complete events to correspond to one queue event but missed this. The issue can be reproduced by md/raid5 read with bio cross chunks. To fix, move trace completion into the loop for every chained bio to call. Fixes: fbbaf700e7b1 ("block: trace completion of all bios.") Reviewed-by: Wade Liang <wadel@synology.com> Reviewed-by: BingJing Chang <bingjingc@synology.com> Signed-off-by: Edward Hsieh <edwardh@synology.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210624123030.27014-1-edwardh@synology.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/partitions/msdos: Fix typo inidicator -> indicatorThomas Bracht Laumann Jespersen
Just a fix for a small typo in msdos_partition(). Signed-off-by: Thomas Bracht Laumann Jespersen <t@laumann.xyz> Link: https://lore.kernel.org/r/20210619195130.19348-1-t@laumann.xyz Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block, bfq: reset waker pointer with shared queuesPaolo Valente
Commit 85686d0dc194 ("block, bfq: keep shared queues out of the waker mechanism") leaves shared bfq_queues out of the waker-detection mechanism. It attains this goal by not updating the pointer last_completed_rq_bfqq, if the last request completed belongs to a shared bfq_queue (so that the pointer will not point to the shared bfq_queue). Yet this has a side effect: the pointer last_completed_rq_bfqq keeps pointing, deceptively, to a bfq_queue that actually is not the last one to have had a request completed. As a consequence, such a bfq_queue may deceptively be considered as a waker of some bfq_queue, even of some shared bfq_queue. To address this issue, reset last_completed_rq_bfqq if the last request completed belongs to a shared queue. Fixes: 85686d0dc194 ("block, bfq: keep shared queues out of the waker mechanism") Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20210619140948.98712-8-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block, bfq: check waker only for queues with no in-flight I/OPaolo Valente
Consider two bfq_queues, say Q1 and Q2, with Q2 empty. If a request of Q1 gets completed shortly before a new request arrives for Q2, then BFQ flags Q1 as a candidate waker for Q2. Yet, the arrival of this new request may have a different cause, in the following case. If also Q2 has requests in flight while waiting for the arrival of a new request, then the completion of its own requests may be the actual cause of the awakening of the process that sends I/O to Q2. So Q1 may be flagged wrongly as a candidate waker. This commit avoids this deceptive flagging, by disabling candidate-waker flagging for Q2, if Q2 has in-flight I/O. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20210619140948.98712-7-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block, bfq: avoid delayed merge of async queuesPaolo Valente
Since commit 430a67f9d616 ("block, bfq: merge bursts of newly-created queues"), BFQ may schedule a merge between a newly created sync bfq_queue, say Q2, and the last sync bfq_queue created, say Q1. To this goal, BFQ stores the address of Q1 in the field bic->stable_merge_bfqq of the bic associated with Q2. So, when the time for the possible merge arrives, BFQ knows which bfq_queue to merge Q2 with. In particular, BFQ checks for possible merges on request arrivals. Yet the same bic may also be associated with an async bfq_queue, say Q3. So, if a request for Q3 arrives, then the above check may happen to be executed while the bfq_queue at hand is Q3, instead of Q2. In this case, Q1 happens to be merged with an async bfq_queue. This is not only a conceptual mistake, because async queues are to be kept out of queue merging, but also a bug that leads to inconsistent states. This commits simply filters async queues out of delayed merges. Fixes: 430a67f9d616 ("block, bfq: merge bursts of newly-created queues") Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20210619140948.98712-6-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block, bfq: boost throughput by extending queue-merging timesPietro Pedroni
One of the methods with which bfq boosts throughput is by merging queues. One of the merging variants in bfq is the stable merge. This mechanism is activated between two queues only if they are created within a certain maximum time T1 from each other. Merging can happen soon or be delayed. In the second case, before merging, bfq needs to evaluate a throughput-boost parameter that indicates whether the queue generates a high throughput is served alone. Merging occurs when this throughput-boost is not high enough. In particular, this parameter is evaluated and late merging may occur only after at least a time T2 from the creation of the queue. Currently T1 and T2 are set to 180ms and 200ms, respectively. In this way the merging mechanism rarely occurs because time is not enough. This results in a noticeable lowering of the overall throughput with some workloads (see the example below). This commit introduces two constants bfq_activation_stable_merging and bfq_late_stable_merging in order to increase the duration of T1 and T2. Both the stable merging activation time and the late merging time are set to 600ms. This value has been experimentally evaluated using sqlite benchmark in the Phoronix Test Suite on a HDD. The duration of the benchmark before this fix was 111.02s, while now it has reached 97.02s, a better result than that of all the other schedulers. Signed-off-by: Pietro Pedroni <pedroni.pietro.96@gmail.com> Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20210619140948.98712-5-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block, bfq: consider also creation time in delayed stable mergePaolo Valente
Since commit 430a67f9d616 ("block, bfq: merge bursts of newly-created queues"), BFQ may schedule a merge between a newly created sync bfq_queue and the last sync bfq_queue created. Such a merging is not performed immediately, because BFQ needs first to find out whether the newly created queue actually reaches a higher throughput if not merged at all (and in that case BFQ will not perform any stable merging). To check that, a little time must be waited after the creation of the new queue, so that some I/O can flow in the queue, and statistics on such I/O can be computed. Yet, to evaluate the above waiting time, the last split time is considered as start time, instead of the creation time of the queue. This is a mistake, because considering the split time is correct only in the following scenario. The queue undergoes a non-stable merges on the arrival of its very first I/O request, due to close I/O with some other queue. While the queue is merged for close I/O, stable merging is not considered. Yet the queue may then happen to be split, if the close I/O finishes (or happens to be a false positive). From this time on, the queue can again be considered for stable merging. But, again, a little time must elapse, to let some new I/O flow in the queue and to get updated statistics. To wait for this time, the split time is to be taken into account. Yet, if the queue does not undergo a non-stable merge on the arrival of its very first request, then BFQ immediately checks whether the stable merge is to be performed. It happens because the split time for a queue is initialized to minus infinity when the queue is created. This commit fixes this mistake by adding the missing condition. Now the check for delayed stable-merge is performed after a little time is elapsed not only from the last queue split time, but also from the creation time of the queue. Fixes: 430a67f9d616 ("block, bfq: merge bursts of newly-created queues") Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20210619140948.98712-4-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block, bfq: fix delayed stable merge checkLuca Mariotti
When attempting to schedule a merge of a given bfq_queue with the currently in-service bfq_queue or with a cooperating bfq_queue among the scheduled bfq_queues, delayed stable merge is checked for rotational or non-queueing devs. For this stable merge to be performed, some conditions must be met. If the current bfq_queue underwent some split from some merged bfq_queue, one of these conditions is that two hundred milliseconds must elapse from split, otherwise this condition is always met. Unfortunately, by mistake, time_is_after_jiffies() was written instead of time_is_before_jiffies() for this check, verifying that less than two hundred milliseconds have elapsed instead of verifying that at least two hundred milliseconds have elapsed. Fix this issue by replacing time_is_after_jiffies() with time_is_before_jiffies(). Signed-off-by: Luca Mariotti <mariottiluca1@hotmail.it> Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: Pietro Pedroni <pedroni.pietro.96@gmail.com> Link: https://lore.kernel.org/r/20210619140948.98712-3-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block, bfq: let also stably merged queues enjoy weight raisingPaolo Valente
Merged bfq_queues are kept out of weight-raising (low-latency) mechanisms. The reason is that these queues are usually created for non-interactive and non-soft-real-time tasks. Yet this is not the case for stably-merged queues. These queues are merged just because they are created shortly after each other. So they may easily serve the I/O of an interactive or soft-real time application, if the application happens to spawn multiple processes. To address this issue, this commits lets also stably-merged queued enjoy weight raising. Signed-off-by: Paolo Valente <paolo.valente@linaro.org> Link: https://lore.kernel.org/r/20210619140948.98712-2-paolo.valente@linaro.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21blk-wbt: make sure throttle is enabled properlyZhang Yi
After commit a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt"), if throttle was disabled by wbt_disable_default(), we could not enable again, fix this by set enable_state back to WBT_STATE_ON_DEFAULT. Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20210619093700.920393-3-yi.zhang@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21blk-wbt: introduce a new disable state to prevent false positive by ↵Zhang Yi
rwb_enabled() Now that we disable wbt by simply zero out rwb->wb_normal in wbt_disable_default() when switch elevator to bfq, but it's not safe because it will become false positive if we change queue depth. If it become false positive between wbt_wait() and wbt_track() when submit write request, it will lead to drop rqw->inflight to -1 in wbt_done(), which will end up trigger IO hung. Fix this issue by introduce a new state which mean the wbt was disabled. Fixes: a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt") Signed-off-by: Zhang Yi <yi.zhang@huawei.com> Link: https://lore.kernel.org/r/20210619093700.920393-2-yi.zhang@huawei.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/mq-deadline: Prioritize high-priority requestsBart Van Assche
While one or more requests with a certain I/O priority are pending, do not dispatch lower priority requests. Dispatch lower priority requests anyway after the "aging" time has expired. This patch has been tested as follows: modprobe scsi_debug ndelay=1000000 max_queue=16 && sd='' && while [ -z "$sd" ]; do sd=/dev/$(basename /sys/bus/pseudo/drivers/scsi_debug/adapter*/host*/target*/*/block/*) done && echo $((100*1000)) > /sys/block/$sd/queue/iosched/aging_expire && cd /sys/fs/cgroup/blkio/ && echo $$ >cgroup.procs && echo restrict-to-be >blkio.prio.class && mkdir -p hipri && cd hipri && echo none-to-rt >blkio.prio.class && { max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/low-pri.txt & } && echo $$ >cgroup.procs && max-iops -a1 -d32 -j1 -e mq-deadline $sd >& ~/hi-pri.txt Result: * 11000 IOPS for the high-priority job * 40 IOPS for the low-priority job If the aging expiry time is changed from 100s into 0, the IOPS results change into 6712 and 6796 IOPS. The max-iops script is a script that runs fio with the following arguments: --bs=4K --gtod_reduce=1 --ioengine=libaio --ioscheduler=${arg_e} --runtime=60 --norandommap --rw=read --thread --buffered=0 --numjobs=${arg_j} --iodepth=${arg_d} --iodepth_batch_submit=${arg_a} --iodepth_batch_complete=$((arg_d / 2)) --name=${positional_argument_1} --filename=${positional_argument_1} Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-17-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/mq-deadline: Add cgroup supportBart Van Assche
Maintain statistics per cgroup and export these to user space. These statistics are essential for verifying whether the proper I/O priorities have been assigned to requests. An example of the statistics data with this patch applied: $ cat /sys/fs/cgroup/io.stat 11:2 rbytes=0 wbytes=0 rios=3 wios=0 dbytes=0 dios=0 [NONE] dispatched=0 inserted=0 merged=171 [RT] dispatched=0 inserted=0 merged=0 [BE] dispatched=0 inserted=0 merged=0 [IDLE] dispatched=0 inserted=0 merged=0 8:32 rbytes=2142720 wbytes=0 rios=105 wios=0 dbytes=0 dios=0 [NONE] dispatched=0 inserted=0 merged=171 [RT] dispatched=0 inserted=0 merged=0 [BE] dispatched=0 inserted=0 merged=0 [IDLE] dispatched=0 inserted=0 merged=0 Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-16-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/mq-deadline: Track I/O statisticsBart Van Assche
Track I/O statistics per I/O priority and export these statistics to debugfs. These statistics help developers of the deadline scheduler. Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-15-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/mq-deadline: Add I/O priority supportBart Van Assche
Maintain one dispatch list and one FIFO list per I/O priority class: RT, BE and IDLE. Maintain statistics for each priority level. Split the debugfs attributes per priority level as follows: $ ls /sys/kernel/debug/block/.../sched/ async_depth dispatch2 read_next_rq write2_fifo_list batching read0_fifo_list starved write_next_rq dispatch0 read1_fifo_list write0_fifo_list dispatch1 read2_fifo_list write1_fifo_list Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-14-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/mq-deadline: Micro-optimize the batching algorithmBart Van Assche
When dispatching the first request of a batch, the deadline_move_request() call clears .next_rq[] for the opposite data direction. .next_rq[] is not restored when changing data direction. Fix this by not clearing .next_rq[] and by keeping track of the data direction of a batch in a variable instead. This patch is a micro-optimization because: - The number of deadline_next_request() calls for the read direction is halved. - The number of times that deadline_next_request() returns NULL is reduced. Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-13-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/mq-deadline: Reserve 25% of scheduler tags for synchronous requestsBart Van Assche
For interactive workloads it is important that synchronous requests are not delayed. Hence reserve 25% of scheduler tags for synchronous requests. This patch still allows asynchronous requests to fill the hardware queues since blk_mq_init_sched() makes sure that the number of scheduler requests is the double of the hardware queue depth. From blk_mq_init_sched(): q->nr_requests = 2 * min_t(unsigned int, q->tag_set->queue_depth, BLKDEV_MAX_RQ); Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-12-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/mq-deadline: Improve the sysfs show and store macrosBart Van Assche
Define separate macros for integers and jiffies to improve readability. Use sysfs_emit() and kstrtoint() instead of sprintf() and simple_strtol(). The former functions are the recommended functions. Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-11-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/mq-deadline: Improve compile-time argument checkingBart Van Assche
Modern compilers complain if an out-of-range value is passed to a function argument that has an enumeration type. Let the compiler detect out-of-range data direction arguments instead of verifying the data_dir argument at runtime. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-10-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/mq-deadline: Rename dd_init_queue() and dd_exit_queue()Bart Van Assche
Change "queue" into "sched" to make the function names reflect better the purpose of these functions. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-9-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/mq-deadline: Remove two local variablesBart Van Assche
Make __dd_dispatch_request() easier to read by removing two local variables. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-8-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/mq-deadline: Add two lockdep_assert_held() statementsBart Van Assche
Document the locking strategy by adding two lockdep_assert_held() statements. Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-7-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/mq-deadline: Add several commentsBart Van Assche
Make the code easier to read by adding more comments. Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-6-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block: Introduce the ioprio rq-qos policyBart Van Assche
Introduce an rq-qos policy that assigns an I/O priority to requests based on blk-cgroup configuration settings. This policy has the following advantages over the ioprio_set() system call: - This policy is cgroup based so it has all the advantages of cgroups. - While ioprio_set() does not affect page cache writeback I/O, this rq-qos controller affects page cache writeback I/O for filesystems that support assiociating a cgroup with writeback I/O. See also Documentation/admin-guide/cgroup-v2.rst. Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20210618004456.7280-5-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/blk-rq-qos: Move a function from a header file into a C fileBart Van Assche
rq_qos_id_to_name() is only used in blk-mq-debugfs.c so move that function into in blk-mq-debugfs.c. Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20210618004456.7280-4-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/blk-cgroup: Swap the blk_throtl_init() and blk_iolatency_init() callsBart Van Assche
Before adding more calls in this function, simplify the error path. Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20210618004456.7280-3-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-21block/Kconfig: Make the BLK_WBT and BLK_WBT_MQ entries consecutiveBart Van Assche
These entries were consecutive at the time of their introduction but are no longer consecutive. Make these again consecutive. Additionally, modify the help text since it refers to blk-mq and since the legacy block layer has been removed. Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20210618004456.7280-2-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-18blk-wbt: remove outdated commentlijiazi
Now wbt_wait() returns void, so remove now outdated comment. Signed-off-by: lijiazi <lijiazi@xiaomi.com> Link: https://lore.kernel.org/r/1623986240-13878-1-git-send-email-lijiazi@xiaomi.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-18blk-mq: fix an IS_ERR() vs NULL bugDan Carpenter
The __blk_mq_alloc_disk() function doesn't return NULLs it returns error pointers. Fixes: b461dfc49eb6 ("blk-mq: add the blk_mq_alloc_disk APIs") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/YMyjci35WBqrtqG+@mwanda Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-18block: Remove unnecessary elevator operation checksDamien Le Moal
The insert_requests and dispatch_request elevator operations are mandatory for the correct execution of an elevator, and all implemented elevators (bfq, kyber and mq-deadline) implement them. As a result, there is no need to check for these operations before calling them when a queue has an elevator set. This simplifies the code in __blk_mq_sched_dispatch_requests() and blk_mq_sched_insert_request(). To avoid out-of-tree elevators to crash the kernel in case of bad implementation, add a check in elv_register() to verify that these operations are implemented. A small, probably not significant, IOPS improvement of 0.1% is observed with this patch applied (4.117 MIOPS to 4.123 MIOPS, average of 20 fio runs doing 4K random direct reads with psync and 32 jobs). Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20210618015922.713999-1-damien.lemoal@wdc.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-18blk-mq: fix use-after-free in blk_mq_exit_schedMing Lei
tagset can't be used after blk_cleanup_queue() is returned because freeing tagset usually follows blk_clenup_queue(). Commit d97e594c5166 ("blk-mq: Use request queue-wide tags for tagset-wide sbitmap") adds check on q->tag_set->flags in blk_mq_exit_sched(), and causes use-after-free. Fixes it by using hctx->flags. Reported-by: syzbot+77ba3d171a25c56756ea@syzkaller.appspotmail.com Fixes: d97e594c5166 ("blk-mq: Use request queue-wide tags for tagset-wide sbitmap") Cc: John Garry <john.garry@huawei.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: John Garry <john.garry@huawei.com> Reviewed-by: John Garry <john.garry@huawei.com> Link: https://lore.kernel.org/r/20210609063046.122843-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-16docs/cgroup-v1/blkio: update for 5.x kernelsKir Kolyshkin
Commit bf382fb0bcef4 ("block: remove legacy IO schedulers", Oct 12 2018) removes the CFQ scheduler, together with blkio.weight and blkio.weight_device described in cgroup v1 documentation. Users are supposed to use the BFQ scheduler, which cgroup file for setting weight is blkio.bfq.weight, but there is no way to set per-device weight. Later, commit 795fe54c2a8 per-device weights for BFQ, meaning that blkio.bfq.weight and blkio.bfq.weight_device can be used in a way similar to the old CFQ cgroup interface. Yet, the cgroup v1 docs were never updated. Fix this: - use the new file names; - fix the range for weight (used to be 10..1000, now 1..1000); - link to BFQ scheduler docs. Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-16docs/cgroup-v1/blkio: stop abusing itemized listKir Kolyshkin
Fix many formatting issues by stop (ab)using itemized lists for everything (mostly replaced by definition lists). Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-16docs: block/bfq: describe per-device weightKir Kolyshkin
The functionality of setting per-device weight for BFQ was added in v5.4 (commit 795fe54c2a828099), but the documentation was never updated. While at it, improve formatting a bit. Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com> Link: https://lore.kernel.org/r/20210614214109.207430-1-kolyshkin@gmail.com Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-16block: mark queue init done at the end of blk_register_queueMing Lei
Mark queue init done when everything is done well in blk_register_queue(), so that wbt_enable_default() can be run quickly without any RCU period involved since adding rq qos requires to freeze queue. Also no any side effect by delaying to mark queue init done. Reported-by: Yi Zhang <yi.zhang@redhat.com> Cc: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Yi Zhang <yi.zhang@redhat.com> Link: https://lore.kernel.org/r/20210609015822.103433-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-16block: fix race between adding/removing rq qos and normal IOMing Lei
Yi reported several kernel panics on: [16687.001777] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 ... [16687.163549] pc : __rq_qos_track+0x38/0x60 or [ 997.690455] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000020 ... [ 997.850347] pc : __rq_qos_done+0x2c/0x50 Turns out it is caused by race between adding rq qos(wbt) and normal IO because rq_qos_add can be run when IO is being submitted, fix this issue by freezing queue before adding/deleting rq qos to queue. rq_qos_exit() needn't to freeze queue because it is called after queue has been frozen. iolatency calls rq_qos_add() during allocating queue, so freezing won't add delay because queue usage refcount works at atomic mode at that time. iocost calls rq_qos_add() when writing cgroup attribute file, that is fine to freeze queue at that time since we usually freeze queue when storing to queue sysfs attribute, meantime iocost only exists on the root cgroup. wbt_init calls it in blk_register_queue() and queue sysfs attribute store(queue_wb_lat_store() when write it 1st time in case of !BLK_WBT_MQ), the following patch will speedup the queue freezing in wbt_init. Reported-by: Yi Zhang <yi.zhang@redhat.com> Cc: Bart Van Assche <bvanassche@acm.org> Signed-off-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Tested-by: Yi Zhang <yi.zhang@redhat.com> Link: https://lore.kernel.org/r/20210609015822.103433-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-16loop: fix order of cleaning up the queue and freeing the tagsetChristoph Hellwig
We must release the queue before freeing the tagset. Fixes: 1c99502fae35 ("loop: use blk_mq_alloc_disk and blk_cleanup_disk") Reported-by: Bruno Goncalves <bgoncalv@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-16mtd_blkdevs: initialze new->rq in add_mtd_blktrans_devChristoph Hellwig
Various places expect the request_queue in ->rq. Initialize it to avoid NULL pointer derefences. Fixes: 6966bb921def ("mtd_blkdevs: use blk_mq_alloc_disk") Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Marek Szyprowski <m.szyprowski@samsung.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-11z2ram: use blk_mq_alloc_disk and blk_cleanup_diskChristoph Hellwig
Use blk_mq_alloc_disk and blk_cleanup_disk to simplify the gendisk and request_queue allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Link: https://lore.kernel.org/r/20210602065345.355274-31-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-11ataflop: use blk_mq_alloc_disk and blk_cleanup_diskChristoph Hellwig
Use blk_mq_alloc_disk and blk_cleanup_disk to simplify the gendisk and request_queue allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Link: https://lore.kernel.org/r/20210602065345.355274-30-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2021-06-11amiflop: use blk_mq_alloc_disk and blk_cleanup_diskChristoph Hellwig
Use blk_mq_alloc_disk and blk_cleanup_disk to simplify the gendisk and request_queue allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Link: https://lore.kernel.org/r/20210602065345.355274-29-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>