Coly Li [Tue, 19 Jul 2022 04:27:24 +0000 (12:27 +0800)]
bcache: remove EXPERIMENTAL for Kconfig option 'Asynchronous device registration'
The "Asynchronous device registration (EXPERIMENTAL)" Kconfig option is
for 2+ years, it is used when registration takes too much time for
massive amount of cached data, to avoid udev task timeout during boot
time.
Many users and products enable this Kconfig option for quite long time
(e.g. SUSE Linux) and it works as expected and no issue reported.
It is time to remove the "EXPERIMENTAL" tag from this Kconfig item.
Signed-off-by: Coly Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20220719042724.8498-2-colyli@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yu Kuai [Sat, 23 Jul 2022 08:24:27 +0000 (16:24 +0800)]
nbd: add missing definition of pr_fmt
commit
1243172d5894 ("nbd: use pr_err to output error message") tries
to define pr_fmt and use short pr_err() to output error message,
however, the definition is missed.
This patch also remove existing "nbd:" inside pr_err().
Fixes:
1243172d5894 ("nbd: use pr_err to output error message")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Link: https://lore.kernel.org/r/20220723082427.3890655-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Dan Carpenter [Fri, 15 Jul 2022 08:12:14 +0000 (11:12 +0300)]
null_blk: fix ida error handling in null_add_dev()
There needs to be some error checking if ida_simple_get() fails.
Also call ida_free() if there are errors later.
Fixes:
94bc02e30fb8 ("nullb: use ida to manage index")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/YtEhXsr6vJeoiYhd@kili
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 15 Jul 2022 14:28:47 +0000 (22:28 +0800)]
null_blk: cleanup null_init_tag_set
The passed 'nullb' can be NULL, so cause null ptr reference.
Fix the issue, meantime cleanup null_init_tag_set for avoiding to add
similar issue in future.
Meantime set BLK_MQ_F_NO_SCHED if g_no_sched is true in case of NULL
device, same with BLK_MQ_F_TAG_HCTX_SHARED.
Cc: Vincent Fu <vincent.fu@samsung.com>
Fixes:
37ae152c7a0d ("null_blk: add configfs variables for 2 options")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220715142847.188275-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 14 Jul 2022 16:29:02 +0000 (10:29 -0600)]
Merge tag 'nvme-5.20-2022-07-14' of git://git.infradead.org/nvme into for-5.20/drivers
Pull NVMe updates from Christoph:
"nvme updates for Linux 5.20
- add support for In-Band authentication (Hannes Reinecke)
- handle the persistent internal error AER (Michael Kelley)
- use in-capsule data for TCP I/O queue connect (Caleb Sander)
- remove timeout for getting RDMA-CM established event (Israel Rukshin)
- misc cleanups (Joel Granados, Sagi Grimberg, Chaitanya Kulkarni,
Guixin Liu, Xiang wangx)"
* tag 'nvme-5.20-2022-07-14' of git://git.infradead.org/nvme: (21 commits)
nvme-multipath: refactor nvme_mpath_add_disk
nvme-apple: use nvme core helper to cancel requests in tagset
nvme-pci: use nvme core helper to cancel requests in tagset
nvme-tcp: use in-capsule data for I/O connect
nvme-rdma: remove timeout for getting RDMA-CM established event
nvmet-auth: expire authentication sessions
nvmet-auth: Diffie-Hellman key exchange support
nvmet: implement basic In-Band Authentication
nvmet: parse fabrics commands on io queues
nvme-auth: Diffie-Hellman key exchange support
nvme: implement In-Band authentication
nvme-fabrics: decode 'authentication required' connect error
nvme: add definitions for NVMe In-Band authentication
lib/base64: RFC4648-compliant base64 encoding
crypto: add crypto_has_kpp()
crypto: add crypto_has_shash()
nvme-loop: use nvme core helpers to cancel all requests in a tagset
nvme: fix qid param blk_mq_alloc_request_hctx
nvme: remove unused timeout parameter
nvme: handle the persistent internal error AER
...
Joel Granados [Tue, 28 Jun 2022 19:10:15 +0000 (21:10 +0200)]
nvme-multipath: refactor nvme_mpath_add_disk
Pass anagrpid as second argument. This is prep patch that allows reusing
this function for supporting unknown command sets.
Signed-off-by: Joel Granados <j.granados@samsung.com>
Signed-off-by: Kanchan Joshi <joshi.k@samsung.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Guixin Liu [Fri, 8 Jul 2022 03:06:05 +0000 (11:06 +0800)]
nvme-apple: use nvme core helper to cancel requests in tagset
Use nvme core helper nvme_cancel_tagset and nvme_cancel_admin_tagset
instead of same logic code.
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Ruozhu Li <liruozhu@huawei.com>
Reviewed-by: Sven Peter <sven@svenpeter.dev>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Guixin Liu [Fri, 8 Jul 2022 03:04:37 +0000 (11:04 +0800)]
nvme-pci: use nvme core helper to cancel requests in tagset
Use nvme core helper nvme_cancel_tagset and nvme_cancel_admin_tagset
instead of same logic code.
Signed-off-by: Guixin Liu <kanie@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Ruozhu Li <liruozhu@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Caleb Sander [Thu, 7 Jul 2022 21:12:45 +0000 (15:12 -0600)]
nvme-tcp: use in-capsule data for I/O connect
Currently, command data is only sent in-capsule on the for admin or I/O
commands on queues that indicate support for it. Send fabrics command
data in-capsule for I/O queues as well to avoid needing a separate
H2CData PDU for the connect command.
This is optimization. Without this change, we send the connect command
capsule and data in separate PDUs (CapsuleCmd and H2CData), and must wait
for the controller to respond with an R2T PDU before sending the H2CData.
With the change, we send a single CapsuleCmd PDU that includes the data.
This reduces the number of bytes (and likely packets) sent across the network,
and simplifies the send state machine handling in the driver.
Signed-off-by: Caleb Sander <csander@purestorage.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Israel Rukshin [Sun, 15 May 2022 15:04:40 +0000 (18:04 +0300)]
nvme-rdma: remove timeout for getting RDMA-CM established event
In case many controllers start error recovery at the same time (i.e.,
when port is down and up), they may never succeed to reconnect again.
This is because the target can't handle all the connect requests at
three seconds (the arbitrary value set today). Even if some of the
connections are established, when a single queue fails to connect,
all the controller's queues are destroyed as well. So, on the
following reconnection attempts the number of connect requests may
remain the same. To fix this, remove the timeout and wait for RDMA-CM
event to abort/complete the connect request. RDMA-CM sends unreachable
event when a timeout of ~90 seconds is expired. This approach is used
at other RDMA-CM users like SRP and iSER at blocking mode. The commit
also renames NVME_RDMA_CONNECT_TIMEOUT_MS to NVME_RDMA_CM_TIMEOUT_MS.
Signed-off-by: Israel Rukshin <israelr@nvidia.com>
Reviewed-by: Max Gurtovoy <mgurtovoy@nvidia.com>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Vincent Fu [Fri, 8 Jul 2022 17:49:49 +0000 (17:49 +0000)]
null_blk: add configfs variables for 2 options
Allow setting via configfs these two options:
no_sched
shared_tag_bitmap
Previously these could only be activated as module parameters.
Still missing are:
shared_tags
timeout
requeue
init_hctx
Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220708174943.87787-3-vincent.fu@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Vincent Fu [Fri, 8 Jul 2022 17:49:49 +0000 (17:49 +0000)]
null_blk: add module parameters for 4 options
Add as module parameters these options:
memory_backed
discard
mbps
cache_size
Previously these could only be set via configfs.
Still missing is bad_blocks.
The kernel test robot found a documentation formatting issue in v1 of
this patch.
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Vincent Fu <vincent.fu@samsung.com>
Link: https://lore.kernel.org/r/20220708174943.87787-2-vincent.fu@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Md Haris Iqbal [Thu, 7 Jul 2022 14:31:22 +0000 (16:31 +0200)]
block/rnbd-srv: Replace sess_dev_list with index_idr
The structure rnbd_srv_session maintains a list and an xarray of
rnbd_srv_dev. There is no need to keep both as one of them can serve the
purpose.
Since one of the places where the lookup of rnbd_srv_dev using
rnbd_srv_session is IO path, an xarray would serve us better than a list
traversal. Hence remove sess_dev_list from rnbd_srv_session, and replace
its uses from xarray.
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Reviewed-by: Aleksei Marov <aleksei.marov@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20220707143122.460362-3-haris.iqbal@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Md Haris Iqbal [Thu, 7 Jul 2022 14:31:21 +0000 (16:31 +0200)]
block/rnbd-srv: Set keep_id to true after mutex_trylock
After setting keep_id if the mutex trylock fails, the keep_id stays set
for the rest of the sess_dev lifetime.
Therefore, set keep_id to true after mutex_trylock succeeds, so that a
failure of trylock does'nt touch keep_id.
Fixes:
b168e1d85cf3 ("block/rnbd-srv: Prevent a deadlock generated by accessing sysfs in parallel")
Cc: gi-oh.kim@ionos.com
Signed-off-by: Md Haris Iqbal <haris.iqbal@ionos.com>
Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
Link: https://lore.kernel.org/r/20220707143122.460362-2-haris.iqbal@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Hannes Reinecke [Mon, 27 Jun 2022 09:52:07 +0000 (11:52 +0200)]
nvmet-auth: expire authentication sessions
Each authentication step is required to be completed within the
KATO interval (or two minutes if not set). So add a workqueue function
to reset the transaction ID and the expected next protocol step;
this will automatically the next authentication command referring
to the terminated authentication.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Hannes Reinecke [Mon, 27 Jun 2022 09:52:06 +0000 (11:52 +0200)]
nvmet-auth: Diffie-Hellman key exchange support
Implement Diffie-Hellman key exchange using FFDHE groups for NVMe
In-Band Authentication.
This patch adds a new host configfs attribute 'dhchap_dhgroup' to
select the FFDHE group to use.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Hannes Reinecke [Mon, 27 Jun 2022 09:52:05 +0000 (11:52 +0200)]
nvmet: implement basic In-Band Authentication
Implement NVMe-oF In-Band authentication according to NVMe TPAR 8006.
This patch adds three additional configfs entries 'dhchap_key',
'dhchap_ctrl_key', and 'dhchap_hash' to the 'host' configfs directory.
The 'dhchap_key' and 'dhchap_ctrl_key' entries need to be in the ASCII
format as specified in NVMe Base Specification v2.0 section 8.13.5.8
'Secret representation'.
'dhchap_hash' defaults to 'hmac(sha256)', and can be written to to
switch to a different HMAC algorithm.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Hannes Reinecke [Mon, 27 Jun 2022 09:52:04 +0000 (11:52 +0200)]
nvmet: parse fabrics commands on io queues
Some fabrics commands can be sent via io queues, so add a new
function nvmet_parse_fabrics_io_cmd() and rename the existing
nvmet_parse_fabrics_cmd() to nvmet_parse_fabrics_admin_cmd().
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Hannes Reinecke [Mon, 27 Jun 2022 09:52:03 +0000 (11:52 +0200)]
nvme-auth: Diffie-Hellman key exchange support
Implement Diffie-Hellman key exchange using FFDHE groups
for NVMe In-Band Authentication.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Hannes Reinecke [Mon, 27 Jun 2022 09:52:02 +0000 (11:52 +0200)]
nvme: implement In-Band authentication
Implement NVMe-oF In-Band authentication according to NVMe TPAR 8006.
This patch adds two new fabric options 'dhchap_secret' to specify the
pre-shared key (in ASCII respresentation according to NVMe 2.0 section
8.13.5.8 'Secret representation') and 'dhchap_ctrl_secret' to specify
the pre-shared controller key for bi-directional authentication of both
the host and the controller.
Re-authentication can be triggered by writing the PSK into the new
controller sysfs attribute 'dhchap_secret' or 'dhchap_ctrl_secret'.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Hannes Reinecke [Mon, 27 Jun 2022 09:52:01 +0000 (11:52 +0200)]
nvme-fabrics: decode 'authentication required' connect error
The 'connect' command might fail with NVME_SC_AUTH_REQUIRED, so we
should be decoding this error, too.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Hannes Reinecke [Mon, 27 Jun 2022 09:52:00 +0000 (11:52 +0200)]
nvme: add definitions for NVMe In-Band authentication
Add new definitions for NVMe In-band authentication as defined in
the NVMe Base Specification v2.0.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Hannes Reinecke [Mon, 27 Jun 2022 09:51:59 +0000 (11:51 +0200)]
lib/base64: RFC4648-compliant base64 encoding
Add RFC4648-compliant base64 encoding and decoding routines, based on
the base64url encoding in fs/crypto/fname.c.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Cc: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Hannes Reinecke [Mon, 27 Jun 2022 09:51:58 +0000 (11:51 +0200)]
crypto: add crypto_has_kpp()
Add helper function to determine if a given key-agreement protocol
primitive is supported.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Hannes Reinecke [Mon, 27 Jun 2022 09:51:57 +0000 (11:51 +0200)]
crypto: add crypto_has_shash()
Add helper function to determine if a given synchronous hash is supported.
Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Sagi Grimberg [Sun, 26 Jun 2022 14:06:00 +0000 (17:06 +0300)]
nvme-loop: use nvme core helpers to cancel all requests in a tagset
A helper now exist, no need to open-code the same functionality.
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Chaitanya Kulkarni [Tue, 7 Jun 2022 01:16:43 +0000 (18:16 -0700)]
nvme: fix qid param blk_mq_alloc_request_hctx
Only caller of the __nvme_submit_sync_cmd() with qid value not equal to
NVME_QID_ANY is nvmf_connect_io_queues(), where qid value is alway set
to > 0.
[1] __nvme_submit_sync_cmd() callers with qid parameter from :-
Caller | qid parameter
------------------------------------------------------
* nvme_fc_connect_io_queues() |
nvmf_connect_io_queue() | qid > 0
* nvme_rdma_start_io_queues() |
nvme_rdma_start_queue() |
nvmf_connect_io_queues() | qid > 0
* nvme_tcp_start_io_queues() |
nvme_tcp_start_queue() |
nvmf_connect_io_queues() | qid > 0
* nvme_loop_connect_io_queues() |
nvmf_connect_io_queues() | qid > 0
When qid value of the function parameter __nvme_submit_sync_cmd() is > 0
from above callers, we use blk_mq_alloc_request_hctx(), where we pass
last parameter as 0 if qid functional parameter value is set to 0 with
conditional operators, see 1002 :-
991 int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
992 union nvme_result *result, void *buffer, unsigned bufflen,
993 int qid, int at_head, blk_mq_req_flags_t flags)
994 {
995 struct request *req;
996 int ret;
997
998 if (qid == NVME_QID_ANY)
999 req = blk_mq_alloc_request(q, nvme_req_op(cmd), flags);
1000 else
1001 req = blk_mq_alloc_request_hctx(q, nvme_req_op(cmd), flags,
1002 qid ? qid - 1 : 0);
1003
But qid function parameter value of the __nvme_submit_sync_cmd() will
never be 0 from above caller list see [1], and all the other callers of
__nvme_submit_sync_cmd() use NVME_QID_ANY as qid value :-
1. nvme_submit_sync_cmd()
2. nvme_features()
3. nvme_sec_submit()
4. nvmf_reg_read32()
5. nvmf_reg_read64()
6. nvmf_ref_write32()
7. nvmf_connect_admin_queue()
Remove the conditional operator to pass the qid as 0 in the call to
blk_mq_alloc_requst_hctx().
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Chaitanya Kulkarni [Tue, 7 Jun 2022 01:16:42 +0000 (18:16 -0700)]
nvme: remove unused timeout parameter
The function __nvme_submit_sync_cmd() has following list of callers
that sets the timeout value to 0 :-
Callers | Timeout value
------------------------------------------------
nvme_submit_sync_cmd() | 0
nvme_features() | 0
nvme_sec_submit() | 0
nvmf_reg_read32() | 0
nvmf_reg_read64() | 0
nvmf_reg_write32() | 0
nvmf_connect_admin_queue() | 0
nvmf_connect_io_queue() | 0
Remove the timeout function parameter from __nvme_submit_sync_cmd() and
adjust the rest of code accordingly.
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Michael Kelley [Wed, 8 Jun 2022 18:52:21 +0000 (11:52 -0700)]
nvme: handle the persistent internal error AER
In the NVM Express Revision 1.4 spec, Figure 145 describes possible
values for an AER with event type "Error" (value 000b). For a
Persistent Internal Error (value 03h), the host should perform a
controller reset.
Add support for this error using code that already exists for
doing a controller reset. As part of this support, introduce
two utility functions for parsing the AER type and subtype.
This new support was tested in a lab environment where we can
generate the persistent internal error on demand, and observe
both the Linux side and NVMe controller side to see that the
controller reset has been done.
Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Xiang wangx [Sat, 4 Jun 2022 14:32:54 +0000 (22:32 +0800)]
nvme: remove a double word in a comment
Delete the redundant word 'be'.
Signed-off-by: Xiang wangx <wangxiang@cdjrlc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Guoqing Jiang [Wed, 6 Jul 2022 13:31:52 +0000 (21:31 +0800)]
rnbd-clt: make rnbd_clt_change_capacity return void
No need to checking the return value, make it return void.
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20220706133152.12058-9-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Guoqing Jiang [Wed, 6 Jul 2022 13:31:51 +0000 (21:31 +0800)]
rnbd-clt: pass sector_t type for resize capacity
Let's change the parameter type to 'sector_t' then we don't need to cast
it from rnbd_clt_resize_dev_store, and update rnbd_clt_resize_disk too.
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20220706133152.12058-8-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Guoqing Jiang [Wed, 6 Jul 2022 13:31:50 +0000 (21:31 +0800)]
rnbd-clt: check capacity inside rnbd_clt_change_capacity
Currently, process_msg_open_rsp checks if capacity changed or not before
call rnbd_clt_change_capacity while the checking also make sense for
rnbd_clt_resize_dev_store, let's move the checking into the function.
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20220706133152.12058-7-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Guoqing Jiang [Wed, 6 Jul 2022 13:31:49 +0000 (21:31 +0800)]
rnbd-clt: adjust the layout of struct rnbd_clt_dev
While at it, let re-arrange the struct to remove holes.
Before, pahole reports
/* size: 232, cachelines: 4, members: 17 */
/* sum members: 224, holes: 2, sum holes: 8 */
/* last cacheline: 40 bytes */
After the change, the report changes to
/* size: 224, cachelines: 4, members: 17 */
/* last cacheline: 32 bytes */
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20220706133152.12058-6-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Guoqing Jiang [Wed, 6 Jul 2022 13:31:48 +0000 (21:31 +0800)]
rnbd-clt: reduce the size of struct rnbd_clt_dev
Previously, both map and remap trigger rnbd_clt_set_dev_attr to set
some members in rnbd_clt_dev such as wc, fua and logical_block_size
etc, but those members are only useful for map scenario given the
setup_request_queue is only called from the path:
rnbd_clt_map_device -> rnbd_client_setup_device
Since rnbd_clt_map_device frees rsp after rnbd_client_setup_device,
we can pass rsp to rnbd_client_setup_device and it's callees, which
means queue's attributes can be set directly from relevant members
of rsp instead from rnbd_clt_dev.
After that, we can kill 11 members from rnbd_clt_dev, and we don't
need rnbd_clt_set_dev_attr either.
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20220706133152.12058-5-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Guoqing Jiang [Wed, 6 Jul 2022 13:31:47 +0000 (21:31 +0800)]
rnbd-clt: kill read_only from struct rnbd_clt_dev
The member is not needed since we can call get_disk_ro to achieve the
same goal.
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20220706133152.12058-4-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Guoqing Jiang [Wed, 6 Jul 2022 13:31:46 +0000 (21:31 +0800)]
rnbd-clt: don't free rsp in msg_open_conf for map scenario
For map scenario, rsp is freed in two places:
1. msg_open_conf frees rsp if rtrs_clt_request returns 0.
2. Otherwise, rsp is freed by the call sites of rtrs_clt_request.
Now, We'd like to control full lifecycle of rsp in rnbd_clt_map_device,
with that, it is feasible to pass rsp to rnbd_client_setup_device in
next commit.
For 1, it is possible to free rsp from the caller of send_usr_msg
because of the synchronization of iu->comp.wait. And we put iu later
in rnbd_clt_map_device to ensure order of release rsp and iu.
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20220706133152.12058-3-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Guoqing Jiang [Wed, 6 Jul 2022 13:31:45 +0000 (21:31 +0800)]
rnbd-clt: open code send_msg_open in rnbd_clt_map_device
Let's open code it in rnbd_clt_map_device, then we can use information
from rsp to setup gendisk and request_queue in next commits. After that,
we can remove some members (wc, fua and max_hw_sectors etc) from struct
rnbd_clt_dev.
Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20220706133152.12058-2-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christophe JAILLET [Sun, 3 Jul 2022 16:05:43 +0000 (18:05 +0200)]
block: null_blk: Use the bitmap API to allocate bitmaps
Use bitmap_zalloc()/bitmap_free() instead of hand-writing them.
It is less verbose and it improves the semantic.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/7c4d3116ba843fc4a8ae557dd6176352a6cd0985.1656864320.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sun, 3 Jul 2022 16:18:10 +0000 (10:18 -0600)]
Merge branch 'md-next' of https://git./linux/kernel/git/song/md into for-5.20/drivers
Pull MD updates from Song:
"1. Improve raid5 lock contention, by Logan Gunthorpe.
2. Misc fixes to raid5, by Logan Gunthorpe.
3. Fix race condition with md_reap_sync_thread(), by Guoqing Jiang."
* 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: (29 commits)
md: Fix spelling mistake in comments
md/raid5: Increase restriction on max segments per request
md/raid5: Improve debug prints
md/raid5: Pivot raid5_make_request()
md/raid5: Check all disks in a stripe_head for reshape progress
md/raid5: Refactor add_stripe_bio()
md/raid5: Keep a reference to last stripe_head for batch
md/raid5: Refactor for loop in raid5_make_request() into while loop
md/raid5: Move read_seqcount_begin() into make_stripe_request()
md/raid5: Drop the do_prepare flag in raid5_make_request()
md/raid5: Factor out helper from raid5_make_request() loop
md/raid5: Move common stripe get code into new find_get_stripe() helper
md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio()
md/raid5: Refactor raid5_make_request loop
md/raid5: Factor out ahead_of_reshape() function
md/raid5: Make logic blocking check consistent with logic that blocks
md: unlock mddev before reap sync_thread in action_store
md: Explicitly create command-line configured devices
md: Notify sysfs sync_completed in md_reap_sync_thread()
md: Ensure resync is reported after it starts
...
Zhang Jiaming [Sat, 2 Jul 2022 01:54:11 +0000 (09:54 +0800)]
md: Fix spelling mistake in comments
There are 2 spelling mistakes in comments. Fix it.
Signed-off-by: Zhang Jiaming <jiaming@nfschina.com>
Signed-off-by: Song Liu <song@kernel.org>
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:45 +0000 (13:19 -0600)]
md/raid5: Increase restriction on max segments per request
The block layer defaults the maximum segments to 128, which means
requests tend to get split around the 512KB depending on how many
pages can be merged. There's no such restriction in the raid5 code
so increase the limit to USHRT_MAX so that larger requests can be
sent as one.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:44 +0000 (13:19 -0600)]
md/raid5: Improve debug prints
Add a debug print for raid5_make_request() so that each request is
printed and add the logical sector number to the debug print in
__add_stripe_bio().
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:43 +0000 (13:19 -0600)]
md/raid5: Pivot raid5_make_request()
raid5_make_request() loops through every page in the request,
finds the appropriate stripe and adds the bio for that page in the
disk.
This causes a great deal of contention on the hash_lock and extra
work seeing each stripe must be found once for every data disk.
The number of times a stripe must be found can be reduced by pivoting
raid5_make_request() so that it loops through every stripe and then
loops through every disk in that stripe to see if the bio must be
added. This reduces the number of times the hash lock must be taken
by a factor equal to the number of data disks.
To accomplish this, the logical sectors that have already been added
must be tracked. Tracking them is done with a bitmap: the bits
for all pages are set at the start of the request and each bit
is cleared once the bio is added to a stripe.
Finding the next sector to be done is then just a call to
find_first_bit() so that sectors that have been done can simply be
skipped.
One minor downside is that the maximum sectors for a request must be
limited so that the bitmap can be appropriately sized on the stack.
This limit is arbitrarily chosen to be 256 stripe pages which works out
to 1MB if PAGE_SIZE == DEFAULT_STRIPE_SIZE. This doesn't actually
restrict the maximum request further seeing the default block queue
settings are used which restricts the number of segments to 128 (which
results in request sizes that are approximately 512KB).
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:42 +0000 (13:19 -0600)]
md/raid5: Check all disks in a stripe_head for reshape progress
When testing if a previous stripe has had reshape expand past it, use
the earliest or latest logical sector in all the disks for that stripe
head. This will allow adding multiple disks at a time in a subesquent
patch.
To do this cleaner, refactor the check into a helper function called
stripe_ahead_of_reshape().
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:41 +0000 (13:19 -0600)]
md/raid5: Refactor add_stripe_bio()
Factor out two helper functions from add_stripe_bio(): one to check for
overlap (stripe_bio_overlaps()), and one to actually add the bio to the
stripe (__add_stripe_bio()). The latter function will always succeed.
This will be useful in the next patch so that overlap can be checked for
multiple disks before adding any
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:40 +0000 (13:19 -0600)]
md/raid5: Keep a reference to last stripe_head for batch
When batching, every stripe head has to find the previous stripe head to
add to the batch list. This involves taking the hash lock which is
highly contended during IO.
Instead of finding the previous stripe_head each time, store a
reference to the previous stripe_head in a pointer so that it doesn't
require taking the contended lock another time.
The reference to the previous stripe must be released before scheduling
and waiting for work to get done. Otherwise, it can hold up
raid5_activate_delayed() and deadlock.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:39 +0000 (13:19 -0600)]
md/raid5: Refactor for loop in raid5_make_request() into while loop
The for loop with retry label can be more cleanly expressed as a while
loop by moving the logical_sector increment into the success path.
No functional changes intended.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:38 +0000 (13:19 -0600)]
md/raid5: Move read_seqcount_begin() into make_stripe_request()
Now that prepare_to_wait() isn't in the way, move read_sequcount_begin()
into make_stripe_request().
No functional changes intended.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:37 +0000 (13:19 -0600)]
md/raid5: Drop the do_prepare flag in raid5_make_request()
prepare_to_wait() can be reasonably called after schedule instead of
setting a flag and preparing in the next loop iteration.
This means that prepare_to_wait() will be called before
read_seqcount_begin(), but there shouldn't be any reason that the order
matters here. On the first iteration of the loop prepare_to_wait() is
already called first.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:36 +0000 (13:19 -0600)]
md/raid5: Factor out helper from raid5_make_request() loop
Factor out the inner loop of raid5_make_request() into it's own helper
called make_stripe_request().
The helper returns a number of statuses: SUCCESS, RETRY,
SCHEDULE_AND_RETRY and FAIL. This makes the code a bit easier to
understand and allows the SCHEDULE_AND_RETRY path to be made common.
A context structure is added to contain do_flush. It will be used
more in subsequent patches for state that needs to be kept
outside the loop.
No functional changes intended. This will be cleaned up further in
subsequent patches to untangle the gen_lock and do_prepare logic
further.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:35 +0000 (13:19 -0600)]
md/raid5: Move common stripe get code into new find_get_stripe() helper
Both uses of find_stripe() require a fairly complicated dance to
increment the reference count. Move this into a common find_get_stripe()
helper.
No functional changes intended.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:34 +0000 (13:19 -0600)]
md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio()
stripe_add_to_batch_list() is better done in the loop in make_request
instead of inside add_stripe_bio(). This is clearer and allows for
storing the batch_head state outside the loop in a subsequent patch.
The call to add_stripe_bio() in retry_aligned_read() is for read
and batching only applies to write. So it's impossible for batching
to happen at that call site.
No functional changes intended.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:33 +0000 (13:19 -0600)]
md/raid5: Refactor raid5_make_request loop
Break immediately if raid5_get_active_stripe() returns NULL and deindent
the rest of the loop. Annotate this check with an unlikely().
This makes the code easier to read and reduces the indentation level.
No functional changes intended.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:32 +0000 (13:19 -0600)]
md/raid5: Factor out ahead_of_reshape() function
There are a few uses of an ugly ternary operator in raid5_make_request()
to check if a sector is a head of a reshape sector.
Factor this out into a simple helper called ahead_of_reshape().
No functional changes intended.
Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:31 +0000 (13:19 -0600)]
md/raid5: Make logic blocking check consistent with logic that blocks
The check in raid5_make_request differs very slightly from the logic
that causes it to block lower down. This likely does not cause a bug
as the check is fuzzy anyway (as reshape may move on between the first
check and the subsequent check). However, make it consistent so it can
be cleaned up in a subsequent patch.
The condition which causes the schedule is:
!(mddev->reshape_backwards ? logical_sector < conf->reshape_progress :
logical_sector >= conf->reshape_progress) &&
(mddev->reshape_backwards ? logical_sector < conf->reshape_safe :
logical_sector >= conf->reshape_safe)
The condition that causes the early bailout is made to match this.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
Guoqing Jiang [Tue, 21 Jun 2022 03:11:29 +0000 (11:11 +0800)]
md: unlock mddev before reap sync_thread in action_store
Since the bug which commit
8b48ec23cc51a ("md: don't unregister sync_thread
with reconfig_mutex held") fixed is related with action_store path, other
callers which reap sync_thread didn't need to be changed.
Let's pull md_unregister_thread from md_reap_sync_thread, then fix previous
bug with belows.
1. unlock mddev before md_reap_sync_thread in action_store.
2. save reshape_position before unlock, then restore it to ensure position
not changed accidentally by others.
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
Chris Webb [Wed, 1 Jun 2022 11:03:07 +0000 (12:03 +0100)]
md: Explicitly create command-line configured devices
Boot-time assembly of arrays with md= command-line arguments breaks when
CONFIG_BLOCK_LEGACY_AUTOLOAD is unset. md_setup_drive() in md-autodetect.c
calls blkdev_get_by_dev(), assuming this implicitly creates the block
device.
Fix this by attempting to md_alloc() the array first. As in the probe path,
ignore any error as failure is caught by blkdev_get_by_dev() anyway.
Signed-off-by: Chris Webb <chris@arachsys.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:56 +0000 (10:27 -0600)]
md: Notify sysfs sync_completed in md_reap_sync_thread()
The mdadm test 07layouts randomly produces a kernel hung task deadlock.
The deadlock is caused by the suspend_lo/suspend_hi files being set by
the mdadm background process during reshape and not being cleared
because the process hangs. (Leaving aside the issue of the fragility of
freezing kernel tasks by buggy userspace processes...)
When the background mdadm process hangs it, is waiting (without a
timeout) on a change to the sync_completed file signalling that the
reshape has completed. The process is woken up a couple times when
the reshape finishes but it is woken up before MD_RECOVERY_RUNNING
is cleared so sync_completed_show() reports 0 instead of "none".
To fix this, notify the sysfs file in md_reap_sync_thread() after
MD_RECOVERY_RUNNING has been cleared. This wakes up mdadm and causes
it to continue and write to suspend_lo/suspend_hi to allow IO to
continue.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:55 +0000 (10:27 -0600)]
md: Ensure resync is reported after it starts
The 07layouts test in mdadm fails on some systems. The failure
presents itself as the backup file not being removed before the next
layout is grown into:
mdadm: /dev/md0: cannot create backup file /tmp/md-test-backup:
File exists
This is because the background mdadm process, which is responsible for
cleaning up this backup file gets into an infinite loop waiting for
the reshape to start. mdadm checks the mdstat file if a reshape is
going and, if it is not, it waits for an event on the file or times
out in 5 seconds. On faster machines, the reshape may complete before
the 5 seconds times out, and thus the background mdadm process loops
waiting for a reshape to start that has already occurred.
mdadm reads the mdstat file to start, but mdstat does not report that the
reshape has begun, even though it has indeed begun. So the mdstat_wait()
call (in mdadm) which polls on the mdstat file won't ever return until
timing out.
The reason mdstat reports the reshape has started is due to an issue
in status_resync(). recovery_active is subtracted from curr_resync which
will result in a value of zero for the first chunk of reshaped data, and
the resulting read will report no reshape in progress.
To fix this, if "resync - recovery_active" is an overloaded value, force
the value to be MD_RESYNC_ACTIVE so the code reports a resync in progress.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:54 +0000 (10:27 -0600)]
md: Use enum for overloaded magic numbers used by mddev->curr_resync
Comments in the code document special values used for
mddev->curr_resync. Make this clearer by using an enum to label these
values.
The only functional change is a couple places use the wrong comparison
operator that implied 3 is another special value. They are all
fixed to imply that 3 or greater is an active resync.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:53 +0000 (10:27 -0600)]
md/raid5-cache: Annotate pslot with __rcu notation
radix_tree_lookup_slot() and radix_tree_replace_slot() API expect the
slot returned and looked up to be marked with __rcu. Otherwise
sparse warnings are generated:
drivers/md/raid5-cache.c:2939:23: warning: incorrect type in
assignment (different address spaces)
drivers/md/raid5-cache.c:2939:23: expected void **pslot
drivers/md/raid5-cache.c:2939:23: got void [noderef] __rcu **
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:52 +0000 (10:27 -0600)]
md/raid5-cache: Clear conf->log after finishing work
A NULL pointer dereferlence on conf->log is seen randomly with
the mdadm test 21raid5cache. Kasan reporst:
BUG: KASAN: null-ptr-deref in r5l_reclaimable_space+0xf5/0x140
Read of size 8 at addr
0000000000000860 by task md0_reclaim/3086
Call Trace:
dump_stack_lvl+0x5a/0x74
kasan_report.cold+0x5f/0x1a9
__asan_load8+0x69/0x90
r5l_reclaimable_space+0xf5/0x140
r5l_do_reclaim+0xf4/0x5e0
r5l_reclaim_thread+0x69/0x3b0
md_thread+0x1a2/0x2c0
kthread+0x177/0x1b0
ret_from_fork+0x22/0x30
This is caused by conf->log being cleared in r5l_exit_log() before
stopping the reclaim thread.
To fix this, clear conf->log after the reclaim_thread is unregistered
and after flushing disable_writeback_work.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:51 +0000 (10:27 -0600)]
md/raid5-cache: Drop RCU usage of conf->log
The only place that uses RCU to access conf->log is in
r5l_log_disk_error(). This function is mostly used in the IO path
and once with mddev_lock() held in raid5_change_consistency_policy().
It is known that the IO will be suspended before the log is freed and
r5l_log_exit() is called with the mddev_lock() held.
This should mean that conf->log can not be freed while the function is
being called, so the RCU protection is not necessary. Drop the
rcu_read_lock() as well as the synchronize_rcu() and
rcu_assign_pointer() usage.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:50 +0000 (10:27 -0600)]
md/raid5-cache: Take mddev_lock in r5c_journal_mode_show()
The mddev->lock spinlock doesn't protect against the removal of
conf->log in r5l_exit_log() so conf->log may be freed before it
is used.
To fix this, take the mddev_lock() insteaad of the mddev->lock spinlock.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:49 +0000 (10:27 -0600)]
md/raid5: suspend the array for calls to log_exit()
The raid5-cache code relies on there being no IO in flight when
log_exit() is called. There are two places where this is not
guaranteed so add mddev_suspend() and mddev_resume() calls to these
sites.
The site in raid5_change_consistency_policy() is in the error path,
and another similar call site already has suspend/resume calls just
below it; so it should be equally safe to make that change here.
There is one remaining site in raid5_remove_disk() that we call log_exit()
without suspending the array. Unfortunately, as the comment stated, we
cannot call mddev_suspend from raid5d.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:48 +0000 (10:27 -0600)]
md/raid5-ppl: Drop unused argument from ppl_handle_flush_request()
ppl_handle_flush_request() takes an struct r5log argument but doesn't
use it. It has no buisiness taking this argument as it is only used
by raid5-cache and has no way to derference it anyway. Remove
the argument.
No functional changes intended.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:47 +0000 (10:27 -0600)]
md/raid5-log: Drop extern decorators for function prototypes
extern is not necessary and recommended against when defining prototype
functions in headers. checkpatch.pl complains about these. So remove
them.
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
Song Liu [Wed, 1 Jun 2022 23:12:37 +0000 (16:12 -0700)]
MAINTAINERS: add patchwork link to linux-raid project
Add link to patchwork:
https://patchwork.kernel.org/project/linux-raid/list/
Signed-off-by: Song Liu <song@kernel.org>
Lars Ellenberg [Wed, 22 Jun 2022 20:49:32 +0000 (22:49 +0200)]
drbd: bm_page_async_io: fix spurious bitmap "IO error" on large volumes
We usually do all our bitmap IO in units of PAGE_SIZE.
With very small or oddly sized external meta data, or with
PAGE_SIZE != 4k, it can happen that our last on-disk bitmap page
is not fully PAGE_SIZE aligned, so we may need to adjust the size
of the IO.
We used to do that with
min_t(unsigned int, PAGE_SIZE,
last_allowed_sector - current_offset);
And for just the right diff, (unsigned int)(diff) will result in 0.
A bio of length 0 will correctly be rejected with an IO error
(and some scary WARN_ON_ONCE()) by the scsi layer.
Do the calculation properly.
Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20220622204932.196830-1-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Liu Song [Sat, 25 Jun 2022 15:15:21 +0000 (23:15 +0800)]
blk-mq: blk_mq_tag_busy is no need to return a value
Currently "blk_mq_tag_busy" return value has no effect, so adjust it.
Some code implementations have also been adjusted to enhance
readability.
Signed-off-by: Liu Song <liusong@linux.alibaba.com>
Link: https://lore.kernel.org/r/1656170121-1619-1-git-send-email-liusong@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 23 Jun 2022 07:48:34 +0000 (09:48 +0200)]
block: Always initialize bio IO priority on submit
Currently, IO priority set in task's IO context is not reflected in the
bio->bi_ioprio for most IO (only io_uring and direct IO set it). This
results in odd results where process is submitting some bios with one
priority and other bios with a different (unset) priority and due to
differing priorities bios cannot be merged. Make sure bio->bi_ioprio is
always set on bio submission.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-9-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 23 Jun 2022 07:48:33 +0000 (09:48 +0200)]
block: Initialize bio priority earlier
Bio's IO priority needs to be initialized before we try to merge the bio
with other bios. Otherwise we could merge bios which would otherwise
receive different IO priorities leading to possible QoS issues.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-8-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 23 Jun 2022 07:48:32 +0000 (09:48 +0200)]
blk-ioprio: Convert from rqos policy to direct call
Convert blk-ioprio handling from a rqos policy to a direct call from
blk_mq_submit_bio(). Firstly, blk-ioprio is not much of a rqos policy
anyway, it just needs a hook in bio submission path to set the bio's IO
priority. Secondly, the rqos .track hook gets actually called too late
for blk-ioprio purposes and introducing a special rqos hook just for
blk-ioprio looks even weirder.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-7-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 23 Jun 2022 07:48:31 +0000 (09:48 +0200)]
blk-ioprio: Remove unneeded field
blkcg->ioprio_set field is not really useful except for avoiding
possibly more expensive checks inside blkcg_ioprio_track(). The check
for blkcg->prio_policy being equal to POLICY_NO_CHANGE does the same
service so just remove the ioprio_set field and replace the check.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-6-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 23 Jun 2022 07:48:30 +0000 (09:48 +0200)]
block: Fix handling of tasks without ioprio in ioprio_get(2)
ioprio_get(2) can be asked to return the best IO priority from several
tasks (IOPRIO_WHO_PGRP, IOPRIO_WHO_USER). Currently the call treats
tasks without set IO priority as having priority
IOPRIO_CLASS_BE/IOPRIO_BE_NORM however this does not really reflect the
IO priority the task will get (which depends on task's nice value).
Fix the code to use the real IO priority task's IO will use. We have to
modify code for ioprio_get(IOPRIO_WHO_PROCESS) to keep returning
IOPRIO_CLASS_NONE priority for tasks without set IO priority as a
special case to maintain userspace visible API.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220623074840.5960-5-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 23 Jun 2022 07:48:29 +0000 (09:48 +0200)]
block: Make ioprio_best() static
Nobody outside of block/ioprio.c uses it.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-4-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 23 Jun 2022 07:48:28 +0000 (09:48 +0200)]
block: Generalize get_current_ioprio() for any task
get_current_ioprio() operates only on current task. We will need the
same functionality for other tasks as well. Generalize
get_current_ioprio() for that and also move the bulk out of the header
file because it is large enough.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-3-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 23 Jun 2022 07:48:27 +0000 (09:48 +0200)]
block: Return effective IO priority from get_current_ioprio()
get_current_ioprio() is used to initialize IO priority of various
requests. As such it should be returning the effective IO priority of
the task (i.e., reflecting the fact that unset IO priority should get
set based on task's CPU priority) so that the conversion is concentrated
in one place.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-2-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jan Kara [Thu, 23 Jun 2022 07:48:26 +0000 (09:48 +0200)]
block: fix default IO priority handling again
Commit
e70344c05995 ("block: fix default IO priority handling")
introduced an inconsistency in get_current_ioprio() that tasks without
IO context return IOPRIO_DEFAULT priority while tasks with freshly
allocated IO context will return 0 (IOPRIO_CLASS_NONE/0) IO priority.
Tasks without IO context used to be rare before
5a9d041ba2f6 ("block:
move io_context creation into where it's needed") but after this commit
they became common because now only BFQ IO scheduler setups task's IO
context. Similar inconsistency is there for get_task_ioprio() so this
inconsistency is now exposed to userspace and userspace will see
different IO priority for tasks operating on devices with BFQ compared
to devices without BFQ. Furthemore the changes done by commit
e70344c05995 change the behavior when no IO priority is set for BFQ IO
scheduler which is also documented in ioprio_set(2) manpage:
"If no I/O scheduler has been set for a thread, then by default the I/O
priority will follow the CPU nice value (setpriority(2)). In Linux
kernels before version 2.6.24, once an I/O priority had been set using
ioprio_set(), there was no way to reset the I/O scheduling behavior to
the default. Since Linux 2.6.24, specifying ioprio as 0 can be used to
reset to the default I/O scheduling behavior."
So make sure we default to IOPRIO_CLASS_NONE as used to be the case
before commit
e70344c05995. Also cleanup alloc_io_context() to
explicitely set this IO priority for the allocated IO context to avoid
future surprises. Note that we tweak ioprio_best() to maintain
ioprio_get(2) behavior and make this commit easily backportable.
CC: stable@vger.kernel.org
Fixes:
e70344c05995 ("block: fix default IO priority handling")
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Sebastian Andrzej Siewior [Wed, 22 Jun 2022 08:25:54 +0000 (10:25 +0200)]
blk-mq: Don't disable preemption around __blk_mq_run_hw_queue().
__blk_mq_delay_run_hw_queue() disables preemption to get a stable
current CPU number and then invokes __blk_mq_run_hw_queue() if the CPU
number is part the mask.
__blk_mq_run_hw_queue() acquires a spin_lock_t which is a sleeping lock
on PREEMPT_RT and can't be acquired with disabled preemption.
It is not required for correctness to invoke __blk_mq_run_hw_queue() on
a CPU matching hctx->cpumask. Both (async and direct requests) can run
on a CPU not matching hctx->cpumask.
The CPU mask without disabling preemption and invoking
__blk_mq_run_hw_queue().
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/YrLSEiNvagKJaDs5@linutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Fri, 17 Jun 2022 21:08:59 +0000 (14:08 -0700)]
block: bfq: Fix kernel-doc headers
Fix the following warnings:
block/bfq-cgroup.c:721: warning: Function parameter or member 'bfqg' not described in '__bfq_bic_change_cgroup'
block/bfq-cgroup.c:721: warning: Excess function parameter 'blkcg' description in '__bfq_bic_change_cgroup'
block/bfq-cgroup.c:870: warning: Function parameter or member 'ioprio_class' not described in 'bfq_reparent_leaf_entity'
block/bfq-cgroup.c:900: warning: Function parameter or member 'ioprio_class' not described in 'bfq_reparent_active_queues'
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220617210859.106623-1-bvanassche@acm.org
Bart Van Assche [Fri, 17 Jun 2022 20:44:33 +0000 (13:44 -0700)]
block: bfq: Remove an unused function definition
This patch is the result of the analysis of a sparse report.
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220617204433.102022-1-bvanassche@acm.org
GuoYong Zheng [Fri, 17 Jun 2022 10:28:04 +0000 (18:28 +0800)]
bfq: Remove useless code in bfq_lookup_next_entity
It is no need to judge entity is null or not here,
directly return entity is ok, so remove it.
Signed-off-by: GuoYong Zheng <zhenggy@chinatelecom.cn>
Link: https://lore.kernel.org/r/1655461684-19075-1-git-send-email-zhenggy@chinatelecom.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 14 Jun 2022 09:09:34 +0000 (11:09 +0200)]
block: move blk_queue_get_max_sectors to blk.h
blk_queue_get_max_sectors is private to the block layer, so move it out
of blkdev.h.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220614090934.570632-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 14 Jun 2022 09:09:33 +0000 (11:09 +0200)]
block: fold blk_max_size_offset into get_max_io_size
Now that blk_max_size_offset has a single caller left, fold it into that
and clean up the naming convention for the local variables there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220614090934.570632-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 14 Jun 2022 09:09:32 +0000 (11:09 +0200)]
block: cleanup variable naming in get_max_io_size
get_max_io_size has a very odd choice of variables names and
initialization patterns. Switch to more descriptive names and more
clear initialization of them.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220614090934.570632-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 14 Jun 2022 09:09:31 +0000 (11:09 +0200)]
block: open code blk_max_size_offset in blk_rq_get_max_sectors
blk_rq_get_max_sectors always uses q->limits.chunk_sectors as the
chunk_sectors argument, and already checks for max_sectors through the
call to blk_queue_get_max_sectors. That means much of
blk_max_size_offset is not needed and open coding it simplifies the code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220614090934.570632-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 14 Jun 2022 09:09:30 +0000 (11:09 +0200)]
dm: open code blk_max_size_offset in max_io_len
max_io_len always passes an explicitly non-zero chunk_sectors into
blk_max_size_offset. That means much of blk_max_size_offset is not
needed and can be open coded to simplify the code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20220614090934.570632-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Tue, 14 Jun 2022 09:09:29 +0000 (11:09 +0200)]
block: factor out a chunk_size_left helper
Factor out a helper from blk_max_size_offset so that it can be reused
independently.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20220614090934.570632-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Wed, 15 Jun 2022 22:55:49 +0000 (15:55 -0700)]
block: Make blk_mq_get_sq_hctx() select the proper hardware queue type
Since the introduction of blk_mq_get_hctx_type() the operation type in
the second argument of blk_mq_get_hctx_type() matters. The introduction
of blk_mq_get_hctx_type() caused blk_mq_get_sq_hctx() to select a
hardware queue of type HCTX_TYPE_READ instead of HCTX_TYPE_DEFAULT.
Switch to hardware queue type HCTX_TYPE_DEFAULT since HCTX_TYPE_READ
should only be used for read requests.
Cc: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220615225549.1054905-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Wed, 15 Jun 2022 22:55:48 +0000 (15:55 -0700)]
block: Rename a blk_mq_map_queue() argument
Before the introduction of blk_mq_get_hctx_type(), blk_mq_map_queue()
only used the flags from its second argument. Since the introduction of
blk_mq_get_hctx_type(), blk_mq_map_queue() uses both the operation and
the flags encoded in that argument. Rename the second argument of
blk_mq_map_queue() to make this clear.
Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220615225549.1054905-3-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Wed, 15 Jun 2022 22:55:47 +0000 (15:55 -0700)]
blk-iocost: Simplify ioc_rqos_done()
Leave out the superfluous "& REQ_OP_MASK" code. The definition of req_op()
shows that that code is superfluous:
#define req_op(req) ((req)->cmd_flags & REQ_OP_MASK)
Compile-tested only.
Cc: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220615225549.1054905-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bo Liu [Wed, 15 Jun 2022 08:18:16 +0000 (04:18 -0400)]
block: Directly use ida_alloc()/free()
Use ida_alloc()/ida_free() instead of
ida_simple_get()/ida_simple_remove().
The latter is deprecated and more verbose.
Signed-off-by: Bo Liu <liubo03@inspur.com>
Reviewed-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/20220615081816.4342-1-liubo03@inspur.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Fri, 10 Jun 2022 19:58:30 +0000 (12:58 -0700)]
iomap: add support for dma aligned direct-io
Use the address alignment requirements from the block_device for direct
io instead of requiring addresses be aligned to the block size.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220610195830.3574005-12-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Fri, 10 Jun 2022 19:58:29 +0000 (12:58 -0700)]
block: relax direct io memory alignment
Use the address alignment requirements from the block_device for direct
io instead of requiring addresses be aligned to the block size. User
space can discover the alignment requirements from the dma_alignment
queue attribute.
User space can specify any hardware compatible DMA offset for each
segment, but every segment length is still required to be a multiple of
the block size.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220610195830.3574005-11-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Fri, 10 Jun 2022 19:58:28 +0000 (12:58 -0700)]
block: introduce bdev_iter_is_aligned helper
Provide a convenient function for this repeatable coding pattern.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220610195830.3574005-10-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Fri, 10 Jun 2022 19:58:27 +0000 (12:58 -0700)]
iov: introduce iov_iter_aligned
The existing iov_iter_alignment() function returns the logical OR of
address and length. For cases where address and length need to be
considered separately, introduce a helper function that a caller can
specificy length and address masks that indicate if the iov is
unaligned.
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220610195830.3574005-9-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Fri, 10 Jun 2022 19:58:26 +0000 (12:58 -0700)]
block/bounce: count bytes instead of sectors
Individual bv_len's may not be a sector size.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220610195830.3574005-8-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Fri, 10 Jun 2022 19:58:25 +0000 (12:58 -0700)]
block/merge: count bytes instead of sectors
Individual bv_len's may not be a sector size.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220610195830.3574005-7-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>