linux-block.git
3 years agonvmet-auth: Diffie-Hellman key exchange support
Hannes Reinecke [Mon, 27 Jun 2022 09:52:06 +0000 (11:52 +0200)]
nvmet-auth: Diffie-Hellman key exchange support

Implement Diffie-Hellman key exchange using FFDHE groups for NVMe
In-Band Authentication.
This patch adds a new host configfs attribute 'dhchap_dhgroup' to
select the FFDHE group to use.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvmet: implement basic In-Band Authentication
Hannes Reinecke [Mon, 27 Jun 2022 09:52:05 +0000 (11:52 +0200)]
nvmet: implement basic In-Band Authentication

Implement NVMe-oF In-Band authentication according to NVMe TPAR 8006.
This patch adds three additional configfs entries 'dhchap_key',
'dhchap_ctrl_key', and 'dhchap_hash' to the 'host' configfs directory.
The 'dhchap_key' and 'dhchap_ctrl_key' entries need to be in the ASCII
format as specified in NVMe Base Specification v2.0 section 8.13.5.8
'Secret representation'.
'dhchap_hash' defaults to 'hmac(sha256)', and can be written to to
switch to a different HMAC algorithm.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvmet: parse fabrics commands on io queues
Hannes Reinecke [Mon, 27 Jun 2022 09:52:04 +0000 (11:52 +0200)]
nvmet: parse fabrics commands on io queues

Some fabrics commands can be sent via io queues, so add a new
function nvmet_parse_fabrics_io_cmd() and rename the existing
nvmet_parse_fabrics_cmd() to nvmet_parse_fabrics_admin_cmd().

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme-auth: Diffie-Hellman key exchange support
Hannes Reinecke [Mon, 27 Jun 2022 09:52:03 +0000 (11:52 +0200)]
nvme-auth: Diffie-Hellman key exchange support

Implement Diffie-Hellman key exchange using FFDHE groups
for NVMe In-Band Authentication.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme: implement In-Band authentication
Hannes Reinecke [Mon, 27 Jun 2022 09:52:02 +0000 (11:52 +0200)]
nvme: implement In-Band authentication

Implement NVMe-oF In-Band authentication according to NVMe TPAR 8006.
This patch adds two new fabric options 'dhchap_secret' to specify the
pre-shared key (in ASCII respresentation according to NVMe 2.0 section
8.13.5.8 'Secret representation') and 'dhchap_ctrl_secret' to specify
the pre-shared controller key for bi-directional authentication of both
the host and the controller.
Re-authentication can be triggered by writing the PSK into the new
controller sysfs attribute 'dhchap_secret' or 'dhchap_ctrl_secret'.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme-fabrics: decode 'authentication required' connect error
Hannes Reinecke [Mon, 27 Jun 2022 09:52:01 +0000 (11:52 +0200)]
nvme-fabrics: decode 'authentication required' connect error

The 'connect' command might fail with NVME_SC_AUTH_REQUIRED, so we
should be decoding this error, too.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme: add definitions for NVMe In-Band authentication
Hannes Reinecke [Mon, 27 Jun 2022 09:52:00 +0000 (11:52 +0200)]
nvme: add definitions for NVMe In-Band authentication

Add new definitions for NVMe In-band authentication as defined in
the NVMe Base Specification v2.0.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agolib/base64: RFC4648-compliant base64 encoding
Hannes Reinecke [Mon, 27 Jun 2022 09:51:59 +0000 (11:51 +0200)]
lib/base64: RFC4648-compliant base64 encoding

Add RFC4648-compliant base64 encoding and decoding routines, based on
the base64url encoding in fs/crypto/fname.c.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Cc: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agocrypto: add crypto_has_kpp()
Hannes Reinecke [Mon, 27 Jun 2022 09:51:58 +0000 (11:51 +0200)]
crypto: add crypto_has_kpp()

Add helper function to determine if a given key-agreement protocol
primitive is supported.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agocrypto: add crypto_has_shash()
Hannes Reinecke [Mon, 27 Jun 2022 09:51:57 +0000 (11:51 +0200)]
crypto: add crypto_has_shash()

Add helper function to determine if a given synchronous hash is supported.

Signed-off-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme-loop: use nvme core helpers to cancel all requests in a tagset
Sagi Grimberg [Sun, 26 Jun 2022 14:06:00 +0000 (17:06 +0300)]
nvme-loop: use nvme core helpers to cancel all requests in a tagset

A helper now exist, no need to open-code the same functionality.

Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme: fix qid param blk_mq_alloc_request_hctx
Chaitanya Kulkarni [Tue, 7 Jun 2022 01:16:43 +0000 (18:16 -0700)]
nvme: fix qid param blk_mq_alloc_request_hctx

Only caller of the __nvme_submit_sync_cmd() with qid value not equal to
NVME_QID_ANY is nvmf_connect_io_queues(), where qid value is alway set
to > 0.

[1] __nvme_submit_sync_cmd() callers with  qid parameter from :-

        Caller                  |   qid parameter
------------------------------------------------------
* nvme_fc_connect_io_queues()   |
   nvmf_connect_io_queue()      |      qid > 0
* nvme_rdma_start_io_queues()   |
   nvme_rdma_start_queue()      |
    nvmf_connect_io_queues()    |      qid > 0
* nvme_tcp_start_io_queues()    |
   nvme_tcp_start_queue()       |
    nvmf_connect_io_queues()    |      qid > 0
* nvme_loop_connect_io_queues() |
   nvmf_connect_io_queues()     |      qid > 0

When qid value of the function parameter __nvme_submit_sync_cmd() is > 0
from above callers, we use blk_mq_alloc_request_hctx(), where we pass
last parameter as 0 if qid functional parameter value is set to 0 with
conditional operators, see 1002 :-

991 int __nvme_submit_sync_cmd(struct request_queue *q, struct nvme_command *cmd,
 992                 union nvme_result *result, void *buffer, unsigned bufflen,
 993                 int qid, int at_head, blk_mq_req_flags_t flags)
 994 {
 995         struct request *req;
 996         int ret;
 997
 998         if (qid == NVME_QID_ANY)
 999                 req = blk_mq_alloc_request(q, nvme_req_op(cmd), flags);
1000         else
1001                 req = blk_mq_alloc_request_hctx(q, nvme_req_op(cmd), flags,
1002                                                 qid ? qid - 1 : 0);
1003

But qid function parameter value of the __nvme_submit_sync_cmd() will
never be 0 from above caller list see [1], and all the other callers of
__nvme_submit_sync_cmd() use NVME_QID_ANY as qid value :-
1. nvme_submit_sync_cmd()
2. nvme_features()
3. nvme_sec_submit()
4. nvmf_reg_read32()
5. nvmf_reg_read64()
6. nvmf_ref_write32()
7. nvmf_connect_admin_queue()

Remove the conditional operator to pass the qid as 0 in the call to
blk_mq_alloc_requst_hctx().

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme: remove unused timeout parameter
Chaitanya Kulkarni [Tue, 7 Jun 2022 01:16:42 +0000 (18:16 -0700)]
nvme: remove unused timeout parameter

The function __nvme_submit_sync_cmd() has following list of callers
that sets the timeout value to 0 :-

        Callers               |   Timeout value
------------------------------------------------
nvme_submit_sync_cmd()        |        0
nvme_features()               |        0
nvme_sec_submit()             |        0
nvmf_reg_read32()             |        0
nvmf_reg_read64()             |        0
nvmf_reg_write32()            |        0
nvmf_connect_admin_queue()    |        0
nvmf_connect_io_queue()       |        0

Remove the timeout function parameter from __nvme_submit_sync_cmd() and
adjust the rest of code accordingly.

Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme: handle the persistent internal error AER
Michael Kelley [Wed, 8 Jun 2022 18:52:21 +0000 (11:52 -0700)]
nvme: handle the persistent internal error AER

In the NVM Express Revision 1.4 spec, Figure 145 describes possible
values for an AER with event type "Error" (value 000b). For a
Persistent Internal Error (value 03h), the host should perform a
controller reset.

Add support for this error using code that already exists for
doing a controller reset. As part of this support, introduce
two utility functions for parsing the AER type and subtype.

This new support was tested in a lab environment where we can
generate the persistent internal error on demand, and observe
both the Linux side and NVMe controller side to see that the
controller reset has been done.

Signed-off-by: Michael Kelley <mikelley@microsoft.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agonvme: remove a double word in a comment
Xiang wangx [Sat, 4 Jun 2022 14:32:54 +0000 (22:32 +0800)]
nvme: remove a double word in a comment

Delete the redundant word 'be'.

Signed-off-by: Xiang wangx <wangxiang@cdjrlc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
3 years agornbd-clt: make rnbd_clt_change_capacity return void
Guoqing Jiang [Wed, 6 Jul 2022 13:31:52 +0000 (21:31 +0800)]
rnbd-clt: make rnbd_clt_change_capacity return void

No need to checking the return value, make it return void.

Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20220706133152.12058-9-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agornbd-clt: pass sector_t type for resize capacity
Guoqing Jiang [Wed, 6 Jul 2022 13:31:51 +0000 (21:31 +0800)]
rnbd-clt: pass sector_t type for resize capacity

Let's change the parameter type to 'sector_t' then we don't need to cast
it from rnbd_clt_resize_dev_store, and update rnbd_clt_resize_disk too.

Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20220706133152.12058-8-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agornbd-clt: check capacity inside rnbd_clt_change_capacity
Guoqing Jiang [Wed, 6 Jul 2022 13:31:50 +0000 (21:31 +0800)]
rnbd-clt: check capacity inside rnbd_clt_change_capacity

Currently, process_msg_open_rsp checks if capacity changed or not before
call rnbd_clt_change_capacity while the checking also make sense for
rnbd_clt_resize_dev_store, let's move the checking into the function.

Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20220706133152.12058-7-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agornbd-clt: adjust the layout of struct rnbd_clt_dev
Guoqing Jiang [Wed, 6 Jul 2022 13:31:49 +0000 (21:31 +0800)]
rnbd-clt: adjust the layout of struct rnbd_clt_dev

While at it, let re-arrange the struct to remove holes.

Before, pahole reports

/* size: 232, cachelines: 4, members: 17 */
/* sum members: 224, holes: 2, sum holes: 8 */
/* last cacheline: 40 bytes */

After the change, the report changes to

/* size: 224, cachelines: 4, members: 17 */
/* last cacheline: 32 bytes */

Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20220706133152.12058-6-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agornbd-clt: reduce the size of struct rnbd_clt_dev
Guoqing Jiang [Wed, 6 Jul 2022 13:31:48 +0000 (21:31 +0800)]
rnbd-clt: reduce the size of struct rnbd_clt_dev

Previously, both map and remap trigger rnbd_clt_set_dev_attr to set
some members in rnbd_clt_dev such as wc, fua and logical_block_size
etc, but those members are only useful for map scenario given the
setup_request_queue is only called from the path:

rnbd_clt_map_device -> rnbd_client_setup_device

Since rnbd_clt_map_device frees rsp after rnbd_client_setup_device,
we can pass rsp to rnbd_client_setup_device and it's callees, which
means queue's attributes can be set directly from relevant members
of rsp instead from rnbd_clt_dev.

After that, we can kill 11 members from rnbd_clt_dev, and we don't
need rnbd_clt_set_dev_attr either.

Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20220706133152.12058-5-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agornbd-clt: kill read_only from struct rnbd_clt_dev
Guoqing Jiang [Wed, 6 Jul 2022 13:31:47 +0000 (21:31 +0800)]
rnbd-clt: kill read_only from struct rnbd_clt_dev

The member is not needed since we can call get_disk_ro to achieve the
same goal.

Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20220706133152.12058-4-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agornbd-clt: don't free rsp in msg_open_conf for map scenario
Guoqing Jiang [Wed, 6 Jul 2022 13:31:46 +0000 (21:31 +0800)]
rnbd-clt: don't free rsp in msg_open_conf for map scenario

For map scenario, rsp is freed in two places:

1. msg_open_conf frees rsp if rtrs_clt_request returns 0.

2. Otherwise, rsp is freed by the call sites of rtrs_clt_request.

Now, We'd like to control full lifecycle of rsp in rnbd_clt_map_device,
with that, it is feasible to pass rsp to rnbd_client_setup_device in
next commit.

For 1, it is possible to free rsp from the caller of send_usr_msg
because of the synchronization of iu->comp.wait. And we put iu later
in rnbd_clt_map_device to ensure order of release rsp and iu.

Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20220706133152.12058-3-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agornbd-clt: open code send_msg_open in rnbd_clt_map_device
Guoqing Jiang [Wed, 6 Jul 2022 13:31:45 +0000 (21:31 +0800)]
rnbd-clt: open code send_msg_open in rnbd_clt_map_device

Let's open code it in rnbd_clt_map_device, then we can use information
from rsp to setup gendisk and request_queue in next commits. After that,
we can remove some members (wc, fua and max_hw_sectors etc) from struct
rnbd_clt_dev.

Acked-by: Jack Wang <jinpu.wang@ionos.com>
Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Link: https://lore.kernel.org/r/20220706133152.12058-2-guoqing.jiang@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: null_blk: Use the bitmap API to allocate bitmaps
Christophe JAILLET [Sun, 3 Jul 2022 16:05:43 +0000 (18:05 +0200)]
block: null_blk: Use the bitmap API to allocate bitmaps

Use bitmap_zalloc()/bitmap_free() instead of hand-writing them.

It is less verbose and it improves the semantic.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/7c4d3116ba843fc4a8ae557dd6176352a6cd0985.1656864320.git.christophe.jaillet@wanadoo.fr
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoMerge branch 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md...
Jens Axboe [Sun, 3 Jul 2022 16:18:10 +0000 (10:18 -0600)]
Merge branch 'md-next' of https://git./linux/kernel/git/song/md into for-5.20/drivers

Pull MD updates from Song:

"1. Improve raid5 lock contention, by Logan Gunthorpe.
 2. Misc fixes to raid5, by Logan Gunthorpe.
 3. Fix race condition with md_reap_sync_thread(), by Guoqing Jiang."

* 'md-next' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md: (29 commits)
  md: Fix spelling mistake in comments
  md/raid5: Increase restriction on max segments per request
  md/raid5: Improve debug prints
  md/raid5: Pivot raid5_make_request()
  md/raid5: Check all disks in a stripe_head for reshape progress
  md/raid5: Refactor add_stripe_bio()
  md/raid5: Keep a reference to last stripe_head for batch
  md/raid5: Refactor for loop in raid5_make_request() into while loop
  md/raid5: Move read_seqcount_begin() into make_stripe_request()
  md/raid5: Drop the do_prepare flag in raid5_make_request()
  md/raid5: Factor out helper from raid5_make_request() loop
  md/raid5: Move common stripe get code into new find_get_stripe() helper
  md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio()
  md/raid5: Refactor raid5_make_request loop
  md/raid5: Factor out ahead_of_reshape() function
  md/raid5: Make logic blocking check consistent with logic that blocks
  md: unlock mddev before reap sync_thread in action_store
  md: Explicitly create command-line configured devices
  md: Notify sysfs sync_completed in md_reap_sync_thread()
  md: Ensure resync is reported after it starts
  ...

3 years agomd: Fix spelling mistake in comments
Zhang Jiaming [Sat, 2 Jul 2022 01:54:11 +0000 (09:54 +0800)]
md: Fix spelling mistake in comments

There are 2 spelling mistakes in comments. Fix it.

Signed-off-by: Zhang Jiaming <jiaming@nfschina.com>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd/raid5: Increase restriction on max segments per request
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:45 +0000 (13:19 -0600)]
md/raid5: Increase restriction on max segments per request

The block layer defaults the maximum segments to 128, which means
requests tend to get split around the 512KB depending on how many
pages can be merged. There's no such restriction in the raid5 code
so increase the limit to USHRT_MAX so that larger requests can be
sent as one.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd/raid5: Improve debug prints
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:44 +0000 (13:19 -0600)]
md/raid5: Improve debug prints

Add a debug print for raid5_make_request() so that each request is
printed and add the logical sector number to the debug print in
__add_stripe_bio().

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd/raid5: Pivot raid5_make_request()
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:43 +0000 (13:19 -0600)]
md/raid5: Pivot raid5_make_request()

raid5_make_request() loops through every page in the request,
finds the appropriate stripe and adds the bio for that page in the
disk.

This causes a great deal of contention on the hash_lock and extra
work seeing each stripe must be found once for every data disk.

The number of times a stripe must be found can be reduced by pivoting
raid5_make_request() so that it loops through every stripe and then
loops through every disk in that stripe to see if the bio must be
added. This reduces the number of times the hash lock must be taken
by a factor equal to the number of data disks.

To accomplish this, the logical sectors that have already been added
must be tracked. Tracking them is done with a bitmap: the bits
for all pages are set at the start of the request and each bit
is cleared once the bio is added to a stripe.

Finding the next sector to be done is then just a call to
find_first_bit() so that sectors that have been done can simply be
skipped.

One minor downside is that the maximum sectors for a request must be
limited so that the bitmap can be appropriately sized on the stack.
This limit is arbitrarily chosen to be 256 stripe pages which works out
to 1MB if PAGE_SIZE == DEFAULT_STRIPE_SIZE. This doesn't actually
restrict the maximum request further seeing the default block queue
settings are used which restricts the number of segments to 128 (which
results in request sizes that are approximately 512KB).

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd/raid5: Check all disks in a stripe_head for reshape progress
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:42 +0000 (13:19 -0600)]
md/raid5: Check all disks in a stripe_head for reshape progress

When testing if a previous stripe has had reshape expand past it, use
the earliest or latest logical sector in all the disks for that stripe
head. This will allow adding multiple disks at a time in a subesquent
patch.

To do this cleaner, refactor the check into a helper function called
stripe_ahead_of_reshape().

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd/raid5: Refactor add_stripe_bio()
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:41 +0000 (13:19 -0600)]
md/raid5: Refactor add_stripe_bio()

Factor out two helper functions from add_stripe_bio(): one to check for
overlap (stripe_bio_overlaps()), and one to actually add the bio to the
stripe (__add_stripe_bio()). The latter function will always succeed.

This will be useful in the next patch so that overlap can be checked for
multiple disks before adding any

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd/raid5: Keep a reference to last stripe_head for batch
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:40 +0000 (13:19 -0600)]
md/raid5: Keep a reference to last stripe_head for batch

When batching, every stripe head has to find the previous stripe head to
add to the batch list. This involves taking the hash lock which is
highly contended during IO.

Instead of finding the previous stripe_head each time, store a
reference to the previous stripe_head in a pointer so that it doesn't
require taking the contended lock another time.

The reference to the previous stripe must be released before scheduling
and waiting for work to get done. Otherwise, it can hold up
raid5_activate_delayed() and deadlock.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd/raid5: Refactor for loop in raid5_make_request() into while loop
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:39 +0000 (13:19 -0600)]
md/raid5: Refactor for loop in raid5_make_request() into while loop

The for loop with retry label can be more cleanly expressed as a while
loop by moving the logical_sector increment into the success path.

No functional changes intended.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd/raid5: Move read_seqcount_begin() into make_stripe_request()
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:38 +0000 (13:19 -0600)]
md/raid5: Move read_seqcount_begin() into make_stripe_request()

Now that prepare_to_wait() isn't in the way, move read_sequcount_begin()
into make_stripe_request().

No functional changes intended.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd/raid5: Drop the do_prepare flag in raid5_make_request()
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:37 +0000 (13:19 -0600)]
md/raid5: Drop the do_prepare flag in raid5_make_request()

prepare_to_wait() can be reasonably called after schedule instead of
setting a flag and preparing in the next loop iteration.

This means that prepare_to_wait() will be called before
read_seqcount_begin(), but there shouldn't be any reason that the order
matters here. On the first iteration of the loop prepare_to_wait() is
already called first.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd/raid5: Factor out helper from raid5_make_request() loop
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:36 +0000 (13:19 -0600)]
md/raid5: Factor out helper from raid5_make_request() loop

Factor out the inner loop of raid5_make_request() into it's own helper
called make_stripe_request().

The helper returns a number of statuses: SUCCESS, RETRY,
SCHEDULE_AND_RETRY and FAIL. This makes the code a bit easier to
understand and allows the SCHEDULE_AND_RETRY path to be made common.

A context structure is added to contain do_flush. It will be used
more in subsequent patches for state that needs to be kept
outside the loop.

No functional changes intended. This will be cleaned up further in
subsequent patches to untangle the gen_lock and do_prepare logic
further.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd/raid5: Move common stripe get code into new find_get_stripe() helper
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:35 +0000 (13:19 -0600)]
md/raid5: Move common stripe get code into new find_get_stripe() helper

Both uses of find_stripe() require a fairly complicated dance to
increment the reference count. Move this into a common find_get_stripe()
helper.

No functional changes intended.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Acked-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio()
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:34 +0000 (13:19 -0600)]
md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio()

stripe_add_to_batch_list() is better done in the loop in make_request
instead of inside add_stripe_bio(). This is clearer and allows for
storing the batch_head state outside the loop in a subsequent patch.

The call to add_stripe_bio() in retry_aligned_read() is for read
and batching only applies to write. So it's impossible for batching
to happen at that call site.

No functional changes intended.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd/raid5: Refactor raid5_make_request loop
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:33 +0000 (13:19 -0600)]
md/raid5: Refactor raid5_make_request loop

Break immediately if raid5_get_active_stripe() returns NULL and deindent
the rest of the loop. Annotate this check with an unlikely().

This makes the code easier to read and reduces the indentation level.

No functional changes intended.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd/raid5: Factor out ahead_of_reshape() function
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:32 +0000 (13:19 -0600)]
md/raid5: Factor out ahead_of_reshape() function

There are a few uses of an ugly ternary operator in raid5_make_request()
to check if a sector is a head of a reshape sector.

Factor this out into a simple helper called ahead_of_reshape().

No functional changes intended.

Suggested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd/raid5: Make logic blocking check consistent with logic that blocks
Logan Gunthorpe [Thu, 16 Jun 2022 19:19:31 +0000 (13:19 -0600)]
md/raid5: Make logic blocking check consistent with logic that blocks

The check in raid5_make_request differs very slightly from the logic
that causes it to block lower down. This likely does not cause a bug
as the check is fuzzy anyway (as reshape may move on between the first
check and the subsequent check). However, make it consistent so it can
be cleaned up in a subsequent patch.

The condition which causes the schedule is:

 !(mddev->reshape_backwards ? logical_sector < conf->reshape_progress :
   logical_sector >= conf->reshape_progress) &&
  (mddev->reshape_backwards ? logical_sector < conf->reshape_safe :
   logical_sector >= conf->reshape_safe)

The condition that causes the early bailout is made to match this.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd: unlock mddev before reap sync_thread in action_store
Guoqing Jiang [Tue, 21 Jun 2022 03:11:29 +0000 (11:11 +0800)]
md: unlock mddev before reap sync_thread in action_store

Since the bug which commit 8b48ec23cc51a ("md: don't unregister sync_thread
with reconfig_mutex held") fixed is related with action_store path, other
callers which reap sync_thread didn't need to be changed.

Let's pull md_unregister_thread from md_reap_sync_thread, then fix previous
bug with belows.

1. unlock mddev before md_reap_sync_thread in action_store.
2. save reshape_position before unlock, then restore it to ensure position
   not changed accidentally by others.

Signed-off-by: Guoqing Jiang <guoqing.jiang@linux.dev>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd: Explicitly create command-line configured devices
Chris Webb [Wed, 1 Jun 2022 11:03:07 +0000 (12:03 +0100)]
md: Explicitly create command-line configured devices

Boot-time assembly of arrays with md= command-line arguments breaks when
CONFIG_BLOCK_LEGACY_AUTOLOAD is unset. md_setup_drive() in md-autodetect.c
calls blkdev_get_by_dev(), assuming this implicitly creates the block
device.

Fix this by attempting to md_alloc() the array first. As in the probe path,
ignore any error as failure is caught by blkdev_get_by_dev() anyway.

Signed-off-by: Chris Webb <chris@arachsys.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd: Notify sysfs sync_completed in md_reap_sync_thread()
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:56 +0000 (10:27 -0600)]
md: Notify sysfs sync_completed in md_reap_sync_thread()

The mdadm test 07layouts randomly produces a kernel hung task deadlock.
The deadlock is caused by the suspend_lo/suspend_hi files being set by
the mdadm background process during reshape and not being cleared
because the process hangs. (Leaving aside the issue of the fragility of
freezing kernel tasks by buggy userspace processes...)

When the background mdadm process hangs it, is waiting (without a
timeout) on a change to the sync_completed file signalling that the
reshape has completed. The process is woken up a couple times when
the reshape finishes but it is woken up before MD_RECOVERY_RUNNING
is cleared so sync_completed_show() reports 0 instead of "none".

To fix this, notify the sysfs file in md_reap_sync_thread() after
MD_RECOVERY_RUNNING has been cleared. This wakes up mdadm and causes
it to continue and write to suspend_lo/suspend_hi to allow IO to
continue.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd: Ensure resync is reported after it starts
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:55 +0000 (10:27 -0600)]
md: Ensure resync is reported after it starts

The 07layouts test in mdadm fails on some systems. The failure
presents itself as the backup file not being removed before the next
layout is grown into:

  mdadm: /dev/md0: cannot create backup file /tmp/md-test-backup:
      File exists

This is because the background mdadm process, which is responsible for
cleaning up this backup file gets into an infinite loop waiting for
the reshape to start. mdadm checks the mdstat file if a reshape is
going and, if it is not, it waits for an event on the file or times
out in 5 seconds. On faster machines, the reshape may complete before
the 5 seconds times out, and thus the background mdadm process loops
waiting for a reshape to start that has already occurred.

mdadm reads the mdstat file to start, but mdstat does not report that the
reshape has begun, even though it has indeed begun. So the mdstat_wait()
call (in mdadm) which polls on the mdstat file won't ever return until
timing out.

The reason mdstat reports the reshape has started is due to an issue
in status_resync(). recovery_active is subtracted from curr_resync which
will result in a value of zero for the first chunk of reshaped data, and
the resulting read will report no reshape in progress.

To fix this, if "resync - recovery_active" is an overloaded value, force
the value to be MD_RESYNC_ACTIVE so the code reports a resync in progress.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd: Use enum for overloaded magic numbers used by mddev->curr_resync
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:54 +0000 (10:27 -0600)]
md: Use enum for overloaded magic numbers used by mddev->curr_resync

Comments in the code document special values used for
mddev->curr_resync. Make this clearer by using an enum to label these
values.

The only functional change is a couple places use the wrong comparison
operator that implied 3 is another special value. They are all
fixed to imply that 3 or greater is an active resync.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd/raid5-cache: Annotate pslot with __rcu notation
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:53 +0000 (10:27 -0600)]
md/raid5-cache: Annotate pslot with __rcu notation

radix_tree_lookup_slot() and radix_tree_replace_slot() API expect the
slot returned and looked up to be marked with __rcu. Otherwise
sparse warnings are generated:

  drivers/md/raid5-cache.c:2939:23: warning: incorrect type in
assignment (different address spaces)
  drivers/md/raid5-cache.c:2939:23:    expected void **pslot
  drivers/md/raid5-cache.c:2939:23:    got void [noderef] __rcu **

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd/raid5-cache: Clear conf->log after finishing work
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:52 +0000 (10:27 -0600)]
md/raid5-cache: Clear conf->log after finishing work

A NULL pointer dereferlence on conf->log is seen randomly with
the mdadm test 21raid5cache. Kasan reporst:

BUG: KASAN: null-ptr-deref in r5l_reclaimable_space+0xf5/0x140
Read of size 8 at addr 0000000000000860 by task md0_reclaim/3086

Call Trace:
  dump_stack_lvl+0x5a/0x74
  kasan_report.cold+0x5f/0x1a9
  __asan_load8+0x69/0x90
  r5l_reclaimable_space+0xf5/0x140
  r5l_do_reclaim+0xf4/0x5e0
  r5l_reclaim_thread+0x69/0x3b0
  md_thread+0x1a2/0x2c0
  kthread+0x177/0x1b0
  ret_from_fork+0x22/0x30

This is caused by conf->log being cleared in r5l_exit_log() before
stopping the reclaim thread.

To fix this, clear conf->log after the reclaim_thread is unregistered
and after flushing disable_writeback_work.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd/raid5-cache: Drop RCU usage of conf->log
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:51 +0000 (10:27 -0600)]
md/raid5-cache: Drop RCU usage of conf->log

The only place that uses RCU to access conf->log is in
r5l_log_disk_error(). This function is mostly used in the IO path
and once with mddev_lock() held in raid5_change_consistency_policy().

It is known that the IO will be suspended before the log is freed and
r5l_log_exit() is called with the mddev_lock() held.

This should mean that conf->log can not be freed while the function is
being called, so the RCU protection is not necessary. Drop the
rcu_read_lock() as well as the synchronize_rcu() and
rcu_assign_pointer() usage.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd/raid5-cache: Take mddev_lock in r5c_journal_mode_show()
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:50 +0000 (10:27 -0600)]
md/raid5-cache: Take mddev_lock in r5c_journal_mode_show()

The mddev->lock spinlock doesn't protect against the removal of
conf->log in r5l_exit_log() so conf->log may be freed before it
is used.

To fix this, take the mddev_lock() insteaad of the mddev->lock spinlock.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd/raid5: suspend the array for calls to log_exit()
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:49 +0000 (10:27 -0600)]
md/raid5: suspend the array for calls to log_exit()

The raid5-cache code relies on there being no IO in flight when
log_exit() is called. There are two places where this is not
guaranteed so add mddev_suspend() and mddev_resume() calls to these
sites.

The site in raid5_change_consistency_policy() is in the error path,
and another similar call site already has suspend/resume calls just
below it; so it should be equally safe to make that change here.

There is one remaining site in raid5_remove_disk() that we call log_exit()
without suspending the array. Unfortunately, as the comment stated, we
cannot call mddev_suspend from raid5d.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd/raid5-ppl: Drop unused argument from ppl_handle_flush_request()
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:48 +0000 (10:27 -0600)]
md/raid5-ppl: Drop unused argument from ppl_handle_flush_request()

ppl_handle_flush_request() takes an struct r5log argument but doesn't
use it. It has no buisiness taking this argument as it is only used
by raid5-cache and has no way to derference it anyway. Remove
the argument.

No functional changes intended.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
3 years agomd/raid5-log: Drop extern decorators for function prototypes
Logan Gunthorpe [Wed, 8 Jun 2022 16:27:47 +0000 (10:27 -0600)]
md/raid5-log: Drop extern decorators for function prototypes

extern is not necessary and recommended against when defining prototype
functions in headers. checkpatch.pl complains about these. So remove
them.

Signed-off-by: Logan Gunthorpe <logang@deltatee.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Song Liu <song@kernel.org>
3 years agoMAINTAINERS: add patchwork link to linux-raid project
Song Liu [Wed, 1 Jun 2022 23:12:37 +0000 (16:12 -0700)]
MAINTAINERS: add patchwork link to linux-raid project

Add link to patchwork:

   https://patchwork.kernel.org/project/linux-raid/list/

Signed-off-by: Song Liu <song@kernel.org>
3 years agodrbd: bm_page_async_io: fix spurious bitmap "IO error" on large volumes
Lars Ellenberg [Wed, 22 Jun 2022 20:49:32 +0000 (22:49 +0200)]
drbd: bm_page_async_io: fix spurious bitmap "IO error" on large volumes

We usually do all our bitmap IO in units of PAGE_SIZE.

With very small or oddly sized external meta data, or with
PAGE_SIZE != 4k, it can happen that our last on-disk bitmap page
is not fully PAGE_SIZE aligned, so we may need to adjust the size
of the IO.

We used to do that with
  min_t(unsigned int, PAGE_SIZE,
last_allowed_sector - current_offset);
And for just the right diff, (unsigned int)(diff) will result in 0.

A bio of length 0 will correctly be rejected with an IO error
(and some scary WARN_ON_ONCE()) by the scsi layer.

Do the calculation properly.

Signed-off-by: Lars Ellenberg <lars.ellenberg@linbit.com>
Signed-off-by: Christoph Böhmwalder <christoph.boehmwalder@linbit.com>
Link: https://lore.kernel.org/r/20220622204932.196830-1-christoph.boehmwalder@linbit.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-mq: blk_mq_tag_busy is no need to return a value
Liu Song [Sat, 25 Jun 2022 15:15:21 +0000 (23:15 +0800)]
blk-mq: blk_mq_tag_busy is no need to return a value

Currently "blk_mq_tag_busy" return value has no effect, so adjust it.
Some code implementations have also been adjusted to enhance
readability.

Signed-off-by: Liu Song <liusong@linux.alibaba.com>
Link: https://lore.kernel.org/r/1656170121-1619-1-git-send-email-liusong@linux.alibaba.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: Always initialize bio IO priority on submit
Jan Kara [Thu, 23 Jun 2022 07:48:34 +0000 (09:48 +0200)]
block: Always initialize bio IO priority on submit

Currently, IO priority set in task's IO context is not reflected in the
bio->bi_ioprio for most IO (only io_uring and direct IO set it). This
results in odd results where process is submitting some bios with one
priority and other bios with a different (unset) priority and due to
differing priorities bios cannot be merged. Make sure bio->bi_ioprio is
always set on bio submission.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-9-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: Initialize bio priority earlier
Jan Kara [Thu, 23 Jun 2022 07:48:33 +0000 (09:48 +0200)]
block: Initialize bio priority earlier

Bio's IO priority needs to be initialized before we try to merge the bio
with other bios. Otherwise we could merge bios which would otherwise
receive different IO priorities leading to possible QoS issues.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-8-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-ioprio: Convert from rqos policy to direct call
Jan Kara [Thu, 23 Jun 2022 07:48:32 +0000 (09:48 +0200)]
blk-ioprio: Convert from rqos policy to direct call

Convert blk-ioprio handling from a rqos policy to a direct call from
blk_mq_submit_bio(). Firstly, blk-ioprio is not much of a rqos policy
anyway, it just needs a hook in bio submission path to set the bio's IO
priority. Secondly, the rqos .track hook gets actually called too late
for blk-ioprio purposes and introducing a special rqos hook just for
blk-ioprio looks even weirder.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-7-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-ioprio: Remove unneeded field
Jan Kara [Thu, 23 Jun 2022 07:48:31 +0000 (09:48 +0200)]
blk-ioprio: Remove unneeded field

blkcg->ioprio_set field is not really useful except for avoiding
possibly more expensive checks inside blkcg_ioprio_track(). The check
for blkcg->prio_policy being equal to POLICY_NO_CHANGE does the same
service so just remove the ioprio_set field and replace the check.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-6-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: Fix handling of tasks without ioprio in ioprio_get(2)
Jan Kara [Thu, 23 Jun 2022 07:48:30 +0000 (09:48 +0200)]
block: Fix handling of tasks without ioprio in ioprio_get(2)

ioprio_get(2) can be asked to return the best IO priority from several
tasks (IOPRIO_WHO_PGRP, IOPRIO_WHO_USER). Currently the call treats
tasks without set IO priority as having priority
IOPRIO_CLASS_BE/IOPRIO_BE_NORM however this does not really reflect the
IO priority the task will get (which depends on task's nice value).

Fix the code to use the real IO priority task's IO will use. We have to
modify code for ioprio_get(IOPRIO_WHO_PROCESS) to keep returning
IOPRIO_CLASS_NONE priority for tasks without set IO priority as a
special case to maintain userspace visible API.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220623074840.5960-5-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: Make ioprio_best() static
Jan Kara [Thu, 23 Jun 2022 07:48:29 +0000 (09:48 +0200)]
block: Make ioprio_best() static

Nobody outside of block/ioprio.c uses it.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-4-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: Generalize get_current_ioprio() for any task
Jan Kara [Thu, 23 Jun 2022 07:48:28 +0000 (09:48 +0200)]
block: Generalize get_current_ioprio() for any task

get_current_ioprio() operates only on current task. We will need the
same functionality for other tasks as well. Generalize
get_current_ioprio() for that and also move the bulk out of the header
file because it is large enough.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-3-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: Return effective IO priority from get_current_ioprio()
Jan Kara [Thu, 23 Jun 2022 07:48:27 +0000 (09:48 +0200)]
block: Return effective IO priority from get_current_ioprio()

get_current_ioprio() is used to initialize IO priority of various
requests. As such it should be returning the effective IO priority of
the task (i.e., reflecting the fact that unset IO priority should get
set based on task's CPU priority) so that the conversion is concentrated
in one place.

Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-2-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: fix default IO priority handling again
Jan Kara [Thu, 23 Jun 2022 07:48:26 +0000 (09:48 +0200)]
block: fix default IO priority handling again

Commit e70344c05995 ("block: fix default IO priority handling")
introduced an inconsistency in get_current_ioprio() that tasks without
IO context return IOPRIO_DEFAULT priority while tasks with freshly
allocated IO context will return 0 (IOPRIO_CLASS_NONE/0) IO priority.
Tasks without IO context used to be rare before 5a9d041ba2f6 ("block:
move io_context creation into where it's needed") but after this commit
they became common because now only BFQ IO scheduler setups task's IO
context. Similar inconsistency is there for get_task_ioprio() so this
inconsistency is now exposed to userspace and userspace will see
different IO priority for tasks operating on devices with BFQ compared
to devices without BFQ. Furthemore the changes done by commit
e70344c05995 change the behavior when no IO priority is set for BFQ IO
scheduler which is also documented in ioprio_set(2) manpage:

"If no I/O scheduler has been set for a thread, then by default the I/O
priority will follow the CPU nice value (setpriority(2)).  In Linux
kernels before version 2.6.24, once an I/O priority had been set using
ioprio_set(), there was no way to reset the I/O scheduling behavior to
the default. Since Linux 2.6.24, specifying ioprio as 0 can be used to
reset to the default I/O scheduling behavior."

So make sure we default to IOPRIO_CLASS_NONE as used to be the case
before commit e70344c05995. Also cleanup alloc_io_context() to
explicitely set this IO priority for the allocated IO context to avoid
future surprises. Note that we tweak ioprio_best() to maintain
ioprio_get(2) behavior and make this commit easily backportable.

CC: stable@vger.kernel.org
Fixes: e70344c05995 ("block: fix default IO priority handling")
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-mq: Don't disable preemption around __blk_mq_run_hw_queue().
Sebastian Andrzej Siewior [Wed, 22 Jun 2022 08:25:54 +0000 (10:25 +0200)]
blk-mq: Don't disable preemption around __blk_mq_run_hw_queue().

__blk_mq_delay_run_hw_queue() disables preemption to get a stable
current CPU number and then invokes __blk_mq_run_hw_queue() if the CPU
number is part the mask.

__blk_mq_run_hw_queue() acquires a spin_lock_t which is a sleeping lock
on PREEMPT_RT and can't be acquired with disabled preemption.

It is not required for correctness to invoke __blk_mq_run_hw_queue() on
a CPU matching hctx->cpumask. Both (async and direct requests) can run
on a CPU not matching hctx->cpumask.

The CPU mask without disabling preemption and invoking
__blk_mq_run_hw_queue().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/YrLSEiNvagKJaDs5@linutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: bfq: Fix kernel-doc headers
Bart Van Assche [Fri, 17 Jun 2022 21:08:59 +0000 (14:08 -0700)]
block: bfq: Fix kernel-doc headers

Fix the following warnings:

block/bfq-cgroup.c:721: warning: Function parameter or member 'bfqg' not described in '__bfq_bic_change_cgroup'
block/bfq-cgroup.c:721: warning: Excess function parameter 'blkcg' description in '__bfq_bic_change_cgroup'
block/bfq-cgroup.c:870: warning: Function parameter or member 'ioprio_class' not described in 'bfq_reparent_leaf_entity'
block/bfq-cgroup.c:900: warning: Function parameter or member 'ioprio_class' not described in 'bfq_reparent_active_queues'

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220617210859.106623-1-bvanassche@acm.org
3 years agoblock: bfq: Remove an unused function definition
Bart Van Assche [Fri, 17 Jun 2022 20:44:33 +0000 (13:44 -0700)]
block: bfq: Remove an unused function definition

This patch is the result of the analysis of a sparse report.

Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220617204433.102022-1-bvanassche@acm.org
3 years agobfq: Remove useless code in bfq_lookup_next_entity
GuoYong Zheng [Fri, 17 Jun 2022 10:28:04 +0000 (18:28 +0800)]
bfq: Remove useless code in bfq_lookup_next_entity

It is no need to judge entity is null or not here,
directly return entity is ok, so remove it.

Signed-off-by: GuoYong Zheng <zhenggy@chinatelecom.cn>
Link: https://lore.kernel.org/r/1655461684-19075-1-git-send-email-zhenggy@chinatelecom.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: move blk_queue_get_max_sectors to blk.h
Christoph Hellwig [Tue, 14 Jun 2022 09:09:34 +0000 (11:09 +0200)]
block: move blk_queue_get_max_sectors to blk.h

blk_queue_get_max_sectors is private to the block layer, so move it out
of blkdev.h.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220614090934.570632-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: fold blk_max_size_offset into get_max_io_size
Christoph Hellwig [Tue, 14 Jun 2022 09:09:33 +0000 (11:09 +0200)]
block: fold blk_max_size_offset into get_max_io_size

Now that blk_max_size_offset has a single caller left, fold it into that
and clean up the naming convention for the local variables there.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220614090934.570632-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: cleanup variable naming in get_max_io_size
Christoph Hellwig [Tue, 14 Jun 2022 09:09:32 +0000 (11:09 +0200)]
block: cleanup variable naming in get_max_io_size

get_max_io_size has a very odd choice of variables names and
initialization patterns.  Switch to more descriptive names and more
clear initialization of them.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220614090934.570632-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: open code blk_max_size_offset in blk_rq_get_max_sectors
Christoph Hellwig [Tue, 14 Jun 2022 09:09:31 +0000 (11:09 +0200)]
block: open code blk_max_size_offset in blk_rq_get_max_sectors

blk_rq_get_max_sectors always uses q->limits.chunk_sectors as the
chunk_sectors argument, and already checks for max_sectors through the
call to blk_queue_get_max_sectors.  That means much of
blk_max_size_offset is not needed and open coding it simplifies the code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220614090934.570632-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agodm: open code blk_max_size_offset in max_io_len
Christoph Hellwig [Tue, 14 Jun 2022 09:09:30 +0000 (11:09 +0200)]
dm: open code blk_max_size_offset in max_io_len

max_io_len always passes an explicitly non-zero chunk_sectors into
blk_max_size_offset.  That means much of blk_max_size_offset is not
needed and can be open coded to simplify the code.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Link: https://lore.kernel.org/r/20220614090934.570632-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: factor out a chunk_size_left helper
Christoph Hellwig [Tue, 14 Jun 2022 09:09:29 +0000 (11:09 +0200)]
block: factor out a chunk_size_left helper

Factor out a helper from blk_max_size_offset so that it can be reused
independently.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Link: https://lore.kernel.org/r/20220614090934.570632-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: Make blk_mq_get_sq_hctx() select the proper hardware queue type
Bart Van Assche [Wed, 15 Jun 2022 22:55:49 +0000 (15:55 -0700)]
block: Make blk_mq_get_sq_hctx() select the proper hardware queue type

Since the introduction of blk_mq_get_hctx_type() the operation type in
the second argument of blk_mq_get_hctx_type() matters. The introduction
of blk_mq_get_hctx_type() caused blk_mq_get_sq_hctx() to select a
hardware queue of type HCTX_TYPE_READ instead of HCTX_TYPE_DEFAULT.
Switch to hardware queue type HCTX_TYPE_DEFAULT since HCTX_TYPE_READ
should only be used for read requests.

Cc: Ming Lei <ming.lei@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220615225549.1054905-4-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: Rename a blk_mq_map_queue() argument
Bart Van Assche [Wed, 15 Jun 2022 22:55:48 +0000 (15:55 -0700)]
block: Rename a blk_mq_map_queue() argument

Before the introduction of blk_mq_get_hctx_type(), blk_mq_map_queue()
only used the flags from its second argument. Since the introduction of
blk_mq_get_hctx_type(), blk_mq_map_queue() uses both the operation and
the flags encoded in that argument. Rename the second argument of
blk_mq_map_queue() to make this clear.

Cc: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220615225549.1054905-3-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblk-iocost: Simplify ioc_rqos_done()
Bart Van Assche [Wed, 15 Jun 2022 22:55:47 +0000 (15:55 -0700)]
blk-iocost: Simplify ioc_rqos_done()

Leave out the superfluous "& REQ_OP_MASK" code. The definition of req_op()
shows that that code is superfluous:

 #define req_op(req) ((req)->cmd_flags & REQ_OP_MASK)

Compile-tested only.

Cc: Tejun Heo <tj@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20220615225549.1054905-2-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: Directly use ida_alloc()/free()
Bo Liu [Wed, 15 Jun 2022 08:18:16 +0000 (04:18 -0400)]
block: Directly use ida_alloc()/free()

Use ida_alloc()/ida_free() instead of
ida_simple_get()/ida_simple_remove().
The latter is deprecated and more verbose.

Signed-off-by: Bo Liu <liubo03@inspur.com>
Reviewed-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Link: https://lore.kernel.org/r/20220615081816.4342-1-liubo03@inspur.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoiomap: add support for dma aligned direct-io
Keith Busch [Fri, 10 Jun 2022 19:58:30 +0000 (12:58 -0700)]
iomap: add support for dma aligned direct-io

Use the address alignment requirements from the block_device for direct
io instead of requiring addresses be aligned to the block size.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220610195830.3574005-12-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: relax direct io memory alignment block-5.20-al
Keith Busch [Fri, 10 Jun 2022 19:58:29 +0000 (12:58 -0700)]
block: relax direct io memory alignment

Use the address alignment requirements from the block_device for direct
io instead of requiring addresses be aligned to the block size. User
space can discover the alignment requirements from the dma_alignment
queue attribute.

User space can specify any hardware compatible DMA offset for each
segment, but every segment length is still required to be a multiple of
the block size.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220610195830.3574005-11-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: introduce bdev_iter_is_aligned helper
Keith Busch [Fri, 10 Jun 2022 19:58:28 +0000 (12:58 -0700)]
block: introduce bdev_iter_is_aligned helper

Provide a convenient function for this repeatable coding pattern.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220610195830.3574005-10-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoiov: introduce iov_iter_aligned
Keith Busch [Fri, 10 Jun 2022 19:58:27 +0000 (12:58 -0700)]
iov: introduce iov_iter_aligned

The existing iov_iter_alignment() function returns the logical OR of
address and length. For cases where address and length need to be
considered separately, introduce a helper function that a caller can
specificy length and address masks that indicate if the iov is
unaligned.

Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220610195830.3574005-9-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock/bounce: count bytes instead of sectors
Keith Busch [Fri, 10 Jun 2022 19:58:26 +0000 (12:58 -0700)]
block/bounce: count bytes instead of sectors

Individual bv_len's may not be a sector size.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Pankaj Raghav <p.raghav@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220610195830.3574005-8-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock/merge: count bytes instead of sectors
Keith Busch [Fri, 10 Jun 2022 19:58:25 +0000 (12:58 -0700)]
block/merge: count bytes instead of sectors

Individual bv_len's may not be a sector size.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220610195830.3574005-7-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: add a helper function for dio alignment
Keith Busch [Fri, 10 Jun 2022 19:58:24 +0000 (12:58 -0700)]
block: add a helper function for dio alignment

This will make it easier to add more complex acceptable alignment
criteria in the future.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220610195830.3574005-6-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: introduce bdev_dma_alignment helper
Keith Busch [Fri, 10 Jun 2022 19:58:23 +0000 (12:58 -0700)]
block: introduce bdev_dma_alignment helper

Preparing for upcoming dma_alignment users that have a block_device, but
don't need the request_queue.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220610195830.3574005-5-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: export dma_alignment attribute
Keith Busch [Fri, 10 Jun 2022 19:58:22 +0000 (12:58 -0700)]
block: export dma_alignment attribute

User space may want to know how to align their buffers to avoid
bouncing. Export the queue attribute.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220610195830.3574005-4-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock/bio: remove duplicate append pages code
Keith Busch [Fri, 10 Jun 2022 19:58:21 +0000 (12:58 -0700)]
block/bio: remove duplicate append pages code

The getting pages setup for zone append and normal IO are identical. Use
common code for each.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220610195830.3574005-3-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoblock: fix infinite loop for invalid zone append
Keith Busch [Fri, 10 Jun 2022 19:58:20 +0000 (12:58 -0700)]
block: fix infinite loop for invalid zone append

Returning 0 early from __bio_iov_append_get_pages() for the
max_append_sectors warning just creates an infinite loop since 0 means
success, and the bio will never fill from the unadvancing iov_iter. We
could turn the return into an error value, but it will already be turned
into an error value later on, so just remove the warning. Clearly no one
ever hit it anyway.

Fixes: 0512a75b98f84 ("block: Introduce REQ_OP_ZONE_APPEND")
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20220610195830.3574005-2-kbusch@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 years agoLinux 5.19-rc4 v5.19-rc4
Linus Torvalds [Sun, 26 Jun 2022 21:22:10 +0000 (14:22 -0700)]
Linux 5.19-rc4

3 years agoMerge tag 'soc-fixes-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Linus Torvalds [Sun, 26 Jun 2022 21:12:56 +0000 (14:12 -0700)]
Merge tag 'soc-fixes-5.19' of git://git./linux/kernel/git/soc/soc

Pull ARM SoC fixes from Arnd Bergmann:
 "A number of fixes have accumulated, but they are largely for harmless
  issues:

   - Several OF node leak fixes

   - A fix to the Exynos7885 UART clock description

   - DTS fixes to prevent boot failures on TI AM64 and J721s2

   - Bus probe error handling fixes for Baikal-T1

   - A fixup to the way STM32 SoCs use separate dts files for different
     firmware stacks

   - Multiple code fixes for Arm SCMI firmware, all dealing with
     robustness of the implementation

   - Multiple NXP i.MX devicetree fixes, addressing incorrect data in DT
     nodes

   - Three updates to the MAINTAINERS file, including Florian Fainelli
     taking over BCM283x/BCM2711 (Raspberry Pi) from Nicolas Saenz
     Julienne"

* tag 'soc-fixes-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc: (29 commits)
  ARM: dts: aspeed: nuvia: rename vendor nuvia to qcom
  arm: mach-spear: Add missing of_node_put() in time.c
  ARM: cns3xxx: Fix refcount leak in cns3xxx_init
  MAINTAINERS: Update email address
  arm64: dts: ti: k3-am64-main: Remove support for HS400 speed mode
  arm64: dts: ti: k3-j721s2: Fix overlapping GICD memory region
  ARM: dts: bcm2711-rpi-400: Fix GPIO line names
  bus: bt1-axi: Don't print error on -EPROBE_DEFER
  bus: bt1-apb: Don't print error on -EPROBE_DEFER
  ARM: Fix refcount leak in axxia_boot_secondary
  ARM: dts: stm32: move SCMI related nodes in a dedicated file for stm32mp15
  soc: imx: imx8m-blk-ctrl: fix display clock for LCDIF2 power domain
  ARM: dts: imx6qdl-colibri: Fix capacitive touch reset polarity
  ARM: dts: imx6qdl: correct PU regulator ramp delay
  firmware: arm_scmi: Fix incorrect error propagation in scmi_voltage_descriptors_get
  firmware: arm_scmi: Avoid using extended string-buffers sizes if not necessary
  firmware: arm_scmi: Fix SENSOR_AXIS_NAME_GET behaviour when unsupported
  ARM: dts: imx7: Move hsic_phy power domain to HSIC PHY node
  soc: bcm: brcmstb: pm: pm-arm: Fix refcount leak in brcmstb_pm_probe
  MAINTAINERS: Update BCM2711/BCM2835 maintainer
  ...

3 years agoMerge tag 'mm-hotfixes-stable-2022-06-26' of git://git.kernel.org/pub/scm/linux/kerne...
Linus Torvalds [Sun, 26 Jun 2022 21:00:55 +0000 (14:00 -0700)]
Merge tag 'mm-hotfixes-stable-2022-06-26' of git://git./linux/kernel/git/akpm/mm

Pull hotfixes from Andrew Morton:
 "Minor things, mainly - mailmap updates, MAINTAINERS updates, etc.

  Fixes for this merge window:

   - fix for a damon boot hang, from SeongJae

   - fix for a kfence warning splat, from Jason Donenfeld

   - fix for zero-pfn pinning, from Alex Williamson

   - fix for fallocate hole punch clearing, from Mike Kravetz

  Fixes for previous releases:

   - fix for a performance regression, from Marcelo

   - fix for a hwpoisining BUG from zhenwei pi"

* tag 'mm-hotfixes-stable-2022-06-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  mailmap: add entry for Christian Marangi
  mm/memory-failure: disable unpoison once hw error happens
  hugetlbfs: zero partial pages during fallocate hole punch
  mm: memcontrol: reference to tools/cgroup/memcg_slabinfo.py
  mm: re-allow pinning of zero pfns
  mm/kfence: select random number before taking raw lock
  MAINTAINERS: add maillist information for LoongArch
  MAINTAINERS: update MM tree references
  MAINTAINERS: update Abel Vesa's email
  MAINTAINERS: add MEMORY HOT(UN)PLUG section and add David as reviewer
  MAINTAINERS: add Miaohe Lin as a memory-failure reviewer
  mailmap: add alias for jarkko@profian.com
  mm/damon/reclaim: schedule 'damon_reclaim_timer' only after 'system_wq' is initialized
  kthread: make it clear that kthread_create_on_node() might be terminated by any fatal signal
  mm: lru_cache_disable: use synchronize_rcu_expedited
  mm/page_isolation.c: fix one kernel-doc comment

3 years agoMerge tag 'perf-tools-fixes-for-v5.19-2022-06-26' of git://git.kernel.org/pub/scm...
Linus Torvalds [Sun, 26 Jun 2022 19:12:25 +0000 (12:12 -0700)]
Merge tag 'perf-tools-fixes-for-v5.19-2022-06-26' of git://git./linux/kernel/git/acme/linux

Pull perf tools fixes from Arnaldo Carvalho de Melo:

 - Enable ignore_missing_thread in 'perf stat', enabling counting with
   '--pid' when threads disappear during counting session setup

 - Adjust output data offset for backward compatibility in 'perf inject'

 - Fix missing free in copy_kcore_dir() in 'perf inject'

 - Fix caching files with a wrong build ID

 - Sync drm, cpufeatures, vhost and svn headers with the kernel

* tag 'perf-tools-fixes-for-v5.19-2022-06-26' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
  tools headers UAPI: Synch KVM's svm.h header with the kernel
  tools include UAPI: Sync linux/vhost.h with the kernel sources
  perf stat: Enable ignore_missing_thread
  perf inject: Adjust output data offset for backward compatibility
  perf trace beauty: Fix generation of errno id->str table on ALT Linux
  perf build-id: Fix caching files with a wrong build ID
  tools headers cpufeatures: Sync with the kernel sources
  tools headers UAPI: Sync drm/i915_drm.h with the kernel sources
  perf inject: Fix missing free in copy_kcore_dir()

3 years agoMerge tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave...
Linus Torvalds [Sun, 26 Jun 2022 17:11:36 +0000 (10:11 -0700)]
Merge tag 'for-5.19-rc3-tag' of git://git./linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - zoned relocation fixes:
      - fix critical section end for extent writeback, this could lead
        to out of order write
      - prevent writing to previous data relocation block group if space
        gets low

 - reflink fixes:
      - fix race between reflinking and ordered extent completion
      - proper error handling when block reserve migration fails
      - add missing inode iversion/mtime/ctime updates on each iteration
        when replacing extents

 - fix deadlock when running fsync/fiemap/commit at the same time

 - fix false-positive KCSAN report regarding pid tracking for read locks
   and data race

 - minor documentation update and link to new site

* tag 'for-5.19-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  Documentation: update btrfs list of features and link to readthedocs.io
  btrfs: fix deadlock with fsync+fiemap+transaction commit
  btrfs: don't set lock_owner when locking extent buffer for reading
  btrfs: zoned: fix critical section of relocation inode writeback
  btrfs: zoned: prevent allocation from previous data relocation BG
  btrfs: do not BUG_ON() on failure to migrate space when replacing extents
  btrfs: add missing inode updates on each iteration when replacing extents
  btrfs: fix race between reflinking and ordered extent completion

3 years agoMerge tag 'dma-mapping-5.19-2022-06-26' of git://git.infradead.org/users/hch/dma...
Linus Torvalds [Sun, 26 Jun 2022 17:01:40 +0000 (10:01 -0700)]
Merge tag 'dma-mapping-5.19-2022-06-26' of git://git.infradead.org/users/hch/dma-mapping

Pull dma-mapping fix from Christoph Hellwig:

 - pass the correct size to dma_set_encrypted() when freeing memory
   (Dexuan Cui)

* tag 'dma-mapping-5.19-2022-06-26' of git://git.infradead.org/users/hch/dma-mapping:
  dma-direct: use the correct size for dma_set_encrypted()

3 years agoMerge tag 'for-5.19/fbdev-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller...
Linus Torvalds [Sun, 26 Jun 2022 16:13:51 +0000 (09:13 -0700)]
Merge tag 'for-5.19/fbdev-2' of git://git./linux/kernel/git/deller/linux-fbdev

Pull fbdev fixes from Helge Deller:
 "Two bug fixes for the pxa3xx and intelfb drivers:

   - pxa3xx-gcu: Fix integer overflow in pxa3xx_gcu_write

   - intelfb: Initialize value of stolen size

  The other changes are small cleanups, simplifications and
  documentation updates to the cirrusfb, skeletonfb, omapfb,
  intelfb, au1100fb and simplefb drivers"

* tag 'for-5.19/fbdev-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/linux-fbdev:
  video: fbdev: omap: Remove duplicate 'the' in comment
  video: fbdev: omapfb: Align '*' in comment
  video: fbdev: simplefb: Check before clk_put() not needed
  video: fbdev: au1100fb: Drop unnecessary NULL ptr check
  video: fbdev: pxa3xx-gcu: Fix integer overflow in pxa3xx_gcu_write
  video: fbdev: skeletonfb: Convert to generic power management
  video: fbdev: cirrusfb: Remove useless reference to PCI power management
  video: fbdev: intelfb: Initialize value of stolen size
  video: fbdev: intelfb: Use aperture size from pci_resource_len
  video: fbdev: skeletonfb: Fix syntax errors in comments

3 years agoMerge tag 'for-5.19/parisc-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller...
Linus Torvalds [Sun, 26 Jun 2022 16:08:30 +0000 (09:08 -0700)]
Merge tag 'for-5.19/parisc-3' of git://git./linux/kernel/git/deller/parisc-linux

Pull parisc architecture fixes from Helge Deller:

 - enable ARCH_HAS_STRICT_MODULE_RWX to prevent a boot crash on c8000
   machines

 - flush all mappings of a shared anonymous page on PA8800/8900 machines
   via flushing the whole data cache. This may slow down such machines
   but makes sure that the cache is consistent

 - Fix duplicate definition build error regarding fb_is_primary_device()

* tag 'for-5.19/parisc-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
  parisc: Enable ARCH_HAS_STRICT_MODULE_RWX
  parisc: Fix flush_anon_page on PA8800/PA8900
  parisc: align '*' in comment in math-emu code
  parisc/stifb: Fix fb_is_primary_device() only available with CONFIG_FB_STI

3 years agoMerge tag 'xtensa-20220626' of https://github.com/jcmvbkbc/linux-xtensa
Linus Torvalds [Sun, 26 Jun 2022 15:59:21 +0000 (08:59 -0700)]
Merge tag 'xtensa-20220626' of https://github.com/jcmvbkbc/linux-xtensa

Pull xtensa fixes from Max Filippov:

 - fix OF reference leaks in xtensa arch code

 - replace '.bss' with '.section .bss' to fix entry.S build with old
   assembler

* tag 'xtensa-20220626' of https://github.com/jcmvbkbc/linux-xtensa:
  xtensa: change '.bss' to '.section .bss'
  xtensa: xtfpga: Fix refcount leak bug in setup
  xtensa: Fix refcount leak bug in time.c

3 years agoMerge tag 'powerpc-5.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc...
Linus Torvalds [Sun, 26 Jun 2022 15:53:20 +0000 (08:53 -0700)]
Merge tag 'powerpc-5.19-3' of git://git./linux/kernel/git/powerpc/linux

Pull powerpc fixes from Michael Ellerman:

 - A fix for a CMA change that broke booting guests with > 2G RAM on
   Power8 hosts.

 - Fix the RTAS call filter to allow a special case that applications
   rely on.

 - A change to our execve path, to make the execve syscall exit
   tracepoint work.

 - Three fixes to wire up our various RNGs earlier in boot so they're
   available for use in the initial seeding in random_init().

 - A build fix for when KASAN is enabled along with
   STRUCTLEAK_BYREF_ALL.

Thanks to Andrew Donnellan, Aneesh Kumar K.V, Christophe Leroy, Jason
Donenfeld, Nathan Lynch, Naveen N. Rao, Sathvika Vasireddy, Sumit
Dubey2, Tyrel Datwyler, and Zi Yan.

* tag 'powerpc-5.19-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/powernv: wire up rng during setup_arch
  powerpc/prom_init: Fix build failure with GCC_PLUGIN_STRUCTLEAK_BYREF_ALL and KASAN
  powerpc/rtas: Allow ibm,platform-dump RTAS call with null buffer address
  powerpc: Enable execve syscall exit tracepoint
  powerpc/pseries: wire up rng during setup_arch()
  powerpc/microwatt: wire up rng during setup_arch()
  powerpc/mm: Move CMA reservations after initmem_init()