linux-2.6-block.git
4 days agomd/raid10: convert read/write to use bio_submit_split_bioset()
Yu Kuai [Wed, 10 Sep 2025 06:30:50 +0000 (14:30 +0800)]
md/raid10: convert read/write to use bio_submit_split_bioset()

Unify bio split code, prepare to fix ordering of split IO, the error path
is modified a bit, however no functional changes are intended:

- bio_submit_split_bioset() can fail the original bio directly
  by split error, set R10BIO_Uptodate in this case to notify
  raid_end_bio_io() that the original bio is returned already.
- set R10BIO_Uptodate and set error value to -EIO is useless now,
  for r10_bio without R10BIO_Uptodate, -EIO will be returned for
  original bio.

And discard is not handled, because discard is only split for
unaligned head and tail, and this can be considered slow path, the
reorder here does not matter much.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 days agomd/raid10: add a new r10bio flag R10BIO_Returned
Yu Kuai [Wed, 10 Sep 2025 06:30:49 +0000 (14:30 +0800)]
md/raid10: add a new r10bio flag R10BIO_Returned

The new helper bio_submit_split_bioset() can failed the orginal bio on
split errors, prepare to handle this case in raid_end_bio_io().

The flag name is refer to the r1bio flag name.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 days agomd/raid1: convert to use bio_submit_split_bioset()
Yu Kuai [Wed, 10 Sep 2025 06:30:48 +0000 (14:30 +0800)]
md/raid1: convert to use bio_submit_split_bioset()

Unify bio split code, and prepare to fix ordering of split IO.

Noted that bio_submit_split_bioset() can fail the original bio directly
by split error, set R1BIO_Returned in this case to notify raid_end_bio_io()
that the original bio is returned already.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 days agomd/raid0: convert raid0_handle_discard() to use bio_submit_split_bioset()
Yu Kuai [Wed, 10 Sep 2025 06:30:47 +0000 (14:30 +0800)]
md/raid0: convert raid0_handle_discard() to use bio_submit_split_bioset()

Unify bio split code, and prepare to fix ordering of split IO

Noted commit 319ff40a5427 ("md/raid0: Fix performance regression for large
sequential writes") already fix ordering of split IO by remapping bio to
underlying disks before resubmitting it, with the respect
md_submit_bio() already split it by sectors, and raid0_make_request()
will split at most once for unaligned IO. This is a bit hacky and we'll
convert this to solution in general later.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 days agoblock: factor out a helper bio_submit_split_bioset()
Yu Kuai [Wed, 10 Sep 2025 06:30:46 +0000 (14:30 +0800)]
block: factor out a helper bio_submit_split_bioset()

No functional changes are intended, some drivers like mdraid will split
bio by internal processing, prepare to unify bio split codes.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 days agoblk-crypto: fix missing blktrace bio split events
Yu Kuai [Wed, 10 Sep 2025 06:30:45 +0000 (14:30 +0800)]
blk-crypto: fix missing blktrace bio split events

trace_block_split() is missing, resulting in blktrace inability to catch
BIO split events and making it harder to analyze the BIO sequence.

Cc: stable@vger.kernel.org
Fixes: 488f6682c832 ("block: blk-crypto-fallback for Inline Encryption")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 days agomd: fix mssing blktrace bio split events
Yu Kuai [Wed, 10 Sep 2025 06:30:44 +0000 (14:30 +0800)]
md: fix mssing blktrace bio split events

If bio is split by internal handling like chunksize or badblocks, the
corresponding trace_block_split() is missing, resulting in blktrace
inability to catch BIO split events and making it harder to analyze the
BIO sequence.

Cc: stable@vger.kernel.org
Fixes: 4b1faf931650 ("block: Kill bio_pair_split()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 days agoblk-mq: add QUEUE_FLAG_BIO_ISSUE_TIME
Yu Kuai [Wed, 10 Sep 2025 06:30:43 +0000 (14:30 +0800)]
blk-mq: add QUEUE_FLAG_BIO_ISSUE_TIME

bio->issue_time_ns is initialized for every bio, however, it's only used
by blk-iolatency. Add a new queue_flag and only set this flag when
blk-iolatency is enabled, so that extra blk_time_get_ns() can be saved
for disks that blk-iolatency is not enabled.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 days agoblock: initialize bio issue time in blk_mq_submit_bio()
Yu Kuai [Wed, 10 Sep 2025 06:30:42 +0000 (14:30 +0800)]
block: initialize bio issue time in blk_mq_submit_bio()

bio->issue_time_ns is only used by blk-iolatency, which can only be
enabled for rq-based disk, hence it's not necessary to initialize
the time for bio-based disk.

Meanwhile, if bio is split by blk_crypto_fallback_split_bio_if_needed(),
the issue time is not initialized for new split bio, this can be fixed
as well.

Noted the next patch will optimize better that bio issue time will
only be used when blk-iolatency is really enabled by the disk.

Fixes: 488f6682c832 ("block: blk-crypto-fallback for Inline Encryption")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 days agoblock: cleanup bio_issue
Yu Kuai [Wed, 10 Sep 2025 06:30:41 +0000 (14:30 +0800)]
block: cleanup bio_issue

Now that bio->bi_issue is only used by blk-iolatency to get bio issue
time, replace bio_issue with u64 time directly and remove bio_issue to
make code cleaner.

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days agoMerge tag 'md-6.18-20250909' of gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid...
Jens Axboe [Tue, 9 Sep 2025 17:22:20 +0000 (11:22 -0600)]
Merge tag 'md-6.18-20250909' of gitolite.pub/scm/linux/kernel/git/mdraid/linux into for-6.18/block

Pull MD changes from Yu Kuai:

"Redundant data is used to enhance data fault tolerance, and the storage
 method for redundant data vary depending on the RAID levels. And it's
 important to maintain the consistency of redundant data.

 Bitmap is used to record which data blocks have been synchronized and
 which ones need to be resynchronized or recovered. Each bit in the
 bitmap represents a segment of data in the array. When a bit is set,
 it indicates that the multiple redundant copies of that data segment
 may not be consistent. Data synchronization can be performed based on
 the bitmap after power failure or readding a disk. If there is no
 bitmap, a full disk synchronization is required.

 Due to known performance issues with md-bitmap and the unreasonable
 implementations:

 - self-managed IO submitting like filemap_write_page();
 - global spin_lock

 I have decided not to continue optimizing based on the current bitmap
 implementation, this new bitmap is invented without locking from IO fast
 path and can be used with fast disks.

 Key features for the new bitmap:
  - IO fastpath is lockless, if user issues lots of write IO to the same
    bitmap bit in a short time, only the first write has additional
    overhead to update bitmap bit, no additional overhead for the
    following writes;
  - support only resync or recover written data, means in the case
    creating new array or replacing with a new disk, there is no need to
    do a full disk resync/recovery;"

* tag 'md-6.18-20250909' of gitolite.kernel.org:pub/scm/linux/kernel/git/mdraid/linux: (24 commits)
  md/md-llbitmap: introduce new lockless bitmap
  md/md-bitmap: make method bitmap_ops->daemon_work optional
  md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER
  md/md-bitmap: add a new method blocks_synced() in bitmap_operations
  md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations
  md/md-bitmap: delay registration of bitmap_ops until creating bitmap
  md/md-bitmap: add a new sysfs api bitmap_type
  md: add a new mddev field 'bitmap_id'
  md/md-bitmap: support discard for bitmap ops
  md: factor out a helper raid_is_456()
  md: add a new parameter 'offset' to md_super_write()
  md/md-bitmap: introduce CONFIG_MD_BITMAP
  md: check before referencing mddev->bitmap_ops
  md/dm-raid: check before referencing mddev->bitmap_ops
  md/raid5: check before referencing mddev->bitmap_ops
  md/raid10: check before referencing mddev->bitmap_ops
  md/raid1: check before referencing mddev->bitmap_ops
  md/raid1: check bitmap before behind write
  md/md-bitmap: handle the case bitmap is not enabled before end_sync()
  md/md-bitmap: handle the case bitmap is not enabled before start_sync()
  ...

5 days agoblk-map: provide the bdev to bio if one exists
Keith Busch [Wed, 3 Sep 2025 20:27:46 +0000 (13:27 -0700)]
blk-map: provide the bdev to bio if one exists

We can now safely provide a block device when extracting user pages for
driver and user passthrough commands. Set the bdev so the caller doesn't
have to do that later. This has an additional  benefit of being able to
extract P2P pages in the passthrough path.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days agoblk-mq-dma: bring back p2p request flags
Keith Busch [Wed, 3 Sep 2025 19:33:17 +0000 (12:33 -0700)]
blk-mq-dma: bring back p2p request flags

We only need to consider data and metadata dma mapping types separately.
The request and bio integrity payload have enough flag bits to
internally track the mapping type for each. Use these so the caller
doesn't need to track them, and provide separete request and integrity
helpers to the common code. This will make it easier to scale new
mappings, like the proposed MMIO attribute, without burdening the caller
to track such things.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days agoblk-integrity: enable p2p source and destination
Keith Busch [Wed, 3 Sep 2025 19:33:16 +0000 (12:33 -0700)]
blk-integrity: enable p2p source and destination

Set the extraction flags to allow p2p pages for the metadata buffer if
the block device allows it. Similar to data payloads, ensure the bio
does not use merging if we see a p2p page.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days agoiov_iter: remove iov_iter_is_aligned
Keith Busch [Wed, 27 Aug 2025 14:12:58 +0000 (07:12 -0700)]
iov_iter: remove iov_iter_is_aligned

No more callers.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Mike Snitzer <snitzer@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days agoblk-integrity: use simpler alignment check
Keith Busch [Wed, 27 Aug 2025 14:12:57 +0000 (07:12 -0700)]
blk-integrity: use simpler alignment check

We're checking length and addresses against the same alignment value, so
use the more simple iterator check.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days agoblock: remove bdev_iter_is_aligned
Keith Busch [Wed, 27 Aug 2025 14:12:56 +0000 (07:12 -0700)]
block: remove bdev_iter_is_aligned

No more callers.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days agoiomap: simplify direct io validity check
Keith Busch [Wed, 27 Aug 2025 14:12:55 +0000 (07:12 -0700)]
iomap: simplify direct io validity check

The block layer checks all the segments for validity later, so no need
for an early check. Just reduce it to a simple position and total length
check, and defer the more invasive segment checks to the block layer.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days agoblock: simplify direct io validity check
Keith Busch [Wed, 27 Aug 2025 14:12:54 +0000 (07:12 -0700)]
block: simplify direct io validity check

The block layer checks all the segments for validity later, so no need
for an early check. Just reduce it to a simple position and total length
check, and defer the more invasive segment checks to the block layer.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days agoblock: align the bio after building it
Keith Busch [Wed, 27 Aug 2025 14:12:53 +0000 (07:12 -0700)]
block: align the bio after building it

Instead of ensuring each vector is block size aligned while constructing
the bio, just ensure the entire size is aligned after it's built. This
makes getting bio pages more flexible to accepting device valid io
vectors that would otherwise get rejected by alignment checks.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days agoblock: add size alignment to bio_iov_iter_get_pages
Keith Busch [Wed, 27 Aug 2025 14:12:52 +0000 (07:12 -0700)]
block: add size alignment to bio_iov_iter_get_pages

The block layer tries to align bio vectors to the block device's logical
block size. Some cases don't have a block device, or we may need to
align to something larger, which we can't derive it from the queue
limits. Have the caller specify what they want, or allow any length
alignment if nothing was specified. Since the most common use case
relies on the block device's limits, a helper function is provided.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days agoblock: check for valid bio while splitting
Keith Busch [Wed, 27 Aug 2025 14:12:51 +0000 (07:12 -0700)]
block: check for valid bio while splitting

We're already iterating every segment, so check these for a valid IO
lengths at the same time. Individual segment lengths will not be checked
on passthrough commands. The read/write command segments must be sized
to the dma alignment.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days agodrivers/block: WQ_PERCPU added to alloc_workqueue users
Marco Crivellari [Fri, 5 Sep 2025 08:51:41 +0000 (10:51 +0200)]
drivers/block: WQ_PERCPU added to alloc_workqueue users

Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.
This lack of consistentcy cannot be addressed without refactoring the API.

alloc_workqueue() treats all queues as per-CPU by default, while unbound
workqueues must opt-in via WQ_UNBOUND.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This default is suboptimal: most workloads benefit from unbound queues,
allowing the scheduler to place worker threads where they’re needed and
reducing noise when CPUs are isolated.

This patch adds a new WQ_PERCPU flag to explicitly request the use of
the per-CPU behavior. Both flags coexist for one release cycle to allow
callers to transition their calls.

Once migration is complete, WQ_UNBOUND can be removed and unbound will
become the implicit default.

With the introduction of the WQ_PERCPU flag (equivalent to !WQ_UNBOUND),
any alloc_workqueue() caller that doesn’t explicitly specify WQ_UNBOUND
must now use WQ_PERCPU.

All existing users have been updated accordingly.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days agodrivers/block: replace use of system_unbound_wq with system_dfl_wq
Marco Crivellari [Fri, 5 Sep 2025 08:51:40 +0000 (10:51 +0200)]
drivers/block: replace use of system_unbound_wq with system_dfl_wq

Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistentcy cannot be addressed without refactoring the API.

system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.

Adding system_dfl_wq to encourage its use when unbound work should be used.

queue_work() / queue_delayed_work() / mod_delayed_work() will now use the
new unbound wq: whether the user still use the old wq a warn will be
printed along with a wq redirect to the new one.

The old system_unbound_wq will be kept for a few release cycles.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days agodrivers/block: replace use of system_wq with system_percpu_wq
Marco Crivellari [Fri, 5 Sep 2025 08:51:39 +0000 (10:51 +0200)]
drivers/block: replace use of system_wq with system_percpu_wq

Currently if a user enqueue a work item using schedule_delayed_work() the
used wq is "system_wq" (per-cpu wq) while queue_delayed_work() use
WORK_CPU_UNBOUND (used when a cpu is not specified). The same applies to
schedule_work() that is using system_wq and queue_work(), that makes use
again of WORK_CPU_UNBOUND.

This lack of consistentcy cannot be addressed without refactoring the API.

system_unbound_wq should be the default workqueue so as not to enforce
locality constraints for random work whenever it's not required.

Adding system_dfl_wq to encourage its use when unbound work should be used.

queue_work() / queue_delayed_work() / mod_delayed_work() will now use the
new unbound wq: whether the user still use the old wq a warn will be
printed along with a wq redirect to the new one.

The old system_unbound_wq will be kept for a few release cycles.

Suggested-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Marco Crivellari <marco.crivellari@suse.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days agoblock: floppy: Replace kmalloc() + copy_from_user() with memdup_user()
Thorsten Blum [Mon, 8 Sep 2025 20:10:20 +0000 (22:10 +0200)]
block: floppy: Replace kmalloc() + copy_from_user() with memdup_user()

Replace kmalloc() followed by copy_from_user() with memdup_user() to
improve and simplify raw_cmd_copyin().

No functional changes intended.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days agoblk-mq: Document tags_srcu member in blk_mq_tag_set structure
Ming Lei [Tue, 9 Sep 2025 12:33:10 +0000 (20:33 +0800)]
blk-mq: Document tags_srcu member in blk_mq_tag_set structure

Add missing documentation for the tags_srcu member that was introduced
to defer freeing of tags page_list to prevent use-after-free when
iterating tags.

Fixes htmldocs warning:
WARNING: include/linux/blk-mq.h:536 struct member 'tags_srcu' not described in 'blk_mq_tag_set'

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days agoblock: remove the bi_inline_vecs variable sized array from struct bio
Christoph Hellwig [Mon, 8 Sep 2025 10:56:39 +0000 (12:56 +0200)]
block: remove the bi_inline_vecs variable sized array from struct bio

Bios are embedded into other structures, and at least spare is unhappy
about embedding structures with variable sized arrays.  There's no
real need to the array anyway, we can replace it with a helper pointing
to the memory just behind the bio, and with the previous cleanups there
is very few site doing anything special with it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days agoblock: add a bio_init_inline helper
Christoph Hellwig [Mon, 8 Sep 2025 10:56:38 +0000 (12:56 +0200)]
block: add a bio_init_inline helper

Just a simpler wrapper around bio_init for callers that want to
initialize a bio with inline bvecs.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
5 days agonbd: restrict sockets to TCP and UDP
Eric Dumazet [Tue, 9 Sep 2025 13:22:43 +0000 (13:22 +0000)]
nbd: restrict sockets to TCP and UDP

Recently, syzbot started to abuse NBD with all kinds of sockets.

Commit cf1b2326b734 ("nbd: verify socket is supported during setup")
made sure the socket supported a shutdown() method.

Explicitely accept TCP and UNIX stream sockets.

Fixes: cf1b2326b734 ("nbd: verify socket is supported during setup")
Reported-by: syzbot+e1cd6bd8493060bd701d@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/CANn89iJ+76eE3A_8S_zTpSyW5hvPRn6V57458hCZGY5hbH_bFA@mail.gmail.com/T/#m081036e8747cd7e2626c1da5d78c8b9d1e55b154
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Mike Christie <mchristi@redhat.com>
Cc: Richard W.M. Jones <rjones@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Yu Kuai <yukuai1@huaweicloud.com>
Cc: linux-block@vger.kernel.org
Cc: nbd@other.debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 days agoblk-throttle: fix access race during throttle policy activation
Han Guangjiang [Fri, 5 Sep 2025 10:24:11 +0000 (18:24 +0800)]
blk-throttle: fix access race during throttle policy activation

On repeated cold boots we occasionally hit a NULL pointer crash in
blk_should_throtl() when throttling is consulted before the throttle
policy is fully enabled for the queue. Checking only q->td != NULL is
insufficient during early initialization, so blkg_to_pd() for the
throttle policy can still return NULL and blkg_to_tg() becomes NULL,
which later gets dereferenced.

 Unable to handle kernel NULL pointer dereference
 at virtual address 0000000000000156
 ...
 pc : submit_bio_noacct+0x14c/0x4c8
 lr : submit_bio_noacct+0x48/0x4c8
 sp : ffff800087f0b690
 x29: ffff800087f0b690 x28: 0000000000005f90 x27: ffff00068af393c0
 x26: 0000000000080000 x25: 000000000002fbc0 x24: ffff000684ddcc70
 x23: 0000000000000000 x22: 0000000000000000 x21: 0000000000000000
 x20: 0000000000080000 x19: ffff000684ddcd08 x18: ffffffffffffffff
 x17: 0000000000000000 x16: ffff80008132a550 x15: 0000ffff98020fff
 x14: 0000000000000000 x13: 1fffe000d11d7021 x12: ffff000688eb810c
 x11: ffff00077ec4bb80 x10: ffff000688dcb720 x9 : ffff80008068ef60
 x8 : 00000a6fb8a86e85 x7 : 000000000000111e x6 : 0000000000000002
 x5 : 0000000000000246 x4 : 0000000000015cff x3 : 0000000000394500
 x2 : ffff000682e35e40 x1 : 0000000000364940 x0 : 000000000000001a
 Call trace:
  submit_bio_noacct+0x14c/0x4c8
  verity_map+0x178/0x2c8
  __map_bio+0x228/0x250
  dm_submit_bio+0x1c4/0x678
  __submit_bio+0x170/0x230
  submit_bio_noacct_nocheck+0x16c/0x388
  submit_bio_noacct+0x16c/0x4c8
  submit_bio+0xb4/0x210
  f2fs_submit_read_bio+0x4c/0xf0
  f2fs_mpage_readpages+0x3b0/0x5f0
  f2fs_readahead+0x90/0xe8

Tighten blk_throtl_activated() to also require that the throttle policy
bit is set on the queue:

  return q->td != NULL &&
         test_bit(blkcg_policy_throtl.plid, q->blkcg_pols);

This prevents blk_should_throtl() from accessing throttle group state
until policy data has been attached to blkgs.

Fixes: a3166c51702b ("blk-throttle: delay initialization until configuration")
Co-developed-by: Liang Jie <liangjie@lixiang.com>
Signed-off-by: Liang Jie <liangjie@lixiang.com>
Signed-off-by: Han Guangjiang <hanguangjiang@lixiang.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 days agonull_blk: Fix the description of the cache_size module argument
Genjian Zhang [Fri, 15 Aug 2025 09:07:32 +0000 (17:07 +0800)]
null_blk: Fix the description of the cache_size module argument

When executing modinfo null_blk, there is an error in the description
of module parameter mbps, and the output information of cache_size is
incomplete.The output of modinfo before and after applying this patch
is as follows:

Before:
[...]
parm:           cache_size:ulong
[...]
parm:           mbps:Cache size in MiB for memory-backed device.
Default: 0 (none) (uint)
[...]

After:
[...]
parm:           cache_size:Cache size in MiB for memory-backed device.
Default: 0 (none) (ulong)
[...]
parm:           mbps:Limit maximum bandwidth (in MiB/s).
Default: 0 (no limit) (uint)
[...]

Fixes: 058efe000b31 ("null_blk: add module parameters for 4 options")
Signed-off-by: Genjian Zhang <zhanggenjian@kylinos.cn>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 days agoblk-mq: Replace tags->lock with SRCU for tag iterators
Ming Lei [Sat, 30 Aug 2025 02:18:23 +0000 (10:18 +0800)]
blk-mq: Replace tags->lock with SRCU for tag iterators

Replace the spinlock in blk_mq_find_and_get_req() with an SRCU read lock
around the tag iterators.

This is done by:

- Holding the SRCU read lock in blk_mq_queue_tag_busy_iter(),
blk_mq_tagset_busy_iter(), and blk_mq_hctx_has_requests().

- Removing the now-redundant tags->lock from blk_mq_find_and_get_req().

This change fixes lockup issue in scsi_host_busy() in case of shost->host_blocked.

Also avoids big tags->lock when reading disk sysfs attribute `inflight`.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 days agoblk-mq: Defer freeing flush queue to SRCU callback
Ming Lei [Sat, 30 Aug 2025 02:18:22 +0000 (10:18 +0800)]
blk-mq: Defer freeing flush queue to SRCU callback

The freeing of the flush queue/request in blk_mq_exit_hctx() can race with
tag iterators that may still be accessing it. To prevent a potential
use-after-free, the deallocation should be deferred until after a grace
period. With this way, we can replace the big tags->lock in tags iterator
code path with srcu for solving the issue.

This patch introduces an SRCU-based deferred freeing mechanism for the
flush queue.

The changes include:
- Adding a `rcu_head` to `struct blk_flush_queue`.
- Creating a new callback function, `blk_free_flush_queue_callback`,
  to handle the actual freeing.
- Replacing the direct call to `blk_free_flush_queue()` in
  `blk_mq_exit_hctx()` with `call_srcu()`, using the `tags_srcu`
  instance to ensure synchronization with tag iterators.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 days agoblk-mq: Defer freeing of tags page_list to SRCU callback
Ming Lei [Sat, 30 Aug 2025 02:18:21 +0000 (10:18 +0800)]
blk-mq: Defer freeing of tags page_list to SRCU callback

Tag iterators can race with the freeing of the request pages(tags->page_list),
potentially leading to use-after-free issues.

Defer the freeing of the page list and the tags structure itself until
after an SRCU grace period has passed. This ensures that any concurrent
tag iterators have completed before the memory is released. With this
way, we can replace the big tags->lock in tags iterator code path with
srcu for solving the issue.

This is achieved by:
- Adding a new `srcu_struct tags_srcu` to `blk_mq_tag_set` to protect
  tag map iteration.
- Adding an `rcu_head` to `struct blk_mq_tags` to be used with
  `call_srcu`.
- Moving the page list freeing logic and the `kfree(tags)` call into a
  new callback function, `blk_mq_free_tags_callback`.
- In `blk_mq_free_tags`, invoking `call_srcu` to schedule the new
  callback for deferred execution.

The read-side protection for the tag iterators will be added in a
subsequent patch.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 days agoblk-mq: Pass tag_set to blk_mq_free_rq_map/tags
Ming Lei [Sat, 30 Aug 2025 02:18:20 +0000 (10:18 +0800)]
blk-mq: Pass tag_set to blk_mq_free_rq_map/tags

To prepare for converting the tag->rqs freeing to be SRCU-based, the
tag_set is needed in the freeing helper functions.

This patch adds 'struct blk_mq_tag_set *' as the first parameter to
blk_mq_free_rq_map() and blk_mq_free_tags(), and updates all their call
sites.

This allows access to the tag_set's SRCU structure in the next step,
which will be used to free the tag maps after a grace period.

No functional change is intended in this patch.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
6 days agoblk-mq: Move flush queue allocation into blk_mq_init_hctx()
Ming Lei [Sat, 30 Aug 2025 02:18:19 +0000 (10:18 +0800)]
blk-mq: Move flush queue allocation into blk_mq_init_hctx()

Move flush queue allocation into blk_mq_init_hctx() and its release into
blk_mq_exit_hctx(), and prepare for replacing tags->lock with SRCU to
draining inflight request walking. blk_mq_exit_hctx() is the last chance
for us to get valid `tag_set` reference, and we need to add one SRCU to
`tag_set` for freeing flush request via call_srcu().

It is safe to move flush queue & request release into blk_mq_exit_hctx(),
because blk_mq_clear_flush_rq_mapping() clears the flush request
reference int driver tags inflight request table, meantime inflight
request walking is drained.

Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
9 days agomd/md-llbitmap: introduce new lockless bitmap
Yu Kuai [Fri, 29 Aug 2025 08:04:26 +0000 (16:04 +0800)]
md/md-llbitmap: introduce new lockless bitmap

Redundant data is used to enhance data fault tolerance, and the storage
method for redundant data vary depending on the RAID levels. And it's
important to maintain the consistency of redundant data.

Bitmap is used to record which data blocks have been synchronized and which
ones need to be resynchronized or recovered. Each bit in the bitmap
represents a segment of data in the array. When a bit is set, it indicates
that the multiple redundant copies of that data segment may not be
consistent. Data synchronization can be performed based on the bitmap after
power failure or readding a disk. If there is no bitmap, a full disk
synchronization is required.

Due to known performance issues with md-bitmap and the unreasonable
implementations:

 - self-managed IO submitting like filemap_write_page();
 - global spin_lock

I have decided not to continue optimizing based on the current bitmap
implementation, this new bitmap is invented without locking from IO fast
path and can be used with fast disks.

For designs and details, see the comments in drivers/md-llbitmap.c.

Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-12-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Li Nan <linan122@huawei.com>
9 days agomd/md-bitmap: make method bitmap_ops->daemon_work optional
Yu Kuai [Fri, 29 Aug 2025 08:04:25 +0000 (16:04 +0800)]
md/md-bitmap: make method bitmap_ops->daemon_work optional

daemon_work() will be called by daemon thread, on the one hand, daemon
thread doesn't have strict wake-up time; on the other hand, too much
work are put to daemon thread, like handle sync IO, handle failed
or specail normal IO, handle recovery, and so on. Hence daemon thread
may be too busy to clear dirty bits in time.

Make bitmap_ops->daemon_work() optional and following patches will use
separate async work to clear dirty bits for the new bitmap.

Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-11-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
9 days agomd: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER
Yu Kuai [Fri, 29 Aug 2025 08:04:24 +0000 (16:04 +0800)]
md: add a new recovery_flag MD_RECOVERY_LAZY_RECOVER

This flag is used by llbitmap in later patches to skip raid456 initial
recover and delay building initial xor data to first write.

https://lore.kernel.org/linux-raid/20250829080426.1441678-10-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
9 days agomd/md-bitmap: add a new method blocks_synced() in bitmap_operations
Yu Kuai [Fri, 29 Aug 2025 08:04:23 +0000 (16:04 +0800)]
md/md-bitmap: add a new method blocks_synced() in bitmap_operations

Currently, raid456 must perform a whole array initial recovery to build
initail xor data, then IO to the array won't have to read all the blocks
in underlying disks.

This behavior will affect IO performance a lot, and nowadays there are
huge disks and the initial recovery can take a long time. Hence llbitmap
will support lazy initial recovery in following patches. This method is
used to check if data blocks is synced or not, if not then IO will still
have to read all blocks for raid456.

Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-9-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
9 days agomd/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations
Yu Kuai [Fri, 29 Aug 2025 08:04:22 +0000 (16:04 +0800)]
md/md-bitmap: add a new method skip_sync_blocks() in bitmap_operations

This method is used to check if blocks can be skipped before calling
into pers->sync_request(), llbitmap will use this method to skip
resync for unwritten/clean data blocks, and recovery/check/repair for
unwritten data blocks;

Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-8-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
9 days agomd/md-bitmap: delay registration of bitmap_ops until creating bitmap
Yu Kuai [Fri, 29 Aug 2025 08:04:21 +0000 (16:04 +0800)]
md/md-bitmap: delay registration of bitmap_ops until creating bitmap

Currently bitmap_ops is registered while allocating mddev, this is fine
when there is only one bitmap_ops.

Delay setting bitmap_ops until creating bitmap, so that user can choose
which bitmap to use before running the array.

Link: https://lore.kernel.org/linux-raid/20250721171557.34587-7-yukuai@kernel.org
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
9 days agomd/md-bitmap: add a new sysfs api bitmap_type
Yu Kuai [Fri, 29 Aug 2025 08:04:20 +0000 (16:04 +0800)]
md/md-bitmap: add a new sysfs api bitmap_type

The api will be used by mdadm to set bitmap_type while creating new array
or assembling array, prepare to add a new bitmap.

Currently available options are:

cat /sys/block/md0/md/bitmap_type
none [bitmap]

Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-6-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Li Nan <linan122@huawei.com>
9 days agomd: add a new mddev field 'bitmap_id'
Yu Kuai [Fri, 29 Aug 2025 08:04:19 +0000 (16:04 +0800)]
md: add a new mddev field 'bitmap_id'

Prepare to store the bitmap id selected by user, also refactor
mddev_set_bitmap_ops a bit in case the value is invalid.

Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-5-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
9 days agomd/md-bitmap: support discard for bitmap ops
Yu Kuai [Fri, 29 Aug 2025 08:04:18 +0000 (16:04 +0800)]
md/md-bitmap: support discard for bitmap ops

Use two new methods {start, end}_discard in bitmap_ops and a new field 'rw'
in struct md_io_clone to handle discard IO, prepare to support new md
bitmap.

Since all bitmap functions to hanlde write IO are the same, also add
typedef to make code cleaner.

Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-4-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
9 days agomd: factor out a helper raid_is_456()
Yu Kuai [Fri, 29 Aug 2025 08:04:17 +0000 (16:04 +0800)]
md: factor out a helper raid_is_456()

There are no functional changes, the helper will be used by llbitmap in
following patches.

Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-3-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
9 days agomd: add a new parameter 'offset' to md_super_write()
Yu Kuai [Fri, 29 Aug 2025 08:04:16 +0000 (16:04 +0800)]
md: add a new parameter 'offset' to md_super_write()

The parameter is always set to 0 for now, following patches will use
this helper to write llbitmap to underlying disks, allow writing
dirty sectors instead of the whole page.

Also rename md_super_write to md_write_metadata since there is nothing
super-block specific.

Link: https://lore.kernel.org/linux-raid/20250829080426.1441678-2-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
9 days agomd/md-bitmap: introduce CONFIG_MD_BITMAP
Yu Kuai [Mon, 7 Jul 2025 01:27:11 +0000 (09:27 +0800)]
md/md-bitmap: introduce CONFIG_MD_BITMAP

Now that all implementations are internal, it's sensible to add a config
option for md-bitmap, and it's a good way for isolation.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-16-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
9 days agomd: check before referencing mddev->bitmap_ops
Yu Kuai [Mon, 7 Jul 2025 01:27:10 +0000 (09:27 +0800)]
md: check before referencing mddev->bitmap_ops

Prepare to introduce CONFIG_MD_BITMAP.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-15-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
9 days agomd/dm-raid: check before referencing mddev->bitmap_ops
Yu Kuai [Mon, 7 Jul 2025 01:27:09 +0000 (09:27 +0800)]
md/dm-raid: check before referencing mddev->bitmap_ops

Prepare to introduce CONFIG_MD_BITMAP.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-14-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
9 days agomd/raid5: check before referencing mddev->bitmap_ops
Yu Kuai [Mon, 7 Jul 2025 01:27:08 +0000 (09:27 +0800)]
md/raid5: check before referencing mddev->bitmap_ops

Prepare to introduce CONFIG_MD_BITMAP.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-13-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
9 days agomd/raid10: check before referencing mddev->bitmap_ops
Yu Kuai [Mon, 7 Jul 2025 01:27:07 +0000 (09:27 +0800)]
md/raid10: check before referencing mddev->bitmap_ops

Prepare to introduce CONFIG_MD_BITMAP.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-12-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
9 days agomd/raid1: check before referencing mddev->bitmap_ops
Yu Kuai [Mon, 7 Jul 2025 01:27:06 +0000 (09:27 +0800)]
md/raid1: check before referencing mddev->bitmap_ops

Prepare to introduce CONFIG_MD_BITMAP.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-11-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
9 days agomd/raid1: check bitmap before behind write
Yu Kuai [Mon, 7 Jul 2025 01:27:05 +0000 (09:27 +0800)]
md/raid1: check bitmap before behind write

behind write rely on bitmap, because the number of IO are recorded in
bitmap->behind_writes, and callers rely on bitmap_wait_behind_writes()
to wait for IO to be done.

However, currently callers doesn't check if bitmap is enabeld before
calling into behind methods. Hence if behind write start without bitmap,
readers will not wait for slow write IO to be done and old data can be
read in some corner cases.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-10-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
9 days agomd/md-bitmap: handle the case bitmap is not enabled before end_sync()
Yu Kuai [Mon, 7 Jul 2025 01:27:04 +0000 (09:27 +0800)]
md/md-bitmap: handle the case bitmap is not enabled before end_sync()

This case can be handled without knowing internal implementation.

Prepare to introduce CONFIG_MD_BITMAP.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-9-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
9 days agomd/md-bitmap: handle the case bitmap is not enabled before start_sync()
Yu Kuai [Mon, 7 Jul 2025 01:27:03 +0000 (09:27 +0800)]
md/md-bitmap: handle the case bitmap is not enabled before start_sync()

This case can be handled without knowing internal implementation.

Prepare to introduce CONFIG_MD_BITMAP.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-8-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
9 days agomd/md-bitmap: add md_bitmap_registered/enabled() helper
Yu Kuai [Mon, 7 Jul 2025 01:27:02 +0000 (09:27 +0800)]
md/md-bitmap: add md_bitmap_registered/enabled() helper

There are no functional changes, prepare to handle the case that
mddev->bitmap_ops can be NULL, which is possible after introducing
CONFIG_MD_BITMAP.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-7-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
9 days agomd/md-bitmap: add a new parameter 'flush' to bitmap_ops->enabled
Yu Kuai [Mon, 7 Jul 2025 01:27:01 +0000 (09:27 +0800)]
md/md-bitmap: add a new parameter 'flush' to bitmap_ops->enabled

The method is only used from raid1/raid10 IO path, to check if write
bio should be pluged, the parameter is always set to true for now,
following patch will use this helper in other context like updating
superblock.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-6-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
9 days agomd/md-bitmap: merge md_bitmap_group into bitmap_operations
Yu Kuai [Mon, 7 Jul 2025 01:27:00 +0000 (09:27 +0800)]
md/md-bitmap: merge md_bitmap_group into bitmap_operations

Now that all bitmap implementations are internal, it doesn't make sense
to export md_bitmap_group anymore.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-5-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
9 days agomd/md-bitmap: remove the parameter 'init' for bitmap_ops->resize()
Yu Kuai [Mon, 7 Jul 2025 01:26:59 +0000 (09:26 +0800)]
md/md-bitmap: remove the parameter 'init' for bitmap_ops->resize()

It's set to 'false' for all callers, hence it's useless and can be
removed.

Link: https://lore.kernel.org/linux-raid/20250707012711.376844-3-yukuai1@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Xiao Ni <xni@redhat.com>
9 days agoblk-mq: fix blk_mq_tags double free while nr_requests grown
Yu Kuai [Thu, 21 Aug 2025 06:06:12 +0000 (14:06 +0800)]
blk-mq: fix blk_mq_tags double free while nr_requests grown

In the case user trigger tags grow by queue sysfs attribute nr_requests,
hctx->sched_tags will be freed directly and replaced with a new
allocated tags, see blk_mq_tag_update_depth().

The problem is that hctx->sched_tags is from elevator->et->tags, while
et->tags is still the freed tags, hence later elevator exit will try to
free the tags again, causing kernel panic.

Fix this problem by replacing et->tags with new allocated tags as well.

Noted there are still some long term problems that will require some
refactor to be fixed thoroughly[1].

[1] https://lore.kernel.org/all/20250815080216.410665-1-yukuai1@huaweicloud.com/
Fixes: f5a6604f7a44 ("block: fix lockdep warning caused by lock dependency in elv_iosched_store")

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
Link: https://lore.kernel.org/r/20250821060612.1729939-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
9 days agoblk-mq: fix elevator depth_updated method
Yu Kuai [Thu, 21 Aug 2025 06:06:11 +0000 (14:06 +0800)]
blk-mq: fix elevator depth_updated method

Current depth_updated has some problems:

1) depth_updated() will be called for each hctx, while all elevators
will update async_depth for the disk level, this is not related to hctx;
2) In blk_mq_update_nr_requests(), if previous hctx update succeed and
this hctx update failed, q->nr_requests will not be updated, while
async_depth is already updated with new nr_reqeuests in previous
depth_updated();
3) All elevators are using q->nr_requests to calculate async_depth now,
however, q->nr_requests is still the old value when depth_updated() is
called from blk_mq_update_nr_requests();

Those problems are first from error path, then mq-deadline, and recently
for bfq and kyber, fix those problems by:

- pass in request_queue instead of hctx;
- move depth_updated() after q->nr_requests is updated in
  blk_mq_update_nr_requests();
- add depth_updated() call inside init_sched() method to initialize
  async_depth;
- remove init_hctx() method for mq-deadline and bfq that is useless now;

Fixes: 77f1e0a52d26 ("bfq: update internal depth state when queue depth changes")
Fixes: 39823b47bbd4 ("block/mq-deadline: Fix the tag reservation code")
Fixes: 42e6c6ce03fd ("lib/sbitmap: convert shallow_depth from one word to the whole sbitmap")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Li Nan <linan122@huawei.com>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Link: https://lore.kernel.org/r/20250821060612.1729939-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
11 days agoublk: inline __ublk_ch_uring_cmd()
Caleb Sander Mateos [Fri, 8 Aug 2025 15:32:50 +0000 (09:32 -0600)]
ublk: inline __ublk_ch_uring_cmd()

ublk_ch_uring_cmd_local() is a thin wrapper around __ublk_ch_uring_cmd()
that copies the ublksrv_io_cmd from user-mapped memory to the stack
using READ_ONCE(). This ublksrv_io_cmd is passed by pointer to
__ublk_ch_uring_cmd() and __ublk_ch_uring_cmd() is a large function
unlikely to be inlined, so __ublk_ch_uring_cmd() will have to load the
ublksrv_io_cmd fields back from the stack. Inline __ublk_ch_uring_cmd()
into ublk_ch_uring_cmd_local() and load the ublksrv_io_cmd fields into
local variables with READ_ONCE(). This allows the compiler to delay
loading the fields until they are needed and choose whether to store
them in registers or on the stack.

Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250808153251.282107-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
11 days agoMerge tag 'pull-getgeo' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs...
Jens Axboe [Wed, 3 Sep 2025 21:15:43 +0000 (15:15 -0600)]
Merge tag 'pull-getgeo' of git://git./linux/kernel/git/viro/vfs into for-6.18/block

Pull struct block_device getgeo changes from Al.

"switching ->getgeo() from struct block_device to struct gendisk

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>"
* tag 'pull-getgeo' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  block: switch ->getgeo() to struct gendisk
  scsi: switch ->bios_param() to passing gendisk
  scsi: switch scsi_bios_ptable() and scsi_partsize() to gendisk

12 days agoblock: use int to store blk_stack_limits() return value
Qianfeng Rong [Tue, 2 Sep 2025 13:09:30 +0000 (21:09 +0800)]
block: use int to store blk_stack_limits() return value

Change the 'ret' variable in blk_stack_limits() from unsigned int to int,
as it needs to store negative value -1.

Storing the negative error codes in unsigned type, or performing equality
comparisons (e.g., ret == -1), doesn't cause an issue at runtime [1] but
can be confusing.  Additionally, assigning negative error codes to unsigned
type may trigger a GCC warning when the -Wsign-conversion flag is enabled.

No effect on runtime.

Link: https://lore.kernel.org/all/x3wogjf6vgpkisdhg3abzrx7v7zktmdnfmqeih5kosszmagqfs@oh3qxrgzkikf/
Signed-off-by: Qianfeng Rong <rongqianfeng@vivo.com>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Fixes: fe0b393f2c0a ("block: Correct handling of bottom device misaligment")
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20250902130930.68317-1-rongqianfeng@vivo.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
12 days agornull: add soft-irq completion support
Andreas Hindborg [Tue, 2 Sep 2025 09:55:11 +0000 (11:55 +0200)]
rnull: add soft-irq completion support

rnull currently only supports direct completion. Add option for completing
requests across CPU nodes via soft IRQ or IPI.

Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com>
Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
Link: https://lore.kernel.org/r/20250902-rnull-up-v6-16-v7-17-b5212cc89b98@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
12 days agorust: block: add remote completion to `Request`
Andreas Hindborg [Tue, 2 Sep 2025 09:55:10 +0000 (11:55 +0200)]
rust: block: add remote completion to `Request`

Allow users of rust block device driver API to schedule completion of
requests via `blk_mq_complete_request_remote`.

Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com>
Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
Link: https://lore.kernel.org/r/20250902-rnull-up-v6-16-v7-16-b5212cc89b98@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
12 days agorust: block: mq: fix spelling in a safety comment
Andreas Hindborg [Tue, 2 Sep 2025 09:55:09 +0000 (11:55 +0200)]
rust: block: mq: fix spelling in a safety comment

Add code block quotes to a safety comment.

Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com>
Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
Link: https://lore.kernel.org/r/20250902-rnull-up-v6-16-v7-15-b5212cc89b98@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
12 days agorust: block: add `GenDisk` private data support
Andreas Hindborg [Tue, 2 Sep 2025 09:55:08 +0000 (11:55 +0200)]
rust: block: add `GenDisk` private data support

Allow users of the rust block device driver API to install private data in
the `GenDisk` structure.

Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com>
Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
Link: https://lore.kernel.org/r/20250902-rnull-up-v6-16-v7-14-b5212cc89b98@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
12 days agornull: enable configuration via `configfs`
Andreas Hindborg [Tue, 2 Sep 2025 09:55:07 +0000 (11:55 +0200)]
rnull: enable configuration via `configfs`

Allow rust null block devices to be configured and instantiated via
`configfs`.

Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
Link: https://lore.kernel.org/r/20250902-rnull-up-v6-16-v7-13-b5212cc89b98@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
12 days agornull: move driver to separate directory
Andreas Hindborg [Tue, 2 Sep 2025 09:55:06 +0000 (11:55 +0200)]
rnull: move driver to separate directory

The rust null block driver is about to gain some additional modules. Rather
than pollute the current directory, move the driver to a subdirectory.

Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com>
Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
Link: https://lore.kernel.org/r/20250902-rnull-up-v6-16-v7-12-b5212cc89b98@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
12 days agorust: block: add block related constants
Andreas Hindborg [Tue, 2 Sep 2025 09:55:05 +0000 (11:55 +0200)]
rust: block: add block related constants

Add a few block subsystem constants to the rust `kernel::block` name space.
This makes it easier to access the constants from rust code.

Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com>
Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
Link: https://lore.kernel.org/r/20250902-rnull-up-v6-16-v7-11-b5212cc89b98@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
12 days agorust: block: remove trait bound from `mq::Request` definition
Andreas Hindborg [Tue, 2 Sep 2025 09:55:04 +0000 (11:55 +0200)]
rust: block: remove trait bound from `mq::Request` definition

Remove the trait bound `T:Operations` from `mq::Request`. The bound is not
required, so remove it to reduce complexity.

Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com>
Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
Link: https://lore.kernel.org/r/20250902-rnull-up-v6-16-v7-10-b5212cc89b98@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
12 days agorust: block: remove `RawWriter`
Andreas Hindborg [Tue, 2 Sep 2025 09:55:03 +0000 (11:55 +0200)]
rust: block: remove `RawWriter`

`RawWriter` is now dead code, so remove it.

Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com>
Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
Link: https://lore.kernel.org/r/20250902-rnull-up-v6-16-v7-9-b5212cc89b98@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
12 days agorust: block: use `NullTerminatedFormatter`
Andreas Hindborg [Tue, 2 Sep 2025 09:55:02 +0000 (11:55 +0200)]
rust: block: use `NullTerminatedFormatter`

Use the new `NullTerminatedFormatter` to write the name of a `GenDisk` to
the name buffer. This new formatter automatically adds a trailing null
marker after the written characters, so we don't need to append that at the
call site any longer.

Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com>
Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
Link: https://lore.kernel.org/r/20250902-rnull-up-v6-16-v7-8-b5212cc89b98@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
12 days agorust: block: normalize imports for `gen_disk.rs`
Andreas Hindborg [Tue, 2 Sep 2025 09:55:01 +0000 (11:55 +0200)]
rust: block: normalize imports for `gen_disk.rs`

Clean up the import statements in `gen_disk.rs` to make the code easier to
maintain.

Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com>
Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
Link: https://lore.kernel.org/r/20250902-rnull-up-v6-16-v7-7-b5212cc89b98@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
12 days agorust: configfs: re-export `configfs_attrs` from `configfs` module
Andreas Hindborg [Tue, 2 Sep 2025 09:55:00 +0000 (11:55 +0200)]
rust: configfs: re-export `configfs_attrs` from `configfs` module

Re-export `configfs_attrs` from `configfs` module, so that users can import
the macro from the `configfs` module rather than the root of the `kernel`
crate.

Also update users to import from the new path.

Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com>
Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
Link: https://lore.kernel.org/r/20250902-rnull-up-v6-16-v7-6-b5212cc89b98@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
12 days agorust: str: introduce `kstrtobool` function
Andreas Hindborg [Tue, 2 Sep 2025 09:54:59 +0000 (11:54 +0200)]
rust: str: introduce `kstrtobool` function

Add a Rust wrapper for the kernel's `kstrtobool` function that converts
common user inputs into boolean values.

Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com>
Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Link: https://lore.kernel.org/r/20250902-rnull-up-v6-16-v7-5-b5212cc89b98@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
12 days agorust: str: introduce `NullTerminatedFormatter`
Andreas Hindborg [Tue, 2 Sep 2025 09:54:58 +0000 (11:54 +0200)]
rust: str: introduce `NullTerminatedFormatter`

Add `NullTerminatedFormatter`, a formatter that writes a null terminated
string to an array or slice buffer. Because this type needs to manage the
trailing null marker, the existing formatters cannot be used to implement
this type.

Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com>
Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
Link: https://lore.kernel.org/r/20250902-rnull-up-v6-16-v7-4-b5212cc89b98@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
12 days agorust: str: expose `str::{Formatter, RawFormatter}` publicly.
Andreas Hindborg [Tue, 2 Sep 2025 09:54:57 +0000 (11:54 +0200)]
rust: str: expose `str::{Formatter, RawFormatter}` publicly.

rnull is going to make use of `str::Formatter` and `str::RawFormatter`, so
expose them with public visibility.

Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com>
Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
Link: https://lore.kernel.org/r/20250902-rnull-up-v6-16-v7-3-b5212cc89b98@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
12 days agorust: str: allow `str::Formatter` to format into `&mut [u8]`.
Andreas Hindborg [Tue, 2 Sep 2025 09:54:56 +0000 (11:54 +0200)]
rust: str: allow `str::Formatter` to format into `&mut [u8]`.

Improve `Formatter` so that it can write to an array or slice buffer.

Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
Link: https://lore.kernel.org/r/20250902-rnull-up-v6-16-v7-2-b5212cc89b98@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
12 days agorust: str: normalize imports in `str.rs`
Andreas Hindborg [Tue, 2 Sep 2025 09:54:55 +0000 (11:54 +0200)]
rust: str: normalize imports in `str.rs`

Clean up imports in `str.rs`. This makes future code manipulation more
manageable.

Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Reviewed-by: Daniel Almeida <daniel.almeida@collabora.com>
Signed-off-by: Andreas Hindborg <a.hindborg@kernel.org>
Link: https://lore.kernel.org/r/20250902-rnull-up-v6-16-v7-1-b5212cc89b98@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
13 days agobrd: use page reference to protect page lifetime
Yu Kuai [Mon, 11 Aug 2025 06:56:28 +0000 (14:56 +0800)]
brd: use page reference to protect page lifetime

As discussed [1], hold rcu for copying data from/to page is too heavy,
it's better to protect page with rcu around for page lookup and then
grab a reference to prevent page to be freed by discard.

[1] https://lore.kernel.org/all/eb41cab3-5946-4fe3-a1be-843dd6fca159@kernel.dk/

Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250811065628.1829339-1-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 weeks agoblk-mq: check kobject state_in_sysfs before deleting in blk_mq_unregister_hctx
Li Nan [Tue, 26 Aug 2025 08:48:54 +0000 (16:48 +0800)]
blk-mq: check kobject state_in_sysfs before deleting in blk_mq_unregister_hctx

In __blk_mq_update_nr_hw_queues() the return value of
blk_mq_sysfs_register_hctxs() is not checked. If sysfs creation for hctx
fails, later changing the number of hw_queues or removing disk will
trigger the following warning:

  kernfs: can not remove 'nr_tags', no directory
  WARNING: CPU: 2 PID: 637 at fs/kernfs/dir.c:1707 kernfs_remove_by_name_ns+0x13f/0x160
  Call Trace:
   remove_files.isra.1+0x38/0xb0
   sysfs_remove_group+0x4d/0x100
   sysfs_remove_groups+0x31/0x60
   __kobject_del+0x23/0xf0
   kobject_del+0x17/0x40
   blk_mq_unregister_hctx+0x5d/0x80
   blk_mq_sysfs_unregister_hctxs+0x94/0xd0
   blk_mq_update_nr_hw_queues+0x124/0x760
   nullb_update_nr_hw_queues+0x71/0xf0 [null_blk]
   nullb_device_submit_queues_store+0x92/0x120 [null_blk]

kobjct_del() was called unconditionally even if sysfs creation failed.
Fix it by checkig the kobject creation statusbefore deleting it.

Fixes: 477e19dedc9d ("blk-mq: adjust debugfs and sysfs register when updating nr_hw_queues")
Signed-off-by: Li Nan <linan122@huawei.com>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250826084854.1030545-1-linan666@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 weeks agofloppy: Sort headers alphabetically
Andy Shevchenko [Mon, 25 Aug 2025 16:32:57 +0000 (18:32 +0200)]
floppy: Sort headers alphabetically

Sorting headers alphabetically helps locating duplicates, and makes it
easier to figure out where to insert new headers.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20250825163545.39303-4-andriy.shevchenko@linux.intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 weeks agofloppy: Replace custom SZ_64K constant
Andy Shevchenko [Mon, 25 Aug 2025 16:32:56 +0000 (18:32 +0200)]
floppy: Replace custom SZ_64K constant

There are only two headers using the K_64 custom constant. Moreover,
its usage tangles a code because the constant is defined in the C
file, while users are in the headers. Replace it with well defined
SZ_64K from sizes.h.

Acked-by: Helge Deller <deller@gmx.de>
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20250825163545.39303-3-andriy.shevchenko@linux.intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 weeks agofloppy: Remove unused CROSS_64KB() macro from arch/ code
Andy Shevchenko [Mon, 25 Aug 2025 16:32:55 +0000 (18:32 +0200)]
floppy: Remove unused CROSS_64KB() macro from arch/ code

Since the commit 3d86739c6343 ("floppy: always use the track buffer")
the CROSS_64KB() is not used by the driver, remove the leftovers.

Acked-by: Helge Deller <deller@gmx.de> #parisc
Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> # m68k
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Link: https://lore.kernel.org/r/20250825163545.39303-2-andriy.shevchenko@linux.intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 weeks agoblock: Move a misplaced comment in queue_wb_lat_store()
Bart Van Assche [Mon, 25 Aug 2025 15:14:24 +0000 (08:14 -0700)]
block: Move a misplaced comment in queue_wb_lat_store()

blk_mq_quiesce_queue() does not wait for pending I/O to finish. Freezing
a queue waits for pending I/O to finish. Hence move the comment that
refers to waiting for pending I/O above the call that freezes the
request queue. This patch moves this comment back to the position where
it was when this comment was introduced. See also commit c125311d96b1
("blk-wbt: don't maintain inflight counts if disabled").

Cc: Christoph Hellwig <hch@lst.de>
Cc: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250825151424.1653910-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 weeks agonvme-pci: convert metadata mapping to dma iter
Keith Busch [Wed, 13 Aug 2025 15:31:53 +0000 (08:31 -0700)]
nvme-pci: convert metadata mapping to dma iter

Aligns data and metadata to the similar dma mapping scheme and removes
one more user of the scatter-gather dma mapping.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250813153153.3260897-10-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 weeks agonvme-pci: create common sgl unmapping helper
Keith Busch [Wed, 13 Aug 2025 15:31:52 +0000 (08:31 -0700)]
nvme-pci: create common sgl unmapping helper

This can be reused by metadata sgls once that starts using the blk-mq
dma api.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250813153153.3260897-9-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 weeks agoblk-integrity: use iterator for mapping sg
Keith Busch [Wed, 13 Aug 2025 15:31:51 +0000 (08:31 -0700)]
blk-integrity: use iterator for mapping sg

Modify blk_rq_map_integrity_sg to use the blk-mq mapping iterator. This
produces more efficient code and converges the integrity mapping
implementations to reduce future maintenance burdens.

The function implementation moves from blk-integrity.c to blk-mq-dma.c
in order to use the types and functions private to that file.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250813153153.3260897-8-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 weeks agoblk-mq-dma: add scatter-less integrity data DMA mapping
Keith Busch [Wed, 13 Aug 2025 15:31:50 +0000 (08:31 -0700)]
blk-mq-dma: add scatter-less integrity data DMA mapping

Similar to regular data, introduce more efficient integrity mapping
helpers that does away with the scatterlist structure. This uses the
block mapping iterator to add IOVA segments if IOMMU is enabled, or maps
directly if not. This also supports P2P segements if integrity data ever
wants to allocate that type of memory.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250813153153.3260897-7-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 weeks agoblk-mq-dma: move common dma start code to a helper
Keith Busch [Wed, 13 Aug 2025 15:31:49 +0000 (08:31 -0700)]
blk-mq-dma: move common dma start code to a helper

In preparing for dma mapping integrity metadata, move the common dma
setup to a helper.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250813153153.3260897-6-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 weeks agoblk-mq: remove REQ_P2PDMA flag
Keith Busch [Wed, 13 Aug 2025 15:31:48 +0000 (08:31 -0700)]
blk-mq: remove REQ_P2PDMA flag

It's not serving any particular purpose. pci_p2pdma_state() already has
all the appropriate checks, so the config and flag checks are not
guarding anything.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250813153153.3260897-5-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 weeks agoblk-mq-dma: require unmap caller provide p2p map type
Keith Busch [Wed, 13 Aug 2025 15:31:47 +0000 (08:31 -0700)]
blk-mq-dma: require unmap caller provide p2p map type

In preparing for integrity dma mappings, we can't rely on the request
flag because data and metadata may have different mapping types.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250813153153.3260897-4-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 weeks agoblk-mq-dma: provide the bio_vec array being iterated
Keith Busch [Wed, 13 Aug 2025 15:31:46 +0000 (08:31 -0700)]
blk-mq-dma: provide the bio_vec array being iterated

This will make it easier to add different sources of the bvec array,
like for upcoming integrity support, rather than assume to use the bio's
bi_io_vec. It also makes iterating "special" payloads more in common
with iterating normal payloads.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250813153153.3260897-3-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2 weeks agoblk-mq-dma: create blk_map_iter type
Keith Busch [Wed, 13 Aug 2025 15:31:45 +0000 (08:31 -0700)]
blk-mq-dma: create blk_map_iter type

The req_iterator happens to have a similar fields to what the dma
iterator needs, but we're not necessarily iterating a request's
bi_io_vec. Create a new type that can be amended for additional future
use.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Kanchan Joshi <joshi.k@samsung.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20250813153153.3260897-2-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
3 weeks agoLinux 6.17-rc3
Linus Torvalds [Sun, 24 Aug 2025 16:04:12 +0000 (12:04 -0400)]
Linux 6.17-rc3

3 weeks agoMerge tag 'i2c-for-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa...
Linus Torvalds [Sun, 24 Aug 2025 14:32:04 +0000 (10:32 -0400)]
Merge tag 'i2c-for-6.17-rc3' of git://git./linux/kernel/git/wsa/linux

Pull i2c fixes from Wolfram Sang:

 - hisi: update maintainership

 - fix several issues in rtl9300 xfer:
     - check message length boundaries
     - correct multi-byte value composition on write
     - increase polling timeout
     - fix block transfer protocol

* tag 'i2c-for-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: rtl9300: Add missing count byte for SMBus Block Ops
  i2c: rtl9300: Increase timeout for transfer polling
  i2c: rtl9300: Fix multi-byte I2C write
  i2c: rtl9300: Fix out-of-bounds bug in rtl9300_i2c_smbus_xfer
  MAINTAINERS: i2c: Update i2c_hisi entry