Ming Lei [Tue, 29 Apr 2025 02:29:36 +0000 (10:29 +0800)]
selftests: ublk: fix UBLK_F_NEED_GET_DATA
Commit
57e13a2e8cd2 ("selftests: ublk: support user recovery") starts to
support UBLK_F_NEED_GET_DATA for covering recovery feature, however the
ublk utility implementation isn't done correctly.
Fix it by supporting UBLK_F_NEED_GET_DATA correctly.
Also add test generic_07 for covering UBLK_F_NEED_GET_DATA.
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes:
57e13a2e8cd2 ("selftests: ublk: support user recovery")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250429022941.1718671-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 25 Apr 2025 01:37:40 +0000 (09:37 +0800)]
ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd
ublk_cancel_cmd() calls io_uring_cmd_done() to complete uring_cmd, but
we may have scheduled task work via io_uring_cmd_complete_in_task() for
dispatching request, then kernel crash can be triggered.
Fix it by not trying to canceling the command if ublk block request is
started.
Fixes:
216c8f5ef0f2 ("ublk: replace monitor with cancelable uring_cmd")
Reported-by: Jared Holzman <jholzman@nvidia.com>
Tested-by: Jared Holzman <jholzman@nvidia.com>
Closes: https://lore.kernel.org/linux-block/
d2179120-171b-47ba-b664-
23242981ef19@nvidia.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250425013742.1079549-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 25 Apr 2025 01:37:39 +0000 (09:37 +0800)]
ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATA
We call io_uring_cmd_complete_in_task() to schedule task_work for handling
UBLK_U_IO_NEED_GET_DATA.
This way is really not necessary because the current context is exactly
the ublk queue context, so call ublk_dispatch_req() directly for handling
UBLK_U_IO_NEED_GET_DATA.
Fixes:
216c8f5ef0f2 ("ublk: replace monitor with cancelable uring_cmd")
Tested-by: Jared Holzman <jholzman@nvidia.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250425013742.1079549-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 23 Apr 2025 05:37:42 +0000 (07:37 +0200)]
block: don't autoload drivers on blk-cgroup configuration
Loading a driver just to configure blk-cgroup doesn't make sense, as that
assumes and already existing device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 23 Apr 2025 05:37:41 +0000 (07:37 +0200)]
block: don't autoload drivers on stat
blkdev_get_no_open can trigger the legacy autoload of block drivers. A
simple stat of a block device has not historically done that, so disable
this behavior again.
Fixes:
9abcfbd235f5 ("block: Add atomic write support for statx")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 23 Apr 2025 05:37:40 +0000 (07:37 +0200)]
block: remove the backing_inode variable in bdev_statx
backing_inode is only used once, so remove it and update the comment
describing the bdev lookup to be a bit more clear.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 23 Apr 2025 05:37:39 +0000 (07:37 +0200)]
block: move blkdev_{get,put} _no_open prototypes out of blkdev.h
These are only to be used by block internal code. Remove the comment
as we grew more users due to reworking block device node opening.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 24 Apr 2025 08:25:21 +0000 (10:25 +0200)]
block: never reduce ra_pages in blk_apply_bdi_limits
When the user increased the read-ahead size through sysfs this value
currently get lost if the device is reprobe, including on a resume
from suspend.
As there is no hardware limitation for the read-ahead size there is
no real need to reset it or track a separate hardware limitation
like for max_sectors.
This restores the pre-atomic queue limit behavior in the sd driver as
sd did not use blk_queue_io_opt and thus never updated the read ahead
size to the value based of the optimal I/O, but changes behavior for
all other drivers. As the new behavior seems useful and sd is the
driver for which the readahead size tweaks are most useful that seems
like a worthwhile trade off.
Fixes:
804e498e0496 ("sd: convert to the atomic queue limits API")
Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20250424082521.1967286-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Uday Shankar [Wed, 23 Apr 2025 21:29:03 +0000 (15:29 -0600)]
selftests: ublk: common: fix _get_disk_dev_t for pre-9.0 coreutils
Some distributions, such as centos stream 9, still have a version of
coreutils which does not yet support the %Hr and %Lr formats for stat(1)
[1, 2]. Running ublk selftests on these distributions results in the
following error in tests that use the _get_disk_dev_t helper:
line 23: ?r: syntax error: operand expected (error token is "?r")
To better accommodate older distributions, rewrite _get_disk_dev_t to
use the much older %t and %T formats for stat instead.
[1] https://github.com/coreutils/coreutils/blob/v9.0/NEWS#L114
[2] https://pkgs.org/download/coreutils
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250423-ublk_selftests-v1-2-7d060e260e76@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 24 Apr 2025 12:27:54 +0000 (06:27 -0600)]
Merge tag 'nvme-6.15-2025-04-24' of git://git.infradead.org/nvme into block-6.15
Pull NVMe fix from Christoph:
"nvme fixes for Linux 6.15
- fix an out-of-bounds access in nvmet_enable_port (Richard Weinberger)"
* tag 'nvme-6.15-2025-04-24' of git://git.infradead.org/nvme:
nvmet: fix out-of-bounds access in nvmet_enable_port
Ming Lei [Mon, 21 Apr 2025 23:59:42 +0000 (07:59 +0800)]
selftests: ublk: remove useless 'delay_us' from 'struct dev_ctx'
'delay_us' shouldn't be added to 'struct dev_ctx' since now it is
handled by per-target command line & 'struct fault_inject_ctx'.
So remove it.
Fixes:
81586652bb1f ("selftests: ublk: add generic_06 for covering fault inject")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Link: https://lore.kernel.org/r/20250421235947.715272-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 21 Apr 2025 23:59:41 +0000 (07:59 +0800)]
selftests: ublk: fix recover test
When adding recovery test:
- 'break' is missed for handling '-g' argument
- test name of test_generic_05.sh is wrong
So fix the two.
Fixes:
57e13a2e8cd2 ("selftests: ublk: support user recovery")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Link: https://lore.kernel.org/r/20250421235947.715272-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Darrick J. Wong [Wed, 23 Apr 2025 19:53:57 +0000 (12:53 -0700)]
block: hoist block size validation code to a separate function
Hoist the block size validation code to bdev_validate_blocksize so that
we can call it from filesystems that don't care about the bdev pagecache
manipulations of set_blocksize.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/174543795720.4139148.840349813093799165.stgit@frogsfrogsfrogs
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Darrick J. Wong [Wed, 23 Apr 2025 19:53:42 +0000 (12:53 -0700)]
block: fix race between set_blocksize and read paths
With the new large sector size support, it's now the case that
set_blocksize can change i_blksize and the folio order in a manner that
conflicts with a concurrent reader and causes a kernel crash.
Specifically, let's say that udev-worker calls libblkid to detect the
labels on a block device. The read call can create an order-0 folio to
read the first 4096 bytes from the disk. But then udev is preempted.
Next, someone tries to mount an 8k-sectorsize filesystem from the same
block device. The filesystem calls set_blksize, which sets i_blksize to
8192 and the minimum folio order to 1.
Now udev resumes, still holding the order-0 folio it allocated. It then
tries to schedule a read bio and do_mpage_readahead tries to create
bufferheads for the folio. Unfortunately, blocks_per_folio == 0 because
the page size is 4096 but the blocksize is 8192 so no bufferheads are
attached and the bh walk never sets bdev. We then submit the bio with a
NULL block device and crash.
Therefore, truncate the page cache after flushing but before updating
i_blksize. However, that's not enough -- we also need to lock out file
IO and page faults during the update. Take both the i_rwsem and the
invalidate_lock in exclusive mode for invalidations, and in shared mode
for read/write operations.
I don't know if this is the correct fix, but xfs/259 found it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/174543795699.4139148.2086129139322431423.stgit@frogsfrogsfrogs
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Richard Weinberger [Fri, 18 Apr 2025 08:02:50 +0000 (10:02 +0200)]
nvmet: fix out-of-bounds access in nvmet_enable_port
When trying to enable a port that has no transport configured yet,
nvmet_enable_port() uses NVMF_TRTYPE_MAX (255) to query the transports
array, causing an out-of-bounds access:
[ 106.058694] BUG: KASAN: global-out-of-bounds in nvmet_enable_port+0x42/0x1da
[ 106.058719] Read of size 8 at addr
ffffffff89dafa58 by task ln/632
[...]
[ 106.076026] nvmet: transport type 255 not supported
Since commit
200adac75888, NVMF_TRTYPE_MAX is the default state as configured by
nvmet_ports_make().
Avoid this by checking for NVMF_TRTYPE_MAX before proceeding.
Fixes:
200adac75888 ("nvme: Add PCI transport type")
Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Jens Axboe [Thu, 17 Apr 2025 12:18:49 +0000 (06:18 -0600)]
Merge tag 'nvme-6.15-2025-04-17' of git://git.infradead.org/nvme into block-6.15
Pull NVMe fixes from Christoph:
"nvme fixes for Linux 6.15
- fix scan failure for non-ANA multipath controllers (Hannes Reinecke)
- fix multipath sysfs links creation for some cases (Hannes Reinecke)
- PCIe endpoint fixes (Damien Le Moal)
- use NULL instead of 0 in the auth code (Damien Le Moal)"
* tag 'nvme-6.15-2025-04-17' of git://git.infradead.org/nvme:
nvmet: pci-epf: cleanup link state management
nvmet: pci-epf: clear CC and CSTS when disabling the controller
nvmet: pci-epf: always fully initialize completion entries
nvmet: auth: use NULL to clear a pointer in nvmet_auth_sq_free()
nvme-multipath: sysfs links may not be created for devices
nvme: fixup scan failure for non-ANA multipath controllers
Jens Axboe [Thu, 17 Apr 2025 03:08:28 +0000 (21:08 -0600)]
Merge tag 'md-6.15-
20250416' of https://git./linux/kernel/git/mdraid/linux into block-6.15
Pull MD fixes from Yu:
"- fix raid10 missing discard IO accounting (Yu Kuai)
- fix bitmap stats for bitmap file (Zheng Qixing)
- fix oops while reading all member disks failed during check/repair
(Meir Elisha)"
* tag 'md-6.15-
20250416' of https://git.kernel.org/pub/scm/linux/kernel/git/mdraid/linux:
md/raid1: Add check for missing source disk in process_checks()
md/md-bitmap: fix stats collection for external bitmaps
md/raid10: fix missing discard IO accounting
Uday Shankar [Wed, 16 Apr 2025 03:54:42 +0000 (11:54 +0800)]
selftests: ublk: add generic_06 for covering fault inject
Add one simple fault inject target, and verify if an application using ublk
device sees an I/O error quickly after the ublk server dies.
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-9-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 16 Apr 2025 03:54:41 +0000 (11:54 +0800)]
ublk: simplify aborting ublk request
Now ublk_abort_queue() is moved to ublk char device release handler,
meantime our request queue is "quiesced" because either ->canceling was
set from uring_cmd cancel function or all IOs are inflight and can't be
completed by ublk server, things becomes easy much:
- all uring_cmd are done, so we needn't to mark io as UBLK_IO_FLAG_ABORTED
for handling completion from uring_cmd
- ublk char device is closed, no one can hold IO request reference any more,
so we can simply complete this request or requeue it for ublk_nosrv_should_reissue_outstanding.
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-8-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 16 Apr 2025 03:54:40 +0000 (11:54 +0800)]
ublk: remove __ublk_quiesce_dev()
Remove __ublk_quiesce_dev() and open code for updating device state as
QUIESCED.
We needn't to drain inflight requests in __ublk_quiesce_dev() any more,
because all inflight requests are aborted in ublk char device release
handler.
Also we needn't to set ->canceling in __ublk_quiesce_dev() any more
because it is done unconditionally now in ublk_ch_release().
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-7-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Uday Shankar [Wed, 16 Apr 2025 03:54:39 +0000 (11:54 +0800)]
ublk: improve detection and handling of ublk server exit
There are currently two ways in which ublk server exit is detected by
ublk_drv:
1. uring_cmd cancellation. If there are any outstanding uring_cmds which
have not been completed to the ublk server when it exits, io_uring
calls the uring_cmd callback with a special cancellation flag as the
issuing task is exiting.
2. I/O timeout. This is needed in addition to the above to handle the
"saturated queue" case, when all I/Os for a given queue are in the
ublk server, and therefore there are no outstanding uring_cmds to
cancel when the ublk server exits.
There are a couple of issues with this approach:
- It is complex and inelegant to have two methods to detect the same
condition
- The second method detects ublk server exit only after a long delay
(~30s, the default timeout assigned by the block layer). This delays
the nosrv behavior from kicking in and potential subsequent recovery
of the device.
The second issue is brought to light with the new test_generic_06 which
will be added in following patch. It fails before this fix:
selftests: ublk: test_generic_06.sh
dev id is 0
dd: error writing '/dev/ublkb0': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 30.0611 s, 0.0 kB/s
DEAD
dd took 31 seconds to exit (>= 5s tolerance)!
generic_06 : [FAIL]
Fix this by instead detecting and handling ublk server exit in the
character file release callback. This has several advantages:
- This one place can handle both saturated and unsaturated queues. Thus,
it replaces both preexisting methods of detecting ublk server exit.
- It runs quickly on ublk server exit - there is no 30s delay.
- It starts the process of removing task references in ublk_drv. This is
needed if we want to relax restrictions in the driver like letting
only one thread serve each queue
There is also the disadvantage that the character file release callback
can also be triggered by intentional close of the file, which is a
significant behavior change. Preexisting ublk servers (libublksrv) are
dependent on the ability to open/close the file multiple times. To
address this, only transition to a nosrv state if the file is released
while the ublk device is live. This allows for programs to open/close
the file multiple times during setup. It is still a behavior change if a
ublk server decides to close/reopen the file while the device is LIVE
(i.e. while it is responsible for serving I/O), but that would be highly
unusual. This behavior is in line with what is done by FUSE, which is
very similar to ublk in that a userspace daemon is providing services
traditionally provided by the kernel.
With this change in, the new test (and all other selftests, and all
ublksrv tests) pass:
selftests: ublk: test_generic_06.sh
dev id is 0
dd: error writing '/dev/ublkb0': Input/output error
1+0 records in
0+0 records out
0 bytes copied, 0.
0376731 s, 0.0 kB/s
DEAD
generic_04 : [PASS]
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 16 Apr 2025 03:54:38 +0000 (11:54 +0800)]
ublk: move device reset into ublk_ch_release()
ublk_ch_release() is called after ublk char device is closed, when all
uring_cmd are done, so it is perfect fine to move ublk device reset to
ublk_ch_release() from ublk_ctrl_start_recovery().
This way can avoid to grab the exiting daemon task_struct too long.
However, reset of the following ublk IO flags has to be moved until ublk
io_uring queues are ready:
- ubq->canceling
For requeuing IO in case of ublk_nosrv_dev_should_queue_io() before device
is recovered
- ubq->fail_io
For failing IO in case of UBLK_F_USER_RECOVERY_FAIL_IO before device is
recovered
- ublk_io->flags
For preventing using io->cmd
With this way, recovery is simplified a lot.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 16 Apr 2025 03:54:37 +0000 (11:54 +0800)]
ublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_io
Now ublk deals with ublk_nosrv_dev_should_queue_io() by keeping request
queue as quiesced. This way is fragile because queue quiesce crosses syscalls
or process contexts.
Switch to rely on ubq->canceling for dealing with
ublk_nosrv_dev_should_queue_io(), because it has been used for this purpose
during io_uring context exiting, and it can be reused before recovering too.
In ublk_queue_rq(), the request will be added to requeue list without
kicking off requeue in case of ubq->canceling, and finally requests added in
requeue list will be dispatched from either ublk_stop_dev() or
ublk_ctrl_end_recovery().
Meantime we have to move reset of ubq->canceling from ublk_ctrl_start_recovery()
to ublk_ctrl_end_recovery(), when IO handling can be recovered completely.
Then blk_mq_quiesce_queue() and blk_mq_unquiesce_queue() are always used
in same context.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Link: https://lore.kernel.org/r/20250416035444.99569-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 16 Apr 2025 03:54:36 +0000 (11:54 +0800)]
ublk: add ublk_force_abort_dev()
Add ublk_force_abort_dev() for handling ublk_nosrv_dev_should_queue_io()
in ublk_stop_dev(). Then queue quiesce and unquiesce can be paired in
single function.
Meantime not change device state to QUIESCED any more, since the disk is
going to be removed soon.
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Uday Shankar [Wed, 16 Apr 2025 03:54:35 +0000 (11:54 +0800)]
ublk: properly serialize all FETCH_REQs
Most uring_cmds issued against ublk character devices are serialized
because each command affects only one queue, and there is an early check
which only allows a single task (the queue's ubq_daemon) to issue
uring_cmds against that queue. However, this mechanism does not work for
FETCH_REQs, since they are expected before ubq_daemon is set. Since
FETCH_REQs are only used at initialization and not in the fast path,
serialize them using the per-ublk-device mutex. This fixes a number of
data races that were previously possible if a badly behaved ublk server
decided to issue multiple FETCH_REQs against the same qid/tag
concurrently.
Reported-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250416035444.99569-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 12 Apr 2025 02:30:29 +0000 (10:30 +0800)]
selftests: ublk: move creating UBLK_TMP into _prep_test()
test may exit early because of missing program or not having required
feature before calling _prep_test(), then $UBLK_TMP isn't cleaned.
Fix it by moving creating $UBLK_TMP into _prep_test(), any resources
created since _prep_test() will be cleaned by _cleanup_test().
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-14-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 12 Apr 2025 02:30:28 +0000 (10:30 +0800)]
selftests: ublk: add test_stress_05.sh
Add test_stress_05.sh for covering removing device with recovery
enabled.
io-hang has been observed with the following patch:
https://lore.kernel.org/linux-block/
20250403-ublk_timeout-v3-1-
aa09f76c7451@purestorage.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-13-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 12 Apr 2025 02:30:27 +0000 (10:30 +0800)]
selftests: ublk: support user recovery
Add user recovery feature.
Meantime add user recovery test: generic_04 and generic_05(zero copy)
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-12-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 12 Apr 2025 02:30:26 +0000 (10:30 +0800)]
selftests: ublk: support target specific command line
Support target specific command line for making related command line code
handling more readable & clean.
Also helps for adding new features.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-11-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 12 Apr 2025 02:30:25 +0000 (10:30 +0800)]
selftests: ublk: increase max nr_queues and queue depth
Increase max nr_queues to 32, and queue depth to 1024.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-10-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 12 Apr 2025 02:30:24 +0000 (10:30 +0800)]
selftests: ublk: set queue pthread's cpu affinity
In NUMA machine, ublk IO performance is very sensitive with queue
pthread's affinity setting.
Retrieve queue's affinity and select the 1st cpu as queue thread's sched
affinity, and it is observed that single cpu task affinity can get
stable & good performance if client application is put on proper cpu.
Dump this info when adding one ublk device. Use shmem to communicate
queue's tid between parent and daemon.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-9-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 12 Apr 2025 02:30:23 +0000 (10:30 +0800)]
selftests: ublk: setup ring with IORING_SETUP_SINGLE_ISSUER/IORING_SETUP_DEFER_TASKRUN
It is observed that this way is more efficient for fast nvme backing
file.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-8-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 12 Apr 2025 02:30:22 +0000 (10:30 +0800)]
selftests: ublk: add two stress tests for zero copy feature
Add stress_03 & stress_04 for covering zero copy feature.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-7-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 12 Apr 2025 02:30:21 +0000 (10:30 +0800)]
selftests: ublk: run stress tests in parallel
Run stress tests in parallel, meantime add shell local function to
simplify the two stress tests.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-6-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 12 Apr 2025 02:30:20 +0000 (10:30 +0800)]
selftests: ublk: make sure _add_ublk_dev can return in sub-shell
Detach ublk daemon from the starting process completely by double-fork and
clearing its process group, so that `_add_ublk_dev` can return from sub-shell.
Then it is more friendly for writing shell test script for adding/recovering
ublk device.
Prepare for running ublk test in parallel.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 12 Apr 2025 02:30:19 +0000 (10:30 +0800)]
selftests: ublk: cleanup backfile automatically
Use global array of $UBLK_BACKFILES for storing all backfile name, then
clean them automatically.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 12 Apr 2025 02:30:18 +0000 (10:30 +0800)]
selftests: ublk: add io_uring uapi header
Add io_uring UAPI header so that ublk can work with latest uapi
definition.
Fix the following build failure:
stripe.c: In function ‘stripe_to_uring_op’:
stripe.c:120:29: error: ‘IORING_OP_READV_FIXED’ undeclared (first use in this function); did you mean ‘IORING_OP_READ_FIXED’?
120 | return zc ? IORING_OP_READV_FIXED : IORING_OP_READV;
| ^~~~~~~~~~~~~~~~~~~~~
| IORING_OP_READ_FIXED
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Fixes:
57ed58c13256 ("selftests: ublk: enable zero copy for stripe target")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Sat, 12 Apr 2025 02:30:17 +0000 (10:30 +0800)]
selftests: ublk: fix ublk_find_tgt()
Bounds check for iterator variable `i` is missed, so add it and fix
ublk_find_tgt().
Cc: Johannes Thumshirn <Johannes.Thumshirn@wdc.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250412023035.2649275-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Martin K. Petersen [Wed, 16 Apr 2025 20:04:10 +0000 (16:04 -0400)]
block: integrity: Do not call set_page_dirty_lock()
Placing multiple protection information buffers inside the same page
can lead to oopses because set_page_dirty_lock() can't be called from
interrupt context.
Since a protection information buffer is not backed by a file there is
no point in setting its page dirty, there is nothing to synchronize.
Drop the call to set_page_dirty_lock() and remove the last argument to
bio_integrity_unpin_bvec().
Cc: stable@vger.kernel.org
Fixes:
492c5d455969 ("block: bio-integrity: directly map user buffers")
Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/yq1v7r3ev9g.fsf@ca-mkp.ca.oracle.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Meir Elisha [Tue, 8 Apr 2025 14:38:08 +0000 (17:38 +0300)]
md/raid1: Add check for missing source disk in process_checks()
During recovery/check operations, the process_checks function loops
through available disks to find a 'primary' source with successfully
read data.
If no suitable source disk is found after checking all possibilities,
the 'primary' index will reach conf->raid_disks * 2. Add an explicit
check for this condition after the loop. If no source disk was found,
print an error message and return early to prevent further processing
without a valid primary source.
Link: https://lore.kernel.org/linux-raid/20250408143808.1026534-1-meir.elisha@volumez.com
Signed-off-by: Meir Elisha <meir.elisha@volumez.com>
Suggested-and-reviewed-by: Yu Kuai <yukuai3@huawei.com>
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Damien Le Moal [Fri, 11 Apr 2025 01:42:11 +0000 (10:42 +0900)]
nvmet: pci-epf: cleanup link state management
Since the link_up boolean field of struct nvmet_pci_epf_ctrl is always
set to true when nvmet_pci_epf_start_ctrl() is called, assign true to
this field in nvmet_pci_epf_start_ctrl(). Conversely, since this field
is set to false when nvmet_pci_epf_stop_ctrl() is called, set this field
to false directly inside that function.
While at it, also add information messages to notify the user of the PCI
link state changes to help troubleshoot any link stability issues
without needing to enable debug messages.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Damien Le Moal [Fri, 11 Apr 2025 01:42:10 +0000 (10:42 +0900)]
nvmet: pci-epf: clear CC and CSTS when disabling the controller
When a host shuts down the controller when shutting down but does so
without first disabling the controller, the enable bit remains set in
the controller configuration register. When the host restarts and
attempts to enable the controller again, the
nvmet_pci_epf_poll_cc_work() function is unable to detect the change
from 0 to 1 of the enable bit, and thus the controller is not enabled
again, which result in a device scan timeout on the host. This problem
also occurs if the host shuts down uncleanly or if the PCIe link goes
down: as the CC.EN value is not reset, the controller is not enabled
again when the host restarts.
Fix this by introducing the function nvmet_pci_epf_clear_ctrl_config()
to clear the CC and CSTS registers of the controller when the PCIe link
is lost (nvmet_pci_epf_stop_ctrl() function), or when starting the
controller fails (nvmet_pci_epf_enable_ctrl() fails). Also use this
function in nvmet_pci_epf_init_bar() to simplify the initialization of
the CC and CSTS registers.
Furthermore, modify the function nvmet_pci_epf_disable_ctrl() to clear
the CC.EN bit and write this updated value to the BAR register when the
controller is shutdown by the host, to ensure that upon restart, we can
detect the host setting CC.EN.
Fixes:
0faa0fe6f90e ("nvmet: New NVMe PCI endpoint function target driver")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Damien Le Moal [Fri, 11 Apr 2025 01:42:09 +0000 (10:42 +0900)]
nvmet: pci-epf: always fully initialize completion entries
For a command that is normally processed through the command request
execute() function, the completion entry for the command is initialized
by __nvmet_req_complete() and nvmet_pci_epf_cq_work() only needs to set
the status field and the phase of the completion entry before posting
the entry to the completion queue.
However, for commands that are failed due to an internal error (e.g. the
command data buffer allocation fails), the command request execute()
function is not called and __nvmet_req_complete() is never executed for
the command, leaving the command completion entry uninitialized. For
such command failed before calling req->execute(), the host ends up
seeing completion entries with an invalid submission queue ID and
command ID.
Avoid such issue by always fully initilizing a command completion entry
in nvmet_pci_epf_cq_work(), setting the entry submission queue head, ID
and command ID.
Fixes:
0faa0fe6f90e ("nvmet: New NVMe PCI endpoint function target driver")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Niklas Cassel <cassel@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Damien Le Moal [Fri, 11 Apr 2025 01:00:15 +0000 (10:00 +0900)]
nvmet: auth: use NULL to clear a pointer in nvmet_auth_sq_free()
When compiling with C=1, the following sparse warning is generated:
auth.c:243:23: warning: Using plain integer as NULL pointer
Avoid this warning by using NULL to instead of 0 to set the sq tls_key
pointer.
Fixes:
fa2e0f8bbc68 ("nvmet-tcp: support secure channel concatenation")
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Hannes Reinecke [Tue, 15 Apr 2025 06:47:37 +0000 (08:47 +0200)]
nvme-multipath: sysfs links may not be created for devices
When rapidly rescanning for new namespaces nvme_mpath_add_sysfs_link() may be
called for a block device not added to sysfs. But NVME_NS_SYSFS_ATTR_LINK
had already been set, so when checking this device a second time we will fail
to create the link.
Fix this by exchanging the order of the block device check and the
NVME_NS_SYSFS_ATTR_LINK bit check.
Fixes:
4dbd2b2ebe4c ("nvme-multipath: Add visibility for round-robin io-policy")
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>**
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Hannes Reinecke [Mon, 14 Apr 2025 12:05:09 +0000 (14:05 +0200)]
nvme: fixup scan failure for non-ANA multipath controllers
Commit
62baf70c3274 caused the ANA log page to be re-read, even on
controllers that do not support ANA. While this should generally
harmless, some controllers hang on the unsupported log page and
never finish probing.
Fixes:
62baf70c3274 ("nvme: re-read ANA log page after ns scan completes")
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Tested-by: Srikanth Aithal <sraithal@amd.com>
[hch: more detailed commit message]
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Caleb Sander Mateos [Wed, 16 Apr 2025 00:41:10 +0000 (18:41 -0600)]
ublk: don't suggest CONFIG_BLK_DEV_UBLK=Y
The CONFIG_BLK_DEV_UBLK help text suggests setting the config option to
Y so task_work_add() can be used to dispatch I/O, improving performance.
However, this mechanism was removed in commit
29dc5d06613f2 ("ublk: kill
queuing request by task_work_add"). So remove this paragraph from the
config help text.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Link: https://lore.kernel.org/r/20250416004111.3242817-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 9 Apr 2025 13:09:40 +0000 (15:09 +0200)]
loop: stop using vfs_iter_{read,write} for buffered I/O
vfs_iter_{read,write} always perform direct I/O when the file has the
O_DIRECT flag set, which breaks disabling direct I/O using the
LOOP_SET_STATUS / LOOP_SET_STATUS64 ioctls.
This was recenly reported as a regression, but as far as I can tell
was only uncovered by better checking for block sizes and has been
around since the direct I/O support was added.
Fix this by using the existing aio code that calls the raw read/write
iter methods instead. Note that despite the comments there is no need
for block drivers to ever call flush_dcache_page themselves, and the
call is a left-over from prehistoric times.
Fixes:
ab1cb278bc70 ("block: loop: introduce ioctl command of LOOP_SET_DIRECT_IO")
Reported-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Tested-by: Darrick J. Wong <djwong@kernel.org>
Link: https://lore.kernel.org/r/20250409130940.3685677-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Thomas Weißschuh [Tue, 15 Apr 2025 14:55:06 +0000 (16:55 +0200)]
loop: LOOP_SET_FD: send uevents for partitions
Remove the suppression of the uevents before scanning for partitions.
The partitions inherit their suppression settings from their parent device,
which lead to the uevents being dropped.
This is similar to the same changes for LOOP_CONFIGURE done in
commit
bb430b694226 ("loop: LOOP_CONFIGURE: send uevents for partitions").
Fixes:
498ef5c777d9 ("loop: suppress uevents while reconfiguring the device")
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20250415-loop-uevent-changed-v3-1-60ff69ac6088@linutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Thomas Weißschuh [Tue, 15 Apr 2025 08:51:47 +0000 (10:51 +0200)]
loop: properly send KOBJ_CHANGED uevent for disk device
The original commit message and the wording "uncork" in the code comment
indicate that it is expected that the suppressed event instances are
automatically sent after unsuppressing.
This is not the case, instead they are discarded.
In effect this means that no "changed" events are emitted on the device
itself by default.
While each discovered partition does trigger a changed event on the
device, devices without partitions don't have any event emitted.
This makes udev miss the device creation and prompted workarounds in
userspace. See the linked util-linux/losetup bug.
Explicitly emit the events and drop the confusingly worded comments.
Link: https://github.com/util-linux/util-linux/issues/2434
Fixes:
498ef5c777d9 ("loop: suppress uevents while reconfiguring the device")
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Link: https://lore.kernel.org/r/20250415-loop-uevent-changed-v2-1-0c4e6a923b2a@linutronix.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Yunlong Xing [Mon, 14 Apr 2025 03:01:59 +0000 (11:01 +0800)]
loop: aio inherit the ioprio of original request
Set cmd->iocb.ki_ioprio to the ioprio of loop device's request.
The purpose is to inherit the original request ioprio in the aio
flow.
Signed-off-by: Yunlong Xing <yunlong.xing@unisoc.com>
Signed-off-by: Zhiguo Niu <zhiguo.niu@unisoc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250414030159.501180-1-yunlong.xing@unisoc.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Zheng Qixing [Sat, 12 Apr 2025 09:25:54 +0000 (17:25 +0800)]
block: fix resource leak in blk_register_queue() error path
When registering a queue fails after blk_mq_sysfs_register() is
successful but the function later encounters an error, we need
to clean up the blk_mq_sysfs resources.
Add the missing blk_mq_sysfs_unregister() call in the error path
to properly clean up these resources and prevent a memory leak.
Fixes:
320ae51feed5 ("blk-mq: new multi-queue block IO queueing mechanism")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250412092554.475218-1-zhengqixing@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bird, Tim [Fri, 11 Apr 2025 19:20:18 +0000 (19:20 +0000)]
block: add SPDX header line to blk-throttle.h
Add an SPDX license identifier line to blk-throttle.h
Use 'GPL-2.0' as the identifier, since blk-throttle.c uses
that, and blk.h (from which some material was copied when
blk-throttle.h was created) also uses that identifier.
Signed-off-by: Tim Bird <tim.bird@sony.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/MW5PR13MB5632EE4645BCA24ED111EC0EFDB62@MW5PR13MB5632.namprd13.prod.outlook.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Thorsten Blum [Thu, 10 Apr 2025 15:47:23 +0000 (17:47 +0200)]
null_blk: Use strscpy() instead of strscpy_pad() in null_add_dev()
blk_mq_alloc_disk() already zero-initializes the destination buffer,
making strscpy() sufficient for safely copying the disk's name. The
additional NUL-padding performed by strscpy_pad() is unnecessary.
If the destination buffer has a fixed length, strscpy() automatically
determines its size using sizeof() when the argument is omitted. This
makes the explicit size argument unnecessary.
The source string is also NUL-terminated and meets the __must_be_cstr()
requirement of strscpy().
No functional changes intended.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Zhu Yanjun <yanjun.zhu@linux.dev>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Link: https://lore.kernel.org/r/20250410154727.883207-1-thorsten.blum@linux.dev
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 10 Apr 2025 15:28:58 +0000 (09:28 -0600)]
Merge tag 'nvme-6.15-2025-04-10' of git://git.infradead.org/nvme into block-6.15
Pull NVMe updates from Christoph:
"nvme updates for Linux 6.15
- nvmet fc/fcloop refcounting fixes (Daniel Wagner)
- fix missed namespace/ANA scans (Hannes Reinecke)
- fix a use after free in the new TCP netns support (Kuniyuki Iwashima)
- fix a NULL instead of false review in multipath (Uday Shankar)"
* tag 'nvme-6.15-2025-04-10' of git://git.infradead.org/nvme:
nvmet-fc: put ref when assoc->del_work is already scheduled
nvmet-fc: take tgtport reference only once
nvmet-fc: update tgtport ref per assoc
nvmet-fc: inline nvmet_fc_free_hostport
nvmet-fc: inline nvmet_fc_delete_assoc
nvmet-fcloop: add ref counting to lport
nvmet-fcloop: replace kref with refcount
nvmet-fcloop: swap list_add_tail arguments
nvme-tcp: fix use-after-free of netns by kernel TCP socket.
nvme: multipath: fix return value of nvme_available_path
nvme: re-read ANA log page after ns scan completes
nvme: requeue namespace scan on missed AENs
Caleb Sander Mateos [Wed, 9 Apr 2025 01:29:26 +0000 (19:29 -0600)]
ublk: pass ublksrv_ctrl_cmd * instead of io_uring_cmd *
The ublk_ctrl_*() handlers all take struct io_uring_cmd *cmd but only
use it to get struct ublksrv_ctrl_cmd *header from the io_uring SQE.
Since the caller ublk_ctrl_uring_cmd() has already computed header, pass
it instead of cmd.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250409012928.3527198-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 9 Apr 2025 01:14:42 +0000 (09:14 +0800)]
ublk: don't fail request for recovery & reissue in case of ubq->canceling
ubq->canceling is set with request queue quiesced when io_uring context is
exiting. USER_RECOVERY or !RECOVERY_FAIL_IO requires request to be re-queued
and re-dispatch after device is recovered.
However commit
d796cea7b9f3 ("ublk: implement ->queue_rqs()") still may fail
any request in case of ubq->canceling, this way breaks USER_RECOVERY or
!RECOVERY_FAIL_IO.
Fix it by calling __ublk_abort_rq() in case of ubq->canceling.
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Reported-by: Uday Shankar <ushankar@purestorage.com>
Closes: https://lore.kernel.org/linux-block/Z%2FQkkTRHfRxtN%2FmB@dev-ushankar.dev.purestorage.com/
Fixes:
d796cea7b9f3 ("ublk: implement ->queue_rqs()")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250409011444.2142010-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Wed, 9 Apr 2025 01:14:41 +0000 (09:14 +0800)]
ublk: fix handling recovery & reissue in ublk_abort_queue()
Commit
8284066946e6 ("ublk: grab request reference when the request is handled
by userspace") doesn't grab request reference in case of recovery reissue.
Then the request can be requeued & re-dispatch & failed when canceling
uring command.
If it is one zc request, the request can be freed before io_uring
returns the zc buffer back, then cause kernel panic:
[ 126.773061] BUG: kernel NULL pointer dereference, address:
00000000000000c8
[ 126.773657] #PF: supervisor read access in kernel mode
[ 126.774052] #PF: error_code(0x0000) - not-present page
[ 126.774455] PGD 0 P4D 0
[ 126.774698] Oops: Oops: 0000 [#1] SMP NOPTI
[ 126.775034] CPU: 13 UID: 0 PID: 1612 Comm: kworker/u64:55 Not tainted 6.14.0_blk+ #182 PREEMPT(full)
[ 126.775676] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-1.fc39 04/01/2014
[ 126.776275] Workqueue: iou_exit io_ring_exit_work
[ 126.776651] RIP: 0010:ublk_io_release+0x14/0x130 [ublk_drv]
Fixes it by always grabbing request reference for aborting the request.
Reported-by: Caleb Sander Mateos <csander@purestorage.com>
Closes: https://lore.kernel.org/linux-block/CADUfDZodKfOGUeWrnAxcZiLT+puaZX8jDHoj_sfHZCOZwhzz6A@mail.gmail.com/
Fixes:
8284066946e6 ("ublk: grab request reference when the request is handled by userspace")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250409011444.2142010-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Daniel Wagner [Tue, 8 Apr 2025 15:29:10 +0000 (17:29 +0200)]
nvmet-fc: put ref when assoc->del_work is already scheduled
Do not leak the tgtport reference when the work is already scheduled.
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Daniel Wagner [Tue, 8 Apr 2025 15:29:09 +0000 (17:29 +0200)]
nvmet-fc: take tgtport reference only once
The reference counting code can be simplified. Instead taking a tgtport
refrerence at the beginning of nvmet_fc_alloc_hostport and put it back
if not a new hostport object is allocated, only take it when a new
hostport object is allocated.
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Daniel Wagner [Tue, 8 Apr 2025 15:29:08 +0000 (17:29 +0200)]
nvmet-fc: update tgtport ref per assoc
We need to take for each unique association a reference.
nvmet_fc_alloc_hostport for each newly created association.
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Daniel Wagner [Tue, 8 Apr 2025 15:29:07 +0000 (17:29 +0200)]
nvmet-fc: inline nvmet_fc_free_hostport
No need for this tiny helper with only one user, let's inline it.
And since the hostport ref counter needs to stay in sync, it's not
optional anymore to give back the reference.
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Daniel Wagner [Tue, 8 Apr 2025 15:29:06 +0000 (17:29 +0200)]
nvmet-fc: inline nvmet_fc_delete_assoc
No need for this tiny helper with only one user, just inline it.
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Daniel Wagner [Tue, 8 Apr 2025 15:29:05 +0000 (17:29 +0200)]
nvmet-fcloop: add ref counting to lport
The fcloop_lport objects live time is controlled by the user interface
add_local_port and del_local_port. nport, rport and tport objects are
pointing to the lport objects but here is no clear tracking. Let's
introduce an explicit ref counter for the lport objects and prepare the
stage for restructuring how lports are used.
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Daniel Wagner [Tue, 8 Apr 2025 15:29:04 +0000 (17:29 +0200)]
nvmet-fcloop: replace kref with refcount
The kref wrapper is not really adding any value ontop of refcount. Thus
replace the kref API with the refcount API.
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Daniel Wagner [Tue, 8 Apr 2025 15:29:03 +0000 (17:29 +0200)]
nvmet-fcloop: swap list_add_tail arguments
The newly element to be added to the list is the first argument of
list_add_tail. This fix is missing
dcfad4ab4d67 ("nvmet-fcloop: swap
the list_add_tail arguments").
Fixes:
437c0b824dbd ("nvme-fcloop: add target to host LS request support")
Signed-off-by: Daniel Wagner <wagi@kernel.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Kuniyuki Iwashima [Tue, 8 Apr 2025 22:40:54 +0000 (15:40 -0700)]
nvme-tcp: fix use-after-free of netns by kernel TCP socket.
Commit
1be52169c348 ("nvme-tcp: fix selinux denied when calling
sock_sendmsg") converted sock_create() in nvme_tcp_alloc_queue()
to sock_create_kern().
sock_create_kern() creates a kernel socket, which does not hold
a reference to netns. If the code does not manage the netns
lifetime properly, use-after-free could happen.
Also, TCP kernel socket with sk_net_refcnt 0 has a socket leak
problem: it remains FIN_WAIT_1 if it misses FIN after close()
because tcp_close() stops all timers.
To fix such problems, let's hold netns ref by sk_net_refcnt_upgrade().
We had the same issue in CIFS, SMC, etc, and applied the same
solution, see commit
ef7134c7fc48 ("smb: client: Fix use-after-free
of network namespace.") and commit
9744d2bf1976 ("smc: Fix
use-after-free in tcp_write_timer_handler().").
Fixes:
1be52169c348 ("nvme-tcp: fix selinux denied when calling sock_sendmsg")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Uday Shankar [Fri, 4 Apr 2025 20:06:43 +0000 (14:06 -0600)]
nvme: multipath: fix return value of nvme_available_path
The function returns bool so we should return false, not NULL. No
functional changes are expected.
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Hannes Reinecke [Thu, 3 Apr 2025 07:19:30 +0000 (09:19 +0200)]
nvme: re-read ANA log page after ns scan completes
When scanning for new namespaces we might have missed an ANA AEN.
The NVMe base spec (NVMe Base Specification v2.1, Figure 151 'Asynchonous
Event Information - Notice': Asymmetric Namespace Access Change) states:
A controller shall not send this even if an Attached Namespace
Attribute Changed asynchronous event [...] is sent for the same event.
so we need to re-read the ANA log page after we rescanned the namespace
list to update the ANA states of the new namespaces.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Hannes Reinecke [Thu, 3 Apr 2025 07:19:29 +0000 (09:19 +0200)]
nvme: requeue namespace scan on missed AENs
Scanning for namespaces can take some time, so if the target is
reconfigured while the scan is running we may miss a Attached Namespace
Attribute Changed AEN.
Check if the NVME_AER_NOTICE_NS_CHANGED bit is set once the scan has
finished, and requeue scanning to pick up any missed change.
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Zheng Qixing [Thu, 3 Apr 2025 01:53:22 +0000 (09:53 +0800)]
md/md-bitmap: fix stats collection for external bitmaps
The bitmap_get_stats() function incorrectly returns -ENOENT for external
bitmaps.
Remove the external bitmap check as the statistics should be available
regardless of bitmap storage location.
Return -EINVAL only for invalid bitmap with no storage (neither in
superblock nor in external file).
Note: "bitmap_info.external" here refers to a bitmap stored in a separate
file (bitmap_file), not to external metadata.
Fixes:
8d28d0ddb986 ("md/md-bitmap: Synchronize bitmap_get_stats() with bitmap lifetime")
Signed-off-by: Zheng Qixing <zhengqixing@huawei.com>
Link: https://lore.kernel.org/linux-raid/20250403015322.2873369-1-zhengqixing@huaweicloud.com
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Yu Kuai [Tue, 25 Mar 2025 01:57:46 +0000 (09:57 +0800)]
md/raid10: fix missing discard IO accounting
md_account_bio() is not called from raid10_handle_discard(), now that we
handle bitmap inside md_account_bio(), also fix missing
bitmap_startwrite for discard.
Test whole disk discard for 20G raid10:
Before:
Device d/s dMB/s drqm/s %drqm d_await dareq-sz
md0 48.00 16.00 0.00 0.00 5.42 341.33
After:
Device d/s dMB/s drqm/s %drqm d_await dareq-sz
md0 68.00 20462.00 0.00 0.00 2.65 308133.65
Link: https://lore.kernel.org/linux-raid/20250325015746.3195035-1-yukuai1@huaweicloud.com
Fixes:
528bc2cf2fcc ("md/raid10: enable io accounting")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Coly Li <colyli@kernel.org>
Ming Lei [Fri, 4 Apr 2025 00:18:49 +0000 (08:18 +0800)]
selftests: ublk: fix test_stripe_04
Commit
57ed58c13256 ("selftests: ublk: enable zero copy for stripe target")
added test entry of test_stripe_04, but forgot to add the test script.
So fix the test by adding the script file.
Reported-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Link: https://lore.kernel.org/r/20250404001849.1443064-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Thu, 3 Apr 2025 23:18:06 +0000 (16:18 -0700)]
Merge tag 'v6.15rc-part2-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
"Four ksmbd SMB3 server fixes, all also for stable"
* tag 'v6.15rc-part2-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix null pointer dereference in alloc_preauth_hash()
ksmbd: validate zero num_subauth before sub_auth is accessed
ksmbd: fix overflow in dacloffset bounds check
ksmbd: fix session use-after-free in multichannel connection
Linus Torvalds [Thu, 3 Apr 2025 23:09:29 +0000 (16:09 -0700)]
Merge tag 'trace-ringbuffer-v6.15-3' of git://git./linux/kernel/git/trace/linux-trace
Pull ring-buffer updates from Steven Rostedt:
"Persistent buffer cleanups and simplifications.
It was mistaken that the physical memory returned from "reserve_mem"
had to be vmap()'d to get to it from a virtual address. But
reserve_mem already maps the memory to the virtual address of the
kernel so a simple phys_to_virt() can be used to get to the virtual
address from the physical memory returned by "reserve_mem". With this
new found knowledge, the code can be cleaned up and simplified.
- Enforce that the persistent memory is page aligned
As the buffers using the persistent memory are all going to be
mapped via pages, make sure that the memory given to the tracing
infrastructure is page aligned. If it is not, it will print a
warning and fail to map the buffer.
- Use phys_to_virt() to get the virtual address from reserve_mem
Instead of calling vmap() on the physical memory returned from
"reserve_mem", use phys_to_virt() instead.
As the memory returned by "memmap" or any other means where a
physical address is given to the tracing infrastructure, it still
needs to be vmap(). Since this memory can never be returned back to
the buddy allocator nor should it ever be memmory mapped to user
space, flag this buffer and up the ref count. The ref count will
keep it from ever being freed, and the flag will prevent it from
ever being memory mapped to user space.
- Use vmap_page_range() for memmap virtual address mapping
For the memmap buffer, instead of allocating an array of struct
pages, assigning them to the contiguous phsycial memory and then
passing that to vmap(), use vmap_page_range() instead
- Replace flush_dcache_folio() with flush_kernel_vmap_range()
Instead of calling virt_to_folio() and passing that to
flush_dcache_folio(), just call flush_kernel_vmap_range() directly.
This also fixes a bug where if a subbuffer was bigger than
PAGE_SIZE only the PAGE_SIZE portion would be flushed"
* tag 'trace-ringbuffer-v6.15-3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ring-buffer: Use flush_kernel_vmap_range() over flush_dcache_folio()
tracing: Use vmap_page_range() to map memmap ring buffer
tracing: Have reserve_mem use phys_to_virt() and separate from memmap buffer
tracing: Enforce the persistent ring buffer to be page aligned
Linus Torvalds [Thu, 3 Apr 2025 23:04:38 +0000 (16:04 -0700)]
Merge tag 'block-6.15-
20250403' of git://git.kernel.dk/linux
Pull more block updates from Jens Axboe:
- NVMe pull request via Keith:
- PCI endpoint target cleanup (Damien)
- Early import for uring_cmd fixed buffer (Caleb)
- Multipath documentation and notification improvements (John)
- Invalid pci sq doorbell write fix (Maurizio)
- Queue init locking fix
- Remove dead nsegs parameter from blk_mq_get_new_requests()
* tag 'block-6.15-
20250403' of git://git.kernel.dk/linux:
block: don't grab elevator lock during queue initialization
nvme-pci: skip nvme_write_sq_db on empty rqlist
nvme-multipath: change the NVME_MULTIPATH config option
nvme: update the multipath warning in nvme_init_ns_head
nvme/ioctl: move fixed buffer lookup to nvme_uring_cmd_io()
nvme/ioctl: move blk_mq_free_request() out of nvme_map_user_request()
nvme/ioctl: don't warn on vectorized uring_cmd with fixed buffer
nvmet: pci-epf: Keep completion queues mapped
block: remove unused nseg parameter
Linus Torvalds [Thu, 3 Apr 2025 22:48:58 +0000 (15:48 -0700)]
Merge tag 'io_uring-6.15-
20250403' of git://git.kernel.dk/linux
Pull more io_uring updates from Jens Axboe:
"Set of fixes/updates for io_uring that should go into this release.
The ublk bits could've gone via either tree - usually I put them in
block, but they got a bit mixed this series with the zero-copy
supported that ended up dipping into both trees.
This contains:
- Fix for sendmsg zc, include in pinned pages accounting like we do
for the other zc types
- Series for ublk fixing request aborting, doing various little
cleanups, fixing some zc issues, and adding queue_rqs support
- Another ublk series doing some code cleanups
- Series cleaning up the io_uring send path, mostly in preparation
for registered buffers
- Series doing little MSG_RING cleanups
- Fix for the newly added zc rx, fixing len being 0 for the last
invocation of the callback
- Add vectored registered buffer support for ublk. With that, then
ublk also supports this feature in the kernel revision where it
could generically introduced for rw/net
- A bunch of selftest additions for ublk. This is the majority of the
diffstat
- Silence a KCSAN data race warning for io-wq
- Various little cleanups and fixes"
* tag 'io_uring-6.15-
20250403' of git://git.kernel.dk/linux: (44 commits)
io_uring: always do atomic put from iowq
selftests: ublk: enable zero copy for stripe target
io_uring: support vectored kernel fixed buffer
block: add for_each_mp_bvec()
io_uring: add validate_fixed_range() for validate fixed buffer
selftests: ublk: kublk: fix an error log line
selftests: ublk: kublk: use ioctl-encoded opcodes
io_uring/zcrx: return early from io_zcrx_recv_skb if readlen is 0
io_uring/net: avoid import_ubuf for regvec send
io_uring/rsrc: check size when importing reg buffer
io_uring: cleanup {g,s]etsockopt sqe reading
io_uring: hide caches sqes from drivers
io_uring: make zcrx depend on CONFIG_IO_URING
io_uring: add req flag invariant build assertion
Documentation: ublk: remove dead footnote
selftests: ublk: specify io_cmd_buf pointer type
ublk: specify io_cmd_buf pointer type
io_uring: don't pass ctx to tw add remote helper
io_uring/msg: initialise msg request opcode
io_uring/msg: rename io_double_lock_ctx()
...
Christian Brauner [Thu, 3 Apr 2025 14:43:50 +0000 (16:43 +0200)]
fs: actually hold the namespace semaphore
Don't use a scoped guard that only protects the next statement.
Use a regular guard to make sure that the namespace semaphore is held
across the whole function.
Signed-off-by: Christian Brauner <brauner@kernel.org>
Reported-by: Leon Romanovsky <leon@kernel.org>
Link: https://lore.kernel.org/all/20250401170715.GA112019@unreal/
Fixes:
db04662e2f4f ("fs: allow detached mounts in clone_private_mount()")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Thu, 3 Apr 2025 22:39:47 +0000 (15:39 -0700)]
Merge tag 'bcachefs-2025-04-03' of git://evilpiepirate.org/bcachefs
Pull more bcachefs updates from Kent Overstreet:
"More notable fixes:
- Fix for striping behaviour on tiering filesystems where replicas
exceeds durability on destination target
- Fix a race in device removal where deleting alloc info races with
the discard worker
- Some small stack usage improvements: this is just enough for KMSAN
builds to not blow the stack, more is queued up for 6.16"
* tag 'bcachefs-2025-04-03' of git://evilpiepirate.org/bcachefs:
bcachefs: Fix "journal stuck" during recovery
bcachefs: backpointer_get_key: check for null from peek_slot()
bcachefs: Fix null ptr deref in invalidate_one_bucket()
bcachefs: Fix check_snapshot_exists() restart handling
bcachefs: use nonblocking variant of print_string_as_lines in error path
bcachefs: Fix scheduling while atomic from logging changes
bcachefs: Add error handling for zlib_deflateInit2()
bcachefs: add missing selection of XARRAY_MULTI
bcachefs: bch_dev_usage_full
bcachefs: Kill btree_iter.trans
bcachefs: do_trace_key_cache_fill()
bcachefs: Split up bch_dev.io_ref
bcachefs: fix ref leak in btree_node_read_all_replicas
bcachefs: Fix null ptr deref in bch2_write_endio()
bcachefs: Fix field spanning write warning
bcachefs: Fix striping behaviour
Linus Torvalds [Thu, 3 Apr 2025 22:35:46 +0000 (15:35 -0700)]
Merge tag '9p-for-6.15-rc1' of https://github.com/martinetd/linux
Pull 9p updates from Dominique Martinet:
- fix handling of bogus (negative/too long) replies
- fix crash on mkdir with ACLs (... looks like nobody is using ACLs
with semi-recent kernels...)
- ipv6 support for trans=tcp
- minor concurrency fix to make syzbot happy
- minor cleanup
* tag '9p-for-6.15-rc1' of https://github.com/martinetd/linux:
docs: fs/9p: Add missing "not" in cache documentation
9p: Use hashtable.h for hash_errmap
Documentation/fs/9p: fix broken link
9p/trans_fd: mark concurrent read and writes to p9_conn->err
9p/net: return error on bogus (longer than requested) replies
9p/net: fix improper handling of bogus negative read/write replies
fs/9p: fix NULL pointer dereference on mkdir
net/9p/fd: support ipv6 for trans=tcp
Linus Torvalds [Thu, 3 Apr 2025 22:31:14 +0000 (15:31 -0700)]
Merge tag 'rtc-6.15' of git://git./linux/kernel/git/abelloni/linux
Pull RTC updates from Alexandre Belloni:
"We see a net reduction of the number of lines of code thanks to the
removal of a now unused driver and a testing tool that is not used
anymore. Apart from this, the max31335 driver gets support for a new
part number and pm8xxx gets UEFI support.
Core:
- setdate is removed as it has better replacements
- skip alarms with a second resolution when we know the RTC doesn't
support those.
Subsystem:
- remove unnecessary private struct members
- use devm_pm_set_wake_irq were relevant
Drivers:
- ds1307: stop disabling alarms on probe for DS1337, DS1339, DS1341
and DS3231
- max31335: add max31331 support
- pcf50633 is removed as support for the related SoC has been removed
- pcf85063: properly handle POR failures"
* tag 'rtc-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (50 commits)
rtc: remove 'setdate' test program
selftest: rtc: skip some tests if the alarm only supports minutes
rtc: mt6397: drop unused defines
rtc: pcf85063: replace dev_err+return with return dev_err_probe
rtc: pcf85063: do a SW reset if POR failed
rtc: max31335: Add driver support for max31331
dt-bindings: rtc: max31335: Add max31331 support
rtc: cros-ec: Avoid a couple of -Wflex-array-member-not-at-end warnings
dt-bindings: rtc: pcf2127: Reference spi-peripheral-props.yaml
rtc: rzn1: implement one-second accuracy for alarms
rtc: pcf50633: Remove
rtc: pm8xxx: implement qcom,no-alarm flag for non-HLOS owned alarm
rtc: pm8xxx: mitigate flash wear
rtc: pm8xxx: add support for uefi offset
dt-bindings: rtc: qcom-pm8xxx: document qcom,no-alarm flag
rtc: rv3032: drop WADA
rtc: rv3032: fix EERD location
rtc: pm8xxx: switch to devm_device_init_wakeup
rtc: pm8xxx: fix possible race condition
rtc: mpfs: switch to devm_device_init_wakeup
...
Linus Torvalds [Thu, 3 Apr 2025 19:21:44 +0000 (12:21 -0700)]
Merge tag 'for-linus' of git://git./linux/kernel/git/rmk/linux
Pull ARM and clkdev updates from Russell King:
- Simplify ARM_MMU_KEEP usage
- Add Rust support for ARM architecture version 7
- Align IPIs reported in /proc/interrupts
- require linker to support KEEP within OVERLAY
- add KEEP() for ARM vectors
- add __printf() attribute for clkdev functions
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rmk/linux:
ARM: 9445/1: clkdev: Mark some functions with __printf() attribute
ARM: 9444/1: add KEEP() keyword to ARM_VECTORS
ARM: 9443/1: Require linker to support KEEP within OVERLAY for DCE
ARM: 9442/1: smp: Fix IPI alignment in /proc/interrupts
ARM: 9441/1: rust: Enable Rust support for ARMv7
ARM: 9439/1: arm32: simplify ARM_MMU_KEEP usage
Linus Torvalds [Thu, 3 Apr 2025 19:07:01 +0000 (12:07 -0700)]
Merge tag 'arm64-fixes' of git://git./linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Fix max_pfn calculation when hotplugging memory so that it never
decreases
- Fix dereference of unused source register in the MOPS SET operation
fault handling
- Fix NULL calling in do_compat_alignment_fixup() when the 32-bit user
space does an unaligned LDREX/STREX
- Add the HiSilicon HIP09 processor to the Spectre-BHB affected CPUs
- Drop unused code pud accessors (special/mkspecial)
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: Don't call NULL in do_compat_alignment_fixup()
arm64: Add support for HIP09 Spectre-BHB mitigation
arm64: mm: Drop dead code for pud special bit handling
arm64: mops: Do not dereference src reg for a set operation
arm64: mm: Correct the update of max_pfn
Linus Torvalds [Thu, 3 Apr 2025 18:55:41 +0000 (11:55 -0700)]
Merge tag 'bpf-fixes' of git://git./linux/kernel/git/bpf/bpf
Pull bpf fixes from Alexei Starovoitov:
- Fix BPF selftests expectations of assembler output and struct layout
(Song Liu and Yonghong Song)
- Fix XSK error code when queue is full (Wang Liang)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
selftests/bpf: Fix verifier_private_stack test failure
selftests/bpf: Fix verifier_bpf_fastcall test
selftests/bpf: Fix tests after fields reorder in struct file
xsk: Fix __xsk_generic_xmit() error code when cq is full
Linus Torvalds [Thu, 3 Apr 2025 18:16:57 +0000 (11:16 -0700)]
Merge tag 'mm-nonmm-stable-2025-04-02-22-12' of git://git./linux/kernel/git/akpm/mm
Pull more non-MM updates from Andrew Morton:
"One bugfix and a couple of small late-arriving updates"
* tag 'mm-nonmm-stable-2025-04-02-22-12' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
lib: scatterlist: fix sg_split_phys to preserve original scatterlist offsets
lib/sort.c: add _nonatomic() variants with cond_resched()
mailmap: add an entry for Nicolas Schier
Linus Torvalds [Thu, 3 Apr 2025 18:10:00 +0000 (11:10 -0700)]
Merge tag 'mm-stable-2025-04-02-22-07' of git://git./linux/kernel/git/akpm/mm
Pull more MM updates from Andrew Morton:
- The series "mm: fixes for fallouts from mem_init() cleanup" from Mike
Rapoport fixes a couple of issues with the just-merged "arch, mm:
reduce code duplication in mem_init()" series
- The series "MAINTAINERS: add my isub-entries to MM part." from Mike
Rapoport does some maintenance on MAINTAINERS
- The series "remove tlb_remove_page_ptdesc()" from Qi Zheng does some
cleanup work to the page mapping code
- The series "mseal system mappings" from Jeff Xu permits sealing of
"system mappings", such as vdso, vvar, vvar_vclock, vectors (arm
compat-mode), sigpage (arm compat-mode)
- Plus the usual shower of singleton patches
* tag 'mm-stable-2025-04-02-22-07' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (31 commits)
mseal sysmap: add arch-support txt
mseal sysmap: enable s390
selftest: test system mappings are sealed
mseal sysmap: update mseal.rst
mseal sysmap: uprobe mapping
mseal sysmap: enable arm64
mseal sysmap: enable x86-64
mseal sysmap: generic vdso vvar mapping
selftests: x86: test_mremap_vdso: skip if vdso is msealed
mseal sysmap: kernel config and header change
mm: pgtable: remove tlb_remove_page_ptdesc()
x86: pgtable: convert to use tlb_remove_ptdesc()
riscv: pgtable: unconditionally use tlb_remove_ptdesc()
mm: pgtable: convert some architectures to use tlb_remove_ptdesc()
mm: pgtable: change pt parameter of tlb_remove_ptdesc() to struct ptdesc*
mm: pgtable: make generic tlb_remove_table() use struct ptdesc
microblaze/mm: put mm_cmdline_setup() in .init.text section
mm/memory_hotplug: fix call folio_test_large with tail page in do_migrate_range
MAINTAINERS: mm: add entry for secretmem
MAINTAINERS: mm: add entry for numa memblocks and numa emulation
...
Linus Torvalds [Thu, 3 Apr 2025 17:47:47 +0000 (10:47 -0700)]
Merge tag 'mm-hotfixes-stable-2025-04-02-21-57' of git://git./linux/kernel/git/akpm/mm
Pull MM hotfixes from Andrew Morton:
"Five hotfixes. Three are cc:stable and the remainder address post-6.14
issues or aren't considered necessary for -stable kernels.
All patches are for MM"
* tag 'mm-hotfixes-stable-2025-04-02-21-57' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm: zswap: fix crypto_free_acomp() deadlock in zswap_cpu_comp_dead()
mm/hugetlb: move hugetlb_sysctl_init() to the __init section
mm: page_isolation: avoid calling folio_hstate() without hugetlb_lock
mm/hugetlb_vmemmap: fix memory loads ordering
mm/userfaultfd: fix release hang over concurrent GUP
Linus Torvalds [Thu, 3 Apr 2025 17:03:38 +0000 (10:03 -0700)]
Merge tag 'sched_ext-for-6.15-rc0-fixes' of git://git./linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Calling scx_bpf_create_dsq() with the same ID would succeed creating
duplicate DSQs. Fix it to return -EEXIST.
- scx_select_cpu_dfl() fixes and cleanups.
- Synchronize tool/sched_ext with external scheduler repo. While this
isn't a fix. There's no risk to the kernel and it's better if they
stay synced closer.
* tag 'sched_ext-for-6.15-rc0-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
tools/sched_ext: Sync with scx repo
sched_ext: initialize built-in idle state before ops.init()
sched_ext: create_dsq: Return -EEXIST on duplicate request
sched_ext: Remove a meaningless conditional goto in scx_select_cpu_dfl()
sched_ext: idle: Fix return code of scx_select_cpu_dfl()
Linus Torvalds [Thu, 3 Apr 2025 16:52:44 +0000 (09:52 -0700)]
Merge tag 'trace-v6.15-2' of git://git./linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix build error when CONFIG_PROBE_EVENTS_BTF_ARGS is not enabled
The tracing of arguments in the function tracer depends on some
functions that are only defined when PROBE_EVENTS_BTF_ARGS is
enabled. In fact, PROBE_EVENTS_BTF_ARGS also depends on all the same
configs as the function argument tracing requires. Just have the
function argument tracing depend on PROBE_EVENTS_BTF_ARGS.
- Free module_delta for persistent ring buffer instance
When an instance holds the persistent ring buffer, it allocates a
helper array to hold the deltas between where modules are loaded on
the last boot and the current boot. This array needs to be freed when
the instance is freed.
- Add cond_resched() to loop in ftrace_graph_set_hash()
The hash functions in ftrace loop over every function that can be
enabled by ftrace. This can be 50,000 functions or more. This loop is
known to trigger soft lockup warnings and requires a cond_resched().
The loop in ftrace_graph_set_hash() was missing it.
- Fix the event format verifier to include "%*p.." arguments
To prevent events from dereferencing stale pointers that can happen
if a trace event uses a dereferece pointer to something that was not
copied into the ring buffer and can be freed by the time the trace is
read, a verifier is called. At boot or module load, the verifier
scans the print format string for pointers that can be dereferenced
and it checks the arguments to make sure they do not contain
something that can be freed. The "%*p" was not handled, which would
add another argument and cause the verifier to not only not verify
this pointer, but it will look at the wrong argument for every
pointer after that.
- Fix mcount sorttable building for different endian type target
When modifying the ELF file to sort the mcount_loc table in the
sorttable.c code, the endianess of the file and the host is used to
determine if the bytes need to be swapped when calculations are done.
A change was made to the sorting of the mcount_loc that read the
values from the ELF file into an array and the swap happened on the
filling of the array. But one of the calculations of the array still
did the swap when it did not need to. This caused building on a
little endian machine for a big endian target to not find the mcount
function in the 'nm' table and it zeroed it out, causing there to be
no functions available to trace.
- Add goto out_unlock jump to rv_register_monitor() on error path
One of the error paths in rv_register_monitor() just returned the
error when it should have jumped to the out_unlock label to release
the mutex.
* tag 'trace-v6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
rv: Fix missing unlock on double nested monitors return path
scripts/sorttable: Fix endianness handling in build-time mcount sort
tracing: Verify event formats that have "%*p.."
ftrace: Add cond_resched() to ftrace_graph_set_hash()
tracing: Free module_delta on freeing of persistent ring buffer
ftrace: Have tracing function args depend on PROBE_EVENTS_BTF_ARGS
Kent Overstreet [Wed, 2 Apr 2025 16:23:25 +0000 (12:23 -0400)]
bcachefs: Fix "journal stuck" during recovery
If we crash when the journal pin fifo is completely full - i.e. we're
at the maximum number of dirty journal entries - that may put us in a
sticky situation in recovery, as journal replay will need to be able to
open new journal entries in order to get going.
bch2_fs_journal_start() already had provisions for resizing the journal
pin fifo if needed, but it needs a fudge factor to ensure there's room
for journal replay.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 2 Apr 2025 16:54:25 +0000 (12:54 -0400)]
bcachefs: backpointer_get_key: check for null from peek_slot()
peek_slot() doesn't normally return bkey_s_c_null - except when we ask
for a key at a btree level that doesn't exist, which can happen here.
We might want to revisit this, but we'll have to look over all the
places where we use peek_slot() on interior nodes.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 2 Apr 2025 16:48:23 +0000 (12:48 -0400)]
bcachefs: Fix null ptr deref in invalidate_one_bucket()
bch2_backpointer_get_key() returns bkey_s_c_null when the target isn't
found.
backpointer_get_key() flags the error, so there's nothing else to do
here - just skip it and move on.
Link: https://github.com/koverstreet/bcachefs/issues/847
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Wed, 2 Apr 2025 18:31:12 +0000 (14:31 -0400)]
bcachefs: Fix check_snapshot_exists() restart handling
Codepaths that create entries in the snapshots btree currently call
bch2_mark_snapshot(), which updates the in-memory snapshot table, before
transaction commit.
This is because bch2_mark_snapshot() is an atomic trigger, run with
btree write locks held, and isn't allowed to fail - but it might need to
reallocate the table, hence we call it early when we're still allowed to
fail.
This is generally harmless - if we fail, we'll have left an entry in the
snapshots table around, but nothing will reference it and it'll get
overwritten if reused by another transaction.
But check_snapshot_exists(), which reconstructs snapshots when the
snapshots btree has been corrupted or lost, was erronously rechecking if
the snapshot exists inside the transaction commit loop - so on
transaction restart (in this case mem_realloced), the second iteration
would return without repairing.
This code needs some cleanup: splitting out a "maybe realloc snapshots
table" helper would have avoided this, that will be in the next patch.
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Bharadwaj Raju [Wed, 2 Apr 2025 18:15:53 +0000 (23:45 +0530)]
bcachefs: use nonblocking variant of print_string_as_lines in error path
The inconsistency error path calls print_string_as_lines, which calls
console_lock, which is a potentially-sleeping function and so can't be
called in an atomic context.
Replace calls to it with the nonblocking variant which is safe to call.
Signed-off-by: Bharadwaj Raju <bharadwaj.raju777@gmail.com>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Kent Overstreet [Tue, 1 Apr 2025 17:01:29 +0000 (13:01 -0400)]
bcachefs: Fix scheduling while atomic from logging changes
Two fixes from the recent logging changes:
bch2_inconsistent(), bch2_fs_inconsistent() be called from interrupt
context, or with rcu_read_lock() held.
The one syzbot found is in
bch2_bkey_pick_read_device
bch2_dev_rcu
bch2_fs_inconsistent
We're starting to switch to lift the printbufs up to higher levels so we
can emit better log messages and print them all in one go (avoid
garbling), so that conversion will help with spotting these in the
future; when we declare a printbuf it must be flagged if we're in an
atomic context.
Secondly, in btree_node_write_endio:
00085 BUG: sleeping function called from invalid context at include/linux/sched/mm.h:321
00085 in_atomic(): 1, irqs_disabled(): 1, non_block: 0, pid: 618, name: bch-reclaim/fa6
00085 preempt_count: 10001, expected: 0
00085 RCU nest depth: 0, expected: 0
00085 4 locks held by bch-reclaim/fa6/618:
00085 #0:
ffffff80d7ccad68 (&j->reclaim_lock){+.+.}-{4:4}, at: bch2_journal_reclaim_thread+0x84/0x198
00085 #1:
ffffff80d7c84218 (&c->btree_trans_barrier){.+.+}-{0:0}, at: __bch2_trans_get+0x1c0/0x440
00085 #2:
ffffff80cd3f8140 (bcachefs_btree){+.+.}-{0:0}, at: __bch2_trans_get+0x22c/0x440
00085 #3:
ffffff80c3823c20 (&vblk->vqs[i].lock){-.-.}-{3:3}, at: virtblk_done+0x58/0x130
00085 irq event stamp: 328
00085 hardirqs last enabled at (327): [<
ffffffc080073a14>] finish_task_switch.isra.0+0xbc/0x2a0
00085 hardirqs last disabled at (328): [<
ffffffc080971a10>] el1_interrupt+0x20/0x60
00085 softirqs last enabled at (0): [<
ffffffc08002f920>] copy_process+0x7c8/0x2118
00085 softirqs last disabled at (0): [<
0000000000000000>] 0x0
00085 Preemption disabled at:
00085 [<
ffffffc08003ada0>] irq_enter_rcu+0x18/0x90
00085 CPU: 8 UID: 0 PID: 618 Comm: bch-reclaim/fa6 Not tainted
6.14.0-rc6-ktest-g04630bde23e8 #18798
00085 Hardware name: linux,dummy-virt (DT)
00085 Call trace:
00085 show_stack+0x1c/0x30 (C)
00085 dump_stack_lvl+0x84/0xc0
00085 dump_stack+0x14/0x20
00085 __might_resched+0x180/0x288
00085 __might_sleep+0x4c/0x88
00085 __kmalloc_node_track_caller_noprof+0x34c/0x3e0
00085 krealloc_noprof+0x1a0/0x2d8
00085 bch2_printbuf_make_room+0x9c/0x120
00085 bch2_prt_printf+0x60/0x1b8
00085 btree_node_write_endio+0x1b0/0x2d8
00085 bio_endio+0x138/0x1f0
00085 btree_node_write_endio+0xe8/0x2d8
00085 bio_endio+0x138/0x1f0
00085 blk_update_request+0x220/0x4c0
00085 blk_mq_end_request+0x28/0x148
00085 virtblk_request_done+0x64/0xe8
00085 blk_mq_complete_request+0x34/0x40
00085 virtblk_done+0x78/0x130
00085 vring_interrupt+0x6c/0xb0
00085 __handle_irq_event_percpu+0x8c/0x2e0
00085 handle_irq_event+0x50/0xb0
00085 handle_fasteoi_irq+0xc4/0x250
00085 handle_irq_desc+0x44/0x60
00085 generic_handle_domain_irq+0x20/0x30
00085 gic_handle_irq+0x54/0xc8
00085 call_on_irq_stack+0x24/0x40
Reported-by: syzbot+c82cd2906e2f192410bb@syzkaller.appspotmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Wentao Liang [Wed, 2 Apr 2025 13:45:44 +0000 (21:45 +0800)]
bcachefs: Add error handling for zlib_deflateInit2()
In attempt_compress(), the return value of zlib_deflateInit2() needs to be
checked. A proper implementation can be found in pstore_compress().
Add an error check and return 0 immediately if the initialzation fails.
Fixes:
986e9842fb68 ("bcachefs: Compression levels")
Signed-off-by: Wentao Liang <vulab@iscas.ac.cn>
Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
Ming Lei [Thu, 3 Apr 2025 10:54:02 +0000 (18:54 +0800)]
block: don't grab elevator lock during queue initialization
->elevator_lock depends on queue freeze lock, see block/blk-sysfs.c.
queue freeze lock depends on fs_reclaim.
So don't grab elevator lock during queue initialization which needs to
call kmalloc(GFP_KERNEL), and we can cut the dependency between
->elevator_lock and fs_reclaim, then the lockdep warning can be killed.
This way is safe because elevator setting isn't ready to run during
queue initialization.
There isn't such issue in __blk_mq_update_nr_hw_queues() because
memalloc_noio_save() is called before acquiring elevator lock.
Fixes the following lockdep warning:
https://lore.kernel.org/linux-block/
67e6b425.
050a0220.2f068f.007b.GAE@google.com/
Reported-by: syzbot+4c7e0f9b94ad65811efb@syzkaller.appspotmail.com
Cc: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250403105402.1334206-1-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 3 Apr 2025 11:29:30 +0000 (12:29 +0100)]
io_uring: always do atomic put from iowq
io_uring always switches requests to atomic refcounting for iowq
execution before there is any parallilism by setting REQ_F_REFCOUNT,
and the flag is not cleared until the request completes. That should be
fine as long as the compiler doesn't make up a non existing value for
the flags, however KCSAN still complains when the request owner changes
oter flag bits:
BUG: KCSAN: data-race in io_req_task_cancel / io_wq_free_work
...
read to 0xffff888117207448 of 8 bytes by task 3871 on cpu 0:
req_ref_put_and_test io_uring/refs.h:22 [inline]
Skip REQ_F_REFCOUNT checks for iowq, we know it's set.
Reported-by: syzbot+903a2ad71fb3f1e47cf5@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d880bc27fb8c3209b54641be4ff6ac02b0e5789a.1743679736.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Thu, 3 Apr 2025 05:41:04 +0000 (22:41 -0700)]
Merge tag 'firewire-updates-6.15' of git://git./linux/kernel/git/ieee1394/linux1394
Pull firewire update from Takashi Sakamoto:
"A single commit to use the common helper function for on-stack
trailing array to enqueue any isochronous packet by the requests
from userspace applications"
* tag 'firewire-updates-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ieee1394/linux1394:
firewire: core: avoid -Wflex-array-member-not-at-end warning
Yonghong Song [Mon, 31 Mar 2025 03:38:28 +0000 (20:38 -0700)]
selftests/bpf: Fix verifier_private_stack test failure
Several verifier_private_stack tests failed with latest bpf-next.
For example, for 'Private stack, single prog' subtest, the
jitted code:
func #0:
0: f3 0f 1e fa endbr64
4: 0f 1f 44 00 00 nopl (%rax,%rax)
9: 0f 1f 00 nopl (%rax)
c: 55 pushq %rbp
d: 48 89 e5 movq %rsp, %rbp
10: f3 0f 1e fa endbr64
14: 49 b9 58 74 8a 8f 7d 60 00 00 movabsq $0x607d8f8a7458, %r9
1e: 65 4c 03 0c 25 28 c0 48 87 addq %gs:-0x78b73fd8, %r9
27: bf 2a 00 00 00 movl $0x2a, %edi
2c: 49 89 b9 00 ff ff ff movq %rdi, -0x100(%r9)
33: 31 c0 xorl %eax, %eax
35: c9 leave
36: e9 20 5d 0f e1 jmp 0xffffffffe10f5d5b
The insn 'addq %gs:-0x78b73fd8, %r9' does not match the expected
regex 'addq %gs:0x{{.*}}, %r9' and this caused test failure.
Fix it by changing '%gs:0x{{.*}}' to '%gs:{{.*}}' to accommodate the
possible negative offset. A few other subtests are fixed in a similar way.
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20250331033828.365077-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>