Christoph Hellwig [Mon, 5 May 2025 08:11:26 +0000 (10:11 +0200)]
mm: remove NR_BOUNCE zone stat
The stat is always 0 now, so remove it and hardwire the user visible
output to 0.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250505081138.3435992-8-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 5 May 2025 08:11:25 +0000 (10:11 +0200)]
block: remove bounce buffering support
The block layer bounce buffering support is unused now, remove it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250505081138.3435992-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 5 May 2025 08:11:24 +0000 (10:11 +0200)]
scsi: remove the no_highmem flag in the host
All users are gone now.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250505081138.3435992-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 5 May 2025 08:11:23 +0000 (10:11 +0200)]
usb-storage: reject probe of device one non-DMA HCDs when using highmem
usb-storage is the last user of the block layer bounce buffering now,
and only uses it for HCDs that do not support DMA on highmem configs.
Remove this support and fail the probe so that the block layer bounce
buffering can go away.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Alan Stern <stern@rowland.harvard.edu>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250505081138.3435992-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 5 May 2025 08:11:22 +0000 (10:11 +0200)]
scsi: make ppa depend on !HIGHMEM
This is one of the last drivers depending on the block layer bounce
buffering code. Restrict it to run on non-highmem configs so that the
bounce buffering code can be removed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250505081138.3435992-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 5 May 2025 08:11:21 +0000 (10:11 +0200)]
scsi: make imm depend on !HIGHMEM
This is one of the last drivers depending on the block layer bounce
buffering code. Restrict it to run on non-highmem configs so that the
bounce buffering code can be removed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250505081138.3435992-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 5 May 2025 08:11:20 +0000 (10:11 +0200)]
scsi: make aha152x depend on !HIGHMEM
This is one of the last drivers depending on the block layer bounce
buffering code. Restrict it to run on non-highmem configs so that the
bounce buffering code can be removed.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250505081138.3435992-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Mon, 5 May 2025 13:13:44 +0000 (07:13 -0600)]
Merge branch 'block-6.15' into for-6.16/block
Merge 6.15 block fixes in, once again, to resolve conflicts with the
fixes for ublk that went into mainline and the 6.16 ublk updates.
* block-6.15:
nvmet-auth: always free derived key data
nvmet-tcp: don't restore null sk_state_change
nvmet-tcp: select CONFIG_TLS from CONFIG_NVME_TARGET_TCP_TLS
nvme-tcp: select CONFIG_TLS from CONFIG_NVME_TCP_TLS
nvme-tcp: fix premature queue removal and I/O failover
nvme-pci: add quirks for WDC Blue SN550 15b7:5009
nvme-pci: add quirks for device 126f:1001
nvme-pci: fix queue unquiesce check on slot_reset
ublk: remove the check of ublk_need_req_ref() from __ublk_check_and_get_req
ublk: enhance check for register/unregister io buffer command
ublk: decouple zero copy from user copy
selftests: ublk: fix UBLK_F_NEED_GET_DATA
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 24 Apr 2025 08:27:52 +0000 (10:27 +0200)]
block: use writeback_iter
Use writeback_iter instead of the deprecated write_cache_pages wrapper
in blkdev_writepages. This removes an indirect call per folio.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Link: https://lore.kernel.org/r/20250424082752.1967679-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Wed, 30 Apr 2025 22:52:34 +0000 (16:52 -0600)]
ublk: store request pointer in ublk_io
A ublk_io is converted to a request in several places in the I/O path by
using blk_mq_tag_to_rq() to look up the (qid, tag) on the ublk device's
tagset. This involves a bunch of dereferences and a tag bounds check.
To make this conversion cheaper, store the request pointer in ublk_io.
Overlap this storage with the io_uring_cmd pointer. This is safe because
the io_uring_cmd pointer is only valid if UBLK_IO_FLAG_ACTIVE is set on
the ublk_io, the request pointer is valid if UBLK_IO_FLAG_OWNED_BY_SRV,
and these flags are mutually exclusive.
Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250430225234.2676781-10-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Wed, 30 Apr 2025 22:52:33 +0000 (16:52 -0600)]
ublk: check UBLK_IO_FLAG_OWNED_BY_SRV in ublk_abort_queue()
ublk_abort_queue() currently checks whether the UBLK_IO_FLAG_ACTIVE flag
is cleared to tell whether to abort each ublk_io in the queue. But it's
possible for a ublk_io to not be ACTIVE but also not have a request in
flight, such as when no fetch request has yet been submitted for a tag
or when a fetch request is cancelled. So ublk_abort_queue() must
additionally check for an inflight request.
Simplify this code by checking for UBLK_IO_FLAG_OWNED_BY_SRV instead,
which indicates precisely whether a request is currently inflight.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250430225234.2676781-9-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Wed, 30 Apr 2025 22:52:32 +0000 (16:52 -0600)]
ublk: don't call ublk_dispatch_req() for NEED_GET_DATA
ublk_dispatch_req() currently handles 3 different cases: incoming ublk
requests that don't need to wait for a data buffer, incoming requests
that do need to wait for a buffer, and resuming those requests once the
buffer is provided. But the call site that provides a data buffer
(UBLK_IO_NEED_GET_DATA) is separate from those for incoming requests.
So simplify the function by splitting the UBLK_IO_NEED_GET_DATA case
into its own function ublk_get_data(). This avoids several redundant
checks in the UBLK_IO_NEED_GET_DATA case, and streamlines the incoming
request cases.
Don't call ublk_fill_io_cmd() for UBLK_IO_NEED_GET_DATA, as it's no
longer necessary to set io->cmd or the UBLK_IO_FLAG_ACTIVE flag for
ublk_dispatch_req().
Since UBLK_IO_NEED_GET_DATA no longer relies on ublk_dispatch_req()
calling io_uring_cmd_done(), return the UBLK_IO_RES_OK status directly
from the ->uring_cmd() handler. If ublk_start_io() fails, don't complete
the UBLK_IO_NEED_GET_DATA command, matching the existing behavior.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250430225234.2676781-8-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Wed, 30 Apr 2025 22:52:31 +0000 (16:52 -0600)]
ublk: factor out ublk_start_io() helper
In preparation for calling it from outside ublk_dispatch_req(), factor
out the code responsible for setting up an incoming ublk I/O request.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250430225234.2676781-7-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Wed, 30 Apr 2025 22:52:30 +0000 (16:52 -0600)]
ublk: don't log uring_cmd cmd_op in ublk_dispatch_req()
cmd_op is either UBLK_U_IO_FETCH_REQ, UBLK_U_IO_COMMIT_AND_FETCH_REQ,
or UBLK_U_IO_NEED_GET_DATA. Which one isn't particularly interesting
and is already recorded by the log line in __ublk_ch_uring_cmd().
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250430225234.2676781-6-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Wed, 30 Apr 2025 22:52:29 +0000 (16:52 -0600)]
ublk: take const ubq pointer in ublk_get_iod()
ublk_get_iod() doesn't modify the struct ublk_queue it is passed.
Clarify that by making the argument a const pointer.
Move the function definition earlier in the file so it doesn't need a
forward declaration.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250430225234.2676781-5-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Wed, 30 Apr 2025 22:52:28 +0000 (16:52 -0600)]
ublk: remove misleading "ubq" in "ubq_complete_io_cmd()"
ubq_complete_io_cmd() doesn't interact with a ublk queue, so "ubq" in
the name is confusing. Most likely "ubq" was meant to be "ublk".
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250430225234.2676781-4-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Wed, 30 Apr 2025 22:52:27 +0000 (16:52 -0600)]
ublk: fix "immepdately" typo in comment
Looks like "immediately" was intended.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250430225234.2676781-3-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Uday Shankar [Wed, 30 Apr 2025 22:52:26 +0000 (16:52 -0600)]
ublk: factor out ublk_commit_and_fetch
Move the logic for the UBLK_IO_COMMIT_AND_FETCH_REQ opcode into its own
function. This also allows us to mark ublk_queue pointers as const for
that operation, which can help prevent data races since we may allow
concurrent operation on one ublk_queue in the future. Also open code
ublk_commit_completion in ublk_commit_and_fetch to reduce the number of
parameters/avoid a redundant lookup.
[Restore __ublk_ch_uring_cmd() req variable used in commit
d6aa0c178bf8
("ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATA")]
Suggested-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250430225234.2676781-2-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Sat, 26 Apr 2025 01:17:28 +0000 (19:17 -0600)]
block: avoid hctx spinlock for plug with multiple queues
blk_mq_flush_plug_list() has a fast path if all requests in the plug
are destined for the same request_queue. It calls ->queue_rqs() with the
whole batch of requests, falling back on ->queue_rq() for any requests
not handled by ->queue_rqs(). However, if the requests are destined for
multiple queues, blk_mq_flush_plug_list() has a slow path that calls
blk_mq_dispatch_list() repeatedly to filter the requests by ctx/hctx.
Each queue's requests are inserted into the hctx's dispatch list under a
spinlock, then __blk_mq_sched_dispatch_requests() takes them out of the
dispatch list (taking the spinlock again), and finally
blk_mq_dispatch_rq_list() calls ->queue_rq() on each request.
Acquiring the hctx spinlock twice and calling ->queue_rq() instead of
->queue_rqs() makes the slow path significantly more expensive. Thus,
batching more requests into a single plug (e.g. io_uring_enter syscall)
can counterintuitively hurt performance by causing the plug to span
multiple queues. We have observed 2-3% of CPU time spent acquiring the
hctx spinlock alone on workloads issuing requests to multiple NVMe
devices in the same io_uring SQE batches.
Add a medium path in blk_mq_flush_plug_list() for plugs that don't have
elevators or come from a schedule, but do span multiple queues. Filter
the requests by queue and call ->queue_rqs()/->queue_rq() on the list of
requests destined to each request_queue.
With this change, we no longer see any CPU time spent in _raw_spin_lock
from blk_mq_flush_plug_list and throughput increases accordingly.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250426011728.4189119-4-csander@purestorage.com
[axboe: fix whitespace damage]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Sat, 26 Apr 2025 01:17:27 +0000 (19:17 -0600)]
block: factor out blk_mq_dispatch_queue_requests() helper
Factor out the logic from blk_mq_flush_plug_list() that calls
->queue_rqs() with a fallback to ->queue_rq() into a helper function
blk_mq_dispatch_queue_requests(). This is in preparation for using this
code with other lists of requests.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250426011728.4189119-3-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Sat, 26 Apr 2025 01:17:26 +0000 (19:17 -0600)]
block: take rq_list instead of plug in dispatch functions
blk_mq_plug_issue_direct(), __blk_mq_flush_plug_list(), and
blk_mq_dispatch_plug_list() take a struct blk_plug * but only use its
mq_list. Pass the struct rq_list * instead in preparation for calling
them with other lists of requests.
Drop "plug" from the function names as they are no longer plug-specific.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20250426011728.4189119-2-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Damien Le Moal [Mon, 7 Apr 2025 07:52:22 +0000 (16:52 +0900)]
Documentation: Document the new zoned loop block device driver
Introduce the zoned_loop.rst documentation file under
admin-guide/blockdev to document the zoned loop block device driver.
An overview of the driver is provided and its usage to create and delete
zoned devices described.
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250407075222.170336-3-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Damien Le Moal [Mon, 7 Apr 2025 07:52:21 +0000 (16:52 +0900)]
block: new zoned loop block device driver
The zoned loop block device driver allows a user to create emulated
zoned block devices using one regular file per zone as backing storage.
Compared to null_blk or scsi_debug, it has the advantage of allowing
emulating large zoned devices without requiring the same amount of
memory as the capacity of the emulated device. Furthermore, zoned
devices emulated with this driver can be re-started after a host reboot
without any loss of the state of the device zones, which is something
that null_blk and scsi_debug do not support.
This initial implementation is simple and does not support zone resource
limits. That is, a zoned loop block device limits for the maximum number
of open zones and maximum number of active zones is always 0.
This driver can be either compiled in-kernel or as a module, named
"zloop". Compilation of this driver depends on the block layer support
for zoned block device (CONFIG_BLK_DEV_ZONED must be set).
Using the zloop driver to create and delete zoned block devices is
done by writing commands to the zoned loop control character device file
(/dev/zloop-control). Creating a device is done with:
$ echo "add [options]" > /dev/zloop-control
The options available for the "add" operation cat be listed by reading
the zloop-control device file:
$ cat /dev/zloop-control
add id=%d,capacity_mb=%u,zone_size_mb=%u,zone_capacity_mb=%u,conv_zones=%u,base_dir=%s,nr_queues=%u,queue_depth=%u
remove id=%d
The options available allow controlling the zoned device total
capacity, zone size, zone capactity of sequential zones, total number
of conventional zones, base directory for the zones backing file, number
of I/O queues and the maximum queue depth of I/O queues.
Deleting a device is done using the "remove" command:
$ echo "remove id=0" > /dev/zloop-control
This implementation passes various tests using zonefs and fio (t/zbd
tests) and provides a state machine for zone conditions that is
compliant with the T10 ZBC and NVMe ZNS specifications.
Co-developed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250407075222.170336-2-dlemoal@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 1 May 2025 13:56:02 +0000 (07:56 -0600)]
Merge tag 'nvme-6.15-2025-05-01' of git://git.infradead.org/nvme into block-6.15
Pull NVMe fixes from Christoph:
"nvme fixes for Linux 6.15
- fix queue unquiesce check on PCI slot_reset (Keith Busch)
- fix premature queue removal and I/O failover in nvme-tcp
(Michael Liang)
- don't restore null sk_state_change (Alistair Francis)
- select CONFIG_TLS where needed (Alistair Francis)
- always free derived key data (Hannes Reinecke)
- more quirks (Wentao Guan)"
* tag 'nvme-6.15-2025-05-01' of git://git.infradead.org/nvme:
nvmet-auth: always free derived key data
nvmet-tcp: don't restore null sk_state_change
nvmet-tcp: select CONFIG_TLS from CONFIG_NVME_TARGET_TCP_TLS
nvme-tcp: select CONFIG_TLS from CONFIG_NVME_TCP_TLS
nvme-tcp: fix premature queue removal and I/O failover
nvme-pci: add quirks for WDC Blue SN550 15b7:5009
nvme-pci: add quirks for device 126f:1001
nvme-pci: fix queue unquiesce check on slot_reset
Hannes Reinecke [Fri, 25 Apr 2025 09:34:34 +0000 (11:34 +0200)]
nvmet-auth: always free derived key data
After calling nvme_auth_derive_tls_psk() we need to free the resulting
psk data, as either TLS is disable (and we don't need the data anyway)
or the psk data is copied into the resulting key (and can be free, too).
Fixes:
fa2e0f8bbc68 ("nvmet-tcp: support secure channel concatenation")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Suggested-by: Maurizio Lombardi <mlombard@bsdbackstore.eu>
Signed-off-by: Hannes Reinecke <hare@kernel.org>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Alistair Francis [Wed, 23 Apr 2025 06:06:21 +0000 (16:06 +1000)]
nvmet-tcp: don't restore null sk_state_change
queue->state_change is set as part of nvmet_tcp_set_queue_sock(), but if
the TCP connection isn't established when nvmet_tcp_set_queue_sock() is
called then queue->state_change isn't set and sock->sk->sk_state_change
isn't replaced.
As such we don't need to restore sock->sk->sk_state_change if
queue->state_change is NULL.
This avoids NULL pointer dereferences such as this:
[ 286.462026][ C0] BUG: kernel NULL pointer dereference, address:
0000000000000000
[ 286.462814][ C0] #PF: supervisor instruction fetch in kernel mode
[ 286.463796][ C0] #PF: error_code(0x0010) - not-present page
[ 286.464392][ C0] PGD
8000000140620067 P4D
8000000140620067 PUD
114201067 PMD 0
[ 286.465086][ C0] Oops: Oops: 0010 [#1] SMP KASAN PTI
[ 286.465559][ C0] CPU: 0 UID: 0 PID: 1628 Comm: nvme Not tainted 6.15.0-rc2+ #11 PREEMPT(voluntary)
[ 286.466393][ C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-3.fc41 04/01/2014
[ 286.467147][ C0] RIP: 0010:0x0
[ 286.467420][ C0] Code: Unable to access opcode bytes at 0xffffffffffffffd6.
[ 286.467977][ C0] RSP: 0018:
ffff8883ae008580 EFLAGS:
00010246
[ 286.468425][ C0] RAX:
0000000000000000 RBX:
ffff88813fd34100 RCX:
ffffffffa386cc43
[ 286.469019][ C0] RDX:
1ffff11027fa68b6 RSI:
0000000000000008 RDI:
ffff88813fd34100
[ 286.469545][ C0] RBP:
ffff88813fd34160 R08:
0000000000000000 R09:
ffffed1027fa682c
[ 286.470072][ C0] R10:
ffff88813fd34167 R11:
0000000000000000 R12:
ffff88813fd344c3
[ 286.470585][ C0] R13:
ffff88813fd34112 R14:
ffff88813fd34aec R15:
ffff888132cdd268
[ 286.471070][ C0] FS:
00007fe3c04c7d80(0000) GS:
ffff88840743f000(0000) knlGS:
0000000000000000
[ 286.471644][ C0] CS: 0010 DS: 0000 ES: 0000 CR0:
0000000080050033
[ 286.472543][ C0] CR2:
ffffffffffffffd6 CR3:
000000012daca000 CR4:
00000000000006f0
[ 286.473500][ C0] DR0:
0000000000000000 DR1:
0000000000000000 DR2:
0000000000000000
[ 286.474467][ C0] DR3:
0000000000000000 DR6:
00000000ffff07f0 DR7:
0000000000000400
[ 286.475453][ C0] Call Trace:
[ 286.476102][ C0] <IRQ>
[ 286.476719][ C0] tcp_fin+0x2bb/0x440
[ 286.477429][ C0] tcp_data_queue+0x190f/0x4e60
[ 286.478174][ C0] ? __build_skb_around+0x234/0x330
[ 286.478940][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.479659][ C0] ? __pfx_tcp_data_queue+0x10/0x10
[ 286.480431][ C0] ? tcp_try_undo_loss+0x640/0x6c0
[ 286.481196][ C0] ? seqcount_lockdep_reader_access.constprop.0+0x82/0x90
[ 286.482046][ C0] ? kvm_clock_get_cycles+0x14/0x30
[ 286.482769][ C0] ? ktime_get+0x66/0x150
[ 286.483433][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.484146][ C0] tcp_rcv_established+0x6e4/0x2050
[ 286.484857][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.485523][ C0] ? ipv4_dst_check+0x160/0x2b0
[ 286.486203][ C0] ? __pfx_tcp_rcv_established+0x10/0x10
[ 286.486917][ C0] ? lock_release+0x217/0x2c0
[ 286.487595][ C0] tcp_v4_do_rcv+0x4d6/0x9b0
[ 286.488279][ C0] tcp_v4_rcv+0x2af8/0x3e30
[ 286.488904][ C0] ? raw_local_deliver+0x51b/0xad0
[ 286.489551][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.490198][ C0] ? __pfx_tcp_v4_rcv+0x10/0x10
[ 286.490813][ C0] ? __pfx_raw_local_deliver+0x10/0x10
[ 286.491487][ C0] ? __pfx_nf_confirm+0x10/0x10 [nf_conntrack]
[ 286.492275][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.492900][ C0] ip_protocol_deliver_rcu+0x8f/0x370
[ 286.493579][ C0] ip_local_deliver_finish+0x297/0x420
[ 286.494268][ C0] ip_local_deliver+0x168/0x430
[ 286.494867][ C0] ? __pfx_ip_local_deliver+0x10/0x10
[ 286.495498][ C0] ? __pfx_ip_local_deliver_finish+0x10/0x10
[ 286.496204][ C0] ? ip_rcv_finish_core+0x19a/0x1f20
[ 286.496806][ C0] ? lock_release+0x217/0x2c0
[ 286.497414][ C0] ip_rcv+0x455/0x6e0
[ 286.497945][ C0] ? __pfx_ip_rcv+0x10/0x10
[ 286.498550][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.499137][ C0] ? __pfx_ip_rcv_finish+0x10/0x10
[ 286.499763][ C0] ? lock_release+0x217/0x2c0
[ 286.500327][ C0] ? dl_scaled_delta_exec+0xd1/0x2c0
[ 286.500922][ C0] ? __pfx_ip_rcv+0x10/0x10
[ 286.501480][ C0] __netif_receive_skb_one_core+0x166/0x1b0
[ 286.502173][ C0] ? __pfx___netif_receive_skb_one_core+0x10/0x10
[ 286.502903][ C0] ? lock_acquire+0x2b2/0x310
[ 286.503487][ C0] ? process_backlog+0x372/0x1350
[ 286.504087][ C0] ? lock_release+0x217/0x2c0
[ 286.504642][ C0] process_backlog+0x3b9/0x1350
[ 286.505214][ C0] ? process_backlog+0x372/0x1350
[ 286.505779][ C0] __napi_poll.constprop.0+0xa6/0x490
[ 286.506363][ C0] net_rx_action+0x92e/0xe10
[ 286.506889][ C0] ? __pfx_net_rx_action+0x10/0x10
[ 286.507437][ C0] ? timerqueue_add+0x1f0/0x320
[ 286.507977][ C0] ? sched_clock_cpu+0x68/0x540
[ 286.508492][ C0] ? lock_acquire+0x2b2/0x310
[ 286.509043][ C0] ? kvm_sched_clock_read+0xd/0x20
[ 286.509607][ C0] ? handle_softirqs+0x1aa/0x7d0
[ 286.510187][ C0] handle_softirqs+0x1f2/0x7d0
[ 286.510754][ C0] ? __pfx_handle_softirqs+0x10/0x10
[ 286.511348][ C0] ? irqtime_account_irq+0x181/0x290
[ 286.511937][ C0] ? __dev_queue_xmit+0x85d/0x3450
[ 286.512510][ C0] do_softirq.part.0+0x89/0xc0
[ 286.513100][ C0] </IRQ>
[ 286.513548][ C0] <TASK>
[ 286.513953][ C0] __local_bh_enable_ip+0x112/0x140
[ 286.514522][ C0] ? __dev_queue_xmit+0x85d/0x3450
[ 286.515072][ C0] __dev_queue_xmit+0x872/0x3450
[ 286.515619][ C0] ? nft_do_chain+0xe16/0x15b0 [nf_tables]
[ 286.516252][ C0] ? __pfx___dev_queue_xmit+0x10/0x10
[ 286.516817][ C0] ? selinux_ip_postroute+0x43c/0xc50
[ 286.517433][ C0] ? __pfx_selinux_ip_postroute+0x10/0x10
[ 286.518061][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.518606][ C0] ? ip_output+0x164/0x4a0
[ 286.519149][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.519671][ C0] ? ip_finish_output2+0x17d5/0x1fb0
[ 286.520258][ C0] ip_finish_output2+0xb4b/0x1fb0
[ 286.520787][ C0] ? __pfx_ip_finish_output2+0x10/0x10
[ 286.521355][ C0] ? __ip_finish_output+0x15d/0x750
[ 286.521890][ C0] ip_output+0x164/0x4a0
[ 286.522372][ C0] ? __pfx_ip_output+0x10/0x10
[ 286.522872][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.523402][ C0] ? _raw_spin_unlock_irqrestore+0x4c/0x60
[ 286.524031][ C0] ? __pfx_ip_finish_output+0x10/0x10
[ 286.524605][ C0] ? __ip_queue_xmit+0x999/0x2260
[ 286.525200][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.525744][ C0] ? ipv4_dst_check+0x16a/0x2b0
[ 286.526279][ C0] ? lock_release+0x217/0x2c0
[ 286.526793][ C0] __ip_queue_xmit+0x1883/0x2260
[ 286.527324][ C0] ? __skb_clone+0x54c/0x730
[ 286.527827][ C0] __tcp_transmit_skb+0x209b/0x37a0
[ 286.528374][ C0] ? __pfx___tcp_transmit_skb+0x10/0x10
[ 286.528952][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.529472][ C0] ? seqcount_lockdep_reader_access.constprop.0+0x82/0x90
[ 286.530152][ C0] ? trace_hardirqs_on+0x12/0x120
[ 286.530691][ C0] tcp_write_xmit+0xb81/0x88b0
[ 286.531224][ C0] ? mod_memcg_state+0x4d/0x60
[ 286.531736][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.532253][ C0] __tcp_push_pending_frames+0x90/0x320
[ 286.532826][ C0] tcp_send_fin+0x141/0xb50
[ 286.533352][ C0] ? __pfx_tcp_send_fin+0x10/0x10
[ 286.533908][ C0] ? __local_bh_enable_ip+0xab/0x140
[ 286.534495][ C0] inet_shutdown+0x243/0x320
[ 286.535077][ C0] nvme_tcp_alloc_queue+0xb3b/0x2590 [nvme_tcp]
[ 286.535709][ C0] ? do_raw_spin_lock+0x129/0x260
[ 286.536314][ C0] ? __pfx_nvme_tcp_alloc_queue+0x10/0x10 [nvme_tcp]
[ 286.536996][ C0] ? do_raw_spin_unlock+0x54/0x1e0
[ 286.537550][ C0] ? _raw_spin_unlock+0x29/0x50
[ 286.538127][ C0] ? do_raw_spin_lock+0x129/0x260
[ 286.538664][ C0] ? __pfx_do_raw_spin_lock+0x10/0x10
[ 286.539249][ C0] ? nvme_tcp_alloc_admin_queue+0xd5/0x340 [nvme_tcp]
[ 286.539892][ C0] ? __wake_up+0x40/0x60
[ 286.540392][ C0] nvme_tcp_alloc_admin_queue+0xd5/0x340 [nvme_tcp]
[ 286.541047][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.541589][ C0] nvme_tcp_setup_ctrl+0x8b/0x7a0 [nvme_tcp]
[ 286.542254][ C0] ? _raw_spin_unlock_irqrestore+0x4c/0x60
[ 286.542887][ C0] ? __pfx_nvme_tcp_setup_ctrl+0x10/0x10 [nvme_tcp]
[ 286.543568][ C0] ? trace_hardirqs_on+0x12/0x120
[ 286.544166][ C0] ? _raw_spin_unlock_irqrestore+0x35/0x60
[ 286.544792][ C0] ? nvme_change_ctrl_state+0x196/0x2e0 [nvme_core]
[ 286.545477][ C0] nvme_tcp_create_ctrl+0x839/0xb90 [nvme_tcp]
[ 286.546126][ C0] nvmf_dev_write+0x3db/0x7e0 [nvme_fabrics]
[ 286.546775][ C0] ? rw_verify_area+0x69/0x520
[ 286.547334][ C0] vfs_write+0x218/0xe90
[ 286.547854][ C0] ? do_syscall_64+0x9f/0x190
[ 286.548408][ C0] ? trace_hardirqs_on_prepare+0xdb/0x120
[ 286.549037][ C0] ? syscall_exit_to_user_mode+0x93/0x280
[ 286.549659][ C0] ? __pfx_vfs_write+0x10/0x10
[ 286.550259][ C0] ? do_syscall_64+0x9f/0x190
[ 286.550840][ C0] ? syscall_exit_to_user_mode+0x8e/0x280
[ 286.551516][ C0] ? trace_hardirqs_on_prepare+0xdb/0x120
[ 286.552180][ C0] ? syscall_exit_to_user_mode+0x93/0x280
[ 286.552834][ C0] ? ksys_read+0xf5/0x1c0
[ 286.553386][ C0] ? __pfx_ksys_read+0x10/0x10
[ 286.553964][ C0] ksys_write+0xf5/0x1c0
[ 286.554499][ C0] ? __pfx_ksys_write+0x10/0x10
[ 286.555072][ C0] ? trace_hardirqs_on_prepare+0xdb/0x120
[ 286.555698][ C0] ? syscall_exit_to_user_mode+0x93/0x280
[ 286.556319][ C0] ? do_syscall_64+0x54/0x190
[ 286.556866][ C0] do_syscall_64+0x93/0x190
[ 286.557420][ C0] ? rcu_read_unlock+0x17/0x60
[ 286.557986][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.558526][ C0] ? lock_release+0x217/0x2c0
[ 286.559087][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.559659][ C0] ? count_memcg_events.constprop.0+0x4a/0x60
[ 286.560476][ C0] ? exc_page_fault+0x7a/0x110
[ 286.561064][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.561647][ C0] ? lock_release+0x217/0x2c0
[ 286.562257][ C0] ? do_user_addr_fault+0x171/0xa00
[ 286.562839][ C0] ? do_user_addr_fault+0x4a2/0xa00
[ 286.563453][ C0] ? irqentry_exit_to_user_mode+0x84/0x270
[ 286.564112][ C0] ? rcu_is_watching+0x11/0xb0
[ 286.564677][ C0] ? irqentry_exit_to_user_mode+0x84/0x270
[ 286.565317][ C0] ? trace_hardirqs_on_prepare+0xdb/0x120
[ 286.565922][ C0] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 286.566542][ C0] RIP: 0033:0x7fe3c05e6504
[ 286.567102][ C0] Code: c7 00 16 00 00 00 b8 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 80 3d c5 8b 10 00 00 74 13 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 54 c3 0f 1f 00 55 48 89 e5 48 83 ec 20 48 89
[ 286.568931][ C0] RSP: 002b:
00007fff76444f58 EFLAGS:
00000202 ORIG_RAX:
0000000000000001
[ 286.569807][ C0] RAX:
ffffffffffffffda RBX:
000000003b40d930 RCX:
00007fe3c05e6504
[ 286.570621][ C0] RDX:
00000000000000cf RSI:
000000003b40d930 RDI:
0000000000000003
[ 286.571443][ C0] RBP:
0000000000000003 R08:
00000000000000cf R09:
000000003b40d930
[ 286.572246][ C0] R10:
0000000000000000 R11:
0000000000000202 R12:
000000003b40cd60
[ 286.573069][ C0] R13:
00000000000000cf R14:
00007fe3c07417f8 R15:
00007fe3c073502e
[ 286.573886][ C0] </TASK>
Closes: https://lore.kernel.org/linux-nvme/5hdonndzoqa265oq3bj6iarwtfk5dewxxjtbjvn5uqnwclpwt6@a2n6w3taxxex/
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Alistair Francis [Tue, 29 Apr 2025 22:23:47 +0000 (08:23 +1000)]
nvmet-tcp: select CONFIG_TLS from CONFIG_NVME_TARGET_TCP_TLS
Ensure that TLS support is enabled in the kernel when
CONFIG_NVME_TARGET_TCP_TLS is enabled. Without this the code compiles,
but does not actually work unless something else enables CONFIG_TLS.
Fixes:
675b453e0241 ("nvmet-tcp: enable TLS handshake upcall")
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Alistair Francis [Tue, 29 Apr 2025 22:40:25 +0000 (08:40 +1000)]
nvme-tcp: select CONFIG_TLS from CONFIG_NVME_TCP_TLS
Ensure that TLS support is enabled in the kernel when
CONFIG_NVME_TCP_TLS is enabled. Without this the code compiles, but does
not actually work unless something else enables CONFIG_TLS.
Fixes:
be8e82caa68 ("nvme-tcp: enable TLS handshake upcall")
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Michael Liang [Tue, 29 Apr 2025 16:42:01 +0000 (10:42 -0600)]
nvme-tcp: fix premature queue removal and I/O failover
This patch addresses a data corruption issue observed in nvme-tcp during
testing.
In an NVMe native multipath setup, when an I/O timeout occurs, all
inflight I/Os are canceled almost immediately after the kernel socket is
shut down. These canceled I/Os are reported as host path errors,
triggering a failover that succeeds on a different path.
However, at this point, the original I/O may still be outstanding in the
host's network transmission path (e.g., the NIC’s TX queue). From the
user-space app's perspective, the buffer associated with the I/O is
considered completed since they're acked on the different path and may
be reused for new I/O requests.
Because nvme-tcp enables zero-copy by default in the transmission path,
this can lead to corrupted data being sent to the original target,
ultimately causing data corruption.
We can reproduce this data corruption by injecting delay on one path and
triggering i/o timeout.
To prevent this issue, this change ensures that all inflight
transmissions are fully completed from host's perspective before
returning from queue stop. To handle concurrent I/O timeout from multiple
namespaces under the same controller, always wait in queue stop
regardless of queue's state.
This aligns with the behavior of queue stopping in other NVMe fabric
transports.
Fixes:
3f2304f8c6d6 ("nvme-tcp: add NVMe over TCP host driver")
Signed-off-by: Michael Liang <mliang@purestorage.com>
Reviewed-by: Mohamed Khalfella <mkhalfella@purestorage.com>
Reviewed-by: Randy Jennings <randyj@purestorage.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Wentao Guan [Thu, 24 Apr 2025 02:40:10 +0000 (10:40 +0800)]
nvme-pci: add quirks for WDC Blue SN550 15b7:5009
Add two quirks for the WDC Blue SN550 (PCI ID 15b7:5009) based on user
reports and hardware analysis:
- NVME_QUIRK_NO_DEEPEST_PS:
liaozw talked to me the problem and solved with
nvme_core.default_ps_max_latency_us=0, so add the quirk.
I also found some reports in the following link.
- NVME_QUIRK_BROKEN_MSI:
after get the lspci from Jack Rio.
I think that the disk also have NVME_QUIRK_BROKEN_MSI.
described in commit
d5887dc6b6c0 ("nvme-pci: Add quirk for broken MSIs")
as sean said in link which match the MSI 1/32 and MSI-X 17.
Log:
lspci -nn | grep -i memory
03:00.0 Non-Volatile memory controller [0108]: Sandisk Corp SanDisk Ultra 3D / WD PC SN530, IX SN530, Blue SN550 NVMe SSD (DRAM-less) [15b7:5009] (rev 01)
lspci -v -d 15b7:5009
03:00.0 Non-Volatile memory controller: Sandisk Corp SanDisk Ultra 3D / WD PC SN530, IX SN530, Blue SN550 NVMe SSD (DRAM-less) (rev 01) (prog-if 02 [NVM Express])
Subsystem: Sandisk Corp WD Blue SN550 NVMe SSD
Flags: bus master, fast devsel, latency 0, IRQ 35, IOMMU group 10
Memory at
fe800000 (64-bit, non-prefetchable) [size=16K]
Memory at
fe804000 (64-bit, non-prefetchable) [size=256]
Capabilities: [80] Power Management version 3
Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+
Capabilities: [b0] MSI-X: Enable+ Count=17 Masked-
Capabilities: [c0] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [150] Device Serial Number 00-00-00-00-00-00-00-00
Capabilities: [1b8] Latency Tolerance Reporting
Capabilities: [300] Secondary PCI Express
Capabilities: [900] L1 PM Substates
Kernel driver in use: nvme
dmesg | grep nvme
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-6.12.20-amd64-desktop-rolling root=UUID= ro splash quiet nvme_core.default_ps_max_latency_us=0 DEEPIN_GFXMODE=
[ 0.059301] Kernel command line: BOOT_IMAGE=/vmlinuz-6.12.20-amd64-desktop-rolling root=UUID= ro splash quiet nvme_core.default_ps_max_latency_us=0 DEEPIN_GFXMODE=
[ 0.542430] nvme nvme0: pci function 0000:03:00.0
[ 0.560426] nvme nvme0: allocated 32 MiB host memory buffer.
[ 0.562491] nvme nvme0: 16/0/0 default/read/poll queues
[ 0.567764] nvme0n1: p1 p2 p3 p4 p5 p6 p7 p8 p9
[ 6.388726] EXT4-fs (nvme0n1p7): mounted filesystem ro with ordered data mode. Quota mode: none.
[ 6.893421] EXT4-fs (nvme0n1p7): re-mounted r/w. Quota mode: none.
[ 7.125419] Adding 16777212k swap on /dev/nvme0n1p8. Priority:-2 extents:1 across:16777212k SS
[ 7.157588] EXT4-fs (nvme0n1p6): mounted filesystem r/w with ordered data mode. Quota mode: none.
[ 7.165021] EXT4-fs (nvme0n1p9): mounted filesystem r/w with ordered data mode. Quota mode: none.
[ 8.036932] nvme nvme0: using unchecked data buffer
[ 8.096023] block nvme0n1: No UUID available providing old NGUID
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d5887dc6b6c054d0da3cd053afc15b7be1f45ff6
Link: https://lore.kernel.org/all/20240422162822.3539156-1-sean.anderson@linux.dev/
Reported-by: liaozw <hedgehog-002@163.com>
Closes: https://bbs.deepin.org.cn/post/286300
Reported-by: rugk <rugk+github@posteo.de>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=208123
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Wentao Guan [Tue, 22 Apr 2025 12:17:25 +0000 (20:17 +0800)]
nvme-pci: add quirks for device 126f:1001
This commit adds NVME_QUIRK_NO_DEEPEST_PS and NVME_QUIRK_BOGUS_NID for
device [126f:1001].
It is similar to commit
e89086c43f05 ("drivers/nvme: Add quirks for
device 126f:2262")
Diff is according the dmesg, use NVME_QUIRK_IGNORE_DEV_SUBNQN.
dmesg | grep -i nvme0:
nvme nvme0: pci function 0000:01:00.0
nvme nvme0: missing or invalid SUBNQN field.
nvme nvme0: 12/0/0 default/read/poll queues
Link:https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=
e89086c43f0500bc7c4ce225495b73b8ce234c1f
Signed-off-by: Wentao Guan <guanwentao@uniontech.com>
Signed-off-by: WangYuli <wangyuli@uniontech.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Keith Busch [Thu, 24 Apr 2025 17:18:01 +0000 (10:18 -0700)]
nvme-pci: fix queue unquiesce check on slot_reset
A zero return means the reset was successfully scheduled. We don't want
to unquiesce the queues while the reset_work is pending, as that will
just flush out requeued requests to a failed completion.
Fixes:
71a5bb153be104 ("nvme: ensure disabling pairs with unquiesce")
Reported-by: Dhankaran Singh Ajravat <dhankaran@meta.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Ming Lei [Tue, 29 Apr 2025 02:29:39 +0000 (10:29 +0800)]
ublk: remove the check of ublk_need_req_ref() from __ublk_check_and_get_req
__ublk_check_and_get_req() is only called from ublk_check_and_get_req()
and ublk_register_io_buf(), the same check has been covered in the two
calling sites.
So remove the check from __ublk_check_and_get_req().
Suggested-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250429022941.1718671-5-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 29 Apr 2025 02:29:38 +0000 (10:29 +0800)]
ublk: enhance check for register/unregister io buffer command
The simple check of UBLK_IO_FLAG_OWNED_BY_SRV can avoid incorrect
register/unregister io buffer easily, so check it before calling
starting to register/un-register io buffer.
Also only allow io buffer register/unregister uring_cmd in case of
UBLK_F_SUPPORT_ZERO_COPY.
Also mark argument 'ublk_queue *' of ublk_register_io_buf as const.
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes:
1f6540e2aabb ("ublk: zc register/unregister bvec")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250429022941.1718671-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 29 Apr 2025 02:29:37 +0000 (10:29 +0800)]
ublk: decouple zero copy from user copy
UBLK_F_USER_COPY and UBLK_F_SUPPORT_ZERO_COPY are two different
features, and shouldn't be coupled together.
Commit
1f6540e2aabb ("ublk: zc register/unregister bvec") enables
user copy automatically in case of UBLK_F_SUPPORT_ZERO_COPY, this way
isn't correct.
So decouple zero copy from user copy, and use independent helper to
check each one.
Fixes:
1f6540e2aabb ("ublk: zc register/unregister bvec")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250429022941.1718671-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Tue, 29 Apr 2025 02:29:36 +0000 (10:29 +0800)]
selftests: ublk: fix UBLK_F_NEED_GET_DATA
Commit
57e13a2e8cd2 ("selftests: ublk: support user recovery") starts to
support UBLK_F_NEED_GET_DATA for covering recovery feature, however the
ublk utility implementation isn't done correctly.
Fix it by supporting UBLK_F_NEED_GET_DATA correctly.
Also add test generic_07 for covering UBLK_F_NEED_GET_DATA.
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Fixes:
57e13a2e8cd2 ("selftests: ublk: support user recovery")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250429022941.1718671-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 28 Apr 2025 14:09:51 +0000 (07:09 -0700)]
brd: use memcpy_{to,from]_page in brd_rw_bvec
Use the proper helpers to copy to/from potential highmem pages, which
do a local instead of atomic kmap underneath, and perform
flush_dcache_page where needed. This also simplifies the code so much
that the separate read write helpers are not required any more.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250428141014.2360063-6-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 28 Apr 2025 14:09:50 +0000 (07:09 -0700)]
brd: split I/O at page boundaries
A lot of complexity in brd stems from the fact that it tries to handle
I/O spanning two backing pages. Instead limit the size of a single
bvec iteration so that it never crosses a page boundary and remove all
the now unneeded code.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250428141014.2360063-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 28 Apr 2025 14:09:49 +0000 (07:09 -0700)]
brd: use bvec_kmap_local in brd_do_bvec
Use the proper helper to kmap a bvec in brd_do_bvec instead of directly
accessing the bvec fields and use the deprecated kmap_atomic API.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250428141014.2360063-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 28 Apr 2025 14:09:48 +0000 (07:09 -0700)]
brd: remove the sector variable in brd_submit_bio
The bvec iter iterates over the sector already, no need to duplicate the
work.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250428141014.2360063-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Mon, 28 Apr 2025 14:09:47 +0000 (07:09 -0700)]
brd: pass a bvec pointer to brd_do_bvec
Pass the bvec to brd_do_bvec instead of marshalling the information into
individual arguments.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Yu Kuai <yukuai3@huawei.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20250428141014.2360063-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 25 Apr 2025 02:41:11 +0000 (20:41 -0600)]
Merge branch 'block-6.15' into for-6.16/block
Merge 6.15 block fixes - both to get the fixes causing issues with
XFS testing, but also to make it easier for 6.16 ublk patches to avoid
conflicts.
* block-6.15:
ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd
ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATA
block: don't autoload drivers on blk-cgroup configuration
block: don't autoload drivers on stat
block: remove the backing_inode variable in bdev_statx
block: move blkdev_{get,put} _no_open prototypes out of blkdev.h
block: never reduce ra_pages in blk_apply_bdi_limits
selftests: ublk: common: fix _get_disk_dev_t for pre-9.0 coreutils
selftests: ublk: remove useless 'delay_us' from 'struct dev_ctx'
selftests: ublk: fix recover test
block: hoist block size validation code to a separate function
block: fix race between set_blocksize and read paths
nvmet: fix out-of-bounds access in nvmet_enable_port
Ming Lei [Fri, 25 Apr 2025 01:37:40 +0000 (09:37 +0800)]
ublk: fix race between io_uring_cmd_complete_in_task and ublk_cancel_cmd
ublk_cancel_cmd() calls io_uring_cmd_done() to complete uring_cmd, but
we may have scheduled task work via io_uring_cmd_complete_in_task() for
dispatching request, then kernel crash can be triggered.
Fix it by not trying to canceling the command if ublk block request is
started.
Fixes:
216c8f5ef0f2 ("ublk: replace monitor with cancelable uring_cmd")
Reported-by: Jared Holzman <jholzman@nvidia.com>
Tested-by: Jared Holzman <jholzman@nvidia.com>
Closes: https://lore.kernel.org/linux-block/
d2179120-171b-47ba-b664-
23242981ef19@nvidia.com/
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250425013742.1079549-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 25 Apr 2025 01:37:39 +0000 (09:37 +0800)]
ublk: call ublk_dispatch_req() for handling UBLK_U_IO_NEED_GET_DATA
We call io_uring_cmd_complete_in_task() to schedule task_work for handling
UBLK_U_IO_NEED_GET_DATA.
This way is really not necessary because the current context is exactly
the ublk queue context, so call ublk_dispatch_req() directly for handling
UBLK_U_IO_NEED_GET_DATA.
Fixes:
216c8f5ef0f2 ("ublk: replace monitor with cancelable uring_cmd")
Tested-by: Jared Holzman <jholzman@nvidia.com>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250425013742.1079549-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 23 Apr 2025 05:37:42 +0000 (07:37 +0200)]
block: don't autoload drivers on blk-cgroup configuration
Loading a driver just to configure blk-cgroup doesn't make sense, as that
assumes and already existing device.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-5-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 23 Apr 2025 05:37:41 +0000 (07:37 +0200)]
block: don't autoload drivers on stat
blkdev_get_no_open can trigger the legacy autoload of block drivers. A
simple stat of a block device has not historically done that, so disable
this behavior again.
Fixes:
9abcfbd235f5 ("block: Add atomic write support for statx")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-4-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 23 Apr 2025 05:37:40 +0000 (07:37 +0200)]
block: remove the backing_inode variable in bdev_statx
backing_inode is only used once, so remove it and update the comment
describing the bdev lookup to be a bit more clear.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-3-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Wed, 23 Apr 2025 05:37:39 +0000 (07:37 +0200)]
block: move blkdev_{get,put} _no_open prototypes out of blkdev.h
These are only to be used by block internal code. Remove the comment
as we grew more users due to reworking block device node opening.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20250423053810.1683309-2-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Christoph Hellwig [Thu, 24 Apr 2025 08:25:21 +0000 (10:25 +0200)]
block: never reduce ra_pages in blk_apply_bdi_limits
When the user increased the read-ahead size through sysfs this value
currently get lost if the device is reprobe, including on a resume
from suspend.
As there is no hardware limitation for the read-ahead size there is
no real need to reset it or track a separate hardware limitation
like for max_sectors.
This restores the pre-atomic queue limit behavior in the sd driver as
sd did not use blk_queue_io_opt and thus never updated the read ahead
size to the value based of the optimal I/O, but changes behavior for
all other drivers. As the new behavior seems useful and sd is the
driver for which the readahead size tweaks are most useful that seems
like a worthwhile trade off.
Fixes:
804e498e0496 ("sd: convert to the atomic queue limits API")
Reported-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Link: https://lore.kernel.org/r/20250424082521.1967286-1-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Uday Shankar [Wed, 23 Apr 2025 21:29:03 +0000 (15:29 -0600)]
selftests: ublk: common: fix _get_disk_dev_t for pre-9.0 coreutils
Some distributions, such as centos stream 9, still have a version of
coreutils which does not yet support the %Hr and %Lr formats for stat(1)
[1, 2]. Running ublk selftests on these distributions results in the
following error in tests that use the _get_disk_dev_t helper:
line 23: ?r: syntax error: operand expected (error token is "?r")
To better accommodate older distributions, rewrite _get_disk_dev_t to
use the much older %t and %T formats for stat instead.
[1] https://github.com/coreutils/coreutils/blob/v9.0/NEWS#L114
[2] https://pkgs.org/download/coreutils
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250423-ublk_selftests-v1-2-7d060e260e76@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 24 Apr 2025 12:27:54 +0000 (06:27 -0600)]
Merge tag 'nvme-6.15-2025-04-24' of git://git.infradead.org/nvme into block-6.15
Pull NVMe fix from Christoph:
"nvme fixes for Linux 6.15
- fix an out-of-bounds access in nvmet_enable_port (Richard Weinberger)"
* tag 'nvme-6.15-2025-04-24' of git://git.infradead.org/nvme:
nvmet: fix out-of-bounds access in nvmet_enable_port
Ming Lei [Mon, 21 Apr 2025 23:59:42 +0000 (07:59 +0800)]
selftests: ublk: remove useless 'delay_us' from 'struct dev_ctx'
'delay_us' shouldn't be added to 'struct dev_ctx' since now it is
handled by per-target command line & 'struct fault_inject_ctx'.
So remove it.
Fixes:
81586652bb1f ("selftests: ublk: add generic_06 for covering fault inject")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Link: https://lore.kernel.org/r/20250421235947.715272-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Mon, 21 Apr 2025 23:59:41 +0000 (07:59 +0800)]
selftests: ublk: fix recover test
When adding recovery test:
- 'break' is missed for handling '-g' argument
- test name of test_generic_05.sh is wrong
So fix the two.
Fixes:
57e13a2e8cd2 ("selftests: ublk: support user recovery")
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Link: https://lore.kernel.org/r/20250421235947.715272-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Darrick J. Wong [Wed, 23 Apr 2025 19:53:57 +0000 (12:53 -0700)]
block: hoist block size validation code to a separate function
Hoist the block size validation code to bdev_validate_blocksize so that
we can call it from filesystems that don't care about the bdev pagecache
manipulations of set_blocksize.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/174543795720.4139148.840349813093799165.stgit@frogsfrogsfrogs
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Darrick J. Wong [Wed, 23 Apr 2025 19:53:42 +0000 (12:53 -0700)]
block: fix race between set_blocksize and read paths
With the new large sector size support, it's now the case that
set_blocksize can change i_blksize and the folio order in a manner that
conflicts with a concurrent reader and causes a kernel crash.
Specifically, let's say that udev-worker calls libblkid to detect the
labels on a block device. The read call can create an order-0 folio to
read the first 4096 bytes from the disk. But then udev is preempted.
Next, someone tries to mount an 8k-sectorsize filesystem from the same
block device. The filesystem calls set_blksize, which sets i_blksize to
8192 and the minimum folio order to 1.
Now udev resumes, still holding the order-0 folio it allocated. It then
tries to schedule a read bio and do_mpage_readahead tries to create
bufferheads for the folio. Unfortunately, blocks_per_folio == 0 because
the page size is 4096 but the blocksize is 8192 so no bufferheads are
attached and the bh walk never sets bdev. We then submit the bio with a
NULL block device and crash.
Therefore, truncate the page cache after flushing but before updating
i_blksize. However, that's not enough -- we also need to lock out file
IO and page faults during the update. Take both the i_rwsem and the
invalidate_lock in exclusive mode for invalidations, and in shared mode
for read/write operations.
I don't know if this is the correct fix, but xfs/259 found it.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Tested-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Link: https://lore.kernel.org/r/174543795699.4139148.2086129139322431423.stgit@frogsfrogsfrogs
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Wed, 16 Apr 2025 17:01:53 +0000 (11:01 -0600)]
ublk: remove unnecessary ubq checks
ublk_init_queues() ensures that all nr_hw_queues queues are initialized,
with each ublk_queue's q_id set to its index. And ublk_init_queues() is
called before ublk_add_chdev(), which creates the cdev. Is is therefore
impossible for the !ubq || ub_cmd->q_id != ubq->q_id condition to hit in
__ublk_ch_uring_cmd(). Remove it to avoids some branches in the I/O path.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Reviewed-by: Uday Shankar <ushankar@purestorage.com>
Link: https://lore.kernel.org/r/20250416170154.3621609-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Richard Weinberger [Fri, 18 Apr 2025 08:02:50 +0000 (10:02 +0200)]
nvmet: fix out-of-bounds access in nvmet_enable_port
When trying to enable a port that has no transport configured yet,
nvmet_enable_port() uses NVMF_TRTYPE_MAX (255) to query the transports
array, causing an out-of-bounds access:
[ 106.058694] BUG: KASAN: global-out-of-bounds in nvmet_enable_port+0x42/0x1da
[ 106.058719] Read of size 8 at addr
ffffffff89dafa58 by task ln/632
[...]
[ 106.076026] nvmet: transport type 255 not supported
Since commit
200adac75888, NVMF_TRTYPE_MAX is the default state as configured by
nvmet_ports_make().
Avoid this by checking for NVMF_TRTYPE_MAX before proceeding.
Fixes:
200adac75888 ("nvme: Add PCI transport type")
Signed-off-by: Richard Weinberger <richard@nod.at>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: Damien Le Moal <dlemoal@kernel.org>
Omri Mann [Mon, 21 Apr 2025 10:59:50 +0000 (13:59 +0300)]
ublk: Add UBLK_U_CMD_UPDATE_SIZE
Currently ublk only allows the size of the ublkb block device to be
set via UBLK_CMD_SET_PARAMS before UBLK_CMD_START_DEV is triggered.
This does not provide support for extendable user-space block devices
without having to stop and restart the underlying ublkb block device
causing IO interruption.
This patch adds a new ublk command UBLK_U_CMD_UPDATE_SIZE to allow the
ublk block device to be resized on-the-fly.
Feature flag UBLK_F_UPDATE_SIZE is also added to indicate support.
Signed-off-by: Omri Mann <omri@nvidia.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/2a370ab1-d85b-409d-b762-f9f3f6bdf705@nvidia.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 15 Apr 2025 14:49:35 +0000 (08:49 -0600)]
block: blk-rq-qos: guard rq-qos helpers by static key
Even if blk-rq-qos isn't used or configured, dipping into the queue to
fetch ->rq_qos is a noticeable slowdown and visible in profiles. Add an
unlikely static key around blk-rq-qos, to avoid fetching this cacheline
if blk-iolatency or blk-wbt isn't configured or used.
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 15 Apr 2025 14:48:06 +0000 (08:48 -0600)]
block: ensure that struct blk_mq_alloc_data is fully initialized
On x86, rep stos will be emitted to clear the the blk_mq_alloc_data
struct, as not all members are being explicitly initialied. Depending on
the type of CPU, this is a noticeable slowdown compared to just ensuring
that the struct is fully initialized when setup.
For the 4 spots that setup a struct blk_mq_alloc_data on the stack,
ensure all members are being initialized.
Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Bart Van Assche [Tue, 15 Apr 2025 20:51:34 +0000 (13:51 -0700)]
block: Simplify blk_mq_dispatch_rq_list() and its callers
The 'nr_budgets' argument of blk_mq_dispatch_rq_list() is either the
number of elements in the 'list' argument or zero. Instead of passing
the number of list elements to blk_mq_dispatch_rq_list(), pass a boolean
argument that indicates whether or not blk_mq_dispatch_rq_list() should
request the block driver for a budget for each request in 'list'.
Remove the code for counting list elements from blk_mq_dispatch_rq_list()
callers where possible. Remove the code that decrements nr_budgets from
blk_mq_dispatch_rq_list() because it is superfluous. Each request that
is processed by blk_mq_dispatch_rq_list() is in one of these two states
if 'get_budget' is false:
* Either the request is on 'list' and the budget for the request has to
be released from the error path.
* Or the request is not on 'list' and q->mq_ops->queue_rq() has already
released the budget (ret != BLK_STS_OK) or q->mq_ops->queue_rq() will
release the budget asynchronously (ret == BLK_STS_OK).
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: John Garry <john.g.garry@oracle.com>
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Link: https://lore.kernel.org/r/20250415205134.3650042-1-bvanassche@acm.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Sun, 20 Apr 2025 22:30:53 +0000 (15:30 -0700)]
gcc-15: disable '-Wunterminated-string-initialization' entirely for now
I had left the warning around but as a non-fatal error to get my gcc-15
builds going, but fixed up some of the most annoying warning cases so
that it wouldn't be *too* verbose.
Because I like the _concept_ of the warning, even if I detested the
implementation to shut it up.
It turns out the implementation to shut it up is even more broken than I
thought, and my "shut up most of the warnings" patch just caused fatal
errors on gcc-14 instead.
I had tested with clang, but when I upgrade my development environment,
I try to do it on all machines because I hate having different systems
to maintain, and hadn't realized that gcc-14 now had issues.
The ACPI case is literally why I wanted to have a *type* that doesn't
trigger the warning (see commit
d5d45a7f2619: "gcc-15: make
'unterminated string initialization' just a warning"), instead of
marking individual places as "__nonstring".
But gcc-14 doesn't like that __nonstring location that shut gcc-15 up,
because it's on an array of char arrays, not on one single array:
drivers/acpi/tables.c:399:1: error: 'nonstring' attribute ignored on objects of type 'const char[][4]' [-Werror=attributes]
399 | static const char table_sigs[][ACPI_NAMESEG_SIZE] __initconst __nonstring = {
| ^~~~~~
and my attempts to nest it properly with a type had failed, because of
how gcc doesn't like marking the types as having attributes, only
symbols.
There may be some trick to it, but I was already annoyed by the bad
attribute design, now I'm just entirely fed up with it.
I wish gcc had a proper way to say "this type is a *byte* array, not a
string".
The obvious thing would be to distinguish between "char []" and an
explicitly signed "unsigned char []" (as opposed to an implicitly
unsigned char, which is typically an architecture-specific default, but
for the kernel is universal thanks to '-funsigned-char').
But any "we can typedef a 8-bit type to not become a string just because
it's an array" model would be fine.
But "__attribute__((nonstring))" is sadly not that sane model.
Reported-by: Chris Clayton <chris2553@googlemail.com>
Fixes:
4b4bd8c50f48 ("gcc-15: acpi: sprinkle random '__nonstring' crumbles around")
Fixes:
d5d45a7f2619 ("gcc-15: make 'unterminated string initialization' just a warning")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 20 Apr 2025 20:43:47 +0000 (13:43 -0700)]
Linux 6.15-rc3
Linus Torvalds [Sun, 20 Apr 2025 18:30:11 +0000 (11:30 -0700)]
gcc-15: work around sequence-point warning
The C sequence points are complicated things, and gcc-15 has apparently
added a warning for the case where an object is both used and modified
multiple times within the same sequence point.
That's a great warning.
Or rather, it would be a great warning, except gcc-15 seems to not
really be very exact about it, and doesn't notice that the modification
are to two entirely different members of the same object: the array
counter and the array entries.
So that seems kind of silly.
That said, the code that gcc complains about is unnecessarily
complicated, so moving the array counter update into a separate
statement seems like the most straightforward fix for these warnings:
drivers/net/wireless/intel/iwlwifi/mld/d3.c: In function ‘iwl_mld_set_netdetect_info’:
drivers/net/wireless/intel/iwlwifi/mld/d3.c:1102:66: error: operation on ‘netdetect_info->n_matches’ may be undefined [-Werror=sequence-point]
1102 | netdetect_info->matches[netdetect_info->n_matches++] = match;
| ~~~~~~~~~~~~~~~~~~~~~~~~~^~
drivers/net/wireless/intel/iwlwifi/mld/d3.c:1120:58: error: operation on ‘match->n_channels’ may be undefined [-Werror=sequence-point]
1120 | match->channels[match->n_channels++] =
| ~~~~~~~~~~~~~~~~~^~
side note: the code at that second warning is actively buggy, and only
works on little-endian machines that don't do strict alignment checks.
The code casts an array of integers into an array of unsigned long in
order to use our bitmap iterators. That happens to work fine on any
sane architecture, but it's still wrong.
This does *not* fix that more serious problem. This only splits the two
assignments into two statements and fixes the compiler warning. I need
to get rid of the new warnings in order to be able to actually do any
build testing.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 20 Apr 2025 18:18:55 +0000 (11:18 -0700)]
gcc-15: add '__nonstring' markers to byte arrays
All of these cases are perfectly valid and good traditional C, but hit
by the "you're not NUL-terminating your byte array" warning.
And none of the cases want any terminating NUL character.
Mark them __nonstring to shut up gcc-15 (and in the case of the ak8974
magnetometer driver, I just removed the explicit array size and let gcc
expand the 3-byte and 6-byte arrays by one extra byte, because it was
the simpler change).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 20 Apr 2025 18:04:00 +0000 (11:04 -0700)]
gcc-15: get rid of misc extra NUL character padding
This removes two cases of explicit NUL padding that now causes warnings
because of '-Wunterminated-string-initialization' being part of -Wextra
in gcc-15.
Gcc is being silly in this case when it says that it truncates a NUL
terminator, because in these cases there were _multiple_ NUL characters.
But we can get rid of the warning by just simplifying the two
initializers that trigger the warning for me, so this does exactly that.
I'm not sure why the power supply code did that odd
.attr_name = #_name "\0",
pattern: it was introduced in commit
2cabeaf15129 ("power: supply: core:
Cleanup power supply sysfs attribute list"), but that 'attr_name[]'
field is an explicitly sized character array in a statically initialized
variable, and a string initializer always has a terminating NUL _and_
statically initialized character arrays are zero-padded anyway, so it
really seems to be rather extraneous belt-and-suspenders.
The zero_uuid[16] initialization in drivers/md/bcache/super.c makes
perfect sense, but it isn't necessary for the same reasons, and not
worth the new gcc warning noise.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 20 Apr 2025 18:02:18 +0000 (11:02 -0700)]
gcc-15: acpi: sprinkle random '__nonstring' crumbles around
This is not great: I'd much rather introduce a typedef that is a "ACPI
name byte buffer", and use that to mark these special 4-byte ACPI names
that do not use NUL termination.
But as noted in the previous commit ("gcc-15: make 'unterminated string
initialization' just a warning") gcc doesn't actually seem to support
that notion, so instead you have to just mark every single array
declaration individually.
So this is not pretty, but this gets rid of the bulk of the annoying
warnings during an allmodconfig build for me.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 20 Apr 2025 17:33:23 +0000 (10:33 -0700)]
gcc-15: make 'unterminated string initialization' just a warning
gcc-15 enabling -Wunterminated-string-initialization in -Wextra by
default was done with the best intentions, but the warning is still
quite broken.
What annoys me about the warning is that this is a very traditional AND
CORRECT way to initialize fixed byte arrays in C:
unsigned char hex[16] = "
0123456789abcdef";
and we use this all over the kernel. And the warning is fine, but gcc
developers apparently never made a reasonable way to disable it. As is
(sadly) tradition with these things.
Yes, there's "__attribute__((nonstring))", and we have a macro to make
that absolutely disgusting syntax more palatable (ie the kernel syntax
for that monstrosity is just "__nonstring").
But that attribute is misdesigned. What you'd typically want to do is
tell the compiler that you are using a type that isn't a string but a
byte array, but that doesn't work at all:
warning: ‘nonstring’ attribute does not apply to types [-Wattributes]
and because of this fundamental mis-design, you then have to mark each
instance of that pattern.
This is particularly noticeable in our ACPI code, because ACPI has this
notion of a 4-byte "type name" that gets used all over, and is exactly
this kind of byte array.
This is a sad oversight, because the warning is useful, but really would
be so much better if gcc had also given a sane way to indicate that we
really just want a byte array type at a type level, not the broken "each
and every array definition" level.
So now instead of creating a nice "ACPI name" type using something like
typedef char acpi_name_t[4] __nonstring;
we have to do things like
char name[ACPI_NAMESEG_SIZE] __nonstring;
in every place that uses this concept and then happens to have the
typical initializers.
This is annoying me mainly because I think the warning _is_ a good
warning, which is why I'm not just turning it off in disgust. But it is
hampered by this bad implementation detail.
[ And obviously I'm doing this now because system upgrades for me are
something that happen in the middle of the release cycle: don't do it
before or during travel, or just before or during the busy merge
window period. ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 20 Apr 2025 04:46:58 +0000 (21:46 -0700)]
Merge tag 'mm-hotfixes-stable-2025-04-19-21-24' of git://git./linux/kernel/git/akpm/mm
Pull misc hotfixes from Andrew Morton:
"16 hotfixes. 2 are cc:stable and the remainder address post-6.14
issues or aren't considered necessary for -stable kernels.
All patches are basically for MM although five are alterations to
MAINTAINERS"
[ Basic counting skills are clearly not a strictly necessary requirement
for kernel maintainers. - Linus ]
* tag 'mm-hotfixes-stable-2025-04-19-21-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
MAINTAINERS: add section for locking of mm's and VMAs
mm: vmscan: fix kswapd exit condition in defrag_mode
mm: vmscan: restore high-cpu watermark safety in kswapd
MAINTAINERS: add Pedro as reviewer to the MEMORY MAPPING section
mm/memory: move sanity checks in do_wp_page() after mapcount vs. refcount stabilization
mm, hugetlb: increment the number of pages to be reset on HVO
writeback: fix false warning in inode_to_wb()
docs: ABI: replace mcroce@microsoft.com with new Meta address
mm/gup: fix wrongly calculated returned value in fault_in_safe_writeable()
MAINTAINERS: add memory advice section
MAINTAINERS: add mmap trace events to MEMORY MAPPING
mm: memcontrol: fix swap counter leak from offline cgroup
MAINTAINERS: add MM subsection for the page allocator
MAINTAINERS: update SLAB ALLOCATOR maintainers
fs/dax: fix folio splitting issue by resetting old folio order + _nr_pages
mm/page_alloc: fix deadlock on cpu_hotplug_lock in __accept_page()
Linus Torvalds [Sat, 19 Apr 2025 21:31:08 +0000 (14:31 -0700)]
Merge tag 'vfs-6.15-rc3.fixes.2' of git://git./linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Revert the hfs{plus} deprecation warning that's also included in this
pull request. The commit introducing the deprecation warning resides
rather early in this branch. So simply dropping it would've rebased
all other commits which I decided to avoid. Hence the revert in the
same branch
[ Background - the deprecation warning discussion resulted in people
stepping up, and so hfs{plus} will have a maintainer taking care of
it after all.. - Linus ]
- Switch CONFIG_SYSFS_SYCALL default to n and decouple from
CONFIG_EXPERT
- Fix an audit bug caused by changes to our kernel path lookup helpers
this cycle. Audit needs the parent path even if the dentry it tried
to look up is negative
- Ensure that the kernel path lookup helpers leave the passed in path
argument clean when they return an error. This is consistent with all
our other helpers
- Ensure that vfs_getattr_nosec() calls bdev_statx() so the relevant
information is available to kernel consumers as well
- Don't set a timer and call schedule() if the timer will expire
immediately in epoll
- Make netfs lookup tables with __nonstring
* tag 'vfs-6.15-rc3.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
Revert "hfs{plus}: add deprecation warning"
fs: move the bdex_statx call to vfs_getattr_nosec
netfs: Mark __nonstring lookup tables
eventpoll: Set epoll timeout if it's in the future
fs: ensure that *path_locked*() helpers leave passed path pristine
fs: add kern_path_locked_negative()
hfs{plus}: add deprecation warning
Kconfig: switch CONFIG_SYSFS_SYCALL default to n
Linus Torvalds [Sat, 19 Apr 2025 20:59:04 +0000 (13:59 -0700)]
Merge tag 'i2c-for-6.15-rc3' of git://git./linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
- Address translator: fix wrong include
- ChromeOS EC tunnel: fix potential NULL pointer dereference
* tag 'i2c-for-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: atr: Fix wrong include
i2c: cros-ec-tunnel: defer probe if parent EC is not present
Christian Brauner [Sat, 19 Apr 2025 20:48:59 +0000 (22:48 +0200)]
Revert "hfs{plus}: add deprecation warning"
This reverts commit
ddee68c499f76ae47c011549df5be53db0057402.
There's ongoing discussion about better maintenance of at least hfsplus.
Rever the deprecation warning for now.
Signed-off-by: Christian Brauner <brauner@kernel.org>
Linus Torvalds [Sat, 19 Apr 2025 18:57:36 +0000 (11:57 -0700)]
Merge tag 'trace-v6.15-rc2' of git://git./linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Initialize hash variables in ftrace subops logic
The fix that simplified the ftrace subops logic opened a path where
some variables could be used without being initialized, and done
subtly where the compiler did not catch it. Initialize those
variables to the EMPTY_HASH, which is the default hash.
- Reinitialize the hash pointers after they are freed
Some of the hash pointers in the subop logic were freed but may still
be referenced later. To prevent use-after-free bugs, initialize them
back to the EMPTY_HASH.
- Free the ftrace hashes when they are replaced
The fix that simplified the subops logic updated some hash pointers,
but left the original hash that they were pointing to where they are
no longer used. This caused a memory leak. Free the hashes that are
pointed to by the pointers when they are replaced.
- Fix size initialization of ftrace direct function hash
The ftrace direct function hash used by BPF initialized the hash size
incorrectly. It checked the size of items to a hard coded 32, which
made the hash bit size of 5. The hash size is supposed to be limited
by the bit size of the hash, as the bitmask is allowed to be greater
than 5. Rework the size check to first pass the number of elements to
fls() and then compare that to FTRACE_HASH_MAX_BITS before allocating
the hash.
- Fix format output of ftrace_graph_ent_entry event
The field depth of the ftrace_graph_ent_entry event is of size 4 but
the output showed it as unsigned long and use "%lu". Change it to
unsigned int and use "%u" in the print format that is displayed to
user space.
- Fix the trace event filter on strings
Events can be filtered on numbers or string values. The return value
checked from strncpy_from_kernel_nofault() and
strncpy_from_user_nofault() was used to determine if reading the
strings would fault or not. It would return fault if the value was
non zero, which is basically meant that it was always considering the
read as a fault.
- Add selftest to test trace event string filtering
In order to catch the breakage of the string filtering, add a self
test to make sure that it continues to work.
* tag 'trace-v6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: selftests: Add testing a user string to filters
tracing: Fix filter string testing
ftrace: Fix type of ftrace_graph_ent_entry.depth
ftrace: fix incorrect hash size in register_ftrace_direct()
ftrace: Free ftrace hashes after they are replaced in the subops code
ftrace: Reinitialize hash to EMPTY_HASH after freeing
ftrace: Initialize variables for ftrace_startup/shutdown_subops()
Linus Torvalds [Sat, 19 Apr 2025 17:38:03 +0000 (10:38 -0700)]
Merge tag 'nfsd-6.15-1' of git://git./linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- v6.15 libcrc clean-up makes invalid configurations possible
- Fix a potential deadlock introduced during the v6.15 merge window
* tag 'nfsd-6.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: decrease sc_count directly if fail to queue dl_recall
nfs: add missing selections of CONFIG_CRC32
Linus Torvalds [Sat, 19 Apr 2025 17:02:43 +0000 (10:02 -0700)]
Merge tag 'rust-fixes-6.15' of git://git./linux/kernel/git/ojeda/linux
Pull rust fixes from Miguel Ojeda:
"Toolchain and infrastructure:
- Fix missing KASAN LLVM flags on first build (and fix spurious
rebuilds) by skipping '--target'
- Fix Make < 4.3 build error by using '$(pound)'
- Fix UML build error by removing 'volatile' qualifier from io
helpers
- Fix UML build error by adding 'dma_{alloc,free}_attrs()' helpers
- Clean gendwarfksyms warnings by avoiding to export '__pfx' symbols
- Clean objtool warning by adding a new 'noreturn' function for
1.86.0
- Disable 'needless_continue' Clippy lint due to new 1.86.0 warnings
- Add missing 'ffi' crate to 'generate_rust_analyzer.py'
'pin-init' crate:
- Import a couple fixes from upstream"
* tag 'rust-fixes-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux:
rust: helpers: Add dma_alloc_attrs() and dma_free_attrs()
rust: helpers: Remove volatile qualifier from io helpers
rust: kbuild: use `pound` to support GNU Make < 4.3
objtool/rust: add one more `noreturn` Rust function for Rust 1.86.0
rust: kasan/kbuild: fix missing flags on first build
rust: disable `clippy::needless_continue`
rust: kbuild: Don't export __pfx symbols
rust: pin-init: use Markdown autolinks in Rust comments
rust: pin-init: alloc: restrict `impl ZeroableOption` for `Box` to `T: Sized`
scripts: generate_rust_analyzer: Add ffi crate
Linus Torvalds [Sat, 19 Apr 2025 16:31:21 +0000 (09:31 -0700)]
Merge tag 'drm-fixes-2025-04-19' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Easter rc3 pull request, fixes in all the usuals, amdgpu, xe, msm,
with some i915/ivpu/mgag200/v3d fixes, then a couple of bits in
dma-buf/gem.
Hopefully has no easter eggs in it.
dma-buf:
- Correctly decrement refcounter on errors
gem:
- Fix test for imported buffers
amdgpu:
- Cleaner shader sysfs fix
- Suspend fix
- Fix doorbell free ordering
- Video caps fix
- DML2 memory allocation optimization
- HDP fix
i915:
- Fix DP DSC configurations that require 3 DSC engines per pipe
xe:
- Fix LRC address being written too late for GuC
- Fix notifier vs folio deadlock
- Fix race betwen dma_buf unmap and vram eviction
- Fix debugfs handling PXP terminations unconditionally
msm:
- Display:
- Fix to call dpu_plane_atomic_check_pipe() for both SSPPs in
case of multi-rect
- Fix to validate plane_state pointer before using it in
dpu_plane_virtual_atomic_check()
- Fix to make sure dereferencing dpu_encoder_phys happens after
making sure it is valid in _dpu_encoder_trigger_start()
- Remove the remaining intr_tear_rd_ptr which we initialized to
-1 because NO_IRQ indices start from 0 now
- GPU:
- Fix IB_SIZE overflow
ivpu:
- Fix debugging
- Fixes to frequency
- Support firmware API 3.28.3
- Flush jobs upon reset
mgag200:
- Set vblank start to correct values
v3d:
- Fix Indirect Dispatch"
* tag 'drm-fixes-2025-04-19' of https://gitlab.freedesktop.org/drm/kernel: (26 commits)
drm/msm/a6xx+: Don't let IB_SIZE overflow
drm/xe/pxp: do not queue unneeded terminations from debugfs
drm/xe/dma_buf: stop relying on placement in unmap
drm/xe/userptr: fix notifier vs folio deadlock
drm/xe: Set LRC addresses before guc load
drm/mgag200: Fix value in <VBLKSTR> register
drm/gem: Internally test import_attach for imported objects
drm/amdgpu: Use the right function for hdp flush
drm/amd/display/dml2: use vzalloc rather than kzalloc
drm/amdgpu: Add back JPEG to video caps for carrizo and newer
drm/amdgpu: fix warning of drm_mm_clean
drm/amd: Forbid suspending into non-default suspend states
drm/amdgpu: use a dummy owner for sysfs triggered cleaner shaders v4
drm/i915/dp: Check for HAS_DSC_3ENGINES while configuring DSC slices
drm/i915/display: Add macro for checking 3 DSC engines
dma-buf/sw_sync: Decrement refcount on error in sw_sync_ioctl_get_deadline()
accel/ivpu: Add cmdq_id to job related logs
accel/ivpu: Show NPU frequency in sysfs
accel/ivpu: Fix the NPU's DPU frequency calculation
accel/ivpu: Update FW Boot API to version 3.28.3
...
Dave Airlie [Sat, 19 Apr 2025 05:09:29 +0000 (15:09 +1000)]
Merge tag 'drm-msm-fixes-2025-04-18' of https://gitlab.freedesktop.org/drm/msm into drm-fixes
Fixes for v6.15-rc3
Display:
- Fix to call dpu_plane_atomic_check_pipe() for both SSPPs in
case of multi-rect
- Fix to validate plane_state pointer before using it in
dpu_plane_virtual_atomic_check()
- Fix to make sure dereferencing dpu_encoder_phys happens after
making sure it is valid in _dpu_encoder_trigger_start()
- Remove the remaining intr_tear_rd_ptr which we initialized
to -1 because NO_IRQ indices start from 0 now
GPU:
- Fix IB_SIZE overflow
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rob Clark <robdclark@gmail.com>
Link: https://lore.kernel.org/r/CAF6AEGtVKXEVdzUzFWmQE8JmK3nx_hp+ynOd-5j3vnfcU-sgOA@mail.gmail.com
Dave Airlie [Sat, 19 Apr 2025 04:59:47 +0000 (14:59 +1000)]
Merge tag 'drm-xe-fixes-2025-04-18' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes
Driver Changes:
- Fix LRC address being written too late for GuC
- Fix notifier vs folio deadlock
- Fix race betwen dma_buf unmap and vram eviction
- Fix debugfs handling PXP terminations unconditionally
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/ndinq644zenywaaycxyfqqivsb2xer4z7err3dlpalbz33jfkm@ttabzsg6wnet
Linus Torvalds [Sat, 19 Apr 2025 03:10:42 +0000 (20:10 -0700)]
Merge tag '6.15-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- Fix hard link lease key problem when close is deferred
- Revert the socket lockdep/refcount workarounds done in cifs.ko now
that it is fixed at the socket layer
* tag '6.15-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
Revert "smb: client: fix TCP timers deadlock after rmmod"
Revert "smb: client: Fix netns refcount imbalance causing leaks and use-after-free"
smb3 client: fix open hardlink on deferred close file error
Rob Clark [Mon, 17 Mar 2025 15:00:06 +0000 (08:00 -0700)]
drm/msm/a6xx+: Don't let IB_SIZE overflow
IB_SIZE is only b0..b19. Starting with a6xx gen3, additional fields
were added above the IB_SIZE. Accidentially setting them can cause
badness. Fix this by properly defining the CP_INDIRECT_BUFFER packet
and using the generated builder macro to ensure unintended bits are not
set.
v2: add missing type attribute for IB_BASE
v3: fix offset attribute in xml
Reported-by: Connor Abbott <cwabbott0@gmail.com>
Fixes:
a83366ef19ea ("drm/msm/a6xx: add A640/A650 to gpulist")
Signed-off-by: Rob Clark <robdclark@chromium.org>
Patchwork: https://patchwork.freedesktop.org/patch/643396/
Wolfram Sang [Fri, 18 Apr 2025 21:42:56 +0000 (23:42 +0200)]
Merge tag 'i2c-host-fixes-6.15-rc3' of git://git./linux/kernel/git/andi.shyti/linux into i2c/for-current
i2c-host-fixes for v6.15-rc3
- ChromeOS EC tunnel: fix potential NULL pointer dereference
Linus Torvalds [Fri, 18 Apr 2025 21:04:57 +0000 (14:04 -0700)]
Merge tag 'x86-urgent-2025-04-18' of git://git./linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
- Fix hypercall detection on Xen guests
- Extend the AMD microcode loader SHA check to Zen5, to block loading
of any unreleased standalone Zen5 microcode patches
- Add new Intel CPU model number for Bartlett Lake
- Fix the workaround for AMD erratum 1054
- Fix buggy early memory acceptance between SEV-SNP guests and the EFI
stub
* tag 'x86-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot/sev: Avoid shared GHCB page for early memory acceptance
x86/cpu/amd: Fix workaround for erratum 1054
x86/cpu: Add CPU model number for Bartlett Lake CPUs with Raptor Cove cores
x86/microcode/AMD: Extend the SHA check to Zen5, block loading of any unreleased standalone Zen5 microcode patches
x86/xen: Fix __xen_hypercall_setfunc()
Linus Torvalds [Fri, 18 Apr 2025 21:02:45 +0000 (14:02 -0700)]
Merge tag 'timers-urgent-2025-04-18' of git://git./linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
"Fix a lockdep false positive in the i8253 driver"
* tag 'timers-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/i8253: Call clockevent_i8253_disable() with interrupts disabled
Linus Torvalds [Fri, 18 Apr 2025 20:35:13 +0000 (13:35 -0700)]
Merge tag 'perf-urgent-2025-04-18' of git://git./linux/kernel/git/tip/tip
Pull x86 perf event fixes from Ingo Molnar:
"Miscellaneous fixes and a hardware-enabling change:
- Fix Intel uncore PMU IIO free running counters on SPR, ICX and SNR
systems
- Fix Intel PEBS buffer overflow handling
- Fix skid in Intel PEBS sampling of user-space general purpose
registers
- Enable Panther Lake PMU support - similar to Lunar Lake"
* tag 'perf-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Add Panther Lake support
perf/x86/intel: Allow to update user space GPRs from PEBS records
perf/x86/intel: Don't clear perf metrics overflow bit unconditionally
perf/x86/intel/uncore: Fix the scale of IIO free running counters on SPR
perf/x86/intel/uncore: Fix the scale of IIO free running counters on ICX
perf/x86/intel/uncore: Fix the scale of IIO free running counters on SNR
Linus Torvalds [Fri, 18 Apr 2025 20:28:41 +0000 (13:28 -0700)]
Merge tag 'irq-urgent-2025-04-18' of git://git./linux/kernel/git/tip/tip
Pull misc irq fixes from Ingo Molnar:
- Fix BCM2712 irqchip driver Kconfig dependencies required on the
Raspberry PI5
- Fix spurious interrupts on RZ/G3E SMARC EVK systems
- Fix crash regression on Sun/NIU hardware
- Apply MSI driver quirk for Sun Neptune chips
* tag 'irq-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/irq-bcm2712-mip: Enable driver when ARCH_BCM2835 is enabled
irqchip/renesas-rzv2h: Prevent TINT spurious interrupt
net/niu: Niu requires MSIX ENTRY_DATA fields touch before entry reads
PCI/MSI: Add an option to write MSIX ENTRY_DATA before any reads
Linus Torvalds [Fri, 18 Apr 2025 20:25:33 +0000 (13:25 -0700)]
Merge tag 'core-urgent-2025-04-18' of git://git./linux/kernel/git/tip/tip
Pull misc core fixes from Ingo Molnar:
"Fix a genksyms related bug, triggered by recent changes to the percpu
code, and update the .clang-format file to not include obsolete
function names"
* tag 'core-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genksyms: Handle typeof_unqual keyword and __seg_{fs,gs} qualifiers
clang-format: Update the ForEachMacros list for v6.15-rc1
Linus Torvalds [Fri, 18 Apr 2025 20:20:20 +0000 (13:20 -0700)]
Merge tag 'hardening-v6.15-rc3' of git://git./linux/kernel/git/kees/linux
Pull hardening fixes from Kees Cook:
- lib/prime_numbers: KUnit test should not select PRIME_NUMBERS (Geert
Uytterhoeven)
- ubsan: Fix panic from test_ubsan_out_of_bounds (Mostafa Saleh)
- ubsan: Remove 'default UBSAN' from UBSAN_INTEGER_WRAP (Nathan
Chancellor)
- string: Add load_unaligned_zeropad() code path to sized_strscpy()
(Peter Collingbourne)
- kasan: Add strscpy() test to trigger tag fault on arm64 (Vincenzo
Frascino)
- Disable GCC randstruct for COMPILE_TEST
* tag 'hardening-v6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
lib/prime_numbers: KUnit test should not select PRIME_NUMBERS
ubsan: Fix panic from test_ubsan_out_of_bounds
lib/Kconfig.ubsan: Remove 'default UBSAN' from UBSAN_INTEGER_WRAP
hardening: Disable GCC randstruct for COMPILE_TEST
kasan: Add strscpy() test to trigger tag fault on arm64
string: Add load_unaligned_zeropad() code path to sized_strscpy()
Linus Torvalds [Fri, 18 Apr 2025 20:18:01 +0000 (13:18 -0700)]
Merge tag 'gpio-fixes-for-v6.15-rc3' of git://git./linux/kernel/git/brgl/linux
Pull gpio fix from Bartosz Golaszewski:
- check for both the new AND old (deprecated) setter callback when
changing GPIO direction to output
* tag 'gpio-fixes-for-v6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpiolib: Allow to use setters with return value for output-only gpios
Linus Torvalds [Fri, 18 Apr 2025 20:09:20 +0000 (13:09 -0700)]
Merge tag 'thermal-6.15-rc3' of git://git./linux/kernel/git/rafael/linux-pm
Pull thermal control fixes from Rafael Wysocki:
"Add missing DVFS support flags for the Lunar Lake and Panther Lake
platforms to the int340x Intel thermal driver and fix DLVR support
for Panther Lake in it (Srinivas Pandruvada)"
* tag 'thermal-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
thermal: intel: int340x: Fix Panther Lake DLVR support
thermal: intel: int340x: Add missing DVFS support flags
Linus Torvalds [Fri, 18 Apr 2025 20:06:12 +0000 (13:06 -0700)]
Merge tag 'pm-6.15-rc3' of git://git./linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These are mostly cpufreq fixes, some of which address recent
regressions and some address older issues that have come to light
during the last two weeks, and a runtime PM documentation correction:
- Fix the performance-to-frequency scaling factor computation on
systems using HWP in the intel_pstate driver after a recent
incorrect update of it (Rafael Wysocki)
- Fix the usage of the CPUFREQ_NEED_UPDATE_LIMITS cpufreq driver flag
in the schedutil cpufreq governor after a recent update of it that
has caused frequency limits changes to be missed sometimes (Rafael
Wysocki)
- Address some recently discovered synchronization issues related to
frequency limits changes in the schedutil cpufreq governor and in
the cpufreq core (Rafael Wysocki)
- Fix ITMT support in the amd-pstate cpufreq driver so that it is
enabled after asym priorities have been correctly initialized for
all CPUs (K Prateek Nayak)
- Fix changing min/max limits in the amd-pstate cpufreq driver while
on the performance governor (Dhananjay Ugwekar)
- Fix a function name in the runtime PM documentation that was
previously incorrectly updated by mistake (Sakari Ailus)"
* tag 'pm-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: Avoid using inconsistent policy->min and policy->max
cpufreq/sched: Set need_freq_update in ignore_dl_rate_limit()
cpufreq/sched: Explicitly synchronize limits_changed flag handling
cpufreq/sched: Fix the usage of CPUFREQ_NEED_UPDATE_LIMITS
Documentation: PM: runtime: Fix a reference to pm_runtime_autosuspend()
cpufreq: intel_pstate: Fix hwp_get_cpu_scaling()
cpufreq/amd-pstate: Enable ITMT support after initializing core rankings
cpufreq/amd-pstate: Fix min_limit perf and freq updation for performance governor
Rafael J. Wysocki [Fri, 18 Apr 2025 18:55:48 +0000 (20:55 +0200)]
Merge branch 'pm-docs'
Merge a runtime PM documentation correction for 6.15-rc3.
* pm-docs:
Documentation: PM: runtime: Fix a reference to pm_runtime_autosuspend()
Linus Torvalds [Fri, 18 Apr 2025 18:46:44 +0000 (11:46 -0700)]
Merge tag 'riscv-for-linus-6.15-rc3' of git://git./linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- A fix for an issue where C instructions ended up in non-C builds, due
to some broken inline assembly in the KGDB breakpoint insertion code
- A fix to avoid spurious printk messages about misaligned access
performance probing
- A fix for a handful of issues with /proc/iomem's reserved region
handling
- A pair of fixes for module relocation processing
- A few build-time fixes
* tag 'riscv-for-linus-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: KGDB: Remove ".option norvc/.option rvc" for kgdb_compiled_break
riscv: KGDB: Do not inline arch_kgdb_breakpoint()
riscv: Avoid fortify warning in syscall_get_arguments()
riscv: Provide all alternative macros all the time
riscv: module: Allocate PLT entries for R_RISCV_PLT32
riscv: module: Fix out-of-bounds relocation access
riscv: Properly export reserved regions in /proc/iomem
riscv: Fix unaligned access info messages
riscv: Avoid fortify warning in syscall_get_arguments()
Documentation: riscv: Fix typo MIMPLID -> MIMPID
riscv: Use kvmalloc_array on relocation_hashtable
Linus Torvalds [Fri, 18 Apr 2025 18:35:11 +0000 (11:35 -0700)]
Merge tag 'linux_kselftest-kunit-fixes-6.15-rc3' of git://git./linux/kernel/git/shuah/linux-kselftest
Pull kunit fix from Shuah Khan:
"Fixes arch sh kunit qemu_configs script sh.py to honor kunit cmdline"
* tag 'linux_kselftest-kunit-fixes-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: qemu_configs: SH: Respect kunit cmdline
Linus Torvalds [Fri, 18 Apr 2025 18:32:31 +0000 (11:32 -0700)]
Merge tag 'linux_kselftest-fixes-6.15-rc3' of git://git./linux/kernel/git/shuah/linux-kselftest
Pull kselftest fix from Shuah Khan:
"Fixes dynevent_limitations.tc test failure on dash by detecting and
handling bash and dash differences in evaluating \\"
* tag 'linux_kselftest-fixes-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
selftests/ftrace: Differentiate bash and dash in dynevent_limitations.tc
Linus Torvalds [Fri, 18 Apr 2025 16:37:44 +0000 (09:37 -0700)]
Merge tag 'v6.15-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
- Fix integer overflow in server disconnect deadtime calculation
- Three fixes for potential use after frees: one for oplocks, and one
for leases and one for kerberos authentication
- Fix to prevent attempted write to directory
- Fix locking warning for durable scavenger thread
* tag 'v6.15-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: Prevent integer overflow in calculation of deadtime
ksmbd: fix the warning from __kernel_write_iter
ksmbd: fix use-after-free in smb_break_all_levII_oplock()
ksmbd: fix use-after-free in __smb2_lease_break_noti()
ksmbd: fix WARNING "do not call blocking ops when !TASK_RUNNING"
ksmbd: Fix dangling pointer in krb_authenticate
Linus Torvalds [Fri, 18 Apr 2025 16:21:14 +0000 (09:21 -0700)]
Merge tag 'block-6.15-
20250417' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- MD pull via Yu:
- fix raid10 missing discard IO accounting (Yu Kuai)
- fix bitmap stats for bitmap file (Zheng Qixing)
- fix oops while reading all member disks failed during
check/repair (Meir Elisha)
- NVMe pull via Christoph:
- fix scan failure for non-ANA multipath controllers (Hannes
Reinecke)
- fix multipath sysfs links creation for some cases (Hannes
Reinecke)
- PCIe endpoint fixes (Damien Le Moal)
- use NULL instead of 0 in the auth code (Damien Le Moal)
- Various ublk fixes:
- Slew of selftest additions
- Improvements and fixes for IO cancelation
- Tweak to Kconfig verbiage
- Fix for page dirtying for blk integrity mapped pages
- loop fixes:
- buffered IO fix
- uevent fixes
- request priority inheritance fix
- Various little fixes
* tag 'block-6.15-
20250417' of git://git.kernel.dk/linux: (38 commits)
selftests: ublk: add generic_06 for covering fault inject
ublk: simplify aborting ublk request
ublk: remove __ublk_quiesce_dev()
ublk: improve detection and handling of ublk server exit
ublk: move device reset into ublk_ch_release()
ublk: rely on ->canceling for dealing with ublk_nosrv_dev_should_queue_io
ublk: add ublk_force_abort_dev()
ublk: properly serialize all FETCH_REQs
selftests: ublk: move creating UBLK_TMP into _prep_test()
selftests: ublk: add test_stress_05.sh
selftests: ublk: support user recovery
selftests: ublk: support target specific command line
selftests: ublk: increase max nr_queues and queue depth
selftests: ublk: set queue pthread's cpu affinity
selftests: ublk: setup ring with IORING_SETUP_SINGLE_ISSUER/IORING_SETUP_DEFER_TASKRUN
selftests: ublk: add two stress tests for zero copy feature
selftests: ublk: run stress tests in parallel
selftests: ublk: make sure _add_ublk_dev can return in sub-shell
selftests: ublk: cleanup backfile automatically
selftests: ublk: add io_uring uapi header
...
Linus Torvalds [Fri, 18 Apr 2025 16:13:52 +0000 (09:13 -0700)]
Merge tag 'io_uring-6.15-
20250418' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
- Correctly cap iov_iter->nr_segs for imports of registered buffers,
both kbuf and normal ones.
Three cleanups to make it saner first, then two fixes for each of the
buffer types.
This fixes a performance regression where partial buffer usage
doesn't trim the tail number of segments, leading the block layer to
iterate the IOs to check if it needs splitting.
- Two patches tweaking the newly introduced zero-copy rx API, mostly to
keep the API consistent once we add multiple interface queues per
ring support in the 6.16 release.
- zc rx unmapping fix for a dead device
* tag 'io_uring-6.15-
20250418' of git://git.kernel.dk/linux:
io_uring/zcrx: fix late dma unmap for a dead dev
io_uring/rsrc: ensure segments counts are correct on kbuf buffers
io_uring/rsrc: send exact nr_segs for fixed buffer
io_uring/rsrc: refactor io_import_fixed
io_uring/rsrc: separate kbuf offset adjustments
io_uring/rsrc: don't skip offset calculation
io_uring/zcrx: add pp to ifq conversion helper
io_uring/zcrx: return ifq id to the user
Steven Rostedt [Fri, 18 Apr 2025 14:12:08 +0000 (10:12 -0400)]
tracing: selftests: Add testing a user string to filters
Running the following commands was broken:
# cd /sys/kernel/tracing
# echo "filename.ustring ~ \"/proc*\"" > events/syscalls/sys_enter_openat/filter
# echo 1 > events/syscalls/sys_enter_openat/enable
# ls /proc/$$/maps
# cat trace
And would produce nothing when it should have produced something like:
ls-1192 [007] ..... 8169.828333: sys_openat(dfd:
ffffffffffffff9c, filename:
7efc18359904, flags: 80000, mode: 0)
Add a test to check this case so that it will be caught if it breaks
again.
Link: https://lore.kernel.org/linux-trace-kernel/20250417183003.505835fb@gandalf.local.home/
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Shuah Khan <skhan@linuxfoundation.org>
Link: https://lore.kernel.org/20250418101208.38dc81f5@gandalf.local.home
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Ard Biesheuvel [Thu, 17 Apr 2025 20:21:21 +0000 (22:21 +0200)]
x86/boot/sev: Avoid shared GHCB page for early memory acceptance
Communicating with the hypervisor using the shared GHCB page requires
clearing the C bit in the mapping of that page. When executing in the
context of the EFI boot services, the page tables are owned by the
firmware, and this manipulation is not possible.
So switch to a different API for accepting memory in SEV-SNP guests, one
which is actually supported at the point during boot where the EFI stub
may need to accept memory, but the SEV-SNP init code has not executed
yet.
For simplicity, also switch the memory acceptance carried out by the
decompressor when not booting via EFI - this only involves the
allocation for the decompressed kernel, and is generally only called
after kexec, as normal boot will jump straight into the kernel from the
EFI stub.
Fixes:
6c3211796326 ("x86/sev: Add SNP-specific unaccepted memory support")
Tested-by: Tom Lendacky <thomas.lendacky@amd.com>
Co-developed-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: <stable@vger.kernel.org>
Cc: Dionna Amalie Glaze <dionnaglaze@google.com>
Cc: Kevin Loughlin <kevinloughlin@google.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-efi@vger.kernel.org
Link: https://lore.kernel.org/r/20250404082921.2767593-8-ardb+git@google.com
Link: https://lore.kernel.org/r/20250410132850.3708703-2-ardb+git@google.com
Link: https://lore.kernel.org/r/20250417202120.1002102-2-ardb+git@google.com
Sandipan Das [Fri, 18 Apr 2025 06:19:40 +0000 (11:49 +0530)]
x86/cpu/amd: Fix workaround for erratum 1054
Erratum 1054 affects AMD Zen processors that are a part of Family 17h
Models 00-2Fh and the workaround is to not set HWCR[IRPerfEn]. However,
when X86_FEATURE_ZEN1 was introduced, the condition to detect unaffected
processors was incorrectly changed in a way that the IRPerfEn bit gets
set only for unaffected Zen 1 processors.
Ensure that HWCR[IRPerfEn] is set for all unaffected processors. This
includes a subset of Zen 1 (Family 17h Models 30h and above) and all
later processors. Also clear X86_FEATURE_IRPERF on affected processors
so that the IRPerfCount register is not used by other entities like the
MSR PMU driver.
Fixes:
232afb557835 ("x86/CPU/AMD: Add X86_FEATURE_ZEN1")
Signed-off-by: Sandipan Das <sandipan.das@amd.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Borislav Petkov <bp@alien8.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/caa057a9d6f8ad579e2f1abaa71efbd5bd4eaf6d.1744956467.git.sandipan.das@amd.com