Jens Axboe [Fri, 23 May 2025 12:08:49 +0000 (06:08 -0600)]
io_uring/io-wq: only create a new worker if it can make progress
Hashed work is serialized by io-wq, intended to be used for cases like
serializing buffered writes to a regular file, where the file system
will serialize the workers anyway with a mutex or similar. Since they
would be forcibly serialized and blocked, it's more efficient for io-wq
to handle these individually rather than issue them in parallel.
If a worker is currently handling a hashed work item and gets blocked,
don't create a new worker if the next work item is also hashed and
mapped to the same bucket. That new worker would not be able to make any
progress anyway.
Reported-by: Fengnan Chang <changfengnan@bytedance.com>
Reported-by: Diangang Li <lidiangang@bytedance.com>
Link: https://lore.kernel.org/io-uring/20250522090909.73212-1-changfengnan@bytedance.com/
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 23 May 2025 12:05:37 +0000 (06:05 -0600)]
io_uring/io-wq: ignore non-busy worker going to sleep
When an io-wq worker goes to sleep, it checks if there's work to do.
If there is, it'll create a new worker. But if this worker is currently
idle, it'll either get woken right back up immediately, or someone
else has already created the necessary worker to handle this work.
Only go through the worker creation logic if the current worker is
currently handling a work item. That means it's being scheduled out as
part of handling that work, not just going to sleep on its own.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 23 May 2025 12:04:29 +0000 (06:04 -0600)]
io_uring/io-wq: move hash helpers to the top
Just in preparation for using them higher up in the function, move
these generic helpers.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Thu, 22 May 2025 15:04:50 +0000 (09:04 -0600)]
trace/io_uring: fix io_uring_local_work_run ctx documentation
The comment for the tracepoint io_uring_local_work_run refers to a field
"tctx" and a type "io_uring_ctx", neither of which exist. "tctx" looks
to mean "ctx" and "io_uring_ctx" should be "io_ring_ctx".
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250522150451.2385652-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 8 May 2025 20:48:33 +0000 (14:48 -0600)]
io_uring: finish IOU_OK -> IOU_COMPLETE transition
IOU_COMPLETE is more descriptive, in that it explicitly says that the
return value means "please post a completion for this request". This
patch completes the transition from IOU_OK to IOU_COMPLETE, replacing
existing IOU_OK users.
This is a purely mechanical change.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 16 May 2025 19:39:32 +0000 (13:39 -0600)]
io_uring: add new helpers for posting overflows
Add two helpers, one for posting overflows for lockless_cq rings, and
one for non-lockless_cq rings. The former can allocate sanely with
GFP_KERNEL, but needs to grab the completion lock for posting, while the
latter must do non-sleeping allocs as it already holds the completion
lock.
While at it, mark the overflow handling functions as __cold as well, as
they should not generally be called during normal operations of the
ring.
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 16 May 2025 19:16:44 +0000 (13:16 -0600)]
io_uring: pass in struct io_big_cqe to io_alloc_ocqe()
Rather than pass extra1/extra2 separately, just pass in the (now) named
io_big_cqe struct instead. The callers that don't use/support CQE32 will
now just pass a single NULL, rather than two seperate mystery zero
values.
Move the clearing of the big_cqe elements into io_alloc_ocqe() as well,
so it can get moved out of the generic code.
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 16 May 2025 15:45:24 +0000 (09:45 -0600)]
io_uring: make io_alloc_ocqe() take a struct io_cqe pointer
The number of arguments to io_alloc_ocqe() is a bit unwieldy. Make it
take a struct io_cqe pointer rather than three separate CQE args. One
path already has that readily available, add an io_init_cqe() helper for
the remaining two.
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 16 May 2025 15:39:14 +0000 (09:39 -0600)]
io_uring: split alloc and add of overflow
Add a new helper, io_alloc_ocqe(), that simply allocates and fills an
overflow entry. Then it can get done outside of the locking section,
and hence use more appropriate gfp_t allocation flags rather than always
default to GFP_ATOMIC.
Inspired by a previous series from Pavel:
https://lore.kernel.org/io-uring/cover.
1747209332.git.asml.silence@gmail.com/
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 14 May 2025 08:07:20 +0000 (09:07 +0100)]
io_uring: open code io_req_cqe_overflow()
A preparation patch, just open code io_req_cqe_overflow().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 16 May 2025 18:33:28 +0000 (12:33 -0600)]
io_uring/fdinfo: get rid of dumping credentials
It's a faily obscure feature, and registered credentials would for that
mostly be a static thing. Don't bother including code to dump the
personalities indices.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 16 May 2025 18:10:12 +0000 (12:10 -0600)]
io_uring/fdinfo: only compile if CONFIG_PROC_FS is set
Rather than wrap fdinfo.c in one big if, handle it on the Makefile
side instead. io_uring.c already conditionally sets fops->fdinfo()
anyway.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 16 May 2025 18:31:19 +0000 (12:31 -0600)]
Merge branch 'io_uring-6.15' into for-6.16/io_uring
Merge in 6.15 io_uring fixes, mostly so that the fdinfo changes can
get easily extended without causing merge conflicts.
* io_uring-6.15:
io_uring/fdinfo: grab ctx->uring_lock around io_uring_show_fdinfo()
io_uring/memmap: don't use page_address() on a highmem page
io_uring/uring_cmd: fix hybrid polling initialization issue
io_uring/sqpoll: Increase task_work submission batch size
io_uring: ensure deferred completions are flushed for multishot
io_uring: always arm linked timeouts prior to issue
io_uring/fdinfo: annotate racy sq/cq head/tail reads
io_uring: fix 'sync' handling of io_fallback_tw()
io_uring: don't duplicate flushing in io_req_post_cqe
Jens Axboe [Tue, 13 May 2025 21:02:23 +0000 (15:02 -0600)]
io_uring/fdinfo: grab ctx->uring_lock around io_uring_show_fdinfo()
Not everything requires locking in there, which is why the 'has_lock'
variable exists. But enough does that it's a bit unwieldy to manage.
Wrap the whole thing in a ->uring_lock trylock, and just return
with no output if we fail to grab it. The existing trylock() will
already have greatly diminished utility/output for the failure case.
This fixes an issue with reading the SQE fields, if the ring is being
actively resized at the same time.
Reported-by: Jann Horn <jannh@google.com>
Fixes:
79cfe9e59c2a ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 13 May 2025 17:26:51 +0000 (18:26 +0100)]
io_uring/kbuf: unify legacy buf provision and removal
Combine IORING_OP_PROVIDE_BUFFERS and IORING_OP_REMOVE_BUFFERS
->issue(), so that we can deduplicate ring locking and list lookups.
This way we further reduce code for legacy provided buffers. Locking is
also separated from buffer related handling, which makes it a bit
simpler with label jumps.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f61af131622ad4337c2fb9f7c453d5b0102c7b90.1747150490.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 13 May 2025 17:26:50 +0000 (18:26 +0100)]
io_uring/kbuf: refactor __io_remove_buffers
__io_remove_buffers used for two purposes, the first is removing
buffers for non ring based lists, which implies that it can be called
multiple times for the same list. And the second is for destroying
lists, which is not perfectly reentrable for ring based lists.
It's confusing, so just have a helper for the legacy pbuf buffer
removal, make sure it's not called for ring pbuf, and open code all ring
pbuf destruction into io_put_bl().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0ae416b099d311ad23f285cea02f2c94c8ae9a6c.1747150490.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 13 May 2025 17:26:49 +0000 (18:26 +0100)]
io_uring/kbuf: don't compute size twice on prep
The size in prep is calculated by io_provide_buffers_prep(), so remove
the recomputation a few lines after.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7c97206561b74fce245cb22449c6082d2e066844.1747150490.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 13 May 2025 17:26:48 +0000 (18:26 +0100)]
io_uring/kbuf: drop extra vars in io_register_pbuf_ring
bl and free_bl variables in io_register_pbuf_ring() always point to the
same list since we started to reallocate the pre-existent list. Drop
free_bl.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d45c3342d74c9030f99376c777a4b3d59089074d.1747150490.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 13 May 2025 17:26:47 +0000 (18:26 +0100)]
io_uring/kbuf: use mem_is_zero()
Make use of mem_is_zero() for reserved fields checking.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/11fe27b7a831329bcdb4ea087317ef123ba7c171.1747150490.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 13 May 2025 17:26:46 +0000 (18:26 +0100)]
io_uring/kbuf: account ring io_buffer_list memory
Follow the non-ringed pbuf struct io_buffer_list allocations and account
it against the memcg. There is low chance of that being an actual
problem as ring provided buffer should either pin user memory or
allocate it, which is already accounted.
Cc: stable@vger.kernel.org # 6.1
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3985218b50d341273cafff7234e1a7e6d0db9808.1747150490.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Mon, 12 May 2025 15:06:06 +0000 (09:06 -0600)]
io_uring/memmap: don't use page_address() on a highmem page
For older/32-bit systems with highmem, don't assume that the pages in
a mapped region are always going to be mapped. If io_region_init_ptr()
finds that the pages are coalescable, also check if the first page is
a HighMem page or not. If it is, fall through to the usual vmap()
mapping rather than attempt to get the unmapped page address.
Cc: stable@vger.kernel.org
Fixes:
c4d0ac1c1567 ("io_uring/memmap: optimise single folio regions")
Link: https://lore.kernel.org/all/681fe2fb.050a0220.f2294.001a.GAE@google.com/
Reported-by: syzbot+5b8c4abafcb1d791ccfc@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/681fed0a.050a0220.f2294.001c.GAE@google.com/
Reported-by: syzbot+6456a99dfdc2e78c4feb@syzkaller.appspotmail.com
Tested-by: syzbot+6456a99dfdc2e78c4feb@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 9 May 2025 11:12:54 +0000 (12:12 +0100)]
io_uring: drain based on allocates reqs
Don't rely on CQ sequence numbers for draining, as it has become messy
and needs cq_extra adjustments. Instead, base it on the number of
allocated requests and only allow flushing when all requests are in the
drain list.
As a result, cq_extra is gone, no overhead for its accounting in aux cqe
posting, less bloating as it was inlined before, and it's in general
simpler than trying to track where we should bump it and where it should
be put back like in cases of overflow. Also, it'll likely help with
cleaning and unifying some of the CQ posting helpers.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/46ece1e34320b046c06fee2498d6b4cd12a700f2.1746788718.git.asml.silence@gmail.com
Link: https://lore.kernel.org/r/24497b04b004bceada496033d3c9d09ff8e81ae9.1746944903.git.asml.silence@gmail.com
[axboe: fold in fix from link2]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
hexue [Mon, 12 May 2025 05:20:25 +0000 (13:20 +0800)]
io_uring/uring_cmd: fix hybrid polling initialization issue
Modify the check for whether the timer is initialized during IO transfer
when passthrough is used with hybrid polling, to ensure that it's always
setup correctly.
Cc: stable@vger.kernel.org
Fixes:
01ee194d1aba ("io_uring: add support for hybrid IOPOLL")
Signed-off-by: hexue <xue01.he@samsung.com>
Link: https://lore.kernel.org/r/20250512052025.293031-1-xue01.he@samsung.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 9 May 2025 11:12:53 +0000 (12:12 +0100)]
io_uring: count allocated requests
Keep track of the number requests a ring currently has allocated (and
not freed), it'll be needed in the next patch.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c8f8308294dc2a1cb8925d984d937d4fc14ab5d4.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 9 May 2025 11:12:52 +0000 (12:12 +0100)]
io_uring: open code io_account_cq_overflow()
io_account_cq_overflow() doesn't help explaining what's going on in
there, and it'll become even smaller with following patches, so open
code it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e4333fa0d371f519e52a71148ebdffed4b8d3aa9.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 9 May 2025 11:12:51 +0000 (12:12 +0100)]
io_uring: consolidate drain seq checking
We check sequences when queuing drained requests as well when flushing
them. Instead, always queue and immediately try to flush, so that all
seq handling can be kept contained in the flushing code.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d4651f742e671af5b3216581e539ea5d31bc7125.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 9 May 2025 11:12:50 +0000 (12:12 +0100)]
io_uring: remove drain prealloc checks
Currently io_drain_req() has two steps. The first is fast path checking
sequence numbers. The second is allocations, rechecking and actual
queuing. Further simplify it by removing the first step.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4d06e89ed07611993d7bf89182de2300858379bd.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 9 May 2025 11:12:49 +0000 (12:12 +0100)]
io_uring: simplify drain ret passing
"ret" in io_drain_req() is only used in one place, remove it and pass
-ENOMEM directly.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ece724b77e66e6caabcc215e0032ee7ff140f289.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 9 May 2025 11:12:48 +0000 (12:12 +0100)]
io_uring: fix spurious drain flushing
io_queue_deferred() is not tolerant to spurious calls not completing
some requests. You can have an inflight drain-marked request and another
request that came after and got queued into the drain list. Now, if
io_queue_deferred() is called before the first request completes, it'll
check the 2nd req with req_need_defer(), find that there is no drain
flag set, and queue it for execution.
To make io_queue_deferred() work, it should at least check sequences for
the first request, and then we need also need to check if there is
another drain request creating another bubble.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/972bde11b7d4ef25b3f5e3fd34f80e4d2aa345b8.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 9 May 2025 11:12:47 +0000 (12:12 +0100)]
io_uring: account drain memory to cgroup
Account drain allocations against memcg. It's not a big problem as each
such allocation is paired with a request, which is accounted, but it's
nicer to follow the limits more closely.
Cc: stable@vger.kernel.org # 6.1
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f8dfdbd755c41fd9c75d12b858af07dfba5bbb68.1746788718.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 9 May 2025 11:03:43 +0000 (12:03 +0100)]
io_uring: add lockdep asserts to io_add_aux_cqe
io_add_aux_cqe() can only be called for rings with uring_lock protected
completion queues, add a couple of assertions in regards to that.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c010eab7b94a187c00a9d46d8b67bf7fcad18af4.1746788592.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 9 May 2025 11:03:28 +0000 (12:03 +0100)]
io_uring/net: move CONFIG_NET guards to Makefile
Instruct Makefile to never try to compile net.c without CONFIG_NET and
kill ifdefs in the file.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/f466400e20c3f536191bfd559b1f3cd2a2ab5a1e.1746788579.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Long Li [Fri, 9 May 2025 06:30:15 +0000 (14:30 +0800)]
io_uring: update parameter name in io_pin_pages function declaration
Rename first parameter in io_pin_pages from ubuf to uaddr for consistency
between declaration and implementation.
Signed-off-by: Long Li <leo.lilong@huawei.com>
Link: https://lore.kernel.org/r/20250509063015.3799255-1-leo.lilong@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Gabriel Krisman Bertazi [Thu, 8 May 2025 18:12:03 +0000 (14:12 -0400)]
io_uring/sqpoll: Increase task_work submission batch size
Our QA team reported a 10%-23%, throughput reduction on an io_uring
sqpoll testcase doing IO to a null_blk, that I traced back to a
reduction of the device submission queue depth utilization. It turns out
that, after commit
af5d68f8892f ("io_uring/sqpoll: manage task_work
privately"), we capped the number of task_work entries that can be
completed from a single spin of sqpoll to only 8 entries, before the
sqpoll goes around to (potentially) sleep. While this cap doesn't drive
the submission side directly, it impacts the completion behavior, which
affects the number of IO queued by fio per sqpoll cycle on the
submission side, and io_uring ends up seeing less ios per sqpoll cycle.
As a result, block layer plugging is less effective, and we see more
time spent inside the block layer in profilings charts, and increased
submission latency measured by fio.
There are other places that have increased overhead once sqpoll sleeps
more often, such as the sqpoll utilization calculation. But, in this
microbenchmark, those were not representative enough in perf charts, and
their removal didn't yield measurable changes in throughput. The major
overhead comes from the fact we plug less, and less often, when submitting
to the block layer.
My benchmark is:
fio --ioengine=io_uring --direct=1 --iodepth=128 --runtime=300 --bs=4k \
--invalidate=1 --time_based --ramp_time=10 --group_reporting=1 \
--filename=/dev/nullb0 --name=RandomReads-direct-nullb-sqpoll-4k-1 \
--rw=randread --numjobs=1 --sqthread_poll
In one machine, tested on top of Linux 6.15-rc1, we have the following
baseline:
READ: bw=4994MiB/s (5236MB/s), 4994MiB/s-4994MiB/s (5236MB/s-5236MB/s), io=439GiB (471GB), run=90001-90001msec
With this patch:
READ: bw=5762MiB/s (6042MB/s), 5762MiB/s-5762MiB/s (6042MB/s-6042MB/s), io=506GiB (544GB), run=90001-90001msec
which is a 15% improvement in measured bandwidth. The average
submission latency is noticeably lowered too. As measured by
fio:
Baseline:
lat (usec): min=20, max=241, avg=99.81, stdev=3.38
Patched:
lat (usec): min=26, max=226, avg=86.48, stdev=4.82
If we look at blktrace, we can also see the plugging behavior is
improved. In the baseline, we end up limited to plugging 8 requests in
the block layer regardless of the device queue depth size, while after
patching we can drive more io, and we manage to utilize the full device
queue.
In the baseline, after a stabilization phase, an ordinary submission
looks like:
254,0 1 49942 0.
016028795 5977 U N [iou-sqp-5976] 7
After patching, I see consistently more requests per unplug.
254,0 1 4996 0.
001432872 3145 U N [iou-sqp-3144] 32
Ideally, the cap size would at least be the deep enough to fill the
device queue, but we can't predict that behavior, or assume all IO goes
to a single device, and thus can't guess the ideal batch size. We also
don't want to let the tw run unbounded, though I'm not sure it would
really be a problem. Instead, let's just give it a more sensible value
that will allow for more efficient batching. I've tested with different
cap values, and initially proposed to increase the cap to 1024. Jens
argued it is too big of a bump and I observed that, with 32, I'm no
longer able to observe this bottleneck in any of my machines.
Fixes:
af5d68f8892f ("io_uring/sqpoll: manage task_work privately")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
Link: https://lore.kernel.org/r/20250508181203.3785544-1-krisman@suse.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 7 May 2025 13:34:24 +0000 (07:34 -0600)]
io_uring: ensure deferred completions are flushed for multishot
Multishot normally uses io_req_post_cqe() to post completions, but when
stopping it, it may finish up with a deferred completion. This is fine,
except if another multishot event triggers before the deferred completions
get flushed. If this occurs, then CQEs may get reordered in the CQ ring,
as new multishot completions get posted before the deferred ones are
flushed. This can cause confusion on the application side, if strict
ordering is required for the use case.
When multishot posting via io_req_post_cqe(), flush any pending deferred
completions first, if any.
Cc: stable@vger.kernel.org # 6.1+
Reported-by: Norman Maurer <norman_maurer@apple.com>
Reported-by: Christian Mazakas <christian.mazakas@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 6 May 2025 12:31:16 +0000 (13:31 +0100)]
io_uring: move io_req_put_rsrc_nodes()
It'd be nice to hide details of how rsrc nodes are used by a request
from rsrc.c, specifically which request fields store them, and what bits
are signifying if there is a node in a request. It rather belong to
generic request handling, so move the helper to io_uring.c. While doing
so remove clearing of ->buf_node as it's controlled by REQ_F_BUF_NODE
and doesn't require zeroing.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/bb73fb42baf825edb39344365aff48cdfdd4c692.1746533789.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 6 May 2025 12:31:07 +0000 (13:31 +0100)]
io_uring: remove io_preinit_req()
Apart from setting ->ctx, io_preinit_req() zeroes a bunch of fields of a
request, from which only ->file_node is mandatory. Remove the function
and zero the entire request on first allocation. With that, we also need
to initialise ->ctx every time, which might be a good thing for
performance as now we're likely overwriting the entire cache line, and
so it can write combined and avoid RMW.
Suggested-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ba5485dc913f1e275862ce88f5169d4ac4a33836.1746533807.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 6 May 2025 12:30:47 +0000 (13:30 +0100)]
io_uring/timeout: don't export link t-out disarm helper
[__]io_disarm_linked_timeout() are only used inside timeout.c. so
confine them inside the file.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1eb200911255e643bf252a8e65fb2c787340cf18.1746533800.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 1 May 2025 12:17:18 +0000 (13:17 +0100)]
io_uring/zcrx: dmabuf backed zerocopy receive
Add support for dmabuf backed zcrx areas. To use it, the user should
pass IORING_ZCRX_AREA_DMABUF in the struct io_uring_zcrx_area_reg flags
field and pass a dmabuf fd in the dmabuf_fd field.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/20bb1890e60a82ec945ab36370d1fd54be414ab6.1746097431.git.asml.silence@gmail.com
Link: https://lore.kernel.org/io-uring/6e37db97303212bbd8955f9501cf99b579f8aece.1746547722.git.asml.silence@gmail.com
[axboe: fold in fixup]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sun, 4 May 2025 14:06:28 +0000 (08:06 -0600)]
io_uring: always arm linked timeouts prior to issue
There are a few spots where linked timeouts are armed, and not all of
them adhere to the pre-arm, attempt issue, post-arm pattern. This can
be problematic if the linked request returns that it will trigger a
callback later, and does so before the linked timeout is fully armed.
Consolidate all the linked timeout handling into __io_issue_sqe(),
rather than have it spread throughout the various issue entry points.
Cc: stable@vger.kernel.org
Link: https://github.com/axboe/liburing/issues/1390
Reported-by: Chase Hiltz <chase@path.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 1 May 2025 12:17:17 +0000 (13:17 +0100)]
io_uring/zcrx: split common area map/unmap parts
Extract area type depedent parts of io_zcrx_[un]map_area from the
generic path. It'll be helpful once there are more area memory types
and not only user pages.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/50f6e893e2d20f937e628196cbf528d15f81c289.1746097431.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 1 May 2025 12:17:16 +0000 (13:17 +0100)]
io_uring/zcrx: split out memory holders from area
In the data path users of struct io_zcrx_area don't need to know what
kind of memory it's backed by. Only keep there generic bits in there and
and split out memory type dependent fields into a new structure. It also
logically separates the step that actually imports the memory, e.g.
pinning user pages, from the generic area initialisation.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b60fc09c76921bf69e77eb17e07eb4decedb3bf4.1746097431.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 1 May 2025 12:17:15 +0000 (13:17 +0100)]
io_uring/zcrx: resolve netdev before area creation
Some area types will require a valid struct device to be created, so
resolve netdev and struct device before creating an area.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ac8c1482be22acfe9ca788d2c3ce31b7451ce488.1746097431.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 1 May 2025 12:17:14 +0000 (13:17 +0100)]
io_uring/zcrx: improve area validation
dmabuf backed area will be taking an offset instead of addresses, and
io_buffer_validate() is not flexible enough to facilitate it. It also
takes an iovec, which may truncate the u64 length zcrx takes. Add a new
helper function for validation.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/0b3b735391a0a8f8971bf0121c19765131fddd3b.1746097431.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 30 Apr 2025 13:17:17 +0000 (07:17 -0600)]
io_uring/fdinfo: annotate racy sq/cq head/tail reads
syzbot complains about the cached sq head read, and it's totally right.
But we don't need to care, it's just reading fdinfo, and reading the
CQ or SQ tail/head entries are known racy in that they are just a view
into that very instant and may of course be outdated by the time they
are reported.
Annotate both the SQ head and CQ tail read with data_race() to avoid
this syzbot complaint.
Link: https://lore.kernel.org/io-uring/6811f6dc.050a0220.39e3a1.0d0e.GAE@google.com/
Reported-by: syzbot+3e77fd302e99f5af9394@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 28 Apr 2025 12:52:33 +0000 (13:52 +0100)]
io_uring/cmd: move net cmd into a separate file
We keep socket io_uring command implementation in io_uring/uring_cmd.c.
Separate it from generic command code into a separate file.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/747d0519a2255bd055ae76b691d38d2b4c311001.1745843119.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 28 Apr 2025 12:52:32 +0000 (13:52 +0100)]
io_uring: delete misleading comment in io_fill_cqe_aux()
io_fill_cqe_aux() doesn't overflow completions, however it might fail
them and lets the caller handle it. Remove the comment, which doesn't
make any sense.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/021aa8c1d8f20ef2b66da6aeabb6b511938fd2c5.1745843119.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 24 Apr 2025 16:28:14 +0000 (10:28 -0600)]
io_uring: fix 'sync' handling of io_fallback_tw()
A previous commit added a 'sync' parameter to io_fallback_tw(), which if
true, means the caller wants to wait on the fallback thread handling it.
But the logic is somewhat messed up, ensure that ctxs are swapped and
flushed appropriately.
Cc: stable@vger.kernel.org
Fixes:
dfbe5561ae93 ("io_uring: flush offloaded and delayed task_work on exit")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 24 Apr 2025 11:31:18 +0000 (12:31 +0100)]
io_uring/eventfd: open code io_eventfd_grab()
io_eventfd_grab() doesn't help wit understanding the path, it'll be
simpler to keep the helper open coded.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5cb53ce3876c2819db9e8055cf41dca4398521db.1745493845.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 24 Apr 2025 11:31:17 +0000 (12:31 +0100)]
io_uring/eventfd: clean up rcu locking
Conditional locking is never welcome if there are better options. Move
rcu locking into io_eventfd_signal(), make it unconditional and use
guards. It also helps with sparse warnings.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/91a925e708ca8a5aa7fee61f96d29b24ea9adeaf.1745493845.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 24 Apr 2025 11:31:16 +0000 (12:31 +0100)]
io_uring/eventfd: dedup signalling helpers
Consolidate io_eventfd_flush_signal() and io_eventfd_signal(). Not much
of a difference for now, but it prepares it for following changes.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5beecd4da65d8d2d83df499196f84b329387f6a2.1745493845.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Thu, 24 Apr 2025 11:28:39 +0000 (12:28 +0100)]
io_uring: don't duplicate flushing in io_req_post_cqe
io_req_post_cqe() sets submit_state.cq_flush so that
*flush_completions() can take care of batch commiting CQEs. Don't commit
it twice by using __io_cq_unlock_post().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/41c416660c509cee676b6cad96081274bcb459f3.1745493861.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Sun, 20 Apr 2025 09:31:20 +0000 (10:31 +0100)]
io_uring/zcrx: add support for multiple ifqs
Allow the user to register multiple ifqs / zcrx contexts. With that we
can use multiple interfaces / interface queues in a single io_uring
instance.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/668b03bee03b5216564482edcfefbc2ee337dd30.1745141261.git.asml.silence@gmail.com
[axboe: fold in fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Sun, 20 Apr 2025 09:31:19 +0000 (10:31 +0100)]
io_uring/zcrx: move zcrx region to struct io_zcrx_ifq
Refill queue region is a part of zcrx and should stay in struct
io_zcrx_ifq. We can't have multiple queues without it, so move it there.
As a result there is no context global zcrx region anymore, and the
region is looked up together with its ifq. To protect a concurrent mmap
from seeing an inconsistent region we were protecting changes to
->zcrx_region with mmap_lock, but now it protect the publishing of the
ifq.
Reviewed-by: David Wei <dw@davidwei.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/24f1a728fc03d0166f16d099575457e10d9d90f2.1745141261.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Sun, 20 Apr 2025 09:31:18 +0000 (10:31 +0100)]
io_uring/zcrx: let zcrx choose region for mmaping
In preparation for adding multiple ifqs, add a helper returning a region
for mmaping zcrx refill queue. For now it's trivial and returns the same
ctx global ->zcrx_region.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d9006e2ef8cd5e5b337c2ba2228ede8409eb60f2.1745141261.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Sun, 20 Apr 2025 09:31:17 +0000 (10:31 +0100)]
io_uring/zcrx: remove sqe->file_index check
sqe->file_index and sqe->zcrx_ifq_idx are aliased, so even though
io_recvzc_prep() has all necessary handling and checking for
->zcrx_ifq_idx, it's already rejected a couple of lines above.
It's not a real problem, however, as we can only have one ifq at the
moment, which index is always zero, and it returns correct error
codes to the user.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b51acedd326d1a87bf53e33303e6fc87661a3e3f.1745141261.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Sun, 20 Apr 2025 09:31:16 +0000 (10:31 +0100)]
io_uring/zcrx: move io_zcrx_iov_page
We'll need io_zcrx_iov_page at the top to keep offset calculations
closer together, move it there.
Reviewed-by: David Wei <dw@davidwei.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/575617033a8b84a5985c7eb760b7121efdbe7e56.1745141261.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Sun, 20 Apr 2025 09:31:15 +0000 (10:31 +0100)]
io_uring/zcrx: remove duplicated freelist init
Several lines below we already initialise the freelist, don't do it
twice.
Reviewed-by: David Wei <dw@davidwei.uk>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/71b07c4e00d1db65899d1ed8d603d961fe7d1c28.1745141261.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Sat, 19 Apr 2025 17:47:06 +0000 (18:47 +0100)]
io_uring/rsrc: remove null check on import
WARN_ON_ONCE() checking imu for NULL in io_import_fixed() is in the hot
path and too protective, it's time to get rid of it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3782c01e311dbd474bca45aefaf79a2f2822fafb.1745083025.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Sat, 19 Apr 2025 17:47:05 +0000 (18:47 +0100)]
io_uring/rsrc: clean up io_coalesce_buffer()
We don't need special handling for the first page in
io_coalesce_buffer(), move it inside the loop.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/ad698cddc1eadb3d92a7515e95bb13f79420323d.1745083025.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Sat, 19 Apr 2025 17:47:04 +0000 (18:47 +0100)]
io_uring/rsrc: use unpin_user_folio
We want to have a full folio to be left pinned but with only one
reference, for that we "unpin" all but the first page with
unpin_user_pages(), which can be confusing. There is a new helper to
achieve that called unpin_user_folio(), so use that.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/e0b2be8f9ea68f6b351ec3bb046f04f437f68491.1745083025.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 16 Apr 2025 19:25:03 +0000 (13:25 -0600)]
io_uring/rsrc: remove node assignment helpers
There are two helpers here, one assigns and increments the node ref
count, and the other is simply a wrapper around that for the buffer node
handling.
The buffer node assignment benefits from checking and setting
REQ_F_BUF_NODE together, otherwise stalls have been observed on setting
that flag later in the process. Hence re-do it so that it's set when
checked, and cleared in case of (unlikely) failure. With that, the
buffer node helper can go, and then drop the generic
io_req_assign_rsrc_node() helper as well as there's only a single user
of it left at that point.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Fri, 4 Apr 2025 20:50:59 +0000 (14:50 -0600)]
io_uring: add support for IORING_OP_PIPE
This works just like pipe2(2), except it also supports fixed file
descriptors. Used in a similar fashion as for other fd instantiating
opcodes (like accept, socket, open, etc), where sqe->file_slot is set
appropriately if two direct descriptors are desired rather than a set
of normal file descriptors.
sqe->addr must be set to a pointer to an array of 2 integers, which
is where the fixed/normal file descriptors are copied to.
sqe->pipe_flags contains flags, same as what is allowed for pipe2(2).
Future expansion of per-op private flags can go in sqe->ioprio,
like we do for other opcodes that take both a "syscall" flag set and
an io_uring opcode specific flag set.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 31 Mar 2025 16:18:02 +0000 (17:18 +0100)]
io_uring: don't store bgid in req->buf_index
Pass buffer group id into the rest of helpers via struct buf_sel_arg
and remove all reassignments of req->buf_index back to bgid. Now, it
only stores buffer indexes, and the group is provided by callers.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3ea9fa08113ecb4d9224b943e7806e80a324bdf9.1743437358.git.asml.silence@gmail.com
Link: https://lore.kernel.org/io-uring/0c01d76ff12986c2f48614db8610caff8f78c869.1743500909.git.asml.silence@gmail.com/
[axboe: fold in patch from second link]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 31 Mar 2025 16:18:01 +0000 (17:18 +0100)]
io_uring/kbuf: pass bgid to io_buffer_select()
The current situation with buffer group id juggling is not ideal.
req->buf_index first stores the bgid, then it's overwritten by a buffer
id, and then it can get restored back no recycling / etc. It's not so
easy to control, and it's not handled consistently across request types
with receive requests saving and restoring the bgid it by hand.
It's a prep patch that adds a buffer group id argument to
io_buffer_select(). The caller will be responsible for stashing a copy
somewhere and passing it into the function.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a210d6427cc3f4f42271a6853274cd5a50e56820.1743437358.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 31 Mar 2025 16:18:00 +0000 (17:18 +0100)]
io_uring: set IMPORT_BUFFER in generic send setup
Move REQ_F_IMPORT_BUFFER to the common send setup. Currently, the only
user is send zc, but we'll want for normal sends to support that in the
future.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/18b74edfc61d8a1b6c9fed3b78a9276fe80f8ced.1743437358.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 31 Mar 2025 16:17:59 +0000 (17:17 +0100)]
io_uring/net: don't use io_do_buffer_select at prep
Prep code is interested whether it's a selected buffer request, not
whether a buffer has already been selected like what
io_do_buffer_select() returns. Check for REQ_F_BUFFER_SELECT directly.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4488a029ac698554bebf732263fe19d7734affa6.1743437358.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Sat, 29 Mar 2025 16:15:24 +0000 (10:15 -0600)]
io_uring/wq: avoid indirect do_work/free_work calls
struct io_wq stores do_work and free_work function pointers which are
called on each work item. But these function pointers are always set to
io_wq_submit_work and io_wq_free_work, respectively. So remove these
function pointers and just call the functions directly.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250329161527.3281314-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Sun, 20 Apr 2025 22:30:53 +0000 (15:30 -0700)]
gcc-15: disable '-Wunterminated-string-initialization' entirely for now
I had left the warning around but as a non-fatal error to get my gcc-15
builds going, but fixed up some of the most annoying warning cases so
that it wouldn't be *too* verbose.
Because I like the _concept_ of the warning, even if I detested the
implementation to shut it up.
It turns out the implementation to shut it up is even more broken than I
thought, and my "shut up most of the warnings" patch just caused fatal
errors on gcc-14 instead.
I had tested with clang, but when I upgrade my development environment,
I try to do it on all machines because I hate having different systems
to maintain, and hadn't realized that gcc-14 now had issues.
The ACPI case is literally why I wanted to have a *type* that doesn't
trigger the warning (see commit
d5d45a7f2619: "gcc-15: make
'unterminated string initialization' just a warning"), instead of
marking individual places as "__nonstring".
But gcc-14 doesn't like that __nonstring location that shut gcc-15 up,
because it's on an array of char arrays, not on one single array:
drivers/acpi/tables.c:399:1: error: 'nonstring' attribute ignored on objects of type 'const char[][4]' [-Werror=attributes]
399 | static const char table_sigs[][ACPI_NAMESEG_SIZE] __initconst __nonstring = {
| ^~~~~~
and my attempts to nest it properly with a type had failed, because of
how gcc doesn't like marking the types as having attributes, only
symbols.
There may be some trick to it, but I was already annoyed by the bad
attribute design, now I'm just entirely fed up with it.
I wish gcc had a proper way to say "this type is a *byte* array, not a
string".
The obvious thing would be to distinguish between "char []" and an
explicitly signed "unsigned char []" (as opposed to an implicitly
unsigned char, which is typically an architecture-specific default, but
for the kernel is universal thanks to '-funsigned-char').
But any "we can typedef a 8-bit type to not become a string just because
it's an array" model would be fine.
But "__attribute__((nonstring))" is sadly not that sane model.
Reported-by: Chris Clayton <chris2553@googlemail.com>
Fixes:
4b4bd8c50f48 ("gcc-15: acpi: sprinkle random '__nonstring' crumbles around")
Fixes:
d5d45a7f2619 ("gcc-15: make 'unterminated string initialization' just a warning")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 20 Apr 2025 20:43:47 +0000 (13:43 -0700)]
Linux 6.15-rc3
Linus Torvalds [Sun, 20 Apr 2025 18:30:11 +0000 (11:30 -0700)]
gcc-15: work around sequence-point warning
The C sequence points are complicated things, and gcc-15 has apparently
added a warning for the case where an object is both used and modified
multiple times within the same sequence point.
That's a great warning.
Or rather, it would be a great warning, except gcc-15 seems to not
really be very exact about it, and doesn't notice that the modification
are to two entirely different members of the same object: the array
counter and the array entries.
So that seems kind of silly.
That said, the code that gcc complains about is unnecessarily
complicated, so moving the array counter update into a separate
statement seems like the most straightforward fix for these warnings:
drivers/net/wireless/intel/iwlwifi/mld/d3.c: In function ‘iwl_mld_set_netdetect_info’:
drivers/net/wireless/intel/iwlwifi/mld/d3.c:1102:66: error: operation on ‘netdetect_info->n_matches’ may be undefined [-Werror=sequence-point]
1102 | netdetect_info->matches[netdetect_info->n_matches++] = match;
| ~~~~~~~~~~~~~~~~~~~~~~~~~^~
drivers/net/wireless/intel/iwlwifi/mld/d3.c:1120:58: error: operation on ‘match->n_channels’ may be undefined [-Werror=sequence-point]
1120 | match->channels[match->n_channels++] =
| ~~~~~~~~~~~~~~~~~^~
side note: the code at that second warning is actively buggy, and only
works on little-endian machines that don't do strict alignment checks.
The code casts an array of integers into an array of unsigned long in
order to use our bitmap iterators. That happens to work fine on any
sane architecture, but it's still wrong.
This does *not* fix that more serious problem. This only splits the two
assignments into two statements and fixes the compiler warning. I need
to get rid of the new warnings in order to be able to actually do any
build testing.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 20 Apr 2025 18:18:55 +0000 (11:18 -0700)]
gcc-15: add '__nonstring' markers to byte arrays
All of these cases are perfectly valid and good traditional C, but hit
by the "you're not NUL-terminating your byte array" warning.
And none of the cases want any terminating NUL character.
Mark them __nonstring to shut up gcc-15 (and in the case of the ak8974
magnetometer driver, I just removed the explicit array size and let gcc
expand the 3-byte and 6-byte arrays by one extra byte, because it was
the simpler change).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 20 Apr 2025 18:04:00 +0000 (11:04 -0700)]
gcc-15: get rid of misc extra NUL character padding
This removes two cases of explicit NUL padding that now causes warnings
because of '-Wunterminated-string-initialization' being part of -Wextra
in gcc-15.
Gcc is being silly in this case when it says that it truncates a NUL
terminator, because in these cases there were _multiple_ NUL characters.
But we can get rid of the warning by just simplifying the two
initializers that trigger the warning for me, so this does exactly that.
I'm not sure why the power supply code did that odd
.attr_name = #_name "\0",
pattern: it was introduced in commit
2cabeaf15129 ("power: supply: core:
Cleanup power supply sysfs attribute list"), but that 'attr_name[]'
field is an explicitly sized character array in a statically initialized
variable, and a string initializer always has a terminating NUL _and_
statically initialized character arrays are zero-padded anyway, so it
really seems to be rather extraneous belt-and-suspenders.
The zero_uuid[16] initialization in drivers/md/bcache/super.c makes
perfect sense, but it isn't necessary for the same reasons, and not
worth the new gcc warning noise.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 20 Apr 2025 18:02:18 +0000 (11:02 -0700)]
gcc-15: acpi: sprinkle random '__nonstring' crumbles around
This is not great: I'd much rather introduce a typedef that is a "ACPI
name byte buffer", and use that to mark these special 4-byte ACPI names
that do not use NUL termination.
But as noted in the previous commit ("gcc-15: make 'unterminated string
initialization' just a warning") gcc doesn't actually seem to support
that notion, so instead you have to just mark every single array
declaration individually.
So this is not pretty, but this gets rid of the bulk of the annoying
warnings during an allmodconfig build for me.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 20 Apr 2025 17:33:23 +0000 (10:33 -0700)]
gcc-15: make 'unterminated string initialization' just a warning
gcc-15 enabling -Wunterminated-string-initialization in -Wextra by
default was done with the best intentions, but the warning is still
quite broken.
What annoys me about the warning is that this is a very traditional AND
CORRECT way to initialize fixed byte arrays in C:
unsigned char hex[16] = "
0123456789abcdef";
and we use this all over the kernel. And the warning is fine, but gcc
developers apparently never made a reasonable way to disable it. As is
(sadly) tradition with these things.
Yes, there's "__attribute__((nonstring))", and we have a macro to make
that absolutely disgusting syntax more palatable (ie the kernel syntax
for that monstrosity is just "__nonstring").
But that attribute is misdesigned. What you'd typically want to do is
tell the compiler that you are using a type that isn't a string but a
byte array, but that doesn't work at all:
warning: ‘nonstring’ attribute does not apply to types [-Wattributes]
and because of this fundamental mis-design, you then have to mark each
instance of that pattern.
This is particularly noticeable in our ACPI code, because ACPI has this
notion of a 4-byte "type name" that gets used all over, and is exactly
this kind of byte array.
This is a sad oversight, because the warning is useful, but really would
be so much better if gcc had also given a sane way to indicate that we
really just want a byte array type at a type level, not the broken "each
and every array definition" level.
So now instead of creating a nice "ACPI name" type using something like
typedef char acpi_name_t[4] __nonstring;
we have to do things like
char name[ACPI_NAMESEG_SIZE] __nonstring;
in every place that uses this concept and then happens to have the
typical initializers.
This is annoying me mainly because I think the warning _is_ a good
warning, which is why I'm not just turning it off in disgust. But it is
hampered by this bad implementation detail.
[ And obviously I'm doing this now because system upgrades for me are
something that happen in the middle of the release cycle: don't do it
before or during travel, or just before or during the busy merge
window period. ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 20 Apr 2025 04:46:58 +0000 (21:46 -0700)]
Merge tag 'mm-hotfixes-stable-2025-04-19-21-24' of git://git./linux/kernel/git/akpm/mm
Pull misc hotfixes from Andrew Morton:
"16 hotfixes. 2 are cc:stable and the remainder address post-6.14
issues or aren't considered necessary for -stable kernels.
All patches are basically for MM although five are alterations to
MAINTAINERS"
[ Basic counting skills are clearly not a strictly necessary requirement
for kernel maintainers. - Linus ]
* tag 'mm-hotfixes-stable-2025-04-19-21-24' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
MAINTAINERS: add section for locking of mm's and VMAs
mm: vmscan: fix kswapd exit condition in defrag_mode
mm: vmscan: restore high-cpu watermark safety in kswapd
MAINTAINERS: add Pedro as reviewer to the MEMORY MAPPING section
mm/memory: move sanity checks in do_wp_page() after mapcount vs. refcount stabilization
mm, hugetlb: increment the number of pages to be reset on HVO
writeback: fix false warning in inode_to_wb()
docs: ABI: replace mcroce@microsoft.com with new Meta address
mm/gup: fix wrongly calculated returned value in fault_in_safe_writeable()
MAINTAINERS: add memory advice section
MAINTAINERS: add mmap trace events to MEMORY MAPPING
mm: memcontrol: fix swap counter leak from offline cgroup
MAINTAINERS: add MM subsection for the page allocator
MAINTAINERS: update SLAB ALLOCATOR maintainers
fs/dax: fix folio splitting issue by resetting old folio order + _nr_pages
mm/page_alloc: fix deadlock on cpu_hotplug_lock in __accept_page()
Linus Torvalds [Sat, 19 Apr 2025 21:31:08 +0000 (14:31 -0700)]
Merge tag 'vfs-6.15-rc3.fixes.2' of git://git./linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Revert the hfs{plus} deprecation warning that's also included in this
pull request. The commit introducing the deprecation warning resides
rather early in this branch. So simply dropping it would've rebased
all other commits which I decided to avoid. Hence the revert in the
same branch
[ Background - the deprecation warning discussion resulted in people
stepping up, and so hfs{plus} will have a maintainer taking care of
it after all.. - Linus ]
- Switch CONFIG_SYSFS_SYCALL default to n and decouple from
CONFIG_EXPERT
- Fix an audit bug caused by changes to our kernel path lookup helpers
this cycle. Audit needs the parent path even if the dentry it tried
to look up is negative
- Ensure that the kernel path lookup helpers leave the passed in path
argument clean when they return an error. This is consistent with all
our other helpers
- Ensure that vfs_getattr_nosec() calls bdev_statx() so the relevant
information is available to kernel consumers as well
- Don't set a timer and call schedule() if the timer will expire
immediately in epoll
- Make netfs lookup tables with __nonstring
* tag 'vfs-6.15-rc3.fixes.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
Revert "hfs{plus}: add deprecation warning"
fs: move the bdex_statx call to vfs_getattr_nosec
netfs: Mark __nonstring lookup tables
eventpoll: Set epoll timeout if it's in the future
fs: ensure that *path_locked*() helpers leave passed path pristine
fs: add kern_path_locked_negative()
hfs{plus}: add deprecation warning
Kconfig: switch CONFIG_SYSFS_SYCALL default to n
Linus Torvalds [Sat, 19 Apr 2025 20:59:04 +0000 (13:59 -0700)]
Merge tag 'i2c-for-6.15-rc3' of git://git./linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
- Address translator: fix wrong include
- ChromeOS EC tunnel: fix potential NULL pointer dereference
* tag 'i2c-for-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: atr: Fix wrong include
i2c: cros-ec-tunnel: defer probe if parent EC is not present
Christian Brauner [Sat, 19 Apr 2025 20:48:59 +0000 (22:48 +0200)]
Revert "hfs{plus}: add deprecation warning"
This reverts commit
ddee68c499f76ae47c011549df5be53db0057402.
There's ongoing discussion about better maintenance of at least hfsplus.
Rever the deprecation warning for now.
Signed-off-by: Christian Brauner <brauner@kernel.org>
Linus Torvalds [Sat, 19 Apr 2025 18:57:36 +0000 (11:57 -0700)]
Merge tag 'trace-v6.15-rc2' of git://git./linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Initialize hash variables in ftrace subops logic
The fix that simplified the ftrace subops logic opened a path where
some variables could be used without being initialized, and done
subtly where the compiler did not catch it. Initialize those
variables to the EMPTY_HASH, which is the default hash.
- Reinitialize the hash pointers after they are freed
Some of the hash pointers in the subop logic were freed but may still
be referenced later. To prevent use-after-free bugs, initialize them
back to the EMPTY_HASH.
- Free the ftrace hashes when they are replaced
The fix that simplified the subops logic updated some hash pointers,
but left the original hash that they were pointing to where they are
no longer used. This caused a memory leak. Free the hashes that are
pointed to by the pointers when they are replaced.
- Fix size initialization of ftrace direct function hash
The ftrace direct function hash used by BPF initialized the hash size
incorrectly. It checked the size of items to a hard coded 32, which
made the hash bit size of 5. The hash size is supposed to be limited
by the bit size of the hash, as the bitmask is allowed to be greater
than 5. Rework the size check to first pass the number of elements to
fls() and then compare that to FTRACE_HASH_MAX_BITS before allocating
the hash.
- Fix format output of ftrace_graph_ent_entry event
The field depth of the ftrace_graph_ent_entry event is of size 4 but
the output showed it as unsigned long and use "%lu". Change it to
unsigned int and use "%u" in the print format that is displayed to
user space.
- Fix the trace event filter on strings
Events can be filtered on numbers or string values. The return value
checked from strncpy_from_kernel_nofault() and
strncpy_from_user_nofault() was used to determine if reading the
strings would fault or not. It would return fault if the value was
non zero, which is basically meant that it was always considering the
read as a fault.
- Add selftest to test trace event string filtering
In order to catch the breakage of the string filtering, add a self
test to make sure that it continues to work.
* tag 'trace-v6.15-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
tracing: selftests: Add testing a user string to filters
tracing: Fix filter string testing
ftrace: Fix type of ftrace_graph_ent_entry.depth
ftrace: fix incorrect hash size in register_ftrace_direct()
ftrace: Free ftrace hashes after they are replaced in the subops code
ftrace: Reinitialize hash to EMPTY_HASH after freeing
ftrace: Initialize variables for ftrace_startup/shutdown_subops()
Linus Torvalds [Sat, 19 Apr 2025 17:38:03 +0000 (10:38 -0700)]
Merge tag 'nfsd-6.15-1' of git://git./linux/kernel/git/cel/linux
Pull nfsd fixes from Chuck Lever:
- v6.15 libcrc clean-up makes invalid configurations possible
- Fix a potential deadlock introduced during the v6.15 merge window
* tag 'nfsd-6.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
nfsd: decrease sc_count directly if fail to queue dl_recall
nfs: add missing selections of CONFIG_CRC32
Linus Torvalds [Sat, 19 Apr 2025 17:02:43 +0000 (10:02 -0700)]
Merge tag 'rust-fixes-6.15' of git://git./linux/kernel/git/ojeda/linux
Pull rust fixes from Miguel Ojeda:
"Toolchain and infrastructure:
- Fix missing KASAN LLVM flags on first build (and fix spurious
rebuilds) by skipping '--target'
- Fix Make < 4.3 build error by using '$(pound)'
- Fix UML build error by removing 'volatile' qualifier from io
helpers
- Fix UML build error by adding 'dma_{alloc,free}_attrs()' helpers
- Clean gendwarfksyms warnings by avoiding to export '__pfx' symbols
- Clean objtool warning by adding a new 'noreturn' function for
1.86.0
- Disable 'needless_continue' Clippy lint due to new 1.86.0 warnings
- Add missing 'ffi' crate to 'generate_rust_analyzer.py'
'pin-init' crate:
- Import a couple fixes from upstream"
* tag 'rust-fixes-6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ojeda/linux:
rust: helpers: Add dma_alloc_attrs() and dma_free_attrs()
rust: helpers: Remove volatile qualifier from io helpers
rust: kbuild: use `pound` to support GNU Make < 4.3
objtool/rust: add one more `noreturn` Rust function for Rust 1.86.0
rust: kasan/kbuild: fix missing flags on first build
rust: disable `clippy::needless_continue`
rust: kbuild: Don't export __pfx symbols
rust: pin-init: use Markdown autolinks in Rust comments
rust: pin-init: alloc: restrict `impl ZeroableOption` for `Box` to `T: Sized`
scripts: generate_rust_analyzer: Add ffi crate
Linus Torvalds [Sat, 19 Apr 2025 16:31:21 +0000 (09:31 -0700)]
Merge tag 'drm-fixes-2025-04-19' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Easter rc3 pull request, fixes in all the usuals, amdgpu, xe, msm,
with some i915/ivpu/mgag200/v3d fixes, then a couple of bits in
dma-buf/gem.
Hopefully has no easter eggs in it.
dma-buf:
- Correctly decrement refcounter on errors
gem:
- Fix test for imported buffers
amdgpu:
- Cleaner shader sysfs fix
- Suspend fix
- Fix doorbell free ordering
- Video caps fix
- DML2 memory allocation optimization
- HDP fix
i915:
- Fix DP DSC configurations that require 3 DSC engines per pipe
xe:
- Fix LRC address being written too late for GuC
- Fix notifier vs folio deadlock
- Fix race betwen dma_buf unmap and vram eviction
- Fix debugfs handling PXP terminations unconditionally
msm:
- Display:
- Fix to call dpu_plane_atomic_check_pipe() for both SSPPs in
case of multi-rect
- Fix to validate plane_state pointer before using it in
dpu_plane_virtual_atomic_check()
- Fix to make sure dereferencing dpu_encoder_phys happens after
making sure it is valid in _dpu_encoder_trigger_start()
- Remove the remaining intr_tear_rd_ptr which we initialized to
-1 because NO_IRQ indices start from 0 now
- GPU:
- Fix IB_SIZE overflow
ivpu:
- Fix debugging
- Fixes to frequency
- Support firmware API 3.28.3
- Flush jobs upon reset
mgag200:
- Set vblank start to correct values
v3d:
- Fix Indirect Dispatch"
* tag 'drm-fixes-2025-04-19' of https://gitlab.freedesktop.org/drm/kernel: (26 commits)
drm/msm/a6xx+: Don't let IB_SIZE overflow
drm/xe/pxp: do not queue unneeded terminations from debugfs
drm/xe/dma_buf: stop relying on placement in unmap
drm/xe/userptr: fix notifier vs folio deadlock
drm/xe: Set LRC addresses before guc load
drm/mgag200: Fix value in <VBLKSTR> register
drm/gem: Internally test import_attach for imported objects
drm/amdgpu: Use the right function for hdp flush
drm/amd/display/dml2: use vzalloc rather than kzalloc
drm/amdgpu: Add back JPEG to video caps for carrizo and newer
drm/amdgpu: fix warning of drm_mm_clean
drm/amd: Forbid suspending into non-default suspend states
drm/amdgpu: use a dummy owner for sysfs triggered cleaner shaders v4
drm/i915/dp: Check for HAS_DSC_3ENGINES while configuring DSC slices
drm/i915/display: Add macro for checking 3 DSC engines
dma-buf/sw_sync: Decrement refcount on error in sw_sync_ioctl_get_deadline()
accel/ivpu: Add cmdq_id to job related logs
accel/ivpu: Show NPU frequency in sysfs
accel/ivpu: Fix the NPU's DPU frequency calculation
accel/ivpu: Update FW Boot API to version 3.28.3
...
Dave Airlie [Sat, 19 Apr 2025 05:09:29 +0000 (15:09 +1000)]
Merge tag 'drm-msm-fixes-2025-04-18' of https://gitlab.freedesktop.org/drm/msm into drm-fixes
Fixes for v6.15-rc3
Display:
- Fix to call dpu_plane_atomic_check_pipe() for both SSPPs in
case of multi-rect
- Fix to validate plane_state pointer before using it in
dpu_plane_virtual_atomic_check()
- Fix to make sure dereferencing dpu_encoder_phys happens after
making sure it is valid in _dpu_encoder_trigger_start()
- Remove the remaining intr_tear_rd_ptr which we initialized
to -1 because NO_IRQ indices start from 0 now
GPU:
- Fix IB_SIZE overflow
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rob Clark <robdclark@gmail.com>
Link: https://lore.kernel.org/r/CAF6AEGtVKXEVdzUzFWmQE8JmK3nx_hp+ynOd-5j3vnfcU-sgOA@mail.gmail.com
Dave Airlie [Sat, 19 Apr 2025 04:59:47 +0000 (14:59 +1000)]
Merge tag 'drm-xe-fixes-2025-04-18' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes
Driver Changes:
- Fix LRC address being written too late for GuC
- Fix notifier vs folio deadlock
- Fix race betwen dma_buf unmap and vram eviction
- Fix debugfs handling PXP terminations unconditionally
Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/ndinq644zenywaaycxyfqqivsb2xer4z7err3dlpalbz33jfkm@ttabzsg6wnet
Linus Torvalds [Sat, 19 Apr 2025 03:10:42 +0000 (20:10 -0700)]
Merge tag '6.15-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- Fix hard link lease key problem when close is deferred
- Revert the socket lockdep/refcount workarounds done in cifs.ko now
that it is fixed at the socket layer
* tag '6.15-rc2-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
Revert "smb: client: fix TCP timers deadlock after rmmod"
Revert "smb: client: Fix netns refcount imbalance causing leaks and use-after-free"
smb3 client: fix open hardlink on deferred close file error
Rob Clark [Mon, 17 Mar 2025 15:00:06 +0000 (08:00 -0700)]
drm/msm/a6xx+: Don't let IB_SIZE overflow
IB_SIZE is only b0..b19. Starting with a6xx gen3, additional fields
were added above the IB_SIZE. Accidentially setting them can cause
badness. Fix this by properly defining the CP_INDIRECT_BUFFER packet
and using the generated builder macro to ensure unintended bits are not
set.
v2: add missing type attribute for IB_BASE
v3: fix offset attribute in xml
Reported-by: Connor Abbott <cwabbott0@gmail.com>
Fixes:
a83366ef19ea ("drm/msm/a6xx: add A640/A650 to gpulist")
Signed-off-by: Rob Clark <robdclark@chromium.org>
Patchwork: https://patchwork.freedesktop.org/patch/643396/
Wolfram Sang [Fri, 18 Apr 2025 21:42:56 +0000 (23:42 +0200)]
Merge tag 'i2c-host-fixes-6.15-rc3' of git://git./linux/kernel/git/andi.shyti/linux into i2c/for-current
i2c-host-fixes for v6.15-rc3
- ChromeOS EC tunnel: fix potential NULL pointer dereference
Linus Torvalds [Fri, 18 Apr 2025 21:04:57 +0000 (14:04 -0700)]
Merge tag 'x86-urgent-2025-04-18' of git://git./linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
- Fix hypercall detection on Xen guests
- Extend the AMD microcode loader SHA check to Zen5, to block loading
of any unreleased standalone Zen5 microcode patches
- Add new Intel CPU model number for Bartlett Lake
- Fix the workaround for AMD erratum 1054
- Fix buggy early memory acceptance between SEV-SNP guests and the EFI
stub
* tag 'x86-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/boot/sev: Avoid shared GHCB page for early memory acceptance
x86/cpu/amd: Fix workaround for erratum 1054
x86/cpu: Add CPU model number for Bartlett Lake CPUs with Raptor Cove cores
x86/microcode/AMD: Extend the SHA check to Zen5, block loading of any unreleased standalone Zen5 microcode patches
x86/xen: Fix __xen_hypercall_setfunc()
Linus Torvalds [Fri, 18 Apr 2025 21:02:45 +0000 (14:02 -0700)]
Merge tag 'timers-urgent-2025-04-18' of git://git./linux/kernel/git/tip/tip
Pull timer fix from Ingo Molnar:
"Fix a lockdep false positive in the i8253 driver"
* tag 'timers-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/i8253: Call clockevent_i8253_disable() with interrupts disabled
Linus Torvalds [Fri, 18 Apr 2025 20:35:13 +0000 (13:35 -0700)]
Merge tag 'perf-urgent-2025-04-18' of git://git./linux/kernel/git/tip/tip
Pull x86 perf event fixes from Ingo Molnar:
"Miscellaneous fixes and a hardware-enabling change:
- Fix Intel uncore PMU IIO free running counters on SPR, ICX and SNR
systems
- Fix Intel PEBS buffer overflow handling
- Fix skid in Intel PEBS sampling of user-space general purpose
registers
- Enable Panther Lake PMU support - similar to Lunar Lake"
* tag 'perf-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Add Panther Lake support
perf/x86/intel: Allow to update user space GPRs from PEBS records
perf/x86/intel: Don't clear perf metrics overflow bit unconditionally
perf/x86/intel/uncore: Fix the scale of IIO free running counters on SPR
perf/x86/intel/uncore: Fix the scale of IIO free running counters on ICX
perf/x86/intel/uncore: Fix the scale of IIO free running counters on SNR
Linus Torvalds [Fri, 18 Apr 2025 20:28:41 +0000 (13:28 -0700)]
Merge tag 'irq-urgent-2025-04-18' of git://git./linux/kernel/git/tip/tip
Pull misc irq fixes from Ingo Molnar:
- Fix BCM2712 irqchip driver Kconfig dependencies required on the
Raspberry PI5
- Fix spurious interrupts on RZ/G3E SMARC EVK systems
- Fix crash regression on Sun/NIU hardware
- Apply MSI driver quirk for Sun Neptune chips
* tag 'irq-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/irq-bcm2712-mip: Enable driver when ARCH_BCM2835 is enabled
irqchip/renesas-rzv2h: Prevent TINT spurious interrupt
net/niu: Niu requires MSIX ENTRY_DATA fields touch before entry reads
PCI/MSI: Add an option to write MSIX ENTRY_DATA before any reads
Linus Torvalds [Fri, 18 Apr 2025 20:25:33 +0000 (13:25 -0700)]
Merge tag 'core-urgent-2025-04-18' of git://git./linux/kernel/git/tip/tip
Pull misc core fixes from Ingo Molnar:
"Fix a genksyms related bug, triggered by recent changes to the percpu
code, and update the .clang-format file to not include obsolete
function names"
* tag 'core-urgent-2025-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genksyms: Handle typeof_unqual keyword and __seg_{fs,gs} qualifiers
clang-format: Update the ForEachMacros list for v6.15-rc1
Linus Torvalds [Fri, 18 Apr 2025 20:20:20 +0000 (13:20 -0700)]
Merge tag 'hardening-v6.15-rc3' of git://git./linux/kernel/git/kees/linux
Pull hardening fixes from Kees Cook:
- lib/prime_numbers: KUnit test should not select PRIME_NUMBERS (Geert
Uytterhoeven)
- ubsan: Fix panic from test_ubsan_out_of_bounds (Mostafa Saleh)
- ubsan: Remove 'default UBSAN' from UBSAN_INTEGER_WRAP (Nathan
Chancellor)
- string: Add load_unaligned_zeropad() code path to sized_strscpy()
(Peter Collingbourne)
- kasan: Add strscpy() test to trigger tag fault on arm64 (Vincenzo
Frascino)
- Disable GCC randstruct for COMPILE_TEST
* tag 'hardening-v6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
lib/prime_numbers: KUnit test should not select PRIME_NUMBERS
ubsan: Fix panic from test_ubsan_out_of_bounds
lib/Kconfig.ubsan: Remove 'default UBSAN' from UBSAN_INTEGER_WRAP
hardening: Disable GCC randstruct for COMPILE_TEST
kasan: Add strscpy() test to trigger tag fault on arm64
string: Add load_unaligned_zeropad() code path to sized_strscpy()
Linus Torvalds [Fri, 18 Apr 2025 20:18:01 +0000 (13:18 -0700)]
Merge tag 'gpio-fixes-for-v6.15-rc3' of git://git./linux/kernel/git/brgl/linux
Pull gpio fix from Bartosz Golaszewski:
- check for both the new AND old (deprecated) setter callback when
changing GPIO direction to output
* tag 'gpio-fixes-for-v6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpiolib: Allow to use setters with return value for output-only gpios
Linus Torvalds [Fri, 18 Apr 2025 20:09:20 +0000 (13:09 -0700)]
Merge tag 'thermal-6.15-rc3' of git://git./linux/kernel/git/rafael/linux-pm
Pull thermal control fixes from Rafael Wysocki:
"Add missing DVFS support flags for the Lunar Lake and Panther Lake
platforms to the int340x Intel thermal driver and fix DLVR support
for Panther Lake in it (Srinivas Pandruvada)"
* tag 'thermal-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
thermal: intel: int340x: Fix Panther Lake DLVR support
thermal: intel: int340x: Add missing DVFS support flags
Linus Torvalds [Fri, 18 Apr 2025 20:06:12 +0000 (13:06 -0700)]
Merge tag 'pm-6.15-rc3' of git://git./linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These are mostly cpufreq fixes, some of which address recent
regressions and some address older issues that have come to light
during the last two weeks, and a runtime PM documentation correction:
- Fix the performance-to-frequency scaling factor computation on
systems using HWP in the intel_pstate driver after a recent
incorrect update of it (Rafael Wysocki)
- Fix the usage of the CPUFREQ_NEED_UPDATE_LIMITS cpufreq driver flag
in the schedutil cpufreq governor after a recent update of it that
has caused frequency limits changes to be missed sometimes (Rafael
Wysocki)
- Address some recently discovered synchronization issues related to
frequency limits changes in the schedutil cpufreq governor and in
the cpufreq core (Rafael Wysocki)
- Fix ITMT support in the amd-pstate cpufreq driver so that it is
enabled after asym priorities have been correctly initialized for
all CPUs (K Prateek Nayak)
- Fix changing min/max limits in the amd-pstate cpufreq driver while
on the performance governor (Dhananjay Ugwekar)
- Fix a function name in the runtime PM documentation that was
previously incorrectly updated by mistake (Sakari Ailus)"
* tag 'pm-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: Avoid using inconsistent policy->min and policy->max
cpufreq/sched: Set need_freq_update in ignore_dl_rate_limit()
cpufreq/sched: Explicitly synchronize limits_changed flag handling
cpufreq/sched: Fix the usage of CPUFREQ_NEED_UPDATE_LIMITS
Documentation: PM: runtime: Fix a reference to pm_runtime_autosuspend()
cpufreq: intel_pstate: Fix hwp_get_cpu_scaling()
cpufreq/amd-pstate: Enable ITMT support after initializing core rankings
cpufreq/amd-pstate: Fix min_limit perf and freq updation for performance governor
Rafael J. Wysocki [Fri, 18 Apr 2025 18:55:48 +0000 (20:55 +0200)]
Merge branch 'pm-docs'
Merge a runtime PM documentation correction for 6.15-rc3.
* pm-docs:
Documentation: PM: runtime: Fix a reference to pm_runtime_autosuspend()
Linus Torvalds [Fri, 18 Apr 2025 18:46:44 +0000 (11:46 -0700)]
Merge tag 'riscv-for-linus-6.15-rc3' of git://git./linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- A fix for an issue where C instructions ended up in non-C builds, due
to some broken inline assembly in the KGDB breakpoint insertion code
- A fix to avoid spurious printk messages about misaligned access
performance probing
- A fix for a handful of issues with /proc/iomem's reserved region
handling
- A pair of fixes for module relocation processing
- A few build-time fixes
* tag 'riscv-for-linus-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: KGDB: Remove ".option norvc/.option rvc" for kgdb_compiled_break
riscv: KGDB: Do not inline arch_kgdb_breakpoint()
riscv: Avoid fortify warning in syscall_get_arguments()
riscv: Provide all alternative macros all the time
riscv: module: Allocate PLT entries for R_RISCV_PLT32
riscv: module: Fix out-of-bounds relocation access
riscv: Properly export reserved regions in /proc/iomem
riscv: Fix unaligned access info messages
riscv: Avoid fortify warning in syscall_get_arguments()
Documentation: riscv: Fix typo MIMPLID -> MIMPID
riscv: Use kvmalloc_array on relocation_hashtable
Linus Torvalds [Fri, 18 Apr 2025 18:35:11 +0000 (11:35 -0700)]
Merge tag 'linux_kselftest-kunit-fixes-6.15-rc3' of git://git./linux/kernel/git/shuah/linux-kselftest
Pull kunit fix from Shuah Khan:
"Fixes arch sh kunit qemu_configs script sh.py to honor kunit cmdline"
* tag 'linux_kselftest-kunit-fixes-6.15-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: qemu_configs: SH: Respect kunit cmdline