Caleb Sander Mateos [Tue, 4 Mar 2025 19:48:12 +0000 (12:48 -0700)]
io_uring: introduce io_cache_free() helper
Add a helper function io_cache_free() that returns an allocation to a
io_alloc_cache, falling back on kfree() if the io_alloc_cache is full.
This is the inverse of io_cache_alloc(), which takes an allocation from
an io_alloc_cache and falls back on kmalloc() if the cache is empty.
Convert 4 callers to use the helper.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Suggested-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/r/20250304194814.2346705-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Fri, 28 Feb 2025 23:59:14 +0000 (16:59 -0700)]
io_uring/rsrc: skip NULL file/buffer checks in io_free_rsrc_node()
io_rsrc_node's of type IORING_RSRC_FILE always have a file attached
immediately after they are allocated. IORING_RSRC_BUFFER nodes won't be
returned from io_sqe_buffer_register()/io_buffer_register_bvec() until
they have a io_mapped_ubuf attached.
So remove the checks for a NULL file/buffer in io_free_rsrc_node().
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228235916.670437-5-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Fri, 28 Feb 2025 23:59:13 +0000 (16:59 -0700)]
io_uring/rsrc: avoid NULL node check on io_sqe_buffer_register() failure
The done: label is only reachable if node is non-NULL. So don't bother
checking, just call io_free_node().
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228235916.670437-4-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Fri, 28 Feb 2025 23:59:12 +0000 (16:59 -0700)]
io_uring/rsrc: call io_free_node() on io_sqe_buffer_register() failure
io_sqe_buffer_register() currently calls io_put_rsrc_node() if it fails
to fully set up the io_rsrc_node. io_put_rsrc_node() is more involved
than necessary, since we already know the reference count will reach 0
and no io_mapped_ubuf has been attached to the node yet.
So just call io_free_node() to release the node's memory. This also
avoids the need to temporarily set the node's buf pointer to NULL.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228235916.670437-3-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Fri, 28 Feb 2025 23:59:11 +0000 (16:59 -0700)]
io_uring/rsrc: free io_rsrc_node using kfree()
io_rsrc_node_alloc() calls io_cache_alloc(), which uses kmalloc() to
allocate the node. So it can be freed with kfree() instead of kvfree().
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228235916.670437-2-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Fri, 28 Feb 2025 23:59:10 +0000 (16:59 -0700)]
io_uring/rsrc: split out io_free_node() helper
Split the freeing of the io_rsrc_node from io_free_rsrc_node(), for use
with nodes that haven't been fully initialized.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228235916.670437-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Sat, 1 Mar 2025 18:36:11 +0000 (11:36 -0700)]
io_uring/rsrc: include io_uring_types.h in rsrc.h
io_uring/rsrc.h uses several types from include/linux/io_uring_types.h.
Include io_uring_types.h explicitly in rsrc.h to avoid depending on
users of rsrc.h including io_uring_types.h first.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/r/20250301183612.937529-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Sat, 1 Mar 2025 19:03:16 +0000 (12:03 -0700)]
ublk: don't cast registered buffer index to int
io_buffer_register_bvec() takes index as an unsigned int argument, but
ublk_register_io_buf() casts ub_cmd->addr (a u64) to int. Remove the
misleading cast and instead pass index as an unsigned value to
ublk_register_io_buf() and ublk_unregister_io_buf().
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250301190317.950208-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Sat, 1 Mar 2025 00:16:08 +0000 (17:16 -0700)]
io_uring/nop: use io_find_buf_node()
Call io_find_buf_node() to avoid duplicating it in io_nop().
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250301001610.678223-2-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Sat, 1 Mar 2025 00:16:07 +0000 (17:16 -0700)]
io_uring/rsrc: declare io_find_buf_node() in header file
Declare io_find_buf_node() in io_uring/rsrc.h so it can be called from
other files.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250301001610.678223-1-csander@purestorage.com
[axboe: keep the inline for local hot path usage]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Fri, 28 Feb 2025 23:14:31 +0000 (16:14 -0700)]
io_uring/ublk: report error when unregister operation fails
Indicate to userspace applications if a UBLK_IO_UNREGISTER_IO_BUF
command specifies an invalid buffer index by returning an error code.
Return -EINVAL if no buffer is registered with the given index, and
-EBUSY if the registered buffer is not a kernel bvec.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228231432.642417-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Fri, 28 Feb 2025 23:03:04 +0000 (16:03 -0700)]
io_uring: convert cmd_to_io_kiocb() macro to function
The cmd_to_io_kiocb() macro applies a pointer cast to its input without
parenthesizing it. Currently all inputs are variable names, so this has
the intended effect. But since casts have relatively high precedence,
the macro would apply the cast to the wrong value if the input was a
pointer addition, for example.
Turn the macro into a static inline function to ensure the pointer cast
is applied to the full input value.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228230305.630885-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Fri, 28 Feb 2025 22:15:13 +0000 (15:15 -0700)]
io_uring/uring_cmd: specify io_uring_cmd_import_fixed() pointer type
io_uring_cmd_import_fixed() takes a struct io_uring_cmd *, but the type
of the ioucmd parameter is void *. Make the pointer type explicit so the
compiler can type check it.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228221514.604350-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Fri, 28 Feb 2025 22:30:56 +0000 (15:30 -0700)]
io_uring/rsrc: use rq_data_dir() to compute bvec dir
The macro rq_data_dir() already computes a request's data direction.
Use it in place of the if-else to set imu->dir.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250228223057.615284-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 28 Feb 2025 16:19:16 +0000 (00:19 +0800)]
selftests: ublk: add ublk zero copy test
Enable zero copy on file backed target, meantime add one fio test for
covering write verify, another test for mkfs/mount/umount.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250228161919.2869102-4-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 28 Feb 2025 16:19:15 +0000 (00:19 +0800)]
selftests: ublk: add file backed ublk
Add file backed ublk target code, meantime add one fio test for
covering write verify, another test for mkfs/mount/umount.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250228161919.2869102-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Ming Lei [Fri, 28 Feb 2025 16:19:14 +0000 (00:19 +0800)]
selftests: ublk: add kernel selftests for ublk
Both ublk driver and userspace heavily depends on io_uring subsystem,
and tools/testing/selftests/ should be the best place for holding this
cross-subsystem tests.
Add basic read/write IO test over this ublk null disk, and make sure ublk
working.
More tests will be added.
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250228161919.2869102-2-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Thu, 27 Feb 2025 22:39:16 +0000 (14:39 -0800)]
io_uring: cache nodes and mapped buffers
Frequent alloc/free cycles on these is pretty costly. Use an io cache to
more efficiently reuse these buffers.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250227223916.143006-7-kbusch@meta.com
[axboe: fix imu leak]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Thu, 27 Feb 2025 22:39:15 +0000 (14:39 -0800)]
ublk: zc register/unregister bvec
Provide new operations for the user to request mapping an active request
to an io uring instance's buf_table. The user has to provide the index
it wants to install the buffer.
A reference count is taken on the request to ensure it can't be
completed while it is active in a ring's buf_table.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250227223916.143006-6-kbusch@meta.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Thu, 27 Feb 2025 22:39:14 +0000 (14:39 -0800)]
io_uring: add support for kernel registered bvecs
Provide an interface for the kernel to leverage the existing
pre-registered buffers that io_uring provides. User space can reference
these later to achieve zero-copy IO.
User space must register an empty fixed buffer table with io_uring in
order for the kernel to make use of it.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250227223916.143006-5-kbusch@meta.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Xinyu Zhang [Thu, 27 Feb 2025 22:39:13 +0000 (14:39 -0800)]
nvme: map uring_cmd data even if address is 0
When using kernel registered bvec fixed buffers, the "address" is
actually the offset into the bvec rather than userspace address.
Therefore it can be 0.
We can skip checking whether the address is NULL before mapping
uring_cmd data. Bad userspace address will be handled properly later when
the user buffer is imported.
With this patch, we will be able to use the kernel registered bvec fixed
buffers in io_uring NVMe passthru with ublk zero-copy support.
Reviewed-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Xinyu Zhang <xizhang@purestorage.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250227223916.143006-4-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Thu, 27 Feb 2025 22:39:12 +0000 (14:39 -0800)]
io_uring/rw: move fixed buffer import to issue path
Registered buffers may depend on a linked command, which makes the prep
path too early to import. Move to the issue path when the node is
actually needed like all the other users of fixed buffers.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250227223916.143006-3-kbusch@meta.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Thu, 27 Feb 2025 22:39:11 +0000 (14:39 -0800)]
io_uring/rw: move buffer_select outside generic prep
Cleans up the generic rw prep to not require the do_import flag. Use a
different prep function for callers that might need buffer select.
Based-on-a-patch-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20250227223916.143006-2-kbusch@meta.com
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Arnd Bergmann [Thu, 27 Feb 2025 13:20:09 +0000 (14:20 +0100)]
io_uring/net: fix build warning for !CONFIG_COMPAT
A code rework resulted in an uninitialized return code when COMPAT
mode is disabled:
io_uring/net.c:722:6: error: variable 'ret' is used uninitialized whenever 'if' condition is true [-Werror,-Wsometimes-uninitialized]
722 | if (io_is_compat(req->ctx)) {
| ^~~~~~~~~~~~~~~~~~~~~~
io_uring/net.c:736:15: note: uninitialized use occurs here
736 | if (unlikely(ret))
| ^~~
Since io_is_compat() turns into a compile-time 'false', the #ifdef
here is completely unnecessary, and removing it avoids the warning.
Fixes:
51e158d40589 ("io_uring/net: unify *mshot_prep calls with compat")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20250227132018.1111094-1-arnd@kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 26 Feb 2025 20:46:34 +0000 (20:46 +0000)]
io_uring: rearrange opdef flags by use pattern
Keep all flags that we use in the generic req init path close together.
That saves a load for x86 because apparently some compilers prefer
reading single bytes.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ef03b6ce4a0c2a5234cd4037fa07e9e4902dcc9e.1740602793.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 26 Feb 2025 11:41:21 +0000 (11:41 +0000)]
io_uring/net: extract iovec import into a helper
Deduplicate iovec imports between compat and !compat by introducing a
helper function.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6a5f8c526f6732c4249a7fa0213b49e1a3ecccf0.1740569495.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 26 Feb 2025 11:41:20 +0000 (11:41 +0000)]
io_uring/net: unify *mshot_prep calls with compat
Instead of duplicating a io_recvmsg_mshot_prep() call in the compat
path, let the common code handle it. For that, copy necessary compat
fields into struct user_msghdr. Note, it zeroes user_msghdr to be on the
safe side as compat is not that interesting and overhead shouldn't be
high.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/94e62386dec570f83b4a4270a46ac60bc415fb71.1740569495.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 26 Feb 2025 11:41:19 +0000 (11:41 +0000)]
io_uring/net: derive iovec storage later
Don't read free_iov until right before we need it to import the iovec.
The only place that uses it before that is provided buffer selection,
but it only serves as temporary storage and iovec content is not reused
afterwards, so use a local variable for that.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8bfa7d74c33e37860a724f4e0e96660c25cd4c02.1740569495.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 26 Feb 2025 11:41:18 +0000 (11:41 +0000)]
io_uring/net: verify msghdr before copying iovec
Normally, net/ would verify msghdr before importing iovec, for example
see copy_msghdr_from_user(), which further assumed by __copy_msghdr()
validating msg->msg_iovlen.
io_uring does it in reverse order, which is fine, but it'll be more
convenient for flip it so that the iovec business is done at the end and
eventually can be nicely pulled out of msghdr parsing section and
thought as a sepaarate step. That also makes structure accesses more
localised, which should be better for caches.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cd35dc1b48d4e6e31f59ae7304c037fbe8a3fd3d.1740569495.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 26 Feb 2025 11:41:17 +0000 (11:41 +0000)]
io_uring/net: isolate msghdr copying code
The user access section in io_msg_copy_hdr() is overextended by covering
selected buffers. It's hard to work with and prone to errors. Limit the
section to msghdr import only, selected buffers will do a separate
copy_from_user() call, and then move it into its own function. This
should be fine, selected buffer single shots are not important, for
multishots the overhead should be non-existent, and it's not that
expensive overall.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/d3eb1f81c8cfbea9f1aa57dab90c472d2aa6e371.1740569495.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 26 Feb 2025 11:41:16 +0000 (11:41 +0000)]
io_uring/net: simplify compat selbuf iov parsing
Use copy_from_user() instead of open coded access_ok() + get_user(),
that's simpler and we don't care about compat that much.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e51f9c323a3cd4ad7c8da656559bdf6237f052fb.1740569495.git.asml.silence@gmail.com
[axboe: fold in bogus < 0 check for tmp_iov.iov_len]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 26 Feb 2025 11:41:15 +0000 (11:41 +0000)]
io_uring/net: remove unnecessary REQ_F_NEED_CLEANUP
REQ_F_NEED_CLEANUP in io_recvmsg_prep_setup() and in io_sendmsg_setup()
are relics of the past and don't do anything useful, the flag should be
and are set earlier on iovec and async_data allocation.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/6aedc3141c1fc027128a4503656cfd686a6980ef.1740569495.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Thu, 27 Feb 2025 14:18:01 +0000 (07:18 -0700)]
Merge branch 'io_uring-6.14' into for-6.15/io_uring
Merge mainline fixes into 6.15 branch, as upcoming patches depend on
fixes that went into the 6.14 mainline branch.
* io_uring-6.14:
io_uring/net: save msg_control for compat
io_uring/rw: clean up mshot forced sync mode
io_uring/rw: move ki_complete init into prep
io_uring/rw: don't directly use ki_complete
io_uring/rw: forbid multishot async reads
io_uring/rsrc: remove unused constants
io_uring: fix spelling error in uapi io_uring.h
io_uring: prevent opcode speculation
io-wq: backoff when retrying worker creation
Pavel Begunkov [Mon, 24 Feb 2025 21:31:10 +0000 (13:31 -0800)]
io_uring: combine buffer lookup and import
Registered buffer are currently imported in two steps, first we lookup
a rsrc node and then use it to set up the iterator. The first part is
usually done at the prep stage, and import happens whenever it's needed.
As we want to defer binding to a node so that it works with linked
requests, combine both steps into a single helper.
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250224213116.3509093-6-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 24 Feb 2025 21:31:09 +0000 (13:31 -0800)]
io_uring/nvme: pass issue_flags to io_uring_cmd_import_fixed()
io_uring_cmd_import_fixed() will need to know the io_uring execution
state in following commits, for now just pass issue_flags into it
without actually using.
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250224213116.3509093-5-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 24 Feb 2025 21:31:08 +0000 (13:31 -0800)]
io_uring/net: reuse req->buf_index for sendzc
There is already a field in io_kiocb that can store a registered buffer
index, use that instead of stashing the value into struct io_sr_msg.
Reviewed-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250224213116.3509093-4-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Mon, 24 Feb 2025 21:31:07 +0000 (13:31 -0800)]
io_uring/nop: reuse req->buf_index
There is already a field in io_kiocb that can store a registered buffer
index, use that instead of stashing the value into struct io_nop.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250224213116.3509093-3-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Keith Busch [Mon, 24 Feb 2025 21:31:06 +0000 (13:31 -0800)]
io_uring/rsrc: remove redundant check for valid imu
The only caller to io_buffer_unmap already checks if the node's buf is
not null, so no need to check again.
Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250224213116.3509093-2-kbusch@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 24 Feb 2025 19:45:06 +0000 (19:45 +0000)]
io_uring/rw: open code io_prep_rw_setup()
Open code io_prep_rw_setup() into its only caller, it doesn't provide
any meaningful abstraction anymore.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/61ba72e2d46119db71f27ab908018e6a6cd6c064.1740425922.git.asml.silence@gmail.com
[axboe: fold in 'ret' being unused fix]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Tue, 25 Feb 2025 15:59:02 +0000 (15:59 +0000)]
io_uring/net: save msg_control for compat
Match the compat part of io_sendmsg_copy_hdr() with its counterpart and
save msg_control.
Fixes:
c55978024d123 ("io_uring/net: move receive multishot out of the generic msghdr path")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/2a8418821fe83d3b64350ad2b3c0303e9b732bbd.1740498502.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 24 Feb 2025 19:45:05 +0000 (19:45 +0000)]
io_uring/rw: extract helper for iovec import
Split out a helper out of __io_import_rw_buffer() that handles vectored
buffers. I'll need it for registered vectored buffers, but it also looks
cleaner, especially with parameters being properly named.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/075470cfb24be38709d946815f35ec846d966f41.1740425922.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 24 Feb 2025 19:45:04 +0000 (19:45 +0000)]
io_uring/rw: rename io_import_iovec()
io_import_iovec() is not limited to iovecs but also imports buffers for
normal reads and selected buffers, rename it for clarity.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/91cea59340b61a8f52dc7b8e720274577a25188c.1740425922.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 24 Feb 2025 19:45:03 +0000 (19:45 +0000)]
io_uring/rw: allocate async data in io_prep_rw()
rw always allocates async_data, so instead of doing that deeper in prep
calls inside of io_prep_rw_setup(), be a bit more explicit and do that
early on in io_prep_rw().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/5ead621051bc3374d1e8d96f816454906a6afd71.1740425922.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Sun, 23 Feb 2025 17:22:31 +0000 (17:22 +0000)]
io_uring: make io_poll_issue() sturdier
io_poll_issue() forwards the call to io_issue_sqe() and thus inherits
some of the handling. That's not particularly failure resistant, as for
example returning an innocently looking IOU_OK from a multishot issue
will lead to severe bugs.
Reimplement io_poll_issue() without io_issue_sqe()'s request completion
logic. Remove extra checks as we know that req->file is already set,
linked timeout are armed, and iopoll is not supported. Also cover it
with warnings for now.
The patch should be useful by itself, but it's also preparing the
codebase for other future clean ups.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3096d7b1026d9a52426a598bdfc8d9d324555545.1740331076.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Sun, 23 Feb 2025 17:22:30 +0000 (17:22 +0000)]
io_uring/net: canonise accept mshot handling
Use a more recognisable pattern for mshot accept, first try to post an
mshot cqe if needed and after do terminating handling.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/daf5c0df7e2966deb0a115021c065fc6161a52d7.1740331076.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Sun, 23 Feb 2025 17:22:29 +0000 (17:22 +0000)]
io_uring/net: fix accept multishot handling
REQ_F_APOLL_MULTISHOT doesn't guarantee it's executed from the multishot
context, so a multishot accept may get executed inline, fail
io_req_post_cqe(), and ask the core code to kill the request with
-ECANCELED by returning IOU_STOP_MULTISHOT even when a socket has been
accepted and installed.
Cc: stable@vger.kernel.org
Fixes:
390ed29b5e425 ("io_uring: add IORING_ACCEPT_MULTISHOT for accept")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/51c6deb01feaa78b08565ca8f24843c017f5bc80.1740331076.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 24 Feb 2025 12:42:24 +0000 (12:42 +0000)]
io_uring/net: use io_is_compat()
Use io_is_compat() for consistency.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/fff93d9d08243284c5db5d546be766a82e85c130.1740400452.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 24 Feb 2025 12:42:23 +0000 (12:42 +0000)]
io_uring/waitid: use io_is_compat()
Use io_is_compat() for consistency.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/28c5b5f1f1bf7f4d18869dafe6e4147ce1bbf0f5.1740400452.git.asml.silence@gmail.com
Link: https://lore.kernel.org/r/20250224172337.2009871-1-csander@purestorage.com
[axboe: fold in improvement from Caleb, see link]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 24 Feb 2025 12:42:22 +0000 (12:42 +0000)]
io_uring/rw: shrink io_iov_compat_buffer_select_prep
Compat performance is not important and simplicity is more appreciated.
Let's not be smart about it and use simpler copy_from_user() instead of
access + __get_user pair.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b334a3a5040efa424ded58e4d8a6ef2554324266.1740400452.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 24 Feb 2025 12:42:21 +0000 (12:42 +0000)]
io_uring/rw: compile out compat param passing
Even when COMPAT is compiled out, we still have to pass
ctx->compat to __import_iovec(). Replace the read with an indirection
with a constant when the kernel doesn't support compat.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/2819df9c8533c36b46d7baccbb317a0ec89da6cd.1740400452.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 24 Feb 2025 12:42:20 +0000 (12:42 +0000)]
io_uring/cmd: optimise !CONFIG_COMPAT flags setting
Use io_is_compat() to avoid extra overhead in io_uring_cmd() for flag
setting when compat is compiled out.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/f4d74c62d7cbddc386c0a9138ecd2b2ed6d3f146.1740400452.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Mon, 24 Feb 2025 12:42:19 +0000 (12:42 +0000)]
io_uring: introduce io_is_compat()
A preparation patch adding a simple helper for gauging the compat state.
It'll help us to optimise and compile out more code in the following
commits.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Reviewed-by: Anuj Gupta <anuj20.g@samsung.com>
Link: https://lore.kernel.org/r/1a87a640265196a67bc38300128e0bfd7839ab1f.1740400452.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 19 Feb 2025 01:33:40 +0000 (01:33 +0000)]
io_uring/rw: clean up mshot forced sync mode
Move code forcing synchronous execution of multishot read requests out
a more generic __io_read().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4ad7b928c776d1ad59addb9fff64ef2d1fc474d5.1739919038.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 19 Feb 2025 01:33:39 +0000 (01:33 +0000)]
io_uring/rw: move ki_complete init into prep
Initialise ki_complete during request prep stage, we'll depend on it not
being reset during issue in the following patch.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/817624086bd5f0448b08c80623399919fda82f34.1739919038.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 19 Feb 2025 01:33:38 +0000 (01:33 +0000)]
io_uring/rw: don't directly use ki_complete
We want to avoid checking ->ki_complete directly in the io_uring
completion path. Fortunately we have only two callback the selection
of which depend on the ring constant flags, i.e. IOPOLL, so use that
to infer the function.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4eb4bdab8cbcf5bc87083f7047edc81e920ab83c.1739919038.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 19 Feb 2025 01:33:37 +0000 (01:33 +0000)]
io_uring/rw: forbid multishot async reads
At the moment we can't sanely handle queuing an async request from a
multishot context, so disable them. It shouldn't matter as pollable
files / socekts don't normally do async.
Patching it in __io_read() is not the cleanest way, but it's simpler
than other options, so let's fix it there and clean up on top.
Cc: stable@vger.kernel.org
Reported-by: chase xd <sl1589472800@gmail.com>
Fixes:
fc68fcda04910 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7d51732c125159d17db4fe16f51ec41b936973f8.1739919038.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Wed, 19 Feb 2025 03:34:43 +0000 (20:34 -0700)]
io_uring/rsrc: remove unused constants
IO_NODE_ALLOC_CACHE_MAX has been unused since commit
fbbb8e991d86
("io_uring/rsrc: get rid of io_rsrc_node allocation cache") removed the
rsrc_node_cache.
IO_RSRC_TAG_TABLE_SHIFT and IO_RSRC_TAG_TABLE_MASK have been unused
since commit
7029acd8a950 ("io_uring/rsrc: get rid of per-ring
io_rsrc_node list") removed the separate tag table for registered nodes.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/r/20250219033444.2020136-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Tue, 18 Feb 2025 23:47:40 +0000 (16:47 -0700)]
io_uring: fix spelling error in uapi io_uring.h
This is obviously not that important, but when changes are synced back
from the kernel to liburing, the codespell CI ends up erroring because
of this misspelling. Let's just correct it and avoid this biting us
again on an import.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Wed, 12 Feb 2025 00:51:18 +0000 (17:51 -0700)]
io_uring: use lockless_cq flag in io_req_complete_post()
io_uring_create() computes ctx->lockless_cq as:
ctx->task_complete || (ctx->flags & IORING_SETUP_IOPOLL)
So use it to simplify that expression in io_req_complete_post().
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Reviewed-by: Li Zetao <lizetao1@huawei.com>
Link: https://lore.kernel.org/r/20250212005119.3433005-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Mon, 17 Feb 2025 02:25:05 +0000 (19:25 -0700)]
io_uring: pass struct io_tw_state by value
8e5b3b89ecaf ("io_uring: remove struct io_tw_state::locked") removed the
only field of io_tw_state but kept it as a task work callback argument
to "forc[e] users not to invoke them carelessly out of a wrong context".
Passing the struct io_tw_state * argument adds a few instructions to all
callers that can't inline the functions and see the argument is unused.
So pass struct io_tw_state by value instead. Since it's a 0-sized value,
it can be passed without any instructions needed to initialize it.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250217022511.1150145-2-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Mon, 17 Feb 2025 02:25:04 +0000 (19:25 -0700)]
io_uring: introduce type alias for io_tw_state
In preparation for changing how io_tw_state is passed, introduce a type
alias io_tw_token_t for struct io_tw_state *. This allows for changing
the representation in one place, without having to update the many
functions that just forward their struct io_tw_state * argument.
Also add a comment to struct io_tw_state to explain its purpose.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250217022511.1150145-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Sun, 16 Feb 2025 22:58:59 +0000 (15:58 -0700)]
io_uring/rsrc: avoid NULL check in io_put_rsrc_node()
Most callers of io_put_rsrc_node() already check that node is non-NULL:
- io_rsrc_data_free()
- io_sqe_buffer_register()
- io_reset_rsrc_node()
- io_req_put_rsrc_nodes() (REQ_F_BUF_NODE indicates non-NULL buf_node)
Only io_splice_cleanup() can call io_put_rsrc_node() with a NULL node.
So move the NULL check there.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250216225900.1075446-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Wed, 12 Feb 2025 16:48:05 +0000 (09:48 -0700)]
io_uring: pass ctx instead of req to io_init_req_drain()
io_init_req_drain() takes a struct io_kiocb *req argument but only uses
it to get struct io_ring_ctx *ctx. The caller already knows the ctx, so
pass it instead.
Drop "req" from the function name since it operates on the ctx rather
than a specific req.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250212164807.3681036-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Caleb Sander Mateos [Tue, 11 Feb 2025 20:19:56 +0000 (13:19 -0700)]
io_uring: use IO_REQ_LINK_FLAGS more
Replace the 2 instances of REQ_F_LINK | REQ_F_HARDLINK with
the more commonly used IO_REQ_LINK_FLAGS.
Signed-off-by: Caleb Sander Mateos <csander@purestorage.com>
Link: https://lore.kernel.org/r/20250211202002.3316324-1-csander@purestorage.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Sat, 8 Feb 2025 17:50:34 +0000 (10:50 -0700)]
io_uring/net: improve recv bundles
Current recv bundles are only supported for multishot receives, and
additionally they also always post at least 2 CQEs if more data is
available than what a buffer will hold. This happens because the initial
bundle recv will do a single buffer, and then do the rest of what is in
the socket as a followup receive. As shown in a test program, if 1k
buffers are available and 32k is available to receive in the socket,
you'd get the following completions:
bundle=1, mshot=0
cqe res 1024
cqe res 1024
[...]
cqe res 1024
bundle=1, mshot=1
cqe res 1024
cqe res 31744
where bundle=1 && mshot=0 will post 32 1k completions, and bundle=1 &&
mshot=1 will post a 1k completion and then a 31k completion.
To support bundle recv without multishot, it's possible to simply retry
the recv immediately and post a single completion, rather than split it
into two completions. With the below patch, the same test looks as
follows:
bundle=1, mshot=0
cqe res 32768
bundle=1, mshot=1
cqe res 32768
where mshot=0 works fine for bundles, and both of them post just a
single 32k completion rather than split it into separate completions.
Posting fewer completions is always a nice win, and not needing
multishot for proper bundle efficiency is nice for cases that can't
necessarily use multishot.
Reported-by: Norman Maurer <norman_maurer@apple.com>
Link: https://lore.kernel.org/r/184f9f92-a682-4205-a15d-89e18f664502@kernel.dk
Fixes:
2f9c9515bdfd ("io_uring/net: support bundles for recv")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 5 Feb 2025 20:16:29 +0000 (13:16 -0700)]
io_uring/waitid: use generic io_cancel_remove() helper
Don't implement our own loop rolling and checking, just use the generic
helper to find and cancel requests.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 5 Feb 2025 20:15:57 +0000 (13:15 -0700)]
io_uring/futex: use generic io_cancel_remove() helper
Don't implement our own loop rolling and checking, just use the generic
helper to find and cancel requests.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 5 Feb 2025 20:13:58 +0000 (13:13 -0700)]
io_uring/cancel: add generic cancel helper
Any opcode that is cancelable ends up defining its own cancel helper
for finding and canceling a specific request. Add a generic helper that
can be used for this purpose.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 5 Feb 2025 19:52:46 +0000 (12:52 -0700)]
io_uring/waitid: convert to io_cancel_remove_all()
Use the generic helper for cancelations.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 5 Feb 2025 19:51:26 +0000 (12:51 -0700)]
io_uring/futex: convert to io_cancel_remove_all()
Use the generic helper for cancelations.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Jens Axboe [Wed, 5 Feb 2025 19:48:56 +0000 (12:48 -0700)]
io_uring/cancel: add generic remove_all helper
Any opcode that is cancelable ends up defining its own remove all
helper, which iterates the pending list and cancels matches. Add a
generic helper for it, which can be used by them.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 5 Feb 2025 11:36:49 +0000 (11:36 +0000)]
io_uring/kbuf: uninline __io_put_kbufs
__io_put_kbufs() and other helper functions are too large to be inlined,
compilers would normally refuse to do so. Uninline it and move together
with io_kbuf_commit into kbuf.c.
io_kbuf_commitSigned-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3dade7f55ad590e811aff83b1ec55c9c04e17b2b.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 5 Feb 2025 11:36:48 +0000 (11:36 +0000)]
io_uring/kbuf: introduce io_kbuf_drop_legacy()
io_kbuf_drop() is only used for legacy provided buffers, and so
__io_put_kbuf_list() is never called for REQ_F_BUFFER_RING. Remove the
dead branch out of __io_put_kbuf_list(), rename it into
io_kbuf_drop_legacy() and use it directly instead of io_kbuf_drop().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c8cc73e2272f09a86ecbdad9ebdd8304f8e583c0.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 5 Feb 2025 11:36:47 +0000 (11:36 +0000)]
io_uring/kbuf: open code __io_put_kbuf()
__io_put_kbuf() is a trivial wrapper, open code it into
__io_put_kbufs().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/9dc17380272b48d56c95992c6f9eaacd5546e1d3.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 5 Feb 2025 11:36:46 +0000 (11:36 +0000)]
io_uring/kbuf: remove legacy kbuf caching
Remove all struct io_buffer caches. It makes it a fair bit simpler.
Apart from from killing a bunch of lines and juggling between lists,
__io_put_kbuf_list() doesn't need ->completion_lock locking now.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/18287217466ee2576ea0b1e72daccf7b22c7e856.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 5 Feb 2025 11:36:45 +0000 (11:36 +0000)]
io_uring/kbuf: simplify __io_put_kbuf
As a preparation step remove an optimisation from __io_put_kbuf() trying
to use the locked cache. With that __io_put_kbuf_list() is only used
with ->io_buffers_comp, and we remove the explicit list argument.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1b7f1394ec4afc7f96b35a61f5992e27c49fd067.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 5 Feb 2025 11:36:44 +0000 (11:36 +0000)]
io_uring/kbuf: move locking into io_kbuf_drop()
Move the burden of locking out of the caller into io_kbuf_drop(), that
will help with furher refactoring.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/530f0cf1f06963029399f819a9a58b1a34bebef3.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 5 Feb 2025 11:36:43 +0000 (11:36 +0000)]
io_uring/kbuf: remove legacy kbuf kmem cache
Remove the kmem cache used by legacy provided buffers.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8195c207d8524d94e972c0c82de99282289f7f5c.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Wed, 5 Feb 2025 11:36:42 +0000 (11:36 +0000)]
io_uring/kbuf: remove legacy kbuf bulk allocation
Legacy provided buffers are slow and discouraged in favour of the ring
variant. Remove the bulk allocation to keep it simpler as we don't care
about performance.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/a064d70370e590efed8076e9501ae4cfc20fe0ca.1738724373.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 31 Jan 2025 17:31:03 +0000 (17:31 +0000)]
io_uring: sanitise ring params earlier
Do all struct io_uring_params validation early on before allocating the
context. That makes initialisation easier, especially by having fewer
places where we need to care about partial de-initialisation.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/363ba90b83ff78eefdc88b60e1b2c4a39d182247.1738344646.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 31 Jan 2025 17:28:21 +0000 (17:28 +0000)]
io_uring: check for iowq alloc_workqueue failure
alloc_workqueue() can fail even during init in io_uring_init(), check
the result and panic if anything went wrong.
Fixes:
73eaa2b583493 ("io_uring: use private workqueue for exit work")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/3a046063902f888f66151f89fa42f84063b9727b.1738343083.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pavel Begunkov [Fri, 31 Jan 2025 17:27:02 +0000 (17:27 +0000)]
io_uring: deduplicate caches deallocation
Add a function that frees all ring caches since we already have two
spots repeating the same thing and it's easy to miss it and change only
one of them.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b6b0125677c58bdff99eda91ab320137406e8562.1738342562.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Max Kellermann [Tue, 28 Jan 2025 13:39:25 +0000 (14:39 +0100)]
io_uring/io-wq: pass io_wq to io_get_next_work()
The only caller has already determined this pointer, so let's skip
the redundant dereference.
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/20250128133927.3989681-7-max.kellermann@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Max Kellermann [Tue, 28 Jan 2025 13:39:24 +0000 (14:39 +0100)]
io_uring/io-wq: do not use bogus hash value
Previously, the `hash` variable was initialized with `-1` and only
updated by io_get_next_work() if the current work was hashed. Commit
60cf46ae6054 ("io-wq: hash dependent work") changed this to always
call io_get_work_hash() even if the work was not hashed. This caused
the `hash != -1U` check to always be true, adding some overhead for
the `hash->wait` code.
This patch fixes the regression by checking the `IO_WQ_WORK_HASHED`
flag.
Perf diff for a flood of `IORING_OP_NOP` with `IOSQE_ASYNC`:
38.55% -1.57% [kernel.kallsyms] [k] queued_spin_lock_slowpath
6.86% -0.72% [kernel.kallsyms] [k] io_worker_handle_work
0.10% +0.67% [kernel.kallsyms] [k] put_prev_entity
1.96% +0.59% [kernel.kallsyms] [k] io_nop_prep
3.31% -0.51% [kernel.kallsyms] [k] try_to_wake_up
7.18% -0.47% [kernel.kallsyms] [k] io_wq_free_work
Fixes:
60cf46ae6054 ("io-wq: hash dependent work")
Cc: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/20250128133927.3989681-6-max.kellermann@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Max Kellermann [Tue, 28 Jan 2025 13:39:23 +0000 (14:39 +0100)]
io_uring/io-wq: cache work->flags in variable
This eliminates several redundant atomic reads and therefore reduces
the duration the surrounding spinlocks are held.
In several io_uring benchmarks, this reduced the CPU time spent in
queued_spin_lock_slowpath() considerably:
io_uring benchmark with a flood of `IORING_OP_NOP` and `IOSQE_ASYNC`:
38.86% -1.49% [kernel.kallsyms] [k] queued_spin_lock_slowpath
6.75% +0.36% [kernel.kallsyms] [k] io_worker_handle_work
2.60% +0.19% [kernel.kallsyms] [k] io_nop
3.92% +0.18% [kernel.kallsyms] [k] io_req_task_complete
6.34% -0.18% [kernel.kallsyms] [k] io_wq_submit_work
HTTP server, static file:
42.79% -2.77% [kernel.kallsyms] [k] queued_spin_lock_slowpath
2.08% +0.23% [kernel.kallsyms] [k] io_wq_submit_work
1.19% +0.20% [kernel.kallsyms] [k] amd_iommu_iotlb_sync_map
1.46% +0.15% [kernel.kallsyms] [k] ep_poll_callback
1.80% +0.15% [kernel.kallsyms] [k] io_worker_handle_work
HTTP server, PHP:
35.03% -1.80% [kernel.kallsyms] [k] queued_spin_lock_slowpath
0.84% +0.21% [kernel.kallsyms] [k] amd_iommu_iotlb_sync_map
1.39% +0.12% [kernel.kallsyms] [k] _copy_to_iter
0.21% +0.10% [kernel.kallsyms] [k] update_sd_lb_stats
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/20250128133927.3989681-5-max.kellermann@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Max Kellermann [Tue, 28 Jan 2025 13:39:22 +0000 (14:39 +0100)]
io_uring/io-wq: move worker lists to struct io_wq_acct
Have separate linked lists for bounded and unbounded workers. This
way, io_acct_activate_free_worker() sees only workers relevant to it
and doesn't need to skip irrelevant ones. This speeds up the
linked list traversal (under acct->lock).
The `io_wq.lock` field is moved to `io_wq_acct.workers_lock`. It did
not actually protect "access to elements below", that is, not all of
them; it only protected access to the worker lists. By having two
locks instead of one, contention on this lock is reduced.
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/20250128133927.3989681-4-max.kellermann@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Max Kellermann [Tue, 28 Jan 2025 13:39:21 +0000 (14:39 +0100)]
io_uring/io-wq: add io_worker.acct pointer
This replaces the `IO_WORKER_F_BOUND` flag. All code that checks this
flag is not interested in knowing whether this is a "bound" worker;
all it does with this flag is determine the `io_wq_acct` pointer. At
the cost of an extra pointer field, we can eliminate some fragile
pointer arithmetic. In turn, the `create_index` and `index` fields
are not needed anymore.
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/20250128133927.3989681-3-max.kellermann@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Max Kellermann [Tue, 28 Jan 2025 13:39:20 +0000 (14:39 +0100)]
io_uring/io-wq: eliminate redundant io_work_get_acct() calls
Instead of calling io_work_get_acct() again, pass acct to
io_wq_insert_work() and io_wq_remove_pending().
This atomic access in io_work_get_acct() was done under the
`acct->lock`, and optimizing it away reduces lock contention a bit.
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
Link: https://lore.kernel.org/r/20250128133927.3989681-2-max.kellermann@ionos.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Sun, 16 Feb 2025 22:02:44 +0000 (14:02 -0800)]
Linux 6.14-rc3
Linus Torvalds [Sun, 16 Feb 2025 20:58:51 +0000 (12:58 -0800)]
Merge tag 'kbuild-fixes-v6.14-2' of git://git./linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild fixes from Masahiro Yamada:
- Fix annoying logs when building tools in parallel
- Fix the Debian linux-headers package build again
- Fix the target triple detection for userspace programs on Clang
* tag 'kbuild-fixes-v6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
modpost: Fix a few typos in a comment
kbuild: userprogs: fix bitsize and target detection on clang
kbuild: fix linux-headers package build when $(CC) cannot link userspace
tools: fix annoying "mkdir -p ..." logs when building tools in parallel
Linus Torvalds [Sun, 16 Feb 2025 20:54:42 +0000 (12:54 -0800)]
Merge tag 'driver-core-6.14-rc3' of git://git./linux/kernel/git/gregkh/driver-core
Pull driver core api addition from Greg KH:
"Here is a driver core new api for 6.14-rc3 that is being added to
allow platform devices from stop being abused.
It adds a new 'faux_device' structure and bus and api to allow almost
a straight or simpler conversion from platform devices that were not
really a platform device. It also comes with a binding for rust, with
an example driver in rust showing how it's used.
I'm adding this now so that the patches that convert the different
drivers and subsystems can all start flowing into linux-next now
through their different development trees, in time for 6.15-rc1.
We have a number that are already reviewed and tested, but adding
those conversions now doesn't seem right. For now, no one is using
this, and it passes all build tests from 0-day and linux-next, so all
should be good"
* tag 'driver-core-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
rust/kernel: Add faux device bindings
driver core: add a faux bus for use when a simple device/bus is needed
Linus Torvalds [Sun, 16 Feb 2025 20:50:44 +0000 (12:50 -0800)]
Merge tag 'tty-6.14-rc3' of git://git./linux/kernel/git/gregkh/tty
Pull serial driver fixes from Greg KH:
"Here are some small serial driver fixes for some reported problems.
Nothing major, just:
- sc16is7xx irq check fix
- 8250 fifo underflow fix
- serial_port and 8250 iotype fixes
Most of these have been in linux-next already, and all have passed
0-day testing"
* tag 'tty-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
serial: 8250: Fix fifo underflow on flush
serial: 8250_pnp: Remove unneeded ->iotype assignment
serial: 8250_platform: Remove unneeded ->iotype assignment
serial: 8250_of: Remove unneeded ->iotype assignment
serial: port: Make ->iotype validation global in __uart_read_properties()
serial: port: Always update ->iotype in __uart_read_properties()
serial: port: Assign ->iotype correctly when ->iobase is set
serial: sc16is7xx: Fix IRQ number check behavior
Linus Torvalds [Sun, 16 Feb 2025 19:15:50 +0000 (11:15 -0800)]
Merge tag 'usb-6.14-rc3' of git://git./linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some small USB driver fixes, and new device ids, for
6.14-rc3. Lots of tiny stuff for reported problems, including:
- new device ids and quirks
- usb hub crash fix found by syzbot
- dwc2 driver fix
- dwc3 driver fixes
- uvc gadget driver fix
- cdc-acm driver fixes for a variety of different issues
- other tiny bugfixes
Almost all of these have been in linux-next this week, and all have
passed 0-day testing"
* tag 'usb-6.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (25 commits)
usb: typec: tcpm: PSSourceOffTimer timeout in PR_Swap enters ERROR_RECOVERY
usb: roles: set switch registered flag early on
usb: gadget: uvc: Fix unstarted kthread worker
USB: quirks: add USB_QUIRK_NO_LPM quirk for Teclast dist
usb: gadget: core: flush gadget workqueue after device removal
USB: gadget: f_midi: f_midi_complete to call queue_work
usb: core: fix pipe creation for get_bMaxPacketSize0
usb: dwc3: Fix timeout issue during controller enter/exit from halt state
USB: Add USB_QUIRK_NO_LPM quirk for sony xperia xz1 smartphone
USB: cdc-acm: Fill in Renesas R-Car D3 USB Download mode quirk
usb: cdc-acm: Fix handling of oversized fragments
usb: cdc-acm: Check control transfer buffer size before access
usb: xhci: Restore xhci_pci support for Renesas HCs
USB: pci-quirks: Fix HCCPARAMS register error for LS7A EHCI
USB: serial: option: drop MeiG Smart defines
USB: serial: option: fix Telit Cinterion FN990A name
USB: serial: option: add Telit Cinterion FN990B compositions
USB: serial: option: add MeiG Smart SLM828
usb: gadget: f_midi: fix MIDI Streaming descriptor lengths
usb: dwc2: gadget: remove of_node reference upon udc_stop
...
Linus Torvalds [Sun, 16 Feb 2025 18:55:17 +0000 (10:55 -0800)]
Merge tag 'irq_urgent_for_v6.14_rc3' of git://git./linux/kernel/git/tip/tip
Pull irq Kconfig cleanup from Borislav Petkov:
- Remove an unused config item GENERIC_PENDING_IRQ_CHIPFLAGS
* tag 'irq_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Remove unused CONFIG_GENERIC_PENDING_IRQ_CHIPFLAGS
Linus Torvalds [Sun, 16 Feb 2025 18:41:50 +0000 (10:41 -0800)]
Merge tag 'perf_urgent_for_v6.14_rc3' of git://git./linux/kernel/git/tip/tip
Pull x86 perf fixes from Borislav Petkov:
- Explicitly clear DEBUGCTL.LBR to prevent LBRs continuing being
enabled after handoff to the OS
- Check CPUID(0x23) leaf and subleafs presence properly
- Remove the PEBS-via-PT feature from being supported on hybrid systems
- Fix perf record/top default commands on systems without a raw PMU
registered
* tag 'perf_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel: Ensure LBRs are disabled when a CPU is starting
perf/x86/intel: Fix ARCH_PERFMON_NUM_COUNTER_LEAF
perf/x86/intel: Clean up PEBS-via-PT on hybrid
perf/x86/rapl: Fix the error checking order
Linus Torvalds [Sun, 16 Feb 2025 18:38:24 +0000 (10:38 -0800)]
Merge tag 'sched_urgent_for_v6.14_rc3' of git://git./linux/kernel/git/tip/tip
Pull scheduler fix from Borislav Petkov:
- Clarify what happens when a task is woken up from the wake queue and
make clear its removal from that queue is atomic
* tag 'sched_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Clarify wake_up_q()'s write to task->wake_q.next
Linus Torvalds [Sun, 16 Feb 2025 18:30:58 +0000 (10:30 -0800)]
Merge tag 'objtool_urgent_for_v6.14_rc3' of git://git./linux/kernel/git/tip/tip
Pull objtool fixes from Borislav Petkov:
- Move a warning about a lld.ld breakage into the verbose setting as
said breakage has been fixed in the meantime
- Teach objtool to ignore dangling jump table entries added by Clang
* tag 'objtool_urgent_for_v6.14_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Move dodgy linker warn to verbose
objtool: Ignore dangling jump table entries
Linus Torvalds [Sun, 16 Feb 2025 18:25:12 +0000 (10:25 -0800)]
Merge tag 'for-linus' of git://git./virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM:
- Large set of fixes for vector handling, especially in the
interactions between host and guest state.
This fixes a number of bugs affecting actual deployments, and
greatly simplifies the FP/SIMD/SVE handling. Thanks to Mark Rutland
for dealing with this thankless task.
- Fix an ugly race between vcpu and vgic creation/init, resulting in
unexpected behaviours
- Fix use of kernel VAs at EL2 when emulating timers with nVHE
- Small set of pKVM improvements and cleanups
x86:
- Fix broken SNP support with KVM module built-in, ensuring the PSP
module is initialized before KVM even when the module
infrastructure cannot be used to order initcalls
- Reject Hyper-V SEND_IPI hypercalls if the local APIC isn't being
emulated by KVM to fix a NULL pointer dereference
- Enter guest mode (L2) from KVM's perspective before initializing
the vCPU's nested NPT MMU so that the MMU is properly tagged for
L2, not L1
- Load the guest's DR6 outside of the innermost .vcpu_run() loop, as
the guest's value may be stale if a VM-Exit is handled in the
fastpath"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits)
x86/sev: Fix broken SNP support with KVM module built-in
KVM: SVM: Ensure PSP module is initialized if KVM module is built-in
crypto: ccp: Add external API interface for PSP module initialization
KVM: arm64: vgic: Hoist SGI/PPI alloc from vgic_init() to kvm_create_vgic()
KVM: arm64: timer: Drop warning on failed interrupt signalling
KVM: arm64: Fix alignment of kvm_hyp_memcache allocations
KVM: arm64: Convert timer offset VA when accessed in HYP code
KVM: arm64: Simplify warning in kvm_arch_vcpu_load_fp()
KVM: arm64: Eagerly switch ZCR_EL{1,2}
KVM: arm64: Mark some header functions as inline
KVM: arm64: Refactor exit handlers
KVM: arm64: Refactor CPTR trap deactivation
KVM: arm64: Remove VHE host restore of CPACR_EL1.SMEN
KVM: arm64: Remove VHE host restore of CPACR_EL1.ZEN
KVM: arm64: Remove host FPSIMD saving for non-protected KVM
KVM: arm64: Unconditionally save+flush host FPSIMD/SVE/SME state
KVM: x86: Load DR6 with guest value only before entering .vcpu_run() loop
KVM: nSVM: Enter guest mode before initializing nested NPT MMU
KVM: selftests: Add CPUID tests for Hyper-V features that need in-kernel APIC
KVM: selftests: Manage CPUID array in Hyper-V CPUID test's core helper
...
Linus Torvalds [Sun, 16 Feb 2025 18:19:41 +0000 (10:19 -0800)]
Merge tag 'mips-fixes_6.14_1' of git://git./linux/kernel/git/mips/linux
Pull MIPS fixes from Thomas Bogendoerfer:
"Fix for o32 ptrace/get_syscall_info"
* tag 'mips-fixes_6.14_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
MIPS: fix mips_get_syscall_arg() for o32
MIPS: Export syscall stack arguments properly for remote use
Linus Torvalds [Sun, 16 Feb 2025 01:20:39 +0000 (17:20 -0800)]
Merge tag 'devicetree-fixes-for-6.14-1' of git://git./linux/kernel/git/robh/linux
Pull devicetree fixes from Rob Herring:
- Add bindings for QCom QCS8300 clocks, QCom SAR2130P qfprom, and
powertip,{st7272|hx8238a} displays
- Fix compatible for TI am62a7 dss
- Add a kunit test for __of_address_resource_bounds()
* tag 'devicetree-fixes-for-6.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
dt-bindings: display: Add powertip,{st7272|hx8238a} as DT Schema description
dt-bindings: nvmem: qcom,qfprom: Add SAR2130P compatible
dt-bindings: display: ti: Fix compatible for am62a7 dss
of: address: Add kunit test for __of_address_resource_bounds()
dt-bindings: clock: qcom: Add QCS8300 video clock controller
dt-bindings: clock: qcom: Add CAMCC clocks for QCS8300
dt-bindings: clock: qcom: Add GPU clocks for QCS8300