path: root/fs/io_uring.c
AgeCommit message (Collapse)Author
2020-01-29io_uring: add support for epoll_ctl(2)for-5.6/io_uring-vfs-2020-01-29for-5.6/io_uring-vfsJens Axboe
This adds IORING_OP_EPOLL_CTL, which can perform the same work as the epoll_ctl(2) system call. Signed-off-by: Jens Axboe <>
2020-01-29io_uring: fix linked command file table usageJens Axboe
We're not consistent in how the file table is grabbed and assigned if we have a command linked that requires the use of it. Add ->file_table to the io_op_defs[] array, and use that to determine when to grab the table instead of having the handlers set it if they need to defer. This also means we can kill the IO_WQ_WORK_NEEDS_FILES flag. We always initialize work->files, so io-wq can just check for that. Signed-off-by: Jens Axboe <>
2020-01-28io_uring: support using a registered personality for commandsJens Axboe
For personalities previously registered via IORING_REGISTER_PERSONALITY, allow any command to select them. This is done through setting sqe->personality to the id returned from registration, and then flagging sqe->flags with IOSQE_PERSONALITY. Reviewed-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-01-28io_uring: allow registering credentialsJens Axboe
If an application wants to use a ring with different kinds of credentials, it can register them upfront. We don't lookup credentials, the credentials of the task calling IORING_REGISTER_PERSONALITY is used. An 'id' is returned for the application to use in subsequent personality support. Reviewed-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-01-28io_uring: add io-wq workqueue sharingPavel Begunkov
If IORING_SETUP_ATTACH_WQ is set, it expects wq_fd in io_uring_params to be a valid io_uring fd io-wq of which will be shared with the newly created io_uring instance. If the flag is set but it can't share io-wq, it fails. This allows creation of "sibling" io_urings, where we prefer to keep the SQ/CQ private, but want to share the async backend to minimize the amount of overhead associated with having multiple rings that belong to the same backend. Reported-by: Jens Axboe <> Reported-by: Daurnimator <> Signed-off-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-01-28io_uring/io-wq: don't use static creds/mm assignmentsJens Axboe
We currently setup the io_wq with a static set of mm and creds. Even for a single-use io-wq per io_uring, this is suboptimal as we have may have multiple enters of the ring. For sharing the io-wq backend, it doesn't work at all. Switch to passing in the creds and mm when the work item is setup. This means that async work is no longer deferred to the io_uring mm and creds, it is done with the current mm and creds. Flag this behavior with IORING_FEAT_CUR_PERSONALITY, so applications know they can rely on the current personality (mm and creds) being the same for direct issue and async issue. Reviewed-by: Stefan Metzmacher <> Signed-off-by: Jens Axboe <>
2020-01-27io_uring: fix refcounting with batched allocations at OOMPavel Begunkov
In case of out of memory the second argument of percpu_ref_put_many() in io_submit_sqes() may evaluate into "nr - (-EAGAIN)", that is clearly wrong. Fixes: 2b85edfc0c90 ("io_uring: batch getting pcpu references") Signed-off-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-01-27io_uring: add comment for drain_nextPavel Begunkov
Draining the middle of a link is tricky, so leave a comment there Signed-off-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-01-27io_uring: don't attempt to copy iovec for READ/WRITEJens Axboe
For the non-vectored variant of READV/WRITEV, we don't need to setup an async io context, and we flag that appropriately in the io_op_defs array. However, in fixing this for the 5.5 kernel in commit 74566df3a71c we didn't have these opcodes, so the check there was added just for the READ_FIXED and WRITE_FIXED opcodes. Replace that check with just a single check for needing async context, that covers all four of these read/write variants that don't use an iovec. Signed-off-by: Jens Axboe <>
2020-01-22io_uring: honor IOSQE_ASYNC for linked reqsPavel Begunkov
REQ_F_FORCE_ASYNC is checked only for the head of a link. Fix it. Signed-off-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-01-22io_uring: prep req when do IOSQE_ASYNCPavel Begunkov
Whenever IOSQE_ASYNC is set, requests will be punted to async without getting into io_issue_req() and without proper preparation done (e.g. io_req_defer_prep()). Hence they will be left uninitialised. Prepare them before punting. Signed-off-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-01-20io_uring: use labeled array init in io_op_defsPavel Begunkov
Don't rely on implicit ordering of IORING_OP_ and explicitly place them at a right place in io_op_defs. Now former comments are now a part of the code and won't ever outdate. Signed-off-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-01-20io_uring: optimise sqe-to-req flags translationPavel Begunkov
For each IOSQE_* flag there is a corresponding REQ_F_* flag. And there is a repetitive pattern of their translation: e.g. if (sqe->flags & SQE_FLAG*) req->flags |= REQ_F_FLAG* Use same numeric values/bits for them and copy instead of manual handling. Signed-off-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-01-20io_uring: remove REQ_F_IO_DRAINEDPavel Begunkov
A request can get into the defer list only once, there is no need for marking it as drained, so remove it. This probably was left after extracting __need_defer() for use in timeouts. Signed-off-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-01-20io_uring: file switch work needs to get flushed on exitJens Axboe
We currently flush early, but if we have something in progress and a new switch is scheduled, we need to ensure to flush after our teardown as well. Signed-off-by: Jens Axboe <>
2020-01-20io_uring: hide uring_fd in ctxPavel Begunkov
req->ring_fd and req->ring_file are used only during the prep stage during submission, which is is protected by mutex. There is no need to store them per-request, place them in ctx. Signed-off-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-01-20io_uring: remove extra check in __io_commit_cqringPavel Begunkov
__io_commit_cqring() is almost always called when there is a change in the rings, so the check is rather pessimising. Signed-off-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-01-20io_uring: optimise use of ctx->drain_nextPavel Begunkov
Move setting ctx->drain_next to the only place it could be set, when it got linked non-head requests. The same for checking it, it's interesting only for a head of a link or a non-linked request. No functional changes here. This removes some code from the common path and also removes REQ_F_DRAIN_LINK flag, as it doesn't need it anymore. Signed-off-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-01-20io_uring: add support for probing opcodesJens Axboe
The application currently has no way of knowing if a given opcode is supported or not without having to try and issue one and see if we get -EINVAL or not. And even this approach is fraught with peril, as maybe we're getting -EINVAL due to some fields being missing, or maybe it's just not that easy to issue that particular command without doing some other leg work in terms of setup first. This adds IORING_REGISTER_PROBE, which fills in a structure with info on what it supported or not. This will work even with sparse opcode fields, which may happen in the future or even today if someone backports specific features to older kernels. Signed-off-by: Jens Axboe <>
2020-01-20io_uring: account fixed file references correctly in batchJens Axboe
We can't assume that the whole batch has fixed files in it. If it's a mix, or none at all, then we can end up doing a ref put that either messes up accounting, or causes an oops if we have no fixed files at all. Also ensure we free requests properly between inflight accounted and normal requests. Fixes: 82c721577011 ("io_uring: extend batch freeing to cover more cases") Reported-by: Dmitrii Dolgov <> Reported-by: Pavel Begunkov <> Tested-by: Dmitrii Dolgov <> Signed-off-by: Jens Axboe <>
2020-01-20io_uring: add opcode to issue trace eventJens Axboe
For some test apps at least, user_data is just zeroes. So it's not a good way to tell what the command actually is. Add the opcode to the issue trace point. Signed-off-by: Jens Axboe <>
2020-01-20io_uring: add support for IORING_OP_OPENAT2Jens Axboe
Add support for the new openat2(2) system call. It's trivial to do, as we can have openat(2) just be wrapped around it. Suggested-by: Stefan Metzmacher <> Signed-off-by: Jens Axboe <>
2020-01-20io_uring: remove 'fname' from io_open structureJens Axboe
We only use it internally in the prep functions for both statx and openat, so we don't need it to be persistent across the request. Signed-off-by: Jens Axboe <>
2020-01-20io_uring: add 'struct open_how' to the openat request contextJens Axboe
We'll need this for openat2(2) support, remove flags and mode from the existing io_open struct. Signed-off-by: Jens Axboe <>
2020-01-20io_uring: enable option to only trigger eventfd for async completionsJens Axboe
If an application is using eventfd notifications with poll to know when new SQEs can be issued, it's expecting the following read/writes to complete inline. And with that, it knows that there are events available, and don't want spurious wakeups on the eventfd for those requests. This adds IORING_REGISTER_EVENTFD_ASYNC, which works just like IORING_REGISTER_EVENTFD, except it only triggers notifications for events that happen from async completions (IRQ, or io-wq worker completions). Any completions inline from the submission itself will not trigger notifications. Suggested-by: Mark Papadakis <> Signed-off-by: Jens Axboe <>
2020-01-20io_uring: change io_ring_ctx bool fields into bit fieldsJens Axboe
In preparation for adding another one, which would make us spill into another long (and hence bump the size of the ctx), change them to bit fields. Signed-off-by: Jens Axboe <>
2020-01-20io_uring: file set registration should use interruptible waitsJens Axboe
If an application attempts to register a set with unbounded requests pending, we can be stuck here forever if they don't complete. We can make this wait interruptible, and just abort if we get signaled. Signed-off-by: Jens Axboe <>
2020-01-20io_uring: Remove unnecessary null checkYueHaibing
Null check kfree is redundant, so remove it. This is detected by coccinelle. Signed-off-by: YueHaibing <> Signed-off-by: Jens Axboe <>
2020-01-20io_uring: add support for send(2) and recv(2)Jens Axboe
This adds IORING_OP_SEND for send(2) support, and IORING_OP_RECV for recv(2) support. Signed-off-by: Jens Axboe <>
2020-01-20io_uring: remove extra io_wq_current_is_worker()Pavel Begunkov
io_wq workers use io_issue_sqe() to forward sqes and never io_queue_sqe(). Remove extra check for io_wq_current_is_worker() Signed-off-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-01-20io_uring: optimise commit_sqring() for common casePavel Begunkov
It should be pretty rare to not submitting anything when there is something in the ring. No need to keep heuristics for this case. Signed-off-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-01-20io_uring: optimise head checks in io_get_sqring()Pavel Begunkov
A user may ask to submit more than there is in the ring, and then io_uring will submit as much as it can. However, in the last iteration it will allocate an io_kiocb and immediately free it. It could do better and adjust @to_submit to what is in the ring. And since the ring's head is already checked here, there is no need to do it in the loop, spamming with smp_load_acquire()'s barriers Signed-off-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-01-20io_uring: clamp to_submit in io_submit_sqes()Pavel Begunkov
Make io_submit_sqes() to clamp @to_submit itself. It removes duplicated code and prepares for following changes. Signed-off-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-01-20io_uring: add support for IORING_SETUP_CLAMPJens Axboe
Some applications like to start small in terms of ring size, and then ramp up as needed. This is a bit tricky to do currently, since we don't advertise the max ring size. This adds IORING_SETUP_CLAMP. If set, and the values for SQ or CQ ring size exceed what we support, then clamp them at the max values instead of returning -EINVAL. Since we return the chosen ring sizes after setup, no further changes are needed on the application side. io_uring already changes the ring sizes if the application doesn't ask for power-of-two sizes, for example. Signed-off-by: Jens Axboe <>
2020-01-20io_uring: extend batch freeing to cover more casesJens Axboe
Currently we only batch free if fixed files are used, no links, no aux data, etc. This extends the batch freeing to only exclude the linked case and fallback case, and make io_free_req_many() handle the other cases just fine. Signed-off-by: Jens Axboe <>
2020-01-20io_uring: wrap multi-req freeing in struct req_batchJens Axboe
This cleans up the code a bit, and it allows us to build on top of the multi-req freeing. Signed-off-by: Jens Axboe <>
2020-01-20io_uring: batch getting pcpu referencesPavel Begunkov
percpu_ref_tryget() has its own overhead. Instead getting a reference for each request, grab a bunch once per io_submit_sqes(). ~5% throughput boost for a "submit and wait 128 nops" benchmark. Signed-off-by: Pavel Begunkov <> __io_req_free_empty() -> __io_req_do_free() Signed-off-by: Jens Axboe <>
2020-01-20io_uring: add IORING_OP_MADVISEJens Axboe
This adds support for doing madvise(2) through io_uring. We assume that any operation can block, and hence punt everything async. This could be improved, but hard to make bullet proof. The async punt ensures it's safe. Reviewed-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-01-20io_uring: add IORING_OP_FADVISEJens Axboe
This adds support for doing fadvise through io_uring. We assume that WILLNEED doesn't block, but that DONTNEED may block. Reviewed-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-01-20io_uring: allow use of offset == -1 to mean file positionJens Axboe
This behaves like preadv2/pwritev2 with offset == -1, it'll use (and update) the current file position. This obviously comes with the caveat that if the application has multiple read/writes in flight, then the end result will not be as expected. This is similar to threads sharing a file descriptor and doing IO using the current file position. Since this feature isn't easily detectable by doing a read or write, add a feature flags, IORING_FEAT_RW_CUR_POS, to allow applications to detect presence of this feature. Reported-by: 李通洲 <> Signed-off-by: Jens Axboe <>
2020-01-20io_uring: add non-vectored read/write commandsJens Axboe
For uses cases that don't already naturally have an iovec, it's easier (or more convenient) to just use a buffer address + length. This is particular true if the use case is from languages that want to create a memory safe abstraction on top of io_uring, and where introducing the need for the iovec may impose an ownership issue. For those cases, they currently need an indirection buffer, which means allocating data just for this purpose. Add basic read/write that don't require the iovec. Signed-off-by: Jens Axboe <>
2020-01-20io_uring: improve poll completion performanceJens Axboe
For busy IORING_OP_POLL_ADD workloads, we can have enough contention on the completion lock that we fail the inline completion path quite often as we fail the trylock on that lock. Add a list for deferred completions that we can use in that case. This helps reduce the number of async offloads we have to do, as if we get multiple completions in a row, we'll piggy back on to the poll_llist instead of having to queue our own offload. Signed-off-by: Jens Axboe <>
2020-01-20io_uring: split overflow state into SQ and CQ sideJens Axboe
We currently check ->cq_overflow_list from both SQ and CQ context, which causes some bouncing of that cache line. Add separate bits of state for this instead, so that the SQ side can check using its own state, and likewise for the CQ side. This adds ->sq_check_overflow with the SQ state, and ->cq_check_overflow with the CQ state. If we hit an overflow condition, both of these bits are set. Likewise for overflow flush clear, we clear both bits. For the fast path of just checking if there's an overflow condition on either the SQ or CQ side, we can use our own private bit for this. Signed-off-by: Jens Axboe <>
2020-01-20io_uring: add lookup table for various opcode needsJens Axboe
We currently have various switch statements that check if an opcode needs a file, mm, etc. These are hard to keep in sync as opcodes are added. Add a struct io_op_def that holds all of this information, so we have just one spot to update when opcodes are added. This also enables us to NOT allocate req->io if a deferred command doesn't need it, and corrects some mistakes we had in terms of what commands need mm context. Signed-off-by: Jens Axboe <>
2020-01-20io_uring: remove two unnecessary function declarationsJens Axboe
__io_free_req() and io_double_put_req() aren't used before they are defined, so we can kill these two forwards. Signed-off-by: Jens Axboe <>
2020-01-20io_uring: move *queue_link_head() from common pathPavel Begunkov
Move io_queue_link_head() to links handling code in io_submit_sqe(), so it wouldn't need extra checks and would have better data locality. Signed-off-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-01-20io_uring: rename prev to headPavel Begunkov
Calling "prev" a head of a link is a bit misleading. Rename it Signed-off-by: Pavel Begunkov <> Signed-off-by: Jens Axboe <>
2020-01-20io_uring: add IOSQE_ASYNCJens Axboe
io_uring defaults to always doing inline submissions, if at all possible. But for larger copies, even if the data is fully cached, that can take a long time. Add an IOSQE_ASYNC flag that the application can set on the SQE - if set, it'll ensure that we always go async for those kinds of requests. Use the io-wq IO_WQ_WORK_CONCURRENT flag to ensure we get the concurrency we desire for this case. Signed-off-by: Jens Axboe <>
2020-01-20io_uring: add support for IORING_OP_STATXJens Axboe
This provides support for async statx(2) through io_uring. Signed-off-by: Jens Axboe <>
2020-01-20io_uring: avoid ring quiesce for fixed file set unregister and updateJens Axboe
We currently fully quiesce the ring before an unregister or update of the fixed fileset. This is very expensive, and we can be a bit smarter about this. Add a percpu refcount for the file tables as a whole. Grab a percpu ref when we use a registered file, and put it on completion. This is cheap to do. Upon removal of a file from a set, switch the ref count to atomic mode. When we hit zero ref on the completion side, then we know we can drop the previously registered files. When the old files have been dropped, switch the ref back to percpu mode for normal operation. Since there's a period between doing the update and the kernel being done with it, add a IORING_OP_FILES_UPDATE opcode that can perform the same action. The application knows the update has completed when it gets the CQE for it. Between doing the update and receiving this completion, the application must continue to use the unregistered fd if submitting IO on this particular file. This takes the runtime of test/file-register from liburing from 14s to about 0.7s. Signed-off-by: Jens Axboe <>