linux-2.6-block.git
4 years agoRevert "proc: don't allow async path resolution of /proc/self components"
Jens Axboe [Mon, 15 Feb 2021 20:42:42 +0000 (13:42 -0700)]
Revert "proc: don't allow async path resolution of /proc/self components"

This reverts commit 8d4c3e76e3be11a64df95ddee52e99092d42fc19.

No longer needed, as the io-wq worker threads have the right identity.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoRevert "proc: don't allow async path resolution of /proc/thread-self components"
Jens Axboe [Mon, 15 Feb 2021 20:42:18 +0000 (13:42 -0700)]
Revert "proc: don't allow async path resolution of /proc/thread-self components"

This reverts commit 0d4370cfe36b7f1719123b621a4ec4d9c7a25f89.

No longer needed, as the io-wq worker threads have the right identity.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: remove io_identity
Jens Axboe [Mon, 15 Feb 2021 20:40:22 +0000 (13:40 -0700)]
io_uring: remove io_identity

Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: remove any grabbing of context
Jens Axboe [Mon, 15 Feb 2021 20:32:18 +0000 (13:32 -0700)]
io_uring: remove any grabbing of context

Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio-wq: worker idling always returns false
Jens Axboe [Mon, 15 Feb 2021 20:26:34 +0000 (13:26 -0700)]
io-wq: worker idling always returns false

Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio-wq: fork worker threads from original task
Jens Axboe [Mon, 15 Feb 2021 20:25:53 +0000 (13:25 -0700)]
io-wq: fork worker threads from original task

Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio-wq: get rid of manager task
Jens Axboe [Mon, 15 Feb 2021 18:59:33 +0000 (11:59 -0700)]
io-wq: get rid of manager task

Just create the workers inline. It may be a bit more costly than
offloading thread creation to a manager, but the upside is that we can
do it inline. We'll need that for inheriting context.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoMerge branch 'for-5.12/io_uring' into for-next
Jens Axboe [Mon, 15 Feb 2021 18:04:02 +0000 (11:04 -0700)]
Merge branch 'for-5.12/io_uring' into for-next

* for-5.12/io_uring:
  proc: don't allow async path resolution of /proc/thread-self components
  io_uring: kill cached requests from exiting task closing the ring
  io_uring: add helper to free all request caches
  io_uring: allow task match to be passed to io_req_cache_free()
  io-wq: clear out worker ->fs and ->files

4 years agoMerge branch 'for-5.12/drivers' into for-next
Jens Axboe [Mon, 15 Feb 2021 18:03:58 +0000 (11:03 -0700)]
Merge branch 'for-5.12/drivers' into for-next

* for-5.12/drivers:
  lightnvm: pblk: Replace guid_copy() with export_guid()/import_guid()
  lightnvm: fix unnecessary NULL check warnings

4 years agoproc: don't allow async path resolution of /proc/thread-self components
Jens Axboe [Sun, 14 Feb 2021 20:21:43 +0000 (13:21 -0700)]
proc: don't allow async path resolution of /proc/thread-self components

If this is attempted by an io-wq kthread, then return -EOPNOTSUPP as we
don't currently support that. Once we can get task_pid_ptr() doing the
right thing, then this can go away again.

Use PF_IO_WORKER for this to speciically target the io_uring workers.
Modify the /proc/self/ check to use PF_IO_WORKER as well.

Cc: stable@vger.kernel.org
Fixes: 8d4c3e76e3be ("proc: don't allow async path resolution of /proc/self components")
Reported-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agolightnvm: pblk: Replace guid_copy() with export_guid()/import_guid() for-5.12/drivers for-5.12/drivers-2021-02-17
Andy Shevchenko [Sun, 14 Feb 2021 10:31:03 +0000 (10:31 +0000)]
lightnvm: pblk: Replace guid_copy() with export_guid()/import_guid()

There is a specific API to treat raw data as GUID, i.e. export_guid()
and import_guid(). Use them instead of guid_copy() with explicit casting.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agolightnvm: fix unnecessary NULL check warnings
Tian Tao [Sun, 14 Feb 2021 10:31:02 +0000 (10:31 +0000)]
lightnvm: fix unnecessary NULL check warnings

Remove NULL checks before vfree() to fix these warnings:
./drivers/lightnvm/pblk-gc.c:27:2-7: WARNING: NULL check before some
freeing functions is not needed.

Signed-off-by: Tian Tao <tiantao6@hisilicon.com>
Signed-off-by: Matias Bjørling <matias.bjorling@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: kill cached requests from exiting task closing the ring
Jens Axboe [Sat, 13 Feb 2021 16:11:04 +0000 (09:11 -0700)]
io_uring: kill cached requests from exiting task closing the ring

Be nice and prune these upfront, in case the ring is being shared and
one of the tasks is going away. This is a bit more important now that
we account the allocations.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: add helper to free all request caches
Jens Axboe [Sat, 13 Feb 2021 16:09:44 +0000 (09:09 -0700)]
io_uring: add helper to free all request caches

We have three different ones, put it in a helper for easy calling. This
is in preparation for doing it outside of ring freeing as well. With
that in mind, also ensure that we do the proper locking for safe calling
from a context where the ring it still live.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: allow task match to be passed to io_req_cache_free()
Jens Axboe [Sat, 13 Feb 2021 16:00:02 +0000 (09:00 -0700)]
io_uring: allow task match to be passed to io_req_cache_free()

No changes in this patch, just allows a caller to pass in a targeted
task that we must match for freeing requests in the cache.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio-wq: clear out worker ->fs and ->files
Jens Axboe [Fri, 12 Feb 2021 21:02:54 +0000 (14:02 -0700)]
io-wq: clear out worker ->fs and ->files

By default, kernel threads have init_fs and init_files assigned. In the
past, this has triggered security problems, as commands that don't ask
for (and hence don't get assigned) fs/files from the originating task
can then attempt path resolution etc with access to parts of the system
they should not be able to.

Rather than add checks in the fs code for misuse, just set these to
NULL. If we do attempt to use them, then the resulting code will oops
rather than provide access to something that it should not permit.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoMerge branch 'for-5.12/io_uring' into for-next
Jens Axboe [Fri, 12 Feb 2021 18:51:06 +0000 (11:51 -0700)]
Merge branch 'for-5.12/io_uring' into for-next

* for-5.12/io_uring:
  io_uring: optimise io_init_req() flags setting
  io_uring: clean io_req_find_next() fast check
  io_uring: don't check PF_EXITING from syscall

4 years agoio_uring: optimise io_init_req() flags setting
Pavel Begunkov [Fri, 12 Feb 2021 18:41:17 +0000 (18:41 +0000)]
io_uring: optimise io_init_req() flags setting

Invalid req->flags are tolerated by free/put well, avoid this dancing
needlessly presetting it to zero, and then not even resetting but
modifying it, i.e. "|=".

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: clean io_req_find_next() fast check
Pavel Begunkov [Fri, 12 Feb 2021 18:41:16 +0000 (18:41 +0000)]
io_uring: clean io_req_find_next() fast check

Indirectly io_req_find_next() is called for every request, optimise the
check by testing flags as it was long before -- __io_req_find_next()
tolerates false-positives well (i.e. link==NULL), and those should be
really rare.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: don't check PF_EXITING from syscall
Pavel Begunkov [Fri, 12 Feb 2021 18:41:15 +0000 (18:41 +0000)]
io_uring: don't check PF_EXITING from syscall

io_sq_thread_acquire_mm_files() can find a PF_EXITING task only when
it's called from task_work context. Don't check it in all other cases,
that are when we're in io_uring_enter().

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoMerge branch 'for-5.12/block-ipi' into for-next
Jens Axboe [Fri, 12 Feb 2021 15:28:40 +0000 (08:28 -0700)]
Merge branch 'for-5.12/block-ipi' into for-next

* for-5.12/block-ipi:
  blk-mq: Use llist_head for blk_cpu_done
  blk-mq: Always complete remote completions requests in softirq
  smp: Process pending softirqs in flush_smp_call_function_from_idle()

4 years agoMerge branch 'for-5.12/io_uring' into for-next
Jens Axboe [Fri, 12 Feb 2021 15:28:35 +0000 (08:28 -0700)]
Merge branch 'for-5.12/io_uring' into for-next

* for-5.12/io_uring: (100 commits)
  io_uring: don't split out consume out of SQE get
  io_uring: save ctx put/get for task_work submit
  io_uring: don't duplicate io_req_task_queue()
  io_uring: optimise SQPOLL mm/files grabbing
  io_uring: optimise out unlikely link queue
  io_uring: take compl state from submit state
  io_uring: inline io_complete_rw_common()
  io_uring: move res check out of io_rw_reissue()
  io_uring: simplify iopoll reissuing
  io_uring: clean up io_req_free_batch_finish()
  io_uring: move submit side state closer in the ring
  io_uring: assign file_slot prior to calling io_sqe_file_register()
  io_uring: remove redundant initialization of variable ret
  io_uring: unpark SQPOLL thread for cancelation
  io_uring: place ring SQ/CQ arrays under memcg memory limits
  io_uring: enable kmemcg account for io_uring requests
  io_uring: enable req cache for IRQ driven IO
  io_uring: fix possible deadlock in io_uring_poll
  io_uring: defer flushing cached reqs
  io_uring: take comp_state from ctx
  ...

4 years agoMerge branch 'for-5.12/libata' into for-next
Jens Axboe [Fri, 12 Feb 2021 15:28:30 +0000 (08:28 -0700)]
Merge branch 'for-5.12/libata' into for-next

* for-5.12/libata:
  ata: Avoid comma separated statements
  ata: ahci_brcm: Add back regulators management

4 years agoMerge branch 'for-5.12/drivers' into for-next
Jens Axboe [Fri, 12 Feb 2021 15:28:25 +0000 (08:28 -0700)]
Merge branch 'for-5.12/drivers' into for-next

* for-5.12/drivers: (62 commits)
  nvme-tcp: fix crash triggered with a dataless request submission
  block: Replace lkml.org links with lore
  nbd: Convert to DEFINE_SHOW_ATTRIBUTE
  nvme: add 48-bit DMA address quirk for Amazon NVMe controllers
  nvme-hwmon: rework to avoid devm allocation
  nvmet: remove else at the end of the function
  nvmet: add nvmet_req_subsys() helper
  nvmet: use min of device_path and disk len
  nvmet: use invalid cmd opcode helper
  nvmet: use invalid cmd opcode helper
  nvmet: add helper to report invalid opcode
  nvmet: remove extra variable in id-ns handler
  nvmet: make nvmet_find_namespace() req based
  nvmet: return uniform error for invalid ns
  nvmet: set status to 0 in case for invalid nsid
  nvmet-fc: add a missing __rcu annotation to nvmet_fc_tgt_assoc.queues
  nvme-multipath: set nr_zones for zoned namespaces
  nvmet-tcp: fix potential race of tcp socket closing accept_work
  nvmet-tcp: fix receive data digest calculation for multiple h2cdata PDUs
  nvme-rdma: handle nvme_rdma_post_send failures better
  ...

Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoMerge branch 'for-5.12/block' into for-next
Jens Axboe [Fri, 12 Feb 2021 15:28:13 +0000 (08:28 -0700)]
Merge branch 'for-5.12/block' into for-next

* for-5.12/block: (104 commits)
  mm: simplify swapdev_block
  sd_zbc: clear zone resources for non-zoned case
  block: introduce blk_queue_clear_zone_settings()
  zonefs: use zone write granularity as block size
  block: introduce zone_write_granularity limit
  block: use blk_queue_set_zoned in add_partition()
  nullb: use blk_queue_set_zoned() to setup zoned devices
  nvme: cleanup zone information initialization
  block: document zone_append_max_bytes attribute
  block: use bi_max_vecs to find the bvec pool
  md/raid10: remove dead code in reshape_request
  block: mark the bio as cloned in bio_iov_bvec_set
  block: set BIO_NO_PAGE_REF in bio_iov_bvec_set
  block: remove a layer of indentation in bio_iov_iter_get_pages
  block: turn the nr_iovecs argument to bio_alloc* into an unsigned short
  block: remove the 1 and 4 vec bvec_slabs entries
  block: streamline bvec_alloc
  block: factor out a bvec_alloc_gfp helper
  block: move struct biovec_slab to bio.c
  block: reuse BIO_INLINE_VECS for integrity bvecs
  ...

4 years agoblk-mq: Use llist_head for blk_cpu_done for-5.12/block-ipi for-5.12/block-ipi-2021-02-21
Sebastian Andrzej Siewior [Sat, 23 Jan 2021 20:10:27 +0000 (21:10 +0100)]
blk-mq: Use llist_head for blk_cpu_done

With llist_head it is possible to avoid the locking (the irq-off region)
when items are added. This makes it possible to add items on a remote
CPU without additional locking.
llist_add() returns true if the list was previously empty. This can be
used to invoke the SMP function call / raise sofirq only if the first
item was added (otherwise it is already pending).
This simplifies the code a little and reduces the IRQ-off regions.

blk_mq_raise_softirq() needs a preempt-disable section to ensure the
request is enqueued on the same CPU as the softirq is raised.
Some callers (USB-storage) invoke this path in preemptible context.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoblk-mq: Always complete remote completions requests in softirq
Sebastian Andrzej Siewior [Sat, 23 Jan 2021 20:10:26 +0000 (21:10 +0100)]
blk-mq: Always complete remote completions requests in softirq

Controllers with multiple queues have their IRQ-handelers pinned to a
CPU. The core shouldn't need to complete the request on a remote CPU.

Remove this case and always raise the softirq to complete the request.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoMerge branch 'sched/smp' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip...
Jens Axboe [Fri, 12 Feb 2021 15:27:51 +0000 (08:27 -0700)]
Merge branch 'sched/smp' of git://git./linux/kernel/git/tip/tip into for-5.12/block-ipi

* 'sched/smp' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  smp: Process pending softirqs in flush_smp_call_function_from_idle()

4 years agoio_uring: don't split out consume out of SQE get
Pavel Begunkov [Fri, 12 Feb 2021 11:55:17 +0000 (11:55 +0000)]
io_uring: don't split out consume out of SQE get

Remove io_consume_sqe() and inline it back into io_get_sqe(). It
requires req dealloc on error, but in exchange we get cleaner
io_submit_sqes() and better locality for cached_sq_head.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: save ctx put/get for task_work submit
Pavel Begunkov [Fri, 12 Feb 2021 03:23:54 +0000 (03:23 +0000)]
io_uring: save ctx put/get for task_work submit

Do a little trick in io_ring_ctx_free() briefly taking uring_lock, that
will wait for everyone currently holding it, so we can skip pinning ctx
with ctx->refs for __io_req_task_submit(), which is executed and loses
its refs/reqs while holding the lock.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: don't duplicate io_req_task_queue()
Pavel Begunkov [Fri, 12 Feb 2021 03:23:53 +0000 (03:23 +0000)]
io_uring: don't duplicate io_req_task_queue()

Don't hand code io_req_task_queue() inside of io_async_buf_func(), just
call it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: optimise SQPOLL mm/files grabbing
Pavel Begunkov [Fri, 12 Feb 2021 03:23:52 +0000 (03:23 +0000)]
io_uring: optimise SQPOLL mm/files grabbing

There are two reasons for this. First is to optimise
io_sq_thread_acquire_mm_files() for non-SQPOLL case, which currently do
too many checks and function calls in the hot path, e.g. in
io_init_req().

The second is to not grab mm/files when there are not needed. As
__io_queue_sqe() issues only one request now, we can reuse
io_sq_thread_acquire_mm_files() instead of unconditional acquire
mm/files.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: optimise out unlikely link queue
Pavel Begunkov [Fri, 12 Feb 2021 03:23:51 +0000 (03:23 +0000)]
io_uring: optimise out unlikely link queue

__io_queue_sqe() tries to issue as much requests of a link as it can,
and uses io_put_req_find_next() to extract a next one, targeting inline
completed requests. As now __io_queue_sqe() is always used together with
struct io_comp_state, it leaves next propagation only a small window and
only for async reqs, that doesn't justify its existence.

Remove it, make __io_queue_sqe() to issue only a head request. It
simplifies the code and will allow other optimisations.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: take compl state from submit state
Pavel Begunkov [Fri, 12 Feb 2021 03:23:50 +0000 (03:23 +0000)]
io_uring: take compl state from submit state

Completion and submission states are now coupled together, it's weird to
get one from argument and another from ctx, do it consistently for
io_req_free_batch(). It's also faster as we already have @state cached
in registers.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: inline io_complete_rw_common()
Pavel Begunkov [Thu, 11 Feb 2021 18:28:23 +0000 (18:28 +0000)]
io_uring: inline io_complete_rw_common()

__io_complete_rw() casts request to kiocb for it to be immediately
container_of()'ed by io_complete_rw_common(). And the last function's name
doesn't do a great job of illuminating its purposes, so just inline it in
its only user.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: move res check out of io_rw_reissue()
Pavel Begunkov [Thu, 11 Feb 2021 18:28:22 +0000 (18:28 +0000)]
io_uring: move res check out of io_rw_reissue()

We pass return code into io_rw_reissue() only to be able to check if it's
-EAGAIN. That's not the cleanest approach and may prevent inlining of the
non-EAGAIN fast path, so do it at call sites.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: simplify iopoll reissuing
Pavel Begunkov [Thu, 11 Feb 2021 18:28:21 +0000 (18:28 +0000)]
io_uring: simplify iopoll reissuing

Don't stash -EAGAIN'ed iopoll requests into a list to reissue it later,
do it eagerly. It removes overhead on keeping and checking that list,
and allows in case of failure for these requests to be completed through
normal iopoll completion path.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: clean up io_req_free_batch_finish()
Pavel Begunkov [Thu, 11 Feb 2021 18:28:20 +0000 (18:28 +0000)]
io_uring: clean up io_req_free_batch_finish()

io_req_free_batch_finish() is final and does not permit struct req_batch
to be reused without re-init. To be more consistent don't clear ->task
there.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoMerge tag 'nvme-5.12-2021-02-11' of git://git.infradead.org/nvme into for-5.12/drivers
Jens Axboe [Thu, 11 Feb 2021 18:29:44 +0000 (11:29 -0700)]
Merge tag 'nvme-5.12-2021-02-11' of git://git.infradead.org/nvme into for-5.12/drivers

Pull NVMe updates from Christoph:

"nvme updates for 5.12:

 - fix multipath handling of ->queue_rq errors (Chao Leng)
 - nvmet cleanups (Chaitanya Kulkarni)
 - add a quirk for buggy Amazon controller (Filippo Sironi)
 - avoid devm allocations in nvme-hwmon that don't interact well with
   fabrics (Hannes Reinecke)
 - sysfs cleanups (Jiapeng Chong)
 - fix nr_zones for multipath (Keith Busch)
 - nvme-tcp crash fix for no-data commands (Sagi Grimberg)
 - nvmet-tcp fixes (Sagi Grimberg)
 - add a missing __rcu annotation (me)"

* tag 'nvme-5.12-2021-02-11' of git://git.infradead.org/nvme: (22 commits)
  nvme-tcp: fix crash triggered with a dataless request submission
  nvme: add 48-bit DMA address quirk for Amazon NVMe controllers
  nvme-hwmon: rework to avoid devm allocation
  nvmet: remove else at the end of the function
  nvmet: add nvmet_req_subsys() helper
  nvmet: use min of device_path and disk len
  nvmet: use invalid cmd opcode helper
  nvmet: use invalid cmd opcode helper
  nvmet: add helper to report invalid opcode
  nvmet: remove extra variable in id-ns handler
  nvmet: make nvmet_find_namespace() req based
  nvmet: return uniform error for invalid ns
  nvmet: set status to 0 in case for invalid nsid
  nvmet-fc: add a missing __rcu annotation to nvmet_fc_tgt_assoc.queues
  nvme-multipath: set nr_zones for zoned namespaces
  nvmet-tcp: fix potential race of tcp socket closing accept_work
  nvmet-tcp: fix receive data digest calculation for multiple h2cdata PDUs
  nvme-rdma: handle nvme_rdma_post_send failures better
  nvme-fabrics: avoid double completions in nvmf_fail_nonready_command
  nvme: introduce a nvme_host_path_error helper
  ...

4 years agoio_uring: move submit side state closer in the ring
Jens Axboe [Thu, 11 Feb 2021 17:48:03 +0000 (10:48 -0700)]
io_uring: move submit side state closer in the ring

We recently added the submit side req cache, but it was placed at the
end of the struct. Move it near the other submission state for better
memory placement, and reshuffle a few other members at the same time.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: assign file_slot prior to calling io_sqe_file_register()
Jens Axboe [Thu, 11 Feb 2021 14:45:08 +0000 (07:45 -0700)]
io_uring: assign file_slot prior to calling io_sqe_file_register()

We use the assigned slot in io_sqe_file_register(), and a previous
patch moved the assignment to after we have called it. This isn't
super pretty, and will get cleaned up in the future. For now, fix
the regression by restoring the previous assignment/clear of the
file_slot.

Fixes: ea64ec02b31d ("io_uring: deduplicate file table slot calculation")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agonvme-tcp: fix crash triggered with a dataless request submission
Sagi Grimberg [Wed, 10 Feb 2021 22:04:00 +0000 (14:04 -0800)]
nvme-tcp: fix crash triggered with a dataless request submission

write-zeros has a bio, but does not have any data buffers associated
with it. Hence should not initialize the request iter for it (which
attempts to reference the bi_io_vec (and crash).
--
 run blktests nvme/012 at 2021-02-05 21:53:34
 BUG: kernel NULL pointer dereference, address: 0000000000000008
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] SMP NOPTI
 CPU: 15 PID: 12069 Comm: kworker/15:2H Tainted: G S        I       5.11.0-rc6+ #1
 Hardware name: Dell Inc. PowerEdge R640/06NR82, BIOS 2.10.0 11/12/2020
 Workqueue: kblockd blk_mq_run_work_fn
 RIP: 0010:nvme_tcp_init_iter+0x7d/0xd0 [nvme_tcp]
 RSP: 0018:ffffbd084447bd18 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: ffffa0bba9f3ce80 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000002000000
 RBP: ffffa0ba8ac6fec0 R08: 0000000002000000 R09: 0000000000000000
 R10: 0000000002800809 R11: 0000000000000000 R12: 0000000000000000
 R13: ffffa0bba9f3cf90 R14: 0000000000000000 R15: 0000000000000000
 FS:  0000000000000000(0000) GS:ffffa0c9ff9c0000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000008 CR3: 00000001c9c6c005 CR4: 00000000007706e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 PKRU: 55555554
 Call Trace:
  nvme_tcp_queue_rq+0xef/0x330 [nvme_tcp]
  blk_mq_dispatch_rq_list+0x11c/0x7c0
  ? blk_mq_flush_busy_ctxs+0xf6/0x110
  __blk_mq_sched_dispatch_requests+0x12b/0x170
  blk_mq_sched_dispatch_requests+0x30/0x60
  __blk_mq_run_hw_queue+0x2b/0x60
  process_one_work+0x1cb/0x360
  ? process_one_work+0x360/0x360
  worker_thread+0x30/0x370
  ? process_one_work+0x360/0x360
  kthread+0x116/0x130
  ? kthread_park+0x80/0x80
  ret_from_fork+0x1f/0x30
--

Fixes: cb9b870fba3e ("nvme-tcp: fix wrong setting of request iov_iter")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agoblock: Replace lkml.org links with lore
Kees Cook [Wed, 10 Feb 2021 23:51:59 +0000 (15:51 -0800)]
block: Replace lkml.org links with lore

As started by commit 05a5f51ca566 ("Documentation: Replace lkml.org
links with lore"), replace lkml.org links with lore to better use a
single source that's more likely to stay available long-term.

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: remove redundant initialization of variable ret
Colin Ian King [Wed, 10 Feb 2021 20:00:07 +0000 (20:00 +0000)]
io_uring: remove redundant initialization of variable ret

The variable ret is being initialized with a value that is never read
and it is being updated later with a new value.  The initialization is
redundant and can be removed.

Addresses-Coverity: ("Unused value")
Fixes: b63534c41e20 ("io_uring: re-issue block requests that failed because of resources")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: unpark SQPOLL thread for cancelation
Pavel Begunkov [Wed, 10 Feb 2021 11:45:42 +0000 (11:45 +0000)]
io_uring: unpark SQPOLL thread for cancelation

We park SQPOLL task before going into io_uring_cancel_files(), so the
task won't run task_works including those that might be important for
the cancellation passes. In this case it's io_poll_remove_one(), which
frees requests via io_put_req_deferred().

Unpark it for while waiting, it's ok as we disable submissions
beforehand, so no new requests will be generated.

INFO: task syz-executor893:8493 blocked for more than 143 seconds.
Call Trace:
 context_switch kernel/sched/core.c:4327 [inline]
 __schedule+0x90c/0x21a0 kernel/sched/core.c:5078
 schedule+0xcf/0x270 kernel/sched/core.c:5157
 io_uring_cancel_files fs/io_uring.c:8912 [inline]
 io_uring_cancel_task_requests+0xe70/0x11a0 fs/io_uring.c:8979
 __io_uring_files_cancel+0x110/0x1b0 fs/io_uring.c:9067
 io_uring_files_cancel include/linux/io_uring.h:51 [inline]
 do_exit+0x2fe/0x2ae0 kernel/exit.c:780
 do_group_exit+0x125/0x310 kernel/exit.c:922
 __do_sys_exit_group kernel/exit.c:933 [inline]
 __se_sys_exit_group kernel/exit.c:931 [inline]
 __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:931
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Cc: stable@vger.kernel.org # 5.5+
Reported-by: syzbot+695b03d82fa8e4901b06@syzkaller.appspotmail.com
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agonbd: Convert to DEFINE_SHOW_ATTRIBUTE
Liao Pingfang [Sat, 6 Feb 2021 07:10:55 +0000 (15:10 +0800)]
nbd: Convert to DEFINE_SHOW_ATTRIBUTE

Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.

Signed-off-by: Liao Pingfang <winndows@163.com>
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agonvme: add 48-bit DMA address quirk for Amazon NVMe controllers
Filippo Sironi [Wed, 10 Feb 2021 00:39:42 +0000 (01:39 +0100)]
nvme: add 48-bit DMA address quirk for Amazon NVMe controllers

Some Amazon NVMe controllers do not follow the NVMe specification
and are limited to 48-bit DMA addresses.  Add a quirk to force
bounce buffering if needed and limit the IOVA allocation for these
devices.

This affects all current Amazon NVMe controllers that expose EBS
volumes (0x0061, 0x0065, 0x8061) and local instance storage
(0xcd00, 0xcd01, 0xcd02).

Signed-off-by: Filippo Sironi <sironi@amazon.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme-hwmon: rework to avoid devm allocation
Hannes Reinecke [Tue, 19 Jan 2021 06:43:18 +0000 (07:43 +0100)]
nvme-hwmon: rework to avoid devm allocation

The original design to use device-managed resource allocation
doesn't really work as the NVMe controller has a vastly different
lifetime than the hwmon sysfs attributes, causing warning about
duplicate sysfs entries upon reconnection.
This patch reworks the hwmon allocation to avoid device-managed
resource allocation, and uses the NVMe controller as parent for
the sysfs attributes.

Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Hannes Reinecke <hare@suse.de>
Tested-by: Enzo Matsumiya <ematsumiya@suse.de>
Tested-by: Daniel Wagner <dwagner@suse.de>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvmet: remove else at the end of the function
Chaitanya Kulkarni [Wed, 10 Feb 2021 05:48:02 +0000 (21:48 -0800)]
nvmet: remove else at the end of the function

The function nvmet_parse_io_cmd() returns value from
nvmet_file_parse_io_cmd() or nvmet_bdev_parse_io_cmd() based on which
backend is set for the request. Remove the else and just return the
value from nvmet_bdev_parse_io_cmd().

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvmet: add nvmet_req_subsys() helper
Chaitanya Kulkarni [Wed, 10 Feb 2021 05:48:01 +0000 (21:48 -0800)]
nvmet: add nvmet_req_subsys() helper

Just like what we have to get the passthru ctrl from the req, add an
helper to get the subsystem associated with the nvmet_req() instead
of open coding the chain of structures.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvmet: use min of device_path and disk len
Chaitanya Kulkarni [Wed, 10 Feb 2021 05:47:59 +0000 (21:47 -0800)]
nvmet: use min of device_path and disk len

In function __assign_req_name() instead of using the DEVICE_NAME_LEN in
strncpy() use min of DISK_NAME_LEN and strlen(req->ns->device_path).

This is needed to turn off the following warnings:-

In file included from drivers/nvme/target/core.c:14:
In function ‘__assign_req_name’,
    inlined from ‘trace_event_raw_event_nvmet_req_init’ at drivers/nvme/target/./trace.h:58:1:
drivers/nvme/target/trace.h:52:3: warning: ‘strncpy’ specified bound 32 equals destination size [-Wstringop-truncation]
   strncpy(name, req->ns->device_path, DISK_NAME_LEN);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In function ‘__assign_req_name’,
    inlined from ‘perf_trace_nvmet_req_complete’ at drivers/nvme/target/./trace.h:100:1:
drivers/nvme/target/trace.h:52:3: warning: ‘strncpy’ specified bound 32 equals destination size [-Wstringop-truncation]
   strncpy(name, req->ns->device_path, DISK_NAME_LEN);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In function ‘__assign_req_name’,
    inlined from ‘perf_trace_nvmet_req_init’ at drivers/nvme/target/./trace.h:58:1:
drivers/nvme/target/trace.h:52:3: warning: ‘strncpy’ specified bound 32 equals destination size [-Wstringop-truncation]
   strncpy(name, req->ns->device_path, DISK_NAME_LEN);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In function ‘__assign_req_name’,
    inlined from ‘trace_event_raw_event_nvmet_req_complete’ at drivers/nvme/target/./trace.h:100:1:
drivers/nvme/target/trace.h:52:3: warning: ‘strncpy’ specified bound 32 equals destination size [-Wstringop-truncation]
   strncpy(name, req->ns->device_path, DISK_NAME_LEN);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvmet: use invalid cmd opcode helper
Chaitanya Kulkarni [Wed, 10 Feb 2021 05:47:58 +0000 (21:47 -0800)]
nvmet: use invalid cmd opcode helper

In the NVMeOF block device backend, file backend, and passthru backend
we reject and report the commands if opcode is not handled.

Use the previously introduced helper in the passthru backend to make the
error message uniform.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvmet: use invalid cmd opcode helper
Chaitanya Kulkarni [Wed, 10 Feb 2021 05:47:57 +0000 (21:47 -0800)]
nvmet: use invalid cmd opcode helper

In the NVMeOF block device backend, file backend, and passthru backend
we reject and report the commands if opcode is not handled.

Use the previously introduced helper in file backend to reduce the
duplicate code and make the error message uniform.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvmet: add helper to report invalid opcode
Chaitanya Kulkarni [Wed, 10 Feb 2021 05:47:56 +0000 (21:47 -0800)]
nvmet: add helper to report invalid opcode

In the NVMeOF block device backend, file backend, and passthru backend
we reject and report the commands if opcode is not handled.

Add an helper and use it in block device backend to keep the code
and error message uniform.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvmet: remove extra variable in id-ns handler
Chaitanya Kulkarni [Wed, 10 Feb 2021 05:47:55 +0000 (21:47 -0800)]
nvmet: remove extra variable in id-ns handler

In nvmet_execute_identify_ns() local variable ctrl is accessed only in
one place, remove that and directly use it from nvmet_req->sq->ctrl.

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvmet: make nvmet_find_namespace() req based
Chaitanya Kulkarni [Wed, 10 Feb 2021 05:47:54 +0000 (21:47 -0800)]
nvmet: make nvmet_find_namespace() req based

The six callers of nvmet_find_namespace() duplicate the error log page
update and status setting code for each call on failure.

All callers are nvmet requests based functions, so we can pass req
to the nvmet_find_namesapce() & derive ctrl from req, that'll allow us
to update the error log page in nvmet_find_namespace(). Now that we
pass the request we can also get rid of the local variable in
nvmet_find_namespace() and use the req->ns and return the error code.

Replace the ctrl parameter with nvmet_req for nvmet_find_namespace(),
centralize the error log page update for non allocated namesapces, and
return uniform error for non-allocated namespace.

The nvmet_find_namespace() takes nsid parameter which is from NVMe
commands structures such as get_log_page, identify, rw and common. All
these commands have same offset for the nsid field.

Derive nsid from req->cmd->common.nsid) & remove the extra parameter
from the nvmet_find_namespace().

Lastly now we associate the ns to the req parameter that we pass to the
nvmet_find_namespace(), rename nvmet_find_namespace() to
nvmet_req_find_ns().

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvmet: return uniform error for invalid ns
Chaitanya Kulkarni [Wed, 10 Feb 2021 05:47:53 +0000 (21:47 -0800)]
nvmet: return uniform error for invalid ns

For nvmet_find_namespace() error case we have inconsistent error code
mapping in the function nvmet_get_smart_log_nsid() and
nvmet_set_feat_write_protect().

There is no point in retrying for the invalid namesapce from the host
side. Set the error code to the NVME_SC_INVALID_NS | NVME_SC_DNR which
matches what we have in nvmet_execute_identify_desclist().

Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvmet: set status to 0 in case for invalid nsid
Chaitanya Kulkarni [Wed, 10 Feb 2021 05:47:52 +0000 (21:47 -0800)]
nvmet: set status to 0 in case for invalid nsid

For unallocated namespace in nvmet_execute_identify_ns() don't set the
status to NVME_SC_INVALID_NS, set it to zero.

Fixes: bffcd507780e ("nvmet: set right status on error in id-ns handler")
Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvmet-fc: add a missing __rcu annotation to nvmet_fc_tgt_assoc.queues
Christoph Hellwig [Sun, 7 Feb 2021 16:17:34 +0000 (17:17 +0100)]
nvmet-fc: add a missing __rcu annotation to nvmet_fc_tgt_assoc.queues

Make sparse happy after the recent conversion to RCU lookups.

Fixes: 4e2f02bf77da ("nvmet-fc: use RCU proctection for assoc_list")
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: James Smart <james.smart@broadcom.com>
4 years agonvme-multipath: set nr_zones for zoned namespaces
Keith Busch [Fri, 5 Feb 2021 19:50:02 +0000 (11:50 -0800)]
nvme-multipath: set nr_zones for zoned namespaces

The bio based drivers only require the request_queue's nr_zones is set,
so set this field in the head if the namespace path is zoned.

Fixes: 240e6ee272c07 ("nvme: support for zoned namespaces")
Reported-by: Minwoo Im <minwoo.im.dev@gmail.com>
Cc: Damien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvmet-tcp: fix potential race of tcp socket closing accept_work
Sagi Grimberg [Fri, 5 Feb 2021 19:47:25 +0000 (11:47 -0800)]
nvmet-tcp: fix potential race of tcp socket closing accept_work

When we accept a TCP connection and allocate an nvmet-tcp queue we should
make sure not to fully establish it or reference it as the connection may
be already closing, which triggers queue release work, which does not
fence against queue establishment.

In order to address such a race, we make sure to check the sk_state and
contain the queue reference to be done underneath the sk_callback_lock
such that the queue release work correctly fences against it.

Fixes: 872d26a391da ("nvmet-tcp: add NVMe over TCP target driver")
Reported-by: Elad Grupi <elad.grupi@dell.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvmet-tcp: fix receive data digest calculation for multiple h2cdata PDUs
Sagi Grimberg [Wed, 3 Feb 2021 23:00:01 +0000 (15:00 -0800)]
nvmet-tcp: fix receive data digest calculation for multiple h2cdata PDUs

When a host sends multiple h2cdata PDUs for a single command, we
should verify the data digest calculation per PDU and not
per command.

Fixes: 872d26a391da ("nvmet-tcp: add NVMe over TCP target driver")
Reported-by: Narayan Ayalasomayajula <Narayan.Ayalasomayajula@wdc.com>
Tested-by: Narayan Ayalasomayajula <Narayan.Ayalasomayajula@wdc.com>
Signed-off-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme-rdma: handle nvme_rdma_post_send failures better
Chao Leng [Mon, 1 Feb 2021 03:49:40 +0000 (11:49 +0800)]
nvme-rdma: handle nvme_rdma_post_send failures better

nvme_rdma_post_send failing is a path related error and should bounce
to another path when using nvme-multipath.  Call nvme_host_path_error
when nvme_rdma_post_send returns -EIO to ensure nvme_complete_rq gets
invoked to fail over to another path if there is one.

Signed-off-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme-fabrics: avoid double completions in nvmf_fail_nonready_command
Chao Leng [Mon, 1 Feb 2021 03:49:39 +0000 (11:49 +0800)]
nvme-fabrics: avoid double completions in nvmf_fail_nonready_command

When reconnecting, the request may be completed with
NVME_SC_HOST_PATH_ERROR in nvmf_fail_nonready_command, which currently
set the state of the request to MQ_RQ_IN_FLIGHT before calling
nvme_complete_rq.  When this happens for a request that is freed by
the caller, such as nvme_submit_user_cmd, in the worst case the request
could be completed again in tear down process.

Instead of calling blk_mq_start_request from nvmf_fail_nonready_command,
just use the new nvme_host_path_error helper to complete the command
without starting it.

Signed-off-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme: introduce a nvme_host_path_error helper
Chao Leng [Thu, 4 Feb 2021 07:55:11 +0000 (08:55 +0100)]
nvme: introduce a nvme_host_path_error helper

When using nvme native multipathing, if a path related error occurs
during ->queue_rq, the request needs to be completed with
NVME_SC_HOST_PATH_ERROR so that the request can be failed over.

Introduce a helper to complete the command from ->queue_rq in a wait
that invokes nvme_complete_rq.

Signed-off-by: Chao Leng <lengchao@huawei.com>
[hch: renamed, added a return value to clean up the callers a bit]
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agoblk-mq: introduce blk_mq_set_request_complete
Chao Leng [Mon, 1 Feb 2021 03:49:38 +0000 (11:49 +0800)]
blk-mq: introduce blk_mq_set_request_complete

nvme drivers need to set the state of request to MQ_RQ_COMPLETE when
directly complete request in queue_rq.
So add blk_mq_set_request_complete.

Signed-off-by: Chao Leng <lengchao@huawei.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agonvme: convert sysfs sprintf/snprintf family to sysfs_emit
Jiapeng Chong [Tue, 2 Feb 2021 07:06:17 +0000 (15:06 +0800)]
nvme: convert sysfs sprintf/snprintf family to sysfs_emit

Fix the following coccicheck warning:

./drivers/nvme/host/core.c:3580:8-16: WARNING: use scnprintf or sprintf.
./drivers/nvme/host/core.c:3570:8-16: WARNING: use scnprintf or sprintf.
./drivers/nvme/host/core.c:3560:8-16: WARNING: use scnprintf or sprintf.
./drivers/nvme/host/core.c:3526:8-16: WARNING: use scnprintf or sprintf.
./drivers/nvme/host/core.c:2833:8-16: WARNING: use scnprintf or sprintf.

Reported-by: Abaci Robot<abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
4 years agomm: simplify swapdev_block for-5.12/block for-5.12/block-2021-02-17
Christoph Hellwig [Tue, 9 Feb 2021 17:14:19 +0000 (18:14 +0100)]
mm: simplify swapdev_block

Open code the parts of map_swap_entry that was actually used by
swapdev_block, and remove the now unused map_swap_entry function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agobcache: Avoid comma separated statements
Joe Perches [Wed, 10 Feb 2021 05:07:28 +0000 (13:07 +0800)]
bcache: Avoid comma separated statements

Use semicolons and braces.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agobcache: Move journal work to new flush wq
Kai Krakow [Wed, 10 Feb 2021 05:07:27 +0000 (13:07 +0800)]
bcache: Move journal work to new flush wq

This is potentially long running and not latency sensitive, let's get
it out of the way of other latency sensitive events.

As observed in the previous commit, the `system_wq` comes easily
congested by bcache, and this fixes a few more stalls I was observing
every once in a while.

Let's not make this `WQ_MEM_RECLAIM` as it showed to reduce performance
of boot and file system operations in my tests. Also, without
`WQ_MEM_RECLAIM`, I no longer see desktop stalls. This matches the
previous behavior as `system_wq` also does no memory reclaim:

> // workqueue.c:
> system_wq = alloc_workqueue("events", 0, 0);

Cc: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org # 5.4+
Signed-off-by: Kai Krakow <kai@kaishome.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agobcache: Give btree_io_wq correct semantics again
Kai Krakow [Wed, 10 Feb 2021 05:07:26 +0000 (13:07 +0800)]
bcache: Give btree_io_wq correct semantics again

Before killing `btree_io_wq`, the queue was allocated using
`create_singlethread_workqueue()` which has `WQ_MEM_RECLAIM`. After
killing it, it no longer had this property but `system_wq` is not
single threaded.

Let's combine both worlds and make it multi threaded but able to
reclaim memory.

Cc: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org # 5.4+
Signed-off-by: Kai Krakow <kai@kaishome.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoRevert "bcache: Kill btree_io_wq"
Kai Krakow [Wed, 10 Feb 2021 05:07:25 +0000 (13:07 +0800)]
Revert "bcache: Kill btree_io_wq"

This reverts commit 56b30770b27d54d68ad51eccc6d888282b568cee.

With the btree using the `system_wq`, I seem to see a lot more desktop
latency than I should.

After some more investigation, it looks like the original assumption
of 56b3077 no longer is true, and bcache has a very high potential of
congesting the `system_wq`. In turn, this introduces laggy desktop
performance, IO stalls (at least with btrfs), and input events may be
delayed.

So let's revert this. It's important to note that the semantics of
using `system_wq` previously mean that `btree_io_wq` should be created
before and destroyed after other bcache wqs to keep the same
assumptions.

Cc: Coly Li <colyli@suse.de>
Cc: stable@vger.kernel.org # 5.4+
Signed-off-by: Kai Krakow <kai@kaishome.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agobcache: Fix register_device_aync typo
Kai Krakow [Wed, 10 Feb 2021 05:07:24 +0000 (13:07 +0800)]
bcache: Fix register_device_aync typo

Should be `register_device_async`.

Cc: Coly Li <colyli@suse.de>
Signed-off-by: Kai Krakow <kai@kaishome.de>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agobcache: consider the fragmentation when update the writeback rate
dongdong tao [Wed, 10 Feb 2021 05:07:23 +0000 (13:07 +0800)]
bcache: consider the fragmentation when update the writeback rate

Current way to calculate the writeback rate only considered the
dirty sectors, this usually works fine when the fragmentation
is not high, but it will give us unreasonable small rate when
we are under a situation that very few dirty sectors consumed
a lot dirty buckets. In some case, the dirty bucekts can reached
to CUTOFF_WRITEBACK_SYNC while the dirty data(sectors) not even
reached the writeback_percent, the writeback rate will still
be the minimum value (4k), thus it will cause all the writes to be
stucked in a non-writeback mode because of the slow writeback.

We accelerate the rate in 3 stages with different aggressiveness,
the first stage starts when dirty buckets percent reach above
BCH_WRITEBACK_FRAGMENT_THRESHOLD_LOW (50), the second is
BCH_WRITEBACK_FRAGMENT_THRESHOLD_MID (57), the third is
BCH_WRITEBACK_FRAGMENT_THRESHOLD_HIGH (64). By default
the first stage tries to writeback the amount of dirty data
in one bucket (on average) in (1 / (dirty_buckets_percent - 50)) second,
the second stage tries to writeback the amount of dirty data in one bucket
in (1 / (dirty_buckets_percent - 57)) * 100 millisecond, the third
stage tries to writeback the amount of dirty data in one bucket in
(1 / (dirty_buckets_percent - 64)) millisecond.

the initial rate at each stage can be controlled by 3 configurable
parameters writeback_rate_fp_term_{low|mid|high}, they are by default
1, 10, 1000, the hint of IO throughput that these values are trying
to achieve is described by above paragraph, the reason that
I choose those value as default is based on the testing and the
production data, below is some details:

A. When it comes to the low stage, there is still a bit far from the 70
   threshold, so we only want to give it a little bit push by setting the
   term to 1, it means the initial rate will be 170 if the fragment is 6,
   it is calculated by bucket_size/fragment, this rate is very small,
   but still much reasonable than the minimum 8.
   For a production bcache with unheavy workload, if the cache device
   is bigger than 1 TB, it may take hours to consume 1% buckets,
   so it is very possible to reclaim enough dirty buckets in this stage,
   thus to avoid entering the next stage.

B. If the dirty buckets ratio didn't turn around during the first stage,
   it comes to the mid stage, then it is necessary for mid stage
   to be more aggressive than low stage, so i choose the initial rate
   to be 10 times more than low stage, that means 1700 as the initial
   rate if the fragment is 6. This is some normal rate
   we usually see for a normal workload when writeback happens
   because of writeback_percent.

C. If the dirty buckets ratio didn't turn around during the low and mid
   stages, it comes to the third stage, and it is the last chance that
   we can turn around to avoid the horrible cutoff writeback sync issue,
   then we choose 100 times more aggressive than the mid stage, that
   means 170000 as the initial rate if the fragment is 6. This is also
   inferred from a production bcache, I've got one week's writeback rate
   data from a production bcache which has quite heavy workloads,
   again, the writeback is triggered by the writeback percent,
   the highest rate area is around 100000 to 240000, so I believe this
   kind aggressiveness at this stage is reasonable for production.
   And it should be mostly enough because the hint is trying to reclaim
   1000 bucket per second, and from that heavy production env,
   it is consuming 50 bucket per second on average in one week's data.

Option writeback_consider_fragment is to control whether we want
this feature to be on or off, it's on by default.

Lastly, below is the performance data for all the testing result,
including the data from production env:
https://docs.google.com/document/d/1AmbIEa_2MhB9bqhC3rfga9tp7n9YX9PLn0jSUxscVW0/edit?usp=sharing

Signed-off-by: dongdong tao <dongdong.tao@canonical.com>
Signed-off-by: Coly Li <colyli@suse.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agosd_zbc: clear zone resources for non-zoned case
Damien Le Moal [Thu, 28 Jan 2021 04:47:33 +0000 (13:47 +0900)]
sd_zbc: clear zone resources for non-zoned case

For host-aware ZBC disk, setting the device zoned model to BLK_ZONED_HA
using blk_queue_set_zoned() in sd_read_block_characteristics() may
result in the block device effective zoned model to be "none"
(BLK_ZONED_NONE) if partitions are present on the device. In this case,
sd_zbc_read_zones() should not setup the zone related queue limits for
the disk so that the device limits and configuration is consistent with
a regular disk and resources not uselessly allocated (e.g. the zone
write pointer tracking array for zone append emulation).

Furthermore, if the disk zoned model changes at run time due to the
creation of a partition by the user, the zone related resources can be
released.

Fix both problems by introducing the function sd_zbc_clear_zone_info()
to reset the scsi disk zone information and free resources and by
returning early in sd_zbc_read_zones() for a block device that has a
zoned model equal to BLK_ZONED_NONE.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoblock: introduce blk_queue_clear_zone_settings()
Damien Le Moal [Thu, 28 Jan 2021 04:47:32 +0000 (13:47 +0900)]
block: introduce blk_queue_clear_zone_settings()

Introduce the internal function blk_queue_clear_zone_settings() to
cleanup all limits and resources related to zoned block devices. This
new function is called from blk_queue_set_zoned() when a disk zoned
model is set to BLK_ZONED_NONE. This particular case can happens when a
partition is created on a host-aware scsi disk.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agozonefs: use zone write granularity as block size
Damien Le Moal [Thu, 28 Jan 2021 04:47:31 +0000 (13:47 +0900)]
zonefs: use zone write granularity as block size

Zoned block devices have different granularity constraints for write
operations into sequential zones. E.g. ZBC and ZAC devices require that
writes be aligned to the device physical block size while NVMe ZNS
devices allow logical block size aligned write operations. To correctly
handle such difference, use the device zone write granularity limit to
set the block size of a zonefs volume, thus allowing the smallest
possible write unit for all zoned device types.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoblock: introduce zone_write_granularity limit
Damien Le Moal [Thu, 28 Jan 2021 04:47:30 +0000 (13:47 +0900)]
block: introduce zone_write_granularity limit

Per ZBC and ZAC specifications, host-managed SMR hard-disks mandate that
all writes into sequential write required zones be aligned to the device
physical block size. However, NVMe ZNS does not have this constraint and
allows write operations into sequential zones to be aligned to the
device logical block size. This inconsistency does not help with
software portability across device types.

To solve this, introduce the zone_write_granularity queue limit to
indicate the alignment constraint, in bytes, of write operations into
zones of a zoned block device. This new limit is exported as a
read-only sysfs queue attribute and the helper
blk_queue_zone_write_granularity() introduced for drivers to set this
limit.

The function blk_queue_set_zoned() is modified to set this new limit to
the device logical block size by default. NVMe ZNS devices as well as
zoned nullb devices use this default value as is. The scsi disk driver
is modified to execute the blk_queue_zone_write_granularity() helper to
set the zone write granularity of host-managed SMR disks to the disk
physical block size.

The accessor functions queue_zone_write_granularity() and
bdev_zone_write_granularity() are also introduced.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoblock: use blk_queue_set_zoned in add_partition()
Damien Le Moal [Thu, 28 Jan 2021 04:47:29 +0000 (13:47 +0900)]
block: use blk_queue_set_zoned in add_partition()

When changing the zoned model of host-aware zoned block devices, use
blk_queue_set_zoned() instead of directly assigning the gendisk queue
zoned limit.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agonullb: use blk_queue_set_zoned() to setup zoned devices
Damien Le Moal [Thu, 28 Jan 2021 04:47:28 +0000 (13:47 +0900)]
nullb: use blk_queue_set_zoned() to setup zoned devices

Use blk_queue_set_zoned() to set a nullb device zone model instead of
directly assigning the device queue zoned limit. This initialization of
the devicve zoned model as well as the setup of the queue flag
QUEUE_FLAG_ZONE_RESETALL and of the device queue elevator feature are
moved from null_init_zoned_dev() to null_register_zoned_dev() so that
the initialization of the queue limits is done when the gendisk of the
nullb device is available.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agonvme: cleanup zone information initialization
Damien Le Moal [Thu, 28 Jan 2021 04:47:27 +0000 (13:47 +0900)]
nvme: cleanup zone information initialization

For a zoned namespace, in nvme_update_ns_info(), call
nvme_update_zone_info() after executing nvme_update_disk_info() so that
the namespace queue logical and physical block size limits are set.
This allows setting the namespace queue max_zone_append_sectors limit
in nvme_update_zone_info() instead of nvme_revalidate_zones(),
simplifying this function. Also use blk_queue_set_zoned() to set the
namespace zoned model.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoblock: document zone_append_max_bytes attribute
Damien Le Moal [Thu, 28 Jan 2021 04:47:26 +0000 (13:47 +0900)]
block: document zone_append_max_bytes attribute

The description of the zone_append_max_bytes sysfs queue attribute is
missing from Documentation/block/queue-sysfs.rst. Add it.

Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@edc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: place ring SQ/CQ arrays under memcg memory limits
Jens Axboe [Wed, 10 Feb 2021 03:14:12 +0000 (20:14 -0700)]
io_uring: place ring SQ/CQ arrays under memcg memory limits

Instead of imposing rlimit memlock limits for the rings themselves,
ensure that we account them properly under memcg with __GFP_ACCOUNT.
We retain rlimit memlock for registered buffers, this is just for the
ring arrays themselves.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: enable kmemcg account for io_uring requests
Jens Axboe [Tue, 9 Feb 2021 20:48:50 +0000 (13:48 -0700)]
io_uring: enable kmemcg account for io_uring requests

This puts io_uring under the memory cgroups accounting and limits for
requests.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: enable req cache for IRQ driven IO
Jens Axboe [Wed, 10 Feb 2021 02:53:37 +0000 (19:53 -0700)]
io_uring: enable req cache for IRQ driven IO

This is the last class of requests that cannot utilize the req alloc
cache. Add a per-ctx req cache that is protected by the completion_lock,
and refill our submit side cache when it gets over our batch count.

Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: fix possible deadlock in io_uring_poll
Hao Xu [Fri, 5 Feb 2021 08:34:21 +0000 (16:34 +0800)]
io_uring: fix possible deadlock in io_uring_poll

Abaci reported follow issue:

[   30.615891] ======================================================
[   30.616648] WARNING: possible circular locking dependency detected
[   30.617423] 5.11.0-rc3-next-20210115 #1 Not tainted
[   30.618035] ------------------------------------------------------
[   30.618914] a.out/1128 is trying to acquire lock:
[   30.619520] ffff88810b063868 (&ep->mtx){+.+.}-{3:3}, at: __ep_eventpoll_poll+0x9f/0x220
[   30.620505]
[   30.620505] but task is already holding lock:
[   30.621218] ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0
[   30.622349]
[   30.622349] which lock already depends on the new lock.
[   30.622349]
[   30.623289]
[   30.623289] the existing dependency chain (in reverse order) is:
[   30.624243]
[   30.624243] -> #1 (&ctx->uring_lock){+.+.}-{3:3}:
[   30.625263]        lock_acquire+0x2c7/0x390
[   30.625868]        __mutex_lock+0xae/0x9f0
[   30.626451]        io_cqring_overflow_flush.part.95+0x6d/0x70
[   30.627278]        io_uring_poll+0xcb/0xd0
[   30.627890]        ep_item_poll.isra.14+0x4e/0x90
[   30.628531]        do_epoll_ctl+0xb7e/0x1120
[   30.629122]        __x64_sys_epoll_ctl+0x70/0xb0
[   30.629770]        do_syscall_64+0x2d/0x40
[   30.630332]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   30.631187]
[   30.631187] -> #0 (&ep->mtx){+.+.}-{3:3}:
[   30.631985]        check_prevs_add+0x226/0xb00
[   30.632584]        __lock_acquire+0x1237/0x13a0
[   30.633207]        lock_acquire+0x2c7/0x390
[   30.633740]        __mutex_lock+0xae/0x9f0
[   30.634258]        __ep_eventpoll_poll+0x9f/0x220
[   30.634879]        __io_arm_poll_handler+0xbf/0x220
[   30.635462]        io_issue_sqe+0xa6b/0x13e0
[   30.635982]        __io_queue_sqe+0x10b/0x550
[   30.636648]        io_queue_sqe+0x235/0x470
[   30.637281]        io_submit_sqes+0xcce/0xf10
[   30.637839]        __x64_sys_io_uring_enter+0x3fb/0x5b0
[   30.638465]        do_syscall_64+0x2d/0x40
[   30.638999]        entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   30.639643]
[   30.639643] other info that might help us debug this:
[   30.639643]
[   30.640618]  Possible unsafe locking scenario:
[   30.640618]
[   30.641402]        CPU0                    CPU1
[   30.641938]        ----                    ----
[   30.642664]   lock(&ctx->uring_lock);
[   30.643425]                                lock(&ep->mtx);
[   30.644498]                                lock(&ctx->uring_lock);
[   30.645668]   lock(&ep->mtx);
[   30.646321]
[   30.646321]  *** DEADLOCK ***
[   30.646321]
[   30.647642] 1 lock held by a.out/1128:
[   30.648424]  #0: ffff88810e952be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0
[   30.649954]
[   30.649954] stack backtrace:
[   30.650592] CPU: 1 PID: 1128 Comm: a.out Not tainted 5.11.0-rc3-next-20210115 #1
[   30.651554] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
[   30.652290] Call Trace:
[   30.652688]  dump_stack+0xac/0xe3
[   30.653164]  check_noncircular+0x11e/0x130
[   30.653747]  ? check_prevs_add+0x226/0xb00
[   30.654303]  check_prevs_add+0x226/0xb00
[   30.654845]  ? add_lock_to_list.constprop.49+0xac/0x1d0
[   30.655564]  __lock_acquire+0x1237/0x13a0
[   30.656262]  lock_acquire+0x2c7/0x390
[   30.656788]  ? __ep_eventpoll_poll+0x9f/0x220
[   30.657379]  ? __io_queue_proc.isra.88+0x180/0x180
[   30.658014]  __mutex_lock+0xae/0x9f0
[   30.658524]  ? __ep_eventpoll_poll+0x9f/0x220
[   30.659112]  ? mark_held_locks+0x5a/0x80
[   30.659648]  ? __ep_eventpoll_poll+0x9f/0x220
[   30.660229]  ? _raw_spin_unlock_irqrestore+0x2d/0x40
[   30.660885]  ? trace_hardirqs_on+0x46/0x110
[   30.661471]  ? __io_queue_proc.isra.88+0x180/0x180
[   30.662102]  ? __ep_eventpoll_poll+0x9f/0x220
[   30.662696]  __ep_eventpoll_poll+0x9f/0x220
[   30.663273]  ? __ep_eventpoll_poll+0x220/0x220
[   30.663875]  __io_arm_poll_handler+0xbf/0x220
[   30.664463]  io_issue_sqe+0xa6b/0x13e0
[   30.664984]  ? __lock_acquire+0x782/0x13a0
[   30.665544]  ? __io_queue_proc.isra.88+0x180/0x180
[   30.666170]  ? __io_queue_sqe+0x10b/0x550
[   30.666725]  __io_queue_sqe+0x10b/0x550
[   30.667252]  ? __fget_files+0x131/0x260
[   30.667791]  ? io_req_prep+0xd8/0x1090
[   30.668316]  ? io_queue_sqe+0x235/0x470
[   30.668868]  io_queue_sqe+0x235/0x470
[   30.669398]  io_submit_sqes+0xcce/0xf10
[   30.669931]  ? xa_load+0xe4/0x1c0
[   30.670425]  __x64_sys_io_uring_enter+0x3fb/0x5b0
[   30.671051]  ? lockdep_hardirqs_on_prepare+0xde/0x180
[   30.671719]  ? syscall_enter_from_user_mode+0x2b/0x80
[   30.672380]  do_syscall_64+0x2d/0x40
[   30.672901]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   30.673503] RIP: 0033:0x7fd89c813239
[   30.673962] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05  3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 ec 2c 00 f7 d8 64 89 01 48
[   30.675920] RSP: 002b:00007ffc65a7c628 EFLAGS: 00000217 ORIG_RAX: 00000000000001aa
[   30.676791] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fd89c813239
[   30.677594] RDX: 0000000000000000 RSI: 0000000000000014 RDI: 0000000000000003
[   30.678678] RBP: 00007ffc65a7c720 R08: 0000000000000000 R09: 0000000003000000
[   30.679492] R10: 0000000000000000 R11: 0000000000000217 R12: 0000000000400ff0
[   30.680282] R13: 00007ffc65a7c840 R14: 0000000000000000 R15: 0000000000000000

This might happen if we do epoll_wait on a uring fd while reading/writing
the former epoll fd in a sqe in the former uring instance.
So let's don't flush cqring overflow list, just do a simple check.

Reported-by: Abaci <abaci@linux.alibaba.com>
Fixes: 6c503150ae33 ("io_uring: patch up IOPOLL overflow_flush sync")
Signed-off-by: Hao Xu <haoxu@linux.alibaba.com>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: defer flushing cached reqs
Pavel Begunkov [Wed, 10 Feb 2021 00:03:23 +0000 (00:03 +0000)]
io_uring: defer flushing cached reqs

Awhile there are requests in the allocation cache -- use them, only if
those ended go for the stashed memory in comp.free_list. As list
manipulation are generally heavy and are not good for caches, flush them
all or as much as can in one go.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: return success/failure from io_flush_cached_reqs()]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: take comp_state from ctx
Pavel Begunkov [Wed, 10 Feb 2021 00:03:22 +0000 (00:03 +0000)]
io_uring: take comp_state from ctx

__io_queue_sqe() is always called with a non-NULL comp_state, which is
taken directly from context. Don't pass it around but infer from ctx.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: enable req cache for task_work items
Jens Axboe [Wed, 10 Feb 2021 00:03:21 +0000 (00:03 +0000)]
io_uring: enable req cache for task_work items

task_work is run without utilizing the req alloc cache, so any deferred
items don't get to take advantage of either the alloc or free side of it.
With task_work now being wrapped by io_uring, we can use the ctx
completion state to both use the req cache and the completion flush
batching.

With this, the only request type that cannot take advantage of the req
cache is IRQ driven IO for regular files / block devices. Anything else,
including IOPOLL polled IO to those same tyes, will take advantage of it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: provide FIFO ordering for task_work
Jens Axboe [Wed, 10 Feb 2021 00:03:20 +0000 (00:03 +0000)]
io_uring: provide FIFO ordering for task_work

task_work is a LIFO list, due to how it's implemented as a lockless
list. For long chains of task_work, this can be problematic as the
first entry added is the last one processed. Similarly, we'd waste
a lot of CPU cycles reversing this list.

Wrap the task_work so we have a single task_work entry per task per
ctx, and use that to run it in the right order.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: use persistent request cache
Jens Axboe [Wed, 10 Feb 2021 00:03:19 +0000 (00:03 +0000)]
io_uring: use persistent request cache

Now that we have the submit_state in the ring itself, we can have io_kiocb
allocations that are persistent across invocations. This reduces the time
spent doing slab allocations and frees.

[sil: rebased]
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: feed reqs back into alloc cache
Pavel Begunkov [Wed, 10 Feb 2021 00:03:18 +0000 (00:03 +0000)]
io_uring: feed reqs back into alloc cache

Make io_req_free_batch(), which is used for inline executed requests and
IOPOLL, to return requests back into the allocation cache, so avoid
most of kmalloc()/kfree() for those cases.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: persistent req cache
Pavel Begunkov [Wed, 10 Feb 2021 00:03:17 +0000 (00:03 +0000)]
io_uring: persistent req cache

Don't free batch-allocated requests across syscalls.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: count ctx refs separately from reqs
Pavel Begunkov [Wed, 10 Feb 2021 00:03:16 +0000 (00:03 +0000)]
io_uring: count ctx refs separately from reqs

Currently batch free handles request memory freeing and ctx ref putting
together. Separate them and use different counters, that will be needed
for reusing reqs memory.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: remove fallback_req
Pavel Begunkov [Wed, 10 Feb 2021 00:03:15 +0000 (00:03 +0000)]
io_uring: remove fallback_req

Remove fallback_req for now, it gets in the way of other changes.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: submit-completion free batching
Pavel Begunkov [Wed, 10 Feb 2021 00:03:14 +0000 (00:03 +0000)]
io_uring: submit-completion free batching

io_submit_flush_completions() does completion batching, but may also use
free batching as iopoll does. The main beneficiaries should be buffered
reads/writes and send/recv.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: replace list with array for compl batch
Pavel Begunkov [Wed, 10 Feb 2021 00:03:13 +0000 (00:03 +0000)]
io_uring: replace list with array for compl batch

Reincarnation of an old patch that replaces a list in struct
io_compl_batch with an array. It's needed to avoid hooking requests via
their compl.list, because it won't be always available in the future.

It's also nice to split io_submit_flush_completions() to avoid free
under locks and remove unlock/lock with a long comment describing when
it can be done.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: don't reinit submit state every time
Pavel Begunkov [Wed, 10 Feb 2021 00:03:12 +0000 (00:03 +0000)]
io_uring: don't reinit submit state every time

As now submit_state is retained across syscalls, we can save ourself
from initialising it from ground up for each io_submit_sqes(). Set some
fields during ctx allocation, and just keep them always consistent.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: remove unnecessary zeroing of ctx members]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: remove ctx from comp_state
Pavel Begunkov [Wed, 10 Feb 2021 00:03:11 +0000 (00:03 +0000)]
io_uring: remove ctx from comp_state

completion state is closely bound to ctx, we don't need to store ctx
inside as we always have it around to pass to flush.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
4 years agoio_uring: don't keep submit_state on stack
Pavel Begunkov [Wed, 10 Feb 2021 00:03:10 +0000 (00:03 +0000)]
io_uring: don't keep submit_state on stack

struct io_submit_state is quite big (168 bytes) and going to grow. It's
better to not keep it on stack as it is now. Move it to context, it's
always protected by uring_lock, so it's fine to have only one instance
of it.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>