io_uring: switch to per-cpu task_work
We see contention on the task_work locking and list management for
networked workloads, where it's not uncommon to have task_work arriving
from multiple CPUs in the system.
The task_work handling ends up with the original task, but to save on
the overhead of repeatedly re-adding that (which is an expensive
cmpxchg), it's wrapped in a per-tctx task_list which belongs to the
original submitter. Having many networked requests inflight can mean
that there's a lot of addition activity on the structure.
Move from a single per-tctx target list to a per-cpu one instead. This
allows multiple completers to add task_work without having to
synchronize on the same lock and list.
Signed-off-by: Jens Axboe <axboe@kernel.dk>