io_uring/msg_ring: avoid double indirection task_work for data messages
If IORING_SETUP_DEFER_TASKRUN is set, then we can't post CQEs remotely
to the target ring. Instead, task_work is queued for the target ring,
which is used to post the CQE. To make matters worse, once the target
CQE has been posted, task_work is then queued with the originator to
fill the completion.
This obviously adds a bunch of overhead and latency. Instead of relying
on generic kernel task_work for this, fill an overflow entry on the
target ring and flag it as such that the target ring will flush it. This
avoids both the task_work for posting the CQE, and it means that the
originator CQE can be filled inline as well.
In local testing, this reduces the latency on the sender side by 5-6x.
Signed-off-by: Jens Axboe <axboe@kernel.dk>