fs: add read support for RWF_UNCACHED
If RWF_UNCACHED is set for io_uring (or preadv2(2)), we'll use private
pages for the buffered reads. These pages will never be inserted into
the page cache, and they are simply droped when we have done the copy at
the end of IO.
If pages in the read range are already in the page cache, then use those
for just copying the data instead of starting IO on private pages.
A previous solution used the page cache even for non-cached ranges, but
the cost of doing so was too high. Removing nodes at the end is
expensive, even with LRU bypass. On top of that, repeatedly
instantiating new xarray nodes is very costly, as it needs to memset 576
bytes of data, and freeing said nodes involve an RCU call per node as
well. All that adds up, making uncached somewhat slower than O_DIRECT.
With the current solition, we're basically at O_DIRECT levels of
performance for RWF_UNCACHED IO.
Protect against truncate the same way O_DIRECT does, by calling
inode_dio_begin() to elevate the inode->i_dio_count.
Signed-off-by: Jens Axboe <axboe@kernel.dk>