blk-mq: private O_DIRECT implementation
Only enabled for single vec blk-mq devices for now, and only for
synchronous O_DIRECT (not libaio). Much quicker than the slob
that is fs/direct-io.c.
With polling, I can get 5-6 usec latencies with 4KB IO.
This patch also enables a hybrid polling mode. Instead of polling
after IO submission, we can induce an artificial delay, and then
poll after that. For example, if the IO is presumed to complete
in 8 usecs from now, we can sleep for 4 usecs, wake up, and then
do our polling. This still puts a sleep/wakeup cycle in the IO
path, but instead of the wakeup happening after the IO has
completed, it'll happen before. With this hybrid scheme, we can
achieve big latency reductions while still using the same (or
less) amount of CPU.
Testing done on a Radian Memory Systems RMS-200/4G DRAM backed
NVMe device.
Signed-off-by: Jens Axboe <axboe@fb.com>