| 1 | .. SPDX-License-Identifier: GPL-2.0 |
| 2 | |
| 3 | =========================================== |
| 4 | Userspace block device driver (ublk driver) |
| 5 | =========================================== |
| 6 | |
| 7 | Overview |
| 8 | ======== |
| 9 | |
| 10 | ublk is a generic framework for implementing block device logic from userspace. |
| 11 | The motivation behind it is that moving virtual block drivers into userspace, |
| 12 | such as loop, nbd and similar can be very helpful. It can help to implement |
| 13 | new virtual block device such as ublk-qcow2 (there are several attempts of |
| 14 | implementing qcow2 driver in kernel). |
| 15 | |
| 16 | Userspace block devices are attractive because: |
| 17 | |
| 18 | - They can be written many programming languages. |
| 19 | - They can use libraries that are not available in the kernel. |
| 20 | - They can be debugged with tools familiar to application developers. |
| 21 | - Crashes do not kernel panic the machine. |
| 22 | - Bugs are likely to have a lower security impact than bugs in kernel |
| 23 | code. |
| 24 | - They can be installed and updated independently of the kernel. |
| 25 | - They can be used to simulate block device easily with user specified |
| 26 | parameters/setting for test/debug purpose |
| 27 | |
| 28 | ublk block device (``/dev/ublkb*``) is added by ublk driver. Any IO request |
| 29 | on the device will be forwarded to ublk userspace program. For convenience, |
| 30 | in this document, ``ublk server`` refers to generic ublk userspace |
| 31 | program. ``ublksrv`` [#userspace]_ is one of such implementation. It |
| 32 | provides ``libublksrv`` [#userspace_lib]_ library for developing specific |
| 33 | user block device conveniently, while also generic type block device is |
| 34 | included, such as loop and null. Richard W.M. Jones wrote userspace nbd device |
| 35 | ``nbdublk`` [#userspace_nbdublk]_ based on ``libublksrv`` [#userspace_lib]_. |
| 36 | |
| 37 | After the IO is handled by userspace, the result is committed back to the |
| 38 | driver, thus completing the request cycle. This way, any specific IO handling |
| 39 | logic is totally done by userspace, such as loop's IO handling, NBD's IO |
| 40 | communication, or qcow2's IO mapping. |
| 41 | |
| 42 | ``/dev/ublkb*`` is driven by blk-mq request-based driver. Each request is |
| 43 | assigned by one queue wide unique tag. ublk server assigns unique tag to each |
| 44 | IO too, which is 1:1 mapped with IO of ``/dev/ublkb*``. |
| 45 | |
| 46 | Both the IO request forward and IO handling result committing are done via |
| 47 | ``io_uring`` passthrough command; that is why ublk is also one io_uring based |
| 48 | block driver. It has been observed that using io_uring passthrough command can |
| 49 | give better IOPS than block IO; which is why ublk is one of high performance |
| 50 | implementation of userspace block device: not only IO request communication is |
| 51 | done by io_uring, but also the preferred IO handling in ublk server is io_uring |
| 52 | based approach too. |
| 53 | |
| 54 | ublk provides control interface to set/get ublk block device parameters. |
| 55 | The interface is extendable and kabi compatible: basically any ublk request |
| 56 | queue's parameter or ublk generic feature parameters can be set/get via the |
| 57 | interface. Thus, ublk is generic userspace block device framework. |
| 58 | For example, it is easy to setup a ublk device with specified block |
| 59 | parameters from userspace. |
| 60 | |
| 61 | Using ublk |
| 62 | ========== |
| 63 | |
| 64 | ublk requires userspace ublk server to handle real block device logic. |
| 65 | |
| 66 | Below is example of using ``ublksrv`` to provide ublk-based loop device. |
| 67 | |
| 68 | - add a device:: |
| 69 | |
| 70 | ublk add -t loop -f ublk-loop.img |
| 71 | |
| 72 | - format with xfs, then use it:: |
| 73 | |
| 74 | mkfs.xfs /dev/ublkb0 |
| 75 | mount /dev/ublkb0 /mnt |
| 76 | # do anything. all IOs are handled by io_uring |
| 77 | ... |
| 78 | umount /mnt |
| 79 | |
| 80 | - list the devices with their info:: |
| 81 | |
| 82 | ublk list |
| 83 | |
| 84 | - delete the device:: |
| 85 | |
| 86 | ublk del -a |
| 87 | ublk del -n $ublk_dev_id |
| 88 | |
| 89 | See usage details in README of ``ublksrv`` [#userspace_readme]_. |
| 90 | |
| 91 | Design |
| 92 | ====== |
| 93 | |
| 94 | Control plane |
| 95 | ------------- |
| 96 | |
| 97 | ublk driver provides global misc device node (``/dev/ublk-control``) for |
| 98 | managing and controlling ublk devices with help of several control commands: |
| 99 | |
| 100 | - ``UBLK_CMD_ADD_DEV`` |
| 101 | |
| 102 | Add a ublk char device (``/dev/ublkc*``) which is talked with ublk server |
| 103 | WRT IO command communication. Basic device info is sent together with this |
| 104 | command. It sets UAPI structure of ``ublksrv_ctrl_dev_info``, |
| 105 | such as ``nr_hw_queues``, ``queue_depth``, and max IO request buffer size, |
| 106 | for which the info is negotiated with the driver and sent back to the server. |
| 107 | When this command is completed, the basic device info is immutable. |
| 108 | |
| 109 | - ``UBLK_CMD_SET_PARAMS`` / ``UBLK_CMD_GET_PARAMS`` |
| 110 | |
| 111 | Set or get parameters of the device, which can be either generic feature |
| 112 | related, or request queue limit related, but can't be IO logic specific, |
| 113 | because the driver does not handle any IO logic. This command has to be |
| 114 | sent before sending ``UBLK_CMD_START_DEV``. |
| 115 | |
| 116 | - ``UBLK_CMD_START_DEV`` |
| 117 | |
| 118 | After the server prepares userspace resources (such as creating per-queue |
| 119 | pthread & io_uring for handling ublk IO), this command is sent to the |
| 120 | driver for allocating & exposing ``/dev/ublkb*``. Parameters set via |
| 121 | ``UBLK_CMD_SET_PARAMS`` are applied for creating the device. |
| 122 | |
| 123 | - ``UBLK_CMD_STOP_DEV`` |
| 124 | |
| 125 | Halt IO on ``/dev/ublkb*`` and remove the device. When this command returns, |
| 126 | ublk server will release resources (such as destroying per-queue pthread & |
| 127 | io_uring). |
| 128 | |
| 129 | - ``UBLK_CMD_DEL_DEV`` |
| 130 | |
| 131 | Remove ``/dev/ublkc*``. When this command returns, the allocated ublk device |
| 132 | number can be reused. |
| 133 | |
| 134 | - ``UBLK_CMD_GET_QUEUE_AFFINITY`` |
| 135 | |
| 136 | When ``/dev/ublkc`` is added, the driver creates block layer tagset, so |
| 137 | that each queue's affinity info is available. The server sends |
| 138 | ``UBLK_CMD_GET_QUEUE_AFFINITY`` to retrieve queue affinity info. It can |
| 139 | set up the per-queue context efficiently, such as bind affine CPUs with IO |
| 140 | pthread and try to allocate buffers in IO thread context. |
| 141 | |
| 142 | - ``UBLK_CMD_GET_DEV_INFO`` |
| 143 | |
| 144 | For retrieving device info via ``ublksrv_ctrl_dev_info``. It is the server's |
| 145 | responsibility to save IO target specific info in userspace. |
| 146 | |
| 147 | - ``UBLK_CMD_GET_DEV_INFO2`` |
| 148 | Same purpose with ``UBLK_CMD_GET_DEV_INFO``, but ublk server has to |
| 149 | provide path of the char device of ``/dev/ublkc*`` for kernel to run |
| 150 | permission check, and this command is added for supporting unprivileged |
| 151 | ublk device, and introduced with ``UBLK_F_UNPRIVILEGED_DEV`` together. |
| 152 | Only the user owning the requested device can retrieve the device info. |
| 153 | |
| 154 | How to deal with userspace/kernel compatibility: |
| 155 | |
| 156 | 1) if kernel is capable of handling ``UBLK_F_UNPRIVILEGED_DEV`` |
| 157 | If ublk server supports ``UBLK_F_UNPRIVILEGED_DEV``: |
| 158 | ublk server should send ``UBLK_CMD_GET_DEV_INFO2``, given anytime |
| 159 | unprivileged application needs to query devices the current user owns, |
| 160 | when the application has no idea if ``UBLK_F_UNPRIVILEGED_DEV`` is set |
| 161 | given the capability info is stateless, and application should always |
| 162 | retrieve it via ``UBLK_CMD_GET_DEV_INFO2`` |
| 163 | |
| 164 | If ublk server doesn't support ``UBLK_F_UNPRIVILEGED_DEV``: |
| 165 | ``UBLK_CMD_GET_DEV_INFO`` is always sent to kernel, and the feature of |
| 166 | UBLK_F_UNPRIVILEGED_DEV isn't available for user |
| 167 | |
| 168 | 2) if kernel isn't capable of handling ``UBLK_F_UNPRIVILEGED_DEV`` |
| 169 | If ublk server supports ``UBLK_F_UNPRIVILEGED_DEV``: |
| 170 | ``UBLK_CMD_GET_DEV_INFO2`` is tried first, and will be failed, then |
| 171 | ``UBLK_CMD_GET_DEV_INFO`` needs to be retried given |
| 172 | ``UBLK_F_UNPRIVILEGED_DEV`` can't be set |
| 173 | |
| 174 | If ublk server doesn't support ``UBLK_F_UNPRIVILEGED_DEV``: |
| 175 | ``UBLK_CMD_GET_DEV_INFO`` is always sent to kernel, and the feature of |
| 176 | ``UBLK_F_UNPRIVILEGED_DEV`` isn't available for user |
| 177 | |
| 178 | - ``UBLK_CMD_START_USER_RECOVERY`` |
| 179 | |
| 180 | This command is valid if ``UBLK_F_USER_RECOVERY`` feature is enabled. This |
| 181 | command is accepted after the old process has exited, ublk device is quiesced |
| 182 | and ``/dev/ublkc*`` is released. User should send this command before he starts |
| 183 | a new process which re-opens ``/dev/ublkc*``. When this command returns, the |
| 184 | ublk device is ready for the new process. |
| 185 | |
| 186 | - ``UBLK_CMD_END_USER_RECOVERY`` |
| 187 | |
| 188 | This command is valid if ``UBLK_F_USER_RECOVERY`` feature is enabled. This |
| 189 | command is accepted after ublk device is quiesced and a new process has |
| 190 | opened ``/dev/ublkc*`` and get all ublk queues be ready. When this command |
| 191 | returns, ublk device is unquiesced and new I/O requests are passed to the |
| 192 | new process. |
| 193 | |
| 194 | - user recovery feature description |
| 195 | |
| 196 | Two new features are added for user recovery: ``UBLK_F_USER_RECOVERY`` and |
| 197 | ``UBLK_F_USER_RECOVERY_REISSUE``. |
| 198 | |
| 199 | With ``UBLK_F_USER_RECOVERY`` set, after one ubq_daemon(ublk server's io |
| 200 | handler) is dying, ublk does not delete ``/dev/ublkb*`` during the whole |
| 201 | recovery stage and ublk device ID is kept. It is ublk server's |
| 202 | responsibility to recover the device context by its own knowledge. |
| 203 | Requests which have not been issued to userspace are requeued. Requests |
| 204 | which have been issued to userspace are aborted. |
| 205 | |
| 206 | With ``UBLK_F_USER_RECOVERY_REISSUE`` set, after one ubq_daemon(ublk |
| 207 | server's io handler) is dying, contrary to ``UBLK_F_USER_RECOVERY``, |
| 208 | requests which have been issued to userspace are requeued and will be |
| 209 | re-issued to the new process after handling ``UBLK_CMD_END_USER_RECOVERY``. |
| 210 | ``UBLK_F_USER_RECOVERY_REISSUE`` is designed for backends who tolerate |
| 211 | double-write since the driver may issue the same I/O request twice. It |
| 212 | might be useful to a read-only FS or a VM backend. |
| 213 | |
| 214 | Unprivileged ublk device is supported by passing ``UBLK_F_UNPRIVILEGED_DEV``. |
| 215 | Once the flag is set, all control commands can be sent by unprivileged |
| 216 | user. Except for command of ``UBLK_CMD_ADD_DEV``, permission check on |
| 217 | the specified char device(``/dev/ublkc*``) is done for all other control |
| 218 | commands by ublk driver, for doing that, path of the char device has to |
| 219 | be provided in these commands' payload from ublk server. With this way, |
| 220 | ublk device becomes container-ware, and device created in one container |
| 221 | can be controlled/accessed just inside this container. |
| 222 | |
| 223 | Data plane |
| 224 | ---------- |
| 225 | |
| 226 | ublk server needs to create per-queue IO pthread & io_uring for handling IO |
| 227 | commands via io_uring passthrough. The per-queue IO pthread |
| 228 | focuses on IO handling and shouldn't handle any control & management |
| 229 | tasks. |
| 230 | |
| 231 | The's IO is assigned by a unique tag, which is 1:1 mapping with IO |
| 232 | request of ``/dev/ublkb*``. |
| 233 | |
| 234 | UAPI structure of ``ublksrv_io_desc`` is defined for describing each IO from |
| 235 | the driver. A fixed mmaped area (array) on ``/dev/ublkc*`` is provided for |
| 236 | exporting IO info to the server; such as IO offset, length, OP/flags and |
| 237 | buffer address. Each ``ublksrv_io_desc`` instance can be indexed via queue id |
| 238 | and IO tag directly. |
| 239 | |
| 240 | The following IO commands are communicated via io_uring passthrough command, |
| 241 | and each command is only for forwarding the IO and committing the result |
| 242 | with specified IO tag in the command data: |
| 243 | |
| 244 | - ``UBLK_IO_FETCH_REQ`` |
| 245 | |
| 246 | Sent from the server IO pthread for fetching future incoming IO requests |
| 247 | destined to ``/dev/ublkb*``. This command is sent only once from the server |
| 248 | IO pthread for ublk driver to setup IO forward environment. |
| 249 | |
| 250 | - ``UBLK_IO_COMMIT_AND_FETCH_REQ`` |
| 251 | |
| 252 | When an IO request is destined to ``/dev/ublkb*``, the driver stores |
| 253 | the IO's ``ublksrv_io_desc`` to the specified mapped area; then the |
| 254 | previous received IO command of this IO tag (either ``UBLK_IO_FETCH_REQ`` |
| 255 | or ``UBLK_IO_COMMIT_AND_FETCH_REQ)`` is completed, so the server gets |
| 256 | the IO notification via io_uring. |
| 257 | |
| 258 | After the server handles the IO, its result is committed back to the |
| 259 | driver by sending ``UBLK_IO_COMMIT_AND_FETCH_REQ`` back. Once ublkdrv |
| 260 | received this command, it parses the result and complete the request to |
| 261 | ``/dev/ublkb*``. In the meantime setup environment for fetching future |
| 262 | requests with the same IO tag. That is, ``UBLK_IO_COMMIT_AND_FETCH_REQ`` |
| 263 | is reused for both fetching request and committing back IO result. |
| 264 | |
| 265 | - ``UBLK_IO_NEED_GET_DATA`` |
| 266 | |
| 267 | With ``UBLK_F_NEED_GET_DATA`` enabled, the WRITE request will be firstly |
| 268 | issued to ublk server without data copy. Then, IO backend of ublk server |
| 269 | receives the request and it can allocate data buffer and embed its addr |
| 270 | inside this new io command. After the kernel driver gets the command, |
| 271 | data copy is done from request pages to this backend's buffer. Finally, |
| 272 | backend receives the request again with data to be written and it can |
| 273 | truly handle the request. |
| 274 | |
| 275 | ``UBLK_IO_NEED_GET_DATA`` adds one additional round-trip and one |
| 276 | io_uring_enter() syscall. Any user thinks that it may lower performance |
| 277 | should not enable UBLK_F_NEED_GET_DATA. ublk server pre-allocates IO |
| 278 | buffer for each IO by default. Any new project should try to use this |
| 279 | buffer to communicate with ublk driver. However, existing project may |
| 280 | break or not able to consume the new buffer interface; that's why this |
| 281 | command is added for backwards compatibility so that existing projects |
| 282 | can still consume existing buffers. |
| 283 | |
| 284 | - data copy between ublk server IO buffer and ublk block IO request |
| 285 | |
| 286 | The driver needs to copy the block IO request pages into the server buffer |
| 287 | (pages) first for WRITE before notifying the server of the coming IO, so |
| 288 | that the server can handle WRITE request. |
| 289 | |
| 290 | When the server handles READ request and sends |
| 291 | ``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy |
| 292 | the server buffer (pages) read to the IO request pages. |
| 293 | |
| 294 | Future development |
| 295 | ================== |
| 296 | |
| 297 | Zero copy |
| 298 | --------- |
| 299 | |
| 300 | Zero copy is a generic requirement for nbd, fuse or similar drivers. A |
| 301 | problem [#xiaoguang]_ Xiaoguang mentioned is that pages mapped to userspace |
| 302 | can't be remapped any more in kernel with existing mm interfaces. This can |
| 303 | occurs when destining direct IO to ``/dev/ublkb*``. Also, he reported that |
| 304 | big requests (IO size >= 256 KB) may benefit a lot from zero copy. |
| 305 | |
| 306 | |
| 307 | References |
| 308 | ========== |
| 309 | |
| 310 | .. [#userspace] https://github.com/ming1/ubdsrv |
| 311 | |
| 312 | .. [#userspace_lib] https://github.com/ming1/ubdsrv/tree/master/lib |
| 313 | |
| 314 | .. [#userspace_nbdublk] https://gitlab.com/rwmjones/libnbd/-/tree/nbdublk |
| 315 | |
| 316 | .. [#userspace_readme] https://github.com/ming1/ubdsrv/blob/master/README |
| 317 | |
| 318 | .. [#stefan] https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/ |
| 319 | |
| 320 | .. [#xiaoguang] https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/ |