Commit | Line | Data |
---|---|---|
7a3d2225 ML |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | =========================================== | |
4 | Userspace block device driver (ublk driver) | |
5 | =========================================== | |
6 | ||
7 | Overview | |
8 | ======== | |
9 | ||
10 | ublk is a generic framework for implementing block device logic from userspace. | |
11 | The motivation behind it is that moving virtual block drivers into userspace, | |
12 | such as loop, nbd and similar can be very helpful. It can help to implement | |
13 | new virtual block device such as ublk-qcow2 (there are several attempts of | |
14 | implementing qcow2 driver in kernel). | |
15 | ||
16 | Userspace block devices are attractive because: | |
17 | ||
18 | - They can be written many programming languages. | |
19 | - They can use libraries that are not available in the kernel. | |
20 | - They can be debugged with tools familiar to application developers. | |
21 | - Crashes do not kernel panic the machine. | |
22 | - Bugs are likely to have a lower security impact than bugs in kernel | |
23 | code. | |
24 | - They can be installed and updated independently of the kernel. | |
25 | - They can be used to simulate block device easily with user specified | |
26 | parameters/setting for test/debug purpose | |
27 | ||
28 | ublk block device (``/dev/ublkb*``) is added by ublk driver. Any IO request | |
29 | on the device will be forwarded to ublk userspace program. For convenience, | |
30 | in this document, ``ublk server`` refers to generic ublk userspace | |
31 | program. ``ublksrv`` [#userspace]_ is one of such implementation. It | |
32 | provides ``libublksrv`` [#userspace_lib]_ library for developing specific | |
33 | user block device conveniently, while also generic type block device is | |
34 | included, such as loop and null. Richard W.M. Jones wrote userspace nbd device | |
35 | ``nbdublk`` [#userspace_nbdublk]_ based on ``libublksrv`` [#userspace_lib]_. | |
36 | ||
37 | After the IO is handled by userspace, the result is committed back to the | |
38 | driver, thus completing the request cycle. This way, any specific IO handling | |
39 | logic is totally done by userspace, such as loop's IO handling, NBD's IO | |
40 | communication, or qcow2's IO mapping. | |
41 | ||
42 | ``/dev/ublkb*`` is driven by blk-mq request-based driver. Each request is | |
43 | assigned by one queue wide unique tag. ublk server assigns unique tag to each | |
44 | IO too, which is 1:1 mapped with IO of ``/dev/ublkb*``. | |
45 | ||
46 | Both the IO request forward and IO handling result committing are done via | |
47 | ``io_uring`` passthrough command; that is why ublk is also one io_uring based | |
48 | block driver. It has been observed that using io_uring passthrough command can | |
49 | give better IOPS than block IO; which is why ublk is one of high performance | |
50 | implementation of userspace block device: not only IO request communication is | |
51 | done by io_uring, but also the preferred IO handling in ublk server is io_uring | |
52 | based approach too. | |
53 | ||
54 | ublk provides control interface to set/get ublk block device parameters. | |
55 | The interface is extendable and kabi compatible: basically any ublk request | |
56 | queue's parameter or ublk generic feature parameters can be set/get via the | |
57 | interface. Thus, ublk is generic userspace block device framework. | |
58 | For example, it is easy to setup a ublk device with specified block | |
59 | parameters from userspace. | |
60 | ||
61 | Using ublk | |
62 | ========== | |
63 | ||
64 | ublk requires userspace ublk server to handle real block device logic. | |
65 | ||
66 | Below is example of using ``ublksrv`` to provide ublk-based loop device. | |
67 | ||
68 | - add a device:: | |
69 | ||
70 | ublk add -t loop -f ublk-loop.img | |
71 | ||
72 | - format with xfs, then use it:: | |
73 | ||
74 | mkfs.xfs /dev/ublkb0 | |
75 | mount /dev/ublkb0 /mnt | |
76 | # do anything. all IOs are handled by io_uring | |
77 | ... | |
78 | umount /mnt | |
79 | ||
80 | - list the devices with their info:: | |
81 | ||
82 | ublk list | |
83 | ||
84 | - delete the device:: | |
85 | ||
86 | ublk del -a | |
87 | ublk del -n $ublk_dev_id | |
88 | ||
89 | See usage details in README of ``ublksrv`` [#userspace_readme]_. | |
90 | ||
91 | Design | |
92 | ====== | |
93 | ||
94 | Control plane | |
95 | ------------- | |
96 | ||
97 | ublk driver provides global misc device node (``/dev/ublk-control``) for | |
98 | managing and controlling ublk devices with help of several control commands: | |
99 | ||
100 | - ``UBLK_CMD_ADD_DEV`` | |
101 | ||
102 | Add a ublk char device (``/dev/ublkc*``) which is talked with ublk server | |
103 | WRT IO command communication. Basic device info is sent together with this | |
104 | command. It sets UAPI structure of ``ublksrv_ctrl_dev_info``, | |
105 | such as ``nr_hw_queues``, ``queue_depth``, and max IO request buffer size, | |
106 | for which the info is negotiated with the driver and sent back to the server. | |
107 | When this command is completed, the basic device info is immutable. | |
108 | ||
109 | - ``UBLK_CMD_SET_PARAMS`` / ``UBLK_CMD_GET_PARAMS`` | |
110 | ||
111 | Set or get parameters of the device, which can be either generic feature | |
112 | related, or request queue limit related, but can't be IO logic specific, | |
113 | because the driver does not handle any IO logic. This command has to be | |
114 | sent before sending ``UBLK_CMD_START_DEV``. | |
115 | ||
116 | - ``UBLK_CMD_START_DEV`` | |
117 | ||
118 | After the server prepares userspace resources (such as creating per-queue | |
119 | pthread & io_uring for handling ublk IO), this command is sent to the | |
120 | driver for allocating & exposing ``/dev/ublkb*``. Parameters set via | |
121 | ``UBLK_CMD_SET_PARAMS`` are applied for creating the device. | |
122 | ||
123 | - ``UBLK_CMD_STOP_DEV`` | |
124 | ||
125 | Halt IO on ``/dev/ublkb*`` and remove the device. When this command returns, | |
126 | ublk server will release resources (such as destroying per-queue pthread & | |
127 | io_uring). | |
128 | ||
129 | - ``UBLK_CMD_DEL_DEV`` | |
130 | ||
131 | Remove ``/dev/ublkc*``. When this command returns, the allocated ublk device | |
132 | number can be reused. | |
133 | ||
134 | - ``UBLK_CMD_GET_QUEUE_AFFINITY`` | |
135 | ||
136 | When ``/dev/ublkc`` is added, the driver creates block layer tagset, so | |
137 | that each queue's affinity info is available. The server sends | |
138 | ``UBLK_CMD_GET_QUEUE_AFFINITY`` to retrieve queue affinity info. It can | |
139 | set up the per-queue context efficiently, such as bind affine CPUs with IO | |
140 | pthread and try to allocate buffers in IO thread context. | |
141 | ||
142 | - ``UBLK_CMD_GET_DEV_INFO`` | |
143 | ||
144 | For retrieving device info via ``ublksrv_ctrl_dev_info``. It is the server's | |
145 | responsibility to save IO target specific info in userspace. | |
146 | ||
4093cb5a ML |
147 | - ``UBLK_CMD_GET_DEV_INFO2`` |
148 | Same purpose with ``UBLK_CMD_GET_DEV_INFO``, but ublk server has to | |
149 | provide path of the char device of ``/dev/ublkc*`` for kernel to run | |
150 | permission check, and this command is added for supporting unprivileged | |
151 | ublk device, and introduced with ``UBLK_F_UNPRIVILEGED_DEV`` together. | |
152 | Only the user owning the requested device can retrieve the device info. | |
153 | ||
154 | How to deal with userspace/kernel compatibility: | |
155 | ||
156 | 1) if kernel is capable of handling ``UBLK_F_UNPRIVILEGED_DEV`` | |
464544fb | 157 | |
4093cb5a | 158 | If ublk server supports ``UBLK_F_UNPRIVILEGED_DEV``: |
464544fb | 159 | |
4093cb5a ML |
160 | ublk server should send ``UBLK_CMD_GET_DEV_INFO2``, given anytime |
161 | unprivileged application needs to query devices the current user owns, | |
162 | when the application has no idea if ``UBLK_F_UNPRIVILEGED_DEV`` is set | |
163 | given the capability info is stateless, and application should always | |
164 | retrieve it via ``UBLK_CMD_GET_DEV_INFO2`` | |
165 | ||
166 | If ublk server doesn't support ``UBLK_F_UNPRIVILEGED_DEV``: | |
464544fb | 167 | |
4093cb5a ML |
168 | ``UBLK_CMD_GET_DEV_INFO`` is always sent to kernel, and the feature of |
169 | UBLK_F_UNPRIVILEGED_DEV isn't available for user | |
170 | ||
171 | 2) if kernel isn't capable of handling ``UBLK_F_UNPRIVILEGED_DEV`` | |
464544fb | 172 | |
4093cb5a | 173 | If ublk server supports ``UBLK_F_UNPRIVILEGED_DEV``: |
464544fb | 174 | |
4093cb5a ML |
175 | ``UBLK_CMD_GET_DEV_INFO2`` is tried first, and will be failed, then |
176 | ``UBLK_CMD_GET_DEV_INFO`` needs to be retried given | |
177 | ``UBLK_F_UNPRIVILEGED_DEV`` can't be set | |
178 | ||
179 | If ublk server doesn't support ``UBLK_F_UNPRIVILEGED_DEV``: | |
464544fb | 180 | |
4093cb5a ML |
181 | ``UBLK_CMD_GET_DEV_INFO`` is always sent to kernel, and the feature of |
182 | ``UBLK_F_UNPRIVILEGED_DEV`` isn't available for user | |
183 | ||
e0539ae0 Z |
184 | - ``UBLK_CMD_START_USER_RECOVERY`` |
185 | ||
186 | This command is valid if ``UBLK_F_USER_RECOVERY`` feature is enabled. This | |
187 | command is accepted after the old process has exited, ublk device is quiesced | |
188 | and ``/dev/ublkc*`` is released. User should send this command before he starts | |
189 | a new process which re-opens ``/dev/ublkc*``. When this command returns, the | |
190 | ublk device is ready for the new process. | |
191 | ||
192 | - ``UBLK_CMD_END_USER_RECOVERY`` | |
193 | ||
194 | This command is valid if ``UBLK_F_USER_RECOVERY`` feature is enabled. This | |
195 | command is accepted after ublk device is quiesced and a new process has | |
196 | opened ``/dev/ublkc*`` and get all ublk queues be ready. When this command | |
197 | returns, ublk device is unquiesced and new I/O requests are passed to the | |
198 | new process. | |
199 | ||
200 | - user recovery feature description | |
201 | ||
202 | Two new features are added for user recovery: ``UBLK_F_USER_RECOVERY`` and | |
203 | ``UBLK_F_USER_RECOVERY_REISSUE``. | |
204 | ||
205 | With ``UBLK_F_USER_RECOVERY`` set, after one ubq_daemon(ublk server's io | |
206 | handler) is dying, ublk does not delete ``/dev/ublkb*`` during the whole | |
207 | recovery stage and ublk device ID is kept. It is ublk server's | |
208 | responsibility to recover the device context by its own knowledge. | |
209 | Requests which have not been issued to userspace are requeued. Requests | |
210 | which have been issued to userspace are aborted. | |
211 | ||
212 | With ``UBLK_F_USER_RECOVERY_REISSUE`` set, after one ubq_daemon(ublk | |
213 | server's io handler) is dying, contrary to ``UBLK_F_USER_RECOVERY``, | |
214 | requests which have been issued to userspace are requeued and will be | |
215 | re-issued to the new process after handling ``UBLK_CMD_END_USER_RECOVERY``. | |
216 | ``UBLK_F_USER_RECOVERY_REISSUE`` is designed for backends who tolerate | |
217 | double-write since the driver may issue the same I/O request twice. It | |
218 | might be useful to a read-only FS or a VM backend. | |
219 | ||
4093cb5a ML |
220 | Unprivileged ublk device is supported by passing ``UBLK_F_UNPRIVILEGED_DEV``. |
221 | Once the flag is set, all control commands can be sent by unprivileged | |
222 | user. Except for command of ``UBLK_CMD_ADD_DEV``, permission check on | |
223 | the specified char device(``/dev/ublkc*``) is done for all other control | |
224 | commands by ublk driver, for doing that, path of the char device has to | |
225 | be provided in these commands' payload from ublk server. With this way, | |
226 | ublk device becomes container-ware, and device created in one container | |
227 | can be controlled/accessed just inside this container. | |
228 | ||
7a3d2225 ML |
229 | Data plane |
230 | ---------- | |
231 | ||
232 | ublk server needs to create per-queue IO pthread & io_uring for handling IO | |
233 | commands via io_uring passthrough. The per-queue IO pthread | |
234 | focuses on IO handling and shouldn't handle any control & management | |
235 | tasks. | |
236 | ||
237 | The's IO is assigned by a unique tag, which is 1:1 mapping with IO | |
238 | request of ``/dev/ublkb*``. | |
239 | ||
240 | UAPI structure of ``ublksrv_io_desc`` is defined for describing each IO from | |
241 | the driver. A fixed mmaped area (array) on ``/dev/ublkc*`` is provided for | |
242 | exporting IO info to the server; such as IO offset, length, OP/flags and | |
243 | buffer address. Each ``ublksrv_io_desc`` instance can be indexed via queue id | |
244 | and IO tag directly. | |
245 | ||
246 | The following IO commands are communicated via io_uring passthrough command, | |
247 | and each command is only for forwarding the IO and committing the result | |
248 | with specified IO tag in the command data: | |
249 | ||
250 | - ``UBLK_IO_FETCH_REQ`` | |
251 | ||
252 | Sent from the server IO pthread for fetching future incoming IO requests | |
253 | destined to ``/dev/ublkb*``. This command is sent only once from the server | |
254 | IO pthread for ublk driver to setup IO forward environment. | |
255 | ||
256 | - ``UBLK_IO_COMMIT_AND_FETCH_REQ`` | |
257 | ||
258 | When an IO request is destined to ``/dev/ublkb*``, the driver stores | |
259 | the IO's ``ublksrv_io_desc`` to the specified mapped area; then the | |
260 | previous received IO command of this IO tag (either ``UBLK_IO_FETCH_REQ`` | |
261 | or ``UBLK_IO_COMMIT_AND_FETCH_REQ)`` is completed, so the server gets | |
262 | the IO notification via io_uring. | |
263 | ||
264 | After the server handles the IO, its result is committed back to the | |
265 | driver by sending ``UBLK_IO_COMMIT_AND_FETCH_REQ`` back. Once ublkdrv | |
266 | received this command, it parses the result and complete the request to | |
267 | ``/dev/ublkb*``. In the meantime setup environment for fetching future | |
268 | requests with the same IO tag. That is, ``UBLK_IO_COMMIT_AND_FETCH_REQ`` | |
269 | is reused for both fetching request and committing back IO result. | |
270 | ||
271 | - ``UBLK_IO_NEED_GET_DATA`` | |
272 | ||
273 | With ``UBLK_F_NEED_GET_DATA`` enabled, the WRITE request will be firstly | |
274 | issued to ublk server without data copy. Then, IO backend of ublk server | |
275 | receives the request and it can allocate data buffer and embed its addr | |
276 | inside this new io command. After the kernel driver gets the command, | |
277 | data copy is done from request pages to this backend's buffer. Finally, | |
278 | backend receives the request again with data to be written and it can | |
279 | truly handle the request. | |
280 | ||
281 | ``UBLK_IO_NEED_GET_DATA`` adds one additional round-trip and one | |
282 | io_uring_enter() syscall. Any user thinks that it may lower performance | |
283 | should not enable UBLK_F_NEED_GET_DATA. ublk server pre-allocates IO | |
284 | buffer for each IO by default. Any new project should try to use this | |
285 | buffer to communicate with ublk driver. However, existing project may | |
286 | break or not able to consume the new buffer interface; that's why this | |
287 | command is added for backwards compatibility so that existing projects | |
288 | can still consume existing buffers. | |
289 | ||
290 | - data copy between ublk server IO buffer and ublk block IO request | |
291 | ||
292 | The driver needs to copy the block IO request pages into the server buffer | |
293 | (pages) first for WRITE before notifying the server of the coming IO, so | |
294 | that the server can handle WRITE request. | |
295 | ||
296 | When the server handles READ request and sends | |
297 | ``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy | |
298 | the server buffer (pages) read to the IO request pages. | |
299 | ||
300 | Future development | |
301 | ================== | |
302 | ||
7a3d2225 ML |
303 | Zero copy |
304 | --------- | |
305 | ||
306 | Zero copy is a generic requirement for nbd, fuse or similar drivers. A | |
307 | problem [#xiaoguang]_ Xiaoguang mentioned is that pages mapped to userspace | |
308 | can't be remapped any more in kernel with existing mm interfaces. This can | |
309 | occurs when destining direct IO to ``/dev/ublkb*``. Also, he reported that | |
310 | big requests (IO size >= 256 KB) may benefit a lot from zero copy. | |
311 | ||
312 | ||
313 | References | |
314 | ========== | |
315 | ||
316 | .. [#userspace] https://github.com/ming1/ubdsrv | |
317 | ||
318 | .. [#userspace_lib] https://github.com/ming1/ubdsrv/tree/master/lib | |
319 | ||
320 | .. [#userspace_nbdublk] https://gitlab.com/rwmjones/libnbd/-/tree/nbdublk | |
321 | ||
322 | .. [#userspace_readme] https://github.com/ming1/ubdsrv/blob/master/README | |
323 | ||
324 | .. [#stefan] https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/ | |
325 | ||
326 | .. [#xiaoguang] https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/ |