Commit | Line | Data |
---|---|---|
8b4a503d | 1 | ================================== |
25627ba3 DJS |
2 | vfio-ccw: the basic infrastructure |
3 | ================================== | |
4 | ||
5 | Introduction | |
6 | ------------ | |
7 | ||
8 | Here we describe the vfio support for I/O subchannel devices for | |
9 | Linux/s390. Motivation for vfio-ccw is to passthrough subchannels to a | |
10 | virtual machine, while vfio is the means. | |
11 | ||
12 | Different than other hardware architectures, s390 has defined a unified | |
13 | I/O access method, which is so called Channel I/O. It has its own access | |
14 | patterns: | |
8b4a503d | 15 | |
25627ba3 DJS |
16 | - Channel programs run asynchronously on a separate (co)processor. |
17 | - The channel subsystem will access any memory designated by the caller | |
18 | in the channel program directly, i.e. there is no iommu involved. | |
8b4a503d | 19 | |
25627ba3 DJS |
20 | Thus when we introduce vfio support for these devices, we realize it |
21 | with a mediated device (mdev) implementation. The vfio mdev will be | |
22 | added to an iommu group, so as to make itself able to be managed by the | |
23 | vfio framework. And we add read/write callbacks for special vfio I/O | |
24 | regions to pass the channel programs from the mdev to its parent device | |
25 | (the real I/O subchannel device) to do further address translation and | |
26 | to perform I/O instructions. | |
27 | ||
28 | This document does not intend to explain the s390 I/O architecture in | |
29 | every detail. More information/reference could be found here: | |
8b4a503d | 30 | |
25627ba3 DJS |
31 | - A good start to know Channel I/O in general: |
32 | https://en.wikipedia.org/wiki/Channel_I/O | |
33 | - s390 architecture: | |
34 | s390 Principles of Operation manual (IBM Form. No. SA22-7832) | |
69cfd92e | 35 | - The existing QEMU code which implements a simple emulated channel |
25627ba3 DJS |
36 | subsystem could also be a good reference. It makes it easier to follow |
37 | the flow. | |
38 | qemu/hw/s390x/css.c | |
39 | ||
40 | For vfio mediated device framework: | |
baa293e9 | 41 | - Documentation/driver-api/vfio-mediated-device.rst |
25627ba3 DJS |
42 | |
43 | Motivation of vfio-ccw | |
44 | ---------------------- | |
45 | ||
69cfd92e | 46 | Typically, a guest virtualized via QEMU/KVM on s390 only sees |
25627ba3 DJS |
47 | paravirtualized virtio devices via the "Virtio Over Channel I/O |
48 | (virtio-ccw)" transport. This makes virtio devices discoverable via | |
49 | standard operating system algorithms for handling channel devices. | |
50 | ||
51 | However this is not enough. On s390 for the majority of devices, which | |
52 | use the standard Channel I/O based mechanism, we also need to provide | |
69cfd92e | 53 | the functionality of passing through them to a QEMU virtual machine. |
25627ba3 DJS |
54 | This includes devices that don't have a virtio counterpart (e.g. tape |
55 | drives) or that have specific characteristics which guests want to | |
56 | exploit. | |
57 | ||
58 | For passing a device to a guest, we want to use the same interface as | |
69cfd92e CH |
59 | everybody else, namely vfio. We implement this vfio support for channel |
60 | devices via the vfio mediated device framework and the subchannel device | |
61 | driver "vfio_ccw". | |
25627ba3 DJS |
62 | |
63 | Access patterns of CCW devices | |
64 | ------------------------------ | |
65 | ||
66 | s390 architecture has implemented a so called channel subsystem, that | |
67 | provides a unified view of the devices physically attached to the | |
68 | systems. Though the s390 hardware platform knows about a huge variety of | |
69 | different peripheral attachments like disk devices (aka. DASDs), tapes, | |
70 | communication controllers, etc. They can all be accessed by a well | |
71 | defined access method and they are presenting I/O completion a unified | |
72 | way: I/O interruptions. | |
73 | ||
74 | All I/O requires the use of channel command words (CCWs). A CCW is an | |
75 | instruction to a specialized I/O channel processor. A channel program is | |
76 | a sequence of CCWs which are executed by the I/O channel subsystem. To | |
77 | issue a channel program to the channel subsystem, it is required to | |
78 | build an operation request block (ORB), which can be used to point out | |
79 | the format of the CCW and other control information to the system. The | |
80 | operating system signals the I/O channel subsystem to begin executing | |
81 | the channel program with a SSCH (start sub-channel) instruction. The | |
82 | central processor is then free to proceed with non-I/O instructions | |
83 | until interrupted. The I/O completion result is received by the | |
84 | interrupt handler in the form of interrupt response block (IRB). | |
85 | ||
86 | Back to vfio-ccw, in short: | |
8b4a503d | 87 | |
25627ba3 DJS |
88 | - ORBs and channel programs are built in guest kernel (with guest |
89 | physical addresses). | |
90 | - ORBs and channel programs are passed to the host kernel. | |
91 | - Host kernel translates the guest physical addresses to real addresses | |
92 | and starts the I/O with issuing a privileged Channel I/O instruction | |
93 | (e.g SSCH). | |
94 | - channel programs run asynchronously on a separate processor. | |
95 | - I/O completion will be signaled to the host with I/O interruptions. | |
96 | And it will be copied as IRB to user space to pass it back to the | |
97 | guest. | |
98 | ||
99 | Physical vfio ccw device and its child mdev | |
100 | ------------------------------------------- | |
101 | ||
102 | As mentioned above, we realize vfio-ccw with a mdev implementation. | |
103 | ||
104 | Channel I/O does not have IOMMU hardware support, so the physical | |
105 | vfio-ccw device does not have an IOMMU level translation or isolation. | |
106 | ||
69cfd92e | 107 | Subchannel I/O instructions are all privileged instructions. When |
25627ba3 DJS |
108 | handling the I/O instruction interception, vfio-ccw has the software |
109 | policing and translation how the channel program is programmed before | |
110 | it gets sent to hardware. | |
111 | ||
112 | Within this implementation, we have two drivers for two types of | |
113 | devices: | |
8b4a503d | 114 | |
25627ba3 DJS |
115 | - The vfio_ccw driver for the physical subchannel device. |
116 | This is an I/O subchannel driver for the real subchannel device. It | |
117 | realizes a group of callbacks and registers to the mdev framework as a | |
118 | parent (physical) device. As a consequence, mdev provides vfio_ccw a | |
119 | generic interface (sysfs) to create mdev devices. A vfio mdev could be | |
120 | created by vfio_ccw then and added to the mediated bus. It is the vfio | |
121 | device that added to an IOMMU group and a vfio group. | |
122 | vfio_ccw also provides an I/O region to accept channel program | |
123 | request from user space and store I/O interrupt result for user | |
124 | space to retrieve. To notify user space an I/O completion, it offers | |
125 | an interface to setup an eventfd fd for asynchronous signaling. | |
126 | ||
127 | - The vfio_mdev driver for the mediated vfio ccw device. | |
128 | This is provided by the mdev framework. It is a vfio device driver for | |
129 | the mdev that created by vfio_ccw. | |
69cfd92e | 130 | It realizes a group of vfio device driver callbacks, adds itself to a |
25627ba3 DJS |
131 | vfio group, and registers itself to the mdev framework as a mdev |
132 | driver. | |
133 | It uses a vfio iommu backend that uses the existing map and unmap | |
134 | ioctls, but rather than programming them into an IOMMU for a device, | |
135 | it simply stores the translations for use by later requests. This | |
136 | means that a device programmed in a VM with guest physical addresses | |
137 | can have the vfio kernel convert that address to process virtual | |
138 | address, pin the page and program the hardware with the host physical | |
139 | address in one step. | |
140 | For a mdev, the vfio iommu backend will not pin the pages during the | |
141 | VFIO_IOMMU_MAP_DMA ioctl. Mdev framework will only maintain a database | |
142 | of the iova<->vaddr mappings in this operation. And they export a | |
143 | vfio_pin_pages and a vfio_unpin_pages interfaces from the vfio iommu | |
144 | backend for the physical devices to pin and unpin pages by demand. | |
145 | ||
8b4a503d | 146 | Below is a high Level block diagram:: |
25627ba3 DJS |
147 | |
148 | +-------------+ | |
149 | | | | |
150 | | +---------+ | mdev_register_driver() +--------------+ | |
151 | | | Mdev | +<-----------------------+ | | |
152 | | | bus | | | vfio_mdev.ko | | |
153 | | | driver | +----------------------->+ |<-> VFIO user | |
154 | | +---------+ | probe()/remove() +--------------+ APIs | |
155 | | | | |
156 | | MDEV CORE | | |
157 | | MODULE | | |
158 | | mdev.ko | | |
89345d51 | 159 | | +---------+ | mdev_register_parent() +--------------+ |
25627ba3 DJS |
160 | | |Physical | +<-----------------------+ | |
161 | | | device | | | vfio_ccw.ko |<-> subchannel | |
162 | | |interface| +----------------------->+ | device | |
163 | | +---------+ | callback +--------------+ | |
164 | +-------------+ | |
165 | ||
166 | The process of how these work together. | |
8b4a503d | 167 | |
25627ba3 DJS |
168 | 1. vfio_ccw.ko drives the physical I/O subchannel, and registers the |
169 | physical device (with callbacks) to mdev framework. | |
170 | When vfio_ccw probing the subchannel device, it registers device | |
171 | pointer and callbacks to the mdev framework. Mdev related file nodes | |
172 | under the device node in sysfs would be created for the subchannel | |
173 | device, namely 'mdev_create', 'mdev_destroy' and | |
174 | 'mdev_supported_types'. | |
175 | 2. Create a mediated vfio ccw device. | |
176 | Use the 'mdev_create' sysfs file, we need to manually create one (and | |
177 | only one for our case) mediated device. | |
178 | 3. vfio_mdev.ko drives the mediated ccw device. | |
179 | vfio_mdev is also the vfio device drvier. It will probe the mdev and | |
180 | add it to an iommu_group and a vfio_group. Then we could pass through | |
181 | the mdev to a guest. | |
182 | ||
127e6217 FA |
183 | |
184 | VFIO-CCW Regions | |
185 | ---------------- | |
186 | ||
187 | The vfio-ccw driver exposes MMIO regions to accept requests from and return | |
188 | results to userspace. | |
189 | ||
25627ba3 DJS |
190 | vfio-ccw I/O region |
191 | ------------------- | |
192 | ||
193 | An I/O region is used to accept channel program request from user | |
194 | space and store I/O interrupt result for user space to retrieve. The | |
8b4a503d MCC |
195 | definition of the region is:: |
196 | ||
197 | struct ccw_io_region { | |
198 | #define ORB_AREA_SIZE 12 | |
199 | __u8 orb_area[ORB_AREA_SIZE]; | |
200 | #define SCSW_AREA_SIZE 12 | |
201 | __u8 scsw_area[SCSW_AREA_SIZE]; | |
202 | #define IRB_AREA_SIZE 96 | |
203 | __u8 irb_area[IRB_AREA_SIZE]; | |
204 | __u32 ret_code; | |
205 | } __packed; | |
25627ba3 | 206 | |
430220b0 CH |
207 | This region is always available. |
208 | ||
25627ba3 DJS |
209 | While starting an I/O request, orb_area should be filled with the |
210 | guest ORB, and scsw_area should be filled with the SCSW of the Virtual | |
211 | Subchannel. | |
212 | ||
213 | irb_area stores the I/O result. | |
214 | ||
430220b0 CH |
215 | ret_code stores a return code for each access of the region. The following |
216 | values may occur: | |
217 | ||
218 | ``0`` | |
219 | The operation was successful. | |
220 | ||
221 | ``-EOPNOTSUPP`` | |
222 | The orb specified transport mode or an unidentified IDAW format, or the | |
223 | scsw specified a function other than the start function. | |
224 | ||
225 | ``-EIO`` | |
226 | A request was issued while the device was not in a state ready to accept | |
227 | requests, or an internal error occurred. | |
228 | ||
229 | ``-EBUSY`` | |
230 | The subchannel was status pending or busy, or a request is already active. | |
231 | ||
232 | ``-EAGAIN`` | |
233 | A request was being processed, and the caller should retry. | |
234 | ||
235 | ``-EACCES`` | |
236 | The channel path(s) used for the I/O were found to be not operational. | |
237 | ||
238 | ``-ENODEV`` | |
239 | The device was found to be not operational. | |
240 | ||
241 | ``-EINVAL`` | |
242 | The orb specified a chain longer than 255 ccws, or an internal error | |
243 | occurred. | |
25627ba3 | 244 | |
127e6217 FA |
245 | |
246 | vfio-ccw cmd region | |
247 | ------------------- | |
248 | ||
249 | The vfio-ccw cmd region is used to accept asynchronous instructions | |
4c4cbbaa CH |
250 | from userspace:: |
251 | ||
252 | #define VFIO_CCW_ASYNC_CMD_HSCH (1 << 0) | |
253 | #define VFIO_CCW_ASYNC_CMD_CSCH (1 << 1) | |
254 | struct ccw_cmd_region { | |
255 | __u32 command; | |
256 | __u32 ret_code; | |
257 | } __packed; | |
127e6217 FA |
258 | |
259 | This region is exposed via region type VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD. | |
260 | ||
261 | Currently, CLEAR SUBCHANNEL and HALT SUBCHANNEL use this region. | |
262 | ||
430220b0 CH |
263 | command specifies the command to be issued; ret_code stores a return code |
264 | for each access of the region. The following values may occur: | |
265 | ||
266 | ``0`` | |
267 | The operation was successful. | |
268 | ||
269 | ``-ENODEV`` | |
270 | The device was found to be not operational. | |
271 | ||
272 | ``-EINVAL`` | |
273 | A command other than halt or clear was specified. | |
274 | ||
275 | ``-EIO`` | |
276 | A request was issued while the device was not in a state ready to accept | |
277 | requests. | |
278 | ||
279 | ``-EAGAIN`` | |
280 | A request was being processed, and the caller should retry. | |
281 | ||
282 | ``-EBUSY`` | |
283 | The subchannel was status pending or busy while processing a halt request. | |
284 | ||
24c98674 FA |
285 | vfio-ccw schib region |
286 | --------------------- | |
287 | ||
288 | The vfio-ccw schib region is used to return Subchannel-Information | |
289 | Block (SCHIB) data to userspace:: | |
290 | ||
291 | struct ccw_schib_region { | |
292 | #define SCHIB_AREA_SIZE 52 | |
293 | __u8 schib_area[SCHIB_AREA_SIZE]; | |
294 | } __packed; | |
295 | ||
296 | This region is exposed via region type VFIO_REGION_SUBTYPE_CCW_SCHIB. | |
297 | ||
298 | Reading this region triggers a STORE SUBCHANNEL to be issued to the | |
299 | associated hardware. | |
430220b0 | 300 | |
d8cac29b FA |
301 | vfio-ccw crw region |
302 | --------------------- | |
303 | ||
304 | The vfio-ccw crw region is used to return Channel Report Word (CRW) | |
305 | data to userspace:: | |
306 | ||
307 | struct ccw_crw_region { | |
308 | __u32 crw; | |
309 | __u32 pad; | |
310 | } __packed; | |
311 | ||
312 | This region is exposed via region type VFIO_REGION_SUBTYPE_CCW_CRW. | |
313 | ||
314 | Reading this region returns a CRW if one that is relevant for this | |
315 | subchannel (e.g. one reporting changes in channel path state) is | |
316 | pending, or all zeroes if not. If multiple CRWs are pending (including | |
317 | possibly chained CRWs), reading this region again will return the next | |
318 | one, until no more CRWs are pending and zeroes are returned. This is | |
319 | similar to how STORE CHANNEL REPORT WORD works. | |
320 | ||
69cfd92e CH |
321 | vfio-ccw operation details |
322 | -------------------------- | |
25627ba3 | 323 | |
69cfd92e CH |
324 | vfio-ccw follows what vfio-pci did on the s390 platform and uses |
325 | vfio-iommu-type1 as the vfio iommu backend. | |
25627ba3 DJS |
326 | |
327 | * CCW translation APIs | |
8b4a503d | 328 | A group of APIs (start with `cp_`) to do CCW translation. The CCWs |
69cfd92e CH |
329 | passed in by a user space program are organized with their guest |
330 | physical memory addresses. These APIs will copy the CCWs into kernel | |
331 | space, and assemble a runnable kernel channel program by updating the | |
332 | guest physical addresses with their corresponding host physical addresses. | |
333 | Note that we have to use IDALs even for direct-access CCWs, as the | |
334 | referenced memory can be located anywhere, including above 2G. | |
25627ba3 DJS |
335 | |
336 | * vfio_ccw device driver | |
69cfd92e | 337 | This driver utilizes the CCW translation APIs and introduces |
25627ba3 DJS |
338 | vfio_ccw, which is the driver for the I/O subchannel devices you want |
339 | to pass through. | |
8b4a503d MCC |
340 | vfio_ccw implements the following vfio ioctls:: |
341 | ||
25627ba3 DJS |
342 | VFIO_DEVICE_GET_INFO |
343 | VFIO_DEVICE_GET_IRQ_INFO | |
344 | VFIO_DEVICE_GET_REGION_INFO | |
345 | VFIO_DEVICE_RESET | |
346 | VFIO_DEVICE_SET_IRQS | |
8b4a503d | 347 | |
25627ba3 DJS |
348 | This provides an I/O region, so that the user space program can pass a |
349 | channel program to the kernel, to do further CCW translation before | |
350 | issuing them to a real device. | |
351 | This also provides the SET_IRQ ioctl to setup an event notifier to | |
352 | notify the user space program the I/O completion in an asynchronous | |
353 | way. | |
69cfd92e CH |
354 | |
355 | The use of vfio-ccw is not limited to QEMU, while QEMU is definitely a | |
25627ba3 | 356 | good example to get understand how these patches work. Here is a little |
69cfd92e | 357 | bit more detail how an I/O request triggered by the QEMU guest will be |
25627ba3 DJS |
358 | handled (without error handling). |
359 | ||
360 | Explanation: | |
25627ba3 | 361 | |
8b4a503d MCC |
362 | - Q1-Q7: QEMU side process. |
363 | - K1-K5: Kernel side process. | |
364 | ||
365 | Q1. | |
366 | Get I/O region info during initialization. | |
367 | ||
368 | Q2. | |
369 | Setup event notifier and handler to handle I/O completion. | |
25627ba3 DJS |
370 | |
371 | ... ... | |
372 | ||
8b4a503d MCC |
373 | Q3. |
374 | Intercept a ssch instruction. | |
375 | Q4. | |
376 | Write the guest channel program and ORB to the I/O region. | |
377 | ||
378 | K1. | |
379 | Copy from guest to kernel. | |
380 | K2. | |
381 | Translate the guest channel program to a host kernel space | |
382 | channel program, which becomes runnable for a real device. | |
383 | K3. | |
384 | With the necessary information contained in the orb passed in | |
385 | by QEMU, issue the ccwchain to the device. | |
386 | K4. | |
387 | Return the ssch CC code. | |
388 | Q5. | |
389 | Return the CC code to the guest. | |
25627ba3 DJS |
390 | |
391 | ... ... | |
392 | ||
8b4a503d MCC |
393 | K5. |
394 | Interrupt handler gets the I/O result and write the result to | |
395 | the I/O region. | |
396 | K6. | |
397 | Signal QEMU to retrieve the result. | |
398 | ||
399 | Q6. | |
400 | Get the signal and event handler reads out the result from the I/O | |
25627ba3 | 401 | region. |
8b4a503d MCC |
402 | Q7. |
403 | Update the irb for the guest. | |
25627ba3 DJS |
404 | |
405 | Limitations | |
406 | ----------- | |
407 | ||
408 | The current vfio-ccw implementation focuses on supporting basic commands | |
409 | needed to implement block device functionality (read/write) of DASD/ECKD | |
410 | device only. Some commands may need special handling in the future, for | |
411 | example, anything related to path grouping. | |
412 | ||
413 | DASD is a kind of storage device. While ECKD is a data recording format. | |
414 | More information for DASD and ECKD could be found here: | |
415 | https://en.wikipedia.org/wiki/Direct-access_storage_device | |
416 | https://en.wikipedia.org/wiki/Count_key_data | |
417 | ||
69cfd92e | 418 | Together with the corresponding work in QEMU, we can bring the passed |
25627ba3 DJS |
419 | through DASD/ECKD device online in a guest now and use it as a block |
420 | device. | |
421 | ||
127e6217 | 422 | The current code allows the guest to start channel programs via |
24c98674 FA |
423 | START SUBCHANNEL, and to issue HALT SUBCHANNEL, CLEAR SUBCHANNEL, |
424 | and STORE SUBCHANNEL. | |
69cfd92e | 425 | |
725b94d7 JR |
426 | Currently all channel programs are prefetched, regardless of the |
427 | p-bit setting in the ORB. As a result, self modifying channel | |
428 | programs are not supported. For this reason, IPL has to be handled as | |
429 | a special case by a userspace/guest program; this has been implemented | |
430 | in QEMU's s390-ccw bios as of QEMU 4.1. | |
431 | ||
69cfd92e CH |
432 | vfio-ccw supports classic (command mode) channel I/O only. Transport |
433 | mode (HPF) is not supported. | |
434 | ||
435 | QDIO subchannels are currently not supported. Classic devices other than | |
436 | DASD/ECKD might work, but have not been tested. | |
437 | ||
25627ba3 DJS |
438 | Reference |
439 | --------- | |
440 | 1. ESA/s390 Principles of Operation manual (IBM Form. No. SA22-7832) | |
441 | 2. ESA/390 Common I/O Device Commands manual (IBM Form. No. SA22-7204) | |
442 | 3. https://en.wikipedia.org/wiki/Channel_I/O | |
8b4a503d | 443 | 4. Documentation/s390/cds.rst |
baa293e9 MCC |
444 | 5. Documentation/driver-api/vfio.rst |
445 | 6. Documentation/driver-api/vfio-mediated-device.rst |