Commit | Line | Data |
---|---|---|
4e7b1156 AR |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | =========================================== | |
4 | Shared Virtual Addressing (SVA) with ENQCMD | |
5 | =========================================== | |
6 | ||
7 | Background | |
8 | ========== | |
9 | ||
10 | Shared Virtual Addressing (SVA) allows the processor and device to use the | |
11 | same virtual addresses avoiding the need for software to translate virtual | |
12 | addresses to physical addresses. SVA is what PCIe calls Shared Virtual | |
13 | Memory (SVM). | |
14 | ||
15 | In addition to the convenience of using application virtual addresses | |
16 | by the device, it also doesn't require pinning pages for DMA. | |
17 | PCIe Address Translation Services (ATS) along with Page Request Interface | |
18 | (PRI) allow devices to function much the same way as the CPU handling | |
19 | application page-faults. For more information please refer to the PCIe | |
20 | specification Chapter 10: ATS Specification. | |
21 | ||
22 | Use of SVA requires IOMMU support in the platform. IOMMU is also | |
23 | required to support the PCIe features ATS and PRI. ATS allows devices | |
24 | to cache translations for virtual addresses. The IOMMU driver uses the | |
25 | mmu_notifier() support to keep the device TLB cache and the CPU cache in | |
26 | sync. When an ATS lookup fails for a virtual address, the device should | |
27 | use the PRI in order to request the virtual address to be paged into the | |
28 | CPU page tables. The device must use ATS again in order the fetch the | |
29 | translation before use. | |
30 | ||
31 | Shared Hardware Workqueues | |
32 | ========================== | |
33 | ||
34 | Unlike Single Root I/O Virtualization (SR-IOV), Scalable IOV (SIOV) permits | |
35 | the use of Shared Work Queues (SWQ) by both applications and Virtual | |
36 | Machines (VM's). This allows better hardware utilization vs. hard | |
37 | partitioning resources that could result in under utilization. In order to | |
38 | allow the hardware to distinguish the context for which work is being | |
39 | executed in the hardware by SWQ interface, SIOV uses Process Address Space | |
40 | ID (PASID), which is a 20-bit number defined by the PCIe SIG. | |
41 | ||
42 | PASID value is encoded in all transactions from the device. This allows the | |
43 | IOMMU to track I/O on a per-PASID granularity in addition to using the PCIe | |
44 | Resource Identifier (RID) which is the Bus/Device/Function. | |
45 | ||
46 | ||
47 | ENQCMD | |
48 | ====== | |
49 | ||
50 | ENQCMD is a new instruction on Intel platforms that atomically submits a | |
51 | work descriptor to a device. The descriptor includes the operation to be | |
52 | performed, virtual addresses of all parameters, virtual address of a completion | |
53 | record, and the PASID (process address space ID) of the current process. | |
54 | ||
55 | ENQCMD works with non-posted semantics and carries a status back if the | |
56 | command was accepted by hardware. This allows the submitter to know if the | |
57 | submission needs to be retried or other device specific mechanisms to | |
58 | implement fairness or ensure forward progress should be provided. | |
59 | ||
60 | ENQCMD is the glue that ensures applications can directly submit commands | |
61 | to the hardware and also permits hardware to be aware of application context | |
62 | to perform I/O operations via use of PASID. | |
63 | ||
64 | Process Address Space Tagging | |
65 | ============================= | |
66 | ||
67 | A new thread-scoped MSR (IA32_PASID) provides the connection between | |
68 | user processes and the rest of the hardware. When an application first | |
69 | accesses an SVA-capable device, this MSR is initialized with a newly | |
70 | allocated PASID. The driver for the device calls an IOMMU-specific API | |
71 | that sets up the routing for DMA and page-requests. | |
72 | ||
73 | For example, the Intel Data Streaming Accelerator (DSA) uses | |
74 | iommu_sva_bind_device(), which will do the following: | |
75 | ||
76 | - Allocate the PASID, and program the process page-table (%cr3 register) in the | |
77 | PASID context entries. | |
78 | - Register for mmu_notifier() to track any page-table invalidations to keep | |
79 | the device TLB in sync. For example, when a page-table entry is invalidated, | |
80 | the IOMMU propagates the invalidation to the device TLB. This will force any | |
81 | future access by the device to this virtual address to participate in | |
82 | ATS. If the IOMMU responds with proper response that a page is not | |
83 | present, the device would request the page to be paged in via the PCIe PRI | |
84 | protocol before performing I/O. | |
85 | ||
86 | This MSR is managed with the XSAVE feature set as "supervisor state" to | |
87 | ensure the MSR is updated during context switch. | |
88 | ||
89 | PASID Management | |
90 | ================ | |
91 | ||
92 | The kernel must allocate a PASID on behalf of each process which will use | |
93 | ENQCMD and program it into the new MSR to communicate the process identity to | |
94 | platform hardware. ENQCMD uses the PASID stored in this MSR to tag requests | |
95 | from this process. When a user submits a work descriptor to a device using the | |
96 | ENQCMD instruction, the PASID field in the descriptor is auto-filled with the | |
97 | value from MSR_IA32_PASID. Requests for DMA from the device are also tagged | |
98 | with the same PASID. The platform IOMMU uses the PASID in the transaction to | |
99 | perform address translation. The IOMMU APIs setup the corresponding PASID | |
100 | entry in IOMMU with the process address used by the CPU (e.g. %cr3 register in | |
101 | x86). | |
102 | ||
103 | The MSR must be configured on each logical CPU before any application | |
104 | thread can interact with a device. Threads that belong to the same | |
105 | process share the same page tables, thus the same MSR value. | |
106 | ||
83aa52ff FY |
107 | PASID Life Cycle Management |
108 | =========================== | |
109 | ||
110 | PASID is initialized as INVALID_IOASID (-1) when a process is created. | |
111 | ||
112 | Only processes that access SVA-capable devices need to have a PASID | |
113 | allocated. This allocation happens when a process opens/binds an SVA-capable | |
114 | device but finds no PASID for this process. Subsequent binds of the same, or | |
115 | other devices will share the same PASID. | |
116 | ||
117 | Although the PASID is allocated to the process by opening a device, | |
118 | it is not active in any of the threads of that process. It's loaded to the | |
119 | IA32_PASID MSR lazily when a thread tries to submit a work descriptor | |
120 | to a device using the ENQCMD. | |
121 | ||
122 | That first access will trigger a #GP fault because the IA32_PASID MSR | |
123 | has not been initialized with the PASID value assigned to the process | |
124 | when the device was opened. The Linux #GP handler notes that a PASID has | |
125 | been allocated for the process, and so initializes the IA32_PASID MSR | |
126 | and returns so that the ENQCMD instruction is re-executed. | |
127 | ||
128 | On fork(2) or exec(2) the PASID is removed from the process as it no | |
129 | longer has the same address space that it had when the device was opened. | |
130 | ||
131 | On clone(2) the new task shares the same address space, so will be | |
132 | able to use the PASID allocated to the process. The IA32_PASID is not | |
133 | preemptively initialized as the PASID value might not be allocated yet or | |
134 | the kernel does not know whether this thread is going to access the device | |
135 | and the cleared IA32_PASID MSR reduces context switch overhead by xstate | |
136 | init optimization. Since #GP faults have to be handled on any threads that | |
137 | were created before the PASID was assigned to the mm of the process, newly | |
138 | created threads might as well be treated in a consistent way. | |
139 | ||
140 | Due to complexity of freeing the PASID and clearing all IA32_PASID MSRs in | |
141 | all threads in unbind, free the PASID lazily only on mm exit. | |
142 | ||
143 | If a process does a close(2) of the device file descriptor and munmap(2) | |
144 | of the device MMIO portal, then the driver will unbind the device. The | |
145 | PASID is still marked VALID in the PASID_MSR for any threads in the | |
146 | process that accessed the device. But this is harmless as without the | |
147 | MMIO portal they cannot submit new work to the device. | |
4e7b1156 AR |
148 | |
149 | Relationships | |
150 | ============= | |
151 | ||
152 | * Each process has many threads, but only one PASID. | |
153 | * Devices have a limited number (~10's to 1000's) of hardware workqueues. | |
154 | The device driver manages allocating hardware workqueues. | |
155 | * A single mmap() maps a single hardware workqueue as a "portal" and | |
156 | each portal maps down to a single workqueue. | |
157 | * For each device with which a process interacts, there must be | |
158 | one or more mmap()'d portals. | |
159 | * Many threads within a process can share a single portal to access | |
160 | a single device. | |
161 | * Multiple processes can separately mmap() the same portal, in | |
162 | which case they still share one device hardware workqueue. | |
163 | * The single process-wide PASID is used by all threads to interact | |
164 | with all devices. There is not, for instance, a PASID for each | |
165 | thread or each thread<->device pair. | |
166 | ||
167 | FAQ | |
168 | === | |
169 | ||
170 | * What is SVA/SVM? | |
171 | ||
172 | Shared Virtual Addressing (SVA) permits I/O hardware and the processor to | |
173 | work in the same address space, i.e., to share it. Some call it Shared | |
174 | Virtual Memory (SVM), but Linux community wanted to avoid confusing it with | |
175 | POSIX Shared Memory and Secure Virtual Machines which were terms already in | |
176 | circulation. | |
177 | ||
178 | * What is a PASID? | |
179 | ||
180 | A Process Address Space ID (PASID) is a PCIe-defined Transaction Layer Packet | |
181 | (TLP) prefix. A PASID is a 20-bit number allocated and managed by the OS. | |
182 | PASID is included in all transactions between the platform and the device. | |
183 | ||
184 | * How are shared workqueues different? | |
185 | ||
186 | Traditionally, in order for userspace applications to interact with hardware, | |
187 | there is a separate hardware instance required per process. For example, | |
188 | consider doorbells as a mechanism of informing hardware about work to process. | |
189 | Each doorbell is required to be spaced 4k (or page-size) apart for process | |
190 | isolation. This requires hardware to provision that space and reserve it in | |
191 | MMIO. This doesn't scale as the number of threads becomes quite large. The | |
192 | hardware also manages the queue depth for Shared Work Queues (SWQ), and | |
193 | consumers don't need to track queue depth. If there is no space to accept | |
194 | a command, the device will return an error indicating retry. | |
195 | ||
196 | A user should check Deferrable Memory Write (DMWr) capability on the device | |
197 | and only submits ENQCMD when the device supports it. In the new DMWr PCIe | |
198 | terminology, devices need to support DMWr completer capability. In addition, | |
199 | it requires all switch ports to support DMWr routing and must be enabled by | |
200 | the PCIe subsystem, much like how PCIe atomic operations are managed for | |
201 | instance. | |
202 | ||
203 | SWQ allows hardware to provision just a single address in the device. When | |
204 | used with ENQCMD to submit work, the device can distinguish the process | |
205 | submitting the work since it will include the PASID assigned to that | |
206 | process. This helps the device scale to a large number of processes. | |
207 | ||
208 | * Is this the same as a user space device driver? | |
209 | ||
210 | Communicating with the device via the shared workqueue is much simpler | |
211 | than a full blown user space driver. The kernel driver does all the | |
212 | initialization of the hardware. User space only needs to worry about | |
213 | submitting work and processing completions. | |
214 | ||
215 | * Is this the same as SR-IOV? | |
216 | ||
217 | Single Root I/O Virtualization (SR-IOV) focuses on providing independent | |
218 | hardware interfaces for virtualizing hardware. Hence, it's required to be | |
219 | almost fully functional interface to software supporting the traditional | |
220 | BARs, space for interrupts via MSI-X, its own register layout. | |
221 | Virtual Functions (VFs) are assisted by the Physical Function (PF) | |
222 | driver. | |
223 | ||
224 | Scalable I/O Virtualization builds on the PASID concept to create device | |
225 | instances for virtualization. SIOV requires host software to assist in | |
226 | creating virtual devices; each virtual device is represented by a PASID | |
227 | along with the bus/device/function of the device. This allows device | |
228 | hardware to optimize device resource creation and can grow dynamically on | |
229 | demand. SR-IOV creation and management is very static in nature. Consult | |
230 | references below for more details. | |
231 | ||
232 | * Why not just create a virtual function for each app? | |
233 | ||
234 | Creating PCIe SR-IOV type Virtual Functions (VF) is expensive. VFs require | |
235 | duplicated hardware for PCI config space and interrupts such as MSI-X. | |
236 | Resources such as interrupts have to be hard partitioned between VFs at | |
237 | creation time, and cannot scale dynamically on demand. The VFs are not | |
238 | completely independent from the Physical Function (PF). Most VFs require | |
239 | some communication and assistance from the PF driver. SIOV, in contrast, | |
240 | creates a software-defined device where all the configuration and control | |
241 | aspects are mediated via the slow path. The work submission and completion | |
242 | happen without any mediation. | |
243 | ||
244 | * Does this support virtualization? | |
245 | ||
246 | ENQCMD can be used from within a guest VM. In these cases, the VMM helps | |
247 | with setting up a translation table to translate from Guest PASID to Host | |
248 | PASID. Please consult the ENQCMD instruction set reference for more | |
249 | details. | |
250 | ||
251 | * Does memory need to be pinned? | |
252 | ||
253 | When devices support SVA along with platform hardware such as IOMMU | |
254 | supporting such devices, there is no need to pin memory for DMA purposes. | |
255 | Devices that support SVA also support other PCIe features that remove the | |
256 | pinning requirement for memory. | |
257 | ||
258 | Device TLB support - Device requests the IOMMU to lookup an address before | |
259 | use via Address Translation Service (ATS) requests. If the mapping exists | |
260 | but there is no page allocated by the OS, IOMMU hardware returns that no | |
261 | mapping exists. | |
262 | ||
263 | Device requests the virtual address to be mapped via Page Request | |
264 | Interface (PRI). Once the OS has successfully completed the mapping, it | |
265 | returns the response back to the device. The device requests again for | |
266 | a translation and continues. | |
267 | ||
268 | IOMMU works with the OS in managing consistency of page-tables with the | |
269 | device. When removing pages, it interacts with the device to remove any | |
270 | device TLB entry that might have been cached before removing the mappings from | |
271 | the OS. | |
272 | ||
273 | References | |
274 | ========== | |
275 | ||
276 | VT-D: | |
277 | https://01.org/blogs/ashokraj/2018/recent-enhancements-intel-virtualization-technology-directed-i/o-intel-vt-d | |
278 | ||
279 | SIOV: | |
280 | https://01.org/blogs/2019/assignable-interfaces-intel-scalable-i/o-virtualization-linux | |
281 | ||
282 | ENQCMD in ISE: | |
283 | https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf | |
284 | ||
285 | DSA spec: | |
286 | https://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf |