Commit | Line | Data |
---|---|---|
4d2e26a3 MCC |
1 | =================================================== |
2 | PCI Express I/O Virtualization Resource on Powerenv | |
3 | =================================================== | |
4 | ||
33052440 | 5 | Wei Yang <weiyang@linux.vnet.ibm.com> |
4d2e26a3 | 6 | |
33052440 | 7 | Benjamin Herrenschmidt <benh@au1.ibm.com> |
4d2e26a3 | 8 | |
33052440 | 9 | Bjorn Helgaas <bhelgaas@google.com> |
4d2e26a3 | 10 | |
33052440 WY |
11 | 26 Aug 2014 |
12 | ||
13 | This document describes the requirement from hardware for PCI MMIO resource | |
14 | sizing and assignment on PowerKVM and how generic PCI code handles this | |
15 | requirement. The first two sections describe the concepts of Partitionable | |
16 | Endpoints and the implementation on P8 (IODA2). The next two sections talks | |
17 | about considerations on enabling SRIOV on IODA2. | |
18 | ||
19 | 1. Introduction to Partitionable Endpoints | |
4d2e26a3 | 20 | ========================================== |
33052440 WY |
21 | |
22 | A Partitionable Endpoint (PE) is a way to group the various resources | |
23 | associated with a device or a set of devices to provide isolation between | |
24 | partitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism | |
25 | to freeze a device that is causing errors in order to limit the possibility | |
26 | of propagation of bad data. | |
27 | ||
28 | There is thus, in HW, a table of PE states that contains a pair of "frozen" | |
29 | state bits (one for MMIO and one for DMA, they get set together but can be | |
30 | cleared independently) for each PE. | |
31 | ||
32 | When a PE is frozen, all stores in any direction are dropped and all loads | |
33 | return all 1's value. MSIs are also blocked. There's a bit more state that | |
34 | captures things like the details of the error that caused the freeze etc., but | |
35 | that's not critical. | |
36 | ||
37 | The interesting part is how the various PCIe transactions (MMIO, DMA, ...) | |
38 | are matched to their corresponding PEs. | |
39 | ||
40 | The following section provides a rough description of what we have on P8 | |
41 | (IODA2). Keep in mind that this is all per PHB (PCI host bridge). Each PHB | |
42 | is a completely separate HW entity that replicates the entire logic, so has | |
43 | its own set of PEs, etc. | |
44 | ||
45 | 2. Implementation of Partitionable Endpoints on P8 (IODA2) | |
4d2e26a3 | 46 | ========================================================== |
33052440 WY |
47 | |
48 | P8 supports up to 256 Partitionable Endpoints per PHB. | |
49 | ||
50 | * Inbound | |
51 | ||
52 | For DMA, MSIs and inbound PCIe error messages, we have a table (in | |
53 | memory but accessed in HW by the chip) that provides a direct | |
54 | correspondence between a PCIe RID (bus/dev/fn) with a PE number. | |
55 | We call this the RTT. | |
56 | ||
57 | - For DMA we then provide an entire address space for each PE that can | |
58 | contain two "windows", depending on the value of PCI address bit 59. | |
59 | Each window can be configured to be remapped via a "TCE table" (IOMMU | |
60 | translation table), which has various configurable characteristics | |
61 | not described here. | |
62 | ||
63 | - For MSIs, we have two windows in the address space (one at the top of | |
64 | the 32-bit space and one much higher) which, via a combination of the | |
65 | address and MSI value, will result in one of the 2048 interrupts per | |
66 | bridge being triggered. There's a PE# in the interrupt controller | |
67 | descriptor table as well which is compared with the PE# obtained from | |
68 | the RTT to "authorize" the device to emit that specific interrupt. | |
69 | ||
70 | - Error messages just use the RTT. | |
71 | ||
72 | * Outbound. That's where the tricky part is. | |
73 | ||
74 | Like other PCI host bridges, the Power8 IODA2 PHB supports "windows" | |
75 | from the CPU address space to the PCI address space. There is one M32 | |
76 | window and sixteen M64 windows. They have different characteristics. | |
77 | First what they have in common: they forward a configurable portion of | |
78 | the CPU address space to the PCIe bus and must be naturally aligned | |
79 | power of two in size. The rest is different: | |
80 | ||
81 | - The M32 window: | |
82 | ||
83 | * Is limited to 4GB in size. | |
84 | ||
85 | * Drops the top bits of the address (above the size) and replaces | |
86 | them with a configurable value. This is typically used to generate | |
87 | 32-bit PCIe accesses. We configure that window at boot from FW and | |
88 | don't touch it from Linux; it's usually set to forward a 2GB | |
89 | portion of address space from the CPU to PCIe | |
90 | 0x8000_0000..0xffff_ffff. (Note: The top 64KB are actually | |
91 | reserved for MSIs but this is not a problem at this point; we just | |
92 | need to ensure Linux doesn't assign anything there, the M32 logic | |
93 | ignores that however and will forward in that space if we try). | |
94 | ||
95 | * It is divided into 256 segments of equal size. A table in the chip | |
96 | maps each segment to a PE#. That allows portions of the MMIO space | |
97 | to be assigned to PEs on a segment granularity. For a 2GB window, | |
98 | the segment granularity is 2GB/256 = 8MB. | |
99 | ||
100 | Now, this is the "main" window we use in Linux today (excluding | |
101 | SR-IOV). We basically use the trick of forcing the bridge MMIO windows | |
102 | onto a segment alignment/granularity so that the space behind a bridge | |
103 | can be assigned to a PE. | |
104 | ||
105 | Ideally we would like to be able to have individual functions in PEs | |
106 | but that would mean using a completely different address allocation | |
107 | scheme where individual function BARs can be "grouped" to fit in one or | |
108 | more segments. | |
109 | ||
110 | - The M64 windows: | |
111 | ||
112 | * Must be at least 256MB in size. | |
113 | ||
114 | * Do not translate addresses (the address on PCIe is the same as the | |
115 | address on the PowerBus). There is a way to also set the top 14 | |
116 | bits which are not conveyed by PowerBus but we don't use this. | |
117 | ||
118 | * Can be configured to be segmented. When not segmented, we can | |
119 | specify the PE# for the entire window. When segmented, a window | |
120 | has 256 segments; however, there is no table for mapping a segment | |
121 | to a PE#. The segment number *is* the PE#. | |
122 | ||
123 | * Support overlaps. If an address is covered by multiple windows, | |
124 | there's a defined ordering for which window applies. | |
125 | ||
126 | We have code (fairly new compared to the M32 stuff) that exploits that | |
127 | for large BARs in 64-bit space: | |
128 | ||
129 | We configure an M64 window to cover the entire region of address space | |
130 | that has been assigned by FW for the PHB (about 64GB, ignore the space | |
131 | for the M32, it comes out of a different "reserve"). We configure it | |
132 | as segmented. | |
133 | ||
134 | Then we do the same thing as with M32, using the bridge alignment | |
135 | trick, to match to those giant segments. | |
136 | ||
137 | Since we cannot remap, we have two additional constraints: | |
138 | ||
139 | - We do the PE# allocation *after* the 64-bit space has been assigned | |
140 | because the addresses we use directly determine the PE#. We then | |
141 | update the M32 PE# for the devices that use both 32-bit and 64-bit | |
142 | spaces or assign the remaining PE# to 32-bit only devices. | |
143 | ||
144 | - We cannot "group" segments in HW, so if a device ends up using more | |
145 | than one segment, we end up with more than one PE#. There is a HW | |
146 | mechanism to make the freeze state cascade to "companion" PEs but | |
147 | that only works for PCIe error messages (typically used so that if | |
148 | you freeze a switch, it freezes all its children). So we do it in | |
149 | SW. We lose a bit of effectiveness of EEH in that case, but that's | |
150 | the best we found. So when any of the PEs freezes, we freeze the | |
151 | other ones for that "domain". We thus introduce the concept of | |
152 | "master PE" which is the one used for DMA, MSIs, etc., and "secondary | |
153 | PEs" that are used for the remaining M64 segments. | |
154 | ||
155 | We would like to investigate using additional M64 windows in "single | |
156 | PE" mode to overlay over specific BARs to work around some of that, for | |
157 | example for devices with very large BARs, e.g., GPUs. It would make | |
158 | sense, but we haven't done it yet. | |
159 | ||
160 | 3. Considerations for SR-IOV on PowerKVM | |
4d2e26a3 | 161 | ======================================== |
33052440 WY |
162 | |
163 | * SR-IOV Background | |
164 | ||
165 | The PCIe SR-IOV feature allows a single Physical Function (PF) to | |
166 | support several Virtual Functions (VFs). Registers in the PF's SR-IOV | |
167 | Capability control the number of VFs and whether they are enabled. | |
168 | ||
169 | When VFs are enabled, they appear in Configuration Space like normal | |
170 | PCI devices, but the BARs in VF config space headers are unusual. For | |
171 | a non-VF device, software uses BARs in the config space header to | |
172 | discover the BAR sizes and assign addresses for them. For VF devices, | |
173 | software uses VF BAR registers in the *PF* SR-IOV Capability to | |
174 | discover sizes and assign addresses. The BARs in the VF's config space | |
175 | header are read-only zeros. | |
176 | ||
177 | When a VF BAR in the PF SR-IOV Capability is programmed, it sets the | |
178 | base address for all the corresponding VF(n) BARs. For example, if the | |
179 | PF SR-IOV Capability is programmed to enable eight VFs, and it has a | |
180 | 1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region. | |
181 | This region is divided into eight contiguous 1MB regions, each of which | |
182 | is a BAR0 for one of the VFs. Note that even though the VF BAR | |
183 | describes an 8MB region, the alignment requirement is for a single VF, | |
184 | i.e., 1MB in this example. | |
185 | ||
186 | There are several strategies for isolating VFs in PEs: | |
187 | ||
188 | - M32 window: There's one M32 window, and it is split into 256 | |
189 | equally-sized segments. The finest granularity possible is a 256MB | |
190 | window with 1MB segments. VF BARs that are 1MB or larger could be | |
191 | mapped to separate PEs in this window. Each segment can be | |
192 | individually mapped to a PE via the lookup table, so this is quite | |
193 | flexible, but it works best when all the VF BARs are the same size. If | |
194 | they are different sizes, the entire window has to be small enough that | |
195 | the segment size matches the smallest VF BAR, which means larger VF | |
196 | BARs span several segments. | |
197 | ||
198 | - Non-segmented M64 window: A non-segmented M64 window is mapped entirely | |
199 | to a single PE, so it could only isolate one VF. | |
200 | ||
201 | - Single segmented M64 windows: A segmented M64 window could be used just | |
202 | like the M32 window, but the segments can't be individually mapped to | |
203 | PEs (the segment number is the PE#), so there isn't as much | |
204 | flexibility. A VF with multiple BARs would have to be in a "domain" of | |
205 | multiple PEs, which is not as well isolated as a single PE. | |
206 | ||
207 | - Multiple segmented M64 windows: As usual, each window is split into 256 | |
208 | equally-sized segments, and the segment number is the PE#. But if we | |
209 | use several M64 windows, they can be set to different base addresses | |
210 | and different segment sizes. If we have VFs that each have a 1MB BAR | |
211 | and a 32MB BAR, we could use one M64 window to assign 1MB segments and | |
212 | another M64 window to assign 32MB segments. | |
213 | ||
214 | Finally, the plan to use M64 windows for SR-IOV, which will be described | |
215 | more in the next two sections. For a given VF BAR, we need to | |
216 | effectively reserve the entire 256 segments (256 * VF BAR size) and | |
217 | position the VF BAR to start at the beginning of a free range of | |
218 | segments/PEs inside that M64 window. | |
219 | ||
220 | The goal is of course to be able to give a separate PE for each VF. | |
221 | ||
222 | The IODA2 platform has 16 M64 windows, which are used to map MMIO | |
223 | range to PE#. Each M64 window defines one MMIO range and this range is | |
224 | divided into 256 segments, with each segment corresponding to one PE. | |
225 | ||
226 | We decide to leverage this M64 window to map VFs to individual PEs, since | |
227 | SR-IOV VF BARs are all the same size. | |
228 | ||
229 | But doing so introduces another problem: total_VFs is usually smaller | |
230 | than the number of M64 window segments, so if we map one VF BAR directly | |
231 | to one M64 window, some part of the M64 window will map to another | |
232 | device's MMIO range. | |
233 | ||
234 | IODA supports 256 PEs, so segmented windows contain 256 segments, so if | |
235 | total_VFs is less than 256, we have the situation in Figure 1.0, where | |
236 | segments [total_VFs, 255] of the M64 window may map to some MMIO range on | |
4d2e26a3 | 237 | other devices:: |
33052440 WY |
238 | |
239 | 0 1 total_VFs - 1 | |
240 | +------+------+- -+------+------+ | |
241 | | | | ... | | | | |
242 | +------+------+- -+------+------+ | |
243 | ||
244 | VF(n) BAR space | |
245 | ||
246 | 0 1 total_VFs - 1 255 | |
247 | +------+------+- -+------+------+- -+------+------+ | |
248 | | | | ... | | | ... | | | | |
249 | +------+------+- -+------+------+- -+------+------+ | |
250 | ||
251 | M64 window | |
252 | ||
253 | Figure 1.0 Direct map VF(n) BAR space | |
254 | ||
255 | Our current solution is to allocate 256 segments even if the VF(n) BAR | |
4d2e26a3 | 256 | space doesn't need that much, as shown in Figure 1.1:: |
33052440 WY |
257 | |
258 | 0 1 total_VFs - 1 255 | |
259 | +------+------+- -+------+------+- -+------+------+ | |
260 | | | | ... | | | ... | | | | |
261 | +------+------+- -+------+------+- -+------+------+ | |
262 | ||
263 | VF(n) BAR space + extra | |
264 | ||
265 | 0 1 total_VFs - 1 255 | |
266 | +------+------+- -+------+------+- -+------+------+ | |
267 | | | | ... | | | ... | | | | |
268 | +------+------+- -+------+------+- -+------+------+ | |
269 | ||
270 | M64 window | |
271 | ||
272 | Figure 1.1 Map VF(n) BAR space + extra | |
273 | ||
274 | Allocating the extra space ensures that the entire M64 window will be | |
275 | assigned to this one SR-IOV device and none of the space will be | |
276 | available for other devices. Note that this only expands the space | |
277 | reserved in software; there are still only total_VFs VFs, and they only | |
278 | respond to segments [0, total_VFs - 1]. There's nothing in hardware that | |
279 | responds to segments [total_VFs, 255]. | |
280 | ||
281 | 4. Implications for the Generic PCI Code | |
4d2e26a3 | 282 | ======================================== |
33052440 WY |
283 | |
284 | The PCIe SR-IOV spec requires that the base of the VF(n) BAR space be | |
285 | aligned to the size of an individual VF BAR. | |
286 | ||
287 | In IODA2, the MMIO address determines the PE#. If the address is in an M32 | |
288 | window, we can set the PE# by updating the table that translates segments | |
289 | to PE#s. Similarly, if the address is in an unsegmented M64 window, we can | |
290 | set the PE# for the window. But if it's in a segmented M64 window, the | |
291 | segment number is the PE#. | |
292 | ||
293 | Therefore, the only way to control the PE# for a VF is to change the base | |
294 | of the VF(n) BAR space in the VF BAR. If the PCI core allocates the exact | |
295 | amount of space required for the VF(n) BAR space, the VF BAR value is fixed | |
296 | and cannot be changed. | |
297 | ||
298 | On the other hand, if the PCI core allocates additional space, the VF BAR | |
299 | value can be changed as long as the entire VF(n) BAR space remains inside | |
300 | the space allocated by the core. | |
301 | ||
302 | Ideally the segment size will be the same as an individual VF BAR size. | |
303 | Then each VF will be in its own PE. The VF BARs (and therefore the PE#s) | |
304 | are contiguous. If VF0 is in PE(x), then VF(n) is in PE(x+n). If we | |
305 | allocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0. | |
306 | ||
307 | If the segment size is smaller than the VF BAR size, it will take several | |
308 | segments to cover a VF BAR, and a VF will be in several PEs. This is | |
309 | possible, but the isolation isn't as good, and it reduces the number of PE# | |
310 | choices because instead of consuming only numVFs segments, the VF(n) BAR | |
311 | space will consume (numVFs * n) segments. That means there aren't as many | |
312 | available segments for adjusting base of the VF(n) BAR space. |