[linux-2.6-block.git] / Documentation / powerpc / pci_iov_resource_on_powernv.rst

===================================================
PCI Express I/O Virtualization Resource on Powerenv
===================================================

Wei Yang <weiyang@linux.vnet.ibm.com>

Benjamin Herrenschmidt <benh@au1.ibm.com>

Bjorn Helgaas <bhelgaas@google.com>

26 Aug 2014

This document describes the requirement from hardware for PCI MMIO resource
sizing and assignment on PowerKVM and how generic PCI code handles this
requirement. The first two sections describe the concepts of Partitionable
Endpoints and the implementation on P8 (IODA2). The next two sections talks
about considerations on enabling SRIOV on IODA2.

1. Introduction to Partitionable Endpoints
==========================================

A Partitionable Endpoint (PE) is a way to group the various resources
associated with a device or a set of devices to provide isolation between
partitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism
to freeze a device that is causing errors in order to limit the possibility
of propagation of bad data.

There is thus, in HW, a table of PE states that contains a pair of "frozen"
state bits (one for MMIO and one for DMA, they get set together but can be
cleared independently) for each PE.

When a PE is frozen, all stores in any direction are dropped and all loads
return all 1's value. MSIs are also blocked. There's a bit more state that
captures things like the details of the error that caused the freeze etc., but
that's not critical.

The interesting part is how the various PCIe transactions (MMIO, DMA, ...)
are matched to their corresponding PEs.

The following section provides a rough description of what we have on P8
(IODA2).  Keep in mind that this is all per PHB (PCI host bridge).  Each PHB
is a completely separate HW entity that replicates the entire logic, so has
its own set of PEs, etc.

2. Implementation of Partitionable Endpoints on P8 (IODA2)
==========================================================

P8 supports up to 256 Partitionable Endpoints per PHB.

  * Inbound

    For DMA, MSIs and inbound PCIe error messages, we have a table (in
    memory but accessed in HW by the chip) that provides a direct
    correspondence between a PCIe RID (bus/dev/fn) with a PE number.
    We call this the RTT.

    - For DMA we then provide an entire address space for each PE that can
      contain two "windows", depending on the value of PCI address bit 59.
      Each window can be configured to be remapped via a "TCE table" (IOMMU
      translation table), which has various configurable characteristics
      not described here.

    - For MSIs, we have two windows in the address space (one at the top of
      the 32-bit space and one much higher) which, via a combination of the
      address and MSI value, will result in one of the 2048 interrupts per
      bridge being triggered.  There's a PE# in the interrupt controller
      descriptor table as well which is compared with the PE# obtained from
      the RTT to "authorize" the device to emit that specific interrupt.

    - Error messages just use the RTT.

  * Outbound.  That's where the tricky part is.

    Like other PCI host bridges, the Power8 IODA2 PHB supports "windows"
    from the CPU address space to the PCI address space.  There is one M32
    window and sixteen M64 windows.  They have different characteristics.
    First what they have in common: they forward a configurable portion of
    the CPU address space to the PCIe bus and must be naturally aligned
    power of two in size.  The rest is different:

    - The M32 window:

      * Is limited to 4GB in size.

      * Drops the top bits of the address (above the size) and replaces
	them with a configurable value.  This is typically used to generate
	32-bit PCIe accesses.  We configure that window at boot from FW and
	don't touch it from Linux; it's usually set to forward a 2GB
	portion of address space from the CPU to PCIe
	0x8000_0000..0xffff_ffff.  (Note: The top 64KB are actually
	reserved for MSIs but this is not a problem at this point; we just
	need to ensure Linux doesn't assign anything there, the M32 logic
	ignores that however and will forward in that space if we try).

      * It is divided into 256 segments of equal size.  A table in the chip
	maps each segment to a PE#.  That allows portions of the MMIO space
	to be assigned to PEs on a segment granularity.  For a 2GB window,
	the segment granularity is 2GB/256 = 8MB.

    Now, this is the "main" window we use in Linux today (excluding
    SR-IOV).  We basically use the trick of forcing the bridge MMIO windows
    onto a segment alignment/granularity so that the space behind a bridge
    can be assigned to a PE.

    Ideally we would like to be able to have individual functions in PEs
    but that would mean using a completely different address allocation
    scheme where individual function BARs can be "grouped" to fit in one or
    more segments.

    - The M64 windows:

      * Must be at least 256MB in size.

      * Do not translate addresses (the address on PCIe is the same as the
	address on the PowerBus).  There is a way to also set the top 14
	bits which are not conveyed by PowerBus but we don't use this.

      * Can be configured to be segmented.  When not segmented, we can
	specify the PE# for the entire window.  When segmented, a window
	has 256 segments; however, there is no table for mapping a segment
	to a PE#.  The segment number *is* the PE#.

      * Support overlaps.  If an address is covered by multiple windows,
	there's a defined ordering for which window applies.

    We have code (fairly new compared to the M32 stuff) that exploits that
    for large BARs in 64-bit space:

    We configure an M64 window to cover the entire region of address space
    that has been assigned by FW for the PHB (about 64GB, ignore the space
    for the M32, it comes out of a different "reserve").  We configure it
    as segmented.

    Then we do the same thing as with M32, using the bridge alignment
    trick, to match to those giant segments.

    Since we cannot remap, we have two additional constraints:

    - We do the PE# allocation *after* the 64-bit space has been assigned
      because the addresses we use directly determine the PE#.  We then
      update the M32 PE# for the devices that use both 32-bit and 64-bit
      spaces or assign the remaining PE# to 32-bit only devices.

    - We cannot "group" segments in HW, so if a device ends up using more
      than one segment, we end up with more than one PE#.  There is a HW
      mechanism to make the freeze state cascade to "companion" PEs but
      that only works for PCIe error messages (typically used so that if
      you freeze a switch, it freezes all its children).  So we do it in
      SW.  We lose a bit of effectiveness of EEH in that case, but that's
      the best we found.  So when any of the PEs freezes, we freeze the
      other ones for that "domain".  We thus introduce the concept of
      "master PE" which is the one used for DMA, MSIs, etc., and "secondary
      PEs" that are used for the remaining M64 segments.

    We would like to investigate using additional M64 windows in "single
    PE" mode to overlay over specific BARs to work around some of that, for
    example for devices with very large BARs, e.g., GPUs.  It would make
    sense, but we haven't done it yet.

3. Considerations for SR-IOV on PowerKVM
========================================

  * SR-IOV Background

    The PCIe SR-IOV feature allows a single Physical Function (PF) to
    support several Virtual Functions (VFs).  Registers in the PF's SR-IOV
    Capability control the number of VFs and whether they are enabled.

    When VFs are enabled, they appear in Configuration Space like normal
    PCI devices, but the BARs in VF config space headers are unusual.  For
    a non-VF device, software uses BARs in the config space header to
    discover the BAR sizes and assign addresses for them.  For VF devices,
    software uses VF BAR registers in the *PF* SR-IOV Capability to
    discover sizes and assign addresses.  The BARs in the VF's config space
    header are read-only zeros.

    When a VF BAR in the PF SR-IOV Capability is programmed, it sets the
    base address for all the corresponding VF(n) BARs.  For example, if the
    PF SR-IOV Capability is programmed to enable eight VFs, and it has a
    1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region.
    This region is divided into eight contiguous 1MB regions, each of which
    is a BAR0 for one of the VFs.  Note that even though the VF BAR
    describes an 8MB region, the alignment requirement is for a single VF,
    i.e., 1MB in this example.

  There are several strategies for isolating VFs in PEs:

  - M32 window: There's one M32 window, and it is split into 256
    equally-sized segments.  The finest granularity possible is a 256MB
    window with 1MB segments.  VF BARs that are 1MB or larger could be
    mapped to separate PEs in this window.  Each segment can be
    individually mapped to a PE via the lookup table, so this is quite
    flexible, but it works best when all the VF BARs are the same size.  If
    they are different sizes, the entire window has to be small enough that
    the segment size matches the smallest VF BAR, which means larger VF
    BARs span several segments.

  - Non-segmented M64 window: A non-segmented M64 window is mapped entirely
    to a single PE, so it could only isolate one VF.

  - Single segmented M64 windows: A segmented M64 window could be used just
    like the M32 window, but the segments can't be individually mapped to
    PEs (the segment number is the PE#), so there isn't as much
    flexibility.  A VF with multiple BARs would have to be in a "domain" of
    multiple PEs, which is not as well isolated as a single PE.

  - Multiple segmented M64 windows: As usual, each window is split into 256
    equally-sized segments, and the segment number is the PE#.  But if we
    use several M64 windows, they can be set to different base addresses
    and different segment sizes.  If we have VFs that each have a 1MB BAR
    and a 32MB BAR, we could use one M64 window to assign 1MB segments and
    another M64 window to assign 32MB segments.

  Finally, the plan to use M64 windows for SR-IOV, which will be described
  more in the next two sections.  For a given VF BAR, we need to
  effectively reserve the entire 256 segments (256 * VF BAR size) and
  position the VF BAR to start at the beginning of a free range of
  segments/PEs inside that M64 window.

  The goal is of course to be able to give a separate PE for each VF.

  The IODA2 platform has 16 M64 windows, which are used to map MMIO
  range to PE#.  Each M64 window defines one MMIO range and this range is
  divided into 256 segments, with each segment corresponding to one PE.

  We decide to leverage this M64 window to map VFs to individual PEs, since
  SR-IOV VF BARs are all the same size.

  But doing so introduces another problem: total_VFs is usually smaller
  than the number of M64 window segments, so if we map one VF BAR directly
  to one M64 window, some part of the M64 window will map to another
  device's MMIO range.

  IODA supports 256 PEs, so segmented windows contain 256 segments, so if
  total_VFs is less than 256, we have the situation in Figure 1.0, where
  segments [total_VFs, 255] of the M64 window may map to some MMIO range on
  other devices::

     0      1                     total_VFs - 1
     +------+------+-     -+------+------+
     |      |      |  ...  |      |      |
     +------+------+-     -+------+------+

                           VF(n) BAR space

     0      1                     total_VFs - 1                255
     +------+------+-     -+------+------+-      -+------+------+
     |      |      |  ...  |      |      |   ...  |      |      |
     +------+------+-     -+------+------+-      -+------+------+

                           M64 window

		Figure 1.0 Direct map VF(n) BAR space

  Our current solution is to allocate 256 segments even if the VF(n) BAR
  space doesn't need that much, as shown in Figure 1.1::

     0      1                     total_VFs - 1                255
     +------+------+-     -+------+------+-      -+------+------+
     |      |      |  ...  |      |      |   ...  |      |      |
     +------+------+-     -+------+------+-      -+------+------+

                           VF(n) BAR space + extra

     0      1                     total_VFs - 1                255
     +------+------+-     -+------+------+-      -+------+------+
     |      |      |  ...  |      |      |   ...  |      |      |
     +------+------+-     -+------+------+-      -+------+------+

			   M64 window

		Figure 1.1 Map VF(n) BAR space + extra

  Allocating the extra space ensures that the entire M64 window will be
  assigned to this one SR-IOV device and none of the space will be
  available for other devices.  Note that this only expands the space
  reserved in software; there are still only total_VFs VFs, and they only
  respond to segments [0, total_VFs - 1].  There's nothing in hardware that
  responds to segments [total_VFs, 255].

4. Implications for the Generic PCI Code
========================================

The PCIe SR-IOV spec requires that the base of the VF(n) BAR space be
aligned to the size of an individual VF BAR.

In IODA2, the MMIO address determines the PE#.  If the address is in an M32
window, we can set the PE# by updating the table that translates segments
to PE#s.  Similarly, if the address is in an unsegmented M64 window, we can
set the PE# for the window.  But if it's in a segmented M64 window, the
segment number is the PE#.

Therefore, the only way to control the PE# for a VF is to change the base
of the VF(n) BAR space in the VF BAR.  If the PCI core allocates the exact
amount of space required for the VF(n) BAR space, the VF BAR value is fixed
and cannot be changed.

On the other hand, if the PCI core allocates additional space, the VF BAR
value can be changed as long as the entire VF(n) BAR space remains inside
the space allocated by the core.

Ideally the segment size will be the same as an individual VF BAR size.
Then each VF will be in its own PE.  The VF BARs (and therefore the PE#s)
are contiguous.  If VF0 is in PE(x), then VF(n) is in PE(x+n).  If we
allocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0.

If the segment size is smaller than the VF BAR size, it will take several
segments to cover a VF BAR, and a VF will be in several PEs.  This is
possible, but the isolation isn't as good, and it reduces the number of PE#
choices because instead of consuming only numVFs segments, the VF(n) BAR
space will consume (numVFs * n) segments.  That means there aren't as many
available segments for adjusting base of the VF(n) BAR space.
Commit	Line	Data
4d2e26a3 MCC	1	===================================================
	2	PCI Express I/O Virtualization Resource on Powerenv
	3	===================================================
	4
33052440	5	Wei Yang <weiyang@linux.vnet.ibm.com>
4d2e26a3	6
33052440	7	Benjamin Herrenschmidt <benh@au1.ibm.com>
4d2e26a3	8
33052440	9	Bjorn Helgaas <bhelgaas@google.com>
4d2e26a3	10
33052440 WY	11	26 Aug 2014
	12
	13	This document describes the requirement from hardware for PCI MMIO resource
	14	sizing and assignment on PowerKVM and how generic PCI code handles this
	15	requirement. The first two sections describe the concepts of Partitionable
	16	Endpoints and the implementation on P8 (IODA2). The next two sections talks
	17	about considerations on enabling SRIOV on IODA2.
	18
	19	1. Introduction to Partitionable Endpoints
4d2e26a3	20	==========================================
33052440 WY	21
	22	A Partitionable Endpoint (PE) is a way to group the various resources
	23	associated with a device or a set of devices to provide isolation between
	24	partitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism
	25	to freeze a device that is causing errors in order to limit the possibility
	26	of propagation of bad data.
	27
	28	There is thus, in HW, a table of PE states that contains a pair of "frozen"
	29	state bits (one for MMIO and one for DMA, they get set together but can be
	30	cleared independently) for each PE.
	31
	32	When a PE is frozen, all stores in any direction are dropped and all loads
	33	return all 1's value. MSIs are also blocked. There's a bit more state that
	34	captures things like the details of the error that caused the freeze etc., but
	35	that's not critical.
	36
	37	The interesting part is how the various PCIe transactions (MMIO, DMA, ...)
	38	are matched to their corresponding PEs.
	39
	40	The following section provides a rough description of what we have on P8
	41	(IODA2). Keep in mind that this is all per PHB (PCI host bridge). Each PHB
	42	is a completely separate HW entity that replicates the entire logic, so has
	43	its own set of PEs, etc.
	44
	45	2. Implementation of Partitionable Endpoints on P8 (IODA2)
4d2e26a3	46	==========================================================
33052440 WY	47
	48	P8 supports up to 256 Partitionable Endpoints per PHB.
	49
	50	* Inbound
	51
	52	For DMA, MSIs and inbound PCIe error messages, we have a table (in
	53	memory but accessed in HW by the chip) that provides a direct
	54	correspondence between a PCIe RID (bus/dev/fn) with a PE number.
	55	We call this the RTT.
	56
	57	- For DMA we then provide an entire address space for each PE that can
	58	contain two "windows", depending on the value of PCI address bit 59.
	59	Each window can be configured to be remapped via a "TCE table" (IOMMU
	60	translation table), which has various configurable characteristics
	61	not described here.
	62
	63	- For MSIs, we have two windows in the address space (one at the top of
	64	the 32-bit space and one much higher) which, via a combination of the
	65	address and MSI value, will result in one of the 2048 interrupts per
	66	bridge being triggered. There's a PE# in the interrupt controller
	67	descriptor table as well which is compared with the PE# obtained from
	68	the RTT to "authorize" the device to emit that specific interrupt.
	69
	70	- Error messages just use the RTT.
	71
	72	* Outbound. That's where the tricky part is.
	73
	74	Like other PCI host bridges, the Power8 IODA2 PHB supports "windows"
	75	from the CPU address space to the PCI address space. There is one M32
	76	window and sixteen M64 windows. They have different characteristics.
	77	First what they have in common: they forward a configurable portion of
	78	the CPU address space to the PCIe bus and must be naturally aligned
	79	power of two in size. The rest is different:
	80
	81	- The M32 window:
	82
	83	* Is limited to 4GB in size.
	84
	85	* Drops the top bits of the address (above the size) and replaces
	86	them with a configurable value. This is typically used to generate
	87	32-bit PCIe accesses. We configure that window at boot from FW and
	88	don't touch it from Linux; it's usually set to forward a 2GB
	89	portion of address space from the CPU to PCIe
	90	0x8000_0000..0xffff_ffff. (Note: The top 64KB are actually
	91	reserved for MSIs but this is not a problem at this point; we just
	92	need to ensure Linux doesn't assign anything there, the M32 logic
	93	ignores that however and will forward in that space if we try).
	94
	95	* It is divided into 256 segments of equal size. A table in the chip
	96	maps each segment to a PE#. That allows portions of the MMIO space
	97	to be assigned to PEs on a segment granularity. For a 2GB window,
	98	the segment granularity is 2GB/256 = 8MB.
	99
	100	Now, this is the "main" window we use in Linux today (excluding
	101	SR-IOV). We basically use the trick of forcing the bridge MMIO windows
	102	onto a segment alignment/granularity so that the space behind a bridge
	103	can be assigned to a PE.
	104
	105	Ideally we would like to be able to have individual functions in PEs
	106	but that would mean using a completely different address allocation
	107	scheme where individual function BARs can be "grouped" to fit in one or
	108	more segments.
	109
	110	- The M64 windows:
111
112	* Must be at least 256MB in size.
113
114	* Do not translate addresses (the address on PCIe is the same as the
115	address on the PowerBus). There is a way to also set the top 14
116	bits which are not conveyed by PowerBus but we don't use this.
117
118	* Can be configured to be segmented. When not segmented, we can
119	specify the PE# for the entire window. When segmented, a window
120	has 256 segments; however, there is no table for mapping a segment
121	to a PE#. The segment number is the PE#.
122
123	* Support overlaps. If an address is covered by multiple windows,
124	there's a defined ordering for which window applies.
125
126	We have code (fairly new compared to the M32 stuff) that exploits that
127	for large BARs in 64-bit space:
128
129	We configure an M64 window to cover the entire region of address space
130	that has been assigned by FW for the PHB (about 64GB, ignore the space
131	for the M32, it comes out of a different "reserve"). We configure it
132	as segmented.
133
134	Then we do the same thing as with M32, using the bridge alignment
135	trick, to match to those giant segments.
136
137	Since we cannot remap, we have two additional constraints:
138
139	- We do the PE# allocation after the 64-bit space has been assigned
140	because the addresses we use directly determine the PE#. We then
141	update the M32 PE# for the devices that use both 32-bit and 64-bit
142	spaces or assign the remaining PE# to 32-bit only devices.
143
144	- We cannot "group" segments in HW, so if a device ends up using more
145	than one segment, we end up with more than one PE#. There is a HW
146	mechanism to make the freeze state cascade to "companion" PEs but
147	that only works for PCIe error messages (typically used so that if
148	you freeze a switch, it freezes all its children). So we do it in
149	SW. We lose a bit of effectiveness of EEH in that case, but that's
150	the best we found. So when any of the PEs freezes, we freeze the
151	other ones for that "domain". We thus introduce the concept of
152	"master PE" which is the one used for DMA, MSIs, etc., and "secondary
153	PEs" that are used for the remaining M64 segments.
154
155	We would like to investigate using additional M64 windows in "single
156	PE" mode to overlay over specific BARs to work around some of that, for
157	example for devices with very large BARs, e.g., GPUs. It would make
158	sense, but we haven't done it yet.
159
160	3. Considerations for SR-IOV on PowerKVM
4d2e26a3	161	========================================
33052440 WY	162
	163	* SR-IOV Background
	164
	165	The PCIe SR-IOV feature allows a single Physical Function (PF) to
	166	support several Virtual Functions (VFs). Registers in the PF's SR-IOV
	167	Capability control the number of VFs and whether they are enabled.
	168
	169	When VFs are enabled, they appear in Configuration Space like normal
	170	PCI devices, but the BARs in VF config space headers are unusual. For
	171	a non-VF device, software uses BARs in the config space header to
	172	discover the BAR sizes and assign addresses for them. For VF devices,
	173	software uses VF BAR registers in the PF SR-IOV Capability to
	174	discover sizes and assign addresses. The BARs in the VF's config space
	175	header are read-only zeros.
	176
	177	When a VF BAR in the PF SR-IOV Capability is programmed, it sets the
	178	base address for all the corresponding VF(n) BARs. For example, if the
	179	PF SR-IOV Capability is programmed to enable eight VFs, and it has a
	180	1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region.
	181	This region is divided into eight contiguous 1MB regions, each of which
	182	is a BAR0 for one of the VFs. Note that even though the VF BAR
	183	describes an 8MB region, the alignment requirement is for a single VF,
	184	i.e., 1MB in this example.
	185
	186	There are several strategies for isolating VFs in PEs:
	187
	188	- M32 window: There's one M32 window, and it is split into 256
	189	equally-sized segments. The finest granularity possible is a 256MB
	190	window with 1MB segments. VF BARs that are 1MB or larger could be
	191	mapped to separate PEs in this window. Each segment can be
	192	individually mapped to a PE via the lookup table, so this is quite
	193	flexible, but it works best when all the VF BARs are the same size. If
	194	they are different sizes, the entire window has to be small enough that
	195	the segment size matches the smallest VF BAR, which means larger VF
	196	BARs span several segments.
	197
	198	- Non-segmented M64 window: A non-segmented M64 window is mapped entirely
	199	to a single PE, so it could only isolate one VF.
	200
	201	- Single segmented M64 windows: A segmented M64 window could be used just
	202	like the M32 window, but the segments can't be individually mapped to
	203	PEs (the segment number is the PE#), so there isn't as much
	204	flexibility. A VF with multiple BARs would have to be in a "domain" of
	205	multiple PEs, which is not as well isolated as a single PE.
	206
	207	- Multiple segmented M64 windows: As usual, each window is split into 256
	208	equally-sized segments, and the segment number is the PE#. But if we
	209	use several M64 windows, they can be set to different base addresses
	210	and different segment sizes. If we have VFs that each have a 1MB BAR
	211	and a 32MB BAR, we could use one M64 window to assign 1MB segments and
	212	another M64 window to assign 32MB segments.
	213
	214	Finally, the plan to use M64 windows for SR-IOV, which will be described
	215	more in the next two sections. For a given VF BAR, we need to
	216	effectively reserve the entire 256 segments (256 * VF BAR size) and
	217	position the VF BAR to start at the beginning of a free range of
	218	segments/PEs inside that M64 window.
	219
	220	The goal is of course to be able to give a separate PE for each VF.
	221
	222	The IODA2 platform has 16 M64 windows, which are used to map MMIO
	223	range to PE#. Each M64 window defines one MMIO range and this range is
	224	divided into 256 segments, with each segment corresponding to one PE.
	225
226	We decide to leverage this M64 window to map VFs to individual PEs, since
227	SR-IOV VF BARs are all the same size.
228
229	But doing so introduces another problem: total_VFs is usually smaller
230	than the number of M64 window segments, so if we map one VF BAR directly
231	to one M64 window, some part of the M64 window will map to another
232	device's MMIO range.
233
234	IODA supports 256 PEs, so segmented windows contain 256 segments, so if
235	total_VFs is less than 256, we have the situation in Figure 1.0, where
236	segments [total_VFs, 255] of the M64 window may map to some MMIO range on
4d2e26a3	237	other devices::
33052440 WY	238
	239	0 1 total_VFs - 1
	240	+------+------+- -+------+------+
	241	\| \| \| ... \| \| \|
	242	+------+------+- -+------+------+
	243
	244	VF(n) BAR space
	245
	246	0 1 total_VFs - 1 255
	247	+------+------+- -+------+------+- -+------+------+
	248	\| \| \| ... \| \| \| ... \| \| \|
	249	+------+------+- -+------+------+- -+------+------+
	250
	251	M64 window
	252
	253	Figure 1.0 Direct map VF(n) BAR space
	254
	255	Our current solution is to allocate 256 segments even if the VF(n) BAR
4d2e26a3	256	space doesn't need that much, as shown in Figure 1.1::
33052440 WY	257
	258	0 1 total_VFs - 1 255
	259	+------+------+- -+------+------+- -+------+------+
	260	\| \| \| ... \| \| \| ... \| \| \|
	261	+------+------+- -+------+------+- -+------+------+
	262
	263	VF(n) BAR space + extra
	264
	265	0 1 total_VFs - 1 255
	266	+------+------+- -+------+------+- -+------+------+
	267	\| \| \| ... \| \| \| ... \| \| \|
	268	+------+------+- -+------+------+- -+------+------+
	269
	270	M64 window
	271
	272	Figure 1.1 Map VF(n) BAR space + extra
	273
	274	Allocating the extra space ensures that the entire M64 window will be
	275	assigned to this one SR-IOV device and none of the space will be
	276	available for other devices. Note that this only expands the space
	277	reserved in software; there are still only total_VFs VFs, and they only
	278	respond to segments [0, total_VFs - 1]. There's nothing in hardware that
	279	responds to segments [total_VFs, 255].
	280
	281	4. Implications for the Generic PCI Code
4d2e26a3	282	========================================
33052440 WY	283
	284	The PCIe SR-IOV spec requires that the base of the VF(n) BAR space be
	285	aligned to the size of an individual VF BAR.
	286
	287	In IODA2, the MMIO address determines the PE#. If the address is in an M32
	288	window, we can set the PE# by updating the table that translates segments
	289	to PE#s. Similarly, if the address is in an unsegmented M64 window, we can
	290	set the PE# for the window. But if it's in a segmented M64 window, the
	291	segment number is the PE#.
	292
	293	Therefore, the only way to control the PE# for a VF is to change the base
	294	of the VF(n) BAR space in the VF BAR. If the PCI core allocates the exact
	295	amount of space required for the VF(n) BAR space, the VF BAR value is fixed
	296	and cannot be changed.
	297
	298	On the other hand, if the PCI core allocates additional space, the VF BAR
	299	value can be changed as long as the entire VF(n) BAR space remains inside
	300	the space allocated by the core.
	301
	302	Ideally the segment size will be the same as an individual VF BAR size.
	303	Then each VF will be in its own PE. The VF BARs (and therefore the PE#s)
	304	are contiguous. If VF0 is in PE(x), then VF(n) is in PE(x+n). If we
	305	allocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0.
	306
	307	If the segment size is smaller than the VF BAR size, it will take several
	308	segments to cover a VF BAR, and a VF will be in several PEs. This is
	309	possible, but the isolation isn't as good, and it reduces the number of PE#
	310	choices because instead of consuming only numVFs segments, the VF(n) BAR
	311	space will consume (numVFs * n) segments. That means there aren't as many
	312	available segments for adjusting base of the VF(n) BAR space.