Commit | Line | Data |
---|---|---|
c736111c PP |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | .. _devlink_port: | |
4 | ||
5 | ============ | |
6 | Devlink Port | |
7 | ============ | |
8 | ||
9 | ``devlink-port`` is a port that exists on the device. It has a logically | |
10 | separate ingress/egress point of the device. A devlink port can be any one | |
11 | of many flavours. A devlink port flavour along with port attributes | |
12 | describe what a port represents. | |
13 | ||
14 | A device driver that intends to publish a devlink port sets the | |
15 | devlink port attributes and registers the devlink port. | |
16 | ||
17 | Devlink port flavours are described below. | |
18 | ||
19 | .. list-table:: List of devlink port flavours | |
20 | :widths: 33 90 | |
21 | ||
22 | * - Flavour | |
23 | - Description | |
24 | * - ``DEVLINK_PORT_FLAVOUR_PHYSICAL`` | |
25 | - Any kind of physical port. This can be an eswitch physical port or any | |
26 | other physical port on the device. | |
27 | * - ``DEVLINK_PORT_FLAVOUR_DSA`` | |
28 | - This indicates a DSA interconnect port. | |
29 | * - ``DEVLINK_PORT_FLAVOUR_CPU`` | |
30 | - This indicates a CPU port applicable only to DSA. | |
31 | * - ``DEVLINK_PORT_FLAVOUR_PCI_PF`` | |
32 | - This indicates an eswitch port representing a port of PCI | |
33 | physical function (PF). | |
34 | * - ``DEVLINK_PORT_FLAVOUR_PCI_VF`` | |
35 | - This indicates an eswitch port representing a port of PCI | |
36 | virtual function (VF). | |
6474ce7e PP |
37 | * - ``DEVLINK_PORT_FLAVOUR_PCI_SF`` |
38 | - This indicates an eswitch port representing a port of PCI | |
39 | subfunction (SF). | |
c736111c PP |
40 | * - ``DEVLINK_PORT_FLAVOUR_VIRTUAL`` |
41 | - This indicates a virtual port for the PCI virtual function. | |
42 | ||
43 | Devlink port can have a different type based on the link layer described below. | |
44 | ||
45 | .. list-table:: List of devlink port types | |
46 | :widths: 23 90 | |
47 | ||
48 | * - Type | |
49 | - Description | |
50 | * - ``DEVLINK_PORT_TYPE_ETH`` | |
51 | - Driver should set this port type when a link layer of the port is | |
52 | Ethernet. | |
53 | * - ``DEVLINK_PORT_TYPE_IB`` | |
54 | - Driver should set this port type when a link layer of the port is | |
55 | InfiniBand. | |
56 | * - ``DEVLINK_PORT_TYPE_AUTO`` | |
57 | - This type is indicated by the user when driver should detect the port | |
58 | type automatically. | |
59 | ||
60 | PCI controllers | |
61 | --------------- | |
62 | In most cases a PCI device has only one controller. A controller consists of | |
6474ce7e PP |
63 | potentially multiple physical, virtual functions and subfunctions. A function |
64 | consists of one or more ports. This port is represented by the devlink eswitch | |
65 | port. | |
c736111c PP |
66 | |
67 | A PCI device connected to multiple CPUs or multiple PCI root complexes or a | |
68 | SmartNIC, however, may have multiple controllers. For a device with multiple | |
69 | controllers, each controller is distinguished by a unique controller number. | |
70 | An eswitch is on the PCI device which supports ports of multiple controllers. | |
71 | ||
72 | An example view of a system with two controllers:: | |
73 | ||
74 | --------------------------------------------------------- | |
75 | | | | |
76 | | --------- --------- ------- ------- | | |
77 | ----------- | | vf(s) | | sf(s) | |vf(s)| |sf(s)| | | |
78 | | server | | ------- ----/---- ---/----- ------- ---/--- ---/--- | | |
79 | | pci rc |=== | pf0 |______/________/ | pf1 |___/_______/ | | |
80 | | connect | | ------- ------- | | |
81 | ----------- | | controller_num=1 (no eswitch) | | |
82 | ------|-------------------------------------------------- | |
83 | (internal wire) | |
84 | | | |
85 | --------------------------------------------------------- | |
86 | | devlink eswitch ports and reps | | |
87 | | ----------------------------------------------------- | | |
88 | | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | | | |
89 | | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | | | |
90 | | ----------------------------------------------------- | | |
91 | | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | | | |
92 | | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | | | |
93 | | ----------------------------------------------------- | | |
94 | | | | |
95 | | | | |
96 | ----------- | --------- --------- ------- ------- | | |
97 | | smartNIC| | | vf(s) | | sf(s) | |vf(s)| |sf(s)| | | |
98 | | pci rc |==| ------- ----/---- ---/----- ------- ---/--- ---/--- | | |
99 | | connect | | | pf0 |______/________/ | pf1 |___/_______/ | | |
100 | ----------- | ------- ------- | | |
101 | | | | |
102 | | local controller_num=0 (eswitch) | | |
103 | --------------------------------------------------------- | |
104 | ||
105 | In the above example, the external controller (identified by controller number = 1) | |
106 | doesn't have the eswitch. Local controller (identified by controller number = 0) | |
107 | has the eswitch. The Devlink instance on the local controller has eswitch | |
108 | devlink ports for both the controllers. | |
109 | ||
110 | Function configuration | |
111 | ====================== | |
112 | ||
da65e9ff | 113 | Users can configure one or more function attributes before enumerating the PCI |
c736111c PP |
114 | function. Usually it means, user should configure function attribute |
115 | before a bus specific device for the function is created. However, when | |
116 | SRIOV is enabled, virtual function devices are created on the PCI bus. | |
117 | Hence, function attribute should be configured before binding virtual | |
6474ce7e PP |
118 | function device to the driver. For subfunctions, this means user should |
119 | configure port function attribute before activating the port function. | |
c736111c PP |
120 | |
121 | A user may set the hardware address of the function using | |
875cd5ee | 122 | `devlink port function set hw_addr` command. For Ethernet port function |
c736111c | 123 | this means a MAC address. |
6474ce7e | 124 | |
da65e9ff SD |
125 | Users may also set the RoCE capability of the function using |
126 | `devlink port function set roce` command. | |
127 | ||
a8ce7b26 SD |
128 | Users may also set the function as migratable using |
129 | 'devlink port function set migratable' command. | |
130 | ||
875cd5ee SD |
131 | Function attributes |
132 | =================== | |
133 | ||
134 | MAC address setup | |
135 | ----------------- | |
136 | The configured MAC address of the PCI VF/SF will be used by netdevice and rdma | |
137 | device created for the PCI VF/SF. | |
138 | ||
139 | - Get the MAC address of the VF identified by its unique devlink port index:: | |
140 | ||
141 | $ devlink port show pci/0000:06:00.0/2 | |
142 | pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 | |
143 | function: | |
144 | hw_addr 00:00:00:00:00:00 | |
145 | ||
146 | - Set the MAC address of the VF identified by its unique devlink port index:: | |
147 | ||
148 | $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55 | |
149 | ||
150 | $ devlink port show pci/0000:06:00.0/2 | |
151 | pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 | |
152 | function: | |
153 | hw_addr 00:11:22:33:44:55 | |
154 | ||
155 | - Get the MAC address of the SF identified by its unique devlink port index:: | |
156 | ||
157 | $ devlink port show pci/0000:06:00.0/32768 | |
158 | pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88 | |
159 | function: | |
160 | hw_addr 00:00:00:00:00:00 | |
161 | ||
162 | - Set the MAC address of the SF identified by its unique devlink port index:: | |
163 | ||
164 | $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88 | |
165 | ||
166 | $ devlink port show pci/0000:06:00.0/32768 | |
167 | pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88 | |
168 | function: | |
169 | hw_addr 00:00:00:00:88:88 | |
170 | ||
da65e9ff SD |
171 | RoCE capability setup |
172 | --------------------- | |
173 | Not all PCI VFs/SFs require RoCE capability. | |
174 | ||
175 | When RoCE capability is disabled, it saves system memory per PCI VF/SF. | |
176 | ||
177 | When user disables RoCE capability for a VF/SF, user application cannot send or | |
178 | receive any RoCE packets through this VF/SF and RoCE GID table for this PCI | |
179 | will be empty. | |
180 | ||
181 | When RoCE capability is disabled in the device using port function attribute, | |
182 | VF/SF driver cannot override it. | |
183 | ||
184 | - Get RoCE capability of the VF device:: | |
185 | ||
186 | $ devlink port show pci/0000:06:00.0/2 | |
187 | pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 | |
188 | function: | |
189 | hw_addr 00:00:00:00:00:00 roce enable | |
190 | ||
191 | - Set RoCE capability of the VF device:: | |
192 | ||
193 | $ devlink port function set pci/0000:06:00.0/2 roce disable | |
194 | ||
195 | $ devlink port show pci/0000:06:00.0/2 | |
196 | pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 | |
197 | function: | |
198 | hw_addr 00:00:00:00:00:00 roce disable | |
199 | ||
a8ce7b26 SD |
200 | migratable capability setup |
201 | --------------------------- | |
202 | Live migration is the process of transferring a live virtual machine | |
203 | from one physical host to another without disrupting its normal | |
204 | operation. | |
205 | ||
206 | User who want PCI VFs to be able to perform live migration need to | |
207 | explicitly enable the VF migratable capability. | |
208 | ||
209 | When user enables migratable capability for a VF, and the HV binds the VF to VFIO driver | |
210 | with migration support, the user can migrate the VM with this VF from one HV to a | |
211 | different one. | |
212 | ||
213 | However, when migratable capability is enable, device will disable features which cannot | |
214 | be migrated. Thus migratable cap can impose limitations on a VF so let the user decide. | |
215 | ||
216 | Example of LM with migratable function configuration: | |
217 | - Get migratable capability of the VF device:: | |
218 | ||
219 | $ devlink port show pci/0000:06:00.0/2 | |
220 | pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 | |
221 | function: | |
222 | hw_addr 00:00:00:00:00:00 migratable disable | |
223 | ||
224 | - Set migratable capability of the VF device:: | |
225 | ||
226 | $ devlink port function set pci/0000:06:00.0/2 migratable enable | |
227 | ||
228 | $ devlink port show pci/0000:06:00.0/2 | |
229 | pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 | |
230 | function: | |
231 | hw_addr 00:00:00:00:00:00 migratable enable | |
232 | ||
233 | - Bind VF to VFIO driver with migration support:: | |
234 | ||
235 | $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind | |
236 | $ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override | |
237 | $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind | |
238 | ||
239 | Attach VF to the VM. | |
240 | Start the VM. | |
241 | Perform live migration. | |
242 | ||
6474ce7e PP |
243 | Subfunction |
244 | ============ | |
245 | ||
246 | Subfunction is a lightweight function that has a parent PCI function on which | |
247 | it is deployed. Subfunction is created and deployed in unit of 1. Unlike | |
248 | SRIOV VFs, a subfunction doesn't require its own PCI virtual function. | |
249 | A subfunction communicates with the hardware through the parent PCI function. | |
250 | ||
c84f6f6c BS |
251 | To use a subfunction, 3 steps setup sequence is followed: |
252 | ||
253 | 1) create - create a subfunction; | |
254 | 2) configure - configure subfunction attributes; | |
255 | 3) deploy - deploy the subfunction; | |
6474ce7e PP |
256 | |
257 | Subfunction management is done using devlink port user interface. | |
258 | User performs setup on the subfunction management device. | |
259 | ||
260 | (1) Create | |
261 | ---------- | |
262 | A subfunction is created using a devlink port interface. A user adds the | |
263 | subfunction by adding a devlink port of subfunction flavour. The devlink | |
264 | kernel code calls down to subfunction management driver (devlink ops) and asks | |
265 | it to create a subfunction devlink port. Driver then instantiates the | |
266 | subfunction port and any associated objects such as health reporters and | |
267 | representor netdevice. | |
268 | ||
269 | (2) Configure | |
270 | ------------- | |
271 | A subfunction devlink port is created but it is not active yet. That means the | |
272 | entities are created on devlink side, the e-switch port representor is created, | |
ad236ccd | 273 | but the subfunction device itself is not created. A user might use e-switch port |
6474ce7e PP |
274 | representor to do settings, putting it into bridge, adding TC rules, etc. A user |
275 | might as well configure the hardware address (such as MAC address) of the | |
276 | subfunction while subfunction is inactive. | |
277 | ||
278 | (3) Deploy | |
279 | ---------- | |
280 | Once a subfunction is configured, user must activate it to use it. Upon | |
281 | activation, subfunction management driver asks the subfunction management | |
282 | device to instantiate the subfunction device on particular PCI function. | |
283 | A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`. | |
284 | At this point a matching subfunction driver binds to the subfunction's auxiliary device. | |
285 | ||
b62767e7 DL |
286 | Rate object management |
287 | ====================== | |
288 | ||
289 | Devlink provides API to manage tx rates of single devlink port or a group. | |
290 | This is done through rate objects, which can be one of the two types: | |
291 | ||
292 | ``leaf`` | |
293 | Represents a single devlink port; created/destroyed by the driver. Since leaf | |
294 | have 1to1 mapping to its devlink port, in user space it is referred as | |
295 | ``pci/<bus_addr>/<port_index>``; | |
296 | ||
297 | ``node`` | |
298 | Represents a group of rate objects (leafs and/or nodes); created/deleted by | |
299 | request from the userspace; initially empty (no rate objects added). In | |
300 | userspace it is referred as ``pci/<bus_addr>/<node_name>``, where | |
301 | ``node_name`` can be any identifier, except decimal number, to avoid | |
302 | collisions with leafs. | |
303 | ||
304 | API allows to configure following rate object's parameters: | |
305 | ||
306 | ``tx_share`` | |
307 | Minimum TX rate value shared among all other rate objects, or rate objects | |
308 | that parts of the parent group, if it is a part of the same group. | |
309 | ||
310 | ``tx_max`` | |
311 | Maximum TX rate value. | |
312 | ||
242dd643 MW |
313 | ``tx_priority`` |
314 | Allows for usage of strict priority arbiter among siblings. This | |
315 | arbitration scheme attempts to schedule nodes based on their priority | |
316 | as long as the nodes remain within their bandwidth limit. The higher the | |
317 | priority the higher the probability that the node will get selected for | |
318 | scheduling. | |
319 | ||
320 | ``tx_weight`` | |
321 | Allows for usage of Weighted Fair Queuing arbitration scheme among | |
322 | siblings. This arbitration scheme can be used simultaneously with the | |
323 | strict priority. As a node is configured with a higher rate it gets more | |
324 | BW relative to it's siblings. Values are relative like a percentage | |
325 | points, they basically tell how much BW should node take relative to | |
326 | it's siblings. | |
327 | ||
b62767e7 DL |
328 | ``parent`` |
329 | Parent node name. Parent node rate limits are considered as additional limits | |
330 | to all node children limits. ``tx_max`` is an upper limit for children. | |
331 | ``tx_share`` is a total bandwidth distributed among children. | |
332 | ||
242dd643 MW |
333 | ``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case |
334 | nodes with the same priority form a WFQ subgroup in the sibling group | |
335 | and arbitration among them is based on assigned weights. | |
336 | ||
337 | Arbitration flow from the high level: | |
c84f6f6c | 338 | |
242dd643 MW |
339 | #. Choose a node, or group of nodes with the highest priority that stays |
340 | within the BW limit and are not blocked. Use ``tx_priority`` as a | |
341 | parameter for this arbitration. | |
c84f6f6c | 342 | |
242dd643 MW |
343 | #. If group of nodes have the same priority perform WFQ arbitration on |
344 | that subgroup. Use ``tx_weight`` as a parameter for this arbitration. | |
c84f6f6c | 345 | |
242dd643 MW |
346 | #. Select the winner node, and continue arbitration flow among it's children, |
347 | until leaf node is reached, and the winner is established. | |
c84f6f6c | 348 | |
242dd643 MW |
349 | #. If all the nodes from the highest priority sub-group are satisfied, or |
350 | overused their assigned BW, move to the lower priority nodes. | |
351 | ||
b62767e7 | 352 | Driver implementations are allowed to support both or either rate object types |
242dd643 MW |
353 | and setting methods of their parameters. Additionally driver implementation |
354 | may export nodes/leafs and their child-parent relationships. | |
b62767e7 | 355 | |
6474ce7e PP |
356 | Terms and Definitions |
357 | ===================== | |
358 | ||
359 | .. list-table:: Terms and Definitions | |
360 | :widths: 22 90 | |
361 | ||
362 | * - Term | |
363 | - Definitions | |
364 | * - ``PCI device`` | |
ad236ccd | 365 | - A physical PCI device having one or more PCI buses consists of one or |
6474ce7e PP |
366 | more PCI controllers. |
367 | * - ``PCI controller`` | |
368 | - A controller consists of potentially multiple physical functions, | |
369 | virtual functions and subfunctions. | |
370 | * - ``Port function`` | |
371 | - An object to manage the function of a port. | |
372 | * - ``Subfunction`` | |
373 | - A lightweight function that has parent PCI function on which it is | |
374 | deployed. | |
375 | * - ``Subfunction device`` | |
376 | - A bus device of the subfunction, usually on a auxiliary bus. | |
377 | * - ``Subfunction driver`` | |
378 | - A device driver for the subfunction auxiliary device. | |
379 | * - ``Subfunction management device`` | |
380 | - A PCI physical function that supports subfunction management. | |
381 | * - ``Subfunction management driver`` | |
382 | - A device driver for PCI physical function that supports | |
383 | subfunction management using devlink port interface. | |
384 | * - ``Subfunction host driver`` | |
385 | - A device driver for PCI physical function that hosts subfunction | |
386 | devices. In most cases it is same as subfunction management driver. When | |
387 | subfunction is used on external controller, subfunction management and | |
388 | host drivers are different. |