Commit | Line | Data |
---|---|---|
c736111c PP |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | .. _devlink_port: | |
4 | ||
5 | ============ | |
6 | Devlink Port | |
7 | ============ | |
8 | ||
9 | ``devlink-port`` is a port that exists on the device. It has a logically | |
10 | separate ingress/egress point of the device. A devlink port can be any one | |
11 | of many flavours. A devlink port flavour along with port attributes | |
12 | describe what a port represents. | |
13 | ||
14 | A device driver that intends to publish a devlink port sets the | |
15 | devlink port attributes and registers the devlink port. | |
16 | ||
17 | Devlink port flavours are described below. | |
18 | ||
19 | .. list-table:: List of devlink port flavours | |
20 | :widths: 33 90 | |
21 | ||
22 | * - Flavour | |
23 | - Description | |
24 | * - ``DEVLINK_PORT_FLAVOUR_PHYSICAL`` | |
25 | - Any kind of physical port. This can be an eswitch physical port or any | |
26 | other physical port on the device. | |
27 | * - ``DEVLINK_PORT_FLAVOUR_DSA`` | |
28 | - This indicates a DSA interconnect port. | |
29 | * - ``DEVLINK_PORT_FLAVOUR_CPU`` | |
30 | - This indicates a CPU port applicable only to DSA. | |
31 | * - ``DEVLINK_PORT_FLAVOUR_PCI_PF`` | |
32 | - This indicates an eswitch port representing a port of PCI | |
33 | physical function (PF). | |
34 | * - ``DEVLINK_PORT_FLAVOUR_PCI_VF`` | |
35 | - This indicates an eswitch port representing a port of PCI | |
36 | virtual function (VF). | |
6474ce7e PP |
37 | * - ``DEVLINK_PORT_FLAVOUR_PCI_SF`` |
38 | - This indicates an eswitch port representing a port of PCI | |
39 | subfunction (SF). | |
c736111c PP |
40 | * - ``DEVLINK_PORT_FLAVOUR_VIRTUAL`` |
41 | - This indicates a virtual port for the PCI virtual function. | |
42 | ||
43 | Devlink port can have a different type based on the link layer described below. | |
44 | ||
45 | .. list-table:: List of devlink port types | |
46 | :widths: 23 90 | |
47 | ||
48 | * - Type | |
49 | - Description | |
50 | * - ``DEVLINK_PORT_TYPE_ETH`` | |
51 | - Driver should set this port type when a link layer of the port is | |
52 | Ethernet. | |
53 | * - ``DEVLINK_PORT_TYPE_IB`` | |
54 | - Driver should set this port type when a link layer of the port is | |
55 | InfiniBand. | |
56 | * - ``DEVLINK_PORT_TYPE_AUTO`` | |
57 | - This type is indicated by the user when driver should detect the port | |
58 | type automatically. | |
59 | ||
60 | PCI controllers | |
61 | --------------- | |
62 | In most cases a PCI device has only one controller. A controller consists of | |
6474ce7e PP |
63 | potentially multiple physical, virtual functions and subfunctions. A function |
64 | consists of one or more ports. This port is represented by the devlink eswitch | |
65 | port. | |
c736111c PP |
66 | |
67 | A PCI device connected to multiple CPUs or multiple PCI root complexes or a | |
68 | SmartNIC, however, may have multiple controllers. For a device with multiple | |
69 | controllers, each controller is distinguished by a unique controller number. | |
70 | An eswitch is on the PCI device which supports ports of multiple controllers. | |
71 | ||
72 | An example view of a system with two controllers:: | |
73 | ||
74 | --------------------------------------------------------- | |
75 | | | | |
76 | | --------- --------- ------- ------- | | |
77 | ----------- | | vf(s) | | sf(s) | |vf(s)| |sf(s)| | | |
78 | | server | | ------- ----/---- ---/----- ------- ---/--- ---/--- | | |
79 | | pci rc |=== | pf0 |______/________/ | pf1 |___/_______/ | | |
80 | | connect | | ------- ------- | | |
81 | ----------- | | controller_num=1 (no eswitch) | | |
82 | ------|-------------------------------------------------- | |
83 | (internal wire) | |
84 | | | |
85 | --------------------------------------------------------- | |
86 | | devlink eswitch ports and reps | | |
87 | | ----------------------------------------------------- | | |
88 | | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | | | |
89 | | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | | | |
90 | | ----------------------------------------------------- | | |
91 | | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | | | |
92 | | |pf0 | pf0vfN | pf0sfN | pf1 | pf1vfN |pf1sfN | | | |
93 | | ----------------------------------------------------- | | |
94 | | | | |
95 | | | | |
96 | ----------- | --------- --------- ------- ------- | | |
97 | | smartNIC| | | vf(s) | | sf(s) | |vf(s)| |sf(s)| | | |
98 | | pci rc |==| ------- ----/---- ---/----- ------- ---/--- ---/--- | | |
99 | | connect | | | pf0 |______/________/ | pf1 |___/_______/ | | |
100 | ----------- | ------- ------- | | |
101 | | | | |
102 | | local controller_num=0 (eswitch) | | |
103 | --------------------------------------------------------- | |
104 | ||
105 | In the above example, the external controller (identified by controller number = 1) | |
106 | doesn't have the eswitch. Local controller (identified by controller number = 0) | |
107 | has the eswitch. The Devlink instance on the local controller has eswitch | |
108 | devlink ports for both the controllers. | |
109 | ||
110 | Function configuration | |
111 | ====================== | |
112 | ||
da65e9ff | 113 | Users can configure one or more function attributes before enumerating the PCI |
c736111c PP |
114 | function. Usually it means, user should configure function attribute |
115 | before a bus specific device for the function is created. However, when | |
116 | SRIOV is enabled, virtual function devices are created on the PCI bus. | |
117 | Hence, function attribute should be configured before binding virtual | |
6474ce7e PP |
118 | function device to the driver. For subfunctions, this means user should |
119 | configure port function attribute before activating the port function. | |
c736111c PP |
120 | |
121 | A user may set the hardware address of the function using | |
875cd5ee | 122 | `devlink port function set hw_addr` command. For Ethernet port function |
c736111c | 123 | this means a MAC address. |
6474ce7e | 124 | |
da65e9ff SD |
125 | Users may also set the RoCE capability of the function using |
126 | `devlink port function set roce` command. | |
127 | ||
a8ce7b26 | 128 | Users may also set the function as migratable using |
4ab18af4 | 129 | `devlink port function set migratable` command. |
a8ce7b26 | 130 | |
62b6442c DC |
131 | Users may also set the IPsec crypto capability of the function using |
132 | `devlink port function set ipsec_crypto` command. | |
133 | ||
390a24cb DC |
134 | Users may also set the IPsec packet capability of the function using |
135 | `devlink port function set ipsec_packet` command. | |
136 | ||
5af3e387 PP |
137 | Users may also set the maximum IO event queues of the function |
138 | using `devlink port function set max_io_eqs` command. | |
139 | ||
875cd5ee SD |
140 | Function attributes |
141 | =================== | |
142 | ||
143 | MAC address setup | |
144 | ----------------- | |
145 | The configured MAC address of the PCI VF/SF will be used by netdevice and rdma | |
146 | device created for the PCI VF/SF. | |
147 | ||
148 | - Get the MAC address of the VF identified by its unique devlink port index:: | |
149 | ||
150 | $ devlink port show pci/0000:06:00.0/2 | |
151 | pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 | |
152 | function: | |
153 | hw_addr 00:00:00:00:00:00 | |
154 | ||
155 | - Set the MAC address of the VF identified by its unique devlink port index:: | |
156 | ||
157 | $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55 | |
158 | ||
159 | $ devlink port show pci/0000:06:00.0/2 | |
160 | pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 | |
161 | function: | |
162 | hw_addr 00:11:22:33:44:55 | |
163 | ||
164 | - Get the MAC address of the SF identified by its unique devlink port index:: | |
165 | ||
166 | $ devlink port show pci/0000:06:00.0/32768 | |
167 | pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88 | |
168 | function: | |
169 | hw_addr 00:00:00:00:00:00 | |
170 | ||
171 | - Set the MAC address of the SF identified by its unique devlink port index:: | |
172 | ||
173 | $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88 | |
174 | ||
175 | $ devlink port show pci/0000:06:00.0/32768 | |
176 | pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88 | |
177 | function: | |
178 | hw_addr 00:00:00:00:88:88 | |
179 | ||
da65e9ff SD |
180 | RoCE capability setup |
181 | --------------------- | |
182 | Not all PCI VFs/SFs require RoCE capability. | |
183 | ||
184 | When RoCE capability is disabled, it saves system memory per PCI VF/SF. | |
185 | ||
186 | When user disables RoCE capability for a VF/SF, user application cannot send or | |
187 | receive any RoCE packets through this VF/SF and RoCE GID table for this PCI | |
188 | will be empty. | |
189 | ||
190 | When RoCE capability is disabled in the device using port function attribute, | |
191 | VF/SF driver cannot override it. | |
192 | ||
193 | - Get RoCE capability of the VF device:: | |
194 | ||
195 | $ devlink port show pci/0000:06:00.0/2 | |
196 | pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 | |
197 | function: | |
198 | hw_addr 00:00:00:00:00:00 roce enable | |
199 | ||
200 | - Set RoCE capability of the VF device:: | |
201 | ||
202 | $ devlink port function set pci/0000:06:00.0/2 roce disable | |
203 | ||
204 | $ devlink port show pci/0000:06:00.0/2 | |
205 | pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 | |
206 | function: | |
207 | hw_addr 00:00:00:00:00:00 roce disable | |
208 | ||
a8ce7b26 SD |
209 | migratable capability setup |
210 | --------------------------- | |
211 | Live migration is the process of transferring a live virtual machine | |
212 | from one physical host to another without disrupting its normal | |
213 | operation. | |
214 | ||
215 | User who want PCI VFs to be able to perform live migration need to | |
216 | explicitly enable the VF migratable capability. | |
217 | ||
218 | When user enables migratable capability for a VF, and the HV binds the VF to VFIO driver | |
219 | with migration support, the user can migrate the VM with this VF from one HV to a | |
220 | different one. | |
221 | ||
222 | However, when migratable capability is enable, device will disable features which cannot | |
223 | be migrated. Thus migratable cap can impose limitations on a VF so let the user decide. | |
224 | ||
225 | Example of LM with migratable function configuration: | |
226 | - Get migratable capability of the VF device:: | |
227 | ||
228 | $ devlink port show pci/0000:06:00.0/2 | |
229 | pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 | |
230 | function: | |
231 | hw_addr 00:00:00:00:00:00 migratable disable | |
232 | ||
233 | - Set migratable capability of the VF device:: | |
234 | ||
235 | $ devlink port function set pci/0000:06:00.0/2 migratable enable | |
236 | ||
237 | $ devlink port show pci/0000:06:00.0/2 | |
238 | pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 | |
239 | function: | |
240 | hw_addr 00:00:00:00:00:00 migratable enable | |
241 | ||
242 | - Bind VF to VFIO driver with migration support:: | |
243 | ||
244 | $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind | |
245 | $ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override | |
246 | $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind | |
247 | ||
248 | Attach VF to the VM. | |
249 | Start the VM. | |
250 | Perform live migration. | |
251 | ||
62b6442c DC |
252 | IPsec crypto capability setup |
253 | ----------------------------- | |
254 | When user enables IPsec crypto capability for a VF, user application can offload | |
255 | XFRM state crypto operation (Encrypt/Decrypt) to this VF. | |
256 | ||
257 | When IPsec crypto capability is disabled (default) for a VF, the XFRM state is | |
258 | processed in software by the kernel. | |
259 | ||
260 | - Get IPsec crypto capability of the VF device:: | |
261 | ||
262 | $ devlink port show pci/0000:06:00.0/2 | |
263 | pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 | |
264 | function: | |
265 | hw_addr 00:00:00:00:00:00 ipsec_crypto disabled | |
266 | ||
267 | - Set IPsec crypto capability of the VF device:: | |
268 | ||
269 | $ devlink port function set pci/0000:06:00.0/2 ipsec_crypto enable | |
270 | ||
271 | $ devlink port show pci/0000:06:00.0/2 | |
272 | pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 | |
273 | function: | |
274 | hw_addr 00:00:00:00:00:00 ipsec_crypto enabled | |
275 | ||
390a24cb DC |
276 | IPsec packet capability setup |
277 | ----------------------------- | |
278 | When user enables IPsec packet capability for a VF, user application can offload | |
279 | XFRM state and policy crypto operation (Encrypt/Decrypt) to this VF, as well as | |
280 | IPsec encapsulation. | |
281 | ||
282 | When IPsec packet capability is disabled (default) for a VF, the XFRM state and | |
283 | policy is processed in software by the kernel. | |
284 | ||
285 | - Get IPsec packet capability of the VF device:: | |
286 | ||
287 | $ devlink port show pci/0000:06:00.0/2 | |
288 | pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 | |
289 | function: | |
290 | hw_addr 00:00:00:00:00:00 ipsec_packet disabled | |
291 | ||
292 | - Set IPsec packet capability of the VF device:: | |
293 | ||
294 | $ devlink port function set pci/0000:06:00.0/2 ipsec_packet enable | |
295 | ||
296 | $ devlink port show pci/0000:06:00.0/2 | |
297 | pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 | |
298 | function: | |
299 | hw_addr 00:00:00:00:00:00 ipsec_packet enabled | |
300 | ||
5af3e387 PP |
301 | Maximum IO events queues setup |
302 | ------------------------------ | |
303 | When user sets maximum number of IO event queues for a SF or | |
304 | a VF, such function driver is limited to consume only enforced | |
305 | number of IO event queues. | |
306 | ||
307 | IO event queues deliver events related to IO queues, including network | |
308 | device transmit and receive queues (txq and rxq) and RDMA Queue Pairs (QPs). | |
309 | For example, the number of netdevice channels and RDMA device completion | |
310 | vectors are derived from the function's IO event queues. Usually, the number | |
311 | of interrupt vectors consumed by the driver is limited by the number of IO | |
312 | event queues per device, as each of the IO event queues is connected to an | |
313 | interrupt vector. | |
314 | ||
315 | - Get maximum IO event queues of the VF device:: | |
316 | ||
317 | $ devlink port show pci/0000:06:00.0/2 | |
318 | pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 | |
319 | function: | |
320 | hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 10 | |
321 | ||
322 | - Set maximum IO event queues of the VF device:: | |
323 | ||
324 | $ devlink port function set pci/0000:06:00.0/2 max_io_eqs 32 | |
325 | ||
326 | $ devlink port show pci/0000:06:00.0/2 | |
327 | pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1 | |
328 | function: | |
329 | hw_addr 00:00:00:00:00:00 ipsec_packet disabled max_io_eqs 32 | |
330 | ||
6474ce7e PP |
331 | Subfunction |
332 | ============ | |
333 | ||
334 | Subfunction is a lightweight function that has a parent PCI function on which | |
335 | it is deployed. Subfunction is created and deployed in unit of 1. Unlike | |
336 | SRIOV VFs, a subfunction doesn't require its own PCI virtual function. | |
337 | A subfunction communicates with the hardware through the parent PCI function. | |
338 | ||
c84f6f6c BS |
339 | To use a subfunction, 3 steps setup sequence is followed: |
340 | ||
341 | 1) create - create a subfunction; | |
342 | 2) configure - configure subfunction attributes; | |
343 | 3) deploy - deploy the subfunction; | |
6474ce7e PP |
344 | |
345 | Subfunction management is done using devlink port user interface. | |
346 | User performs setup on the subfunction management device. | |
347 | ||
348 | (1) Create | |
349 | ---------- | |
350 | A subfunction is created using a devlink port interface. A user adds the | |
351 | subfunction by adding a devlink port of subfunction flavour. The devlink | |
352 | kernel code calls down to subfunction management driver (devlink ops) and asks | |
353 | it to create a subfunction devlink port. Driver then instantiates the | |
354 | subfunction port and any associated objects such as health reporters and | |
355 | representor netdevice. | |
356 | ||
357 | (2) Configure | |
358 | ------------- | |
359 | A subfunction devlink port is created but it is not active yet. That means the | |
360 | entities are created on devlink side, the e-switch port representor is created, | |
ad236ccd | 361 | but the subfunction device itself is not created. A user might use e-switch port |
6474ce7e PP |
362 | representor to do settings, putting it into bridge, adding TC rules, etc. A user |
363 | might as well configure the hardware address (such as MAC address) of the | |
364 | subfunction while subfunction is inactive. | |
365 | ||
366 | (3) Deploy | |
367 | ---------- | |
368 | Once a subfunction is configured, user must activate it to use it. Upon | |
369 | activation, subfunction management driver asks the subfunction management | |
370 | device to instantiate the subfunction device on particular PCI function. | |
371 | A subfunction device is created on the :ref:`Documentation/driver-api/auxiliary_bus.rst <auxiliary_bus>`. | |
372 | At this point a matching subfunction driver binds to the subfunction's auxiliary device. | |
373 | ||
b62767e7 DL |
374 | Rate object management |
375 | ====================== | |
376 | ||
377 | Devlink provides API to manage tx rates of single devlink port or a group. | |
378 | This is done through rate objects, which can be one of the two types: | |
379 | ||
380 | ``leaf`` | |
381 | Represents a single devlink port; created/destroyed by the driver. Since leaf | |
382 | have 1to1 mapping to its devlink port, in user space it is referred as | |
383 | ``pci/<bus_addr>/<port_index>``; | |
384 | ||
385 | ``node`` | |
386 | Represents a group of rate objects (leafs and/or nodes); created/deleted by | |
387 | request from the userspace; initially empty (no rate objects added). In | |
388 | userspace it is referred as ``pci/<bus_addr>/<node_name>``, where | |
389 | ``node_name`` can be any identifier, except decimal number, to avoid | |
390 | collisions with leafs. | |
391 | ||
392 | API allows to configure following rate object's parameters: | |
393 | ||
394 | ``tx_share`` | |
395 | Minimum TX rate value shared among all other rate objects, or rate objects | |
396 | that parts of the parent group, if it is a part of the same group. | |
397 | ||
398 | ``tx_max`` | |
399 | Maximum TX rate value. | |
400 | ||
242dd643 MW |
401 | ``tx_priority`` |
402 | Allows for usage of strict priority arbiter among siblings. This | |
403 | arbitration scheme attempts to schedule nodes based on their priority | |
404 | as long as the nodes remain within their bandwidth limit. The higher the | |
405 | priority the higher the probability that the node will get selected for | |
406 | scheduling. | |
407 | ||
408 | ``tx_weight`` | |
409 | Allows for usage of Weighted Fair Queuing arbitration scheme among | |
410 | siblings. This arbitration scheme can be used simultaneously with the | |
411 | strict priority. As a node is configured with a higher rate it gets more | |
d56b699d | 412 | BW relative to its siblings. Values are relative like a percentage |
242dd643 | 413 | points, they basically tell how much BW should node take relative to |
d56b699d | 414 | its siblings. |
242dd643 | 415 | |
b62767e7 DL |
416 | ``parent`` |
417 | Parent node name. Parent node rate limits are considered as additional limits | |
418 | to all node children limits. ``tx_max`` is an upper limit for children. | |
419 | ``tx_share`` is a total bandwidth distributed among children. | |
420 | ||
242dd643 MW |
421 | ``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case |
422 | nodes with the same priority form a WFQ subgroup in the sibling group | |
423 | and arbitration among them is based on assigned weights. | |
424 | ||
425 | Arbitration flow from the high level: | |
c84f6f6c | 426 | |
242dd643 MW |
427 | #. Choose a node, or group of nodes with the highest priority that stays |
428 | within the BW limit and are not blocked. Use ``tx_priority`` as a | |
429 | parameter for this arbitration. | |
c84f6f6c | 430 | |
242dd643 MW |
431 | #. If group of nodes have the same priority perform WFQ arbitration on |
432 | that subgroup. Use ``tx_weight`` as a parameter for this arbitration. | |
c84f6f6c | 433 | |
d56b699d | 434 | #. Select the winner node, and continue arbitration flow among its children, |
242dd643 | 435 | until leaf node is reached, and the winner is established. |
c84f6f6c | 436 | |
242dd643 MW |
437 | #. If all the nodes from the highest priority sub-group are satisfied, or |
438 | overused their assigned BW, move to the lower priority nodes. | |
439 | ||
b62767e7 | 440 | Driver implementations are allowed to support both or either rate object types |
242dd643 MW |
441 | and setting methods of their parameters. Additionally driver implementation |
442 | may export nodes/leafs and their child-parent relationships. | |
b62767e7 | 443 | |
6474ce7e PP |
444 | Terms and Definitions |
445 | ===================== | |
446 | ||
447 | .. list-table:: Terms and Definitions | |
448 | :widths: 22 90 | |
449 | ||
450 | * - Term | |
451 | - Definitions | |
452 | * - ``PCI device`` | |
ad236ccd | 453 | - A physical PCI device having one or more PCI buses consists of one or |
6474ce7e PP |
454 | more PCI controllers. |
455 | * - ``PCI controller`` | |
456 | - A controller consists of potentially multiple physical functions, | |
457 | virtual functions and subfunctions. | |
458 | * - ``Port function`` | |
459 | - An object to manage the function of a port. | |
460 | * - ``Subfunction`` | |
461 | - A lightweight function that has parent PCI function on which it is | |
462 | deployed. | |
463 | * - ``Subfunction device`` | |
464 | - A bus device of the subfunction, usually on a auxiliary bus. | |
465 | * - ``Subfunction driver`` | |
466 | - A device driver for the subfunction auxiliary device. | |
467 | * - ``Subfunction management device`` | |
468 | - A PCI physical function that supports subfunction management. | |
469 | * - ``Subfunction management driver`` | |
470 | - A device driver for PCI physical function that supports | |
471 | subfunction management using devlink port interface. | |
472 | * - ``Subfunction host driver`` | |
473 | - A device driver for PCI physical function that hosts subfunction | |
474 | devices. In most cases it is same as subfunction management driver. When | |
475 | subfunction is used on external controller, subfunction management and | |
476 | host drivers are different. |