Merge tag 'net-next-6.7-followup' of git://git.kernel.org/pub/scm/linux/kernel/git...
[linux-2.6-block.git] / Documentation / networking / representors.rst
CommitLineData
6fb4825e
EC
1.. SPDX-License-Identifier: GPL-2.0
2
3=============================
4Network Function Representors
5=============================
6
7This document describes the semantics and usage of representor netdevices, as
8used to control internal switching on SmartNICs. For the closely-related port
9representors on physical (multi-port) switches, see
10:ref:`Documentation/networking/switchdev.rst <switchdev>`.
11
12Motivation
13----------
14
15Since the mid-2010s, network cards have started offering more complex
16virtualisation capabilities than the legacy SR-IOV approach (with its simple
17MAC/VLAN-based switching model) can support. This led to a desire to offload
18software-defined networks (such as OpenVSwitch) to these NICs to specify the
19network connectivity of each function. The resulting designs are variously
20called SmartNICs or DPUs.
21
22Network function representors bring the standard Linux networking stack to
23virtual switches and IOV devices. Just as each physical port of a Linux-
24controlled switch has a separate netdev, so does each virtual port of a virtual
25switch.
26When the system boots, and before any offload is configured, all packets from
27the virtual functions appear in the networking stack of the PF via the
28representors. The PF can thus always communicate freely with the virtual
29functions.
30The PF can configure standard Linux forwarding between representors, the uplink
31or any other netdev (routing, bridging, TC classifiers).
32
33Thus, a representor is both a control plane object (representing the function in
34administrative commands) and a data plane object (one end of a virtual pipe).
35As a virtual link endpoint, the representor can be configured like any other
36netdevice; in some cases (e.g. link state) the representee will follow the
37representor's configuration, while in others there are separate APIs to
38configure the representee.
39
40Definitions
41-----------
42
43This document uses the term "switchdev function" to refer to the PCIe function
44which has administrative control over the virtual switch on the device.
45Typically, this will be a PF, but conceivably a NIC could be configured to grant
46these administrative privileges instead to a VF or SF (subfunction).
47Depending on NIC design, a multi-port NIC might have a single switchdev function
48for the whole device or might have a separate virtual switch, and hence
49switchdev function, for each physical network port.
50If the NIC supports nested switching, there might be separate switchdev
51functions for each nested switch, in which case each switchdev function should
52only create representors for the ports on the (sub-)switch it directly
53administers.
54
55A "representee" is the object that a representor represents. So for example in
56the case of a VF representor, the representee is the corresponding VF.
57
58What does a representor do?
59---------------------------
60
61A representor has three main roles.
62
631. It is used to configure the network connection the representee sees, e.g.
64 link up/down, MTU, etc. For instance, bringing the representor
65 administratively UP should cause the representee to see a link up / carrier
66 on event.
672. It provides the slow path for traffic which does not hit any offloaded
68 fast-path rules in the virtual switch. Packets transmitted on the
69 representor netdevice should be delivered to the representee; packets
70 transmitted by the representee which fail to match any switching rule should
71 be received on the representor netdevice. (That is, there is a virtual pipe
72 connecting the representor to the representee, similar in concept to a veth
73 pair.)
74 This allows software switch implementations (such as OpenVSwitch or a Linux
75 bridge) to forward packets between representees and the rest of the network.
763. It acts as a handle by which switching rules (such as TC filters) can refer
77 to the representee, allowing these rules to be offloaded.
78
79The combination of 2) and 3) means that the behaviour (apart from performance)
80should be the same whether a TC filter is offloaded or not. E.g. a TC rule
81on a VF representor applies in software to packets received on that representor
82netdevice, while in hardware offload it would apply to packets transmitted by
83the representee VF. Conversely, a mirred egress redirect to a VF representor
84corresponds in hardware to delivery directly to the representee VF.
85
86What functions should have a representor?
87-----------------------------------------
88
89Essentially, for each virtual port on the device's internal switch, there
90should be a representor.
91Some vendors have chosen to omit representors for the uplink and the physical
92network port, which can simplify usage (the uplink netdev becomes in effect the
93physical port's representor) but does not generalise to devices with multiple
94ports or uplinks.
95
96Thus, the following should all have representors:
97
98 - VFs belonging to the switchdev function.
99 - Other PFs on the local PCIe controller, and any VFs belonging to them.
100 - PFs and VFs on external PCIe controllers on the device (e.g. for any embedded
101 System-on-Chip within the SmartNIC).
102 - PFs and VFs with other personalities, including network block devices (such
103 as a vDPA virtio-blk PF backed by remote/distributed storage), if (and only
104 if) their network access is implemented through a virtual switch port. [#]_
105 Note that such functions can require a representor despite the representee
106 not having a netdev.
107 - Subfunctions (SFs) belonging to any of the above PFs or VFs, if they have
108 their own port on the switch (as opposed to using their parent PF's port).
109 - Any accelerators or plugins on the device whose interface to the network is
110 through a virtual switch port, even if they do not have a corresponding PCIe
111 PF or VF.
112
113This allows the entire switching behaviour of the NIC to be controlled through
114representor TC rules.
115
116It is a common misunderstanding to conflate virtual ports with PCIe virtual
117functions or their netdevs. While in simple cases there will be a 1:1
118correspondence between VF netdevices and VF representors, more advanced device
119configurations may not follow this.
120A PCIe function which does not have network access through the internal switch
121(not even indirectly through the hardware implementation of whatever services
122the function provides) should *not* have a representor (even if it has a
123netdev).
124Such a function has no switch virtual port for the representor to configure or
125to be the other end of the virtual pipe.
126The representor represents the virtual port, not the PCIe function nor the 'end
127user' netdevice.
128
129.. [#] The concept here is that a hardware IP stack in the device performs the
130 translation between block DMA requests and network packets, so that only
131 network packets pass through the virtual port onto the switch. The network
132 access that the IP stack "sees" would then be configurable through tc rules;
133 e.g. its traffic might all be wrapped in a specific VLAN or VxLAN. However,
134 any needed configuration of the block device *qua* block device, not being a
135 networking entity, would not be appropriate for the representor and would
136 thus use some other channel such as devlink.
137 Contrast this with the case of a virtio-blk implementation which forwards the
138 DMA requests unchanged to another PF whose driver then initiates and
139 terminates IP traffic in software; in that case the DMA traffic would *not*
140 run over the virtual switch and the virtio-blk PF should thus *not* have a
141 representor.
142
143How are representors created?
144-----------------------------
145
146The driver instance attached to the switchdev function should, for each virtual
147port on the switch, create a pure-software netdevice which has some form of
148in-kernel reference to the switchdev function's own netdevice or driver private
149data (``netdev_priv()``).
150This may be by enumerating ports at probe time, reacting dynamically to the
151creation and destruction of ports at run time, or a combination of the two.
152
153The operations of the representor netdevice will generally involve acting
154through the switchdev function. For example, ``ndo_start_xmit()`` might send
155the packet through a hardware TX queue attached to the switchdev function, with
156either packet metadata or queue configuration marking it for delivery to the
157representee.
158
159How are representors identified?
160--------------------------------
161
162The representor netdevice should *not* directly refer to a PCIe device (e.g.
163through ``net_dev->dev.parent`` / ``SET_NETDEV_DEV()``), either of the
164representee or of the switchdev function.
a258c804
MP
165Instead, the driver should use the ``SET_NETDEV_DEVLINK_PORT`` macro to
166assign a devlink port instance to the netdevice before registering the
167netdevice; the kernel uses the devlink port to provide the ``phys_switch_id``
168and ``phys_port_name`` sysfs nodes.
169(Some legacy drivers implement ``ndo_get_port_parent_id()`` and
6fb4825e
EC
170``ndo_get_phys_port_name()`` directly, but this is deprecated.) See
171:ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>` for the
172details of this API.
173
174It is expected that userland will use this information (e.g. through udev rules)
175to construct an appropriately informative name or alias for the netdevice. For
176instance if the switchdev function is ``eth4`` then a representor with a
177``phys_port_name`` of ``p0pf1vf2`` might be renamed ``eth4pf1vf2rep``.
178
179There are as yet no established conventions for naming representors which do not
180correspond to PCIe functions (e.g. accelerators and plugins).
181
182How do representors interact with TC rules?
183-------------------------------------------
184
185Any TC rule on a representor applies (in software TC) to packets received by
186that representor netdevice. Thus, if the delivery part of the rule corresponds
187to another port on the virtual switch, the driver may choose to offload it to
188hardware, applying it to packets transmitted by the representee.
189
190Similarly, since a TC mirred egress action targeting the representor would (in
191software) send the packet through the representor (and thus indirectly deliver
192it to the representee), hardware offload should interpret this as delivery to
193the representee.
194
195As a simple example, if ``PORT_DEV`` is the physical port representor and
196``REP_DEV`` is a VF representor, the following rules::
197
198 tc filter add dev $REP_DEV parent ffff: protocol ipv4 flower \
199 action mirred egress redirect dev $PORT_DEV
200 tc filter add dev $PORT_DEV parent ffff: protocol ipv4 flower skip_sw \
201 action mirred egress mirror dev $REP_DEV
202
203would mean that all IPv4 packets from the VF are sent out the physical port, and
204all IPv4 packets received on the physical port are delivered to the VF in
205addition to ``PORT_DEV``. (Note that without ``skip_sw`` on the second rule,
206the VF would get two copies, as the packet reception on ``PORT_DEV`` would
207trigger the TC rule again and mirror the packet to ``REP_DEV``.)
208
209On devices without separate port and uplink representors, ``PORT_DEV`` would
210instead be the switchdev function's own uplink netdevice.
211
212Of course the rules can (if supported by the NIC) include packet-modifying
213actions (e.g. VLAN push/pop), which should be performed by the virtual switch.
214
215Tunnel encapsulation and decapsulation are rather more complicated, as they
216involve a third netdevice (a tunnel netdev operating in metadata mode, such as
217a VxLAN device created with ``ip link add vxlan0 type vxlan external``) and
218require an IP address to be bound to the underlay device (e.g. switchdev
219function uplink netdev or port representor). TC rules such as::
220
221 tc filter add dev $REP_DEV parent ffff: flower \
222 action tunnel_key set id $VNI src_ip $LOCAL_IP dst_ip $REMOTE_IP \
223 dst_port 4789 \
224 action mirred egress redirect dev vxlan0
225 tc filter add dev vxlan0 parent ffff: flower enc_src_ip $REMOTE_IP \
226 enc_dst_ip $LOCAL_IP enc_key_id $VNI enc_dst_port 4789 \
227 action tunnel_key unset action mirred egress redirect dev $REP_DEV
228
229where ``LOCAL_IP`` is an IP address bound to ``PORT_DEV``, and ``REMOTE_IP`` is
230another IP address on the same subnet, mean that packets sent by the VF should
231be VxLAN encapsulated and sent out the physical port (the driver has to deduce
232this by a route lookup of ``LOCAL_IP`` leading to ``PORT_DEV``, and also
233perform an ARP/neighbour table lookup to find the MAC addresses to use in the
234outer Ethernet frame), while UDP packets received on the physical port with UDP
235port 4789 should be parsed as VxLAN and, if their VSID matches ``$VNI``,
236decapsulated and forwarded to the VF.
237
238If this all seems complicated, just remember the 'golden rule' of TC offload:
239the hardware should ensure the same final results as if the packets were
240processed through the slow path, traversed software TC (except ignoring any
241``skip_hw`` rules and applying any ``skip_sw`` rules) and were transmitted or
242received through the representor netdevices.
243
244Configuring the representee's MAC
245---------------------------------
246
247The representee's link state is controlled through the representor. Setting the
248representor administratively UP or DOWN should cause carrier ON or OFF at the
249representee.
250
251Setting an MTU on the representor should cause that same MTU to be reported to
252the representee.
253(On hardware that allows configuring separate and distinct MTU and MRU values,
254the representor MTU should correspond to the representee's MRU and vice-versa.)
255
256Currently there is no way to use the representor to set the station permanent
257MAC address of the representee; other methods available to do this include:
258
259 - legacy SR-IOV (``ip link set DEVICE vf NUM mac LLADDR``)
260 - devlink port function (see **devlink-port(8)** and
261 :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`)