Commit | Line | Data |
---|---|---|
32c0f0be MCC |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | .. include:: <isonum.txt> | |
6fb4825e | 3 | .. _switchdev: |
32c0f0be MCC |
4 | |
5 | =============================================== | |
4ceec22d SF |
6 | Ethernet switch device driver model (switchdev) |
7 | =============================================== | |
32c0f0be MCC |
8 | |
9 | Copyright |copy| 2014 Jiri Pirko <jiri@resnulli.us> | |
10 | ||
11 | Copyright |copy| 2014-2015 Scott Feldman <sfeldma@gmail.com> | |
4ceec22d SF |
12 | |
13 | ||
14 | The Ethernet switch device driver model (switchdev) is an in-kernel driver | |
15 | model for switch devices which offload the forwarding (data) plane from the | |
16 | kernel. | |
17 | ||
18 | Figure 1 is a block diagram showing the components of the switchdev model for | |
19 | an example setup using a data-center-class switch ASIC chip. Other setups | |
20 | with SR-IOV or soft switches, such as OVS, are possible. | |
21 | ||
32c0f0be | 22 | :: |
4ceec22d | 23 | |
32c0f0be MCC |
24 | |
25 | User-space tools | |
51513748 RD |
26 | |
27 | user space | | |
28 | +-------------------------------------------------------------------+ | |
29 | kernel | Netlink | |
32c0f0be MCC |
30 | | |
31 | +--------------+-------------------------------+ | |
32 | | Network stack | | |
33 | | (Linux) | | |
34 | | | | |
35 | +----------------------------------------------+ | |
36 | ||
37 | sw1p2 sw1p4 sw1p6 | |
38 | sw1p1 + sw1p3 + sw1p5 + eth1 | |
39 | + | + | + | + | |
40 | | | | | | | | | |
41 | +--+----+----+----+----+----+---+ +-----+-----+ | |
42 | | Switch driver | | mgmt | | |
43 | | (this document) | | driver | | |
44 | | | | | | |
45 | +--------------+----------------+ +-----------+ | |
46 | | | |
51513748 RD |
47 | kernel | HW bus (eg PCI) |
48 | +-------------------------------------------------------------------+ | |
49 | hardware | | |
32c0f0be MCC |
50 | +--------------+----------------+ |
51 | | Switch device (sw1) | | |
52 | | +----+ +--------+ | |
53 | | | v offloaded data path | mgmt port | |
54 | | | | | | |
55 | +--|----|----+----+----+----+---+ | |
56 | | | | | | | | |
57 | + + + + + + | |
58 | p1 p2 p3 p4 p5 p6 | |
51513748 | 59 | |
32c0f0be | 60 | front-panel ports |
d5066c46 | 61 | |
4ceec22d | 62 | |
32c0f0be | 63 | Fig 1. |
4ceec22d SF |
64 | |
65 | ||
66 | Include Files | |
67 | ------------- | |
68 | ||
32c0f0be MCC |
69 | :: |
70 | ||
71 | #include <linux/netdevice.h> | |
72 | #include <net/switchdev.h> | |
4ceec22d SF |
73 | |
74 | ||
75 | Configuration | |
76 | ------------- | |
77 | ||
78 | Use "depends NET_SWITCHDEV" in driver's Kconfig to ensure switchdev model | |
79 | support is built for driver. | |
80 | ||
81 | ||
82 | Switch Ports | |
83 | ------------ | |
84 | ||
85 | On switchdev driver initialization, the driver will allocate and register a | |
86 | struct net_device (using register_netdev()) for each enumerated physical switch | |
87 | port, called the port netdev. A port netdev is the software representation of | |
88 | the physical port and provides a conduit for control traffic to/from the | |
89 | controller (the kernel) and the network, as well as an anchor point for higher | |
90 | level constructs such as bridges, bonds, VLANs, tunnels, and L3 routers. Using | |
91 | standard netdev tools (iproute2, ethtool, etc), the port netdev can also | |
92 | provide to the user access to the physical properties of the switch port such | |
93 | as PHY link state and I/O statistics. | |
94 | ||
95 | There is (currently) no higher-level kernel object for the switch beyond the | |
96 | port netdevs. All of the switchdev driver ops are netdev ops or switchdev ops. | |
97 | ||
98 | A switch management port is outside the scope of the switchdev driver model. | |
99 | Typically, the management port is not participating in offloaded data plane and | |
100 | is loaded with a different driver, such as a NIC driver, on the management port | |
101 | device. | |
102 | ||
75f3a101 IS |
103 | Switch ID |
104 | ^^^^^^^^^ | |
105 | ||
80d79ad2 FF |
106 | The switchdev driver must implement the net_device operation |
107 | ndo_get_port_parent_id for each port netdev, returning the same physical ID for | |
108 | each port of a switch. The ID must be unique between switches on the same | |
109 | system. The ID does not need to be unique between switches on different | |
110 | systems. | |
75f3a101 IS |
111 | |
112 | The switch ID is used to locate ports on a switch and to know if aggregated | |
113 | ports belong to the same switch. | |
114 | ||
4ceec22d SF |
115 | Port Netdev Naming |
116 | ^^^^^^^^^^^^^^^^^^ | |
117 | ||
118 | Udev rules should be used for port netdev naming, using some unique attribute | |
119 | of the port as a key, for example the port MAC address or the port PHYS name. | |
120 | Hard-coding of kernel netdev names within the driver is discouraged; let the | |
121 | kernel pick the default netdev name, and let udev set the final name based on a | |
122 | port attribute. | |
123 | ||
124 | Using port PHYS name (ndo_get_phys_port_name) for the key is particularly | |
1f5dc44c | 125 | useful for dynamically-named ports where the device names its ports based on |
4ceec22d SF |
126 | external configuration. For example, if a physical 40G port is split logically |
127 | into 4 10G ports, resulting in 4 port netdevs, the device can give a unique | |
32c0f0be | 128 | name for each port using port PHYS name. The udev rule would be:: |
4ceec22d | 129 | |
32c0f0be MCC |
130 | SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \ |
131 | ATTR{phys_port_name}!="", NAME="swX$attr{phys_port_name}" | |
4ceec22d SF |
132 | |
133 | Suggested naming convention is "swXpYsZ", where X is the switch name or ID, Y | |
134 | is the port name or ID, and Z is the sub-port name or ID. For example, sw1p1s0 | |
135 | would be sub-port 0 on port 1 on switch 1. | |
136 | ||
4ceec22d SF |
137 | Port Features |
138 | ^^^^^^^^^^^^^ | |
139 | ||
140 | NETIF_F_NETNS_LOCAL | |
141 | ||
142 | If the switchdev driver (and device) only supports offloading of the default | |
143 | network namespace (netns), the driver should set this feature flag to prevent | |
144 | the port netdev from being moved out of the default netns. A netns-aware | |
1f5dc44c | 145 | driver/device would not set this flag and be responsible for partitioning |
4ceec22d SF |
146 | hardware to preserve netns containment. This means hardware cannot forward |
147 | traffic from a port in one namespace to another port in another namespace. | |
148 | ||
149 | Port Topology | |
150 | ^^^^^^^^^^^^^ | |
151 | ||
152 | The port netdevs representing the physical switch ports can be organized into | |
153 | higher-level switching constructs. The default construct is a standalone | |
154 | router port, used to offload L3 forwarding. Two or more ports can be bonded | |
155 | together to form a LAG. Two or more ports (or LAGs) can be bridged to bridge | |
d290f1fc | 156 | L2 networks. VLANs can be applied to sub-divide L2 networks. L2-over-L3 |
4ceec22d SF |
157 | tunnels can be built on ports. These constructs are built using standard Linux |
158 | tools such as the bridge driver, the bonding/team drivers, and netlink-based | |
159 | tools such as iproute2. | |
160 | ||
161 | The switchdev driver can know a particular port's position in the topology by | |
162 | monitoring NETDEV_CHANGEUPPER notifications. For example, a port moved into a | |
404a5ad7 | 163 | bond will see its upper master change. If that bond is moved into a bridge, |
4ceec22d SF |
164 | the bond's upper master will change. And so on. The driver will track such |
165 | movements to know what position a port is in in the overall topology by | |
166 | registering for netdevice events and acting on NETDEV_CHANGEUPPER. | |
167 | ||
168 | L2 Forwarding Offload | |
169 | --------------------- | |
170 | ||
171 | The idea is to offload the L2 data forwarding (switching) path from the kernel | |
172 | to the switchdev device by mirroring bridge FDB entries down to the device. An | |
173 | FDB entry is the {port, MAC, VLAN} tuple forwarding destination. | |
174 | ||
175 | To offloading L2 bridging, the switchdev driver/device should support: | |
176 | ||
177 | - Static FDB entries installed on a bridge port | |
178 | - Notification of learned/forgotten src mac/vlans from device | |
179 | - STP state changes on the port | |
180 | - VLAN flooding of multicast/broadcast and unknown unicast packets | |
181 | ||
182 | Static FDB Entries | |
183 | ^^^^^^^^^^^^^^^^^^ | |
184 | ||
787a4109 VO |
185 | A driver which implements the ``ndo_fdb_add``, ``ndo_fdb_del`` and |
186 | ``ndo_fdb_dump`` operations is able to support the command below, which adds a | |
187 | static bridge FDB entry:: | |
188 | ||
189 | bridge fdb add dev DEV ADDRESS [vlan VID] [self] static | |
190 | ||
191 | (the "static" keyword is non-optional: if not specified, the entry defaults to | |
192 | being "local", which means that it should not be forwarded) | |
193 | ||
194 | The "self" keyword (optional because it is implicit) has the role of | |
195 | instructing the kernel to fulfill the operation through the ``ndo_fdb_add`` | |
196 | implementation of the ``DEV`` device itself. If ``DEV`` is a bridge port, this | |
197 | will bypass the bridge and therefore leave the software database out of sync | |
198 | with the hardware one. | |
199 | ||
200 | To avoid this, the "master" keyword can be used:: | |
201 | ||
202 | bridge fdb add dev DEV ADDRESS [vlan VID] master static | |
203 | ||
204 | The above command instructs the kernel to search for a master interface of | |
205 | ``DEV`` and fulfill the operation through the ``ndo_fdb_add`` method of that. | |
206 | This time, the bridge generates a ``SWITCHDEV_FDB_ADD_TO_DEVICE`` notification | |
207 | which the port driver can handle and use it to program its hardware table. This | |
208 | way, the software and the hardware database will both contain this static FDB | |
209 | entry. | |
210 | ||
211 | Note: for new switchdev drivers that offload the Linux bridge, implementing the | |
212 | ``ndo_fdb_add`` and ``ndo_fdb_del`` bridge bypass methods is strongly | |
213 | discouraged: all static FDB entries should be added on a bridge port using the | |
214 | "master" flag. The ``ndo_fdb_dump`` is an exception and can be implemented to | |
215 | visualize the hardware tables, if the device does not have an interrupt for | |
216 | notifying the operating system of newly learned/forgotten dynamic FDB | |
217 | addresses. In that case, the hardware FDB might end up having entries that the | |
218 | software FDB does not, and implementing ``ndo_fdb_dump`` is the only way to see | |
219 | them. | |
1f5dc44c | 220 | |
4ceec22d | 221 | Note: by default, the bridge does not filter on VLAN and only bridges untagged |
32c0f0be | 222 | traffic. To enable VLAN support, turn on VLAN filtering:: |
4ceec22d SF |
223 | |
224 | echo 1 >/sys/class/net/<bridge>/bridge/vlan_filtering | |
225 | ||
226 | Notification of Learned/Forgotten Source MAC/VLANs | |
227 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
228 | ||
229 | The switch device will learn/forget source MAC address/VLAN on ingress packets | |
230 | and notify the switch driver of the mac/vlan/port tuples. The switch driver, | |
32c0f0be | 231 | in turn, will notify the bridge driver using the switchdev notifier call:: |
4ceec22d | 232 | |
6685987c | 233 | err = call_switchdev_notifiers(val, dev, info, extack); |
4ceec22d | 234 | |
f5ed2feb SF |
235 | Where val is SWITCHDEV_FDB_ADD when learning and SWITCHDEV_FDB_DEL when |
236 | forgetting, and info points to a struct switchdev_notifier_fdb_info. On | |
237 | SWITCHDEV_FDB_ADD, the bridge driver will install the FDB entry into the | |
238 | bridge's FDB and mark the entry as NTF_EXT_LEARNED. The iproute2 bridge | |
32c0f0be | 239 | command will label these entries "offload":: |
4ceec22d SF |
240 | |
241 | $ bridge fdb | |
242 | 52:54:00:12:35:01 dev sw1p1 master br0 permanent | |
243 | 00:02:00:00:02:00 dev sw1p1 master br0 offload | |
244 | 00:02:00:00:02:00 dev sw1p1 self | |
245 | 52:54:00:12:35:02 dev sw1p2 master br0 permanent | |
246 | 00:02:00:00:03:00 dev sw1p2 master br0 offload | |
247 | 00:02:00:00:03:00 dev sw1p2 self | |
248 | 33:33:00:00:00:01 dev eth0 self permanent | |
249 | 01:00:5e:00:00:01 dev eth0 self permanent | |
250 | 33:33:ff:00:00:00 dev eth0 self permanent | |
251 | 01:80:c2:00:00:0e dev eth0 self permanent | |
252 | 33:33:00:00:00:01 dev br0 self permanent | |
253 | 01:00:5e:00:00:01 dev br0 self permanent | |
254 | 33:33:ff:12:35:01 dev br0 self permanent | |
255 | ||
32c0f0be | 256 | Learning on the port should be disabled on the bridge using the bridge command:: |
4ceec22d SF |
257 | |
258 | bridge link set dev DEV learning off | |
259 | ||
32c0f0be | 260 | Learning on the device port should be enabled, as well as learning_sync:: |
4ceec22d SF |
261 | |
262 | bridge link set dev DEV learning on self | |
263 | bridge link set dev DEV learning_sync on self | |
264 | ||
5a784498 | 265 | Learning_sync attribute enables syncing of the learned/forgotten FDB entry to |
4ceec22d SF |
266 | the bridge's FDB. It's possible, but not optimal, to enable learning on the |
267 | device port and on the bridge port, and disable learning_sync. | |
268 | ||
cc0c207a | 269 | To support learning, the driver implements switchdev op |
010c8f01 | 270 | switchdev_port_attr_set for SWITCHDEV_ATTR_PORT_ID_{PRE}_BRIDGE_FLAGS. |
4ceec22d SF |
271 | |
272 | FDB Ageing | |
273 | ^^^^^^^^^^ | |
274 | ||
45ffda75 SF |
275 | The bridge will skip ageing FDB entries marked with NTF_EXT_LEARNED and it is |
276 | the responsibility of the port driver/device to age out these entries. If the | |
277 | port device supports ageing, when the FDB entry expires, it will notify the | |
278 | driver which in turn will notify the bridge with SWITCHDEV_FDB_DEL. If the | |
279 | device does not support ageing, the driver can simulate ageing using a | |
5a784498 | 280 | garbage collection timer to monitor FDB entries. Expired entries will be |
45ffda75 SF |
281 | notified to the bridge using SWITCHDEV_FDB_DEL. See rocker driver for |
282 | example of driver running ageing timer. | |
283 | ||
284 | To keep an NTF_EXT_LEARNED entry "alive", the driver should refresh the FDB | |
285 | entry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...). The | |
4ceec22d SF |
286 | notification will reset the FDB entry's last-used time to now. The driver |
287 | should rate limit refresh notifications, for example, no more than once a | |
45ffda75 | 288 | second. (The last-used time is visible using the bridge -s fdb option). |
4ceec22d SF |
289 | |
290 | STP State Change on Port | |
291 | ^^^^^^^^^^^^^^^^^^^^^^^^ | |
292 | ||
293 | Internally or with a third-party STP protocol implementation (e.g. mstpd), the | |
294 | bridge driver maintains the STP state for ports, and will notify the switch | |
f5ed2feb | 295 | driver of STP state change on a port using the switchdev op |
1f868398 | 296 | switchdev_attr_port_set for SWITCHDEV_ATTR_PORT_ID_STP_UPDATE. |
4ceec22d SF |
297 | |
298 | State is one of BR_STATE_*. The switch driver can use STP state updates to | |
299 | update ingress packet filter list for the port. For example, if port is | |
300 | DISABLED, no packets should pass, but if port moves to BLOCKED, then STP BPDUs | |
301 | and other IEEE 01:80:c2:xx:xx:xx link-local multicast packets can pass. | |
302 | ||
303 | Note that STP BDPUs are untagged and STP state applies to all VLANs on the port | |
304 | so packet filters should be applied consistently across untagged and tagged | |
305 | VLANs on the port. | |
306 | ||
307 | Flooding L2 domain | |
308 | ^^^^^^^^^^^^^^^^^^ | |
309 | ||
310 | For a given L2 VLAN domain, the switch device should flood multicast/broadcast | |
311 | and unknown unicast packets to all ports in domain, if allowed by port's | |
312 | current STP state. The switch driver, knowing which ports are within which | |
371e59ad IS |
313 | vlan L2 domain, can program the switch device for flooding. The packet may |
314 | be sent to the port netdev for processing by the bridge driver. The | |
a48037e7 SF |
315 | bridge should not reflood the packet to the same ports the device flooded, |
316 | otherwise there will be duplicate packets on the wire. | |
317 | ||
6bc506b4 IS |
318 | To avoid duplicate packets, the switch driver should mark a packet as already |
319 | forwarded by setting the skb->offload_fwd_mark bit. The bridge driver will mark | |
320 | the skb using the ingress bridge port's mark and prevent it from being forwarded | |
321 | through any bridge port with the same mark. | |
4ceec22d SF |
322 | |
323 | It is possible for the switch device to not handle flooding and push the | |
324 | packets up to the bridge driver for flooding. This is not ideal as the number | |
325 | of ports scale in the L2 domain as the device is much more efficient at | |
326 | flooding packets that software. | |
327 | ||
741af005 IS |
328 | If supported by the device, flood control can be offloaded to it, preventing |
329 | certain netdevs from flooding unicast traffic for which there is no FDB entry. | |
330 | ||
4ceec22d SF |
331 | IGMP Snooping |
332 | ^^^^^^^^^^^^^ | |
333 | ||
4f5590f8 ER |
334 | In order to support IGMP snooping, the port netdevs should trap to the bridge |
335 | driver all IGMP join and leave messages. | |
336 | The bridge multicast module will notify port netdevs on every multicast group | |
337 | changed whether it is static configured or dynamically joined/leave. | |
338 | The hardware implementation should be forwarding all registered multicast | |
339 | traffic groups only to the configured ports. | |
4ceec22d | 340 | |
7616dcbb SF |
341 | L3 Routing Offload |
342 | ------------------ | |
4ceec22d SF |
343 | |
344 | Offloading L3 routing requires that device be programmed with FIB entries from | |
345 | the kernel, with the device doing the FIB lookup and forwarding. The device | |
346 | does a longest prefix match (LPM) on FIB entries matching route prefix and | |
7616dcbb SF |
347 | forwards the packet to the matching FIB entry's nexthop(s) egress ports. |
348 | ||
fd41b0ea JP |
349 | To program the device, the driver has to register a FIB notifier handler |
350 | using register_fib_notifier. The following events are available: | |
7616dcbb | 351 | |
32c0f0be MCC |
352 | =================== =================================================== |
353 | FIB_EVENT_ENTRY_ADD used for both adding a new FIB entry to the device, | |
354 | or modifying an existing entry on the device. | |
355 | FIB_EVENT_ENTRY_DEL used for removing a FIB entry | |
356 | FIB_EVENT_RULE_ADD, | |
357 | FIB_EVENT_RULE_DEL used to propagate FIB rule changes | |
358 | =================== =================================================== | |
359 | ||
360 | FIB_EVENT_ENTRY_ADD and FIB_EVENT_ENTRY_DEL events pass:: | |
7616dcbb | 361 | |
fd41b0ea JP |
362 | struct fib_entry_notifier_info { |
363 | struct fib_notifier_info info; /* must be first */ | |
7616dcbb SF |
364 | u32 dst; |
365 | int dst_len; | |
366 | struct fib_info *fi; | |
367 | u8 tos; | |
368 | u8 type; | |
7616dcbb | 369 | u32 tb_id; |
fd41b0ea JP |
370 | u32 nlflags; |
371 | }; | |
7616dcbb | 372 | |
32c0f0be MCC |
373 | to add/modify/delete IPv4 dst/dest_len prefix on table tb_id. The ``*fi`` |
374 | structure holds details on the route and route's nexthops. ``*dev`` is one | |
375 | of the port netdevs mentioned in the route's next hop list. | |
4ceec22d SF |
376 | |
377 | Routes offloaded to the device are labeled with "offload" in the ip route | |
32c0f0be | 378 | listing:: |
4ceec22d SF |
379 | |
380 | $ ip route show | |
381 | default via 192.168.0.2 dev eth0 | |
382 | 11.0.0.0/30 dev sw1p1 proto kernel scope link src 11.0.0.2 offload | |
383 | 11.0.0.4/30 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload | |
384 | 11.0.0.8/30 dev sw1p2 proto kernel scope link src 11.0.0.10 offload | |
385 | 11.0.0.12/30 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload | |
386 | 12.0.0.2 proto zebra metric 30 offload | |
387 | nexthop via 11.0.0.1 dev sw1p1 weight 1 | |
388 | nexthop via 11.0.0.9 dev sw1p2 weight 1 | |
389 | 12.0.0.3 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload | |
390 | 12.0.0.4 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload | |
391 | 192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.15 | |
392 | ||
fd41b0ea JP |
393 | The "offload" flag is set in case at least one device offloads the FIB entry. |
394 | ||
7616dcbb | 395 | XXX: add/mod/del IPv6 FIB API |
4ceec22d SF |
396 | |
397 | Nexthop Resolution | |
398 | ^^^^^^^^^^^^^^^^^^ | |
399 | ||
400 | The FIB entry's nexthop list contains the nexthop tuple (gateway, dev), but for | |
401 | the switch device to forward the packet with the correct dst mac address, the | |
402 | nexthop gateways must be resolved to the neighbor's mac address. Neighbor mac | |
403 | address discovery comes via the ARP (or ND) process and is available via the | |
404 | arp_tbl neighbor table. To resolve the routes nexthop gateways, the driver | |
405 | should trigger the kernel's neighbor resolution process. See the rocker | |
406 | driver's rocker_port_ipv4_resolve() for an example. | |
407 | ||
408 | The driver can monitor for updates to arp_tbl using the netevent notifier | |
409 | NETEVENT_NEIGH_UPDATE. The device can be programmed with resolved nexthops | |
dd19f83d SF |
410 | for the routes as arp_tbl updates. The driver implements ndo_neigh_destroy |
411 | to know when arp_tbl neighbor entries are purged from the port. | |
0f22ad45 FF |
412 | |
413 | Device driver expected behavior | |
414 | ------------------------------- | |
415 | ||
416 | Below is a set of defined behavior that switchdev enabled network devices must | |
417 | adhere to. | |
418 | ||
419 | Configuration-less state | |
420 | ^^^^^^^^^^^^^^^^^^^^^^^^ | |
421 | ||
422 | Upon driver bring up, the network devices must be fully operational, and the | |
423 | backing driver must configure the network device such that it is possible to | |
424 | send and receive traffic to this network device and it is properly separated | |
425 | from other network devices/ports (e.g.: as is frequent with a switch ASIC). How | |
426 | this is achieved is heavily hardware dependent, but a simple solution can be to | |
427 | use per-port VLAN identifiers unless a better mechanism is available | |
428 | (proprietary metadata for each network port for instance). | |
429 | ||
430 | The network device must be capable of running a full IP protocol stack | |
431 | including multicast, DHCP, IPv4/6, etc. If necessary, it should program the | |
432 | appropriate filters for VLAN, multicast, unicast etc. The underlying device | |
433 | driver must effectively be configured in a similar fashion to what it would do | |
434 | when IGMP snooping is enabled for IP multicast over these switchdev network | |
435 | devices and unsolicited multicast must be filtered as early as possible in | |
436 | the hardware. | |
437 | ||
438 | When configuring VLANs on top of the network device, all VLANs must be working, | |
439 | irrespective of the state of other network devices (e.g.: other ports being part | |
440 | of a VLAN-aware bridge doing ingress VID checking). See below for details. | |
441 | ||
442 | If the device implements e.g.: VLAN filtering, putting the interface in | |
443 | promiscuous mode should allow the reception of all VLAN tags (including those | |
444 | not present in the filter(s)). | |
445 | ||
446 | Bridged switch ports | |
447 | ^^^^^^^^^^^^^^^^^^^^ | |
448 | ||
449 | When a switchdev enabled network device is added as a bridge member, it should | |
450 | not disrupt any functionality of non-bridged network devices and they | |
451 | should continue to behave as normal network devices. Depending on the bridge | |
452 | configuration knobs below, the expected behavior is documented. | |
453 | ||
454 | Bridge VLAN filtering | |
455 | ^^^^^^^^^^^^^^^^^^^^^ | |
456 | ||
457 | The Linux bridge allows the configuration of a VLAN filtering mode (statically, | |
458 | at device creation time, and dynamically, during run time) which must be | |
459 | observed by the underlying switchdev network device/hardware: | |
460 | ||
461 | - with VLAN filtering turned off: the bridge is strictly VLAN unaware and its | |
462 | data path will process all Ethernet frames as if they are VLAN-untagged. | |
463 | The bridge VLAN database can still be modified, but the modifications should | |
464 | have no effect while VLAN filtering is turned off. Frames ingressing the | |
465 | device with a VID that is not programmed into the bridge/switch's VLAN table | |
466 | must be forwarded and may be processed using a VLAN device (see below). | |
467 | ||
468 | - with VLAN filtering turned on: the bridge is VLAN-aware and frames ingressing | |
469 | the device with a VID that is not programmed into the bridges/switch's VLAN | |
470 | table must be dropped (strict VID checking). | |
471 | ||
472 | When there is a VLAN device (e.g: sw0p1.100) configured on top of a switchdev | |
473 | network device which is a bridge port member, the behavior of the software | |
474 | network stack must be preserved, or the configuration must be refused if that | |
475 | is not possible. | |
476 | ||
477 | - with VLAN filtering turned off, the bridge will process all ingress traffic | |
478 | for the port, except for the traffic tagged with a VLAN ID destined for a | |
479 | VLAN upper. The VLAN upper interface (which consumes the VLAN tag) can even | |
480 | be added to a second bridge, which includes other switch ports or software | |
481 | interfaces. Some approaches to ensure that the forwarding domain for traffic | |
482 | belonging to the VLAN upper interfaces are managed properly: | |
cfeb961a | 483 | |
0f22ad45 FF |
484 | * If forwarding destinations can be managed per VLAN, the hardware could be |
485 | configured to map all traffic, except the packets tagged with a VID | |
486 | belonging to a VLAN upper interface, to an internal VID corresponding to | |
487 | untagged packets. This internal VID spans all ports of the VLAN-unaware | |
488 | bridge. The VID corresponding to the VLAN upper interface spans the | |
489 | physical port of that VLAN interface, as well as the other ports that | |
490 | might be bridged with it. | |
491 | * Treat bridge ports with VLAN upper interfaces as standalone, and let | |
492 | forwarding be handled in the software data path. | |
493 | ||
494 | - with VLAN filtering turned on, these VLAN devices can be created as long as | |
495 | the bridge does not have an existing VLAN entry with the same VID on any | |
496 | bridge port. These VLAN devices cannot be enslaved into the bridge since they | |
497 | duplicate functionality/use case with the bridge's VLAN data path processing. | |
498 | ||
499 | Non-bridged network ports of the same switch fabric must not be disturbed in any | |
500 | way by the enabling of VLAN filtering on the bridge device(s). If the VLAN | |
501 | filtering setting is global to the entire chip, then the standalone ports | |
502 | should indicate to the network stack that VLAN filtering is required by setting | |
503 | 'rx-vlan-filter: on [fixed]' in the ethtool features. | |
504 | ||
505 | Because VLAN filtering can be turned on/off at runtime, the switchdev driver | |
506 | must be able to reconfigure the underlying hardware on the fly to honor the | |
507 | toggling of that option and behave appropriately. If that is not possible, the | |
508 | switchdev driver can also refuse to support dynamic toggling of the VLAN | |
509 | filtering knob at runtime and require a destruction of the bridge device(s) and | |
510 | creation of new bridge device(s) with a different VLAN filtering value to | |
511 | ensure VLAN awareness is pushed down to the hardware. | |
512 | ||
513 | Even when VLAN filtering in the bridge is turned off, the underlying switch | |
514 | hardware and driver may still configure itself in a VLAN-aware mode provided | |
515 | that the behavior described above is observed. | |
516 | ||
517 | The VLAN protocol of the bridge plays a role in deciding whether a packet is | |
518 | treated as tagged or not: a bridge using the 802.1ad protocol must treat both | |
519 | VLAN-untagged packets, as well as packets tagged with 802.1Q headers, as | |
520 | untagged. | |
521 | ||
522 | The 802.1p (VID 0) tagged packets must be treated in the same way by the device | |
523 | as untagged packets, since the bridge device does not allow the manipulation of | |
524 | VID 0 in its database. | |
525 | ||
526 | When the bridge has VLAN filtering enabled and a PVID is not configured on the | |
6b38c571 | 527 | ingress port, untagged and 802.1p tagged packets must be dropped. When the bridge |
0f22ad45 FF |
528 | has VLAN filtering enabled and a PVID exists on the ingress port, untagged and |
529 | priority-tagged packets must be accepted and forwarded according to the | |
530 | bridge's port membership of the PVID VLAN. When the bridge has VLAN filtering | |
531 | disabled, the presence/lack of a PVID should not influence the packet | |
532 | forwarding decision. | |
533 | ||
534 | Bridge IGMP snooping | |
535 | ^^^^^^^^^^^^^^^^^^^^ | |
536 | ||
537 | The Linux bridge allows the configuration of IGMP snooping (statically, at | |
538 | interface creation time, or dynamically, during runtime) which must be observed | |
539 | by the underlying switchdev network device/hardware in the following way: | |
540 | ||
541 | - when IGMP snooping is turned off, multicast traffic must be flooded to all | |
542 | ports within the same bridge that have mcast_flood=true. The CPU/management | |
543 | port should ideally not be flooded (unless the ingress interface has | |
544 | IFF_ALLMULTI or IFF_PROMISC) and continue to learn multicast traffic through | |
545 | the network stack notifications. If the hardware is not capable of doing that | |
546 | then the CPU/management port must also be flooded and multicast filtering | |
547 | happens in software. | |
548 | ||
549 | - when IGMP snooping is turned on, multicast traffic must selectively flow | |
550 | to the appropriate network ports (including CPU/management port). Flooding of | |
551 | unknown multicast should be only towards the ports connected to a multicast | |
552 | router (the local device may also act as a multicast router). | |
553 | ||
554 | The switch must adhere to RFC 4541 and flood multicast traffic accordingly | |
555 | since that is what the Linux bridge implementation does. | |
556 | ||
557 | Because IGMP snooping can be turned on/off at runtime, the switchdev driver | |
558 | must be able to reconfigure the underlying hardware on the fly to honor the | |
559 | toggling of that option and behave appropriately. | |
560 | ||
561 | A switchdev driver can also refuse to support dynamic toggling of the multicast | |
562 | snooping knob at runtime and require the destruction of the bridge device(s) | |
563 | and creation of a new bridge device(s) with a different multicast snooping | |
564 | value. |