Commit | Line | Data |
---|---|---|
4ceec22d SF |
1 | Ethernet switch device driver model (switchdev) |
2 | =============================================== | |
3 | Copyright (c) 2014 Jiri Pirko <jiri@resnulli.us> | |
4 | Copyright (c) 2014-2015 Scott Feldman <sfeldma@gmail.com> | |
5 | ||
6 | ||
7 | The Ethernet switch device driver model (switchdev) is an in-kernel driver | |
8 | model for switch devices which offload the forwarding (data) plane from the | |
9 | kernel. | |
10 | ||
11 | Figure 1 is a block diagram showing the components of the switchdev model for | |
12 | an example setup using a data-center-class switch ASIC chip. Other setups | |
13 | with SR-IOV or soft switches, such as OVS, are possible. | |
14 | ||
15 | ||
16 | User-space tools | |
17 | ||
18 | user space | | |
19 | +-------------------------------------------------------------------+ | |
20 | kernel | Netlink | |
21 | | | |
22 | +--------------+-------------------------------+ | |
23 | | Network stack | | |
24 | | (Linux) | | |
25 | | | | |
26 | +----------------------------------------------+ | |
27 | ||
28 | sw1p2 sw1p4 sw1p6 | |
29 | sw1p1 + sw1p3 + sw1p5 + eth1 | |
30 | + | + | + | + | |
31 | | | | | | | | | |
32 | +--+----+----+----+-+--+----+---+ +-----+-----+ | |
33 | | Switch driver | | mgmt | | |
34 | | (this document) | | driver | | |
35 | | | | | | |
36 | +--------------+----------------+ +-----------+ | |
37 | | | |
38 | kernel | HW bus (eg PCI) | |
39 | +-------------------------------------------------------------------+ | |
40 | hardware | | |
41 | +--------------+---+------------+ | |
42 | | Switch device (sw1) | | |
43 | | +----+ +--------+ | |
44 | | | v offloaded data path | mgmt port | |
45 | | | | | | |
46 | +--|----|----+----+----+----+---+ | |
47 | | | | | | | | |
48 | + + + + + + | |
49 | p1 p2 p3 p4 p5 p6 | |
50 | ||
51 | front-panel ports | |
52 | ||
53 | ||
54 | Fig 1. | |
55 | ||
56 | ||
57 | Include Files | |
58 | ------------- | |
59 | ||
60 | #include <linux/netdevice.h> | |
61 | #include <net/switchdev.h> | |
62 | ||
63 | ||
64 | Configuration | |
65 | ------------- | |
66 | ||
67 | Use "depends NET_SWITCHDEV" in driver's Kconfig to ensure switchdev model | |
68 | support is built for driver. | |
69 | ||
70 | ||
71 | Switch Ports | |
72 | ------------ | |
73 | ||
74 | On switchdev driver initialization, the driver will allocate and register a | |
75 | struct net_device (using register_netdev()) for each enumerated physical switch | |
76 | port, called the port netdev. A port netdev is the software representation of | |
77 | the physical port and provides a conduit for control traffic to/from the | |
78 | controller (the kernel) and the network, as well as an anchor point for higher | |
79 | level constructs such as bridges, bonds, VLANs, tunnels, and L3 routers. Using | |
80 | standard netdev tools (iproute2, ethtool, etc), the port netdev can also | |
81 | provide to the user access to the physical properties of the switch port such | |
82 | as PHY link state and I/O statistics. | |
83 | ||
84 | There is (currently) no higher-level kernel object for the switch beyond the | |
85 | port netdevs. All of the switchdev driver ops are netdev ops or switchdev ops. | |
86 | ||
87 | A switch management port is outside the scope of the switchdev driver model. | |
88 | Typically, the management port is not participating in offloaded data plane and | |
89 | is loaded with a different driver, such as a NIC driver, on the management port | |
90 | device. | |
91 | ||
92 | Port Netdev Naming | |
93 | ^^^^^^^^^^^^^^^^^^ | |
94 | ||
95 | Udev rules should be used for port netdev naming, using some unique attribute | |
96 | of the port as a key, for example the port MAC address or the port PHYS name. | |
97 | Hard-coding of kernel netdev names within the driver is discouraged; let the | |
98 | kernel pick the default netdev name, and let udev set the final name based on a | |
99 | port attribute. | |
100 | ||
101 | Using port PHYS name (ndo_get_phys_port_name) for the key is particularly | |
102 | useful for dynically-named ports where the device names it's ports based on | |
103 | external configuration. For example, if a physical 40G port is split logically | |
104 | into 4 10G ports, resulting in 4 port netdevs, the device can give a unique | |
105 | name for each port using port PHYS name. The udev rule would be: | |
106 | ||
107 | SUBSYSTEM=="net", ACTION=="add", DRIVER="<driver>", ATTR{phys_port_name}!="", \ | |
108 | NAME="$attr{phys_port_name}" | |
109 | ||
110 | Suggested naming convention is "swXpYsZ", where X is the switch name or ID, Y | |
111 | is the port name or ID, and Z is the sub-port name or ID. For example, sw1p1s0 | |
112 | would be sub-port 0 on port 1 on switch 1. | |
113 | ||
114 | Switch ID | |
115 | ^^^^^^^^^ | |
116 | ||
117 | The switchdev driver must implement the switchdev op switchdev_port_attr_get for | |
118 | SWITCHDEV_ATTR_PORT_PARENT_ID for each port netdev, returning the same physical ID | |
119 | for each port of a switch. The ID must be unique between switches on the same | |
120 | system. The ID does not need to be unique between switches on different | |
121 | systems. | |
122 | ||
123 | The switch ID is used to locate ports on a switch and to know if aggregated | |
124 | ports belong to the same switch. | |
125 | ||
126 | Port Features | |
127 | ^^^^^^^^^^^^^ | |
128 | ||
129 | NETIF_F_NETNS_LOCAL | |
130 | ||
131 | If the switchdev driver (and device) only supports offloading of the default | |
132 | network namespace (netns), the driver should set this feature flag to prevent | |
133 | the port netdev from being moved out of the default netns. A netns-aware | |
134 | driver/device would not set this flag and be resposible for partitioning | |
135 | hardware to preserve netns containment. This means hardware cannot forward | |
136 | traffic from a port in one namespace to another port in another namespace. | |
137 | ||
138 | Port Topology | |
139 | ^^^^^^^^^^^^^ | |
140 | ||
141 | The port netdevs representing the physical switch ports can be organized into | |
142 | higher-level switching constructs. The default construct is a standalone | |
143 | router port, used to offload L3 forwarding. Two or more ports can be bonded | |
144 | together to form a LAG. Two or more ports (or LAGs) can be bridged to bridge | |
145 | to L2 networks. VLANs can be applied to sub-divide L2 networks. L2-over-L3 | |
146 | tunnels can be built on ports. These constructs are built using standard Linux | |
147 | tools such as the bridge driver, the bonding/team drivers, and netlink-based | |
148 | tools such as iproute2. | |
149 | ||
150 | The switchdev driver can know a particular port's position in the topology by | |
151 | monitoring NETDEV_CHANGEUPPER notifications. For example, a port moved into a | |
152 | bond will see it's upper master change. If that bond is moved into a bridge, | |
153 | the bond's upper master will change. And so on. The driver will track such | |
154 | movements to know what position a port is in in the overall topology by | |
155 | registering for netdevice events and acting on NETDEV_CHANGEUPPER. | |
156 | ||
157 | L2 Forwarding Offload | |
158 | --------------------- | |
159 | ||
160 | The idea is to offload the L2 data forwarding (switching) path from the kernel | |
161 | to the switchdev device by mirroring bridge FDB entries down to the device. An | |
162 | FDB entry is the {port, MAC, VLAN} tuple forwarding destination. | |
163 | ||
164 | To offloading L2 bridging, the switchdev driver/device should support: | |
165 | ||
166 | - Static FDB entries installed on a bridge port | |
167 | - Notification of learned/forgotten src mac/vlans from device | |
168 | - STP state changes on the port | |
169 | - VLAN flooding of multicast/broadcast and unknown unicast packets | |
170 | ||
171 | Static FDB Entries | |
172 | ^^^^^^^^^^^^^^^^^^ | |
173 | ||
174 | The switchdev driver should implement ndo_fdb_add, ndo_fdb_del and ndo_fdb_dump | |
175 | to support static FDB entries installed to the device. Static bridge FDB | |
176 | entries are installed, for example, using iproute2 bridge cmd: | |
177 | ||
178 | bridge fdb add ADDR dev DEV [vlan VID] [self] | |
179 | ||
180 | Note: by default, the bridge does not filter on VLAN and only bridges untagged | |
181 | traffic. To enable VLAN support, turn on VLAN filtering: | |
182 | ||
183 | echo 1 >/sys/class/net/<bridge>/bridge/vlan_filtering | |
184 | ||
185 | Notification of Learned/Forgotten Source MAC/VLANs | |
186 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
187 | ||
188 | The switch device will learn/forget source MAC address/VLAN on ingress packets | |
189 | and notify the switch driver of the mac/vlan/port tuples. The switch driver, | |
190 | in turn, will notify the bridge driver using the switchdev notifier call: | |
191 | ||
192 | err = call_switchdev_notifiers(val, dev, info); | |
193 | ||
194 | Where val is SWITCHDEV_FDB_ADD when learning and SWITCHDEV_FDB_DEL when forgetting, and | |
195 | info points to a struct switchdev_notifier_fdb_info. On SWITCHDEV_FDB_ADD, the bridge | |
196 | driver will install the FDB entry into the bridge's FDB and mark the entry as | |
197 | NTF_EXT_LEARNED. The iproute2 bridge command will label these entries | |
198 | "offload": | |
199 | ||
200 | $ bridge fdb | |
201 | 52:54:00:12:35:01 dev sw1p1 master br0 permanent | |
202 | 00:02:00:00:02:00 dev sw1p1 master br0 offload | |
203 | 00:02:00:00:02:00 dev sw1p1 self | |
204 | 52:54:00:12:35:02 dev sw1p2 master br0 permanent | |
205 | 00:02:00:00:03:00 dev sw1p2 master br0 offload | |
206 | 00:02:00:00:03:00 dev sw1p2 self | |
207 | 33:33:00:00:00:01 dev eth0 self permanent | |
208 | 01:00:5e:00:00:01 dev eth0 self permanent | |
209 | 33:33:ff:00:00:00 dev eth0 self permanent | |
210 | 01:80:c2:00:00:0e dev eth0 self permanent | |
211 | 33:33:00:00:00:01 dev br0 self permanent | |
212 | 01:00:5e:00:00:01 dev br0 self permanent | |
213 | 33:33:ff:12:35:01 dev br0 self permanent | |
214 | ||
215 | Learning on the port should be disabled on the bridge using the bridge command: | |
216 | ||
217 | bridge link set dev DEV learning off | |
218 | ||
219 | Learning on the device port should be enabled, as well as learning_sync: | |
220 | ||
221 | bridge link set dev DEV learning on self | |
222 | bridge link set dev DEV learning_sync on self | |
223 | ||
224 | Learning_sync attribute enables syncing of the learned/forgotton FDB entry to | |
225 | the bridge's FDB. It's possible, but not optimal, to enable learning on the | |
226 | device port and on the bridge port, and disable learning_sync. | |
227 | ||
228 | To support learning and learning_sync port attributes, the driver implements | |
229 | switchdev op switchdev_port_attr_get/set for SWITCHDEV_ATTR_PORT_BRIDGE_FLAGS. The driver | |
230 | should initialize the attributes to the hardware defaults. | |
231 | ||
232 | FDB Ageing | |
233 | ^^^^^^^^^^ | |
234 | ||
235 | There are two FDB ageing models supported: 1) ageing by the device, and 2) | |
236 | ageing by the kernel. Ageing by the device is preferred if many FDB entries | |
237 | are supported. The driver calls call_switchdev_notifiers(SWITCHDEV_FDB_DEL, ...) to | |
238 | age out the FDB entry. In this model, ageing by the kernel should be turned | |
239 | off. XXX: how to turn off ageing in kernel on a per-port basis or otherwise | |
240 | prevent the kernel from ageing out the FDB entry? | |
241 | ||
242 | In the kernel ageing model, the standard bridge ageing mechanism is used to age | |
243 | out stale FDB entries. To keep an FDB entry "alive", the driver should refresh | |
244 | the FDB entry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...). The | |
245 | notification will reset the FDB entry's last-used time to now. The driver | |
246 | should rate limit refresh notifications, for example, no more than once a | |
247 | second. If the FDB entry expires, ndo_fdb_del is called to remove entry from | |
248 | the device. XXX: this last part isn't currently correct: ndo_fdb_del isn't | |
249 | called, so the stale entry remains in device...this need to get fixed. | |
250 | ||
251 | FDB Flush | |
252 | ^^^^^^^^^ | |
253 | ||
254 | XXX: Unimplemented. Need to support FDB flush by bridge driver for port and | |
255 | remove both static and learned FDB entries. | |
256 | ||
257 | STP State Change on Port | |
258 | ^^^^^^^^^^^^^^^^^^^^^^^^ | |
259 | ||
260 | Internally or with a third-party STP protocol implementation (e.g. mstpd), the | |
261 | bridge driver maintains the STP state for ports, and will notify the switch | |
262 | driver of STP state change on a port using the switchdev op switchdev_attr_port_set for | |
263 | SWITCHDEV_ATTR_PORT_STP_UPDATE. | |
264 | ||
265 | State is one of BR_STATE_*. The switch driver can use STP state updates to | |
266 | update ingress packet filter list for the port. For example, if port is | |
267 | DISABLED, no packets should pass, but if port moves to BLOCKED, then STP BPDUs | |
268 | and other IEEE 01:80:c2:xx:xx:xx link-local multicast packets can pass. | |
269 | ||
270 | Note that STP BDPUs are untagged and STP state applies to all VLANs on the port | |
271 | so packet filters should be applied consistently across untagged and tagged | |
272 | VLANs on the port. | |
273 | ||
274 | Flooding L2 domain | |
275 | ^^^^^^^^^^^^^^^^^^ | |
276 | ||
277 | For a given L2 VLAN domain, the switch device should flood multicast/broadcast | |
278 | and unknown unicast packets to all ports in domain, if allowed by port's | |
279 | current STP state. The switch driver, knowing which ports are within which | |
280 | vlan L2 domain, can program the switch device for flooding. The packet should | |
281 | also be sent to the port netdev for processing by the bridge driver. The | |
282 | bridge should not reflood the packet to the same ports the device flooded. | |
283 | XXX: the mechanism to avoid duplicate flood packets is being discuseed. | |
284 | ||
285 | It is possible for the switch device to not handle flooding and push the | |
286 | packets up to the bridge driver for flooding. This is not ideal as the number | |
287 | of ports scale in the L2 domain as the device is much more efficient at | |
288 | flooding packets that software. | |
289 | ||
290 | IGMP Snooping | |
291 | ^^^^^^^^^^^^^ | |
292 | ||
293 | XXX: complete this section | |
294 | ||
295 | ||
296 | L3 routing | |
297 | ---------- | |
298 | ||
299 | Offloading L3 routing requires that device be programmed with FIB entries from | |
300 | the kernel, with the device doing the FIB lookup and forwarding. The device | |
301 | does a longest prefix match (LPM) on FIB entries matching route prefix and | |
302 | forwards the packet to the matching FIB entry's nexthop(s) egress ports. To | |
303 | program the device, the switchdev driver is called with add/delete ops for IPv4 | |
304 | and IPv6 FIB entries. For IPv4, the driver implements switchdev ops: | |
305 | ||
306 | int (*switchdev_fib_ipv4_add)(struct net_device *dev, | |
307 | __be32 dst, int dst_len, | |
308 | struct fib_info *fi, | |
309 | u8 tos, u8 type, | |
310 | u32 nlflags, u32 tb_id); | |
311 | ||
312 | int (*switchdev_fib_ipv4_del)(struct net_device *dev, | |
313 | __be32 dst, int dst_len, | |
314 | struct fib_info *fi, | |
315 | u8 tos, u8 type, | |
316 | u32 tb_id); | |
317 | ||
318 | to add/delete IPv4 dst/dest_len prefix on table tb_id. The *fi structure holds | |
319 | details on the route and route's nexthops. *dev is one of the port netdevs | |
320 | mentioned in the routes next hop list. If the output port netdevs referenced | |
321 | in the route's nexthop list don't all have the same switch ID, the driver is | |
322 | not called to add/delete the FIB entry. | |
323 | ||
324 | Routes offloaded to the device are labeled with "offload" in the ip route | |
325 | listing: | |
326 | ||
327 | $ ip route show | |
328 | default via 192.168.0.2 dev eth0 | |
329 | 11.0.0.0/30 dev sw1p1 proto kernel scope link src 11.0.0.2 offload | |
330 | 11.0.0.4/30 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload | |
331 | 11.0.0.8/30 dev sw1p2 proto kernel scope link src 11.0.0.10 offload | |
332 | 11.0.0.12/30 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload | |
333 | 12.0.0.2 proto zebra metric 30 offload | |
334 | nexthop via 11.0.0.1 dev sw1p1 weight 1 | |
335 | nexthop via 11.0.0.9 dev sw1p2 weight 1 | |
336 | 12.0.0.3 via 11.0.0.1 dev sw1p1 proto zebra metric 20 offload | |
337 | 12.0.0.4 via 11.0.0.9 dev sw1p2 proto zebra metric 20 offload | |
338 | 192.168.0.0/24 dev eth0 proto kernel scope link src 192.168.0.15 | |
339 | ||
340 | XXX: add/del IPv6 FIB API | |
341 | ||
342 | Nexthop Resolution | |
343 | ^^^^^^^^^^^^^^^^^^ | |
344 | ||
345 | The FIB entry's nexthop list contains the nexthop tuple (gateway, dev), but for | |
346 | the switch device to forward the packet with the correct dst mac address, the | |
347 | nexthop gateways must be resolved to the neighbor's mac address. Neighbor mac | |
348 | address discovery comes via the ARP (or ND) process and is available via the | |
349 | arp_tbl neighbor table. To resolve the routes nexthop gateways, the driver | |
350 | should trigger the kernel's neighbor resolution process. See the rocker | |
351 | driver's rocker_port_ipv4_resolve() for an example. | |
352 | ||
353 | The driver can monitor for updates to arp_tbl using the netevent notifier | |
354 | NETEVENT_NEIGH_UPDATE. The device can be programmed with resolved nexthops | |
355 | for the routes as arp_tbl updates. |