Commit | Line | Data |
---|---|---|
ea5bacaa MCC |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | ===================================================== | |
e5b1de1f MM |
4 | Netdev features mess and how to get out from it alive |
5 | ===================================================== | |
6 | ||
7 | Author: | |
8 | Michał Mirosław <mirq-linux@rere.qmqm.pl> | |
9 | ||
10 | ||
11 | ||
ea5bacaa MCC |
12 | Part I: Feature sets |
13 | ==================== | |
e5b1de1f MM |
14 | |
15 | Long gone are the days when a network card would just take and give packets | |
16 | verbatim. Today's devices add multiple features and bugs (read: offloads) | |
17 | that relieve an OS of various tasks like generating and checking checksums, | |
18 | splitting packets, classifying them. Those capabilities and their state | |
19 | are commonly referred to as netdev features in Linux kernel world. | |
20 | ||
21 | There are currently three sets of features relevant to the driver, and | |
22 | one used internally by network core: | |
23 | ||
24 | 1. netdev->hw_features set contains features whose state may possibly | |
25 | be changed (enabled or disabled) for a particular device by user's | |
26 | request. This set should be initialized in ndo_init callback and not | |
27 | changed later. | |
28 | ||
29 | 2. netdev->features set contains features which are currently enabled | |
30 | for a device. This should be changed only by network core or in | |
31 | error paths of ndo_set_features callback. | |
32 | ||
33 | 3. netdev->vlan_features set contains features whose state is inherited | |
34 | by child VLAN devices (limits netdev->features set). This is currently | |
35 | used for all VLAN devices whether tags are stripped or inserted in | |
36 | hardware or software. | |
37 | ||
38 | 4. netdev->wanted_features set contains feature set requested by user. | |
39 | This set is filtered by ndo_fix_features callback whenever it or | |
40 | some device-specific conditions change. This set is internal to | |
41 | networking core and should not be referenced in drivers. | |
42 | ||
43 | ||
44 | ||
ea5bacaa MCC |
45 | Part II: Controlling enabled features |
46 | ===================================== | |
e5b1de1f MM |
47 | |
48 | When current feature set (netdev->features) is to be changed, new set | |
49 | is calculated and filtered by calling ndo_fix_features callback | |
50 | and netdev_fix_features(). If the resulting set differs from current | |
51 | set, it is passed to ndo_set_features callback and (if the callback | |
52 | returns success) replaces value stored in netdev->features. | |
53 | NETDEV_FEAT_CHANGE notification is issued after that whenever current | |
54 | set might have changed. | |
55 | ||
56 | The following events trigger recalculation: | |
57 | 1. device's registration, after ndo_init returned success | |
58 | 2. user requested changes in features state | |
59 | 3. netdev_update_features() is called | |
60 | ||
61 | ndo_*_features callbacks are called with rtnl_lock held. Missing callbacks | |
62 | are treated as always returning success. | |
63 | ||
64 | A driver that wants to trigger recalculation must do so by calling | |
65 | netdev_update_features() while holding rtnl_lock. This should not be done | |
66 | from ndo_*_features callbacks. netdev->features should not be modified by | |
67 | driver except by means of ndo_fix_features callback. | |
68 | ||
69 | ||
70 | ||
ea5bacaa MCC |
71 | Part III: Implementation hints |
72 | ============================== | |
e5b1de1f MM |
73 | |
74 | * ndo_fix_features: | |
75 | ||
76 | All dependencies between features should be resolved here. The resulting | |
77 | set can be reduced further by networking core imposed limitations (as coded | |
78 | in netdev_fix_features()). For this reason it is safer to disable a feature | |
79 | when its dependencies are not met instead of forcing the dependency on. | |
80 | ||
81 | This callback should not modify hardware nor driver state (should be | |
82 | stateless). It can be called multiple times between successive | |
83 | ndo_set_features calls. | |
84 | ||
85 | Callback must not alter features contained in NETIF_F_SOFT_FEATURES or | |
86 | NETIF_F_NEVER_CHANGE sets. The exception is NETIF_F_VLAN_CHALLENGED but | |
87 | care must be taken as the change won't affect already configured VLANs. | |
88 | ||
89 | * ndo_set_features: | |
90 | ||
91 | Hardware should be reconfigured to match passed feature set. The set | |
92 | should not be altered unless some error condition happens that can't | |
93 | be reliably detected in ndo_fix_features. In this case, the callback | |
94 | should update netdev->features to match resulting hardware state. | |
95 | Errors returned are not (and cannot be) propagated anywhere except dmesg. | |
96 | (Note: successful return is zero, >0 means silent error.) | |
97 | ||
98 | ||
99 | ||
ea5bacaa MCC |
100 | Part IV: Features |
101 | ================= | |
e5b1de1f MM |
102 | |
103 | For current list of features, see include/linux/netdev_features.h. | |
104 | This section describes semantics of some of them. | |
105 | ||
106 | * Transmit checksumming | |
107 | ||
108 | For complete description, see comments near the top of include/linux/skbuff.h. | |
109 | ||
110 | Note: NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM + NETIF_F_IPV6_CSUM. | |
111 | It means that device can fill TCP/UDP-like checksum anywhere in the packets | |
112 | whatever headers there might be. | |
113 | ||
114 | * Transmit TCP segmentation offload | |
115 | ||
116 | NETIF_F_TSO_ECN means that hardware can properly split packets with CWR bit | |
117 | set, be it TCPv4 (when NETIF_F_TSO is enabled) or TCPv6 (NETIF_F_TSO6). | |
118 | ||
83aa025f WB |
119 | * Transmit UDP segmentation offload |
120 | ||
09e58b2d | 121 | NETIF_F_GSO_UDP_L4 accepts a single UDP header with a payload that exceeds |
83aa025f WB |
122 | gso_size. On segmentation, it segments the payload on gso_size boundaries and |
123 | replicates the network and UDP headers (fixing up the last one if less than | |
124 | gso_size). | |
125 | ||
e5b1de1f MM |
126 | * Transmit DMA from high memory |
127 | ||
128 | On platforms where this is relevant, NETIF_F_HIGHDMA signals that | |
129 | ndo_start_xmit can handle skbs with frags in high memory. | |
130 | ||
131 | * Transmit scatter-gather | |
132 | ||
133 | Those features say that ndo_start_xmit can handle fragmented skbs: | |
134 | NETIF_F_SG --- paged skbs (skb_shinfo()->frags), NETIF_F_FRAGLIST --- | |
135 | chained skbs (skb->next/prev list). | |
136 | ||
137 | * Software features | |
138 | ||
139 | Features contained in NETIF_F_SOFT_FEATURES are features of networking | |
140 | stack. Driver should not change behaviour based on them. | |
141 | ||
142 | * LLTX driver (deprecated for hardware drivers) | |
143 | ||
f0cdf76c FW |
144 | NETIF_F_LLTX is meant to be used by drivers that don't need locking at all, |
145 | e.g. software tunnels. | |
e5b1de1f | 146 | |
f0cdf76c FW |
147 | This is also used in a few legacy drivers that implement their |
148 | own locking, don't use it for new (hardware) drivers. | |
e5b1de1f MM |
149 | |
150 | * netns-local device | |
151 | ||
152 | NETIF_F_NETNS_LOCAL is set for devices that are not allowed to move between | |
153 | network namespaces (e.g. loopback). | |
154 | ||
155 | Don't use it in drivers. | |
156 | ||
157 | * VLAN challenged | |
158 | ||
159 | NETIF_F_VLAN_CHALLENGED should be set for devices which can't cope with VLAN | |
160 | headers. Some drivers set this because the cards can't handle the bigger MTU. | |
161 | [FIXME: Those cases could be fixed in VLAN code by allowing only reduced-MTU | |
162 | VLANs. This may be not useful, though.] | |
36eabda3 BG |
163 | |
164 | * rx-fcs | |
165 | ||
166 | This requests that the NIC append the Ethernet Frame Checksum (FCS) | |
167 | to the end of the skb data. This allows sniffers and other tools to | |
168 | read the CRC recorded by the NIC on receipt of the packet. | |
5e0c03c8 BG |
169 | |
170 | * rx-all | |
171 | ||
172 | This requests that the NIC receive all possible frames, including errored | |
173 | frames (such as bad FCS, etc). This can be helpful when sniffing a link with | |
174 | bad packets on it. Some NICs may receive more packets if also put into normal | |
73e212fc | 175 | PROMISC mode. |
fb1f5f79 MC |
176 | |
177 | * rx-gro-hw | |
178 | ||
179 | This requests that the NIC enables Hardware GRO (generic receive offload). | |
180 | Hardware GRO is basically the exact reverse of TSO, and is generally | |
181 | stricter than Hardware LRO. A packet stream merged by Hardware GRO must | |
182 | be re-segmentable by GSO or TSO back to the exact original packet stream. | |
183 | Hardware GRO is dependent on RXCSUM since every packet successfully merged | |
184 | by hardware must also have the checksum verified by hardware. | |
dcf0cd1c GM |
185 | |
186 | * hsr-tag-ins-offload | |
187 | ||
188 | This should be set for devices which insert an HSR (High-availability Seamless | |
189 | Redundancy) or PRP (Parallel Redundancy Protocol) tag automatically. | |
190 | ||
191 | * hsr-tag-rm-offload | |
192 | ||
193 | This should be set for devices which remove HSR (High-availability Seamless | |
194 | Redundancy) or PRP (Parallel Redundancy Protocol) tags automatically. | |
195 | ||
196 | * hsr-fwd-offload | |
197 | ||
198 | This should be set for devices which forward HSR (High-availability Seamless | |
199 | Redundancy) frames from one port to another in hardware. | |
200 | ||
201 | * hsr-dup-offload | |
202 | ||
203 | This should be set for devices which duplicate outgoing HSR (High-availability | |
204 | Seamless Redundancy) or PRP (Parallel Redundancy Protocol) tags automatically | |
205 | frames in hardware. |