Commit | Line | Data |
---|---|---|
e8ae7b00 EC |
1 | Checksum Offloads in the Linux Networking Stack |
2 | ||
3 | ||
4 | Introduction | |
5 | ============ | |
6 | ||
7 | This document describes a set of techniques in the Linux networking stack | |
8 | to take advantage of checksum offload capabilities of various NICs. | |
9 | ||
10 | The following technologies are described: | |
11 | * TX Checksum Offload | |
12 | * LCO: Local Checksum Offload | |
13 | * RCO: Remote Checksum Offload | |
14 | ||
15 | Things that should be documented here but aren't yet: | |
16 | * RX Checksum Offload | |
17 | * CHECKSUM_UNNECESSARY conversion | |
18 | ||
19 | ||
20 | TX Checksum Offload | |
21 | =================== | |
22 | ||
23 | The interface for offloading a transmit checksum to a device is explained | |
24 | in detail in comments near the top of include/linux/skbuff.h. | |
25 | In brief, it allows to request the device fill in a single ones-complement | |
26 | checksum defined by the sk_buff fields skb->csum_start and | |
27 | skb->csum_offset. The device should compute the 16-bit ones-complement | |
28 | checksum (i.e. the 'IP-style' checksum) from csum_start to the end of the | |
29 | packet, and fill in the result at (csum_start + csum_offset). | |
30 | Because csum_offset cannot be negative, this ensures that the previous | |
31 | value of the checksum field is included in the checksum computation, thus | |
32 | it can be used to supply any needed corrections to the checksum (such as | |
33 | the sum of the pseudo-header for UDP or TCP). | |
34 | This interface only allows a single checksum to be offloaded. Where | |
35 | encapsulation is used, the packet may have multiple checksum fields in | |
36 | different header layers, and the rest will have to be handled by another | |
37 | mechanism such as LCO or RCO. | |
43c26a1a DC |
38 | CRC32c can also be offloaded using this interface, by means of filling |
39 | skb->csum_start and skb->csum_offset as described above, and setting | |
40 | skb->csum_not_inet: see skbuff.h comment (section 'D') for more details. | |
e8ae7b00 EC |
41 | No offloading of the IP header checksum is performed; it is always done in |
42 | software. This is OK because when we build the IP header, we obviously | |
43 | have it in cache, so summing it isn't expensive. It's also rather short. | |
44 | The requirements for GSO are more complicated, because when segmenting an | |
45 | encapsulated packet both the inner and outer checksums may need to be | |
46 | edited or recomputed for each resulting segment. See the skbuff.h comment | |
47 | (section 'E') for more details. | |
48 | ||
49 | A driver declares its offload capabilities in netdev->hw_features; see | |
f2b41874 | 50 | Documentation/networking/netdev-features.txt for more. Note that a device |
e8ae7b00 EC |
51 | which only advertises NETIF_F_IP[V6]_CSUM must still obey the csum_start |
52 | and csum_offset given in the SKB; if it tries to deduce these itself in | |
53 | hardware (as some NICs do) the driver should check that the values in the | |
54 | SKB match those which the hardware will deduce, and if not, fall back to | |
43c26a1a DC |
55 | checksumming in software instead (with skb_csum_hwoffload_help() or one of |
56 | the skb_checksum_help() / skb_crc32c_csum_help functions, as mentioned in | |
57 | include/linux/skbuff.h). | |
e8ae7b00 EC |
58 | |
59 | The stack should, for the most part, assume that checksum offload is | |
60 | supported by the underlying device. The only place that should check is | |
61 | validate_xmit_skb(), and the functions it calls directly or indirectly. | |
62 | That function compares the offload features requested by the SKB (which | |
63 | may include other offloads besides TX Checksum Offload) and, if they are | |
64 | not supported or enabled on the device (determined by netdev->features), | |
65 | performs the corresponding offload in software. In the case of TX | |
43c26a1a | 66 | Checksum Offload, that means calling skb_csum_hwoffload_help(skb, features). |
e8ae7b00 EC |
67 | |
68 | ||
69 | LCO: Local Checksum Offload | |
70 | =========================== | |
71 | ||
72 | LCO is a technique for efficiently computing the outer checksum of an | |
73 | encapsulated datagram when the inner checksum is due to be offloaded. | |
74 | The ones-complement sum of a correctly checksummed TCP or UDP packet is | |
c81aa797 SL |
75 | equal to the complement of the sum of the pseudo header, because everything |
76 | else gets 'cancelled out' by the checksum field. This is because the sum was | |
e8ae7b00 EC |
77 | complemented before being written to the checksum field. |
78 | More generally, this holds in any case where the 'IP-style' ones complement | |
79 | checksum is used, and thus any checksum that TX Checksum Offload supports. | |
80 | That is, if we have set up TX Checksum Offload with a start/offset pair, we | |
c81aa797 | 81 | know that after the device has filled in that checksum, the ones |
e8ae7b00 | 82 | complement sum from csum_start to the end of the packet will be equal to |
c81aa797 SL |
83 | the complement of whatever value we put in the checksum field beforehand. |
84 | This allows us to compute the outer checksum without looking at the payload: | |
85 | we simply stop summing when we get to csum_start, then add the complement of | |
86 | the 16-bit word at (csum_start + csum_offset). | |
e8ae7b00 EC |
87 | Then, when the true inner checksum is filled in (either by hardware or by |
88 | skb_checksum_help()), the outer checksum will become correct by virtue of | |
89 | the arithmetic. | |
90 | ||
91 | LCO is performed by the stack when constructing an outer UDP header for an | |
92 | encapsulation such as VXLAN or GENEVE, in udp_set_csum(). Similarly for | |
93 | the IPv6 equivalents, in udp6_set_csum(). | |
94 | It is also performed when constructing an IPv4 GRE header, in | |
95 | net/ipv4/ip_gre.c:build_header(). It is *not* currently performed when | |
96 | constructing an IPv6 GRE header; the GRE checksum is computed over the | |
97 | whole packet in net/ipv6/ip6_gre.c:ip6gre_xmit2(), but it should be | |
98 | possible to use LCO here as IPv6 GRE still uses an IP-style checksum. | |
99 | All of the LCO implementations use a helper function lco_csum(), in | |
100 | include/linux/skbuff.h. | |
101 | ||
102 | LCO can safely be used for nested encapsulations; in this case, the outer | |
103 | encapsulation layer will sum over both its own header and the 'middle' | |
104 | header. This does mean that the 'middle' header will get summed multiple | |
105 | times, but there doesn't seem to be a way to avoid that without incurring | |
106 | bigger costs (e.g. in SKB bloat). | |
107 | ||
108 | ||
109 | RCO: Remote Checksum Offload | |
110 | ============================ | |
111 | ||
112 | RCO is a technique for eliding the inner checksum of an encapsulated | |
113 | datagram, allowing the outer checksum to be offloaded. It does, however, | |
114 | involve a change to the encapsulation protocols, which the receiver must | |
115 | also support. For this reason, it is disabled by default. | |
116 | RCO is detailed in the following Internet-Drafts: | |
117 | https://tools.ietf.org/html/draft-herbert-remotecsumoffload-00 | |
118 | https://tools.ietf.org/html/draft-herbert-vxlan-rco-00 | |
119 | In Linux, RCO is implemented individually in each encapsulation protocol, | |
120 | and most tunnel types have flags controlling its use. For instance, VXLAN | |
121 | has the flag VXLAN_F_REMCSUM_TX (per struct vxlan_rdst) to indicate that | |
122 | RCO should be used when transmitting to a given remote destination. |