Commit | Line | Data |
---|---|---|
aa376427 MCC |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | ==================================== | |
19b351f1 PNA |
4 | Netfilter's flowtable infrastructure |
5 | ==================================== | |
6 | ||
143490cd PNA |
7 | This documentation describes the Netfilter flowtable infrastructure which allows |
8 | you to define a fastpath through the flowtable datapath. This infrastructure | |
9 | also provides hardware offload support. The flowtable supports for the layer 3 | |
10 | IPv4 and IPv6 and the layer 4 TCP and UDP protocols. | |
19b351f1 PNA |
11 | |
12 | Overview | |
13 | -------- | |
14 | ||
143490cd PNA |
15 | Once the first packet of the flow successfully goes through the IP forwarding |
16 | path, from the second packet on, you might decide to offload the flow to the | |
17 | flowtable through your ruleset. The flowtable infrastructure provides a rule | |
18 | action that allows you to specify when to add a flow to the flowtable. | |
19b351f1 | 19 | |
143490cd PNA |
20 | A packet that finds a matching entry in the flowtable (ie. flowtable hit) is |
21 | transmitted to the output netdevice via neigh_xmit(), hence, packets bypass the | |
22 | classic IP forwarding path (the visible effect is that you do not see these | |
23 | packets from any of the Netfilter hooks coming after ingress). In case that | |
24 | there is no matching entry in the flowtable (ie. flowtable miss), the packet | |
25 | follows the classic IP forwarding path. | |
19b351f1 | 26 | |
143490cd PNA |
27 | The flowtable uses a resizable hashtable. Lookups are based on the following |
28 | n-tuple selectors: layer 2 protocol encapsulation (VLAN and PPPoE), layer 3 | |
29 | source and destination, layer 4 source and destination ports and the input | |
30 | interface (useful in case there are several conntrack zones in place). | |
19b351f1 | 31 | |
143490cd PNA |
32 | The 'flow add' action allows you to populate the flowtable, the user selectively |
33 | specifies what flows are placed into the flowtable. Hence, packets follow the | |
34 | classic IP forwarding path unless the user explicitly instruct flows to use this | |
35 | new alternative forwarding path via policy. | |
19b351f1 | 36 | |
143490cd PNA |
37 | The flowtable datapath is represented in Fig.1, which describes the classic IP |
38 | forwarding path including the Netfilter hooks and the flowtable fastpath bypass. | |
19b351f1 | 39 | |
aa376427 MCC |
40 | :: |
41 | ||
42 | userspace process | |
43 | ^ | | |
44 | | | | |
45 | _____|____ ____\/___ | |
46 | / \ / \ | |
47 | | input | | output | | |
48 | \__________/ \_________/ | |
49 | ^ | | |
50 | | | | |
19b351f1 PNA |
51 | _________ __________ --------- _____\/_____ |
52 | / \ / \ |Routing | / \ | |
53 | --> ingress ---> prerouting ---> |decision| | postrouting |--> neigh_xmit | |
54 | \_________/ \__________/ ---------- \____________/ ^ | |
7c9abe12 PNA |
55 | | ^ | ^ | |
56 | flowtable | ____\/___ | | | |
57 | | | / \ | | | |
58 | __\/___ | | forward |------------ | | |
19b351f1 PNA |
59 | |-----| | \_________/ | |
60 | |-----| | 'flow offload' rule | | |
61 | |-----| | adds entry to | | |
62 | |_____| | flowtable | | |
63 | | | | | |
64 | / \ | | | |
65 | /hit\_no_| | | |
66 | \ ? / | | |
67 | \ / | | |
68 | |__yes_________________fastpath bypass ____________________________| | |
69 | ||
aa376427 | 70 | Fig.1 Netfilter hooks and flowtable interactions |
19b351f1 PNA |
71 | |
72 | The flowtable entry also stores the NAT configuration, so all packets are | |
143490cd PNA |
73 | mangled according to the NAT policy that is specified from the classic IP |
74 | forwarding path. The TTL is decremented before calling neigh_xmit(). Fragmented | |
75 | traffic is passed up to follow the classic IP forwarding path given that the | |
76 | transport header is missing, in this case, flowtable lookups are not possible. | |
77 | TCP RST and FIN packets are also passed up to the classic IP forwarding path to | |
78 | release the flow gracefully. Packets that exceed the MTU are also passed up to | |
79 | the classic forwarding path to report packet-too-big ICMP errors to the sender. | |
19b351f1 PNA |
80 | |
81 | Example configuration | |
82 | --------------------- | |
83 | ||
84 | Enabling the flowtable bypass is relatively easy, you only need to create a | |
aa376427 | 85 | flowtable and add one rule to your forward chain:: |
19b351f1 | 86 | |
aa376427 | 87 | table inet x { |
19b351f1 | 88 | flowtable f { |
78e06cf4 | 89 | hook ingress priority 0; devices = { eth0, eth1 }; |
19b351f1 | 90 | } |
aa376427 MCC |
91 | chain y { |
92 | type filter hook forward priority 0; policy accept; | |
143490cd | 93 | ip protocol tcp flow add @f |
aa376427 MCC |
94 | counter packets 0 bytes 0 |
95 | } | |
96 | } | |
19b351f1 PNA |
97 | |
98 | This example adds the flowtable 'f' to the ingress hook of the eth0 and eth1 | |
99 | netdevices. You can create as many flowtables as you want in case you need to | |
100 | perform resource partitioning. The flowtable priority defines the order in which | |
101 | hooks are run in the pipeline, this is convenient in case you already have a | |
102 | nftables ingress chain (make sure the flowtable priority is smaller than the | |
103 | nftables ingress chain hence the flowtable runs before in the pipeline). | |
104 | ||
105 | The 'flow offload' action from the forward chain 'y' adds an entry to the | |
106 | flowtable for the TCP syn-ack packet coming in the reply direction. Once the | |
107 | flow is offloaded, you will observe that the counter rule in the example above | |
108 | does not get updated for the packets that are being forwarded through the | |
109 | forwarding bypass. | |
110 | ||
143490cd PNA |
111 | You can identify offloaded flows through the [OFFLOAD] tag when listing your |
112 | connection tracking table. | |
113 | ||
114 | :: | |
794d9b25 | 115 | |
143490cd PNA |
116 | # conntrack -L |
117 | tcp 6 src=10.141.10.2 dst=192.168.10.2 sport=52728 dport=5201 src=192.168.10.2 dst=192.168.10.1 sport=5201 dport=52728 [OFFLOAD] mark=0 use=2 | |
118 | ||
119 | ||
120 | Layer 2 encapsulation | |
121 | --------------------- | |
122 | ||
123 | Since Linux kernel 5.13, the flowtable infrastructure discovers the real | |
124 | netdevice behind VLAN and PPPoE netdevices. The flowtable software datapath | |
125 | parses the VLAN and PPPoE layer 2 headers to extract the ethertype and the | |
126 | VLAN ID / PPPoE session ID which are used for the flowtable lookups. The | |
127 | flowtable datapath also deals with layer 2 decapsulation. | |
128 | ||
129 | You do not need to add the PPPoE and the VLAN devices to your flowtable, | |
130 | instead the real device is sufficient for the flowtable to track your flows. | |
131 | ||
132 | Bridge and IP forwarding | |
133 | ------------------------ | |
134 | ||
135 | Since Linux kernel 5.13, you can add bridge ports to the flowtable. The | |
136 | flowtable infrastructure discovers the topology behind the bridge device. This | |
137 | allows the flowtable to define a fastpath bypass between the bridge ports | |
138 | (represented as eth1 and eth2 in the example figure below) and the gateway | |
139 | device (represented as eth0) in your switch/router. | |
140 | ||
141 | :: | |
794d9b25 | 142 | |
143490cd PNA |
143 | fastpath bypass |
144 | .-------------------------. | |
145 | / \ | |
146 | | IP forwarding | | |
147 | | / \ \/ | |
148 | | br0 eth0 ..... eth0 | |
149 | . / \ *host B* | |
150 | -> eth1 eth2 | |
151 | . *switch/router* | |
152 | . | |
153 | . | |
154 | eth0 | |
155 | *host A* | |
156 | ||
157 | The flowtable infrastructure also supports for bridge VLAN filtering actions | |
158 | such as PVID and untagged. You can also stack a classic VLAN device on top of | |
159 | your bridge port. | |
160 | ||
161 | If you would like that your flowtable defines a fastpath between your bridge | |
162 | ports and your IP forwarding path, you have to add your bridge ports (as | |
163 | represented by the real netdevice) to your flowtable definition. | |
164 | ||
165 | Counters | |
166 | -------- | |
167 | ||
168 | The flowtable can synchronize packet and byte counters with the existing | |
169 | connection tracking entry by specifying the counter statement in your flowtable | |
170 | definition, e.g. | |
171 | ||
172 | :: | |
794d9b25 | 173 | |
143490cd PNA |
174 | table inet x { |
175 | flowtable f { | |
176 | hook ingress priority 0; devices = { eth0, eth1 }; | |
177 | counter | |
178 | } | |
143490cd PNA |
179 | } |
180 | ||
181 | Counter support is available since Linux kernel 5.7. | |
182 | ||
183 | Hardware offload | |
184 | ---------------- | |
185 | ||
186 | If your network device provides hardware offload support, you can turn it on by | |
187 | means of the 'offload' flag in your flowtable definition, e.g. | |
188 | ||
189 | :: | |
794d9b25 | 190 | |
143490cd PNA |
191 | table inet x { |
192 | flowtable f { | |
193 | hook ingress priority 0; devices = { eth0, eth1 }; | |
194 | flags offload; | |
195 | } | |
143490cd PNA |
196 | } |
197 | ||
198 | There is a workqueue that adds the flows to the hardware. Note that a few | |
199 | packets might still run over the flowtable software path until the workqueue has | |
200 | a chance to offload the flow to the network device. | |
201 | ||
202 | You can identify hardware offloaded flows through the [HW_OFFLOAD] tag when | |
203 | listing your connection tracking table. Please, note that the [OFFLOAD] tag | |
204 | refers to the software offload mode, so there is a distinction between [OFFLOAD] | |
205 | which refers to the software flowtable fastpath and [HW_OFFLOAD] which refers | |
206 | to the hardware offload datapath being used by the flow. | |
207 | ||
208 | The flowtable hardware offload infrastructure also supports for the DSA | |
209 | (Distributed Switch Architecture). | |
210 | ||
211 | Limitations | |
212 | ----------- | |
213 | ||
214 | The flowtable behaves like a cache. The flowtable entries might get stale if | |
215 | either the destination MAC address or the egress netdevice that is used for | |
216 | transmission changes. | |
217 | ||
218 | This might be a problem if: | |
219 | ||
220 | - You run the flowtable in software mode and you combine bridge and IP | |
221 | forwarding in your setup. | |
222 | - Hardware offload is enabled. | |
223 | ||
19b351f1 PNA |
224 | More reading |
225 | ------------ | |
226 | ||
aa376427 MCC |
227 | This documentation is based on the LWN.net articles [1]_\ [2]_. Rafal Milecki |
228 | also made a very complete and comprehensive summary called "A state of network | |
19b351f1 | 229 | acceleration" that describes how things were before this infrastructure was |
64747d5e | 230 | mainlined [3]_ and it also makes a rough summary of this work [4]_. |
19b351f1 | 231 | |
aa376427 MCC |
232 | .. [1] https://lwn.net/Articles/738214/ |
233 | .. [2] https://lwn.net/Articles/742164/ | |
234 | .. [3] http://lists.infradead.org/pipermail/lede-dev/2018-January/010830.html | |
235 | .. [4] http://lists.infradead.org/pipermail/lede-dev/2018-January/010829.html |