Merge tag 'linux_kselftest-next-6.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel...
[linux-block.git] / Documentation / networking / nf_flowtable.rst
CommitLineData
aa376427
MCC
1.. SPDX-License-Identifier: GPL-2.0
2
3====================================
19b351f1
PNA
4Netfilter's flowtable infrastructure
5====================================
6
143490cd
PNA
7This documentation describes the Netfilter flowtable infrastructure which allows
8you to define a fastpath through the flowtable datapath. This infrastructure
9also provides hardware offload support. The flowtable supports for the layer 3
10IPv4 and IPv6 and the layer 4 TCP and UDP protocols.
19b351f1
PNA
11
12Overview
13--------
14
143490cd
PNA
15Once the first packet of the flow successfully goes through the IP forwarding
16path, from the second packet on, you might decide to offload the flow to the
17flowtable through your ruleset. The flowtable infrastructure provides a rule
18action that allows you to specify when to add a flow to the flowtable.
19b351f1 19
143490cd
PNA
20A packet that finds a matching entry in the flowtable (ie. flowtable hit) is
21transmitted to the output netdevice via neigh_xmit(), hence, packets bypass the
22classic IP forwarding path (the visible effect is that you do not see these
23packets from any of the Netfilter hooks coming after ingress). In case that
24there is no matching entry in the flowtable (ie. flowtable miss), the packet
25follows the classic IP forwarding path.
19b351f1 26
143490cd
PNA
27The flowtable uses a resizable hashtable. Lookups are based on the following
28n-tuple selectors: layer 2 protocol encapsulation (VLAN and PPPoE), layer 3
29source and destination, layer 4 source and destination ports and the input
30interface (useful in case there are several conntrack zones in place).
19b351f1 31
143490cd
PNA
32The 'flow add' action allows you to populate the flowtable, the user selectively
33specifies what flows are placed into the flowtable. Hence, packets follow the
34classic IP forwarding path unless the user explicitly instruct flows to use this
35new alternative forwarding path via policy.
19b351f1 36
143490cd
PNA
37The flowtable datapath is represented in Fig.1, which describes the classic IP
38forwarding path including the Netfilter hooks and the flowtable fastpath bypass.
19b351f1 39
aa376427
MCC
40::
41
42 userspace process
43 ^ |
44 | |
45 _____|____ ____\/___
46 / \ / \
47 | input | | output |
48 \__________/ \_________/
49 ^ |
50 | |
19b351f1
PNA
51 _________ __________ --------- _____\/_____
52 / \ / \ |Routing | / \
53 --> ingress ---> prerouting ---> |decision| | postrouting |--> neigh_xmit
54 \_________/ \__________/ ---------- \____________/ ^
7c9abe12
PNA
55 | ^ | ^ |
56 flowtable | ____\/___ | |
57 | | / \ | |
58 __\/___ | | forward |------------ |
19b351f1
PNA
59 |-----| | \_________/ |
60 |-----| | 'flow offload' rule |
61 |-----| | adds entry to |
62 |_____| | flowtable |
63 | | |
64 / \ | |
65 /hit\_no_| |
66 \ ? / |
67 \ / |
68 |__yes_________________fastpath bypass ____________________________|
69
aa376427 70 Fig.1 Netfilter hooks and flowtable interactions
19b351f1
PNA
71
72The flowtable entry also stores the NAT configuration, so all packets are
143490cd
PNA
73mangled according to the NAT policy that is specified from the classic IP
74forwarding path. The TTL is decremented before calling neigh_xmit(). Fragmented
75traffic is passed up to follow the classic IP forwarding path given that the
76transport header is missing, in this case, flowtable lookups are not possible.
77TCP RST and FIN packets are also passed up to the classic IP forwarding path to
78release the flow gracefully. Packets that exceed the MTU are also passed up to
79the classic forwarding path to report packet-too-big ICMP errors to the sender.
19b351f1
PNA
80
81Example configuration
82---------------------
83
84Enabling the flowtable bypass is relatively easy, you only need to create a
aa376427 85flowtable and add one rule to your forward chain::
19b351f1 86
aa376427 87 table inet x {
19b351f1 88 flowtable f {
78e06cf4 89 hook ingress priority 0; devices = { eth0, eth1 };
19b351f1 90 }
aa376427
MCC
91 chain y {
92 type filter hook forward priority 0; policy accept;
143490cd 93 ip protocol tcp flow add @f
aa376427
MCC
94 counter packets 0 bytes 0
95 }
96 }
19b351f1
PNA
97
98This example adds the flowtable 'f' to the ingress hook of the eth0 and eth1
99netdevices. You can create as many flowtables as you want in case you need to
100perform resource partitioning. The flowtable priority defines the order in which
101hooks are run in the pipeline, this is convenient in case you already have a
102nftables ingress chain (make sure the flowtable priority is smaller than the
103nftables ingress chain hence the flowtable runs before in the pipeline).
104
105The 'flow offload' action from the forward chain 'y' adds an entry to the
106flowtable for the TCP syn-ack packet coming in the reply direction. Once the
107flow is offloaded, you will observe that the counter rule in the example above
108does not get updated for the packets that are being forwarded through the
109forwarding bypass.
110
143490cd
PNA
111You can identify offloaded flows through the [OFFLOAD] tag when listing your
112connection tracking table.
113
114::
794d9b25 115
143490cd
PNA
116 # conntrack -L
117 tcp 6 src=10.141.10.2 dst=192.168.10.2 sport=52728 dport=5201 src=192.168.10.2 dst=192.168.10.1 sport=5201 dport=52728 [OFFLOAD] mark=0 use=2
118
119
120Layer 2 encapsulation
121---------------------
122
123Since Linux kernel 5.13, the flowtable infrastructure discovers the real
124netdevice behind VLAN and PPPoE netdevices. The flowtable software datapath
125parses the VLAN and PPPoE layer 2 headers to extract the ethertype and the
126VLAN ID / PPPoE session ID which are used for the flowtable lookups. The
127flowtable datapath also deals with layer 2 decapsulation.
128
129You do not need to add the PPPoE and the VLAN devices to your flowtable,
130instead the real device is sufficient for the flowtable to track your flows.
131
132Bridge and IP forwarding
133------------------------
134
135Since Linux kernel 5.13, you can add bridge ports to the flowtable. The
136flowtable infrastructure discovers the topology behind the bridge device. This
137allows the flowtable to define a fastpath bypass between the bridge ports
138(represented as eth1 and eth2 in the example figure below) and the gateway
139device (represented as eth0) in your switch/router.
140
141::
794d9b25 142
143490cd
PNA
143 fastpath bypass
144 .-------------------------.
145 / \
146 | IP forwarding |
147 | / \ \/
148 | br0 eth0 ..... eth0
149 . / \ *host B*
150 -> eth1 eth2
151 . *switch/router*
152 .
153 .
154 eth0
155 *host A*
156
157The flowtable infrastructure also supports for bridge VLAN filtering actions
158such as PVID and untagged. You can also stack a classic VLAN device on top of
159your bridge port.
160
161If you would like that your flowtable defines a fastpath between your bridge
162ports and your IP forwarding path, you have to add your bridge ports (as
163represented by the real netdevice) to your flowtable definition.
164
165Counters
166--------
167
168The flowtable can synchronize packet and byte counters with the existing
169connection tracking entry by specifying the counter statement in your flowtable
170definition, e.g.
171
172::
794d9b25 173
143490cd
PNA
174 table inet x {
175 flowtable f {
176 hook ingress priority 0; devices = { eth0, eth1 };
177 counter
178 }
143490cd
PNA
179 }
180
181Counter support is available since Linux kernel 5.7.
182
183Hardware offload
184----------------
185
186If your network device provides hardware offload support, you can turn it on by
187means of the 'offload' flag in your flowtable definition, e.g.
188
189::
794d9b25 190
143490cd
PNA
191 table inet x {
192 flowtable f {
193 hook ingress priority 0; devices = { eth0, eth1 };
194 flags offload;
195 }
143490cd
PNA
196 }
197
198There is a workqueue that adds the flows to the hardware. Note that a few
199packets might still run over the flowtable software path until the workqueue has
200a chance to offload the flow to the network device.
201
202You can identify hardware offloaded flows through the [HW_OFFLOAD] tag when
203listing your connection tracking table. Please, note that the [OFFLOAD] tag
204refers to the software offload mode, so there is a distinction between [OFFLOAD]
205which refers to the software flowtable fastpath and [HW_OFFLOAD] which refers
206to the hardware offload datapath being used by the flow.
207
208The flowtable hardware offload infrastructure also supports for the DSA
209(Distributed Switch Architecture).
210
211Limitations
212-----------
213
214The flowtable behaves like a cache. The flowtable entries might get stale if
215either the destination MAC address or the egress netdevice that is used for
216transmission changes.
217
218This might be a problem if:
219
220- You run the flowtable in software mode and you combine bridge and IP
221 forwarding in your setup.
222- Hardware offload is enabled.
223
19b351f1
PNA
224More reading
225------------
226
aa376427
MCC
227This documentation is based on the LWN.net articles [1]_\ [2]_. Rafal Milecki
228also made a very complete and comprehensive summary called "A state of network
19b351f1 229acceleration" that describes how things were before this infrastructure was
64747d5e 230mainlined [3]_ and it also makes a rough summary of this work [4]_.
19b351f1 231
aa376427
MCC
232.. [1] https://lwn.net/Articles/738214/
233.. [2] https://lwn.net/Articles/742164/
234.. [3] http://lists.infradead.org/pipermail/lede-dev/2018-January/010830.html
235.. [4] http://lists.infradead.org/pipermail/lede-dev/2018-January/010829.html