Commit | Line | Data |
---|---|---|
63893472 MCC |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | ============================================= | |
ccb1352e JG |
4 | Open vSwitch datapath developer documentation |
5 | ============================================= | |
6 | ||
7 | The Open vSwitch kernel module allows flexible userspace control over | |
8 | flow-level packet processing on selected network devices. It can be | |
9 | used to implement a plain Ethernet switch, network device bonding, | |
10 | VLAN processing, network access control, flow-based network control, | |
11 | and so on. | |
12 | ||
13 | The kernel module implements multiple "datapaths" (analogous to | |
14 | bridges), each of which can have multiple "vports" (analogous to ports | |
15 | within a bridge). Each datapath also has associated with it a "flow | |
16 | table" that userspace populates with "flows" that map from keys based | |
17 | on packet headers and metadata to sets of actions. The most common | |
18 | action forwards the packet to another vport; other actions are also | |
19 | implemented. | |
20 | ||
21 | When a packet arrives on a vport, the kernel module processes it by | |
22 | extracting its flow key and looking it up in the flow table. If there | |
23 | is a matching flow, it executes the associated actions. If there is | |
24 | no match, it queues the packet to userspace for processing (as part of | |
25 | its processing, userspace will likely set up a flow to handle further | |
26 | packets of the same type entirely in-kernel). | |
27 | ||
28 | ||
29 | Flow key compatibility | |
30 | ---------------------- | |
31 | ||
32 | Network protocols evolve over time. New protocols become important | |
33 | and existing protocols lose their prominence. For the Open vSwitch | |
34 | kernel module to remain relevant, it must be possible for newer | |
35 | versions to parse additional protocols as part of the flow key. It | |
36 | might even be desirable, someday, to drop support for parsing | |
37 | protocols that have become obsolete. Therefore, the Netlink interface | |
38 | to Open vSwitch is designed to allow carefully written userspace | |
39 | applications to work with any version of the flow key, past or future. | |
40 | ||
41 | To support this forward and backward compatibility, whenever the | |
42 | kernel module passes a packet to userspace, it also passes along the | |
43 | flow key that it parsed from the packet. Userspace then extracts its | |
44 | own notion of a flow key from the packet and compares it against the | |
45 | kernel-provided version: | |
46 | ||
47 | - If userspace's notion of the flow key for the packet matches the | |
48 | kernel's, then nothing special is necessary. | |
49 | ||
50 | - If the kernel's flow key includes more fields than the userspace | |
51 | version of the flow key, for example if the kernel decoded IPv6 | |
52 | headers but userspace stopped at the Ethernet type (because it | |
53 | does not understand IPv6), then again nothing special is | |
54 | necessary. Userspace can still set up a flow in the usual way, | |
55 | as long as it uses the kernel-provided flow key to do it. | |
56 | ||
57 | - If the userspace flow key includes more fields than the | |
58 | kernel's, for example if userspace decoded an IPv6 header but | |
59 | the kernel stopped at the Ethernet type, then userspace can | |
60 | forward the packet manually, without setting up a flow in the | |
61 | kernel. This case is bad for performance because every packet | |
62 | that the kernel considers part of the flow must go to userspace, | |
63 | but the forwarding behavior is correct. (If userspace can | |
64 | determine that the values of the extra fields would not affect | |
65 | forwarding behavior, then it could set up a flow anyway.) | |
66 | ||
67 | How flow keys evolve over time is important to making this work, so | |
68 | the following sections go into detail. | |
69 | ||
70 | ||
71 | Flow key format | |
72 | --------------- | |
73 | ||
74 | A flow key is passed over a Netlink socket as a sequence of Netlink | |
75 | attributes. Some attributes represent packet metadata, defined as any | |
76 | information about a packet that cannot be extracted from the packet | |
77 | itself, e.g. the vport on which the packet was received. Most | |
78 | attributes, however, are extracted from headers within the packet, | |
79 | e.g. source and destination addresses from Ethernet, IP, or TCP | |
80 | headers. | |
81 | ||
82 | The <linux/openvswitch.h> header file defines the exact format of the | |
83 | flow key attributes. For informal explanatory purposes here, we write | |
84 | them as comma-separated strings, with parentheses indicating arguments | |
85 | and nesting. For example, the following could represent a flow key | |
63893472 | 86 | corresponding to a TCP packet that arrived on vport 1:: |
ccb1352e JG |
87 | |
88 | in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4), | |
89 | eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0, | |
90 | frag=no), tcp(src=49163, dst=80) | |
91 | ||
63893472 | 92 | Often we ellipsize arguments not important to the discussion, e.g.:: |
ccb1352e JG |
93 | |
94 | in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...) | |
95 | ||
96 | ||
03f0d916 AZ |
97 | Wildcarded flow key format |
98 | -------------------------- | |
99 | ||
100 | A wildcarded flow is described with two sequences of Netlink attributes | |
101 | passed over the Netlink socket. A flow key, exactly as described above, and an | |
102 | optional corresponding flow mask. | |
103 | ||
104 | A wildcarded flow can represent a group of exact match flows. Each '1' bit | |
105 | in the mask specifies a exact match with the corresponding bit in the flow key. | |
106 | A '0' bit specifies a don't care bit, which will match either a '1' or '0' bit | |
107 | of a incoming packet. Using wildcarded flow can improve the flow set up rate | |
108 | by reduce the number of new flows need to be processed by the user space program. | |
109 | ||
110 | Support for the mask Netlink attribute is optional for both the kernel and user | |
111 | space program. The kernel can ignore the mask attribute, installing an exact | |
112 | match flow, or reduce the number of don't care bits in the kernel to less than | |
113 | what was specified by the user space program. In this case, variations in bits | |
114 | that the kernel does not implement will simply result in additional flow setups. | |
115 | The kernel module will also work with user space programs that neither support | |
116 | nor supply flow mask attributes. | |
117 | ||
118 | Since the kernel may ignore or modify wildcard bits, it can be difficult for | |
119 | the userspace program to know exactly what matches are installed. There are | |
120 | two possible approaches: reactively install flows as they miss the kernel | |
121 | flow table (and therefore not attempt to determine wildcard changes at all) | |
122 | or use the kernel's response messages to determine the installed wildcards. | |
123 | ||
124 | When interacting with userspace, the kernel should maintain the match portion | |
125 | of the key exactly as originally installed. This will provides a handle to | |
126 | identify the flow for all future operations. However, when reporting the | |
127 | mask of an installed flow, the mask should include any restrictions imposed | |
128 | by the kernel. | |
129 | ||
130 | The behavior when using overlapping wildcarded flows is undefined. It is the | |
131 | responsibility of the user space program to ensure that any incoming packet | |
132 | can match at most one flow, wildcarded or not. The current implementation | |
133 | performs best-effort detection of overlapping wildcarded flows and may reject | |
134 | some but not all of them. However, this behavior may change in future versions. | |
135 | ||
136 | ||
74ed7ab9 JS |
137 | Unique flow identifiers |
138 | ----------------------- | |
139 | ||
140 | An alternative to using the original match portion of a key as the handle for | |
141 | flow identification is a unique flow identifier, or "UFID". UFIDs are optional | |
142 | for both the kernel and user space program. | |
143 | ||
144 | User space programs that support UFID are expected to provide it during flow | |
145 | setup in addition to the flow, then refer to the flow using the UFID for all | |
146 | future operations. The kernel is not required to index flows by the original | |
147 | flow key if a UFID is specified. | |
148 | ||
149 | ||
ccb1352e JG |
150 | Basic rule for evolving flow keys |
151 | --------------------------------- | |
152 | ||
153 | Some care is needed to really maintain forward and backward | |
154 | compatibility for applications that follow the rules listed under | |
155 | "Flow key compatibility" above. | |
156 | ||
63893472 | 157 | The basic rule is obvious:: |
ccb1352e | 158 | |
63893472 | 159 | ================================================================== |
ccb1352e JG |
160 | New network protocol support must only supplement existing flow |
161 | key attributes. It must not change the meaning of already defined | |
162 | flow key attributes. | |
63893472 | 163 | ================================================================== |
ccb1352e JG |
164 | |
165 | This rule does have less-obvious consequences so it is worth working | |
166 | through a few examples. Suppose, for example, that the kernel module | |
167 | did not already implement VLAN parsing. Instead, it just interpreted | |
168 | the 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the | |
169 | packet. The flow key for any packet with an 802.1Q header would look | |
63893472 | 170 | essentially like this, ignoring metadata:: |
ccb1352e JG |
171 | |
172 | eth(...), eth_type(0x8100) | |
173 | ||
174 | Naively, to add VLAN support, it makes sense to add a new "vlan" flow | |
175 | key attribute to contain the VLAN tag, then continue to decode the | |
176 | encapsulated headers beyond the VLAN tag using the existing field | |
efaac3bf | 177 | definitions. With this change, a TCP packet in VLAN 10 would have a |
63893472 | 178 | flow key much like this:: |
ccb1352e JG |
179 | |
180 | eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...) | |
181 | ||
182 | But this change would negatively affect a userspace application that | |
183 | has not been updated to understand the new "vlan" flow key attribute. | |
184 | The application could, following the flow compatibility rules above, | |
185 | ignore the "vlan" attribute that it does not understand and therefore | |
186 | assume that the flow contained IP packets. This is a bad assumption | |
187 | (the flow only contains IP packets if one parses and skips over the | |
188 | 802.1Q header) and it could cause the application's behavior to change | |
189 | across kernel versions even though it follows the compatibility rules. | |
190 | ||
191 | The solution is to use a set of nested attributes. This is, for | |
192 | example, why 802.1Q support uses nested attributes. A TCP packet in | |
63893472 | 193 | VLAN 10 is actually expressed as:: |
ccb1352e JG |
194 | |
195 | eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800), | |
196 | ip(proto=6, ...), tcp(...))) | |
197 | ||
198 | Notice how the "eth_type", "ip", and "tcp" flow key attributes are | |
199 | nested inside the "encap" attribute. Thus, an application that does | |
200 | not understand the "vlan" key will not see either of those attributes | |
201 | and therefore will not misinterpret them. (Also, the outer eth_type | |
202 | is still 0x8100, not changed to 0x0800.) | |
203 | ||
204 | Handling malformed packets | |
205 | -------------------------- | |
206 | ||
207 | Don't drop packets in the kernel for malformed protocol headers, bad | |
208 | checksums, etc. This would prevent userspace from implementing a | |
209 | simple Ethernet switch that forwards every packet. | |
210 | ||
211 | Instead, in such a case, include an attribute with "empty" content. | |
212 | It doesn't matter if the empty content could be valid protocol values, | |
213 | as long as those values are rarely seen in practice, because userspace | |
214 | can always forward all packets with those values to userspace and | |
215 | handle them individually. | |
216 | ||
217 | For example, consider a packet that contains an IP header that | |
218 | indicates protocol 6 for TCP, but which is truncated just after the IP | |
219 | header, so that the TCP header is missing. The flow key for this | |
220 | packet would include a tcp attribute with all-zero src and dst, like | |
63893472 | 221 | this:: |
ccb1352e JG |
222 | |
223 | eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0) | |
224 | ||
225 | As another example, consider a packet with an Ethernet type of 0x8100, | |
226 | indicating that a VLAN TCI should follow, but which is truncated just | |
227 | after the Ethernet type. The flow key for this packet would include | |
63893472 | 228 | an all-zero-bits vlan and an empty encap attribute, like this:: |
ccb1352e JG |
229 | |
230 | eth(...), eth_type(0x8100), vlan(0), encap() | |
231 | ||
232 | Unlike a TCP packet with source and destination ports 0, an | |
233 | all-zero-bits VLAN TCI is not that rare, so the CFI bit (aka | |
234 | VLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan | |
235 | attribute expressly to allow this situation to be distinguished. | |
236 | Thus, the flow key in this second example unambiguously indicates a | |
237 | missing or malformed VLAN TCI. | |
238 | ||
239 | Other rules | |
240 | ----------- | |
241 | ||
242 | The other rules for flow keys are much less subtle: | |
243 | ||
244 | - Duplicate attributes are not allowed at a given nesting level. | |
245 | ||
246 | - Ordering of attributes is not significant. | |
247 | ||
248 | - When the kernel sends a given flow key to userspace, it always | |
249 | composes it the same way. This allows userspace to hash and | |
250 | compare entire flow keys that it may not be able to fully | |
251 | interpret. |