Commit | Line | Data |
---|---|---|
b4b8faa1 MK |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | ====== | |
4 | AF_XDP | |
5 | ====== | |
6 | ||
7 | Overview | |
8 | ======== | |
9 | ||
10 | AF_XDP is an address family that is optimized for high performance | |
11 | packet processing. | |
12 | ||
13 | This document assumes that the reader is familiar with BPF and XDP. If | |
14 | not, the Cilium project has an excellent reference guide at | |
bbff2f32 | 15 | http://cilium.readthedocs.io/en/latest/bpf/. |
b4b8faa1 MK |
16 | |
17 | Using the XDP_REDIRECT action from an XDP program, the program can | |
18 | redirect ingress frames to other XDP enabled netdevs, using the | |
19 | bpf_redirect_map() function. AF_XDP sockets enable the possibility for | |
20 | XDP programs to redirect frames to a memory buffer in a user-space | |
21 | application. | |
22 | ||
23 | An AF_XDP socket (XSK) is created with the normal socket() | |
24 | syscall. Associated with each XSK are two rings: the RX ring and the | |
25 | TX ring. A socket can receive packets on the RX ring and it can send | |
26 | packets on the TX ring. These rings are registered and sized with the | |
27 | setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is mandatory | |
28 | to have at least one of these rings for each socket. An RX or TX | |
29 | descriptor ring points to a data buffer in a memory area called a | |
30 | UMEM. RX and TX can share the same UMEM so that a packet does not have | |
31 | to be copied between RX and TX. Moreover, if a packet needs to be kept | |
32 | for a while due to a possible retransmit, the descriptor that points | |
33 | to that packet can be changed to point to another and reused right | |
34 | away. This again avoids copying data. | |
35 | ||
bbff2f32 BT |
36 | The UMEM consists of a number of equally sized chunks. A descriptor in |
37 | one of the rings references a frame by referencing its addr. The addr | |
38 | is simply an offset within the entire UMEM region. The user space | |
39 | allocates memory for this UMEM using whatever means it feels is most | |
40 | appropriate (malloc, mmap, huge pages, etc). This memory area is then | |
41 | registered with the kernel using the new setsockopt XDP_UMEM_REG. The | |
42 | UMEM also has two rings: the FILL ring and the COMPLETION ring. The | |
43 | fill ring is used by the application to send down addr for the kernel | |
44 | to fill in with RX packet data. References to these frames will then | |
45 | appear in the RX ring once each packet has been received. The | |
46 | completion ring, on the other hand, contains frame addr that the | |
47 | kernel has transmitted completely and can now be used again by user | |
48 | space, for either TX or RX. Thus, the frame addrs appearing in the | |
49 | completion ring are addrs that were previously transmitted using the | |
50 | TX ring. In summary, the RX and FILL rings are used for the RX path | |
51 | and the TX and COMPLETION rings are used for the TX path. | |
b4b8faa1 MK |
52 | |
53 | The socket is then finally bound with a bind() call to a device and a | |
54 | specific queue id on that device, and it is not until bind is | |
55 | completed that traffic starts to flow. | |
56 | ||
57 | The UMEM can be shared between processes, if desired. If a process | |
58 | wants to do this, it simply skips the registration of the UMEM and its | |
59 | corresponding two rings, sets the XDP_SHARED_UMEM flag in the bind | |
60 | call and submits the XSK of the process it would like to share UMEM | |
61 | with as well as its own newly created XSK socket. The new process will | |
bbff2f32 BT |
62 | then receive frame addr references in its own RX ring that point to |
63 | this shared UMEM. Note that since the ring structures are | |
64 | single-consumer / single-producer (for performance reasons), the new | |
65 | process has to create its own socket with associated RX and TX rings, | |
66 | since it cannot share this with the other process. This is also the | |
67 | reason that there is only one set of FILL and COMPLETION rings per | |
68 | UMEM. It is the responsibility of a single process to handle the UMEM. | |
b4b8faa1 MK |
69 | |
70 | How is then packets distributed from an XDP program to the XSKs? There | |
71 | is a BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The | |
72 | user-space application can place an XSK at an arbitrary place in this | |
73 | map. The XDP program can then redirect a packet to a specific index in | |
74 | this map and at this point XDP validates that the XSK in that map was | |
75 | indeed bound to that device and ring number. If not, the packet is | |
76 | dropped. If the map is empty at that index, the packet is also | |
77 | dropped. This also means that it is currently mandatory to have an XDP | |
78 | program loaded (and one XSK in the XSKMAP) to be able to get any | |
79 | traffic to user space through the XSK. | |
80 | ||
81 | AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the | |
82 | driver does not have support for XDP, or XDP_SKB is explicitly chosen | |
83 | when loading the XDP program, XDP_SKB mode is employed that uses SKBs | |
84 | together with the generic XDP support and copies out the data to user | |
85 | space. A fallback mode that works for any network device. On the other | |
86 | hand, if the driver has support for XDP, it will be used by the AF_XDP | |
87 | code to provide better performance, but there is still a copy of the | |
88 | data into user space. | |
89 | ||
90 | Concepts | |
91 | ======== | |
92 | ||
93 | In order to use an AF_XDP socket, a number of associated objects need | |
94 | to be setup. | |
95 | ||
96 | Jonathan Corbet has also written an excellent article on LWN, | |
97 | "Accelerating networking with AF_XDP". It can be found at | |
98 | https://lwn.net/Articles/750845/. | |
99 | ||
100 | UMEM | |
101 | ---- | |
102 | ||
103 | UMEM is a region of virtual contiguous memory, divided into | |
104 | equal-sized frames. An UMEM is associated to a netdev and a specific | |
bbff2f32 BT |
105 | queue id of that netdev. It is created and configured (chunk size, |
106 | headroom, start address and size) by using the XDP_UMEM_REG setsockopt | |
107 | system call. A UMEM is bound to a netdev and queue id, via the bind() | |
108 | system call. | |
b4b8faa1 MK |
109 | |
110 | An AF_XDP is socket linked to a single UMEM, but one UMEM can have | |
111 | multiple AF_XDP sockets. To share an UMEM created via one socket A, | |
112 | the next socket B can do this by setting the XDP_SHARED_UMEM flag in | |
113 | struct sockaddr_xdp member sxdp_flags, and passing the file descriptor | |
114 | of A to struct sockaddr_xdp member sxdp_shared_umem_fd. | |
115 | ||
116 | The UMEM has two single-producer/single-consumer rings, that are used | |
117 | to transfer ownership of UMEM frames between the kernel and the | |
118 | user-space application. | |
119 | ||
120 | Rings | |
121 | ----- | |
122 | ||
123 | There are a four different kind of rings: Fill, Completion, RX and | |
124 | TX. All rings are single-producer/single-consumer, so the user-space | |
125 | application need explicit synchronization of multiple | |
126 | processes/threads are reading/writing to them. | |
127 | ||
128 | The UMEM uses two rings: Fill and Completion. Each socket associated | |
129 | with the UMEM must have an RX queue, TX queue or both. Say, that there | |
130 | is a setup with four sockets (all doing TX and RX). Then there will be | |
131 | one Fill ring, one Completion ring, four TX rings and four RX rings. | |
132 | ||
133 | The rings are head(producer)/tail(consumer) based rings. A producer | |
134 | writes the data ring at the index pointed out by struct xdp_ring | |
135 | producer member, and increasing the producer index. A consumer reads | |
136 | the data ring at the index pointed out by struct xdp_ring consumer | |
137 | member, and increasing the consumer index. | |
138 | ||
139 | The rings are configured and created via the _RING setsockopt system | |
140 | calls and mmapped to user-space using the appropriate offset to mmap() | |
141 | (XDP_PGOFF_RX_RING, XDP_PGOFF_TX_RING, XDP_UMEM_PGOFF_FILL_RING and | |
142 | XDP_UMEM_PGOFF_COMPLETION_RING). | |
143 | ||
144 | The size of the rings need to be of size power of two. | |
145 | ||
146 | UMEM Fill Ring | |
147 | ~~~~~~~~~~~~~~ | |
148 | ||
149 | The Fill ring is used to transfer ownership of UMEM frames from | |
bbff2f32 BT |
150 | user-space to kernel-space. The UMEM addrs are passed in the ring. As |
151 | an example, if the UMEM is 64k and each chunk is 4k, then the UMEM has | |
152 | 16 chunks and can pass addrs between 0 and 64k. | |
b4b8faa1 MK |
153 | |
154 | Frames passed to the kernel are used for the ingress path (RX rings). | |
155 | ||
bbff2f32 BT |
156 | The user application produces UMEM addrs to this ring. Note that the |
157 | kernel will mask the incoming addr. E.g. for a chunk size of 2k, the | |
158 | log2(2048) LSB of the addr will be masked off, meaning that 2048, 2050 | |
159 | and 3000 refers to the same chunk. | |
160 | ||
b4b8faa1 | 161 | |
7ccc4f18 KD |
162 | UMEM Completion Ring |
163 | ~~~~~~~~~~~~~~~~~~~~ | |
b4b8faa1 MK |
164 | |
165 | The Completion Ring is used transfer ownership of UMEM frames from | |
166 | kernel-space to user-space. Just like the Fill ring, UMEM indicies are | |
167 | used. | |
168 | ||
169 | Frames passed from the kernel to user-space are frames that has been | |
170 | sent (TX ring) and can be used by user-space again. | |
171 | ||
bbff2f32 | 172 | The user application consumes UMEM addrs from this ring. |
b4b8faa1 MK |
173 | |
174 | ||
175 | RX Ring | |
176 | ~~~~~~~ | |
177 | ||
178 | The RX ring is the receiving side of a socket. Each entry in the ring | |
bbff2f32 BT |
179 | is a struct xdp_desc descriptor. The descriptor contains UMEM offset |
180 | (addr) and the length of the data (len). | |
b4b8faa1 MK |
181 | |
182 | If no frames have been passed to kernel via the Fill ring, no | |
183 | descriptors will (or can) appear on the RX ring. | |
184 | ||
185 | The user application consumes struct xdp_desc descriptors from this | |
186 | ring. | |
187 | ||
188 | TX Ring | |
189 | ~~~~~~~ | |
190 | ||
191 | The TX ring is used to send frames. The struct xdp_desc descriptor is | |
192 | filled (index, length and offset) and passed into the ring. | |
193 | ||
194 | To start the transfer a sendmsg() system call is required. This might | |
195 | be relaxed in the future. | |
196 | ||
197 | The user application produces struct xdp_desc descriptors to this | |
198 | ring. | |
199 | ||
200 | XSKMAP / BPF_MAP_TYPE_XSKMAP | |
201 | ---------------------------- | |
202 | ||
203 | On XDP side there is a BPF map type BPF_MAP_TYPE_XSKMAP (XSKMAP) that | |
204 | is used in conjunction with bpf_redirect_map() to pass the ingress | |
205 | frame to a socket. | |
206 | ||
207 | The user application inserts the socket into the map, via the bpf() | |
208 | system call. | |
209 | ||
210 | Note that if an XDP program tries to redirect to a socket that does | |
211 | not match the queue configuration and netdev, the frame will be | |
212 | dropped. E.g. an AF_XDP socket is bound to netdev eth0 and | |
213 | queue 17. Only the XDP program executing for eth0 and queue 17 will | |
214 | successfully pass data to the socket. Please refer to the sample | |
215 | application (samples/bpf/) in for an example. | |
216 | ||
217 | Usage | |
218 | ===== | |
219 | ||
220 | In order to use AF_XDP sockets there are two parts needed. The | |
221 | user-space application and the XDP program. For a complete setup and | |
222 | usage example, please refer to the sample application. The user-space | |
223 | side is xdpsock_user.c and the XDP side xdpsock_kern.c. | |
224 | ||
225 | Naive ring dequeue and enqueue could look like this:: | |
226 | ||
bbff2f32 BT |
227 | // struct xdp_rxtx_ring { |
228 | // __u32 *producer; | |
229 | // __u32 *consumer; | |
230 | // struct xdp_desc *desc; | |
231 | // }; | |
232 | ||
233 | // struct xdp_umem_ring { | |
234 | // __u32 *producer; | |
235 | // __u32 *consumer; | |
236 | // __u64 *desc; | |
237 | // }; | |
238 | ||
b4b8faa1 MK |
239 | // typedef struct xdp_rxtx_ring RING; |
240 | // typedef struct xdp_umem_ring RING; | |
241 | ||
242 | // typedef struct xdp_desc RING_TYPE; | |
bbff2f32 | 243 | // typedef __u64 RING_TYPE; |
b4b8faa1 MK |
244 | |
245 | int dequeue_one(RING *ring, RING_TYPE *item) | |
246 | { | |
bbff2f32 | 247 | __u32 entries = *ring->producer - *ring->consumer; |
b4b8faa1 MK |
248 | |
249 | if (entries == 0) | |
250 | return -1; | |
251 | ||
252 | // read-barrier! | |
253 | ||
bbff2f32 BT |
254 | *item = ring->desc[*ring->consumer & (RING_SIZE - 1)]; |
255 | (*ring->consumer)++; | |
b4b8faa1 MK |
256 | return 0; |
257 | } | |
258 | ||
259 | int enqueue_one(RING *ring, const RING_TYPE *item) | |
260 | { | |
bbff2f32 | 261 | u32 free_entries = RING_SIZE - (*ring->producer - *ring->consumer); |
b4b8faa1 MK |
262 | |
263 | if (free_entries == 0) | |
264 | return -1; | |
265 | ||
bbff2f32 | 266 | ring->desc[*ring->producer & (RING_SIZE - 1)] = *item; |
b4b8faa1 MK |
267 | |
268 | // write-barrier! | |
269 | ||
bbff2f32 | 270 | (*ring->producer)++; |
b4b8faa1 MK |
271 | return 0; |
272 | } | |
273 | ||
274 | ||
275 | For a more optimized version, please refer to the sample application. | |
276 | ||
277 | Sample application | |
278 | ================== | |
279 | ||
280 | There is a xdpsock benchmarking/test application included that | |
281 | demonstrates how to use AF_XDP sockets with both private and shared | |
282 | UMEMs. Say that you would like your UDP traffic from port 4242 to end | |
283 | up in queue 16, that we will enable AF_XDP on. Here, we use ethtool | |
284 | for this:: | |
285 | ||
286 | ethtool -N p3p2 rx-flow-hash udp4 fn | |
287 | ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \ | |
288 | action 16 | |
289 | ||
290 | Running the rxdrop benchmark in XDP_DRV mode can then be done | |
291 | using:: | |
292 | ||
293 | samples/bpf/xdpsock -i p3p2 -q 16 -r -N | |
294 | ||
295 | For XDP_SKB mode, use the switch "-S" instead of "-N" and all options | |
296 | can be displayed with "-h", as usual. | |
297 | ||
298 | Credits | |
299 | ======= | |
300 | ||
301 | - Björn Töpel (AF_XDP core) | |
302 | - Magnus Karlsson (AF_XDP core) | |
303 | - Alexander Duyck | |
304 | - Alexei Starovoitov | |
305 | - Daniel Borkmann | |
306 | - Jesper Dangaard Brouer | |
307 | - John Fastabend | |
308 | - Jonathan Corbet (LWN coverage) | |
309 | - Michael S. Tsirkin | |
310 | - Qi Z Zhang | |
311 | - Willem de Bruijn | |
312 |