Commit | Line | Data |
---|---|---|
b4b8faa1 MK |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | ====== | |
4 | AF_XDP | |
5 | ====== | |
6 | ||
7 | Overview | |
8 | ======== | |
9 | ||
10 | AF_XDP is an address family that is optimized for high performance | |
11 | packet processing. | |
12 | ||
13 | This document assumes that the reader is familiar with BPF and XDP. If | |
14 | not, the Cilium project has an excellent reference guide at | |
bbff2f32 | 15 | http://cilium.readthedocs.io/en/latest/bpf/. |
b4b8faa1 MK |
16 | |
17 | Using the XDP_REDIRECT action from an XDP program, the program can | |
18 | redirect ingress frames to other XDP enabled netdevs, using the | |
19 | bpf_redirect_map() function. AF_XDP sockets enable the possibility for | |
20 | XDP programs to redirect frames to a memory buffer in a user-space | |
21 | application. | |
22 | ||
23 | An AF_XDP socket (XSK) is created with the normal socket() | |
24 | syscall. Associated with each XSK are two rings: the RX ring and the | |
25 | TX ring. A socket can receive packets on the RX ring and it can send | |
26 | packets on the TX ring. These rings are registered and sized with the | |
27 | setsockopts XDP_RX_RING and XDP_TX_RING, respectively. It is mandatory | |
28 | to have at least one of these rings for each socket. An RX or TX | |
29 | descriptor ring points to a data buffer in a memory area called a | |
30 | UMEM. RX and TX can share the same UMEM so that a packet does not have | |
31 | to be copied between RX and TX. Moreover, if a packet needs to be kept | |
32 | for a while due to a possible retransmit, the descriptor that points | |
33 | to that packet can be changed to point to another and reused right | |
34 | away. This again avoids copying data. | |
35 | ||
bbff2f32 BT |
36 | The UMEM consists of a number of equally sized chunks. A descriptor in |
37 | one of the rings references a frame by referencing its addr. The addr | |
38 | is simply an offset within the entire UMEM region. The user space | |
39 | allocates memory for this UMEM using whatever means it feels is most | |
40 | appropriate (malloc, mmap, huge pages, etc). This memory area is then | |
41 | registered with the kernel using the new setsockopt XDP_UMEM_REG. The | |
42 | UMEM also has two rings: the FILL ring and the COMPLETION ring. The | |
e0e4f8e9 | 43 | FILL ring is used by the application to send down addr for the kernel |
bbff2f32 BT |
44 | to fill in with RX packet data. References to these frames will then |
45 | appear in the RX ring once each packet has been received. The | |
e0e4f8e9 | 46 | COMPLETION ring, on the other hand, contains frame addr that the |
bbff2f32 BT |
47 | kernel has transmitted completely and can now be used again by user |
48 | space, for either TX or RX. Thus, the frame addrs appearing in the | |
e0e4f8e9 | 49 | COMPLETION ring are addrs that were previously transmitted using the |
bbff2f32 BT |
50 | TX ring. In summary, the RX and FILL rings are used for the RX path |
51 | and the TX and COMPLETION rings are used for the TX path. | |
b4b8faa1 MK |
52 | |
53 | The socket is then finally bound with a bind() call to a device and a | |
54 | specific queue id on that device, and it is not until bind is | |
55 | completed that traffic starts to flow. | |
56 | ||
57 | The UMEM can be shared between processes, if desired. If a process | |
58 | wants to do this, it simply skips the registration of the UMEM and its | |
59 | corresponding two rings, sets the XDP_SHARED_UMEM flag in the bind | |
60 | call and submits the XSK of the process it would like to share UMEM | |
61 | with as well as its own newly created XSK socket. The new process will | |
bbff2f32 BT |
62 | then receive frame addr references in its own RX ring that point to |
63 | this shared UMEM. Note that since the ring structures are | |
64 | single-consumer / single-producer (for performance reasons), the new | |
65 | process has to create its own socket with associated RX and TX rings, | |
66 | since it cannot share this with the other process. This is also the | |
67 | reason that there is only one set of FILL and COMPLETION rings per | |
68 | UMEM. It is the responsibility of a single process to handle the UMEM. | |
b4b8faa1 MK |
69 | |
70 | How is then packets distributed from an XDP program to the XSKs? There | |
71 | is a BPF map called XSKMAP (or BPF_MAP_TYPE_XSKMAP in full). The | |
72 | user-space application can place an XSK at an arbitrary place in this | |
73 | map. The XDP program can then redirect a packet to a specific index in | |
74 | this map and at this point XDP validates that the XSK in that map was | |
75 | indeed bound to that device and ring number. If not, the packet is | |
76 | dropped. If the map is empty at that index, the packet is also | |
77 | dropped. This also means that it is currently mandatory to have an XDP | |
78 | program loaded (and one XSK in the XSKMAP) to be able to get any | |
79 | traffic to user space through the XSK. | |
80 | ||
81 | AF_XDP can operate in two different modes: XDP_SKB and XDP_DRV. If the | |
82 | driver does not have support for XDP, or XDP_SKB is explicitly chosen | |
83 | when loading the XDP program, XDP_SKB mode is employed that uses SKBs | |
84 | together with the generic XDP support and copies out the data to user | |
85 | space. A fallback mode that works for any network device. On the other | |
86 | hand, if the driver has support for XDP, it will be used by the AF_XDP | |
87 | code to provide better performance, but there is still a copy of the | |
88 | data into user space. | |
89 | ||
90 | Concepts | |
91 | ======== | |
92 | ||
93 | In order to use an AF_XDP socket, a number of associated objects need | |
e0e4f8e9 MK |
94 | to be setup. These objects and their options are explained in the |
95 | following sections. | |
b4b8faa1 | 96 | |
e0e4f8e9 MK |
97 | For an overview on how AF_XDP works, you can also take a look at the |
98 | Linux Plumbers paper from 2018 on the subject: | |
99 | http://vger.kernel.org/lpc_net2018_talks/lpc18_paper_af_xdp_perf-v2.pdf. Do | |
100 | NOT consult the paper from 2017 on "AF_PACKET v4", the first attempt | |
101 | at AF_XDP. Nearly everything changed since then. Jonathan Corbet has | |
102 | also written an excellent article on LWN, "Accelerating networking | |
103 | with AF_XDP". It can be found at https://lwn.net/Articles/750845/. | |
b4b8faa1 MK |
104 | |
105 | UMEM | |
106 | ---- | |
107 | ||
108 | UMEM is a region of virtual contiguous memory, divided into | |
109 | equal-sized frames. An UMEM is associated to a netdev and a specific | |
bbff2f32 BT |
110 | queue id of that netdev. It is created and configured (chunk size, |
111 | headroom, start address and size) by using the XDP_UMEM_REG setsockopt | |
112 | system call. A UMEM is bound to a netdev and queue id, via the bind() | |
113 | system call. | |
b4b8faa1 MK |
114 | |
115 | An AF_XDP is socket linked to a single UMEM, but one UMEM can have | |
116 | multiple AF_XDP sockets. To share an UMEM created via one socket A, | |
117 | the next socket B can do this by setting the XDP_SHARED_UMEM flag in | |
118 | struct sockaddr_xdp member sxdp_flags, and passing the file descriptor | |
119 | of A to struct sockaddr_xdp member sxdp_shared_umem_fd. | |
120 | ||
e0e4f8e9 | 121 | The UMEM has two single-producer/single-consumer rings that are used |
b4b8faa1 MK |
122 | to transfer ownership of UMEM frames between the kernel and the |
123 | user-space application. | |
124 | ||
125 | Rings | |
126 | ----- | |
127 | ||
e0e4f8e9 | 128 | There are a four different kind of rings: FILL, COMPLETION, RX and |
b4b8faa1 MK |
129 | TX. All rings are single-producer/single-consumer, so the user-space |
130 | application need explicit synchronization of multiple | |
131 | processes/threads are reading/writing to them. | |
132 | ||
e0e4f8e9 | 133 | The UMEM uses two rings: FILL and COMPLETION. Each socket associated |
b4b8faa1 MK |
134 | with the UMEM must have an RX queue, TX queue or both. Say, that there |
135 | is a setup with four sockets (all doing TX and RX). Then there will be | |
e0e4f8e9 | 136 | one FILL ring, one COMPLETION ring, four TX rings and four RX rings. |
b4b8faa1 MK |
137 | |
138 | The rings are head(producer)/tail(consumer) based rings. A producer | |
139 | writes the data ring at the index pointed out by struct xdp_ring | |
140 | producer member, and increasing the producer index. A consumer reads | |
141 | the data ring at the index pointed out by struct xdp_ring consumer | |
142 | member, and increasing the consumer index. | |
143 | ||
144 | The rings are configured and created via the _RING setsockopt system | |
145 | calls and mmapped to user-space using the appropriate offset to mmap() | |
146 | (XDP_PGOFF_RX_RING, XDP_PGOFF_TX_RING, XDP_UMEM_PGOFF_FILL_RING and | |
147 | XDP_UMEM_PGOFF_COMPLETION_RING). | |
148 | ||
149 | The size of the rings need to be of size power of two. | |
150 | ||
151 | UMEM Fill Ring | |
152 | ~~~~~~~~~~~~~~ | |
153 | ||
e0e4f8e9 | 154 | The FILL ring is used to transfer ownership of UMEM frames from |
bbff2f32 BT |
155 | user-space to kernel-space. The UMEM addrs are passed in the ring. As |
156 | an example, if the UMEM is 64k and each chunk is 4k, then the UMEM has | |
157 | 16 chunks and can pass addrs between 0 and 64k. | |
b4b8faa1 MK |
158 | |
159 | Frames passed to the kernel are used for the ingress path (RX rings). | |
160 | ||
d57f172c KL |
161 | The user application produces UMEM addrs to this ring. Note that, if |
162 | running the application with aligned chunk mode, the kernel will mask | |
163 | the incoming addr. E.g. for a chunk size of 2k, the log2(2048) LSB of | |
164 | the addr will be masked off, meaning that 2048, 2050 and 3000 refers | |
165 | to the same chunk. If the user application is run in the unaligned | |
166 | chunks mode, then the incoming addr will be left untouched. | |
bbff2f32 | 167 | |
b4b8faa1 | 168 | |
7ccc4f18 KD |
169 | UMEM Completion Ring |
170 | ~~~~~~~~~~~~~~~~~~~~ | |
b4b8faa1 | 171 | |
e0e4f8e9 MK |
172 | The COMPLETION Ring is used transfer ownership of UMEM frames from |
173 | kernel-space to user-space. Just like the FILL ring, UMEM indices are | |
b4b8faa1 MK |
174 | used. |
175 | ||
176 | Frames passed from the kernel to user-space are frames that has been | |
177 | sent (TX ring) and can be used by user-space again. | |
178 | ||
bbff2f32 | 179 | The user application consumes UMEM addrs from this ring. |
b4b8faa1 MK |
180 | |
181 | ||
182 | RX Ring | |
183 | ~~~~~~~ | |
184 | ||
185 | The RX ring is the receiving side of a socket. Each entry in the ring | |
bbff2f32 BT |
186 | is a struct xdp_desc descriptor. The descriptor contains UMEM offset |
187 | (addr) and the length of the data (len). | |
b4b8faa1 | 188 | |
e0e4f8e9 | 189 | If no frames have been passed to kernel via the FILL ring, no |
b4b8faa1 MK |
190 | descriptors will (or can) appear on the RX ring. |
191 | ||
192 | The user application consumes struct xdp_desc descriptors from this | |
193 | ring. | |
194 | ||
195 | TX Ring | |
196 | ~~~~~~~ | |
197 | ||
198 | The TX ring is used to send frames. The struct xdp_desc descriptor is | |
199 | filled (index, length and offset) and passed into the ring. | |
200 | ||
201 | To start the transfer a sendmsg() system call is required. This might | |
202 | be relaxed in the future. | |
203 | ||
204 | The user application produces struct xdp_desc descriptors to this | |
205 | ring. | |
206 | ||
e0e4f8e9 MK |
207 | Libbpf |
208 | ====== | |
209 | ||
210 | Libbpf is a helper library for eBPF and XDP that makes using these | |
211 | technologies a lot simpler. It also contains specific helper functions | |
212 | in tools/lib/bpf/xsk.h for facilitating the use of AF_XDP. It | |
213 | contains two types of functions: those that can be used to make the | |
214 | setup of AF_XDP socket easier and ones that can be used in the data | |
215 | plane to access the rings safely and quickly. To see an example on how | |
216 | to use this API, please take a look at the sample application in | |
217 | samples/bpf/xdpsock_usr.c which uses libbpf for both setup and data | |
218 | plane operations. | |
219 | ||
220 | We recommend that you use this library unless you have become a power | |
221 | user. It will make your program a lot simpler. | |
222 | ||
b4b8faa1 | 223 | XSKMAP / BPF_MAP_TYPE_XSKMAP |
e0e4f8e9 | 224 | ============================ |
b4b8faa1 MK |
225 | |
226 | On XDP side there is a BPF map type BPF_MAP_TYPE_XSKMAP (XSKMAP) that | |
227 | is used in conjunction with bpf_redirect_map() to pass the ingress | |
228 | frame to a socket. | |
229 | ||
230 | The user application inserts the socket into the map, via the bpf() | |
231 | system call. | |
232 | ||
233 | Note that if an XDP program tries to redirect to a socket that does | |
234 | not match the queue configuration and netdev, the frame will be | |
235 | dropped. E.g. an AF_XDP socket is bound to netdev eth0 and | |
236 | queue 17. Only the XDP program executing for eth0 and queue 17 will | |
237 | successfully pass data to the socket. Please refer to the sample | |
238 | application (samples/bpf/) in for an example. | |
239 | ||
e0e4f8e9 MK |
240 | Configuration Flags and Socket Options |
241 | ====================================== | |
242 | ||
243 | These are the various configuration flags that can be used to control | |
244 | and monitor the behavior of AF_XDP sockets. | |
245 | ||
246 | XDP_COPY and XDP_ZERO_COPY bind flags | |
247 | ------------------------------------- | |
248 | ||
249 | When you bind to a socket, the kernel will first try to use zero-copy | |
250 | copy. If zero-copy is not supported, it will fall back on using copy | |
251 | mode, i.e. copying all packets out to user space. But if you would | |
252 | like to force a certain mode, you can use the following flags. If you | |
253 | pass the XDP_COPY flag to the bind call, the kernel will force the | |
254 | socket into copy mode. If it cannot use copy mode, the bind call will | |
255 | fail with an error. Conversely, the XDP_ZERO_COPY flag will force the | |
256 | socket into zero-copy mode or fail. | |
257 | ||
258 | XDP_SHARED_UMEM bind flag | |
259 | ------------------------- | |
260 | ||
261 | This flag enables you to bind multiple sockets to the same UMEM, but | |
262 | only if they share the same queue id. In this mode, each socket has | |
263 | their own RX and TX rings, but the UMEM (tied to the fist socket | |
264 | created) only has a single FILL ring and a single COMPLETION | |
265 | ring. To use this mode, create the first socket and bind it in the normal | |
266 | way. Create a second socket and create an RX and a TX ring, or at | |
267 | least one of them, but no FILL or COMPLETION rings as the ones from | |
268 | the first socket will be used. In the bind call, set he | |
269 | XDP_SHARED_UMEM option and provide the initial socket's fd in the | |
270 | sxdp_shared_umem_fd field. You can attach an arbitrary number of extra | |
271 | sockets this way. | |
272 | ||
273 | What socket will then a packet arrive on? This is decided by the XDP | |
274 | program. Put all the sockets in the XSK_MAP and just indicate which | |
275 | index in the array you would like to send each packet to. A simple | |
276 | round-robin example of distributing packets is shown below: | |
277 | ||
278 | .. code-block:: c | |
279 | ||
280 | #include <linux/bpf.h> | |
281 | #include "bpf_helpers.h" | |
282 | ||
283 | #define MAX_SOCKS 16 | |
284 | ||
285 | struct { | |
286 | __uint(type, BPF_MAP_TYPE_XSKMAP); | |
287 | __uint(max_entries, MAX_SOCKS); | |
288 | __uint(key_size, sizeof(int)); | |
289 | __uint(value_size, sizeof(int)); | |
290 | } xsks_map SEC(".maps"); | |
291 | ||
292 | static unsigned int rr; | |
293 | ||
294 | SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx) | |
295 | { | |
296 | rr = (rr + 1) & (MAX_SOCKS - 1); | |
297 | ||
57afa8b0 | 298 | return bpf_redirect_map(&xsks_map, rr, XDP_DROP); |
e0e4f8e9 MK |
299 | } |
300 | ||
301 | Note, that since there is only a single set of FILL and COMPLETION | |
302 | rings, and they are single producer, single consumer rings, you need | |
303 | to make sure that multiple processes or threads do not use these rings | |
304 | concurrently. There are no synchronization primitives in the | |
305 | libbpf code that protects multiple users at this point in time. | |
306 | ||
57afa8b0 MK |
307 | Libbpf uses this mode if you create more than one socket tied to the |
308 | same umem. However, note that you need to supply the | |
309 | XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD libbpf_flag with the | |
310 | xsk_socket__create calls and load your own XDP program as there is no | |
311 | built in one in libbpf that will route the traffic for you. | |
312 | ||
e0e4f8e9 MK |
313 | XDP_USE_NEED_WAKEUP bind flag |
314 | ----------------------------- | |
315 | ||
316 | This option adds support for a new flag called need_wakeup that is | |
317 | present in the FILL ring and the TX ring, the rings for which user | |
318 | space is a producer. When this option is set in the bind call, the | |
319 | need_wakeup flag will be set if the kernel needs to be explicitly | |
320 | woken up by a syscall to continue processing packets. If the flag is | |
321 | zero, no syscall is needed. | |
322 | ||
323 | If the flag is set on the FILL ring, the application needs to call | |
324 | poll() to be able to continue to receive packets on the RX ring. This | |
325 | can happen, for example, when the kernel has detected that there are no | |
326 | more buffers on the FILL ring and no buffers left on the RX HW ring of | |
327 | the NIC. In this case, interrupts are turned off as the NIC cannot | |
328 | receive any packets (as there are no buffers to put them in), and the | |
329 | need_wakeup flag is set so that user space can put buffers on the | |
330 | FILL ring and then call poll() so that the kernel driver can put these | |
331 | buffers on the HW ring and start to receive packets. | |
332 | ||
333 | If the flag is set for the TX ring, it means that the application | |
334 | needs to explicitly notify the kernel to send any packets put on the | |
335 | TX ring. This can be accomplished either by a poll() call, as in the | |
336 | RX path, or by calling sendto(). | |
337 | ||
338 | An example of how to use this flag can be found in | |
339 | samples/bpf/xdpsock_user.c. An example with the use of libbpf helpers | |
340 | would look like this for the TX path: | |
341 | ||
342 | .. code-block:: c | |
343 | ||
344 | if (xsk_ring_prod__needs_wakeup(&my_tx_ring)) | |
345 | sendto(xsk_socket__fd(xsk_handle), NULL, 0, MSG_DONTWAIT, NULL, 0); | |
346 | ||
347 | I.e., only use the syscall if the flag is set. | |
348 | ||
349 | We recommend that you always enable this mode as it usually leads to | |
350 | better performance especially if you run the application and the | |
351 | driver on the same core, but also if you use different cores for the | |
352 | application and the kernel driver, as it reduces the number of | |
353 | syscalls needed for the TX path. | |
354 | ||
355 | XDP_{RX|TX|UMEM_FILL|UMEM_COMPLETION}_RING setsockopts | |
356 | ------------------------------------------------------ | |
357 | ||
358 | These setsockopts sets the number of descriptors that the RX, TX, | |
359 | FILL, and COMPLETION rings respectively should have. It is mandatory | |
360 | to set the size of at least one of the RX and TX rings. If you set | |
361 | both, you will be able to both receive and send traffic from your | |
362 | application, but if you only want to do one of them, you can save | |
363 | resources by only setting up one of them. Both the FILL ring and the | |
57afa8b0 MK |
364 | COMPLETION ring are mandatory as you need to have a UMEM tied to your |
365 | socket. But if the XDP_SHARED_UMEM flag is used, any socket after the | |
366 | first one does not have a UMEM and should in that case not have any | |
367 | FILL or COMPLETION rings created as the ones from the shared umem will | |
368 | be used. Note, that the rings are single-producer single-consumer, so | |
369 | do not try to access them from multiple processes at the same | |
370 | time. See the XDP_SHARED_UMEM section. | |
371 | ||
372 | In libbpf, you can create Rx-only and Tx-only sockets by supplying | |
373 | NULL to the rx and tx arguments, respectively, to the | |
374 | xsk_socket__create function. | |
375 | ||
376 | If you create a Tx-only socket, we recommend that you do not put any | |
377 | packets on the fill ring. If you do this, drivers might think you are | |
378 | going to receive something when you in fact will not, and this can | |
379 | negatively impact performance. | |
e0e4f8e9 MK |
380 | |
381 | XDP_UMEM_REG setsockopt | |
382 | ----------------------- | |
383 | ||
384 | This setsockopt registers a UMEM to a socket. This is the area that | |
385 | contain all the buffers that packet can recide in. The call takes a | |
386 | pointer to the beginning of this area and the size of it. Moreover, it | |
387 | also has parameter called chunk_size that is the size that the UMEM is | |
388 | divided into. It can only be 2K or 4K at the moment. If you have an | |
389 | UMEM area that is 128K and a chunk size of 2K, this means that you | |
390 | will be able to hold a maximum of 128K / 2K = 64 packets in your UMEM | |
391 | area and that your largest packet size can be 2K. | |
392 | ||
393 | There is also an option to set the headroom of each single buffer in | |
394 | the UMEM. If you set this to N bytes, it means that the packet will | |
395 | start N bytes into the buffer leaving the first N bytes for the | |
396 | application to use. The final option is the flags field, but it will | |
397 | be dealt with in separate sections for each UMEM flag. | |
398 | ||
399 | XDP_STATISTICS getsockopt | |
400 | ------------------------- | |
401 | ||
402 | Gets drop statistics of a socket that can be useful for debug | |
403 | purposes. The supported statistics are shown below: | |
404 | ||
405 | .. code-block:: c | |
406 | ||
407 | struct xdp_statistics { | |
408 | __u64 rx_dropped; /* Dropped for reasons other than invalid desc */ | |
409 | __u64 rx_invalid_descs; /* Dropped due to invalid descriptor */ | |
410 | __u64 tx_invalid_descs; /* Dropped due to invalid descriptor */ | |
411 | }; | |
412 | ||
413 | XDP_OPTIONS getsockopt | |
414 | ---------------------- | |
415 | ||
416 | Gets options from an XDP socket. The only one supported so far is | |
417 | XDP_OPTIONS_ZEROCOPY which tells you if zero-copy is on or not. | |
418 | ||
b4b8faa1 MK |
419 | Usage |
420 | ===== | |
421 | ||
e0e4f8e9 | 422 | In order to use AF_XDP sockets two parts are needed. The |
b4b8faa1 MK |
423 | user-space application and the XDP program. For a complete setup and |
424 | usage example, please refer to the sample application. The user-space | |
0bed6137 EL |
425 | side is xdpsock_user.c and the XDP side is part of libbpf. |
426 | ||
e0e4f8e9 MK |
427 | The XDP code sample included in tools/lib/bpf/xsk.c is the following: |
428 | ||
429 | .. code-block:: c | |
0bed6137 EL |
430 | |
431 | SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx) | |
432 | { | |
433 | int index = ctx->rx_queue_index; | |
434 | ||
e0e4f8e9 | 435 | // A set entry here means that the corresponding queue_id |
0bed6137 EL |
436 | // has an active AF_XDP socket bound to it. |
437 | if (bpf_map_lookup_elem(&xsks_map, &index)) | |
438 | return bpf_redirect_map(&xsks_map, index, 0); | |
439 | ||
440 | return XDP_PASS; | |
441 | } | |
b4b8faa1 | 442 | |
e0e4f8e9 MK |
443 | A simple but not so performance ring dequeue and enqueue could look |
444 | like this: | |
445 | ||
446 | .. code-block:: c | |
b4b8faa1 | 447 | |
bbff2f32 BT |
448 | // struct xdp_rxtx_ring { |
449 | // __u32 *producer; | |
450 | // __u32 *consumer; | |
451 | // struct xdp_desc *desc; | |
452 | // }; | |
453 | ||
454 | // struct xdp_umem_ring { | |
455 | // __u32 *producer; | |
456 | // __u32 *consumer; | |
457 | // __u64 *desc; | |
458 | // }; | |
459 | ||
b4b8faa1 MK |
460 | // typedef struct xdp_rxtx_ring RING; |
461 | // typedef struct xdp_umem_ring RING; | |
462 | ||
463 | // typedef struct xdp_desc RING_TYPE; | |
bbff2f32 | 464 | // typedef __u64 RING_TYPE; |
b4b8faa1 MK |
465 | |
466 | int dequeue_one(RING *ring, RING_TYPE *item) | |
467 | { | |
bbff2f32 | 468 | __u32 entries = *ring->producer - *ring->consumer; |
b4b8faa1 MK |
469 | |
470 | if (entries == 0) | |
471 | return -1; | |
472 | ||
473 | // read-barrier! | |
474 | ||
bbff2f32 BT |
475 | *item = ring->desc[*ring->consumer & (RING_SIZE - 1)]; |
476 | (*ring->consumer)++; | |
b4b8faa1 MK |
477 | return 0; |
478 | } | |
479 | ||
480 | int enqueue_one(RING *ring, const RING_TYPE *item) | |
481 | { | |
bbff2f32 | 482 | u32 free_entries = RING_SIZE - (*ring->producer - *ring->consumer); |
b4b8faa1 MK |
483 | |
484 | if (free_entries == 0) | |
485 | return -1; | |
486 | ||
bbff2f32 | 487 | ring->desc[*ring->producer & (RING_SIZE - 1)] = *item; |
b4b8faa1 MK |
488 | |
489 | // write-barrier! | |
490 | ||
bbff2f32 | 491 | (*ring->producer)++; |
b4b8faa1 MK |
492 | return 0; |
493 | } | |
494 | ||
e0e4f8e9 MK |
495 | But please use the libbpf functions as they are optimized and ready to |
496 | use. Will make your life easier. | |
b4b8faa1 MK |
497 | |
498 | Sample application | |
499 | ================== | |
500 | ||
501 | There is a xdpsock benchmarking/test application included that | |
e0e4f8e9 MK |
502 | demonstrates how to use AF_XDP sockets with private UMEMs. Say that |
503 | you would like your UDP traffic from port 4242 to end up in queue 16, | |
504 | that we will enable AF_XDP on. Here, we use ethtool for this:: | |
b4b8faa1 MK |
505 | |
506 | ethtool -N p3p2 rx-flow-hash udp4 fn | |
507 | ethtool -N p3p2 flow-type udp4 src-port 4242 dst-port 4242 \ | |
508 | action 16 | |
509 | ||
510 | Running the rxdrop benchmark in XDP_DRV mode can then be done | |
511 | using:: | |
512 | ||
513 | samples/bpf/xdpsock -i p3p2 -q 16 -r -N | |
514 | ||
515 | For XDP_SKB mode, use the switch "-S" instead of "-N" and all options | |
516 | can be displayed with "-h", as usual. | |
517 | ||
e0e4f8e9 MK |
518 | This sample application uses libbpf to make the setup and usage of |
519 | AF_XDP simpler. If you want to know how the raw uapi of AF_XDP is | |
520 | really used to make something more advanced, take a look at the libbpf | |
521 | code in tools/lib/bpf/xsk.[ch]. | |
522 | ||
0f4a9b7d MK |
523 | FAQ |
524 | ======= | |
525 | ||
526 | Q: I am not seeing any traffic on the socket. What am I doing wrong? | |
527 | ||
528 | A: When a netdev of a physical NIC is initialized, Linux usually | |
e0e4f8e9 | 529 | allocates one RX and TX queue pair per core. So on a 8 core system, |
0f4a9b7d MK |
530 | queue ids 0 to 7 will be allocated, one per core. In the AF_XDP |
531 | bind call or the xsk_socket__create libbpf function call, you | |
532 | specify a specific queue id to bind to and it is only the traffic | |
533 | towards that queue you are going to get on you socket. So in the | |
534 | example above, if you bind to queue 0, you are NOT going to get any | |
535 | traffic that is distributed to queues 1 through 7. If you are | |
536 | lucky, you will see the traffic, but usually it will end up on one | |
537 | of the queues you have not bound to. | |
538 | ||
539 | There are a number of ways to solve the problem of getting the | |
540 | traffic you want to the queue id you bound to. If you want to see | |
541 | all the traffic, you can force the netdev to only have 1 queue, queue | |
542 | id 0, and then bind to queue 0. You can use ethtool to do this:: | |
543 | ||
221fb726 | 544 | sudo ethtool -L <interface> combined 1 |
0f4a9b7d MK |
545 | |
546 | If you want to only see part of the traffic, you can program the | |
547 | NIC through ethtool to filter out your traffic to a single queue id | |
548 | that you can bind your XDP socket to. Here is one example in which | |
549 | UDP traffic to and from port 4242 are sent to queue 2:: | |
550 | ||
221fb726 RD |
551 | sudo ethtool -N <interface> rx-flow-hash udp4 fn |
552 | sudo ethtool -N <interface> flow-type udp4 src-port 4242 dst-port \ | |
553 | 4242 action 2 | |
0f4a9b7d | 554 | |
e0e4f8e9 | 555 | A number of other ways are possible all up to the capabilities of |
0f4a9b7d MK |
556 | the NIC you have. |
557 | ||
e0e4f8e9 MK |
558 | Q: Can I use the XSKMAP to implement a switch betwen different umems |
559 | in copy mode? | |
560 | ||
561 | A: The short answer is no, that is not supported at the moment. The | |
562 | XSKMAP can only be used to switch traffic coming in on queue id X | |
563 | to sockets bound to the same queue id X. The XSKMAP can contain | |
564 | sockets bound to different queue ids, for example X and Y, but only | |
565 | traffic goming in from queue id Y can be directed to sockets bound | |
566 | to the same queue id Y. In zero-copy mode, you should use the | |
567 | switch, or other distribution mechanism, in your NIC to direct | |
568 | traffic to the correct queue id and socket. | |
569 | ||
b4b8faa1 MK |
570 | Credits |
571 | ======= | |
572 | ||
573 | - Björn Töpel (AF_XDP core) | |
574 | - Magnus Karlsson (AF_XDP core) | |
575 | - Alexander Duyck | |
576 | - Alexei Starovoitov | |
577 | - Daniel Borkmann | |
578 | - Jesper Dangaard Brouer | |
579 | - John Fastabend | |
580 | - Jonathan Corbet (LWN coverage) | |
581 | - Michael S. Tsirkin | |
582 | - Qi Z Zhang | |
583 | - Willem de Bruijn |