Commit | Line | Data |
---|---|---|
1da177e4 LT |
1 | -------------------------------------------------------------------------------- |
2 | + ABSTRACT | |
3 | -------------------------------------------------------------------------------- | |
4 | ||
889b8f96 | 5 | This file documents the mmap() facility available with the PACKET |
d1ee40f9 DB |
6 | socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for |
7 | i) capture network traffic with utilities like tcpdump, ii) transmit network | |
8 | traffic, or any other that needs raw access to network interface. | |
1da177e4 | 9 | |
69e3c75f | 10 | Howto can be found at: |
2b221d20 | 11 | https://sites.google.com/site/packetmmap/ |
1da177e4 | 12 | |
69e3c75f | 13 | Please send your comments to |
be2a608b | 14 | Ulisses Alonso CamarĂ³ <uaca@i.hate.spam.alumni.uv.es> |
2b221d20 | 15 | Johann Baudy |
1da177e4 LT |
16 | |
17 | ------------------------------------------------------------------------------- | |
18 | + Why use PACKET_MMAP | |
19 | -------------------------------------------------------------------------------- | |
20 | ||
d1ee40f9 DB |
21 | In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very |
22 | inefficient. It uses very limited buffers and requires one system call to | |
23 | capture each packet, it requires two if you want to get packet's timestamp | |
24 | (like libpcap always does). | |
1da177e4 LT |
25 | |
26 | In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size | |
69e3c75f JB |
27 | configurable circular buffer mapped in user space that can be used to either |
28 | send or receive packets. This way reading packets just needs to wait for them, | |
29 | most of the time there is no need to issue a single system call. Concerning | |
30 | transmission, multiple packets can be sent through one system call to get the | |
d1ee40f9 DB |
31 | highest bandwidth. By using a shared buffer between the kernel and the user |
32 | also has the benefit of minimizing packet copies. | |
69e3c75f JB |
33 | |
34 | It's fine to use PACKET_MMAP to improve the performance of the capture and | |
35 | transmission process, but it isn't everything. At least, if you are capturing | |
36 | at high speeds (this is relative to the cpu speed), you should check if the | |
37 | device driver of your network interface card supports some sort of interrupt | |
38 | load mitigation or (even better) if it supports NAPI, also make sure it is | |
39 | enabled. For transmission, check the MTU (Maximum Transmission Unit) used and | |
d1ee40f9 DB |
40 | supported by devices of your network. CPU IRQ pinning of your network interface |
41 | card can also be an advantage. | |
1da177e4 LT |
42 | |
43 | -------------------------------------------------------------------------------- | |
889b8f96 | 44 | + How to use mmap() to improve capture process |
1da177e4 LT |
45 | -------------------------------------------------------------------------------- |
46 | ||
c30fe7f7 | 47 | From the user standpoint, you should use the higher level libpcap library, which |
1da177e4 LT |
48 | is a de facto standard, portable across nearly all operating systems |
49 | including Win32. | |
50 | ||
2b221d20 SH |
51 | Packet MMAP support was integrated into libpcap around the time of version 1.3.0; |
52 | TPACKET_V3 support was added in version 1.5.0 | |
1da177e4 LT |
53 | |
54 | -------------------------------------------------------------------------------- | |
889b8f96 | 55 | + How to use mmap() directly to improve capture process |
1da177e4 LT |
56 | -------------------------------------------------------------------------------- |
57 | ||
58 | From the system calls stand point, the use of PACKET_MMAP involves | |
59 | the following process: | |
60 | ||
61 | ||
62 | [setup] socket() -------> creation of the capture socket | |
63 | setsockopt() ---> allocation of the circular buffer (ring) | |
69e3c75f | 64 | option: PACKET_RX_RING |
6c28f2c0 | 65 | mmap() ---------> mapping of the allocated buffer to the |
1da177e4 LT |
66 | user process |
67 | ||
68 | [capture] poll() ---------> to wait for incoming packets | |
69 | ||
70 | [shutdown] close() --------> destruction of the capture socket and | |
71 | deallocation of all associated | |
72 | resources. | |
73 | ||
74 | ||
75 | socket creation and destruction is straight forward, and is done | |
76 | the same way with or without PACKET_MMAP: | |
77 | ||
d1ee40f9 | 78 | int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL)); |
1da177e4 LT |
79 | |
80 | where mode is SOCK_RAW for the raw interface were link level | |
81 | information can be captured or SOCK_DGRAM for the cooked | |
82 | interface where link level information capture is not | |
83 | supported and a link level pseudo-header is provided | |
84 | by the kernel. | |
85 | ||
86 | The destruction of the socket and all associated resources | |
87 | is done by a simple call to close(fd). | |
88 | ||
7e11daa7 NB |
89 | Similarly as without PACKET_MMAP, it is possible to use one socket |
90 | for capture and transmission. This can be done by mapping the | |
91 | allocated RX and TX buffer ring with a single mmap() call. | |
92 | See "Mapping and use of the circular buffer (ring)". | |
93 | ||
a33f3224 | 94 | Next I will describe PACKET_MMAP settings and its constraints, |
6c28f2c0 | 95 | also the mapping of the circular buffer in the user process and |
1da177e4 LT |
96 | the use of this buffer. |
97 | ||
69e3c75f | 98 | -------------------------------------------------------------------------------- |
889b8f96 | 99 | + How to use mmap() directly to improve transmission process |
69e3c75f JB |
100 | -------------------------------------------------------------------------------- |
101 | Transmission process is similar to capture as shown below. | |
102 | ||
103 | [setup] socket() -------> creation of the transmission socket | |
104 | setsockopt() ---> allocation of the circular buffer (ring) | |
105 | option: PACKET_TX_RING | |
106 | bind() ---------> bind transmission socket with a network interface | |
107 | mmap() ---------> mapping of the allocated buffer to the | |
108 | user process | |
109 | ||
110 | [transmission] poll() ---------> wait for free packets (optional) | |
111 | send() ---------> send all packets that are set as ready in | |
112 | the ring | |
113 | The flag MSG_DONTWAIT can be used to return | |
114 | before end of transfer. | |
115 | ||
116 | [shutdown] close() --------> destruction of the transmission socket and | |
117 | deallocation of all associated resources. | |
118 | ||
66e56cd4 DB |
119 | Socket creation and destruction is also straight forward, and is done |
120 | the same way as in capturing described in the previous paragraph: | |
121 | ||
122 | int fd = socket(PF_PACKET, mode, 0); | |
123 | ||
124 | The protocol can optionally be 0 in case we only want to transmit | |
125 | via this socket, which avoids an expensive call to packet_rcv(). | |
126 | In this case, you also need to bind(2) the TX_RING with sll_protocol = 0 | |
127 | set. Otherwise, htons(ETH_P_ALL) or any other protocol, for example. | |
128 | ||
69e3c75f JB |
129 | Binding the socket to your network interface is mandatory (with zero copy) to |
130 | know the header size of frames used in the circular buffer. | |
131 | ||
132 | As capture, each frame contains two parts: | |
133 | ||
134 | -------------------- | |
135 | | struct tpacket_hdr | Header. It contains the status of | |
136 | | | of this frame | |
137 | |--------------------| | |
138 | | data buffer | | |
139 | . . Data that will be sent over the network interface. | |
140 | . . | |
141 | -------------------- | |
142 | ||
143 | bind() associates the socket to your network interface thanks to | |
144 | sll_ifindex parameter of struct sockaddr_ll. | |
145 | ||
146 | Initialization example: | |
147 | ||
148 | struct sockaddr_ll my_addr; | |
149 | struct ifreq s_ifr; | |
150 | ... | |
151 | ||
152 | strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name)); | |
153 | ||
154 | /* get interface index of eth0 */ | |
155 | ioctl(this->socket, SIOCGIFINDEX, &s_ifr); | |
156 | ||
157 | /* fill sockaddr_ll struct to prepare binding */ | |
158 | my_addr.sll_family = AF_PACKET; | |
30e7dfe7 | 159 | my_addr.sll_protocol = htons(ETH_P_ALL); |
69e3c75f JB |
160 | my_addr.sll_ifindex = s_ifr.ifr_ifindex; |
161 | ||
162 | /* bind socket to eth0 */ | |
163 | bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll)); | |
164 | ||
2b221d20 | 165 | A complete tutorial is available at: https://sites.google.com/site/packetmmap/ |
69e3c75f | 166 | |
5920cd3a PC |
167 | By default, the user should put data at : |
168 | frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll) | |
169 | ||
170 | So, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW), | |
171 | the beginning of the user data will be at : | |
172 | frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) | |
173 | ||
174 | If you wish to put user data at a custom offset from the beginning of | |
175 | the frame (for payload alignment with SOCK_RAW mode for instance) you | |
176 | can set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order | |
177 | to make this work it must be enabled previously with setsockopt() | |
178 | and the PACKET_TX_HAS_OFF option. | |
179 | ||
1da177e4 LT |
180 | -------------------------------------------------------------------------------- |
181 | + PACKET_MMAP settings | |
182 | -------------------------------------------------------------------------------- | |
183 | ||
1da177e4 LT |
184 | To setup PACKET_MMAP from user level code is done with a call like |
185 | ||
69e3c75f | 186 | - Capture process |
1da177e4 | 187 | setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req)) |
69e3c75f JB |
188 | - Transmission process |
189 | setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req)) | |
1da177e4 LT |
190 | |
191 | The most significant argument in the previous call is the req parameter, | |
192 | this parameter must to have the following structure: | |
193 | ||
194 | struct tpacket_req | |
195 | { | |
196 | unsigned int tp_block_size; /* Minimal size of contiguous block */ | |
197 | unsigned int tp_block_nr; /* Number of blocks */ | |
198 | unsigned int tp_frame_size; /* Size of frame */ | |
199 | unsigned int tp_frame_nr; /* Total number of frames */ | |
200 | }; | |
201 | ||
202 | This structure is defined in /usr/include/linux/if_packet.h and establishes a | |
69e3c75f | 203 | circular buffer (ring) of unswappable memory. |
1da177e4 LT |
204 | Being mapped in the capture process allows reading the captured frames and |
205 | related meta-information like timestamps without requiring a system call. | |
206 | ||
69e3c75f | 207 | Frames are grouped in blocks. Each block is a physically contiguous |
1da177e4 LT |
208 | region of memory and holds tp_block_size/tp_frame_size frames. The total number |
209 | of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because | |
210 | ||
211 | frames_per_block = tp_block_size/tp_frame_size | |
212 | ||
213 | indeed, packet_set_ring checks that the following condition is true | |
214 | ||
215 | frames_per_block * tp_block_nr == tp_frame_nr | |
216 | ||
1da177e4 LT |
217 | Lets see an example, with the following values: |
218 | ||
219 | tp_block_size= 4096 | |
220 | tp_frame_size= 2048 | |
221 | tp_block_nr = 4 | |
222 | tp_frame_nr = 8 | |
223 | ||
224 | we will get the following buffer structure: | |
225 | ||
226 | block #1 block #2 | |
227 | +---------+---------+ +---------+---------+ | |
228 | | frame 1 | frame 2 | | frame 3 | frame 4 | | |
229 | +---------+---------+ +---------+---------+ | |
230 | ||
231 | block #3 block #4 | |
232 | +---------+---------+ +---------+---------+ | |
233 | | frame 5 | frame 6 | | frame 7 | frame 8 | | |
234 | +---------+---------+ +---------+---------+ | |
235 | ||
236 | A frame can be of any size with the only condition it can fit in a block. A block | |
237 | can only hold an integer number of frames, or in other words, a frame cannot | |
25985edc | 238 | be spawned across two blocks, so there are some details you have to take into |
6c28f2c0 | 239 | account when choosing the frame_size. See "Mapping and use of the circular |
1da177e4 LT |
240 | buffer (ring)". |
241 | ||
1da177e4 LT |
242 | -------------------------------------------------------------------------------- |
243 | + PACKET_MMAP setting constraints | |
244 | -------------------------------------------------------------------------------- | |
245 | ||
246 | In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch), | |
247 | the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or | |
248 | 16384 in a 64 bit architecture. For information on these kernel versions | |
249 | see http://pusa.uv.es/~ulisses/packet_mmap/packet_mmap.pre-2.4.26_2.6.5.txt | |
250 | ||
251 | Block size limit | |
252 | ------------------ | |
253 | ||
254 | As stated earlier, each block is a contiguous physical region of memory. These | |
255 | memory regions are allocated with calls to the __get_free_pages() function. As | |
256 | the name indicates, this function allocates pages of memory, and the second | |
257 | argument is "order" or a power of two number of pages, that is | |
258 | (for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes, | |
259 | order=2 ==> 16384 bytes, etc. The maximum size of a | |
260 | region allocated by __get_free_pages is determined by the MAX_ORDER macro. More | |
261 | precisely the limit can be calculated as: | |
262 | ||
263 | PAGE_SIZE << MAX_ORDER | |
264 | ||
265 | In a i386 architecture PAGE_SIZE is 4096 bytes | |
266 | In a 2.4/i386 kernel MAX_ORDER is 10 | |
267 | In a 2.6/i386 kernel MAX_ORDER is 11 | |
268 | ||
269 | So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel | |
270 | respectively, with an i386 architecture. | |
271 | ||
272 | User space programs can include /usr/include/sys/user.h and | |
273 | /usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations. | |
274 | ||
275 | The pagesize can also be determined dynamically with the getpagesize (2) | |
276 | system call. | |
277 | ||
1da177e4 LT |
278 | Block number limit |
279 | -------------------- | |
280 | ||
281 | To understand the constraints of PACKET_MMAP, we have to see the structure | |
282 | used to hold the pointers to each block. | |
283 | ||
284 | Currently, this structure is a dynamically allocated vector with kmalloc | |
285 | called pg_vec, its size limits the number of blocks that can be allocated. | |
286 | ||
287 | +---+---+---+---+ | |
288 | | x | x | x | x | | |
289 | +---+---+---+---+ | |
290 | | | | | | |
291 | | | | v | |
292 | | | v block #4 | |
293 | | v block #3 | |
294 | v block #2 | |
295 | block #1 | |
296 | ||
2fe0ae78 ML |
297 | kmalloc allocates any number of bytes of physically contiguous memory from |
298 | a pool of pre-determined sizes. This pool of memory is maintained by the slab | |
c30fe7f7 UZ |
299 | allocator which is at the end the responsible for doing the allocation and |
300 | hence which imposes the maximum memory that kmalloc can allocate. | |
1da177e4 LT |
301 | |
302 | In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The | |
303 | predetermined sizes that kmalloc uses can be checked in the "size-<bytes>" | |
304 | entries of /proc/slabinfo | |
305 | ||
306 | In a 32 bit architecture, pointers are 4 bytes long, so the total number of | |
307 | pointers to blocks is | |
308 | ||
309 | 131072/4 = 32768 blocks | |
310 | ||
1da177e4 LT |
311 | PACKET_MMAP buffer size calculator |
312 | ------------------------------------ | |
313 | ||
314 | Definitions: | |
315 | ||
316 | <size-max> : is the maximum size of allocable with kmalloc (see /proc/slabinfo) | |
317 | <pointer size>: depends on the architecture -- sizeof(void *) | |
318 | <page size> : depends on the architecture -- PAGE_SIZE or getpagesize (2) | |
319 | <max-order> : is the value defined with MAX_ORDER | |
320 | <frame size> : it's an upper bound of frame's capture size (more on this later) | |
321 | ||
322 | from these definitions we will derive | |
323 | ||
324 | <block number> = <size-max>/<pointer size> | |
325 | <block size> = <pagesize> << <max-order> | |
326 | ||
327 | so, the max buffer size is | |
328 | ||
329 | <block number> * <block size> | |
330 | ||
331 | and, the number of frames be | |
332 | ||
333 | <block number> * <block size> / <frame size> | |
334 | ||
2e150f6e | 335 | Suppose the following parameters, which apply for 2.6 kernel and an |
1da177e4 LT |
336 | i386 architecture: |
337 | ||
338 | <size-max> = 131072 bytes | |
339 | <pointer size> = 4 bytes | |
340 | <pagesize> = 4096 bytes | |
341 | <max-order> = 11 | |
342 | ||
6c28f2c0 | 343 | and a value for <frame size> of 2048 bytes. These parameters will yield |
1da177e4 LT |
344 | |
345 | <block number> = 131072/4 = 32768 blocks | |
346 | <block size> = 4096 << 11 = 8 MiB. | |
347 | ||
348 | and hence the buffer will have a 262144 MiB size. So it can hold | |
349 | 262144 MiB / 2048 bytes = 134217728 frames | |
350 | ||
1da177e4 LT |
351 | Actually, this buffer size is not possible with an i386 architecture. |
352 | Remember that the memory is allocated in kernel space, in the case of | |
353 | an i386 kernel's memory size is limited to 1GiB. | |
354 | ||
355 | All memory allocations are not freed until the socket is closed. The memory | |
356 | allocations are done with GFP_KERNEL priority, this basically means that | |
357 | the allocation can wait and swap other process' memory in order to allocate | |
992caacf | 358 | the necessary memory, so normally limits can be reached. |
1da177e4 LT |
359 | |
360 | Other constraints | |
361 | ------------------- | |
362 | ||
363 | If you check the source code you will see that what I draw here as a frame | |
5d3f083d | 364 | is not only the link level frame. At the beginning of each frame there is a |
1da177e4 LT |
365 | header called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame |
366 | meta information like timestamp. So what we draw here a frame it's really | |
367 | the following (from include/linux/if_packet.h): | |
368 | ||
369 | /* | |
370 | Frame structure: | |
371 | ||
372 | - Start. Frame must be aligned to TPACKET_ALIGNMENT=16 | |
373 | - struct tpacket_hdr | |
374 | - pad to TPACKET_ALIGNMENT=16 | |
375 | - struct sockaddr_ll | |
3f6dee9b | 376 | - Gap, chosen so that packet data (Start+tp_net) aligns to |
1da177e4 LT |
377 | TPACKET_ALIGNMENT=16 |
378 | - Start+tp_mac: [ Optional MAC header ] | |
379 | - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16. | |
380 | - Pad to align to TPACKET_ALIGNMENT=16 | |
381 | */ | |
1da177e4 LT |
382 | |
383 | The following are conditions that are checked in packet_set_ring | |
384 | ||
385 | tp_block_size must be a multiple of PAGE_SIZE (1) | |
386 | tp_frame_size must be greater than TPACKET_HDRLEN (obvious) | |
387 | tp_frame_size must be a multiple of TPACKET_ALIGNMENT | |
388 | tp_frame_nr must be exactly frames_per_block*tp_block_nr | |
389 | ||
6c28f2c0 | 390 | Note that tp_block_size should be chosen to be a power of two or there will |
1da177e4 LT |
391 | be a waste of memory. |
392 | ||
393 | -------------------------------------------------------------------------------- | |
6c28f2c0 | 394 | + Mapping and use of the circular buffer (ring) |
1da177e4 LT |
395 | -------------------------------------------------------------------------------- |
396 | ||
6c28f2c0 | 397 | The mapping of the buffer in the user process is done with the conventional |
1da177e4 LT |
398 | mmap function. Even the circular buffer is compound of several physically |
399 | discontiguous blocks of memory, they are contiguous to the user space, hence | |
400 | just one call to mmap is needed: | |
401 | ||
402 | mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); | |
403 | ||
404 | If tp_frame_size is a divisor of tp_block_size frames will be | |
d9195881 | 405 | contiguously spaced by tp_frame_size bytes. If not, each |
1da177e4 LT |
406 | tp_block_size/tp_frame_size frames there will be a gap between |
407 | the frames. This is because a frame cannot be spawn across two | |
408 | blocks. | |
409 | ||
7e11daa7 NB |
410 | To use one socket for capture and transmission, the mapping of both the |
411 | RX and TX buffer ring has to be done with one call to mmap: | |
412 | ||
413 | ... | |
414 | setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &foo, sizeof(foo)); | |
415 | setsockopt(fd, SOL_PACKET, PACKET_TX_RING, &bar, sizeof(bar)); | |
416 | ... | |
417 | rx_ring = mmap(0, size * 2, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); | |
418 | tx_ring = rx_ring + size; | |
419 | ||
420 | RX must be the first as the kernel maps the TX ring memory right | |
421 | after the RX one. | |
422 | ||
1da177e4 LT |
423 | At the beginning of each frame there is an status field (see |
424 | struct tpacket_hdr). If this field is 0 means that the frame is ready | |
425 | to be used for the kernel, If not, there is a frame the user can read | |
426 | and the following flags apply: | |
427 | ||
69e3c75f | 428 | +++ Capture process: |
1da177e4 LT |
429 | from include/linux/if_packet.h |
430 | ||
682f048b AD |
431 | #define TP_STATUS_COPY (1 << 1) |
432 | #define TP_STATUS_LOSING (1 << 2) | |
433 | #define TP_STATUS_CSUMNOTREADY (1 << 3) | |
434 | #define TP_STATUS_CSUM_VALID (1 << 7) | |
1da177e4 | 435 | |
1da177e4 LT |
436 | TP_STATUS_COPY : This flag indicates that the frame (and associated |
437 | meta information) has been truncated because it's | |
438 | larger than tp_frame_size. This packet can be | |
439 | read entirely with recvfrom(). | |
440 | ||
441 | In order to make this work it must to be | |
442 | enabled previously with setsockopt() and | |
443 | the PACKET_COPY_THRESH option. | |
444 | ||
a93c1256 | 445 | The number of frames that can be buffered to |
1da177e4 LT |
446 | be read with recvfrom is limited like a normal socket. |
447 | See the SO_RCVBUF option in the socket (7) man page. | |
448 | ||
449 | TP_STATUS_LOSING : indicates there were packet drops from last time | |
450 | statistics where checked with getsockopt() and | |
451 | the PACKET_STATISTICS option. | |
452 | ||
c30fe7f7 | 453 | TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets which |
a33f3224 | 454 | its checksum will be done in hardware. So while |
1da177e4 LT |
455 | reading the packet we should not try to check the |
456 | checksum. | |
457 | ||
682f048b AD |
458 | TP_STATUS_CSUM_VALID : This flag indicates that at least the transport |
459 | header checksum of the packet has been already | |
460 | validated on the kernel side. If the flag is not set | |
461 | then we are free to check the checksum by ourselves | |
462 | provided that TP_STATUS_CSUMNOTREADY is also not set. | |
463 | ||
1da177e4 LT |
464 | for convenience there are also the following defines: |
465 | ||
466 | #define TP_STATUS_KERNEL 0 | |
467 | #define TP_STATUS_USER 1 | |
468 | ||
469 | The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel | |
470 | receives a packet it puts in the buffer and updates the status with | |
471 | at least the TP_STATUS_USER flag. Then the user can read the packet, | |
472 | once the packet is read the user must zero the status field, so the kernel | |
473 | can use again that frame buffer. | |
474 | ||
475 | The user can use poll (any other variant should apply too) to check if new | |
476 | packets are in the ring: | |
477 | ||
478 | struct pollfd pfd; | |
479 | ||
480 | pfd.fd = fd; | |
481 | pfd.revents = 0; | |
482 | pfd.events = POLLIN|POLLRDNORM|POLLERR; | |
483 | ||
484 | if (status == TP_STATUS_KERNEL) | |
485 | retval = poll(&pfd, 1, timeout); | |
486 | ||
487 | It doesn't incur in a race condition to first check the status value and | |
488 | then poll for frames. | |
489 | ||
69e3c75f JB |
490 | ++ Transmission process |
491 | Those defines are also used for transmission: | |
492 | ||
493 | #define TP_STATUS_AVAILABLE 0 // Frame is available | |
494 | #define TP_STATUS_SEND_REQUEST 1 // Frame will be sent on next send() | |
495 | #define TP_STATUS_SENDING 2 // Frame is currently in transmission | |
496 | #define TP_STATUS_WRONG_FORMAT 4 // Frame format is not correct | |
497 | ||
498 | First, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a | |
499 | packet, the user fills a data buffer of an available frame, sets tp_len to | |
500 | current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST. | |
501 | This can be done on multiple frames. Once the user is ready to transmit, it | |
502 | calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are | |
503 | forwarded to the network device. The kernel updates each status of sent | |
504 | frames with TP_STATUS_SENDING until the end of transfer. | |
505 | At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE. | |
506 | ||
507 | header->tp_len = in_i_size; | |
508 | header->tp_status = TP_STATUS_SEND_REQUEST; | |
509 | retval = send(this->socket, NULL, 0, 0); | |
510 | ||
511 | The user can also use poll() to check if a buffer is available: | |
512 | (status == TP_STATUS_SENDING) | |
513 | ||
514 | struct pollfd pfd; | |
515 | pfd.fd = fd; | |
516 | pfd.revents = 0; | |
517 | pfd.events = POLLOUT; | |
518 | retval = poll(&pfd, 1, timeout); | |
519 | ||
d1ee40f9 DB |
520 | ------------------------------------------------------------------------------- |
521 | + What TPACKET versions are available and when to use them? | |
522 | ------------------------------------------------------------------------------- | |
523 | ||
524 | int val = tpacket_version; | |
525 | setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); | |
526 | getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); | |
527 | ||
528 | where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3. | |
529 | ||
530 | TPACKET_V1: | |
531 | - Default if not otherwise specified by setsockopt(2) | |
532 | - RX_RING, TX_RING available | |
d1ee40f9 DB |
533 | |
534 | TPACKET_V1 --> TPACKET_V2: | |
535 | - Made 64 bit clean due to unsigned long usage in TPACKET_V1 | |
536 | structures, thus this also works on 64 bit kernel with 32 bit | |
537 | userspace and the like | |
538 | - Timestamp resolution in nanoseconds instead of microseconds | |
539 | - RX_RING, TX_RING available | |
ac7686b9 AW |
540 | - VLAN metadata information available for packets |
541 | (TP_STATUS_VLAN_VALID, TP_STATUS_VLAN_TPID_VALID), | |
542 | in the tpacket2_hdr structure: | |
543 | - TP_STATUS_VLAN_VALID bit being set into the tp_status field indicates | |
544 | that the tp_vlan_tci field has valid VLAN TCI value | |
545 | - TP_STATUS_VLAN_TPID_VALID bit being set into the tp_status field | |
546 | indicates that the tp_vlan_tpid field has valid VLAN TPID value | |
d1ee40f9 DB |
547 | - How to switch to TPACKET_V2: |
548 | 1. Replace struct tpacket_hdr by struct tpacket2_hdr | |
549 | 2. Query header len and save | |
550 | 3. Set protocol version to 2, set up ring as usual | |
551 | 4. For getting the sockaddr_ll, | |
552 | use (void *)hdr + TPACKET_ALIGN(hdrlen) instead of | |
553 | (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) | |
554 | ||
555 | TPACKET_V2 --> TPACKET_V3: | |
7f953ab2 | 556 | - Flexible buffer implementation for RX_RING: |
d1ee40f9 DB |
557 | 1. Blocks can be configured with non-static frame-size |
558 | 2. Read/poll is at a block-level (as opposed to packet-level) | |
559 | 3. Added poll timeout to avoid indefinite user-space wait | |
560 | on idle links | |
561 | 4. Added user-configurable knobs: | |
562 | 4.1 block::timeout | |
563 | 4.2 tpkt_hdr::sk_rxhash | |
564 | - RX Hash data available in user space | |
7f953ab2 SV |
565 | - TX_RING semantics are conceptually similar to TPACKET_V2; |
566 | use tpacket3_hdr instead of tpacket2_hdr, and TPACKET3_HDRLEN | |
567 | instead of TPACKET2_HDRLEN. In the current implementation, | |
568 | the tp_next_offset field in the tpacket3_hdr MUST be set to | |
569 | zero, indicating that the ring does not hold variable sized frames. | |
570 | Packets with non-zero values of tp_next_offset will be dropped. | |
d1ee40f9 DB |
571 | |
572 | ------------------------------------------------------------------------------- | |
573 | + AF_PACKET fanout mode | |
574 | ------------------------------------------------------------------------------- | |
575 | ||
576 | In the AF_PACKET fanout mode, packet reception can be load balanced among | |
577 | processes. This also works in combination with mmap(2) on packet sockets. | |
578 | ||
7ec06da8 DB |
579 | Currently implemented fanout policies are: |
580 | ||
b0db5cdf | 581 | - PACKET_FANOUT_HASH: schedule to socket by skb's packet hash |
7ec06da8 DB |
582 | - PACKET_FANOUT_LB: schedule to socket by round-robin |
583 | - PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on | |
584 | - PACKET_FANOUT_RND: schedule to socket by random selection | |
585 | - PACKET_FANOUT_ROLLOVER: if one socket is full, rollover to another | |
bb9fbe2d | 586 | - PACKET_FANOUT_QM: schedule to socket by skbs recorded queue_mapping |
7ec06da8 | 587 | |
d1ee40f9 DB |
588 | Minimal example code by David S. Miller (try things like "./test eth0 hash", |
589 | "./test eth0 lb", etc.): | |
590 | ||
591 | #include <stddef.h> | |
592 | #include <stdlib.h> | |
593 | #include <stdio.h> | |
594 | #include <string.h> | |
595 | ||
596 | #include <sys/types.h> | |
597 | #include <sys/wait.h> | |
598 | #include <sys/socket.h> | |
599 | #include <sys/ioctl.h> | |
600 | ||
601 | #include <unistd.h> | |
602 | ||
603 | #include <linux/if_ether.h> | |
604 | #include <linux/if_packet.h> | |
605 | ||
606 | #include <net/if.h> | |
607 | ||
608 | static const char *device_name; | |
609 | static int fanout_type; | |
610 | static int fanout_id; | |
611 | ||
612 | #ifndef PACKET_FANOUT | |
613 | # define PACKET_FANOUT 18 | |
614 | # define PACKET_FANOUT_HASH 0 | |
615 | # define PACKET_FANOUT_LB 1 | |
616 | #endif | |
617 | ||
618 | static int setup_socket(void) | |
619 | { | |
620 | int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP)); | |
621 | struct sockaddr_ll ll; | |
622 | struct ifreq ifr; | |
623 | int fanout_arg; | |
624 | ||
625 | if (fd < 0) { | |
626 | perror("socket"); | |
627 | return EXIT_FAILURE; | |
628 | } | |
629 | ||
630 | memset(&ifr, 0, sizeof(ifr)); | |
631 | strcpy(ifr.ifr_name, device_name); | |
632 | err = ioctl(fd, SIOCGIFINDEX, &ifr); | |
633 | if (err < 0) { | |
634 | perror("SIOCGIFINDEX"); | |
635 | return EXIT_FAILURE; | |
636 | } | |
637 | ||
638 | memset(&ll, 0, sizeof(ll)); | |
639 | ll.sll_family = AF_PACKET; | |
640 | ll.sll_ifindex = ifr.ifr_ifindex; | |
641 | err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); | |
642 | if (err < 0) { | |
643 | perror("bind"); | |
644 | return EXIT_FAILURE; | |
645 | } | |
646 | ||
647 | fanout_arg = (fanout_id | (fanout_type << 16)); | |
648 | err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT, | |
649 | &fanout_arg, sizeof(fanout_arg)); | |
650 | if (err) { | |
651 | perror("setsockopt"); | |
652 | return EXIT_FAILURE; | |
653 | } | |
654 | ||
655 | return fd; | |
656 | } | |
657 | ||
658 | static void fanout_thread(void) | |
659 | { | |
660 | int fd = setup_socket(); | |
661 | int limit = 10000; | |
662 | ||
663 | if (fd < 0) | |
664 | exit(fd); | |
665 | ||
666 | while (limit-- > 0) { | |
667 | char buf[1600]; | |
668 | int err; | |
669 | ||
670 | err = read(fd, buf, sizeof(buf)); | |
671 | if (err < 0) { | |
672 | perror("read"); | |
673 | exit(EXIT_FAILURE); | |
674 | } | |
675 | if ((limit % 10) == 0) | |
676 | fprintf(stdout, "(%d) \n", getpid()); | |
677 | } | |
678 | ||
679 | fprintf(stdout, "%d: Received 10000 packets\n", getpid()); | |
680 | ||
681 | close(fd); | |
682 | exit(0); | |
683 | } | |
684 | ||
685 | int main(int argc, char **argp) | |
686 | { | |
687 | int fd, err; | |
688 | int i; | |
689 | ||
690 | if (argc != 3) { | |
691 | fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]); | |
692 | return EXIT_FAILURE; | |
693 | } | |
694 | ||
695 | if (!strcmp(argp[2], "hash")) | |
696 | fanout_type = PACKET_FANOUT_HASH; | |
697 | else if (!strcmp(argp[2], "lb")) | |
698 | fanout_type = PACKET_FANOUT_LB; | |
699 | else { | |
700 | fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]); | |
701 | exit(EXIT_FAILURE); | |
702 | } | |
703 | ||
704 | device_name = argp[1]; | |
705 | fanout_id = getpid() & 0xffff; | |
706 | ||
707 | for (i = 0; i < 4; i++) { | |
708 | pid_t pid = fork(); | |
709 | ||
710 | switch (pid) { | |
711 | case 0: | |
712 | fanout_thread(); | |
713 | ||
714 | case -1: | |
715 | perror("fork"); | |
716 | exit(EXIT_FAILURE); | |
717 | } | |
718 | } | |
719 | ||
720 | for (i = 0; i < 4; i++) { | |
721 | int status; | |
722 | ||
723 | wait(&status); | |
724 | } | |
725 | ||
726 | return 0; | |
727 | } | |
728 | ||
4eb06148 DB |
729 | ------------------------------------------------------------------------------- |
730 | + AF_PACKET TPACKET_V3 example | |
731 | ------------------------------------------------------------------------------- | |
732 | ||
733 | AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame | |
734 | sizes by doing it's own memory management. It is based on blocks where polling | |
735 | works on a per block basis instead of per ring as in TPACKET_V2 and predecessor. | |
736 | ||
737 | It is said that TPACKET_V3 brings the following benefits: | |
738 | *) ~15 - 20% reduction in CPU-usage | |
739 | *) ~20% increase in packet capture rate | |
740 | *) ~2x increase in packet density | |
741 | *) Port aggregation analysis | |
742 | *) Non static frame size to capture entire packet payload | |
743 | ||
744 | So it seems to be a good candidate to be used with packet fanout. | |
745 | ||
746 | Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile | |
747 | it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.): | |
748 | ||
d70a3f88 DB |
749 | /* Written from scratch, but kernel-to-user space API usage |
750 | * dissected from lolpcap: | |
751 | * Copyright 2011, Chetan Loke <loke.chetan@gmail.com> | |
752 | * License: GPL, version 2.0 | |
753 | */ | |
754 | ||
4eb06148 DB |
755 | #include <stdio.h> |
756 | #include <stdlib.h> | |
757 | #include <stdint.h> | |
758 | #include <string.h> | |
759 | #include <assert.h> | |
760 | #include <net/if.h> | |
761 | #include <arpa/inet.h> | |
762 | #include <netdb.h> | |
763 | #include <poll.h> | |
764 | #include <unistd.h> | |
765 | #include <signal.h> | |
766 | #include <inttypes.h> | |
767 | #include <sys/socket.h> | |
768 | #include <sys/mman.h> | |
769 | #include <linux/if_packet.h> | |
770 | #include <linux/if_ether.h> | |
771 | #include <linux/ip.h> | |
772 | ||
4eb06148 DB |
773 | #ifndef likely |
774 | # define likely(x) __builtin_expect(!!(x), 1) | |
775 | #endif | |
776 | #ifndef unlikely | |
777 | # define unlikely(x) __builtin_expect(!!(x), 0) | |
778 | #endif | |
779 | ||
780 | struct block_desc { | |
781 | uint32_t version; | |
782 | uint32_t offset_to_priv; | |
783 | struct tpacket_hdr_v1 h1; | |
784 | }; | |
785 | ||
786 | struct ring { | |
787 | struct iovec *rd; | |
788 | uint8_t *map; | |
789 | struct tpacket_req3 req; | |
790 | }; | |
791 | ||
792 | static unsigned long packets_total = 0, bytes_total = 0; | |
793 | static sig_atomic_t sigint = 0; | |
794 | ||
d70a3f88 | 795 | static void sighandler(int num) |
4eb06148 DB |
796 | { |
797 | sigint = 1; | |
798 | } | |
799 | ||
800 | static int setup_socket(struct ring *ring, char *netdev) | |
801 | { | |
802 | int err, i, fd, v = TPACKET_V3; | |
803 | struct sockaddr_ll ll; | |
d70a3f88 DB |
804 | unsigned int blocksiz = 1 << 22, framesiz = 1 << 11; |
805 | unsigned int blocknum = 64; | |
4eb06148 DB |
806 | |
807 | fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); | |
808 | if (fd < 0) { | |
809 | perror("socket"); | |
810 | exit(1); | |
811 | } | |
812 | ||
813 | err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v)); | |
814 | if (err < 0) { | |
815 | perror("setsockopt"); | |
816 | exit(1); | |
817 | } | |
818 | ||
819 | memset(&ring->req, 0, sizeof(ring->req)); | |
d70a3f88 DB |
820 | ring->req.tp_block_size = blocksiz; |
821 | ring->req.tp_frame_size = framesiz; | |
822 | ring->req.tp_block_nr = blocknum; | |
823 | ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz; | |
824 | ring->req.tp_retire_blk_tov = 60; | |
825 | ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH; | |
4eb06148 DB |
826 | |
827 | err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req, | |
828 | sizeof(ring->req)); | |
829 | if (err < 0) { | |
830 | perror("setsockopt"); | |
831 | exit(1); | |
832 | } | |
833 | ||
834 | ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr, | |
d70a3f88 | 835 | PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0); |
4eb06148 DB |
836 | if (ring->map == MAP_FAILED) { |
837 | perror("mmap"); | |
838 | exit(1); | |
839 | } | |
840 | ||
841 | ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd)); | |
842 | assert(ring->rd); | |
843 | for (i = 0; i < ring->req.tp_block_nr; ++i) { | |
844 | ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size); | |
845 | ring->rd[i].iov_len = ring->req.tp_block_size; | |
846 | } | |
847 | ||
848 | memset(&ll, 0, sizeof(ll)); | |
849 | ll.sll_family = PF_PACKET; | |
850 | ll.sll_protocol = htons(ETH_P_ALL); | |
851 | ll.sll_ifindex = if_nametoindex(netdev); | |
852 | ll.sll_hatype = 0; | |
853 | ll.sll_pkttype = 0; | |
854 | ll.sll_halen = 0; | |
855 | ||
856 | err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); | |
857 | if (err < 0) { | |
858 | perror("bind"); | |
859 | exit(1); | |
860 | } | |
861 | ||
862 | return fd; | |
863 | } | |
864 | ||
4eb06148 DB |
865 | static void display(struct tpacket3_hdr *ppd) |
866 | { | |
867 | struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac); | |
868 | struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN); | |
869 | ||
870 | if (eth->h_proto == htons(ETH_P_IP)) { | |
871 | struct sockaddr_in ss, sd; | |
872 | char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST]; | |
873 | ||
874 | memset(&ss, 0, sizeof(ss)); | |
875 | ss.sin_family = PF_INET; | |
876 | ss.sin_addr.s_addr = ip->saddr; | |
877 | getnameinfo((struct sockaddr *) &ss, sizeof(ss), | |
878 | sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST); | |
879 | ||
880 | memset(&sd, 0, sizeof(sd)); | |
881 | sd.sin_family = PF_INET; | |
882 | sd.sin_addr.s_addr = ip->daddr; | |
883 | getnameinfo((struct sockaddr *) &sd, sizeof(sd), | |
884 | dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST); | |
885 | ||
886 | printf("%s -> %s, ", sbuff, dbuff); | |
887 | } | |
888 | ||
889 | printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash); | |
890 | } | |
891 | ||
892 | static void walk_block(struct block_desc *pbd, const int block_num) | |
893 | { | |
d70a3f88 | 894 | int num_pkts = pbd->h1.num_pkts, i; |
4eb06148 | 895 | unsigned long bytes = 0; |
4eb06148 DB |
896 | struct tpacket3_hdr *ppd; |
897 | ||
d70a3f88 DB |
898 | ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd + |
899 | pbd->h1.offset_to_first_pkt); | |
4eb06148 DB |
900 | for (i = 0; i < num_pkts; ++i) { |
901 | bytes += ppd->tp_snaplen; | |
4eb06148 DB |
902 | display(ppd); |
903 | ||
d70a3f88 DB |
904 | ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd + |
905 | ppd->tp_next_offset); | |
4eb06148 DB |
906 | } |
907 | ||
4eb06148 DB |
908 | packets_total += num_pkts; |
909 | bytes_total += bytes; | |
910 | } | |
911 | ||
d70a3f88 | 912 | static void flush_block(struct block_desc *pbd) |
4eb06148 | 913 | { |
d70a3f88 | 914 | pbd->h1.block_status = TP_STATUS_KERNEL; |
4eb06148 DB |
915 | } |
916 | ||
917 | static void teardown_socket(struct ring *ring, int fd) | |
918 | { | |
919 | munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr); | |
920 | free(ring->rd); | |
921 | close(fd); | |
922 | } | |
923 | ||
924 | int main(int argc, char **argp) | |
925 | { | |
926 | int fd, err; | |
927 | socklen_t len; | |
928 | struct ring ring; | |
929 | struct pollfd pfd; | |
d70a3f88 | 930 | unsigned int block_num = 0, blocks = 64; |
4eb06148 DB |
931 | struct block_desc *pbd; |
932 | struct tpacket_stats_v3 stats; | |
933 | ||
934 | if (argc != 2) { | |
935 | fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]); | |
936 | return EXIT_FAILURE; | |
937 | } | |
938 | ||
939 | signal(SIGINT, sighandler); | |
940 | ||
941 | memset(&ring, 0, sizeof(ring)); | |
942 | fd = setup_socket(&ring, argp[argc - 1]); | |
943 | assert(fd > 0); | |
944 | ||
945 | memset(&pfd, 0, sizeof(pfd)); | |
946 | pfd.fd = fd; | |
947 | pfd.events = POLLIN | POLLERR; | |
948 | pfd.revents = 0; | |
949 | ||
950 | while (likely(!sigint)) { | |
951 | pbd = (struct block_desc *) ring.rd[block_num].iov_base; | |
d70a3f88 DB |
952 | |
953 | if ((pbd->h1.block_status & TP_STATUS_USER) == 0) { | |
4eb06148 | 954 | poll(&pfd, 1, -1); |
d70a3f88 | 955 | continue; |
4eb06148 DB |
956 | } |
957 | ||
958 | walk_block(pbd, block_num); | |
959 | flush_block(pbd); | |
d70a3f88 | 960 | block_num = (block_num + 1) % blocks; |
4eb06148 DB |
961 | } |
962 | ||
963 | len = sizeof(stats); | |
964 | err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len); | |
965 | if (err < 0) { | |
966 | perror("getsockopt"); | |
967 | exit(1); | |
968 | } | |
969 | ||
970 | fflush(stdout); | |
971 | printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n", | |
972 | stats.tp_packets, bytes_total, stats.tp_drops, | |
973 | stats.tp_freeze_q_cnt); | |
974 | ||
975 | teardown_socket(&ring, fd); | |
976 | return 0; | |
977 | } | |
978 | ||
d346a3fa DB |
979 | ------------------------------------------------------------------------------- |
980 | + PACKET_QDISC_BYPASS | |
981 | ------------------------------------------------------------------------------- | |
982 | ||
983 | If there is a requirement to load the network with many packets in a similar | |
984 | fashion as pktgen does, you might set the following option after socket | |
985 | creation: | |
986 | ||
987 | int one = 1; | |
988 | setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS, &one, sizeof(one)); | |
989 | ||
990 | This has the side-effect, that packets sent through PF_PACKET will bypass the | |
991 | kernel's qdisc layer and are forcedly pushed to the driver directly. Meaning, | |
992 | packet are not buffered, tc disciplines are ignored, increased loss can occur | |
993 | and such packets are also not visible to other PF_PACKET sockets anymore. So, | |
994 | you have been warned; generally, this can be useful for stress testing various | |
995 | components of a system. | |
996 | ||
997 | On default, PACKET_QDISC_BYPASS is disabled and needs to be explicitly enabled | |
998 | on PF_PACKET sockets. | |
999 | ||
614f60fa SM |
1000 | ------------------------------------------------------------------------------- |
1001 | + PACKET_TIMESTAMP | |
1002 | ------------------------------------------------------------------------------- | |
1003 | ||
1004 | The PACKET_TIMESTAMP setting determines the source of the timestamp in | |
2940b26b DB |
1005 | the packet meta information for mmap(2)ed RX_RING and TX_RINGs. If your |
1006 | NIC is capable of timestamping packets in hardware, you can request those | |
1007 | hardware timestamps to be used. Note: you may need to enable the generation | |
1008 | of hardware timestamps with SIOCSHWTSTAMP (see related information from | |
1009 | Documentation/networking/timestamping.txt). | |
614f60fa | 1010 | |
68a360e8 WB |
1011 | PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING: |
1012 | ||
1013 | int req = SOF_TIMESTAMPING_RAW_HARDWARE; | |
614f60fa SM |
1014 | setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req)) |
1015 | ||
2940b26b DB |
1016 | For the mmap(2)ed ring buffers, such timestamps are stored in the |
1017 | tpacket{,2,3}_hdr structure's tp_sec and tp_{n,u}sec members. To determine | |
1018 | what kind of timestamp has been reported, the tp_status field is binary |'ed | |
1019 | with the following possible bits ... | |
1020 | ||
2940b26b DB |
1021 | TP_STATUS_TS_RAW_HARDWARE |
1022 | TP_STATUS_TS_SOFTWARE | |
1023 | ||
1024 | ... that are equivalent to its SOF_TIMESTAMPING_* counterparts. For the | |
68a360e8 WB |
1025 | RX_RING, if neither is set (i.e. PACKET_TIMESTAMP is not set), then a |
1026 | software fallback was invoked *within* PF_PACKET's processing code (less | |
1027 | precise). | |
2940b26b DB |
1028 | |
1029 | Getting timestamps for the TX_RING works as follows: i) fill the ring frames, | |
1030 | ii) call sendto() e.g. in blocking mode, iii) wait for status of relevant | |
1031 | frames to be updated resp. the frame handed over to the application, iv) walk | |
1032 | through the frames to pick up the individual hw/sw timestamps. | |
1033 | ||
1034 | Only (!) if transmit timestamping is enabled, then these bits are combined | |
1035 | with binary | with TP_STATUS_AVAILABLE, so you must check for that in your | |
1036 | application (e.g. !(tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING)) | |
1037 | in a first step to see if the frame belongs to the application, and then | |
1038 | one can extract the type of timestamp in a second step from tp_status)! | |
1039 | ||
1040 | If you don't care about them, thus having it disabled, checking for | |
1041 | TP_STATUS_AVAILABLE resp. TP_STATUS_WRONG_FORMAT is sufficient. If in the | |
1042 | TX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec | |
1043 | members do not contain a valid value. For TX_RINGs, by default no timestamp | |
1044 | is generated! | |
614f60fa | 1045 | |
f2b41874 | 1046 | See include/linux/net_tstamp.h and Documentation/networking/timestamping.txt |
614f60fa SM |
1047 | for more information on hardware timestamps. |
1048 | ||
d1ee40f9 DB |
1049 | ------------------------------------------------------------------------------- |
1050 | + Miscellaneous bits | |
1051 | ------------------------------------------------------------------------------- | |
1052 | ||
1053 | - Packet sockets work well together with Linux socket filters, thus you also | |
1054 | might want to have a look at Documentation/networking/filter.txt | |
1055 | ||
1da177e4 LT |
1056 | -------------------------------------------------------------------------------- |
1057 | + THANKS | |
1058 | -------------------------------------------------------------------------------- | |
1059 | ||
1060 | Jesse Brandeburg, for fixing my grammathical/spelling errors | |
1061 |