Commit | Line | Data |
---|---|---|
1da177e4 LT |
1 | -------------------------------------------------------------------------------- |
2 | + ABSTRACT | |
3 | -------------------------------------------------------------------------------- | |
4 | ||
889b8f96 | 5 | This file documents the mmap() facility available with the PACKET |
d1ee40f9 DB |
6 | socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for |
7 | i) capture network traffic with utilities like tcpdump, ii) transmit network | |
8 | traffic, or any other that needs raw access to network interface. | |
1da177e4 | 9 | |
69e3c75f | 10 | You can find the latest version of this document at: |
0ea6e611 | 11 | http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap |
1da177e4 | 12 | |
69e3c75f JB |
13 | Howto can be found at: |
14 | http://wiki.gnu-log.net (packet_mmap) | |
1da177e4 | 15 | |
69e3c75f | 16 | Please send your comments to |
be2a608b | 17 | Ulisses Alonso CamarĂ³ <uaca@i.hate.spam.alumni.uv.es> |
69e3c75f | 18 | Johann Baudy <johann.baudy@gnu-log.net> |
1da177e4 LT |
19 | |
20 | ------------------------------------------------------------------------------- | |
21 | + Why use PACKET_MMAP | |
22 | -------------------------------------------------------------------------------- | |
23 | ||
d1ee40f9 DB |
24 | In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very |
25 | inefficient. It uses very limited buffers and requires one system call to | |
26 | capture each packet, it requires two if you want to get packet's timestamp | |
27 | (like libpcap always does). | |
1da177e4 LT |
28 | |
29 | In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size | |
69e3c75f JB |
30 | configurable circular buffer mapped in user space that can be used to either |
31 | send or receive packets. This way reading packets just needs to wait for them, | |
32 | most of the time there is no need to issue a single system call. Concerning | |
33 | transmission, multiple packets can be sent through one system call to get the | |
d1ee40f9 DB |
34 | highest bandwidth. By using a shared buffer between the kernel and the user |
35 | also has the benefit of minimizing packet copies. | |
69e3c75f JB |
36 | |
37 | It's fine to use PACKET_MMAP to improve the performance of the capture and | |
38 | transmission process, but it isn't everything. At least, if you are capturing | |
39 | at high speeds (this is relative to the cpu speed), you should check if the | |
40 | device driver of your network interface card supports some sort of interrupt | |
41 | load mitigation or (even better) if it supports NAPI, also make sure it is | |
42 | enabled. For transmission, check the MTU (Maximum Transmission Unit) used and | |
d1ee40f9 DB |
43 | supported by devices of your network. CPU IRQ pinning of your network interface |
44 | card can also be an advantage. | |
1da177e4 LT |
45 | |
46 | -------------------------------------------------------------------------------- | |
889b8f96 | 47 | + How to use mmap() to improve capture process |
1da177e4 LT |
48 | -------------------------------------------------------------------------------- |
49 | ||
c30fe7f7 | 50 | From the user standpoint, you should use the higher level libpcap library, which |
1da177e4 LT |
51 | is a de facto standard, portable across nearly all operating systems |
52 | including Win32. | |
53 | ||
54 | Said that, at time of this writing, official libpcap 0.8.1 is out and doesn't include | |
55 | support for PACKET_MMAP, and also probably the libpcap included in your distribution. | |
56 | ||
57 | I'm aware of two implementations of PACKET_MMAP in libpcap: | |
58 | ||
0ea6e611 | 59 | http://wiki.ipxwarzone.com/ (by Simon Patarin, based on libpcap 0.6.2) |
1da177e4 LT |
60 | http://public.lanl.gov/cpw/ (by Phil Wood, based on lastest libpcap) |
61 | ||
62 | The rest of this document is intended for people who want to understand | |
63 | the low level details or want to improve libpcap by including PACKET_MMAP | |
64 | support. | |
65 | ||
66 | -------------------------------------------------------------------------------- | |
889b8f96 | 67 | + How to use mmap() directly to improve capture process |
1da177e4 LT |
68 | -------------------------------------------------------------------------------- |
69 | ||
70 | From the system calls stand point, the use of PACKET_MMAP involves | |
71 | the following process: | |
72 | ||
73 | ||
74 | [setup] socket() -------> creation of the capture socket | |
75 | setsockopt() ---> allocation of the circular buffer (ring) | |
69e3c75f | 76 | option: PACKET_RX_RING |
6c28f2c0 | 77 | mmap() ---------> mapping of the allocated buffer to the |
1da177e4 LT |
78 | user process |
79 | ||
80 | [capture] poll() ---------> to wait for incoming packets | |
81 | ||
82 | [shutdown] close() --------> destruction of the capture socket and | |
83 | deallocation of all associated | |
84 | resources. | |
85 | ||
86 | ||
87 | socket creation and destruction is straight forward, and is done | |
88 | the same way with or without PACKET_MMAP: | |
89 | ||
d1ee40f9 | 90 | int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL)); |
1da177e4 LT |
91 | |
92 | where mode is SOCK_RAW for the raw interface were link level | |
93 | information can be captured or SOCK_DGRAM for the cooked | |
94 | interface where link level information capture is not | |
95 | supported and a link level pseudo-header is provided | |
96 | by the kernel. | |
97 | ||
98 | The destruction of the socket and all associated resources | |
99 | is done by a simple call to close(fd). | |
100 | ||
a33f3224 | 101 | Next I will describe PACKET_MMAP settings and its constraints, |
6c28f2c0 | 102 | also the mapping of the circular buffer in the user process and |
1da177e4 LT |
103 | the use of this buffer. |
104 | ||
69e3c75f | 105 | -------------------------------------------------------------------------------- |
889b8f96 | 106 | + How to use mmap() directly to improve transmission process |
69e3c75f JB |
107 | -------------------------------------------------------------------------------- |
108 | Transmission process is similar to capture as shown below. | |
109 | ||
110 | [setup] socket() -------> creation of the transmission socket | |
111 | setsockopt() ---> allocation of the circular buffer (ring) | |
112 | option: PACKET_TX_RING | |
113 | bind() ---------> bind transmission socket with a network interface | |
114 | mmap() ---------> mapping of the allocated buffer to the | |
115 | user process | |
116 | ||
117 | [transmission] poll() ---------> wait for free packets (optional) | |
118 | send() ---------> send all packets that are set as ready in | |
119 | the ring | |
120 | The flag MSG_DONTWAIT can be used to return | |
121 | before end of transfer. | |
122 | ||
123 | [shutdown] close() --------> destruction of the transmission socket and | |
124 | deallocation of all associated resources. | |
125 | ||
126 | Binding the socket to your network interface is mandatory (with zero copy) to | |
127 | know the header size of frames used in the circular buffer. | |
128 | ||
129 | As capture, each frame contains two parts: | |
130 | ||
131 | -------------------- | |
132 | | struct tpacket_hdr | Header. It contains the status of | |
133 | | | of this frame | |
134 | |--------------------| | |
135 | | data buffer | | |
136 | . . Data that will be sent over the network interface. | |
137 | . . | |
138 | -------------------- | |
139 | ||
140 | bind() associates the socket to your network interface thanks to | |
141 | sll_ifindex parameter of struct sockaddr_ll. | |
142 | ||
143 | Initialization example: | |
144 | ||
145 | struct sockaddr_ll my_addr; | |
146 | struct ifreq s_ifr; | |
147 | ... | |
148 | ||
149 | strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name)); | |
150 | ||
151 | /* get interface index of eth0 */ | |
152 | ioctl(this->socket, SIOCGIFINDEX, &s_ifr); | |
153 | ||
154 | /* fill sockaddr_ll struct to prepare binding */ | |
155 | my_addr.sll_family = AF_PACKET; | |
30e7dfe7 | 156 | my_addr.sll_protocol = htons(ETH_P_ALL); |
69e3c75f JB |
157 | my_addr.sll_ifindex = s_ifr.ifr_ifindex; |
158 | ||
159 | /* bind socket to eth0 */ | |
160 | bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll)); | |
161 | ||
162 | A complete tutorial is available at: http://wiki.gnu-log.net/ | |
163 | ||
5920cd3a PC |
164 | By default, the user should put data at : |
165 | frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll) | |
166 | ||
167 | So, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW), | |
168 | the beginning of the user data will be at : | |
169 | frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) | |
170 | ||
171 | If you wish to put user data at a custom offset from the beginning of | |
172 | the frame (for payload alignment with SOCK_RAW mode for instance) you | |
173 | can set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order | |
174 | to make this work it must be enabled previously with setsockopt() | |
175 | and the PACKET_TX_HAS_OFF option. | |
176 | ||
1da177e4 LT |
177 | -------------------------------------------------------------------------------- |
178 | + PACKET_MMAP settings | |
179 | -------------------------------------------------------------------------------- | |
180 | ||
1da177e4 LT |
181 | To setup PACKET_MMAP from user level code is done with a call like |
182 | ||
69e3c75f | 183 | - Capture process |
1da177e4 | 184 | setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req)) |
69e3c75f JB |
185 | - Transmission process |
186 | setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req)) | |
1da177e4 LT |
187 | |
188 | The most significant argument in the previous call is the req parameter, | |
189 | this parameter must to have the following structure: | |
190 | ||
191 | struct tpacket_req | |
192 | { | |
193 | unsigned int tp_block_size; /* Minimal size of contiguous block */ | |
194 | unsigned int tp_block_nr; /* Number of blocks */ | |
195 | unsigned int tp_frame_size; /* Size of frame */ | |
196 | unsigned int tp_frame_nr; /* Total number of frames */ | |
197 | }; | |
198 | ||
199 | This structure is defined in /usr/include/linux/if_packet.h and establishes a | |
69e3c75f | 200 | circular buffer (ring) of unswappable memory. |
1da177e4 LT |
201 | Being mapped in the capture process allows reading the captured frames and |
202 | related meta-information like timestamps without requiring a system call. | |
203 | ||
69e3c75f | 204 | Frames are grouped in blocks. Each block is a physically contiguous |
1da177e4 LT |
205 | region of memory and holds tp_block_size/tp_frame_size frames. The total number |
206 | of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because | |
207 | ||
208 | frames_per_block = tp_block_size/tp_frame_size | |
209 | ||
210 | indeed, packet_set_ring checks that the following condition is true | |
211 | ||
212 | frames_per_block * tp_block_nr == tp_frame_nr | |
213 | ||
1da177e4 LT |
214 | Lets see an example, with the following values: |
215 | ||
216 | tp_block_size= 4096 | |
217 | tp_frame_size= 2048 | |
218 | tp_block_nr = 4 | |
219 | tp_frame_nr = 8 | |
220 | ||
221 | we will get the following buffer structure: | |
222 | ||
223 | block #1 block #2 | |
224 | +---------+---------+ +---------+---------+ | |
225 | | frame 1 | frame 2 | | frame 3 | frame 4 | | |
226 | +---------+---------+ +---------+---------+ | |
227 | ||
228 | block #3 block #4 | |
229 | +---------+---------+ +---------+---------+ | |
230 | | frame 5 | frame 6 | | frame 7 | frame 8 | | |
231 | +---------+---------+ +---------+---------+ | |
232 | ||
233 | A frame can be of any size with the only condition it can fit in a block. A block | |
234 | can only hold an integer number of frames, or in other words, a frame cannot | |
25985edc | 235 | be spawned across two blocks, so there are some details you have to take into |
6c28f2c0 | 236 | account when choosing the frame_size. See "Mapping and use of the circular |
1da177e4 LT |
237 | buffer (ring)". |
238 | ||
1da177e4 LT |
239 | -------------------------------------------------------------------------------- |
240 | + PACKET_MMAP setting constraints | |
241 | -------------------------------------------------------------------------------- | |
242 | ||
243 | In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch), | |
244 | the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or | |
245 | 16384 in a 64 bit architecture. For information on these kernel versions | |
246 | see http://pusa.uv.es/~ulisses/packet_mmap/packet_mmap.pre-2.4.26_2.6.5.txt | |
247 | ||
248 | Block size limit | |
249 | ------------------ | |
250 | ||
251 | As stated earlier, each block is a contiguous physical region of memory. These | |
252 | memory regions are allocated with calls to the __get_free_pages() function. As | |
253 | the name indicates, this function allocates pages of memory, and the second | |
254 | argument is "order" or a power of two number of pages, that is | |
255 | (for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes, | |
256 | order=2 ==> 16384 bytes, etc. The maximum size of a | |
257 | region allocated by __get_free_pages is determined by the MAX_ORDER macro. More | |
258 | precisely the limit can be calculated as: | |
259 | ||
260 | PAGE_SIZE << MAX_ORDER | |
261 | ||
262 | In a i386 architecture PAGE_SIZE is 4096 bytes | |
263 | In a 2.4/i386 kernel MAX_ORDER is 10 | |
264 | In a 2.6/i386 kernel MAX_ORDER is 11 | |
265 | ||
266 | So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel | |
267 | respectively, with an i386 architecture. | |
268 | ||
269 | User space programs can include /usr/include/sys/user.h and | |
270 | /usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations. | |
271 | ||
272 | The pagesize can also be determined dynamically with the getpagesize (2) | |
273 | system call. | |
274 | ||
1da177e4 LT |
275 | Block number limit |
276 | -------------------- | |
277 | ||
278 | To understand the constraints of PACKET_MMAP, we have to see the structure | |
279 | used to hold the pointers to each block. | |
280 | ||
281 | Currently, this structure is a dynamically allocated vector with kmalloc | |
282 | called pg_vec, its size limits the number of blocks that can be allocated. | |
283 | ||
284 | +---+---+---+---+ | |
285 | | x | x | x | x | | |
286 | +---+---+---+---+ | |
287 | | | | | | |
288 | | | | v | |
289 | | | v block #4 | |
290 | | v block #3 | |
291 | v block #2 | |
292 | block #1 | |
293 | ||
2fe0ae78 ML |
294 | kmalloc allocates any number of bytes of physically contiguous memory from |
295 | a pool of pre-determined sizes. This pool of memory is maintained by the slab | |
c30fe7f7 UZ |
296 | allocator which is at the end the responsible for doing the allocation and |
297 | hence which imposes the maximum memory that kmalloc can allocate. | |
1da177e4 LT |
298 | |
299 | In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The | |
300 | predetermined sizes that kmalloc uses can be checked in the "size-<bytes>" | |
301 | entries of /proc/slabinfo | |
302 | ||
303 | In a 32 bit architecture, pointers are 4 bytes long, so the total number of | |
304 | pointers to blocks is | |
305 | ||
306 | 131072/4 = 32768 blocks | |
307 | ||
1da177e4 LT |
308 | PACKET_MMAP buffer size calculator |
309 | ------------------------------------ | |
310 | ||
311 | Definitions: | |
312 | ||
313 | <size-max> : is the maximum size of allocable with kmalloc (see /proc/slabinfo) | |
314 | <pointer size>: depends on the architecture -- sizeof(void *) | |
315 | <page size> : depends on the architecture -- PAGE_SIZE or getpagesize (2) | |
316 | <max-order> : is the value defined with MAX_ORDER | |
317 | <frame size> : it's an upper bound of frame's capture size (more on this later) | |
318 | ||
319 | from these definitions we will derive | |
320 | ||
321 | <block number> = <size-max>/<pointer size> | |
322 | <block size> = <pagesize> << <max-order> | |
323 | ||
324 | so, the max buffer size is | |
325 | ||
326 | <block number> * <block size> | |
327 | ||
328 | and, the number of frames be | |
329 | ||
330 | <block number> * <block size> / <frame size> | |
331 | ||
2e150f6e | 332 | Suppose the following parameters, which apply for 2.6 kernel and an |
1da177e4 LT |
333 | i386 architecture: |
334 | ||
335 | <size-max> = 131072 bytes | |
336 | <pointer size> = 4 bytes | |
337 | <pagesize> = 4096 bytes | |
338 | <max-order> = 11 | |
339 | ||
6c28f2c0 | 340 | and a value for <frame size> of 2048 bytes. These parameters will yield |
1da177e4 LT |
341 | |
342 | <block number> = 131072/4 = 32768 blocks | |
343 | <block size> = 4096 << 11 = 8 MiB. | |
344 | ||
345 | and hence the buffer will have a 262144 MiB size. So it can hold | |
346 | 262144 MiB / 2048 bytes = 134217728 frames | |
347 | ||
1da177e4 LT |
348 | Actually, this buffer size is not possible with an i386 architecture. |
349 | Remember that the memory is allocated in kernel space, in the case of | |
350 | an i386 kernel's memory size is limited to 1GiB. | |
351 | ||
352 | All memory allocations are not freed until the socket is closed. The memory | |
353 | allocations are done with GFP_KERNEL priority, this basically means that | |
354 | the allocation can wait and swap other process' memory in order to allocate | |
992caacf | 355 | the necessary memory, so normally limits can be reached. |
1da177e4 LT |
356 | |
357 | Other constraints | |
358 | ------------------- | |
359 | ||
360 | If you check the source code you will see that what I draw here as a frame | |
5d3f083d | 361 | is not only the link level frame. At the beginning of each frame there is a |
1da177e4 LT |
362 | header called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame |
363 | meta information like timestamp. So what we draw here a frame it's really | |
364 | the following (from include/linux/if_packet.h): | |
365 | ||
366 | /* | |
367 | Frame structure: | |
368 | ||
369 | - Start. Frame must be aligned to TPACKET_ALIGNMENT=16 | |
370 | - struct tpacket_hdr | |
371 | - pad to TPACKET_ALIGNMENT=16 | |
372 | - struct sockaddr_ll | |
3f6dee9b | 373 | - Gap, chosen so that packet data (Start+tp_net) aligns to |
1da177e4 LT |
374 | TPACKET_ALIGNMENT=16 |
375 | - Start+tp_mac: [ Optional MAC header ] | |
376 | - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16. | |
377 | - Pad to align to TPACKET_ALIGNMENT=16 | |
378 | */ | |
1da177e4 LT |
379 | |
380 | The following are conditions that are checked in packet_set_ring | |
381 | ||
382 | tp_block_size must be a multiple of PAGE_SIZE (1) | |
383 | tp_frame_size must be greater than TPACKET_HDRLEN (obvious) | |
384 | tp_frame_size must be a multiple of TPACKET_ALIGNMENT | |
385 | tp_frame_nr must be exactly frames_per_block*tp_block_nr | |
386 | ||
6c28f2c0 | 387 | Note that tp_block_size should be chosen to be a power of two or there will |
1da177e4 LT |
388 | be a waste of memory. |
389 | ||
390 | -------------------------------------------------------------------------------- | |
6c28f2c0 | 391 | + Mapping and use of the circular buffer (ring) |
1da177e4 LT |
392 | -------------------------------------------------------------------------------- |
393 | ||
6c28f2c0 | 394 | The mapping of the buffer in the user process is done with the conventional |
1da177e4 LT |
395 | mmap function. Even the circular buffer is compound of several physically |
396 | discontiguous blocks of memory, they are contiguous to the user space, hence | |
397 | just one call to mmap is needed: | |
398 | ||
399 | mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); | |
400 | ||
401 | If tp_frame_size is a divisor of tp_block_size frames will be | |
d9195881 | 402 | contiguously spaced by tp_frame_size bytes. If not, each |
1da177e4 LT |
403 | tp_block_size/tp_frame_size frames there will be a gap between |
404 | the frames. This is because a frame cannot be spawn across two | |
405 | blocks. | |
406 | ||
407 | At the beginning of each frame there is an status field (see | |
408 | struct tpacket_hdr). If this field is 0 means that the frame is ready | |
409 | to be used for the kernel, If not, there is a frame the user can read | |
410 | and the following flags apply: | |
411 | ||
69e3c75f | 412 | +++ Capture process: |
1da177e4 LT |
413 | from include/linux/if_packet.h |
414 | ||
415 | #define TP_STATUS_COPY 2 | |
416 | #define TP_STATUS_LOSING 4 | |
417 | #define TP_STATUS_CSUMNOTREADY 8 | |
418 | ||
1da177e4 LT |
419 | TP_STATUS_COPY : This flag indicates that the frame (and associated |
420 | meta information) has been truncated because it's | |
421 | larger than tp_frame_size. This packet can be | |
422 | read entirely with recvfrom(). | |
423 | ||
424 | In order to make this work it must to be | |
425 | enabled previously with setsockopt() and | |
426 | the PACKET_COPY_THRESH option. | |
427 | ||
428 | The number of frames than can be buffered to | |
429 | be read with recvfrom is limited like a normal socket. | |
430 | See the SO_RCVBUF option in the socket (7) man page. | |
431 | ||
432 | TP_STATUS_LOSING : indicates there were packet drops from last time | |
433 | statistics where checked with getsockopt() and | |
434 | the PACKET_STATISTICS option. | |
435 | ||
c30fe7f7 | 436 | TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets which |
a33f3224 | 437 | its checksum will be done in hardware. So while |
1da177e4 LT |
438 | reading the packet we should not try to check the |
439 | checksum. | |
440 | ||
441 | for convenience there are also the following defines: | |
442 | ||
443 | #define TP_STATUS_KERNEL 0 | |
444 | #define TP_STATUS_USER 1 | |
445 | ||
446 | The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel | |
447 | receives a packet it puts in the buffer and updates the status with | |
448 | at least the TP_STATUS_USER flag. Then the user can read the packet, | |
449 | once the packet is read the user must zero the status field, so the kernel | |
450 | can use again that frame buffer. | |
451 | ||
452 | The user can use poll (any other variant should apply too) to check if new | |
453 | packets are in the ring: | |
454 | ||
455 | struct pollfd pfd; | |
456 | ||
457 | pfd.fd = fd; | |
458 | pfd.revents = 0; | |
459 | pfd.events = POLLIN|POLLRDNORM|POLLERR; | |
460 | ||
461 | if (status == TP_STATUS_KERNEL) | |
462 | retval = poll(&pfd, 1, timeout); | |
463 | ||
464 | It doesn't incur in a race condition to first check the status value and | |
465 | then poll for frames. | |
466 | ||
69e3c75f JB |
467 | ++ Transmission process |
468 | Those defines are also used for transmission: | |
469 | ||
470 | #define TP_STATUS_AVAILABLE 0 // Frame is available | |
471 | #define TP_STATUS_SEND_REQUEST 1 // Frame will be sent on next send() | |
472 | #define TP_STATUS_SENDING 2 // Frame is currently in transmission | |
473 | #define TP_STATUS_WRONG_FORMAT 4 // Frame format is not correct | |
474 | ||
475 | First, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a | |
476 | packet, the user fills a data buffer of an available frame, sets tp_len to | |
477 | current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST. | |
478 | This can be done on multiple frames. Once the user is ready to transmit, it | |
479 | calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are | |
480 | forwarded to the network device. The kernel updates each status of sent | |
481 | frames with TP_STATUS_SENDING until the end of transfer. | |
482 | At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE. | |
483 | ||
484 | header->tp_len = in_i_size; | |
485 | header->tp_status = TP_STATUS_SEND_REQUEST; | |
486 | retval = send(this->socket, NULL, 0, 0); | |
487 | ||
488 | The user can also use poll() to check if a buffer is available: | |
489 | (status == TP_STATUS_SENDING) | |
490 | ||
491 | struct pollfd pfd; | |
492 | pfd.fd = fd; | |
493 | pfd.revents = 0; | |
494 | pfd.events = POLLOUT; | |
495 | retval = poll(&pfd, 1, timeout); | |
496 | ||
d1ee40f9 DB |
497 | ------------------------------------------------------------------------------- |
498 | + What TPACKET versions are available and when to use them? | |
499 | ------------------------------------------------------------------------------- | |
500 | ||
501 | int val = tpacket_version; | |
502 | setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); | |
503 | getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); | |
504 | ||
505 | where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3. | |
506 | ||
507 | TPACKET_V1: | |
508 | - Default if not otherwise specified by setsockopt(2) | |
509 | - RX_RING, TX_RING available | |
510 | - VLAN metadata information available for packets | |
511 | (TP_STATUS_VLAN_VALID) | |
512 | ||
513 | TPACKET_V1 --> TPACKET_V2: | |
514 | - Made 64 bit clean due to unsigned long usage in TPACKET_V1 | |
515 | structures, thus this also works on 64 bit kernel with 32 bit | |
516 | userspace and the like | |
517 | - Timestamp resolution in nanoseconds instead of microseconds | |
518 | - RX_RING, TX_RING available | |
519 | - How to switch to TPACKET_V2: | |
520 | 1. Replace struct tpacket_hdr by struct tpacket2_hdr | |
521 | 2. Query header len and save | |
522 | 3. Set protocol version to 2, set up ring as usual | |
523 | 4. For getting the sockaddr_ll, | |
524 | use (void *)hdr + TPACKET_ALIGN(hdrlen) instead of | |
525 | (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) | |
526 | ||
527 | TPACKET_V2 --> TPACKET_V3: | |
528 | - Flexible buffer implementation: | |
529 | 1. Blocks can be configured with non-static frame-size | |
530 | 2. Read/poll is at a block-level (as opposed to packet-level) | |
531 | 3. Added poll timeout to avoid indefinite user-space wait | |
532 | on idle links | |
533 | 4. Added user-configurable knobs: | |
534 | 4.1 block::timeout | |
535 | 4.2 tpkt_hdr::sk_rxhash | |
536 | - RX Hash data available in user space | |
537 | - Currently only RX_RING available | |
538 | ||
539 | ------------------------------------------------------------------------------- | |
540 | + AF_PACKET fanout mode | |
541 | ------------------------------------------------------------------------------- | |
542 | ||
543 | In the AF_PACKET fanout mode, packet reception can be load balanced among | |
544 | processes. This also works in combination with mmap(2) on packet sockets. | |
545 | ||
546 | Minimal example code by David S. Miller (try things like "./test eth0 hash", | |
547 | "./test eth0 lb", etc.): | |
548 | ||
549 | #include <stddef.h> | |
550 | #include <stdlib.h> | |
551 | #include <stdio.h> | |
552 | #include <string.h> | |
553 | ||
554 | #include <sys/types.h> | |
555 | #include <sys/wait.h> | |
556 | #include <sys/socket.h> | |
557 | #include <sys/ioctl.h> | |
558 | ||
559 | #include <unistd.h> | |
560 | ||
561 | #include <linux/if_ether.h> | |
562 | #include <linux/if_packet.h> | |
563 | ||
564 | #include <net/if.h> | |
565 | ||
566 | static const char *device_name; | |
567 | static int fanout_type; | |
568 | static int fanout_id; | |
569 | ||
570 | #ifndef PACKET_FANOUT | |
571 | # define PACKET_FANOUT 18 | |
572 | # define PACKET_FANOUT_HASH 0 | |
573 | # define PACKET_FANOUT_LB 1 | |
574 | #endif | |
575 | ||
576 | static int setup_socket(void) | |
577 | { | |
578 | int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP)); | |
579 | struct sockaddr_ll ll; | |
580 | struct ifreq ifr; | |
581 | int fanout_arg; | |
582 | ||
583 | if (fd < 0) { | |
584 | perror("socket"); | |
585 | return EXIT_FAILURE; | |
586 | } | |
587 | ||
588 | memset(&ifr, 0, sizeof(ifr)); | |
589 | strcpy(ifr.ifr_name, device_name); | |
590 | err = ioctl(fd, SIOCGIFINDEX, &ifr); | |
591 | if (err < 0) { | |
592 | perror("SIOCGIFINDEX"); | |
593 | return EXIT_FAILURE; | |
594 | } | |
595 | ||
596 | memset(&ll, 0, sizeof(ll)); | |
597 | ll.sll_family = AF_PACKET; | |
598 | ll.sll_ifindex = ifr.ifr_ifindex; | |
599 | err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); | |
600 | if (err < 0) { | |
601 | perror("bind"); | |
602 | return EXIT_FAILURE; | |
603 | } | |
604 | ||
605 | fanout_arg = (fanout_id | (fanout_type << 16)); | |
606 | err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT, | |
607 | &fanout_arg, sizeof(fanout_arg)); | |
608 | if (err) { | |
609 | perror("setsockopt"); | |
610 | return EXIT_FAILURE; | |
611 | } | |
612 | ||
613 | return fd; | |
614 | } | |
615 | ||
616 | static void fanout_thread(void) | |
617 | { | |
618 | int fd = setup_socket(); | |
619 | int limit = 10000; | |
620 | ||
621 | if (fd < 0) | |
622 | exit(fd); | |
623 | ||
624 | while (limit-- > 0) { | |
625 | char buf[1600]; | |
626 | int err; | |
627 | ||
628 | err = read(fd, buf, sizeof(buf)); | |
629 | if (err < 0) { | |
630 | perror("read"); | |
631 | exit(EXIT_FAILURE); | |
632 | } | |
633 | if ((limit % 10) == 0) | |
634 | fprintf(stdout, "(%d) \n", getpid()); | |
635 | } | |
636 | ||
637 | fprintf(stdout, "%d: Received 10000 packets\n", getpid()); | |
638 | ||
639 | close(fd); | |
640 | exit(0); | |
641 | } | |
642 | ||
643 | int main(int argc, char **argp) | |
644 | { | |
645 | int fd, err; | |
646 | int i; | |
647 | ||
648 | if (argc != 3) { | |
649 | fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]); | |
650 | return EXIT_FAILURE; | |
651 | } | |
652 | ||
653 | if (!strcmp(argp[2], "hash")) | |
654 | fanout_type = PACKET_FANOUT_HASH; | |
655 | else if (!strcmp(argp[2], "lb")) | |
656 | fanout_type = PACKET_FANOUT_LB; | |
657 | else { | |
658 | fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]); | |
659 | exit(EXIT_FAILURE); | |
660 | } | |
661 | ||
662 | device_name = argp[1]; | |
663 | fanout_id = getpid() & 0xffff; | |
664 | ||
665 | for (i = 0; i < 4; i++) { | |
666 | pid_t pid = fork(); | |
667 | ||
668 | switch (pid) { | |
669 | case 0: | |
670 | fanout_thread(); | |
671 | ||
672 | case -1: | |
673 | perror("fork"); | |
674 | exit(EXIT_FAILURE); | |
675 | } | |
676 | } | |
677 | ||
678 | for (i = 0; i < 4; i++) { | |
679 | int status; | |
680 | ||
681 | wait(&status); | |
682 | } | |
683 | ||
684 | return 0; | |
685 | } | |
686 | ||
614f60fa SM |
687 | ------------------------------------------------------------------------------- |
688 | + PACKET_TIMESTAMP | |
689 | ------------------------------------------------------------------------------- | |
690 | ||
691 | The PACKET_TIMESTAMP setting determines the source of the timestamp in | |
692 | the packet meta information. If your NIC is capable of timestamping | |
693 | packets in hardware, you can request those hardware timestamps to used. | |
694 | Note: you may need to enable the generation of hardware timestamps with | |
695 | SIOCSHWTSTAMP. | |
696 | ||
697 | PACKET_TIMESTAMP accepts the same integer bit field as | |
698 | SO_TIMESTAMPING. However, only the SOF_TIMESTAMPING_SYS_HARDWARE | |
699 | and SOF_TIMESTAMPING_RAW_HARDWARE values are recognized by | |
700 | PACKET_TIMESTAMP. SOF_TIMESTAMPING_SYS_HARDWARE takes precedence over | |
701 | SOF_TIMESTAMPING_RAW_HARDWARE if both bits are set. | |
702 | ||
703 | int req = 0; | |
704 | req |= SOF_TIMESTAMPING_SYS_HARDWARE; | |
705 | setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req)) | |
706 | ||
707 | If PACKET_TIMESTAMP is not set, a software timestamp generated inside | |
708 | the networking stack is used (the behavior before this setting was added). | |
709 | ||
710 | See include/linux/net_tstamp.h and Documentation/networking/timestamping | |
711 | for more information on hardware timestamps. | |
712 | ||
d1ee40f9 DB |
713 | ------------------------------------------------------------------------------- |
714 | + Miscellaneous bits | |
715 | ------------------------------------------------------------------------------- | |
716 | ||
717 | - Packet sockets work well together with Linux socket filters, thus you also | |
718 | might want to have a look at Documentation/networking/filter.txt | |
719 | ||
1da177e4 LT |
720 | -------------------------------------------------------------------------------- |
721 | + THANKS | |
722 | -------------------------------------------------------------------------------- | |
723 | ||
724 | Jesse Brandeburg, for fixing my grammathical/spelling errors | |
725 |