Commit | Line | Data |
---|---|---|
cb3f0d56 MCC |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
ffba964e TY |
3 | .. _networking-filter: |
4 | ||
cb3f0d56 | 5 | ======================================================= |
7924cd5e DB |
6 | Linux Socket Filtering aka Berkeley Packet Filter (BPF) |
7 | ======================================================= | |
1da177e4 | 8 | |
88691e9e CH |
9 | Notice |
10 | ------ | |
11 | ||
12 | This file used to document the eBPF format and mechanisms even when not | |
13 | related to socket filtering. The ../bpf/index.rst has more details | |
14 | on eBPF. | |
15 | ||
1da177e4 | 16 | Introduction |
7924cd5e DB |
17 | ------------ |
18 | ||
19 | Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter. | |
20 | Though there are some distinct differences between the BSD and Linux | |
21 | Kernel filtering, but when we speak of BPF or LSF in Linux context, we | |
22 | mean the very same mechanism of filtering in the Linux kernel. | |
23 | ||
24 | BPF allows a user-space program to attach a filter onto any socket and | |
25 | allow or disallow certain types of data to come through the socket. LSF | |
26 | follows exactly the same filter code structure as BSD's BPF, so referring | |
27 | to the BSD bpf.4 manpage is very helpful in creating filters. | |
28 | ||
29 | On Linux, BPF is much simpler than on BSD. One does not have to worry | |
30 | about devices or anything like that. You simply create your filter code, | |
31 | send it to the kernel via the SO_ATTACH_FILTER option and if your filter | |
32 | code passes the kernel check on it, you then immediately begin filtering | |
33 | data on that socket. | |
34 | ||
35 | You can also detach filters from your socket via the SO_DETACH_FILTER | |
36 | option. This will probably not be used much since when you close a socket | |
37 | that has a filter on it the filter is automagically removed. The other | |
38 | less common case may be adding a different filter on the same socket where | |
39 | you had another filter that is still running: the kernel takes care of | |
40 | removing the old one and placing your new one in its place, assuming your | |
41 | filter has passed the checks, otherwise if it fails the old filter will | |
42 | remain on that socket. | |
43 | ||
44 | SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once | |
45 | set, a filter cannot be removed or changed. This allows one process to | |
46 | setup a socket, attach a filter, lock it then drop privileges and be | |
47 | assured that the filter will be kept until the socket is closed. | |
48 | ||
49 | The biggest user of this construct might be libpcap. Issuing a high-level | |
50 | filter command like `tcpdump -i em1 port 22` passes through the libpcap | |
51 | internal compiler that generates a structure that can eventually be loaded | |
52 | via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd` | |
53 | displays what is being placed into this structure. | |
54 | ||
55 | Although we were only speaking about sockets here, BPF in Linux is used | |
56 | in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel | |
cb3f0d56 | 57 | qdisc layer, SECCOMP-BPF (SECure COMPuting [1]_), and lots of other places |
7924cd5e DB |
58 | such as team driver, PTP code, etc where BPF is being used. |
59 | ||
cb3f0d56 | 60 | .. [1] Documentation/userspace-api/seccomp_filter.rst |
7924cd5e DB |
61 | |
62 | Original BPF paper: | |
63 | ||
64 | Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new | |
65 | architecture for user-level packet capture. In Proceedings of the | |
66 | USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993 | |
67 | Conference Proceedings (USENIX'93). USENIX Association, Berkeley, | |
68 | CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf] | |
69 | ||
70 | Structure | |
71 | --------- | |
72 | ||
73 | User space applications include <linux/filter.h> which contains the | |
cb3f0d56 | 74 | following relevant structures:: |
7924cd5e | 75 | |
cb3f0d56 MCC |
76 | struct sock_filter { /* Filter block */ |
77 | __u16 code; /* Actual filter code */ | |
78 | __u8 jt; /* Jump true */ | |
79 | __u8 jf; /* Jump false */ | |
80 | __u32 k; /* Generic multiuse field */ | |
81 | }; | |
7924cd5e DB |
82 | |
83 | Such a structure is assembled as an array of 4-tuples, that contains | |
84 | a code, jt, jf and k value. jt and jf are jump offsets and k a generic | |
cb3f0d56 | 85 | value to be used for a provided code:: |
7924cd5e | 86 | |
cb3f0d56 MCC |
87 | struct sock_fprog { /* Required for SO_ATTACH_FILTER. */ |
88 | unsigned short len; /* Number of filter blocks */ | |
89 | struct sock_filter __user *filter; | |
90 | }; | |
7924cd5e DB |
91 | |
92 | For socket filtering, a pointer to this structure (as shown in | |
93 | follow-up example) is being passed to the kernel through setsockopt(2). | |
94 | ||
95 | Example | |
96 | ------- | |
97 | ||
cb3f0d56 MCC |
98 | :: |
99 | ||
100 | #include <sys/socket.h> | |
101 | #include <sys/types.h> | |
102 | #include <arpa/inet.h> | |
103 | #include <linux/if_ether.h> | |
104 | /* ... */ | |
105 | ||
106 | /* From the example above: tcpdump -i em1 port 22 -dd */ | |
107 | struct sock_filter code[] = { | |
108 | { 0x28, 0, 0, 0x0000000c }, | |
109 | { 0x15, 0, 8, 0x000086dd }, | |
110 | { 0x30, 0, 0, 0x00000014 }, | |
111 | { 0x15, 2, 0, 0x00000084 }, | |
112 | { 0x15, 1, 0, 0x00000006 }, | |
113 | { 0x15, 0, 17, 0x00000011 }, | |
114 | { 0x28, 0, 0, 0x00000036 }, | |
115 | { 0x15, 14, 0, 0x00000016 }, | |
116 | { 0x28, 0, 0, 0x00000038 }, | |
117 | { 0x15, 12, 13, 0x00000016 }, | |
118 | { 0x15, 0, 12, 0x00000800 }, | |
119 | { 0x30, 0, 0, 0x00000017 }, | |
120 | { 0x15, 2, 0, 0x00000084 }, | |
121 | { 0x15, 1, 0, 0x00000006 }, | |
122 | { 0x15, 0, 8, 0x00000011 }, | |
123 | { 0x28, 0, 0, 0x00000014 }, | |
124 | { 0x45, 6, 0, 0x00001fff }, | |
125 | { 0xb1, 0, 0, 0x0000000e }, | |
126 | { 0x48, 0, 0, 0x0000000e }, | |
127 | { 0x15, 2, 0, 0x00000016 }, | |
128 | { 0x48, 0, 0, 0x00000010 }, | |
129 | { 0x15, 0, 1, 0x00000016 }, | |
130 | { 0x06, 0, 0, 0x0000ffff }, | |
131 | { 0x06, 0, 0, 0x00000000 }, | |
132 | }; | |
133 | ||
134 | struct sock_fprog bpf = { | |
135 | .len = ARRAY_SIZE(code), | |
136 | .filter = code, | |
137 | }; | |
138 | ||
139 | sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); | |
140 | if (sock < 0) | |
141 | /* ... bail out ... */ | |
142 | ||
143 | ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf)); | |
144 | if (ret < 0) | |
145 | /* ... bail out ... */ | |
146 | ||
147 | /* ... */ | |
148 | close(sock); | |
7924cd5e DB |
149 | |
150 | The above example code attaches a socket filter for a PF_PACKET socket | |
151 | in order to let all IPv4/IPv6 packets with port 22 pass. The rest will | |
152 | be dropped for this socket. | |
153 | ||
154 | The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments | |
155 | and SO_LOCK_FILTER for preventing the filter to be detached, takes an | |
156 | integer value with 0 or 1. | |
157 | ||
158 | Note that socket filters are not restricted to PF_PACKET sockets only, | |
159 | but can also be used on other socket families. | |
160 | ||
161 | Summary of system calls: | |
162 | ||
163 | * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val)); | |
164 | * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val)); | |
165 | * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER, &val, sizeof(val)); | |
166 | ||
167 | Normally, most use cases for socket filtering on packet sockets will be | |
168 | covered by libpcap in high-level syntax, so as an application developer | |
169 | you should stick to that. libpcap wraps its own layer around all that. | |
170 | ||
171 | Unless i) using/linking to libpcap is not an option, ii) the required BPF | |
172 | filters use Linux extensions that are not supported by libpcap's compiler, | |
173 | iii) a filter might be more complex and not cleanly implementable with | |
174 | libpcap's compiler, or iv) particular filter codes should be optimized | |
175 | differently than libpcap's internal compiler does; then in such cases | |
176 | writing such a filter "by hand" can be of an alternative. For example, | |
177 | xt_bpf and cls_bpf users might have requirements that could result in | |
178 | more complex filter code, or one that cannot be expressed with libpcap | |
179 | (e.g. different return codes for various code paths). Moreover, BPF JIT | |
180 | implementors may wish to manually write test cases and thus need low-level | |
181 | access to BPF code as well. | |
182 | ||
183 | BPF engine and instruction set | |
184 | ------------------------------ | |
185 | ||
c246fd33 | 186 | Under tools/bpf/ there's a small helper tool called bpf_asm which can |
7924cd5e DB |
187 | be used to write low-level filters for example scenarios mentioned in the |
188 | previous section. Asm-like syntax mentioned here has been implemented in | |
189 | bpf_asm and will be used for further explanations (instead of dealing with | |
190 | less readable opcodes directly, principles are the same). The syntax is | |
191 | closely modelled after Steven McCanne's and Van Jacobson's BPF paper. | |
192 | ||
193 | The BPF architecture consists of the following basic elements: | |
194 | ||
cb3f0d56 | 195 | ======= ==================================================== |
7924cd5e | 196 | Element Description |
cb3f0d56 | 197 | ======= ==================================================== |
7924cd5e DB |
198 | A 32 bit wide accumulator |
199 | X 32 bit wide X register | |
200 | M[] 16 x 32 bit wide misc registers aka "scratch memory | |
cb3f0d56 MCC |
201 | store", addressable from 0 to 15 |
202 | ======= ==================================================== | |
7924cd5e DB |
203 | |
204 | A program, that is translated by bpf_asm into "opcodes" is an array that | |
cb3f0d56 | 205 | consists of the following elements (as already mentioned):: |
7924cd5e DB |
206 | |
207 | op:16, jt:8, jf:8, k:32 | |
208 | ||
209 | The element op is a 16 bit wide opcode that has a particular instruction | |
210 | encoded. jt and jf are two 8 bit wide jump targets, one for condition | |
211 | "jump if true", the other one "jump if false". Eventually, element k | |
212 | contains a miscellaneous argument that can be interpreted in different | |
213 | ways depending on the given instruction in op. | |
214 | ||
215 | The instruction set consists of load, store, branch, alu, miscellaneous | |
216 | and return instructions that are also represented in bpf_asm syntax. This | |
217 | table lists all bpf_asm instructions available resp. what their underlying | |
218 | opcodes as defined in linux/filter.h stand for: | |
219 | ||
cb3f0d56 | 220 | =========== =================== ===================== |
7924cd5e | 221 | Instruction Addressing mode Description |
cb3f0d56 | 222 | =========== =================== ===================== |
31ce8c4a | 223 | ld 1, 2, 3, 4, 12 Load word into A |
7924cd5e DB |
224 | ldi 4 Load word into A |
225 | ldh 1, 2 Load half-word into A | |
226 | ldb 1, 2 Load byte into A | |
31ce8c4a | 227 | ldx 3, 4, 5, 12 Load word into X |
7924cd5e DB |
228 | ldxi 4 Load word into X |
229 | ldxb 5 Load byte into X | |
230 | ||
231 | st 3 Store A into M[] | |
232 | stx 3 Store X into M[] | |
233 | ||
234 | jmp 6 Jump to label | |
235 | ja 6 Jump to label | |
31ce8c4a AF |
236 | jeq 7, 8, 9, 10 Jump on A == <x> |
237 | jneq 9, 10 Jump on A != <x> | |
238 | jne 9, 10 Jump on A != <x> | |
239 | jlt 9, 10 Jump on A < <x> | |
240 | jle 9, 10 Jump on A <= <x> | |
241 | jgt 7, 8, 9, 10 Jump on A > <x> | |
242 | jge 7, 8, 9, 10 Jump on A >= <x> | |
243 | jset 7, 8, 9, 10 Jump on A & <x> | |
7924cd5e DB |
244 | |
245 | add 0, 4 A + <x> | |
246 | sub 0, 4 A - <x> | |
247 | mul 0, 4 A * <x> | |
248 | div 0, 4 A / <x> | |
249 | mod 0, 4 A % <x> | |
83d26b63 | 250 | neg !A |
7924cd5e DB |
251 | and 0, 4 A & <x> |
252 | or 0, 4 A | <x> | |
253 | xor 0, 4 A ^ <x> | |
254 | lsh 0, 4 A << <x> | |
255 | rsh 0, 4 A >> <x> | |
256 | ||
257 | tax Copy A into X | |
258 | txa Copy X into A | |
259 | ||
31ce8c4a | 260 | ret 4, 11 Return |
cb3f0d56 | 261 | =========== =================== ===================== |
7924cd5e DB |
262 | |
263 | The next table shows addressing formats from the 2nd column: | |
264 | ||
cb3f0d56 | 265 | =============== =================== =============================================== |
7924cd5e | 266 | Addressing mode Syntax Description |
cb3f0d56 | 267 | =============== =================== =============================================== |
7924cd5e DB |
268 | 0 x/%x Register X |
269 | 1 [k] BHW at byte offset k in the packet | |
270 | 2 [x + k] BHW at the offset X + k in the packet | |
271 | 3 M[k] Word at offset k in M[] | |
272 | 4 #k Literal value stored in k | |
273 | 5 4*([k]&0xf) Lower nibble * 4 at byte offset k in the packet | |
274 | 6 L Jump label L | |
275 | 7 #k,Lt,Lf Jump to Lt if true, otherwise jump to Lf | |
31ce8c4a AF |
276 | 8 x/%x,Lt,Lf Jump to Lt if true, otherwise jump to Lf |
277 | 9 #k,Lt Jump to Lt if predicate is true | |
278 | 10 x/%x,Lt Jump to Lt if predicate is true | |
279 | 11 a/%a Accumulator A | |
280 | 12 extension BPF extension | |
cb3f0d56 | 281 | =============== =================== =============================================== |
7924cd5e DB |
282 | |
283 | The Linux kernel also has a couple of BPF extensions that are used along | |
284 | with the class of load instructions by "overloading" the k argument with | |
285 | a negative offset + a particular extension offset. The result of such BPF | |
286 | extensions are loaded into A. | |
287 | ||
288 | Possible BPF extensions are shown in the following table: | |
289 | ||
cb3f0d56 | 290 | =================================== ================================================= |
7924cd5e | 291 | Extension Description |
cb3f0d56 | 292 | =================================== ================================================= |
7924cd5e DB |
293 | len skb->len |
294 | proto skb->protocol | |
295 | type skb->pkt_type | |
296 | poff Payload start offset | |
297 | ifidx skb->dev->ifindex | |
298 | nla Netlink attribute of type X with offset A | |
299 | nlan Nested Netlink attribute of type X with offset A | |
300 | mark skb->mark | |
301 | queue skb->queue_mapping | |
302 | hatype skb->dev->type | |
b0db5cdf | 303 | rxhash skb->hash |
7924cd5e | 304 | cpu raw_smp_processor_id() |
df8a39de | 305 | vlan_tci skb_vlan_tag_get(skb) |
27cd5452 MS |
306 | vlan_avail skb_vlan_tag_present(skb) |
307 | vlan_tpid skb->vlan_proto | |
a251c17a | 308 | rand get_random_u32() |
cb3f0d56 | 309 | =================================== ================================================= |
7924cd5e DB |
310 | |
311 | These extensions can also be prefixed with '#'. | |
312 | Examples for low-level BPF: | |
313 | ||
cb3f0d56 | 314 | **ARP packets**:: |
7924cd5e DB |
315 | |
316 | ldh [12] | |
317 | jne #0x806, drop | |
318 | ret #-1 | |
319 | drop: ret #0 | |
320 | ||
cb3f0d56 | 321 | **IPv4 TCP packets**:: |
7924cd5e DB |
322 | |
323 | ldh [12] | |
324 | jne #0x800, drop | |
325 | ldb [23] | |
326 | jneq #6, drop | |
327 | ret #-1 | |
328 | drop: ret #0 | |
329 | ||
2551c2d1 | 330 | **icmp random packet sampling, 1 in 4**:: |
cb3f0d56 | 331 | |
4cd3675e CG |
332 | ldh [12] |
333 | jne #0x800, drop | |
334 | ldb [23] | |
335 | jneq #1, drop | |
336 | # get a random uint32 number | |
337 | ld rand | |
338 | mod #4 | |
339 | jneq #1, drop | |
340 | ret #-1 | |
341 | drop: ret #0 | |
342 | ||
cb3f0d56 | 343 | **SECCOMP filter example**:: |
7924cd5e DB |
344 | |
345 | ld [4] /* offsetof(struct seccomp_data, arch) */ | |
346 | jne #0xc000003e, bad /* AUDIT_ARCH_X86_64 */ | |
347 | ld [0] /* offsetof(struct seccomp_data, nr) */ | |
348 | jeq #15, good /* __NR_rt_sigreturn */ | |
349 | jeq #231, good /* __NR_exit_group */ | |
350 | jeq #60, good /* __NR_exit */ | |
351 | jeq #0, good /* __NR_read */ | |
352 | jeq #1, good /* __NR_write */ | |
353 | jeq #5, good /* __NR_fstat */ | |
354 | jeq #9, good /* __NR_mmap */ | |
355 | jeq #14, good /* __NR_rt_sigprocmask */ | |
356 | jeq #13, good /* __NR_rt_sigaction */ | |
357 | jeq #35, good /* __NR_nanosleep */ | |
fd76875c | 358 | bad: ret #0 /* SECCOMP_RET_KILL_THREAD */ |
7924cd5e DB |
359 | good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */ |
360 | ||
88865347 RU |
361 | Examples for low-level BPF extension: |
362 | ||
363 | **Packet for interface index 13**:: | |
364 | ||
365 | ld ifidx | |
366 | jneq #13, drop | |
367 | ret #-1 | |
368 | drop: ret #0 | |
369 | ||
370 | **(Accelerated) VLAN w/ id 10**:: | |
371 | ||
372 | ld vlan_tci | |
373 | jneq #10, drop | |
374 | ret #-1 | |
375 | drop: ret #0 | |
376 | ||
7924cd5e DB |
377 | The above example code can be placed into a file (here called "foo"), and |
378 | then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf | |
379 | and cls_bpf understands and can directly be loaded with. Example with above | |
cb3f0d56 | 380 | ARP code:: |
7924cd5e | 381 | |
cb3f0d56 MCC |
382 | $ ./bpf_asm foo |
383 | 4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0, | |
7924cd5e | 384 | |
cb3f0d56 | 385 | In copy and paste C-like output:: |
7924cd5e | 386 | |
cb3f0d56 MCC |
387 | $ ./bpf_asm -c foo |
388 | { 0x28, 0, 0, 0x0000000c }, | |
389 | { 0x15, 0, 1, 0x00000806 }, | |
390 | { 0x06, 0, 0, 0xffffffff }, | |
391 | { 0x06, 0, 0, 0000000000 }, | |
7924cd5e DB |
392 | |
393 | In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF | |
394 | filters that might not be obvious at first, it's good to test filters before | |
395 | attaching to a live system. For that purpose, there's a small tool called | |
c246fd33 | 396 | bpf_dbg under tools/bpf/ in the kernel source directory. This debugger allows |
7924cd5e DB |
397 | for testing BPF filters against given pcap files, single stepping through the |
398 | BPF code on the pcap's packets and to do BPF machine register dumps. | |
399 | ||
cb3f0d56 | 400 | Starting bpf_dbg is trivial and just requires issuing:: |
7924cd5e | 401 | |
cb3f0d56 | 402 | # ./bpf_dbg |
7924cd5e DB |
403 | |
404 | In case input and output do not equal stdin/stdout, bpf_dbg takes an | |
405 | alternative stdin source as a first argument, and an alternative stdout | |
406 | sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`. | |
407 | ||
408 | Other than that, a particular libreadline configuration can be set via | |
409 | file "~/.bpf_dbg_init" and the command history is stored in the file | |
410 | "~/.bpf_dbg_history". | |
411 | ||
412 | Interaction in bpf_dbg happens through a shell that also has auto-completion | |
413 | support (follow-up example commands starting with '>' denote bpf_dbg shell). | |
414 | The usual workflow would be to ... | |
415 | ||
cb3f0d56 | 416 | * load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0 |
7924cd5e | 417 | Loads a BPF filter from standard output of bpf_asm, or transformed via |
cb3f0d56 | 418 | e.g. ``tcpdump -iem1 -ddd port 22 | tr '\n' ','``. Note that for JIT |
7924cd5e DB |
419 | debugging (next section), this command creates a temporary socket and |
420 | loads the BPF code into the kernel. Thus, this will also be useful for | |
421 | JIT developers. | |
422 | ||
cb3f0d56 MCC |
423 | * load pcap foo.pcap |
424 | ||
7924cd5e DB |
425 | Loads standard tcpdump pcap file. |
426 | ||
cb3f0d56 MCC |
427 | * run [<n>] |
428 | ||
7924cd5e DB |
429 | bpf passes:1 fails:9 |
430 | Runs through all packets from a pcap to account how many passes and fails | |
431 | the filter will generate. A limit of packets to traverse can be given. | |
432 | ||
cb3f0d56 MCC |
433 | * disassemble:: |
434 | ||
435 | l0: ldh [12] | |
436 | l1: jeq #0x800, l2, l5 | |
437 | l2: ldb [23] | |
438 | l3: jeq #0x1, l4, l5 | |
439 | l4: ret #0xffff | |
440 | l5: ret #0 | |
441 | ||
7924cd5e DB |
442 | Prints out BPF code disassembly. |
443 | ||
cb3f0d56 MCC |
444 | * dump:: |
445 | ||
446 | /* { op, jt, jf, k }, */ | |
447 | { 0x28, 0, 0, 0x0000000c }, | |
448 | { 0x15, 0, 3, 0x00000800 }, | |
449 | { 0x30, 0, 0, 0x00000017 }, | |
450 | { 0x15, 0, 1, 0x00000001 }, | |
451 | { 0x06, 0, 0, 0x0000ffff }, | |
452 | { 0x06, 0, 0, 0000000000 }, | |
453 | ||
7924cd5e DB |
454 | Prints out C-style BPF code dump. |
455 | ||
cb3f0d56 MCC |
456 | * breakpoint 0:: |
457 | ||
458 | breakpoint at: l0: ldh [12] | |
459 | ||
460 | * breakpoint 1:: | |
461 | ||
462 | breakpoint at: l1: jeq #0x800, l2, l5 | |
463 | ||
7924cd5e | 464 | ... |
cb3f0d56 | 465 | |
7924cd5e DB |
466 | Sets breakpoints at particular BPF instructions. Issuing a `run` command |
467 | will walk through the pcap file continuing from the current packet and | |
468 | break when a breakpoint is being hit (another `run` will continue from | |
469 | the currently active breakpoint executing next instructions): | |
470 | ||
cb3f0d56 MCC |
471 | * run:: |
472 | ||
473 | -- register dump -- | |
474 | pc: [0] <-- program counter | |
475 | code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction | |
476 | curr: l0: ldh [12] <-- disassembly of current instruction | |
477 | A: [00000000][0] <-- content of A (hex, decimal) | |
478 | X: [00000000][0] <-- content of X (hex, decimal) | |
479 | M[0,15]: [00000000][0] <-- folded content of M (hex, decimal) | |
480 | -- packet dump -- <-- Current packet from pcap (hex) | |
481 | len: 42 | |
482 | 0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01 | |
483 | 16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26 | |
484 | 32: 00 00 00 00 00 00 0a 3b 01 01 | |
485 | (breakpoint) | |
486 | > | |
487 | ||
488 | * breakpoint:: | |
489 | ||
490 | breakpoints: 0 1 | |
491 | ||
492 | Prints currently set breakpoints. | |
493 | ||
494 | * step [-<n>, +<n>] | |
495 | ||
7924cd5e DB |
496 | Performs single stepping through the BPF program from the current pc |
497 | offset. Thus, on each step invocation, above register dump is issued. | |
498 | This can go forwards and backwards in time, a plain `step` will break | |
499 | on the next BPF instruction, thus +1. (No `run` needs to be issued here.) | |
500 | ||
cb3f0d56 MCC |
501 | * select <n> |
502 | ||
7924cd5e DB |
503 | Selects a given packet from the pcap file to continue from. Thus, on |
504 | the next `run` or `step`, the BPF program is being evaluated against | |
505 | the user pre-selected packet. Numbering starts just as in Wireshark | |
506 | with index 1. | |
507 | ||
cb3f0d56 MCC |
508 | * quit |
509 | ||
7924cd5e DB |
510 | Exits bpf_dbg. |
511 | ||
512 | JIT compiler | |
513 | ------------ | |
514 | ||
e8cb0167 BT |
515 | The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC, |
516 | PowerPC, ARM, ARM64, MIPS, RISC-V and s390 and can be enabled through | |
517 | CONFIG_BPF_JIT. The JIT compiler is transparently invoked for each | |
518 | attached filter from user space or for internal kernel users if it has | |
cb3f0d56 | 519 | been previously enabled by root:: |
7924cd5e DB |
520 | |
521 | echo 1 > /proc/sys/net/core/bpf_jit_enable | |
522 | ||
523 | For JIT developers, doing audits etc, each compile run can output the generated | |
cb3f0d56 | 524 | opcode image into the kernel log via:: |
7924cd5e DB |
525 | |
526 | echo 2 > /proc/sys/net/core/bpf_jit_enable | |
527 | ||
cb3f0d56 | 528 | Example output from dmesg:: |
7924cd5e | 529 | |
cb3f0d56 MCC |
530 | [ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f |
531 | [ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68 | |
532 | [ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00 | |
533 | [ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00 | |
534 | [ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00 | |
535 | [ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3 | |
7924cd5e | 536 | |
2c25fc9a LY |
537 | When CONFIG_BPF_JIT_ALWAYS_ON is enabled, bpf_jit_enable is permanently set to 1 and |
538 | setting any other value than that will return in failure. This is even the case for | |
539 | setting bpf_jit_enable to 2, since dumping the final JIT image into the kernel log | |
540 | is discouraged and introspection through bpftool (under tools/bpf/bpftool/) is the | |
541 | generally recommended approach instead. | |
542 | ||
c246fd33 | 543 | In the kernel source tree under tools/bpf/, there's bpf_jit_disasm for |
cb3f0d56 MCC |
544 | generating disassembly out of the kernel log's hexdump:: |
545 | ||
546 | # ./bpf_jit_disasm | |
547 | 70 bytes emitted from JIT compiler (pass:3, flen:6) | |
548 | ffffffffa0069c8f + <x>: | |
549 | 0: push %rbp | |
550 | 1: mov %rsp,%rbp | |
551 | 4: sub $0x60,%rsp | |
552 | 8: mov %rbx,-0x8(%rbp) | |
553 | c: mov 0x68(%rdi),%r9d | |
554 | 10: sub 0x6c(%rdi),%r9d | |
555 | 14: mov 0xd8(%rdi),%r8 | |
556 | 1b: mov $0xc,%esi | |
557 | 20: callq 0xffffffffe0ff9442 | |
558 | 25: cmp $0x800,%eax | |
559 | 2a: jne 0x0000000000000042 | |
560 | 2c: mov $0x17,%esi | |
561 | 31: callq 0xffffffffe0ff945e | |
562 | 36: cmp $0x1,%eax | |
563 | 39: jne 0x0000000000000042 | |
564 | 3b: mov $0xffff,%eax | |
565 | 40: jmp 0x0000000000000044 | |
566 | 42: xor %eax,%eax | |
567 | 44: leaveq | |
568 | 45: retq | |
569 | ||
570 | Issuing option `-o` will "annotate" opcodes to resulting assembler | |
571 | instructions, which can be very useful for JIT developers: | |
572 | ||
573 | # ./bpf_jit_disasm -o | |
574 | 70 bytes emitted from JIT compiler (pass:3, flen:6) | |
575 | ffffffffa0069c8f + <x>: | |
576 | 0: push %rbp | |
577 | 55 | |
578 | 1: mov %rsp,%rbp | |
579 | 48 89 e5 | |
580 | 4: sub $0x60,%rsp | |
581 | 48 83 ec 60 | |
582 | 8: mov %rbx,-0x8(%rbp) | |
583 | 48 89 5d f8 | |
584 | c: mov 0x68(%rdi),%r9d | |
585 | 44 8b 4f 68 | |
586 | 10: sub 0x6c(%rdi),%r9d | |
587 | 44 2b 4f 6c | |
588 | 14: mov 0xd8(%rdi),%r8 | |
589 | 4c 8b 87 d8 00 00 00 | |
590 | 1b: mov $0xc,%esi | |
591 | be 0c 00 00 00 | |
592 | 20: callq 0xffffffffe0ff9442 | |
593 | e8 1d 94 ff e0 | |
594 | 25: cmp $0x800,%eax | |
595 | 3d 00 08 00 00 | |
596 | 2a: jne 0x0000000000000042 | |
597 | 75 16 | |
598 | 2c: mov $0x17,%esi | |
599 | be 17 00 00 00 | |
600 | 31: callq 0xffffffffe0ff945e | |
601 | e8 28 94 ff e0 | |
602 | 36: cmp $0x1,%eax | |
603 | 83 f8 01 | |
604 | 39: jne 0x0000000000000042 | |
605 | 75 07 | |
606 | 3b: mov $0xffff,%eax | |
607 | b8 ff ff 00 00 | |
608 | 40: jmp 0x0000000000000044 | |
609 | eb 02 | |
610 | 42: xor %eax,%eax | |
611 | 31 c0 | |
612 | 44: leaveq | |
613 | c9 | |
614 | 45: retq | |
615 | c3 | |
7924cd5e DB |
616 | |
617 | For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful | |
618 | toolchain for developing and testing the kernel's JIT compiler. | |
619 | ||
9a985cdc AS |
620 | BPF kernel internals |
621 | -------------------- | |
e4ad4032 | 622 | Internally, for the kernel interpreter, a different instruction set |
9a985cdc AS |
623 | format with similar underlying principles from BPF described in previous |
624 | paragraphs is being used. However, the instruction set format is modelled | |
625 | closer to the underlying architecture to mimic native instruction sets, so | |
e4ad4032 | 626 | that a better performance can be achieved (more details later). This new |
88691e9e | 627 | ISA is called eBPF. See the ../bpf/index.rst for details. (Note: eBPF which |
e4ad4032 AS |
628 | originates from [e]xtended BPF is not the same as BPF extensions! While |
629 | eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading' | |
630 | of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.) | |
9a985cdc | 631 | |
9a985cdc | 632 | The new instruction set was originally designed with the possible goal in |
e4ad4032 | 633 | mind to write programs in "restricted C" and compile into eBPF with a optional |
9a985cdc | 634 | GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with |
e4ad4032 | 635 | minimal performance overhead over two steps, that is, C -> eBPF -> native code. |
9a985cdc AS |
636 | |
637 | Currently, the new format is being used for running user BPF programs, which | |
638 | includes seccomp BPF, classic socket filters, cls_bpf traffic classifier, | |
639 | team driver's classifier for its load-balancing mode, netfilter's xt_bpf | |
640 | extension, PTP dissector/classifier, and much more. They are all internally | |
641 | converted by the kernel into the new instruction set representation and run | |
e4ad4032 | 642 | in the eBPF interpreter. For in-kernel handlers, this all works transparently |
7ae457c1 | 643 | by using bpf_prog_create() for setting up the filter, resp. |
fb7dd8bc AN |
644 | bpf_prog_destroy() for destroying it. The function |
645 | bpf_prog_run(filter, ctx) transparently invokes eBPF interpreter or JITed | |
7ae457c1 AS |
646 | code to run the filter. 'filter' is a pointer to struct bpf_prog that we |
647 | got from bpf_prog_create(), and 'ctx' the given context (e.g. | |
4df95ff4 | 648 | skb pointer). All constraints and restrictions from bpf_check_classic() apply |
e4ad4032 AS |
649 | before a conversion to the new layout is being done behind the scenes! |
650 | ||
e8cb0167 BT |
651 | Currently, the classic BPF format is being used for JITing on most |
652 | 32-bit architectures, whereas x86-64, aarch64, s390x, powerpc64, | |
06b74152 | 653 | sparc64, arm32, riscv64, riscv32 perform JIT compilation from eBPF |
e8cb0167 | 654 | instruction set. |
9a985cdc | 655 | |
04caa489 DB |
656 | Testing |
657 | ------- | |
658 | ||
659 | Next to the BPF toolchain, the kernel also ships a test module that contains | |
06edc59c | 660 | various test cases for classic and eBPF that can be executed against |
04caa489 | 661 | the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and |
cb3f0d56 | 662 | enabled via Kconfig:: |
04caa489 DB |
663 | |
664 | CONFIG_TEST_BPF=m | |
665 | ||
666 | After the module has been built and installed, the test suite can be executed | |
667 | via insmod or modprobe against 'test_bpf' module. Results of the test cases | |
668 | including timings in nsec can be found in the kernel log (dmesg). | |
669 | ||
7924cd5e DB |
670 | Misc |
671 | ---- | |
672 | ||
673 | Also trinity, the Linux syscall fuzzer, has built-in support for BPF and | |
674 | SECCOMP-BPF kernel fuzzing. | |
675 | ||
676 | Written by | |
677 | ---------- | |
678 | ||
679 | The document was written in the hope that it is found useful and in order | |
680 | to give potential BPF hackers or security auditors a better overview of | |
681 | the underlying architecture. | |
682 | ||
cb3f0d56 MCC |
683 | - Jay Schulist <jschlst@samba.org> |
684 | - Daniel Borkmann <daniel@iogearbox.net> | |
685 | - Alexei Starovoitov <ast@kernel.org> |