Commit | Line | Data |
---|---|---|
7924cd5e DB |
1 | Linux Socket Filtering aka Berkeley Packet Filter (BPF) |
2 | ======================================================= | |
1da177e4 LT |
3 | |
4 | Introduction | |
7924cd5e DB |
5 | ------------ |
6 | ||
7 | Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter. | |
8 | Though there are some distinct differences between the BSD and Linux | |
9 | Kernel filtering, but when we speak of BPF or LSF in Linux context, we | |
10 | mean the very same mechanism of filtering in the Linux kernel. | |
11 | ||
12 | BPF allows a user-space program to attach a filter onto any socket and | |
13 | allow or disallow certain types of data to come through the socket. LSF | |
14 | follows exactly the same filter code structure as BSD's BPF, so referring | |
15 | to the BSD bpf.4 manpage is very helpful in creating filters. | |
16 | ||
17 | On Linux, BPF is much simpler than on BSD. One does not have to worry | |
18 | about devices or anything like that. You simply create your filter code, | |
19 | send it to the kernel via the SO_ATTACH_FILTER option and if your filter | |
20 | code passes the kernel check on it, you then immediately begin filtering | |
21 | data on that socket. | |
22 | ||
23 | You can also detach filters from your socket via the SO_DETACH_FILTER | |
24 | option. This will probably not be used much since when you close a socket | |
25 | that has a filter on it the filter is automagically removed. The other | |
26 | less common case may be adding a different filter on the same socket where | |
27 | you had another filter that is still running: the kernel takes care of | |
28 | removing the old one and placing your new one in its place, assuming your | |
29 | filter has passed the checks, otherwise if it fails the old filter will | |
30 | remain on that socket. | |
31 | ||
32 | SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once | |
33 | set, a filter cannot be removed or changed. This allows one process to | |
34 | setup a socket, attach a filter, lock it then drop privileges and be | |
35 | assured that the filter will be kept until the socket is closed. | |
36 | ||
37 | The biggest user of this construct might be libpcap. Issuing a high-level | |
38 | filter command like `tcpdump -i em1 port 22` passes through the libpcap | |
39 | internal compiler that generates a structure that can eventually be loaded | |
40 | via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd` | |
41 | displays what is being placed into this structure. | |
42 | ||
43 | Although we were only speaking about sockets here, BPF in Linux is used | |
44 | in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel | |
45 | qdisc layer, SECCOMP-BPF (SECure COMPuting [1]), and lots of other places | |
46 | such as team driver, PTP code, etc where BPF is being used. | |
47 | ||
48 | [1] Documentation/prctl/seccomp_filter.txt | |
49 | ||
50 | Original BPF paper: | |
51 | ||
52 | Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new | |
53 | architecture for user-level packet capture. In Proceedings of the | |
54 | USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993 | |
55 | Conference Proceedings (USENIX'93). USENIX Association, Berkeley, | |
56 | CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf] | |
57 | ||
58 | Structure | |
59 | --------- | |
60 | ||
61 | User space applications include <linux/filter.h> which contains the | |
62 | following relevant structures: | |
63 | ||
64 | struct sock_filter { /* Filter block */ | |
65 | __u16 code; /* Actual filter code */ | |
66 | __u8 jt; /* Jump true */ | |
67 | __u8 jf; /* Jump false */ | |
68 | __u32 k; /* Generic multiuse field */ | |
69 | }; | |
70 | ||
71 | Such a structure is assembled as an array of 4-tuples, that contains | |
72 | a code, jt, jf and k value. jt and jf are jump offsets and k a generic | |
73 | value to be used for a provided code. | |
74 | ||
75 | struct sock_fprog { /* Required for SO_ATTACH_FILTER. */ | |
76 | unsigned short len; /* Number of filter blocks */ | |
77 | struct sock_filter __user *filter; | |
78 | }; | |
79 | ||
80 | For socket filtering, a pointer to this structure (as shown in | |
81 | follow-up example) is being passed to the kernel through setsockopt(2). | |
82 | ||
83 | Example | |
84 | ------- | |
85 | ||
86 | #include <sys/socket.h> | |
87 | #include <sys/types.h> | |
88 | #include <arpa/inet.h> | |
89 | #include <linux/if_ether.h> | |
90 | /* ... */ | |
91 | ||
92 | /* From the example above: tcpdump -i em1 port 22 -dd */ | |
93 | struct sock_filter code[] = { | |
94 | { 0x28, 0, 0, 0x0000000c }, | |
95 | { 0x15, 0, 8, 0x000086dd }, | |
96 | { 0x30, 0, 0, 0x00000014 }, | |
97 | { 0x15, 2, 0, 0x00000084 }, | |
98 | { 0x15, 1, 0, 0x00000006 }, | |
99 | { 0x15, 0, 17, 0x00000011 }, | |
100 | { 0x28, 0, 0, 0x00000036 }, | |
101 | { 0x15, 14, 0, 0x00000016 }, | |
102 | { 0x28, 0, 0, 0x00000038 }, | |
103 | { 0x15, 12, 13, 0x00000016 }, | |
104 | { 0x15, 0, 12, 0x00000800 }, | |
105 | { 0x30, 0, 0, 0x00000017 }, | |
106 | { 0x15, 2, 0, 0x00000084 }, | |
107 | { 0x15, 1, 0, 0x00000006 }, | |
108 | { 0x15, 0, 8, 0x00000011 }, | |
109 | { 0x28, 0, 0, 0x00000014 }, | |
110 | { 0x45, 6, 0, 0x00001fff }, | |
111 | { 0xb1, 0, 0, 0x0000000e }, | |
112 | { 0x48, 0, 0, 0x0000000e }, | |
113 | { 0x15, 2, 0, 0x00000016 }, | |
114 | { 0x48, 0, 0, 0x00000010 }, | |
115 | { 0x15, 0, 1, 0x00000016 }, | |
116 | { 0x06, 0, 0, 0x0000ffff }, | |
117 | { 0x06, 0, 0, 0x00000000 }, | |
118 | }; | |
119 | ||
120 | struct sock_fprog bpf = { | |
121 | .len = ARRAY_SIZE(code), | |
122 | .filter = code, | |
123 | }; | |
124 | ||
125 | sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); | |
126 | if (sock < 0) | |
127 | /* ... bail out ... */ | |
128 | ||
129 | ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf)); | |
130 | if (ret < 0) | |
131 | /* ... bail out ... */ | |
132 | ||
133 | /* ... */ | |
134 | close(sock); | |
135 | ||
136 | The above example code attaches a socket filter for a PF_PACKET socket | |
137 | in order to let all IPv4/IPv6 packets with port 22 pass. The rest will | |
138 | be dropped for this socket. | |
139 | ||
140 | The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments | |
141 | and SO_LOCK_FILTER for preventing the filter to be detached, takes an | |
142 | integer value with 0 or 1. | |
143 | ||
144 | Note that socket filters are not restricted to PF_PACKET sockets only, | |
145 | but can also be used on other socket families. | |
146 | ||
147 | Summary of system calls: | |
148 | ||
149 | * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val)); | |
150 | * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val)); | |
151 | * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER, &val, sizeof(val)); | |
152 | ||
153 | Normally, most use cases for socket filtering on packet sockets will be | |
154 | covered by libpcap in high-level syntax, so as an application developer | |
155 | you should stick to that. libpcap wraps its own layer around all that. | |
156 | ||
157 | Unless i) using/linking to libpcap is not an option, ii) the required BPF | |
158 | filters use Linux extensions that are not supported by libpcap's compiler, | |
159 | iii) a filter might be more complex and not cleanly implementable with | |
160 | libpcap's compiler, or iv) particular filter codes should be optimized | |
161 | differently than libpcap's internal compiler does; then in such cases | |
162 | writing such a filter "by hand" can be of an alternative. For example, | |
163 | xt_bpf and cls_bpf users might have requirements that could result in | |
164 | more complex filter code, or one that cannot be expressed with libpcap | |
165 | (e.g. different return codes for various code paths). Moreover, BPF JIT | |
166 | implementors may wish to manually write test cases and thus need low-level | |
167 | access to BPF code as well. | |
168 | ||
169 | BPF engine and instruction set | |
170 | ------------------------------ | |
171 | ||
172 | Under tools/net/ there's a small helper tool called bpf_asm which can | |
173 | be used to write low-level filters for example scenarios mentioned in the | |
174 | previous section. Asm-like syntax mentioned here has been implemented in | |
175 | bpf_asm and will be used for further explanations (instead of dealing with | |
176 | less readable opcodes directly, principles are the same). The syntax is | |
177 | closely modelled after Steven McCanne's and Van Jacobson's BPF paper. | |
178 | ||
179 | The BPF architecture consists of the following basic elements: | |
180 | ||
181 | Element Description | |
182 | ||
183 | A 32 bit wide accumulator | |
184 | X 32 bit wide X register | |
185 | M[] 16 x 32 bit wide misc registers aka "scratch memory | |
186 | store", addressable from 0 to 15 | |
187 | ||
188 | A program, that is translated by bpf_asm into "opcodes" is an array that | |
189 | consists of the following elements (as already mentioned): | |
190 | ||
191 | op:16, jt:8, jf:8, k:32 | |
192 | ||
193 | The element op is a 16 bit wide opcode that has a particular instruction | |
194 | encoded. jt and jf are two 8 bit wide jump targets, one for condition | |
195 | "jump if true", the other one "jump if false". Eventually, element k | |
196 | contains a miscellaneous argument that can be interpreted in different | |
197 | ways depending on the given instruction in op. | |
198 | ||
199 | The instruction set consists of load, store, branch, alu, miscellaneous | |
200 | and return instructions that are also represented in bpf_asm syntax. This | |
201 | table lists all bpf_asm instructions available resp. what their underlying | |
202 | opcodes as defined in linux/filter.h stand for: | |
203 | ||
204 | Instruction Addressing mode Description | |
205 | ||
206 | ld 1, 2, 3, 4, 10 Load word into A | |
207 | ldi 4 Load word into A | |
208 | ldh 1, 2 Load half-word into A | |
209 | ldb 1, 2 Load byte into A | |
210 | ldx 3, 4, 5, 10 Load word into X | |
211 | ldxi 4 Load word into X | |
212 | ldxb 5 Load byte into X | |
213 | ||
214 | st 3 Store A into M[] | |
215 | stx 3 Store X into M[] | |
216 | ||
217 | jmp 6 Jump to label | |
218 | ja 6 Jump to label | |
219 | jeq 7, 8 Jump on k == A | |
220 | jneq 8 Jump on k != A | |
221 | jne 8 Jump on k != A | |
222 | jlt 8 Jump on k < A | |
223 | jle 8 Jump on k <= A | |
224 | jgt 7, 8 Jump on k > A | |
225 | jge 7, 8 Jump on k >= A | |
226 | jset 7, 8 Jump on k & A | |
227 | ||
228 | add 0, 4 A + <x> | |
229 | sub 0, 4 A - <x> | |
230 | mul 0, 4 A * <x> | |
231 | div 0, 4 A / <x> | |
232 | mod 0, 4 A % <x> | |
233 | neg 0, 4 !A | |
234 | and 0, 4 A & <x> | |
235 | or 0, 4 A | <x> | |
236 | xor 0, 4 A ^ <x> | |
237 | lsh 0, 4 A << <x> | |
238 | rsh 0, 4 A >> <x> | |
239 | ||
240 | tax Copy A into X | |
241 | txa Copy X into A | |
242 | ||
243 | ret 4, 9 Return | |
244 | ||
245 | The next table shows addressing formats from the 2nd column: | |
246 | ||
247 | Addressing mode Syntax Description | |
248 | ||
249 | 0 x/%x Register X | |
250 | 1 [k] BHW at byte offset k in the packet | |
251 | 2 [x + k] BHW at the offset X + k in the packet | |
252 | 3 M[k] Word at offset k in M[] | |
253 | 4 #k Literal value stored in k | |
254 | 5 4*([k]&0xf) Lower nibble * 4 at byte offset k in the packet | |
255 | 6 L Jump label L | |
256 | 7 #k,Lt,Lf Jump to Lt if true, otherwise jump to Lf | |
257 | 8 #k,Lt Jump to Lt if predicate is true | |
258 | 9 a/%a Accumulator A | |
259 | 10 extension BPF extension | |
260 | ||
261 | The Linux kernel also has a couple of BPF extensions that are used along | |
262 | with the class of load instructions by "overloading" the k argument with | |
263 | a negative offset + a particular extension offset. The result of such BPF | |
264 | extensions are loaded into A. | |
265 | ||
266 | Possible BPF extensions are shown in the following table: | |
267 | ||
268 | Extension Description | |
269 | ||
270 | len skb->len | |
271 | proto skb->protocol | |
272 | type skb->pkt_type | |
273 | poff Payload start offset | |
274 | ifidx skb->dev->ifindex | |
275 | nla Netlink attribute of type X with offset A | |
276 | nlan Nested Netlink attribute of type X with offset A | |
277 | mark skb->mark | |
278 | queue skb->queue_mapping | |
279 | hatype skb->dev->type | |
b0db5cdf | 280 | rxhash skb->hash |
7924cd5e DB |
281 | cpu raw_smp_processor_id() |
282 | vlan_tci vlan_tx_tag_get(skb) | |
283 | vlan_pr vlan_tx_tag_present(skb) | |
4cd3675e | 284 | rand prandom_u32() |
7924cd5e DB |
285 | |
286 | These extensions can also be prefixed with '#'. | |
287 | Examples for low-level BPF: | |
288 | ||
289 | ** ARP packets: | |
290 | ||
291 | ldh [12] | |
292 | jne #0x806, drop | |
293 | ret #-1 | |
294 | drop: ret #0 | |
295 | ||
296 | ** IPv4 TCP packets: | |
297 | ||
298 | ldh [12] | |
299 | jne #0x800, drop | |
300 | ldb [23] | |
301 | jneq #6, drop | |
302 | ret #-1 | |
303 | drop: ret #0 | |
304 | ||
305 | ** (Accelerated) VLAN w/ id 10: | |
306 | ||
307 | ld vlan_tci | |
308 | jneq #10, drop | |
309 | ret #-1 | |
310 | drop: ret #0 | |
311 | ||
4cd3675e CG |
312 | ** icmp random packet sampling, 1 in 4 |
313 | ldh [12] | |
314 | jne #0x800, drop | |
315 | ldb [23] | |
316 | jneq #1, drop | |
317 | # get a random uint32 number | |
318 | ld rand | |
319 | mod #4 | |
320 | jneq #1, drop | |
321 | ret #-1 | |
322 | drop: ret #0 | |
323 | ||
7924cd5e DB |
324 | ** SECCOMP filter example: |
325 | ||
326 | ld [4] /* offsetof(struct seccomp_data, arch) */ | |
327 | jne #0xc000003e, bad /* AUDIT_ARCH_X86_64 */ | |
328 | ld [0] /* offsetof(struct seccomp_data, nr) */ | |
329 | jeq #15, good /* __NR_rt_sigreturn */ | |
330 | jeq #231, good /* __NR_exit_group */ | |
331 | jeq #60, good /* __NR_exit */ | |
332 | jeq #0, good /* __NR_read */ | |
333 | jeq #1, good /* __NR_write */ | |
334 | jeq #5, good /* __NR_fstat */ | |
335 | jeq #9, good /* __NR_mmap */ | |
336 | jeq #14, good /* __NR_rt_sigprocmask */ | |
337 | jeq #13, good /* __NR_rt_sigaction */ | |
338 | jeq #35, good /* __NR_nanosleep */ | |
339 | bad: ret #0 /* SECCOMP_RET_KILL */ | |
340 | good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */ | |
341 | ||
342 | The above example code can be placed into a file (here called "foo"), and | |
343 | then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf | |
344 | and cls_bpf understands and can directly be loaded with. Example with above | |
345 | ARP code: | |
346 | ||
347 | $ ./bpf_asm foo | |
348 | 4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0, | |
349 | ||
350 | In copy and paste C-like output: | |
351 | ||
352 | $ ./bpf_asm -c foo | |
353 | { 0x28, 0, 0, 0x0000000c }, | |
354 | { 0x15, 0, 1, 0x00000806 }, | |
355 | { 0x06, 0, 0, 0xffffffff }, | |
356 | { 0x06, 0, 0, 0000000000 }, | |
357 | ||
358 | In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF | |
359 | filters that might not be obvious at first, it's good to test filters before | |
360 | attaching to a live system. For that purpose, there's a small tool called | |
361 | bpf_dbg under tools/net/ in the kernel source directory. This debugger allows | |
362 | for testing BPF filters against given pcap files, single stepping through the | |
363 | BPF code on the pcap's packets and to do BPF machine register dumps. | |
364 | ||
365 | Starting bpf_dbg is trivial and just requires issuing: | |
366 | ||
367 | # ./bpf_dbg | |
368 | ||
369 | In case input and output do not equal stdin/stdout, bpf_dbg takes an | |
370 | alternative stdin source as a first argument, and an alternative stdout | |
371 | sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`. | |
372 | ||
373 | Other than that, a particular libreadline configuration can be set via | |
374 | file "~/.bpf_dbg_init" and the command history is stored in the file | |
375 | "~/.bpf_dbg_history". | |
376 | ||
377 | Interaction in bpf_dbg happens through a shell that also has auto-completion | |
378 | support (follow-up example commands starting with '>' denote bpf_dbg shell). | |
379 | The usual workflow would be to ... | |
380 | ||
381 | > load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0 | |
382 | Loads a BPF filter from standard output of bpf_asm, or transformed via | |
383 | e.g. `tcpdump -iem1 -ddd port 22 | tr '\n' ','`. Note that for JIT | |
384 | debugging (next section), this command creates a temporary socket and | |
385 | loads the BPF code into the kernel. Thus, this will also be useful for | |
386 | JIT developers. | |
387 | ||
388 | > load pcap foo.pcap | |
389 | Loads standard tcpdump pcap file. | |
390 | ||
391 | > run [<n>] | |
392 | bpf passes:1 fails:9 | |
393 | Runs through all packets from a pcap to account how many passes and fails | |
394 | the filter will generate. A limit of packets to traverse can be given. | |
395 | ||
396 | > disassemble | |
397 | l0: ldh [12] | |
398 | l1: jeq #0x800, l2, l5 | |
399 | l2: ldb [23] | |
400 | l3: jeq #0x1, l4, l5 | |
401 | l4: ret #0xffff | |
402 | l5: ret #0 | |
403 | Prints out BPF code disassembly. | |
404 | ||
405 | > dump | |
406 | /* { op, jt, jf, k }, */ | |
407 | { 0x28, 0, 0, 0x0000000c }, | |
408 | { 0x15, 0, 3, 0x00000800 }, | |
409 | { 0x30, 0, 0, 0x00000017 }, | |
410 | { 0x15, 0, 1, 0x00000001 }, | |
411 | { 0x06, 0, 0, 0x0000ffff }, | |
412 | { 0x06, 0, 0, 0000000000 }, | |
413 | Prints out C-style BPF code dump. | |
414 | ||
415 | > breakpoint 0 | |
416 | breakpoint at: l0: ldh [12] | |
417 | > breakpoint 1 | |
418 | breakpoint at: l1: jeq #0x800, l2, l5 | |
419 | ... | |
420 | Sets breakpoints at particular BPF instructions. Issuing a `run` command | |
421 | will walk through the pcap file continuing from the current packet and | |
422 | break when a breakpoint is being hit (another `run` will continue from | |
423 | the currently active breakpoint executing next instructions): | |
424 | ||
425 | > run | |
426 | -- register dump -- | |
427 | pc: [0] <-- program counter | |
428 | code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction | |
429 | curr: l0: ldh [12] <-- disassembly of current instruction | |
430 | A: [00000000][0] <-- content of A (hex, decimal) | |
431 | X: [00000000][0] <-- content of X (hex, decimal) | |
432 | M[0,15]: [00000000][0] <-- folded content of M (hex, decimal) | |
433 | -- packet dump -- <-- Current packet from pcap (hex) | |
434 | len: 42 | |
435 | 0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01 | |
436 | 16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26 | |
437 | 32: 00 00 00 00 00 00 0a 3b 01 01 | |
438 | (breakpoint) | |
439 | > | |
440 | ||
441 | > breakpoint | |
442 | breakpoints: 0 1 | |
443 | Prints currently set breakpoints. | |
444 | ||
445 | > step [-<n>, +<n>] | |
446 | Performs single stepping through the BPF program from the current pc | |
447 | offset. Thus, on each step invocation, above register dump is issued. | |
448 | This can go forwards and backwards in time, a plain `step` will break | |
449 | on the next BPF instruction, thus +1. (No `run` needs to be issued here.) | |
450 | ||
451 | > select <n> | |
452 | Selects a given packet from the pcap file to continue from. Thus, on | |
453 | the next `run` or `step`, the BPF program is being evaluated against | |
454 | the user pre-selected packet. Numbering starts just as in Wireshark | |
455 | with index 1. | |
456 | ||
457 | > quit | |
458 | # | |
459 | Exits bpf_dbg. | |
460 | ||
461 | JIT compiler | |
462 | ------------ | |
463 | ||
464 | The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC, PowerPC, | |
465 | ARM and s390 and can be enabled through CONFIG_BPF_JIT. The JIT compiler is | |
466 | transparently invoked for each attached filter from user space or for internal | |
467 | kernel users if it has been previously enabled by root: | |
468 | ||
469 | echo 1 > /proc/sys/net/core/bpf_jit_enable | |
470 | ||
471 | For JIT developers, doing audits etc, each compile run can output the generated | |
472 | opcode image into the kernel log via: | |
473 | ||
474 | echo 2 > /proc/sys/net/core/bpf_jit_enable | |
475 | ||
476 | Example output from dmesg: | |
477 | ||
478 | [ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f | |
479 | [ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68 | |
480 | [ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00 | |
481 | [ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00 | |
482 | [ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00 | |
483 | [ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3 | |
484 | ||
485 | In the kernel source tree under tools/net/, there's bpf_jit_disasm for | |
486 | generating disassembly out of the kernel log's hexdump: | |
487 | ||
488 | # ./bpf_jit_disasm | |
489 | 70 bytes emitted from JIT compiler (pass:3, flen:6) | |
490 | ffffffffa0069c8f + <x>: | |
491 | 0: push %rbp | |
492 | 1: mov %rsp,%rbp | |
493 | 4: sub $0x60,%rsp | |
494 | 8: mov %rbx,-0x8(%rbp) | |
495 | c: mov 0x68(%rdi),%r9d | |
496 | 10: sub 0x6c(%rdi),%r9d | |
497 | 14: mov 0xd8(%rdi),%r8 | |
498 | 1b: mov $0xc,%esi | |
499 | 20: callq 0xffffffffe0ff9442 | |
500 | 25: cmp $0x800,%eax | |
501 | 2a: jne 0x0000000000000042 | |
502 | 2c: mov $0x17,%esi | |
503 | 31: callq 0xffffffffe0ff945e | |
504 | 36: cmp $0x1,%eax | |
505 | 39: jne 0x0000000000000042 | |
506 | 3b: mov $0xffff,%eax | |
507 | 40: jmp 0x0000000000000044 | |
508 | 42: xor %eax,%eax | |
509 | 44: leaveq | |
510 | 45: retq | |
511 | ||
512 | Issuing option `-o` will "annotate" opcodes to resulting assembler | |
513 | instructions, which can be very useful for JIT developers: | |
514 | ||
515 | # ./bpf_jit_disasm -o | |
516 | 70 bytes emitted from JIT compiler (pass:3, flen:6) | |
517 | ffffffffa0069c8f + <x>: | |
518 | 0: push %rbp | |
519 | 55 | |
520 | 1: mov %rsp,%rbp | |
521 | 48 89 e5 | |
522 | 4: sub $0x60,%rsp | |
523 | 48 83 ec 60 | |
524 | 8: mov %rbx,-0x8(%rbp) | |
525 | 48 89 5d f8 | |
526 | c: mov 0x68(%rdi),%r9d | |
527 | 44 8b 4f 68 | |
528 | 10: sub 0x6c(%rdi),%r9d | |
529 | 44 2b 4f 6c | |
530 | 14: mov 0xd8(%rdi),%r8 | |
531 | 4c 8b 87 d8 00 00 00 | |
532 | 1b: mov $0xc,%esi | |
533 | be 0c 00 00 00 | |
534 | 20: callq 0xffffffffe0ff9442 | |
535 | e8 1d 94 ff e0 | |
536 | 25: cmp $0x800,%eax | |
537 | 3d 00 08 00 00 | |
538 | 2a: jne 0x0000000000000042 | |
539 | 75 16 | |
540 | 2c: mov $0x17,%esi | |
541 | be 17 00 00 00 | |
542 | 31: callq 0xffffffffe0ff945e | |
543 | e8 28 94 ff e0 | |
544 | 36: cmp $0x1,%eax | |
545 | 83 f8 01 | |
546 | 39: jne 0x0000000000000042 | |
547 | 75 07 | |
548 | 3b: mov $0xffff,%eax | |
549 | b8 ff ff 00 00 | |
550 | 40: jmp 0x0000000000000044 | |
551 | eb 02 | |
552 | 42: xor %eax,%eax | |
553 | 31 c0 | |
554 | 44: leaveq | |
555 | c9 | |
556 | 45: retq | |
557 | c3 | |
558 | ||
559 | For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful | |
560 | toolchain for developing and testing the kernel's JIT compiler. | |
561 | ||
9a985cdc AS |
562 | BPF kernel internals |
563 | -------------------- | |
e4ad4032 | 564 | Internally, for the kernel interpreter, a different instruction set |
9a985cdc AS |
565 | format with similar underlying principles from BPF described in previous |
566 | paragraphs is being used. However, the instruction set format is modelled | |
567 | closer to the underlying architecture to mimic native instruction sets, so | |
e4ad4032 AS |
568 | that a better performance can be achieved (more details later). This new |
569 | ISA is called 'eBPF' or 'internal BPF' interchangeably. (Note: eBPF which | |
570 | originates from [e]xtended BPF is not the same as BPF extensions! While | |
571 | eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading' | |
572 | of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.) | |
9a985cdc AS |
573 | |
574 | It is designed to be JITed with one to one mapping, which can also open up | |
e4ad4032 AS |
575 | the possibility for GCC/LLVM compilers to generate optimized eBPF code through |
576 | an eBPF backend that performs almost as fast as natively compiled code. | |
9a985cdc AS |
577 | |
578 | The new instruction set was originally designed with the possible goal in | |
e4ad4032 | 579 | mind to write programs in "restricted C" and compile into eBPF with a optional |
9a985cdc | 580 | GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with |
e4ad4032 | 581 | minimal performance overhead over two steps, that is, C -> eBPF -> native code. |
9a985cdc AS |
582 | |
583 | Currently, the new format is being used for running user BPF programs, which | |
584 | includes seccomp BPF, classic socket filters, cls_bpf traffic classifier, | |
585 | team driver's classifier for its load-balancing mode, netfilter's xt_bpf | |
586 | extension, PTP dissector/classifier, and much more. They are all internally | |
587 | converted by the kernel into the new instruction set representation and run | |
e4ad4032 | 588 | in the eBPF interpreter. For in-kernel handlers, this all works transparently |
7ae457c1 AS |
589 | by using bpf_prog_create() for setting up the filter, resp. |
590 | bpf_prog_destroy() for destroying it. The macro | |
591 | BPF_PROG_RUN(filter, ctx) transparently invokes eBPF interpreter or JITed | |
592 | code to run the filter. 'filter' is a pointer to struct bpf_prog that we | |
593 | got from bpf_prog_create(), and 'ctx' the given context (e.g. | |
4df95ff4 | 594 | skb pointer). All constraints and restrictions from bpf_check_classic() apply |
e4ad4032 AS |
595 | before a conversion to the new layout is being done behind the scenes! |
596 | ||
597 | Currently, the classic BPF format is being used for JITing on most of the | |
598 | architectures. Only x86-64 performs JIT compilation from eBPF instruction set, | |
599 | however, future work will migrate other JIT compilers as well, so that they | |
600 | will profit from the very same benefits. | |
9a985cdc AS |
601 | |
602 | Some core changes of the new internal format: | |
603 | ||
604 | - Number of registers increase from 2 to 10: | |
605 | ||
606 | The old format had two registers A and X, and a hidden frame pointer. The | |
607 | new layout extends this to be 10 internal registers and a read-only frame | |
608 | pointer. Since 64-bit CPUs are passing arguments to functions via registers | |
e4ad4032 | 609 | the number of args from eBPF program to in-kernel function is restricted |
9a985cdc AS |
610 | to 5 and one register is used to accept return value from an in-kernel |
611 | function. Natively, x86_64 passes first 6 arguments in registers, aarch64/ | |
612 | sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved | |
613 | registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers. | |
614 | ||
e4ad4032 | 615 | Therefore, eBPF calling convention is defined as: |
9a985cdc | 616 | |
e4ad4032 AS |
617 | * R0 - return value from in-kernel function, and exit value for eBPF program |
618 | * R1 - R5 - arguments from eBPF program to in-kernel function | |
9a985cdc AS |
619 | * R6 - R9 - callee saved registers that in-kernel function will preserve |
620 | * R10 - read-only frame pointer to access stack | |
621 | ||
e4ad4032 AS |
622 | Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64, |
623 | etc, and eBPF calling convention maps directly to ABIs used by the kernel on | |
9a985cdc AS |
624 | 64-bit architectures. |
625 | ||
626 | On 32-bit architectures JIT may map programs that use only 32-bit arithmetic | |
627 | and may let more complex programs to be interpreted. | |
628 | ||
e4ad4032 AS |
629 | R0 - R5 are scratch registers and eBPF program needs spill/fill them if |
630 | necessary across calls. Note that there is only one eBPF program (== one | |
631 | eBPF main routine) and it cannot call other eBPF functions, it can only | |
632 | call predefined in-kernel functions, though. | |
9a985cdc AS |
633 | |
634 | - Register width increases from 32-bit to 64-bit: | |
635 | ||
636 | Still, the semantics of the original 32-bit ALU operations are preserved | |
e4ad4032 | 637 | via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower |
9a985cdc AS |
638 | subregisters that zero-extend into 64-bit if they are being written to. |
639 | That behavior maps directly to x86_64 and arm64 subregister definition, but | |
640 | makes other JITs more difficult. | |
641 | ||
642 | 32-bit architectures run 64-bit internal BPF programs via interpreter. | |
643 | Their JITs may convert BPF programs that only use 32-bit subregisters into | |
644 | native instruction set and let the rest being interpreted. | |
645 | ||
646 | Operation is 64-bit, because on 64-bit architectures, pointers are also | |
647 | 64-bit wide, and we want to pass 64-bit values in/out of kernel functions, | |
e4ad4032 AS |
648 | so 32-bit eBPF registers would otherwise require to define register-pair |
649 | ABI, thus, there won't be able to use a direct eBPF register to HW register | |
9a985cdc AS |
650 | mapping and JIT would need to do combine/split/move operations for every |
651 | register in and out of the function, which is complex, bug prone and slow. | |
652 | Another reason is the use of atomic 64-bit counters. | |
653 | ||
654 | - Conditional jt/jf targets replaced with jt/fall-through: | |
655 | ||
656 | While the original design has constructs such as "if (cond) jump_true; | |
657 | else jump_false;", they are being replaced into alternative constructs like | |
658 | "if (cond) jump_true; /* else fall-through */". | |
659 | ||
660 | - Introduces bpf_call insn and register passing convention for zero overhead | |
661 | calls from/to other kernel functions: | |
662 | ||
dfee07cc AS |
663 | Before an in-kernel function call, the internal BPF program needs to |
664 | place function arguments into R1 to R5 registers to satisfy calling | |
665 | convention, then the interpreter will take them from registers and pass | |
666 | to in-kernel function. If R1 - R5 registers are mapped to CPU registers | |
667 | that are used for argument passing on given architecture, the JIT compiler | |
668 | doesn't need to emit extra moves. Function arguments will be in the correct | |
669 | registers and BPF_CALL instruction will be JITed as single 'call' HW | |
670 | instruction. This calling convention was picked to cover common call | |
671 | situations without performance penalty. | |
672 | ||
673 | After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has | |
674 | a return value of the function. Since R6 - R9 are callee saved, their state | |
675 | is preserved across the call. | |
676 | ||
677 | For example, consider three C functions: | |
678 | ||
679 | u64 f1() { return (*_f2)(1); } | |
680 | u64 f2(u64 a) { return f3(a + 1, a); } | |
681 | u64 f3(u64 a, u64 b) { return a - b; } | |
682 | ||
683 | GCC can compile f1, f3 into x86_64: | |
684 | ||
685 | f1: | |
686 | movl $1, %edi | |
687 | movq _f2(%rip), %rax | |
688 | jmp *%rax | |
689 | f3: | |
690 | movq %rdi, %rax | |
691 | subq %rsi, %rax | |
692 | ret | |
693 | ||
e4ad4032 | 694 | Function f2 in eBPF may look like: |
dfee07cc AS |
695 | |
696 | f2: | |
697 | bpf_mov R2, R1 | |
698 | bpf_add R1, 1 | |
699 | bpf_call f3 | |
700 | bpf_exit | |
701 | ||
702 | If f2 is JITed and the pointer stored to '_f2'. The calls f1 -> f2 -> f3 and | |
703 | returns will be seamless. Without JIT, __sk_run_filter() interpreter needs to | |
704 | be used to call into f2. | |
705 | ||
e4ad4032 | 706 | For practical reasons all eBPF programs have only one argument 'ctx' which is |
dfee07cc AS |
707 | already placed into R1 (e.g. on __sk_run_filter() startup) and the programs |
708 | can call kernel functions with up to 5 arguments. Calls with 6 or more arguments | |
709 | are currently not supported, but these restrictions can be lifted if necessary | |
710 | in the future. | |
711 | ||
712 | On 64-bit architectures all register map to HW registers one to one. For | |
713 | example, x86_64 JIT compiler can map them as ... | |
714 | ||
715 | R0 - rax | |
716 | R1 - rdi | |
717 | R2 - rsi | |
718 | R3 - rdx | |
719 | R4 - rcx | |
720 | R5 - r8 | |
721 | R6 - rbx | |
722 | R7 - r13 | |
723 | R8 - r14 | |
724 | R9 - r15 | |
725 | R10 - rbp | |
726 | ||
727 | ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing | |
728 | and rbx, r12 - r15 are callee saved. | |
729 | ||
730 | Then the following internal BPF pseudo-program: | |
731 | ||
732 | bpf_mov R6, R1 /* save ctx */ | |
733 | bpf_mov R2, 2 | |
734 | bpf_mov R3, 3 | |
735 | bpf_mov R4, 4 | |
736 | bpf_mov R5, 5 | |
737 | bpf_call foo | |
738 | bpf_mov R7, R0 /* save foo() return value */ | |
739 | bpf_mov R1, R6 /* restore ctx for next call */ | |
740 | bpf_mov R2, 6 | |
741 | bpf_mov R3, 7 | |
742 | bpf_mov R4, 8 | |
743 | bpf_mov R5, 9 | |
744 | bpf_call bar | |
745 | bpf_add R0, R7 | |
746 | bpf_exit | |
747 | ||
748 | After JIT to x86_64 may look like: | |
749 | ||
750 | push %rbp | |
751 | mov %rsp,%rbp | |
752 | sub $0x228,%rsp | |
753 | mov %rbx,-0x228(%rbp) | |
754 | mov %r13,-0x220(%rbp) | |
755 | mov %rdi,%rbx | |
756 | mov $0x2,%esi | |
757 | mov $0x3,%edx | |
758 | mov $0x4,%ecx | |
759 | mov $0x5,%r8d | |
760 | callq foo | |
761 | mov %rax,%r13 | |
762 | mov %rbx,%rdi | |
763 | mov $0x2,%esi | |
764 | mov $0x3,%edx | |
765 | mov $0x4,%ecx | |
766 | mov $0x5,%r8d | |
767 | callq bar | |
768 | add %r13,%rax | |
769 | mov -0x228(%rbp),%rbx | |
770 | mov -0x220(%rbp),%r13 | |
771 | leaveq | |
772 | retq | |
773 | ||
774 | Which is in this example equivalent in C to: | |
775 | ||
776 | u64 bpf_filter(u64 ctx) | |
777 | { | |
778 | return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9); | |
779 | } | |
780 | ||
781 | In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64 | |
782 | arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper | |
e4ad4032 | 783 | registers and place their return value into '%rax' which is R0 in eBPF. |
dfee07cc | 784 | Prologue and epilogue are emitted by JIT and are implicit in the |
e4ad4032 | 785 | interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve |
dfee07cc AS |
786 | them across the calls as defined by calling convention. |
787 | ||
788 | For example the following program is invalid: | |
789 | ||
790 | bpf_mov R1, 1 | |
791 | bpf_call foo | |
792 | bpf_mov R0, R1 | |
793 | bpf_exit | |
794 | ||
795 | After the call the registers R1-R5 contain junk values and cannot be read. | |
e4ad4032 | 796 | In the future an eBPF verifier can be used to validate internal BPF programs. |
9a985cdc | 797 | |
e4ad4032 | 798 | Also in the new design, eBPF is limited to 4096 insns, which means that any |
9a985cdc AS |
799 | program will terminate quickly and will only call a fixed number of kernel |
800 | functions. Original BPF and the new format are two operand instructions, | |
e4ad4032 | 801 | which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT. |
9a985cdc AS |
802 | |
803 | The input context pointer for invoking the interpreter function is generic, | |
804 | its content is defined by a specific use case. For seccomp register R1 points | |
805 | to seccomp_data, for converted BPF filters R1 points to a skb. | |
806 | ||
807 | A program, that is translated internally consists of the following elements: | |
808 | ||
e430f34e | 809 | op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32 |
9a985cdc | 810 | |
dfee07cc AS |
811 | So far 87 internal BPF instructions were implemented. 8-bit 'op' opcode field |
812 | has room for new instructions. Some of them may use 16/24/32 byte encoding. New | |
813 | instructions must be multiple of 8 bytes to preserve backward compatibility. | |
814 | ||
815 | Internal BPF is a general purpose RISC instruction set. Not every register and | |
816 | every instruction are used during translation from original BPF to new format. | |
817 | For example, socket filters are not using 'exclusive add' instruction, but | |
818 | tracing filters may do to maintain counters of events, for example. Register R9 | |
819 | is not used by socket filters either, but more complex filters may be running | |
820 | out of registers and would have to resort to spill/fill to stack. | |
821 | ||
822 | Internal BPF can used as generic assembler for last step performance | |
823 | optimizations, socket filters and seccomp are using it as assembler. Tracing | |
824 | filters may use it as assembler to generate code from kernel. In kernel usage | |
825 | may not be bounded by security considerations, since generated internal BPF code | |
826 | may be optimizing internal code path and not being exposed to the user space. | |
827 | Safety of internal BPF can come from a verifier (TBD). In such use cases as | |
828 | described, it may be used as safe instruction set. | |
829 | ||
9a985cdc AS |
830 | Just like the original BPF, the new format runs within a controlled environment, |
831 | is deterministic and the kernel can easily prove that. The safety of the program | |
832 | can be determined in two steps: first step does depth-first-search to disallow | |
833 | loops and other CFG validation; second step starts from the first insn and | |
834 | descends all possible paths. It simulates execution of every insn and observes | |
835 | the state change of registers and stack. | |
836 | ||
783e327b AS |
837 | eBPF opcode encoding |
838 | -------------------- | |
839 | ||
840 | eBPF is reusing most of the opcode encoding from classic to simplify conversion | |
841 | of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code' | |
842 | field is divided into three parts: | |
843 | ||
844 | +----------------+--------+--------------------+ | |
845 | | 4 bits | 1 bit | 3 bits | | |
846 | | operation code | source | instruction class | | |
847 | +----------------+--------+--------------------+ | |
848 | (MSB) (LSB) | |
849 | ||
850 | Three LSB bits store instruction class which is one of: | |
851 | ||
852 | Classic BPF classes: eBPF classes: | |
853 | ||
854 | BPF_LD 0x00 BPF_LD 0x00 | |
855 | BPF_LDX 0x01 BPF_LDX 0x01 | |
856 | BPF_ST 0x02 BPF_ST 0x02 | |
857 | BPF_STX 0x03 BPF_STX 0x03 | |
858 | BPF_ALU 0x04 BPF_ALU 0x04 | |
859 | BPF_JMP 0x05 BPF_JMP 0x05 | |
860 | BPF_RET 0x06 [ class 6 unused, for future if needed ] | |
861 | BPF_MISC 0x07 BPF_ALU64 0x07 | |
862 | ||
863 | When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ... | |
864 | ||
865 | BPF_K 0x00 | |
866 | BPF_X 0x08 | |
867 | ||
868 | * in classic BPF, this means: | |
869 | ||
870 | BPF_SRC(code) == BPF_X - use register X as source operand | |
871 | BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand | |
872 | ||
873 | * in eBPF, this means: | |
874 | ||
875 | BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand | |
876 | BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand | |
877 | ||
878 | ... and four MSB bits store operation code. | |
879 | ||
880 | If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of: | |
881 | ||
882 | BPF_ADD 0x00 | |
883 | BPF_SUB 0x10 | |
884 | BPF_MUL 0x20 | |
885 | BPF_DIV 0x30 | |
886 | BPF_OR 0x40 | |
887 | BPF_AND 0x50 | |
888 | BPF_LSH 0x60 | |
889 | BPF_RSH 0x70 | |
890 | BPF_NEG 0x80 | |
891 | BPF_MOD 0x90 | |
892 | BPF_XOR 0xa0 | |
893 | BPF_MOV 0xb0 /* eBPF only: mov reg to reg */ | |
894 | BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */ | |
895 | BPF_END 0xd0 /* eBPF only: endianness conversion */ | |
896 | ||
897 | If BPF_CLASS(code) == BPF_JMP, BPF_OP(code) is one of: | |
898 | ||
899 | BPF_JA 0x00 | |
900 | BPF_JEQ 0x10 | |
901 | BPF_JGT 0x20 | |
902 | BPF_JGE 0x30 | |
903 | BPF_JSET 0x40 | |
904 | BPF_JNE 0x50 /* eBPF only: jump != */ | |
905 | BPF_JSGT 0x60 /* eBPF only: signed '>' */ | |
906 | BPF_JSGE 0x70 /* eBPF only: signed '>=' */ | |
907 | BPF_CALL 0x80 /* eBPF only: function call */ | |
908 | BPF_EXIT 0x90 /* eBPF only: function return */ | |
909 | ||
910 | So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF | |
911 | and eBPF. There are only two registers in classic BPF, so it means A += X. | |
912 | In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly, | |
913 | BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous | |
914 | src_reg = (u32) src_reg ^ (u32) imm32 in eBPF. | |
915 | ||
916 | Classic BPF is using BPF_MISC class to represent A = X and X = A moves. | |
917 | eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no | |
918 | BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean | |
919 | exactly the same operations as BPF_ALU, but with 64-bit wide operands | |
920 | instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.: | |
921 | dst_reg = dst_reg + src_reg | |
922 | ||
923 | Classic BPF wastes the whole BPF_RET class to represent a single 'ret' | |
924 | operation. Classic BPF_RET | BPF_K means copy imm32 into return register | |
925 | and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT | |
926 | in eBPF means function exit only. The eBPF program needs to store return | |
927 | value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is currently | |
928 | unused and reserved for future use. | |
929 | ||
930 | For load and store instructions the 8-bit 'code' field is divided as: | |
931 | ||
932 | +--------+--------+-------------------+ | |
933 | | 3 bits | 2 bits | 3 bits | | |
934 | | mode | size | instruction class | | |
935 | +--------+--------+-------------------+ | |
936 | (MSB) (LSB) | |
937 | ||
938 | Size modifier is one of ... | |
939 | ||
940 | BPF_W 0x00 /* word */ | |
941 | BPF_H 0x08 /* half word */ | |
942 | BPF_B 0x10 /* byte */ | |
943 | BPF_DW 0x18 /* eBPF only, double word */ | |
944 | ||
945 | ... which encodes size of load/store operation: | |
946 | ||
947 | B - 1 byte | |
948 | H - 2 byte | |
949 | W - 4 byte | |
950 | DW - 8 byte (eBPF only) | |
951 | ||
952 | Mode modifier is one of: | |
953 | ||
954 | BPF_IMM 0x00 /* classic BPF only, reserved in eBPF */ | |
955 | BPF_ABS 0x20 | |
956 | BPF_IND 0x40 | |
957 | BPF_MEM 0x60 | |
958 | BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */ | |
959 | BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */ | |
960 | BPF_XADD 0xc0 /* eBPF only, exclusive add */ | |
961 | ||
962 | eBPF has two non-generic instructions: (BPF_ABS | <size> | BPF_LD) and | |
963 | (BPF_IND | <size> | BPF_LD) which are used to access packet data. | |
964 | ||
965 | They had to be carried over from classic to have strong performance of | |
966 | socket filters running in eBPF interpreter. These instructions can only | |
967 | be used when interpreter context is a pointer to 'struct sk_buff' and | |
968 | have seven implicit operands. Register R6 is an implicit input that must | |
969 | contain pointer to sk_buff. Register R0 is an implicit output which contains | |
970 | the data fetched from the packet. Registers R1-R5 are scratch registers | |
971 | and must not be used to store the data across BPF_ABS | BPF_LD or | |
972 | BPF_IND | BPF_LD instructions. | |
973 | ||
974 | These instructions have implicit program exit condition as well. When | |
975 | eBPF program is trying to access the data beyond the packet boundary, | |
976 | the interpreter will abort the execution of the program. JIT compilers | |
977 | therefore must preserve this property. src_reg and imm32 fields are | |
978 | explicit inputs to these instructions. | |
979 | ||
980 | For example: | |
981 | ||
982 | BPF_IND | BPF_W | BPF_LD means: | |
983 | ||
984 | R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32)) | |
985 | and R1 - R5 were scratched. | |
986 | ||
987 | Unlike classic BPF instruction set, eBPF has generic load/store operations: | |
988 | ||
989 | BPF_MEM | <size> | BPF_STX: *(size *) (dst_reg + off) = src_reg | |
990 | BPF_MEM | <size> | BPF_ST: *(size *) (dst_reg + off) = imm32 | |
991 | BPF_MEM | <size> | BPF_LDX: dst_reg = *(size *) (src_reg + off) | |
992 | BPF_XADD | BPF_W | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg | |
993 | BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg | |
994 | ||
995 | Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and | |
996 | 2 byte atomic increments are not supported. | |
997 | ||
04caa489 DB |
998 | Testing |
999 | ------- | |
1000 | ||
1001 | Next to the BPF toolchain, the kernel also ships a test module that contains | |
1002 | various test cases for classic and internal BPF that can be executed against | |
1003 | the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and | |
1004 | enabled via Kconfig: | |
1005 | ||
1006 | CONFIG_TEST_BPF=m | |
1007 | ||
1008 | After the module has been built and installed, the test suite can be executed | |
1009 | via insmod or modprobe against 'test_bpf' module. Results of the test cases | |
1010 | including timings in nsec can be found in the kernel log (dmesg). | |
1011 | ||
7924cd5e DB |
1012 | Misc |
1013 | ---- | |
1014 | ||
1015 | Also trinity, the Linux syscall fuzzer, has built-in support for BPF and | |
1016 | SECCOMP-BPF kernel fuzzing. | |
1017 | ||
1018 | Written by | |
1019 | ---------- | |
1020 | ||
1021 | The document was written in the hope that it is found useful and in order | |
1022 | to give potential BPF hackers or security auditors a better overview of | |
1023 | the underlying architecture. | |
1024 | ||
1025 | Jay Schulist <jschlst@samba.org> | |
1026 | Daniel Borkmann <dborkman@redhat.com> | |
9a985cdc | 1027 | Alexei Starovoitov <ast@plumgrid.com> |