Commit | Line | Data |
---|---|---|
41db511a CH |
1 | |
2 | =================== | |
3 | Classic BPF vs eBPF | |
4 | =================== | |
5 | ||
6 | eBPF is designed to be JITed with one to one mapping, which can also open up | |
7 | the possibility for GCC/LLVM compilers to generate optimized eBPF code through | |
8 | an eBPF backend that performs almost as fast as natively compiled code. | |
9 | ||
10 | Some core changes of the eBPF format from classic BPF: | |
11 | ||
12 | - Number of registers increase from 2 to 10: | |
13 | ||
14 | The old format had two registers A and X, and a hidden frame pointer. The | |
15 | new layout extends this to be 10 internal registers and a read-only frame | |
16 | pointer. Since 64-bit CPUs are passing arguments to functions via registers | |
17 | the number of args from eBPF program to in-kernel function is restricted | |
18 | to 5 and one register is used to accept return value from an in-kernel | |
19 | function. Natively, x86_64 passes first 6 arguments in registers, aarch64/ | |
20 | sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved | |
21 | registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers. | |
22 | ||
23 | Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64, | |
24 | etc, and eBPF calling convention maps directly to ABIs used by the kernel on | |
25 | 64-bit architectures. | |
26 | ||
27 | On 32-bit architectures JIT may map programs that use only 32-bit arithmetic | |
28 | and may let more complex programs to be interpreted. | |
29 | ||
30 | R0 - R5 are scratch registers and eBPF program needs spill/fill them if | |
31 | necessary across calls. Note that there is only one eBPF program (== one | |
32 | eBPF main routine) and it cannot call other eBPF functions, it can only | |
33 | call predefined in-kernel functions, though. | |
34 | ||
35 | - Register width increases from 32-bit to 64-bit: | |
36 | ||
37 | Still, the semantics of the original 32-bit ALU operations are preserved | |
38 | via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower | |
39 | subregisters that zero-extend into 64-bit if they are being written to. | |
40 | That behavior maps directly to x86_64 and arm64 subregister definition, but | |
41 | makes other JITs more difficult. | |
42 | ||
43 | 32-bit architectures run 64-bit eBPF programs via interpreter. | |
44 | Their JITs may convert BPF programs that only use 32-bit subregisters into | |
45 | native instruction set and let the rest being interpreted. | |
46 | ||
47 | Operation is 64-bit, because on 64-bit architectures, pointers are also | |
48 | 64-bit wide, and we want to pass 64-bit values in/out of kernel functions, | |
49 | so 32-bit eBPF registers would otherwise require to define register-pair | |
50 | ABI, thus, there won't be able to use a direct eBPF register to HW register | |
51 | mapping and JIT would need to do combine/split/move operations for every | |
52 | register in and out of the function, which is complex, bug prone and slow. | |
53 | Another reason is the use of atomic 64-bit counters. | |
54 | ||
55 | - Conditional jt/jf targets replaced with jt/fall-through: | |
56 | ||
57 | While the original design has constructs such as ``if (cond) jump_true; | |
58 | else jump_false;``, they are being replaced into alternative constructs like | |
59 | ``if (cond) jump_true; /* else fall-through */``. | |
60 | ||
61 | - Introduces bpf_call insn and register passing convention for zero overhead | |
62 | calls from/to other kernel functions: | |
63 | ||
64 | Before an in-kernel function call, the eBPF program needs to | |
65 | place function arguments into R1 to R5 registers to satisfy calling | |
66 | convention, then the interpreter will take them from registers and pass | |
67 | to in-kernel function. If R1 - R5 registers are mapped to CPU registers | |
68 | that are used for argument passing on given architecture, the JIT compiler | |
69 | doesn't need to emit extra moves. Function arguments will be in the correct | |
70 | registers and BPF_CALL instruction will be JITed as single 'call' HW | |
71 | instruction. This calling convention was picked to cover common call | |
72 | situations without performance penalty. | |
73 | ||
74 | After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has | |
75 | a return value of the function. Since R6 - R9 are callee saved, their state | |
76 | is preserved across the call. | |
77 | ||
78 | For example, consider three C functions:: | |
79 | ||
80 | u64 f1() { return (*_f2)(1); } | |
81 | u64 f2(u64 a) { return f3(a + 1, a); } | |
82 | u64 f3(u64 a, u64 b) { return a - b; } | |
83 | ||
84 | GCC can compile f1, f3 into x86_64:: | |
85 | ||
86 | f1: | |
87 | movl $1, %edi | |
88 | movq _f2(%rip), %rax | |
89 | jmp *%rax | |
90 | f3: | |
91 | movq %rdi, %rax | |
92 | subq %rsi, %rax | |
93 | ret | |
94 | ||
95 | Function f2 in eBPF may look like:: | |
96 | ||
97 | f2: | |
98 | bpf_mov R2, R1 | |
99 | bpf_add R1, 1 | |
100 | bpf_call f3 | |
101 | bpf_exit | |
102 | ||
103 | If f2 is JITed and the pointer stored to ``_f2``. The calls f1 -> f2 -> f3 and | |
104 | returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to | |
105 | be used to call into f2. | |
106 | ||
107 | For practical reasons all eBPF programs have only one argument 'ctx' which is | |
108 | already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs | |
109 | can call kernel functions with up to 5 arguments. Calls with 6 or more arguments | |
110 | are currently not supported, but these restrictions can be lifted if necessary | |
111 | in the future. | |
112 | ||
113 | On 64-bit architectures all register map to HW registers one to one. For | |
114 | example, x86_64 JIT compiler can map them as ... | |
115 | ||
116 | :: | |
117 | ||
118 | R0 - rax | |
119 | R1 - rdi | |
120 | R2 - rsi | |
121 | R3 - rdx | |
122 | R4 - rcx | |
123 | R5 - r8 | |
124 | R6 - rbx | |
125 | R7 - r13 | |
126 | R8 - r14 | |
127 | R9 - r15 | |
128 | R10 - rbp | |
129 | ||
130 | ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing | |
131 | and rbx, r12 - r15 are callee saved. | |
132 | ||
133 | Then the following eBPF pseudo-program:: | |
134 | ||
135 | bpf_mov R6, R1 /* save ctx */ | |
136 | bpf_mov R2, 2 | |
137 | bpf_mov R3, 3 | |
138 | bpf_mov R4, 4 | |
139 | bpf_mov R5, 5 | |
140 | bpf_call foo | |
141 | bpf_mov R7, R0 /* save foo() return value */ | |
142 | bpf_mov R1, R6 /* restore ctx for next call */ | |
143 | bpf_mov R2, 6 | |
144 | bpf_mov R3, 7 | |
145 | bpf_mov R4, 8 | |
146 | bpf_mov R5, 9 | |
147 | bpf_call bar | |
148 | bpf_add R0, R7 | |
149 | bpf_exit | |
150 | ||
151 | After JIT to x86_64 may look like:: | |
152 | ||
153 | push %rbp | |
154 | mov %rsp,%rbp | |
155 | sub $0x228,%rsp | |
156 | mov %rbx,-0x228(%rbp) | |
157 | mov %r13,-0x220(%rbp) | |
158 | mov %rdi,%rbx | |
159 | mov $0x2,%esi | |
160 | mov $0x3,%edx | |
161 | mov $0x4,%ecx | |
162 | mov $0x5,%r8d | |
163 | callq foo | |
164 | mov %rax,%r13 | |
165 | mov %rbx,%rdi | |
166 | mov $0x6,%esi | |
167 | mov $0x7,%edx | |
168 | mov $0x8,%ecx | |
169 | mov $0x9,%r8d | |
170 | callq bar | |
171 | add %r13,%rax | |
172 | mov -0x228(%rbp),%rbx | |
173 | mov -0x220(%rbp),%r13 | |
174 | leaveq | |
175 | retq | |
176 | ||
177 | Which is in this example equivalent in C to:: | |
178 | ||
179 | u64 bpf_filter(u64 ctx) | |
180 | { | |
181 | return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9); | |
182 | } | |
183 | ||
184 | In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64 | |
185 | arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper | |
186 | registers and place their return value into ``%rax`` which is R0 in eBPF. | |
187 | Prologue and epilogue are emitted by JIT and are implicit in the | |
188 | interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve | |
189 | them across the calls as defined by calling convention. | |
190 | ||
191 | For example the following program is invalid:: | |
192 | ||
193 | bpf_mov R1, 1 | |
194 | bpf_call foo | |
195 | bpf_mov R0, R1 | |
196 | bpf_exit | |
197 | ||
198 | After the call the registers R1-R5 contain junk values and cannot be read. | |
199 | An in-kernel verifier.rst is used to validate eBPF programs. | |
200 | ||
201 | Also in the new design, eBPF is limited to 4096 insns, which means that any | |
202 | program will terminate quickly and will only call a fixed number of kernel | |
203 | functions. Original BPF and eBPF are two operand instructions, | |
204 | which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT. | |
205 | ||
206 | The input context pointer for invoking the interpreter function is generic, | |
207 | its content is defined by a specific use case. For seccomp register R1 points | |
208 | to seccomp_data, for converted BPF filters R1 points to a skb. | |
209 | ||
210 | A program, that is translated internally consists of the following elements:: | |
211 | ||
212 | op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32 | |
213 | ||
214 | So far 87 eBPF instructions were implemented. 8-bit 'op' opcode field | |
215 | has room for new instructions. Some of them may use 16/24/32 byte encoding. New | |
216 | instructions must be multiple of 8 bytes to preserve backward compatibility. | |
217 | ||
218 | eBPF is a general purpose RISC instruction set. Not every register and | |
219 | every instruction are used during translation from original BPF to eBPF. | |
220 | For example, socket filters are not using ``exclusive add`` instruction, but | |
221 | tracing filters may do to maintain counters of events, for example. Register R9 | |
222 | is not used by socket filters either, but more complex filters may be running | |
223 | out of registers and would have to resort to spill/fill to stack. | |
224 | ||
225 | eBPF can be used as a generic assembler for last step performance | |
226 | optimizations, socket filters and seccomp are using it as assembler. Tracing | |
227 | filters may use it as assembler to generate code from kernel. In kernel usage | |
228 | may not be bounded by security considerations, since generated eBPF code | |
229 | may be optimizing internal code path and not being exposed to the user space. | |
230 | Safety of eBPF can come from the verifier.rst. In such use cases as | |
231 | described, it may be used as safe instruction set. | |
232 | ||
233 | Just like the original BPF, eBPF runs within a controlled environment, | |
234 | is deterministic and the kernel can easily prove that. The safety of the program | |
235 | can be determined in two steps: first step does depth-first-search to disallow | |
236 | loops and other CFG validation; second step starts from the first insn and | |
237 | descends all possible paths. It simulates execution of every insn and observes | |
238 | the state change of registers and stack. | |
239 | ||
240 | opcode encoding | |
241 | =============== | |
242 | ||
243 | eBPF is reusing most of the opcode encoding from classic to simplify conversion | |
244 | of classic BPF to eBPF. | |
245 | ||
246 | For arithmetic and jump instructions the 8-bit 'code' field is divided into three | |
247 | parts:: | |
248 | ||
249 | +----------------+--------+--------------------+ | |
250 | | 4 bits | 1 bit | 3 bits | | |
251 | | operation code | source | instruction class | | |
252 | +----------------+--------+--------------------+ | |
253 | (MSB) (LSB) | |
254 | ||
255 | Three LSB bits store instruction class which is one of: | |
256 | ||
257 | =================== =============== | |
258 | Classic BPF classes eBPF classes | |
259 | =================== =============== | |
260 | BPF_LD 0x00 BPF_LD 0x00 | |
261 | BPF_LDX 0x01 BPF_LDX 0x01 | |
262 | BPF_ST 0x02 BPF_ST 0x02 | |
263 | BPF_STX 0x03 BPF_STX 0x03 | |
264 | BPF_ALU 0x04 BPF_ALU 0x04 | |
265 | BPF_JMP 0x05 BPF_JMP 0x05 | |
266 | BPF_RET 0x06 BPF_JMP32 0x06 | |
267 | BPF_MISC 0x07 BPF_ALU64 0x07 | |
268 | =================== =============== | |
269 | ||
270 | The 4th bit encodes the source operand ... | |
271 | ||
272 | :: | |
273 | ||
274 | BPF_K 0x00 | |
275 | BPF_X 0x08 | |
276 | ||
277 | * in classic BPF, this means:: | |
278 | ||
279 | BPF_SRC(code) == BPF_X - use register X as source operand | |
280 | BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand | |
281 | ||
282 | * in eBPF, this means:: | |
283 | ||
284 | BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand | |
285 | BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand | |
286 | ||
287 | ... and four MSB bits store operation code. | |
288 | ||
289 | If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of:: | |
290 | ||
291 | BPF_ADD 0x00 | |
292 | BPF_SUB 0x10 | |
293 | BPF_MUL 0x20 | |
294 | BPF_DIV 0x30 | |
295 | BPF_OR 0x40 | |
296 | BPF_AND 0x50 | |
297 | BPF_LSH 0x60 | |
298 | BPF_RSH 0x70 | |
299 | BPF_NEG 0x80 | |
300 | BPF_MOD 0x90 | |
301 | BPF_XOR 0xa0 | |
302 | BPF_MOV 0xb0 /* eBPF only: mov reg to reg */ | |
303 | BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */ | |
304 | BPF_END 0xd0 /* eBPF only: endianness conversion */ | |
305 | ||
306 | If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of:: | |
307 | ||
308 | BPF_JA 0x00 /* BPF_JMP only */ | |
309 | BPF_JEQ 0x10 | |
310 | BPF_JGT 0x20 | |
311 | BPF_JGE 0x30 | |
312 | BPF_JSET 0x40 | |
313 | BPF_JNE 0x50 /* eBPF only: jump != */ | |
314 | BPF_JSGT 0x60 /* eBPF only: signed '>' */ | |
315 | BPF_JSGE 0x70 /* eBPF only: signed '>=' */ | |
316 | BPF_CALL 0x80 /* eBPF BPF_JMP only: function call */ | |
317 | BPF_EXIT 0x90 /* eBPF BPF_JMP only: function return */ | |
318 | BPF_JLT 0xa0 /* eBPF only: unsigned '<' */ | |
319 | BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */ | |
320 | BPF_JSLT 0xc0 /* eBPF only: signed '<' */ | |
321 | BPF_JSLE 0xd0 /* eBPF only: signed '<=' */ | |
322 | ||
323 | So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF | |
324 | and eBPF. There are only two registers in classic BPF, so it means A += X. | |
325 | In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly, | |
326 | BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous | |
327 | src_reg = (u32) src_reg ^ (u32) imm32 in eBPF. | |
328 | ||
329 | Classic BPF is using BPF_MISC class to represent A = X and X = A moves. | |
330 | eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no | |
331 | BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean | |
332 | exactly the same operations as BPF_ALU, but with 64-bit wide operands | |
333 | instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.: | |
334 | dst_reg = dst_reg + src_reg | |
335 | ||
336 | Classic BPF wastes the whole BPF_RET class to represent a single ``ret`` | |
337 | operation. Classic BPF_RET | BPF_K means copy imm32 into return register | |
338 | and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT | |
339 | in eBPF means function exit only. The eBPF program needs to store return | |
340 | value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as | |
341 | BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide | |
342 | operands for the comparisons instead. | |
343 | ||
344 | For load and store instructions the 8-bit 'code' field is divided as:: | |
345 | ||
346 | +--------+--------+-------------------+ | |
347 | | 3 bits | 2 bits | 3 bits | | |
348 | | mode | size | instruction class | | |
349 | +--------+--------+-------------------+ | |
350 | (MSB) (LSB) | |
351 | ||
352 | Size modifier is one of ... | |
353 | ||
354 | :: | |
355 | ||
356 | BPF_W 0x00 /* word */ | |
357 | BPF_H 0x08 /* half word */ | |
358 | BPF_B 0x10 /* byte */ | |
359 | BPF_DW 0x18 /* eBPF only, double word */ | |
360 | ||
361 | ... which encodes size of load/store operation:: | |
362 | ||
363 | B - 1 byte | |
364 | H - 2 byte | |
365 | W - 4 byte | |
366 | DW - 8 byte (eBPF only) | |
367 | ||
368 | Mode modifier is one of:: | |
369 | ||
370 | BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */ | |
371 | BPF_ABS 0x20 | |
372 | BPF_IND 0x40 | |
373 | BPF_MEM 0x60 | |
374 | BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */ | |
375 | BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */ | |
376 | BPF_ATOMIC 0xc0 /* eBPF only, atomic operations */ |