Commit | Line | Data |
---|---|---|
1a6ac1d5 JDB |
1 | ============== |
2 | BPF Design Q&A | |
3 | ============== | |
4 | ||
2e39748a AS |
5 | BPF extensibility and applicability to networking, tracing, security |
6 | in the linux kernel and several user space implementations of BPF | |
7 | virtual machine led to a number of misunderstanding on what BPF actually is. | |
8 | This short QA is an attempt to address that and outline a direction | |
9 | of where BPF is heading long term. | |
10 | ||
1a6ac1d5 JDB |
11 | .. contents:: |
12 | :local: | |
13 | :depth: 3 | |
14 | ||
15 | Questions and Answers | |
16 | ===================== | |
17 | ||
2e39748a | 18 | Q: Is BPF a generic instruction set similar to x64 and arm64? |
1a6ac1d5 | 19 | ------------------------------------------------------------- |
2e39748a AS |
20 | A: NO. |
21 | ||
22 | Q: Is BPF a generic virtual machine ? | |
1a6ac1d5 | 23 | ------------------------------------- |
2e39748a AS |
24 | A: NO. |
25 | ||
1a6ac1d5 JDB |
26 | BPF is generic instruction set *with* C calling convention. |
27 | ----------------------------------------------------------- | |
2e39748a AS |
28 | |
29 | Q: Why C calling convention was chosen? | |
1a6ac1d5 JDB |
30 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
31 | ||
2e39748a | 32 | A: Because BPF programs are designed to run in the linux kernel |
1a6ac1d5 JDB |
33 | which is written in C, hence BPF defines instruction set compatible |
34 | with two most used architectures x64 and arm64 (and takes into | |
35 | consideration important quirks of other architectures) and | |
36 | defines calling convention that is compatible with C calling | |
37 | convention of the linux kernel on those architectures. | |
2e39748a | 38 | |
46604676 | 39 | Q: Can multiple return values be supported in the future? |
1a6ac1d5 | 40 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
2e39748a AS |
41 | A: NO. BPF allows only register R0 to be used as return value. |
42 | ||
46604676 | 43 | Q: Can more than 5 function arguments be supported in the future? |
1a6ac1d5 | 44 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
2e39748a | 45 | A: NO. BPF calling convention only allows registers R1-R5 to be used |
1a6ac1d5 JDB |
46 | as arguments. BPF is not a standalone instruction set. |
47 | (unlike x64 ISA that allows msft, cdecl and other conventions) | |
2e39748a | 48 | |
46604676 | 49 | Q: Can BPF programs access instruction pointer or return address? |
1a6ac1d5 | 50 | ----------------------------------------------------------------- |
2e39748a AS |
51 | A: NO. |
52 | ||
46604676 | 53 | Q: Can BPF programs access stack pointer ? |
1a6ac1d5 JDB |
54 | ------------------------------------------ |
55 | A: NO. | |
56 | ||
57 | Only frame pointer (register R10) is accessible. | |
58 | From compiler point of view it's necessary to have stack pointer. | |
46604676 | 59 | For example, LLVM defines register R11 as stack pointer in its |
1a6ac1d5 | 60 | BPF backend, but it makes sure that generated code never uses it. |
2e39748a AS |
61 | |
62 | Q: Does C-calling convention diminishes possible use cases? | |
1a6ac1d5 JDB |
63 | ----------------------------------------------------------- |
64 | A: YES. | |
65 | ||
66 | BPF design forces addition of major functionality in the form | |
67 | of kernel helper functions and kernel objects like BPF maps with | |
68 | seamless interoperability between them. It lets kernel call into | |
46604676 AN |
69 | BPF programs and programs call kernel helpers with zero overhead, |
70 | as all of them were native C code. That is particularly the case | |
1a6ac1d5 JDB |
71 | for JITed BPF programs that are indistinguishable from |
72 | native kernel C code. | |
2e39748a AS |
73 | |
74 | Q: Does it mean that 'innovative' extensions to BPF code are disallowed? | |
1a6ac1d5 JDB |
75 | ------------------------------------------------------------------------ |
76 | A: Soft yes. | |
77 | ||
46604676 | 78 | At least for now, until BPF core has support for |
1a6ac1d5 | 79 | bpf-to-bpf calls, indirect calls, loops, global variables, |
46604676 | 80 | jump tables, read-only sections, and all other normal constructs |
1a6ac1d5 | 81 | that C code can produce. |
2e39748a AS |
82 | |
83 | Q: Can loops be supported in a safe way? | |
1a6ac1d5 JDB |
84 | ---------------------------------------- |
85 | A: It's not clear yet. | |
86 | ||
87 | BPF developers are trying to find a way to | |
3b880244 AS |
88 | support bounded loops. |
89 | ||
90 | Q: What are the verifier limits? | |
91 | -------------------------------- | |
92 | A: The only limit known to the user space is BPF_MAXINSNS (4096). | |
93 | It's the maximum number of instructions that the unprivileged bpf | |
94 | program can have. The verifier has various internal limits. | |
95 | Like the maximum number of instructions that can be explored during | |
96 | program analysis. Currently, that limit is set to 1 million. | |
97 | Which essentially means that the largest program can consist | |
98 | of 1 million NOP instructions. There is a limit to the maximum number | |
99 | of subsequent branches, a limit to the number of nested bpf-to-bpf | |
100 | calls, a limit to the number of the verifier states per instruction, | |
101 | a limit to the number of maps used by the program. | |
102 | All these limits can be hit with a sufficiently complex program. | |
103 | There are also non-numerical limits that can cause the program | |
104 | to be rejected. The verifier used to recognize only pointer + constant | |
105 | expressions. Now it can recognize pointer + bounded_register. | |
106 | bpf_lookup_map_elem(key) had a requirement that 'key' must be | |
107 | a pointer to the stack. Now, 'key' can be a pointer to map value. | |
108 | The verifier is steadily getting 'smarter'. The limits are | |
109 | being removed. The only way to know that the program is going to | |
110 | be accepted by the verifier is to try to load it. | |
111 | The bpf development process guarantees that the future kernel | |
112 | versions will accept all bpf programs that were accepted by | |
113 | the earlier versions. | |
114 | ||
1a6ac1d5 JDB |
115 | |
116 | Instruction level questions | |
117 | --------------------------- | |
118 | ||
119 | Q: LD_ABS and LD_IND instructions vs C code | |
120 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
2e39748a AS |
121 | |
122 | Q: How come LD_ABS and LD_IND instruction are present in BPF whereas | |
1a6ac1d5 JDB |
123 | C code cannot express them and has to use builtin intrinsics? |
124 | ||
2e39748a | 125 | A: This is artifact of compatibility with classic BPF. Modern |
1a6ac1d5 JDB |
126 | networking code in BPF performs better without them. |
127 | See 'direct packet access'. | |
2e39748a | 128 | |
1a6ac1d5 JDB |
129 | Q: BPF instructions mapping not one-to-one to native CPU |
130 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
2e39748a | 131 | Q: It seems not all BPF instructions are one-to-one to native CPU. |
1a6ac1d5 JDB |
132 | For example why BPF_JNE and other compare and jumps are not cpu-like? |
133 | ||
2e39748a | 134 | A: This was necessary to avoid introducing flags into ISA which are |
1a6ac1d5 | 135 | impossible to make generic and efficient across CPU architectures. |
2e39748a | 136 | |
46604676 | 137 | Q: Why BPF_DIV instruction doesn't map to x64 div? |
1a6ac1d5 | 138 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
2e39748a | 139 | A: Because if we picked one-to-one relationship to x64 it would have made |
1a6ac1d5 JDB |
140 | it more complicated to support on arm64 and other archs. Also it |
141 | needs div-by-zero runtime check. | |
2e39748a | 142 | |
46604676 | 143 | Q: Why there is no BPF_SDIV for signed divide operation? |
1a6ac1d5 | 144 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
2e39748a | 145 | A: Because it would be rarely used. llvm errors in such case and |
46604676 | 146 | prints a suggestion to use unsigned divide instead. |
2e39748a AS |
147 | |
148 | Q: Why BPF has implicit prologue and epilogue? | |
1a6ac1d5 | 149 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
2e39748a | 150 | A: Because architectures like sparc have register windows and in general |
1a6ac1d5 JDB |
151 | there are enough subtle differences between architectures, so naive |
152 | store return address into stack won't work. Another reason is BPF has | |
153 | to be safe from division by zero (and legacy exception path | |
154 | of LD_ABS insn). Those instructions need to invoke epilogue and | |
155 | return implicitly. | |
2e39748a AS |
156 | |
157 | Q: Why BPF_JLT and BPF_JLE instructions were not introduced in the beginning? | |
1a6ac1d5 | 158 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
2e39748a | 159 | A: Because classic BPF didn't have them and BPF authors felt that compiler |
1a6ac1d5 JDB |
160 | workaround would be acceptable. Turned out that programs lose performance |
161 | due to lack of these compare instructions and they were added. | |
162 | These two instructions is a perfect example what kind of new BPF | |
163 | instructions are acceptable and can be added in the future. | |
164 | These two already had equivalent instructions in native CPUs. | |
165 | New instructions that don't have one-to-one mapping to HW instructions | |
166 | will not be accepted. | |
167 | ||
168 | Q: BPF 32-bit subregister requirements | |
169 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
2e39748a | 170 | Q: BPF 32-bit subregisters have a requirement to zero upper 32-bits of BPF |
1a6ac1d5 JDB |
171 | registers which makes BPF inefficient virtual machine for 32-bit |
172 | CPU architectures and 32-bit HW accelerators. Can true 32-bit registers | |
173 | be added to BPF in the future? | |
174 | ||
c231c22a JW |
175 | A: NO. |
176 | ||
177 | But some optimizations on zero-ing the upper 32 bits for BPF registers are | |
178 | available, and can be leveraged to improve the performance of JITed BPF | |
179 | programs for 32-bit architectures. | |
180 | ||
181 | Starting with version 7, LLVM is able to generate instructions that operate | |
182 | on 32-bit subregisters, provided the option -mattr=+alu32 is passed for | |
183 | compiling a program. Furthermore, the verifier can now mark the | |
184 | instructions for which zero-ing the upper bits of the destination register | |
185 | is required, and insert an explicit zero-extension (zext) instruction | |
186 | (a mov32 variant). This means that for architectures without zext hardware | |
187 | support, the JIT back-ends do not need to clear the upper bits for | |
188 | subregisters written by alu32 instructions or narrow loads. Instead, the | |
189 | back-ends simply need to support code generation for that mov32 variant, | |
190 | and to overwrite bpf_jit_needs_zext() to make it return "true" (in order to | |
191 | enable zext insertion in the verifier). | |
192 | ||
193 | Note that it is possible for a JIT back-end to have partial hardware | |
194 | support for zext. In that case, if verifier zext insertion is enabled, | |
195 | it could lead to the insertion of unnecessary zext instructions. Such | |
196 | instructions could be removed by creating a simple peephole inside the JIT | |
197 | back-end: if one instruction has hardware support for zext and if the next | |
198 | instruction is an explicit zext, then the latter can be skipped when doing | |
199 | the code generation. | |
2e39748a AS |
200 | |
201 | Q: Does BPF have a stable ABI? | |
1a6ac1d5 | 202 | ------------------------------ |
2e39748a | 203 | A: YES. BPF instructions, arguments to BPF programs, set of helper |
1a6ac1d5 | 204 | functions and their arguments, recognized return codes are all part |
a769fa72 DB |
205 | of ABI. However there is one specific exception to tracing programs |
206 | which are using helpers like bpf_probe_read() to walk kernel internal | |
207 | data structures and compile with kernel internal headers. Both of these | |
208 | kernel internals are subject to change and can break with newer kernels | |
209 | such that the program needs to be adapted accordingly. | |
2e39748a | 210 | |
6939f4ef QY |
211 | Q: Are tracepoints part of the stable ABI? |
212 | ------------------------------------------ | |
213 | A: NO. Tracepoints are tied to internal implementation details hence they are | |
214 | subject to change and can break with newer kernels. BPF programs need to change | |
215 | accordingly when this happens. | |
216 | ||
2e39748a | 217 | Q: How much stack space a BPF program uses? |
1a6ac1d5 | 218 | ------------------------------------------- |
2e39748a | 219 | A: Currently all program types are limited to 512 bytes of stack |
1a6ac1d5 JDB |
220 | space, but the verifier computes the actual amount of stack used |
221 | and both interpreter and most JITed code consume necessary amount. | |
2e39748a AS |
222 | |
223 | Q: Can BPF be offloaded to HW? | |
1a6ac1d5 | 224 | ------------------------------ |
2e39748a AS |
225 | A: YES. BPF HW offload is supported by NFP driver. |
226 | ||
227 | Q: Does classic BPF interpreter still exist? | |
1a6ac1d5 | 228 | -------------------------------------------- |
2e39748a AS |
229 | A: NO. Classic BPF programs are converted into extend BPF instructions. |
230 | ||
231 | Q: Can BPF call arbitrary kernel functions? | |
1a6ac1d5 | 232 | ------------------------------------------- |
2e39748a | 233 | A: NO. BPF programs can only call a set of helper functions which |
1a6ac1d5 | 234 | is defined for every program type. |
2e39748a AS |
235 | |
236 | Q: Can BPF overwrite arbitrary kernel memory? | |
1a6ac1d5 JDB |
237 | --------------------------------------------- |
238 | A: NO. | |
239 | ||
240 | Tracing bpf programs can *read* arbitrary memory with bpf_probe_read() | |
241 | and bpf_probe_read_str() helpers. Networking programs cannot read | |
242 | arbitrary memory, since they don't have access to these helpers. | |
243 | Programs can never read or write arbitrary memory directly. | |
2e39748a AS |
244 | |
245 | Q: Can BPF overwrite arbitrary user memory? | |
1a6ac1d5 JDB |
246 | ------------------------------------------- |
247 | A: Sort-of. | |
248 | ||
249 | Tracing BPF programs can overwrite the user memory | |
250 | of the current task with bpf_probe_write_user(). Every time such | |
251 | program is loaded the kernel will print warning message, so | |
252 | this helper is only useful for experiments and prototypes. | |
253 | Tracing BPF programs are root only. | |
2e39748a | 254 | |
1a6ac1d5 JDB |
255 | Q: New functionality via kernel modules? |
256 | ---------------------------------------- | |
2e39748a | 257 | Q: Can BPF functionality such as new program or map types, new |
1a6ac1d5 JDB |
258 | helpers, etc be added out of kernel module code? |
259 | ||
2e39748a | 260 | A: NO. |
5bdca94f MKL |
261 | |
262 | Q: Directly calling kernel function is an ABI? | |
263 | ---------------------------------------------- | |
264 | Q: Some kernel functions (e.g. tcp_slow_start) can be called | |
265 | by BPF programs. Do these kernel functions become an ABI? | |
266 | ||
267 | A: NO. | |
268 | ||
269 | The kernel function protos will change and the bpf programs will be | |
270 | rejected by the verifier. Also, for example, some of the bpf-callable | |
271 | kernel functions have already been used by other kernel tcp | |
272 | cc (congestion-control) implementations. If any of these kernel | |
273 | functions has changed, both the in-tree and out-of-tree kernel tcp cc | |
274 | implementations have to be changed. The same goes for the bpf | |
275 | programs and they have to be adjusted accordingly. |