Commit | Line | Data |
---|---|---|
88691e9e CH |
1 | |
2 | ============= | |
3 | eBPF verifier | |
4 | ============= | |
5 | ||
6 | The safety of the eBPF program is determined in two steps. | |
7 | ||
8 | First step does DAG check to disallow loops and other CFG validation. | |
9 | In particular it will detect programs that have unreachable instructions. | |
10 | (though classic BPF checker allows them) | |
11 | ||
12 | Second step starts from the first insn and descends all possible paths. | |
13 | It simulates execution of every insn and observes the state change of | |
14 | registers and stack. | |
15 | ||
16 | At the start of the program the register R1 contains a pointer to context | |
17 | and has type PTR_TO_CTX. | |
18 | If verifier sees an insn that does R2=R1, then R2 has now type | |
19 | PTR_TO_CTX as well and can be used on the right hand side of expression. | |
20 | If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=SCALAR_VALUE, | |
21 | since addition of two valid pointers makes invalid pointer. | |
22 | (In 'secure' mode verifier will reject any type of pointer arithmetic to make | |
23 | sure that kernel addresses don't leak to unprivileged users) | |
24 | ||
25 | If register was never written to, it's not readable:: | |
26 | ||
27 | bpf_mov R0 = R2 | |
28 | bpf_exit | |
29 | ||
30 | will be rejected, since R2 is unreadable at the start of the program. | |
31 | ||
32 | After kernel function call, R1-R5 are reset to unreadable and | |
33 | R0 has a return type of the function. | |
34 | ||
35 | Since R6-R9 are callee saved, their state is preserved across the call. | |
36 | ||
37 | :: | |
38 | ||
39 | bpf_mov R6 = 1 | |
40 | bpf_call foo | |
41 | bpf_mov R0 = R6 | |
42 | bpf_exit | |
43 | ||
44 | is a correct program. If there was R1 instead of R6, it would have | |
45 | been rejected. | |
46 | ||
47 | load/store instructions are allowed only with registers of valid types, which | |
48 | are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked. | |
49 | For example:: | |
50 | ||
51 | bpf_mov R1 = 1 | |
52 | bpf_mov R2 = 2 | |
53 | bpf_xadd *(u32 *)(R1 + 3) += R2 | |
54 | bpf_exit | |
55 | ||
56 | will be rejected, since R1 doesn't have a valid pointer type at the time of | |
57 | execution of instruction bpf_xadd. | |
58 | ||
59 | At the start R1 type is PTR_TO_CTX (a pointer to generic ``struct bpf_context``) | |
60 | A callback is used to customize verifier to restrict eBPF program access to only | |
61 | certain fields within ctx structure with specified size and alignment. | |
62 | ||
63 | For example, the following insn:: | |
64 | ||
65 | bpf_ld R0 = *(u32 *)(R6 + 8) | |
66 | ||
67 | intends to load a word from address R6 + 8 and store it into R0 | |
68 | If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know | |
69 | that offset 8 of size 4 bytes can be accessed for reading, otherwise | |
70 | the verifier will reject the program. | |
71 | If R6=PTR_TO_STACK, then access should be aligned and be within | |
72 | stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8, | |
73 | so it will fail verification, since it's out of bounds. | |
74 | ||
75 | The verifier will allow eBPF program to read data from stack only after | |
76 | it wrote into it. | |
77 | ||
78 | Classic BPF verifier does similar check with M[0-15] memory slots. | |
79 | For example:: | |
80 | ||
81 | bpf_ld R0 = *(u32 *)(R10 - 4) | |
82 | bpf_exit | |
83 | ||
84 | is invalid program. | |
85 | Though R10 is correct read-only register and has type PTR_TO_STACK | |
86 | and R10 - 4 is within stack bounds, there were no stores into that location. | |
87 | ||
88 | Pointer register spill/fill is tracked as well, since four (R6-R9) | |
89 | callee saved registers may not be enough for some programs. | |
90 | ||
91 | Allowed function calls are customized with bpf_verifier_ops->get_func_proto() | |
92 | The eBPF verifier will check that registers match argument constraints. | |
93 | After the call register R0 will be set to return type of the function. | |
94 | ||
95 | Function calls is a main mechanism to extend functionality of eBPF programs. | |
96 | Socket filters may let programs to call one set of functions, whereas tracing | |
97 | filters may allow completely different set. | |
98 | ||
99 | If a function made accessible to eBPF program, it needs to be thought through | |
100 | from safety point of view. The verifier will guarantee that the function is | |
101 | called with valid arguments. | |
102 | ||
103 | seccomp vs socket filters have different security restrictions for classic BPF. | |
104 | Seccomp solves this by two stage verifier: classic BPF verifier is followed | |
105 | by seccomp verifier. In case of eBPF one configurable verifier is shared for | |
106 | all use cases. | |
107 | ||
108 | See details of eBPF verifier in kernel/bpf/verifier.c | |
109 | ||
110 | Register value tracking | |
111 | ======================= | |
112 | ||
113 | In order to determine the safety of an eBPF program, the verifier must track | |
114 | the range of possible values in each register and also in each stack slot. | |
115 | This is done with ``struct bpf_reg_state``, defined in include/linux/ | |
116 | bpf_verifier.h, which unifies tracking of scalar and pointer values. Each | |
117 | register state has a type, which is either NOT_INIT (the register has not been | |
118 | written to), SCALAR_VALUE (some value which is not usable as a pointer), or a | |
119 | pointer type. The types of pointers describe their base, as follows: | |
120 | ||
121 | ||
122 | PTR_TO_CTX | |
123 | Pointer to bpf_context. | |
124 | CONST_PTR_TO_MAP | |
125 | Pointer to struct bpf_map. "Const" because arithmetic | |
126 | on these pointers is forbidden. | |
127 | PTR_TO_MAP_VALUE | |
128 | Pointer to the value stored in a map element. | |
129 | PTR_TO_MAP_VALUE_OR_NULL | |
130 | Either a pointer to a map value, or NULL; map accesses | |
131 | (see maps.rst) return this type, which becomes a | |
132 | PTR_TO_MAP_VALUE when checked != NULL. Arithmetic on | |
133 | these pointers is forbidden. | |
134 | PTR_TO_STACK | |
135 | Frame pointer. | |
136 | PTR_TO_PACKET | |
137 | skb->data. | |
138 | PTR_TO_PACKET_END | |
139 | skb->data + headlen; arithmetic forbidden. | |
140 | PTR_TO_SOCKET | |
141 | Pointer to struct bpf_sock_ops, implicitly refcounted. | |
142 | PTR_TO_SOCKET_OR_NULL | |
143 | Either a pointer to a socket, or NULL; socket lookup | |
144 | returns this type, which becomes a PTR_TO_SOCKET when | |
145 | checked != NULL. PTR_TO_SOCKET is reference-counted, | |
146 | so programs must release the reference through the | |
147 | socket release function before the end of the program. | |
148 | Arithmetic on these pointers is forbidden. | |
149 | ||
150 | However, a pointer may be offset from this base (as a result of pointer | |
151 | arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable | |
152 | offset'. The former is used when an exactly-known value (e.g. an immediate | |
153 | operand) is added to a pointer, while the latter is used for values which are | |
154 | not exactly known. The variable offset is also used in SCALAR_VALUEs, to track | |
155 | the range of possible values in the register. | |
156 | ||
157 | The verifier's knowledge about the variable offset consists of: | |
158 | ||
159 | * minimum and maximum values as unsigned | |
160 | * minimum and maximum values as signed | |
161 | ||
162 | * knowledge of the values of individual bits, in the form of a 'tnum': a u64 | |
163 | 'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown; | |
164 | 1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both | |
165 | mask and value; no bit should ever be 1 in both. For example, if a byte is read | |
166 | into a register from memory, the register's top 56 bits are known zero, while | |
167 | the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we | |
168 | then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0; | |
169 | 0x1ff), because of potential carries. | |
170 | ||
171 | Besides arithmetic, the register state can also be updated by conditional | |
172 | branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch | |
173 | it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false' | |
174 | branch it will have a umax_value of 8. A signed compare (with BPF_JSGT or | |
175 | BPF_JSGE) would instead update the signed minimum/maximum values. Information | |
176 | from the signed and unsigned bounds can be combined; for instance if a value is | |
177 | first tested < 8 and then tested s> 4, the verifier will conclude that the value | |
178 | is also > 4 and s< 8, since the bounds prevent crossing the sign boundary. | |
179 | ||
180 | PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all | |
181 | pointers sharing that same variable offset. This is important for packet range | |
182 | checks: after adding a variable to a packet pointer register A, if you then copy | |
183 | it to another register B and then add a constant 4 to A, both registers will | |
184 | share the same 'id' but the A will have a fixed offset of +4. Then if A is | |
185 | bounds-checked and found to be less than a PTR_TO_PACKET_END, the register B is | |
186 | now known to have a safe range of at least 4 bytes. See 'Direct packet access', | |
187 | below, for more on PTR_TO_PACKET ranges. | |
188 | ||
189 | The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of | |
190 | the pointer returned from a map lookup. This means that when one copy is | |
191 | checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs. | |
192 | As well as range-checking, the tracked information is also used for enforcing | |
193 | alignment of pointer accesses. For instance, on most systems the packet pointer | |
194 | is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump | |
1d3cab43 | 195 | over the Ethernet header, then reads IHL and adds (IHL * 4), the resulting |
88691e9e CH |
196 | pointer will have a variable offset known to be 4n+2 for some n, so adding the 2 |
197 | bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through | |
198 | that pointer are safe. | |
199 | The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, common | |
200 | to all copies of the pointer returned from a socket lookup. This has similar | |
201 | behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but | |
202 | it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly | |
203 | represents a reference to the corresponding ``struct sock``. To ensure that the | |
204 | reference is not leaked, it is imperative to NULL-check the reference and in | |
205 | the non-NULL case, and pass the valid reference to the socket release function. | |
206 | ||
207 | Direct packet access | |
208 | ==================== | |
209 | ||
210 | In cls_bpf and act_bpf programs the verifier allows direct access to the packet | |
211 | data via skb->data and skb->data_end pointers. | |
212 | Ex:: | |
213 | ||
214 | 1: r4 = *(u32 *)(r1 +80) /* load skb->data_end */ | |
215 | 2: r3 = *(u32 *)(r1 +76) /* load skb->data */ | |
216 | 3: r5 = r3 | |
217 | 4: r5 += 14 | |
218 | 5: if r5 > r4 goto pc+16 | |
219 | R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp | |
220 | 6: r0 = *(u16 *)(r3 +12) /* access 12 and 13 bytes of the packet */ | |
221 | ||
222 | this 2byte load from the packet is safe to do, since the program author | |
223 | did check ``if (skb->data + 14 > skb->data_end) goto err`` at insn #5 which | |
224 | means that in the fall-through case the register R3 (which points to skb->data) | |
225 | has at least 14 directly accessible bytes. The verifier marks it | |
226 | as R3=pkt(id=0,off=0,r=14). | |
227 | id=0 means that no additional variables were added to the register. | |
228 | off=0 means that no additional constants were added. | |
229 | r=14 is the range of safe access which means that bytes [R3, R3 + 14) are ok. | |
230 | Note that R5 is marked as R5=pkt(id=0,off=14,r=14). It also points | |
231 | to the packet data, but constant 14 was added to the register, so | |
232 | it now points to ``skb->data + 14`` and accessible range is [R5, R5 + 14 - 14) | |
233 | which is zero bytes. | |
234 | ||
235 | More complex packet access may look like:: | |
236 | ||
237 | ||
238 | R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp | |
239 | 6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */ | |
240 | 7: r4 = *(u8 *)(r3 +12) | |
241 | 8: r4 *= 14 | |
242 | 9: r3 = *(u32 *)(r1 +76) /* load skb->data */ | |
243 | 10: r3 += r4 | |
244 | 11: r2 = r1 | |
245 | 12: r2 <<= 48 | |
246 | 13: r2 >>= 48 | |
247 | 14: r3 += r2 | |
248 | 15: r2 = r3 | |
249 | 16: r2 += 8 | |
250 | 17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */ | |
251 | 18: if r2 > r1 goto pc+2 | |
252 | R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp | |
253 | 19: r1 = *(u8 *)(r3 +4) | |
254 | ||
255 | The state of the register R3 is R3=pkt(id=2,off=0,r=8) | |
256 | id=2 means that two ``r3 += rX`` instructions were seen, so r3 points to some | |
257 | offset within a packet and since the program author did | |
258 | ``if (r3 + 8 > r1) goto err`` at insn #18, the safe range is [R3, R3 + 8). | |
259 | The verifier only allows 'add'/'sub' operations on packet registers. Any other | |
260 | operation will set the register state to 'SCALAR_VALUE' and it won't be | |
261 | available for direct packet access. | |
262 | ||
263 | Operation ``r3 += rX`` may overflow and become less than original skb->data, | |
264 | therefore the verifier has to prevent that. So when it sees ``r3 += rX`` | |
265 | instruction and rX is more than 16-bit value, any subsequent bounds-check of r3 | |
266 | against skb->data_end will not give us 'range' information, so attempts to read | |
267 | through the pointer will give "invalid access to packet" error. | |
268 | ||
269 | Ex. after insn ``r4 = *(u8 *)(r3 +12)`` (insn #7 above) the state of r4 is | |
270 | R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits | |
271 | of the register are guaranteed to be zero, and nothing is known about the lower | |
272 | 8 bits. After insn ``r4 *= 14`` the state becomes | |
273 | R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit | |
274 | value by constant 14 will keep upper 52 bits as zero, also the least significant | |
275 | bit will be zero as 14 is even. Similarly ``r2 >>= 48`` will make | |
276 | R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign | |
277 | extending. This logic is implemented in adjust_reg_min_max_vals() function, | |
278 | which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice | |
279 | versa) and adjust_scalar_min_max_vals() for operations on two scalars. | |
280 | ||
281 | The end result is that bpf program author can access packet directly | |
282 | using normal C code as:: | |
283 | ||
284 | void *data = (void *)(long)skb->data; | |
285 | void *data_end = (void *)(long)skb->data_end; | |
286 | struct eth_hdr *eth = data; | |
287 | struct iphdr *iph = data + sizeof(*eth); | |
288 | struct udphdr *udp = data + sizeof(*eth) + sizeof(*iph); | |
289 | ||
290 | if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end) | |
291 | return 0; | |
292 | if (eth->h_proto != htons(ETH_P_IP)) | |
293 | return 0; | |
294 | if (iph->protocol != IPPROTO_UDP || iph->ihl != 5) | |
295 | return 0; | |
296 | if (udp->dest == 53 || udp->source == 9) | |
297 | ...; | |
298 | ||
299 | which makes such programs easier to write comparing to LD_ABS insn | |
300 | and significantly faster. | |
301 | ||
302 | Pruning | |
303 | ======= | |
304 | ||
305 | The verifier does not actually walk all possible paths through the program. For | |
306 | each new branch to analyse, the verifier looks at all the states it's previously | |
307 | been in when at this instruction. If any of them contain the current state as a | |
308 | subset, the branch is 'pruned' - that is, the fact that the previous state was | |
309 | accepted implies the current state would be as well. For instance, if in the | |
310 | previous state, r1 held a packet-pointer, and in the current state, r1 holds a | |
311 | packet-pointer with a range as long or longer and at least as strict an | |
312 | alignment, then r1 is safe. Similarly, if r2 was NOT_INIT before then it can't | |
313 | have been used by any path from that point, so any value in r2 (including | |
314 | another NOT_INIT) is safe. The implementation is in the function regsafe(). | |
315 | Pruning considers not only the registers but also the stack (and any spilled | |
316 | registers it may hold). They must all be safe for the branch to be pruned. | |
317 | This is implemented in states_equal(). | |
318 | ||
cb601848 EZ |
319 | Some technical details about state pruning implementation could be found below. |
320 | ||
321 | Register liveness tracking | |
322 | -------------------------- | |
323 | ||
324 | In order to make state pruning effective, liveness state is tracked for each | |
325 | register and stack slot. The basic idea is to track which registers and stack | |
326 | slots are actually used during subseqeuent execution of the program, until | |
327 | program exit is reached. Registers and stack slots that were never used could be | |
328 | removed from the cached state thus making more states equivalent to a cached | |
329 | state. This could be illustrated by the following program:: | |
330 | ||
331 | 0: call bpf_get_prandom_u32() | |
332 | 1: r1 = 0 | |
333 | 2: if r0 == 0 goto +1 | |
334 | 3: r0 = 1 | |
335 | --- checkpoint --- | |
336 | 4: r0 = r1 | |
337 | 5: exit | |
338 | ||
339 | Suppose that a state cache entry is created at instruction #4 (such entries are | |
340 | also called "checkpoints" in the text below). The verifier could reach the | |
341 | instruction with one of two possible register states: | |
342 | ||
343 | * r0 = 1, r1 = 0 | |
344 | * r0 = 0, r1 = 0 | |
345 | ||
346 | However, only the value of register ``r1`` is important to successfully finish | |
347 | verification. The goal of the liveness tracking algorithm is to spot this fact | |
348 | and figure out that both states are actually equivalent. | |
349 | ||
350 | Data structures | |
351 | ~~~~~~~~~~~~~~~ | |
352 | ||
353 | Liveness is tracked using the following data structures:: | |
354 | ||
355 | enum bpf_reg_liveness { | |
356 | REG_LIVE_NONE = 0, | |
357 | REG_LIVE_READ32 = 0x1, | |
358 | REG_LIVE_READ64 = 0x2, | |
359 | REG_LIVE_READ = REG_LIVE_READ32 | REG_LIVE_READ64, | |
360 | REG_LIVE_WRITTEN = 0x4, | |
361 | REG_LIVE_DONE = 0x8, | |
362 | }; | |
363 | ||
364 | struct bpf_reg_state { | |
365 | ... | |
366 | struct bpf_reg_state *parent; | |
367 | ... | |
368 | enum bpf_reg_liveness live; | |
369 | ... | |
370 | }; | |
371 | ||
372 | struct bpf_stack_state { | |
373 | struct bpf_reg_state spilled_ptr; | |
374 | ... | |
375 | }; | |
376 | ||
377 | struct bpf_func_state { | |
378 | struct bpf_reg_state regs[MAX_BPF_REG]; | |
379 | ... | |
380 | struct bpf_stack_state *stack; | |
381 | } | |
382 | ||
383 | struct bpf_verifier_state { | |
384 | struct bpf_func_state *frame[MAX_CALL_FRAMES]; | |
385 | struct bpf_verifier_state *parent; | |
386 | ... | |
387 | } | |
388 | ||
389 | * ``REG_LIVE_NONE`` is an initial value assigned to ``->live`` fields upon new | |
390 | verifier state creation; | |
391 | ||
392 | * ``REG_LIVE_WRITTEN`` means that the value of the register (or stack slot) is | |
393 | defined by some instruction verified between this verifier state's parent and | |
394 | verifier state itself; | |
395 | ||
396 | * ``REG_LIVE_READ{32,64}`` means that the value of the register (or stack slot) | |
397 | is read by a some child state of this verifier state; | |
398 | ||
399 | * ``REG_LIVE_DONE`` is a marker used by ``clean_verifier_state()`` to avoid | |
400 | processing same verifier state multiple times and for some sanity checks; | |
401 | ||
402 | * ``->live`` field values are formed by combining ``enum bpf_reg_liveness`` | |
403 | values using bitwise or. | |
404 | ||
405 | Register parentage chains | |
406 | ~~~~~~~~~~~~~~~~~~~~~~~~~ | |
407 | ||
408 | In order to propagate information between parent and child states, a *register | |
409 | parentage chain* is established. Each register or stack slot is linked to a | |
410 | corresponding register or stack slot in its parent state via a ``->parent`` | |
411 | pointer. This link is established upon state creation in ``is_state_visited()`` | |
412 | and might be modified by ``set_callee_state()`` called from | |
413 | ``__check_func_call()``. | |
414 | ||
415 | The rules for correspondence between registers / stack slots are as follows: | |
416 | ||
417 | * For the current stack frame, registers and stack slots of the new state are | |
418 | linked to the registers and stack slots of the parent state with the same | |
419 | indices. | |
420 | ||
421 | * For the outer stack frames, only caller saved registers (r6-r9) and stack | |
422 | slots are linked to the registers and stack slots of the parent state with the | |
423 | same indices. | |
424 | ||
425 | * When function call is processed a new ``struct bpf_func_state`` instance is | |
426 | allocated, it encapsulates a new set of registers and stack slots. For this | |
427 | new frame, parent links for r6-r9 and stack slots are set to nil, parent links | |
428 | for r1-r5 are set to match caller r1-r5 parent links. | |
429 | ||
430 | This could be illustrated by the following diagram (arrows stand for | |
431 | ``->parent`` pointers):: | |
432 | ||
433 | ... ; Frame #0, some instructions | |
434 | --- checkpoint #0 --- | |
435 | 1 : r6 = 42 ; Frame #0 | |
436 | --- checkpoint #1 --- | |
437 | 2 : call foo() ; Frame #0 | |
438 | ... ; Frame #1, instructions from foo() | |
439 | --- checkpoint #2 --- | |
440 | ... ; Frame #1, instructions from foo() | |
441 | --- checkpoint #3 --- | |
442 | exit ; Frame #1, return from foo() | |
443 | 3 : r1 = r6 ; Frame #0 <- current state | |
444 | ||
445 | +-------------------------------+-------------------------------+ | |
446 | | Frame #0 | Frame #1 | | |
447 | Checkpoint +-------------------------------+-------------------------------+ | |
448 | #0 | r0 | r1-r5 | r6-r9 | fp-8 ... | | |
449 | +-------------------------------+ | |
450 | ^ ^ ^ ^ | |
451 | | | | | | |
452 | Checkpoint +-------------------------------+ | |
453 | #1 | r0 | r1-r5 | r6-r9 | fp-8 ... | | |
454 | +-------------------------------+ | |
455 | ^ ^ ^ | |
456 | |_______|_______|_______________ | |
457 | | | | | |
458 | nil nil | | | nil nil | |
459 | | | | | | | | | |
460 | Checkpoint +-------------------------------+-------------------------------+ | |
461 | #2 | r0 | r1-r5 | r6-r9 | fp-8 ... | r0 | r1-r5 | r6-r9 | fp-8 ... | | |
462 | +-------------------------------+-------------------------------+ | |
463 | ^ ^ ^ ^ ^ | |
464 | nil nil | | | | | | |
465 | | | | | | | | | |
466 | Checkpoint +-------------------------------+-------------------------------+ | |
467 | #3 | r0 | r1-r5 | r6-r9 | fp-8 ... | r0 | r1-r5 | r6-r9 | fp-8 ... | | |
468 | +-------------------------------+-------------------------------+ | |
469 | ^ ^ | |
470 | nil nil | | | |
471 | | | | | | |
472 | Current +-------------------------------+ | |
473 | state | r0 | r1-r5 | r6-r9 | fp-8 ... | | |
474 | +-------------------------------+ | |
475 | \ | |
476 | r6 read mark is propagated via these links | |
477 | all the way up to checkpoint #1. | |
478 | The checkpoint #1 contains a write mark for r6 | |
479 | because of instruction (1), thus read propagation | |
480 | does not reach checkpoint #0 (see section below). | |
481 | ||
482 | Liveness marks tracking | |
483 | ~~~~~~~~~~~~~~~~~~~~~~~ | |
484 | ||
485 | For each processed instruction, the verifier tracks read and written registers | |
486 | and stack slots. The main idea of the algorithm is that read marks propagate | |
487 | back along the state parentage chain until they hit a write mark, which 'screens | |
488 | off' earlier states from the read. The information about reads is propagated by | |
489 | function ``mark_reg_read()`` which could be summarized as follows:: | |
490 | ||
491 | mark_reg_read(struct bpf_reg_state *state, ...): | |
492 | parent = state->parent | |
493 | while parent: | |
494 | if state->live & REG_LIVE_WRITTEN: | |
495 | break | |
496 | if parent->live & REG_LIVE_READ64: | |
497 | break | |
498 | parent->live |= REG_LIVE_READ64 | |
499 | state = parent | |
500 | parent = state->parent | |
501 | ||
502 | Notes: | |
503 | ||
504 | * The read marks are applied to the **parent** state while write marks are | |
505 | applied to the **current** state. The write mark on a register or stack slot | |
506 | means that it is updated by some instruction in the straight-line code leading | |
507 | from the parent state to the current state. | |
508 | ||
509 | * Details about REG_LIVE_READ32 are omitted. | |
510 | ||
511 | * Function ``propagate_liveness()`` (see section :ref:`read_marks_for_cache_hits`) | |
512 | might override the first parent link. Please refer to the comments in the | |
513 | ``propagate_liveness()`` and ``mark_reg_read()`` source code for further | |
514 | details. | |
515 | ||
516 | Because stack writes could have different sizes ``REG_LIVE_WRITTEN`` marks are | |
517 | applied conservatively: stack slots are marked as written only if write size | |
518 | corresponds to the size of the register, e.g. see function ``save_register_state()``. | |
519 | ||
520 | Consider the following example:: | |
521 | ||
522 | 0: (*u64)(r10 - 8) = 0 ; define 8 bytes of fp-8 | |
523 | --- checkpoint #0 --- | |
524 | 1: (*u32)(r10 - 8) = 1 ; redefine lower 4 bytes | |
525 | 2: r1 = (*u32)(r10 - 8) ; read lower 4 bytes defined at (1) | |
526 | 3: r2 = (*u32)(r10 - 4) ; read upper 4 bytes defined at (0) | |
527 | ||
528 | As stated above, the write at (1) does not count as ``REG_LIVE_WRITTEN``. Should | |
529 | it be otherwise, the algorithm above wouldn't be able to propagate the read mark | |
530 | from (3) to checkpoint #0. | |
531 | ||
532 | Once the ``BPF_EXIT`` instruction is reached ``update_branch_counts()`` is | |
533 | called to update the ``->branches`` counter for each verifier state in a chain | |
534 | of parent verifier states. When the ``->branches`` counter reaches zero the | |
535 | verifier state becomes a valid entry in a set of cached verifier states. | |
536 | ||
537 | Each entry of the verifier states cache is post-processed by a function | |
538 | ``clean_live_states()``. This function marks all registers and stack slots | |
539 | without ``REG_LIVE_READ{32,64}`` marks as ``NOT_INIT`` or ``STACK_INVALID``. | |
540 | Registers/stack slots marked in this way are ignored in function ``stacksafe()`` | |
541 | called from ``states_equal()`` when a state cache entry is considered for | |
542 | equivalence with a current state. | |
543 | ||
544 | Now it is possible to explain how the example from the beginning of the section | |
545 | works:: | |
546 | ||
547 | 0: call bpf_get_prandom_u32() | |
548 | 1: r1 = 0 | |
549 | 2: if r0 == 0 goto +1 | |
550 | 3: r0 = 1 | |
551 | --- checkpoint[0] --- | |
552 | 4: r0 = r1 | |
553 | 5: exit | |
554 | ||
555 | * At instruction #2 branching point is reached and state ``{ r0 == 0, r1 == 0, pc == 4 }`` | |
556 | is pushed to states processing queue (pc stands for program counter). | |
557 | ||
558 | * At instruction #4: | |
559 | ||
560 | * ``checkpoint[0]`` states cache entry is created: ``{ r0 == 1, r1 == 0, pc == 4 }``; | |
561 | * ``checkpoint[0].r0`` is marked as written; | |
562 | * ``checkpoint[0].r1`` is marked as read; | |
563 | ||
564 | * At instruction #5 exit is reached and ``checkpoint[0]`` can now be processed | |
565 | by ``clean_live_states()``. After this processing ``checkpoint[0].r0`` has a | |
566 | read mark and all other registers and stack slots are marked as ``NOT_INIT`` | |
567 | or ``STACK_INVALID`` | |
568 | ||
569 | * The state ``{ r0 == 0, r1 == 0, pc == 4 }`` is popped from the states queue | |
570 | and is compared against a cached state ``{ r1 == 0, pc == 4 }``, the states | |
571 | are considered equivalent. | |
572 | ||
573 | .. _read_marks_for_cache_hits: | |
574 | ||
575 | Read marks propagation for cache hits | |
576 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
577 | ||
578 | Another point is the handling of read marks when a previously verified state is | |
579 | found in the states cache. Upon cache hit verifier must behave in the same way | |
580 | as if the current state was verified to the program exit. This means that all | |
581 | read marks, present on registers and stack slots of the cached state, must be | |
582 | propagated over the parentage chain of the current state. Example below shows | |
583 | why this is important. Function ``propagate_liveness()`` handles this case. | |
584 | ||
585 | Consider the following state parentage chain (S is a starting state, A-E are | |
586 | derived states, -> arrows show which state is derived from which):: | |
587 | ||
588 | r1 read | |
589 | <------------- A[r1] == 0 | |
590 | C[r1] == 0 | |
591 | S ---> A ---> B ---> exit E[r1] == 1 | |
592 | | | |
593 | ` ---> C ---> D | |
594 | | | |
595 | ` ---> E ^ | |
596 | |___ suppose all these | |
597 | ^ states are at insn #Y | |
598 | | | |
599 | suppose all these | |
600 | states are at insn #X | |
601 | ||
602 | * Chain of states ``S -> A -> B -> exit`` is verified first. | |
603 | ||
604 | * While ``B -> exit`` is verified, register ``r1`` is read and this read mark is | |
605 | propagated up to state ``A``. | |
606 | ||
607 | * When chain of states ``C -> D`` is verified the state ``D`` turns out to be | |
608 | equivalent to state ``B``. | |
609 | ||
610 | * The read mark for ``r1`` has to be propagated to state ``C``, otherwise state | |
611 | ``C`` might get mistakenly marked as equivalent to state ``E`` even though | |
612 | values for register ``r1`` differ between ``C`` and ``E``. | |
613 | ||
88691e9e CH |
614 | Understanding eBPF verifier messages |
615 | ==================================== | |
616 | ||
617 | The following are few examples of invalid eBPF programs and verifier error | |
618 | messages as seen in the log: | |
619 | ||
620 | Program with unreachable instructions:: | |
621 | ||
622 | static struct bpf_insn prog[] = { | |
623 | BPF_EXIT_INSN(), | |
624 | BPF_EXIT_INSN(), | |
625 | }; | |
626 | ||
43429ea7 | 627 | Error:: |
88691e9e CH |
628 | |
629 | unreachable insn 1 | |
630 | ||
631 | Program that reads uninitialized register:: | |
632 | ||
633 | BPF_MOV64_REG(BPF_REG_0, BPF_REG_2), | |
634 | BPF_EXIT_INSN(), | |
635 | ||
636 | Error:: | |
637 | ||
638 | 0: (bf) r0 = r2 | |
639 | R2 !read_ok | |
640 | ||
641 | Program that doesn't initialize R0 before exiting:: | |
642 | ||
643 | BPF_MOV64_REG(BPF_REG_2, BPF_REG_1), | |
644 | BPF_EXIT_INSN(), | |
645 | ||
646 | Error:: | |
647 | ||
648 | 0: (bf) r2 = r1 | |
649 | 1: (95) exit | |
650 | R0 !read_ok | |
651 | ||
652 | Program that accesses stack out of bounds:: | |
653 | ||
654 | BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0), | |
655 | BPF_EXIT_INSN(), | |
656 | ||
657 | Error:: | |
658 | ||
659 | 0: (7a) *(u64 *)(r10 +8) = 0 | |
660 | invalid stack off=8 size=8 | |
661 | ||
662 | Program that doesn't initialize stack before passing its address into function:: | |
663 | ||
664 | BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), | |
665 | BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), | |
666 | BPF_LD_MAP_FD(BPF_REG_1, 0), | |
667 | BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), | |
668 | BPF_EXIT_INSN(), | |
669 | ||
670 | Error:: | |
671 | ||
672 | 0: (bf) r2 = r10 | |
673 | 1: (07) r2 += -8 | |
674 | 2: (b7) r1 = 0x0 | |
675 | 3: (85) call 1 | |
676 | invalid indirect read from stack off -8+0 size 8 | |
677 | ||
678 | Program that uses invalid map_fd=0 while calling to map_lookup_elem() function:: | |
679 | ||
680 | BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), | |
681 | BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), | |
682 | BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), | |
683 | BPF_LD_MAP_FD(BPF_REG_1, 0), | |
684 | BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), | |
685 | BPF_EXIT_INSN(), | |
686 | ||
687 | Error:: | |
688 | ||
689 | 0: (7a) *(u64 *)(r10 -8) = 0 | |
690 | 1: (bf) r2 = r10 | |
691 | 2: (07) r2 += -8 | |
692 | 3: (b7) r1 = 0x0 | |
693 | 4: (85) call 1 | |
694 | fd 0 is not pointing to valid bpf_map | |
695 | ||
696 | Program that doesn't check return value of map_lookup_elem() before accessing | |
697 | map element:: | |
698 | ||
699 | BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), | |
700 | BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), | |
701 | BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), | |
702 | BPF_LD_MAP_FD(BPF_REG_1, 0), | |
703 | BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), | |
704 | BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), | |
705 | BPF_EXIT_INSN(), | |
706 | ||
707 | Error:: | |
708 | ||
709 | 0: (7a) *(u64 *)(r10 -8) = 0 | |
710 | 1: (bf) r2 = r10 | |
711 | 2: (07) r2 += -8 | |
712 | 3: (b7) r1 = 0x0 | |
713 | 4: (85) call 1 | |
714 | 5: (7a) *(u64 *)(r0 +0) = 0 | |
715 | R0 invalid mem access 'map_value_or_null' | |
716 | ||
717 | Program that correctly checks map_lookup_elem() returned value for NULL, but | |
718 | accesses the memory with incorrect alignment:: | |
719 | ||
720 | BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), | |
721 | BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), | |
722 | BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), | |
723 | BPF_LD_MAP_FD(BPF_REG_1, 0), | |
724 | BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), | |
725 | BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1), | |
726 | BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0), | |
727 | BPF_EXIT_INSN(), | |
728 | ||
729 | Error:: | |
730 | ||
731 | 0: (7a) *(u64 *)(r10 -8) = 0 | |
732 | 1: (bf) r2 = r10 | |
733 | 2: (07) r2 += -8 | |
734 | 3: (b7) r1 = 1 | |
735 | 4: (85) call 1 | |
736 | 5: (15) if r0 == 0x0 goto pc+1 | |
737 | R0=map_ptr R10=fp | |
738 | 6: (7a) *(u64 *)(r0 +4) = 0 | |
739 | misaligned access off 4 size 8 | |
740 | ||
741 | Program that correctly checks map_lookup_elem() returned value for NULL and | |
742 | accesses memory with correct alignment in one side of 'if' branch, but fails | |
743 | to do so in the other side of 'if' branch:: | |
744 | ||
745 | BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), | |
746 | BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), | |
747 | BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), | |
748 | BPF_LD_MAP_FD(BPF_REG_1, 0), | |
749 | BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), | |
750 | BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), | |
751 | BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), | |
752 | BPF_EXIT_INSN(), | |
753 | BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1), | |
754 | BPF_EXIT_INSN(), | |
755 | ||
756 | Error:: | |
757 | ||
758 | 0: (7a) *(u64 *)(r10 -8) = 0 | |
759 | 1: (bf) r2 = r10 | |
760 | 2: (07) r2 += -8 | |
761 | 3: (b7) r1 = 1 | |
762 | 4: (85) call 1 | |
763 | 5: (15) if r0 == 0x0 goto pc+2 | |
764 | R0=map_ptr R10=fp | |
765 | 6: (7a) *(u64 *)(r0 +0) = 0 | |
766 | 7: (95) exit | |
767 | ||
768 | from 5 to 8: R0=imm0 R10=fp | |
769 | 8: (7a) *(u64 *)(r0 +0) = 1 | |
770 | R0 invalid mem access 'imm' | |
771 | ||
772 | Program that performs a socket lookup then sets the pointer to NULL without | |
773 | checking it:: | |
774 | ||
775 | BPF_MOV64_IMM(BPF_REG_2, 0), | |
776 | BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), | |
777 | BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), | |
778 | BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), | |
779 | BPF_MOV64_IMM(BPF_REG_3, 4), | |
780 | BPF_MOV64_IMM(BPF_REG_4, 0), | |
781 | BPF_MOV64_IMM(BPF_REG_5, 0), | |
782 | BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), | |
783 | BPF_MOV64_IMM(BPF_REG_0, 0), | |
784 | BPF_EXIT_INSN(), | |
785 | ||
786 | Error:: | |
787 | ||
788 | 0: (b7) r2 = 0 | |
789 | 1: (63) *(u32 *)(r10 -8) = r2 | |
790 | 2: (bf) r2 = r10 | |
791 | 3: (07) r2 += -8 | |
792 | 4: (b7) r3 = 4 | |
793 | 5: (b7) r4 = 0 | |
794 | 6: (b7) r5 = 0 | |
795 | 7: (85) call bpf_sk_lookup_tcp#65 | |
796 | 8: (b7) r0 = 0 | |
797 | 9: (95) exit | |
798 | Unreleased reference id=1, alloc_insn=7 | |
799 | ||
800 | Program that performs a socket lookup but does not NULL-check the returned | |
801 | value:: | |
802 | ||
803 | BPF_MOV64_IMM(BPF_REG_2, 0), | |
804 | BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), | |
805 | BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), | |
806 | BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), | |
807 | BPF_MOV64_IMM(BPF_REG_3, 4), | |
808 | BPF_MOV64_IMM(BPF_REG_4, 0), | |
809 | BPF_MOV64_IMM(BPF_REG_5, 0), | |
810 | BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), | |
811 | BPF_EXIT_INSN(), | |
812 | ||
813 | Error:: | |
814 | ||
815 | 0: (b7) r2 = 0 | |
816 | 1: (63) *(u32 *)(r10 -8) = r2 | |
817 | 2: (bf) r2 = r10 | |
818 | 3: (07) r2 += -8 | |
819 | 4: (b7) r3 = 4 | |
820 | 5: (b7) r4 = 0 | |
821 | 6: (b7) r5 = 0 | |
822 | 7: (85) call bpf_sk_lookup_tcp#65 | |
823 | 8: (95) exit | |
824 | Unreleased reference id=1, alloc_insn=7 |