Commit | Line | Data |
---|---|---|
bdbda395 DV |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | .. _kfuncs-header-label: | |
4 | ||
63e564eb KKD |
5 | ============================= |
6 | BPF Kernel Functions (kfuncs) | |
7 | ============================= | |
8 | ||
9 | 1. Introduction | |
10 | =============== | |
11 | ||
12 | BPF Kernel Functions or more commonly known as kfuncs are functions in the Linux | |
13 | kernel which are exposed for use by BPF programs. Unlike normal BPF helpers, | |
14 | kfuncs do not have a stable interface and can change from one kernel release to | |
15 | another. Hence, BPF programs need to be updated in response to changes in the | |
16 | kernel. | |
17 | ||
18 | 2. Defining a kfunc | |
19 | =================== | |
20 | ||
21 | There are two ways to expose a kernel function to BPF programs, either make an | |
22 | existing function in the kernel visible, or add a new wrapper for BPF. In both | |
23 | cases, care must be taken that BPF program can only call such function in a | |
24 | valid context. To enforce this, visibility of a kfunc can be per program type. | |
25 | ||
26 | If you are not creating a BPF wrapper for existing kernel function, skip ahead | |
27 | to :ref:`BPF_kfunc_nodef`. | |
28 | ||
29 | 2.1 Creating a wrapper kfunc | |
30 | ---------------------------- | |
31 | ||
32 | When defining a wrapper kfunc, the wrapper function should have extern linkage. | |
33 | This prevents the compiler from optimizing away dead code, as this wrapper kfunc | |
34 | is not invoked anywhere in the kernel itself. It is not necessary to provide a | |
35 | prototype in a header for the wrapper kfunc. | |
36 | ||
37 | An example is given below:: | |
38 | ||
39 | /* Disables missing prototype warnings */ | |
40 | __diag_push(); | |
41 | __diag_ignore_all("-Wmissing-prototypes", | |
42 | "Global kfuncs as their definitions will be in BTF"); | |
43 | ||
98e6ab7a | 44 | __bpf_kfunc struct task_struct *bpf_find_get_task_by_vpid(pid_t nr) |
63e564eb KKD |
45 | { |
46 | return find_get_task_by_vpid(nr); | |
47 | } | |
48 | ||
49 | __diag_pop(); | |
50 | ||
51 | A wrapper kfunc is often needed when we need to annotate parameters of the | |
52 | kfunc. Otherwise one may directly make the kfunc visible to the BPF program by | |
53 | registering it with the BPF subsystem. See :ref:`BPF_kfunc_nodef`. | |
54 | ||
55 | 2.2 Annotating kfunc parameters | |
56 | ------------------------------- | |
57 | ||
58 | Similar to BPF helpers, there is sometime need for additional context required | |
59 | by the verifier to make the usage of kernel functions safer and more useful. | |
60 | Hence, we can annotate a parameter by suffixing the name of the argument of the | |
61 | kfunc with a __tag, where tag may be one of the supported annotations. | |
62 | ||
63 | 2.2.1 __sz Annotation | |
64 | --------------------- | |
65 | ||
66 | This annotation is used to indicate a memory and size pair in the argument list. | |
67 | An example is given below:: | |
68 | ||
98e6ab7a | 69 | __bpf_kfunc void bpf_memzero(void *mem, int mem__sz) |
63e564eb KKD |
70 | { |
71 | ... | |
72 | } | |
73 | ||
74 | Here, the verifier will treat first argument as a PTR_TO_MEM, and second | |
75 | argument as its size. By default, without __sz annotation, the size of the type | |
76 | of the pointer is used. Without __sz annotation, a kfunc cannot accept a void | |
77 | pointer. | |
78 | ||
a50388db KKD |
79 | 2.2.2 __k Annotation |
80 | -------------------- | |
81 | ||
82 | This annotation is only understood for scalar arguments, where it indicates that | |
83 | the verifier must check the scalar argument to be a known constant, which does | |
84 | not indicate a size parameter, and the value of the constant is relevant to the | |
85 | safety of the program. | |
86 | ||
87 | An example is given below:: | |
88 | ||
98e6ab7a | 89 | __bpf_kfunc void *bpf_obj_new(u32 local_type_id__k, ...) |
a50388db KKD |
90 | { |
91 | ... | |
92 | } | |
93 | ||
94 | Here, bpf_obj_new uses local_type_id argument to find out the size of that type | |
95 | ID in program's BTF and return a sized pointer to it. Each type ID will have a | |
96 | distinct size, hence it is crucial to treat each such call as distinct when | |
97 | values don't match during verifier state pruning checks. | |
98 | ||
99 | Hence, whenever a constant scalar argument is accepted by a kfunc which is not a | |
100 | size parameter, and the value of the constant matters for program safety, __k | |
101 | suffix should be used. | |
102 | ||
63e564eb KKD |
103 | .. _BPF_kfunc_nodef: |
104 | ||
105 | 2.3 Using an existing kernel function | |
106 | ------------------------------------- | |
107 | ||
108 | When an existing function in the kernel is fit for consumption by BPF programs, | |
109 | it can be directly registered with the BPF subsystem. However, care must still | |
110 | be taken to review the context in which it will be invoked by the BPF program | |
111 | and whether it is safe to do so. | |
112 | ||
113 | 2.4 Annotating kfuncs | |
114 | --------------------- | |
115 | ||
116 | In addition to kfuncs' arguments, verifier may need more information about the | |
117 | type of kfunc(s) being registered with the BPF subsystem. To do so, we define | |
118 | flags on a set of kfuncs as follows:: | |
119 | ||
120 | BTF_SET8_START(bpf_task_set) | |
121 | BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL) | |
122 | BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE) | |
123 | BTF_SET8_END(bpf_task_set) | |
124 | ||
125 | This set encodes the BTF ID of each kfunc listed above, and encodes the flags | |
126 | along with it. Ofcourse, it is also allowed to specify no flags. | |
127 | ||
98e6ab7a DV |
128 | kfunc definitions should also always be annotated with the ``__bpf_kfunc`` |
129 | macro. This prevents issues such as the compiler inlining the kfunc if it's a | |
130 | static kernel function, or the function being elided in an LTO build as it's | |
131 | not used in the rest of the kernel. Developers should not manually add | |
132 | annotations to their kfunc to prevent these issues. If an annotation is | |
133 | required to prevent such an issue with your kfunc, it is a bug and should be | |
134 | added to the definition of the macro so that other kfuncs are similarly | |
135 | protected. An example is given below:: | |
136 | ||
137 | __bpf_kfunc struct task_struct *bpf_get_task_pid(s32 pid) | |
138 | { | |
139 | ... | |
140 | } | |
141 | ||
63e564eb KKD |
142 | 2.4.1 KF_ACQUIRE flag |
143 | --------------------- | |
144 | ||
145 | The KF_ACQUIRE flag is used to indicate that the kfunc returns a pointer to a | |
146 | refcounted object. The verifier will then ensure that the pointer to the object | |
147 | is eventually released using a release kfunc, or transferred to a map using a | |
148 | referenced kptr (by invoking bpf_kptr_xchg). If not, the verifier fails the | |
149 | loading of the BPF program until no lingering references remain in all possible | |
150 | explored states of the program. | |
151 | ||
152 | 2.4.2 KF_RET_NULL flag | |
153 | ---------------------- | |
154 | ||
155 | The KF_RET_NULL flag is used to indicate that the pointer returned by the kfunc | |
156 | may be NULL. Hence, it forces the user to do a NULL check on the pointer | |
157 | returned from the kfunc before making use of it (dereferencing or passing to | |
158 | another helper). This flag is often used in pairing with KF_ACQUIRE flag, but | |
159 | both are orthogonal to each other. | |
160 | ||
161 | 2.4.3 KF_RELEASE flag | |
162 | --------------------- | |
163 | ||
164 | The KF_RELEASE flag is used to indicate that the kfunc releases the pointer | |
165 | passed in to it. There can be only one referenced pointer that can be passed in. | |
166 | All copies of the pointer being released are invalidated as a result of invoking | |
167 | kfunc with this flag. | |
168 | ||
169 | 2.4.4 KF_KPTR_GET flag | |
170 | ---------------------- | |
171 | ||
172 | The KF_KPTR_GET flag is used to indicate that the kfunc takes the first argument | |
173 | as a pointer to kptr, safely increments the refcount of the object it points to, | |
174 | and returns a reference to the user. The rest of the arguments may be normal | |
175 | arguments of a kfunc. The KF_KPTR_GET flag should be used in conjunction with | |
176 | KF_ACQUIRE and KF_RET_NULL flags. | |
177 | ||
178 | 2.4.5 KF_TRUSTED_ARGS flag | |
179 | -------------------------- | |
180 | ||
181 | The KF_TRUSTED_ARGS flag is used for kfuncs taking pointer arguments. It | |
3f00c523 DV |
182 | indicates that the all pointer arguments are valid, and that all pointers to |
183 | BTF objects have been passed in their unmodified form (that is, at a zero | |
d94cbde2 DV |
184 | offset, and without having been obtained from walking another pointer, with one |
185 | exception described below). | |
3f00c523 DV |
186 | |
187 | There are two types of pointers to kernel objects which are considered "valid": | |
188 | ||
189 | 1. Pointers which are passed as tracepoint or struct_ops callback arguments. | |
190 | 2. Pointers which were returned from a KF_ACQUIRE or KF_KPTR_GET kfunc. | |
191 | ||
192 | Pointers to non-BTF objects (e.g. scalar pointers) may also be passed to | |
193 | KF_TRUSTED_ARGS kfuncs, and may have a non-zero offset. | |
194 | ||
195 | The definition of "valid" pointers is subject to change at any time, and has | |
196 | absolutely no ABI stability guarantees. | |
63e564eb | 197 | |
d94cbde2 DV |
198 | As mentioned above, a nested pointer obtained from walking a trusted pointer is |
199 | no longer trusted, with one exception. If a struct type has a field that is | |
200 | guaranteed to be valid as long as its parent pointer is trusted, the | |
201 | ``BTF_TYPE_SAFE_NESTED`` macro can be used to express that to the verifier as | |
202 | follows: | |
203 | ||
204 | .. code-block:: c | |
205 | ||
206 | BTF_TYPE_SAFE_NESTED(struct task_struct) { | |
207 | const cpumask_t *cpus_ptr; | |
208 | }; | |
209 | ||
210 | In other words, you must: | |
211 | ||
212 | 1. Wrap the trusted pointer type in the ``BTF_TYPE_SAFE_NESTED`` macro. | |
213 | ||
214 | 2. Specify the type and name of the trusted nested field. This field must match | |
215 | the field in the original type definition exactly. | |
216 | ||
fa96b242 BT |
217 | 2.4.6 KF_SLEEPABLE flag |
218 | ----------------------- | |
219 | ||
220 | The KF_SLEEPABLE flag is used for kfuncs that may sleep. Such kfuncs can only | |
221 | be called by sleepable BPF programs (BPF_F_SLEEPABLE). | |
222 | ||
4dd48c6f AS |
223 | 2.4.7 KF_DESTRUCTIVE flag |
224 | -------------------------- | |
225 | ||
226 | The KF_DESTRUCTIVE flag is used to indicate functions calling which is | |
227 | destructive to the system. For example such a call can result in system | |
228 | rebooting or panicking. Due to this additional restrictions apply to these | |
229 | calls. At the moment they only require CAP_SYS_BOOT capability, but more can be | |
230 | added later. | |
231 | ||
f5362564 YS |
232 | 2.4.8 KF_RCU flag |
233 | ----------------- | |
234 | ||
235 | The KF_RCU flag is used for kfuncs which have a rcu ptr as its argument. | |
236 | When used together with KF_ACQUIRE, it indicates the kfunc should have a | |
237 | single argument which must be a trusted argument or a MEM_RCU pointer. | |
238 | The argument may have reference count of 0 and the kfunc must take this | |
239 | into consideration. | |
240 | ||
63e564eb KKD |
241 | 2.5 Registering the kfuncs |
242 | -------------------------- | |
243 | ||
244 | Once the kfunc is prepared for use, the final step to making it visible is | |
245 | registering it with the BPF subsystem. Registration is done per BPF program | |
246 | type. An example is shown below:: | |
247 | ||
248 | BTF_SET8_START(bpf_task_set) | |
249 | BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL) | |
250 | BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE) | |
251 | BTF_SET8_END(bpf_task_set) | |
252 | ||
253 | static const struct btf_kfunc_id_set bpf_task_kfunc_set = { | |
254 | .owner = THIS_MODULE, | |
255 | .set = &bpf_task_set, | |
256 | }; | |
257 | ||
258 | static int init_subsystem(void) | |
259 | { | |
260 | return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_task_kfunc_set); | |
261 | } | |
262 | late_initcall(init_subsystem); | |
25c5e92d | 263 | |
027bdec8 DV |
264 | 2.6 Specifying no-cast aliases with ___init |
265 | -------------------------------------------- | |
266 | ||
267 | The verifier will always enforce that the BTF type of a pointer passed to a | |
268 | kfunc by a BPF program, matches the type of pointer specified in the kfunc | |
269 | definition. The verifier, does, however, allow types that are equivalent | |
270 | according to the C standard to be passed to the same kfunc arg, even if their | |
271 | BTF_IDs differ. | |
272 | ||
273 | For example, for the following type definition: | |
274 | ||
275 | .. code-block:: c | |
276 | ||
277 | struct bpf_cpumask { | |
278 | cpumask_t cpumask; | |
279 | refcount_t usage; | |
280 | }; | |
281 | ||
282 | The verifier would allow a ``struct bpf_cpumask *`` to be passed to a kfunc | |
283 | taking a ``cpumask_t *`` (which is a typedef of ``struct cpumask *``). For | |
284 | instance, both ``struct cpumask *`` and ``struct bpf_cpmuask *`` can be passed | |
285 | to bpf_cpumask_test_cpu(). | |
286 | ||
287 | In some cases, this type-aliasing behavior is not desired. ``struct | |
288 | nf_conn___init`` is one such example: | |
289 | ||
290 | .. code-block:: c | |
291 | ||
292 | struct nf_conn___init { | |
293 | struct nf_conn ct; | |
294 | }; | |
295 | ||
296 | The C standard would consider these types to be equivalent, but it would not | |
297 | always be safe to pass either type to a trusted kfunc. ``struct | |
298 | nf_conn___init`` represents an allocated ``struct nf_conn`` object that has | |
299 | *not yet been initialized*, so it would therefore be unsafe to pass a ``struct | |
300 | nf_conn___init *`` to a kfunc that's expecting a fully initialized ``struct | |
301 | nf_conn *`` (e.g. ``bpf_ct_change_timeout()``). | |
302 | ||
303 | In order to accommodate such requirements, the verifier will enforce strict | |
304 | PTR_TO_BTF_ID type matching if two types have the exact same name, with one | |
305 | being suffixed with ``___init``. | |
306 | ||
25c5e92d DV |
307 | 3. Core kfuncs |
308 | ============== | |
309 | ||
310 | The BPF subsystem provides a number of "core" kfuncs that are potentially | |
311 | applicable to a wide variety of different possible use cases and programs. | |
312 | Those kfuncs are documented here. | |
313 | ||
314 | 3.1 struct task_struct * kfuncs | |
315 | ------------------------------- | |
316 | ||
317 | There are a number of kfuncs that allow ``struct task_struct *`` objects to be | |
318 | used as kptrs: | |
319 | ||
320 | .. kernel-doc:: kernel/bpf/helpers.c | |
321 | :identifiers: bpf_task_acquire bpf_task_release | |
322 | ||
323 | These kfuncs are useful when you want to acquire or release a reference to a | |
324 | ``struct task_struct *`` that was passed as e.g. a tracepoint arg, or a | |
325 | struct_ops callback arg. For example: | |
326 | ||
327 | .. code-block:: c | |
328 | ||
329 | /** | |
330 | * A trivial example tracepoint program that shows how to | |
331 | * acquire and release a struct task_struct * pointer. | |
332 | */ | |
333 | SEC("tp_btf/task_newtask") | |
334 | int BPF_PROG(task_acquire_release_example, struct task_struct *task, u64 clone_flags) | |
335 | { | |
336 | struct task_struct *acquired; | |
337 | ||
338 | acquired = bpf_task_acquire(task); | |
339 | ||
340 | /* | |
341 | * In a typical program you'd do something like store | |
342 | * the task in a map, and the map will automatically | |
343 | * release it later. Here, we release it manually. | |
344 | */ | |
345 | bpf_task_release(acquired); | |
346 | return 0; | |
347 | } | |
348 | ||
349 | ---- | |
350 | ||
351 | A BPF program can also look up a task from a pid. This can be useful if the | |
352 | caller doesn't have a trusted pointer to a ``struct task_struct *`` object that | |
353 | it can acquire a reference on with bpf_task_acquire(). | |
354 | ||
355 | .. kernel-doc:: kernel/bpf/helpers.c | |
356 | :identifiers: bpf_task_from_pid | |
357 | ||
358 | Here is an example of it being used: | |
359 | ||
360 | .. code-block:: c | |
361 | ||
362 | SEC("tp_btf/task_newtask") | |
363 | int BPF_PROG(task_get_pid_example, struct task_struct *task, u64 clone_flags) | |
364 | { | |
365 | struct task_struct *lookup; | |
366 | ||
367 | lookup = bpf_task_from_pid(task->pid); | |
368 | if (!lookup) | |
369 | /* A task should always be found, as %task is a tracepoint arg. */ | |
370 | return -ENOENT; | |
371 | ||
372 | if (lookup->pid != task->pid) { | |
373 | /* bpf_task_from_pid() looks up the task via its | |
374 | * globally-unique pid from the init_pid_ns. Thus, | |
375 | * the pid of the lookup task should always be the | |
376 | * same as the input task. | |
377 | */ | |
378 | bpf_task_release(lookup); | |
379 | return -EINVAL; | |
380 | } | |
381 | ||
382 | /* bpf_task_from_pid() returns an acquired reference, | |
383 | * so it must be dropped before returning from the | |
384 | * tracepoint handler. | |
385 | */ | |
386 | bpf_task_release(lookup); | |
387 | return 0; | |
388 | } | |
36aa10ff DV |
389 | |
390 | 3.2 struct cgroup * kfuncs | |
391 | -------------------------- | |
392 | ||
393 | ``struct cgroup *`` objects also have acquire and release functions: | |
394 | ||
395 | .. kernel-doc:: kernel/bpf/helpers.c | |
396 | :identifiers: bpf_cgroup_acquire bpf_cgroup_release | |
397 | ||
398 | These kfuncs are used in exactly the same manner as bpf_task_acquire() and | |
399 | bpf_task_release() respectively, so we won't provide examples for them. | |
400 | ||
401 | ---- | |
402 | ||
403 | You may also acquire a reference to a ``struct cgroup`` kptr that's already | |
404 | stored in a map using bpf_cgroup_kptr_get(): | |
405 | ||
406 | .. kernel-doc:: kernel/bpf/helpers.c | |
407 | :identifiers: bpf_cgroup_kptr_get | |
408 | ||
409 | Here's an example of how it can be used: | |
410 | ||
411 | .. code-block:: c | |
412 | ||
413 | /* struct containing the struct task_struct kptr which is actually stored in the map. */ | |
414 | struct __cgroups_kfunc_map_value { | |
415 | struct cgroup __kptr_ref * cgroup; | |
416 | }; | |
417 | ||
418 | /* The map containing struct __cgroups_kfunc_map_value entries. */ | |
419 | struct { | |
420 | __uint(type, BPF_MAP_TYPE_HASH); | |
421 | __type(key, int); | |
422 | __type(value, struct __cgroups_kfunc_map_value); | |
423 | __uint(max_entries, 1); | |
424 | } __cgroups_kfunc_map SEC(".maps"); | |
425 | ||
426 | /* ... */ | |
427 | ||
428 | /** | |
429 | * A simple example tracepoint program showing how a | |
430 | * struct cgroup kptr that is stored in a map can | |
431 | * be acquired using the bpf_cgroup_kptr_get() kfunc. | |
432 | */ | |
433 | SEC("tp_btf/cgroup_mkdir") | |
434 | int BPF_PROG(cgroup_kptr_get_example, struct cgroup *cgrp, const char *path) | |
435 | { | |
436 | struct cgroup *kptr; | |
437 | struct __cgroups_kfunc_map_value *v; | |
438 | s32 id = cgrp->self.id; | |
439 | ||
440 | /* Assume a cgroup kptr was previously stored in the map. */ | |
441 | v = bpf_map_lookup_elem(&__cgroups_kfunc_map, &id); | |
442 | if (!v) | |
443 | return -ENOENT; | |
444 | ||
445 | /* Acquire a reference to the cgroup kptr that's already stored in the map. */ | |
446 | kptr = bpf_cgroup_kptr_get(&v->cgroup); | |
447 | if (!kptr) | |
448 | /* If no cgroup was present in the map, it's because | |
449 | * we're racing with another CPU that removed it with | |
450 | * bpf_kptr_xchg() between the bpf_map_lookup_elem() | |
451 | * above, and our call to bpf_cgroup_kptr_get(). | |
452 | * bpf_cgroup_kptr_get() internally safely handles this | |
453 | * race, and will return NULL if the task is no longer | |
454 | * present in the map by the time we invoke the kfunc. | |
455 | */ | |
456 | return -EBUSY; | |
457 | ||
458 | /* Free the reference we just took above. Note that the | |
459 | * original struct cgroup kptr is still in the map. It will | |
460 | * be freed either at a later time if another context deletes | |
461 | * it from the map, or automatically by the BPF subsystem if | |
462 | * it's still present when the map is destroyed. | |
463 | */ | |
464 | bpf_cgroup_release(kptr); | |
465 | ||
466 | return 0; | |
467 | } | |
468 | ||
469 | ---- | |
470 | ||
471 | Another kfunc available for interacting with ``struct cgroup *`` objects is | |
472 | bpf_cgroup_ancestor(). This allows callers to access the ancestor of a cgroup, | |
473 | and return it as a cgroup kptr. | |
474 | ||
475 | .. kernel-doc:: kernel/bpf/helpers.c | |
476 | :identifiers: bpf_cgroup_ancestor | |
477 | ||
478 | Eventually, BPF should be updated to allow this to happen with a normal memory | |
479 | load in the program itself. This is currently not possible without more work in | |
480 | the verifier. bpf_cgroup_ancestor() can be used as follows: | |
481 | ||
482 | .. code-block:: c | |
483 | ||
484 | /** | |
485 | * Simple tracepoint example that illustrates how a cgroup's | |
486 | * ancestor can be accessed using bpf_cgroup_ancestor(). | |
487 | */ | |
488 | SEC("tp_btf/cgroup_mkdir") | |
489 | int BPF_PROG(cgrp_ancestor_example, struct cgroup *cgrp, const char *path) | |
490 | { | |
491 | struct cgroup *parent; | |
492 | ||
493 | /* The parent cgroup resides at the level before the current cgroup's level. */ | |
494 | parent = bpf_cgroup_ancestor(cgrp, cgrp->level - 1); | |
495 | if (!parent) | |
496 | return -ENOENT; | |
497 | ||
498 | bpf_printk("Parent id is %d", parent->self.id); | |
499 | ||
500 | /* Return the parent cgroup that was acquired above. */ | |
501 | bpf_cgroup_release(parent); | |
502 | return 0; | |
503 | } | |
bdbda395 DV |
504 | |
505 | 3.3 struct cpumask * kfuncs | |
506 | --------------------------- | |
507 | ||
508 | BPF provides a set of kfuncs that can be used to query, allocate, mutate, and | |
509 | destroy struct cpumask * objects. Please refer to :ref:`cpumasks-header-label` | |
510 | for more details. |