Commit | Line | Data |
---|---|---|
bdbda395 DV |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | .. _kfuncs-header-label: | |
4 | ||
63e564eb KKD |
5 | ============================= |
6 | BPF Kernel Functions (kfuncs) | |
7 | ============================= | |
8 | ||
9 | 1. Introduction | |
10 | =============== | |
11 | ||
12 | BPF Kernel Functions or more commonly known as kfuncs are functions in the Linux | |
13 | kernel which are exposed for use by BPF programs. Unlike normal BPF helpers, | |
14 | kfuncs do not have a stable interface and can change from one kernel release to | |
15 | another. Hence, BPF programs need to be updated in response to changes in the | |
16 | kernel. | |
17 | ||
18 | 2. Defining a kfunc | |
19 | =================== | |
20 | ||
21 | There are two ways to expose a kernel function to BPF programs, either make an | |
22 | existing function in the kernel visible, or add a new wrapper for BPF. In both | |
23 | cases, care must be taken that BPF program can only call such function in a | |
24 | valid context. To enforce this, visibility of a kfunc can be per program type. | |
25 | ||
26 | If you are not creating a BPF wrapper for existing kernel function, skip ahead | |
27 | to :ref:`BPF_kfunc_nodef`. | |
28 | ||
29 | 2.1 Creating a wrapper kfunc | |
30 | ---------------------------- | |
31 | ||
32 | When defining a wrapper kfunc, the wrapper function should have extern linkage. | |
33 | This prevents the compiler from optimizing away dead code, as this wrapper kfunc | |
34 | is not invoked anywhere in the kernel itself. It is not necessary to provide a | |
35 | prototype in a header for the wrapper kfunc. | |
36 | ||
37 | An example is given below:: | |
38 | ||
39 | /* Disables missing prototype warnings */ | |
40 | __diag_push(); | |
41 | __diag_ignore_all("-Wmissing-prototypes", | |
42 | "Global kfuncs as their definitions will be in BTF"); | |
43 | ||
44 | struct task_struct *bpf_find_get_task_by_vpid(pid_t nr) | |
45 | { | |
46 | return find_get_task_by_vpid(nr); | |
47 | } | |
48 | ||
49 | __diag_pop(); | |
50 | ||
51 | A wrapper kfunc is often needed when we need to annotate parameters of the | |
52 | kfunc. Otherwise one may directly make the kfunc visible to the BPF program by | |
53 | registering it with the BPF subsystem. See :ref:`BPF_kfunc_nodef`. | |
54 | ||
55 | 2.2 Annotating kfunc parameters | |
56 | ------------------------------- | |
57 | ||
58 | Similar to BPF helpers, there is sometime need for additional context required | |
59 | by the verifier to make the usage of kernel functions safer and more useful. | |
60 | Hence, we can annotate a parameter by suffixing the name of the argument of the | |
61 | kfunc with a __tag, where tag may be one of the supported annotations. | |
62 | ||
63 | 2.2.1 __sz Annotation | |
64 | --------------------- | |
65 | ||
66 | This annotation is used to indicate a memory and size pair in the argument list. | |
67 | An example is given below:: | |
68 | ||
69 | void bpf_memzero(void *mem, int mem__sz) | |
70 | { | |
71 | ... | |
72 | } | |
73 | ||
74 | Here, the verifier will treat first argument as a PTR_TO_MEM, and second | |
75 | argument as its size. By default, without __sz annotation, the size of the type | |
76 | of the pointer is used. Without __sz annotation, a kfunc cannot accept a void | |
77 | pointer. | |
78 | ||
a50388db KKD |
79 | 2.2.2 __k Annotation |
80 | -------------------- | |
81 | ||
82 | This annotation is only understood for scalar arguments, where it indicates that | |
83 | the verifier must check the scalar argument to be a known constant, which does | |
84 | not indicate a size parameter, and the value of the constant is relevant to the | |
85 | safety of the program. | |
86 | ||
87 | An example is given below:: | |
88 | ||
89 | void *bpf_obj_new(u32 local_type_id__k, ...) | |
90 | { | |
91 | ... | |
92 | } | |
93 | ||
94 | Here, bpf_obj_new uses local_type_id argument to find out the size of that type | |
95 | ID in program's BTF and return a sized pointer to it. Each type ID will have a | |
96 | distinct size, hence it is crucial to treat each such call as distinct when | |
97 | values don't match during verifier state pruning checks. | |
98 | ||
99 | Hence, whenever a constant scalar argument is accepted by a kfunc which is not a | |
100 | size parameter, and the value of the constant matters for program safety, __k | |
101 | suffix should be used. | |
102 | ||
63e564eb KKD |
103 | .. _BPF_kfunc_nodef: |
104 | ||
105 | 2.3 Using an existing kernel function | |
106 | ------------------------------------- | |
107 | ||
108 | When an existing function in the kernel is fit for consumption by BPF programs, | |
109 | it can be directly registered with the BPF subsystem. However, care must still | |
110 | be taken to review the context in which it will be invoked by the BPF program | |
111 | and whether it is safe to do so. | |
112 | ||
113 | 2.4 Annotating kfuncs | |
114 | --------------------- | |
115 | ||
116 | In addition to kfuncs' arguments, verifier may need more information about the | |
117 | type of kfunc(s) being registered with the BPF subsystem. To do so, we define | |
118 | flags on a set of kfuncs as follows:: | |
119 | ||
120 | BTF_SET8_START(bpf_task_set) | |
121 | BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL) | |
122 | BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE) | |
123 | BTF_SET8_END(bpf_task_set) | |
124 | ||
125 | This set encodes the BTF ID of each kfunc listed above, and encodes the flags | |
126 | along with it. Ofcourse, it is also allowed to specify no flags. | |
127 | ||
128 | 2.4.1 KF_ACQUIRE flag | |
129 | --------------------- | |
130 | ||
131 | The KF_ACQUIRE flag is used to indicate that the kfunc returns a pointer to a | |
132 | refcounted object. The verifier will then ensure that the pointer to the object | |
133 | is eventually released using a release kfunc, or transferred to a map using a | |
134 | referenced kptr (by invoking bpf_kptr_xchg). If not, the verifier fails the | |
135 | loading of the BPF program until no lingering references remain in all possible | |
136 | explored states of the program. | |
137 | ||
138 | 2.4.2 KF_RET_NULL flag | |
139 | ---------------------- | |
140 | ||
141 | The KF_RET_NULL flag is used to indicate that the pointer returned by the kfunc | |
142 | may be NULL. Hence, it forces the user to do a NULL check on the pointer | |
143 | returned from the kfunc before making use of it (dereferencing or passing to | |
144 | another helper). This flag is often used in pairing with KF_ACQUIRE flag, but | |
145 | both are orthogonal to each other. | |
146 | ||
147 | 2.4.3 KF_RELEASE flag | |
148 | --------------------- | |
149 | ||
150 | The KF_RELEASE flag is used to indicate that the kfunc releases the pointer | |
151 | passed in to it. There can be only one referenced pointer that can be passed in. | |
152 | All copies of the pointer being released are invalidated as a result of invoking | |
153 | kfunc with this flag. | |
154 | ||
155 | 2.4.4 KF_KPTR_GET flag | |
156 | ---------------------- | |
157 | ||
158 | The KF_KPTR_GET flag is used to indicate that the kfunc takes the first argument | |
159 | as a pointer to kptr, safely increments the refcount of the object it points to, | |
160 | and returns a reference to the user. The rest of the arguments may be normal | |
161 | arguments of a kfunc. The KF_KPTR_GET flag should be used in conjunction with | |
162 | KF_ACQUIRE and KF_RET_NULL flags. | |
163 | ||
164 | 2.4.5 KF_TRUSTED_ARGS flag | |
165 | -------------------------- | |
166 | ||
167 | The KF_TRUSTED_ARGS flag is used for kfuncs taking pointer arguments. It | |
3f00c523 DV |
168 | indicates that the all pointer arguments are valid, and that all pointers to |
169 | BTF objects have been passed in their unmodified form (that is, at a zero | |
d94cbde2 DV |
170 | offset, and without having been obtained from walking another pointer, with one |
171 | exception described below). | |
3f00c523 DV |
172 | |
173 | There are two types of pointers to kernel objects which are considered "valid": | |
174 | ||
175 | 1. Pointers which are passed as tracepoint or struct_ops callback arguments. | |
176 | 2. Pointers which were returned from a KF_ACQUIRE or KF_KPTR_GET kfunc. | |
177 | ||
178 | Pointers to non-BTF objects (e.g. scalar pointers) may also be passed to | |
179 | KF_TRUSTED_ARGS kfuncs, and may have a non-zero offset. | |
180 | ||
181 | The definition of "valid" pointers is subject to change at any time, and has | |
182 | absolutely no ABI stability guarantees. | |
63e564eb | 183 | |
d94cbde2 DV |
184 | As mentioned above, a nested pointer obtained from walking a trusted pointer is |
185 | no longer trusted, with one exception. If a struct type has a field that is | |
186 | guaranteed to be valid as long as its parent pointer is trusted, the | |
187 | ``BTF_TYPE_SAFE_NESTED`` macro can be used to express that to the verifier as | |
188 | follows: | |
189 | ||
190 | .. code-block:: c | |
191 | ||
192 | BTF_TYPE_SAFE_NESTED(struct task_struct) { | |
193 | const cpumask_t *cpus_ptr; | |
194 | }; | |
195 | ||
196 | In other words, you must: | |
197 | ||
198 | 1. Wrap the trusted pointer type in the ``BTF_TYPE_SAFE_NESTED`` macro. | |
199 | ||
200 | 2. Specify the type and name of the trusted nested field. This field must match | |
201 | the field in the original type definition exactly. | |
202 | ||
fa96b242 BT |
203 | 2.4.6 KF_SLEEPABLE flag |
204 | ----------------------- | |
205 | ||
206 | The KF_SLEEPABLE flag is used for kfuncs that may sleep. Such kfuncs can only | |
207 | be called by sleepable BPF programs (BPF_F_SLEEPABLE). | |
208 | ||
4dd48c6f AS |
209 | 2.4.7 KF_DESTRUCTIVE flag |
210 | -------------------------- | |
211 | ||
212 | The KF_DESTRUCTIVE flag is used to indicate functions calling which is | |
213 | destructive to the system. For example such a call can result in system | |
214 | rebooting or panicking. Due to this additional restrictions apply to these | |
215 | calls. At the moment they only require CAP_SYS_BOOT capability, but more can be | |
216 | added later. | |
217 | ||
f5362564 YS |
218 | 2.4.8 KF_RCU flag |
219 | ----------------- | |
220 | ||
221 | The KF_RCU flag is used for kfuncs which have a rcu ptr as its argument. | |
222 | When used together with KF_ACQUIRE, it indicates the kfunc should have a | |
223 | single argument which must be a trusted argument or a MEM_RCU pointer. | |
224 | The argument may have reference count of 0 and the kfunc must take this | |
225 | into consideration. | |
226 | ||
63e564eb KKD |
227 | 2.5 Registering the kfuncs |
228 | -------------------------- | |
229 | ||
230 | Once the kfunc is prepared for use, the final step to making it visible is | |
231 | registering it with the BPF subsystem. Registration is done per BPF program | |
232 | type. An example is shown below:: | |
233 | ||
234 | BTF_SET8_START(bpf_task_set) | |
235 | BTF_ID_FLAGS(func, bpf_get_task_pid, KF_ACQUIRE | KF_RET_NULL) | |
236 | BTF_ID_FLAGS(func, bpf_put_pid, KF_RELEASE) | |
237 | BTF_SET8_END(bpf_task_set) | |
238 | ||
239 | static const struct btf_kfunc_id_set bpf_task_kfunc_set = { | |
240 | .owner = THIS_MODULE, | |
241 | .set = &bpf_task_set, | |
242 | }; | |
243 | ||
244 | static int init_subsystem(void) | |
245 | { | |
246 | return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_task_kfunc_set); | |
247 | } | |
248 | late_initcall(init_subsystem); | |
25c5e92d | 249 | |
027bdec8 DV |
250 | 2.6 Specifying no-cast aliases with ___init |
251 | -------------------------------------------- | |
252 | ||
253 | The verifier will always enforce that the BTF type of a pointer passed to a | |
254 | kfunc by a BPF program, matches the type of pointer specified in the kfunc | |
255 | definition. The verifier, does, however, allow types that are equivalent | |
256 | according to the C standard to be passed to the same kfunc arg, even if their | |
257 | BTF_IDs differ. | |
258 | ||
259 | For example, for the following type definition: | |
260 | ||
261 | .. code-block:: c | |
262 | ||
263 | struct bpf_cpumask { | |
264 | cpumask_t cpumask; | |
265 | refcount_t usage; | |
266 | }; | |
267 | ||
268 | The verifier would allow a ``struct bpf_cpumask *`` to be passed to a kfunc | |
269 | taking a ``cpumask_t *`` (which is a typedef of ``struct cpumask *``). For | |
270 | instance, both ``struct cpumask *`` and ``struct bpf_cpmuask *`` can be passed | |
271 | to bpf_cpumask_test_cpu(). | |
272 | ||
273 | In some cases, this type-aliasing behavior is not desired. ``struct | |
274 | nf_conn___init`` is one such example: | |
275 | ||
276 | .. code-block:: c | |
277 | ||
278 | struct nf_conn___init { | |
279 | struct nf_conn ct; | |
280 | }; | |
281 | ||
282 | The C standard would consider these types to be equivalent, but it would not | |
283 | always be safe to pass either type to a trusted kfunc. ``struct | |
284 | nf_conn___init`` represents an allocated ``struct nf_conn`` object that has | |
285 | *not yet been initialized*, so it would therefore be unsafe to pass a ``struct | |
286 | nf_conn___init *`` to a kfunc that's expecting a fully initialized ``struct | |
287 | nf_conn *`` (e.g. ``bpf_ct_change_timeout()``). | |
288 | ||
289 | In order to accommodate such requirements, the verifier will enforce strict | |
290 | PTR_TO_BTF_ID type matching if two types have the exact same name, with one | |
291 | being suffixed with ``___init``. | |
292 | ||
25c5e92d DV |
293 | 3. Core kfuncs |
294 | ============== | |
295 | ||
296 | The BPF subsystem provides a number of "core" kfuncs that are potentially | |
297 | applicable to a wide variety of different possible use cases and programs. | |
298 | Those kfuncs are documented here. | |
299 | ||
300 | 3.1 struct task_struct * kfuncs | |
301 | ------------------------------- | |
302 | ||
303 | There are a number of kfuncs that allow ``struct task_struct *`` objects to be | |
304 | used as kptrs: | |
305 | ||
306 | .. kernel-doc:: kernel/bpf/helpers.c | |
307 | :identifiers: bpf_task_acquire bpf_task_release | |
308 | ||
309 | These kfuncs are useful when you want to acquire or release a reference to a | |
310 | ``struct task_struct *`` that was passed as e.g. a tracepoint arg, or a | |
311 | struct_ops callback arg. For example: | |
312 | ||
313 | .. code-block:: c | |
314 | ||
315 | /** | |
316 | * A trivial example tracepoint program that shows how to | |
317 | * acquire and release a struct task_struct * pointer. | |
318 | */ | |
319 | SEC("tp_btf/task_newtask") | |
320 | int BPF_PROG(task_acquire_release_example, struct task_struct *task, u64 clone_flags) | |
321 | { | |
322 | struct task_struct *acquired; | |
323 | ||
324 | acquired = bpf_task_acquire(task); | |
325 | ||
326 | /* | |
327 | * In a typical program you'd do something like store | |
328 | * the task in a map, and the map will automatically | |
329 | * release it later. Here, we release it manually. | |
330 | */ | |
331 | bpf_task_release(acquired); | |
332 | return 0; | |
333 | } | |
334 | ||
335 | ---- | |
336 | ||
337 | A BPF program can also look up a task from a pid. This can be useful if the | |
338 | caller doesn't have a trusted pointer to a ``struct task_struct *`` object that | |
339 | it can acquire a reference on with bpf_task_acquire(). | |
340 | ||
341 | .. kernel-doc:: kernel/bpf/helpers.c | |
342 | :identifiers: bpf_task_from_pid | |
343 | ||
344 | Here is an example of it being used: | |
345 | ||
346 | .. code-block:: c | |
347 | ||
348 | SEC("tp_btf/task_newtask") | |
349 | int BPF_PROG(task_get_pid_example, struct task_struct *task, u64 clone_flags) | |
350 | { | |
351 | struct task_struct *lookup; | |
352 | ||
353 | lookup = bpf_task_from_pid(task->pid); | |
354 | if (!lookup) | |
355 | /* A task should always be found, as %task is a tracepoint arg. */ | |
356 | return -ENOENT; | |
357 | ||
358 | if (lookup->pid != task->pid) { | |
359 | /* bpf_task_from_pid() looks up the task via its | |
360 | * globally-unique pid from the init_pid_ns. Thus, | |
361 | * the pid of the lookup task should always be the | |
362 | * same as the input task. | |
363 | */ | |
364 | bpf_task_release(lookup); | |
365 | return -EINVAL; | |
366 | } | |
367 | ||
368 | /* bpf_task_from_pid() returns an acquired reference, | |
369 | * so it must be dropped before returning from the | |
370 | * tracepoint handler. | |
371 | */ | |
372 | bpf_task_release(lookup); | |
373 | return 0; | |
374 | } | |
36aa10ff DV |
375 | |
376 | 3.2 struct cgroup * kfuncs | |
377 | -------------------------- | |
378 | ||
379 | ``struct cgroup *`` objects also have acquire and release functions: | |
380 | ||
381 | .. kernel-doc:: kernel/bpf/helpers.c | |
382 | :identifiers: bpf_cgroup_acquire bpf_cgroup_release | |
383 | ||
384 | These kfuncs are used in exactly the same manner as bpf_task_acquire() and | |
385 | bpf_task_release() respectively, so we won't provide examples for them. | |
386 | ||
387 | ---- | |
388 | ||
389 | You may also acquire a reference to a ``struct cgroup`` kptr that's already | |
390 | stored in a map using bpf_cgroup_kptr_get(): | |
391 | ||
392 | .. kernel-doc:: kernel/bpf/helpers.c | |
393 | :identifiers: bpf_cgroup_kptr_get | |
394 | ||
395 | Here's an example of how it can be used: | |
396 | ||
397 | .. code-block:: c | |
398 | ||
399 | /* struct containing the struct task_struct kptr which is actually stored in the map. */ | |
400 | struct __cgroups_kfunc_map_value { | |
401 | struct cgroup __kptr_ref * cgroup; | |
402 | }; | |
403 | ||
404 | /* The map containing struct __cgroups_kfunc_map_value entries. */ | |
405 | struct { | |
406 | __uint(type, BPF_MAP_TYPE_HASH); | |
407 | __type(key, int); | |
408 | __type(value, struct __cgroups_kfunc_map_value); | |
409 | __uint(max_entries, 1); | |
410 | } __cgroups_kfunc_map SEC(".maps"); | |
411 | ||
412 | /* ... */ | |
413 | ||
414 | /** | |
415 | * A simple example tracepoint program showing how a | |
416 | * struct cgroup kptr that is stored in a map can | |
417 | * be acquired using the bpf_cgroup_kptr_get() kfunc. | |
418 | */ | |
419 | SEC("tp_btf/cgroup_mkdir") | |
420 | int BPF_PROG(cgroup_kptr_get_example, struct cgroup *cgrp, const char *path) | |
421 | { | |
422 | struct cgroup *kptr; | |
423 | struct __cgroups_kfunc_map_value *v; | |
424 | s32 id = cgrp->self.id; | |
425 | ||
426 | /* Assume a cgroup kptr was previously stored in the map. */ | |
427 | v = bpf_map_lookup_elem(&__cgroups_kfunc_map, &id); | |
428 | if (!v) | |
429 | return -ENOENT; | |
430 | ||
431 | /* Acquire a reference to the cgroup kptr that's already stored in the map. */ | |
432 | kptr = bpf_cgroup_kptr_get(&v->cgroup); | |
433 | if (!kptr) | |
434 | /* If no cgroup was present in the map, it's because | |
435 | * we're racing with another CPU that removed it with | |
436 | * bpf_kptr_xchg() between the bpf_map_lookup_elem() | |
437 | * above, and our call to bpf_cgroup_kptr_get(). | |
438 | * bpf_cgroup_kptr_get() internally safely handles this | |
439 | * race, and will return NULL if the task is no longer | |
440 | * present in the map by the time we invoke the kfunc. | |
441 | */ | |
442 | return -EBUSY; | |
443 | ||
444 | /* Free the reference we just took above. Note that the | |
445 | * original struct cgroup kptr is still in the map. It will | |
446 | * be freed either at a later time if another context deletes | |
447 | * it from the map, or automatically by the BPF subsystem if | |
448 | * it's still present when the map is destroyed. | |
449 | */ | |
450 | bpf_cgroup_release(kptr); | |
451 | ||
452 | return 0; | |
453 | } | |
454 | ||
455 | ---- | |
456 | ||
457 | Another kfunc available for interacting with ``struct cgroup *`` objects is | |
458 | bpf_cgroup_ancestor(). This allows callers to access the ancestor of a cgroup, | |
459 | and return it as a cgroup kptr. | |
460 | ||
461 | .. kernel-doc:: kernel/bpf/helpers.c | |
462 | :identifiers: bpf_cgroup_ancestor | |
463 | ||
464 | Eventually, BPF should be updated to allow this to happen with a normal memory | |
465 | load in the program itself. This is currently not possible without more work in | |
466 | the verifier. bpf_cgroup_ancestor() can be used as follows: | |
467 | ||
468 | .. code-block:: c | |
469 | ||
470 | /** | |
471 | * Simple tracepoint example that illustrates how a cgroup's | |
472 | * ancestor can be accessed using bpf_cgroup_ancestor(). | |
473 | */ | |
474 | SEC("tp_btf/cgroup_mkdir") | |
475 | int BPF_PROG(cgrp_ancestor_example, struct cgroup *cgrp, const char *path) | |
476 | { | |
477 | struct cgroup *parent; | |
478 | ||
479 | /* The parent cgroup resides at the level before the current cgroup's level. */ | |
480 | parent = bpf_cgroup_ancestor(cgrp, cgrp->level - 1); | |
481 | if (!parent) | |
482 | return -ENOENT; | |
483 | ||
484 | bpf_printk("Parent id is %d", parent->self.id); | |
485 | ||
486 | /* Return the parent cgroup that was acquired above. */ | |
487 | bpf_cgroup_release(parent); | |
488 | return 0; | |
489 | } | |
bdbda395 DV |
490 | |
491 | 3.3 struct cpumask * kfuncs | |
492 | --------------------------- | |
493 | ||
494 | BPF provides a set of kfuncs that can be used to query, allocate, mutate, and | |
495 | destroy struct cpumask * objects. Please refer to :ref:`cpumasks-header-label` | |
496 | for more details. |