Commit | Line | Data |
---|---|---|
8972e18a SS |
1 | ============= |
2 | BPF Iterators | |
3 | ============= | |
4 | ||
5 | ||
6 | ---------- | |
7 | Motivation | |
8 | ---------- | |
9 | ||
10 | There are a few existing ways to dump kernel data into user space. The most | |
11 | popular one is the ``/proc`` system. For example, ``cat /proc/net/tcp6`` dumps | |
12 | all tcp6 sockets in the system, and ``cat /proc/net/netlink`` dumps all netlink | |
13 | sockets in the system. However, their output format tends to be fixed, and if | |
14 | users want more information about these sockets, they have to patch the kernel, | |
15 | which often takes time to publish upstream and release. The same is true for popular | |
16 | tools like `ss <https://man7.org/linux/man-pages/man8/ss.8.html>`_ where any | |
17 | additional information needs a kernel patch. | |
18 | ||
19 | To solve this problem, the `drgn | |
20 | <https://www.kernel.org/doc/html/latest/bpf/drgn.html>`_ tool is often used to | |
21 | dig out the kernel data with no kernel change. However, the main drawback for | |
22 | drgn is performance, as it cannot do pointer tracing inside the kernel. In | |
23 | addition, drgn cannot validate a pointer value and may read invalid data if the | |
24 | pointer becomes invalid inside the kernel. | |
25 | ||
26 | The BPF iterator solves the above problem by providing flexibility on what data | |
27 | (e.g., tasks, bpf_maps, etc.) to collect by calling BPF programs for each kernel | |
28 | data object. | |
29 | ||
30 | ---------------------- | |
31 | How BPF Iterators Work | |
32 | ---------------------- | |
33 | ||
34 | A BPF iterator is a type of BPF program that allows users to iterate over | |
35 | specific types of kernel objects. Unlike traditional BPF tracing programs that | |
36 | allow users to define callbacks that are invoked at particular points of | |
37 | execution in the kernel, BPF iterators allow users to define callbacks that | |
38 | should be executed for every entry in a variety of kernel data structures. | |
39 | ||
40 | For example, users can define a BPF iterator that iterates over every task on | |
41 | the system and dumps the total amount of CPU runtime currently used by each of | |
42 | them. Another BPF task iterator may instead dump the cgroup information for each | |
43 | task. Such flexibility is the core value of BPF iterators. | |
44 | ||
45 | A BPF program is always loaded into the kernel at the behest of a user space | |
46 | process. A user space process loads a BPF program by opening and initializing | |
47 | the program skeleton as required and then invoking a syscall to have the BPF | |
48 | program verified and loaded by the kernel. | |
49 | ||
50 | In traditional tracing programs, a program is activated by having user space | |
51 | obtain a ``bpf_link`` to the program with ``bpf_program__attach()``. Once | |
52 | activated, the program callback will be invoked whenever the tracepoint is | |
53 | triggered in the main kernel. For BPF iterator programs, a ``bpf_link`` to the | |
54 | program is obtained using ``bpf_link_create()``, and the program callback is | |
55 | invoked by issuing system calls from user space. | |
56 | ||
57 | Next, let us see how you can use the iterators to iterate on kernel objects and | |
58 | read data. | |
59 | ||
60 | ------------------------ | |
61 | How to Use BPF iterators | |
62 | ------------------------ | |
63 | ||
64 | BPF selftests are a great resource to illustrate how to use the iterators. In | |
65 | this section, we’ll walk through a BPF selftest which shows how to load and use | |
66 | a BPF iterator program. To begin, we’ll look at `bpf_iter.c | |
67 | <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/prog_tests/bpf_iter.c>`_, | |
68 | which illustrates how to load and trigger BPF iterators on the user space side. | |
69 | Later, we’ll look at a BPF program that runs in kernel space. | |
70 | ||
71 | Loading a BPF iterator in the kernel from user space typically involves the | |
72 | following steps: | |
73 | ||
74 | * The BPF program is loaded into the kernel through ``libbpf``. Once the kernel | |
75 | has verified and loaded the program, it returns a file descriptor (fd) to user | |
76 | space. | |
77 | * Obtain a ``link_fd`` to the BPF program by calling the ``bpf_link_create()`` | |
78 | specified with the BPF program file descriptor received from the kernel. | |
79 | * Next, obtain a BPF iterator file descriptor (``bpf_iter_fd``) by calling the | |
80 | ``bpf_iter_create()`` specified with the ``bpf_link`` received from Step 2. | |
81 | * Trigger the iteration by calling ``read(bpf_iter_fd)`` until no data is | |
82 | available. | |
83 | * Close the iterator fd using ``close(bpf_iter_fd)``. | |
84 | * If needed to reread the data, get a new ``bpf_iter_fd`` and do the read again. | |
85 | ||
86 | The following are a few examples of selftest BPF iterator programs: | |
87 | ||
88 | * `bpf_iter_tcp4.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_tcp4.c>`_ | |
89 | * `bpf_iter_task_vma.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c>`_ | |
90 | * `bpf_iter_task_file.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c>`_ | |
91 | ||
92 | Let us look at ``bpf_iter_task_file.c``, which runs in kernel space: | |
93 | ||
94 | Here is the definition of ``bpf_iter__task_file`` in `vmlinux.h | |
95 | <https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html#btf>`_. | |
96 | Any struct name in ``vmlinux.h`` in the format ``bpf_iter__<iter_name>`` | |
97 | represents a BPF iterator. The suffix ``<iter_name>`` represents the type of | |
98 | iterator. | |
99 | ||
100 | :: | |
101 | ||
102 | struct bpf_iter__task_file { | |
103 | union { | |
104 | struct bpf_iter_meta *meta; | |
105 | }; | |
106 | union { | |
107 | struct task_struct *task; | |
108 | }; | |
109 | u32 fd; | |
110 | union { | |
111 | struct file *file; | |
112 | }; | |
113 | }; | |
114 | ||
115 | In the above code, the field 'meta' contains the metadata, which is the same for | |
116 | all BPF iterator programs. The rest of the fields are specific to different | |
117 | iterators. For example, for task_file iterators, the kernel layer provides the | |
118 | 'task', 'fd' and 'file' field values. The 'task' and 'file' are `reference | |
119 | counted | |
120 | <https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html#file-descriptors-and-reference-counters>`_, | |
121 | so they won't go away when the BPF program runs. | |
122 | ||
123 | Here is a snippet from the ``bpf_iter_task_file.c`` file: | |
124 | ||
125 | :: | |
126 | ||
127 | SEC("iter/task_file") | |
128 | int dump_task_file(struct bpf_iter__task_file *ctx) | |
129 | { | |
130 | struct seq_file *seq = ctx->meta->seq; | |
131 | struct task_struct *task = ctx->task; | |
132 | struct file *file = ctx->file; | |
133 | __u32 fd = ctx->fd; | |
134 | ||
135 | if (task == NULL || file == NULL) | |
136 | return 0; | |
137 | ||
138 | if (ctx->meta->seq_num == 0) { | |
139 | count = 0; | |
140 | BPF_SEQ_PRINTF(seq, " tgid gid fd file\n"); | |
141 | } | |
142 | ||
143 | if (tgid == task->tgid && task->tgid != task->pid) | |
144 | count++; | |
145 | ||
146 | if (last_tgid != task->tgid) { | |
147 | last_tgid = task->tgid; | |
148 | unique_tgid_count++; | |
149 | } | |
150 | ||
151 | BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd, | |
152 | (long)file->f_op); | |
153 | return 0; | |
154 | } | |
155 | ||
156 | In the above example, the section name ``SEC(iter/task_file)``, indicates that | |
157 | the program is a BPF iterator program to iterate all files from all tasks. The | |
158 | context of the program is ``bpf_iter__task_file`` struct. | |
159 | ||
160 | The user space program invokes the BPF iterator program running in the kernel | |
161 | by issuing a ``read()`` syscall. Once invoked, the BPF | |
162 | program can export data to user space using a variety of BPF helper functions. | |
163 | You can use either ``bpf_seq_printf()`` (and BPF_SEQ_PRINTF helper macro) or | |
164 | ``bpf_seq_write()`` function based on whether you need formatted output or just | |
165 | binary data, respectively. For binary-encoded data, the user space applications | |
166 | can process the data from ``bpf_seq_write()`` as needed. For the formatted data, | |
167 | you can use ``cat <path>`` to print the results similar to ``cat | |
168 | /proc/net/netlink`` after pinning the BPF iterator to the bpffs mount. Later, | |
169 | use ``rm -f <path>`` to remove the pinned iterator. | |
170 | ||
171 | For example, you can use the following command to create a BPF iterator from the | |
172 | ``bpf_iter_ipv6_route.o`` object file and pin it to the ``/sys/fs/bpf/my_route`` | |
173 | path: | |
174 | ||
175 | :: | |
176 | ||
177 | $ bpftool iter pin ./bpf_iter_ipv6_route.o /sys/fs/bpf/my_route | |
178 | ||
179 | And then print out the results using the following command: | |
180 | ||
181 | :: | |
182 | ||
183 | $ cat /sys/fs/bpf/my_route | |
184 | ||
185 | ||
186 | ------------------------------------------------------- | |
187 | Implement Kernel Support for BPF Iterator Program Types | |
188 | ------------------------------------------------------- | |
189 | ||
190 | To implement a BPF iterator in the kernel, the developer must make a one-time | |
191 | change to the following key data structure defined in the `bpf.h | |
192 | <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/include/linux/bpf.h>`_ | |
193 | file. | |
194 | ||
195 | :: | |
196 | ||
197 | struct bpf_iter_reg { | |
198 | const char *target; | |
199 | bpf_iter_attach_target_t attach_target; | |
200 | bpf_iter_detach_target_t detach_target; | |
201 | bpf_iter_show_fdinfo_t show_fdinfo; | |
202 | bpf_iter_fill_link_info_t fill_link_info; | |
203 | bpf_iter_get_func_proto_t get_func_proto; | |
204 | u32 ctx_arg_info_size; | |
205 | u32 feature; | |
206 | struct bpf_ctx_arg_aux ctx_arg_info[BPF_ITER_CTX_ARG_MAX]; | |
207 | const struct bpf_iter_seq_info *seq_info; | |
208 | }; | |
209 | ||
210 | After filling the data structure fields, call ``bpf_iter_reg_target()`` to | |
211 | register the iterator to the main BPF iterator subsystem. | |
212 | ||
213 | The following is the breakdown for each field in struct ``bpf_iter_reg``. | |
214 | ||
215 | .. list-table:: | |
216 | :widths: 25 50 | |
217 | :header-rows: 1 | |
218 | ||
219 | * - Fields | |
220 | - Description | |
221 | * - target | |
222 | - Specifies the name of the BPF iterator. For example: ``bpf_map``, | |
223 | ``bpf_map_elem``. The name should be different from other ``bpf_iter`` target names in the kernel. | |
224 | * - attach_target and detach_target | |
225 | - Allows for target specific ``link_create`` action since some targets | |
226 | may need special processing. Called during the user space link_create stage. | |
227 | * - show_fdinfo and fill_link_info | |
228 | - Called to fill target specific information when user tries to get link | |
229 | info associated with the iterator. | |
230 | * - get_func_proto | |
231 | - Permits a BPF iterator to access BPF helpers specific to the iterator. | |
232 | * - ctx_arg_info_size and ctx_arg_info | |
233 | - Specifies the verifier states for BPF program arguments associated with | |
234 | the bpf iterator. | |
235 | * - feature | |
236 | - Specifies certain action requests in the kernel BPF iterator | |
237 | infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means | |
238 | that the kernel function cond_resched() is called to avoid other kernel | |
239 | subsystem (e.g., rcu) misbehaving. | |
240 | * - seq_info | |
2404dd01 AP |
241 | - Specifies the set of seq operations for the BPF iterator and helpers to |
242 | initialize/free the private data for the corresponding ``seq_file``. | |
8972e18a SS |
243 | |
244 | `Click here | |
245 | <https://lore.kernel.org/bpf/20210212183107.50963-2-songliubraving@fb.com/>`_ | |
246 | to see an implementation of the ``task_vma`` BPF iterator in the kernel. | |
247 | ||
248 | --------------------------------- | |
249 | Parameterizing BPF Task Iterators | |
250 | --------------------------------- | |
251 | ||
252 | By default, BPF iterators walk through all the objects of the specified types | |
253 | (processes, cgroups, maps, etc.) across the entire system to read relevant | |
254 | kernel data. But often, there are cases where we only care about a much smaller | |
255 | subset of iterable kernel objects, such as only iterating tasks within a | |
256 | specific process. Therefore, BPF iterator programs support filtering out objects | |
257 | from iteration by allowing user space to configure the iterator program when it | |
258 | is attached. | |
259 | ||
260 | -------------------------- | |
261 | BPF Task Iterator Program | |
262 | -------------------------- | |
263 | ||
264 | The following code is a BPF iterator program to print files and task information | |
265 | through the ``seq_file`` of the iterator. It is a standard BPF iterator program | |
266 | that visits every file of an iterator. We will use this BPF program in our | |
267 | example later. | |
268 | ||
269 | :: | |
270 | ||
271 | #include <vmlinux.h> | |
272 | #include <bpf/bpf_helpers.h> | |
273 | ||
274 | char _license[] SEC("license") = "GPL"; | |
275 | ||
276 | SEC("iter/task_file") | |
277 | int dump_task_file(struct bpf_iter__task_file *ctx) | |
278 | { | |
279 | struct seq_file *seq = ctx->meta->seq; | |
280 | struct task_struct *task = ctx->task; | |
281 | struct file *file = ctx->file; | |
282 | __u32 fd = ctx->fd; | |
283 | if (task == NULL || file == NULL) | |
284 | return 0; | |
285 | if (ctx->meta->seq_num == 0) { | |
286 | BPF_SEQ_PRINTF(seq, " tgid pid fd file\n"); | |
287 | } | |
288 | BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd, | |
289 | (long)file->f_op); | |
290 | return 0; | |
291 | } | |
292 | ||
293 | ---------------------------------------- | |
294 | Creating a File Iterator with Parameters | |
295 | ---------------------------------------- | |
296 | ||
297 | Now, let us look at how to create an iterator that includes only files of a | |
298 | process. | |
299 | ||
300 | First, fill the ``bpf_iter_attach_opts`` struct as shown below: | |
301 | ||
302 | :: | |
303 | ||
304 | LIBBPF_OPTS(bpf_iter_attach_opts, opts); | |
305 | union bpf_iter_link_info linfo; | |
306 | memset(&linfo, 0, sizeof(linfo)); | |
307 | linfo.task.pid = getpid(); | |
308 | opts.link_info = &linfo; | |
309 | opts.link_info_len = sizeof(linfo); | |
310 | ||
311 | ``linfo.task.pid``, if it is non-zero, directs the kernel to create an iterator | |
312 | that only includes opened files for the process with the specified ``pid``. In | |
313 | this example, we will only be iterating files for our process. If | |
314 | ``linfo.task.pid`` is zero, the iterator will visit every opened file of every | |
315 | process. Similarly, ``linfo.task.tid`` directs the kernel to create an iterator | |
316 | that visits opened files of a specific thread, not a process. In this example, | |
317 | ``linfo.task.tid`` is different from ``linfo.task.pid`` only if the thread has a | |
318 | separate file descriptor table. In most circumstances, all process threads share | |
319 | a single file descriptor table. | |
320 | ||
321 | Now, in the userspace program, pass the pointer of struct to the | |
322 | ``bpf_program__attach_iter()``. | |
323 | ||
324 | :: | |
325 | ||
326 | link = bpf_program__attach_iter(prog, &opts); iter_fd = | |
327 | bpf_iter_create(bpf_link__fd(link)); | |
328 | ||
329 | If both *tid* and *pid* are zero, an iterator created from this struct | |
330 | ``bpf_iter_attach_opts`` will include every opened file of every task in the | |
331 | system (in the namespace, actually.) It is the same as passing a NULL as the | |
332 | second argument to ``bpf_program__attach_iter()``. | |
333 | ||
334 | The whole program looks like the following code: | |
335 | ||
336 | :: | |
337 | ||
338 | #include <stdio.h> | |
339 | #include <unistd.h> | |
340 | #include <bpf/bpf.h> | |
341 | #include <bpf/libbpf.h> | |
342 | #include "bpf_iter_task_ex.skel.h" | |
343 | ||
344 | static int do_read_opts(struct bpf_program *prog, struct bpf_iter_attach_opts *opts) | |
345 | { | |
346 | struct bpf_link *link; | |
347 | char buf[16] = {}; | |
348 | int iter_fd = -1, len; | |
349 | int ret = 0; | |
350 | ||
351 | link = bpf_program__attach_iter(prog, opts); | |
352 | if (!link) { | |
353 | fprintf(stderr, "bpf_program__attach_iter() fails\n"); | |
354 | return -1; | |
355 | } | |
356 | iter_fd = bpf_iter_create(bpf_link__fd(link)); | |
357 | if (iter_fd < 0) { | |
358 | fprintf(stderr, "bpf_iter_create() fails\n"); | |
359 | ret = -1; | |
360 | goto free_link; | |
361 | } | |
362 | /* not check contents, but ensure read() ends without error */ | |
363 | while ((len = read(iter_fd, buf, sizeof(buf) - 1)) > 0) { | |
364 | buf[len] = 0; | |
365 | printf("%s", buf); | |
366 | } | |
367 | printf("\n"); | |
368 | free_link: | |
369 | if (iter_fd >= 0) | |
370 | close(iter_fd); | |
371 | bpf_link__destroy(link); | |
372 | return 0; | |
373 | } | |
374 | ||
375 | static void test_task_file(void) | |
376 | { | |
377 | LIBBPF_OPTS(bpf_iter_attach_opts, opts); | |
378 | struct bpf_iter_task_ex *skel; | |
379 | union bpf_iter_link_info linfo; | |
380 | skel = bpf_iter_task_ex__open_and_load(); | |
381 | if (skel == NULL) | |
382 | return; | |
383 | memset(&linfo, 0, sizeof(linfo)); | |
384 | linfo.task.pid = getpid(); | |
385 | opts.link_info = &linfo; | |
386 | opts.link_info_len = sizeof(linfo); | |
387 | printf("PID %d\n", getpid()); | |
388 | do_read_opts(skel->progs.dump_task_file, &opts); | |
389 | bpf_iter_task_ex__destroy(skel); | |
390 | } | |
391 | ||
392 | int main(int argc, const char * const * argv) | |
393 | { | |
394 | test_task_file(); | |
395 | return 0; | |
396 | } | |
397 | ||
398 | The following lines are the output of the program. | |
399 | :: | |
400 | ||
401 | PID 1859 | |
402 | ||
403 | tgid pid fd file | |
404 | 1859 1859 0 ffffffff82270aa0 | |
405 | 1859 1859 1 ffffffff82270aa0 | |
406 | 1859 1859 2 ffffffff82270aa0 | |
407 | 1859 1859 3 ffffffff82272980 | |
408 | 1859 1859 4 ffffffff8225e120 | |
409 | 1859 1859 5 ffffffff82255120 | |
410 | 1859 1859 6 ffffffff82254f00 | |
411 | 1859 1859 7 ffffffff82254d80 | |
412 | 1859 1859 8 ffffffff8225abe0 | |
413 | ||
414 | ------------------ | |
415 | Without Parameters | |
416 | ------------------ | |
417 | ||
418 | Let us look at how a BPF iterator without parameters skips files of other | |
419 | processes in the system. In this case, the BPF program has to check the pid or | |
420 | the tid of tasks, or it will receive every opened file in the system (in the | |
421 | current *pid* namespace, actually). So, we usually add a global variable in the | |
422 | BPF program to pass a *pid* to the BPF program. | |
423 | ||
424 | The BPF program would look like the following block. | |
425 | ||
426 | :: | |
427 | ||
428 | ...... | |
429 | int target_pid = 0; | |
430 | ||
431 | SEC("iter/task_file") | |
432 | int dump_task_file(struct bpf_iter__task_file *ctx) | |
433 | { | |
434 | ...... | |
435 | if (task->tgid != target_pid) /* Check task->pid instead to check thread IDs */ | |
436 | return 0; | |
437 | BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd, | |
438 | (long)file->f_op); | |
439 | return 0; | |
440 | } | |
441 | ||
442 | The user space program would look like the following block: | |
443 | ||
444 | :: | |
445 | ||
446 | ...... | |
447 | static void test_task_file(void) | |
448 | { | |
449 | ...... | |
450 | skel = bpf_iter_task_ex__open_and_load(); | |
451 | if (skel == NULL) | |
452 | return; | |
453 | skel->bss->target_pid = getpid(); /* process ID. For thread id, use gettid() */ | |
454 | memset(&linfo, 0, sizeof(linfo)); | |
455 | linfo.task.pid = getpid(); | |
456 | opts.link_info = &linfo; | |
457 | opts.link_info_len = sizeof(linfo); | |
458 | ...... | |
459 | } | |
460 | ||
461 | ``target_pid`` is a global variable in the BPF program. The user space program | |
462 | should initialize the variable with a process ID to skip opened files of other | |
463 | processes in the BPF program. When you parametrize a BPF iterator, the iterator | |
464 | calls the BPF program fewer times which can save significant resources. | |
465 | ||
466 | --------------------------- | |
467 | Parametrizing VMA Iterators | |
468 | --------------------------- | |
469 | ||
470 | By default, a BPF VMA iterator includes every VMA in every process. However, | |
471 | you can still specify a process or a thread to include only its VMAs. Unlike | |
472 | files, a thread can not have a separate address space (since Linux 2.6.0-test6). | |
473 | Here, using *tid* makes no difference from using *pid*. | |
474 | ||
475 | ---------------------------- | |
476 | Parametrizing Task Iterators | |
477 | ---------------------------- | |
478 | ||
479 | A BPF task iterator with *pid* includes all tasks (threads) of a process. The | |
480 | BPF program receives these tasks one after another. You can specify a BPF task | |
481 | iterator with *tid* parameter to include only the tasks that match the given | |
482 | *tid*. |