Documentation/bpf/bpf_iterators.rst

   1 =============
   2 BPF Iterators
   3 =============
   4
   5
   6 ----------
   7 Motivation
   8 ----------
   9
  10 There are a few existing ways to dump kernel data into user space. The most
  11 popular one is the ``/proc`` system. For example, ``cat /proc/net/tcp6`` dumps
  12 all tcp6 sockets in the system, and ``cat /proc/net/netlink`` dumps all netlink
  13 sockets in the system. However, their output format tends to be fixed, and if
  14 users want more information about these sockets, they have to patch the kernel,
  15 which often takes time to publish upstream and release. The same is true for popular
  16 tools like `ss <https://man7.org/linux/man-pages/man8/ss.8.html>`_ where any
  17 additional information needs a kernel patch.
  18
  19 To solve this problem, the `drgn
  20 <https://www.kernel.org/doc/html/latest/bpf/drgn.html>`_ tool is often used to
  21 dig out the kernel data with no kernel change. However, the main drawback for
  22 drgn is performance, as it cannot do pointer tracing inside the kernel. In
  23 addition, drgn cannot validate a pointer value and may read invalid data if the
  24 pointer becomes invalid inside the kernel.
  25
  26 The BPF iterator solves the above problem by providing flexibility on what data
  27 (e.g., tasks, bpf_maps, etc.) to collect by calling BPF programs for each kernel
  28 data object.
  29
  30 ----------------------
  31 How BPF Iterators Work
  32 ----------------------
  33
  34 A BPF iterator is a type of BPF program that allows users to iterate over
  35 specific types of kernel objects. Unlike traditional BPF tracing programs that
  36 allow users to define callbacks that are invoked at particular points of
  37 execution in the kernel, BPF iterators allow users to define callbacks that
  38 should be executed for every entry in a variety of kernel data structures.
  39
  40 For example, users can define a BPF iterator that iterates over every task on
  41 the system and dumps the total amount of CPU runtime currently used by each of
  42 them. Another BPF task iterator may instead dump the cgroup information for each
  43 task. Such flexibility is the core value of BPF iterators.
  44
  45 A BPF program is always loaded into the kernel at the behest of a user space
  46 process. A user space process loads a BPF program by opening and initializing
  47 the program skeleton as required and then invoking a syscall to have the BPF
  48 program verified and loaded by the kernel.
  49
  50 In traditional tracing programs, a program is activated by having user space
  51 obtain a ``bpf_link`` to the program with ``bpf_program__attach()``. Once
  52 activated, the program callback will be invoked whenever the tracepoint is
  53 triggered in the main kernel. For BPF iterator programs, a ``bpf_link`` to the
  54 program is obtained using ``bpf_link_create()``, and the program callback is
  55 invoked by issuing system calls from user space.
  56
  57 Next, let us see how you can use the iterators to iterate on kernel objects and
  58 read data.
  59
  60 ------------------------
  61 How to Use BPF iterators
  62 ------------------------
  63
  64 BPF selftests are a great resource to illustrate how to use the iterators. In
  65 this section, we’ll walk through a BPF selftest which shows how to load and use
  66 a BPF iterator program.   To begin, we’ll look at `bpf_iter.c
  67 <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/prog_tests/bpf_iter.c>`_,
  68 which illustrates how to load and trigger BPF iterators on the user space side.
  69 Later, we’ll look at a BPF program that runs in kernel space.
  70
  71 Loading a BPF iterator in the kernel from user space typically involves the
  72 following steps:
  73
  74 * The BPF program is loaded into the kernel through ``libbpf``. Once the kernel
  75   has verified and loaded the program, it returns a file descriptor (fd) to user
  76   space.
  77 * Obtain a ``link_fd`` to the BPF program by calling the ``bpf_link_create()``
  78   specified with the BPF program file descriptor received from the kernel.
  79 * Next, obtain a BPF iterator file descriptor (``bpf_iter_fd``) by calling the
  80   ``bpf_iter_create()`` specified with the ``bpf_link`` received from Step 2.
  81 * Trigger the iteration by calling ``read(bpf_iter_fd)`` until no data is
  82   available.
  83 * Close the iterator fd using ``close(bpf_iter_fd)``.
  84 * If needed to reread the data, get a new ``bpf_iter_fd`` and do the read again.
  85
  86 The following are a few examples of selftest BPF iterator programs:
  87
  88 * `bpf_iter_tcp4.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_tcp4.c>`_
  89 * `bpf_iter_task_vma.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c>`_
  90 * `bpf_iter_task_file.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c>`_
  91
  92 Let us look at ``bpf_iter_task_file.c``, which runs in kernel space:
  93
  94 Here is the definition of ``bpf_iter__task_file`` in `vmlinux.h
  95 <https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html#btf>`_.
  96 Any struct name in ``vmlinux.h`` in the format ``bpf_iter__<iter_name>``
  97 represents a BPF iterator. The suffix ``<iter_name>`` represents the type of
  98 iterator.
  99
 100 ::
 101
 102     struct bpf_iter__task_file {
 103             union {
 104                 struct bpf_iter_meta *meta;
 105             };
 106             union {
 107                 struct task_struct *task;
 108             };
 109             u32 fd;
 110             union {
 111                 struct file *file;
 112             };
 113     };
 114
 115 In the above code, the field 'meta' contains the metadata, which is the same for
 116 all BPF iterator programs. The rest of the fields are specific to different
 117 iterators. For example, for task_file iterators, the kernel layer provides the
 118 'task', 'fd' and 'file' field values. The 'task' and 'file' are `reference
 119 counted
 120 <https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html#file-descriptors-and-reference-counters>`_,
 121 so they won't go away when the BPF program runs.
 122
 123 Here is a snippet from the  ``bpf_iter_task_file.c`` file:
 124
 125 ::
 126
 127   SEC("iter/task_file")
 128   int dump_task_file(struct bpf_iter__task_file *ctx)
 129   {
 130     struct seq_file *seq = ctx->meta->seq;
 131     struct task_struct *task = ctx->task;
 132     struct file *file = ctx->file;
 133     __u32 fd = ctx->fd;
 134
 135     if (task == NULL || file == NULL)
 136       return 0;
 137
 138     if (ctx->meta->seq_num == 0) {
 139       count = 0;
 140       BPF_SEQ_PRINTF(seq, "    tgid      gid       fd      file\n");
 141     }
 142
 143     if (tgid == task->tgid && task->tgid != task->pid)
 144       count++;
 145
 146     if (last_tgid != task->tgid) {
 147       last_tgid = task->tgid;
 148       unique_tgid_count++;
 149     }
 150
 151     BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
 152             (long)file->f_op);
 153     return 0;
 154   }
 155
 156 In the above example, the section name ``SEC(iter/task_file)``, indicates that
 157 the program is a BPF iterator program to iterate all files from all tasks. The
 158 context of the program is ``bpf_iter__task_file`` struct.
 159
 160 The user space program invokes the BPF iterator program running in the kernel
 161 by issuing a ``read()`` syscall. Once invoked, the BPF
 162 program can export data to user space using a variety of BPF helper functions.
 163 You can use either ``bpf_seq_printf()`` (and BPF_SEQ_PRINTF helper macro) or
 164 ``bpf_seq_write()`` function based on whether you need formatted output or just
 165 binary data, respectively. For binary-encoded data, the user space applications
 166 can process the data from ``bpf_seq_write()`` as needed. For the formatted data,
 167 you can use ``cat <path>`` to print the results similar to ``cat
 168 /proc/net/netlink`` after pinning the BPF iterator to the bpffs mount. Later,
 169 use  ``rm -f <path>`` to remove the pinned iterator.
 170
 171 For example, you can use the following command to create a BPF iterator from the
 172 ``bpf_iter_ipv6_route.o`` object file and pin it to the ``/sys/fs/bpf/my_route``
 173 path:
 174
 175 ::
 176
 177   $ bpftool iter pin ./bpf_iter_ipv6_route.o  /sys/fs/bpf/my_route
 178
 179 And then print out the results using the following command:
 180
 181 ::
 182
 183   $ cat /sys/fs/bpf/my_route
 184
 185
 186 -------------------------------------------------------
 187 Implement Kernel Support for BPF Iterator Program Types
 188 -------------------------------------------------------
 189
 190 To implement a BPF iterator in the kernel, the developer must make a one-time
 191 change to the following key data structure defined in the `bpf.h
 192 <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/include/linux/bpf.h>`_
 193 file.
 194
 195 ::
 196
 197   struct bpf_iter_reg {
 198             const char *target;
 199             bpf_iter_attach_target_t attach_target;
 200             bpf_iter_detach_target_t detach_target;
 201             bpf_iter_show_fdinfo_t show_fdinfo;
 202             bpf_iter_fill_link_info_t fill_link_info;
 203             bpf_iter_get_func_proto_t get_func_proto;
 204             u32 ctx_arg_info_size;
 205             u32 feature;
 206             struct bpf_ctx_arg_aux ctx_arg_info[BPF_ITER_CTX_ARG_MAX];
 207             const struct bpf_iter_seq_info *seq_info;
 208   };
 209
 210 After filling the data structure fields, call ``bpf_iter_reg_target()`` to
 211 register the iterator to the main BPF iterator subsystem.
 212
 213 The following is the breakdown for each field in struct ``bpf_iter_reg``.
 214
 215 .. list-table::
 216    :widths: 25 50
 217    :header-rows: 1
 218
 219    * - Fields
 220      - Description
 221    * - target
 222      - Specifies the name of the BPF iterator. For example: ``bpf_map``,
 223        ``bpf_map_elem``. The name should be different from other ``bpf_iter`` target names in the kernel.
 224    * - attach_target and detach_target
 225      - Allows for target specific ``link_create`` action since some targets
 226        may need special processing. Called during the user space link_create stage.
 227    * - show_fdinfo and fill_link_info
 228      - Called to fill target specific information when user tries to get link
 229        info associated with the iterator.
 230    * - get_func_proto
 231      - Permits a BPF iterator to access BPF helpers specific to the iterator.
 232    * - ctx_arg_info_size and ctx_arg_info
 233      - Specifies the verifier states for BPF program arguments associated with
 234        the bpf iterator.
 235    * - feature
 236      - Specifies certain action requests in the kernel BPF iterator
 237        infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means
 238        that the kernel function cond_resched() is called to avoid other kernel
 239        subsystem (e.g., rcu) misbehaving.
 240    * - seq_info
 241      - Specifies certain action requests in the kernel BPF iterator
 242        infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means
 243        that the kernel function cond_resched() is called to avoid other kernel
 244        subsystem (e.g., rcu) misbehaving.
 245
 246
 247 `Click here
 248 <https://lore.kernel.org/bpf/20210212183107.50963-2-songliubraving@fb.com/>`_
 249 to see an implementation of the ``task_vma`` BPF iterator in the kernel.
 250
 251 ---------------------------------
 252 Parameterizing BPF Task Iterators
 253 ---------------------------------
 254
 255 By default, BPF iterators walk through all the objects of the specified types
 256 (processes, cgroups, maps, etc.) across the entire system to read relevant
 257 kernel data. But often, there are cases where we only care about a much smaller
 258 subset of iterable kernel objects, such as only iterating tasks within a
 259 specific process. Therefore, BPF iterator programs support filtering out objects
 260 from iteration by allowing user space to configure the iterator program when it
 261 is attached.
 262
 263 --------------------------
 264 BPF Task Iterator Program
 265 --------------------------
 266
 267 The following code is a BPF iterator program to print files and task information
 268 through the ``seq_file`` of the iterator. It is a standard BPF iterator program
 269 that visits every file of an iterator. We will use this BPF program in our
 270 example later.
 271
 272 ::
 273
 274   #include <vmlinux.h>
 275   #include <bpf/bpf_helpers.h>
 276
 277   char _license[] SEC("license") = "GPL";
 278
 279   SEC("iter/task_file")
 280   int dump_task_file(struct bpf_iter__task_file *ctx)
 281   {
 282         struct seq_file *seq = ctx->meta->seq;
 283         struct task_struct *task = ctx->task;
 284         struct file *file = ctx->file;
 285         __u32 fd = ctx->fd;
 286         if (task == NULL || file == NULL)
 287                 return 0;
 288         if (ctx->meta->seq_num == 0) {
 289                 BPF_SEQ_PRINTF(seq, "    tgid      pid       fd      file\n");
 290         }
 291         BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
 292                         (long)file->f_op);
 293         return 0;
 294   }
 295
 296 ----------------------------------------
 297 Creating a File Iterator with Parameters
 298 ----------------------------------------
 299
 300 Now, let us look at how to create an iterator that includes only files of a
 301 process.
 302
 303 First,  fill the ``bpf_iter_attach_opts`` struct as shown below:
 304
 305 ::
 306
 307   LIBBPF_OPTS(bpf_iter_attach_opts, opts);
 308   union bpf_iter_link_info linfo;
 309   memset(&linfo, 0, sizeof(linfo));
 310   linfo.task.pid = getpid();
 311   opts.link_info = &linfo;
 312   opts.link_info_len = sizeof(linfo);
 313
 314 ``linfo.task.pid``, if it is non-zero, directs the kernel to create an iterator
 315 that only includes opened files for the process with the specified ``pid``. In
 316 this example, we will only be iterating files for our process. If
 317 ``linfo.task.pid`` is zero, the iterator will visit every opened file of every
 318 process. Similarly, ``linfo.task.tid`` directs the kernel to create an iterator
 319 that visits opened files of a specific thread, not a process. In this example,
 320 ``linfo.task.tid`` is different from ``linfo.task.pid`` only if the thread has a
 321 separate file descriptor table. In most circumstances, all process threads share
 322 a single file descriptor table.
 323
 324 Now, in the userspace program, pass the pointer of struct to the
 325 ``bpf_program__attach_iter()``.
 326
 327 ::
 328
 329   link = bpf_program__attach_iter(prog, &opts); iter_fd =
 330   bpf_iter_create(bpf_link__fd(link));
 331
 332 If both *tid* and *pid* are zero, an iterator created from this struct
 333 ``bpf_iter_attach_opts`` will include every opened file of every task in the
 334 system (in the namespace, actually.) It is the same as passing a NULL as the
 335 second argument to ``bpf_program__attach_iter()``.
 336
 337 The whole program looks like the following code:
 338
 339 ::
 340
 341   #include <stdio.h>
 342   #include <unistd.h>
 343   #include <bpf/bpf.h>
 344   #include <bpf/libbpf.h>
 345   #include "bpf_iter_task_ex.skel.h"
 346
 347   static int do_read_opts(struct bpf_program *prog, struct bpf_iter_attach_opts *opts)
 348   {
 349         struct bpf_link *link;
 350         char buf[16] = {};
 351         int iter_fd = -1, len;
 352         int ret = 0;
 353
 354         link = bpf_program__attach_iter(prog, opts);
 355         if (!link) {
 356                 fprintf(stderr, "bpf_program__attach_iter() fails\n");
 357                 return -1;
 358         }
 359         iter_fd = bpf_iter_create(bpf_link__fd(link));
 360         if (iter_fd < 0) {
 361                 fprintf(stderr, "bpf_iter_create() fails\n");
 362                 ret = -1;
 363                 goto free_link;
 364         }
 365         /* not check contents, but ensure read() ends without error */
 366         while ((len = read(iter_fd, buf, sizeof(buf) - 1)) > 0) {
 367                 buf[len] = 0;
 368                 printf("%s", buf);
 369         }
 370         printf("\n");
 371   free_link:
 372         if (iter_fd >= 0)
 373                 close(iter_fd);
 374         bpf_link__destroy(link);
 375         return 0;
 376   }
 377
 378   static void test_task_file(void)
 379   {
 380         LIBBPF_OPTS(bpf_iter_attach_opts, opts);
 381         struct bpf_iter_task_ex *skel;
 382         union bpf_iter_link_info linfo;
 383         skel = bpf_iter_task_ex__open_and_load();
 384         if (skel == NULL)
 385                 return;
 386         memset(&linfo, 0, sizeof(linfo));
 387         linfo.task.pid = getpid();
 388         opts.link_info = &linfo;
 389         opts.link_info_len = sizeof(linfo);
 390         printf("PID %d\n", getpid());
 391         do_read_opts(skel->progs.dump_task_file, &opts);
 392         bpf_iter_task_ex__destroy(skel);
 393   }
 394
 395   int main(int argc, const char * const * argv)
 396   {
 397         test_task_file();
 398         return 0;
 399   }
 400
 401 The following lines are the output of the program.
 402 ::
 403
 404   PID 1859
 405
 406      tgid      pid       fd      file
 407      1859     1859        0 ffffffff82270aa0
 408      1859     1859        1 ffffffff82270aa0
 409      1859     1859        2 ffffffff82270aa0
 410      1859     1859        3 ffffffff82272980
 411      1859     1859        4 ffffffff8225e120
 412      1859     1859        5 ffffffff82255120
 413      1859     1859        6 ffffffff82254f00
 414      1859     1859        7 ffffffff82254d80
 415      1859     1859        8 ffffffff8225abe0
 416
 417 ------------------
 418 Without Parameters
 419 ------------------
 420
 421 Let us look at how a BPF iterator without parameters skips files of other
 422 processes in the system. In this case, the BPF program has to check the pid or
 423 the tid of tasks, or it will receive every opened file in the system (in the
 424 current *pid* namespace, actually). So, we usually add a global variable in the
 425 BPF program to pass a *pid* to the BPF program.
 426
 427 The BPF program would look like the following block.
 428
 429   ::
 430
 431     ......
 432     int target_pid = 0;
 433
 434     SEC("iter/task_file")
 435     int dump_task_file(struct bpf_iter__task_file *ctx)
 436     {
 437           ......
 438           if (task->tgid != target_pid) /* Check task->pid instead to check thread IDs */
 439                   return 0;
 440           BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
 441                           (long)file->f_op);
 442           return 0;
 443     }
 444
 445 The user space program would look like the following block:
 446
 447   ::
 448
 449     ......
 450     static void test_task_file(void)
 451     {
 452           ......
 453           skel = bpf_iter_task_ex__open_and_load();
 454           if (skel == NULL)
 455                   return;
 456           skel->bss->target_pid = getpid(); /* process ID.  For thread id, use gettid() */
 457           memset(&linfo, 0, sizeof(linfo));
 458           linfo.task.pid = getpid();
 459           opts.link_info = &linfo;
 460           opts.link_info_len = sizeof(linfo);
 461           ......
 462     }
 463
 464 ``target_pid`` is a global variable in the BPF program. The user space program
 465 should initialize the variable with a process ID to skip opened files of other
 466 processes in the BPF program. When you parametrize a BPF iterator, the iterator
 467 calls the BPF program fewer times which can save significant resources.
 468
 469 ---------------------------
 470 Parametrizing VMA Iterators
 471 ---------------------------
 472
 473 By default, a BPF VMA iterator includes every VMA in every process.  However,
 474 you can still specify a process or a thread to include only its VMAs. Unlike
 475 files, a thread can not have a separate address space (since Linux 2.6.0-test6).
 476 Here, using *tid* makes no difference from using *pid*.
 477
 478 ----------------------------
 479 Parametrizing Task Iterators
 480 ----------------------------
 481
 482 A BPF task iterator with *pid* includes all tasks (threads) of a process. The
 483 BPF program receives these tasks one after another. You can specify a BPF task
 484 iterator with *tid* parameter to include only the tasks that match the given
 485 *tid*.