git.kernel.dk Git - linux-2.6-block.git/log

projects / linux-2.6-block.git / log

summary | shortlog | log | commit | commitdiff | tree
first ⋅ prev ⋅ next

commit | commitdiff | tree

Alexei Starovoitov [Tue, 22 Aug 2023 19:52:48 +0000 (12:52 -0700)]

Merge branch 'fix-for-check_func_arg_reg_off'

Kumar Kartikeya Dwivedi says:

====================
Fix for check_func_arg_reg_off

Remove a leftover hunk in check_func_arg_reg_off that incorrectly
bypasses reg->off == 0 requirement for release kfuncs and helpers.
====================

Link: https://lore.kernel.org/r/20230822175140.1317749-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Kumar Kartikeya Dwivedi [Tue, 22 Aug 2023 17:51:40 +0000 (23:21 +0530)]

selftests/bpf: Add test for bpf_obj_drop with bad reg->off

Add a selftest for the fix provided in the previous commit. Without the
fix, the selftest passes the verifier while it should fail. The special
logic for detecting graph root or node for reg->off and bypassing
reg->off == 0 guarantee for release helpers/kfuncs has been dropped.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230822175140.1317749-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Kumar Kartikeya Dwivedi [Tue, 22 Aug 2023 17:51:39 +0000 (23:21 +0530)]

bpf: Fix check_func_arg_reg_off bug for graph root/node

The commit being fixed introduced a hunk into check_func_arg_reg_off
that bypasses reg->off == 0 enforcement when offset points to a graph
node or root. This might possibly be done for treating bpf_rbtree_remove
and others as KF_RELEASE and then later check correct reg->off in helper
argument checks.

But this is not the case, those helpers are already not KF_RELEASE and
permit non-zero reg->off and verify it later to match the subobject in
BTF type.

However, this logic leads to bpf_obj_drop permitting free of register
arguments with non-zero offset when they point to a graph root or node
within them, which is not ok.

For instance:

struct foo {
int i;
int j;
struct bpf_rb_node node;
};

struct foo *f = bpf_obj_new(typeof(*f));
if (!f) ...
bpf_obj_drop(f); // OK
bpf_obj_drop(&f->i); // still ok from verifier PoV
bpf_obj_drop(&f->node); // Not OK, but permitted right now

Fix this by dropping the whole part of code altogether.

Fixes: 6a3cd3318ff6 ("bpf: Migrate release_on_unlock logic to non-owning ref semantics")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230822175140.1317749-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Yonghong Song [Tue, 22 Aug 2023 05:00:58 +0000 (22:00 -0700)]

selftests/bpf: Add a failure test for bpf_kptr_xchg() with local kptr

For a bpf_kptr_xchg() with local kptr, if the map value kptr type and
allocated local obj type does not match, with the previous patch,
the below verifier error message will be logged:
R2 is of type <allocated local obj type> but <map value kptr type> is expected

Without the previous patch, the test will have unexpected success.

Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230822050058.2887354-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Yonghong Song [Tue, 22 Aug 2023 05:00:53 +0000 (22:00 -0700)]

bpf: Fix a bpf_kptr_xchg() issue with local kptr

When reviewing local percpu kptr support, Alexei discovered a bug
wherea bpf_kptr_xchg() may succeed even if the map value kptr type and
locally allocated obj type do not match ([1]). Missed struct btf_id
comparison is the reason for the bug. This patch added such struct btf_id
comparison and will flag verification failure if types do not match.

[1] https://lore.kernel.org/bpf/20230819002907.io3iphmnuk43xblu@macbook-pro-8.dhcp.thefacebook.com/#t

Reported-by: Alexei Starovoitov <ast@kernel.org>
Fixes: 738c96d5e2e3 ("bpf: Allow local kptrs to be exchanged via bpf_kptr_xchg")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20230822050053.2886960-1-yonghong.song@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Alexei Starovoitov [Mon, 21 Aug 2023 22:51:28 +0000 (15:51 -0700)]

Merge branch 'bpf-add-multi-uprobe-link'

Jiri Olsa says:

====================
bpf: Add multi uprobe link

hi,
this patchset is adding support to attach multiple uprobes and usdt probes
through new uprobe_multi link.

The current uprobe is attached through the perf event and attaching many
uprobes takes a lot of time because of that.

The main reason is that we need to install perf event for each probed function
and profile shows perf event installation (perf_install_in_context) as culprit.

The new uprobe_multi link just creates raw uprobes and attaches the bpf
program to them without perf event being involved.

In addition to being faster we also save file descriptors. For the current
uprobe attach we use extra perf event fd for each probed function. The new
link just need one fd that covers all the functions we are attaching to.

v7 changes:
  - fixed task release on error path and re-org the error
    path to be more straightforward [Yonghong]
  - re-organized uprobe_prog_run locking to follow general pattern
    and removed might_fault check as it's not needed in uprobe/task
    context [Yonghong]

There's support for bpftrace [2] and tetragon [1].

Also available at:
  https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
  uprobe_multi

thanks,
jirka

[1] https://github.com/cilium/tetragon/pull/936
[2] https://github.com/iovisor/bpftrace/compare/master...olsajiri:bpftrace:uprobe_multi
[3] https://lore.kernel.org/bpf/20230628115329.248450-1-laoar.shao@gmail.com/
---
====================

Link: https://lore.kernel.org/r/20230809083440.3209381-1-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:40 +0000 (10:34 +0200)]

selftests/bpf: Add extra link to uprobe_multi tests

Attaching extra program to same functions system wide for api
and link tests.

This way we can test the pid filter works properly when there's
extra system wide consumer on the same uprobe that will trigger
the original uprobe handler.

We expect to have the same counts as before.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230809083440.3209381-29-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:39 +0000 (10:34 +0200)]

selftests/bpf: Add uprobe_multi pid filter tests

Running api and link tests also with pid filter and checking
the probe gets executed only for specific pid.

Spawning extra process to trigger attached uprobes and checking
we get correct counts from executed programs.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230809083440.3209381-28-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:38 +0000 (10:34 +0200)]

selftests/bpf: Add uprobe_multi cookie test

Adding test for cookies setup/retrieval in uprobe_link uprobes
and making sure bpf_get_attach_cookie works properly.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230809083440.3209381-27-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:37 +0000 (10:34 +0200)]

selftests/bpf: Add uprobe_multi usdt bench test

Adding test that attaches 50k usdt probes in usdt_multi binary.

After the attach is done we run the binary and make sure we get
proper amount of hits.

With current uprobes:

  # perf stat --null ./test_progs -n 254/6
  #254/6   uprobe_multi_test/bench_usdt:OK
  #254     uprobe_multi_test:OK
  Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED

   Performance counter stats for './test_progs -n 254/6':

      1353.659680562 seconds time elapsed

With uprobe_multi link:

  # perf stat --null ./test_progs -n 254/6
  #254/6   uprobe_multi_test/bench_usdt:OK
  #254     uprobe_multi_test:OK
  Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED

   Performance counter stats for './test_progs -n 254/6':

         0.322046364 seconds time elapsed

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230809083440.3209381-26-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:36 +0000 (10:34 +0200)]

selftests/bpf: Add uprobe_multi usdt test code

Adding code in uprobe_multi test binary that defines 50k usdts
and will serve as attach point for uprobe_multi usdt bench test
in following patch.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230809083440.3209381-25-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:35 +0000 (10:34 +0200)]

selftests/bpf: Add uprobe_multi bench test

Adding test that attaches 50k uprobes in uprobe_multi binary.

After the attach is done we run the binary and make sure we
get proper amount of hits.

The resulting attach/detach times on my setup:

  test_bench_attach_uprobe:PASS:uprobe_multi__open 0 nsec
  test_bench_attach_uprobe:PASS:uprobe_multi__attach 0 nsec
  test_bench_attach_uprobe:PASS:uprobes_count 0 nsec
  test_bench_attach_uprobe: attached in   0.346s
  test_bench_attach_uprobe: detached in   0.419s
  #262/5   uprobe_multi_test/bench_uprobe:OK

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230809083440.3209381-24-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:34 +0000 (10:34 +0200)]

selftests/bpf: Add uprobe_multi test program

Adding uprobe_multi test program that defines 50k uprobe_multi_func_*
functions and will serve as attach point for uprobe_multi bench test
in following patch.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230809083440.3209381-23-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:33 +0000 (10:34 +0200)]

selftests/bpf: Add uprobe_multi link test

Adding uprobe_multi test for bpf_link_create attach function.

Testing attachment using the struct bpf_link_create_opts.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230809083440.3209381-22-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:32 +0000 (10:34 +0200)]

selftests/bpf: Add uprobe_multi api test

Adding uprobe_multi test for bpf_program__attach_uprobe_multi
attach function.

Testing attachment using glob patterns and via bpf_uprobe_multi_opts
paths/syms fields.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230809083440.3209381-21-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:31 +0000 (10:34 +0200)]

selftests/bpf: Add uprobe_multi skel test

Adding uprobe_multi test for skeleton load/attach functions,
to test skeleton auto attach for uprobe_multi link.

Test that bpf_get_func_ip works properly for uprobe_multi
attachment.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230809083440.3209381-20-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:30 +0000 (10:34 +0200)]

selftests/bpf: Move get_time_ns to testing_helpers.h

We'd like to have single copy of get_time_ns used b bench and test_progs,
but we can't just include bench.h, because of conflicting 'struct env'
objects.

Moving get_time_ns to testing_helpers.h which is being included by both
bench and test_progs objects.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230809083440.3209381-19-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:29 +0000 (10:34 +0200)]

libbpf: Add uprobe multi link support to bpf_program__attach_usdt

Adding support for usdt_manager_attach_usdt to use uprobe_multi
link to attach to usdt probes.

The uprobe_multi support is detected before the usdt program is
loaded and its expected_attach_type is set accordingly.

If uprobe_multi support is detected the usdt_manager_attach_usdt
gathers uprobes info and calls bpf_program__attach_uprobe to
create all needed uprobes.

If uprobe_multi support is not detected the old behaviour stays.

Also adding usdt.s program section for sleepable usdt probes.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230809083440.3209381-18-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:28 +0000 (10:34 +0200)]

libbpf: Add uprobe multi link detection

Adding uprobe-multi link detection. It will be used later in
bpf_program__attach_usdt function to check and use uprobe_multi
link over standard uprobe links.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230809083440.3209381-17-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:27 +0000 (10:34 +0200)]

libbpf: Add support for u[ret]probe.multi[.s] program sections

Adding support for several uprobe_multi program sections
to allow auto attach of multi_uprobe programs.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230809083440.3209381-16-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:26 +0000 (10:34 +0200)]

libbpf: Add bpf_program__attach_uprobe_multi function

Adding bpf_program__attach_uprobe_multi function that
allows to attach multiple uprobes with uprobe_multi link.

The user can specify uprobes with direct arguments:

  binary_path/func_pattern/pid

or with struct bpf_uprobe_multi_opts opts argument fields:

  const char **syms;
  const unsigned long *offsets;
  const unsigned long *ref_ctr_offsets;
  const __u64 *cookies;

User can specify 2 mutually exclusive set of inputs:

1) use only path/func_pattern/pid arguments

2) use path/pid with allowed combinations of:
    syms/offsets/ref_ctr_offsets/cookies/cnt

    - syms and offsets are mutually exclusive
    - ref_ctr_offsets and cookies are optional

Any other usage results in error.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230809083440.3209381-15-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:25 +0000 (10:34 +0200)]

libbpf: Add bpf_link_create support for multi uprobes

Adding new uprobe_multi struct to bpf_link_create_opts object
to pass multiple uprobe data to link_create attr uapi.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230809083440.3209381-14-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:24 +0000 (10:34 +0200)]

libbpf: Add elf_resolve_pattern_offsets function

Adding elf_resolve_pattern_offsets function that looks up
offsets for symbols specified by pattern argument.

The 'pattern' argument allows wildcards (*?' supported).

Offsets are returned in allocated array together with its
size and needs to be released by the caller.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230809083440.3209381-13-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:23 +0000 (10:34 +0200)]

libbpf: Add elf_resolve_syms_offsets function

Adding elf_resolve_syms_offsets function that looks up
offsets for symbols specified in syms array argument.

Offsets are returned in allocated array with the 'cnt' size,
that needs to be released by the caller.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230809083440.3209381-12-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:22 +0000 (10:34 +0200)]

libbpf: Add elf symbol iterator

Adding elf symbol iterator object (and some functions) that follow
open-coded iterator pattern and some functions to ease up iterating
elf object symbols.

The idea is to iterate single symbol section with:

  struct elf_sym_iter iter;
  struct elf_sym *sym;

  if (elf_sym_iter_new(&iter, elf, binary_path, SHT_DYNSYM))
        goto error;

  while ((sym = elf_sym_iter_next(&iter))) {
        ...
  }

I considered opening the elf inside the iterator and iterate all symbol
sections, but then it gets more complicated wrt user checks for when
the next section is processed.

Plus side is the we don't need 'exit' function, because caller/user is
in charge of that.

The returned iterated symbol object from elf_sym_iter_next function
is placed inside the struct elf_sym_iter, so no extra allocation or
argument is needed.

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230809083440.3209381-11-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:21 +0000 (10:34 +0200)]

libbpf: Add elf_open/elf_close functions

Adding elf_open/elf_close functions and using it in
elf_find_func_offset_from_file function. It will be
used in following changes to save some common code.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230809083440.3209381-10-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:20 +0000 (10:34 +0200)]

libbpf: Move elf_find_func_offset* functions to elf object

Adding new elf object that will contain elf related functions.
There's no functional change.

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230809083440.3209381-9-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:19 +0000 (10:34 +0200)]

libbpf: Add uprobe_multi attach type and link names

Adding new uprobe_multi attach type and link names,
so the functions can resolve the new values.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20230809083440.3209381-8-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:18 +0000 (10:34 +0200)]

bpf: Add bpf_get_func_ip helper support for uprobe link

Adding support for bpf_get_func_ip helper being called from
ebpf program attached by uprobe_multi link.

It returns the ip of the uprobe.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-7-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:17 +0000 (10:34 +0200)]

bpf: Add pid filter support for uprobe_multi link

Adding support to specify pid for uprobe_multi link and the uprobes
are created only for task with given pid value.

Using the consumer.filter filter callback for that, so the task gets
filtered during the uprobe installation.

We still need to check the task during runtime in the uprobe handler,
because the handler could get executed if there's another system
wide consumer on the same uprobe (thanks Oleg for the insight).

Cc: Oleg Nesterov <oleg@redhat.com>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-6-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:16 +0000 (10:34 +0200)]

bpf: Add cookies support for uprobe_multi link

Adding support to specify cookies array for uprobe_multi link.

The cookies array share indexes and length with other uprobe_multi
arrays (offsets/ref_ctr_offsets).

The cookies[i] value defines cookie for i-the uprobe and will be
returned by bpf_get_attach_cookie helper when called from ebpf
program hooked to that specific uprobe.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-5-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:15 +0000 (10:34 +0200)]

bpf: Add multi uprobe link

Adding new multi uprobe link that allows to attach bpf program
to multiple uprobes.

Uprobes to attach are specified via new link_create uprobe_multi
union:

  struct {
    __aligned_u64   path;
    __aligned_u64   offsets;
    __aligned_u64   ref_ctr_offsets;
    __u32           cnt;
    __u32           flags;
  } uprobe_multi;

Uprobes are defined for single binary specified in path and multiple
calling sites specified in offsets array with optional reference
counters specified in ref_ctr_offsets array. All specified arrays
have length of 'cnt'.

The 'flags' supports single bit for now that marks the uprobe as
return probe.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-4-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:14 +0000 (10:34 +0200)]

bpf: Add attach_type checks under bpf_prog_attach_check_attach_type

Add extra attach_type checks from link_create under
bpf_prog_attach_check_attach_type.

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-3-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Jiri Olsa [Wed, 9 Aug 2023 08:34:13 +0000 (10:34 +0200)]

bpf: Switch BPF_F_KPROBE_MULTI_RETURN macro to enum

Switching BPF_F_KPROBE_MULTI_RETURN macro to anonymous enum,
so it'd show up in vmlinux.h. There's not functional change
compared to having this as macro.

Acked-by: Yafang Shao <laoar.shao@gmail.com>
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230809083440.3209381-2-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Alexei Starovoitov [Mon, 21 Aug 2023 22:39:10 +0000 (15:39 -0700)]

Merge branch 'samples-bpf-make-bpf-programs-more-libbpf-aware'

Daniel T. Lee says:

====================
samples/bpf: make BPF programs more libbpf aware

The existing tracing programs have been developed for a considerable
period of time and, as a result, do not properly incorporate the
features of the current libbpf, such as CO-RE. This is evident in
frequent usage of functions like PT_REGS* and the persistence of "hack"
methods using underscore-style bpf_probe_read_kernel from the past.
These programs are far behind the current level of libbpf and can
potentially confuse users.

The kernel has undergone significant changes, and some of these changes
have broken these programs, but on the other hand, more robust APIs have
been developed for increased stableness.

To list some of the kernel changes that this patch set is focusing on,
- symbol mismatch occurs due to compiler optimization [1]
- inline of blk_account_io* breaks BPF kprobe program [2]
- new tracepoints for the block_io_start/done are introduced [3]
- map lookup probes can't be triggered (bpf_disable_instrumentation)[4]
- BPF_KSYSCALL has been introduced to simplify argument fetching [5]
- convert to vmlinux.h and use tp argument structure within it
- make tracing programs to be more CO-RE centric

In this regard, this patch set aims not only to integrate the latest
features of libbpf into BPF programs but also to reduce confusion and
clarify the BPF programs. This will help with the potential confusion
among users and make the programs more intutitive.

[1]: https://github.com/iovisor/bcc/issues/1754
[2]: https://github.com/iovisor/bcc/issues/4261
[3]: commit 5a80bd075f3b ("block: introduce block_io_start/block_io_done tracepoints")
[4]: commit 7c4cd051add3 ("bpf: Fix syscall's stackmap lookup potential deadlock")
[5]: commit 6f5d467d55f0 ("libbpf: improve BPF_KPROBE_SYSCALL macro and rename it to BPF_KSYSCALL")
====================

Link: https://lore.kernel.org/r/20230818090119.477441-1-danieltimlee@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Daniel T. Lee [Fri, 18 Aug 2023 09:01:19 +0000 (18:01 +0900)]

samples/bpf: simplify spintest with kprobe.multi

With the introduction of kprobe.multi, it is now possible to attach
multiple kprobes to a single BPF program without the need for multiple
definitions. Additionally, this method supports wildcard-based
matching, allowing for further simplification of BPF programs. In here,
an asterisk (*) wildcard is used to map to all symbols relevant to
spin_{lock|unlock}.

Furthermore, since kprobe.multi handles symbol matching, this commit
eliminates the need for the previous logic of reading the ksym table to
verify the existence of symbols.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Link: https://lore.kernel.org/r/20230818090119.477441-10-danieltimlee@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Daniel T. Lee [Fri, 18 Aug 2023 09:01:18 +0000 (18:01 +0900)]

samples/bpf: refactor syscall tracing programs using BPF_KSYSCALL macro

This commit refactors the syscall tracing programs by adopting the
BPF_KSYSCALL macro. This change aims to enhance the clarity and
simplicity of the BPF programs by reducing the complexity of argument
parsing from pt_regs.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Link: https://lore.kernel.org/r/20230818090119.477441-9-danieltimlee@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Daniel T. Lee [Fri, 18 Aug 2023 09:01:17 +0000 (18:01 +0900)]

samples/bpf: fix broken map lookup probe

In the commit 7c4cd051add3 ("bpf: Fix syscall's stackmap lookup
potential deadlock"), a potential deadlock issue was addressed, which
resulted in *_map_lookup_elem not triggering BPF programs.
(prior to lookup, bpf_disable_instrumentation() is used)

To resolve the broken map lookup probe using "htab_map_lookup_elem",
this commit introduces an alternative approach. Instead, it utilize
"bpf_map_copy_value" and apply a filter specifically for the hash table
with map_type.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Fixes: 7c4cd051add3 ("bpf: Fix syscall's stackmap lookup potential deadlock")
Link: https://lore.kernel.org/r/20230818090119.477441-8-danieltimlee@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Daniel T. Lee [Fri, 18 Aug 2023 09:01:16 +0000 (18:01 +0900)]

samples/bpf: fix bio latency check with tracepoint

Recently, a new tracepoint for the block layer, specifically the
block_io_start/done tracepoints, was introduced in commit 5a80bd075f3b
("block: introduce block_io_start/block_io_done tracepoints").

Previously, the kprobe entry used for this purpose was quite unstable
and inherently broke relevant probes [1]. Now that a stable tracepoint
is available, this commit replaces the bio latency check with it.

One of the changes made during this replacement is the key used for the
hash table. Since 'struct request' cannot be used as a hash key, the
approach taken follows that which was implemented in bcc/biolatency [2].
(uses dev:sector for the key)

[1]: https://github.com/iovisor/bcc/issues/4261
[2]: https://github.com/iovisor/bcc/pull/4691

Fixes: 450b7879e345 ("block: move blk_account_io_{start,done} to blk-mq.c")
Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Link: https://lore.kernel.org/r/20230818090119.477441-7-danieltimlee@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Daniel T. Lee [Fri, 18 Aug 2023 09:01:15 +0000 (18:01 +0900)]

samples/bpf: make tracing programs to be more CO-RE centric

The existing tracing programs have been developed for a considerable
period of time and, as a result, do not properly incorporate the
features of the current libbpf, such as CO-RE. This is evident in
frequent usage of functions like PT_REGS* and the persistence of "hack"
methods using underscore-style bpf_probe_read_kernel from the past.

These programs are far behind the current level of libbpf and can
potentially confuse users. Therefore, this commit aims to convert the
outdated BPF programs to be more CO-RE centric.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Link: https://lore.kernel.org/r/20230818090119.477441-6-danieltimlee@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Daniel T. Lee [Fri, 18 Aug 2023 09:01:14 +0000 (18:01 +0900)]

samples/bpf: fix symbol mismatch by compiler optimization

Currently, multiple kprobe programs are suffering from symbol mismatch
due to compiler optimization. These optimizations might induce
additional suffix to the symbol name such as '.isra' or '.constprop'.

    # egrep ' finish_task_switch| __netif_receive_skb_core' /proc/kallsyms
    ffffffff81135e50 t finish_task_switch.isra.0
    ffffffff81dd36d0 t __netif_receive_skb_core.constprop.0
    ffffffff8205cc0e t finish_task_switch.isra.0.cold
    ffffffff820b1aba t __netif_receive_skb_core.constprop.0.cold

To avoid this, this commit replaces the original kprobe section to
kprobe.multi in order to match symbol with wildcard characters. Here,
asterisk is used for avoiding symbol mismatch.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Link: https://lore.kernel.org/r/20230818090119.477441-5-danieltimlee@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Daniel T. Lee [Fri, 18 Aug 2023 09:01:13 +0000 (18:01 +0900)]

samples/bpf: unify bpf program suffix to .bpf with tracing programs

Currently, BPF programs typically have a suffix of .bpf.c. However,
some programs still utilize a mixture of _kern.c suffix alongside the
naming convention. In order to achieve consistency in the naming of
these programs, this commit unifies the inconsistency in the naming
convention of BPF kernel programs.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Link: https://lore.kernel.org/r/20230818090119.477441-4-danieltimlee@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Daniel T. Lee [Fri, 18 Aug 2023 09:01:12 +0000 (18:01 +0900)]

samples/bpf: convert to vmlinux.h with tracing programs

This commit replaces separate headers with a single vmlinux.h to
tracing programs. Thanks to that, we no longer need to define the
argument structure for tracing programs directly. For example, argument
for the sched_switch tracpepoint (sched_switch_args) can be replaced
with the vmlinux.h provided trace_event_raw_sched_switch.

Additional defines have been added to the BPF program either directly
or through the inclusion of net_shared.h. Defined values are
PERF_MAX_STACK_DEPTH, IFNAMSIZ constants and __stringify() macro. This
change enables the BPF program to access internal structures with BTF
generated "vmlinux.h" header.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Link: https://lore.kernel.org/r/20230818090119.477441-3-danieltimlee@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Daniel T. Lee [Fri, 18 Aug 2023 09:01:11 +0000 (18:01 +0900)]

samples/bpf: fix warning with ignored-attributes

Currently, compiling the bpf programs will result the warning with the
ignored attribute as follows. This commit fixes the warning by adding
cf-protection option.

    In file included from ./arch/x86/include/asm/linkage.h:6:
    ./arch/x86/include/asm/ibt.h:77:8: warning: 'nocf_check' attribute ignored; use -fcf-protection to enable the attribute [-Wignored-attributes]
    extern __noendbr u64 ibt_save(bool disable);
           ^
    ./arch/x86/include/asm/ibt.h:32:34: note: expanded from macro '__noendbr'
                                       ^

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Link: https://lore.kernel.org/r/20230818090119.477441-2-danieltimlee@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Alexei Starovoitov [Mon, 21 Aug 2023 22:21:16 +0000 (15:21 -0700)]

Merge branch 'remove-unnecessary-synchronizations-in-cpumap'

Hou Tao says:

====================
Remove unnecessary synchronizations in cpumap

From: Hou Tao <houtao1@huawei.com>

Hi,

This is the formal patchset to remove unnecessary synchronizations in
cpu-map after address comments and collect Rvb tags from Toke
Høiland-Jørgensen (Big thanks to Toke). Patch #1 removes the unnecessary
rcu_barrier() when freeing bpf_cpu_map_entry and replaces it by
queue_rcu_work(). Patch #2 removes the unnecessary call_rcu() and
queue_work() when destroying cpu-map and does the freeing directly.

Test the patchset by using xdp_redirect_cpu and virtio-net. Both
xdp-mode and skb-mode have been exercised and no issues were reported.
As ususal, comments and suggestions are always welcome.

Change Log:
v1:
  * address comments from Toke Høiland-Jørgensen
  * add Rvb tags from Toke Høiland-Jørgensen
  * update outdated comment in cpu_map_delete_elem()

RFC: https://lore.kernel.org/bpf/20230728023030.1906124-1-houtao@huaweicloud.com
====================

Link: https://lore.kernel.org/r/20230816045959.358059-1-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Hou Tao [Wed, 16 Aug 2023 04:59:58 +0000 (12:59 +0800)]

bpf, cpumask: Clean up bpf_cpu_map_entry directly in cpu_map_free

After synchronous_rcu(), both the dettached XDP program and
xdp_do_flush() are completed, and the only user of bpf_cpu_map_entry
will be cpu_map_kthread_run(), so instead of calling
__cpu_map_entry_replace() to stop kthread and cleanup entry after a RCU
grace period, do these things directly.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20230816045959.358059-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Hou Tao [Wed, 16 Aug 2023 04:59:57 +0000 (12:59 +0800)]

bpf, cpumap: Use queue_rcu_work() to remove unnecessary rcu_barrier()

As for now __cpu_map_entry_replace() uses call_rcu() to wait for the
inflight xdp program to exit the RCU read critical section, and then
launch kworker cpu_map_kthread_stop() to call kthread_stop() to flush
all pending xdp frames or skbs.

But it is unnecessary to use rcu_barrier() in cpu_map_kthread_stop() to
wait for the completion of __cpu_map_entry_free(), because rcu_barrier()
will wait for all pending RCU callbacks and cpu_map_kthread_stop() only
needs to wait for the completion of a specific __cpu_map_entry_free().

So use queue_rcu_work() to replace call_rcu(), schedule_work() and
rcu_barrier(). queue_rcu_work() will queue a __cpu_map_entry_free()
kworker after a RCU grace period. Because __cpu_map_entry_free() is
running in a kworker context, so it is OK to do all of these freeing
procedures include kthread_stop() in it.

After the update, there is no need to do reference-counting for
bpf_cpu_map_entry, because bpf_cpu_map_entry is freed directly in
__cpu_map_entry_free(), so just remove it.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20230816045959.358059-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

commit | commitdiff | tree

Yonghong Song [Fri, 18 Aug 2023 17:43:12 +0000 (10:43 -0700)]

selftests/bpf: Fix a selftest compilation error

When building the kernel and selftest with clang compiler (llvm17 or llvm18),
I hit the following compilation failure:
  In file included from progs/test_lwt_redirect.c:3:
  In file included from /usr/include/linux/ip.h:21:
  In file included from /usr/include/asm/byteorder.h:5:
  In file included from /usr/include/linux/byteorder/little_endian.h:13:
  /usr/include/linux/swab.h:136:8: error: unknown type name '__always_inline'
    136 | static __always_inline unsigned long __swab(const unsigned long y)
        |        ^
  /usr/include/linux/swab.h:171:8: error: unknown type name '__always_inline'
    171 | static __always_inline __u16 __swab16p(const __u16 *p)
  ...

bpf_helpers.h file provided a definition for __always_inline.
Putting 'ip.h' after 'bpf_helpers.h' fixed the issue.

Fixes: 43a7c3ef8a15 ("selftests/bpf: Add lwt_xmit tests for BPF_REDIRECT")
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/r/20230818174312.1883381-1-yonghong.song@linux.dev
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

commit | commitdiff | tree

Dave Marchevsky [Thu, 17 Aug 2023 22:53:53 +0000 (15:53 -0700)]

selftests/bpf: Add CO-RE relocs kfunc flavors tests

This patch adds selftests that exercise kfunc flavor relocation
functionality added in the previous patch. The actual kfunc defined
in kernel/bpf/helpers.c is:

  struct task_struct *bpf_task_acquire(struct task_struct *p)

The following relocation behaviors are checked:

  struct task_struct *bpf_task_acquire___one(struct task_struct *name)
    * Should succeed despite differing param name

  struct task_struct *bpf_task_acquire___two(struct task_struct *p, void *ctx)
    * Should fail because there is no two-param bpf_task_acquire

  struct task_struct *bpf_task_acquire___three(void *ctx)
    * Should fail because, despite vmlinux's bpf_task_acquire having one param,
      the types don't match

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: David Vernet <void@manifault.com>
Link: https://lore.kernel.org/bpf/20230817225353.2570845-2-davemarchevsky@fb.com

commit | commitdiff | tree

Dave Marchevsky [Thu, 17 Aug 2023 22:53:52 +0000 (15:53 -0700)]

libbpf: Support triple-underscore flavors for kfunc relocation

The function signature of kfuncs can change at any time due to their
intentional lack of stability guarantees. As kfuncs become more widely
used, BPF program writers will need facilities to support calling
different versions of a kfunc from a single BPF object. Consider this
simplified example based on a real scenario we ran into at Meta:

  /* initial kfunc signature */
  int some_kfunc(void *ptr)

  /* Oops, we need to add some flag to modify behavior. No problem,
    change the kfunc. flags = 0 retains original behavior */
  int some_kfunc(void *ptr, long flags)

If the initial version of the kfunc is deployed on some portion of the
fleet and the new version on the rest, a fleetwide service that uses
some_kfunc will currently need to load different BPF programs depending
on which some_kfunc is available.

Luckily CO-RE provides a facility to solve a very similar problem,
struct definition changes, by allowing program writers to declare
my_struct___old and my_struct___new, with ___suffix being considered a
'flavor' of the non-suffixed name and being ignored by
bpf_core_type_exists and similar calls.

This patch extends the 'flavor' facility to the kfunc extern
relocation process. BPF program writers can now declare

  extern int some_kfunc___old(void *ptr)
  extern int some_kfunc___new(void *ptr, int flags)

then test which version of the kfunc exists with bpf_ksym_exists.
Relocation and verifier's dead code elimination will work in concert as
expected, allowing this pattern:

  if (bpf_ksym_exists(some_kfunc___old))
    some_kfunc___old(ptr);
  else
    some_kfunc___new(ptr, 0);

Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: David Vernet <void@manifault.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20230817225353.2570845-1-davemarchevsky@fb.com

commit | commitdiff | tree

Helge Deller [Thu, 17 Aug 2023 22:02:40 +0000 (00:02 +0200)]

bpf/tests: Enhance output on error and fix typos

If a testcase returns a wrong (unexpected) value, print the expected and
returned value in hex notation in addition to the decimal notation.

This is very useful in tests which bit-shift hex values left or right and
helped me a lot while developing the JIT compiler for the hppa architecture.

Additionally fix two typos: dowrd -> dword, tall calls -> tail calls.

Signed-off-by: Helge Deller <deller@gmx.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/ZN6ZAAVoWZpsD1Jf@p100

commit | commitdiff | tree

Yan Zhai [Fri, 18 Aug 2023 02:58:18 +0000 (19:58 -0700)]

selftests/bpf: Add lwt_xmit tests for BPF_REROUTE

There is no lwt test case for BPF_REROUTE yet. Add test cases for both
normal and abnormal situations. The abnormal situation is set up with an
fq qdisc on the reroute target device. Without proper fixes, overflow
this qdisc queue limit (to trigger a drop) would panic the kernel.

Signed-off-by: Yan Zhai <yan@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/62c8ddc1e924269dcf80d2e8af1a1e632cee0b3a.1692326837.git.yan@cloudflare.com

commit | commitdiff | tree

Yan Zhai [Fri, 18 Aug 2023 02:58:16 +0000 (19:58 -0700)]

selftests/bpf: Add lwt_xmit tests for BPF_REDIRECT

There is no lwt_xmit test case for BPF_REDIRECT yet. Add test cases for
both normal and abnormal situations. For abnormal test cases, devices
are set down or have its carrier set down. Without proper fixes,
BPF_REDIRECT to either ingress or egress of such device would panic the
kernel.

Signed-off-by: Yan Zhai <yan@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/96bf435243641939d9c9da329fab29cb45f7df22.1692326837.git.yan@cloudflare.com

commit | commitdiff | tree

Yan Zhai [Fri, 18 Aug 2023 02:58:14 +0000 (19:58 -0700)]

lwt: Check LWTUNNEL_XMIT_CONTINUE strictly

LWTUNNEL_XMIT_CONTINUE is implicitly assumed in ip(6)_finish_output2,
such that any positive return value from a xmit hook could cause
unexpected continue behavior, despite that related skb may have been
freed. This could be error-prone for future xmit hook ops. One of the
possible errors is to return statuses of dst_output directly.

To make the code safer, redefine LWTUNNEL_XMIT_CONTINUE value to
distinguish from dst_output statuses and check the continue
condition explicitly.

Fixes: 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure")
Suggested-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/96b939b85eda00e8df4f7c080f770970a4c5f698.1692326837.git.yan@cloudflare.com

commit | commitdiff | tree

Yan Zhai [Fri, 18 Aug 2023 02:58:11 +0000 (19:58 -0700)]

lwt: Fix return values of BPF xmit ops

BPF encap ops can return different types of positive values, such like
NET_RX_DROP, NET_XMIT_CN, NETDEV_TX_BUSY, and so on, from function
skb_do_redirect and bpf_lwt_xmit_reroute. At the xmit hook, such return
values would be treated implicitly as LWTUNNEL_XMIT_CONTINUE in
ip(6)_finish_output2. When this happens, skbs that have been freed would
continue to the neighbor subsystem, causing use-after-free bug and
kernel crashes.

To fix the incorrect behavior, skb_do_redirect return values can be
simply discarded, the same as tc-egress behavior. On the other hand,
bpf_lwt_xmit_reroute returns useful errors to local senders, e.g. PMTU
information. Thus convert its return values to avoid the conflict with
LWTUNNEL_XMIT_CONTINUE.

Fixes: 3a0af8fd61f9 ("bpf: BPF for lightweight tunnel infrastructure")
Reported-by: Jordan Griege <jgriege@cloudflare.com>
Suggested-by: Martin KaFai Lau <martin.lau@linux.dev>
Suggested-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Yan Zhai <yan@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/0d2b878186cfe215fec6b45769c1cd0591d3628d.1692326837.git.yan@cloudflare.com

commit | commitdiff | tree

Xu Kuohai [Tue, 15 Aug 2023 15:41:58 +0000 (11:41 -0400)]

selftests/bpf: Enable cpu v4 tests for arm64

Enable CPU v4 instruction tests for arm64. Below are the test results from
BPF test_progs selftests:

  # ./test_progs -t ldsx_insn,verifier_sdiv,verifier_movsx,verifier_ldsx,verifier_gotol,verifier_bswap
  #115/1   ldsx_insn/map_val and probed_memory:OK
  #115/2   ldsx_insn/ctx_member_sign_ext:OK
  #115/3   ldsx_insn/ctx_member_narrow_sign_ext:OK
  #115     ldsx_insn:OK
  #302/1   verifier_bswap/BSWAP, 16:OK
  #302/2   verifier_bswap/BSWAP, 16 @unpriv:OK
  #302/3   verifier_bswap/BSWAP, 32:OK
  #302/4   verifier_bswap/BSWAP, 32 @unpriv:OK
  #302/5   verifier_bswap/BSWAP, 64:OK
  #302/6   verifier_bswap/BSWAP, 64 @unpriv:OK
  #302     verifier_bswap:OK
  #316/1   verifier_gotol/gotol, small_imm:OK
  #316/2   verifier_gotol/gotol, small_imm @unpriv:OK
  #316     verifier_gotol:OK
  #324/1   verifier_ldsx/LDSX, S8:OK
  #324/2   verifier_ldsx/LDSX, S8 @unpriv:OK
  #324/3   verifier_ldsx/LDSX, S16:OK
  #324/4   verifier_ldsx/LDSX, S16 @unpriv:OK
  #324/5   verifier_ldsx/LDSX, S32:OK
  #324/6   verifier_ldsx/LDSX, S32 @unpriv:OK
  #324/7   verifier_ldsx/LDSX, S8 range checking, privileged:OK
  #324/8   verifier_ldsx/LDSX, S16 range checking:OK
  #324/9   verifier_ldsx/LDSX, S16 range checking @unpriv:OK
  #324/10  verifier_ldsx/LDSX, S32 range checking:OK
  #324/11  verifier_ldsx/LDSX, S32 range checking @unpriv:OK
  #324     verifier_ldsx:OK
  #335/1   verifier_movsx/MOV32SX, S8:OK
  #335/2   verifier_movsx/MOV32SX, S8 @unpriv:OK
  #335/3   verifier_movsx/MOV32SX, S16:OK
  #335/4   verifier_movsx/MOV32SX, S16 @unpriv:OK
  #335/5   verifier_movsx/MOV64SX, S8:OK
  #335/6   verifier_movsx/MOV64SX, S8 @unpriv:OK
  #335/7   verifier_movsx/MOV64SX, S16:OK
  #335/8   verifier_movsx/MOV64SX, S16 @unpriv:OK
  #335/9   verifier_movsx/MOV64SX, S32:OK
  #335/10  verifier_movsx/MOV64SX, S32 @unpriv:OK
  #335/11  verifier_movsx/MOV32SX, S8, range_check:OK
  #335/12  verifier_movsx/MOV32SX, S8, range_check @unpriv:OK
  #335/13  verifier_movsx/MOV32SX, S16, range_check:OK
  #335/14  verifier_movsx/MOV32SX, S16, range_check @unpriv:OK
  #335/15  verifier_movsx/MOV32SX, S16, range_check 2:OK
  #335/16  verifier_movsx/MOV32SX, S16, range_check 2 @unpriv:OK
  #335/17  verifier_movsx/MOV64SX, S8, range_check:OK
  #335/18  verifier_movsx/MOV64SX, S8, range_check @unpriv:OK
  #335/19  verifier_movsx/MOV64SX, S16, range_check:OK
  #335/20  verifier_movsx/MOV64SX, S16, range_check @unpriv:OK
  #335/21  verifier_movsx/MOV64SX, S32, range_check:OK
  #335/22  verifier_movsx/MOV64SX, S32, range_check @unpriv:OK
  #335/23  verifier_movsx/MOV64SX, S16, R10 Sign Extension:OK
  #335/24  verifier_movsx/MOV64SX, S16, R10 Sign Extension @unpriv:OK
  #335     verifier_movsx:OK
  #347/1   verifier_sdiv/SDIV32, non-zero imm divisor, check 1:OK
  #347/2   verifier_sdiv/SDIV32, non-zero imm divisor, check 1 @unpriv:OK
  #347/3   verifier_sdiv/SDIV32, non-zero imm divisor, check 2:OK
  #347/4   verifier_sdiv/SDIV32, non-zero imm divisor, check 2 @unpriv:OK
  #347/5   verifier_sdiv/SDIV32, non-zero imm divisor, check 3:OK
  #347/6   verifier_sdiv/SDIV32, non-zero imm divisor, check 3 @unpriv:OK
  #347/7   verifier_sdiv/SDIV32, non-zero imm divisor, check 4:OK
  #347/8   verifier_sdiv/SDIV32, non-zero imm divisor, check 4 @unpriv:OK
  #347/9   verifier_sdiv/SDIV32, non-zero imm divisor, check 5:OK
  #347/10  verifier_sdiv/SDIV32, non-zero imm divisor, check 5 @unpriv:OK
  #347/11  verifier_sdiv/SDIV32, non-zero imm divisor, check 6:OK
  #347/12  verifier_sdiv/SDIV32, non-zero imm divisor, check 6 @unpriv:OK
  #347/13  verifier_sdiv/SDIV32, non-zero imm divisor, check 7:OK
  #347/14  verifier_sdiv/SDIV32, non-zero imm divisor, check 7 @unpriv:OK
  #347/15  verifier_sdiv/SDIV32, non-zero imm divisor, check 8:OK
  #347/16  verifier_sdiv/SDIV32, non-zero imm divisor, check 8 @unpriv:OK
  #347/17  verifier_sdiv/SDIV32, non-zero reg divisor, check 1:OK
  #347/18  verifier_sdiv/SDIV32, non-zero reg divisor, check 1 @unpriv:OK
  #347/19  verifier_sdiv/SDIV32, non-zero reg divisor, check 2:OK
  #347/20  verifier_sdiv/SDIV32, non-zero reg divisor, check 2 @unpriv:OK
  #347/21  verifier_sdiv/SDIV32, non-zero reg divisor, check 3:OK
  #347/22  verifier_sdiv/SDIV32, non-zero reg divisor, check 3 @unpriv:OK
  #347/23  verifier_sdiv/SDIV32, non-zero reg divisor, check 4:OK
  #347/24  verifier_sdiv/SDIV32, non-zero reg divisor, check 4 @unpriv:OK
  #347/25  verifier_sdiv/SDIV32, non-zero reg divisor, check 5:OK
  #347/26  verifier_sdiv/SDIV32, non-zero reg divisor, check 5 @unpriv:OK
  #347/27  verifier_sdiv/SDIV32, non-zero reg divisor, check 6:OK
  #347/28  verifier_sdiv/SDIV32, non-zero reg divisor, check 6 @unpriv:OK
  #347/29  verifier_sdiv/SDIV32, non-zero reg divisor, check 7:OK
  #347/30  verifier_sdiv/SDIV32, non-zero reg divisor, check 7 @unpriv:OK
  #347/31  verifier_sdiv/SDIV32, non-zero reg divisor, check 8:OK
  #347/32  verifier_sdiv/SDIV32, non-zero reg divisor, check 8 @unpriv:OK
  #347/33  verifier_sdiv/SDIV64, non-zero imm divisor, check 1:OK
  #347/34  verifier_sdiv/SDIV64, non-zero imm divisor, check 1 @unpriv:OK
  #347/35  verifier_sdiv/SDIV64, non-zero imm divisor, check 2:OK
  #347/36  verifier_sdiv/SDIV64, non-zero imm divisor, check 2 @unpriv:OK
  #347/37  verifier_sdiv/SDIV64, non-zero imm divisor, check 3:OK
  #347/38  verifier_sdiv/SDIV64, non-zero imm divisor, check 3 @unpriv:OK
  #347/39  verifier_sdiv/SDIV64, non-zero imm divisor, check 4:OK
  #347/40  verifier_sdiv/SDIV64, non-zero imm divisor, check 4 @unpriv:OK
  #347/41  verifier_sdiv/SDIV64, non-zero imm divisor, check 5:OK
  #347/42  verifier_sdiv/SDIV64, non-zero imm divisor, check 5 @unpriv:OK
  #347/43  verifier_sdiv/SDIV64, non-zero imm divisor, check 6:OK
  #347/44  verifier_sdiv/SDIV64, non-zero imm divisor, check 6 @unpriv:OK
  #347/45  verifier_sdiv/SDIV64, non-zero reg divisor, check 1:OK
  #347/46  verifier_sdiv/SDIV64, non-zero reg divisor, check 1 @unpriv:OK
  #347/47  verifier_sdiv/SDIV64, non-zero reg divisor, check 2:OK
  #347/48  verifier_sdiv/SDIV64, non-zero reg divisor, check 2 @unpriv:OK
  #347/49  verifier_sdiv/SDIV64, non-zero reg divisor, check 3:OK
  #347/50  verifier_sdiv/SDIV64, non-zero reg divisor, check 3 @unpriv:OK
  #347/51  verifier_sdiv/SDIV64, non-zero reg divisor, check 4:OK
  #347/52  verifier_sdiv/SDIV64, non-zero reg divisor, check 4 @unpriv:OK
  #347/53  verifier_sdiv/SDIV64, non-zero reg divisor, check 5:OK
  #347/54  verifier_sdiv/SDIV64, non-zero reg divisor, check 5 @unpriv:OK
  #347/55  verifier_sdiv/SDIV64, non-zero reg divisor, check 6:OK
  #347/56  verifier_sdiv/SDIV64, non-zero reg divisor, check 6 @unpriv:OK
  #347/57  verifier_sdiv/SMOD32, non-zero imm divisor, check 1:OK
  #347/58  verifier_sdiv/SMOD32, non-zero imm divisor, check 1 @unpriv:OK
  #347/59  verifier_sdiv/SMOD32, non-zero imm divisor, check 2:OK
  #347/60  verifier_sdiv/SMOD32, non-zero imm divisor, check 2 @unpriv:OK
  #347/61  verifier_sdiv/SMOD32, non-zero imm divisor, check 3:OK
  #347/62  verifier_sdiv/SMOD32, non-zero imm divisor, check 3 @unpriv:OK
  #347/63  verifier_sdiv/SMOD32, non-zero imm divisor, check 4:OK
  #347/64  verifier_sdiv/SMOD32, non-zero imm divisor, check 4 @unpriv:OK
  #347/65  verifier_sdiv/SMOD32, non-zero imm divisor, check 5:OK
  #347/66  verifier_sdiv/SMOD32, non-zero imm divisor, check 5 @unpriv:OK
  #347/67  verifier_sdiv/SMOD32, non-zero imm divisor, check 6:OK
  #347/68  verifier_sdiv/SMOD32, non-zero imm divisor, check 6 @unpriv:OK
  #347/69  verifier_sdiv/SMOD32, non-zero reg divisor, check 1:OK
  #347/70  verifier_sdiv/SMOD32, non-zero reg divisor, check 1 @unpriv:OK
  #347/71  verifier_sdiv/SMOD32, non-zero reg divisor, check 2:OK
  #347/72  verifier_sdiv/SMOD32, non-zero reg divisor, check 2 @unpriv:OK
  #347/73  verifier_sdiv/SMOD32, non-zero reg divisor, check 3:OK
  #347/74  verifier_sdiv/SMOD32, non-zero reg divisor, check 3 @unpriv:OK
  #347/75  verifier_sdiv/SMOD32, non-zero reg divisor, check 4:OK
  #347/76  verifier_sdiv/SMOD32, non-zero reg divisor, check 4 @unpriv:OK
  #347/77  verifier_sdiv/SMOD32, non-zero reg divisor, check 5:OK
  #347/78  verifier_sdiv/SMOD32, non-zero reg divisor, check 5 @unpriv:OK
  #347/79  verifier_sdiv/SMOD32, non-zero reg divisor, check 6:OK
  #347/80  verifier_sdiv/SMOD32, non-zero reg divisor, check 6 @unpriv:OK
  #347/81  verifier_sdiv/SMOD64, non-zero imm divisor, check 1:OK
  #347/82  verifier_sdiv/SMOD64, non-zero imm divisor, check 1 @unpriv:OK
  #347/83  verifier_sdiv/SMOD64, non-zero imm divisor, check 2:OK
  #347/84  verifier_sdiv/SMOD64, non-zero imm divisor, check 2 @unpriv:OK
  #347/85  verifier_sdiv/SMOD64, non-zero imm divisor, check 3:OK
  #347/86  verifier_sdiv/SMOD64, non-zero imm divisor, check 3 @unpriv:OK
  #347/87  verifier_sdiv/SMOD64, non-zero imm divisor, check 4:OK
  #347/88  verifier_sdiv/SMOD64, non-zero imm divisor, check 4 @unpriv:OK
  #347/89  verifier_sdiv/SMOD64, non-zero imm divisor, check 5:OK
  #347/90  verifier_sdiv/SMOD64, non-zero imm divisor, check 5 @unpriv:OK
  #347/91  verifier_sdiv/SMOD64, non-zero imm divisor, check 6:OK
  #347/92  verifier_sdiv/SMOD64, non-zero imm divisor, check 6 @unpriv:OK
  #347/93  verifier_sdiv/SMOD64, non-zero imm divisor, check 7:OK
  #347/94  verifier_sdiv/SMOD64, non-zero imm divisor, check 7 @unpriv:OK
  #347/95  verifier_sdiv/SMOD64, non-zero imm divisor, check 8:OK
  #347/96  verifier_sdiv/SMOD64, non-zero imm divisor, check 8 @unpriv:OK
  #347/97  verifier_sdiv/SMOD64, non-zero reg divisor, check 1:OK
  #347/98  verifier_sdiv/SMOD64, non-zero reg divisor, check 1 @unpriv:OK
  #347/99  verifier_sdiv/SMOD64, non-zero reg divisor, check 2:OK
  #347/100 verifier_sdiv/SMOD64, non-zero reg divisor, check 2 @unpriv:OK
  #347/101 verifier_sdiv/SMOD64, non-zero reg divisor, check 3:OK
  #347/102 verifier_sdiv/SMOD64, non-zero reg divisor, check 3 @unpriv:OK
  #347/103 verifier_sdiv/SMOD64, non-zero reg divisor, check 4:OK
  #347/104 verifier_sdiv/SMOD64, non-zero reg divisor, check 4 @unpriv:OK
  #347/105 verifier_sdiv/SMOD64, non-zero reg divisor, check 5:OK
  #347/106 verifier_sdiv/SMOD64, non-zero reg divisor, check 5 @unpriv:OK
  #347/107 verifier_sdiv/SMOD64, non-zero reg divisor, check 6:OK
  #347/108 verifier_sdiv/SMOD64, non-zero reg divisor, check 6 @unpriv:OK
  #347/109 verifier_sdiv/SMOD64, non-zero reg divisor, check 7:OK
  #347/110 verifier_sdiv/SMOD64, non-zero reg divisor, check 7 @unpriv:OK
  #347/111 verifier_sdiv/SMOD64, non-zero reg divisor, check 8:OK
  #347/112 verifier_sdiv/SMOD64, non-zero reg divisor, check 8 @unpriv:OK
  #347/113 verifier_sdiv/SDIV32, zero divisor:OK
  #347/114 verifier_sdiv/SDIV32, zero divisor @unpriv:OK
  #347/115 verifier_sdiv/SDIV64, zero divisor:OK
  #347/116 verifier_sdiv/SDIV64, zero divisor @unpriv:OK
  #347/117 verifier_sdiv/SMOD32, zero divisor:OK
  #347/118 verifier_sdiv/SMOD32, zero divisor @unpriv:OK
  #347/119 verifier_sdiv/SMOD64, zero divisor:OK
  #347/120 verifier_sdiv/SMOD64, zero divisor @unpriv:OK
  #347     verifier_sdiv:OK
  Summary: 6/166 PASSED, 0 SKIPPED, 0 FAILED

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Florent Revest <revest@chromium.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Florent Revest <revest@chromium.org>
Link: https://lore.kernel.org/bpf/20230815154158.717901-8-xukuohai@huaweicloud.com

commit | commitdiff | tree

Xu Kuohai [Tue, 15 Aug 2023 15:41:57 +0000 (11:41 -0400)]

bpf, arm64: Support signed div/mod instructions

Add JIT for signed div/mod instructions.

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Florent Revest <revest@chromium.org>
Acked-by: Florent Revest <revest@chromium.org>
Link: https://lore.kernel.org/bpf/20230815154158.717901-7-xukuohai@huaweicloud.com

commit | commitdiff | tree

Xu Kuohai [Tue, 15 Aug 2023 15:41:56 +0000 (11:41 -0400)]

bpf, arm64: Support 32-bit offset jmp instruction

Add support for 32-bit offset jmp instructions. Given the arm64 direct jump
range is +-128MB, which is large enough for BPF prog, jumps beyond this range
are not supported.

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Florent Revest <revest@chromium.org>
Acked-by: Florent Revest <revest@chromium.org>
Link: https://lore.kernel.org/bpf/20230815154158.717901-6-xukuohai@huaweicloud.com

commit | commitdiff | tree

Xu Kuohai [Tue, 15 Aug 2023 15:41:55 +0000 (11:41 -0400)]

bpf, arm64: Support unconditional bswap

Add JIT support for unconditional bswap instructions.

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Florent Revest <revest@chromium.org>
Acked-by: Florent Revest <revest@chromium.org>
Link: https://lore.kernel.org/bpf/20230815154158.717901-5-xukuohai@huaweicloud.com

commit | commitdiff | tree

Xu Kuohai [Tue, 15 Aug 2023 15:41:54 +0000 (11:41 -0400)]

bpf, arm64: Support sign-extension mov instructions

Add JIT support for BPF sign-extension mov instructions with arm64
SXTB/SXTH/SXTW instructions.

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Florent Revest <revest@chromium.org>
Acked-by: Florent Revest <revest@chromium.org>
Link: https://lore.kernel.org/bpf/20230815154158.717901-4-xukuohai@huaweicloud.com

commit | commitdiff | tree

Xu Kuohai [Tue, 15 Aug 2023 15:41:53 +0000 (11:41 -0400)]

bpf, arm64: Support sign-extension load instructions

Add JIT support for sign-extension load instructions.

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Florent Revest <revest@chromium.org>
Acked-by: Florent Revest <revest@chromium.org>
Link: https://lore.kernel.org/bpf/20230815154158.717901-3-xukuohai@huaweicloud.com

commit | commitdiff | tree

Xu Kuohai [Tue, 15 Aug 2023 15:41:52 +0000 (11:41 -0400)]

arm64: insn: Add encoders for LDRSB/LDRSH/LDRSW

To support BPF sign-extend load instructions, add encoders for
LDRSB/LDRSH/LDRSW.

LDRSB/LDRSH/LDRSW (immediate) is encoded as follows:

     3     2 2   2   2                       1         0         0
     0     7 6   4   2                       0         5         0
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  | sz|1 1 1|0|0 1|opc|        imm12          |    Rn   |    Rt   |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

LDRSB/LDRSH/LDRSW (register) is encoded as follows:

     3     2 2   2   2 2         1     1 1   1         0         0
     0     7 6   4   2 1         6     3 2   0         5         0
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  | sz|1 1 1|0|0 0|opc|1|    Rm   | opt |S|1 0|    Rn   |    Rt   |
  +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

where:

   - sz
     indicates whether 8-bit, 16-bit or 32-bit data is to be loaded

   - opc
     opc[1] (bit 23) is always 1 and opc[0] == 1 indicates regsize
     is 32-bit. Since BPF signed load instructions always exend the
     sign bit to bit 63 regardless of whether it loads an 8-bit,
     16-bit or 32-bit data. So only 64-bit register size is required.
     That is, it's sufficient to set field opc fixed to 0x2.

   - opt
     Indicates whether to sign extend the offset register Rm and the
     effective bits of Rm. We set opt to 0x7 (SXTX) since we'll use
     Rm as a sgined 64-bit value in BPF.

   - S
     Optional only when opt field is 0x3 (LSL)

In short, the above fields are encoded to the values listed below.

                   sz   opc  opt   S
LDRSB (immediate)  0x0  0x2  na    na
LDRSH (immediate)  0x1  0x2  na    na
LDRSW (immediate)  0x2  0x2  na    na
LDRSB (register)   0x0  0x2  0x7   0
LDRSH (register)   0x1  0x2  0x7   0
LDRSW (register)   0x2  0x2  0x7   0

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Tested-by: Florent Revest <revest@chromium.org>
Acked-by: Florent Revest <revest@chromium.org>
Link: https://lore.kernel.org/bpf/20230815154158.717901-2-xukuohai@huaweicloud.com

commit | commitdiff | tree

Jakub Kicinski [Fri, 18 Aug 2023 02:25:44 +0000 (19:25 -0700)]

Merge branch 'netconsole-enable-compile-time-configuration'

Breno Leitao says:

====================
netconsole: Enable compile time configuration

Enable netconsole features to be set at compilation time. Create two
Kconfig options that allow users to set extended logs and release
prepending features at compilation time.

The first patch de-duplicates the initialization code, and the second
patch adds the support in the de-duplicated code, avoiding touching two
different functions with the same change.
====================

Link: https://lore.kernel.org/r/20230811093158.1678322-1-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Breno Leitao [Fri, 11 Aug 2023 09:31:58 +0000 (02:31 -0700)]

netconsole: Enable compile time configuration

Enable netconsole features to be set at compilation time. Create two
Kconfig options that allow users to set extended logs and release
prepending features at compilation time.

Right now, the user needs to pass command line parameters to netconsole,
such as "+"/"r" to enable extended logs and version prepending features.

With these two options, the user could set the default values for the
features at compile time, and don't need to pass it in the command line
to get them enabled, simplifying the command line.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230811093158.1678322-3-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Breno Leitao [Fri, 11 Aug 2023 09:31:57 +0000 (02:31 -0700)]

netconsole: Create a allocation helper

De-duplicate the initialization and allocation code for struct
netconsole_target.

The same allocation and initialization code is duplicated in two
different places in the netconsole subsystem, when the netconsole target
is initialized by command line parameters (alloc_param_target()), and
dynamically by sysfs (make_netconsole_target()).

Create a helper function, and call it from the two different functions.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20230811093158.1678322-2-leitao@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Justin Stitt [Tue, 15 Aug 2023 20:35:59 +0000 (20:35 +0000)]

net: mdio: fix -Wvoid-pointer-to-enum-cast warning

When building with clang 18 I see the following warning:
|       drivers/net/mdio/mdio-xgene.c:338:13: warning: cast to smaller integer
|               type 'enum xgene_mdio_id' from 'const void *' [-Wvoid-pointer-to-enum-cast]
|         338 |                 mdio_id = (enum xgene_mdio_id)of_id->data;

This is due to the fact that `of_id->data` is a void* while `enum
xgene_mdio_id` has the size of an int. This leads to truncation and
possible data loss.

Link: https://github.com/ClangBuiltLinux/linux/issues/1910
Reported-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Justin Stitt <justinstitt@google.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/20230815-void-drivers-net-mdio-mdio-xgene-v1-1-5304342e0659@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Fri, 18 Aug 2023 02:15:07 +0000 (19:15 -0700)]

Merge branch 'netem-use-a-seeded-prng-for-loss-and-corruption-events'

François Michel says:

====================
netem: use a seeded PRNG for loss and corruption events

In order to reproduce bugs or performance evaluation of
network protocols and applications, it is useful to have
reproducible test suites and tools. This patch adds
a way to specify a PRNG seed through the
TCA_NETEM_PRNG_SEED attribute for generating netem
loss and corruption events. Initializing the qdisc
with the same seed leads to the exact same loss
and corruption patterns. If no seed is explicitly
specified, the qdisc generates a random seed using
get_random_u64().

This patch can be and has been tested using tc from
the following iproute2-next fork:
https://github.com/francoismichel/iproute2-next

For instance, setting the seed 42424242 on the loopback
with a loss rate of 10% will systematically drop the 5th,
12th and 24th packet when sending 25 packets.
====================

Link: https://lore.kernel.org/r/20230815092348.1449179-1-francois.michel@uclouvain.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

François Michel [Tue, 15 Aug 2023 09:23:40 +0000 (11:23 +0200)]

netem: use seeded PRNG for correlated loss events

Use prandom_u32_state() instead of get_random_u32() to generate
the correlated loss events of netem.

Signed-off-by: François Michel <francois.michel@uclouvain.be>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://lore.kernel.org/r/20230815092348.1449179-4-francois.michel@uclouvain.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

François Michel [Tue, 15 Aug 2023 09:23:39 +0000 (11:23 +0200)]

netem: use a seeded PRNG for generating random losses

Use prandom_u32_state() instead of get_random_u32() to generate
the random loss events of netem. The state of the prng is part
of the prng attribute of struct netem_sched_data.

Signed-off-by: François Michel <francois.michel@uclouvain.be>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://lore.kernel.org/r/20230815092348.1449179-3-francois.michel@uclouvain.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

François Michel [Tue, 15 Aug 2023 09:23:38 +0000 (11:23 +0200)]

netem: add prng attribute to netem_sched_data

Add prng attribute to struct netem_sched_data and
allows setting the seed of the PRNG through netlink
using the new TCA_NETEM_PRNG_SEED attribute.
The PRNG attribute is not actually used yet.

Signed-off-by: François Michel <francois.michel@uclouvain.be>
Reviewed-by: Simon Horman <horms@kernel.org>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Link: https://lore.kernel.org/r/20230815092348.1449179-2-francois.michel@uclouvain.be
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jialin Zhang [Tue, 15 Aug 2023 02:42:48 +0000 (10:42 +0800)]

net: ena: Use pci_dev_id() to simplify the code

PCI core API pci_dev_id() can be used to get the BDF number for a pci
device. We don't need to compose it manually. Use pci_dev_id() to
simplify the code a little bit.

Signed-off-by: Jialin Zhang <zhangjialin11@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Shay Agroskin <shayagr@amazon.com>
Link: https://lore.kernel.org/r/20230815024248.3519068-1-zhangjialin11@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Ziyang Xuan [Mon, 14 Aug 2023 08:30:00 +0000 (16:30 +0800)]

tun: add __exit annotations to module exit func tun_cleanup()

Add missing __exit annotations to module exit func tun_cleanup().

Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Link: https://lore.kernel.org/r/20230814083000.3893589-1-william.xuanziyang@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 17 Aug 2023 03:09:43 +0000 (20:09 -0700)]

Merge tag 'for-netdev' of https://git./linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2023-08-16

We've added 17 non-merge commits during the last 6 day(s) which contain
a total of 20 files changed, 1179 insertions(+), 37 deletions(-).

The main changes are:

1) Add a BPF hook in sys_socket() to change the protocol ID
   from IPPROTO_TCP to IPPROTO_MPTCP to cover migration for legacy
   applications, from Geliang Tang.

2) Follow-up/fallout fix from the SO_REUSEPORT + bpf_sk_assign work
   to fix a splat on non-fullsock sks in inet[6]_steal_sock,
   from Lorenz Bauer.

3) Improvements to struct_ops links to avoid forcing presence of
   update/validate callbacks. Also add bpf_struct_ops fields documentation,
   from David Vernet.

4) Ensure libbpf sets close-on-exec flag on gzopen, from Marco Vedovati.

5) Several new tcx selftest additions and bpftool link show support for
   tcx and xdp links, from Daniel Borkmann.

6) Fix a smatch warning on uninitialized symbol in
   bpf_perf_link_fill_kprobe, from Yafang Shao.

7) BPF selftest fixes e.g. misplaced break in kfunc_call test,
   from Yipeng Zou.

8) Small cleanup to remove unused declaration bpf_link_new_file,
   from Yue Haibing.

9) Small typo fix to bpftool's perf help message, from Daniel T. Lee.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next:
  selftests/bpf: Add mptcpify test
  selftests/bpf: Fix error checks of mptcp open_and_load
  selftests/bpf: Add two mptcp netns helpers
  bpf: Add update_socket_protocol hook
  bpftool: Implement link show support for xdp
  bpftool: Implement link show support for tcx
  selftests/bpf: Add selftest for fill_link_info
  bpf: Fix uninitialized symbol in bpf_perf_link_fill_kprobe()
  net: Fix slab-out-of-bounds in inet[6]_steal_sock
  bpf: Document struct bpf_struct_ops fields
  bpf: Support default .validate() and .update() behavior for struct_ops links
  selftests/bpf: Add various more tcx test cases
  selftests/bpf: Clean up fmod_ret in bench_rename test script
  selftests/bpf: Fix repeat option when kfunc_call verification fails
  libbpf: Set close-on-exec flag on gzopen
  bpftool: fix perf help message
  bpf: Remove unused declaration bpf_link_new_file()
====================

Link: https://lore.kernel.org/r/20230816212840.1539-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Jakub Kicinski [Thu, 17 Aug 2023 02:33:44 +0000 (19:33 -0700)]

Revert "net: ethernet: ti: am65-cpsw: add mqprio qdisc offload in channel mode"

This reverts commit 90bc21aaef4adaefceda2d385756138fc247c0c2.

Patch was merged too hastily, Vladimir requested changes in:
https://lore.kernel.org/all/20230816121305.5dio5tk3chge2ndh@skbuf/

Signed-off-by: Jakub Kicinski <kuba@kernel.org>

commit | commitdiff | tree

Martin KaFai Lau [Wed, 16 Aug 2023 17:22:17 +0000 (10:22 -0700)]

Merge branch 'bpf: Force to MPTCP'

Geliang Tang says:

====================
As is described in the "How to use MPTCP?" section in MPTCP wiki [1]:

"Your app should create sockets with IPPROTO_MPTCP as the proto:
( socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP); ). Legacy apps can be
forced to create and use MPTCP sockets instead of TCP ones via the
mptcpize command bundled with the mptcpd daemon."

But the mptcpize (LD_PRELOAD technique) command has some limitations
[2]:

- it doesn't work if the application is not using libc (e.g. GoLang
apps)
- in some envs, it might not be easy to set env vars / change the way
apps are launched, e.g. on Android
- mptcpize needs to be launched with all apps that want MPTCP: we could
have more control from BPF to enable MPTCP only for some apps or all the
ones of a netns or a cgroup, etc.
- it is not in BPF, we cannot talk about it at netdev conf.

So this patchset attempts to use BPF to implement functions similer to
mptcpize.

The main idea is to add a hook in sys_socket() to change the protocol id
from IPPROTO_TCP (or 0) to IPPROTO_MPTCP.

[1]
https://github.com/multipath-tcp/mptcp_net-next/wiki
[2]
https://github.com/multipath-tcp/mptcp_net-next/issues/79

v14:
- Use getsockopt(MPTCP_INFO) to verify mptcp protocol intead of using
nstat command.

v13:
- drop "Use random netns name for mptcp" patch.

v12:
- update diag_* log of update_socket_protocol.
- add 'ip netns show' after 'ip netns del' to check if there is
a test did not clean up its netns.
- return libbpf_get_error() instead of -EIO for the error from
open_and_load().
- Use getsockopt(SOL_PROTOCOL) to verify mptcp protocol intead of
using 'ss -tOni'.

v11:
- add comments about outputs of 'ss' and 'nstat'.
- use "err = verify_mptcpify()" instead of using =+.

v10:
- drop "#ifdef CONFIG_BPF_JIT".
- include vmlinux.h and bpf_tracing_net.h to avoid defining some
macros.
- drop unneeded checks for mptcp.

v9:
- update comment for 'update_socket_protocol'.

v8:
- drop the additional checks on the 'protocol' value after the
'update_socket_protocol()' call.

v7:
- add __weak and __diag_* for update_socket_protocol.

v6:
- add update_socket_protocol.

v5:
- add bpf_mptcpify helper.

v4:
- use lsm_cgroup/socket_create

v3:
- patch 8: char cmd[128]; -> char cmd[256];

v2:
- Fix build selftests errors reported by CI

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/79
====================

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

commit | commitdiff | tree

Geliang Tang [Wed, 16 Aug 2023 01:11:59 +0000 (09:11 +0800)]

selftests/bpf: Add mptcpify test

Implement a new test program mptcpify: if the family is AF_INET or
AF_INET6, the type is SOCK_STREAM, and the protocol ID is 0 or
IPPROTO_TCP, set it to IPPROTO_MPTCP. It will be hooked in
update_socket_protocol().

Extend the MPTCP test base, add a selftest test_mptcpify() for the
mptcpify case. Open and load the mptcpify test prog to mptcpify the
TCP sockets dynamically, then use start_server() and connect_to_fd()
to create a TCP socket, but actually what's created is an MPTCP
socket, which can be verified through 'getsockopt(SOL_PROTOCOL)'
and 'getsockopt(MPTCP_INFO)'.

Acked-by: Yonghong Song <yonghong.song@linux.dev>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Link: https://lore.kernel.org/r/364e72f307e7bb38382ec7442c182d76298a9c41.1692147782.git.geliang.tang@suse.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

commit | commitdiff | tree

Geliang Tang [Wed, 16 Aug 2023 01:11:58 +0000 (09:11 +0800)]

selftests/bpf: Fix error checks of mptcp open_and_load

Return libbpf_get_error(), instead of -EIO, for the error from
mptcp_sock__open_and_load().

Load success means prog_fd and map_fd are always valid. So drop these
unneeded ASSERT_GE checks for them in mptcp run_test().

Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Link: https://lore.kernel.org/r/db5fcb93293df9ab173edcbaf8252465b80da6f2.1692147782.git.geliang.tang@suse.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

commit | commitdiff | tree

Geliang Tang [Wed, 16 Aug 2023 01:11:57 +0000 (09:11 +0800)]

selftests/bpf: Add two mptcp netns helpers

Add two netns helpers for mptcp tests: create_netns() and
cleanup_netns(). Use them in test_base().

These new helpers will be re-used in the following commits
introducing new tests.

Acked-by: Yonghong Song <yonghong.song@linux.dev>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Link: https://lore.kernel.org/r/7506371fb6c417b401cc9d7365fe455754f4ba3f.1692147782.git.geliang.tang@suse.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

commit | commitdiff | tree

Geliang Tang [Wed, 16 Aug 2023 01:11:56 +0000 (09:11 +0800)]

bpf: Add update_socket_protocol hook

Add a hook named update_socket_protocol in __sys_socket(), for bpf
progs to attach to and update socket protocol. One user case is to
force legacy TCP apps to create and use MPTCP sockets instead of
TCP ones.

Define a fmod_ret set named bpf_mptcp_fmodret_ids, add the hook
update_socket_protocol into this set, and register it in
bpf_mptcp_kfunc_init().

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/79
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Link: https://lore.kernel.org/r/ac84be00f97072a46f8a72b4e2be46cbb7fa5053.1692147782.git.geliang.tang@suse.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

commit | commitdiff | tree

Daniel Borkmann [Wed, 16 Aug 2023 09:56:51 +0000 (11:56 +0200)]

bpftool: Implement link show support for xdp

Add support to dump XDP link information to bpftool. This reuses the
recently added show_link_ifindex_{plain,json}(). The XDP link info only
exposes the ifindex.

Below shows an example link dump output, and a cgroup link is included
for comparison, too:

  # bpftool link
  [...]
  10: cgroup  prog 2466
        cgroup_id 1  attach_type cgroup_inet6_post_bind
  [...]
  16: xdp  prog 2477
        ifindex enp5s0(3)
  [...]

Equivalent json output:

  # bpftool link --json
  [...]
  {
    "id": 10,
    "type": "cgroup",
    "prog_id": 2466,
    "cgroup_id": 1,
    "attach_type": "cgroup_inet6_post_bind"
  },
  [...]
  {
    "id": 16,
    "type": "xdp",
    "prog_id": 2477,
    "devname": "enp5s0",
    "ifindex": 3
  }
  [...]

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Link: https://lore.kernel.org/r/20230816095651.10014-2-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

commit | commitdiff | tree

Daniel Borkmann [Wed, 16 Aug 2023 09:56:50 +0000 (11:56 +0200)]

bpftool: Implement link show support for tcx

Add support to dump tcx link information to bpftool. This adds a
common helper show_link_ifindex_{plain,json}() which can be reused
also for other link types. The plain text and json device output is
the same format as in bpftool net dump.

Below shows an example link dump output along with a cgroup link
for comparison:

  # bpftool link
  [...]
  10: cgroup  prog 1977
        cgroup_id 1  attach_type cgroup_inet6_post_bind
  [...]
  13: tcx  prog 2053
        ifindex enp5s0(3)  attach_type tcx_ingress
  14: tcx  prog 2080
        ifindex enp5s0(3)  attach_type tcx_egress
  [...]

Equivalent json output:

  # bpftool link --json
  [...]
  {
    "id": 10,
    "type": "cgroup",
    "prog_id": 1977,
    "cgroup_id": 1,
    "attach_type": "cgroup_inet6_post_bind"
  },
  [...]
  {
    "id": 13,
    "type": "tcx",
    "prog_id": 2053,
    "devname": "enp5s0",
    "ifindex": 3,
    "attach_type": "tcx_ingress"
  },
  {
    "id": 14,
    "type": "tcx",
    "prog_id": 2080,
    "devname": "enp5s0",
    "ifindex": 3,
    "attach_type": "tcx_egress"
  }
  [...]

Suggested-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/r/20230816095651.10014-1-daniel@iogearbox.net
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

commit | commitdiff | tree

Yafang Shao [Sun, 13 Aug 2023 14:19:00 +0000 (14:19 +0000)]

selftests/bpf: Add selftest for fill_link_info

Add selftest for the fill_link_info of uprobe, kprobe and tracepoint.
The result:

  $ tools/testing/selftests/bpf/test_progs --name=fill_link_info
  #79/1    fill_link_info/kprobe_link_info:OK
  #79/2    fill_link_info/kretprobe_link_info:OK
  #79/3    fill_link_info/kprobe_invalid_ubuff:OK
  #79/4    fill_link_info/tracepoint_link_info:OK
  #79/5    fill_link_info/uprobe_link_info:OK
  #79/6    fill_link_info/uretprobe_link_info:OK
  #79/7    fill_link_info/kprobe_multi_link_info:OK
  #79/8    fill_link_info/kretprobe_multi_link_info:OK
  #79/9    fill_link_info/kprobe_multi_invalid_ubuff:OK
  #79      fill_link_info:OK
  Summary: 1/9 PASSED, 0 SKIPPED, 0 FAILED

The test case for kprobe_multi won't be run on aarch64, as it is not
supported.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20230813141900.1268-3-laoar.shao@gmail.com

commit | commitdiff | tree

Yafang Shao [Sun, 13 Aug 2023 14:18:59 +0000 (14:18 +0000)]

bpf: Fix uninitialized symbol in bpf_perf_link_fill_kprobe()

The commit 1b715e1b0ec5 ("bpf: Support ->fill_link_info for perf_event") leads
to the following Smatch static checker warning:

kernel/bpf/syscall.c:3416 bpf_perf_link_fill_kprobe()
error: uninitialized symbol 'type'.

That can happens when uname is NULL. So fix it by verifying the uname when we
really need to fill it.

Fixes: 1b715e1b0ec5 ("bpf: Support ->fill_link_info for perf_event")
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Closes: https://lore.kernel.org/bpf/85697a7e-f897-4f74-8b43-82721bebc462@kili.mountain
Link: https://lore.kernel.org/bpf/20230813141900.1268-2-laoar.shao@gmail.com

commit | commitdiff | tree

David S. Miller [Wed, 16 Aug 2023 11:26:44 +0000 (12:26 +0100)]

Merge branch 'ipv6-expired-routes'

Kui-Feng Lee says:

====================
Remove expired routes with a separated list of routes.

FIB6 GC walks trees of fib6_tables to remove expired routes. Walking a tree
can be expensive if the number of routes in a table is big, even if most of
them are permanent. Checking routes in a separated list of routes having
expiration will avoid this potential issue.

Background
==========

The size of a Linux IPv6 routing table can become a big problem if not
managed appropriately.  Now, Linux has a garbage collector to remove
expired routes periodically.  However, this may lead to a situation in
which the routing path is blocked for a long period due to an
excessive number of routes.

For example, years ago, there is a commit c7bb4b89033b ("ipv6: tcp:
drop silly ICMPv6 packet too big messages").  The root cause is that
malicious ICMPv6 packets were sent back for every small packet sent to
them. These packets add routes with an expiration time that prompts
the GC to periodically check all routes in the tables, including
permanent ones.

Why Route Expires
=================

Users can add IPv6 routes with an expiration time manually. However,
the Neighbor Discovery protocol may also generate routes that can
expire.  For example, Router Advertisement (RA) messages may create a
default route with an expiration time. [RFC 4861] For IPv4, it is not
possible to set an expiration time for a route, and there is no RA, so
there is no need to worry about such issues.

Create Routes with Expires
==========================

You can create routes with expires with the  command.

For example,

    ip -6 route add 2001:b000:591::3 via fe80::5054:ff:fe12:3457 \
        dev enp0s3 expires 30

The route that has been generated will be deleted automatically in 30
seconds.

GC of FIB6
==========

The function called fib6_run_gc() is responsible for performing
garbage collection (GC) for the Linux IPv6 stack. It checks for the
expiration of every route by traversing the trees of routing
tables. The time taken to traverse a routing table increases with its
size. Holding the routing table lock during traversal is particularly
undesirable. Therefore, it is preferable to keep the lock for the
shortest possible duration.

Solution
========

The cause of the issue is keeping the routing table locked during the
traversal of large trees. To solve this problem, we can create a separate
list of routes that have expiration. This will prevent GC from checking
permanent routes.

Result
======

We conducted a test to measure the execution times of fib6_gc_timer_cb()
and observed that it enhances the GC of FIB6. During the test, we added
permanent routes with the following numbers: 1000, 3000, 6000, and
9000. Additionally, we added a route with an expiration time.

Here are the average execution times for the kernel without the patch.
- 120020 ns with 1000 permanent routes
- 308920 ns with 3000 ...
- 581470 ns with 6000 ...
- 855310 ns with 9000 ...

The kernel with the patch consistently takes around 14000 ns to execute,
regardless of the number of permanent routes that are installed.

Major changes from v7:

- Fix warings raised by the patchwork.

Major changes from v6:

- Remove unnecessary check of tb6 in fib6_clean_expires_locked().

- Use ib6_clean_expires_locked() instead in fib6_purge_rt().

Major changes from v5:

- Change the order of adding new routes to the GC list and starting
   GC timer.

- Remove time measurements from the test case.

- Stop forcing GC flush.

Major changes from v4:

- Detect existence of 'strace' in the test case.

Major changes from v3:

- Fix the type of arg according to feedback.

- Add 1k temporary routes and 5K permanent routes in the test case.
   Measure time spending on GC with strace.

Major changes from v2:

- Remove unnecessary and incorrect sysctl restoring in the test case.

Major changes from v1:

- Moved gc_link to avoid creating a hole in fib6_info.

- Moved fib6_set_expires*() and fib6_clean_expires*() to the header
   file and inlined. And removed duplicated lines.

- Added a test case.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Kui-Feng Lee [Tue, 15 Aug 2023 18:07:06 +0000 (11:07 -0700)]

selftests: fib_tests: Add a test case for IPv6 garbage collection

Add 1000 IPv6 routes with expiration time (w/ and w/o additional 5000
permanet routes in the background.) Wait for a few seconds to make sure
they are removed correctly.

The expected output of the test looks like the following example.

> Fib6 garbage collection test
> TEST: ipv6 route garbage collection [ OK ]

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Kui-Feng Lee [Tue, 15 Aug 2023 18:07:05 +0000 (11:07 -0700)]

net/ipv6: Remove expired routes with a separated list of routes.

FIB6 GC walks trees of fib6_tables to remove expired routes. Walking a tree
can be expensive if the number of routes in a table is big, even if most of
them are permanent. Checking routes in a separated list of routes having
expiration will avoid this potential issue.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Kai-Heng Feng [Tue, 15 Aug 2023 17:01:11 +0000 (10:01 -0700)]

e1000e: Use PME poll to circumvent unreliable ACPI wake

On some I219 devices, ethernet cable plugging detection only works once
from PCI D3 state. Subsequent cable plugging does set PME bit correctly,
but device still doesn't get woken up.

Since I219 connects to the root complex directly, it relies on platform
firmware (ACPI) to wake it up. In this case, the GPE from _PRW only
works for first cable plugging but fails to notify the driver for
subsequent plugging events.

The issue was originally found on CNP, but the same issue can be found
on ADL too. So workaround the issue by continuing use PME poll after
first ACPI wake. As PME poll is always used, the runtime suspend
restriction for CNP can also be removed.

Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Tested-by: Naama Meir <naamax.meir@linux.intel.com>
Acked-by: Sasha Neftin <sasha.neftin@intel.com>
Reviewed-by: Simon Horman <simon.horman@corigine.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Abel Wu [Mon, 14 Aug 2023 07:09:11 +0000 (15:09 +0800)]

net-memcg: Fix scope of sockmem pressure indicators

Now there are two indicators of socket memory pressure sit inside
struct mem_cgroup, socket_pressure and tcpmem_pressure, indicating
memory reclaim pressure in memcg->memory and ->tcpmem respectively.

When in legacy mode (cgroupv1), the socket memory is charged into
->tcpmem which is independent of ->memory, so socket_pressure has
nothing to do with socket's pressure at all. Things could be worse
by taking socket_pressure into consideration in legacy mode, as a
pressure in ->memory can lead to premature reclamation/throttling
in socket.

While for the default mode (cgroupv2), the socket memory is charged
into ->memory, and ->tcpmem/->tcpmem_pressure are simply not used.

So {socket,tcpmem}_pressure are only used in default/legacy mode
respectively for indicating socket memory pressure. This patch fixes
the pieces of code that make mixed use of both.

Fixes: 8e8ae645249b ("mm: memcontrol: hook up vmpressure to socket pressure")
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Acked-by: Shakeel Butt <shakeelb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Louis Peens [Tue, 15 Aug 2023 12:43:25 +0000 (14:43 +0200)]

nfp: update maintainer

Take over maintainership of the nfp driver from Simon as he
is moving away from Corigine.

Signed-off-by: Louis Peens <louis.peens@corigine.com>
Acked-by: Simon Horman <simon.horman@corigine.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Grygorii Strashko [Tue, 15 Aug 2023 08:21:05 +0000 (11:21 +0300)]

net: ethernet: ti: am65-cpsw: add mqprio qdisc offload in channel mode

This patch adds MQPRIO Qdisc offload in full 'channel' mode which allows
not only setting up pri:tc mapping, but also configuring TX shapers on
external port FIFOs. The K3 CPSW MQPRIO Qdisc offload is expected to work
with VLAN/priority tagged packets. Non-tagged packets have to be mapped
only to TC0.

- TX traffic classes must be rated starting from TC that has highest
priority and with no gaps
- Traffic classes are used starting from 0, that has highest priority
- min_rate defines Committed Information Rate (guaranteed)
- max_rate defines Excess Information Rate (non guaranteed) and offloaded
as (max_rate[i] - tcX_min_rate[i])
- VLAN/priority tagged packets mapped to TC0 will exit switch with VLAN tag
priority 0

The configuration example:
ethtool -L eth1 tx 5
ethtool --set-priv-flags eth1 p0-rx-ptype-rrobin off

tc qdisc add dev eth1 parent root handle 100: mqprio num_tc 3 \
map 0 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 \
queues 1@0 1@1 1@2 hw 1 mode channel \
shaper bw_rlimit min_rate 0 100mbit 200mbit max_rate 0 101mbit 202mbit

tc qdisc replace dev eth2 handle 100: parent root mqprio num_tc 1 \
map 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 queues 1@0 hw 1

ip link add link eth1 name eth1.100 type vlan id 100
ip link set eth1.100 type vlan egress 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7

In the above example two ports share the same TX CPPI queue 0 for low
priority traffic. 3 traffic classes are defined for eth1 and mapped to:
TC0 - low priority, TX CPPI queue 0 -> ext Port 1 fifo0, no rate limit
TC1 - prio 2, TX CPPI queue 1 -> ext Port 1 fifo1, CIR=100Mbit/s, EIR=1Mbit/s
TC2 - prio 3, TX CPPI queue 2 -> ext Port 1 fifo2, CIR=200Mbit/s, EIR=2Mbit/s

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Roger Quadros <rogerq@kernel.org>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

David S. Miller [Wed, 16 Aug 2023 10:09:18 +0000 (11:09 +0100)]

Merge branch 'inet-data-races'

Eric Dumazet says:

====================
inet: socket lock and data-races avoidance

In this series, I converted 20 bits in "struct inet_sock" and made
them truly atomic.

This allows to implement many IP_ socket options in a lockless
fashion (no need to acquire socket lock), and fixes data-races
that were showing up in various KCSAN reports.

I also took care of IP_TTL/IP_MINTTL, but left few other options
for another series.

v4: Rebased after recent mptcp changes.
Added Reviewed-by: tags from Simon (thanks !)

v3: fixed patch 7, feedback from build bot about ipvs set_mcast_loop()

v2: addressed a feedback from a build bot in patch 9 by removing
unused issk variable in mptcp_setsockopt_sol_ip_set_transparent()
Added Acked-by: tags from Soheil (thanks !)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Eric Dumazet [Wed, 16 Aug 2023 08:15:47 +0000 (08:15 +0000)]

inet: implement lockless IP_MINTTL

inet->min_ttl is already read with READ_ONCE().

Implementing IP_MINTTL socket option set/read
without holding the socket lock is easy.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Eric Dumazet [Wed, 16 Aug 2023 08:15:46 +0000 (08:15 +0000)]

inet: implement lockless IP_TTL

ip_select_ttl() is racy, because it reads inet->uc_ttl
without proper locking.

Add READ_ONCE()/WRITE_ONCE() annotations while
allowing IP_TTL socket option to be set/read without
holding the socket lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Eric Dumazet [Wed, 16 Aug 2023 08:15:45 +0000 (08:15 +0000)]

inet: move inet->defer_connect to inet->inet_flags

Make room in struct inet_sock by removing this bit field,
using one available bit in inet_flags instead.

Also move local_port_range to fill the resulting hole,
saving 8 bytes on 64bit arches.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Eric Dumazet [Wed, 16 Aug 2023 08:15:44 +0000 (08:15 +0000)]

inet: move inet->bind_address_no_port to inet->inet_flags

IP_BIND_ADDRESS_NO_PORT socket option can now be set/read
without locking the socket.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Eric Dumazet [Wed, 16 Aug 2023 08:15:43 +0000 (08:15 +0000)]

inet: move inet->nodefrag to inet->inet_flags

IP_NODEFRAG socket option can now be set/read
without locking the socket.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Eric Dumazet [Wed, 16 Aug 2023 08:15:42 +0000 (08:15 +0000)]

inet: move inet->is_icsk to inet->inet_flags

We move single bit fields to inet->inet_flags to avoid races.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Eric Dumazet [Wed, 16 Aug 2023 08:15:41 +0000 (08:15 +0000)]

inet: move inet->transparent to inet->inet_flags

IP_TRANSPARENT socket option can now be set/read
without locking the socket.

v2: removed unused issk variable in mptcp_setsockopt_sol_ip_set_transparent()
v4: rebased after commit 3f326a821b99 ("mptcp: change the mpc check helper to return a sk")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Eric Dumazet [Wed, 16 Aug 2023 08:15:40 +0000 (08:15 +0000)]

inet: move inet->mc_all to inet->inet_frags

IP_MULTICAST_ALL socket option can now be set/read
without locking the socket.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

commit | commitdiff | tree

Eric Dumazet [Wed, 16 Aug 2023 08:15:39 +0000 (08:15 +0000)]

inet: move inet->mc_loop to inet->inet_frags

IP_MULTICAST_LOOP socket option can now be set/read
without locking the socket.

v3: fix build bot error reported in ipvs set_mcast_loop()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Linux 6.x block layer and io_uring tree(s)