linux-2.6-block.git
8 weeks agoMerge branch 'selftests-bpf-test-sockmap-sockhash-redirection'
Martin KaFai Lau [Thu, 22 May 2025 21:26:59 +0000 (14:26 -0700)]
Merge branch 'selftests-bpf-test-sockmap-sockhash-redirection'

Michal Luczaj says:

====================
selftests/bpf: Test sockmap/sockhash redirection

The idea behind this series is to comprehensively test the BPF redirection:

BPF_MAP_TYPE_SOCKMAP,
BPF_MAP_TYPE_SOCKHASH
x
sk_msg-to-egress,
sk_msg-to-ingress,
sk_skb-to-egress,
sk_skb-to-ingress
x
AF_INET, SOCK_STREAM,
AF_INET6, SOCK_STREAM,
AF_INET, SOCK_DGRAM,
AF_INET6, SOCK_DGRAM,
AF_UNIX, SOCK_STREAM,
AF_UNIX, SOCK_DGRAM,
AF_VSOCK, SOCK_STREAM,
AF_VSOCK, SOCK_SEQPACKET

New module is introduced, sockmap_redir: all supported and unsupported
redirect combinations are tested for success and failure respectively. Code
is pretty much stolen/adapted from Jakub Sitnicki's sockmap_redir_matrix.c
[1].

Usage:
$ cd tools/testing/selftests/bpf
$ make
$ sudo ./test_progs -t sockmap_redir
...
Summary: 1/576 PASSED, 0 SKIPPED, 0 FAILED

[1]: https://github.com/jsitnicki/sockmap-redir-matrix/blob/main/sockmap_redir_matrix.c

Changes in v3:
- Drop unrelated changes; sockmap_listen, test_sockmap_listen, doc
- Collect tags [Jakub, John]
- Introduce BPF verdict programs especially for sockmap_redir [Jiayuan]
- Link to v2: https://lore.kernel.org/r/20250411-selftests-sockmap-redir-v2-0-5f9b018d6704@rbox.co

Changes in v2:
- Verify that the unsupported redirect combos do fail [Jakub]
- Dedup tests in sockmap_listen
- Cosmetic changes and code reordering
- Link to v1: https://lore.kernel.org/bpf/42939687-20f9-4a45-b7c2-342a0e11a014@rbox.co/
====================

Link: https://patch.msgid.link/20250515-selftests-sockmap-redir-v3-0-a1ea723f7e7e@rbox.co
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
8 weeks agoselftests/bpf: sockmap_listen cleanup: Drop af_inet SOCK_DGRAM redir tests
Michal Luczaj [Wed, 14 May 2025 22:15:31 +0000 (00:15 +0200)]
selftests/bpf: sockmap_listen cleanup: Drop af_inet SOCK_DGRAM redir tests

Remove tests covered by sockmap_redir.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20250515-selftests-sockmap-redir-v3-8-a1ea723f7e7e@rbox.co
8 weeks agoselftests/bpf: sockmap_listen cleanup: Drop af_unix redir tests
Michal Luczaj [Wed, 14 May 2025 22:15:30 +0000 (00:15 +0200)]
selftests/bpf: sockmap_listen cleanup: Drop af_unix redir tests

Remove tests covered by sockmap_redir.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20250515-selftests-sockmap-redir-v3-7-a1ea723f7e7e@rbox.co
8 weeks agoselftests/bpf: sockmap_listen cleanup: Drop af_vsock redir tests
Michal Luczaj [Wed, 14 May 2025 22:15:29 +0000 (00:15 +0200)]
selftests/bpf: sockmap_listen cleanup: Drop af_vsock redir tests

Remove tests covered by sockmap_redir.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20250515-selftests-sockmap-redir-v3-6-a1ea723f7e7e@rbox.co
8 weeks agoselftests/bpf: Add selftest for sockmap/hashmap redirection
Michal Luczaj [Wed, 14 May 2025 22:15:28 +0000 (00:15 +0200)]
selftests/bpf: Add selftest for sockmap/hashmap redirection

Test redirection logic. All supported and unsupported redirect combinations
are tested for success and failure respectively.

BPF_MAP_TYPE_SOCKMAP
BPF_MAP_TYPE_SOCKHASH
x
sk_msg-to-egress
sk_msg-to-ingress
sk_skb-to-egress
sk_skb-to-ingress
x
AF_INET, SOCK_STREAM
AF_INET6, SOCK_STREAM
AF_INET, SOCK_DGRAM
AF_INET6, SOCK_DGRAM
AF_UNIX, SOCK_STREAM
AF_UNIX, SOCK_DGRAM
AF_VSOCK, SOCK_STREAM
AF_VSOCK, SOCK_SEQPACKET

Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20250515-selftests-sockmap-redir-v3-5-a1ea723f7e7e@rbox.co
8 weeks agoselftests/bpf: Introduce verdict programs for sockmap_redir
Michal Luczaj [Wed, 14 May 2025 22:15:27 +0000 (00:15 +0200)]
selftests/bpf: Introduce verdict programs for sockmap_redir

Instead of piggybacking on test_sockmap_listen, introduce
test_sockmap_redir especially for sockmap redirection tests.

Suggested-by: Jiayuan Chen <mrpre@163.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20250515-selftests-sockmap-redir-v3-4-a1ea723f7e7e@rbox.co
8 weeks agoselftests/bpf: Add u32()/u64() to sockmap_helpers
Michal Luczaj [Wed, 14 May 2025 22:15:26 +0000 (00:15 +0200)]
selftests/bpf: Add u32()/u64() to sockmap_helpers

Add integer wrappers for convenient sockmap usage.

While there, fix misaligned trailing slashes.

Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20250515-selftests-sockmap-redir-v3-3-a1ea723f7e7e@rbox.co
8 weeks agoselftests/bpf: Add socket_kind_to_str() to socket_helpers
Michal Luczaj [Wed, 14 May 2025 22:15:25 +0000 (00:15 +0200)]
selftests/bpf: Add socket_kind_to_str() to socket_helpers

Add function that returns string representation of socket's domain/type.

Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20250515-selftests-sockmap-redir-v3-2-a1ea723f7e7e@rbox.co
8 weeks agoselftests/bpf: Support af_unix SOCK_DGRAM socket pair creation
Michal Luczaj [Wed, 14 May 2025 22:15:24 +0000 (00:15 +0200)]
selftests/bpf: Support af_unix SOCK_DGRAM socket pair creation

Handle af_unix in init_addr_loopback(). For pair creation, bind() the peer
socket to make SOCK_DGRAM connect() happy.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/r/20250515-selftests-sockmap-redir-v3-1-a1ea723f7e7e@rbox.co
8 weeks agoselftests/bpf: Add SKIP_LLVM makefile variable
Mykyta Yatsenko [Thu, 22 May 2025 01:38:13 +0000 (02:38 +0100)]
selftests/bpf: Add SKIP_LLVM makefile variable

Introduce SKIP_LLVM makefile variable that allows to avoid using llvm
dependencies when building BPF selftests. This is different from
existing feature-llvm, as the latter is a result of automatic detection
and should not be set by user explicitly.
Avoiding llvm dependencies could be useful for environments that do not
have them, given that as of now llvm dependencies are required only by
jit_disasm_helpers.c.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250522013813.125428-1-mykyta.yatsenko5@gmail.com
8 weeks agoMerge branch 's390-bpf-use-kernel-s-expoline-thunks'
Alexei Starovoitov [Thu, 22 May 2025 15:40:57 +0000 (08:40 -0700)]
Merge branch 's390-bpf-use-kernel-s-expoline-thunks'

Ilya Leoshkevich says:

====================
This series simplifies the s390 JIT by replacing the generation of
expolines (Spectre mitigation) with using the ones from the kernel
text. This is possible thanks to the V!=R s390 kernel rework.
Patch 1 is a small prerequisite for arch/s390 that I would like to
get in via the BPF tree. It has Heiko's Acked-by.
Patches 2 and 3 are the implementation.
====================

Link: https://patch.msgid.link/20250519223646.66382-1-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
8 weeks agos390/bpf: Use kernel's expoline thunks
Ilya Leoshkevich [Mon, 19 May 2025 22:30:06 +0000 (23:30 +0100)]
s390/bpf: Use kernel's expoline thunks

Simplify the JIT code by replacing the custom expolines with the ones
defined in the kernel text.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20250519223646.66382-4-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
8 weeks agos390/bpf: Add macros for calling external functions
Ilya Leoshkevich [Mon, 19 May 2025 22:30:05 +0000 (23:30 +0100)]
s390/bpf: Add macros for calling external functions

After the V!=R rework (commit c98d2ecae08f ("s390/mm: Uncouple physical
vs virtual address spaces")), kernel and BPF programs are allocated
within a 4G region, making it possible to use relative addressing to
directly use kernel functions from BPF code.

Add two new macros for calling kernel functions from BPF code:
EMIT6_PCREL_RILB_PTR() and EMIT6_PCREL_RILC_PTR(). Factor out parts
of the existing macros that are helpful for implementing the new ones.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20250519223646.66382-3-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
8 weeks agos390: always declare expoline thunks
Ilya Leoshkevich [Mon, 19 May 2025 22:30:04 +0000 (23:30 +0100)]
s390: always declare expoline thunks

It would be convenient to use the following pattern in the BPF JIT:

  if (nospec_uses_trampoline())
    emit_call(__s390_indirect_jump_r1);

Unfortunately with CONFIG_EXPOLINE=n the compiler complains about the
missing prototype of __s390_indirect_jump_r1(). One could wrap the
whole "if" statement in an #ifdef, but this clutters the code.

Instead, declare expoline thunk prototypes even when compiling without
expolines. When using the above code structure and compiling without
expolines, references to them are optimized away, and there are no
linker errors.

Acked-by: Heiko Carstens <hca@linux.ibm.com>
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20250519223646.66382-2-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
8 weeks agobpf: Revert "bpf: remove unnecessary rcu_read_{lock,unlock}() in multi-uprobe attach...
Di Shen [Tue, 20 May 2025 05:49:43 +0000 (13:49 +0800)]
bpf: Revert "bpf: remove unnecessary rcu_read_{lock,unlock}() in multi-uprobe attach logic"

This reverts commit 4a8f635a6054.

Althought get_pid_task() internally already calls rcu_read_lock() and
rcu_read_unlock(), the find_vpid() was not.

The documentation for find_vpid() clearly states:
"Must be called with the tasklist_lock or rcu_read_lock() held."

Add proper rcu_read_lock/unlock() to protect the find_vpid().

Fixes: 4a8f635a6054 ("bpf: remove unnecessary rcu_read_{lock,unlock}() in multi-uprobe attach logic")
Reported-by: Xuewen Yan <xuewen.yan@unisoc.com>
Signed-off-by: Di Shen <di.shen@unisoc.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20250520054943.5002-1-xuewen.yan@unisoc.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
8 weeks agoMerge branch 'libbpf-support-multi-split-btf'
Andrii Nakryiko [Tue, 20 May 2025 23:22:30 +0000 (16:22 -0700)]
Merge branch 'libbpf-support-multi-split-btf'

Alan Maguire says:

====================
libbpf: support multi-split BTF

In discussing handling of inlines in BTF [1], one area which we may need
support for in the future is multiple split BTF, where split BTF sits
atop another split BTF which sits atop base BTF.  This two-patch series
fixes one issue discovered when testing multi-split BTF and extends the
split BTF test to cover multi-split BTF also.

[1] https://lore.kernel.org/dwarves/20250416-btf_inline-v1-0-e4bd2f8adae5@meta.com/
====================

Link: https://patch.msgid.link/20250519165935.261614-1-alan.maguire@oracle.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
8 weeks agoselftests/bpf: Test multi-split BTF
Alan Maguire [Mon, 19 May 2025 16:59:35 +0000 (17:59 +0100)]
selftests/bpf: Test multi-split BTF

Extend split BTF test to cover case where we create split BTF on top of
existing split BTF and add info to it; ensure that such BTF can be
created and handled by searching within it, dumping/comparing to expected.

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250519165935.261614-3-alan.maguire@oracle.com
8 weeks agolibbpf/btf: Fix string handling to support multi-split BTF
Alan Maguire [Mon, 19 May 2025 16:59:34 +0000 (17:59 +0100)]
libbpf/btf: Fix string handling to support multi-split BTF

libbpf handling of split BTF has been written largely with the
assumption that multiple splits are possible, i.e. split BTF on top of
split BTF on top of base BTF.  One area where this does not quite work
is string handling in split BTF; the start string offset should be the
base BTF string section length + the base BTF string offset.  This
worked in the past because for a single split BTF with base the start
string offset was always 0.

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250519165935.261614-2-alan.maguire@oracle.com
2 months agoselftests/bpf: Remove unnecessary link dependencies
Mykyta Yatsenko [Fri, 16 May 2025 19:55:22 +0000 (20:55 +0100)]
selftests/bpf: Remove unnecessary link dependencies

Remove llvm dependencies from binaries that do not use llvm libraries.
Filter out libxml2 from llvm dependencies, as it seems that
it is not actually used. This patch reduced link dependencies
for BPF selftests.
The next line was adding llvm dependencies to every target in the
makefile, while the only targets that require those are test
runnners (test_progs, test_progs-no_alu32,...):
```
$(OUTPUT)/$(TRUNNER_BINARY): LDLIBS += $$(LLVM_LDLIBS)
```

Before this change:
ldd linux/tools/testing/selftests/bpf/veristat
    linux-vdso.so.1 (0x00007ffd2c3fd000)
    libelf.so.1 => /lib64/libelf.so.1 (0x00007fe1dcf89000)
    libz.so.1 => /lib64/libz.so.1 (0x00007fe1dcf6f000)
    libm.so.6 => /lib64/libm.so.6 (0x00007fe1dce94000)
    libzstd.so.1 => /lib64/libzstd.so.1 (0x00007fe1dcddd000)
    libxml2.so.2 => /lib64/libxml2.so.2 (0x00007fe1dcc54000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007fe1dca00000)
    libc.so.6 => /lib64/libc.so.6 (0x00007fe1dc600000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fe1dcfb1000)
    liblzma.so.5 => /lib64/liblzma.so.5 (0x00007fe1dc9d4000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fe1dcc38000)

After:
ldd linux/tools/testing/selftests/bpf/veristat
    linux-vdso.so.1 (0x00007ffc83370000)
    libelf.so.1 => /lib64/libelf.so.1 (0x00007f4b87515000)
    libz.so.1 => /lib64/libz.so.1 (0x00007f4b874fb000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f4b87200000)
    libzstd.so.1 => /lib64/libzstd.so.1 (0x00007f4b87444000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f4b8753d000)

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250516195522.311769-1-mykyta.yatsenko5@gmail.com
2 months agobpf: WARN_ONCE on verifier bugs
Paul Chaignon [Mon, 19 May 2025 13:43:57 +0000 (15:43 +0200)]
bpf: WARN_ONCE on verifier bugs

Throughout the verifier's logic, there are multiple checks for
inconsistent states that should never happen and would indicate a
verifier bug. These bugs are typically logged in the verifier logs and
sometimes preceded by a WARN_ONCE.

This patch reworks these checks to consistently emit a verifier log AND
a warning when CONFIG_DEBUG_KERNEL is enabled. The consistent use of
WARN_ONCE should help fuzzers (ex. syzkaller) expose any situation
where they are actually able to reach one of those buggy verifier
states.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Paul Chaignon <paul.chaignon@gmail.com>
Link: https://lore.kernel.org/r/aCs1nYvNNMq8dAWP@mail.gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agoMerge branch 's390-bpf-remove-the-orig_call-null-check'
Alexei Starovoitov [Thu, 15 May 2025 00:48:57 +0000 (17:48 -0700)]
Merge branch 's390-bpf-remove-the-orig_call-null-check'

Ilya Leoshkevich says:

====================
I've been looking at fixing the tailcall_bpf2bpf_hierarchy failures on
s390. One of the challenges is that when a BPF trampoline calls a BPF
prog A, the prologue of A sets the tail call count to 0. Therefore it
would be useful to know whether the trampoline is attached to some
other BPF prog B, in which case A should be called using an offset
equal to tail_call_start, bypassing the tail call count initialization.

The trampoline attachment point is passed to trampoline functions via
the orig_call variable. Unfortunately in the case of calculating the
size of a struct_ops trampoline it's NULL, and I could not think of a
good reason to have it this way. This series makes it always non-NULL.
====================

Link: https://patch.msgid.link/20250512221911.61314-1-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agos390/bpf: Remove the orig_call NULL check
Ilya Leoshkevich [Mon, 12 May 2025 20:57:31 +0000 (22:57 +0200)]
s390/bpf: Remove the orig_call NULL check

Now that orig_call can never be NULL, remove the respective check.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20250512221911.61314-3-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agobpf: Pass the same orig_call value to trampoline functions
Ilya Leoshkevich [Mon, 12 May 2025 20:57:30 +0000 (22:57 +0200)]
bpf: Pass the same orig_call value to trampoline functions

There is currently some confusion in the s390x JIT regarding whether
orig_call can be NULL and what that means. Originally the NULL value
was used to distinguish the struct_ops case, but this was superseded by
BPF_TRAMP_F_INDIRECT (see commit 0c970ed2f87c ("s390/bpf: Fix indirect
trampoline generation").

The remaining reason to have this check is that NULL can actually be
passed to the arch_bpf_trampoline_size() call - but not to the
respective arch_prepare_bpf_trampoline()! call - by
bpf_struct_ops_prepare_trampoline().

Remove this asymmetry by passing stub_func to both functions, so that
JITs may rely on orig_call never being NULL.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250512221911.61314-2-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agos390/bpf: Store backchain even for leaf progs
Ilya Leoshkevich [Mon, 12 May 2025 12:26:15 +0000 (14:26 +0200)]
s390/bpf: Store backchain even for leaf progs

Currently a crash in a leaf prog (caused by a bug) produces the
following call trace:

     [<000003ff600ebf00>] bpf_prog_6df0139e1fbf2789_fentry+0x20/0x78
     [<0000000000000000>] 0x0

This is because leaf progs do not store backchain. Fix by making all
progs do it. This is what GCC and Clang-generated code does as well.
Now the call trace looks like this:

     [<000003ff600eb0f2>] bpf_prog_6df0139e1fbf2789_fentry+0x2a/0x80
     [<000003ff600ed096>] bpf_trampoline_201863462940+0x96/0xf4
     [<000003ff600e3a40>] bpf_prog_05f379658fdd72f2_classifier_0+0x58/0xc0
     [<000003ffe0aef070>] bpf_test_run+0x210/0x390
     [<000003ffe0af0dc2>] bpf_prog_test_run_skb+0x25a/0x668
     [<000003ffe038a90e>] __sys_bpf+0xa46/0xdb0
     [<000003ffe038ad0c>] __s390x_sys_bpf+0x44/0x50
     [<000003ffe0defea8>] __do_syscall+0x150/0x280
     [<000003ffe0e01d5c>] system_call+0x74/0x98

Fixes: 054623105728 ("s390/bpf: Add s390x eBPF JIT compiler backend")
Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
Link: https://lore.kernel.org/r/20250512122717.54878-1-iii@linux.ibm.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agoselftests/bpf: Relax TCPOPT_WINDOW validation in test_tcp_custom_syncookie.c.
Kuniyuki Iwashima [Wed, 14 May 2025 21:40:20 +0000 (14:40 -0700)]
selftests/bpf: Relax TCPOPT_WINDOW validation in test_tcp_custom_syncookie.c.

The custom syncookie test expects TCPOPT_WINDOW to be 7 based on the
kernel’s behaviour at the time, but the upcoming series [0] will bump
it to 10.

Let's relax the test to allow any valid TCPOPT_WINDOW value in the
range 1–14.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/netdev/20250513193919.1089692-1-edumazet@google.com/
Link: https://patch.msgid.link/20250514214021.85187-1-kuniyu@amazon.com
2 months agolibbpf: Check bpf_map_skeleton link for NULL
Mykyta Yatsenko [Wed, 14 May 2025 11:32:20 +0000 (12:32 +0100)]
libbpf: Check bpf_map_skeleton link for NULL

Avoid dereferencing bpf_map_skeleton's link field if it's NULL.
If BPF map skeleton is created with the size, that indicates containing
link field, but the field was not actually initialized with valid
bpf_link pointer, libbpf crashes. This may happen when using libbpf-rs
skeleton.
Skeleton loading may still progress, but user needs to attach struct_ops
map separately.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250514113220.219095-1-mykyta.yatsenko5@gmail.com
2 months agobpf: Add support for __prog argument suffix to pass in prog->aux
Kumar Kartikeya Dwivedi [Tue, 13 May 2025 14:28:12 +0000 (07:28 -0700)]
bpf: Add support for __prog argument suffix to pass in prog->aux

Instead of hardcoding the list of kfuncs that need prog->aux passed to
them with a combination of fixup_kfunc_call adjustment + __ign suffix,
combine both in __prog suffix, which ignores the argument passed in, and
fixes it up to the prog->aux. This allows kfuncs to have the prog->aux
passed into them without having to touch the verifier.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20250513142812.1021591-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agobpf: Fix WARN() in get_bpf_raw_tp_regs
Tao Chen [Tue, 13 May 2025 04:27:47 +0000 (12:27 +0800)]
bpf: Fix WARN() in get_bpf_raw_tp_regs

syzkaller reported an issue:

WARNING: CPU: 3 PID: 5971 at kernel/trace/bpf_trace.c:1861 get_bpf_raw_tp_regs+0xa4/0x100 kernel/trace/bpf_trace.c:1861
Modules linked in:
CPU: 3 UID: 0 PID: 5971 Comm: syz-executor205 Not tainted 6.15.0-rc5-syzkaller-00038-g707df3375124 #0 PREEMPT(full)
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:get_bpf_raw_tp_regs+0xa4/0x100 kernel/trace/bpf_trace.c:1861
RSP: 0018:ffffc90003636fa8 EFLAGS: 00010293
RAX: 0000000000000000 RBX: 0000000000000003 RCX: ffffffff81c6bc4c
RDX: ffff888032efc880 RSI: ffffffff81c6bc83 RDI: 0000000000000005
RBP: ffff88806a730860 R08: 0000000000000005 R09: 0000000000000003
R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000004
R13: 0000000000000001 R14: ffffc90003637008 R15: 0000000000000900
FS:  0000000000000000(0000) GS:ffff8880d6cdf000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7baee09130 CR3: 0000000029f5a000 CR4: 0000000000352ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 ____bpf_get_stack_raw_tp kernel/trace/bpf_trace.c:1934 [inline]
 bpf_get_stack_raw_tp+0x24/0x160 kernel/trace/bpf_trace.c:1931
 bpf_prog_ec3b2eefa702d8d3+0x43/0x47
 bpf_dispatcher_nop_func include/linux/bpf.h:1316 [inline]
 __bpf_prog_run include/linux/filter.h:718 [inline]
 bpf_prog_run include/linux/filter.h:725 [inline]
 __bpf_trace_run kernel/trace/bpf_trace.c:2363 [inline]
 bpf_trace_run3+0x23f/0x5a0 kernel/trace/bpf_trace.c:2405
 __bpf_trace_mmap_lock_acquire_returned+0xfc/0x140 include/trace/events/mmap_lock.h:47
 __traceiter_mmap_lock_acquire_returned+0x79/0xc0 include/trace/events/mmap_lock.h:47
 __do_trace_mmap_lock_acquire_returned include/trace/events/mmap_lock.h:47 [inline]
 trace_mmap_lock_acquire_returned include/trace/events/mmap_lock.h:47 [inline]
 __mmap_lock_do_trace_acquire_returned+0x138/0x1f0 mm/mmap_lock.c:35
 __mmap_lock_trace_acquire_returned include/linux/mmap_lock.h:36 [inline]
 mmap_read_trylock include/linux/mmap_lock.h:204 [inline]
 stack_map_get_build_id_offset+0x535/0x6f0 kernel/bpf/stackmap.c:157
 __bpf_get_stack+0x307/0xa10 kernel/bpf/stackmap.c:483
 ____bpf_get_stack kernel/bpf/stackmap.c:499 [inline]
 bpf_get_stack+0x32/0x40 kernel/bpf/stackmap.c:496
 ____bpf_get_stack_raw_tp kernel/trace/bpf_trace.c:1941 [inline]
 bpf_get_stack_raw_tp+0x124/0x160 kernel/trace/bpf_trace.c:1931
 bpf_prog_ec3b2eefa702d8d3+0x43/0x47

Tracepoint like trace_mmap_lock_acquire_returned may cause nested call
as the corner case show above, which will be resolved with more general
method in the future. As a result, WARN_ON_ONCE will be triggered. As
Alexei suggested, remove the WARN_ON_ONCE first.

Fixes: 9594dc3c7e71 ("bpf: fix nested bpf tracepoints with per-cpu data")
Reported-by: syzbot+45b0c89a0fc7ae8dbadc@syzkaller.appspotmail.com
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Tao Chen <chen.dylane@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250513042747.757042-1-chen.dylane@linux.dev
Closes: https://lore.kernel.org/bpf/8bc2554d-1052-4922-8832-e0078a033e1d@gmail.com

2 months agodocs: bpf: Fix bullet point formatting warning
Khaled Elnaggar [Tue, 13 May 2025 01:58:59 +0000 (04:58 +0300)]
docs: bpf: Fix bullet point formatting warning

Fix indentation for a bullet list item in bpf_iterators.rst.
According to reStructuredText rules, bullet list item bodies must be
consistently indented relative to the bullet. The indentation of the
first line after the bullet determines the alignment for the rest of
the item body.

Reported by smatch:
  /linux/Documentation/bpf/bpf_iterators.rst:55: WARNING: Bullet list ends without a blank line; unexpected unindent. [docutils]

Fixes: 7220eabff8cb ("bpf, docs: document open-coded BPF iterators")
Signed-off-by: Khaled Elnaggar <khaledelnaggarlinux@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250513015901.475207-1-khaledelnaggarlinux@gmail.com
2 months agoMerge branch 'introduce-kfuncs-for-memory-reads-into-dynptrs'
Alexei Starovoitov [Tue, 13 May 2025 01:29:04 +0000 (18:29 -0700)]
Merge branch 'introduce-kfuncs-for-memory-reads-into-dynptrs'

Mykyta Yatsenko says:

====================
Introduce kfuncs for memory reads into dynptrs

From: Mykyta Yatsenko <yatsenko@meta.com>

This patch adds new kfuncs that enable reading variable-length
user or kernel data directly into dynptrs.
These kfuncs provide a way to perform dynamically-sized reads
while maintaining memory safety. Unlike existing
`bpf_probe_read_{user|kernel}` APIs, which are limited to constant-sized
reads, these new kfuncs allow for more flexible data access.

v4 -> v5
 * Fix pointers annotations, use __user where necessary, cast where needed

v3 -> v4
 * Added pid filtering in selftests

v2 -> v3
 * Add KF_TRUSTED_ARGS for kfuncs that take pointer to task_struct
 as an argument
 * Remove checks for non-NULL task, where it was not necessary
 * Added comments on constants used in selftests, etc.

v1 -> v2
 * Renaming helper functions to use "user_str" instead of "user_data_str"
 suffix

====================

Link: https://patch.msgid.link/20250512205348.191079-1-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agoselftests/bpf: introduce tests for dynptr copy kfuncs
Mykyta Yatsenko [Mon, 12 May 2025 20:53:48 +0000 (21:53 +0100)]
selftests/bpf: introduce tests for dynptr copy kfuncs

Introduce selftests verifying newly-added dynptr copy kfuncs.
Covering contiguous and non-contiguous memory backed dynptrs.

Disable test_probe_read_user_str_dynptr that triggers bug in
strncpy_from_user_nofault. Patch to fix the issue [1].

[1] https://patchwork.kernel.org/project/linux-mm/patch/20250422131449.57177-1-mykyta.yatsenko5@gmail.com/

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20250512205348.191079-4-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agobpf: Implement dynptr copy kfuncs
Mykyta Yatsenko [Mon, 12 May 2025 20:53:47 +0000 (21:53 +0100)]
bpf: Implement dynptr copy kfuncs

This patch introduces a new set of kfuncs for working with dynptrs in
BPF programs, enabling reading variable-length user or kernel data
into dynptr directly. To enable memory-safety, verifier allows only
constant-sized reads via existing bpf_probe_read_{user|kernel} etc.
kfuncs, dynptr-based kfuncs allow dynamically-sized reads without memory
safety shortcomings.

The following kfuncs are introduced:
* `bpf_probe_read_kernel_dynptr()`: probes kernel-space data into a dynptr
* `bpf_probe_read_user_dynptr()`: probes user-space data into a dynptr
* `bpf_probe_read_kernel_str_dynptr()`: probes kernel-space string into
a dynptr
* `bpf_probe_read_user_str_dynptr()`: probes user-space string into a
dynptr
* `bpf_copy_from_user_dynptr()`: sleepable, copies user-space data into
a dynptr for the current task
* `bpf_copy_from_user_str_dynptr()`: sleepable, copies user-space string
into a dynptr for the current task
* `bpf_copy_from_user_task_dynptr()`: sleepable, copies user-space data
of the task into a dynptr
* `bpf_copy_from_user_task_str_dynptr()`: sleepable, copies user-space
string of the task into a dynptr

The implementation is built on two generic functions:
 * __bpf_dynptr_copy
 * __bpf_dynptr_copy_str
These functions take function pointers as arguments, enabling the
copying of data from various sources, including both kernel and user
space.
Use __always_inline for generic functions and callbacks to make sure the
compiler doesn't generate indirect calls into callbacks, which is more
expensive, especially on some kernel configurations. Inlining allows
compiler to put direct calls into all the specific callback implementations
(copy_user_data_sleepable, copy_user_data_nofault, and so on).

Reviewed-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20250512205348.191079-3-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agohelpers: make few bpf helpers public
Mykyta Yatsenko [Mon, 12 May 2025 20:53:46 +0000 (21:53 +0100)]
helpers: make few bpf helpers public

Make bpf_dynptr_slice_rdwr, bpf_dynptr_check_off_len and
__bpf_dynptr_write available outside of the helpers.c by
adding their prototypes into linux/include/bpf.h.
bpf_dynptr_check_off_len() implementation is moved to header and made
inline explicitly, as small function should typically be inlined.

These functions are going to be used from bpf_trace.c in the next
patch of this series.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Link: https://lore.kernel.org/r/20250512205348.191079-2-mykyta.yatsenko5@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agolibbpf: Use proper errno value in nlattr
Anton Protopopov [Sat, 10 May 2025 18:20:11 +0000 (18:20 +0000)]
libbpf: Use proper errno value in nlattr

Return value of the validate_nla() function can be propagated all the
way up to users of libbpf API. In case of error this libbpf version
of validate_nla returns -1 which will be seen as -EPERM from user's
point of view. Instead, return a more reasonable -EINVAL.

Fixes: bbf48c18ee0c ("libbpf: add error reporting in XDP")
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250510182011.2246631-1-a.s.protopopov@gmail.com
2 months agoselftests/bpf: Allow skipping docs compilation
Mykyta Yatsenko [Sat, 10 May 2025 00:24:50 +0000 (01:24 +0100)]
selftests/bpf: Allow skipping docs compilation

Currently rst2man is required to build bpf selftests, as the tool is
used by Makefile.docs. rst2man may be missing in some build
environments and is not essential for selftests. It makes sense to
allow user to skip building docs.

This patch adds SKIP_DOCS variable into bpf selftests Makefile that when
set to 1 allows skipping building docs, for example:
make -C tools/testing/selftests TARGETS=bpf SKIP_DOCS=1

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250510002450.365613-1-mykyta.yatsenko5@gmail.com
2 months agoMerge branch 'fix-verifier-test-failures-in-verbose-mode'
Alexei Starovoitov [Mon, 12 May 2025 17:43:43 +0000 (10:43 -0700)]
Merge branch 'fix-verifier-test-failures-in-verbose-mode'

Gregory Bell says:

====================
Fix verifier test failures in verbose mode

This patch series fixes two issues that cause false failures in the
BPF verifier test suite when run with verbose output (`-v`).

The following tests fail only when running the test_verifier in
verbose.

This leads to inconsistent results across verbose and
non-verbose runs.

Patch 1 addresses an issue where the verbose flag (`-v`) unintentionally
overrides the `opts.log_level`, leading to incorrect contents when checking
bpf_vlog in tests with `expected_ret == VERBOSE_ACCEPT`. This occurs when
running verbose with `-v` but not `-vv`

Patch 2 increases the size of the `bpf_vlog[]` buffer to prevent truncation
of large verifier logs, which was causing failures in several scale and
64-bit immediate tests.

Before patches:
./test_verifier | grep FAIL
Summary: 790 PASSED, 0 SKIPPED, 0 FAILED

./test_verifier -v | grep FAIL
Summary: 782 PASSED, 0 SKIPPED, 8 FAILED

./test_verifier -vv | grep FAIL
Summary: 787 PASSED, 0 SKIPPED, 3 FAILED

After patches:
./test_verifier -v | grep FAIL
Summary: 790 PASSED, 0 SKIPPED, 0 FAILED
./test_verifier -vv | grep FAIL
Summary: 790 PASSED, 0 SKIPPED, 0 FAILED

These fixes improve test reliability and ensure consistent behavior across
verbose and non-verbose runs.
====================

Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://patch.msgid.link/cover.1747058195.git.grbell@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agoselftests/bpf: test_verifier verbose log overflows
Gregory Bell [Mon, 12 May 2025 14:04:13 +0000 (10:04 -0400)]
selftests/bpf: test_verifier verbose log overflows

Tests:
 - 458/p ld_dw: xor semi-random 64-bit imms, test 5
 - 501/p scale: scale test 1
 - 502/p scale: scale test 2

fail in verbose mode due to bpf_vlog[] overflowing. These tests
generate large verifier logs that exceed the current buffer size,
causing them to fail to load.

Increase the size of the bpf_vlog[] buffer to accommodate larger
logs and prevent false failures during test runs with verbose output.

Signed-off-by: Gregory Bell <grbell@redhat.com>
Link: https://lore.kernel.org/r/e49267100f07f099a5877a3a5fc797b702bbaf0c.1747058195.git.grbell@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agoselftests/bpf: test_verifier verbose causes erroneous failures
Gregory Bell [Mon, 12 May 2025 14:04:12 +0000 (10:04 -0400)]
selftests/bpf: test_verifier verbose causes erroneous failures

When running test_verifier with the -v flag and a test with
`expected_ret==VERBOSE_ACCEPT`, the opts.log_level is unintentionally
overwritten because the verbose flag takes precedence. This leads to
a mismatch in the expected and actual contents of bpf_vlog, causing
tests to fail incorrectly.

Reorder the conditional logic that sets opts.log_level to preserve
the expected log level and prevent it from being overridden by -v.

Signed-off-by: Gregory Bell <grbell@redhat.com>
Link: https://lore.kernel.org/r/182bf00474f817c99f968a9edb119882f62be0f8.1747058195.git.grbell@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agobpf, docs: document open-coded BPF iterators
Andrii Nakryiko [Fri, 9 May 2025 18:03:50 +0000 (11:03 -0700)]
bpf, docs: document open-coded BPF iterators

Extract BPF open-coded iterators documentation spread out across a few
original commit messages ([0], [1]) into a dedicated doc section under
Documentation/bpf/bpf_iterators.rst. Also make explicit expectation that
BPF iterator program type should be accompanied by a corresponding
open-coded BPF iterator implementation, going forward.

  [0] https://lore.kernel.org/all/20230308184121.1165081-3-andrii@kernel.org/
  [1] https://lore.kernel.org/all/20230308184121.1165081-4-andrii@kernel.org/

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20250509180350.2604946-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agoMerge branch 'ktls-sockmap-fix-missing-uncharge-operation-and-add-selfttest'
Martin KaFai Lau [Sat, 10 May 2025 00:51:16 +0000 (17:51 -0700)]
Merge branch 'ktls-sockmap-fix-missing-uncharge-operation-and-add-selfttest'

Jiayuan Chen says:

====================
ktls, sockmap: Fix missing uncharge operation and add selfttest

Cong reported a warning when running ./test_sockmp:
https://lore.kernel.org/bpf/aAmIi0vlycHtbXeb@pop-os.localdomain/T/#t

------------[ cut here ]------------
WARNING: CPU: 1 PID: 40 at net/ipv4/af_inet.c inet_sock_destruct+0x173/0x1d5
Tainted: [W]=WARN
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
Workqueue: events sk_psock_destroy
RIP: 0010:inet_sock_destruct+0x173/0x1d5
RSP: 0018:ffff8880085cfc18 EFLAGS: 00010202
RAX: 1ffff11003dbfc00 RBX: ffff88801edfe3e8 RCX: ffffffff822f5af4
RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff88801edfe16c
RBP: ffff88801edfe184 R08: ffffed1003dbfc31 R09: 0000000000000000
R10: ffffffff822f5ab7 R11: ffff88801edfe187 R12: ffff88801edfdec0
R13: ffff888020376ac0 R14: ffff888020376ac0 R15: ffff888020376a60
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000556365155830 CR3: 000000001d6aa000 CR4: 0000000000350ef0
Call Trace:
 <TASK>
 __sk_destruct+0x46/0x222
 sk_psock_destroy+0x22f/0x242
 process_one_work+0x504/0x8a8
 ? process_one_work+0x39d/0x8a8
 ? __pfx_process_one_work+0x10/0x10
 ? worker_thread+0x44/0x2ae
 ? __list_add_valid_or_report+0x83/0xea
 ? srso_return_thunk+0x5/0x5f
 ? __list_add+0x45/0x52
 process_scheduled_works+0x73/0x82
 worker_thread+0x1ce/0x2ae

When we specify apply_bytes, we divide the msg into multiple segments,
each with a length of 'send', and every time we send this part of the data
using tcp_bpf_sendmsg_redir(), we use sk_msg_return_zero() to uncharge the
memory of the specified 'send' size.

However, if the first segment of data fails to send, for example, the
peer's buffer is full, we need to release all of the msg. When releasing
the msg, we haven't uncharged the memory of the subsequent segments.

This modification does not make significant logical changes, but only
fills in the missing uncharge places.

This issue has existed all along, until it was exposed after we added the
apply test in test_sockmap:

commit 3448ad23b34e ("selftests/bpf: Add apply_bytes test to test_txmsg_redir_wait_sndmem in test_sockmap")
====================

Link: https://patch.msgid.link/20250425060015.6968-1-jiayuan.chen@linux.dev
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2 months agoselftests/bpf: Add test to cover sockmap with ktls
Jiayuan Chen [Fri, 25 Apr 2025 05:59:58 +0000 (13:59 +0800)]
selftests/bpf: Add test to cover sockmap with ktls

The selftest can reproduce an issue where we miss the uncharge operation
when freeing msg, which will cause the following warning. We fixed the
issue and added this reproducer to selftest to ensure it will not happen
again.

------------[ cut here ]------------
WARNING: CPU: 1 PID: 40 at net/ipv4/af_inet.c inet_sock_destruct+0x173/0x1d5
Tainted: [W]=WARN
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
Workqueue: events sk_psock_destroy
RIP: 0010:inet_sock_destruct+0x173/0x1d5
RSP: 0018:ffff8880085cfc18 EFLAGS: 00010202
RAX: 1ffff11003dbfc00 RBX: ffff88801edfe3e8 RCX: ffffffff822f5af4
RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff88801edfe16c
RBP: ffff88801edfe184 R08: ffffed1003dbfc31 R09: 0000000000000000
R10: ffffffff822f5ab7 R11: ffff88801edfe187 R12: ffff88801edfdec0
R13: ffff888020376ac0 R14: ffff888020376ac0 R15: ffff888020376a60
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000556365155830 CR3: 000000001d6aa000 CR4: 0000000000350ef0
Call Trace:
 <TASK>
 __sk_destruct+0x46/0x222
 sk_psock_destroy+0x22f/0x242
 process_one_work+0x504/0x8a8
 ? process_one_work+0x39d/0x8a8
 ? __pfx_process_one_work+0x10/0x10
 ? worker_thread+0x44/0x2ae
 ? __list_add_valid_or_report+0x83/0xea
 ? srso_return_thunk+0x5/0x5f
 ? __list_add+0x45/0x52
 process_scheduled_works+0x73/0x82
 worker_thread+0x1ce/0x2ae

Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250425060015.6968-3-jiayuan.chen@linux.dev
2 months agoktls, sockmap: Fix missing uncharge operation
Jiayuan Chen [Fri, 25 Apr 2025 05:59:57 +0000 (13:59 +0800)]
ktls, sockmap: Fix missing uncharge operation

When we specify apply_bytes, we divide the msg into multiple segments,
each with a length of 'send', and every time we send this part of the data
using tcp_bpf_sendmsg_redir(), we use sk_msg_return_zero() to uncharge the
memory of the specified 'send' size.

However, if the first segment of data fails to send, for example, the
peer's buffer is full, we need to release all of the msg. When releasing
the msg, we haven't uncharged the memory of the subsequent segments.

This modification does not make significant logical changes, but only
fills in the missing uncharge places.

This issue has existed all along, until it was exposed after we added the
apply test in test_sockmap:
commit 3448ad23b34e ("selftests/bpf: Add apply_bytes test to test_txmsg_redir_wait_sndmem in test_sockmap")

Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling")
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Closes: https://lore.kernel.org/bpf/aAmIi0vlycHtbXeb@pop-os.localdomain/T/#t
Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Link: https://lore.kernel.org/r/20250425060015.6968-2-jiayuan.chen@linux.dev
2 months agoMerge branch 'bpf-retrieve-ref_ctr_offset-from-uprobe-perf-link'
Andrii Nakryiko [Fri, 9 May 2025 20:01:08 +0000 (13:01 -0700)]
Merge branch 'bpf-retrieve-ref_ctr_offset-from-uprobe-perf-link'

Jiri Olsa says:

====================
bpf: Retrieve ref_ctr_offset from uprobe perf link

hi,
adding ref_ctr_offset retrieval for uprobe perf link info.

v2 changes:
  - display ref_ctr_offset as hex number [Andrii]
  - added acks

thanks,
jirka
---
====================

Link: https://patch.msgid.link/20250509153539.779599-1-jolsa@kernel.org
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2 months agobpftool: Display ref_ctr_offset for uprobe link info
Jiri Olsa [Fri, 9 May 2025 15:35:39 +0000 (17:35 +0200)]
bpftool: Display ref_ctr_offset for uprobe link info

Adding support to display ref_ctr_offset in link output, like:

  # bpftool link
  ...
  42: perf_event  prog 174
          uprobe /proc/self/exe+0x102f13  cookie 3735928559  ref_ctr_offset 0x303a3fa
          bpf_cookie 3735928559
          pids test_progs(1820)

  # bpftool link -j | jq
  [
    ...
    {
      "id": 42,
       ...
      "ref_ctr_offset": 50500538,
    }
  ]

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250509153539.779599-4-jolsa@kernel.org
2 months agoselftests/bpf: Add link info test for ref_ctr_offset retrieval
Jiri Olsa [Fri, 9 May 2025 15:35:38 +0000 (17:35 +0200)]
selftests/bpf: Add link info test for ref_ctr_offset retrieval

Adding link info test for ref_ctr_offset retrieval for both
uprobe and uretprobe probes.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/bpf/20250509153539.779599-3-jolsa@kernel.org
2 months agobpf: Add support to retrieve ref_ctr_offset for uprobe perf link
Jiri Olsa [Fri, 9 May 2025 15:35:37 +0000 (17:35 +0200)]
bpf: Add support to retrieve ref_ctr_offset for uprobe perf link

Adding support to retrieve ref_ctr_offset for uprobe perf link,
which got somehow omitted from the initial uprobe link info changes.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yafang Shao <laoar.shao@gmail.com>
Link: https://lore.kernel.org/bpf/20250509153539.779599-2-jolsa@kernel.org
2 months agoscripts/bpf_doc.py: implement json output format
Ihor Solodrai [Thu, 8 May 2025 20:37:08 +0000 (13:37 -0700)]
scripts/bpf_doc.py: implement json output format

bpf_doc.py parses bpf.h header to collect information about various
API elements (such as BPF helpers) and then dump them in one of the
supported formats: rst docs and a C header.

It's useful for external tools to be able to consume this information
in an easy-to-parse format such as JSON. Implement JSON printers and
add --json command line argument.

v3->v4: refactor attrs to only be a helper's field
v2->v3: nit cleanup
v1->v2: add json printer for syscall target

v3: https://lore.kernel.org/bpf/20250507203034.270428-1-isolodrai@meta.com/
v2: https://lore.kernel.org/bpf/20250507182802.3833349-1-isolodrai@meta.com/
v1: https://lore.kernel.org/bpf/20250506000605.497296-1-isolodrai@meta.com/

Signed-off-by: Ihor Solodrai <isolodrai@meta.com>
Tested-by: Quentin Monnet <qmo@kernel.org>
Reviewed-by: Quentin Monnet <qmo@kernel.org>
Link: https://lore.kernel.org/r/20250508203708.2520847-1-isolodrai@meta.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agoselftests/bpf: Fix caps for __xlated/jited_unpriv
Luis Gerhorst [Thu, 1 May 2025 07:35:52 +0000 (09:35 +0200)]
selftests/bpf: Fix caps for __xlated/jited_unpriv

Currently, __xlated_unpriv and __jited_unpriv do not work because the
BPF syscall will overwrite info.jited_prog_len and info.xlated_prog_len
with 0 if the process is not bpf_capable(). This bug was not noticed
before, because there is no test that actually uses
__xlated_unpriv/__jited_unpriv.

To resolve this, simply restore the capabilities earlier (but still
after loading the program). Adding this here unconditionally is fine
because the function first checks that the capabilities were initialized
before attempting to restore them.

This will be important later when we add tests that check whether a
speculation barrier was inserted in the correct location.

Signed-off-by: Luis Gerhorst <luis.gerhorst@fau.de>
Fixes: 9c9f73391310 ("selftests/bpf: allow checking xlated programs in verifier_* tests")
Fixes: 7d743e4c759c ("selftests/bpf: __jited test tag to check disassembly after jit")
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Tested-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20250501073603.1402960-2-luis.gerhorst@fau.de
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agoMerge branch 'bpf-allow-some-trace-helpers-for-all-prog-types'
Andrii Nakryiko [Fri, 9 May 2025 17:37:11 +0000 (10:37 -0700)]
Merge branch 'bpf-allow-some-trace-helpers-for-all-prog-types'

Feng Yang says:

====================
bpf: Allow some trace helpers for all prog types

From: Feng Yang <yangfeng@kylinos.cn>

This series allow some trace helpers for all prog types.

if it works under NMI and doesn't use any context-dependent things,
should be fine for any program type. The detailed discussion is in [1].

[1] https://lore.kernel.org/all/CAEf4Bza6gK3dsrTosk6k3oZgtHesNDSrDd8sdeQ-GiS6oJixQg@mail.gmail.com/
---
Changes in v3:
- cgroup_current_func_proto clean.
- bpf_scx_get_func_proto clean. Thanks, Andrii Nakryiko.
- Link to v2: https://lore.kernel.org/all/20250427063821.207263-1-yangfeng59949@163.com/

Changes in v2:
- not expose compat probe read APIs to more program types.
- Remove the prog->sleepable check added for copy_from_user,
- or the summarization_freplace/might_sleep_with_might_sleep test will fail with the error "program of this type cannot use helper bpf_copy_from_user"
- Link to v1: https://lore.kernel.org/all/20250425080032.327477-1-yangfeng59949@163.com/
====================

Link: https://patch.msgid.link/20250506061434.94277-1-yangfeng59949@163.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
2 months agosched_ext: Remove bpf_scx_get_func_proto
Feng Yang [Tue, 6 May 2025 06:14:34 +0000 (14:14 +0800)]
sched_ext: Remove bpf_scx_get_func_proto

task_storage_{get,delete} has been moved to bpf_base_func_proto.

Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Feng Yang <yangfeng@kylinos.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/bpf/20250506061434.94277-3-yangfeng59949@163.com
2 months agobpf: Allow some trace helpers for all prog types
Feng Yang [Tue, 6 May 2025 06:14:33 +0000 (14:14 +0800)]
bpf: Allow some trace helpers for all prog types

if it works under NMI and doesn't use any context-dependent things,
should be fine for any program type. The detailed discussion is in [1].

[1] https://lore.kernel.org/all/CAEf4Bza6gK3dsrTosk6k3oZgtHesNDSrDd8sdeQ-GiS6oJixQg@mail.gmail.com/

Suggested-by: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Signed-off-by: Feng Yang <yangfeng@kylinos.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/bpf/20250506061434.94277-2-yangfeng59949@163.com
2 months agoMerge branch 'bpf-riscv64-support-load-acquire-and-store-release-instructions'
Alexei Starovoitov [Fri, 9 May 2025 17:05:28 +0000 (10:05 -0700)]
Merge branch 'bpf-riscv64-support-load-acquire-and-store-release-instructions'

Peilin Ye says:

====================
bpf, riscv64: Support load-acquire and store-release instructions

Hi all!

Patchset [1] introduced BPF load-acquire (BPF_LOAD_ACQ) and
store-release (BPF_STORE_REL) instructions, and added x86-64 and arm64
JIT compiler support.  As a follow-up, this v2 patchset supports
load-acquire and store-release instructions for the riscv64 JIT
compiler, and introduces some related selftests/ changes.

Specifically:

 * PATCH 1 makes insn_def_regno() handle load-acquires properly for
   bpf_jit_needs_zext() (true for riscv64) architectures
 * PATCH 2, 3 from Andrea Parri add the actual support to the riscv64
   JIT compiler
 * PATCH 4 optimizes code emission by skipping redundant zext
   instructions inserted by the verifier
 * PATCH 5, 6 and 7 are minor selftest/ improvements
 * PATCH 8 enables (non-arena) load-acquire/store-release selftests for
   riscv64

v1: https://lore.kernel.org/bpf/cover.1745970908.git.yepeilin@google.com/
Changes since v1:

 * add Acked-by:, Reviewed-by: and Tested-by: tags from Lehui and Björn
 * simplify code logic in PATCH 1 (Lehui)
 * in PATCH 3, avoid changing 'return 0;' to 'return ret;' at the end of
   bpf_jit_emit_insn() (Lehui)

Please refer to individual patches for details.  Thanks!

[1] https://lore.kernel.org/all/cover.1741049567.git.yepeilin@google.com/
====================

Link: https://patch.msgid.link/cover.1746588351.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agoselftests/bpf: Enable non-arena load-acquire/store-release selftests for riscv64
Peilin Ye [Wed, 7 May 2025 03:43:31 +0000 (03:43 +0000)]
selftests/bpf: Enable non-arena load-acquire/store-release selftests for riscv64

For riscv64, enable all BPF_{LOAD_ACQ,STORE_REL} selftests except the
arena_atomics/* ones (not guarded behind CAN_USE_LOAD_ACQ_STORE_REL),
since arena access is not yet supported.

Acked-by: Björn Töpel <bjorn@kernel.org>
Reviewed-by: Pu Lehui <pulehui@huawei.com>
Tested-by: Björn Töpel <bjorn@rivosinc.com> # QEMU/RVA23
Signed-off-by: Peilin Ye <yepeilin@google.com>
Link: https://lore.kernel.org/r/9d878fa99a72626208a8eed3c04c4140caf77fda.1746588351.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agoselftests/bpf: Verify zero-extension behavior in load-acquire tests
Peilin Ye [Wed, 7 May 2025 03:43:25 +0000 (03:43 +0000)]
selftests/bpf: Verify zero-extension behavior in load-acquire tests

Verify that 8-, 16- and 32-bit load-acquires are zero-extending by using
immediate values with their highest bit set.  Do the same for the 64-bit
variant to keep the style consistent.

Acked-by: Björn Töpel <bjorn@kernel.org>
Reviewed-by: Pu Lehui <pulehui@huawei.com>
Tested-by: Björn Töpel <bjorn@rivosinc.com> # QEMU/RVA23
Signed-off-by: Peilin Ye <yepeilin@google.com>
Link: https://lore.kernel.org/r/11097fd515f10308b3941469ee4c86cb8872db3f.1746588351.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agoselftests/bpf: Avoid passing out-of-range values to __retval()
Peilin Ye [Wed, 7 May 2025 03:43:19 +0000 (03:43 +0000)]
selftests/bpf: Avoid passing out-of-range values to __retval()

Currently, we pass 0x1234567890abcdef to __retval() for the following
two tests:

  verifier_load_acquire/load_acquire_64
  verifier_store_release/store_release_64

However, the upper 32 bits of that value are being ignored, since
__retval() expects an int.  Actually, the tests would still pass even if
I change '__retval(0x1234567890abcdef)' to e.g. '__retval(0x90abcdef)'.

Restructure the tests a bit to test the entire 64-bit values properly.
Do the same to their 8-, 16- and 32-bit variants as well to keep the
style consistent.

Fixes: ff3afe5da998 ("selftests/bpf: Add selftests for load-acquire and store-release instructions")
Acked-by: Björn Töpel <bjorn@kernel.org>
Reviewed-by: Pu Lehui <pulehui@huawei.com>
Tested-by: Björn Töpel <bjorn@rivosinc.com> # QEMU/RVA23
Signed-off-by: Peilin Ye <yepeilin@google.com>
Link: https://lore.kernel.org/r/d67f4c6f6ee0d0388cbce1f4892ec4176ee2d604.1746588351.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agoselftests/bpf: Use CAN_USE_LOAD_ACQ_STORE_REL when appropriate
Peilin Ye [Wed, 7 May 2025 03:43:13 +0000 (03:43 +0000)]
selftests/bpf: Use CAN_USE_LOAD_ACQ_STORE_REL when appropriate

Instead of open-coding the conditions, use
'#ifdef CAN_USE_LOAD_ACQ_STORE_REL' to guard the following tests:

  verifier_precision/bpf_load_acquire
  verifier_precision/bpf_store_release
  verifier_store_release/*

Note that, for the first two tests in verifier_precision.c, switching to
'#ifdef CAN_USE_LOAD_ACQ_STORE_REL' means also checking if
'__clang_major__ >= 18', which has already been guaranteed by the outer
'#if' check.

Acked-by: Björn Töpel <bjorn@kernel.org>
Reviewed-by: Pu Lehui <pulehui@huawei.com>
Tested-by: Björn Töpel <bjorn@rivosinc.com> # QEMU/RVA23
Signed-off-by: Peilin Ye <yepeilin@google.com>
Link: https://lore.kernel.org/r/45d7e025f6e390a8ff36f08fc51e31705ac896bd.1746588351.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agobpf, riscv64: Skip redundant zext instruction after load-acquire
Peilin Ye [Wed, 7 May 2025 03:43:07 +0000 (03:43 +0000)]
bpf, riscv64: Skip redundant zext instruction after load-acquire

Currently, the verifier inserts a zext instruction right after every 8-,
16- or 32-bit load-acquire, which is already zero-extending.  Skip such
redundant zext instructions.

While we are here, update that already-obsolete comment about "skip the
next instruction" in build_body().  Also change emit_atomic_rmw()'s
parameters to keep it consistent with emit_atomic_ld_st().

Note that checking 'insn[1]' relies on 'insn' not being the last
instruction, which should have been guaranteed by the verifier; we
already use 'insn[1]' elsewhere in the file for similar purposes.
Additionally, we don't check if 'insn[1]' is actually a zext for our
load-acquire's dst_reg, or some other registers - in other words, here
we are relying on the verifier to always insert a redundant zext right
after a 8/16/32-bit load-acquire, for its dst_reg.

Acked-by: Björn Töpel <bjorn@kernel.org>
Reviewed-by: Pu Lehui <pulehui@huawei.com>
Tested-by: Björn Töpel <bjorn@rivosinc.com> # QEMU/RVA23
Signed-off-by: Peilin Ye <yepeilin@google.com>
Link: https://lore.kernel.org/r/10e90e0eab042f924d35ad0d1c1f7ca29f673152.1746588351.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agobpf, riscv64: Support load-acquire and store-release instructions
Andrea Parri [Wed, 7 May 2025 03:43:01 +0000 (03:43 +0000)]
bpf, riscv64: Support load-acquire and store-release instructions

Support BPF load-acquire (BPF_LOAD_ACQ) and store-release
(BPF_STORE_REL) instructions in the riscv64 JIT compiler.  For example,
consider the following 64-bit load-acquire (assuming little-endian):

  db 10 00 00 00 01 00 00  r1 = load_acquire((u64 *)(r1 + 0x0))
  95 00 00 00 00 00 00 00  exit

  opcode (0xdb): BPF_ATOMIC | BPF_DW | BPF_STX
  imm (0x00000100): BPF_LOAD_ACQ

The JIT compiler will emit an LD instruction followed by a FENCE R,RW
instruction for the above, e.g.:

  ld x7,0(x6)
  fence r,rw

Similarly, consider the following 16-bit store-release:

  cb 21 00 00 10 01 00 00  store_release((u16 *)(r1 + 0x0), w2)
  95 00 00 00 00 00 00 00  exit

  opcode (0xcb): BPF_ATOMIC | BPF_H | BPF_STX
  imm (0x00000110): BPF_STORE_REL

A FENCE RW,W instruction followed by an SH instruction will be emitted,
e.g.:

  fence rw,w
  sh x2,0(x4)

8-bit and 16-bit load-acquires are zero-extending (cf., LBU, LHU).  The
verifier always rejects misaligned load-acquires/store-releases (even if
BPF_F_ANY_ALIGNMENT is set), so the emitted load and store instructions
are guaranteed to be single-copy atomic.

Introduce primitives to emit the relevant (and the most common/used in
the kernel) fences, i.e. fences with R -> RW, RW -> W and RW -> RW.

Rename emit_atomic() to emit_atomic_rmw() to make it clear that it only
handles RMW atomics, and replace its is64 parameter to allow to perform
the required checks on the opsize (BPF_SIZE(code)).

Acked-by: Björn Töpel <bjorn@kernel.org>
Tested-by: Björn Töpel <bjorn@rivosinc.com> # QEMU/RVA23
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
Co-developed-by: Peilin Ye <yepeilin@google.com>
Signed-off-by: Peilin Ye <yepeilin@google.com>
Reviewed-by: Pu Lehui <pulehui@huawei.com>
Link: https://lore.kernel.org/r/3059c560e537ad43ed19055d2ebbd970c698095a.1746588351.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agobpf, riscv64: Introduce emit_load_*() and emit_store_*()
Andrea Parri [Wed, 7 May 2025 03:42:55 +0000 (03:42 +0000)]
bpf, riscv64: Introduce emit_load_*() and emit_store_*()

We're planning to add support for the load-acquire and store-release
BPF instructions.  Define emit_load_<size>() and emit_store_<size>()
to enable/facilitate the (re)use of their code.

Acked-by: Björn Töpel <bjorn@kernel.org>
Reviewed-by: Pu Lehui <pulehui@huawei.com>
Tested-by: Björn Töpel <bjorn@rivosinc.com> # QEMU/RVA23
Tested-by: Peilin Ye <yepeilin@google.com>
Signed-off-by: Andrea Parri <parri.andrea@gmail.com>
[yepeilin@google.com: cosmetic change to commit title]
Signed-off-by: Peilin Ye <yepeilin@google.com>
Link: https://lore.kernel.org/r/fce89473a5748e1631d18a5917d953460d1ae0d0.1746588351.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agobpf/verifier: Handle BPF_LOAD_ACQ instructions in insn_def_regno()
Peilin Ye [Wed, 7 May 2025 03:42:45 +0000 (03:42 +0000)]
bpf/verifier: Handle BPF_LOAD_ACQ instructions in insn_def_regno()

In preparation for supporting BPF load-acquire and store-release
instructions for architectures where bpf_jit_needs_zext() returns true
(e.g. riscv64), make insn_def_regno() handle load-acquires properly.

Acked-by: Björn Töpel <bjorn@kernel.org>
Tested-by: Björn Töpel <bjorn@rivosinc.com> # QEMU/RVA23
Signed-off-by: Peilin Ye <yepeilin@google.com>
Reviewed-by: Pu Lehui <pulehui@huawei.com>
Link: https://lore.kernel.org/r/09cb2aec979aaed9d16db41f0f5b364de39377c0.1746588351.git.yepeilin@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agobpftool: Fix cgroup command to only show cgroup bpf programs
Martin KaFai Lau [Wed, 7 May 2025 20:32:32 +0000 (13:32 -0700)]
bpftool: Fix cgroup command to only show cgroup bpf programs

The netkit program is not a cgroup bpf program and should not be shown
in the output of the "bpftool cgroup show" command.

However, if the netkit device happens to have ifindex 3,
the "bpftool cgroup show" command will output the netkit
bpf program as well:

> ip -d link show dev nk1
3: nk1@if2: ...
    link/ether ...
    netkit mode ...

> bpftool net show
tc:
nk1(3) netkit/peer tw_ns_nk2phy prog_id 469447

> bpftool cgroup show /sys/fs/cgroup/...
ID       AttachType      AttachFlags     Name
...      ...                             ...
469447   netkit_peer                     tw_ns_nk2phy

The reason is that the target_fd (which is the cgroup_fd here) and
the target_ifindex are in a union in the uapi/linux/bpf.h. The bpftool
iterates all values in "enum bpf_attach_type" which includes
non cgroup attach types like netkit. The cgroup_fd is usually 3 here,
so the bug is triggered when the netkit ifindex just happens
to be 3 as well.

The bpftool's cgroup.c already has a list of cgroup-only attach type
defined in "cgroup_attach_types[]". This patch fixes it by iterating
over "cgroup_attach_types[]" instead of "__MAX_BPF_ATTACH_TYPE".

Cc: Quentin Monnet <qmo@kernel.org>
Reported-by: Takshak Chahande <ctakshak@meta.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Quentin Monnet <qmo@kernel.org>
Link: https://lore.kernel.org/r/20250507203232.1420762-1-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agobpftool: Fix regression of "bpftool cgroup tree" EINVAL on older kernels
YiFei Zhu [Mon, 28 Apr 2025 21:15:36 +0000 (21:15 +0000)]
bpftool: Fix regression of "bpftool cgroup tree" EINVAL on older kernels

If cgroup_has_attached_progs queries an attach type not supported
by the running kernel, due to the kernel being older than the bpftool
build, it would encounter an -EINVAL from BPF_PROG_QUERY syscall.

Prior to commit 98b303c9bf05 ("bpftool: Query only cgroup-related
attach types"), this EINVAL would be ignored by the function, allowing
the function to only consider supported attach types. The commit
changed so that, instead of querying all attach types, only attach
types from the array `cgroup_attach_types` is queried. The assumption
is that because these are only cgroup attach types, they should all
be supported. Unfortunately this assumption may be false when the
kernel is older than the bpftool build, where the attach types queried
by bpftool is not yet implemented in the kernel. This would result in
errors such as:

  $ bpftool cgroup tree
  CgroupPath
  ID       AttachType      AttachFlags     Name
  Error: can't query bpf programs attached to /sys/fs/cgroup: Invalid argument

This patch restores the logic of ignoring EINVAL from prior to that patch.

Fixes: 98b303c9bf05 ("bpftool: Query only cgroup-related attach types")
Reported-by: Sagarika Sharma <sharmasagarika@google.com>
Reported-by: Minh-Anh Nguyen <minhanhdn@google.com>
Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Quentin Monnet <qmo@kernel.org>
Link: https://lore.kernel.org/bpf/20250428211536.1651456-1-zhuyifei@google.com
2 months agoMerge branch 'bpf-support-bpf-rbtree-traversal-and-list-peeking'
Alexei Starovoitov [Tue, 6 May 2025 17:21:06 +0000 (10:21 -0700)]
Merge branch 'bpf-support-bpf-rbtree-traversal-and-list-peeking'

Martin KaFai Lau says:

====================
bpf: Support bpf rbtree traversal and list peeking

From: Martin KaFai Lau <martin.lau@kernel.org>

The RFC v1 [1] showed a fq qdisc implementation in bpf
that is much closer to the kernel sch_fq.c.

The fq example and bpf qdisc changes are separated out from this set.
This set is to focus on the kfunc and verifier changes that
enable the bpf rbtree traversal and list peeking.

v2:
- Added tests to check that the return value of
  the bpf_rbtree_{root,left,right} and bpf_list_{front,back} is
  marked as a non_own_ref node pointer. (Kumar)
- Added tests to ensure that the bpf_rbtree_{root,left,right} and
  bpf_list_{front,back} must be called after holding the spinlock.
- Squashed the selftests adjustment to the corresponding verifier
  changes to avoid bisect failure. (Kumar)
- Separated the bpf qdisc specific changes and fq selftest example
  from this set.

[1]: https://lore.kernel.org/bpf/20250418224652.105998-1-martin.lau@linux.dev/
====================

Link: https://patch.msgid.link/20250506015857.817950-1-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agoselftests/bpf: Add test for bpf_list_{front,back}
Martin KaFai Lau [Tue, 6 May 2025 01:58:55 +0000 (18:58 -0700)]
selftests/bpf: Add test for bpf_list_{front,back}

This patch adds the "list_peek" test to use the new
bpf_list_{front,back} kfunc.

The test_{front,back}* tests ensure that the return value
is a non_own_ref node pointer and requires the spinlock to be held.

Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> # check non_own_ref marking
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250506015857.817950-9-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agobpf: Add bpf_list_{front,back} kfunc
Martin KaFai Lau [Tue, 6 May 2025 01:58:54 +0000 (18:58 -0700)]
bpf: Add bpf_list_{front,back} kfunc

In the kernel fq qdisc implementation, it only needs to look at
the fields of the first node in a list but does not always
need to remove it from the list. It is more convenient to have
a peek kfunc for the list. It works similar to the bpf_rbtree_first().

This patch adds bpf_list_{front,back} kfunc. The verifier is changed
such that the kfunc returning "struct bpf_list_node *" will be
marked as non-owning. The exception is the KF_ACQUIRE kfunc. The
net effect is only the new bpf_list_{front,back} kfuncs will
have its return pointer marked as non-owning.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250506015857.817950-8-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agobpf: Simplify reg0 marking for the list kfuncs that return a bpf_list_node pointer
Martin KaFai Lau [Tue, 6 May 2025 01:58:53 +0000 (18:58 -0700)]
bpf: Simplify reg0 marking for the list kfuncs that return a bpf_list_node pointer

The next patch will add bpf_list_{front,back} kfuncs to peek the head
and tail of a list. Both of them will return a 'struct bpf_list_node *'.

Follow the earlier change for rbtree, this patch checks the
return btf type is a 'struct bpf_list_node' pointer instead
of checking each kfuncs individually to decide if
mark_reg_graph_node should be called. This will make
the bpf_list_{front,back} kfunc addition easier in
the later patch.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250506015857.817950-7-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agoselftests/bpf: Add tests for bpf_rbtree_{root,left,right}
Martin KaFai Lau [Tue, 6 May 2025 01:58:52 +0000 (18:58 -0700)]
selftests/bpf: Add tests for bpf_rbtree_{root,left,right}

This patch has a much simplified rbtree usage from the
kernel sch_fq qdisc. It has a "struct node_data" which can be
added to two different rbtrees which are ordered by different keys.

The test first populates both rbtrees. Then search for a lookup_key
from the "groot0" rbtree. Once the lookup_key is found, that node
refcount is taken. The node is then removed from another "groot1"
rbtree.

While searching the lookup_key, the test will also try to remove
all rbnodes in the path leading to the lookup_key.

The test_{root,left,right}_spinlock_true tests ensure that the
return value of the bpf_rbtree functions is a non_own_ref node pointer.
This is done by forcing an verifier error by calling a helper
bpf_jiffies64() while holding the spinlock. The tests then
check for the verifier message
"call bpf_rbtree...R0=rcu_ptr_or_null_node..."

The other test_{root,left,right}_spinlock_false tests ensure that
they must be called with spinlock held.

Suggested-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> # Check non_own_ref marking
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250506015857.817950-6-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agobpf: Allow refcounted bpf_rb_node used in bpf_rbtree_{remove,left,right}
Martin KaFai Lau [Tue, 6 May 2025 01:58:51 +0000 (18:58 -0700)]
bpf: Allow refcounted bpf_rb_node used in bpf_rbtree_{remove,left,right}

The bpf_rbtree_{remove,left,right} requires the root's lock to be held.
They also check the node_internal->owner is still owned by that root
before proceeding, so it is safe to allow refcounted bpf_rb_node
pointer to be used in these kfuncs.

In a bpf fq implementation which is much closer to the kernel fq,
https://lore.kernel.org/bpf/20250418224652.105998-13-martin.lau@linux.dev/,
a networking flow (allocated by bpf_obj_new) can be added to two different
rbtrees. There are cases that the flow is searched from one rbtree,
held the refcount of the flow, and then removed from another rbtree:

struct fq_flow {
struct bpf_rb_node fq_node;
struct bpf_rb_node rate_node;
struct bpf_refcount refcount;
unsigned long sk_long;
};

int bpf_fq_enqueue(...)
{
/* ... */

bpf_spin_lock(&root->lock);
while (can_loop) {
/* ... */
if (!p)
break;
gc_f = bpf_rb_entry(p, struct fq_flow, fq_node);
if (gc_f->sk_long == sk_long) {
f = bpf_refcount_acquire(gc_f);
break;
}
/* ... */
}
bpf_spin_unlock(&root->lock);

if (f) {
bpf_spin_lock(&q->lock);
bpf_rbtree_remove(&q->delayed, &f->rate_node);
bpf_spin_unlock(&q->lock);
}
}

bpf_rbtree_{left,right} do not need this change but are relaxed together
with bpf_rbtree_remove instead of adding extra verifier logic
to exclude these kfuncs.

To avoid bi-sect failure, this patch also changes the selftests together.

The "rbtree_api_remove_unadded_node" is not expecting verifier's error.
The test now expects bpf_rbtree_remove(&groot, &m->node) to return NULL.
The test uses __retval(0) to ensure this NULL return value.

Some of the "only take non-owning..." failure messages are changed also.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250506015857.817950-5-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agobpf: Add bpf_rbtree_{root,left,right} kfunc
Martin KaFai Lau [Tue, 6 May 2025 01:58:50 +0000 (18:58 -0700)]
bpf: Add bpf_rbtree_{root,left,right} kfunc

In a bpf fq implementation that is much closer to the kernel fq,
it will need to traverse the rbtree:
https://lore.kernel.org/bpf/20250418224652.105998-13-martin.lau@linux.dev/

The much simplified logic that uses the bpf_rbtree_{root,left,right}
to traverse the rbtree is like:

struct fq_flow {
struct bpf_rb_node fq_node;
struct bpf_rb_node rate_node;
struct bpf_refcount refcount;
unsigned long sk_long;
};

struct fq_flow_root {
struct bpf_spin_lock lock;
struct bpf_rb_root root __contains(fq_flow, fq_node);
};

struct fq_flow *fq_classify(...)
{
struct bpf_rb_node *tofree[FQ_GC_MAX];
struct fq_flow_root *root;
struct fq_flow *gc_f, *f;
struct bpf_rb_node *p;
int i, fcnt = 0;

/* ... */

f = NULL;
bpf_spin_lock(&root->lock);
p = bpf_rbtree_root(&root->root);
while (can_loop) {
if (!p)
break;

gc_f = bpf_rb_entry(p, struct fq_flow, fq_node);
if (gc_f->sk_long == sk_long) {
f = bpf_refcount_acquire(gc_f);
break;
}

/* To be removed from the rbtree */
if (fcnt < FQ_GC_MAX && fq_gc_candidate(gc_f, jiffies_now))
tofree[fcnt++] = p;

if (gc_f->sk_long > sk_long)
p = bpf_rbtree_left(&root->root, p);
else
p = bpf_rbtree_right(&root->root, p);
}

/* remove from the rbtree */
for (i = 0; i < fcnt; i++) {
p = tofree[i];
tofree[i] = bpf_rbtree_remove(&root->root, p);
}

bpf_spin_unlock(&root->lock);

/* bpf_obj_drop the fq_flow(s) that have just been removed
 * from the rbtree.
 */
for (i = 0; i < fcnt; i++) {
p = tofree[i];
if (p) {
gc_f = bpf_rb_entry(p, struct fq_flow, fq_node);
bpf_obj_drop(gc_f);
}
}

return f;

}

The above simplified code needs to traverse the rbtree for two purposes,
1) find the flow with the desired sk_long value
2) while searching for the sk_long, collect flows that are
   the fq_gc_candidate. They will be removed from the rbtree.

This patch adds the bpf_rbtree_{root,left,right} kfunc to enable
the rbtree traversal. The returned bpf_rb_node pointer will be a
non-owning reference which is the same as the returned pointer
of the exisiting bpf_rbtree_first kfunc.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250506015857.817950-4-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agobpf: Simplify reg0 marking for the rbtree kfuncs that return a bpf_rb_node pointer
Martin KaFai Lau [Tue, 6 May 2025 01:58:49 +0000 (18:58 -0700)]
bpf: Simplify reg0 marking for the rbtree kfuncs that return a bpf_rb_node pointer

The current rbtree kfunc, bpf_rbtree_{first, remove}, returns the
bpf_rb_node pointer. The check_kfunc_call currently checks the
kfunc btf_id instead of its return pointer type to decide
if it needs to do mark_reg_graph_node(reg0) and ref_set_non_owning(reg0).

The later patch will add bpf_rbtree_{root,left,right} that will also
return a bpf_rb_node pointer. Instead of adding more kfunc btf_id
checks to the "if" case, this patch changes the test to check the
kfunc's return type. is_rbtree_node_type() function is added to
test if a pointer type is a bpf_rb_node. The callers have already
skipped the modifiers of the pointer type.

A note on the ref_set_non_owning(), although bpf_rbtree_remove()
also returns a bpf_rb_node pointer, the bpf_rbtree_remove()
has the KF_ACQUIRE flag. Thus, its reg0 will not become non-owning.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250506015857.817950-3-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agobpf: Check KF_bpf_rbtree_add_impl for the "case KF_ARG_PTR_TO_RB_NODE"
Martin KaFai Lau [Tue, 6 May 2025 01:58:48 +0000 (18:58 -0700)]
bpf: Check KF_bpf_rbtree_add_impl for the "case KF_ARG_PTR_TO_RB_NODE"

In a later patch, two new kfuncs will take the bpf_rb_node pointer arg.

struct bpf_rb_node *bpf_rbtree_left(struct bpf_rb_root *root,
    struct bpf_rb_node *node);
struct bpf_rb_node *bpf_rbtree_right(struct bpf_rb_root *root,
     struct bpf_rb_node *node);

In the check_kfunc_call, there is a "case KF_ARG_PTR_TO_RB_NODE"
to check if the reg->type should be an allocated pointer or should be
a non_owning_ref.

The later patch will need to ensure that the bpf_rb_node pointer passing
to the new bpf_rbtree_{left,right} must be a non_owning_ref. This
should be the same requirement as the existing bpf_rbtree_remove.

This patch swaps the current "if else" statement. Instead of checking
the bpf_rbtree_remove, it checks the bpf_rbtree_add. Then the new
bpf_rbtree_{left,right} will fall into the "else" case to make
the later patch simpler. bpf_rbtree_add should be the only
one that needs an allocated pointer.

This should be a no-op change considering there are only two kfunc(s)
taking bpf_rb_node pointer arg, rbtree_add and rbtree_remove.

Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20250506015857.817950-2-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agolibbpf: Improve BTF dedup handling of "identical" BTF types
Andrii Nakryiko [Thu, 1 May 2025 23:52:31 +0000 (16:52 -0700)]
libbpf: Improve BTF dedup handling of "identical" BTF types

BTF dedup has a strong assumption that compiler with deduplicate identical
types within any given compilation unit (i.e., .c file). This property
is used when establishing equilvalence of two subgraphs of types.

Unfortunately, this property doesn't always holds in practice. We've
seen cases of having truly identical structs, unions, array definitions,
and, most recently, even pointers to the same type being duplicated
within CU.

Previously, we mitigated this on a case-by-case basis, adding a few
simple heuristics for validating that two BTF types (having two
different type IDs) are structurally the same. But this approach scales
poorly, and we can have more weird cases come up in the future.

So let's take a half-step back, and implement a bit more generic
structural equivalence check, recursively. We still limit it to
reasonable depth to avoid long reference loops. Depth-wise limiting of
potentially cyclical graph isn't great, but as I mentioned below doesn't
seem to be detrimental performance-wise. We can always improve this in
the future with per-type visited markers, if necessary.

Performance-wise this doesn't seem too affect vmlinux BTF dedup, which
makes sense because this logic kicks in not so frequently and only if we
already established a canonical candidate type match, but suddenly find
a different (but probably identical) type.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Link: https://lore.kernel.org/r/20250501235231.1339822-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agobpf: Replace offsetof() with struct_size()
Thorsten Blum [Sat, 3 May 2025 15:15:13 +0000 (17:15 +0200)]
bpf: Replace offsetof() with struct_size()

Compared to offsetof(), struct_size() provides additional compile-time
checks for structs with flexible arrays (e.g., __must_be_array()).

No functional changes intended.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250503151513.343931-2-thorsten.blum@linux.dev
2 months agobpf: Fix uninitialized values in BPF_{CORE,PROBE}_READ
Anton Protopopov [Fri, 2 May 2025 19:30:31 +0000 (19:30 +0000)]
bpf: Fix uninitialized values in BPF_{CORE,PROBE}_READ

With the latest LLVM bpf selftests build will fail with
the following error message:

    progs/profiler.inc.h:710:31: error: default initialization of an object of type 'typeof ((parent_task)->real_cred->uid.val)' (aka 'const unsigned int') leaves the object uninitialized and is incompatible with C++ [-Werror,-Wdefault-const-init-unsafe]
      710 |         proc_exec_data->parent_uid = BPF_CORE_READ(parent_task, real_cred, uid.val);
          |                                      ^
    tools/testing/selftests/bpf/tools/include/bpf/bpf_core_read.h:520:35: note: expanded from macro 'BPF_CORE_READ'
      520 |         ___type((src), a, ##__VA_ARGS__) __r;                               \
          |                                          ^

This happens because BPF_CORE_READ (and other macro) declare the
variable __r using the ___type macro which can inherit const modifier
from intermediate types.

Fix this by using __typeof_unqual__, when supported. (And when it
is not supported, the problem shouldn't appear, as older compilers
haven't complained.)

Fixes: 792001f4f7aa ("libbpf: Add user-space variants of BPF_CORE_READ() family of macros")
Fixes: a4b09a9ef945 ("libbpf: Add non-CO-RE variants of BPF_CORE_READ() macro family")
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250502193031.3522715-1-a.s.protopopov@gmail.com
2 months agoselftests/bpf: Remove sockmap_ktls disconnect_after_delete test
Ihor Solodrai [Fri, 2 May 2025 18:52:21 +0000 (11:52 -0700)]
selftests/bpf: Remove sockmap_ktls disconnect_after_delete test

"sockmap_ktls disconnect_after_delete" is effectively moot after
disconnect has been disabled for TLS [1][2]. Remove the test
completely.

[1] https://lore.kernel.org/bpf/20250416170246.2438524-1-ihor.solodrai@linux.dev/
[2] https://lore.kernel.org/netdev/20250404180334.3224206-1-kuba@kernel.org/

Signed-off-by: Ihor Solodrai <isolodrai@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250502185221.1556192-1-isolodrai@meta.com
2 months agoselftests/bpf: Add btf dedup test covering module BTF dedup
Alan Maguire [Wed, 30 Apr 2025 13:42:49 +0000 (14:42 +0100)]
selftests/bpf: Add btf dedup test covering module BTF dedup

Recently issues were observed with module BTF deduplication failures
[1].  Add a dedup selftest that ensures that core kernel types are
referenced from split BTF as base BTF types.  To do this use bpf_testmod
functions which utilize core kernel types, specifically

ssize_t
bpf_testmod_test_write(struct file *file, struct kobject *kobj,
                       struct bin_attribute *bin_attr,
                       char *buf, loff_t off, size_t len);

__bpf_kfunc struct sock *bpf_kfunc_call_test3(struct sock *sk);

__bpf_kfunc void bpf_kfunc_call_test_pass_ctx(struct __sk_buff *skb);

For each of these ensure that the types they reference -
struct file, struct kobject, struct bin_attr etc - are in base BTF.
Note that because bpf_testmod.ko is built with distilled base BTF
the associated reference types - i.e. the PTR that points at a
"struct file" - will be in split BTF.  As a result the test resolves
typedef and pointer references and verifies the pointed-at or
typedef'ed type is in base BTF.  Because we use BTF from
/sys/kernel/btf/bpf_testmod relocation has occurred for the
referenced types and they will be base - not distilled base - types.

For large-scale dedup issues, we see such types appear in split BTF and
as a result this test fails.  Hence it is proposed as a test which will
fail when large-scale dedup issues have occurred.

[1] https://lore.kernel.org/dwarves/CAADnVQL+-LiJGXwxD3jEUrOonO-fX0SZC8496dVzUXvfkB7gYQ@mail.gmail.com/

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20250430134249.2451066-1-alan.maguire@oracle.com
2 months agoMerge branch 'bpf-allow-xdp_redirect-for-xdp-dev-bound-programs'
Martin KaFai Lau [Thu, 1 May 2025 19:54:07 +0000 (12:54 -0700)]
Merge branch 'bpf-allow-xdp_redirect-for-xdp-dev-bound-programs'

Lorenzo Bianconi says:

====================
bpf: Allow XDP_REDIRECT for XDP dev-bound programs

In the current implementation if the program is dev-bound to a specific
device, it will not be possible to perform XDP_REDIRECT into a DEVMAP or
CPUMAP even if the program is running in the driver NAPI context.
Fix the issue introducing __bpf_prog_map_compatible utility routine in
order to avoid bpf_prog_is_dev_bound() during the XDP program load.
Continue forbidding to attach a dev-bound program to XDP maps.
====================

Link: https://patch.msgid.link/20250428-xdp-prog-bound-fix-v3-0-c9e9ba3300c7@kernel.org
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2 months agoselftests/bpf: xdp_metadata: Check XDP_REDIRCT support for dev-bound progs
Lorenzo Bianconi [Mon, 28 Apr 2025 15:44:03 +0000 (17:44 +0200)]
selftests/bpf: xdp_metadata: Check XDP_REDIRCT support for dev-bound progs

Improve xdp_metadata bpf selftest in order to check it is possible for a
XDP dev-bound program to perform XDP_REDIRECT into a DEVMAP but it is still
not allowed to attach a XDP dev-bound program to a DEVMAP entry.

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
2 months agobpf: Allow XDP dev-bound programs to perform XDP_REDIRECT into maps
Lorenzo Bianconi [Mon, 28 Apr 2025 15:44:02 +0000 (17:44 +0200)]
bpf: Allow XDP dev-bound programs to perform XDP_REDIRECT into maps

In the current implementation if the program is dev-bound to a specific
device, it will not be possible to perform XDP_REDIRECT into a DEVMAP
or CPUMAP even if the program is running in the driver NAPI context and
it is not attached to any map entry. This seems in contrast with the
explanation available in bpf_prog_map_compatible routine.
Fix the issue introducing __bpf_prog_map_compatible utility routine in
order to avoid bpf_prog_is_dev_bound() check running bpf_check_tail_call()
at program load time (bpf_prog_select_runtime()).
Continue forbidding to attach a dev-bound program to XDP maps
(BPF_MAP_TYPE_PROG_ARRAY, BPF_MAP_TYPE_DEVMAP and BPF_MAP_TYPE_CPUMAP).

Fixes: 3d76a4d3d4e59 ("bpf: XDP metadata RX kfuncs")
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
2 months agobpf: Replace offsetof() with struct_size()
Thorsten Blum [Mon, 28 Apr 2025 21:06:39 +0000 (23:06 +0200)]
bpf: Replace offsetof() with struct_size()

Compared to offsetof(), struct_size() provides additional compile-time
checks for structs with flexible arrays (e.g., __must_be_array()).

No functional changes intended.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250428210638.30219-2-thorsten.blum@linux.dev
2 months agolibbpf: Use proper errno value in linker
Anton Protopopov [Wed, 30 Apr 2025 12:08:20 +0000 (12:08 +0000)]
libbpf: Use proper errno value in linker

Return values of the linker_append_sec_data() and the
linker_append_elf_relos() functions are propagated all the
way up to users of libbpf API. In some error cases these
functions return -1 which will be seen as -EPERM from user's
point of view. Instead, return a more reasonable -EINVAL.

Fixes: faf6ed321cf6 ("libbpf: Add BPF static linker APIs")
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250430120820.2262053-1-a.s.protopopov@gmail.com
2 months agoselftests/bpf: Fix kmem_cache iterator draining
T.J. Mercier [Mon, 28 Apr 2025 18:02:54 +0000 (18:02 +0000)]
selftests/bpf: Fix kmem_cache iterator draining

The closing parentheses around the read syscall is misplaced, causing
single byte reads from the iterator instead of buf sized reads. While
the end result is the same, many more read calls than necessary are
performed.

$ tools/testing/selftests/bpf/vmtest.sh  "./test_progs -t kmem_cache_iter"
145/1   kmem_cache_iter/check_task_struct:OK
145/2   kmem_cache_iter/check_slabinfo:OK
145/3   kmem_cache_iter/open_coded_iter:OK
145     kmem_cache_iter:OK
Summary: 1/3 PASSED, 0 SKIPPED, 0 FAILED

Fixes: a496d0cdc84d ("selftests/bpf: Add a test for kmem_cache_iter")
Signed-off-by: T.J. Mercier <tjmercier@google.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Song Liu <song@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Link: https://patch.msgid.link/20250428180256.1482899-1-tjmercier@google.com
2 months agolibbpf: Add identical pointer detection to btf_dedup_is_equiv()
Alan Maguire [Tue, 29 Apr 2025 16:10:42 +0000 (17:10 +0100)]
libbpf: Add identical pointer detection to btf_dedup_is_equiv()

Recently as a side-effect of

commit ac053946f5c4 ("compiler.h: introduce TYPEOF_UNQUAL() macro")

issues were observed in deduplication between modules and kernel BTF
such that a large number of kernel types were not deduplicated so
were found in module BTF (task_struct, bpf_prog etc).  The root cause
appeared to be a failure to dedup struct types, specifically those
with members that were pointers with __percpu annotations.

The issue in dedup is at the point that we are deduplicating structures,
we have not yet deduplicated reference types like pointers.  If multiple
copies of a pointer point at the same (deduplicated) integer as in this
case, we do not see them as identical.  Special handling already exists
to deal with structures and arrays, so add pointer handling here too.

Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20250429161042.2069678-1-alan.maguire@oracle.com
2 months agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after rc4
Alexei Starovoitov [Mon, 28 Apr 2025 15:40:45 +0000 (08:40 -0700)]
Merge git://git./linux/kernel/git/bpf/bpf after rc4

Cross-merge bpf and other fixes after downstream PRs.

No conflicts.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2 months agoLinux 6.15-rc4 v6.15-rc4
Linus Torvalds [Sun, 27 Apr 2025 22:19:23 +0000 (15:19 -0700)]
Linux 6.15-rc4

2 months agoMerge tag 'pci-v6.15-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Linus Torvalds [Sat, 26 Apr 2025 20:02:36 +0000 (13:02 -0700)]
Merge tag 'pci-v6.15-fixes-3' of git://git./linux/kernel/git/pci/pci

Pull PCI fixes from Bjorn Helgaas:

 - When releasing a start-aligned resource, e.g., a bridge window, save
   start/end/flags for the next assignment attempt; fixes a v6.15-rc1
   regression (Ilpo Järvinen)

 - Move set_pcie_speed.sh from TEST_PROGS to TEST_FILE; fixes a bwctrl
   selftest v6.15-rc1 regression (Ilpo Järvinen)

 - Add Manivannan Sadhasivam as maintainer of native host bridge and
   endpoint drivers (Manivannan Sadhasivam)

 - In endpoint test driver, defer IRQ allocation from .probe() until
   ioctl() to fix a regression on platforms where the Vendor/Device ID
   match doesn't include driver_data (Niklas Cassel)

* tag 'pci-v6.15-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
  misc: pci_endpoint_test: Defer IRQ allocation until ioctl(PCITEST_SET_IRQTYPE)
  MAINTAINERS: Move Manivannan Sadhasivam as PCI Native host bridge and endpoint maintainer
  selftests/pcie_bwctrl: Fix test progs list
  PCI: Restore assigned resources fully after release

2 months agoMerge tag 'nfsd-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Linus Torvalds [Sat, 26 Apr 2025 17:43:03 +0000 (10:43 -0700)]
Merge tag 'nfsd-6.15-2' of git://git./linux/kernel/git/cel/linux

Pull nfsd fix from Chuck Lever:

 - Revert a v6.15 patch due to a report of SELinux test failures

* tag 'nfsd-6.15-2' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  Revert "sunrpc: clean cache_detail immediately when flush is written frequently"

2 months agoMerge tag 'x86-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 26 Apr 2025 16:45:54 +0000 (09:45 -0700)]
Merge tag 'x86-urgent-2025-04-26' of git://git./linux/kernel/git/tip/tip

Pull misc x86 fixes from Ingo Molnar:

 - Fix 32-bit kernel boot crash if passed physical memory with more than
   32 address bits

 - Fix Xen PV crash

 - Work around build bug in certain limited build environments

 - Fix CTEST instruction decoding in insn_decoder_test

* tag 'x86-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/insn: Fix CTEST instruction decoding
  x86/boot: Work around broken busybox 'truncate' tool
  x86/mm: Fix _pgd_alloc() for Xen PV mode
  x86/e820: Discard high memory that can't be addressed by 32-bit systems

2 months agoMerge tag 'sched-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 26 Apr 2025 16:23:20 +0000 (09:23 -0700)]
Merge tag 'sched-urgent-2025-04-26' of git://git./linux/kernel/git/tip/tip

Pull scheduler fix from Ingo Molnar:
 "Fix sporadic crashes in dequeue_entities() due to ... bad math.

  [ Arguably if pick_eevdf()/pick_next_entity() was less trusting of
    complex math being correct it could have de-escalated a crash into
    a warning, but that's for a different patch ]"

* tag 'sched-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/eevdf: Fix se->slice being set to U64_MAX and resulting crash

2 months agoMerge tag 'perf-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 26 Apr 2025 16:13:09 +0000 (09:13 -0700)]
Merge tag 'perf-urgent-2025-04-26' of git://git./linux/kernel/git/tip/tip

Pull misc perf events fixes from Ingo Molnar:

 - Use POLLERR for events in error state, instead of the ambiguous
   POLLHUP error value

 - Fix non-sampling (counting) events on certain x86 platforms

* tag 'perf-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf/x86: Fix non-sampling (counting) events on certain x86 platforms
  perf/core: Change to POLLERR for pinned events with error

2 months agoMerge tag 'irq-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 26 Apr 2025 16:08:45 +0000 (09:08 -0700)]
Merge tag 'irq-urgent-2025-04-26' of git://git./linux/kernel/git/tip/tip

Pull irq fix from Ingo Molnar:
 "Fix crashes in the gic-v2m irqchip driver, caused by an incorrect
  __init annotation"

* tag 'irq-urgent-2025-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/gic-v2m: Prevent use after free of gicv2m_get_fwnode()

2 months agoMerge tag 'loongarch-fixes-6.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git...
Linus Torvalds [Sat, 26 Apr 2025 16:02:41 +0000 (09:02 -0700)]
Merge tag 'loongarch-fixes-6.15-1' of git://git./linux/kernel/git/chenhuacai/linux-loongson

Pull LoongArch fixes from Huacai Chen:
 "Add a missing Kconfig option, fix some bugs in exception handlers,
  memory management and KVM"

* tag 'loongarch-fixes-6.15-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
  LoongArch: KVM: Fix PMU pass-through issue if VM exits to host finally
  LoongArch: KVM: Fully clear some CSRs when VM reboot
  LoongArch: KVM: Fix multiple typos of KVM code
  LoongArch: Return NULL from huge_pte_offset() for invalid PMD
  LoongArch: Remove a bogus reference to ZONE_DMA
  LoongArch: Handle fp, lsx, lasx and lbt assembly symbols
  LoongArch: Make do_xyz() exception handlers more robust
  LoongArch: Make regs_irqs_disabled() more clear
  LoongArch: Select ARCH_USE_MEMTEST

2 months agoMerge tag 'for-linus' of https://github.com/openrisc/linux
Linus Torvalds [Sat, 26 Apr 2025 16:01:13 +0000 (09:01 -0700)]
Merge tag 'for-linus' of https://github.com/openrisc/linux

Pull OpenRISC updates from Stafford Horne:

 - Support for cacheinfo API to expose OpenRISC cache info via sysfs,
   this also translated to some cleanups to OpenRISC cache flush and
   invalidate API's

 - Documentation updates for new mailing list and toolchain binaries

* tag 'for-linus' of https://github.com/openrisc/linux:
  Documentation: openrisc: Update toolchain binaries URL
  Documentation: openrisc: Update mailing list
  openrisc: Add cacheinfo support
  openrisc: Introduce new utility functions to flush and invalidate caches
  openrisc: Refactor struct cpuinfo_or1k to reduce duplication

2 months agoRevert "sunrpc: clean cache_detail immediately when flush is written frequently"
Chuck Lever [Thu, 24 Apr 2025 13:27:35 +0000 (09:27 -0400)]
Revert "sunrpc: clean cache_detail immediately when flush is written frequently"

Ondrej reports that certain SELinux tests are failing after commit
fc2a169c56de ("sunrpc: clean cache_detail immediately when flush is
written frequently"), merged during the v6.15 merge window.

Reported-by: Ondrej Mosnacek <omosnace@redhat.com>
Fixes: fc2a169c56de ("sunrpc: clean cache_detail immediately when flush is written frequently")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2 months agoMerge tag 'move-lib-kunit-v6.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel...
Linus Torvalds [Sat, 26 Apr 2025 15:55:24 +0000 (08:55 -0700)]
Merge tag 'move-lib-kunit-v6.15-rc4' of git://git./linux/kernel/git/kees/linux

Pull kunit fix from Kees Cook:
 "A single fix for the kunit lib/tests/ relocation:

   - Ensure prime numbers tests are included in KUnit test runs (Mark Brown)"

* tag 'move-lib-kunit-v6.15-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  lib: Ensure prime numbers tests are included in KUnit test runs

2 months agoMerge tag 'drm-fixes-2025-04-26' of https://gitlab.freedesktop.org/drm/kernel
Linus Torvalds [Sat, 26 Apr 2025 15:32:29 +0000 (08:32 -0700)]
Merge tag 'drm-fixes-2025-04-26' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
 "Weekly drm fixes, mostly amdgpu, with some exynos cleanups and a
  couple of minor fixes, seems a bit quiet, but probably some lag from
  Easter holidays.

  amdgpu:
   - P2P DMA fixes
   - Display reset fixes
   - DCN 3.5 fixes
   - ACPI EDID fix
   - LTTPR fix
   - mode_valid() fix

  exynos:
   - fix spelling error
   - remove redundant error handling in exynos_drm_vidi.c module
   - marks struct decon_data as const in the exynos7_drm_decon driver
     since it is only read
   - Remove unnecessary checking in exynos_drm_drv.c module

  meson:
   - Fix VCLK calculation

  panel:
   - jd9365a: Fix reset polarity"

* tag 'drm-fixes-2025-04-26' of https://gitlab.freedesktop.org/drm/kernel:
  drm/exynos: Fix spelling mistake "enqueu" -> "enqueue"
  drm/exynos: exynos7_drm_decon: Consstify struct decon_data
  drm/exynos: fixed a spelling error
  drm/exynos/vidi: Remove redundant error handling in vidi_get_modes()
  drm/exynos: Remove unnecessary checking
  drm/amd/display: do not copy invalid CRTC timing info
  drm/amd/display: Default IPS to RCG_IN_ACTIVE_IPS2_IN_OFF
  drm/amd/display: Use 16ms AUX read interval for LTTPR with old sinks
  drm/amd/display: Fix ACPI edid parsing on some Lenovo systems
  drm/amdgpu: Allow P2P access through XGMI
  drm/amd/display: Enable urgent latency adjustment on DCN35
  drm/amd/display: Force full update in gpu reset
  drm/amd/display: Fix gpu reset in multidisplay config
  drm/amdgpu: Don't pin VRAM without DMABUF_MOVE_NOTIFY
  drm/amdgpu: Use allowed_domains for pinning dmabufs
  drm: panel: jd9365da: fix reset signal polarity in unprepare
  drm/meson: use unsigned long long / Hz for frequency types
  Revert "drm/meson: vclk: fix calculation of 59.94 fractional rates"

2 months agosched/eevdf: Fix se->slice being set to U64_MAX and resulting crash
Omar Sandoval [Fri, 25 Apr 2025 08:51:24 +0000 (01:51 -0700)]
sched/eevdf: Fix se->slice being set to U64_MAX and resulting crash

There is a code path in dequeue_entities() that can set the slice of a
sched_entity to U64_MAX, which sometimes results in a crash.

The offending case is when dequeue_entities() is called to dequeue a
delayed group entity, and then the entity's parent's dequeue is delayed.
In that case:

1. In the if (entity_is_task(se)) else block at the beginning of
   dequeue_entities(), slice is set to
   cfs_rq_min_slice(group_cfs_rq(se)). If the entity was delayed, then
   it has no queued tasks, so cfs_rq_min_slice() returns U64_MAX.
2. The first for_each_sched_entity() loop dequeues the entity.
3. If the entity was its parent's only child, then the next iteration
   tries to dequeue the parent.
4. If the parent's dequeue needs to be delayed, then it breaks from the
   first for_each_sched_entity() loop _without updating slice_.
5. The second for_each_sched_entity() loop sets the parent's ->slice to
   the saved slice, which is still U64_MAX.

This throws off subsequent calculations with potentially catastrophic
results. A manifestation we saw in production was:

6. In update_entity_lag(), se->slice is used to calculate limit, which
   ends up as a huge negative number.
7. limit is used in se->vlag = clamp(vlag, -limit, limit). Because limit
   is negative, vlag > limit, so se->vlag is set to the same huge
   negative number.
8. In place_entity(), se->vlag is scaled, which overflows and results in
   another huge (positive or negative) number.
9. The adjusted lag is subtracted from se->vruntime, which increases or
   decreases se->vruntime by a huge number.
10. pick_eevdf() calls entity_eligible()/vruntime_eligible(), which
    incorrectly returns false because the vruntime is so far from the
    other vruntimes on the queue, causing the
    (vruntime - cfs_rq->min_vruntime) * load calulation to overflow.
11. Nothing appears to be eligible, so pick_eevdf() returns NULL.
12. pick_next_entity() tries to dereference the return value of
    pick_eevdf() and crashes.

Dumping the cfs_rq states from the core dumps with drgn showed tell-tale
huge vruntime ranges and bogus vlag values, and I also traced se->slice
being set to U64_MAX on live systems (which was usually "benign" since
the rest of the runqueue needed to be in a particular state to crash).

Fix it in dequeue_entities() by always setting slice from the first
non-empty cfs_rq.

Fixes: aef6987d8954 ("sched/eevdf: Propagate min_slice up the cgroup hierarchy")
Signed-off-by: Omar Sandoval <osandov@fb.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/f0c2d1072be229e1bdddc73c0703919a8b00c652.1745570998.git.osandov@fb.com
2 months agoirqchip/gic-v2m: Prevent use after free of gicv2m_get_fwnode()
Suzuki K Poulose [Tue, 22 Apr 2025 16:16:16 +0000 (17:16 +0100)]
irqchip/gic-v2m: Prevent use after free of gicv2m_get_fwnode()

With ACPI in place, gicv2m_get_fwnode() is registered with the pci
subsystem as pci_msi_get_fwnode_cb(), which may get invoked at runtime
during a PCI host bridge probe. But, the call back is wrongly marked as
__init, causing it to be freed, while being registered with the PCI
subsystem and could trigger:

 Unable to handle kernel paging request at virtual address ffff8000816c0400
  gicv2m_get_fwnode+0x0/0x58 (P)
  pci_set_bus_msi_domain+0x74/0x88
  pci_register_host_bridge+0x194/0x548

This is easily reproducible on a Juno board with ACPI boot.

Retain the function for later use.

Fixes: 0644b3daca28 ("irqchip/gic-v2m: acpi: Introducing GICv2m ACPI support")
Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
2 months agoLoongArch: KVM: Fix PMU pass-through issue if VM exits to host finally
Bibo Mao [Thu, 24 Apr 2025 12:15:52 +0000 (20:15 +0800)]
LoongArch: KVM: Fix PMU pass-through issue if VM exits to host finally

In function kvm_pre_enter_guest(), it prepares to enter guest and check
whether there are pending signals or events. And it will not enter guest
if there are, PMU pass-through preparation for guest should be cancelled
and host should own PMU hardware.

Cc: stable@vger.kernel.org
Fixes: f4e40ea9f78f ("LoongArch: KVM: Add PMU support for guest")
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
2 months agoLoongArch: KVM: Fully clear some CSRs when VM reboot
Bibo Mao [Thu, 24 Apr 2025 12:15:52 +0000 (20:15 +0800)]
LoongArch: KVM: Fully clear some CSRs when VM reboot

Some registers such as LOONGARCH_CSR_ESTAT and LOONGARCH_CSR_GINTC are
partly cleared with function _kvm_setcsr(). This comes from the hardware
specification, some bits are read only in VM mode, and however they can
be written in host mode. So they are partly cleared in VM mode, and can
be fully cleared in host mode.

These read only bits show pending interrupt or exception status. When VM
reset, the read-only bits should be cleared, otherwise vCPU will receive
unknown interrupts in boot stage.

Here registers LOONGARCH_CSR_ESTAT/LOONGARCH_CSR_GINTC are fully cleared
in ioctl KVM_REG_LOONGARCH_VCPU_RESET vCPU reset path.

Cc: stable@vger.kernel.org
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>