git.kernel.dk Git - linux-2.6-block.git/log

libbpf: Introduce errstr() for stringifying errno

Add function errstr(int err) that allows converting numeric error codes
into string representations.

Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241111212919.368971-2-mykyta.yatsenko5@gmail.com

bpf: Replace the document for PTR_TO_BTF_ID_OR_NULL

Commit c25b2ae13603 ("bpf: Replace PTR_TO_XXX_OR_NULL with PTR_TO_XXX |
PTR_MAYBE_NULL") moved the fields around and misplaced the
documentation for "PTR_TO_BTF_ID_OR_NULL". So, let's replace it in the
proper place.

Signed-off-by: Menglong Dong <dongml2@chinatelecom.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241111124911.1436911-1-dongml2@chinatelecom.cn

tools/bpf: Fix the wrong format specifier in bpf_jit_disasm

There is a static checker warning that the %d in format string is
mismatched with the corresponding argument type, which could result in
incorrect printed data. This patch fixes it.

Signed-off-by: Luo Yifan <luoyifan@cmss.chinamobile.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241111021004.272293-1-luoyifan@cmss.chinamobile.com

kbuild,bpf: Pass make jobs' value to pahole

Pass the value of make's -j/--jobs argument to pahole, to avoid out of
memory errors and make pahole respect the "jobs" value of make.

On systems with little memory but many cores, invoking pahole using -j
without argument potentially creates too many pahole instances,
causing an out-of-memory situation. Instead, we should pass make's
"jobs" value as an argument to pahole's -j, which is likely configured
to be (much) lower than the actual core count on such systems.

If make was invoked without -j, either via cmdline or MAKEFLAGS, then
JOBS will be simply empty, resulting in the existing behavior, as
expected.

Signed-off-by: Florian Schmaus <flo@geekplace.eu>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Link: https://lore.kernel.org/bpf/20241102100452.793970-1-flo@geekplace.eu

Merge branch 'refactor-lock-management'

Kumar Kartikeya Dwivedi says:

====================
Refactor lock management

This set refactors lock management in the verifier in preparation for
spin locks that can be acquired multiple times. In addition to this,
unnecessary code special case reference leak logic for callbacks is also
dropped, that is no longer necessary. See patches for details.

Changelog:
----------
v5 -> v6
v5: https://lore.kernel.org/bpf/20241109225243.2306756-1-memxor@gmail.com

* Move active_locks mutation to {acquire,release}_lock_state (Alexei)

v4 -> v5
v4: https://lore.kernel.org/bpf/20241109074347.1434011-1-memxor@gmail.com

* Make active_locks part of bpf_func_state (Alexei)
* Remove unneeded in_callback_fn logic for references

v3 -> v4
v3: https://lore.kernel.org/bpf/20241104151716.2079893-1-memxor@gmail.com

* Address comments from Alexei
  * Drop struct bpf_active_lock definition
  * Name enum type, expand definition to multiple lines
  * s/REF_TYPE_BPF_LOCK/REF_TYPE_LOCK/g
  * Change active_lock type to int
  * Fix type of 'type' in acquire_lock_state
  * Filter by taking type explicitly in find_lock_state
  * WARN for default case in refsafe switch statement

v2 -> v3
v2: https://lore.kernel.org/bpf/20241103212252.547071-1-memxor@gmail.com

  * Rebase on bpf-next to resolve merge conflict

v1 -> v2
v1: https://lore.kernel.org/bpf/20241103205856.345580-1-memxor@gmail.com

  * Fix refsafe state comparison to check callback_ref and ptr separately.
====================

Link: https://lore.kernel.org/r/20241109231430.2475236-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

bpf: Drop special callback reference handling

Logic to prevent callbacks from acquiring new references for the program
(i.e. leaving acquired references), and releasing caller references
(i.e. those acquired in parent frames) was introduced in commit
9d9d00ac29d0 ("bpf: Fix reference state management for synchronous callbacks").

This was necessary because back then, the verifier simulated each
callback once (that could potentially be executed N times, where N can
be zero). This meant that callbacks that left lingering resources or
cleared caller resources could do it more than once, operating on
undefined state or leaking memory.

With the fixes to callback verification in commit
ab5cfac139ab ("bpf: verify callbacks as if they are called unknown number of times"),
all of this extra logic is no longer necessary. Hence, drop it as part
of this commit.

Cc: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241109231430.2475236-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

bpf: Refactor active lock management

When bpf_spin_lock was introduced originally, there was deliberation on
whether to use an array of lock IDs, but since bpf_spin_lock is limited
to holding a single lock at any given time, we've been using a single ID
to identify the held lock.

In preparation for introducing spin locks that can be taken multiple
times, introduce support for acquiring multiple lock IDs. For this
purpose, reuse the acquired_refs array and store both lock and pointer
references. We tag the entry with REF_TYPE_PTR or REF_TYPE_LOCK to
disambiguate and find the relevant entry. The ptr field is used to track
the map_ptr or btf (for bpf_obj_new allocations) to ensure locks can be
matched with protected fields within the same "allocation", i.e.
bpf_obj_new object or map value.

The struct active_lock is changed to an int as the state is part of the
acquired_refs array, and we only need active_lock as a cheap way of
detecting lock presence.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241109231430.2475236-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

selftests/bpf: skip the timer_lockup test for single-CPU nodes

The timer_lockup test needs 2 CPUs to work, on single-CPU nodes it fails
to set thread affinity to CPU 1 since it doesn't exist:

    # ./test_progs -t timer_lockup
    test_timer_lockup:PASS:timer_lockup__open_and_load 0 nsec
    test_timer_lockup:PASS:pthread_create thread1 0 nsec
    test_timer_lockup:PASS:pthread_create thread2 0 nsec
    timer_lockup_thread:PASS:cpu affinity 0 nsec
    timer_lockup_thread:FAIL:cpu affinity unexpected error: 22 (errno 0)
    test_timer_lockup:PASS: 0 nsec
    #406     timer_lockup:FAIL

Skip the test if only 1 CPU is available.

Signed-off-by: Viktor Malik <vmalik@redhat.com>
Fixes: 50bd5a0c658d1 ("selftests/bpf: Add timer lockup selftest")
Tested-by: Philo Lu <lulie@linux.alibaba.com>
Acked-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241107115231.75200-1-vmalik@redhat.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

Merge branch 'fix-lockdep-warning-for-htab-of-map'

Hou Tao says:

====================
The patch set fixes a lockdep warning for htab of map. The
warning is found when running test_maps. The warning occurs when
htab_put_fd_value() attempts to acquire map_idr_lock to free the map id
of the inner map while already holding the bucket lock (raw_spinlock_t).

The fix moves the invocation of free_htab_elem() after
htab_unlock_bucket() and adds a test case to verify the solution.
====================

Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

selftests/bpf: Test the update operations for htab of maps

Add test cases to verify the following four update operations on htab of
maps don't trigger lockdep warning:

(1) add then delete
(2) add, overwrite, then delete
(3) add, then lookup_and_delete
(4) add two elements, then lookup_and_delete_batch

Test cases are added for pre-allocated and non-preallocated htab of maps
respectively.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241106063542.357743-4-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

selftests/bpf: Move ENOTSUPP from bpf_util.h

Moving the definition of ENOTSUPP into bpf_util.h to remove the
duplicated definitions in multiple files.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241106063542.357743-3-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

bpf: Call free_htab_elem() after htab_unlock_bucket()

For htab of maps, when the map is removed from the htab, it may hold the
last reference of the map. bpf_map_fd_put_ptr() will invoke
bpf_map_free_id() to free the id of the removed map element. However,
bpf_map_fd_put_ptr() is invoked while holding a bucket lock
(raw_spin_lock_t), and bpf_map_free_id() attempts to acquire map_idr_lock
(spinlock_t), triggering the following lockdep warning:

  =============================
  [ BUG: Invalid wait context ]
  6.11.0-rc4+ #49 Not tainted
  -----------------------------
  test_maps/4881 is trying to lock:
  ffffffff84884578 (map_idr_lock){+...}-{3:3}, at: bpf_map_free_id.part.0+0x21/0x70
  other info that might help us debug this:
  context-{5:5}
  2 locks held by test_maps/4881:
   #0: ffffffff846caf60 (rcu_read_lock){....}-{1:3}, at: bpf_fd_htab_map_update_elem+0xf9/0x270
   #1: ffff888149ced148 (&htab->lockdep_key#2){....}-{2:2}, at: htab_map_update_elem+0x178/0xa80
  stack backtrace:
  CPU: 0 UID: 0 PID: 4881 Comm: test_maps Not tainted 6.11.0-rc4+ #49
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), ...
  Call Trace:
   <TASK>
   dump_stack_lvl+0x6e/0xb0
   dump_stack+0x10/0x20
   __lock_acquire+0x73e/0x36c0
   lock_acquire+0x182/0x450
   _raw_spin_lock_irqsave+0x43/0x70
   bpf_map_free_id.part.0+0x21/0x70
   bpf_map_put+0xcf/0x110
   bpf_map_fd_put_ptr+0x9a/0xb0
   free_htab_elem+0x69/0xe0
   htab_map_update_elem+0x50f/0xa80
   bpf_fd_htab_map_update_elem+0x131/0x270
   htab_map_update_elem+0x50f/0xa80
   bpf_fd_htab_map_update_elem+0x131/0x270
   bpf_map_update_value+0x266/0x380
   __sys_bpf+0x21bb/0x36b0
   __x64_sys_bpf+0x45/0x60
   x64_sys_call+0x1b2a/0x20d0
   do_syscall_64+0x5d/0x100
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

One way to fix the lockdep warning is using raw_spinlock_t for
map_idr_lock as well. However, bpf_map_alloc_id() invokes
idr_alloc_cyclic() after acquiring map_idr_lock, it will trigger a
similar lockdep warning because the slab's lock (s->cpu_slab->lock) is
still a spinlock.

Instead of changing map_idr_lock's type, fix the issue by invoking
htab_put_fd_value() after htab_unlock_bucket(). However, only deferring
the invocation of htab_put_fd_value() is not enough, because the old map
pointers in htab of maps can not be saved during batched deletion.
Therefore, also defer the invocation of free_htab_elem(), so these
to-be-freed elements could be linked together similar to lru map.

There are four callers for ->map_fd_put_ptr:

(1) alloc_htab_elem() (through htab_put_fd_value())
It invokes ->map_fd_put_ptr() under a raw_spinlock_t. The invocation of
htab_put_fd_value() can not simply move after htab_unlock_bucket(),
because the old element has already been stashed in htab->extra_elems.
It may be reused immediately after htab_unlock_bucket() and the
invocation of htab_put_fd_value() after htab_unlock_bucket() may release
the newly-added element incorrectly. Therefore, saving the map pointer
of the old element for htab of maps before unlocking the bucket and
releasing the map_ptr after unlock. Beside the map pointer in the old
element, should do the same thing for the special fields in the old
element as well.

(2) free_htab_elem() (through htab_put_fd_value())
Its caller includes __htab_map_lookup_and_delete_elem(),
htab_map_delete_elem() and __htab_map_lookup_and_delete_batch().

For htab_map_delete_elem(), simply invoke free_htab_elem() after
htab_unlock_bucket(). For __htab_map_lookup_and_delete_batch(), just
like lru map, linking the to-be-freed element into node_to_free list
and invoking free_htab_elem() for these element after unlock. It is safe
to reuse batch_flink as the link for node_to_free, because these
elements have been removed from the hash llist.

Because htab of maps doesn't support lookup_and_delete operation,
__htab_map_lookup_and_delete_elem() doesn't have the problem, so kept
it as is.

(3) fd_htab_map_free()
It invokes ->map_fd_put_ptr without raw_spinlock_t.

(4) bpf_fd_htab_map_update_elem()
It invokes ->map_fd_put_ptr without raw_spinlock_t.

After moving free_htab_elem() outside htab bucket lock scope, using
pcpu_freelist_push() instead of __pcpu_freelist_push() to disable
the irq before freeing elements, and protecting the invocations of
bpf_mem_cache_free() with migrate_{disable|enable} pair.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Link: https://lore.kernel.org/r/20241106063542.357743-2-houtao@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

Merge branch 'bpf-add-uprobe-session-support'

Jiri Olsa says:

====================
bpf: Add uprobe session support

hi,
this patchset is adding support for session uprobe attachment and
using it through bpf link for bpf programs.

The session means that the uprobe consumer is executed on entry
and return of probed function with additional control:
  - entry callback can control execution of the return callback
  - entry and return callbacks can share data/cookie

Uprobe changes (on top of perf/core [1] are posted in here [2].
This patchset is based on bpf-next/master and will be merged once
we pull [2] in bpf-next/master.

v9 changes:
  - rebased on bpf-next/master with perf/core tag merged (thanks Peter!)

thanks,
jirka

[1] git://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git perf/core
[2] https://lore.kernel.org/bpf/20241018202252.693462-1-jolsa@kernel.org/T/#ma43c549c4bf684ca1b17fa638aa5e7cbb46893e9
---
====================

Link: https://lore.kernel.org/r/20241108134544.480660-1-jolsa@kernel.org
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

selftests/bpf: Add threads to consumer test

With recent uprobe fix [1] the sync time after unregistering uprobe is
much longer and prolongs the consumer test which creates and destroys
hundreds of uprobes.

This change adds 16 threads (which fits the test logic) and speeds up
the test.

Before the change:

  # perf stat --null ./test_progs -t uprobe_multi_test/consumers
  #421/9   uprobe_multi_test/consumers:OK
  #421     uprobe_multi_test:OK
  Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED

   Performance counter stats for './test_progs -t uprobe_multi_test/consumers':

        28.818778973 seconds time elapsed

         0.745518000 seconds user
         0.919186000 seconds sys

After the change:

  # perf stat --null ./test_progs -t uprobe_multi_test/consumers 2>&1
  #421/9   uprobe_multi_test/consumers:OK
  #421     uprobe_multi_test:OK
  Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED

   Performance counter stats for './test_progs -t uprobe_multi_test/consumers':

         3.504790814 seconds time elapsed

         0.012141000 seconds user
         0.751760000 seconds sys

[1] commit 87195a1ee332 ("uprobes: switch to RCU Tasks Trace flavor for better performance")

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-14-jolsa@kernel.org

selftests/bpf: Add uprobe sessions to consumer test

Adding uprobe session consumers to the consumer test,
so we get the session into the test mix.

In addition scaling down the test to have just 1 uprobe
and 1 uretprobe, otherwise the test time grows and is
unsuitable for CI even with threads.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-13-jolsa@kernel.org

selftests/bpf: Add uprobe session single consumer test

Testing that the session ret_handler bypass works on single
uprobe with multiple consumers, each with different session
ignore return value.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-12-jolsa@kernel.org

selftests/bpf: Add kprobe session verifier test for return value

Making sure kprobe.session program can return only [0,1] values.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-11-jolsa@kernel.org

selftests/bpf: Add uprobe session verifier test for return value

Making sure uprobe.session program can return only [0,1] values.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-10-jolsa@kernel.org

selftests/bpf: Add uprobe session recursive test

Adding uprobe session test that verifies the cookie value is stored
properly when single uprobe-ed function is executed recursively.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-9-jolsa@kernel.org

selftests/bpf: Add uprobe session cookie test

Adding uprobe session test that verifies the cookie value
get properly propagated from entry to return program.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-8-jolsa@kernel.org

selftests/bpf: Add uprobe session test

Adding uprobe session test and testing that the entry program
return value controls execution of the return probe program.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-7-jolsa@kernel.org

libbpf: Add support for uprobe multi session attach

Adding support to attach program in uprobe session mode
with bpf_program__attach_uprobe_multi function.

Adding session bool to bpf_uprobe_multi_opts struct that allows
to load and attach the bpf program via uprobe session.
the attachment to create uprobe multi session.

Also adding new program loader section that allows:
SEC("uprobe.session/bpf_fentry_test*")

and loads/attaches uprobe program as uprobe session.

Adding sleepable hook (uprobe.session.s) as well.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-6-jolsa@kernel.org

bpf: Add support for uprobe multi session context

Placing bpf_session_run_ctx layer in between bpf_run_ctx and
bpf_uprobe_multi_run_ctx, so the session data can be retrieved
from uprobe_multi link.

Plus granting session kfuncs access to uprobe session programs.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-5-jolsa@kernel.org

bpf: Add support for uprobe multi session attach

Adding support to attach BPF program for entry and return probe
of the same function. This is common use case which at the moment
requires to create two uprobe multi links.

Adding new BPF_TRACE_UPROBE_SESSION attach type that instructs
kernel to attach single link program to both entry and exit probe.

It's possible to control execution of the BPF program on return
probe simply by returning zero or non zero from the entry BPF
program execution to execute or not the BPF program on return
probe respectively.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-4-jolsa@kernel.org

bpf: Force uprobe bpf program to always return 0

As suggested by Andrii make uprobe multi bpf programs to always return 0,
so they can't force uprobe removal.

Keeping the int return type for uprobe_prog_run, because it will be used
in following session changes.

Fixes: 89ae89f53d20 ("bpf: Add multi uprobe link")
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-3-jolsa@kernel.org

bpf: Allow return values 0 and 1 for kprobe session

The kprobe session program can return only 0 or 1,
instruct verifier to check for that.

Fixes: 535a3692ba72 ("bpf: Add support for kprobe session attach")
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241108134544.480660-2-jolsa@kernel.org

selftests/bpf: Fix uprobe consumer test (again)

The new uprobe changes bring some new behaviour that we need to reflect
in the consumer test. Now pending uprobe instance in the kernel can
survive longer and thus might call uretprobe consumer callbacks in
some situations in which, previously, such callback would be omitted.
We now need to take that into account in uprobe-multi consumer tests.

The idea being that uretprobe under test either stayed from before to
after (uret_stays + test_bit) or uretprobe instance survived and we
have uretprobe active in after (uret_survives + test_bit).

uret_survives just states that uretprobe survives if there are *any*
uretprobes both before and after (overlapping or not, doesn't matter)
and uprobe was attached before.

Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241107094337.3848210-1-jolsa@kernel.org

bpf: Remove trailing whitespace in verifier.rst

Remove trailing whitespace in Documentation/bpf/verifier.rst.

Signed-off-by: Abhinav Saxena <xandfury@gmail.com>
Link: https://lore.kernel.org/r/20241107063708.106340-2-xandfury@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

selftests/bpf: Allow building with extra flags

In order to specify extra compilation or linking flags to BPF selftests,
it is possible to set EXTRA_CFLAGS and EXTRA_LDFLAGS from the command
line. The problem is that they are not propagated to sub-make calls
(runqslower, bpftool, libbpf) and in the better case are not applied, in
the worse case cause the entire build fail.

Propagate EXTRA_CFLAGS and EXTRA_LDFLAGS to the sub-makes.

This, for instance, allows to build selftests as PIE with

$ make EXTRA_CFLAGS='-fPIE' EXTRA_LDFLAGS='-pie'

Without this change, the command would fail because libbpf.a would not
be built with -fPIE and other PIE binaries would not link against it.

The only problem is that we have to explicitly provide empty
EXTRA_CFLAGS='' and EXTRA_LDFLAGS='' to the builds of kernel modules as
we don't want to build modules with flags used for userspace (the above
example would fail as kernel doesn't support PIE).

Signed-off-by: Viktor Malik <vmalik@redhat.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

Merge tag 'perf-core-for-bpf-next' from tip tree

Stable tag for bpf-next's uprobe work.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

Merge branch 'handle-possible-null-trusted-raw_tp-arguments'

Kumar Kartikeya Dwivedi says:

====================
Handle possible NULL trusted raw_tp arguments

More context is available in [0], but the TLDR; is that the verifier
incorrectly assumes that any raw tracepoint argument will always be
non-NULL. This means that even when users correctly check possible NULL
arguments, the verifier can remove the NULL check due to incorrect
knowledge of the NULL-ness of the pointer. Secondly, kernel helpers or
kfuncs taking these trusted tracepoint arguments incorrectly assume that
all arguments will always be valid non-NULL.

In this set, we mark raw_tp arguments as PTR_MAYBE_NULL on top of
PTR_TRUSTED, but special case their behavior when dereferencing them or
pointer arithmetic over them is involved. When passing trusted args to
helpers or kfuncs, raw_tp programs are permitted to pass possibly NULL
pointers in such cases.

Any loads into such maybe NULL trusted PTR_TO_BTF_ID is promoted to a
PROBE_MEM load to handle emanating page faults. The verifier will ensure
NULL checks on such pointers are preserved and do not lead to dead code
elimination.

This new behavior is not applied when ref_obj_id is non-zero, as those
pointers do not belong to raw_tp arguments, but instead acquired
objects.

Since helpers and kfuncs already require attention for PTR_TO_BTF_ID
(non-trusted) pointers, we do not implement any protection for such
cases in this patch set, and leave it as future work for an upcoming
series.

A selftest is included with this patch set to verify the new behavior,
and it crashes the kernel without the first patch.

[0]: https://lore.kernel.org/bpf/CAADnVQLMPPavJQR6JFsi3dtaaLHB816JN4HCV_TFWohJ61D+wQ@mail.gmail.com

Changelog:
----------
v2 -> v3
v2: https://lore.kernel.org/bpf/20241103184144.3765700-1-memxor@gmail.com

* Fix lenient check around check_ptr_to_btf_access allowing any
PTR_TO_BTF_ID with PTR_MAYBE_NULL to be deref'd.
* Add Juri and Jiri's Tested-by, Reviewed-by resp.

v1 -> v2
v1: https://lore.kernel.org/bpf/20241101000017.3424165-1-memxor@gmail.com

* Add patch to clean up users of gettid (Andrii)
* Avoid nested blocks in sefltest (Andrii)
* Prevent code motion optimization in selftest using barrier()
====================

Link: https://lore.kernel.org/r/20241104171959.2938862-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add tests for raw_tp null handling

Ensure that trusted PTR_TO_BTF_ID accesses perform PROBE_MEM handling in
raw_tp program. Without the previous fix, this selftest crashes the
kernel due to a NULL-pointer dereference. Also ensure that dead code
elimination does not kick in for checks on the pointer.

Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241104171959.2938862-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Clean up open-coded gettid syscall invocations

Availability of the gettid definition across glibc versions supported by
BPF selftests is not certain. Currently, all users in the tree open-code
syscall to gettid. Convert them to a common macro definition.

Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241104171959.2938862-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Mark raw_tp arguments with PTR_MAYBE_NULL

Arguments to a raw tracepoint are tagged as trusted, which carries the
semantics that the pointer will be non-NULL. However, in certain cases,
a raw tracepoint argument may end up being NULL. More context about this
issue is available in [0].

Thus, there is a discrepancy between the reality, that raw_tp arguments
can actually be NULL, and the verifier's knowledge, that they are never
NULL, causing explicit NULL checks to be deleted, and accesses to such
pointers potentially crashing the kernel.

To fix this, mark raw_tp arguments as PTR_MAYBE_NULL, and then special
case the dereference and pointer arithmetic to permit it, and allow
passing them into helpers/kfuncs; these exceptions are made for raw_tp
programs only. Ensure that we don't do this when ref_obj_id > 0, as in
that case this is an acquired object and doesn't need such adjustment.

The reason we do mask_raw_tp_trusted_reg logic is because other will
recheck in places whether the register is a trusted_reg, and then
consider our register as untrusted when detecting the presence of the
PTR_MAYBE_NULL flag.

To allow safe dereference, we enable PROBE_MEM marking when we see loads
into trusted pointers with PTR_MAYBE_NULL.

While trusted raw_tp arguments can also be passed into helpers or kfuncs
where such broken assumption may cause issues, a future patch set will
tackle their case separately, as PTR_TO_BTF_ID (without PTR_TRUSTED) can
already be passed into helpers and causes similar problems. Thus, they
are left alone for now.

It is possible that these checks also permit passing non-raw_tp args
that are trusted PTR_TO_BTF_ID with null marking. In such a case,
allowing dereference when pointer is NULL expands allowed behavior, so
won't regress existing programs, and the case of passing these into
helpers is the same as above and will be dealt with later.

Also update the failure case in tp_btf_nullable selftest to capture the
new behavior, as the verifier will no longer cause an error when
directly dereference a raw tracepoint argument marked as __nullable.

[0]: https://lore.kernel.org/bpf/ZrCZS6nisraEqehw@jlelli-thinkpadt14gen4.remote.csb

Reviewed-by: Jiri Olsa <jolsa@kernel.org>
Reported-by: Juri Lelli <juri.lelli@redhat.com>
Tested-by: Juri Lelli <juri.lelli@redhat.com>
Fixes: 3f00c5239344 ("bpf: Allow trusted pointers to be passed to KF_TRUSTED_ARGS kfuncs")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241104171959.2938862-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Move btf_type_is_struct_ptr() under CONFIG_BPF_SYSCALL

The static inline btf_type_is_struct_ptr() function calls
btf_type_skip_modifiers() which is guarded by CONFIG_BPF_SYSCALL.
btf_type_is_struct_ptr() is also only called by CONFIG_BPF_SYSCALL
ifdef code, so let's only expose btf_type_is_struct_ptr() if
CONFIG_BPF_SYSCALL is defined.

Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
Link: https://lore.kernel.org/r/20241104060300.421403-1-alistair.francis@wdc.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'fix-resource-leak-checks-for-tail-calls'

Kumar Kartikeya Dwivedi says:

====================
Fix resource leak checks for tail calls

This set contains a fix for detecting unreleased RCU read locks or
unfinished preempt_disable sections when performing a tail call. Spin
locks are prevented by accident since they don't allow any function
calls, including tail calls (modelled as call instruction to a helper),
so we ensure they are checked as well, in preparation for relaxing
function call restricton for critical sections in the future.

Then, in the second patch, all the checks for reference leaks and locks
are unified into a single function that can be called from different
places. This unification patch is kept separate and placed after the fix
to allow independent backport of the fix to older kernels without a
depdendency on the clean up.

Naturally, this creates a divergence in the disparate error messages,
therefore selftests that rely on the exact error strings need to be
updated to match the new verifier log message.

A selftest is included to ensure no regressions occur wrt this behavior.
====================

Link: https://lore.kernel.org/r/20241103225940.1408302-1-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add tests for tail calls with locks and refs

Add failure tests to ensure bugs don't slip through for tail calls and
lingering locks, RCU sections, preemption disabled sections, and
references prevent tail calls.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241103225940.1408302-4-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Unify resource leak checks

There are similar checks for covering locks, references, RCU read
sections and preempt_disable sections in 3 places in the verifer, i.e.
for tail calls, bpf_ld_[abs, ind], and exit path (for BPF_EXIT and
bpf_throw). Unify all of these into a common check_resource_leak
function to avoid code duplication.

Also update the error strings in selftests to the new ones in the same
change to ensure clean bisection.

Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241103225940.1408302-3-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Tighten tail call checks for lingering locks, RCU, preempt_disable

There are three situations when a program logically exits and transfers
control to the kernel or another program: bpf_throw, BPF_EXIT, and tail
calls. The former two check for any lingering locks and references, but
tail calls currently do not. Expand the checks to check for spin locks,
RCU read sections and preempt disabled sections.

Spin locks are indirectly preventing tail calls as function calls are
disallowed, but the checks for preemption and RCU are more relaxed,
hence ensure tail calls are prevented in their presence.

Fixes: 9bb00b2895cb ("bpf: Add kfunc bpf_rcu_read_lock/unlock()")
Fixes: fc7566ad0a82 ("bpf: Introduce bpf_preempt_[disable,enable] kfuncs")
Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Link: https://lore.kernel.org/r/20241103225940.1408302-2-memxor@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Disable warnings on unused flags for Clang builds

There exist compiler flags supported by GCC but not supported by Clang
(e.g. -specs=...). Currently, these cannot be passed to BPF selftests
builds, even when building with GCC, as some binaries (urandom_read and
liburandom_read.so) are always built with Clang and the unsupported
flags make the compilation fail (as -Werror is turned on).

Add -Wno-unused-command-line-argument to these rules to suppress such
errors.

This allows to do things like:

    $ CFLAGS="-specs=/usr/lib/rpm/redhat/redhat-hardened-cc1" \
      make -C tools/testing/selftests/bpf

Without this patch, the compilation would fail with:

    [...]
    clang: error: argument unused during compilation: '-specs=/usr/lib/rpm/redhat/redhat-hardened-cc1' [-Werror,-Wunused-command-line-argument]
    make: *** [Makefile:273: /bpf-next/tools/testing/selftests/bpf/liburandom_read.so] Error 1
    [...]

Signed-off-by: Viktor Malik <vmalik@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/2d349e9d5eb0a79dd9ff94b496769d64e6ff7654.1730449390.git.vmalik@redhat.com

bpftool: Prevent setting duplicate _GNU_SOURCE in Makefile

When building selftests with CFLAGS set via env variable, the value of
CFLAGS is propagated into bpftool Makefile (called from selftests
Makefile). This makes the compilation fail as _GNU_SOURCE is defined two
times - once from selftests Makefile (by including lib.mk) and once from
bpftool Makefile (by calling `llvm-config --cflags`):

    $ CFLAGS="" make -C tools/testing/selftests/bpf
    [...]
    CC      /bpf-next/tools/testing/selftests/bpf/tools/build/bpftool/btf.o
    <command-line>: error: "_GNU_SOURCE" redefined [-Werror]
    <command-line>: note: this is the location of the previous definition
    cc1: all warnings being treated as errors
    [...]

Filter out -D_GNU_SOURCE from the result of `llvm-config --cflags` in
bpftool Makefile to prevent this error.

Signed-off-by: Viktor Malik <vmalik@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Quentin Monnet <qmo@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/acec3108b62d4df1436cda777e58e93e033ac7a7.1730449390.git.vmalik@redhat.com

bpf, bpftool: Fix incorrect disasm pc

This patch addresses the bpftool issue "Wrong callq address displayed"[0].

The issue stemmed from an incorrect program counter (PC) value used during
disassembly with LLVM or libbfd.

For LLVM: The PC argument must represent the actual address in the kernel
to compute the correct relative address.

For libbfd: The relative address can be adjusted by adding func_ksym within
the custom info->print_address_func to yield the correct address.

Links:
[0] https://github.com/libbpf/bpftool/issues/109

Changes:
v2 -> v3:
  * Address comment from Quentin:
    * Remove the typedef.

v1 -> v2:
  * Fix the broken libbfd disassembler.

Fixes: e1947c750ffe ("bpftool: Refactor disassembler for JIT-ed programs")
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Quentin Monnet <qmo@kernel.org>
Reviewed-by: Quentin Monnet <qmo@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Link: https://lore.kernel.org/bpf/20241031152844.68817-1-leon.hwang@linux.dev

selftests/bpf: Add a test for open coded kmem_cache iter

The new subtest runs with bpf_prog_test_run_opts() as a syscall prog.
It iterates the kmem_cache using bpf_for_each loop and count the number
of entries.  Finally it checks it with the number of entries from the
regular iterator.

  $ ./vmtest.sh -- ./test_progs -t kmem_cache_iter
  ...
  #130/1   kmem_cache_iter/check_task_struct:OK
  #130/2   kmem_cache_iter/check_slabinfo:OK
  #130/3   kmem_cache_iter/open_coded_iter:OK
  #130     kmem_cache_iter:OK
  Summary: 1/3 PASSED, 0 SKIPPED, 0 FAILED

Also simplify the code by using attach routine of the skeleton.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20241030222819.1800667-2-namhyung@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Add open coded version of kmem_cache iterator

Add a new open coded iterator for kmem_cache which can be called from a
BPF program like below.  It doesn't take any argument and traverses all
kmem_cache entries.

  struct kmem_cache *pos;

  bpf_for_each(kmem_cache, pos) {
      ...
  }

As it needs to grab slab_mutex, it should be called from sleepable BPF
programs only.

Also update the existing iterator code to use the open coded version
internally as suggested by Andrii.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20241030222819.1800667-1-namhyung@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

uprobes: SRCU-protect uretprobe lifetime (with timeout)

Avoid taking refcount on uprobe in prepare_uretprobe(), instead take
uretprobe-specific SRCU lock and keep it active as kernel transfers
control back to user space.

Given we can't rely on user space returning from traced function within
reasonable time period, we need to make sure not to keep SRCU lock
active for too long, though. To that effect, we employ a timer callback
which is meant to terminate SRCU lock region after predefined timeout
(currently set to 100ms), and instead transfer underlying struct
uprobe's lifetime protection to refcounting.

This fallback to less scalable refcounting after 100ms is a fine
tradeoff from uretprobe's scalability and performance perspective,
because uretprobing *long running* user functions inherently doesn't run
into scalability issues (there is just not enough frequency to cause
noticeable issues with either performance or scalability).

The overall trick is in ensuring synchronization between current thread
and timer's callback fired on some other thread. To cope with that with
minimal logic complications, we add hprobe wrapper which is used to
contain all the synchronization related issues behind a small number of
basic helpers: hprobe_expire() for "downgrading" uprobe from SRCU-protected
state to refcounted state, and a hprobe_consume() and hprobe_finalize()
pair of single-use consuming helpers. Other than that, whatever current
thread's logic is there stays the same, as timer thread cannot modify
return_instance state (or add new/remove old return_instances). It only
takes care of SRCU unlock and uprobe refcounting, which is hidden from
the higher-level uretprobe handling logic.

We use atomic xchg() in hprobe_consume(), which is called from
performance critical handle_uretprobe_chain() function run in the
current context. When uncontended, this xchg() doesn't seem to hurt
performance as there are no other competing CPUs fighting for the same
cache line. We also mark struct return_instance as ____cacheline_aligned
to ensure no false sharing can happen.

Another technical moment. We need to make sure that the list of return
instances can be safely traversed under RCU from timer callback, so we
delay return_instance freeing with kfree_rcu() and make sure that list
modifications use RCU-aware operations.

Also, given SRCU lock survives transition from kernel to user space and
back we need to use lower-level __srcu_read_lock() and
__srcu_read_unlock() to avoid lockdep complaining.

Just to give an impression of a kind of performance improvements this
change brings, below are benchmarking results with and without these
SRCU changes, assuming other uprobe optimizations (mainly RCU Tasks
Trace for entry uprobes, lockless RB-tree lookup, and lockless VMA to
uprobe lookup) are left intact:

WITHOUT SRCU for uretprobes
===========================
uretprobe-nop         ( 1 cpus):    2.197 ± 0.002M/s  (  2.197M/s/cpu)
uretprobe-nop         ( 2 cpus):    3.325 ± 0.001M/s  (  1.662M/s/cpu)
uretprobe-nop         ( 3 cpus):    4.129 ± 0.002M/s  (  1.376M/s/cpu)
uretprobe-nop         ( 4 cpus):    6.180 ± 0.003M/s  (  1.545M/s/cpu)
uretprobe-nop         ( 8 cpus):    7.323 ± 0.005M/s  (  0.915M/s/cpu)
uretprobe-nop         (16 cpus):    6.943 ± 0.005M/s  (  0.434M/s/cpu)
uretprobe-nop         (32 cpus):    5.931 ± 0.014M/s  (  0.185M/s/cpu)
uretprobe-nop         (64 cpus):    5.145 ± 0.003M/s  (  0.080M/s/cpu)
uretprobe-nop         (80 cpus):    4.925 ± 0.005M/s  (  0.062M/s/cpu)

WITH SRCU for uretprobes
========================
uretprobe-nop         ( 1 cpus):    1.968 ± 0.001M/s  (  1.968M/s/cpu)
uretprobe-nop         ( 2 cpus):    3.739 ± 0.003M/s  (  1.869M/s/cpu)
uretprobe-nop         ( 3 cpus):    5.616 ± 0.003M/s  (  1.872M/s/cpu)
uretprobe-nop         ( 4 cpus):    7.286 ± 0.002M/s  (  1.822M/s/cpu)
uretprobe-nop         ( 8 cpus):   13.657 ± 0.007M/s  (  1.707M/s/cpu)
uretprobe-nop         (32 cpus):   45.305 ± 0.066M/s  (  1.416M/s/cpu)
uretprobe-nop         (64 cpus):   42.390 ± 0.922M/s  (  0.662M/s/cpu)
uretprobe-nop         (80 cpus):   47.554 ± 2.411M/s  (  0.594M/s/cpu)

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241024044159.3156646-3-andrii@kernel.org

uprobes: allow put_uprobe() from non-sleepable softirq context

Currently put_uprobe() might trigger mutex_lock()/mutex_unlock(), which
makes it unsuitable to be called from more restricted context like softirq.

Let's make put_uprobe() agnostic to the context in which it is called,
and use work queue to defer the mutex-protected clean up steps.

RB tree removal step is also moved into work-deferred callback to avoid
potential deadlock between softirq-based timer callback, added in the
next patch, and the rest of uprobe code.

We can rework locking altogher as a follow up, but that's significantly
more tricky, so warrants its own patch set. For now, we need to make
sure that changes in the next patch that add timer thread work correctly
with existing approach, while concentrating on SRCU + timeout logic.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20241024044159.3156646-2-andrii@kernel.org

perf/x86/rapl: Clean up cpumask and hotplug

The rapl pmu is die scope, which is supported by the generic perf_event
subsystem now.

Set the scope for the rapl PMU and remove all the cpumask and hotplug
codes.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Oliver Sang <oliver.sang@intel.com>
Tested-by: Dhananjay Ugwekar <dhananjay.ugwekar@amd.com>
Link: https://lore.kernel.org/r/20241010142604.770192-2-kan.liang@linux.intel.com

perf/x86/rapl: Move the pmu allocation out of CPU hotplug

There are extra codes in the CPU hotplug function to allocate rapl pmus.
The generic PMU hotplug support is hard to be applied.

As long as the rapl pmus can be allocated upfront for each die/socket,
the code doesn't need to be implemented in the CPU hotplug function.
Move the code to the init_rapl_pmus(), and allocate a PMU for each
possible die/socket.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: Oliver Sang <oliver.sang@intel.com>
Link: https://lore.kernel.org/r/20241010142604.770192-1-kan.liang@linux.intel.com

selftests/bpf: drop unnecessary bpf_iter.h type duplication

Drop bpf_iter.h header which uses vmlinux.h but re-defines a bunch of
iterator structures and some of BPF constants for use in BPF iterator
selftests.

None of that is necessary when fresh vmlinux.h header is generated for
vmlinux image that matches latest selftests. So drop ugly hacks and have
a nice plain vmlinux.h usage everywhere.

We could do the same with all the kfunc __ksym redefinitions, but that
has dependency on very fresh pahole, so I'm not addressing that here.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20241029203919.1948941-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

libbpf: start v1.6 development cycle

With libbpf v1.5.0 release out, start v1.6 dev cycle.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20241029184045.581537-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

docs/bpf: Add description of .BTF.base section

Now that .BTF.base sections are generated for out-of-tree kernel
modules (provided pahole supports the "distilled_base" BTF feature),
document .BTF.base and its role in supporting resilient split BTF
and BTF relocation.

Changes since v1:

- updated formatting, corrected typo, used BTF ID[s] consistently
(Andrii)

Signed-off-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241028091543.2175967-1-alan.maguire@oracle.com

bpf: handle implicit declaration of function gettid in bpf_iter.c

As we can see from the title, when I compiled the selftests/bpf, I
saw the error:
implicit declaration of function ‘gettid’ ; did you mean ‘getgid’? [-Werror=implicit-function-declaration]
  skel->bss->tid = gettid();
                   ^~~~~~
                   getgid

Directly call the syscall solves this issue.

Signed-off-by: Jason Xing <kernelxing@tencent.com>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Link: https://lore.kernel.org/r/20241029074627.80289-1-kerneljasonxing@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>

bpf, arm64: Remove garbage frame for struct_ops trampoline

The callsite layout for arm64 fentry is:

mov x9, lr
nop

When a bpf prog is attached, the nop instruction is patched to a call
to bpf trampoline:

mov x9, lr
bl <bpf trampoline>

So two return addresses are passed to bpf trampoline: the return address
for the traced function/prog, stored in x9, and the return address for
the bpf trampoline itself, stored in lr. To obtain a full and accurate
call stack, the bpf trampoline constructs two fake function frames using
x9 and lr.

However, struct_ops progs are invoked directly as function callbacks,
meaning that x9 is not set as it is in the fentry callsite. In this case,
the frame constructed using x9 is garbage. The following stack trace for
struct_ops, captured by perf sampling, illustrates this issue, where
tcp_ack+0x404 is a garbage frame:

ffffffc0801a04b4 bpf_prog_50992e55a0f655a9_bpf_cubic_cong_avoid+0x98 (bpf_prog_50992e55a0f655a9_bpf_cubic_cong_avoid)
ffffffc0801a228c [unknown] ([kernel.kallsyms]) // bpf trampoline
ffffffd08d362590 tcp_ack+0x798 ([kernel.kallsyms]) // caller for bpf trampoline
ffffffd08d3621fc tcp_ack+0x404 ([kernel.kallsyms]) // garbage frame
ffffffd08d36452c tcp_rcv_established+0x4ac ([kernel.kallsyms])
ffffffd08d375c58 tcp_v4_do_rcv+0x1f0 ([kernel.kallsyms])
ffffffd08d378630 tcp_v4_rcv+0xeb8 ([kernel.kallsyms])

To fix it, construct only one frame using lr for struct_ops.

The above stack trace also indicates that there is no kernel symbol for
struct_ops bpf trampoline. This will be addressed in a follow-up patch.

Fixes: efc9909fdce0 ("bpf, arm64: Add bpf trampoline for arm64")
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Acked-by: Puranjay Mohan <puranjay@kernel.org>
Tested-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20241025085220.533949-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge git://git./pub/scm/linux/kernel/git/bpf/bpf

Cross-merge bpf fixes after downstream PR.

No conflicts.

Adjacent changes in:

include/linux/bpf.h
include/uapi/linux/bpf.h
kernel/bpf/btf.c
kernel/bpf/helpers.c
kernel/bpf/syscall.c
kernel/bpf/verifier.c
kernel/trace/bpf_trace.c
mm/slab_common.c
tools/include/uapi/linux/bpf.h
tools/testing/selftests/bpf/Makefile

Link: https://lore.kernel.org/all/20241024215724.60017-1-daniel@iogearbox.net/
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge tag 'bpf-fixes' of git://git./linux/kernel/git/bpf/bpf

Pull bpf fixes from Daniel Borkmann:

- Fix an out-of-bounds read in bpf_link_show_fdinfo for BPF sockmap
   link file descriptors (Hou Tao)

- Fix BPF arm64 JIT's address emission with tag-based KASAN enabled
   reserving not enough size (Peter Collingbourne)

- Fix BPF verifier do_misc_fixups patching for inlining of the
   bpf_get_branch_snapshot BPF helper (Andrii Nakryiko)

- Fix a BPF verifier bug and reject BPF program write attempts into
   read-only marked BPF maps (Daniel Borkmann)

- Fix perf_event_detach_bpf_prog error handling by removing an invalid
   check which would skip BPF program release (Jiri Olsa)

- Fix memory leak when parsing mount options for the BPF filesystem
   (Hou Tao)

* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  bpf: Check validity of link->type in bpf_link_show_fdinfo()
  bpf: Add the missing BPF_LINK_TYPE invocation for sockmap
  bpf: fix do_misc_fixups() for bpf_get_branch_snapshot()
  bpf,perf: Fix perf_event_detach_bpf_prog error handling
  selftests/bpf: Add test for passing in uninit mtu_len
  selftests/bpf: Add test for writes to .rodata
  bpf: Remove MEM_UNINIT from skb/xdp MTU helpers
  bpf: Fix overloading of MEM_UNINIT's meaning
  bpf: Add MEM_WRITE attribute
  bpf: Preserve param->string when parsing mount options
  bpf, arm64: Fix address emission with tag-based KASAN enabled

Merge tag 'net-6.12-rc5' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
"Including fixes from netfiler, xfrm and bluetooth.

  Oddly this includes a fix for a posix clock regression; in our
  previous PR we included a change there as a pre-requisite for
  networking one. That fix proved to be buggy and requires the follow-up
  included here. Thomas suggested we should send it, given we sent the
  buggy patch.

  Current release - regressions:

   - posix-clock: Fix unbalanced locking in pc_clock_settime()

   - netfilter: fix typo causing some targets not to load on IPv6

  Current release - new code bugs:

   - xfrm: policy: remove last remnants of pernet inexact list

  Previous releases - regressions:

   - core: fix races in netdev_tx_sent_queue()/dev_watchdog()

   - bluetooth: fix UAF on sco_sock_timeout

   - eth: hv_netvsc: fix VF namespace also in synthetic NIC
     NETDEV_REGISTER event

   - eth: usbnet: fix name regression

   - eth: be2net: fix potential memory leak in be_xmit()

   - eth: plip: fix transmit path breakage

  Previous releases - always broken:

   - sched: deny mismatched skip_sw/skip_hw flags for actions created by
     classifiers

   - netfilter: bpf: must hold reference on net namespace

   - eth: virtio_net: fix integer overflow in stats

   - eth: bnxt_en: replace ptp_lock with irqsave variant

   - eth: octeon_ep: add SKB allocation failures handling in
     __octep_oq_process_rx()

  Misc:

   - MAINTAINERS: add Simon as an official reviewer"

* tag 'net-6.12-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (40 commits)
  net: dsa: mv88e6xxx: support 4000ps cycle counter period
  net: dsa: mv88e6xxx: read cycle counter period from hardware
  net: dsa: mv88e6xxx: group cycle counter coefficients
  net: usb: qmi_wwan: add Fibocom FG132 0x0112 composition
  hv_netvsc: Fix VF namespace also in synthetic NIC NETDEV_REGISTER event
  net: dsa: microchip: disable EEE for KSZ879x/KSZ877x/KSZ876x
  Bluetooth: ISO: Fix UAF on iso_sock_timeout
  Bluetooth: SCO: Fix UAF on sco_sock_timeout
  Bluetooth: hci_core: Disable works on hci_unregister_dev
  posix-clock: posix-clock: Fix unbalanced locking in pc_clock_settime()
  r8169: avoid unsolicited interrupts
  net: sched: use RCU read-side critical section in taprio_dump()
  net: sched: fix use-after-free in taprio_change()
  net/sched: act_api: deny mismatched skip_sw/skip_hw flags for actions created by classifiers
  net: usb: usbnet: fix name regression
  mlxsw: spectrum_router: fix xa_store() error checking
  virtio_net: fix integer overflow in stats
  net: fix races in netdev_tx_sent_queue()/dev_watchdog()
  net: wwan: fix global oob in wwan_rtnl_policy
  netfilter: xtables: fix typo causing some targets not to load on IPv6
  ...

Merge tag 'hid-for-linus-20241024' of git://git./linux/kernel/git/hid/hid

Pull HID fixes from Jiri Kosina:
"Device-specific functionality quirks for Thinkpad X1 Gen3, Logitech
  Bolt and some Goodix touchpads (Bartłomiej Maryńczak, Hans de Goede
  and Kenneth Albanowski)"

* tag 'hid-for-linus-20241024' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
  HID: lenovo: Add support for Thinkpad X1 Tablet Gen 3 keyboard
  HID: multitouch: Add quirk for Logitech Bolt receiver w/ Casa touchpad
  HID: i2c-hid: Delayed i2c resume wakeup for 0x0d42 Goodix touchpad

Merge tag 'loongarch-fixes-6.12-1' of git://git./linux/kernel/git/chenhuacai/linux-loongson

Pull LoongArch fixes from Huacai Chen:
"Get correct cores_per_package for SMT systems, enable IRQ if do_ale()
  triggered in irq-enabled context, and fix some bugs about vDSO, memory
  managenent, hrtimer in KVM, etc"

* tag 'loongarch-fixes-6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
  LoongArch: KVM: Mark hrtimer to expire in hard interrupt context
  LoongArch: Make KASAN usable for variable cpu_vabits
  LoongArch: Set initial pte entry with PAGE_GLOBAL for kernel space
  LoongArch: Don't crash in stack_top() for tasks without vDSO
  LoongArch: Set correct size for vDSO code mapping
  LoongArch: Enable IRQ if do_ale() triggered in irq-enabled context
  LoongArch: Get correct cores_per_package for SMT systems
  LoongArch: Use "Exception return address" to comment ERA

Merge tag 'probes-fixes-v6.12-rc4.2' of git://git./linux/kernel/git/trace/linux-trace

Pull probes fixes from Masami Hiramatsu:

- objpool: Fix choosing allocation for percpu slots

   Fixes to allocate objpool's percpu slots correctly according to the
   GFP flag. It checks whether "any bit" in GFP_ATOMIC is set to choose
   the vmalloc source, but it should check "all bits" in GFP_ATOMIC flag
   is set, because GFP_ATOMIC is a combined flag.

- tracing/probes: Fix MAX_TRACE_ARGS limit handling

   If more than MAX_TRACE_ARGS are passed for creating a probe event,
   the entries over MAX_TRACE_ARG in trace_arg array are not
   initialized. Thus if the kernel accesses those entries, it crashes.
   This rejects creating event if the number of arguments is over
   MAX_TRACE_ARGS.

- tracing: Consider the NUL character when validating the event length

   A strlen() is used when parsing the event name, and the original code
   does not consider the terminal null byte. Thus it can pass the name
   one byte longer than the buffer. This fixes to check it correctly.

* tag 'probes-fixes-v6.12-rc4.2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Consider the NULL character when validating the event length
  tracing/probes: Fix MAX_TRACE_ARGS limit handling
  objpool: fix choosing allocation for percpu slots

Merge tag 'for-6.12-rc4-tag' of git://git./linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

- mount option fixes:
     - fix handling of compression mount options on remount
     - reject rw remount in case there are options that don't work
       in read-write mode (like rescue options)

- fix zone accounting of unusable space

- fix in-memory corruption when merging extent maps

- fix delalloc range locking for sector < page

- use more convenient default value of drop subtree threshold, clean
   more subvolumes without the fallback to marking quotas inconsistent

- fix smatch warning about incorrect value passed to ERR_PTR

* tag 'for-6.12-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fix passing 0 to ERR_PTR in btrfs_search_dir_index_item()
  btrfs: reject ro->rw reconfiguration if there are hard ro requirements
  btrfs: fix read corruption due to race with extent map merging
  btrfs: fix the delalloc range locking if sector size < page size
  btrfs: qgroup: set a more sane default value for subtree drop threshold
  btrfs: clear force-compress on remount when compress mount option is given
  btrfs: zoned: fix zone unusable accounting for freed reserved extent

Merge tag 'jfs-6.12-rc5' of github.com:kleikamp/linux-shaggy

Pull jfs fix from David Kleikamp:
"Fix a regression introduced in 6.12-rc1"

* tag 'jfs-6.12-rc5' of github.com:kleikamp/linux-shaggy:
jfs: Fix sanity check in dbMount

Merge tag 'bcachefs-2024-10-22' of https://github.com/koverstreet/bcachefs

Pull bcachefs fixes from Kent Overstreet:
"Lots of hotfixes:

   - transaction restart injection has been shaking out a few things

   - fix a data corruption in the buffered write path on -ENOSPC, found
     by xfstests generic/299

   - Some small show_options fixes

   - Repair mismatches in inode hash type, seed: different snapshot
     versions of an inode must have the same hash/type seed, used for
     directory entries and xattrs. We were checking the hash seed, but
     not the type, and a user contributed a filesystem where the hash
     type on one inode had somehow been flipped; these fixes allow his
     filesystem to repair.

     Additionally, the hash type flip made some directory entries
     invisible, which were then recreated by userspace; so the hash
     check code now checks for duplicate non dangling dirents, and
     renames one of them if necessary.

   - Don't use wait_event_interruptible() in recovery: this fixes some
     filesystems failing to mount with -ERESTARTSYS

   - Workaround for kvmalloc not supporting > INT_MAX allocations,
     causing an -ENOMEM when allocating the sorted array of journal
     keys: this allows a 75 TB filesystem to mount

   - Make sure bch_inode_unpacked.bi_snapshot is set in the old inode
     compat path: this alllows Marcin's filesystem (in use since before
     6.7) to repair and mount"

* tag 'bcachefs-2024-10-22' of https://github.com/koverstreet/bcachefs: (26 commits)
  bcachefs: Set bch_inode_unpacked.bi_snapshot in old inode path
  bcachefs: Mark more errors as AUTOFIX
  bcachefs: Workaround for kvmalloc() not supporting > INT_MAX allocations
  bcachefs: Don't use wait_event_interruptible() in recovery
  bcachefs: Fix __bch2_fsck_err() warning
  bcachefs: fsck: Improve hash_check_key()
  bcachefs: bch2_hash_set_or_get_in_snapshot()
  bcachefs: Repair mismatches in inode hash seed, type
  bcachefs: Add hash seed, type to inode_to_text()
  bcachefs: INODE_STR_HASH() for bch_inode_unpacked
  bcachefs: Run in-kernel offline fsck without ratelimit errors
  bcachefs: skip mount option handle for empty string.
  bcachefs: fix incorrect show_options results
  bcachefs: Fix data corruption on -ENOSPC in buffered write path
  bcachefs: bch2_folio_reservation_get_partial() is now better behaved
  bcachefs: fix disk reservation accounting in bch2_folio_reservation_get()
  bcachefS: ec: fix data type on stripe deletion
  bcachefs: Don't use commit_do() unnecessarily
  bcachefs: handle restarts in bch2_bucket_io_time_reset()
  bcachefs: fix restart handling in __bch2_resume_logged_op_finsert()
  ...

Revert "9p: Enable multipage folios"

This reverts commit 1325e4a91a405f88f1b18626904d37860a4f9069.

using multipage folios apparently break some madvise operations like
MADV_PAGEOUT which do not reliably unload the specified page anymore,

Revert the patch until that is figured out.

Reported-by: Andrii Nakryiko <andrii@kernel.org>
Fixes: 1325e4a91a40 ("9p: Enable multipage folios")
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge branch 'share-user-memory-to-bpf-program-through-task-storage-map'

Martin KaFai Lau says:

====================
Share user memory to BPF program through task storage map

From: Martin KaFai Lau <martin.lau@kernel.org>

It is the v6 of this series. Starting from v5, it is a continuation work
of the RFC v4.

Changes in v6:
1. In patch 1, reject t->size == 0 in btf_check_and_fixup_fields.
   Reject a uptr pointing to an empty struct.

   A test is added to patch 12 to test this case.

2. In patch 6, when checking if the uptr struct spans across
   pages, there was an off by one error in calculating the "end" such
   that the uptr will be rejected by error if the object is located
   exactly at the end of a page.

   This is fixed by adding t->size "- 1" to "start".

   A test is added to patch 9 to test this case.

3. In patch 6, check for PageHighMem(page) and return -EOPNOTSUPP.
   The 32 bit arch jit is missing other crucial bpf features (e.g. kfunc).
   Patch 6 commit message has been updated to include this change.

4. The selftests are cleaned up such that  "struct user_data *dummy_data"
   global ptr is used instead of the whole "struct user_data  dummy_data"
   object. Still a hack to avoid generating fwd btf type for the
   uptr struct but somewhat lighter than a full blown global object.

Changes in v5:
1. The original patch 1 and patch 2 are combined.
2. Patch 3, 4, and 5 are new. They get the bpf_local_storage
   ready to handle the __uptr in the map_value.
3. Patch 6 is mostly new, so I reset the sob.
4. There are some changes in the carry over patch 1 and 2 also. They
   are mentioned at the individual patch.
5. More tests are added.

The following is the original cover letter and the earlier change log.
The bpf prog example has been removed. Please find a similar
example in the selftests task_ls_uptr.c.

~~~~~~~~

Some of BPF schedulers (sched_ext) need hints from user programs to do
a better job. For example, a scheduler can handle a task in a
different way if it knows a task is doing GC. So, we need an efficient
way to share the information between user programs and BPF
programs. Sharing memory between user programs and BPF programs is
what this patchset does.

== REQUIREMENT ==

This patchset enables every task in every process to share a small
chunk of memory of it's own with a BPF scheduler. So, they can update
the hints without expensive overhead of syscalls. It also wants every
task sees only the data/memory belong to the task/or the task's
process.

== DESIGN ==

This patchset enables BPF prorams to embed __uptr; uptr in the values
of task storage maps. A uptr field can only be set by user programs by
updating map element value through a syscall. A uptr points to a block
of memory allocated by the user program updating the element
value. The memory will be pinned to ensure it staying in the core
memory and to avoid a page fault when the BPF program accesses it.

Please see the selftests task_ls_uptr.c for an example.

== MEMORY ==

In order to use memory efficiently, we don't want to pin a large
number of pages. To archieve that, user programs should collect the
memory blocks pointed by uptrs together to share memory pages if
possible. It avoid the situation that pin one page for each thread in
a process.  Instead, we can have several threads pointing their uptrs
to the same page but with different offsets.

Although it is not necessary, avoiding the memory pointed by an uptr
crossing the boundary of a page can prevent an additional mapping in
the kernel address space.

== RESTRICT ==

The memory pointed by a uptr should reside in one memory
page. Crossing multi-pages is not supported at the moment.

Only task storage map have been supported at the moment.

The values of uptrs can only be updated by user programs through
syscalls.

bpf_map_lookup_elem() from userspace returns zeroed values for uptrs
to prevent leaking information of the kernel.
---

Changes from v3:

- Merge part 4 and 5 as the new part 4 in order to cease the warning
    of unused functions from CI.

Changes from v1:

- Rename BPF_KPTR_USER to BPF_UPTR.

- Restrict uptr to one page.

- Mark uptr with PTR_TO_MEM | PTR_MAY_BE_NULL and with the size of
    the target type.

- Move uptr away from bpf_obj_memcpy() by introducing
    bpf_obj_uptrcpy() and copy_map_uptr_locked().

- Remove the BPF_FROM_USER flag.

- Align the meory pointed by an uptr in the test case. Remove the
    uptr of mmapped memory.

Kui-Feng Lee (4):
  bpf: Support __uptr type tag in BTF
  bpf: Handle BPF_UPTR in verifier
  libbpf: define __uptr.
  selftests/bpf: Some basic __uptr tests
====================

Link: https://lore.kernel.org/r/20241023234759.860539-1-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Create task_local_storage map with invalid uptr's struct

This patch tests the map creation failure when the map_value
has unsupported uptr. The three cases are the struct is larger
than one page, the struct is empty, and the struct is a kernel struct.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20241023234759.860539-13-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add uptr failure verifier tests

Add verifier tests to ensure invalid uptr usages are rejected.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20241023234759.860539-12-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add update_elem failure test for task storage uptr

This patch test the following failures in syscall update_elem
1. The first update_elem(BPF_F_LOCK) should be EOPNOTSUPP. syscall.c takes
   care of unpinning the uptr.
2. The second update_elem(BPF_EXIST) fails. syscall.c takes care of
   unpinning the uptr.
3. The forth update_elem(BPF_NOEXIST) fails. bpf_local_storage_update
   takes care of unpinning the uptr.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20241023234759.860539-11-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Test a uptr struct spanning across pages.

This patch tests the case when uptr has a struct spanning across two
pages. It is not supported now and EOPNOTSUPP is expected from the
syscall update_elem.

It also tests the whole uptr struct located exactly at the
end of a page and ensures that this case is accepted by update_elem.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20241023234759.860539-10-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Some basic __uptr tests

Make sure the memory of uptrs have been mapped to the kernel properly.
Also ensure the values of uptrs in the kernel haven't been copied
to userspace.

It also has the syscall update_elem/delete_elem test to test the
pin/unpin code paths.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20241023234759.860539-9-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

libbpf: define __uptr.

Make __uptr available to BPF programs to enable them to define uptrs.

Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20241023234759.860539-8-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Add uptr support in the map_value of the task local storage.

This patch adds uptr support in the map_value of the task local storage.

struct map_value {
struct user_data __uptr *uptr;
};

struct {
__uint(type, BPF_MAP_TYPE_TASK_STORAGE);
__uint(map_flags, BPF_F_NO_PREALLOC);
__type(key, int);
__type(value, struct value_type);
} datamap SEC(".maps");

A new bpf_obj_pin_uptrs() is added to pin the user page and
also stores the kernel address back to the uptr for the
bpf prog to use later. It currently does not support
the uptr pointing to a user struct across two pages.
It also excludes PageHighMem support to keep it simple.
As of now, the 32bit bpf jit is missing other more crucial bpf
features. For example, many important bpf features depend on
bpf kfunc now but so far only one arch (x86-32) supports it
which was added by me as an example when kfunc was first
introduced to bpf.

The uptr can only be stored to the task local storage by the
syscall update_elem. Meaning the uptr will not be considered
if it is provided by the bpf prog through
bpf_task_storage_get(BPF_LOCAL_STORAGE_GET_F_CREATE).
This is enforced by only calling
bpf_local_storage_update(swap_uptrs==true) in
bpf_pid_task_storage_update_elem. Everywhere else will
have swap_uptrs==false.

This will pump down to bpf_selem_alloc(swap_uptrs==true). It is
the only case that bpf_selem_alloc() will take the uptr value when
updating the newly allocated selem. bpf_obj_swap_uptrs() is added
to swap the uptr between the SDATA(selem)->data and the user provided
map_value in "void *value". bpf_obj_swap_uptrs() makes the
SDATA(selem)->data takes the ownership of the uptr and the user space
provided map_value will have NULL in the uptr.

The bpf_obj_unpin_uptrs() is called after map->ops->map_update_elem()
returning error. If the map->ops->map_update_elem has reached
a state that the local storage has taken the uptr ownership,
the bpf_obj_unpin_uptrs() will be a no op because the uptr
is NULL. A "__"bpf_obj_unpin_uptrs is added to make this
error path unpin easier such that it does not have to check
the map->record is NULL or not.

BPF_F_LOCK is not supported when the map_value has uptr.
This can be revisited later if there is a use case. A similar
swap_uptrs idea can be considered.

The final bit is to do unpin_user_page in the bpf_obj_free_fields().
The earlier patch has ensured that the bpf_obj_free_fields() has
gone through the rcu gp when needed.

Cc: linux-mm@kvack.org
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Link: https://lore.kernel.org/r/20241023234759.860539-7-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Postpone bpf_obj_free_fields to the rcu callback

A later patch will enable the uptr usage in the task_local_storage map.
This will require the unpin_user_page() to be done after the rcu
task trace gp for the cases that the uptr may still be used by
a bpf prog. The bpf_obj_free_fields() will be the one doing
unpin_user_page(), so this patch is to postpone calling
bpf_obj_free_fields() to the rcu callback.

The bpf_obj_free_fields() is only required to be done in
the rcu callback when bpf->bpf_ma==true and reuse_now==false.

bpf->bpf_ma==true case is because uptr will only be enabled
in task storage which has already been moved to bpf_mem_alloc.
The bpf->bpf_ma==false case can be supported in the future
also if there is a need.

reuse_now==false when the selem (aka storage) is deleted
by bpf prog (bpf_task_storage_delete) or by syscall delete_elem().
In both cases, bpf_obj_free_fields() needs to wait for
rcu gp.

A few words on reuse_now==true. reuse_now==true when the
storage's owner (i.e. the task_struct) is destructing or the map
itself is doing map_free(). In both cases, no bpf prog should
have a hold on the selem and its uptrs, so there is no need to
postpone bpf_obj_free_fields(). reuse_now==true should be the
common case for local storage usage where the storage exists
throughout the lifetime of its owner (task_struct).

The bpf_obj_free_fields() needs to use the map->record. Doing
bpf_obj_free_fields() in a rcu callback will require the
bpf_local_storage_map_free() to wait for rcu_barrier. An optimization
could be only waiting for rcu_barrier when the map has uptr in
its map_value. This will require either yet another rcu callback
function or adding a bool in the selem to flag if the SDATA(selem)->smap
is still valid. This patch chooses to keep it simple and wait for
rcu_barrier for maps that use bpf_mem_alloc.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20241023234759.860539-6-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Postpone bpf_selem_free() in bpf_selem_unlink_storage_nolock()

In a later patch, bpf_selem_free() will call unpin_user_page()
through bpf_obj_free_fields(). unpin_user_page() may take spin_lock.
However, some bpf_selem_free() call paths have held a raw_spin_lock.
Like this:

raw_spin_lock_irqsave()
  bpf_selem_unlink_storage_nolock()
    bpf_selem_free()
      unpin_user_page()
        spin_lock()

To avoid spinlock nested in raw_spinlock, bpf_selem_free() should be
done after releasing the raw_spinlock. The "bool reuse_now" arg is
replaced with "struct hlist_head *free_selem_list" in
bpf_selem_unlink_storage_nolock(). The bpf_selem_unlink_storage_nolock()
will append the to-be-free selem at the free_selem_list. The caller of
bpf_selem_unlink_storage_nolock() will need to call the new
bpf_selem_free_list(free_selem_list, reuse_now) to free the selem
after releasing the raw_spinlock.

Note that the selem->snode cannot be reused for linking to
the free_selem_list because the selem->snode is protected by the
raw_spinlock that we want to avoid holding. A new
"struct hlist_node free_node;" is union-ized with
the rcu_head. Only the first one successfully
hlist_del_init_rcu(&selem->snode) will be able
to use the free_node. After succeeding hlist_del_init_rcu(&selem->snode),
the free_node and rcu_head usage is serialized such that they
can share the 16 bytes in a union.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20241023234759.860539-5-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Add "bool swap_uptrs" arg to bpf_local_storage_update() and bpf_selem_alloc()

In a later patch, the task local storage will only accept uptr
from the syscall update_elem and will not accept uptr from
the bpf prog. The reason is the bpf prog does not have a way
to provide a valid user space address.

bpf_local_storage_update() and bpf_selem_alloc() are used by
both bpf prog bpf_task_storage_get(BPF_LOCAL_STORAGE_GET_F_CREATE)
and bpf syscall update_elem. "bool swap_uptrs" arg is added
to bpf_local_storage_update() and bpf_selem_alloc() to tell if
it is called by the bpf prog or by the bpf syscall. When
swap_uptrs==true, it is called by the syscall.

The arg is named (swap_)uptrs because the later patch will swap
the uptrs between the newly allocated selem and the user space
provided map_value. It will make error handling easier in case
map->ops->map_update_elem() fails and the caller can decide
if it needs to unpin the uptr in the user space provided
map_value or the bpf_local_storage_update() has already
taken the uptr ownership and will take care of unpinning it also.

Only swap_uptrs==false is passed now. The logic to handle
the true case will be added in a later patch.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20241023234759.860539-4-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Handle BPF_UPTR in verifier

This patch adds BPF_UPTR support to the verifier. Not that only the
map_value will support the "__uptr" type tag.

This patch enforces only BPF_LDX is allowed to the value of an uptr.
After BPF_LDX, it will mark the dst_reg as PTR_TO_MEM | PTR_MAYBE_NULL
with size deduced from the field.kptr.btf_id. This will make the
dst_reg pointed memory to be readable and writable as scalar.

There is a redundant "val_reg = reg_state(env, value_regno);" statement
in the check_map_kptr_access(). This patch takes this chance to remove
it also.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20241023234759.860539-3-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Support __uptr type tag in BTF

This patch introduces the "__uptr" type tag to BTF. It is to define
a pointer pointing to the user space memory. This patch adds BTF
logic to pass the "__uptr" type tag.

btf_find_kptr() is reused for the "__uptr" tag. The "__uptr" will only
be supported in the map_value of the task storage map. However,
btf_parse_struct_meta() also uses btf_find_kptr() but it is not
interested in "__uptr". This patch adds a "field_mask" argument
to btf_find_kptr() which will return BTF_FIELD_IGNORE if the
caller is not interested in a “__uptr” field.

btf_parse_kptr() is also reused to parse the uptr.
The btf_check_and_fixup_fields() is changed to do extra
checks on the uptr to ensure that its struct size is not larger
than PAGE_SIZE. It is not clear how a uptr pointing to a CO-RE
supported kernel struct will be used, so it is also not allowed now.

Signed-off-by: Kui-Feng Lee <thinker.li@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20241023234759.860539-2-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'add-the-missing-bpf_link_type-invocation-for-sockmap'

Hou Tao says:

====================
Add the missing BPF_LINK_TYPE invocation for sockmap

From: Hou Tao <houtao1@huawei.com>

Hi,

The tiny patch set fixes the out-of-bound read problem when reading the
fdinfo of sock map link fd. And in order to spot such omission early for
the newly-added link type in the future, it also checks the validity of
the link->type and adds a WARN_ONCE() for missed invocation.

Please see individual patches for more details. And comments are always
welcome.

v3:
* patch #2: check and warn the validity of link->type instead of
adding a static assertion for bpf_link_type_strs array.

v2: http://lore.kernel.org/bpf/d49fa2f4-f743-c763-7579-c3cab4dd88cb@huaweicloud.com
====================

Link: https://lore.kernel.org/r/20241024013558.1135167-1-houtao@huaweicloud.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

bpf: Check validity of link->type in bpf_link_show_fdinfo()

If a newly-added link type doesn't invoke BPF_LINK_TYPE(), accessing
bpf_link_type_strs[link->type] may result in an out-of-bounds access.

To spot such missed invocations early in the future, checking the
validity of link->type in bpf_link_show_fdinfo() and emitting a warning
when such invocations are missed.

Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241024013558.1135167-3-houtao@huaweicloud.com

bpf: Add the missing BPF_LINK_TYPE invocation for sockmap

There is an out-of-bounds read in bpf_link_show_fdinfo() for the sockmap
link fd. Fix it by adding the missing BPF_LINK_TYPE invocation for
sockmap link

Also add comments for bpf_link_type to prevent missing updates in the
future.

Fixes: 699c23f02c65 ("bpf: Add bpf_link support for sk_msg and sk_skb progs")
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241024013558.1135167-2-houtao@huaweicloud.com

Merge branch 'net-dsa-mv88e6xxx-fix-mv88e6393x-phc-frequency-on-internal-clock'

Shenghao Yang says:

====================
net: dsa: mv88e6xxx: fix MV88E6393X PHC frequency on internal clock

The MV88E6393X family of switches can additionally run their cycle
counters using a 250MHz internal clock instead of the usual 125MHz
external clock [1].

The driver currently assumes all designs utilize that external clock,
but MikroTik's RB5009 uses the internal source - causing the PHC to be
seen running at 2x real time in userspace, making synchronization
with ptp4l impossible.

This series adds support for reading off the cycle counter frequency
known to the hardware in the TAI_CLOCK_PERIOD register and picking an
appropriate set of scaling coefficients instead of using a fixed set
for each switch family.

Patch 1 groups those cycle counter coefficients into a new structure to
make it easier to pass them around.

Patch 2 modifies PTP initialization to probe TAI_CLOCK_PERIOD and
use an appropriate set of coefficients.

Patch 3 adds support for 4000ps cycle counter periods.

Changes since v2 [2]:

- Patch 1: "net: dsa: mv88e6xxx: group cycle counter coefficients"
  - Moved declaration of mv88e6xxx_cc_coeffs to avoid moving that in
    Patch 2.

- Patch 2: "net: dsa: mv88e6xxx: read cycle counter period from hardware"
  - Removed move of mv88e6xxx_cc_coeffs declaration.

- Patch 3: "net: dsa: mv88e6xxx: support 4000ps cycle counter periods"
  - No change.

[1] https://lore.kernel.org/netdev/d6622575-bf1b-445a-b08f-2739e3642aae@lunn.ch/
[2] https://lore.kernel.org/netdev/20241006145951.719162-1-me@shenghaoyang.info/
====================

Link: https://patch.msgid.link/20241020063833.5425-1-me@shenghaoyang.info
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: dsa: mv88e6xxx: support 4000ps cycle counter period

The MV88E6393X family of devices can run its cycle counter off
an internal 250MHz clock instead of an external 125MHz one.

Add support for this cycle counter period by adding another set
of coefficients and lowering the periodic cycle counter read interval
to compensate for faster overflows at the increased frequency.

Otherwise, the PHC runs at 2x real time in userspace and cannot be
synchronized.

Fixes: de776d0d316f ("net: dsa: mv88e6xxx: add support for mv88e6393x family")
Signed-off-by: Shenghao Yang <me@shenghaoyang.info>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: dsa: mv88e6xxx: read cycle counter period from hardware

Instead of relying on a fixed mapping of hardware family to cycle
counter frequency, pull this information from the
MV88E6XXX_TAI_CLOCK_PERIOD register.

This lets us support switches whose cycle counter frequencies depend on
board design.

Fixes: de776d0d316f ("net: dsa: mv88e6xxx: add support for mv88e6393x family")
Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Shenghao Yang <me@shenghaoyang.info>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: dsa: mv88e6xxx: group cycle counter coefficients

Instead of having them as individual fields in ptp_ops, wrap the
coefficients in a separate struct so they can be referenced together.

Fixes: de776d0d316f ("net: dsa: mv88e6xxx: add support for mv88e6393x family")
Signed-off-by: Shenghao Yang <me@shenghaoyang.info>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: usb: qmi_wwan: add Fibocom FG132 0x0112 composition

Add Fibocom FG132 0x0112 composition:

T:  Bus=03 Lev=02 Prnt=06 Port=01 Cnt=02 Dev#= 10 Spd=12   MxCh= 0
D:  Ver= 2.01 Cls=00(>ifc ) Sub=00 Prot=00 MxPS=64 #Cfgs=  1
P:  Vendor=2cb7 ProdID=0112 Rev= 5.15
S:  Manufacturer=Fibocom Wireless Inc.
S:  Product=Fibocom Module
S:  SerialNumber=xxxxxxxx
C:* #Ifs= 4 Cfg#= 1 Atr=a0 MxPwr=500mA
I:* If#= 0 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=50 Driver=qmi_wwan
E:  Ad=82(I) Atr=03(Int.) MxPS=   8 Ivl=32ms
E:  Ad=81(I) Atr=02(Bulk) MxPS=  64 Ivl=0ms
E:  Ad=01(O) Atr=02(Bulk) MxPS=  64 Ivl=0ms
I:* If#= 1 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=ff Prot=30 Driver=option
E:  Ad=02(O) Atr=02(Bulk) MxPS=  64 Ivl=0ms
E:  Ad=83(I) Atr=02(Bulk) MxPS=  64 Ivl=0ms
I:* If#= 2 Alt= 0 #EPs= 3 Cls=ff(vend.) Sub=ff Prot=40 Driver=option
E:  Ad=85(I) Atr=03(Int.) MxPS=  10 Ivl=32ms
E:  Ad=84(I) Atr=02(Bulk) MxPS=  64 Ivl=0ms
E:  Ad=03(O) Atr=02(Bulk) MxPS=  64 Ivl=0ms
I:* If#= 3 Alt= 0 #EPs= 2 Cls=ff(vend.) Sub=00 Prot=00 Driver=option
E:  Ad=86(I) Atr=02(Bulk) MxPS=  64 Ivl=0ms
E:  Ad=04(O) Atr=02(Bulk) MxPS=  64 Ivl=0ms

Signed-off-by: Reinhard Speyerer <rspmn@arcor.de>
Link: https://patch.msgid.link/ZxLKp5YZDy-OM0-e@arcor.de
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

hv_netvsc: Fix VF namespace also in synthetic NIC NETDEV_REGISTER event

The existing code moves VF to the same namespace as the synthetic NIC
during netvsc_register_vf(). But, if the synthetic device is moved to a
new namespace after the VF registration, the VF won't be moved together.

To make the behavior more consistent, add a namespace check for synthetic
NIC's NETDEV_REGISTER event (generated during its move), and move the VF
if it is not in the same namespace.

Cc: stable@vger.kernel.org
Fixes: c0a41b887ce6 ("hv_netvsc: move VF to same namespace as netvsc device")
Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/1729275922-17595-1-git-send-email-haiyangz@microsoft.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: dsa: microchip: disable EEE for KSZ879x/KSZ877x/KSZ876x

The well-known errata regarding EEE not being functional on various KSZ
switches has been refactored a few times. Recently the refactoring has
excluded several switches that the errata should also apply to.

Disable EEE for additional switches with this errata and provide
additional comments referring to the public errata document.

The original workaround for the errata was applied with a register
write to manually disable the EEE feature in MMD 7:60 which was being
applied for KSZ9477/KSZ9897/KSZ9567 switch ID's.

Then came commit 26dd2974c5b5 ("net: phy: micrel: Move KSZ9477 errata
fixes to PHY driver") and commit 6068e6d7ba50 ("net: dsa: microchip:
remove KSZ9477 PHY errata handling") which moved the errata from the
switch driver to the PHY driver but only for PHY_ID_KSZ9477 (PHY ID)
however that PHY code was dead code because an entry was never added
for PHY_ID_KSZ9477 via MODULE_DEVICE_TABLE.

This was apparently realized much later and commit 54a4e5c16382 ("net:
phy: micrel: add Microchip KSZ 9477 to the device table") added the
PHY_ID_KSZ9477 to the PHY driver but as the errata was only being
applied to PHY_ID_KSZ9477 it's not completely clear what switches
that relates to.

Later commit 6149db4997f5 ("net: phy: micrel: fix KSZ9477 PHY issues
after suspend/resume") breaks this again for all but KSZ9897 by only
applying the errata for that PHY ID.

Following that this was affected with commit 08c6d8bae48c("net: phy:
Provide Module 4 KSZ9477 errata (DS80000754C)") which removes
the blatant register write to MMD 7:60 and replaces it by
setting phydev->eee_broken_modes = -1 so that the generic phy-c45 code
disables EEE but this is only done for the KSZ9477_CHIP_ID (Switch ID).

Lastly commit 0411f73c13af ("net: dsa: microchip: disable EEE for
KSZ8567/KSZ9567/KSZ9896/KSZ9897.") adds some additional switches
that were missing to the errata due to the previous changes.

This commit adds an additional set of switches.

Fixes: 0411f73c13af ("net: dsa: microchip: disable EEE for KSZ8567/KSZ9567/KSZ9896/KSZ9897.")
Signed-off-by: Tim Harvey <tharvey@gateworks.com>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20241018160658.781564-1-tharvey@gateworks.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge tag 'for-net-2024-10-23' of git://git./linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

- hci_core: Disable works on hci_unregister_dev
- SCO: Fix UAF on sco_sock_timeout
- ISO: Fix UAF on iso_sock_timeout

* tag 'for-net-2024-10-23' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: ISO: Fix UAF on iso_sock_timeout
  Bluetooth: SCO: Fix UAF on sco_sock_timeout
  Bluetooth: hci_core: Disable works on hci_unregister_dev
====================

Link: https://patch.msgid.link/20241023143005.2297694-1-luiz.dentz@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

Merge tag 'ipsec-2024-10-22' of git://git./linux/kernel/git/klassert/ipsec

Steffen Klassert says:

====================
pull request (net): ipsec 2024-10-22

1) Fix routing behavior that relies on L4 information
   for xfrm encapsulated packets.
   From Eyal Birger.

2) Remove leftovers of pernet policy_inexact lists.
   From Florian Westphal.

3) Validate new SA's prefixlen when the selector family is
   not set from userspace.
   From Sabrina Dubroca.

4) Fix a kernel-infoleak when dumping an auth algorithm.
   From Petr Vaganov.

Please pull or let me know if there are problems.

ipsec-2024-10-22

* tag 'ipsec-2024-10-22' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
  xfrm: fix one more kernel-infoleak in algo dumping
  xfrm: validate new SA's prefixlen using SA family when sel.family is unset
  xfrm: policy: remove last remnants of pernet inexact list
  xfrm: respect ip protocols rules criteria when performing dst lookups
  xfrm: extract dst lookup parameters into a struct
====================

Link: https://patch.msgid.link/20241022092226.654370-1-steffen.klassert@secunet.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

bpf: fix do_misc_fixups() for bpf_get_branch_snapshot()

We need `goto next_insn;` at the end of patching instead of `continue;`.
It currently works by accident by making verifier re-process patched
instructions.

Reported-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Fixes: 314a53623cd4 ("bpf: inline bpf_get_branch_snapshot() helper")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Yonghong Song <yonghong.song@linux.dev>
Acked-by: Shung-Hsi Yu <shung-hsi.yu@suse.com>
Link: https://lore.kernel.org/r/20241023161916.2896274-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'fix-libbpf-s-bpf_object-and-bpf-subskel-interoperability'

Andrii Nakryiko says:

====================
Fix libbpf's bpf_object and BPF subskel interoperability

Fix libbpf's global data map mmap()'ing logic to make BPF objects loaded
through generic bpf_object__load() API interoperable with BPF subskeleton
instantiated from such BPF object. The issue is in re-mmap()'ing of global
data maps after BPF object is loaded into kernel, which is currently done in
BPF skeleton-specific code, and should instead be done in generic and common
bpf_object_load() logic.

See patch #2 for the fix, patch #3 for the selftests. Patch #1 is preliminary
fix for existing spin_lock selftests which currently works by accident.
====================

Link: https://lore.kernel.org/r/20241023043908.3834423-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: validate generic bpf_object and subskel APIs work together

Add a new subtest validating that bpf_object loaded and initialized
through generic APIs is still interoperable with BPF subskeleton,
including initialization and reading of global variables.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20241023043908.3834423-4-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

libbpf: move global data mmap()'ing into bpf_object__load()

Since BPF skeleton inception libbpf has been doing mmap()'ing of global
data ARRAY maps in bpf_object__load_skeleton() API, which is used by
code generated .skel.h files (i.e., by BPF skeletons only).

This is wrong because if BPF object is loaded through generic
bpf_object__load() API, global data maps won't be re-mmap()'ed after
load step, and memory pointers returned from bpf_map__initial_value()
would be wrong and won't reflect the actual memory shared between BPF
program and user space.

bpf_map__initial_value() return result is rarely used after load, so
this went unnoticed for a really long time, until bpftrace project
attempted to load BPF object through generic bpf_object__load() API and
then used BPF subskeleton instantiated from such bpf_object. It turned
out that .data/.rodata/.bss data updates through such subskeleton was
"blackholed", all because libbpf wouldn't re-mmap() those maps during
bpf_object__load() phase.

Long story short, this step should be done by libbpf regardless of BPF
skeleton usage, right after BPF map is created in the kernel. This patch
moves this functionality into bpf_object__populate_internal_map() to
achieve this. And bpf_object__load_skeleton() is now simple and almost
trivial, only propagating these mmap()'ed pointers into user-supplied
skeleton structs.

We also do trivial adjustments to error reporting inside
bpf_object__populate_internal_map() for consistency with the rest of
libbpf's map-handling code.

Reported-by: Alastair Robertson <ajor@meta.com>
Reported-by: Jonathan Wiepert <jwiepert@meta.com>
Fixes: d66562fba1ce ("libbpf: Add BPF object skeleton support")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20241023043908.3834423-3-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: fix test_spin_lock_fail.c's global vars usage

Global variables of special types (like `struct bpf_spin_lock`) make
underlying ARRAY maps non-mmapable. To make this work with libbpf's
mmaping logic, application is expected to declare such special variables
as static, so libbpf doesn't even attempt to mmap() such ARRAYs.

test_spin_lock_fail.c didn't follow this rule, but given it relied on
this test to trigger failures, this went unnoticed, as we never got to
the step of mmap()'ing these ARRAY maps.

It is fragile and relies on specific sequence of libbpf steps, which are
an internal implementation details.

Fix the test by marking lockA and lockB as static.

Fixes: c48748aea4f8 ("selftests/bpf: Add failure test cases for spin lock pairing")
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20241023043908.3834423-2-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

Merge branch 'fix-wmaybe-uninitialized-warnings-errors'

Eder Zulian says:

====================
Fix -Wmaybe-uninitialized warnings/errors

Hello!

This v2 series initializes the variables 'set' and 'set8' in sets_patch to
NULL, along with the variables 'new_off' and 'pad_bits' and 'pad_type' in
btf_dump_emit_bit_padding to zero or NULL according to their types and the
variable 'o' in options__order to NULL to prevent compiler warnings/errors
which are observed when compiling with non-default compilation options, but
are not emitted by the compiler with the current default compilation
options.

- tools/bpf/resolve_btfids/main.c: Initialize the variables 'set' and
  'set8' in sets_patch to NULL.

- tools/lib/bpf/btf_dump.c: Initialize the variables 'new_off' and
  'pad_bits' and 'pad_type' in btf_dump_emit_bit_padding to zero/NULL

- tools/lib/subcmd/parse-options.c: Initialize the variable 'o' in
  options__order to NULL.
  Sam James mentioned that Michael Weiß had previously sent an alternative
  patch as
  https://lore.kernel.org/all/20240731085217.94928-1-michael.weiss@aisec.fraunhofer.de/

Tested on x86_64 with clang version 17.0.6 and gcc (GCC) 13.3.1.

  $ for c in gcc clang; do for o in fast g s z $(seq 0 3); do make -C \
  tools/bpf/resolve_btfids/ HOST_CC=${c} "HOSTCFLAGS=-O${o} -Wall" \
  clean all 2>&1 | tee ${c}-O${o}.out; done; done && \
  grep 'warning:\|error:' *.out

  [...]
  clang-O1.out:main.c:163:9: warning: ‘set8’ may be used uninitialized [-Wmaybe-uninitialized]
  clang-O1.out:main.c:163:9: warning: ‘set’ may be used uninitialized [-Wmaybe-uninitialized]
  clang-O2.out:main.c:163:9: warning: ‘set8’ may be used uninitialized [-Wmaybe-uninitialized]
  clang-O2.out:main.c:163:9: warning: ‘set’ may be used uninitialized [-Wmaybe-uninitialized]
  clang-O3.out:main.c:163:9: warning: ‘set8’ may be used uninitialized [-Wmaybe-uninitialized]
  clang-O3.out:main.c:163:9: warning: ‘set’ may be used uninitialized [-Wmaybe-uninitialized]
  clang-Ofast.out:main.c:163:9: warning: ‘set8’ may be used uninitialized [-Wmaybe-uninitialized]
  clang-Ofast.out:main.c:163:9: warning: ‘set’ may be used uninitialized [-Wmaybe-uninitialized]
  clang-Og.out:btf_dump.c:903:42: error: ‘new_off’ may be used uninitialized [-Werror=maybe-uninitialized]
  clang-Og.out:btf_dump.c:917:25: error: ‘pad_type’ may be used uninitialized [-Werror=maybe-uninitialized]
  clang-Og.out:btf_dump.c:930:20: error: ‘pad_bits’ may be used uninitialized [-Werror=maybe-uninitialized]
  clang-Os.out:parse-options.c:832:9: error: ‘o’ may be used uninitialized [-Werror=maybe-uninitialized]
  clang-Oz.out:parse-options.c:832:9: error: ‘o’ may be used uninitialized [-Werror=maybe-uninitialized]
  gcc-O1.out:main.c:163:9: warning: ‘set8’ may be used uninitialized [-Wmaybe-uninitialized]
  gcc-O1.out:main.c:163:9: warning: ‘set’ may be used uninitialized [-Wmaybe-uninitialized]
  gcc-O2.out:main.c:163:9: warning: ‘set8’ may be used uninitialized [-Wmaybe-uninitialized]
  gcc-O2.out:main.c:163:9: warning: ‘set’ may be used uninitialized [-Wmaybe-uninitialized]
  gcc-O3.out:main.c:163:9: warning: ‘set8’ may be used uninitialized [-Wmaybe-uninitialized]
  gcc-O3.out:main.c:163:9: warning: ‘set’ may be used uninitialized [-Wmaybe-uninitialized]
  gcc-Ofast.out:main.c:163:9: warning: ‘set8’ may be used uninitialized [-Wmaybe-uninitialized]
  gcc-Ofast.out:main.c:163:9: warning: ‘set’ may be used uninitialized [-Wmaybe-uninitialized]
  gcc-Og.out:btf_dump.c:903:42: error: ‘new_off’ may be used uninitialized [-Werror=maybe-uninitialized]
  gcc-Og.out:btf_dump.c:917:25: error: ‘pad_type’ may be used uninitialized [-Werror=maybe-uninitialized]
  gcc-Og.out:btf_dump.c:930:20: error: ‘pad_bits’ may be used uninitialized [-Werror=maybe-uninitialized]
  gcc-Os.out:parse-options.c:832:9: error: ‘o’ may be used uninitialized [-Werror=maybe-uninitialized]
  gcc-Oz.out:parse-options.c:832:9: error: ‘o’ may be used uninitialized [-Werror=maybe-uninitialized]

The above warnings and/or errors are fixed. However, they are observed with
current default compilation options.

Updates since v1:

- Incorporate feedback from reviewers. Add a comment about an alternative
  patch for parse-options.c sent before (based on comments from Sam James.)
  Split in multiple patches creating this series and a typo was fixed
  "Initiazlide" -> "Initialize" (suggested by Viktor Malik). State more
  clearly that the -Wmaybe-uninitialized issues only happen when compiling
  with non-default compilation options (based on comments from Yonghong
  Song.)

Thanks,
====================

Link: https://lore.kernel.org/r/20241022172329.3871958-1-ezulian@redhat.com
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

libsubcmd: Silence compiler warning

Initialize the pointer 'o' in options__order to NULL to prevent a
compiler warning/error which is observed when compiling with the '-Og'
option, but is not emitted by the compiler with the current default
compilation options.

For example, when compiling libsubcmd with

$ make "EXTRA_CFLAGS=-Og" -C tools/lib/subcmd/ clean all

Clang version 17.0.6 and GCC 13.3.1 fail to compile parse-options.c due
to following error:

  parse-options.c: In function ‘options__order’:
  parse-options.c:832:9: error: ‘o’ may be used uninitialized [-Werror=maybe-uninitialized]
    832 |         memcpy(&ordered[nr_opts], o, sizeof(*o));
        |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  parse-options.c:810:30: note: ‘o’ was declared here
    810 |         const struct option *o, *p = opts;
        |                              ^
  cc1: all warnings being treated as errors

Signed-off-by: Eder Zulian <ezulian@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20241022172329.3871958-4-ezulian@redhat.com

libbpf: Prevent compiler warnings/errors

Initialize 'new_off' and 'pad_bits' to 0 and 'pad_type' to  NULL in
btf_dump_emit_bit_padding to prevent compiler warnings/errors which are
observed when compiling with 'EXTRA_CFLAGS=-g -Og' options, but do not
happen when compiling with current default options.

For example, when compiling libbpf with

  $ make "EXTRA_CFLAGS=-g -Og" -C tools/lib/bpf/ clean all

Clang version 17.0.6 and GCC 13.3.1 fail to compile btf_dump.c due to
following errors:

  btf_dump.c: In function ‘btf_dump_emit_bit_padding’:
  btf_dump.c:903:42: error: ‘new_off’ may be used uninitialized [-Werror=maybe-uninitialized]
    903 |         if (new_off > cur_off && new_off <= next_off) {
        |                                  ~~~~~~~~^~~~~~~~~~~
  btf_dump.c:870:13: note: ‘new_off’ was declared here
    870 |         int new_off, pad_bits, bits, i;
        |             ^~~~~~~
  btf_dump.c:917:25: error: ‘pad_type’ may be used uninitialized [-Werror=maybe-uninitialized]
    917 |                         btf_dump_printf(d, "\n%s%s: %d;", pfx(lvl), pad_type,
        |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    918 |                                         in_bitfield ? new_off - cur_off : 0);
        |                                         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  btf_dump.c:871:21: note: ‘pad_type’ was declared here
    871 |         const char *pad_type;
        |                     ^~~~~~~~
  btf_dump.c:930:20: error: ‘pad_bits’ may be used uninitialized [-Werror=maybe-uninitialized]
    930 |                 if (bits == pad_bits) {
        |                    ^
  btf_dump.c:870:22: note: ‘pad_bits’ was declared here
    870 |         int new_off, pad_bits, bits, i;
        |                      ^~~~~~~~
  cc1: all warnings being treated as errors

Signed-off-by: Eder Zulian <ezulian@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20241022172329.3871958-3-ezulian@redhat.com

resolve_btfids: Fix compiler warnings

Initialize 'set' and 'set8' pointers to NULL in sets_patch to prevent
possible compiler warnings which are issued for various optimization
levels, but do not happen when compiling with current default
compilation options.

For example, when compiling resolve_btfids with

  $ make "HOSTCFLAGS=-O2 -Wall" -C tools/bpf/resolve_btfids/ clean all

Clang version 17.0.6 and GCC 13.3.1 issue following
-Wmaybe-uninitialized warnings for variables 'set8' and 'set':

  In function ‘sets_patch’,
      inlined from ‘symbols_patch’ at main.c:748:6,
      inlined from ‘main’ at main.c:823:6:
  main.c:163:9: warning: ‘set8’ may be used uninitialized [-Wmaybe-uninitialized]
    163 |         eprintf(1, verbose, pr_fmt(fmt), ##__VA_ARGS__)
        |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  main.c:729:17: note: in expansion of macro ‘pr_debug’
    729 |                 pr_debug("sorting  addr %5lu: cnt %6d [%s]\n",
        |                 ^~~~~~~~
  main.c: In function ‘main’:
  main.c:682:37: note: ‘set8’ was declared here
    682 |                 struct btf_id_set8 *set8;
        |                                     ^~~~
  In function ‘sets_patch’,
      inlined from ‘symbols_patch’ at main.c:748:6,
      inlined from ‘main’ at main.c:823:6:
  main.c:163:9: warning: ‘set’ may be used uninitialized [-Wmaybe-uninitialized]
    163 |         eprintf(1, verbose, pr_fmt(fmt), ##__VA_ARGS__)
        |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  main.c:729:17: note: in expansion of macro ‘pr_debug’
    729 |                 pr_debug("sorting  addr %5lu: cnt %6d [%s]\n",
        |                 ^~~~~~~~
  main.c: In function ‘main’:
  main.c:683:36: note: ‘set’ was declared here
    683 |                 struct btf_id_set *set;
        |                                    ^~~

Signed-off-by: Eder Zulian <ezulian@redhat.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20241022172329.3871958-2-ezulian@redhat.com

bpf,perf: Fix perf_event_detach_bpf_prog error handling

Peter reported that perf_event_detach_bpf_prog might skip to release
the bpf program for -ENOENT error from bpf_prog_array_copy.

This can't happen because bpf program is stored in perf event and is
detached and released only when perf event is freed.

Let's drop the -ENOENT check and make sure the bpf program is released
in any case.

Fixes: 170a7e3ea070 ("bpf: bpf_prog_array_copy() should return -ENOENT if exclude_prog not found")
Reported-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20241023200352.3488610-1-jolsa@kernel.org
Closes: https://lore.kernel.org/lkml/20241022111638.GC16066@noisy.programming.kicks-ass.net/

uprobe: Add support for session consumer

This change allows the uprobe consumer to behave as session which
means that 'handler' and 'ret_handler' callbacks are connected in
a way that allows to:

  - control execution of 'ret_handler' from 'handler' callback
  - share data between 'handler' and 'ret_handler' callbacks

The session concept fits to our common use case where we do filtering
on entry uprobe and based on the result we decide to run the return
uprobe (or not).

It's also convenient to share the data between session callbacks.

To achive this we are adding new return value the uprobe consumer
can return from 'handler' callback:

  UPROBE_HANDLER_IGNORE
  - Ignore 'ret_handler' callback for this consumer.

And store cookie and pass it to 'ret_handler' when consumer has both
'handler' and 'ret_handler' callbacks defined.

We store shared data in the return_consumer object array as part of
the return_instance object. This way the handle_uretprobe_chain can
find related return_consumer and its shared data.

We also store entry handler return value, for cases when there are
multiple consumers on single uprobe and some of them are ignored and
some of them not, in which case the return probe gets installed and
we need to have a way to find out which consumer needs to be ignored.

The tricky part is when consumer is registered 'after' the uprobe
entry handler is hit. In such case this consumer's 'ret_handler' gets
executed as well, but it won't have the proper data pointer set,
so we can filter it out.

Suggested-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20241018202252.693462-3-jolsa@kernel.org

uprobe: Add data pointer to consumer handlers

Adding data pointer to both entry and exit consumer handlers and all
its users. The functionality itself is coming in following change.

Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20241018202252.693462-2-jolsa@kernel.org