linux-2.6-block.git
10 months agosched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()
Tejun Heo [Mon, 9 Sep 2024 23:42:47 +0000 (13:42 -1000)]
sched_ext: Implement scx_bpf_dispatch[_vtime]_from_dsq()

Once a task is put into a DSQ, the allowed operations are fairly limited.
Tasks in the built-in local and global DSQs are executed automatically and,
ignoring dequeue, there is only one way a task in a user DSQ can be
manipulated - scx_bpf_consume() moves the first task to the dispatching
local DSQ. This inflexibility sometimes gets in the way and is an area where
multiple feature requests have been made.

Implement scx_bpf_dispatch[_vtime]_from_dsq(), which can be called during
DSQ iteration and can move the task to any DSQ - local DSQs, global DSQ and
user DSQs. The kfuncs can be called from ops.dispatch() and any BPF context
which dosen't hold a rq lock including BPF timers and SYSCALL programs.

This is an expansion of an earlier patch which only allowed moving into the
dispatching local DSQ:

  http://lkml.kernel.org/r/Zn4Cw4FDTmvXnhaf@slm.duckdns.org

v2: Remove @slice and @vtime from scx_bpf_dispatch_from_dsq[_vtime]() as
    they push scx_bpf_dispatch_from_dsq_vtime() over the kfunc argument
    count limit and often won't be needed anyway. Instead provide
    scx_bpf_dispatch_from_dsq_set_{slice|vtime}() kfuncs which can be called
    only when needed and override the specified parameter for the subsequent
    dispatch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: David Vernet <void@manifault.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Andrea Righi <andrea.righi@linux.dev>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
10 months agosched_ext: Compact struct bpf_iter_scx_dsq_kern
Tejun Heo [Mon, 9 Sep 2024 23:42:47 +0000 (13:42 -1000)]
sched_ext: Compact struct bpf_iter_scx_dsq_kern

struct scx_iter_scx_dsq is defined as 6 u64's and scx_dsq_iter_kern was
using 5 of them. We want to add two more u64 fields but it's better if we do
so while staying within scx_iter_scx_dsq to maintain binary compatibility.

The way scx_iter_scx_dsq_kern is laid out is rather inefficient - the node
field takes up three u64's but only one bit of the last u64 is used. Turn
the bool into u32 flags and only use the lower 16 bits freeing up 48 bits -
16 bits for flags, 32 bits for a u32 - for use by struct
bpf_iter_scx_dsq_kern.

This allows moving the dsq_seq and flags fields of bpf_iter_scx_dsq_kern
into the cursor field reducing the struct size by a full u64.

No behavior changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
10 months agosched_ext: Replace consume_local_task() with move_local_task_to_local_dsq()
Tejun Heo [Mon, 9 Sep 2024 23:42:47 +0000 (13:42 -1000)]
sched_ext: Replace consume_local_task() with move_local_task_to_local_dsq()

- Rename move_task_to_local_dsq() to move_remote_task_to_local_dsq().

- Rename consume_local_task() to move_local_task_to_local_dsq() and remove
  task_unlink_from_dsq() and source DSQ unlocking from it.

This is to make the migration code easier to reuse.

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
10 months agosched_ext: Move consume_local_task() upward
Tejun Heo [Mon, 9 Sep 2024 23:42:47 +0000 (13:42 -1000)]
sched_ext: Move consume_local_task() upward

So that the local case comes first and two CONFIG_SMP blocks can be merged.

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
10 months agosched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq()
Tejun Heo [Mon, 9 Sep 2024 23:42:47 +0000 (13:42 -1000)]
sched_ext: Move sanity check and dsq_mod_nr() into task_unlink_from_dsq()

All task_unlink_from_dsq() users are doing dsq_mod_nr(dsq, -1). Move it into
task_unlink_from_dsq(). Also move sanity check into it.

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
10 months agosched_ext: Reorder args for consume_local/remote_task()
Tejun Heo [Mon, 9 Sep 2024 23:42:47 +0000 (13:42 -1000)]
sched_ext: Reorder args for consume_local/remote_task()

Reorder args for consistency in the order of:

  current_rq, p, src_[rq|dsq], dst_[rq|dsq].

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
10 months agosched_ext: Restructure dispatch_to_local_dsq()
Tejun Heo [Mon, 9 Sep 2024 23:42:47 +0000 (13:42 -1000)]
sched_ext: Restructure dispatch_to_local_dsq()

Now that there's nothing left after the big if block, flip the if condition
and unindent the body.

No functional changes intended.

v2: Add BUG() to clarify control can't reach the end of
    dispatch_to_local_dsq() in UP kernels per David.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
10 months agosched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling
Tejun Heo [Mon, 9 Sep 2024 23:42:47 +0000 (13:42 -1000)]
sched_ext: Fix processs_ddsp_deferred_locals() by unifying DTL_INVALID handling

With the preceding update, the only return value which makes meaningful
difference is DTL_INVALID, for which one caller, finish_dispatch(), falls
back to the global DSQ and the other, process_ddsp_deferred_locals(),
doesn't do anything.

It should always fallback to the global DSQ. Move the global DSQ fallback
into dispatch_to_local_dsq() and remove the return value.

v2: Patch title and description updated to reflect the behavior fix for
    process_ddsp_deferred_locals().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
10 months agosched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON
Tejun Heo [Mon, 9 Sep 2024 23:42:47 +0000 (13:42 -1000)]
sched_ext: Make find_dsq_for_dispatch() handle SCX_DSQ_LOCAL_ON

find_dsq_for_dispatch() handles all DSQ IDs except SCX_DSQ_LOCAL_ON.
Instead, each caller is hanlding SCX_DSQ_LOCAL_ON before calling it. Move
SCX_DSQ_LOCAL_ON lookup into find_dsq_for_dispatch() to remove duplicate
code in direct_dispatch() and dispatch_to_local_dsq().

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
10 months agosched_ext: Refactor consume_remote_task()
Tejun Heo [Mon, 9 Sep 2024 23:42:47 +0000 (13:42 -1000)]
sched_ext: Refactor consume_remote_task()

The tricky p->scx.holding_cpu handling was split across
consume_remote_task() body and move_task_to_local_dsq(). Refactor such that:

- All the tricky part is now in the new unlink_dsq_and_lock_src_rq() with
  consolidated documentation.

- move_task_to_local_dsq() now implements straightforward task migration
  making it easier to use in other places.

- dispatch_to_local_dsq() is another user move_task_to_local_dsq(). The
  usage is updated accordingly. This makes the local and remote cases more
  symmetric.

No functional changes intended.

v2: s/task_rq/src_rq/ for consistency.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
10 months agosched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate
Tejun Heo [Mon, 9 Sep 2024 23:42:47 +0000 (13:42 -1000)]
sched_ext: Rename scx_kfunc_set_sleepable to unlocked and relocate

Sleepables don't need to be in its own kfunc set as each is tagged with
KF_SLEEPABLE. Rename to scx_kfunc_set_unlocked indicating that rq lock is
not held and relocate right above the any set. This will be used to add
kfuncs that are allowed to be called from SYSCALL but not TRACING.

No functional changes intended.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: David Vernet <void@manifault.com>
10 months agosched_ext: Add missing static to scx_dump_data
Tejun Heo [Sat, 7 Sep 2024 18:11:25 +0000 (08:11 -1000)]
sched_ext: Add missing static to scx_dump_data

scx_dump_data is only used inside ext.c but doesn't have static. Add it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409070218.RB5WsQ07-lkp@intel.com/

10 months agosched_ext: Add missing static to scx_has_op[]
Tejun Heo [Fri, 6 Sep 2024 18:18:55 +0000 (08:18 -1000)]
sched_ext: Add missing static to scx_has_op[]

scx_has_op[] is only used inside ext.c but doesn't have static. Add it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202409062337.m7qqI88I-lkp@intel.com/

10 months agosched_ext: Temporarily work around pick_task_scx() being called without balance_scx()
Tejun Heo [Fri, 6 Sep 2024 18:17:09 +0000 (08:17 -1000)]
sched_ext: Temporarily work around pick_task_scx() being called without balance_scx()

pick_task_scx() must be preceded by balance_scx() but there currently is a
bug where fair could say yes on balance() but no on pick_task(), which then
ends up calling pick_task_scx() without preceding balance_scx(). Work around
by dropping WARN_ON_ONCE() and ignoring cases which don't make sense.

This isn't great and can theoretically lead to stalls. However, for
switch_all cases, this happens only while a BPF scheduler is being loaded or
unloaded, and, for partial cases, fair will likely keep triggering this CPU.

This will be reverted once the fair behavior is fixed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
10 months agoMerge branch 'bpf/master' into for-6.12
Tejun Heo [Wed, 4 Sep 2024 21:41:32 +0000 (11:41 -1000)]
Merge branch 'bpf/master' into for-6.12

Pull bpf/master to receive baebe9aaba1e ("bpf: allow passing struct
bpf_iter_<type> as kfunc arguments") and related changes in preparation for
the DSQ iterator patchset.

Signed-off-by: Tejun Heo <tj@kernel.org>
10 months agosched_ext: Add a cgroup scheduler which uses flattened hierarchy
Tejun Heo [Wed, 4 Sep 2024 20:24:59 +0000 (10:24 -1000)]
sched_ext: Add a cgroup scheduler which uses flattened hierarchy

This patch adds scx_flatcg example scheduler which implements hierarchical
weight-based cgroup CPU control by flattening the cgroup hierarchy into a
single layer by compounding the active weight share at each level.

This flattening of hierarchy can bring a substantial performance gain when
the cgroup hierarchy is nested multiple levels. in a simple benchmark using
wrk[8] on apache serving a CGI script calculating sha1sum of a small file,
it outperforms CFS by ~3% with CPU controller disabled and by ~10% with two
apache instances competing with 2:1 weight ratio nested four level deep.

However, the gain comes at the cost of not being able to properly handle
thundering herd of cgroups. For example, if many cgroups which are nested
behind a low priority parent cgroup wake up around the same time, they may
be able to consume more CPU cycles than they are entitled to. In many use
cases, this isn't a real concern especially given the performance gain.
Also, there are ways to mitigate the problem further by e.g. introducing an
extra scheduling layer on cgroup delegation boundaries.

v5: - Updated to specify SCX_OPS_HAS_CGROUP_WEIGHT instead of
      SCX_OPS_KNOB_CGROUP_WEIGHT.

v4: - Revert reference counted kptr for cgv_node as the change caused easily
      reproducible stalls.

v3: - Updated to reflect the core API changes including ops.init/exit_task()
      and direct dispatch from ops.select_cpu(). Fixes and improvements
      including additional statistics.

    - Use reference counted kptr for cgv_node instead of xchg'ing against
      stash location.

    - Dropped '-p' option.

v2: - Use SCX_BUG[_ON]() to simplify error handling.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
10 months agosched_ext: Add cgroup support
Tejun Heo [Wed, 4 Sep 2024 20:24:59 +0000 (10:24 -1000)]
sched_ext: Add cgroup support

Add sched_ext_ops operations to init/exit cgroups, and track task migrations
and config changes. A BPF scheduler may not implement or implement only
subset of cgroup features. The implemented features can be indicated using
%SCX_OPS_HAS_CGOUP_* flags. If cgroup configuration makes use of features
that are not implemented, a warning is triggered.

While a BPF scheduler is being enabled and disabled, relevant cgroup
operations are locked out using scx_cgroup_rwsem. This avoids situations
like task prep taking place while the task is being moved across cgroups,
making things easier for BPF schedulers.

v7: - cgroup interface file visibility toggling is dropped in favor just
      warning messages. Dynamically changing interface visiblity caused more
      confusion than helping.

v6: - Updated to reflect the removal of SCX_KF_SLEEPABLE.

    - Updated to use CONFIG_GROUP_SCHED_WEIGHT and fixes for
      !CONFIG_FAIR_GROUP_SCHED && CONFIG_EXT_GROUP_SCHED.

v5: - Flipped the locking order between scx_cgroup_rwsem and
      cpus_read_lock() to avoid locking order conflict w/ cpuset. Better
      documentation around locking.

    - sched_move_task() takes an early exit if the source and destination
      are identical. This triggered the warning in scx_cgroup_can_attach()
      as it left p->scx.cgrp_moving_from uncleared. Updated the cgroup
      migration path so that ops.cgroup_prep_move() is skipped for identity
      migrations so that its invocations always match ops.cgroup_move()
      one-to-one.

v4: - Example schedulers moved into their own patches.

    - Fix build failure when !CONFIG_CGROUP_SCHED, reported by Andrea Righi.

v3: - Make scx_example_pair switch all tasks by default.

    - Convert to BPF inline iterators.

    - scx_bpf_task_cgroup() is added to determine the current cgroup from
      CPU controller's POV. This allows BPF schedulers to accurately track
      CPU cgroup membership.

    - scx_example_flatcg added. This demonstrates flattened hierarchy
      implementation of CPU cgroup control and shows significant performance
      improvement when cgroups which are nested multiple levels are under
      competition.

v2: - Build fixes for different CONFIG combinations.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
Reported-by: kernel test robot <lkp@intel.com>
Cc: Andrea Righi <andrea.righi@canonical.com>
10 months agosched: Introduce CONFIG_GROUP_SCHED_WEIGHT
Tejun Heo [Wed, 4 Sep 2024 20:24:59 +0000 (10:24 -1000)]
sched: Introduce CONFIG_GROUP_SCHED_WEIGHT

sched_ext will soon add cgroup cpu.weigh support. The cgroup interface code
is currently gated behind CONFIG_FAIR_GROUP_SCHED. As the fair class and/or
SCX may implement the feature, put the interface code behind the new
CONFIG_CGROUP_SCHED_WEIGHT which is selected by CONFIG_FAIR_GROUP_SCHED.
This allows either sched class to enable the itnerface code without ading
more complex CONFIG tests.

When !CONFIG_FAIR_GROUP_SCHED, a dummy version of sched_group_set_shares()
is added to support later CONFIG_CGROUP_SCHED_WEIGHT &&
!CONFIG_FAIR_GROUP_SCHED builds.

No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
10 months agosched: Make cpu_shares_read_u64() use tg_weight()
Tejun Heo [Wed, 4 Sep 2024 20:24:59 +0000 (10:24 -1000)]
sched: Make cpu_shares_read_u64() use tg_weight()

Move tg_weight() upward and make cpu_shares_read_u64() use it too. This
makes the weight retrieval shared between cgroup v1 and v2 paths and will be
used to implement cgroup support for sched_ext.

No functional changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
10 months agosched: Expose css_tg()
Tejun Heo [Wed, 4 Sep 2024 20:24:59 +0000 (10:24 -1000)]
sched: Expose css_tg()

A new BPF extensible sched_class will use css_tg() in the init and exit
paths to visit all task_groups by walking cgroups.

v4: __setscheduler_prio() is already exposed. Dropped from this patch.

v3: Dropped SCHED_CHANGE_BLOCK() as upstream is adding more generic cleanup
    mechanism.

v2: Expose SCHED_CHANGE_BLOCK() too and update the description.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: David Vernet <dvernet@meta.com>
Acked-by: Josh Don <joshdon@google.com>
Acked-by: Hao Luo <haoluo@google.com>
Acked-by: Barret Rhoden <brho@google.com>
10 months agosched_ext: TASK_DEAD tasks must be switched into SCX on ops_enable
Tejun Heo [Sat, 31 Aug 2024 08:02:34 +0000 (22:02 -1000)]
sched_ext: TASK_DEAD tasks must be switched into SCX on ops_enable

During scx_ops_enable(), SCX needs to invoke the sleepable ops.init_task()
on every task. To do this, it does get_task_struct() on each iterated task,
drop the lock and then call ops.init_task().

However, a TASK_DEAD task may already have lost all its usage count and be
waiting for RCU grace period to be freed. If get_task_struct() is called on
such task, use-after-free can happen. To avoid such situations,
scx_ops_enable() skips initialization of TASK_DEAD tasks, which seems safe
as they are never going to be scheduled again.

Unfortunately, a racing sched_setscheduler(2) can grab the task before the
task is unhashed and then continue to e.g. move the task from RT to SCX
after TASK_DEAD is set and ops_enable skipped the task. As the task hasn't
gone through scx_ops_init_task(), scx_ops_enable_task() called from
switching_to_scx() triggers the following warning:

  sched_ext: Invalid task state transition 0 -> 3 for stress-ng-race-[2872]
  WARNING: CPU: 6 PID: 2367 at kernel/sched/ext.c:3327 scx_ops_enable_task+0x18f/0x1f0
  ...
  RIP: 0010:scx_ops_enable_task+0x18f/0x1f0
  ...
   switching_to_scx+0x13/0xa0
   __sched_setscheduler+0x84e/0xa50
   do_sched_setscheduler+0x104/0x1c0
   __x64_sys_sched_setscheduler+0x18/0x30
   do_syscall_64+0x7b/0x140
   entry_SYSCALL_64_after_hwframe+0x76/0x7e

As in the ops_disable path, it just doesn't seem like a good idea to leave
any task in an inconsistent state, even when the task is dead. The root
cause is ops_enable not being able to tell reliably whether a task is truly
dead (no one else is looking at it and it's about to be freed) and was
testing TASK_DEAD instead. Fix it by testing the task's usage count
directly.

- ops_init no longer ignores TASK_DEAD tasks. As now all users iterate all
  tasks, @include_dead is removed from scx_task_iter_next_locked() along
  with dead task filtering.

- tryget_task_struct() is added. Tasks are skipped iff tryget_task_struct()
  fails.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Peter Zijlstra <peterz@infradead.org>
10 months agosched_ext: TASK_DEAD tasks must be switched out of SCX on ops_disable
Tejun Heo [Fri, 30 Aug 2024 23:44:40 +0000 (13:44 -1000)]
sched_ext: TASK_DEAD tasks must be switched out of SCX on ops_disable

scx_ops_disable_workfn() only switches !TASK_DEAD tasks out of SCX while
calling scx_ops_exit_task() on all tasks including dead ones. This can leave
a dead task on SCX but with SCX_TASK_NONE state, which is inconsistent.

If another task was in the process of changing the TASK_DEAD task's
scheduling class and grabs the rq lock after scx_ops_disable_workfn() is
done with the task, the task ends up calling scx_ops_disable_task() on the
dead task which is in an inconsistent state triggering a warning:

  WARNING: CPU: 6 PID: 3316 at kernel/sched/ext.c:3411 scx_ops_disable_task+0x12c/0x160
  ...
  RIP: 0010:scx_ops_disable_task+0x12c/0x160
  ...
  Call Trace:
   <TASK>
   check_class_changed+0x2c/0x70
   __sched_setscheduler+0x8a0/0xa50
   do_sched_setscheduler+0x104/0x1c0
   __x64_sys_sched_setscheduler+0x18/0x30
   do_syscall_64+0x7b/0x140
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7f140d70ea5b

There is no reason to leave dead tasks on SCX when unloading the BPF
scheduler. Fix by making scx_ops_disable_workfn() eject all tasks including
the dead ones from SCX.

Signed-off-by: Tejun Heo <tj@kernel.org>
10 months agosched_ext: Remove sched_class->switch_class()
Tejun Heo [Wed, 4 Sep 2024 07:54:29 +0000 (21:54 -1000)]
sched_ext: Remove sched_class->switch_class()

With sched_ext converted to use put_prev_task() for class switch detection,
there's no user of switch_class() left. Drop it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
10 months agosched_ext: Remove switch_class_scx()
Tejun Heo [Wed, 4 Sep 2024 07:54:29 +0000 (21:54 -1000)]
sched_ext: Remove switch_class_scx()

Now that put_prev_task_scx() is called with @next on task switches, there's
no reason to use sched_class.switch_class(). Rename switch_class_scx() to
switch_class() and call it from put_prev_task_scx().

Signed-off-by: Tejun Heo <tj@kernel.org>
10 months agosched_ext: Relocate functions in kernel/sched/ext.c
Tejun Heo [Wed, 4 Sep 2024 07:54:29 +0000 (21:54 -1000)]
sched_ext: Relocate functions in kernel/sched/ext.c

Relocate functions to ease the removal of switch_class_scx(). No functional
changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
10 months agosched_ext: Unify regular and core-sched pick task paths
Tejun Heo [Wed, 4 Sep 2024 07:54:29 +0000 (21:54 -1000)]
sched_ext: Unify regular and core-sched pick task paths

Because the BPF scheduler's dispatch path is invoked from balance(),
sched_ext needs to invoke balance_one() on all sibling rq's before picking
the next task for core-sched.

Before the recent pick_next_task() updates, sched_ext couldn't share pick
task between regular and core-sched paths because pick_next_task() depended
on put_prev_task() being called on the current task. Tasks currently running
on sibling rq's can't be put when one rq is trying to pick the next task, so
pick_task_scx() had to have a separate mechanism to pick between a sibling
rq's current task and the first task in its local DSQ.

However, with the preceding updates, pick_next_task_scx() no longer depends
on the current task being put and can compare the current task and the next
in line statelessly, and the pick task logic should be shareable between
regular and core-sched paths.

Unify regular and core-sched pick task paths:

- There's no reason to distinguish local and sibling picks anymore. @local
  is removed from balance_one().

- pick_next_task_scx() is turned into pick_task_scx() by dropping the
  put_prev_set_next_task() call.

- The old pick_task_scx() is dropped.

Signed-off-by: Tejun Heo <tj@kernel.org>
10 months agosched_ext: Replace SCX_TASK_BAL_KEEP with SCX_RQ_BAL_KEEP
Tejun Heo [Wed, 4 Sep 2024 07:54:28 +0000 (21:54 -1000)]
sched_ext: Replace SCX_TASK_BAL_KEEP with SCX_RQ_BAL_KEEP

SCX_TASK_BAL_KEEP is used by balance_one() to tell pick_next_task_scx() to
keep running the current task. It's not really a task property. Replace it
with SCX_RQ_BAL_KEEP which resides in rq->scx.flags and is a better fit for
the usage. Also, the existing clearing rule is unnecessarily strict and
makes it difficult to use with core-sched. Just clear it on entry to
balance_one().

Signed-off-by: Tejun Heo <tj@kernel.org>
10 months agosched_ext: Don't call put_prev_task_scx() before picking the next task
Tejun Heo [Wed, 4 Sep 2024 07:54:28 +0000 (21:54 -1000)]
sched_ext: Don't call put_prev_task_scx() before picking the next task

fd03c5b85855 ("sched: Rework pick_next_task()") changed the definition of
pick_next_task() from:

  pick_next_task() := pick_task() + set_next_task(.first = true)

to:

  pick_next_task(prev) := pick_task() + put_prev_task() + set_next_task(.first = true)

making invoking put_prev_task() pick_next_task()'s responsibility. This
reordering allows pick_task() to be shared between regular and core-sched
paths and put_prev_task() to know the next task.

sched_ext depended on put_prev_task_scx() enqueueing the current task before
pick_next_task_scx() is called. While pulling sched/core changes,
70cc76aa0d80 ("Merge branch 'tip/sched/core' into for-6.12") added an
explicit put_prev_task_scx() call for SCX tasks in pick_next_task_scx()
before picking the first task as a workaround.

Clean it up and adopt the conventions that other sched classes are
following.

The operation of keeping running the current task was spread and required
the task to be put on the local DSQ before picking:

  - balance_one() used SCX_TASK_BAL_KEEP to indicate that the task is still
    runnable, hasn't exhausted its slice, and thus should keep running.

  - put_prev_task_scx() enqueued the task to local DSQ if SCX_TASK_BAL_KEEP
    is set. It also called do_enqueue_task() with SCX_ENQ_LAST if it is the
    only runnable task. do_enqueue_task() in turn decided whether to use the
    local DSQ depending on SCX_OPS_ENQ_LAST.

Consolidate the logic in balance_one() as it always knows whether it is
going to keep the current task. balance_one() now considers all conditions
where the current task should be kept and uses SCX_TASK_BAL_KEEP to tell
pick_next_task_scx() to keep the current task instead of picking one from
the local DSQ. Accordingly, SCX_ENQ_LAST handling is removed from
put_prev_task_scx() and do_enqueue_task() and pick_next_task_scx() is
updated to pick the current task if SCX_TASK_BAL_KEEP is set.

The workaround put_prev_task[_scx]() calls are replaced with
put_prev_set_next_task().

This causes two behavior changes observable from the BPF scheduler:

- When a task keep running, it no longer goes through enqueue/dequeue cycle
  and thus ops.stopping/running() transitions. The new behavior is better
  and all the existing schedulers should be able to handle the new behavior.

- The BPF scheduler cannot keep executing the current task by enqueueing
  SCX_ENQ_LAST task to the local DSQ. If SCX_OPS_ENQ_LAST is specified, the
  BPF scheduler is responsible for resuming execution after each
  SCX_ENQ_LAST. SCX_OPS_ENQ_LAST is mostly useful for cases where scheduling
  decisions are not made on the local CPU - e.g. central or userspace-driven
  schedulin - and the new behavior is more logical and shouldn't pose any
  problems. SCX_OPS_ENQ_LAST demonstration from scx_qmap is dropped as it
  doesn't fit that well anymore and the last task handling is moved to the
  end of qmap_dispatch().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Vernet <void@manifault.com>
Cc: Andrea Righi <righi.andrea@gmail.com>
Cc: Changwoo Min <multics69@gmail.com>
Cc: Daniel Hodges <hodges.daniel.scott@gmail.com>
Cc: Dan Schatzberg <schatzberg.dan@gmail.com>
10 months agoMerge branch 'tip/sched/core' into for-6.12
Tejun Heo [Tue, 3 Sep 2024 19:15:42 +0000 (09:15 -1000)]
Merge branch 'tip/sched/core' into for-6.12

- Resolve trivial context conflicts from dl_server clearing being moved
  around.

- Add @next to put_prev_task_scx() and @prev to pick_next_task_scx() to
  match sched/core.

- Merge sched_class->switch_class() addition from sched_ext with
  tip/sched/core changes in __pick_next_task().

- Make pick_next_task_scx() call put_prev_task_scx() to emulate the previous
  behavior where sched_class->put_prev_task() was called before
  sched_class->pick_next_task().

While this makes sched_ext build and function, the behavior is not in line
with other sched classes. The follow-up patches will address the
discrepancies and remove sched_class->switch_class().

Signed-off-by: Tejun Heo <tj@kernel.org>
10 months agosched: Add put_prev_task(.next)
Peter Zijlstra [Tue, 13 Aug 2024 22:25:56 +0000 (00:25 +0200)]
sched: Add put_prev_task(.next)

In order to tell the previous sched_class what the next task is, add
put_prev_task(.next).

Notable SCX will use this to:

 1) determine the next task will leave the SCX sched class and push
    the current task to another CPU if possible.
 2) statistics on how often and which other classes preempt it

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224016.367421076@infradead.org
10 months agosched: Rework dl_server
Peter Zijlstra [Tue, 13 Aug 2024 22:25:55 +0000 (00:25 +0200)]
sched: Rework dl_server

When a task is selected through a dl_server, it will have p->dl_server
set, such that it can account runtime to the dl_server, see
update_curr_task().

Currently p->dl_server is set in pick*task() whenever it goes through
the dl_server, clearing it is a bit of a mess though. The trivial
solution is clearing it on the final put (now that we have this
location).

However, this gives a problem when:

p = pick_task(rq);
if (p)
put_prev_set_next_task(rq, prev, next);

picks the same task but through a different path, notably when it goes
from picking through the dl_server to a direct pick or vice-versa. In
that case we cannot readily determine wether we should clear or
preserve p->dl_server.

An additional complication is pick_*task() setting p->dl_server for a
remote pick, it might still need to update runtime before it schedules
the core_pick.

Close all these holes and remove all the random clearing of
p->dl_server by:

 - having pick_*task() manage rq->dl_server

 - having the final put_prev_task() clear p->dl_server

 - having the first set_next_task() set p->dl_server = rq->dl_server

 - complicate the core_sched code to save/restore rq->dl_server where
   appropriate.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224016.259853414@infradead.org
10 months agosched: Combine the last put_prev_task() and the first set_next_task()
Peter Zijlstra [Tue, 13 Aug 2024 22:25:54 +0000 (00:25 +0200)]
sched: Combine the last put_prev_task() and the first set_next_task()

Ensure the last put_prev_task() and the first set_next_task() always
go together.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224016.158454756@infradead.org
10 months agosched: Rework pick_next_task()
Peter Zijlstra [Tue, 13 Aug 2024 22:25:53 +0000 (00:25 +0200)]
sched: Rework pick_next_task()

The current rule is that:

  pick_next_task() := pick_task() + set_next_task(.first = true)

And many classes implement it directly as such. Change things around
to make pick_next_task() optional while also changing the definition to:

  pick_next_task(prev) := pick_task() + put_prev_task() + set_next_task(.first = true)

The reason is that sched_ext would like to have a 'final' call that
knows the next task. By placing put_prev_task() right next to
set_next_task() (as it already is for sched_core) this becomes
trivial.

As a bonus, this is a nice cleanup on its own.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224016.051225657@infradead.org
10 months agosched: Split up put_prev_task_balance()
Peter Zijlstra [Tue, 13 Aug 2024 22:25:52 +0000 (00:25 +0200)]
sched: Split up put_prev_task_balance()

With the goal of pushing put_prev_task() after pick_task() / into
pick_next_task().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224015.943143811@infradead.org
10 months agosched: Clean up DL server vs core sched
Peter Zijlstra [Tue, 13 Aug 2024 22:25:51 +0000 (00:25 +0200)]
sched: Clean up DL server vs core sched

Abide by the simple rule:

  pick_next_task() := pick_task() + set_next_task(.first = true)

This allows us to trivially get rid of server_pick_next() and things
collapse nicely.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224015.837303391@infradead.org
10 months agosched: Fixup set_next_task() implementations
Peter Zijlstra [Tue, 13 Aug 2024 22:25:50 +0000 (00:25 +0200)]
sched: Fixup set_next_task() implementations

The rule is that:

  pick_next_task() := pick_task() + set_next_task(.first = true)

Turns out, there's still a few things in pick_next_task() that are
missing from that combination.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224015.724111109@infradead.org
10 months agosched: Use set_next_task(.first) where required
Peter Zijlstra [Tue, 13 Aug 2024 22:25:49 +0000 (00:25 +0200)]
sched: Use set_next_task(.first) where required

Turns out the core_sched bits forgot to use the
set_next_task(.first=true) variant. Notably:

  pick_next_task() := pick_task() + set_next_task(.first = true)

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240813224015.614146342@infradead.org
10 months agosched/fair: Properly deactivate sched_delayed task upon class change
Valentin Schneider [Thu, 29 Aug 2024 13:53:53 +0000 (15:53 +0200)]
sched/fair: Properly deactivate sched_delayed task upon class change

__sched_setscheduler() goes through an enqueue/dequeue cycle like so:

  flags := DEQUEUE_SAVE | DEQUEUE_MOVE | DEQUEUE_NOCLOCK;
  prev_class->dequeue_task(rq, p, flags);
  new_class->enqueue_task(rq, p, flags);

when prev_class := fair_sched_class, this is followed by:

  dequeue_task(rq, p, DEQUEUE_NOCLOCK | DEQUEUE_SLEEP);

the idea being that since the task has switched classes, we need to drop
the sched_delayed logic and have that task be deactivated per its previous
dequeue_task(..., DEQUEUE_SLEEP).

Unfortunately, this leaves the task on_rq. This is missing the tail end of
dequeue_entities() that issues __block_task(), which __sched_setscheduler()
won't have done due to not using DEQUEUE_DELAYED - not that it should, as
it is pretty much a fair_sched_class specific thing.

Make switched_from_fair() properly deactivate sched_delayed tasks upon
class changes via __block_task(), as if a
  dequeue_task(..., DEQUEUE_DELAYED)
had been issued.

Fixes: 2e0199df252a ("sched/fair: Prepare exit/cleanup paths for delayed_dequeue")
Reported-by: "Paul E. McKenney" <paulmck@kernel.org>
Reported-by: Chen Yu <yu.c.chen@intel.com>
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20240829135353.1524260-1-vschneid@redhat.com
10 months agosched/deadline: Fix schedstats vs deadline servers
Huang Shijie [Thu, 29 Aug 2024 03:11:11 +0000 (11:11 +0800)]
sched/deadline: Fix schedstats vs deadline servers

In dl_server_start(), when schedstats is enabled, the following
happens:

  dl_server_start()
    dl_se->dl_server = 1;
    enqueue_dl_entity()
      update_stats_enqueue_dl()
        __schedstats_from_dl_se()
          dl_task_of()
            BUG_ON(dl_server(dl_se));

Since only tasks have schedstats and internal entries do not, avoid
trying to update stats in this case.

Fixes: 63ba8422f876 ("sched/deadline: Introduce deadline servers")
Signed-off-by: Huang Shijie <shijie@os.amperecomputing.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Juri Lelli <juri.lelli@redhat.com>
Link: https://lkml.kernel.org/r/20240829031111.12142-1-shijie@os.amperecomputing.com
10 months agosched_ext: Use sched_clock_cpu() instead of rq_clock_task() in touch_core_sched()
Tejun Heo [Fri, 30 Aug 2024 17:54:41 +0000 (07:54 -1000)]
sched_ext: Use sched_clock_cpu() instead of rq_clock_task() in touch_core_sched()

Since 3cf78c5d01d6 ("sched_ext: Unpin and repin rq lock from
balance_scx()"), sched_ext's balance path terminates rq_pin in the outermost
function. This is simpler and in line with what other balance functions are
doing but it loses control over rq->clock_update_flags which makes
assert_clock_udpated() trigger if other CPUs pins the rq lock.

The only place this matters is touch_core_sched() which uses the timestamp
to order tasks from sibling rq's. Switch to sched_clock_cpu(). Later, it may
be better to use per-core dispatch sequence number.

v2: Use sched_clock_cpu() instead of ktime_get_ns() per David.

Signed-off-by: Tejun Heo <tj@kernel.org>
Fixes: 3cf78c5d01d6 ("sched_ext: Unpin and repin rq lock from balance_scx()")
Acked-by: David Vernet <void@manifault.com>
Cc: Peter Zijlstra <peterz@infradead.org>
10 months agosched_ext: Use task_can_run_on_remote_rq() test in dispatch_to_local_dsq()
Tejun Heo [Fri, 30 Aug 2024 10:51:40 +0000 (00:51 -1000)]
sched_ext: Use task_can_run_on_remote_rq() test in dispatch_to_local_dsq()

When deciding whether a task can be migrated to a CPU,
dispatch_to_local_dsq() was open-coding p->cpus_allowed and scx_rq_online()
tests instead of using task_can_run_on_remote_rq(). This had two problems.

- It was missing is_migration_disabled() check and thus could try to migrate
  a task which shouldn't leading to assertion and scheduling failures.

- It was testing p->cpus_ptr directly instead of using task_allowed_on_cpu()
  and thus failed to consider ISA compatibility.

Update dispatch_to_local_dsq() to use task_can_run_on_remote_rq():

- Move scx_ops_error() triggering into task_can_run_on_remote_rq().

- When migration isn't allowed, fall back to the global DSQ instead of the
  source DSQ by returning DTL_INVALID. This is both simpler and an overall
  better behavior.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Acked-by: David Vernet <void@manifault.com>
10 months agoselftests/bpf: Do not update vmlinux.h unnecessarily
Ihor Solodrai [Wed, 28 Aug 2024 17:46:23 +0000 (17:46 +0000)]
selftests/bpf: Do not update vmlinux.h unnecessarily

%.bpf.o objects depend on vmlinux.h, which makes them transitively
dependent on unnecessary libbpf headers. However vmlinux.h doesn't
actually change as often.

When generating vmlinux.h, compare it to a previous version and update
it only if there are changes.

Example of build time improvement (after first clean build):
  $ touch ../../../lib/bpf/bpf.h
  $ time make -j8
Before: real  1m37.592s
After:  real  0m27.310s

Notice that %.bpf.o gen step is skipped if vmlinux.h hasn't changed.

Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/CAEf4BzY1z5cC7BKye8=A8aTVxpsCzD=p1jdTfKC7i0XVuYoHUQ@mail.gmail.com
Link: https://lore.kernel.org/bpf/20240828174608.377204-2-ihor.solodrai@pm.me
10 months agoselftests/bpf: Specify libbpf headers required for %.bpf.o progs
Ihor Solodrai [Wed, 28 Aug 2024 17:46:14 +0000 (17:46 +0000)]
selftests/bpf: Specify libbpf headers required for %.bpf.o progs

Test %.bpf.o objects actually depend only on some libbpf headers.
Define a list of required headers and use it as TRUNNER_BPF_OBJS
dependency.

bpf_*.h list was determined by:

    $ grep -rh 'include <bpf/bpf_' progs | sort -u

Signed-off-by: Ihor Solodrai <ihor.solodrai@pm.me>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link:
Link: https://lore.kernel.org/bpf/20240828174608.377204-1-ihor.solodrai@pm.me
https://lore.kernel.org/bpf/CAEf4BzYQ-j2i_xjs94Nn=8+FVfkWt51mLZyiYKiz9oA4Z=pCeA@mail.gmail.com/

10 months agoselftests/bpf: Check if distilled base inherits source endianness
Eduard Zingerman [Fri, 30 Aug 2024 17:34:06 +0000 (10:34 -0700)]
selftests/bpf: Check if distilled base inherits source endianness

Create a BTF with endianness different from host, make a distilled
base/split BTF pair from it, dump as raw bytes, import again and
verify that endianness is preserved.

Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240830173406.1581007-1-eddyz87@gmail.com
10 months agolibbpf: Ensure new BTF objects inherit input endianness
Tony Ambardar [Fri, 30 Aug 2024 09:51:50 +0000 (02:51 -0700)]
libbpf: Ensure new BTF objects inherit input endianness

New split BTF needs to preserve base's endianness. Similarly, when
creating a distilled BTF, we need to preserve original endianness.

Fix by updating libbpf's btf__distill_base() and btf_new_empty() to retain
the byte order of any source BTF objects when creating new ones.

Fixes: ba451366bf44 ("libbpf: Implement basic split BTF support")
Fixes: 58e185a0dc35 ("libbpf: Add btf__distill_base() creating split BTF with distilled base BTF")
Reported-by: Song Liu <song@kernel.org>
Reported-by: Eduard Zingerman <eddyz87@gmail.com>
Suggested-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Tony Ambardar <tony.ambardar@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/6358db36c5f68b07873a0a5be2d062b1af5ea5f8.camel@gmail.com/
Link: https://lore.kernel.org/bpf/20240830095150.278881-1-tony.ambardar@gmail.com
10 months agobpf: Use sockfd_put() helper
Jinjie Ruan [Fri, 30 Aug 2024 02:07:56 +0000 (10:07 +0800)]
bpf: Use sockfd_put() helper

Replace fput() with sockfd_put() in bpf_fd_reuseport_array_update_elem().

Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://lore.kernel.org/r/20240830020756.607877-1-ruanjinjie@huawei.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agobpf: Remove custom build rule
Alexey Gladkov [Fri, 30 Aug 2024 07:43:50 +0000 (09:43 +0200)]
bpf: Remove custom build rule

According to the documentation, when building a kernel with the C=2
parameter, all source files should be checked. But this does not happen
for the kernel/bpf/ directory.

$ touch kernel/bpf/core.o
$ make C=2 CHECK=true kernel/bpf/core.o

Outputs:

  CHECK   scripts/mod/empty.c
  CALL    scripts/checksyscalls.sh
  DESCEND objtool
  INSTALL libsubcmd_headers
  CC      kernel/bpf/core.o

As can be seen the compilation is done, but CHECK is not executed. This
happens because kernel/bpf/Makefile has defined its own rule for
compilation and forgotten the macro that does the check.

There is no need to duplicate the build code, and this rule can be
removed to use generic rules.

Acked-by: Masahiro Yamada <masahiroy@kernel.org>
Tested-by: Oleg Nesterov <oleg@redhat.com>
Tested-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Alexey Gladkov <legion@kernel.org>
Link: https://lore.kernel.org/r/20240830074350.211308-1-legion@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agoselftests/bpf: Add tests for iter next method returning valid pointer
Juntong Deng [Thu, 29 Aug 2024 20:13:15 +0000 (21:13 +0100)]
selftests/bpf: Add tests for iter next method returning valid pointer

This patch adds test cases for iter next method returning valid
pointer, which can also used as usage examples.

Currently iter next method should return valid pointer.

iter_next_trusted is the correct usage and test if iter next method
return valid pointer. bpf_iter_task_vma_next has KF_RET_NULL flag,
so the returned pointer may be NULL. We need to check if the pointer
is NULL before using it.

iter_next_trusted_or_null is the incorrect usage. There is no checking
before using the pointer, so it will be rejected by the verifier.

iter_next_rcu and iter_next_rcu_or_null are similar test cases for
KF_RCU_PROTECTED iterators.

iter_next_rcu_not_trusted is used to test that the pointer returned by
iter next method of KF_RCU_PROTECTED iterator cannot be passed in
KF_TRUSTED_ARGS kfuncs.

iter_next_ptr_mem_not_trusted is used to test that base type
PTR_TO_MEM should not be combined with type flag PTR_TRUSTED.

Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
Link: https://lore.kernel.org/r/AM6PR03MB5848709758F6922F02AF9F1F99962@AM6PR03MB5848.eurprd03.prod.outlook.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agobpf: Make the pointer returned by iter next method valid
Juntong Deng [Thu, 29 Aug 2024 20:11:17 +0000 (21:11 +0100)]
bpf: Make the pointer returned by iter next method valid

Currently we cannot pass the pointer returned by iter next method as
argument to KF_TRUSTED_ARGS or KF_RCU kfuncs, because the pointer
returned by iter next method is not "valid".

This patch sets the pointer returned by iter next method to be valid.

This is based on the fact that if the iterator is implemented correctly,
then the pointer returned from the iter next method should be valid.

This does not make NULL pointer valid. If the iter next method has
KF_RET_NULL flag, then the verifier will ask the ebpf program to
check NULL pointer.

KF_RCU_PROTECTED iterator is a special case, the pointer returned by
iter next method should only be valid within RCU critical section,
so it should be with MEM_RCU, not PTR_TRUSTED.

Another special case is bpf_iter_num_next, which returns a pointer with
base type PTR_TO_MEM. PTR_TO_MEM should not be combined with type flag
PTR_TRUSTED (PTR_TO_MEM already means the pointer is valid).

The pointer returned by iter next method of other types of iterators
is with PTR_TRUSTED.

In addition, this patch adds get_iter_from_state to help us get the
current iterator from the current state.

Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
Link: https://lore.kernel.org/r/AM6PR03MB584869F8B448EA1C87B7CDA399962@AM6PR03MB5848.eurprd03.prod.outlook.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agoMerge branch 'bpf-add-gen_epilogue-to-bpf_verifier_ops'
Alexei Starovoitov [Fri, 30 Aug 2024 01:15:46 +0000 (18:15 -0700)]
Merge branch 'bpf-add-gen_epilogue-to-bpf_verifier_ops'

Martin KaFai Lau says:

====================
bpf: Add gen_epilogue to bpf_verifier_ops

From: Martin KaFai Lau <martin.lau@kernel.org>

This set allows the subsystem to patch codes before BPF_EXIT.
The verifier ops, .gen_epilogue, is added for this purpose.
One of the use case will be in the bpf qdisc, the bpf qdisc
subsystem can ensure the skb->dev is in the correct value.
The bpf qdisc subsystem can either inline fixing it in the
epilogue or call another kernel function to handle it (e.g. drop)
in the epilogue. Another use case could be in bpf_tcp_ca.c to
enforce snd_cwnd has valid value (e.g. positive value).

v5:
 * Removed the skip_cnt argument from adjust_jmp_off() in patch 2.
   Instead, reuse the delta argument and skip
   the [tgt_idx, tgt_idx + delta) instructions.
 * Added a BPF_JMP32_A macro in patch 3.
 * Removed pro_epilogue_subprog.c in patch 6.
   The pro_epilogue_kfunc.c has covered the subprog case.
   Renamed the file pro_epilogue_kfunc.c to pro_epilogue.c.
   Some of the SEC names and function names are changed
   accordingly (mainly shorten them by removing the _kfunc suffix).
 * Added comments to explain the tail_call result in patch 7.
 * Fixed the following bpf CI breakages. I ran it in CI
   manually to confirm:
   https://github.com/kernel-patches/bpf/actions/runs/10590714532
 * s390 zext added "w3 = w3". Adjusted the test to
   use all ALU64 and BPF_DW to avoid zext.
   Also changed the "int a" in the "struct st_ops_args" to "u64 a".
 * llvm17 does not take:
       *(u64 *)(r1 +0) = 0;
   so it is changed to:
       r3 = 0;
       *(u64 *)(r1 +0) = r3;

v4:
 * Fixed a bug in the memcpy in patch 3
   The size in the memcpy should be
   epilogue_cnt * sizeof(*epilogue_buf)

v3:
 * Moved epilogue_buf[16] to env.
   Patch 1 is added to move the existing insn_buf[16] to env.
 * Fixed a case that the bpf prog has a BPF_JMP that goes back
   to the first instruction of the main prog.
   The jump back to 1st insn case also applies to the prologue.
   Patch 2 is added to handle it.
 * If the bpf main prog has multiple BPF_EXIT, use a BPF_JA
   to goto the earlier patched epilogue.
   Note that there are (BPF_JMP32 | BPF_JA) vs (BPF_JMP | BPF_JA)
   details in the patch 3 commit message.
 * There are subtle changes in patch 3, so I reset the Reviewed-by.
 * Added patch 8 and patch 9 to cover the changes in patch 2 and patch 3.
 * Dropped the kfunc call from pro/epilogue and its selftests.

v2:
 * Remove the RFC tag. Keep the ordering at where .gen_epilogue is
   called in the verifier relative to the check_max_stack_depth().
   This will be consistent with the other extra stack_depth
   usage like optimize_bpf_loop().
 * Use __xlated check provided by the test_loader to
   check the patched instructions after gen_pro/epilogue (Eduard).
 * Added Patch 3 by Eduard (Thanks!).
====================

Link: https://lore.kernel.org/r/20240829210833.388152-1-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agoselftests/bpf: Test epilogue patching when the main prog has multiple BPF_EXIT
Martin KaFai Lau [Thu, 29 Aug 2024 21:08:31 +0000 (14:08 -0700)]
selftests/bpf: Test epilogue patching when the main prog has multiple BPF_EXIT

This patch tests the epilogue patching when the main prog has
multiple BPF_EXIT. The verifier should have patched the 2nd (and
later) BPF_EXIT with a BPF_JA that goes back to the earlier
patched epilogue instructions.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240829210833.388152-10-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agoselftests/bpf: A pro/epilogue test when the main prog jumps back to the 1st insn
Martin KaFai Lau [Thu, 29 Aug 2024 21:08:30 +0000 (14:08 -0700)]
selftests/bpf: A pro/epilogue test when the main prog jumps back to the 1st insn

This patch adds a pro/epilogue test when the main prog has a goto insn
that goes back to the very first instruction of the prog. It is
to test the correctness of the adjust_jmp_off(prog, 0, delta)
after the verifier has applied the prologue and/or epilogue patch.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240829210833.388152-9-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agoselftests/bpf: Add tailcall epilogue test
Martin KaFai Lau [Thu, 29 Aug 2024 21:08:29 +0000 (14:08 -0700)]
selftests/bpf: Add tailcall epilogue test

This patch adds a gen_epilogue test to test a main prog
using a bpf_tail_call.

A non test_loader test is used. The tailcall target program,
"test_epilogue_subprog", needs to be used in a struct_ops map
before it can be loaded. Another struct_ops map is also needed
to host the actual "test_epilogue_tailcall" struct_ops program
that does the bpf_tail_call. The earlier test_loader patch
will attach all struct_ops maps but the bpf_testmod.c does
not support >1 attached struct_ops.

The earlier patch used the test_loader which has already covered
checking for the patched pro/epilogue instructions. This is done
by the __xlated tag.

This patch goes for the regular skel load and syscall test to do
the tailcall test that can also allow to directly pass the
the "struct st_ops_args *args" as ctx_in to the
SEC("syscall") program.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240829210833.388152-8-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agoselftests/bpf: Test gen_prologue and gen_epilogue
Martin KaFai Lau [Thu, 29 Aug 2024 21:08:28 +0000 (14:08 -0700)]
selftests/bpf: Test gen_prologue and gen_epilogue

This test adds a new struct_ops "bpf_testmod_st_ops" in bpf_testmod.
The ops of the bpf_testmod_st_ops is triggered by new kfunc calls
"bpf_kfunc_st_ops_test_*logue". These new kfunc calls are
primarily used by the SEC("syscall") program. The test triggering
sequence is like:
    SEC("syscall")
    syscall_prologue(struct st_ops_args *args)
        bpf_kfunc_st_op_test_prologue(args)
    st_ops->test_prologue(args)

.gen_prologue adds 1000 to args->a
.gen_epilogue adds 10000 to args->a
.gen_epilogue will also set the r0 to 2 * args->a.

The .gen_prologue and .gen_epilogue of the bpf_testmod_st_ops
will test the prog->aux->attach_func_name to decide if
it needs to generate codes.

The main programs of the pro_epilogue.c will call a
new kfunc bpf_kfunc_st_ops_inc10 which does "args->a += 10".
It will also call a subprog() which does "args->a += 1".

This patch uses the test_loader infra to check the __xlated
instructions patched after gen_prologue and/or gen_epilogue.
The __xlated check is based on Eduard's example (Thanks!) in v1.

args->a is returned by the struct_ops prog (either the main prog
or the epilogue). Thus, the __retval of the SEC("syscall") prog
is checked. For example, when triggering the ops in the
'SEC("struct_ops/test_epilogue") int test_epilogue'
The expected args->a is +1 (subprog call) + 10 (kfunc call)
                 + 10000 (.gen_epilogue) = 10011.
The expected return value is 2 * 10011 (.gen_epilogue).

Suggested-by: Eduard Zingerman <eddyz87@gmail.com>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240829210833.388152-7-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agoselftests/bpf: attach struct_ops maps before test prog runs
Eduard Zingerman [Thu, 29 Aug 2024 21:08:27 +0000 (14:08 -0700)]
selftests/bpf: attach struct_ops maps before test prog runs

In test_loader based tests to bpf_map__attach_struct_ops()
before call to bpf_prog_test_run_opts() in order to trigger
bpf_struct_ops->reg() callbacks on kernel side.
This allows to use __retval macro for struct_ops tests.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240829210833.388152-6-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agobpf: Export bpf_base_func_proto
Martin KaFai Lau [Thu, 29 Aug 2024 21:08:26 +0000 (14:08 -0700)]
bpf: Export bpf_base_func_proto

The bpf_testmod needs to use the bpf_tail_call helper in
a later selftest patch. This patch is to EXPORT_GPL_SYMBOL
the bpf_base_func_proto.

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240829210833.388152-5-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agobpf: Add gen_epilogue to bpf_verifier_ops
Martin KaFai Lau [Thu, 29 Aug 2024 21:08:25 +0000 (14:08 -0700)]
bpf: Add gen_epilogue to bpf_verifier_ops

This patch adds a .gen_epilogue to the bpf_verifier_ops. It is similar
to the existing .gen_prologue. Instead of allowing a subsystem
to run code at the beginning of a bpf prog, it allows the subsystem
to run code just before the bpf prog exit.

One of the use case is to allow the upcoming bpf qdisc to ensure that
the skb->dev is the same as the qdisc->dev_queue->dev. The bpf qdisc
struct_ops implementation could either fix it up or drop the skb.
Another use case could be in bpf_tcp_ca.c to enforce snd_cwnd
has sane value (e.g. non zero).

The epilogue can do the useful thing (like checking skb->dev) if it
can access the bpf prog's ctx. Unlike prologue, r1 may not hold the
ctx pointer. This patch saves the r1 in the stack if the .gen_epilogue
has returned some instructions in the "epilogue_buf".

The existing .gen_prologue is done in convert_ctx_accesses().
The new .gen_epilogue is done in the convert_ctx_accesses() also.
When it sees the (BPF_JMP | BPF_EXIT) instruction, it will be patched
with the earlier generated "epilogue_buf". The epilogue patching is
only done for the main prog.

Only one epilogue will be patched to the main program. When the
bpf prog has multiple BPF_EXIT instructions, a BPF_JA is used
to goto the earlier patched epilogue. Majority of the archs
support (BPF_JMP32 | BPF_JA): x86, arm, s390, risv64, loongarch,
powerpc and arc. This patch keeps it simple and always
use (BPF_JMP32 | BPF_JA). A new macro BPF_JMP32_A is added to
generate the (BPF_JMP32 | BPF_JA) insn.

Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240829210833.388152-4-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agobpf: Adjust BPF_JMP that jumps to the 1st insn of the prologue
Martin KaFai Lau [Thu, 29 Aug 2024 21:08:24 +0000 (14:08 -0700)]
bpf: Adjust BPF_JMP that jumps to the 1st insn of the prologue

The next patch will add a ctx ptr saving instruction
"(r1 = *(u64 *)(r10 -8)" at the beginning for the main prog
when there is an epilogue patch (by the .gen_epilogue() verifier
ops added in the next patch).

There is one corner case if the bpf prog has a BPF_JMP that jumps
to the 1st instruction. It needs an adjustment such that
those BPF_JMP instructions won't jump to the newly added
ctx saving instruction.
The commit 5337ac4c9b80 ("bpf: Fix the corner case with may_goto and jump to the 1st insn.")
has the details on this case.

Note that the jump back to 1st instruction is not limited to the
ctx ptr saving instruction. The same also applies to the prologue.
A later test, pro_epilogue_goto_start.c, has a test for the prologue
only case.

Thus, this patch does one adjustment after gen_prologue and
the future ctx ptr saving. It is done by
adjust_jmp_off(env->prog, 0, delta) where delta has the total
number of instructions in the prologue and
the future ctx ptr saving instruction.

The adjust_jmp_off(env->prog, 0, delta) assumes that the
prologue does not have a goto 1st instruction itself.
To accommodate the prologue might have a goto 1st insn itself,
this patch changes the adjust_jmp_off() to skip considering
the instructions between [tgt_idx, tgt_idx + delta).

Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240829210833.388152-3-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agobpf: Move insn_buf[16] to bpf_verifier_env
Martin KaFai Lau [Thu, 29 Aug 2024 21:08:23 +0000 (14:08 -0700)]
bpf: Move insn_buf[16] to bpf_verifier_env

This patch moves the 'struct bpf_insn insn_buf[16]' stack usage
to the bpf_verifier_env. A '#define INSN_BUF_SIZE 16' is also added
to replace the ARRAY_SIZE(insn_buf) usages.

Both convert_ctx_accesses() and do_misc_fixup() are changed
to use the env->insn_buf.

It is a refactoring work for adding the epilogue_buf[16] in a later patch.

With this patch, the stack size usage decreased.

Before:
./kernel/bpf/verifier.c:22133:5: warning: stack frame size (2584)

After:
./kernel/bpf/verifier.c:22184:5: warning: stack frame size (2264)

Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240829210833.388152-2-martin.lau@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agobpf: Use kvmemdup to simplify the code
Hongbo Li [Wed, 28 Aug 2024 06:21:28 +0000 (14:21 +0800)]
bpf: Use kvmemdup to simplify the code

Use kvmemdup instead of kvmalloc() + memcpy() to simplify the
code.

No functional change intended.

Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Hongbo Li <lihongbo22@huawei.com>
Link: https://lore.kernel.org/r/20240828062128.1223417-1-lihongbo22@huawei.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agodocs/bpf: Fix a typo in verifier.rst
Yiming Xiang [Thu, 29 Aug 2024 03:17:12 +0000 (23:17 -0400)]
docs/bpf: Fix a typo in verifier.rst

In verifier.rst, there is a typo in section 'Register parentage chains'.
Caller saved registers are r0-r5, callee saved registers are r6-r9.

Here by context it means callee saved registers rather than caller saved
registers. This may confuse users.

Signed-off-by: Yiming Xiang <kxiang@umich.edu>
Link: https://lore.kernel.org/r/20240829031712.198489-1-kxiang@umich.edu
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agoselftests/bpf: Make sure stashed kptr in local kptr is freed recursively
Amery Hung [Tue, 27 Aug 2024 01:13:01 +0000 (01:13 +0000)]
selftests/bpf: Make sure stashed kptr in local kptr is freed recursively

When dropping a local kptr, any kptr stashed into it is supposed to be
freed through bpf_obj_free_fields->__bpf_obj_drop_impl recursively. Add a
test to make sure it happens.

The test first stashes a referenced kptr to "struct task" into a local
kptr and gets the reference count of the task. Then, it drops the local
kptr and reads the reference count of the task again. Since
bpf_obj_free_fields and __bpf_obj_drop_impl will go through the local kptr
recursively during bpf_obj_drop, the dtor of the stashed task kptr should
eventually be called. The second reference count should be one less than
the first one.

Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20240827011301.608620-1-amery.hung@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agolibbpf: Fix bpf_object__open_skeleton()'s mishandling of options
Andrii Nakryiko [Tue, 27 Aug 2024 20:37:21 +0000 (13:37 -0700)]
libbpf: Fix bpf_object__open_skeleton()'s mishandling of options

We do an ugly copying of options in bpf_object__open_skeleton() just to
be able to set object name from skeleton's recorded name (while still
allowing user to override it through opts->object_name).

This is not just ugly, but it also is broken due to memcpy() that
doesn't take into account potential skel_opts' and user-provided opts'
sizes differences due to backward and forward compatibility. This leads
to copying over extra bytes and then failing to validate options
properly. It could, technically, lead also to SIGSEGV, if we are unlucky.

So just get rid of that memory copy completely and instead pass
default object name into bpf_object_open() directly, simplifying all
this significantly. The rule now is that obj_name should be non-NULL for
bpf_object_open() when called with in-memory buffer, so validate that
explicitly as well.

We adopt bpf_object__open_mem() to this as well and generate default
name (based on buffer memory address and size) outside of bpf_object_open().

Fixes: d66562fba1ce ("libbpf: Add BPF object skeleton support")
Reported-by: Daniel Müller <deso@posteo.net>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Daniel Müller <deso@posteo.net>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/bpf/20240827203721.1145494-1-andrii@kernel.org
10 months agoselftests/bpf: Add test for zero offset or non-zero offset pointers as KF_ACQUIRE...
Juntong Deng [Wed, 28 Aug 2024 19:51:32 +0000 (20:51 +0100)]
selftests/bpf: Add test for zero offset or non-zero offset pointers as KF_ACQUIRE kfuncs argument

This patch adds test cases for zero offset (implicit cast) or non-zero
offset pointer as KF_ACQUIRE kfuncs argument. Currently KF_ACQUIRE
kfuncs should support passing in pointers like &sk->sk_write_queue
(non-zero offset) or &sk->__sk_common (zero offset) and not be rejected
by the verifier.

Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
Link: https://lore.kernel.org/r/AM6PR03MB5848CB6F0D4D9068669A905B99952@AM6PR03MB5848.eurprd03.prod.outlook.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agobpf: Relax KF_ACQUIRE kfuncs strict type matching constraint
Juntong Deng [Wed, 28 Aug 2024 19:48:11 +0000 (20:48 +0100)]
bpf: Relax KF_ACQUIRE kfuncs strict type matching constraint

Currently we cannot pass zero offset (implicit cast) or non-zero offset
pointers to KF_ACQUIRE kfuncs. This is because KF_ACQUIRE kfuncs
requires strict type matching, but zero offset or non-zero offset does
not change the type of pointer, which causes the ebpf program to be
rejected by the verifier.

This can cause some problems, one example is that bpf_skb_peek_tail
kfunc [0] cannot be implemented by just passing in non-zero offset
pointers. We cannot pass pointers like &sk->sk_write_queue (non-zero
offset) or &sk->__sk_common (zero offset) to KF_ACQUIRE kfuncs.

This patch makes KF_ACQUIRE kfuncs not require strict type matching.

[0]: https://lore.kernel.org/bpf/AM6PR03MB5848CA39CB4B7A4397D380B099B12@AM6PR03MB5848.eurprd03.prod.outlook.com/

Signed-off-by: Juntong Deng <juntong.deng@outlook.com>
Link: https://lore.kernel.org/r/AM6PR03MB5848FD2BD89BF0B6B5AA3B4C99952@AM6PR03MB5848.eurprd03.prod.outlook.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agoselftests/bpf: Fix incorrect parameters in NULL pointer checking
Hao Ge [Tue, 20 Aug 2024 02:36:22 +0000 (10:36 +0800)]
selftests/bpf: Fix incorrect parameters in NULL pointer checking

Smatch reported the following warning:
    ./tools/testing/selftests/bpf/testing_helpers.c:455 get_xlated_program()
    warn: variable dereferenced before check 'buf' (see line 454)

It seems correct,so let's modify it based on it's suggestion.

Actually,commit b23ed4d74c4d ("selftests/bpf: Fix invalid pointer
check in get_xlated_program()") fixed an issue in the test_verifier.c
once,but it was reverted this time.

Let's solve this issue with the minimal changes possible.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/1eb3732f-605a-479d-ba64-cd14250cbf91@stanley.mountain/
Fixes: b4b7a4099b8c ("selftests/bpf: Factor out get_xlated_program() helper")
Signed-off-by: Hao Ge <gehao@kylinos.cn>
Link: https://lore.kernel.org/r/20240820023622.29190-1-hao.ge@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agoMerge branch 'bpf-arm64-simplify-jited-prologue-epilogue'
Alexei Starovoitov [Wed, 28 Aug 2024 15:41:34 +0000 (08:41 -0700)]
Merge branch 'bpf-arm64-simplify-jited-prologue-epilogue'

Xu Kuohai says:

====================
bpf, arm64: Simplify jited prologue/epilogue

From: Xu Kuohai <xukuohai@huawei.com>

The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:

   0:   bti jc
   4:   mov     x9, lr
   8:   nop
   c:   paciasp
  10:   stp     fp, lr, [sp, #-16]!
  14:   mov     fp, sp
  18:   stp     x19, x20, [sp, #-16]!
  1c:   stp     x21, x22, [sp, #-16]!
  20:   stp     x26, x25, [sp, #-16]!
  24:   mov     x26, #0
  28:   stp     x26, x25, [sp, #-16]!
  2c:   mov     x26, sp
  30:   stp     x27, x28, [sp, #-16]!
  34:   mov     x25, sp
  38:   bti j  // tailcall target
  3c:   sub     sp, sp, #0
  40:   mov     x7, #0
  44:   add     sp, sp, #0
  48:   ldp     x27, x28, [sp], #16
  4c:   ldp     x26, x25, [sp], #16
  50:   ldp     x26, x25, [sp], #16
  54:   ldp     x21, x22, [sp], #16
  58:   ldp     x19, x20, [sp], #16
  5c:   ldp     fp, lr, [sp], #16
  60:   mov     x0, x7
  64:   autiasp
  68:   ret

Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.

Now the jited result of empty prog is:

   0:   bti jc
   4:   mov     x9, lr
   8:   nop
   c:   paciasp
  10:   stp     fp, lr, [sp, #-16]!
  14:   mov     fp, sp
  18:   stp     xzr, x26, [sp, #-16]!
  1c:   mov     x26, sp
  20:   bti j // tailcall target
  24:   mov     x7, #0
  28:   ldp     xzr, x26, [sp], #16
  2c:   ldp     fp, lr, [sp], #16
  30:   mov     x0, x7
  34:   autiasp
  38:   ret
====================

Acked-by: Puranjay Mohan <puranjay@kernel.org>
Link: https://lore.kernel.org/r/20240826071624.350108-1-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agobpf, arm64: Avoid blindly saving/restoring all callee-saved registers
Xu Kuohai [Mon, 26 Aug 2024 07:16:24 +0000 (15:16 +0800)]
bpf, arm64: Avoid blindly saving/restoring all callee-saved registers

The arm64 jit blindly saves/restores all callee-saved registers, making
the jited result looks a bit too compliated. For example, for an empty
prog, the jited result is:

   0:   bti jc
   4:   mov     x9, lr
   8:   nop
   c:   paciasp
  10:   stp     fp, lr, [sp, #-16]!
  14:   mov     fp, sp
  18:   stp     x19, x20, [sp, #-16]!
  1c:   stp     x21, x22, [sp, #-16]!
  20:   stp     x26, x25, [sp, #-16]!
  24:   mov     x26, #0
  28:   stp     x26, x25, [sp, #-16]!
  2c:   mov     x26, sp
  30:   stp     x27, x28, [sp, #-16]!
  34:   mov     x25, sp
  38:   bti j  // tailcall target
  3c:   sub     sp, sp, #0
  40:   mov     x7, #0
  44:   add     sp, sp, #0
  48:   ldp     x27, x28, [sp], #16
  4c:   ldp     x26, x25, [sp], #16
  50:   ldp     x26, x25, [sp], #16
  54:   ldp     x21, x22, [sp], #16
  58:   ldp     x19, x20, [sp], #16
  5c:   ldp     fp, lr, [sp], #16
  60:   mov     x0, x7
  64:   autiasp
  68:   ret

Clearly, there is no need to save/restore unused callee-saved registers.
This patch does this change, making the jited image to only save/restore
the callee-saved registers it uses.

Now the jited result of empty prog is:

   0:   bti jc
   4:   mov     x9, lr
   8:   nop
   c:   paciasp
  10:   stp     fp, lr, [sp, #-16]!
  14:   mov     fp, sp
  18:   stp     xzr, x26, [sp, #-16]!
  1c:   mov     x26, sp
  20:   bti j // tailcall target
  24:   mov     x7, #0
  28:   ldp     xzr, x26, [sp], #16
  2c:   ldp     fp, lr, [sp], #16
  30:   mov     x0, x7
  34:   autiasp
  38:   ret

Since bpf prog saves/restores its own callee-saved registers as needed,
to make tailcall work correctly, the caller needs to restore its saved
registers before tailcall, and the callee needs to save its callee-saved
registers after tailcall. This extra restoring/saving instructions
increases preformance overhead.

[1] provides 2 benchmarks for tailcall scenarios. Below is the perf
number measured in an arm64 KVM guest. The result indicates that the
performance difference before and after the patch in typical tailcall
scenarios is negligible.

- Before:

 Performance counter stats for './test_progs -t tailcalls' (5 runs):

           4313.43 msec task-clock                       #    0.874 CPUs utilized               ( +-  0.16% )
               574      context-switches                 #  133.073 /sec                        ( +-  1.14% )
                 0      cpu-migrations                   #    0.000 /sec
               538      page-faults                      #  124.727 /sec                        ( +-  0.57% )
       10697772784      cycles                           #    2.480 GHz                         ( +-  0.22% )  (61.19%)
       25511241955      instructions                     #    2.38  insn per cycle              ( +-  0.08% )  (66.70%)
        5108910557      branches                         #    1.184 G/sec                       ( +-  0.08% )  (72.38%)
           2800459      branch-misses                    #    0.05% of all branches             ( +-  0.51% )  (72.36%)
                        TopDownL1                 #     0.60 retiring                    ( +-  0.09% )  (66.84%)
                                                  #     0.21 frontend_bound              ( +-  0.15% )  (61.31%)
                                                  #     0.12 bad_speculation             ( +-  0.08% )  (50.11%)
                                                  #     0.07 backend_bound               ( +-  0.16% )  (33.30%)
        8274201819      L1-dcache-loads                  #    1.918 G/sec                       ( +-  0.18% )  (33.15%)
            468268      L1-dcache-load-misses            #    0.01% of all L1-dcache accesses   ( +-  4.69% )  (33.16%)
            385383      LLC-loads                        #   89.345 K/sec                       ( +-  5.22% )  (33.16%)
             38296      LLC-load-misses                  #    9.94% of all LL-cache accesses    ( +- 42.52% )  (38.69%)
        6886576501      L1-icache-loads                  #    1.597 G/sec                       ( +-  0.35% )  (38.69%)
           1848585      L1-icache-load-misses            #    0.03% of all L1-icache accesses   ( +-  4.52% )  (44.23%)
        9043645883      dTLB-loads                       #    2.097 G/sec                       ( +-  0.10% )  (44.33%)
            416672      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  ( +-  5.15% )  (49.89%)
        6925626111      iTLB-loads                       #    1.606 G/sec                       ( +-  0.35% )  (55.46%)
             66220      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  ( +-  1.88% )  (55.50%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

            4.9372 +- 0.0526 seconds time elapsed  ( +-  1.07% )

 Performance counter stats for './test_progs -t flow_dissector' (5 runs):

          10924.50 msec task-clock                       #    0.945 CPUs utilized               ( +-  0.08% )
               603      context-switches                 #   55.197 /sec                        ( +-  1.13% )
                 0      cpu-migrations                   #    0.000 /sec
               566      page-faults                      #   51.810 /sec                        ( +-  0.42% )
       27381270695      cycles                           #    2.506 GHz                         ( +-  0.18% )  (60.46%)
       56996583922      instructions                     #    2.08  insn per cycle              ( +-  0.21% )  (66.11%)
       10321647567      branches                         #  944.816 M/sec                       ( +-  0.17% )  (71.79%)
           3347735      branch-misses                    #    0.03% of all branches             ( +-  3.72% )  (72.15%)
                        TopDownL1                 #     0.52 retiring                    ( +-  0.13% )  (66.74%)
                                                  #     0.27 frontend_bound              ( +-  0.14% )  (61.27%)
                                                  #     0.14 bad_speculation             ( +-  0.19% )  (50.36%)
                                                  #     0.07 backend_bound               ( +-  0.42% )  (33.89%)
       18740797617      L1-dcache-loads                  #    1.715 G/sec                       ( +-  0.43% )  (33.71%)
          13715669      L1-dcache-load-misses            #    0.07% of all L1-dcache accesses   ( +- 32.85% )  (33.34%)
           4087551      LLC-loads                        #  374.164 K/sec                       ( +- 29.53% )  (33.26%)
            267906      LLC-load-misses                  #    6.55% of all LL-cache accesses    ( +- 23.90% )  (38.76%)
       15811864229      L1-icache-loads                  #    1.447 G/sec                       ( +-  0.12% )  (38.73%)
           2976833      L1-icache-load-misses            #    0.02% of all L1-icache accesses   ( +-  9.73% )  (44.22%)
       20138907471      dTLB-loads                       #    1.843 G/sec                       ( +-  0.18% )  (44.15%)
            732850      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  ( +- 11.18% )  (49.64%)
       15895726702      iTLB-loads                       #    1.455 G/sec                       ( +-  0.15% )  (55.13%)
            152075      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  ( +-  4.71% )  (54.98%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

           11.5613 +- 0.0317 seconds time elapsed  ( +-  0.27% )

- After:

 Performance counter stats for './test_progs -t tailcalls' (5 runs):

           4278.78 msec task-clock                       #    0.871 CPUs utilized               ( +-  0.15% )
               569      context-switches                 #  132.982 /sec                        ( +-  0.58% )
                 0      cpu-migrations                   #    0.000 /sec
               539      page-faults                      #  125.970 /sec                        ( +-  0.43% )
       10588986432      cycles                           #    2.475 GHz                         ( +-  0.20% )  (60.91%)
       25303825043      instructions                     #    2.39  insn per cycle              ( +-  0.08% )  (66.48%)
        5110756256      branches                         #    1.194 G/sec                       ( +-  0.07% )  (72.03%)
           2719569      branch-misses                    #    0.05% of all branches             ( +-  2.42% )  (72.03%)
                        TopDownL1                 #     0.60 retiring                    ( +-  0.22% )  (66.31%)
                                                  #     0.22 frontend_bound              ( +-  0.21% )  (60.83%)
                                                  #     0.12 bad_speculation             ( +-  0.26% )  (50.25%)
                                                  #     0.06 backend_bound               ( +-  0.17% )  (33.52%)
        8163648527      L1-dcache-loads                  #    1.908 G/sec                       ( +-  0.33% )  (33.52%)
            694979      L1-dcache-load-misses            #    0.01% of all L1-dcache accesses   ( +- 30.53% )  (33.52%)
           1902347      LLC-loads                        #  444.600 K/sec                       ( +- 48.84% )  (33.69%)
             96677      LLC-load-misses                  #    5.08% of all LL-cache accesses    ( +- 43.48% )  (39.30%)
        6863517589      L1-icache-loads                  #    1.604 G/sec                       ( +-  0.37% )  (39.17%)
           1871519      L1-icache-load-misses            #    0.03% of all L1-icache accesses   ( +-  6.78% )  (44.56%)
        8927782813      dTLB-loads                       #    2.087 G/sec                       ( +-  0.14% )  (44.37%)
            438237      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  ( +-  6.00% )  (49.75%)
        6886906831      iTLB-loads                       #    1.610 G/sec                       ( +-  0.36% )  (55.08%)
             67568      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  ( +-  3.27% )  (54.86%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

            4.9114 +- 0.0309 seconds time elapsed  ( +-  0.63% )

 Performance counter stats for './test_progs -t flow_dissector' (5 runs):

          10948.40 msec task-clock                       #    0.942 CPUs utilized               ( +-  0.05% )
               615      context-switches                 #   56.173 /sec                        ( +-  1.65% )
                 1      cpu-migrations                   #    0.091 /sec                        ( +- 31.62% )
               567      page-faults                      #   51.788 /sec                        ( +-  0.44% )
       27334194328      cycles                           #    2.497 GHz                         ( +-  0.08% )  (61.05%)
       56656528828      instructions                     #    2.07  insn per cycle              ( +-  0.08% )  (66.67%)
       10270389422      branches                         #  938.072 M/sec                       ( +-  0.10% )  (72.21%)
           3453837      branch-misses                    #    0.03% of all branches             ( +-  3.75% )  (72.27%)
                        TopDownL1                 #     0.52 retiring                    ( +-  0.16% )  (66.55%)
                                                  #     0.27 frontend_bound              ( +-  0.09% )  (60.91%)
                                                  #     0.14 bad_speculation             ( +-  0.08% )  (49.85%)
                                                  #     0.07 backend_bound               ( +-  0.16% )  (33.33%)
       18982866028      L1-dcache-loads                  #    1.734 G/sec                       ( +-  0.24% )  (33.34%)
           8802454      L1-dcache-load-misses            #    0.05% of all L1-dcache accesses   ( +- 52.30% )  (33.31%)
           2612962      LLC-loads                        #  238.661 K/sec                       ( +- 29.78% )  (33.45%)
            264107      LLC-load-misses                  #   10.11% of all LL-cache accesses    ( +- 18.34% )  (39.07%)
       15793205997      L1-icache-loads                  #    1.443 G/sec                       ( +-  0.15% )  (39.09%)
           3930802      L1-icache-load-misses            #    0.02% of all L1-icache accesses   ( +-  3.72% )  (44.66%)
       20097828496      dTLB-loads                       #    1.836 G/sec                       ( +-  0.09% )  (44.68%)
            961757      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  ( +-  3.32% )  (50.15%)
       15838728506      iTLB-loads                       #    1.447 G/sec                       ( +-  0.09% )  (55.62%)
            167652      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  ( +-  1.28% )  (55.52%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

           11.6173 +- 0.0268 seconds time elapsed  ( +-  0.23% )

[1] https://lore.kernel.org/bpf/20200724123644.5096-1-maciej.fijalkowski@intel.com/

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-3-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agobpf, arm64: Get rid of fpb
Xu Kuohai [Mon, 26 Aug 2024 07:16:23 +0000 (15:16 +0800)]
bpf, arm64: Get rid of fpb

bpf prog accesses stack using BPF_FP as the base address and a negative
immediate number as offset. But arm64 ldr/str instructions only support
non-negative immediate number as offset. To simplify the jited result,
commit 5b3d19b9bd40 ("bpf, arm64: Adjust the offset of str/ldr(immediate)
to positive number") introduced FPB to represent the lowest stack address
that the bpf prog being jited may access, and with this address as the
baseline, it converts BPF_FP plus negative immediate offset number to FPB
plus non-negative immediate offset.

Considering that for a given bpf prog, the jited stack space is fixed
with A64_SP as the lowest address and BPF_FP as the highest address.
Thus we can get rid of FPB and converts BPF_FP plus negative immediate
offset to A64_SP plus non-negative immediate offset.

Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Link: https://lore.kernel.org/r/20240826071624.350108-2-xukuohai@huaweicloud.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agosched_ext: Add missing cfi stub for ops.tick
Tejun Heo [Wed, 28 Aug 2024 00:17:20 +0000 (14:17 -1000)]
sched_ext: Add missing cfi stub for ops.tick

The cfi stub for ops.tick was missing which will fail scheduler loading
after pending BPF changes. Add it.

Signed-off-by: Tejun Heo <tj@kernel.org>
10 months agosamples/bpf: tracex4: Fix failed to create kretprobe 'kmem_cache_alloc_node+0x0'
Rong Tao [Tue, 27 Aug 2024 04:30:30 +0000 (12:30 +0800)]
samples/bpf: tracex4: Fix failed to create kretprobe 'kmem_cache_alloc_node+0x0'

commit 7bd230a26648 ("mm/slab: enable slab allocation tagging for kmalloc
and friends") [1] swap kmem_cache_alloc_node() to
kmem_cache_alloc_node_noprof().

    linux/samples/bpf$ sudo ./tracex4
    libbpf: prog 'bpf_prog2': failed to create kretprobe
    'kmem_cache_alloc_node+0x0' perf event: No such file or directory
    ERROR: bpf_program__attach failed

Signed-off-by: Rong Tao <rongtao@cestc.cn>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://github.com/torvalds/linux/commit/7bd230a26648ac68ab3731ebbc449090f0ac6a37
Link: https://lore.kernel.org/bpf/tencent_34E5BCCAC5ABF3E81222AD81B1D05F16DE06@qq.com
10 months agoscx_central: Fix smatch checker warning
Tejun Heo [Tue, 27 Aug 2024 20:05:58 +0000 (10:05 -1000)]
scx_central: Fix smatch checker warning

ARRAY_ELEM_PTR() is an access macro used to help the BPF verifier not
confused by offseted memory acceeses by yiedling a valid pointer or NULL in
a way that's clear to the verifier. As such, the canonical usage involves
checking NULL return from the macro. Note that in many cases, the NULL
condition can never happen - they're there just to hint the verifier.

In a bpf_loop in scx_central.bpf.c::central_dispatch(), the NULL check was
incorrect in that there was another dereference of the pointer in addition
to the NULL checked access. This worked as the pointer can never be NULL and
the verifier could tell it would never be NULL in this case.

However, this still looks wrong and trips smatch:

  ./tools/sched_ext/scx_central.bpf.c:205 ____central_dispatch()
  error: we previously assumed 'gimme' could be null (see line 201)

  ./tools/sched_ext/scx_central.bpf.c
      195
      196                         if (!scx_bpf_dispatch_nr_slots())
      197                                 break;
      198
      199                         /* central's gimme is never set */
      200                         gimme = ARRAY_ELEM_PTR(cpu_gimme_task, cpu, nr_cpu_ids);
      201                         if (gimme && !*gimme)
      ^^^^^
  If gimme is NULL

      202                                 continue;
      203
      204                         if (dispatch_to_cpu(cpu))
  --> 205                                 *gimme = false;

Fix the NULL check so that there are no derefs if NULL. This doesn't change
actual behavior.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: http://lkml.kernel.org/r/<955e1c3c-ace2-4a1d-b246-15b8196038a3@stanley.mountain>
10 months agoselftests/bpf: Add tests for bpf_copy_from_user_str kfunc.
Jordan Rome [Fri, 23 Aug 2024 19:51:01 +0000 (12:51 -0700)]
selftests/bpf: Add tests for bpf_copy_from_user_str kfunc.

This adds tests for both the happy path and
the error path.

Signed-off-by: Jordan Rome <linux@jordanrome.com>
Link: https://lore.kernel.org/r/20240823195101.3621028-2-linux@jordanrome.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agobpf: Add bpf_copy_from_user_str kfunc
Jordan Rome [Fri, 23 Aug 2024 19:51:00 +0000 (12:51 -0700)]
bpf: Add bpf_copy_from_user_str kfunc

This adds a kfunc wrapper around strncpy_from_user,
which can be called from sleepable BPF programs.

This matches the non-sleepable 'bpf_probe_read_user_str'
helper except it includes an additional 'flags'
param, which allows consumers to clear the entire
destination buffer on success or failure.

Signed-off-by: Jordan Rome <linux@jordanrome.com>
Link: https://lore.kernel.org/r/20240823195101.3621028-1-linux@jordanrome.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agoselftests/bpf: use simply-expanded variables for libpcap flags
Eduard Zingerman [Fri, 23 Aug 2024 19:44:09 +0000 (12:44 -0700)]
selftests/bpf: use simply-expanded variables for libpcap flags

Save pkg-config output for libpcap as simply-expanded variables.
For an obscure reason 'shell' call in LDLIBS/CFLAGS recursively
expanded variables makes *.test.o files compilation non-parallel
when make is executed with -j option.

While at it, reuse 'pkg-config --cflags' call to define
-DTRAFFIC_MONITOR=1 option, it's exit status is the same as for
'pkg-config --exists'.

Fixes: f52403b6bfea ("selftests/bpf: Add traffic monitor functions.")
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240823194409.774815-1-eddyz87@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
10 months agoMerge branch 'support-bpf_kptr_xchg-into-local-kptr'
Alexei Starovoitov [Fri, 23 Aug 2024 18:39:33 +0000 (11:39 -0700)]
Merge branch 'support-bpf_kptr_xchg-into-local-kptr'

Amery Hung says:

====================
Support bpf_kptr_xchg into local kptr

This revision adds substaintial changes to patch 2 to support structures
with kptr as the only special btf type. The test is split into
local_kptr_stash and task_kfunc_success to remove dependencies on
bpf_testmod that would break veristat results.

This series allows stashing kptr into local kptr. Currently, kptrs are
only allowed to be stashed into map value with bpf_kptr_xchg(). A
motivating use case of this series is to enable adding referenced kptr to
bpf_rbtree or bpf_list by using allocated object as graph node and the
storage of referenced kptr. For example, a bpf qdisc [0] enqueuing a
referenced kptr to a struct sk_buff* to a bpf_list serving as a fifo:

    struct skb_node {
            struct sk_buff __kptr *skb;
            struct bpf_list_node node;
    };

    private(A) struct bpf_spin_lock fifo_lock;
    private(A) struct bpf_list_head fifo __contains(skb_node, node);

    /* In Qdisc_ops.enqueue */
    struct skb_node *skbn;

    skbn = bpf_obj_new(typeof(*skbn));
    if (!skbn)
        goto drop;

    /* skb is a referenced kptr to struct sk_buff acquired earilier
     * but not shown in this code snippet.
     */
    skb = bpf_kptr_xchg(&skbn->skb, skb);
    if (skb)
        /* should not happen; do something below releasing skb to
         * satisfy the verifier */
     ...

    bpf_spin_lock(&fifo_lock);
    bpf_list_push_back(&fifo, &skbn->node);
    bpf_spin_unlock(&fifo_lock);

The implementation first searches for BPF_KPTR when generating program
BTF. Then, we teach the verifier that the detination argument of
bpf_kptr_xchg() can be local kptr, and use the btf_record in program BTF
to check against the source argument.

This series is mostly developed by Dave, who kindly helped and sent me
the patchset. The selftests in bpf qdisc (WIP) relies on this series to
work.

[0] https://lore.kernel.org/netdev/20240714175130.4051012-10-amery.hung@bytedance.com/
---
v3 -> v4
  - Allow struct in prog btf w/ kptr as the only special field type
  - Split tests of stashing referenced kptr and local kptr
  - v3: https://lore.kernel.org/bpf/20240809005131.3916464-1-amery.hung@bytedance.com/

v2 -> v3
  - Fix prog btf memory leak
  - Test stashing kptr in prog btf
  - Test unstashing kptrs after stashing into local kptrs
  - v2: https://lore.kernel.org/bpf/20240803001145.635887-1-amery.hung@bytedance.com/

v1 -> v2
  - Fix the document for bpf_kptr_xchg()
  - Add a comment explaining changes in the verifier
  - v1: https://lore.kernel.org/bpf/20240728030115.3970543-1-amery.hung@bytedance.com/
====================

Link: https://lore.kernel.org/r/20240813212424.2871455-1-amery.hung@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agoselftests/bpf: Test bpf_kptr_xchg stashing into local kptr
Dave Marchevsky [Tue, 13 Aug 2024 21:24:24 +0000 (21:24 +0000)]
selftests/bpf: Test bpf_kptr_xchg stashing into local kptr

Test stashing both referenced kptr and local kptr into local kptrs. Then,
test unstashing them.

Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Link: https://lore.kernel.org/r/20240813212424.2871455-6-amery.hung@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agobpf: Support bpf_kptr_xchg into local kptr
Dave Marchevsky [Tue, 13 Aug 2024 21:24:23 +0000 (21:24 +0000)]
bpf: Support bpf_kptr_xchg into local kptr

Currently, users can only stash kptr into map values with bpf_kptr_xchg().
This patch further supports stashing kptr into local kptr by adding local
kptr as a valid destination type.

When stashing into local kptr, btf_record in program BTF is used instead
of btf_record in map to search for the btf_field of the local kptr.

The local kptr specific checks in check_reg_type() only apply when the
source argument of bpf_kptr_xchg() is local kptr. Therefore, we make the
scope of the check explicit as the destination now can also be local kptr.

Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Link: https://lore.kernel.org/r/20240813212424.2871455-5-amery.hung@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agobpf: Rename ARG_PTR_TO_KPTR -> ARG_KPTR_XCHG_DEST
Dave Marchevsky [Tue, 13 Aug 2024 21:24:22 +0000 (21:24 +0000)]
bpf: Rename ARG_PTR_TO_KPTR -> ARG_KPTR_XCHG_DEST

ARG_PTR_TO_KPTR is currently only used by the bpf_kptr_xchg helper.
Although it limits reg types for that helper's first arg to
PTR_TO_MAP_VALUE, any arbitrary mapval won't do: further custom
verification logic ensures that the mapval reg being xchgd-into is
pointing to a kptr field. If this is not the case, it's not safe to xchg
into that reg's pointee.

Let's rename the bpf_arg_type to more accurately describe the fairly
specific expectations that this arg type encodes.

This is a nonfunctional change.

Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Link: https://lore.kernel.org/r/20240813212424.2871455-4-amery.hung@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agobpf: Search for kptrs in prog BTF structs
Dave Marchevsky [Tue, 13 Aug 2024 21:24:21 +0000 (21:24 +0000)]
bpf: Search for kptrs in prog BTF structs

Currently btf_parse_fields is used in two places to create struct
btf_record's for structs: when looking at mapval type, and when looking
at any struct in program BTF. The former looks for kptr fields while the
latter does not. This patch modifies the btf_parse_fields call made when
looking at prog BTF struct types to search for kptrs as well.

Before this series there was no reason to search for kptrs in non-mapval
types: a referenced kptr needs some owner to guarantee resource cleanup,
and map values were the only owner that supported this. If a struct with
a kptr field were to have some non-kptr-aware owner, the kptr field
might not be properly cleaned up and result in resources leaking. Only
searching for kptr fields in mapval was a simple way to avoid this
problem.

In practice, though, searching for BPF_KPTR when populating
struct_meta_tab does not expose us to this risk, as struct_meta_tab is
only accessed through btf_find_struct_meta helper, and that helper is
only called in contexts where recognizing the kptr field is safe:

  * PTR_TO_BTF_ID reg w/ MEM_ALLOC flag
    * Such a reg is a local kptr and must be free'd via bpf_obj_drop,
      which will correctly handle kptr field

  * When handling specific kfuncs which either expect MEM_ALLOC input or
    return MEM_ALLOC output (obj_{new,drop}, percpu_obj_{new,drop},
    list+rbtree funcs, refcount_acquire)
     * Will correctly handle kptr field for same reasons as above

  * When looking at kptr pointee type
     * Called by functions which implement "correct kptr resource
       handling"

  * In btf_check_and_fixup_fields
     * Helper that ensures no ownership loops for lists and rbtrees,
       doesn't care about kptr field existence

So we should be able to find BPF_KPTR fields in all prog BTF structs
without leaking resources.

Further patches in the series will build on this change to support
kptr_xchg into non-mapval local kptr. Without this change there would be
no kptr field found in such a type.

Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Dave Marchevsky <davemarchevsky@fb.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Link: https://lore.kernel.org/r/20240813212424.2871455-3-amery.hung@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agobpf: Let callers of btf_parse_kptr() track life cycle of prog btf
Amery Hung [Tue, 13 Aug 2024 21:24:20 +0000 (21:24 +0000)]
bpf: Let callers of btf_parse_kptr() track life cycle of prog btf

btf_parse_kptr() and btf_record_free() do btf_get() and btf_put()
respectively when working on btf_record in program and map if there are
kptr fields. If the kptr is from program BTF, since both callers has
already tracked the life cycle of program BTF, it is safe to remove the
btf_get() and btf_put().

This change prevents memory leak of program BTF later when we start
searching for kptr fields when building btf_record for program. It can
happen when the btf fd is closed. The btf_put() corresponding to the
btf_get() in btf_parse_kptr() was supposed to be called by
btf_record_free() in btf_free_struct_meta_tab() in btf_free(). However,
it will never happen since the invocation of btf_free() depends on the
refcount of the btf to become 0 in the first place.

Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Amery Hung <amery.hung@bytedance.com>
Link: https://lore.kernel.org/r/20240813212424.2871455-2-amery.hung@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agoselftests/bpf: add multi-uprobe benchmarks
Andrii Nakryiko [Tue, 6 Aug 2024 04:29:35 +0000 (21:29 -0700)]
selftests/bpf: add multi-uprobe benchmarks

Add multi-uprobe and multi-uretprobe benchmarks to bench tool.
Multi- and classic uprobes/uretprobes have different low-level
triggering code paths, so it's sometimes important to be able to
benchmark both flavors of uprobes/uretprobes.

Sample examples from my dev machine below. Single-threaded peformance
almost doesn't differ, but with more parallel CPUs triggering the same
uprobe/uretprobe the difference grows. This might be due to [0], but
given the code is slightly different, there could be other sources of
slowdown.

Note, all these numbers will change due to ongoing work to improve
uprobe/uretprobe scalability (e.g., [1]), but having benchmark like this
is useful for measurements and debugging nevertheless.

\#!/bin/bash
set -eufo pipefail
for p in 1 8 16 32; do
    for i in uprobe-nop uretprobe-nop uprobe-multi-nop uretprobe-multi-nop; do
        summary=$(sudo ./bench -w1 -d3 -p$p -a trig-$i | tail -n1)
        total=$(echo "$summary" | cut -d'(' -f1 | cut -d' ' -f3-)
        percpu=$(echo "$summary" | cut -d'(' -f2 | cut -d')' -f1 | cut -d'/' -f1)
        printf "%-21s (%2d cpus): %s (%s/s/cpu)\n" $i $p "$total" "$percpu"
    done
    echo
done

uprobe-nop            ( 1 cpus):    1.020 ± 0.005M/s  (  1.020M/s/cpu)
uretprobe-nop         ( 1 cpus):    0.515 ± 0.009M/s  (  0.515M/s/cpu)
uprobe-multi-nop      ( 1 cpus):    1.036 ± 0.004M/s  (  1.036M/s/cpu)
uretprobe-multi-nop   ( 1 cpus):    0.512 ± 0.005M/s  (  0.512M/s/cpu)

uprobe-nop            ( 8 cpus):    3.481 ± 0.030M/s  (  0.435M/s/cpu)
uretprobe-nop         ( 8 cpus):    2.222 ± 0.008M/s  (  0.278M/s/cpu)
uprobe-multi-nop      ( 8 cpus):    3.769 ± 0.094M/s  (  0.471M/s/cpu)
uretprobe-multi-nop   ( 8 cpus):    2.482 ± 0.007M/s  (  0.310M/s/cpu)

uprobe-nop            (16 cpus):    2.968 ± 0.011M/s  (  0.185M/s/cpu)
uretprobe-nop         (16 cpus):    1.870 ± 0.002M/s  (  0.117M/s/cpu)
uprobe-multi-nop      (16 cpus):    3.541 ± 0.037M/s  (  0.221M/s/cpu)
uretprobe-multi-nop   (16 cpus):    2.123 ± 0.026M/s  (  0.133M/s/cpu)

uprobe-nop            (32 cpus):    2.524 ± 0.026M/s  (  0.079M/s/cpu)
uretprobe-nop         (32 cpus):    1.572 ± 0.003M/s  (  0.049M/s/cpu)
uprobe-multi-nop      (32 cpus):    2.717 ± 0.003M/s  (  0.085M/s/cpu)
uretprobe-multi-nop   (32 cpus):    1.687 ± 0.007M/s  (  0.053M/s/cpu)

  [0] https://lore.kernel.org/linux-trace-kernel/20240805202803.1813090-1-andrii@kernel.org/
  [1] https://lore.kernel.org/linux-trace-kernel/20240731214256.3588718-1-andrii@kernel.org/

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20240806042935.3867862-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agoselftests/bpf: make use of PROCMAP_QUERY ioctl if available
Andrii Nakryiko [Tue, 6 Aug 2024 23:03:19 +0000 (16:03 -0700)]
selftests/bpf: make use of PROCMAP_QUERY ioctl if available

Instead of parsing text-based /proc/<pid>/maps file, try to use
PROCMAP_QUERY ioctl() to simplify and speed up data fetching.
This logic is used to do uprobe file offset calculation, so any bugs in
this logic would manifest as failing uprobe BPF selftests.

This also serves as a simple demonstration of one of the intended uses.

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/r/20240806230319.869734-1-andrii@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agoMerge branch 'follow-up-for-__jited-test-tag'
Alexei Starovoitov [Fri, 23 Aug 2024 14:29:03 +0000 (07:29 -0700)]
Merge branch 'follow-up-for-__jited-test-tag'

Eduard Zingerman says:

====================
follow up for __jited test tag

This patch-set is a collection of follow-ups for
"__jited test tag to check disassembly after jit" series (see [1]).

First patch is most important:
as it turns out, I broke all test_loader based tests for s390 CI.
E.g. see log [2] for s390 execution of test_progs,
note all 'verivier_*' tests being skipped.
This happens because of incorrect handling of corner case when
get_current_arch() does not know which architecture to return.

Second patch makes matching of function return sequence in
verifier_tailcall_jit more flexible:

    -__jited(" retq")
    +__jited(" {{(retq|jmp 0x)}}")

The difference could be seen with and w/o mitigations=off boot
parameter for test VM (CI runs with mitigations=off, hence it
generates retq).

Third patch addresses Alexei's request to add #define and a comment in
jit_disasm_helpers.c.

[1] https://lore.kernel.org/bpf/20240820102357.3372779-1-eddyz87@gmail.com/
[2] https://github.com/kernel-patches/bpf/actions/runs/10518445973/job/29144511595
====================

Link: https://lore.kernel.org/r/20240823080644.263943-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agoselftests/bpf: #define LOCAL_LABEL_LEN for jit_disasm_helpers.c
Eduard Zingerman [Fri, 23 Aug 2024 08:06:44 +0000 (01:06 -0700)]
selftests/bpf: #define LOCAL_LABEL_LEN for jit_disasm_helpers.c

Extract local label length as a #define directive and
elaborate why 'i % MAX_LOCAL_LABELS' expression is needed
for local labels array initialization.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240823080644.263943-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agoselftests/bpf: match both retq/rethunk in verifier_tailcall_jit
Eduard Zingerman [Fri, 23 Aug 2024 08:06:43 +0000 (01:06 -0700)]
selftests/bpf: match both retq/rethunk in verifier_tailcall_jit

Depending on kernel parameters, x86 jit generates either retq or jump
to rethunk for 'exit' instruction. The difference could be seen when
kernel is booted with and without mitigations=off parameter.
Relax the verifier_tailcall_jit test case to match both variants.

Fixes: e5bdd6a8be78 ("selftests/bpf: validate jit behaviour for tail calls")
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240823080644.263943-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agoselftests/bpf: test_loader.c:get_current_arch() should not return 0
Eduard Zingerman [Fri, 23 Aug 2024 08:06:42 +0000 (01:06 -0700)]
selftests/bpf: test_loader.c:get_current_arch() should not return 0

At the moment, when test_loader.c:get_current_arch() can't determine
the arch, it returns 0. The arch check in run_subtest() looks as
follows:

if ((get_current_arch() & spec->arch_mask) == 0) {
test__skip();
return;
}

Which means that all test_loader based tests would be skipped if arch
could not be determined. get_current_arch() recognizes x86_64, arm64
and riscv64. Which means that CI skips test_loader tests for s390.

Fix this by making sure that get_current_arch() always returns
non-zero value. In combination with default spec->arch_mask == -1 this
should cover all possibilities.

Fixes: f406026fefa7 ("selftests/bpf: by default use arch mask allowing all archs")
Fixes: 7d743e4c759c ("selftests/bpf: __jited test tag to check disassembly after jit")
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240823080644.263943-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agoselftests/bpf: Add testcase for updating attached freplace prog to prog_array map
Leon Hwang [Sun, 28 Jul 2024 11:46:12 +0000 (19:46 +0800)]
selftests/bpf: Add testcase for updating attached freplace prog to prog_array map

Add a selftest to confirm the issue, which gets -EINVAL when update
attached freplace prog to prog_array map, has been fixed.

cd tools/testing/selftests/bpf; ./test_progs -t tailcalls
328/25  tailcalls/tailcall_freplace:OK
328     tailcalls:OK
Summary: 1/25 PASSED, 0 SKIPPED, 0 FAILED

Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20240728114612.48486-3-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov [Thu, 22 Aug 2024 16:27:59 +0000 (09:27 -0700)]
Merge git://git./pub/scm/linux/kernel/git/bpf/bpf

Cross-merge bpf fixes after downstream PR including
important fixes (from bpf-next point of view):
commit 41c24102af7b ("selftests/bpf: Filter out _GNU_SOURCE when compiling test_cpp")
commit fdad456cbcca ("bpf: Fix updating attached freplace prog in prog_array map")

No conflicts.

Adjacent changes in:
include/linux/bpf_verifier.h
kernel/bpf/verifier.c
tools/testing/selftests/bpf/Makefile

Link: https://lore.kernel.org/bpf/20240813234307.82773-1-alexei.starovoitov@gmail.com/
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agoMerge branch 'support-bpf_fastcall-patterns-for-calls-to-kfuncs'
Alexei Starovoitov [Thu, 22 Aug 2024 15:35:21 +0000 (08:35 -0700)]
Merge branch 'support-bpf_fastcall-patterns-for-calls-to-kfuncs'

Eduard Zingerman says:

====================
support bpf_fastcall patterns for calls to kfuncs

As an extension of [1], allow bpf_fastcall patterns for kfuncs:
- pattern rules are the same as for helpers;
- spill/fill removal is allowed only for kfuncs listed in the
  is_fastcall_kfunc_call (under assumption that such kfuncs would
  always be members of special_kfunc_list).

Allow bpf_fastcall rewrite for bpf_cast_to_kern_ctx() and
bpf_rdonly_cast() in order to conjure selftests for this feature.

After this patch-set verifier would rewrite the program below:

  r2 = 1
  *(u64 *)(r10 - 32) = r2
  call %[bpf_cast_to_kern_ctx]
  r2 = *(u64 *)(r10 - 32)
  r0 = r2;"

As follows:

  r2 = 1   /* spill/fill at r10[-32] is removed */
  r0 = r1  /* replacement for bpf_cast_to_kern_ctx() */
  r0 = r2
  exit

Also, attribute used by LLVM implementation of the feature had been
changed from no_caller_saved_registers to bpf_fastcall (see [2]).
This patch-set replaces references to nocsr by references to
bpf_fastcall to keep LLVM and Kernel parts in sync.

[1] no_caller_saved_registers attribute for helper calls
    https://lore.kernel.org/bpf/20240722233844.1406874-1-eddyz87@gmail.com/
[2] [BPF] introduce __attribute__((bpf_fastcall))
    https://github.com/llvm/llvm-project/pull/105417

Changes v2->v3:
- added a patch fixing arch_mask handling in test_loader,
  otherwise newly added tests for the feature were skipped
  (a fix for regression introduced by a recent commit);
- fixed warning regarding unused 'params' variable;
- applied stylistical fixes suggested by Yonghong;
- added acks from Yonghong;

Changes v1->v2:
- added two patches replacing all mentions of nocsr by bpf_fastcall
  (suggested by Andrii);
- removed KF_NOCSR flag (suggested by Yonghong).

v1: https://lore.kernel.org/bpf/20240812234356.2089263-1-eddyz87@gmail.com/
v2: https://lore.kernel.org/bpf/20240817015140.1039351-1-eddyz87@gmail.com/
====================

Link: https://lore.kernel.org/r/20240822084112.3257995-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agoselftests/bpf: check if bpf_fastcall is recognized for kfuncs
Eduard Zingerman [Thu, 22 Aug 2024 08:41:12 +0000 (01:41 -0700)]
selftests/bpf: check if bpf_fastcall is recognized for kfuncs

Use kfunc_bpf_cast_to_kern_ctx() and kfunc_bpf_rdonly_cast() to verify
that bpf_fastcall pattern is recognized for kfunc calls.

Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240822084112.3257995-7-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agoselftests/bpf: by default use arch mask allowing all archs
Eduard Zingerman [Thu, 22 Aug 2024 08:41:11 +0000 (01:41 -0700)]
selftests/bpf: by default use arch mask allowing all archs

If test case does not specify architecture via __arch_* macro consider
that it should be run for all architectures.

Fixes: 7d743e4c759c ("selftests/bpf: __jited test tag to check disassembly after jit")
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240822084112.3257995-6-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agobpf: allow bpf_fastcall for bpf_cast_to_kern_ctx and bpf_rdonly_cast
Eduard Zingerman [Thu, 22 Aug 2024 08:41:10 +0000 (01:41 -0700)]
bpf: allow bpf_fastcall for bpf_cast_to_kern_ctx and bpf_rdonly_cast

do_misc_fixups() relaces bpf_cast_to_kern_ctx() and bpf_rdonly_cast()
by a single instruction "r0 = r1". This follows bpf_fastcall contract.
This commit allows bpf_fastcall pattern rewrite for these two
functions in order to use them in bpf_fastcall selftests.

Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240822084112.3257995-5-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agobpf: support bpf_fastcall patterns for kfuncs
Eduard Zingerman [Thu, 22 Aug 2024 08:41:09 +0000 (01:41 -0700)]
bpf: support bpf_fastcall patterns for kfuncs

Recognize bpf_fastcall patterns around kfunc calls.
For example, suppose bpf_cast_to_kern_ctx() follows bpf_fastcall
contract (which it does), in such a case allow verifier to rewrite BPF
program below:

  r2 = 1;
  *(u64 *)(r10 - 32) = r2;
  call %[bpf_cast_to_kern_ctx];
  r2 = *(u64 *)(r10 - 32);
  r0 = r2;

By removing the spill/fill pair:

  r2 = 1;
  call %[bpf_cast_to_kern_ctx];
  r0 = r2;

Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240822084112.3257995-4-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agoselftests/bpf: rename nocsr -> bpf_fastcall in selftests
Eduard Zingerman [Thu, 22 Aug 2024 08:41:08 +0000 (01:41 -0700)]
selftests/bpf: rename nocsr -> bpf_fastcall in selftests

Attribute used by LLVM implementation of the feature had been changed
from no_caller_saved_registers to bpf_fastcall (see [1]).
This commit replaces references to nocsr by references to bpf_fastcall
to keep LLVM and selftests parts in sync.

[1] https://github.com/llvm/llvm-project/pull/105417

Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240822084112.3257995-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agobpf: rename nocsr -> bpf_fastcall in verifier
Eduard Zingerman [Thu, 22 Aug 2024 08:41:07 +0000 (01:41 -0700)]
bpf: rename nocsr -> bpf_fastcall in verifier

Attribute used by LLVM implementation of the feature had been changed
from no_caller_saved_registers to bpf_fastcall (see [1]).
This commit replaces references to nocsr by references to bpf_fastcall
to keep LLVM and Kernel parts in sync.

[1] https://github.com/llvm/llvm-project/pull/105417

Acked-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240822084112.3257995-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agobpf: Fix percpu address space issues
Uros Bizjak [Sun, 11 Aug 2024 16:13:33 +0000 (18:13 +0200)]
bpf: Fix percpu address space issues

In arraymap.c:

In bpf_array_map_seq_start() and bpf_array_map_seq_next()
cast return values from the __percpu address space to
the generic address space via uintptr_t [1].

Correct the declaration of pptr pointer in __bpf_array_map_seq_show()
to void __percpu * and cast the value from the generic address
space to the __percpu address space via uintptr_t [1].

In hashtab.c:

Assign the return value from bpf_mem_cache_alloc() to void pointer
and cast the value to void __percpu ** (void pointer to percpu void
pointer) before dereferencing.

In memalloc.c:

Explicitly declare __percpu variables.

Cast obj to void __percpu **.

In helpers.c:

Cast ptr in BPF_CALL_1 and BPF_CALL_2 from generic address space
to __percpu address space via const uintptr_t [1].

Found by GCC's named address space checks.

There were no changes in the resulting object files.

[1] https://sparse.docs.kernel.org/en/latest/annotations.html#address-space-name

Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Song Liu <song@kernel.org>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: KP Singh <kpsingh@kernel.org>
Cc: Stanislav Fomichev <sdf@fomichev.me>
Cc: Hao Luo <haoluo@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Acked-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240811161414.56744-1-ubizjak@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agoMerge branch 'bpf-fix-null-pointer-access-for-malformed-bpf_core_type_id_local-relos'
Alexei Starovoitov [Thu, 22 Aug 2024 15:00:26 +0000 (08:00 -0700)]
Merge branch 'bpf-fix-null-pointer-access-for-malformed-bpf_core_type_id_local-relos'

Eduard Zingerman says:

====================
bpf: fix null pointer access for malformed BPF_CORE_TYPE_ID_LOCAL relos

Liu RuiTong reported an in-kernel null pointer derefence when
processing BPF_CORE_TYPE_ID_LOCAL relocations referencing non-existing
BTF types. Fix this by adding proper id checks.

Changes v2->v3:
- selftest update suggested by Andrii:
  avoid memset(0) for log buffer and do memset(0) for bpf_attr.

Changes v1->v2:
- moved check from bpf_core_calc_relo_insn() to bpf_core_apply()
  now both in kernel and in libbpf relocation type id is guaranteed
  to exist when bpf_core_calc_relo_insn() is called;
- added a test case.

v1: https://lore.kernel.org/bpf/20240821164620.1056362-1-eddyz87@gmail.com/
v2: https://lore.kernel.org/bpf/20240822001837.2715909-1-eddyz87@gmail.com/
====================

Link: https://lore.kernel.org/r/20240822080124.2995724-1-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agoselftests/bpf: test for malformed BPF_CORE_TYPE_ID_LOCAL relocation
Eduard Zingerman [Thu, 22 Aug 2024 08:01:24 +0000 (01:01 -0700)]
selftests/bpf: test for malformed BPF_CORE_TYPE_ID_LOCAL relocation

Check that verifier rejects BPF program containing relocation
pointing to non-existent BTF type.

To force relocation resolution on kernel side test case uses
bpf_attr->core_relos field. This field is not exposed by libbpf,
so directly do BPF system call in the test.

Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240822080124.2995724-3-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
10 months agobpf: correctly handle malformed BPF_CORE_TYPE_ID_LOCAL relos
Eduard Zingerman [Thu, 22 Aug 2024 08:01:23 +0000 (01:01 -0700)]
bpf: correctly handle malformed BPF_CORE_TYPE_ID_LOCAL relos

In case of malformed relocation record of kind BPF_CORE_TYPE_ID_LOCAL
referencing a non-existing BTF type, function bpf_core_calc_relo_insn
would cause a null pointer deference.

Fix this by adding a proper check upper in call stack, as malformed
relocation records could be passed from user space.

Simplest reproducer is a program:

    r0 = 0
    exit

With a single relocation record:

    .insn_off = 0,          /* patch first instruction */
    .type_id = 100500,      /* this type id does not exist */
    .access_str_off = 6,    /* offset of string "0" */
    .kind = BPF_CORE_TYPE_ID_LOCAL,

See the link for original reproducer or next commit for a test case.

Fixes: 74753e1462e7 ("libbpf: Replace btf__type_by_id() with btf_type_by_id().")
Reported-by: Liu RuiTong <cnitlrt@gmail.com>
Closes: https://lore.kernel.org/bpf/CAK55_s6do7C+DVwbwY_7nKfUz0YLDoiA1v6X3Y9+p0sWzipFSA@mail.gmail.com/
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20240822080124.2995724-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>