bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing
authorYonghong Song <yonghong.song@linux.dev>
Tue, 10 Sep 2024 21:40:37 +0000 (14:40 -0700)
committerAndrii Nakryiko <andrii@kernel.org>
Wed, 11 Sep 2024 20:27:27 +0000 (13:27 -0700)
commit376bd59e2a0404b09767cc991cf5aed394cf0cf2
treef0cffae955976d4032b016a653cdcd3d7227217d
parent2bea33f907a0185b3341075d764ab5f45334e0cc
bpf: Use fake pt_regs when doing bpf syscall tracepoint tracing

Salvatore Benedetto reported an issue that when doing syscall tracepoint
tracing the kernel stack is empty. For example, using the following
command line
  bpftrace -e 'tracepoint:syscalls:sys_enter_read { print("Kernel Stack\n"); print(kstack()); }'
  bpftrace -e 'tracepoint:syscalls:sys_exit_read { print("Kernel Stack\n"); print(kstack()); }'
the output for both commands is
===
  Kernel Stack
===

Further analysis shows that pt_regs used for bpf syscall tracepoint
tracing is from the one constructed during user->kernel transition.
The call stack looks like
  perf_syscall_enter+0x88/0x7c0
  trace_sys_enter+0x41/0x80
  syscall_trace_enter+0x100/0x160
  do_syscall_64+0x38/0xf0
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

The ip address stored in pt_regs is from user space hence no kernel
stack is printed.

To fix the issue, kernel address from pt_regs is required.
In kernel repo, there are already a few cases like this. For example,
in kernel/trace/bpf_trace.c, several perf_fetch_caller_regs(fake_regs_ptr)
instances are used to supply ip address or use ip address to construct
call stack.

Instead of allocate fake_regs in the stack which may consume
a lot of bytes, the function perf_trace_buf_alloc() in
perf_syscall_{enter, exit}() is leveraged to create fake_regs,
which will be passed to perf_call_bpf_{enter,exit}().

For the above bpftrace script, I got the following output with this patch:
for tracepoint:syscalls:sys_enter_read
===
  Kernel Stack

        syscall_trace_enter+407
        syscall_trace_enter+407
        do_syscall_64+74
        entry_SYSCALL_64_after_hwframe+75
===
and for tracepoint:syscalls:sys_exit_read
===
Kernel Stack

        syscall_exit_work+185
        syscall_exit_work+185
        syscall_exit_to_user_mode+305
        do_syscall_64+118
        entry_SYSCALL_64_after_hwframe+75
===

Reported-by: Salvatore Benedetto <salvabenedetto@meta.com>
Suggested-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240910214037.3663272-1-yonghong.song@linux.dev
kernel/trace/trace_syscalls.c