riscv: Implement HAVE_DYNAMIC_FTRACE_WITH_CALL_OPS
This patch enables support for DYNAMIC_FTRACE_WITH_CALL_OPS on RISC-V.
This allows each ftrace callsite to provide an ftrace_ops to the common
ftrace trampoline, allowing each callsite to invoke distinct tracer
functions without the need to fall back to list processing or to
allocate custom trampolines for each callsite. This significantly speeds
up cases where multiple distinct trace functions are used and callsites
are mostly traced by a single tracer.
The idea and most of the implementation is taken from the ARM64's
implementation of the same feature. The idea is to place a pointer to
the ftrace_ops as a literal at a fixed offset from the function entry
point, which can be recovered by the common ftrace trampoline.
We use -fpatchable-function-entry to reserve 8 bytes above the function
entry by emitting 2 4 byte or 4 2 byte nops depending on the presence of
CONFIG_RISCV_ISA_C. These 8 bytes are patched at runtime with a pointer
to the associated ftrace_ops for that callsite. Functions are aligned to
8 bytes to make sure that the accesses to this literal are atomic.
This approach allows for directly invoking ftrace_ops::func even for
ftrace_ops which are dynamically-allocated (or part of a module),
without going via ftrace_ops_list_func.
We've benchamrked this with the ftrace_ops sample module on Spacemit K1
Jupiter:
Without this patch:
baseline (Linux rivos
6.14.0-09584-g7d06015d936c #3 SMP Sat Mar 29
+-----------------------+-----------------+----------------------------+
| Number of tracers | Total time (ns) | Per-call average time |
|-----------------------+-----------------+----------------------------|
| Relevant | Irrelevant | 100000 calls | Total (ns) | Overhead (ns) |
|----------+------------+-----------------+------------+---------------|
| 0 | 0 |
1357958 | 13 | - |
| 0 | 1 |
1302375 | 13 | - |
| 0 | 2 |
1302375 | 13 | - |
| 0 | 10 |
1379084 | 13 | - |
| 0 | 100 |
1302458 | 13 | - |
| 0 | 200 |
1302333 | 13 | - |
|----------+------------+-----------------+------------+---------------|
| 1 | 0 |
13677833 | 136 | 123 |
| 1 | 1 |
18500916 | 185 | 172 |
| 1 | 2 |
22856459 | 228 | 215 |
| 1 | 10 |
58824709 | 588 | 575 |
| 1 | 100 |
505141584 | 5051 | 5038 |
| 1 | 200 |
1580473126 | 15804 | 15791 |
|----------+------------+-----------------+------------+---------------|
| 1 | 0 |
13561000 | 135 | 122 |
| 2 | 0 |
19707292 | 197 | 184 |
| 10 | 0 |
67774750 | 677 | 664 |
| 100 | 0 |
714123125 | 7141 | 7128 |
| 200 | 0 |
1918065668 | 19180 | 19167 |
+----------+------------+-----------------+------------+---------------+
Note: per-call overhead is estimated relative to the baseline case with
0 relevant tracers and 0 irrelevant tracers.
With this patch:
v4-rc4 (Linux rivos
6.14.0-09598-gd75747611c93 #4 SMP Sat Mar 29
+-----------------------+-----------------+----------------------------+
| Number of tracers | Total time (ns) | Per-call average time |
|-----------------------+-----------------+----------------------------|
| Relevant | Irrelevant | 100000 calls | Total (ns) | Overhead (ns) |
|----------+------------+-----------------+------------+---------------|
| 0 | 0 |
1459917 | 14 | - |
| 0 | 1 |
1408000 | 14 | - |
| 0 | 2 |
1383792 | 13 | - |
| 0 | 10 |
1430709 | 14 | - |
| 0 | 100 |
1383791 | 13 | - |
| 0 | 200 |
1383750 | 13 | - |
|----------+------------+-----------------+------------+---------------|
| 1 | 0 |
5238041 | 52 | 38 |
| 1 | 1 |
5228542 | 52 | 38 |
| 1 | 2 |
5325917 | 53 | 40 |
| 1 | 10 |
5299667 | 52 | 38 |
| 1 | 100 |
5245250 | 52 | 39 |
| 1 | 200 |
5238459 | 52 | 39 |
|----------+------------+-----------------+------------+---------------|
| 1 | 0 |
5239083 | 52 | 38 |
| 2 | 0 |
19449417 | 194 | 181 |
| 10 | 0 |
67718584 | 677 | 663 |
| 100 | 0 |
709840708 | 7098 | 7085 |
| 200 | 0 |
2203580626 | 22035 | 22022 |
+----------+------------+-----------------+------------+---------------+
Note: per-call overhead is estimated relative to the baseline case with
0 relevant tracers and 0 irrelevant tracers.
As can be seen from the above:
a) Whenever there is a single relevant tracer function associated with a
tracee, the overhead of invoking the tracer is constant, and does not
scale with the number of tracers which are *not* associated with that
tracee.
b) The overhead for a single relevant tracer has dropped to ~1/3 of the
overhead prior to this series (from 122ns to 38ns). This is largely
due to permitting calls to dynamically-allocated ftrace_ops without
going through ftrace_ops_list_func.
Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
[update kconfig, asm, refactor]
Signed-off-by: Andy Chiu <andybnac@gmail.com>
Tested-by: Björn Töpel <bjorn@rivosinc.com>
Link: https://lore.kernel.org/r/20250407180838.42877-10-andybnac@gmail.com
Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Signed-off-by: Palmer Dabbelt <palmer@dabbelt.com>