linux-block.git
7 months agoperf testsuite: Install kprobe tests and common files
Michael Petlan [Thu, 15 Feb 2024 11:02:31 +0000 (12:02 +0100)]
perf testsuite: Install kprobe tests and common files

Signed-off-by: Michael Petlan <mpetlan@redhat.com>
Cc: kjain@linux.ibm.com
Cc: atrajeev@linux.vnet.ibm.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240215110231.15385-8-mpetlan@redhat.com
7 months agoperf testsuite: Add test for kprobe handling
Veronika Molnarova [Thu, 15 Feb 2024 11:02:30 +0000 (12:02 +0100)]
perf testsuite: Add test for kprobe handling

Test perf interface to kprobes: listing, adding and removing probes. It
is run as a part of perftool-testsuite_probe test case.

Signed-off-by: Veronika Molnarova <vmolnaro@redhat.com>
Signed-off-by: Michael Petlan <mpetlan@redhat.com>
Cc: kjain@linux.ibm.com
Cc: atrajeev@linux.vnet.ibm.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240215110231.15385-7-mpetlan@redhat.com
7 months agoperf testsuite: Add common output checking helpers
Veronika Molnarova [Thu, 15 Feb 2024 11:02:29 +0000 (12:02 +0100)]
perf testsuite: Add common output checking helpers

As a form of validation, it is a common practice to check the outputs
of commands whether they contain expected patterns or match a certain
regex.

Add helpers for verifying that all regexes are found in the output, that
all lines match any pattern from a set and that a certain expression is
not present in the output.

In verbose mode these helpers log mismatches for easier failure
investigation.

Signed-off-by: Veronika Molnarova <vmolnaro@redhat.com>
Signed-off-by: Michael Petlan <mpetlan@redhat.com>
Cc: kjain@linux.ibm.com
Cc: atrajeev@linux.vnet.ibm.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240215110231.15385-6-mpetlan@redhat.com
7 months agoperf testsuite: Add test case for perf probe
Veronika Molnarova [Thu, 15 Feb 2024 11:02:28 +0000 (12:02 +0100)]
perf testsuite: Add test case for perf probe

Add new perf probe test case that acts as an entry element in perf test
list. Runs multiple subtests from directory "base_probe", which will be
added in incomming patches and can be expanded without further editing.

Signed-off-by: Veronika Molnarova <vmolnaro@redhat.com>
Signed-off-by: Michael Petlan <mpetlan@redhat.com>
Cc: kjain@linux.ibm.com
Cc: atrajeev@linux.vnet.ibm.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240215110231.15385-5-mpetlan@redhat.com
7 months agoperf testsuite: Add initialization script for shell tests
Veronika Molnarova [Thu, 15 Feb 2024 11:02:27 +0000 (12:02 +0100)]
perf testsuite: Add initialization script for shell tests

Initialize reporting and logging functions that unifies formatting
of the test output used for shell tests.

Signed-off-by: Veronika Molnarova <vmolnaro@redhat.com>
Signed-off-by: Michael Petlan <mpetlan@redhat.com>
Cc: kjain@linux.ibm.com
Cc: atrajeev@linux.vnet.ibm.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240215110231.15385-4-mpetlan@redhat.com
7 months agoperf testsuite: Add common setting for shell tests
Veronika Molnarova [Thu, 15 Feb 2024 11:02:26 +0000 (12:02 +0100)]
perf testsuite: Add common setting for shell tests

Add settings defining sample commands later shared by shell tests. This
adds the possibility to globally adjust the default values for the whole
testsuite.

Signed-off-by: Veronika Molnarova <vmolnaro@redhat.com>
Signed-off-by: Michael Petlan <mpetlan@redhat.com>
Cc: kjain@linux.ibm.com
Cc: atrajeev@linux.vnet.ibm.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240215110231.15385-3-mpetlan@redhat.com
7 months agoperf testsuite: Add common regex patters
Veronika Molnarova [Thu, 15 Feb 2024 11:02:25 +0000 (12:02 +0100)]
perf testsuite: Add common regex patters

Unify perf regexes for checking testing output into a single file
to reduce duplicates and prevent errors when editing.

This will be used in upcomming patches in shell tests.

Signed-off-by: Veronika Molnarova <vmolnaro@redhat.com>
Signed-off-by: Michael Petlan <mpetlan@redhat.com>
Cc: kjain@linux.ibm.com
Cc: atrajeev@linux.vnet.ibm.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240215110231.15385-2-mpetlan@redhat.com
7 months agoperf test: Enable Symbols test to work with a current module dso
Adrian Hunter [Wed, 31 Jan 2024 19:24:16 +0000 (21:24 +0200)]
perf test: Enable Symbols test to work with a current module dso

The test needs a struct machine and creates one for the current host,
but a side-effect is that struct machine has set up kernel maps
including module maps.

If the 'Symbols' test --dso option specifies a current kernel module,
it will already be present as a kernel dso, and a map with kmaps needs
to be used otherwise there will be a segfault - see below.

For that case, find the existing map and use that. In that case also,
the dso is split by section into multiple dsos, so test those dsos
also. That in turn, shows up that those dsos have not had overlapping
symbols removed, so the test fails.

Example:

  Before:

    $ perf test -F -v Symbols --dso /lib/modules/$(uname -r)/kernel/arch/x86/kvm/kvm-intel.ko
     70: Symbols                                                         :
    --- start ---
    Testing /lib/modules/6.7.2-local/kernel/arch/x86/kvm/kvm-intel.ko
    Segmentation fault (core dumped)

  After:

    $ perf test -F -v Symbols --dso /lib/modules/$(uname -r)/kernel/arch/x86/kvm/kvm-intel.ko
     70: Symbols                                                         :
    --- start ---
    Testing /lib/modules/6.7.2-local/kernel/arch/x86/kvm/kvm-intel.ko
    Overlapping symbols:
     41d30-41fbb l vmx_init
     41d30-41fbb g init_module
    ---- end ----
    Symbols: FAILED!

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240131192416.16387-1-adrian.hunter@intel.com
7 months agoperf build: Cleanup perf register configuration
Leo Yan [Wed, 14 Feb 2024 11:39:47 +0000 (19:39 +0800)]
perf build: Cleanup perf register configuration

The target is to allow the tool to always enable the perf register
feature for native parsing and cross parsing, and current code doesn't
depend on the macro 'HAVE_PERF_REGS_SUPPORT'.

This patch remove the variable 'NO_PERF_REGS' and the defined macro
'HAVE_PERF_REGS_SUPPORT' from the Makefile.

Signed-off-by: Leo Yan <leo.yan@linux.dev>
Reviewed-by: Ian Rogers <irogers@google.com>
Cc: James Clark <james.clark@arm.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Mike Leach <mike.leach@linaro.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Ming Wang <wangming01@loongson.cn>
Cc: John Garry <john.g.garry@oracle.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: linux-csky@vger.kernel.org
Cc: linux-riscv@lists.infradead.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240214113947.240957-5-leo.yan@linux.dev
7 months agoperf parse-regs: Introduce a weak function arch__sample_reg_masks()
Leo Yan [Wed, 14 Feb 2024 11:39:46 +0000 (19:39 +0800)]
perf parse-regs: Introduce a weak function arch__sample_reg_masks()

Every architecture can provide a register list for sampling. If an
architecture doesn't support register sampling, it won't define the data
structure 'sample_reg_masks'. Consequently, any code using this
structure must be protected by the macro 'HAVE_PERF_REGS_SUPPORT'.

This patch defines a weak function, arch__sample_reg_masks(), which will
be replaced by an architecture-defined function for returning the
architecture's register list. With this refactoring, the function always
exists, the condition checking for 'HAVE_PERF_REGS_SUPPORT' is not
needed anymore, so remove it.

Signed-off-by: Leo Yan <leo.yan@linux.dev>
Reviewed-by: Ian Rogers <irogers@google.com>
Cc: James Clark <james.clark@arm.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Mike Leach <mike.leach@linaro.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Ming Wang <wangming01@loongson.cn>
Cc: John Garry <john.g.garry@oracle.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: linux-csky@vger.kernel.org
Cc: linux-riscv@lists.infradead.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240214113947.240957-4-leo.yan@linux.dev
7 months agoperf parse-regs: Always build perf register functions
Leo Yan [Wed, 14 Feb 2024 11:39:45 +0000 (19:39 +0800)]
perf parse-regs: Always build perf register functions

Currently, the macro HAVE_PERF_REGS_SUPPORT is used as a switch to turn
on or turn off the code of perf registers. If any architecture cannot
support perf register, it disables the perf register parsing, for both
the native parsing and cross parsing for other architectures.

To support both the native parsing and cross parsing, the tool should
always build the perf regs functions. Thus, this patch removes
HAVE_PERF_REGS_SUPPORT from the perf regs files.

Signed-off-by: Leo Yan <leo.yan@linux.dev>
Reviewed-by: Ian Rogers <irogers@google.com>
Cc: James Clark <james.clark@arm.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Mike Leach <mike.leach@linaro.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Ming Wang <wangming01@loongson.cn>
Cc: John Garry <john.g.garry@oracle.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: linux-csky@vger.kernel.org
Cc: linux-riscv@lists.infradead.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240214113947.240957-3-leo.yan@linux.dev
7 months agoperf build: Remove unused CONFIG_PERF_REGS
Leo Yan [Wed, 14 Feb 2024 11:39:44 +0000 (19:39 +0800)]
perf build: Remove unused CONFIG_PERF_REGS

CONFIG_PERF_REGS is not used, remove it.

Signed-off-by: Leo Yan <leo.yan@linux.dev>
Reviewed-by: Ian Rogers <irogers@google.com>
Cc: James Clark <james.clark@arm.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Guo Ren <guoren@kernel.org>
Cc: Will Deacon <will@kernel.org>
Cc: Mike Leach <mike.leach@linaro.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Ming Wang <wangming01@loongson.cn>
Cc: John Garry <john.g.garry@oracle.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: linux-csky@vger.kernel.org
Cc: linux-riscv@lists.infradead.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240214113947.240957-2-leo.yan@linux.dev
7 months agoperf metric: Don't remove scale from counts
Ian Rogers [Fri, 9 Feb 2024 20:49:47 +0000 (12:49 -0800)]
perf metric: Don't remove scale from counts

Counts were switched from the scaled saved value form to the
aggregated count to avoid double accounting. When this happened the
removing of scaling for a count should have been removed, however, it
wasn't and this wasn't observed as it normally doesn't matter because
a counter's scale is 1. A problem was observed with RAPL events that
are scaled.

Fixes: 37cc8ad77cf8 ("perf metric: Directly use counts rather than saved_value")
Signed-off-by: Ian Rogers <irogers@google.com>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: James Clark <james.clark@arm.com>
Cc: Kaige Ye <ye@kaige.org>
Cc: John Garry <john.g.garry@oracle.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240209204947.3873294-5-irogers@google.com
7 months agoperf stat: Avoid metric-only segv
Ian Rogers [Fri, 9 Feb 2024 20:49:46 +0000 (12:49 -0800)]
perf stat: Avoid metric-only segv

Cycles is recognized as part of a hard coded metric in stat-shadow.c,
it may call print_metric_only with a NULL fmt string leading to a
segfault. Handle the NULL fmt explicitly.

Fixes: 088519f318be ("perf stat: Move the display functions to stat-display.c")
Signed-off-by: Ian Rogers <irogers@google.com>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: James Clark <james.clark@arm.com>
Cc: Kaige Ye <ye@kaige.org>
Cc: John Garry <john.g.garry@oracle.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240209204947.3873294-4-irogers@google.com
7 months agoperf expr: Fix "has_event" function for metric style events
Ian Rogers [Fri, 9 Feb 2024 20:49:45 +0000 (12:49 -0800)]
perf expr: Fix "has_event" function for metric style events

Events in metrics cannot use '/' as a separator, it would be
recognized as a divide, so they use '@'. The '@' is recognized in the
metricgroups code and changed to '/', do the same in the has_event
function so that the parsing is only tried without the @s.

Fixes: 4a4a9bf9075f ("perf expr: Add has_event function")
Signed-off-by: Ian Rogers <irogers@google.com>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: James Clark <james.clark@arm.com>
Cc: Kaige Ye <ye@kaige.org>
Cc: John Garry <john.g.garry@oracle.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240209204947.3873294-3-irogers@google.com
7 months agoperf expr: Allow NaN to be a valid number
Ian Rogers [Fri, 9 Feb 2024 20:49:44 +0000 (12:49 -0800)]
perf expr: Allow NaN to be a valid number

Currently only floating point numbers can be parsed, add a special
case for NaN.

Signed-off-by: Ian Rogers <irogers@google.com>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: James Clark <james.clark@arm.com>
Cc: Kaige Ye <ye@kaige.org>
Cc: John Garry <john.g.garry@oracle.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240209204947.3873294-2-irogers@google.com
7 months agoperf maps: Locking tidy up of nr_maps
Ian Rogers [Sat, 10 Feb 2024 03:17:46 +0000 (19:17 -0800)]
perf maps: Locking tidy up of nr_maps

After this change maps__nr_maps is only used by tests, existing users
are migrated to maps__empty. Compute maps__empty under the read lock.

Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: James Clark <james.clark@arm.com>
Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Colin Ian King <colin.i.king@gmail.com>
Cc: Changbin Du <changbin.du@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Leo Yan <leo.yan@linux.dev>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Artem Savkov <asavkov@redhat.com>
Cc: bpf@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240210031746.4057262-7-irogers@google.com
7 months agoperf maps: Hide maps internals
Ian Rogers [Sat, 10 Feb 2024 03:17:45 +0000 (19:17 -0800)]
perf maps: Hide maps internals

Move the struct into the C file. Add maps__equal to work around
exposing the struct for reference count checking. Add accessors for
the unwind_libunwind_ops. Move maps_list_node to its only use in
symbol.c.

Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: James Clark <james.clark@arm.com>
Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Colin Ian King <colin.i.king@gmail.com>
Cc: Changbin Du <changbin.du@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Leo Yan <leo.yan@linux.dev>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Artem Savkov <asavkov@redhat.com>
Cc: bpf@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240210031746.4057262-6-irogers@google.com
7 months agoperf maps: Get map before returning in maps__find_next_entry
Ian Rogers [Sat, 10 Feb 2024 03:17:44 +0000 (19:17 -0800)]
perf maps: Get map before returning in maps__find_next_entry

Finding a map is done under a lock, returning the map without a
reference count means it can be removed without notice and causing
uses after free. Grab a reference count to the map within the lock
region and return this. Fix up locations that need a map__put
following this.

Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: James Clark <james.clark@arm.com>
Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Colin Ian King <colin.i.king@gmail.com>
Cc: Changbin Du <changbin.du@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Leo Yan <leo.yan@linux.dev>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Artem Savkov <asavkov@redhat.com>
Cc: bpf@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240210031746.4057262-5-irogers@google.com
7 months agoperf maps: Get map before returning in maps__find_by_name
Ian Rogers [Sat, 10 Feb 2024 03:17:43 +0000 (19:17 -0800)]
perf maps: Get map before returning in maps__find_by_name

Finding a map is done under a lock, returning the map without a
reference count means it can be removed without notice and causing
uses after free. Grab a reference count to the map within the lock
region and return this. Fix up locations that need a map__put
following this. Also fix some reference counted pointer comparisons.

Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: James Clark <james.clark@arm.com>
Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Colin Ian King <colin.i.king@gmail.com>
Cc: Changbin Du <changbin.du@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Leo Yan <leo.yan@linux.dev>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Artem Savkov <asavkov@redhat.com>
Cc: bpf@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240210031746.4057262-4-irogers@google.com
7 months agoperf maps: Get map before returning in maps__find
Ian Rogers [Sat, 10 Feb 2024 03:17:42 +0000 (19:17 -0800)]
perf maps: Get map before returning in maps__find

Finding a map is done under a lock, returning the map without a
reference count means it can be removed without notice and causing
uses after free. Grab a reference count to the map within the lock
region and return this. Fix up locations that need a map__put
following this.

Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: James Clark <james.clark@arm.com>
Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Colin Ian King <colin.i.king@gmail.com>
Cc: Changbin Du <changbin.du@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Leo Yan <leo.yan@linux.dev>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Artem Savkov <asavkov@redhat.com>
Cc: bpf@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240210031746.4057262-3-irogers@google.com
7 months agoperf maps: Switch from rbtree to lazily sorted array for addresses
Ian Rogers [Sat, 10 Feb 2024 03:17:41 +0000 (19:17 -0800)]
perf maps: Switch from rbtree to lazily sorted array for addresses

Maps is a collection of maps primarily sorted by the starting address
of the map. Prior to this change the maps were held in an rbtree
requiring 4 pointers per node. Prior to reference count checking, the
rbnode was embedded in the map so 3 pointers per node were
necessary. This change switches the rbtree to an array lazily sorted
by address, much as the array sorting nodes by name. 1 pointer is
needed per node, but to avoid excessive resizing the backing array may
be twice the number of used elements. Meaning the memory overhead is
roughly half that of the rbtree. For a perf record with
"--no-bpf-event -g -a" of true, the memory overhead of perf inject is
reduce fom 3.3MB to 3MB, so 10% or 300KB is saved.

Map inserts always happen at the end of the array. The code tracks
whether the insertion violates the sorting property. O(log n) rb-tree
complexity is switched to O(1).

Remove slides the array, so O(log n) rb-tree complexity is degraded to
O(n).

A find may need to sort the array using qsort which is O(n*log n), but
in general the maps should be sorted and so average performance should
be O(log n) as with the rbtree.

An rbtree node consumes a cache line, but with the array 4 nodes fit
on a cache line. Iteration is simplified to scanning an array rather
than pointer chasing.

Overall it is expected the performance after the change should be
comparable to before, but with half of the memory consumed.

To avoid a list and repeated logic around splitting maps,
maps__merge_in is rewritten in terms of
maps__fixup_overlap_and_insert. maps_merge_in splits the given mapping
inserting remaining gaps. maps__fixup_overlap_and_insert splits the
existing mappings, then adds the incoming mapping. By adding the new
mapping first, then re-inserting the existing mappings the splitting
behavior matches.

Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: James Clark <james.clark@arm.com>
Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Colin Ian King <colin.i.king@gmail.com>
Cc: Changbin Du <changbin.du@huawei.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Song Liu <song@kernel.org>
Cc: Leo Yan <leo.yan@linux.dev>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Artem Savkov <asavkov@redhat.com>
Cc: bpf@vger.kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240210031746.4057262-2-irogers@google.com
7 months agoMerge branch 'perf-tools' into perf-tools-next
Namhyung Kim [Mon, 12 Feb 2024 20:19:21 +0000 (12:19 -0800)]
Merge branch 'perf-tools' into perf-tools-next

To get some fixes in the perf test and JSON metrics into the development
branch.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
7 months agoperf srcline: Add missed addr2line closes
Ian Rogers [Thu, 1 Feb 2024 00:15:02 +0000 (16:15 -0800)]
perf srcline: Add missed addr2line closes

The child_process for addr2line sets in and out to -1 so that pipes
get created. It is the caller's responsibility to close the pipes,
finish_command doesn't do it. Add the missed closes.

Fixes: b3801e791231 ("perf srcline: Simplify addr2line subprocess")
Signed-off-by: Ian Rogers <irogers@google.com>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: James Clark <james.clark@arm.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: John Garry <john.g.garry@oracle.com>
Cc: Tom Rix <trix@redhat.com>
Cc: llvm@lists.linux.dev
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240201001504.1348511-8-irogers@google.com
7 months agoperf stat: Support per-cluster aggregation
Yicong Yang [Thu, 8 Feb 2024 02:40:26 +0000 (10:40 +0800)]
perf stat: Support per-cluster aggregation

Some platforms have 'cluster' topology and CPUs in the cluster will
share resources like L3 Cache Tag (for HiSilicon Kunpeng SoC) or L2
cache (for Intel Jacobsville). Currently parsing and building cluster
topology have been supported since [1].

perf stat has already supported aggregation for other topologies like
die or socket, etc. It'll be useful to aggregate per-cluster to find
problems like L3T bandwidth contention.

This patch add support for "--per-cluster" option for per-cluster
aggregation. Also update the docs and related test. The output will
be like:

[root@localhost tmp]# perf stat -a -e LLC-load --per-cluster -- sleep 5

 Performance counter stats for 'system wide':

S56-D0-CLS158    4      1,321,521,570      LLC-load
S56-D0-CLS594    4        794,211,453      LLC-load
S56-D0-CLS1030    4             41,623      LLC-load
S56-D0-CLS1466    4             41,646      LLC-load
S56-D0-CLS1902    4             16,863      LLC-load
S56-D0-CLS2338    4             15,721      LLC-load
S56-D0-CLS2774    4             22,671      LLC-load
[...]

On a legacy system without cluster or cluster support, the output will
be look like:
[root@localhost perf]# perf stat -a -e cycles --per-cluster -- sleep 1

 Performance counter stats for 'system wide':

S56-D0-CLS0   64         18,011,485      cycles
S7182-D0-CLS0   64         16,548,835      cycles

Note that this patch doesn't mix the cluster information in the outputs
of --per-core to avoid breaking any tools/scripts using it.

Note that perf recently supports "--per-cache" aggregation, but it's not
the same with the cluster although cluster CPUs may share some cache
resources. For example on my machine all clusters within a die share the
same L3 cache:
$ cat /sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list
0-31
$ cat /sys/devices/system/cpu/cpu0/topology/cluster_cpus_list
0-3

[1] commit c5e22feffdd7 ("topology: Represent clusters of CPUs within a die")

Tested-by: Jie Zhan <zhanjie9@hisilicon.com>
Reviewed-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Cc: james.clark@arm.com
Cc: 21cnbao@gmail.com
Cc: prime.zeng@hisilicon.com
Cc: Jonathan.Cameron@huawei.com
Cc: fanghao11@huawei.com
Cc: linuxarm@huawei.com
Cc: tim.c.chen@intel.com
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240208024026.2691-1-yangyicong@huawei.com
7 months agoperf tools: Remove misleading comments on map functions
Namhyung Kim [Thu, 8 Feb 2024 18:10:25 +0000 (10:10 -0800)]
perf tools: Remove misleading comments on map functions

When it converts sample IP to or from objdump-capable one, there's a
comment saying that kernel modules have DSO_SPACE__USER.  But commit
02213cec64bb ("perf maps: Mark module DSOs with kernel type") changed
it and makes the comment confusing.  Let's get rid of it.

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20240208181025.1329645-1-namhyung@kernel.org
7 months agoperf thread_map: Free strlist on normal path in thread_map__new_by_tid_str()
Yang Jihong [Tue, 6 Feb 2024 08:32:28 +0000 (08:32 +0000)]
perf thread_map: Free strlist on normal path in thread_map__new_by_tid_str()

slist needs to be freed in both error path and normal path in
thread_map__new_by_tid_str().

Fixes: b52956c961be3a04 ("perf tools: Allow multiple threads or processes in record, stat, top")
Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240206083228.172607-6-yangjihong1@huawei.com
7 months agoperf sched: Move curr_pid and cpu_last_switched initialization to perf_sched__{lat...
Yang Jihong [Tue, 6 Feb 2024 08:32:27 +0000 (08:32 +0000)]
perf sched: Move curr_pid and cpu_last_switched initialization to perf_sched__{lat|map|replay}()

The curr_pid and cpu_last_switched are used only for the
'perf sched replay/latency/map'. Put their initialization in
perf_sched__{lat|map|replay () to reduce unnecessary actions in other
commands.

Simple functional testing:

  # perf sched record perf bench sched messaging
  # Running 'sched/messaging' benchmark:
  # 20 sender and receiver processes per group
  # 10 groups == 400 processes run

       Total time: 0.209 [sec]
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 16.456 MB perf.data (147907 samples) ]

  # perf sched lat

   -------------------------------------------------------------------------------------------------------------------------------------------
    Task                  |   Runtime ms  | Switches | Avg delay ms    | Max delay ms    | Max delay start           | Max delay end          |
   -------------------------------------------------------------------------------------------------------------------------------------------
    sched-messaging:(401) |   2990.699 ms |    38705 | avg:   0.661 ms | max:  67.046 ms | max start: 456532.624830 s | max end: 456532.691876 s
    qemu-system-x86:(7)   |    179.764 ms |     2191 | avg:   0.152 ms | max:  21.857 ms | max start: 456532.576434 s | max end: 456532.598291 s
    sshd:48125            |      0.522 ms |        2 | avg:   0.037 ms | max:   0.046 ms | max start: 456532.514610 s | max end: 456532.514656 s
  <SNIP>
    ksoftirqd/11:82       |      0.063 ms |        1 | avg:   0.005 ms | max:   0.005 ms | max start: 456532.769366 s | max end: 456532.769371 s
    kworker/9:0-mm_:34624 |      0.233 ms |       20 | avg:   0.004 ms | max:   0.007 ms | max start: 456532.690804 s | max end: 456532.690812 s
    migration/13:93       |      0.000 ms |        1 | avg:   0.004 ms | max:   0.004 ms | max start: 456532.512669 s | max end: 456532.512674 s
   -----------------------------------------------------------------------------------------------------------------
    TOTAL:                |   3180.750 ms |    41368 |
   ---------------------------------------------------

  # echo $?
  0

  # perf sched map
    *A0                                                               456532.510141 secs A0 => migration/0:15
    *.                                                                456532.510171 secs .  => swapper:0
     .  *B0                                                           456532.510261 secs B0 => migration/1:21
     .  *.                                                            456532.510279 secs
  <SNIP>
     L7  L7  L7  L7  L7  L7  L7  L7  L7  L7  L7 *L7  .   .   .   .    456532.785979 secs
     L7  L7  L7  L7  L7  L7  L7  L7  L7  L7  L7  L7 *L7  .   .   .    456532.786054 secs
     L7  L7  L7  L7  L7  L7  L7  L7  L7  L7  L7  L7  L7 *L7  .   .    456532.786127 secs
     L7  L7  L7  L7  L7  L7  L7  L7  L7  L7  L7  L7  L7  L7 *L7  .    456532.786197 secs
     L7  L7  L7  L7  L7  L7  L7  L7  L7  L7  L7  L7  L7  L7  L7 *L7   456532.786270 secs
  # echo $?
  0

  # perf sched replay
  run measurement overhead: 108 nsecs
  sleep measurement overhead: 66473 nsecs
  the run test took 1000002 nsecs
  the sleep test took 1082686 nsecs
  nr_run_events:        49334
  nr_sleep_events:      50054
  nr_wakeup_events:     34701
  target-less wakeups:  165
  multi-target wakeups: 766
  task      0 (             swapper:         0), nr_events: 15419
  task      1 (             swapper:         1), nr_events: 1
  task      2 (             swapper:         2), nr_events: 1
  <SNIP>
  task    715 (     sched-messaging:    110248), nr_events: 1438
  task    716 (     sched-messaging:    110249), nr_events: 512
  task    717 (     sched-messaging:    110250), nr_events: 500
  task    718 (     sched-messaging:    110251), nr_events: 537
  task    719 (     sched-messaging:    110252), nr_events: 823
  ------------------------------------------------------------
  #1  : 1325.288, ravg: 1325.29, cpu: 7823.35 / 7823.35
  #2  : 1363.606, ravg: 1329.12, cpu: 7655.53 / 7806.56
  #3  : 1349.494, ravg: 1331.16, cpu: 7544.80 / 7780.39
  #4  : 1311.488, ravg: 1329.19, cpu: 7495.13 / 7751.86
  #5  : 1309.902, ravg: 1327.26, cpu: 7266.65 / 7703.34
  #6  : 1309.535, ravg: 1325.49, cpu: 7843.86 / 7717.39
  #7  : 1316.482, ravg: 1324.59, cpu: 7854.41 / 7731.09
  #8  : 1366.604, ravg: 1328.79, cpu: 7955.81 / 7753.57
  #9  : 1326.286, ravg: 1328.54, cpu: 7466.86 / 7724.90
  #10 : 1356.653, ravg: 1331.35, cpu: 7566.60 / 7709.07
  # echo $?
  0

Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240206083228.172607-5-yangjihong1@huawei.com
7 months agoperf sched: Move curr_thread initialization to perf_sched__map()
Yang Jihong [Tue, 6 Feb 2024 08:32:26 +0000 (08:32 +0000)]
perf sched: Move curr_thread initialization to perf_sched__map()

The curr_thread is used only for the 'perf sched map'. Put initialization
in perf_sched__map() to reduce unnecessary actions in other commands.

Simple functional testing:

  # perf sched record perf bench sched messaging
  # Running 'sched/messaging' benchmark:
  # 20 sender and receiver processes per group
  # 10 groups == 400 processes run

       Total time: 0.197 [sec]
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 15.526 MB perf.data (140095 samples) ]

  # perf sched map
    *A0                                                               451264.532445 secs A0 => migration/0:15
    *.                                                                451264.532468 secs .  => swapper:0
     .  *B0                                                           451264.532537 secs B0 => migration/1:21
     .  *.                                                            451264.532560 secs
     .   .  *C0                                                       451264.532644 secs C0 => migration/2:27
     .   .  *.                                                        451264.532668 secs
     .   .   .  *D0                                                   451264.532753 secs D0 => migration/3:33
     .   .   .  *.                                                    451264.532778 secs
     .   .   .   .  *E0                                               451264.532861 secs E0 => migration/4:39
     .   .   .   .  *.                                                451264.532886 secs
     .   .   .   .   .  *F0                                           451264.532973 secs F0 => migration/5:45
  <SNIP>
     A7  A7  A7  A7  A7 *A7  .   .   .   .   .   .   .   .   .   .    451264.790785 secs
     A7  A7  A7  A7  A7  A7 *A7  .   .   .   .   .   .   .   .   .    451264.790858 secs
     A7  A7  A7  A7  A7  A7  A7 *A7  .   .   .   .   .   .   .   .    451264.790934 secs
     A7  A7  A7  A7  A7  A7  A7  A7 *A7  .   .   .   .   .   .   .    451264.791004 secs
     A7  A7  A7  A7  A7  A7  A7  A7  A7 *A7  .   .   .   .   .   .    451264.791075 secs
     A7  A7  A7  A7  A7  A7  A7  A7  A7  A7 *A7  .   .   .   .   .    451264.791143 secs
     A7  A7  A7  A7  A7  A7  A7  A7  A7  A7  A7 *A7  .   .   .   .    451264.791232 secs
     A7  A7  A7  A7  A7  A7  A7  A7  A7  A7  A7  A7 *A7  .   .   .    451264.791336 secs
     A7  A7  A7  A7  A7  A7  A7  A7  A7  A7  A7  A7  A7 *A7  .   .    451264.791407 secs
     A7  A7  A7  A7  A7  A7  A7  A7  A7  A7  A7  A7  A7  A7 *A7  .    451264.791484 secs
     A7  A7  A7  A7  A7  A7  A7  A7  A7  A7  A7  A7  A7  A7  A7 *A7   451264.791553 secs
  # echo $?
  0

Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240206083228.172607-4-yangjihong1@huawei.com
7 months agoperf sched: Fix memory leak in perf_sched__map()
Yang Jihong [Tue, 6 Feb 2024 08:32:25 +0000 (08:32 +0000)]
perf sched: Fix memory leak in perf_sched__map()

perf_sched__map() needs to free memory of map_cpus, color_pids and
color_cpus in normal path and rollback allocated memory in error path.

Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240206083228.172607-3-yangjihong1@huawei.com
7 months agoperf sched: Move start_work_mutex and work_done_wait_mutex initialization to perf_sch...
Yang Jihong [Tue, 6 Feb 2024 08:32:24 +0000 (08:32 +0000)]
perf sched: Move start_work_mutex and work_done_wait_mutex initialization to perf_sched__replay()

The start_work_mutex and work_done_wait_mutex are used only for the
'perf sched replay'. Put their initialization in perf_sched__replay () to
reduce unnecessary actions in other commands.

Simple functional testing:

  # perf sched record perf bench sched messaging
  # Running 'sched/messaging' benchmark:
  # 20 sender and receiver processes per group
  # 10 groups == 400 processes run

       Total time: 0.197 [sec]
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 14.952 MB perf.data (134165 samples) ]

  # perf sched replay
  run measurement overhead: 108 nsecs
  sleep measurement overhead: 65658 nsecs
  the run test took 999991 nsecs
  the sleep test took 1079324 nsecs
  nr_run_events:        42378
  nr_sleep_events:      43102
  nr_wakeup_events:     31852
  target-less wakeups:  17
  multi-target wakeups: 712
  task      0 (             swapper:         0), nr_events: 10451
  task      1 (             swapper:         1), nr_events: 3
  task      2 (             swapper:         2), nr_events: 1
  <SNIP>
  task    717 (     sched-messaging:     74483), nr_events: 152
  task    718 (     sched-messaging:     74484), nr_events: 1944
  task    719 (     sched-messaging:     74485), nr_events: 73
  task    720 (     sched-messaging:     74486), nr_events: 163
  task    721 (     sched-messaging:     74487), nr_events: 942
  task    722 (     sched-messaging:     74488), nr_events: 78
  task    723 (     sched-messaging:     74489), nr_events: 1090
  ------------------------------------------------------------
  #1  : 1366.507, ravg: 1366.51, cpu: 7682.70 / 7682.70
  #2  : 1410.072, ravg: 1370.86, cpu: 7723.88 / 7686.82
  #3  : 1396.296, ravg: 1373.41, cpu: 7568.20 / 7674.96
  #4  : 1381.019, ravg: 1374.17, cpu: 7531.81 / 7660.64
  #5  : 1393.826, ravg: 1376.13, cpu: 7725.25 / 7667.11
  #6  : 1401.581, ravg: 1378.68, cpu: 7594.82 / 7659.88
  #7  : 1381.337, ravg: 1378.94, cpu: 7371.22 / 7631.01
  #8  : 1373.842, ravg: 1378.43, cpu: 7894.92 / 7657.40
  #9  : 1364.697, ravg: 1377.06, cpu: 7324.91 / 7624.15
  #10 : 1363.613, ravg: 1375.72, cpu: 7209.55 / 7582.69
  # echo $?
  0

Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240206083228.172607-2-yangjihong1@huawei.com
7 months agoperf test: Skip metric w/o event name on arm64 in stat STD output linter
Yicong Yang [Wed, 7 Feb 2024 09:12:22 +0000 (17:12 +0800)]
perf test: Skip metric w/o event name on arm64 in stat STD output linter

stat+std_output.sh test fails on my arm64 machine:
[root@localhost shell]# ./stat+std_output.sh
Checking STD output: no args Unknown event name in TopDownL1                 #     0.18 retiring
[root@localhost shell]# ./stat+std_output.sh
Checking STD output: no args [Success]
Checking STD output: system wide [Success]
Checking STD output: interval [Success]
Checking STD output: per thread Unknown event name in tmux: server-1114960                                                   #     0.41 frontend_bound

When no args specified `perf stat` will add TopdownL1 metric group
and the output will be like:
[root@localhost shell]# perf stat -- stress-ng --vm 1 --timeout 1
stress-ng: info:  [3351733] setting to a 1 second run per stressor
stress-ng: info:  [3351733] dispatching hogs: 1 vm
stress-ng: info:  [3351733] successful run completed in 1.02s

 Performance counter stats for 'stress-ng --vm 1 --timeout 1':

          1,037.71 msec task-clock                       #    1.000 CPUs utilized
                13      context-switches                 #   12.528 /sec
                 1      cpu-migrations                   #    0.964 /sec
            67,544      page-faults                      #   65.090 K/sec
     2,691,932,561      cycles                           #    2.594 GHz                         (74.56%)
     6,571,333,653      instructions                     #    2.44  insn per cycle              (74.92%)
       521,863,142      branches                         #  502.901 M/sec                       (75.21%)
           425,879      branch-misses                    #    0.08% of all branches             (87.57%)
                        TopDownL1                 #     0.61 retiring                    (87.67%)
                                                  #     0.03 frontend_bound              (87.67%)
                                                  #     0.02 bad_speculation             (87.67%)
                                                  #     0.34 backend_bound               (74.61%)

       1.038138390 seconds time elapsed

       0.844849000 seconds user
       0.189053000 seconds sys

Metrics in group TopDownL1 don't have event name on arm64 but are not
listed in the $skip_metric list which they should be listed. Add them
to the skip list as what does for x86 platforms in [1].

[1] commit 4d60e83dfcee ("perf test: Skip metrics w/o event name in stat STD output linter")

Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Cc: linuxarm@huawei.com
Cc: kan.liang@linux.intel.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240207091222.54096-1-yangyicong@huawei.com
7 months agoperf symbols: Slightly improve module file executable section mappings
Adrian Hunter [Thu, 8 Feb 2024 08:53:26 +0000 (10:53 +0200)]
perf symbols: Slightly improve module file executable section mappings

Currently perf does not record module section addresses except for
the .text section. In general that means perf cannot get module section
mappings correct (except for .text) when loading symbols from a kernel
module file. (Note using --kcore does not have this issue)

Improve that situation slightly by identifying executable sections that
use the same mapping as the .text section. That happens when an
executable section comes directly after the .text section, both in memory
and on file, something that can be determined by following the same layout
rules used by the kernel, refer kernel layout_sections(). Note whether
that happens is somewhat arbitrary, so this is not a final solution.

Example from tracing a virtual machine process:

 Before:

  $ perf script | grep unknown
         CPU 0/KVM    1718   203.511270:     318341 cpu-cycles:P:  ffffffffc13e8a70 [unknown] (/lib/modules/6.7.2-local/kernel/arch/x86/kvm/kvm-intel.ko)
  $ perf script -vvv 2>&1 >/dev/null | grep kvm.intel | grep 'noinstr.text\|ffff'
  Map: 0-7e0 41430 [kvm_intel].noinstr.text
  Map: ffffffffc13a7000-ffffffffc1421000 a0 /lib/modules/6.7.2-local/kernel/arch/x86/kvm/kvm-intel.ko

 After:

  $ perf script | grep 203.511270
         CPU 0/KVM    1718   203.511270:     318341 cpu-cycles:P:  ffffffffc13e8a70 vmx_vmexit+0x0 (/lib/modules/6.7.2-local/kernel/arch/x86/kvm/kvm-intel.ko)
  $ perf script -vvv 2>&1 >/dev/null | grep kvm.intel | grep 'noinstr.text\|ffff'
  Map: ffffffffc13a7000-ffffffffc1421000 a0 /lib/modules/6.7.2-local/kernel/arch/x86/kvm/kvm-intel.ko

Reported-by: Like Xu <like.xu.linux@gmail.com>
Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240208085326.13432-3-adrian.hunter@intel.com
7 months agoperf tools: Make it possible to see perf's kernel and module memory mappings
Adrian Hunter [Thu, 8 Feb 2024 08:53:25 +0000 (10:53 +0200)]
perf tools: Make it possible to see perf's kernel and module memory mappings

Dump kmaps if using 'perf --debug kmaps' or verbose > 2 (e.g. -vvv) for
tools 'perf script' and 'perf report' if there is no browser.

Example:

  $ perf --debug kmaps script 2>&1 >/dev/null | grep kvm.intel
  build id event received for /lib/modules/6.7.2-local/kernel/arch/x86/kvm/kvm-intel.ko: 0691d75e10e72ebbbd45a44c59f6d00a5604badf [20]
  Map: 0-3a3 4f5d8 [kvm_intel].modinfo
  Map: 0-5240 5f280 [kvm_intel]__versions
  Map: 0-30 64 [kvm_intel].note.Linux
  Map: 0-14 644c0 [kvm_intel].orc_header
  Map: 0-5297 43680 [kvm_intel].rodata
  Map: 0-5bee 3b837 [kvm_intel].text.unlikely
  Map: 0-7e0 41430 [kvm_intel].noinstr.text
  Map: 0-2080 713c0 [kvm_intel].bss
  Map: 0-26 705c8 [kvm_intel].data..read_mostly
  Map: 0-5888 6a4c0 [kvm_intel].data
  Map: 0-22 70220 [kvm_intel].data.once
  Map: 0-40 705f0 [kvm_intel].data..percpu
  Map: 0-1685 41d20 [kvm_intel].init.text
  Map: 0-4b8 6fd60 [kvm_intel].init.data
  Map: 0-380 70248 [kvm_intel]__dyndbg
  Map: 0-8 70218 [kvm_intel].exit.data
  Map: 0-438 4f980 [kvm_intel]__param
  Map: 0-5f5 4ca0f [kvm_intel].rodata.str1.1
  Map: 0-3657 493b8 [kvm_intel].rodata.str1.8
  Map: 0-e0 70640 [kvm_intel].data..ro_after_init
  Map: 0-500 70ec0 [kvm_intel].gnu.linkonce.this_module
  Map: ffffffffc13a7000-ffffffffc1421000 a0 /lib/modules/6.7.2-local/kernel/arch/x86/kvm/kvm-intel.ko

The example above shows how the module section mappings are all wrong
except for the main .text mapping at 0xffffffffc13a7000.

Signed-off-by: Adrian Hunter <adrian.hunter@intel.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Like Xu <like.xu.linux@gmail.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240208085326.13432-2-adrian.hunter@intel.com
7 months agoperf record: Display data size on pipe mode
Namhyung Kim [Fri, 12 Jan 2024 23:13:40 +0000 (15:13 -0800)]
perf record: Display data size on pipe mode

Currently pipe mode doesn't set the file size and it results in a
misleading message of 0 data size at the end.  Although it might miss
some accounting for pipe header or more, just displaying the data size
would reduce the possible confusion.

Before:
  $ perf record -o- perf test -w noploop | perf report -i- -q --percent-limit=1
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.000 MB - ]    <======  (here)
      99.58%  perf     perf                  [.] noploop

After:
  $ perf record -o- perf test -w noploop | perf report -i- -q --percent-limit=1
  [ perf record: Woken up 1 times to write data ]
  [ perf record: Captured and wrote 0.229 MB - ]
      99.46%  perf     perf                  [.] noploop

Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Reviewed-by: Ian Rogers <irogers@google.com>
Link: https://lore.kernel.org/r/20240112231340.779469-1-namhyung@kernel.org
7 months agoperf script: Print source line for each jump in brstackinsn
Kan Liang [Mon, 5 Feb 2024 14:58:19 +0000 (06:58 -0800)]
perf script: Print source line for each jump in brstackinsn

With the srcline option, the perf script only prints a source line at
the beginning of a sample with call/ret from functions, but not for
each jump in brstackinsn. It's useful to print a source line for each
jump in brstackinsn when the end user analyze the full assembler
sequences of branch sequences for the sample.

The srccode option can also be used to locate the source code line.
However, it's printed almost for every line and makes the output less
readable.

 $perf script -F +brstackinsn,+srcline --xed

Before the patch,

 tchain_edit_deb 1463275 15228549.107820:     282495 instructions:u:            401133 f3+0xd (/home/kan/os.li>
  tchain_edit.c:22
        f3+40:  tchain_edit.c:20
        000000000040114e                        jle 0x401133                    # PRED 6 cycles [6]
        0000000000401133                        movl  -0x4(%rbp), %eax
        0000000000401136                        and $0x1, %eax
        0000000000401139                        test %eax, %eax
        000000000040113b                        jz 0x401143
        000000000040113d                        addl  $0x1, -0x4(%rbp)
        0000000000401141                        jmp 0x401147                    # PRED 3 cycles [9] 2.00 IPC
        0000000000401147                        cmpl  $0x3e7, -0x4(%rbp)
        000000000040114e                        jle 0x401133                    # PRED 6 cycles [15] 0.33 IPC

After the patch,

 tchain_edit_deb 1463275 15228549.107820:     282495 instructions:u:            401133 f3+0xd (/home/kan/os.li>
  tchain_edit.c:22
        f3+40:  tchain_edit.c:20
        000000000040114e                        jle 0x401133                     srcline: tchain_edit.c:20      # PRED 6 cycles [6]
        0000000000401133                        movl  -0x4(%rbp), %eax
        0000000000401136                        and $0x1, %eax
        0000000000401139                        test %eax, %eax
        000000000040113b                        jz 0x401143
        000000000040113d                        addl  $0x1, -0x4(%rbp)
        0000000000401141                        jmp 0x401147                     srcline: tchain_edit.c:23      # PRED 3 cycles [9] 2.00 IPC
        0000000000401147                        cmpl  $0x3e7, -0x4(%rbp)
        000000000040114e                        jle 0x401133                     srcline: tchain_edit.c:20      # PRED 6 cycles [15] 0.33 IPC

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Cc: ahmad.yasin@intel.com
Cc: amiri.khalil@intel.com
Cc: ak@linux.intel.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240205145819.1943114-1-kan.liang@linux.intel.com
7 months agoperf kvm powerpc: Fix build
Ian Rogers [Tue, 6 Feb 2024 23:59:02 +0000 (15:59 -0800)]
perf kvm powerpc: Fix build

Updates to struct parse_events_error needed to be carried through to
PowerPC specific event parsing.

Fixes: fd7b8e8fb20f ("perf parse-events: Print all errors")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Namhyung <namhyung@kernel.org>
Cc: James Clark <james.clark@arm.com>
Cc: Leo Yan <leo.yan@linaro.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240206235902.2917395-1-irogers@google.com
7 months agoperf/pmu-events/powerpc: Update json mapfile with Power11 PVR
Madhavan Srinivasan [Mon, 29 Jan 2024 12:08:55 +0000 (17:38 +0530)]
perf/pmu-events/powerpc: Update json mapfile with Power11 PVR

Update the Power11 PVR to json mapfile to enable
json events. Power11 is PowerISA v3.1 compliant
and support Power10 events.

Signed-off-by: Madhavan Srinivasan <maddy@linux.ibm.com>
Reviewed-by: Kajol Jain <kjain@linux.ibm.com>
Cc: atrajeev@linux.vnet.ibm.com
Cc: disgoel@linux.vnet.ibm.com
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240129120855.551529-1-maddy@linux.ibm.com
7 months agotools: perf: Expose sample ID / stream ID to python scripts
Ben Gainey [Tue, 23 Jan 2024 10:31:37 +0000 (10:31 +0000)]
tools: perf: Expose sample ID / stream ID to python scripts

perf script exposes the evsel_name to python scripts as part of the data
passed to the sample or tracepoint handler function, and it passes the id and
stream_id to the throttled/unthrottled handler functions. This makes matching
throttle events and samples difficult.

To make this possible, this change exposes the sample id and stream_id values
to the script.

Signed-off-by: Ben Gainey <ben.gainey@arm.com>
Reviewed-by: Adrian Hunter <adrian.hunter@intel.com>
Cc: will@kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240123103137.1890779-2-ben.gainey@arm.com
7 months agoperf bpf: Clean up the generated/copied vmlinux.h
Arnaldo Carvalho de Melo [Fri, 2 Feb 2024 14:32:20 +0000 (11:32 -0300)]
perf bpf: Clean up the generated/copied vmlinux.h

When building perf with BPF skels we either copy the minimalistic
tools/perf/util/bpf_skel/vmlinux/vmlinux.h or use bpftool to generate a
vmlinux from BTF, storing the result in $(SKEL_OUT)/vmlinux.h.

We need to remove that when doing a 'make -C tools/perf clean', fix it.

Fixes: b7a2d774c9c5a9a3 ("perf build: Add ability to build with a generated vmlinux.h")
Reviewed-by: Ian Rogers <irogers@google.com>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: James Clark <james.clark@arm.com>
Cc: Tiezhu Yang <yangtiezhu@loongson.cn>
Cc: Yang Jihong <yangjihong1@huawei.com>
Cc: bpf@vger.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/Zbz89KK5wHfZ82jv@x1
7 months agoperf jevents: Drop or simplify small integer values
Ian Rogers [Wed, 31 Jan 2024 20:14:29 +0000 (12:14 -0800)]
perf jevents: Drop or simplify small integer values

Prior to this patch '0' would be dropped as the config values default
to 0. Some json values are hex and the string '0' wouldn't match '0x0'
as zero. Add a more robust is_zero test to drop these event terms.

When encoding numbers as hex, if the number is between 0 and 9
inclusive then don't add a 0x prefix.

Update test expectations for these changes.

On x86 this reduces the event/metric C string by 58,411 bytes.

Signed-off-by: Ian Rogers <irogers@google.com>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Cc: Edward Baker <edward.baker@intel.com>
Cc: Perry Taylor <perry.taylor@intel.com>
Cc: Weilin Wang <weilin.wang@intel.com>
Cc: John Garry <john.g.garry@oracle.com>
Cc: Jing Zhang <renyu.zj@linux.alibaba.com>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Michael Petlan <mpetlan@redhat.com>
Cc: Veronika Molnarova <vmolnaro@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240131201429.792138-1-irogers@google.com
7 months agoperf parse-events: Print all errors
Ian Rogers [Wed, 31 Jan 2024 13:49:40 +0000 (05:49 -0800)]
perf parse-events: Print all errors

Prior to this patch the first and the last error encountered during
parsing are printed. To see other errors verbose needs
enabling. Unfortunately this can drop useful errors, in particular on
terms. This patch changes the errors so that instead of the first and
last all errors are recorded and printed, the underlying data
structure is changed to a list.

Before:
```
$ perf stat -e 'slots/edge=2/' true
event syntax error: 'slots/edge=2/'
                                \___ Bad event or PMU

Unable to find PMU or event on a PMU of 'slots'

Initial error:
event syntax error: 'slots/edge=2/'
                     \___ Cannot find PMU `slots'. Missing kernel support?
Run 'perf list' for a list of valid events

 Usage: perf stat [<options>] [<command>]

    -e, --event <event>   event selector. use 'perf list' to list available events
```

After:
```
$ perf stat -e 'slots/edge=2/' true
event syntax error: 'slots/edge=2/'
                     \___ Bad event or PMU

Unable to find PMU or event on a PMU of 'slots'

event syntax error: 'slots/edge=2/'
                                \___ value too big for format (edge), maximum is 1

event syntax error: 'slots/edge=2/'
                     \___ Cannot find PMU `slots'. Missing kernel support?
Run 'perf list' for a list of valid events

 Usage: perf stat [<options>] [<command>]

    -e, --event <event>   event selector. use 'perf list' to list available events
```

Signed-off-by: Ian Rogers <irogers@google.com>
Reviewed-by: James Clark <james.clark@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: tchen168@asu.edu
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Michael Petlan <mpetlan@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240131134940.593788-3-irogers@google.com
7 months agoperf parse-events: Improve error location of terms cloned from an event
Ian Rogers [Wed, 31 Jan 2024 13:49:39 +0000 (05:49 -0800)]
perf parse-events: Improve error location of terms cloned from an event

A PMU event/alias will have a set of format terms that replace it when
an event is parsed. The location of the terms is their position when
parsed for the event/alias either from sysfs or json. This location is
of little use when an event fails to parse as the error will be given
in terms of the location in the string of events parsed not the json
or sysfs string. Fix this by making the cloned terms location that of
the event/alias.

If a cloned term from an event/alias is invalid the bad format is hard
to determine from the error string. Add the name of the bad format
into the error string.

Signed-off-by: Ian Rogers <irogers@google.com>
Reviewed-by: James Clark <james.clark@arm.com>
Cc: tchen168@asu.edu
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Michael Petlan <mpetlan@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240131134940.593788-2-irogers@google.com
7 months agoperf tsc: Add missing newlines to debug statements
Ian Rogers [Wed, 31 Jan 2024 13:49:38 +0000 (05:49 -0800)]
perf tsc: Add missing newlines to debug statements

It is assumed that debug statements always print a newline, fix two
missing ones.

Signed-off-by: Ian Rogers <irogers@google.com>
Reviewed-by: James Clark <james.clark@arm.com>
Cc: tchen168@asu.edu
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Michael Petlan <mpetlan@redhat.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240131134940.593788-1-irogers@google.com
7 months agoperf Documentation: Add some more hints to tips.txt
Andi Kleen [Wed, 31 Jan 2024 02:13:52 +0000 (18:13 -0800)]
perf Documentation: Add some more hints to tips.txt

Add some (hopefully useful) hints to tips.txt
Also some minor corrections.

Would probably good to make it a reviewer rule that if generally useful
options are added the patch must add an example to tips.txt

Signed-off-by: Andi Kleen <ak@linux.intel.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240131021352.151440-1-ak@linux.intel.com
7 months agoperf test: Simplify metric value validation test final report
Weilin Wang [Tue, 30 Jan 2024 18:09:07 +0000 (10:09 -0800)]
perf test: Simplify metric value validation test final report

The original test report was too complicated to read with information
that not really useful. This new update simplify the report which should
largely improve the readibility.

Signed-off-by: Weilin Wang <weilin.wang@intel.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Cc: Caleb Biggers <caleb.biggers@intel.com>
Cc: Perry Taylor <perry.taylor@intel.com>
Cc: Samantha Alt <samantha.alt@intel.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240130180907.639729-1-weilin.wang@intel.com
7 months agoperf report: Prevent segfault with --no-parent
Andi Kleen [Tue, 30 Jan 2024 18:55:52 +0000 (10:55 -0800)]
perf report: Prevent segfault with --no-parent

Prevent a perf report segfault with the (non sensical) --no-parent
option

Signed-off-By: Andi Kleen <ak@linux.intel.com>
Acked-by: Namhyung <namhyung@kernel.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240130185552.150578-1-ak@linux.intel.com
7 months agoperf evsel: Fix duplicate initialization of data->id in evsel__parse_sample()
Yang Jihong [Sat, 27 Jan 2024 02:57:56 +0000 (02:57 +0000)]
perf evsel: Fix duplicate initialization of data->id in evsel__parse_sample()

data->id has been initialized at line 2362, remove duplicate initialization.

Fixes: 3ad31d8a0df2 ("perf evsel: Centralize perf_sample initialization")
Signed-off-by: Yang Jihong <yangjihong1@huawei.com>
Reviewed-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240127025756.4041808-1-yangjihong1@huawei.com
7 months agoperf evsel: Rename get_states() to parse_task_states() and make it public
Ze Gao [Tue, 23 Jan 2024 07:02:11 +0000 (02:02 -0500)]
perf evsel: Rename get_states() to parse_task_states() and make it public

Since get_states() assumes the existence of libtraceevent, so move
to where it should belong, i.e, util/trace-event-parse.c, and also
rename it to parse_task_states().

Leave evsel_getstate() untouched as it fits well in the evsel
category.

Also make some necessary tweaks for python support, and get it
verified with: perf test python.

Signed-off-by: Ze Gao <zegao@tencent.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/r/20240123070210.1669843-2-zegao@tencent.com
8 months agoperf tools headers: update the asm-generic/unaligned.h copy with the kernel sources
Arnaldo Carvalho de Melo [Wed, 31 Jan 2024 14:24:40 +0000 (11:24 -0300)]
perf tools headers: update the asm-generic/unaligned.h copy with the kernel sources

To pick up the changes in:

  1ab33c03145d0f6c ("asm-generic: make sparse happy with odd-sized put_unaligned_*()")

Addressing this perf tools build warning:

  Warning: Kernel ABI header differences:
    diff -u tools/include/asm-generic/unaligned.h include/asm-generic/unaligned.h

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/lkml/Zbp9I7rmFj1Owhug@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
8 months agotools include UAPI: Sync linux/mount.h copy with the kernel sources
Arnaldo Carvalho de Melo [Tue, 30 Jan 2024 14:44:00 +0000 (11:44 -0300)]
tools include UAPI: Sync linux/mount.h copy with the kernel sources

To pick the changes from:

  35e27a5744131996 ("fs: keep struct mnt_id_req extensible")
  b4c2bea8ceaa50cd ("add listmount(2) syscall")
  46eae99ef73302f9 ("add statmount(2) syscall")

That doesn't change anything in tools this time as nothing that is
harvested by the beauty scripts got changed:

  $ ls -1 tools/perf/trace/beauty/*mount*sh
  tools/perf/trace/beauty/fsmount.sh
  tools/perf/trace/beauty/mount_flags.sh
  tools/perf/trace/beauty/move_mount_flags.sh
  $

This addresses this perf build warning.

  Warning: Kernel ABI header differences:
    diff -u tools/include/uapi/linux/mount.h include/uapi/linux/mount.h

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/lkml/ZbkMiB7ZcOsLP2V5@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
8 months agoperf evlist: Fix evlist__new_default() for > 1 core PMU
James Clark [Wed, 24 Jan 2024 09:43:57 +0000 (09:43 +0000)]
perf evlist: Fix evlist__new_default() for > 1 core PMU

The 'Session topology' test currently fails with this message when
evlist__new_default() opens more than one event:

  32: Session topology                                                :
  --- start ---
  templ file: /tmp/perf-test-vv5YzZ
  Using CPUID 0x00000000410fd070
  Opening: unknown-hardware:HG
  ------------------------------------------------------------
  perf_event_attr:
    type                             0 (PERF_TYPE_HARDWARE)
    config                           0xb00000000
    disabled                         1
  ------------------------------------------------------------
  sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8 = 4
  Opening: unknown-hardware:HG
  ------------------------------------------------------------
  perf_event_attr:
    type                             0 (PERF_TYPE_HARDWARE)
    config                           0xa00000000
    disabled                         1
  ------------------------------------------------------------
  sys_perf_event_open: pid 0  cpu -1  group_fd -1  flags 0x8 = 5
  non matching sample_type
  FAILED tests/topology.c:73 can't get session
  ---- end ----
  Session topology: FAILED!

This is because when re-opening the file and parsing the header, Perf
expects that any file that has more than one event has the sample ID
flag set. Perf record already sets the flag in a similar way when there
is more than one event, so add the same logic to evlist__new_default().

evlist__new_default() is only currently used in tests, so I don't
expect this change to have any other side effects. The other tests that
use it don't save and re-open the file so don't hit this issue.

The session topology test has been failing on Arm big.LITTLE platforms
since commit 251aa040244a3b17 ("perf parse-events: Wildcard most
"numeric" events") when evlist__new_default() started opening multiple
events for 'cycles'.

Fixes: 251aa040244a3b17 ("perf parse-events: Wildcard most "numeric" events")
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: James Clark <james.clark@arm.com>
[ This was failing as well on a Rocket Lake Refresh/14700k Intel hybrid system - Arnaldo ]
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Tested-by: Ian Rogers <irogers@google.com>
Tested-by: Kan Liang <kan.liang@linux.intel.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Changbin Du <changbin.du@huawei.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Yang Jihong <yangjihong1@huawei.com>
Closes: https://lore.kernel.org/lkml/CAP-5=fWVQ-7ijjK3-w1q+k2WYVNHbAcejb-xY0ptbjRw476VKA@mail.gmail.com/
Link: https://lore.kernel.org/r/20240124094358.489372-1-james.clark@arm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
8 months agotools headers: Update the copy of x86's mem{cpy,set}_64.S used in 'perf bench'
Arnaldo Carvalho de Melo [Tue, 30 Jan 2024 13:45:24 +0000 (10:45 -0300)]
tools headers: Update the copy of x86's mem{cpy,set}_64.S used in 'perf bench'

This is to get the changes from:

  94ea9c05219518ef ("x86/headers: Replace #include <asm/export.h> with #include <linux/export.h>")
  10f4c9b9a33b7df0 ("x86/asm: Fix build of UML with KASAN")

That addresses these perf tools build warning:

  Warning: Kernel ABI header differences:
    diff -u tools/arch/x86/lib/memcpy_64.S arch/x86/lib/memcpy_64.S
    diff -u tools/arch/x86/lib/memset_64.S arch/x86/lib/memset_64.S

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Vincent Whitchurch <vincent.whitchurch@axis.com>
Link: https://lore.kernel.org/lkml/ZbkIKpKdNqOFdMwJ@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
8 months agotools headers x86 cpufeatures: Sync with the kernel sources to pick TDX, Zen, APIC...
Arnaldo Carvalho de Melo [Tue, 30 Jan 2024 13:09:18 +0000 (10:09 -0300)]
tools headers x86 cpufeatures: Sync with the kernel sources to pick TDX, Zen, APIC MSR fence changes

To pick the changes from:

  1e536e10689700e0 ("x86/cpu: Detect TDX partial write machine check erratum")
  765a0542fdc7aad7 ("x86/virt/tdx: Detect TDX during kernel boot")
  30fa92832f405d5a ("x86/CPU/AMD: Add ZenX generations flags")
  04c3024560d3a14a ("x86/barrier: Do not serialize MSR accesses on AMD")

This causes these perf files to be rebuilt and brings some X86_FEATURE
that will be used when updating the copies of
tools/arch/x86/lib/mem{cpy,set}_64.S with the kernel sources:

      CC       /tmp/build/perf/bench/mem-memcpy-x86-64-asm.o
      CC       /tmp/build/perf/bench/mem-memset-x86-64-asm.o

And addresses this perf build warning:

  Warning: Kernel ABI header differences:
    diff -u tools/arch/x86/include/asm/cpufeatures.h arch/x86/include/asm/cpufeatures.h

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kai Huang <kai.huang@intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/lkml/
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
8 months agotools headers UAPI: Sync unistd.h to pick {list,stat}mount, lsm_{[gs]et_self_attr...
Arnaldo Carvalho de Melo [Mon, 29 Jan 2024 16:00:07 +0000 (13:00 -0300)]
tools headers UAPI: Sync unistd.h to pick {list,stat}mount, lsm_{[gs]et_self_attr,list_modules} syscall numbers

To pick the changes in these csets:

  d8b0f5465012538c ("wire up syscalls for statmount/listmount")
  5f42375904b08890 ("LSM: wireup Linux Security Module syscalls")

Used in some architectures to create syscall tables.

This addresses this perf build warning:

  Warning: Kernel ABI header differences:
    diff -u tools/include/uapi/asm-generic/unistd.h include/uapi/asm-generic/unistd.h

Cc: Casey Schaufler <casey@schaufler-ca.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/lkml/ZbfMuAlUMRO9Hqa6@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
8 months agoperf vendor events intel: Alderlake/sapphirerapids metric fixes
Ian Rogers [Thu, 4 Jan 2024 23:19:03 +0000 (15:19 -0800)]
perf vendor events intel: Alderlake/sapphirerapids metric fixes

As events are deduplicated by name, ensure PMU prefixes are always
used in metrics. Previously they may be missed on the first event in a
formula.

Update metric constraints for architectures with topdown l2 events.

Conversion script updated in:
https://github.com/intel/perfmon/pull/128

Reported-by: Arnaldo Carvalho de Melo <acme@kernel.org>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Ian Rogers <irogers@google.com>
Tested-by: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Edward Baker <edward.baker@intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Closes: https://lore.kernel.org/lkml/ZZam-EG-UepcXtWw@kernel.org/
Link: https://lore.kernel.org/r/20240104231903.775717-1-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
8 months agotools headers UAPI: Sync kvm headers with the kernel sources
Arnaldo Carvalho de Melo [Sat, 27 Jan 2024 17:52:23 +0000 (14:52 -0300)]
tools headers UAPI: Sync kvm headers with the kernel sources

To pick the changes in:

  a5d3df8ae13fada7 ("KVM: remove deprecated UAPIs")
  6d72283526090850 ("KVM x86/xen: add an override for PVCLOCK_TSC_STABLE_BIT")
  89ea60c2c7b5838b ("KVM: x86: Add support for "protected VMs" that can utilize private memory")
  8dd2eee9d526c30f ("KVM: x86/mmu: Handle page fault for private memory")
  a7800aa80ea4d535 ("KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory")
  5a475554db1e476a ("KVM: Introduce per-page memory attributes")
  16f95f3b95caded2 ("KVM: Add KVM_EXIT_MEMORY_FAULT exit to report faults to userspace")
  bb58b90b1a8f753b ("KVM: Introduce KVM_SET_USER_MEMORY_REGION2")
  3f9cd0ca848413fd ("KVM: arm64: Allow userspace to get the writable masks for feature ID registers")

That automatically adds support for some new ioctls and remove a bunch
of deprecated ones.

This ends up making the new binary to forget about the deprecated one,
so when used in an older system it will not be able to resolve those
codes to strings.

  $ tools/perf/trace/beauty/kvm_ioctl.sh > before
  $ cp include/uapi/linux/kvm.h tools/include/uapi/linux/kvm.h
  $ tools/perf/trace/beauty/kvm_ioctl.sh > after
  $ diff -u before after
  --- before 2024-01-27 14:48:16.523014020 -0300
  +++ after 2024-01-27 14:48:24.183932866 -0300
  @@ -14,6 +14,7 @@
    [0x46] = "SET_USER_MEMORY_REGION",
    [0x47] = "SET_TSS_ADDR",
    [0x48] = "SET_IDENTITY_MAP_ADDR",
  + [0x49] = "SET_USER_MEMORY_REGION2",
    [0x60] = "CREATE_IRQCHIP",
    [0x61] = "IRQ_LINE",
    [0x62] = "GET_IRQCHIP",
  @@ -22,14 +23,8 @@
    [0x65] = "GET_PIT",
    [0x66] = "SET_PIT",
    [0x67] = "IRQ_LINE_STATUS",
  - [0x69] = "ASSIGN_PCI_DEVICE",
    [0x6a] = "SET_GSI_ROUTING",
  - [0x70] = "ASSIGN_DEV_IRQ",
    [0x71] = "REINJECT_CONTROL",
  - [0x72] = "DEASSIGN_PCI_DEVICE",
  - [0x73] = "ASSIGN_SET_MSIX_NR",
  - [0x74] = "ASSIGN_SET_MSIX_ENTRY",
  - [0x75] = "DEASSIGN_DEV_IRQ",
    [0x76] = "IRQFD",
    [0x77] = "CREATE_PIT2",
    [0x78] = "SET_BOOT_CPU_ID",
  @@ -66,7 +61,6 @@
    [0x9f] = "GET_VCPU_EVENTS",
    [0xa0] = "SET_VCPU_EVENTS",
    [0xa3] = "ENABLE_CAP",
  - [0xa4] = "ASSIGN_SET_INTX_MASK",
    [0xa5] = "SIGNAL_MSI",
    [0xa6] = "GET_XCRS",
    [0xa7] = "SET_XCRS",
  @@ -97,6 +91,8 @@
    [0xcd] = "SET_SREGS2",
    [0xce] = "GET_STATS_FD",
    [0xd0] = "XEN_HVM_EVTCHN_SEND",
  + [0xd2] = "SET_MEMORY_ATTRIBUTES",
  + [0xd4] = "CREATE_GUEST_MEMFD",
    [0xe0] = "CREATE_DEVICE",
    [0xe1] = "SET_DEVICE_ATTR",
    [0xe2] = "GET_DEVICE_ATTR",
  $

This silences these perf build warnings:

  Warning: Kernel ABI header differences:
    diff -u tools/include/uapi/linux/kvm.h include/uapi/linux/kvm.h
    diff -u tools/arch/x86/include/uapi/asm/kvm.h arch/x86/include/uapi/asm/kvm.h

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Chao Peng <chao.p.peng@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jing Zhang <jingzhangos@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul Durrant <pdurrant@amazon.com>
Cc: Sean Christopherson <seanjc@google.com>
Link: https://lore.kernel.org/lkml/ZbVLbkngp4oq13qN@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
8 months agoperf tools: Fix calloc() arguments to address error introduced in gcc-14
Sun Haiyong [Sat, 6 Jan 2024 09:41:29 +0000 (17:41 +0800)]
perf tools: Fix calloc() arguments to address error introduced in gcc-14

the definition of calloc is as follows:

    void *calloc(size_t nmemb, size_t size);

number of members is in the first parameter and the size is in the
second parameter.

Fix error messages on gcc 14 20240102:

  error: 'calloc' sizes specified with 'sizeof' in the earlier argument and
  not in the later argument [-Werror=calloc-transposed-args]

Committer notes:

I noticed this on fedora 40 and rawhide.

Signed-off-by: Sun Haiyong <sunhaiyong@loongson.cn>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Tested-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20240106094129.3337057-1-siyanteng@loongson.cn
Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
8 months agoperf top: Remove needless malloc(0) call that triggers -Walloc-size
Sun Haiyong [Mon, 4 Dec 2023 08:20:55 +0000 (16:20 +0800)]
perf top: Remove needless malloc(0) call that triggers -Walloc-size

GCC 14 introduces a new -Walloc-size included in -Wextra which errors out
like:

  builtin-top.c: In function â€˜prompt_integer’:
  builtin-top.c:360:21: error: allocation of insufficient size â€˜0’ for
  type â€˜char’ with size â€˜1’ [-Werror=alloc-size]
    360 |         char *buf = malloc(0), *p;
        |                     ^~~~~~

Just set it to NULL, getline() will do the allocation.

Signed-off-by: Sun Haiyong <sunhaiyong@loongson.cn>
Signed-off-by: Yanteng Si <siyanteng@loongson.cn>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lore.kernel.org/r/20231204082055.91877-1-siyanteng@loongson.cn
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
8 months agoperf build: Make minimal shellcheck version to v0.6.0
Yicong Yang [Mon, 22 Jan 2024 08:04:06 +0000 (16:04 +0800)]
perf build: Make minimal shellcheck version to v0.6.0

The perf build failed due to the shellcheck on my machine (v0.4.6 on Ubuntu
18.04.1 LTS) doesn't support -a/--check-sourced and -S/--severity option.

These two options are introduced in shellcheck v0.4.7 and v0.6.0
respectively. So restrict the minimal version of shellcheck to v0.6.0.

Fixes: b809fc656e763296 ("perf build: Shellcheck support for OUTPUT directory")
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Junhao He <hejunhao3@huawei.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: linuxarm@huawei.com
Link: https://lore.kernel.org/r/20240122080406.28678-1-yangyicong@huawei.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
8 months agotools headers UAPI: Update tools's copy of drm.h headers to pick DRM_IOCTL_MODE_CLOSEFB
Arnaldo Carvalho de Melo [Fri, 26 Jan 2024 14:00:01 +0000 (11:00 -0300)]
tools headers UAPI: Update tools's copy of drm.h headers to pick DRM_IOCTL_MODE_CLOSEFB

Picking the changes from:

  8570c27932e132d2 ("drm/syncobj: Add deadline support for syncobj waits")
  9724ed6c1b1212d1 ("drm: Introduce DRM_CLIENT_CAP_CURSOR_PLANE_HOTSPOT")
  e4d983acffff270c ("drm: introduce DRM_CAP_ATOMIC_ASYNC_PAGE_FLIP")
  d208d875667e2a29 ("drm: introduce CLOSEFB IOCTL")
  afa5cf3175a22b71 ("drm/i915/uapi: fix typos/spellos and punctuation")

Addressing these perf build warnings:

  Warning: Kernel ABI header differences:

Now 'perf trace' and other code that might use the
tools/perf/trace/beauty autogenerated tables will be able to translate
this new ioctl command into a string:

  $ tools/perf/trace/beauty/drm_ioctl.sh > before
  $ cp include/uapi/drm/drm.h tools/include/uapi/drm/drm.h
  $ tools/perf/trace/beauty/drm_ioctl.sh > after
  $ diff -u before after
  --- before 2024-01-26 10:54:23.486381862 -0300
  +++ after 2024-01-26 10:54:35.767902442 -0300
  @@ -109,6 +109,7 @@
    [0xCD] = "SYNCOBJ_TIMELINE_SIGNAL",
    [0xCE] = "MODE_GETFB2",
    [0xCF] = "SYNCOBJ_EVENTFD",
  + [0xD0] = "MODE_CLOSEFB",
    [DRM_COMMAND_BASE + 0x00] = "I915_INIT",
    [DRM_COMMAND_BASE + 0x01] = "I915_FLUSH",
    [DRM_COMMAND_BASE + 0x02] = "I915_FLIP",
  $

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Javier Martinez Canillas <javierm@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Rob Clark <robdclark@chromium.org>
Cc: Simon Ser <contact@emersion.fr>
Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: Zack Rusin <zack.rusin@broadcom.com>
Link: https://lore.kernel.org/lkml/ZbPIN9Dcc5AM0uxo@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
8 months agoperf test shell daemon: Make signal test less racy
Ian Rogers [Wed, 24 Jan 2024 04:30:15 +0000 (20:30 -0800)]
perf test shell daemon: Make signal test less racy

The daemon signal test sends signals and then expects files to be
written. It was observed on an Intel Alderlake that the signals were
sent too quickly leading to the 3 expected files not appearing.

To avoid this send the next signal only after the expected previous file
has appeared. To avoid an infinite loop the number of retries is
limited.

Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Ross Zwisler <zwisler@chromium.org>
Cc: Shirisha G <shirisha@linux.ibm.com>
Link: https://lore.kernel.org/r/20240124043015.1388867-6-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
8 months agoperf test shell script: Fix test for python being disabled
Ian Rogers [Wed, 24 Jan 2024 04:30:14 +0000 (20:30 -0800)]
perf test shell script: Fix test for python being disabled

"grep -cv" can exit with an error code that causes the "set -e" to abort
the script. Switch to using the grep exit code in the if condition to
avoid this.

Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Ross Zwisler <zwisler@chromium.org>
Cc: Shirisha G <shirisha@linux.ibm.com>
Link: https://lore.kernel.org/r/20240124043015.1388867-5-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
8 months agoperf test: Workaround debug output in list test
Ian Rogers [Wed, 24 Jan 2024 04:30:13 +0000 (20:30 -0800)]
perf test: Workaround debug output in list test

Write the JSON output to a specific file to avoid debug output
breaking it.

Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Ross Zwisler <zwisler@chromium.org>
Cc: Shirisha G <shirisha@linux.ibm.com>
Link: https://lore.kernel.org/r/20240124043015.1388867-4-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
8 months agoperf list: Add output file option
Ian Rogers [Wed, 24 Jan 2024 04:30:12 +0000 (20:30 -0800)]
perf list: Add output file option

Add an option to write the 'perf list' output to a specific file. This
can avoid issues with debug output being written into the output stream.

Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Ross Zwisler <zwisler@chromium.org>
Cc: Shirisha G <shirisha@linux.ibm.com>
Link: https://lore.kernel.org/r/20240124043015.1388867-3-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
8 months agoperf list: Switch error message to pr_err() to respect debug settings (-v)
Ian Rogers [Wed, 24 Jan 2024 04:30:11 +0000 (20:30 -0800)]
perf list: Switch error message to pr_err() to respect debug settings (-v)

Using printf() can interrupt 'perf list output', use pr_err() which can
respect debug settings and the debug file.

Signed-off-by: Ian Rogers <irogers@google.com>
Acked-by: Namhyung Kim <namhyung@kernel.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Athira Rajeev <atrajeev@linux.vnet.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kajol Jain <kjain@linux.ibm.com>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ravi Bangoria <ravi.bangoria@amd.com>
Cc: Ross Zwisler <zwisler@chromium.org>
Cc: Shirisha G <shirisha@linux.ibm.com>
Link: https://lore.kernel.org/r/20240124043015.1388867-2-irogers@google.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
8 months agoperf test: Fix 'perf script' tests on s390
Thomas Richter [Thu, 25 Jan 2024 10:03:51 +0000 (11:03 +0100)]
perf test: Fix 'perf script' tests on s390

In linux next repo, test case 'perf script tests' fails on s390.

The root case is a command line invocation of 'perf record' with
call-graph information. On s390 only DWARF formatted call-graphs are
supported and only on software events.

Change the command line parameters for s390.

Output before:

  # perf test 89
  89: perf script tests              : FAILED!
  #

Output after:

  # perf test 89
  89: perf script tests              : Ok
  #

Fixes: 0dd5041c9a0eaf8c ("perf addr_location: Add init/exit/copy functions")
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Thomas Richter <tmricht@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Sumanth Korikkar <sumanthk@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Link: https://lore.kernel.org/r/20240125100351.936262-1-tmricht@linux.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
8 months agotools headers UAPI: Sync linux/fcntl.h with the kernel sources
Arnaldo Carvalho de Melo [Thu, 25 Jan 2024 14:23:56 +0000 (11:23 -0300)]
tools headers UAPI: Sync linux/fcntl.h with the kernel sources

To get the changes in:

  8a924db2d7b5eb69 ("fs: Pass AT_GETATTR_NOSEC flag to getattr interface function")

That don't add anything that is handled by existing hard coded tables or
table generation scripts.

This silences this perf build warning:

  Warning: Kernel ABI header differences:
    diff -u tools/include/uapi/linux/fcntl.h include/uapi/linux/fcntl.h

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Stefan Berger <stefanb@linux.ibm.com>
Link: https://lore.kernel.org/lkml/ZbJv9fGF_k2xXEdr@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
8 months agotools arch x86: Sync the msr-index.h copy with the kernel sources to pick IA32_MKTME_...
Arnaldo Carvalho de Melo [Thu, 25 Jan 2024 14:08:44 +0000 (11:08 -0300)]
tools arch x86: Sync the msr-index.h copy with the kernel sources to pick IA32_MKTME_KEYID_PARTITIONING

To pick up the changes in:

  765a0542fdc7aad7 ("x86/virt/tdx: Detect TDX during kernel boot")

Addressing this tools/perf build warning:

  Warning: Kernel ABI header differences:
    diff -u tools/arch/x86/include/asm/msr-index.h arch/x86/include/asm/msr-index.h

That makes the beautification scripts to pick some new entries:

  $ tools/perf/trace/beauty/tracepoints/x86_msr.sh > before
  $ cp arch/x86/include/asm/msr-index.h tools/arch/x86/include/asm/msr-index.h
  $ tools/perf/trace/beauty/tracepoints/x86_msr.sh > after
  $ diff -u before after
  --- before 2024-01-25 11:08:12.363223880 -0300
  +++ after 2024-01-25 11:08:24.839307699 -0300
  @@ -21,6 +21,7 @@
    [0x0000004f] = "PPIN",
    [0x00000060] = "LBR_CORE_TO",
    [0x00000079] = "IA32_UCODE_WRITE",
  + [0x00000087] = "IA32_MKTME_KEYID_PARTITIONING",
    [0x0000008b] = "AMD64_PATCH_LEVEL",
    [0x0000008C] = "IA32_SGXLEPUBKEYHASH0",
    [0x0000008D] = "IA32_SGXLEPUBKEYHASH1",
  $

Now one can trace systemwide asking to see backtraces to where that MSR
is being read/written, see this example with a previous update:

  # perf trace -e msr:*_msr/max-stack=32/ --filter="msr==IA32_MKTME_KEYID_PARTITIONING"
  ^C#

If we use -v (verbose mode) we can see what it does behind the scenes:

  # perf trace -v -e msr:*_msr/max-stack=32/ --filter="msr==IA32_MKTME_KEYID_PARTITIONING"
  Using CPUID GenuineIntel-6-8E-A
  0x87
  New filter for msr:read_msr: (msr==0x87) && (common_pid != 58627 && common_pid != 3792)
  0x87
  New filter for msr:write_msr: (msr==0x87) && (common_pid != 58627 && common_pid != 3792)
  mmap size 528384B
  ^C#

Example with a frequent msr:

  # perf trace -v -e msr:*_msr/max-stack=32/ --filter="msr==IA32_SPEC_CTRL" --max-events 2
  Using CPUID AuthenticAMD-25-21-0
  0x48
  New filter for msr:read_msr: (msr==0x48) && (common_pid != 2612129 && common_pid != 3841)
  0x48
  New filter for msr:write_msr: (msr==0x48) && (common_pid != 2612129 && common_pid != 3841)
  mmap size 528384B
  Looking at the vmlinux_path (8 entries long)
  symsrc__init: build id mismatch for vmlinux.
  Using /proc/kcore for kernel data
  Using /proc/kallsyms for symbols
   0.000 Timer/2525383 msr:write_msr(msr: IA32_SPEC_CTRL, val: 6)
      do_trace_write_msr ([kernel.kallsyms])
      do_trace_write_msr ([kernel.kallsyms])
      __switch_to_xtra ([kernel.kallsyms])
      __switch_to ([kernel.kallsyms])
      __schedule ([kernel.kallsyms])
      schedule ([kernel.kallsyms])
      futex_wait_queue_me ([kernel.kallsyms])
      futex_wait ([kernel.kallsyms])
      do_futex ([kernel.kallsyms])
      __x64_sys_futex ([kernel.kallsyms])
      do_syscall_64 ([kernel.kallsyms])
      entry_SYSCALL_64_after_hwframe ([kernel.kallsyms])
      __futex_abstimed_wait_common64 (/usr/lib64/libpthread-2.33.so)
   0.030 :0/0 msr:write_msr(msr: IA32_SPEC_CTRL, val: 2)
      do_trace_write_msr ([kernel.kallsyms])
      do_trace_write_msr ([kernel.kallsyms])
      __switch_to_xtra ([kernel.kallsyms])
      __switch_to ([kernel.kallsyms])
      __schedule ([kernel.kallsyms])
      schedule_idle ([kernel.kallsyms])
      do_idle ([kernel.kallsyms])
      cpu_startup_entry ([kernel.kallsyms])
      secondary_startup_64_no_verify ([kernel.kallsyms])
  #

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kai Huang <kai.huang@intel.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/lkml/ZbJt27rjkQVU1YoP@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
8 months agotools headers uapi: Sync linux/stat.h with the kernel sources to pick STATX_MNT_ID_UNIQUE
Arnaldo Carvalho de Melo [Thu, 25 Jan 2024 14:00:14 +0000 (11:00 -0300)]
tools headers uapi: Sync linux/stat.h with the kernel sources to pick STATX_MNT_ID_UNIQUE

To pick the changes from:

  98d2b43081972abe ("add unique mount ID")

That add STATX_MNT_ID_UNIQUE that was manually added to
tools/perf/trace/beauty/statx.c, at some point this should move to the
shell based automated way.

This silences this perf build warning:

  Warning: Kernel ABI header differences:
    diff -u tools/include/uapi/linux/stat.h include/uapi/linux/stat.h

Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Miklos Szeredi <mszeredi@redhat.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Link: https://lore.kernel.org/lkml/ZbJq08s19890WDo-@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
8 months agoperf tools: Add -H short option for --hierarchy
Namhyung Kim [Thu, 25 Jan 2024 05:51:24 +0000 (21:51 -0800)]
perf tools: Add -H short option for --hierarchy

I found the hierarchy mode useful, but it's easy to make a typo when
using it.  Let's add a short option for that.

Also update the documentation. :)

Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Link: https://lore.kernel.org/r/20240125055124.1579617-1-namhyung@kernel.org
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
8 months agoperf pmu: Treat the msr pmu as software
Ian Rogers [Wed, 24 Jan 2024 23:42:00 +0000 (15:42 -0800)]
perf pmu: Treat the msr pmu as software

The msr PMU is a software one, meaning msr events may be grouped
with events in a hardware context. As the msr PMU isn't marked as a
software PMU by perf_pmu__is_software, groups with the msr PMU in
are broken and the msr events placed in a different group. This
may lead to multiplexing errors where a hardware event isn't
counted while the msr event, such as tsc, is. Fix all of this by
marking the msr PMU as software, which agrees with the driver.

Before:
```
$ perf stat -e '{slots,tsc}' -a true
WARNING: events were regrouped to match PMUs

 Performance counter stats for 'system wide':

         1,750,335      slots
         4,243,557      tsc

       0.001456717 seconds time elapsed
```

After:
```
$ perf stat -e '{slots,tsc}' -a true
 Performance counter stats for 'system wide':

        12,526,380      slots
         3,415,163      tsc

       0.001488360 seconds time elapsed
```

Fixes: 251aa040244a ("perf parse-events: Wildcard most "numeric" events")
Signed-off-by: Ian Rogers <irogers@google.com>
Reviewed-by: Kan Liang <kan.liang@linux.intel.com>
Cc: James Clark <james.clark@arm.com>
Cc: Caleb Biggers <caleb.biggers@intel.com>
Cc: Edward Baker <edward.baker@intel.com>
Cc: Perry Taylor <perry.taylor@intel.com>
Cc: Samantha Alt <samantha.alt@intel.com>
Cc: Weilin Wang <weilin.wang@intel.com>
Link: https://lore.kernel.org/r/20240124234200.1510417-1-irogers@google.com
Signed-off-by: Namhyung Kim <namhyung@kernel.org>
8 months agoMerge tag 'net-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Linus Torvalds [Thu, 25 Jan 2024 18:58:35 +0000 (10:58 -0800)]
Merge tag 'net-6.8-rc2' of git://git./linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
 "Including fixes from bpf, netfilter and WiFi.

  Jakub is doing a lot of work to include the self-tests in our CI, as a
  result a significant amount of self-tests related fixes is flowing in
  (and will likely continue in the next few weeks).

  Current release - regressions:

   - bpf: fix a kernel crash for the riscv 64 JIT

   - bnxt_en: fix memory leak in bnxt_hwrm_get_rings()

   - revert "net: macsec: use skb_ensure_writable_head_tail to expand
     the skb"

  Previous releases - regressions:

   - core: fix removing a namespace with conflicting altnames

   - tc/flower: fix chain template offload memory leak

   - tcp:
      - make sure init the accept_queue's spinlocks once
      - fix autocork on CPUs with weak memory model

   - udp: fix busy polling

   - mlx5e:
      - fix out-of-bound read in port timestamping
      - fix peer flow lists corruption

   - iwlwifi: fix a memory corruption

  Previous releases - always broken:

   - netfilter:
      - nft_chain_filter: handle NETDEV_UNREGISTER for inet/ingress
        basechain
      - nft_limit: reject configurations that cause integer overflow

   - bpf: fix bpf_xdp_adjust_tail() with XSK zero-copy mbuf, avoiding a
     NULL pointer dereference upon shrinking

   - llc: make llc_ui_sendmsg() more robust against bonding changes

   - smc: fix illegal rmb_desc access in SMC-D connection dump

   - dpll: fix pin dump crash for rebound module

   - bnxt_en: fix possible crash after creating sw mqprio TCs

   - hv_netvsc: calculate correct ring size when PAGE_SIZE is not 4kB

  Misc:

   - several self-tests fixes for better integration with the netdev CI

   - added several missing modules descriptions"

* tag 'net-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits)
  tsnep: Fix XDP_RING_NEED_WAKEUP for empty fill ring
  tsnep: Remove FCS for XDP data path
  net: fec: fix the unhandled context fault from smmu
  selftests: bonding: do not test arp/ns target with mode balance-alb/tlb
  fjes: fix memleaks in fjes_hw_setup
  i40e: update xdp_rxq_info::frag_size for ZC enabled Rx queue
  i40e: set xdp_rxq_info::frag_size
  xdp: reflect tail increase for MEM_TYPE_XSK_BUFF_POOL
  ice: update xdp_rxq_info::frag_size for ZC enabled Rx queue
  intel: xsk: initialize skb_frag_t::bv_offset in ZC drivers
  ice: remove redundant xdp_rxq_info registration
  i40e: handle multi-buffer packets that are shrunk by xdp prog
  ice: work on pre-XDP prog frag count
  xsk: fix usage of multi-buffer BPF helpers for ZC XDP
  xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags
  xsk: recycle buffer in case Rx queue was full
  net: fill in MODULE_DESCRIPTION()s for rvu_mbox
  net: fill in MODULE_DESCRIPTION()s for litex
  net: fill in MODULE_DESCRIPTION()s for fsl_pq_mdio
  net: fill in MODULE_DESCRIPTION()s for fec
  ...

8 months agoMerge tag 'ovl-fixes-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/overla...
Linus Torvalds [Thu, 25 Jan 2024 18:52:30 +0000 (10:52 -0800)]
Merge tag 'ovl-fixes-6.8-rc2' of git://git./linux/kernel/git/overlayfs/vfs

Pull overlayfs fix from Amir Goldstein:
 "Change the on-disk format for the new "xwhiteouts" feature introduced
  in v6.7

  The change reduces unneeded overhead of an extra getxattr per readdir.
  The only user of the "xwhiteout" feature is the external composefs
  tool, which has been updated to support the new on-disk format.

  This change is also designated for 6.7.y"

* tag 'ovl-fixes-6.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/overlayfs/vfs:
  ovl: mark xwhiteouts directory with overlay.opaque='x'

8 months agoMerge tag 'vfs-6.8-rc2.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Linus Torvalds [Thu, 25 Jan 2024 18:41:29 +0000 (10:41 -0800)]
Merge tag 'vfs-6.8-rc2.netfs' of git://git./linux/kernel/git/vfs/vfs

Pull netfs fixes from Christian Brauner:
 "This contains various fixes for the netfs work merged earlier this
  cycle:

  afs:
   - Fix locking imbalance in afs_proc_addr_prefs_show()
   - Remove afs_dynroot_d_revalidate() which is redundant
   - Fix error handling during lookup
   - Hide sillyrenames from userspace. This fixes a race between
     silly-rename files being created/removed and userspace iterating
     over directory entries
   - Don't use unnecessary folio_*() functions

  cifs:
   - Don't use unnecessary folio_*() functions

  cachefiles:
   - erofs: Fix Null dereference when cachefiles are not doing
     ondemand-mode
   - Update mailing list

  netfs library:
   - Add Jeff Layton as reviewer
   - Update mailing list
   - Fix a error checking in netfs_perform_write()
   - fscache: Check error before dereferencing
   - Don't use unnecessary folio_*() functions"

* tag 'vfs-6.8-rc2.netfs' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
  afs: Fix missing/incorrect unlocking of RCU read lock
  afs: Remove afs_dynroot_d_revalidate() as it is redundant
  afs: Fix error handling with lookup via FS.InlineBulkStatus
  afs: Hide silly-rename files from userspace
  cachefiles, erofs: Fix NULL deref in when cachefiles is not doing ondemand-mode
  netfs: Fix a NULL vs IS_ERR() check in netfs_perform_write()
  netfs, fscache: Prevent Oops in fscache_put_cache()
  cifs: Don't use certain unnecessary folio_*() functions
  afs: Don't use certain unnecessary folio_*() functions
  netfs: Don't use certain unnecessary folio_*() functions
  netfs: Add Jeff Layton as reviewer
  netfs, cachefiles: Change mailing list

8 months agoMerge tag 'nfsd-6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Linus Torvalds [Thu, 25 Jan 2024 18:26:52 +0000 (10:26 -0800)]
Merge tag 'nfsd-6.8-1' of git://git./linux/kernel/git/cel/linux

Pull nfsd fixes from Chuck Lever:

 - Fix in-kernel RPC UDP transport

 - Fix NFSv4.0 RELEASE_LOCKOWNER

* tag 'nfsd-6.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux:
  nfsd: fix RELEASE_LOCKOWNER
  SUNRPC: use request size to initialize bio_vec in svc_udp_sendto()

8 months agoMerge tag 'urgent-rcu.2024.01.24a' of https://github.com/neeraju/linux
Linus Torvalds [Thu, 25 Jan 2024 18:21:21 +0000 (10:21 -0800)]
Merge tag 'urgent-rcu.2024.01.24a' of https://github.com/neeraju/linux

Pull RCU fix from Neeraj Upadhyay:
 "This fixes RCU grace period stalls, which are observed when an
  outgoing CPU's quiescent state reporting results in wakeup of one of
  the grace period kthreads, to complete the grace period.

  If those kthreads have SCHED_FIFO policy, the wake up can indirectly
  arm the RT bandwith timer to the local offline CPU.

  Earlier migration of the hrtimers from the CPU introduced in commit
  5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU
  earlier") results in this timer getting ignored.

  If the RCU grace period kthreads are waiting for RT bandwidth to be
  available, they may never be actually scheduled, resulting in RCU
  stall warnings"

* tag 'urgent-rcu.2024.01.24a' of https://github.com/neeraju/linux:
  rcu: Defer RCU kthreads wakeup when CPU is dying

8 months agoMerge branch 'tsnep-xdp-fixes'
Paolo Abeni [Thu, 25 Jan 2024 10:59:44 +0000 (11:59 +0100)]
Merge branch 'tsnep-xdp-fixes'

Gerhard Engleder says:

====================
tsnep: XDP fixes

Found two driver specific problems during XDP and XSK testing.
====================

Link: https://lore.kernel.org/r/20240123200918.61219-1-gerhard@engleder-embedded.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 months agotsnep: Fix XDP_RING_NEED_WAKEUP for empty fill ring
Gerhard Engleder [Tue, 23 Jan 2024 20:09:18 +0000 (21:09 +0100)]
tsnep: Fix XDP_RING_NEED_WAKEUP for empty fill ring

The fill ring of the XDP socket may contain not enough buffers to
completey fill the RX queue during socket creation. In this case the
flag XDP_RING_NEED_WAKEUP is not set as this flag is only set if the RX
queue is not completely filled during polling.

Set XDP_RING_NEED_WAKEUP flag also if RX queue is not completely filled
during XDP socket creation.

Fixes: 3fc2333933fd ("tsnep: Add XDP socket zero-copy RX support")
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 months agotsnep: Remove FCS for XDP data path
Gerhard Engleder [Tue, 23 Jan 2024 20:09:17 +0000 (21:09 +0100)]
tsnep: Remove FCS for XDP data path

The RX data buffer includes the FCS. The FCS is already stripped for the
normal data path. But for the XDP data path the FCS is included and
acts like additional/useless data.

Remove the FCS from the RX data buffer also for XDP.

Fixes: 65b28c810035 ("tsnep: Add XDP RX support")
Fixes: 3fc2333933fd ("tsnep: Add XDP socket zero-copy RX support")
Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 months agoMerge tag 'mlx5-fixes-2024-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git...
Paolo Abeni [Thu, 25 Jan 2024 10:42:27 +0000 (11:42 +0100)]
Merge tag 'mlx5-fixes-2024-01-24' of git://git./linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5 fixes 2024-01-24

This series provides bug fixes to mlx5 driver.
Please pull and let me know if there is any problem.

* tag 'mlx5-fixes-2024-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
  net/mlx5e: fix a potential double-free in fs_any_create_groups
  net/mlx5e: fix a double-free in arfs_create_groups
  net/mlx5e: Ignore IPsec replay window values on sender side
  net/mlx5e: Allow software parsing when IPsec crypto is enabled
  net/mlx5: Use mlx5 device constant for selecting CQ period mode for ASO
  net/mlx5: DR, Can't go to uplink vport on RX rule
  net/mlx5: DR, Use the right GVMI number for drop action
  net/mlx5: Bridge, fix multicast packets sent to uplink
  net/mlx5: Fix a WARN upon a callback command failure
  net/mlx5e: Fix peer flow lists handling
  net/mlx5e: Fix inconsistent hairpin RQT sizes
  net/mlx5e: Fix operation precedence bug in port timestamping napi_poll context
  net/mlx5: Fix query of sd_group field
  net/mlx5e: Use the correct lag ports number when creating TISes
====================

Link: https://lore.kernel.org/r/20240124081855.115410-1-saeed@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 months agoMerge tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Paolo Abeni [Thu, 25 Jan 2024 10:30:31 +0000 (11:30 +0100)]
Merge tag 'for-netdev' of https://git./linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2024-01-25

The following pull-request contains BPF updates for your *net* tree.

We've added 12 non-merge commits during the last 2 day(s) which contain
a total of 13 files changed, 190 insertions(+), 91 deletions(-).

The main changes are:

1) Fix bpf_xdp_adjust_tail() in context of XSK zero-copy drivers which
   support XDP multi-buffer. The former triggered a NULL pointer
   dereference upon shrinking, from Maciej Fijalkowski & Tirthendu Sarkar.

2) Fix a bug in riscv64 BPF JIT which emitted a wrong prologue and
   epilogue for struct_ops programs, from Pu Lehui.

* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
  i40e: update xdp_rxq_info::frag_size for ZC enabled Rx queue
  i40e: set xdp_rxq_info::frag_size
  xdp: reflect tail increase for MEM_TYPE_XSK_BUFF_POOL
  ice: update xdp_rxq_info::frag_size for ZC enabled Rx queue
  intel: xsk: initialize skb_frag_t::bv_offset in ZC drivers
  ice: remove redundant xdp_rxq_info registration
  i40e: handle multi-buffer packets that are shrunk by xdp prog
  ice: work on pre-XDP prog frag count
  xsk: fix usage of multi-buffer BPF helpers for ZC XDP
  xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags
  xsk: recycle buffer in case Rx queue was full
  riscv, bpf: Fix unpredictable kernel crash about RV64 struct_ops
====================

Link: https://lore.kernel.org/r/20240125084416.10876-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 months agonet: fec: fix the unhandled context fault from smmu
Shenwei Wang [Tue, 23 Jan 2024 16:51:41 +0000 (10:51 -0600)]
net: fec: fix the unhandled context fault from smmu

When repeatedly changing the interface link speed using the command below:

ethtool -s eth0 speed 100 duplex full
ethtool -s eth0 speed 1000 duplex full

The following errors may sometimes be reported by the ARM SMMU driver:

[ 5395.035364] fec 5b040000.ethernet eth0: Link is Down
[ 5395.039255] arm-smmu 51400000.iommu: Unhandled context fault:
fsr=0x402, iova=0x00000000, fsynr=0x100001, cbfrsynra=0x852, cb=2
[ 5398.108460] fec 5b040000.ethernet eth0: Link is Up - 100Mbps/Full -
flow control off

It is identified that the FEC driver does not properly stop the TX queue
during the link speed transitions, and this results in the invalid virtual
I/O address translations from the SMMU and causes the context faults.

Fixes: dbc64a8ea231 ("net: fec: move calls to quiesce/resume packet processing out of fec_restart()")
Signed-off-by: Shenwei Wang <shenwei.wang@nxp.com>
Link: https://lore.kernel.org/r/20240123165141.2008104-1-shenwei.wang@nxp.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 months agoselftests: bonding: do not test arp/ns target with mode balance-alb/tlb
Hangbin Liu [Tue, 23 Jan 2024 07:59:17 +0000 (15:59 +0800)]
selftests: bonding: do not test arp/ns target with mode balance-alb/tlb

The prio_arp/ns tests hard code the mode to active-backup. At the same
time, The balance-alb/tlb modes do not support arp/ns target. So remove
the prio_arp/ns tests from the loop and only test active-backup mode.

Fixes: 481b56e0391e ("selftests: bonding: re-format bond option tests")
Reported-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Closes: https://lore.kernel.org/netdev/17415.1705965957@famine/
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Link: https://lore.kernel.org/r/20240123075917.1576360-1-liuhangbin@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
8 months agoMerge tag 'nf-24-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Jakub Kicinski [Thu, 25 Jan 2024 05:03:16 +0000 (21:03 -0800)]
Merge tag 'nf-24-01-24' of git://git./linux/kernel/git/netfilter/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Update nf_tables kdoc to keep it in sync with the code, from George Guo.

2) Handle NETDEV_UNREGISTER event for inet/ingress basechain.

3) Reject configuration that cause nft_limit to overflow,
   from Florian Westphal.

4) Restrict anonymous set/map names to 16 bytes, from Florian Westphal.

5) Disallow to encode queue number and error in verdicts. This reverts
   a patch which seems to have introduced an early attempt to support for
   nfqueue maps, which is these days supported via nft_queue expression.

6) Sanitize family via .validate for expressions that explicitly refer
   to NF_INET_* hooks.

* tag 'nf-24-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_tables: validate NFPROTO_* family
  netfilter: nf_tables: reject QUEUE/DROP verdict parameters
  netfilter: nf_tables: restrict anonymous set and map names to 16 bytes
  netfilter: nft_limit: reject configurations that cause integer overflow
  netfilter: nft_chain_filter: handle NETDEV_UNREGISTER for inet/ingress basechain
  netfilter: nf_tables: cleanup documentation
====================

Link: https://lore.kernel.org/r/20240124191248.75463-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 months agofjes: fix memleaks in fjes_hw_setup
Zhipeng Lu [Mon, 22 Jan 2024 17:24:42 +0000 (01:24 +0800)]
fjes: fix memleaks in fjes_hw_setup

In fjes_hw_setup, it allocates several memory and delay the deallocation
to the fjes_hw_exit in fjes_probe through the following call chain:

fjes_probe
  |-> fjes_hw_init
        |-> fjes_hw_setup
  |-> fjes_hw_exit

However, when fjes_hw_setup fails, fjes_hw_exit won't be called and thus
all the resources allocated in fjes_hw_setup will be leaked. In this
patch, we free those resources in fjes_hw_setup and prevents such leaks.

Fixes: 2fcbca687702 ("fjes: platform_driver's .probe and .remove routine")
Signed-off-by: Zhipeng Lu <alexious@zju.edu.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240122172445.3841883-1-alexious@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 months agoMerge tag 'ceph-for-6.8-rc2' of https://github.com/ceph/ceph-client
Linus Torvalds [Thu, 25 Jan 2024 00:59:52 +0000 (16:59 -0800)]
Merge tag 'ceph-for-6.8-rc2' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A fix to avoid triggering an assert in some cases where RBD exclusive
  mappings are involved and a deprecated API cleanup"

* tag 'ceph-for-6.8-rc2' of https://github.com/ceph/ceph-client:
  rbd: don't move requests to the running list on errors
  rbd: remove usage of the deprecated ida_simple_*() API

8 months agoMerge tag 'integrity-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar...
Linus Torvalds [Thu, 25 Jan 2024 00:51:59 +0000 (16:51 -0800)]
Merge tag 'integrity-v6.8-rc1' of git://git./linux/kernel/git/zohar/linux-integrity

Pull integrity fix from Mimi Zohar:
 "Revert patch that required user-provided key data, since keys can be
  created from kernel-generated random numbers"

* tag 'integrity-v6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
  Revert "KEYS: encrypted: Add check for strsep"

8 months agoMerge branch 'net-bpf_xdp_adjust_tail-and-intel-mbuf-fixes'
Alexei Starovoitov [Thu, 25 Jan 2024 00:24:07 +0000 (16:24 -0800)]
Merge branch 'net-bpf_xdp_adjust_tail-and-intel-mbuf-fixes'

Maciej Fijalkowski says:

====================
net: bpf_xdp_adjust_tail() and Intel mbuf fixes

Hey,

after a break followed by dealing with sickness, here is a v6 that makes
bpf_xdp_adjust_tail() actually usable for ZC drivers that support XDP
multi-buffer. Since v4 I tried also using bpf_xdp_adjust_tail() with
positive offset which exposed yet another issues, which can be observed
by increased commit count when compared to v3.

John, in the end I think we should remove handling
MEM_TYPE_XSK_BUFF_POOL from __xdp_return(), but it is out of the scope
for fixes set, IMHO.

Thanks,
Maciej

v6:
- add acks [Magnus]
- fix spelling mistakes [Magnus]
- avoid touching xdp_buff in xp_alloc_{reused,new_from_fq}() [Magnus]
- s/shrink_data/bpf_xdp_shrink_data [Jakub]
- remove __shrink_data() [Jakub]
- check retvals from __xdp_rxq_info_reg() [Magnus]

v5:
- pick correct version of patch 5 [Simon]
- elaborate a bit more on what patch 2 fixes

v4:
- do not clear frags flag when deleting tail; xsk_buff_pool now does
  that
- skip some NULL tests for xsk_buff_get_tail [Martin, John]
- address problems around registering xdp_rxq_info
- fix bpf_xdp_frags_increase_tail() for ZC mbuf

v3:
- add acks
- s/xsk_buff_tail_del/xsk_buff_del_tail
- address i40e as well (thanks Tirthendu)

v2:
- fix !CONFIG_XDP_SOCKETS builds
- add reviewed-by tag to patch 3
====================

Link: https://lore.kernel.org/r/20240124191602.566724-1-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
8 months agoi40e: update xdp_rxq_info::frag_size for ZC enabled Rx queue
Maciej Fijalkowski [Wed, 24 Jan 2024 19:16:02 +0000 (20:16 +0100)]
i40e: update xdp_rxq_info::frag_size for ZC enabled Rx queue

Now that i40e driver correctly sets up frag_size in xdp_rxq_info, let us
make it work for ZC multi-buffer as well. i40e_ring::rx_buf_len for ZC
is being set via xsk_pool_get_rx_frame_size() and this needs to be
propagated up to xdp_rxq_info.

Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support")
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20240124191602.566724-12-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
8 months agoi40e: set xdp_rxq_info::frag_size
Maciej Fijalkowski [Wed, 24 Jan 2024 19:16:01 +0000 (20:16 +0100)]
i40e: set xdp_rxq_info::frag_size

i40e support XDP multi-buffer so it is supposed to use
__xdp_rxq_info_reg() instead of xdp_rxq_info_reg() and set the
frag_size. It can not be simply converted at existing callsite because
rx_buf_len could be un-initialized, so let us register xdp_rxq_info
within i40e_configure_rx_ring(), which happen to be called with already
initialized rx_buf_len value.

Commit 5180ff1364bc ("i40e: use int for i40e_status") converted 'err' to
int, so two variables to deal with return codes are not needed within
i40e_configure_rx_ring(). Remove 'ret' and use 'err' to handle status
from xdp_rxq_info registration.

Fixes: e213ced19bef ("i40e: add support for XDP multi-buffer Rx")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20240124191602.566724-11-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
8 months agoxdp: reflect tail increase for MEM_TYPE_XSK_BUFF_POOL
Maciej Fijalkowski [Wed, 24 Jan 2024 19:16:00 +0000 (20:16 +0100)]
xdp: reflect tail increase for MEM_TYPE_XSK_BUFF_POOL

XSK ZC Rx path calculates the size of data that will be posted to XSK Rx
queue via subtracting xdp_buff::data_end from xdp_buff::data.

In bpf_xdp_frags_increase_tail(), when underlying memory type of
xdp_rxq_info is MEM_TYPE_XSK_BUFF_POOL, add offset to data_end in tail
fragment, so that later on user space will be able to take into account
the amount of bytes added by XDP program.

Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20240124191602.566724-10-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
8 months agoice: update xdp_rxq_info::frag_size for ZC enabled Rx queue
Maciej Fijalkowski [Wed, 24 Jan 2024 19:15:59 +0000 (20:15 +0100)]
ice: update xdp_rxq_info::frag_size for ZC enabled Rx queue

Now that ice driver correctly sets up frag_size in xdp_rxq_info, let us
make it work for ZC multi-buffer as well. ice_rx_ring::rx_buf_len for ZC
is being set via xsk_pool_get_rx_frame_size() and this needs to be
propagated up to xdp_rxq_info.

Use a bigger hammer and instead of unregistering only xdp_rxq_info's
memory model, unregister it altogether and register it again and have
xdp_rxq_info with correct frag_size value.

Fixes: 1bbc04de607b ("ice: xsk: add RX multi-buffer support")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20240124191602.566724-9-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
8 months agointel: xsk: initialize skb_frag_t::bv_offset in ZC drivers
Maciej Fijalkowski [Wed, 24 Jan 2024 19:15:58 +0000 (20:15 +0100)]
intel: xsk: initialize skb_frag_t::bv_offset in ZC drivers

Ice and i40e ZC drivers currently set offset of a frag within
skb_shared_info to 0, which is incorrect. xdp_buffs that come from
xsk_buff_pool always have 256 bytes of a headroom, so they need to be
taken into account to retrieve xdp_buff::data via skb_frag_address().
Otherwise, bpf_xdp_frags_increase_tail() would be starting its job from
xdp_buff::data_hard_start which would result in overwriting existing
payload.

Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support")
Fixes: 1bbc04de607b ("ice: xsk: add RX multi-buffer support")
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20240124191602.566724-8-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
8 months agoice: remove redundant xdp_rxq_info registration
Maciej Fijalkowski [Wed, 24 Jan 2024 19:15:57 +0000 (20:15 +0100)]
ice: remove redundant xdp_rxq_info registration

xdp_rxq_info struct can be registered by drivers via two functions -
xdp_rxq_info_reg() and __xdp_rxq_info_reg(). The latter one allows
drivers that support XDP multi-buffer to set up xdp_rxq_info::frag_size
which in turn will make it possible to grow the packet via
bpf_xdp_adjust_tail() BPF helper.

Currently, ice registers xdp_rxq_info in two spots:
1) ice_setup_rx_ring() // via xdp_rxq_info_reg(), BUG
2) ice_vsi_cfg_rxq()   // via __xdp_rxq_info_reg(), OK

Cited commit under fixes tag took care of setting up frag_size and
updated registration scheme in 2) but it did not help as
1) is called before 2) and as shown above it uses old registration
function. This means that 2) sees that xdp_rxq_info is already
registered and never calls __xdp_rxq_info_reg() which leaves us with
xdp_rxq_info::frag_size being set to 0.

To fix this misbehavior, simply remove xdp_rxq_info_reg() call from
ice_setup_rx_ring().

Fixes: 2fba7dc5157b ("ice: Add support for XDP multi-buffer on Rx side")
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20240124191602.566724-7-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
8 months agoi40e: handle multi-buffer packets that are shrunk by xdp prog
Tirthendu Sarkar [Wed, 24 Jan 2024 19:15:56 +0000 (20:15 +0100)]
i40e: handle multi-buffer packets that are shrunk by xdp prog

XDP programs can shrink packets by calling the bpf_xdp_adjust_tail()
helper function. For multi-buffer packets this may lead to reduction of
frag count stored in skb_shared_info area of the xdp_buff struct. This
results in issues with the current handling of XDP_PASS and XDP_DROP
cases.

For XDP_PASS, currently skb is being built using frag count of
xdp_buffer before it was processed by XDP prog and thus will result in
an inconsistent skb when frag count gets reduced by XDP prog. To fix
this, get correct frag count while building the skb instead of using
pre-obtained frag count.

For XDP_DROP, current page recycling logic will not reuse the page but
instead will adjust the pagecnt_bias so that the page can be freed. This
again results in inconsistent behavior as the page refcnt has already
been changed by the helper while freeing the frag(s) as part of
shrinking the packet. To fix this, only adjust pagecnt_bias for buffers
that are stillpart of the packet post-xdp prog run.

Fixes: e213ced19bef ("i40e: add support for XDP multi-buffer Rx")
Reported-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com>
Link: https://lore.kernel.org/r/20240124191602.566724-6-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
8 months agoice: work on pre-XDP prog frag count
Maciej Fijalkowski [Wed, 24 Jan 2024 19:15:55 +0000 (20:15 +0100)]
ice: work on pre-XDP prog frag count

Fix an OOM panic in XDP_DRV mode when a XDP program shrinks a
multi-buffer packet by 4k bytes and then redirects it to an AF_XDP
socket.

Since support for handling multi-buffer frames was added to XDP, usage
of bpf_xdp_adjust_tail() helper within XDP program can free the page
that given fragment occupies and in turn decrease the fragment count
within skb_shared_info that is embedded in xdp_buff struct. In current
ice driver codebase, it can become problematic when page recycling logic
decides not to reuse the page. In such case, __page_frag_cache_drain()
is used with ice_rx_buf::pagecnt_bias that was not adjusted after
refcount of page was changed by XDP prog which in turn does not drain
the refcount to 0 and page is never freed.

To address this, let us store the count of frags before the XDP program
was executed on Rx ring struct. This will be used to compare with
current frag count from skb_shared_info embedded in xdp_buff. A smaller
value in the latter indicates that XDP prog freed frag(s). Then, for
given delta decrement pagecnt_bias for XDP_DROP verdict.

While at it, let us also handle the EOP frag within
ice_set_rx_bufs_act() to make our life easier, so all of the adjustments
needed to be applied against freed frags are performed in the single
place.

Fixes: 2fba7dc5157b ("ice: Add support for XDP multi-buffer on Rx side")
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20240124191602.566724-5-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
8 months agoxsk: fix usage of multi-buffer BPF helpers for ZC XDP
Maciej Fijalkowski [Wed, 24 Jan 2024 19:15:54 +0000 (20:15 +0100)]
xsk: fix usage of multi-buffer BPF helpers for ZC XDP

Currently when packet is shrunk via bpf_xdp_adjust_tail() and memory
type is set to MEM_TYPE_XSK_BUFF_POOL, null ptr dereference happens:

[1136314.192256] BUG: kernel NULL pointer dereference, address:
0000000000000034
[1136314.203943] #PF: supervisor read access in kernel mode
[1136314.213768] #PF: error_code(0x0000) - not-present page
[1136314.223550] PGD 0 P4D 0
[1136314.230684] Oops: 0000 [#1] PREEMPT SMP NOPTI
[1136314.239621] CPU: 8 PID: 54203 Comm: xdpsock Not tainted 6.6.0+ #257
[1136314.250469] Hardware name: Intel Corporation S2600WFT/S2600WFT,
BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
[1136314.265615] RIP: 0010:__xdp_return+0x6c/0x210
[1136314.274653] Code: ad 00 48 8b 47 08 49 89 f8 a8 01 0f 85 9b 01 00 00 0f 1f 44 00 00 f0 41 ff 48 34 75 32 4c 89 c7 e9 79 cd 80 ff 83 fe 03 75 17 <f6> 41 34 01 0f 85 02 01 00 00 48 89 cf e9 22 cc 1e 00 e9 3d d2 86
[1136314.302907] RSP: 0018:ffffc900089f8db0 EFLAGS: 00010246
[1136314.312967] RAX: ffffc9003168aed0 RBX: ffff8881c3300000 RCX:
0000000000000000
[1136314.324953] RDX: 0000000000000000 RSI: 0000000000000003 RDI:
ffffc9003168c000
[1136314.336929] RBP: 0000000000000ae0 R08: 0000000000000002 R09:
0000000000010000
[1136314.348844] R10: ffffc9000e495000 R11: 0000000000000040 R12:
0000000000000001
[1136314.360706] R13: 0000000000000524 R14: ffffc9003168aec0 R15:
0000000000000001
[1136314.373298] FS:  00007f8df8bbcb80(0000) GS:ffff8897e0e00000(0000)
knlGS:0000000000000000
[1136314.386105] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1136314.396532] CR2: 0000000000000034 CR3: 00000001aa912002 CR4:
00000000007706f0
[1136314.408377] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[1136314.420173] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[1136314.431890] PKRU: 55555554
[1136314.439143] Call Trace:
[1136314.446058]  <IRQ>
[1136314.452465]  ? __die+0x20/0x70
[1136314.459881]  ? page_fault_oops+0x15b/0x440
[1136314.468305]  ? exc_page_fault+0x6a/0x150
[1136314.476491]  ? asm_exc_page_fault+0x22/0x30
[1136314.484927]  ? __xdp_return+0x6c/0x210
[1136314.492863]  bpf_xdp_adjust_tail+0x155/0x1d0
[1136314.501269]  bpf_prog_ccc47ae29d3b6570_xdp_sock_prog+0x15/0x60
[1136314.511263]  ice_clean_rx_irq_zc+0x206/0xc60 [ice]
[1136314.520222]  ? ice_xmit_zc+0x6e/0x150 [ice]
[1136314.528506]  ice_napi_poll+0x467/0x670 [ice]
[1136314.536858]  ? ttwu_do_activate.constprop.0+0x8f/0x1a0
[1136314.546010]  __napi_poll+0x29/0x1b0
[1136314.553462]  net_rx_action+0x133/0x270
[1136314.561619]  __do_softirq+0xbe/0x28e
[1136314.569303]  do_softirq+0x3f/0x60

This comes from __xdp_return() call with xdp_buff argument passed as
NULL which is supposed to be consumed by xsk_buff_free() call.

To address this properly, in ZC case, a node that represents the frag
being removed has to be pulled out of xskb_list. Introduce
appropriate xsk helpers to do such node operation and use them
accordingly within bpf_xdp_adjust_tail().

Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com> # For the xsk header part
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20240124191602.566724-4-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
8 months agoxsk: make xsk_buff_pool responsible for clearing xdp_buff::flags
Maciej Fijalkowski [Wed, 24 Jan 2024 19:15:53 +0000 (20:15 +0100)]
xsk: make xsk_buff_pool responsible for clearing xdp_buff::flags

XDP multi-buffer support introduced XDP_FLAGS_HAS_FRAGS flag that is
used by drivers to notify data path whether xdp_buff contains fragments
or not. Data path looks up mentioned flag on first buffer that occupies
the linear part of xdp_buff, so drivers only modify it there. This is
sufficient for SKB and XDP_DRV modes as usually xdp_buff is allocated on
stack or it resides within struct representing driver's queue and
fragments are carried via skb_frag_t structs. IOW, we are dealing with
only one xdp_buff.

ZC mode though relies on list of xdp_buff structs that is carried via
xsk_buff_pool::xskb_list, so ZC data path has to make sure that
fragments do *not* have XDP_FLAGS_HAS_FRAGS set. Otherwise,
xsk_buff_free() could misbehave if it would be executed against xdp_buff
that carries a frag with XDP_FLAGS_HAS_FRAGS flag set. Such scenario can
take place when within supplied XDP program bpf_xdp_adjust_tail() is
used with negative offset that would in turn release the tail fragment
from multi-buffer frame.

Calling xsk_buff_free() on tail fragment with XDP_FLAGS_HAS_FRAGS would
result in releasing all the nodes from xskb_list that were produced by
driver before XDP program execution, which is not what is intended -
only tail fragment should be deleted from xskb_list and then it should
be put onto xsk_buff_pool::free_list. Such multi-buffer frame will never
make it up to user space, so from AF_XDP application POV there would be
no traffic running, however due to free_list getting constantly new
nodes, driver will be able to feed HW Rx queue with recycled buffers.
Bottom line is that instead of traffic being redirected to user space,
it would be continuously dropped.

To fix this, let us clear the mentioned flag on xsk_buff_pool side
during xdp_buff initialization, which is what should have been done
right from the start of XSK multi-buffer support.

Fixes: 1bbc04de607b ("ice: xsk: add RX multi-buffer support")
Fixes: 1c9ba9c14658 ("i40e: xsk: add RX multi-buffer support")
Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20240124191602.566724-3-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
8 months agoxsk: recycle buffer in case Rx queue was full
Maciej Fijalkowski [Wed, 24 Jan 2024 19:15:52 +0000 (20:15 +0100)]
xsk: recycle buffer in case Rx queue was full

Add missing xsk_buff_free() call when __xsk_rcv_zc() failed to produce
descriptor to XSK Rx queue.

Fixes: 24ea50127ecf ("xsk: support mbuf on ZC RX")
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Link: https://lore.kernel.org/r/20240124191602.566724-2-maciej.fijalkowski@intel.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>