tools/perf/Documentation/perf-arm-spe.txt

   1 perf-arm-spe(1)
   2 ================
   3
   4 NAME
   5 ----
   6 perf-arm-spe - Support for Arm Statistical Profiling Extension within Perf tools
   7
   8 SYNOPSIS
   9 --------
  10 [verse]
  11 'perf record' -e arm_spe//
  12
  13 DESCRIPTION
  14 -----------
  15
  16 The SPE (Statistical Profiling Extension) feature provides accurate attribution of latencies and
  17  events down to individual instructions. Rather than being interrupt-driven, it picks an
  18 instruction to sample and then captures data for it during execution. Data includes execution time
  19 in cycles. For loads and stores it also includes data address, cache miss events, and data origin.
  20
  21 The sampling has 5 stages:
  22
  23   1. Choose an operation
  24   2. Collect data about the operation
  25   3. Optionally discard the record based on a filter
  26   4. Write the record to memory
  27   5. Interrupt when the buffer is full
  28
  29 Choose an operation
  30 ~~~~~~~~~~~~~~~~~~~
  31
  32 This is chosen from a sample population, for SPE this is an IMPLEMENTATION DEFINED choice of all
  33 architectural instructions or all micro-ops. Sampling happens at a programmable interval. The
  34 architecture provides a mechanism for the SPE driver to infer the minimum interval at which it should
  35 sample. This minimum interval is used by the driver if no interval is specified. A pseudo-random
  36 perturbation is also added to the sampling interval by default.
  37
  38 Collect data about the operation
  39 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  40
  41 Program counter, PMU events, timings and data addresses related to the operation are recorded.
  42 Sampling ensures there is only one sampled operation is in flight.
  43
  44 Optionally discard the record based on a filter
  45 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  46
  47 Based on programmable criteria, choose whether to keep the record or discard it. If the record is
  48 discarded then the flow stops here for this sample.
  49
  50 Write the record to memory
  51 ~~~~~~~~~~~~~~~~~~~~~~~~~~
  52
  53 The record is appended to a memory buffer
  54
  55 Interrupt when the buffer is full
  56 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  57
  58 When the buffer fills, an interrupt is sent and the driver signals Perf to collect the records.
  59 Perf saves the raw data in the perf.data file.
  60
  61 Opening the file
  62 ----------------
  63
  64 Up until this point no decoding of the SPE data was done by either the kernel or Perf. Only when the
  65 recorded file is opened with 'perf report' or 'perf script' does the decoding happen. When decoding
  66 the data, Perf generates "synthetic samples" as if these were generated at the time of the
  67 recording. These samples are the same as if normal sampling was done by Perf without using SPE,
  68 although they may have more attributes associated with them. For example a normal sample may have
  69 just the instruction pointer, but an SPE sample can have data addresses and latency attributes.
  70
  71 Why Sampling?
  72 -------------
  73
  74  - Sampling, rather than tracing, cuts down the profiling problem to something more manageable for
  75  hardware. Only one sampled operation is in flight at a time.
  76
  77  - Allows precise attribution data, including: Full PC of instruction, data virtual and physical
  78  addresses.
  79
  80  - Allows correlation between an instruction and events, such as TLB and cache miss. (Data source
  81  indicates which particular cache was hit, but the meaning is implementation defined because
  82  different implementations can have different cache configurations.)
  83
  84 However, SPE does not provide any call-graph information, and relies on statistical methods.
  85
  86 Collisions
  87 ----------
  88
  89 When an operation is sampled while a previous sampled operation has not finished, a collision
  90 occurs. The new sample is dropped. Collisions affect the integrity of the data, so the sample rate
  91 should be set to avoid collisions.
  92
  93 The 'sample_collision' PMU event can be used to determine the number of lost samples. Although this
  94 count is based on collisions _before_ filtering occurs. Therefore this can not be used as an exact
  95 number for samples dropped that would have made it through the filter, but can be a rough
  96 guide.
  97
  98 The effect of microarchitectural sampling
  99 -----------------------------------------
 100
 101 If an implementation samples micro-operations instead of instructions, the results of sampling must
 102 be weighted accordingly.
 103
 104 For example, if a given instruction A is always converted into two micro-operations, A0 and A1, it
 105 becomes twice as likely to appear in the sample population.
 106
 107 The coarse effect of conversions, and, if applicable, sampling of speculative operations, can be
 108 estimated from the 'sample_pop' and 'inst_retired' PMU events.
 109
 110 Kernel Requirements
 111 -------------------
 112
 113 The ARM_SPE_PMU config must be set to build as either a module or statically.
 114
 115 Depending on CPU model, the kernel may need to be booted with page table isolation disabled
 116 (kpti=off). If KPTI needs to be disabled, this will fail with a console message "profiling buffer
 117 inaccessible. Try passing 'kpti=off' on the kernel command line".
 118
 119 Capturing SPE with perf command-line tools
 120 ------------------------------------------
 121
 122 You can record a session with SPE samples:
 123
 124   perf record -e arm_spe// -- ./mybench
 125
 126 The sample period is set from the -c option, and because the minimum interval is used by default
 127 it's recommended to set this to a higher value. The value is written to PMSIRR.INTERVAL.
 128
 129 Config parameters
 130 ~~~~~~~~~~~~~~~~~
 131
 132 These are placed between the // in the event and comma separated. For example '-e
 133 arm_spe/load_filter=1,min_latency=10/'
 134
 135   branch_filter=1     - collect branches only (PMSFCR.B)
 136   event_filter=<mask> - filter on specific events (PMSEVFR) - see bitfield description below
 137   jitter=1            - use jitter to avoid resonance when sampling (PMSIRR.RND)
 138   load_filter=1       - collect loads only (PMSFCR.LD)
 139   min_latency=<n>     - collect only samples with this latency or higher* (PMSLATFR)
 140   pa_enable=1         - collect physical address (as well as VA) of loads/stores (PMSCR.PA) - requires privilege
 141   pct_enable=1        - collect physical timestamp instead of virtual timestamp (PMSCR.PCT) - requires privilege
 142   store_filter=1      - collect stores only (PMSFCR.ST)
 143   ts_enable=1         - enable timestamping with value of generic timer (PMSCR.TS)
 144
 145 +++*+++ Latency is the total latency from the point at which sampling started on that instruction, rather
 146 than only the execution latency.
 147
 148 Only some events can be filtered on; these include:
 149
 150   bit 1     - instruction retired (i.e. omit speculative instructions)
 151   bit 3     - L1D refill
 152   bit 5     - TLB refill
 153   bit 7     - mispredict
 154   bit 11    - misaligned access
 155
 156 So to sample just retired instructions:
 157
 158   perf record -e arm_spe/event_filter=2/ -- ./mybench
 159
 160 or just mispredicted branches:
 161
 162   perf record -e arm_spe/event_filter=0x80/ -- ./mybench
 163
 164 Viewing the data
 165 ~~~~~~~~~~~~~~~~~
 166
 167 By default perf report and perf script will assign samples to separate groups depending on the
 168 attributes/events of the SPE record. Because instructions can have multiple events associated with
 169 them, the samples in these groups are not necessarily unique. For example perf report shows these
 170 groups:
 171
 172   Available samples
 173   0 arm_spe//
 174   0 dummy:u
 175   21 l1d-miss
 176   897 l1d-access
 177   5 llc-miss
 178   7 llc-access
 179   2 tlb-miss
 180   1K tlb-access
 181   36 branch-miss
 182   0 remote-access
 183   900 memory
 184
 185 The arm_spe// and dummy:u events are implementation details and are expected to be empty.
 186
 187 To get a full list of unique samples that are not sorted into groups, set the itrace option to
 188 generate 'instruction' samples. The period option is also taken into account, so set it to 1
 189 instruction unless you want to further downsample the already sampled SPE data:
 190
 191   perf report --itrace=i1i
 192
 193 Memory access details are also stored on the samples and this can be viewed with:
 194
 195   perf report --mem-mode
 196
 197 Common errors
 198 ~~~~~~~~~~~~~
 199
 200  - "Cannot find PMU `arm_spe'. Missing kernel support?"
 201
 202    Module not built or loaded, KPTI not disabled (see above), or running on a VM
 203
 204  - "Arm SPE CONTEXT packets not found in the traces."
 205
 206    Root privilege is required to collect context packets. But these only increase the accuracy of
 207    assigning PIDs to kernel samples. For userspace sampling this can be ignored.
 208
 209  - Excessively large perf.data file size
 210
 211    Increase sampling interval (see above)
 212
 213
 214 SEE ALSO
 215 --------
 216
 217 linkperf:perf-record[1], linkperf:perf-script[1], linkperf:perf-report[1],
 218 linkperf:perf-inject[1]