Commit | Line | Data |
---|---|---|
e7bc62b6 IM |
1 | |
2 | Performance Counters for Linux | |
3 | ------------------------------ | |
4 | ||
5 | Performance counters are special hardware registers available on most modern | |
6 | CPUs. These registers count the number of certain types of hw events: such | |
7 | as instructions executed, cachemisses suffered, or branches mis-predicted - | |
8 | without slowing down the kernel or applications. These registers can also | |
9 | trigger interrupts when a threshold number of events have passed - and can | |
10 | thus be used to profile the code that runs on that CPU. | |
11 | ||
12 | The Linux Performance Counter subsystem provides an abstraction of these | |
447557ac | 13 | hardware capabilities. It provides per task and per CPU counters, counter |
f66c6b20 PM |
14 | groups, and it provides event capabilities on top of those. It |
15 | provides "virtual" 64-bit counters, regardless of the width of the | |
16 | underlying hardware counters. | |
e7bc62b6 IM |
17 | |
18 | Performance counters are accessed via special file descriptors. | |
19 | There's one file descriptor per virtual counter used. | |
20 | ||
b68eebd1 | 21 | The special file descriptor is opened via the sys_perf_event_open() |
e7bc62b6 IM |
22 | system call: |
23 | ||
0b413e44 | 24 | int sys_perf_event_open(struct perf_event_attr *hw_event_uptr, |
f66c6b20 PM |
25 | pid_t pid, int cpu, int group_fd, |
26 | unsigned long flags); | |
e7bc62b6 IM |
27 | |
28 | The syscall returns the new fd. The fd can be used via the normal | |
29 | VFS system calls: read() can be used to read the counter, fcntl() | |
30 | can be used to set the blocking mode, etc. | |
31 | ||
32 | Multiple counters can be kept open at a time, and the counters | |
33 | can be poll()ed. | |
34 | ||
0b413e44 | 35 | When creating a new counter fd, 'perf_event_attr' is: |
447557ac | 36 | |
0b413e44 | 37 | struct perf_event_attr { |
e5791a80 PZ |
38 | /* |
39 | * The MSB of the config word signifies if the rest contains cpu | |
40 | * specific (raw) counter configuration data, if unset, the next | |
41 | * 7 bits are an event type and the rest of the bits are the event | |
42 | * identifier. | |
43 | */ | |
44 | __u64 config; | |
45 | ||
46 | __u64 irq_period; | |
47 | __u32 record_type; | |
48 | __u32 read_format; | |
49 | ||
50 | __u64 disabled : 1, /* off by default */ | |
e5791a80 PZ |
51 | inherit : 1, /* children inherit it */ |
52 | pinned : 1, /* must always be on PMU */ | |
53 | exclusive : 1, /* only group on PMU */ | |
54 | exclude_user : 1, /* don't count user */ | |
55 | exclude_kernel : 1, /* ditto kernel */ | |
56 | exclude_hv : 1, /* ditto hypervisor */ | |
57 | exclude_idle : 1, /* don't count when idle */ | |
58 | mmap : 1, /* include mmap data */ | |
59 | munmap : 1, /* include munmap data */ | |
60 | comm : 1, /* include comm data */ | |
61 | ||
62 | __reserved_1 : 52; | |
63 | ||
64 | __u32 extra_config_len; | |
65 | __u32 wakeup_events; /* wakeup every n events */ | |
66 | ||
67 | __u64 __reserved_2; | |
68 | __u64 __reserved_3; | |
447557ac IM |
69 | }; |
70 | ||
e5791a80 | 71 | The 'config' field specifies what the counter should count. It |
f66c6b20 PM |
72 | is divided into 3 bit-fields: |
73 | ||
e5791a80 PZ |
74 | raw_type: 1 bit (most significant bit) 0x8000_0000_0000_0000 |
75 | type: 7 bits (next most significant) 0x7f00_0000_0000_0000 | |
76 | event_id: 56 bits (least significant) 0x00ff_ffff_ffff_ffff | |
f66c6b20 PM |
77 | |
78 | If 'raw_type' is 1, then the counter will count a hardware event | |
79 | specified by the remaining 63 bits of event_config. The encoding is | |
80 | machine-specific. | |
81 | ||
82 | If 'raw_type' is 0, then the 'type' field says what kind of counter | |
83 | this is, with the following encoding: | |
84 | ||
b68eebd1 | 85 | enum perf_type_id { |
f66c6b20 PM |
86 | PERF_TYPE_HARDWARE = 0, |
87 | PERF_TYPE_SOFTWARE = 1, | |
88 | PERF_TYPE_TRACEPOINT = 2, | |
89 | }; | |
90 | ||
91 | A counter of PERF_TYPE_HARDWARE will count the hardware event | |
92 | specified by 'event_id': | |
93 | ||
447557ac | 94 | /* |
f66c6b20 | 95 | * Generalized performance counter event types, used by the hw_event.event_id |
cdd6c482 | 96 | * parameter of the sys_perf_event_open() syscall: |
447557ac | 97 | */ |
b68eebd1 | 98 | enum perf_hw_id { |
447557ac IM |
99 | /* |
100 | * Common hardware events, generalized by the kernel: | |
101 | */ | |
f4dbfa8f PZ |
102 | PERF_COUNT_HW_CPU_CYCLES = 0, |
103 | PERF_COUNT_HW_INSTRUCTIONS = 1, | |
0895cf0a | 104 | PERF_COUNT_HW_CACHE_REFERENCES = 2, |
f4dbfa8f PZ |
105 | PERF_COUNT_HW_CACHE_MISSES = 3, |
106 | PERF_COUNT_HW_BRANCH_INSTRUCTIONS = 4, | |
0895cf0a | 107 | PERF_COUNT_HW_BRANCH_MISSES = 5, |
f4dbfa8f | 108 | PERF_COUNT_HW_BUS_CYCLES = 6, |
438f1a9f LX |
109 | PERF_COUNT_HW_STALLED_CYCLES_FRONTEND = 7, |
110 | PERF_COUNT_HW_STALLED_CYCLES_BACKEND = 8, | |
111 | PERF_COUNT_HW_REF_CPU_CYCLES = 9, | |
447557ac | 112 | }; |
e7bc62b6 | 113 | |
f66c6b20 PM |
114 | These are standardized types of events that work relatively uniformly |
115 | on all CPUs that implement Performance Counters support under Linux, | |
116 | although there may be variations (e.g., different CPUs might count | |
117 | cache references and misses at different levels of the cache hierarchy). | |
118 | If a CPU is not able to count the selected event, then the system call | |
119 | will return -EINVAL. | |
e7bc62b6 | 120 | |
f66c6b20 PM |
121 | More hw_event_types are supported as well, but they are CPU-specific |
122 | and accessed as raw events. For example, to count "External bus | |
123 | cycles while bus lock signal asserted" events on Intel Core CPUs, pass | |
124 | in a 0x4064 event_id value and set hw_event.raw_type to 1. | |
e7bc62b6 | 125 | |
f66c6b20 PM |
126 | A counter of type PERF_TYPE_SOFTWARE will count one of the available |
127 | software events, selected by 'event_id': | |
e7bc62b6 | 128 | |
447557ac | 129 | /* |
f66c6b20 PM |
130 | * Special "software" counters provided by the kernel, even if the hardware |
131 | * does not support performance counters. These counters measure various | |
132 | * physical and sw events of the kernel (and allow the profiling of them as | |
133 | * well): | |
447557ac | 134 | */ |
b68eebd1 | 135 | enum perf_sw_ids { |
f4dbfa8f | 136 | PERF_COUNT_SW_CPU_CLOCK = 0, |
0895cf0a KS |
137 | PERF_COUNT_SW_TASK_CLOCK = 1, |
138 | PERF_COUNT_SW_PAGE_FAULTS = 2, | |
f4dbfa8f PZ |
139 | PERF_COUNT_SW_CONTEXT_SWITCHES = 3, |
140 | PERF_COUNT_SW_CPU_MIGRATIONS = 4, | |
141 | PERF_COUNT_SW_PAGE_FAULTS_MIN = 5, | |
142 | PERF_COUNT_SW_PAGE_FAULTS_MAJ = 6, | |
f7d79860 AB |
143 | PERF_COUNT_SW_ALIGNMENT_FAULTS = 7, |
144 | PERF_COUNT_SW_EMULATION_FAULTS = 8, | |
447557ac | 145 | }; |
e7bc62b6 | 146 | |
e5791a80 PZ |
147 | Counters of the type PERF_TYPE_TRACEPOINT are available when the ftrace event |
148 | tracer is available, and event_id values can be obtained from | |
149 | /debug/tracing/events/*/*/id | |
150 | ||
151 | ||
f66c6b20 PM |
152 | Counters come in two flavours: counting counters and sampling |
153 | counters. A "counting" counter is one that is used for counting the | |
154 | number of events that occur, and is characterised by having | |
e5791a80 PZ |
155 | irq_period = 0. |
156 | ||
157 | ||
158 | A read() on a counter returns the current value of the counter and possible | |
159 | additional values as specified by 'read_format', each value is a u64 (8 bytes) | |
160 | in size. | |
161 | ||
162 | /* | |
163 | * Bits that can be set in hw_event.read_format to request that | |
164 | * reads on the counter should return the indicated quantities, | |
165 | * in increasing order of bit value, after the counter value. | |
166 | */ | |
cdd6c482 | 167 | enum perf_event_read_format { |
e5791a80 PZ |
168 | PERF_FORMAT_TOTAL_TIME_ENABLED = 1, |
169 | PERF_FORMAT_TOTAL_TIME_RUNNING = 2, | |
170 | }; | |
171 | ||
172 | Using these additional values one can establish the overcommit ratio for a | |
173 | particular counter allowing one to take the round-robin scheduling effect | |
174 | into account. | |
175 | ||
e7bc62b6 | 176 | |
f66c6b20 PM |
177 | A "sampling" counter is one that is set up to generate an interrupt |
178 | every N events, where N is given by 'irq_period'. A sampling counter | |
e5791a80 PZ |
179 | has irq_period > 0. The record_type controls what data is recorded on each |
180 | interrupt: | |
e7bc62b6 | 181 | |
f66c6b20 | 182 | /* |
e5791a80 PZ |
183 | * Bits that can be set in hw_event.record_type to request information |
184 | * in the overflow packets. | |
f66c6b20 | 185 | */ |
cdd6c482 | 186 | enum perf_event_record_format { |
e5791a80 PZ |
187 | PERF_RECORD_IP = 1U << 0, |
188 | PERF_RECORD_TID = 1U << 1, | |
189 | PERF_RECORD_TIME = 1U << 2, | |
190 | PERF_RECORD_ADDR = 1U << 3, | |
191 | PERF_RECORD_GROUP = 1U << 4, | |
192 | PERF_RECORD_CALLCHAIN = 1U << 5, | |
f66c6b20 | 193 | }; |
447557ac | 194 | |
e5791a80 PZ |
195 | Such (and other) events will be recorded in a ring-buffer, which is |
196 | available to user-space using mmap() (see below). | |
f66c6b20 PM |
197 | |
198 | The 'disabled' bit specifies whether the counter starts out disabled | |
199 | or enabled. If it is initially disabled, it can be enabled by ioctl | |
200 | or prctl (see below). | |
201 | ||
f66c6b20 PM |
202 | The 'inherit' bit, if set, specifies that this counter should count |
203 | events on descendant tasks as well as the task specified. This only | |
204 | applies to new descendents, not to any existing descendents at the | |
205 | time the counter is created (nor to any new descendents of existing | |
206 | descendents). | |
207 | ||
208 | The 'pinned' bit, if set, specifies that the counter should always be | |
209 | on the CPU if at all possible. It only applies to hardware counters | |
210 | and only to group leaders. If a pinned counter cannot be put onto the | |
211 | CPU (e.g. because there are not enough hardware counters or because of | |
212 | a conflict with some other event), then the counter goes into an | |
213 | 'error' state, where reads return end-of-file (i.e. read() returns 0) | |
214 | until the counter is subsequently enabled or disabled. | |
215 | ||
216 | The 'exclusive' bit, if set, specifies that when this counter's group | |
217 | is on the CPU, it should be the only group using the CPU's counters. | |
218 | In future, this will allow sophisticated monitoring programs to supply | |
219 | extra configuration information via 'extra_config_len' to exploit | |
220 | advanced features of the CPU's Performance Monitor Unit (PMU) that are | |
221 | not otherwise accessible and that might disrupt other hardware | |
222 | counters. | |
223 | ||
224 | The 'exclude_user', 'exclude_kernel' and 'exclude_hv' bits provide a | |
225 | way to request that counting of events be restricted to times when the | |
226 | CPU is in user, kernel and/or hypervisor mode. | |
227 | ||
23e232bd AM |
228 | Furthermore the 'exclude_host' and 'exclude_guest' bits provide a way |
229 | to request counting of events restricted to guest and host contexts when | |
230 | using Linux as the hypervisor. | |
231 | ||
e5791a80 PZ |
232 | The 'mmap' and 'munmap' bits allow recording of PROT_EXEC mmap/munmap |
233 | operations, these can be used to relate userspace IP addresses to actual | |
234 | code, even after the mapping (or even the whole process) is gone, | |
235 | these events are recorded in the ring-buffer (see below). | |
236 | ||
237 | The 'comm' bit allows tracking of process comm data on process creation. | |
238 | This too is recorded in the ring-buffer (see below). | |
f66c6b20 | 239 | |
b68eebd1 | 240 | The 'pid' parameter to the sys_perf_event_open() system call allows the |
f66c6b20 | 241 | counter to be specific to a task: |
e7bc62b6 IM |
242 | |
243 | pid == 0: if the pid parameter is zero, the counter is attached to the | |
244 | current task. | |
245 | ||
246 | pid > 0: the counter is attached to a specific task (if the current task | |
247 | has sufficient privilege to do so) | |
248 | ||
249 | pid < 0: all tasks are counted (per cpu counters) | |
250 | ||
f66c6b20 | 251 | The 'cpu' parameter allows a counter to be made specific to a CPU: |
e7bc62b6 IM |
252 | |
253 | cpu >= 0: the counter is restricted to a specific CPU | |
254 | cpu == -1: the counter counts on all CPUs | |
255 | ||
447557ac | 256 | (Note: the combination of 'pid == -1' and 'cpu == -1' is not valid.) |
e7bc62b6 IM |
257 | |
258 | A 'pid > 0' and 'cpu == -1' counter is a per task counter that counts | |
259 | events of that task and 'follows' that task to whatever CPU the task | |
260 | gets schedule to. Per task counters can be created by any user, for | |
261 | their own tasks. | |
262 | ||
263 | A 'pid == -1' and 'cpu == x' counter is a per CPU counter that counts | |
6b3e0e2e AB |
264 | all events on CPU-x. Per CPU counters need CAP_PERFMON or CAP_SYS_ADMIN |
265 | privilege. | |
e7bc62b6 | 266 | |
f66c6b20 PM |
267 | The 'flags' parameter is currently unused and must be zero. |
268 | ||
269 | The 'group_fd' parameter allows counter "groups" to be set up. A | |
270 | counter group has one counter which is the group "leader". The leader | |
b68eebd1 | 271 | is created first, with group_fd = -1 in the sys_perf_event_open call |
f66c6b20 PM |
272 | that creates it. The rest of the group members are created |
273 | subsequently, with group_fd giving the fd of the group leader. | |
274 | (A single counter on its own is created with group_fd = -1 and is | |
275 | considered to be a group with only 1 member.) | |
276 | ||
277 | A counter group is scheduled onto the CPU as a unit, that is, it will | |
278 | only be put onto the CPU if all of the counters in the group can be | |
279 | put onto the CPU. This means that the values of the member counters | |
280 | can be meaningfully compared, added, divided (to get ratios), etc., | |
281 | with each other, since they have counted events for the same set of | |
282 | executed instructions. | |
283 | ||
e5791a80 PZ |
284 | |
285 | Like stated, asynchronous events, like counter overflow or PROT_EXEC mmap | |
286 | tracking are logged into a ring-buffer. This ring-buffer is created and | |
287 | accessed through mmap(). | |
288 | ||
289 | The mmap size should be 1+2^n pages, where the first page is a meta-data page | |
cdd6c482 | 290 | (struct perf_event_mmap_page) that contains various bits of information such |
e5791a80 PZ |
291 | as where the ring-buffer head is. |
292 | ||
293 | /* | |
294 | * Structure of the page that can be mapped via mmap | |
295 | */ | |
cdd6c482 | 296 | struct perf_event_mmap_page { |
e5791a80 PZ |
297 | __u32 version; /* version number of this structure */ |
298 | __u32 compat_version; /* lowest version this is compat with */ | |
299 | ||
300 | /* | |
301 | * Bits needed to read the hw counters in user-space. | |
302 | * | |
303 | * u32 seq; | |
304 | * s64 count; | |
305 | * | |
306 | * do { | |
307 | * seq = pc->lock; | |
308 | * | |
309 | * barrier() | |
310 | * if (pc->index) { | |
311 | * count = pmc_read(pc->index - 1); | |
312 | * count += pc->offset; | |
313 | * } else | |
314 | * goto regular_read; | |
315 | * | |
316 | * barrier(); | |
317 | * } while (pc->lock != seq); | |
318 | * | |
319 | * NOTE: for obvious reason this only works on self-monitoring | |
320 | * processes. | |
321 | */ | |
322 | __u32 lock; /* seqlock for synchronization */ | |
323 | __u32 index; /* hardware counter identifier */ | |
324 | __s64 offset; /* add to hardware counter value */ | |
325 | ||
326 | /* | |
327 | * Control data for the mmap() data buffer. | |
328 | * | |
329 | * User-space reading this value should issue an rmb(), on SMP capable | |
cdd6c482 | 330 | * platforms, after reading this value -- see perf_event_wakeup(). |
e5791a80 PZ |
331 | */ |
332 | __u32 data_head; /* head in the data section */ | |
333 | }; | |
334 | ||
335 | NOTE: the hw-counter userspace bits are arch specific and are currently only | |
336 | implemented on powerpc. | |
337 | ||
338 | The following 2^n pages are the ring-buffer which contains events of the form: | |
339 | ||
cdd6c482 IM |
340 | #define PERF_RECORD_MISC_KERNEL (1 << 0) |
341 | #define PERF_RECORD_MISC_USER (1 << 1) | |
342 | #define PERF_RECORD_MISC_OVERFLOW (1 << 2) | |
e5791a80 PZ |
343 | |
344 | struct perf_event_header { | |
345 | __u32 type; | |
346 | __u16 misc; | |
347 | __u16 size; | |
348 | }; | |
349 | ||
350 | enum perf_event_type { | |
351 | ||
352 | /* | |
353 | * The MMAP events record the PROT_EXEC mappings so that we can | |
354 | * correlate userspace IPs to code. They have the following structure: | |
355 | * | |
356 | * struct { | |
357 | * struct perf_event_header header; | |
358 | * | |
359 | * u32 pid, tid; | |
360 | * u64 addr; | |
361 | * u64 len; | |
362 | * u64 pgoff; | |
363 | * char filename[]; | |
364 | * }; | |
365 | */ | |
cdd6c482 IM |
366 | PERF_RECORD_MMAP = 1, |
367 | PERF_RECORD_MUNMAP = 2, | |
e5791a80 PZ |
368 | |
369 | /* | |
370 | * struct { | |
371 | * struct perf_event_header header; | |
372 | * | |
373 | * u32 pid, tid; | |
374 | * char comm[]; | |
375 | * }; | |
376 | */ | |
cdd6c482 | 377 | PERF_RECORD_COMM = 3, |
e5791a80 PZ |
378 | |
379 | /* | |
cdd6c482 | 380 | * When header.misc & PERF_RECORD_MISC_OVERFLOW the event_type field |
e5791a80 PZ |
381 | * will be PERF_RECORD_* |
382 | * | |
383 | * struct { | |
384 | * struct perf_event_header header; | |
385 | * | |
386 | * { u64 ip; } && PERF_RECORD_IP | |
387 | * { u32 pid, tid; } && PERF_RECORD_TID | |
388 | * { u64 time; } && PERF_RECORD_TIME | |
389 | * { u64 addr; } && PERF_RECORD_ADDR | |
390 | * | |
391 | * { u64 nr; | |
392 | * { u64 event, val; } cnt[nr]; } && PERF_RECORD_GROUP | |
393 | * | |
394 | * { u16 nr, | |
395 | * hv, | |
396 | * kernel, | |
397 | * user; | |
398 | * u64 ips[nr]; } && PERF_RECORD_CALLCHAIN | |
399 | * }; | |
400 | */ | |
401 | }; | |
402 | ||
403 | NOTE: PERF_RECORD_CALLCHAIN is arch specific and currently only implemented | |
404 | on x86. | |
405 | ||
406 | Notification of new events is possible through poll()/select()/epoll() and | |
407 | fcntl() managing signals. | |
408 | ||
409 | Normally a notification is generated for every page filled, however one can | |
0b413e44 | 410 | additionally set perf_event_attr.wakeup_events to generate one every |
e5791a80 PZ |
411 | so many counter overflow events. |
412 | ||
413 | Future work will include a splice() interface to the ring-buffer. | |
414 | ||
415 | ||
f66c6b20 PM |
416 | Counters can be enabled and disabled in two ways: via ioctl and via |
417 | prctl. When a counter is disabled, it doesn't count or generate | |
418 | events but does continue to exist and maintain its count value. | |
419 | ||
a59e64a1 | 420 | An individual counter can be enabled with |
f66c6b20 | 421 | |
a59e64a1 | 422 | ioctl(fd, PERF_EVENT_IOC_ENABLE, 0); |
f66c6b20 PM |
423 | |
424 | or disabled with | |
425 | ||
a59e64a1 | 426 | ioctl(fd, PERF_EVENT_IOC_DISABLE, 0); |
f66c6b20 | 427 | |
a59e64a1 | 428 | For a counter group, pass PERF_IOC_FLAG_GROUP as the third argument. |
f66c6b20 PM |
429 | Enabling or disabling the leader of a group enables or disables the |
430 | whole group; that is, while the group leader is disabled, none of the | |
431 | counters in the group will count. Enabling or disabling a member of a | |
432 | group other than the leader only affects that counter - disabling an | |
433 | non-leader stops that counter from counting but doesn't affect any | |
434 | other counter. | |
435 | ||
e5791a80 PZ |
436 | Additionally, non-inherited overflow counters can use |
437 | ||
cdd6c482 | 438 | ioctl(fd, PERF_EVENT_IOC_REFRESH, nr); |
e5791a80 PZ |
439 | |
440 | to enable a counter for 'nr' events, after which it gets disabled again. | |
441 | ||
f66c6b20 PM |
442 | A process can enable or disable all the counter groups that are |
443 | attached to it, using prctl: | |
444 | ||
cdd6c482 | 445 | prctl(PR_TASK_PERF_EVENTS_ENABLE); |
f66c6b20 | 446 | |
cdd6c482 | 447 | prctl(PR_TASK_PERF_EVENTS_DISABLE); |
f66c6b20 PM |
448 | |
449 | This applies to all counters on the current process, whether created | |
450 | by this process or by another, and doesn't affect any counters that | |
451 | this process has created on other processes. It only enables or | |
452 | disables the group leaders, not any other members in the groups. | |
447557ac | 453 | |
018df72d MF |
454 | |
455 | Arch requirements | |
456 | ----------------- | |
457 | ||
458 | If your architecture does not have hardware performance metrics, you can | |
459 | still use the generic software counters based on hrtimers for sampling. | |
460 | ||
cdd6c482 | 461 | So to start with, in order to add HAVE_PERF_EVENTS to your Kconfig, you |
018df72d | 462 | will need at least this: |
cdd6c482 | 463 | - asm/perf_event.h - a basic stub will suffice at first |
018df72d | 464 | - support for atomic64 types (and associated helper functions) |
018df72d MF |
465 | |
466 | If your architecture does have hardware capabilities, you can override the | |
cdd6c482 | 467 | weak stub hw_perf_event_init() to register hardware counters. |
906010b2 PZ |
468 | |
469 | Architectures that have d-cache aliassing issues, such as Sparc and ARM, | |
470 | should select PERF_USE_VMALLOC in order to avoid these for perf mmap(). |