ALSA: hda/realtek - Enable headset mic on Asus MJ401TA
[linux-2.6-block.git] / Documentation / trace / ftrace.rst
CommitLineData
1f198e22
CD
1========================
2ftrace - Function Tracer
3========================
4
5Copyright 2008 Red Hat Inc.
6
7:Author: Steven Rostedt <srostedt@redhat.com>
8:License: The GNU Free Documentation License, Version 1.2
9 (dual licensed under the GPL v2)
10:Original Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton,
11 John Kacur, and David Teigland.
12
13- Written for: 2.6.28-rc2
14- Updated for: 3.10
15- Updated for: 4.13 - Copyright 2017 VMware Inc. Steven Rostedt
16- Converted to rst format - Changbin Du <changbin.du@intel.com>
17
18Introduction
19------------
20
21Ftrace is an internal tracer designed to help out developers and
22designers of systems to find what is going on inside the kernel.
23It can be used for debugging or analyzing latencies and
24performance issues that take place outside of user-space.
25
26Although ftrace is typically considered the function tracer, it
2a1e03ca 27is really a framework of several assorted tracing utilities.
1f198e22
CD
28There's latency tracing to examine what occurs between interrupts
29disabled and enabled, as well as for preemption and from a time
30a task is woken to the task is actually scheduled in.
31
32One of the most common uses of ftrace is the event tracing.
2a1e03ca 33Throughout the kernel is hundreds of static event points that
1f198e22
CD
34can be enabled via the tracefs file system to see what is
35going on in certain parts of the kernel.
36
37See events.txt for more information.
38
39
40Implementation Details
41----------------------
42
43See :doc:`ftrace-design` for details for arch porters and such.
44
45
46The File System
47---------------
48
49Ftrace uses the tracefs file system to hold the control files as
50well as the files to display output.
51
52When tracefs is configured into the kernel (which selecting any ftrace
53option will do) the directory /sys/kernel/tracing will be created. To mount
54this directory, you can add to your /etc/fstab file::
55
56 tracefs /sys/kernel/tracing tracefs defaults 0 0
57
58Or you can mount it at run time with::
59
60 mount -t tracefs nodev /sys/kernel/tracing
61
62For quicker access to that directory you may want to make a soft link to
63it::
64
65 ln -s /sys/kernel/tracing /tracing
66
67.. attention::
68
69 Before 4.1, all ftrace tracing control files were within the debugfs
70 file system, which is typically located at /sys/kernel/debug/tracing.
71 For backward compatibility, when mounting the debugfs file system,
72 the tracefs file system will be automatically mounted at:
73
74 /sys/kernel/debug/tracing
75
76 All files located in the tracefs file system will be located in that
77 debugfs file system directory as well.
78
79.. attention::
80
81 Any selected ftrace option will also create the tracefs file system.
82 The rest of the document will assume that you are in the ftrace directory
83 (cd /sys/kernel/tracing) and will only concentrate on the files within that
84 directory and not distract from the content with the extended
85 "/sys/kernel/tracing" path name.
86
87That's it! (assuming that you have ftrace configured into your kernel)
88
89After mounting tracefs you will have access to the control and output files
90of ftrace. Here is a list of some of the key files:
91
92
93 Note: all time values are in microseconds.
94
95 current_tracer:
96
97 This is used to set or display the current tracer
98 that is configured.
99
100 available_tracers:
101
102 This holds the different types of tracers that
103 have been compiled into the kernel. The
104 tracers listed here can be configured by
105 echoing their name into current_tracer.
106
107 tracing_on:
108
109 This sets or displays whether writing to the trace
110 ring buffer is enabled. Echo 0 into this file to disable
111 the tracer or 1 to enable it. Note, this only disables
112 writing to the ring buffer, the tracing overhead may
113 still be occurring.
114
115 The kernel function tracing_off() can be used within the
116 kernel to disable writing to the ring buffer, which will
117 set this file to "0". User space can re-enable tracing by
118 echoing "1" into the file.
119
120 Note, the function and event trigger "traceoff" will also
121 set this file to zero and stop tracing. Which can also
122 be re-enabled by user space using this file.
123
124 trace:
125
126 This file holds the output of the trace in a human
127 readable format (described below). Note, tracing is temporarily
f12fcca6
PW
128 disabled when the file is open for reading. Once all readers
129 are closed, tracing is re-enabled.
1f198e22
CD
130
131 trace_pipe:
132
133 The output is the same as the "trace" file but this
134 file is meant to be streamed with live tracing.
135 Reads from this file will block until new data is
136 retrieved. Unlike the "trace" file, this file is a
137 consumer. This means reading from this file causes
138 sequential reads to display more current data. Once
139 data is read from this file, it is consumed, and
140 will not be read again with a sequential read. The
141 "trace" file is static, and if the tracer is not
142 adding more data, it will display the same
f12fcca6
PW
143 information every time it is read. Unlike the
144 "trace" file, opening this file for reading will not
145 temporarily disable tracing.
1f198e22
CD
146
147 trace_options:
148
149 This file lets the user control the amount of data
150 that is displayed in one of the above output
151 files. Options also exist to modify how a tracer
152 or events work (stack traces, timestamps, etc).
153
154 options:
155
156 This is a directory that has a file for every available
157 trace option (also in trace_options). Options may also be set
158 or cleared by writing a "1" or "0" respectively into the
159 corresponding file with the option name.
160
161 tracing_max_latency:
162
163 Some of the tracers record the max latency.
164 For example, the maximum time that interrupts are disabled.
165 The maximum time is saved in this file. The max trace will also be
166 stored, and displayed by "trace". A new max trace will only be
167 recorded if the latency is greater than the value in this file
168 (in microseconds).
169
170 By echoing in a time into this file, no latency will be recorded
171 unless it is greater than the time in this file.
172
173 tracing_thresh:
174
175 Some latency tracers will record a trace whenever the
176 latency is greater than the number in this file.
177 Only active when the file contains a number greater than 0.
178 (in microseconds)
179
180 buffer_size_kb:
181
182 This sets or displays the number of kilobytes each CPU
183 buffer holds. By default, the trace buffers are the same size
184 for each CPU. The displayed number is the size of the
185 CPU buffer and not total size of all buffers. The
186 trace buffers are allocated in pages (blocks of memory
187 that the kernel uses for allocation, usually 4 KB in size).
188 If the last page allocated has room for more bytes
189 than requested, the rest of the page will be used,
190 making the actual allocation bigger than requested or shown.
191 ( Note, the size may not be a multiple of the page size
192 due to buffer management meta-data. )
193
194 Buffer sizes for individual CPUs may vary
195 (see "per_cpu/cpu0/buffer_size_kb" below), and if they do
196 this file will show "X".
197
198 buffer_total_size_kb:
199
200 This displays the total combined size of all the trace buffers.
201
202 free_buffer:
203
204 If a process is performing tracing, and the ring buffer should be
205 shrunk "freed" when the process is finished, even if it were to be
206 killed by a signal, this file can be used for that purpose. On close
207 of this file, the ring buffer will be resized to its minimum size.
208 Having a process that is tracing also open this file, when the process
209 exits its file descriptor for this file will be closed, and in doing so,
210 the ring buffer will be "freed".
211
212 It may also stop tracing if disable_on_free option is set.
213
214 tracing_cpumask:
215
216 This is a mask that lets the user only trace on specified CPUs.
217 The format is a hex string representing the CPUs.
218
219 set_ftrace_filter:
220
221 When dynamic ftrace is configured in (see the
222 section below "dynamic ftrace"), the code is dynamically
223 modified (code text rewrite) to disable calling of the
224 function profiler (mcount). This lets tracing be configured
225 in with practically no overhead in performance. This also
226 has a side effect of enabling or disabling specific functions
227 to be traced. Echoing names of functions into this file
228 will limit the trace to only those functions.
32fb7ef6
SM
229 This influences the tracers "function" and "function_graph"
230 and thus also function profiling (see "function_profile_enabled").
1f198e22
CD
231
232 The functions listed in "available_filter_functions" are what
233 can be written into this file.
234
235 This interface also allows for commands to be used. See the
236 "Filter commands" section for more details.
237
f79b3f33
SRV
238 As a speed up, since processing strings can't be quite expensive
239 and requires a check of all functions registered to tracing, instead
240 an index can be written into this file. A number (starting with "1")
241 written will instead select the same corresponding at the line position
242 of the "available_filter_functions" file.
243
1f198e22
CD
244 set_ftrace_notrace:
245
246 This has an effect opposite to that of
247 set_ftrace_filter. Any function that is added here will not
248 be traced. If a function exists in both set_ftrace_filter
249 and set_ftrace_notrace, the function will _not_ be traced.
250
251 set_ftrace_pid:
252
253 Have the function tracer only trace the threads whose PID are
254 listed in this file.
255
256 If the "function-fork" option is set, then when a task whose
257 PID is listed in this file forks, the child's PID will
258 automatically be added to this file, and the child will be
259 traced by the function tracer as well. This option will also
260 cause PIDs of tasks that exit to be removed from the file.
261
262 set_event_pid:
263
264 Have the events only trace a task with a PID listed in this file.
265 Note, sched_switch and sched_wake_up will also trace events
266 listed in this file.
267
268 To have the PIDs of children of tasks with their PID in this file
269 added on fork, enable the "event-fork" option. That option will also
270 cause the PIDs of tasks to be removed from this file when the task
271 exits.
272
273 set_graph_function:
274
275 Functions listed in this file will cause the function graph
276 tracer to only trace these functions and the functions that
277 they call. (See the section "dynamic ftrace" for more details).
32fb7ef6
SM
278 Note, set_ftrace_filter and set_ftrace_notrace still affects
279 what functions are being traced.
1f198e22
CD
280
281 set_graph_notrace:
282
283 Similar to set_graph_function, but will disable function graph
284 tracing when the function is hit until it exits the function.
285 This makes it possible to ignore tracing functions that are called
286 by a specific function.
287
288 available_filter_functions:
289
290 This lists the functions that ftrace has processed and can trace.
291 These are the function names that you can pass to
32fb7ef6
SM
292 "set_ftrace_filter", "set_ftrace_notrace",
293 "set_graph_function", or "set_graph_notrace".
1f198e22
CD
294 (See the section "dynamic ftrace" below for more details.)
295
296 dyn_ftrace_total_info:
297
298 This file is for debugging purposes. The number of functions that
299 have been converted to nops and are available to be traced.
300
301 enabled_functions:
302
303 This file is more for debugging ftrace, but can also be useful
304 in seeing if any function has a callback attached to it.
305 Not only does the trace infrastructure use ftrace function
306 trace utility, but other subsystems might too. This file
307 displays all functions that have a callback attached to them
308 as well as the number of callbacks that have been attached.
309 Note, a callback may also call multiple functions which will
310 not be listed in this count.
311
312 If the callback registered to be traced by a function with
313 the "save regs" attribute (thus even more overhead), a 'R'
314 will be displayed on the same line as the function that
315 is returning registers.
316
317 If the callback registered to be traced by a function with
318 the "ip modify" attribute (thus the regs->ip can be changed),
319 an 'I' will be displayed on the same line as the function that
320 can be overridden.
321
322 If the architecture supports it, it will also show what callback
323 is being directly called by the function. If the count is greater
324 than 1 it most likely will be ftrace_ops_list_func().
325
326 If the callback of the function jumps to a trampoline that is
327 specific to a the callback and not the standard trampoline,
328 its address will be printed as well as the function that the
329 trampoline calls.
330
331 function_profile_enabled:
332
333 When set it will enable all functions with either the function
334 tracer, or if configured, the function graph tracer. It will
335 keep a histogram of the number of functions that were called
336 and if the function graph tracer was configured, it will also keep
337 track of the time spent in those functions. The histogram
338 content can be displayed in the files:
339
1fee4f77 340 trace_stat/function<cpu> ( function0, function1, etc).
1f198e22 341
1fee4f77 342 trace_stat:
1f198e22
CD
343
344 A directory that holds different tracing stats.
345
346 kprobe_events:
347
348 Enable dynamic trace points. See kprobetrace.txt.
349
350 kprobe_profile:
351
352 Dynamic trace points stats. See kprobetrace.txt.
353
354 max_graph_depth:
355
356 Used with the function graph tracer. This is the max depth
357 it will trace into a function. Setting this to a value of
358 one will show only the first kernel function that is called
359 from user space.
360
361 printk_formats:
362
363 This is for tools that read the raw format files. If an event in
364 the ring buffer references a string, only a pointer to the string
365 is recorded into the buffer and not the string itself. This prevents
366 tools from knowing what that string was. This file displays the string
367 and address for the string allowing tools to map the pointers to what
368 the strings were.
369
370 saved_cmdlines:
371
372 Only the pid of the task is recorded in a trace event unless
373 the event specifically saves the task comm as well. Ftrace
374 makes a cache of pid mappings to comms to try to display
375 comms for events. If a pid for a comm is not listed, then
376 "<...>" is displayed in the output.
377
378 If the option "record-cmd" is set to "0", then comms of tasks
379 will not be saved during recording. By default, it is enabled.
380
381 saved_cmdlines_size:
382
383 By default, 128 comms are saved (see "saved_cmdlines" above). To
384 increase or decrease the amount of comms that are cached, echo
385 in a the number of comms to cache, into this file.
386
387 saved_tgids:
388
389 If the option "record-tgid" is set, on each scheduling context switch
390 the Task Group ID of a task is saved in a table mapping the PID of
391 the thread to its TGID. By default, the "record-tgid" option is
392 disabled.
393
394 snapshot:
395
396 This displays the "snapshot" buffer and also lets the user
397 take a snapshot of the current running trace.
398 See the "Snapshot" section below for more details.
399
400 stack_max_size:
401
402 When the stack tracer is activated, this will display the
403 maximum stack size it has encountered.
404 See the "Stack Trace" section below.
405
406 stack_trace:
407
408 This displays the stack back trace of the largest stack
409 that was encountered when the stack tracer is activated.
410 See the "Stack Trace" section below.
411
412 stack_trace_filter:
413
414 This is similar to "set_ftrace_filter" but it limits what
415 functions the stack tracer will check.
416
417 trace_clock:
418
419 Whenever an event is recorded into the ring buffer, a
420 "timestamp" is added. This stamp comes from a specified
421 clock. By default, ftrace uses the "local" clock. This
422 clock is very fast and strictly per cpu, but on some
423 systems it may not be monotonic with respect to other
424 CPUs. In other words, the local clocks may not be in sync
425 with local clocks on other CPUs.
426
427 Usual clocks for tracing::
428
429 # cat trace_clock
430 [local] global counter x86-tsc
431
432 The clock with the square brackets around it is the one in effect.
433
434 local:
435 Default clock, but may not be in sync across CPUs
436
437 global:
438 This clock is in sync with all CPUs but may
439 be a bit slower than the local clock.
440
441 counter:
442 This is not a clock at all, but literally an atomic
443 counter. It counts up one by one, but is in sync
444 with all CPUs. This is useful when you need to
445 know exactly the order events occurred with respect to
446 each other on different CPUs.
447
448 uptime:
449 This uses the jiffies counter and the time stamp
450 is relative to the time since boot up.
451
452 perf:
453 This makes ftrace use the same clock that perf uses.
454 Eventually perf will be able to read ftrace buffers
455 and this will help out in interleaving the data.
456
457 x86-tsc:
458 Architectures may define their own clocks. For
459 example, x86 uses its own TSC cycle clock here.
460
461 ppc-tb:
462 This uses the powerpc timebase register value.
463 This is in sync across CPUs and can also be used
464 to correlate events across hypervisor/guest if
465 tb_offset is known.
466
467 mono:
468 This uses the fast monotonic clock (CLOCK_MONOTONIC)
469 which is monotonic and is subject to NTP rate adjustments.
470
471 mono_raw:
472 This is the raw monotonic clock (CLOCK_MONOTONIC_RAW)
2a1e03ca 473 which is monotonic but is not subject to any rate adjustments
1f198e22
CD
474 and ticks at the same rate as the hardware clocksource.
475
476 boot:
a3ed0e43
TG
477 This is the boot clock (CLOCK_BOOTTIME) and is based on the
478 fast monotonic clock, but also accounts for time spent in
479 suspend. Since the clock access is designed for use in
480 tracing in the suspend path, some side effects are possible
481 if clock is accessed after the suspend time is accounted before
482 the fast mono clock is updated. In this case, the clock update
483 appears to happen slightly sooner than it normally would have.
484 Also on 32-bit systems, it's possible that the 64-bit boot offset
485 sees a partial update. These effects are rare and post
486 processing should be able to handle them. See comments in the
487 ktime_get_boot_fast_ns() function for more information.
680014d6
LT
488
489 To set a clock, simply echo the clock name into this file::
490
491 # echo global > trace_clock
1f198e22
CD
492
493 trace_marker:
494
495 This is a very useful file for synchronizing user space
496 with events happening in the kernel. Writing strings into
497 this file will be written into the ftrace buffer.
498
499 It is useful in applications to open this file at the start
500 of the application and just reference the file descriptor
501 for the file::
502
503 void trace_write(const char *fmt, ...)
504 {
505 va_list ap;
506 char buf[256];
507 int n;
508
509 if (trace_fd < 0)
510 return;
511
512 va_start(ap, fmt);
513 n = vsnprintf(buf, 256, fmt, ap);
514 va_end(ap);
515
516 write(trace_fd, buf, n);
517 }
518
519 start::
520
521 trace_fd = open("trace_marker", WR_ONLY);
522
d3439f9d
SRV
523 Note: Writing into the trace_marker file can also initiate triggers
524 that are written into /sys/kernel/tracing/events/ftrace/print/trigger
525 See "Event triggers" in Documentation/trace/events.rst and an
526 example in Documentation/trace/histogram.rst (Section 3.)
527
1f198e22
CD
528 trace_marker_raw:
529
530 This is similar to trace_marker above, but is meant for for binary data
531 to be written to it, where a tool can be used to parse the data
532 from trace_pipe_raw.
533
534 uprobe_events:
535
536 Add dynamic tracepoints in programs.
537 See uprobetracer.txt
538
539 uprobe_profile:
540
541 Uprobe statistics. See uprobetrace.txt
542
543 instances:
544
545 This is a way to make multiple trace buffers where different
546 events can be recorded in different buffers.
547 See "Instances" section below.
548
549 events:
550
551 This is the trace event directory. It holds event tracepoints
552 (also known as static tracepoints) that have been compiled
553 into the kernel. It shows what event tracepoints exist
554 and how they are grouped by system. There are "enable"
555 files at various levels that can enable the tracepoints
556 when a "1" is written to them.
557
558 See events.txt for more information.
559
560 set_event:
561
562 By echoing in the event into this file, will enable that event.
563
564 See events.txt for more information.
565
566 available_events:
567
568 A list of events that can be enabled in tracing.
569
570 See events.txt for more information.
571
2a56bb59
LT
572 timestamp_mode:
573
574 Certain tracers may change the timestamp mode used when
575 logging trace events into the event buffer. Events with
576 different modes can coexist within a buffer but the mode in
577 effect when an event is logged determines which timestamp mode
578 is used for that event. The default timestamp mode is
579 'delta'.
580
581 Usual timestamp modes for tracing:
582
583 # cat timestamp_mode
584 [delta] absolute
585
586 The timestamp mode with the square brackets around it is the
587 one in effect.
588
589 delta: Default timestamp mode - timestamp is a delta against
590 a per-buffer timestamp.
591
592 absolute: The timestamp is a full timestamp, not a delta
593 against some other value. As such it takes up more
594 space and is less efficient.
595
1f198e22
CD
596 hwlat_detector:
597
598 Directory for the Hardware Latency Detector.
599 See "Hardware Latency Detector" section below.
600
601 per_cpu:
602
603 This is a directory that contains the trace per_cpu information.
604
605 per_cpu/cpu0/buffer_size_kb:
606
607 The ftrace buffer is defined per_cpu. That is, there's a separate
608 buffer for each CPU to allow writes to be done atomically,
609 and free from cache bouncing. These buffers may have different
610 size buffers. This file is similar to the buffer_size_kb
611 file, but it only displays or sets the buffer size for the
612 specific CPU. (here cpu0).
613
614 per_cpu/cpu0/trace:
615
616 This is similar to the "trace" file, but it will only display
617 the data specific for the CPU. If written to, it only clears
618 the specific CPU buffer.
619
620 per_cpu/cpu0/trace_pipe
621
622 This is similar to the "trace_pipe" file, and is a consuming
623 read, but it will only display (and consume) the data specific
624 for the CPU.
625
626 per_cpu/cpu0/trace_pipe_raw
627
628 For tools that can parse the ftrace ring buffer binary format,
629 the trace_pipe_raw file can be used to extract the data
630 from the ring buffer directly. With the use of the splice()
631 system call, the buffer data can be quickly transferred to
632 a file or to the network where a server is collecting the
633 data.
634
635 Like trace_pipe, this is a consuming reader, where multiple
636 reads will always produce different data.
637
638 per_cpu/cpu0/snapshot:
639
640 This is similar to the main "snapshot" file, but will only
641 snapshot the current CPU (if supported). It only displays
642 the content of the snapshot for a given CPU, and if
643 written to, only clears this CPU buffer.
644
645 per_cpu/cpu0/snapshot_raw:
646
647 Similar to the trace_pipe_raw, but will read the binary format
648 from the snapshot buffer for the given CPU.
649
650 per_cpu/cpu0/stats:
651
652 This displays certain stats about the ring buffer:
653
654 entries:
655 The number of events that are still in the buffer.
656
657 overrun:
658 The number of lost events due to overwriting when
659 the buffer was full.
660
661 commit overrun:
662 Should always be zero.
663 This gets set if so many events happened within a nested
664 event (ring buffer is re-entrant), that it fills the
665 buffer and starts dropping events.
666
667 bytes:
668 Bytes actually read (not overwritten).
669
670 oldest event ts:
671 The oldest timestamp in the buffer
672
673 now ts:
674 The current timestamp
675
676 dropped events:
677 Events lost due to overwrite option being off.
678
679 read events:
680 The number of events read.
681
682The Tracers
683-----------
684
685Here is the list of current tracers that may be configured.
686
687 "function"
688
689 Function call tracer to trace all kernel functions.
690
691 "function_graph"
692
693 Similar to the function tracer except that the
694 function tracer probes the functions on their entry
695 whereas the function graph tracer traces on both entry
696 and exit of the functions. It then provides the ability
697 to draw a graph of function calls similar to C code
698 source.
699
700 "blk"
701
702 The block tracer. The tracer used by the blktrace user
703 application.
704
705 "hwlat"
706
707 The Hardware Latency tracer is used to detect if the hardware
708 produces any latency. See "Hardware Latency Detector" section
709 below.
710
711 "irqsoff"
712
713 Traces the areas that disable interrupts and saves
714 the trace with the longest max latency.
715 See tracing_max_latency. When a new max is recorded,
716 it replaces the old trace. It is best to view this
717 trace with the latency-format option enabled, which
718 happens automatically when the tracer is selected.
719
720 "preemptoff"
721
722 Similar to irqsoff but traces and records the amount of
723 time for which preemption is disabled.
724
725 "preemptirqsoff"
726
727 Similar to irqsoff and preemptoff, but traces and
728 records the largest time for which irqs and/or preemption
729 is disabled.
730
731 "wakeup"
732
733 Traces and records the max latency that it takes for
734 the highest priority task to get scheduled after
735 it has been woken up.
736 Traces all tasks as an average developer would expect.
737
738 "wakeup_rt"
739
740 Traces and records the max latency that it takes for just
741 RT tasks (as the current "wakeup" does). This is useful
742 for those interested in wake up timings of RT tasks.
743
744 "wakeup_dl"
745
746 Traces and records the max latency that it takes for
747 a SCHED_DEADLINE task to be woken (as the "wakeup" and
748 "wakeup_rt" does).
749
750 "mmiotrace"
751
752 A special tracer that is used to trace binary module.
753 It will trace all the calls that a module makes to the
754 hardware. Everything it writes and reads from the I/O
755 as well.
756
757 "branch"
758
759 This tracer can be configured when tracing likely/unlikely
760 calls within the kernel. It will trace when a likely and
761 unlikely branch is hit and if it was correct in its prediction
762 of being correct.
763
764 "nop"
765
766 This is the "trace nothing" tracer. To remove all
767 tracers from tracing simply echo "nop" into
768 current_tracer.
769
26a94491
TZ
770Error conditions
771----------------
772
773 For most ftrace commands, failure modes are obvious and communicated
774 using standard return codes.
775
776 For other more involved commands, extended error information may be
777 available via the tracing/error_log file. For the commands that
778 support it, reading the tracing/error_log file after an error will
779 display more detailed information about what went wrong, if
780 information is available. The tracing/error_log file is a circular
781 error log displaying a small number (currently, 8) of ftrace errors
782 for the last (8) failed commands.
783
784 The extended error information and usage takes the form shown in
785 this example::
786
787 # echo xxx > /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
788 echo: write error: Invalid argument
789
790 # cat /sys/kernel/debug/tracing/error_log
791 [ 5348.887237] location: error: Couldn't yyy: zzz
792 Command: xxx
793 ^
794 [ 7517.023364] location: error: Bad rrr: sss
795 Command: ppp qqq
796 ^
797
798 To clear the error log, echo the empty string into it::
799
800 # echo > /sys/kernel/debug/tracing/error_log
1f198e22
CD
801
802Examples of using the tracer
803----------------------------
804
805Here are typical examples of using the tracers when controlling
806them only with the tracefs interface (without using any
807user-land utilities).
808
809Output format:
810--------------
811
812Here is an example of the output format of the file "trace"::
813
814 # tracer: function
815 #
816 # entries-in-buffer/entries-written: 140080/250280 #P:4
817 #
818 # _-----=> irqs-off
819 # / _----=> need-resched
820 # | / _---=> hardirq/softirq
821 # || / _--=> preempt-depth
822 # ||| / delay
823 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
824 # | | | |||| | |
825 bash-1977 [000] .... 17284.993652: sys_close <-system_call_fastpath
826 bash-1977 [000] .... 17284.993653: __close_fd <-sys_close
827 bash-1977 [000] .... 17284.993653: _raw_spin_lock <-__close_fd
828 sshd-1974 [003] .... 17284.993653: __srcu_read_unlock <-fsnotify
829 bash-1977 [000] .... 17284.993654: add_preempt_count <-_raw_spin_lock
830 bash-1977 [000] ...1 17284.993655: _raw_spin_unlock <-__close_fd
831 bash-1977 [000] ...1 17284.993656: sub_preempt_count <-_raw_spin_unlock
832 bash-1977 [000] .... 17284.993657: filp_close <-__close_fd
833 bash-1977 [000] .... 17284.993657: dnotify_flush <-filp_close
834 sshd-1974 [003] .... 17284.993658: sys_select <-system_call_fastpath
835 ....
836
837A header is printed with the tracer name that is represented by
838the trace. In this case the tracer is "function". Then it shows the
839number of events in the buffer as well as the total number of entries
840that were written. The difference is the number of entries that were
841lost due to the buffer filling up (250280 - 140080 = 110200 events
842lost).
843
844The header explains the content of the events. Task name "bash", the task
845PID "1977", the CPU that it was running on "000", the latency format
846(explained below), the timestamp in <secs>.<usecs> format, the
847function name that was traced "sys_close" and the parent function that
848called this function "system_call_fastpath". The timestamp is the time
849at which the function was entered.
850
851Latency trace format
852--------------------
853
854When the latency-format option is enabled or when one of the latency
855tracers is set, the trace file gives somewhat more information to see
856why a latency happened. Here is a typical trace::
857
858 # tracer: irqsoff
859 #
860 # irqsoff latency trace v1.1.5 on 3.8.0-test+
861 # --------------------------------------------------------------------
862 # latency: 259 us, #4/4, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
863 # -----------------
864 # | task: ps-6143 (uid:0 nice:0 policy:0 rt_prio:0)
865 # -----------------
866 # => started at: __lock_task_sighand
867 # => ended at: _raw_spin_unlock_irqrestore
868 #
869 #
870 # _------=> CPU#
871 # / _-----=> irqs-off
872 # | / _----=> need-resched
873 # || / _---=> hardirq/softirq
874 # ||| / _--=> preempt-depth
875 # |||| / delay
876 # cmd pid ||||| time | caller
877 # \ / ||||| \ | /
878 ps-6143 2d... 0us!: trace_hardirqs_off <-__lock_task_sighand
879 ps-6143 2d..1 259us+: trace_hardirqs_on <-_raw_spin_unlock_irqrestore
880 ps-6143 2d..1 263us+: time_hardirqs_on <-_raw_spin_unlock_irqrestore
881 ps-6143 2d..1 306us : <stack trace>
882 => trace_hardirqs_on_caller
883 => trace_hardirqs_on
884 => _raw_spin_unlock_irqrestore
885 => do_task_stat
886 => proc_tgid_stat
887 => proc_single_show
888 => seq_read
889 => vfs_read
890 => sys_read
891 => system_call_fastpath
892
893
894This shows that the current tracer is "irqsoff" tracing the time
895for which interrupts were disabled. It gives the trace version (which
896never changes) and the version of the kernel upon which this was executed on
897(3.8). Then it displays the max latency in microseconds (259 us). The number
898of trace entries displayed and the total number (both are four: #4/4).
899VP, KP, SP, and HP are always zero and are reserved for later use.
900#P is the number of online CPUs (#P:4).
901
902The task is the process that was running when the latency
903occurred. (ps pid: 6143).
904
905The start and stop (the functions in which the interrupts were
906disabled and enabled respectively) that caused the latencies:
907
908 - __lock_task_sighand is where the interrupts were disabled.
909 - _raw_spin_unlock_irqrestore is where they were enabled again.
910
911The next lines after the header are the trace itself. The header
912explains which is which.
913
914 cmd: The name of the process in the trace.
915
916 pid: The PID of that process.
917
918 CPU#: The CPU which the process was running on.
919
920 irqs-off: 'd' interrupts are disabled. '.' otherwise.
921 .. caution:: If the architecture does not support a way to
922 read the irq flags variable, an 'X' will always
923 be printed here.
924
925 need-resched:
926 - 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set,
927 - 'n' only TIF_NEED_RESCHED is set,
928 - 'p' only PREEMPT_NEED_RESCHED is set,
929 - '.' otherwise.
930
931 hardirq/softirq:
932 - 'Z' - NMI occurred inside a hardirq
933 - 'z' - NMI is running
934 - 'H' - hard irq occurred inside a softirq.
935 - 'h' - hard irq is running
936 - 's' - soft irq is running
937 - '.' - normal context.
938
939 preempt-depth: The level of preempt_disabled
940
941The above is mostly meaningful for kernel developers.
942
943 time:
944 When the latency-format option is enabled, the trace file
945 output includes a timestamp relative to the start of the
946 trace. This differs from the output when latency-format
947 is disabled, which includes an absolute timestamp.
948
949 delay:
950 This is just to help catch your eye a bit better. And
951 needs to be fixed to be only relative to the same CPU.
952 The marks are determined by the difference between this
953 current trace and the next trace.
954
955 - '$' - greater than 1 second
2a1e03ca
AL
956 - '@' - greater than 100 millisecond
957 - '*' - greater than 10 millisecond
1f198e22
CD
958 - '#' - greater than 1000 microsecond
959 - '!' - greater than 100 microsecond
960 - '+' - greater than 10 microsecond
961 - ' ' - less than or equal to 10 microsecond.
962
963 The rest is the same as the 'trace' file.
964
965 Note, the latency tracers will usually end with a back trace
966 to easily find where the latency occurred.
967
968trace_options
969-------------
970
971The trace_options file (or the options directory) is used to control
972what gets printed in the trace output, or manipulate the tracers.
973To see what is available, simply cat the file::
974
975 cat trace_options
976 print-parent
977 nosym-offset
978 nosym-addr
979 noverbose
980 noraw
981 nohex
982 nobin
983 noblock
984 trace_printk
985 annotate
986 nouserstacktrace
987 nosym-userobj
988 noprintk-msg-only
989 context-info
990 nolatency-format
991 record-cmd
992 norecord-tgid
993 overwrite
994 nodisable_on_free
995 irq-info
996 markers
997 noevent-fork
998 function-trace
999 nofunction-fork
1000 nodisplay-graph
1001 nostacktrace
1002 nobranch
1003
1004To disable one of the options, echo in the option prepended with
1005"no"::
1006
1007 echo noprint-parent > trace_options
1008
1009To enable an option, leave off the "no"::
1010
1011 echo sym-offset > trace_options
1012
1013Here are the available options:
1014
1015 print-parent
1016 On function traces, display the calling (parent)
1017 function as well as the function being traced.
1018 ::
1019
1020 print-parent:
1021 bash-4000 [01] 1477.606694: simple_strtoul <-kstrtoul
1022
1023 noprint-parent:
1024 bash-4000 [01] 1477.606694: simple_strtoul
1025
1026
1027 sym-offset
1028 Display not only the function name, but also the
1029 offset in the function. For example, instead of
1030 seeing just "ktime_get", you will see
1031 "ktime_get+0xb/0x20".
1032 ::
1033
1034 sym-offset:
1035 bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0
1036
1037 sym-addr
1038 This will also display the function address as well
1039 as the function name.
1040 ::
1041
1042 sym-addr:
1043 bash-4000 [01] 1477.606694: simple_strtoul <c0339346>
1044
1045 verbose
1046 This deals with the trace file when the
1047 latency-format option is enabled.
1048 ::
1049
1050 bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \
1051 (+0.000ms): simple_strtoul (kstrtoul)
1052
1053 raw
1054 This will display raw numbers. This option is best for
1055 use with user applications that can translate the raw
1056 numbers better than having it done in the kernel.
1057
1058 hex
1059 Similar to raw, but the numbers will be in a hexadecimal format.
1060
1061 bin
1062 This will print out the formats in raw binary.
1063
1064 block
1065 When set, reading trace_pipe will not block when polled.
1066
1067 trace_printk
1068 Can disable trace_printk() from writing into the buffer.
1069
1070 annotate
1071 It is sometimes confusing when the CPU buffers are full
1072 and one CPU buffer had a lot of events recently, thus
1073 a shorter time frame, were another CPU may have only had
1074 a few events, which lets it have older events. When
1075 the trace is reported, it shows the oldest events first,
1076 and it may look like only one CPU ran (the one with the
1077 oldest events). When the annotate option is set, it will
1078 display when a new CPU buffer started::
1079
1080 <idle>-0 [001] dNs4 21169.031481: wake_up_idle_cpu <-add_timer_on
1081 <idle>-0 [001] dNs4 21169.031482: _raw_spin_unlock_irqrestore <-add_timer_on
1082 <idle>-0 [001] .Ns4 21169.031484: sub_preempt_count <-_raw_spin_unlock_irqrestore
1083 ##### CPU 2 buffer started ####
1084 <idle>-0 [002] .N.1 21169.031484: rcu_idle_exit <-cpu_idle
1085 <idle>-0 [001] .Ns3 21169.031484: _raw_spin_unlock <-clocksource_watchdog
1086 <idle>-0 [001] .Ns3 21169.031485: sub_preempt_count <-_raw_spin_unlock
1087
1088 userstacktrace
1089 This option changes the trace. It records a
1090 stacktrace of the current user space thread after
1091 each trace event.
1092
1093 sym-userobj
1094 when user stacktrace are enabled, look up which
1095 object the address belongs to, and print a
1096 relative address. This is especially useful when
1097 ASLR is on, otherwise you don't get a chance to
1098 resolve the address to object/file/line after
1099 the app is no longer running
1100
1101 The lookup is performed when you read
1102 trace,trace_pipe. Example::
1103
1104 a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0
1105 x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
1106
1107
1108 printk-msg-only
1109 When set, trace_printk()s will only show the format
1110 and not their parameters (if trace_bprintk() or
1111 trace_bputs() was used to save the trace_printk()).
1112
1113 context-info
1114 Show only the event data. Hides the comm, PID,
1115 timestamp, CPU, and other useful data.
1116
1117 latency-format
1118 This option changes the trace output. When it is enabled,
1119 the trace displays additional information about the
1120 latency, as described in "Latency trace format".
1121
1122 record-cmd
1123 When any event or tracer is enabled, a hook is enabled
1124 in the sched_switch trace point to fill comm cache
1125 with mapped pids and comms. But this may cause some
1126 overhead, and if you only care about pids, and not the
1127 name of the task, disabling this option can lower the
1128 impact of tracing. See "saved_cmdlines".
1129
1130 record-tgid
1131 When any event or tracer is enabled, a hook is enabled
1132 in the sched_switch trace point to fill the cache of
1133 mapped Thread Group IDs (TGID) mapping to pids. See
1134 "saved_tgids".
1135
1136 overwrite
1137 This controls what happens when the trace buffer is
1138 full. If "1" (default), the oldest events are
1139 discarded and overwritten. If "0", then the newest
1140 events are discarded.
1141 (see per_cpu/cpu0/stats for overrun and dropped)
1142
1143 disable_on_free
1144 When the free_buffer is closed, tracing will
1145 stop (tracing_on set to 0).
1146
1147 irq-info
1148 Shows the interrupt, preempt count, need resched data.
1149 When disabled, the trace looks like::
1150
1151 # tracer: function
1152 #
1153 # entries-in-buffer/entries-written: 144405/9452052 #P:4
1154 #
1155 # TASK-PID CPU# TIMESTAMP FUNCTION
1156 # | | | | |
1157 <idle>-0 [002] 23636.756054: ttwu_do_activate.constprop.89 <-try_to_wake_up
1158 <idle>-0 [002] 23636.756054: activate_task <-ttwu_do_activate.constprop.89
1159 <idle>-0 [002] 23636.756055: enqueue_task <-activate_task
1160
1161
1162 markers
1163 When set, the trace_marker is writable (only by root).
1164 When disabled, the trace_marker will error with EINVAL
1165 on write.
1166
1167 event-fork
1168 When set, tasks with PIDs listed in set_event_pid will have
1169 the PIDs of their children added to set_event_pid when those
1170 tasks fork. Also, when tasks with PIDs in set_event_pid exit,
1171 their PIDs will be removed from the file.
1172
1173 function-trace
1174 The latency tracers will enable function tracing
1175 if this option is enabled (default it is). When
1176 it is disabled, the latency tracers do not trace
1177 functions. This keeps the overhead of the tracer down
1178 when performing latency tests.
1179
1180 function-fork
1181 When set, tasks with PIDs listed in set_ftrace_pid will
1182 have the PIDs of their children added to set_ftrace_pid
1183 when those tasks fork. Also, when tasks with PIDs in
1184 set_ftrace_pid exit, their PIDs will be removed from the
1185 file.
1186
1187 display-graph
1188 When set, the latency tracers (irqsoff, wakeup, etc) will
1189 use function graph tracing instead of function tracing.
1190
1191 stacktrace
1192 When set, a stack trace is recorded after any trace event
1193 is recorded.
1194
1195 branch
1196 Enable branch tracing with the tracer. This enables branch
1197 tracer along with the currently set tracer. Enabling this
1198 with the "nop" tracer is the same as just enabling the
1199 "branch" tracer.
1200
1201.. tip:: Some tracers have their own options. They only appear in this
1202 file when the tracer is active. They always appear in the
1203 options directory.
1204
1205
1206Here are the per tracer options:
1207
1208Options for function tracer:
1209
1210 func_stack_trace
1211 When set, a stack trace is recorded after every
1212 function that is recorded. NOTE! Limit the functions
1213 that are recorded before enabling this, with
1214 "set_ftrace_filter" otherwise the system performance
1215 will be critically degraded. Remember to disable
1216 this option before clearing the function filter.
1217
1218Options for function_graph tracer:
1219
1220 Since the function_graph tracer has a slightly different output
1221 it has its own options to control what is displayed.
1222
1223 funcgraph-overrun
1224 When set, the "overrun" of the graph stack is
1225 displayed after each function traced. The
1226 overrun, is when the stack depth of the calls
1227 is greater than what is reserved for each task.
1228 Each task has a fixed array of functions to
1229 trace in the call graph. If the depth of the
1230 calls exceeds that, the function is not traced.
1231 The overrun is the number of functions missed
1232 due to exceeding this array.
1233
1234 funcgraph-cpu
1235 When set, the CPU number of the CPU where the trace
1236 occurred is displayed.
1237
1238 funcgraph-overhead
1239 When set, if the function takes longer than
1240 A certain amount, then a delay marker is
1241 displayed. See "delay" above, under the
1242 header description.
1243
1244 funcgraph-proc
1245 Unlike other tracers, the process' command line
1246 is not displayed by default, but instead only
1247 when a task is traced in and out during a context
1248 switch. Enabling this options has the command
1249 of each process displayed at every line.
1250
1251 funcgraph-duration
1252 At the end of each function (the return)
1253 the duration of the amount of time in the
1254 function is displayed in microseconds.
1255
1256 funcgraph-abstime
1257 When set, the timestamp is displayed at each line.
1258
1259 funcgraph-irqs
1260 When disabled, functions that happen inside an
1261 interrupt will not be traced.
1262
1263 funcgraph-tail
1264 When set, the return event will include the function
1265 that it represents. By default this is off, and
1266 only a closing curly bracket "}" is displayed for
1267 the return of a function.
1268
1269 sleep-time
1270 When running function graph tracer, to include
1271 the time a task schedules out in its function.
1272 When enabled, it will account time the task has been
1273 scheduled out as part of the function call.
1274
1275 graph-time
1276 When running function profiler with function graph tracer,
1277 to include the time to call nested functions. When this is
1278 not set, the time reported for the function will only
1279 include the time the function itself executed for, not the
1280 time for functions that it called.
1281
1282Options for blk tracer:
1283
1284 blk_classic
1285 Shows a more minimalistic output.
1286
1287
1288irqsoff
1289-------
1290
1291When interrupts are disabled, the CPU can not react to any other
1292external event (besides NMIs and SMIs). This prevents the timer
1293interrupt from triggering or the mouse interrupt from letting
1294the kernel know of a new mouse event. The result is a latency
1295with the reaction time.
1296
1297The irqsoff tracer tracks the time for which interrupts are
1298disabled. When a new maximum latency is hit, the tracer saves
1299the trace leading up to that latency point so that every time a
1300new maximum is reached, the old saved trace is discarded and the
1301new trace is saved.
1302
1303To reset the maximum, echo 0 into tracing_max_latency. Here is
1304an example::
1305
1306 # echo 0 > options/function-trace
1307 # echo irqsoff > current_tracer
1308 # echo 1 > tracing_on
1309 # echo 0 > tracing_max_latency
1310 # ls -ltr
1311 [...]
1312 # echo 0 > tracing_on
1313 # cat trace
1314 # tracer: irqsoff
1315 #
1316 # irqsoff latency trace v1.1.5 on 3.8.0-test+
1317 # --------------------------------------------------------------------
1318 # latency: 16 us, #4/4, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1319 # -----------------
1320 # | task: swapper/0-0 (uid:0 nice:0 policy:0 rt_prio:0)
1321 # -----------------
1322 # => started at: run_timer_softirq
1323 # => ended at: run_timer_softirq
1324 #
1325 #
1326 # _------=> CPU#
1327 # / _-----=> irqs-off
1328 # | / _----=> need-resched
1329 # || / _---=> hardirq/softirq
1330 # ||| / _--=> preempt-depth
1331 # |||| / delay
1332 # cmd pid ||||| time | caller
1333 # \ / ||||| \ | /
1334 <idle>-0 0d.s2 0us+: _raw_spin_lock_irq <-run_timer_softirq
1335 <idle>-0 0dNs3 17us : _raw_spin_unlock_irq <-run_timer_softirq
1336 <idle>-0 0dNs3 17us+: trace_hardirqs_on <-run_timer_softirq
1337 <idle>-0 0dNs3 25us : <stack trace>
1338 => _raw_spin_unlock_irq
1339 => run_timer_softirq
1340 => __do_softirq
1341 => call_softirq
1342 => do_softirq
1343 => irq_exit
1344 => smp_apic_timer_interrupt
1345 => apic_timer_interrupt
1346 => rcu_idle_exit
1347 => cpu_idle
1348 => rest_init
1349 => start_kernel
1350 => x86_64_start_reservations
1351 => x86_64_start_kernel
1352
1353Here we see that that we had a latency of 16 microseconds (which is
1354very good). The _raw_spin_lock_irq in run_timer_softirq disabled
1355interrupts. The difference between the 16 and the displayed
1356timestamp 25us occurred because the clock was incremented
1357between the time of recording the max latency and the time of
1358recording the function that had that latency.
1359
1360Note the above example had function-trace not set. If we set
1361function-trace, we get a much larger output::
1362
1363 with echo 1 > options/function-trace
1364
1365 # tracer: irqsoff
1366 #
1367 # irqsoff latency trace v1.1.5 on 3.8.0-test+
1368 # --------------------------------------------------------------------
1369 # latency: 71 us, #168/168, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1370 # -----------------
1371 # | task: bash-2042 (uid:0 nice:0 policy:0 rt_prio:0)
1372 # -----------------
1373 # => started at: ata_scsi_queuecmd
1374 # => ended at: ata_scsi_queuecmd
1375 #
1376 #
1377 # _------=> CPU#
1378 # / _-----=> irqs-off
1379 # | / _----=> need-resched
1380 # || / _---=> hardirq/softirq
1381 # ||| / _--=> preempt-depth
1382 # |||| / delay
1383 # cmd pid ||||| time | caller
1384 # \ / ||||| \ | /
1385 bash-2042 3d... 0us : _raw_spin_lock_irqsave <-ata_scsi_queuecmd
1386 bash-2042 3d... 0us : add_preempt_count <-_raw_spin_lock_irqsave
1387 bash-2042 3d..1 1us : ata_scsi_find_dev <-ata_scsi_queuecmd
1388 bash-2042 3d..1 1us : __ata_scsi_find_dev <-ata_scsi_find_dev
1389 bash-2042 3d..1 2us : ata_find_dev.part.14 <-__ata_scsi_find_dev
1390 bash-2042 3d..1 2us : ata_qc_new_init <-__ata_scsi_queuecmd
1391 bash-2042 3d..1 3us : ata_sg_init <-__ata_scsi_queuecmd
1392 bash-2042 3d..1 4us : ata_scsi_rw_xlat <-__ata_scsi_queuecmd
1393 bash-2042 3d..1 4us : ata_build_rw_tf <-ata_scsi_rw_xlat
1394 [...]
1395 bash-2042 3d..1 67us : delay_tsc <-__delay
1396 bash-2042 3d..1 67us : add_preempt_count <-delay_tsc
1397 bash-2042 3d..2 67us : sub_preempt_count <-delay_tsc
1398 bash-2042 3d..1 67us : add_preempt_count <-delay_tsc
1399 bash-2042 3d..2 68us : sub_preempt_count <-delay_tsc
1400 bash-2042 3d..1 68us+: ata_bmdma_start <-ata_bmdma_qc_issue
1401 bash-2042 3d..1 71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
1402 bash-2042 3d..1 71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
1403 bash-2042 3d..1 72us+: trace_hardirqs_on <-ata_scsi_queuecmd
1404 bash-2042 3d..1 120us : <stack trace>
1405 => _raw_spin_unlock_irqrestore
1406 => ata_scsi_queuecmd
1407 => scsi_dispatch_cmd
1408 => scsi_request_fn
1409 => __blk_run_queue_uncond
1410 => __blk_run_queue
1411 => blk_queue_bio
1412 => generic_make_request
1413 => submit_bio
1414 => submit_bh
1415 => __ext3_get_inode_loc
1416 => ext3_iget
1417 => ext3_lookup
1418 => lookup_real
1419 => __lookup_hash
1420 => walk_component
1421 => lookup_last
1422 => path_lookupat
1423 => filename_lookup
1424 => user_path_at_empty
1425 => user_path_at
1426 => vfs_fstatat
1427 => vfs_stat
1428 => sys_newstat
1429 => system_call_fastpath
1430
1431
1432Here we traced a 71 microsecond latency. But we also see all the
1433functions that were called during that time. Note that by
1434enabling function tracing, we incur an added overhead. This
1435overhead may extend the latency times. But nevertheless, this
1436trace has provided some very helpful debugging information.
1437
88d380eb
CD
1438If we prefer function graph output instead of function, we can set
1439display-graph option::
3df5ffd2 1440
88d380eb
CD
1441 with echo 1 > options/display-graph
1442
1443 # tracer: irqsoff
1444 #
1445 # irqsoff latency trace v1.1.5 on 4.20.0-rc6+
1446 # --------------------------------------------------------------------
1447 # latency: 3751 us, #274/274, CPU#0 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4)
1448 # -----------------
1449 # | task: bash-1507 (uid:0 nice:0 policy:0 rt_prio:0)
1450 # -----------------
1451 # => started at: free_debug_processing
1452 # => ended at: return_to_handler
1453 #
1454 #
1455 # _-----=> irqs-off
1456 # / _----=> need-resched
1457 # | / _---=> hardirq/softirq
1458 # || / _--=> preempt-depth
1459 # ||| /
1460 # REL TIME CPU TASK/PID |||| DURATION FUNCTION CALLS
1461 # | | | | |||| | | | | | |
1462 0 us | 0) bash-1507 | d... | 0.000 us | _raw_spin_lock_irqsave();
1463 0 us | 0) bash-1507 | d..1 | 0.378 us | do_raw_spin_trylock();
1464 1 us | 0) bash-1507 | d..2 | | set_track() {
1465 2 us | 0) bash-1507 | d..2 | | save_stack_trace() {
1466 2 us | 0) bash-1507 | d..2 | | __save_stack_trace() {
1467 3 us | 0) bash-1507 | d..2 | | __unwind_start() {
1468 3 us | 0) bash-1507 | d..2 | | get_stack_info() {
1469 3 us | 0) bash-1507 | d..2 | 0.351 us | in_task_stack();
1470 4 us | 0) bash-1507 | d..2 | 1.107 us | }
1471 [...]
1472 3750 us | 0) bash-1507 | d..1 | 0.516 us | do_raw_spin_unlock();
1473 3750 us | 0) bash-1507 | d..1 | 0.000 us | _raw_spin_unlock_irqrestore();
1474 3764 us | 0) bash-1507 | d..1 | 0.000 us | tracer_hardirqs_on();
1475 bash-1507 0d..1 3792us : <stack trace>
1476 => free_debug_processing
1477 => __slab_free
1478 => kmem_cache_free
1479 => vm_area_free
1480 => remove_vma
1481 => exit_mmap
1482 => mmput
1483 => flush_old_exec
1484 => load_elf_binary
1485 => search_binary_handler
1486 => __do_execve_file.isra.32
1487 => __x64_sys_execve
1488 => do_syscall_64
1489 => entry_SYSCALL_64_after_hwframe
1f198e22
CD
1490
1491preemptoff
1492----------
1493
1494When preemption is disabled, we may be able to receive
1495interrupts but the task cannot be preempted and a higher
1496priority task must wait for preemption to be enabled again
1497before it can preempt a lower priority task.
1498
1499The preemptoff tracer traces the places that disable preemption.
1500Like the irqsoff tracer, it records the maximum latency for
1501which preemption was disabled. The control of preemptoff tracer
1502is much like the irqsoff tracer.
1503::
1504
1505 # echo 0 > options/function-trace
1506 # echo preemptoff > current_tracer
1507 # echo 1 > tracing_on
1508 # echo 0 > tracing_max_latency
1509 # ls -ltr
1510 [...]
1511 # echo 0 > tracing_on
1512 # cat trace
1513 # tracer: preemptoff
1514 #
1515 # preemptoff latency trace v1.1.5 on 3.8.0-test+
1516 # --------------------------------------------------------------------
1517 # latency: 46 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1518 # -----------------
1519 # | task: sshd-1991 (uid:0 nice:0 policy:0 rt_prio:0)
1520 # -----------------
1521 # => started at: do_IRQ
1522 # => ended at: do_IRQ
1523 #
1524 #
1525 # _------=> CPU#
1526 # / _-----=> irqs-off
1527 # | / _----=> need-resched
1528 # || / _---=> hardirq/softirq
1529 # ||| / _--=> preempt-depth
1530 # |||| / delay
1531 # cmd pid ||||| time | caller
1532 # \ / ||||| \ | /
1533 sshd-1991 1d.h. 0us+: irq_enter <-do_IRQ
1534 sshd-1991 1d..1 46us : irq_exit <-do_IRQ
1535 sshd-1991 1d..1 47us+: trace_preempt_on <-do_IRQ
1536 sshd-1991 1d..1 52us : <stack trace>
1537 => sub_preempt_count
1538 => irq_exit
1539 => do_IRQ
1540 => ret_from_intr
1541
1542
1543This has some more changes. Preemption was disabled when an
1544interrupt came in (notice the 'h'), and was enabled on exit.
1545But we also see that interrupts have been disabled when entering
1546the preempt off section and leaving it (the 'd'). We do not know if
1547interrupts were enabled in the mean time or shortly after this
1548was over.
1549::
1550
1551 # tracer: preemptoff
1552 #
1553 # preemptoff latency trace v1.1.5 on 3.8.0-test+
1554 # --------------------------------------------------------------------
1555 # latency: 83 us, #241/241, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1556 # -----------------
1557 # | task: bash-1994 (uid:0 nice:0 policy:0 rt_prio:0)
1558 # -----------------
1559 # => started at: wake_up_new_task
1560 # => ended at: task_rq_unlock
1561 #
1562 #
1563 # _------=> CPU#
1564 # / _-----=> irqs-off
1565 # | / _----=> need-resched
1566 # || / _---=> hardirq/softirq
1567 # ||| / _--=> preempt-depth
1568 # |||| / delay
1569 # cmd pid ||||| time | caller
1570 # \ / ||||| \ | /
1571 bash-1994 1d..1 0us : _raw_spin_lock_irqsave <-wake_up_new_task
1572 bash-1994 1d..1 0us : select_task_rq_fair <-select_task_rq
1573 bash-1994 1d..1 1us : __rcu_read_lock <-select_task_rq_fair
1574 bash-1994 1d..1 1us : source_load <-select_task_rq_fair
1575 bash-1994 1d..1 1us : source_load <-select_task_rq_fair
1576 [...]
1577 bash-1994 1d..1 12us : irq_enter <-smp_apic_timer_interrupt
1578 bash-1994 1d..1 12us : rcu_irq_enter <-irq_enter
1579 bash-1994 1d..1 13us : add_preempt_count <-irq_enter
1580 bash-1994 1d.h1 13us : exit_idle <-smp_apic_timer_interrupt
1581 bash-1994 1d.h1 13us : hrtimer_interrupt <-smp_apic_timer_interrupt
1582 bash-1994 1d.h1 13us : _raw_spin_lock <-hrtimer_interrupt
1583 bash-1994 1d.h1 14us : add_preempt_count <-_raw_spin_lock
1584 bash-1994 1d.h2 14us : ktime_get_update_offsets <-hrtimer_interrupt
1585 [...]
1586 bash-1994 1d.h1 35us : lapic_next_event <-clockevents_program_event
1587 bash-1994 1d.h1 35us : irq_exit <-smp_apic_timer_interrupt
1588 bash-1994 1d.h1 36us : sub_preempt_count <-irq_exit
1589 bash-1994 1d..2 36us : do_softirq <-irq_exit
1590 bash-1994 1d..2 36us : __do_softirq <-call_softirq
1591 bash-1994 1d..2 36us : __local_bh_disable <-__do_softirq
1592 bash-1994 1d.s2 37us : add_preempt_count <-_raw_spin_lock_irq
1593 bash-1994 1d.s3 38us : _raw_spin_unlock <-run_timer_softirq
1594 bash-1994 1d.s3 39us : sub_preempt_count <-_raw_spin_unlock
1595 bash-1994 1d.s2 39us : call_timer_fn <-run_timer_softirq
1596 [...]
1597 bash-1994 1dNs2 81us : cpu_needs_another_gp <-rcu_process_callbacks
1598 bash-1994 1dNs2 82us : __local_bh_enable <-__do_softirq
1599 bash-1994 1dNs2 82us : sub_preempt_count <-__local_bh_enable
1600 bash-1994 1dN.2 82us : idle_cpu <-irq_exit
1601 bash-1994 1dN.2 83us : rcu_irq_exit <-irq_exit
1602 bash-1994 1dN.2 83us : sub_preempt_count <-irq_exit
1603 bash-1994 1.N.1 84us : _raw_spin_unlock_irqrestore <-task_rq_unlock
1604 bash-1994 1.N.1 84us+: trace_preempt_on <-task_rq_unlock
1605 bash-1994 1.N.1 104us : <stack trace>
1606 => sub_preempt_count
1607 => _raw_spin_unlock_irqrestore
1608 => task_rq_unlock
1609 => wake_up_new_task
1610 => do_fork
1611 => sys_clone
1612 => stub_clone
1613
1614
1615The above is an example of the preemptoff trace with
1616function-trace set. Here we see that interrupts were not disabled
1617the entire time. The irq_enter code lets us know that we entered
1618an interrupt 'h'. Before that, the functions being traced still
1619show that it is not in an interrupt, but we can see from the
1620functions themselves that this is not the case.
1621
1622preemptirqsoff
1623--------------
1624
1625Knowing the locations that have interrupts disabled or
1626preemption disabled for the longest times is helpful. But
1627sometimes we would like to know when either preemption and/or
1628interrupts are disabled.
1629
1630Consider the following code::
1631
1632 local_irq_disable();
1633 call_function_with_irqs_off();
1634 preempt_disable();
1635 call_function_with_irqs_and_preemption_off();
1636 local_irq_enable();
1637 call_function_with_preemption_off();
1638 preempt_enable();
1639
1640The irqsoff tracer will record the total length of
1641call_function_with_irqs_off() and
1642call_function_with_irqs_and_preemption_off().
1643
1644The preemptoff tracer will record the total length of
1645call_function_with_irqs_and_preemption_off() and
1646call_function_with_preemption_off().
1647
1648But neither will trace the time that interrupts and/or
1649preemption is disabled. This total time is the time that we can
1650not schedule. To record this time, use the preemptirqsoff
1651tracer.
1652
1653Again, using this trace is much like the irqsoff and preemptoff
1654tracers.
1655::
1656
1657 # echo 0 > options/function-trace
1658 # echo preemptirqsoff > current_tracer
1659 # echo 1 > tracing_on
1660 # echo 0 > tracing_max_latency
1661 # ls -ltr
1662 [...]
1663 # echo 0 > tracing_on
1664 # cat trace
1665 # tracer: preemptirqsoff
1666 #
1667 # preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
1668 # --------------------------------------------------------------------
1669 # latency: 100 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1670 # -----------------
1671 # | task: ls-2230 (uid:0 nice:0 policy:0 rt_prio:0)
1672 # -----------------
1673 # => started at: ata_scsi_queuecmd
1674 # => ended at: ata_scsi_queuecmd
1675 #
1676 #
1677 # _------=> CPU#
1678 # / _-----=> irqs-off
1679 # | / _----=> need-resched
1680 # || / _---=> hardirq/softirq
1681 # ||| / _--=> preempt-depth
1682 # |||| / delay
1683 # cmd pid ||||| time | caller
1684 # \ / ||||| \ | /
1685 ls-2230 3d... 0us+: _raw_spin_lock_irqsave <-ata_scsi_queuecmd
1686 ls-2230 3...1 100us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
1687 ls-2230 3...1 101us+: trace_preempt_on <-ata_scsi_queuecmd
1688 ls-2230 3...1 111us : <stack trace>
1689 => sub_preempt_count
1690 => _raw_spin_unlock_irqrestore
1691 => ata_scsi_queuecmd
1692 => scsi_dispatch_cmd
1693 => scsi_request_fn
1694 => __blk_run_queue_uncond
1695 => __blk_run_queue
1696 => blk_queue_bio
1697 => generic_make_request
1698 => submit_bio
1699 => submit_bh
1700 => ext3_bread
1701 => ext3_dir_bread
1702 => htree_dirblock_to_tree
1703 => ext3_htree_fill_tree
1704 => ext3_readdir
1705 => vfs_readdir
1706 => sys_getdents
1707 => system_call_fastpath
1708
1709
1710The trace_hardirqs_off_thunk is called from assembly on x86 when
1711interrupts are disabled in the assembly code. Without the
1712function tracing, we do not know if interrupts were enabled
1713within the preemption points. We do see that it started with
1714preemption enabled.
1715
1716Here is a trace with function-trace set::
1717
1718 # tracer: preemptirqsoff
1719 #
1720 # preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
1721 # --------------------------------------------------------------------
1722 # latency: 161 us, #339/339, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1723 # -----------------
1724 # | task: ls-2269 (uid:0 nice:0 policy:0 rt_prio:0)
1725 # -----------------
1726 # => started at: schedule
1727 # => ended at: mutex_unlock
1728 #
1729 #
1730 # _------=> CPU#
1731 # / _-----=> irqs-off
1732 # | / _----=> need-resched
1733 # || / _---=> hardirq/softirq
1734 # ||| / _--=> preempt-depth
1735 # |||| / delay
1736 # cmd pid ||||| time | caller
1737 # \ / ||||| \ | /
1738 kworker/-59 3...1 0us : __schedule <-schedule
1739 kworker/-59 3d..1 0us : rcu_preempt_qs <-rcu_note_context_switch
1740 kworker/-59 3d..1 1us : add_preempt_count <-_raw_spin_lock_irq
1741 kworker/-59 3d..2 1us : deactivate_task <-__schedule
1742 kworker/-59 3d..2 1us : dequeue_task <-deactivate_task
1743 kworker/-59 3d..2 2us : update_rq_clock <-dequeue_task
1744 kworker/-59 3d..2 2us : dequeue_task_fair <-dequeue_task
1745 kworker/-59 3d..2 2us : update_curr <-dequeue_task_fair
1746 kworker/-59 3d..2 2us : update_min_vruntime <-update_curr
1747 kworker/-59 3d..2 3us : cpuacct_charge <-update_curr
1748 kworker/-59 3d..2 3us : __rcu_read_lock <-cpuacct_charge
1749 kworker/-59 3d..2 3us : __rcu_read_unlock <-cpuacct_charge
1750 kworker/-59 3d..2 3us : update_cfs_rq_blocked_load <-dequeue_task_fair
1751 kworker/-59 3d..2 4us : clear_buddies <-dequeue_task_fair
1752 kworker/-59 3d..2 4us : account_entity_dequeue <-dequeue_task_fair
1753 kworker/-59 3d..2 4us : update_min_vruntime <-dequeue_task_fair
1754 kworker/-59 3d..2 4us : update_cfs_shares <-dequeue_task_fair
1755 kworker/-59 3d..2 5us : hrtick_update <-dequeue_task_fair
1756 kworker/-59 3d..2 5us : wq_worker_sleeping <-__schedule
1757 kworker/-59 3d..2 5us : kthread_data <-wq_worker_sleeping
1758 kworker/-59 3d..2 5us : put_prev_task_fair <-__schedule
1759 kworker/-59 3d..2 6us : pick_next_task_fair <-pick_next_task
1760 kworker/-59 3d..2 6us : clear_buddies <-pick_next_task_fair
1761 kworker/-59 3d..2 6us : set_next_entity <-pick_next_task_fair
1762 kworker/-59 3d..2 6us : update_stats_wait_end <-set_next_entity
1763 ls-2269 3d..2 7us : finish_task_switch <-__schedule
1764 ls-2269 3d..2 7us : _raw_spin_unlock_irq <-finish_task_switch
1765 ls-2269 3d..2 8us : do_IRQ <-ret_from_intr
1766 ls-2269 3d..2 8us : irq_enter <-do_IRQ
1767 ls-2269 3d..2 8us : rcu_irq_enter <-irq_enter
1768 ls-2269 3d..2 9us : add_preempt_count <-irq_enter
1769 ls-2269 3d.h2 9us : exit_idle <-do_IRQ
1770 [...]
1771 ls-2269 3d.h3 20us : sub_preempt_count <-_raw_spin_unlock
1772 ls-2269 3d.h2 20us : irq_exit <-do_IRQ
1773 ls-2269 3d.h2 21us : sub_preempt_count <-irq_exit
1774 ls-2269 3d..3 21us : do_softirq <-irq_exit
1775 ls-2269 3d..3 21us : __do_softirq <-call_softirq
1776 ls-2269 3d..3 21us+: __local_bh_disable <-__do_softirq
1777 ls-2269 3d.s4 29us : sub_preempt_count <-_local_bh_enable_ip
1778 ls-2269 3d.s5 29us : sub_preempt_count <-_local_bh_enable_ip
1779 ls-2269 3d.s5 31us : do_IRQ <-ret_from_intr
1780 ls-2269 3d.s5 31us : irq_enter <-do_IRQ
1781 ls-2269 3d.s5 31us : rcu_irq_enter <-irq_enter
1782 [...]
1783 ls-2269 3d.s5 31us : rcu_irq_enter <-irq_enter
1784 ls-2269 3d.s5 32us : add_preempt_count <-irq_enter
1785 ls-2269 3d.H5 32us : exit_idle <-do_IRQ
1786 ls-2269 3d.H5 32us : handle_irq <-do_IRQ
1787 ls-2269 3d.H5 32us : irq_to_desc <-handle_irq
1788 ls-2269 3d.H5 33us : handle_fasteoi_irq <-handle_irq
1789 [...]
1790 ls-2269 3d.s5 158us : _raw_spin_unlock_irqrestore <-rtl8139_poll
1791 ls-2269 3d.s3 158us : net_rps_action_and_irq_enable.isra.65 <-net_rx_action
1792 ls-2269 3d.s3 159us : __local_bh_enable <-__do_softirq
1793 ls-2269 3d.s3 159us : sub_preempt_count <-__local_bh_enable
1794 ls-2269 3d..3 159us : idle_cpu <-irq_exit
1795 ls-2269 3d..3 159us : rcu_irq_exit <-irq_exit
1796 ls-2269 3d..3 160us : sub_preempt_count <-irq_exit
1797 ls-2269 3d... 161us : __mutex_unlock_slowpath <-mutex_unlock
1798 ls-2269 3d... 162us+: trace_hardirqs_on <-mutex_unlock
1799 ls-2269 3d... 186us : <stack trace>
1800 => __mutex_unlock_slowpath
1801 => mutex_unlock
1802 => process_output
1803 => n_tty_write
1804 => tty_write
1805 => vfs_write
1806 => sys_write
1807 => system_call_fastpath
1808
1809This is an interesting trace. It started with kworker running and
1810scheduling out and ls taking over. But as soon as ls released the
1811rq lock and enabled interrupts (but not preemption) an interrupt
1812triggered. When the interrupt finished, it started running softirqs.
1813But while the softirq was running, another interrupt triggered.
1814When an interrupt is running inside a softirq, the annotation is 'H'.
1815
1816
1817wakeup
1818------
1819
1820One common case that people are interested in tracing is the
1821time it takes for a task that is woken to actually wake up.
1822Now for non Real-Time tasks, this can be arbitrary. But tracing
1823it none the less can be interesting.
1824
1825Without function tracing::
1826
1827 # echo 0 > options/function-trace
1828 # echo wakeup > current_tracer
1829 # echo 1 > tracing_on
1830 # echo 0 > tracing_max_latency
1831 # chrt -f 5 sleep 1
1832 # echo 0 > tracing_on
1833 # cat trace
1834 # tracer: wakeup
1835 #
1836 # wakeup latency trace v1.1.5 on 3.8.0-test+
1837 # --------------------------------------------------------------------
1838 # latency: 15 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1839 # -----------------
1840 # | task: kworker/3:1H-312 (uid:0 nice:-20 policy:0 rt_prio:0)
1841 # -----------------
1842 #
1843 # _------=> CPU#
1844 # / _-----=> irqs-off
1845 # | / _----=> need-resched
1846 # || / _---=> hardirq/softirq
1847 # ||| / _--=> preempt-depth
1848 # |||| / delay
1849 # cmd pid ||||| time | caller
1850 # \ / ||||| \ | /
1851 <idle>-0 3dNs7 0us : 0:120:R + [003] 312:100:R kworker/3:1H
1852 <idle>-0 3dNs7 1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
1853 <idle>-0 3d..3 15us : __schedule <-schedule
1854 <idle>-0 3d..3 15us : 0:120:R ==> [003] 312:100:R kworker/3:1H
1855
1856The tracer only traces the highest priority task in the system
1857to avoid tracing the normal circumstances. Here we see that
1858the kworker with a nice priority of -20 (not very nice), took
1859just 15 microseconds from the time it woke up, to the time it
1860ran.
1861
1862Non Real-Time tasks are not that interesting. A more interesting
1863trace is to concentrate only on Real-Time tasks.
1864
1865wakeup_rt
1866---------
1867
1868In a Real-Time environment it is very important to know the
1869wakeup time it takes for the highest priority task that is woken
1870up to the time that it executes. This is also known as "schedule
1871latency". I stress the point that this is about RT tasks. It is
1872also important to know the scheduling latency of non-RT tasks,
1873but the average schedule latency is better for non-RT tasks.
1874Tools like LatencyTop are more appropriate for such
1875measurements.
1876
1877Real-Time environments are interested in the worst case latency.
1878That is the longest latency it takes for something to happen,
1879and not the average. We can have a very fast scheduler that may
1880only have a large latency once in a while, but that would not
1881work well with Real-Time tasks. The wakeup_rt tracer was designed
1882to record the worst case wakeups of RT tasks. Non-RT tasks are
1883not recorded because the tracer only records one worst case and
1884tracing non-RT tasks that are unpredictable will overwrite the
1885worst case latency of RT tasks (just run the normal wakeup
1886tracer for a while to see that effect).
1887
1888Since this tracer only deals with RT tasks, we will run this
1889slightly differently than we did with the previous tracers.
1890Instead of performing an 'ls', we will run 'sleep 1' under
1891'chrt' which changes the priority of the task.
1892::
1893
1894 # echo 0 > options/function-trace
1895 # echo wakeup_rt > current_tracer
1896 # echo 1 > tracing_on
1897 # echo 0 > tracing_max_latency
1898 # chrt -f 5 sleep 1
1899 # echo 0 > tracing_on
1900 # cat trace
1901 # tracer: wakeup
1902 #
1903 # tracer: wakeup_rt
1904 #
1905 # wakeup_rt latency trace v1.1.5 on 3.8.0-test+
1906 # --------------------------------------------------------------------
1907 # latency: 5 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1908 # -----------------
1909 # | task: sleep-2389 (uid:0 nice:0 policy:1 rt_prio:5)
1910 # -----------------
1911 #
1912 # _------=> CPU#
1913 # / _-----=> irqs-off
1914 # | / _----=> need-resched
1915 # || / _---=> hardirq/softirq
1916 # ||| / _--=> preempt-depth
1917 # |||| / delay
1918 # cmd pid ||||| time | caller
1919 # \ / ||||| \ | /
1920 <idle>-0 3d.h4 0us : 0:120:R + [003] 2389: 94:R sleep
1921 <idle>-0 3d.h4 1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
1922 <idle>-0 3d..3 5us : __schedule <-schedule
1923 <idle>-0 3d..3 5us : 0:120:R ==> [003] 2389: 94:R sleep
1924
1925
1926Running this on an idle system, we see that it only took 5 microseconds
1927to perform the task switch. Note, since the trace point in the schedule
1928is before the actual "switch", we stop the tracing when the recorded task
1929is about to schedule in. This may change if we add a new marker at the
1930end of the scheduler.
1931
1932Notice that the recorded task is 'sleep' with the PID of 2389
1933and it has an rt_prio of 5. This priority is user-space priority
1934and not the internal kernel priority. The policy is 1 for
1935SCHED_FIFO and 2 for SCHED_RR.
1936
1937Note, that the trace data shows the internal priority (99 - rtprio).
1938::
1939
1940 <idle>-0 3d..3 5us : 0:120:R ==> [003] 2389: 94:R sleep
1941
1942The 0:120:R means idle was running with a nice priority of 0 (120 - 120)
1943and in the running state 'R'. The sleep task was scheduled in with
19442389: 94:R. That is the priority is the kernel rtprio (99 - 5 = 94)
1945and it too is in the running state.
1946
1947Doing the same with chrt -r 5 and function-trace set.
1948::
1949
1950 echo 1 > options/function-trace
1951
1952 # tracer: wakeup_rt
1953 #
1954 # wakeup_rt latency trace v1.1.5 on 3.8.0-test+
1955 # --------------------------------------------------------------------
1956 # latency: 29 us, #85/85, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1957 # -----------------
1958 # | task: sleep-2448 (uid:0 nice:0 policy:1 rt_prio:5)
1959 # -----------------
1960 #
1961 # _------=> CPU#
1962 # / _-----=> irqs-off
1963 # | / _----=> need-resched
1964 # || / _---=> hardirq/softirq
1965 # ||| / _--=> preempt-depth
1966 # |||| / delay
1967 # cmd pid ||||| time | caller
1968 # \ / ||||| \ | /
1969 <idle>-0 3d.h4 1us+: 0:120:R + [003] 2448: 94:R sleep
1970 <idle>-0 3d.h4 2us : ttwu_do_activate.constprop.87 <-try_to_wake_up
1971 <idle>-0 3d.h3 3us : check_preempt_curr <-ttwu_do_wakeup
1972 <idle>-0 3d.h3 3us : resched_curr <-check_preempt_curr
1973 <idle>-0 3dNh3 4us : task_woken_rt <-ttwu_do_wakeup
1974 <idle>-0 3dNh3 4us : _raw_spin_unlock <-try_to_wake_up
1975 <idle>-0 3dNh3 4us : sub_preempt_count <-_raw_spin_unlock
1976 <idle>-0 3dNh2 5us : ttwu_stat <-try_to_wake_up
1977 <idle>-0 3dNh2 5us : _raw_spin_unlock_irqrestore <-try_to_wake_up
1978 <idle>-0 3dNh2 6us : sub_preempt_count <-_raw_spin_unlock_irqrestore
1979 <idle>-0 3dNh1 6us : _raw_spin_lock <-__run_hrtimer
1980 <idle>-0 3dNh1 6us : add_preempt_count <-_raw_spin_lock
1981 <idle>-0 3dNh2 7us : _raw_spin_unlock <-hrtimer_interrupt
1982 <idle>-0 3dNh2 7us : sub_preempt_count <-_raw_spin_unlock
1983 <idle>-0 3dNh1 7us : tick_program_event <-hrtimer_interrupt
1984 <idle>-0 3dNh1 7us : clockevents_program_event <-tick_program_event
1985 <idle>-0 3dNh1 8us : ktime_get <-clockevents_program_event
1986 <idle>-0 3dNh1 8us : lapic_next_event <-clockevents_program_event
1987 <idle>-0 3dNh1 8us : irq_exit <-smp_apic_timer_interrupt
1988 <idle>-0 3dNh1 9us : sub_preempt_count <-irq_exit
1989 <idle>-0 3dN.2 9us : idle_cpu <-irq_exit
1990 <idle>-0 3dN.2 9us : rcu_irq_exit <-irq_exit
1991 <idle>-0 3dN.2 10us : rcu_eqs_enter_common.isra.45 <-rcu_irq_exit
1992 <idle>-0 3dN.2 10us : sub_preempt_count <-irq_exit
1993 <idle>-0 3.N.1 11us : rcu_idle_exit <-cpu_idle
1994 <idle>-0 3dN.1 11us : rcu_eqs_exit_common.isra.43 <-rcu_idle_exit
1995 <idle>-0 3.N.1 11us : tick_nohz_idle_exit <-cpu_idle
1996 <idle>-0 3dN.1 12us : menu_hrtimer_cancel <-tick_nohz_idle_exit
1997 <idle>-0 3dN.1 12us : ktime_get <-tick_nohz_idle_exit
1998 <idle>-0 3dN.1 12us : tick_do_update_jiffies64 <-tick_nohz_idle_exit
1999 <idle>-0 3dN.1 13us : cpu_load_update_nohz <-tick_nohz_idle_exit
2000 <idle>-0 3dN.1 13us : _raw_spin_lock <-cpu_load_update_nohz
2001 <idle>-0 3dN.1 13us : add_preempt_count <-_raw_spin_lock
2002 <idle>-0 3dN.2 13us : __cpu_load_update <-cpu_load_update_nohz
2003 <idle>-0 3dN.2 14us : sched_avg_update <-__cpu_load_update
2004 <idle>-0 3dN.2 14us : _raw_spin_unlock <-cpu_load_update_nohz
2005 <idle>-0 3dN.2 14us : sub_preempt_count <-_raw_spin_unlock
2006 <idle>-0 3dN.1 15us : calc_load_nohz_stop <-tick_nohz_idle_exit
2007 <idle>-0 3dN.1 15us : touch_softlockup_watchdog <-tick_nohz_idle_exit
2008 <idle>-0 3dN.1 15us : hrtimer_cancel <-tick_nohz_idle_exit
2009 <idle>-0 3dN.1 15us : hrtimer_try_to_cancel <-hrtimer_cancel
2010 <idle>-0 3dN.1 16us : lock_hrtimer_base.isra.18 <-hrtimer_try_to_cancel
2011 <idle>-0 3dN.1 16us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
2012 <idle>-0 3dN.1 16us : add_preempt_count <-_raw_spin_lock_irqsave
2013 <idle>-0 3dN.2 17us : __remove_hrtimer <-remove_hrtimer.part.16
2014 <idle>-0 3dN.2 17us : hrtimer_force_reprogram <-__remove_hrtimer
2015 <idle>-0 3dN.2 17us : tick_program_event <-hrtimer_force_reprogram
2016 <idle>-0 3dN.2 18us : clockevents_program_event <-tick_program_event
2017 <idle>-0 3dN.2 18us : ktime_get <-clockevents_program_event
2018 <idle>-0 3dN.2 18us : lapic_next_event <-clockevents_program_event
2019 <idle>-0 3dN.2 19us : _raw_spin_unlock_irqrestore <-hrtimer_try_to_cancel
2020 <idle>-0 3dN.2 19us : sub_preempt_count <-_raw_spin_unlock_irqrestore
2021 <idle>-0 3dN.1 19us : hrtimer_forward <-tick_nohz_idle_exit
2022 <idle>-0 3dN.1 20us : ktime_add_safe <-hrtimer_forward
2023 <idle>-0 3dN.1 20us : ktime_add_safe <-hrtimer_forward
2024 <idle>-0 3dN.1 20us : hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
2025 <idle>-0 3dN.1 20us : __hrtimer_start_range_ns <-hrtimer_start_range_ns
2026 <idle>-0 3dN.1 21us : lock_hrtimer_base.isra.18 <-__hrtimer_start_range_ns
2027 <idle>-0 3dN.1 21us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
2028 <idle>-0 3dN.1 21us : add_preempt_count <-_raw_spin_lock_irqsave
2029 <idle>-0 3dN.2 22us : ktime_add_safe <-__hrtimer_start_range_ns
2030 <idle>-0 3dN.2 22us : enqueue_hrtimer <-__hrtimer_start_range_ns
2031 <idle>-0 3dN.2 22us : tick_program_event <-__hrtimer_start_range_ns
2032 <idle>-0 3dN.2 23us : clockevents_program_event <-tick_program_event
2033 <idle>-0 3dN.2 23us : ktime_get <-clockevents_program_event
2034 <idle>-0 3dN.2 23us : lapic_next_event <-clockevents_program_event
2035 <idle>-0 3dN.2 24us : _raw_spin_unlock_irqrestore <-__hrtimer_start_range_ns
2036 <idle>-0 3dN.2 24us : sub_preempt_count <-_raw_spin_unlock_irqrestore
2037 <idle>-0 3dN.1 24us : account_idle_ticks <-tick_nohz_idle_exit
2038 <idle>-0 3dN.1 24us : account_idle_time <-account_idle_ticks
2039 <idle>-0 3.N.1 25us : sub_preempt_count <-cpu_idle
2040 <idle>-0 3.N.. 25us : schedule <-cpu_idle
2041 <idle>-0 3.N.. 25us : __schedule <-preempt_schedule
2042 <idle>-0 3.N.. 26us : add_preempt_count <-__schedule
2043 <idle>-0 3.N.1 26us : rcu_note_context_switch <-__schedule
2044 <idle>-0 3.N.1 26us : rcu_sched_qs <-rcu_note_context_switch
2045 <idle>-0 3dN.1 27us : rcu_preempt_qs <-rcu_note_context_switch
2046 <idle>-0 3.N.1 27us : _raw_spin_lock_irq <-__schedule
2047 <idle>-0 3dN.1 27us : add_preempt_count <-_raw_spin_lock_irq
2048 <idle>-0 3dN.2 28us : put_prev_task_idle <-__schedule
2049 <idle>-0 3dN.2 28us : pick_next_task_stop <-pick_next_task
2050 <idle>-0 3dN.2 28us : pick_next_task_rt <-pick_next_task
2051 <idle>-0 3dN.2 29us : dequeue_pushable_task <-pick_next_task_rt
2052 <idle>-0 3d..3 29us : __schedule <-preempt_schedule
2053 <idle>-0 3d..3 30us : 0:120:R ==> [003] 2448: 94:R sleep
2054
2055This isn't that big of a trace, even with function tracing enabled,
2056so I included the entire trace.
2057
2058The interrupt went off while when the system was idle. Somewhere
2059before task_woken_rt() was called, the NEED_RESCHED flag was set,
2060this is indicated by the first occurrence of the 'N' flag.
2061
2062Latency tracing and events
2063--------------------------
2064As function tracing can induce a much larger latency, but without
2065seeing what happens within the latency it is hard to know what
2066caused it. There is a middle ground, and that is with enabling
2067events.
2068::
2069
2070 # echo 0 > options/function-trace
2071 # echo wakeup_rt > current_tracer
2072 # echo 1 > events/enable
2073 # echo 1 > tracing_on
2074 # echo 0 > tracing_max_latency
2075 # chrt -f 5 sleep 1
2076 # echo 0 > tracing_on
2077 # cat trace
2078 # tracer: wakeup_rt
2079 #
2080 # wakeup_rt latency trace v1.1.5 on 3.8.0-test+
2081 # --------------------------------------------------------------------
2082 # latency: 6 us, #12/12, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
2083 # -----------------
2084 # | task: sleep-5882 (uid:0 nice:0 policy:1 rt_prio:5)
2085 # -----------------
2086 #
2087 # _------=> CPU#
2088 # / _-----=> irqs-off
2089 # | / _----=> need-resched
2090 # || / _---=> hardirq/softirq
2091 # ||| / _--=> preempt-depth
2092 # |||| / delay
2093 # cmd pid ||||| time | caller
2094 # \ / ||||| \ | /
2095 <idle>-0 2d.h4 0us : 0:120:R + [002] 5882: 94:R sleep
2096 <idle>-0 2d.h4 0us : ttwu_do_activate.constprop.87 <-try_to_wake_up
2097 <idle>-0 2d.h4 1us : sched_wakeup: comm=sleep pid=5882 prio=94 success=1 target_cpu=002
2098 <idle>-0 2dNh2 1us : hrtimer_expire_exit: hrtimer=ffff88007796feb8
2099 <idle>-0 2.N.2 2us : power_end: cpu_id=2
2100 <idle>-0 2.N.2 3us : cpu_idle: state=4294967295 cpu_id=2
2101 <idle>-0 2dN.3 4us : hrtimer_cancel: hrtimer=ffff88007d50d5e0
2102 <idle>-0 2dN.3 4us : hrtimer_start: hrtimer=ffff88007d50d5e0 function=tick_sched_timer expires=34311211000000 softexpires=34311211000000
2103 <idle>-0 2.N.2 5us : rcu_utilization: Start context switch
2104 <idle>-0 2.N.2 5us : rcu_utilization: End context switch
2105 <idle>-0 2d..3 6us : __schedule <-schedule
2106 <idle>-0 2d..3 6us : 0:120:R ==> [002] 5882: 94:R sleep
2107
2108
2109Hardware Latency Detector
2110-------------------------
2111
2112The hardware latency detector is executed by enabling the "hwlat" tracer.
2113
2114NOTE, this tracer will affect the performance of the system as it will
2115periodically make a CPU constantly busy with interrupts disabled.
2116::
2117
2118 # echo hwlat > current_tracer
2119 # sleep 100
2120 # cat trace
2121 # tracer: hwlat
2122 #
2123 # _-----=> irqs-off
2124 # / _----=> need-resched
2125 # | / _---=> hardirq/softirq
2126 # || / _--=> preempt-depth
2127 # ||| / delay
2128 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
2129 # | | | |||| | |
2130 <...>-3638 [001] d... 19452.055471: #1 inner/outer(us): 12/14 ts:1499801089.066141940
2131 <...>-3638 [003] d... 19454.071354: #2 inner/outer(us): 11/9 ts:1499801091.082164365
2132 <...>-3638 [002] dn.. 19461.126852: #3 inner/outer(us): 12/9 ts:1499801098.138150062
2133 <...>-3638 [001] d... 19488.340960: #4 inner/outer(us): 8/12 ts:1499801125.354139633
2134 <...>-3638 [003] d... 19494.388553: #5 inner/outer(us): 8/12 ts:1499801131.402150961
2135 <...>-3638 [003] d... 19501.283419: #6 inner/outer(us): 0/12 ts:1499801138.297435289 nmi-total:4 nmi-count:1
2136
2137
2138The above output is somewhat the same in the header. All events will have
2139interrupts disabled 'd'. Under the FUNCTION title there is:
2140
2141 #1
2142 This is the count of events recorded that were greater than the
2143 tracing_threshold (See below).
2144
2145 inner/outer(us): 12/14
2146
2147 This shows two numbers as "inner latency" and "outer latency". The test
2148 runs in a loop checking a timestamp twice. The latency detected within
2149 the two timestamps is the "inner latency" and the latency detected
2150 after the previous timestamp and the next timestamp in the loop is
2151 the "outer latency".
2152
2153 ts:1499801089.066141940
2154
2155 The absolute timestamp that the event happened.
2156
2157 nmi-total:4 nmi-count:1
2158
2159 On architectures that support it, if an NMI comes in during the
2160 test, the time spent in NMI is reported in "nmi-total" (in
2161 microseconds).
2162
2163 All architectures that have NMIs will show the "nmi-count" if an
2164 NMI comes in during the test.
2165
2166hwlat files:
2167
2168 tracing_threshold
2169 This gets automatically set to "10" to represent 10
2170 microseconds. This is the threshold of latency that
2171 needs to be detected before the trace will be recorded.
2172
2173 Note, when hwlat tracer is finished (another tracer is
2174 written into "current_tracer"), the original value for
2175 tracing_threshold is placed back into this file.
2176
2177 hwlat_detector/width
2178 The length of time the test runs with interrupts disabled.
2179
2180 hwlat_detector/window
2181 The length of time of the window which the test
2182 runs. That is, the test will run for "width"
2183 microseconds per "window" microseconds
2184
2185 tracing_cpumask
2186 When the test is started. A kernel thread is created that
2187 runs the test. This thread will alternate between CPUs
2188 listed in the tracing_cpumask between each period
2189 (one "window"). To limit the test to specific CPUs
2190 set the mask in this file to only the CPUs that the test
2191 should run on.
2192
2193function
2194--------
2195
2196This tracer is the function tracer. Enabling the function tracer
2197can be done from the debug file system. Make sure the
2198ftrace_enabled is set; otherwise this tracer is a nop.
2199See the "ftrace_enabled" section below.
2200::
2201
2202 # sysctl kernel.ftrace_enabled=1
2203 # echo function > current_tracer
2204 # echo 1 > tracing_on
2205 # usleep 1
2206 # echo 0 > tracing_on
2207 # cat trace
2208 # tracer: function
2209 #
2210 # entries-in-buffer/entries-written: 24799/24799 #P:4
2211 #
2212 # _-----=> irqs-off
2213 # / _----=> need-resched
2214 # | / _---=> hardirq/softirq
2215 # || / _--=> preempt-depth
2216 # ||| / delay
2217 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
2218 # | | | |||| | |
2219 bash-1994 [002] .... 3082.063030: mutex_unlock <-rb_simple_write
2220 bash-1994 [002] .... 3082.063031: __mutex_unlock_slowpath <-mutex_unlock
2221 bash-1994 [002] .... 3082.063031: __fsnotify_parent <-fsnotify_modify
2222 bash-1994 [002] .... 3082.063032: fsnotify <-fsnotify_modify
2223 bash-1994 [002] .... 3082.063032: __srcu_read_lock <-fsnotify
2224 bash-1994 [002] .... 3082.063032: add_preempt_count <-__srcu_read_lock
2225 bash-1994 [002] ...1 3082.063032: sub_preempt_count <-__srcu_read_lock
2226 bash-1994 [002] .... 3082.063033: __srcu_read_unlock <-fsnotify
2227 [...]
2228
2229
2230Note: function tracer uses ring buffers to store the above
2231entries. The newest data may overwrite the oldest data.
2232Sometimes using echo to stop the trace is not sufficient because
2233the tracing could have overwritten the data that you wanted to
2234record. For this reason, it is sometimes better to disable
2235tracing directly from a program. This allows you to stop the
2236tracing at the point that you hit the part that you are
2237interested in. To disable the tracing directly from a C program,
2238something like following code snippet can be used::
2239
2240 int trace_fd;
2241 [...]
2242 int main(int argc, char *argv[]) {
2243 [...]
2244 trace_fd = open(tracing_file("tracing_on"), O_WRONLY);
2245 [...]
2246 if (condition_hit()) {
2247 write(trace_fd, "0", 1);
2248 }
2249 [...]
2250 }
2251
2252
2253Single thread tracing
2254---------------------
2255
2256By writing into set_ftrace_pid you can trace a
2257single thread. For example::
2258
2259 # cat set_ftrace_pid
2260 no pid
2261 # echo 3111 > set_ftrace_pid
2262 # cat set_ftrace_pid
2263 3111
2264 # echo function > current_tracer
2265 # cat trace | head
2266 # tracer: function
2267 #
2268 # TASK-PID CPU# TIMESTAMP FUNCTION
2269 # | | | | |
2270 yum-updatesd-3111 [003] 1637.254676: finish_task_switch <-thread_return
2271 yum-updatesd-3111 [003] 1637.254681: hrtimer_cancel <-schedule_hrtimeout_range
2272 yum-updatesd-3111 [003] 1637.254682: hrtimer_try_to_cancel <-hrtimer_cancel
2273 yum-updatesd-3111 [003] 1637.254683: lock_hrtimer_base <-hrtimer_try_to_cancel
2274 yum-updatesd-3111 [003] 1637.254685: fget_light <-do_sys_poll
2275 yum-updatesd-3111 [003] 1637.254686: pipe_poll <-do_sys_poll
2276 # echo > set_ftrace_pid
2277 # cat trace |head
2278 # tracer: function
2279 #
2280 # TASK-PID CPU# TIMESTAMP FUNCTION
2281 # | | | | |
2282 ##### CPU 3 buffer started ####
2283 yum-updatesd-3111 [003] 1701.957688: free_poll_entry <-poll_freewait
2284 yum-updatesd-3111 [003] 1701.957689: remove_wait_queue <-free_poll_entry
2285 yum-updatesd-3111 [003] 1701.957691: fput <-free_poll_entry
2286 yum-updatesd-3111 [003] 1701.957692: audit_syscall_exit <-sysret_audit
2287 yum-updatesd-3111 [003] 1701.957693: path_put <-audit_syscall_exit
2288
2289If you want to trace a function when executing, you could use
2290something like this simple program.
2291::
2292
2293 #include <stdio.h>
2294 #include <stdlib.h>
2295 #include <sys/types.h>
2296 #include <sys/stat.h>
2297 #include <fcntl.h>
2298 #include <unistd.h>
2299 #include <string.h>
2300
2301 #define _STR(x) #x
2302 #define STR(x) _STR(x)
2303 #define MAX_PATH 256
2304
2305 const char *find_tracefs(void)
2306 {
2307 static char tracefs[MAX_PATH+1];
2308 static int tracefs_found;
2309 char type[100];
2310 FILE *fp;
2311
2312 if (tracefs_found)
2313 return tracefs;
2314
2315 if ((fp = fopen("/proc/mounts","r")) == NULL) {
2316 perror("/proc/mounts");
2317 return NULL;
2318 }
2319
2320 while (fscanf(fp, "%*s %"
2321 STR(MAX_PATH)
2322 "s %99s %*s %*d %*d\n",
2323 tracefs, type) == 2) {
2324 if (strcmp(type, "tracefs") == 0)
2325 break;
2326 }
2327 fclose(fp);
2328
2329 if (strcmp(type, "tracefs") != 0) {
2330 fprintf(stderr, "tracefs not mounted");
2331 return NULL;
2332 }
2333
2334 strcat(tracefs, "/tracing/");
2335 tracefs_found = 1;
2336
2337 return tracefs;
2338 }
2339
2340 const char *tracing_file(const char *file_name)
2341 {
2342 static char trace_file[MAX_PATH+1];
2343 snprintf(trace_file, MAX_PATH, "%s/%s", find_tracefs(), file_name);
2344 return trace_file;
2345 }
2346
2347 int main (int argc, char **argv)
2348 {
2349 if (argc < 1)
2350 exit(-1);
2351
2352 if (fork() > 0) {
2353 int fd, ffd;
2354 char line[64];
2355 int s;
2356
2357 ffd = open(tracing_file("current_tracer"), O_WRONLY);
2358 if (ffd < 0)
2359 exit(-1);
2360 write(ffd, "nop", 3);
2361
2362 fd = open(tracing_file("set_ftrace_pid"), O_WRONLY);
2363 s = sprintf(line, "%d\n", getpid());
2364 write(fd, line, s);
2365
2366 write(ffd, "function", 8);
2367
2368 close(fd);
2369 close(ffd);
2370
2371 execvp(argv[1], argv+1);
2372 }
2373
2374 return 0;
2375 }
2376
2377Or this simple script!
2378::
2379
2380 #!/bin/bash
2381
2382 tracefs=`sed -ne 's/^tracefs \(.*\) tracefs.*/\1/p' /proc/mounts`
2383 echo nop > $tracefs/tracing/current_tracer
2384 echo 0 > $tracefs/tracing/tracing_on
2385 echo $$ > $tracefs/tracing/set_ftrace_pid
2386 echo function > $tracefs/tracing/current_tracer
2387 echo 1 > $tracefs/tracing/tracing_on
2388 exec "$@"
2389
2390
2391function graph tracer
2392---------------------------
2393
2394This tracer is similar to the function tracer except that it
2395probes a function on its entry and its exit. This is done by
2396using a dynamically allocated stack of return addresses in each
2397task_struct. On function entry the tracer overwrites the return
2398address of each function traced to set a custom probe. Thus the
2399original return address is stored on the stack of return address
2400in the task_struct.
2401
2402Probing on both ends of a function leads to special features
2403such as:
2404
2405- measure of a function's time execution
2406- having a reliable call stack to draw function calls graph
2407
2408This tracer is useful in several situations:
2409
2410- you want to find the reason of a strange kernel behavior and
2411 need to see what happens in detail on any areas (or specific
2412 ones).
2413
2414- you are experiencing weird latencies but it's difficult to
2415 find its origin.
2416
2417- you want to find quickly which path is taken by a specific
2418 function
2419
2420- you just want to peek inside a working kernel and want to see
2421 what happens there.
2422
2423::
2424
2425 # tracer: function_graph
2426 #
2427 # CPU DURATION FUNCTION CALLS
2428 # | | | | | | |
2429
2430 0) | sys_open() {
2431 0) | do_sys_open() {
2432 0) | getname() {
2433 0) | kmem_cache_alloc() {
2434 0) 1.382 us | __might_sleep();
2435 0) 2.478 us | }
2436 0) | strncpy_from_user() {
2437 0) | might_fault() {
2438 0) 1.389 us | __might_sleep();
2439 0) 2.553 us | }
2440 0) 3.807 us | }
2441 0) 7.876 us | }
2442 0) | alloc_fd() {
2443 0) 0.668 us | _spin_lock();
2444 0) 0.570 us | expand_files();
2445 0) 0.586 us | _spin_unlock();
2446
2447
2448There are several columns that can be dynamically
2449enabled/disabled. You can use every combination of options you
2450want, depending on your needs.
2451
2452- The cpu number on which the function executed is default
2453 enabled. It is sometimes better to only trace one cpu (see
2454 tracing_cpu_mask file) or you might sometimes see unordered
2455 function calls while cpu tracing switch.
2456
2457 - hide: echo nofuncgraph-cpu > trace_options
2458 - show: echo funcgraph-cpu > trace_options
2459
2460- The duration (function's time of execution) is displayed on
2461 the closing bracket line of a function or on the same line
2462 than the current function in case of a leaf one. It is default
2463 enabled.
2464
2465 - hide: echo nofuncgraph-duration > trace_options
2466 - show: echo funcgraph-duration > trace_options
2467
2468- The overhead field precedes the duration field in case of
2469 reached duration thresholds.
2470
2471 - hide: echo nofuncgraph-overhead > trace_options
2472 - show: echo funcgraph-overhead > trace_options
2473 - depends on: funcgraph-duration
2474
2475 ie::
2476
2477 3) # 1837.709 us | } /* __switch_to */
2478 3) | finish_task_switch() {
2479 3) 0.313 us | _raw_spin_unlock_irq();
2480 3) 3.177 us | }
2481 3) # 1889.063 us | } /* __schedule */
2482 3) ! 140.417 us | } /* __schedule */
2483 3) # 2034.948 us | } /* schedule */
2484 3) * 33998.59 us | } /* schedule_preempt_disabled */
2485
2486 [...]
2487
2488 1) 0.260 us | msecs_to_jiffies();
2489 1) 0.313 us | __rcu_read_unlock();
2490 1) + 61.770 us | }
2491 1) + 64.479 us | }
2492 1) 0.313 us | rcu_bh_qs();
2493 1) 0.313 us | __local_bh_enable();
2494 1) ! 217.240 us | }
2495 1) 0.365 us | idle_cpu();
2496 1) | rcu_irq_exit() {
2497 1) 0.417 us | rcu_eqs_enter_common.isra.47();
2498 1) 3.125 us | }
2499 1) ! 227.812 us | }
2500 1) ! 457.395 us | }
2501 1) @ 119760.2 us | }
2502
2503 [...]
2504
2505 2) | handle_IPI() {
2506 1) 6.979 us | }
2507 2) 0.417 us | scheduler_ipi();
2508 1) 9.791 us | }
2509 1) + 12.917 us | }
2510 2) 3.490 us | }
2511 1) + 15.729 us | }
2512 1) + 18.542 us | }
2513 2) $ 3594274 us | }
2514
2515Flags::
2516
2517 + means that the function exceeded 10 usecs.
2518 ! means that the function exceeded 100 usecs.
2519 # means that the function exceeded 1000 usecs.
2520 * means that the function exceeded 10 msecs.
2521 @ means that the function exceeded 100 msecs.
2522 $ means that the function exceeded 1 sec.
2523
2524
2525- The task/pid field displays the thread cmdline and pid which
2526 executed the function. It is default disabled.
2527
2528 - hide: echo nofuncgraph-proc > trace_options
2529 - show: echo funcgraph-proc > trace_options
2530
2531 ie::
2532
2533 # tracer: function_graph
2534 #
2535 # CPU TASK/PID DURATION FUNCTION CALLS
2536 # | | | | | | | | |
2537 0) sh-4802 | | d_free() {
2538 0) sh-4802 | | call_rcu() {
2539 0) sh-4802 | | __call_rcu() {
2540 0) sh-4802 | 0.616 us | rcu_process_gp_end();
2541 0) sh-4802 | 0.586 us | check_for_new_grace_period();
2542 0) sh-4802 | 2.899 us | }
2543 0) sh-4802 | 4.040 us | }
2544 0) sh-4802 | 5.151 us | }
2545 0) sh-4802 | + 49.370 us | }
2546
2547
2548- The absolute time field is an absolute timestamp given by the
2549 system clock since it started. A snapshot of this time is
2550 given on each entry/exit of functions
2551
2552 - hide: echo nofuncgraph-abstime > trace_options
2553 - show: echo funcgraph-abstime > trace_options
2554
2555 ie::
2556
2557 #
2558 # TIME CPU DURATION FUNCTION CALLS
2559 # | | | | | | | |
2560 360.774522 | 1) 0.541 us | }
2561 360.774522 | 1) 4.663 us | }
2562 360.774523 | 1) 0.541 us | __wake_up_bit();
2563 360.774524 | 1) 6.796 us | }
2564 360.774524 | 1) 7.952 us | }
2565 360.774525 | 1) 9.063 us | }
2566 360.774525 | 1) 0.615 us | journal_mark_dirty();
2567 360.774527 | 1) 0.578 us | __brelse();
2568 360.774528 | 1) | reiserfs_prepare_for_journal() {
2569 360.774528 | 1) | unlock_buffer() {
2570 360.774529 | 1) | wake_up_bit() {
2571 360.774529 | 1) | bit_waitqueue() {
2572 360.774530 | 1) 0.594 us | __phys_addr();
2573
2574
2575The function name is always displayed after the closing bracket
2576for a function if the start of that function is not in the
2577trace buffer.
2578
2579Display of the function name after the closing bracket may be
2580enabled for functions whose start is in the trace buffer,
2581allowing easier searching with grep for function durations.
2582It is default disabled.
2583
2584 - hide: echo nofuncgraph-tail > trace_options
2585 - show: echo funcgraph-tail > trace_options
2586
2587 Example with nofuncgraph-tail (default)::
2588
2589 0) | putname() {
2590 0) | kmem_cache_free() {
2591 0) 0.518 us | __phys_addr();
2592 0) 1.757 us | }
2593 0) 2.861 us | }
2594
2595 Example with funcgraph-tail::
2596
2597 0) | putname() {
2598 0) | kmem_cache_free() {
2599 0) 0.518 us | __phys_addr();
2600 0) 1.757 us | } /* kmem_cache_free() */
2601 0) 2.861 us | } /* putname() */
2602
2603You can put some comments on specific functions by using
2604trace_printk() For example, if you want to put a comment inside
2605the __might_sleep() function, you just have to include
2606<linux/ftrace.h> and call trace_printk() inside __might_sleep()::
2607
2608 trace_printk("I'm a comment!\n")
2609
2610will produce::
2611
2612 1) | __might_sleep() {
2613 1) | /* I'm a comment! */
2614 1) 1.449 us | }
2615
2616
2617You might find other useful features for this tracer in the
2618following "dynamic ftrace" section such as tracing only specific
2619functions or tasks.
2620
2621dynamic ftrace
2622--------------
2623
2624If CONFIG_DYNAMIC_FTRACE is set, the system will run with
2625virtually no overhead when function tracing is disabled. The way
2626this works is the mcount function call (placed at the start of
2627every kernel function, produced by the -pg switch in gcc),
2628starts of pointing to a simple return. (Enabling FTRACE will
2629include the -pg switch in the compiling of the kernel.)
2630
2631At compile time every C file object is run through the
2632recordmcount program (located in the scripts directory). This
2633program will parse the ELF headers in the C object to find all
2634the locations in the .text section that call mcount. Starting
2a1e03ca 2635with gcc version 4.6, the -mfentry has been added for x86, which
1f198e22
CD
2636calls "__fentry__" instead of "mcount". Which is called before
2637the creation of the stack frame.
2638
2639Note, not all sections are traced. They may be prevented by either
2640a notrace, or blocked another way and all inline functions are not
2641traced. Check the "available_filter_functions" file to see what functions
2642can be traced.
2643
2644A section called "__mcount_loc" is created that holds
2645references to all the mcount/fentry call sites in the .text section.
2646The recordmcount program re-links this section back into the
2647original object. The final linking stage of the kernel will add all these
2648references into a single table.
2649
2650On boot up, before SMP is initialized, the dynamic ftrace code
2651scans this table and updates all the locations into nops. It
2652also records the locations, which are added to the
2653available_filter_functions list. Modules are processed as they
2654are loaded and before they are executed. When a module is
2655unloaded, it also removes its functions from the ftrace function
2656list. This is automatic in the module unload code, and the
2657module author does not need to worry about it.
2658
2659When tracing is enabled, the process of modifying the function
2660tracepoints is dependent on architecture. The old method is to use
2661kstop_machine to prevent races with the CPUs executing code being
2662modified (which can cause the CPU to do undesirable things, especially
2663if the modified code crosses cache (or page) boundaries), and the nops are
2664patched back to calls. But this time, they do not call mcount
2665(which is just a function stub). They now call into the ftrace
2666infrastructure.
2667
2668The new method of modifying the function tracepoints is to place
2669a breakpoint at the location to be modified, sync all CPUs, modify
2670the rest of the instruction not covered by the breakpoint. Sync
2671all CPUs again, and then remove the breakpoint with the finished
2672version to the ftrace call site.
2673
2674Some archs do not even need to monkey around with the synchronization,
2675and can just slap the new code on top of the old without any
2676problems with other CPUs executing it at the same time.
2677
2678One special side-effect to the recording of the functions being
2679traced is that we can now selectively choose which functions we
2680wish to trace and which ones we want the mcount calls to remain
2681as nops.
2682
2683Two files are used, one for enabling and one for disabling the
2684tracing of specified functions. They are:
2685
2686 set_ftrace_filter
2687
2688and
2689
2690 set_ftrace_notrace
2691
2692A list of available functions that you can add to these files is
2693listed in:
2694
2695 available_filter_functions
2696
2697::
2698
2699 # cat available_filter_functions
2700 put_prev_task_idle
2701 kmem_cache_create
2702 pick_next_task_rt
2703 get_online_cpus
2704 pick_next_task_fair
2705 mutex_lock
2706 [...]
2707
2708If I am only interested in sys_nanosleep and hrtimer_interrupt::
2709
2710 # echo sys_nanosleep hrtimer_interrupt > set_ftrace_filter
2711 # echo function > current_tracer
2712 # echo 1 > tracing_on
2713 # usleep 1
2714 # echo 0 > tracing_on
2715 # cat trace
2716 # tracer: function
2717 #
2718 # entries-in-buffer/entries-written: 5/5 #P:4
2719 #
2720 # _-----=> irqs-off
2721 # / _----=> need-resched
2722 # | / _---=> hardirq/softirq
2723 # || / _--=> preempt-depth
2724 # ||| / delay
2725 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
2726 # | | | |||| | |
2727 usleep-2665 [001] .... 4186.475355: sys_nanosleep <-system_call_fastpath
2728 <idle>-0 [001] d.h1 4186.475409: hrtimer_interrupt <-smp_apic_timer_interrupt
2729 usleep-2665 [001] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
2730 <idle>-0 [003] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
2731 <idle>-0 [002] d.h1 4186.475427: hrtimer_interrupt <-smp_apic_timer_interrupt
2732
2733To see which functions are being traced, you can cat the file:
2734::
2735
2736 # cat set_ftrace_filter
2737 hrtimer_interrupt
2738 sys_nanosleep
2739
2740
2741Perhaps this is not enough. The filters also allow glob(7) matching.
2742
6234c7bd 2743 ``<match>*``
1f198e22 2744 will match functions that begin with <match>
6234c7bd 2745 ``*<match>``
1f198e22 2746 will match functions that end with <match>
6234c7bd 2747 ``*<match>*``
1f198e22 2748 will match functions that have <match> in it
6234c7bd 2749 ``<match1>*<match2>``
1f198e22
CD
2750 will match functions that begin with <match1> and end with <match2>
2751
2752.. note::
2753 It is better to use quotes to enclose the wild cards,
2754 otherwise the shell may expand the parameters into names
2755 of files in the local directory.
2756
2757::
2758
2759 # echo 'hrtimer_*' > set_ftrace_filter
2760
2761Produces::
2762
2763 # tracer: function
2764 #
2765 # entries-in-buffer/entries-written: 897/897 #P:4
2766 #
2767 # _-----=> irqs-off
2768 # / _----=> need-resched
2769 # | / _---=> hardirq/softirq
2770 # || / _--=> preempt-depth
2771 # ||| / delay
2772 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
2773 # | | | |||| | |
2774 <idle>-0 [003] dN.1 4228.547803: hrtimer_cancel <-tick_nohz_idle_exit
2775 <idle>-0 [003] dN.1 4228.547804: hrtimer_try_to_cancel <-hrtimer_cancel
2776 <idle>-0 [003] dN.2 4228.547805: hrtimer_force_reprogram <-__remove_hrtimer
2777 <idle>-0 [003] dN.1 4228.547805: hrtimer_forward <-tick_nohz_idle_exit
2778 <idle>-0 [003] dN.1 4228.547805: hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
2779 <idle>-0 [003] d..1 4228.547858: hrtimer_get_next_event <-get_next_timer_interrupt
2780 <idle>-0 [003] d..1 4228.547859: hrtimer_start <-__tick_nohz_idle_enter
2781 <idle>-0 [003] d..2 4228.547860: hrtimer_force_reprogram <-__rem
2782
2783Notice that we lost the sys_nanosleep.
2784::
2785
2786 # cat set_ftrace_filter
2787 hrtimer_run_queues
2788 hrtimer_run_pending
2789 hrtimer_init
2790 hrtimer_cancel
2791 hrtimer_try_to_cancel
2792 hrtimer_forward
2793 hrtimer_start
2794 hrtimer_reprogram
2795 hrtimer_force_reprogram
2796 hrtimer_get_next_event
2797 hrtimer_interrupt
2798 hrtimer_nanosleep
2799 hrtimer_wakeup
2800 hrtimer_get_remaining
2801 hrtimer_get_res
2802 hrtimer_init_sleeper
2803
2804
2805This is because the '>' and '>>' act just like they do in bash.
2806To rewrite the filters, use '>'
2807To append to the filters, use '>>'
2808
2809To clear out a filter so that all functions will be recorded
2810again::
2811
2812 # echo > set_ftrace_filter
2813 # cat set_ftrace_filter
2814 #
2815
2816Again, now we want to append.
2817
2818::
2819
2820 # echo sys_nanosleep > set_ftrace_filter
2821 # cat set_ftrace_filter
2822 sys_nanosleep
2823 # echo 'hrtimer_*' >> set_ftrace_filter
2824 # cat set_ftrace_filter
2825 hrtimer_run_queues
2826 hrtimer_run_pending
2827 hrtimer_init
2828 hrtimer_cancel
2829 hrtimer_try_to_cancel
2830 hrtimer_forward
2831 hrtimer_start
2832 hrtimer_reprogram
2833 hrtimer_force_reprogram
2834 hrtimer_get_next_event
2835 hrtimer_interrupt
2836 sys_nanosleep
2837 hrtimer_nanosleep
2838 hrtimer_wakeup
2839 hrtimer_get_remaining
2840 hrtimer_get_res
2841 hrtimer_init_sleeper
2842
2843
2844The set_ftrace_notrace prevents those functions from being
2845traced.
2846::
2847
2848 # echo '*preempt*' '*lock*' > set_ftrace_notrace
2849
2850Produces::
2851
2852 # tracer: function
2853 #
2854 # entries-in-buffer/entries-written: 39608/39608 #P:4
2855 #
2856 # _-----=> irqs-off
2857 # / _----=> need-resched
2858 # | / _---=> hardirq/softirq
2859 # || / _--=> preempt-depth
2860 # ||| / delay
2861 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
2862 # | | | |||| | |
2863 bash-1994 [000] .... 4342.324896: file_ra_state_init <-do_dentry_open
2864 bash-1994 [000] .... 4342.324897: open_check_o_direct <-do_last
2865 bash-1994 [000] .... 4342.324897: ima_file_check <-do_last
2866 bash-1994 [000] .... 4342.324898: process_measurement <-ima_file_check
2867 bash-1994 [000] .... 4342.324898: ima_get_action <-process_measurement
2868 bash-1994 [000] .... 4342.324898: ima_match_policy <-ima_get_action
2869 bash-1994 [000] .... 4342.324899: do_truncate <-do_last
2870 bash-1994 [000] .... 4342.324899: should_remove_suid <-do_truncate
2871 bash-1994 [000] .... 4342.324899: notify_change <-do_truncate
2872 bash-1994 [000] .... 4342.324900: current_fs_time <-notify_change
2873 bash-1994 [000] .... 4342.324900: current_kernel_time <-current_fs_time
2874 bash-1994 [000] .... 4342.324900: timespec_trunc <-current_fs_time
2875
2876We can see that there's no more lock or preempt tracing.
2877
f79b3f33
SRV
2878Selecting function filters via index
2879------------------------------------
2880
2881Because processing of strings is expensive (the address of the function
2882needs to be looked up before comparing to the string being passed in),
2883an index can be used as well to enable functions. This is useful in the
2884case of setting thousands of specific functions at a time. By passing
2885in a list of numbers, no string processing will occur. Instead, the function
2886at the specific location in the internal array (which corresponds to the
2887functions in the "available_filter_functions" file), is selected.
2888
2889::
2890
2891 # echo 1 > set_ftrace_filter
2892
2893Will select the first function listed in "available_filter_functions"
2894
2895::
2896
2897 # head -1 available_filter_functions
2898 trace_initcall_finish_cb
2899
2900 # cat set_ftrace_filter
2901 trace_initcall_finish_cb
2902
2903 # head -50 available_filter_functions | tail -1
2904 x86_pmu_commit_txn
2905
2906 # echo 1 50 > set_ftrace_filter
2907 # cat set_ftrace_filter
2908 trace_initcall_finish_cb
2909 x86_pmu_commit_txn
1f198e22
CD
2910
2911Dynamic ftrace with the function graph tracer
2912---------------------------------------------
2913
2914Although what has been explained above concerns both the
2915function tracer and the function-graph-tracer, there are some
2916special features only available in the function-graph tracer.
2917
2918If you want to trace only one function and all of its children,
2919you just have to echo its name into set_graph_function::
2920
2921 echo __do_fault > set_graph_function
2922
2923will produce the following "expanded" trace of the __do_fault()
2924function::
2925
2926 0) | __do_fault() {
2927 0) | filemap_fault() {
2928 0) | find_lock_page() {
2929 0) 0.804 us | find_get_page();
2930 0) | __might_sleep() {
2931 0) 1.329 us | }
2932 0) 3.904 us | }
2933 0) 4.979 us | }
2934 0) 0.653 us | _spin_lock();
2935 0) 0.578 us | page_add_file_rmap();
2936 0) 0.525 us | native_set_pte_at();
2937 0) 0.585 us | _spin_unlock();
2938 0) | unlock_page() {
2939 0) 0.541 us | page_waitqueue();
2940 0) 0.639 us | __wake_up_bit();
2941 0) 2.786 us | }
2942 0) + 14.237 us | }
2943 0) | __do_fault() {
2944 0) | filemap_fault() {
2945 0) | find_lock_page() {
2946 0) 0.698 us | find_get_page();
2947 0) | __might_sleep() {
2948 0) 1.412 us | }
2949 0) 3.950 us | }
2950 0) 5.098 us | }
2951 0) 0.631 us | _spin_lock();
2952 0) 0.571 us | page_add_file_rmap();
2953 0) 0.526 us | native_set_pte_at();
2954 0) 0.586 us | _spin_unlock();
2955 0) | unlock_page() {
2956 0) 0.533 us | page_waitqueue();
2957 0) 0.638 us | __wake_up_bit();
2958 0) 2.793 us | }
2959 0) + 14.012 us | }
2960
2961You can also expand several functions at once::
2962
2963 echo sys_open > set_graph_function
2964 echo sys_close >> set_graph_function
2965
2966Now if you want to go back to trace all functions you can clear
2967this special filter via::
2968
2969 echo > set_graph_function
2970
2971
2972ftrace_enabled
2973--------------
2974
2975Note, the proc sysctl ftrace_enable is a big on/off switch for the
2976function tracer. By default it is enabled (when function tracing is
2977enabled in the kernel). If it is disabled, all function tracing is
2978disabled. This includes not only the function tracers for ftrace, but
2979also for any other uses (perf, kprobes, stack tracing, profiling, etc).
2980
2981Please disable this with care.
2982
2983This can be disable (and enabled) with::
2984
2985 sysctl kernel.ftrace_enabled=0
2986 sysctl kernel.ftrace_enabled=1
2987
2988 or
2989
2990 echo 0 > /proc/sys/kernel/ftrace_enabled
2991 echo 1 > /proc/sys/kernel/ftrace_enabled
2992
2993
2994Filter commands
2995---------------
2996
2997A few commands are supported by the set_ftrace_filter interface.
2998Trace commands have the following format::
2999
3000 <function>:<command>:<parameter>
3001
3002The following commands are supported:
3003
3004- mod:
3005 This command enables function filtering per module. The
3006 parameter defines the module. For example, if only the write*
3007 functions in the ext3 module are desired, run:
3008
3009 echo 'write*:mod:ext3' > set_ftrace_filter
3010
3011 This command interacts with the filter in the same way as
3012 filtering based on function names. Thus, adding more functions
3013 in a different module is accomplished by appending (>>) to the
3014 filter file. Remove specific module functions by prepending
3015 '!'::
3016
3017 echo '!writeback*:mod:ext3' >> set_ftrace_filter
3018
3019 Mod command supports module globbing. Disable tracing for all
3020 functions except a specific module::
3021
3022 echo '!*:mod:!ext3' >> set_ftrace_filter
3023
3024 Disable tracing for all modules, but still trace kernel::
3025
3026 echo '!*:mod:*' >> set_ftrace_filter
3027
3028 Enable filter only for kernel::
3029
3030 echo '*write*:mod:!*' >> set_ftrace_filter
3031
3032 Enable filter for module globbing::
3033
3034 echo '*write*:mod:*snd*' >> set_ftrace_filter
3035
3036- traceon/traceoff:
3037 These commands turn tracing on and off when the specified
3038 functions are hit. The parameter determines how many times the
3039 tracing system is turned on and off. If unspecified, there is
3040 no limit. For example, to disable tracing when a schedule bug
3041 is hit the first 5 times, run::
3042
3043 echo '__schedule_bug:traceoff:5' > set_ftrace_filter
3044
3045 To always disable tracing when __schedule_bug is hit::
3046
3047 echo '__schedule_bug:traceoff' > set_ftrace_filter
3048
3049 These commands are cumulative whether or not they are appended
3050 to set_ftrace_filter. To remove a command, prepend it by '!'
3051 and drop the parameter::
3052
3053 echo '!__schedule_bug:traceoff:0' > set_ftrace_filter
3054
3055 The above removes the traceoff command for __schedule_bug
3056 that have a counter. To remove commands without counters::
3057
3058 echo '!__schedule_bug:traceoff' > set_ftrace_filter
3059
3060- snapshot:
3061 Will cause a snapshot to be triggered when the function is hit.
3062 ::
3063
3064 echo 'native_flush_tlb_others:snapshot' > set_ftrace_filter
3065
3066 To only snapshot once:
3067 ::
3068
3069 echo 'native_flush_tlb_others:snapshot:1' > set_ftrace_filter
3070
3071 To remove the above commands::
3072
3073 echo '!native_flush_tlb_others:snapshot' > set_ftrace_filter
3074 echo '!native_flush_tlb_others:snapshot:0' > set_ftrace_filter
3075
3076- enable_event/disable_event:
3077 These commands can enable or disable a trace event. Note, because
3078 function tracing callbacks are very sensitive, when these commands
3079 are registered, the trace point is activated, but disabled in
3080 a "soft" mode. That is, the tracepoint will be called, but
3081 just will not be traced. The event tracepoint stays in this mode
3082 as long as there's a command that triggers it.
3083 ::
3084
3085 echo 'try_to_wake_up:enable_event:sched:sched_switch:2' > \
3086 set_ftrace_filter
3087
3088 The format is::
3089
3090 <function>:enable_event:<system>:<event>[:count]
3091 <function>:disable_event:<system>:<event>[:count]
3092
3093 To remove the events commands::
3094
3095 echo '!try_to_wake_up:enable_event:sched:sched_switch:0' > \
3096 set_ftrace_filter
3097 echo '!schedule:disable_event:sched:sched_switch' > \
3098 set_ftrace_filter
3099
3100- dump:
3101 When the function is hit, it will dump the contents of the ftrace
3102 ring buffer to the console. This is useful if you need to debug
3103 something, and want to dump the trace when a certain function
2a1e03ca 3104 is hit. Perhaps it's a function that is called before a triple
1f198e22
CD
3105 fault happens and does not allow you to get a regular dump.
3106
3107- cpudump:
3108 When the function is hit, it will dump the contents of the ftrace
3109 ring buffer for the current CPU to the console. Unlike the "dump"
3110 command, it only prints out the contents of the ring buffer for the
3111 CPU that executed the function that triggered the dump.
3112
8a2933cf
MH
3113- stacktrace:
3114 When the function is hit, a stack trace is recorded.
3115
1f198e22
CD
3116trace_pipe
3117----------
3118
3119The trace_pipe outputs the same content as the trace file, but
3120the effect on the tracing is different. Every read from
3121trace_pipe is consumed. This means that subsequent reads will be
3122different. The trace is live.
3123::
3124
3125 # echo function > current_tracer
3126 # cat trace_pipe > /tmp/trace.out &
3127 [1] 4153
3128 # echo 1 > tracing_on
3129 # usleep 1
3130 # echo 0 > tracing_on
3131 # cat trace
3132 # tracer: function
3133 #
3134 # entries-in-buffer/entries-written: 0/0 #P:4
3135 #
3136 # _-----=> irqs-off
3137 # / _----=> need-resched
3138 # | / _---=> hardirq/softirq
3139 # || / _--=> preempt-depth
3140 # ||| / delay
3141 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
3142 # | | | |||| | |
3143
3144 #
3145 # cat /tmp/trace.out
3146 bash-1994 [000] .... 5281.568961: mutex_unlock <-rb_simple_write
3147 bash-1994 [000] .... 5281.568963: __mutex_unlock_slowpath <-mutex_unlock
3148 bash-1994 [000] .... 5281.568963: __fsnotify_parent <-fsnotify_modify
3149 bash-1994 [000] .... 5281.568964: fsnotify <-fsnotify_modify
3150 bash-1994 [000] .... 5281.568964: __srcu_read_lock <-fsnotify
3151 bash-1994 [000] .... 5281.568964: add_preempt_count <-__srcu_read_lock
3152 bash-1994 [000] ...1 5281.568965: sub_preempt_count <-__srcu_read_lock
3153 bash-1994 [000] .... 5281.568965: __srcu_read_unlock <-fsnotify
3154 bash-1994 [000] .... 5281.568967: sys_dup2 <-system_call_fastpath
3155
3156
3157Note, reading the trace_pipe file will block until more input is
f12fcca6
PW
3158added. This is contrary to the trace file. If any process opened
3159the trace file for reading, it will actually disable tracing and
3160prevent new entries from being added. The trace_pipe file does
3161not have this limitation.
1f198e22
CD
3162
3163trace entries
3164-------------
3165
3166Having too much or not enough data can be troublesome in
3167diagnosing an issue in the kernel. The file buffer_size_kb is
3168used to modify the size of the internal trace buffers. The
3169number listed is the number of entries that can be recorded per
3170CPU. To know the full size, multiply the number of possible CPUs
3171with the number of entries.
3172::
3173
3174 # cat buffer_size_kb
3175 1408 (units kilobytes)
3176
3177Or simply read buffer_total_size_kb
3178::
3179
3180 # cat buffer_total_size_kb
3181 5632
3182
3183To modify the buffer, simple echo in a number (in 1024 byte segments).
3184::
3185
3186 # echo 10000 > buffer_size_kb
3187 # cat buffer_size_kb
3188 10000 (units kilobytes)
3189
3190It will try to allocate as much as possible. If you allocate too
3191much, it can cause Out-Of-Memory to trigger.
3192::
3193
3194 # echo 1000000000000 > buffer_size_kb
3195 -bash: echo: write error: Cannot allocate memory
3196 # cat buffer_size_kb
3197 85
3198
3199The per_cpu buffers can be changed individually as well:
3200::
3201
3202 # echo 10000 > per_cpu/cpu0/buffer_size_kb
3203 # echo 100 > per_cpu/cpu1/buffer_size_kb
3204
3205When the per_cpu buffers are not the same, the buffer_size_kb
3206at the top level will just show an X
3207::
3208
3209 # cat buffer_size_kb
3210 X
3211
3212This is where the buffer_total_size_kb is useful:
3213::
3214
3215 # cat buffer_total_size_kb
3216 12916
3217
3218Writing to the top level buffer_size_kb will reset all the buffers
3219to be the same again.
3220
3221Snapshot
3222--------
3223CONFIG_TRACER_SNAPSHOT makes a generic snapshot feature
3224available to all non latency tracers. (Latency tracers which
3225record max latency, such as "irqsoff" or "wakeup", can't use
3226this feature, since those are already using the snapshot
3227mechanism internally.)
3228
3229Snapshot preserves a current trace buffer at a particular point
3230in time without stopping tracing. Ftrace swaps the current
3231buffer with a spare buffer, and tracing continues in the new
3232current (=previous spare) buffer.
3233
3234The following tracefs files in "tracing" are related to this
3235feature:
3236
3237 snapshot:
3238
3239 This is used to take a snapshot and to read the output
3240 of the snapshot. Echo 1 into this file to allocate a
3241 spare buffer and to take a snapshot (swap), then read
3242 the snapshot from this file in the same format as
3243 "trace" (described above in the section "The File
3244 System"). Both reads snapshot and tracing are executable
3245 in parallel. When the spare buffer is allocated, echoing
3246 0 frees it, and echoing else (positive) values clear the
3247 snapshot contents.
3248 More details are shown in the table below.
3249
3250 +--------------+------------+------------+------------+
3251 |status\\input | 0 | 1 | else |
3252 +==============+============+============+============+
3253 |not allocated |(do nothing)| alloc+swap |(do nothing)|
3254 +--------------+------------+------------+------------+
3255 |allocated | free | swap | clear |
3256 +--------------+------------+------------+------------+
3257
3258Here is an example of using the snapshot feature.
3259::
3260
3261 # echo 1 > events/sched/enable
3262 # echo 1 > snapshot
3263 # cat snapshot
3264 # tracer: nop
3265 #
3266 # entries-in-buffer/entries-written: 71/71 #P:8
3267 #
3268 # _-----=> irqs-off
3269 # / _----=> need-resched
3270 # | / _---=> hardirq/softirq
3271 # || / _--=> preempt-depth
3272 # ||| / delay
3273 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
3274 # | | | |||| | |
3275 <idle>-0 [005] d... 2440.603828: sched_switch: prev_comm=swapper/5 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2242 next_prio=120
3276 sleep-2242 [005] d... 2440.603846: sched_switch: prev_comm=snapshot-test-2 prev_pid=2242 prev_prio=120 prev_state=R ==> next_comm=kworker/5:1 next_pid=60 next_prio=120
3277 [...]
3278 <idle>-0 [002] d... 2440.707230: sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2229 next_prio=120
3279
3280 # cat trace
3281 # tracer: nop
3282 #
3283 # entries-in-buffer/entries-written: 77/77 #P:8
3284 #
3285 # _-----=> irqs-off
3286 # / _----=> need-resched
3287 # | / _---=> hardirq/softirq
3288 # || / _--=> preempt-depth
3289 # ||| / delay
3290 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
3291 # | | | |||| | |
3292 <idle>-0 [007] d... 2440.707395: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2243 next_prio=120
3293 snapshot-test-2-2229 [002] d... 2440.707438: sched_switch: prev_comm=snapshot-test-2 prev_pid=2229 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120
3294 [...]
3295
3296
3297If you try to use this snapshot feature when current tracer is
3298one of the latency tracers, you will get the following results.
3299::
3300
3301 # echo wakeup > current_tracer
3302 # echo 1 > snapshot
3303 bash: echo: write error: Device or resource busy
3304 # cat snapshot
3305 cat: snapshot: Device or resource busy
3306
3307
3308Instances
3309---------
3310In the tracefs tracing directory is a directory called "instances".
3311This directory can have new directories created inside of it using
3312mkdir, and removing directories with rmdir. The directory created
3313with mkdir in this directory will already contain files and other
3314directories after it is created.
3315::
3316
3317 # mkdir instances/foo
3318 # ls instances/foo
3319 buffer_size_kb buffer_total_size_kb events free_buffer per_cpu
3320 set_event snapshot trace trace_clock trace_marker trace_options
3321 trace_pipe tracing_on
3322
3323As you can see, the new directory looks similar to the tracing directory
3324itself. In fact, it is very similar, except that the buffer and
3325events are agnostic from the main director, or from any other
3326instances that are created.
3327
3328The files in the new directory work just like the files with the
3329same name in the tracing directory except the buffer that is used
3330is a separate and new buffer. The files affect that buffer but do not
3331affect the main buffer with the exception of trace_options. Currently,
3332the trace_options affect all instances and the top level buffer
3333the same, but this may change in future releases. That is, options
3334may become specific to the instance they reside in.
3335
3336Notice that none of the function tracer files are there, nor is
3337current_tracer and available_tracers. This is because the buffers
3338can currently only have events enabled for them.
3339::
3340
3341 # mkdir instances/foo
3342 # mkdir instances/bar
3343 # mkdir instances/zoot
3344 # echo 100000 > buffer_size_kb
3345 # echo 1000 > instances/foo/buffer_size_kb
3346 # echo 5000 > instances/bar/per_cpu/cpu1/buffer_size_kb
3347 # echo function > current_trace
3348 # echo 1 > instances/foo/events/sched/sched_wakeup/enable
3349 # echo 1 > instances/foo/events/sched/sched_wakeup_new/enable
3350 # echo 1 > instances/foo/events/sched/sched_switch/enable
3351 # echo 1 > instances/bar/events/irq/enable
3352 # echo 1 > instances/zoot/events/syscalls/enable
3353 # cat trace_pipe
3354 CPU:2 [LOST 11745 EVENTS]
3355 bash-2044 [002] .... 10594.481032: _raw_spin_lock_irqsave <-get_page_from_freelist
3356 bash-2044 [002] d... 10594.481032: add_preempt_count <-_raw_spin_lock_irqsave
3357 bash-2044 [002] d..1 10594.481032: __rmqueue <-get_page_from_freelist
3358 bash-2044 [002] d..1 10594.481033: _raw_spin_unlock <-get_page_from_freelist
3359 bash-2044 [002] d..1 10594.481033: sub_preempt_count <-_raw_spin_unlock
3360 bash-2044 [002] d... 10594.481033: get_pageblock_flags_group <-get_pageblock_migratetype
3361 bash-2044 [002] d... 10594.481034: __mod_zone_page_state <-get_page_from_freelist
3362 bash-2044 [002] d... 10594.481034: zone_statistics <-get_page_from_freelist
3363 bash-2044 [002] d... 10594.481034: __inc_zone_state <-zone_statistics
3364 bash-2044 [002] d... 10594.481034: __inc_zone_state <-zone_statistics
3365 bash-2044 [002] .... 10594.481035: arch_dup_task_struct <-copy_process
3366 [...]
3367
3368 # cat instances/foo/trace_pipe
3369 bash-1998 [000] d..4 136.676759: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
3370 bash-1998 [000] dN.4 136.676760: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
3371 <idle>-0 [003] d.h3 136.676906: sched_wakeup: comm=rcu_preempt pid=9 prio=120 success=1 target_cpu=003
3372 <idle>-0 [003] d..3 136.676909: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=9 next_prio=120
3373 rcu_preempt-9 [003] d..3 136.676916: sched_switch: prev_comm=rcu_preempt prev_pid=9 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
3374 bash-1998 [000] d..4 136.677014: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
3375 bash-1998 [000] dN.4 136.677016: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
3376 bash-1998 [000] d..3 136.677018: sched_switch: prev_comm=bash prev_pid=1998 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=59 next_prio=120
3377 kworker/0:1-59 [000] d..4 136.677022: sched_wakeup: comm=sshd pid=1995 prio=120 success=1 target_cpu=001
3378 kworker/0:1-59 [000] d..3 136.677025: sched_switch: prev_comm=kworker/0:1 prev_pid=59 prev_prio=120 prev_state=S ==> next_comm=bash next_pid=1998 next_prio=120
3379 [...]
3380
3381 # cat instances/bar/trace_pipe
3382 migration/1-14 [001] d.h3 138.732674: softirq_raise: vec=3 [action=NET_RX]
3383 <idle>-0 [001] dNh3 138.732725: softirq_raise: vec=3 [action=NET_RX]
3384 bash-1998 [000] d.h1 138.733101: softirq_raise: vec=1 [action=TIMER]
3385 bash-1998 [000] d.h1 138.733102: softirq_raise: vec=9 [action=RCU]
3386 bash-1998 [000] ..s2 138.733105: softirq_entry: vec=1 [action=TIMER]
3387 bash-1998 [000] ..s2 138.733106: softirq_exit: vec=1 [action=TIMER]
3388 bash-1998 [000] ..s2 138.733106: softirq_entry: vec=9 [action=RCU]
3389 bash-1998 [000] ..s2 138.733109: softirq_exit: vec=9 [action=RCU]
3390 sshd-1995 [001] d.h1 138.733278: irq_handler_entry: irq=21 name=uhci_hcd:usb4
3391 sshd-1995 [001] d.h1 138.733280: irq_handler_exit: irq=21 ret=unhandled
3392 sshd-1995 [001] d.h1 138.733281: irq_handler_entry: irq=21 name=eth0
3393 sshd-1995 [001] d.h1 138.733283: irq_handler_exit: irq=21 ret=handled
3394 [...]
3395
3396 # cat instances/zoot/trace
3397 # tracer: nop
3398 #
3399 # entries-in-buffer/entries-written: 18996/18996 #P:4
3400 #
3401 # _-----=> irqs-off
3402 # / _----=> need-resched
3403 # | / _---=> hardirq/softirq
3404 # || / _--=> preempt-depth
3405 # ||| / delay
3406 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
3407 # | | | |||| | |
3408 bash-1998 [000] d... 140.733501: sys_write -> 0x2
3409 bash-1998 [000] d... 140.733504: sys_dup2(oldfd: a, newfd: 1)
3410 bash-1998 [000] d... 140.733506: sys_dup2 -> 0x1
3411 bash-1998 [000] d... 140.733508: sys_fcntl(fd: a, cmd: 1, arg: 0)
3412 bash-1998 [000] d... 140.733509: sys_fcntl -> 0x1
3413 bash-1998 [000] d... 140.733510: sys_close(fd: a)
3414 bash-1998 [000] d... 140.733510: sys_close -> 0x0
3415 bash-1998 [000] d... 140.733514: sys_rt_sigprocmask(how: 0, nset: 0, oset: 6e2768, sigsetsize: 8)
3416 bash-1998 [000] d... 140.733515: sys_rt_sigprocmask -> 0x0
3417 bash-1998 [000] d... 140.733516: sys_rt_sigaction(sig: 2, act: 7fff718846f0, oact: 7fff71884650, sigsetsize: 8)
3418 bash-1998 [000] d... 140.733516: sys_rt_sigaction -> 0x0
3419
3420You can see that the trace of the top most trace buffer shows only
3421the function tracing. The foo instance displays wakeups and task
3422switches.
3423
3424To remove the instances, simply delete their directories:
3425::
3426
3427 # rmdir instances/foo
3428 # rmdir instances/bar
3429 # rmdir instances/zoot
3430
3431Note, if a process has a trace file open in one of the instance
3432directories, the rmdir will fail with EBUSY.
3433
3434
3435Stack trace
3436-----------
3437Since the kernel has a fixed sized stack, it is important not to
3438waste it in functions. A kernel developer must be conscience of
3439what they allocate on the stack. If they add too much, the system
3440can be in danger of a stack overflow, and corruption will occur,
3441usually leading to a system panic.
3442
3443There are some tools that check this, usually with interrupts
3444periodically checking usage. But if you can perform a check
3445at every function call that will become very useful. As ftrace provides
3446a function tracer, it makes it convenient to check the stack size
3447at every function call. This is enabled via the stack tracer.
3448
3449CONFIG_STACK_TRACER enables the ftrace stack tracing functionality.
3450To enable it, write a '1' into /proc/sys/kernel/stack_tracer_enabled.
3451::
3452
3453 # echo 1 > /proc/sys/kernel/stack_tracer_enabled
3454
3455You can also enable it from the kernel command line to trace
3456the stack size of the kernel during boot up, by adding "stacktrace"
3457to the kernel command line parameter.
3458
3459After running it for a few minutes, the output looks like:
3460::
3461
3462 # cat stack_max_size
3463 2928
3464
3465 # cat stack_trace
3466 Depth Size Location (18 entries)
3467 ----- ---- --------
3468 0) 2928 224 update_sd_lb_stats+0xbc/0x4ac
3469 1) 2704 160 find_busiest_group+0x31/0x1f1
3470 2) 2544 256 load_balance+0xd9/0x662
3471 3) 2288 80 idle_balance+0xbb/0x130
3472 4) 2208 128 __schedule+0x26e/0x5b9
3473 5) 2080 16 schedule+0x64/0x66
3474 6) 2064 128 schedule_timeout+0x34/0xe0
3475 7) 1936 112 wait_for_common+0x97/0xf1
3476 8) 1824 16 wait_for_completion+0x1d/0x1f
3477 9) 1808 128 flush_work+0xfe/0x119
3478 10) 1680 16 tty_flush_to_ldisc+0x1e/0x20
3479 11) 1664 48 input_available_p+0x1d/0x5c
3480 12) 1616 48 n_tty_poll+0x6d/0x134
3481 13) 1568 64 tty_poll+0x64/0x7f
3482 14) 1504 880 do_select+0x31e/0x511
3483 15) 624 400 core_sys_select+0x177/0x216
3484 16) 224 96 sys_select+0x91/0xb9
3485 17) 128 128 system_call_fastpath+0x16/0x1b
3486
3487Note, if -mfentry is being used by gcc, functions get traced before
3488they set up the stack frame. This means that leaf level functions
3489are not tested by the stack tracer when -mfentry is used.
3490
3491Currently, -mfentry is used by gcc 4.6.0 and above on x86 only.
3492
3493More
3494----
3495More details can be found in the source code, in the `kernel/trace/*.c` files.