selftests/ftrace: Add test to test new set_event_notrace_pid file
[linux-block.git] / Documentation / trace / ftrace.rst
CommitLineData
1f198e22
CD
1========================
2ftrace - Function Tracer
3========================
4
5Copyright 2008 Red Hat Inc.
6
7:Author: Steven Rostedt <srostedt@redhat.com>
8:License: The GNU Free Documentation License, Version 1.2
9 (dual licensed under the GPL v2)
10:Original Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton,
11 John Kacur, and David Teigland.
12
13- Written for: 2.6.28-rc2
14- Updated for: 3.10
15- Updated for: 4.13 - Copyright 2017 VMware Inc. Steven Rostedt
16- Converted to rst format - Changbin Du <changbin.du@intel.com>
17
18Introduction
19------------
20
21Ftrace is an internal tracer designed to help out developers and
22designers of systems to find what is going on inside the kernel.
23It can be used for debugging or analyzing latencies and
24performance issues that take place outside of user-space.
25
26Although ftrace is typically considered the function tracer, it
2a1e03ca 27is really a framework of several assorted tracing utilities.
1f198e22
CD
28There's latency tracing to examine what occurs between interrupts
29disabled and enabled, as well as for preemption and from a time
30a task is woken to the task is actually scheduled in.
31
32One of the most common uses of ftrace is the event tracing.
2a1e03ca 33Throughout the kernel is hundreds of static event points that
1f198e22
CD
34can be enabled via the tracefs file system to see what is
35going on in certain parts of the kernel.
36
37See events.txt for more information.
38
39
40Implementation Details
41----------------------
42
43See :doc:`ftrace-design` for details for arch porters and such.
44
45
46The File System
47---------------
48
49Ftrace uses the tracefs file system to hold the control files as
50well as the files to display output.
51
52When tracefs is configured into the kernel (which selecting any ftrace
53option will do) the directory /sys/kernel/tracing will be created. To mount
54this directory, you can add to your /etc/fstab file::
55
56 tracefs /sys/kernel/tracing tracefs defaults 0 0
57
58Or you can mount it at run time with::
59
60 mount -t tracefs nodev /sys/kernel/tracing
61
62For quicker access to that directory you may want to make a soft link to
63it::
64
65 ln -s /sys/kernel/tracing /tracing
66
67.. attention::
68
69 Before 4.1, all ftrace tracing control files were within the debugfs
70 file system, which is typically located at /sys/kernel/debug/tracing.
71 For backward compatibility, when mounting the debugfs file system,
72 the tracefs file system will be automatically mounted at:
73
74 /sys/kernel/debug/tracing
75
76 All files located in the tracefs file system will be located in that
77 debugfs file system directory as well.
78
79.. attention::
80
81 Any selected ftrace option will also create the tracefs file system.
82 The rest of the document will assume that you are in the ftrace directory
83 (cd /sys/kernel/tracing) and will only concentrate on the files within that
84 directory and not distract from the content with the extended
85 "/sys/kernel/tracing" path name.
86
87That's it! (assuming that you have ftrace configured into your kernel)
88
89After mounting tracefs you will have access to the control and output files
90of ftrace. Here is a list of some of the key files:
91
92
93 Note: all time values are in microseconds.
94
95 current_tracer:
96
97 This is used to set or display the current tracer
d693b288
FCB
98 that is configured. Changing the current tracer clears
99 the ring buffer content as well as the "snapshot" buffer.
1f198e22
CD
100
101 available_tracers:
102
103 This holds the different types of tracers that
104 have been compiled into the kernel. The
105 tracers listed here can be configured by
106 echoing their name into current_tracer.
107
108 tracing_on:
109
110 This sets or displays whether writing to the trace
111 ring buffer is enabled. Echo 0 into this file to disable
112 the tracer or 1 to enable it. Note, this only disables
113 writing to the ring buffer, the tracing overhead may
114 still be occurring.
115
116 The kernel function tracing_off() can be used within the
117 kernel to disable writing to the ring buffer, which will
118 set this file to "0". User space can re-enable tracing by
119 echoing "1" into the file.
120
121 Note, the function and event trigger "traceoff" will also
122 set this file to zero and stop tracing. Which can also
123 be re-enabled by user space using this file.
124
125 trace:
126
127 This file holds the output of the trace in a human
8a815e6b 128 readable format (described below). Opening this file for
d693b288 129 writing with the O_TRUNC flag clears the ring buffer content.
8a815e6b
SRV
130 Note, this file is not a consumer. If tracing is off
131 (no tracer running, or tracing_on is zero), it will produce
132 the same output each time it is read. When tracing is on,
133 it may produce inconsistent results as it tries to read
134 the entire buffer without consuming it.
1f198e22
CD
135
136 trace_pipe:
137
138 The output is the same as the "trace" file but this
139 file is meant to be streamed with live tracing.
140 Reads from this file will block until new data is
141 retrieved. Unlike the "trace" file, this file is a
142 consumer. This means reading from this file causes
143 sequential reads to display more current data. Once
144 data is read from this file, it is consumed, and
145 will not be read again with a sequential read. The
146 "trace" file is static, and if the tracer is not
147 adding more data, it will display the same
8a815e6b 148 information every time it is read.
1f198e22
CD
149
150 trace_options:
151
152 This file lets the user control the amount of data
153 that is displayed in one of the above output
154 files. Options also exist to modify how a tracer
155 or events work (stack traces, timestamps, etc).
156
157 options:
158
159 This is a directory that has a file for every available
160 trace option (also in trace_options). Options may also be set
161 or cleared by writing a "1" or "0" respectively into the
162 corresponding file with the option name.
163
164 tracing_max_latency:
165
166 Some of the tracers record the max latency.
167 For example, the maximum time that interrupts are disabled.
168 The maximum time is saved in this file. The max trace will also be
169 stored, and displayed by "trace". A new max trace will only be
170 recorded if the latency is greater than the value in this file
171 (in microseconds).
172
173 By echoing in a time into this file, no latency will be recorded
174 unless it is greater than the time in this file.
175
176 tracing_thresh:
177
178 Some latency tracers will record a trace whenever the
179 latency is greater than the number in this file.
180 Only active when the file contains a number greater than 0.
181 (in microseconds)
182
183 buffer_size_kb:
184
185 This sets or displays the number of kilobytes each CPU
186 buffer holds. By default, the trace buffers are the same size
187 for each CPU. The displayed number is the size of the
188 CPU buffer and not total size of all buffers. The
189 trace buffers are allocated in pages (blocks of memory
190 that the kernel uses for allocation, usually 4 KB in size).
a65d634e
FCB
191 A few extra pages may be allocated to accommodate buffer management
192 meta-data. If the last page allocated has room for more bytes
1f198e22
CD
193 than requested, the rest of the page will be used,
194 making the actual allocation bigger than requested or shown.
195 ( Note, the size may not be a multiple of the page size
196 due to buffer management meta-data. )
197
198 Buffer sizes for individual CPUs may vary
199 (see "per_cpu/cpu0/buffer_size_kb" below), and if they do
200 this file will show "X".
201
202 buffer_total_size_kb:
203
204 This displays the total combined size of all the trace buffers.
205
206 free_buffer:
207
208 If a process is performing tracing, and the ring buffer should be
209 shrunk "freed" when the process is finished, even if it were to be
210 killed by a signal, this file can be used for that purpose. On close
211 of this file, the ring buffer will be resized to its minimum size.
212 Having a process that is tracing also open this file, when the process
213 exits its file descriptor for this file will be closed, and in doing so,
214 the ring buffer will be "freed".
215
216 It may also stop tracing if disable_on_free option is set.
217
218 tracing_cpumask:
219
220 This is a mask that lets the user only trace on specified CPUs.
221 The format is a hex string representing the CPUs.
222
223 set_ftrace_filter:
224
225 When dynamic ftrace is configured in (see the
226 section below "dynamic ftrace"), the code is dynamically
227 modified (code text rewrite) to disable calling of the
228 function profiler (mcount). This lets tracing be configured
229 in with practically no overhead in performance. This also
230 has a side effect of enabling or disabling specific functions
231 to be traced. Echoing names of functions into this file
232 will limit the trace to only those functions.
32fb7ef6
SM
233 This influences the tracers "function" and "function_graph"
234 and thus also function profiling (see "function_profile_enabled").
1f198e22
CD
235
236 The functions listed in "available_filter_functions" are what
237 can be written into this file.
238
239 This interface also allows for commands to be used. See the
240 "Filter commands" section for more details.
241
5b8914a6 242 As a speed up, since processing strings can be quite expensive
f79b3f33
SRV
243 and requires a check of all functions registered to tracing, instead
244 an index can be written into this file. A number (starting with "1")
245 written will instead select the same corresponding at the line position
246 of the "available_filter_functions" file.
247
1f198e22
CD
248 set_ftrace_notrace:
249
250 This has an effect opposite to that of
251 set_ftrace_filter. Any function that is added here will not
252 be traced. If a function exists in both set_ftrace_filter
253 and set_ftrace_notrace, the function will _not_ be traced.
254
255 set_ftrace_pid:
256
257 Have the function tracer only trace the threads whose PID are
258 listed in this file.
259
260 If the "function-fork" option is set, then when a task whose
261 PID is listed in this file forks, the child's PID will
262 automatically be added to this file, and the child will be
263 traced by the function tracer as well. This option will also
264 cause PIDs of tasks that exit to be removed from the file.
265
266 set_event_pid:
267
268 Have the events only trace a task with a PID listed in this file.
269 Note, sched_switch and sched_wake_up will also trace events
270 listed in this file.
271
272 To have the PIDs of children of tasks with their PID in this file
273 added on fork, enable the "event-fork" option. That option will also
274 cause the PIDs of tasks to be removed from this file when the task
275 exits.
276
277 set_graph_function:
278
279 Functions listed in this file will cause the function graph
280 tracer to only trace these functions and the functions that
281 they call. (See the section "dynamic ftrace" for more details).
32fb7ef6
SM
282 Note, set_ftrace_filter and set_ftrace_notrace still affects
283 what functions are being traced.
1f198e22
CD
284
285 set_graph_notrace:
286
287 Similar to set_graph_function, but will disable function graph
288 tracing when the function is hit until it exits the function.
289 This makes it possible to ignore tracing functions that are called
290 by a specific function.
291
292 available_filter_functions:
293
294 This lists the functions that ftrace has processed and can trace.
295 These are the function names that you can pass to
32fb7ef6
SM
296 "set_ftrace_filter", "set_ftrace_notrace",
297 "set_graph_function", or "set_graph_notrace".
1f198e22
CD
298 (See the section "dynamic ftrace" below for more details.)
299
300 dyn_ftrace_total_info:
301
302 This file is for debugging purposes. The number of functions that
303 have been converted to nops and are available to be traced.
304
305 enabled_functions:
306
307 This file is more for debugging ftrace, but can also be useful
308 in seeing if any function has a callback attached to it.
309 Not only does the trace infrastructure use ftrace function
310 trace utility, but other subsystems might too. This file
311 displays all functions that have a callback attached to them
312 as well as the number of callbacks that have been attached.
313 Note, a callback may also call multiple functions which will
314 not be listed in this count.
315
316 If the callback registered to be traced by a function with
317 the "save regs" attribute (thus even more overhead), a 'R'
318 will be displayed on the same line as the function that
319 is returning registers.
320
321 If the callback registered to be traced by a function with
322 the "ip modify" attribute (thus the regs->ip can be changed),
323 an 'I' will be displayed on the same line as the function that
324 can be overridden.
325
326 If the architecture supports it, it will also show what callback
327 is being directly called by the function. If the count is greater
328 than 1 it most likely will be ftrace_ops_list_func().
329
330 If the callback of the function jumps to a trampoline that is
331 specific to a the callback and not the standard trampoline,
332 its address will be printed as well as the function that the
333 trampoline calls.
334
335 function_profile_enabled:
336
337 When set it will enable all functions with either the function
338 tracer, or if configured, the function graph tracer. It will
339 keep a histogram of the number of functions that were called
340 and if the function graph tracer was configured, it will also keep
341 track of the time spent in those functions. The histogram
342 content can be displayed in the files:
343
1fee4f77 344 trace_stat/function<cpu> ( function0, function1, etc).
1f198e22 345
1fee4f77 346 trace_stat:
1f198e22
CD
347
348 A directory that holds different tracing stats.
349
350 kprobe_events:
351
352 Enable dynamic trace points. See kprobetrace.txt.
353
354 kprobe_profile:
355
356 Dynamic trace points stats. See kprobetrace.txt.
357
358 max_graph_depth:
359
360 Used with the function graph tracer. This is the max depth
361 it will trace into a function. Setting this to a value of
362 one will show only the first kernel function that is called
363 from user space.
364
365 printk_formats:
366
367 This is for tools that read the raw format files. If an event in
368 the ring buffer references a string, only a pointer to the string
369 is recorded into the buffer and not the string itself. This prevents
370 tools from knowing what that string was. This file displays the string
371 and address for the string allowing tools to map the pointers to what
372 the strings were.
373
374 saved_cmdlines:
375
376 Only the pid of the task is recorded in a trace event unless
377 the event specifically saves the task comm as well. Ftrace
378 makes a cache of pid mappings to comms to try to display
379 comms for events. If a pid for a comm is not listed, then
380 "<...>" is displayed in the output.
381
382 If the option "record-cmd" is set to "0", then comms of tasks
383 will not be saved during recording. By default, it is enabled.
384
385 saved_cmdlines_size:
386
387 By default, 128 comms are saved (see "saved_cmdlines" above). To
388 increase or decrease the amount of comms that are cached, echo
5b8914a6 389 the number of comms to cache into this file.
1f198e22
CD
390
391 saved_tgids:
392
393 If the option "record-tgid" is set, on each scheduling context switch
394 the Task Group ID of a task is saved in a table mapping the PID of
395 the thread to its TGID. By default, the "record-tgid" option is
396 disabled.
397
398 snapshot:
399
400 This displays the "snapshot" buffer and also lets the user
401 take a snapshot of the current running trace.
402 See the "Snapshot" section below for more details.
403
404 stack_max_size:
405
406 When the stack tracer is activated, this will display the
407 maximum stack size it has encountered.
408 See the "Stack Trace" section below.
409
410 stack_trace:
411
412 This displays the stack back trace of the largest stack
413 that was encountered when the stack tracer is activated.
414 See the "Stack Trace" section below.
415
416 stack_trace_filter:
417
418 This is similar to "set_ftrace_filter" but it limits what
419 functions the stack tracer will check.
420
421 trace_clock:
422
423 Whenever an event is recorded into the ring buffer, a
424 "timestamp" is added. This stamp comes from a specified
425 clock. By default, ftrace uses the "local" clock. This
426 clock is very fast and strictly per cpu, but on some
427 systems it may not be monotonic with respect to other
428 CPUs. In other words, the local clocks may not be in sync
429 with local clocks on other CPUs.
430
431 Usual clocks for tracing::
432
433 # cat trace_clock
434 [local] global counter x86-tsc
435
436 The clock with the square brackets around it is the one in effect.
437
438 local:
439 Default clock, but may not be in sync across CPUs
440
441 global:
442 This clock is in sync with all CPUs but may
443 be a bit slower than the local clock.
444
445 counter:
446 This is not a clock at all, but literally an atomic
447 counter. It counts up one by one, but is in sync
448 with all CPUs. This is useful when you need to
449 know exactly the order events occurred with respect to
450 each other on different CPUs.
451
452 uptime:
453 This uses the jiffies counter and the time stamp
454 is relative to the time since boot up.
455
456 perf:
457 This makes ftrace use the same clock that perf uses.
458 Eventually perf will be able to read ftrace buffers
459 and this will help out in interleaving the data.
460
461 x86-tsc:
462 Architectures may define their own clocks. For
463 example, x86 uses its own TSC cycle clock here.
464
465 ppc-tb:
466 This uses the powerpc timebase register value.
467 This is in sync across CPUs and can also be used
468 to correlate events across hypervisor/guest if
469 tb_offset is known.
470
471 mono:
472 This uses the fast monotonic clock (CLOCK_MONOTONIC)
473 which is monotonic and is subject to NTP rate adjustments.
474
475 mono_raw:
476 This is the raw monotonic clock (CLOCK_MONOTONIC_RAW)
2a1e03ca 477 which is monotonic but is not subject to any rate adjustments
1f198e22
CD
478 and ticks at the same rate as the hardware clocksource.
479
480 boot:
a3ed0e43
TG
481 This is the boot clock (CLOCK_BOOTTIME) and is based on the
482 fast monotonic clock, but also accounts for time spent in
483 suspend. Since the clock access is designed for use in
484 tracing in the suspend path, some side effects are possible
485 if clock is accessed after the suspend time is accounted before
486 the fast mono clock is updated. In this case, the clock update
487 appears to happen slightly sooner than it normally would have.
488 Also on 32-bit systems, it's possible that the 64-bit boot offset
489 sees a partial update. These effects are rare and post
490 processing should be able to handle them. See comments in the
491 ktime_get_boot_fast_ns() function for more information.
680014d6
LT
492
493 To set a clock, simply echo the clock name into this file::
494
495 # echo global > trace_clock
1f198e22 496
d693b288
FCB
497 Setting a clock clears the ring buffer content as well as the
498 "snapshot" buffer.
499
1f198e22
CD
500 trace_marker:
501
502 This is a very useful file for synchronizing user space
503 with events happening in the kernel. Writing strings into
504 this file will be written into the ftrace buffer.
505
506 It is useful in applications to open this file at the start
507 of the application and just reference the file descriptor
508 for the file::
509
510 void trace_write(const char *fmt, ...)
511 {
512 va_list ap;
513 char buf[256];
514 int n;
515
516 if (trace_fd < 0)
517 return;
518
519 va_start(ap, fmt);
520 n = vsnprintf(buf, 256, fmt, ap);
521 va_end(ap);
522
523 write(trace_fd, buf, n);
524 }
525
526 start::
527
528 trace_fd = open("trace_marker", WR_ONLY);
529
d3439f9d
SRV
530 Note: Writing into the trace_marker file can also initiate triggers
531 that are written into /sys/kernel/tracing/events/ftrace/print/trigger
532 See "Event triggers" in Documentation/trace/events.rst and an
533 example in Documentation/trace/histogram.rst (Section 3.)
534
1f198e22
CD
535 trace_marker_raw:
536
537 This is similar to trace_marker above, but is meant for for binary data
538 to be written to it, where a tool can be used to parse the data
539 from trace_pipe_raw.
540
541 uprobe_events:
542
543 Add dynamic tracepoints in programs.
544 See uprobetracer.txt
545
546 uprobe_profile:
547
548 Uprobe statistics. See uprobetrace.txt
549
550 instances:
551
552 This is a way to make multiple trace buffers where different
553 events can be recorded in different buffers.
554 See "Instances" section below.
555
556 events:
557
558 This is the trace event directory. It holds event tracepoints
559 (also known as static tracepoints) that have been compiled
560 into the kernel. It shows what event tracepoints exist
561 and how they are grouped by system. There are "enable"
562 files at various levels that can enable the tracepoints
563 when a "1" is written to them.
564
565 See events.txt for more information.
566
567 set_event:
568
569 By echoing in the event into this file, will enable that event.
570
571 See events.txt for more information.
572
573 available_events:
574
575 A list of events that can be enabled in tracing.
576
577 See events.txt for more information.
578
2a56bb59
LT
579 timestamp_mode:
580
581 Certain tracers may change the timestamp mode used when
582 logging trace events into the event buffer. Events with
583 different modes can coexist within a buffer but the mode in
584 effect when an event is logged determines which timestamp mode
585 is used for that event. The default timestamp mode is
586 'delta'.
587
588 Usual timestamp modes for tracing:
589
590 # cat timestamp_mode
591 [delta] absolute
592
593 The timestamp mode with the square brackets around it is the
594 one in effect.
595
596 delta: Default timestamp mode - timestamp is a delta against
597 a per-buffer timestamp.
598
599 absolute: The timestamp is a full timestamp, not a delta
600 against some other value. As such it takes up more
601 space and is less efficient.
602
1f198e22
CD
603 hwlat_detector:
604
605 Directory for the Hardware Latency Detector.
606 See "Hardware Latency Detector" section below.
607
608 per_cpu:
609
610 This is a directory that contains the trace per_cpu information.
611
612 per_cpu/cpu0/buffer_size_kb:
613
614 The ftrace buffer is defined per_cpu. That is, there's a separate
615 buffer for each CPU to allow writes to be done atomically,
616 and free from cache bouncing. These buffers may have different
617 size buffers. This file is similar to the buffer_size_kb
618 file, but it only displays or sets the buffer size for the
619 specific CPU. (here cpu0).
620
621 per_cpu/cpu0/trace:
622
623 This is similar to the "trace" file, but it will only display
624 the data specific for the CPU. If written to, it only clears
625 the specific CPU buffer.
626
627 per_cpu/cpu0/trace_pipe
628
629 This is similar to the "trace_pipe" file, and is a consuming
630 read, but it will only display (and consume) the data specific
631 for the CPU.
632
633 per_cpu/cpu0/trace_pipe_raw
634
635 For tools that can parse the ftrace ring buffer binary format,
636 the trace_pipe_raw file can be used to extract the data
637 from the ring buffer directly. With the use of the splice()
638 system call, the buffer data can be quickly transferred to
639 a file or to the network where a server is collecting the
640 data.
641
642 Like trace_pipe, this is a consuming reader, where multiple
643 reads will always produce different data.
644
645 per_cpu/cpu0/snapshot:
646
647 This is similar to the main "snapshot" file, but will only
648 snapshot the current CPU (if supported). It only displays
649 the content of the snapshot for a given CPU, and if
650 written to, only clears this CPU buffer.
651
652 per_cpu/cpu0/snapshot_raw:
653
654 Similar to the trace_pipe_raw, but will read the binary format
655 from the snapshot buffer for the given CPU.
656
657 per_cpu/cpu0/stats:
658
659 This displays certain stats about the ring buffer:
660
661 entries:
662 The number of events that are still in the buffer.
663
664 overrun:
665 The number of lost events due to overwriting when
666 the buffer was full.
667
668 commit overrun:
669 Should always be zero.
670 This gets set if so many events happened within a nested
671 event (ring buffer is re-entrant), that it fills the
672 buffer and starts dropping events.
673
674 bytes:
675 Bytes actually read (not overwritten).
676
677 oldest event ts:
678 The oldest timestamp in the buffer
679
680 now ts:
681 The current timestamp
682
683 dropped events:
684 Events lost due to overwrite option being off.
685
686 read events:
687 The number of events read.
688
689The Tracers
690-----------
691
692Here is the list of current tracers that may be configured.
693
694 "function"
695
696 Function call tracer to trace all kernel functions.
697
698 "function_graph"
699
700 Similar to the function tracer except that the
701 function tracer probes the functions on their entry
702 whereas the function graph tracer traces on both entry
703 and exit of the functions. It then provides the ability
704 to draw a graph of function calls similar to C code
705 source.
706
707 "blk"
708
709 The block tracer. The tracer used by the blktrace user
710 application.
711
712 "hwlat"
713
714 The Hardware Latency tracer is used to detect if the hardware
715 produces any latency. See "Hardware Latency Detector" section
716 below.
717
718 "irqsoff"
719
720 Traces the areas that disable interrupts and saves
721 the trace with the longest max latency.
722 See tracing_max_latency. When a new max is recorded,
723 it replaces the old trace. It is best to view this
724 trace with the latency-format option enabled, which
725 happens automatically when the tracer is selected.
726
727 "preemptoff"
728
729 Similar to irqsoff but traces and records the amount of
730 time for which preemption is disabled.
731
732 "preemptirqsoff"
733
734 Similar to irqsoff and preemptoff, but traces and
735 records the largest time for which irqs and/or preemption
736 is disabled.
737
738 "wakeup"
739
740 Traces and records the max latency that it takes for
741 the highest priority task to get scheduled after
742 it has been woken up.
743 Traces all tasks as an average developer would expect.
744
745 "wakeup_rt"
746
747 Traces and records the max latency that it takes for just
748 RT tasks (as the current "wakeup" does). This is useful
749 for those interested in wake up timings of RT tasks.
750
751 "wakeup_dl"
752
753 Traces and records the max latency that it takes for
754 a SCHED_DEADLINE task to be woken (as the "wakeup" and
755 "wakeup_rt" does).
756
757 "mmiotrace"
758
759 A special tracer that is used to trace binary module.
760 It will trace all the calls that a module makes to the
761 hardware. Everything it writes and reads from the I/O
762 as well.
763
764 "branch"
765
766 This tracer can be configured when tracing likely/unlikely
767 calls within the kernel. It will trace when a likely and
768 unlikely branch is hit and if it was correct in its prediction
769 of being correct.
770
771 "nop"
772
773 This is the "trace nothing" tracer. To remove all
774 tracers from tracing simply echo "nop" into
775 current_tracer.
776
26a94491
TZ
777Error conditions
778----------------
779
780 For most ftrace commands, failure modes are obvious and communicated
781 using standard return codes.
782
783 For other more involved commands, extended error information may be
784 available via the tracing/error_log file. For the commands that
785 support it, reading the tracing/error_log file after an error will
786 display more detailed information about what went wrong, if
787 information is available. The tracing/error_log file is a circular
788 error log displaying a small number (currently, 8) of ftrace errors
789 for the last (8) failed commands.
790
791 The extended error information and usage takes the form shown in
792 this example::
793
794 # echo xxx > /sys/kernel/debug/tracing/events/sched/sched_wakeup/trigger
795 echo: write error: Invalid argument
796
797 # cat /sys/kernel/debug/tracing/error_log
798 [ 5348.887237] location: error: Couldn't yyy: zzz
799 Command: xxx
800 ^
801 [ 7517.023364] location: error: Bad rrr: sss
802 Command: ppp qqq
803 ^
804
805 To clear the error log, echo the empty string into it::
806
807 # echo > /sys/kernel/debug/tracing/error_log
1f198e22
CD
808
809Examples of using the tracer
810----------------------------
811
812Here are typical examples of using the tracers when controlling
813them only with the tracefs interface (without using any
814user-land utilities).
815
816Output format:
817--------------
818
819Here is an example of the output format of the file "trace"::
820
821 # tracer: function
822 #
823 # entries-in-buffer/entries-written: 140080/250280 #P:4
824 #
825 # _-----=> irqs-off
826 # / _----=> need-resched
827 # | / _---=> hardirq/softirq
828 # || / _--=> preempt-depth
829 # ||| / delay
830 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
831 # | | | |||| | |
832 bash-1977 [000] .... 17284.993652: sys_close <-system_call_fastpath
833 bash-1977 [000] .... 17284.993653: __close_fd <-sys_close
834 bash-1977 [000] .... 17284.993653: _raw_spin_lock <-__close_fd
835 sshd-1974 [003] .... 17284.993653: __srcu_read_unlock <-fsnotify
836 bash-1977 [000] .... 17284.993654: add_preempt_count <-_raw_spin_lock
837 bash-1977 [000] ...1 17284.993655: _raw_spin_unlock <-__close_fd
838 bash-1977 [000] ...1 17284.993656: sub_preempt_count <-_raw_spin_unlock
839 bash-1977 [000] .... 17284.993657: filp_close <-__close_fd
840 bash-1977 [000] .... 17284.993657: dnotify_flush <-filp_close
841 sshd-1974 [003] .... 17284.993658: sys_select <-system_call_fastpath
842 ....
843
844A header is printed with the tracer name that is represented by
845the trace. In this case the tracer is "function". Then it shows the
846number of events in the buffer as well as the total number of entries
847that were written. The difference is the number of entries that were
848lost due to the buffer filling up (250280 - 140080 = 110200 events
849lost).
850
851The header explains the content of the events. Task name "bash", the task
852PID "1977", the CPU that it was running on "000", the latency format
853(explained below), the timestamp in <secs>.<usecs> format, the
854function name that was traced "sys_close" and the parent function that
855called this function "system_call_fastpath". The timestamp is the time
856at which the function was entered.
857
858Latency trace format
859--------------------
860
861When the latency-format option is enabled or when one of the latency
862tracers is set, the trace file gives somewhat more information to see
863why a latency happened. Here is a typical trace::
864
865 # tracer: irqsoff
866 #
867 # irqsoff latency trace v1.1.5 on 3.8.0-test+
868 # --------------------------------------------------------------------
869 # latency: 259 us, #4/4, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
870 # -----------------
871 # | task: ps-6143 (uid:0 nice:0 policy:0 rt_prio:0)
872 # -----------------
873 # => started at: __lock_task_sighand
874 # => ended at: _raw_spin_unlock_irqrestore
875 #
876 #
877 # _------=> CPU#
878 # / _-----=> irqs-off
879 # | / _----=> need-resched
880 # || / _---=> hardirq/softirq
881 # ||| / _--=> preempt-depth
882 # |||| / delay
883 # cmd pid ||||| time | caller
884 # \ / ||||| \ | /
885 ps-6143 2d... 0us!: trace_hardirqs_off <-__lock_task_sighand
886 ps-6143 2d..1 259us+: trace_hardirqs_on <-_raw_spin_unlock_irqrestore
887 ps-6143 2d..1 263us+: time_hardirqs_on <-_raw_spin_unlock_irqrestore
888 ps-6143 2d..1 306us : <stack trace>
889 => trace_hardirqs_on_caller
890 => trace_hardirqs_on
891 => _raw_spin_unlock_irqrestore
892 => do_task_stat
893 => proc_tgid_stat
894 => proc_single_show
895 => seq_read
896 => vfs_read
897 => sys_read
898 => system_call_fastpath
899
900
901This shows that the current tracer is "irqsoff" tracing the time
902for which interrupts were disabled. It gives the trace version (which
903never changes) and the version of the kernel upon which this was executed on
904(3.8). Then it displays the max latency in microseconds (259 us). The number
905of trace entries displayed and the total number (both are four: #4/4).
906VP, KP, SP, and HP are always zero and are reserved for later use.
907#P is the number of online CPUs (#P:4).
908
909The task is the process that was running when the latency
910occurred. (ps pid: 6143).
911
912The start and stop (the functions in which the interrupts were
913disabled and enabled respectively) that caused the latencies:
914
915 - __lock_task_sighand is where the interrupts were disabled.
916 - _raw_spin_unlock_irqrestore is where they were enabled again.
917
918The next lines after the header are the trace itself. The header
919explains which is which.
920
921 cmd: The name of the process in the trace.
922
923 pid: The PID of that process.
924
925 CPU#: The CPU which the process was running on.
926
927 irqs-off: 'd' interrupts are disabled. '.' otherwise.
928 .. caution:: If the architecture does not support a way to
929 read the irq flags variable, an 'X' will always
930 be printed here.
931
932 need-resched:
933 - 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set,
934 - 'n' only TIF_NEED_RESCHED is set,
935 - 'p' only PREEMPT_NEED_RESCHED is set,
936 - '.' otherwise.
937
938 hardirq/softirq:
939 - 'Z' - NMI occurred inside a hardirq
940 - 'z' - NMI is running
941 - 'H' - hard irq occurred inside a softirq.
942 - 'h' - hard irq is running
943 - 's' - soft irq is running
944 - '.' - normal context.
945
946 preempt-depth: The level of preempt_disabled
947
948The above is mostly meaningful for kernel developers.
949
950 time:
951 When the latency-format option is enabled, the trace file
952 output includes a timestamp relative to the start of the
953 trace. This differs from the output when latency-format
954 is disabled, which includes an absolute timestamp.
955
956 delay:
957 This is just to help catch your eye a bit better. And
958 needs to be fixed to be only relative to the same CPU.
959 The marks are determined by the difference between this
960 current trace and the next trace.
961
962 - '$' - greater than 1 second
2a1e03ca
AL
963 - '@' - greater than 100 millisecond
964 - '*' - greater than 10 millisecond
1f198e22
CD
965 - '#' - greater than 1000 microsecond
966 - '!' - greater than 100 microsecond
967 - '+' - greater than 10 microsecond
968 - ' ' - less than or equal to 10 microsecond.
969
970 The rest is the same as the 'trace' file.
971
972 Note, the latency tracers will usually end with a back trace
973 to easily find where the latency occurred.
974
975trace_options
976-------------
977
978The trace_options file (or the options directory) is used to control
979what gets printed in the trace output, or manipulate the tracers.
980To see what is available, simply cat the file::
981
982 cat trace_options
983 print-parent
984 nosym-offset
985 nosym-addr
986 noverbose
987 noraw
988 nohex
989 nobin
990 noblock
991 trace_printk
992 annotate
993 nouserstacktrace
994 nosym-userobj
995 noprintk-msg-only
996 context-info
997 nolatency-format
998 record-cmd
999 norecord-tgid
1000 overwrite
1001 nodisable_on_free
1002 irq-info
1003 markers
1004 noevent-fork
1005 function-trace
1006 nofunction-fork
1007 nodisplay-graph
1008 nostacktrace
1009 nobranch
1010
1011To disable one of the options, echo in the option prepended with
1012"no"::
1013
1014 echo noprint-parent > trace_options
1015
1016To enable an option, leave off the "no"::
1017
1018 echo sym-offset > trace_options
1019
1020Here are the available options:
1021
1022 print-parent
1023 On function traces, display the calling (parent)
1024 function as well as the function being traced.
1025 ::
1026
1027 print-parent:
1028 bash-4000 [01] 1477.606694: simple_strtoul <-kstrtoul
1029
1030 noprint-parent:
1031 bash-4000 [01] 1477.606694: simple_strtoul
1032
1033
1034 sym-offset
1035 Display not only the function name, but also the
1036 offset in the function. For example, instead of
1037 seeing just "ktime_get", you will see
1038 "ktime_get+0xb/0x20".
1039 ::
1040
1041 sym-offset:
1042 bash-4000 [01] 1477.606694: simple_strtoul+0x6/0xa0
1043
1044 sym-addr
1045 This will also display the function address as well
1046 as the function name.
1047 ::
1048
1049 sym-addr:
1050 bash-4000 [01] 1477.606694: simple_strtoul <c0339346>
1051
1052 verbose
1053 This deals with the trace file when the
1054 latency-format option is enabled.
1055 ::
1056
1057 bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \
1058 (+0.000ms): simple_strtoul (kstrtoul)
1059
1060 raw
1061 This will display raw numbers. This option is best for
1062 use with user applications that can translate the raw
1063 numbers better than having it done in the kernel.
1064
1065 hex
1066 Similar to raw, but the numbers will be in a hexadecimal format.
1067
1068 bin
1069 This will print out the formats in raw binary.
1070
1071 block
1072 When set, reading trace_pipe will not block when polled.
1073
1074 trace_printk
1075 Can disable trace_printk() from writing into the buffer.
1076
1077 annotate
1078 It is sometimes confusing when the CPU buffers are full
1079 and one CPU buffer had a lot of events recently, thus
1080 a shorter time frame, were another CPU may have only had
1081 a few events, which lets it have older events. When
1082 the trace is reported, it shows the oldest events first,
1083 and it may look like only one CPU ran (the one with the
1084 oldest events). When the annotate option is set, it will
1085 display when a new CPU buffer started::
1086
1087 <idle>-0 [001] dNs4 21169.031481: wake_up_idle_cpu <-add_timer_on
1088 <idle>-0 [001] dNs4 21169.031482: _raw_spin_unlock_irqrestore <-add_timer_on
1089 <idle>-0 [001] .Ns4 21169.031484: sub_preempt_count <-_raw_spin_unlock_irqrestore
1090 ##### CPU 2 buffer started ####
1091 <idle>-0 [002] .N.1 21169.031484: rcu_idle_exit <-cpu_idle
1092 <idle>-0 [001] .Ns3 21169.031484: _raw_spin_unlock <-clocksource_watchdog
1093 <idle>-0 [001] .Ns3 21169.031485: sub_preempt_count <-_raw_spin_unlock
1094
1095 userstacktrace
1096 This option changes the trace. It records a
1097 stacktrace of the current user space thread after
1098 each trace event.
1099
1100 sym-userobj
1101 when user stacktrace are enabled, look up which
1102 object the address belongs to, and print a
1103 relative address. This is especially useful when
1104 ASLR is on, otherwise you don't get a chance to
1105 resolve the address to object/file/line after
1106 the app is no longer running
1107
1108 The lookup is performed when you read
1109 trace,trace_pipe. Example::
1110
1111 a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0
1112 x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6]
1113
1114
1115 printk-msg-only
1116 When set, trace_printk()s will only show the format
1117 and not their parameters (if trace_bprintk() or
1118 trace_bputs() was used to save the trace_printk()).
1119
1120 context-info
1121 Show only the event data. Hides the comm, PID,
1122 timestamp, CPU, and other useful data.
1123
1124 latency-format
1125 This option changes the trace output. When it is enabled,
1126 the trace displays additional information about the
1127 latency, as described in "Latency trace format".
1128
06e0a548
SRV
1129 pause-on-trace
1130 When set, opening the trace file for read, will pause
1131 writing to the ring buffer (as if tracing_on was set to zero).
1132 This simulates the original behavior of the trace file.
1133 When the file is closed, tracing will be enabled again.
1134
1f198e22
CD
1135 record-cmd
1136 When any event or tracer is enabled, a hook is enabled
1137 in the sched_switch trace point to fill comm cache
1138 with mapped pids and comms. But this may cause some
1139 overhead, and if you only care about pids, and not the
1140 name of the task, disabling this option can lower the
1141 impact of tracing. See "saved_cmdlines".
1142
1143 record-tgid
1144 When any event or tracer is enabled, a hook is enabled
1145 in the sched_switch trace point to fill the cache of
1146 mapped Thread Group IDs (TGID) mapping to pids. See
1147 "saved_tgids".
1148
1149 overwrite
1150 This controls what happens when the trace buffer is
1151 full. If "1" (default), the oldest events are
1152 discarded and overwritten. If "0", then the newest
1153 events are discarded.
1154 (see per_cpu/cpu0/stats for overrun and dropped)
1155
1156 disable_on_free
1157 When the free_buffer is closed, tracing will
1158 stop (tracing_on set to 0).
1159
1160 irq-info
1161 Shows the interrupt, preempt count, need resched data.
1162 When disabled, the trace looks like::
1163
1164 # tracer: function
1165 #
1166 # entries-in-buffer/entries-written: 144405/9452052 #P:4
1167 #
1168 # TASK-PID CPU# TIMESTAMP FUNCTION
1169 # | | | | |
1170 <idle>-0 [002] 23636.756054: ttwu_do_activate.constprop.89 <-try_to_wake_up
1171 <idle>-0 [002] 23636.756054: activate_task <-ttwu_do_activate.constprop.89
1172 <idle>-0 [002] 23636.756055: enqueue_task <-activate_task
1173
1174
1175 markers
1176 When set, the trace_marker is writable (only by root).
1177 When disabled, the trace_marker will error with EINVAL
1178 on write.
1179
1180 event-fork
1181 When set, tasks with PIDs listed in set_event_pid will have
1182 the PIDs of their children added to set_event_pid when those
1183 tasks fork. Also, when tasks with PIDs in set_event_pid exit,
1184 their PIDs will be removed from the file.
1185
1186 function-trace
1187 The latency tracers will enable function tracing
1188 if this option is enabled (default it is). When
1189 it is disabled, the latency tracers do not trace
1190 functions. This keeps the overhead of the tracer down
1191 when performing latency tests.
1192
1193 function-fork
1194 When set, tasks with PIDs listed in set_ftrace_pid will
1195 have the PIDs of their children added to set_ftrace_pid
1196 when those tasks fork. Also, when tasks with PIDs in
1197 set_ftrace_pid exit, their PIDs will be removed from the
1198 file.
1199
1200 display-graph
1201 When set, the latency tracers (irqsoff, wakeup, etc) will
1202 use function graph tracing instead of function tracing.
1203
1204 stacktrace
1205 When set, a stack trace is recorded after any trace event
1206 is recorded.
1207
1208 branch
1209 Enable branch tracing with the tracer. This enables branch
1210 tracer along with the currently set tracer. Enabling this
1211 with the "nop" tracer is the same as just enabling the
1212 "branch" tracer.
1213
1214.. tip:: Some tracers have their own options. They only appear in this
1215 file when the tracer is active. They always appear in the
1216 options directory.
1217
1218
1219Here are the per tracer options:
1220
1221Options for function tracer:
1222
1223 func_stack_trace
1224 When set, a stack trace is recorded after every
1225 function that is recorded. NOTE! Limit the functions
1226 that are recorded before enabling this, with
1227 "set_ftrace_filter" otherwise the system performance
1228 will be critically degraded. Remember to disable
1229 this option before clearing the function filter.
1230
1231Options for function_graph tracer:
1232
1233 Since the function_graph tracer has a slightly different output
1234 it has its own options to control what is displayed.
1235
1236 funcgraph-overrun
1237 When set, the "overrun" of the graph stack is
1238 displayed after each function traced. The
1239 overrun, is when the stack depth of the calls
1240 is greater than what is reserved for each task.
1241 Each task has a fixed array of functions to
1242 trace in the call graph. If the depth of the
1243 calls exceeds that, the function is not traced.
1244 The overrun is the number of functions missed
1245 due to exceeding this array.
1246
1247 funcgraph-cpu
1248 When set, the CPU number of the CPU where the trace
1249 occurred is displayed.
1250
1251 funcgraph-overhead
1252 When set, if the function takes longer than
1253 A certain amount, then a delay marker is
1254 displayed. See "delay" above, under the
1255 header description.
1256
1257 funcgraph-proc
1258 Unlike other tracers, the process' command line
1259 is not displayed by default, but instead only
1260 when a task is traced in and out during a context
1261 switch. Enabling this options has the command
1262 of each process displayed at every line.
1263
1264 funcgraph-duration
1265 At the end of each function (the return)
1266 the duration of the amount of time in the
1267 function is displayed in microseconds.
1268
1269 funcgraph-abstime
1270 When set, the timestamp is displayed at each line.
1271
1272 funcgraph-irqs
1273 When disabled, functions that happen inside an
1274 interrupt will not be traced.
1275
1276 funcgraph-tail
1277 When set, the return event will include the function
1278 that it represents. By default this is off, and
1279 only a closing curly bracket "}" is displayed for
1280 the return of a function.
1281
1282 sleep-time
1283 When running function graph tracer, to include
1284 the time a task schedules out in its function.
1285 When enabled, it will account time the task has been
1286 scheduled out as part of the function call.
1287
1288 graph-time
1289 When running function profiler with function graph tracer,
1290 to include the time to call nested functions. When this is
1291 not set, the time reported for the function will only
1292 include the time the function itself executed for, not the
1293 time for functions that it called.
1294
1295Options for blk tracer:
1296
1297 blk_classic
1298 Shows a more minimalistic output.
1299
1300
1301irqsoff
1302-------
1303
1304When interrupts are disabled, the CPU can not react to any other
1305external event (besides NMIs and SMIs). This prevents the timer
1306interrupt from triggering or the mouse interrupt from letting
1307the kernel know of a new mouse event. The result is a latency
1308with the reaction time.
1309
1310The irqsoff tracer tracks the time for which interrupts are
1311disabled. When a new maximum latency is hit, the tracer saves
1312the trace leading up to that latency point so that every time a
1313new maximum is reached, the old saved trace is discarded and the
1314new trace is saved.
1315
1316To reset the maximum, echo 0 into tracing_max_latency. Here is
1317an example::
1318
1319 # echo 0 > options/function-trace
1320 # echo irqsoff > current_tracer
1321 # echo 1 > tracing_on
1322 # echo 0 > tracing_max_latency
1323 # ls -ltr
1324 [...]
1325 # echo 0 > tracing_on
1326 # cat trace
1327 # tracer: irqsoff
1328 #
1329 # irqsoff latency trace v1.1.5 on 3.8.0-test+
1330 # --------------------------------------------------------------------
1331 # latency: 16 us, #4/4, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1332 # -----------------
1333 # | task: swapper/0-0 (uid:0 nice:0 policy:0 rt_prio:0)
1334 # -----------------
1335 # => started at: run_timer_softirq
1336 # => ended at: run_timer_softirq
1337 #
1338 #
1339 # _------=> CPU#
1340 # / _-----=> irqs-off
1341 # | / _----=> need-resched
1342 # || / _---=> hardirq/softirq
1343 # ||| / _--=> preempt-depth
1344 # |||| / delay
1345 # cmd pid ||||| time | caller
1346 # \ / ||||| \ | /
1347 <idle>-0 0d.s2 0us+: _raw_spin_lock_irq <-run_timer_softirq
1348 <idle>-0 0dNs3 17us : _raw_spin_unlock_irq <-run_timer_softirq
1349 <idle>-0 0dNs3 17us+: trace_hardirqs_on <-run_timer_softirq
1350 <idle>-0 0dNs3 25us : <stack trace>
1351 => _raw_spin_unlock_irq
1352 => run_timer_softirq
1353 => __do_softirq
1354 => call_softirq
1355 => do_softirq
1356 => irq_exit
1357 => smp_apic_timer_interrupt
1358 => apic_timer_interrupt
1359 => rcu_idle_exit
1360 => cpu_idle
1361 => rest_init
1362 => start_kernel
1363 => x86_64_start_reservations
1364 => x86_64_start_kernel
1365
1366Here we see that that we had a latency of 16 microseconds (which is
1367very good). The _raw_spin_lock_irq in run_timer_softirq disabled
1368interrupts. The difference between the 16 and the displayed
1369timestamp 25us occurred because the clock was incremented
1370between the time of recording the max latency and the time of
1371recording the function that had that latency.
1372
1373Note the above example had function-trace not set. If we set
1374function-trace, we get a much larger output::
1375
1376 with echo 1 > options/function-trace
1377
1378 # tracer: irqsoff
1379 #
1380 # irqsoff latency trace v1.1.5 on 3.8.0-test+
1381 # --------------------------------------------------------------------
1382 # latency: 71 us, #168/168, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1383 # -----------------
1384 # | task: bash-2042 (uid:0 nice:0 policy:0 rt_prio:0)
1385 # -----------------
1386 # => started at: ata_scsi_queuecmd
1387 # => ended at: ata_scsi_queuecmd
1388 #
1389 #
1390 # _------=> CPU#
1391 # / _-----=> irqs-off
1392 # | / _----=> need-resched
1393 # || / _---=> hardirq/softirq
1394 # ||| / _--=> preempt-depth
1395 # |||| / delay
1396 # cmd pid ||||| time | caller
1397 # \ / ||||| \ | /
1398 bash-2042 3d... 0us : _raw_spin_lock_irqsave <-ata_scsi_queuecmd
1399 bash-2042 3d... 0us : add_preempt_count <-_raw_spin_lock_irqsave
1400 bash-2042 3d..1 1us : ata_scsi_find_dev <-ata_scsi_queuecmd
1401 bash-2042 3d..1 1us : __ata_scsi_find_dev <-ata_scsi_find_dev
1402 bash-2042 3d..1 2us : ata_find_dev.part.14 <-__ata_scsi_find_dev
1403 bash-2042 3d..1 2us : ata_qc_new_init <-__ata_scsi_queuecmd
1404 bash-2042 3d..1 3us : ata_sg_init <-__ata_scsi_queuecmd
1405 bash-2042 3d..1 4us : ata_scsi_rw_xlat <-__ata_scsi_queuecmd
1406 bash-2042 3d..1 4us : ata_build_rw_tf <-ata_scsi_rw_xlat
1407 [...]
1408 bash-2042 3d..1 67us : delay_tsc <-__delay
1409 bash-2042 3d..1 67us : add_preempt_count <-delay_tsc
1410 bash-2042 3d..2 67us : sub_preempt_count <-delay_tsc
1411 bash-2042 3d..1 67us : add_preempt_count <-delay_tsc
1412 bash-2042 3d..2 68us : sub_preempt_count <-delay_tsc
1413 bash-2042 3d..1 68us+: ata_bmdma_start <-ata_bmdma_qc_issue
1414 bash-2042 3d..1 71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
1415 bash-2042 3d..1 71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
1416 bash-2042 3d..1 72us+: trace_hardirqs_on <-ata_scsi_queuecmd
1417 bash-2042 3d..1 120us : <stack trace>
1418 => _raw_spin_unlock_irqrestore
1419 => ata_scsi_queuecmd
1420 => scsi_dispatch_cmd
1421 => scsi_request_fn
1422 => __blk_run_queue_uncond
1423 => __blk_run_queue
1424 => blk_queue_bio
1425 => generic_make_request
1426 => submit_bio
1427 => submit_bh
1428 => __ext3_get_inode_loc
1429 => ext3_iget
1430 => ext3_lookup
1431 => lookup_real
1432 => __lookup_hash
1433 => walk_component
1434 => lookup_last
1435 => path_lookupat
1436 => filename_lookup
1437 => user_path_at_empty
1438 => user_path_at
1439 => vfs_fstatat
1440 => vfs_stat
1441 => sys_newstat
1442 => system_call_fastpath
1443
1444
1445Here we traced a 71 microsecond latency. But we also see all the
1446functions that were called during that time. Note that by
1447enabling function tracing, we incur an added overhead. This
1448overhead may extend the latency times. But nevertheless, this
1449trace has provided some very helpful debugging information.
1450
88d380eb
CD
1451If we prefer function graph output instead of function, we can set
1452display-graph option::
3df5ffd2 1453
88d380eb
CD
1454 with echo 1 > options/display-graph
1455
1456 # tracer: irqsoff
1457 #
1458 # irqsoff latency trace v1.1.5 on 4.20.0-rc6+
1459 # --------------------------------------------------------------------
1460 # latency: 3751 us, #274/274, CPU#0 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:4)
1461 # -----------------
1462 # | task: bash-1507 (uid:0 nice:0 policy:0 rt_prio:0)
1463 # -----------------
1464 # => started at: free_debug_processing
1465 # => ended at: return_to_handler
1466 #
1467 #
1468 # _-----=> irqs-off
1469 # / _----=> need-resched
1470 # | / _---=> hardirq/softirq
1471 # || / _--=> preempt-depth
1472 # ||| /
1473 # REL TIME CPU TASK/PID |||| DURATION FUNCTION CALLS
1474 # | | | | |||| | | | | | |
1475 0 us | 0) bash-1507 | d... | 0.000 us | _raw_spin_lock_irqsave();
1476 0 us | 0) bash-1507 | d..1 | 0.378 us | do_raw_spin_trylock();
1477 1 us | 0) bash-1507 | d..2 | | set_track() {
1478 2 us | 0) bash-1507 | d..2 | | save_stack_trace() {
1479 2 us | 0) bash-1507 | d..2 | | __save_stack_trace() {
1480 3 us | 0) bash-1507 | d..2 | | __unwind_start() {
1481 3 us | 0) bash-1507 | d..2 | | get_stack_info() {
1482 3 us | 0) bash-1507 | d..2 | 0.351 us | in_task_stack();
1483 4 us | 0) bash-1507 | d..2 | 1.107 us | }
1484 [...]
1485 3750 us | 0) bash-1507 | d..1 | 0.516 us | do_raw_spin_unlock();
1486 3750 us | 0) bash-1507 | d..1 | 0.000 us | _raw_spin_unlock_irqrestore();
1487 3764 us | 0) bash-1507 | d..1 | 0.000 us | tracer_hardirqs_on();
1488 bash-1507 0d..1 3792us : <stack trace>
1489 => free_debug_processing
1490 => __slab_free
1491 => kmem_cache_free
1492 => vm_area_free
1493 => remove_vma
1494 => exit_mmap
1495 => mmput
1496 => flush_old_exec
1497 => load_elf_binary
1498 => search_binary_handler
1499 => __do_execve_file.isra.32
1500 => __x64_sys_execve
1501 => do_syscall_64
1502 => entry_SYSCALL_64_after_hwframe
1f198e22
CD
1503
1504preemptoff
1505----------
1506
1507When preemption is disabled, we may be able to receive
1508interrupts but the task cannot be preempted and a higher
1509priority task must wait for preemption to be enabled again
1510before it can preempt a lower priority task.
1511
1512The preemptoff tracer traces the places that disable preemption.
1513Like the irqsoff tracer, it records the maximum latency for
1514which preemption was disabled. The control of preemptoff tracer
1515is much like the irqsoff tracer.
1516::
1517
1518 # echo 0 > options/function-trace
1519 # echo preemptoff > current_tracer
1520 # echo 1 > tracing_on
1521 # echo 0 > tracing_max_latency
1522 # ls -ltr
1523 [...]
1524 # echo 0 > tracing_on
1525 # cat trace
1526 # tracer: preemptoff
1527 #
1528 # preemptoff latency trace v1.1.5 on 3.8.0-test+
1529 # --------------------------------------------------------------------
1530 # latency: 46 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1531 # -----------------
1532 # | task: sshd-1991 (uid:0 nice:0 policy:0 rt_prio:0)
1533 # -----------------
1534 # => started at: do_IRQ
1535 # => ended at: do_IRQ
1536 #
1537 #
1538 # _------=> CPU#
1539 # / _-----=> irqs-off
1540 # | / _----=> need-resched
1541 # || / _---=> hardirq/softirq
1542 # ||| / _--=> preempt-depth
1543 # |||| / delay
1544 # cmd pid ||||| time | caller
1545 # \ / ||||| \ | /
1546 sshd-1991 1d.h. 0us+: irq_enter <-do_IRQ
1547 sshd-1991 1d..1 46us : irq_exit <-do_IRQ
1548 sshd-1991 1d..1 47us+: trace_preempt_on <-do_IRQ
1549 sshd-1991 1d..1 52us : <stack trace>
1550 => sub_preempt_count
1551 => irq_exit
1552 => do_IRQ
1553 => ret_from_intr
1554
1555
1556This has some more changes. Preemption was disabled when an
1557interrupt came in (notice the 'h'), and was enabled on exit.
1558But we also see that interrupts have been disabled when entering
1559the preempt off section and leaving it (the 'd'). We do not know if
1560interrupts were enabled in the mean time or shortly after this
1561was over.
1562::
1563
1564 # tracer: preemptoff
1565 #
1566 # preemptoff latency trace v1.1.5 on 3.8.0-test+
1567 # --------------------------------------------------------------------
1568 # latency: 83 us, #241/241, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1569 # -----------------
1570 # | task: bash-1994 (uid:0 nice:0 policy:0 rt_prio:0)
1571 # -----------------
1572 # => started at: wake_up_new_task
1573 # => ended at: task_rq_unlock
1574 #
1575 #
1576 # _------=> CPU#
1577 # / _-----=> irqs-off
1578 # | / _----=> need-resched
1579 # || / _---=> hardirq/softirq
1580 # ||| / _--=> preempt-depth
1581 # |||| / delay
1582 # cmd pid ||||| time | caller
1583 # \ / ||||| \ | /
1584 bash-1994 1d..1 0us : _raw_spin_lock_irqsave <-wake_up_new_task
1585 bash-1994 1d..1 0us : select_task_rq_fair <-select_task_rq
1586 bash-1994 1d..1 1us : __rcu_read_lock <-select_task_rq_fair
1587 bash-1994 1d..1 1us : source_load <-select_task_rq_fair
1588 bash-1994 1d..1 1us : source_load <-select_task_rq_fair
1589 [...]
1590 bash-1994 1d..1 12us : irq_enter <-smp_apic_timer_interrupt
1591 bash-1994 1d..1 12us : rcu_irq_enter <-irq_enter
1592 bash-1994 1d..1 13us : add_preempt_count <-irq_enter
1593 bash-1994 1d.h1 13us : exit_idle <-smp_apic_timer_interrupt
1594 bash-1994 1d.h1 13us : hrtimer_interrupt <-smp_apic_timer_interrupt
1595 bash-1994 1d.h1 13us : _raw_spin_lock <-hrtimer_interrupt
1596 bash-1994 1d.h1 14us : add_preempt_count <-_raw_spin_lock
1597 bash-1994 1d.h2 14us : ktime_get_update_offsets <-hrtimer_interrupt
1598 [...]
1599 bash-1994 1d.h1 35us : lapic_next_event <-clockevents_program_event
1600 bash-1994 1d.h1 35us : irq_exit <-smp_apic_timer_interrupt
1601 bash-1994 1d.h1 36us : sub_preempt_count <-irq_exit
1602 bash-1994 1d..2 36us : do_softirq <-irq_exit
1603 bash-1994 1d..2 36us : __do_softirq <-call_softirq
1604 bash-1994 1d..2 36us : __local_bh_disable <-__do_softirq
1605 bash-1994 1d.s2 37us : add_preempt_count <-_raw_spin_lock_irq
1606 bash-1994 1d.s3 38us : _raw_spin_unlock <-run_timer_softirq
1607 bash-1994 1d.s3 39us : sub_preempt_count <-_raw_spin_unlock
1608 bash-1994 1d.s2 39us : call_timer_fn <-run_timer_softirq
1609 [...]
1610 bash-1994 1dNs2 81us : cpu_needs_another_gp <-rcu_process_callbacks
1611 bash-1994 1dNs2 82us : __local_bh_enable <-__do_softirq
1612 bash-1994 1dNs2 82us : sub_preempt_count <-__local_bh_enable
1613 bash-1994 1dN.2 82us : idle_cpu <-irq_exit
1614 bash-1994 1dN.2 83us : rcu_irq_exit <-irq_exit
1615 bash-1994 1dN.2 83us : sub_preempt_count <-irq_exit
1616 bash-1994 1.N.1 84us : _raw_spin_unlock_irqrestore <-task_rq_unlock
1617 bash-1994 1.N.1 84us+: trace_preempt_on <-task_rq_unlock
1618 bash-1994 1.N.1 104us : <stack trace>
1619 => sub_preempt_count
1620 => _raw_spin_unlock_irqrestore
1621 => task_rq_unlock
1622 => wake_up_new_task
1623 => do_fork
1624 => sys_clone
1625 => stub_clone
1626
1627
1628The above is an example of the preemptoff trace with
1629function-trace set. Here we see that interrupts were not disabled
1630the entire time. The irq_enter code lets us know that we entered
1631an interrupt 'h'. Before that, the functions being traced still
1632show that it is not in an interrupt, but we can see from the
1633functions themselves that this is not the case.
1634
1635preemptirqsoff
1636--------------
1637
1638Knowing the locations that have interrupts disabled or
1639preemption disabled for the longest times is helpful. But
1640sometimes we would like to know when either preemption and/or
1641interrupts are disabled.
1642
1643Consider the following code::
1644
1645 local_irq_disable();
1646 call_function_with_irqs_off();
1647 preempt_disable();
1648 call_function_with_irqs_and_preemption_off();
1649 local_irq_enable();
1650 call_function_with_preemption_off();
1651 preempt_enable();
1652
1653The irqsoff tracer will record the total length of
1654call_function_with_irqs_off() and
1655call_function_with_irqs_and_preemption_off().
1656
1657The preemptoff tracer will record the total length of
1658call_function_with_irqs_and_preemption_off() and
1659call_function_with_preemption_off().
1660
1661But neither will trace the time that interrupts and/or
1662preemption is disabled. This total time is the time that we can
1663not schedule. To record this time, use the preemptirqsoff
1664tracer.
1665
1666Again, using this trace is much like the irqsoff and preemptoff
1667tracers.
1668::
1669
1670 # echo 0 > options/function-trace
1671 # echo preemptirqsoff > current_tracer
1672 # echo 1 > tracing_on
1673 # echo 0 > tracing_max_latency
1674 # ls -ltr
1675 [...]
1676 # echo 0 > tracing_on
1677 # cat trace
1678 # tracer: preemptirqsoff
1679 #
1680 # preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
1681 # --------------------------------------------------------------------
1682 # latency: 100 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1683 # -----------------
1684 # | task: ls-2230 (uid:0 nice:0 policy:0 rt_prio:0)
1685 # -----------------
1686 # => started at: ata_scsi_queuecmd
1687 # => ended at: ata_scsi_queuecmd
1688 #
1689 #
1690 # _------=> CPU#
1691 # / _-----=> irqs-off
1692 # | / _----=> need-resched
1693 # || / _---=> hardirq/softirq
1694 # ||| / _--=> preempt-depth
1695 # |||| / delay
1696 # cmd pid ||||| time | caller
1697 # \ / ||||| \ | /
1698 ls-2230 3d... 0us+: _raw_spin_lock_irqsave <-ata_scsi_queuecmd
1699 ls-2230 3...1 100us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd
1700 ls-2230 3...1 101us+: trace_preempt_on <-ata_scsi_queuecmd
1701 ls-2230 3...1 111us : <stack trace>
1702 => sub_preempt_count
1703 => _raw_spin_unlock_irqrestore
1704 => ata_scsi_queuecmd
1705 => scsi_dispatch_cmd
1706 => scsi_request_fn
1707 => __blk_run_queue_uncond
1708 => __blk_run_queue
1709 => blk_queue_bio
1710 => generic_make_request
1711 => submit_bio
1712 => submit_bh
1713 => ext3_bread
1714 => ext3_dir_bread
1715 => htree_dirblock_to_tree
1716 => ext3_htree_fill_tree
1717 => ext3_readdir
1718 => vfs_readdir
1719 => sys_getdents
1720 => system_call_fastpath
1721
1722
1723The trace_hardirqs_off_thunk is called from assembly on x86 when
1724interrupts are disabled in the assembly code. Without the
1725function tracing, we do not know if interrupts were enabled
1726within the preemption points. We do see that it started with
1727preemption enabled.
1728
1729Here is a trace with function-trace set::
1730
1731 # tracer: preemptirqsoff
1732 #
1733 # preemptirqsoff latency trace v1.1.5 on 3.8.0-test+
1734 # --------------------------------------------------------------------
1735 # latency: 161 us, #339/339, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1736 # -----------------
1737 # | task: ls-2269 (uid:0 nice:0 policy:0 rt_prio:0)
1738 # -----------------
1739 # => started at: schedule
1740 # => ended at: mutex_unlock
1741 #
1742 #
1743 # _------=> CPU#
1744 # / _-----=> irqs-off
1745 # | / _----=> need-resched
1746 # || / _---=> hardirq/softirq
1747 # ||| / _--=> preempt-depth
1748 # |||| / delay
1749 # cmd pid ||||| time | caller
1750 # \ / ||||| \ | /
1751 kworker/-59 3...1 0us : __schedule <-schedule
1752 kworker/-59 3d..1 0us : rcu_preempt_qs <-rcu_note_context_switch
1753 kworker/-59 3d..1 1us : add_preempt_count <-_raw_spin_lock_irq
1754 kworker/-59 3d..2 1us : deactivate_task <-__schedule
1755 kworker/-59 3d..2 1us : dequeue_task <-deactivate_task
1756 kworker/-59 3d..2 2us : update_rq_clock <-dequeue_task
1757 kworker/-59 3d..2 2us : dequeue_task_fair <-dequeue_task
1758 kworker/-59 3d..2 2us : update_curr <-dequeue_task_fair
1759 kworker/-59 3d..2 2us : update_min_vruntime <-update_curr
1760 kworker/-59 3d..2 3us : cpuacct_charge <-update_curr
1761 kworker/-59 3d..2 3us : __rcu_read_lock <-cpuacct_charge
1762 kworker/-59 3d..2 3us : __rcu_read_unlock <-cpuacct_charge
1763 kworker/-59 3d..2 3us : update_cfs_rq_blocked_load <-dequeue_task_fair
1764 kworker/-59 3d..2 4us : clear_buddies <-dequeue_task_fair
1765 kworker/-59 3d..2 4us : account_entity_dequeue <-dequeue_task_fair
1766 kworker/-59 3d..2 4us : update_min_vruntime <-dequeue_task_fair
1767 kworker/-59 3d..2 4us : update_cfs_shares <-dequeue_task_fair
1768 kworker/-59 3d..2 5us : hrtick_update <-dequeue_task_fair
1769 kworker/-59 3d..2 5us : wq_worker_sleeping <-__schedule
1770 kworker/-59 3d..2 5us : kthread_data <-wq_worker_sleeping
1771 kworker/-59 3d..2 5us : put_prev_task_fair <-__schedule
1772 kworker/-59 3d..2 6us : pick_next_task_fair <-pick_next_task
1773 kworker/-59 3d..2 6us : clear_buddies <-pick_next_task_fair
1774 kworker/-59 3d..2 6us : set_next_entity <-pick_next_task_fair
1775 kworker/-59 3d..2 6us : update_stats_wait_end <-set_next_entity
1776 ls-2269 3d..2 7us : finish_task_switch <-__schedule
1777 ls-2269 3d..2 7us : _raw_spin_unlock_irq <-finish_task_switch
1778 ls-2269 3d..2 8us : do_IRQ <-ret_from_intr
1779 ls-2269 3d..2 8us : irq_enter <-do_IRQ
1780 ls-2269 3d..2 8us : rcu_irq_enter <-irq_enter
1781 ls-2269 3d..2 9us : add_preempt_count <-irq_enter
1782 ls-2269 3d.h2 9us : exit_idle <-do_IRQ
1783 [...]
1784 ls-2269 3d.h3 20us : sub_preempt_count <-_raw_spin_unlock
1785 ls-2269 3d.h2 20us : irq_exit <-do_IRQ
1786 ls-2269 3d.h2 21us : sub_preempt_count <-irq_exit
1787 ls-2269 3d..3 21us : do_softirq <-irq_exit
1788 ls-2269 3d..3 21us : __do_softirq <-call_softirq
1789 ls-2269 3d..3 21us+: __local_bh_disable <-__do_softirq
1790 ls-2269 3d.s4 29us : sub_preempt_count <-_local_bh_enable_ip
1791 ls-2269 3d.s5 29us : sub_preempt_count <-_local_bh_enable_ip
1792 ls-2269 3d.s5 31us : do_IRQ <-ret_from_intr
1793 ls-2269 3d.s5 31us : irq_enter <-do_IRQ
1794 ls-2269 3d.s5 31us : rcu_irq_enter <-irq_enter
1795 [...]
1796 ls-2269 3d.s5 31us : rcu_irq_enter <-irq_enter
1797 ls-2269 3d.s5 32us : add_preempt_count <-irq_enter
1798 ls-2269 3d.H5 32us : exit_idle <-do_IRQ
1799 ls-2269 3d.H5 32us : handle_irq <-do_IRQ
1800 ls-2269 3d.H5 32us : irq_to_desc <-handle_irq
1801 ls-2269 3d.H5 33us : handle_fasteoi_irq <-handle_irq
1802 [...]
1803 ls-2269 3d.s5 158us : _raw_spin_unlock_irqrestore <-rtl8139_poll
1804 ls-2269 3d.s3 158us : net_rps_action_and_irq_enable.isra.65 <-net_rx_action
1805 ls-2269 3d.s3 159us : __local_bh_enable <-__do_softirq
1806 ls-2269 3d.s3 159us : sub_preempt_count <-__local_bh_enable
1807 ls-2269 3d..3 159us : idle_cpu <-irq_exit
1808 ls-2269 3d..3 159us : rcu_irq_exit <-irq_exit
1809 ls-2269 3d..3 160us : sub_preempt_count <-irq_exit
1810 ls-2269 3d... 161us : __mutex_unlock_slowpath <-mutex_unlock
1811 ls-2269 3d... 162us+: trace_hardirqs_on <-mutex_unlock
1812 ls-2269 3d... 186us : <stack trace>
1813 => __mutex_unlock_slowpath
1814 => mutex_unlock
1815 => process_output
1816 => n_tty_write
1817 => tty_write
1818 => vfs_write
1819 => sys_write
1820 => system_call_fastpath
1821
1822This is an interesting trace. It started with kworker running and
1823scheduling out and ls taking over. But as soon as ls released the
1824rq lock and enabled interrupts (but not preemption) an interrupt
1825triggered. When the interrupt finished, it started running softirqs.
1826But while the softirq was running, another interrupt triggered.
1827When an interrupt is running inside a softirq, the annotation is 'H'.
1828
1829
1830wakeup
1831------
1832
1833One common case that people are interested in tracing is the
1834time it takes for a task that is woken to actually wake up.
1835Now for non Real-Time tasks, this can be arbitrary. But tracing
1836it none the less can be interesting.
1837
1838Without function tracing::
1839
1840 # echo 0 > options/function-trace
1841 # echo wakeup > current_tracer
1842 # echo 1 > tracing_on
1843 # echo 0 > tracing_max_latency
1844 # chrt -f 5 sleep 1
1845 # echo 0 > tracing_on
1846 # cat trace
1847 # tracer: wakeup
1848 #
1849 # wakeup latency trace v1.1.5 on 3.8.0-test+
1850 # --------------------------------------------------------------------
1851 # latency: 15 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1852 # -----------------
1853 # | task: kworker/3:1H-312 (uid:0 nice:-20 policy:0 rt_prio:0)
1854 # -----------------
1855 #
1856 # _------=> CPU#
1857 # / _-----=> irqs-off
1858 # | / _----=> need-resched
1859 # || / _---=> hardirq/softirq
1860 # ||| / _--=> preempt-depth
1861 # |||| / delay
1862 # cmd pid ||||| time | caller
1863 # \ / ||||| \ | /
1864 <idle>-0 3dNs7 0us : 0:120:R + [003] 312:100:R kworker/3:1H
1865 <idle>-0 3dNs7 1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
1866 <idle>-0 3d..3 15us : __schedule <-schedule
1867 <idle>-0 3d..3 15us : 0:120:R ==> [003] 312:100:R kworker/3:1H
1868
1869The tracer only traces the highest priority task in the system
1870to avoid tracing the normal circumstances. Here we see that
1871the kworker with a nice priority of -20 (not very nice), took
1872just 15 microseconds from the time it woke up, to the time it
1873ran.
1874
1875Non Real-Time tasks are not that interesting. A more interesting
1876trace is to concentrate only on Real-Time tasks.
1877
1878wakeup_rt
1879---------
1880
1881In a Real-Time environment it is very important to know the
1882wakeup time it takes for the highest priority task that is woken
1883up to the time that it executes. This is also known as "schedule
1884latency". I stress the point that this is about RT tasks. It is
1885also important to know the scheduling latency of non-RT tasks,
1886but the average schedule latency is better for non-RT tasks.
1887Tools like LatencyTop are more appropriate for such
1888measurements.
1889
1890Real-Time environments are interested in the worst case latency.
1891That is the longest latency it takes for something to happen,
1892and not the average. We can have a very fast scheduler that may
1893only have a large latency once in a while, but that would not
1894work well with Real-Time tasks. The wakeup_rt tracer was designed
1895to record the worst case wakeups of RT tasks. Non-RT tasks are
1896not recorded because the tracer only records one worst case and
1897tracing non-RT tasks that are unpredictable will overwrite the
1898worst case latency of RT tasks (just run the normal wakeup
1899tracer for a while to see that effect).
1900
1901Since this tracer only deals with RT tasks, we will run this
1902slightly differently than we did with the previous tracers.
1903Instead of performing an 'ls', we will run 'sleep 1' under
1904'chrt' which changes the priority of the task.
1905::
1906
1907 # echo 0 > options/function-trace
1908 # echo wakeup_rt > current_tracer
1909 # echo 1 > tracing_on
1910 # echo 0 > tracing_max_latency
1911 # chrt -f 5 sleep 1
1912 # echo 0 > tracing_on
1913 # cat trace
1914 # tracer: wakeup
1915 #
1916 # tracer: wakeup_rt
1917 #
1918 # wakeup_rt latency trace v1.1.5 on 3.8.0-test+
1919 # --------------------------------------------------------------------
1920 # latency: 5 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1921 # -----------------
1922 # | task: sleep-2389 (uid:0 nice:0 policy:1 rt_prio:5)
1923 # -----------------
1924 #
1925 # _------=> CPU#
1926 # / _-----=> irqs-off
1927 # | / _----=> need-resched
1928 # || / _---=> hardirq/softirq
1929 # ||| / _--=> preempt-depth
1930 # |||| / delay
1931 # cmd pid ||||| time | caller
1932 # \ / ||||| \ | /
1933 <idle>-0 3d.h4 0us : 0:120:R + [003] 2389: 94:R sleep
1934 <idle>-0 3d.h4 1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up
1935 <idle>-0 3d..3 5us : __schedule <-schedule
1936 <idle>-0 3d..3 5us : 0:120:R ==> [003] 2389: 94:R sleep
1937
1938
1939Running this on an idle system, we see that it only took 5 microseconds
1940to perform the task switch. Note, since the trace point in the schedule
1941is before the actual "switch", we stop the tracing when the recorded task
1942is about to schedule in. This may change if we add a new marker at the
1943end of the scheduler.
1944
1945Notice that the recorded task is 'sleep' with the PID of 2389
1946and it has an rt_prio of 5. This priority is user-space priority
1947and not the internal kernel priority. The policy is 1 for
1948SCHED_FIFO and 2 for SCHED_RR.
1949
1950Note, that the trace data shows the internal priority (99 - rtprio).
1951::
1952
1953 <idle>-0 3d..3 5us : 0:120:R ==> [003] 2389: 94:R sleep
1954
1955The 0:120:R means idle was running with a nice priority of 0 (120 - 120)
1956and in the running state 'R'. The sleep task was scheduled in with
19572389: 94:R. That is the priority is the kernel rtprio (99 - 5 = 94)
1958and it too is in the running state.
1959
1960Doing the same with chrt -r 5 and function-trace set.
1961::
1962
1963 echo 1 > options/function-trace
1964
1965 # tracer: wakeup_rt
1966 #
1967 # wakeup_rt latency trace v1.1.5 on 3.8.0-test+
1968 # --------------------------------------------------------------------
1969 # latency: 29 us, #85/85, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
1970 # -----------------
1971 # | task: sleep-2448 (uid:0 nice:0 policy:1 rt_prio:5)
1972 # -----------------
1973 #
1974 # _------=> CPU#
1975 # / _-----=> irqs-off
1976 # | / _----=> need-resched
1977 # || / _---=> hardirq/softirq
1978 # ||| / _--=> preempt-depth
1979 # |||| / delay
1980 # cmd pid ||||| time | caller
1981 # \ / ||||| \ | /
1982 <idle>-0 3d.h4 1us+: 0:120:R + [003] 2448: 94:R sleep
1983 <idle>-0 3d.h4 2us : ttwu_do_activate.constprop.87 <-try_to_wake_up
1984 <idle>-0 3d.h3 3us : check_preempt_curr <-ttwu_do_wakeup
1985 <idle>-0 3d.h3 3us : resched_curr <-check_preempt_curr
1986 <idle>-0 3dNh3 4us : task_woken_rt <-ttwu_do_wakeup
1987 <idle>-0 3dNh3 4us : _raw_spin_unlock <-try_to_wake_up
1988 <idle>-0 3dNh3 4us : sub_preempt_count <-_raw_spin_unlock
1989 <idle>-0 3dNh2 5us : ttwu_stat <-try_to_wake_up
1990 <idle>-0 3dNh2 5us : _raw_spin_unlock_irqrestore <-try_to_wake_up
1991 <idle>-0 3dNh2 6us : sub_preempt_count <-_raw_spin_unlock_irqrestore
1992 <idle>-0 3dNh1 6us : _raw_spin_lock <-__run_hrtimer
1993 <idle>-0 3dNh1 6us : add_preempt_count <-_raw_spin_lock
1994 <idle>-0 3dNh2 7us : _raw_spin_unlock <-hrtimer_interrupt
1995 <idle>-0 3dNh2 7us : sub_preempt_count <-_raw_spin_unlock
1996 <idle>-0 3dNh1 7us : tick_program_event <-hrtimer_interrupt
1997 <idle>-0 3dNh1 7us : clockevents_program_event <-tick_program_event
1998 <idle>-0 3dNh1 8us : ktime_get <-clockevents_program_event
1999 <idle>-0 3dNh1 8us : lapic_next_event <-clockevents_program_event
2000 <idle>-0 3dNh1 8us : irq_exit <-smp_apic_timer_interrupt
2001 <idle>-0 3dNh1 9us : sub_preempt_count <-irq_exit
2002 <idle>-0 3dN.2 9us : idle_cpu <-irq_exit
2003 <idle>-0 3dN.2 9us : rcu_irq_exit <-irq_exit
2004 <idle>-0 3dN.2 10us : rcu_eqs_enter_common.isra.45 <-rcu_irq_exit
2005 <idle>-0 3dN.2 10us : sub_preempt_count <-irq_exit
2006 <idle>-0 3.N.1 11us : rcu_idle_exit <-cpu_idle
2007 <idle>-0 3dN.1 11us : rcu_eqs_exit_common.isra.43 <-rcu_idle_exit
2008 <idle>-0 3.N.1 11us : tick_nohz_idle_exit <-cpu_idle
2009 <idle>-0 3dN.1 12us : menu_hrtimer_cancel <-tick_nohz_idle_exit
2010 <idle>-0 3dN.1 12us : ktime_get <-tick_nohz_idle_exit
2011 <idle>-0 3dN.1 12us : tick_do_update_jiffies64 <-tick_nohz_idle_exit
2012 <idle>-0 3dN.1 13us : cpu_load_update_nohz <-tick_nohz_idle_exit
2013 <idle>-0 3dN.1 13us : _raw_spin_lock <-cpu_load_update_nohz
2014 <idle>-0 3dN.1 13us : add_preempt_count <-_raw_spin_lock
2015 <idle>-0 3dN.2 13us : __cpu_load_update <-cpu_load_update_nohz
2016 <idle>-0 3dN.2 14us : sched_avg_update <-__cpu_load_update
2017 <idle>-0 3dN.2 14us : _raw_spin_unlock <-cpu_load_update_nohz
2018 <idle>-0 3dN.2 14us : sub_preempt_count <-_raw_spin_unlock
2019 <idle>-0 3dN.1 15us : calc_load_nohz_stop <-tick_nohz_idle_exit
2020 <idle>-0 3dN.1 15us : touch_softlockup_watchdog <-tick_nohz_idle_exit
2021 <idle>-0 3dN.1 15us : hrtimer_cancel <-tick_nohz_idle_exit
2022 <idle>-0 3dN.1 15us : hrtimer_try_to_cancel <-hrtimer_cancel
2023 <idle>-0 3dN.1 16us : lock_hrtimer_base.isra.18 <-hrtimer_try_to_cancel
2024 <idle>-0 3dN.1 16us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
2025 <idle>-0 3dN.1 16us : add_preempt_count <-_raw_spin_lock_irqsave
2026 <idle>-0 3dN.2 17us : __remove_hrtimer <-remove_hrtimer.part.16
2027 <idle>-0 3dN.2 17us : hrtimer_force_reprogram <-__remove_hrtimer
2028 <idle>-0 3dN.2 17us : tick_program_event <-hrtimer_force_reprogram
2029 <idle>-0 3dN.2 18us : clockevents_program_event <-tick_program_event
2030 <idle>-0 3dN.2 18us : ktime_get <-clockevents_program_event
2031 <idle>-0 3dN.2 18us : lapic_next_event <-clockevents_program_event
2032 <idle>-0 3dN.2 19us : _raw_spin_unlock_irqrestore <-hrtimer_try_to_cancel
2033 <idle>-0 3dN.2 19us : sub_preempt_count <-_raw_spin_unlock_irqrestore
2034 <idle>-0 3dN.1 19us : hrtimer_forward <-tick_nohz_idle_exit
2035 <idle>-0 3dN.1 20us : ktime_add_safe <-hrtimer_forward
2036 <idle>-0 3dN.1 20us : ktime_add_safe <-hrtimer_forward
2037 <idle>-0 3dN.1 20us : hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
2038 <idle>-0 3dN.1 20us : __hrtimer_start_range_ns <-hrtimer_start_range_ns
2039 <idle>-0 3dN.1 21us : lock_hrtimer_base.isra.18 <-__hrtimer_start_range_ns
2040 <idle>-0 3dN.1 21us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18
2041 <idle>-0 3dN.1 21us : add_preempt_count <-_raw_spin_lock_irqsave
2042 <idle>-0 3dN.2 22us : ktime_add_safe <-__hrtimer_start_range_ns
2043 <idle>-0 3dN.2 22us : enqueue_hrtimer <-__hrtimer_start_range_ns
2044 <idle>-0 3dN.2 22us : tick_program_event <-__hrtimer_start_range_ns
2045 <idle>-0 3dN.2 23us : clockevents_program_event <-tick_program_event
2046 <idle>-0 3dN.2 23us : ktime_get <-clockevents_program_event
2047 <idle>-0 3dN.2 23us : lapic_next_event <-clockevents_program_event
2048 <idle>-0 3dN.2 24us : _raw_spin_unlock_irqrestore <-__hrtimer_start_range_ns
2049 <idle>-0 3dN.2 24us : sub_preempt_count <-_raw_spin_unlock_irqrestore
2050 <idle>-0 3dN.1 24us : account_idle_ticks <-tick_nohz_idle_exit
2051 <idle>-0 3dN.1 24us : account_idle_time <-account_idle_ticks
2052 <idle>-0 3.N.1 25us : sub_preempt_count <-cpu_idle
2053 <idle>-0 3.N.. 25us : schedule <-cpu_idle
2054 <idle>-0 3.N.. 25us : __schedule <-preempt_schedule
2055 <idle>-0 3.N.. 26us : add_preempt_count <-__schedule
2056 <idle>-0 3.N.1 26us : rcu_note_context_switch <-__schedule
2057 <idle>-0 3.N.1 26us : rcu_sched_qs <-rcu_note_context_switch
2058 <idle>-0 3dN.1 27us : rcu_preempt_qs <-rcu_note_context_switch
2059 <idle>-0 3.N.1 27us : _raw_spin_lock_irq <-__schedule
2060 <idle>-0 3dN.1 27us : add_preempt_count <-_raw_spin_lock_irq
2061 <idle>-0 3dN.2 28us : put_prev_task_idle <-__schedule
2062 <idle>-0 3dN.2 28us : pick_next_task_stop <-pick_next_task
2063 <idle>-0 3dN.2 28us : pick_next_task_rt <-pick_next_task
2064 <idle>-0 3dN.2 29us : dequeue_pushable_task <-pick_next_task_rt
2065 <idle>-0 3d..3 29us : __schedule <-preempt_schedule
2066 <idle>-0 3d..3 30us : 0:120:R ==> [003] 2448: 94:R sleep
2067
2068This isn't that big of a trace, even with function tracing enabled,
2069so I included the entire trace.
2070
2071The interrupt went off while when the system was idle. Somewhere
2072before task_woken_rt() was called, the NEED_RESCHED flag was set,
2073this is indicated by the first occurrence of the 'N' flag.
2074
2075Latency tracing and events
2076--------------------------
2077As function tracing can induce a much larger latency, but without
2078seeing what happens within the latency it is hard to know what
2079caused it. There is a middle ground, and that is with enabling
2080events.
2081::
2082
2083 # echo 0 > options/function-trace
2084 # echo wakeup_rt > current_tracer
2085 # echo 1 > events/enable
2086 # echo 1 > tracing_on
2087 # echo 0 > tracing_max_latency
2088 # chrt -f 5 sleep 1
2089 # echo 0 > tracing_on
2090 # cat trace
2091 # tracer: wakeup_rt
2092 #
2093 # wakeup_rt latency trace v1.1.5 on 3.8.0-test+
2094 # --------------------------------------------------------------------
2095 # latency: 6 us, #12/12, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
2096 # -----------------
2097 # | task: sleep-5882 (uid:0 nice:0 policy:1 rt_prio:5)
2098 # -----------------
2099 #
2100 # _------=> CPU#
2101 # / _-----=> irqs-off
2102 # | / _----=> need-resched
2103 # || / _---=> hardirq/softirq
2104 # ||| / _--=> preempt-depth
2105 # |||| / delay
2106 # cmd pid ||||| time | caller
2107 # \ / ||||| \ | /
2108 <idle>-0 2d.h4 0us : 0:120:R + [002] 5882: 94:R sleep
2109 <idle>-0 2d.h4 0us : ttwu_do_activate.constprop.87 <-try_to_wake_up
2110 <idle>-0 2d.h4 1us : sched_wakeup: comm=sleep pid=5882 prio=94 success=1 target_cpu=002
2111 <idle>-0 2dNh2 1us : hrtimer_expire_exit: hrtimer=ffff88007796feb8
2112 <idle>-0 2.N.2 2us : power_end: cpu_id=2
2113 <idle>-0 2.N.2 3us : cpu_idle: state=4294967295 cpu_id=2
2114 <idle>-0 2dN.3 4us : hrtimer_cancel: hrtimer=ffff88007d50d5e0
2115 <idle>-0 2dN.3 4us : hrtimer_start: hrtimer=ffff88007d50d5e0 function=tick_sched_timer expires=34311211000000 softexpires=34311211000000
2116 <idle>-0 2.N.2 5us : rcu_utilization: Start context switch
2117 <idle>-0 2.N.2 5us : rcu_utilization: End context switch
2118 <idle>-0 2d..3 6us : __schedule <-schedule
2119 <idle>-0 2d..3 6us : 0:120:R ==> [002] 5882: 94:R sleep
2120
2121
2122Hardware Latency Detector
2123-------------------------
2124
2125The hardware latency detector is executed by enabling the "hwlat" tracer.
2126
2127NOTE, this tracer will affect the performance of the system as it will
2128periodically make a CPU constantly busy with interrupts disabled.
2129::
2130
2131 # echo hwlat > current_tracer
2132 # sleep 100
2133 # cat trace
2134 # tracer: hwlat
2135 #
b396bfde
SRV
2136 # entries-in-buffer/entries-written: 13/13 #P:8
2137 #
1f198e22
CD
2138 # _-----=> irqs-off
2139 # / _----=> need-resched
2140 # | / _---=> hardirq/softirq
2141 # || / _--=> preempt-depth
2142 # ||| / delay
2143 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
2144 # | | | |||| | |
b396bfde
SRV
2145 <...>-1729 [001] d... 678.473449: #1 inner/outer(us): 11/12 ts:1581527483.343962693 count:6
2146 <...>-1729 [004] d... 689.556542: #2 inner/outer(us): 16/9 ts:1581527494.889008092 count:1
2147 <...>-1729 [005] d... 714.756290: #3 inner/outer(us): 16/16 ts:1581527519.678961629 count:5
2148 <...>-1729 [001] d... 718.788247: #4 inner/outer(us): 9/17 ts:1581527523.889012713 count:1
2149 <...>-1729 [002] d... 719.796341: #5 inner/outer(us): 13/9 ts:1581527524.912872606 count:1
2150 <...>-1729 [006] d... 844.787091: #6 inner/outer(us): 9/12 ts:1581527649.889048502 count:2
2151 <...>-1729 [003] d... 849.827033: #7 inner/outer(us): 18/9 ts:1581527654.889013793 count:1
2152 <...>-1729 [007] d... 853.859002: #8 inner/outer(us): 9/12 ts:1581527658.889065736 count:1
2153 <...>-1729 [001] d... 855.874978: #9 inner/outer(us): 9/11 ts:1581527660.861991877 count:1
2154 <...>-1729 [001] d... 863.938932: #10 inner/outer(us): 9/11 ts:1581527668.970010500 count:1 nmi-total:7 nmi-count:1
2155 <...>-1729 [007] d... 878.050780: #11 inner/outer(us): 9/12 ts:1581527683.385002600 count:1 nmi-total:5 nmi-count:1
2156 <...>-1729 [007] d... 886.114702: #12 inner/outer(us): 9/12 ts:1581527691.385001600 count:1
1f198e22
CD
2157
2158
2159The above output is somewhat the same in the header. All events will have
2160interrupts disabled 'd'. Under the FUNCTION title there is:
2161
2162 #1
2163 This is the count of events recorded that were greater than the
2164 tracing_threshold (See below).
2165
b396bfde 2166 inner/outer(us): 11/11
1f198e22
CD
2167
2168 This shows two numbers as "inner latency" and "outer latency". The test
2169 runs in a loop checking a timestamp twice. The latency detected within
2170 the two timestamps is the "inner latency" and the latency detected
2171 after the previous timestamp and the next timestamp in the loop is
2172 the "outer latency".
2173
b396bfde
SRV
2174 ts:1581527483.343962693
2175
2176 The absolute timestamp that the first latency was recorded in the window.
2177
2178 count:6
1f198e22 2179
b396bfde 2180 The number of times a latency was detected during the window.
1f198e22 2181
b396bfde 2182 nmi-total:7 nmi-count:1
1f198e22
CD
2183
2184 On architectures that support it, if an NMI comes in during the
2185 test, the time spent in NMI is reported in "nmi-total" (in
2186 microseconds).
2187
2188 All architectures that have NMIs will show the "nmi-count" if an
2189 NMI comes in during the test.
2190
2191hwlat files:
2192
2193 tracing_threshold
2194 This gets automatically set to "10" to represent 10
2195 microseconds. This is the threshold of latency that
2196 needs to be detected before the trace will be recorded.
2197
2198 Note, when hwlat tracer is finished (another tracer is
2199 written into "current_tracer"), the original value for
2200 tracing_threshold is placed back into this file.
2201
2202 hwlat_detector/width
2203 The length of time the test runs with interrupts disabled.
2204
2205 hwlat_detector/window
2206 The length of time of the window which the test
2207 runs. That is, the test will run for "width"
2208 microseconds per "window" microseconds
2209
2210 tracing_cpumask
2211 When the test is started. A kernel thread is created that
2212 runs the test. This thread will alternate between CPUs
2213 listed in the tracing_cpumask between each period
2214 (one "window"). To limit the test to specific CPUs
2215 set the mask in this file to only the CPUs that the test
2216 should run on.
2217
2218function
2219--------
2220
2221This tracer is the function tracer. Enabling the function tracer
2222can be done from the debug file system. Make sure the
2223ftrace_enabled is set; otherwise this tracer is a nop.
2224See the "ftrace_enabled" section below.
2225::
2226
2227 # sysctl kernel.ftrace_enabled=1
2228 # echo function > current_tracer
2229 # echo 1 > tracing_on
2230 # usleep 1
2231 # echo 0 > tracing_on
2232 # cat trace
2233 # tracer: function
2234 #
2235 # entries-in-buffer/entries-written: 24799/24799 #P:4
2236 #
2237 # _-----=> irqs-off
2238 # / _----=> need-resched
2239 # | / _---=> hardirq/softirq
2240 # || / _--=> preempt-depth
2241 # ||| / delay
2242 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
2243 # | | | |||| | |
2244 bash-1994 [002] .... 3082.063030: mutex_unlock <-rb_simple_write
2245 bash-1994 [002] .... 3082.063031: __mutex_unlock_slowpath <-mutex_unlock
2246 bash-1994 [002] .... 3082.063031: __fsnotify_parent <-fsnotify_modify
2247 bash-1994 [002] .... 3082.063032: fsnotify <-fsnotify_modify
2248 bash-1994 [002] .... 3082.063032: __srcu_read_lock <-fsnotify
2249 bash-1994 [002] .... 3082.063032: add_preempt_count <-__srcu_read_lock
2250 bash-1994 [002] ...1 3082.063032: sub_preempt_count <-__srcu_read_lock
2251 bash-1994 [002] .... 3082.063033: __srcu_read_unlock <-fsnotify
2252 [...]
2253
2254
2255Note: function tracer uses ring buffers to store the above
2256entries. The newest data may overwrite the oldest data.
2257Sometimes using echo to stop the trace is not sufficient because
2258the tracing could have overwritten the data that you wanted to
2259record. For this reason, it is sometimes better to disable
2260tracing directly from a program. This allows you to stop the
2261tracing at the point that you hit the part that you are
2262interested in. To disable the tracing directly from a C program,
2263something like following code snippet can be used::
2264
2265 int trace_fd;
2266 [...]
2267 int main(int argc, char *argv[]) {
2268 [...]
2269 trace_fd = open(tracing_file("tracing_on"), O_WRONLY);
2270 [...]
2271 if (condition_hit()) {
2272 write(trace_fd, "0", 1);
2273 }
2274 [...]
2275 }
2276
2277
2278Single thread tracing
2279---------------------
2280
2281By writing into set_ftrace_pid you can trace a
2282single thread. For example::
2283
2284 # cat set_ftrace_pid
2285 no pid
2286 # echo 3111 > set_ftrace_pid
2287 # cat set_ftrace_pid
2288 3111
2289 # echo function > current_tracer
2290 # cat trace | head
2291 # tracer: function
2292 #
2293 # TASK-PID CPU# TIMESTAMP FUNCTION
2294 # | | | | |
2295 yum-updatesd-3111 [003] 1637.254676: finish_task_switch <-thread_return
2296 yum-updatesd-3111 [003] 1637.254681: hrtimer_cancel <-schedule_hrtimeout_range
2297 yum-updatesd-3111 [003] 1637.254682: hrtimer_try_to_cancel <-hrtimer_cancel
2298 yum-updatesd-3111 [003] 1637.254683: lock_hrtimer_base <-hrtimer_try_to_cancel
2299 yum-updatesd-3111 [003] 1637.254685: fget_light <-do_sys_poll
2300 yum-updatesd-3111 [003] 1637.254686: pipe_poll <-do_sys_poll
2301 # echo > set_ftrace_pid
2302 # cat trace |head
2303 # tracer: function
2304 #
2305 # TASK-PID CPU# TIMESTAMP FUNCTION
2306 # | | | | |
2307 ##### CPU 3 buffer started ####
2308 yum-updatesd-3111 [003] 1701.957688: free_poll_entry <-poll_freewait
2309 yum-updatesd-3111 [003] 1701.957689: remove_wait_queue <-free_poll_entry
2310 yum-updatesd-3111 [003] 1701.957691: fput <-free_poll_entry
2311 yum-updatesd-3111 [003] 1701.957692: audit_syscall_exit <-sysret_audit
2312 yum-updatesd-3111 [003] 1701.957693: path_put <-audit_syscall_exit
2313
2314If you want to trace a function when executing, you could use
2315something like this simple program.
2316::
2317
2318 #include <stdio.h>
2319 #include <stdlib.h>
2320 #include <sys/types.h>
2321 #include <sys/stat.h>
2322 #include <fcntl.h>
2323 #include <unistd.h>
2324 #include <string.h>
2325
2326 #define _STR(x) #x
2327 #define STR(x) _STR(x)
2328 #define MAX_PATH 256
2329
2330 const char *find_tracefs(void)
2331 {
2332 static char tracefs[MAX_PATH+1];
2333 static int tracefs_found;
2334 char type[100];
2335 FILE *fp;
2336
2337 if (tracefs_found)
2338 return tracefs;
2339
2340 if ((fp = fopen("/proc/mounts","r")) == NULL) {
2341 perror("/proc/mounts");
2342 return NULL;
2343 }
2344
2345 while (fscanf(fp, "%*s %"
2346 STR(MAX_PATH)
2347 "s %99s %*s %*d %*d\n",
2348 tracefs, type) == 2) {
2349 if (strcmp(type, "tracefs") == 0)
2350 break;
2351 }
2352 fclose(fp);
2353
2354 if (strcmp(type, "tracefs") != 0) {
2355 fprintf(stderr, "tracefs not mounted");
2356 return NULL;
2357 }
2358
2359 strcat(tracefs, "/tracing/");
2360 tracefs_found = 1;
2361
2362 return tracefs;
2363 }
2364
2365 const char *tracing_file(const char *file_name)
2366 {
2367 static char trace_file[MAX_PATH+1];
2368 snprintf(trace_file, MAX_PATH, "%s/%s", find_tracefs(), file_name);
2369 return trace_file;
2370 }
2371
2372 int main (int argc, char **argv)
2373 {
2374 if (argc < 1)
2375 exit(-1);
2376
2377 if (fork() > 0) {
2378 int fd, ffd;
2379 char line[64];
2380 int s;
2381
2382 ffd = open(tracing_file("current_tracer"), O_WRONLY);
2383 if (ffd < 0)
2384 exit(-1);
2385 write(ffd, "nop", 3);
2386
2387 fd = open(tracing_file("set_ftrace_pid"), O_WRONLY);
2388 s = sprintf(line, "%d\n", getpid());
2389 write(fd, line, s);
2390
2391 write(ffd, "function", 8);
2392
2393 close(fd);
2394 close(ffd);
2395
2396 execvp(argv[1], argv+1);
2397 }
2398
2399 return 0;
2400 }
2401
2402Or this simple script!
2403::
2404
2405 #!/bin/bash
2406
2407 tracefs=`sed -ne 's/^tracefs \(.*\) tracefs.*/\1/p' /proc/mounts`
2408 echo nop > $tracefs/tracing/current_tracer
2409 echo 0 > $tracefs/tracing/tracing_on
2410 echo $$ > $tracefs/tracing/set_ftrace_pid
2411 echo function > $tracefs/tracing/current_tracer
2412 echo 1 > $tracefs/tracing/tracing_on
2413 exec "$@"
2414
2415
2416function graph tracer
2417---------------------------
2418
2419This tracer is similar to the function tracer except that it
2420probes a function on its entry and its exit. This is done by
2421using a dynamically allocated stack of return addresses in each
2422task_struct. On function entry the tracer overwrites the return
2423address of each function traced to set a custom probe. Thus the
2424original return address is stored on the stack of return address
2425in the task_struct.
2426
2427Probing on both ends of a function leads to special features
2428such as:
2429
2430- measure of a function's time execution
2431- having a reliable call stack to draw function calls graph
2432
2433This tracer is useful in several situations:
2434
2435- you want to find the reason of a strange kernel behavior and
2436 need to see what happens in detail on any areas (or specific
2437 ones).
2438
2439- you are experiencing weird latencies but it's difficult to
2440 find its origin.
2441
2442- you want to find quickly which path is taken by a specific
2443 function
2444
2445- you just want to peek inside a working kernel and want to see
2446 what happens there.
2447
2448::
2449
2450 # tracer: function_graph
2451 #
2452 # CPU DURATION FUNCTION CALLS
2453 # | | | | | | |
2454
2455 0) | sys_open() {
2456 0) | do_sys_open() {
2457 0) | getname() {
2458 0) | kmem_cache_alloc() {
2459 0) 1.382 us | __might_sleep();
2460 0) 2.478 us | }
2461 0) | strncpy_from_user() {
2462 0) | might_fault() {
2463 0) 1.389 us | __might_sleep();
2464 0) 2.553 us | }
2465 0) 3.807 us | }
2466 0) 7.876 us | }
2467 0) | alloc_fd() {
2468 0) 0.668 us | _spin_lock();
2469 0) 0.570 us | expand_files();
2470 0) 0.586 us | _spin_unlock();
2471
2472
2473There are several columns that can be dynamically
2474enabled/disabled. You can use every combination of options you
2475want, depending on your needs.
2476
2477- The cpu number on which the function executed is default
2478 enabled. It is sometimes better to only trace one cpu (see
2479 tracing_cpu_mask file) or you might sometimes see unordered
2480 function calls while cpu tracing switch.
2481
2482 - hide: echo nofuncgraph-cpu > trace_options
2483 - show: echo funcgraph-cpu > trace_options
2484
2485- The duration (function's time of execution) is displayed on
2486 the closing bracket line of a function or on the same line
2487 than the current function in case of a leaf one. It is default
2488 enabled.
2489
2490 - hide: echo nofuncgraph-duration > trace_options
2491 - show: echo funcgraph-duration > trace_options
2492
2493- The overhead field precedes the duration field in case of
2494 reached duration thresholds.
2495
2496 - hide: echo nofuncgraph-overhead > trace_options
2497 - show: echo funcgraph-overhead > trace_options
2498 - depends on: funcgraph-duration
2499
2500 ie::
2501
2502 3) # 1837.709 us | } /* __switch_to */
2503 3) | finish_task_switch() {
2504 3) 0.313 us | _raw_spin_unlock_irq();
2505 3) 3.177 us | }
2506 3) # 1889.063 us | } /* __schedule */
2507 3) ! 140.417 us | } /* __schedule */
2508 3) # 2034.948 us | } /* schedule */
2509 3) * 33998.59 us | } /* schedule_preempt_disabled */
2510
2511 [...]
2512
2513 1) 0.260 us | msecs_to_jiffies();
2514 1) 0.313 us | __rcu_read_unlock();
2515 1) + 61.770 us | }
2516 1) + 64.479 us | }
2517 1) 0.313 us | rcu_bh_qs();
2518 1) 0.313 us | __local_bh_enable();
2519 1) ! 217.240 us | }
2520 1) 0.365 us | idle_cpu();
2521 1) | rcu_irq_exit() {
2522 1) 0.417 us | rcu_eqs_enter_common.isra.47();
2523 1) 3.125 us | }
2524 1) ! 227.812 us | }
2525 1) ! 457.395 us | }
2526 1) @ 119760.2 us | }
2527
2528 [...]
2529
2530 2) | handle_IPI() {
2531 1) 6.979 us | }
2532 2) 0.417 us | scheduler_ipi();
2533 1) 9.791 us | }
2534 1) + 12.917 us | }
2535 2) 3.490 us | }
2536 1) + 15.729 us | }
2537 1) + 18.542 us | }
2538 2) $ 3594274 us | }
2539
2540Flags::
2541
2542 + means that the function exceeded 10 usecs.
2543 ! means that the function exceeded 100 usecs.
2544 # means that the function exceeded 1000 usecs.
2545 * means that the function exceeded 10 msecs.
2546 @ means that the function exceeded 100 msecs.
2547 $ means that the function exceeded 1 sec.
2548
2549
2550- The task/pid field displays the thread cmdline and pid which
2551 executed the function. It is default disabled.
2552
2553 - hide: echo nofuncgraph-proc > trace_options
2554 - show: echo funcgraph-proc > trace_options
2555
2556 ie::
2557
2558 # tracer: function_graph
2559 #
2560 # CPU TASK/PID DURATION FUNCTION CALLS
2561 # | | | | | | | | |
2562 0) sh-4802 | | d_free() {
2563 0) sh-4802 | | call_rcu() {
2564 0) sh-4802 | | __call_rcu() {
2565 0) sh-4802 | 0.616 us | rcu_process_gp_end();
2566 0) sh-4802 | 0.586 us | check_for_new_grace_period();
2567 0) sh-4802 | 2.899 us | }
2568 0) sh-4802 | 4.040 us | }
2569 0) sh-4802 | 5.151 us | }
2570 0) sh-4802 | + 49.370 us | }
2571
2572
2573- The absolute time field is an absolute timestamp given by the
2574 system clock since it started. A snapshot of this time is
2575 given on each entry/exit of functions
2576
2577 - hide: echo nofuncgraph-abstime > trace_options
2578 - show: echo funcgraph-abstime > trace_options
2579
2580 ie::
2581
2582 #
2583 # TIME CPU DURATION FUNCTION CALLS
2584 # | | | | | | | |
2585 360.774522 | 1) 0.541 us | }
2586 360.774522 | 1) 4.663 us | }
2587 360.774523 | 1) 0.541 us | __wake_up_bit();
2588 360.774524 | 1) 6.796 us | }
2589 360.774524 | 1) 7.952 us | }
2590 360.774525 | 1) 9.063 us | }
2591 360.774525 | 1) 0.615 us | journal_mark_dirty();
2592 360.774527 | 1) 0.578 us | __brelse();
2593 360.774528 | 1) | reiserfs_prepare_for_journal() {
2594 360.774528 | 1) | unlock_buffer() {
2595 360.774529 | 1) | wake_up_bit() {
2596 360.774529 | 1) | bit_waitqueue() {
2597 360.774530 | 1) 0.594 us | __phys_addr();
2598
2599
2600The function name is always displayed after the closing bracket
2601for a function if the start of that function is not in the
2602trace buffer.
2603
2604Display of the function name after the closing bracket may be
2605enabled for functions whose start is in the trace buffer,
2606allowing easier searching with grep for function durations.
2607It is default disabled.
2608
2609 - hide: echo nofuncgraph-tail > trace_options
2610 - show: echo funcgraph-tail > trace_options
2611
2612 Example with nofuncgraph-tail (default)::
2613
2614 0) | putname() {
2615 0) | kmem_cache_free() {
2616 0) 0.518 us | __phys_addr();
2617 0) 1.757 us | }
2618 0) 2.861 us | }
2619
2620 Example with funcgraph-tail::
2621
2622 0) | putname() {
2623 0) | kmem_cache_free() {
2624 0) 0.518 us | __phys_addr();
2625 0) 1.757 us | } /* kmem_cache_free() */
2626 0) 2.861 us | } /* putname() */
2627
2628You can put some comments on specific functions by using
2629trace_printk() For example, if you want to put a comment inside
2630the __might_sleep() function, you just have to include
2631<linux/ftrace.h> and call trace_printk() inside __might_sleep()::
2632
2633 trace_printk("I'm a comment!\n")
2634
2635will produce::
2636
2637 1) | __might_sleep() {
2638 1) | /* I'm a comment! */
2639 1) 1.449 us | }
2640
2641
2642You might find other useful features for this tracer in the
2643following "dynamic ftrace" section such as tracing only specific
2644functions or tasks.
2645
2646dynamic ftrace
2647--------------
2648
2649If CONFIG_DYNAMIC_FTRACE is set, the system will run with
2650virtually no overhead when function tracing is disabled. The way
2651this works is the mcount function call (placed at the start of
2652every kernel function, produced by the -pg switch in gcc),
2653starts of pointing to a simple return. (Enabling FTRACE will
2654include the -pg switch in the compiling of the kernel.)
2655
2656At compile time every C file object is run through the
2657recordmcount program (located in the scripts directory). This
2658program will parse the ELF headers in the C object to find all
2659the locations in the .text section that call mcount. Starting
2a1e03ca 2660with gcc version 4.6, the -mfentry has been added for x86, which
1f198e22
CD
2661calls "__fentry__" instead of "mcount". Which is called before
2662the creation of the stack frame.
2663
2664Note, not all sections are traced. They may be prevented by either
2665a notrace, or blocked another way and all inline functions are not
2666traced. Check the "available_filter_functions" file to see what functions
2667can be traced.
2668
2669A section called "__mcount_loc" is created that holds
2670references to all the mcount/fentry call sites in the .text section.
2671The recordmcount program re-links this section back into the
2672original object. The final linking stage of the kernel will add all these
2673references into a single table.
2674
2675On boot up, before SMP is initialized, the dynamic ftrace code
2676scans this table and updates all the locations into nops. It
2677also records the locations, which are added to the
2678available_filter_functions list. Modules are processed as they
2679are loaded and before they are executed. When a module is
2680unloaded, it also removes its functions from the ftrace function
2681list. This is automatic in the module unload code, and the
2682module author does not need to worry about it.
2683
2684When tracing is enabled, the process of modifying the function
2685tracepoints is dependent on architecture. The old method is to use
2686kstop_machine to prevent races with the CPUs executing code being
2687modified (which can cause the CPU to do undesirable things, especially
2688if the modified code crosses cache (or page) boundaries), and the nops are
2689patched back to calls. But this time, they do not call mcount
2690(which is just a function stub). They now call into the ftrace
2691infrastructure.
2692
2693The new method of modifying the function tracepoints is to place
2694a breakpoint at the location to be modified, sync all CPUs, modify
2695the rest of the instruction not covered by the breakpoint. Sync
2696all CPUs again, and then remove the breakpoint with the finished
2697version to the ftrace call site.
2698
2699Some archs do not even need to monkey around with the synchronization,
2700and can just slap the new code on top of the old without any
2701problems with other CPUs executing it at the same time.
2702
2703One special side-effect to the recording of the functions being
2704traced is that we can now selectively choose which functions we
2705wish to trace and which ones we want the mcount calls to remain
2706as nops.
2707
2708Two files are used, one for enabling and one for disabling the
2709tracing of specified functions. They are:
2710
2711 set_ftrace_filter
2712
2713and
2714
2715 set_ftrace_notrace
2716
2717A list of available functions that you can add to these files is
2718listed in:
2719
2720 available_filter_functions
2721
2722::
2723
2724 # cat available_filter_functions
2725 put_prev_task_idle
2726 kmem_cache_create
2727 pick_next_task_rt
2728 get_online_cpus
2729 pick_next_task_fair
2730 mutex_lock
2731 [...]
2732
2733If I am only interested in sys_nanosleep and hrtimer_interrupt::
2734
2735 # echo sys_nanosleep hrtimer_interrupt > set_ftrace_filter
2736 # echo function > current_tracer
2737 # echo 1 > tracing_on
2738 # usleep 1
2739 # echo 0 > tracing_on
2740 # cat trace
2741 # tracer: function
2742 #
2743 # entries-in-buffer/entries-written: 5/5 #P:4
2744 #
2745 # _-----=> irqs-off
2746 # / _----=> need-resched
2747 # | / _---=> hardirq/softirq
2748 # || / _--=> preempt-depth
2749 # ||| / delay
2750 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
2751 # | | | |||| | |
2752 usleep-2665 [001] .... 4186.475355: sys_nanosleep <-system_call_fastpath
2753 <idle>-0 [001] d.h1 4186.475409: hrtimer_interrupt <-smp_apic_timer_interrupt
2754 usleep-2665 [001] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
2755 <idle>-0 [003] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
2756 <idle>-0 [002] d.h1 4186.475427: hrtimer_interrupt <-smp_apic_timer_interrupt
2757
2758To see which functions are being traced, you can cat the file:
2759::
2760
2761 # cat set_ftrace_filter
2762 hrtimer_interrupt
2763 sys_nanosleep
2764
2765
2766Perhaps this is not enough. The filters also allow glob(7) matching.
2767
6234c7bd 2768 ``<match>*``
1f198e22 2769 will match functions that begin with <match>
6234c7bd 2770 ``*<match>``
1f198e22 2771 will match functions that end with <match>
6234c7bd 2772 ``*<match>*``
1f198e22 2773 will match functions that have <match> in it
6234c7bd 2774 ``<match1>*<match2>``
1f198e22
CD
2775 will match functions that begin with <match1> and end with <match2>
2776
2777.. note::
2778 It is better to use quotes to enclose the wild cards,
2779 otherwise the shell may expand the parameters into names
2780 of files in the local directory.
2781
2782::
2783
2784 # echo 'hrtimer_*' > set_ftrace_filter
2785
2786Produces::
2787
2788 # tracer: function
2789 #
2790 # entries-in-buffer/entries-written: 897/897 #P:4
2791 #
2792 # _-----=> irqs-off
2793 # / _----=> need-resched
2794 # | / _---=> hardirq/softirq
2795 # || / _--=> preempt-depth
2796 # ||| / delay
2797 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
2798 # | | | |||| | |
2799 <idle>-0 [003] dN.1 4228.547803: hrtimer_cancel <-tick_nohz_idle_exit
2800 <idle>-0 [003] dN.1 4228.547804: hrtimer_try_to_cancel <-hrtimer_cancel
2801 <idle>-0 [003] dN.2 4228.547805: hrtimer_force_reprogram <-__remove_hrtimer
2802 <idle>-0 [003] dN.1 4228.547805: hrtimer_forward <-tick_nohz_idle_exit
2803 <idle>-0 [003] dN.1 4228.547805: hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11
2804 <idle>-0 [003] d..1 4228.547858: hrtimer_get_next_event <-get_next_timer_interrupt
2805 <idle>-0 [003] d..1 4228.547859: hrtimer_start <-__tick_nohz_idle_enter
2806 <idle>-0 [003] d..2 4228.547860: hrtimer_force_reprogram <-__rem
2807
2808Notice that we lost the sys_nanosleep.
2809::
2810
2811 # cat set_ftrace_filter
2812 hrtimer_run_queues
2813 hrtimer_run_pending
2814 hrtimer_init
2815 hrtimer_cancel
2816 hrtimer_try_to_cancel
2817 hrtimer_forward
2818 hrtimer_start
2819 hrtimer_reprogram
2820 hrtimer_force_reprogram
2821 hrtimer_get_next_event
2822 hrtimer_interrupt
2823 hrtimer_nanosleep
2824 hrtimer_wakeup
2825 hrtimer_get_remaining
2826 hrtimer_get_res
2827 hrtimer_init_sleeper
2828
2829
2830This is because the '>' and '>>' act just like they do in bash.
2831To rewrite the filters, use '>'
2832To append to the filters, use '>>'
2833
2834To clear out a filter so that all functions will be recorded
2835again::
2836
2837 # echo > set_ftrace_filter
2838 # cat set_ftrace_filter
2839 #
2840
2841Again, now we want to append.
2842
2843::
2844
2845 # echo sys_nanosleep > set_ftrace_filter
2846 # cat set_ftrace_filter
2847 sys_nanosleep
2848 # echo 'hrtimer_*' >> set_ftrace_filter
2849 # cat set_ftrace_filter
2850 hrtimer_run_queues
2851 hrtimer_run_pending
2852 hrtimer_init
2853 hrtimer_cancel
2854 hrtimer_try_to_cancel
2855 hrtimer_forward
2856 hrtimer_start
2857 hrtimer_reprogram
2858 hrtimer_force_reprogram
2859 hrtimer_get_next_event
2860 hrtimer_interrupt
2861 sys_nanosleep
2862 hrtimer_nanosleep
2863 hrtimer_wakeup
2864 hrtimer_get_remaining
2865 hrtimer_get_res
2866 hrtimer_init_sleeper
2867
2868
2869The set_ftrace_notrace prevents those functions from being
2870traced.
2871::
2872
2873 # echo '*preempt*' '*lock*' > set_ftrace_notrace
2874
2875Produces::
2876
2877 # tracer: function
2878 #
2879 # entries-in-buffer/entries-written: 39608/39608 #P:4
2880 #
2881 # _-----=> irqs-off
2882 # / _----=> need-resched
2883 # | / _---=> hardirq/softirq
2884 # || / _--=> preempt-depth
2885 # ||| / delay
2886 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
2887 # | | | |||| | |
2888 bash-1994 [000] .... 4342.324896: file_ra_state_init <-do_dentry_open
2889 bash-1994 [000] .... 4342.324897: open_check_o_direct <-do_last
2890 bash-1994 [000] .... 4342.324897: ima_file_check <-do_last
2891 bash-1994 [000] .... 4342.324898: process_measurement <-ima_file_check
2892 bash-1994 [000] .... 4342.324898: ima_get_action <-process_measurement
2893 bash-1994 [000] .... 4342.324898: ima_match_policy <-ima_get_action
2894 bash-1994 [000] .... 4342.324899: do_truncate <-do_last
2895 bash-1994 [000] .... 4342.324899: should_remove_suid <-do_truncate
2896 bash-1994 [000] .... 4342.324899: notify_change <-do_truncate
2897 bash-1994 [000] .... 4342.324900: current_fs_time <-notify_change
2898 bash-1994 [000] .... 4342.324900: current_kernel_time <-current_fs_time
2899 bash-1994 [000] .... 4342.324900: timespec_trunc <-current_fs_time
2900
2901We can see that there's no more lock or preempt tracing.
2902
f79b3f33
SRV
2903Selecting function filters via index
2904------------------------------------
2905
2906Because processing of strings is expensive (the address of the function
2907needs to be looked up before comparing to the string being passed in),
2908an index can be used as well to enable functions. This is useful in the
2909case of setting thousands of specific functions at a time. By passing
2910in a list of numbers, no string processing will occur. Instead, the function
2911at the specific location in the internal array (which corresponds to the
2912functions in the "available_filter_functions" file), is selected.
2913
2914::
2915
2916 # echo 1 > set_ftrace_filter
2917
2918Will select the first function listed in "available_filter_functions"
2919
2920::
2921
2922 # head -1 available_filter_functions
2923 trace_initcall_finish_cb
2924
2925 # cat set_ftrace_filter
2926 trace_initcall_finish_cb
2927
2928 # head -50 available_filter_functions | tail -1
2929 x86_pmu_commit_txn
2930
2931 # echo 1 50 > set_ftrace_filter
2932 # cat set_ftrace_filter
2933 trace_initcall_finish_cb
2934 x86_pmu_commit_txn
1f198e22
CD
2935
2936Dynamic ftrace with the function graph tracer
2937---------------------------------------------
2938
2939Although what has been explained above concerns both the
2940function tracer and the function-graph-tracer, there are some
2941special features only available in the function-graph tracer.
2942
2943If you want to trace only one function and all of its children,
2944you just have to echo its name into set_graph_function::
2945
2946 echo __do_fault > set_graph_function
2947
2948will produce the following "expanded" trace of the __do_fault()
2949function::
2950
2951 0) | __do_fault() {
2952 0) | filemap_fault() {
2953 0) | find_lock_page() {
2954 0) 0.804 us | find_get_page();
2955 0) | __might_sleep() {
2956 0) 1.329 us | }
2957 0) 3.904 us | }
2958 0) 4.979 us | }
2959 0) 0.653 us | _spin_lock();
2960 0) 0.578 us | page_add_file_rmap();
2961 0) 0.525 us | native_set_pte_at();
2962 0) 0.585 us | _spin_unlock();
2963 0) | unlock_page() {
2964 0) 0.541 us | page_waitqueue();
2965 0) 0.639 us | __wake_up_bit();
2966 0) 2.786 us | }
2967 0) + 14.237 us | }
2968 0) | __do_fault() {
2969 0) | filemap_fault() {
2970 0) | find_lock_page() {
2971 0) 0.698 us | find_get_page();
2972 0) | __might_sleep() {
2973 0) 1.412 us | }
2974 0) 3.950 us | }
2975 0) 5.098 us | }
2976 0) 0.631 us | _spin_lock();
2977 0) 0.571 us | page_add_file_rmap();
2978 0) 0.526 us | native_set_pte_at();
2979 0) 0.586 us | _spin_unlock();
2980 0) | unlock_page() {
2981 0) 0.533 us | page_waitqueue();
2982 0) 0.638 us | __wake_up_bit();
2983 0) 2.793 us | }
2984 0) + 14.012 us | }
2985
2986You can also expand several functions at once::
2987
2988 echo sys_open > set_graph_function
2989 echo sys_close >> set_graph_function
2990
2991Now if you want to go back to trace all functions you can clear
2992this special filter via::
2993
2994 echo > set_graph_function
2995
2996
2997ftrace_enabled
2998--------------
2999
3000Note, the proc sysctl ftrace_enable is a big on/off switch for the
3001function tracer. By default it is enabled (when function tracing is
3002enabled in the kernel). If it is disabled, all function tracing is
3003disabled. This includes not only the function tracers for ftrace, but
7162431d
MB
3004also for any other uses (perf, kprobes, stack tracing, profiling, etc). It
3005cannot be disabled if there is a callback with FTRACE_OPS_FL_PERMANENT set
3006registered.
1f198e22
CD
3007
3008Please disable this with care.
3009
3010This can be disable (and enabled) with::
3011
3012 sysctl kernel.ftrace_enabled=0
3013 sysctl kernel.ftrace_enabled=1
3014
3015 or
3016
3017 echo 0 > /proc/sys/kernel/ftrace_enabled
3018 echo 1 > /proc/sys/kernel/ftrace_enabled
3019
3020
3021Filter commands
3022---------------
3023
3024A few commands are supported by the set_ftrace_filter interface.
3025Trace commands have the following format::
3026
3027 <function>:<command>:<parameter>
3028
3029The following commands are supported:
3030
3031- mod:
3032 This command enables function filtering per module. The
3033 parameter defines the module. For example, if only the write*
3034 functions in the ext3 module are desired, run:
3035
3036 echo 'write*:mod:ext3' > set_ftrace_filter
3037
3038 This command interacts with the filter in the same way as
3039 filtering based on function names. Thus, adding more functions
3040 in a different module is accomplished by appending (>>) to the
3041 filter file. Remove specific module functions by prepending
3042 '!'::
3043
3044 echo '!writeback*:mod:ext3' >> set_ftrace_filter
3045
3046 Mod command supports module globbing. Disable tracing for all
3047 functions except a specific module::
3048
3049 echo '!*:mod:!ext3' >> set_ftrace_filter
3050
3051 Disable tracing for all modules, but still trace kernel::
3052
3053 echo '!*:mod:*' >> set_ftrace_filter
3054
3055 Enable filter only for kernel::
3056
3057 echo '*write*:mod:!*' >> set_ftrace_filter
3058
3059 Enable filter for module globbing::
3060
3061 echo '*write*:mod:*snd*' >> set_ftrace_filter
3062
3063- traceon/traceoff:
3064 These commands turn tracing on and off when the specified
3065 functions are hit. The parameter determines how many times the
3066 tracing system is turned on and off. If unspecified, there is
3067 no limit. For example, to disable tracing when a schedule bug
3068 is hit the first 5 times, run::
3069
3070 echo '__schedule_bug:traceoff:5' > set_ftrace_filter
3071
3072 To always disable tracing when __schedule_bug is hit::
3073
3074 echo '__schedule_bug:traceoff' > set_ftrace_filter
3075
3076 These commands are cumulative whether or not they are appended
3077 to set_ftrace_filter. To remove a command, prepend it by '!'
3078 and drop the parameter::
3079
3080 echo '!__schedule_bug:traceoff:0' > set_ftrace_filter
3081
3082 The above removes the traceoff command for __schedule_bug
3083 that have a counter. To remove commands without counters::
3084
3085 echo '!__schedule_bug:traceoff' > set_ftrace_filter
3086
3087- snapshot:
3088 Will cause a snapshot to be triggered when the function is hit.
3089 ::
3090
3091 echo 'native_flush_tlb_others:snapshot' > set_ftrace_filter
3092
3093 To only snapshot once:
3094 ::
3095
3096 echo 'native_flush_tlb_others:snapshot:1' > set_ftrace_filter
3097
3098 To remove the above commands::
3099
3100 echo '!native_flush_tlb_others:snapshot' > set_ftrace_filter
3101 echo '!native_flush_tlb_others:snapshot:0' > set_ftrace_filter
3102
3103- enable_event/disable_event:
3104 These commands can enable or disable a trace event. Note, because
3105 function tracing callbacks are very sensitive, when these commands
3106 are registered, the trace point is activated, but disabled in
3107 a "soft" mode. That is, the tracepoint will be called, but
3108 just will not be traced. The event tracepoint stays in this mode
3109 as long as there's a command that triggers it.
3110 ::
3111
3112 echo 'try_to_wake_up:enable_event:sched:sched_switch:2' > \
3113 set_ftrace_filter
3114
3115 The format is::
3116
3117 <function>:enable_event:<system>:<event>[:count]
3118 <function>:disable_event:<system>:<event>[:count]
3119
3120 To remove the events commands::
3121
3122 echo '!try_to_wake_up:enable_event:sched:sched_switch:0' > \
3123 set_ftrace_filter
3124 echo '!schedule:disable_event:sched:sched_switch' > \
3125 set_ftrace_filter
3126
3127- dump:
3128 When the function is hit, it will dump the contents of the ftrace
3129 ring buffer to the console. This is useful if you need to debug
3130 something, and want to dump the trace when a certain function
2a1e03ca 3131 is hit. Perhaps it's a function that is called before a triple
1f198e22
CD
3132 fault happens and does not allow you to get a regular dump.
3133
3134- cpudump:
3135 When the function is hit, it will dump the contents of the ftrace
3136 ring buffer for the current CPU to the console. Unlike the "dump"
3137 command, it only prints out the contents of the ring buffer for the
3138 CPU that executed the function that triggered the dump.
3139
8a2933cf
MH
3140- stacktrace:
3141 When the function is hit, a stack trace is recorded.
3142
1f198e22
CD
3143trace_pipe
3144----------
3145
3146The trace_pipe outputs the same content as the trace file, but
3147the effect on the tracing is different. Every read from
3148trace_pipe is consumed. This means that subsequent reads will be
3149different. The trace is live.
3150::
3151
3152 # echo function > current_tracer
3153 # cat trace_pipe > /tmp/trace.out &
3154 [1] 4153
3155 # echo 1 > tracing_on
3156 # usleep 1
3157 # echo 0 > tracing_on
3158 # cat trace
3159 # tracer: function
3160 #
3161 # entries-in-buffer/entries-written: 0/0 #P:4
3162 #
3163 # _-----=> irqs-off
3164 # / _----=> need-resched
3165 # | / _---=> hardirq/softirq
3166 # || / _--=> preempt-depth
3167 # ||| / delay
3168 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
3169 # | | | |||| | |
3170
3171 #
3172 # cat /tmp/trace.out
3173 bash-1994 [000] .... 5281.568961: mutex_unlock <-rb_simple_write
3174 bash-1994 [000] .... 5281.568963: __mutex_unlock_slowpath <-mutex_unlock
3175 bash-1994 [000] .... 5281.568963: __fsnotify_parent <-fsnotify_modify
3176 bash-1994 [000] .... 5281.568964: fsnotify <-fsnotify_modify
3177 bash-1994 [000] .... 5281.568964: __srcu_read_lock <-fsnotify
3178 bash-1994 [000] .... 5281.568964: add_preempt_count <-__srcu_read_lock
3179 bash-1994 [000] ...1 5281.568965: sub_preempt_count <-__srcu_read_lock
3180 bash-1994 [000] .... 5281.568965: __srcu_read_unlock <-fsnotify
3181 bash-1994 [000] .... 5281.568967: sys_dup2 <-system_call_fastpath
3182
3183
3184Note, reading the trace_pipe file will block until more input is
f12fcca6
PW
3185added. This is contrary to the trace file. If any process opened
3186the trace file for reading, it will actually disable tracing and
3187prevent new entries from being added. The trace_pipe file does
3188not have this limitation.
1f198e22
CD
3189
3190trace entries
3191-------------
3192
3193Having too much or not enough data can be troublesome in
3194diagnosing an issue in the kernel. The file buffer_size_kb is
3195used to modify the size of the internal trace buffers. The
3196number listed is the number of entries that can be recorded per
3197CPU. To know the full size, multiply the number of possible CPUs
3198with the number of entries.
3199::
3200
3201 # cat buffer_size_kb
3202 1408 (units kilobytes)
3203
3204Or simply read buffer_total_size_kb
3205::
3206
3207 # cat buffer_total_size_kb
3208 5632
3209
3210To modify the buffer, simple echo in a number (in 1024 byte segments).
3211::
3212
3213 # echo 10000 > buffer_size_kb
3214 # cat buffer_size_kb
3215 10000 (units kilobytes)
3216
3217It will try to allocate as much as possible. If you allocate too
3218much, it can cause Out-Of-Memory to trigger.
3219::
3220
3221 # echo 1000000000000 > buffer_size_kb
3222 -bash: echo: write error: Cannot allocate memory
3223 # cat buffer_size_kb
3224 85
3225
3226The per_cpu buffers can be changed individually as well:
3227::
3228
3229 # echo 10000 > per_cpu/cpu0/buffer_size_kb
3230 # echo 100 > per_cpu/cpu1/buffer_size_kb
3231
3232When the per_cpu buffers are not the same, the buffer_size_kb
3233at the top level will just show an X
3234::
3235
3236 # cat buffer_size_kb
3237 X
3238
3239This is where the buffer_total_size_kb is useful:
3240::
3241
3242 # cat buffer_total_size_kb
3243 12916
3244
3245Writing to the top level buffer_size_kb will reset all the buffers
3246to be the same again.
3247
3248Snapshot
3249--------
3250CONFIG_TRACER_SNAPSHOT makes a generic snapshot feature
3251available to all non latency tracers. (Latency tracers which
3252record max latency, such as "irqsoff" or "wakeup", can't use
3253this feature, since those are already using the snapshot
3254mechanism internally.)
3255
3256Snapshot preserves a current trace buffer at a particular point
3257in time without stopping tracing. Ftrace swaps the current
3258buffer with a spare buffer, and tracing continues in the new
3259current (=previous spare) buffer.
3260
3261The following tracefs files in "tracing" are related to this
3262feature:
3263
3264 snapshot:
3265
3266 This is used to take a snapshot and to read the output
3267 of the snapshot. Echo 1 into this file to allocate a
3268 spare buffer and to take a snapshot (swap), then read
3269 the snapshot from this file in the same format as
3270 "trace" (described above in the section "The File
3271 System"). Both reads snapshot and tracing are executable
3272 in parallel. When the spare buffer is allocated, echoing
3273 0 frees it, and echoing else (positive) values clear the
3274 snapshot contents.
3275 More details are shown in the table below.
3276
3277 +--------------+------------+------------+------------+
3278 |status\\input | 0 | 1 | else |
3279 +==============+============+============+============+
3280 |not allocated |(do nothing)| alloc+swap |(do nothing)|
3281 +--------------+------------+------------+------------+
3282 |allocated | free | swap | clear |
3283 +--------------+------------+------------+------------+
3284
3285Here is an example of using the snapshot feature.
3286::
3287
3288 # echo 1 > events/sched/enable
3289 # echo 1 > snapshot
3290 # cat snapshot
3291 # tracer: nop
3292 #
3293 # entries-in-buffer/entries-written: 71/71 #P:8
3294 #
3295 # _-----=> irqs-off
3296 # / _----=> need-resched
3297 # | / _---=> hardirq/softirq
3298 # || / _--=> preempt-depth
3299 # ||| / delay
3300 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
3301 # | | | |||| | |
3302 <idle>-0 [005] d... 2440.603828: sched_switch: prev_comm=swapper/5 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2242 next_prio=120
3303 sleep-2242 [005] d... 2440.603846: sched_switch: prev_comm=snapshot-test-2 prev_pid=2242 prev_prio=120 prev_state=R ==> next_comm=kworker/5:1 next_pid=60 next_prio=120
3304 [...]
3305 <idle>-0 [002] d... 2440.707230: sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2229 next_prio=120
3306
3307 # cat trace
3308 # tracer: nop
3309 #
3310 # entries-in-buffer/entries-written: 77/77 #P:8
3311 #
3312 # _-----=> irqs-off
3313 # / _----=> need-resched
3314 # | / _---=> hardirq/softirq
3315 # || / _--=> preempt-depth
3316 # ||| / delay
3317 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
3318 # | | | |||| | |
3319 <idle>-0 [007] d... 2440.707395: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2243 next_prio=120
3320 snapshot-test-2-2229 [002] d... 2440.707438: sched_switch: prev_comm=snapshot-test-2 prev_pid=2229 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120
3321 [...]
3322
3323
3324If you try to use this snapshot feature when current tracer is
3325one of the latency tracers, you will get the following results.
3326::
3327
3328 # echo wakeup > current_tracer
3329 # echo 1 > snapshot
3330 bash: echo: write error: Device or resource busy
3331 # cat snapshot
3332 cat: snapshot: Device or resource busy
3333
3334
3335Instances
3336---------
3337In the tracefs tracing directory is a directory called "instances".
3338This directory can have new directories created inside of it using
3339mkdir, and removing directories with rmdir. The directory created
3340with mkdir in this directory will already contain files and other
3341directories after it is created.
3342::
3343
3344 # mkdir instances/foo
3345 # ls instances/foo
3346 buffer_size_kb buffer_total_size_kb events free_buffer per_cpu
3347 set_event snapshot trace trace_clock trace_marker trace_options
3348 trace_pipe tracing_on
3349
3350As you can see, the new directory looks similar to the tracing directory
3351itself. In fact, it is very similar, except that the buffer and
5b8914a6 3352events are agnostic from the main directory, or from any other
1f198e22
CD
3353instances that are created.
3354
3355The files in the new directory work just like the files with the
3356same name in the tracing directory except the buffer that is used
3357is a separate and new buffer. The files affect that buffer but do not
3358affect the main buffer with the exception of trace_options. Currently,
3359the trace_options affect all instances and the top level buffer
3360the same, but this may change in future releases. That is, options
3361may become specific to the instance they reside in.
3362
3363Notice that none of the function tracer files are there, nor is
3364current_tracer and available_tracers. This is because the buffers
3365can currently only have events enabled for them.
3366::
3367
3368 # mkdir instances/foo
3369 # mkdir instances/bar
3370 # mkdir instances/zoot
3371 # echo 100000 > buffer_size_kb
3372 # echo 1000 > instances/foo/buffer_size_kb
3373 # echo 5000 > instances/bar/per_cpu/cpu1/buffer_size_kb
3374 # echo function > current_trace
3375 # echo 1 > instances/foo/events/sched/sched_wakeup/enable
3376 # echo 1 > instances/foo/events/sched/sched_wakeup_new/enable
3377 # echo 1 > instances/foo/events/sched/sched_switch/enable
3378 # echo 1 > instances/bar/events/irq/enable
3379 # echo 1 > instances/zoot/events/syscalls/enable
3380 # cat trace_pipe
3381 CPU:2 [LOST 11745 EVENTS]
3382 bash-2044 [002] .... 10594.481032: _raw_spin_lock_irqsave <-get_page_from_freelist
3383 bash-2044 [002] d... 10594.481032: add_preempt_count <-_raw_spin_lock_irqsave
3384 bash-2044 [002] d..1 10594.481032: __rmqueue <-get_page_from_freelist
3385 bash-2044 [002] d..1 10594.481033: _raw_spin_unlock <-get_page_from_freelist
3386 bash-2044 [002] d..1 10594.481033: sub_preempt_count <-_raw_spin_unlock
3387 bash-2044 [002] d... 10594.481033: get_pageblock_flags_group <-get_pageblock_migratetype
3388 bash-2044 [002] d... 10594.481034: __mod_zone_page_state <-get_page_from_freelist
3389 bash-2044 [002] d... 10594.481034: zone_statistics <-get_page_from_freelist
3390 bash-2044 [002] d... 10594.481034: __inc_zone_state <-zone_statistics
3391 bash-2044 [002] d... 10594.481034: __inc_zone_state <-zone_statistics
3392 bash-2044 [002] .... 10594.481035: arch_dup_task_struct <-copy_process
3393 [...]
3394
3395 # cat instances/foo/trace_pipe
3396 bash-1998 [000] d..4 136.676759: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
3397 bash-1998 [000] dN.4 136.676760: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
3398 <idle>-0 [003] d.h3 136.676906: sched_wakeup: comm=rcu_preempt pid=9 prio=120 success=1 target_cpu=003
3399 <idle>-0 [003] d..3 136.676909: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=9 next_prio=120
3400 rcu_preempt-9 [003] d..3 136.676916: sched_switch: prev_comm=rcu_preempt prev_pid=9 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
3401 bash-1998 [000] d..4 136.677014: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000
3402 bash-1998 [000] dN.4 136.677016: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000
3403 bash-1998 [000] d..3 136.677018: sched_switch: prev_comm=bash prev_pid=1998 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=59 next_prio=120
3404 kworker/0:1-59 [000] d..4 136.677022: sched_wakeup: comm=sshd pid=1995 prio=120 success=1 target_cpu=001
3405 kworker/0:1-59 [000] d..3 136.677025: sched_switch: prev_comm=kworker/0:1 prev_pid=59 prev_prio=120 prev_state=S ==> next_comm=bash next_pid=1998 next_prio=120
3406 [...]
3407
3408 # cat instances/bar/trace_pipe
3409 migration/1-14 [001] d.h3 138.732674: softirq_raise: vec=3 [action=NET_RX]
3410 <idle>-0 [001] dNh3 138.732725: softirq_raise: vec=3 [action=NET_RX]
3411 bash-1998 [000] d.h1 138.733101: softirq_raise: vec=1 [action=TIMER]
3412 bash-1998 [000] d.h1 138.733102: softirq_raise: vec=9 [action=RCU]
3413 bash-1998 [000] ..s2 138.733105: softirq_entry: vec=1 [action=TIMER]
3414 bash-1998 [000] ..s2 138.733106: softirq_exit: vec=1 [action=TIMER]
3415 bash-1998 [000] ..s2 138.733106: softirq_entry: vec=9 [action=RCU]
3416 bash-1998 [000] ..s2 138.733109: softirq_exit: vec=9 [action=RCU]
3417 sshd-1995 [001] d.h1 138.733278: irq_handler_entry: irq=21 name=uhci_hcd:usb4
3418 sshd-1995 [001] d.h1 138.733280: irq_handler_exit: irq=21 ret=unhandled
3419 sshd-1995 [001] d.h1 138.733281: irq_handler_entry: irq=21 name=eth0
3420 sshd-1995 [001] d.h1 138.733283: irq_handler_exit: irq=21 ret=handled
3421 [...]
3422
3423 # cat instances/zoot/trace
3424 # tracer: nop
3425 #
3426 # entries-in-buffer/entries-written: 18996/18996 #P:4
3427 #
3428 # _-----=> irqs-off
3429 # / _----=> need-resched
3430 # | / _---=> hardirq/softirq
3431 # || / _--=> preempt-depth
3432 # ||| / delay
3433 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
3434 # | | | |||| | |
3435 bash-1998 [000] d... 140.733501: sys_write -> 0x2
3436 bash-1998 [000] d... 140.733504: sys_dup2(oldfd: a, newfd: 1)
3437 bash-1998 [000] d... 140.733506: sys_dup2 -> 0x1
3438 bash-1998 [000] d... 140.733508: sys_fcntl(fd: a, cmd: 1, arg: 0)
3439 bash-1998 [000] d... 140.733509: sys_fcntl -> 0x1
3440 bash-1998 [000] d... 140.733510: sys_close(fd: a)
3441 bash-1998 [000] d... 140.733510: sys_close -> 0x0
3442 bash-1998 [000] d... 140.733514: sys_rt_sigprocmask(how: 0, nset: 0, oset: 6e2768, sigsetsize: 8)
3443 bash-1998 [000] d... 140.733515: sys_rt_sigprocmask -> 0x0
3444 bash-1998 [000] d... 140.733516: sys_rt_sigaction(sig: 2, act: 7fff718846f0, oact: 7fff71884650, sigsetsize: 8)
3445 bash-1998 [000] d... 140.733516: sys_rt_sigaction -> 0x0
3446
3447You can see that the trace of the top most trace buffer shows only
3448the function tracing. The foo instance displays wakeups and task
3449switches.
3450
3451To remove the instances, simply delete their directories:
3452::
3453
3454 # rmdir instances/foo
3455 # rmdir instances/bar
3456 # rmdir instances/zoot
3457
3458Note, if a process has a trace file open in one of the instance
3459directories, the rmdir will fail with EBUSY.
3460
3461
3462Stack trace
3463-----------
3464Since the kernel has a fixed sized stack, it is important not to
3465waste it in functions. A kernel developer must be conscience of
3466what they allocate on the stack. If they add too much, the system
3467can be in danger of a stack overflow, and corruption will occur,
3468usually leading to a system panic.
3469
3470There are some tools that check this, usually with interrupts
3471periodically checking usage. But if you can perform a check
3472at every function call that will become very useful. As ftrace provides
3473a function tracer, it makes it convenient to check the stack size
3474at every function call. This is enabled via the stack tracer.
3475
3476CONFIG_STACK_TRACER enables the ftrace stack tracing functionality.
3477To enable it, write a '1' into /proc/sys/kernel/stack_tracer_enabled.
3478::
3479
3480 # echo 1 > /proc/sys/kernel/stack_tracer_enabled
3481
3482You can also enable it from the kernel command line to trace
3483the stack size of the kernel during boot up, by adding "stacktrace"
3484to the kernel command line parameter.
3485
3486After running it for a few minutes, the output looks like:
3487::
3488
3489 # cat stack_max_size
3490 2928
3491
3492 # cat stack_trace
3493 Depth Size Location (18 entries)
3494 ----- ---- --------
3495 0) 2928 224 update_sd_lb_stats+0xbc/0x4ac
3496 1) 2704 160 find_busiest_group+0x31/0x1f1
3497 2) 2544 256 load_balance+0xd9/0x662
3498 3) 2288 80 idle_balance+0xbb/0x130
3499 4) 2208 128 __schedule+0x26e/0x5b9
3500 5) 2080 16 schedule+0x64/0x66
3501 6) 2064 128 schedule_timeout+0x34/0xe0
3502 7) 1936 112 wait_for_common+0x97/0xf1
3503 8) 1824 16 wait_for_completion+0x1d/0x1f
3504 9) 1808 128 flush_work+0xfe/0x119
3505 10) 1680 16 tty_flush_to_ldisc+0x1e/0x20
3506 11) 1664 48 input_available_p+0x1d/0x5c
3507 12) 1616 48 n_tty_poll+0x6d/0x134
3508 13) 1568 64 tty_poll+0x64/0x7f
3509 14) 1504 880 do_select+0x31e/0x511
3510 15) 624 400 core_sys_select+0x177/0x216
3511 16) 224 96 sys_select+0x91/0xb9
3512 17) 128 128 system_call_fastpath+0x16/0x1b
3513
3514Note, if -mfentry is being used by gcc, functions get traced before
3515they set up the stack frame. This means that leaf level functions
3516are not tested by the stack tracer when -mfentry is used.
3517
3518Currently, -mfentry is used by gcc 4.6.0 and above on x86 only.
3519
3520More
3521----
3522More details can be found in the source code, in the `kernel/trace/*.c` files.