Commit | Line | Data |
---|---|---|
c3123552 | 1 | ============================= |
c757249a | 2 | Per-task statistics interface |
c3123552 | 3 | ============================= |
c757249a SN |
4 | |
5 | ||
6 | Taskstats is a netlink-based interface for sending per-task and | |
7 | per-process statistics from the kernel to userspace. | |
8 | ||
9 | Taskstats was designed for the following benefits: | |
10 | ||
11 | - efficiently provide statistics during lifetime of a task and on its exit | |
12 | - unified interface for multiple accounting subsystems | |
13 | - extensibility for use by future accounting patches | |
14 | ||
15 | Terminology | |
16 | ----------- | |
17 | ||
18 | "pid", "tid" and "task" are used interchangeably and refer to the standard | |
19 | Linux task defined by struct task_struct. per-pid stats are the same as | |
20 | per-task stats. | |
21 | ||
22 | "tgid", "process" and "thread group" are used interchangeably and refer to the | |
23 | tasks that share an mm_struct i.e. the traditional Unix process. Despite the | |
24 | use of tgid, there is no special treatment for the task that is thread group | |
25 | leader - a process is deemed alive as long as it has any task belonging to it. | |
26 | ||
27 | Usage | |
28 | ----- | |
29 | ||
9e06d3f9 | 30 | To get statistics during a task's lifetime, userspace opens a unicast netlink |
c757249a SN |
31 | socket (NETLINK_GENERIC family) and sends commands specifying a pid or a tgid. |
32 | The response contains statistics for a task (if pid is specified) or the sum of | |
33 | statistics for all tasks of the process (if tgid is specified). | |
34 | ||
9e06d3f9 SN |
35 | To obtain statistics for tasks which are exiting, the userspace listener |
36 | sends a register command and specifies a cpumask. Whenever a task exits on | |
37 | one of the cpus in the cpumask, its per-pid statistics are sent to the | |
38 | registered listener. Using cpumasks allows the data received by one listener | |
39 | to be limited and assists in flow control over the netlink interface and is | |
40 | explained in more detail below. | |
41 | ||
42 | If the exiting task is the last thread exiting its thread group, | |
43 | an additional record containing the per-tgid stats is also sent to userspace. | |
44 | The latter contains the sum of per-pid stats for all threads in the thread | |
45 | group, both past and present. | |
c757249a | 46 | |
a3baf649 | 47 | getdelays.c is a simple utility demonstrating usage of the taskstats interface |
9e06d3f9 SN |
48 | for reporting delay accounting statistics. Users can register cpumasks, |
49 | send commands and process responses, listen for per-tid/tgid exit data, | |
50 | write the data received to a file and do basic flow control by increasing | |
51 | receive buffer sizes. | |
c757249a SN |
52 | |
53 | Interface | |
54 | --------- | |
55 | ||
56 | The user-kernel interface is encapsulated in include/linux/taskstats.h | |
57 | ||
58 | To avoid this documentation becoming obsolete as the interface evolves, only | |
59 | an outline of the current version is given. taskstats.h always overrides the | |
60 | description here. | |
61 | ||
62 | struct taskstats is the common accounting structure for both per-pid and | |
63 | per-tgid data. It is versioned and can be extended by each accounting subsystem | |
64 | that is added to the kernel. The fields and their semantics are defined in the | |
65 | taskstats.h file. | |
66 | ||
67 | The data exchanged between user and kernel space is a netlink message belonging | |
68 | to the NETLINK_GENERIC family and using the netlink attributes interface. | |
c3123552 | 69 | The messages are in the format:: |
c757249a SN |
70 | |
71 | +----------+- - -+-------------+-------------------+ | |
72 | | nlmsghdr | Pad | genlmsghdr | taskstats payload | | |
73 | +----------+- - -+-------------+-------------------+ | |
74 | ||
75 | ||
76 | The taskstats payload is one of the following three kinds: | |
77 | ||
9e06d3f9 SN |
78 | 1. Commands: Sent from user to kernel. Commands to get data on |
79 | a pid/tgid consist of one attribute, of type TASKSTATS_CMD_ATTR_PID/TGID, | |
80 | containing a u32 pid or tgid in the attribute payload. The pid/tgid denotes | |
81 | the task/process for which userspace wants statistics. | |
82 | ||
83 | Commands to register/deregister interest in exit data from a set of cpus | |
84 | consist of one attribute, of type | |
85 | TASKSTATS_CMD_ATTR_REGISTER/DEREGISTER_CPUMASK and contain a cpumask in the | |
86 | attribute payload. The cpumask is specified as an ascii string of | |
87 | comma-separated cpu ranges e.g. to listen to exit data from cpus 1,2,3,5,7,8 | |
88 | the cpumask would be "1-3,5,7-8". If userspace forgets to deregister interest | |
89 | in cpus before closing the listening socket, the kernel cleans up its interest | |
90 | set over time. However, for the sake of efficiency, an explicit deregistration | |
91 | is advisable. | |
c757249a SN |
92 | |
93 | 2. Response for a command: sent from the kernel in response to a userspace | |
94 | command. The payload is a series of three attributes of type: | |
95 | ||
96 | a) TASKSTATS_TYPE_AGGR_PID/TGID : attribute containing no payload but indicates | |
97 | a pid/tgid will be followed by some stats. | |
98 | ||
99 | b) TASKSTATS_TYPE_PID/TGID: attribute whose payload is the pid/tgid whose stats | |
fa00e7e1 | 100 | are being returned. |
c757249a | 101 | |
fa00e7e1 | 102 | c) TASKSTATS_TYPE_STATS: attribute with a struct taskstats as payload. The |
c757249a SN |
103 | same structure is used for both per-pid and per-tgid stats. |
104 | ||
105 | 3. New message sent by kernel whenever a task exits. The payload consists of a | |
106 | series of attributes of the following type: | |
107 | ||
108 | a) TASKSTATS_TYPE_AGGR_PID: indicates next two attributes will be pid+stats | |
109 | b) TASKSTATS_TYPE_PID: contains exiting task's pid | |
110 | c) TASKSTATS_TYPE_STATS: contains the exiting task's per-pid stats | |
111 | d) TASKSTATS_TYPE_AGGR_TGID: indicates next two attributes will be tgid+stats | |
112 | e) TASKSTATS_TYPE_TGID: contains tgid of process to which task belongs | |
113 | f) TASKSTATS_TYPE_STATS: contains the per-tgid stats for exiting task's process | |
114 | ||
115 | ||
116 | per-tgid stats | |
117 | -------------- | |
118 | ||
119 | Taskstats provides per-process stats, in addition to per-task stats, since | |
120 | resource management is often done at a process granularity and aggregating task | |
121 | stats in userspace alone is inefficient and potentially inaccurate (due to lack | |
122 | of atomicity). | |
123 | ||
124 | However, maintaining per-process, in addition to per-task stats, within the | |
ad4ecbcb | 125 | kernel has space and time overheads. To address this, the taskstats code |
4ae0edc2 ML |
126 | accumulates each exiting task's statistics into a process-wide data structure. |
127 | When the last task of a process exits, the process level data accumulated also | |
ad4ecbcb SN |
128 | gets sent to userspace (along with the per-task data). |
129 | ||
130 | When a user queries to get per-tgid data, the sum of all other live threads in | |
4ae0edc2 | 131 | the group is added up and added to the accumulated total for previously exited |
ad4ecbcb | 132 | threads of the same thread group. |
c757249a SN |
133 | |
134 | Extending taskstats | |
135 | ------------------- | |
136 | ||
137 | There are two ways to extend the taskstats interface to export more | |
138 | per-task/process stats as patches to collect them get added to the kernel | |
139 | in future: | |
140 | ||
141 | 1. Adding more fields to the end of the existing struct taskstats. Backward | |
142 | compatibility is ensured by the version number within the | |
143 | structure. Userspace will use only the fields of the struct that correspond | |
144 | to the version its using. | |
145 | ||
146 | 2. Defining separate statistic structs and using the netlink attributes | |
147 | interface to return them. Since userspace processes each netlink attribute | |
148 | independently, it can always ignore attributes whose type it does not | |
149 | understand (because it is using an older version of the interface). | |
150 | ||
151 | ||
152 | Choosing between 1. and 2. is a matter of trading off flexibility and | |
153 | overhead. If only a few fields need to be added, then 1. is the preferable | |
154 | path since the kernel and userspace don't need to incur the overhead of | |
155 | processing new netlink attributes. But if the new fields expand the existing | |
156 | struct too much, requiring disparate userspace accounting utilities to | |
157 | unnecessarily receive large structures whose fields are of no interest, then | |
158 | extending the attributes structure would be worthwhile. | |
159 | ||
9e06d3f9 SN |
160 | Flow control for taskstats |
161 | -------------------------- | |
162 | ||
163 | When the rate of task exits becomes large, a listener may not be able to keep | |
164 | up with the kernel's rate of sending per-tid/tgid exit data leading to data | |
165 | loss. This possibility gets compounded when the taskstats structure gets | |
166 | extended and the number of cpus grows large. | |
167 | ||
168 | To avoid losing statistics, userspace should do one or more of the following: | |
169 | ||
170 | - increase the receive buffer sizes for the netlink sockets opened by | |
c3123552 | 171 | listeners to receive exit data. |
9e06d3f9 SN |
172 | |
173 | - create more listeners and reduce the number of cpus being listened to by | |
c3123552 MCC |
174 | each listener. In the extreme case, there could be one listener for each cpu. |
175 | Users may also consider setting the cpu affinity of the listener to the subset | |
176 | of cpus to which it listens, especially if they are listening to just one cpu. | |
9e06d3f9 SN |
177 | |
178 | Despite these measures, if the userspace receives ENOBUFS error messages | |
179 | indicated overflow of receive buffers, it should take measures to handle the | |
180 | loss of data. |