Commit | Line | Data |
---|---|---|
373e8ffa JK |
1 | .. _psi: |
2 | ||
eb414681 JW |
3 | ================================ |
4 | PSI - Pressure Stall Information | |
5 | ================================ | |
6 | ||
7 | :Date: April, 2018 | |
8 | :Author: Johannes Weiner <hannes@cmpxchg.org> | |
9 | ||
10 | When CPU, memory or IO devices are contended, workloads experience | |
11 | latency spikes, throughput losses, and run the risk of OOM kills. | |
12 | ||
13 | Without an accurate measure of such contention, users are forced to | |
14 | either play it safe and under-utilize their hardware resources, or | |
15 | roll the dice and frequently suffer the disruptions resulting from | |
16 | excessive overcommit. | |
17 | ||
18 | The psi feature identifies and quantifies the disruptions caused by | |
19 | such resource crunches and the time impact it has on complex workloads | |
20 | or even entire systems. | |
21 | ||
22 | Having an accurate measure of productivity losses caused by resource | |
23 | scarcity aids users in sizing workloads to hardware--or provisioning | |
24 | hardware according to workload demand. | |
25 | ||
26 | As psi aggregates this information in realtime, systems can be managed | |
27 | dynamically using techniques such as load shedding, migrating jobs to | |
28 | other systems or data centers, or strategically pausing or killing low | |
29 | priority or restartable batch jobs. | |
30 | ||
31 | This allows maximizing hardware utilization without sacrificing | |
32 | workload health or risking major disruptions such as OOM kills. | |
33 | ||
34 | Pressure interface | |
35 | ================== | |
36 | ||
37 | Pressure information for each resource is exported through the | |
38 | respective file in /proc/pressure/ -- cpu, memory, and io. | |
39 | ||
890d550d | 40 | The format is as such:: |
eb414681 | 41 | |
c3123552 MCC |
42 | some avg10=0.00 avg60=0.00 avg300=0.00 total=0 |
43 | full avg10=0.00 avg60=0.00 avg300=0.00 total=0 | |
eb414681 JW |
44 | |
45 | The "some" line indicates the share of time in which at least some | |
46 | tasks are stalled on a given resource. | |
47 | ||
48 | The "full" line indicates the share of time in which all non-idle | |
49 | tasks are stalled on a given resource simultaneously. In this state | |
50 | actual CPU cycles are going to waste, and a workload that spends | |
51 | extended time in this state is considered to be thrashing. This has | |
52 | severe impact on performance, and it's useful to distinguish this | |
53 | situation from a state where some tasks are stalled but the CPU is | |
54 | still doing productive work. As such, time spent in this subset of the | |
55 | stall state is tracked separately and exported in the "full" averages. | |
56 | ||
890d550d CZ |
57 | CPU full is undefined at the system level, but has been reported |
58 | since 5.13, so it is set to zero for backward compatibility. | |
59 | ||
be87ab0a WL |
60 | The ratios (in %) are tracked as recent trends over ten, sixty, and |
61 | three hundred second windows, which gives insight into short term events | |
62 | as well as medium and long term trends. The total absolute stall time | |
63 | (in us) is tracked and exported as well, to allow detection of latency | |
64 | spikes which wouldn't necessarily make a dent in the time averages, | |
65 | or to average trends over custom time frames. | |
2ce7135a | 66 | |
0e94682b SB |
67 | Monitoring for pressure thresholds |
68 | ================================== | |
69 | ||
70 | Users can register triggers and use poll() to be woken up when resource | |
71 | pressure exceeds certain thresholds. | |
72 | ||
73 | A trigger describes the maximum cumulative stall time over a specific | |
74 | time window, e.g. 100ms of total stall time within any 500ms window to | |
75 | generate a wakeup event. | |
76 | ||
77 | To register a trigger user has to open psi interface file under | |
78 | /proc/pressure/ representing the resource to be monitored and write the | |
79 | desired threshold and time window. The open file descriptor should be | |
80 | used to wait for trigger events using select(), poll() or epoll(). | |
c3123552 | 81 | The following format is used:: |
0e94682b | 82 | |
c3123552 | 83 | <some|full> <stall amount in us> <time window in us> |
0e94682b SB |
84 | |
85 | For example writing "some 150000 1000000" into /proc/pressure/memory | |
86 | would add 150ms threshold for partial memory stall measured within | |
87 | 1sec time window. Writing "full 50000 1000000" into /proc/pressure/io | |
88 | would add 50ms threshold for full io stall measured within 1sec time window. | |
89 | ||
90 | Triggers can be set on more than one psi metric and more than one trigger | |
91 | for the same psi metric can be specified. However for each trigger a separate | |
92 | file descriptor is required to be able to poll it separately from others, | |
93 | therefore for each trigger a separate open() syscall should be made even | |
a06247c6 SB |
94 | when opening the same psi interface file. Write operations to a file descriptor |
95 | with an already existing psi trigger will fail with EBUSY. | |
0e94682b SB |
96 | |
97 | Monitors activate only when system enters stall state for the monitored | |
98 | psi metric and deactivates upon exit from the stall state. While system is | |
99 | in the stall state psi signal growth is monitored at a rate of 10 times per | |
100 | tracking window. | |
101 | ||
102 | The kernel accepts window sizes ranging from 500ms to 10s, therefore min | |
103 | monitoring update interval is 50ms and max is 1s. Min limit is set to | |
104 | prevent overly frequent polling. Max limit is chosen as a high enough number | |
105 | after which monitors are most likely not needed and psi averages can be used | |
106 | instead. | |
107 | ||
d82caa27 DC |
108 | Unprivileged users can also create monitors, with the only limitation that the |
109 | window size must be a multiple of 2s, in order to prevent excessive resource | |
110 | usage. | |
111 | ||
0e94682b SB |
112 | When activated, psi monitor stays active for at least the duration of one |
113 | tracking window to avoid repeated activations/deactivations when system is | |
114 | bouncing in and out of the stall state. | |
115 | ||
116 | Notifications to the userspace are rate-limited to one per tracking window. | |
117 | ||
118 | The trigger will de-register when the file descriptor used to define the | |
119 | trigger is closed. | |
120 | ||
121 | Userspace monitor usage example | |
122 | =============================== | |
123 | ||
c3123552 MCC |
124 | :: |
125 | ||
126 | #include <errno.h> | |
127 | #include <fcntl.h> | |
128 | #include <stdio.h> | |
129 | #include <poll.h> | |
130 | #include <string.h> | |
131 | #include <unistd.h> | |
132 | ||
133 | /* | |
134 | * Monitor memory partial stall with 1s tracking window size | |
135 | * and 150ms threshold. | |
136 | */ | |
137 | int main() { | |
0e94682b SB |
138 | const char trig[] = "some 150000 1000000"; |
139 | struct pollfd fds; | |
140 | int n; | |
141 | ||
142 | fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK); | |
143 | if (fds.fd < 0) { | |
144 | printf("/proc/pressure/memory open error: %s\n", | |
145 | strerror(errno)); | |
146 | return 1; | |
147 | } | |
148 | fds.events = POLLPRI; | |
149 | ||
150 | if (write(fds.fd, trig, strlen(trig) + 1) < 0) { | |
151 | printf("/proc/pressure/memory write error: %s\n", | |
152 | strerror(errno)); | |
153 | return 1; | |
154 | } | |
155 | ||
156 | printf("waiting for events...\n"); | |
157 | while (1) { | |
158 | n = poll(&fds, 1, -1); | |
159 | if (n < 0) { | |
160 | printf("poll error: %s\n", strerror(errno)); | |
161 | return 1; | |
162 | } | |
163 | if (fds.revents & POLLERR) { | |
164 | printf("got POLLERR, event source is gone\n"); | |
165 | return 0; | |
166 | } | |
167 | if (fds.revents & POLLPRI) { | |
168 | printf("event triggered!\n"); | |
169 | } else { | |
170 | printf("unknown event received: 0x%x\n", fds.revents); | |
171 | return 1; | |
172 | } | |
173 | } | |
174 | ||
175 | return 0; | |
c3123552 | 176 | } |
0e94682b | 177 | |
2ce7135a JW |
178 | Cgroup2 interface |
179 | ================= | |
180 | ||
25bf1bac | 181 | In a system with a CONFIG_CGROUPS=y kernel and the cgroup2 filesystem |
2ce7135a JW |
182 | mounted, pressure stall information is also tracked for tasks grouped |
183 | into cgroups. Each subdirectory in the cgroupfs mountpoint contains | |
184 | cpu.pressure, memory.pressure, and io.pressure files; the format is | |
185 | the same as the /proc/pressure/ files. | |
0e94682b SB |
186 | |
187 | Per-cgroup psi monitors can be specified and used the same way as | |
188 | system-wide ones. |