Commit | Line | Data |
---|---|---|
373e8ffa JK |
1 | .. _psi: |
2 | ||
eb414681 JW |
3 | ================================ |
4 | PSI - Pressure Stall Information | |
5 | ================================ | |
6 | ||
7 | :Date: April, 2018 | |
8 | :Author: Johannes Weiner <hannes@cmpxchg.org> | |
9 | ||
10 | When CPU, memory or IO devices are contended, workloads experience | |
11 | latency spikes, throughput losses, and run the risk of OOM kills. | |
12 | ||
13 | Without an accurate measure of such contention, users are forced to | |
14 | either play it safe and under-utilize their hardware resources, or | |
15 | roll the dice and frequently suffer the disruptions resulting from | |
16 | excessive overcommit. | |
17 | ||
18 | The psi feature identifies and quantifies the disruptions caused by | |
19 | such resource crunches and the time impact it has on complex workloads | |
20 | or even entire systems. | |
21 | ||
22 | Having an accurate measure of productivity losses caused by resource | |
23 | scarcity aids users in sizing workloads to hardware--or provisioning | |
24 | hardware according to workload demand. | |
25 | ||
26 | As psi aggregates this information in realtime, systems can be managed | |
27 | dynamically using techniques such as load shedding, migrating jobs to | |
28 | other systems or data centers, or strategically pausing or killing low | |
29 | priority or restartable batch jobs. | |
30 | ||
31 | This allows maximizing hardware utilization without sacrificing | |
32 | workload health or risking major disruptions such as OOM kills. | |
33 | ||
34 | Pressure interface | |
35 | ================== | |
36 | ||
37 | Pressure information for each resource is exported through the | |
38 | respective file in /proc/pressure/ -- cpu, memory, and io. | |
39 | ||
c3123552 | 40 | The format for CPU is as such:: |
eb414681 | 41 | |
c3123552 | 42 | some avg10=0.00 avg60=0.00 avg300=0.00 total=0 |
eb414681 | 43 | |
c3123552 | 44 | and for memory and IO:: |
eb414681 | 45 | |
c3123552 MCC |
46 | some avg10=0.00 avg60=0.00 avg300=0.00 total=0 |
47 | full avg10=0.00 avg60=0.00 avg300=0.00 total=0 | |
eb414681 JW |
48 | |
49 | The "some" line indicates the share of time in which at least some | |
50 | tasks are stalled on a given resource. | |
51 | ||
52 | The "full" line indicates the share of time in which all non-idle | |
53 | tasks are stalled on a given resource simultaneously. In this state | |
54 | actual CPU cycles are going to waste, and a workload that spends | |
55 | extended time in this state is considered to be thrashing. This has | |
56 | severe impact on performance, and it's useful to distinguish this | |
57 | situation from a state where some tasks are stalled but the CPU is | |
58 | still doing productive work. As such, time spent in this subset of the | |
59 | stall state is tracked separately and exported in the "full" averages. | |
60 | ||
be87ab0a WL |
61 | The ratios (in %) are tracked as recent trends over ten, sixty, and |
62 | three hundred second windows, which gives insight into short term events | |
63 | as well as medium and long term trends. The total absolute stall time | |
64 | (in us) is tracked and exported as well, to allow detection of latency | |
65 | spikes which wouldn't necessarily make a dent in the time averages, | |
66 | or to average trends over custom time frames. | |
2ce7135a | 67 | |
0e94682b SB |
68 | Monitoring for pressure thresholds |
69 | ================================== | |
70 | ||
71 | Users can register triggers and use poll() to be woken up when resource | |
72 | pressure exceeds certain thresholds. | |
73 | ||
74 | A trigger describes the maximum cumulative stall time over a specific | |
75 | time window, e.g. 100ms of total stall time within any 500ms window to | |
76 | generate a wakeup event. | |
77 | ||
78 | To register a trigger user has to open psi interface file under | |
79 | /proc/pressure/ representing the resource to be monitored and write the | |
80 | desired threshold and time window. The open file descriptor should be | |
81 | used to wait for trigger events using select(), poll() or epoll(). | |
c3123552 | 82 | The following format is used:: |
0e94682b | 83 | |
c3123552 | 84 | <some|full> <stall amount in us> <time window in us> |
0e94682b SB |
85 | |
86 | For example writing "some 150000 1000000" into /proc/pressure/memory | |
87 | would add 150ms threshold for partial memory stall measured within | |
88 | 1sec time window. Writing "full 50000 1000000" into /proc/pressure/io | |
89 | would add 50ms threshold for full io stall measured within 1sec time window. | |
90 | ||
91 | Triggers can be set on more than one psi metric and more than one trigger | |
92 | for the same psi metric can be specified. However for each trigger a separate | |
93 | file descriptor is required to be able to poll it separately from others, | |
94 | therefore for each trigger a separate open() syscall should be made even | |
95 | when opening the same psi interface file. | |
96 | ||
97 | Monitors activate only when system enters stall state for the monitored | |
98 | psi metric and deactivates upon exit from the stall state. While system is | |
99 | in the stall state psi signal growth is monitored at a rate of 10 times per | |
100 | tracking window. | |
101 | ||
102 | The kernel accepts window sizes ranging from 500ms to 10s, therefore min | |
103 | monitoring update interval is 50ms and max is 1s. Min limit is set to | |
104 | prevent overly frequent polling. Max limit is chosen as a high enough number | |
105 | after which monitors are most likely not needed and psi averages can be used | |
106 | instead. | |
107 | ||
108 | When activated, psi monitor stays active for at least the duration of one | |
109 | tracking window to avoid repeated activations/deactivations when system is | |
110 | bouncing in and out of the stall state. | |
111 | ||
112 | Notifications to the userspace are rate-limited to one per tracking window. | |
113 | ||
114 | The trigger will de-register when the file descriptor used to define the | |
115 | trigger is closed. | |
116 | ||
117 | Userspace monitor usage example | |
118 | =============================== | |
119 | ||
c3123552 MCC |
120 | :: |
121 | ||
122 | #include <errno.h> | |
123 | #include <fcntl.h> | |
124 | #include <stdio.h> | |
125 | #include <poll.h> | |
126 | #include <string.h> | |
127 | #include <unistd.h> | |
128 | ||
129 | /* | |
130 | * Monitor memory partial stall with 1s tracking window size | |
131 | * and 150ms threshold. | |
132 | */ | |
133 | int main() { | |
0e94682b SB |
134 | const char trig[] = "some 150000 1000000"; |
135 | struct pollfd fds; | |
136 | int n; | |
137 | ||
138 | fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK); | |
139 | if (fds.fd < 0) { | |
140 | printf("/proc/pressure/memory open error: %s\n", | |
141 | strerror(errno)); | |
142 | return 1; | |
143 | } | |
144 | fds.events = POLLPRI; | |
145 | ||
146 | if (write(fds.fd, trig, strlen(trig) + 1) < 0) { | |
147 | printf("/proc/pressure/memory write error: %s\n", | |
148 | strerror(errno)); | |
149 | return 1; | |
150 | } | |
151 | ||
152 | printf("waiting for events...\n"); | |
153 | while (1) { | |
154 | n = poll(&fds, 1, -1); | |
155 | if (n < 0) { | |
156 | printf("poll error: %s\n", strerror(errno)); | |
157 | return 1; | |
158 | } | |
159 | if (fds.revents & POLLERR) { | |
160 | printf("got POLLERR, event source is gone\n"); | |
161 | return 0; | |
162 | } | |
163 | if (fds.revents & POLLPRI) { | |
164 | printf("event triggered!\n"); | |
165 | } else { | |
166 | printf("unknown event received: 0x%x\n", fds.revents); | |
167 | return 1; | |
168 | } | |
169 | } | |
170 | ||
171 | return 0; | |
c3123552 | 172 | } |
0e94682b | 173 | |
2ce7135a JW |
174 | Cgroup2 interface |
175 | ================= | |
176 | ||
177 | In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem | |
178 | mounted, pressure stall information is also tracked for tasks grouped | |
179 | into cgroups. Each subdirectory in the cgroupfs mountpoint contains | |
180 | cpu.pressure, memory.pressure, and io.pressure files; the format is | |
181 | the same as the /proc/pressure/ files. | |
0e94682b SB |
182 | |
183 | Per-cgroup psi monitors can be specified and used the same way as | |
184 | system-wide ones. |