Commit | Line | Data |
---|---|---|
373e8ffa JK |
1 | .. _psi: |
2 | ||
eb414681 JW |
3 | ================================ |
4 | PSI - Pressure Stall Information | |
5 | ================================ | |
6 | ||
7 | :Date: April, 2018 | |
8 | :Author: Johannes Weiner <hannes@cmpxchg.org> | |
9 | ||
10 | When CPU, memory or IO devices are contended, workloads experience | |
11 | latency spikes, throughput losses, and run the risk of OOM kills. | |
12 | ||
13 | Without an accurate measure of such contention, users are forced to | |
14 | either play it safe and under-utilize their hardware resources, or | |
15 | roll the dice and frequently suffer the disruptions resulting from | |
16 | excessive overcommit. | |
17 | ||
18 | The psi feature identifies and quantifies the disruptions caused by | |
19 | such resource crunches and the time impact it has on complex workloads | |
20 | or even entire systems. | |
21 | ||
22 | Having an accurate measure of productivity losses caused by resource | |
23 | scarcity aids users in sizing workloads to hardware--or provisioning | |
24 | hardware according to workload demand. | |
25 | ||
26 | As psi aggregates this information in realtime, systems can be managed | |
27 | dynamically using techniques such as load shedding, migrating jobs to | |
28 | other systems or data centers, or strategically pausing or killing low | |
29 | priority or restartable batch jobs. | |
30 | ||
31 | This allows maximizing hardware utilization without sacrificing | |
32 | workload health or risking major disruptions such as OOM kills. | |
33 | ||
34 | Pressure interface | |
35 | ================== | |
36 | ||
37 | Pressure information for each resource is exported through the | |
38 | respective file in /proc/pressure/ -- cpu, memory, and io. | |
39 | ||
c3123552 | 40 | The format for CPU is as such:: |
eb414681 | 41 | |
c3123552 | 42 | some avg10=0.00 avg60=0.00 avg300=0.00 total=0 |
eb414681 | 43 | |
c3123552 | 44 | and for memory and IO:: |
eb414681 | 45 | |
c3123552 MCC |
46 | some avg10=0.00 avg60=0.00 avg300=0.00 total=0 |
47 | full avg10=0.00 avg60=0.00 avg300=0.00 total=0 | |
eb414681 JW |
48 | |
49 | The "some" line indicates the share of time in which at least some | |
50 | tasks are stalled on a given resource. | |
51 | ||
52 | The "full" line indicates the share of time in which all non-idle | |
53 | tasks are stalled on a given resource simultaneously. In this state | |
54 | actual CPU cycles are going to waste, and a workload that spends | |
55 | extended time in this state is considered to be thrashing. This has | |
56 | severe impact on performance, and it's useful to distinguish this | |
57 | situation from a state where some tasks are stalled but the CPU is | |
58 | still doing productive work. As such, time spent in this subset of the | |
59 | stall state is tracked separately and exported in the "full" averages. | |
60 | ||
be87ab0a WL |
61 | The ratios (in %) are tracked as recent trends over ten, sixty, and |
62 | three hundred second windows, which gives insight into short term events | |
63 | as well as medium and long term trends. The total absolute stall time | |
64 | (in us) is tracked and exported as well, to allow detection of latency | |
65 | spikes which wouldn't necessarily make a dent in the time averages, | |
66 | or to average trends over custom time frames. | |
2ce7135a | 67 | |
0e94682b SB |
68 | Monitoring for pressure thresholds |
69 | ================================== | |
70 | ||
71 | Users can register triggers and use poll() to be woken up when resource | |
72 | pressure exceeds certain thresholds. | |
73 | ||
74 | A trigger describes the maximum cumulative stall time over a specific | |
75 | time window, e.g. 100ms of total stall time within any 500ms window to | |
76 | generate a wakeup event. | |
77 | ||
78 | To register a trigger user has to open psi interface file under | |
79 | /proc/pressure/ representing the resource to be monitored and write the | |
80 | desired threshold and time window. The open file descriptor should be | |
81 | used to wait for trigger events using select(), poll() or epoll(). | |
c3123552 | 82 | The following format is used:: |
0e94682b | 83 | |
c3123552 | 84 | <some|full> <stall amount in us> <time window in us> |
0e94682b SB |
85 | |
86 | For example writing "some 150000 1000000" into /proc/pressure/memory | |
87 | would add 150ms threshold for partial memory stall measured within | |
88 | 1sec time window. Writing "full 50000 1000000" into /proc/pressure/io | |
89 | would add 50ms threshold for full io stall measured within 1sec time window. | |
90 | ||
91 | Triggers can be set on more than one psi metric and more than one trigger | |
92 | for the same psi metric can be specified. However for each trigger a separate | |
93 | file descriptor is required to be able to poll it separately from others, | |
94 | therefore for each trigger a separate open() syscall should be made even | |
a06247c6 SB |
95 | when opening the same psi interface file. Write operations to a file descriptor |
96 | with an already existing psi trigger will fail with EBUSY. | |
0e94682b SB |
97 | |
98 | Monitors activate only when system enters stall state for the monitored | |
99 | psi metric and deactivates upon exit from the stall state. While system is | |
100 | in the stall state psi signal growth is monitored at a rate of 10 times per | |
101 | tracking window. | |
102 | ||
103 | The kernel accepts window sizes ranging from 500ms to 10s, therefore min | |
104 | monitoring update interval is 50ms and max is 1s. Min limit is set to | |
105 | prevent overly frequent polling. Max limit is chosen as a high enough number | |
106 | after which monitors are most likely not needed and psi averages can be used | |
107 | instead. | |
108 | ||
109 | When activated, psi monitor stays active for at least the duration of one | |
110 | tracking window to avoid repeated activations/deactivations when system is | |
111 | bouncing in and out of the stall state. | |
112 | ||
113 | Notifications to the userspace are rate-limited to one per tracking window. | |
114 | ||
115 | The trigger will de-register when the file descriptor used to define the | |
116 | trigger is closed. | |
117 | ||
118 | Userspace monitor usage example | |
119 | =============================== | |
120 | ||
c3123552 MCC |
121 | :: |
122 | ||
123 | #include <errno.h> | |
124 | #include <fcntl.h> | |
125 | #include <stdio.h> | |
126 | #include <poll.h> | |
127 | #include <string.h> | |
128 | #include <unistd.h> | |
129 | ||
130 | /* | |
131 | * Monitor memory partial stall with 1s tracking window size | |
132 | * and 150ms threshold. | |
133 | */ | |
134 | int main() { | |
0e94682b SB |
135 | const char trig[] = "some 150000 1000000"; |
136 | struct pollfd fds; | |
137 | int n; | |
138 | ||
139 | fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK); | |
140 | if (fds.fd < 0) { | |
141 | printf("/proc/pressure/memory open error: %s\n", | |
142 | strerror(errno)); | |
143 | return 1; | |
144 | } | |
145 | fds.events = POLLPRI; | |
146 | ||
147 | if (write(fds.fd, trig, strlen(trig) + 1) < 0) { | |
148 | printf("/proc/pressure/memory write error: %s\n", | |
149 | strerror(errno)); | |
150 | return 1; | |
151 | } | |
152 | ||
153 | printf("waiting for events...\n"); | |
154 | while (1) { | |
155 | n = poll(&fds, 1, -1); | |
156 | if (n < 0) { | |
157 | printf("poll error: %s\n", strerror(errno)); | |
158 | return 1; | |
159 | } | |
160 | if (fds.revents & POLLERR) { | |
161 | printf("got POLLERR, event source is gone\n"); | |
162 | return 0; | |
163 | } | |
164 | if (fds.revents & POLLPRI) { | |
165 | printf("event triggered!\n"); | |
166 | } else { | |
167 | printf("unknown event received: 0x%x\n", fds.revents); | |
168 | return 1; | |
169 | } | |
170 | } | |
171 | ||
172 | return 0; | |
c3123552 | 173 | } |
0e94682b | 174 | |
2ce7135a JW |
175 | Cgroup2 interface |
176 | ================= | |
177 | ||
178 | In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem | |
179 | mounted, pressure stall information is also tracked for tasks grouped | |
180 | into cgroups. Each subdirectory in the cgroupfs mountpoint contains | |
181 | cpu.pressure, memory.pressure, and io.pressure files; the format is | |
182 | the same as the /proc/pressure/ files. | |
0e94682b SB |
183 | |
184 | Per-cgroup psi monitors can be specified and used the same way as | |
185 | system-wide ones. |