Commit | Line | Data |
---|---|---|
6bbe6f57 MCC |
1 | ======================= |
2 | Intel Powerclamp Driver | |
3 | ======================= | |
4 | ||
5 | By: | |
6 | - Arjan van de Ven <arjan@linux.intel.com> | |
7 | - Jacob Pan <jacob.jun.pan@linux.intel.com> | |
8 | ||
9 | .. Contents: | |
d6d71ee4 | 10 | |
d6d71ee4 JP |
11 | (*) Introduction |
12 | - Goals and Objectives | |
13 | ||
14 | (*) Theory of Operation | |
15 | - Idle Injection | |
16 | - Calibration | |
17 | ||
18 | (*) Performance Analysis | |
19 | - Effectiveness and Limitations | |
20 | - Power vs Performance | |
21 | - Scalability | |
22 | - Calibration | |
23 | - Comparison with Alternative Techniques | |
24 | ||
25 | (*) Usage and Interfaces | |
26 | - Generic Thermal Layer (sysfs) | |
27 | - Kernel APIs (TBD) | |
28 | ||
d6d71ee4 JP |
29 | INTRODUCTION |
30 | ============ | |
31 | ||
32 | Consider the situation where a system’s power consumption must be | |
33 | reduced at runtime, due to power budget, thermal constraint, or noise | |
34 | level, and where active cooling is not preferred. Software managed | |
35 | passive power reduction must be performed to prevent the hardware | |
36 | actions that are designed for catastrophic scenarios. | |
37 | ||
38 | Currently, P-states, T-states (clock modulation), and CPU offlining | |
39 | are used for CPU throttling. | |
40 | ||
41 | On Intel CPUs, C-states provide effective power reduction, but so far | |
42 | they’re only used opportunistically, based on workload. With the | |
43 | development of intel_powerclamp driver, the method of synchronizing | |
44 | idle injection across all online CPU threads was introduced. The goal | |
45 | is to achieve forced and controllable C-state residency. | |
46 | ||
47 | Test/Analysis has been made in the areas of power, performance, | |
48 | scalability, and user experience. In many cases, clear advantage is | |
49 | shown over taking the CPU offline or modulating the CPU clock. | |
50 | ||
51 | ||
d6d71ee4 JP |
52 | THEORY OF OPERATION |
53 | =================== | |
54 | ||
55 | Idle Injection | |
56 | -------------- | |
57 | ||
58 | On modern Intel processors (Nehalem or later), package level C-state | |
59 | residency is available in MSRs, thus also available to the kernel. | |
60 | ||
6bbe6f57 MCC |
61 | These MSRs are:: |
62 | ||
63 | #define MSR_PKG_C2_RESIDENCY 0x60D | |
64 | #define MSR_PKG_C3_RESIDENCY 0x3F8 | |
65 | #define MSR_PKG_C6_RESIDENCY 0x3F9 | |
66 | #define MSR_PKG_C7_RESIDENCY 0x3FA | |
d6d71ee4 JP |
67 | |
68 | If the kernel can also inject idle time to the system, then a | |
69 | closed-loop control system can be established that manages package | |
70 | level C-state. The intel_powerclamp driver is conceived as such a | |
71 | control system, where the target set point is a user-selected idle | |
72 | ratio (based on power reduction), and the error is the difference | |
73 | between the actual package level C-state residency ratio and the target idle | |
74 | ratio. | |
75 | ||
76 | Injection is controlled by high priority kernel threads, spawned for | |
77 | each online CPU. | |
78 | ||
79 | These kernel threads, with SCHED_FIFO class, are created to perform | |
80 | clamping actions of controlled duty ratio and duration. Each per-CPU | |
81 | thread synchronizes its idle time and duration, based on the rounding | |
82 | of jiffies, so accumulated errors can be prevented to avoid a jittery | |
83 | effect. Threads are also bound to the CPU such that they cannot be | |
84 | migrated, unless the CPU is taken offline. In this case, threads | |
85 | belong to the offlined CPUs will be terminated immediately. | |
86 | ||
87 | Running as SCHED_FIFO and relatively high priority, also allows such | |
7852fe3a | 88 | scheme to work for both preemptible and non-preemptible kernels. |
d6d71ee4 JP |
89 | Alignment of idle time around jiffies ensures scalability for HZ |
90 | values. This effect can be better visualized using a Perf timechart. | |
91 | The following diagram shows the behavior of kernel thread | |
92 | kidle_inject/cpu. During idle injection, it runs monitor/mwait idle | |
93 | for a given "duration", then relinquishes the CPU to other tasks, | |
94 | until the next time interval. | |
95 | ||
96 | The NOHZ schedule tick is disabled during idle time, but interrupts | |
97 | are not masked. Tests show that the extra wakeups from scheduler tick | |
98 | have a dramatic impact on the effectiveness of the powerclamp driver | |
99 | on large scale systems (Westmere system with 80 processors). | |
100 | ||
6bbe6f57 MCC |
101 | :: |
102 | ||
103 | CPU0 | |
104 | ____________ ____________ | |
105 | kidle_inject/0 | sleep | mwait | sleep | | |
106 | _________| |________| |_______ | |
107 | duration | |
108 | CPU1 | |
109 | ____________ ____________ | |
110 | kidle_inject/1 | sleep | mwait | sleep | | |
111 | _________| |________| |_______ | |
112 | ^ | |
113 | | | |
114 | | | |
115 | roundup(jiffies, interval) | |
d6d71ee4 JP |
116 | |
117 | Only one CPU is allowed to collect statistics and update global | |
118 | control parameters. This CPU is referred to as the controlling CPU in | |
119 | this document. The controlling CPU is elected at runtime, with a | |
120 | policy that favors BSP, taking into account the possibility of a CPU | |
121 | hot-plug. | |
122 | ||
123 | In terms of dynamics of the idle control system, package level idle | |
124 | time is considered largely as a non-causal system where its behavior | |
125 | cannot be based on the past or current input. Therefore, the | |
126 | intel_powerclamp driver attempts to enforce the desired idle time | |
127 | instantly as given input (target idle ratio). After injection, | |
05d0066a | 128 | powerclamp monitors the actual idle for a given time window and adjust |
d6d71ee4 JP |
129 | the next injection accordingly to avoid over/under correction. |
130 | ||
131 | When used in a causal control system, such as a temperature control, | |
132 | it is up to the user of this driver to implement algorithms where | |
133 | past samples and outputs are included in the feedback. For example, a | |
134 | PID-based thermal controller can use the powerclamp driver to | |
135 | maintain a desired target temperature, based on integral and | |
136 | derivative gains of the past samples. | |
137 | ||
138 | ||
139 | ||
140 | Calibration | |
141 | ----------- | |
142 | During scalability testing, it is observed that synchronized actions | |
143 | among CPUs become challenging as the number of cores grows. This is | |
144 | also true for the ability of a system to enter package level C-states. | |
145 | ||
146 | To make sure the intel_powerclamp driver scales well, online | |
147 | calibration is implemented. The goals for doing such a calibration | |
148 | are: | |
149 | ||
150 | a) determine the effective range of idle injection ratio | |
151 | b) determine the amount of compensation needed at each target ratio | |
152 | ||
153 | Compensation to each target ratio consists of two parts: | |
154 | ||
6bbe6f57 | 155 | a) steady state error compensation |
d6d71ee4 JP |
156 | This is to offset the error occurring when the system can |
157 | enter idle without extra wakeups (such as external interrupts). | |
158 | ||
159 | b) dynamic error compensation | |
160 | When an excessive amount of wakeups occurs during idle, an | |
161 | additional idle ratio can be added to quiet interrupts, by | |
162 | slowing down CPU activities. | |
163 | ||
164 | A debugfs file is provided for the user to examine compensation | |
6bbe6f57 MCC |
165 | progress and results, such as on a Westmere system:: |
166 | ||
167 | [jacob@nex01 ~]$ cat | |
168 | /sys/kernel/debug/intel_powerclamp/powerclamp_calib | |
169 | controlling cpu: 0 | |
170 | pct confidence steady dynamic (compensation) | |
171 | 0 0 0 0 | |
172 | 1 1 0 0 | |
173 | 2 1 1 0 | |
174 | 3 3 1 0 | |
175 | 4 3 1 0 | |
176 | 5 3 1 0 | |
177 | 6 3 1 0 | |
178 | 7 3 1 0 | |
179 | 8 3 1 0 | |
180 | ... | |
181 | 30 3 2 0 | |
182 | 31 3 2 0 | |
183 | 32 3 1 0 | |
184 | 33 3 2 0 | |
185 | 34 3 1 0 | |
186 | 35 3 2 0 | |
187 | 36 3 1 0 | |
188 | 37 3 2 0 | |
189 | 38 3 1 0 | |
190 | 39 3 2 0 | |
191 | 40 3 3 0 | |
192 | 41 3 1 0 | |
193 | 42 3 2 0 | |
194 | 43 3 1 0 | |
195 | 44 3 1 0 | |
196 | 45 3 2 0 | |
197 | 46 3 3 0 | |
198 | 47 3 0 0 | |
199 | 48 3 2 0 | |
200 | 49 3 3 0 | |
d6d71ee4 JP |
201 | |
202 | Calibration occurs during runtime. No offline method is available. | |
203 | Steady state compensation is used only when confidence levels of all | |
204 | adjacent ratios have reached satisfactory level. A confidence level | |
205 | is accumulated based on clean data collected at runtime. Data | |
206 | collected during a period without extra interrupts is considered | |
207 | clean. | |
208 | ||
209 | To compensate for excessive amounts of wakeup during idle, additional | |
210 | idle time is injected when such a condition is detected. Currently, | |
211 | we have a simple algorithm to double the injection ratio. A possible | |
212 | enhancement might be to throttle the offending IRQ, such as delaying | |
213 | EOI for level triggered interrupts. But it is a challenge to be | |
214 | non-intrusive to the scheduler or the IRQ core code. | |
215 | ||
216 | ||
217 | CPU Online/Offline | |
218 | ------------------ | |
219 | Per-CPU kernel threads are started/stopped upon receiving | |
220 | notifications of CPU hotplug activities. The intel_powerclamp driver | |
221 | keeps track of clamping kernel threads, even after they are migrated | |
222 | to other CPUs, after a CPU offline event. | |
223 | ||
224 | ||
d6d71ee4 | 225 | Performance Analysis |
6bbe6f57 | 226 | ==================== |
d6d71ee4 JP |
227 | This section describes the general performance data collected on |
228 | multiple systems, including Westmere (80P) and Ivy Bridge (4P, 8P). | |
229 | ||
230 | Effectiveness and Limitations | |
231 | ----------------------------- | |
232 | The maximum range that idle injection is allowed is capped at 50 | |
233 | percent. As mentioned earlier, since interrupts are allowed during | |
234 | forced idle time, excessive interrupts could result in less | |
235 | effectiveness. The extreme case would be doing a ping -f to generated | |
236 | flooded network interrupts without much CPU acknowledgement. In this | |
237 | case, little can be done from the idle injection threads. In most | |
238 | normal cases, such as scp a large file, applications can be throttled | |
239 | by the powerclamp driver, since slowing down the CPU also slows down | |
240 | network protocol processing, which in turn reduces interrupts. | |
241 | ||
242 | When control parameters change at runtime by the controlling CPU, it | |
243 | may take an additional period for the rest of the CPUs to catch up | |
244 | with the changes. During this time, idle injection is out of sync, | |
245 | thus not able to enter package C- states at the expected ratio. But | |
246 | this effect is minor, in that in most cases change to the target | |
247 | ratio is updated much less frequently than the idle injection | |
248 | frequency. | |
249 | ||
250 | Scalability | |
251 | ----------- | |
252 | Tests also show a minor, but measurable, difference between the 4P/8P | |
253 | Ivy Bridge system and the 80P Westmere server under 50% idle ratio. | |
254 | More compensation is needed on Westmere for the same amount of | |
255 | target idle ratio. The compensation also increases as the idle ratio | |
256 | gets larger. The above reason constitutes the need for the | |
257 | calibration code. | |
258 | ||
259 | On the IVB 8P system, compared to an offline CPU, powerclamp can | |
260 | achieve up to 40% better performance per watt. (measured by a spin | |
261 | counter summed over per CPU counting threads spawned for all running | |
262 | CPUs). | |
263 | ||
d6d71ee4 JP |
264 | Usage and Interfaces |
265 | ==================== | |
266 | The powerclamp driver is registered to the generic thermal layer as a | |
6bbe6f57 | 267 | cooling device. Currently, it’s not bound to any thermal zones:: |
d6d71ee4 | 268 | |
6bbe6f57 MCC |
269 | jacob@chromoly:/sys/class/thermal/cooling_device14$ grep . * |
270 | cur_state:0 | |
271 | max_state:50 | |
272 | type:intel_powerclamp | |
d6d71ee4 | 273 | |
d7335056 JP |
274 | cur_state allows user to set the desired idle percentage. Writing 0 to |
275 | cur_state will stop idle injection. Writing a value between 1 and | |
276 | max_state will start the idle injection. Reading cur_state returns the | |
277 | actual and current idle percentage. This may not be the same value | |
278 | set by the user in that current idle percentage depends on workload | |
279 | and includes natural idle. When idle injection is disabled, reading | |
280 | cur_state returns value -1 instead of 0 which is to avoid confusing | |
281 | 100% busy state with the disabled state. | |
282 | ||
d6d71ee4 | 283 | Example usage: |
6bbe6f57 MCC |
284 | - To inject 25% idle time:: |
285 | ||
286 | $ sudo sh -c "echo 25 > /sys/class/thermal/cooling_device80/cur_state | |
d6d71ee4 JP |
287 | |
288 | If the system is not busy and has more than 25% idle time already, | |
289 | then the powerclamp driver will not start idle injection. Using Top | |
290 | will not show idle injection kernel threads. | |
291 | ||
292 | If the system is busy (spin test below) and has less than 25% natural | |
d7335056 JP |
293 | idle time, powerclamp kernel threads will do idle injection. Forced |
294 | idle time is accounted as normal idle in that common code path is | |
295 | taken as the idle task. | |
296 | ||
297 | In this example, 24.1% idle is shown. This helps the system admin or | |
6bbe6f57 MCC |
298 | user determine the cause of slowdown, when a powerclamp driver is in action:: |
299 | ||
300 | ||
301 | Tasks: 197 total, 1 running, 196 sleeping, 0 stopped, 0 zombie | |
302 | Cpu(s): 71.2%us, 4.7%sy, 0.0%ni, 24.1%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st | |
303 | Mem: 3943228k total, 1689632k used, 2253596k free, 74960k buffers | |
304 | Swap: 4087804k total, 0k used, 4087804k free, 945336k cached | |
305 | ||
306 | PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND | |
307 | 3352 jacob 20 0 262m 644 428 S 286 0.0 0:17.16 spin | |
308 | 3341 root -51 0 0 0 0 D 25 0.0 0:01.62 kidle_inject/0 | |
309 | 3344 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/3 | |
310 | 3342 root -51 0 0 0 0 D 25 0.0 0:01.61 kidle_inject/1 | |
311 | 3343 root -51 0 0 0 0 D 25 0.0 0:01.60 kidle_inject/2 | |
312 | 2935 jacob 20 0 696m 125m 35m S 5 3.3 0:31.11 firefox | |
313 | 1546 root 20 0 158m 20m 6640 S 3 0.5 0:26.97 Xorg | |
314 | 2100 jacob 20 0 1223m 88m 30m S 3 2.3 0:23.68 compiz | |
d6d71ee4 JP |
315 | |
316 | Tests have shown that by using the powerclamp driver as a cooling | |
317 | device, a PID based userspace thermal controller can manage to | |
318 | control CPU temperature effectively, when no other thermal influence | |
319 | is added. For example, a UltraBook user can compile the kernel under | |
320 | certain temperature (below most active trip points). |