Commit | Line | Data |
---|---|---|
fc7db767 | 1 | .. SPDX-License-Identifier: GPL-2.0 |
fc1860d6 | 2 | .. include:: <isonum.txt> |
fc7db767 | 3 | |
33fc30b4 | 4 | .. |intel_pstate| replace:: :doc:`intel_pstate <intel_pstate>` |
2a0e4927 RW |
5 | |
6 | ======================= | |
7 | CPU Performance Scaling | |
8 | ======================= | |
9 | ||
fc1860d6 RW |
10 | :Copyright: |copy| 2017 Intel Corporation |
11 | ||
12 | :Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com> | |
2a0e4927 | 13 | |
2a0e4927 RW |
14 | |
15 | The Concept of CPU Performance Scaling | |
16 | ====================================== | |
17 | ||
18 | The majority of modern processors are capable of operating in a number of | |
19 | different clock frequency and voltage configurations, often referred to as | |
20 | Operating Performance Points or P-states (in ACPI terminology). As a rule, | |
21 | the higher the clock frequency and the higher the voltage, the more instructions | |
22 | can be retired by the CPU over a unit of time, but also the higher the clock | |
23 | frequency and the higher the voltage, the more energy is consumed over a unit of | |
24 | time (or the more power is drawn) by the CPU in the given P-state. Therefore | |
25 | there is a natural tradeoff between the CPU capacity (the number of instructions | |
26 | that can be executed over a unit of time) and the power drawn by the CPU. | |
27 | ||
28 | In some situations it is desirable or even necessary to run the program as fast | |
29 | as possible and then there is no reason to use any P-states different from the | |
30 | highest one (i.e. the highest-performance frequency/voltage configuration | |
31 | available). In some other cases, however, it may not be necessary to execute | |
32 | instructions so quickly and maintaining the highest available CPU capacity for a | |
33 | relatively long time without utilizing it entirely may be regarded as wasteful. | |
34 | It also may not be physically possible to maintain maximum CPU capacity for too | |
35 | long for thermal or power supply capacity reasons or similar. To cover those | |
36 | cases, there are hardware interfaces allowing CPUs to be switched between | |
37 | different frequency/voltage configurations or (in the ACPI terminology) to be | |
38 | put into different P-states. | |
39 | ||
40 | Typically, they are used along with algorithms to estimate the required CPU | |
41 | capacity, so as to decide which P-states to put the CPUs into. Of course, since | |
42 | the utilization of the system generally changes over time, that has to be done | |
43 | repeatedly on a regular basis. The activity by which this happens is referred | |
44 | to as CPU performance scaling or CPU frequency scaling (because it involves | |
45 | adjusting the CPU clock frequency). | |
46 | ||
47 | ||
48 | CPU Performance Scaling in Linux | |
49 | ================================ | |
50 | ||
51 | The Linux kernel supports CPU performance scaling by means of the ``CPUFreq`` | |
52 | (CPU Frequency scaling) subsystem that consists of three layers of code: the | |
53 | core, scaling governors and scaling drivers. | |
54 | ||
55 | The ``CPUFreq`` core provides the common code infrastructure and user space | |
56 | interfaces for all platforms that support CPU performance scaling. It defines | |
57 | the basic framework in which the other components operate. | |
58 | ||
59 | Scaling governors implement algorithms to estimate the required CPU capacity. | |
60 | As a rule, each governor implements one, possibly parametrized, scaling | |
61 | algorithm. | |
62 | ||
63 | Scaling drivers talk to the hardware. They provide scaling governors with | |
64 | information on the available P-states (or P-state ranges in some cases) and | |
65 | access platform-specific hardware interfaces to change CPU P-states as requested | |
66 | by scaling governors. | |
67 | ||
68 | In principle, all available scaling governors can be used with every scaling | |
69 | driver. That design is based on the observation that the information used by | |
70 | performance scaling algorithms for P-state selection can be represented in a | |
71 | platform-independent form in the majority of cases, so it should be possible | |
72 | to use the same performance scaling algorithm implemented in exactly the same | |
73 | way regardless of which scaling driver is used. Consequently, the same set of | |
74 | scaling governors should be suitable for every supported platform. | |
75 | ||
76 | However, that observation may not hold for performance scaling algorithms | |
77 | based on information provided by the hardware itself, for example through | |
78 | feedback registers, as that information is typically specific to the hardware | |
79 | interface it comes from and may not be easily represented in an abstract, | |
80 | platform-independent way. For this reason, ``CPUFreq`` allows scaling drivers | |
81 | to bypass the governor layer and implement their own performance scaling | |
33fc30b4 | 82 | algorithms. That is done by the |intel_pstate| scaling driver. |
2a0e4927 RW |
83 | |
84 | ||
85 | ``CPUFreq`` Policy Objects | |
86 | ========================== | |
87 | ||
88 | In some cases the hardware interface for P-state control is shared by multiple | |
89 | CPUs. That is, for example, the same register (or set of registers) is used to | |
90 | control the P-state of multiple CPUs at the same time and writing to it affects | |
91 | all of those CPUs simultaneously. | |
92 | ||
93 | Sets of CPUs sharing hardware P-state control interfaces are represented by | |
abc59fd4 MCC |
94 | ``CPUFreq`` as struct cpufreq_policy objects. For consistency, |
95 | struct cpufreq_policy is also used when there is only one CPU in the given | |
2a0e4927 RW |
96 | set. |
97 | ||
abc59fd4 | 98 | The ``CPUFreq`` core maintains a pointer to a struct cpufreq_policy object for |
2a0e4927 RW |
99 | every CPU in the system, including CPUs that are currently offline. If multiple |
100 | CPUs share the same hardware P-state control interface, all of the pointers | |
abc59fd4 | 101 | corresponding to them point to the same struct cpufreq_policy object. |
2a0e4927 | 102 | |
abc59fd4 | 103 | ``CPUFreq`` uses struct cpufreq_policy as its basic data type and the design |
2a0e4927 RW |
104 | of its user space interface is based on the policy concept. |
105 | ||
106 | ||
107 | CPU Initialization | |
108 | ================== | |
109 | ||
110 | First of all, a scaling driver has to be registered for ``CPUFreq`` to work. | |
111 | It is only possible to register one scaling driver at a time, so the scaling | |
112 | driver is expected to be able to handle all CPUs in the system. | |
113 | ||
114 | The scaling driver may be registered before or after CPU registration. If | |
115 | CPUs are registered earlier, the driver core invokes the ``CPUFreq`` core to | |
116 | take a note of all of the already registered CPUs during the registration of the | |
117 | scaling driver. In turn, if any CPUs are registered after the registration of | |
118 | the scaling driver, the ``CPUFreq`` core will be invoked to take note of them | |
119 | at their registration time. | |
120 | ||
121 | In any case, the ``CPUFreq`` core is invoked to take note of any logical CPU it | |
122 | has not seen so far as soon as it is ready to handle that CPU. [Note that the | |
123 | logical CPU may be a physical single-core processor, or a single core in a | |
124 | multicore processor, or a hardware thread in a physical processor or processor | |
125 | core. In what follows "CPU" always means "logical CPU" unless explicitly stated | |
126 | otherwise and the word "processor" is used to refer to the physical part | |
127 | possibly including multiple logical CPUs.] | |
128 | ||
129 | Once invoked, the ``CPUFreq`` core checks if the policy pointer is already set | |
130 | for the given CPU and if so, it skips the policy object creation. Otherwise, | |
131 | a new policy object is created and initialized, which involves the creation of | |
132 | a new policy directory in ``sysfs``, and the policy pointer corresponding to | |
133 | the given CPU is set to the new policy object's address in memory. | |
134 | ||
135 | Next, the scaling driver's ``->init()`` callback is invoked with the policy | |
136 | pointer of the new CPU passed to it as the argument. That callback is expected | |
137 | to initialize the performance scaling hardware interface for the given CPU (or, | |
138 | more precisely, for the set of CPUs sharing the hardware interface it belongs | |
139 | to, represented by its policy object) and, if the policy object it has been | |
140 | called for is new, to set parameters of the policy, like the minimum and maximum | |
141 | frequencies supported by the hardware, the table of available frequencies (if | |
142 | the set of supported P-states is not a continuous range), and the mask of CPUs | |
143 | that belong to the same policy (including both online and offline CPUs). That | |
144 | mask is then used by the core to populate the policy pointers for all of the | |
145 | CPUs in it. | |
146 | ||
147 | The next major initialization step for a new policy object is to attach a | |
148 | scaling governor to it (to begin with, that is the default scaling governor | |
8412b456 QP |
149 | determined by the kernel command line or configuration, but it may be changed |
150 | later via ``sysfs``). First, a pointer to the new policy object is passed to | |
151 | the governor's ``->init()`` callback which is expected to initialize all of the | |
2a0e4927 RW |
152 | data structures necessary to handle the given policy and, possibly, to add |
153 | a governor ``sysfs`` interface to it. Next, the governor is started by | |
154 | invoking its ``->start()`` callback. | |
155 | ||
e531efa1 | 156 | That callback is expected to register per-CPU utilization update callbacks for |
2a0e4927 RW |
157 | all of the online CPUs belonging to the given policy with the CPU scheduler. |
158 | The utilization update callbacks will be invoked by the CPU scheduler on | |
159 | important events, like task enqueue and dequeue, on every iteration of the | |
160 | scheduler tick or generally whenever the CPU utilization may change (from the | |
161 | scheduler's perspective). They are expected to carry out computations needed | |
162 | to determine the P-state to use for the given policy going forward and to | |
163 | invoke the scaling driver to make changes to the hardware in accordance with | |
164 | the P-state selection. The scaling driver may be invoked directly from | |
165 | scheduler context or asynchronously, via a kernel thread or workqueue, depending | |
166 | on the configuration and capabilities of the scaling driver and the governor. | |
167 | ||
168 | Similar steps are taken for policy objects that are not new, but were "inactive" | |
169 | previously, meaning that all of the CPUs belonging to them were offline. The | |
170 | only practical difference in that case is that the ``CPUFreq`` core will attempt | |
171 | to use the scaling governor previously used with the policy that became | |
172 | "inactive" (and is re-initialized now) instead of the default governor. | |
173 | ||
174 | In turn, if a previously offline CPU is being brought back online, but some | |
175 | other CPUs sharing the policy object with it are online already, there is no | |
176 | need to re-initialize the policy object at all. In that case, it only is | |
177 | necessary to restart the scaling governor so that it can take the new online CPU | |
178 | into account. That is achieved by invoking the governor's ``->stop`` and | |
179 | ``->start()`` callbacks, in this order, for the entire policy. | |
180 | ||
33fc30b4 | 181 | As mentioned before, the |intel_pstate| scaling driver bypasses the scaling |
2a0e4927 | 182 | governor layer of ``CPUFreq`` and provides its own P-state selection algorithms. |
33fc30b4 | 183 | Consequently, if |intel_pstate| is used, scaling governors are not attached to |
2a0e4927 RW |
184 | new policy objects. Instead, the driver's ``->setpolicy()`` callback is invoked |
185 | to register per-CPU utilization update callbacks for each policy. These | |
186 | callbacks are invoked by the CPU scheduler in the same way as for scaling | |
33fc30b4 | 187 | governors, but in the |intel_pstate| case they both determine the P-state to |
2a0e4927 RW |
188 | use and change the hardware configuration accordingly in one go from scheduler |
189 | context. | |
190 | ||
191 | The policy objects created during CPU initialization and other data structures | |
192 | associated with them are torn down when the scaling driver is unregistered | |
193 | (which happens when the kernel module containing it is unloaded, for example) or | |
194 | when the last CPU belonging to the given policy in unregistered. | |
195 | ||
196 | ||
197 | Policy Interface in ``sysfs`` | |
198 | ============================= | |
199 | ||
200 | During the initialization of the kernel, the ``CPUFreq`` core creates a | |
201 | ``sysfs`` directory (kobject) called ``cpufreq`` under | |
202 | :file:`/sys/devices/system/cpu/`. | |
203 | ||
204 | That directory contains a ``policyX`` subdirectory (where ``X`` represents an | |
205 | integer number) for every policy object maintained by the ``CPUFreq`` core. | |
206 | Each ``policyX`` directory is pointed to by ``cpufreq`` symbolic links | |
207 | under :file:`/sys/devices/system/cpu/cpuY/` (where ``Y`` represents an integer | |
208 | that may be different from the one represented by ``X``) for all of the CPUs | |
209 | associated with (or belonging to) the given policy. The ``policyX`` directories | |
210 | in :file:`/sys/devices/system/cpu/cpufreq` each contain policy-specific | |
211 | attributes (files) to control ``CPUFreq`` behavior for the corresponding policy | |
212 | objects (that is, for all of the CPUs associated with them). | |
213 | ||
214 | Some of those attributes are generic. They are created by the ``CPUFreq`` core | |
215 | and their behavior generally does not depend on what scaling driver is in use | |
216 | and what scaling governor is attached to the given policy. Some scaling drivers | |
217 | also add driver-specific attributes to the policy directories in ``sysfs`` to | |
218 | control policy-specific aspects of driver behavior. | |
219 | ||
220 | The generic attributes under :file:`/sys/devices/system/cpu/cpufreq/policyX/` | |
221 | are the following: | |
222 | ||
223 | ``affected_cpus`` | |
224 | List of online CPUs belonging to this policy (i.e. sharing the hardware | |
225 | performance scaling interface represented by the ``policyX`` policy | |
226 | object). | |
227 | ||
228 | ``bios_limit`` | |
229 | If the platform firmware (BIOS) tells the OS to apply an upper limit to | |
230 | CPU frequencies, that limit will be reported through this attribute (if | |
231 | present). | |
232 | ||
233 | The existence of the limit may be a result of some (often unintentional) | |
234 | BIOS settings, restrictions coming from a service processor or another | |
235 | BIOS/HW-based mechanisms. | |
236 | ||
237 | This does not cover ACPI thermal limitations which can be discovered | |
238 | through a generic thermal driver. | |
239 | ||
240 | This attribute is not present if the scaling driver in use does not | |
241 | support it. | |
242 | ||
c2e3af11 RW |
243 | ``cpuinfo_cur_freq`` |
244 | Current frequency of the CPUs belonging to this policy as obtained from | |
245 | the hardware (in KHz). | |
246 | ||
247 | This is expected to be the frequency the hardware actually runs at. | |
248 | If that frequency cannot be determined, this attribute should not | |
249 | be present. | |
250 | ||
2a0e4927 RW |
251 | ``cpuinfo_max_freq`` |
252 | Maximum possible operating frequency the CPUs belonging to this policy | |
253 | can run at (in kHz). | |
254 | ||
255 | ``cpuinfo_min_freq`` | |
256 | Minimum possible operating frequency the CPUs belonging to this policy | |
257 | can run at (in kHz). | |
258 | ||
259 | ``cpuinfo_transition_latency`` | |
260 | The time it takes to switch the CPUs belonging to this policy from one | |
261 | P-state to another, in nanoseconds. | |
262 | ||
263 | If unknown or if known to be so high that the scaling driver does not | |
264 | work with the `ondemand`_ governor, -1 (:c:macro:`CPUFREQ_ETERNAL`) | |
265 | will be returned by reads from this attribute. | |
266 | ||
267 | ``related_cpus`` | |
268 | List of all (online and offline) CPUs belonging to this policy. | |
269 | ||
270 | ``scaling_available_governors`` | |
271 | List of ``CPUFreq`` scaling governors present in the kernel that can | |
33fc30b4 | 272 | be attached to this policy or (if the |intel_pstate| scaling driver is |
2a0e4927 RW |
273 | in use) list of scaling algorithms provided by the driver that can be |
274 | applied to this policy. | |
275 | ||
276 | [Note that some governors are modular and it may be necessary to load a | |
277 | kernel module for the governor held by it to become available and be | |
278 | listed by this attribute.] | |
279 | ||
280 | ``scaling_cur_freq`` | |
281 | Current frequency of all of the CPUs belonging to this policy (in kHz). | |
282 | ||
8183003e RW |
283 | In the majority of cases, this is the frequency of the last P-state |
284 | requested by the scaling driver from the hardware using the scaling | |
2a0e4927 RW |
285 | interface provided by it, which may or may not reflect the frequency |
286 | the CPU is actually running at (due to hardware design and other | |
287 | limitations). | |
288 | ||
8183003e RW |
289 | Some architectures (e.g. ``x86``) may attempt to provide information |
290 | more precisely reflecting the current CPU frequency through this | |
291 | attribute, but that still may not be the exact current CPU frequency as | |
292 | seen by the hardware at the moment. | |
2a0e4927 RW |
293 | |
294 | ``scaling_driver`` | |
295 | The scaling driver currently in use. | |
296 | ||
297 | ``scaling_governor`` | |
298 | The scaling governor currently attached to this policy or (if the | |
33fc30b4 | 299 | |intel_pstate| scaling driver is in use) the scaling algorithm |
2a0e4927 RW |
300 | provided by the driver that is currently applied to this policy. |
301 | ||
302 | This attribute is read-write and writing to it will cause a new scaling | |
303 | governor to be attached to this policy or a new scaling algorithm | |
304 | provided by the scaling driver to be applied to it (in the | |
33fc30b4 | 305 | |intel_pstate| case), as indicated by the string written to this |
2a0e4927 RW |
306 | attribute (which must be one of the names listed by the |
307 | ``scaling_available_governors`` attribute described above). | |
308 | ||
309 | ``scaling_max_freq`` | |
310 | Maximum frequency the CPUs belonging to this policy are allowed to be | |
311 | running at (in kHz). | |
312 | ||
313 | This attribute is read-write and writing a string representing an | |
314 | integer to it will cause a new limit to be set (it must not be lower | |
315 | than the value of the ``scaling_min_freq`` attribute). | |
316 | ||
317 | ``scaling_min_freq`` | |
318 | Minimum frequency the CPUs belonging to this policy are allowed to be | |
319 | running at (in kHz). | |
320 | ||
321 | This attribute is read-write and writing a string representing a | |
322 | non-negative integer to it will cause a new limit to be set (it must not | |
323 | be higher than the value of the ``scaling_max_freq`` attribute). | |
324 | ||
325 | ``scaling_setspeed`` | |
326 | This attribute is functional only if the `userspace`_ scaling governor | |
327 | is attached to the given policy. | |
328 | ||
329 | It returns the last frequency requested by the governor (in kHz) or can | |
330 | be written to in order to set a new frequency for the policy. | |
331 | ||
332 | ||
333 | Generic Scaling Governors | |
334 | ========================= | |
335 | ||
336 | ``CPUFreq`` provides generic scaling governors that can be used with all | |
337 | scaling drivers. As stated before, each of them implements a single, possibly | |
338 | parametrized, performance scaling algorithm. | |
339 | ||
340 | Scaling governors are attached to policy objects and different policy objects | |
341 | can be handled by different scaling governors at the same time (although that | |
342 | may lead to suboptimal results in some cases). | |
343 | ||
344 | The scaling governor for a given policy object can be changed at any time with | |
345 | the help of the ``scaling_governor`` policy attribute in ``sysfs``. | |
346 | ||
347 | Some governors expose ``sysfs`` attributes to control or fine-tune the scaling | |
348 | algorithms implemented by them. Those attributes, referred to as governor | |
349 | tunables, can be either global (system-wide) or per-policy, depending on the | |
350 | scaling driver in use. If the driver requires governor tunables to be | |
351 | per-policy, they are located in a subdirectory of each policy directory. | |
352 | Otherwise, they are located in a subdirectory under | |
353 | :file:`/sys/devices/system/cpu/cpufreq/`. In either case the name of the | |
354 | subdirectory containing the governor tunables is the name of the governor | |
355 | providing them. | |
356 | ||
357 | ``performance`` | |
358 | --------------- | |
359 | ||
360 | When attached to a policy object, this governor causes the highest frequency, | |
361 | within the ``scaling_max_freq`` policy limit, to be requested for that policy. | |
362 | ||
363 | The request is made once at that time the governor for the policy is set to | |
364 | ``performance`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq`` | |
365 | policy limits change after that. | |
366 | ||
367 | ``powersave`` | |
368 | ------------- | |
369 | ||
370 | When attached to a policy object, this governor causes the lowest frequency, | |
371 | within the ``scaling_min_freq`` policy limit, to be requested for that policy. | |
372 | ||
373 | The request is made once at that time the governor for the policy is set to | |
374 | ``powersave`` and whenever the ``scaling_max_freq`` or ``scaling_min_freq`` | |
375 | policy limits change after that. | |
376 | ||
377 | ``userspace`` | |
378 | ------------- | |
379 | ||
380 | This governor does not do anything by itself. Instead, it allows user space | |
381 | to set the CPU frequency for the policy it is attached to by writing to the | |
382 | ``scaling_setspeed`` attribute of that policy. | |
383 | ||
384 | ``schedutil`` | |
385 | ------------- | |
386 | ||
387 | This governor uses CPU utilization data available from the CPU scheduler. It | |
388 | generally is regarded as a part of the CPU scheduler, so it can access the | |
389 | scheduler's internal data structures directly. | |
390 | ||
391 | It runs entirely in scheduler context, although in some cases it may need to | |
392 | invoke the scaling driver asynchronously when it decides that the CPU frequency | |
393 | should be changed for a given policy (that depends on whether or not the driver | |
394 | is capable of changing the CPU frequency from scheduler context). | |
395 | ||
396 | The actions of this governor for a particular CPU depend on the scheduling class | |
397 | invoking its utilization update callback for that CPU. If it is invoked by the | |
398 | RT or deadline scheduling classes, the governor will increase the frequency to | |
399 | the allowed maximum (that is, the ``scaling_max_freq`` policy limit). In turn, | |
400 | if it is invoked by the CFS scheduling class, the governor will use the | |
401 | Per-Entity Load Tracking (PELT) metric for the root control group of the | |
1120b0f9 RW |
402 | given CPU as the CPU utilization estimate (see the *Per-entity load tracking* |
403 | LWN.net article [1]_ for a description of the PELT mechanism). Then, the new | |
2a0e4927 RW |
404 | CPU frequency to apply is computed in accordance with the formula |
405 | ||
406 | f = 1.25 * ``f_0`` * ``util`` / ``max`` | |
407 | ||
408 | where ``util`` is the PELT number, ``max`` is the theoretical maximum of | |
409 | ``util``, and ``f_0`` is either the maximum possible CPU frequency for the given | |
410 | policy (if the PELT number is frequency-invariant), or the current CPU frequency | |
411 | (otherwise). | |
412 | ||
413 | This governor also employs a mechanism allowing it to temporarily bump up the | |
414 | CPU frequency for tasks that have been waiting on I/O most recently, called | |
415 | "IO-wait boosting". That happens when the :c:macro:`SCHED_CPUFREQ_IOWAIT` flag | |
416 | is passed by the scheduler to the governor callback which causes the frequency | |
417 | to go up to the allowed maximum immediately and then draw back to the value | |
418 | returned by the above formula over time. | |
419 | ||
420 | This governor exposes only one tunable: | |
421 | ||
422 | ``rate_limit_us`` | |
423 | Minimum time (in microseconds) that has to pass between two consecutive | |
424 | runs of governor computations (default: 1000 times the scaling driver's | |
425 | transition latency). | |
426 | ||
427 | The purpose of this tunable is to reduce the scheduler context overhead | |
428 | of the governor which might be excessive without it. | |
429 | ||
430 | This governor generally is regarded as a replacement for the older `ondemand`_ | |
431 | and `conservative`_ governors (described below), as it is simpler and more | |
432 | tightly integrated with the CPU scheduler, its overhead in terms of CPU context | |
433 | switches and similar is less significant, and it uses the scheduler's own CPU | |
434 | utilization metric, so in principle its decisions should not contradict the | |
435 | decisions made by the other parts of the scheduler. | |
436 | ||
437 | ``ondemand`` | |
438 | ------------ | |
439 | ||
440 | This governor uses CPU load as a CPU frequency selection metric. | |
441 | ||
442 | In order to estimate the current CPU load, it measures the time elapsed between | |
443 | consecutive invocations of its worker routine and computes the fraction of that | |
444 | time in which the given CPU was not idle. The ratio of the non-idle (active) | |
445 | time to the total CPU time is taken as an estimate of the load. | |
446 | ||
447 | If this governor is attached to a policy shared by multiple CPUs, the load is | |
448 | estimated for all of them and the greatest result is taken as the load estimate | |
449 | for the entire policy. | |
450 | ||
451 | The worker routine of this governor has to run in process context, so it is | |
452 | invoked asynchronously (via a workqueue) and CPU P-states are updated from | |
453 | there if necessary. As a result, the scheduler context overhead from this | |
454 | governor is minimum, but it causes additional CPU context switches to happen | |
455 | relatively often and the CPU P-state updates triggered by it can be relatively | |
456 | irregular. Also, it affects its own CPU load metric by running code that | |
457 | reduces the CPU idle time (even though the CPU idle time is only reduced very | |
458 | slightly by it). | |
459 | ||
460 | It generally selects CPU frequencies proportional to the estimated load, so that | |
461 | the value of the ``cpuinfo_max_freq`` policy attribute corresponds to the load of | |
462 | 1 (or 100%), and the value of the ``cpuinfo_min_freq`` policy attribute | |
463 | corresponds to the load of 0, unless when the load exceeds a (configurable) | |
464 | speedup threshold, in which case it will go straight for the highest frequency | |
465 | it is allowed to use (the ``scaling_max_freq`` policy limit). | |
466 | ||
467 | This governor exposes the following tunables: | |
468 | ||
469 | ``sampling_rate`` | |
470 | This is how often the governor's worker routine should run, in | |
471 | microseconds. | |
472 | ||
473 | Typically, it is set to values of the order of 10000 (10 ms). Its | |
474 | default value is equal to the value of ``cpuinfo_transition_latency`` | |
475 | for each policy this governor is attached to (but since the unit here | |
476 | is greater by 1000, this means that the time represented by | |
477 | ``sampling_rate`` is 1000 times greater than the transition latency by | |
478 | default). | |
479 | ||
480 | If this tunable is per-policy, the following shell command sets the time | |
481 | represented by it to be 750 times as high as the transition latency:: | |
482 | ||
483 | # echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate | |
484 | ||
2a0e4927 RW |
485 | ``up_threshold`` |
486 | If the estimated CPU load is above this value (in percent), the governor | |
487 | will set the frequency to the maximum value allowed for the policy. | |
488 | Otherwise, the selected frequency will be proportional to the estimated | |
489 | CPU load. | |
490 | ||
491 | ``ignore_nice_load`` | |
492 | If set to 1 (default 0), it will cause the CPU load estimation code to | |
493 | treat the CPU time spent on executing tasks with "nice" levels greater | |
494 | than 0 as CPU idle time. | |
495 | ||
496 | This may be useful if there are tasks in the system that should not be | |
497 | taken into account when deciding what frequency to run the CPUs at. | |
498 | Then, to make that happen it is sufficient to increase the "nice" level | |
499 | of those tasks above 0 and set this attribute to 1. | |
500 | ||
501 | ``sampling_down_factor`` | |
502 | Temporary multiplier, between 1 (default) and 100 inclusive, to apply to | |
503 | the ``sampling_rate`` value if the CPU load goes above ``up_threshold``. | |
504 | ||
505 | This causes the next execution of the governor's worker routine (after | |
506 | setting the frequency to the allowed maximum) to be delayed, so the | |
507 | frequency stays at the maximum level for a longer time. | |
508 | ||
509 | Frequency fluctuations in some bursty workloads may be avoided this way | |
510 | at the cost of additional energy spent on maintaining the maximum CPU | |
511 | capacity. | |
512 | ||
513 | ``powersave_bias`` | |
514 | Reduction factor to apply to the original frequency target of the | |
515 | governor (including the maximum value used when the ``up_threshold`` | |
516 | value is exceeded by the estimated CPU load) or sensitivity threshold | |
517 | for the AMD frequency sensitivity powersave bias driver | |
518 | (:file:`drivers/cpufreq/amd_freq_sensitivity.c`), between 0 and 1000 | |
519 | inclusive. | |
520 | ||
521 | If the AMD frequency sensitivity powersave bias driver is not loaded, | |
522 | the effective frequency to apply is given by | |
523 | ||
524 | f * (1 - ``powersave_bias`` / 1000) | |
525 | ||
526 | where f is the governor's original frequency target. The default value | |
527 | of this attribute is 0 in that case. | |
528 | ||
529 | If the AMD frequency sensitivity powersave bias driver is loaded, the | |
530 | value of this attribute is 400 by default and it is used in a different | |
531 | way. | |
532 | ||
533 | On Family 16h (and later) AMD processors there is a mechanism to get a | |
534 | measured workload sensitivity, between 0 and 100% inclusive, from the | |
535 | hardware. That value can be used to estimate how the performance of the | |
536 | workload running on a CPU will change in response to frequency changes. | |
537 | ||
538 | The performance of a workload with the sensitivity of 0 (memory-bound or | |
539 | IO-bound) is not expected to increase at all as a result of increasing | |
540 | the CPU frequency, whereas workloads with the sensitivity of 100% | |
541 | (CPU-bound) are expected to perform much better if the CPU frequency is | |
542 | increased. | |
543 | ||
544 | If the workload sensitivity is less than the threshold represented by | |
545 | the ``powersave_bias`` value, the sensitivity powersave bias driver | |
546 | will cause the governor to select a frequency lower than its original | |
547 | target, so as to avoid over-provisioning workloads that will not benefit | |
548 | from running at higher CPU frequencies. | |
549 | ||
550 | ``conservative`` | |
551 | ---------------- | |
552 | ||
553 | This governor uses CPU load as a CPU frequency selection metric. | |
554 | ||
555 | It estimates the CPU load in the same way as the `ondemand`_ governor described | |
556 | above, but the CPU frequency selection algorithm implemented by it is different. | |
557 | ||
558 | Namely, it avoids changing the frequency significantly over short time intervals | |
559 | which may not be suitable for systems with limited power supply capacity (e.g. | |
560 | battery-powered). To achieve that, it changes the frequency in relatively | |
561 | small steps, one step at a time, up or down - depending on whether or not a | |
562 | (configurable) threshold has been exceeded by the estimated CPU load. | |
563 | ||
564 | This governor exposes the following tunables: | |
565 | ||
566 | ``freq_step`` | |
567 | Frequency step in percent of the maximum frequency the governor is | |
568 | allowed to set (the ``scaling_max_freq`` policy limit), between 0 and | |
569 | 100 (5 by default). | |
570 | ||
571 | This is how much the frequency is allowed to change in one go. Setting | |
572 | it to 0 will cause the default frequency step (5 percent) to be used | |
573 | and setting it to 100 effectively causes the governor to periodically | |
574 | switch the frequency between the ``scaling_min_freq`` and | |
575 | ``scaling_max_freq`` policy limits. | |
576 | ||
577 | ``down_threshold`` | |
578 | Threshold value (in percent, 20 by default) used to determine the | |
579 | frequency change direction. | |
580 | ||
581 | If the estimated CPU load is greater than this value, the frequency will | |
582 | go up (by ``freq_step``). If the load is less than this value (and the | |
583 | ``sampling_down_factor`` mechanism is not in effect), the frequency will | |
584 | go down. Otherwise, the frequency will not be changed. | |
585 | ||
586 | ``sampling_down_factor`` | |
587 | Frequency decrease deferral factor, between 1 (default) and 10 | |
588 | inclusive. | |
589 | ||
590 | It effectively causes the frequency to go down ``sampling_down_factor`` | |
591 | times slower than it ramps up. | |
592 | ||
593 | ||
594 | Frequency Boost Support | |
595 | ======================= | |
596 | ||
597 | Background | |
598 | ---------- | |
599 | ||
600 | Some processors support a mechanism to raise the operating frequency of some | |
601 | cores in a multicore package temporarily (and above the sustainable frequency | |
602 | threshold for the whole package) under certain conditions, for example if the | |
603 | whole chip is not fully utilized and below its intended thermal or power budget. | |
604 | ||
605 | Different names are used by different vendors to refer to this functionality. | |
606 | For Intel processors it is referred to as "Turbo Boost", AMD calls it | |
607 | "Turbo-Core" or (in technical documentation) "Core Performance Boost" and so on. | |
608 | As a rule, it also is implemented differently by different vendors. The simple | |
609 | term "frequency boost" is used here for brevity to refer to all of those | |
610 | implementations. | |
611 | ||
612 | The frequency boost mechanism may be either hardware-based or software-based. | |
613 | If it is hardware-based (e.g. on x86), the decision to trigger the boosting is | |
614 | made by the hardware (although in general it requires the hardware to be put | |
615 | into a special state in which it can control the CPU frequency within certain | |
616 | limits). If it is software-based (e.g. on ARM), the scaling driver decides | |
617 | whether or not to trigger boosting and when to do that. | |
618 | ||
619 | The ``boost`` File in ``sysfs`` | |
620 | ------------------------------- | |
621 | ||
622 | This file is located under :file:`/sys/devices/system/cpu/cpufreq/` and controls | |
623 | the "boost" setting for the whole system. It is not present if the underlying | |
624 | scaling driver does not support the frequency boost mechanism (or supports it, | |
625 | but provides a driver-specific interface for controlling it, like | |
33fc30b4 | 626 | |intel_pstate|). |
2a0e4927 RW |
627 | |
628 | If the value in this file is 1, the frequency boost mechanism is enabled. This | |
629 | means that either the hardware can be put into states in which it is able to | |
630 | trigger boosting (in the hardware-based case), or the software is allowed to | |
631 | trigger boosting (in the software-based case). It does not mean that boosting | |
632 | is actually in use at the moment on any CPUs in the system. It only means a | |
633 | permission to use the frequency boost mechanism (which still may never be used | |
634 | for other reasons). | |
635 | ||
636 | If the value in this file is 0, the frequency boost mechanism is disabled and | |
637 | cannot be used at all. | |
638 | ||
639 | The only values that can be written to this file are 0 and 1. | |
640 | ||
641 | Rationale for Boost Control Knob | |
642 | -------------------------------- | |
643 | ||
644 | The frequency boost mechanism is generally intended to help to achieve optimum | |
645 | CPU performance on time scales below software resolution (e.g. below the | |
646 | scheduler tick interval) and it is demonstrably suitable for many workloads, but | |
647 | it may lead to problems in certain situations. | |
648 | ||
649 | For this reason, many systems make it possible to disable the frequency boost | |
650 | mechanism in the platform firmware (BIOS) setup, but that requires the system to | |
651 | be restarted for the setting to be adjusted as desired, which may not be | |
652 | practical at least in some cases. For example: | |
653 | ||
654 | 1. Boosting means overclocking the processor, although under controlled | |
655 | conditions. Generally, the processor's energy consumption increases | |
656 | as a result of increasing its frequency and voltage, even temporarily. | |
657 | That may not be desirable on systems that switch to power sources of | |
658 | limited capacity, such as batteries, so the ability to disable the boost | |
659 | mechanism while the system is running may help there (but that depends on | |
660 | the workload too). | |
661 | ||
662 | 2. In some situations deterministic behavior is more important than | |
663 | performance or energy consumption (or both) and the ability to disable | |
664 | boosting while the system is running may be useful then. | |
665 | ||
666 | 3. To examine the impact of the frequency boost mechanism itself, it is useful | |
667 | to be able to run tests with and without boosting, preferably without | |
668 | restarting the system in the meantime. | |
669 | ||
670 | 4. Reproducible results are important when running benchmarks. Since | |
671 | the boosting functionality depends on the load of the whole package, | |
672 | single-thread performance may vary because of it which may lead to | |
673 | unreproducible results sometimes. That can be avoided by disabling the | |
674 | frequency boost mechanism before running benchmarks sensitive to that | |
675 | issue. | |
676 | ||
677 | Legacy AMD ``cpb`` Knob | |
678 | ----------------------- | |
679 | ||
680 | The AMD powernow-k8 scaling driver supports a ``sysfs`` knob very similar to | |
681 | the global ``boost`` one. It is used for disabling/enabling the "Core | |
682 | Performance Boost" feature of some AMD processors. | |
683 | ||
684 | If present, that knob is located in every ``CPUFreq`` policy directory in | |
685 | ``sysfs`` (:file:`/sys/devices/system/cpu/cpufreq/policyX/`) and is called | |
686 | ``cpb``, which indicates a more fine grained control interface. The actual | |
687 | implementation, however, works on the system-wide basis and setting that knob | |
688 | for one policy causes the same value of it to be set for all of the other | |
689 | policies at the same time. | |
690 | ||
691 | That knob is still supported on AMD processors that support its underlying | |
692 | hardware feature, but it may be configured out of the kernel (via the | |
693 | :c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option) and the global | |
694 | ``boost`` knob is present regardless. Thus it is always possible use the | |
695 | ``boost`` knob instead of the ``cpb`` one which is highly recommended, as that | |
696 | is more consistent with what all of the other systems do (and the ``cpb`` knob | |
697 | may not be supported any more in the future). | |
698 | ||
699 | The ``cpb`` knob is never present for any processors without the underlying | |
700 | hardware feature (e.g. all Intel ones), even if the | |
701 | :c:macro:`CONFIG_X86_ACPI_CPUFREQ_CPB` configuration option is set. | |
702 | ||
703 | ||
1120b0f9 RW |
704 | References |
705 | ========== | |
706 | ||
707 | .. [1] Jonathan Corbet, *Per-entity load tracking*, | |
708 | https://lwn.net/Articles/531853/ |