Commit | Line | Data |
---|---|---|
33fc30b4 RW |
1 | =============================================== |
2 | ``intel_pstate`` CPU Performance Scaling Driver | |
3 | =============================================== | |
4 | ||
5 | :: | |
6 | ||
7 | Copyright (c) 2017 Intel Corp., Rafael J. Wysocki <rafael.j.wysocki@intel.com> | |
8 | ||
9 | ||
10 | General Information | |
11 | =================== | |
12 | ||
13 | ``intel_pstate`` is a part of the | |
14 | :doc:`CPU performance scaling subsystem <cpufreq>` in the Linux kernel | |
15 | (``CPUFreq``). It is a scaling driver for the Sandy Bridge and later | |
16 | generations of Intel processors. Note, however, that some of those processors | |
17 | may not be supported. [To understand ``intel_pstate`` it is necessary to know | |
18 | how ``CPUFreq`` works in general, so this is the time to read :doc:`cpufreq` if | |
19 | you have not done that yet.] | |
20 | ||
21 | For the processors supported by ``intel_pstate``, the P-state concept is broader | |
22 | than just an operating frequency or an operating performance point (see the | |
23 | `LinuxCon Europe 2015 presentation by Kristen Accardi <LCEU2015_>`_ for more | |
24 | information about that). For this reason, the representation of P-states used | |
25 | by ``intel_pstate`` internally follows the hardware specification (for details | |
26 | refer to `Intel® 64 and IA-32 Architectures Software Developer’s Manual | |
27 | Volume 3: System Programming Guide <SDM_>`_). However, the ``CPUFreq`` core | |
28 | uses frequencies for identifying operating performance points of CPUs and | |
29 | frequencies are involved in the user space interface exposed by it, so | |
30 | ``intel_pstate`` maps its internal representation of P-states to frequencies too | |
31 | (fortunately, that mapping is unambiguous). At the same time, it would not be | |
32 | practical for ``intel_pstate`` to supply the ``CPUFreq`` core with a table of | |
33 | available frequencies due to the possible size of it, so the driver does not do | |
34 | that. Some functionality of the core is limited by that. | |
35 | ||
36 | Since the hardware P-state selection interface used by ``intel_pstate`` is | |
37 | available at the logical CPU level, the driver always works with individual | |
38 | CPUs. Consequently, if ``intel_pstate`` is in use, every ``CPUFreq`` policy | |
39 | object corresponds to one logical CPU and ``CPUFreq`` policies are effectively | |
40 | equivalent to CPUs. In particular, this means that they become "inactive" every | |
41 | time the corresponding CPU is taken offline and need to be re-initialized when | |
42 | it goes back online. | |
43 | ||
44 | ``intel_pstate`` is not modular, so it cannot be unloaded, which means that the | |
45 | only way to pass early-configuration-time parameters to it is via the kernel | |
46 | command line. However, its configuration can be adjusted via ``sysfs`` to a | |
47 | great extent. In some configurations it even is possible to unregister it via | |
48 | ``sysfs`` which allows another ``CPUFreq`` scaling driver to be loaded and | |
49 | registered (see `below <status_attr_>`_). | |
50 | ||
51 | ||
52 | Operation Modes | |
53 | =============== | |
54 | ||
55 | ``intel_pstate`` can operate in three different modes: in the active mode with | |
56 | or without hardware-managed P-states support and in the passive mode. Which of | |
57 | them will be in effect depends on what kernel command line options are used and | |
58 | on the capabilities of the processor. | |
59 | ||
60 | Active Mode | |
61 | ----------- | |
62 | ||
63 | This is the default operation mode of ``intel_pstate``. If it works in this | |
64 | mode, the ``scaling_driver`` policy attribute in ``sysfs`` for all ``CPUFreq`` | |
65 | policies contains the string "intel_pstate". | |
66 | ||
67 | In this mode the driver bypasses the scaling governors layer of ``CPUFreq`` and | |
68 | provides its own scaling algorithms for P-state selection. Those algorithms | |
69 | can be applied to ``CPUFreq`` policies in the same way as generic scaling | |
70 | governors (that is, through the ``scaling_governor`` policy attribute in | |
71 | ``sysfs``). [Note that different P-state selection algorithms may be chosen for | |
72 | different policies, but that is not recommended.] | |
73 | ||
74 | They are not generic scaling governors, but their names are the same as the | |
75 | names of some of those governors. Moreover, confusingly enough, they generally | |
76 | do not work in the same way as the generic governors they share the names with. | |
77 | For example, the ``powersave`` P-state selection algorithm provided by | |
78 | ``intel_pstate`` is not a counterpart of the generic ``powersave`` governor | |
79 | (roughly, it corresponds to the ``schedutil`` and ``ondemand`` governors). | |
80 | ||
81 | There are two P-state selection algorithms provided by ``intel_pstate`` in the | |
82 | active mode: ``powersave`` and ``performance``. The way they both operate | |
83 | depends on whether or not the hardware-managed P-states (HWP) feature has been | |
84 | enabled in the processor and possibly on the processor model. | |
85 | ||
86 | Which of the P-state selection algorithms is used by default depends on the | |
87 | :c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option. | |
88 | Namely, if that option is set, the ``performance`` algorithm will be used by | |
89 | default, and the other one will be used by default if it is not set. | |
90 | ||
91 | Active Mode With HWP | |
92 | ~~~~~~~~~~~~~~~~~~~~ | |
93 | ||
94 | If the processor supports the HWP feature, it will be enabled during the | |
95 | processor initialization and cannot be disabled after that. It is possible | |
96 | to avoid enabling it by passing the ``intel_pstate=no_hwp`` argument to the | |
97 | kernel in the command line. | |
98 | ||
99 | If the HWP feature has been enabled, ``intel_pstate`` relies on the processor to | |
100 | select P-states by itself, but still it can give hints to the processor's | |
101 | internal P-state selection logic. What those hints are depends on which P-state | |
102 | selection algorithm has been applied to the given policy (or to the CPU it | |
103 | corresponds to). | |
104 | ||
105 | Even though the P-state selection is carried out by the processor automatically, | |
106 | ``intel_pstate`` registers utilization update callbacks with the CPU scheduler | |
107 | in this mode. However, they are not used for running a P-state selection | |
108 | algorithm, but for periodic updates of the current CPU frequency information to | |
109 | be made available from the ``scaling_cur_freq`` policy attribute in ``sysfs``. | |
110 | ||
111 | HWP + ``performance`` | |
112 | ..................... | |
113 | ||
114 | In this configuration ``intel_pstate`` will write 0 to the processor's | |
115 | Energy-Performance Preference (EPP) knob (if supported) or its | |
116 | Energy-Performance Bias (EPB) knob (otherwise), which means that the processor's | |
117 | internal P-state selection logic is expected to focus entirely on performance. | |
118 | ||
119 | This will override the EPP/EPB setting coming from the ``sysfs`` interface | |
120 | (see `Energy vs Performance Hints`_ below). | |
121 | ||
122 | Also, in this configuration the range of P-states available to the processor's | |
123 | internal P-state selection logic is always restricted to the upper boundary | |
124 | (that is, the maximum P-state that the driver is allowed to use). | |
125 | ||
126 | HWP + ``powersave`` | |
127 | ................... | |
128 | ||
129 | In this configuration ``intel_pstate`` will set the processor's | |
130 | Energy-Performance Preference (EPP) knob (if supported) or its | |
131 | Energy-Performance Bias (EPB) knob (otherwise) to whatever value it was | |
132 | previously set to via ``sysfs`` (or whatever default value it was | |
133 | set to by the platform firmware). This usually causes the processor's | |
134 | internal P-state selection logic to be less performance-focused. | |
135 | ||
136 | Active Mode Without HWP | |
137 | ~~~~~~~~~~~~~~~~~~~~~~~ | |
138 | ||
139 | This is the default operation mode for processors that do not support the HWP | |
140 | feature. It also is used by default with the ``intel_pstate=no_hwp`` argument | |
141 | in the kernel command line. However, in this mode ``intel_pstate`` may refuse | |
142 | to work with the given processor if it does not recognize it. [Note that | |
143 | ``intel_pstate`` will never refuse to work with any processor with the HWP | |
144 | feature enabled.] | |
145 | ||
146 | In this mode ``intel_pstate`` registers utilization update callbacks with the | |
147 | CPU scheduler in order to run a P-state selection algorithm, either | |
13610c93 | 148 | ``powersave`` or ``performance``, depending on the ``scaling_governor`` policy |
33fc30b4 RW |
149 | setting in ``sysfs``. The current CPU frequency information to be made |
150 | available from the ``scaling_cur_freq`` policy attribute in ``sysfs`` is | |
151 | periodically updated by those utilization update callbacks too. | |
152 | ||
153 | ``performance`` | |
154 | ............... | |
155 | ||
156 | Without HWP, this P-state selection algorithm is always the same regardless of | |
157 | the processor model and platform configuration. | |
158 | ||
159 | It selects the maximum P-state it is allowed to use, subject to limits set via | |
fab24dcc RW |
160 | ``sysfs``, every time the driver configuration for the given CPU is updated |
161 | (e.g. via ``sysfs``). | |
33fc30b4 RW |
162 | |
163 | This is the default P-state selection algorithm if the | |
164 | :c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option | |
165 | is set. | |
166 | ||
167 | ``powersave`` | |
168 | ............. | |
169 | ||
9d0ef7af | 170 | Without HWP, this P-state selection algorithm is similar to the algorithm |
33fc30b4 RW |
171 | implemented by the generic ``schedutil`` scaling governor except that the |
172 | utilization metric used by it is based on numbers coming from feedback | |
173 | registers of the CPU. It generally selects P-states proportional to the | |
9d0ef7af RW |
174 | current CPU utilization. |
175 | ||
176 | This algorithm is run by the driver's utilization update callback for the | |
177 | given CPU when it is invoked by the CPU scheduler, but not more often than | |
178 | every 10 ms. Like in the ``performance`` case, the hardware configuration | |
179 | is not touched if the new P-state turns out to be the same as the current | |
180 | one. | |
33fc30b4 RW |
181 | |
182 | This is the default P-state selection algorithm if the | |
183 | :c:macro:`CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE` kernel configuration option | |
184 | is not set. | |
185 | ||
186 | Passive Mode | |
187 | ------------ | |
188 | ||
189 | This mode is used if the ``intel_pstate=passive`` argument is passed to the | |
190 | kernel in the command line (it implies the ``intel_pstate=no_hwp`` setting too). | |
191 | Like in the active mode without HWP support, in this mode ``intel_pstate`` may | |
192 | refuse to work with the given processor if it does not recognize it. | |
193 | ||
194 | If the driver works in this mode, the ``scaling_driver`` policy attribute in | |
195 | ``sysfs`` for all ``CPUFreq`` policies contains the string "intel_cpufreq". | |
196 | Then, the driver behaves like a regular ``CPUFreq`` scaling driver. That is, | |
197 | it is invoked by generic scaling governors when necessary to talk to the | |
198 | hardware in order to change the P-state of a CPU (in particular, the | |
199 | ``schedutil`` governor can invoke it directly from scheduler context). | |
200 | ||
201 | While in this mode, ``intel_pstate`` can be used with all of the (generic) | |
202 | scaling governors listed by the ``scaling_available_governors`` policy attribute | |
203 | in ``sysfs`` (and the P-state selection algorithms described above are not | |
204 | used). Then, it is responsible for the configuration of policy objects | |
205 | corresponding to CPUs and provides the ``CPUFreq`` core (and the scaling | |
206 | governors attached to the policy objects) with accurate information on the | |
207 | maximum and minimum operating frequencies supported by the hardware (including | |
208 | the so-called "turbo" frequency ranges). In other words, in the passive mode | |
209 | the entire range of available P-states is exposed by ``intel_pstate`` to the | |
210 | ``CPUFreq`` core. However, in this mode the driver does not register | |
211 | utilization update callbacks with the CPU scheduler and the ``scaling_cur_freq`` | |
212 | information comes from the ``CPUFreq`` core (and is the last frequency selected | |
213 | by the current scaling governor for the given policy). | |
214 | ||
215 | ||
216 | .. _turbo: | |
217 | ||
218 | Turbo P-states Support | |
219 | ====================== | |
220 | ||
221 | In the majority of cases, the entire range of P-states available to | |
222 | ``intel_pstate`` can be divided into two sub-ranges that correspond to | |
223 | different types of processor behavior, above and below a boundary that | |
224 | will be referred to as the "turbo threshold" in what follows. | |
225 | ||
226 | The P-states above the turbo threshold are referred to as "turbo P-states" and | |
227 | the whole sub-range of P-states they belong to is referred to as the "turbo | |
228 | range". These names are related to the Turbo Boost technology allowing a | |
229 | multicore processor to opportunistically increase the P-state of one or more | |
230 | cores if there is enough power to do that and if that is not going to cause the | |
231 | thermal envelope of the processor package to be exceeded. | |
232 | ||
233 | Specifically, if software sets the P-state of a CPU core within the turbo range | |
234 | (that is, above the turbo threshold), the processor is permitted to take over | |
235 | performance scaling control for that core and put it into turbo P-states of its | |
236 | choice going forward. However, that permission is interpreted differently by | |
237 | different processor generations. Namely, the Sandy Bridge generation of | |
238 | processors will never use any P-states above the last one set by software for | |
239 | the given core, even if it is within the turbo range, whereas all of the later | |
240 | processor generations will take it as a license to use any P-states from the | |
241 | turbo range, even above the one set by software. In other words, on those | |
242 | processors setting any P-state from the turbo range will enable the processor | |
243 | to put the given core into all turbo P-states up to and including the maximum | |
244 | supported one as it sees fit. | |
245 | ||
246 | One important property of turbo P-states is that they are not sustainable. More | |
247 | precisely, there is no guarantee that any CPUs will be able to stay in any of | |
248 | those states indefinitely, because the power distribution within the processor | |
249 | package may change over time or the thermal envelope it was designed for might | |
250 | be exceeded if a turbo P-state was used for too long. | |
251 | ||
252 | In turn, the P-states below the turbo threshold generally are sustainable. In | |
253 | fact, if one of them is set by software, the processor is not expected to change | |
254 | it to a lower one unless in a thermal stress or a power limit violation | |
255 | situation (a higher P-state may still be used if it is set for another CPU in | |
256 | the same package at the same time, for example). | |
257 | ||
258 | Some processors allow multiple cores to be in turbo P-states at the same time, | |
259 | but the maximum P-state that can be set for them generally depends on the number | |
260 | of cores running concurrently. The maximum turbo P-state that can be set for 3 | |
261 | cores at the same time usually is lower than the analogous maximum P-state for | |
262 | 2 cores, which in turn usually is lower than the maximum turbo P-state that can | |
263 | be set for 1 core. The one-core maximum turbo P-state is thus the maximum | |
264 | supported one overall. | |
265 | ||
266 | The maximum supported turbo P-state, the turbo threshold (the maximum supported | |
267 | non-turbo P-state) and the minimum supported P-state are specific to the | |
268 | processor model and can be determined by reading the processor's model-specific | |
269 | registers (MSRs). Moreover, some processors support the Configurable TDP | |
270 | (Thermal Design Power) feature and, when that feature is enabled, the turbo | |
271 | threshold effectively becomes a configurable value that can be set by the | |
272 | platform firmware. | |
273 | ||
274 | Unlike ``_PSS`` objects in the ACPI tables, ``intel_pstate`` always exposes | |
275 | the entire range of available P-states, including the whole turbo range, to the | |
276 | ``CPUFreq`` core and (in the passive mode) to generic scaling governors. This | |
277 | generally causes turbo P-states to be set more often when ``intel_pstate`` is | |
278 | used relative to ACPI-based CPU performance scaling (see `below <acpi-cpufreq_>`_ | |
279 | for more information). | |
280 | ||
281 | Moreover, since ``intel_pstate`` always knows what the real turbo threshold is | |
282 | (even if the Configurable TDP feature is enabled in the processor), its | |
283 | ``no_turbo`` attribute in ``sysfs`` (described `below <no_turbo_attr_>`_) should | |
284 | work as expected in all cases (that is, if set to disable turbo P-states, it | |
285 | always should prevent ``intel_pstate`` from using them). | |
286 | ||
287 | ||
288 | Processor Support | |
289 | ================= | |
290 | ||
291 | To handle a given processor ``intel_pstate`` requires a number of different | |
292 | pieces of information on it to be known, including: | |
293 | ||
294 | * The minimum supported P-state. | |
295 | ||
296 | * The maximum supported `non-turbo P-state <turbo_>`_. | |
297 | ||
298 | * Whether or not turbo P-states are supported at all. | |
299 | ||
300 | * The maximum supported `one-core turbo P-state <turbo_>`_ (if turbo P-states | |
301 | are supported). | |
302 | ||
303 | * The scaling formula to translate the driver's internal representation | |
304 | of P-states into frequencies and the other way around. | |
305 | ||
306 | Generally, ways to obtain that information are specific to the processor model | |
307 | or family. Although it often is possible to obtain all of it from the processor | |
308 | itself (using model-specific registers), there are cases in which hardware | |
309 | manuals need to be consulted to get to it too. | |
310 | ||
311 | For this reason, there is a list of supported processors in ``intel_pstate`` and | |
312 | the driver initialization will fail if the detected processor is not in that | |
313 | list, unless it supports the `HWP feature <Active Mode_>`_. [The interface to | |
314 | obtain all of the information listed above is the same for all of the processors | |
315 | supporting the HWP feature, which is why they all are supported by | |
316 | ``intel_pstate``.] | |
317 | ||
318 | ||
319 | User Space Interface in ``sysfs`` | |
320 | ================================= | |
321 | ||
322 | Global Attributes | |
323 | ----------------- | |
324 | ||
325 | ``intel_pstate`` exposes several global attributes (files) in ``sysfs`` to | |
326 | control its functionality at the system level. They are located in the | |
327 | ``/sys/devices/system/cpu/cpufreq/intel_pstate/`` directory and affect all | |
328 | CPUs. | |
329 | ||
330 | Some of them are not present if the ``intel_pstate=per_cpu_perf_limits`` | |
331 | argument is passed to the kernel in the command line. | |
332 | ||
333 | ``max_perf_pct`` | |
334 | Maximum P-state the driver is allowed to set in percent of the | |
335 | maximum supported performance level (the highest supported `turbo | |
336 | P-state <turbo_>`_). | |
337 | ||
338 | This attribute will not be exposed if the | |
339 | ``intel_pstate=per_cpu_perf_limits`` argument is present in the kernel | |
340 | command line. | |
341 | ||
342 | ``min_perf_pct`` | |
343 | Minimum P-state the driver is allowed to set in percent of the | |
344 | maximum supported performance level (the highest supported `turbo | |
345 | P-state <turbo_>`_). | |
346 | ||
347 | This attribute will not be exposed if the | |
348 | ``intel_pstate=per_cpu_perf_limits`` argument is present in the kernel | |
349 | command line. | |
350 | ||
351 | ``num_pstates`` | |
352 | Number of P-states supported by the processor (between 0 and 255 | |
353 | inclusive) including both turbo and non-turbo P-states (see | |
354 | `Turbo P-states Support`_). | |
355 | ||
356 | The value of this attribute is not affected by the ``no_turbo`` | |
357 | setting described `below <no_turbo_attr_>`_. | |
358 | ||
359 | This attribute is read-only. | |
360 | ||
361 | ``turbo_pct`` | |
362 | Ratio of the `turbo range <turbo_>`_ size to the size of the entire | |
363 | range of supported P-states, in percent. | |
364 | ||
365 | This attribute is read-only. | |
366 | ||
367 | .. _no_turbo_attr: | |
368 | ||
369 | ``no_turbo`` | |
370 | If set (equal to 1), the driver is not allowed to set any turbo P-states | |
371 | (see `Turbo P-states Support`_). If unset (equalt to 0, which is the | |
372 | default), turbo P-states can be set by the driver. | |
373 | [Note that ``intel_pstate`` does not support the general ``boost`` | |
374 | attribute (supported by some other scaling drivers) which is replaced | |
375 | by this one.] | |
376 | ||
377 | This attrubute does not affect the maximum supported frequency value | |
378 | supplied to the ``CPUFreq`` core and exposed via the policy interface, | |
379 | but it affects the maximum possible value of per-policy P-state limits | |
380 | (see `Interpretation of Policy Attributes`_ below for details). | |
381 | ||
382 | .. _status_attr: | |
383 | ||
384 | ``status`` | |
385 | Operation mode of the driver: "active", "passive" or "off". | |
386 | ||
387 | "active" | |
388 | The driver is functional and in the `active mode | |
389 | <Active Mode_>`_. | |
390 | ||
391 | "passive" | |
392 | The driver is functional and in the `passive mode | |
393 | <Passive Mode_>`_. | |
394 | ||
395 | "off" | |
396 | The driver is not functional (it is not registered as a scaling | |
397 | driver with the ``CPUFreq`` core). | |
398 | ||
399 | This attribute can be written to in order to change the driver's | |
400 | operation mode or to unregister it. The string written to it must be | |
401 | one of the possible values of it and, if successful, the write will | |
402 | cause the driver to switch over to the operation mode represented by | |
403 | that string - or to be unregistered in the "off" case. [Actually, | |
404 | switching over from the active mode to the passive mode or the other | |
405 | way around causes the driver to be unregistered and registered again | |
406 | with a different set of callbacks, so all of its settings (the global | |
407 | as well as the per-policy ones) are then reset to their default | |
408 | values, possibly depending on the target operation mode.] | |
409 | ||
410 | That only is supported in some configurations, though (for example, if | |
411 | the `HWP feature is enabled in the processor <Active Mode With HWP_>`_, | |
412 | the operation mode of the driver cannot be changed), and if it is not | |
413 | supported in the current configuration, writes to this attribute with | |
414 | fail with an appropriate error. | |
415 | ||
416 | Interpretation of Policy Attributes | |
417 | ----------------------------------- | |
418 | ||
419 | The interpretation of some ``CPUFreq`` policy attributes described in | |
420 | :doc:`cpufreq` is special with ``intel_pstate`` as the current scaling driver | |
421 | and it generally depends on the driver's `operation mode <Operation Modes_>`_. | |
422 | ||
423 | First of all, the values of the ``cpuinfo_max_freq``, ``cpuinfo_min_freq`` and | |
424 | ``scaling_cur_freq`` attributes are produced by applying a processor-specific | |
425 | multiplier to the internal P-state representation used by ``intel_pstate``. | |
426 | Also, the values of the ``scaling_max_freq`` and ``scaling_min_freq`` | |
427 | attributes are capped by the frequency corresponding to the maximum P-state that | |
428 | the driver is allowed to set. | |
429 | ||
430 | If the ``no_turbo`` `global attribute <no_turbo_attr_>`_ is set, the driver is | |
431 | not allowed to use turbo P-states, so the maximum value of ``scaling_max_freq`` | |
432 | and ``scaling_min_freq`` is limited to the maximum non-turbo P-state frequency. | |
433 | Accordingly, setting ``no_turbo`` causes ``scaling_max_freq`` and | |
434 | ``scaling_min_freq`` to go down to that value if they were above it before. | |
435 | However, the old values of ``scaling_max_freq`` and ``scaling_min_freq`` will be | |
436 | restored after unsetting ``no_turbo``, unless these attributes have been written | |
437 | to after ``no_turbo`` was set. | |
438 | ||
439 | If ``no_turbo`` is not set, the maximum possible value of ``scaling_max_freq`` | |
440 | and ``scaling_min_freq`` corresponds to the maximum supported turbo P-state, | |
441 | which also is the value of ``cpuinfo_max_freq`` in either case. | |
442 | ||
443 | Next, the following policy attributes have special meaning if | |
444 | ``intel_pstate`` works in the `active mode <Active Mode_>`_: | |
445 | ||
446 | ``scaling_available_governors`` | |
447 | List of P-state selection algorithms provided by ``intel_pstate``. | |
448 | ||
449 | ``scaling_governor`` | |
450 | P-state selection algorithm provided by ``intel_pstate`` currently in | |
451 | use with the given policy. | |
452 | ||
453 | ``scaling_cur_freq`` | |
454 | Frequency of the average P-state of the CPU represented by the given | |
455 | policy for the time interval between the last two invocations of the | |
456 | driver's utilization update callback by the CPU scheduler for that CPU. | |
457 | ||
458 | The meaning of these attributes in the `passive mode <Passive Mode_>`_ is the | |
459 | same as for other scaling drivers. | |
460 | ||
461 | Additionally, the value of the ``scaling_driver`` attribute for ``intel_pstate`` | |
462 | depends on the operation mode of the driver. Namely, it is either | |
463 | "intel_pstate" (in the `active mode <Active Mode_>`_) or "intel_cpufreq" (in the | |
464 | `passive mode <Passive Mode_>`_). | |
465 | ||
466 | Coordination of P-State Limits | |
467 | ------------------------------ | |
468 | ||
469 | ``intel_pstate`` allows P-state limits to be set in two ways: with the help of | |
470 | the ``max_perf_pct`` and ``min_perf_pct`` `global attributes | |
471 | <Global Attributes_>`_ or via the ``scaling_max_freq`` and ``scaling_min_freq`` | |
472 | ``CPUFreq`` policy attributes. The coordination between those limits is based | |
473 | on the following rules, regardless of the current operation mode of the driver: | |
474 | ||
475 | 1. All CPUs are affected by the global limits (that is, none of them can be | |
476 | requested to run faster than the global maximum and none of them can be | |
477 | requested to run slower than the global minimum). | |
478 | ||
479 | 2. Each individual CPU is affected by its own per-policy limits (that is, it | |
480 | cannot be requested to run faster than its own per-policy maximum and it | |
481 | cannot be requested to run slower than its own per-policy minimum). | |
482 | ||
483 | 3. The global and per-policy limits can be set independently. | |
484 | ||
485 | If the `HWP feature is enabled in the processor <Active Mode With HWP_>`_, the | |
486 | resulting effective values are written into its registers whenever the limits | |
487 | change in order to request its internal P-state selection logic to always set | |
488 | P-states within these limits. Otherwise, the limits are taken into account by | |
489 | scaling governors (in the `passive mode <Passive Mode_>`_) and by the driver | |
490 | every time before setting a new P-state for a CPU. | |
491 | ||
492 | Additionally, if the ``intel_pstate=per_cpu_perf_limits`` command line argument | |
493 | is passed to the kernel, ``max_perf_pct`` and ``min_perf_pct`` are not exposed | |
494 | at all and the only way to set the limits is by using the policy attributes. | |
495 | ||
496 | ||
497 | Energy vs Performance Hints | |
498 | --------------------------- | |
499 | ||
500 | If ``intel_pstate`` works in the `active mode with the HWP feature enabled | |
501 | <Active Mode With HWP_>`_ in the processor, additional attributes are present | |
502 | in every ``CPUFreq`` policy directory in ``sysfs``. They are intended to allow | |
503 | user space to help ``intel_pstate`` to adjust the processor's internal P-state | |
504 | selection logic by focusing it on performance or on energy-efficiency, or | |
505 | somewhere between the two extremes: | |
506 | ||
507 | ``energy_performance_preference`` | |
508 | Current value of the energy vs performance hint for the given policy | |
509 | (or the CPU represented by it). | |
510 | ||
511 | The hint can be changed by writing to this attribute. | |
512 | ||
513 | ``energy_performance_available_preferences`` | |
514 | List of strings that can be written to the | |
515 | ``energy_performance_preference`` attribute. | |
516 | ||
517 | They represent different energy vs performance hints and should be | |
518 | self-explanatory, except that ``default`` represents whatever hint | |
519 | value was set by the platform firmware. | |
520 | ||
521 | Strings written to the ``energy_performance_preference`` attribute are | |
522 | internally translated to integer values written to the processor's | |
523 | Energy-Performance Preference (EPP) knob (if supported) or its | |
524 | Energy-Performance Bias (EPB) knob. | |
525 | ||
526 | [Note that tasks may by migrated from one CPU to another by the scheduler's | |
527 | load-balancing algorithm and if different energy vs performance hints are | |
528 | set for those CPUs, that may lead to undesirable outcomes. To avoid such | |
529 | issues it is better to set the same energy vs performance hint for all CPUs | |
530 | or to pin every task potentially sensitive to them to a specific CPU.] | |
531 | ||
532 | .. _acpi-cpufreq: | |
533 | ||
534 | ``intel_pstate`` vs ``acpi-cpufreq`` | |
535 | ==================================== | |
536 | ||
537 | On the majority of systems supported by ``intel_pstate``, the ACPI tables | |
538 | provided by the platform firmware contain ``_PSS`` objects returning information | |
539 | that can be used for CPU performance scaling (refer to the `ACPI specification`_ | |
540 | for details on the ``_PSS`` objects and the format of the information returned | |
541 | by them). | |
542 | ||
543 | The information returned by the ACPI ``_PSS`` objects is used by the | |
544 | ``acpi-cpufreq`` scaling driver. On systems supported by ``intel_pstate`` | |
545 | the ``acpi-cpufreq`` driver uses the same hardware CPU performance scaling | |
546 | interface, but the set of P-states it can use is limited by the ``_PSS`` | |
547 | output. | |
548 | ||
549 | On those systems each ``_PSS`` object returns a list of P-states supported by | |
550 | the corresponding CPU which basically is a subset of the P-states range that can | |
551 | be used by ``intel_pstate`` on the same system, with one exception: the whole | |
552 | `turbo range <turbo_>`_ is represented by one item in it (the topmost one). By | |
553 | convention, the frequency returned by ``_PSS`` for that item is greater by 1 MHz | |
554 | than the frequency of the highest non-turbo P-state listed by it, but the | |
555 | corresponding P-state representation (following the hardware specification) | |
556 | returned for it matches the maximum supported turbo P-state (or is the | |
557 | special value 255 meaning essentially "go as high as you can get"). | |
558 | ||
559 | The list of P-states returned by ``_PSS`` is reflected by the table of | |
560 | available frequencies supplied by ``acpi-cpufreq`` to the ``CPUFreq`` core and | |
561 | scaling governors and the minimum and maximum supported frequencies reported by | |
562 | it come from that list as well. In particular, given the special representation | |
563 | of the turbo range described above, this means that the maximum supported | |
564 | frequency reported by ``acpi-cpufreq`` is higher by 1 MHz than the frequency | |
565 | of the highest supported non-turbo P-state listed by ``_PSS`` which, of course, | |
566 | affects decisions made by the scaling governors, except for ``powersave`` and | |
567 | ``performance``. | |
568 | ||
569 | For example, if a given governor attempts to select a frequency proportional to | |
570 | estimated CPU load and maps the load of 100% to the maximum supported frequency | |
571 | (possibly multiplied by a constant), then it will tend to choose P-states below | |
572 | the turbo threshold if ``acpi-cpufreq`` is used as the scaling driver, because | |
573 | in that case the turbo range corresponds to a small fraction of the frequency | |
574 | band it can use (1 MHz vs 1 GHz or more). In consequence, it will only go to | |
575 | the turbo range for the highest loads and the other loads above 50% that might | |
576 | benefit from running at turbo frequencies will be given non-turbo P-states | |
577 | instead. | |
578 | ||
579 | One more issue related to that may appear on systems supporting the | |
580 | `Configurable TDP feature <turbo_>`_ allowing the platform firmware to set the | |
581 | turbo threshold. Namely, if that is not coordinated with the lists of P-states | |
582 | returned by ``_PSS`` properly, there may be more than one item corresponding to | |
583 | a turbo P-state in those lists and there may be a problem with avoiding the | |
584 | turbo range (if desirable or necessary). Usually, to avoid using turbo | |
585 | P-states overall, ``acpi-cpufreq`` simply avoids using the topmost state listed | |
586 | by ``_PSS``, but that is not sufficient when there are other turbo P-states in | |
587 | the list returned by it. | |
588 | ||
589 | Apart from the above, ``acpi-cpufreq`` works like ``intel_pstate`` in the | |
590 | `passive mode <Passive Mode_>`_, except that the number of P-states it can set | |
591 | is limited to the ones listed by the ACPI ``_PSS`` objects. | |
592 | ||
593 | ||
594 | Kernel Command Line Options for ``intel_pstate`` | |
595 | ================================================ | |
596 | ||
597 | Several kernel command line options can be used to pass early-configuration-time | |
598 | parameters to ``intel_pstate`` in order to enforce specific behavior of it. All | |
599 | of them have to be prepended with the ``intel_pstate=`` prefix. | |
600 | ||
601 | ``disable`` | |
602 | Do not register ``intel_pstate`` as the scaling driver even if the | |
603 | processor is supported by it. | |
604 | ||
605 | ``passive`` | |
606 | Register ``intel_pstate`` in the `passive mode <Passive Mode_>`_ to | |
607 | start with. | |
608 | ||
609 | This option implies the ``no_hwp`` one described below. | |
610 | ||
611 | ``force`` | |
612 | Register ``intel_pstate`` as the scaling driver instead of | |
613 | ``acpi-cpufreq`` even if the latter is preferred on the given system. | |
614 | ||
615 | This may prevent some platform features (such as thermal controls and | |
616 | power capping) that rely on the availability of ACPI P-states | |
617 | information from functioning as expected, so it should be used with | |
618 | caution. | |
619 | ||
620 | This option does not work with processors that are not supported by | |
621 | ``intel_pstate`` and on platforms where the ``pcc-cpufreq`` scaling | |
622 | driver is used instead of ``acpi-cpufreq``. | |
623 | ||
624 | ``no_hwp`` | |
625 | Do not enable the `hardware-managed P-states (HWP) feature | |
626 | <Active Mode With HWP_>`_ even if it is supported by the processor. | |
627 | ||
628 | ``hwp_only`` | |
629 | Register ``intel_pstate`` as the scaling driver only if the | |
630 | `hardware-managed P-states (HWP) feature <Active Mode With HWP_>`_ is | |
631 | supported by the processor. | |
632 | ||
633 | ``support_acpi_ppc`` | |
634 | Take ACPI ``_PPC`` performance limits into account. | |
635 | ||
636 | If the preferred power management profile in the FADT (Fixed ACPI | |
637 | Description Table) is set to "Enterprise Server" or "Performance | |
638 | Server", the ACPI ``_PPC`` limits are taken into account by default | |
639 | and this option has no effect. | |
640 | ||
641 | ``per_cpu_perf_limits`` | |
642 | Use per-logical-CPU P-State limits (see `Coordination of P-state | |
643 | Limits`_ for details). | |
644 | ||
645 | ||
646 | Diagnostics and Tuning | |
647 | ====================== | |
648 | ||
649 | Trace Events | |
650 | ------------ | |
651 | ||
652 | There are two static trace events that can be used for ``intel_pstate`` | |
653 | diagnostics. One of them is the ``cpu_frequency`` trace event generally used | |
654 | by ``CPUFreq``, and the other one is the ``pstate_sample`` trace event specific | |
655 | to ``intel_pstate``. Both of them are triggered by ``intel_pstate`` only if | |
656 | it works in the `active mode <Active Mode_>`_. | |
657 | ||
658 | The following sequence of shell commands can be used to enable them and see | |
659 | their output (if the kernel is generally configured to support event tracing):: | |
660 | ||
661 | # cd /sys/kernel/debug/tracing/ | |
662 | # echo 1 > events/power/pstate_sample/enable | |
663 | # echo 1 > events/power/cpu_frequency/enable | |
664 | # cat trace | |
665 | gnome-terminal--4510 [001] ..s. 1177.680733: pstate_sample: core_busy=107 scaled=94 from=26 to=26 mperf=1143818 aperf=1230607 tsc=29838618 freq=2474476 | |
666 | cat-5235 [002] ..s. 1177.681723: cpu_frequency: state=2900000 cpu_id=2 | |
667 | ||
668 | If ``intel_pstate`` works in the `passive mode <Passive Mode_>`_, the | |
669 | ``cpu_frequency`` trace event will be triggered either by the ``schedutil`` | |
670 | scaling governor (for the policies it is attached to), or by the ``CPUFreq`` | |
671 | core (for the policies with other scaling governors). | |
672 | ||
673 | ``ftrace`` | |
674 | ---------- | |
675 | ||
676 | The ``ftrace`` interface can be used for low-level diagnostics of | |
677 | ``intel_pstate``. For example, to check how often the function to set a | |
678 | P-state is called, the ``ftrace`` filter can be set to to | |
679 | :c:func:`intel_pstate_set_pstate`:: | |
680 | ||
681 | # cd /sys/kernel/debug/tracing/ | |
682 | # cat available_filter_functions | grep -i pstate | |
683 | intel_pstate_set_pstate | |
684 | intel_pstate_cpu_init | |
685 | ... | |
686 | # echo intel_pstate_set_pstate > set_ftrace_filter | |
687 | # echo function > current_tracer | |
688 | # cat trace | head -15 | |
689 | # tracer: function | |
690 | # | |
691 | # entries-in-buffer/entries-written: 80/80 #P:4 | |
692 | # | |
693 | # _-----=> irqs-off | |
694 | # / _----=> need-resched | |
695 | # | / _---=> hardirq/softirq | |
696 | # || / _--=> preempt-depth | |
697 | # ||| / delay | |
698 | # TASK-PID CPU# |||| TIMESTAMP FUNCTION | |
699 | # | | | |||| | | | |
700 | Xorg-3129 [000] ..s. 2537.644844: intel_pstate_set_pstate <-intel_pstate_timer_func | |
701 | gnome-terminal--4510 [002] ..s. 2537.649844: intel_pstate_set_pstate <-intel_pstate_timer_func | |
702 | gnome-shell-3409 [001] ..s. 2537.650850: intel_pstate_set_pstate <-intel_pstate_timer_func | |
703 | <idle>-0 [000] ..s. 2537.654843: intel_pstate_set_pstate <-intel_pstate_timer_func | |
704 | ||
33fc30b4 RW |
705 | |
706 | .. _LCEU2015: http://events.linuxfoundation.org/sites/events/files/slides/LinuxConEurope_2015.pdf | |
707 | .. _SDM: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-system-programming-manual-325384.html | |
708 | .. _ACPI specification: http://www.uefi.org/sites/default/files/resources/ACPI_6_1.pdf |