[linux-block.git] / Documentation / scheduler / sched-energy.rst

=======================
Energy Aware Scheduling
=======================

1. Introduction
---------------

Energy Aware Scheduling (or EAS) gives the scheduler the ability to predict
the impact of its decisions on the energy consumed by CPUs. EAS relies on an
Energy Model (EM) of the CPUs to select an energy efficient CPU for each task,
with a minimal impact on throughput. This document aims at providing an
introduction on how EAS works, what are the main design decisions behind it, and
details what is needed to get it to run.

Before going any further, please note that at the time of writing::

   /!\ EAS does not support platforms with symmetric CPU topologies /!\

EAS operates only on heterogeneous CPU topologies (such as Arm big.LITTLE)
because this is where the potential for saving energy through scheduling is
the highest.

The actual EM used by EAS is _not_ maintained by the scheduler, but by a
dedicated framework. For details about this framework and what it provides,
please refer to its documentation (see Documentation/power/energy-model.rst).


2. Background and Terminology
-----------------------------

To make it clear from the start:
 - energy = [joule] (resource like a battery on powered devices)
 - power = energy/time = [joule/second] = [watt]

The goal of EAS is to minimize energy, while still getting the job done. That
is, we want to maximize::

	performance [inst/s]
	--------------------
	    power [W]

which is equivalent to minimizing::

	energy [J]
	-----------
	instruction

while still getting 'good' performance. It is essentially an alternative
optimization objective to the current performance-only objective for the
scheduler. This alternative considers two objectives: energy-efficiency and
performance.

The idea behind introducing an EM is to allow the scheduler to evaluate the
implications of its decisions rather than blindly applying energy-saving
techniques that may have positive effects only on some platforms. At the same
time, the EM must be as simple as possible to minimize the scheduler latency
impact.

In short, EAS changes the way CFS tasks are assigned to CPUs. When it is time
for the scheduler to decide where a task should run (during wake-up), the EM
is used to break the tie between several good CPU candidates and pick the one
that is predicted to yield the best energy consumption without harming the
system's throughput. The predictions made by EAS rely on specific elements of
knowledge about the platform's topology, which include the 'capacity' of CPUs,
and their respective energy costs.


3. Topology information
-----------------------

EAS (as well as the rest of the scheduler) uses the notion of 'capacity' to
differentiate CPUs with different computing throughput. The 'capacity' of a CPU
represents the amount of work it can absorb when running at its highest
frequency compared to the most capable CPU of the system. Capacity values are
normalized in a 1024 range, and are comparable with the utilization signals of
tasks and CPUs computed by the Per-Entity Load Tracking (PELT) mechanism. Thanks
to capacity and utilization values, EAS is able to estimate how big/busy a
task/CPU is, and to take this into consideration when evaluating performance vs
energy trade-offs. The capacity of CPUs is provided via arch-specific code
through the arch_scale_cpu_capacity() callback.

The rest of platform knowledge used by EAS is directly read from the Energy
Model (EM) framework. The EM of a platform is composed of a power cost table
per 'performance domain' in the system (see Documentation/power/energy-model.rst
for further details about performance domains).

The scheduler manages references to the EM objects in the topology code when the
scheduling domains are built, or re-built. For each root domain (rd), the
scheduler maintains a singly linked list of all performance domains intersecting
the current rd->span. Each node in the list contains a pointer to a struct
em_perf_domain as provided by the EM framework.

The lists are attached to the root domains in order to cope with exclusive
cpuset configurations. Since the boundaries of exclusive cpusets do not
necessarily match those of performance domains, the lists of different root
domains can contain duplicate elements.

Example 1.
    Let us consider a platform with 12 CPUs, split in 3 performance domains
    (pd0, pd4 and pd8), organized as follows::

	          CPUs:   0 1 2 3 4 5 6 7 8 9 10 11
	          PDs:   |--pd0--|--pd4--|---pd8---|
	          RDs:   |----rd1----|-----rd2-----|

    Now, consider that userspace decided to split the system with two
    exclusive cpusets, hence creating two independent root domains, each
    containing 6 CPUs. The two root domains are denoted rd1 and rd2 in the
    above figure. Since pd4 intersects with both rd1 and rd2, it will be
    present in the linked list '->pd' attached to each of them:

       * rd1->pd: pd0 -> pd4
       * rd2->pd: pd4 -> pd8

    Please note that the scheduler will create two duplicate list nodes for
    pd4 (one for each list). However, both just hold a pointer to the same
    shared data structure of the EM framework.

Since the access to these lists can happen concurrently with hotplug and other
things, they are protected by RCU, like the rest of topology structures
manipulated by the scheduler.

EAS also maintains a static key (sched_energy_present) which is enabled when at
least one root domain meets all conditions for EAS to start. Those conditions
are summarized in Section 6.


4. Energy-Aware task placement
------------------------------

EAS overrides the CFS task wake-up balancing code. It uses the EM of the
platform and the PELT signals to choose an energy-efficient target CPU during
wake-up balance. When EAS is enabled, select_task_rq_fair() calls
find_energy_efficient_cpu() to do the placement decision. This function looks
for the CPU with the highest spare capacity (CPU capacity - CPU utilization) in
each performance domain since it is the one which will allow us to keep the
frequency the lowest. Then, the function checks if placing the task there could
save energy compared to leaving it on prev_cpu, i.e. the CPU where the task ran
in its previous activation.

find_energy_efficient_cpu() uses compute_energy() to estimate what will be the
energy consumed by the system if the waking task was migrated. compute_energy()
looks at the current utilization landscape of the CPUs and adjusts it to
'simulate' the task migration. The EM framework provides the em_pd_energy() API
which computes the expected energy consumption of each performance domain for
the given utilization landscape.

An example of energy-optimized task placement decision is detailed below.

Example 2.
    Let us consider a (fake) platform with 2 independent performance domains
    composed of two CPUs each. CPU0 and CPU1 are little CPUs; CPU2 and CPU3
    are big.

    The scheduler must decide where to place a task P whose util_avg = 200
    and prev_cpu = 0.

    The current utilization landscape of the CPUs is depicted on the graph
    below. CPUs 0-3 have a util_avg of 400, 100, 600 and 500 respectively
    Each performance domain has three Operating Performance Points (OPPs).
    The CPU capacity and power cost associated with each OPP is listed in
    the Energy Model table. The util_avg of P is shown on the figures
    below as 'PP'::

     CPU util.
      1024                 - - - - - - -              Energy Model
                                               +-----------+-------------+
                                               |  Little   |     Big     |
       768                 =============       +-----+-----+------+------+
                                               | Cap | Pwr | Cap  | Pwr  |
                                               +-----+-----+------+------+
       512  ===========    - ##- - - - -       | 170 | 50  | 512  | 400  |
                             ##     ##         | 341 | 150 | 768  | 800  |
       341  -PP - - - -      ##     ##         | 512 | 300 | 1024 | 1700 |
             PP              ##     ##         +-----+-----+------+------+
       170  -## - - - -      ##     ##
             ##     ##       ##     ##
           ------------    -------------
            CPU0   CPU1     CPU2   CPU3

      Current OPP: =====       Other OPP: - - -     util_avg (100 each): ##


    find_energy_efficient_cpu() will first look for the CPUs with the
    maximum spare capacity in the two performance domains. In this example,
    CPU1 and CPU3. Then it will estimate the energy of the system if P was
    placed on either of them, and check if that would save some energy
    compared to leaving P on CPU0. EAS assumes that OPPs follow utilization
    (which is coherent with the behaviour of the schedutil CPUFreq
    governor, see Section 6. for more details on this topic).

    **Case 1. P is migrated to CPU1**::

      1024                 - - - - - - -

                                            Energy calculation:
       768                 =============     * CPU0: 200 / 341 * 150 = 88
                                             * CPU1: 300 / 341 * 150 = 131
                                             * CPU2: 600 / 768 * 800 = 625
       512  - - - - - -    - ##- - - - -     * CPU3: 500 / 768 * 800 = 520
                             ##     ##          => total_energy = 1364
       341  ===========      ##     ##
                    PP       ##     ##
       170  -## - - PP-      ##     ##
             ##     ##       ##     ##
           ------------    -------------
            CPU0   CPU1     CPU2   CPU3


    **Case 2. P is migrated to CPU3**::

      1024                 - - - - - - -

                                            Energy calculation:
       768                 =============     * CPU0: 200 / 341 * 150 = 88
                                             * CPU1: 100 / 341 * 150 = 43
                                    PP       * CPU2: 600 / 768 * 800 = 625
       512  - - - - - -    - ##- - -PP -     * CPU3: 700 / 768 * 800 = 729
                             ##     ##          => total_energy = 1485
       341  ===========      ##     ##
                             ##     ##
       170  -## - - - -      ##     ##
             ##     ##       ##     ##
           ------------    -------------
            CPU0   CPU1     CPU2   CPU3


    **Case 3. P stays on prev_cpu / CPU 0**::

      1024                 - - - - - - -

                                            Energy calculation:
       768                 =============     * CPU0: 400 / 512 * 300 = 234
                                             * CPU1: 100 / 512 * 300 = 58
                                             * CPU2: 600 / 768 * 800 = 625
       512  ===========    - ##- - - - -     * CPU3: 500 / 768 * 800 = 520
                             ##     ##          => total_energy = 1437
       341  -PP - - - -      ##     ##
             PP              ##     ##
       170  -## - - - -      ##     ##
             ##     ##       ##     ##
           ------------    -------------
            CPU0   CPU1     CPU2   CPU3


    From these calculations, the Case 1 has the lowest total energy. So CPU 1
    is be the best candidate from an energy-efficiency standpoint.

Big CPUs are generally more power hungry than the little ones and are thus used
mainly when a task doesn't fit the littles. However, little CPUs aren't always
necessarily more energy-efficient than big CPUs. For some systems, the high OPPs
of the little CPUs can be less energy-efficient than the lowest OPPs of the
bigs, for example. So, if the little CPUs happen to have enough utilization at
a specific point in time, a small task waking up at that moment could be better
of executing on the big side in order to save energy, even though it would fit
on the little side.

And even in the case where all OPPs of the big CPUs are less energy-efficient
than those of the little, using the big CPUs for a small task might still, under
specific conditions, save energy. Indeed, placing a task on a little CPU can
result in raising the OPP of the entire performance domain, and that will
increase the cost of the tasks already running there. If the waking task is
placed on a big CPU, its own execution cost might be higher than if it was
running on a little, but it won't impact the other tasks of the little CPUs
which will keep running at a lower OPP. So, when considering the total energy
consumed by CPUs, the extra cost of running that one task on a big core can be
smaller than the cost of raising the OPP on the little CPUs for all the other
tasks.

The examples above would be nearly impossible to get right in a generic way, and
for all platforms, without knowing the cost of running at different OPPs on all
CPUs of the system. Thanks to its EM-based design, EAS should cope with them
correctly without too many troubles. However, in order to ensure a minimal
impact on throughput for high-utilization scenarios, EAS also implements another
mechanism called 'over-utilization'.


5. Over-utilization
-------------------

From a general standpoint, the use-cases where EAS can help the most are those
involving a light/medium CPU utilization. Whenever long CPU-bound tasks are
being run, they will require all of the available CPU capacity, and there isn't
much that can be done by the scheduler to save energy without severely harming
throughput. In order to avoid hurting performance with EAS, CPUs are flagged as
'over-utilized' as soon as they are used at more than 80% of their compute
capacity. As long as no CPUs are over-utilized in a root domain, load balancing
is disabled and EAS overridess the wake-up balancing code. EAS is likely to load
the most energy efficient CPUs of the system more than the others if that can be
done without harming throughput. So, the load-balancer is disabled to prevent
it from breaking the energy-efficient task placement found by EAS. It is safe to
do so when the system isn't overutilized since being below the 80% tipping point
implies that:

    a. there is some idle time on all CPUs, so the utilization signals used by
       EAS are likely to accurately represent the 'size' of the various tasks
       in the system;
    b. all tasks should already be provided with enough CPU capacity,
       regardless of their nice values;
    c. since there is spare capacity all tasks must be blocking/sleeping
       regularly and balancing at wake-up is sufficient.

As soon as one CPU goes above the 80% tipping point, at least one of the three
assumptions above becomes incorrect. In this scenario, the 'overutilized' flag
is raised for the entire root domain, EAS is disabled, and the load-balancer is
re-enabled. By doing so, the scheduler falls back onto load-based algorithms for
wake-up and load balance under CPU-bound conditions. This provides a better
respect of the nice values of tasks.

Since the notion of overutilization largely relies on detecting whether or not
there is some idle time in the system, the CPU capacity 'stolen' by higher
(than CFS) scheduling classes (as well as IRQ) must be taken into account. As
such, the detection of overutilization accounts for the capacity used not only
by CFS tasks, but also by the other scheduling classes and IRQ.


6. Dependencies and requirements for EAS
----------------------------------------

Energy Aware Scheduling depends on the CPUs of the system having specific
hardware properties and on other features of the kernel being enabled. This
section lists these dependencies and provides hints as to how they can be met.


6.1 - Asymmetric CPU topology
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^


As mentioned in the introduction, EAS is only supported on platforms with
asymmetric CPU topologies for now. This requirement is checked at run-time by
looking for the presence of the SD_ASYM_CPUCAPACITY_FULL flag when the scheduling
domains are built.

See Documentation/scheduler/sched-capacity.rst for requirements to be met for this
flag to be set in the sched_domain hierarchy.

Please note that EAS is not fundamentally incompatible with SMP, but no
significant savings on SMP platforms have been observed yet. This restriction
could be amended in the future if proven otherwise.


6.2 - Energy Model presence
^^^^^^^^^^^^^^^^^^^^^^^^^^^

EAS uses the EM of a platform to estimate the impact of scheduling decisions on
energy. So, your platform must provide power cost tables to the EM framework in
order to make EAS start. To do so, please refer to documentation of the
independent EM framework in Documentation/power/energy-model.rst.

Please also note that the scheduling domains need to be re-built after the
EM has been registered in order to start EAS.

EAS uses the EM to make a forecasting decision on energy usage and thus it is
more focused on the difference when checking possible options for task
placement. For EAS it doesn't matter whether the EM power values are expressed
in milli-Watts or in an 'abstract scale'.


6.3 - Energy Model complexity
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

EAS does not impose any complexity limit on the number of PDs/OPPs/CPUs but
restricts the number of CPUs to EM_MAX_NUM_CPUS to prevent overflows during
the energy estimation.


6.4 - Schedutil governor
^^^^^^^^^^^^^^^^^^^^^^^^

EAS tries to predict at which OPP will the CPUs be running in the close future
in order to estimate their energy consumption. To do so, it is assumed that OPPs
of CPUs follow their utilization.

Although it is very difficult to provide hard guarantees regarding the accuracy
of this assumption in practice (because the hardware might not do what it is
told to do, for example), schedutil as opposed to other CPUFreq governors at
least _requests_ frequencies calculated using the utilization signals.
Consequently, the only sane governor to use together with EAS is schedutil,
because it is the only one providing some degree of consistency between
frequency requests and energy predictions.

Using EAS with any other governor than schedutil is not supported.


6.5 Scale-invariant utilization signals
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In order to make accurate prediction across CPUs and for all performance
states, EAS needs frequency-invariant and CPU-invariant PELT signals. These can
be obtained using the architecture-defined arch_scale{cpu,freq}_capacity()
callbacks.

Using EAS on a platform that doesn't implement these two callbacks is not
supported.


6.6 Multithreading (SMT)
^^^^^^^^^^^^^^^^^^^^^^^^

EAS in its current form is SMT unaware and is not able to leverage
multithreaded hardware to save energy. EAS considers threads as independent
CPUs, which can actually be counter-productive for both performance and energy.

EAS on SMT is not supported.
Commit	Line	Data
d6a3b247 MCC	1	=======================
	2	Energy Aware Scheduling
	3	=======================
81a930d3 QP	4
	5	1. Introduction
	6	---------------
	7
	8	Energy Aware Scheduling (or EAS) gives the scheduler the ability to predict
	9	the impact of its decisions on the energy consumed by CPUs. EAS relies on an
	10	Energy Model (EM) of the CPUs to select an energy efficient CPU for each task,
	11	with a minimal impact on throughput. This document aims at providing an
	12	introduction on how EAS works, what are the main design decisions behind it, and
	13	details what is needed to get it to run.
	14
d6a3b247	15	Before going any further, please note that at the time of writing::
81a930d3 QP	16
	17	/!\ EAS does not support platforms with symmetric CPU topologies /!\
	18
	19	EAS operates only on heterogeneous CPU topologies (such as Arm big.LITTLE)
	20	because this is where the potential for saving energy through scheduling is
	21	the highest.
	22
	23	The actual EM used by EAS is _not_ maintained by the scheduler, but by a
	24	dedicated framework. For details about this framework and what it provides,
151f4e2b	25	please refer to its documentation (see Documentation/power/energy-model.rst).
81a930d3 QP	26
	27
	28	2. Background and Terminology
	29	-----------------------------
	30
	31	To make it clear from the start:
	32	- energy = [joule] (resource like a battery on powered devices)
	33	- power = energy/time = [joule/second] = [watt]
	34
	35	The goal of EAS is to minimize energy, while still getting the job done. That
d6a3b247	36	is, we want to maximize::
81a930d3 QP	37
	38	performance [inst/s]
	39	--------------------
	40	power [W]
	41
d6a3b247	42	which is equivalent to minimizing::
81a930d3 QP	43
	44	energy [J]
	45	-----------
	46	instruction
	47
	48	while still getting 'good' performance. It is essentially an alternative
	49	optimization objective to the current performance-only objective for the
	50	scheduler. This alternative considers two objectives: energy-efficiency and
	51	performance.
	52
	53	The idea behind introducing an EM is to allow the scheduler to evaluate the
	54	implications of its decisions rather than blindly applying energy-saving
	55	techniques that may have positive effects only on some platforms. At the same
	56	time, the EM must be as simple as possible to minimize the scheduler latency
	57	impact.
	58
	59	In short, EAS changes the way CFS tasks are assigned to CPUs. When it is time
	60	for the scheduler to decide where a task should run (during wake-up), the EM
	61	is used to break the tie between several good CPU candidates and pick the one
	62	that is predicted to yield the best energy consumption without harming the
	63	system's throughput. The predictions made by EAS rely on specific elements of
	64	knowledge about the platform's topology, which include the 'capacity' of CPUs,
	65	and their respective energy costs.
	66
	67
	68	3. Topology information
	69	-----------------------
	70
	71	EAS (as well as the rest of the scheduler) uses the notion of 'capacity' to
	72	differentiate CPUs with different computing throughput. The 'capacity' of a CPU
	73	represents the amount of work it can absorb when running at its highest
	74	frequency compared to the most capable CPU of the system. Capacity values are
	75	normalized in a 1024 range, and are comparable with the utilization signals of
	76	tasks and CPUs computed by the Per-Entity Load Tracking (PELT) mechanism. Thanks
	77	to capacity and utilization values, EAS is able to estimate how big/busy a
	78	task/CPU is, and to take this into consideration when evaluating performance vs
	79	energy trade-offs. The capacity of CPUs is provided via arch-specific code
	80	through the arch_scale_cpu_capacity() callback.
	81
	82	The rest of platform knowledge used by EAS is directly read from the Energy
	83	Model (EM) framework. The EM of a platform is composed of a power cost table
151f4e2b	84	per 'performance domain' in the system (see Documentation/power/energy-model.rst
d56b699d	85	for further details about performance domains).
81a930d3 QP	86
	87	The scheduler manages references to the EM objects in the topology code when the
	88	scheduling domains are built, or re-built. For each root domain (rd), the
	89	scheduler maintains a singly linked list of all performance domains intersecting
	90	the current rd->span. Each node in the list contains a pointer to a struct
	91	em_perf_domain as provided by the EM framework.
	92
	93	The lists are attached to the root domains in order to cope with exclusive
	94	cpuset configurations. Since the boundaries of exclusive cpusets do not
	95	necessarily match those of performance domains, the lists of different root
	96	domains can contain duplicate elements.
	97
	98	Example 1.
	99	Let us consider a platform with 12 CPUs, split in 3 performance domains
d6a3b247	100	(pd0, pd4 and pd8), organized as follows::
81a930d3 QP	101
	102	CPUs: 0 1 2 3 4 5 6 7 8 9 10 11
	103	PDs: \|--pd0--\|--pd4--\|---pd8---\|
	104	RDs: \|----rd1----\|-----rd2-----\|
	105
	106	Now, consider that userspace decided to split the system with two
	107	exclusive cpusets, hence creating two independent root domains, each
	108	containing 6 CPUs. The two root domains are denoted rd1 and rd2 in the
	109	above figure. Since pd4 intersects with both rd1 and rd2, it will be
	110	present in the linked list '->pd' attached to each of them:
d6a3b247	111
81a930d3 QP	112	* rd1->pd: pd0 -> pd4
	113	* rd2->pd: pd4 -> pd8
	114
	115	Please note that the scheduler will create two duplicate list nodes for
	116	pd4 (one for each list). However, both just hold a pointer to the same
	117	shared data structure of the EM framework.
	118
	119	Since the access to these lists can happen concurrently with hotplug and other
	120	things, they are protected by RCU, like the rest of topology structures
	121	manipulated by the scheduler.
	122
	123	EAS also maintains a static key (sched_energy_present) which is enabled when at
	124	least one root domain meets all conditions for EAS to start. Those conditions
	125	are summarized in Section 6.
	126
	127
	128	4. Energy-Aware task placement
	129	------------------------------
	130
	131	EAS overrides the CFS task wake-up balancing code. It uses the EM of the
	132	platform and the PELT signals to choose an energy-efficient target CPU during
	133	wake-up balance. When EAS is enabled, select_task_rq_fair() calls
	134	find_energy_efficient_cpu() to do the placement decision. This function looks
	135	for the CPU with the highest spare capacity (CPU capacity - CPU utilization) in
	136	each performance domain since it is the one which will allow us to keep the
	137	frequency the lowest. Then, the function checks if placing the task there could
	138	save energy compared to leaving it on prev_cpu, i.e. the CPU where the task ran
	139	in its previous activation.
	140
	141	find_energy_efficient_cpu() uses compute_energy() to estimate what will be the
	142	energy consumed by the system if the waking task was migrated. compute_energy()
	143	looks at the current utilization landscape of the CPUs and adjusts it to
	144	'simulate' the task migration. The EM framework provides the em_pd_energy() API
	145	which computes the expected energy consumption of each performance domain for
	146	the given utilization landscape.
	147
	148	An example of energy-optimized task placement decision is detailed below.
	149
	150	Example 2.
	151	Let us consider a (fake) platform with 2 independent performance domains
	152	composed of two CPUs each. CPU0 and CPU1 are little CPUs; CPU2 and CPU3
	153	are big.
	154
	155	The scheduler must decide where to place a task P whose util_avg = 200
	156	and prev_cpu = 0.
	157
	158	The current utilization landscape of the CPUs is depicted on the graph
	159	below. CPUs 0-3 have a util_avg of 400, 100, 600 and 500 respectively
	160	Each performance domain has three Operating Performance Points (OPPs).
	161	The CPU capacity and power cost associated with each OPP is listed in
	162	the Energy Model table. The util_avg of P is shown on the figures
d6a3b247	163	below as 'PP'::
81a930d3	164
d6a3b247	165	CPU util.
81a930d3 QP	166	1024 - - - - - - - Energy Model
	167	+-----------+-------------+
	168	\| Little \| Big \|
	169	768 ============= +-----+-----+------+------+
	170	\| Cap \| Pwr \| Cap \| Pwr \|
	171	+-----+-----+------+------+
	172	512 =========== - ##- - - - - \| 170 \| 50 \| 512 \| 400 \|
	173	## ## \| 341 \| 150 \| 768 \| 800 \|
	174	341 -PP - - - - ## ## \| 512 \| 300 \| 1024 \| 1700 \|
	175	PP ## ## +-----+-----+------+------+
	176	170 -## - - - - ## ##
	177	## ## ## ##
	178	------------ -------------
	179	CPU0 CPU1 CPU2 CPU3
	180
	181	Current OPP: ===== Other OPP: - - - util_avg (100 each): ##
	182
	183
	184	find_energy_efficient_cpu() will first look for the CPUs with the
	185	maximum spare capacity in the two performance domains. In this example,
	186	CPU1 and CPU3. Then it will estimate the energy of the system if P was
	187	placed on either of them, and check if that would save some energy
	188	compared to leaving P on CPU0. EAS assumes that OPPs follow utilization
	189	(which is coherent with the behaviour of the schedutil CPUFreq
	190	governor, see Section 6. for more details on this topic).
	191
d6a3b247	192	Case 1. P is migrated to CPU1::
81a930d3 QP	193
	194	1024 - - - - - - -
	195
	196	Energy calculation:
	197	768 ============= * CPU0: 200 / 341 * 150 = 88
	198	* CPU1: 300 / 341 * 150 = 131
	199	* CPU2: 600 / 768 * 800 = 625
	200	512 - - - - - - - ##- - - - - * CPU3: 500 / 768 * 800 = 520
	201	## ## => total_energy = 1364
	202	341 =========== ## ##
	203	PP ## ##
	204	170 -## - - PP- ## ##
	205	## ## ## ##
	206	------------ -------------
	207	CPU0 CPU1 CPU2 CPU3
	208
	209
d6a3b247	210	Case 2. P is migrated to CPU3::
81a930d3 QP	211
	212	1024 - - - - - - -
	213
	214	Energy calculation:
	215	768 ============= * CPU0: 200 / 341 * 150 = 88
	216	* CPU1: 100 / 341 * 150 = 43
	217	PP * CPU2: 600 / 768 * 800 = 625
	218	512 - - - - - - - ##- - -PP - * CPU3: 700 / 768 * 800 = 729
	219	## ## => total_energy = 1485
	220	341 =========== ## ##
	221	## ##
	222	170 -## - - - - ## ##
	223	## ## ## ##
	224	------------ -------------
	225	CPU0 CPU1 CPU2 CPU3
	226
	227
d6a3b247	228	Case 3. P stays on prev_cpu / CPU 0::
81a930d3 QP	229
	230	1024 - - - - - - -
	231
	232	Energy calculation:
	233	768 ============= * CPU0: 400 / 512 * 300 = 234
	234	* CPU1: 100 / 512 * 300 = 58
	235	* CPU2: 600 / 768 * 800 = 625
	236	512 =========== - ##- - - - - * CPU3: 500 / 768 * 800 = 520
	237	## ## => total_energy = 1437
	238	341 -PP - - - - ## ##
	239	PP ## ##
	240	170 -## - - - - ## ##
	241	## ## ## ##
	242	------------ -------------
	243	CPU0 CPU1 CPU2 CPU3
	244
	245
	246	From these calculations, the Case 1 has the lowest total energy. So CPU 1
	247	is be the best candidate from an energy-efficiency standpoint.
	248
	249	Big CPUs are generally more power hungry than the little ones and are thus used
	250	mainly when a task doesn't fit the littles. However, little CPUs aren't always
	251	necessarily more energy-efficient than big CPUs. For some systems, the high OPPs
	252	of the little CPUs can be less energy-efficient than the lowest OPPs of the
	253	bigs, for example. So, if the little CPUs happen to have enough utilization at
	254	a specific point in time, a small task waking up at that moment could be better
	255	of executing on the big side in order to save energy, even though it would fit
	256	on the little side.
	257
	258	And even in the case where all OPPs of the big CPUs are less energy-efficient
	259	than those of the little, using the big CPUs for a small task might still, under
	260	specific conditions, save energy. Indeed, placing a task on a little CPU can
	261	result in raising the OPP of the entire performance domain, and that will
	262	increase the cost of the tasks already running there. If the waking task is
	263	placed on a big CPU, its own execution cost might be higher than if it was
	264	running on a little, but it won't impact the other tasks of the little CPUs
	265	which will keep running at a lower OPP. So, when considering the total energy
	266	consumed by CPUs, the extra cost of running that one task on a big core can be
	267	smaller than the cost of raising the OPP on the little CPUs for all the other
	268	tasks.
	269
	270	The examples above would be nearly impossible to get right in a generic way, and
	271	for all platforms, without knowing the cost of running at different OPPs on all
	272	CPUs of the system. Thanks to its EM-based design, EAS should cope with them
	273	correctly without too many troubles. However, in order to ensure a minimal
	274	impact on throughput for high-utilization scenarios, EAS also implements another
	275	mechanism called 'over-utilization'.
	276
	277
	278	5. Over-utilization
	279	-------------------
	280
	281	From a general standpoint, the use-cases where EAS can help the most are those
	282	involving a light/medium CPU utilization. Whenever long CPU-bound tasks are
	283	being run, they will require all of the available CPU capacity, and there isn't
d56b699d	284	much that can be done by the scheduler to save energy without severely harming
81a930d3 QP	285	throughput. In order to avoid hurting performance with EAS, CPUs are flagged as
	286	'over-utilized' as soon as they are used at more than 80% of their compute
	287	capacity. As long as no CPUs are over-utilized in a root domain, load balancing
	288	is disabled and EAS overridess the wake-up balancing code. EAS is likely to load
	289	the most energy efficient CPUs of the system more than the others if that can be
	290	done without harming throughput. So, the load-balancer is disabled to prevent
	291	it from breaking the energy-efficient task placement found by EAS. It is safe to
	292	do so when the system isn't overutilized since being below the 80% tipping point
	293	implies that:
	294
	295	a. there is some idle time on all CPUs, so the utilization signals used by
	296	EAS are likely to accurately represent the 'size' of the various tasks
	297	in the system;
	298	b. all tasks should already be provided with enough CPU capacity,
	299	regardless of their nice values;
	300	c. since there is spare capacity all tasks must be blocking/sleeping
	301	regularly and balancing at wake-up is sufficient.
	302
	303	As soon as one CPU goes above the 80% tipping point, at least one of the three
	304	assumptions above becomes incorrect. In this scenario, the 'overutilized' flag
	305	is raised for the entire root domain, EAS is disabled, and the load-balancer is
	306	re-enabled. By doing so, the scheduler falls back onto load-based algorithms for
	307	wake-up and load balance under CPU-bound conditions. This provides a better
	308	respect of the nice values of tasks.
	309
	310	Since the notion of overutilization largely relies on detecting whether or not
	311	there is some idle time in the system, the CPU capacity 'stolen' by higher
	312	(than CFS) scheduling classes (as well as IRQ) must be taken into account. As
	313	such, the detection of overutilization accounts for the capacity used not only
	314	by CFS tasks, but also by the other scheduling classes and IRQ.
	315
	316
	317	6. Dependencies and requirements for EAS
	318	----------------------------------------
	319
	320	Energy Aware Scheduling depends on the CPUs of the system having specific
	321	hardware properties and on other features of the kernel being enabled. This
	322	section lists these dependencies and provides hints as to how they can be met.
	323
	324
d6a3b247 MCC	325	6.1 - Asymmetric CPU topology
	326	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
	327
81a930d3 QP	328
	329	As mentioned in the introduction, EAS is only supported on platforms with
	330	asymmetric CPU topologies for now. This requirement is checked at run-time by
adf3c31e	331	looking for the presence of the SD_ASYM_CPUCAPACITY_FULL flag when the scheduling
81a930d3 QP	332	domains are built.
81a930d3 QP	333
e4e29e78	334	See Documentation/scheduler/sched-capacity.rst for requirements to be met for this
949bcb81	335	flag to be set in the sched_domain hierarchy.
81a930d3 QP	336
	337	Please note that EAS is not fundamentally incompatible with SMP, but no
	338	significant savings on SMP platforms have been observed yet. This restriction
	339	could be amended in the future if proven otherwise.
	340
	341
d6a3b247 MCC	342	6.2 - Energy Model presence
d6a3b247 MCC	343	^^^^^^^^^^^^^^^^^^^^^^^^^^^
81a930d3 QP	344
	345	EAS uses the EM of a platform to estimate the impact of scheduling decisions on
	346	energy. So, your platform must provide power cost tables to the EM framework in
	347	order to make EAS start. To do so, please refer to documentation of the
151f4e2b	348	independent EM framework in Documentation/power/energy-model.rst.
81a930d3 QP	349
	350	Please also note that the scheduling domains need to be re-built after the
	351	EM has been registered in order to start EAS.
	352
5a64f775 LL	353	EAS uses the EM to make a forecasting decision on energy usage and thus it is
	354	more focused on the difference when checking possible options for task
	355	placement. For EAS it doesn't matter whether the EM power values are expressed
	356	in milli-Watts or in an 'abstract scale'.
	357
81a930d3	358
d6a3b247 MCC	359	6.3 - Energy Model complexity
d6a3b247 MCC	360	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
81a930d3	361
5b77261c PG	362	EAS does not impose any complexity limit on the number of PDs/OPPs/CPUs but
	363	restricts the number of CPUs to EM_MAX_NUM_CPUS to prevent overflows during
	364	the energy estimation.
81a930d3 QP	365
81a930d3 QP	366
d6a3b247 MCC	367	6.4 - Schedutil governor
d6a3b247 MCC	368	^^^^^^^^^^^^^^^^^^^^^^^^
81a930d3 QP	369
	370	EAS tries to predict at which OPP will the CPUs be running in the close future
	371	in order to estimate their energy consumption. To do so, it is assumed that OPPs
	372	of CPUs follow their utilization.
	373
	374	Although it is very difficult to provide hard guarantees regarding the accuracy
	375	of this assumption in practice (because the hardware might not do what it is
	376	told to do, for example), schedutil as opposed to other CPUFreq governors at
	377	least _requests_ frequencies calculated using the utilization signals.
	378	Consequently, the only sane governor to use together with EAS is schedutil,
	379	because it is the only one providing some degree of consistency between
	380	frequency requests and energy predictions.
	381
	382	Using EAS with any other governor than schedutil is not supported.
	383
	384
d6a3b247 MCC	385	6.5 Scale-invariant utilization signals
d6a3b247 MCC	386	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
81a930d3 QP	387
	388	In order to make accurate prediction across CPUs and for all performance
	389	states, EAS needs frequency-invariant and CPU-invariant PELT signals. These can
	390	be obtained using the architecture-defined arch_scale{cpu,freq}_capacity()
	391	callbacks.
	392
	393	Using EAS on a platform that doesn't implement these two callbacks is not
	394	supported.
	395
	396
d6a3b247 MCC	397	6.6 Multithreading (SMT)
d6a3b247 MCC	398	^^^^^^^^^^^^^^^^^^^^^^^^
81a930d3 QP	399
	400	EAS in its current form is SMT unaware and is not able to leverage
	401	multithreaded hardware to save energy. EAS considers threads as independent
	402	CPUs, which can actually be counter-productive for both performance and energy.
	403
	404	EAS on SMT is not supported.