[linux-block.git] / Documentation / scheduler / sched-capacity.rst

=========================
Capacity Aware Scheduling
=========================

1. CPU Capacity
===============

1.1 Introduction
----------------

Conventional, homogeneous SMP platforms are composed of purely identical
CPUs. Heterogeneous platforms on the other hand are composed of CPUs with
different performance characteristics - on such platforms, not all CPUs can be
considered equal.

CPU capacity is a measure of the performance a CPU can reach, normalized against
the most performant CPU in the system. Heterogeneous systems are also called
asymmetric CPU capacity systems, as they contain CPUs of different capacities.

Disparity in maximum attainable performance (IOW in maximum CPU capacity) stems
from two factors:

- not all CPUs may have the same microarchitecture (µarch).
- with Dynamic Voltage and Frequency Scaling (DVFS), not all CPUs may be
  physically able to attain the higher Operating Performance Points (OPP).

Arm big.LITTLE systems are an example of both. The big CPUs are more
performance-oriented than the LITTLE ones (more pipeline stages, bigger caches,
smarter predictors, etc), and can usually reach higher OPPs than the LITTLE ones
can.

CPU performance is usually expressed in Millions of Instructions Per Second
(MIPS), which can also be expressed as a given amount of instructions attainable
per Hz, leading to::

  capacity(cpu) = work_per_hz(cpu) * max_freq(cpu)

1.2 Scheduler terms
-------------------

Two different capacity values are used within the scheduler. A CPU's
``original capacity`` is its maximum attainable capacity, i.e. its maximum
attainable performance level. This original capacity is returned by
the function arch_scale_cpu_capacity(). A CPU's ``capacity`` is its ``original
capacity`` to which some loss of available performance (e.g. time spent
handling IRQs) is subtracted.

Note that a CPU's ``capacity`` is solely intended to be used by the CFS class,
while ``original capacity`` is class-agnostic. The rest of this document will use
the term ``capacity`` interchangeably with ``original capacity`` for the sake of
brevity.

1.3 Platform examples
---------------------

1.3.1 Identical OPPs
~~~~~~~~~~~~~~~~~~~~

Consider an hypothetical dual-core asymmetric CPU capacity system where

- work_per_hz(CPU0) = W
- work_per_hz(CPU1) = W/2
- all CPUs are running at the same fixed frequency

By the above definition of capacity:

- capacity(CPU0) = C
- capacity(CPU1) = C/2

To draw the parallel with Arm big.LITTLE, CPU0 would be a big while CPU1 would
be a LITTLE.

With a workload that periodically does a fixed amount of work, you will get an
execution trace like so::

 CPU0 work ^
           |     ____                ____                ____
           |    |    |              |    |              |    |
           +----+----+----+----+----+----+----+----+----+----+-> time

 CPU1 work ^
           |     _________           _________           ____
           |    |         |         |         |         |
           +----+----+----+----+----+----+----+----+----+----+-> time

CPU0 has the highest capacity in the system (C), and completes a fixed amount of
work W in T units of time. On the other hand, CPU1 has half the capacity of
CPU0, and thus only completes W/2 in T.

1.3.2 Different max OPPs
~~~~~~~~~~~~~~~~~~~~~~~~

Usually, CPUs of different capacity values also have different maximum
OPPs. Consider the same CPUs as above (i.e. same work_per_hz()) with:

- max_freq(CPU0) = F
- max_freq(CPU1) = 2/3 * F

This yields:

- capacity(CPU0) = C
- capacity(CPU1) = C/3

Executing the same workload as described in 1.3.1, which each CPU running at its
maximum frequency results in::

 CPU0 work ^
           |     ____                ____                ____
           |    |    |              |    |              |    |
           +----+----+----+----+----+----+----+----+----+----+-> time

                            workload on CPU1
 CPU1 work ^
           |     ______________      ______________      ____
           |    |              |    |              |    |
           +----+----+----+----+----+----+----+----+----+----+-> time

1.4 Representation caveat
-------------------------

It should be noted that having a *single* value to represent differences in CPU
performance is somewhat of a contentious point. The relative performance
difference between two different µarchs could be X% on integer operations, Y% on
floating point operations, Z% on branches, and so on. Still, results using this
simple approach have been satisfactory for now.

2. Task utilization
===================

2.1 Introduction
----------------

Capacity aware scheduling requires an expression of a task's requirements with
regards to CPU capacity. Each scheduler class can express this differently, and
while task utilization is specific to CFS, it is convenient to describe it here
in order to introduce more generic concepts.

Task utilization is a percentage meant to represent the throughput requirements
of a task. A simple approximation of it is the task's duty cycle, i.e.::

  task_util(p) = duty_cycle(p)

On an SMP system with fixed frequencies, 100% utilization suggests the task is a
busy loop. Conversely, 10% utilization hints it is a small periodic task that
spends more time sleeping than executing. Variable CPU frequencies and
asymmetric CPU capacities complexify this somewhat; the following sections will
expand on these.

2.2 Frequency invariance
------------------------

One issue that needs to be taken into account is that a workload's duty cycle is
directly impacted by the current OPP the CPU is running at. Consider running a
periodic workload at a given frequency F::

  CPU work ^
           |     ____                ____                ____
           |    |    |              |    |              |    |
           +----+----+----+----+----+----+----+----+----+----+-> time

This yields duty_cycle(p) == 25%.

Now, consider running the *same* workload at frequency F/2::

  CPU work ^
           |     _________           _________           ____
           |    |         |         |         |         |
           +----+----+----+----+----+----+----+----+----+----+-> time

This yields duty_cycle(p) == 50%, despite the task having the exact same
behaviour (i.e. executing the same amount of work) in both executions.

The task utilization signal can be made frequency invariant using the following
formula::

  task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) / max_frequency(cpu))

Applying this formula to the two examples above yields a frequency invariant
task utilization of 25%.

2.3 CPU invariance
------------------

CPU capacity has a similar effect on task utilization in that running an
identical workload on CPUs of different capacity values will yield different
duty cycles.

Consider the system described in 1.3.2., i.e.::

- capacity(CPU0) = C
- capacity(CPU1) = C/3

Executing a given periodic workload on each CPU at their maximum frequency would
result in::

 CPU0 work ^
           |     ____                ____                ____
           |    |    |              |    |              |    |
           +----+----+----+----+----+----+----+----+----+----+-> time

 CPU1 work ^
           |     ______________      ______________      ____
           |    |              |    |              |    |
           +----+----+----+----+----+----+----+----+----+----+-> time

IOW,

- duty_cycle(p) == 25% if p runs on CPU0 at its maximum frequency
- duty_cycle(p) == 75% if p runs on CPU1 at its maximum frequency

The task utilization signal can be made CPU invariant using the following
formula::

  task_util_cpu_inv(p) = duty_cycle(p) * (capacity(cpu) / max_capacity)

with ``max_capacity`` being the highest CPU capacity value in the
system. Applying this formula to the above example above yields a CPU
invariant task utilization of 25%.

2.4 Invariant task utilization
------------------------------

Both frequency and CPU invariance need to be applied to task utilization in
order to obtain a truly invariant signal. The pseudo-formula for a task
utilization that is both CPU and frequency invariant is thus, for a given
task p::

                                     curr_frequency(cpu)   capacity(cpu)
  task_util_inv(p) = duty_cycle(p) * ------------------- * -------------
                                     max_frequency(cpu)    max_capacity

In other words, invariant task utilization describes the behaviour of a task as
if it were running on the highest-capacity CPU in the system, running at its
maximum frequency.

Any mention of task utilization in the following sections will imply its
invariant form.

2.5 Utilization estimation
--------------------------

Without a crystal ball, task behaviour (and thus task utilization) cannot
accurately be predicted the moment a task first becomes runnable. The CFS class
maintains a handful of CPU and task signals based on the Per-Entity Load
Tracking (PELT) mechanism, one of those yielding an *average* utilization (as
opposed to instantaneous).

This means that while the capacity aware scheduling criteria will be written
considering a "true" task utilization (using a crystal ball), the implementation
will only ever be able to use an estimator thereof.

3. Capacity aware scheduling requirements
=========================================

3.1 CPU capacity
----------------

Linux cannot currently figure out CPU capacity on its own, this information thus
needs to be handed to it. Architectures must define arch_scale_cpu_capacity()
for that purpose.

The arm, arm64, and RISC-V architectures directly map this to the arch_topology driver
CPU scaling data, which is derived from the capacity-dmips-mhz CPU binding; see
Documentation/devicetree/bindings/cpu/cpu-capacity.txt.

3.2 Frequency invariance
------------------------

As stated in 2.2, capacity-aware scheduling requires a frequency-invariant task
utilization. Architectures must define arch_scale_freq_capacity(cpu) for that
purpose.

Implementing this function requires figuring out at which frequency each CPU
have been running at. One way to implement this is to leverage hardware counters
whose increment rate scale with a CPU's current frequency (APERF/MPERF on x86,
AMU on arm64). Another is to directly hook into cpufreq frequency transitions,
when the kernel is aware of the switched-to frequency (also employed by
arm/arm64).

4. Scheduler topology
=====================

During the construction of the sched domains, the scheduler will figure out
whether the system exhibits asymmetric CPU capacities. Should that be the
case:

- The sched_asym_cpucapacity static key will be enabled.
- The SD_ASYM_CPUCAPACITY_FULL flag will be set at the lowest sched_domain
  level that spans all unique CPU capacity values.
- The SD_ASYM_CPUCAPACITY flag will be set for any sched_domain that spans
  CPUs with any range of asymmetry.

The sched_asym_cpucapacity static key is intended to guard sections of code that
cater to asymmetric CPU capacity systems. Do note however that said key is
*system-wide*. Imagine the following setup using cpusets::

  capacity    C/2          C
            ________    ________
           /        \  /        \
  CPUs     0  1  2  3  4  5  6  7
           \__/  \______________/
  cpusets   cs0         cs1

Which could be created via:

.. code-block:: sh

  mkdir /sys/fs/cgroup/cpuset/cs0
  echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.cpus
  echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.mems

  mkdir /sys/fs/cgroup/cpuset/cs1
  echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.cpus
  echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.mems

  echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance

Since there *is* CPU capacity asymmetry in the system, the
sched_asym_cpucapacity static key will be enabled. However, the sched_domain
hierarchy of CPUs 0-1 spans a single capacity value: SD_ASYM_CPUCAPACITY isn't
set in that hierarchy, it describes an SMP island and should be treated as such.

Therefore, the 'canonical' pattern for protecting codepaths that cater to
asymmetric CPU capacities is to:

- Check the sched_asym_cpucapacity static key
- If it is enabled, then also check for the presence of SD_ASYM_CPUCAPACITY in
  the sched_domain hierarchy (if relevant, i.e. the codepath targets a specific
  CPU or group thereof)

5. Capacity aware scheduling implementation
===========================================

5.1 CFS
-------

5.1.1 Capacity fitness
~~~~~~~~~~~~~~~~~~~~~~

The main capacity scheduling criterion of CFS is::

  task_util(p) < capacity(task_cpu(p))

This is commonly called the capacity fitness criterion, i.e. CFS must ensure a
task "fits" on its CPU. If it is violated, the task will need to achieve more
work than what its CPU can provide: it will be CPU-bound.

Furthermore, uclamp lets userspace specify a minimum and a maximum utilization
value for a task, either via sched_setattr() or via the cgroup interface (see
Documentation/admin-guide/cgroup-v2.rst). As its name imply, this can be used to
clamp task_util() in the previous criterion.

5.1.2 Wakeup CPU selection
~~~~~~~~~~~~~~~~~~~~~~~~~~

CFS task wakeup CPU selection follows the capacity fitness criterion described
above. On top of that, uclamp is used to clamp the task utilization values,
which lets userspace have more leverage over the CPU selection of CFS
tasks. IOW, CFS wakeup CPU selection searches for a CPU that satisfies::

  clamp(task_util(p), task_uclamp_min(p), task_uclamp_max(p)) < capacity(cpu)

By using uclamp, userspace can e.g. allow a busy loop (100% utilization) to run
on any CPU by giving it a low uclamp.max value. Conversely, it can force a small
periodic task (e.g. 10% utilization) to run on the highest-performance CPUs by
giving it a high uclamp.min value.

.. note::

  Wakeup CPU selection in CFS can be eclipsed by Energy Aware Scheduling
  (EAS), which is described in Documentation/scheduler/sched-energy.rst.

5.1.3 Load balancing
~~~~~~~~~~~~~~~~~~~~

A pathological case in the wakeup CPU selection occurs when a task rarely
sleeps, if at all - it thus rarely wakes up, if at all. Consider::

  w == wakeup event

  capacity(CPU0) = C
  capacity(CPU1) = C / 3

                           workload on CPU0
  CPU work ^
           |     _________           _________           ____
           |    |         |         |         |         |
           +----+----+----+----+----+----+----+----+----+----+-> time
                w                   w                   w

                           workload on CPU1
  CPU work ^
           |     ____________________________________________
           |    |
           +----+----+----+----+----+----+----+----+----+----+->
                w

This workload should run on CPU0, but if the task either:

- was improperly scheduled from the start (inaccurate initial
  utilization estimation)
- was properly scheduled from the start, but suddenly needs more
  processing power

then it might become CPU-bound, IOW ``task_util(p) > capacity(task_cpu(p))``;
the CPU capacity scheduling criterion is violated, and there may not be any more
wakeup event to fix this up via wakeup CPU selection.

Tasks that are in this situation are dubbed "misfit" tasks, and the mechanism
put in place to handle this shares the same name. Misfit task migration
leverages the CFS load balancer, more specifically the active load balance part
(which caters to migrating currently running tasks). When load balance happens,
a misfit active load balance will be triggered if a misfit task can be migrated
to a CPU with more capacity than its current one.

5.2 RT
------

5.2.1 Wakeup CPU selection
~~~~~~~~~~~~~~~~~~~~~~~~~~

RT task wakeup CPU selection searches for a CPU that satisfies::

  task_uclamp_min(p) <= capacity(task_cpu(cpu))

while still following the usual priority constraints. If none of the candidate
CPUs can satisfy this capacity criterion, then strict priority based scheduling
is followed and CPU capacities are ignored.

5.3 DL
------

5.3.1 Wakeup CPU selection
~~~~~~~~~~~~~~~~~~~~~~~~~~

DL task wakeup CPU selection searches for a CPU that satisfies::

  task_bandwidth(p) < capacity(task_cpu(p))

while still respecting the usual bandwidth and deadline constraints. If
none of the candidate CPUs can satisfy this capacity criterion, then the
task will remain on its current CPU.
Commit	Line	Data
65065fd7 VS	1	=========================
	2	Capacity Aware Scheduling
	3	=========================
	4
	5	1. CPU Capacity
	6	===============
	7
	8	1.1 Introduction
	9	----------------
	10
	11	Conventional, homogeneous SMP platforms are composed of purely identical
	12	CPUs. Heterogeneous platforms on the other hand are composed of CPUs with
	13	different performance characteristics - on such platforms, not all CPUs can be
	14	considered equal.
	15
	16	CPU capacity is a measure of the performance a CPU can reach, normalized against
	17	the most performant CPU in the system. Heterogeneous systems are also called
	18	asymmetric CPU capacity systems, as they contain CPUs of different capacities.
	19
	20	Disparity in maximum attainable performance (IOW in maximum CPU capacity) stems
	21	from two factors:
	22
	23	- not all CPUs may have the same microarchitecture (µarch).
	24	- with Dynamic Voltage and Frequency Scaling (DVFS), not all CPUs may be
	25	physically able to attain the higher Operating Performance Points (OPP).
	26
	27	Arm big.LITTLE systems are an example of both. The big CPUs are more
	28	performance-oriented than the LITTLE ones (more pipeline stages, bigger caches,
	29	smarter predictors, etc), and can usually reach higher OPPs than the LITTLE ones
	30	can.
	31
	32	CPU performance is usually expressed in Millions of Instructions Per Second
	33	(MIPS), which can also be expressed as a given amount of instructions attainable
	34	per Hz, leading to::
	35
	36	capacity(cpu) = work_per_hz(cpu) * max_freq(cpu)
	37
	38	1.2 Scheduler terms
	39	-------------------
	40
	41	Two different capacity values are used within the scheduler. A CPU's
7bc26384 VG	42	``original capacity`` is its maximum attainable capacity, i.e. its maximum
	43	attainable performance level. This original capacity is returned by
	44	the function arch_scale_cpu_capacity(). A CPU's ``capacity`` is its ``original
	45	capacity`` to which some loss of available performance (e.g. time spent
	46	handling IRQs) is subtracted.
65065fd7 VS	47
65065fd7 VS	48	Note that a CPU's ``capacity`` is solely intended to be used by the CFS class,
7bc26384 VG	49	while ``original capacity`` is class-agnostic. The rest of this document will use
7bc26384 VG	50	the term ``capacity`` interchangeably with ``original capacity`` for the sake of
65065fd7 VS	51	brevity.
	52
	53	1.3 Platform examples
	54	---------------------
	55
	56	1.3.1 Identical OPPs
	57	~~~~~~~~~~~~~~~~~~~~
	58
	59	Consider an hypothetical dual-core asymmetric CPU capacity system where
	60
	61	- work_per_hz(CPU0) = W
	62	- work_per_hz(CPU1) = W/2
	63	- all CPUs are running at the same fixed frequency
	64
	65	By the above definition of capacity:
	66
	67	- capacity(CPU0) = C
	68	- capacity(CPU1) = C/2
	69
	70	To draw the parallel with Arm big.LITTLE, CPU0 would be a big while CPU1 would
	71	be a LITTLE.
	72
	73	With a workload that periodically does a fixed amount of work, you will get an
	74	execution trace like so::
	75
	76	CPU0 work ^
	77	\| ____ ____ ____
	78	\| \| \| \| \| \| \|
	79	+----+----+----+----+----+----+----+----+----+----+-> time
	80
	81	CPU1 work ^
	82	\| _________ _________ ____
	83	\| \| \| \| \| \|
	84	+----+----+----+----+----+----+----+----+----+----+-> time
	85
	86	CPU0 has the highest capacity in the system (C), and completes a fixed amount of
	87	work W in T units of time. On the other hand, CPU1 has half the capacity of
	88	CPU0, and thus only completes W/2 in T.
	89
	90	1.3.2 Different max OPPs
	91	~~~~~~~~~~~~~~~~~~~~~~~~
	92
	93	Usually, CPUs of different capacity values also have different maximum
	94	OPPs. Consider the same CPUs as above (i.e. same work_per_hz()) with:
	95
	96	- max_freq(CPU0) = F
	97	- max_freq(CPU1) = 2/3 * F
	98
	99	This yields:
	100
	101	- capacity(CPU0) = C
	102	- capacity(CPU1) = C/3
	103
	104	Executing the same workload as described in 1.3.1, which each CPU running at its
	105	maximum frequency results in::
	106
	107	CPU0 work ^
	108	\| ____ ____ ____
	109	\| \| \| \| \| \| \|
	110	+----+----+----+----+----+----+----+----+----+----+-> time
	111
	112	workload on CPU1
	113	CPU1 work ^
	114	\| ______________ ______________ ____
115	\| \| \| \| \| \|
116	+----+----+----+----+----+----+----+----+----+----+-> time
117
118	1.4 Representation caveat
119	-------------------------
120
121	It should be noted that having a single value to represent differences in CPU
122	performance is somewhat of a contentious point. The relative performance
123	difference between two different µarchs could be X% on integer operations, Y% on
124	floating point operations, Z% on branches, and so on. Still, results using this
125	simple approach have been satisfactory for now.
126
127	2. Task utilization
128	===================
129
130	2.1 Introduction
131	----------------
132
133	Capacity aware scheduling requires an expression of a task's requirements with
134	regards to CPU capacity. Each scheduler class can express this differently, and
135	while task utilization is specific to CFS, it is convenient to describe it here
136	in order to introduce more generic concepts.
137
138	Task utilization is a percentage meant to represent the throughput requirements
139	of a task. A simple approximation of it is the task's duty cycle, i.e.::
140
141	task_util(p) = duty_cycle(p)
142
143	On an SMP system with fixed frequencies, 100% utilization suggests the task is a
144	busy loop. Conversely, 10% utilization hints it is a small periodic task that
145	spends more time sleeping than executing. Variable CPU frequencies and
146	asymmetric CPU capacities complexify this somewhat; the following sections will
147	expand on these.
148
149	2.2 Frequency invariance
150	------------------------
151
152	One issue that needs to be taken into account is that a workload's duty cycle is
153	directly impacted by the current OPP the CPU is running at. Consider running a
154	periodic workload at a given frequency F::
155
156	CPU work ^
157	\| ____ ____ ____
158	\| \| \| \| \| \| \|
159	+----+----+----+----+----+----+----+----+----+----+-> time
160
161	This yields duty_cycle(p) == 25%.
162
163	Now, consider running the same workload at frequency F/2::
164
165	CPU work ^
166	\| _________ _________ ____
167	\| \| \| \| \| \|
168	+----+----+----+----+----+----+----+----+----+----+-> time
169
170	This yields duty_cycle(p) == 50%, despite the task having the exact same
171	behaviour (i.e. executing the same amount of work) in both executions.
172
173	The task utilization signal can be made frequency invariant using the following
174	formula::
175
176	task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) / max_frequency(cpu))
177
178	Applying this formula to the two examples above yields a frequency invariant
179	task utilization of 25%.
180
181	2.3 CPU invariance
182	------------------
183
184	CPU capacity has a similar effect on task utilization in that running an
185	identical workload on CPUs of different capacity values will yield different
186	duty cycles.
187
188	Consider the system described in 1.3.2., i.e.::
189
190	- capacity(CPU0) = C
191	- capacity(CPU1) = C/3
192
193	Executing a given periodic workload on each CPU at their maximum frequency would
194	result in::
195
196	CPU0 work ^
197	\| ____ ____ ____
198	\| \| \| \| \| \| \|
199	+----+----+----+----+----+----+----+----+----+----+-> time
200
201	CPU1 work ^
202	\| ______________ ______________ ____
203	\| \| \| \| \| \|
204	+----+----+----+----+----+----+----+----+----+----+-> time
205
206	IOW,
207
208	- duty_cycle(p) == 25% if p runs on CPU0 at its maximum frequency
209	- duty_cycle(p) == 75% if p runs on CPU1 at its maximum frequency
210
211	The task utilization signal can be made CPU invariant using the following
212	formula::
213
214	task_util_cpu_inv(p) = duty_cycle(p) * (capacity(cpu) / max_capacity)
215
216	with ``max_capacity`` being the highest CPU capacity value in the
217	system. Applying this formula to the above example above yields a CPU
218	invariant task utilization of 25%.
219
220	2.4 Invariant task utilization
221	------------------------------
222
223	Both frequency and CPU invariance need to be applied to task utilization in
224	order to obtain a truly invariant signal. The pseudo-formula for a task
225	utilization that is both CPU and frequency invariant is thus, for a given
226	task p::
227
228	curr_frequency(cpu) capacity(cpu)
229	task_util_inv(p) = duty_cycle(p) * ------------------- * -------------
230	max_frequency(cpu) max_capacity
231
232	In other words, invariant task utilization describes the behaviour of a task as
233	if it were running on the highest-capacity CPU in the system, running at its
234	maximum frequency.
235
236	Any mention of task utilization in the following sections will imply its
237	invariant form.
238
239	2.5 Utilization estimation
240	--------------------------
241
242	Without a crystal ball, task behaviour (and thus task utilization) cannot
243	accurately be predicted the moment a task first becomes runnable. The CFS class
244	maintains a handful of CPU and task signals based on the Per-Entity Load
245	Tracking (PELT) mechanism, one of those yielding an average utilization (as
246	opposed to instantaneous).
247
248	This means that while the capacity aware scheduling criteria will be written
249	considering a "true" task utilization (using a crystal ball), the implementation
250	will only ever be able to use an estimator thereof.
251
252	3. Capacity aware scheduling requirements
253	=========================================
254
255	3.1 CPU capacity
256	----------------
257
258	Linux cannot currently figure out CPU capacity on its own, this information thus
259	needs to be handed to it. Architectures must define arch_scale_cpu_capacity()
260	for that purpose.
261
5d89176a	262	The arm, arm64, and RISC-V architectures directly map this to the arch_topology driver
65065fd7	263	CPU scaling data, which is derived from the capacity-dmips-mhz CPU binding; see
7d207831	264	Documentation/devicetree/bindings/cpu/cpu-capacity.txt.
65065fd7 VS	265
	266	3.2 Frequency invariance
	267	------------------------
	268
	269	As stated in 2.2, capacity-aware scheduling requires a frequency-invariant task
	270	utilization. Architectures must define arch_scale_freq_capacity(cpu) for that
	271	purpose.
	272
	273	Implementing this function requires figuring out at which frequency each CPU
	274	have been running at. One way to implement this is to leverage hardware counters
	275	whose increment rate scale with a CPU's current frequency (APERF/MPERF on x86,
	276	AMU on arm64). Another is to directly hook into cpufreq frequency transitions,
	277	when the kernel is aware of the switched-to frequency (also employed by
	278	arm/arm64).
	279
	280	4. Scheduler topology
	281	=====================
	282
	283	During the construction of the sched domains, the scheduler will figure out
	284	whether the system exhibits asymmetric CPU capacities. Should that be the
	285	case:
	286
	287	- The sched_asym_cpucapacity static key will be enabled.
adf3c31e BM	288	- The SD_ASYM_CPUCAPACITY_FULL flag will be set at the lowest sched_domain
	289	level that spans all unique CPU capacity values.
	290	- The SD_ASYM_CPUCAPACITY flag will be set for any sched_domain that spans
	291	CPUs with any range of asymmetry.
65065fd7 VS	292
	293	The sched_asym_cpucapacity static key is intended to guard sections of code that
	294	cater to asymmetric CPU capacity systems. Do note however that said key is
	295	system-wide. Imagine the following setup using cpusets::
	296
	297	capacity C/2 C
	298	________ ________
	299	/ \ / \
	300	CPUs 0 1 2 3 4 5 6 7
	301	\__/ \______________/
	302	cpusets cs0 cs1
	303
	304	Which could be created via:
	305
	306	.. code-block:: sh
	307
	308	mkdir /sys/fs/cgroup/cpuset/cs0
	309	echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.cpus
	310	echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.mems
	311
	312	mkdir /sys/fs/cgroup/cpuset/cs1
	313	echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.cpus
	314	echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.mems
	315
	316	echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
	317
	318	Since there is CPU capacity asymmetry in the system, the
	319	sched_asym_cpucapacity static key will be enabled. However, the sched_domain
	320	hierarchy of CPUs 0-1 spans a single capacity value: SD_ASYM_CPUCAPACITY isn't
	321	set in that hierarchy, it describes an SMP island and should be treated as such.
	322
	323	Therefore, the 'canonical' pattern for protecting codepaths that cater to
	324	asymmetric CPU capacities is to:
	325
	326	- Check the sched_asym_cpucapacity static key
	327	- If it is enabled, then also check for the presence of SD_ASYM_CPUCAPACITY in
	328	the sched_domain hierarchy (if relevant, i.e. the codepath targets a specific
	329	CPU or group thereof)
	330
	331	5. Capacity aware scheduling implementation
	332	===========================================
	333
	334	5.1 CFS
	335	-------
	336
	337	5.1.1 Capacity fitness
	338	~~~~~~~~~~~~~~~~~~~~~~
	339
	340	The main capacity scheduling criterion of CFS is::
	341
	342	task_util(p) < capacity(task_cpu(p))
	343
	344	This is commonly called the capacity fitness criterion, i.e. CFS must ensure a
	345	task "fits" on its CPU. If it is violated, the task will need to achieve more
	346	work than what its CPU can provide: it will be CPU-bound.
	347
	348	Furthermore, uclamp lets userspace specify a minimum and a maximum utilization
	349	value for a task, either via sched_setattr() or via the cgroup interface (see
	350	Documentation/admin-guide/cgroup-v2.rst). As its name imply, this can be used to
	351	clamp task_util() in the previous criterion.
	352
	353	5.1.2 Wakeup CPU selection
	354	~~~~~~~~~~~~~~~~~~~~~~~~~~
	355
356	CFS task wakeup CPU selection follows the capacity fitness criterion described
357	above. On top of that, uclamp is used to clamp the task utilization values,
358	which lets userspace have more leverage over the CPU selection of CFS
359	tasks. IOW, CFS wakeup CPU selection searches for a CPU that satisfies::
360
361	clamp(task_util(p), task_uclamp_min(p), task_uclamp_max(p)) < capacity(cpu)
362
363	By using uclamp, userspace can e.g. allow a busy loop (100% utilization) to run
364	on any CPU by giving it a low uclamp.max value. Conversely, it can force a small
365	periodic task (e.g. 10% utilization) to run on the highest-performance CPUs by
366	giving it a high uclamp.min value.
367
368	.. note::
369
370	Wakeup CPU selection in CFS can be eclipsed by Energy Aware Scheduling
e4e29e78	371	(EAS), which is described in Documentation/scheduler/sched-energy.rst.
65065fd7 VS	372
	373	5.1.3 Load balancing
	374	~~~~~~~~~~~~~~~~~~~~
	375
	376	A pathological case in the wakeup CPU selection occurs when a task rarely
	377	sleeps, if at all - it thus rarely wakes up, if at all. Consider::
	378
	379	w == wakeup event
	380
	381	capacity(CPU0) = C
	382	capacity(CPU1) = C / 3
	383
	384	workload on CPU0
	385	CPU work ^
	386	\| _________ _________ ____
	387	\| \| \| \| \| \|
	388	+----+----+----+----+----+----+----+----+----+----+-> time
	389	w w w
	390
	391	workload on CPU1
	392	CPU work ^
	393	\| ____________________________________________
	394	\| \|
	395	+----+----+----+----+----+----+----+----+----+----+->
	396	w
	397
	398	This workload should run on CPU0, but if the task either:
	399
	400	- was improperly scheduled from the start (inaccurate initial
	401	utilization estimation)
	402	- was properly scheduled from the start, but suddenly needs more
	403	processing power
	404
	405	then it might become CPU-bound, IOW ``task_util(p) > capacity(task_cpu(p))``;
	406	the CPU capacity scheduling criterion is violated, and there may not be any more
	407	wakeup event to fix this up via wakeup CPU selection.
	408
	409	Tasks that are in this situation are dubbed "misfit" tasks, and the mechanism
	410	put in place to handle this shares the same name. Misfit task migration
	411	leverages the CFS load balancer, more specifically the active load balance part
	412	(which caters to migrating currently running tasks). When load balance happens,
	413	a misfit active load balance will be triggered if a misfit task can be migrated
	414	to a CPU with more capacity than its current one.
	415
	416	5.2 RT
	417	------
	418
	419	5.2.1 Wakeup CPU selection
	420	~~~~~~~~~~~~~~~~~~~~~~~~~~
	421
	422	RT task wakeup CPU selection searches for a CPU that satisfies::
	423
	424	task_uclamp_min(p) <= capacity(task_cpu(cpu))
	425
	426	while still following the usual priority constraints. If none of the candidate
	427	CPUs can satisfy this capacity criterion, then strict priority based scheduling
	428	is followed and CPU capacities are ignored.
	429
	430	5.3 DL
	431	------
	432
	433	5.3.1 Wakeup CPU selection
	434	~~~~~~~~~~~~~~~~~~~~~~~~~~
	435
436	DL task wakeup CPU selection searches for a CPU that satisfies::
437
438	task_bandwidth(p) < capacity(task_cpu(p))
439
440	while still respecting the usual bandwidth and deadline constraints. If
441	none of the candidate CPUs can satisfy this capacity criterion, then the
442	task will remain on its current CPU.