sched/topology: Consolidate and clean up access to a CPU's max compute capacity
[linux-2.6-block.git] / Documentation / scheduler / sched-capacity.rst
CommitLineData
65065fd7
VS
1=========================
2Capacity Aware Scheduling
3=========================
4
51. CPU Capacity
6===============
7
81.1 Introduction
9----------------
10
11Conventional, homogeneous SMP platforms are composed of purely identical
12CPUs. Heterogeneous platforms on the other hand are composed of CPUs with
13different performance characteristics - on such platforms, not all CPUs can be
14considered equal.
15
16CPU capacity is a measure of the performance a CPU can reach, normalized against
17the most performant CPU in the system. Heterogeneous systems are also called
18asymmetric CPU capacity systems, as they contain CPUs of different capacities.
19
20Disparity in maximum attainable performance (IOW in maximum CPU capacity) stems
21from two factors:
22
23- not all CPUs may have the same microarchitecture (µarch).
24- with Dynamic Voltage and Frequency Scaling (DVFS), not all CPUs may be
25 physically able to attain the higher Operating Performance Points (OPP).
26
27Arm big.LITTLE systems are an example of both. The big CPUs are more
28performance-oriented than the LITTLE ones (more pipeline stages, bigger caches,
29smarter predictors, etc), and can usually reach higher OPPs than the LITTLE ones
30can.
31
32CPU performance is usually expressed in Millions of Instructions Per Second
33(MIPS), which can also be expressed as a given amount of instructions attainable
34per Hz, leading to::
35
36 capacity(cpu) = work_per_hz(cpu) * max_freq(cpu)
37
381.2 Scheduler terms
39-------------------
40
41Two different capacity values are used within the scheduler. A CPU's
7bc26384
VG
42``original capacity`` is its maximum attainable capacity, i.e. its maximum
43attainable performance level. This original capacity is returned by
44the function arch_scale_cpu_capacity(). A CPU's ``capacity`` is its ``original
45capacity`` to which some loss of available performance (e.g. time spent
46handling IRQs) is subtracted.
65065fd7
VS
47
48Note that a CPU's ``capacity`` is solely intended to be used by the CFS class,
7bc26384
VG
49while ``original capacity`` is class-agnostic. The rest of this document will use
50the term ``capacity`` interchangeably with ``original capacity`` for the sake of
65065fd7
VS
51brevity.
52
531.3 Platform examples
54---------------------
55
561.3.1 Identical OPPs
57~~~~~~~~~~~~~~~~~~~~
58
59Consider an hypothetical dual-core asymmetric CPU capacity system where
60
61- work_per_hz(CPU0) = W
62- work_per_hz(CPU1) = W/2
63- all CPUs are running at the same fixed frequency
64
65By the above definition of capacity:
66
67- capacity(CPU0) = C
68- capacity(CPU1) = C/2
69
70To draw the parallel with Arm big.LITTLE, CPU0 would be a big while CPU1 would
71be a LITTLE.
72
73With a workload that periodically does a fixed amount of work, you will get an
74execution trace like so::
75
76 CPU0 work ^
77 | ____ ____ ____
78 | | | | | | |
79 +----+----+----+----+----+----+----+----+----+----+-> time
80
81 CPU1 work ^
82 | _________ _________ ____
83 | | | | | |
84 +----+----+----+----+----+----+----+----+----+----+-> time
85
86CPU0 has the highest capacity in the system (C), and completes a fixed amount of
87work W in T units of time. On the other hand, CPU1 has half the capacity of
88CPU0, and thus only completes W/2 in T.
89
901.3.2 Different max OPPs
91~~~~~~~~~~~~~~~~~~~~~~~~
92
93Usually, CPUs of different capacity values also have different maximum
94OPPs. Consider the same CPUs as above (i.e. same work_per_hz()) with:
95
96- max_freq(CPU0) = F
97- max_freq(CPU1) = 2/3 * F
98
99This yields:
100
101- capacity(CPU0) = C
102- capacity(CPU1) = C/3
103
104Executing the same workload as described in 1.3.1, which each CPU running at its
105maximum frequency results in::
106
107 CPU0 work ^
108 | ____ ____ ____
109 | | | | | | |
110 +----+----+----+----+----+----+----+----+----+----+-> time
111
112 workload on CPU1
113 CPU1 work ^
114 | ______________ ______________ ____
115 | | | | | |
116 +----+----+----+----+----+----+----+----+----+----+-> time
117
1181.4 Representation caveat
119-------------------------
120
121It should be noted that having a *single* value to represent differences in CPU
122performance is somewhat of a contentious point. The relative performance
123difference between two different µarchs could be X% on integer operations, Y% on
124floating point operations, Z% on branches, and so on. Still, results using this
125simple approach have been satisfactory for now.
126
1272. Task utilization
128===================
129
1302.1 Introduction
131----------------
132
133Capacity aware scheduling requires an expression of a task's requirements with
134regards to CPU capacity. Each scheduler class can express this differently, and
135while task utilization is specific to CFS, it is convenient to describe it here
136in order to introduce more generic concepts.
137
138Task utilization is a percentage meant to represent the throughput requirements
139of a task. A simple approximation of it is the task's duty cycle, i.e.::
140
141 task_util(p) = duty_cycle(p)
142
143On an SMP system with fixed frequencies, 100% utilization suggests the task is a
144busy loop. Conversely, 10% utilization hints it is a small periodic task that
145spends more time sleeping than executing. Variable CPU frequencies and
146asymmetric CPU capacities complexify this somewhat; the following sections will
147expand on these.
148
1492.2 Frequency invariance
150------------------------
151
152One issue that needs to be taken into account is that a workload's duty cycle is
153directly impacted by the current OPP the CPU is running at. Consider running a
154periodic workload at a given frequency F::
155
156 CPU work ^
157 | ____ ____ ____
158 | | | | | | |
159 +----+----+----+----+----+----+----+----+----+----+-> time
160
161This yields duty_cycle(p) == 25%.
162
163Now, consider running the *same* workload at frequency F/2::
164
165 CPU work ^
166 | _________ _________ ____
167 | | | | | |
168 +----+----+----+----+----+----+----+----+----+----+-> time
169
170This yields duty_cycle(p) == 50%, despite the task having the exact same
171behaviour (i.e. executing the same amount of work) in both executions.
172
173The task utilization signal can be made frequency invariant using the following
174formula::
175
176 task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) / max_frequency(cpu))
177
178Applying this formula to the two examples above yields a frequency invariant
179task utilization of 25%.
180
1812.3 CPU invariance
182------------------
183
184CPU capacity has a similar effect on task utilization in that running an
185identical workload on CPUs of different capacity values will yield different
186duty cycles.
187
188Consider the system described in 1.3.2., i.e.::
189
190- capacity(CPU0) = C
191- capacity(CPU1) = C/3
192
193Executing a given periodic workload on each CPU at their maximum frequency would
194result in::
195
196 CPU0 work ^
197 | ____ ____ ____
198 | | | | | | |
199 +----+----+----+----+----+----+----+----+----+----+-> time
200
201 CPU1 work ^
202 | ______________ ______________ ____
203 | | | | | |
204 +----+----+----+----+----+----+----+----+----+----+-> time
205
206IOW,
207
208- duty_cycle(p) == 25% if p runs on CPU0 at its maximum frequency
209- duty_cycle(p) == 75% if p runs on CPU1 at its maximum frequency
210
211The task utilization signal can be made CPU invariant using the following
212formula::
213
214 task_util_cpu_inv(p) = duty_cycle(p) * (capacity(cpu) / max_capacity)
215
216with ``max_capacity`` being the highest CPU capacity value in the
217system. Applying this formula to the above example above yields a CPU
218invariant task utilization of 25%.
219
2202.4 Invariant task utilization
221------------------------------
222
223Both frequency and CPU invariance need to be applied to task utilization in
224order to obtain a truly invariant signal. The pseudo-formula for a task
225utilization that is both CPU and frequency invariant is thus, for a given
226task p::
227
228 curr_frequency(cpu) capacity(cpu)
229 task_util_inv(p) = duty_cycle(p) * ------------------- * -------------
230 max_frequency(cpu) max_capacity
231
232In other words, invariant task utilization describes the behaviour of a task as
233if it were running on the highest-capacity CPU in the system, running at its
234maximum frequency.
235
236Any mention of task utilization in the following sections will imply its
237invariant form.
238
2392.5 Utilization estimation
240--------------------------
241
242Without a crystal ball, task behaviour (and thus task utilization) cannot
243accurately be predicted the moment a task first becomes runnable. The CFS class
244maintains a handful of CPU and task signals based on the Per-Entity Load
245Tracking (PELT) mechanism, one of those yielding an *average* utilization (as
246opposed to instantaneous).
247
248This means that while the capacity aware scheduling criteria will be written
249considering a "true" task utilization (using a crystal ball), the implementation
250will only ever be able to use an estimator thereof.
251
2523. Capacity aware scheduling requirements
253=========================================
254
2553.1 CPU capacity
256----------------
257
258Linux cannot currently figure out CPU capacity on its own, this information thus
259needs to be handed to it. Architectures must define arch_scale_cpu_capacity()
260for that purpose.
261
5d89176a 262The arm, arm64, and RISC-V architectures directly map this to the arch_topology driver
65065fd7 263CPU scaling data, which is derived from the capacity-dmips-mhz CPU binding; see
7d207831 264Documentation/devicetree/bindings/cpu/cpu-capacity.txt.
65065fd7
VS
265
2663.2 Frequency invariance
267------------------------
268
269As stated in 2.2, capacity-aware scheduling requires a frequency-invariant task
270utilization. Architectures must define arch_scale_freq_capacity(cpu) for that
271purpose.
272
273Implementing this function requires figuring out at which frequency each CPU
274have been running at. One way to implement this is to leverage hardware counters
275whose increment rate scale with a CPU's current frequency (APERF/MPERF on x86,
276AMU on arm64). Another is to directly hook into cpufreq frequency transitions,
277when the kernel is aware of the switched-to frequency (also employed by
278arm/arm64).
279
2804. Scheduler topology
281=====================
282
283During the construction of the sched domains, the scheduler will figure out
284whether the system exhibits asymmetric CPU capacities. Should that be the
285case:
286
287- The sched_asym_cpucapacity static key will be enabled.
adf3c31e
BM
288- The SD_ASYM_CPUCAPACITY_FULL flag will be set at the lowest sched_domain
289 level that spans all unique CPU capacity values.
290- The SD_ASYM_CPUCAPACITY flag will be set for any sched_domain that spans
291 CPUs with any range of asymmetry.
65065fd7
VS
292
293The sched_asym_cpucapacity static key is intended to guard sections of code that
294cater to asymmetric CPU capacity systems. Do note however that said key is
295*system-wide*. Imagine the following setup using cpusets::
296
297 capacity C/2 C
298 ________ ________
299 / \ / \
300 CPUs 0 1 2 3 4 5 6 7
301 \__/ \______________/
302 cpusets cs0 cs1
303
304Which could be created via:
305
306.. code-block:: sh
307
308 mkdir /sys/fs/cgroup/cpuset/cs0
309 echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.cpus
310 echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.mems
311
312 mkdir /sys/fs/cgroup/cpuset/cs1
313 echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.cpus
314 echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.mems
315
316 echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance
317
318Since there *is* CPU capacity asymmetry in the system, the
319sched_asym_cpucapacity static key will be enabled. However, the sched_domain
320hierarchy of CPUs 0-1 spans a single capacity value: SD_ASYM_CPUCAPACITY isn't
321set in that hierarchy, it describes an SMP island and should be treated as such.
322
323Therefore, the 'canonical' pattern for protecting codepaths that cater to
324asymmetric CPU capacities is to:
325
326- Check the sched_asym_cpucapacity static key
327- If it is enabled, then also check for the presence of SD_ASYM_CPUCAPACITY in
328 the sched_domain hierarchy (if relevant, i.e. the codepath targets a specific
329 CPU or group thereof)
330
3315. Capacity aware scheduling implementation
332===========================================
333
3345.1 CFS
335-------
336
3375.1.1 Capacity fitness
338~~~~~~~~~~~~~~~~~~~~~~
339
340The main capacity scheduling criterion of CFS is::
341
342 task_util(p) < capacity(task_cpu(p))
343
344This is commonly called the capacity fitness criterion, i.e. CFS must ensure a
345task "fits" on its CPU. If it is violated, the task will need to achieve more
346work than what its CPU can provide: it will be CPU-bound.
347
348Furthermore, uclamp lets userspace specify a minimum and a maximum utilization
349value for a task, either via sched_setattr() or via the cgroup interface (see
350Documentation/admin-guide/cgroup-v2.rst). As its name imply, this can be used to
351clamp task_util() in the previous criterion.
352
3535.1.2 Wakeup CPU selection
354~~~~~~~~~~~~~~~~~~~~~~~~~~
355
356CFS task wakeup CPU selection follows the capacity fitness criterion described
357above. On top of that, uclamp is used to clamp the task utilization values,
358which lets userspace have more leverage over the CPU selection of CFS
359tasks. IOW, CFS wakeup CPU selection searches for a CPU that satisfies::
360
361 clamp(task_util(p), task_uclamp_min(p), task_uclamp_max(p)) < capacity(cpu)
362
363By using uclamp, userspace can e.g. allow a busy loop (100% utilization) to run
364on any CPU by giving it a low uclamp.max value. Conversely, it can force a small
365periodic task (e.g. 10% utilization) to run on the highest-performance CPUs by
366giving it a high uclamp.min value.
367
368.. note::
369
370 Wakeup CPU selection in CFS can be eclipsed by Energy Aware Scheduling
e4e29e78 371 (EAS), which is described in Documentation/scheduler/sched-energy.rst.
65065fd7
VS
372
3735.1.3 Load balancing
374~~~~~~~~~~~~~~~~~~~~
375
376A pathological case in the wakeup CPU selection occurs when a task rarely
377sleeps, if at all - it thus rarely wakes up, if at all. Consider::
378
379 w == wakeup event
380
381 capacity(CPU0) = C
382 capacity(CPU1) = C / 3
383
384 workload on CPU0
385 CPU work ^
386 | _________ _________ ____
387 | | | | | |
388 +----+----+----+----+----+----+----+----+----+----+-> time
389 w w w
390
391 workload on CPU1
392 CPU work ^
393 | ____________________________________________
394 | |
395 +----+----+----+----+----+----+----+----+----+----+->
396 w
397
398This workload should run on CPU0, but if the task either:
399
400- was improperly scheduled from the start (inaccurate initial
401 utilization estimation)
402- was properly scheduled from the start, but suddenly needs more
403 processing power
404
405then it might become CPU-bound, IOW ``task_util(p) > capacity(task_cpu(p))``;
406the CPU capacity scheduling criterion is violated, and there may not be any more
407wakeup event to fix this up via wakeup CPU selection.
408
409Tasks that are in this situation are dubbed "misfit" tasks, and the mechanism
410put in place to handle this shares the same name. Misfit task migration
411leverages the CFS load balancer, more specifically the active load balance part
412(which caters to migrating currently running tasks). When load balance happens,
413a misfit active load balance will be triggered if a misfit task can be migrated
414to a CPU with more capacity than its current one.
415
4165.2 RT
417------
418
4195.2.1 Wakeup CPU selection
420~~~~~~~~~~~~~~~~~~~~~~~~~~
421
422RT task wakeup CPU selection searches for a CPU that satisfies::
423
424 task_uclamp_min(p) <= capacity(task_cpu(cpu))
425
426while still following the usual priority constraints. If none of the candidate
427CPUs can satisfy this capacity criterion, then strict priority based scheduling
428is followed and CPU capacities are ignored.
429
4305.3 DL
431------
432
4335.3.1 Wakeup CPU selection
434~~~~~~~~~~~~~~~~~~~~~~~~~~
435
436DL task wakeup CPU selection searches for a CPU that satisfies::
437
438 task_bandwidth(p) < capacity(task_cpu(p))
439
440while still respecting the usual bandwidth and deadline constraints. If
441none of the candidate CPUs can satisfy this capacity criterion, then the
442task will remain on its current CPU.