Commit | Line | Data |
---|---|---|
65065fd7 VS |
1 | ========================= |
2 | Capacity Aware Scheduling | |
3 | ========================= | |
4 | ||
5 | 1. CPU Capacity | |
6 | =============== | |
7 | ||
8 | 1.1 Introduction | |
9 | ---------------- | |
10 | ||
11 | Conventional, homogeneous SMP platforms are composed of purely identical | |
12 | CPUs. Heterogeneous platforms on the other hand are composed of CPUs with | |
13 | different performance characteristics - on such platforms, not all CPUs can be | |
14 | considered equal. | |
15 | ||
16 | CPU capacity is a measure of the performance a CPU can reach, normalized against | |
17 | the most performant CPU in the system. Heterogeneous systems are also called | |
18 | asymmetric CPU capacity systems, as they contain CPUs of different capacities. | |
19 | ||
20 | Disparity in maximum attainable performance (IOW in maximum CPU capacity) stems | |
21 | from two factors: | |
22 | ||
23 | - not all CPUs may have the same microarchitecture (µarch). | |
24 | - with Dynamic Voltage and Frequency Scaling (DVFS), not all CPUs may be | |
25 | physically able to attain the higher Operating Performance Points (OPP). | |
26 | ||
27 | Arm big.LITTLE systems are an example of both. The big CPUs are more | |
28 | performance-oriented than the LITTLE ones (more pipeline stages, bigger caches, | |
29 | smarter predictors, etc), and can usually reach higher OPPs than the LITTLE ones | |
30 | can. | |
31 | ||
32 | CPU performance is usually expressed in Millions of Instructions Per Second | |
33 | (MIPS), which can also be expressed as a given amount of instructions attainable | |
34 | per Hz, leading to:: | |
35 | ||
36 | capacity(cpu) = work_per_hz(cpu) * max_freq(cpu) | |
37 | ||
38 | 1.2 Scheduler terms | |
39 | ------------------- | |
40 | ||
41 | Two different capacity values are used within the scheduler. A CPU's | |
42 | ``capacity_orig`` is its maximum attainable capacity, i.e. its maximum | |
43 | attainable performance level. A CPU's ``capacity`` is its ``capacity_orig`` to | |
44 | which some loss of available performance (e.g. time spent handling IRQs) is | |
45 | subtracted. | |
46 | ||
47 | Note that a CPU's ``capacity`` is solely intended to be used by the CFS class, | |
48 | while ``capacity_orig`` is class-agnostic. The rest of this document will use | |
49 | the term ``capacity`` interchangeably with ``capacity_orig`` for the sake of | |
50 | brevity. | |
51 | ||
52 | 1.3 Platform examples | |
53 | --------------------- | |
54 | ||
55 | 1.3.1 Identical OPPs | |
56 | ~~~~~~~~~~~~~~~~~~~~ | |
57 | ||
58 | Consider an hypothetical dual-core asymmetric CPU capacity system where | |
59 | ||
60 | - work_per_hz(CPU0) = W | |
61 | - work_per_hz(CPU1) = W/2 | |
62 | - all CPUs are running at the same fixed frequency | |
63 | ||
64 | By the above definition of capacity: | |
65 | ||
66 | - capacity(CPU0) = C | |
67 | - capacity(CPU1) = C/2 | |
68 | ||
69 | To draw the parallel with Arm big.LITTLE, CPU0 would be a big while CPU1 would | |
70 | be a LITTLE. | |
71 | ||
72 | With a workload that periodically does a fixed amount of work, you will get an | |
73 | execution trace like so:: | |
74 | ||
75 | CPU0 work ^ | |
76 | | ____ ____ ____ | |
77 | | | | | | | | | |
78 | +----+----+----+----+----+----+----+----+----+----+-> time | |
79 | ||
80 | CPU1 work ^ | |
81 | | _________ _________ ____ | |
82 | | | | | | | | |
83 | +----+----+----+----+----+----+----+----+----+----+-> time | |
84 | ||
85 | CPU0 has the highest capacity in the system (C), and completes a fixed amount of | |
86 | work W in T units of time. On the other hand, CPU1 has half the capacity of | |
87 | CPU0, and thus only completes W/2 in T. | |
88 | ||
89 | 1.3.2 Different max OPPs | |
90 | ~~~~~~~~~~~~~~~~~~~~~~~~ | |
91 | ||
92 | Usually, CPUs of different capacity values also have different maximum | |
93 | OPPs. Consider the same CPUs as above (i.e. same work_per_hz()) with: | |
94 | ||
95 | - max_freq(CPU0) = F | |
96 | - max_freq(CPU1) = 2/3 * F | |
97 | ||
98 | This yields: | |
99 | ||
100 | - capacity(CPU0) = C | |
101 | - capacity(CPU1) = C/3 | |
102 | ||
103 | Executing the same workload as described in 1.3.1, which each CPU running at its | |
104 | maximum frequency results in:: | |
105 | ||
106 | CPU0 work ^ | |
107 | | ____ ____ ____ | |
108 | | | | | | | | | |
109 | +----+----+----+----+----+----+----+----+----+----+-> time | |
110 | ||
111 | workload on CPU1 | |
112 | CPU1 work ^ | |
113 | | ______________ ______________ ____ | |
114 | | | | | | | | |
115 | +----+----+----+----+----+----+----+----+----+----+-> time | |
116 | ||
117 | 1.4 Representation caveat | |
118 | ------------------------- | |
119 | ||
120 | It should be noted that having a *single* value to represent differences in CPU | |
121 | performance is somewhat of a contentious point. The relative performance | |
122 | difference between two different µarchs could be X% on integer operations, Y% on | |
123 | floating point operations, Z% on branches, and so on. Still, results using this | |
124 | simple approach have been satisfactory for now. | |
125 | ||
126 | 2. Task utilization | |
127 | =================== | |
128 | ||
129 | 2.1 Introduction | |
130 | ---------------- | |
131 | ||
132 | Capacity aware scheduling requires an expression of a task's requirements with | |
133 | regards to CPU capacity. Each scheduler class can express this differently, and | |
134 | while task utilization is specific to CFS, it is convenient to describe it here | |
135 | in order to introduce more generic concepts. | |
136 | ||
137 | Task utilization is a percentage meant to represent the throughput requirements | |
138 | of a task. A simple approximation of it is the task's duty cycle, i.e.:: | |
139 | ||
140 | task_util(p) = duty_cycle(p) | |
141 | ||
142 | On an SMP system with fixed frequencies, 100% utilization suggests the task is a | |
143 | busy loop. Conversely, 10% utilization hints it is a small periodic task that | |
144 | spends more time sleeping than executing. Variable CPU frequencies and | |
145 | asymmetric CPU capacities complexify this somewhat; the following sections will | |
146 | expand on these. | |
147 | ||
148 | 2.2 Frequency invariance | |
149 | ------------------------ | |
150 | ||
151 | One issue that needs to be taken into account is that a workload's duty cycle is | |
152 | directly impacted by the current OPP the CPU is running at. Consider running a | |
153 | periodic workload at a given frequency F:: | |
154 | ||
155 | CPU work ^ | |
156 | | ____ ____ ____ | |
157 | | | | | | | | | |
158 | +----+----+----+----+----+----+----+----+----+----+-> time | |
159 | ||
160 | This yields duty_cycle(p) == 25%. | |
161 | ||
162 | Now, consider running the *same* workload at frequency F/2:: | |
163 | ||
164 | CPU work ^ | |
165 | | _________ _________ ____ | |
166 | | | | | | | | |
167 | +----+----+----+----+----+----+----+----+----+----+-> time | |
168 | ||
169 | This yields duty_cycle(p) == 50%, despite the task having the exact same | |
170 | behaviour (i.e. executing the same amount of work) in both executions. | |
171 | ||
172 | The task utilization signal can be made frequency invariant using the following | |
173 | formula:: | |
174 | ||
175 | task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) / max_frequency(cpu)) | |
176 | ||
177 | Applying this formula to the two examples above yields a frequency invariant | |
178 | task utilization of 25%. | |
179 | ||
180 | 2.3 CPU invariance | |
181 | ------------------ | |
182 | ||
183 | CPU capacity has a similar effect on task utilization in that running an | |
184 | identical workload on CPUs of different capacity values will yield different | |
185 | duty cycles. | |
186 | ||
187 | Consider the system described in 1.3.2., i.e.:: | |
188 | ||
189 | - capacity(CPU0) = C | |
190 | - capacity(CPU1) = C/3 | |
191 | ||
192 | Executing a given periodic workload on each CPU at their maximum frequency would | |
193 | result in:: | |
194 | ||
195 | CPU0 work ^ | |
196 | | ____ ____ ____ | |
197 | | | | | | | | | |
198 | +----+----+----+----+----+----+----+----+----+----+-> time | |
199 | ||
200 | CPU1 work ^ | |
201 | | ______________ ______________ ____ | |
202 | | | | | | | | |
203 | +----+----+----+----+----+----+----+----+----+----+-> time | |
204 | ||
205 | IOW, | |
206 | ||
207 | - duty_cycle(p) == 25% if p runs on CPU0 at its maximum frequency | |
208 | - duty_cycle(p) == 75% if p runs on CPU1 at its maximum frequency | |
209 | ||
210 | The task utilization signal can be made CPU invariant using the following | |
211 | formula:: | |
212 | ||
213 | task_util_cpu_inv(p) = duty_cycle(p) * (capacity(cpu) / max_capacity) | |
214 | ||
215 | with ``max_capacity`` being the highest CPU capacity value in the | |
216 | system. Applying this formula to the above example above yields a CPU | |
217 | invariant task utilization of 25%. | |
218 | ||
219 | 2.4 Invariant task utilization | |
220 | ------------------------------ | |
221 | ||
222 | Both frequency and CPU invariance need to be applied to task utilization in | |
223 | order to obtain a truly invariant signal. The pseudo-formula for a task | |
224 | utilization that is both CPU and frequency invariant is thus, for a given | |
225 | task p:: | |
226 | ||
227 | curr_frequency(cpu) capacity(cpu) | |
228 | task_util_inv(p) = duty_cycle(p) * ------------------- * ------------- | |
229 | max_frequency(cpu) max_capacity | |
230 | ||
231 | In other words, invariant task utilization describes the behaviour of a task as | |
232 | if it were running on the highest-capacity CPU in the system, running at its | |
233 | maximum frequency. | |
234 | ||
235 | Any mention of task utilization in the following sections will imply its | |
236 | invariant form. | |
237 | ||
238 | 2.5 Utilization estimation | |
239 | -------------------------- | |
240 | ||
241 | Without a crystal ball, task behaviour (and thus task utilization) cannot | |
242 | accurately be predicted the moment a task first becomes runnable. The CFS class | |
243 | maintains a handful of CPU and task signals based on the Per-Entity Load | |
244 | Tracking (PELT) mechanism, one of those yielding an *average* utilization (as | |
245 | opposed to instantaneous). | |
246 | ||
247 | This means that while the capacity aware scheduling criteria will be written | |
248 | considering a "true" task utilization (using a crystal ball), the implementation | |
249 | will only ever be able to use an estimator thereof. | |
250 | ||
251 | 3. Capacity aware scheduling requirements | |
252 | ========================================= | |
253 | ||
254 | 3.1 CPU capacity | |
255 | ---------------- | |
256 | ||
257 | Linux cannot currently figure out CPU capacity on its own, this information thus | |
258 | needs to be handed to it. Architectures must define arch_scale_cpu_capacity() | |
259 | for that purpose. | |
260 | ||
5d89176a | 261 | The arm, arm64, and RISC-V architectures directly map this to the arch_topology driver |
65065fd7 | 262 | CPU scaling data, which is derived from the capacity-dmips-mhz CPU binding; see |
7d207831 | 263 | Documentation/devicetree/bindings/cpu/cpu-capacity.txt. |
65065fd7 VS |
264 | |
265 | 3.2 Frequency invariance | |
266 | ------------------------ | |
267 | ||
268 | As stated in 2.2, capacity-aware scheduling requires a frequency-invariant task | |
269 | utilization. Architectures must define arch_scale_freq_capacity(cpu) for that | |
270 | purpose. | |
271 | ||
272 | Implementing this function requires figuring out at which frequency each CPU | |
273 | have been running at. One way to implement this is to leverage hardware counters | |
274 | whose increment rate scale with a CPU's current frequency (APERF/MPERF on x86, | |
275 | AMU on arm64). Another is to directly hook into cpufreq frequency transitions, | |
276 | when the kernel is aware of the switched-to frequency (also employed by | |
277 | arm/arm64). | |
278 | ||
279 | 4. Scheduler topology | |
280 | ===================== | |
281 | ||
282 | During the construction of the sched domains, the scheduler will figure out | |
283 | whether the system exhibits asymmetric CPU capacities. Should that be the | |
284 | case: | |
285 | ||
286 | - The sched_asym_cpucapacity static key will be enabled. | |
adf3c31e BM |
287 | - The SD_ASYM_CPUCAPACITY_FULL flag will be set at the lowest sched_domain |
288 | level that spans all unique CPU capacity values. | |
289 | - The SD_ASYM_CPUCAPACITY flag will be set for any sched_domain that spans | |
290 | CPUs with any range of asymmetry. | |
65065fd7 VS |
291 | |
292 | The sched_asym_cpucapacity static key is intended to guard sections of code that | |
293 | cater to asymmetric CPU capacity systems. Do note however that said key is | |
294 | *system-wide*. Imagine the following setup using cpusets:: | |
295 | ||
296 | capacity C/2 C | |
297 | ________ ________ | |
298 | / \ / \ | |
299 | CPUs 0 1 2 3 4 5 6 7 | |
300 | \__/ \______________/ | |
301 | cpusets cs0 cs1 | |
302 | ||
303 | Which could be created via: | |
304 | ||
305 | .. code-block:: sh | |
306 | ||
307 | mkdir /sys/fs/cgroup/cpuset/cs0 | |
308 | echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.cpus | |
309 | echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.mems | |
310 | ||
311 | mkdir /sys/fs/cgroup/cpuset/cs1 | |
312 | echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.cpus | |
313 | echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.mems | |
314 | ||
315 | echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance | |
316 | ||
317 | Since there *is* CPU capacity asymmetry in the system, the | |
318 | sched_asym_cpucapacity static key will be enabled. However, the sched_domain | |
319 | hierarchy of CPUs 0-1 spans a single capacity value: SD_ASYM_CPUCAPACITY isn't | |
320 | set in that hierarchy, it describes an SMP island and should be treated as such. | |
321 | ||
322 | Therefore, the 'canonical' pattern for protecting codepaths that cater to | |
323 | asymmetric CPU capacities is to: | |
324 | ||
325 | - Check the sched_asym_cpucapacity static key | |
326 | - If it is enabled, then also check for the presence of SD_ASYM_CPUCAPACITY in | |
327 | the sched_domain hierarchy (if relevant, i.e. the codepath targets a specific | |
328 | CPU or group thereof) | |
329 | ||
330 | 5. Capacity aware scheduling implementation | |
331 | =========================================== | |
332 | ||
333 | 5.1 CFS | |
334 | ------- | |
335 | ||
336 | 5.1.1 Capacity fitness | |
337 | ~~~~~~~~~~~~~~~~~~~~~~ | |
338 | ||
339 | The main capacity scheduling criterion of CFS is:: | |
340 | ||
341 | task_util(p) < capacity(task_cpu(p)) | |
342 | ||
343 | This is commonly called the capacity fitness criterion, i.e. CFS must ensure a | |
344 | task "fits" on its CPU. If it is violated, the task will need to achieve more | |
345 | work than what its CPU can provide: it will be CPU-bound. | |
346 | ||
347 | Furthermore, uclamp lets userspace specify a minimum and a maximum utilization | |
348 | value for a task, either via sched_setattr() or via the cgroup interface (see | |
349 | Documentation/admin-guide/cgroup-v2.rst). As its name imply, this can be used to | |
350 | clamp task_util() in the previous criterion. | |
351 | ||
352 | 5.1.2 Wakeup CPU selection | |
353 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
354 | ||
355 | CFS task wakeup CPU selection follows the capacity fitness criterion described | |
356 | above. On top of that, uclamp is used to clamp the task utilization values, | |
357 | which lets userspace have more leverage over the CPU selection of CFS | |
358 | tasks. IOW, CFS wakeup CPU selection searches for a CPU that satisfies:: | |
359 | ||
360 | clamp(task_util(p), task_uclamp_min(p), task_uclamp_max(p)) < capacity(cpu) | |
361 | ||
362 | By using uclamp, userspace can e.g. allow a busy loop (100% utilization) to run | |
363 | on any CPU by giving it a low uclamp.max value. Conversely, it can force a small | |
364 | periodic task (e.g. 10% utilization) to run on the highest-performance CPUs by | |
365 | giving it a high uclamp.min value. | |
366 | ||
367 | .. note:: | |
368 | ||
369 | Wakeup CPU selection in CFS can be eclipsed by Energy Aware Scheduling | |
e4e29e78 | 370 | (EAS), which is described in Documentation/scheduler/sched-energy.rst. |
65065fd7 VS |
371 | |
372 | 5.1.3 Load balancing | |
373 | ~~~~~~~~~~~~~~~~~~~~ | |
374 | ||
375 | A pathological case in the wakeup CPU selection occurs when a task rarely | |
376 | sleeps, if at all - it thus rarely wakes up, if at all. Consider:: | |
377 | ||
378 | w == wakeup event | |
379 | ||
380 | capacity(CPU0) = C | |
381 | capacity(CPU1) = C / 3 | |
382 | ||
383 | workload on CPU0 | |
384 | CPU work ^ | |
385 | | _________ _________ ____ | |
386 | | | | | | | | |
387 | +----+----+----+----+----+----+----+----+----+----+-> time | |
388 | w w w | |
389 | ||
390 | workload on CPU1 | |
391 | CPU work ^ | |
392 | | ____________________________________________ | |
393 | | | | |
394 | +----+----+----+----+----+----+----+----+----+----+-> | |
395 | w | |
396 | ||
397 | This workload should run on CPU0, but if the task either: | |
398 | ||
399 | - was improperly scheduled from the start (inaccurate initial | |
400 | utilization estimation) | |
401 | - was properly scheduled from the start, but suddenly needs more | |
402 | processing power | |
403 | ||
404 | then it might become CPU-bound, IOW ``task_util(p) > capacity(task_cpu(p))``; | |
405 | the CPU capacity scheduling criterion is violated, and there may not be any more | |
406 | wakeup event to fix this up via wakeup CPU selection. | |
407 | ||
408 | Tasks that are in this situation are dubbed "misfit" tasks, and the mechanism | |
409 | put in place to handle this shares the same name. Misfit task migration | |
410 | leverages the CFS load balancer, more specifically the active load balance part | |
411 | (which caters to migrating currently running tasks). When load balance happens, | |
412 | a misfit active load balance will be triggered if a misfit task can be migrated | |
413 | to a CPU with more capacity than its current one. | |
414 | ||
415 | 5.2 RT | |
416 | ------ | |
417 | ||
418 | 5.2.1 Wakeup CPU selection | |
419 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
420 | ||
421 | RT task wakeup CPU selection searches for a CPU that satisfies:: | |
422 | ||
423 | task_uclamp_min(p) <= capacity(task_cpu(cpu)) | |
424 | ||
425 | while still following the usual priority constraints. If none of the candidate | |
426 | CPUs can satisfy this capacity criterion, then strict priority based scheduling | |
427 | is followed and CPU capacities are ignored. | |
428 | ||
429 | 5.3 DL | |
430 | ------ | |
431 | ||
432 | 5.3.1 Wakeup CPU selection | |
433 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
434 | ||
435 | DL task wakeup CPU selection searches for a CPU that satisfies:: | |
436 | ||
437 | task_bandwidth(p) < capacity(task_cpu(p)) | |
438 | ||
439 | while still respecting the usual bandwidth and deadline constraints. If | |
440 | none of the candidate CPUs can satisfy this capacity criterion, then the | |
441 | task will remain on its current CPU. |