Commit | Line | Data |
---|---|---|
65065fd7 VS |
1 | ========================= |
2 | Capacity Aware Scheduling | |
3 | ========================= | |
4 | ||
5 | 1. CPU Capacity | |
6 | =============== | |
7 | ||
8 | 1.1 Introduction | |
9 | ---------------- | |
10 | ||
11 | Conventional, homogeneous SMP platforms are composed of purely identical | |
12 | CPUs. Heterogeneous platforms on the other hand are composed of CPUs with | |
13 | different performance characteristics - on such platforms, not all CPUs can be | |
14 | considered equal. | |
15 | ||
16 | CPU capacity is a measure of the performance a CPU can reach, normalized against | |
17 | the most performant CPU in the system. Heterogeneous systems are also called | |
18 | asymmetric CPU capacity systems, as they contain CPUs of different capacities. | |
19 | ||
20 | Disparity in maximum attainable performance (IOW in maximum CPU capacity) stems | |
21 | from two factors: | |
22 | ||
23 | - not all CPUs may have the same microarchitecture (µarch). | |
24 | - with Dynamic Voltage and Frequency Scaling (DVFS), not all CPUs may be | |
25 | physically able to attain the higher Operating Performance Points (OPP). | |
26 | ||
27 | Arm big.LITTLE systems are an example of both. The big CPUs are more | |
28 | performance-oriented than the LITTLE ones (more pipeline stages, bigger caches, | |
29 | smarter predictors, etc), and can usually reach higher OPPs than the LITTLE ones | |
30 | can. | |
31 | ||
32 | CPU performance is usually expressed in Millions of Instructions Per Second | |
33 | (MIPS), which can also be expressed as a given amount of instructions attainable | |
34 | per Hz, leading to:: | |
35 | ||
36 | capacity(cpu) = work_per_hz(cpu) * max_freq(cpu) | |
37 | ||
38 | 1.2 Scheduler terms | |
39 | ------------------- | |
40 | ||
41 | Two different capacity values are used within the scheduler. A CPU's | |
7bc26384 VG |
42 | ``original capacity`` is its maximum attainable capacity, i.e. its maximum |
43 | attainable performance level. This original capacity is returned by | |
44 | the function arch_scale_cpu_capacity(). A CPU's ``capacity`` is its ``original | |
45 | capacity`` to which some loss of available performance (e.g. time spent | |
46 | handling IRQs) is subtracted. | |
65065fd7 VS |
47 | |
48 | Note that a CPU's ``capacity`` is solely intended to be used by the CFS class, | |
7bc26384 VG |
49 | while ``original capacity`` is class-agnostic. The rest of this document will use |
50 | the term ``capacity`` interchangeably with ``original capacity`` for the sake of | |
65065fd7 VS |
51 | brevity. |
52 | ||
53 | 1.3 Platform examples | |
54 | --------------------- | |
55 | ||
56 | 1.3.1 Identical OPPs | |
57 | ~~~~~~~~~~~~~~~~~~~~ | |
58 | ||
59 | Consider an hypothetical dual-core asymmetric CPU capacity system where | |
60 | ||
61 | - work_per_hz(CPU0) = W | |
62 | - work_per_hz(CPU1) = W/2 | |
63 | - all CPUs are running at the same fixed frequency | |
64 | ||
65 | By the above definition of capacity: | |
66 | ||
67 | - capacity(CPU0) = C | |
68 | - capacity(CPU1) = C/2 | |
69 | ||
70 | To draw the parallel with Arm big.LITTLE, CPU0 would be a big while CPU1 would | |
71 | be a LITTLE. | |
72 | ||
73 | With a workload that periodically does a fixed amount of work, you will get an | |
74 | execution trace like so:: | |
75 | ||
76 | CPU0 work ^ | |
77 | | ____ ____ ____ | |
78 | | | | | | | | | |
79 | +----+----+----+----+----+----+----+----+----+----+-> time | |
80 | ||
81 | CPU1 work ^ | |
82 | | _________ _________ ____ | |
83 | | | | | | | | |
84 | +----+----+----+----+----+----+----+----+----+----+-> time | |
85 | ||
86 | CPU0 has the highest capacity in the system (C), and completes a fixed amount of | |
87 | work W in T units of time. On the other hand, CPU1 has half the capacity of | |
88 | CPU0, and thus only completes W/2 in T. | |
89 | ||
90 | 1.3.2 Different max OPPs | |
91 | ~~~~~~~~~~~~~~~~~~~~~~~~ | |
92 | ||
93 | Usually, CPUs of different capacity values also have different maximum | |
94 | OPPs. Consider the same CPUs as above (i.e. same work_per_hz()) with: | |
95 | ||
96 | - max_freq(CPU0) = F | |
97 | - max_freq(CPU1) = 2/3 * F | |
98 | ||
99 | This yields: | |
100 | ||
101 | - capacity(CPU0) = C | |
102 | - capacity(CPU1) = C/3 | |
103 | ||
104 | Executing the same workload as described in 1.3.1, which each CPU running at its | |
105 | maximum frequency results in:: | |
106 | ||
107 | CPU0 work ^ | |
108 | | ____ ____ ____ | |
109 | | | | | | | | | |
110 | +----+----+----+----+----+----+----+----+----+----+-> time | |
111 | ||
112 | workload on CPU1 | |
113 | CPU1 work ^ | |
114 | | ______________ ______________ ____ | |
115 | | | | | | | | |
116 | +----+----+----+----+----+----+----+----+----+----+-> time | |
117 | ||
118 | 1.4 Representation caveat | |
119 | ------------------------- | |
120 | ||
121 | It should be noted that having a *single* value to represent differences in CPU | |
122 | performance is somewhat of a contentious point. The relative performance | |
123 | difference between two different µarchs could be X% on integer operations, Y% on | |
124 | floating point operations, Z% on branches, and so on. Still, results using this | |
125 | simple approach have been satisfactory for now. | |
126 | ||
127 | 2. Task utilization | |
128 | =================== | |
129 | ||
130 | 2.1 Introduction | |
131 | ---------------- | |
132 | ||
133 | Capacity aware scheduling requires an expression of a task's requirements with | |
134 | regards to CPU capacity. Each scheduler class can express this differently, and | |
135 | while task utilization is specific to CFS, it is convenient to describe it here | |
136 | in order to introduce more generic concepts. | |
137 | ||
138 | Task utilization is a percentage meant to represent the throughput requirements | |
139 | of a task. A simple approximation of it is the task's duty cycle, i.e.:: | |
140 | ||
141 | task_util(p) = duty_cycle(p) | |
142 | ||
143 | On an SMP system with fixed frequencies, 100% utilization suggests the task is a | |
144 | busy loop. Conversely, 10% utilization hints it is a small periodic task that | |
145 | spends more time sleeping than executing. Variable CPU frequencies and | |
146 | asymmetric CPU capacities complexify this somewhat; the following sections will | |
147 | expand on these. | |
148 | ||
149 | 2.2 Frequency invariance | |
150 | ------------------------ | |
151 | ||
152 | One issue that needs to be taken into account is that a workload's duty cycle is | |
153 | directly impacted by the current OPP the CPU is running at. Consider running a | |
154 | periodic workload at a given frequency F:: | |
155 | ||
156 | CPU work ^ | |
157 | | ____ ____ ____ | |
158 | | | | | | | | | |
159 | +----+----+----+----+----+----+----+----+----+----+-> time | |
160 | ||
161 | This yields duty_cycle(p) == 25%. | |
162 | ||
163 | Now, consider running the *same* workload at frequency F/2:: | |
164 | ||
165 | CPU work ^ | |
166 | | _________ _________ ____ | |
167 | | | | | | | | |
168 | +----+----+----+----+----+----+----+----+----+----+-> time | |
169 | ||
170 | This yields duty_cycle(p) == 50%, despite the task having the exact same | |
171 | behaviour (i.e. executing the same amount of work) in both executions. | |
172 | ||
173 | The task utilization signal can be made frequency invariant using the following | |
174 | formula:: | |
175 | ||
176 | task_util_freq_inv(p) = duty_cycle(p) * (curr_frequency(cpu) / max_frequency(cpu)) | |
177 | ||
178 | Applying this formula to the two examples above yields a frequency invariant | |
179 | task utilization of 25%. | |
180 | ||
181 | 2.3 CPU invariance | |
182 | ------------------ | |
183 | ||
184 | CPU capacity has a similar effect on task utilization in that running an | |
185 | identical workload on CPUs of different capacity values will yield different | |
186 | duty cycles. | |
187 | ||
188 | Consider the system described in 1.3.2., i.e.:: | |
189 | ||
190 | - capacity(CPU0) = C | |
191 | - capacity(CPU1) = C/3 | |
192 | ||
193 | Executing a given periodic workload on each CPU at their maximum frequency would | |
194 | result in:: | |
195 | ||
196 | CPU0 work ^ | |
197 | | ____ ____ ____ | |
198 | | | | | | | | | |
199 | +----+----+----+----+----+----+----+----+----+----+-> time | |
200 | ||
201 | CPU1 work ^ | |
202 | | ______________ ______________ ____ | |
203 | | | | | | | | |
204 | +----+----+----+----+----+----+----+----+----+----+-> time | |
205 | ||
206 | IOW, | |
207 | ||
208 | - duty_cycle(p) == 25% if p runs on CPU0 at its maximum frequency | |
209 | - duty_cycle(p) == 75% if p runs on CPU1 at its maximum frequency | |
210 | ||
211 | The task utilization signal can be made CPU invariant using the following | |
212 | formula:: | |
213 | ||
214 | task_util_cpu_inv(p) = duty_cycle(p) * (capacity(cpu) / max_capacity) | |
215 | ||
216 | with ``max_capacity`` being the highest CPU capacity value in the | |
217 | system. Applying this formula to the above example above yields a CPU | |
218 | invariant task utilization of 25%. | |
219 | ||
220 | 2.4 Invariant task utilization | |
221 | ------------------------------ | |
222 | ||
223 | Both frequency and CPU invariance need to be applied to task utilization in | |
224 | order to obtain a truly invariant signal. The pseudo-formula for a task | |
225 | utilization that is both CPU and frequency invariant is thus, for a given | |
226 | task p:: | |
227 | ||
228 | curr_frequency(cpu) capacity(cpu) | |
229 | task_util_inv(p) = duty_cycle(p) * ------------------- * ------------- | |
230 | max_frequency(cpu) max_capacity | |
231 | ||
232 | In other words, invariant task utilization describes the behaviour of a task as | |
233 | if it were running on the highest-capacity CPU in the system, running at its | |
234 | maximum frequency. | |
235 | ||
236 | Any mention of task utilization in the following sections will imply its | |
237 | invariant form. | |
238 | ||
239 | 2.5 Utilization estimation | |
240 | -------------------------- | |
241 | ||
242 | Without a crystal ball, task behaviour (and thus task utilization) cannot | |
243 | accurately be predicted the moment a task first becomes runnable. The CFS class | |
244 | maintains a handful of CPU and task signals based on the Per-Entity Load | |
245 | Tracking (PELT) mechanism, one of those yielding an *average* utilization (as | |
246 | opposed to instantaneous). | |
247 | ||
248 | This means that while the capacity aware scheduling criteria will be written | |
249 | considering a "true" task utilization (using a crystal ball), the implementation | |
250 | will only ever be able to use an estimator thereof. | |
251 | ||
252 | 3. Capacity aware scheduling requirements | |
253 | ========================================= | |
254 | ||
255 | 3.1 CPU capacity | |
256 | ---------------- | |
257 | ||
258 | Linux cannot currently figure out CPU capacity on its own, this information thus | |
259 | needs to be handed to it. Architectures must define arch_scale_cpu_capacity() | |
260 | for that purpose. | |
261 | ||
5d89176a | 262 | The arm, arm64, and RISC-V architectures directly map this to the arch_topology driver |
65065fd7 | 263 | CPU scaling data, which is derived from the capacity-dmips-mhz CPU binding; see |
7d207831 | 264 | Documentation/devicetree/bindings/cpu/cpu-capacity.txt. |
65065fd7 VS |
265 | |
266 | 3.2 Frequency invariance | |
267 | ------------------------ | |
268 | ||
269 | As stated in 2.2, capacity-aware scheduling requires a frequency-invariant task | |
270 | utilization. Architectures must define arch_scale_freq_capacity(cpu) for that | |
271 | purpose. | |
272 | ||
273 | Implementing this function requires figuring out at which frequency each CPU | |
274 | have been running at. One way to implement this is to leverage hardware counters | |
275 | whose increment rate scale with a CPU's current frequency (APERF/MPERF on x86, | |
276 | AMU on arm64). Another is to directly hook into cpufreq frequency transitions, | |
277 | when the kernel is aware of the switched-to frequency (also employed by | |
278 | arm/arm64). | |
279 | ||
280 | 4. Scheduler topology | |
281 | ===================== | |
282 | ||
283 | During the construction of the sched domains, the scheduler will figure out | |
284 | whether the system exhibits asymmetric CPU capacities. Should that be the | |
285 | case: | |
286 | ||
287 | - The sched_asym_cpucapacity static key will be enabled. | |
adf3c31e BM |
288 | - The SD_ASYM_CPUCAPACITY_FULL flag will be set at the lowest sched_domain |
289 | level that spans all unique CPU capacity values. | |
290 | - The SD_ASYM_CPUCAPACITY flag will be set for any sched_domain that spans | |
291 | CPUs with any range of asymmetry. | |
65065fd7 VS |
292 | |
293 | The sched_asym_cpucapacity static key is intended to guard sections of code that | |
294 | cater to asymmetric CPU capacity systems. Do note however that said key is | |
295 | *system-wide*. Imagine the following setup using cpusets:: | |
296 | ||
297 | capacity C/2 C | |
298 | ________ ________ | |
299 | / \ / \ | |
300 | CPUs 0 1 2 3 4 5 6 7 | |
301 | \__/ \______________/ | |
302 | cpusets cs0 cs1 | |
303 | ||
304 | Which could be created via: | |
305 | ||
306 | .. code-block:: sh | |
307 | ||
308 | mkdir /sys/fs/cgroup/cpuset/cs0 | |
309 | echo 0-1 > /sys/fs/cgroup/cpuset/cs0/cpuset.cpus | |
310 | echo 0 > /sys/fs/cgroup/cpuset/cs0/cpuset.mems | |
311 | ||
312 | mkdir /sys/fs/cgroup/cpuset/cs1 | |
313 | echo 2-7 > /sys/fs/cgroup/cpuset/cs1/cpuset.cpus | |
314 | echo 0 > /sys/fs/cgroup/cpuset/cs1/cpuset.mems | |
315 | ||
316 | echo 0 > /sys/fs/cgroup/cpuset/cpuset.sched_load_balance | |
317 | ||
318 | Since there *is* CPU capacity asymmetry in the system, the | |
319 | sched_asym_cpucapacity static key will be enabled. However, the sched_domain | |
320 | hierarchy of CPUs 0-1 spans a single capacity value: SD_ASYM_CPUCAPACITY isn't | |
321 | set in that hierarchy, it describes an SMP island and should be treated as such. | |
322 | ||
323 | Therefore, the 'canonical' pattern for protecting codepaths that cater to | |
324 | asymmetric CPU capacities is to: | |
325 | ||
326 | - Check the sched_asym_cpucapacity static key | |
327 | - If it is enabled, then also check for the presence of SD_ASYM_CPUCAPACITY in | |
328 | the sched_domain hierarchy (if relevant, i.e. the codepath targets a specific | |
329 | CPU or group thereof) | |
330 | ||
331 | 5. Capacity aware scheduling implementation | |
332 | =========================================== | |
333 | ||
334 | 5.1 CFS | |
335 | ------- | |
336 | ||
337 | 5.1.1 Capacity fitness | |
338 | ~~~~~~~~~~~~~~~~~~~~~~ | |
339 | ||
340 | The main capacity scheduling criterion of CFS is:: | |
341 | ||
342 | task_util(p) < capacity(task_cpu(p)) | |
343 | ||
344 | This is commonly called the capacity fitness criterion, i.e. CFS must ensure a | |
345 | task "fits" on its CPU. If it is violated, the task will need to achieve more | |
346 | work than what its CPU can provide: it will be CPU-bound. | |
347 | ||
348 | Furthermore, uclamp lets userspace specify a minimum and a maximum utilization | |
349 | value for a task, either via sched_setattr() or via the cgroup interface (see | |
350 | Documentation/admin-guide/cgroup-v2.rst). As its name imply, this can be used to | |
351 | clamp task_util() in the previous criterion. | |
352 | ||
353 | 5.1.2 Wakeup CPU selection | |
354 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
355 | ||
356 | CFS task wakeup CPU selection follows the capacity fitness criterion described | |
357 | above. On top of that, uclamp is used to clamp the task utilization values, | |
358 | which lets userspace have more leverage over the CPU selection of CFS | |
359 | tasks. IOW, CFS wakeup CPU selection searches for a CPU that satisfies:: | |
360 | ||
361 | clamp(task_util(p), task_uclamp_min(p), task_uclamp_max(p)) < capacity(cpu) | |
362 | ||
363 | By using uclamp, userspace can e.g. allow a busy loop (100% utilization) to run | |
364 | on any CPU by giving it a low uclamp.max value. Conversely, it can force a small | |
365 | periodic task (e.g. 10% utilization) to run on the highest-performance CPUs by | |
366 | giving it a high uclamp.min value. | |
367 | ||
368 | .. note:: | |
369 | ||
370 | Wakeup CPU selection in CFS can be eclipsed by Energy Aware Scheduling | |
e4e29e78 | 371 | (EAS), which is described in Documentation/scheduler/sched-energy.rst. |
65065fd7 VS |
372 | |
373 | 5.1.3 Load balancing | |
374 | ~~~~~~~~~~~~~~~~~~~~ | |
375 | ||
376 | A pathological case in the wakeup CPU selection occurs when a task rarely | |
377 | sleeps, if at all - it thus rarely wakes up, if at all. Consider:: | |
378 | ||
379 | w == wakeup event | |
380 | ||
381 | capacity(CPU0) = C | |
382 | capacity(CPU1) = C / 3 | |
383 | ||
384 | workload on CPU0 | |
385 | CPU work ^ | |
386 | | _________ _________ ____ | |
387 | | | | | | | | |
388 | +----+----+----+----+----+----+----+----+----+----+-> time | |
389 | w w w | |
390 | ||
391 | workload on CPU1 | |
392 | CPU work ^ | |
393 | | ____________________________________________ | |
394 | | | | |
395 | +----+----+----+----+----+----+----+----+----+----+-> | |
396 | w | |
397 | ||
398 | This workload should run on CPU0, but if the task either: | |
399 | ||
400 | - was improperly scheduled from the start (inaccurate initial | |
401 | utilization estimation) | |
402 | - was properly scheduled from the start, but suddenly needs more | |
403 | processing power | |
404 | ||
405 | then it might become CPU-bound, IOW ``task_util(p) > capacity(task_cpu(p))``; | |
406 | the CPU capacity scheduling criterion is violated, and there may not be any more | |
407 | wakeup event to fix this up via wakeup CPU selection. | |
408 | ||
409 | Tasks that are in this situation are dubbed "misfit" tasks, and the mechanism | |
410 | put in place to handle this shares the same name. Misfit task migration | |
411 | leverages the CFS load balancer, more specifically the active load balance part | |
412 | (which caters to migrating currently running tasks). When load balance happens, | |
413 | a misfit active load balance will be triggered if a misfit task can be migrated | |
414 | to a CPU with more capacity than its current one. | |
415 | ||
416 | 5.2 RT | |
417 | ------ | |
418 | ||
419 | 5.2.1 Wakeup CPU selection | |
420 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
421 | ||
422 | RT task wakeup CPU selection searches for a CPU that satisfies:: | |
423 | ||
424 | task_uclamp_min(p) <= capacity(task_cpu(cpu)) | |
425 | ||
426 | while still following the usual priority constraints. If none of the candidate | |
427 | CPUs can satisfy this capacity criterion, then strict priority based scheduling | |
428 | is followed and CPU capacities are ignored. | |
429 | ||
430 | 5.3 DL | |
431 | ------ | |
432 | ||
433 | 5.3.1 Wakeup CPU selection | |
434 | ~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
435 | ||
436 | DL task wakeup CPU selection searches for a CPU that satisfies:: | |
437 | ||
438 | task_bandwidth(p) < capacity(task_cpu(p)) | |
439 | ||
440 | while still respecting the usual bandwidth and deadline constraints. If | |
441 | none of the candidate CPUs can satisfy this capacity criterion, then the | |
442 | task will remain on its current CPU. |