Commit | Line | Data |
---|---|---|
d6a3b247 MCC |
1 | ======================= |
2 | Energy Aware Scheduling | |
3 | ======================= | |
81a930d3 QP |
4 | |
5 | 1. Introduction | |
6 | --------------- | |
7 | ||
8 | Energy Aware Scheduling (or EAS) gives the scheduler the ability to predict | |
9 | the impact of its decisions on the energy consumed by CPUs. EAS relies on an | |
10 | Energy Model (EM) of the CPUs to select an energy efficient CPU for each task, | |
11 | with a minimal impact on throughput. This document aims at providing an | |
12 | introduction on how EAS works, what are the main design decisions behind it, and | |
13 | details what is needed to get it to run. | |
14 | ||
d6a3b247 | 15 | Before going any further, please note that at the time of writing:: |
81a930d3 QP |
16 | |
17 | /!\ EAS does not support platforms with symmetric CPU topologies /!\ | |
18 | ||
19 | EAS operates only on heterogeneous CPU topologies (such as Arm big.LITTLE) | |
20 | because this is where the potential for saving energy through scheduling is | |
21 | the highest. | |
22 | ||
23 | The actual EM used by EAS is _not_ maintained by the scheduler, but by a | |
24 | dedicated framework. For details about this framework and what it provides, | |
151f4e2b | 25 | please refer to its documentation (see Documentation/power/energy-model.rst). |
81a930d3 QP |
26 | |
27 | ||
28 | 2. Background and Terminology | |
29 | ----------------------------- | |
30 | ||
31 | To make it clear from the start: | |
32 | - energy = [joule] (resource like a battery on powered devices) | |
33 | - power = energy/time = [joule/second] = [watt] | |
34 | ||
35 | The goal of EAS is to minimize energy, while still getting the job done. That | |
d6a3b247 | 36 | is, we want to maximize:: |
81a930d3 QP |
37 | |
38 | performance [inst/s] | |
39 | -------------------- | |
40 | power [W] | |
41 | ||
d6a3b247 | 42 | which is equivalent to minimizing:: |
81a930d3 QP |
43 | |
44 | energy [J] | |
45 | ----------- | |
46 | instruction | |
47 | ||
48 | while still getting 'good' performance. It is essentially an alternative | |
49 | optimization objective to the current performance-only objective for the | |
50 | scheduler. This alternative considers two objectives: energy-efficiency and | |
51 | performance. | |
52 | ||
53 | The idea behind introducing an EM is to allow the scheduler to evaluate the | |
54 | implications of its decisions rather than blindly applying energy-saving | |
55 | techniques that may have positive effects only on some platforms. At the same | |
56 | time, the EM must be as simple as possible to minimize the scheduler latency | |
57 | impact. | |
58 | ||
59 | In short, EAS changes the way CFS tasks are assigned to CPUs. When it is time | |
60 | for the scheduler to decide where a task should run (during wake-up), the EM | |
61 | is used to break the tie between several good CPU candidates and pick the one | |
62 | that is predicted to yield the best energy consumption without harming the | |
63 | system's throughput. The predictions made by EAS rely on specific elements of | |
64 | knowledge about the platform's topology, which include the 'capacity' of CPUs, | |
65 | and their respective energy costs. | |
66 | ||
67 | ||
68 | 3. Topology information | |
69 | ----------------------- | |
70 | ||
71 | EAS (as well as the rest of the scheduler) uses the notion of 'capacity' to | |
72 | differentiate CPUs with different computing throughput. The 'capacity' of a CPU | |
73 | represents the amount of work it can absorb when running at its highest | |
74 | frequency compared to the most capable CPU of the system. Capacity values are | |
75 | normalized in a 1024 range, and are comparable with the utilization signals of | |
76 | tasks and CPUs computed by the Per-Entity Load Tracking (PELT) mechanism. Thanks | |
77 | to capacity and utilization values, EAS is able to estimate how big/busy a | |
78 | task/CPU is, and to take this into consideration when evaluating performance vs | |
79 | energy trade-offs. The capacity of CPUs is provided via arch-specific code | |
80 | through the arch_scale_cpu_capacity() callback. | |
81 | ||
82 | The rest of platform knowledge used by EAS is directly read from the Energy | |
83 | Model (EM) framework. The EM of a platform is composed of a power cost table | |
151f4e2b | 84 | per 'performance domain' in the system (see Documentation/power/energy-model.rst |
d56b699d | 85 | for further details about performance domains). |
81a930d3 QP |
86 | |
87 | The scheduler manages references to the EM objects in the topology code when the | |
88 | scheduling domains are built, or re-built. For each root domain (rd), the | |
89 | scheduler maintains a singly linked list of all performance domains intersecting | |
90 | the current rd->span. Each node in the list contains a pointer to a struct | |
91 | em_perf_domain as provided by the EM framework. | |
92 | ||
93 | The lists are attached to the root domains in order to cope with exclusive | |
94 | cpuset configurations. Since the boundaries of exclusive cpusets do not | |
95 | necessarily match those of performance domains, the lists of different root | |
96 | domains can contain duplicate elements. | |
97 | ||
98 | Example 1. | |
99 | Let us consider a platform with 12 CPUs, split in 3 performance domains | |
d6a3b247 | 100 | (pd0, pd4 and pd8), organized as follows:: |
81a930d3 QP |
101 | |
102 | CPUs: 0 1 2 3 4 5 6 7 8 9 10 11 | |
103 | PDs: |--pd0--|--pd4--|---pd8---| | |
104 | RDs: |----rd1----|-----rd2-----| | |
105 | ||
106 | Now, consider that userspace decided to split the system with two | |
107 | exclusive cpusets, hence creating two independent root domains, each | |
108 | containing 6 CPUs. The two root domains are denoted rd1 and rd2 in the | |
109 | above figure. Since pd4 intersects with both rd1 and rd2, it will be | |
110 | present in the linked list '->pd' attached to each of them: | |
d6a3b247 | 111 | |
81a930d3 QP |
112 | * rd1->pd: pd0 -> pd4 |
113 | * rd2->pd: pd4 -> pd8 | |
114 | ||
115 | Please note that the scheduler will create two duplicate list nodes for | |
116 | pd4 (one for each list). However, both just hold a pointer to the same | |
117 | shared data structure of the EM framework. | |
118 | ||
119 | Since the access to these lists can happen concurrently with hotplug and other | |
120 | things, they are protected by RCU, like the rest of topology structures | |
121 | manipulated by the scheduler. | |
122 | ||
123 | EAS also maintains a static key (sched_energy_present) which is enabled when at | |
124 | least one root domain meets all conditions for EAS to start. Those conditions | |
125 | are summarized in Section 6. | |
126 | ||
127 | ||
128 | 4. Energy-Aware task placement | |
129 | ------------------------------ | |
130 | ||
131 | EAS overrides the CFS task wake-up balancing code. It uses the EM of the | |
132 | platform and the PELT signals to choose an energy-efficient target CPU during | |
133 | wake-up balance. When EAS is enabled, select_task_rq_fair() calls | |
134 | find_energy_efficient_cpu() to do the placement decision. This function looks | |
135 | for the CPU with the highest spare capacity (CPU capacity - CPU utilization) in | |
136 | each performance domain since it is the one which will allow us to keep the | |
137 | frequency the lowest. Then, the function checks if placing the task there could | |
138 | save energy compared to leaving it on prev_cpu, i.e. the CPU where the task ran | |
139 | in its previous activation. | |
140 | ||
141 | find_energy_efficient_cpu() uses compute_energy() to estimate what will be the | |
142 | energy consumed by the system if the waking task was migrated. compute_energy() | |
143 | looks at the current utilization landscape of the CPUs and adjusts it to | |
144 | 'simulate' the task migration. The EM framework provides the em_pd_energy() API | |
145 | which computes the expected energy consumption of each performance domain for | |
146 | the given utilization landscape. | |
147 | ||
148 | An example of energy-optimized task placement decision is detailed below. | |
149 | ||
150 | Example 2. | |
151 | Let us consider a (fake) platform with 2 independent performance domains | |
152 | composed of two CPUs each. CPU0 and CPU1 are little CPUs; CPU2 and CPU3 | |
153 | are big. | |
154 | ||
155 | The scheduler must decide where to place a task P whose util_avg = 200 | |
156 | and prev_cpu = 0. | |
157 | ||
158 | The current utilization landscape of the CPUs is depicted on the graph | |
159 | below. CPUs 0-3 have a util_avg of 400, 100, 600 and 500 respectively | |
160 | Each performance domain has three Operating Performance Points (OPPs). | |
161 | The CPU capacity and power cost associated with each OPP is listed in | |
162 | the Energy Model table. The util_avg of P is shown on the figures | |
d6a3b247 | 163 | below as 'PP':: |
81a930d3 | 164 | |
d6a3b247 | 165 | CPU util. |
81a930d3 QP |
166 | 1024 - - - - - - - Energy Model |
167 | +-----------+-------------+ | |
168 | | Little | Big | | |
169 | 768 ============= +-----+-----+------+------+ | |
170 | | Cap | Pwr | Cap | Pwr | | |
171 | +-----+-----+------+------+ | |
172 | 512 =========== - ##- - - - - | 170 | 50 | 512 | 400 | | |
173 | ## ## | 341 | 150 | 768 | 800 | | |
174 | 341 -PP - - - - ## ## | 512 | 300 | 1024 | 1700 | | |
175 | PP ## ## +-----+-----+------+------+ | |
176 | 170 -## - - - - ## ## | |
177 | ## ## ## ## | |
178 | ------------ ------------- | |
179 | CPU0 CPU1 CPU2 CPU3 | |
180 | ||
181 | Current OPP: ===== Other OPP: - - - util_avg (100 each): ## | |
182 | ||
183 | ||
184 | find_energy_efficient_cpu() will first look for the CPUs with the | |
185 | maximum spare capacity in the two performance domains. In this example, | |
186 | CPU1 and CPU3. Then it will estimate the energy of the system if P was | |
187 | placed on either of them, and check if that would save some energy | |
188 | compared to leaving P on CPU0. EAS assumes that OPPs follow utilization | |
189 | (which is coherent with the behaviour of the schedutil CPUFreq | |
190 | governor, see Section 6. for more details on this topic). | |
191 | ||
d6a3b247 | 192 | **Case 1. P is migrated to CPU1**:: |
81a930d3 QP |
193 | |
194 | 1024 - - - - - - - | |
195 | ||
196 | Energy calculation: | |
197 | 768 ============= * CPU0: 200 / 341 * 150 = 88 | |
198 | * CPU1: 300 / 341 * 150 = 131 | |
199 | * CPU2: 600 / 768 * 800 = 625 | |
200 | 512 - - - - - - - ##- - - - - * CPU3: 500 / 768 * 800 = 520 | |
201 | ## ## => total_energy = 1364 | |
202 | 341 =========== ## ## | |
203 | PP ## ## | |
204 | 170 -## - - PP- ## ## | |
205 | ## ## ## ## | |
206 | ------------ ------------- | |
207 | CPU0 CPU1 CPU2 CPU3 | |
208 | ||
209 | ||
d6a3b247 | 210 | **Case 2. P is migrated to CPU3**:: |
81a930d3 QP |
211 | |
212 | 1024 - - - - - - - | |
213 | ||
214 | Energy calculation: | |
215 | 768 ============= * CPU0: 200 / 341 * 150 = 88 | |
216 | * CPU1: 100 / 341 * 150 = 43 | |
217 | PP * CPU2: 600 / 768 * 800 = 625 | |
218 | 512 - - - - - - - ##- - -PP - * CPU3: 700 / 768 * 800 = 729 | |
219 | ## ## => total_energy = 1485 | |
220 | 341 =========== ## ## | |
221 | ## ## | |
222 | 170 -## - - - - ## ## | |
223 | ## ## ## ## | |
224 | ------------ ------------- | |
225 | CPU0 CPU1 CPU2 CPU3 | |
226 | ||
227 | ||
d6a3b247 | 228 | **Case 3. P stays on prev_cpu / CPU 0**:: |
81a930d3 QP |
229 | |
230 | 1024 - - - - - - - | |
231 | ||
232 | Energy calculation: | |
233 | 768 ============= * CPU0: 400 / 512 * 300 = 234 | |
234 | * CPU1: 100 / 512 * 300 = 58 | |
235 | * CPU2: 600 / 768 * 800 = 625 | |
236 | 512 =========== - ##- - - - - * CPU3: 500 / 768 * 800 = 520 | |
237 | ## ## => total_energy = 1437 | |
238 | 341 -PP - - - - ## ## | |
239 | PP ## ## | |
240 | 170 -## - - - - ## ## | |
241 | ## ## ## ## | |
242 | ------------ ------------- | |
243 | CPU0 CPU1 CPU2 CPU3 | |
244 | ||
245 | ||
246 | From these calculations, the Case 1 has the lowest total energy. So CPU 1 | |
247 | is be the best candidate from an energy-efficiency standpoint. | |
248 | ||
249 | Big CPUs are generally more power hungry than the little ones and are thus used | |
250 | mainly when a task doesn't fit the littles. However, little CPUs aren't always | |
251 | necessarily more energy-efficient than big CPUs. For some systems, the high OPPs | |
252 | of the little CPUs can be less energy-efficient than the lowest OPPs of the | |
253 | bigs, for example. So, if the little CPUs happen to have enough utilization at | |
254 | a specific point in time, a small task waking up at that moment could be better | |
255 | of executing on the big side in order to save energy, even though it would fit | |
256 | on the little side. | |
257 | ||
258 | And even in the case where all OPPs of the big CPUs are less energy-efficient | |
259 | than those of the little, using the big CPUs for a small task might still, under | |
260 | specific conditions, save energy. Indeed, placing a task on a little CPU can | |
261 | result in raising the OPP of the entire performance domain, and that will | |
262 | increase the cost of the tasks already running there. If the waking task is | |
263 | placed on a big CPU, its own execution cost might be higher than if it was | |
264 | running on a little, but it won't impact the other tasks of the little CPUs | |
265 | which will keep running at a lower OPP. So, when considering the total energy | |
266 | consumed by CPUs, the extra cost of running that one task on a big core can be | |
267 | smaller than the cost of raising the OPP on the little CPUs for all the other | |
268 | tasks. | |
269 | ||
270 | The examples above would be nearly impossible to get right in a generic way, and | |
271 | for all platforms, without knowing the cost of running at different OPPs on all | |
272 | CPUs of the system. Thanks to its EM-based design, EAS should cope with them | |
273 | correctly without too many troubles. However, in order to ensure a minimal | |
274 | impact on throughput for high-utilization scenarios, EAS also implements another | |
275 | mechanism called 'over-utilization'. | |
276 | ||
277 | ||
278 | 5. Over-utilization | |
279 | ------------------- | |
280 | ||
281 | From a general standpoint, the use-cases where EAS can help the most are those | |
282 | involving a light/medium CPU utilization. Whenever long CPU-bound tasks are | |
283 | being run, they will require all of the available CPU capacity, and there isn't | |
d56b699d | 284 | much that can be done by the scheduler to save energy without severely harming |
81a930d3 QP |
285 | throughput. In order to avoid hurting performance with EAS, CPUs are flagged as |
286 | 'over-utilized' as soon as they are used at more than 80% of their compute | |
287 | capacity. As long as no CPUs are over-utilized in a root domain, load balancing | |
288 | is disabled and EAS overridess the wake-up balancing code. EAS is likely to load | |
289 | the most energy efficient CPUs of the system more than the others if that can be | |
290 | done without harming throughput. So, the load-balancer is disabled to prevent | |
291 | it from breaking the energy-efficient task placement found by EAS. It is safe to | |
292 | do so when the system isn't overutilized since being below the 80% tipping point | |
293 | implies that: | |
294 | ||
295 | a. there is some idle time on all CPUs, so the utilization signals used by | |
296 | EAS are likely to accurately represent the 'size' of the various tasks | |
297 | in the system; | |
298 | b. all tasks should already be provided with enough CPU capacity, | |
299 | regardless of their nice values; | |
300 | c. since there is spare capacity all tasks must be blocking/sleeping | |
301 | regularly and balancing at wake-up is sufficient. | |
302 | ||
303 | As soon as one CPU goes above the 80% tipping point, at least one of the three | |
304 | assumptions above becomes incorrect. In this scenario, the 'overutilized' flag | |
305 | is raised for the entire root domain, EAS is disabled, and the load-balancer is | |
306 | re-enabled. By doing so, the scheduler falls back onto load-based algorithms for | |
307 | wake-up and load balance under CPU-bound conditions. This provides a better | |
308 | respect of the nice values of tasks. | |
309 | ||
310 | Since the notion of overutilization largely relies on detecting whether or not | |
311 | there is some idle time in the system, the CPU capacity 'stolen' by higher | |
312 | (than CFS) scheduling classes (as well as IRQ) must be taken into account. As | |
313 | such, the detection of overutilization accounts for the capacity used not only | |
314 | by CFS tasks, but also by the other scheduling classes and IRQ. | |
315 | ||
316 | ||
317 | 6. Dependencies and requirements for EAS | |
318 | ---------------------------------------- | |
319 | ||
320 | Energy Aware Scheduling depends on the CPUs of the system having specific | |
321 | hardware properties and on other features of the kernel being enabled. This | |
322 | section lists these dependencies and provides hints as to how they can be met. | |
323 | ||
324 | ||
d6a3b247 MCC |
325 | 6.1 - Asymmetric CPU topology |
326 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
327 | ||
81a930d3 QP |
328 | |
329 | As mentioned in the introduction, EAS is only supported on platforms with | |
330 | asymmetric CPU topologies for now. This requirement is checked at run-time by | |
adf3c31e | 331 | looking for the presence of the SD_ASYM_CPUCAPACITY_FULL flag when the scheduling |
81a930d3 QP |
332 | domains are built. |
333 | ||
e4e29e78 | 334 | See Documentation/scheduler/sched-capacity.rst for requirements to be met for this |
949bcb81 | 335 | flag to be set in the sched_domain hierarchy. |
81a930d3 QP |
336 | |
337 | Please note that EAS is not fundamentally incompatible with SMP, but no | |
338 | significant savings on SMP platforms have been observed yet. This restriction | |
339 | could be amended in the future if proven otherwise. | |
340 | ||
341 | ||
d6a3b247 MCC |
342 | 6.2 - Energy Model presence |
343 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
81a930d3 QP |
344 | |
345 | EAS uses the EM of a platform to estimate the impact of scheduling decisions on | |
346 | energy. So, your platform must provide power cost tables to the EM framework in | |
347 | order to make EAS start. To do so, please refer to documentation of the | |
151f4e2b | 348 | independent EM framework in Documentation/power/energy-model.rst. |
81a930d3 QP |
349 | |
350 | Please also note that the scheduling domains need to be re-built after the | |
351 | EM has been registered in order to start EAS. | |
352 | ||
5a64f775 LL |
353 | EAS uses the EM to make a forecasting decision on energy usage and thus it is |
354 | more focused on the difference when checking possible options for task | |
355 | placement. For EAS it doesn't matter whether the EM power values are expressed | |
356 | in milli-Watts or in an 'abstract scale'. | |
357 | ||
81a930d3 | 358 | |
d6a3b247 MCC |
359 | 6.3 - Energy Model complexity |
360 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
81a930d3 | 361 | |
5b77261c PG |
362 | EAS does not impose any complexity limit on the number of PDs/OPPs/CPUs but |
363 | restricts the number of CPUs to EM_MAX_NUM_CPUS to prevent overflows during | |
364 | the energy estimation. | |
81a930d3 QP |
365 | |
366 | ||
d6a3b247 MCC |
367 | 6.4 - Schedutil governor |
368 | ^^^^^^^^^^^^^^^^^^^^^^^^ | |
81a930d3 QP |
369 | |
370 | EAS tries to predict at which OPP will the CPUs be running in the close future | |
371 | in order to estimate their energy consumption. To do so, it is assumed that OPPs | |
372 | of CPUs follow their utilization. | |
373 | ||
374 | Although it is very difficult to provide hard guarantees regarding the accuracy | |
375 | of this assumption in practice (because the hardware might not do what it is | |
376 | told to do, for example), schedutil as opposed to other CPUFreq governors at | |
377 | least _requests_ frequencies calculated using the utilization signals. | |
378 | Consequently, the only sane governor to use together with EAS is schedutil, | |
379 | because it is the only one providing some degree of consistency between | |
380 | frequency requests and energy predictions. | |
381 | ||
382 | Using EAS with any other governor than schedutil is not supported. | |
383 | ||
384 | ||
d6a3b247 MCC |
385 | 6.5 Scale-invariant utilization signals |
386 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
81a930d3 QP |
387 | |
388 | In order to make accurate prediction across CPUs and for all performance | |
389 | states, EAS needs frequency-invariant and CPU-invariant PELT signals. These can | |
390 | be obtained using the architecture-defined arch_scale{cpu,freq}_capacity() | |
391 | callbacks. | |
392 | ||
393 | Using EAS on a platform that doesn't implement these two callbacks is not | |
394 | supported. | |
395 | ||
396 | ||
d6a3b247 MCC |
397 | 6.6 Multithreading (SMT) |
398 | ^^^^^^^^^^^^^^^^^^^^^^^^ | |
81a930d3 QP |
399 | |
400 | EAS in its current form is SMT unaware and is not able to leverage | |
401 | multithreaded hardware to save energy. EAS considers threads as independent | |
402 | CPUs, which can actually be counter-productive for both performance and energy. | |
403 | ||
404 | EAS on SMT is not supported. |