Commit | Line | Data |
---|---|---|
acbee592 QY |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | ==================== | |
4 | Utilization Clamping | |
5 | ==================== | |
6 | ||
7 | 1. Introduction | |
8 | =============== | |
9 | ||
10 | Utilization clamping, also known as util clamp or uclamp, is a scheduler | |
11 | feature that allows user space to help in managing the performance requirement | |
12 | of tasks. It was introduced in v5.3 release. The CGroup support was merged in | |
13 | v5.4. | |
14 | ||
15 | Uclamp is a hinting mechanism that allows the scheduler to understand the | |
16 | performance requirements and restrictions of the tasks, thus it helps the | |
17 | scheduler to make a better decision. And when schedutil cpufreq governor is | |
18 | used, util clamp will influence the CPU frequency selection as well. | |
19 | ||
20 | Since the scheduler and schedutil are both driven by PELT (util_avg) signals, | |
21 | util clamp acts on that to achieve its goal by clamping the signal to a certain | |
22 | point; hence the name. That is, by clamping utilization we are making the | |
23 | system run at a certain performance point. | |
24 | ||
25 | The right way to view util clamp is as a mechanism to make request or hint on | |
26 | performance constraints. It consists of two tunables: | |
27 | ||
28 | * UCLAMP_MIN, which sets the lower bound. | |
29 | * UCLAMP_MAX, which sets the upper bound. | |
30 | ||
31 | These two bounds will ensure a task will operate within this performance range | |
32 | of the system. UCLAMP_MIN implies boosting a task, while UCLAMP_MAX implies | |
33 | capping a task. | |
34 | ||
35 | One can tell the system (scheduler) that some tasks require a minimum | |
36 | performance point to operate at to deliver the desired user experience. Or one | |
37 | can tell the system that some tasks should be restricted from consuming too | |
38 | much resources and should not go above a specific performance point. Viewing | |
39 | the uclamp values as performance points rather than utilization is a better | |
40 | abstraction from user space point of view. | |
41 | ||
42 | As an example, a game can use util clamp to form a feedback loop with its | |
43 | perceived Frames Per Second (FPS). It can dynamically increase the minimum | |
44 | performance point required by its display pipeline to ensure no frame is | |
45 | dropped. It can also dynamically 'prime' up these tasks if it knows in the | |
46 | coming few hundred milliseconds a computationally intensive scene is about to | |
47 | happen. | |
48 | ||
49 | On mobile hardware where the capability of the devices varies a lot, this | |
50 | dynamic feedback loop offers a great flexibility to ensure best user experience | |
51 | given the capabilities of any system. | |
52 | ||
53 | Of course a static configuration is possible too. The exact usage will depend | |
54 | on the system, application and the desired outcome. | |
55 | ||
56 | Another example is in Android where tasks are classified as background, | |
57 | foreground, top-app, etc. Util clamp can be used to constrain how much | |
58 | resources background tasks are consuming by capping the performance point they | |
59 | can run at. This constraint helps reserve resources for important tasks, like | |
60 | the ones belonging to the currently active app (top-app group). Beside this | |
61 | helps in limiting how much power they consume. This can be more obvious in | |
62 | heterogeneous systems (e.g. Arm big.LITTLE); the constraint will help bias the | |
63 | background tasks to stay on the little cores which will ensure that: | |
64 | ||
65 | 1. The big cores are free to run top-app tasks immediately. top-app | |
66 | tasks are the tasks the user is currently interacting with, hence | |
67 | the most important tasks in the system. | |
68 | 2. They don't run on a power hungry core and drain battery even if they | |
69 | are CPU intensive tasks. | |
70 | ||
71 | .. note:: | |
72 | **little cores**: | |
73 | CPUs with capacity < 1024 | |
74 | ||
75 | **big cores**: | |
76 | CPUs with capacity = 1024 | |
77 | ||
78 | By making these uclamp performance requests, or rather hints, user space can | |
79 | ensure system resources are used optimally to deliver the best possible user | |
80 | experience. | |
81 | ||
82 | Another use case is to help with **overcoming the ramp up latency inherit in | |
83 | how scheduler utilization signal is calculated**. | |
84 | ||
85 | On the other hand, a busy task for instance that requires to run at maximum | |
86 | performance point will suffer a delay of ~200ms (PELT HALFIFE = 32ms) for the | |
87 | scheduler to realize that. This is known to affect workloads like gaming on | |
88 | mobile devices where frames will drop due to slow response time to select the | |
89 | higher frequency required for the tasks to finish their work in time. Setting | |
90 | UCLAMP_MIN=1024 will ensure such tasks will always see the highest performance | |
91 | level when they start running. | |
92 | ||
93 | The overall visible effect goes beyond better perceived user | |
94 | experience/performance and stretches to help achieve a better overall | |
95 | performance/watt if used effectively. | |
96 | ||
97 | User space can form a feedback loop with the thermal subsystem too to ensure | |
98 | the device doesn't heat up to the point where it will throttle. | |
99 | ||
100 | Both SCHED_NORMAL/OTHER and SCHED_FIFO/RR honour uclamp requests/hints. | |
101 | ||
102 | In the SCHED_FIFO/RR case, uclamp gives the option to run RT tasks at any | |
103 | performance point rather than being tied to MAX frequency all the time. Which | |
104 | can be useful on general purpose systems that run on battery powered devices. | |
105 | ||
106 | Note that by design RT tasks don't have per-task PELT signal and must always | |
107 | run at a constant frequency to combat undeterministic DVFS rampup delays. | |
108 | ||
109 | Note that using schedutil always implies a single delay to modify the frequency | |
110 | when an RT task wakes up. This cost is unchanged by using uclamp. Uclamp only | |
111 | helps picking what frequency to request instead of schedutil always requesting | |
112 | MAX for all RT tasks. | |
113 | ||
114 | See :ref:`section 3.4 <uclamp-default-values>` for default values and | |
115 | :ref:`3.4.1 <sched-util-clamp-min-rt-default>` on how to change RT tasks | |
116 | default value. | |
117 | ||
118 | 2. Design | |
119 | ========= | |
120 | ||
121 | Util clamp is a property of every task in the system. It sets the boundaries of | |
122 | its utilization signal; acting as a bias mechanism that influences certain | |
123 | decisions within the scheduler. | |
124 | ||
125 | The actual utilization signal of a task is never clamped in reality. If you | |
126 | inspect PELT signals at any point of time you should continue to see them as | |
127 | they are intact. Clamping happens only when needed, e.g: when a task wakes up | |
128 | and the scheduler needs to select a suitable CPU for it to run on. | |
129 | ||
130 | Since the goal of util clamp is to allow requesting a minimum and maximum | |
131 | performance point for a task to run on, it must be able to influence the | |
132 | frequency selection as well as task placement to be most effective. Both of | |
133 | which have implications on the utilization value at CPU runqueue (rq for short) | |
134 | level, which brings us to the main design challenge. | |
135 | ||
136 | When a task wakes up on an rq, the utilization signal of the rq will be | |
137 | affected by the uclamp settings of all the tasks enqueued on it. For example if | |
138 | a task requests to run at UTIL_MIN = 512, then the util signal of the rq needs | |
139 | to respect to this request as well as all other requests from all of the | |
140 | enqueued tasks. | |
141 | ||
142 | To be able to aggregate the util clamp value of all the tasks attached to the | |
143 | rq, uclamp must do some housekeeping at every enqueue/dequeue, which is the | |
144 | scheduler hot path. Hence care must be taken since any slow down will have | |
145 | significant impact on a lot of use cases and could hinder its usability in | |
146 | practice. | |
147 | ||
148 | The way this is handled is by dividing the utilization range into buckets | |
149 | (struct uclamp_bucket) which allows us to reduce the search space from every | |
150 | task on the rq to only a subset of tasks on the top-most bucket. | |
151 | ||
152 | When a task is enqueued, the counter in the matching bucket is incremented, | |
153 | and on dequeue it is decremented. This makes keeping track of the effective | |
154 | uclamp value at rq level a lot easier. | |
155 | ||
156 | As tasks are enqueued and dequeued, we keep track of the current effective | |
157 | uclamp value of the rq. See :ref:`section 2.1 <uclamp-buckets>` for details on | |
158 | how this works. | |
159 | ||
160 | Later at any path that wants to identify the effective uclamp value of the rq, | |
161 | it will simply need to read this effective uclamp value of the rq at that exact | |
162 | moment of time it needs to take a decision. | |
163 | ||
164 | For task placement case, only Energy Aware and Capacity Aware Scheduling | |
165 | (EAS/CAS) make use of uclamp for now, which implies that it is applied on | |
166 | heterogeneous systems only. | |
167 | When a task wakes up, the scheduler will look at the current effective uclamp | |
168 | value of every rq and compare it with the potential new value if the task were | |
169 | to be enqueued there. Favoring the rq that will end up with the most energy | |
170 | efficient combination. | |
171 | ||
172 | Similarly in schedutil, when it needs to make a frequency update it will look | |
173 | at the current effective uclamp value of the rq which is influenced by the set | |
174 | of tasks currently enqueued there and select the appropriate frequency that | |
175 | will satisfy constraints from requests. | |
176 | ||
177 | Other paths like setting overutilization state (which effectively disables EAS) | |
178 | make use of uclamp as well. Such cases are considered necessary housekeeping to | |
179 | allow the 2 main use cases above and will not be covered in detail here as they | |
180 | could change with implementation details. | |
181 | ||
182 | .. _uclamp-buckets: | |
183 | ||
184 | 2.1. Buckets | |
185 | ------------ | |
186 | ||
187 | :: | |
188 | ||
189 | [struct rq] | |
190 | ||
191 | (bottom) (top) | |
192 | ||
193 | 0 1024 | |
194 | | | | |
195 | +-----------+-----------+-----------+---- ----+-----------+ | |
196 | | Bucket 0 | Bucket 1 | Bucket 2 | ... | Bucket N | | |
197 | +-----------+-----------+-----------+---- ----+-----------+ | |
198 | : : : | |
199 | +- p0 +- p3 +- p4 | |
200 | : : | |
201 | +- p1 +- p5 | |
202 | : | |
203 | +- p2 | |
204 | ||
205 | ||
206 | .. note:: | |
207 | The diagram above is an illustration rather than a true depiction of the | |
208 | internal data structure. | |
209 | ||
210 | To reduce the search space when trying to decide the effective uclamp value of | |
211 | an rq as tasks are enqueued/dequeued, the whole utilization range is divided | |
212 | into N buckets where N is configured at compile time by setting | |
213 | CONFIG_UCLAMP_BUCKETS_COUNT. By default it is set to 5. | |
214 | ||
215 | The rq has a bucket for each uclamp_id tunables: [UCLAMP_MIN, UCLAMP_MAX]. | |
216 | ||
217 | The range of each bucket is 1024/N. For example, for the default value of | |
218 | 5 there will be 5 buckets, each of which will cover the following range: | |
219 | ||
220 | :: | |
221 | ||
222 | DELTA = round_closest(1024/5) = 204.8 = 205 | |
223 | ||
224 | Bucket 0: [0:204] | |
225 | Bucket 1: [205:409] | |
226 | Bucket 2: [410:614] | |
227 | Bucket 3: [615:819] | |
228 | Bucket 4: [820:1024] | |
229 | ||
230 | When a task p with following tunable parameters | |
231 | ||
232 | :: | |
233 | ||
234 | p->uclamp[UCLAMP_MIN] = 300 | |
235 | p->uclamp[UCLAMP_MAX] = 1024 | |
236 | ||
237 | is enqueued into the rq, bucket 1 will be incremented for UCLAMP_MIN and bucket | |
238 | 4 will be incremented for UCLAMP_MAX to reflect the fact the rq has a task in | |
239 | this range. | |
240 | ||
241 | The rq then keeps track of its current effective uclamp value for each | |
242 | uclamp_id. | |
243 | ||
244 | When a task p is enqueued, the rq value changes to: | |
245 | ||
246 | :: | |
247 | ||
248 | // update bucket logic goes here | |
249 | rq->uclamp[UCLAMP_MIN] = max(rq->uclamp[UCLAMP_MIN], p->uclamp[UCLAMP_MIN]) | |
250 | // repeat for UCLAMP_MAX | |
251 | ||
252 | Similarly, when p is dequeued the rq value changes to: | |
253 | ||
254 | :: | |
255 | ||
256 | // update bucket logic goes here | |
257 | rq->uclamp[UCLAMP_MIN] = search_top_bucket_for_highest_value() | |
258 | // repeat for UCLAMP_MAX | |
259 | ||
260 | When all buckets are empty, the rq uclamp values are reset to system defaults. | |
261 | See :ref:`section 3.4 <uclamp-default-values>` for details on default values. | |
262 | ||
263 | ||
264 | 2.2. Max aggregation | |
265 | -------------------- | |
266 | ||
267 | Util clamp is tuned to honour the request for the task that requires the | |
268 | highest performance point. | |
269 | ||
270 | When multiple tasks are attached to the same rq, then util clamp must make sure | |
271 | the task that needs the highest performance point gets it even if there's | |
272 | another task that doesn't need it or is disallowed from reaching this point. | |
273 | ||
274 | For example, if there are multiple tasks attached to an rq with the following | |
275 | values: | |
276 | ||
277 | :: | |
278 | ||
279 | p0->uclamp[UCLAMP_MIN] = 300 | |
280 | p0->uclamp[UCLAMP_MAX] = 900 | |
281 | ||
282 | p1->uclamp[UCLAMP_MIN] = 500 | |
283 | p1->uclamp[UCLAMP_MAX] = 500 | |
284 | ||
285 | then assuming both p0 and p1 are enqueued to the same rq, both UCLAMP_MIN | |
286 | and UCLAMP_MAX become: | |
287 | ||
288 | :: | |
289 | ||
290 | rq->uclamp[UCLAMP_MIN] = max(300, 500) = 500 | |
291 | rq->uclamp[UCLAMP_MAX] = max(900, 500) = 900 | |
292 | ||
293 | As we shall see in :ref:`section 5.1 <uclamp-capping-fail>`, this max | |
294 | aggregation is the cause of one of limitations when using util clamp, in | |
295 | particular for UCLAMP_MAX hint when user space would like to save power. | |
296 | ||
297 | 2.3. Hierarchical aggregation | |
298 | ----------------------------- | |
299 | ||
300 | As stated earlier, util clamp is a property of every task in the system. But | |
301 | the actual applied (effective) value can be influenced by more than just the | |
302 | request made by the task or another actor on its behalf (middleware library). | |
303 | ||
304 | The effective util clamp value of any task is restricted as follows: | |
305 | ||
306 | 1. By the uclamp settings defined by the cgroup CPU controller it is attached | |
307 | to, if any. | |
308 | 2. The restricted value in (1) is then further restricted by the system wide | |
309 | uclamp settings. | |
310 | ||
311 | :ref:`Section 3 <uclamp-interfaces>` discusses the interfaces and will expand | |
312 | further on that. | |
313 | ||
314 | For now suffice to say that if a task makes a request, its actual effective | |
315 | value will have to adhere to some restrictions imposed by cgroup and system | |
316 | wide settings. | |
317 | ||
318 | The system will still accept the request even if effectively will be beyond the | |
319 | constraints, but as soon as the task moves to a different cgroup or a sysadmin | |
320 | modifies the system settings, the request will be satisfied only if it is | |
321 | within new constraints. | |
322 | ||
323 | In other words, this aggregation will not cause an error when a task changes | |
324 | its uclamp values, but rather the system may not be able to satisfy requests | |
325 | based on those factors. | |
326 | ||
327 | 2.4. Range | |
328 | ---------- | |
329 | ||
330 | Uclamp performance request has the range of 0 to 1024 inclusive. | |
331 | ||
332 | For cgroup interface percentage is used (that is 0 to 100 inclusive). | |
333 | Just like other cgroup interfaces, you can use 'max' instead of 100. | |
334 | ||
335 | .. _uclamp-interfaces: | |
336 | ||
337 | 3. Interfaces | |
338 | ============= | |
339 | ||
340 | 3.1. Per task interface | |
341 | ----------------------- | |
342 | ||
343 | sched_setattr() syscall was extended to accept two new fields: | |
344 | ||
345 | * sched_util_min: requests the minimum performance point the system should run | |
346 | at when this task is running. Or lower performance bound. | |
347 | * sched_util_max: requests the maximum performance point the system should run | |
348 | at when this task is running. Or upper performance bound. | |
349 | ||
350 | For example, the following scenario have 40% to 80% utilization constraints: | |
351 | ||
352 | :: | |
353 | ||
354 | attr->sched_util_min = 40% * 1024; | |
355 | attr->sched_util_max = 80% * 1024; | |
356 | ||
357 | When task @p is running, **the scheduler should try its best to ensure it | |
358 | starts at 40% performance level**. If the task runs for a long enough time so | |
359 | that its actual utilization goes above 80%, the utilization, or performance | |
360 | level, will be capped. | |
361 | ||
362 | The special value -1 is used to reset the uclamp settings to the system | |
363 | default. | |
364 | ||
365 | Note that resetting the uclamp value to system default using -1 is not the same | |
366 | as manually setting uclamp value to system default. This distinction is | |
367 | important because as we shall see in system interfaces, the default value for | |
368 | RT could be changed. SCHED_NORMAL/OTHER might gain similar knobs too in the | |
369 | future. | |
370 | ||
371 | 3.2. cgroup interface | |
372 | --------------------- | |
373 | ||
374 | There are two uclamp related values in the CPU cgroup controller: | |
375 | ||
376 | * cpu.uclamp.min | |
377 | * cpu.uclamp.max | |
378 | ||
379 | When a task is attached to a CPU controller, its uclamp values will be impacted | |
380 | as follows: | |
381 | ||
382 | * cpu.uclamp.min is a protection as described in :ref:`section 3-3 of cgroup | |
383 | v2 documentation <cgroupv2-protections-distributor>`. | |
384 | ||
385 | If a task uclamp_min value is lower than cpu.uclamp.min, then the task will | |
386 | inherit the cgroup cpu.uclamp.min value. | |
387 | ||
388 | In a cgroup hierarchy, effective cpu.uclamp.min is the max of (child, | |
389 | parent). | |
390 | ||
391 | * cpu.uclamp.max is a limit as described in :ref:`section 3-2 of cgroup v2 | |
392 | documentation <cgroupv2-limits-distributor>`. | |
393 | ||
394 | If a task uclamp_max value is higher than cpu.uclamp.max, then the task will | |
395 | inherit the cgroup cpu.uclamp.max value. | |
396 | ||
397 | In a cgroup hierarchy, effective cpu.uclamp.max is the min of (child, | |
398 | parent). | |
399 | ||
400 | For example, given following parameters: | |
401 | ||
402 | :: | |
403 | ||
404 | p0->uclamp[UCLAMP_MIN] = // system default; | |
405 | p0->uclamp[UCLAMP_MAX] = // system default; | |
406 | ||
407 | p1->uclamp[UCLAMP_MIN] = 40% * 1024; | |
408 | p1->uclamp[UCLAMP_MAX] = 50% * 1024; | |
409 | ||
410 | cgroup0->cpu.uclamp.min = 20% * 1024; | |
411 | cgroup0->cpu.uclamp.max = 60% * 1024; | |
412 | ||
413 | cgroup1->cpu.uclamp.min = 60% * 1024; | |
414 | cgroup1->cpu.uclamp.max = 100% * 1024; | |
415 | ||
416 | when p0 and p1 are attached to cgroup0, the values become: | |
417 | ||
418 | :: | |
419 | ||
420 | p0->uclamp[UCLAMP_MIN] = cgroup0->cpu.uclamp.min = 20% * 1024; | |
421 | p0->uclamp[UCLAMP_MAX] = cgroup0->cpu.uclamp.max = 60% * 1024; | |
422 | ||
423 | p1->uclamp[UCLAMP_MIN] = 40% * 1024; // intact | |
424 | p1->uclamp[UCLAMP_MAX] = 50% * 1024; // intact | |
425 | ||
426 | when p0 and p1 are attached to cgroup1, these instead become: | |
427 | ||
428 | :: | |
429 | ||
430 | p0->uclamp[UCLAMP_MIN] = cgroup1->cpu.uclamp.min = 60% * 1024; | |
431 | p0->uclamp[UCLAMP_MAX] = cgroup1->cpu.uclamp.max = 100% * 1024; | |
432 | ||
433 | p1->uclamp[UCLAMP_MIN] = cgroup1->cpu.uclamp.min = 60% * 1024; | |
434 | p1->uclamp[UCLAMP_MAX] = 50% * 1024; // intact | |
435 | ||
436 | Note that cgroup interfaces allows cpu.uclamp.max value to be lower than | |
437 | cpu.uclamp.min. Other interfaces don't allow that. | |
438 | ||
439 | 3.3. System interface | |
440 | --------------------- | |
441 | ||
442 | 3.3.1 sched_util_clamp_min | |
443 | -------------------------- | |
444 | ||
445 | System wide limit of allowed UCLAMP_MIN range. By default it is set to 1024, | |
446 | which means that permitted effective UCLAMP_MIN range for tasks is [0:1024]. | |
447 | By changing it to 512 for example the range reduces to [0:512]. This is useful | |
448 | to restrict how much boosting tasks are allowed to acquire. | |
449 | ||
450 | Requests from tasks to go above this knob value will still succeed, but | |
451 | they won't be satisfied until it is more than p->uclamp[UCLAMP_MIN]. | |
452 | ||
453 | The value must be smaller than or equal to sched_util_clamp_max. | |
454 | ||
455 | 3.3.2 sched_util_clamp_max | |
456 | -------------------------- | |
457 | ||
458 | System wide limit of allowed UCLAMP_MAX range. By default it is set to 1024, | |
459 | which means that permitted effective UCLAMP_MAX range for tasks is [0:1024]. | |
460 | ||
461 | By changing it to 512 for example the effective allowed range reduces to | |
462 | [0:512]. This means is that no task can run above 512, which implies that all | |
463 | rqs are restricted too. IOW, the whole system is capped to half its performance | |
464 | capacity. | |
465 | ||
466 | This is useful to restrict the overall maximum performance point of the system. | |
467 | For example, it can be handy to limit performance when running low on battery | |
468 | or when the system wants to limit access to more energy hungry performance | |
469 | levels when it's in idle state or screen is off. | |
470 | ||
471 | Requests from tasks to go above this knob value will still succeed, but they | |
472 | won't be satisfied until it is more than p->uclamp[UCLAMP_MAX]. | |
473 | ||
474 | The value must be greater than or equal to sched_util_clamp_min. | |
475 | ||
476 | .. _uclamp-default-values: | |
477 | ||
478 | 3.4. Default values | |
479 | ------------------- | |
480 | ||
481 | By default all SCHED_NORMAL/SCHED_OTHER tasks are initialized to: | |
482 | ||
483 | :: | |
484 | ||
485 | p_fair->uclamp[UCLAMP_MIN] = 0 | |
486 | p_fair->uclamp[UCLAMP_MAX] = 1024 | |
487 | ||
488 | That is, by default they're boosted to run at the maximum performance point of | |
489 | changed at boot or runtime. No argument was made yet as to why we should | |
490 | provide this, but can be added in the future. | |
491 | ||
492 | For SCHED_FIFO/SCHED_RR tasks: | |
493 | ||
494 | :: | |
495 | ||
496 | p_rt->uclamp[UCLAMP_MIN] = 1024 | |
497 | p_rt->uclamp[UCLAMP_MAX] = 1024 | |
498 | ||
499 | That is by default they're boosted to run at the maximum performance point of | |
500 | the system which retains the historical behavior of the RT tasks. | |
501 | ||
502 | RT tasks default uclamp_min value can be modified at boot or runtime via | |
503 | sysctl. See below section. | |
504 | ||
505 | .. _sched-util-clamp-min-rt-default: | |
506 | ||
507 | 3.4.1 sched_util_clamp_min_rt_default | |
508 | ------------------------------------- | |
509 | ||
510 | Running RT tasks at maximum performance point is expensive on battery powered | |
511 | devices and not necessary. To allow system developer to offer good performance | |
512 | guarantees for these tasks without pushing it all the way to maximum | |
513 | performance point, this sysctl knob allows tuning the best boost value to | |
514 | address the system requirement without burning power running at maximum | |
515 | performance point all the time. | |
516 | ||
517 | Application developer are encouraged to use the per task util clamp interface | |
518 | to ensure they are performance and power aware. Ideally this knob should be set | |
519 | to 0 by system designers and leave the task of managing performance | |
520 | requirements to the apps. | |
521 | ||
522 | 4. How to use util clamp | |
523 | ======================== | |
524 | ||
525 | Util clamp promotes the concept of user space assisted power and performance | |
526 | management. At the scheduler level there is no info required to make the best | |
527 | decision. However, with util clamp user space can hint to the scheduler to make | |
528 | better decision about task placement and frequency selection. | |
529 | ||
530 | Best results are achieved by not making any assumptions about the system the | |
531 | application is running on and to use it in conjunction with a feedback loop to | |
532 | dynamically monitor and adjust. Ultimately this will allow for a better user | |
533 | experience at a better perf/watt. | |
534 | ||
535 | For some systems and use cases, static setup will help to achieve good results. | |
536 | Portability will be a problem in this case. How much work one can do at 100, | |
537 | 200 or 1024 is different for each system. Unless there's a specific target | |
538 | system, static setup should be avoided. | |
539 | ||
540 | There are enough possibilities to create a whole framework based on util clamp | |
541 | or self contained app that makes use of it directly. | |
542 | ||
543 | 4.1. Boost important and DVFS-latency-sensitive tasks | |
544 | ----------------------------------------------------- | |
545 | ||
546 | A GUI task might not be busy to warrant driving the frequency high when it | |
547 | wakes up. However, it requires to finish its work within a specific time window | |
548 | to deliver the desired user experience. The right frequency it requires at | |
549 | wakeup will be system dependent. On some underpowered systems it will be high, | |
550 | on other overpowered ones it will be low or 0. | |
551 | ||
552 | This task can increase its UCLAMP_MIN value every time it misses the deadline | |
553 | to ensure on next wake up it runs at a higher performance point. It should try | |
554 | to approach the lowest UCLAMP_MIN value that allows to meet its deadline on any | |
555 | particular system to achieve the best possible perf/watt for that system. | |
556 | ||
557 | On heterogeneous systems, it might be important for this task to run on | |
558 | a faster CPU. | |
559 | ||
560 | **Generally it is advised to perceive the input as performance level or point | |
561 | which will imply both task placement and frequency selection**. | |
562 | ||
563 | 4.2. Cap background tasks | |
564 | ------------------------- | |
565 | ||
566 | Like explained for Android case in the introduction. Any app can lower | |
567 | UCLAMP_MAX for some background tasks that don't care about performance but | |
568 | could end up being busy and consume unnecessary system resources on the system. | |
569 | ||
570 | 4.3. Powersave mode | |
571 | ------------------- | |
572 | ||
573 | sched_util_clamp_max system wide interface can be used to limit all tasks from | |
574 | operating at the higher performance points which are usually energy | |
575 | inefficient. | |
576 | ||
577 | This is not unique to uclamp as one can achieve the same by reducing max | |
578 | frequency of the cpufreq governor. It can be considered a more convenient | |
579 | alternative interface. | |
580 | ||
581 | 4.4. Per-app performance restriction | |
582 | ------------------------------------ | |
583 | ||
584 | Middleware/Utility can provide the user an option to set UCLAMP_MIN/MAX for an | |
585 | app every time it is executed to guarantee a minimum performance point and/or | |
586 | limit it from draining system power at the cost of reduced performance for | |
587 | these apps. | |
588 | ||
589 | If you want to prevent your laptop from heating up while on the go from | |
590 | compiling the kernel and happy to sacrifice performance to save power, but | |
591 | still would like to keep your browser performance intact, uclamp makes it | |
592 | possible. | |
593 | ||
594 | 5. Limitations | |
595 | ============== | |
596 | ||
597 | .. _uclamp-capping-fail: | |
598 | ||
599 | 5.1. Capping frequency with uclamp_max fails under certain conditions | |
600 | --------------------------------------------------------------------- | |
601 | ||
602 | If task p0 is capped to run at 512: | |
603 | ||
604 | :: | |
605 | ||
606 | p0->uclamp[UCLAMP_MAX] = 512 | |
607 | ||
608 | and it shares the rq with p1 which is free to run at any performance point: | |
609 | ||
610 | :: | |
611 | ||
612 | p1->uclamp[UCLAMP_MAX] = 1024 | |
613 | ||
614 | then due to max aggregation the rq will be allowed to reach max performance | |
615 | point: | |
616 | ||
617 | :: | |
618 | ||
619 | rq->uclamp[UCLAMP_MAX] = max(512, 1024) = 1024 | |
620 | ||
621 | Assuming both p0 and p1 have UCLAMP_MIN = 0, then the frequency selection for | |
622 | the rq will depend on the actual utilization value of the tasks. | |
623 | ||
624 | If p1 is a small task but p0 is a CPU intensive task, then due to the fact that | |
625 | both are running at the same rq, p1 will cause the frequency capping to be left | |
626 | from the rq although p1, which is allowed to run at any performance point, | |
627 | doesn't actually need to run at that frequency. | |
628 | ||
629 | 5.2. UCLAMP_MAX can break PELT (util_avg) signal | |
630 | ------------------------------------------------ | |
631 | ||
632 | PELT assumes that frequency will always increase as the signals grow to ensure | |
633 | there's always some idle time on the CPU. But with UCLAMP_MAX, this frequency | |
634 | increase will be prevented which can lead to no idle time in some | |
635 | circumstances. When there's no idle time, a task will stuck in a busy loop, | |
636 | which would result in util_avg being 1024. | |
637 | ||
638 | Combing with issue described below, this can lead to unwanted frequency spikes | |
639 | when severely capped tasks share the rq with a small non capped task. | |
640 | ||
641 | As an example if task p, which have: | |
642 | ||
643 | :: | |
644 | ||
645 | p0->util_avg = 300 | |
646 | p0->uclamp[UCLAMP_MAX] = 0 | |
647 | ||
648 | wakes up on an idle CPU, then it will run at min frequency (Fmin) this | |
649 | CPU is capable of. The max CPU frequency (Fmax) matters here as well, | |
650 | since it designates the shortest computational time to finish the task's | |
651 | work on this CPU. | |
652 | ||
653 | :: | |
654 | ||
655 | rq->uclamp[UCLAMP_MAX] = 0 | |
656 | ||
657 | If the ratio of Fmax/Fmin is 3, then maximum value will be: | |
658 | ||
659 | :: | |
660 | ||
661 | 300 * (Fmax/Fmin) = 900 | |
662 | ||
663 | which indicates the CPU will still see idle time since 900 is < 1024. The | |
664 | _actual_ util_avg will not be 900 though, but somewhere between 300 and 900. As | |
665 | long as there's idle time, p->util_avg updates will be off by a some margin, | |
666 | but not proportional to Fmax/Fmin. | |
667 | ||
668 | :: | |
669 | ||
670 | p0->util_avg = 300 + small_error | |
671 | ||
672 | Now if the ratio of Fmax/Fmin is 4, the maximum value becomes: | |
673 | ||
674 | :: | |
675 | ||
676 | 300 * (Fmax/Fmin) = 1200 | |
677 | ||
678 | which is higher than 1024 and indicates that the CPU has no idle time. When | |
679 | this happens, then the _actual_ util_avg will become: | |
680 | ||
681 | :: | |
682 | ||
683 | p0->util_avg = 1024 | |
684 | ||
685 | If task p1 wakes up on this CPU, which have: | |
686 | ||
687 | :: | |
688 | ||
689 | p1->util_avg = 200 | |
690 | p1->uclamp[UCLAMP_MAX] = 1024 | |
691 | ||
692 | then the effective UCLAMP_MAX for the CPU will be 1024 according to max | |
693 | aggregation rule. But since the capped p0 task was running and throttled | |
694 | severely, then the rq->util_avg will be: | |
695 | ||
696 | :: | |
697 | ||
698 | p0->util_avg = 1024 | |
699 | p1->util_avg = 200 | |
700 | ||
701 | rq->util_avg = 1024 | |
702 | rq->uclamp[UCLAMP_MAX] = 1024 | |
703 | ||
704 | Hence lead to a frequency spike since if p0 wasn't throttled we should get: | |
705 | ||
706 | :: | |
707 | ||
708 | p0->util_avg = 300 | |
709 | p1->util_avg = 200 | |
710 | ||
711 | rq->util_avg = 500 | |
712 | ||
713 | and run somewhere near mid performance point of that CPU, not the Fmax we get. | |
714 | ||
715 | 5.3. Schedutil response time issues | |
716 | ----------------------------------- | |
717 | ||
718 | schedutil has three limitations: | |
719 | ||
720 | 1. Hardware takes non-zero time to respond to any frequency change | |
721 | request. On some platforms can be in the order of few ms. | |
722 | 2. Non fast-switch systems require a worker deadline thread to wake up | |
723 | and perform the frequency change, which adds measurable overhead. | |
724 | 3. schedutil rate_limit_us drops any requests during this rate_limit_us | |
725 | window. | |
726 | ||
727 | If a relatively small task is doing critical job and requires a certain | |
728 | performance point when it wakes up and starts running, then all these | |
729 | limitations will prevent it from getting what it wants in the time scale it | |
730 | expects. | |
731 | ||
732 | This limitation is not only impactful when using uclamp, but will be more | |
733 | prevalent as we no longer gradually ramp up or down. We could easily be | |
734 | jumping between frequencies depending on the order tasks wake up, and their | |
735 | respective uclamp values. | |
736 | ||
737 | We regard that as a limitation of the capabilities of the underlying system | |
738 | itself. | |
739 | ||
740 | There is room to improve the behavior of schedutil rate_limit_us, but not much | |
741 | to be done for 1 or 2. They are considered hard limitations of the system. |