Commit | Line | Data |
---|---|---|
b57b8496 TY |
1 | ========= |
2 | Schedutil | |
3 | ========= | |
0301925d | 4 | |
b57b8496 | 5 | .. note:: |
0301925d | 6 | |
b57b8496 TY |
7 | All this assumes a linear relation between frequency and work capacity, |
8 | we know this is flawed, but it is the best workable approximation. | |
0301925d PZ |
9 | |
10 | ||
11 | PELT (Per Entity Load Tracking) | |
b57b8496 | 12 | =============================== |
0301925d PZ |
13 | |
14 | With PELT we track some metrics across the various scheduler entities, from | |
15 | individual tasks to task-group slices to CPU runqueues. As the basis for this | |
16 | we use an Exponentially Weighted Moving Average (EWMA), each period (1024us) | |
17 | is decayed such that y^32 = 0.5. That is, the most recent 32ms contribute | |
18 | half, while the rest of history contribute the other half. | |
19 | ||
20 | Specifically: | |
21 | ||
22 | ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ... | |
23 | ||
24 | ewma(u) = ewma_sum(u) / ewma_sum(1) | |
25 | ||
26 | Since this is essentially a progression of an infinite geometric series, the | |
27 | results are composable, that is ewma(A) + ewma(B) = ewma(A+B). This property | |
28 | is key, since it gives the ability to recompose the averages when tasks move | |
29 | around. | |
30 | ||
31 | Note that blocked tasks still contribute to the aggregates (task-group slices | |
32 | and CPU runqueues), which reflects their expected contribution when they | |
33 | resume running. | |
34 | ||
35 | Using this we track 2 key metrics: 'running' and 'runnable'. 'Running' | |
36 | reflects the time an entity spends on the CPU, while 'runnable' reflects the | |
37 | time an entity spends on the runqueue. When there is only a single task these | |
38 | two metrics are the same, but once there is contention for the CPU 'running' | |
39 | will decrease to reflect the fraction of time each task spends on the CPU | |
40 | while 'runnable' will increase to reflect the amount of contention. | |
41 | ||
42 | For more detail see: kernel/sched/pelt.c | |
43 | ||
44 | ||
b57b8496 TY |
45 | Frequency / CPU Invariance |
46 | ========================== | |
0301925d PZ |
47 | |
48 | Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU | |
49 | for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on | |
50 | a big CPU, we allow architectures to scale the time delta with two ratios, one | |
51 | Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio. | |
52 | ||
53 | For simple DVFS architectures (where software is in full control) we trivially | |
b57b8496 | 54 | compute the ratio as:: |
0301925d PZ |
55 | |
56 | f_cur | |
57 | r_dvfs := ----- | |
58 | f_max | |
59 | ||
60 | For more dynamic systems where the hardware is in control of DVFS we use | |
61 | hardware counters (Intel APERF/MPERF, ARMv8.4-AMU) to provide us this ratio. | |
b57b8496 | 62 | For Intel specifically, we use:: |
0301925d PZ |
63 | |
64 | APERF | |
65 | f_cur := ----- * P0 | |
66 | MPERF | |
67 | ||
68 | 4C-turbo; if available and turbo enabled | |
69 | f_max := { 1C-turbo; if turbo enabled | |
70 | P0; otherwise | |
71 | ||
72 | f_cur | |
73 | r_dvfs := min( 1, ----- ) | |
74 | f_max | |
75 | ||
76 | We pick 4C turbo over 1C turbo to make it slightly more sustainable. | |
77 | ||
78 | r_cpu is determined as the ratio of highest performance level of the current | |
79 | CPU vs the highest performance level of any other CPU in the system. | |
80 | ||
81 | r_tot = r_dvfs * r_cpu | |
82 | ||
83 | The result is that the above 'running' and 'runnable' metrics become invariant | |
84 | of DVFS and CPU type. IOW. we can transfer and compare them between CPUs. | |
85 | ||
86 | For more detail see: | |
87 | ||
88 | - kernel/sched/pelt.h:update_rq_clock_pelt() | |
89 | - arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation." | |
90 | - Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization" | |
91 | ||
92 | ||
93 | UTIL_EST / UTIL_EST_FASTUP | |
b57b8496 | 94 | ========================== |
0301925d PZ |
95 | |
96 | Because periodic tasks have their averages decayed while they sleep, even | |
97 | though when running their expected utilization will be the same, they suffer a | |
98 | (DVFS) ramp-up after they are running again. | |
99 | ||
100 | To alleviate this (a default enabled option) UTIL_EST drives an Infinite | |
101 | Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is | |
102 | highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR | |
103 | filter to instantly increase and only decay on decrease. | |
104 | ||
105 | A further runqueue wide sum (of runnable tasks) is maintained of: | |
106 | ||
107 | util_est := \Sum_t max( t_running, t_util_est_ewma ) | |
108 | ||
109 | For more detail see: kernel/sched/fair.c:util_est_dequeue() | |
110 | ||
111 | ||
112 | UCLAMP | |
b57b8496 | 113 | ====== |
0301925d PZ |
114 | |
115 | It is possible to set effective u_min and u_max clamps on each CFS or RT task; | |
116 | the runqueue keeps an max aggregate of these clamps for all running tasks. | |
117 | ||
118 | For more detail see: include/uapi/linux/sched/types.h | |
119 | ||
120 | ||
121 | Schedutil / DVFS | |
b57b8496 | 122 | ================ |
0301925d PZ |
123 | |
124 | Every time the scheduler load tracking is updated (task wakeup, task | |
125 | migration, time progression) we call out to schedutil to update the hardware | |
126 | DVFS state. | |
127 | ||
128 | The basis is the CPU runqueue's 'running' metric, which per the above it is | |
129 | the frequency invariant utilization estimate of the CPU. From this we compute | |
b57b8496 | 130 | a desired frequency like:: |
0301925d PZ |
131 | |
132 | max( running, util_est ); if UTIL_EST | |
133 | u_cfs := { running; otherwise | |
134 | ||
135 | clamp( u_cfs + u_rt , u_min, u_max ); if UCLAMP_TASK | |
136 | u_clamp := { u_cfs + u_rt; otherwise | |
137 | ||
138 | u := u_clamp + u_irq + u_dl; [approx. see source for more detail] | |
139 | ||
140 | f_des := min( f_max, 1.25 u * f_max ) | |
141 | ||
b57b8496 | 142 | XXX IO-wait: when the update is due to a task wakeup from IO-completion we |
0301925d PZ |
143 | boost 'u' above. |
144 | ||
145 | This frequency is then used to select a P-state/OPP or directly munged into a | |
146 | CPPC style request to the hardware. | |
147 | ||
148 | XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min | |
149 | required to satisfy the workload. | |
150 | ||
151 | Because these callbacks are directly from the scheduler, the DVFS hardware | |
152 | interaction should be 'fast' and non-blocking. Schedutil supports | |
153 | rate-limiting DVFS requests for when hardware interaction is slow and | |
154 | expensive, this reduces effectiveness. | |
155 | ||
156 | For more information see: kernel/sched/cpufreq_schedutil.c | |
157 | ||
158 | ||
159 | NOTES | |
b57b8496 | 160 | ===== |
0301925d PZ |
161 | |
162 | - On low-load scenarios, where DVFS is most relevant, the 'running' numbers | |
163 | will closely reflect utilization. | |
164 | ||
165 | - In saturated scenarios task movement will cause some transient dips, | |
166 | suppose we have a CPU saturated with 4 tasks, then when we migrate a task | |
167 | to an idle CPU, the old CPU will have a 'running' value of 0.75 while the | |
168 | new CPU will gain 0.25. This is inevitable and time progression will | |
169 | correct this. XXX do we still guarantee f_max due to no idle-time? | |
170 | ||
171 | - Much of the above is about avoiding DVFS dips, and independent DVFS domains | |
172 | having to re-learn / ramp-up when load shifts. | |
173 |