Commit | Line | Data |
---|---|---|
d6a3b247 | 1 | ===================== |
88ebc08e BR |
2 | CFS Bandwidth Control |
3 | ===================== | |
4 | ||
6c57c12d KK |
5 | .. note:: |
6 | This document only discusses CPU bandwidth control for SCHED_NORMAL. | |
7 | The SCHED_RT case is covered in Documentation/scheduler/sched-rt-group.rst | |
88ebc08e BR |
8 | |
9 | CFS bandwidth control is a CONFIG_FAIR_GROUP_SCHED extension which allows the | |
10 | specification of the maximum CPU bandwidth available to a group or hierarchy. | |
11 | ||
12 | The bandwidth allowed for a group is specified using a quota and period. Within | |
de53fd7a DC |
13 | each given "period" (microseconds), a task group is allocated up to "quota" |
14 | microseconds of CPU time. That quota is assigned to per-cpu run queues in | |
15 | slices as threads in the cgroup become runnable. Once all quota has been | |
16 | assigned any additional requests for quota will result in those threads being | |
17 | throttled. Throttled threads will not be able to run again until the next | |
18 | period when the quota is replenished. | |
19 | ||
20 | A group's unassigned quota is globally tracked, being refreshed back to | |
21 | cfs_quota units at each period boundary. As threads consume this bandwidth it | |
22 | is transferred to cpu-local "silos" on a demand basis. The amount transferred | |
88ebc08e BR |
23 | within each of these updates is tunable and described as the "slice". |
24 | ||
d73df887 HC |
25 | Burst feature |
26 | ------------- | |
27 | This feature borrows time now against our future underrun, at the cost of | |
28 | increased interference against the other system users. All nicely bounded. | |
29 | ||
30 | Traditional (UP-EDF) bandwidth control is something like: | |
31 | ||
32 | (U = \Sum u_i) <= 1 | |
33 | ||
34 | This guaranteeds both that every deadline is met and that the system is | |
35 | stable. After all, if U were > 1, then for every second of walltime, | |
36 | we'd have to run more than a second of program time, and obviously miss | |
37 | our deadline, but the next deadline will be further out still, there is | |
38 | never time to catch up, unbounded fail. | |
39 | ||
40 | The burst feature observes that a workload doesn't always executes the full | |
41 | quota; this enables one to describe u_i as a statistical distribution. | |
42 | ||
43 | For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100) | |
44 | (the traditional WCET). This effectively allows u to be smaller, | |
45 | increasing the efficiency (we can pack more tasks in the system), but at | |
46 | the cost of missing deadlines when all the odds line up. However, it | |
47 | does maintain stability, since every overrun must be paired with an | |
48 | underrun as long as our x is above the average. | |
49 | ||
50 | That is, suppose we have 2 tasks, both specify a p(95) value, then we | |
51 | have a p(95)*p(95) = 90.25% chance both tasks are within their quota and | |
52 | everything is good. At the same time we have a p(5)p(5) = 0.25% chance | |
53 | both tasks will exceed their quota at the same time (guaranteed deadline | |
54 | fail). Somewhere in between there's a threshold where one exceeds and | |
55 | the other doesn't underrun enough to compensate; this depends on the | |
56 | specific CDFs. | |
57 | ||
58 | At the same time, we can say that the worst case deadline miss, will be | |
59 | \Sum e_i; that is, there is a bounded tardiness (under the assumption | |
60 | that x+e is indeed WCET). | |
61 | ||
62 | The interferenece when using burst is valued by the possibilities for | |
63 | missing the deadline and the average WCET. Test results showed that when | |
64 | there many cgroups or CPU is under utilized, the interference is | |
65 | limited. More details are shown in: | |
66 | https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/ | |
67 | ||
88ebc08e BR |
68 | Management |
69 | ---------- | |
d73df887 | 70 | Quota, period and burst are managed within the cpu subsystem via cgroupfs. |
88ebc08e | 71 | |
e5ba9ea6 KK |
72 | .. note:: |
73 | The cgroupfs files described in this section are only applicable | |
74 | to cgroup v1. For cgroup v2, see | |
716c9d94 | 75 | :ref:`Documentation/admin-guide/cgroup-v2.rst <cgroup-v2-cpu>`. |
e5ba9ea6 | 76 | |
d73df887 | 77 | - cpu.cfs_quota_us: run-time replenished within a period (in microseconds) |
7ebc7dc8 KK |
78 | - cpu.cfs_period_us: the length of a period (in microseconds) |
79 | - cpu.stat: exports throttling statistics [explained further below] | |
d73df887 | 80 | - cpu.cfs_burst_us: the maximum accumulated run-time (in microseconds) |
88ebc08e | 81 | |
d6a3b247 MCC |
82 | The default values are:: |
83 | ||
88ebc08e | 84 | cpu.cfs_period_us=100ms |
d73df887 HC |
85 | cpu.cfs_quota_us=-1 |
86 | cpu.cfs_burst_us=0 | |
88ebc08e BR |
87 | |
88 | A value of -1 for cpu.cfs_quota_us indicates that the group does not have any | |
89 | bandwidth restriction in place, such a group is described as an unconstrained | |
de53fd7a | 90 | bandwidth group. This represents the traditional work-conserving behavior for |
88ebc08e BR |
91 | CFS. |
92 | ||
d73df887 HC |
93 | Writing any (valid) positive value(s) no smaller than cpu.cfs_burst_us will |
94 | enact the specified bandwidth limit. The minimum quota allowed for the quota or | |
95 | period is 1ms. There is also an upper bound on the period length of 1s. | |
96 | Additional restrictions exist when bandwidth limits are used in a hierarchical | |
97 | fashion, these are explained in more detail below. | |
88ebc08e BR |
98 | |
99 | Writing any negative value to cpu.cfs_quota_us will remove the bandwidth limit | |
100 | and return the group to an unconstrained state once more. | |
101 | ||
d73df887 HC |
102 | A value of 0 for cpu.cfs_burst_us indicates that the group can not accumulate |
103 | any unused bandwidth. It makes the traditional bandwidth control behavior for | |
104 | CFS unchanged. Writing any (valid) positive value(s) no larger than | |
105 | cpu.cfs_quota_us into cpu.cfs_burst_us will enact the cap on unused bandwidth | |
106 | accumulation. | |
107 | ||
88ebc08e BR |
108 | Any updates to a group's bandwidth specification will result in it becoming |
109 | unthrottled if it is in a constrained state. | |
110 | ||
111 | System wide settings | |
112 | -------------------- | |
113 | For efficiency run-time is transferred between the global pool and CPU local | |
de53fd7a DC |
114 | "silos" in a batch fashion. This greatly reduces global accounting pressure |
115 | on large systems. The amount transferred each time such an update is required | |
88ebc08e BR |
116 | is described as the "slice". |
117 | ||
d6a3b247 MCC |
118 | This is tunable via procfs:: |
119 | ||
88ebc08e BR |
120 | /proc/sys/kernel/sched_cfs_bandwidth_slice_us (default=5ms) |
121 | ||
122 | Larger slice values will reduce transfer overheads, while smaller values allow | |
123 | for more fine-grained consumption. | |
124 | ||
125 | Statistics | |
126 | ---------- | |
d73df887 | 127 | A group's bandwidth statistics are exported via 5 fields in cpu.stat. |
88ebc08e BR |
128 | |
129 | cpu.stat: | |
d6a3b247 | 130 | |
88ebc08e BR |
131 | - nr_periods: Number of enforcement intervals that have elapsed. |
132 | - nr_throttled: Number of times the group has been throttled/limited. | |
133 | - throttled_time: The total time duration (in nanoseconds) for which entities | |
134 | of the group have been throttled. | |
d73df887 HC |
135 | - nr_bursts: Number of periods burst occurs. |
136 | - burst_time: Cumulative wall-time (in nanoseconds) that any CPUs has used | |
ce881fc0 | 137 | above quota in respective periods. |
88ebc08e BR |
138 | |
139 | This interface is read-only. | |
140 | ||
141 | Hierarchical considerations | |
142 | --------------------------- | |
143 | The interface enforces that an individual entity's bandwidth is always | |
144 | attainable, that is: max(c_i) <= C. However, over-subscription in the | |
145 | aggregate case is explicitly allowed to enable work-conserving semantics | |
d6a3b247 MCC |
146 | within a hierarchy: |
147 | ||
88ebc08e | 148 | e.g. \Sum (c_i) may exceed C |
d6a3b247 | 149 | |
88ebc08e BR |
150 | [ Where C is the parent's bandwidth, and c_i its children ] |
151 | ||
152 | ||
153 | There are two ways in which a group may become throttled: | |
d6a3b247 | 154 | |
88ebc08e BR |
155 | a. it fully consumes its own quota within a period |
156 | b. a parent's quota is fully consumed within its period | |
157 | ||
158 | In case b) above, even though the child may have runtime remaining it will not | |
159 | be allowed to until the parent's runtime is refreshed. | |
160 | ||
de53fd7a DC |
161 | CFS Bandwidth Quota Caveats |
162 | --------------------------- | |
163 | Once a slice is assigned to a cpu it does not expire. However all but 1ms of | |
164 | the slice may be returned to the global pool if all threads on that cpu become | |
165 | unrunnable. This is configured at compile time by the min_cfs_rq_runtime | |
166 | variable. This is a performance tweak that helps prevent added contention on | |
167 | the global lock. | |
168 | ||
169 | The fact that cpu-local slices do not expire results in some interesting corner | |
170 | cases that should be understood. | |
171 | ||
172 | For cgroup cpu constrained applications that are cpu limited this is a | |
173 | relatively moot point because they will naturally consume the entirety of their | |
174 | quota as well as the entirety of each cpu-local slice in each period. As a | |
175 | result it is expected that nr_periods roughly equal nr_throttled, and that | |
176 | cpuacct.usage will increase roughly equal to cfs_quota_us in each period. | |
177 | ||
178 | For highly-threaded, non-cpu bound applications this non-expiration nuance | |
179 | allows applications to briefly burst past their quota limits by the amount of | |
180 | unused slice on each cpu that the task group is running on (typically at most | |
181 | 1ms per cpu or as defined by min_cfs_rq_runtime). This slight burst only | |
182 | applies if quota had been assigned to a cpu and then not fully used or returned | |
183 | in previous periods. This burst amount will not be transferred between cores. | |
184 | As a result, this mechanism still strictly limits the task group to quota | |
185 | average usage, albeit over a longer time window than a single period. This | |
186 | also limits the burst ability to no more than 1ms per cpu. This provides | |
187 | better more predictable user experience for highly threaded applications with | |
188 | small quota limits on high core count machines. It also eliminates the | |
189 | propensity to throttle these applications while simultanously using less than | |
190 | quota amounts of cpu. Another way to say this, is that by allowing the unused | |
191 | portion of a slice to remain valid across periods we have decreased the | |
192 | possibility of wastefully expiring quota on cpu-local silos that don't need a | |
193 | full slice's amount of cpu time. | |
194 | ||
195 | The interaction between cpu-bound and non-cpu-bound-interactive applications | |
196 | should also be considered, especially when single core usage hits 100%. If you | |
197 | gave each of these applications half of a cpu-core and they both got scheduled | |
198 | on the same CPU it is theoretically possible that the non-cpu bound application | |
199 | will use up to 1ms additional quota in some periods, thereby preventing the | |
200 | cpu-bound application from fully using its quota by that same amount. In these | |
201 | instances it will be up to the CFS algorithm (see sched-design-CFS.rst) to | |
202 | decide which application is chosen to run, as they will both be runnable and | |
203 | have remaining quota. This runtime discrepancy will be made up in the following | |
204 | periods when the interactive application idles. | |
205 | ||
88ebc08e BR |
206 | Examples |
207 | -------- | |
d6a3b247 | 208 | 1. Limit a group to 1 CPU worth of runtime:: |
88ebc08e BR |
209 | |
210 | If period is 250ms and quota is also 250ms, the group will get | |
211 | 1 CPU worth of runtime every 250ms. | |
212 | ||
213 | # echo 250000 > cpu.cfs_quota_us /* quota = 250ms */ | |
214 | # echo 250000 > cpu.cfs_period_us /* period = 250ms */ | |
215 | ||
d6a3b247 | 216 | 2. Limit a group to 2 CPUs worth of runtime on a multi-CPU machine |
88ebc08e | 217 | |
d6a3b247 MCC |
218 | With 500ms period and 1000ms quota, the group can get 2 CPUs worth of |
219 | runtime every 500ms:: | |
88ebc08e BR |
220 | |
221 | # echo 1000000 > cpu.cfs_quota_us /* quota = 1000ms */ | |
222 | # echo 500000 > cpu.cfs_period_us /* period = 500ms */ | |
223 | ||
224 | The larger period here allows for increased burst capacity. | |
225 | ||
226 | 3. Limit a group to 20% of 1 CPU. | |
227 | ||
d6a3b247 | 228 | With 50ms period, 10ms quota will be equivalent to 20% of 1 CPU:: |
88ebc08e BR |
229 | |
230 | # echo 10000 > cpu.cfs_quota_us /* quota = 10ms */ | |
231 | # echo 50000 > cpu.cfs_period_us /* period = 50ms */ | |
232 | ||
d6a3b247 MCC |
233 | By using a small period here we are ensuring a consistent latency |
234 | response at the expense of burst capacity. | |
d73df887 HC |
235 | |
236 | 4. Limit a group to 40% of 1 CPU, and allow accumulate up to 20% of 1 CPU | |
237 | additionally, in case accumulation has been done. | |
238 | ||
239 | With 50ms period, 20ms quota will be equivalent to 40% of 1 CPU. | |
ce881fc0 | 240 | And 10ms burst will be equivalent to 20% of 1 CPU:: |
d73df887 HC |
241 | |
242 | # echo 20000 > cpu.cfs_quota_us /* quota = 20ms */ | |
243 | # echo 50000 > cpu.cfs_period_us /* period = 50ms */ | |
244 | # echo 10000 > cpu.cfs_burst_us /* burst = 10ms */ | |
245 | ||
246 | Larger buffer setting (no larger than quota) allows greater burst capacity. |