Commit | Line | Data |
---|---|---|
65731578 TH |
1 | |
2 | Cgroup unified hierarchy | |
3 | ||
4 | April, 2014 Tejun Heo <tj@kernel.org> | |
5 | ||
6 | This document describes the changes made by unified hierarchy and | |
7 | their rationales. It will eventually be merged into the main cgroup | |
8 | documentation. | |
9 | ||
10 | CONTENTS | |
11 | ||
12 | 1. Background | |
13 | 2. Basic Operation | |
14 | 2-1. Mounting | |
15 | 2-2. cgroup.subtree_control | |
16 | 2-3. cgroup.controllers | |
17 | 3. Structural Constraints | |
18 | 3-1. Top-down | |
19 | 3-2. No internal tasks | |
8a0792ef TH |
20 | 4. Delegation |
21 | 4-1. Model of delegation | |
22 | 4-2. Common ancestor rule | |
23 | 5. Other Changes | |
24 | 5-1. [Un]populated Notification | |
25 | 5-2. Other Core Changes | |
6abc8ca1 TH |
26 | 5-3. Controller File Conventions |
27 | 5-3-1. Format | |
28 | 5-3-2. Control Knobs | |
29 | 5-4. Per-Controller Changes | |
2ee867dc | 30 | 5-4-1. io |
6abc8ca1 TH |
31 | 5-4-2. cpuset |
32 | 5-4-3. memory | |
8a0792ef TH |
33 | 6. Planned Changes |
34 | 6-1. CAP for resource control | |
65731578 TH |
35 | |
36 | ||
37 | 1. Background | |
38 | ||
39 | cgroup allows an arbitrary number of hierarchies and each hierarchy | |
40 | can host any number of controllers. While this seems to provide a | |
41 | high level of flexibility, it isn't quite useful in practice. | |
42 | ||
43 | For example, as there is only one instance of each controller, utility | |
44 | type controllers such as freezer which can be useful in all | |
45 | hierarchies can only be used in one. The issue is exacerbated by the | |
46 | fact that controllers can't be moved around once hierarchies are | |
47 | populated. Another issue is that all controllers bound to a hierarchy | |
48 | are forced to have exactly the same view of the hierarchy. It isn't | |
49 | possible to vary the granularity depending on the specific controller. | |
50 | ||
51 | In practice, these issues heavily limit which controllers can be put | |
52 | on the same hierarchy and most configurations resort to putting each | |
53 | controller on its own hierarchy. Only closely related ones, such as | |
54 | the cpu and cpuacct controllers, make sense to put on the same | |
55 | hierarchy. This often means that userland ends up managing multiple | |
56 | similar hierarchies repeating the same steps on each hierarchy | |
57 | whenever a hierarchy management operation is necessary. | |
58 | ||
59 | Unfortunately, support for multiple hierarchies comes at a steep cost. | |
60 | Internal implementation in cgroup core proper is dazzlingly | |
61 | complicated but more importantly the support for multiple hierarchies | |
62 | restricts how cgroup is used in general and what controllers can do. | |
63 | ||
64 | There's no limit on how many hierarchies there may be, which means | |
65 | that a task's cgroup membership can't be described in finite length. | |
66 | The key may contain any varying number of entries and is unlimited in | |
67 | length, which makes it highly awkward to handle and leads to addition | |
68 | of controllers which exist only to identify membership, which in turn | |
69 | exacerbates the original problem. | |
70 | ||
71 | Also, as a controller can't have any expectation regarding what shape | |
72 | of hierarchies other controllers would be on, each controller has to | |
73 | assume that all other controllers are operating on completely | |
74 | orthogonal hierarchies. This makes it impossible, or at least very | |
75 | cumbersome, for controllers to cooperate with each other. | |
76 | ||
77 | In most use cases, putting controllers on hierarchies which are | |
78 | completely orthogonal to each other isn't necessary. What usually is | |
79 | called for is the ability to have differing levels of granularity | |
80 | depending on the specific controller. In other words, hierarchy may | |
81 | be collapsed from leaf towards root when viewed from specific | |
82 | controllers. For example, a given configuration might not care about | |
83 | how memory is distributed beyond a certain level while still wanting | |
84 | to control how CPU cycles are distributed. | |
85 | ||
86 | Unified hierarchy is the next version of cgroup interface. It aims to | |
87 | address the aforementioned issues by having more structure while | |
88 | retaining enough flexibility for most use cases. Various other | |
89 | general and controller-specific interface issues are also addressed in | |
90 | the process. | |
91 | ||
92 | ||
93 | 2. Basic Operation | |
94 | ||
95 | 2-1. Mounting | |
96 | ||
97 | Currently, unified hierarchy can be mounted with the following mount | |
98 | command. Note that this is still under development and scheduled to | |
99 | change soon. | |
100 | ||
101 | mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT | |
102 | ||
a8ddc821 TH |
103 | All controllers which support the unified hierarchy and are not bound |
104 | to other hierarchies are automatically bound to unified hierarchy and | |
105 | show up at the root of it. Controllers which are enabled only in the | |
106 | root of unified hierarchy can be bound to other hierarchies. This | |
107 | allows mixing unified hierarchy with the traditional multiple | |
108 | hierarchies in a fully backward compatible way. | |
109 | ||
af0ba678 TH |
110 | A controller can be moved across hierarchies only after the controller |
111 | is no longer referenced in its current hierarchy. Because per-cgroup | |
112 | controller states are destroyed asynchronously and controllers may | |
113 | have lingering references, a controller may not show up immediately on | |
114 | the unified hierarchy after the final umount of the previous | |
115 | hierarchy. Similarly, a controller should be fully disabled to be | |
116 | moved out of the unified hierarchy and it may take some time for the | |
117 | disabled controller to become available for other hierarchies; | |
118 | furthermore, due to dependencies among controllers, other controllers | |
119 | may need to be disabled too. | |
120 | ||
121 | While useful for development and manual configurations, dynamically | |
122 | moving controllers between the unified and other hierarchies is | |
123 | strongly discouraged for production use. It is recommended to decide | |
124 | the hierarchies and controller associations before starting using the | |
125 | controllers. | |
65731578 TH |
126 | |
127 | ||
128 | 2-2. cgroup.subtree_control | |
129 | ||
130 | All cgroups on unified hierarchy have a "cgroup.subtree_control" file | |
131 | which governs which controllers are enabled on the children of the | |
132 | cgroup. Let's assume a hierarchy like the following. | |
133 | ||
134 | root - A - B - C | |
135 | \ D | |
136 | ||
137 | root's "cgroup.subtree_control" file determines which controllers are | |
138 | enabled on A. A's on B. B's on C and D. This coincides with the | |
139 | fact that controllers on the immediate sub-level are used to | |
140 | distribute the resources of the parent. In fact, it's natural to | |
141 | assume that resource control knobs of a child belong to its parent. | |
142 | Enabling a controller in a "cgroup.subtree_control" file declares that | |
143 | distribution of the respective resources of the cgroup will be | |
144 | controlled. Note that this means that controller enable states are | |
145 | shared among siblings. | |
146 | ||
147 | When read, the file contains a space-separated list of currently | |
148 | enabled controllers. A write to the file should contain a | |
149 | space-separated list of controllers with '+' or '-' prefixed (without | |
150 | the quotes). Controllers prefixed with '+' are enabled and '-' | |
151 | disabled. If a controller is listed multiple times, the last entry | |
152 | wins. The specific operations are executed atomically - either all | |
153 | succeed or fail. | |
154 | ||
155 | ||
156 | 2-3. cgroup.controllers | |
157 | ||
158 | Read-only "cgroup.controllers" file contains a space-separated list of | |
159 | controllers which can be enabled in the cgroup's | |
160 | "cgroup.subtree_control" file. | |
161 | ||
162 | In the root cgroup, this lists controllers which are not bound to | |
163 | other hierarchies and the content changes as controllers are bound to | |
164 | and unbound from other hierarchies. | |
165 | ||
166 | In non-root cgroups, the content of this file equals that of the | |
167 | parent's "cgroup.subtree_control" file as only controllers enabled | |
168 | from the parent can be used in its children. | |
169 | ||
170 | ||
171 | 3. Structural Constraints | |
172 | ||
173 | 3-1. Top-down | |
174 | ||
175 | As it doesn't make sense to nest control of an uncontrolled resource, | |
176 | all non-root "cgroup.subtree_control" files can only contain | |
177 | controllers which are enabled in the parent's "cgroup.subtree_control" | |
178 | file. A controller can be enabled only if the parent has the | |
179 | controller enabled and a controller can't be disabled if one or more | |
180 | children have it enabled. | |
181 | ||
182 | ||
183 | 3-2. No internal tasks | |
184 | ||
185 | One long-standing issue that cgroup faces is the competition between | |
186 | tasks belonging to the parent cgroup and its children cgroups. This | |
187 | is inherently nasty as two different types of entities compete and | |
188 | there is no agreed-upon obvious way to handle it. Different | |
189 | controllers are doing different things. | |
190 | ||
191 | The cpu controller considers tasks and cgroups as equivalents and maps | |
192 | nice levels to cgroup weights. This works for some cases but falls | |
193 | flat when children should be allocated specific ratios of CPU cycles | |
194 | and the number of internal tasks fluctuates - the ratios constantly | |
195 | change as the number of competing entities fluctuates. There also are | |
196 | other issues. The mapping from nice level to weight isn't obvious or | |
197 | universal, and there are various other knobs which simply aren't | |
198 | available for tasks. | |
199 | ||
2ee867dc | 200 | The io controller implicitly creates a hidden leaf node for each |
65731578 TH |
201 | cgroup to host the tasks. The hidden leaf has its own copies of all |
202 | the knobs with "leaf_" prefixed. While this allows equivalent control | |
203 | over internal tasks, it's with serious drawbacks. It always adds an | |
204 | extra layer of nesting which may not be necessary, makes the interface | |
205 | messy and significantly complicates the implementation. | |
206 | ||
207 | The memory controller currently doesn't have a way to control what | |
208 | happens between internal tasks and child cgroups and the behavior is | |
209 | not clearly defined. There have been attempts to add ad-hoc behaviors | |
210 | and knobs to tailor the behavior to specific workloads. Continuing | |
211 | this direction will lead to problems which will be extremely difficult | |
212 | to resolve in the long term. | |
213 | ||
214 | Multiple controllers struggle with internal tasks and came up with | |
215 | different ways to deal with it; unfortunately, all the approaches in | |
216 | use now are severely flawed and, furthermore, the widely different | |
217 | behaviors make cgroup as whole highly inconsistent. | |
218 | ||
219 | It is clear that this is something which needs to be addressed from | |
220 | cgroup core proper in a uniform way so that controllers don't need to | |
221 | worry about it and cgroup as a whole shows a consistent and logical | |
222 | behavior. To achieve that, unified hierarchy enforces the following | |
223 | structural constraint: | |
224 | ||
225 | Except for the root, only cgroups which don't contain any task may | |
226 | have controllers enabled in their "cgroup.subtree_control" files. | |
227 | ||
228 | Combined with other properties, this guarantees that, when a | |
229 | controller is looking at the part of the hierarchy which has it | |
230 | enabled, tasks are always only on the leaves. This rules out | |
231 | situations where child cgroups compete against internal tasks of the | |
232 | parent. | |
233 | ||
234 | There are two things to note. Firstly, the root cgroup is exempt from | |
235 | the restriction. Root contains tasks and anonymous resource | |
236 | consumption which can't be associated with any other cgroup and | |
237 | requires special treatment from most controllers. How resource | |
238 | consumption in the root cgroup is governed is up to each controller. | |
239 | ||
240 | Secondly, the restriction doesn't take effect if there is no enabled | |
241 | controller in the cgroup's "cgroup.subtree_control" file. This is | |
242 | important as otherwise it wouldn't be possible to create children of a | |
243 | populated cgroup. To control resource distribution of a cgroup, the | |
244 | cgroup must create children and transfer all its tasks to the children | |
245 | before enabling controllers in its "cgroup.subtree_control" file. | |
246 | ||
247 | ||
8a0792ef | 248 | 4. Delegation |
65731578 | 249 | |
8a0792ef TH |
250 | 4-1. Model of delegation |
251 | ||
252 | A cgroup can be delegated to a less privileged user by granting write | |
253 | access of the directory and its "cgroup.procs" file to the user. Note | |
254 | that the resource control knobs in a given directory concern the | |
255 | resources of the parent and thus must not be delegated along with the | |
256 | directory. | |
257 | ||
258 | Once delegated, the user can build sub-hierarchy under the directory, | |
259 | organize processes as it sees fit and further distribute the resources | |
260 | it got from the parent. The limits and other settings of all resource | |
261 | controllers are hierarchical and regardless of what happens in the | |
262 | delegated sub-hierarchy, nothing can escape the resource restrictions | |
263 | imposed by the parent. | |
264 | ||
265 | Currently, cgroup doesn't impose any restrictions on the number of | |
266 | cgroups in or nesting depth of a delegated sub-hierarchy; however, | |
267 | this may in the future be limited explicitly. | |
268 | ||
269 | ||
270 | 4-2. Common ancestor rule | |
271 | ||
272 | On the unified hierarchy, to write to a "cgroup.procs" file, in | |
273 | addition to the usual write permission to the file and uid match, the | |
274 | writer must also have write access to the "cgroup.procs" file of the | |
275 | common ancestor of the source and destination cgroups. This prevents | |
276 | delegatees from smuggling processes across disjoint sub-hierarchies. | |
277 | ||
278 | Let's say cgroups C0 and C1 have been delegated to user U0 who created | |
279 | C00, C01 under C0 and C10 under C1 as follows. | |
280 | ||
281 | ~~~~~~~~~~~~~ - C0 - C00 | |
282 | ~ cgroup ~ \ C01 | |
283 | ~ hierarchy ~ | |
284 | ~~~~~~~~~~~~~ - C1 - C10 | |
285 | ||
286 | C0 and C1 are separate entities in terms of resource distribution | |
287 | regardless of their relative positions in the hierarchy. The | |
288 | resources the processes under C0 are entitled to are controlled by | |
289 | C0's ancestors and may be completely different from C1. It's clear | |
290 | that the intention of delegating C0 to U0 is allowing U0 to organize | |
291 | the processes under C0 and further control the distribution of C0's | |
292 | resources. | |
293 | ||
294 | On traditional hierarchies, if a task has write access to "tasks" or | |
295 | "cgroup.procs" file of a cgroup and its uid agrees with the target, it | |
296 | can move the target to the cgroup. In the above example, U0 will not | |
297 | only be able to move processes in each sub-hierarchy but also across | |
298 | the two sub-hierarchies, effectively allowing it to violate the | |
299 | organizational and resource restrictions implied by the hierarchical | |
300 | structure above C0 and C1. | |
301 | ||
302 | On the unified hierarchy, let's say U0 wants to write the pid of a | |
303 | process which has a matching uid and is currently in C10 into | |
304 | "C00/cgroup.procs". U0 obviously has write access to the file and | |
305 | migration permission on the process; however, the common ancestor of | |
306 | the source cgroup C10 and the destination cgroup C00 is above the | |
307 | points of delegation and U0 would not have write access to its | |
308 | "cgroup.procs" and thus be denied with -EACCES. | |
309 | ||
310 | ||
311 | 5. Other Changes | |
312 | ||
313 | 5-1. [Un]populated Notification | |
65731578 TH |
314 | |
315 | cgroup users often need a way to determine when a cgroup's | |
316 | subhierarchy becomes empty so that it can be cleaned up. cgroup | |
317 | currently provides release_agent for it; unfortunately, this mechanism | |
318 | is riddled with issues. | |
319 | ||
320 | - It delivers events by forking and execing a userland binary | |
321 | specified as the release_agent. This is a long deprecated method of | |
322 | notification delivery. It's extremely heavy, slow and cumbersome to | |
323 | integrate with larger infrastructure. | |
324 | ||
325 | - There is single monitoring point at the root. There's no way to | |
326 | delegate management of a subtree. | |
327 | ||
328 | - The event isn't recursive. It triggers when a cgroup doesn't have | |
329 | any tasks or child cgroups. Events for internal nodes trigger only | |
330 | after all children are removed. This again makes it impossible to | |
331 | delegate management of a subtree. | |
332 | ||
333 | - Events are filtered from the kernel side. A "notify_on_release" | |
334 | file is used to subscribe to or suppress release events. This is | |
335 | unnecessarily complicated and probably done this way because event | |
336 | delivery itself was expensive. | |
337 | ||
4a07c222 TH |
338 | Unified hierarchy implements "populated" field in "cgroup.events" |
339 | interface file which can be used to monitor whether the cgroup's | |
340 | subhierarchy has tasks in it or not. Its value is 0 if there is no | |
341 | task in the cgroup and its descendants; otherwise, 1. poll and | |
342 | [id]notify events are triggered when the value changes. | |
65731578 TH |
343 | |
344 | This is significantly lighter and simpler and trivially allows | |
345 | delegating management of subhierarchy - subhierarchy monitoring can | |
346 | block further propagation simply by putting itself or another process | |
347 | in the subhierarchy and monitor events that it's interested in from | |
348 | there without interfering with monitoring higher in the tree. | |
349 | ||
350 | In unified hierarchy, the release_agent mechanism is no longer | |
351 | supported and the interface files "release_agent" and | |
352 | "notify_on_release" do not exist. | |
353 | ||
354 | ||
8a0792ef | 355 | 5-2. Other Core Changes |
65731578 TH |
356 | |
357 | - None of the mount options is allowed. | |
358 | ||
359 | - remount is disallowed. | |
360 | ||
361 | - rename(2) is disallowed. | |
362 | ||
363 | - The "tasks" file is removed. Everything should at process | |
364 | granularity. Use the "cgroup.procs" file instead. | |
365 | ||
366 | - The "cgroup.procs" file is not sorted. pids will be unique unless | |
367 | they got recycled in-between reads. | |
368 | ||
369 | - The "cgroup.clone_children" file is removed. | |
370 | ||
2e91fa7f TH |
371 | - /proc/PID/cgroup keeps reporting the cgroup that a zombie belonged |
372 | to before exiting. If the cgroup is removed before the zombie is | |
373 | reaped, " (deleted)" is appeneded to the path. | |
374 | ||
65731578 | 375 | |
6abc8ca1 | 376 | 5-3. Controller File Conventions |
65731578 | 377 | |
6abc8ca1 TH |
378 | 5-3-1. Format |
379 | ||
380 | In general, all controller files should be in one of the following | |
381 | formats whenever possible. | |
382 | ||
383 | - Values only files | |
384 | ||
385 | VAL0 VAL1...\n | |
386 | ||
387 | - Flat keyed files | |
388 | ||
389 | KEY0 VAL0\n | |
390 | KEY1 VAL1\n | |
391 | ... | |
392 | ||
393 | - Nested keyed files | |
394 | ||
395 | KEY0 SUB_KEY0=VAL00 SUB_KEY1=VAL01... | |
396 | KEY1 SUB_KEY0=VAL10 SUB_KEY1=VAL11... | |
397 | ... | |
398 | ||
399 | For a writeable file, the format for writing should generally match | |
400 | reading; however, controllers may allow omitting later fields or | |
401 | implement restricted shortcuts for most common use cases. | |
402 | ||
403 | For both flat and nested keyed files, only the values for a single key | |
404 | can be written at a time. For nested keyed files, the sub key pairs | |
405 | may be specified in any order and not all pairs have to be specified. | |
406 | ||
407 | ||
408 | 5-3-2. Control Knobs | |
409 | ||
410 | - Settings for a single feature should generally be implemented in a | |
411 | single file. | |
412 | ||
413 | - In general, the root cgroup should be exempt from resource control | |
414 | and thus shouldn't have resource control knobs. | |
415 | ||
416 | - If a controller implements ratio based resource distribution, the | |
417 | control knob should be named "weight" and have the range [1, 10000] | |
418 | and 100 should be the default value. The values are chosen to allow | |
419 | enough and symmetric bias in both directions while keeping it | |
420 | intuitive (the default is 100%). | |
421 | ||
422 | - If a controller implements an absolute resource guarantee and/or | |
423 | limit, the control knobs should be named "min" and "max" | |
424 | respectively. If a controller implements best effort resource | |
425 | gurantee and/or limit, the control knobs should be named "low" and | |
426 | "high" respectively. | |
427 | ||
428 | In the above four control files, the special token "max" should be | |
429 | used to represent upward infinity for both reading and writing. | |
430 | ||
431 | - If a setting has configurable default value and specific overrides, | |
432 | the default settings should be keyed with "default" and appear as | |
433 | the first entry in the file. Specific entries can use "default" as | |
434 | its value to indicate inheritance of the default value. | |
435 | ||
4a07c222 TH |
436 | - For events which are not very high frequency, an interface file |
437 | "events" should be created which lists event key value pairs. | |
438 | Whenever a notifiable event happens, file modified event should be | |
439 | generated on the file. | |
440 | ||
6abc8ca1 TH |
441 | |
442 | 5-4. Per-Controller Changes | |
443 | ||
2ee867dc | 444 | 5-4-1. io |
65731578 | 445 | |
2ee867dc TH |
446 | - blkio is renamed to io. The interface is overhauled anyway. The |
447 | new name is more in line with the other two major controllers, cpu | |
448 | and memory, and better suited given that it may be used for cgroup | |
449 | writeback without involving block layer. | |
450 | ||
451 | - Everything including stat is always hierarchical making separate | |
452 | recursive stat files pointless and, as no internal node can have | |
453 | tasks, leaf weights are meaningless. The operation model is | |
454 | simplified and the interface is overhauled accordingly. | |
455 | ||
456 | io.stat | |
457 | ||
458 | The stat file. The reported stats are from the point where | |
459 | bio's are issued to request_queue. The stats are counted | |
460 | independent of which policies are enabled. Each line in the | |
461 | file follows the following format. More fields may later be | |
462 | added at the end. | |
463 | ||
464 | $MAJ:$MIN rbytes=$RBYTES wbytes=$WBYTES rios=$RIOS wrios=$WIOS | |
465 | ||
466 | io.weight | |
467 | ||
468 | The weight setting, currently only available and effective if | |
469 | cfq-iosched is in use for the target device. The weight is | |
69d7fde5 | 470 | between 1 and 10000 and defaults to 100. The first line |
2ee867dc TH |
471 | always contains the default weight in the following format to |
472 | use when per-device setting is missing. | |
473 | ||
474 | default $WEIGHT | |
475 | ||
476 | Subsequent lines list per-device weights of the following | |
477 | format. | |
478 | ||
479 | $MAJ:$MIN $WEIGHT | |
480 | ||
481 | Writing "$WEIGHT" or "default $WEIGHT" changes the default | |
482 | setting. Writing "$MAJ:$MIN $WEIGHT" sets per-device weight | |
483 | while "$MAJ:$MIN default" clears it. | |
484 | ||
485 | This file is available only on non-root cgroups. | |
486 | ||
487 | io.max | |
488 | ||
489 | The maximum bandwidth and/or iops setting, only available if | |
490 | blk-throttle is enabled. The file is of the following format. | |
491 | ||
492 | $MAJ:$MIN rbps=$RBPS wbps=$WBPS riops=$RIOPS wiops=$WIOPS | |
493 | ||
494 | ${R|W}BPS are read/write bytes per second and ${R|W}IOPS are | |
495 | read/write IOs per second. "max" indicates no limit. Writing | |
496 | to the file follows the same format but the individual | |
55d01595 | 497 | settings may be omitted or specified in any order. |
2ee867dc TH |
498 | |
499 | This file is available only on non-root cgroups. | |
65731578 TH |
500 | |
501 | ||
6abc8ca1 | 502 | 5-4-2. cpuset |
65731578 TH |
503 | |
504 | - Tasks are kept in empty cpusets after hotplug and take on the masks | |
505 | of the nearest non-empty ancestor, instead of being moved to it. | |
506 | ||
507 | - A task can be moved into an empty cpuset, and again it takes on the | |
508 | masks of the nearest non-empty ancestor. | |
509 | ||
510 | ||
6abc8ca1 | 511 | 5-4-3. memory |
65731578 TH |
512 | |
513 | - use_hierarchy is on by default and the cgroup file for the flag is | |
514 | not created. | |
515 | ||
241994ed JW |
516 | - The original lower boundary, the soft limit, is defined as a limit |
517 | that is per default unset. As a result, the set of cgroups that | |
518 | global reclaim prefers is opt-in, rather than opt-out. The costs | |
519 | for optimizing these mostly negative lookups are so high that the | |
520 | implementation, despite its enormous size, does not even provide the | |
521 | basic desirable behavior. First off, the soft limit has no | |
522 | hierarchical meaning. All configured groups are organized in a | |
523 | global rbtree and treated like equal peers, regardless where they | |
524 | are located in the hierarchy. This makes subtree delegation | |
525 | impossible. Second, the soft limit reclaim pass is so aggressive | |
526 | that it not just introduces high allocation latencies into the | |
527 | system, but also impacts system performance due to overreclaim, to | |
528 | the point where the feature becomes self-defeating. | |
529 | ||
530 | The memory.low boundary on the other hand is a top-down allocated | |
531 | reserve. A cgroup enjoys reclaim protection when it and all its | |
532 | ancestors are below their low boundaries, which makes delegation of | |
533 | subtrees possible. Secondly, new cgroups have no reserve per | |
534 | default and in the common case most cgroups are eligible for the | |
535 | preferred reclaim pass. This allows the new low boundary to be | |
536 | efficiently implemented with just a minor addition to the generic | |
537 | reclaim code, without the need for out-of-band data structures and | |
538 | reclaim passes. Because the generic reclaim code considers all | |
539 | cgroups except for the ones running low in the preferred first | |
540 | reclaim pass, overreclaim of individual groups is eliminated as | |
541 | well, resulting in much better overall workload performance. | |
542 | ||
543 | - The original high boundary, the hard limit, is defined as a strict | |
544 | limit that can not budge, even if the OOM killer has to be called. | |
545 | But this generally goes against the goal of making the most out of | |
546 | the available memory. The memory consumption of workloads varies | |
547 | during runtime, and that requires users to overcommit. But doing | |
548 | that with a strict upper limit requires either a fairly accurate | |
549 | prediction of the working set size or adding slack to the limit. | |
550 | Since working set size estimation is hard and error prone, and | |
551 | getting it wrong results in OOM kills, most users tend to err on the | |
552 | side of a looser limit and end up wasting precious resources. | |
553 | ||
554 | The memory.high boundary on the other hand can be set much more | |
555 | conservatively. When hit, it throttles allocations by forcing them | |
556 | into direct reclaim to work off the excess, but it never invokes the | |
557 | OOM killer. As a result, a high boundary that is chosen too | |
558 | aggressively will not terminate the processes, but instead it will | |
559 | lead to gradual performance degradation. The user can monitor this | |
560 | and make corrections until the minimal memory footprint that still | |
561 | gives acceptable performance is found. | |
562 | ||
563 | In extreme cases, with many concurrent allocations and a complete | |
564 | breakdown of reclaim progress within the group, the high boundary | |
565 | can be exceeded. But even then it's mostly better to satisfy the | |
566 | allocation from the slack available in other groups or the rest of | |
567 | the system than killing the group. Otherwise, memory.max is there | |
568 | to limit this type of spillover and ultimately contain buggy or even | |
569 | malicious applications. | |
570 | ||
571 | - The original control file names are unwieldy and inconsistent in | |
572 | many different ways. For example, the upper boundary hit count is | |
573 | exported in the memory.failcnt file, but an OOM event count has to | |
574 | be manually counted by listening to memory.oom_control events, and | |
575 | lower boundary / soft limit events have to be counted by first | |
576 | setting a threshold for that value and then counting those events. | |
577 | Also, usage and limit files encode their units in the filename. | |
578 | That makes the filenames very long, even though this is not | |
579 | information that a user needs to be reminded of every time they type | |
580 | out those names. | |
581 | ||
582 | To address these naming issues, as well as to signal clearly that | |
583 | the new interface carries a new configuration model, the naming | |
584 | conventions in it necessarily differ from the old interface. | |
585 | ||
586 | - The original limit files indicate the state of an unset limit with a | |
587 | Very High Number, and a configured limit can be unset by echoing -1 | |
588 | into those files. But that very high number is implementation and | |
589 | architecture dependent and not very descriptive. And while -1 can | |
590 | be understood as an underflow into the highest possible value, -2 or | |
591 | -10M etc. do not work, so it's not consistent. | |
592 | ||
d2973697 JW |
593 | memory.low, memory.high, and memory.max will use the string "max" to |
594 | indicate and set the highest possible value. | |
65731578 | 595 | |
8a0792ef | 596 | 6. Planned Changes |
65731578 | 597 | |
8a0792ef | 598 | 6-1. CAP for resource control |
65731578 TH |
599 | |
600 | Unified hierarchy will require one of the capabilities(7), which is | |
601 | yet to be decided, for all resource control related knobs. Process | |
602 | organization operations - creation of sub-cgroups and migration of | |
603 | processes in sub-hierarchies may be delegated by changing the | |
604 | ownership and/or permissions on the cgroup directory and | |
605 | "cgroup.procs" interface file; however, all operations which affect | |
606 | resource control - writes to a "cgroup.subtree_control" file or any | |
607 | controller-specific knobs - will require an explicit CAP privilege. | |
608 | ||
609 | This, in part, is to prevent the cgroup interface from being | |
610 | inadvertently promoted to programmable API used by non-privileged | |
611 | binaries. cgroup exposes various aspects of the system in ways which | |
612 | aren't properly abstracted for direct consumption by regular programs. | |
613 | This is an administration interface much closer to sysctl knobs than | |
614 | system calls. Even the basic access model, being filesystem path | |
615 | based, isn't suitable for direct consumption. There's no way to | |
616 | access "my cgroup" in a race-free way or make multiple operations | |
617 | atomic against migration to another cgroup. | |
618 | ||
619 | Another aspect is that, for better or for worse, the cgroup interface | |
620 | goes through far less scrutiny than regular interfaces for | |
621 | unprivileged userland. The upside is that cgroup is able to expose | |
622 | useful features which may not be suitable for general consumption in a | |
623 | reasonable time frame. It provides a relatively short path between | |
624 | internal details and userland-visible interface. Of course, this | |
625 | shortcut comes with high risk. We go through what we go through for | |
626 | general kernel APIs for good reasons. It may end up leaking internal | |
627 | details in a way which can exert significant pain by locking the | |
628 | kernel into a contract that can't be maintained in a reasonable | |
629 | manner. | |
630 | ||
631 | Also, due to the specific nature, cgroup and its controllers don't | |
632 | tend to attract attention from a wide scope of developers. cgroup's | |
633 | short history is already fraught with severely mis-designed | |
634 | interfaces, unnecessary commitments to and exposing of internal | |
635 | details, broken and dangerous implementations of various features. | |
636 | ||
637 | Keeping cgroup as an administration interface is both advantageous for | |
638 | its role and imperative given its nature. Some of the cgroup features | |
639 | may make sense for unprivileged access. If deemed justified, those | |
640 | must be further abstracted and implemented as a different interface, | |
641 | be it a system call or process-private filesystem, and survive through | |
642 | the scrutiny that any interface for general consumption is required to | |
643 | go through. | |
644 | ||
645 | Requiring CAP is not a complete solution but should serve as a | |
646 | significant deterrent against spraying cgroup usages in non-privileged | |
647 | programs. |