Commit | Line | Data |
---|---|---|
d6a3b247 MCC |
1 | ===================== |
2 | Scheduler Nice Design | |
3 | ===================== | |
4 | ||
aea25401 IM |
5 | This document explains the thinking about the revamped and streamlined |
6 | nice-levels implementation in the new Linux scheduler. | |
7 | ||
8 | Nice levels were always pretty weak under Linux and people continuously | |
9 | pestered us to make nice +19 tasks use up much less CPU time. | |
10 | ||
11 | Unfortunately that was not that easy to implement under the old | |
12 | scheduler, (otherwise we'd have done it long ago) because nice level | |
13 | support was historically coupled to timeslice length, and timeslice | |
14 | units were driven by the HZ tick, so the smallest timeslice was 1/HZ. | |
15 | ||
16 | In the O(1) scheduler (in 2003) we changed negative nice levels to be | |
17 | much stronger than they were before in 2.4 (and people were happy about | |
18 | that change), and we also intentionally calibrated the linear timeslice | |
19 | rule so that nice +19 level would be _exactly_ 1 jiffy. To better | |
20 | understand it, the timeslice graph went like this (cheesy ASCII art | |
d6a3b247 | 21 | alert!):: |
aea25401 IM |
22 | |
23 | ||
24 | A | |
25 | \ | [timeslice length] | |
26 | \ | | |
27 | \ | | |
28 | \ | | |
29 | \ | | |
30 | \|___100msecs | |
31 | |^ . _ | |
32 | | ^ . _ | |
33 | | ^ . _ | |
34 | -*----------------------------------*-----> [nice level] | |
35 | -20 | +19 | |
36 | | | |
37 | | | |
38 | ||
39 | So that if someone wanted to really renice tasks, +19 would give a much | |
40 | bigger hit than the normal linear rule would do. (The solution of | |
41 | changing the ABI to extend priorities was discarded early on.) | |
42 | ||
43 | This approach worked to some degree for some time, but later on with | |
44 | HZ=1000 it caused 1 jiffy to be 1 msec, which meant 0.1% CPU usage which | |
45 | we felt to be a bit excessive. Excessive _not_ because it's too small of | |
46 | a CPU utilization, but because it causes too frequent (once per | |
47 | millisec) rescheduling. (and would thus trash the cache, etc. Remember, | |
48 | this was long ago when hardware was weaker and caches were smaller, and | |
49 | people were running number crunching apps at nice +19.) | |
50 | ||
51 | So for HZ=1000 we changed nice +19 to 5msecs, because that felt like the | |
52 | right minimal granularity - and this translates to 5% CPU utilization. | |
53 | But the fundamental HZ-sensitive property for nice+19 still remained, | |
54 | and we never got a single complaint about nice +19 being too _weak_ in | |
55 | terms of CPU utilization, we only got complaints about it (still) being | |
56 | too _strong_ :-) | |
57 | ||
58 | To sum it up: we always wanted to make nice levels more consistent, but | |
59 | within the constraints of HZ and jiffies and their nasty design level | |
60 | coupling to timeslices and granularity it was not really viable. | |
61 | ||
19f59460 | 62 | The second (less frequent but still periodically occurring) complaint |
aea25401 IM |
63 | about Linux's nice level support was its assymetry around the origo |
64 | (which you can see demonstrated in the picture above), or more | |
65 | accurately: the fact that nice level behavior depended on the _absolute_ | |
66 | nice level as well, while the nice API itself is fundamentally | |
67 | "relative": | |
68 | ||
69 | int nice(int inc); | |
70 | ||
71 | asmlinkage long sys_nice(int increment) | |
72 | ||
73 | (the first one is the glibc API, the second one is the syscall API.) | |
74 | Note that the 'inc' is relative to the current nice level. Tools like | |
75 | bash's "nice" command mirror this relative API. | |
76 | ||
77 | With the old scheduler, if you for example started a niced task with +1 | |
78 | and another task with +2, the CPU split between the two tasks would | |
79 | depend on the nice level of the parent shell - if it was at nice -10 the | |
80 | CPU split was different than if it was at +5 or +10. | |
81 | ||
82 | A third complaint against Linux's nice level support was that negative | |
83 | nice levels were not 'punchy enough', so lots of people had to resort to | |
84 | run audio (and other multimedia) apps under RT priorities such as | |
85 | SCHED_FIFO. But this caused other problems: SCHED_FIFO is not starvation | |
86 | proof, and a buggy SCHED_FIFO app can also lock up the system for good. | |
87 | ||
88 | The new scheduler in v2.6.23 addresses all three types of complaints: | |
89 | ||
90 | To address the first complaint (of nice levels being not "punchy" | |
91 | enough), the scheduler was decoupled from 'time slice' and HZ concepts | |
92 | (and granularity was made a separate concept from nice levels) and thus | |
93 | it was possible to implement better and more consistent nice +19 | |
94 | support: with the new scheduler nice +19 tasks get a HZ-independent | |
95 | 1.5%, instead of the variable 3%-5%-9% range they got in the old | |
96 | scheduler. | |
97 | ||
98 | To address the second complaint (of nice levels not being consistent), | |
99 | the new scheduler makes nice(1) have the same CPU utilization effect on | |
100 | tasks, regardless of their absolute nice levels. So on the new | |
101 | scheduler, running a nice +10 and a nice 11 task has the same CPU | |
102 | utilization "split" between them as running a nice -5 and a nice -4 | |
103 | task. (one will get 55% of the CPU, the other 45%.) That is why nice | |
104 | levels were changed to be "multiplicative" (or exponential) - that way | |
105 | it does not matter which nice level you start out from, the 'relative | |
106 | result' will always be the same. | |
107 | ||
108 | The third complaint (of negative nice levels not being "punchy" enough | |
109 | and forcing audio apps to run under the more dangerous SCHED_FIFO | |
110 | scheduling policy) is addressed by the new scheduler almost | |
111 | automatically: stronger negative nice levels are an automatic | |
112 | side-effect of the recalibrated dynamic range of nice levels. |