[linux-block.git] / Documentation / scheduler / sched-nice-design.rst

=====================
Scheduler Nice Design
=====================

This document explains the thinking about the revamped and streamlined
nice-levels implementation in the new Linux scheduler.

Nice levels were always pretty weak under Linux and people continuously
pestered us to make nice +19 tasks use up much less CPU time.

Unfortunately that was not that easy to implement under the old
scheduler, (otherwise we'd have done it long ago) because nice level
support was historically coupled to timeslice length, and timeslice
units were driven by the HZ tick, so the smallest timeslice was 1/HZ.

In the O(1) scheduler (in 2003) we changed negative nice levels to be
much stronger than they were before in 2.4 (and people were happy about
that change), and we also intentionally calibrated the linear timeslice
rule so that nice +19 level would be _exactly_ 1 jiffy. To better
understand it, the timeslice graph went like this (cheesy ASCII art
alert!)::


                   A
             \     | [timeslice length]
              \    |
               \   |
                \  |
                 \ |
                  \|___100msecs
                   |^ . _
                   |      ^ . _
                   |            ^ . _
 -*----------------------------------*-----> [nice level]
 -20               |                +19
                   |
                   |

So that if someone wanted to really renice tasks, +19 would give a much
bigger hit than the normal linear rule would do. (The solution of
changing the ABI to extend priorities was discarded early on.)

This approach worked to some degree for some time, but later on with
HZ=1000 it caused 1 jiffy to be 1 msec, which meant 0.1% CPU usage which
we felt to be a bit excessive. Excessive _not_ because it's too small of
a CPU utilization, but because it causes too frequent (once per
millisec) rescheduling. (and would thus trash the cache, etc. Remember,
this was long ago when hardware was weaker and caches were smaller, and
people were running number crunching apps at nice +19.)

So for HZ=1000 we changed nice +19 to 5msecs, because that felt like the
right minimal granularity - and this translates to 5% CPU utilization.
But the fundamental HZ-sensitive property for nice+19 still remained,
and we never got a single complaint about nice +19 being too _weak_ in
terms of CPU utilization, we only got complaints about it (still) being
too _strong_ :-)

To sum it up: we always wanted to make nice levels more consistent, but
within the constraints of HZ and jiffies and their nasty design level
coupling to timeslices and granularity it was not really viable.

The second (less frequent but still periodically occurring) complaint
about Linux's nice level support was its assymetry around the origo
(which you can see demonstrated in the picture above), or more
accurately: the fact that nice level behavior depended on the _absolute_
nice level as well, while the nice API itself is fundamentally
"relative":

   int nice(int inc);

   asmlinkage long sys_nice(int increment)

(the first one is the glibc API, the second one is the syscall API.)
Note that the 'inc' is relative to the current nice level. Tools like
bash's "nice" command mirror this relative API.

With the old scheduler, if you for example started a niced task with +1
and another task with +2, the CPU split between the two tasks would
depend on the nice level of the parent shell - if it was at nice -10 the
CPU split was different than if it was at +5 or +10.

A third complaint against Linux's nice level support was that negative
nice levels were not 'punchy enough', so lots of people had to resort to
run audio (and other multimedia) apps under RT priorities such as
SCHED_FIFO. But this caused other problems: SCHED_FIFO is not starvation
proof, and a buggy SCHED_FIFO app can also lock up the system for good.

The new scheduler in v2.6.23 addresses all three types of complaints:

To address the first complaint (of nice levels being not "punchy"
enough), the scheduler was decoupled from 'time slice' and HZ concepts
(and granularity was made a separate concept from nice levels) and thus
it was possible to implement better and more consistent nice +19
support: with the new scheduler nice +19 tasks get a HZ-independent
1.5%, instead of the variable 3%-5%-9% range they got in the old
scheduler.

To address the second complaint (of nice levels not being consistent),
the new scheduler makes nice(1) have the same CPU utilization effect on
tasks, regardless of their absolute nice levels. So on the new
scheduler, running a nice +10 and a nice 11 task has the same CPU
utilization "split" between them as running a nice -5 and a nice -4
task. (one will get 55% of the CPU, the other 45%.) That is why nice
levels were changed to be "multiplicative" (or exponential) - that way
it does not matter which nice level you start out from, the 'relative
result' will always be the same.

The third complaint (of negative nice levels not being "punchy" enough
and forcing audio apps to run under the more dangerous SCHED_FIFO
scheduling policy) is addressed by the new scheduler almost
automatically: stronger negative nice levels are an automatic
side-effect of the recalibrated dynamic range of nice levels.
Commit	Line	Data
d6a3b247 MCC	1	=====================
	2	Scheduler Nice Design
	3	=====================
	4
aea25401 IM	5	This document explains the thinking about the revamped and streamlined
	6	nice-levels implementation in the new Linux scheduler.
	7
	8	Nice levels were always pretty weak under Linux and people continuously
	9	pestered us to make nice +19 tasks use up much less CPU time.
	10
	11	Unfortunately that was not that easy to implement under the old
	12	scheduler, (otherwise we'd have done it long ago) because nice level
	13	support was historically coupled to timeslice length, and timeslice
	14	units were driven by the HZ tick, so the smallest timeslice was 1/HZ.
	15
	16	In the O(1) scheduler (in 2003) we changed negative nice levels to be
	17	much stronger than they were before in 2.4 (and people were happy about
	18	that change), and we also intentionally calibrated the linear timeslice
	19	rule so that nice +19 level would be _exactly_ 1 jiffy. To better
	20	understand it, the timeslice graph went like this (cheesy ASCII art
d6a3b247	21	alert!)::
aea25401 IM	22
	23
	24	A
	25	\ \| [timeslice length]
	26	\ \|
	27	\ \|
	28	\ \|
	29	\ \|
	30	\\|___100msecs
	31	\|^ . _
	32	\| ^ . _
	33	\| ^ . _
	34	----------------------------------------> [nice level]
	35	-20 \| +19
	36	\|
	37	\|
	38
	39	So that if someone wanted to really renice tasks, +19 would give a much
	40	bigger hit than the normal linear rule would do. (The solution of
	41	changing the ABI to extend priorities was discarded early on.)
	42
	43	This approach worked to some degree for some time, but later on with
	44	HZ=1000 it caused 1 jiffy to be 1 msec, which meant 0.1% CPU usage which
	45	we felt to be a bit excessive. Excessive _not_ because it's too small of
	46	a CPU utilization, but because it causes too frequent (once per
	47	millisec) rescheduling. (and would thus trash the cache, etc. Remember,
	48	this was long ago when hardware was weaker and caches were smaller, and
	49	people were running number crunching apps at nice +19.)
	50
	51	So for HZ=1000 we changed nice +19 to 5msecs, because that felt like the
	52	right minimal granularity - and this translates to 5% CPU utilization.
	53	But the fundamental HZ-sensitive property for nice+19 still remained,
	54	and we never got a single complaint about nice +19 being too _weak_ in
	55	terms of CPU utilization, we only got complaints about it (still) being
	56	too _strong_ :-)
	57
	58	To sum it up: we always wanted to make nice levels more consistent, but
	59	within the constraints of HZ and jiffies and their nasty design level
	60	coupling to timeslices and granularity it was not really viable.
	61
19f59460	62	The second (less frequent but still periodically occurring) complaint
aea25401 IM	63	about Linux's nice level support was its assymetry around the origo
	64	(which you can see demonstrated in the picture above), or more
	65	accurately: the fact that nice level behavior depended on the _absolute_
	66	nice level as well, while the nice API itself is fundamentally
	67	"relative":
	68
	69	int nice(int inc);
	70
	71	asmlinkage long sys_nice(int increment)
	72
	73	(the first one is the glibc API, the second one is the syscall API.)
	74	Note that the 'inc' is relative to the current nice level. Tools like
	75	bash's "nice" command mirror this relative API.
	76
	77	With the old scheduler, if you for example started a niced task with +1
	78	and another task with +2, the CPU split between the two tasks would
	79	depend on the nice level of the parent shell - if it was at nice -10 the
	80	CPU split was different than if it was at +5 or +10.
	81
	82	A third complaint against Linux's nice level support was that negative
	83	nice levels were not 'punchy enough', so lots of people had to resort to
	84	run audio (and other multimedia) apps under RT priorities such as
	85	SCHED_FIFO. But this caused other problems: SCHED_FIFO is not starvation
	86	proof, and a buggy SCHED_FIFO app can also lock up the system for good.
	87
	88	The new scheduler in v2.6.23 addresses all three types of complaints:
	89
	90	To address the first complaint (of nice levels being not "punchy"
	91	enough), the scheduler was decoupled from 'time slice' and HZ concepts
	92	(and granularity was made a separate concept from nice levels) and thus
	93	it was possible to implement better and more consistent nice +19
	94	support: with the new scheduler nice +19 tasks get a HZ-independent
	95	1.5%, instead of the variable 3%-5%-9% range they got in the old
	96	scheduler.
	97
	98	To address the second complaint (of nice levels not being consistent),
	99	the new scheduler makes nice(1) have the same CPU utilization effect on
	100	tasks, regardless of their absolute nice levels. So on the new
	101	scheduler, running a nice +10 and a nice 11 task has the same CPU
	102	utilization "split" between them as running a nice -5 and a nice -4
	103	task. (one will get 55% of the CPU, the other 45%.) That is why nice
	104	levels were changed to be "multiplicative" (or exponential) - that way
	105	it does not matter which nice level you start out from, the 'relative
	106	result' will always be the same.
	107
	108	The third complaint (of negative nice levels not being "punchy" enough
	109	and forcing audio apps to run under the more dangerous SCHED_FIFO
	110	scheduling policy) is addressed by the new scheduler almost
	111	automatically: stronger negative nice levels are an automatic
	112	side-effect of the recalibrated dynamic range of nice levels.