[linux-2.6-block.git] / Documentation / scheduler / sched-rt-group.txt

				Real-Time group scheduling
				--------------------------

CONTENTS
========

1. Overview
  1.1 The problem
  1.2 The solution
2. The interface
  2.1 System-wide settings
  2.2 Default behaviour
  2.3 Basis for grouping tasks
3. Future plans


1. Overview
===========


1.1 The problem
---------------

Realtime scheduling is all about determinism, a group has to be able to rely on
the amount of bandwidth (eg. CPU time) being constant. In order to schedule
multiple groups of realtime tasks, each group must be assigned a fixed portion
of the CPU time available.  Without a minimum guarantee a realtime group can
obviously fall short. A fuzzy upper limit is of no use since it cannot be
relied upon. Which leaves us with just the single fixed portion.

1.2 The solution
----------------

CPU time is divided by means of specifying how much time can be spent running
in a given period. We allocate this "run time" for each realtime group which
the other realtime groups will not be permitted to use.

Any time not allocated to a realtime group will be used to run normal priority
tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by
SCHED_OTHER.

Let's consider an example: a frame fixed realtime renderer must deliver 25
frames a second, which yields a period of 0.04s per frame. Now say it will also
have to play some music and respond to input, leaving it with around 80% CPU
time dedicated for the graphics. We can then give this group a run time of 0.8
* 0.04s = 0.032s.

This way the graphics group will have a 0.04s period with a 0.032s run time
limit. Now if the audio thread needs to refill the DMA buffer every 0.005s, but
needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s =
0.00015s. So this group can be scheduled with a period of 0.005s and a run time
of 0.00015s.

The remaining CPU time will be used for user input and other tasks. Because
realtime tasks have explicitly allocated the CPU time they need to perform
their tasks, buffer underruns in the graphics or audio can be eliminated.

NOTE: the above example is not fully implemented as of yet (2.6.25). We still
lack an EDF scheduler to make non-uniform periods usable.


2. The Interface
================


2.1 System wide settings
------------------------

The system wide settings are configured under the /proc virtual file system:

/proc/sys/kernel/sched_rt_period_us:
  The scheduling period that is equivalent to 100% CPU bandwidth

/proc/sys/kernel/sched_rt_runtime_us:
  A global limit on how much time realtime scheduling may use.  Even without
  CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime
  processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth
  available to all realtime groups.

  * Time is specified in us because the interface is s32. This gives an
    operating range from 1us to about 35 minutes.
  * sched_rt_period_us takes values from 1 to INT_MAX.
  * sched_rt_runtime_us takes values from -1 to (INT_MAX - 1).
  * A run time of -1 specifies runtime == period, ie. no limit.


2.2 Default behaviour
---------------------

The default values for sched_rt_period_us (1000000 or 1s) and
sched_rt_runtime_us (950000 or 0.95s).  This gives 0.05s to be used by
SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away
realtime tasks will not lock up the machine but leave a little time to recover
it.  By setting runtime to -1 you'd get the old behaviour back.

By default all bandwidth is assigned to the root group and new groups get the
period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you
want to assign bandwidth to another group, reduce the root group's bandwidth
and assign some or all of the difference to another group.

Realtime group scheduling means you have to assign a portion of total CPU
bandwidth to the group before it will accept realtime tasks. Therefore you will
not be able to run realtime tasks as any user other than root until you have
done that, even if the user has the rights to run processes with realtime
priority!


2.3 Basis for grouping tasks
----------------------------

There are two compile-time settings for allocating CPU bandwidth. These are
configured using the "Basis for grouping tasks" multiple choice menu under
General setup > Group CPU Scheduler:

a. CONFIG_USER_SCHED (aka "Basis for grouping tasks" =  "user id")

This lets you use the virtual files under
"/sys/kernel/uids/<uid>/cpu_rt_runtime_us" to control he CPU time reserved for
each user .

The other option is:

.o CONFIG_CGROUP_SCHED (aka "Basis for grouping tasks" = "Control groups")

This uses the /cgroup virtual file system and "/cgroup/<cgroup>/cpu.rt_runtime_us"
to control the CPU time reserved for each control group instead.

For more information on working with control groups, you should read
Documentation/cgroups.txt as well.

Group settings are checked against the following limits in order to keep the configuration
schedulable:

   \Sum_{i} runtime_{i} / global_period <= global_runtime / global_period

For now, this can be simplified to just the following (but see Future plans):

   \Sum_{i} runtime_{i} <= global_runtime


3. Future plans
===============

There is work in progress to make the scheduling period for each group
("/sys/kernel/uids/<uid>/cpu_rt_period_us" or
"/cgroup/<cgroup>/cpu.rt_period_us" respectively) configurable as well.

The constraint on the period is that a subgroup must have a smaller or
equal period to its parent. But realistically its not very useful _yet_
as its prone to starvation without deadline scheduling.

Consider two sibling groups A and B; both have 50% bandwidth, but A's
period is twice the length of B's.

* group A: period=100000us, runtime=10000us
	- this runs for 0.01s once every 0.1s

* group B: period= 50000us, runtime=10000us
	- this runs for 0.01s twice every 0.1s (or once every 0.05 sec).

This means that currently a while (1) loop in A will run for the full period of
B and can starve B's tasks (assuming they are of lower priority) for a whole
period.

The next project will be SCHED_EDF (Earliest Deadline First scheduling) to bring
full deadline scheduling to the linux kernel. Deadline scheduling the above
groups and treating end of the period as a deadline will ensure that they both
get their allocated time.

Implementing SCHED_EDF might take a while to complete. Priority Inheritance is
the biggest challenge as the current linux PI infrastructure is geared towards
the limited static priority levels 0-139. With deadline scheduling you need to
do deadline inheritance (since priority is inversely proportional to the
deadline delta (deadline - now).

This means the whole PI machinery will have to be reworked - and that is one of
the most complex pieces of code we have.
Commit	Line	Data
b9b158fe VR	1	Real-Time group scheduling
b9b158fe VR	2	--------------------------
9f0c1e56	3
b9b158fe VR	4	CONTENTS
b9b158fe VR	5	========
9f0c1e56	6
b9b158fe VR	7	1. Overview
	8	1.1 The problem
	9	1.2 The solution
	10	2. The interface
	11	2.1 System-wide settings
	12	2.2 Default behaviour
	13	2.3 Basis for grouping tasks
	14	3. Future plans
9f0c1e56	15
9f0c1e56	16
b9b158fe VR	17	1. Overview
b9b158fe VR	18	===========
9f0c1e56	19
9f0c1e56	20
b9b158fe VR	21	1.1 The problem
b9b158fe VR	22	---------------
9f0c1e56	23
b9b158fe VR	24	Realtime scheduling is all about determinism, a group has to be able to rely on
	25	the amount of bandwidth (eg. CPU time) being constant. In order to schedule
	26	multiple groups of realtime tasks, each group must be assigned a fixed portion
	27	of the CPU time available. Without a minimum guarantee a realtime group can
	28	obviously fall short. A fuzzy upper limit is of no use since it cannot be
	29	relied upon. Which leaves us with just the single fixed portion.
9f0c1e56	30
b9b158fe VR	31	1.2 The solution
b9b158fe VR	32	----------------
9f0c1e56	33
b9b158fe VR	34	CPU time is divided by means of specifying how much time can be spent running
	35	in a given period. We allocate this "run time" for each realtime group which
	36	the other realtime groups will not be permitted to use.
9f0c1e56	37
b9b158fe VR	38	Any time not allocated to a realtime group will be used to run normal priority
	39	tasks (SCHED_OTHER). Any allocated run time not used will also be picked up by
	40	SCHED_OTHER.
9f0c1e56	41
b9b158fe VR	42	Let's consider an example: a frame fixed realtime renderer must deliver 25
	43	frames a second, which yields a period of 0.04s per frame. Now say it will also
	44	have to play some music and respond to input, leaving it with around 80% CPU
	45	time dedicated for the graphics. We can then give this group a run time of 0.8
	46	* 0.04s = 0.032s.
9f0c1e56	47
b9b158fe VR	48	This way the graphics group will have a 0.04s period with a 0.032s run time
	49	limit. Now if the audio thread needs to refill the DMA buffer every 0.005s, but
	50	needs only about 3% CPU time to do so, it can do with a 0.03 * 0.005s =
	51	0.00015s. So this group can be scheduled with a period of 0.005s and a run time
	52	of 0.00015s.
9f0c1e56	53
f7d62364	54	The remaining CPU time will be used for user input and other tasks. Because
b9b158fe	55	realtime tasks have explicitly allocated the CPU time they need to perform
f7d62364	56	their tasks, buffer underruns in the graphics or audio can be eliminated.
9f0c1e56	57
b9b158fe VR	58	NOTE: the above example is not fully implemented as of yet (2.6.25). We still
b9b158fe VR	59	lack an EDF scheduler to make non-uniform periods usable.
9f0c1e56	60
9f0c1e56	61
b9b158fe VR	62	2. The Interface
b9b158fe VR	63	================
9f0c1e56	64
9f0c1e56	65
b9b158fe VR	66	2.1 System wide settings
b9b158fe VR	67	------------------------
9f0c1e56	68
b9b158fe	69	The system wide settings are configured under the /proc virtual file system:
9f0c1e56	70
b9b158fe VR	71	/proc/sys/kernel/sched_rt_period_us:
b9b158fe VR	72	The scheduling period that is equivalent to 100% CPU bandwidth
9f0c1e56	73
b9b158fe VR	74	/proc/sys/kernel/sched_rt_runtime_us:
	75	A global limit on how much time realtime scheduling may use. Even without
	76	CONFIG_RT_GROUP_SCHED enabled, this will limit time reserved to realtime
	77	processes. With CONFIG_RT_GROUP_SCHED it signifies the total bandwidth
	78	available to all realtime groups.
	79
	80	* Time is specified in us because the interface is s32. This gives an
	81	operating range from 1us to about 35 minutes.
	82	* sched_rt_period_us takes values from 1 to INT_MAX.
	83	* sched_rt_runtime_us takes values from -1 to (INT_MAX - 1).
	84	* A run time of -1 specifies runtime == period, ie. no limit.
	85
	86
	87	2.2 Default behaviour
	88	---------------------
	89
	90	The default values for sched_rt_period_us (1000000 or 1s) and
	91	sched_rt_runtime_us (950000 or 0.95s). This gives 0.05s to be used by
	92	SCHED_OTHER (non-RT tasks). These defaults were chosen so that a run-away
	93	realtime tasks will not lock up the machine but leave a little time to recover
	94	it. By setting runtime to -1 you'd get the old behaviour back.
	95
	96	By default all bandwidth is assigned to the root group and new groups get the
	97	period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you
	98	want to assign bandwidth to another group, reduce the root group's bandwidth
	99	and assign some or all of the difference to another group.
	100
	101	Realtime group scheduling means you have to assign a portion of total CPU
	102	bandwidth to the group before it will accept realtime tasks. Therefore you will
	103	not be able to run realtime tasks as any user other than root until you have
	104	done that, even if the user has the rights to run processes with realtime
	105	priority!
	106
	107
	108	2.3 Basis for grouping tasks
	109	----------------------------
	110
	111	There are two compile-time settings for allocating CPU bandwidth. These are
	112	configured using the "Basis for grouping tasks" multiple choice menu under
	113	General setup > Group CPU Scheduler:
	114
	115	a. CONFIG_USER_SCHED (aka "Basis for grouping tasks" = "user id")
	116
	117	This lets you use the virtual files under
	118	"/sys/kernel/uids/<uid>/cpu_rt_runtime_us" to control he CPU time reserved for
	119	each user .
	120
	121	The other option is:
	122
	123	.o CONFIG_CGROUP_SCHED (aka "Basis for grouping tasks" = "Control groups")
	124
	125	This uses the /cgroup virtual file system and "/cgroup/<cgroup>/cpu.rt_runtime_us"
	126	to control the CPU time reserved for each control group instead.
	127
	128	For more information on working with control groups, you should read
	129	Documentation/cgroups.txt as well.
	130
	131	Group settings are checked against the following limits in order to keep the configuration
	132	schedulable:
9f0c1e56 PZ	133
	134	\Sum_{i} runtime_{i} / global_period <= global_runtime / global_period
	135
b9b158fe VR	136	For now, this can be simplified to just the following (but see Future plans):
	137
	138	\Sum_{i} runtime_{i} <= global_runtime
	139
	140
	141	3. Future plans
	142	===============
	143
	144	There is work in progress to make the scheduling period for each group
	145	("/sys/kernel/uids/<uid>/cpu_rt_period_us" or
	146	"/cgroup/<cgroup>/cpu.rt_period_us" respectively) configurable as well.
	147
	148	The constraint on the period is that a subgroup must have a smaller or
	149	equal period to its parent. But realistically its not very useful _yet_
	150	as its prone to starvation without deadline scheduling.
	151
	152	Consider two sibling groups A and B; both have 50% bandwidth, but A's
	153	period is twice the length of B's.
	154
	155	* group A: period=100000us, runtime=10000us
	156	- this runs for 0.01s once every 0.1s
	157
	158	* group B: period= 50000us, runtime=10000us
	159	- this runs for 0.01s twice every 0.1s (or once every 0.05 sec).
	160
	161	This means that currently a while (1) loop in A will run for the full period of
	162	B and can starve B's tasks (assuming they are of lower priority) for a whole
	163	period.
	164
	165	The next project will be SCHED_EDF (Earliest Deadline First scheduling) to bring
	166	full deadline scheduling to the linux kernel. Deadline scheduling the above
	167	groups and treating end of the period as a deadline will ensure that they both
	168	get their allocated time.
	169
	170	Implementing SCHED_EDF might take a while to complete. Priority Inheritance is
	171	the biggest challenge as the current linux PI infrastructure is geared towards
	172	the limited static priority levels 0-139. With deadline scheduling you need to
	173	do deadline inheritance (since priority is inversely proportional to the
	174	deadline delta (deadline - now).
	175
	176	This means the whole PI machinery will have to be reworked - and that is one of
	177	the most complex pieces of code we have.