Commit | Line | Data |
---|---|---|
712e5e34 DF |
1 | Deadline Task Scheduling |
2 | ------------------------ | |
3 | ||
4 | CONTENTS | |
5 | ======== | |
6 | ||
7 | 0. WARNING | |
8 | 1. Overview | |
9 | 2. Scheduling algorithm | |
10 | 3. Scheduling Real-Time Tasks | |
11 | 4. Bandwidth management | |
12 | 4.1 System-wide settings | |
13 | 4.2 Task interface | |
14 | 4.3 Default behavior | |
15 | 5. Tasks CPU affinity | |
16 | 5.1 SCHED_DEADLINE and cpusets HOWTO | |
17 | 6. Future plans | |
18 | ||
19 | ||
20 | 0. WARNING | |
21 | ========== | |
22 | ||
23 | Fiddling with these settings can result in an unpredictable or even unstable | |
24 | system behavior. As for -rt (group) scheduling, it is assumed that root users | |
25 | know what they're doing. | |
26 | ||
27 | ||
28 | 1. Overview | |
29 | =========== | |
30 | ||
31 | The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is | |
32 | basically an implementation of the Earliest Deadline First (EDF) scheduling | |
33 | algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS) | |
34 | that makes it possible to isolate the behavior of tasks between each other. | |
35 | ||
36 | ||
37 | 2. Scheduling algorithm | |
38 | ================== | |
39 | ||
40 | SCHED_DEADLINE uses three parameters, named "runtime", "period", and | |
41 | "deadline" to schedule tasks. A SCHED_DEADLINE task is guaranteed to receive | |
42 | "runtime" microseconds of execution time every "period" microseconds, and | |
43 | these "runtime" microseconds are available within "deadline" microseconds | |
44 | from the beginning of the period. In order to implement this behaviour, | |
45 | every time the task wakes up, the scheduler computes a "scheduling deadline" | |
46 | consistent with the guarantee (using the CBS[2,3] algorithm). Tasks are then | |
47 | scheduled using EDF[1] on these scheduling deadlines (the task with the | |
ad67dc31 | 48 | earliest scheduling deadline is selected for execution). Notice that this |
712e5e34 DF |
49 | guaranteed is respected if a proper "admission control" strategy (see Section |
50 | "4. Bandwidth management") is used. | |
51 | ||
52 | Summing up, the CBS[2,3] algorithms assigns scheduling deadlines to tasks so | |
53 | that each task runs for at most its runtime every period, avoiding any | |
54 | interference between different tasks (bandwidth isolation), while the EDF[1] | |
ad67dc31 LA |
55 | algorithm selects the task with the earliest scheduling deadline as the one |
56 | to be executed next. Thanks to this feature, tasks that do not strictly comply | |
57 | with the "traditional" real-time task model (see Section 3) can effectively | |
58 | use the new policy. | |
712e5e34 DF |
59 | |
60 | In more details, the CBS algorithm assigns scheduling deadlines to | |
61 | tasks in the following way: | |
62 | ||
63 | - Each SCHED_DEADLINE task is characterised by the "runtime", | |
64 | "deadline", and "period" parameters; | |
65 | ||
66 | - The state of the task is described by a "scheduling deadline", and | |
ad67dc31 | 67 | a "remaining runtime". These two parameters are initially set to 0; |
712e5e34 DF |
68 | |
69 | - When a SCHED_DEADLINE task wakes up (becomes ready for execution), | |
70 | the scheduler checks if | |
71 | ||
ad67dc31 LA |
72 | remaining runtime runtime |
73 | ---------------------------------- > --------- | |
74 | scheduling deadline - current time period | |
712e5e34 DF |
75 | |
76 | then, if the scheduling deadline is smaller than the current time, or | |
77 | this condition is verified, the scheduling deadline and the | |
ad67dc31 | 78 | remaining runtime are re-initialised as |
712e5e34 DF |
79 | |
80 | scheduling deadline = current time + deadline | |
ad67dc31 | 81 | remaining runtime = runtime |
712e5e34 | 82 | |
ad67dc31 | 83 | otherwise, the scheduling deadline and the remaining runtime are |
712e5e34 DF |
84 | left unchanged; |
85 | ||
86 | - When a SCHED_DEADLINE task executes for an amount of time t, its | |
ad67dc31 | 87 | remaining runtime is decreased as |
712e5e34 | 88 | |
ad67dc31 | 89 | remaining runtime = remaining runtime - t |
712e5e34 DF |
90 | |
91 | (technically, the runtime is decreased at every tick, or when the | |
92 | task is descheduled / preempted); | |
93 | ||
ad67dc31 | 94 | - When the remaining runtime becomes less or equal than 0, the task is |
712e5e34 DF |
95 | said to be "throttled" (also known as "depleted" in real-time literature) |
96 | and cannot be scheduled until its scheduling deadline. The "replenishment | |
97 | time" for this task (see next item) is set to be equal to the current | |
98 | value of the scheduling deadline; | |
99 | ||
100 | - When the current time is equal to the replenishment time of a | |
ad67dc31 | 101 | throttled task, the scheduling deadline and the remaining runtime are |
712e5e34 DF |
102 | updated as |
103 | ||
104 | scheduling deadline = scheduling deadline + period | |
ad67dc31 | 105 | remaining runtime = remaining runtime + runtime |
712e5e34 DF |
106 | |
107 | ||
108 | 3. Scheduling Real-Time Tasks | |
109 | ============================= | |
110 | ||
111 | * BIG FAT WARNING ****************************************************** | |
112 | * | |
113 | * This section contains a (not-thorough) summary on classical deadline | |
114 | * scheduling theory, and how it applies to SCHED_DEADLINE. | |
115 | * The reader can "safely" skip to Section 4 if only interested in seeing | |
116 | * how the scheduling policy can be used. Anyway, we strongly recommend | |
117 | * to come back here and continue reading (once the urge for testing is | |
118 | * satisfied :P) to be sure of fully understanding all technical details. | |
119 | ************************************************************************ | |
120 | ||
121 | There are no limitations on what kind of task can exploit this new | |
122 | scheduling discipline, even if it must be said that it is particularly | |
123 | suited for periodic or sporadic real-time tasks that need guarantees on their | |
124 | timing behavior, e.g., multimedia, streaming, control applications, etc. | |
125 | ||
126 | A typical real-time task is composed of a repetition of computation phases | |
127 | (task instances, or jobs) which are activated on a periodic or sporadic | |
128 | fashion. | |
129 | Each job J_j (where J_j is the j^th job of the task) is characterised by an | |
130 | arrival time r_j (the time when the job starts), an amount of computation | |
131 | time c_j needed to finish the job, and a job absolute deadline d_j, which | |
132 | is the time within which the job should be finished. The maximum execution | |
133 | time max_j{c_j} is called "Worst Case Execution Time" (WCET) for the task. | |
134 | A real-time task can be periodic with period P if r_{j+1} = r_j + P, or | |
135 | sporadic with minimum inter-arrival time P is r_{j+1} >= r_j + P. Finally, | |
136 | d_j = r_j + D, where D is the task's relative deadline. | |
137 | ||
138 | SCHED_DEADLINE can be used to schedule real-time tasks guaranteeing that | |
139 | the jobs' deadlines of a task are respected. In order to do this, a task | |
140 | must be scheduled by setting: | |
141 | ||
142 | - runtime >= WCET | |
143 | - deadline = D | |
144 | - period <= P | |
145 | ||
146 | IOW, if runtime >= WCET and if period is >= P, then the scheduling deadlines | |
147 | and the absolute deadlines (d_j) coincide, so a proper admission control | |
148 | allows to respect the jobs' absolute deadlines for this task (this is what is | |
149 | called "hard schedulability property" and is an extension of Lemma 1 of [2]). | |
ad67dc31 LA |
150 | Notice that if runtime > deadline the admission control will surely reject |
151 | this task, as it is not possible to respect its temporal constraints. | |
712e5e34 DF |
152 | |
153 | References: | |
154 | 1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram- | |
155 | ming in a hard-real-time environment. Journal of the Association for | |
156 | Computing Machinery, 20(1), 1973. | |
157 | 2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard | |
158 | Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems | |
159 | Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf | |
160 | 3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab | |
ad67dc31 | 161 | Technical Report. http://disi.unitn.it/~abeni/tr-98-01.pdf |
712e5e34 DF |
162 | |
163 | 4. Bandwidth management | |
164 | ======================= | |
165 | ||
166 | In order for the -deadline scheduling to be effective and useful, it is | |
167 | important to have some method to keep the allocation of the available CPU | |
0d9ba8b0 JL |
168 | bandwidth to the tasks under control. This is usually called "admission |
169 | control" and if it is not performed at all, no guarantee can be given on | |
170 | the actual scheduling of the -deadline tasks. | |
171 | ||
172 | The interface used to control the fraction of CPU bandwidth that can be | |
173 | allocated to -deadline tasks is similar to the one already used for -rt | |
174 | tasks with real-time group scheduling (a.k.a. RT-throttling - see | |
175 | Documentation/scheduler/sched-rt-group.txt), and is based on readable/ | |
176 | writable control files located in procfs (for system wide settings). | |
177 | Notice that per-group settings (controlled through cgroupfs) are still not | |
178 | defined for -deadline tasks, because more discussion is needed in order to | |
179 | figure out how we want to manage SCHED_DEADLINE bandwidth at the task group | |
180 | level. | |
181 | ||
182 | A main difference between deadline bandwidth management and RT-throttling | |
712e5e34 | 183 | is that -deadline tasks have bandwidth on their own (while -rt ones don't!), |
0d9ba8b0 JL |
184 | and thus we don't need a higher level throttling mechanism to enforce the |
185 | desired bandwidth. Therefore, using this simple interface we can put a cap | |
186 | on total utilization of -deadline tasks (i.e., \Sum (runtime_i / period_i) < | |
187 | global_dl_utilization_cap). | |
712e5e34 DF |
188 | |
189 | 4.1 System wide settings | |
190 | ------------------------ | |
191 | ||
192 | The system wide settings are configured under the /proc virtual file system. | |
193 | ||
0d9ba8b0 JL |
194 | For now the -rt knobs are used for -deadline admission control and the |
195 | -deadline runtime is accounted against the -rt runtime. We realise that this | |
196 | isn't entirely desirable; however, it is better to have a small interface for | |
197 | now, and be able to change it easily later. The ideal situation (see 5.) is to | |
198 | run -rt tasks from a -deadline server; in which case the -rt bandwidth is a | |
199 | direct subset of dl_bw. | |
712e5e34 DF |
200 | |
201 | This means that, for a root_domain comprising M CPUs, -deadline tasks | |
202 | can be created while the sum of their bandwidths stays below: | |
203 | ||
204 | M * (sched_rt_runtime_us / sched_rt_period_us) | |
205 | ||
206 | It is also possible to disable this bandwidth management logic, and | |
207 | be thus free of oversubscribing the system up to any arbitrary level. | |
208 | This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us. | |
209 | ||
210 | ||
211 | 4.2 Task interface | |
212 | ------------------ | |
213 | ||
214 | Specifying a periodic/sporadic task that executes for a given amount of | |
215 | runtime at each instance, and that is scheduled according to the urgency of | |
216 | its own timing constraints needs, in general, a way of declaring: | |
217 | - a (maximum/typical) instance execution time, | |
218 | - a minimum interval between consecutive instances, | |
219 | - a time constraint by which each instance must be completed. | |
220 | ||
221 | Therefore: | |
222 | * a new struct sched_attr, containing all the necessary fields is | |
223 | provided; | |
224 | * the new scheduling related syscalls that manipulate it, i.e., | |
225 | sched_setattr() and sched_getattr() are implemented. | |
226 | ||
227 | ||
228 | 4.3 Default behavior | |
229 | --------------------- | |
230 | ||
231 | The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to | |
232 | 950000. With rt_period equal to 1000000, by default, it means that -deadline | |
233 | tasks can use at most 95%, multiplied by the number of CPUs that compose the | |
234 | root_domain, for each root_domain. | |
235 | ||
236 | A -deadline task cannot fork. | |
237 | ||
238 | 5. Tasks CPU affinity | |
239 | ===================== | |
240 | ||
241 | -deadline tasks cannot have an affinity mask smaller that the entire | |
242 | root_domain they are created on. However, affinities can be specified | |
243 | through the cpuset facility (Documentation/cgroups/cpusets.txt). | |
244 | ||
245 | 5.1 SCHED_DEADLINE and cpusets HOWTO | |
246 | ------------------------------------ | |
247 | ||
248 | An example of a simple configuration (pin a -deadline task to CPU0) | |
249 | follows (rt-app is used to create a -deadline task). | |
250 | ||
251 | mkdir /dev/cpuset | |
252 | mount -t cgroup -o cpuset cpuset /dev/cpuset | |
253 | cd /dev/cpuset | |
254 | mkdir cpu0 | |
255 | echo 0 > cpu0/cpuset.cpus | |
256 | echo 0 > cpu0/cpuset.mems | |
257 | echo 1 > cpuset.cpu_exclusive | |
258 | echo 0 > cpuset.sched_load_balance | |
259 | echo 1 > cpu0/cpuset.cpu_exclusive | |
260 | echo 1 > cpu0/cpuset.mem_exclusive | |
261 | echo $$ > cpu0/tasks | |
262 | rt-app -t 100000:10000:d:0 -D5 (it is now actually superfluous to specify | |
263 | task affinity) | |
264 | ||
265 | 6. Future plans | |
266 | =============== | |
267 | ||
268 | Still missing: | |
269 | ||
270 | - refinements to deadline inheritance, especially regarding the possibility | |
271 | of retaining bandwidth isolation among non-interacting tasks. This is | |
272 | being studied from both theoretical and practical points of view, and | |
273 | hopefully we should be able to produce some demonstrative code soon; | |
274 | - (c)group based bandwidth management, and maybe scheduling; | |
275 | - access control for non-root users (and related security concerns to | |
276 | address), which is the best way to allow unprivileged use of the mechanisms | |
277 | and how to prevent non-root users "cheat" the system? | |
278 | ||
279 | As already discussed, we are planning also to merge this work with the EDF | |
280 | throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in | |
281 | the preliminary phases of the merge and we really seek feedback that would | |
282 | help us decide on the direction it should take. |