Commit | Line | Data |
---|---|---|
712e5e34 DF |
1 | Deadline Task Scheduling |
2 | ------------------------ | |
3 | ||
4 | CONTENTS | |
5 | ======== | |
6 | ||
7 | 0. WARNING | |
8 | 1. Overview | |
9 | 2. Scheduling algorithm | |
10 | 3. Scheduling Real-Time Tasks | |
11 | 4. Bandwidth management | |
12 | 4.1 System-wide settings | |
13 | 4.2 Task interface | |
14 | 4.3 Default behavior | |
15 | 5. Tasks CPU affinity | |
16 | 5.1 SCHED_DEADLINE and cpusets HOWTO | |
17 | 6. Future plans | |
f5801933 | 18 | A. Test suite |
712e5e34 DF |
19 | |
20 | ||
21 | 0. WARNING | |
22 | ========== | |
23 | ||
24 | Fiddling with these settings can result in an unpredictable or even unstable | |
25 | system behavior. As for -rt (group) scheduling, it is assumed that root users | |
26 | know what they're doing. | |
27 | ||
28 | ||
29 | 1. Overview | |
30 | =========== | |
31 | ||
32 | The SCHED_DEADLINE policy contained inside the sched_dl scheduling class is | |
33 | basically an implementation of the Earliest Deadline First (EDF) scheduling | |
34 | algorithm, augmented with a mechanism (called Constant Bandwidth Server, CBS) | |
35 | that makes it possible to isolate the behavior of tasks between each other. | |
36 | ||
37 | ||
38 | 2. Scheduling algorithm | |
39 | ================== | |
40 | ||
41 | SCHED_DEADLINE uses three parameters, named "runtime", "period", and | |
b56bfc6c | 42 | "deadline", to schedule tasks. A SCHED_DEADLINE task should receive |
712e5e34 DF |
43 | "runtime" microseconds of execution time every "period" microseconds, and |
44 | these "runtime" microseconds are available within "deadline" microseconds | |
45 | from the beginning of the period. In order to implement this behaviour, | |
46 | every time the task wakes up, the scheduler computes a "scheduling deadline" | |
47 | consistent with the guarantee (using the CBS[2,3] algorithm). Tasks are then | |
48 | scheduled using EDF[1] on these scheduling deadlines (the task with the | |
b56bfc6c LA |
49 | earliest scheduling deadline is selected for execution). Notice that the |
50 | task actually receives "runtime" time units within "deadline" if a proper | |
51 | "admission control" strategy (see Section "4. Bandwidth management") is used | |
52 | (clearly, if the system is overloaded this guarantee cannot be respected). | |
712e5e34 DF |
53 | |
54 | Summing up, the CBS[2,3] algorithms assigns scheduling deadlines to tasks so | |
55 | that each task runs for at most its runtime every period, avoiding any | |
56 | interference between different tasks (bandwidth isolation), while the EDF[1] | |
ad67dc31 LA |
57 | algorithm selects the task with the earliest scheduling deadline as the one |
58 | to be executed next. Thanks to this feature, tasks that do not strictly comply | |
59 | with the "traditional" real-time task model (see Section 3) can effectively | |
60 | use the new policy. | |
712e5e34 DF |
61 | |
62 | In more details, the CBS algorithm assigns scheduling deadlines to | |
63 | tasks in the following way: | |
64 | ||
65 | - Each SCHED_DEADLINE task is characterised by the "runtime", | |
66 | "deadline", and "period" parameters; | |
67 | ||
68 | - The state of the task is described by a "scheduling deadline", and | |
ad67dc31 | 69 | a "remaining runtime". These two parameters are initially set to 0; |
712e5e34 DF |
70 | |
71 | - When a SCHED_DEADLINE task wakes up (becomes ready for execution), | |
72 | the scheduler checks if | |
73 | ||
ad67dc31 LA |
74 | remaining runtime runtime |
75 | ---------------------------------- > --------- | |
76 | scheduling deadline - current time period | |
712e5e34 DF |
77 | |
78 | then, if the scheduling deadline is smaller than the current time, or | |
79 | this condition is verified, the scheduling deadline and the | |
ad67dc31 | 80 | remaining runtime are re-initialised as |
712e5e34 DF |
81 | |
82 | scheduling deadline = current time + deadline | |
ad67dc31 | 83 | remaining runtime = runtime |
712e5e34 | 84 | |
ad67dc31 | 85 | otherwise, the scheduling deadline and the remaining runtime are |
712e5e34 DF |
86 | left unchanged; |
87 | ||
88 | - When a SCHED_DEADLINE task executes for an amount of time t, its | |
ad67dc31 | 89 | remaining runtime is decreased as |
712e5e34 | 90 | |
ad67dc31 | 91 | remaining runtime = remaining runtime - t |
712e5e34 DF |
92 | |
93 | (technically, the runtime is decreased at every tick, or when the | |
94 | task is descheduled / preempted); | |
95 | ||
ad67dc31 | 96 | - When the remaining runtime becomes less or equal than 0, the task is |
712e5e34 DF |
97 | said to be "throttled" (also known as "depleted" in real-time literature) |
98 | and cannot be scheduled until its scheduling deadline. The "replenishment | |
99 | time" for this task (see next item) is set to be equal to the current | |
100 | value of the scheduling deadline; | |
101 | ||
102 | - When the current time is equal to the replenishment time of a | |
ad67dc31 | 103 | throttled task, the scheduling deadline and the remaining runtime are |
712e5e34 DF |
104 | updated as |
105 | ||
106 | scheduling deadline = scheduling deadline + period | |
ad67dc31 | 107 | remaining runtime = remaining runtime + runtime |
712e5e34 DF |
108 | |
109 | ||
110 | 3. Scheduling Real-Time Tasks | |
111 | ============================= | |
112 | ||
113 | * BIG FAT WARNING ****************************************************** | |
114 | * | |
115 | * This section contains a (not-thorough) summary on classical deadline | |
116 | * scheduling theory, and how it applies to SCHED_DEADLINE. | |
117 | * The reader can "safely" skip to Section 4 if only interested in seeing | |
118 | * how the scheduling policy can be used. Anyway, we strongly recommend | |
119 | * to come back here and continue reading (once the urge for testing is | |
120 | * satisfied :P) to be sure of fully understanding all technical details. | |
121 | ************************************************************************ | |
122 | ||
123 | There are no limitations on what kind of task can exploit this new | |
124 | scheduling discipline, even if it must be said that it is particularly | |
125 | suited for periodic or sporadic real-time tasks that need guarantees on their | |
126 | timing behavior, e.g., multimedia, streaming, control applications, etc. | |
127 | ||
128 | A typical real-time task is composed of a repetition of computation phases | |
129 | (task instances, or jobs) which are activated on a periodic or sporadic | |
130 | fashion. | |
131 | Each job J_j (where J_j is the j^th job of the task) is characterised by an | |
132 | arrival time r_j (the time when the job starts), an amount of computation | |
133 | time c_j needed to finish the job, and a job absolute deadline d_j, which | |
134 | is the time within which the job should be finished. The maximum execution | |
135 | time max_j{c_j} is called "Worst Case Execution Time" (WCET) for the task. | |
136 | A real-time task can be periodic with period P if r_{j+1} = r_j + P, or | |
137 | sporadic with minimum inter-arrival time P is r_{j+1} >= r_j + P. Finally, | |
138 | d_j = r_j + D, where D is the task's relative deadline. | |
b56bfc6c LA |
139 | The utilisation of a real-time task is defined as the ratio between its |
140 | WCET and its period (or minimum inter-arrival time), and represents | |
141 | the fraction of CPU time needed to execute the task. | |
142 | ||
143 | If the total utilisation sum_i(WCET_i/P_i) is larger than M (with M equal | |
144 | to the number of CPUs), then the scheduler is unable to respect all the | |
145 | deadlines. | |
146 | Note that total utilisation is defined as the sum of the utilisations | |
147 | WCET_i/P_i over all the real-time tasks in the system. When considering | |
148 | multiple real-time tasks, the parameters of the i-th task are indicated | |
149 | with the "_i" suffix. | |
150 | Moreover, if the total utilisation is larger than M, then we risk starving | |
151 | non- real-time tasks by real-time tasks. | |
152 | If, instead, the total utilisation is smaller than M, then non real-time | |
153 | tasks will not be starved and the system might be able to respect all the | |
154 | deadlines. | |
155 | As a matter of fact, in this case it is possible to provide an upper bound | |
156 | for tardiness (defined as the maximum between 0 and the difference | |
157 | between the finishing time of a job and its absolute deadline). | |
158 | More precisely, it can be proven that using a global EDF scheduler the | |
159 | maximum tardiness of each task is smaller or equal than | |
160 | ((M − 1) · WCET_max − WCET_min)/(M − (M − 2) · U_max) + WCET_max | |
161 | where WCET_max = max_i{WCET_i} is the maximum WCET, WCET_min=min_i{WCET_i} | |
162 | is the minimum WCET, and U_max = max_i{WCET_i/P_i} is the maximum utilisation. | |
163 | ||
164 | If M=1 (uniprocessor system), or in case of partitioned scheduling (each | |
165 | real-time task is statically assigned to one and only one CPU), it is | |
166 | possible to formally check if all the deadlines are respected. | |
167 | If D_i = P_i for all tasks, then EDF is able to respect all the deadlines | |
168 | of all the tasks executing on a CPU if and only if the total utilisation | |
169 | of the tasks running on such a CPU is smaller or equal than 1. | |
170 | If D_i != P_i for some task, then it is possible to define the density of | |
171 | a task as C_i/min{D_i,T_i}, and EDF is able to respect all the deadlines | |
172 | of all the tasks running on a CPU if the sum sum_i C_i/min{D_i,T_i} of the | |
173 | densities of the tasks running on such a CPU is smaller or equal than 1 | |
174 | (notice that this condition is only sufficient, and not necessary). | |
175 | ||
176 | On multiprocessor systems with global EDF scheduling (non partitioned | |
177 | systems), a sufficient test for schedulability can not be based on the | |
178 | utilisations (it can be shown that task sets with utilisations slightly | |
179 | larger than 1 can miss deadlines regardless of the number of CPUs M). | |
180 | However, as previously stated, enforcing that the total utilisation is smaller | |
181 | than M is enough to guarantee that non real-time tasks are not starved and | |
182 | that the tardiness of real-time tasks has an upper bound. | |
712e5e34 DF |
183 | |
184 | SCHED_DEADLINE can be used to schedule real-time tasks guaranteeing that | |
185 | the jobs' deadlines of a task are respected. In order to do this, a task | |
186 | must be scheduled by setting: | |
187 | ||
188 | - runtime >= WCET | |
189 | - deadline = D | |
190 | - period <= P | |
191 | ||
192 | IOW, if runtime >= WCET and if period is >= P, then the scheduling deadlines | |
193 | and the absolute deadlines (d_j) coincide, so a proper admission control | |
194 | allows to respect the jobs' absolute deadlines for this task (this is what is | |
195 | called "hard schedulability property" and is an extension of Lemma 1 of [2]). | |
ad67dc31 LA |
196 | Notice that if runtime > deadline the admission control will surely reject |
197 | this task, as it is not possible to respect its temporal constraints. | |
712e5e34 DF |
198 | |
199 | References: | |
200 | 1 - C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogram- | |
201 | ming in a hard-real-time environment. Journal of the Association for | |
202 | Computing Machinery, 20(1), 1973. | |
203 | 2 - L. Abeni , G. Buttazzo. Integrating Multimedia Applications in Hard | |
204 | Real-Time Systems. Proceedings of the 19th IEEE Real-time Systems | |
205 | Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf | |
206 | 3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab | |
ad67dc31 | 207 | Technical Report. http://disi.unitn.it/~abeni/tr-98-01.pdf |
712e5e34 DF |
208 | |
209 | 4. Bandwidth management | |
210 | ======================= | |
211 | ||
b56bfc6c LA |
212 | As previously mentioned, in order for -deadline scheduling to be |
213 | effective and useful (that is, to be able to provide "runtime" time units | |
214 | within "deadline"), it is important to have some method to keep the allocation | |
215 | of the available fractions of CPU time to the various tasks under control. | |
216 | This is usually called "admission control" and if it is not performed, then | |
217 | no guarantee can be given on the actual scheduling of the -deadline tasks. | |
218 | ||
219 | As already stated in Section 3, a necessary condition to be respected to | |
220 | correctly schedule a set of real-time tasks is that the total utilisation | |
221 | is smaller than M. When talking about -deadline tasks, this requires that | |
222 | the sum of the ratio between runtime and period for all tasks is smaller | |
223 | than M. Notice that the ratio runtime/period is equivalent to the utilisation | |
224 | of a "traditional" real-time task, and is also often referred to as | |
225 | "bandwidth". | |
226 | The interface used to control the CPU bandwidth that can be allocated | |
227 | to -deadline tasks is similar to the one already used for -rt | |
0d9ba8b0 JL |
228 | tasks with real-time group scheduling (a.k.a. RT-throttling - see |
229 | Documentation/scheduler/sched-rt-group.txt), and is based on readable/ | |
230 | writable control files located in procfs (for system wide settings). | |
231 | Notice that per-group settings (controlled through cgroupfs) are still not | |
232 | defined for -deadline tasks, because more discussion is needed in order to | |
233 | figure out how we want to manage SCHED_DEADLINE bandwidth at the task group | |
234 | level. | |
235 | ||
236 | A main difference between deadline bandwidth management and RT-throttling | |
712e5e34 | 237 | is that -deadline tasks have bandwidth on their own (while -rt ones don't!), |
0d9ba8b0 | 238 | and thus we don't need a higher level throttling mechanism to enforce the |
b56bfc6c LA |
239 | desired bandwidth. In other words, this means that interface parameters are |
240 | only used at admission control time (i.e., when the user calls | |
241 | sched_setattr()). Scheduling is then performed considering actual tasks' | |
242 | parameters, so that CPU bandwidth is allocated to SCHED_DEADLINE tasks | |
243 | respecting their needs in terms of granularity. Therefore, using this simple | |
244 | interface we can put a cap on total utilization of -deadline tasks (i.e., | |
245 | \Sum (runtime_i / period_i) < global_dl_utilization_cap). | |
712e5e34 DF |
246 | |
247 | 4.1 System wide settings | |
248 | ------------------------ | |
249 | ||
250 | The system wide settings are configured under the /proc virtual file system. | |
251 | ||
0d9ba8b0 JL |
252 | For now the -rt knobs are used for -deadline admission control and the |
253 | -deadline runtime is accounted against the -rt runtime. We realise that this | |
254 | isn't entirely desirable; however, it is better to have a small interface for | |
255 | now, and be able to change it easily later. The ideal situation (see 5.) is to | |
256 | run -rt tasks from a -deadline server; in which case the -rt bandwidth is a | |
257 | direct subset of dl_bw. | |
712e5e34 DF |
258 | |
259 | This means that, for a root_domain comprising M CPUs, -deadline tasks | |
260 | can be created while the sum of their bandwidths stays below: | |
261 | ||
262 | M * (sched_rt_runtime_us / sched_rt_period_us) | |
263 | ||
264 | It is also possible to disable this bandwidth management logic, and | |
265 | be thus free of oversubscribing the system up to any arbitrary level. | |
266 | This is done by writing -1 in /proc/sys/kernel/sched_rt_runtime_us. | |
267 | ||
268 | ||
269 | 4.2 Task interface | |
270 | ------------------ | |
271 | ||
272 | Specifying a periodic/sporadic task that executes for a given amount of | |
273 | runtime at each instance, and that is scheduled according to the urgency of | |
274 | its own timing constraints needs, in general, a way of declaring: | |
275 | - a (maximum/typical) instance execution time, | |
276 | - a minimum interval between consecutive instances, | |
277 | - a time constraint by which each instance must be completed. | |
278 | ||
279 | Therefore: | |
280 | * a new struct sched_attr, containing all the necessary fields is | |
281 | provided; | |
282 | * the new scheduling related syscalls that manipulate it, i.e., | |
283 | sched_setattr() and sched_getattr() are implemented. | |
284 | ||
285 | ||
286 | 4.3 Default behavior | |
287 | --------------------- | |
288 | ||
289 | The default value for SCHED_DEADLINE bandwidth is to have rt_runtime equal to | |
290 | 950000. With rt_period equal to 1000000, by default, it means that -deadline | |
291 | tasks can use at most 95%, multiplied by the number of CPUs that compose the | |
292 | root_domain, for each root_domain. | |
b56bfc6c LA |
293 | This means that non -deadline tasks will receive at least 5% of the CPU time, |
294 | and that -deadline tasks will receive their runtime with a guaranteed | |
295 | worst-case delay respect to the "deadline" parameter. If "deadline" = "period" | |
296 | and the cpuset mechanism is used to implement partitioned scheduling (see | |
297 | Section 5), then this simple setting of the bandwidth management is able to | |
298 | deterministically guarantee that -deadline tasks will receive their runtime | |
299 | in a period. | |
300 | ||
301 | Finally, notice that in order not to jeopardize the admission control a | |
302 | -deadline task cannot fork. | |
712e5e34 DF |
303 | |
304 | 5. Tasks CPU affinity | |
305 | ===================== | |
306 | ||
307 | -deadline tasks cannot have an affinity mask smaller that the entire | |
308 | root_domain they are created on. However, affinities can be specified | |
309 | through the cpuset facility (Documentation/cgroups/cpusets.txt). | |
310 | ||
311 | 5.1 SCHED_DEADLINE and cpusets HOWTO | |
312 | ------------------------------------ | |
313 | ||
314 | An example of a simple configuration (pin a -deadline task to CPU0) | |
315 | follows (rt-app is used to create a -deadline task). | |
316 | ||
317 | mkdir /dev/cpuset | |
318 | mount -t cgroup -o cpuset cpuset /dev/cpuset | |
319 | cd /dev/cpuset | |
320 | mkdir cpu0 | |
321 | echo 0 > cpu0/cpuset.cpus | |
322 | echo 0 > cpu0/cpuset.mems | |
323 | echo 1 > cpuset.cpu_exclusive | |
324 | echo 0 > cpuset.sched_load_balance | |
325 | echo 1 > cpu0/cpuset.cpu_exclusive | |
326 | echo 1 > cpu0/cpuset.mem_exclusive | |
327 | echo $$ > cpu0/tasks | |
328 | rt-app -t 100000:10000:d:0 -D5 (it is now actually superfluous to specify | |
329 | task affinity) | |
330 | ||
331 | 6. Future plans | |
332 | =============== | |
333 | ||
334 | Still missing: | |
335 | ||
336 | - refinements to deadline inheritance, especially regarding the possibility | |
337 | of retaining bandwidth isolation among non-interacting tasks. This is | |
338 | being studied from both theoretical and practical points of view, and | |
339 | hopefully we should be able to produce some demonstrative code soon; | |
340 | - (c)group based bandwidth management, and maybe scheduling; | |
341 | - access control for non-root users (and related security concerns to | |
342 | address), which is the best way to allow unprivileged use of the mechanisms | |
343 | and how to prevent non-root users "cheat" the system? | |
344 | ||
345 | As already discussed, we are planning also to merge this work with the EDF | |
346 | throttling patches [https://lkml.org/lkml/2010/2/23/239] but we still are in | |
347 | the preliminary phases of the merge and we really seek feedback that would | |
348 | help us decide on the direction it should take. | |
f5801933 JL |
349 | |
350 | Appendix A. Test suite | |
351 | ====================== | |
352 | ||
353 | The SCHED_DEADLINE policy can be easily tested using two applications that | |
354 | are part of a wider Linux Scheduler validation suite. The suite is | |
355 | available as a GitHub repository: https://github.com/scheduler-tools. | |
356 | ||
357 | The first testing application is called rt-app and can be used to | |
358 | start multiple threads with specific parameters. rt-app supports | |
359 | SCHED_{OTHER,FIFO,RR,DEADLINE} scheduling policies and their related | |
360 | parameters (e.g., niceness, priority, runtime/deadline/period). rt-app | |
361 | is a valuable tool, as it can be used to synthetically recreate certain | |
362 | workloads (maybe mimicking real use-cases) and evaluate how the scheduler | |
363 | behaves under such workloads. In this way, results are easily reproducible. | |
364 | rt-app is available at: https://github.com/scheduler-tools/rt-app. | |
365 | ||
366 | Thread parameters can be specified from the command line, with something like | |
367 | this: | |
368 | ||
369 | # rt-app -t 100000:10000:d -t 150000:20000:f:10 -D5 | |
370 | ||
371 | The above creates 2 threads. The first one, scheduled by SCHED_DEADLINE, | |
372 | executes for 10ms every 100ms. The second one, scheduled at SCHED_FIFO | |
373 | priority 10, executes for 20ms every 150ms. The test will run for a total | |
374 | of 5 seconds. | |
375 | ||
376 | More interestingly, configurations can be described with a json file that | |
377 | can be passed as input to rt-app with something like this: | |
378 | ||
379 | # rt-app my_config.json | |
380 | ||
381 | The parameters that can be specified with the second method are a superset | |
382 | of the command line options. Please refer to rt-app documentation for more | |
383 | details (<rt-app-sources>/doc/*.json). | |
384 | ||
385 | The second testing application is a modification of schedtool, called | |
386 | schedtool-dl, which can be used to setup SCHED_DEADLINE parameters for a | |
387 | certain pid/application. schedtool-dl is available at: | |
388 | https://github.com/scheduler-tools/schedtool-dl.git. | |
389 | ||
390 | The usage is straightforward: | |
391 | ||
392 | # schedtool -E -t 10000000:100000000 -e ./my_cpuhog_app | |
393 | ||
394 | With this, my_cpuhog_app is put to run inside a SCHED_DEADLINE reservation | |
395 | of 10ms every 100ms (note that parameters are expressed in microseconds). | |
396 | You can also use schedtool to create a reservation for an already running | |
397 | application, given that you know its pid: | |
398 | ||
399 | # schedtool -E -t 10000000:100000000 my_app_pid |