Commit | Line | Data |
---|---|---|
9e255e2b | 1 | ====================================== |
458f69ef MCC |
2 | NO_HZ: Reducing Scheduling-Clock Ticks |
3 | ====================================== | |
0c87f9b5 PM |
4 | |
5 | ||
6 | This document describes Kconfig options and boot parameters that can | |
7 | reduce the number of scheduling-clock interrupts, thereby improving energy | |
8 | efficiency and reducing OS jitter. Reducing OS jitter is important for | |
9 | some types of computationally intensive high-performance computing (HPC) | |
10 | applications and for real-time applications. | |
11 | ||
295fde89 PM |
12 | There are three main ways of managing scheduling-clock interrupts |
13 | (also known as "scheduling-clock ticks" or simply "ticks"): | |
0c87f9b5 | 14 | |
295fde89 PM |
15 | 1. Never omit scheduling-clock ticks (CONFIG_HZ_PERIODIC=y or |
16 | CONFIG_NO_HZ=n for older kernels). You normally will -not- | |
17 | want to choose this option. | |
0c87f9b5 | 18 | |
295fde89 PM |
19 | 2. Omit scheduling-clock ticks on idle CPUs (CONFIG_NO_HZ_IDLE=y or |
20 | CONFIG_NO_HZ=y for older kernels). This is the most common | |
21 | approach, and should be the default. | |
0c87f9b5 | 22 | |
295fde89 PM |
23 | 3. Omit scheduling-clock ticks on CPUs that are either idle or that |
24 | have only one runnable task (CONFIG_NO_HZ_FULL=y). Unless you | |
25 | are running realtime applications or certain types of HPC | |
26 | workloads, you will normally -not- want this option. | |
27 | ||
28 | These three cases are described in the following three sections, followed | |
8bdf7a25 PM |
29 | by a third section on RCU-specific considerations, a fourth section |
30 | discussing testing, and a fifth and final section listing known issues. | |
0c87f9b5 PM |
31 | |
32 | ||
458f69ef MCC |
33 | Never Omit Scheduling-Clock Ticks |
34 | ================================= | |
295fde89 PM |
35 | |
36 | Very old versions of Linux from the 1990s and the very early 2000s | |
37 | are incapable of omitting scheduling-clock ticks. It turns out that | |
38 | there are some situations where this old-school approach is still the | |
39 | right approach, for example, in heavy workloads with lots of tasks | |
40 | that use short bursts of CPU, where there are very frequent idle | |
41 | periods, but where these idle periods are also quite short (tens or | |
42 | hundreds of microseconds). For these types of workloads, scheduling | |
43 | clock interrupts will normally be delivered any way because there | |
44 | will frequently be multiple runnable tasks per CPU. In these cases, | |
45 | attempting to turn off the scheduling clock interrupt will have no effect | |
46 | other than increasing the overhead of switching to and from idle and | |
47 | transitioning between user and kernel execution. | |
48 | ||
49 | This mode of operation can be selected using CONFIG_HZ_PERIODIC=y (or | |
50 | CONFIG_NO_HZ=n for older kernels). | |
51 | ||
52 | However, if you are instead running a light workload with long idle | |
53 | periods, failing to omit scheduling-clock interrupts will result in | |
54 | excessive power consumption. This is especially bad on battery-powered | |
55 | devices, where it results in extremely short battery lifetimes. If you | |
56 | are running light workloads, you should therefore read the following | |
57 | section. | |
58 | ||
59 | In addition, if you are running either a real-time workload or an HPC | |
60 | workload with short iterations, the scheduling-clock interrupts can | |
61 | degrade your applications performance. If this describes your workload, | |
62 | you should read the following two sections. | |
63 | ||
64 | ||
458f69ef MCC |
65 | Omit Scheduling-Clock Ticks For Idle CPUs |
66 | ========================================= | |
0c87f9b5 PM |
67 | |
68 | If a CPU is idle, there is little point in sending it a scheduling-clock | |
69 | interrupt. After all, the primary purpose of a scheduling-clock interrupt | |
70 | is to force a busy CPU to shift its attention among multiple duties, | |
71 | and an idle CPU has no duties to shift its attention among. | |
72 | ||
73 | The CONFIG_NO_HZ_IDLE=y Kconfig option causes the kernel to avoid sending | |
74 | scheduling-clock interrupts to idle CPUs, which is critically important | |
75 | both to battery-powered devices and to highly virtualized mainframes. | |
76 | A battery-powered device running a CONFIG_HZ_PERIODIC=y kernel would | |
77 | drain its battery very quickly, easily 2-3 times as fast as would the | |
78 | same device running a CONFIG_NO_HZ_IDLE=y kernel. A mainframe running | |
79 | 1,500 OS instances might find that half of its CPU time was consumed by | |
80 | unnecessary scheduling-clock interrupts. In these situations, there | |
81 | is strong motivation to avoid sending scheduling-clock interrupts to | |
82 | idle CPUs. That said, dyntick-idle mode is not free: | |
83 | ||
84 | 1. It increases the number of instructions executed on the path | |
85 | to and from the idle loop. | |
86 | ||
87 | 2. On many architectures, dyntick-idle mode also increases the | |
88 | number of expensive clock-reprogramming operations. | |
89 | ||
90 | Therefore, systems with aggressive real-time response constraints often | |
91 | run CONFIG_HZ_PERIODIC=y kernels (or CONFIG_NO_HZ=n for older kernels) | |
92 | in order to avoid degrading from-idle transition latencies. | |
93 | ||
94 | An idle CPU that is not receiving scheduling-clock interrupts is said to | |
95 | be "dyntick-idle", "in dyntick-idle mode", "in nohz mode", or "running | |
96 | tickless". The remainder of this document will use "dyntick-idle mode". | |
97 | ||
98 | There is also a boot parameter "nohz=" that can be used to disable | |
99 | dyntick-idle mode in CONFIG_NO_HZ_IDLE=y kernels by specifying "nohz=off". | |
100 | By default, CONFIG_NO_HZ_IDLE=y kernels boot with "nohz=on", enabling | |
101 | dyntick-idle mode. | |
102 | ||
103 | ||
458f69ef MCC |
104 | Omit Scheduling-Clock Ticks For CPUs With Only One Runnable Task |
105 | ================================================================ | |
0c87f9b5 PM |
106 | |
107 | If a CPU has only one runnable task, there is little point in sending it | |
108 | a scheduling-clock interrupt because there is no other task to switch to. | |
295fde89 PM |
109 | Note that omitting scheduling-clock ticks for CPUs with only one runnable |
110 | task implies also omitting them for idle CPUs. | |
0c87f9b5 PM |
111 | |
112 | The CONFIG_NO_HZ_FULL=y Kconfig option causes the kernel to avoid | |
113 | sending scheduling-clock interrupts to CPUs with a single runnable task, | |
114 | and such CPUs are said to be "adaptive-ticks CPUs". This is important | |
115 | for applications with aggressive real-time response constraints because | |
116 | it allows them to improve their worst-case response times by the maximum | |
117 | duration of a scheduling-clock interrupt. It is also important for | |
118 | computationally intensive short-iteration workloads: If any CPU is | |
119 | delayed during a given iteration, all the other CPUs will be forced to | |
120 | wait idle while the delayed CPU finishes. Thus, the delay is multiplied | |
121 | by one less than the number of CPUs. In these situations, there is | |
122 | again strong motivation to avoid sending scheduling-clock interrupts. | |
123 | ||
124 | By default, no CPU will be an adaptive-ticks CPU. The "nohz_full=" | |
125 | boot parameter specifies the adaptive-ticks CPUs. For example, | |
126 | "nohz_full=1,6-8" says that CPUs 1, 6, 7, and 8 are to be adaptive-ticks | |
127 | CPUs. Note that you are prohibited from marking all of the CPUs as | |
128 | adaptive-tick CPUs: At least one non-adaptive-tick CPU must remain | |
8bdf7a25 PM |
129 | online to handle timekeeping tasks in order to ensure that system |
130 | calls like gettimeofday() returns accurate values on adaptive-tick CPUs. | |
131 | (This is not an issue for CONFIG_NO_HZ_IDLE=y because there are no running | |
132 | user processes to observe slight drifts in clock rate.) Therefore, the | |
133 | boot CPU is prohibited from entering adaptive-ticks mode. Specifying a | |
134 | "nohz_full=" mask that includes the boot CPU will result in a boot-time | |
135 | error message, and the boot CPU will be removed from the mask. Note that | |
136 | this means that your system must have at least two CPUs in order for | |
137 | CONFIG_NO_HZ_FULL=y to do anything for you. | |
0c87f9b5 | 138 | |
0c87f9b5 PM |
139 | Finally, adaptive-ticks CPUs must have their RCU callbacks offloaded. |
140 | This is covered in the "RCU IMPLICATIONS" section below. | |
141 | ||
142 | Normally, a CPU remains in adaptive-ticks mode as long as possible. | |
143 | In particular, transitioning to kernel mode does not automatically change | |
144 | the mode. Instead, the CPU will exit adaptive-ticks mode only if needed, | |
145 | for example, if that CPU enqueues an RCU callback. | |
146 | ||
147 | Just as with dyntick-idle mode, the benefits of adaptive-tick mode do | |
148 | not come for free: | |
149 | ||
150 | 1. CONFIG_NO_HZ_FULL selects CONFIG_NO_HZ_COMMON, so you cannot run | |
151 | adaptive ticks without also running dyntick idle. This dependency | |
152 | extends down into the implementation, so that all of the costs | |
153 | of CONFIG_NO_HZ_IDLE are also incurred by CONFIG_NO_HZ_FULL. | |
154 | ||
155 | 2. The user/kernel transitions are slightly more expensive due | |
156 | to the need to inform kernel subsystems (such as RCU) about | |
157 | the change in mode. | |
158 | ||
c2519784 PM |
159 | 3. POSIX CPU timers prevent CPUs from entering adaptive-tick mode. |
160 | Real-time applications needing to take actions based on CPU time | |
161 | consumption need to use other means of doing so. | |
0c87f9b5 PM |
162 | |
163 | 4. If there are more perf events pending than the hardware can | |
164 | accommodate, they are normally round-robined so as to collect | |
165 | all of them over time. Adaptive-tick mode may prevent this | |
166 | round-robining from happening. This will likely be fixed by | |
167 | preventing CPUs with large numbers of perf events pending from | |
168 | entering adaptive-tick mode. | |
169 | ||
170 | 5. Scheduler statistics for adaptive-tick CPUs may be computed | |
171 | slightly differently than those for non-adaptive-tick CPUs. | |
172 | This might in turn perturb load-balancing of real-time tasks. | |
173 | ||
0c87f9b5 PM |
174 | Although improvements are expected over time, adaptive ticks is quite |
175 | useful for many types of real-time and compute-intensive applications. | |
176 | However, the drawbacks listed above mean that adaptive ticks should not | |
177 | (yet) be enabled by default. | |
178 | ||
179 | ||
458f69ef MCC |
180 | RCU Implications |
181 | ================ | |
0c87f9b5 PM |
182 | |
183 | There are situations in which idle CPUs cannot be permitted to | |
184 | enter either dyntick-idle mode or adaptive-tick mode, the most | |
185 | common being when that CPU has RCU callbacks pending. | |
186 | ||
187 | The CONFIG_RCU_FAST_NO_HZ=y Kconfig option may be used to cause such CPUs | |
188 | to enter dyntick-idle mode or adaptive-tick mode anyway. In this case, | |
189 | a timer will awaken these CPUs every four jiffies in order to ensure | |
190 | that the RCU callbacks are processed in a timely fashion. | |
191 | ||
192 | Another approach is to offload RCU callback processing to "rcuo" kthreads | |
193 | using the CONFIG_RCU_NOCB_CPU=y Kconfig option. The specific CPUs to | |
44c65ff2 PM |
194 | offload may be selected using The "rcu_nocbs=" kernel boot parameter, |
195 | which takes a comma-separated list of CPUs and CPU ranges, for example, | |
196 | "1,3-5" selects CPUs 1, 3, 4, and 5. | |
0c87f9b5 PM |
197 | |
198 | The offloaded CPUs will never queue RCU callbacks, and therefore RCU | |
199 | never prevents offloaded CPUs from entering either dyntick-idle mode | |
200 | or adaptive-tick mode. That said, note that it is up to userspace to | |
201 | pin the "rcuo" kthreads to specific CPUs if desired. Otherwise, the | |
202 | scheduler will decide where to run them, which might or might not be | |
203 | where you want them to run. | |
204 | ||
205 | ||
458f69ef MCC |
206 | Testing |
207 | ======= | |
8bdf7a25 PM |
208 | |
209 | So you enable all the OS-jitter features described in this document, | |
210 | but do not see any change in your workload's behavior. Is this because | |
211 | your workload isn't affected that much by OS jitter, or is it because | |
212 | something else is in the way? This section helps answer this question | |
213 | by providing a simple OS-jitter test suite, which is available on branch | |
214 | master of the following git archive: | |
215 | ||
216 | git://git.kernel.org/pub/scm/linux/kernel/git/frederic/dynticks-testing.git | |
217 | ||
218 | Clone this archive and follow the instructions in the README file. | |
219 | This test procedure will produce a trace that will allow you to evaluate | |
220 | whether or not you have succeeded in removing OS jitter from your system. | |
221 | If this trace shows that you have removed OS jitter as much as is | |
222 | possible, then you can conclude that your workload is not all that | |
223 | sensitive to OS jitter. | |
224 | ||
225 | Note: this test requires that your system have at least two CPUs. | |
226 | We do not currently have a good way to remove OS jitter from single-CPU | |
227 | systems. | |
228 | ||
229 | ||
458f69ef MCC |
230 | Known Issues |
231 | ============ | |
0c87f9b5 | 232 | |
458f69ef | 233 | * Dyntick-idle slows transitions to and from idle slightly. |
0c87f9b5 PM |
234 | In practice, this has not been a problem except for the most |
235 | aggressive real-time workloads, which have the option of disabling | |
236 | dyntick-idle mode, an option that most of them take. However, | |
237 | some workloads will no doubt want to use adaptive ticks to | |
238 | eliminate scheduling-clock interrupt latencies. Here are some | |
239 | options for these workloads: | |
240 | ||
241 | a. Use PMQOS from userspace to inform the kernel of your | |
242 | latency requirements (preferred). | |
243 | ||
244 | b. On x86 systems, use the "idle=mwait" boot parameter. | |
245 | ||
246 | c. On x86 systems, use the "intel_idle.max_cstate=" to limit | |
247 | ` the maximum C-state depth. | |
248 | ||
249 | d. On x86 systems, use the "idle=poll" boot parameter. | |
250 | However, please note that use of this parameter can cause | |
251 | your CPU to overheat, which may cause thermal throttling | |
252 | to degrade your latencies -- and that this degradation can | |
253 | be even worse than that of dyntick-idle. Furthermore, | |
254 | this parameter effectively disables Turbo Mode on Intel | |
255 | CPUs, which can significantly reduce maximum performance. | |
256 | ||
458f69ef | 257 | * Adaptive-ticks slows user/kernel transitions slightly. |
0c87f9b5 PM |
258 | This is not expected to be a problem for computationally intensive |
259 | workloads, which have few such transitions. Careful benchmarking | |
260 | will be required to determine whether or not other workloads | |
261 | are significantly affected by this effect. | |
262 | ||
458f69ef | 263 | * Adaptive-ticks does not do anything unless there is only one |
0c87f9b5 PM |
264 | runnable task for a given CPU, even though there are a number |
265 | of other situations where the scheduling-clock tick is not | |
266 | needed. To give but one example, consider a CPU that has one | |
267 | runnable high-priority SCHED_FIFO task and an arbitrary number | |
268 | of low-priority SCHED_OTHER tasks. In this case, the CPU is | |
269 | required to run the SCHED_FIFO task until it either blocks or | |
270 | some other higher-priority task awakens on (or is assigned to) | |
271 | this CPU, so there is no point in sending a scheduling-clock | |
272 | interrupt to this CPU. However, the current implementation | |
273 | nevertheless sends scheduling-clock interrupts to CPUs having a | |
274 | single runnable SCHED_FIFO task and multiple runnable SCHED_OTHER | |
275 | tasks, even though these interrupts are unnecessary. | |
276 | ||
ce5f4fc8 PM |
277 | And even when there are multiple runnable tasks on a given CPU, |
278 | there is little point in interrupting that CPU until the current | |
279 | running task's timeslice expires, which is almost always way | |
280 | longer than the time of the next scheduling-clock interrupt. | |
281 | ||
0c87f9b5 PM |
282 | Better handling of these sorts of situations is future work. |
283 | ||
458f69ef | 284 | * A reboot is required to reconfigure both adaptive idle and RCU |
0c87f9b5 PM |
285 | callback offloading. Runtime reconfiguration could be provided |
286 | if needed, however, due to the complexity of reconfiguring RCU at | |
287 | runtime, there would need to be an earthshakingly good reason. | |
288 | Especially given that you have the straightforward option of | |
289 | simply offloading RCU callbacks from all CPUs and pinning them | |
290 | where you want them whenever you want them pinned. | |
291 | ||
458f69ef | 292 | * Additional configuration is required to deal with other sources |
0c87f9b5 PM |
293 | of OS jitter, including interrupts and system-utility tasks |
294 | and processes. This configuration normally involves binding | |
295 | interrupts and tasks to particular CPUs. | |
296 | ||
458f69ef | 297 | * Some sources of OS jitter can currently be eliminated only by |
0c87f9b5 PM |
298 | constraining the workload. For example, the only way to eliminate |
299 | OS jitter due to global TLB shootdowns is to avoid the unmapping | |
300 | operations (such as kernel module unload operations) that | |
301 | result in these shootdowns. For another example, page faults | |
302 | and TLB misses can be reduced (and in some cases eliminated) by | |
303 | using huge pages and by constraining the amount of memory used | |
304 | by the application. Pre-faulting the working set can also be | |
305 | helpful, especially when combined with the mlock() and mlockall() | |
306 | system calls. | |
307 | ||
458f69ef | 308 | * Unless all CPUs are idle, at least one CPU must keep the |
0c87f9b5 PM |
309 | scheduling-clock interrupt going in order to support accurate |
310 | timekeeping. | |
311 | ||
458f69ef | 312 | * If there might potentially be some adaptive-ticks CPUs, there |
ce5f4fc8 PM |
313 | will be at least one CPU keeping the scheduling-clock interrupt |
314 | going, even if all CPUs are otherwise idle. | |
315 | ||
316 | Better handling of this situation is ongoing work. | |
317 | ||
458f69ef | 318 | * Some process-handling operations still require the occasional |
ce5f4fc8 PM |
319 | scheduling-clock tick. These operations include calculating CPU |
320 | load, maintaining sched average, computing CFS entity vruntime, | |
321 | computing avenrun, and carrying out load balancing. They are | |
322 | currently accommodated by scheduling-clock tick every second | |
323 | or so. On-going work will eliminate the need even for these | |
324 | infrequent scheduling-clock ticks. |