Commit | Line | Data |
---|---|---|
919e9e63 TG |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | .. _kernel_hacking_locktypes: | |
4 | ||
5 | ========================== | |
6 | Lock types and their rules | |
7 | ========================== | |
8 | ||
9 | Introduction | |
10 | ============ | |
11 | ||
12 | The kernel provides a variety of locking primitives which can be divided | |
1edcd467 | 13 | into three categories: |
919e9e63 TG |
14 | |
15 | - Sleeping locks | |
91710728 | 16 | - CPU local locks |
919e9e63 TG |
17 | - Spinning locks |
18 | ||
19 | This document conceptually describes these lock types and provides rules | |
20 | for their nesting, including the rules for use under PREEMPT_RT. | |
21 | ||
22 | ||
23 | Lock categories | |
24 | =============== | |
25 | ||
26 | Sleeping locks | |
27 | -------------- | |
28 | ||
29 | Sleeping locks can only be acquired in preemptible task context. | |
30 | ||
31 | Although implementations allow try_lock() from other contexts, it is | |
32 | necessary to carefully evaluate the safety of unlock() as well as of | |
33 | try_lock(). Furthermore, it is also necessary to evaluate the debugging | |
34 | versions of these primitives. In short, don't acquire sleeping locks from | |
35 | other contexts unless there is no other option. | |
36 | ||
37 | Sleeping lock types: | |
38 | ||
39 | - mutex | |
40 | - rt_mutex | |
41 | - semaphore | |
42 | - rw_semaphore | |
43 | - ww_mutex | |
44 | - percpu_rw_semaphore | |
45 | ||
46 | On PREEMPT_RT kernels, these lock types are converted to sleeping locks: | |
47 | ||
91710728 | 48 | - local_lock |
919e9e63 TG |
49 | - spinlock_t |
50 | - rwlock_t | |
51 | ||
91710728 TG |
52 | |
53 | CPU local locks | |
54 | --------------- | |
55 | ||
56 | - local_lock | |
57 | ||
58 | On non-PREEMPT_RT kernels, local_lock functions are wrappers around | |
59 | preemption and interrupt disabling primitives. Contrary to other locking | |
60 | mechanisms, disabling preemption or interrupts are pure CPU local | |
61 | concurrency control mechanisms and not suited for inter-CPU concurrency | |
62 | control. | |
63 | ||
64 | ||
919e9e63 TG |
65 | Spinning locks |
66 | -------------- | |
67 | ||
68 | - raw_spinlock_t | |
69 | - bit spinlocks | |
70 | ||
71 | On non-PREEMPT_RT kernels, these lock types are also spinning locks: | |
72 | ||
73 | - spinlock_t | |
74 | - rwlock_t | |
75 | ||
76 | Spinning locks implicitly disable preemption and the lock / unlock functions | |
77 | can have suffixes which apply further protections: | |
78 | ||
79 | =================== ==================================================== | |
80 | _bh() Disable / enable bottom halves (soft interrupts) | |
81 | _irq() Disable / enable interrupts | |
82 | _irqsave/restore() Save and disable / restore interrupt disabled state | |
83 | =================== ==================================================== | |
84 | ||
91710728 | 85 | |
7ecc6aa5 TG |
86 | Owner semantics |
87 | =============== | |
88 | ||
89 | The aforementioned lock types except semaphores have strict owner | |
90 | semantics: | |
91 | ||
92 | The context (task) that acquired the lock must release it. | |
93 | ||
94 | rw_semaphores have a special interface which allows non-owner release for | |
95 | readers. | |
96 | ||
919e9e63 TG |
97 | |
98 | rtmutex | |
99 | ======= | |
100 | ||
101 | RT-mutexes are mutexes with support for priority inheritance (PI). | |
102 | ||
51e69e65 | 103 | PI has limitations on non-PREEMPT_RT kernels due to preemption and |
919e9e63 TG |
104 | interrupt disabled sections. |
105 | ||
106 | PI clearly cannot preempt preemption-disabled or interrupt-disabled | |
107 | regions of code, even on PREEMPT_RT kernels. Instead, PREEMPT_RT kernels | |
108 | execute most such regions of code in preemptible task context, especially | |
109 | interrupt handlers and soft interrupts. This conversion allows spinlock_t | |
110 | and rwlock_t to be implemented via RT-mutexes. | |
111 | ||
112 | ||
7ecc6aa5 TG |
113 | semaphore |
114 | ========= | |
115 | ||
116 | semaphore is a counting semaphore implementation. | |
117 | ||
118 | Semaphores are often used for both serialization and waiting, but new use | |
119 | cases should instead use separate serialization and wait mechanisms, such | |
120 | as mutexes and completions. | |
121 | ||
122 | semaphores and PREEMPT_RT | |
123 | ---------------------------- | |
124 | ||
125 | PREEMPT_RT does not change the semaphore implementation because counting | |
126 | semaphores have no concept of owners, thus preventing PREEMPT_RT from | |
127 | providing priority inheritance for semaphores. After all, an unknown | |
128 | owner cannot be boosted. As a consequence, blocking on semaphores can | |
129 | result in priority inversion. | |
130 | ||
131 | ||
132 | rw_semaphore | |
133 | ============ | |
134 | ||
135 | rw_semaphore is a multiple readers and single writer lock mechanism. | |
136 | ||
137 | On non-PREEMPT_RT kernels the implementation is fair, thus preventing | |
138 | writer starvation. | |
139 | ||
140 | rw_semaphore complies by default with the strict owner semantics, but there | |
141 | exist special-purpose interfaces that allow non-owner release for readers. | |
142 | These interfaces work independent of the kernel configuration. | |
143 | ||
144 | rw_semaphore and PREEMPT_RT | |
145 | --------------------------- | |
146 | ||
147 | PREEMPT_RT kernels map rw_semaphore to a separate rt_mutex-based | |
148 | implementation, thus changing the fairness: | |
149 | ||
150 | Because an rw_semaphore writer cannot grant its priority to multiple | |
151 | readers, a preempted low-priority reader will continue holding its lock, | |
152 | thus starving even high-priority writers. In contrast, because readers | |
153 | can grant their priority to a writer, a preempted low-priority writer will | |
154 | have its priority boosted until it releases the lock, thus preventing that | |
155 | writer from starving readers. | |
156 | ||
157 | ||
91710728 TG |
158 | local_lock |
159 | ========== | |
160 | ||
161 | local_lock provides a named scope to critical sections which are protected | |
162 | by disabling preemption or interrupts. | |
163 | ||
164 | On non-PREEMPT_RT kernels local_lock operations map to the preemption and | |
165 | interrupt disabling and enabling primitives: | |
166 | ||
94dea151 MR |
167 | =============================== ====================== |
168 | local_lock(&llock) preempt_disable() | |
169 | local_unlock(&llock) preempt_enable() | |
170 | local_lock_irq(&llock) local_irq_disable() | |
171 | local_unlock_irq(&llock) local_irq_enable() | |
172 | local_lock_irqsave(&llock) local_irq_save() | |
173 | local_unlock_irqrestore(&llock) local_irq_restore() | |
174 | =============================== ====================== | |
91710728 TG |
175 | |
176 | The named scope of local_lock has two advantages over the regular | |
177 | primitives: | |
178 | ||
179 | - The lock name allows static analysis and is also a clear documentation | |
180 | of the protection scope while the regular primitives are scopeless and | |
181 | opaque. | |
182 | ||
183 | - If lockdep is enabled the local_lock gains a lockmap which allows to | |
184 | validate the correctness of the protection. This can detect cases where | |
185 | e.g. a function using preempt_disable() as protection mechanism is | |
186 | invoked from interrupt or soft-interrupt context. Aside of that | |
187 | lockdep_assert_held(&llock) works as with any other locking primitive. | |
188 | ||
189 | local_lock and PREEMPT_RT | |
190 | ------------------------- | |
191 | ||
192 | PREEMPT_RT kernels map local_lock to a per-CPU spinlock_t, thus changing | |
193 | semantics: | |
194 | ||
195 | - All spinlock_t changes also apply to local_lock. | |
196 | ||
197 | local_lock usage | |
198 | ---------------- | |
199 | ||
200 | local_lock should be used in situations where disabling preemption or | |
201 | interrupts is the appropriate form of concurrency control to protect | |
202 | per-CPU data structures on a non PREEMPT_RT kernel. | |
203 | ||
204 | local_lock is not suitable to protect against preemption or interrupts on a | |
205 | PREEMPT_RT kernel due to the PREEMPT_RT specific spinlock_t semantics. | |
206 | ||
207 | ||
919e9e63 TG |
208 | raw_spinlock_t and spinlock_t |
209 | ============================= | |
210 | ||
211 | raw_spinlock_t | |
212 | -------------- | |
213 | ||
214 | raw_spinlock_t is a strict spinning lock implementation regardless of the | |
215 | kernel configuration including PREEMPT_RT enabled kernels. | |
216 | ||
217 | raw_spinlock_t is a strict spinning lock implementation in all kernels, | |
218 | including PREEMPT_RT kernels. Use raw_spinlock_t only in real critical | |
51e69e65 | 219 | core code, low-level interrupt handling and places where disabling |
919e9e63 TG |
220 | preemption or interrupts is required, for example, to safely access |
221 | hardware state. raw_spinlock_t can sometimes also be used when the | |
222 | critical section is tiny, thus avoiding RT-mutex overhead. | |
223 | ||
224 | spinlock_t | |
225 | ---------- | |
226 | ||
7ecc6aa5 | 227 | The semantics of spinlock_t change with the state of PREEMPT_RT. |
919e9e63 | 228 | |
51e69e65 RD |
229 | On a non-PREEMPT_RT kernel spinlock_t is mapped to raw_spinlock_t and has |
230 | exactly the same semantics. | |
919e9e63 TG |
231 | |
232 | spinlock_t and PREEMPT_RT | |
233 | ------------------------- | |
234 | ||
51e69e65 RD |
235 | On a PREEMPT_RT kernel spinlock_t is mapped to a separate implementation |
236 | based on rt_mutex which changes the semantics: | |
919e9e63 | 237 | |
51e69e65 | 238 | - Preemption is not disabled. |
919e9e63 TG |
239 | |
240 | - The hard interrupt related suffixes for spin_lock / spin_unlock | |
51e69e65 RD |
241 | operations (_irq, _irqsave / _irqrestore) do not affect the CPU's |
242 | interrupt disabled state. | |
919e9e63 TG |
243 | |
244 | - The soft interrupt related suffix (_bh()) still disables softirq | |
245 | handlers. | |
246 | ||
247 | Non-PREEMPT_RT kernels disable preemption to get this effect. | |
248 | ||
249 | PREEMPT_RT kernels use a per-CPU lock for serialization which keeps | |
250 | preemption disabled. The lock disables softirq handlers and also | |
251 | prevents reentrancy due to task preemption. | |
252 | ||
253 | PREEMPT_RT kernels preserve all other spinlock_t semantics: | |
254 | ||
255 | - Tasks holding a spinlock_t do not migrate. Non-PREEMPT_RT kernels | |
256 | avoid migration by disabling preemption. PREEMPT_RT kernels instead | |
257 | disable migration, which ensures that pointers to per-CPU variables | |
258 | remain valid even if the task is preempted. | |
259 | ||
260 | - Task state is preserved across spinlock acquisition, ensuring that the | |
261 | task-state rules apply to all kernel configurations. Non-PREEMPT_RT | |
262 | kernels leave task state untouched. However, PREEMPT_RT must change | |
263 | task state if the task blocks during acquisition. Therefore, it saves | |
264 | the current task state before blocking and the corresponding lock wakeup | |
7ecc6aa5 TG |
265 | restores it, as shown below:: |
266 | ||
267 | task->state = TASK_INTERRUPTIBLE | |
268 | lock() | |
269 | block() | |
270 | task->saved_state = task->state | |
271 | task->state = TASK_UNINTERRUPTIBLE | |
272 | schedule() | |
273 | lock wakeup | |
274 | task->state = task->saved_state | |
919e9e63 TG |
275 | |
276 | Other types of wakeups would normally unconditionally set the task state | |
277 | to RUNNING, but that does not work here because the task must remain | |
278 | blocked until the lock becomes available. Therefore, when a non-lock | |
279 | wakeup attempts to awaken a task blocked waiting for a spinlock, it | |
280 | instead sets the saved state to RUNNING. Then, when the lock | |
281 | acquisition completes, the lock wakeup sets the task state to the saved | |
7ecc6aa5 TG |
282 | state, in this case setting it to RUNNING:: |
283 | ||
284 | task->state = TASK_INTERRUPTIBLE | |
285 | lock() | |
286 | block() | |
287 | task->saved_state = task->state | |
288 | task->state = TASK_UNINTERRUPTIBLE | |
289 | schedule() | |
290 | non lock wakeup | |
291 | task->saved_state = TASK_RUNNING | |
292 | ||
293 | lock wakeup | |
294 | task->state = task->saved_state | |
295 | ||
296 | This ensures that the real wakeup cannot be lost. | |
297 | ||
919e9e63 TG |
298 | |
299 | rwlock_t | |
300 | ======== | |
301 | ||
302 | rwlock_t is a multiple readers and single writer lock mechanism. | |
303 | ||
304 | Non-PREEMPT_RT kernels implement rwlock_t as a spinning lock and the | |
305 | suffix rules of spinlock_t apply accordingly. The implementation is fair, | |
306 | thus preventing writer starvation. | |
307 | ||
308 | rwlock_t and PREEMPT_RT | |
309 | ----------------------- | |
310 | ||
311 | PREEMPT_RT kernels map rwlock_t to a separate rt_mutex-based | |
312 | implementation, thus changing semantics: | |
313 | ||
314 | - All the spinlock_t changes also apply to rwlock_t. | |
315 | ||
316 | - Because an rwlock_t writer cannot grant its priority to multiple | |
317 | readers, a preempted low-priority reader will continue holding its lock, | |
318 | thus starving even high-priority writers. In contrast, because readers | |
319 | can grant their priority to a writer, a preempted low-priority writer | |
320 | will have its priority boosted until it releases the lock, thus | |
321 | preventing that writer from starving readers. | |
322 | ||
323 | ||
324 | PREEMPT_RT caveats | |
325 | ================== | |
326 | ||
91710728 TG |
327 | local_lock on RT |
328 | ---------------- | |
329 | ||
330 | The mapping of local_lock to spinlock_t on PREEMPT_RT kernels has a few | |
331 | implications. For example, on a non-PREEMPT_RT kernel the following code | |
332 | sequence works as expected:: | |
333 | ||
334 | local_lock_irq(&local_lock); | |
335 | raw_spin_lock(&lock); | |
336 | ||
337 | and is fully equivalent to:: | |
338 | ||
339 | raw_spin_lock_irq(&lock); | |
340 | ||
341 | On a PREEMPT_RT kernel this code sequence breaks because local_lock_irq() | |
342 | is mapped to a per-CPU spinlock_t which neither disables interrupts nor | |
343 | preemption. The following code sequence works perfectly correct on both | |
344 | PREEMPT_RT and non-PREEMPT_RT kernels:: | |
345 | ||
346 | local_lock_irq(&local_lock); | |
347 | spin_lock(&lock); | |
348 | ||
349 | Another caveat with local locks is that each local_lock has a specific | |
350 | protection scope. So the following substitution is wrong:: | |
351 | ||
352 | func1() | |
353 | { | |
354 | local_irq_save(flags); -> local_lock_irqsave(&local_lock_1, flags); | |
355 | func3(); | |
94dea151 | 356 | local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_1, flags); |
91710728 TG |
357 | } |
358 | ||
359 | func2() | |
360 | { | |
361 | local_irq_save(flags); -> local_lock_irqsave(&local_lock_2, flags); | |
362 | func3(); | |
94dea151 | 363 | local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_2, flags); |
91710728 TG |
364 | } |
365 | ||
366 | func3() | |
367 | { | |
368 | lockdep_assert_irqs_disabled(); | |
369 | access_protected_data(); | |
370 | } | |
371 | ||
372 | On a non-PREEMPT_RT kernel this works correctly, but on a PREEMPT_RT kernel | |
373 | local_lock_1 and local_lock_2 are distinct and cannot serialize the callers | |
374 | of func3(). Also the lockdep assert will trigger on a PREEMPT_RT kernel | |
375 | because local_lock_irqsave() does not disable interrupts due to the | |
376 | PREEMPT_RT-specific semantics of spinlock_t. The correct substitution is:: | |
377 | ||
378 | func1() | |
379 | { | |
380 | local_irq_save(flags); -> local_lock_irqsave(&local_lock, flags); | |
381 | func3(); | |
94dea151 | 382 | local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags); |
91710728 TG |
383 | } |
384 | ||
385 | func2() | |
386 | { | |
387 | local_irq_save(flags); -> local_lock_irqsave(&local_lock, flags); | |
388 | func3(); | |
94dea151 | 389 | local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags); |
91710728 TG |
390 | } |
391 | ||
392 | func3() | |
393 | { | |
394 | lockdep_assert_held(&local_lock); | |
395 | access_protected_data(); | |
396 | } | |
397 | ||
398 | ||
919e9e63 TG |
399 | spinlock_t and rwlock_t |
400 | ----------------------- | |
401 | ||
91710728 | 402 | The changes in spinlock_t and rwlock_t semantics on PREEMPT_RT kernels |
919e9e63 TG |
403 | have a few implications. For example, on a non-PREEMPT_RT kernel the |
404 | following code sequence works as expected:: | |
405 | ||
406 | local_irq_disable(); | |
407 | spin_lock(&lock); | |
408 | ||
409 | and is fully equivalent to:: | |
410 | ||
411 | spin_lock_irq(&lock); | |
412 | ||
413 | Same applies to rwlock_t and the _irqsave() suffix variants. | |
414 | ||
415 | On PREEMPT_RT kernel this code sequence breaks because RT-mutex requires a | |
416 | fully preemptible context. Instead, use spin_lock_irq() or | |
417 | spin_lock_irqsave() and their unlock counterparts. In cases where the | |
418 | interrupt disabling and locking must remain separate, PREEMPT_RT offers a | |
419 | local_lock mechanism. Acquiring the local_lock pins the task to a CPU, | |
51e69e65 RD |
420 | allowing things like per-CPU interrupt disabled locks to be acquired. |
421 | However, this approach should be used only where absolutely necessary. | |
919e9e63 | 422 | |
91710728 | 423 | A typical scenario is protection of per-CPU variables in thread context:: |
919e9e63 | 424 | |
91710728 TG |
425 | struct foo *p = get_cpu_ptr(&var1); |
426 | ||
427 | spin_lock(&p->lock); | |
428 | p->count += this_cpu_read(var2); | |
429 | ||
430 | This is correct code on a non-PREEMPT_RT kernel, but on a PREEMPT_RT kernel | |
431 | this breaks. The PREEMPT_RT-specific change of spinlock_t semantics does | |
432 | not allow to acquire p->lock because get_cpu_ptr() implicitly disables | |
433 | preemption. The following substitution works on both kernels:: | |
434 | ||
435 | struct foo *p; | |
436 | ||
437 | migrate_disable(); | |
438 | p = this_cpu_ptr(&var1); | |
439 | spin_lock(&p->lock); | |
440 | p->count += this_cpu_read(var2); | |
441 | ||
442 | On a non-PREEMPT_RT kernel migrate_disable() maps to preempt_disable() | |
443 | which makes the above code fully equivalent. On a PREEMPT_RT kernel | |
444 | migrate_disable() ensures that the task is pinned on the current CPU which | |
445 | in turn guarantees that the per-CPU access to var1 and var2 are staying on | |
446 | the same CPU. | |
447 | ||
448 | The migrate_disable() substitution is not valid for the following | |
449 | scenario:: | |
450 | ||
451 | func() | |
452 | { | |
453 | struct foo *p; | |
454 | ||
455 | migrate_disable(); | |
456 | p = this_cpu_ptr(&var1); | |
457 | p->val = func2(); | |
458 | ||
459 | While correct on a non-PREEMPT_RT kernel, this breaks on PREEMPT_RT because | |
460 | here migrate_disable() does not protect against reentrancy from a | |
461 | preempting task. A correct substitution for this case is:: | |
462 | ||
463 | func() | |
464 | { | |
465 | struct foo *p; | |
466 | ||
467 | local_lock(&foo_lock); | |
468 | p = this_cpu_ptr(&var1); | |
469 | p->val = func2(); | |
470 | ||
471 | On a non-PREEMPT_RT kernel this protects against reentrancy by disabling | |
472 | preemption. On a PREEMPT_RT kernel this is achieved by acquiring the | |
473 | underlying per-CPU spinlock. | |
474 | ||
475 | ||
476 | raw_spinlock_t on RT | |
477 | -------------------- | |
919e9e63 TG |
478 | |
479 | Acquiring a raw_spinlock_t disables preemption and possibly also | |
480 | interrupts, so the critical section must avoid acquiring a regular | |
481 | spinlock_t or rwlock_t, for example, the critical section must avoid | |
482 | allocating memory. Thus, on a non-PREEMPT_RT kernel the following code | |
483 | works perfectly:: | |
484 | ||
485 | raw_spin_lock(&lock); | |
486 | p = kmalloc(sizeof(*p), GFP_ATOMIC); | |
487 | ||
488 | But this code fails on PREEMPT_RT kernels because the memory allocator is | |
489 | fully preemptible and therefore cannot be invoked from truly atomic | |
490 | contexts. However, it is perfectly fine to invoke the memory allocator | |
491 | while holding normal non-raw spinlocks because they do not disable | |
492 | preemption on PREEMPT_RT kernels:: | |
493 | ||
494 | spin_lock(&lock); | |
495 | p = kmalloc(sizeof(*p), GFP_ATOMIC); | |
496 | ||
497 | ||
498 | bit spinlocks | |
499 | ------------- | |
500 | ||
7ecc6aa5 TG |
501 | PREEMPT_RT cannot substitute bit spinlocks because a single bit is too |
502 | small to accommodate an RT-mutex. Therefore, the semantics of bit | |
503 | spinlocks are preserved on PREEMPT_RT kernels, so that the raw_spinlock_t | |
504 | caveats also apply to bit spinlocks. | |
919e9e63 | 505 | |
7ecc6aa5 TG |
506 | Some bit spinlocks are replaced with regular spinlock_t for PREEMPT_RT |
507 | using conditional (#ifdef'ed) code changes at the usage site. In contrast, | |
508 | usage-site changes are not needed for the spinlock_t substitution. | |
509 | Instead, conditionals in header files and the core locking implemementation | |
510 | enable the compiler to do the substitution transparently. | |
919e9e63 TG |
511 | |
512 | ||
513 | Lock type nesting rules | |
514 | ======================= | |
515 | ||
516 | The most basic rules are: | |
517 | ||
91710728 TG |
518 | - Lock types of the same lock category (sleeping, CPU local, spinning) |
519 | can nest arbitrarily as long as they respect the general lock ordering | |
520 | rules to prevent deadlocks. | |
521 | ||
522 | - Sleeping lock types cannot nest inside CPU local and spinning lock types. | |
919e9e63 | 523 | |
91710728 | 524 | - CPU local and spinning lock types can nest inside sleeping lock types. |
919e9e63 | 525 | |
91710728 | 526 | - Spinning lock types can nest inside all lock types |
919e9e63 | 527 | |
7ecc6aa5 | 528 | These constraints apply both in PREEMPT_RT and otherwise. |
919e9e63 | 529 | |
7ecc6aa5 | 530 | The fact that PREEMPT_RT changes the lock category of spinlock_t and |
91710728 TG |
531 | rwlock_t from spinning to sleeping and substitutes local_lock with a |
532 | per-CPU spinlock_t means that they cannot be acquired while holding a raw | |
533 | spinlock. This results in the following nesting ordering: | |
919e9e63 TG |
534 | |
535 | 1) Sleeping locks | |
91710728 | 536 | 2) spinlock_t, rwlock_t, local_lock |
919e9e63 TG |
537 | 3) raw_spinlock_t and bit spinlocks |
538 | ||
7ecc6aa5 TG |
539 | Lockdep will complain if these constraints are violated, both in |
540 | PREEMPT_RT and otherwise. |