[linux-2.6-block.git] / Documentation / locking / locktypes.rst

.. SPDX-License-Identifier: GPL-2.0

.. _kernel_hacking_locktypes:

==========================
Lock types and their rules
==========================

Introduction
============

The kernel provides a variety of locking primitives which can be divided
into three categories:

 - Sleeping locks
 - CPU local locks
 - Spinning locks

This document conceptually describes these lock types and provides rules
for their nesting, including the rules for use under PREEMPT_RT.


Lock categories
===============

Sleeping locks
--------------

Sleeping locks can only be acquired in preemptible task context.

Although implementations allow try_lock() from other contexts, it is
necessary to carefully evaluate the safety of unlock() as well as of
try_lock().  Furthermore, it is also necessary to evaluate the debugging
versions of these primitives.  In short, don't acquire sleeping locks from
other contexts unless there is no other option.

Sleeping lock types:

 - mutex
 - rt_mutex
 - semaphore
 - rw_semaphore
 - ww_mutex
 - percpu_rw_semaphore

On PREEMPT_RT kernels, these lock types are converted to sleeping locks:

 - local_lock
 - spinlock_t
 - rwlock_t


CPU local locks
---------------

 - local_lock

On non-PREEMPT_RT kernels, local_lock functions are wrappers around
preemption and interrupt disabling primitives. Contrary to other locking
mechanisms, disabling preemption or interrupts are pure CPU local
concurrency control mechanisms and not suited for inter-CPU concurrency
control.


Spinning locks
--------------

 - raw_spinlock_t
 - bit spinlocks

On non-PREEMPT_RT kernels, these lock types are also spinning locks:

 - spinlock_t
 - rwlock_t

Spinning locks implicitly disable preemption and the lock / unlock functions
can have suffixes which apply further protections:

 ===================  ====================================================
 _bh()                Disable / enable bottom halves (soft interrupts)
 _irq()               Disable / enable interrupts
 _irqsave/restore()   Save and disable / restore interrupt disabled state
 ===================  ====================================================


Owner semantics
===============

The aforementioned lock types except semaphores have strict owner
semantics:

  The context (task) that acquired the lock must release it.

rw_semaphores have a special interface which allows non-owner release for
readers.


rtmutex
=======

RT-mutexes are mutexes with support for priority inheritance (PI).

PI has limitations on non-PREEMPT_RT kernels due to preemption and
interrupt disabled sections.

PI clearly cannot preempt preemption-disabled or interrupt-disabled
regions of code, even on PREEMPT_RT kernels.  Instead, PREEMPT_RT kernels
execute most such regions of code in preemptible task context, especially
interrupt handlers and soft interrupts.  This conversion allows spinlock_t
and rwlock_t to be implemented via RT-mutexes.


semaphore
=========

semaphore is a counting semaphore implementation.

Semaphores are often used for both serialization and waiting, but new use
cases should instead use separate serialization and wait mechanisms, such
as mutexes and completions.

semaphores and PREEMPT_RT
----------------------------

PREEMPT_RT does not change the semaphore implementation because counting
semaphores have no concept of owners, thus preventing PREEMPT_RT from
providing priority inheritance for semaphores.  After all, an unknown
owner cannot be boosted. As a consequence, blocking on semaphores can
result in priority inversion.


rw_semaphore
============

rw_semaphore is a multiple readers and single writer lock mechanism.

On non-PREEMPT_RT kernels the implementation is fair, thus preventing
writer starvation.

rw_semaphore complies by default with the strict owner semantics, but there
exist special-purpose interfaces that allow non-owner release for readers.
These interfaces work independent of the kernel configuration.

rw_semaphore and PREEMPT_RT
---------------------------

PREEMPT_RT kernels map rw_semaphore to a separate rt_mutex-based
implementation, thus changing the fairness:

 Because an rw_semaphore writer cannot grant its priority to multiple
 readers, a preempted low-priority reader will continue holding its lock,
 thus starving even high-priority writers.  In contrast, because readers
 can grant their priority to a writer, a preempted low-priority writer will
 have its priority boosted until it releases the lock, thus preventing that
 writer from starving readers.


local_lock
==========

local_lock provides a named scope to critical sections which are protected
by disabling preemption or interrupts.

On non-PREEMPT_RT kernels local_lock operations map to the preemption and
interrupt disabling and enabling primitives:

 ===============================  ======================
 local_lock(&llock)               preempt_disable()
 local_unlock(&llock)             preempt_enable()
 local_lock_irq(&llock)           local_irq_disable()
 local_unlock_irq(&llock)         local_irq_enable()
 local_lock_irqsave(&llock)       local_irq_save()
 local_unlock_irqrestore(&llock)  local_irq_restore()
 ===============================  ======================

The named scope of local_lock has two advantages over the regular
primitives:

  - The lock name allows static analysis and is also a clear documentation
    of the protection scope while the regular primitives are scopeless and
    opaque.

  - If lockdep is enabled the local_lock gains a lockmap which allows to
    validate the correctness of the protection. This can detect cases where
    e.g. a function using preempt_disable() as protection mechanism is
    invoked from interrupt or soft-interrupt context. Aside of that
    lockdep_assert_held(&llock) works as with any other locking primitive.

local_lock and PREEMPT_RT
-------------------------

PREEMPT_RT kernels map local_lock to a per-CPU spinlock_t, thus changing
semantics:

  - All spinlock_t changes also apply to local_lock.

local_lock usage
----------------

local_lock should be used in situations where disabling preemption or
interrupts is the appropriate form of concurrency control to protect
per-CPU data structures on a non PREEMPT_RT kernel.

local_lock is not suitable to protect against preemption or interrupts on a
PREEMPT_RT kernel due to the PREEMPT_RT specific spinlock_t semantics.


raw_spinlock_t and spinlock_t
=============================

raw_spinlock_t
--------------

raw_spinlock_t is a strict spinning lock implementation regardless of the
kernel configuration including PREEMPT_RT enabled kernels.

raw_spinlock_t is a strict spinning lock implementation in all kernels,
including PREEMPT_RT kernels.  Use raw_spinlock_t only in real critical
core code, low-level interrupt handling and places where disabling
preemption or interrupts is required, for example, to safely access
hardware state.  raw_spinlock_t can sometimes also be used when the
critical section is tiny, thus avoiding RT-mutex overhead.

spinlock_t
----------

The semantics of spinlock_t change with the state of PREEMPT_RT.

On a non-PREEMPT_RT kernel spinlock_t is mapped to raw_spinlock_t and has
exactly the same semantics.

spinlock_t and PREEMPT_RT
-------------------------

On a PREEMPT_RT kernel spinlock_t is mapped to a separate implementation
based on rt_mutex which changes the semantics:

 - Preemption is not disabled.

 - The hard interrupt related suffixes for spin_lock / spin_unlock
   operations (_irq, _irqsave / _irqrestore) do not affect the CPU's
   interrupt disabled state.

 - The soft interrupt related suffix (_bh()) still disables softirq
   handlers.

   Non-PREEMPT_RT kernels disable preemption to get this effect.

   PREEMPT_RT kernels use a per-CPU lock for serialization which keeps
   preemption disabled. The lock disables softirq handlers and also
   prevents reentrancy due to task preemption.

PREEMPT_RT kernels preserve all other spinlock_t semantics:

 - Tasks holding a spinlock_t do not migrate.  Non-PREEMPT_RT kernels
   avoid migration by disabling preemption.  PREEMPT_RT kernels instead
   disable migration, which ensures that pointers to per-CPU variables
   remain valid even if the task is preempted.

 - Task state is preserved across spinlock acquisition, ensuring that the
   task-state rules apply to all kernel configurations.  Non-PREEMPT_RT
   kernels leave task state untouched.  However, PREEMPT_RT must change
   task state if the task blocks during acquisition.  Therefore, it saves
   the current task state before blocking and the corresponding lock wakeup
   restores it, as shown below::

    task->state = TASK_INTERRUPTIBLE
     lock()
       block()
         task->saved_state = task->state
	 task->state = TASK_UNINTERRUPTIBLE
	 schedule()
					lock wakeup
					  task->state = task->saved_state

   Other types of wakeups would normally unconditionally set the task state
   to RUNNING, but that does not work here because the task must remain
   blocked until the lock becomes available.  Therefore, when a non-lock
   wakeup attempts to awaken a task blocked waiting for a spinlock, it
   instead sets the saved state to RUNNING.  Then, when the lock
   acquisition completes, the lock wakeup sets the task state to the saved
   state, in this case setting it to RUNNING::

    task->state = TASK_INTERRUPTIBLE
     lock()
       block()
         task->saved_state = task->state
	 task->state = TASK_UNINTERRUPTIBLE
	 schedule()
					non lock wakeup
					  task->saved_state = TASK_RUNNING

					lock wakeup
					  task->state = task->saved_state

   This ensures that the real wakeup cannot be lost.


rwlock_t
========

rwlock_t is a multiple readers and single writer lock mechanism.

Non-PREEMPT_RT kernels implement rwlock_t as a spinning lock and the
suffix rules of spinlock_t apply accordingly. The implementation is fair,
thus preventing writer starvation.

rwlock_t and PREEMPT_RT
-----------------------

PREEMPT_RT kernels map rwlock_t to a separate rt_mutex-based
implementation, thus changing semantics:

 - All the spinlock_t changes also apply to rwlock_t.

 - Because an rwlock_t writer cannot grant its priority to multiple
   readers, a preempted low-priority reader will continue holding its lock,
   thus starving even high-priority writers.  In contrast, because readers
   can grant their priority to a writer, a preempted low-priority writer
   will have its priority boosted until it releases the lock, thus
   preventing that writer from starving readers.


PREEMPT_RT caveats
==================

local_lock on RT
----------------

The mapping of local_lock to spinlock_t on PREEMPT_RT kernels has a few
implications. For example, on a non-PREEMPT_RT kernel the following code
sequence works as expected::

  local_lock_irq(&local_lock);
  raw_spin_lock(&lock);

and is fully equivalent to::

   raw_spin_lock_irq(&lock);

On a PREEMPT_RT kernel this code sequence breaks because local_lock_irq()
is mapped to a per-CPU spinlock_t which neither disables interrupts nor
preemption. The following code sequence works perfectly correct on both
PREEMPT_RT and non-PREEMPT_RT kernels::

  local_lock_irq(&local_lock);
  spin_lock(&lock);

Another caveat with local locks is that each local_lock has a specific
protection scope. So the following substitution is wrong::

  func1()
  {
    local_irq_save(flags);    -> local_lock_irqsave(&local_lock_1, flags);
    func3();
    local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_1, flags);
  }

  func2()
  {
    local_irq_save(flags);    -> local_lock_irqsave(&local_lock_2, flags);
    func3();
    local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_2, flags);
  }

  func3()
  {
    lockdep_assert_irqs_disabled();
    access_protected_data();
  }

On a non-PREEMPT_RT kernel this works correctly, but on a PREEMPT_RT kernel
local_lock_1 and local_lock_2 are distinct and cannot serialize the callers
of func3(). Also the lockdep assert will trigger on a PREEMPT_RT kernel
because local_lock_irqsave() does not disable interrupts due to the
PREEMPT_RT-specific semantics of spinlock_t. The correct substitution is::

  func1()
  {
    local_irq_save(flags);    -> local_lock_irqsave(&local_lock, flags);
    func3();
    local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags);
  }

  func2()
  {
    local_irq_save(flags);    -> local_lock_irqsave(&local_lock, flags);
    func3();
    local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags);
  }

  func3()
  {
    lockdep_assert_held(&local_lock);
    access_protected_data();
  }


spinlock_t and rwlock_t
-----------------------

The changes in spinlock_t and rwlock_t semantics on PREEMPT_RT kernels
have a few implications.  For example, on a non-PREEMPT_RT kernel the
following code sequence works as expected::

   local_irq_disable();
   spin_lock(&lock);

and is fully equivalent to::

   spin_lock_irq(&lock);

Same applies to rwlock_t and the _irqsave() suffix variants.

On PREEMPT_RT kernel this code sequence breaks because RT-mutex requires a
fully preemptible context.  Instead, use spin_lock_irq() or
spin_lock_irqsave() and their unlock counterparts.  In cases where the
interrupt disabling and locking must remain separate, PREEMPT_RT offers a
local_lock mechanism.  Acquiring the local_lock pins the task to a CPU,
allowing things like per-CPU interrupt disabled locks to be acquired.
However, this approach should be used only where absolutely necessary.

A typical scenario is protection of per-CPU variables in thread context::

  struct foo *p = get_cpu_ptr(&var1);

  spin_lock(&p->lock);
  p->count += this_cpu_read(var2);

This is correct code on a non-PREEMPT_RT kernel, but on a PREEMPT_RT kernel
this breaks. The PREEMPT_RT-specific change of spinlock_t semantics does
not allow to acquire p->lock because get_cpu_ptr() implicitly disables
preemption. The following substitution works on both kernels::

  struct foo *p;

  migrate_disable();
  p = this_cpu_ptr(&var1);
  spin_lock(&p->lock);
  p->count += this_cpu_read(var2);

On a non-PREEMPT_RT kernel migrate_disable() maps to preempt_disable()
which makes the above code fully equivalent. On a PREEMPT_RT kernel
migrate_disable() ensures that the task is pinned on the current CPU which
in turn guarantees that the per-CPU access to var1 and var2 are staying on
the same CPU.

The migrate_disable() substitution is not valid for the following
scenario::

  func()
  {
    struct foo *p;

    migrate_disable();
    p = this_cpu_ptr(&var1);
    p->val = func2();

While correct on a non-PREEMPT_RT kernel, this breaks on PREEMPT_RT because
here migrate_disable() does not protect against reentrancy from a
preempting task. A correct substitution for this case is::

  func()
  {
    struct foo *p;

    local_lock(&foo_lock);
    p = this_cpu_ptr(&var1);
    p->val = func2();

On a non-PREEMPT_RT kernel this protects against reentrancy by disabling
preemption. On a PREEMPT_RT kernel this is achieved by acquiring the
underlying per-CPU spinlock.


raw_spinlock_t on RT
--------------------

Acquiring a raw_spinlock_t disables preemption and possibly also
interrupts, so the critical section must avoid acquiring a regular
spinlock_t or rwlock_t, for example, the critical section must avoid
allocating memory.  Thus, on a non-PREEMPT_RT kernel the following code
works perfectly::

  raw_spin_lock(&lock);
  p = kmalloc(sizeof(*p), GFP_ATOMIC);

But this code fails on PREEMPT_RT kernels because the memory allocator is
fully preemptible and therefore cannot be invoked from truly atomic
contexts.  However, it is perfectly fine to invoke the memory allocator
while holding normal non-raw spinlocks because they do not disable
preemption on PREEMPT_RT kernels::

  spin_lock(&lock);
  p = kmalloc(sizeof(*p), GFP_ATOMIC);


bit spinlocks
-------------

PREEMPT_RT cannot substitute bit spinlocks because a single bit is too
small to accommodate an RT-mutex.  Therefore, the semantics of bit
spinlocks are preserved on PREEMPT_RT kernels, so that the raw_spinlock_t
caveats also apply to bit spinlocks.

Some bit spinlocks are replaced with regular spinlock_t for PREEMPT_RT
using conditional (#ifdef'ed) code changes at the usage site.  In contrast,
usage-site changes are not needed for the spinlock_t substitution.
Instead, conditionals in header files and the core locking implemementation
enable the compiler to do the substitution transparently.


Lock type nesting rules
=======================

The most basic rules are:

  - Lock types of the same lock category (sleeping, CPU local, spinning)
    can nest arbitrarily as long as they respect the general lock ordering
    rules to prevent deadlocks.

  - Sleeping lock types cannot nest inside CPU local and spinning lock types.

  - CPU local and spinning lock types can nest inside sleeping lock types.

  - Spinning lock types can nest inside all lock types

These constraints apply both in PREEMPT_RT and otherwise.

The fact that PREEMPT_RT changes the lock category of spinlock_t and
rwlock_t from spinning to sleeping and substitutes local_lock with a
per-CPU spinlock_t means that they cannot be acquired while holding a raw
spinlock.  This results in the following nesting ordering:

  1) Sleeping locks
  2) spinlock_t, rwlock_t, local_lock
  3) raw_spinlock_t and bit spinlocks

Lockdep will complain if these constraints are violated, both in
PREEMPT_RT and otherwise.
Commit	Line	Data
919e9e63 TG	1	.. SPDX-License-Identifier: GPL-2.0
	2
	3	.. _kernel_hacking_locktypes:
	4
	5	==========================
	6	Lock types and their rules
	7	==========================
	8
	9	Introduction
	10	============
	11
	12	The kernel provides a variety of locking primitives which can be divided
1edcd467	13	into three categories:
919e9e63 TG	14
919e9e63 TG	15	- Sleeping locks
91710728	16	- CPU local locks
919e9e63 TG	17	- Spinning locks
	18
	19	This document conceptually describes these lock types and provides rules
	20	for their nesting, including the rules for use under PREEMPT_RT.
	21
	22
	23	Lock categories
	24	===============
	25
	26	Sleeping locks
	27	--------------
	28
	29	Sleeping locks can only be acquired in preemptible task context.
	30
	31	Although implementations allow try_lock() from other contexts, it is
	32	necessary to carefully evaluate the safety of unlock() as well as of
	33	try_lock(). Furthermore, it is also necessary to evaluate the debugging
	34	versions of these primitives. In short, don't acquire sleeping locks from
	35	other contexts unless there is no other option.
	36
	37	Sleeping lock types:
	38
	39	- mutex
	40	- rt_mutex
	41	- semaphore
	42	- rw_semaphore
	43	- ww_mutex
	44	- percpu_rw_semaphore
	45
	46	On PREEMPT_RT kernels, these lock types are converted to sleeping locks:
	47
91710728	48	- local_lock
919e9e63 TG	49	- spinlock_t
	50	- rwlock_t
	51
91710728 TG	52
	53	CPU local locks
	54	---------------
	55
	56	- local_lock
	57
	58	On non-PREEMPT_RT kernels, local_lock functions are wrappers around
	59	preemption and interrupt disabling primitives. Contrary to other locking
	60	mechanisms, disabling preemption or interrupts are pure CPU local
	61	concurrency control mechanisms and not suited for inter-CPU concurrency
	62	control.
	63
	64
919e9e63 TG	65	Spinning locks
	66	--------------
	67
	68	- raw_spinlock_t
	69	- bit spinlocks
	70
	71	On non-PREEMPT_RT kernels, these lock types are also spinning locks:
	72
	73	- spinlock_t
	74	- rwlock_t
	75
	76	Spinning locks implicitly disable preemption and the lock / unlock functions
	77	can have suffixes which apply further protections:
	78
	79	=================== ====================================================
	80	_bh() Disable / enable bottom halves (soft interrupts)
	81	_irq() Disable / enable interrupts
	82	_irqsave/restore() Save and disable / restore interrupt disabled state
	83	=================== ====================================================
	84
91710728	85
7ecc6aa5 TG	86	Owner semantics
	87	===============
	88
	89	The aforementioned lock types except semaphores have strict owner
	90	semantics:
	91
	92	The context (task) that acquired the lock must release it.
	93
	94	rw_semaphores have a special interface which allows non-owner release for
	95	readers.
	96
919e9e63 TG	97
	98	rtmutex
	99	=======
	100
	101	RT-mutexes are mutexes with support for priority inheritance (PI).
	102
51e69e65	103	PI has limitations on non-PREEMPT_RT kernels due to preemption and
919e9e63 TG	104	interrupt disabled sections.
	105
	106	PI clearly cannot preempt preemption-disabled or interrupt-disabled
	107	regions of code, even on PREEMPT_RT kernels. Instead, PREEMPT_RT kernels
	108	execute most such regions of code in preemptible task context, especially
	109	interrupt handlers and soft interrupts. This conversion allows spinlock_t
	110	and rwlock_t to be implemented via RT-mutexes.
	111
	112
7ecc6aa5 TG	113	semaphore
	114	=========
	115
	116	semaphore is a counting semaphore implementation.
	117
	118	Semaphores are often used for both serialization and waiting, but new use
	119	cases should instead use separate serialization and wait mechanisms, such
	120	as mutexes and completions.
	121
	122	semaphores and PREEMPT_RT
	123	----------------------------
	124
	125	PREEMPT_RT does not change the semaphore implementation because counting
	126	semaphores have no concept of owners, thus preventing PREEMPT_RT from
	127	providing priority inheritance for semaphores. After all, an unknown
	128	owner cannot be boosted. As a consequence, blocking on semaphores can
	129	result in priority inversion.
	130
	131
	132	rw_semaphore
	133	============
	134
	135	rw_semaphore is a multiple readers and single writer lock mechanism.
	136
	137	On non-PREEMPT_RT kernels the implementation is fair, thus preventing
	138	writer starvation.
	139
	140	rw_semaphore complies by default with the strict owner semantics, but there
	141	exist special-purpose interfaces that allow non-owner release for readers.
	142	These interfaces work independent of the kernel configuration.
	143
	144	rw_semaphore and PREEMPT_RT
	145	---------------------------
	146
	147	PREEMPT_RT kernels map rw_semaphore to a separate rt_mutex-based
	148	implementation, thus changing the fairness:
	149
	150	Because an rw_semaphore writer cannot grant its priority to multiple
	151	readers, a preempted low-priority reader will continue holding its lock,
	152	thus starving even high-priority writers. In contrast, because readers
	153	can grant their priority to a writer, a preempted low-priority writer will
	154	have its priority boosted until it releases the lock, thus preventing that
	155	writer from starving readers.
	156
	157
91710728 TG	158	local_lock
	159	==========
	160
	161	local_lock provides a named scope to critical sections which are protected
	162	by disabling preemption or interrupts.
	163
	164	On non-PREEMPT_RT kernels local_lock operations map to the preemption and
	165	interrupt disabling and enabling primitives:
	166
94dea151 MR	167	=============================== ======================
	168	local_lock(&llock) preempt_disable()
	169	local_unlock(&llock) preempt_enable()
	170	local_lock_irq(&llock) local_irq_disable()
	171	local_unlock_irq(&llock) local_irq_enable()
	172	local_lock_irqsave(&llock) local_irq_save()
	173	local_unlock_irqrestore(&llock) local_irq_restore()
	174	=============================== ======================
91710728 TG	175
	176	The named scope of local_lock has two advantages over the regular
	177	primitives:
	178
	179	- The lock name allows static analysis and is also a clear documentation
	180	of the protection scope while the regular primitives are scopeless and
	181	opaque.
	182
	183	- If lockdep is enabled the local_lock gains a lockmap which allows to
	184	validate the correctness of the protection. This can detect cases where
	185	e.g. a function using preempt_disable() as protection mechanism is
	186	invoked from interrupt or soft-interrupt context. Aside of that
	187	lockdep_assert_held(&llock) works as with any other locking primitive.
	188
	189	local_lock and PREEMPT_RT
	190	-------------------------
	191
	192	PREEMPT_RT kernels map local_lock to a per-CPU spinlock_t, thus changing
	193	semantics:
	194
	195	- All spinlock_t changes also apply to local_lock.
	196
	197	local_lock usage
	198	----------------
	199
	200	local_lock should be used in situations where disabling preemption or
	201	interrupts is the appropriate form of concurrency control to protect
	202	per-CPU data structures on a non PREEMPT_RT kernel.
	203
	204	local_lock is not suitable to protect against preemption or interrupts on a
	205	PREEMPT_RT kernel due to the PREEMPT_RT specific spinlock_t semantics.
	206
	207
919e9e63 TG	208	raw_spinlock_t and spinlock_t
	209	=============================
	210
	211	raw_spinlock_t
	212	--------------
	213
	214	raw_spinlock_t is a strict spinning lock implementation regardless of the
	215	kernel configuration including PREEMPT_RT enabled kernels.
	216
	217	raw_spinlock_t is a strict spinning lock implementation in all kernels,
	218	including PREEMPT_RT kernels. Use raw_spinlock_t only in real critical
51e69e65	219	core code, low-level interrupt handling and places where disabling
919e9e63 TG	220	preemption or interrupts is required, for example, to safely access
	221	hardware state. raw_spinlock_t can sometimes also be used when the
	222	critical section is tiny, thus avoiding RT-mutex overhead.
	223
	224	spinlock_t
	225	----------
	226
7ecc6aa5	227	The semantics of spinlock_t change with the state of PREEMPT_RT.
919e9e63	228
51e69e65 RD	229	On a non-PREEMPT_RT kernel spinlock_t is mapped to raw_spinlock_t and has
51e69e65 RD	230	exactly the same semantics.
919e9e63 TG	231
	232	spinlock_t and PREEMPT_RT
	233	-------------------------
	234
51e69e65 RD	235	On a PREEMPT_RT kernel spinlock_t is mapped to a separate implementation
51e69e65 RD	236	based on rt_mutex which changes the semantics:
919e9e63	237
51e69e65	238	- Preemption is not disabled.
919e9e63 TG	239
919e9e63 TG	240	- The hard interrupt related suffixes for spin_lock / spin_unlock
51e69e65 RD	241	operations (_irq, _irqsave / _irqrestore) do not affect the CPU's
51e69e65 RD	242	interrupt disabled state.
919e9e63 TG	243
	244	- The soft interrupt related suffix (_bh()) still disables softirq
	245	handlers.
	246
	247	Non-PREEMPT_RT kernels disable preemption to get this effect.
	248
	249	PREEMPT_RT kernels use a per-CPU lock for serialization which keeps
	250	preemption disabled. The lock disables softirq handlers and also
	251	prevents reentrancy due to task preemption.
	252
	253	PREEMPT_RT kernels preserve all other spinlock_t semantics:
	254
	255	- Tasks holding a spinlock_t do not migrate. Non-PREEMPT_RT kernels
	256	avoid migration by disabling preemption. PREEMPT_RT kernels instead
	257	disable migration, which ensures that pointers to per-CPU variables
	258	remain valid even if the task is preempted.
	259
	260	- Task state is preserved across spinlock acquisition, ensuring that the
	261	task-state rules apply to all kernel configurations. Non-PREEMPT_RT
	262	kernels leave task state untouched. However, PREEMPT_RT must change
	263	task state if the task blocks during acquisition. Therefore, it saves
	264	the current task state before blocking and the corresponding lock wakeup
7ecc6aa5 TG	265	restores it, as shown below::
	266
	267	task->state = TASK_INTERRUPTIBLE
	268	lock()
	269	block()
	270	task->saved_state = task->state
	271	task->state = TASK_UNINTERRUPTIBLE
	272	schedule()
	273	lock wakeup
	274	task->state = task->saved_state
919e9e63 TG	275
	276	Other types of wakeups would normally unconditionally set the task state
	277	to RUNNING, but that does not work here because the task must remain
	278	blocked until the lock becomes available. Therefore, when a non-lock
	279	wakeup attempts to awaken a task blocked waiting for a spinlock, it
	280	instead sets the saved state to RUNNING. Then, when the lock
	281	acquisition completes, the lock wakeup sets the task state to the saved
7ecc6aa5 TG	282	state, in this case setting it to RUNNING::
	283
	284	task->state = TASK_INTERRUPTIBLE
	285	lock()
	286	block()
	287	task->saved_state = task->state
	288	task->state = TASK_UNINTERRUPTIBLE
	289	schedule()
	290	non lock wakeup
	291	task->saved_state = TASK_RUNNING
	292
	293	lock wakeup
	294	task->state = task->saved_state
	295
	296	This ensures that the real wakeup cannot be lost.
	297
919e9e63 TG	298
	299	rwlock_t
	300	========
	301
	302	rwlock_t is a multiple readers and single writer lock mechanism.
	303
	304	Non-PREEMPT_RT kernels implement rwlock_t as a spinning lock and the
	305	suffix rules of spinlock_t apply accordingly. The implementation is fair,
	306	thus preventing writer starvation.
	307
	308	rwlock_t and PREEMPT_RT
	309	-----------------------
	310
	311	PREEMPT_RT kernels map rwlock_t to a separate rt_mutex-based
	312	implementation, thus changing semantics:
	313
	314	- All the spinlock_t changes also apply to rwlock_t.
	315
	316	- Because an rwlock_t writer cannot grant its priority to multiple
	317	readers, a preempted low-priority reader will continue holding its lock,
	318	thus starving even high-priority writers. In contrast, because readers
	319	can grant their priority to a writer, a preempted low-priority writer
	320	will have its priority boosted until it releases the lock, thus
	321	preventing that writer from starving readers.
	322
	323
	324	PREEMPT_RT caveats
	325	==================
	326
91710728 TG	327	local_lock on RT
	328	----------------
	329
	330	The mapping of local_lock to spinlock_t on PREEMPT_RT kernels has a few
	331	implications. For example, on a non-PREEMPT_RT kernel the following code
	332	sequence works as expected::
	333
	334	local_lock_irq(&local_lock);
	335	raw_spin_lock(&lock);
	336
	337	and is fully equivalent to::
	338
	339	raw_spin_lock_irq(&lock);
	340
	341	On a PREEMPT_RT kernel this code sequence breaks because local_lock_irq()
	342	is mapped to a per-CPU spinlock_t which neither disables interrupts nor
	343	preemption. The following code sequence works perfectly correct on both
	344	PREEMPT_RT and non-PREEMPT_RT kernels::
	345
	346	local_lock_irq(&local_lock);
	347	spin_lock(&lock);
	348
	349	Another caveat with local locks is that each local_lock has a specific
	350	protection scope. So the following substitution is wrong::
	351
	352	func1()
	353	{
	354	local_irq_save(flags); -> local_lock_irqsave(&local_lock_1, flags);
	355	func3();
94dea151	356	local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_1, flags);
91710728 TG	357	}
	358
	359	func2()
	360	{
	361	local_irq_save(flags); -> local_lock_irqsave(&local_lock_2, flags);
	362	func3();
94dea151	363	local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock_2, flags);
91710728 TG	364	}
	365
	366	func3()
	367	{
	368	lockdep_assert_irqs_disabled();
	369	access_protected_data();
	370	}
	371
	372	On a non-PREEMPT_RT kernel this works correctly, but on a PREEMPT_RT kernel
	373	local_lock_1 and local_lock_2 are distinct and cannot serialize the callers
	374	of func3(). Also the lockdep assert will trigger on a PREEMPT_RT kernel
	375	because local_lock_irqsave() does not disable interrupts due to the
	376	PREEMPT_RT-specific semantics of spinlock_t. The correct substitution is::
	377
	378	func1()
	379	{
	380	local_irq_save(flags); -> local_lock_irqsave(&local_lock, flags);
	381	func3();
94dea151	382	local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags);
91710728 TG	383	}
	384
	385	func2()
	386	{
	387	local_irq_save(flags); -> local_lock_irqsave(&local_lock, flags);
	388	func3();
94dea151	389	local_irq_restore(flags); -> local_unlock_irqrestore(&local_lock, flags);
91710728 TG	390	}
	391
	392	func3()
	393	{
	394	lockdep_assert_held(&local_lock);
	395	access_protected_data();
	396	}
	397
	398
919e9e63 TG	399	spinlock_t and rwlock_t
	400	-----------------------
	401
91710728	402	The changes in spinlock_t and rwlock_t semantics on PREEMPT_RT kernels
919e9e63 TG	403	have a few implications. For example, on a non-PREEMPT_RT kernel the
	404	following code sequence works as expected::
	405
	406	local_irq_disable();
	407	spin_lock(&lock);
	408
	409	and is fully equivalent to::
	410
	411	spin_lock_irq(&lock);
	412
	413	Same applies to rwlock_t and the _irqsave() suffix variants.
	414
	415	On PREEMPT_RT kernel this code sequence breaks because RT-mutex requires a
	416	fully preemptible context. Instead, use spin_lock_irq() or
	417	spin_lock_irqsave() and their unlock counterparts. In cases where the
	418	interrupt disabling and locking must remain separate, PREEMPT_RT offers a
	419	local_lock mechanism. Acquiring the local_lock pins the task to a CPU,
51e69e65 RD	420	allowing things like per-CPU interrupt disabled locks to be acquired.
51e69e65 RD	421	However, this approach should be used only where absolutely necessary.
919e9e63	422
91710728	423	A typical scenario is protection of per-CPU variables in thread context::
919e9e63	424
91710728 TG	425	struct foo *p = get_cpu_ptr(&var1);
	426
	427	spin_lock(&p->lock);
	428	p->count += this_cpu_read(var2);
	429
	430	This is correct code on a non-PREEMPT_RT kernel, but on a PREEMPT_RT kernel
	431	this breaks. The PREEMPT_RT-specific change of spinlock_t semantics does
	432	not allow to acquire p->lock because get_cpu_ptr() implicitly disables
	433	preemption. The following substitution works on both kernels::
	434
	435	struct foo *p;
	436
	437	migrate_disable();
	438	p = this_cpu_ptr(&var1);
	439	spin_lock(&p->lock);
	440	p->count += this_cpu_read(var2);
	441
	442	On a non-PREEMPT_RT kernel migrate_disable() maps to preempt_disable()
	443	which makes the above code fully equivalent. On a PREEMPT_RT kernel
	444	migrate_disable() ensures that the task is pinned on the current CPU which
	445	in turn guarantees that the per-CPU access to var1 and var2 are staying on
	446	the same CPU.
	447
	448	The migrate_disable() substitution is not valid for the following
	449	scenario::
	450
	451	func()
	452	{
	453	struct foo *p;
	454
	455	migrate_disable();
	456	p = this_cpu_ptr(&var1);
	457	p->val = func2();
	458
	459	While correct on a non-PREEMPT_RT kernel, this breaks on PREEMPT_RT because
	460	here migrate_disable() does not protect against reentrancy from a
	461	preempting task. A correct substitution for this case is::
	462
	463	func()
	464	{
	465	struct foo *p;
	466
	467	local_lock(&foo_lock);
	468	p = this_cpu_ptr(&var1);
	469	p->val = func2();
	470
	471	On a non-PREEMPT_RT kernel this protects against reentrancy by disabling
	472	preemption. On a PREEMPT_RT kernel this is achieved by acquiring the
	473	underlying per-CPU spinlock.
	474
	475
	476	raw_spinlock_t on RT
	477	--------------------
919e9e63 TG	478
	479	Acquiring a raw_spinlock_t disables preemption and possibly also
	480	interrupts, so the critical section must avoid acquiring a regular
	481	spinlock_t or rwlock_t, for example, the critical section must avoid
	482	allocating memory. Thus, on a non-PREEMPT_RT kernel the following code
	483	works perfectly::
	484
	485	raw_spin_lock(&lock);
	486	p = kmalloc(sizeof(*p), GFP_ATOMIC);
	487
	488	But this code fails on PREEMPT_RT kernels because the memory allocator is
	489	fully preemptible and therefore cannot be invoked from truly atomic
	490	contexts. However, it is perfectly fine to invoke the memory allocator
	491	while holding normal non-raw spinlocks because they do not disable
	492	preemption on PREEMPT_RT kernels::
	493
	494	spin_lock(&lock);
	495	p = kmalloc(sizeof(*p), GFP_ATOMIC);
	496
	497
	498	bit spinlocks
	499	-------------
	500
7ecc6aa5 TG	501	PREEMPT_RT cannot substitute bit spinlocks because a single bit is too
	502	small to accommodate an RT-mutex. Therefore, the semantics of bit
	503	spinlocks are preserved on PREEMPT_RT kernels, so that the raw_spinlock_t
	504	caveats also apply to bit spinlocks.
919e9e63	505
7ecc6aa5 TG	506	Some bit spinlocks are replaced with regular spinlock_t for PREEMPT_RT
	507	using conditional (#ifdef'ed) code changes at the usage site. In contrast,
	508	usage-site changes are not needed for the spinlock_t substitution.
	509	Instead, conditionals in header files and the core locking implemementation
	510	enable the compiler to do the substitution transparently.
919e9e63 TG	511
	512
	513	Lock type nesting rules
	514	=======================
	515
	516	The most basic rules are:
	517
91710728 TG	518	- Lock types of the same lock category (sleeping, CPU local, spinning)
	519	can nest arbitrarily as long as they respect the general lock ordering
	520	rules to prevent deadlocks.
	521
	522	- Sleeping lock types cannot nest inside CPU local and spinning lock types.
919e9e63	523
91710728	524	- CPU local and spinning lock types can nest inside sleeping lock types.
919e9e63	525
91710728	526	- Spinning lock types can nest inside all lock types
919e9e63	527
7ecc6aa5	528	These constraints apply both in PREEMPT_RT and otherwise.
919e9e63	529
7ecc6aa5	530	The fact that PREEMPT_RT changes the lock category of spinlock_t and
91710728 TG	531	rwlock_t from spinning to sleeping and substitutes local_lock with a
	532	per-CPU spinlock_t means that they cannot be acquired while holding a raw
	533	spinlock. This results in the following nesting ordering:
919e9e63 TG	534
919e9e63 TG	535	1) Sleeping locks
91710728	536	2) spinlock_t, rwlock_t, local_lock
919e9e63 TG	537	3) raw_spinlock_t and bit spinlocks
919e9e63 TG	538
7ecc6aa5 TG	539	Lockdep will complain if these constraints are violated, both in
7ecc6aa5 TG	540	PREEMPT_RT and otherwise.