[linux-block.git] / Documentation / RCU / listRCU.rst

.. _list_rcu_doc:

Using RCU to Protect Read-Mostly Linked Lists
=============================================

One of the most common uses of RCU is protecting read-mostly linked lists
(``struct list_head`` in list.h).  One big advantage of this approach is
that all of the required memory ordering is provided by the list macros.
This document describes several list-based RCU use cases.


Example 1: Read-mostly list: Deferred Destruction
-------------------------------------------------

A widely used usecase for RCU lists in the kernel is lockless iteration over
all processes in the system. ``task_struct::tasks`` represents the list node that
links all the processes. The list can be traversed in parallel to any list
additions or removals.

The traversal of the list is done using ``for_each_process()`` which is defined
by the 2 macros::

	#define next_task(p) \
		list_entry_rcu((p)->tasks.next, struct task_struct, tasks)

	#define for_each_process(p) \
		for (p = &init_task ; (p = next_task(p)) != &init_task ; )

The code traversing the list of all processes typically looks like::

	rcu_read_lock();
	for_each_process(p) {
		/* Do something with p */
	}
	rcu_read_unlock();

The simplified and heavily inlined code for removing a process from a
task list is::

	void release_task(struct task_struct *p)
	{
		write_lock(&tasklist_lock);
		list_del_rcu(&p->tasks);
		write_unlock(&tasklist_lock);
		call_rcu(&p->rcu, delayed_put_task_struct);
	}

When a process exits, ``release_task()`` calls ``list_del_rcu(&p->tasks)``
via __exit_signal() and __unhash_process() under ``tasklist_lock``
writer lock protection.  The list_del_rcu() invocation removes
the task from the list of all tasks. The ``tasklist_lock``
prevents concurrent list additions/removals from corrupting the
list. Readers using ``for_each_process()`` are not protected with the
``tasklist_lock``. To prevent readers from noticing changes in the list
pointers, the ``task_struct`` object is freed only after one or more
grace periods elapse, with the help of call_rcu(), which is invoked via
put_task_struct_rcu_user(). This deferring of destruction ensures that
any readers traversing the list will see valid ``p->tasks.next`` pointers
and deletion/freeing can happen in parallel with traversal of the list.
This pattern is also called an **existence lock**, since RCU refrains
from invoking the delayed_put_task_struct() callback function until
all existing readers finish, which guarantees that the ``task_struct``
object in question will remain in existence until after the completion
of all RCU readers that might possibly have a reference to that object.


Example 2: Read-Side Action Taken Outside of Lock: No In-Place Updates
----------------------------------------------------------------------

Some reader-writer locking use cases compute a value while holding
the read-side lock, but continue to use that value after that lock is
released.  These use cases are often good candidates for conversion
to RCU.  One prominent example involves network packet routing.
Because the packet-routing data tracks the state of equipment outside
of the computer, it will at times contain stale data.  Therefore, once
the route has been computed, there is no need to hold the routing table
static during transmission of the packet.  After all, you can hold the
routing table static all you want, but that won't keep the external
Internet from changing, and it is the state of the external Internet
that really matters.  In addition, routing entries are typically added
or deleted, rather than being modified in place.  This is a rare example
of the finite speed of light and the non-zero size of atoms actually
helping make synchronization be lighter weight.

A straightforward example of this type of RCU use case may be found in
the system-call auditing support.  For example, a reader-writer locked
implementation of ``audit_filter_task()`` might be as follows::

	static enum audit_state audit_filter_task(struct task_struct *tsk, char **key)
	{
		struct audit_entry *e;
		enum audit_state   state;

		read_lock(&auditsc_lock);
		/* Note: audit_filter_mutex held by caller. */
		list_for_each_entry(e, &audit_tsklist, list) {
			if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
				if (state == AUDIT_STATE_RECORD)
					*key = kstrdup(e->rule.filterkey, GFP_ATOMIC);
				read_unlock(&auditsc_lock);
				return state;
			}
		}
		read_unlock(&auditsc_lock);
		return AUDIT_BUILD_CONTEXT;
	}

Here the list is searched under the lock, but the lock is dropped before
the corresponding value is returned.  By the time that this value is acted
on, the list may well have been modified.  This makes sense, since if
you are turning auditing off, it is OK to audit a few extra system calls.

This means that RCU can be easily applied to the read side, as follows::

	static enum audit_state audit_filter_task(struct task_struct *tsk, char **key)
	{
		struct audit_entry *e;
		enum audit_state   state;

		rcu_read_lock();
		/* Note: audit_filter_mutex held by caller. */
		list_for_each_entry_rcu(e, &audit_tsklist, list) {
			if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
				if (state == AUDIT_STATE_RECORD)
					*key = kstrdup(e->rule.filterkey, GFP_ATOMIC);
				rcu_read_unlock();
				return state;
			}
		}
		rcu_read_unlock();
		return AUDIT_BUILD_CONTEXT;
	}

The read_lock() and read_unlock() calls have become rcu_read_lock()
and rcu_read_unlock(), respectively, and the list_for_each_entry()
has become list_for_each_entry_rcu().  The **_rcu()** list-traversal
primitives add READ_ONCE() and diagnostic checks for incorrect use
outside of an RCU read-side critical section.

The changes to the update side are also straightforward. A reader-writer lock
might be used as follows for deletion and insertion in these simplified
versions of audit_del_rule() and audit_add_rule()::

	static inline int audit_del_rule(struct audit_rule *rule,
					 struct list_head *list)
	{
		struct audit_entry *e;

		write_lock(&auditsc_lock);
		list_for_each_entry(e, list, list) {
			if (!audit_compare_rule(rule, &e->rule)) {
				list_del(&e->list);
				write_unlock(&auditsc_lock);
				return 0;
			}
		}
		write_unlock(&auditsc_lock);
		return -EFAULT;		/* No matching rule */
	}

	static inline int audit_add_rule(struct audit_entry *entry,
					 struct list_head *list)
	{
		write_lock(&auditsc_lock);
		if (entry->rule.flags & AUDIT_PREPEND) {
			entry->rule.flags &= ~AUDIT_PREPEND;
			list_add(&entry->list, list);
		} else {
			list_add_tail(&entry->list, list);
		}
		write_unlock(&auditsc_lock);
		return 0;
	}

Following are the RCU equivalents for these two functions::

	static inline int audit_del_rule(struct audit_rule *rule,
					 struct list_head *list)
	{
		struct audit_entry *e;

		/* No need to use the _rcu iterator here, since this is the only
		 * deletion routine. */
		list_for_each_entry(e, list, list) {
			if (!audit_compare_rule(rule, &e->rule)) {
				list_del_rcu(&e->list);
				call_rcu(&e->rcu, audit_free_rule);
				return 0;
			}
		}
		return -EFAULT;		/* No matching rule */
	}

	static inline int audit_add_rule(struct audit_entry *entry,
					 struct list_head *list)
	{
		if (entry->rule.flags & AUDIT_PREPEND) {
			entry->rule.flags &= ~AUDIT_PREPEND;
			list_add_rcu(&entry->list, list);
		} else {
			list_add_tail_rcu(&entry->list, list);
		}
		return 0;
	}

Normally, the write_lock() and write_unlock() would be replaced by a
spin_lock() and a spin_unlock(). But in this case, all callers hold
``audit_filter_mutex``, so no additional locking is required. The
auditsc_lock can therefore be eliminated, since use of RCU eliminates the
need for writers to exclude readers.

The list_del(), list_add(), and list_add_tail() primitives have been
replaced by list_del_rcu(), list_add_rcu(), and list_add_tail_rcu().
The **_rcu()** list-manipulation primitives add memory barriers that are
needed on weakly ordered CPUs.  The list_del_rcu() primitive omits the
pointer poisoning debug-assist code that would otherwise cause concurrent
readers to fail spectacularly.

So, when readers can tolerate stale data and when entries are either added or
deleted, without in-place modification, it is very easy to use RCU!


Example 3: Handling In-Place Updates
------------------------------------

The system-call auditing code does not update auditing rules in place.  However,
if it did, the reader-writer-locked code to do so might look as follows
(assuming only ``field_count`` is updated, otherwise, the added fields would
need to be filled in)::

	static inline int audit_upd_rule(struct audit_rule *rule,
					 struct list_head *list,
					 __u32 newaction,
					 __u32 newfield_count)
	{
		struct audit_entry *e;
		struct audit_entry *ne;

		write_lock(&auditsc_lock);
		/* Note: audit_filter_mutex held by caller. */
		list_for_each_entry(e, list, list) {
			if (!audit_compare_rule(rule, &e->rule)) {
				e->rule.action = newaction;
				e->rule.field_count = newfield_count;
				write_unlock(&auditsc_lock);
				return 0;
			}
		}
		write_unlock(&auditsc_lock);
		return -EFAULT;		/* No matching rule */
	}

The RCU version creates a copy, updates the copy, then replaces the old
entry with the newly updated entry.  This sequence of actions, allowing
concurrent reads while making a copy to perform an update, is what gives
RCU (*read-copy update*) its name.

The RCU version of audit_upd_rule() is as follows::

	static inline int audit_upd_rule(struct audit_rule *rule,
					 struct list_head *list,
					 __u32 newaction,
					 __u32 newfield_count)
	{
		struct audit_entry *e;
		struct audit_entry *ne;

		list_for_each_entry(e, list, list) {
			if (!audit_compare_rule(rule, &e->rule)) {
				ne = kmalloc(sizeof(*entry), GFP_ATOMIC);
				if (ne == NULL)
					return -ENOMEM;
				audit_copy_rule(&ne->rule, &e->rule);
				ne->rule.action = newaction;
				ne->rule.field_count = newfield_count;
				list_replace_rcu(&e->list, &ne->list);
				call_rcu(&e->rcu, audit_free_rule);
				return 0;
			}
		}
		return -EFAULT;		/* No matching rule */
	}

Again, this assumes that the caller holds ``audit_filter_mutex``.  Normally, the
writer lock would become a spinlock in this sort of code.

The update_lsm_rule() does something very similar, for those who would
prefer to look at real Linux-kernel code.

Another use of this pattern can be found in the openswitch driver's *connection
tracking table* code in ``ct_limit_set()``.  The table holds connection tracking
entries and has a limit on the maximum entries.  There is one such table
per-zone and hence one *limit* per zone.  The zones are mapped to their limits
through a hashtable using an RCU-managed hlist for the hash chains. When a new
limit is set, a new limit object is allocated and ``ct_limit_set()`` is called
to replace the old limit object with the new one using list_replace_rcu().
The old limit object is then freed after a grace period using kfree_rcu().


Example 4: Eliminating Stale Data
---------------------------------

The auditing example above tolerates stale data, as do most algorithms
that are tracking external state.  After all, given there is a delay
from the time the external state changes before Linux becomes aware
of the change, and so as noted earlier, a small quantity of additional
RCU-induced staleness is generally not a problem.

However, there are many examples where stale data cannot be tolerated.
One example in the Linux kernel is the System V IPC (see the shm_lock()
function in ipc/shm.c).  This code checks a *deleted* flag under a
per-entry spinlock, and, if the *deleted* flag is set, pretends that the
entry does not exist.  For this to be helpful, the search function must
return holding the per-entry spinlock, as shm_lock() does in fact do.

.. _quick_quiz:

Quick Quiz:
	For the deleted-flag technique to be helpful, why is it necessary
	to hold the per-entry lock while returning from the search function?

:ref:`Answer to Quick Quiz <quick_quiz_answer>`

If the system-call audit module were to ever need to reject stale data, one way
to accomplish this would be to add a ``deleted`` flag and a ``lock`` spinlock to the
``audit_entry`` structure, and modify audit_filter_task() as follows::

	static enum audit_state audit_filter_task(struct task_struct *tsk)
	{
		struct audit_entry *e;
		enum audit_state   state;

		rcu_read_lock();
		list_for_each_entry_rcu(e, &audit_tsklist, list) {
			if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
				spin_lock(&e->lock);
				if (e->deleted) {
					spin_unlock(&e->lock);
					rcu_read_unlock();
					return AUDIT_BUILD_CONTEXT;
				}
				rcu_read_unlock();
				if (state == AUDIT_STATE_RECORD)
					*key = kstrdup(e->rule.filterkey, GFP_ATOMIC);
				return state;
			}
		}
		rcu_read_unlock();
		return AUDIT_BUILD_CONTEXT;
	}

The ``audit_del_rule()`` function would need to set the ``deleted`` flag under the
spinlock as follows::

	static inline int audit_del_rule(struct audit_rule *rule,
					 struct list_head *list)
	{
		struct audit_entry *e;

		/* No need to use the _rcu iterator here, since this
		 * is the only deletion routine. */
		list_for_each_entry(e, list, list) {
			if (!audit_compare_rule(rule, &e->rule)) {
				spin_lock(&e->lock);
				list_del_rcu(&e->list);
				e->deleted = 1;
				spin_unlock(&e->lock);
				call_rcu(&e->rcu, audit_free_rule);
				return 0;
			}
		}
		return -EFAULT;		/* No matching rule */
	}

This too assumes that the caller holds ``audit_filter_mutex``.

Note that this example assumes that entries are only added and deleted.
Additional mechanism is required to deal correctly with the update-in-place
performed by audit_upd_rule().  For one thing, audit_upd_rule() would
need to hold the locks of both the old ``audit_entry`` and its replacement
while executing the list_replace_rcu().


Example 5: Skipping Stale Objects
---------------------------------

For some use cases, reader performance can be improved by skipping
stale objects during read-side list traversal, where stale objects
are those that will be removed and destroyed after one or more grace
periods. One such example can be found in the timerfd subsystem. When a
``CLOCK_REALTIME`` clock is reprogrammed (for example due to setting
of the system time) then all programmed ``timerfds`` that depend on
this clock get triggered and processes waiting on them are awakened in
advance of their scheduled expiry. To facilitate this, all such timers
are added to an RCU-managed ``cancel_list`` when they are setup in
``timerfd_setup_cancel()``::

	static void timerfd_setup_cancel(struct timerfd_ctx *ctx, int flags)
	{
		spin_lock(&ctx->cancel_lock);
		if ((ctx->clockid == CLOCK_REALTIME ||
		     ctx->clockid == CLOCK_REALTIME_ALARM) &&
		    (flags & TFD_TIMER_ABSTIME) && (flags & TFD_TIMER_CANCEL_ON_SET)) {
			if (!ctx->might_cancel) {
				ctx->might_cancel = true;
				spin_lock(&cancel_lock);
				list_add_rcu(&ctx->clist, &cancel_list);
				spin_unlock(&cancel_lock);
			}
		} else {
			__timerfd_remove_cancel(ctx);
		}
		spin_unlock(&ctx->cancel_lock);
	}

When a timerfd is freed (fd is closed), then the ``might_cancel``
flag of the timerfd object is cleared, the object removed from the
``cancel_list`` and destroyed, as shown in this simplified and inlined
version of timerfd_release()::

	int timerfd_release(struct inode *inode, struct file *file)
	{
		struct timerfd_ctx *ctx = file->private_data;

		spin_lock(&ctx->cancel_lock);
		if (ctx->might_cancel) {
			ctx->might_cancel = false;
			spin_lock(&cancel_lock);
			list_del_rcu(&ctx->clist);
			spin_unlock(&cancel_lock);
		}
		spin_unlock(&ctx->cancel_lock);

		if (isalarm(ctx))
			alarm_cancel(&ctx->t.alarm);
		else
			hrtimer_cancel(&ctx->t.tmr);
		kfree_rcu(ctx, rcu);
		return 0;
	}

If the ``CLOCK_REALTIME`` clock is set, for example by a time server, the
hrtimer framework calls ``timerfd_clock_was_set()`` which walks the
``cancel_list`` and wakes up processes waiting on the timerfd. While iterating
the ``cancel_list``, the ``might_cancel`` flag is consulted to skip stale
objects::

	void timerfd_clock_was_set(void)
	{
		ktime_t moffs = ktime_mono_to_real(0);
		struct timerfd_ctx *ctx;
		unsigned long flags;

		rcu_read_lock();
		list_for_each_entry_rcu(ctx, &cancel_list, clist) {
			if (!ctx->might_cancel)
				continue;
			spin_lock_irqsave(&ctx->wqh.lock, flags);
			if (ctx->moffs != moffs) {
				ctx->moffs = KTIME_MAX;
				ctx->ticks++;
				wake_up_locked_poll(&ctx->wqh, EPOLLIN);
			}
			spin_unlock_irqrestore(&ctx->wqh.lock, flags);
		}
		rcu_read_unlock();
	}

The key point is that because RCU-protected traversal of the
``cancel_list`` happens concurrently with object addition and removal,
sometimes the traversal can access an object that has been removed from
the list. In this example, a flag is used to skip such objects.


Summary
-------

Read-mostly list-based data structures that can tolerate stale data are
the most amenable to use of RCU.  The simplest case is where entries are
either added or deleted from the data structure (or atomically modified
in place), but non-atomic in-place modifications can be handled by making
a copy, updating the copy, then replacing the original with the copy.
If stale data cannot be tolerated, then a *deleted* flag may be used
in conjunction with a per-entry spinlock in order to allow the search
function to reject newly deleted data.

.. _quick_quiz_answer:

Answer to Quick Quiz:
	For the deleted-flag technique to be helpful, why is it necessary
	to hold the per-entry lock while returning from the search function?

	If the search function drops the per-entry lock before returning,
	then the caller will be processing stale data in any case.  If it
	is really OK to be processing stale data, then you don't need a
	*deleted* flag.  If processing stale data really is a problem,
	then you need to hold the per-entry lock across all of the code
	that uses the value that was returned.

:ref:`Back to Quick Quiz <quick_quiz>`
Commit	Line	Data
9422dc24	1	.. _list_rcu_doc:
1da177e4	2
9422dc24 JC	3	Using RCU to Protect Read-Mostly Linked Lists
9422dc24 JC	4	=============================================
1da177e4	5
06e6d1d6 PM	6	One of the most common uses of RCU is protecting read-mostly linked lists
	7	(``struct list_head`` in list.h). One big advantage of this approach is
	8	that all of the required memory ordering is provided by the list macros.
	9	This document describes several list-based RCU use cases.
1da177e4	10
dc8cb9df JFG	11
	12	Example 1: Read-mostly list: Deferred Destruction
	13	-------------------------------------------------
	14
	15	A widely used usecase for RCU lists in the kernel is lockless iteration over
	16	all processes in the system. ``task_struct::tasks`` represents the list node that
	17	links all the processes. The list can be traversed in parallel to any list
	18	additions or removals.
	19
	20	The traversal of the list is done using ``for_each_process()`` which is defined
	21	by the 2 macros::
	22
	23	#define next_task(p) \
	24	list_entry_rcu((p)->tasks.next, struct task_struct, tasks)
	25
	26	#define for_each_process(p) \
	27	for (p = &init_task ; (p = next_task(p)) != &init_task ; )
	28
	29	The code traversing the list of all processes typically looks like::
	30
	31	rcu_read_lock();
	32	for_each_process(p) {
	33	/* Do something with p */
	34	}
	35	rcu_read_unlock();
	36
06e6d1d6 PM	37	The simplified and heavily inlined code for removing a process from a
06e6d1d6 PM	38	task list is::
dc8cb9df JFG	39
	40	void release_task(struct task_struct *p)
	41	{
	42	write_lock(&tasklist_lock);
	43	list_del_rcu(&p->tasks);
	44	write_unlock(&tasklist_lock);
	45	call_rcu(&p->rcu, delayed_put_task_struct);
	46	}
	47
06e6d1d6 PM	48	When a process exits, ``release_task()`` calls ``list_del_rcu(&p->tasks)``
	49	via __exit_signal() and __unhash_process() under ``tasklist_lock``
	50	writer lock protection. The list_del_rcu() invocation removes
	51	the task from the list of all tasks. The ``tasklist_lock``
	52	prevents concurrent list additions/removals from corrupting the
	53	list. Readers using ``for_each_process()`` are not protected with the
	54	``tasklist_lock``. To prevent readers from noticing changes in the list
	55	pointers, the ``task_struct`` object is freed only after one or more
	56	grace periods elapse, with the help of call_rcu(), which is invoked via
	57	put_task_struct_rcu_user(). This deferring of destruction ensures that
	58	any readers traversing the list will see valid ``p->tasks.next`` pointers
	59	and deletion/freeing can happen in parallel with traversal of the list.
	60	This pattern is also called an existence lock, since RCU refrains
	61	from invoking the delayed_put_task_struct() callback function until
	62	all existing readers finish, which guarantees that the ``task_struct``
	63	object in question will remain in existence until after the completion
	64	of all RCU readers that might possibly have a reference to that object.
dc8cb9df JFG	65
	66
	67	Example 2: Read-Side Action Taken Outside of Lock: No In-Place Updates
9422dc24	68	----------------------------------------------------------------------
1da177e4	69
06e6d1d6 PM	70	Some reader-writer locking use cases compute a value while holding
	71	the read-side lock, but continue to use that value after that lock is
	72	released. These use cases are often good candidates for conversion
	73	to RCU. One prominent example involves network packet routing.
	74	Because the packet-routing data tracks the state of equipment outside
	75	of the computer, it will at times contain stale data. Therefore, once
	76	the route has been computed, there is no need to hold the routing table
	77	static during transmission of the packet. After all, you can hold the
	78	routing table static all you want, but that won't keep the external
	79	Internet from changing, and it is the state of the external Internet
	80	that really matters. In addition, routing entries are typically added
	81	or deleted, rather than being modified in place. This is a rare example
	82	of the finite speed of light and the non-zero size of atoms actually
	83	helping make synchronization be lighter weight.
	84
	85	A straightforward example of this type of RCU use case may be found in
	86	the system-call auditing support. For example, a reader-writer locked
dc8cb9df	87	implementation of ``audit_filter_task()`` might be as follows::
1da177e4	88
06e6d1d6	89	static enum audit_state audit_filter_task(struct task_struct tsk, char *key)
1da177e4 LT	90	{
	91	struct audit_entry *e;
	92	enum audit_state state;
	93
	94	read_lock(&auditsc_lock);
dc8cb9df	95	/* Note: audit_filter_mutex held by caller. */
1da177e4 LT	96	list_for_each_entry(e, &audit_tsklist, list) {
1da177e4 LT	97	if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
06e6d1d6 PM	98	if (state == AUDIT_STATE_RECORD)
06e6d1d6 PM	99	*key = kstrdup(e->rule.filterkey, GFP_ATOMIC);
1da177e4 LT	100	read_unlock(&auditsc_lock);
	101	return state;
	102	}
	103	}
	104	read_unlock(&auditsc_lock);
	105	return AUDIT_BUILD_CONTEXT;
	106	}
	107
	108	Here the list is searched under the lock, but the lock is dropped before
	109	the corresponding value is returned. By the time that this value is acted
	110	on, the list may well have been modified. This makes sense, since if
	111	you are turning auditing off, it is OK to audit a few extra system calls.
	112
9422dc24	113	This means that RCU can be easily applied to the read side, as follows::
1da177e4	114
06e6d1d6	115	static enum audit_state audit_filter_task(struct task_struct tsk, char *key)
1da177e4 LT	116	{
	117	struct audit_entry *e;
	118	enum audit_state state;
	119
	120	rcu_read_lock();
dc8cb9df	121	/* Note: audit_filter_mutex held by caller. */
1da177e4 LT	122	list_for_each_entry_rcu(e, &audit_tsklist, list) {
1da177e4 LT	123	if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
06e6d1d6 PM	124	if (state == AUDIT_STATE_RECORD)
06e6d1d6 PM	125	*key = kstrdup(e->rule.filterkey, GFP_ATOMIC);
1da177e4 LT	126	rcu_read_unlock();
	127	return state;
	128	}
	129	}
	130	rcu_read_unlock();
	131	return AUDIT_BUILD_CONTEXT;
	132	}
	133
06e6d1d6 PM	134	The read_lock() and read_unlock() calls have become rcu_read_lock()
	135	and rcu_read_unlock(), respectively, and the list_for_each_entry()
	136	has become list_for_each_entry_rcu(). The _rcu() list-traversal
	137	primitives add READ_ONCE() and diagnostic checks for incorrect use
	138	outside of an RCU read-side critical section.
1da177e4	139
dc8cb9df	140	The changes to the update side are also straightforward. A reader-writer lock
06e6d1d6 PM	141	might be used as follows for deletion and insertion in these simplified
06e6d1d6 PM	142	versions of audit_del_rule() and audit_add_rule()::
1da177e4 LT	143
	144	static inline int audit_del_rule(struct audit_rule *rule,
	145	struct list_head *list)
	146	{
dc8cb9df	147	struct audit_entry *e;
1da177e4 LT	148
	149	write_lock(&auditsc_lock);
	150	list_for_each_entry(e, list, list) {
	151	if (!audit_compare_rule(rule, &e->rule)) {
	152	list_del(&e->list);
	153	write_unlock(&auditsc_lock);
	154	return 0;
	155	}
	156	}
	157	write_unlock(&auditsc_lock);
	158	return -EFAULT; /* No matching rule */
	159	}
	160
	161	static inline int audit_add_rule(struct audit_entry *entry,
	162	struct list_head *list)
	163	{
	164	write_lock(&auditsc_lock);
	165	if (entry->rule.flags & AUDIT_PREPEND) {
	166	entry->rule.flags &= ~AUDIT_PREPEND;
	167	list_add(&entry->list, list);
	168	} else {
	169	list_add_tail(&entry->list, list);
	170	}
	171	write_unlock(&auditsc_lock);
	172	return 0;
	173	}
	174
9422dc24	175	Following are the RCU equivalents for these two functions::
1da177e4 LT	176
	177	static inline int audit_del_rule(struct audit_rule *rule,
	178	struct list_head *list)
	179	{
dc8cb9df	180	struct audit_entry *e;
1da177e4	181
dc8cb9df	182	/* No need to use the _rcu iterator here, since this is the only
1da177e4 LT	183	* deletion routine. */
	184	list_for_each_entry(e, list, list) {
	185	if (!audit_compare_rule(rule, &e->rule)) {
	186	list_del_rcu(&e->list);
3943ac5d	187	call_rcu(&e->rcu, audit_free_rule);
1da177e4 LT	188	return 0;
	189	}
	190	}
	191	return -EFAULT; /* No matching rule */
	192	}
	193
	194	static inline int audit_add_rule(struct audit_entry *entry,
	195	struct list_head *list)
	196	{
	197	if (entry->rule.flags & AUDIT_PREPEND) {
	198	entry->rule.flags &= ~AUDIT_PREPEND;
	199	list_add_rcu(&entry->list, list);
	200	} else {
	201	list_add_tail_rcu(&entry->list, list);
	202	}
	203	return 0;
	204	}
	205
06e6d1d6	206	Normally, the write_lock() and write_unlock() would be replaced by a
dc8cb9df JFG	207	spin_lock() and a spin_unlock(). But in this case, all callers hold
dc8cb9df JFG	208	``audit_filter_mutex``, so no additional locking is required. The
06e6d1d6	209	auditsc_lock can therefore be eliminated, since use of RCU eliminates the
dc8cb9df	210	need for writers to exclude readers.
1da177e4 LT	211
	212	The list_del(), list_add(), and list_add_tail() primitives have been
	213	replaced by list_del_rcu(), list_add_rcu(), and list_add_tail_rcu().
06e6d1d6 PM	214	The _rcu() list-manipulation primitives add memory barriers that are
06e6d1d6 PM	215	needed on weakly ordered CPUs. The list_del_rcu() primitive omits the
dc8cb9df JFG	216	pointer poisoning debug-assist code that would otherwise cause concurrent
dc8cb9df JFG	217	readers to fail spectacularly.
1da177e4	218
dc8cb9df JFG	219	So, when readers can tolerate stale data and when entries are either added or
dc8cb9df JFG	220	deleted, without in-place modification, it is very easy to use RCU!
1da177e4	221
dc8cb9df JFG	222
dc8cb9df JFG	223	Example 3: Handling In-Place Updates
9422dc24	224	------------------------------------
1da177e4	225
dc8cb9df JFG	226	The system-call auditing code does not update auditing rules in place. However,
	227	if it did, the reader-writer-locked code to do so might look as follows
	228	(assuming only ``field_count`` is updated, otherwise, the added fields would
	229	need to be filled in)::
1da177e4 LT	230
	231	static inline int audit_upd_rule(struct audit_rule *rule,
	232	struct list_head *list,
	233	__u32 newaction,
	234	__u32 newfield_count)
	235	{
dc8cb9df JFG	236	struct audit_entry *e;
dc8cb9df JFG	237	struct audit_entry *ne;
1da177e4 LT	238
1da177e4 LT	239	write_lock(&auditsc_lock);
dc8cb9df	240	/* Note: audit_filter_mutex held by caller. */
1da177e4 LT	241	list_for_each_entry(e, list, list) {
	242	if (!audit_compare_rule(rule, &e->rule)) {
	243	e->rule.action = newaction;
c50a8714	244	e->rule.field_count = newfield_count;
1da177e4 LT	245	write_unlock(&auditsc_lock);
	246	return 0;
	247	}
	248	}
	249	write_unlock(&auditsc_lock);
	250	return -EFAULT; /* No matching rule */
	251	}
	252
	253	The RCU version creates a copy, updates the copy, then replaces the old
	254	entry with the newly updated entry. This sequence of actions, allowing
dc8cb9df	255	concurrent reads while making a copy to perform an update, is what gives
06e6d1d6 PM	256	RCU (read-copy update) its name.
	257
	258	The RCU version of audit_upd_rule() is as follows::
1da177e4 LT	259
	260	static inline int audit_upd_rule(struct audit_rule *rule,
	261	struct list_head *list,
	262	__u32 newaction,
	263	__u32 newfield_count)
	264	{
dc8cb9df JFG	265	struct audit_entry *e;
dc8cb9df JFG	266	struct audit_entry *ne;
1da177e4 LT	267
	268	list_for_each_entry(e, list, list) {
	269	if (!audit_compare_rule(rule, &e->rule)) {
	270	ne = kmalloc(sizeof(*entry), GFP_ATOMIC);
	271	if (ne == NULL)
	272	return -ENOMEM;
	273	audit_copy_rule(&ne->rule, &e->rule);
	274	ne->rule.action = newaction;
c50a8714	275	ne->rule.field_count = newfield_count;
57d34a6c	276	list_replace_rcu(&e->list, &ne->list);
3943ac5d	277	call_rcu(&e->rcu, audit_free_rule);
1da177e4 LT	278	return 0;
	279	}
	280	}
	281	return -EFAULT; /* No matching rule */
	282	}
	283
dc8cb9df JFG	284	Again, this assumes that the caller holds ``audit_filter_mutex``. Normally, the
dc8cb9df JFG	285	writer lock would become a spinlock in this sort of code.
1da177e4	286
06e6d1d6 PM	287	The update_lsm_rule() does something very similar, for those who would
	288	prefer to look at real Linux-kernel code.
	289
dc8cb9df JFG	290	Another use of this pattern can be found in the openswitch driver's *connection
	291	tracking table* code in ``ct_limit_set()``. The table holds connection tracking
	292	entries and has a limit on the maximum entries. There is one such table
	293	per-zone and hence one limit per zone. The zones are mapped to their limits
	294	through a hashtable using an RCU-managed hlist for the hash chains. When a new
	295	limit is set, a new limit object is allocated and ``ct_limit_set()`` is called
	296	to replace the old limit object with the new one using list_replace_rcu().
	297	The old limit object is then freed after a grace period using kfree_rcu().
	298
	299
	300	Example 4: Eliminating Stale Data
9422dc24	301	---------------------------------
1da177e4	302
dc8cb9df	303	The auditing example above tolerates stale data, as do most algorithms
06e6d1d6 PM	304	that are tracking external state. After all, given there is a delay
	305	from the time the external state changes before Linux becomes aware
	306	of the change, and so as noted earlier, a small quantity of additional
	307	RCU-induced staleness is generally not a problem.
1da177e4 LT	308
1da177e4 LT	309	However, there are many examples where stale data cannot be tolerated.
3282b046 SP	310	One example in the Linux kernel is the System V IPC (see the shm_lock()
3282b046 SP	311	function in ipc/shm.c). This code checks a deleted flag under a
dc8cb9df	312	per-entry spinlock, and, if the deleted flag is set, pretends that the
1da177e4	313	entry does not exist. For this to be helpful, the search function must
3282b046	314	return holding the per-entry spinlock, as shm_lock() does in fact do.
dc8cb9df JFG	315
dc8cb9df JFG	316	.. _quick_quiz:
1da177e4	317
9422dc24	318	Quick Quiz:
dc8cb9df JFG	319	For the deleted-flag technique to be helpful, why is it necessary
dc8cb9df JFG	320	to hold the per-entry lock while returning from the search function?
9422dc24	321
dc8cb9df	322	:ref:`Answer to Quick Quiz <quick_quiz_answer>`
1da177e4	323
dc8cb9df JFG	324	If the system-call audit module were to ever need to reject stale data, one way
dc8cb9df JFG	325	to accomplish this would be to add a ``deleted`` flag and a ``lock`` spinlock to the
06e6d1d6	326	``audit_entry`` structure, and modify audit_filter_task() as follows::
1da177e4 LT	327
	328	static enum audit_state audit_filter_task(struct task_struct *tsk)
	329	{
	330	struct audit_entry *e;
	331	enum audit_state state;
	332
	333	rcu_read_lock();
	334	list_for_each_entry_rcu(e, &audit_tsklist, list) {
	335	if (audit_filter_rules(tsk, &e->rule, NULL, &state)) {
	336	spin_lock(&e->lock);
	337	if (e->deleted) {
	338	spin_unlock(&e->lock);
	339	rcu_read_unlock();
	340	return AUDIT_BUILD_CONTEXT;
	341	}
	342	rcu_read_unlock();
06e6d1d6 PM	343	if (state == AUDIT_STATE_RECORD)
06e6d1d6 PM	344	*key = kstrdup(e->rule.filterkey, GFP_ATOMIC);
1da177e4 LT	345	return state;
	346	}
	347	}
	348	rcu_read_unlock();
	349	return AUDIT_BUILD_CONTEXT;
	350	}
	351
dc8cb9df JFG	352	The ``audit_del_rule()`` function would need to set the ``deleted`` flag under the
dc8cb9df JFG	353	spinlock as follows::
1da177e4 LT	354
	355	static inline int audit_del_rule(struct audit_rule *rule,
	356	struct list_head *list)
	357	{
dc8cb9df	358	struct audit_entry *e;
1da177e4	359
dc8cb9df	360	/* No need to use the _rcu iterator here, since this
d19720a9	361	* is the only deletion routine. */
1da177e4 LT	362	list_for_each_entry(e, list, list) {
	363	if (!audit_compare_rule(rule, &e->rule)) {
	364	spin_lock(&e->lock);
	365	list_del_rcu(&e->list);
	366	e->deleted = 1;
	367	spin_unlock(&e->lock);
3943ac5d	368	call_rcu(&e->rcu, audit_free_rule);
1da177e4 LT	369	return 0;
	370	}
	371	}
	372	return -EFAULT; /* No matching rule */
	373	}
	374
dc8cb9df JFG	375	This too assumes that the caller holds ``audit_filter_mutex``.
dc8cb9df JFG	376
06e6d1d6 PM	377	Note that this example assumes that entries are only added and deleted.
	378	Additional mechanism is required to deal correctly with the update-in-place
	379	performed by audit_upd_rule(). For one thing, audit_upd_rule() would
	380	need to hold the locks of both the old ``audit_entry`` and its replacement
	381	while executing the list_replace_rcu().
	382
dc8cb9df JFG	383
	384	Example 5: Skipping Stale Objects
	385	---------------------------------
	386
06e6d1d6 PM	387	For some use cases, reader performance can be improved by skipping
	388	stale objects during read-side list traversal, where stale objects
	389	are those that will be removed and destroyed after one or more grace
	390	periods. One such example can be found in the timerfd subsystem. When a
	391	``CLOCK_REALTIME`` clock is reprogrammed (for example due to setting
	392	of the system time) then all programmed ``timerfds`` that depend on
	393	this clock get triggered and processes waiting on them are awakened in
	394	advance of their scheduled expiry. To facilitate this, all such timers
	395	are added to an RCU-managed ``cancel_list`` when they are setup in
dc8cb9df JFG	396	``timerfd_setup_cancel()``::
	397
	398	static void timerfd_setup_cancel(struct timerfd_ctx *ctx, int flags)
	399	{
	400	spin_lock(&ctx->cancel_lock);
06e6d1d6 PM	401	if ((ctx->clockid == CLOCK_REALTIME \|\|
06e6d1d6 PM	402	ctx->clockid == CLOCK_REALTIME_ALARM) &&
dc8cb9df JFG	403	(flags & TFD_TIMER_ABSTIME) && (flags & TFD_TIMER_CANCEL_ON_SET)) {
	404	if (!ctx->might_cancel) {
	405	ctx->might_cancel = true;
	406	spin_lock(&cancel_lock);
	407	list_add_rcu(&ctx->clist, &cancel_list);
	408	spin_unlock(&cancel_lock);
	409	}
06e6d1d6 PM	410	} else {
06e6d1d6 PM	411	__timerfd_remove_cancel(ctx);
dc8cb9df JFG	412	}
	413	spin_unlock(&ctx->cancel_lock);
	414	}
	415
06e6d1d6 PM	416	When a timerfd is freed (fd is closed), then the ``might_cancel``
	417	flag of the timerfd object is cleared, the object removed from the
	418	``cancel_list`` and destroyed, as shown in this simplified and inlined
	419	version of timerfd_release()::
dc8cb9df JFG	420
	421	int timerfd_release(struct inode inode, struct file file)
	422	{
	423	struct timerfd_ctx *ctx = file->private_data;
	424
	425	spin_lock(&ctx->cancel_lock);
	426	if (ctx->might_cancel) {
	427	ctx->might_cancel = false;
	428	spin_lock(&cancel_lock);
	429	list_del_rcu(&ctx->clist);
	430	spin_unlock(&cancel_lock);
	431	}
	432	spin_unlock(&ctx->cancel_lock);
	433
06e6d1d6 PM	434	if (isalarm(ctx))
	435	alarm_cancel(&ctx->t.alarm);
	436	else
	437	hrtimer_cancel(&ctx->t.tmr);
dc8cb9df JFG	438	kfree_rcu(ctx, rcu);
	439	return 0;
	440	}
	441
	442	If the ``CLOCK_REALTIME`` clock is set, for example by a time server, the
	443	hrtimer framework calls ``timerfd_clock_was_set()`` which walks the
	444	``cancel_list`` and wakes up processes waiting on the timerfd. While iterating
	445	the ``cancel_list``, the ``might_cancel`` flag is consulted to skip stale
	446	objects::
	447
	448	void timerfd_clock_was_set(void)
	449	{
06e6d1d6	450	ktime_t moffs = ktime_mono_to_real(0);
dc8cb9df JFG	451	struct timerfd_ctx *ctx;
	452	unsigned long flags;
	453
	454	rcu_read_lock();
	455	list_for_each_entry_rcu(ctx, &cancel_list, clist) {
	456	if (!ctx->might_cancel)
	457	continue;
	458	spin_lock_irqsave(&ctx->wqh.lock, flags);
06e6d1d6	459	if (ctx->moffs != moffs) {
dc8cb9df JFG	460	ctx->moffs = KTIME_MAX;
	461	ctx->ticks++;
	462	wake_up_locked_poll(&ctx->wqh, EPOLLIN);
	463	}
	464	spin_unlock_irqrestore(&ctx->wqh.lock, flags);
	465	}
	466	rcu_read_unlock();
	467	}
	468
06e6d1d6 PM	469	The key point is that because RCU-protected traversal of the
	470	``cancel_list`` happens concurrently with object addition and removal,
	471	sometimes the traversal can access an object that has been removed from
	472	the list. In this example, a flag is used to skip such objects.
dc8cb9df JFG	473
dc8cb9df JFG	474
1da177e4	475	Summary
9422dc24	476	-------
1da177e4 LT	477
	478	Read-mostly list-based data structures that can tolerate stale data are
	479	the most amenable to use of RCU. The simplest case is where entries are
	480	either added or deleted from the data structure (or atomically modified
	481	in place), but non-atomic in-place modifications can be handled by making
	482	a copy, updating the copy, then replacing the original with the copy.
dc8cb9df	483	If stale data cannot be tolerated, then a deleted flag may be used
1da177e4 LT	484	in conjunction with a per-entry spinlock in order to allow the search
	485	function to reject newly deleted data.
	486
dc8cb9df	487	.. _quick_quiz_answer:
1da177e4	488
9422dc24	489	Answer to Quick Quiz:
dc8cb9df JFG	490	For the deleted-flag technique to be helpful, why is it necessary
dc8cb9df JFG	491	to hold the per-entry lock while returning from the search function?
d19720a9 PM	492
	493	If the search function drops the per-entry lock before returning,
	494	then the caller will be processing stale data in any case. If it
	495	is really OK to be processing stale data, then you don't need a
dc8cb9df	496	deleted flag. If processing stale data really is a problem,
d19720a9 PM	497	then you need to hold the per-entry lock across all of the code
d19720a9 PM	498	that uses the value that was returned.
dc8cb9df JFG	499
dc8cb9df JFG	500	:ref:`Back to Quick Quiz <quick_quiz>`