locking/atomics: Flip atomic_fetch_or() arguments
[linux-2.6-block.git] / Documentation / RCU / Design / Requirements / Requirements.htmlx
CommitLineData
649e4368
PM
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
2 "http://www.w3.org/TR/html4/loose.dtd">
3 <html>
4 <head><title>A Tour Through RCU's Requirements [LWN.net]</title>
5 <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">
6
7<h1>A Tour Through RCU's Requirements</h1>
8
9<p>Copyright IBM Corporation, 2015</p>
10<p>Author: Paul E.&nbsp;McKenney</p>
11<p><i>The initial version of this document appeared in the
12<a href="https://lwn.net/">LWN</a> articles
13<a href="https://lwn.net/Articles/652156/">here</a>,
14<a href="https://lwn.net/Articles/652677/">here</a>, and
15<a href="https://lwn.net/Articles/653326/">here</a>.</i></p>
16
17<h2>Introduction</h2>
18
19<p>
20Read-copy update (RCU) is a synchronization mechanism that is often
21used as a replacement for reader-writer locking.
22RCU is unusual in that updaters do not block readers,
23which means that RCU's read-side primitives can be exceedingly fast
24and scalable.
25In addition, updaters can make useful forward progress concurrently
26with readers.
27However, all this concurrency between RCU readers and updaters does raise
28the question of exactly what RCU readers are doing, which in turn
29raises the question of exactly what RCU's requirements are.
30
31<p>
32This document therefore summarizes RCU's requirements, and can be thought
33of as an informal, high-level specification for RCU.
34It is important to understand that RCU's specification is primarily
35empirical in nature;
36in fact, I learned about many of these requirements the hard way.
37This situation might cause some consternation, however, not only
38has this learning process been a lot of fun, but it has also been
39a great privilege to work with so many people willing to apply
40technologies in interesting new ways.
41
42<p>
43All that aside, here are the categories of currently known RCU requirements:
44</p>
45
46<ol>
47<li> <a href="#Fundamental Requirements">
48 Fundamental Requirements</a>
49<li> <a href="#Fundamental Non-Requirements">Fundamental Non-Requirements</a>
50<li> <a href="#Parallelism Facts of Life">
51 Parallelism Facts of Life</a>
52<li> <a href="#Quality-of-Implementation Requirements">
53 Quality-of-Implementation Requirements</a>
54<li> <a href="#Linux Kernel Complications">
55 Linux Kernel Complications</a>
56<li> <a href="#Software-Engineering Requirements">
57 Software-Engineering Requirements</a>
58<li> <a href="#Other RCU Flavors">
59 Other RCU Flavors</a>
60<li> <a href="#Possible Future Changes">
61 Possible Future Changes</a>
62</ol>
63
64<p>
65This is followed by a <a href="#Summary">summary</a>,
66which is in turn followed by the inevitable
67<a href="#Answers to Quick Quizzes">answers to the quick quizzes</a>.
68
69<h2><a name="Fundamental Requirements">Fundamental Requirements</a></h2>
70
71<p>
72RCU's fundamental requirements are the closest thing RCU has to hard
73mathematical requirements.
74These are:
75
76<ol>
77<li> <a href="#Grace-Period Guarantee">
78 Grace-Period Guarantee</a>
79<li> <a href="#Publish-Subscribe Guarantee">
80 Publish-Subscribe Guarantee</a>
4b689330
PM
81<li> <a href="#Memory-Barrier Guarantees">
82 Memory-Barrier Guarantees</a>
649e4368
PM
83<li> <a href="#RCU Primitives Guaranteed to Execute Unconditionally">
84 RCU Primitives Guaranteed to Execute Unconditionally</a>
85<li> <a href="#Guaranteed Read-to-Write Upgrade">
86 Guaranteed Read-to-Write Upgrade</a>
87</ol>
88
89<h3><a name="Grace-Period Guarantee">Grace-Period Guarantee</a></h3>
90
91<p>
92RCU's grace-period guarantee is unusual in being premeditated:
93Jack Slingwine and I had this guarantee firmly in mind when we started
94work on RCU (then called &ldquo;rclock&rdquo;) in the early 1990s.
95That said, the past two decades of experience with RCU have produced
96a much more detailed understanding of this guarantee.
97
98<p>
99RCU's grace-period guarantee allows updaters to wait for the completion
100of all pre-existing RCU read-side critical sections.
101An RCU read-side critical section
102begins with the marker <tt>rcu_read_lock()</tt> and ends with
103the marker <tt>rcu_read_unlock()</tt>.
104These markers may be nested, and RCU treats a nested set as one
105big RCU read-side critical section.
106Production-quality implementations of <tt>rcu_read_lock()</tt> and
107<tt>rcu_read_unlock()</tt> are extremely lightweight, and in
108fact have exactly zero overhead in Linux kernels built for production
109use with <tt>CONFIG_PREEMPT=n</tt>.
110
111<p>
112This guarantee allows ordering to be enforced with extremely low
113overhead to readers, for example:
114
115<blockquote>
116<pre>
117 1 int x, y;
118 2
119 3 void thread0(void)
120 4 {
121 5 rcu_read_lock();
122 6 r1 = READ_ONCE(x);
123 7 r2 = READ_ONCE(y);
124 8 rcu_read_unlock();
125 9 }
12610
12711 void thread1(void)
12812 {
12913 WRITE_ONCE(x, 1);
13014 synchronize_rcu();
13115 WRITE_ONCE(y, 1);
13216 }
133</pre>
134</blockquote>
135
136<p>
137Because the <tt>synchronize_rcu()</tt> on line&nbsp;14 waits for
138all pre-existing readers, any instance of <tt>thread0()</tt> that
139loads a value of zero from <tt>x</tt> must complete before
140<tt>thread1()</tt> stores to <tt>y</tt>, so that instance must
141also load a value of zero from <tt>y</tt>.
142Similarly, any instance of <tt>thread0()</tt> that loads a value of
143one from <tt>y</tt> must have started after the
144<tt>synchronize_rcu()</tt> started, and must therefore also load
145a value of one from <tt>x</tt>.
146Therefore, the outcome:
147<blockquote>
148<pre>
149(r1 == 0 &amp;&amp; r2 == 1)
150</pre>
151</blockquote>
152cannot happen.
153
154<p>@@QQ@@
155Wait a minute!
156You said that updaters can make useful forward progress concurrently
157with readers, but pre-existing readers will block
158<tt>synchronize_rcu()</tt>!!!
159Just who are you trying to fool???
160<p>@@QQA@@
161First, if updaters do not wish to be blocked by readers, they can use
162<tt>call_rcu()</tt> or <tt>kfree_rcu()</tt>, which will
163be discussed later.
164Second, even when using <tt>synchronize_rcu()</tt>, the other
165update-side code does run concurrently with readers, whether pre-existing
166or not.
167<p>@@QQE@@
168
169<p>
170This scenario resembles one of the first uses of RCU in
171<a href="https://en.wikipedia.org/wiki/DYNIX">DYNIX/ptx</a>,
172which managed a distributed lock manager's transition into
173a state suitable for handling recovery from node failure,
174more or less as follows:
175
176<blockquote>
177<pre>
178 1 #define STATE_NORMAL 0
179 2 #define STATE_WANT_RECOVERY 1
180 3 #define STATE_RECOVERING 2
181 4 #define STATE_WANT_NORMAL 3
182 5
183 6 int state = STATE_NORMAL;
184 7
185 8 void do_something_dlm(void)
186 9 {
18710 int state_snap;
18811
18912 rcu_read_lock();
19013 state_snap = READ_ONCE(state);
19114 if (state_snap == STATE_NORMAL)
19215 do_something();
19316 else
19417 do_something_carefully();
19518 rcu_read_unlock();
19619 }
19720
19821 void start_recovery(void)
19922 {
20023 WRITE_ONCE(state, STATE_WANT_RECOVERY);
20124 synchronize_rcu();
20225 WRITE_ONCE(state, STATE_RECOVERING);
20326 recovery();
20427 WRITE_ONCE(state, STATE_WANT_NORMAL);
20528 synchronize_rcu();
20629 WRITE_ONCE(state, STATE_NORMAL);
20730 }
208</pre>
209</blockquote>
210
211<p>
212The RCU read-side critical section in <tt>do_something_dlm()</tt>
213works with the <tt>synchronize_rcu()</tt> in <tt>start_recovery()</tt>
214to guarantee that <tt>do_something()</tt> never runs concurrently
215with <tt>recovery()</tt>, but with little or no synchronization
216overhead in <tt>do_something_dlm()</tt>.
217
218<p>@@QQ@@
219Why is the <tt>synchronize_rcu()</tt> on line&nbsp;28 needed?
220<p>@@QQA@@
221Without that extra grace period, memory reordering could result in
222<tt>do_something_dlm()</tt> executing <tt>do_something()</tt>
223concurrently with the last bits of <tt>recovery()</tt>.
224<p>@@QQE@@
225
226<p>
227In order to avoid fatal problems such as deadlocks,
228an RCU read-side critical section must not contain calls to
229<tt>synchronize_rcu()</tt>.
230Similarly, an RCU read-side critical section must not
231contain anything that waits, directly or indirectly, on completion of
232an invocation of <tt>synchronize_rcu()</tt>.
233
234<p>
235Although RCU's grace-period guarantee is useful in and of itself, with
236<a href="https://lwn.net/Articles/573497/">quite a few use cases</a>,
237it would be good to be able to use RCU to coordinate read-side
238access to linked data structures.
239For this, the grace-period guarantee is not sufficient, as can
240be seen in function <tt>add_gp_buggy()</tt> below.
241We will look at the reader's code later, but in the meantime, just think of
242the reader as locklessly picking up the <tt>gp</tt> pointer,
243and, if the value loaded is non-<tt>NULL</tt>, locklessly accessing the
244<tt>-&gt;a</tt> and <tt>-&gt;b</tt> fields.
245
246<blockquote>
247<pre>
248 1 bool add_gp_buggy(int a, int b)
249 2 {
250 3 p = kmalloc(sizeof(*p), GFP_KERNEL);
251 4 if (!p)
252 5 return -ENOMEM;
253 6 spin_lock(&amp;gp_lock);
254 7 if (rcu_access_pointer(gp)) {
255 8 spin_unlock(&amp;gp_lock);
256 9 return false;
25710 }
25811 p-&gt;a = a;
25912 p-&gt;b = a;
26013 gp = p; /* ORDERING BUG */
26114 spin_unlock(&amp;gp_lock);
26215 return true;
26316 }
264</pre>
265</blockquote>
266
267<p>
268The problem is that both the compiler and weakly ordered CPUs are within
269their rights to reorder this code as follows:
270
271<blockquote>
272<pre>
273 1 bool add_gp_buggy_optimized(int a, int b)
274 2 {
275 3 p = kmalloc(sizeof(*p), GFP_KERNEL);
276 4 if (!p)
277 5 return -ENOMEM;
278 6 spin_lock(&amp;gp_lock);
279 7 if (rcu_access_pointer(gp)) {
280 8 spin_unlock(&amp;gp_lock);
281 9 return false;
28210 }
283<b>11 gp = p; /* ORDERING BUG */
28412 p-&gt;a = a;
28513 p-&gt;b = a;</b>
28614 spin_unlock(&amp;gp_lock);
28715 return true;
28816 }
289</pre>
290</blockquote>
291
292<p>
293If an RCU reader fetches <tt>gp</tt> just after
294<tt>add_gp_buggy_optimized</tt> executes line&nbsp;11,
295it will see garbage in the <tt>-&gt;a</tt> and <tt>-&gt;b</tt>
296fields.
297And this is but one of many ways in which compiler and hardware optimizations
298could cause trouble.
299Therefore, we clearly need some way to prevent the compiler and the CPU from
300reordering in this manner, which brings us to the publish-subscribe
301guarantee discussed in the next section.
302
303<h3><a name="Publish-Subscribe Guarantee">Publish/Subscribe Guarantee</a></h3>
304
305<p>
306RCU's publish-subscribe guarantee allows data to be inserted
307into a linked data structure without disrupting RCU readers.
308The updater uses <tt>rcu_assign_pointer()</tt> to insert the
309new data, and readers use <tt>rcu_dereference()</tt> to
310access data, whether new or old.
311The following shows an example of insertion:
312
313<blockquote>
314<pre>
315 1 bool add_gp(int a, int b)
316 2 {
317 3 p = kmalloc(sizeof(*p), GFP_KERNEL);
318 4 if (!p)
319 5 return -ENOMEM;
320 6 spin_lock(&amp;gp_lock);
321 7 if (rcu_access_pointer(gp)) {
322 8 spin_unlock(&amp;gp_lock);
323 9 return false;
32410 }
32511 p-&gt;a = a;
32612 p-&gt;b = a;
32713 rcu_assign_pointer(gp, p);
32814 spin_unlock(&amp;gp_lock);
32915 return true;
33016 }
331</pre>
332</blockquote>
333
334<p>
335The <tt>rcu_assign_pointer()</tt> on line&nbsp;13 is conceptually
336equivalent to a simple assignment statement, but also guarantees
337that its assignment will
338happen after the two assignments in lines&nbsp;11 and&nbsp;12,
339similar to the C11 <tt>memory_order_release</tt> store operation.
340It also prevents any number of &ldquo;interesting&rdquo; compiler
341optimizations, for example, the use of <tt>gp</tt> as a scratch
342location immediately preceding the assignment.
343
344<p>@@QQ@@
345But <tt>rcu_assign_pointer()</tt> does nothing to prevent the
346two assignments to <tt>p-&gt;a</tt> and <tt>p-&gt;b</tt>
347from being reordered.
348Can't that also cause problems?
349<p>@@QQA@@
350No, it cannot.
351The readers cannot see either of these two fields until
352the assignment to <tt>gp</tt>, by which time both fields are
353fully initialized.
354So reordering the assignments
355to <tt>p-&gt;a</tt> and <tt>p-&gt;b</tt> cannot possibly
356cause any problems.
357<p>@@QQE@@
358
359<p>
360It is tempting to assume that the reader need not do anything special
361to control its accesses to the RCU-protected data,
362as shown in <tt>do_something_gp_buggy()</tt> below:
363
364<blockquote>
365<pre>
366 1 bool do_something_gp_buggy(void)
367 2 {
368 3 rcu_read_lock();
369 4 p = gp; /* OPTIMIZATIONS GALORE!!! */
370 5 if (p) {
371 6 do_something(p-&gt;a, p-&gt;b);
372 7 rcu_read_unlock();
373 8 return true;
374 9 }
37510 rcu_read_unlock();
37611 return false;
37712 }
378</pre>
379</blockquote>
380
381<p>
382However, this temptation must be resisted because there are a
383surprisingly large number of ways that the compiler
384(to say nothing of
385<a href="https://h71000.www7.hp.com/wizard/wiz_2637.html">DEC Alpha CPUs</a>)
386can trip this code up.
387For but one example, if the compiler were short of registers, it
388might choose to refetch from <tt>gp</tt> rather than keeping
389a separate copy in <tt>p</tt> as follows:
390
391<blockquote>
392<pre>
393 1 bool do_something_gp_buggy_optimized(void)
394 2 {
395 3 rcu_read_lock();
396 4 if (gp) { /* OPTIMIZATIONS GALORE!!! */
397<b> 5 do_something(gp-&gt;a, gp-&gt;b);</b>
398 6 rcu_read_unlock();
399 7 return true;
400 8 }
401 9 rcu_read_unlock();
40210 return false;
40311 }
404</pre>
405</blockquote>
406
407<p>
408If this function ran concurrently with a series of updates that
409replaced the current structure with a new one,
410the fetches of <tt>gp-&gt;a</tt>
411and <tt>gp-&gt;b</tt> might well come from two different structures,
412which could cause serious confusion.
413To prevent this (and much else besides), <tt>do_something_gp()</tt> uses
414<tt>rcu_dereference()</tt> to fetch from <tt>gp</tt>:
415
416<blockquote>
417<pre>
418 1 bool do_something_gp(void)
419 2 {
420 3 rcu_read_lock();
421 4 p = rcu_dereference(gp);
422 5 if (p) {
423 6 do_something(p-&gt;a, p-&gt;b);
424 7 rcu_read_unlock();
425 8 return true;
426 9 }
42710 rcu_read_unlock();
42811 return false;
42912 }
430</pre>
431</blockquote>
432
433<p>
434The <tt>rcu_dereference()</tt> uses volatile casts and (for DEC Alpha)
435memory barriers in the Linux kernel.
436Should a
437<a href="http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf">high-quality implementation of C11 <tt>memory_order_consume</tt> [PDF]</a>
438ever appear, then <tt>rcu_dereference()</tt> could be implemented
439as a <tt>memory_order_consume</tt> load.
440Regardless of the exact implementation, a pointer fetched by
441<tt>rcu_dereference()</tt> may not be used outside of the
442outermost RCU read-side critical section containing that
443<tt>rcu_dereference()</tt>, unless protection of
444the corresponding data element has been passed from RCU to some
445other synchronization mechanism, most commonly locking or
446<a href="https://www.kernel.org/doc/Documentation/RCU/rcuref.txt">reference counting</a>.
447
448<p>
449In short, updaters use <tt>rcu_assign_pointer()</tt> and readers
450use <tt>rcu_dereference()</tt>, and these two RCU API elements
451work together to ensure that readers have a consistent view of
452newly added data elements.
453
454<p>
455Of course, it is also necessary to remove elements from RCU-protected
456data structures, for example, using the following process:
457
458<ol>
459<li> Remove the data element from the enclosing structure.
460<li> Wait for all pre-existing RCU read-side critical sections
461 to complete (because only pre-existing readers can possibly have
462 a reference to the newly removed data element).
463<li> At this point, only the updater has a reference to the
464 newly removed data element, so it can safely reclaim
465 the data element, for example, by passing it to <tt>kfree()</tt>.
466</ol>
467
468This process is implemented by <tt>remove_gp_synchronous()</tt>:
469
470<blockquote>
471<pre>
472 1 bool remove_gp_synchronous(void)
473 2 {
474 3 struct foo *p;
475 4
476 5 spin_lock(&amp;gp_lock);
477 6 p = rcu_access_pointer(gp);
478 7 if (!p) {
479 8 spin_unlock(&amp;gp_lock);
480 9 return false;
48110 }
48211 rcu_assign_pointer(gp, NULL);
48312 spin_unlock(&amp;gp_lock);
48413 synchronize_rcu();
48514 kfree(p);
48615 return true;
48716 }
488</pre>
489</blockquote>
490
491<p>
492This function is straightforward, with line&nbsp;13 waiting for a grace
493period before line&nbsp;14 frees the old data element.
494This waiting ensures that readers will reach line&nbsp;7 of
495<tt>do_something_gp()</tt> before the data element referenced by
496<tt>p</tt> is freed.
497The <tt>rcu_access_pointer()</tt> on line&nbsp;6 is similar to
498<tt>rcu_dereference()</tt>, except that:
499
500<ol>
501<li> The value returned by <tt>rcu_access_pointer()</tt>
502 cannot be dereferenced.
503 If you want to access the value pointed to as well as
504 the pointer itself, use <tt>rcu_dereference()</tt>
505 instead of <tt>rcu_access_pointer()</tt>.
506<li> The call to <tt>rcu_access_pointer()</tt> need not be
507 protected.
508 In contrast, <tt>rcu_dereference()</tt> must either be
509 within an RCU read-side critical section or in a code
510 segment where the pointer cannot change, for example, in
511 code protected by the corresponding update-side lock.
512</ol>
513
514<p>@@QQ@@
515Without the <tt>rcu_dereference()</tt> or the
516<tt>rcu_access_pointer()</tt>, what destructive optimizations
517might the compiler make use of?
518<p>@@QQA@@
519Let's start with what happens to <tt>do_something_gp()</tt>
520if it fails to use <tt>rcu_dereference()</tt>.
521It could reuse a value formerly fetched from this same pointer.
522It could also fetch the pointer from <tt>gp</tt> in a byte-at-a-time
523manner, resulting in <i>load tearing</i>, in turn resulting a bytewise
524mash-up of two distince pointer values.
525It might even use value-speculation optimizations, where it makes a wrong
526guess, but by the time it gets around to checking the value, an update
527has changed the pointer to match the wrong guess.
528Too bad about any dereferences that returned pre-initialization garbage
529in the meantime!
530
531<p>
532For <tt>remove_gp_synchronous()</tt>, as long as all modifications
533to <tt>gp</tt> are carried out while holding <tt>gp_lock</tt>,
534the above optimizations are harmless.
535However,
536with <tt>CONFIG_SPARSE_RCU_POINTER=y</tt>,
537<tt>sparse</tt> will complain if you
538define <tt>gp</tt> with <tt>__rcu</tt> and then
539access it without using
540either <tt>rcu_access_pointer()</tt> or <tt>rcu_dereference()</tt>.
541<p>@@QQE@@
542
543<p>
4b689330
PM
544In short, RCU's publish-subscribe guarantee is provided by the combination
545of <tt>rcu_assign_pointer()</tt> and <tt>rcu_dereference()</tt>.
546This guarantee allows data elements to be safely added to RCU-protected
547linked data structures without disrupting RCU readers.
548This guarantee can be used in combination with the grace-period
549guarantee to also allow data elements to be removed from RCU-protected
550linked data structures, again without disrupting RCU readers.
551
552<p>
553This guarantee was only partially premeditated.
554DYNIX/ptx used an explicit memory barrier for publication, but had nothing
555resembling <tt>rcu_dereference()</tt> for subscription, nor did it
556have anything resembling the <tt>smp_read_barrier_depends()</tt>
557that was later subsumed into <tt>rcu_dereference()</tt>.
558The need for these operations made itself known quite suddenly at a
559late-1990s meeting with the DEC Alpha architects, back in the days when
560DEC was still a free-standing company.
561It took the Alpha architects a good hour to convince me that any sort
562of barrier would ever be needed, and it then took me a good <i>two</i> hours
563to convince them that their documentation did not make this point clear.
564More recent work with the C and C++ standards committees have provided
565much education on tricks and traps from the compiler.
566In short, compilers were much less tricky in the early 1990s, but in
5672015, don't even think about omitting <tt>rcu_dereference()</tt>!
568
569<h3><a name="Memory-Barrier Guarantees">Memory-Barrier Guarantees</a></h3>
570
571<p>
572The previous section's simple linked-data-structure scenario clearly
573demonstrates the need for RCU's stringent memory-ordering guarantees on
574systems with more than one CPU:
649e4368
PM
575
576<ol>
577<li> Each CPU that has an RCU read-side critical section that
578 begins before <tt>synchronize_rcu()</tt> starts is
579 guaranteed to execute a full memory barrier between the time
580 that the RCU read-side critical section ends and the time that
581 <tt>synchronize_rcu()</tt> returns.
582 Without this guarantee, a pre-existing RCU read-side critical section
583 might hold a reference to the newly removed <tt>struct foo</tt>
584 after the <tt>kfree()</tt> on line&nbsp;14 of
585 <tt>remove_gp_synchronous()</tt>.
586<li> Each CPU that has an RCU read-side critical section that ends
587 after <tt>synchronize_rcu()</tt> returns is guaranteed
588 to execute a full memory barrier between the time that
589 <tt>synchronize_rcu()</tt> begins and the time that the RCU
590 read-side critical section begins.
591 Without this guarantee, a later RCU read-side critical section
592 running after the <tt>kfree()</tt> on line&nbsp;14 of
593 <tt>remove_gp_synchronous()</tt> might
594 later run <tt>do_something_gp()</tt> and find the
595 newly deleted <tt>struct foo</tt>.
596<li> If the task invoking <tt>synchronize_rcu()</tt> remains
597 on a given CPU, then that CPU is guaranteed to execute a full
598 memory barrier sometime during the execution of
599 <tt>synchronize_rcu()</tt>.
600 This guarantee ensures that the <tt>kfree()</tt> on
601 line&nbsp;14 of <tt>remove_gp_synchronous()</tt> really does
602 execute after the removal on line&nbsp;11.
603<li> If the task invoking <tt>synchronize_rcu()</tt> migrates
604 among a group of CPUs during that invocation, then each of the
605 CPUs in that group is guaranteed to execute a full memory barrier
606 sometime during the execution of <tt>synchronize_rcu()</tt>.
607 This guarantee also ensures that the <tt>kfree()</tt> on
608 line&nbsp;14 of <tt>remove_gp_synchronous()</tt> really does
609 execute after the removal on
610 line&nbsp;11, but also in the case where the thread executing the
611 <tt>synchronize_rcu()</tt> migrates in the meantime.
612</ol>
613
614<p>@@QQ@@
615Given that multiple CPUs can start RCU read-side critical sections
616at any time without any ordering whatsoever, how can RCU possibly tell whether
617or not a given RCU read-side critical section starts before a
618given instance of <tt>synchronize_rcu()</tt>?
619<p>@@QQA@@
620If RCU cannot tell whether or not a given
621RCU read-side critical section starts before a
622given instance of <tt>synchronize_rcu()</tt>,
623then it must assume that the RCU read-side critical section
624started first.
625In other words, a given instance of <tt>synchronize_rcu()</tt>
626can avoid waiting on a given RCU read-side critical section only
627if it can prove that <tt>synchronize_rcu()</tt> started first.
628<p>@@QQE@@
629
630<p>@@QQ@@
631The first and second guarantees require unbelievably strict ordering!
632Are all these memory barriers <i> really</i> required?
633<p>@@QQA@@
634Yes, they really are required.
635To see why the first guarantee is required, consider the following
636sequence of events:
637
638<ol>
639<li> CPU 1: <tt>rcu_read_lock()</tt>
640<li> CPU 1: <tt>q = rcu_dereference(gp);
641 /* Very likely to return p. */</tt>
642<li> CPU 0: <tt>list_del_rcu(p);</tt>
643<li> CPU 0: <tt>synchronize_rcu()</tt> starts.
644<li> CPU 1: <tt>do_something_with(q-&gt;a);
645 /* No smp_mb(), so might happen after kfree(). */</tt>
646<li> CPU 1: <tt>rcu_read_unlock()</tt>
647<li> CPU 0: <tt>synchronize_rcu()</tt> returns.
648<li> CPU 0: <tt>kfree(p);</tt>
649</ol>
650
651<p>
652Therefore, there absolutely must be a full memory barrier between the
653end of the RCU read-side critical section and the end of the
654grace period.
655
656<p>
657The sequence of events demonstrating the necessity of the second rule
658is roughly similar:
659
660<ol>
661<li> CPU 0: <tt>list_del_rcu(p);</tt>
662<li> CPU 0: <tt>synchronize_rcu()</tt> starts.
663<li> CPU 1: <tt>rcu_read_lock()</tt>
664<li> CPU 1: <tt>q = rcu_dereference(gp);
665 /* Might return p if no memory barrier. */</tt>
666<li> CPU 0: <tt>synchronize_rcu()</tt> returns.
667<li> CPU 0: <tt>kfree(p);</tt>
668<li> CPU 1: <tt>do_something_with(q-&gt;a); /* Boom!!! */</tt>
669<li> CPU 1: <tt>rcu_read_unlock()</tt>
670</ol>
671
672<p>
673And similarly, without a memory barrier between the beginning of the
674grace period and the beginning of the RCU read-side critical section,
675CPU&nbsp;1 might end up accessing the freelist.
676
677<p>
678The &ldquo;as if&rdquo; rule of course applies, so that any implementation
679that acts as if the appropriate memory barriers were in place is a
680correct implementation.
681That said, it is much easier to fool yourself into believing that you have
682adhered to the as-if rule than it is to actually adhere to it!
683<p>@@QQE@@
684
685<p>
4b689330
PM
686Note that these memory-barrier requirements do not replace the fundamental
687RCU requirement that a grace period wait for all pre-existing readers.
688On the contrary, the memory barriers called out in this section must operate in
689such a way as to <i>enforce</i> this fundamental requirement.
690Of course, different implementations enforce this requirement in different
691ways, but enforce it they must.
649e4368
PM
692
693<h3><a name="RCU Primitives Guaranteed to Execute Unconditionally">RCU Primitives Guaranteed to Execute Unconditionally</a></h3>
694
695<p>
696The common-case RCU primitives are unconditional.
697They are invoked, they do their job, and they return, with no possibility
698of error, and no need to retry.
699This is a key RCU design philosophy.
700
701<p>
702However, this philosophy is pragmatic rather than pigheaded.
703If someone comes up with a good justification for a particular conditional
704RCU primitive, it might well be implemented and added.
705After all, this guarantee was reverse-engineered, not premeditated.
706The unconditional nature of the RCU primitives was initially an
707accident of implementation, and later experience with synchronization
708primitives with conditional primitives caused me to elevate this
709accident to a guarantee.
710Therefore, the justification for adding a conditional primitive to
711RCU would need to be based on detailed and compelling use cases.
712
713<h3><a name="Guaranteed Read-to-Write Upgrade">Guaranteed Read-to-Write Upgrade</a></h3>
714
715<p>
716As far as RCU is concerned, it is always possible to carry out an
717update within an RCU read-side critical section.
718For example, that RCU read-side critical section might search for
719a given data element, and then might acquire the update-side
720spinlock in order to update that element, all while remaining
721in that RCU read-side critical section.
722Of course, it is necessary to exit the RCU read-side critical section
723before invoking <tt>synchronize_rcu()</tt>, however, this
724inconvenience can be avoided through use of the
725<tt>call_rcu()</tt> and <tt>kfree_rcu()</tt> API members
726described later in this document.
727
728<p>@@QQ@@
729But how does the upgrade-to-write operation exclude other readers?
730<p>@@QQA@@
731It doesn't, just like normal RCU updates, which also do not exclude
732RCU readers.
733<p>@@QQE@@
734
735<p>
736This guarantee allows lookup code to be shared between read-side
737and update-side code, and was premeditated, appearing in the earliest
738DYNIX/ptx RCU documentation.
739
740<h2><a name="Fundamental Non-Requirements">Fundamental Non-Requirements</a></h2>
741
742<p>
743RCU provides extremely lightweight readers, and its read-side guarantees,
744though quite useful, are correspondingly lightweight.
745It is therefore all too easy to assume that RCU is guaranteeing more
746than it really is.
747Of course, the list of things that RCU does not guarantee is infinitely
748long, however, the following sections list a few non-guarantees that
749have caused confusion.
750Except where otherwise noted, these non-guarantees were premeditated.
751
752<ol>
753<li> <a href="#Readers Impose Minimal Ordering">
754 Readers Impose Minimal Ordering</a>
755<li> <a href="#Readers Do Not Exclude Updaters">
756 Readers Do Not Exclude Updaters</a>
757<li> <a href="#Updaters Only Wait For Old Readers">
758 Updaters Only Wait For Old Readers</a>
759<li> <a href="#Grace Periods Don't Partition Read-Side Critical Sections">
760 Grace Periods Don't Partition Read-Side Critical Sections</a>
761<li> <a href="#Read-Side Critical Sections Don't Partition Grace Periods">
762 Read-Side Critical Sections Don't Partition Grace Periods</a>
763<li> <a href="#Disabling Preemption Does Not Block Grace Periods">
764 Disabling Preemption Does Not Block Grace Periods</a>
765</ol>
766
767<h3><a name="Readers Impose Minimal Ordering">Readers Impose Minimal Ordering</a></h3>
768
769<p>
770Reader-side markers such as <tt>rcu_read_lock()</tt> and
771<tt>rcu_read_unlock()</tt> provide absolutely no ordering guarantees
772except through their interaction with the grace-period APIs such as
773<tt>synchronize_rcu()</tt>.
774To see this, consider the following pair of threads:
775
776<blockquote>
777<pre>
778 1 void thread0(void)
779 2 {
780 3 rcu_read_lock();
781 4 WRITE_ONCE(x, 1);
782 5 rcu_read_unlock();
783 6 rcu_read_lock();
784 7 WRITE_ONCE(y, 1);
785 8 rcu_read_unlock();
786 9 }
78710
78811 void thread1(void)
78912 {
79013 rcu_read_lock();
79114 r1 = READ_ONCE(y);
79215 rcu_read_unlock();
79316 rcu_read_lock();
79417 r2 = READ_ONCE(x);
79518 rcu_read_unlock();
79619 }
797</pre>
798</blockquote>
799
800<p>
801After <tt>thread0()</tt> and <tt>thread1()</tt> execute
802concurrently, it is quite possible to have
803
804<blockquote>
805<pre>
806(r1 == 1 &amp;&amp; r2 == 0)
807</pre>
808</blockquote>
809
810(that is, <tt>y</tt> appears to have been assigned before <tt>x</tt>),
811which would not be possible if <tt>rcu_read_lock()</tt> and
812<tt>rcu_read_unlock()</tt> had much in the way of ordering
813properties.
814But they do not, so the CPU is within its rights
815to do significant reordering.
816This is by design: Any significant ordering constraints would slow down
817these fast-path APIs.
818
819<p>@@QQ@@
820Can't the compiler also reorder this code?
821<p>@@QQA@@
822No, the volatile casts in <tt>READ_ONCE()</tt> and
823<tt>WRITE_ONCE()</tt> prevent the compiler from reordering in
824this particular case.
825<p>@@QQE@@
826
827<h3><a name="Readers Do Not Exclude Updaters">Readers Do Not Exclude Updaters</a></h3>
828
829<p>
830Neither <tt>rcu_read_lock()</tt> nor <tt>rcu_read_unlock()</tt>
831exclude updates.
832All they do is to prevent grace periods from ending.
833The following example illustrates this:
834
835<blockquote>
836<pre>
837 1 void thread0(void)
838 2 {
839 3 rcu_read_lock();
840 4 r1 = READ_ONCE(y);
841 5 if (r1) {
842 6 do_something_with_nonzero_x();
843 7 r2 = READ_ONCE(x);
844 8 WARN_ON(!r2); /* BUG!!! */
845 9 }
84610 rcu_read_unlock();
84711 }
84812
84913 void thread1(void)
85014 {
85115 spin_lock(&amp;my_lock);
85216 WRITE_ONCE(x, 1);
85317 WRITE_ONCE(y, 1);
85418 spin_unlock(&amp;my_lock);
85519 }
856</pre>
857</blockquote>
858
859<p>
860If the <tt>thread0()</tt> function's <tt>rcu_read_lock()</tt>
861excluded the <tt>thread1()</tt> function's update,
862the <tt>WARN_ON()</tt> could never fire.
863But the fact is that <tt>rcu_read_lock()</tt> does not exclude
864much of anything aside from subsequent grace periods, of which
865<tt>thread1()</tt> has none, so the
866<tt>WARN_ON()</tt> can and does fire.
867
868<h3><a name="Updaters Only Wait For Old Readers">Updaters Only Wait For Old Readers</a></h3>
869
870<p>
871It might be tempting to assume that after <tt>synchronize_rcu()</tt>
872completes, there are no readers executing.
873This temptation must be avoided because
874new readers can start immediately after <tt>synchronize_rcu()</tt>
875starts, and <tt>synchronize_rcu()</tt> is under no
876obligation to wait for these new readers.
877
878<p>@@QQ@@
879Suppose that synchronize_rcu() did wait until all readers had completed.
880Would the updater be able to rely on this?
881<p>@@QQA@@
882No.
883Even if <tt>synchronize_rcu()</tt> were to wait until
884all readers had completed, a new reader might start immediately after
885<tt>synchronize_rcu()</tt> completed.
886Therefore, the code following
887<tt>synchronize_rcu()</tt> cannot rely on there being no readers
888in any case.
889<p>@@QQE@@
890
891<h3><a name="Grace Periods Don't Partition Read-Side Critical Sections">
892Grace Periods Don't Partition Read-Side Critical Sections</a></h3>
893
894<p>
895It is tempting to assume that if any part of one RCU read-side critical
896section precedes a given grace period, and if any part of another RCU
897read-side critical section follows that same grace period, then all of
898the first RCU read-side critical section must precede all of the second.
899However, this just isn't the case: A single grace period does not
900partition the set of RCU read-side critical sections.
901An example of this situation can be illustrated as follows, where
902<tt>x</tt>, <tt>y</tt>, and <tt>z</tt> are initially all zero:
903
904<blockquote>
905<pre>
906 1 void thread0(void)
907 2 {
908 3 rcu_read_lock();
909 4 WRITE_ONCE(a, 1);
910 5 WRITE_ONCE(b, 1);
911 6 rcu_read_unlock();
912 7 }
913 8
914 9 void thread1(void)
91510 {
91611 r1 = READ_ONCE(a);
91712 synchronize_rcu();
91813 WRITE_ONCE(c, 1);
91914 }
92015
92116 void thread2(void)
92217 {
92318 rcu_read_lock();
92419 r2 = READ_ONCE(b);
92520 r3 = READ_ONCE(c);
92621 rcu_read_unlock();
92722 }
928</pre>
929</blockquote>
930
931<p>
932It turns out that the outcome:
933
934<blockquote>
935<pre>
936(r1 == 1 &amp;&amp; r2 == 0 &amp;&amp; r3 == 1)
937</pre>
938</blockquote>
939
940is entirely possible.
941The following figure show how this can happen, with each circled
942<tt>QS</tt> indicating the point at which RCU recorded a
943<i>quiescent state</i> for each thread, that is, a state in which
944RCU knows that the thread cannot be in the midst of an RCU read-side
945critical section that started before the current grace period:
946
947<p><img src="GPpartitionReaders1.svg" alt="GPpartitionReaders1.svg" width="60%"></p>
948
949<p>
950If it is necessary to partition RCU read-side critical sections in this
951manner, it is necessary to use two grace periods, where the first
952grace period is known to end before the second grace period starts:
953
954<blockquote>
955<pre>
956 1 void thread0(void)
957 2 {
958 3 rcu_read_lock();
959 4 WRITE_ONCE(a, 1);
960 5 WRITE_ONCE(b, 1);
961 6 rcu_read_unlock();
962 7 }
963 8
964 9 void thread1(void)
96510 {
96611 r1 = READ_ONCE(a);
96712 synchronize_rcu();
96813 WRITE_ONCE(c, 1);
96914 }
97015
97116 void thread2(void)
97217 {
97318 r2 = READ_ONCE(c);
97419 synchronize_rcu();
97520 WRITE_ONCE(d, 1);
97621 }
97722
97823 void thread3(void)
97924 {
98025 rcu_read_lock();
98126 r3 = READ_ONCE(b);
98227 r4 = READ_ONCE(d);
98328 rcu_read_unlock();
98429 }
985</pre>
986</blockquote>
987
988<p>
989Here, if <tt>(r1 == 1)</tt>, then
990<tt>thread0()</tt>'s write to <tt>b</tt> must happen
991before the end of <tt>thread1()</tt>'s grace period.
992If in addition <tt>(r4 == 1)</tt>, then
993<tt>thread3()</tt>'s read from <tt>b</tt> must happen
994after the beginning of <tt>thread2()</tt>'s grace period.
995If it is also the case that <tt>(r2 == 1)</tt>, then the
996end of <tt>thread1()</tt>'s grace period must precede the
997beginning of <tt>thread2()</tt>'s grace period.
998This mean that the two RCU read-side critical sections cannot overlap,
999guaranteeing that <tt>(r3 == 1)</tt>.
1000As a result, the outcome:
1001
1002<blockquote>
1003<pre>
1004(r1 == 1 &amp;&amp; r2 == 1 &amp;&amp; r3 == 0 &amp;&amp; r4 == 1)
1005</pre>
1006</blockquote>
1007
1008cannot happen.
1009
1010<p>
1011This non-requirement was also non-premeditated, but became apparent
1012when studying RCU's interaction with memory ordering.
1013
1014<h3><a name="Read-Side Critical Sections Don't Partition Grace Periods">
1015Read-Side Critical Sections Don't Partition Grace Periods</a></h3>
1016
1017<p>
1018It is also tempting to assume that if an RCU read-side critical section
1019happens between a pair of grace periods, then those grace periods cannot
1020overlap.
1021However, this temptation leads nowhere good, as can be illustrated by
1022the following, with all variables initially zero:
1023
1024<blockquote>
1025<pre>
1026 1 void thread0(void)
1027 2 {
1028 3 rcu_read_lock();
1029 4 WRITE_ONCE(a, 1);
1030 5 WRITE_ONCE(b, 1);
1031 6 rcu_read_unlock();
1032 7 }
1033 8
1034 9 void thread1(void)
103510 {
103611 r1 = READ_ONCE(a);
103712 synchronize_rcu();
103813 WRITE_ONCE(c, 1);
103914 }
104015
104116 void thread2(void)
104217 {
104318 rcu_read_lock();
104419 WRITE_ONCE(d, 1);
104520 r2 = READ_ONCE(c);
104621 rcu_read_unlock();
104722 }
104823
104924 void thread3(void)
105025 {
105126 r3 = READ_ONCE(d);
105227 synchronize_rcu();
105328 WRITE_ONCE(e, 1);
105429 }
105530
105631 void thread4(void)
105732 {
105833 rcu_read_lock();
105934 r4 = READ_ONCE(b);
106035 r5 = READ_ONCE(e);
106136 rcu_read_unlock();
106237 }
1063</pre>
1064</blockquote>
1065
1066<p>
1067In this case, the outcome:
1068
1069<blockquote>
1070<pre>
1071(r1 == 1 &amp;&amp; r2 == 1 &amp;&amp; r3 == 1 &amp;&amp; r4 == 0 &amp&amp; r5 == 1)
1072</pre>
1073</blockquote>
1074
1075is entirely possible, as illustrated below:
1076
1077<p><img src="ReadersPartitionGP1.svg" alt="ReadersPartitionGP1.svg" width="100%"></p>
1078
1079<p>
1080Again, an RCU read-side critical section can overlap almost all of a
1081given grace period, just so long as it does not overlap the entire
1082grace period.
1083As a result, an RCU read-side critical section cannot partition a pair
1084of RCU grace periods.
1085
1086<p>@@QQ@@
1087How long a sequence of grace periods, each separated by an RCU read-side
1088critical section, would be required to partition the RCU read-side
1089critical sections at the beginning and end of the chain?
1090<p>@@QQA@@
1091In theory, an infinite number.
1092In practice, an unknown number that is sensitive to both implementation
1093details and timing considerations.
1094Therefore, even in practice, RCU users must abide by the theoretical rather
1095than the practical answer.
1096<p>@@QQE@@
1097
1098<h3><a name="Disabling Preemption Does Not Block Grace Periods">
1099Disabling Preemption Does Not Block Grace Periods</a></h3>
1100
1101<p>
1102There was a time when disabling preemption on any given CPU would block
1103subsequent grace periods.
1104However, this was an accident of implementation and is not a requirement.
1105And in the current Linux-kernel implementation, disabling preemption
1106on a given CPU in fact does not block grace periods, as Oleg Nesterov
1107<a href="https://lkml.kernel.org/g/20150614193825.GA19582@redhat.com">demonstrated</a>.
1108
1109<p>
1110If you need a preempt-disable region to block grace periods, you need to add
1111<tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>, for example
1112as follows:
1113
1114<blockquote>
1115<pre>
1116 1 preempt_disable();
1117 2 rcu_read_lock();
1118 3 do_something();
1119 4 rcu_read_unlock();
1120 5 preempt_enable();
1121 6
1122 7 /* Spinlocks implicitly disable preemption. */
1123 8 spin_lock(&amp;mylock);
1124 9 rcu_read_lock();
112510 do_something();
112611 rcu_read_unlock();
112712 spin_unlock(&amp;mylock);
1128</pre>
1129</blockquote>
1130
1131<p>
1132In theory, you could enter the RCU read-side critical section first,
1133but it is more efficient to keep the entire RCU read-side critical
1134section contained in the preempt-disable region as shown above.
1135Of course, RCU read-side critical sections that extend outside of
1136preempt-disable regions will work correctly, but such critical sections
1137can be preempted, which forces <tt>rcu_read_unlock()</tt> to do
1138more work.
1139And no, this is <i>not</i> an invitation to enclose all of your RCU
1140read-side critical sections within preempt-disable regions, because
1141doing so would degrade real-time response.
1142
1143<p>
1144This non-requirement appeared with preemptible RCU.
1145If you need a grace period that waits on non-preemptible code regions, use
1146<a href="#Sched Flavor">RCU-sched</a>.
1147
1148<h2><a name="Parallelism Facts of Life">Parallelism Facts of Life</a></h2>
1149
1150<p>
1151These parallelism facts of life are by no means specific to RCU, but
1152the RCU implementation must abide by them.
1153They therefore bear repeating:
1154
1155<ol>
1156<li> Any CPU or task may be delayed at any time,
1157 and any attempts to avoid these delays by disabling
1158 preemption, interrupts, or whatever are completely futile.
1159 This is most obvious in preemptible user-level
1160 environments and in virtualized environments (where
1161 a given guest OS's VCPUs can be preempted at any time by
1162 the underlying hypervisor), but can also happen in bare-metal
1163 environments due to ECC errors, NMIs, and other hardware
1164 events.
1165 Although a delay of more than about 20 seconds can result
1166 in splats, the RCU implementation is obligated to use
1167 algorithms that can tolerate extremely long delays, but where
1168 &ldquo;extremely long&rdquo; is not long enough to allow
1169 wrap-around when incrementing a 64-bit counter.
1170<li> Both the compiler and the CPU can reorder memory accesses.
1171 Where it matters, RCU must use compiler directives and
1172 memory-barrier instructions to preserve ordering.
1173<li> Conflicting writes to memory locations in any given cache line
1174 will result in expensive cache misses.
1175 Greater numbers of concurrent writes and more-frequent
1176 concurrent writes will result in more dramatic slowdowns.
1177 RCU is therefore obligated to use algorithms that have
1178 sufficient locality to avoid significant performance and
1179 scalability problems.
1180<li> As a rough rule of thumb, only one CPU's worth of processing
1181 may be carried out under the protection of any given exclusive
1182 lock.
1183 RCU must therefore use scalable locking designs.
1184<li> Counters are finite, especially on 32-bit systems.
1185 RCU's use of counters must therefore tolerate counter wrap,
1186 or be designed such that counter wrap would take way more
1187 time than a single system is likely to run.
1188 An uptime of ten years is quite possible, a runtime
1189 of a century much less so.
1190 As an example of the latter, RCU's dyntick-idle nesting counter
1191 allows 54 bits for interrupt nesting level (this counter
1192 is 64 bits even on a 32-bit system).
1193 Overflowing this counter requires 2<sup>54</sup>
1194 half-interrupts on a given CPU without that CPU ever going idle.
1195 If a half-interrupt happened every microsecond, it would take
1196 570 years of runtime to overflow this counter, which is currently
1197 believed to be an acceptably long time.
1198<li> Linux systems can have thousands of CPUs running a single
1199 Linux kernel in a single shared-memory environment.
1200 RCU must therefore pay close attention to high-end scalability.
1201</ol>
1202
1203<p>
1204This last parallelism fact of life means that RCU must pay special
1205attention to the preceding facts of life.
1206The idea that Linux might scale to systems with thousands of CPUs would
1207have been met with some skepticism in the 1990s, but these requirements
1208would have otherwise have been unsurprising, even in the early 1990s.
1209
1210<h2><a name="Quality-of-Implementation Requirements">Quality-of-Implementation Requirements</a></h2>
1211
1212<p>
1213These sections list quality-of-implementation requirements.
1214Although an RCU implementation that ignores these requirements could
1215still be used, it would likely be subject to limitations that would
1216make it inappropriate for industrial-strength production use.
1217Classes of quality-of-implementation requirements are as follows:
1218
1219<ol>
1220<li> <a href="#Specialization">Specialization</a>
1221<li> <a href="#Performance and Scalability">Performance and Scalability</a>
1222<li> <a href="#Composability">Composability</a>
1223<li> <a href="#Corner Cases">Corner Cases</a>
1224</ol>
1225
1226<p>
1227These classes is covered in the following sections.
1228
1229<h3><a name="Specialization">Specialization</a></h3>
1230
1231<p>
1232RCU is and always has been intended primarily for read-mostly situations, as
1233illustrated by the following figure.
1234This means that RCU's read-side primitives are optimized, often at the
1235expense of its update-side primitives.
1236
1237<p><img src="RCUApplicability.svg" alt="RCUApplicability.svg" width="70%"></p>
1238
1239<p>
1240This focus on read-mostly situations means that RCU must interoperate
1241with other synchronization primitives.
1242For example, the <tt>add_gp()</tt> and <tt>remove_gp_synchronous()</tt>
1243examples discussed earlier use RCU to protect readers and locking to
1244coordinate updaters.
1245However, the need extends much farther, requiring that a variety of
1246synchronization primitives be legal within RCU read-side critical sections,
1247including spinlocks, sequence locks, atomic operations, reference
1248counters, and memory barriers.
1249
1250<p>@@QQ@@
1251What about sleeping locks?
1252<p>@@QQA@@
1253These are forbidden within Linux-kernel RCU read-side critical sections
1254because it is not legal to place a quiescent state (in this case,
1255voluntary context switch) within an RCU read-side critical section.
1256However, sleeping locks may be used within userspace RCU read-side critical
1257sections, and also within Linux-kernel sleepable RCU
1258<a href="#Sleepable RCU">(SRCU)</a>
1259read-side critical sections.
1260In addition, the -rt patchset turns spinlocks into a sleeping locks so
1261that the corresponding critical sections can be preempted, which
1262also means that these sleeplockified spinlocks (but not other sleeping locks!)
1263may be acquire within -rt-Linux-kernel RCU read-side critical sections.
1264
1265<p>
1266Note that it <i>is</i> legal for a normal RCU read-side critical section
1267to conditionally acquire a sleeping locks (as in <tt>mutex_trylock()</tt>),
1268but only as long as it does not loop indefinitely attempting to
1269conditionally acquire that sleeping locks.
1270The key point is that things like <tt>mutex_trylock()</tt>
1271either return with the mutex held, or return an error indication if
1272the mutex was not immediately available.
1273Either way, <tt>mutex_trylock()</tt> returns immediately without sleeping.
1274<p>@@QQE@@
1275
1276<p>
1277It often comes as a surprise that many algorithms do not require a
1278consistent view of data, but many can function in that mode,
1279with network routing being the poster child.
1280Internet routing algorithms take significant time to propagate
1281updates, so that by the time an update arrives at a given system,
1282that system has been sending network traffic the wrong way for
1283a considerable length of time.
1284Having a few threads continue to send traffic the wrong way for a
1285few more milliseconds is clearly not a problem: In the worst case,
1286TCP retransmissions will eventually get the data where it needs to go.
1287In general, when tracking the state of the universe outside of the
1288computer, some level of inconsistency must be tolerated due to
1289speed-of-light delays if nothing else.
1290
1291<p>
1292Furthermore, uncertainty about external state is inherent in many cases.
1293For example, a pair of veternarians might use heartbeat to determine
1294whether or not a given cat was alive.
1295But how long should they wait after the last heartbeat to decide that
1296the cat is in fact dead?
1297Waiting less than 400 milliseconds makes no sense because this would
1298mean that a relaxed cat would be considered to cycle between death
1299and life more than 100 times per minute.
1300Moreover, just as with human beings, a cat's heart might stop for
1301some period of time, so the exact wait period is a judgment call.
1302One of our pair of veternarians might wait 30 seconds before pronouncing
1303the cat dead, while the other might insist on waiting a full minute.
1304The two veternarians would then disagree on the state of the cat during
1305the final 30 seconds of the minute following the last heartbeat, as
1306fancifully illustrated below:
1307
1308<p><img src="2013-08-is-it-dead.png" alt="2013-08-is-it-dead.png" width="431"></p>
1309
1310<p>
1311Interestingly enough, this same situation applies to hardware.
1312When push comes to shove, how do we tell whether or not some
1313external server has failed?
1314We send messages to it periodically, and declare it failed if we
1315don't receive a response within a given period of time.
1316Policy decisions can usually tolerate short
1317periods of inconsistency.
1318The policy was decided some time ago, and is only now being put into
1319effect, so a few milliseconds of delay is normally inconsequential.
1320
1321<p>
1322However, there are algorithms that absolutely must see consistent data.
1323For example, the translation between a user-level SystemV semaphore
1324ID to the corresponding in-kernel data structure is protected by RCU,
1325but it is absolutely forbidden to update a semaphore that has just been
1326removed.
1327In the Linux kernel, this need for consistency is accommodated by acquiring
1328spinlocks located in the in-kernel data structure from within
1329the RCU read-side critical section, and this is indicated by the
1330green box in the figure above.
1331Many other techniques may be used, and are in fact used within the
1332Linux kernel.
1333
1334<p>
1335In short, RCU is not required to maintain consistency, and other
1336mechanisms may be used in concert with RCU when consistency is required.
1337RCU's specialization allows it to do its job extremely well, and its
1338ability to interoperate with other synchronization mechanisms allows
1339the right mix of synchronization tools to be used for a given job.
1340
1341<h3><a name="Performance and Scalability">Performance and Scalability</a></h3>
1342
1343<p>
1344Energy efficiency is a critical component of performance today,
1345and Linux-kernel RCU implementations must therefore avoid unnecessarily
1346awakening idle CPUs.
1347I cannot claim that this requirement was premeditated.
1348In fact, I learned of it during a telephone conversation in which I
1349was given &ldquo;frank and open&rdquo; feedback on the importance
1350of energy efficiency in battery-powered systems and on specific
1351energy-efficiency shortcomings of the Linux-kernel RCU implementation.
1352In my experience, the battery-powered embedded community will consider
1353any unnecessary wakeups to be extremely unfriendly acts.
1354So much so that mere Linux-kernel-mailing-list posts are
1355insufficient to vent their ire.
1356
1357<p>
1358Memory consumption is not particularly important for in most
1359situations, and has become decreasingly
1360so as memory sizes have expanded and memory
1361costs have plummeted.
1362However, as I learned from Matt Mackall's
1363<a href="http://elinux.org/Linux_Tiny-FAQ">bloatwatch</a>
1364efforts, memory footprint is critically important on single-CPU systems with
1365non-preemptible (<tt>CONFIG_PREEMPT=n</tt>) kernels, and thus
1366<a href="https://lkml.kernel.org/g/20090113221724.GA15307@linux.vnet.ibm.com">tiny RCU</a>
1367was born.
1368Josh Triplett has since taken over the small-memory banner with his
1369<a href="https://tiny.wiki.kernel.org/">Linux kernel tinification</a>
1370project, which resulted in
1371<a href="#Sleepable RCU">SRCU</a>
1372becoming optional for those kernels not needing it.
1373
1374<p>
1375The remaining performance requirements are, for the most part,
1376unsurprising.
1377For example, in keeping with RCU's read-side specialization,
1378<tt>rcu_dereference()</tt> should have negligible overhead (for
1379example, suppression of a few minor compiler optimizations).
1380Similarly, in non-preemptible environments, <tt>rcu_read_lock()</tt> and
1381<tt>rcu_read_unlock()</tt> should have exactly zero overhead.
1382
1383<p>
1384In preemptible environments, in the case where the RCU read-side
1385critical section was not preempted (as will be the case for the
1386highest-priority real-time process), <tt>rcu_read_lock()</tt> and
1387<tt>rcu_read_unlock()</tt> should have minimal overhead.
1388In particular, they should not contain atomic read-modify-write
1389operations, memory-barrier instructions, preemption disabling,
1390interrupt disabling, or backwards branches.
1391However, in the case where the RCU read-side critical section was preempted,
1392<tt>rcu_read_unlock()</tt> may acquire spinlocks and disable interrupts.
1393This is why it is better to nest an RCU read-side critical section
1394within a preempt-disable region than vice versa, at least in cases
1395where that critical section is short enough to avoid unduly degrading
1396real-time latencies.
1397
1398<p>
1399The <tt>synchronize_rcu()</tt> grace-period-wait primitive is
1400optimized for throughput.
1401It may therefore incur several milliseconds of latency in addition to
1402the duration of the longest RCU read-side critical section.
1403On the other hand, multiple concurrent invocations of
1404<tt>synchronize_rcu()</tt> are required to use batching optimizations
1405so that they can be satisfied by a single underlying grace-period-wait
1406operation.
1407For example, in the Linux kernel, it is not unusual for a single
1408grace-period-wait operation to serve more than
1409<a href="https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-sub-millisecond-response">1,000 separate invocations</a>
1410of <tt>synchronize_rcu()</tt>, thus amortizing the per-invocation
1411overhead down to nearly zero.
1412However, the grace-period optimization is also required to avoid
1413measurable degradation of real-time scheduling and interrupt latencies.
1414
1415<p>
1416In some cases, the multi-millisecond <tt>synchronize_rcu()</tt>
1417latencies are unacceptable.
1418In these cases, <tt>synchronize_rcu_expedited()</tt> may be used
1419instead, reducing the grace-period latency down to a few tens of
1420microseconds on small systems, at least in cases where the RCU read-side
1421critical sections are short.
1422There are currently no special latency requirements for
1423<tt>synchronize_rcu_expedited()</tt> on large systems, but,
1424consistent with the empirical nature of the RCU specification,
1425that is subject to change.
1426However, there most definitely are scalability requirements:
1427A storm of <tt>synchronize_rcu_expedited()</tt> invocations on 4096
1428CPUs should at least make reasonable forward progress.
1429In return for its shorter latencies, <tt>synchronize_rcu_expedited()</tt>
1430is permitted to impose modest degradation of real-time latency
1431on non-idle online CPUs.
1432That said, it will likely be necessary to take further steps to reduce this
1433degradation, hopefully to roughly that of a scheduling-clock interrupt.
1434
1435<p>
1436There are a number of situations where even
1437<tt>synchronize_rcu_expedited()</tt>'s reduced grace-period
1438latency is unacceptable.
1439In these situations, the asynchronous <tt>call_rcu()</tt> can be
1440used in place of <tt>synchronize_rcu()</tt> as follows:
1441
1442<blockquote>
1443<pre>
1444 1 struct foo {
1445 2 int a;
1446 3 int b;
1447 4 struct rcu_head rh;
1448 5 };
1449 6
1450 7 static void remove_gp_cb(struct rcu_head *rhp)
1451 8 {
1452 9 struct foo *p = container_of(rhp, struct foo, rh);
145310
145411 kfree(p);
145512 }
145613
145714 bool remove_gp_asynchronous(void)
145815 {
145916 struct foo *p;
146017
146118 spin_lock(&amp;gp_lock);
146219 p = rcu_dereference(gp);
146320 if (!p) {
146421 spin_unlock(&amp;gp_lock);
146522 return false;
146623 }
146724 rcu_assign_pointer(gp, NULL);
146825 call_rcu(&amp;p-&gt;rh, remove_gp_cb);
146926 spin_unlock(&amp;gp_lock);
147027 return true;
147128 }
1472</pre>
1473</blockquote>
1474
1475<p>
1476A definition of <tt>struct foo</tt> is finally needed, and appears
1477on lines&nbsp;1-5.
1478The function <tt>remove_gp_cb()</tt> is passed to <tt>call_rcu()</tt>
1479on line&nbsp;25, and will be invoked after the end of a subsequent
1480grace period.
1481This gets the same effect as <tt>remove_gp_synchronous()</tt>,
1482but without forcing the updater to wait for a grace period to elapse.
1483The <tt>call_rcu()</tt> function may be used in a number of
1484situations where neither <tt>synchronize_rcu()</tt> nor
1485<tt>synchronize_rcu_expedited()</tt> would be legal,
1486including within preempt-disable code, <tt>local_bh_disable()</tt> code,
1487interrupt-disable code, and interrupt handlers.
1488However, even <tt>call_rcu()</tt> is illegal within NMI handlers.
1489The callback function (<tt>remove_gp_cb()</tt> in this case) will be
1490executed within softirq (software interrupt) environment within the
1491Linux kernel,
1492either within a real softirq handler or under the protection
1493of <tt>local_bh_disable()</tt>.
1494In both the Linux kernel and in userspace, it is bad practice to
1495write an RCU callback function that takes too long.
1496Long-running operations should be relegated to separate threads or
1497(in the Linux kernel) workqueues.
1498
1499<p>@@QQ@@
1500Why does line&nbsp;19 use <tt>rcu_access_pointer()</tt>?
1501After all, <tt>call_rcu()</tt> on line&nbsp;25 stores into the
1502structure, which would interact badly with concurrent insertions.
1503Doesn't this mean that <tt>rcu_dereference()</tt> is required?
1504<p>@@QQA@@
1505Presumably the <tt>-&gt;gp_lock</tt> acquired on line&nbsp;18 excludes
1506any changes, including any insertions that <tt>rcu_dereference()</tt>
1507would protect against.
1508Therefore, any insertions will be delayed until after <tt>-&gt;gp_lock</tt>
1509is released on line&nbsp;25, which in turn means that
1510<tt>rcu_access_pointer()</tt> suffices.
1511<p>@@QQE@@
1512
1513<p>
1514However, all that <tt>remove_gp_cb()</tt> is doing is
1515invoking <tt>kfree()</tt> on the data element.
1516This is a common idiom, and is supported by <tt>kfree_rcu()</tt>,
1517which allows &ldquo;fire and forget&rdquo; operation as shown below:
1518
1519<blockquote>
1520<pre>
1521 1 struct foo {
1522 2 int a;
1523 3 int b;
1524 4 struct rcu_head rh;
1525 5 };
1526 6
1527 7 bool remove_gp_faf(void)
1528 8 {
1529 9 struct foo *p;
153010
153111 spin_lock(&amp;gp_lock);
153212 p = rcu_dereference(gp);
153313 if (!p) {
153414 spin_unlock(&amp;gp_lock);
153515 return false;
153616 }
153717 rcu_assign_pointer(gp, NULL);
153818 kfree_rcu(p, rh);
153919 spin_unlock(&amp;gp_lock);
154020 return true;
154121 }
1542</pre>
1543</blockquote>
1544
1545<p>
1546Note that <tt>remove_gp_faf()</tt> simply invokes
1547<tt>kfree_rcu()</tt> and proceeds, without any need to pay any
1548further attention to the subsequent grace period and <tt>kfree()</tt>.
1549It is permissible to invoke <tt>kfree_rcu()</tt> from the same
1550environments as for <tt>call_rcu()</tt>.
1551Interestingly enough, DYNIX/ptx had the equivalents of
1552<tt>call_rcu()</tt> and <tt>kfree_rcu()</tt>, but not
1553<tt>synchronize_rcu()</tt>.
1554This was due to the fact that RCU was not heavily used within DYNIX/ptx,
1555so the very few places that needed something like
1556<tt>synchronize_rcu()</tt> simply open-coded it.
1557
1558<p>@@QQ@@
1559Earlier it was claimed that <tt>call_rcu()</tt> and
1560<tt>kfree_rcu()</tt> allowed updaters to avoid being blocked
1561by readers.
1562But how can that be correct, given that the invocation of the callback
1563and the freeing of the memory (respectively) must still wait for
1564a grace period to elapse?
1565<p>@@QQA@@
1566We could define things this way, but keep in mind that this sort of
1567definition would say that updates in garbage-collected languages
1568cannot complete until the next time the garbage collector runs,
1569which does not seem at all reasonable.
1570The key point is that in most cases, an updater using either
1571<tt>call_rcu()</tt> or <tt>kfree_rcu()</tt> can proceed to the
1572next update as soon as it has invoked <tt>call_rcu()</tt> or
1573<tt>kfree_rcu()</tt>, without having to wait for a subsequent
1574grace period.
1575<p>@@QQE@@
1576
1577<p>
1578But what if the updater must wait for the completion of code to be
1579executed after the end of the grace period, but has other tasks
1580that can be carried out in the meantime?
1581The polling-style <tt>get_state_synchronize_rcu()</tt> and
1582<tt>cond_synchronize_rcu()</tt> functions may be used for this
1583purpose, as shown below:
1584
1585<blockquote>
1586<pre>
1587 1 bool remove_gp_poll(void)
1588 2 {
1589 3 struct foo *p;
1590 4 unsigned long s;
1591 5
1592 6 spin_lock(&amp;gp_lock);
1593 7 p = rcu_access_pointer(gp);
1594 8 if (!p) {
1595 9 spin_unlock(&amp;gp_lock);
159610 return false;
159711 }
159812 rcu_assign_pointer(gp, NULL);
159913 spin_unlock(&amp;gp_lock);
160014 s = get_state_synchronize_rcu();
160115 do_something_while_waiting();
160216 cond_synchronize_rcu(s);
160317 kfree(p);
160418 return true;
160519 }
1606</pre>
1607</blockquote>
1608
1609<p>
1610On line&nbsp;14, <tt>get_state_synchronize_rcu()</tt> obtains a
1611&ldquo;cookie&rdquo; from RCU,
1612then line&nbsp;15 carries out other tasks,
1613and finally, line&nbsp;16 returns immediately if a grace period has
1614elapsed in the meantime, but otherwise waits as required.
1615The need for <tt>get_state_synchronize_rcu</tt> and
1616<tt>cond_synchronize_rcu()</tt> has appeared quite recently,
1617so it is too early to tell whether they will stand the test of time.
1618
1619<p>
1620RCU thus provides a range of tools to allow updaters to strike the
1621required tradeoff between latency, flexibility and CPU overhead.
1622
1623<h3><a name="Composability">Composability</a></h3>
1624
1625<p>
1626Composability has received much attention in recent years, perhaps in part
1627due to the collision of multicore hardware with object-oriented techniques
1628designed in single-threaded environments for single-threaded use.
1629And in theory, RCU read-side critical sections may be composed, and in
1630fact may be nested arbitrarily deeply.
1631In practice, as with all real-world implementations of composable
1632constructs, there are limitations.
1633
1634<p>
1635Implementations of RCU for which <tt>rcu_read_lock()</tt>
1636and <tt>rcu_read_unlock()</tt> generate no code, such as
1637Linux-kernel RCU when <tt>CONFIG_PREEMPT=n</tt>, can be
1638nested arbitrarily deeply.
1639After all, there is no overhead.
1640Except that if all these instances of <tt>rcu_read_lock()</tt>
1641and <tt>rcu_read_unlock()</tt> are visible to the compiler,
1642compilation will eventually fail due to exhausting memory,
1643mass storage, or user patience, whichever comes first.
1644If the nesting is not visible to the compiler, as is the case with
1645mutually recursive functions each in its own translation unit,
1646stack overflow will result.
1647If the nesting takes the form of loops, either the control variable
1648will overflow or (in the Linux kernel) you will get an RCU CPU stall warning.
1649Nevertheless, this class of RCU implementations is one
1650of the most composable constructs in existence.
1651
1652<p>
1653RCU implementations that explicitly track nesting depth
1654are limited by the nesting-depth counter.
1655For example, the Linux kernel's preemptible RCU limits nesting to
1656<tt>INT_MAX</tt>.
1657This should suffice for almost all practical purposes.
1658That said, a consecutive pair of RCU read-side critical sections
1659between which there is an operation that waits for a grace period
1660cannot be enclosed in another RCU read-side critical section.
1661This is because it is not legal to wait for a grace period within
1662an RCU read-side critical section: To do so would result either
1663in deadlock or
1664in RCU implicitly splitting the enclosing RCU read-side critical
1665section, neither of which is conducive to a long-lived and prosperous
1666kernel.
1667
0825458b
PM
1668<p>
1669It is worth noting that RCU is not alone in limiting composability.
1670For example, many transactional-memory implementations prohibit
1671composing a pair of transactions separated by an irrevocable
1672operation (for example, a network receive operation).
1673For another example, lock-based critical sections can be composed
1674surprisingly freely, but only if deadlock is avoided.
1675
649e4368
PM
1676<p>
1677In short, although RCU read-side critical sections are highly composable,
1678care is required in some situations, just as is the case for any other
1679composable synchronization mechanism.
1680
1681<h3><a name="Corner Cases">Corner Cases</a></h3>
1682
1683<p>
1684A given RCU workload might have an endless and intense stream of
1685RCU read-side critical sections, perhaps even so intense that there
1686was never a point in time during which there was not at least one
1687RCU read-side critical section in flight.
1688RCU cannot allow this situation to block grace periods: As long as
1689all the RCU read-side critical sections are finite, grace periods
1690must also be finite.
1691
1692<p>
1693That said, preemptible RCU implementations could potentially result
1694in RCU read-side critical sections being preempted for long durations,
1695which has the effect of creating a long-duration RCU read-side
1696critical section.
1697This situation can arise only in heavily loaded systems, but systems using
1698real-time priorities are of course more vulnerable.
1699Therefore, RCU priority boosting is provided to help deal with this
1700case.
1701That said, the exact requirements on RCU priority boosting will likely
1702evolve as more experience accumulates.
1703
1704<p>
1705Other workloads might have very high update rates.
1706Although one can argue that such workloads should instead use
1707something other than RCU, the fact remains that RCU must
1708handle such workloads gracefully.
1709This requirement is another factor driving batching of grace periods,
1710but it is also the driving force behind the checks for large numbers
1711of queued RCU callbacks in the <tt>call_rcu()</tt> code path.
1712Finally, high update rates should not delay RCU read-side critical
1713sections, although some read-side delays can occur when using
1714<tt>synchronize_rcu_expedited()</tt>, courtesy of this function's use
1715of <tt>try_stop_cpus()</tt>.
1716(In the future, <tt>synchronize_rcu_expedited()</tt> will be
1717converted to use lighter-weight inter-processor interrupts (IPIs),
1718but this will still disturb readers, though to a much smaller degree.)
1719
1720<p>
1721Although all three of these corner cases were understood in the early
17221990s, a simple user-level test consisting of <tt>close(open(path))</tt>
1723in a tight loop
1724in the early 2000s suddenly provided a much deeper appreciation of the
1725high-update-rate corner case.
1726This test also motivated addition of some RCU code to react to high update
1727rates, for example, if a given CPU finds itself with more than 10,000
1728RCU callbacks queued, it will cause RCU to take evasive action by
1729more aggressively starting grace periods and more aggressively forcing
1730completion of grace-period processing.
1731This evasive action causes the grace period to complete more quickly,
1732but at the cost of restricting RCU's batching optimizations, thus
1733increasing the CPU overhead incurred by that grace period.
1734
1735<h2><a name="Software-Engineering Requirements">
1736Software-Engineering Requirements</a></h2>
1737
1738<p>
1739Between Murphy's Law and &ldquo;To err is human&rdquo;, it is necessary to
1740guard against mishaps and misuse:
1741
1742<ol>
1743<li> It is all too easy to forget to use <tt>rcu_read_lock()</tt>
1744 everywhere that it is needed, so kernels built with
1745 <tt>CONFIG_PROVE_RCU=y</tt> will spat if
1746 <tt>rcu_dereference()</tt> is used outside of an
1747 RCU read-side critical section.
1748 Update-side code can use <tt>rcu_dereference_protected()</tt>,
1749 which takes a
1750 <a href="https://lwn.net/Articles/371986/">lockdep expression</a>
1751 to indicate what is providing the protection.
1752 If the indicated protection is not provided, a lockdep splat
1753 is emitted.
1754
1755 <p>
1756 Code shared between readers and updaters can use
1757 <tt>rcu_dereference_check()</tt>, which also takes a
1758 lockdep expression, and emits a lockdep splat if neither
1759 <tt>rcu_read_lock()</tt> nor the indicated protection
1760 is in place.
1761 In addition, <tt>rcu_dereference_raw()</tt> is used in those
1762 (hopefully rare) cases where the required protection cannot
1763 be easily described.
1764 Finally, <tt>rcu_read_lock_held()</tt> is provided to
1765 allow a function to verify that it has been invoked within
1766 an RCU read-side critical section.
1767 I was made aware of this set of requirements shortly after Thomas
1768 Gleixner audited a number of RCU uses.
1769<li> A given function might wish to check for RCU-related preconditions
1770 upon entry, before using any other RCU API.
1771 The <tt>rcu_lockdep_assert()</tt> does this job,
1772 asserting the expression in kernels having lockdep enabled
1773 and doing nothing otherwise.
1774<li> It is also easy to forget to use <tt>rcu_assign_pointer()</tt>
1775 and <tt>rcu_dereference()</tt>, perhaps (incorrectly)
1776 substituting a simple assignment.
1777 To catch this sort of error, a given RCU-protected pointer may be
1778 tagged with <tt>__rcu</tt>, after which running sparse
1779 with <tt>CONFIG_SPARSE_RCU_POINTER=y</tt> will complain
1780 about simple-assignment accesses to that pointer.
1781 Arnd Bergmann made me aware of this requirement, and also
1782 supplied the needed
1783 <a href="https://lwn.net/Articles/376011/">patch series</a>.
1784<li> Kernels built with <tt>CONFIG_DEBUG_OBJECTS_RCU_HEAD=y</tt>
1785 will splat if a data element is passed to <tt>call_rcu()</tt>
1786 twice in a row, without a grace period in between.
1787 (This error is similar to a double free.)
1788 The corresponding <tt>rcu_head</tt> structures that are
1789 dynamically allocated are automatically tracked, but
1790 <tt>rcu_head</tt> structures allocated on the stack
1791 must be initialized with <tt>init_rcu_head_on_stack()</tt>
1792 and cleaned up with <tt>destroy_rcu_head_on_stack()</tt>.
1793 Similarly, statically allocated non-stack <tt>rcu_head</tt>
1794 structures must be initialized with <tt>init_rcu_head()</tt>
1795 and cleaned up with <tt>destroy_rcu_head()</tt>.
1796 Mathieu Desnoyers made me aware of this requirement, and also
1797 supplied the needed
1798 <a href="https://lkml.kernel.org/g/20100319013024.GA28456@Krystal">patch</a>.
1799<li> An infinite loop in an RCU read-side critical section will
01d3ad38
PM
1800 eventually trigger an RCU CPU stall warning splat, with
1801 the duration of &ldquo;eventually&rdquo; being controlled by the
1802 <tt>RCU_CPU_STALL_TIMEOUT</tt> <tt>Kconfig</tt> option, or,
1803 alternatively, by the
1804 <tt>rcupdate.rcu_cpu_stall_timeout</tt> boot/sysfs
1805 parameter.
649e4368
PM
1806 However, RCU is not obligated to produce this splat
1807 unless there is a grace period waiting on that particular
1808 RCU read-side critical section.
01d3ad38
PM
1809 <p>
1810 Some extreme workloads might intentionally delay
1811 RCU grace periods, and systems running those workloads can
1812 be booted with <tt>rcupdate.rcu_cpu_stall_suppress</tt>
1813 to suppress the splats.
1814 This kernel parameter may also be set via <tt>sysfs</tt>.
1815 Furthermore, RCU CPU stall warnings are counter-productive
1816 during sysrq dumps and during panics.
1817 RCU therefore supplies the <tt>rcu_sysrq_start()</tt> and
1818 <tt>rcu_sysrq_end()</tt> API members to be called before
1819 and after long sysrq dumps.
1820 RCU also supplies the <tt>rcu_panic()</tt> notifier that is
1821 automatically invoked at the beginning of a panic to suppress
1822 further RCU CPU stall warnings.
1823
1824 <p>
649e4368
PM
1825 This requirement made itself known in the early 1990s, pretty
1826 much the first time that it was necessary to debug a CPU stall.
01d3ad38
PM
1827 That said, the initial implementation in DYNIX/ptx was quite
1828 generic in comparison with that of Linux.
649e4368
PM
1829<li> Although it would be very good to detect pointers leaking out
1830 of RCU read-side critical sections, there is currently no
1831 good way of doing this.
1832 One complication is the need to distinguish between pointers
1833 leaking and pointers that have been handed off from RCU to
1834 some other synchronization mechanism, for example, reference
1835 counting.
1836<li> In kernels built with <tt>CONFIG_RCU_TRACE=y</tt>, RCU-related
1837 information is provided via both debugfs and event tracing.
1838<li> Open-coded use of <tt>rcu_assign_pointer()</tt> and
1839 <tt>rcu_dereference()</tt> to create typical linked
1840 data structures can be surprisingly error-prone.
1841 Therefore, RCU-protected
1842 <a href="https://lwn.net/Articles/609973/#RCU List APIs">linked lists</a>
1843 and, more recently, RCU-protected
1844 <a href="https://lwn.net/Articles/612100/">hash tables</a>
1845 are available.
1846 Many other special-purpose RCU-protected data structures are
1847 available in the Linux kernel and the userspace RCU library.
1848<li> Some linked structures are created at compile time, but still
1849 require <tt>__rcu</tt> checking.
1850 The <tt>RCU_POINTER_INITIALIZER()</tt> macro serves this
1851 purpose.
1852<li> It is not necessary to use <tt>rcu_assign_pointer()</tt>
1853 when creating linked structures that are to be published via
1854 a single external pointer.
1855 The <tt>RCU_INIT_POINTER()</tt> macro is provided for
1856 this task and also for assigning <tt>NULL</tt> pointers
1857 at runtime.
1858</ol>
1859
1860<p>
1861This not a hard-and-fast list: RCU's diagnostic capabilities will
1862continue to be guided by the number and type of usage bugs found
1863in real-world RCU usage.
1864
1865<h2><a name="Linux Kernel Complications">Linux Kernel Complications</a></h2>
1866
1867<p>
1868The Linux kernel provides an interesting environment for all kinds of
1869software, including RCU.
1870Some of the relevant points of interest are as follows:
1871
1872<ol>
1873<li> <a href="#Configuration">Configuration</a>.
1874<li> <a href="#Firmware Interface">Firmware Interface</a>.
1875<li> <a href="#Early Boot">Early Boot</a>.
1876<li> <a href="#Interrupts and NMIs">
1877 Interrupts and non-maskable interrupts (NMIs)</a>.
1878<li> <a href="#Loadable Modules">Loadable Modules</a>.
1879<li> <a href="#Hotplug CPU">Hotplug CPU</a>.
1880<li> <a href="#Scheduler and RCU">Scheduler and RCU</a>.
1881<li> <a href="#Tracing and RCU">Tracing and RCU</a>.
1882<li> <a href="#Energy Efficiency">Energy Efficiency</a>.
701e8031 1883<li> <a href="#Memory Efficiency">Memory Efficiency</a>.
649e4368
PM
1884<li> <a href="#Performance, Scalability, Response Time, and Reliability">
1885 Performance, Scalability, Response Time, and Reliability</a>.
1886</ol>
1887
1888<p>
1889This list is probably incomplete, but it does give a feel for the
1890most notable Linux-kernel complications.
1891Each of the following sections covers one of the above topics.
1892
1893<h3><a name="Configuration">Configuration</a></h3>
1894
1895<p>
1896RCU's goal is automatic configuration, so that almost nobody
1897needs to worry about RCU's <tt>Kconfig</tt> options.
1898And for almost all users, RCU does in fact work well
1899&ldquo;out of the box.&rdquo;
1900
1901<p>
1902However, there are specialized use cases that are handled by
1903kernel boot parameters and <tt>Kconfig</tt> options.
1904Unfortunately, the <tt>Kconfig</tt> system will explicitly ask users
1905about new <tt>Kconfig</tt> options, which requires almost all of them
1906be hidden behind a <tt>CONFIG_RCU_EXPERT</tt> <tt>Kconfig</tt> option.
1907
1908<p>
1909This all should be quite obvious, but the fact remains that
1910Linus Torvalds recently had to
1911<a href="https://lkml.kernel.org/g/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.com">remind</a>
1912me of this requirement.
1913
1914<h3><a name="Firmware Interface">Firmware Interface</a></h3>
1915
1916<p>
1917In many cases, kernel obtains information about the system from the
1918firmware, and sometimes things are lost in translation.
1919Or the translation is accurate, but the original message is bogus.
1920
1921<p>
1922For example, some systems' firmware overreports the number of CPUs,
1923sometimes by a large factor.
1924If RCU naively believed the firmware, as it used to do,
1925it would create too many per-CPU kthreads.
1926Although the resulting system will still run correctly, the extra
1927kthreads needlessly consume memory and can cause confusion
1928when they show up in <tt>ps</tt> listings.
1929
1930<p>
1931RCU must therefore wait for a given CPU to actually come online before
1932it can allow itself to believe that the CPU actually exists.
1933The resulting &ldquo;ghost CPUs&rdquo; (which are never going to
1934come online) cause a number of
1935<a href="https://paulmck.livejournal.com/37494.html">interesting complications</a>.
1936
1937<h3><a name="Early Boot">Early Boot</a></h3>
1938
1939<p>
1940The Linux kernel's boot sequence is an interesting process,
1941and RCU is used early, even before <tt>rcu_init()</tt>
1942is invoked.
1943In fact, a number of RCU's primitives can be used as soon as the
1944initial task's <tt>task_struct</tt> is available and the
1945boot CPU's per-CPU variables are set up.
1946The read-side primitives (<tt>rcu_read_lock()</tt>,
1947<tt>rcu_read_unlock()</tt>, <tt>rcu_dereference()</tt>,
1948and <tt>rcu_access_pointer()</tt>) will operate normally very early on,
1949as will <tt>rcu_assign_pointer()</tt>.
1950
1951<p>
1952Although <tt>call_rcu()</tt> may be invoked at any
1953time during boot, callbacks are not guaranteed to be invoked until after
1954the scheduler is fully up and running.
1955This delay in callback invocation is due to the fact that RCU does not
1956invoke callbacks until it is fully initialized, and this full initialization
1957cannot occur until after the scheduler has initialized itself to the
1958point where RCU can spawn and run its kthreads.
1959In theory, it would be possible to invoke callbacks earlier,
1960however, this is not a panacea because there would be severe restrictions
1961on what operations those callbacks could invoke.
1962
1963<p>
1964Perhaps surprisingly, <tt>synchronize_rcu()</tt>,
1965<a href="#Bottom-Half Flavor"><tt>synchronize_rcu_bh()</tt></a>
1966(<a href="#Bottom-Half Flavor">discussed below</a>),
1967and
1968<a href="#Sched Flavor"><tt>synchronize_sched()</tt></a>
1969will all operate normally
1970during very early boot, the reason being that there is only one CPU
1971and preemption is disabled.
1972This means that the call <tt>synchronize_rcu()</tt> (or friends)
1973itself is a quiescent
1974state and thus a grace period, so the early-boot implementation can
1975be a no-op.
1976
1977<p>
1978Both <tt>synchronize_rcu_bh()</tt> and <tt>synchronize_sched()</tt>
1979continue to operate normally through the remainder of boot, courtesy
1980of the fact that preemption is disabled across their RCU read-side
1981critical sections and also courtesy of the fact that there is still
1982only one CPU.
1983However, once the scheduler starts initializing, preemption is enabled.
1984There is still only a single CPU, but the fact that preemption is enabled
1985means that the no-op implementation of <tt>synchronize_rcu()</tt> no
1986longer works in <tt>CONFIG_PREEMPT=y</tt> kernels.
1987Therefore, as soon as the scheduler starts initializing, the early-boot
1988fastpath is disabled.
1989This means that <tt>synchronize_rcu()</tt> switches to its runtime
1990mode of operation where it posts callbacks, which in turn means that
1991any call to <tt>synchronize_rcu()</tt> will block until the corresponding
1992callback is invoked.
1993Unfortunately, the callback cannot be invoked until RCU's runtime
1994grace-period machinery is up and running, which cannot happen until
1995the scheduler has initialized itself sufficiently to allow RCU's
1996kthreads to be spawned.
1997Therefore, invoking <tt>synchronize_rcu()</tt> during scheduler
1998initialization can result in deadlock.
1999
2000<p>@@QQ@@
2001So what happens with <tt>synchronize_rcu()</tt> during
2002scheduler initialization for <tt>CONFIG_PREEMPT=n</tt>
2003kernels?
2004<p>@@QQA@@
2005In <tt>CONFIG_PREEMPT=n</tt> kernel, <tt>synchronize_rcu()</tt>
2006maps directly to <tt>synchronize_sched()</tt>.
2007Therefore, <tt>synchronize_rcu()</tt> works normally throughout
2008boot in <tt>CONFIG_PREEMPT=n</tt> kernels.
2009However, your code must also work in <tt>CONFIG_PREEMPT=y</tt> kernels,
2010so it is still necessary to avoid invoking <tt>synchronize_rcu()</tt>
2011during scheduler initialization.
2012<p>@@QQE@@
2013
2014<p>
2015I learned of these boot-time requirements as a result of a series of
2016system hangs.
2017
2018<h3><a name="Interrupts and NMIs">Interrupts and NMIs</a></h3>
2019
2020<p>
2021The Linux kernel has interrupts, and RCU read-side critical sections are
2022legal within interrupt handlers and within interrupt-disabled regions
2023of code, as are invocations of <tt>call_rcu()</tt>.
2024
2025<p>
2026Some Linux-kernel architectures can enter an interrupt handler from
2027non-idle process context, and then just never leave it, instead stealthily
2028transitioning back to process context.
2029This trick is sometimes used to invoke system calls from inside the kernel.
2030These &ldquo;half-interrupts&rdquo; mean that RCU has to be very careful
2031about how it counts interrupt nesting levels.
2032I learned of this requirement the hard way during a rewrite
2033of RCU's dyntick-idle code.
2034
2035<p>
2036The Linux kernel has non-maskable interrupts (NMIs), and
2037RCU read-side critical sections are legal within NMI handlers.
2038Thankfully, RCU update-side primitives, including
2039<tt>call_rcu()</tt>, are prohibited within NMI handlers.
2040
2041<p>
2042The name notwithstanding, some Linux-kernel architectures
2043can have nested NMIs, which RCU must handle correctly.
2044Andy Lutomirski
2045<a href="https://lkml.kernel.org/g/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com">surprised me</a>
2046with this requirement;
2047he also kindly surprised me with
2048<a href="https://lkml.kernel.org/g/CALCETrXSY9JpW3uE6H8WYk81sg56qasA2aqmjMPsq5dOtzso=g@mail.gmail.com">an algorithm</a>
2049that meets this requirement.
2050
2051<h3><a name="Loadable Modules">Loadable Modules</a></h3>
2052
2053<p>
2054The Linux kernel has loadable modules, and these modules can
2055also be unloaded.
2056After a given module has been unloaded, any attempt to call
2057one of its functions results in a segmentation fault.
2058The module-unload functions must therefore cancel any
2059delayed calls to loadable-module functions, for example,
2060any outstanding <tt>mod_timer()</tt> must be dealt with
2061via <tt>del_timer_sync()</tt> or similar.
2062
2063<p>
2064Unfortunately, there is no way to cancel an RCU callback;
2065once you invoke <tt>call_rcu()</tt>, the callback function is
2066going to eventually be invoked, unless the system goes down first.
2067Because it is normally considered socially irresponsible to crash the system
2068in response to a module unload request, we need some other way
2069to deal with in-flight RCU callbacks.
2070
2071<p>
2072RCU therefore provides
2073<tt><a href="https://lwn.net/Articles/217484/">rcu_barrier()</a></tt>,
2074which waits until all in-flight RCU callbacks have been invoked.
2075If a module uses <tt>call_rcu()</tt>, its exit function should therefore
2076prevent any future invocation of <tt>call_rcu()</tt>, then invoke
2077<tt>rcu_barrier()</tt>.
2078In theory, the underlying module-unload code could invoke
2079<tt>rcu_barrier()</tt> unconditionally, but in practice this would
2080incur unacceptable latencies.
2081
2082<p>
2083Nikita Danilov noted this requirement for an analogous filesystem-unmount
2084situation, and Dipankar Sarma incorporated <tt>rcu_barrier()</tt> into RCU.
2085The need for <tt>rcu_barrier()</tt> for module unloading became
2086apparent later.
2087
2088<h3><a name="Hotplug CPU">Hotplug CPU</a></h3>
2089
2090<p>
2091The Linux kernel supports CPU hotplug, which means that CPUs
2092can come and go.
2093It is of course illegal to use any RCU API member from an offline CPU.
2094This requirement was present from day one in DYNIX/ptx, but
2095on the other hand, the Linux kernel's CPU-hotplug implementation
2096is &ldquo;interesting.&rdquo;
2097
2098<p>
2099The Linux-kernel CPU-hotplug implementation has notifiers that
2100are used to allow the various kernel subsystems (including RCU)
2101to respond appropriately to a given CPU-hotplug operation.
2102Most RCU operations may be invoked from CPU-hotplug notifiers,
2103including even normal synchronous grace-period operations
2104such as <tt>synchronize_rcu()</tt>.
2105However, expedited grace-period operations such as
2106<tt>synchronize_rcu_expedited()</tt> are not supported,
2107due to the fact that current implementations block CPU-hotplug
2108operations, which could result in deadlock.
2109
2110<p>
2111In addition, all-callback-wait operations such as
2112<tt>rcu_barrier()</tt> are also not supported, due to the
2113fact that there are phases of CPU-hotplug operations where
2114the outgoing CPU's callbacks will not be invoked until after
2115the CPU-hotplug operation ends, which could also result in deadlock.
2116
2117<h3><a name="Scheduler and RCU">Scheduler and RCU</a></h3>
2118
2119<p>
2120RCU depends on the scheduler, and the scheduler uses RCU to
2121protect some of its data structures.
2122This means the scheduler is forbidden from acquiring
2123the runqueue locks and the priority-inheritance locks
a4b57562
PM
2124in the middle of an outermost RCU read-side critical section unless either
2125(1)&nbsp;it releases them before exiting that same
2126RCU read-side critical section, or
c64c4b0f 2127(2)&nbsp;interrupts are disabled across
a4b57562
PM
2128that entire RCU read-side critical section.
2129This same prohibition also applies (recursively!) to any lock that is acquired
649e4368 2130while holding any lock to which this prohibition applies.
a4b57562
PM
2131Adhering to this rule prevents preemptible RCU from invoking
2132<tt>rcu_read_unlock_special()</tt> while either runqueue or
2133priority-inheritance locks are held, thus avoiding deadlock.
649e4368 2134
c64c4b0f
PM
2135<p>
2136Prior to v4.4, it was only necessary to disable preemption across
2137RCU read-side critical sections that acquired scheduler locks.
2138In v4.4, expedited grace periods started using IPIs, and these
2139IPIs could force a <tt>rcu_read_unlock()</tt> to take the slowpath.
2140Therefore, this expedited-grace-period change required disabling of
2141interrupts, not just preemption.
2142
649e4368
PM
2143<p>
2144For RCU's part, the preemptible-RCU <tt>rcu_read_unlock()</tt>
2145implementation must be written carefully to avoid similar deadlocks.
2146In particular, <tt>rcu_read_unlock()</tt> must tolerate an
2147interrupt where the interrupt handler invokes both
2148<tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>.
2149This possibility requires <tt>rcu_read_unlock()</tt> to use
2150negative nesting levels to avoid destructive recursion via
2151interrupt handler's use of RCU.
2152
2153<p>
2154This pair of mutual scheduler-RCU requirements came as a
2155<a href="https://lwn.net/Articles/453002/">complete surprise</a>.
2156
2157<p>
2158As noted above, RCU makes use of kthreads, and it is necessary to
2159avoid excessive CPU-time accumulation by these kthreads.
2160This requirement was no surprise, but RCU's violation of it
2161when running context-switch-heavy workloads when built with
2162<tt>CONFIG_NO_HZ_FULL=y</tt>
2163<a href="http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf">did come as a surprise [PDF]</a>.
2164RCU has made good progress towards meeting this requirement, even
2165for context-switch-have <tt>CONFIG_NO_HZ_FULL=y</tt> workloads,
2166but there is room for further improvement.
2167
2168<h3><a name="Tracing and RCU">Tracing and RCU</a></h3>
2169
2170<p>
2171It is possible to use tracing on RCU code, but tracing itself
2172uses RCU.
2173For this reason, <tt>rcu_dereference_raw_notrace()</tt>
2174is provided for use by tracing, which avoids the destructive
2175recursion that could otherwise ensue.
2176This API is also used by virtualization in some architectures,
2177where RCU readers execute in environments in which tracing
2178cannot be used.
2179The tracing folks both located the requirement and provided the
2180needed fix, so this surprise requirement was relatively painless.
2181
2182<h3><a name="Energy Efficiency">Energy Efficiency</a></h3>
2183
2184<p>
2185Interrupting idle CPUs is considered socially unacceptable,
2186especially by people with battery-powered embedded systems.
2187RCU therefore conserves energy by detecting which CPUs are
2188idle, including tracking CPUs that have been interrupted from idle.
2189This is a large part of the energy-efficiency requirement,
2190so I learned of this via an irate phone call.
2191
2192<p>
2193Because RCU avoids interrupting idle CPUs, it is illegal to
2194execute an RCU read-side critical section on an idle CPU.
2195(Kernels built with <tt>CONFIG_PROVE_RCU=y</tt> will splat
2196if you try it.)
2197The <tt>RCU_NONIDLE()</tt> macro and <tt>_rcuidle</tt>
2198event tracing is provided to work around this restriction.
2199In addition, <tt>rcu_is_watching()</tt> may be used to
2200test whether or not it is currently legal to run RCU read-side
2201critical sections on this CPU.
2202I learned of the need for diagnostics on the one hand
2203and <tt>RCU_NONIDLE()</tt> on the other while inspecting
2204idle-loop code.
2205Steven Rostedt supplied <tt>_rcuidle</tt> event tracing,
2206which is used quite heavily in the idle loop.
2207
2208<p>
2209It is similarly socially unacceptable to interrupt an
2210<tt>nohz_full</tt> CPU running in userspace.
2211RCU must therefore track <tt>nohz_full</tt> userspace
2212execution.
2213And in
2214<a href="https://lwn.net/Articles/558284/"><tt>CONFIG_NO_HZ_FULL_SYSIDLE=y</tt></a>
2215kernels, RCU must separately track idle CPUs on the one hand and
2216CPUs that are either idle or executing in userspace on the other.
2217In both cases, RCU must be able to sample state at two points in
2218time, and be able to determine whether or not some other CPU spent
2219any time idle and/or executing in userspace.
2220
2221<p>
2222These energy-efficiency requirements have proven quite difficult to
2223understand and to meet, for example, there have been more than five
2224clean-sheet rewrites of RCU's energy-efficiency code, the last of
2225which was finally able to demonstrate
2226<a href="http://www.rdrop.com/users/paulmck/realtime/paper/AMPenergy.2013.04.19a.pdf">real energy savings running on real hardware [PDF]</a>.
2227As noted earlier,
2228I learned of many of these requirements via angry phone calls:
2229Flaming me on the Linux-kernel mailing list was apparently not
2230sufficient to fully vent their ire at RCU's energy-efficiency bugs!
2231
701e8031
PM
2232<h3><a name="Memory Efficiency">Memory Efficiency</a></h3>
2233
2234<p>
2235Although small-memory non-realtime systems can simply use Tiny RCU,
2236code size is only one aspect of memory efficiency.
2237Another aspect is the size of the <tt>rcu_head</tt> structure
2238used by <tt>call_rcu()</tt> and <tt>kfree_rcu()</tt>.
2239Although this structure contains nothing more than a pair of pointers,
2240it does appear in many RCU-protected data structures, including
2241some that are size critical.
2242The <tt>page</tt> structure is a case in point, as evidenced by
2243the many occurrences of the <tt>union</tt> keyword within that structure.
2244
2245<p>
2246This need for memory efficiency is one reason that RCU uses hand-crafted
2247singly linked lists to track the <tt>rcu_head</tt> structures that
2248are waiting for a grace period to elapse.
2249It is also the reason why <tt>rcu_head</tt> structures do not contain
2250debug information, such as fields tracking the file and line of the
2251<tt>call_rcu()</tt> or <tt>kfree_rcu()</tt> that posted them.
2252Although this information might appear in debug-only kernel builds at some
2253point, in the meantime, the <tt>-&gt;func</tt> field will often provide
2254the needed debug information.
2255
2256<p>
2257However, in some cases, the need for memory efficiency leads to even
2258more extreme measures.
2259Returning to the <tt>page</tt> structure, the <tt>rcu_head</tt> field
2260shares storage with a great many other structures that are used at
2261various points in the corresponding page's lifetime.
2262In order to correctly resolve certain
2263<a href="https://lkml.kernel.org/g/1439976106-137226-1-git-send-email-kirill.shutemov@linux.intel.com">race conditions</a>,
2264the Linux kernel's memory-management subsystem needs a particular bit
2265to remain zero during all phases of grace-period processing,
2266and that bit happens to map to the bottom bit of the
2267<tt>rcu_head</tt> structure's <tt>-&gt;next</tt> field.
2268RCU makes this guarantee as long as <tt>call_rcu()</tt>
2269is used to post the callback, as opposed to <tt>kfree_rcu()</tt>
2270or some future &ldquo;lazy&rdquo;
2271variant of <tt>call_rcu()</tt> that might one day be created for
2272energy-efficiency purposes.
2273
649e4368
PM
2274<h3><a name="Performance, Scalability, Response Time, and Reliability">
2275Performance, Scalability, Response Time, and Reliability</a></h3>
2276
2277<p>
2278Expanding on the
2279<a href="#Performance and Scalability">earlier discussion</a>,
2280RCU is used heavily by hot code paths in performance-critical
2281portions of the Linux kernel's networking, security, virtualization,
2282and scheduling code paths.
2283RCU must therefore use efficient implementations, especially in its
2284read-side primitives.
2285To that end, it would be good if preemptible RCU's implementation
2286of <tt>rcu_read_lock()</tt> could be inlined, however, doing
2287this requires resolving <tt>#include</tt> issues with the
2288<tt>task_struct</tt> structure.
2289
2290<p>
2291The Linux kernel supports hardware configurations with up to
22924096 CPUs, which means that RCU must be extremely scalable.
2293Algorithms that involve frequent acquisitions of global locks or
2294frequent atomic operations on global variables simply cannot be
2295tolerated within the RCU implementation.
2296RCU therefore makes heavy use of a combining tree based on the
2297<tt>rcu_node</tt> structure.
2298RCU is required to tolerate all CPUs continuously invoking any
2299combination of RCU's runtime primitives with minimal per-operation
2300overhead.
2301In fact, in many cases, increasing load must <i>decrease</i> the
2302per-operation overhead, witness the batching optimizations for
2303<tt>synchronize_rcu()</tt>, <tt>call_rcu()</tt>,
2304<tt>synchronize_rcu_expedited()</tt>, and <tt>rcu_barrier()</tt>.
2305As a general rule, RCU must cheerfully accept whatever the
2306rest of the Linux kernel decides to throw at it.
2307
2308<p>
2309The Linux kernel is used for real-time workloads, especially
2310in conjunction with the
2311<a href="https://rt.wiki.kernel.org/index.php/Main_Page">-rt patchset</a>.
2312The real-time-latency response requirements are such that the
2313traditional approach of disabling preemption across RCU
2314read-side critical sections is inappropriate.
2315Kernels built with <tt>CONFIG_PREEMPT=y</tt> therefore
2316use an RCU implementation that allows RCU read-side critical
2317sections to be preempted.
2318This requirement made its presence known after users made it
2319clear that an earlier
2320<a href="https://lwn.net/Articles/107930/">real-time patch</a>
2321did not meet their needs, in conjunction with some
2322<a href="https://lkml.kernel.org/g/20050318002026.GA2693@us.ibm.com">RCU issues</a>
2323encountered by a very early version of the -rt patchset.
2324
2325<p>
2326In addition, RCU must make do with a sub-100-microsecond real-time latency
2327budget.
2328In fact, on smaller systems with the -rt patchset, the Linux kernel
2329provides sub-20-microsecond real-time latencies for the whole kernel,
2330including RCU.
2331RCU's scalability and latency must therefore be sufficient for
2332these sorts of configurations.
2333To my surprise, the sub-100-microsecond real-time latency budget
2334<a href="http://www.rdrop.com/users/paulmck/realtime/paper/bigrt.2013.01.31a.LCA.pdf">
2335applies to even the largest systems [PDF]</a>,
2336up to and including systems with 4096 CPUs.
2337This real-time requirement motivated the grace-period kthread, which
2338also simplified handling of a number of race conditions.
2339
2340<p>
2341Finally, RCU's status as a synchronization primitive means that
2342any RCU failure can result in arbitrary memory corruption that can be
2343extremely difficult to debug.
2344This means that RCU must be extremely reliable, which in
2345practice also means that RCU must have an aggressive stress-test
2346suite.
2347This stress-test suite is called <tt>rcutorture</tt>.
2348
2349<p>
2350Although the need for <tt>rcutorture</tt> was no surprise,
2351the current immense popularity of the Linux kernel is posing
2352interesting&mdash;and perhaps unprecedented&mdash;validation
2353challenges.
2354To see this, keep in mind that there are well over one billion
2355instances of the Linux kernel running today, given Android
2356smartphones, Linux-powered televisions, and servers.
2357This number can be expected to increase sharply with the advent of
2358the celebrated Internet of Things.
2359
2360<p>
2361Suppose that RCU contains a race condition that manifests on average
2362once per million years of runtime.
2363This bug will be occurring about three times per <i>day</i> across
2364the installed base.
2365RCU could simply hide behind hardware error rates, given that no one
2366should really expect their smartphone to last for a million years.
2367However, anyone taking too much comfort from this thought should
2368consider the fact that in most jurisdictions, a successful multi-year
2369test of a given mechanism, which might include a Linux kernel,
2370suffices for a number of types of safety-critical certifications.
2371In fact, rumor has it that the Linux kernel is already being used
2372in production for safety-critical applications.
2373I don't know about you, but I would feel quite bad if a bug in RCU
2374killed someone.
2375Which might explain my recent focus on validation and verification.
2376
2377<h2><a name="Other RCU Flavors">Other RCU Flavors</a></h2>
2378
2379<p>
2380One of the more surprising things about RCU is that there are now
2381no fewer than five <i>flavors</i>, or API families.
2382In addition, the primary flavor that has been the sole focus up to
2383this point has two different implementations, non-preemptible and
2384preemptible.
2385The other four flavors are listed below, with requirements for each
2386described in a separate section.
2387
2388<ol>
2389<li> <a href="#Bottom-Half Flavor">Bottom-Half Flavor</a>
2390<li> <a href="#Sched Flavor">Sched Flavor</a>
2391<li> <a href="#Sleepable RCU">Sleepable RCU</a>
2392<li> <a href="#Tasks RCU">Tasks RCU</a>
2393</ol>
2394
2395<h3><a name="Bottom-Half Flavor">Bottom-Half Flavor</a></h3>
2396
2397<p>
2398The softirq-disable (AKA &ldquo;bottom-half&rdquo;,
2399hence the &ldquo;_bh&rdquo; abbreviations)
2400flavor of RCU, or <i>RCU-bh</i>, was developed by
2401Dipankar Sarma to provide a flavor of RCU that could withstand the
2402network-based denial-of-service attacks researched by Robert
2403Olsson.
2404These attacks placed so much networking load on the system
2405that some of the CPUs never exited softirq execution,
2406which in turn prevented those CPUs from ever executing a context switch,
2407which, in the RCU implementation of that time, prevented grace periods
2408from ever ending.
2409The result was an out-of-memory condition and a system hang.
2410
2411<p>
2412The solution was the creation of RCU-bh, which does
2413<tt>local_bh_disable()</tt>
2414across its read-side critical sections, and which uses the transition
2415from one type of softirq processing to another as a quiescent state
2416in addition to context switch, idle, user mode, and offline.
2417This means that RCU-bh grace periods can complete even when some of
2418the CPUs execute in softirq indefinitely, thus allowing algorithms
2419based on RCU-bh to withstand network-based denial-of-service attacks.
2420
2421<p>
2422Because
2423<tt>rcu_read_lock_bh()</tt> and <tt>rcu_read_unlock_bh()</tt>
2424disable and re-enable softirq handlers, any attempt to start a softirq
2425handlers during the
2426RCU-bh read-side critical section will be deferred.
2427In this case, <tt>rcu_read_unlock_bh()</tt>
2428will invoke softirq processing, which can take considerable time.
2429One can of course argue that this softirq overhead should be associated
2430with the code following the RCU-bh read-side critical section rather
2431than <tt>rcu_read_unlock_bh()</tt>, but the fact
2432is that most profiling tools cannot be expected to make this sort
2433of fine distinction.
2434For example, suppose that a three-millisecond-long RCU-bh read-side
2435critical section executes during a time of heavy networking load.
2436There will very likely be an attempt to invoke at least one softirq
2437handler during that three milliseconds, but any such invocation will
2438be delayed until the time of the <tt>rcu_read_unlock_bh()</tt>.
2439This can of course make it appear at first glance as if
2440<tt>rcu_read_unlock_bh()</tt> was executing very slowly.
2441
2442<p>
2443The
2444<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">RCU-bh API</a>
2445includes
2446<tt>rcu_read_lock_bh()</tt>,
2447<tt>rcu_read_unlock_bh()</tt>,
2448<tt>rcu_dereference_bh()</tt>,
2449<tt>rcu_dereference_bh_check()</tt>,
2450<tt>synchronize_rcu_bh()</tt>,
2451<tt>synchronize_rcu_bh_expedited()</tt>,
2452<tt>call_rcu_bh()</tt>,
2453<tt>rcu_barrier_bh()</tt>, and
2454<tt>rcu_read_lock_bh_held()</tt>.
2455
2456<h3><a name="Sched Flavor">Sched Flavor</a></h3>
2457
2458<p>
2459Before preemptible RCU, waiting for an RCU grace period had the
2460side effect of also waiting for all pre-existing interrupt
2461and NMI handlers.
2462However, there are legitimate preemptible-RCU implementations that
2463do not have this property, given that any point in the code outside
2464of an RCU read-side critical section can be a quiescent state.
2465Therefore, <i>RCU-sched</i> was created, which follows &ldquo;classic&rdquo;
2466RCU in that an RCU-sched grace period waits for for pre-existing
2467interrupt and NMI handlers.
2468In kernels built with <tt>CONFIG_PREEMPT=n</tt>, the RCU and RCU-sched
2469APIs have identical implementations, while kernels built with
2470<tt>CONFIG_PREEMPT=y</tt> provide a separate implementation for each.
2471
2472<p>
2473Note well that in <tt>CONFIG_PREEMPT=y</tt> kernels,
2474<tt>rcu_read_lock_sched()</tt> and <tt>rcu_read_unlock_sched()</tt>
2475disable and re-enable preemption, respectively.
2476This means that if there was a preemption attempt during the
2477RCU-sched read-side critical section, <tt>rcu_read_unlock_sched()</tt>
2478will enter the scheduler, with all the latency and overhead entailed.
2479Just as with <tt>rcu_read_unlock_bh()</tt>, this can make it look
2480as if <tt>rcu_read_unlock_sched()</tt> was executing very slowly.
2481However, the highest-priority task won't be preempted, so that task
2482will enjoy low-overhead <tt>rcu_read_unlock_sched()</tt> invocations.
2483
2484<p>
2485The
2486<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">RCU-sched API</a>
2487includes
2488<tt>rcu_read_lock_sched()</tt>,
2489<tt>rcu_read_unlock_sched()</tt>,
2490<tt>rcu_read_lock_sched_notrace()</tt>,
2491<tt>rcu_read_unlock_sched_notrace()</tt>,
2492<tt>rcu_dereference_sched()</tt>,
2493<tt>rcu_dereference_sched_check()</tt>,
2494<tt>synchronize_sched()</tt>,
2495<tt>synchronize_rcu_sched_expedited()</tt>,
2496<tt>call_rcu_sched()</tt>,
2497<tt>rcu_barrier_sched()</tt>, and
2498<tt>rcu_read_lock_sched_held()</tt>.
2499However, anything that disables preemption also marks an RCU-sched
2500read-side critical section, including
2501<tt>preempt_disable()</tt> and <tt>preempt_enable()</tt>,
2502<tt>local_irq_save()</tt> and <tt>local_irq_restore()</tt>,
2503and so on.
2504
2505<h3><a name="Sleepable RCU">Sleepable RCU</a></h3>
2506
2507<p>
2508For well over a decade, someone saying &ldquo;I need to block within
2509an RCU read-side critical section&rdquo; was a reliable indication
2510that this someone did not understand RCU.
2511After all, if you are always blocking in an RCU read-side critical
2512section, you can probably afford to use a higher-overhead synchronization
2513mechanism.
2514However, that changed with the advent of the Linux kernel's notifiers,
2515whose RCU read-side critical
2516sections almost never sleep, but sometimes need to.
2517This resulted in the introduction of
2518<a href="https://lwn.net/Articles/202847/">sleepable RCU</a>,
2519or <i>SRCU</i>.
2520
2521<p>
2522SRCU allows different domains to be defined, with each such domain
2523defined by an instance of an <tt>srcu_struct</tt> structure.
2524A pointer to this structure must be passed in to each SRCU function,
2525for example, <tt>synchronize_srcu(&amp;ss)</tt>, where
2526<tt>ss</tt> is the <tt>srcu_struct</tt> structure.
2527The key benefit of these domains is that a slow SRCU reader in one
2528domain does not delay an SRCU grace period in some other domain.
2529That said, one consequence of these domains is that read-side code
2530must pass a &ldquo;cookie&rdquo; from <tt>srcu_read_lock()</tt>
2531to <tt>srcu_read_unlock()</tt>, for example, as follows:
2532
2533<blockquote>
2534<pre>
2535 1 int idx;
2536 2
2537 3 idx = srcu_read_lock(&amp;ss);
2538 4 do_something();
2539 5 srcu_read_unlock(&amp;ss, idx);
2540</pre>
2541</blockquote>
2542
2543<p>
2544As noted above, it is legal to block within SRCU read-side critical sections,
2545however, with great power comes great responsibility.
2546If you block forever in one of a given domain's SRCU read-side critical
2547sections, then that domain's grace periods will also be blocked forever.
2548Of course, one good way to block forever is to deadlock, which can
2549happen if any operation in a given domain's SRCU read-side critical
2550section can block waiting, either directly or indirectly, for that domain's
2551grace period to elapse.
2552For example, this results in a self-deadlock:
2553
2554<blockquote>
2555<pre>
2556 1 int idx;
2557 2
2558 3 idx = srcu_read_lock(&amp;ss);
2559 4 do_something();
2560 5 synchronize_srcu(&amp;ss);
2561 6 srcu_read_unlock(&amp;ss, idx);
2562</pre>
2563</blockquote>
2564
2565<p>
2566However, if line&nbsp;5 acquired a mutex that was held across
2567a <tt>synchronize_srcu()</tt> for domain <tt>ss</tt>,
2568deadlock would still be possible.
2569Furthermore, if line&nbsp;5 acquired a mutex that was held across
2570a <tt>synchronize_srcu()</tt> for some other domain <tt>ss1</tt>,
2571and if an <tt>ss1</tt>-domain SRCU read-side critical section
2572acquired another mutex that was held across as <tt>ss</tt>-domain
2573<tt>synchronize_srcu()</tt>,
2574deadlock would again be possible.
2575Such a deadlock cycle could extend across an arbitrarily large number
2576of different SRCU domains.
2577Again, with great power comes great responsibility.
2578
2579<p>
2580Unlike the other RCU flavors, SRCU read-side critical sections can
2581run on idle and even offline CPUs.
2582This ability requires that <tt>srcu_read_lock()</tt> and
2583<tt>srcu_read_unlock()</tt> contain memory barriers, which means
2584that SRCU readers will run a bit slower than would RCU readers.
2585It also motivates the <tt>smp_mb__after_srcu_read_unlock()</tt>
2586API, which, in combination with <tt>srcu_read_unlock()</tt>,
2587guarantees a full memory barrier.
2588
2589<p>
2590The
2591<a href="https://lwn.net/Articles/609973/#RCU Per-Flavor API Table">SRCU API</a>
2592includes
2593<tt>srcu_read_lock()</tt>,
2594<tt>srcu_read_unlock()</tt>,
2595<tt>srcu_dereference()</tt>,
2596<tt>srcu_dereference_check()</tt>,
2597<tt>synchronize_srcu()</tt>,
2598<tt>synchronize_srcu_expedited()</tt>,
2599<tt>call_srcu()</tt>,
2600<tt>srcu_barrier()</tt>, and
2601<tt>srcu_read_lock_held()</tt>.
2602It also includes
2603<tt>DEFINE_SRCU()</tt>,
2604<tt>DEFINE_STATIC_SRCU()</tt>, and
2605<tt>init_srcu_struct()</tt>
2606APIs for defining and initializing <tt>srcu_struct</tt> structures.
2607
2608<h3><a name="Tasks RCU">Tasks RCU</a></h3>
2609
2610<p>
2611Some forms of tracing use &ldquo;tramopolines&rdquo; to handle the
2612binary rewriting required to install different types of probes.
2613It would be good to be able to free old trampolines, which sounds
2614like a job for some form of RCU.
2615However, because it is necessary to be able to install a trace
2616anywhere in the code, it is not possible to use read-side markers
2617such as <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>.
2618In addition, it does not work to have these markers in the trampoline
2619itself, because there would need to be instructions following
2620<tt>rcu_read_unlock()</tt>.
2621Although <tt>synchronize_rcu()</tt> would guarantee that execution
2622reached the <tt>rcu_read_unlock()</tt>, it would not be able to
2623guarantee that execution had completely left the trampoline.
2624
2625<p>
2626The solution, in the form of
2627<a href="https://lwn.net/Articles/607117/"><i>Tasks RCU</i></a>,
2628is to have implicit
2629read-side critical sections that are delimited by voluntary context
2630switches, that is, calls to <tt>schedule()</tt>,
2631<tt>cond_resched_rcu_qs()</tt>, and
2632<tt>synchronize_rcu_tasks()</tt>.
2633In addition, transitions to and from userspace execution also delimit
2634tasks-RCU read-side critical sections.
2635
2636<p>
2637The tasks-RCU API is quite compact, consisting only of
2638<tt>call_rcu_tasks()</tt>,
2639<tt>synchronize_rcu_tasks()</tt>, and
2640<tt>rcu_barrier_tasks()</tt>.
2641
2642<h2><a name="Possible Future Changes">Possible Future Changes</a></h2>
2643
2644<p>
2645One of the tricks that RCU uses to attain update-side scalability is
2646to increase grace-period latency with increasing numbers of CPUs.
2647If this becomes a serious problem, it will be necessary to rework the
2648grace-period state machine so as to avoid the need for the additional
2649latency.
2650
2651<p>
2652Expedited grace periods scan the CPUs, so their latency and overhead
2653increases with increasing numbers of CPUs.
2654If this becomes a serious problem on large systems, it will be necessary
2655to do some redesign to avoid this scalability problem.
2656
2657<p>
2658RCU disables CPU hotplug in a few places, perhaps most notably in the
2659expedited grace-period and <tt>rcu_barrier()</tt> operations.
2660If there is a strong reason to use expedited grace periods in CPU-hotplug
2661notifiers, it will be necessary to avoid disabling CPU hotplug.
2662This would introduce some complexity, so there had better be a <i>very</i>
2663good reason.
2664
2665<p>
2666The tradeoff between grace-period latency on the one hand and interruptions
2667of other CPUs on the other hand may need to be re-examined.
2668The desire is of course for zero grace-period latency as well as zero
2669interprocessor interrupts undertaken during an expedited grace period
2670operation.
2671While this ideal is unlikely to be achievable, it is quite possible that
2672further improvements can be made.
2673
2674<p>
2675The multiprocessor implementations of RCU use a combining tree that
2676groups CPUs so as to reduce lock contention and increase cache locality.
2677However, this combining tree does not spread its memory across NUMA
2678nodes nor does it align the CPU groups with hardware features such
2679as sockets or cores.
2680Such spreading and alignment is currently believed to be unnecessary
2681because the hotpath read-side primitives do not access the combining
2682tree, nor does <tt>call_rcu()</tt> in the common case.
2683If you believe that your architecture needs such spreading and alignment,
2684then your architecture should also benefit from the
2685<tt>rcutree.rcu_fanout_leaf</tt> boot parameter, which can be set
2686to the number of CPUs in a socket, NUMA node, or whatever.
2687If the number of CPUs is too large, use a fraction of the number of
2688CPUs.
2689If the number of CPUs is a large prime number, well, that certainly
2690is an &ldquo;interesting&rdquo; architectural choice!
2691More flexible arrangements might be considered, but only if
2692<tt>rcutree.rcu_fanout_leaf</tt> has proven inadequate, and only
2693if the inadequacy has been demonstrated by a carefully run and
2694realistic system-level workload.
2695
2696<p>
2697Please note that arrangements that require RCU to remap CPU numbers will
2698require extremely good demonstration of need and full exploration of
2699alternatives.
2700
2701<p>
2702There is an embarrassingly large number of flavors of RCU, and this
2703number has been increasing over time.
2704Perhaps it will be possible to combine some at some future date.
2705
2706<p>
2707RCU's various kthreads are reasonably recent additions.
2708It is quite likely that adjustments will be required to more gracefully
2709handle extreme loads.
2710It might also be necessary to be able to relate CPU utilization by
2711RCU's kthreads and softirq handlers to the code that instigated this
2712CPU utilization.
2713For example, RCU callback overhead might be charged back to the
2714originating <tt>call_rcu()</tt> instance, though probably not
2715in production kernels.
2716
2717<h2><a name="Summary">Summary</a></h2>
2718
2719<p>
2720This document has presented more than two decade's worth of RCU
2721requirements.
2722Given that the requirements keep changing, this will not be the last
2723word on this subject, but at least it serves to get an important
2724subset of the requirements set forth.
2725
2726<h2><a name="Acknowledgments">Acknowledgments</a></h2>
2727
2728I am grateful to Steven Rostedt, Lai Jiangshan, Ingo Molnar,
2729Oleg Nesterov, Borislav Petkov, Peter Zijlstra, Boqun Feng, and
2730Andy Lutomirski for their help in rendering
2731this article human readable, and to Michelle Rankin for her support
2732of this effort.
2733Other contributions are acknowledged in the Linux kernel's git archive.
2734The cartoon is copyright (c) 2013 by Melissa Broussard,
2735and is provided
2736under the terms of the Creative Commons Attribution-Share Alike 3.0
2737United States license.
2738
2739<p>@@QQAL@@
2740
2741</body></html>