Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski...
[linux-2.6-block.git] / Documentation / RCU / Design / Data-Structures / Data-Structures.html
CommitLineData
5c145847
PM
1<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
2 "http://www.w3.org/TR/html4/loose.dtd">
3 <html>
4 <head><title>A Tour Through TREE_RCU's Data Structures [LWN.net]</title>
5 <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
6
7 <p>January 27, 2016</p>
8 <p>This article was contributed by Paul E.&nbsp;McKenney</p>
9
10<h3>Introduction</h3>
11
12This document describes RCU's major data structures and their relationship
13to each other.
14
15<ol>
16<li> <a href="#Data-Structure Relationships">
17 Data-Structure Relationships</a>
18<li> <a href="#The rcu_state Structure">
19 The <tt>rcu_state</tt> Structure</a>
20<li> <a href="#The rcu_node Structure">
21 The <tt>rcu_node</tt> Structure</a>
22<li> <a href="#The rcu_data Structure">
23 The <tt>rcu_data</tt> Structure</a>
24<li> <a href="#The rcu_dynticks Structure">
25 The <tt>rcu_dynticks</tt> Structure</a>
26<li> <a href="#The rcu_head Structure">
27 The <tt>rcu_head</tt> Structure</a>
28<li> <a href="#RCU-Specific Fields in the task_struct Structure">
29 RCU-Specific Fields in the <tt>task_struct</tt> Structure</a>
30<li> <a href="#Accessor Functions">
31 Accessor Functions</a>
32</ol>
33
34At the end we have the
35<a href="#Answers to Quick Quizzes">answers to the quick quizzes</a>.
36
37<h3><a name="Data-Structure Relationships">Data-Structure Relationships</a></h3>
38
39<p>RCU is for all intents and purposes a large state machine, and its
40data structures maintain the state in such a way as to allow RCU readers
41to execute extremely quickly, while also processing the RCU grace periods
42requested by updaters in an efficient and extremely scalable fashion.
43The efficiency and scalability of RCU updaters is provided primarily
44by a combining tree, as shown below:
45
46</p><p><img src="BigTreeClassicRCU.svg" alt="BigTreeClassicRCU.svg" width="30%">
47
48</p><p>This diagram shows an enclosing <tt>rcu_state</tt> structure
49containing a tree of <tt>rcu_node</tt> structures.
50Each leaf node of the <tt>rcu_node</tt> tree has up to 16
51<tt>rcu_data</tt> structures associated with it, so that there
52are <tt>NR_CPUS</tt> number of <tt>rcu_data</tt> structures,
53one for each possible CPU.
54This structure is adjusted at boot time, if needed, to handle the
55common case where <tt>nr_cpu_ids</tt> is much less than
56<tt>NR_CPUs</tt>.
57For example, a number of Linux distributions set <tt>NR_CPUs=4096</tt>,
58which results in a three-level <tt>rcu_node</tt> tree.
59If the actual hardware has only 16 CPUs, RCU will adjust itself
60at boot time, resulting in an <tt>rcu_node</tt> tree with only a single node.
61
62</p><p>The purpose of this combining tree is to allow per-CPU events
63such as quiescent states, dyntick-idle transitions,
64and CPU hotplug operations to be processed efficiently
65and scalably.
66Quiescent states are recorded by the per-CPU <tt>rcu_data</tt> structures,
67and other events are recorded by the leaf-level <tt>rcu_node</tt>
68structures.
69All of these events are combined at each level of the tree until finally
70grace periods are completed at the tree's root <tt>rcu_node</tt>
71structure.
72A grace period can be completed at the root once every CPU
73(or, in the case of <tt>CONFIG_PREEMPT_RCU</tt>, task)
74has passed through a quiescent state.
75Once a grace period has completed, record of that fact is propagated
76back down the tree.
77
78</p><p>As can be seen from the diagram, on a 64-bit system
79a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout
80of 64 at the root and a fanout of 16 at the leaves.
81
82<table>
83<tr><th>&nbsp;</th></tr>
84<tr><th align="left">Quick Quiz:</th></tr>
85<tr><td>
86 Why isn't the fanout at the leaves also 64?
87</td></tr>
88<tr><th align="left">Answer:</th></tr>
89<tr><td bgcolor="#ffffff"><font color="ffffff">
90 Because there are more types of events that affect the leaf-level
91 <tt>rcu_node</tt> structures than further up the tree.
92 Therefore, if the leaf <tt>rcu_node</tt> structures have fanout of
93 64, the contention on these structures' <tt>-&gt;structures</tt>
94 becomes excessive.
95 Experimentation on a wide variety of systems has shown that a fanout
96 of 16 works well for the leaves of the <tt>rcu_node</tt> tree.
97 </font>
98
99 <p><font color="ffffff">Of course, further experience with
100 systems having hundreds or thousands of CPUs may demonstrate
101 that the fanout for the non-leaf <tt>rcu_node</tt> structures
102 must also be reduced.
103 Such reduction can be easily carried out when and if it proves
104 necessary.
105 In the meantime, if you are using such a system and running into
106 contention problems on the non-leaf <tt>rcu_node</tt> structures,
107 you may use the <tt>CONFIG_RCU_FANOUT</tt> kernel configuration
108 parameter to reduce the non-leaf fanout as needed.
109 </font>
110
111 <p><font color="ffffff">Kernels built for systems with
112 strong NUMA characteristics might also need to adjust
113 <tt>CONFIG_RCU_FANOUT</tt> so that the domains of the
114 <tt>rcu_node</tt> structures align with hardware boundaries.
115 However, there has thus far been no need for this.
116</font></td></tr>
117<tr><td>&nbsp;</td></tr>
118</table>
119
120<p>If your system has more than 1,024 CPUs (or more than 512 CPUs on
121a 32-bit system), then RCU will automatically add more levels to the
122tree.
123For example, if you are crazy enough to build a 64-bit system with 65,536
124CPUs, RCU would configure the <tt>rcu_node</tt> tree as follows:
125
126</p><p><img src="HugeTreeClassicRCU.svg" alt="HugeTreeClassicRCU.svg" width="50%">
127
128</p><p>RCU currently permits up to a four-level tree, which on a 64-bit system
129accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for
13032-bit systems.
131On the other hand, you can set <tt>CONFIG_RCU_FANOUT</tt> to be
132as small as 2 if you wish, which would permit only 16 CPUs, which
133is useful for testing.
134
135</p><p>This multi-level combining tree allows us to get most of the
136performance and scalability
137benefits of partitioning, even though RCU grace-period detection is
138inherently a global operation.
139The trick here is that only the last CPU to report a quiescent state
140into a given <tt>rcu_node</tt> structure need advance to the <tt>rcu_node</tt>
141structure at the next level up the tree.
142This means that at the leaf-level <tt>rcu_node</tt> structure, only
143one access out of sixteen will progress up the tree.
144For the internal <tt>rcu_node</tt> structures, the situation is even
145more extreme: Only one access out of sixty-four will progress up
146the tree.
147Because the vast majority of the CPUs do not progress up the tree,
148the lock contention remains roughly constant up the tree.
149No matter how many CPUs there are in the system, at most 64 quiescent-state
150reports per grace period will progress all the way to the root
151<tt>rcu_node</tt> structure, thus ensuring that the lock contention
152on that root <tt>rcu_node</tt> structure remains acceptably low.
153
154</p><p>In effect, the combining tree acts like a big shock absorber,
155keeping lock contention under control at all tree levels regardless
156of the level of loading on the system.
157
158</p><p>The Linux kernel actually supports multiple flavors of RCU
159running concurrently, so RCU builds separate data structures for each
160flavor.
161For example, for <tt>CONFIG_TREE_RCU=y</tt> kernels, RCU provides
162rcu_sched and rcu_bh, as shown below:
163
164</p><p><img src="BigTreeClassicRCUBH.svg" alt="BigTreeClassicRCUBH.svg" width="33%">
165
166</p><p>Energy efficiency is increasingly important, and for that
167reason the Linux kernel provides <tt>CONFIG_NO_HZ_IDLE</tt>, which
168turns off the scheduling-clock interrupts on idle CPUs, which in
169turn allows those CPUs to attain deeper sleep states and to consume
170less energy.
171CPUs whose scheduling-clock interrupts have been turned off are
172said to be in <i>dyntick-idle mode</i>.
173RCU must handle dyntick-idle CPUs specially
174because RCU would otherwise wake up each CPU on every grace period,
175which would defeat the whole purpose of <tt>CONFIG_NO_HZ_IDLE</tt>.
176RCU uses the <tt>rcu_dynticks</tt> structure to track
177which CPUs are in dyntick idle mode, as shown below:
178
179</p><p><img src="BigTreeClassicRCUBHdyntick.svg" alt="BigTreeClassicRCUBHdyntick.svg" width="33%">
180
181</p><p>However, if a CPU is in dyntick-idle mode, it is in that mode
182for all flavors of RCU.
183Therefore, a single <tt>rcu_dynticks</tt> structure is allocated per
184CPU, and all of a given CPU's <tt>rcu_data</tt> structures share
185that <tt>rcu_dynticks</tt>, as shown in the figure.
186
187</p><p>Kernels built with <tt>CONFIG_PREEMPT_RCU</tt> support
188rcu_preempt in addition to rcu_sched and rcu_bh, as shown below:
189
190</p><p><img src="BigTreePreemptRCUBHdyntick.svg" alt="BigTreePreemptRCUBHdyntick.svg" width="35%">
191
192</p><p>RCU updaters wait for normal grace periods by registering
193RCU callbacks, either directly via <tt>call_rcu()</tt> and
194friends (namely <tt>call_rcu_bh()</tt> and <tt>call_rcu_sched()</tt>),
195there being a separate interface per flavor of RCU)
196or indirectly via <tt>synchronize_rcu()</tt> and friends.
197RCU callbacks are represented by <tt>rcu_head</tt> structures,
198which are queued on <tt>rcu_data</tt> structures while they are
199waiting for a grace period to elapse, as shown in the following figure:
200
201</p><p><img src="BigTreePreemptRCUBHdyntickCB.svg" alt="BigTreePreemptRCUBHdyntickCB.svg" width="40%">
202
203</p><p>This figure shows how <tt>TREE_RCU</tt>'s and
204<tt>PREEMPT_RCU</tt>'s major data structures are related.
205Lesser data structures will be introduced with the algorithms that
206make use of them.
207
208</p><p>Note that each of the data structures in the above figure has
209its own synchronization:
210
211<p><ol>
212<li> Each <tt>rcu_state</tt> structures has a lock and a mutex,
213 and some fields are protected by the corresponding root
214 <tt>rcu_node</tt> structure's lock.
215<li> Each <tt>rcu_node</tt> structure has a spinlock.
216<li> The fields in <tt>rcu_data</tt> are private to the corresponding
217 CPU, although a few can be read and written by other CPUs.
218<li> Similarly, the fields in <tt>rcu_dynticks</tt> are private
219 to the corresponding CPU, although a few can be read by
220 other CPUs.
221</ol>
222
223<p>It is important to note that different data structures can have
224very different ideas about the state of RCU at any given time.
225For but one example, awareness of the start or end of a given RCU
226grace period propagates slowly through the data structures.
227This slow propagation is absolutely necessary for RCU to have good
228read-side performance.
229If this balkanized implementation seems foreign to you, one useful
230trick is to consider each instance of these data structures to be
231a different person, each having the usual slightly different
232view of reality.
233
234</p><p>The general role of each of these data structures is as
235follows:
236
237</p><ol>
238<li> <tt>rcu_state</tt>:
239 This structure forms the interconnection between the
240 <tt>rcu_node</tt> and <tt>rcu_data</tt> structures,
241 tracks grace periods, serves as short-term repository
242 for callbacks orphaned by CPU-hotplug events,
243 maintains <tt>rcu_barrier()</tt> state,
244 tracks expedited grace-period state,
245 and maintains state used to force quiescent states when
246 grace periods extend too long,
247<li> <tt>rcu_node</tt>: This structure forms the combining
248 tree that propagates quiescent-state
249 information from the leaves to the root, and also propagates
250 grace-period information from the root to the leaves.
251 It provides local copies of the grace-period state in order
252 to allow this information to be accessed in a synchronized
253 manner without suffering the scalability limitations that
254 would otherwise be imposed by global locking.
255 In <tt>CONFIG_PREEMPT_RCU</tt> kernels, it manages the lists
256 of tasks that have blocked while in their current
257 RCU read-side critical section.
258 In <tt>CONFIG_PREEMPT_RCU</tt> with
259 <tt>CONFIG_RCU_BOOST</tt>, it manages the
260 per-<tt>rcu_node</tt> priority-boosting
261 kernel threads (kthreads) and state.
262 Finally, it records CPU-hotplug state in order to determine
263 which CPUs should be ignored during a given grace period.
264<li> <tt>rcu_data</tt>: This per-CPU structure is the
265 focus of quiescent-state detection and RCU callback queuing.
266 It also tracks its relationship to the corresponding leaf
267 <tt>rcu_node</tt> structure to allow more-efficient
268 propagation of quiescent states up the <tt>rcu_node</tt>
269 combining tree.
270 Like the <tt>rcu_node</tt> structure, it provides a local
271 copy of the grace-period information to allow for-free
272 synchronized
273 access to this information from the corresponding CPU.
274 Finally, this structure records past dyntick-idle state
275 for the corresponding CPU and also tracks statistics.
276<li> <tt>rcu_dynticks</tt>:
277 This per-CPU structure tracks the current dyntick-idle
278 state for the corresponding CPU.
279 Unlike the other three structures, the <tt>rcu_dynticks</tt>
280 structure is not replicated per RCU flavor.
281<li> <tt>rcu_head</tt>:
282 This structure represents RCU callbacks, and is the
283 only structure allocated and managed by RCU users.
284 The <tt>rcu_head</tt> structure is normally embedded
285 within the RCU-protected data structure.
286</ol>
287
288<p>If all you wanted from this article was a general notion of how
289RCU's data structures are related, you are done.
290Otherwise, each of the following sections give more details on
291the <tt>rcu_state</tt>, <tt>rcu_node</tt>, <tt>rcu_data</tt>,
292and <tt>rcu_dynticks</tt> data structures.
293
294<h3><a name="The rcu_state Structure">
295The <tt>rcu_state</tt> Structure</a></h3>
296
297<p>The <tt>rcu_state</tt> structure is the base structure that
298represents a flavor of RCU.
299This structure forms the interconnection between the
300<tt>rcu_node</tt> and <tt>rcu_data</tt> structures,
301tracks grace periods, contains the lock used to
302synchronize with CPU-hotplug events,
303and maintains state used to force quiescent states when
304grace periods extend too long,
305
306</p><p>A few of the <tt>rcu_state</tt> structure's fields are discussed,
307singly and in groups, in the following sections.
308The more specialized fields are covered in the discussion of their
309use.
310
311<h5>Relationship to rcu_node and rcu_data Structures</h5>
312
313This portion of the <tt>rcu_state</tt> structure is declared
314as follows:
315
316<pre>
317 1 struct rcu_node node[NUM_RCU_NODES];
318 2 struct rcu_node *level[NUM_RCU_LVLS + 1];
319 3 struct rcu_data __percpu *rda;
320</pre>
321
322<table>
323<tr><th>&nbsp;</th></tr>
324<tr><th align="left">Quick Quiz:</th></tr>
325<tr><td>
326 Wait a minute!
327 You said that the <tt>rcu_node</tt> structures formed a tree,
328 but they are declared as a flat array!
329 What gives?
330</td></tr>
331<tr><th align="left">Answer:</th></tr>
332<tr><td bgcolor="#ffffff"><font color="ffffff">
333 The tree is laid out in the array.
334 The first node In the array is the head, the next set of nodes in the
335 array are children of the head node, and so on until the last set of
336 nodes in the array are the leaves.
337 </font>
338
339 <p><font color="ffffff">See the following diagrams to see how
340 this works.
341</font></td></tr>
342<tr><td>&nbsp;</td></tr>
343</table>
344
345<p>The <tt>rcu_node</tt> tree is embedded into the
346<tt>-&gt;node[]</tt> array as shown in the following figure:
347
348</p><p><img src="TreeMapping.svg" alt="TreeMapping.svg" width="40%">
349
350</p><p>One interesting consequence of this mapping is that a
351breadth-first traversal of the tree is implemented as a simple
352linear scan of the array, which is in fact what the
353<tt>rcu_for_each_node_breadth_first()</tt> macro does.
354This macro is used at the beginning and ends of grace periods.
355
356</p><p>Each entry of the <tt>-&gt;level</tt> array references
357the first <tt>rcu_node</tt> structure on the corresponding level
358of the tree, for example, as shown below:
359
360</p><p><img src="TreeMappingLevel.svg" alt="TreeMappingLevel.svg" width="40%">
361
362</p><p>The zero<sup>th</sup> element of the array references the root
363<tt>rcu_node</tt> structure, the first element references the
364first child of the root <tt>rcu_node</tt>, and finally the second
365element references the first leaf <tt>rcu_node</tt> structure.
366
367</p><p>For whatever it is worth, if you draw the tree to be tree-shaped
368rather than array-shaped, it is easy to draw a planar representation:
369
370</p><p><img src="TreeLevel.svg" alt="TreeLevel.svg" width="60%">
371
372</p><p>Finally, the <tt>-&gt;rda</tt> field references a per-CPU
373pointer to the corresponding CPU's <tt>rcu_data</tt> structure.
374
375</p><p>All of these fields are constant once initialization is complete,
376and therefore need no protection.
377
378<h5>Grace-Period Tracking</h5>
379
380<p>This portion of the <tt>rcu_state</tt> structure is declared
381as follows:
382
383<pre>
384 1 unsigned long gpnum;
385 2 unsigned long completed;
386</pre>
387
388<p>RCU grace periods are numbered, and
389the <tt>-&gt;gpnum</tt> field contains the number of the grace
390period that started most recently.
391The <tt>-&gt;completed</tt> field contains the number of the
392grace period that completed most recently.
393If the two fields are equal, the RCU grace period that most recently
394started has already completed, and therefore the corresponding
395flavor of RCU is idle.
396If <tt>-&gt;gpnum</tt> is one greater than <tt>-&gt;completed</tt>,
397then <tt>-&gt;gpnum</tt> gives the number of the current RCU
398grace period, which has not yet completed.
399Any other combination of values indicates that something is broken.
400These two fields are protected by the root <tt>rcu_node</tt>'s
401<tt>-&gt;lock</tt> field.
402
403</p><p>There are <tt>-&gt;gpnum</tt> and <tt>-&gt;completed</tt> fields
404in the <tt>rcu_node</tt> and <tt>rcu_data</tt> structures
405as well.
406The fields in the <tt>rcu_state</tt> structure represent the
407most current values, and those of the other structures are compared
408in order to detect the start of a new grace period in a distributed
409fashion.
410The values flow from <tt>rcu_state</tt> to <tt>rcu_node</tt>
411(down the tree from the root to the leaves) to <tt>rcu_data</tt>.
412
413<h5>Miscellaneous</h5>
414
415<p>This portion of the <tt>rcu_state</tt> structure is declared
416as follows:
417
418<pre>
419 1 unsigned long gp_max;
420 2 char abbr;
421 3 char *name;
422</pre>
423
424<p>The <tt>-&gt;gp_max</tt> field tracks the duration of the longest
425grace period in jiffies.
426It is protected by the root <tt>rcu_node</tt>'s <tt>-&gt;lock</tt>.
427
428<p>The <tt>-&gt;name</tt> field points to the name of the RCU flavor
429(for example, &ldquo;rcu_sched&rdquo;), and is constant.
430The <tt>-&gt;abbr</tt> field contains a one-character abbreviation,
431for example, &ldquo;s&rdquo; for RCU-sched.
432
433<h3><a name="The rcu_node Structure">
434The <tt>rcu_node</tt> Structure</a></h3>
435
436<p>The <tt>rcu_node</tt> structures form the combining
437tree that propagates quiescent-state
438information from the leaves to the root and also that propagates
439grace-period information from the root down to the leaves.
440They provides local copies of the grace-period state in order
441to allow this information to be accessed in a synchronized
442manner without suffering the scalability limitations that
443would otherwise be imposed by global locking.
444In <tt>CONFIG_PREEMPT_RCU</tt> kernels, they manage the lists
445of tasks that have blocked while in their current
446RCU read-side critical section.
447In <tt>CONFIG_PREEMPT_RCU</tt> with
448<tt>CONFIG_RCU_BOOST</tt>, they manage the
449per-<tt>rcu_node</tt> priority-boosting
450kernel threads (kthreads) and state.
451Finally, they record CPU-hotplug state in order to determine
452which CPUs should be ignored during a given grace period.
453
454</p><p>The <tt>rcu_node</tt> structure's fields are discussed,
455singly and in groups, in the following sections.
456
457<h5>Connection to Combining Tree</h5>
458
459<p>This portion of the <tt>rcu_node</tt> structure is declared
460as follows:
461
462<pre>
463 1 struct rcu_node *parent;
464 2 u8 level;
465 3 u8 grpnum;
466 4 unsigned long grpmask;
467 5 int grplo;
468 6 int grphi;
469</pre>
470
471<p>The <tt>-&gt;parent</tt> pointer references the <tt>rcu_node</tt>
472one level up in the tree, and is <tt>NULL</tt> for the root
473<tt>rcu_node</tt>.
474The RCU implementation makes heavy use of this field to push quiescent
475states up the tree.
476The <tt>-&gt;level</tt> field gives the level in the tree, with
477the root being at level zero, its children at level one, and so on.
478The <tt>-&gt;grpnum</tt> field gives this node's position within
479the children of its parent, so this number can range between 0 and 31
480on 32-bit systems and between 0 and 63 on 64-bit systems.
481The <tt>-&gt;level</tt> and <tt>-&gt;grpnum</tt> fields are
482used only during initialization and for tracing.
483The <tt>-&gt;grpmask</tt> field is the bitmask counterpart of
484<tt>-&gt;grpnum</tt>, and therefore always has exactly one bit set.
485This mask is used to clear the bit corresponding to this <tt>rcu_node</tt>
486structure in its parent's bitmasks, which are described later.
487Finally, the <tt>-&gt;grplo</tt> and <tt>-&gt;grphi</tt> fields
488contain the lowest and highest numbered CPU served by this
489<tt>rcu_node</tt> structure, respectively.
490
491</p><p>All of these fields are constant, and thus do not require any
492synchronization.
493
494<h5>Synchronization</h5>
495
496<p>This field of the <tt>rcu_node</tt> structure is declared
497as follows:
498
499<pre>
500 1 raw_spinlock_t lock;
501</pre>
502
503<p>This field is used to protect the remaining fields in this structure,
504unless otherwise stated.
505That said, all of the fields in this structure can be accessed without
506locking for tracing purposes.
507Yes, this can result in confusing traces, but better some tracing confusion
508than to be heisenbugged out of existence.
509
510<h5>Grace-Period Tracking</h5>
511
512<p>This portion of the <tt>rcu_node</tt> structure is declared
513as follows:
514
515<pre>
516 1 unsigned long gpnum;
517 2 unsigned long completed;
518</pre>
519
520<p>These fields are the counterparts of the fields of the same name in
521the <tt>rcu_state</tt> structure.
522They each may lag up to one behind their <tt>rcu_state</tt>
523counterparts.
524If a given <tt>rcu_node</tt> structure's <tt>-&gt;gpnum</tt> and
525<tt>-&gt;complete</tt> fields are equal, then this <tt>rcu_node</tt>
526structure believes that RCU is idle.
527Otherwise, as with the <tt>rcu_state</tt> structure,
528the <tt>-&gt;gpnum</tt> field will be one greater than the
529<tt>-&gt;complete</tt> fields, with <tt>-&gt;gpnum</tt>
530indicating which grace period this <tt>rcu_node</tt> believes
531is still being waited for.
532
533</p><p>The <tt>&gt;gpnum</tt> field of each <tt>rcu_node</tt>
534structure is updated at the beginning
535of each grace period, and the <tt>-&gt;completed</tt> fields are
536updated at the end of each grace period.
537
538<h5>Quiescent-State Tracking</h5>
539
540<p>These fields manage the propagation of quiescent states up the
541combining tree.
542
543</p><p>This portion of the <tt>rcu_node</tt> structure has fields
544as follows:
545
546<pre>
547 1 unsigned long qsmask;
548 2 unsigned long expmask;
549 3 unsigned long qsmaskinit;
550 4 unsigned long expmaskinit;
551</pre>
552
553<p>The <tt>-&gt;qsmask</tt> field tracks which of this
554<tt>rcu_node</tt> structure's children still need to report
555quiescent states for the current normal grace period.
556Such children will have a value of 1 in their corresponding bit.
557Note that the leaf <tt>rcu_node</tt> structures should be
558thought of as having <tt>rcu_data</tt> structures as their
559children.
560Similarly, the <tt>-&gt;expmask</tt> field tracks which
561of this <tt>rcu_node</tt> structure's children still need to report
562quiescent states for the current expedited grace period.
563An expedited grace period has
564the same conceptual properties as a normal grace period, but the
565expedited implementation accepts extreme CPU overhead to obtain
566much lower grace-period latency, for example, consuming a few
567tens of microseconds worth of CPU time to reduce grace-period
568duration from milliseconds to tens of microseconds.
569The <tt>-&gt;qsmaskinit</tt> field tracks which of this
570<tt>rcu_node</tt> structure's children cover for at least
571one online CPU.
572This mask is used to initialize <tt>-&gt;qsmask</tt>,
573and <tt>-&gt;expmaskinit</tt> is used to initialize
574<tt>-&gt;expmask</tt> and the beginning of the
575normal and expedited grace periods, respectively.
576
577<table>
578<tr><th>&nbsp;</th></tr>
579<tr><th align="left">Quick Quiz:</th></tr>
580<tr><td>
581 Why are these bitmasks protected by locking?
582 Come on, haven't you heard of atomic instructions???
583</td></tr>
584<tr><th align="left">Answer:</th></tr>
585<tr><td bgcolor="#ffffff"><font color="ffffff">
586 Lockless grace-period computation! Such a tantalizing possibility!
587 </font>
588
589 <p><font color="ffffff">But consider the following sequence of events:
590 </font>
591
592 <ol>
593 <li> <font color="ffffff">CPU&nbsp;0 has been in dyntick-idle
594 mode for quite some time.
595 When it wakes up, it notices that the current RCU
596 grace period needs it to report in, so it sets a
597 flag where the scheduling clock interrupt will find it.
598 </font><p>
599 <li> <font color="ffffff">Meanwhile, CPU&nbsp;1 is running
600 <tt>force_quiescent_state()</tt>,
601 and notices that CPU&nbsp;0 has been in dyntick idle mode,
602 which qualifies as an extended quiescent state.
603 </font><p>
604 <li> <font color="ffffff">CPU&nbsp;0's scheduling clock
605 interrupt fires in the
606 middle of an RCU read-side critical section, and notices
607 that the RCU core needs something, so commences RCU softirq
608 processing.
609 </font>
610 <p>
611 <li> <font color="ffffff">CPU&nbsp;0's softirq handler
612 executes and is just about ready
613 to report its quiescent state up the <tt>rcu_node</tt>
614 tree.
615 </font><p>
616 <li> <font color="ffffff">But CPU&nbsp;1 beats it to the punch,
617 completing the current
618 grace period and starting a new one.
619 </font><p>
620 <li> <font color="ffffff">CPU&nbsp;0 now reports its quiescent
621 state for the wrong
622 grace period.
623 That grace period might now end before the RCU read-side
624 critical section.
625 If that happens, disaster will ensue.
626 </font>
627 </ol>
628
629 <p><font color="ffffff">So the locking is absolutely required in
630 order to coordinate
631 clearing of the bits with the grace-period numbers in
632 <tt>-&gt;gpnum</tt> and <tt>-&gt;completed</tt>.
633</font></td></tr>
634<tr><td>&nbsp;</td></tr>
635</table>
636
637<h5>Blocked-Task Management</h5>
638
639<p><tt>PREEMPT_RCU</tt> allows tasks to be preempted in the
640midst of their RCU read-side critical sections, and these tasks
641must be tracked explicitly.
642The details of exactly why and how they are tracked will be covered
643in a separate article on RCU read-side processing.
644For now, it is enough to know that the <tt>rcu_node</tt>
645structure tracks them.
646
647<pre>
648 1 struct list_head blkd_tasks;
649 2 struct list_head *gp_tasks;
650 3 struct list_head *exp_tasks;
651 4 bool wait_blkd_tasks;
652</pre>
653
654<p>The <tt>-&gt;blkd_tasks</tt> field is a list header for
655the list of blocked and preempted tasks.
656As tasks undergo context switches within RCU read-side critical
657sections, their <tt>task_struct</tt> structures are enqueued
658(via the <tt>task_struct</tt>'s <tt>-&gt;rcu_node_entry</tt>
659field) onto the head of the <tt>-&gt;blkd_tasks</tt> list for the
660leaf <tt>rcu_node</tt> structure corresponding to the CPU
661on which the outgoing context switch executed.
662As these tasks later exit their RCU read-side critical sections,
663they remove themselves from the list.
664This list is therefore in reverse time order, so that if one of the tasks
665is blocking the current grace period, all subsequent tasks must
666also be blocking that same grace period.
667Therefore, a single pointer into this list suffices to track
668all tasks blocking a given grace period.
669That pointer is stored in <tt>-&gt;gp_tasks</tt> for normal
670grace periods and in <tt>-&gt;exp_tasks</tt> for expedited
671grace periods.
672These last two fields are <tt>NULL</tt> if either there is
673no grace period in flight or if there are no blocked tasks
674preventing that grace period from completing.
675If either of these two pointers is referencing a task that
676removes itself from the <tt>-&gt;blkd_tasks</tt> list,
677then that task must advance the pointer to the next task on
678the list, or set the pointer to <tt>NULL</tt> if there
679are no subsequent tasks on the list.
680
681</p><p>For example, suppose that tasks&nbsp;T1, T2, and&nbsp;T3 are
682all hard-affinitied to the largest-numbered CPU in the system.
683Then if task&nbsp;T1 blocked in an RCU read-side
684critical section, then an expedited grace period started,
685then task&nbsp;T2 blocked in an RCU read-side critical section,
686then a normal grace period started, and finally task&nbsp;3 blocked
687in an RCU read-side critical section, then the state of the
688last leaf <tt>rcu_node</tt> structure's blocked-task list
689would be as shown below:
690
691</p><p><img src="blkd_task.svg" alt="blkd_task.svg" width="60%">
692
693</p><p>Task&nbsp;T1 is blocking both grace periods, task&nbsp;T2 is
694blocking only the normal grace period, and task&nbsp;T3 is blocking
695neither grace period.
696Note that these tasks will not remove themselves from this list
697immediately upon resuming execution.
698They will instead remain on the list until they execute the outermost
699<tt>rcu_read_unlock()</tt> that ends their RCU read-side critical
700section.
701
702<p>
703The <tt>-&gt;wait_blkd_tasks</tt> field indicates whether or not
704the current grace period is waiting on a blocked task.
705
706<h5>Sizing the <tt>rcu_node</tt> Array</h5>
707
708<p>The <tt>rcu_node</tt> array is sized via a series of
709C-preprocessor expressions as follows:
710
711<pre>
712 1 #ifdef CONFIG_RCU_FANOUT
713 2 #define RCU_FANOUT CONFIG_RCU_FANOUT
714 3 #else
715 4 # ifdef CONFIG_64BIT
716 5 # define RCU_FANOUT 64
717 6 # else
718 7 # define RCU_FANOUT 32
719 8 # endif
720 9 #endif
72110
72211 #ifdef CONFIG_RCU_FANOUT_LEAF
72312 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF
72413 #else
72514 # ifdef CONFIG_64BIT
72615 # define RCU_FANOUT_LEAF 64
72716 # else
72817 # define RCU_FANOUT_LEAF 32
72918 # endif
73019 #endif
73120
73221 #define RCU_FANOUT_1 (RCU_FANOUT_LEAF)
73322 #define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT)
73423 #define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT)
73524 #define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT)
73625
73726 #if NR_CPUS &lt;= RCU_FANOUT_1
73827 # define RCU_NUM_LVLS 1
73928 # define NUM_RCU_LVL_0 1
74029 # define NUM_RCU_NODES NUM_RCU_LVL_0
74130 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0 }
74231 # define RCU_NODE_NAME_INIT { "rcu_node_0" }
74332 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" }
74433 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0" }
74534 #elif NR_CPUS &lt;= RCU_FANOUT_2
74635 # define RCU_NUM_LVLS 2
74736 # define NUM_RCU_LVL_0 1
74837 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
74938 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1)
75039 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1 }
75140 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" }
75241 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" }
75342 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1" }
75443 #elif NR_CPUS &lt;= RCU_FANOUT_3
75544 # define RCU_NUM_LVLS 3
75645 # define NUM_RCU_LVL_0 1
75746 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
75847 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
75948 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2)
76049 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 }
76150 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" }
76251 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" }
76352 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" }
76453 #elif NR_CPUS &lt;= RCU_FANOUT_4
76554 # define RCU_NUM_LVLS 4
76655 # define NUM_RCU_LVL_0 1
76756 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3)
76857 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
76958 # define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
77059 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3)
77160 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 }
77261 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" }
77362 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" }
77463 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" }
77564 #else
77665 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
77766 #endif
778</pre>
779
780<p>The maximum number of levels in the <tt>rcu_node</tt> structure
781is currently limited to four, as specified by lines&nbsp;21-24
782and the structure of the subsequent &ldquo;if&rdquo; statement.
783For 32-bit systems, this allows 16*32*32*32=524,288 CPUs, which
784should be sufficient for the next few years at least.
785For 64-bit systems, 16*64*64*64=4,194,304 CPUs is allowed, which
786should see us through the next decade or so.
787This four-level tree also allows kernels built with
788<tt>CONFIG_RCU_FANOUT=8</tt> to support up to 4096 CPUs,
789which might be useful in very large systems having eight CPUs per
790socket (but please note that no one has yet shown any measurable
791performance degradation due to misaligned socket and <tt>rcu_node</tt>
792boundaries).
793In addition, building kernels with a full four levels of <tt>rcu_node</tt>
794tree permits better testing of RCU's combining-tree code.
795
796</p><p>The <tt>RCU_FANOUT</tt> symbol controls how many children
797are permitted at each non-leaf level of the <tt>rcu_node</tt> tree.
798If the <tt>CONFIG_RCU_FANOUT</tt> Kconfig option is not specified,
799it is set based on the word size of the system, which is also
800the Kconfig default.
801
802</p><p>The <tt>RCU_FANOUT_LEAF</tt> symbol controls how many CPUs are
803handled by each leaf <tt>rcu_node</tt> structure.
804Experience has shown that allowing a given leaf <tt>rcu_node</tt>
805structure to handle 64 CPUs, as permitted by the number of bits in
806the <tt>-&gt;qsmask</tt> field on a 64-bit system, results in
807excessive contention for the leaf <tt>rcu_node</tt> structures'
808<tt>-&gt;lock</tt> fields.
809The number of CPUs per leaf <tt>rcu_node</tt> structure is therefore
810limited to 16 given the default value of <tt>CONFIG_RCU_FANOUT_LEAF</tt>.
811If <tt>CONFIG_RCU_FANOUT_LEAF</tt> is unspecified, the value
812selected is based on the word size of the system, just as for
813<tt>CONFIG_RCU_FANOUT</tt>.
814Lines&nbsp;11-19 perform this computation.
815
816</p><p>Lines&nbsp;21-24 compute the maximum number of CPUs supported by
817a single-level (which contains a single <tt>rcu_node</tt> structure),
818two-level, three-level, and four-level <tt>rcu_node</tt> tree,
819respectively, given the fanout specified by <tt>RCU_FANOUT</tt>
820and <tt>RCU_FANOUT_LEAF</tt>.
821These numbers of CPUs are retained in the
822<tt>RCU_FANOUT_1</tt>,
823<tt>RCU_FANOUT_2</tt>,
824<tt>RCU_FANOUT_3</tt>, and
825<tt>RCU_FANOUT_4</tt>
826C-preprocessor variables, respectively.
827
828</p><p>These variables are used to control the C-preprocessor <tt>#if</tt>
829statement spanning lines&nbsp;26-66 that computes the number of
830<tt>rcu_node</tt> structures required for each level of the tree,
831as well as the number of levels required.
832The number of levels is placed in the <tt>NUM_RCU_LVLS</tt>
833C-preprocessor variable by lines&nbsp;27, 35, 44, and&nbsp;54.
834The number of <tt>rcu_node</tt> structures for the topmost level
835of the tree is always exactly one, and this value is unconditionally
836placed into <tt>NUM_RCU_LVL_0</tt> by lines&nbsp;28, 36, 45, and&nbsp;55.
837The rest of the levels (if any) of the <tt>rcu_node</tt> tree
838are computed by dividing the maximum number of CPUs by the
839fanout supported by the number of levels from the current level down,
840rounding up. This computation is performed by lines&nbsp;37,
84146-47, and&nbsp;56-58.
842Lines&nbsp;31-33, 40-42, 50-52, and&nbsp;62-63 create initializers
843for lockdep lock-class names.
844Finally, lines&nbsp;64-66 produce an error if the maximum number of
845CPUs is too large for the specified fanout.
846
847<h3><a name="The rcu_data Structure">
848The <tt>rcu_data</tt> Structure</a></h3>
849
850<p>The <tt>rcu_data</tt> maintains the per-CPU state for the
851corresponding flavor of RCU.
852The fields in this structure may be accessed only from the corresponding
853CPU (and from tracing) unless otherwise stated.
854This structure is the
855focus of quiescent-state detection and RCU callback queuing.
856It also tracks its relationship to the corresponding leaf
857<tt>rcu_node</tt> structure to allow more-efficient
858propagation of quiescent states up the <tt>rcu_node</tt>
859combining tree.
860Like the <tt>rcu_node</tt> structure, it provides a local
861copy of the grace-period information to allow for-free
862synchronized
863access to this information from the corresponding CPU.
864Finally, this structure records past dyntick-idle state
865for the corresponding CPU and also tracks statistics.
866
867</p><p>The <tt>rcu_data</tt> structure's fields are discussed,
868singly and in groups, in the following sections.
869
870<h5>Connection to Other Data Structures</h5>
871
872<p>This portion of the <tt>rcu_data</tt> structure is declared
873as follows:
874
875<pre>
876 1 int cpu;
877 2 struct rcu_state *rsp;
878 3 struct rcu_node *mynode;
879 4 struct rcu_dynticks *dynticks;
880 5 unsigned long grpmask;
881 6 bool beenonline;
882</pre>
883
884<p>The <tt>-&gt;cpu</tt> field contains the number of the
885corresponding CPU, the <tt>-&gt;rsp</tt> pointer references
886the corresponding <tt>rcu_state</tt> structure (and is most frequently
887used to locate the name of the corresponding flavor of RCU for tracing),
888and the <tt>-&gt;mynode</tt> field references the corresponding
889<tt>rcu_node</tt> structure.
890The <tt>-&gt;mynode</tt> is used to propagate quiescent states
891up the combining tree.
892<p>The <tt>-&gt;dynticks</tt> pointer references the
893<tt>rcu_dynticks</tt> structure corresponding to this
894CPU.
895Recall that a single per-CPU instance of the <tt>rcu_dynticks</tt>
896structure is shared among all flavors of RCU.
897These first four fields are constant and therefore require not
898synchronization.
899
900</p><p>The <tt>-&gt;grpmask</tt> field indicates the bit in
901the <tt>-&gt;mynode-&gt;qsmask</tt> corresponding to this
902<tt>rcu_data</tt> structure, and is also used when propagating
903quiescent states.
904The <tt>-&gt;beenonline</tt> flag is set whenever the corresponding
905CPU comes online, which means that the debugfs tracing need not dump
906out any <tt>rcu_data</tt> structure for which this flag is not set.
907
908<h5>Quiescent-State and Grace-Period Tracking</h5>
909
910<p>This portion of the <tt>rcu_data</tt> structure is declared
911as follows:
912
913<pre>
914 1 unsigned long completed;
915 2 unsigned long gpnum;
916 3 bool cpu_no_qs;
917 4 bool core_needs_qs;
918 5 bool gpwrap;
919 6 unsigned long rcu_qs_ctr_snap;
920</pre>
921
922<p>The <tt>completed</tt> and <tt>gpnum</tt>
923fields are the counterparts of the fields of the same name
924in the <tt>rcu_state</tt> and <tt>rcu_node</tt> structures.
925They may each lag up to one behind their <tt>rcu_node</tt>
926counterparts, but in <tt>CONFIG_NO_HZ_IDLE</tt> and
927<tt>CONFIG_NO_HZ_FULL</tt> kernels can lag
928arbitrarily far behind for CPUs in dyntick-idle mode (but these counters
929will catch up upon exit from dyntick-idle mode).
930If a given <tt>rcu_data</tt> structure's <tt>-&gt;gpnum</tt> and
931<tt>-&gt;complete</tt> fields are equal, then this <tt>rcu_data</tt>
932structure believes that RCU is idle.
933Otherwise, as with the <tt>rcu_state</tt> and <tt>rcu_node</tt>
934structure,
935the <tt>-&gt;gpnum</tt> field will be one greater than the
936<tt>-&gt;complete</tt> fields, with <tt>-&gt;gpnum</tt>
937indicating which grace period this <tt>rcu_data</tt> believes
938is still being waited for.
939
940<table>
941<tr><th>&nbsp;</th></tr>
942<tr><th align="left">Quick Quiz:</th></tr>
943<tr><td>
944 All this replication of the grace period numbers can only cause
945 massive confusion.
946 Why not just keep a global pair of counters and be done with it???
947</td></tr>
948<tr><th align="left">Answer:</th></tr>
949<tr><td bgcolor="#ffffff"><font color="ffffff">
950 Because if there was only a single global pair of grace-period
951 numbers, there would need to be a single global lock to allow
952 safely accessing and updating them.
953 And if we are not going to have a single global lock, we need
954 to carefully manage the numbers on a per-node basis.
955 Recall from the answer to a previous Quick Quiz that the consequences
956 of applying a previously sampled quiescent state to the wrong
957 grace period are quite severe.
958</font></td></tr>
959<tr><td>&nbsp;</td></tr>
960</table>
961
962<p>The <tt>-&gt;cpu_no_qs</tt> flag indicates that the
963CPU has not yet passed through a quiescent state,
964while the <tt>-&gt;core_needs_qs</tt> flag indicates that the
965RCU core needs a quiescent state from the corresponding CPU.
966The <tt>-&gt;gpwrap</tt> field indicates that the corresponding
967CPU has remained idle for so long that the <tt>completed</tt>
968and <tt>gpnum</tt> counters are in danger of overflow, which
969will cause the CPU to disregard the values of its counters on
970its next exit from idle.
971Finally, the <tt>rcu_qs_ctr_snap</tt> field is used to detect
972cases where a given operation has resulted in a quiescent state
973for all flavors of RCU, for example, <tt>cond_resched_rcu_qs()</tt>.
974
975<h5>RCU Callback Handling</h5>
976
977<p>In the absence of CPU-hotplug events, RCU callbacks are invoked by
978the same CPU that registered them.
979This is strictly a cache-locality optimization: callbacks can and
980do get invoked on CPUs other than the one that registered them.
981After all, if the CPU that registered a given callback has gone
982offline before the callback can be invoked, there really is no other
983choice.
984
985</p><p>This portion of the <tt>rcu_data</tt> structure is declared
986as follows:
987
988<pre>
989 1 struct rcu_head *nxtlist;
990 2 struct rcu_head **nxttail[RCU_NEXT_SIZE];
991 3 unsigned long nxtcompleted[RCU_NEXT_SIZE];
992 4 long qlen_lazy;
993 5 long qlen;
994 6 long qlen_last_fqs_check;
995 7 unsigned long n_force_qs_snap;
996 8 unsigned long n_cbs_invoked;
997 9 unsigned long n_cbs_orphaned;
99810 unsigned long n_cbs_adopted;
99911 long blimit;
1000</pre>
1001
1002<p>The <tt>-&gt;nxtlist</tt> pointer and the
1003<tt>-&gt;nxttail[]</tt> array form a four-segment list with
1004older callbacks near the head and newer ones near the tail.
1005Each segment contains callbacks with the corresponding relationship
1006to the current grace period.
1007The pointer out of the end of each of the four segments is referenced
1008by the element of the <tt>-&gt;nxttail[]</tt> array indexed by
1009<tt>RCU_DONE_TAIL</tt> (for callbacks handled by a prior grace period),
1010<tt>RCU_WAIT_TAIL</tt> (for callbacks waiting on the current grace period),
1011<tt>RCU_NEXT_READY_TAIL</tt> (for callbacks that will wait on the next
1012grace period), and
1013<tt>RCU_NEXT_TAIL</tt> (for callbacks that are not yet associated
1014with a specific grace period)
1015respectively, as shown in the following figure.
1016
1017</p><p><img src="nxtlist.svg" alt="nxtlist.svg" width="40%">
1018
1019</p><p>In this figure, the <tt>-&gt;nxtlist</tt> pointer references the
1020first
1021RCU callback in the list.
1022The <tt>-&gt;nxttail[RCU_DONE_TAIL]</tt> array element references
1023the <tt>-&gt;nxtlist</tt> pointer itself, indicating that none
1024of the callbacks is ready to invoke.
1025The <tt>-&gt;nxttail[RCU_WAIT_TAIL]</tt> array element references callback
1026CB&nbsp;2's <tt>-&gt;next</tt> pointer, which indicates that
1027CB&nbsp;1 and CB&nbsp;2 are both waiting on the current grace period.
1028The <tt>-&gt;nxttail[RCU_NEXT_READY_TAIL]</tt> array element
1029references the same RCU callback that <tt>-&gt;nxttail[RCU_WAIT_TAIL]</tt>
1030does, which indicates that there are no callbacks waiting on the next
1031RCU grace period.
1032The <tt>-&gt;nxttail[RCU_NEXT_TAIL]</tt> array element references
1033CB&nbsp;4's <tt>-&gt;next</tt> pointer, indicating that all the
1034remaining RCU callbacks have not yet been assigned to an RCU grace
1035period.
1036Note that the <tt>-&gt;nxttail[RCU_NEXT_TAIL]</tt> array element
1037always references the last RCU callback's <tt>-&gt;next</tt> pointer
1038unless the callback list is empty, in which case it references
1039the <tt>-&gt;nxtlist</tt> pointer.
1040
1041</p><p>CPUs advance their callbacks from the
1042<tt>RCU_NEXT_TAIL</tt> to the <tt>RCU_NEXT_READY_TAIL</tt> to the
1043<tt>RCU_WAIT_TAIL</tt> to the <tt>RCU_DONE_TAIL</tt> list segments
1044as grace periods advance.
1045The CPU advances the callbacks in its <tt>rcu_data</tt> structure
1046whenever it notices that another RCU grace period has completed.
1047The CPU detects the completion of an RCU grace period by noticing
1048that the value of its <tt>rcu_data</tt> structure's
1049<tt>-&gt;completed</tt> field differs from that of its leaf
1050<tt>rcu_node</tt> structure.
1051Recall that each <tt>rcu_node</tt> structure's
1052<tt>-&gt;completed</tt> field is updated at the end of each
1053grace period.
1054
1055</p><p>The <tt>-&gt;nxtcompleted[]</tt> array records grace-period
1056numbers corresponding to the list segments.
1057This allows CPUs that go idle for extended periods to determine
1058which of their callbacks are ready to be invoked after reawakening.
1059
1060</p><p>The <tt>-&gt;qlen</tt> counter contains the number of
1061callbacks in <tt>-&gt;nxtlist</tt>, and the
1062<tt>-&gt;qlen_lazy</tt> contains the number of those callbacks that
1063are known to only free memory, and whose invocation can therefore
1064be safely deferred.
1065The <tt>-&gt;qlen_last_fqs_check</tt> and
1066<tt>-&gt;n_force_qs_snap</tt> coordinate the forcing of quiescent
1067states from <tt>call_rcu()</tt> and friends when callback
1068lists grow excessively long.
1069
1070</p><p>The <tt>-&gt;n_cbs_invoked</tt>,
1071<tt>-&gt;n_cbs_orphaned</tt>, and <tt>-&gt;n_cbs_adopted</tt>
1072fields count the number of callbacks invoked,
1073sent to other CPUs when this CPU goes offline,
1074and received from other CPUs when those other CPUs go offline.
1075Finally, the <tt>-&gt;blimit</tt> counter is the maximum number of
1076RCU callbacks that may be invoked at a given time.
1077
1078<h5>Dyntick-Idle Handling</h5>
1079
1080<p>This portion of the <tt>rcu_data</tt> structure is declared
1081as follows:
1082
1083<pre>
1084 1 int dynticks_snap;
1085 2 unsigned long dynticks_fqs;
1086</pre>
1087
1088The <tt>-&gt;dynticks_snap</tt> field is used to take a snapshot
1089of the corresponding CPU's dyntick-idle state when forcing
1090quiescent states, and is therefore accessed from other CPUs.
1091Finally, the <tt>-&gt;dynticks_fqs</tt> field is used to
1092count the number of times this CPU is determined to be in
1093dyntick-idle state, and is used for tracing and debugging purposes.
1094
1095<h3><a name="The rcu_dynticks Structure">
1096The <tt>rcu_dynticks</tt> Structure</a></h3>
1097
1098<p>The <tt>rcu_dynticks</tt> maintains the per-CPU dyntick-idle state
1099for the corresponding CPU.
1100Unlike the other structures, <tt>rcu_dynticks</tt> is not
1101replicated over the different flavors of RCU.
1102The fields in this structure may be accessed only from the corresponding
1103CPU (and from tracing) unless otherwise stated.
1104Its fields are as follows:
1105
1106<pre>
1107 1 int dynticks_nesting;
1108 2 int dynticks_nmi_nesting;
1109 3 atomic_t dynticks;
1110</pre>
1111
1112<p>The <tt>-&gt;dynticks_nesting</tt> field counts the
1113nesting depth of normal interrupts.
1114In addition, this counter is incremented when exiting dyntick-idle
1115mode and decremented when entering it.
1116This counter can therefore be thought of as counting the number
1117of reasons why this CPU cannot be permitted to enter dyntick-idle
1118mode, aside from non-maskable interrupts (NMIs).
1119NMIs are counted by the <tt>-&gt;dynticks_nmi_nesting</tt>
1120field, except that NMIs that interrupt non-dyntick-idle execution
1121are not counted.
1122
1123</p><p>Finally, the <tt>-&gt;dynticks</tt> field counts the corresponding
1124CPU's transitions to and from dyntick-idle mode, so that this counter
1125has an even value when the CPU is in dyntick-idle mode and an odd
1126value otherwise.
1127
1128<table>
1129<tr><th>&nbsp;</th></tr>
1130<tr><th align="left">Quick Quiz:</th></tr>
1131<tr><td>
1132 Why not just count all NMIs?
1133 Wouldn't that be simpler and less error prone?
1134</td></tr>
1135<tr><th align="left">Answer:</th></tr>
1136<tr><td bgcolor="#ffffff"><font color="ffffff">
1137 It seems simpler only until you think hard about how to go about
1138 updating the <tt>rcu_dynticks</tt> structure's
1139 <tt>-&gt;dynticks</tt> field.
1140</font></td></tr>
1141<tr><td>&nbsp;</td></tr>
1142</table>
1143
1144<p>Additional fields are present for some special-purpose
1145builds, and are discussed separately.
1146
1147<h3><a name="The rcu_head Structure">
1148The <tt>rcu_head</tt> Structure</a></h3>
1149
1150<p>Each <tt>rcu_head</tt> structure represents an RCU callback.
1151These structures are normally embedded within RCU-protected data
1152structures whose algorithms use asynchronous grace periods.
1153In contrast, when using algorithms that block waiting for RCU grace periods,
1154RCU users need not provide <tt>rcu_head</tt> structures.
1155
1156</p><p>The <tt>rcu_head</tt> structure has fields as follows:
1157
1158<pre>
1159 1 struct rcu_head *next;
1160 2 void (*func)(struct rcu_head *head);
1161</pre>
1162
1163<p>The <tt>-&gt;next</tt> field is used
1164to link the <tt>rcu_head</tt> structures together in the
1165lists within the <tt>rcu_data</tt> structures.
1166The <tt>-&gt;func</tt> field is a pointer to the function
1167to be called when the callback is ready to be invoked, and
1168this function is passed a pointer to the <tt>rcu_head</tt>
1169structure.
1170However, <tt>kfree_rcu()</tt> uses the <tt>-&gt;func</tt>
1171field to record the offset of the <tt>rcu_head</tt>
1172structure within the enclosing RCU-protected data structure.
1173
1174</p><p>Both of these fields are used internally by RCU.
1175From the viewpoint of RCU users, this structure is an
1176opaque &ldquo;cookie&rdquo;.
1177
1178<table>
1179<tr><th>&nbsp;</th></tr>
1180<tr><th align="left">Quick Quiz:</th></tr>
1181<tr><td>
1182 Given that the callback function <tt>-&gt;func</tt>
1183 is passed a pointer to the <tt>rcu_head</tt> structure,
1184 how is that function supposed to find the beginning of the
1185 enclosing RCU-protected data structure?
1186</td></tr>
1187<tr><th align="left">Answer:</th></tr>
1188<tr><td bgcolor="#ffffff"><font color="ffffff">
1189 In actual practice, there is a separate callback function per
1190 type of RCU-protected data structure.
1191 The callback function can therefore use the <tt>container_of()</tt>
1192 macro in the Linux kernel (or other pointer-manipulation facilities
1193 in other software environments) to find the beginning of the
1194 enclosing structure.
1195</font></td></tr>
1196<tr><td>&nbsp;</td></tr>
1197</table>
1198
1199<h3><a name="RCU-Specific Fields in the task_struct Structure">
1200RCU-Specific Fields in the <tt>task_struct</tt> Structure</a></h3>
1201
1202<p>The <tt>CONFIG_PREEMPT_RCU</tt> implementation uses some
1203additional fields in the <tt>task_struct</tt> structure:
1204
1205<pre>
1206 1 #ifdef CONFIG_PREEMPT_RCU
1207 2 int rcu_read_lock_nesting;
1208 3 union rcu_special rcu_read_unlock_special;
1209 4 struct list_head rcu_node_entry;
1210 5 struct rcu_node *rcu_blocked_node;
1211 6 #endif /* #ifdef CONFIG_PREEMPT_RCU */
1212 7 #ifdef CONFIG_TASKS_RCU
1213 8 unsigned long rcu_tasks_nvcsw;
1214 9 bool rcu_tasks_holdout;
121510 struct list_head rcu_tasks_holdout_list;
121611 int rcu_tasks_idle_cpu;
121712 #endif /* #ifdef CONFIG_TASKS_RCU */
1218</pre>
1219
1220<p>The <tt>-&gt;rcu_read_lock_nesting</tt> field records the
1221nesting level for RCU read-side critical sections, and
1222the <tt>-&gt;rcu_read_unlock_special</tt> field is a bitmask
1223that records special conditions that require <tt>rcu_read_unlock()</tt>
1224to do additional work.
1225The <tt>-&gt;rcu_node_entry</tt> field is used to form lists of
1226tasks that have blocked within preemptible-RCU read-side critical
1227sections and the <tt>-&gt;rcu_blocked_node</tt> field references
1228the <tt>rcu_node</tt> structure whose list this task is a member of,
1229or <tt>NULL</tt> if it is not blocked within a preemptible-RCU
1230read-side critical section.
1231
1232<p>The <tt>-&gt;rcu_tasks_nvcsw</tt> field tracks the number of
1233voluntary context switches that this task had undergone at the
1234beginning of the current tasks-RCU grace period,
1235<tt>-&gt;rcu_tasks_holdout</tt> is set if the current tasks-RCU
1236grace period is waiting on this task, <tt>-&gt;rcu_tasks_holdout_list</tt>
1237is a list element enqueuing this task on the holdout list,
1238and <tt>-&gt;rcu_tasks_idle_cpu</tt> tracks which CPU this
1239idle task is running, but only if the task is currently running,
1240that is, if the CPU is currently idle.
1241
1242<h3><a name="Accessor Functions">
1243Accessor Functions</a></h3>
1244
1245<p>The following listing shows the
1246<tt>rcu_get_root()</tt>, <tt>rcu_for_each_node_breadth_first</tt>,
1247<tt>rcu_for_each_nonleaf_node_breadth_first()</tt>, and
1248<tt>rcu_for_each_leaf_node()</tt> function and macros:
1249
1250<pre>
1251 1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp)
1252 2 {
1253 3 return &amp;rsp-&gt;node[0];
1254 4 }
1255 5
1256 6 #define rcu_for_each_node_breadth_first(rsp, rnp) \
1257 7 for ((rnp) = &amp;(rsp)-&gt;node[0]; \
1258 8 (rnp) &lt; &amp;(rsp)-&gt;node[NUM_RCU_NODES]; (rnp)++)
1259 9
1260 10 #define rcu_for_each_nonleaf_node_breadth_first(rsp, rnp) \
1261 11 for ((rnp) = &amp;(rsp)-&gt;node[0]; \
1262 12 (rnp) &lt; (rsp)-&gt;level[NUM_RCU_LVLS - 1]; (rnp)++)
1263 13
1264 14 #define rcu_for_each_leaf_node(rsp, rnp) \
1265 15 for ((rnp) = (rsp)-&gt;level[NUM_RCU_LVLS - 1]; \
1266 16 (rnp) &lt; &amp;(rsp)-&gt;node[NUM_RCU_NODES]; (rnp)++)
1267</pre>
1268
1269<p>The <tt>rcu_get_root()</tt> simply returns a pointer to the
1270first element of the specified <tt>rcu_state</tt> structure's
1271<tt>-&gt;node[]</tt> array, which is the root <tt>rcu_node</tt>
1272structure.
1273
1274</p><p>As noted earlier, the <tt>rcu_for_each_node_breadth_first()</tt>
1275macro takes advantage of the layout of the <tt>rcu_node</tt>
1276structures in the <tt>rcu_state</tt> structure's
1277<tt>-&gt;node[]</tt> array, performing a breadth-first traversal by
1278simply traversing the array in order.
1279The <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> macro operates
1280similarly, but traverses only the first part of the array, thus excluding
1281the leaf <tt>rcu_node</tt> structures.
1282Finally, the <tt>rcu_for_each_leaf_node()</tt> macro traverses only
1283the last part of the array, thus traversing only the leaf
1284<tt>rcu_node</tt> structures.
1285
1286<table>
1287<tr><th>&nbsp;</th></tr>
1288<tr><th align="left">Quick Quiz:</th></tr>
1289<tr><td>
1290 What do <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> and
1291 <tt>rcu_for_each_leaf_node()</tt> do if the <tt>rcu_node</tt> tree
1292 contains only a single node?
1293</td></tr>
1294<tr><th align="left">Answer:</th></tr>
1295<tr><td bgcolor="#ffffff"><font color="ffffff">
1296 In the single-node case,
1297 <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> is a no-op
1298 and <tt>rcu_for_each_leaf_node()</tt> traverses the single node.
1299</font></td></tr>
1300<tr><td>&nbsp;</td></tr>
1301</table>
1302
1303<h3><a name="Summary">
1304Summary</a></h3>
1305
1306So each flavor of RCU is represented by an <tt>rcu_state</tt> structure,
1307which contains a combining tree of <tt>rcu_node</tt> and
1308<tt>rcu_data</tt> structures.
1309Finally, in <tt>CONFIG_NO_HZ_IDLE</tt> kernels, each CPU's dyntick-idle
1310state is tracked by an <tt>rcu_dynticks</tt> structure.
1311
1312If you made it this far, you are well prepared to read the code
1313walkthroughs in the other articles in this series.
1314
1315<h3><a name="Acknowledgments">
1316Acknowledgments</a></h3>
1317
1318I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul
1319Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn
1320for helping me get this document into a more human-readable state.
1321
1322<h3><a name="Legal Statement">
1323Legal Statement</a></h3>
1324
1325<p>This work represents the view of the author and does not necessarily
1326represent the view of IBM.
1327
1328</p><p>Linux is a registered trademark of Linus Torvalds.
1329
1330</p><p>Other company, product, and service names may be trademarks or
1331service marks of others.
1332
1333</body></html>