Commit | Line | Data |
---|---|---|
5c145847 PM |
1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" |
2 | "http://www.w3.org/TR/html4/loose.dtd"> | |
3 | <html> | |
4 | <head><title>A Tour Through TREE_RCU's Data Structures [LWN.net]</title> | |
5 | <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> | |
6 | ||
4afac159 | 7 | <p>December 18, 2016</p> |
5c145847 PM |
8 | <p>This article was contributed by Paul E. McKenney</p> |
9 | ||
10 | <h3>Introduction</h3> | |
11 | ||
12 | This document describes RCU's major data structures and their relationship | |
13 | to each other. | |
14 | ||
15 | <ol> | |
16 | <li> <a href="#Data-Structure Relationships"> | |
17 | Data-Structure Relationships</a> | |
18 | <li> <a href="#The rcu_state Structure"> | |
19 | The <tt>rcu_state</tt> Structure</a> | |
20 | <li> <a href="#The rcu_node Structure"> | |
21 | The <tt>rcu_node</tt> Structure</a> | |
22 | <li> <a href="#The rcu_data Structure"> | |
23 | The <tt>rcu_data</tt> Structure</a> | |
24 | <li> <a href="#The rcu_dynticks Structure"> | |
25 | The <tt>rcu_dynticks</tt> Structure</a> | |
26 | <li> <a href="#The rcu_head Structure"> | |
27 | The <tt>rcu_head</tt> Structure</a> | |
28 | <li> <a href="#RCU-Specific Fields in the task_struct Structure"> | |
29 | RCU-Specific Fields in the <tt>task_struct</tt> Structure</a> | |
30 | <li> <a href="#Accessor Functions"> | |
31 | Accessor Functions</a> | |
32 | </ol> | |
33 | ||
5c145847 PM |
34 | <h3><a name="Data-Structure Relationships">Data-Structure Relationships</a></h3> |
35 | ||
36 | <p>RCU is for all intents and purposes a large state machine, and its | |
37 | data structures maintain the state in such a way as to allow RCU readers | |
38 | to execute extremely quickly, while also processing the RCU grace periods | |
39 | requested by updaters in an efficient and extremely scalable fashion. | |
40 | The efficiency and scalability of RCU updaters is provided primarily | |
41 | by a combining tree, as shown below: | |
42 | ||
43 | </p><p><img src="BigTreeClassicRCU.svg" alt="BigTreeClassicRCU.svg" width="30%"> | |
44 | ||
45 | </p><p>This diagram shows an enclosing <tt>rcu_state</tt> structure | |
46 | containing a tree of <tt>rcu_node</tt> structures. | |
47 | Each leaf node of the <tt>rcu_node</tt> tree has up to 16 | |
48 | <tt>rcu_data</tt> structures associated with it, so that there | |
49 | are <tt>NR_CPUS</tt> number of <tt>rcu_data</tt> structures, | |
50 | one for each possible CPU. | |
51 | This structure is adjusted at boot time, if needed, to handle the | |
52 | common case where <tt>nr_cpu_ids</tt> is much less than | |
53 | <tt>NR_CPUs</tt>. | |
54 | For example, a number of Linux distributions set <tt>NR_CPUs=4096</tt>, | |
55 | which results in a three-level <tt>rcu_node</tt> tree. | |
56 | If the actual hardware has only 16 CPUs, RCU will adjust itself | |
57 | at boot time, resulting in an <tt>rcu_node</tt> tree with only a single node. | |
58 | ||
59 | </p><p>The purpose of this combining tree is to allow per-CPU events | |
60 | such as quiescent states, dyntick-idle transitions, | |
61 | and CPU hotplug operations to be processed efficiently | |
62 | and scalably. | |
63 | Quiescent states are recorded by the per-CPU <tt>rcu_data</tt> structures, | |
64 | and other events are recorded by the leaf-level <tt>rcu_node</tt> | |
65 | structures. | |
66 | All of these events are combined at each level of the tree until finally | |
67 | grace periods are completed at the tree's root <tt>rcu_node</tt> | |
68 | structure. | |
69 | A grace period can be completed at the root once every CPU | |
70 | (or, in the case of <tt>CONFIG_PREEMPT_RCU</tt>, task) | |
71 | has passed through a quiescent state. | |
72 | Once a grace period has completed, record of that fact is propagated | |
73 | back down the tree. | |
74 | ||
75 | </p><p>As can be seen from the diagram, on a 64-bit system | |
76 | a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout | |
77 | of 64 at the root and a fanout of 16 at the leaves. | |
78 | ||
79 | <table> | |
80 | <tr><th> </th></tr> | |
81 | <tr><th align="left">Quick Quiz:</th></tr> | |
82 | <tr><td> | |
83 | Why isn't the fanout at the leaves also 64? | |
84 | </td></tr> | |
85 | <tr><th align="left">Answer:</th></tr> | |
86 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
87 | Because there are more types of events that affect the leaf-level | |
88 | <tt>rcu_node</tt> structures than further up the tree. | |
89 | Therefore, if the leaf <tt>rcu_node</tt> structures have fanout of | |
90 | 64, the contention on these structures' <tt>->structures</tt> | |
91 | becomes excessive. | |
92 | Experimentation on a wide variety of systems has shown that a fanout | |
93 | of 16 works well for the leaves of the <tt>rcu_node</tt> tree. | |
94 | </font> | |
95 | ||
96 | <p><font color="ffffff">Of course, further experience with | |
97 | systems having hundreds or thousands of CPUs may demonstrate | |
98 | that the fanout for the non-leaf <tt>rcu_node</tt> structures | |
99 | must also be reduced. | |
100 | Such reduction can be easily carried out when and if it proves | |
101 | necessary. | |
102 | In the meantime, if you are using such a system and running into | |
103 | contention problems on the non-leaf <tt>rcu_node</tt> structures, | |
104 | you may use the <tt>CONFIG_RCU_FANOUT</tt> kernel configuration | |
105 | parameter to reduce the non-leaf fanout as needed. | |
106 | </font> | |
107 | ||
108 | <p><font color="ffffff">Kernels built for systems with | |
109 | strong NUMA characteristics might also need to adjust | |
110 | <tt>CONFIG_RCU_FANOUT</tt> so that the domains of the | |
111 | <tt>rcu_node</tt> structures align with hardware boundaries. | |
112 | However, there has thus far been no need for this. | |
113 | </font></td></tr> | |
114 | <tr><td> </td></tr> | |
115 | </table> | |
116 | ||
117 | <p>If your system has more than 1,024 CPUs (or more than 512 CPUs on | |
118 | a 32-bit system), then RCU will automatically add more levels to the | |
119 | tree. | |
120 | For example, if you are crazy enough to build a 64-bit system with 65,536 | |
121 | CPUs, RCU would configure the <tt>rcu_node</tt> tree as follows: | |
122 | ||
123 | </p><p><img src="HugeTreeClassicRCU.svg" alt="HugeTreeClassicRCU.svg" width="50%"> | |
124 | ||
125 | </p><p>RCU currently permits up to a four-level tree, which on a 64-bit system | |
126 | accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for | |
127 | 32-bit systems. | |
128 | On the other hand, you can set <tt>CONFIG_RCU_FANOUT</tt> to be | |
129 | as small as 2 if you wish, which would permit only 16 CPUs, which | |
130 | is useful for testing. | |
131 | ||
132 | </p><p>This multi-level combining tree allows us to get most of the | |
133 | performance and scalability | |
134 | benefits of partitioning, even though RCU grace-period detection is | |
135 | inherently a global operation. | |
136 | The trick here is that only the last CPU to report a quiescent state | |
137 | into a given <tt>rcu_node</tt> structure need advance to the <tt>rcu_node</tt> | |
138 | structure at the next level up the tree. | |
139 | This means that at the leaf-level <tt>rcu_node</tt> structure, only | |
140 | one access out of sixteen will progress up the tree. | |
141 | For the internal <tt>rcu_node</tt> structures, the situation is even | |
142 | more extreme: Only one access out of sixty-four will progress up | |
143 | the tree. | |
144 | Because the vast majority of the CPUs do not progress up the tree, | |
145 | the lock contention remains roughly constant up the tree. | |
146 | No matter how many CPUs there are in the system, at most 64 quiescent-state | |
147 | reports per grace period will progress all the way to the root | |
148 | <tt>rcu_node</tt> structure, thus ensuring that the lock contention | |
149 | on that root <tt>rcu_node</tt> structure remains acceptably low. | |
150 | ||
151 | </p><p>In effect, the combining tree acts like a big shock absorber, | |
152 | keeping lock contention under control at all tree levels regardless | |
153 | of the level of loading on the system. | |
154 | ||
155 | </p><p>The Linux kernel actually supports multiple flavors of RCU | |
156 | running concurrently, so RCU builds separate data structures for each | |
157 | flavor. | |
158 | For example, for <tt>CONFIG_TREE_RCU=y</tt> kernels, RCU provides | |
159 | rcu_sched and rcu_bh, as shown below: | |
160 | ||
161 | </p><p><img src="BigTreeClassicRCUBH.svg" alt="BigTreeClassicRCUBH.svg" width="33%"> | |
162 | ||
163 | </p><p>Energy efficiency is increasingly important, and for that | |
164 | reason the Linux kernel provides <tt>CONFIG_NO_HZ_IDLE</tt>, which | |
165 | turns off the scheduling-clock interrupts on idle CPUs, which in | |
166 | turn allows those CPUs to attain deeper sleep states and to consume | |
167 | less energy. | |
168 | CPUs whose scheduling-clock interrupts have been turned off are | |
169 | said to be in <i>dyntick-idle mode</i>. | |
170 | RCU must handle dyntick-idle CPUs specially | |
171 | because RCU would otherwise wake up each CPU on every grace period, | |
172 | which would defeat the whole purpose of <tt>CONFIG_NO_HZ_IDLE</tt>. | |
173 | RCU uses the <tt>rcu_dynticks</tt> structure to track | |
174 | which CPUs are in dyntick idle mode, as shown below: | |
175 | ||
176 | </p><p><img src="BigTreeClassicRCUBHdyntick.svg" alt="BigTreeClassicRCUBHdyntick.svg" width="33%"> | |
177 | ||
178 | </p><p>However, if a CPU is in dyntick-idle mode, it is in that mode | |
179 | for all flavors of RCU. | |
180 | Therefore, a single <tt>rcu_dynticks</tt> structure is allocated per | |
181 | CPU, and all of a given CPU's <tt>rcu_data</tt> structures share | |
182 | that <tt>rcu_dynticks</tt>, as shown in the figure. | |
183 | ||
184 | </p><p>Kernels built with <tt>CONFIG_PREEMPT_RCU</tt> support | |
185 | rcu_preempt in addition to rcu_sched and rcu_bh, as shown below: | |
186 | ||
187 | </p><p><img src="BigTreePreemptRCUBHdyntick.svg" alt="BigTreePreemptRCUBHdyntick.svg" width="35%"> | |
188 | ||
189 | </p><p>RCU updaters wait for normal grace periods by registering | |
190 | RCU callbacks, either directly via <tt>call_rcu()</tt> and | |
191 | friends (namely <tt>call_rcu_bh()</tt> and <tt>call_rcu_sched()</tt>), | |
192 | there being a separate interface per flavor of RCU) | |
193 | or indirectly via <tt>synchronize_rcu()</tt> and friends. | |
194 | RCU callbacks are represented by <tt>rcu_head</tt> structures, | |
195 | which are queued on <tt>rcu_data</tt> structures while they are | |
196 | waiting for a grace period to elapse, as shown in the following figure: | |
197 | ||
198 | </p><p><img src="BigTreePreemptRCUBHdyntickCB.svg" alt="BigTreePreemptRCUBHdyntickCB.svg" width="40%"> | |
199 | ||
200 | </p><p>This figure shows how <tt>TREE_RCU</tt>'s and | |
201 | <tt>PREEMPT_RCU</tt>'s major data structures are related. | |
202 | Lesser data structures will be introduced with the algorithms that | |
203 | make use of them. | |
204 | ||
205 | </p><p>Note that each of the data structures in the above figure has | |
206 | its own synchronization: | |
207 | ||
208 | <p><ol> | |
209 | <li> Each <tt>rcu_state</tt> structures has a lock and a mutex, | |
210 | and some fields are protected by the corresponding root | |
211 | <tt>rcu_node</tt> structure's lock. | |
212 | <li> Each <tt>rcu_node</tt> structure has a spinlock. | |
213 | <li> The fields in <tt>rcu_data</tt> are private to the corresponding | |
214 | CPU, although a few can be read and written by other CPUs. | |
215 | <li> Similarly, the fields in <tt>rcu_dynticks</tt> are private | |
216 | to the corresponding CPU, although a few can be read by | |
217 | other CPUs. | |
218 | </ol> | |
219 | ||
220 | <p>It is important to note that different data structures can have | |
221 | very different ideas about the state of RCU at any given time. | |
222 | For but one example, awareness of the start or end of a given RCU | |
223 | grace period propagates slowly through the data structures. | |
224 | This slow propagation is absolutely necessary for RCU to have good | |
225 | read-side performance. | |
226 | If this balkanized implementation seems foreign to you, one useful | |
227 | trick is to consider each instance of these data structures to be | |
228 | a different person, each having the usual slightly different | |
229 | view of reality. | |
230 | ||
231 | </p><p>The general role of each of these data structures is as | |
232 | follows: | |
233 | ||
234 | </p><ol> | |
235 | <li> <tt>rcu_state</tt>: | |
236 | This structure forms the interconnection between the | |
237 | <tt>rcu_node</tt> and <tt>rcu_data</tt> structures, | |
238 | tracks grace periods, serves as short-term repository | |
239 | for callbacks orphaned by CPU-hotplug events, | |
240 | maintains <tt>rcu_barrier()</tt> state, | |
241 | tracks expedited grace-period state, | |
242 | and maintains state used to force quiescent states when | |
243 | grace periods extend too long, | |
244 | <li> <tt>rcu_node</tt>: This structure forms the combining | |
245 | tree that propagates quiescent-state | |
246 | information from the leaves to the root, and also propagates | |
247 | grace-period information from the root to the leaves. | |
248 | It provides local copies of the grace-period state in order | |
249 | to allow this information to be accessed in a synchronized | |
250 | manner without suffering the scalability limitations that | |
251 | would otherwise be imposed by global locking. | |
252 | In <tt>CONFIG_PREEMPT_RCU</tt> kernels, it manages the lists | |
253 | of tasks that have blocked while in their current | |
254 | RCU read-side critical section. | |
255 | In <tt>CONFIG_PREEMPT_RCU</tt> with | |
256 | <tt>CONFIG_RCU_BOOST</tt>, it manages the | |
257 | per-<tt>rcu_node</tt> priority-boosting | |
258 | kernel threads (kthreads) and state. | |
259 | Finally, it records CPU-hotplug state in order to determine | |
260 | which CPUs should be ignored during a given grace period. | |
261 | <li> <tt>rcu_data</tt>: This per-CPU structure is the | |
262 | focus of quiescent-state detection and RCU callback queuing. | |
263 | It also tracks its relationship to the corresponding leaf | |
264 | <tt>rcu_node</tt> structure to allow more-efficient | |
265 | propagation of quiescent states up the <tt>rcu_node</tt> | |
266 | combining tree. | |
267 | Like the <tt>rcu_node</tt> structure, it provides a local | |
268 | copy of the grace-period information to allow for-free | |
269 | synchronized | |
270 | access to this information from the corresponding CPU. | |
271 | Finally, this structure records past dyntick-idle state | |
272 | for the corresponding CPU and also tracks statistics. | |
273 | <li> <tt>rcu_dynticks</tt>: | |
274 | This per-CPU structure tracks the current dyntick-idle | |
275 | state for the corresponding CPU. | |
276 | Unlike the other three structures, the <tt>rcu_dynticks</tt> | |
277 | structure is not replicated per RCU flavor. | |
278 | <li> <tt>rcu_head</tt>: | |
279 | This structure represents RCU callbacks, and is the | |
280 | only structure allocated and managed by RCU users. | |
281 | The <tt>rcu_head</tt> structure is normally embedded | |
282 | within the RCU-protected data structure. | |
283 | </ol> | |
284 | ||
285 | <p>If all you wanted from this article was a general notion of how | |
286 | RCU's data structures are related, you are done. | |
287 | Otherwise, each of the following sections give more details on | |
288 | the <tt>rcu_state</tt>, <tt>rcu_node</tt>, <tt>rcu_data</tt>, | |
289 | and <tt>rcu_dynticks</tt> data structures. | |
290 | ||
291 | <h3><a name="The rcu_state Structure"> | |
292 | The <tt>rcu_state</tt> Structure</a></h3> | |
293 | ||
294 | <p>The <tt>rcu_state</tt> structure is the base structure that | |
295 | represents a flavor of RCU. | |
296 | This structure forms the interconnection between the | |
297 | <tt>rcu_node</tt> and <tt>rcu_data</tt> structures, | |
298 | tracks grace periods, contains the lock used to | |
299 | synchronize with CPU-hotplug events, | |
300 | and maintains state used to force quiescent states when | |
301 | grace periods extend too long, | |
302 | ||
303 | </p><p>A few of the <tt>rcu_state</tt> structure's fields are discussed, | |
304 | singly and in groups, in the following sections. | |
305 | The more specialized fields are covered in the discussion of their | |
306 | use. | |
307 | ||
308 | <h5>Relationship to rcu_node and rcu_data Structures</h5> | |
309 | ||
310 | This portion of the <tt>rcu_state</tt> structure is declared | |
311 | as follows: | |
312 | ||
313 | <pre> | |
314 | 1 struct rcu_node node[NUM_RCU_NODES]; | |
315 | 2 struct rcu_node *level[NUM_RCU_LVLS + 1]; | |
316 | 3 struct rcu_data __percpu *rda; | |
317 | </pre> | |
318 | ||
319 | <table> | |
320 | <tr><th> </th></tr> | |
321 | <tr><th align="left">Quick Quiz:</th></tr> | |
322 | <tr><td> | |
323 | Wait a minute! | |
324 | You said that the <tt>rcu_node</tt> structures formed a tree, | |
325 | but they are declared as a flat array! | |
326 | What gives? | |
327 | </td></tr> | |
328 | <tr><th align="left">Answer:</th></tr> | |
329 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
330 | The tree is laid out in the array. | |
331 | The first node In the array is the head, the next set of nodes in the | |
332 | array are children of the head node, and so on until the last set of | |
333 | nodes in the array are the leaves. | |
334 | </font> | |
335 | ||
336 | <p><font color="ffffff">See the following diagrams to see how | |
337 | this works. | |
338 | </font></td></tr> | |
339 | <tr><td> </td></tr> | |
340 | </table> | |
341 | ||
342 | <p>The <tt>rcu_node</tt> tree is embedded into the | |
343 | <tt>->node[]</tt> array as shown in the following figure: | |
344 | ||
345 | </p><p><img src="TreeMapping.svg" alt="TreeMapping.svg" width="40%"> | |
346 | ||
347 | </p><p>One interesting consequence of this mapping is that a | |
348 | breadth-first traversal of the tree is implemented as a simple | |
349 | linear scan of the array, which is in fact what the | |
350 | <tt>rcu_for_each_node_breadth_first()</tt> macro does. | |
351 | This macro is used at the beginning and ends of grace periods. | |
352 | ||
353 | </p><p>Each entry of the <tt>->level</tt> array references | |
354 | the first <tt>rcu_node</tt> structure on the corresponding level | |
355 | of the tree, for example, as shown below: | |
356 | ||
357 | </p><p><img src="TreeMappingLevel.svg" alt="TreeMappingLevel.svg" width="40%"> | |
358 | ||
359 | </p><p>The zero<sup>th</sup> element of the array references the root | |
360 | <tt>rcu_node</tt> structure, the first element references the | |
361 | first child of the root <tt>rcu_node</tt>, and finally the second | |
362 | element references the first leaf <tt>rcu_node</tt> structure. | |
363 | ||
364 | </p><p>For whatever it is worth, if you draw the tree to be tree-shaped | |
365 | rather than array-shaped, it is easy to draw a planar representation: | |
366 | ||
367 | </p><p><img src="TreeLevel.svg" alt="TreeLevel.svg" width="60%"> | |
368 | ||
369 | </p><p>Finally, the <tt>->rda</tt> field references a per-CPU | |
370 | pointer to the corresponding CPU's <tt>rcu_data</tt> structure. | |
371 | ||
372 | </p><p>All of these fields are constant once initialization is complete, | |
373 | and therefore need no protection. | |
374 | ||
375 | <h5>Grace-Period Tracking</h5> | |
376 | ||
377 | <p>This portion of the <tt>rcu_state</tt> structure is declared | |
378 | as follows: | |
379 | ||
380 | <pre> | |
381 | 1 unsigned long gpnum; | |
382 | 2 unsigned long completed; | |
383 | </pre> | |
384 | ||
385 | <p>RCU grace periods are numbered, and | |
386 | the <tt>->gpnum</tt> field contains the number of the grace | |
387 | period that started most recently. | |
388 | The <tt>->completed</tt> field contains the number of the | |
389 | grace period that completed most recently. | |
390 | If the two fields are equal, the RCU grace period that most recently | |
391 | started has already completed, and therefore the corresponding | |
392 | flavor of RCU is idle. | |
393 | If <tt>->gpnum</tt> is one greater than <tt>->completed</tt>, | |
394 | then <tt>->gpnum</tt> gives the number of the current RCU | |
395 | grace period, which has not yet completed. | |
396 | Any other combination of values indicates that something is broken. | |
397 | These two fields are protected by the root <tt>rcu_node</tt>'s | |
398 | <tt>->lock</tt> field. | |
399 | ||
400 | </p><p>There are <tt>->gpnum</tt> and <tt>->completed</tt> fields | |
401 | in the <tt>rcu_node</tt> and <tt>rcu_data</tt> structures | |
402 | as well. | |
403 | The fields in the <tt>rcu_state</tt> structure represent the | |
404 | most current values, and those of the other structures are compared | |
405 | in order to detect the start of a new grace period in a distributed | |
406 | fashion. | |
407 | The values flow from <tt>rcu_state</tt> to <tt>rcu_node</tt> | |
408 | (down the tree from the root to the leaves) to <tt>rcu_data</tt>. | |
409 | ||
410 | <h5>Miscellaneous</h5> | |
411 | ||
412 | <p>This portion of the <tt>rcu_state</tt> structure is declared | |
413 | as follows: | |
414 | ||
415 | <pre> | |
416 | 1 unsigned long gp_max; | |
417 | 2 char abbr; | |
418 | 3 char *name; | |
419 | </pre> | |
420 | ||
421 | <p>The <tt>->gp_max</tt> field tracks the duration of the longest | |
422 | grace period in jiffies. | |
423 | It is protected by the root <tt>rcu_node</tt>'s <tt>->lock</tt>. | |
424 | ||
425 | <p>The <tt>->name</tt> field points to the name of the RCU flavor | |
426 | (for example, “rcu_sched”), and is constant. | |
427 | The <tt>->abbr</tt> field contains a one-character abbreviation, | |
428 | for example, “s” for RCU-sched. | |
429 | ||
430 | <h3><a name="The rcu_node Structure"> | |
431 | The <tt>rcu_node</tt> Structure</a></h3> | |
432 | ||
433 | <p>The <tt>rcu_node</tt> structures form the combining | |
434 | tree that propagates quiescent-state | |
435 | information from the leaves to the root and also that propagates | |
436 | grace-period information from the root down to the leaves. | |
437 | They provides local copies of the grace-period state in order | |
438 | to allow this information to be accessed in a synchronized | |
439 | manner without suffering the scalability limitations that | |
440 | would otherwise be imposed by global locking. | |
441 | In <tt>CONFIG_PREEMPT_RCU</tt> kernels, they manage the lists | |
442 | of tasks that have blocked while in their current | |
443 | RCU read-side critical section. | |
444 | In <tt>CONFIG_PREEMPT_RCU</tt> with | |
445 | <tt>CONFIG_RCU_BOOST</tt>, they manage the | |
446 | per-<tt>rcu_node</tt> priority-boosting | |
447 | kernel threads (kthreads) and state. | |
448 | Finally, they record CPU-hotplug state in order to determine | |
449 | which CPUs should be ignored during a given grace period. | |
450 | ||
451 | </p><p>The <tt>rcu_node</tt> structure's fields are discussed, | |
452 | singly and in groups, in the following sections. | |
453 | ||
454 | <h5>Connection to Combining Tree</h5> | |
455 | ||
456 | <p>This portion of the <tt>rcu_node</tt> structure is declared | |
457 | as follows: | |
458 | ||
459 | <pre> | |
460 | 1 struct rcu_node *parent; | |
461 | 2 u8 level; | |
462 | 3 u8 grpnum; | |
463 | 4 unsigned long grpmask; | |
464 | 5 int grplo; | |
465 | 6 int grphi; | |
466 | </pre> | |
467 | ||
468 | <p>The <tt>->parent</tt> pointer references the <tt>rcu_node</tt> | |
469 | one level up in the tree, and is <tt>NULL</tt> for the root | |
470 | <tt>rcu_node</tt>. | |
471 | The RCU implementation makes heavy use of this field to push quiescent | |
472 | states up the tree. | |
473 | The <tt>->level</tt> field gives the level in the tree, with | |
474 | the root being at level zero, its children at level one, and so on. | |
475 | The <tt>->grpnum</tt> field gives this node's position within | |
476 | the children of its parent, so this number can range between 0 and 31 | |
477 | on 32-bit systems and between 0 and 63 on 64-bit systems. | |
478 | The <tt>->level</tt> and <tt>->grpnum</tt> fields are | |
479 | used only during initialization and for tracing. | |
480 | The <tt>->grpmask</tt> field is the bitmask counterpart of | |
481 | <tt>->grpnum</tt>, and therefore always has exactly one bit set. | |
482 | This mask is used to clear the bit corresponding to this <tt>rcu_node</tt> | |
483 | structure in its parent's bitmasks, which are described later. | |
484 | Finally, the <tt>->grplo</tt> and <tt>->grphi</tt> fields | |
485 | contain the lowest and highest numbered CPU served by this | |
486 | <tt>rcu_node</tt> structure, respectively. | |
487 | ||
488 | </p><p>All of these fields are constant, and thus do not require any | |
489 | synchronization. | |
490 | ||
491 | <h5>Synchronization</h5> | |
492 | ||
493 | <p>This field of the <tt>rcu_node</tt> structure is declared | |
494 | as follows: | |
495 | ||
496 | <pre> | |
497 | 1 raw_spinlock_t lock; | |
498 | </pre> | |
499 | ||
500 | <p>This field is used to protect the remaining fields in this structure, | |
501 | unless otherwise stated. | |
502 | That said, all of the fields in this structure can be accessed without | |
503 | locking for tracing purposes. | |
504 | Yes, this can result in confusing traces, but better some tracing confusion | |
505 | than to be heisenbugged out of existence. | |
506 | ||
507 | <h5>Grace-Period Tracking</h5> | |
508 | ||
509 | <p>This portion of the <tt>rcu_node</tt> structure is declared | |
510 | as follows: | |
511 | ||
512 | <pre> | |
513 | 1 unsigned long gpnum; | |
514 | 2 unsigned long completed; | |
515 | </pre> | |
516 | ||
517 | <p>These fields are the counterparts of the fields of the same name in | |
518 | the <tt>rcu_state</tt> structure. | |
519 | They each may lag up to one behind their <tt>rcu_state</tt> | |
520 | counterparts. | |
521 | If a given <tt>rcu_node</tt> structure's <tt>->gpnum</tt> and | |
522 | <tt>->complete</tt> fields are equal, then this <tt>rcu_node</tt> | |
523 | structure believes that RCU is idle. | |
524 | Otherwise, as with the <tt>rcu_state</tt> structure, | |
525 | the <tt>->gpnum</tt> field will be one greater than the | |
526 | <tt>->complete</tt> fields, with <tt>->gpnum</tt> | |
527 | indicating which grace period this <tt>rcu_node</tt> believes | |
528 | is still being waited for. | |
529 | ||
530 | </p><p>The <tt>>gpnum</tt> field of each <tt>rcu_node</tt> | |
531 | structure is updated at the beginning | |
532 | of each grace period, and the <tt>->completed</tt> fields are | |
533 | updated at the end of each grace period. | |
534 | ||
535 | <h5>Quiescent-State Tracking</h5> | |
536 | ||
537 | <p>These fields manage the propagation of quiescent states up the | |
538 | combining tree. | |
539 | ||
540 | </p><p>This portion of the <tt>rcu_node</tt> structure has fields | |
541 | as follows: | |
542 | ||
543 | <pre> | |
544 | 1 unsigned long qsmask; | |
545 | 2 unsigned long expmask; | |
546 | 3 unsigned long qsmaskinit; | |
547 | 4 unsigned long expmaskinit; | |
548 | </pre> | |
549 | ||
550 | <p>The <tt>->qsmask</tt> field tracks which of this | |
551 | <tt>rcu_node</tt> structure's children still need to report | |
552 | quiescent states for the current normal grace period. | |
553 | Such children will have a value of 1 in their corresponding bit. | |
554 | Note that the leaf <tt>rcu_node</tt> structures should be | |
555 | thought of as having <tt>rcu_data</tt> structures as their | |
556 | children. | |
557 | Similarly, the <tt>->expmask</tt> field tracks which | |
558 | of this <tt>rcu_node</tt> structure's children still need to report | |
559 | quiescent states for the current expedited grace period. | |
560 | An expedited grace period has | |
561 | the same conceptual properties as a normal grace period, but the | |
562 | expedited implementation accepts extreme CPU overhead to obtain | |
563 | much lower grace-period latency, for example, consuming a few | |
564 | tens of microseconds worth of CPU time to reduce grace-period | |
565 | duration from milliseconds to tens of microseconds. | |
566 | The <tt>->qsmaskinit</tt> field tracks which of this | |
567 | <tt>rcu_node</tt> structure's children cover for at least | |
568 | one online CPU. | |
569 | This mask is used to initialize <tt>->qsmask</tt>, | |
570 | and <tt>->expmaskinit</tt> is used to initialize | |
571 | <tt>->expmask</tt> and the beginning of the | |
572 | normal and expedited grace periods, respectively. | |
573 | ||
574 | <table> | |
575 | <tr><th> </th></tr> | |
576 | <tr><th align="left">Quick Quiz:</th></tr> | |
577 | <tr><td> | |
578 | Why are these bitmasks protected by locking? | |
579 | Come on, haven't you heard of atomic instructions??? | |
580 | </td></tr> | |
581 | <tr><th align="left">Answer:</th></tr> | |
582 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
583 | Lockless grace-period computation! Such a tantalizing possibility! | |
584 | </font> | |
585 | ||
586 | <p><font color="ffffff">But consider the following sequence of events: | |
587 | </font> | |
588 | ||
589 | <ol> | |
590 | <li> <font color="ffffff">CPU 0 has been in dyntick-idle | |
591 | mode for quite some time. | |
592 | When it wakes up, it notices that the current RCU | |
593 | grace period needs it to report in, so it sets a | |
594 | flag where the scheduling clock interrupt will find it. | |
595 | </font><p> | |
596 | <li> <font color="ffffff">Meanwhile, CPU 1 is running | |
597 | <tt>force_quiescent_state()</tt>, | |
598 | and notices that CPU 0 has been in dyntick idle mode, | |
599 | which qualifies as an extended quiescent state. | |
600 | </font><p> | |
601 | <li> <font color="ffffff">CPU 0's scheduling clock | |
602 | interrupt fires in the | |
603 | middle of an RCU read-side critical section, and notices | |
604 | that the RCU core needs something, so commences RCU softirq | |
605 | processing. | |
606 | </font> | |
607 | <p> | |
608 | <li> <font color="ffffff">CPU 0's softirq handler | |
609 | executes and is just about ready | |
610 | to report its quiescent state up the <tt>rcu_node</tt> | |
611 | tree. | |
612 | </font><p> | |
613 | <li> <font color="ffffff">But CPU 1 beats it to the punch, | |
614 | completing the current | |
615 | grace period and starting a new one. | |
616 | </font><p> | |
617 | <li> <font color="ffffff">CPU 0 now reports its quiescent | |
618 | state for the wrong | |
619 | grace period. | |
620 | That grace period might now end before the RCU read-side | |
621 | critical section. | |
622 | If that happens, disaster will ensue. | |
623 | </font> | |
624 | </ol> | |
625 | ||
626 | <p><font color="ffffff">So the locking is absolutely required in | |
627 | order to coordinate | |
628 | clearing of the bits with the grace-period numbers in | |
629 | <tt>->gpnum</tt> and <tt>->completed</tt>. | |
630 | </font></td></tr> | |
631 | <tr><td> </td></tr> | |
632 | </table> | |
633 | ||
634 | <h5>Blocked-Task Management</h5> | |
635 | ||
636 | <p><tt>PREEMPT_RCU</tt> allows tasks to be preempted in the | |
637 | midst of their RCU read-side critical sections, and these tasks | |
638 | must be tracked explicitly. | |
639 | The details of exactly why and how they are tracked will be covered | |
640 | in a separate article on RCU read-side processing. | |
641 | For now, it is enough to know that the <tt>rcu_node</tt> | |
642 | structure tracks them. | |
643 | ||
644 | <pre> | |
645 | 1 struct list_head blkd_tasks; | |
646 | 2 struct list_head *gp_tasks; | |
647 | 3 struct list_head *exp_tasks; | |
648 | 4 bool wait_blkd_tasks; | |
649 | </pre> | |
650 | ||
651 | <p>The <tt>->blkd_tasks</tt> field is a list header for | |
652 | the list of blocked and preempted tasks. | |
653 | As tasks undergo context switches within RCU read-side critical | |
654 | sections, their <tt>task_struct</tt> structures are enqueued | |
655 | (via the <tt>task_struct</tt>'s <tt>->rcu_node_entry</tt> | |
656 | field) onto the head of the <tt>->blkd_tasks</tt> list for the | |
657 | leaf <tt>rcu_node</tt> structure corresponding to the CPU | |
658 | on which the outgoing context switch executed. | |
659 | As these tasks later exit their RCU read-side critical sections, | |
660 | they remove themselves from the list. | |
661 | This list is therefore in reverse time order, so that if one of the tasks | |
662 | is blocking the current grace period, all subsequent tasks must | |
663 | also be blocking that same grace period. | |
664 | Therefore, a single pointer into this list suffices to track | |
665 | all tasks blocking a given grace period. | |
666 | That pointer is stored in <tt>->gp_tasks</tt> for normal | |
667 | grace periods and in <tt>->exp_tasks</tt> for expedited | |
668 | grace periods. | |
669 | These last two fields are <tt>NULL</tt> if either there is | |
670 | no grace period in flight or if there are no blocked tasks | |
671 | preventing that grace period from completing. | |
672 | If either of these two pointers is referencing a task that | |
673 | removes itself from the <tt>->blkd_tasks</tt> list, | |
674 | then that task must advance the pointer to the next task on | |
675 | the list, or set the pointer to <tt>NULL</tt> if there | |
676 | are no subsequent tasks on the list. | |
677 | ||
678 | </p><p>For example, suppose that tasks T1, T2, and T3 are | |
679 | all hard-affinitied to the largest-numbered CPU in the system. | |
680 | Then if task T1 blocked in an RCU read-side | |
681 | critical section, then an expedited grace period started, | |
682 | then task T2 blocked in an RCU read-side critical section, | |
683 | then a normal grace period started, and finally task 3 blocked | |
684 | in an RCU read-side critical section, then the state of the | |
685 | last leaf <tt>rcu_node</tt> structure's blocked-task list | |
686 | would be as shown below: | |
687 | ||
688 | </p><p><img src="blkd_task.svg" alt="blkd_task.svg" width="60%"> | |
689 | ||
690 | </p><p>Task T1 is blocking both grace periods, task T2 is | |
691 | blocking only the normal grace period, and task T3 is blocking | |
692 | neither grace period. | |
693 | Note that these tasks will not remove themselves from this list | |
694 | immediately upon resuming execution. | |
695 | They will instead remain on the list until they execute the outermost | |
696 | <tt>rcu_read_unlock()</tt> that ends their RCU read-side critical | |
697 | section. | |
698 | ||
699 | <p> | |
700 | The <tt>->wait_blkd_tasks</tt> field indicates whether or not | |
701 | the current grace period is waiting on a blocked task. | |
702 | ||
703 | <h5>Sizing the <tt>rcu_node</tt> Array</h5> | |
704 | ||
705 | <p>The <tt>rcu_node</tt> array is sized via a series of | |
706 | C-preprocessor expressions as follows: | |
707 | ||
708 | <pre> | |
709 | 1 #ifdef CONFIG_RCU_FANOUT | |
710 | 2 #define RCU_FANOUT CONFIG_RCU_FANOUT | |
711 | 3 #else | |
712 | 4 # ifdef CONFIG_64BIT | |
713 | 5 # define RCU_FANOUT 64 | |
714 | 6 # else | |
715 | 7 # define RCU_FANOUT 32 | |
716 | 8 # endif | |
717 | 9 #endif | |
718 | 10 | |
719 | 11 #ifdef CONFIG_RCU_FANOUT_LEAF | |
720 | 12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF | |
721 | 13 #else | |
722 | 14 # ifdef CONFIG_64BIT | |
723 | 15 # define RCU_FANOUT_LEAF 64 | |
724 | 16 # else | |
725 | 17 # define RCU_FANOUT_LEAF 32 | |
726 | 18 # endif | |
727 | 19 #endif | |
728 | 20 | |
729 | 21 #define RCU_FANOUT_1 (RCU_FANOUT_LEAF) | |
730 | 22 #define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT) | |
731 | 23 #define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT) | |
732 | 24 #define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT) | |
733 | 25 | |
734 | 26 #if NR_CPUS <= RCU_FANOUT_1 | |
735 | 27 # define RCU_NUM_LVLS 1 | |
736 | 28 # define NUM_RCU_LVL_0 1 | |
737 | 29 # define NUM_RCU_NODES NUM_RCU_LVL_0 | |
738 | 30 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0 } | |
739 | 31 # define RCU_NODE_NAME_INIT { "rcu_node_0" } | |
740 | 32 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" } | |
741 | 33 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0" } | |
742 | 34 #elif NR_CPUS <= RCU_FANOUT_2 | |
743 | 35 # define RCU_NUM_LVLS 2 | |
744 | 36 # define NUM_RCU_LVL_0 1 | |
745 | 37 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) | |
746 | 38 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1) | |
747 | 39 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1 } | |
748 | 40 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" } | |
749 | 41 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" } | |
750 | 42 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1" } | |
751 | 43 #elif NR_CPUS <= RCU_FANOUT_3 | |
752 | 44 # define RCU_NUM_LVLS 3 | |
753 | 45 # define NUM_RCU_LVL_0 1 | |
754 | 46 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) | |
755 | 47 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) | |
756 | 48 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2) | |
757 | 49 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 } | |
758 | 50 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" } | |
759 | 51 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" } | |
760 | 52 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" } | |
761 | 53 #elif NR_CPUS <= RCU_FANOUT_4 | |
762 | 54 # define RCU_NUM_LVLS 4 | |
763 | 55 # define NUM_RCU_LVL_0 1 | |
764 | 56 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3) | |
765 | 57 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) | |
766 | 58 # define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) | |
767 | 59 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3) | |
768 | 60 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 } | |
769 | 61 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" } | |
770 | 62 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" } | |
771 | 63 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" } | |
772 | 64 #else | |
773 | 65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS" | |
774 | 66 #endif | |
775 | </pre> | |
776 | ||
777 | <p>The maximum number of levels in the <tt>rcu_node</tt> structure | |
778 | is currently limited to four, as specified by lines 21-24 | |
779 | and the structure of the subsequent “if” statement. | |
780 | For 32-bit systems, this allows 16*32*32*32=524,288 CPUs, which | |
781 | should be sufficient for the next few years at least. | |
782 | For 64-bit systems, 16*64*64*64=4,194,304 CPUs is allowed, which | |
783 | should see us through the next decade or so. | |
784 | This four-level tree also allows kernels built with | |
785 | <tt>CONFIG_RCU_FANOUT=8</tt> to support up to 4096 CPUs, | |
786 | which might be useful in very large systems having eight CPUs per | |
787 | socket (but please note that no one has yet shown any measurable | |
788 | performance degradation due to misaligned socket and <tt>rcu_node</tt> | |
789 | boundaries). | |
790 | In addition, building kernels with a full four levels of <tt>rcu_node</tt> | |
791 | tree permits better testing of RCU's combining-tree code. | |
792 | ||
793 | </p><p>The <tt>RCU_FANOUT</tt> symbol controls how many children | |
794 | are permitted at each non-leaf level of the <tt>rcu_node</tt> tree. | |
795 | If the <tt>CONFIG_RCU_FANOUT</tt> Kconfig option is not specified, | |
796 | it is set based on the word size of the system, which is also | |
797 | the Kconfig default. | |
798 | ||
799 | </p><p>The <tt>RCU_FANOUT_LEAF</tt> symbol controls how many CPUs are | |
800 | handled by each leaf <tt>rcu_node</tt> structure. | |
801 | Experience has shown that allowing a given leaf <tt>rcu_node</tt> | |
802 | structure to handle 64 CPUs, as permitted by the number of bits in | |
803 | the <tt>->qsmask</tt> field on a 64-bit system, results in | |
804 | excessive contention for the leaf <tt>rcu_node</tt> structures' | |
805 | <tt>->lock</tt> fields. | |
806 | The number of CPUs per leaf <tt>rcu_node</tt> structure is therefore | |
807 | limited to 16 given the default value of <tt>CONFIG_RCU_FANOUT_LEAF</tt>. | |
808 | If <tt>CONFIG_RCU_FANOUT_LEAF</tt> is unspecified, the value | |
809 | selected is based on the word size of the system, just as for | |
810 | <tt>CONFIG_RCU_FANOUT</tt>. | |
811 | Lines 11-19 perform this computation. | |
812 | ||
813 | </p><p>Lines 21-24 compute the maximum number of CPUs supported by | |
814 | a single-level (which contains a single <tt>rcu_node</tt> structure), | |
815 | two-level, three-level, and four-level <tt>rcu_node</tt> tree, | |
816 | respectively, given the fanout specified by <tt>RCU_FANOUT</tt> | |
817 | and <tt>RCU_FANOUT_LEAF</tt>. | |
818 | These numbers of CPUs are retained in the | |
819 | <tt>RCU_FANOUT_1</tt>, | |
820 | <tt>RCU_FANOUT_2</tt>, | |
821 | <tt>RCU_FANOUT_3</tt>, and | |
822 | <tt>RCU_FANOUT_4</tt> | |
823 | C-preprocessor variables, respectively. | |
824 | ||
825 | </p><p>These variables are used to control the C-preprocessor <tt>#if</tt> | |
826 | statement spanning lines 26-66 that computes the number of | |
827 | <tt>rcu_node</tt> structures required for each level of the tree, | |
828 | as well as the number of levels required. | |
829 | The number of levels is placed in the <tt>NUM_RCU_LVLS</tt> | |
830 | C-preprocessor variable by lines 27, 35, 44, and 54. | |
831 | The number of <tt>rcu_node</tt> structures for the topmost level | |
832 | of the tree is always exactly one, and this value is unconditionally | |
833 | placed into <tt>NUM_RCU_LVL_0</tt> by lines 28, 36, 45, and 55. | |
834 | The rest of the levels (if any) of the <tt>rcu_node</tt> tree | |
835 | are computed by dividing the maximum number of CPUs by the | |
836 | fanout supported by the number of levels from the current level down, | |
837 | rounding up. This computation is performed by lines 37, | |
838 | 46-47, and 56-58. | |
839 | Lines 31-33, 40-42, 50-52, and 62-63 create initializers | |
840 | for lockdep lock-class names. | |
841 | Finally, lines 64-66 produce an error if the maximum number of | |
842 | CPUs is too large for the specified fanout. | |
843 | ||
844 | <h3><a name="The rcu_data Structure"> | |
845 | The <tt>rcu_data</tt> Structure</a></h3> | |
846 | ||
847 | <p>The <tt>rcu_data</tt> maintains the per-CPU state for the | |
848 | corresponding flavor of RCU. | |
849 | The fields in this structure may be accessed only from the corresponding | |
850 | CPU (and from tracing) unless otherwise stated. | |
851 | This structure is the | |
852 | focus of quiescent-state detection and RCU callback queuing. | |
853 | It also tracks its relationship to the corresponding leaf | |
854 | <tt>rcu_node</tt> structure to allow more-efficient | |
855 | propagation of quiescent states up the <tt>rcu_node</tt> | |
856 | combining tree. | |
857 | Like the <tt>rcu_node</tt> structure, it provides a local | |
858 | copy of the grace-period information to allow for-free | |
859 | synchronized | |
860 | access to this information from the corresponding CPU. | |
861 | Finally, this structure records past dyntick-idle state | |
862 | for the corresponding CPU and also tracks statistics. | |
863 | ||
864 | </p><p>The <tt>rcu_data</tt> structure's fields are discussed, | |
865 | singly and in groups, in the following sections. | |
866 | ||
867 | <h5>Connection to Other Data Structures</h5> | |
868 | ||
869 | <p>This portion of the <tt>rcu_data</tt> structure is declared | |
870 | as follows: | |
871 | ||
872 | <pre> | |
873 | 1 int cpu; | |
874 | 2 struct rcu_state *rsp; | |
875 | 3 struct rcu_node *mynode; | |
876 | 4 struct rcu_dynticks *dynticks; | |
877 | 5 unsigned long grpmask; | |
878 | 6 bool beenonline; | |
879 | </pre> | |
880 | ||
881 | <p>The <tt>->cpu</tt> field contains the number of the | |
882 | corresponding CPU, the <tt>->rsp</tt> pointer references | |
883 | the corresponding <tt>rcu_state</tt> structure (and is most frequently | |
884 | used to locate the name of the corresponding flavor of RCU for tracing), | |
885 | and the <tt>->mynode</tt> field references the corresponding | |
886 | <tt>rcu_node</tt> structure. | |
887 | The <tt>->mynode</tt> is used to propagate quiescent states | |
888 | up the combining tree. | |
889 | <p>The <tt>->dynticks</tt> pointer references the | |
890 | <tt>rcu_dynticks</tt> structure corresponding to this | |
891 | CPU. | |
892 | Recall that a single per-CPU instance of the <tt>rcu_dynticks</tt> | |
893 | structure is shared among all flavors of RCU. | |
894 | These first four fields are constant and therefore require not | |
895 | synchronization. | |
896 | ||
897 | </p><p>The <tt>->grpmask</tt> field indicates the bit in | |
898 | the <tt>->mynode->qsmask</tt> corresponding to this | |
899 | <tt>rcu_data</tt> structure, and is also used when propagating | |
900 | quiescent states. | |
901 | The <tt>->beenonline</tt> flag is set whenever the corresponding | |
902 | CPU comes online, which means that the debugfs tracing need not dump | |
903 | out any <tt>rcu_data</tt> structure for which this flag is not set. | |
904 | ||
905 | <h5>Quiescent-State and Grace-Period Tracking</h5> | |
906 | ||
907 | <p>This portion of the <tt>rcu_data</tt> structure is declared | |
908 | as follows: | |
909 | ||
910 | <pre> | |
911 | 1 unsigned long completed; | |
912 | 2 unsigned long gpnum; | |
913 | 3 bool cpu_no_qs; | |
914 | 4 bool core_needs_qs; | |
915 | 5 bool gpwrap; | |
916 | 6 unsigned long rcu_qs_ctr_snap; | |
917 | </pre> | |
918 | ||
919 | <p>The <tt>completed</tt> and <tt>gpnum</tt> | |
920 | fields are the counterparts of the fields of the same name | |
921 | in the <tt>rcu_state</tt> and <tt>rcu_node</tt> structures. | |
922 | They may each lag up to one behind their <tt>rcu_node</tt> | |
923 | counterparts, but in <tt>CONFIG_NO_HZ_IDLE</tt> and | |
924 | <tt>CONFIG_NO_HZ_FULL</tt> kernels can lag | |
925 | arbitrarily far behind for CPUs in dyntick-idle mode (but these counters | |
926 | will catch up upon exit from dyntick-idle mode). | |
927 | If a given <tt>rcu_data</tt> structure's <tt>->gpnum</tt> and | |
928 | <tt>->complete</tt> fields are equal, then this <tt>rcu_data</tt> | |
929 | structure believes that RCU is idle. | |
930 | Otherwise, as with the <tt>rcu_state</tt> and <tt>rcu_node</tt> | |
931 | structure, | |
932 | the <tt>->gpnum</tt> field will be one greater than the | |
933 | <tt>->complete</tt> fields, with <tt>->gpnum</tt> | |
934 | indicating which grace period this <tt>rcu_data</tt> believes | |
935 | is still being waited for. | |
936 | ||
937 | <table> | |
938 | <tr><th> </th></tr> | |
939 | <tr><th align="left">Quick Quiz:</th></tr> | |
940 | <tr><td> | |
941 | All this replication of the grace period numbers can only cause | |
942 | massive confusion. | |
943 | Why not just keep a global pair of counters and be done with it??? | |
944 | </td></tr> | |
945 | <tr><th align="left">Answer:</th></tr> | |
946 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
947 | Because if there was only a single global pair of grace-period | |
948 | numbers, there would need to be a single global lock to allow | |
949 | safely accessing and updating them. | |
950 | And if we are not going to have a single global lock, we need | |
951 | to carefully manage the numbers on a per-node basis. | |
952 | Recall from the answer to a previous Quick Quiz that the consequences | |
953 | of applying a previously sampled quiescent state to the wrong | |
954 | grace period are quite severe. | |
955 | </font></td></tr> | |
956 | <tr><td> </td></tr> | |
957 | </table> | |
958 | ||
959 | <p>The <tt>->cpu_no_qs</tt> flag indicates that the | |
960 | CPU has not yet passed through a quiescent state, | |
961 | while the <tt>->core_needs_qs</tt> flag indicates that the | |
962 | RCU core needs a quiescent state from the corresponding CPU. | |
963 | The <tt>->gpwrap</tt> field indicates that the corresponding | |
964 | CPU has remained idle for so long that the <tt>completed</tt> | |
965 | and <tt>gpnum</tt> counters are in danger of overflow, which | |
966 | will cause the CPU to disregard the values of its counters on | |
967 | its next exit from idle. | |
968 | Finally, the <tt>rcu_qs_ctr_snap</tt> field is used to detect | |
969 | cases where a given operation has resulted in a quiescent state | |
970 | for all flavors of RCU, for example, <tt>cond_resched_rcu_qs()</tt>. | |
971 | ||
972 | <h5>RCU Callback Handling</h5> | |
973 | ||
974 | <p>In the absence of CPU-hotplug events, RCU callbacks are invoked by | |
975 | the same CPU that registered them. | |
976 | This is strictly a cache-locality optimization: callbacks can and | |
977 | do get invoked on CPUs other than the one that registered them. | |
978 | After all, if the CPU that registered a given callback has gone | |
979 | offline before the callback can be invoked, there really is no other | |
980 | choice. | |
981 | ||
982 | </p><p>This portion of the <tt>rcu_data</tt> structure is declared | |
983 | as follows: | |
984 | ||
985 | <pre> | |
986 | 1 struct rcu_head *nxtlist; | |
987 | 2 struct rcu_head **nxttail[RCU_NEXT_SIZE]; | |
988 | 3 unsigned long nxtcompleted[RCU_NEXT_SIZE]; | |
989 | 4 long qlen_lazy; | |
990 | 5 long qlen; | |
991 | 6 long qlen_last_fqs_check; | |
992 | 7 unsigned long n_force_qs_snap; | |
993 | 8 unsigned long n_cbs_invoked; | |
994 | 9 unsigned long n_cbs_orphaned; | |
995 | 10 unsigned long n_cbs_adopted; | |
996 | 11 long blimit; | |
997 | </pre> | |
998 | ||
999 | <p>The <tt>->nxtlist</tt> pointer and the | |
1000 | <tt>->nxttail[]</tt> array form a four-segment list with | |
1001 | older callbacks near the head and newer ones near the tail. | |
1002 | Each segment contains callbacks with the corresponding relationship | |
1003 | to the current grace period. | |
1004 | The pointer out of the end of each of the four segments is referenced | |
1005 | by the element of the <tt>->nxttail[]</tt> array indexed by | |
1006 | <tt>RCU_DONE_TAIL</tt> (for callbacks handled by a prior grace period), | |
1007 | <tt>RCU_WAIT_TAIL</tt> (for callbacks waiting on the current grace period), | |
1008 | <tt>RCU_NEXT_READY_TAIL</tt> (for callbacks that will wait on the next | |
1009 | grace period), and | |
1010 | <tt>RCU_NEXT_TAIL</tt> (for callbacks that are not yet associated | |
1011 | with a specific grace period) | |
1012 | respectively, as shown in the following figure. | |
1013 | ||
1014 | </p><p><img src="nxtlist.svg" alt="nxtlist.svg" width="40%"> | |
1015 | ||
1016 | </p><p>In this figure, the <tt>->nxtlist</tt> pointer references the | |
1017 | first | |
1018 | RCU callback in the list. | |
1019 | The <tt>->nxttail[RCU_DONE_TAIL]</tt> array element references | |
1020 | the <tt>->nxtlist</tt> pointer itself, indicating that none | |
1021 | of the callbacks is ready to invoke. | |
1022 | The <tt>->nxttail[RCU_WAIT_TAIL]</tt> array element references callback | |
1023 | CB 2's <tt>->next</tt> pointer, which indicates that | |
1024 | CB 1 and CB 2 are both waiting on the current grace period. | |
1025 | The <tt>->nxttail[RCU_NEXT_READY_TAIL]</tt> array element | |
1026 | references the same RCU callback that <tt>->nxttail[RCU_WAIT_TAIL]</tt> | |
1027 | does, which indicates that there are no callbacks waiting on the next | |
1028 | RCU grace period. | |
1029 | The <tt>->nxttail[RCU_NEXT_TAIL]</tt> array element references | |
1030 | CB 4's <tt>->next</tt> pointer, indicating that all the | |
1031 | remaining RCU callbacks have not yet been assigned to an RCU grace | |
1032 | period. | |
1033 | Note that the <tt>->nxttail[RCU_NEXT_TAIL]</tt> array element | |
1034 | always references the last RCU callback's <tt>->next</tt> pointer | |
1035 | unless the callback list is empty, in which case it references | |
1036 | the <tt>->nxtlist</tt> pointer. | |
1037 | ||
1038 | </p><p>CPUs advance their callbacks from the | |
1039 | <tt>RCU_NEXT_TAIL</tt> to the <tt>RCU_NEXT_READY_TAIL</tt> to the | |
1040 | <tt>RCU_WAIT_TAIL</tt> to the <tt>RCU_DONE_TAIL</tt> list segments | |
1041 | as grace periods advance. | |
1042 | The CPU advances the callbacks in its <tt>rcu_data</tt> structure | |
1043 | whenever it notices that another RCU grace period has completed. | |
1044 | The CPU detects the completion of an RCU grace period by noticing | |
1045 | that the value of its <tt>rcu_data</tt> structure's | |
1046 | <tt>->completed</tt> field differs from that of its leaf | |
1047 | <tt>rcu_node</tt> structure. | |
1048 | Recall that each <tt>rcu_node</tt> structure's | |
1049 | <tt>->completed</tt> field is updated at the end of each | |
1050 | grace period. | |
1051 | ||
1052 | </p><p>The <tt>->nxtcompleted[]</tt> array records grace-period | |
1053 | numbers corresponding to the list segments. | |
1054 | This allows CPUs that go idle for extended periods to determine | |
1055 | which of their callbacks are ready to be invoked after reawakening. | |
1056 | ||
1057 | </p><p>The <tt>->qlen</tt> counter contains the number of | |
1058 | callbacks in <tt>->nxtlist</tt>, and the | |
1059 | <tt>->qlen_lazy</tt> contains the number of those callbacks that | |
1060 | are known to only free memory, and whose invocation can therefore | |
1061 | be safely deferred. | |
1062 | The <tt>->qlen_last_fqs_check</tt> and | |
1063 | <tt>->n_force_qs_snap</tt> coordinate the forcing of quiescent | |
1064 | states from <tt>call_rcu()</tt> and friends when callback | |
1065 | lists grow excessively long. | |
1066 | ||
1067 | </p><p>The <tt>->n_cbs_invoked</tt>, | |
1068 | <tt>->n_cbs_orphaned</tt>, and <tt>->n_cbs_adopted</tt> | |
1069 | fields count the number of callbacks invoked, | |
1070 | sent to other CPUs when this CPU goes offline, | |
1071 | and received from other CPUs when those other CPUs go offline. | |
1072 | Finally, the <tt>->blimit</tt> counter is the maximum number of | |
1073 | RCU callbacks that may be invoked at a given time. | |
1074 | ||
1075 | <h5>Dyntick-Idle Handling</h5> | |
1076 | ||
1077 | <p>This portion of the <tt>rcu_data</tt> structure is declared | |
1078 | as follows: | |
1079 | ||
1080 | <pre> | |
1081 | 1 int dynticks_snap; | |
1082 | 2 unsigned long dynticks_fqs; | |
1083 | </pre> | |
1084 | ||
1085 | The <tt>->dynticks_snap</tt> field is used to take a snapshot | |
1086 | of the corresponding CPU's dyntick-idle state when forcing | |
1087 | quiescent states, and is therefore accessed from other CPUs. | |
1088 | Finally, the <tt>->dynticks_fqs</tt> field is used to | |
1089 | count the number of times this CPU is determined to be in | |
1090 | dyntick-idle state, and is used for tracing and debugging purposes. | |
1091 | ||
1092 | <h3><a name="The rcu_dynticks Structure"> | |
1093 | The <tt>rcu_dynticks</tt> Structure</a></h3> | |
1094 | ||
1095 | <p>The <tt>rcu_dynticks</tt> maintains the per-CPU dyntick-idle state | |
1096 | for the corresponding CPU. | |
1097 | Unlike the other structures, <tt>rcu_dynticks</tt> is not | |
1098 | replicated over the different flavors of RCU. | |
1099 | The fields in this structure may be accessed only from the corresponding | |
1100 | CPU (and from tracing) unless otherwise stated. | |
1101 | Its fields are as follows: | |
1102 | ||
1103 | <pre> | |
1104 | 1 int dynticks_nesting; | |
1105 | 2 int dynticks_nmi_nesting; | |
1106 | 3 atomic_t dynticks; | |
0f9be8ca | 1107 | 4 bool rcu_need_heavy_qs; |
9577df9a | 1108 | 5 unsigned long rcu_qs_ctr; |
9226b10d | 1109 | 6 bool rcu_urgent_qs; |
5c145847 PM |
1110 | </pre> |
1111 | ||
1112 | <p>The <tt>->dynticks_nesting</tt> field counts the | |
1113 | nesting depth of normal interrupts. | |
1114 | In addition, this counter is incremented when exiting dyntick-idle | |
1115 | mode and decremented when entering it. | |
1116 | This counter can therefore be thought of as counting the number | |
1117 | of reasons why this CPU cannot be permitted to enter dyntick-idle | |
1118 | mode, aside from non-maskable interrupts (NMIs). | |
1119 | NMIs are counted by the <tt>->dynticks_nmi_nesting</tt> | |
1120 | field, except that NMIs that interrupt non-dyntick-idle execution | |
1121 | are not counted. | |
1122 | ||
abb06b99 | 1123 | </p><p>The <tt>->dynticks</tt> field counts the corresponding |
5c145847 PM |
1124 | CPU's transitions to and from dyntick-idle mode, so that this counter |
1125 | has an even value when the CPU is in dyntick-idle mode and an odd | |
1126 | value otherwise. | |
1127 | ||
0f9be8ca | 1128 | </p><p>The <tt>->rcu_need_heavy_qs</tt> field is used |
abb06b99 | 1129 | to record the fact that the RCU core code would really like to |
9577df9a PM |
1130 | see a quiescent state from the corresponding CPU, so much so that |
1131 | it is willing to call for heavy-weight dyntick-counter operations. | |
abb06b99 PM |
1132 | This flag is checked by RCU's context-switch and <tt>cond_resched()</tt> |
1133 | code, which provide a momentary idle sojourn in response. | |
1134 | ||
9226b10d | 1135 | </p><p>The <tt>->rcu_qs_ctr</tt> field is used to record |
9577df9a PM |
1136 | quiescent states from <tt>cond_resched()</tt>. |
1137 | Because <tt>cond_resched()</tt> can execute quite frequently, this | |
1138 | must be quite lightweight, as in a non-atomic increment of this | |
1139 | per-CPU field. | |
1140 | ||
9226b10d PM |
1141 | </p><p>Finally, the <tt>->rcu_urgent_qs</tt> field is used to record |
1142 | the fact that the RCU core code would really like to see a quiescent | |
1143 | state from the corresponding CPU, with the various other fields indicating | |
1144 | just how badly RCU wants this quiescent state. | |
1145 | This flag is checked by RCU's context-switch and <tt>cond_resched()</tt> | |
1146 | code, which, if nothing else, non-atomically increment <tt>->rcu_qs_ctr</tt> | |
1147 | in response. | |
1148 | ||
5c145847 PM |
1149 | <table> |
1150 | <tr><th> </th></tr> | |
1151 | <tr><th align="left">Quick Quiz:</th></tr> | |
1152 | <tr><td> | |
1153 | Why not just count all NMIs? | |
1154 | Wouldn't that be simpler and less error prone? | |
1155 | </td></tr> | |
1156 | <tr><th align="left">Answer:</th></tr> | |
1157 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
1158 | It seems simpler only until you think hard about how to go about | |
1159 | updating the <tt>rcu_dynticks</tt> structure's | |
1160 | <tt>->dynticks</tt> field. | |
1161 | </font></td></tr> | |
1162 | <tr><td> </td></tr> | |
1163 | </table> | |
1164 | ||
1165 | <p>Additional fields are present for some special-purpose | |
1166 | builds, and are discussed separately. | |
1167 | ||
1168 | <h3><a name="The rcu_head Structure"> | |
1169 | The <tt>rcu_head</tt> Structure</a></h3> | |
1170 | ||
1171 | <p>Each <tt>rcu_head</tt> structure represents an RCU callback. | |
1172 | These structures are normally embedded within RCU-protected data | |
1173 | structures whose algorithms use asynchronous grace periods. | |
1174 | In contrast, when using algorithms that block waiting for RCU grace periods, | |
1175 | RCU users need not provide <tt>rcu_head</tt> structures. | |
1176 | ||
1177 | </p><p>The <tt>rcu_head</tt> structure has fields as follows: | |
1178 | ||
1179 | <pre> | |
1180 | 1 struct rcu_head *next; | |
1181 | 2 void (*func)(struct rcu_head *head); | |
1182 | </pre> | |
1183 | ||
1184 | <p>The <tt>->next</tt> field is used | |
1185 | to link the <tt>rcu_head</tt> structures together in the | |
1186 | lists within the <tt>rcu_data</tt> structures. | |
1187 | The <tt>->func</tt> field is a pointer to the function | |
1188 | to be called when the callback is ready to be invoked, and | |
1189 | this function is passed a pointer to the <tt>rcu_head</tt> | |
1190 | structure. | |
1191 | However, <tt>kfree_rcu()</tt> uses the <tt>->func</tt> | |
1192 | field to record the offset of the <tt>rcu_head</tt> | |
1193 | structure within the enclosing RCU-protected data structure. | |
1194 | ||
1195 | </p><p>Both of these fields are used internally by RCU. | |
1196 | From the viewpoint of RCU users, this structure is an | |
1197 | opaque “cookie”. | |
1198 | ||
1199 | <table> | |
1200 | <tr><th> </th></tr> | |
1201 | <tr><th align="left">Quick Quiz:</th></tr> | |
1202 | <tr><td> | |
1203 | Given that the callback function <tt>->func</tt> | |
1204 | is passed a pointer to the <tt>rcu_head</tt> structure, | |
1205 | how is that function supposed to find the beginning of the | |
1206 | enclosing RCU-protected data structure? | |
1207 | </td></tr> | |
1208 | <tr><th align="left">Answer:</th></tr> | |
1209 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
1210 | In actual practice, there is a separate callback function per | |
1211 | type of RCU-protected data structure. | |
1212 | The callback function can therefore use the <tt>container_of()</tt> | |
1213 | macro in the Linux kernel (or other pointer-manipulation facilities | |
1214 | in other software environments) to find the beginning of the | |
1215 | enclosing structure. | |
1216 | </font></td></tr> | |
1217 | <tr><td> </td></tr> | |
1218 | </table> | |
1219 | ||
1220 | <h3><a name="RCU-Specific Fields in the task_struct Structure"> | |
1221 | RCU-Specific Fields in the <tt>task_struct</tt> Structure</a></h3> | |
1222 | ||
1223 | <p>The <tt>CONFIG_PREEMPT_RCU</tt> implementation uses some | |
1224 | additional fields in the <tt>task_struct</tt> structure: | |
1225 | ||
1226 | <pre> | |
1227 | 1 #ifdef CONFIG_PREEMPT_RCU | |
1228 | 2 int rcu_read_lock_nesting; | |
1229 | 3 union rcu_special rcu_read_unlock_special; | |
1230 | 4 struct list_head rcu_node_entry; | |
1231 | 5 struct rcu_node *rcu_blocked_node; | |
1232 | 6 #endif /* #ifdef CONFIG_PREEMPT_RCU */ | |
1233 | 7 #ifdef CONFIG_TASKS_RCU | |
1234 | 8 unsigned long rcu_tasks_nvcsw; | |
1235 | 9 bool rcu_tasks_holdout; | |
1236 | 10 struct list_head rcu_tasks_holdout_list; | |
1237 | 11 int rcu_tasks_idle_cpu; | |
1238 | 12 #endif /* #ifdef CONFIG_TASKS_RCU */ | |
1239 | </pre> | |
1240 | ||
1241 | <p>The <tt>->rcu_read_lock_nesting</tt> field records the | |
1242 | nesting level for RCU read-side critical sections, and | |
1243 | the <tt>->rcu_read_unlock_special</tt> field is a bitmask | |
1244 | that records special conditions that require <tt>rcu_read_unlock()</tt> | |
1245 | to do additional work. | |
1246 | The <tt>->rcu_node_entry</tt> field is used to form lists of | |
1247 | tasks that have blocked within preemptible-RCU read-side critical | |
1248 | sections and the <tt>->rcu_blocked_node</tt> field references | |
1249 | the <tt>rcu_node</tt> structure whose list this task is a member of, | |
1250 | or <tt>NULL</tt> if it is not blocked within a preemptible-RCU | |
1251 | read-side critical section. | |
1252 | ||
1253 | <p>The <tt>->rcu_tasks_nvcsw</tt> field tracks the number of | |
1254 | voluntary context switches that this task had undergone at the | |
1255 | beginning of the current tasks-RCU grace period, | |
1256 | <tt>->rcu_tasks_holdout</tt> is set if the current tasks-RCU | |
1257 | grace period is waiting on this task, <tt>->rcu_tasks_holdout_list</tt> | |
1258 | is a list element enqueuing this task on the holdout list, | |
1259 | and <tt>->rcu_tasks_idle_cpu</tt> tracks which CPU this | |
1260 | idle task is running, but only if the task is currently running, | |
1261 | that is, if the CPU is currently idle. | |
1262 | ||
1263 | <h3><a name="Accessor Functions"> | |
1264 | Accessor Functions</a></h3> | |
1265 | ||
1266 | <p>The following listing shows the | |
1267 | <tt>rcu_get_root()</tt>, <tt>rcu_for_each_node_breadth_first</tt>, | |
1268 | <tt>rcu_for_each_nonleaf_node_breadth_first()</tt>, and | |
1269 | <tt>rcu_for_each_leaf_node()</tt> function and macros: | |
1270 | ||
1271 | <pre> | |
1272 | 1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp) | |
1273 | 2 { | |
1274 | 3 return &rsp->node[0]; | |
1275 | 4 } | |
1276 | 5 | |
1277 | 6 #define rcu_for_each_node_breadth_first(rsp, rnp) \ | |
1278 | 7 for ((rnp) = &(rsp)->node[0]; \ | |
1279 | 8 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) | |
1280 | 9 | |
1281 | 10 #define rcu_for_each_nonleaf_node_breadth_first(rsp, rnp) \ | |
1282 | 11 for ((rnp) = &(rsp)->node[0]; \ | |
1283 | 12 (rnp) < (rsp)->level[NUM_RCU_LVLS - 1]; (rnp)++) | |
1284 | 13 | |
1285 | 14 #define rcu_for_each_leaf_node(rsp, rnp) \ | |
1286 | 15 for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \ | |
1287 | 16 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) | |
1288 | </pre> | |
1289 | ||
1290 | <p>The <tt>rcu_get_root()</tt> simply returns a pointer to the | |
1291 | first element of the specified <tt>rcu_state</tt> structure's | |
1292 | <tt>->node[]</tt> array, which is the root <tt>rcu_node</tt> | |
1293 | structure. | |
1294 | ||
1295 | </p><p>As noted earlier, the <tt>rcu_for_each_node_breadth_first()</tt> | |
1296 | macro takes advantage of the layout of the <tt>rcu_node</tt> | |
1297 | structures in the <tt>rcu_state</tt> structure's | |
1298 | <tt>->node[]</tt> array, performing a breadth-first traversal by | |
1299 | simply traversing the array in order. | |
1300 | The <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> macro operates | |
1301 | similarly, but traverses only the first part of the array, thus excluding | |
1302 | the leaf <tt>rcu_node</tt> structures. | |
1303 | Finally, the <tt>rcu_for_each_leaf_node()</tt> macro traverses only | |
1304 | the last part of the array, thus traversing only the leaf | |
1305 | <tt>rcu_node</tt> structures. | |
1306 | ||
1307 | <table> | |
1308 | <tr><th> </th></tr> | |
1309 | <tr><th align="left">Quick Quiz:</th></tr> | |
1310 | <tr><td> | |
1311 | What do <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> and | |
1312 | <tt>rcu_for_each_leaf_node()</tt> do if the <tt>rcu_node</tt> tree | |
1313 | contains only a single node? | |
1314 | </td></tr> | |
1315 | <tr><th align="left">Answer:</th></tr> | |
1316 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
1317 | In the single-node case, | |
1318 | <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> is a no-op | |
1319 | and <tt>rcu_for_each_leaf_node()</tt> traverses the single node. | |
1320 | </font></td></tr> | |
1321 | <tr><td> </td></tr> | |
1322 | </table> | |
1323 | ||
1324 | <h3><a name="Summary"> | |
1325 | Summary</a></h3> | |
1326 | ||
1327 | So each flavor of RCU is represented by an <tt>rcu_state</tt> structure, | |
1328 | which contains a combining tree of <tt>rcu_node</tt> and | |
1329 | <tt>rcu_data</tt> structures. | |
1330 | Finally, in <tt>CONFIG_NO_HZ_IDLE</tt> kernels, each CPU's dyntick-idle | |
1331 | state is tracked by an <tt>rcu_dynticks</tt> structure. | |
1332 | ||
1333 | If you made it this far, you are well prepared to read the code | |
1334 | walkthroughs in the other articles in this series. | |
1335 | ||
1336 | <h3><a name="Acknowledgments"> | |
1337 | Acknowledgments</a></h3> | |
1338 | ||
1339 | I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul | |
1340 | Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn | |
1341 | for helping me get this document into a more human-readable state. | |
1342 | ||
1343 | <h3><a name="Legal Statement"> | |
1344 | Legal Statement</a></h3> | |
1345 | ||
1346 | <p>This work represents the view of the author and does not necessarily | |
1347 | represent the view of IBM. | |
1348 | ||
1349 | </p><p>Linux is a registered trademark of Linus Torvalds. | |
1350 | ||
1351 | </p><p>Other company, product, and service names may be trademarks or | |
1352 | service marks of others. | |
1353 | ||
1354 | </body></html> |