Commit | Line | Data |
---|---|---|
5c145847 PM |
1 | <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" |
2 | "http://www.w3.org/TR/html4/loose.dtd"> | |
3 | <html> | |
4 | <head><title>A Tour Through TREE_RCU's Data Structures [LWN.net]</title> | |
5 | <meta HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> | |
6 | ||
7 | <p>January 27, 2016</p> | |
8 | <p>This article was contributed by Paul E. McKenney</p> | |
9 | ||
10 | <h3>Introduction</h3> | |
11 | ||
12 | This document describes RCU's major data structures and their relationship | |
13 | to each other. | |
14 | ||
15 | <ol> | |
16 | <li> <a href="#Data-Structure Relationships"> | |
17 | Data-Structure Relationships</a> | |
18 | <li> <a href="#The rcu_state Structure"> | |
19 | The <tt>rcu_state</tt> Structure</a> | |
20 | <li> <a href="#The rcu_node Structure"> | |
21 | The <tt>rcu_node</tt> Structure</a> | |
22 | <li> <a href="#The rcu_data Structure"> | |
23 | The <tt>rcu_data</tt> Structure</a> | |
24 | <li> <a href="#The rcu_dynticks Structure"> | |
25 | The <tt>rcu_dynticks</tt> Structure</a> | |
26 | <li> <a href="#The rcu_head Structure"> | |
27 | The <tt>rcu_head</tt> Structure</a> | |
28 | <li> <a href="#RCU-Specific Fields in the task_struct Structure"> | |
29 | RCU-Specific Fields in the <tt>task_struct</tt> Structure</a> | |
30 | <li> <a href="#Accessor Functions"> | |
31 | Accessor Functions</a> | |
32 | </ol> | |
33 | ||
34 | At the end we have the | |
35 | <a href="#Answers to Quick Quizzes">answers to the quick quizzes</a>. | |
36 | ||
37 | <h3><a name="Data-Structure Relationships">Data-Structure Relationships</a></h3> | |
38 | ||
39 | <p>RCU is for all intents and purposes a large state machine, and its | |
40 | data structures maintain the state in such a way as to allow RCU readers | |
41 | to execute extremely quickly, while also processing the RCU grace periods | |
42 | requested by updaters in an efficient and extremely scalable fashion. | |
43 | The efficiency and scalability of RCU updaters is provided primarily | |
44 | by a combining tree, as shown below: | |
45 | ||
46 | </p><p><img src="BigTreeClassicRCU.svg" alt="BigTreeClassicRCU.svg" width="30%"> | |
47 | ||
48 | </p><p>This diagram shows an enclosing <tt>rcu_state</tt> structure | |
49 | containing a tree of <tt>rcu_node</tt> structures. | |
50 | Each leaf node of the <tt>rcu_node</tt> tree has up to 16 | |
51 | <tt>rcu_data</tt> structures associated with it, so that there | |
52 | are <tt>NR_CPUS</tt> number of <tt>rcu_data</tt> structures, | |
53 | one for each possible CPU. | |
54 | This structure is adjusted at boot time, if needed, to handle the | |
55 | common case where <tt>nr_cpu_ids</tt> is much less than | |
56 | <tt>NR_CPUs</tt>. | |
57 | For example, a number of Linux distributions set <tt>NR_CPUs=4096</tt>, | |
58 | which results in a three-level <tt>rcu_node</tt> tree. | |
59 | If the actual hardware has only 16 CPUs, RCU will adjust itself | |
60 | at boot time, resulting in an <tt>rcu_node</tt> tree with only a single node. | |
61 | ||
62 | </p><p>The purpose of this combining tree is to allow per-CPU events | |
63 | such as quiescent states, dyntick-idle transitions, | |
64 | and CPU hotplug operations to be processed efficiently | |
65 | and scalably. | |
66 | Quiescent states are recorded by the per-CPU <tt>rcu_data</tt> structures, | |
67 | and other events are recorded by the leaf-level <tt>rcu_node</tt> | |
68 | structures. | |
69 | All of these events are combined at each level of the tree until finally | |
70 | grace periods are completed at the tree's root <tt>rcu_node</tt> | |
71 | structure. | |
72 | A grace period can be completed at the root once every CPU | |
73 | (or, in the case of <tt>CONFIG_PREEMPT_RCU</tt>, task) | |
74 | has passed through a quiescent state. | |
75 | Once a grace period has completed, record of that fact is propagated | |
76 | back down the tree. | |
77 | ||
78 | </p><p>As can be seen from the diagram, on a 64-bit system | |
79 | a two-level tree with 64 leaves can accommodate 1,024 CPUs, with a fanout | |
80 | of 64 at the root and a fanout of 16 at the leaves. | |
81 | ||
82 | <table> | |
83 | <tr><th> </th></tr> | |
84 | <tr><th align="left">Quick Quiz:</th></tr> | |
85 | <tr><td> | |
86 | Why isn't the fanout at the leaves also 64? | |
87 | </td></tr> | |
88 | <tr><th align="left">Answer:</th></tr> | |
89 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
90 | Because there are more types of events that affect the leaf-level | |
91 | <tt>rcu_node</tt> structures than further up the tree. | |
92 | Therefore, if the leaf <tt>rcu_node</tt> structures have fanout of | |
93 | 64, the contention on these structures' <tt>->structures</tt> | |
94 | becomes excessive. | |
95 | Experimentation on a wide variety of systems has shown that a fanout | |
96 | of 16 works well for the leaves of the <tt>rcu_node</tt> tree. | |
97 | </font> | |
98 | ||
99 | <p><font color="ffffff">Of course, further experience with | |
100 | systems having hundreds or thousands of CPUs may demonstrate | |
101 | that the fanout for the non-leaf <tt>rcu_node</tt> structures | |
102 | must also be reduced. | |
103 | Such reduction can be easily carried out when and if it proves | |
104 | necessary. | |
105 | In the meantime, if you are using such a system and running into | |
106 | contention problems on the non-leaf <tt>rcu_node</tt> structures, | |
107 | you may use the <tt>CONFIG_RCU_FANOUT</tt> kernel configuration | |
108 | parameter to reduce the non-leaf fanout as needed. | |
109 | </font> | |
110 | ||
111 | <p><font color="ffffff">Kernels built for systems with | |
112 | strong NUMA characteristics might also need to adjust | |
113 | <tt>CONFIG_RCU_FANOUT</tt> so that the domains of the | |
114 | <tt>rcu_node</tt> structures align with hardware boundaries. | |
115 | However, there has thus far been no need for this. | |
116 | </font></td></tr> | |
117 | <tr><td> </td></tr> | |
118 | </table> | |
119 | ||
120 | <p>If your system has more than 1,024 CPUs (or more than 512 CPUs on | |
121 | a 32-bit system), then RCU will automatically add more levels to the | |
122 | tree. | |
123 | For example, if you are crazy enough to build a 64-bit system with 65,536 | |
124 | CPUs, RCU would configure the <tt>rcu_node</tt> tree as follows: | |
125 | ||
126 | </p><p><img src="HugeTreeClassicRCU.svg" alt="HugeTreeClassicRCU.svg" width="50%"> | |
127 | ||
128 | </p><p>RCU currently permits up to a four-level tree, which on a 64-bit system | |
129 | accommodates up to 4,194,304 CPUs, though only a mere 524,288 CPUs for | |
130 | 32-bit systems. | |
131 | On the other hand, you can set <tt>CONFIG_RCU_FANOUT</tt> to be | |
132 | as small as 2 if you wish, which would permit only 16 CPUs, which | |
133 | is useful for testing. | |
134 | ||
135 | </p><p>This multi-level combining tree allows us to get most of the | |
136 | performance and scalability | |
137 | benefits of partitioning, even though RCU grace-period detection is | |
138 | inherently a global operation. | |
139 | The trick here is that only the last CPU to report a quiescent state | |
140 | into a given <tt>rcu_node</tt> structure need advance to the <tt>rcu_node</tt> | |
141 | structure at the next level up the tree. | |
142 | This means that at the leaf-level <tt>rcu_node</tt> structure, only | |
143 | one access out of sixteen will progress up the tree. | |
144 | For the internal <tt>rcu_node</tt> structures, the situation is even | |
145 | more extreme: Only one access out of sixty-four will progress up | |
146 | the tree. | |
147 | Because the vast majority of the CPUs do not progress up the tree, | |
148 | the lock contention remains roughly constant up the tree. | |
149 | No matter how many CPUs there are in the system, at most 64 quiescent-state | |
150 | reports per grace period will progress all the way to the root | |
151 | <tt>rcu_node</tt> structure, thus ensuring that the lock contention | |
152 | on that root <tt>rcu_node</tt> structure remains acceptably low. | |
153 | ||
154 | </p><p>In effect, the combining tree acts like a big shock absorber, | |
155 | keeping lock contention under control at all tree levels regardless | |
156 | of the level of loading on the system. | |
157 | ||
158 | </p><p>The Linux kernel actually supports multiple flavors of RCU | |
159 | running concurrently, so RCU builds separate data structures for each | |
160 | flavor. | |
161 | For example, for <tt>CONFIG_TREE_RCU=y</tt> kernels, RCU provides | |
162 | rcu_sched and rcu_bh, as shown below: | |
163 | ||
164 | </p><p><img src="BigTreeClassicRCUBH.svg" alt="BigTreeClassicRCUBH.svg" width="33%"> | |
165 | ||
166 | </p><p>Energy efficiency is increasingly important, and for that | |
167 | reason the Linux kernel provides <tt>CONFIG_NO_HZ_IDLE</tt>, which | |
168 | turns off the scheduling-clock interrupts on idle CPUs, which in | |
169 | turn allows those CPUs to attain deeper sleep states and to consume | |
170 | less energy. | |
171 | CPUs whose scheduling-clock interrupts have been turned off are | |
172 | said to be in <i>dyntick-idle mode</i>. | |
173 | RCU must handle dyntick-idle CPUs specially | |
174 | because RCU would otherwise wake up each CPU on every grace period, | |
175 | which would defeat the whole purpose of <tt>CONFIG_NO_HZ_IDLE</tt>. | |
176 | RCU uses the <tt>rcu_dynticks</tt> structure to track | |
177 | which CPUs are in dyntick idle mode, as shown below: | |
178 | ||
179 | </p><p><img src="BigTreeClassicRCUBHdyntick.svg" alt="BigTreeClassicRCUBHdyntick.svg" width="33%"> | |
180 | ||
181 | </p><p>However, if a CPU is in dyntick-idle mode, it is in that mode | |
182 | for all flavors of RCU. | |
183 | Therefore, a single <tt>rcu_dynticks</tt> structure is allocated per | |
184 | CPU, and all of a given CPU's <tt>rcu_data</tt> structures share | |
185 | that <tt>rcu_dynticks</tt>, as shown in the figure. | |
186 | ||
187 | </p><p>Kernels built with <tt>CONFIG_PREEMPT_RCU</tt> support | |
188 | rcu_preempt in addition to rcu_sched and rcu_bh, as shown below: | |
189 | ||
190 | </p><p><img src="BigTreePreemptRCUBHdyntick.svg" alt="BigTreePreemptRCUBHdyntick.svg" width="35%"> | |
191 | ||
192 | </p><p>RCU updaters wait for normal grace periods by registering | |
193 | RCU callbacks, either directly via <tt>call_rcu()</tt> and | |
194 | friends (namely <tt>call_rcu_bh()</tt> and <tt>call_rcu_sched()</tt>), | |
195 | there being a separate interface per flavor of RCU) | |
196 | or indirectly via <tt>synchronize_rcu()</tt> and friends. | |
197 | RCU callbacks are represented by <tt>rcu_head</tt> structures, | |
198 | which are queued on <tt>rcu_data</tt> structures while they are | |
199 | waiting for a grace period to elapse, as shown in the following figure: | |
200 | ||
201 | </p><p><img src="BigTreePreemptRCUBHdyntickCB.svg" alt="BigTreePreemptRCUBHdyntickCB.svg" width="40%"> | |
202 | ||
203 | </p><p>This figure shows how <tt>TREE_RCU</tt>'s and | |
204 | <tt>PREEMPT_RCU</tt>'s major data structures are related. | |
205 | Lesser data structures will be introduced with the algorithms that | |
206 | make use of them. | |
207 | ||
208 | </p><p>Note that each of the data structures in the above figure has | |
209 | its own synchronization: | |
210 | ||
211 | <p><ol> | |
212 | <li> Each <tt>rcu_state</tt> structures has a lock and a mutex, | |
213 | and some fields are protected by the corresponding root | |
214 | <tt>rcu_node</tt> structure's lock. | |
215 | <li> Each <tt>rcu_node</tt> structure has a spinlock. | |
216 | <li> The fields in <tt>rcu_data</tt> are private to the corresponding | |
217 | CPU, although a few can be read and written by other CPUs. | |
218 | <li> Similarly, the fields in <tt>rcu_dynticks</tt> are private | |
219 | to the corresponding CPU, although a few can be read by | |
220 | other CPUs. | |
221 | </ol> | |
222 | ||
223 | <p>It is important to note that different data structures can have | |
224 | very different ideas about the state of RCU at any given time. | |
225 | For but one example, awareness of the start or end of a given RCU | |
226 | grace period propagates slowly through the data structures. | |
227 | This slow propagation is absolutely necessary for RCU to have good | |
228 | read-side performance. | |
229 | If this balkanized implementation seems foreign to you, one useful | |
230 | trick is to consider each instance of these data structures to be | |
231 | a different person, each having the usual slightly different | |
232 | view of reality. | |
233 | ||
234 | </p><p>The general role of each of these data structures is as | |
235 | follows: | |
236 | ||
237 | </p><ol> | |
238 | <li> <tt>rcu_state</tt>: | |
239 | This structure forms the interconnection between the | |
240 | <tt>rcu_node</tt> and <tt>rcu_data</tt> structures, | |
241 | tracks grace periods, serves as short-term repository | |
242 | for callbacks orphaned by CPU-hotplug events, | |
243 | maintains <tt>rcu_barrier()</tt> state, | |
244 | tracks expedited grace-period state, | |
245 | and maintains state used to force quiescent states when | |
246 | grace periods extend too long, | |
247 | <li> <tt>rcu_node</tt>: This structure forms the combining | |
248 | tree that propagates quiescent-state | |
249 | information from the leaves to the root, and also propagates | |
250 | grace-period information from the root to the leaves. | |
251 | It provides local copies of the grace-period state in order | |
252 | to allow this information to be accessed in a synchronized | |
253 | manner without suffering the scalability limitations that | |
254 | would otherwise be imposed by global locking. | |
255 | In <tt>CONFIG_PREEMPT_RCU</tt> kernels, it manages the lists | |
256 | of tasks that have blocked while in their current | |
257 | RCU read-side critical section. | |
258 | In <tt>CONFIG_PREEMPT_RCU</tt> with | |
259 | <tt>CONFIG_RCU_BOOST</tt>, it manages the | |
260 | per-<tt>rcu_node</tt> priority-boosting | |
261 | kernel threads (kthreads) and state. | |
262 | Finally, it records CPU-hotplug state in order to determine | |
263 | which CPUs should be ignored during a given grace period. | |
264 | <li> <tt>rcu_data</tt>: This per-CPU structure is the | |
265 | focus of quiescent-state detection and RCU callback queuing. | |
266 | It also tracks its relationship to the corresponding leaf | |
267 | <tt>rcu_node</tt> structure to allow more-efficient | |
268 | propagation of quiescent states up the <tt>rcu_node</tt> | |
269 | combining tree. | |
270 | Like the <tt>rcu_node</tt> structure, it provides a local | |
271 | copy of the grace-period information to allow for-free | |
272 | synchronized | |
273 | access to this information from the corresponding CPU. | |
274 | Finally, this structure records past dyntick-idle state | |
275 | for the corresponding CPU and also tracks statistics. | |
276 | <li> <tt>rcu_dynticks</tt>: | |
277 | This per-CPU structure tracks the current dyntick-idle | |
278 | state for the corresponding CPU. | |
279 | Unlike the other three structures, the <tt>rcu_dynticks</tt> | |
280 | structure is not replicated per RCU flavor. | |
281 | <li> <tt>rcu_head</tt>: | |
282 | This structure represents RCU callbacks, and is the | |
283 | only structure allocated and managed by RCU users. | |
284 | The <tt>rcu_head</tt> structure is normally embedded | |
285 | within the RCU-protected data structure. | |
286 | </ol> | |
287 | ||
288 | <p>If all you wanted from this article was a general notion of how | |
289 | RCU's data structures are related, you are done. | |
290 | Otherwise, each of the following sections give more details on | |
291 | the <tt>rcu_state</tt>, <tt>rcu_node</tt>, <tt>rcu_data</tt>, | |
292 | and <tt>rcu_dynticks</tt> data structures. | |
293 | ||
294 | <h3><a name="The rcu_state Structure"> | |
295 | The <tt>rcu_state</tt> Structure</a></h3> | |
296 | ||
297 | <p>The <tt>rcu_state</tt> structure is the base structure that | |
298 | represents a flavor of RCU. | |
299 | This structure forms the interconnection between the | |
300 | <tt>rcu_node</tt> and <tt>rcu_data</tt> structures, | |
301 | tracks grace periods, contains the lock used to | |
302 | synchronize with CPU-hotplug events, | |
303 | and maintains state used to force quiescent states when | |
304 | grace periods extend too long, | |
305 | ||
306 | </p><p>A few of the <tt>rcu_state</tt> structure's fields are discussed, | |
307 | singly and in groups, in the following sections. | |
308 | The more specialized fields are covered in the discussion of their | |
309 | use. | |
310 | ||
311 | <h5>Relationship to rcu_node and rcu_data Structures</h5> | |
312 | ||
313 | This portion of the <tt>rcu_state</tt> structure is declared | |
314 | as follows: | |
315 | ||
316 | <pre> | |
317 | 1 struct rcu_node node[NUM_RCU_NODES]; | |
318 | 2 struct rcu_node *level[NUM_RCU_LVLS + 1]; | |
319 | 3 struct rcu_data __percpu *rda; | |
320 | </pre> | |
321 | ||
322 | <table> | |
323 | <tr><th> </th></tr> | |
324 | <tr><th align="left">Quick Quiz:</th></tr> | |
325 | <tr><td> | |
326 | Wait a minute! | |
327 | You said that the <tt>rcu_node</tt> structures formed a tree, | |
328 | but they are declared as a flat array! | |
329 | What gives? | |
330 | </td></tr> | |
331 | <tr><th align="left">Answer:</th></tr> | |
332 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
333 | The tree is laid out in the array. | |
334 | The first node In the array is the head, the next set of nodes in the | |
335 | array are children of the head node, and so on until the last set of | |
336 | nodes in the array are the leaves. | |
337 | </font> | |
338 | ||
339 | <p><font color="ffffff">See the following diagrams to see how | |
340 | this works. | |
341 | </font></td></tr> | |
342 | <tr><td> </td></tr> | |
343 | </table> | |
344 | ||
345 | <p>The <tt>rcu_node</tt> tree is embedded into the | |
346 | <tt>->node[]</tt> array as shown in the following figure: | |
347 | ||
348 | </p><p><img src="TreeMapping.svg" alt="TreeMapping.svg" width="40%"> | |
349 | ||
350 | </p><p>One interesting consequence of this mapping is that a | |
351 | breadth-first traversal of the tree is implemented as a simple | |
352 | linear scan of the array, which is in fact what the | |
353 | <tt>rcu_for_each_node_breadth_first()</tt> macro does. | |
354 | This macro is used at the beginning and ends of grace periods. | |
355 | ||
356 | </p><p>Each entry of the <tt>->level</tt> array references | |
357 | the first <tt>rcu_node</tt> structure on the corresponding level | |
358 | of the tree, for example, as shown below: | |
359 | ||
360 | </p><p><img src="TreeMappingLevel.svg" alt="TreeMappingLevel.svg" width="40%"> | |
361 | ||
362 | </p><p>The zero<sup>th</sup> element of the array references the root | |
363 | <tt>rcu_node</tt> structure, the first element references the | |
364 | first child of the root <tt>rcu_node</tt>, and finally the second | |
365 | element references the first leaf <tt>rcu_node</tt> structure. | |
366 | ||
367 | </p><p>For whatever it is worth, if you draw the tree to be tree-shaped | |
368 | rather than array-shaped, it is easy to draw a planar representation: | |
369 | ||
370 | </p><p><img src="TreeLevel.svg" alt="TreeLevel.svg" width="60%"> | |
371 | ||
372 | </p><p>Finally, the <tt>->rda</tt> field references a per-CPU | |
373 | pointer to the corresponding CPU's <tt>rcu_data</tt> structure. | |
374 | ||
375 | </p><p>All of these fields are constant once initialization is complete, | |
376 | and therefore need no protection. | |
377 | ||
378 | <h5>Grace-Period Tracking</h5> | |
379 | ||
380 | <p>This portion of the <tt>rcu_state</tt> structure is declared | |
381 | as follows: | |
382 | ||
383 | <pre> | |
384 | 1 unsigned long gpnum; | |
385 | 2 unsigned long completed; | |
386 | </pre> | |
387 | ||
388 | <p>RCU grace periods are numbered, and | |
389 | the <tt>->gpnum</tt> field contains the number of the grace | |
390 | period that started most recently. | |
391 | The <tt>->completed</tt> field contains the number of the | |
392 | grace period that completed most recently. | |
393 | If the two fields are equal, the RCU grace period that most recently | |
394 | started has already completed, and therefore the corresponding | |
395 | flavor of RCU is idle. | |
396 | If <tt>->gpnum</tt> is one greater than <tt>->completed</tt>, | |
397 | then <tt>->gpnum</tt> gives the number of the current RCU | |
398 | grace period, which has not yet completed. | |
399 | Any other combination of values indicates that something is broken. | |
400 | These two fields are protected by the root <tt>rcu_node</tt>'s | |
401 | <tt>->lock</tt> field. | |
402 | ||
403 | </p><p>There are <tt>->gpnum</tt> and <tt>->completed</tt> fields | |
404 | in the <tt>rcu_node</tt> and <tt>rcu_data</tt> structures | |
405 | as well. | |
406 | The fields in the <tt>rcu_state</tt> structure represent the | |
407 | most current values, and those of the other structures are compared | |
408 | in order to detect the start of a new grace period in a distributed | |
409 | fashion. | |
410 | The values flow from <tt>rcu_state</tt> to <tt>rcu_node</tt> | |
411 | (down the tree from the root to the leaves) to <tt>rcu_data</tt>. | |
412 | ||
413 | <h5>Miscellaneous</h5> | |
414 | ||
415 | <p>This portion of the <tt>rcu_state</tt> structure is declared | |
416 | as follows: | |
417 | ||
418 | <pre> | |
419 | 1 unsigned long gp_max; | |
420 | 2 char abbr; | |
421 | 3 char *name; | |
422 | </pre> | |
423 | ||
424 | <p>The <tt>->gp_max</tt> field tracks the duration of the longest | |
425 | grace period in jiffies. | |
426 | It is protected by the root <tt>rcu_node</tt>'s <tt>->lock</tt>. | |
427 | ||
428 | <p>The <tt>->name</tt> field points to the name of the RCU flavor | |
429 | (for example, “rcu_sched”), and is constant. | |
430 | The <tt>->abbr</tt> field contains a one-character abbreviation, | |
431 | for example, “s” for RCU-sched. | |
432 | ||
433 | <h3><a name="The rcu_node Structure"> | |
434 | The <tt>rcu_node</tt> Structure</a></h3> | |
435 | ||
436 | <p>The <tt>rcu_node</tt> structures form the combining | |
437 | tree that propagates quiescent-state | |
438 | information from the leaves to the root and also that propagates | |
439 | grace-period information from the root down to the leaves. | |
440 | They provides local copies of the grace-period state in order | |
441 | to allow this information to be accessed in a synchronized | |
442 | manner without suffering the scalability limitations that | |
443 | would otherwise be imposed by global locking. | |
444 | In <tt>CONFIG_PREEMPT_RCU</tt> kernels, they manage the lists | |
445 | of tasks that have blocked while in their current | |
446 | RCU read-side critical section. | |
447 | In <tt>CONFIG_PREEMPT_RCU</tt> with | |
448 | <tt>CONFIG_RCU_BOOST</tt>, they manage the | |
449 | per-<tt>rcu_node</tt> priority-boosting | |
450 | kernel threads (kthreads) and state. | |
451 | Finally, they record CPU-hotplug state in order to determine | |
452 | which CPUs should be ignored during a given grace period. | |
453 | ||
454 | </p><p>The <tt>rcu_node</tt> structure's fields are discussed, | |
455 | singly and in groups, in the following sections. | |
456 | ||
457 | <h5>Connection to Combining Tree</h5> | |
458 | ||
459 | <p>This portion of the <tt>rcu_node</tt> structure is declared | |
460 | as follows: | |
461 | ||
462 | <pre> | |
463 | 1 struct rcu_node *parent; | |
464 | 2 u8 level; | |
465 | 3 u8 grpnum; | |
466 | 4 unsigned long grpmask; | |
467 | 5 int grplo; | |
468 | 6 int grphi; | |
469 | </pre> | |
470 | ||
471 | <p>The <tt>->parent</tt> pointer references the <tt>rcu_node</tt> | |
472 | one level up in the tree, and is <tt>NULL</tt> for the root | |
473 | <tt>rcu_node</tt>. | |
474 | The RCU implementation makes heavy use of this field to push quiescent | |
475 | states up the tree. | |
476 | The <tt>->level</tt> field gives the level in the tree, with | |
477 | the root being at level zero, its children at level one, and so on. | |
478 | The <tt>->grpnum</tt> field gives this node's position within | |
479 | the children of its parent, so this number can range between 0 and 31 | |
480 | on 32-bit systems and between 0 and 63 on 64-bit systems. | |
481 | The <tt>->level</tt> and <tt>->grpnum</tt> fields are | |
482 | used only during initialization and for tracing. | |
483 | The <tt>->grpmask</tt> field is the bitmask counterpart of | |
484 | <tt>->grpnum</tt>, and therefore always has exactly one bit set. | |
485 | This mask is used to clear the bit corresponding to this <tt>rcu_node</tt> | |
486 | structure in its parent's bitmasks, which are described later. | |
487 | Finally, the <tt>->grplo</tt> and <tt>->grphi</tt> fields | |
488 | contain the lowest and highest numbered CPU served by this | |
489 | <tt>rcu_node</tt> structure, respectively. | |
490 | ||
491 | </p><p>All of these fields are constant, and thus do not require any | |
492 | synchronization. | |
493 | ||
494 | <h5>Synchronization</h5> | |
495 | ||
496 | <p>This field of the <tt>rcu_node</tt> structure is declared | |
497 | as follows: | |
498 | ||
499 | <pre> | |
500 | 1 raw_spinlock_t lock; | |
501 | </pre> | |
502 | ||
503 | <p>This field is used to protect the remaining fields in this structure, | |
504 | unless otherwise stated. | |
505 | That said, all of the fields in this structure can be accessed without | |
506 | locking for tracing purposes. | |
507 | Yes, this can result in confusing traces, but better some tracing confusion | |
508 | than to be heisenbugged out of existence. | |
509 | ||
510 | <h5>Grace-Period Tracking</h5> | |
511 | ||
512 | <p>This portion of the <tt>rcu_node</tt> structure is declared | |
513 | as follows: | |
514 | ||
515 | <pre> | |
516 | 1 unsigned long gpnum; | |
517 | 2 unsigned long completed; | |
518 | </pre> | |
519 | ||
520 | <p>These fields are the counterparts of the fields of the same name in | |
521 | the <tt>rcu_state</tt> structure. | |
522 | They each may lag up to one behind their <tt>rcu_state</tt> | |
523 | counterparts. | |
524 | If a given <tt>rcu_node</tt> structure's <tt>->gpnum</tt> and | |
525 | <tt>->complete</tt> fields are equal, then this <tt>rcu_node</tt> | |
526 | structure believes that RCU is idle. | |
527 | Otherwise, as with the <tt>rcu_state</tt> structure, | |
528 | the <tt>->gpnum</tt> field will be one greater than the | |
529 | <tt>->complete</tt> fields, with <tt>->gpnum</tt> | |
530 | indicating which grace period this <tt>rcu_node</tt> believes | |
531 | is still being waited for. | |
532 | ||
533 | </p><p>The <tt>>gpnum</tt> field of each <tt>rcu_node</tt> | |
534 | structure is updated at the beginning | |
535 | of each grace period, and the <tt>->completed</tt> fields are | |
536 | updated at the end of each grace period. | |
537 | ||
538 | <h5>Quiescent-State Tracking</h5> | |
539 | ||
540 | <p>These fields manage the propagation of quiescent states up the | |
541 | combining tree. | |
542 | ||
543 | </p><p>This portion of the <tt>rcu_node</tt> structure has fields | |
544 | as follows: | |
545 | ||
546 | <pre> | |
547 | 1 unsigned long qsmask; | |
548 | 2 unsigned long expmask; | |
549 | 3 unsigned long qsmaskinit; | |
550 | 4 unsigned long expmaskinit; | |
551 | </pre> | |
552 | ||
553 | <p>The <tt>->qsmask</tt> field tracks which of this | |
554 | <tt>rcu_node</tt> structure's children still need to report | |
555 | quiescent states for the current normal grace period. | |
556 | Such children will have a value of 1 in their corresponding bit. | |
557 | Note that the leaf <tt>rcu_node</tt> structures should be | |
558 | thought of as having <tt>rcu_data</tt> structures as their | |
559 | children. | |
560 | Similarly, the <tt>->expmask</tt> field tracks which | |
561 | of this <tt>rcu_node</tt> structure's children still need to report | |
562 | quiescent states for the current expedited grace period. | |
563 | An expedited grace period has | |
564 | the same conceptual properties as a normal grace period, but the | |
565 | expedited implementation accepts extreme CPU overhead to obtain | |
566 | much lower grace-period latency, for example, consuming a few | |
567 | tens of microseconds worth of CPU time to reduce grace-period | |
568 | duration from milliseconds to tens of microseconds. | |
569 | The <tt>->qsmaskinit</tt> field tracks which of this | |
570 | <tt>rcu_node</tt> structure's children cover for at least | |
571 | one online CPU. | |
572 | This mask is used to initialize <tt>->qsmask</tt>, | |
573 | and <tt>->expmaskinit</tt> is used to initialize | |
574 | <tt>->expmask</tt> and the beginning of the | |
575 | normal and expedited grace periods, respectively. | |
576 | ||
577 | <table> | |
578 | <tr><th> </th></tr> | |
579 | <tr><th align="left">Quick Quiz:</th></tr> | |
580 | <tr><td> | |
581 | Why are these bitmasks protected by locking? | |
582 | Come on, haven't you heard of atomic instructions??? | |
583 | </td></tr> | |
584 | <tr><th align="left">Answer:</th></tr> | |
585 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
586 | Lockless grace-period computation! Such a tantalizing possibility! | |
587 | </font> | |
588 | ||
589 | <p><font color="ffffff">But consider the following sequence of events: | |
590 | </font> | |
591 | ||
592 | <ol> | |
593 | <li> <font color="ffffff">CPU 0 has been in dyntick-idle | |
594 | mode for quite some time. | |
595 | When it wakes up, it notices that the current RCU | |
596 | grace period needs it to report in, so it sets a | |
597 | flag where the scheduling clock interrupt will find it. | |
598 | </font><p> | |
599 | <li> <font color="ffffff">Meanwhile, CPU 1 is running | |
600 | <tt>force_quiescent_state()</tt>, | |
601 | and notices that CPU 0 has been in dyntick idle mode, | |
602 | which qualifies as an extended quiescent state. | |
603 | </font><p> | |
604 | <li> <font color="ffffff">CPU 0's scheduling clock | |
605 | interrupt fires in the | |
606 | middle of an RCU read-side critical section, and notices | |
607 | that the RCU core needs something, so commences RCU softirq | |
608 | processing. | |
609 | </font> | |
610 | <p> | |
611 | <li> <font color="ffffff">CPU 0's softirq handler | |
612 | executes and is just about ready | |
613 | to report its quiescent state up the <tt>rcu_node</tt> | |
614 | tree. | |
615 | </font><p> | |
616 | <li> <font color="ffffff">But CPU 1 beats it to the punch, | |
617 | completing the current | |
618 | grace period and starting a new one. | |
619 | </font><p> | |
620 | <li> <font color="ffffff">CPU 0 now reports its quiescent | |
621 | state for the wrong | |
622 | grace period. | |
623 | That grace period might now end before the RCU read-side | |
624 | critical section. | |
625 | If that happens, disaster will ensue. | |
626 | </font> | |
627 | </ol> | |
628 | ||
629 | <p><font color="ffffff">So the locking is absolutely required in | |
630 | order to coordinate | |
631 | clearing of the bits with the grace-period numbers in | |
632 | <tt>->gpnum</tt> and <tt>->completed</tt>. | |
633 | </font></td></tr> | |
634 | <tr><td> </td></tr> | |
635 | </table> | |
636 | ||
637 | <h5>Blocked-Task Management</h5> | |
638 | ||
639 | <p><tt>PREEMPT_RCU</tt> allows tasks to be preempted in the | |
640 | midst of their RCU read-side critical sections, and these tasks | |
641 | must be tracked explicitly. | |
642 | The details of exactly why and how they are tracked will be covered | |
643 | in a separate article on RCU read-side processing. | |
644 | For now, it is enough to know that the <tt>rcu_node</tt> | |
645 | structure tracks them. | |
646 | ||
647 | <pre> | |
648 | 1 struct list_head blkd_tasks; | |
649 | 2 struct list_head *gp_tasks; | |
650 | 3 struct list_head *exp_tasks; | |
651 | 4 bool wait_blkd_tasks; | |
652 | </pre> | |
653 | ||
654 | <p>The <tt>->blkd_tasks</tt> field is a list header for | |
655 | the list of blocked and preempted tasks. | |
656 | As tasks undergo context switches within RCU read-side critical | |
657 | sections, their <tt>task_struct</tt> structures are enqueued | |
658 | (via the <tt>task_struct</tt>'s <tt>->rcu_node_entry</tt> | |
659 | field) onto the head of the <tt>->blkd_tasks</tt> list for the | |
660 | leaf <tt>rcu_node</tt> structure corresponding to the CPU | |
661 | on which the outgoing context switch executed. | |
662 | As these tasks later exit their RCU read-side critical sections, | |
663 | they remove themselves from the list. | |
664 | This list is therefore in reverse time order, so that if one of the tasks | |
665 | is blocking the current grace period, all subsequent tasks must | |
666 | also be blocking that same grace period. | |
667 | Therefore, a single pointer into this list suffices to track | |
668 | all tasks blocking a given grace period. | |
669 | That pointer is stored in <tt>->gp_tasks</tt> for normal | |
670 | grace periods and in <tt>->exp_tasks</tt> for expedited | |
671 | grace periods. | |
672 | These last two fields are <tt>NULL</tt> if either there is | |
673 | no grace period in flight or if there are no blocked tasks | |
674 | preventing that grace period from completing. | |
675 | If either of these two pointers is referencing a task that | |
676 | removes itself from the <tt>->blkd_tasks</tt> list, | |
677 | then that task must advance the pointer to the next task on | |
678 | the list, or set the pointer to <tt>NULL</tt> if there | |
679 | are no subsequent tasks on the list. | |
680 | ||
681 | </p><p>For example, suppose that tasks T1, T2, and T3 are | |
682 | all hard-affinitied to the largest-numbered CPU in the system. | |
683 | Then if task T1 blocked in an RCU read-side | |
684 | critical section, then an expedited grace period started, | |
685 | then task T2 blocked in an RCU read-side critical section, | |
686 | then a normal grace period started, and finally task 3 blocked | |
687 | in an RCU read-side critical section, then the state of the | |
688 | last leaf <tt>rcu_node</tt> structure's blocked-task list | |
689 | would be as shown below: | |
690 | ||
691 | </p><p><img src="blkd_task.svg" alt="blkd_task.svg" width="60%"> | |
692 | ||
693 | </p><p>Task T1 is blocking both grace periods, task T2 is | |
694 | blocking only the normal grace period, and task T3 is blocking | |
695 | neither grace period. | |
696 | Note that these tasks will not remove themselves from this list | |
697 | immediately upon resuming execution. | |
698 | They will instead remain on the list until they execute the outermost | |
699 | <tt>rcu_read_unlock()</tt> that ends their RCU read-side critical | |
700 | section. | |
701 | ||
702 | <p> | |
703 | The <tt>->wait_blkd_tasks</tt> field indicates whether or not | |
704 | the current grace period is waiting on a blocked task. | |
705 | ||
706 | <h5>Sizing the <tt>rcu_node</tt> Array</h5> | |
707 | ||
708 | <p>The <tt>rcu_node</tt> array is sized via a series of | |
709 | C-preprocessor expressions as follows: | |
710 | ||
711 | <pre> | |
712 | 1 #ifdef CONFIG_RCU_FANOUT | |
713 | 2 #define RCU_FANOUT CONFIG_RCU_FANOUT | |
714 | 3 #else | |
715 | 4 # ifdef CONFIG_64BIT | |
716 | 5 # define RCU_FANOUT 64 | |
717 | 6 # else | |
718 | 7 # define RCU_FANOUT 32 | |
719 | 8 # endif | |
720 | 9 #endif | |
721 | 10 | |
722 | 11 #ifdef CONFIG_RCU_FANOUT_LEAF | |
723 | 12 #define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF | |
724 | 13 #else | |
725 | 14 # ifdef CONFIG_64BIT | |
726 | 15 # define RCU_FANOUT_LEAF 64 | |
727 | 16 # else | |
728 | 17 # define RCU_FANOUT_LEAF 32 | |
729 | 18 # endif | |
730 | 19 #endif | |
731 | 20 | |
732 | 21 #define RCU_FANOUT_1 (RCU_FANOUT_LEAF) | |
733 | 22 #define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT) | |
734 | 23 #define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT) | |
735 | 24 #define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT) | |
736 | 25 | |
737 | 26 #if NR_CPUS <= RCU_FANOUT_1 | |
738 | 27 # define RCU_NUM_LVLS 1 | |
739 | 28 # define NUM_RCU_LVL_0 1 | |
740 | 29 # define NUM_RCU_NODES NUM_RCU_LVL_0 | |
741 | 30 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0 } | |
742 | 31 # define RCU_NODE_NAME_INIT { "rcu_node_0" } | |
743 | 32 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" } | |
744 | 33 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0" } | |
745 | 34 #elif NR_CPUS <= RCU_FANOUT_2 | |
746 | 35 # define RCU_NUM_LVLS 2 | |
747 | 36 # define NUM_RCU_LVL_0 1 | |
748 | 37 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) | |
749 | 38 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1) | |
750 | 39 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1 } | |
751 | 40 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" } | |
752 | 41 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" } | |
753 | 42 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1" } | |
754 | 43 #elif NR_CPUS <= RCU_FANOUT_3 | |
755 | 44 # define RCU_NUM_LVLS 3 | |
756 | 45 # define NUM_RCU_LVL_0 1 | |
757 | 46 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) | |
758 | 47 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) | |
759 | 48 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2) | |
760 | 49 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 } | |
761 | 50 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" } | |
762 | 51 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" } | |
763 | 52 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2" } | |
764 | 53 #elif NR_CPUS <= RCU_FANOUT_4 | |
765 | 54 # define RCU_NUM_LVLS 4 | |
766 | 55 # define NUM_RCU_LVL_0 1 | |
767 | 56 # define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3) | |
768 | 57 # define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2) | |
769 | 58 # define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1) | |
770 | 59 # define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3) | |
771 | 60 # define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 } | |
772 | 61 # define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" } | |
773 | 62 # define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" } | |
774 | 63 # define RCU_EXP_NAME_INIT { "rcu_node_exp_0", "rcu_node_exp_1", "rcu_node_exp_2", "rcu_node_exp_3" } | |
775 | 64 #else | |
776 | 65 # error "CONFIG_RCU_FANOUT insufficient for NR_CPUS" | |
777 | 66 #endif | |
778 | </pre> | |
779 | ||
780 | <p>The maximum number of levels in the <tt>rcu_node</tt> structure | |
781 | is currently limited to four, as specified by lines 21-24 | |
782 | and the structure of the subsequent “if” statement. | |
783 | For 32-bit systems, this allows 16*32*32*32=524,288 CPUs, which | |
784 | should be sufficient for the next few years at least. | |
785 | For 64-bit systems, 16*64*64*64=4,194,304 CPUs is allowed, which | |
786 | should see us through the next decade or so. | |
787 | This four-level tree also allows kernels built with | |
788 | <tt>CONFIG_RCU_FANOUT=8</tt> to support up to 4096 CPUs, | |
789 | which might be useful in very large systems having eight CPUs per | |
790 | socket (but please note that no one has yet shown any measurable | |
791 | performance degradation due to misaligned socket and <tt>rcu_node</tt> | |
792 | boundaries). | |
793 | In addition, building kernels with a full four levels of <tt>rcu_node</tt> | |
794 | tree permits better testing of RCU's combining-tree code. | |
795 | ||
796 | </p><p>The <tt>RCU_FANOUT</tt> symbol controls how many children | |
797 | are permitted at each non-leaf level of the <tt>rcu_node</tt> tree. | |
798 | If the <tt>CONFIG_RCU_FANOUT</tt> Kconfig option is not specified, | |
799 | it is set based on the word size of the system, which is also | |
800 | the Kconfig default. | |
801 | ||
802 | </p><p>The <tt>RCU_FANOUT_LEAF</tt> symbol controls how many CPUs are | |
803 | handled by each leaf <tt>rcu_node</tt> structure. | |
804 | Experience has shown that allowing a given leaf <tt>rcu_node</tt> | |
805 | structure to handle 64 CPUs, as permitted by the number of bits in | |
806 | the <tt>->qsmask</tt> field on a 64-bit system, results in | |
807 | excessive contention for the leaf <tt>rcu_node</tt> structures' | |
808 | <tt>->lock</tt> fields. | |
809 | The number of CPUs per leaf <tt>rcu_node</tt> structure is therefore | |
810 | limited to 16 given the default value of <tt>CONFIG_RCU_FANOUT_LEAF</tt>. | |
811 | If <tt>CONFIG_RCU_FANOUT_LEAF</tt> is unspecified, the value | |
812 | selected is based on the word size of the system, just as for | |
813 | <tt>CONFIG_RCU_FANOUT</tt>. | |
814 | Lines 11-19 perform this computation. | |
815 | ||
816 | </p><p>Lines 21-24 compute the maximum number of CPUs supported by | |
817 | a single-level (which contains a single <tt>rcu_node</tt> structure), | |
818 | two-level, three-level, and four-level <tt>rcu_node</tt> tree, | |
819 | respectively, given the fanout specified by <tt>RCU_FANOUT</tt> | |
820 | and <tt>RCU_FANOUT_LEAF</tt>. | |
821 | These numbers of CPUs are retained in the | |
822 | <tt>RCU_FANOUT_1</tt>, | |
823 | <tt>RCU_FANOUT_2</tt>, | |
824 | <tt>RCU_FANOUT_3</tt>, and | |
825 | <tt>RCU_FANOUT_4</tt> | |
826 | C-preprocessor variables, respectively. | |
827 | ||
828 | </p><p>These variables are used to control the C-preprocessor <tt>#if</tt> | |
829 | statement spanning lines 26-66 that computes the number of | |
830 | <tt>rcu_node</tt> structures required for each level of the tree, | |
831 | as well as the number of levels required. | |
832 | The number of levels is placed in the <tt>NUM_RCU_LVLS</tt> | |
833 | C-preprocessor variable by lines 27, 35, 44, and 54. | |
834 | The number of <tt>rcu_node</tt> structures for the topmost level | |
835 | of the tree is always exactly one, and this value is unconditionally | |
836 | placed into <tt>NUM_RCU_LVL_0</tt> by lines 28, 36, 45, and 55. | |
837 | The rest of the levels (if any) of the <tt>rcu_node</tt> tree | |
838 | are computed by dividing the maximum number of CPUs by the | |
839 | fanout supported by the number of levels from the current level down, | |
840 | rounding up. This computation is performed by lines 37, | |
841 | 46-47, and 56-58. | |
842 | Lines 31-33, 40-42, 50-52, and 62-63 create initializers | |
843 | for lockdep lock-class names. | |
844 | Finally, lines 64-66 produce an error if the maximum number of | |
845 | CPUs is too large for the specified fanout. | |
846 | ||
847 | <h3><a name="The rcu_data Structure"> | |
848 | The <tt>rcu_data</tt> Structure</a></h3> | |
849 | ||
850 | <p>The <tt>rcu_data</tt> maintains the per-CPU state for the | |
851 | corresponding flavor of RCU. | |
852 | The fields in this structure may be accessed only from the corresponding | |
853 | CPU (and from tracing) unless otherwise stated. | |
854 | This structure is the | |
855 | focus of quiescent-state detection and RCU callback queuing. | |
856 | It also tracks its relationship to the corresponding leaf | |
857 | <tt>rcu_node</tt> structure to allow more-efficient | |
858 | propagation of quiescent states up the <tt>rcu_node</tt> | |
859 | combining tree. | |
860 | Like the <tt>rcu_node</tt> structure, it provides a local | |
861 | copy of the grace-period information to allow for-free | |
862 | synchronized | |
863 | access to this information from the corresponding CPU. | |
864 | Finally, this structure records past dyntick-idle state | |
865 | for the corresponding CPU and also tracks statistics. | |
866 | ||
867 | </p><p>The <tt>rcu_data</tt> structure's fields are discussed, | |
868 | singly and in groups, in the following sections. | |
869 | ||
870 | <h5>Connection to Other Data Structures</h5> | |
871 | ||
872 | <p>This portion of the <tt>rcu_data</tt> structure is declared | |
873 | as follows: | |
874 | ||
875 | <pre> | |
876 | 1 int cpu; | |
877 | 2 struct rcu_state *rsp; | |
878 | 3 struct rcu_node *mynode; | |
879 | 4 struct rcu_dynticks *dynticks; | |
880 | 5 unsigned long grpmask; | |
881 | 6 bool beenonline; | |
882 | </pre> | |
883 | ||
884 | <p>The <tt>->cpu</tt> field contains the number of the | |
885 | corresponding CPU, the <tt>->rsp</tt> pointer references | |
886 | the corresponding <tt>rcu_state</tt> structure (and is most frequently | |
887 | used to locate the name of the corresponding flavor of RCU for tracing), | |
888 | and the <tt>->mynode</tt> field references the corresponding | |
889 | <tt>rcu_node</tt> structure. | |
890 | The <tt>->mynode</tt> is used to propagate quiescent states | |
891 | up the combining tree. | |
892 | <p>The <tt>->dynticks</tt> pointer references the | |
893 | <tt>rcu_dynticks</tt> structure corresponding to this | |
894 | CPU. | |
895 | Recall that a single per-CPU instance of the <tt>rcu_dynticks</tt> | |
896 | structure is shared among all flavors of RCU. | |
897 | These first four fields are constant and therefore require not | |
898 | synchronization. | |
899 | ||
900 | </p><p>The <tt>->grpmask</tt> field indicates the bit in | |
901 | the <tt>->mynode->qsmask</tt> corresponding to this | |
902 | <tt>rcu_data</tt> structure, and is also used when propagating | |
903 | quiescent states. | |
904 | The <tt>->beenonline</tt> flag is set whenever the corresponding | |
905 | CPU comes online, which means that the debugfs tracing need not dump | |
906 | out any <tt>rcu_data</tt> structure for which this flag is not set. | |
907 | ||
908 | <h5>Quiescent-State and Grace-Period Tracking</h5> | |
909 | ||
910 | <p>This portion of the <tt>rcu_data</tt> structure is declared | |
911 | as follows: | |
912 | ||
913 | <pre> | |
914 | 1 unsigned long completed; | |
915 | 2 unsigned long gpnum; | |
916 | 3 bool cpu_no_qs; | |
917 | 4 bool core_needs_qs; | |
918 | 5 bool gpwrap; | |
919 | 6 unsigned long rcu_qs_ctr_snap; | |
920 | </pre> | |
921 | ||
922 | <p>The <tt>completed</tt> and <tt>gpnum</tt> | |
923 | fields are the counterparts of the fields of the same name | |
924 | in the <tt>rcu_state</tt> and <tt>rcu_node</tt> structures. | |
925 | They may each lag up to one behind their <tt>rcu_node</tt> | |
926 | counterparts, but in <tt>CONFIG_NO_HZ_IDLE</tt> and | |
927 | <tt>CONFIG_NO_HZ_FULL</tt> kernels can lag | |
928 | arbitrarily far behind for CPUs in dyntick-idle mode (but these counters | |
929 | will catch up upon exit from dyntick-idle mode). | |
930 | If a given <tt>rcu_data</tt> structure's <tt>->gpnum</tt> and | |
931 | <tt>->complete</tt> fields are equal, then this <tt>rcu_data</tt> | |
932 | structure believes that RCU is idle. | |
933 | Otherwise, as with the <tt>rcu_state</tt> and <tt>rcu_node</tt> | |
934 | structure, | |
935 | the <tt>->gpnum</tt> field will be one greater than the | |
936 | <tt>->complete</tt> fields, with <tt>->gpnum</tt> | |
937 | indicating which grace period this <tt>rcu_data</tt> believes | |
938 | is still being waited for. | |
939 | ||
940 | <table> | |
941 | <tr><th> </th></tr> | |
942 | <tr><th align="left">Quick Quiz:</th></tr> | |
943 | <tr><td> | |
944 | All this replication of the grace period numbers can only cause | |
945 | massive confusion. | |
946 | Why not just keep a global pair of counters and be done with it??? | |
947 | </td></tr> | |
948 | <tr><th align="left">Answer:</th></tr> | |
949 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
950 | Because if there was only a single global pair of grace-period | |
951 | numbers, there would need to be a single global lock to allow | |
952 | safely accessing and updating them. | |
953 | And if we are not going to have a single global lock, we need | |
954 | to carefully manage the numbers on a per-node basis. | |
955 | Recall from the answer to a previous Quick Quiz that the consequences | |
956 | of applying a previously sampled quiescent state to the wrong | |
957 | grace period are quite severe. | |
958 | </font></td></tr> | |
959 | <tr><td> </td></tr> | |
960 | </table> | |
961 | ||
962 | <p>The <tt>->cpu_no_qs</tt> flag indicates that the | |
963 | CPU has not yet passed through a quiescent state, | |
964 | while the <tt>->core_needs_qs</tt> flag indicates that the | |
965 | RCU core needs a quiescent state from the corresponding CPU. | |
966 | The <tt>->gpwrap</tt> field indicates that the corresponding | |
967 | CPU has remained idle for so long that the <tt>completed</tt> | |
968 | and <tt>gpnum</tt> counters are in danger of overflow, which | |
969 | will cause the CPU to disregard the values of its counters on | |
970 | its next exit from idle. | |
971 | Finally, the <tt>rcu_qs_ctr_snap</tt> field is used to detect | |
972 | cases where a given operation has resulted in a quiescent state | |
973 | for all flavors of RCU, for example, <tt>cond_resched_rcu_qs()</tt>. | |
974 | ||
975 | <h5>RCU Callback Handling</h5> | |
976 | ||
977 | <p>In the absence of CPU-hotplug events, RCU callbacks are invoked by | |
978 | the same CPU that registered them. | |
979 | This is strictly a cache-locality optimization: callbacks can and | |
980 | do get invoked on CPUs other than the one that registered them. | |
981 | After all, if the CPU that registered a given callback has gone | |
982 | offline before the callback can be invoked, there really is no other | |
983 | choice. | |
984 | ||
985 | </p><p>This portion of the <tt>rcu_data</tt> structure is declared | |
986 | as follows: | |
987 | ||
988 | <pre> | |
989 | 1 struct rcu_head *nxtlist; | |
990 | 2 struct rcu_head **nxttail[RCU_NEXT_SIZE]; | |
991 | 3 unsigned long nxtcompleted[RCU_NEXT_SIZE]; | |
992 | 4 long qlen_lazy; | |
993 | 5 long qlen; | |
994 | 6 long qlen_last_fqs_check; | |
995 | 7 unsigned long n_force_qs_snap; | |
996 | 8 unsigned long n_cbs_invoked; | |
997 | 9 unsigned long n_cbs_orphaned; | |
998 | 10 unsigned long n_cbs_adopted; | |
999 | 11 long blimit; | |
1000 | </pre> | |
1001 | ||
1002 | <p>The <tt>->nxtlist</tt> pointer and the | |
1003 | <tt>->nxttail[]</tt> array form a four-segment list with | |
1004 | older callbacks near the head and newer ones near the tail. | |
1005 | Each segment contains callbacks with the corresponding relationship | |
1006 | to the current grace period. | |
1007 | The pointer out of the end of each of the four segments is referenced | |
1008 | by the element of the <tt>->nxttail[]</tt> array indexed by | |
1009 | <tt>RCU_DONE_TAIL</tt> (for callbacks handled by a prior grace period), | |
1010 | <tt>RCU_WAIT_TAIL</tt> (for callbacks waiting on the current grace period), | |
1011 | <tt>RCU_NEXT_READY_TAIL</tt> (for callbacks that will wait on the next | |
1012 | grace period), and | |
1013 | <tt>RCU_NEXT_TAIL</tt> (for callbacks that are not yet associated | |
1014 | with a specific grace period) | |
1015 | respectively, as shown in the following figure. | |
1016 | ||
1017 | </p><p><img src="nxtlist.svg" alt="nxtlist.svg" width="40%"> | |
1018 | ||
1019 | </p><p>In this figure, the <tt>->nxtlist</tt> pointer references the | |
1020 | first | |
1021 | RCU callback in the list. | |
1022 | The <tt>->nxttail[RCU_DONE_TAIL]</tt> array element references | |
1023 | the <tt>->nxtlist</tt> pointer itself, indicating that none | |
1024 | of the callbacks is ready to invoke. | |
1025 | The <tt>->nxttail[RCU_WAIT_TAIL]</tt> array element references callback | |
1026 | CB 2's <tt>->next</tt> pointer, which indicates that | |
1027 | CB 1 and CB 2 are both waiting on the current grace period. | |
1028 | The <tt>->nxttail[RCU_NEXT_READY_TAIL]</tt> array element | |
1029 | references the same RCU callback that <tt>->nxttail[RCU_WAIT_TAIL]</tt> | |
1030 | does, which indicates that there are no callbacks waiting on the next | |
1031 | RCU grace period. | |
1032 | The <tt>->nxttail[RCU_NEXT_TAIL]</tt> array element references | |
1033 | CB 4's <tt>->next</tt> pointer, indicating that all the | |
1034 | remaining RCU callbacks have not yet been assigned to an RCU grace | |
1035 | period. | |
1036 | Note that the <tt>->nxttail[RCU_NEXT_TAIL]</tt> array element | |
1037 | always references the last RCU callback's <tt>->next</tt> pointer | |
1038 | unless the callback list is empty, in which case it references | |
1039 | the <tt>->nxtlist</tt> pointer. | |
1040 | ||
1041 | </p><p>CPUs advance their callbacks from the | |
1042 | <tt>RCU_NEXT_TAIL</tt> to the <tt>RCU_NEXT_READY_TAIL</tt> to the | |
1043 | <tt>RCU_WAIT_TAIL</tt> to the <tt>RCU_DONE_TAIL</tt> list segments | |
1044 | as grace periods advance. | |
1045 | The CPU advances the callbacks in its <tt>rcu_data</tt> structure | |
1046 | whenever it notices that another RCU grace period has completed. | |
1047 | The CPU detects the completion of an RCU grace period by noticing | |
1048 | that the value of its <tt>rcu_data</tt> structure's | |
1049 | <tt>->completed</tt> field differs from that of its leaf | |
1050 | <tt>rcu_node</tt> structure. | |
1051 | Recall that each <tt>rcu_node</tt> structure's | |
1052 | <tt>->completed</tt> field is updated at the end of each | |
1053 | grace period. | |
1054 | ||
1055 | </p><p>The <tt>->nxtcompleted[]</tt> array records grace-period | |
1056 | numbers corresponding to the list segments. | |
1057 | This allows CPUs that go idle for extended periods to determine | |
1058 | which of their callbacks are ready to be invoked after reawakening. | |
1059 | ||
1060 | </p><p>The <tt>->qlen</tt> counter contains the number of | |
1061 | callbacks in <tt>->nxtlist</tt>, and the | |
1062 | <tt>->qlen_lazy</tt> contains the number of those callbacks that | |
1063 | are known to only free memory, and whose invocation can therefore | |
1064 | be safely deferred. | |
1065 | The <tt>->qlen_last_fqs_check</tt> and | |
1066 | <tt>->n_force_qs_snap</tt> coordinate the forcing of quiescent | |
1067 | states from <tt>call_rcu()</tt> and friends when callback | |
1068 | lists grow excessively long. | |
1069 | ||
1070 | </p><p>The <tt>->n_cbs_invoked</tt>, | |
1071 | <tt>->n_cbs_orphaned</tt>, and <tt>->n_cbs_adopted</tt> | |
1072 | fields count the number of callbacks invoked, | |
1073 | sent to other CPUs when this CPU goes offline, | |
1074 | and received from other CPUs when those other CPUs go offline. | |
1075 | Finally, the <tt>->blimit</tt> counter is the maximum number of | |
1076 | RCU callbacks that may be invoked at a given time. | |
1077 | ||
1078 | <h5>Dyntick-Idle Handling</h5> | |
1079 | ||
1080 | <p>This portion of the <tt>rcu_data</tt> structure is declared | |
1081 | as follows: | |
1082 | ||
1083 | <pre> | |
1084 | 1 int dynticks_snap; | |
1085 | 2 unsigned long dynticks_fqs; | |
1086 | </pre> | |
1087 | ||
1088 | The <tt>->dynticks_snap</tt> field is used to take a snapshot | |
1089 | of the corresponding CPU's dyntick-idle state when forcing | |
1090 | quiescent states, and is therefore accessed from other CPUs. | |
1091 | Finally, the <tt>->dynticks_fqs</tt> field is used to | |
1092 | count the number of times this CPU is determined to be in | |
1093 | dyntick-idle state, and is used for tracing and debugging purposes. | |
1094 | ||
1095 | <h3><a name="The rcu_dynticks Structure"> | |
1096 | The <tt>rcu_dynticks</tt> Structure</a></h3> | |
1097 | ||
1098 | <p>The <tt>rcu_dynticks</tt> maintains the per-CPU dyntick-idle state | |
1099 | for the corresponding CPU. | |
1100 | Unlike the other structures, <tt>rcu_dynticks</tt> is not | |
1101 | replicated over the different flavors of RCU. | |
1102 | The fields in this structure may be accessed only from the corresponding | |
1103 | CPU (and from tracing) unless otherwise stated. | |
1104 | Its fields are as follows: | |
1105 | ||
1106 | <pre> | |
1107 | 1 int dynticks_nesting; | |
1108 | 2 int dynticks_nmi_nesting; | |
1109 | 3 atomic_t dynticks; | |
1110 | </pre> | |
1111 | ||
1112 | <p>The <tt>->dynticks_nesting</tt> field counts the | |
1113 | nesting depth of normal interrupts. | |
1114 | In addition, this counter is incremented when exiting dyntick-idle | |
1115 | mode and decremented when entering it. | |
1116 | This counter can therefore be thought of as counting the number | |
1117 | of reasons why this CPU cannot be permitted to enter dyntick-idle | |
1118 | mode, aside from non-maskable interrupts (NMIs). | |
1119 | NMIs are counted by the <tt>->dynticks_nmi_nesting</tt> | |
1120 | field, except that NMIs that interrupt non-dyntick-idle execution | |
1121 | are not counted. | |
1122 | ||
1123 | </p><p>Finally, the <tt>->dynticks</tt> field counts the corresponding | |
1124 | CPU's transitions to and from dyntick-idle mode, so that this counter | |
1125 | has an even value when the CPU is in dyntick-idle mode and an odd | |
1126 | value otherwise. | |
1127 | ||
1128 | <table> | |
1129 | <tr><th> </th></tr> | |
1130 | <tr><th align="left">Quick Quiz:</th></tr> | |
1131 | <tr><td> | |
1132 | Why not just count all NMIs? | |
1133 | Wouldn't that be simpler and less error prone? | |
1134 | </td></tr> | |
1135 | <tr><th align="left">Answer:</th></tr> | |
1136 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
1137 | It seems simpler only until you think hard about how to go about | |
1138 | updating the <tt>rcu_dynticks</tt> structure's | |
1139 | <tt>->dynticks</tt> field. | |
1140 | </font></td></tr> | |
1141 | <tr><td> </td></tr> | |
1142 | </table> | |
1143 | ||
1144 | <p>Additional fields are present for some special-purpose | |
1145 | builds, and are discussed separately. | |
1146 | ||
1147 | <h3><a name="The rcu_head Structure"> | |
1148 | The <tt>rcu_head</tt> Structure</a></h3> | |
1149 | ||
1150 | <p>Each <tt>rcu_head</tt> structure represents an RCU callback. | |
1151 | These structures are normally embedded within RCU-protected data | |
1152 | structures whose algorithms use asynchronous grace periods. | |
1153 | In contrast, when using algorithms that block waiting for RCU grace periods, | |
1154 | RCU users need not provide <tt>rcu_head</tt> structures. | |
1155 | ||
1156 | </p><p>The <tt>rcu_head</tt> structure has fields as follows: | |
1157 | ||
1158 | <pre> | |
1159 | 1 struct rcu_head *next; | |
1160 | 2 void (*func)(struct rcu_head *head); | |
1161 | </pre> | |
1162 | ||
1163 | <p>The <tt>->next</tt> field is used | |
1164 | to link the <tt>rcu_head</tt> structures together in the | |
1165 | lists within the <tt>rcu_data</tt> structures. | |
1166 | The <tt>->func</tt> field is a pointer to the function | |
1167 | to be called when the callback is ready to be invoked, and | |
1168 | this function is passed a pointer to the <tt>rcu_head</tt> | |
1169 | structure. | |
1170 | However, <tt>kfree_rcu()</tt> uses the <tt>->func</tt> | |
1171 | field to record the offset of the <tt>rcu_head</tt> | |
1172 | structure within the enclosing RCU-protected data structure. | |
1173 | ||
1174 | </p><p>Both of these fields are used internally by RCU. | |
1175 | From the viewpoint of RCU users, this structure is an | |
1176 | opaque “cookie”. | |
1177 | ||
1178 | <table> | |
1179 | <tr><th> </th></tr> | |
1180 | <tr><th align="left">Quick Quiz:</th></tr> | |
1181 | <tr><td> | |
1182 | Given that the callback function <tt>->func</tt> | |
1183 | is passed a pointer to the <tt>rcu_head</tt> structure, | |
1184 | how is that function supposed to find the beginning of the | |
1185 | enclosing RCU-protected data structure? | |
1186 | </td></tr> | |
1187 | <tr><th align="left">Answer:</th></tr> | |
1188 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
1189 | In actual practice, there is a separate callback function per | |
1190 | type of RCU-protected data structure. | |
1191 | The callback function can therefore use the <tt>container_of()</tt> | |
1192 | macro in the Linux kernel (or other pointer-manipulation facilities | |
1193 | in other software environments) to find the beginning of the | |
1194 | enclosing structure. | |
1195 | </font></td></tr> | |
1196 | <tr><td> </td></tr> | |
1197 | </table> | |
1198 | ||
1199 | <h3><a name="RCU-Specific Fields in the task_struct Structure"> | |
1200 | RCU-Specific Fields in the <tt>task_struct</tt> Structure</a></h3> | |
1201 | ||
1202 | <p>The <tt>CONFIG_PREEMPT_RCU</tt> implementation uses some | |
1203 | additional fields in the <tt>task_struct</tt> structure: | |
1204 | ||
1205 | <pre> | |
1206 | 1 #ifdef CONFIG_PREEMPT_RCU | |
1207 | 2 int rcu_read_lock_nesting; | |
1208 | 3 union rcu_special rcu_read_unlock_special; | |
1209 | 4 struct list_head rcu_node_entry; | |
1210 | 5 struct rcu_node *rcu_blocked_node; | |
1211 | 6 #endif /* #ifdef CONFIG_PREEMPT_RCU */ | |
1212 | 7 #ifdef CONFIG_TASKS_RCU | |
1213 | 8 unsigned long rcu_tasks_nvcsw; | |
1214 | 9 bool rcu_tasks_holdout; | |
1215 | 10 struct list_head rcu_tasks_holdout_list; | |
1216 | 11 int rcu_tasks_idle_cpu; | |
1217 | 12 #endif /* #ifdef CONFIG_TASKS_RCU */ | |
1218 | </pre> | |
1219 | ||
1220 | <p>The <tt>->rcu_read_lock_nesting</tt> field records the | |
1221 | nesting level for RCU read-side critical sections, and | |
1222 | the <tt>->rcu_read_unlock_special</tt> field is a bitmask | |
1223 | that records special conditions that require <tt>rcu_read_unlock()</tt> | |
1224 | to do additional work. | |
1225 | The <tt>->rcu_node_entry</tt> field is used to form lists of | |
1226 | tasks that have blocked within preemptible-RCU read-side critical | |
1227 | sections and the <tt>->rcu_blocked_node</tt> field references | |
1228 | the <tt>rcu_node</tt> structure whose list this task is a member of, | |
1229 | or <tt>NULL</tt> if it is not blocked within a preemptible-RCU | |
1230 | read-side critical section. | |
1231 | ||
1232 | <p>The <tt>->rcu_tasks_nvcsw</tt> field tracks the number of | |
1233 | voluntary context switches that this task had undergone at the | |
1234 | beginning of the current tasks-RCU grace period, | |
1235 | <tt>->rcu_tasks_holdout</tt> is set if the current tasks-RCU | |
1236 | grace period is waiting on this task, <tt>->rcu_tasks_holdout_list</tt> | |
1237 | is a list element enqueuing this task on the holdout list, | |
1238 | and <tt>->rcu_tasks_idle_cpu</tt> tracks which CPU this | |
1239 | idle task is running, but only if the task is currently running, | |
1240 | that is, if the CPU is currently idle. | |
1241 | ||
1242 | <h3><a name="Accessor Functions"> | |
1243 | Accessor Functions</a></h3> | |
1244 | ||
1245 | <p>The following listing shows the | |
1246 | <tt>rcu_get_root()</tt>, <tt>rcu_for_each_node_breadth_first</tt>, | |
1247 | <tt>rcu_for_each_nonleaf_node_breadth_first()</tt>, and | |
1248 | <tt>rcu_for_each_leaf_node()</tt> function and macros: | |
1249 | ||
1250 | <pre> | |
1251 | 1 static struct rcu_node *rcu_get_root(struct rcu_state *rsp) | |
1252 | 2 { | |
1253 | 3 return &rsp->node[0]; | |
1254 | 4 } | |
1255 | 5 | |
1256 | 6 #define rcu_for_each_node_breadth_first(rsp, rnp) \ | |
1257 | 7 for ((rnp) = &(rsp)->node[0]; \ | |
1258 | 8 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) | |
1259 | 9 | |
1260 | 10 #define rcu_for_each_nonleaf_node_breadth_first(rsp, rnp) \ | |
1261 | 11 for ((rnp) = &(rsp)->node[0]; \ | |
1262 | 12 (rnp) < (rsp)->level[NUM_RCU_LVLS - 1]; (rnp)++) | |
1263 | 13 | |
1264 | 14 #define rcu_for_each_leaf_node(rsp, rnp) \ | |
1265 | 15 for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \ | |
1266 | 16 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) | |
1267 | </pre> | |
1268 | ||
1269 | <p>The <tt>rcu_get_root()</tt> simply returns a pointer to the | |
1270 | first element of the specified <tt>rcu_state</tt> structure's | |
1271 | <tt>->node[]</tt> array, which is the root <tt>rcu_node</tt> | |
1272 | structure. | |
1273 | ||
1274 | </p><p>As noted earlier, the <tt>rcu_for_each_node_breadth_first()</tt> | |
1275 | macro takes advantage of the layout of the <tt>rcu_node</tt> | |
1276 | structures in the <tt>rcu_state</tt> structure's | |
1277 | <tt>->node[]</tt> array, performing a breadth-first traversal by | |
1278 | simply traversing the array in order. | |
1279 | The <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> macro operates | |
1280 | similarly, but traverses only the first part of the array, thus excluding | |
1281 | the leaf <tt>rcu_node</tt> structures. | |
1282 | Finally, the <tt>rcu_for_each_leaf_node()</tt> macro traverses only | |
1283 | the last part of the array, thus traversing only the leaf | |
1284 | <tt>rcu_node</tt> structures. | |
1285 | ||
1286 | <table> | |
1287 | <tr><th> </th></tr> | |
1288 | <tr><th align="left">Quick Quiz:</th></tr> | |
1289 | <tr><td> | |
1290 | What do <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> and | |
1291 | <tt>rcu_for_each_leaf_node()</tt> do if the <tt>rcu_node</tt> tree | |
1292 | contains only a single node? | |
1293 | </td></tr> | |
1294 | <tr><th align="left">Answer:</th></tr> | |
1295 | <tr><td bgcolor="#ffffff"><font color="ffffff"> | |
1296 | In the single-node case, | |
1297 | <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> is a no-op | |
1298 | and <tt>rcu_for_each_leaf_node()</tt> traverses the single node. | |
1299 | </font></td></tr> | |
1300 | <tr><td> </td></tr> | |
1301 | </table> | |
1302 | ||
1303 | <h3><a name="Summary"> | |
1304 | Summary</a></h3> | |
1305 | ||
1306 | So each flavor of RCU is represented by an <tt>rcu_state</tt> structure, | |
1307 | which contains a combining tree of <tt>rcu_node</tt> and | |
1308 | <tt>rcu_data</tt> structures. | |
1309 | Finally, in <tt>CONFIG_NO_HZ_IDLE</tt> kernels, each CPU's dyntick-idle | |
1310 | state is tracked by an <tt>rcu_dynticks</tt> structure. | |
1311 | ||
1312 | If you made it this far, you are well prepared to read the code | |
1313 | walkthroughs in the other articles in this series. | |
1314 | ||
1315 | <h3><a name="Acknowledgments"> | |
1316 | Acknowledgments</a></h3> | |
1317 | ||
1318 | I owe thanks to Cyrill Gorcunov, Mathieu Desnoyers, Dhaval Giani, Paul | |
1319 | Turner, Abhishek Srivastava, Matt Kowalczyk, and Serge Hallyn | |
1320 | for helping me get this document into a more human-readable state. | |
1321 | ||
1322 | <h3><a name="Legal Statement"> | |
1323 | Legal Statement</a></h3> | |
1324 | ||
1325 | <p>This work represents the view of the author and does not necessarily | |
1326 | represent the view of IBM. | |
1327 | ||
1328 | </p><p>Linux is a registered trademark of Linus Torvalds. | |
1329 | ||
1330 | </p><p>Other company, product, and service names may be trademarks or | |
1331 | service marks of others. | |
1332 | ||
1333 | </body></html> |