Commit | Line | Data |
---|---|---|
ccc9971e MCC |
1 | ================================= |
2 | A Tour Through RCU's Requirements | |
3 | ================================= | |
4 | ||
5 | Copyright IBM Corporation, 2015 | |
6 | ||
7 | Author: Paul E. McKenney | |
8 | ||
9 | The initial version of this document appeared in the | |
10 | `LWN <https://lwn.net/>`_ on those articles: | |
11 | `part 1 <https://lwn.net/Articles/652156/>`_, | |
12 | `part 2 <https://lwn.net/Articles/652677/>`_, and | |
13 | `part 3 <https://lwn.net/Articles/653326/>`_. | |
14 | ||
15 | Introduction | |
16 | ------------ | |
17 | ||
18 | Read-copy update (RCU) is a synchronization mechanism that is often used | |
19 | as a replacement for reader-writer locking. RCU is unusual in that | |
20 | updaters do not block readers, which means that RCU's read-side | |
21 | primitives can be exceedingly fast and scalable. In addition, updaters | |
22 | can make useful forward progress concurrently with readers. However, all | |
23 | this concurrency between RCU readers and updaters does raise the | |
24 | question of exactly what RCU readers are doing, which in turn raises the | |
25 | question of exactly what RCU's requirements are. | |
26 | ||
27 | This document therefore summarizes RCU's requirements, and can be | |
28 | thought of as an informal, high-level specification for RCU. It is | |
29 | important to understand that RCU's specification is primarily empirical | |
30 | in nature; in fact, I learned about many of these requirements the hard | |
31 | way. This situation might cause some consternation, however, not only | |
32 | has this learning process been a lot of fun, but it has also been a | |
33 | great privilege to work with so many people willing to apply | |
34 | technologies in interesting new ways. | |
35 | ||
36 | All that aside, here are the categories of currently known RCU | |
37 | requirements: | |
38 | ||
07335c16 JFG |
39 | #. `Fundamental Requirements`_ |
40 | #. `Fundamental Non-Requirements`_ | |
41 | #. `Parallelism Facts of Life`_ | |
42 | #. `Quality-of-Implementation Requirements`_ | |
43 | #. `Linux Kernel Complications`_ | |
44 | #. `Software-Engineering Requirements`_ | |
45 | #. `Other RCU Flavors`_ | |
46 | #. `Possible Future Changes`_ | |
ccc9971e MCC |
47 | |
48 | This is followed by a `summary <#Summary>`__, however, the answers to | |
49 | each quick quiz immediately follows the quiz. Select the big white space | |
50 | with your mouse to see the answer. | |
51 | ||
52 | Fundamental Requirements | |
53 | ------------------------ | |
54 | ||
55 | RCU's fundamental requirements are the closest thing RCU has to hard | |
56 | mathematical requirements. These are: | |
57 | ||
07335c16 JFG |
58 | #. `Grace-Period Guarantee`_ |
59 | #. `Publish/Subscribe Guarantee`_ | |
60 | #. `Memory-Barrier Guarantees`_ | |
61 | #. `RCU Primitives Guaranteed to Execute Unconditionally`_ | |
62 | #. `Guaranteed Read-to-Write Upgrade`_ | |
ccc9971e MCC |
63 | |
64 | Grace-Period Guarantee | |
65 | ~~~~~~~~~~~~~~~~~~~~~~ | |
66 | ||
67 | RCU's grace-period guarantee is unusual in being premeditated: Jack | |
68 | Slingwine and I had this guarantee firmly in mind when we started work | |
69 | on RCU (then called “rclock”) in the early 1990s. That said, the past | |
70 | two decades of experience with RCU have produced a much more detailed | |
71 | understanding of this guarantee. | |
72 | ||
73 | RCU's grace-period guarantee allows updaters to wait for the completion | |
74 | of all pre-existing RCU read-side critical sections. An RCU read-side | |
75 | critical section begins with the marker ``rcu_read_lock()`` and ends | |
76 | with the marker ``rcu_read_unlock()``. These markers may be nested, and | |
77 | RCU treats a nested set as one big RCU read-side critical section. | |
78 | Production-quality implementations of ``rcu_read_lock()`` and | |
79 | ``rcu_read_unlock()`` are extremely lightweight, and in fact have | |
80 | exactly zero overhead in Linux kernels built for production use with | |
81 | ``CONFIG_PREEMPT=n``. | |
82 | ||
83 | This guarantee allows ordering to be enforced with extremely low | |
84 | overhead to readers, for example: | |
85 | ||
86 | :: | |
87 | ||
88 | 1 int x, y; | |
89 | 2 | |
90 | 3 void thread0(void) | |
91 | 4 { | |
92 | 5 rcu_read_lock(); | |
93 | 6 r1 = READ_ONCE(x); | |
94 | 7 r2 = READ_ONCE(y); | |
95 | 8 rcu_read_unlock(); | |
96 | 9 } | |
97 | 10 | |
98 | 11 void thread1(void) | |
99 | 12 { | |
100 | 13 WRITE_ONCE(x, 1); | |
101 | 14 synchronize_rcu(); | |
102 | 15 WRITE_ONCE(y, 1); | |
103 | 16 } | |
104 | ||
105 | Because the ``synchronize_rcu()`` on line 14 waits for all pre-existing | |
106 | readers, any instance of ``thread0()`` that loads a value of zero from | |
107 | ``x`` must complete before ``thread1()`` stores to ``y``, so that | |
108 | instance must also load a value of zero from ``y``. Similarly, any | |
109 | instance of ``thread0()`` that loads a value of one from ``y`` must have | |
110 | started after the ``synchronize_rcu()`` started, and must therefore also | |
111 | load a value of one from ``x``. Therefore, the outcome: | |
112 | ||
113 | :: | |
114 | ||
115 | (r1 == 0 && r2 == 1) | |
116 | ||
117 | cannot happen. | |
118 | ||
119 | +-----------------------------------------------------------------------+ | |
120 | | **Quick Quiz**: | | |
121 | +-----------------------------------------------------------------------+ | |
122 | | Wait a minute! You said that updaters can make useful forward | | |
123 | | progress concurrently with readers, but pre-existing readers will | | |
124 | | block ``synchronize_rcu()``!!! | | |
125 | | Just who are you trying to fool??? | | |
126 | +-----------------------------------------------------------------------+ | |
127 | | **Answer**: | | |
128 | +-----------------------------------------------------------------------+ | |
129 | | First, if updaters do not wish to be blocked by readers, they can use | | |
130 | | ``call_rcu()`` or ``kfree_rcu()``, which will be discussed later. | | |
131 | | Second, even when using ``synchronize_rcu()``, the other update-side | | |
132 | | code does run concurrently with readers, whether pre-existing or not. | | |
133 | +-----------------------------------------------------------------------+ | |
134 | ||
135 | This scenario resembles one of the first uses of RCU in | |
136 | `DYNIX/ptx <https://en.wikipedia.org/wiki/DYNIX>`__, which managed a | |
137 | distributed lock manager's transition into a state suitable for handling | |
138 | recovery from node failure, more or less as follows: | |
139 | ||
140 | :: | |
141 | ||
142 | 1 #define STATE_NORMAL 0 | |
143 | 2 #define STATE_WANT_RECOVERY 1 | |
144 | 3 #define STATE_RECOVERING 2 | |
145 | 4 #define STATE_WANT_NORMAL 3 | |
146 | 5 | |
147 | 6 int state = STATE_NORMAL; | |
148 | 7 | |
149 | 8 void do_something_dlm(void) | |
150 | 9 { | |
151 | 10 int state_snap; | |
152 | 11 | |
153 | 12 rcu_read_lock(); | |
154 | 13 state_snap = READ_ONCE(state); | |
155 | 14 if (state_snap == STATE_NORMAL) | |
156 | 15 do_something(); | |
157 | 16 else | |
158 | 17 do_something_carefully(); | |
159 | 18 rcu_read_unlock(); | |
160 | 19 } | |
161 | 20 | |
162 | 21 void start_recovery(void) | |
163 | 22 { | |
164 | 23 WRITE_ONCE(state, STATE_WANT_RECOVERY); | |
165 | 24 synchronize_rcu(); | |
166 | 25 WRITE_ONCE(state, STATE_RECOVERING); | |
167 | 26 recovery(); | |
168 | 27 WRITE_ONCE(state, STATE_WANT_NORMAL); | |
169 | 28 synchronize_rcu(); | |
170 | 29 WRITE_ONCE(state, STATE_NORMAL); | |
171 | 30 } | |
172 | ||
173 | The RCU read-side critical section in ``do_something_dlm()`` works with | |
174 | the ``synchronize_rcu()`` in ``start_recovery()`` to guarantee that | |
175 | ``do_something()`` never runs concurrently with ``recovery()``, but with | |
176 | little or no synchronization overhead in ``do_something_dlm()``. | |
177 | ||
178 | +-----------------------------------------------------------------------+ | |
179 | | **Quick Quiz**: | | |
180 | +-----------------------------------------------------------------------+ | |
181 | | Why is the ``synchronize_rcu()`` on line 28 needed? | | |
182 | +-----------------------------------------------------------------------+ | |
183 | | **Answer**: | | |
184 | +-----------------------------------------------------------------------+ | |
185 | | Without that extra grace period, memory reordering could result in | | |
186 | | ``do_something_dlm()`` executing ``do_something()`` concurrently with | | |
187 | | the last bits of ``recovery()``. | | |
188 | +-----------------------------------------------------------------------+ | |
189 | ||
190 | In order to avoid fatal problems such as deadlocks, an RCU read-side | |
191 | critical section must not contain calls to ``synchronize_rcu()``. | |
192 | Similarly, an RCU read-side critical section must not contain anything | |
193 | that waits, directly or indirectly, on completion of an invocation of | |
194 | ``synchronize_rcu()``. | |
195 | ||
196 | Although RCU's grace-period guarantee is useful in and of itself, with | |
197 | `quite a few use cases <https://lwn.net/Articles/573497/>`__, it would | |
198 | be good to be able to use RCU to coordinate read-side access to linked | |
199 | data structures. For this, the grace-period guarantee is not sufficient, | |
200 | as can be seen in function ``add_gp_buggy()`` below. We will look at the | |
201 | reader's code later, but in the meantime, just think of the reader as | |
202 | locklessly picking up the ``gp`` pointer, and, if the value loaded is | |
203 | non-\ ``NULL``, locklessly accessing the ``->a`` and ``->b`` fields. | |
204 | ||
205 | :: | |
206 | ||
207 | 1 bool add_gp_buggy(int a, int b) | |
208 | 2 { | |
209 | 3 p = kmalloc(sizeof(*p), GFP_KERNEL); | |
210 | 4 if (!p) | |
211 | 5 return -ENOMEM; | |
212 | 6 spin_lock(&gp_lock); | |
213 | 7 if (rcu_access_pointer(gp)) { | |
214 | 8 spin_unlock(&gp_lock); | |
215 | 9 return false; | |
216 | 10 } | |
217 | 11 p->a = a; | |
218 | 12 p->b = a; | |
219 | 13 gp = p; /* ORDERING BUG */ | |
220 | 14 spin_unlock(&gp_lock); | |
221 | 15 return true; | |
222 | 16 } | |
223 | ||
224 | The problem is that both the compiler and weakly ordered CPUs are within | |
225 | their rights to reorder this code as follows: | |
226 | ||
227 | :: | |
228 | ||
229 | 1 bool add_gp_buggy_optimized(int a, int b) | |
230 | 2 { | |
231 | 3 p = kmalloc(sizeof(*p), GFP_KERNEL); | |
232 | 4 if (!p) | |
233 | 5 return -ENOMEM; | |
234 | 6 spin_lock(&gp_lock); | |
235 | 7 if (rcu_access_pointer(gp)) { | |
236 | 8 spin_unlock(&gp_lock); | |
237 | 9 return false; | |
238 | 10 } | |
239 | 11 gp = p; /* ORDERING BUG */ | |
240 | 12 p->a = a; | |
241 | 13 p->b = a; | |
242 | 14 spin_unlock(&gp_lock); | |
243 | 15 return true; | |
244 | 16 } | |
245 | ||
246 | If an RCU reader fetches ``gp`` just after ``add_gp_buggy_optimized`` | |
247 | executes line 11, it will see garbage in the ``->a`` and ``->b`` fields. | |
248 | And this is but one of many ways in which compiler and hardware | |
249 | optimizations could cause trouble. Therefore, we clearly need some way | |
250 | to prevent the compiler and the CPU from reordering in this manner, | |
251 | which brings us to the publish-subscribe guarantee discussed in the next | |
252 | section. | |
253 | ||
254 | Publish/Subscribe Guarantee | |
255 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
256 | ||
257 | RCU's publish-subscribe guarantee allows data to be inserted into a | |
258 | linked data structure without disrupting RCU readers. The updater uses | |
259 | ``rcu_assign_pointer()`` to insert the new data, and readers use | |
260 | ``rcu_dereference()`` to access data, whether new or old. The following | |
261 | shows an example of insertion: | |
262 | ||
263 | :: | |
264 | ||
265 | 1 bool add_gp(int a, int b) | |
266 | 2 { | |
267 | 3 p = kmalloc(sizeof(*p), GFP_KERNEL); | |
268 | 4 if (!p) | |
269 | 5 return -ENOMEM; | |
270 | 6 spin_lock(&gp_lock); | |
271 | 7 if (rcu_access_pointer(gp)) { | |
272 | 8 spin_unlock(&gp_lock); | |
273 | 9 return false; | |
274 | 10 } | |
275 | 11 p->a = a; | |
276 | 12 p->b = a; | |
277 | 13 rcu_assign_pointer(gp, p); | |
278 | 14 spin_unlock(&gp_lock); | |
279 | 15 return true; | |
280 | 16 } | |
281 | ||
282 | The ``rcu_assign_pointer()`` on line 13 is conceptually equivalent to a | |
283 | simple assignment statement, but also guarantees that its assignment | |
284 | will happen after the two assignments in lines 11 and 12, similar to the | |
285 | C11 ``memory_order_release`` store operation. It also prevents any | |
286 | number of “interesting” compiler optimizations, for example, the use of | |
287 | ``gp`` as a scratch location immediately preceding the assignment. | |
288 | ||
289 | +-----------------------------------------------------------------------+ | |
290 | | **Quick Quiz**: | | |
291 | +-----------------------------------------------------------------------+ | |
292 | | But ``rcu_assign_pointer()`` does nothing to prevent the two | | |
293 | | assignments to ``p->a`` and ``p->b`` from being reordered. Can't that | | |
294 | | also cause problems? | | |
295 | +-----------------------------------------------------------------------+ | |
296 | | **Answer**: | | |
297 | +-----------------------------------------------------------------------+ | |
298 | | No, it cannot. The readers cannot see either of these two fields | | |
299 | | until the assignment to ``gp``, by which time both fields are fully | | |
300 | | initialized. So reordering the assignments to ``p->a`` and ``p->b`` | | |
301 | | cannot possibly cause any problems. | | |
302 | +-----------------------------------------------------------------------+ | |
303 | ||
304 | It is tempting to assume that the reader need not do anything special to | |
305 | control its accesses to the RCU-protected data, as shown in | |
306 | ``do_something_gp_buggy()`` below: | |
307 | ||
308 | :: | |
309 | ||
310 | 1 bool do_something_gp_buggy(void) | |
311 | 2 { | |
312 | 3 rcu_read_lock(); | |
313 | 4 p = gp; /* OPTIMIZATIONS GALORE!!! */ | |
314 | 5 if (p) { | |
315 | 6 do_something(p->a, p->b); | |
316 | 7 rcu_read_unlock(); | |
317 | 8 return true; | |
318 | 9 } | |
319 | 10 rcu_read_unlock(); | |
320 | 11 return false; | |
321 | 12 } | |
322 | ||
323 | However, this temptation must be resisted because there are a | |
324 | surprisingly large number of ways that the compiler (to say nothing of | |
325 | `DEC Alpha CPUs <https://h71000.www7.hp.com/wizard/wiz_2637.html>`__) | |
326 | can trip this code up. For but one example, if the compiler were short | |
327 | of registers, it might choose to refetch from ``gp`` rather than keeping | |
328 | a separate copy in ``p`` as follows: | |
329 | ||
330 | :: | |
331 | ||
332 | 1 bool do_something_gp_buggy_optimized(void) | |
333 | 2 { | |
334 | 3 rcu_read_lock(); | |
335 | 4 if (gp) { /* OPTIMIZATIONS GALORE!!! */ | |
336 | 5 do_something(gp->a, gp->b); | |
337 | 6 rcu_read_unlock(); | |
338 | 7 return true; | |
339 | 8 } | |
340 | 9 rcu_read_unlock(); | |
341 | 10 return false; | |
342 | 11 } | |
343 | ||
344 | If this function ran concurrently with a series of updates that replaced | |
345 | the current structure with a new one, the fetches of ``gp->a`` and | |
346 | ``gp->b`` might well come from two different structures, which could | |
347 | cause serious confusion. To prevent this (and much else besides), | |
348 | ``do_something_gp()`` uses ``rcu_dereference()`` to fetch from ``gp``: | |
349 | ||
350 | :: | |
351 | ||
352 | 1 bool do_something_gp(void) | |
353 | 2 { | |
354 | 3 rcu_read_lock(); | |
355 | 4 p = rcu_dereference(gp); | |
356 | 5 if (p) { | |
357 | 6 do_something(p->a, p->b); | |
358 | 7 rcu_read_unlock(); | |
359 | 8 return true; | |
360 | 9 } | |
361 | 10 rcu_read_unlock(); | |
362 | 11 return false; | |
363 | 12 } | |
364 | ||
365 | The ``rcu_dereference()`` uses volatile casts and (for DEC Alpha) memory | |
366 | barriers in the Linux kernel. Should a `high-quality implementation of | |
367 | C11 ``memory_order_consume`` | |
368 | [PDF] <http://www.rdrop.com/users/paulmck/RCU/consume.2015.07.13a.pdf>`__ | |
369 | ever appear, then ``rcu_dereference()`` could be implemented as a | |
370 | ``memory_order_consume`` load. Regardless of the exact implementation, a | |
371 | pointer fetched by ``rcu_dereference()`` may not be used outside of the | |
372 | outermost RCU read-side critical section containing that | |
373 | ``rcu_dereference()``, unless protection of the corresponding data | |
374 | element has been passed from RCU to some other synchronization | |
375 | mechanism, most commonly locking or `reference | |
376 | counting <https://www.kernel.org/doc/Documentation/RCU/rcuref.txt>`__. | |
377 | ||
378 | In short, updaters use ``rcu_assign_pointer()`` and readers use | |
379 | ``rcu_dereference()``, and these two RCU API elements work together to | |
380 | ensure that readers have a consistent view of newly added data elements. | |
381 | ||
382 | Of course, it is also necessary to remove elements from RCU-protected | |
383 | data structures, for example, using the following process: | |
384 | ||
385 | #. Remove the data element from the enclosing structure. | |
386 | #. Wait for all pre-existing RCU read-side critical sections to complete | |
387 | (because only pre-existing readers can possibly have a reference to | |
388 | the newly removed data element). | |
389 | #. At this point, only the updater has a reference to the newly removed | |
390 | data element, so it can safely reclaim the data element, for example, | |
391 | by passing it to ``kfree()``. | |
392 | ||
393 | This process is implemented by ``remove_gp_synchronous()``: | |
394 | ||
395 | :: | |
396 | ||
397 | 1 bool remove_gp_synchronous(void) | |
398 | 2 { | |
399 | 3 struct foo *p; | |
400 | 4 | |
401 | 5 spin_lock(&gp_lock); | |
402 | 6 p = rcu_access_pointer(gp); | |
403 | 7 if (!p) { | |
404 | 8 spin_unlock(&gp_lock); | |
405 | 9 return false; | |
406 | 10 } | |
407 | 11 rcu_assign_pointer(gp, NULL); | |
408 | 12 spin_unlock(&gp_lock); | |
409 | 13 synchronize_rcu(); | |
410 | 14 kfree(p); | |
411 | 15 return true; | |
412 | 16 } | |
413 | ||
414 | This function is straightforward, with line 13 waiting for a grace | |
415 | period before line 14 frees the old data element. This waiting ensures | |
416 | that readers will reach line 7 of ``do_something_gp()`` before the data | |
417 | element referenced by ``p`` is freed. The ``rcu_access_pointer()`` on | |
418 | line 6 is similar to ``rcu_dereference()``, except that: | |
419 | ||
420 | #. The value returned by ``rcu_access_pointer()`` cannot be | |
421 | dereferenced. If you want to access the value pointed to as well as | |
422 | the pointer itself, use ``rcu_dereference()`` instead of | |
423 | ``rcu_access_pointer()``. | |
424 | #. The call to ``rcu_access_pointer()`` need not be protected. In | |
425 | contrast, ``rcu_dereference()`` must either be within an RCU | |
426 | read-side critical section or in a code segment where the pointer | |
427 | cannot change, for example, in code protected by the corresponding | |
428 | update-side lock. | |
429 | ||
430 | +-----------------------------------------------------------------------+ | |
431 | | **Quick Quiz**: | | |
432 | +-----------------------------------------------------------------------+ | |
433 | | Without the ``rcu_dereference()`` or the ``rcu_access_pointer()``, | | |
434 | | what destructive optimizations might the compiler make use of? | | |
435 | +-----------------------------------------------------------------------+ | |
436 | | **Answer**: | | |
437 | +-----------------------------------------------------------------------+ | |
438 | | Let's start with what happens to ``do_something_gp()`` if it fails to | | |
439 | | use ``rcu_dereference()``. It could reuse a value formerly fetched | | |
440 | | from this same pointer. It could also fetch the pointer from ``gp`` | | |
441 | | in a byte-at-a-time manner, resulting in *load tearing*, in turn | | |
442 | | resulting a bytewise mash-up of two distinct pointer values. It might | | |
443 | | even use value-speculation optimizations, where it makes a wrong | | |
444 | | guess, but by the time it gets around to checking the value, an | | |
445 | | update has changed the pointer to match the wrong guess. Too bad | | |
446 | | about any dereferences that returned pre-initialization garbage in | | |
447 | | the meantime! | | |
448 | | For ``remove_gp_synchronous()``, as long as all modifications to | | |
449 | | ``gp`` are carried out while holding ``gp_lock``, the above | | |
450 | | optimizations are harmless. However, ``sparse`` will complain if you | | |
451 | | define ``gp`` with ``__rcu`` and then access it without using either | | |
452 | | ``rcu_access_pointer()`` or ``rcu_dereference()``. | | |
453 | +-----------------------------------------------------------------------+ | |
454 | ||
455 | In short, RCU's publish-subscribe guarantee is provided by the | |
456 | combination of ``rcu_assign_pointer()`` and ``rcu_dereference()``. This | |
457 | guarantee allows data elements to be safely added to RCU-protected | |
458 | linked data structures without disrupting RCU readers. This guarantee | |
459 | can be used in combination with the grace-period guarantee to also allow | |
460 | data elements to be removed from RCU-protected linked data structures, | |
461 | again without disrupting RCU readers. | |
462 | ||
463 | This guarantee was only partially premeditated. DYNIX/ptx used an | |
464 | explicit memory barrier for publication, but had nothing resembling | |
465 | ``rcu_dereference()`` for subscription, nor did it have anything | |
466 | resembling the ``smp_read_barrier_depends()`` that was later subsumed | |
467 | into ``rcu_dereference()`` and later still into ``READ_ONCE()``. The | |
468 | need for these operations made itself known quite suddenly at a | |
469 | late-1990s meeting with the DEC Alpha architects, back in the days when | |
470 | DEC was still a free-standing company. It took the Alpha architects a | |
471 | good hour to convince me that any sort of barrier would ever be needed, | |
472 | and it then took me a good *two* hours to convince them that their | |
473 | documentation did not make this point clear. More recent work with the C | |
474 | and C++ standards committees have provided much education on tricks and | |
475 | traps from the compiler. In short, compilers were much less tricky in | |
476 | the early 1990s, but in 2015, don't even think about omitting | |
477 | ``rcu_dereference()``! | |
478 | ||
479 | Memory-Barrier Guarantees | |
480 | ~~~~~~~~~~~~~~~~~~~~~~~~~ | |
481 | ||
482 | The previous section's simple linked-data-structure scenario clearly | |
483 | demonstrates the need for RCU's stringent memory-ordering guarantees on | |
484 | systems with more than one CPU: | |
485 | ||
486 | #. Each CPU that has an RCU read-side critical section that begins | |
487 | before ``synchronize_rcu()`` starts is guaranteed to execute a full | |
488 | memory barrier between the time that the RCU read-side critical | |
489 | section ends and the time that ``synchronize_rcu()`` returns. Without | |
490 | this guarantee, a pre-existing RCU read-side critical section might | |
491 | hold a reference to the newly removed ``struct foo`` after the | |
492 | ``kfree()`` on line 14 of ``remove_gp_synchronous()``. | |
493 | #. Each CPU that has an RCU read-side critical section that ends after | |
494 | ``synchronize_rcu()`` returns is guaranteed to execute a full memory | |
495 | barrier between the time that ``synchronize_rcu()`` begins and the | |
496 | time that the RCU read-side critical section begins. Without this | |
497 | guarantee, a later RCU read-side critical section running after the | |
498 | ``kfree()`` on line 14 of ``remove_gp_synchronous()`` might later run | |
499 | ``do_something_gp()`` and find the newly deleted ``struct foo``. | |
500 | #. If the task invoking ``synchronize_rcu()`` remains on a given CPU, | |
501 | then that CPU is guaranteed to execute a full memory barrier sometime | |
502 | during the execution of ``synchronize_rcu()``. This guarantee ensures | |
503 | that the ``kfree()`` on line 14 of ``remove_gp_synchronous()`` really | |
504 | does execute after the removal on line 11. | |
505 | #. If the task invoking ``synchronize_rcu()`` migrates among a group of | |
506 | CPUs during that invocation, then each of the CPUs in that group is | |
507 | guaranteed to execute a full memory barrier sometime during the | |
508 | execution of ``synchronize_rcu()``. This guarantee also ensures that | |
509 | the ``kfree()`` on line 14 of ``remove_gp_synchronous()`` really does | |
510 | execute after the removal on line 11, but also in the case where the | |
511 | thread executing the ``synchronize_rcu()`` migrates in the meantime. | |
512 | ||
513 | +-----------------------------------------------------------------------+ | |
514 | | **Quick Quiz**: | | |
515 | +-----------------------------------------------------------------------+ | |
516 | | Given that multiple CPUs can start RCU read-side critical sections at | | |
517 | | any time without any ordering whatsoever, how can RCU possibly tell | | |
518 | | whether or not a given RCU read-side critical section starts before a | | |
519 | | given instance of ``synchronize_rcu()``? | | |
520 | +-----------------------------------------------------------------------+ | |
521 | | **Answer**: | | |
522 | +-----------------------------------------------------------------------+ | |
523 | | If RCU cannot tell whether or not a given RCU read-side critical | | |
524 | | section starts before a given instance of ``synchronize_rcu()``, then | | |
525 | | it must assume that the RCU read-side critical section started first. | | |
526 | | In other words, a given instance of ``synchronize_rcu()`` can avoid | | |
527 | | waiting on a given RCU read-side critical section only if it can | | |
528 | | prove that ``synchronize_rcu()`` started first. | | |
529 | | A related question is “When ``rcu_read_lock()`` doesn't generate any | | |
530 | | code, why does it matter how it relates to a grace period?” The | | |
531 | | answer is that it is not the relationship of ``rcu_read_lock()`` | | |
532 | | itself that is important, but rather the relationship of the code | | |
533 | | within the enclosed RCU read-side critical section to the code | | |
534 | | preceding and following the grace period. If we take this viewpoint, | | |
535 | | then a given RCU read-side critical section begins before a given | | |
536 | | grace period when some access preceding the grace period observes the | | |
537 | | effect of some access within the critical section, in which case none | | |
538 | | of the accesses within the critical section may observe the effects | | |
539 | | of any access following the grace period. | | |
540 | | | | |
541 | | As of late 2016, mathematical models of RCU take this viewpoint, for | | |
542 | | example, see slides 62 and 63 of the `2016 LinuxCon | | |
543 | | EU <http://www2.rdrop.com/users/paulmck/scalability/paper/LinuxMM.201 | | |
544 | | 6.10.04c.LCE.pdf>`__ | | |
545 | | presentation. | | |
546 | +-----------------------------------------------------------------------+ | |
547 | ||
548 | +-----------------------------------------------------------------------+ | |
549 | | **Quick Quiz**: | | |
550 | +-----------------------------------------------------------------------+ | |
551 | | The first and second guarantees require unbelievably strict ordering! | | |
552 | | Are all these memory barriers *really* required? | | |
553 | +-----------------------------------------------------------------------+ | |
554 | | **Answer**: | | |
555 | +-----------------------------------------------------------------------+ | |
556 | | Yes, they really are required. To see why the first guarantee is | | |
557 | | required, consider the following sequence of events: | | |
558 | | | | |
559 | | #. CPU 1: ``rcu_read_lock()`` | | |
560 | | #. CPU 1: ``q = rcu_dereference(gp); /* Very likely to return p. */`` | | |
561 | | #. CPU 0: ``list_del_rcu(p);`` | | |
562 | | #. CPU 0: ``synchronize_rcu()`` starts. | | |
563 | | #. CPU 1: ``do_something_with(q->a);`` | | |
564 | | ``/* No smp_mb(), so might happen after kfree(). */`` | | |
565 | | #. CPU 1: ``rcu_read_unlock()`` | | |
566 | | #. CPU 0: ``synchronize_rcu()`` returns. | | |
567 | | #. CPU 0: ``kfree(p);`` | | |
568 | | | | |
569 | | Therefore, there absolutely must be a full memory barrier between the | | |
570 | | end of the RCU read-side critical section and the end of the grace | | |
571 | | period. | | |
572 | | | | |
573 | | The sequence of events demonstrating the necessity of the second rule | | |
574 | | is roughly similar: | | |
575 | | | | |
576 | | #. CPU 0: ``list_del_rcu(p);`` | | |
577 | | #. CPU 0: ``synchronize_rcu()`` starts. | | |
578 | | #. CPU 1: ``rcu_read_lock()`` | | |
579 | | #. CPU 1: ``q = rcu_dereference(gp);`` | | |
580 | | ``/* Might return p if no memory barrier. */`` | | |
581 | | #. CPU 0: ``synchronize_rcu()`` returns. | | |
582 | | #. CPU 0: ``kfree(p);`` | | |
583 | | #. CPU 1: ``do_something_with(q->a); /* Boom!!! */`` | | |
584 | | #. CPU 1: ``rcu_read_unlock()`` | | |
585 | | | | |
586 | | And similarly, without a memory barrier between the beginning of the | | |
587 | | grace period and the beginning of the RCU read-side critical section, | | |
588 | | CPU 1 might end up accessing the freelist. | | |
589 | | | | |
590 | | The “as if” rule of course applies, so that any implementation that | | |
591 | | acts as if the appropriate memory barriers were in place is a correct | | |
592 | | implementation. That said, it is much easier to fool yourself into | | |
593 | | believing that you have adhered to the as-if rule than it is to | | |
594 | | actually adhere to it! | | |
595 | +-----------------------------------------------------------------------+ | |
596 | ||
597 | +-----------------------------------------------------------------------+ | |
598 | | **Quick Quiz**: | | |
599 | +-----------------------------------------------------------------------+ | |
600 | | You claim that ``rcu_read_lock()`` and ``rcu_read_unlock()`` generate | | |
601 | | absolutely no code in some kernel builds. This means that the | | |
602 | | compiler might arbitrarily rearrange consecutive RCU read-side | | |
603 | | critical sections. Given such rearrangement, if a given RCU read-side | | |
604 | | critical section is done, how can you be sure that all prior RCU | | |
605 | | read-side critical sections are done? Won't the compiler | | |
606 | | rearrangements make that impossible to determine? | | |
607 | +-----------------------------------------------------------------------+ | |
608 | | **Answer**: | | |
609 | +-----------------------------------------------------------------------+ | |
610 | | In cases where ``rcu_read_lock()`` and ``rcu_read_unlock()`` generate | | |
611 | | absolutely no code, RCU infers quiescent states only at special | | |
612 | | locations, for example, within the scheduler. Because calls to | | |
613 | | ``schedule()`` had better prevent calling-code accesses to shared | | |
614 | | variables from being rearranged across the call to ``schedule()``, if | | |
615 | | RCU detects the end of a given RCU read-side critical section, it | | |
616 | | will necessarily detect the end of all prior RCU read-side critical | | |
617 | | sections, no matter how aggressively the compiler scrambles the code. | | |
618 | | Again, this all assumes that the compiler cannot scramble code across | | |
619 | | calls to the scheduler, out of interrupt handlers, into the idle | | |
620 | | loop, into user-mode code, and so on. But if your kernel build allows | | |
621 | | that sort of scrambling, you have broken far more than just RCU! | | |
622 | +-----------------------------------------------------------------------+ | |
623 | ||
624 | Note that these memory-barrier requirements do not replace the | |
625 | fundamental RCU requirement that a grace period wait for all | |
626 | pre-existing readers. On the contrary, the memory barriers called out in | |
627 | this section must operate in such a way as to *enforce* this fundamental | |
628 | requirement. Of course, different implementations enforce this | |
629 | requirement in different ways, but enforce it they must. | |
630 | ||
631 | RCU Primitives Guaranteed to Execute Unconditionally | |
632 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
633 | ||
634 | The common-case RCU primitives are unconditional. They are invoked, they | |
635 | do their job, and they return, with no possibility of error, and no need | |
636 | to retry. This is a key RCU design philosophy. | |
637 | ||
638 | However, this philosophy is pragmatic rather than pigheaded. If someone | |
639 | comes up with a good justification for a particular conditional RCU | |
640 | primitive, it might well be implemented and added. After all, this | |
641 | guarantee was reverse-engineered, not premeditated. The unconditional | |
642 | nature of the RCU primitives was initially an accident of | |
643 | implementation, and later experience with synchronization primitives | |
644 | with conditional primitives caused me to elevate this accident to a | |
645 | guarantee. Therefore, the justification for adding a conditional | |
646 | primitive to RCU would need to be based on detailed and compelling use | |
647 | cases. | |
648 | ||
649 | Guaranteed Read-to-Write Upgrade | |
650 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
651 | ||
652 | As far as RCU is concerned, it is always possible to carry out an update | |
653 | within an RCU read-side critical section. For example, that RCU | |
654 | read-side critical section might search for a given data element, and | |
655 | then might acquire the update-side spinlock in order to update that | |
656 | element, all while remaining in that RCU read-side critical section. Of | |
657 | course, it is necessary to exit the RCU read-side critical section | |
658 | before invoking ``synchronize_rcu()``, however, this inconvenience can | |
659 | be avoided through use of the ``call_rcu()`` and ``kfree_rcu()`` API | |
660 | members described later in this document. | |
661 | ||
662 | +-----------------------------------------------------------------------+ | |
663 | | **Quick Quiz**: | | |
664 | +-----------------------------------------------------------------------+ | |
665 | | But how does the upgrade-to-write operation exclude other readers? | | |
666 | +-----------------------------------------------------------------------+ | |
667 | | **Answer**: | | |
668 | +-----------------------------------------------------------------------+ | |
669 | | It doesn't, just like normal RCU updates, which also do not exclude | | |
670 | | RCU readers. | | |
671 | +-----------------------------------------------------------------------+ | |
672 | ||
673 | This guarantee allows lookup code to be shared between read-side and | |
674 | update-side code, and was premeditated, appearing in the earliest | |
675 | DYNIX/ptx RCU documentation. | |
676 | ||
677 | Fundamental Non-Requirements | |
678 | ---------------------------- | |
679 | ||
680 | RCU provides extremely lightweight readers, and its read-side | |
681 | guarantees, though quite useful, are correspondingly lightweight. It is | |
682 | therefore all too easy to assume that RCU is guaranteeing more than it | |
683 | really is. Of course, the list of things that RCU does not guarantee is | |
684 | infinitely long, however, the following sections list a few | |
685 | non-guarantees that have caused confusion. Except where otherwise noted, | |
686 | these non-guarantees were premeditated. | |
687 | ||
07335c16 JFG |
688 | #. `Readers Impose Minimal Ordering`_ |
689 | #. `Readers Do Not Exclude Updaters`_ | |
690 | #. `Updaters Only Wait For Old Readers`_ | |
691 | #. `Grace Periods Don't Partition Read-Side Critical Sections`_ | |
692 | #. `Read-Side Critical Sections Don't Partition Grace Periods`_ | |
ccc9971e MCC |
693 | |
694 | Readers Impose Minimal Ordering | |
695 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
696 | ||
697 | Reader-side markers such as ``rcu_read_lock()`` and | |
698 | ``rcu_read_unlock()`` provide absolutely no ordering guarantees except | |
699 | through their interaction with the grace-period APIs such as | |
700 | ``synchronize_rcu()``. To see this, consider the following pair of | |
701 | threads: | |
702 | ||
703 | :: | |
704 | ||
705 | 1 void thread0(void) | |
706 | 2 { | |
707 | 3 rcu_read_lock(); | |
708 | 4 WRITE_ONCE(x, 1); | |
709 | 5 rcu_read_unlock(); | |
710 | 6 rcu_read_lock(); | |
711 | 7 WRITE_ONCE(y, 1); | |
712 | 8 rcu_read_unlock(); | |
713 | 9 } | |
714 | 10 | |
715 | 11 void thread1(void) | |
716 | 12 { | |
717 | 13 rcu_read_lock(); | |
718 | 14 r1 = READ_ONCE(y); | |
719 | 15 rcu_read_unlock(); | |
720 | 16 rcu_read_lock(); | |
721 | 17 r2 = READ_ONCE(x); | |
722 | 18 rcu_read_unlock(); | |
723 | 19 } | |
724 | ||
725 | After ``thread0()`` and ``thread1()`` execute concurrently, it is quite | |
726 | possible to have | |
727 | ||
728 | :: | |
729 | ||
730 | (r1 == 1 && r2 == 0) | |
731 | ||
732 | (that is, ``y`` appears to have been assigned before ``x``), which would | |
733 | not be possible if ``rcu_read_lock()`` and ``rcu_read_unlock()`` had | |
734 | much in the way of ordering properties. But they do not, so the CPU is | |
735 | within its rights to do significant reordering. This is by design: Any | |
736 | significant ordering constraints would slow down these fast-path APIs. | |
737 | ||
738 | +-----------------------------------------------------------------------+ | |
739 | | **Quick Quiz**: | | |
740 | +-----------------------------------------------------------------------+ | |
741 | | Can't the compiler also reorder this code? | | |
742 | +-----------------------------------------------------------------------+ | |
743 | | **Answer**: | | |
744 | +-----------------------------------------------------------------------+ | |
745 | | No, the volatile casts in ``READ_ONCE()`` and ``WRITE_ONCE()`` | | |
746 | | prevent the compiler from reordering in this particular case. | | |
747 | +-----------------------------------------------------------------------+ | |
748 | ||
749 | Readers Do Not Exclude Updaters | |
750 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
751 | ||
752 | Neither ``rcu_read_lock()`` nor ``rcu_read_unlock()`` exclude updates. | |
753 | All they do is to prevent grace periods from ending. The following | |
754 | example illustrates this: | |
755 | ||
756 | :: | |
757 | ||
758 | 1 void thread0(void) | |
759 | 2 { | |
760 | 3 rcu_read_lock(); | |
761 | 4 r1 = READ_ONCE(y); | |
762 | 5 if (r1) { | |
763 | 6 do_something_with_nonzero_x(); | |
764 | 7 r2 = READ_ONCE(x); | |
765 | 8 WARN_ON(!r2); /* BUG!!! */ | |
766 | 9 } | |
767 | 10 rcu_read_unlock(); | |
768 | 11 } | |
769 | 12 | |
770 | 13 void thread1(void) | |
771 | 14 { | |
772 | 15 spin_lock(&my_lock); | |
773 | 16 WRITE_ONCE(x, 1); | |
774 | 17 WRITE_ONCE(y, 1); | |
775 | 18 spin_unlock(&my_lock); | |
776 | 19 } | |
777 | ||
778 | If the ``thread0()`` function's ``rcu_read_lock()`` excluded the | |
779 | ``thread1()`` function's update, the ``WARN_ON()`` could never fire. But | |
780 | the fact is that ``rcu_read_lock()`` does not exclude much of anything | |
781 | aside from subsequent grace periods, of which ``thread1()`` has none, so | |
782 | the ``WARN_ON()`` can and does fire. | |
783 | ||
784 | Updaters Only Wait For Old Readers | |
785 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
786 | ||
787 | It might be tempting to assume that after ``synchronize_rcu()`` | |
788 | completes, there are no readers executing. This temptation must be | |
789 | avoided because new readers can start immediately after | |
790 | ``synchronize_rcu()`` starts, and ``synchronize_rcu()`` is under no | |
791 | obligation to wait for these new readers. | |
792 | ||
793 | +-----------------------------------------------------------------------+ | |
794 | | **Quick Quiz**: | | |
795 | +-----------------------------------------------------------------------+ | |
796 | | Suppose that synchronize_rcu() did wait until *all* readers had | | |
797 | | completed instead of waiting only on pre-existing readers. For how | | |
798 | | long would the updater be able to rely on there being no readers? | | |
799 | +-----------------------------------------------------------------------+ | |
800 | | **Answer**: | | |
801 | +-----------------------------------------------------------------------+ | |
802 | | For no time at all. Even if ``synchronize_rcu()`` were to wait until | | |
803 | | all readers had completed, a new reader might start immediately after | | |
804 | | ``synchronize_rcu()`` completed. Therefore, the code following | | |
805 | | ``synchronize_rcu()`` can *never* rely on there being no readers. | | |
806 | +-----------------------------------------------------------------------+ | |
807 | ||
808 | Grace Periods Don't Partition Read-Side Critical Sections | |
809 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
810 | ||
811 | It is tempting to assume that if any part of one RCU read-side critical | |
812 | section precedes a given grace period, and if any part of another RCU | |
813 | read-side critical section follows that same grace period, then all of | |
814 | the first RCU read-side critical section must precede all of the second. | |
815 | However, this just isn't the case: A single grace period does not | |
816 | partition the set of RCU read-side critical sections. An example of this | |
817 | situation can be illustrated as follows, where ``x``, ``y``, and ``z`` | |
818 | are initially all zero: | |
819 | ||
820 | :: | |
821 | ||
822 | 1 void thread0(void) | |
823 | 2 { | |
824 | 3 rcu_read_lock(); | |
825 | 4 WRITE_ONCE(a, 1); | |
826 | 5 WRITE_ONCE(b, 1); | |
827 | 6 rcu_read_unlock(); | |
828 | 7 } | |
829 | 8 | |
830 | 9 void thread1(void) | |
831 | 10 { | |
832 | 11 r1 = READ_ONCE(a); | |
833 | 12 synchronize_rcu(); | |
834 | 13 WRITE_ONCE(c, 1); | |
835 | 14 } | |
836 | 15 | |
837 | 16 void thread2(void) | |
838 | 17 { | |
839 | 18 rcu_read_lock(); | |
840 | 19 r2 = READ_ONCE(b); | |
841 | 20 r3 = READ_ONCE(c); | |
842 | 21 rcu_read_unlock(); | |
843 | 22 } | |
844 | ||
845 | It turns out that the outcome: | |
846 | ||
847 | :: | |
848 | ||
849 | (r1 == 1 && r2 == 0 && r3 == 1) | |
850 | ||
851 | is entirely possible. The following figure show how this can happen, | |
852 | with each circled ``QS`` indicating the point at which RCU recorded a | |
853 | *quiescent state* for each thread, that is, a state in which RCU knows | |
854 | that the thread cannot be in the midst of an RCU read-side critical | |
855 | section that started before the current grace period: | |
856 | ||
857 | .. kernel-figure:: GPpartitionReaders1.svg | |
858 | ||
859 | If it is necessary to partition RCU read-side critical sections in this | |
860 | manner, it is necessary to use two grace periods, where the first grace | |
861 | period is known to end before the second grace period starts: | |
862 | ||
863 | :: | |
864 | ||
865 | 1 void thread0(void) | |
866 | 2 { | |
867 | 3 rcu_read_lock(); | |
868 | 4 WRITE_ONCE(a, 1); | |
869 | 5 WRITE_ONCE(b, 1); | |
870 | 6 rcu_read_unlock(); | |
871 | 7 } | |
872 | 8 | |
873 | 9 void thread1(void) | |
874 | 10 { | |
875 | 11 r1 = READ_ONCE(a); | |
876 | 12 synchronize_rcu(); | |
877 | 13 WRITE_ONCE(c, 1); | |
878 | 14 } | |
879 | 15 | |
880 | 16 void thread2(void) | |
881 | 17 { | |
882 | 18 r2 = READ_ONCE(c); | |
883 | 19 synchronize_rcu(); | |
884 | 20 WRITE_ONCE(d, 1); | |
885 | 21 } | |
886 | 22 | |
887 | 23 void thread3(void) | |
888 | 24 { | |
889 | 25 rcu_read_lock(); | |
890 | 26 r3 = READ_ONCE(b); | |
891 | 27 r4 = READ_ONCE(d); | |
892 | 28 rcu_read_unlock(); | |
893 | 29 } | |
894 | ||
895 | Here, if ``(r1 == 1)``, then ``thread0()``'s write to ``b`` must happen | |
896 | before the end of ``thread1()``'s grace period. If in addition | |
897 | ``(r4 == 1)``, then ``thread3()``'s read from ``b`` must happen after | |
898 | the beginning of ``thread2()``'s grace period. If it is also the case | |
899 | that ``(r2 == 1)``, then the end of ``thread1()``'s grace period must | |
900 | precede the beginning of ``thread2()``'s grace period. This mean that | |
901 | the two RCU read-side critical sections cannot overlap, guaranteeing | |
902 | that ``(r3 == 1)``. As a result, the outcome: | |
903 | ||
904 | :: | |
905 | ||
906 | (r1 == 1 && r2 == 1 && r3 == 0 && r4 == 1) | |
907 | ||
908 | cannot happen. | |
909 | ||
910 | This non-requirement was also non-premeditated, but became apparent when | |
911 | studying RCU's interaction with memory ordering. | |
912 | ||
913 | Read-Side Critical Sections Don't Partition Grace Periods | |
914 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
915 | ||
916 | It is also tempting to assume that if an RCU read-side critical section | |
917 | happens between a pair of grace periods, then those grace periods cannot | |
918 | overlap. However, this temptation leads nowhere good, as can be | |
919 | illustrated by the following, with all variables initially zero: | |
920 | ||
921 | :: | |
922 | ||
923 | 1 void thread0(void) | |
924 | 2 { | |
925 | 3 rcu_read_lock(); | |
926 | 4 WRITE_ONCE(a, 1); | |
927 | 5 WRITE_ONCE(b, 1); | |
928 | 6 rcu_read_unlock(); | |
929 | 7 } | |
930 | 8 | |
931 | 9 void thread1(void) | |
932 | 10 { | |
933 | 11 r1 = READ_ONCE(a); | |
934 | 12 synchronize_rcu(); | |
935 | 13 WRITE_ONCE(c, 1); | |
936 | 14 } | |
937 | 15 | |
938 | 16 void thread2(void) | |
939 | 17 { | |
940 | 18 rcu_read_lock(); | |
941 | 19 WRITE_ONCE(d, 1); | |
942 | 20 r2 = READ_ONCE(c); | |
943 | 21 rcu_read_unlock(); | |
944 | 22 } | |
945 | 23 | |
946 | 24 void thread3(void) | |
947 | 25 { | |
948 | 26 r3 = READ_ONCE(d); | |
949 | 27 synchronize_rcu(); | |
950 | 28 WRITE_ONCE(e, 1); | |
951 | 29 } | |
952 | 30 | |
953 | 31 void thread4(void) | |
954 | 32 { | |
955 | 33 rcu_read_lock(); | |
956 | 34 r4 = READ_ONCE(b); | |
957 | 35 r5 = READ_ONCE(e); | |
958 | 36 rcu_read_unlock(); | |
959 | 37 } | |
960 | ||
961 | In this case, the outcome: | |
962 | ||
963 | :: | |
964 | ||
965 | (r1 == 1 && r2 == 1 && r3 == 1 && r4 == 0 && r5 == 1) | |
966 | ||
967 | is entirely possible, as illustrated below: | |
968 | ||
969 | .. kernel-figure:: ReadersPartitionGP1.svg | |
970 | ||
971 | Again, an RCU read-side critical section can overlap almost all of a | |
972 | given grace period, just so long as it does not overlap the entire grace | |
973 | period. As a result, an RCU read-side critical section cannot partition | |
974 | a pair of RCU grace periods. | |
975 | ||
976 | +-----------------------------------------------------------------------+ | |
977 | | **Quick Quiz**: | | |
978 | +-----------------------------------------------------------------------+ | |
979 | | How long a sequence of grace periods, each separated by an RCU | | |
980 | | read-side critical section, would be required to partition the RCU | | |
981 | | read-side critical sections at the beginning and end of the chain? | | |
982 | +-----------------------------------------------------------------------+ | |
983 | | **Answer**: | | |
984 | +-----------------------------------------------------------------------+ | |
985 | | In theory, an infinite number. In practice, an unknown number that is | | |
986 | | sensitive to both implementation details and timing considerations. | | |
987 | | Therefore, even in practice, RCU users must abide by the theoretical | | |
988 | | rather than the practical answer. | | |
989 | +-----------------------------------------------------------------------+ | |
990 | ||
991 | Parallelism Facts of Life | |
992 | ------------------------- | |
993 | ||
994 | These parallelism facts of life are by no means specific to RCU, but the | |
995 | RCU implementation must abide by them. They therefore bear repeating: | |
996 | ||
997 | #. Any CPU or task may be delayed at any time, and any attempts to avoid | |
998 | these delays by disabling preemption, interrupts, or whatever are | |
999 | completely futile. This is most obvious in preemptible user-level | |
1000 | environments and in virtualized environments (where a given guest | |
1001 | OS's VCPUs can be preempted at any time by the underlying | |
1002 | hypervisor), but can also happen in bare-metal environments due to | |
1003 | ECC errors, NMIs, and other hardware events. Although a delay of more | |
1004 | than about 20 seconds can result in splats, the RCU implementation is | |
1005 | obligated to use algorithms that can tolerate extremely long delays, | |
1006 | but where “extremely long” is not long enough to allow wrap-around | |
1007 | when incrementing a 64-bit counter. | |
1008 | #. Both the compiler and the CPU can reorder memory accesses. Where it | |
1009 | matters, RCU must use compiler directives and memory-barrier | |
1010 | instructions to preserve ordering. | |
1011 | #. Conflicting writes to memory locations in any given cache line will | |
1012 | result in expensive cache misses. Greater numbers of concurrent | |
1013 | writes and more-frequent concurrent writes will result in more | |
1014 | dramatic slowdowns. RCU is therefore obligated to use algorithms that | |
1015 | have sufficient locality to avoid significant performance and | |
1016 | scalability problems. | |
1017 | #. As a rough rule of thumb, only one CPU's worth of processing may be | |
1018 | carried out under the protection of any given exclusive lock. RCU | |
1019 | must therefore use scalable locking designs. | |
1020 | #. Counters are finite, especially on 32-bit systems. RCU's use of | |
1021 | counters must therefore tolerate counter wrap, or be designed such | |
1022 | that counter wrap would take way more time than a single system is | |
1023 | likely to run. An uptime of ten years is quite possible, a runtime of | |
1024 | a century much less so. As an example of the latter, RCU's | |
1025 | dyntick-idle nesting counter allows 54 bits for interrupt nesting | |
1026 | level (this counter is 64 bits even on a 32-bit system). Overflowing | |
1027 | this counter requires 2\ :sup:`54` half-interrupts on a given CPU | |
1028 | without that CPU ever going idle. If a half-interrupt happened every | |
1029 | microsecond, it would take 570 years of runtime to overflow this | |
1030 | counter, which is currently believed to be an acceptably long time. | |
1031 | #. Linux systems can have thousands of CPUs running a single Linux | |
1032 | kernel in a single shared-memory environment. RCU must therefore pay | |
1033 | close attention to high-end scalability. | |
1034 | ||
1035 | This last parallelism fact of life means that RCU must pay special | |
1036 | attention to the preceding facts of life. The idea that Linux might | |
1037 | scale to systems with thousands of CPUs would have been met with some | |
1038 | skepticism in the 1990s, but these requirements would have otherwise | |
1039 | have been unsurprising, even in the early 1990s. | |
1040 | ||
1041 | Quality-of-Implementation Requirements | |
1042 | -------------------------------------- | |
1043 | ||
1044 | These sections list quality-of-implementation requirements. Although an | |
1045 | RCU implementation that ignores these requirements could still be used, | |
1046 | it would likely be subject to limitations that would make it | |
1047 | inappropriate for industrial-strength production use. Classes of | |
1048 | quality-of-implementation requirements are as follows: | |
1049 | ||
07335c16 JFG |
1050 | #. `Specialization`_ |
1051 | #. `Performance and Scalability`_ | |
1052 | #. `Forward Progress`_ | |
1053 | #. `Composability`_ | |
1054 | #. `Corner Cases`_ | |
ccc9971e MCC |
1055 | |
1056 | These classes is covered in the following sections. | |
1057 | ||
1058 | Specialization | |
1059 | ~~~~~~~~~~~~~~ | |
1060 | ||
1061 | RCU is and always has been intended primarily for read-mostly | |
1062 | situations, which means that RCU's read-side primitives are optimized, | |
1063 | often at the expense of its update-side primitives. Experience thus far | |
1064 | is captured by the following list of situations: | |
1065 | ||
1066 | #. Read-mostly data, where stale and inconsistent data is not a problem: | |
1067 | RCU works great! | |
1068 | #. Read-mostly data, where data must be consistent: RCU works well. | |
1069 | #. Read-write data, where data must be consistent: RCU *might* work OK. | |
1070 | Or not. | |
1071 | #. Write-mostly data, where data must be consistent: RCU is very | |
1072 | unlikely to be the right tool for the job, with the following | |
1073 | exceptions, where RCU can provide: | |
1074 | ||
1075 | a. Existence guarantees for update-friendly mechanisms. | |
1076 | b. Wait-free read-side primitives for real-time use. | |
1077 | ||
1078 | This focus on read-mostly situations means that RCU must interoperate | |
1079 | with other synchronization primitives. For example, the ``add_gp()`` and | |
1080 | ``remove_gp_synchronous()`` examples discussed earlier use RCU to | |
1081 | protect readers and locking to coordinate updaters. However, the need | |
1082 | extends much farther, requiring that a variety of synchronization | |
1083 | primitives be legal within RCU read-side critical sections, including | |
1084 | spinlocks, sequence locks, atomic operations, reference counters, and | |
1085 | memory barriers. | |
1086 | ||
1087 | +-----------------------------------------------------------------------+ | |
1088 | | **Quick Quiz**: | | |
1089 | +-----------------------------------------------------------------------+ | |
1090 | | What about sleeping locks? | | |
1091 | +-----------------------------------------------------------------------+ | |
1092 | | **Answer**: | | |
1093 | +-----------------------------------------------------------------------+ | |
1094 | | These are forbidden within Linux-kernel RCU read-side critical | | |
1095 | | sections because it is not legal to place a quiescent state (in this | | |
1096 | | case, voluntary context switch) within an RCU read-side critical | | |
1097 | | section. However, sleeping locks may be used within userspace RCU | | |
1098 | | read-side critical sections, and also within Linux-kernel sleepable | | |
1099 | | RCU `(SRCU) <#Sleepable%20RCU>`__ read-side critical sections. In | | |
1100 | | addition, the -rt patchset turns spinlocks into a sleeping locks so | | |
1101 | | that the corresponding critical sections can be preempted, which also | | |
1102 | | means that these sleeplockified spinlocks (but not other sleeping | | |
1103 | | locks!) may be acquire within -rt-Linux-kernel RCU read-side critical | | |
1104 | | sections. | | |
1105 | | Note that it *is* legal for a normal RCU read-side critical section | | |
1106 | | to conditionally acquire a sleeping locks (as in | | |
1107 | | ``mutex_trylock()``), but only as long as it does not loop | | |
1108 | | indefinitely attempting to conditionally acquire that sleeping locks. | | |
1109 | | The key point is that things like ``mutex_trylock()`` either return | | |
1110 | | with the mutex held, or return an error indication if the mutex was | | |
1111 | | not immediately available. Either way, ``mutex_trylock()`` returns | | |
1112 | | immediately without sleeping. | | |
1113 | +-----------------------------------------------------------------------+ | |
1114 | ||
1115 | It often comes as a surprise that many algorithms do not require a | |
1116 | consistent view of data, but many can function in that mode, with | |
1117 | network routing being the poster child. Internet routing algorithms take | |
1118 | significant time to propagate updates, so that by the time an update | |
1119 | arrives at a given system, that system has been sending network traffic | |
1120 | the wrong way for a considerable length of time. Having a few threads | |
1121 | continue to send traffic the wrong way for a few more milliseconds is | |
1122 | clearly not a problem: In the worst case, TCP retransmissions will | |
1123 | eventually get the data where it needs to go. In general, when tracking | |
1124 | the state of the universe outside of the computer, some level of | |
1125 | inconsistency must be tolerated due to speed-of-light delays if nothing | |
1126 | else. | |
1127 | ||
1128 | Furthermore, uncertainty about external state is inherent in many cases. | |
1129 | For example, a pair of veterinarians might use heartbeat to determine | |
1130 | whether or not a given cat was alive. But how long should they wait | |
1131 | after the last heartbeat to decide that the cat is in fact dead? Waiting | |
1132 | less than 400 milliseconds makes no sense because this would mean that a | |
1133 | relaxed cat would be considered to cycle between death and life more | |
1134 | than 100 times per minute. Moreover, just as with human beings, a cat's | |
1135 | heart might stop for some period of time, so the exact wait period is a | |
1136 | judgment call. One of our pair of veterinarians might wait 30 seconds | |
1137 | before pronouncing the cat dead, while the other might insist on waiting | |
1138 | a full minute. The two veterinarians would then disagree on the state of | |
1139 | the cat during the final 30 seconds of the minute following the last | |
1140 | heartbeat. | |
1141 | ||
1142 | Interestingly enough, this same situation applies to hardware. When push | |
1143 | comes to shove, how do we tell whether or not some external server has | |
1144 | failed? We send messages to it periodically, and declare it failed if we | |
1145 | don't receive a response within a given period of time. Policy decisions | |
1146 | can usually tolerate short periods of inconsistency. The policy was | |
1147 | decided some time ago, and is only now being put into effect, so a few | |
1148 | milliseconds of delay is normally inconsequential. | |
1149 | ||
1150 | However, there are algorithms that absolutely must see consistent data. | |
1151 | For example, the translation between a user-level SystemV semaphore ID | |
1152 | to the corresponding in-kernel data structure is protected by RCU, but | |
1153 | it is absolutely forbidden to update a semaphore that has just been | |
1154 | removed. In the Linux kernel, this need for consistency is accommodated | |
1155 | by acquiring spinlocks located in the in-kernel data structure from | |
1156 | within the RCU read-side critical section, and this is indicated by the | |
1157 | green box in the figure above. Many other techniques may be used, and | |
1158 | are in fact used within the Linux kernel. | |
1159 | ||
1160 | In short, RCU is not required to maintain consistency, and other | |
1161 | mechanisms may be used in concert with RCU when consistency is required. | |
1162 | RCU's specialization allows it to do its job extremely well, and its | |
1163 | ability to interoperate with other synchronization mechanisms allows the | |
1164 | right mix of synchronization tools to be used for a given job. | |
1165 | ||
1166 | Performance and Scalability | |
1167 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
1168 | ||
1169 | Energy efficiency is a critical component of performance today, and | |
1170 | Linux-kernel RCU implementations must therefore avoid unnecessarily | |
1171 | awakening idle CPUs. I cannot claim that this requirement was | |
1172 | premeditated. In fact, I learned of it during a telephone conversation | |
1173 | in which I was given “frank and open” feedback on the importance of | |
1174 | energy efficiency in battery-powered systems and on specific | |
1175 | energy-efficiency shortcomings of the Linux-kernel RCU implementation. | |
1176 | In my experience, the battery-powered embedded community will consider | |
1177 | any unnecessary wakeups to be extremely unfriendly acts. So much so that | |
1178 | mere Linux-kernel-mailing-list posts are insufficient to vent their ire. | |
1179 | ||
1180 | Memory consumption is not particularly important for in most situations, | |
1181 | and has become decreasingly so as memory sizes have expanded and memory | |
1182 | costs have plummeted. However, as I learned from Matt Mackall's | |
1183 | `bloatwatch <http://elinux.org/Linux_Tiny-FAQ>`__ efforts, memory | |
1184 | footprint is critically important on single-CPU systems with | |
1185 | non-preemptible (``CONFIG_PREEMPT=n``) kernels, and thus `tiny | |
1186 | RCU <https://lkml.kernel.org/g/20090113221724.GA15307@linux.vnet.ibm.com>`__ | |
1187 | was born. Josh Triplett has since taken over the small-memory banner | |
1188 | with his `Linux kernel tinification <https://tiny.wiki.kernel.org/>`__ | |
1189 | project, which resulted in `SRCU <#Sleepable%20RCU>`__ becoming optional | |
1190 | for those kernels not needing it. | |
1191 | ||
1192 | The remaining performance requirements are, for the most part, | |
1193 | unsurprising. For example, in keeping with RCU's read-side | |
1194 | specialization, ``rcu_dereference()`` should have negligible overhead | |
1195 | (for example, suppression of a few minor compiler optimizations). | |
1196 | Similarly, in non-preemptible environments, ``rcu_read_lock()`` and | |
1197 | ``rcu_read_unlock()`` should have exactly zero overhead. | |
1198 | ||
1199 | In preemptible environments, in the case where the RCU read-side | |
1200 | critical section was not preempted (as will be the case for the | |
1201 | highest-priority real-time process), ``rcu_read_lock()`` and | |
1202 | ``rcu_read_unlock()`` should have minimal overhead. In particular, they | |
1203 | should not contain atomic read-modify-write operations, memory-barrier | |
1204 | instructions, preemption disabling, interrupt disabling, or backwards | |
1205 | branches. However, in the case where the RCU read-side critical section | |
1206 | was preempted, ``rcu_read_unlock()`` may acquire spinlocks and disable | |
1207 | interrupts. This is why it is better to nest an RCU read-side critical | |
1208 | section within a preempt-disable region than vice versa, at least in | |
1209 | cases where that critical section is short enough to avoid unduly | |
1210 | degrading real-time latencies. | |
1211 | ||
1212 | The ``synchronize_rcu()`` grace-period-wait primitive is optimized for | |
1213 | throughput. It may therefore incur several milliseconds of latency in | |
1214 | addition to the duration of the longest RCU read-side critical section. | |
1215 | On the other hand, multiple concurrent invocations of | |
1216 | ``synchronize_rcu()`` are required to use batching optimizations so that | |
1217 | they can be satisfied by a single underlying grace-period-wait | |
1218 | operation. For example, in the Linux kernel, it is not unusual for a | |
1219 | single grace-period-wait operation to serve more than `1,000 separate | |
1220 | invocations <https://www.usenix.org/conference/2004-usenix-annual-technical-conference/making-rcu-safe-deep-sub-millisecond-response>`__ | |
1221 | of ``synchronize_rcu()``, thus amortizing the per-invocation overhead | |
1222 | down to nearly zero. However, the grace-period optimization is also | |
1223 | required to avoid measurable degradation of real-time scheduling and | |
1224 | interrupt latencies. | |
1225 | ||
1226 | In some cases, the multi-millisecond ``synchronize_rcu()`` latencies are | |
1227 | unacceptable. In these cases, ``synchronize_rcu_expedited()`` may be | |
1228 | used instead, reducing the grace-period latency down to a few tens of | |
1229 | microseconds on small systems, at least in cases where the RCU read-side | |
1230 | critical sections are short. There are currently no special latency | |
1231 | requirements for ``synchronize_rcu_expedited()`` on large systems, but, | |
1232 | consistent with the empirical nature of the RCU specification, that is | |
1233 | subject to change. However, there most definitely are scalability | |
1234 | requirements: A storm of ``synchronize_rcu_expedited()`` invocations on | |
1235 | 4096 CPUs should at least make reasonable forward progress. In return | |
1236 | for its shorter latencies, ``synchronize_rcu_expedited()`` is permitted | |
1237 | to impose modest degradation of real-time latency on non-idle online | |
1238 | CPUs. Here, “modest” means roughly the same latency degradation as a | |
1239 | scheduling-clock interrupt. | |
1240 | ||
1241 | There are a number of situations where even | |
1242 | ``synchronize_rcu_expedited()``'s reduced grace-period latency is | |
1243 | unacceptable. In these situations, the asynchronous ``call_rcu()`` can | |
1244 | be used in place of ``synchronize_rcu()`` as follows: | |
1245 | ||
1246 | :: | |
1247 | ||
1248 | 1 struct foo { | |
1249 | 2 int a; | |
1250 | 3 int b; | |
1251 | 4 struct rcu_head rh; | |
1252 | 5 }; | |
1253 | 6 | |
1254 | 7 static void remove_gp_cb(struct rcu_head *rhp) | |
1255 | 8 { | |
1256 | 9 struct foo *p = container_of(rhp, struct foo, rh); | |
1257 | 10 | |
1258 | 11 kfree(p); | |
1259 | 12 } | |
1260 | 13 | |
1261 | 14 bool remove_gp_asynchronous(void) | |
1262 | 15 { | |
1263 | 16 struct foo *p; | |
1264 | 17 | |
1265 | 18 spin_lock(&gp_lock); | |
1266 | 19 p = rcu_access_pointer(gp); | |
1267 | 20 if (!p) { | |
1268 | 21 spin_unlock(&gp_lock); | |
1269 | 22 return false; | |
1270 | 23 } | |
1271 | 24 rcu_assign_pointer(gp, NULL); | |
1272 | 25 call_rcu(&p->rh, remove_gp_cb); | |
1273 | 26 spin_unlock(&gp_lock); | |
1274 | 27 return true; | |
1275 | 28 } | |
1276 | ||
1277 | A definition of ``struct foo`` is finally needed, and appears on | |
1278 | lines 1-5. The function ``remove_gp_cb()`` is passed to ``call_rcu()`` | |
1279 | on line 25, and will be invoked after the end of a subsequent grace | |
1280 | period. This gets the same effect as ``remove_gp_synchronous()``, but | |
1281 | without forcing the updater to wait for a grace period to elapse. The | |
1282 | ``call_rcu()`` function may be used in a number of situations where | |
1283 | neither ``synchronize_rcu()`` nor ``synchronize_rcu_expedited()`` would | |
1284 | be legal, including within preempt-disable code, ``local_bh_disable()`` | |
1285 | code, interrupt-disable code, and interrupt handlers. However, even | |
1286 | ``call_rcu()`` is illegal within NMI handlers and from idle and offline | |
1287 | CPUs. The callback function (``remove_gp_cb()`` in this case) will be | |
1288 | executed within softirq (software interrupt) environment within the | |
1289 | Linux kernel, either within a real softirq handler or under the | |
1290 | protection of ``local_bh_disable()``. In both the Linux kernel and in | |
1291 | userspace, it is bad practice to write an RCU callback function that | |
1292 | takes too long. Long-running operations should be relegated to separate | |
1293 | threads or (in the Linux kernel) workqueues. | |
1294 | ||
1295 | +-----------------------------------------------------------------------+ | |
1296 | | **Quick Quiz**: | | |
1297 | +-----------------------------------------------------------------------+ | |
1298 | | Why does line 19 use ``rcu_access_pointer()``? After all, | | |
1299 | | ``call_rcu()`` on line 25 stores into the structure, which would | | |
1300 | | interact badly with concurrent insertions. Doesn't this mean that | | |
1301 | | ``rcu_dereference()`` is required? | | |
1302 | +-----------------------------------------------------------------------+ | |
1303 | | **Answer**: | | |
1304 | +-----------------------------------------------------------------------+ | |
1305 | | Presumably the ``->gp_lock`` acquired on line 18 excludes any | | |
1306 | | changes, including any insertions that ``rcu_dereference()`` would | | |
1307 | | protect against. Therefore, any insertions will be delayed until | | |
1308 | | after ``->gp_lock`` is released on line 25, which in turn means that | | |
1309 | | ``rcu_access_pointer()`` suffices. | | |
1310 | +-----------------------------------------------------------------------+ | |
1311 | ||
1312 | However, all that ``remove_gp_cb()`` is doing is invoking ``kfree()`` on | |
1313 | the data element. This is a common idiom, and is supported by | |
1314 | ``kfree_rcu()``, which allows “fire and forget” operation as shown | |
1315 | below: | |
1316 | ||
1317 | :: | |
1318 | ||
1319 | 1 struct foo { | |
1320 | 2 int a; | |
1321 | 3 int b; | |
1322 | 4 struct rcu_head rh; | |
1323 | 5 }; | |
1324 | 6 | |
1325 | 7 bool remove_gp_faf(void) | |
1326 | 8 { | |
1327 | 9 struct foo *p; | |
1328 | 10 | |
1329 | 11 spin_lock(&gp_lock); | |
1330 | 12 p = rcu_dereference(gp); | |
1331 | 13 if (!p) { | |
1332 | 14 spin_unlock(&gp_lock); | |
1333 | 15 return false; | |
1334 | 16 } | |
1335 | 17 rcu_assign_pointer(gp, NULL); | |
1336 | 18 kfree_rcu(p, rh); | |
1337 | 19 spin_unlock(&gp_lock); | |
1338 | 20 return true; | |
1339 | 21 } | |
1340 | ||
1341 | Note that ``remove_gp_faf()`` simply invokes ``kfree_rcu()`` and | |
1342 | proceeds, without any need to pay any further attention to the | |
1343 | subsequent grace period and ``kfree()``. It is permissible to invoke | |
1344 | ``kfree_rcu()`` from the same environments as for ``call_rcu()``. | |
1345 | Interestingly enough, DYNIX/ptx had the equivalents of ``call_rcu()`` | |
1346 | and ``kfree_rcu()``, but not ``synchronize_rcu()``. This was due to the | |
1347 | fact that RCU was not heavily used within DYNIX/ptx, so the very few | |
1348 | places that needed something like ``synchronize_rcu()`` simply | |
1349 | open-coded it. | |
1350 | ||
1351 | +-----------------------------------------------------------------------+ | |
1352 | | **Quick Quiz**: | | |
1353 | +-----------------------------------------------------------------------+ | |
1354 | | Earlier it was claimed that ``call_rcu()`` and ``kfree_rcu()`` | | |
1355 | | allowed updaters to avoid being blocked by readers. But how can that | | |
1356 | | be correct, given that the invocation of the callback and the freeing | | |
1357 | | of the memory (respectively) must still wait for a grace period to | | |
1358 | | elapse? | | |
1359 | +-----------------------------------------------------------------------+ | |
1360 | | **Answer**: | | |
1361 | +-----------------------------------------------------------------------+ | |
1362 | | We could define things this way, but keep in mind that this sort of | | |
1363 | | definition would say that updates in garbage-collected languages | | |
1364 | | cannot complete until the next time the garbage collector runs, which | | |
1365 | | does not seem at all reasonable. The key point is that in most cases, | | |
1366 | | an updater using either ``call_rcu()`` or ``kfree_rcu()`` can proceed | | |
1367 | | to the next update as soon as it has invoked ``call_rcu()`` or | | |
1368 | | ``kfree_rcu()``, without having to wait for a subsequent grace | | |
1369 | | period. | | |
1370 | +-----------------------------------------------------------------------+ | |
1371 | ||
1372 | But what if the updater must wait for the completion of code to be | |
1373 | executed after the end of the grace period, but has other tasks that can | |
1374 | be carried out in the meantime? The polling-style | |
1375 | ``get_state_synchronize_rcu()`` and ``cond_synchronize_rcu()`` functions | |
1376 | may be used for this purpose, as shown below: | |
1377 | ||
1378 | :: | |
1379 | ||
1380 | 1 bool remove_gp_poll(void) | |
1381 | 2 { | |
1382 | 3 struct foo *p; | |
1383 | 4 unsigned long s; | |
1384 | 5 | |
1385 | 6 spin_lock(&gp_lock); | |
1386 | 7 p = rcu_access_pointer(gp); | |
1387 | 8 if (!p) { | |
1388 | 9 spin_unlock(&gp_lock); | |
1389 | 10 return false; | |
1390 | 11 } | |
1391 | 12 rcu_assign_pointer(gp, NULL); | |
1392 | 13 spin_unlock(&gp_lock); | |
1393 | 14 s = get_state_synchronize_rcu(); | |
1394 | 15 do_something_while_waiting(); | |
1395 | 16 cond_synchronize_rcu(s); | |
1396 | 17 kfree(p); | |
1397 | 18 return true; | |
1398 | 19 } | |
1399 | ||
1400 | On line 14, ``get_state_synchronize_rcu()`` obtains a “cookie” from RCU, | |
1401 | then line 15 carries out other tasks, and finally, line 16 returns | |
1402 | immediately if a grace period has elapsed in the meantime, but otherwise | |
1403 | waits as required. The need for ``get_state_synchronize_rcu`` and | |
1404 | ``cond_synchronize_rcu()`` has appeared quite recently, so it is too | |
1405 | early to tell whether they will stand the test of time. | |
1406 | ||
1407 | RCU thus provides a range of tools to allow updaters to strike the | |
1408 | required tradeoff between latency, flexibility and CPU overhead. | |
1409 | ||
1410 | Forward Progress | |
1411 | ~~~~~~~~~~~~~~~~ | |
1412 | ||
1413 | In theory, delaying grace-period completion and callback invocation is | |
1414 | harmless. In practice, not only are memory sizes finite but also | |
1415 | callbacks sometimes do wakeups, and sufficiently deferred wakeups can be | |
1416 | difficult to distinguish from system hangs. Therefore, RCU must provide | |
1417 | a number of mechanisms to promote forward progress. | |
1418 | ||
1419 | These mechanisms are not foolproof, nor can they be. For one simple | |
1420 | example, an infinite loop in an RCU read-side critical section must by | |
1421 | definition prevent later grace periods from ever completing. For a more | |
1422 | involved example, consider a 64-CPU system built with | |
1423 | ``CONFIG_RCU_NOCB_CPU=y`` and booted with ``rcu_nocbs=1-63``, where | |
1424 | CPUs 1 through 63 spin in tight loops that invoke ``call_rcu()``. Even | |
1425 | if these tight loops also contain calls to ``cond_resched()`` (thus | |
1426 | allowing grace periods to complete), CPU 0 simply will not be able to | |
1427 | invoke callbacks as fast as the other 63 CPUs can register them, at | |
1428 | least not until the system runs out of memory. In both of these | |
1429 | examples, the Spiderman principle applies: With great power comes great | |
1430 | responsibility. However, short of this level of abuse, RCU is required | |
1431 | to ensure timely completion of grace periods and timely invocation of | |
1432 | callbacks. | |
1433 | ||
1434 | RCU takes the following steps to encourage timely completion of grace | |
1435 | periods: | |
1436 | ||
1437 | #. If a grace period fails to complete within 100 milliseconds, RCU | |
1438 | causes future invocations of ``cond_resched()`` on the holdout CPUs | |
1439 | to provide an RCU quiescent state. RCU also causes those CPUs' | |
1440 | ``need_resched()`` invocations to return ``true``, but only after the | |
1441 | corresponding CPU's next scheduling-clock. | |
1442 | #. CPUs mentioned in the ``nohz_full`` kernel boot parameter can run | |
1443 | indefinitely in the kernel without scheduling-clock interrupts, which | |
1444 | defeats the above ``need_resched()`` strategem. RCU will therefore | |
1445 | invoke ``resched_cpu()`` on any ``nohz_full`` CPUs still holding out | |
1446 | after 109 milliseconds. | |
1447 | #. In kernels built with ``CONFIG_RCU_BOOST=y``, if a given task that | |
1448 | has been preempted within an RCU read-side critical section is | |
1449 | holding out for more than 500 milliseconds, RCU will resort to | |
1450 | priority boosting. | |
1451 | #. If a CPU is still holding out 10 seconds into the grace period, RCU | |
1452 | will invoke ``resched_cpu()`` on it regardless of its ``nohz_full`` | |
1453 | state. | |
1454 | ||
1455 | The above values are defaults for systems running with ``HZ=1000``. They | |
1456 | will vary as the value of ``HZ`` varies, and can also be changed using | |
1457 | the relevant Kconfig options and kernel boot parameters. RCU currently | |
1458 | does not do much sanity checking of these parameters, so please use | |
1459 | caution when changing them. Note that these forward-progress measures | |
1460 | are provided only for RCU, not for `SRCU <#Sleepable%20RCU>`__ or `Tasks | |
1461 | RCU <#Tasks%20RCU>`__. | |
1462 | ||
1463 | RCU takes the following steps in ``call_rcu()`` to encourage timely | |
1464 | invocation of callbacks when any given non-\ ``rcu_nocbs`` CPU has | |
1465 | 10,000 callbacks, or has 10,000 more callbacks than it had the last time | |
1466 | encouragement was provided: | |
1467 | ||
1468 | #. Starts a grace period, if one is not already in progress. | |
1469 | #. Forces immediate checking for quiescent states, rather than waiting | |
1470 | for three milliseconds to have elapsed since the beginning of the | |
1471 | grace period. | |
1472 | #. Immediately tags the CPU's callbacks with their grace period | |
1473 | completion numbers, rather than waiting for the ``RCU_SOFTIRQ`` | |
1474 | handler to get around to it. | |
1475 | #. Lifts callback-execution batch limits, which speeds up callback | |
1476 | invocation at the expense of degrading realtime response. | |
1477 | ||
1478 | Again, these are default values when running at ``HZ=1000``, and can be | |
1479 | overridden. Again, these forward-progress measures are provided only for | |
1480 | RCU, not for `SRCU <#Sleepable%20RCU>`__ or `Tasks | |
1481 | RCU <#Tasks%20RCU>`__. Even for RCU, callback-invocation forward | |
1482 | progress for ``rcu_nocbs`` CPUs is much less well-developed, in part | |
1483 | because workloads benefiting from ``rcu_nocbs`` CPUs tend to invoke | |
1484 | ``call_rcu()`` relatively infrequently. If workloads emerge that need | |
1485 | both ``rcu_nocbs`` CPUs and high ``call_rcu()`` invocation rates, then | |
1486 | additional forward-progress work will be required. | |
1487 | ||
1488 | Composability | |
1489 | ~~~~~~~~~~~~~ | |
1490 | ||
1491 | Composability has received much attention in recent years, perhaps in | |
1492 | part due to the collision of multicore hardware with object-oriented | |
1493 | techniques designed in single-threaded environments for single-threaded | |
1494 | use. And in theory, RCU read-side critical sections may be composed, and | |
1495 | in fact may be nested arbitrarily deeply. In practice, as with all | |
1496 | real-world implementations of composable constructs, there are | |
1497 | limitations. | |
1498 | ||
1499 | Implementations of RCU for which ``rcu_read_lock()`` and | |
1500 | ``rcu_read_unlock()`` generate no code, such as Linux-kernel RCU when | |
1501 | ``CONFIG_PREEMPT=n``, can be nested arbitrarily deeply. After all, there | |
1502 | is no overhead. Except that if all these instances of | |
1503 | ``rcu_read_lock()`` and ``rcu_read_unlock()`` are visible to the | |
1504 | compiler, compilation will eventually fail due to exhausting memory, | |
1505 | mass storage, or user patience, whichever comes first. If the nesting is | |
1506 | not visible to the compiler, as is the case with mutually recursive | |
1507 | functions each in its own translation unit, stack overflow will result. | |
1508 | If the nesting takes the form of loops, perhaps in the guise of tail | |
1509 | recursion, either the control variable will overflow or (in the Linux | |
1510 | kernel) you will get an RCU CPU stall warning. Nevertheless, this class | |
1511 | of RCU implementations is one of the most composable constructs in | |
1512 | existence. | |
1513 | ||
1514 | RCU implementations that explicitly track nesting depth are limited by | |
1515 | the nesting-depth counter. For example, the Linux kernel's preemptible | |
1516 | RCU limits nesting to ``INT_MAX``. This should suffice for almost all | |
1517 | practical purposes. That said, a consecutive pair of RCU read-side | |
1518 | critical sections between which there is an operation that waits for a | |
1519 | grace period cannot be enclosed in another RCU read-side critical | |
1520 | section. This is because it is not legal to wait for a grace period | |
1521 | within an RCU read-side critical section: To do so would result either | |
1522 | in deadlock or in RCU implicitly splitting the enclosing RCU read-side | |
1523 | critical section, neither of which is conducive to a long-lived and | |
1524 | prosperous kernel. | |
1525 | ||
1526 | It is worth noting that RCU is not alone in limiting composability. For | |
1527 | example, many transactional-memory implementations prohibit composing a | |
1528 | pair of transactions separated by an irrevocable operation (for example, | |
1529 | a network receive operation). For another example, lock-based critical | |
1530 | sections can be composed surprisingly freely, but only if deadlock is | |
1531 | avoided. | |
1532 | ||
1533 | In short, although RCU read-side critical sections are highly | |
1534 | composable, care is required in some situations, just as is the case for | |
1535 | any other composable synchronization mechanism. | |
1536 | ||
1537 | Corner Cases | |
1538 | ~~~~~~~~~~~~ | |
1539 | ||
1540 | A given RCU workload might have an endless and intense stream of RCU | |
1541 | read-side critical sections, perhaps even so intense that there was | |
1542 | never a point in time during which there was not at least one RCU | |
1543 | read-side critical section in flight. RCU cannot allow this situation to | |
1544 | block grace periods: As long as all the RCU read-side critical sections | |
1545 | are finite, grace periods must also be finite. | |
1546 | ||
1547 | That said, preemptible RCU implementations could potentially result in | |
1548 | RCU read-side critical sections being preempted for long durations, | |
1549 | which has the effect of creating a long-duration RCU read-side critical | |
1550 | section. This situation can arise only in heavily loaded systems, but | |
1551 | systems using real-time priorities are of course more vulnerable. | |
1552 | Therefore, RCU priority boosting is provided to help deal with this | |
1553 | case. That said, the exact requirements on RCU priority boosting will | |
1554 | likely evolve as more experience accumulates. | |
1555 | ||
1556 | Other workloads might have very high update rates. Although one can | |
1557 | argue that such workloads should instead use something other than RCU, | |
1558 | the fact remains that RCU must handle such workloads gracefully. This | |
1559 | requirement is another factor driving batching of grace periods, but it | |
1560 | is also the driving force behind the checks for large numbers of queued | |
1561 | RCU callbacks in the ``call_rcu()`` code path. Finally, high update | |
1562 | rates should not delay RCU read-side critical sections, although some | |
1563 | small read-side delays can occur when using | |
1564 | ``synchronize_rcu_expedited()``, courtesy of this function's use of | |
1565 | ``smp_call_function_single()``. | |
1566 | ||
1567 | Although all three of these corner cases were understood in the early | |
1568 | 1990s, a simple user-level test consisting of ``close(open(path))`` in a | |
1569 | tight loop in the early 2000s suddenly provided a much deeper | |
1570 | appreciation of the high-update-rate corner case. This test also | |
1571 | motivated addition of some RCU code to react to high update rates, for | |
1572 | example, if a given CPU finds itself with more than 10,000 RCU callbacks | |
1573 | queued, it will cause RCU to take evasive action by more aggressively | |
1574 | starting grace periods and more aggressively forcing completion of | |
1575 | grace-period processing. This evasive action causes the grace period to | |
1576 | complete more quickly, but at the cost of restricting RCU's batching | |
1577 | optimizations, thus increasing the CPU overhead incurred by that grace | |
1578 | period. | |
1579 | ||
1580 | Software-Engineering Requirements | |
1581 | --------------------------------- | |
1582 | ||
1583 | Between Murphy's Law and “To err is human”, it is necessary to guard | |
1584 | against mishaps and misuse: | |
1585 | ||
1586 | #. It is all too easy to forget to use ``rcu_read_lock()`` everywhere | |
1587 | that it is needed, so kernels built with ``CONFIG_PROVE_RCU=y`` will | |
1588 | splat if ``rcu_dereference()`` is used outside of an RCU read-side | |
1589 | critical section. Update-side code can use | |
1590 | ``rcu_dereference_protected()``, which takes a `lockdep | |
1591 | expression <https://lwn.net/Articles/371986/>`__ to indicate what is | |
1592 | providing the protection. If the indicated protection is not | |
1593 | provided, a lockdep splat is emitted. | |
1594 | Code shared between readers and updaters can use | |
1595 | ``rcu_dereference_check()``, which also takes a lockdep expression, | |
1596 | and emits a lockdep splat if neither ``rcu_read_lock()`` nor the | |
1597 | indicated protection is in place. In addition, | |
1598 | ``rcu_dereference_raw()`` is used in those (hopefully rare) cases | |
1599 | where the required protection cannot be easily described. Finally, | |
1600 | ``rcu_read_lock_held()`` is provided to allow a function to verify | |
1601 | that it has been invoked within an RCU read-side critical section. I | |
1602 | was made aware of this set of requirements shortly after Thomas | |
1603 | Gleixner audited a number of RCU uses. | |
1604 | #. A given function might wish to check for RCU-related preconditions | |
1605 | upon entry, before using any other RCU API. The | |
1606 | ``rcu_lockdep_assert()`` does this job, asserting the expression in | |
1607 | kernels having lockdep enabled and doing nothing otherwise. | |
1608 | #. It is also easy to forget to use ``rcu_assign_pointer()`` and | |
1609 | ``rcu_dereference()``, perhaps (incorrectly) substituting a simple | |
1610 | assignment. To catch this sort of error, a given RCU-protected | |
1611 | pointer may be tagged with ``__rcu``, after which sparse will | |
1612 | complain about simple-assignment accesses to that pointer. Arnd | |
1613 | Bergmann made me aware of this requirement, and also supplied the | |
1614 | needed `patch series <https://lwn.net/Articles/376011/>`__. | |
1615 | #. Kernels built with ``CONFIG_DEBUG_OBJECTS_RCU_HEAD=y`` will splat if | |
1616 | a data element is passed to ``call_rcu()`` twice in a row, without a | |
1617 | grace period in between. (This error is similar to a double free.) | |
1618 | The corresponding ``rcu_head`` structures that are dynamically | |
1619 | allocated are automatically tracked, but ``rcu_head`` structures | |
1620 | allocated on the stack must be initialized with | |
1621 | ``init_rcu_head_on_stack()`` and cleaned up with | |
1622 | ``destroy_rcu_head_on_stack()``. Similarly, statically allocated | |
1623 | non-stack ``rcu_head`` structures must be initialized with | |
1624 | ``init_rcu_head()`` and cleaned up with ``destroy_rcu_head()``. | |
1625 | Mathieu Desnoyers made me aware of this requirement, and also | |
1626 | supplied the needed | |
1627 | `patch <https://lkml.kernel.org/g/20100319013024.GA28456@Krystal>`__. | |
1628 | #. An infinite loop in an RCU read-side critical section will eventually | |
1629 | trigger an RCU CPU stall warning splat, with the duration of | |
1630 | “eventually” being controlled by the ``RCU_CPU_STALL_TIMEOUT`` | |
1631 | ``Kconfig`` option, or, alternatively, by the | |
1632 | ``rcupdate.rcu_cpu_stall_timeout`` boot/sysfs parameter. However, RCU | |
1633 | is not obligated to produce this splat unless there is a grace period | |
1634 | waiting on that particular RCU read-side critical section. | |
1635 | ||
1636 | Some extreme workloads might intentionally delay RCU grace periods, | |
1637 | and systems running those workloads can be booted with | |
1638 | ``rcupdate.rcu_cpu_stall_suppress`` to suppress the splats. This | |
1639 | kernel parameter may also be set via ``sysfs``. Furthermore, RCU CPU | |
1640 | stall warnings are counter-productive during sysrq dumps and during | |
1641 | panics. RCU therefore supplies the ``rcu_sysrq_start()`` and | |
1642 | ``rcu_sysrq_end()`` API members to be called before and after long | |
1643 | sysrq dumps. RCU also supplies the ``rcu_panic()`` notifier that is | |
1644 | automatically invoked at the beginning of a panic to suppress further | |
1645 | RCU CPU stall warnings. | |
1646 | ||
1647 | This requirement made itself known in the early 1990s, pretty much | |
1648 | the first time that it was necessary to debug a CPU stall. That said, | |
1649 | the initial implementation in DYNIX/ptx was quite generic in | |
1650 | comparison with that of Linux. | |
1651 | ||
1652 | #. Although it would be very good to detect pointers leaking out of RCU | |
1653 | read-side critical sections, there is currently no good way of doing | |
1654 | this. One complication is the need to distinguish between pointers | |
1655 | leaking and pointers that have been handed off from RCU to some other | |
1656 | synchronization mechanism, for example, reference counting. | |
1657 | #. In kernels built with ``CONFIG_RCU_TRACE=y``, RCU-related information | |
1658 | is provided via event tracing. | |
1659 | #. Open-coded use of ``rcu_assign_pointer()`` and ``rcu_dereference()`` | |
1660 | to create typical linked data structures can be surprisingly | |
1661 | error-prone. Therefore, RCU-protected `linked | |
1662 | lists <https://lwn.net/Articles/609973/#RCU%20List%20APIs>`__ and, | |
1663 | more recently, RCU-protected `hash | |
1664 | tables <https://lwn.net/Articles/612100/>`__ are available. Many | |
1665 | other special-purpose RCU-protected data structures are available in | |
1666 | the Linux kernel and the userspace RCU library. | |
1667 | #. Some linked structures are created at compile time, but still require | |
1668 | ``__rcu`` checking. The ``RCU_POINTER_INITIALIZER()`` macro serves | |
1669 | this purpose. | |
1670 | #. It is not necessary to use ``rcu_assign_pointer()`` when creating | |
1671 | linked structures that are to be published via a single external | |
1672 | pointer. The ``RCU_INIT_POINTER()`` macro is provided for this task | |
1673 | and also for assigning ``NULL`` pointers at runtime. | |
1674 | ||
1675 | This not a hard-and-fast list: RCU's diagnostic capabilities will | |
1676 | continue to be guided by the number and type of usage bugs found in | |
1677 | real-world RCU usage. | |
1678 | ||
1679 | Linux Kernel Complications | |
1680 | -------------------------- | |
1681 | ||
1682 | The Linux kernel provides an interesting environment for all kinds of | |
1683 | software, including RCU. Some of the relevant points of interest are as | |
1684 | follows: | |
1685 | ||
07335c16 JFG |
1686 | #. `Configuration`_ |
1687 | #. `Firmware Interface`_ | |
1688 | #. `Early Boot`_ | |
1689 | #. `Interrupts and NMIs`_ | |
1690 | #. `Loadable Modules`_ | |
1691 | #. `Hotplug CPU`_ | |
1692 | #. `Scheduler and RCU`_ | |
1693 | #. `Tracing and RCU`_ | |
1694 | #. `Energy Efficiency`_ | |
1695 | #. `Scheduling-Clock Interrupts and RCU`_ | |
1696 | #. `Memory Efficiency`_ | |
1697 | #. `Performance, Scalability, Response Time, and Reliability`_ | |
ccc9971e MCC |
1698 | |
1699 | This list is probably incomplete, but it does give a feel for the most | |
1700 | notable Linux-kernel complications. Each of the following sections | |
1701 | covers one of the above topics. | |
1702 | ||
1703 | Configuration | |
1704 | ~~~~~~~~~~~~~ | |
1705 | ||
1706 | RCU's goal is automatic configuration, so that almost nobody needs to | |
1707 | worry about RCU's ``Kconfig`` options. And for almost all users, RCU | |
1708 | does in fact work well “out of the box.” | |
1709 | ||
1710 | However, there are specialized use cases that are handled by kernel boot | |
1711 | parameters and ``Kconfig`` options. Unfortunately, the ``Kconfig`` | |
1712 | system will explicitly ask users about new ``Kconfig`` options, which | |
1713 | requires almost all of them be hidden behind a ``CONFIG_RCU_EXPERT`` | |
1714 | ``Kconfig`` option. | |
1715 | ||
1716 | This all should be quite obvious, but the fact remains that Linus | |
1717 | Torvalds recently had to | |
1718 | `remind <https://lkml.kernel.org/g/CA+55aFy4wcCwaL4okTs8wXhGZ5h-ibecy_Meg9C4MNQrUnwMcg@mail.gmail.com>`__ | |
1719 | me of this requirement. | |
1720 | ||
1721 | Firmware Interface | |
1722 | ~~~~~~~~~~~~~~~~~~ | |
1723 | ||
1724 | In many cases, kernel obtains information about the system from the | |
1725 | firmware, and sometimes things are lost in translation. Or the | |
1726 | translation is accurate, but the original message is bogus. | |
1727 | ||
1728 | For example, some systems' firmware overreports the number of CPUs, | |
1729 | sometimes by a large factor. If RCU naively believed the firmware, as it | |
1730 | used to do, it would create too many per-CPU kthreads. Although the | |
1731 | resulting system will still run correctly, the extra kthreads needlessly | |
1732 | consume memory and can cause confusion when they show up in ``ps`` | |
1733 | listings. | |
1734 | ||
1735 | RCU must therefore wait for a given CPU to actually come online before | |
1736 | it can allow itself to believe that the CPU actually exists. The | |
1737 | resulting “ghost CPUs” (which are never going to come online) cause a | |
1738 | number of `interesting | |
1739 | complications <https://paulmck.livejournal.com/37494.html>`__. | |
1740 | ||
1741 | Early Boot | |
1742 | ~~~~~~~~~~ | |
1743 | ||
1744 | The Linux kernel's boot sequence is an interesting process, and RCU is | |
1745 | used early, even before ``rcu_init()`` is invoked. In fact, a number of | |
1746 | RCU's primitives can be used as soon as the initial task's | |
1747 | ``task_struct`` is available and the boot CPU's per-CPU variables are | |
1748 | set up. The read-side primitives (``rcu_read_lock()``, | |
1749 | ``rcu_read_unlock()``, ``rcu_dereference()``, and | |
1750 | ``rcu_access_pointer()``) will operate normally very early on, as will | |
1751 | ``rcu_assign_pointer()``. | |
1752 | ||
1753 | Although ``call_rcu()`` may be invoked at any time during boot, | |
1754 | callbacks are not guaranteed to be invoked until after all of RCU's | |
1755 | kthreads have been spawned, which occurs at ``early_initcall()`` time. | |
1756 | This delay in callback invocation is due to the fact that RCU does not | |
1757 | invoke callbacks until it is fully initialized, and this full | |
1758 | initialization cannot occur until after the scheduler has initialized | |
1759 | itself to the point where RCU can spawn and run its kthreads. In theory, | |
1760 | it would be possible to invoke callbacks earlier, however, this is not a | |
1761 | panacea because there would be severe restrictions on what operations | |
1762 | those callbacks could invoke. | |
1763 | ||
1764 | Perhaps surprisingly, ``synchronize_rcu()`` and | |
1765 | ``synchronize_rcu_expedited()``, will operate normally during very early | |
1766 | boot, the reason being that there is only one CPU and preemption is | |
1767 | disabled. This means that the call ``synchronize_rcu()`` (or friends) | |
1768 | itself is a quiescent state and thus a grace period, so the early-boot | |
1769 | implementation can be a no-op. | |
1770 | ||
1771 | However, once the scheduler has spawned its first kthread, this early | |
1772 | boot trick fails for ``synchronize_rcu()`` (as well as for | |
1773 | ``synchronize_rcu_expedited()``) in ``CONFIG_PREEMPT=y`` kernels. The | |
1774 | reason is that an RCU read-side critical section might be preempted, | |
1775 | which means that a subsequent ``synchronize_rcu()`` really does have to | |
1776 | wait for something, as opposed to simply returning immediately. | |
1777 | Unfortunately, ``synchronize_rcu()`` can't do this until all of its | |
1778 | kthreads are spawned, which doesn't happen until some time during | |
1779 | ``early_initcalls()`` time. But this is no excuse: RCU is nevertheless | |
1780 | required to correctly handle synchronous grace periods during this time | |
1781 | period. Once all of its kthreads are up and running, RCU starts running | |
1782 | normally. | |
1783 | ||
1784 | +-----------------------------------------------------------------------+ | |
1785 | | **Quick Quiz**: | | |
1786 | +-----------------------------------------------------------------------+ | |
1787 | | How can RCU possibly handle grace periods before all of its kthreads | | |
1788 | | have been spawned??? | | |
1789 | +-----------------------------------------------------------------------+ | |
1790 | | **Answer**: | | |
1791 | +-----------------------------------------------------------------------+ | |
1792 | | Very carefully! | | |
1793 | | During the “dead zone” between the time that the scheduler spawns the | | |
1794 | | first task and the time that all of RCU's kthreads have been spawned, | | |
1795 | | all synchronous grace periods are handled by the expedited | | |
1796 | | grace-period mechanism. At runtime, this expedited mechanism relies | | |
1797 | | on workqueues, but during the dead zone the requesting task itself | | |
1798 | | drives the desired expedited grace period. Because dead-zone | | |
1799 | | execution takes place within task context, everything works. Once the | | |
1800 | | dead zone ends, expedited grace periods go back to using workqueues, | | |
1801 | | as is required to avoid problems that would otherwise occur when a | | |
1802 | | user task received a POSIX signal while driving an expedited grace | | |
1803 | | period. | | |
1804 | | | | |
1805 | | And yes, this does mean that it is unhelpful to send POSIX signals to | | |
1806 | | random tasks between the time that the scheduler spawns its first | | |
1807 | | kthread and the time that RCU's kthreads have all been spawned. If | | |
1808 | | there ever turns out to be a good reason for sending POSIX signals | | |
1809 | | during that time, appropriate adjustments will be made. (If it turns | | |
1810 | | out that POSIX signals are sent during this time for no good reason, | | |
1811 | | other adjustments will be made, appropriate or otherwise.) | | |
1812 | +-----------------------------------------------------------------------+ | |
1813 | ||
1814 | I learned of these boot-time requirements as a result of a series of | |
1815 | system hangs. | |
1816 | ||
1817 | Interrupts and NMIs | |
1818 | ~~~~~~~~~~~~~~~~~~~ | |
1819 | ||
1820 | The Linux kernel has interrupts, and RCU read-side critical sections are | |
1821 | legal within interrupt handlers and within interrupt-disabled regions of | |
1822 | code, as are invocations of ``call_rcu()``. | |
1823 | ||
1824 | Some Linux-kernel architectures can enter an interrupt handler from | |
1825 | non-idle process context, and then just never leave it, instead | |
1826 | stealthily transitioning back to process context. This trick is | |
1827 | sometimes used to invoke system calls from inside the kernel. These | |
1828 | “half-interrupts” mean that RCU has to be very careful about how it | |
1829 | counts interrupt nesting levels. I learned of this requirement the hard | |
1830 | way during a rewrite of RCU's dyntick-idle code. | |
1831 | ||
1832 | The Linux kernel has non-maskable interrupts (NMIs), and RCU read-side | |
1833 | critical sections are legal within NMI handlers. Thankfully, RCU | |
1834 | update-side primitives, including ``call_rcu()``, are prohibited within | |
1835 | NMI handlers. | |
1836 | ||
1837 | The name notwithstanding, some Linux-kernel architectures can have | |
1838 | nested NMIs, which RCU must handle correctly. Andy Lutomirski `surprised | |
1839 | me <https://lkml.kernel.org/r/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com>`__ | |
1840 | with this requirement; he also kindly surprised me with `an | |
1841 | algorithm <https://lkml.kernel.org/r/CALCETrXSY9JpW3uE6H8WYk81sg56qasA2aqmjMPsq5dOtzso=g@mail.gmail.com>`__ | |
1842 | that meets this requirement. | |
1843 | ||
1844 | Furthermore, NMI handlers can be interrupted by what appear to RCU to be | |
1845 | normal interrupts. One way that this can happen is for code that | |
1846 | directly invokes ``rcu_irq_enter()`` and ``rcu_irq_exit()`` to be called | |
1847 | from an NMI handler. This astonishing fact of life prompted the current | |
1848 | code structure, which has ``rcu_irq_enter()`` invoking | |
1849 | ``rcu_nmi_enter()`` and ``rcu_irq_exit()`` invoking ``rcu_nmi_exit()``. | |
1850 | And yes, I also learned of this requirement the hard way. | |
1851 | ||
1852 | Loadable Modules | |
1853 | ~~~~~~~~~~~~~~~~ | |
1854 | ||
1855 | The Linux kernel has loadable modules, and these modules can also be | |
1856 | unloaded. After a given module has been unloaded, any attempt to call | |
1857 | one of its functions results in a segmentation fault. The module-unload | |
1858 | functions must therefore cancel any delayed calls to loadable-module | |
1859 | functions, for example, any outstanding ``mod_timer()`` must be dealt | |
1860 | with via ``del_timer_sync()`` or similar. | |
1861 | ||
1862 | Unfortunately, there is no way to cancel an RCU callback; once you | |
1863 | invoke ``call_rcu()``, the callback function is eventually going to be | |
1864 | invoked, unless the system goes down first. Because it is normally | |
1865 | considered socially irresponsible to crash the system in response to a | |
1866 | module unload request, we need some other way to deal with in-flight RCU | |
1867 | callbacks. | |
1868 | ||
1869 | RCU therefore provides ``rcu_barrier()``, which waits until all | |
1870 | in-flight RCU callbacks have been invoked. If a module uses | |
1871 | ``call_rcu()``, its exit function should therefore prevent any future | |
1872 | invocation of ``call_rcu()``, then invoke ``rcu_barrier()``. In theory, | |
1873 | the underlying module-unload code could invoke ``rcu_barrier()`` | |
1874 | unconditionally, but in practice this would incur unacceptable | |
1875 | latencies. | |
1876 | ||
1877 | Nikita Danilov noted this requirement for an analogous | |
1878 | filesystem-unmount situation, and Dipankar Sarma incorporated | |
1879 | ``rcu_barrier()`` into RCU. The need for ``rcu_barrier()`` for module | |
1880 | unloading became apparent later. | |
1881 | ||
1882 | .. important:: | |
1883 | ||
1884 | The ``rcu_barrier()`` function is not, repeat, | |
1885 | *not*, obligated to wait for a grace period. It is instead only required | |
1886 | to wait for RCU callbacks that have already been posted. Therefore, if | |
1887 | there are no RCU callbacks posted anywhere in the system, | |
1888 | ``rcu_barrier()`` is within its rights to return immediately. Even if | |
1889 | there are callbacks posted, ``rcu_barrier()`` does not necessarily need | |
1890 | to wait for a grace period. | |
1891 | ||
1892 | +-----------------------------------------------------------------------+ | |
1893 | | **Quick Quiz**: | | |
1894 | +-----------------------------------------------------------------------+ | |
1895 | | Wait a minute! Each RCU callbacks must wait for a grace period to | | |
1896 | | complete, and ``rcu_barrier()`` must wait for each pre-existing | | |
1897 | | callback to be invoked. Doesn't ``rcu_barrier()`` therefore need to | | |
1898 | | wait for a full grace period if there is even one callback posted | | |
1899 | | anywhere in the system? | | |
1900 | +-----------------------------------------------------------------------+ | |
1901 | | **Answer**: | | |
1902 | +-----------------------------------------------------------------------+ | |
1903 | | Absolutely not!!! | | |
1904 | | Yes, each RCU callbacks must wait for a grace period to complete, but | | |
1905 | | it might well be partly (or even completely) finished waiting by the | | |
1906 | | time ``rcu_barrier()`` is invoked. In that case, ``rcu_barrier()`` | | |
1907 | | need only wait for the remaining portion of the grace period to | | |
1908 | | elapse. So even if there are quite a few callbacks posted, | | |
1909 | | ``rcu_barrier()`` might well return quite quickly. | | |
1910 | | | | |
1911 | | So if you need to wait for a grace period as well as for all | | |
1912 | | pre-existing callbacks, you will need to invoke both | | |
1913 | | ``synchronize_rcu()`` and ``rcu_barrier()``. If latency is a concern, | | |
1914 | | you can always use workqueues to invoke them concurrently. | | |
1915 | +-----------------------------------------------------------------------+ | |
1916 | ||
1917 | Hotplug CPU | |
1918 | ~~~~~~~~~~~ | |
1919 | ||
1920 | The Linux kernel supports CPU hotplug, which means that CPUs can come | |
1921 | and go. It is of course illegal to use any RCU API member from an | |
1922 | offline CPU, with the exception of `SRCU <#Sleepable%20RCU>`__ read-side | |
1923 | critical sections. This requirement was present from day one in | |
1924 | DYNIX/ptx, but on the other hand, the Linux kernel's CPU-hotplug | |
1925 | implementation is “interesting.” | |
1926 | ||
1927 | The Linux-kernel CPU-hotplug implementation has notifiers that are used | |
1928 | to allow the various kernel subsystems (including RCU) to respond | |
1929 | appropriately to a given CPU-hotplug operation. Most RCU operations may | |
1930 | be invoked from CPU-hotplug notifiers, including even synchronous | |
1931 | grace-period operations such as ``synchronize_rcu()`` and | |
1932 | ``synchronize_rcu_expedited()``. | |
1933 | ||
1934 | However, all-callback-wait operations such as ``rcu_barrier()`` are also | |
1935 | not supported, due to the fact that there are phases of CPU-hotplug | |
1936 | operations where the outgoing CPU's callbacks will not be invoked until | |
1937 | after the CPU-hotplug operation ends, which could also result in | |
1938 | deadlock. Furthermore, ``rcu_barrier()`` blocks CPU-hotplug operations | |
1939 | during its execution, which results in another type of deadlock when | |
1940 | invoked from a CPU-hotplug notifier. | |
1941 | ||
1942 | Scheduler and RCU | |
1943 | ~~~~~~~~~~~~~~~~~ | |
1944 | ||
1945 | RCU depends on the scheduler, and the scheduler uses RCU to protect some | |
1946 | of its data structures. The preemptible-RCU ``rcu_read_unlock()`` | |
1947 | implementation must therefore be written carefully to avoid deadlocks | |
1948 | involving the scheduler's runqueue and priority-inheritance locks. In | |
1949 | particular, ``rcu_read_unlock()`` must tolerate an interrupt where the | |
1950 | interrupt handler invokes both ``rcu_read_lock()`` and | |
1951 | ``rcu_read_unlock()``. This possibility requires ``rcu_read_unlock()`` | |
1952 | to use negative nesting levels to avoid destructive recursion via | |
1953 | interrupt handler's use of RCU. | |
1954 | ||
1955 | This scheduler-RCU requirement came as a `complete | |
1956 | surprise <https://lwn.net/Articles/453002/>`__. | |
1957 | ||
1958 | As noted above, RCU makes use of kthreads, and it is necessary to avoid | |
1959 | excessive CPU-time accumulation by these kthreads. This requirement was | |
1960 | no surprise, but RCU's violation of it when running context-switch-heavy | |
1961 | workloads when built with ``CONFIG_NO_HZ_FULL=y`` `did come as a | |
1962 | surprise | |
1963 | [PDF] <http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf>`__. | |
1964 | RCU has made good progress towards meeting this requirement, even for | |
1965 | context-switch-heavy ``CONFIG_NO_HZ_FULL=y`` workloads, but there is | |
1966 | room for further improvement. | |
1967 | ||
1968 | It is forbidden to hold any of scheduler's runqueue or | |
1969 | priority-inheritance spinlocks across an ``rcu_read_unlock()`` unless | |
1970 | interrupts have been disabled across the entire RCU read-side critical | |
1971 | section, that is, up to and including the matching ``rcu_read_lock()``. | |
1972 | Violating this restriction can result in deadlocks involving these | |
1973 | scheduler spinlocks. There was hope that this restriction might be | |
1974 | lifted when interrupt-disabled calls to ``rcu_read_unlock()`` started | |
1975 | deferring the reporting of the resulting RCU-preempt quiescent state | |
1976 | until the end of the corresponding interrupts-disabled region. | |
1977 | Unfortunately, timely reporting of the corresponding quiescent state to | |
1978 | expedited grace periods requires a call to ``raise_softirq()``, which | |
1979 | can acquire these scheduler spinlocks. In addition, real-time systems | |
1980 | using RCU priority boosting need this restriction to remain in effect | |
1981 | because deferred quiescent-state reporting would also defer deboosting, | |
1982 | which in turn would degrade real-time latencies. | |
1983 | ||
1984 | In theory, if a given RCU read-side critical section could be guaranteed | |
1985 | to be less than one second in duration, holding a scheduler spinlock | |
1986 | across that critical section's ``rcu_read_unlock()`` would require only | |
1987 | that preemption be disabled across the entire RCU read-side critical | |
1988 | section, not interrupts. Unfortunately, given the possibility of vCPU | |
1989 | preemption, long-running interrupts, and so on, it is not possible in | |
1990 | practice to guarantee that a given RCU read-side critical section will | |
1991 | complete in less than one second. Therefore, as noted above, if | |
1992 | scheduler spinlocks are held across a given call to | |
1993 | ``rcu_read_unlock()``, interrupts must be disabled across the entire RCU | |
1994 | read-side critical section. | |
1995 | ||
1996 | Tracing and RCU | |
1997 | ~~~~~~~~~~~~~~~ | |
1998 | ||
1999 | It is possible to use tracing on RCU code, but tracing itself uses RCU. | |
d7424e28 | 2000 | For this reason, ``rcu_dereference_raw_check()`` is provided for use |
ccc9971e MCC |
2001 | by tracing, which avoids the destructive recursion that could otherwise |
2002 | ensue. This API is also used by virtualization in some architectures, | |
2003 | where RCU readers execute in environments in which tracing cannot be | |
2004 | used. The tracing folks both located the requirement and provided the | |
2005 | needed fix, so this surprise requirement was relatively painless. | |
2006 | ||
2007 | Energy Efficiency | |
2008 | ~~~~~~~~~~~~~~~~~ | |
2009 | ||
2010 | Interrupting idle CPUs is considered socially unacceptable, especially | |
2011 | by people with battery-powered embedded systems. RCU therefore conserves | |
2012 | energy by detecting which CPUs are idle, including tracking CPUs that | |
2013 | have been interrupted from idle. This is a large part of the | |
2014 | energy-efficiency requirement, so I learned of this via an irate phone | |
2015 | call. | |
2016 | ||
2017 | Because RCU avoids interrupting idle CPUs, it is illegal to execute an | |
2018 | RCU read-side critical section on an idle CPU. (Kernels built with | |
2019 | ``CONFIG_PROVE_RCU=y`` will splat if you try it.) The ``RCU_NONIDLE()`` | |
2020 | macro and ``_rcuidle`` event tracing is provided to work around this | |
2021 | restriction. In addition, ``rcu_is_watching()`` may be used to test | |
2022 | whether or not it is currently legal to run RCU read-side critical | |
2023 | sections on this CPU. I learned of the need for diagnostics on the one | |
2024 | hand and ``RCU_NONIDLE()`` on the other while inspecting idle-loop code. | |
2025 | Steven Rostedt supplied ``_rcuidle`` event tracing, which is used quite | |
2026 | heavily in the idle loop. However, there are some restrictions on the | |
2027 | code placed within ``RCU_NONIDLE()``: | |
2028 | ||
2029 | #. Blocking is prohibited. In practice, this is not a serious | |
2030 | restriction given that idle tasks are prohibited from blocking to | |
2031 | begin with. | |
2032 | #. Although nesting ``RCU_NONIDLE()`` is permitted, they cannot nest | |
2033 | indefinitely deeply. However, given that they can be nested on the | |
2034 | order of a million deep, even on 32-bit systems, this should not be a | |
2035 | serious restriction. This nesting limit would probably be reached | |
2036 | long after the compiler OOMed or the stack overflowed. | |
2037 | #. Any code path that enters ``RCU_NONIDLE()`` must sequence out of that | |
2038 | same ``RCU_NONIDLE()``. For example, the following is grossly | |
2039 | illegal: | |
2040 | ||
2041 | :: | |
2042 | ||
2043 | 1 RCU_NONIDLE({ | |
2044 | 2 do_something(); | |
2045 | 3 goto bad_idea; /* BUG!!! */ | |
2046 | 4 do_something_else();}); | |
2047 | 5 bad_idea: | |
2048 | ||
2049 | ||
2050 | It is just as illegal to transfer control into the middle of | |
2051 | ``RCU_NONIDLE()``'s argument. Yes, in theory, you could transfer in | |
2052 | as long as you also transferred out, but in practice you could also | |
2053 | expect to get sharply worded review comments. | |
2054 | ||
2055 | It is similarly socially unacceptable to interrupt an ``nohz_full`` CPU | |
2056 | running in userspace. RCU must therefore track ``nohz_full`` userspace | |
2057 | execution. RCU must therefore be able to sample state at two points in | |
2058 | time, and be able to determine whether or not some other CPU spent any | |
2059 | time idle and/or executing in userspace. | |
2060 | ||
2061 | These energy-efficiency requirements have proven quite difficult to | |
2062 | understand and to meet, for example, there have been more than five | |
2063 | clean-sheet rewrites of RCU's energy-efficiency code, the last of which | |
2064 | was finally able to demonstrate `real energy savings running on real | |
2065 | hardware | |
2066 | [PDF] <http://www.rdrop.com/users/paulmck/realtime/paper/AMPenergy.2013.04.19a.pdf>`__. | |
2067 | As noted earlier, I learned of many of these requirements via angry | |
2068 | phone calls: Flaming me on the Linux-kernel mailing list was apparently | |
2069 | not sufficient to fully vent their ire at RCU's energy-efficiency bugs! | |
2070 | ||
2071 | Scheduling-Clock Interrupts and RCU | |
2072 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
2073 | ||
2074 | The kernel transitions between in-kernel non-idle execution, userspace | |
2075 | execution, and the idle loop. Depending on kernel configuration, RCU | |
2076 | handles these states differently: | |
2077 | ||
2078 | +-----------------+------------------+------------------+-----------------+ | |
2079 | | ``HZ`` Kconfig | In-Kernel | Usermode | Idle | | |
2080 | +=================+==================+==================+=================+ | |
2081 | | ``HZ_PERIODIC`` | Can rely on | Can rely on | Can rely on | | |
2082 | | | scheduling-clock | scheduling-clock | RCU's | | |
2083 | | | interrupt. | interrupt and | dyntick-idle | | |
2084 | | | | its detection | detection. | | |
2085 | | | | of interrupt | | | |
2086 | | | | from usermode. | | | |
2087 | +-----------------+------------------+------------------+-----------------+ | |
2088 | | ``NO_HZ_IDLE`` | Can rely on | Can rely on | Can rely on | | |
2089 | | | scheduling-clock | scheduling-clock | RCU's | | |
2090 | | | interrupt. | interrupt and | dyntick-idle | | |
2091 | | | | its detection | detection. | | |
2092 | | | | of interrupt | | | |
2093 | | | | from usermode. | | | |
2094 | +-----------------+------------------+------------------+-----------------+ | |
2095 | | ``NO_HZ_FULL`` | Can only | Can rely on | Can rely on | | |
2096 | | | sometimes rely | RCU's | RCU's | | |
2097 | | | on | dyntick-idle | dyntick-idle | | |
2098 | | | scheduling-clock | detection. | detection. | | |
2099 | | | interrupt. In | | | | |
2100 | | | other cases, it | | | | |
2101 | | | is necessary to | | | | |
2102 | | | bound kernel | | | | |
2103 | | | execution times | | | | |
2104 | | | and/or use | | | | |
2105 | | | IPIs. | | | | |
2106 | +-----------------+------------------+------------------+-----------------+ | |
2107 | ||
2108 | +-----------------------------------------------------------------------+ | |
2109 | | **Quick Quiz**: | | |
2110 | +-----------------------------------------------------------------------+ | |
2111 | | Why can't ``NO_HZ_FULL`` in-kernel execution rely on the | | |
2112 | | scheduling-clock interrupt, just like ``HZ_PERIODIC`` and | | |
2113 | | ``NO_HZ_IDLE`` do? | | |
2114 | +-----------------------------------------------------------------------+ | |
2115 | | **Answer**: | | |
2116 | +-----------------------------------------------------------------------+ | |
2117 | | Because, as a performance optimization, ``NO_HZ_FULL`` does not | | |
2118 | | necessarily re-enable the scheduling-clock interrupt on entry to each | | |
2119 | | and every system call. | | |
2120 | +-----------------------------------------------------------------------+ | |
2121 | ||
2122 | However, RCU must be reliably informed as to whether any given CPU is | |
2123 | currently in the idle loop, and, for ``NO_HZ_FULL``, also whether that | |
2124 | CPU is executing in usermode, as discussed | |
2125 | `earlier <#Energy%20Efficiency>`__. It also requires that the | |
2126 | scheduling-clock interrupt be enabled when RCU needs it to be: | |
2127 | ||
2128 | #. If a CPU is either idle or executing in usermode, and RCU believes it | |
2129 | is non-idle, the scheduling-clock tick had better be running. | |
2130 | Otherwise, you will get RCU CPU stall warnings. Or at best, very long | |
2131 | (11-second) grace periods, with a pointless IPI waking the CPU from | |
2132 | time to time. | |
2133 | #. If a CPU is in a portion of the kernel that executes RCU read-side | |
2134 | critical sections, and RCU believes this CPU to be idle, you will get | |
2135 | random memory corruption. **DON'T DO THIS!!!** | |
2136 | This is one reason to test with lockdep, which will complain about | |
2137 | this sort of thing. | |
2138 | #. If a CPU is in a portion of the kernel that is absolutely positively | |
2139 | no-joking guaranteed to never execute any RCU read-side critical | |
2140 | sections, and RCU believes this CPU to to be idle, no problem. This | |
2141 | sort of thing is used by some architectures for light-weight | |
2142 | exception handlers, which can then avoid the overhead of | |
2143 | ``rcu_irq_enter()`` and ``rcu_irq_exit()`` at exception entry and | |
2144 | exit, respectively. Some go further and avoid the entireties of | |
2145 | ``irq_enter()`` and ``irq_exit()``. | |
2146 | Just make very sure you are running some of your tests with | |
2147 | ``CONFIG_PROVE_RCU=y``, just in case one of your code paths was in | |
2148 | fact joking about not doing RCU read-side critical sections. | |
2149 | #. If a CPU is executing in the kernel with the scheduling-clock | |
2150 | interrupt disabled and RCU believes this CPU to be non-idle, and if | |
2151 | the CPU goes idle (from an RCU perspective) every few jiffies, no | |
2152 | problem. It is usually OK for there to be the occasional gap between | |
2153 | idle periods of up to a second or so. | |
2154 | If the gap grows too long, you get RCU CPU stall warnings. | |
2155 | #. If a CPU is either idle or executing in usermode, and RCU believes it | |
2156 | to be idle, of course no problem. | |
2157 | #. If a CPU is executing in the kernel, the kernel code path is passing | |
2158 | through quiescent states at a reasonable frequency (preferably about | |
2159 | once per few jiffies, but the occasional excursion to a second or so | |
2160 | is usually OK) and the scheduling-clock interrupt is enabled, of | |
2161 | course no problem. | |
2162 | If the gap between a successive pair of quiescent states grows too | |
2163 | long, you get RCU CPU stall warnings. | |
2164 | ||
2165 | +-----------------------------------------------------------------------+ | |
2166 | | **Quick Quiz**: | | |
2167 | +-----------------------------------------------------------------------+ | |
2168 | | But what if my driver has a hardware interrupt handler that can run | | |
2169 | | for many seconds? I cannot invoke ``schedule()`` from an hardware | | |
2170 | | interrupt handler, after all! | | |
2171 | +-----------------------------------------------------------------------+ | |
2172 | | **Answer**: | | |
2173 | +-----------------------------------------------------------------------+ | |
2174 | | One approach is to do ``rcu_irq_exit();rcu_irq_enter();`` every so | | |
2175 | | often. But given that long-running interrupt handlers can cause other | | |
2176 | | problems, not least for response time, shouldn't you work to keep | | |
2177 | | your interrupt handler's runtime within reasonable bounds? | | |
2178 | +-----------------------------------------------------------------------+ | |
2179 | ||
2180 | But as long as RCU is properly informed of kernel state transitions | |
2181 | between in-kernel execution, usermode execution, and idle, and as long | |
2182 | as the scheduling-clock interrupt is enabled when RCU needs it to be, | |
2183 | you can rest assured that the bugs you encounter will be in some other | |
2184 | part of RCU or some other part of the kernel! | |
2185 | ||
2186 | Memory Efficiency | |
2187 | ~~~~~~~~~~~~~~~~~ | |
2188 | ||
2189 | Although small-memory non-realtime systems can simply use Tiny RCU, code | |
2190 | size is only one aspect of memory efficiency. Another aspect is the size | |
2191 | of the ``rcu_head`` structure used by ``call_rcu()`` and | |
2192 | ``kfree_rcu()``. Although this structure contains nothing more than a | |
2193 | pair of pointers, it does appear in many RCU-protected data structures, | |
2194 | including some that are size critical. The ``page`` structure is a case | |
2195 | in point, as evidenced by the many occurrences of the ``union`` keyword | |
2196 | within that structure. | |
2197 | ||
2198 | This need for memory efficiency is one reason that RCU uses hand-crafted | |
2199 | singly linked lists to track the ``rcu_head`` structures that are | |
2200 | waiting for a grace period to elapse. It is also the reason why | |
2201 | ``rcu_head`` structures do not contain debug information, such as fields | |
2202 | tracking the file and line of the ``call_rcu()`` or ``kfree_rcu()`` that | |
2203 | posted them. Although this information might appear in debug-only kernel | |
2204 | builds at some point, in the meantime, the ``->func`` field will often | |
2205 | provide the needed debug information. | |
2206 | ||
2207 | However, in some cases, the need for memory efficiency leads to even | |
2208 | more extreme measures. Returning to the ``page`` structure, the | |
2209 | ``rcu_head`` field shares storage with a great many other structures | |
2210 | that are used at various points in the corresponding page's lifetime. In | |
2211 | order to correctly resolve certain `race | |
2212 | conditions <https://lkml.kernel.org/g/1439976106-137226-1-git-send-email-kirill.shutemov@linux.intel.com>`__, | |
2213 | the Linux kernel's memory-management subsystem needs a particular bit to | |
2214 | remain zero during all phases of grace-period processing, and that bit | |
2215 | happens to map to the bottom bit of the ``rcu_head`` structure's | |
2216 | ``->next`` field. RCU makes this guarantee as long as ``call_rcu()`` is | |
2217 | used to post the callback, as opposed to ``kfree_rcu()`` or some future | |
2218 | “lazy” variant of ``call_rcu()`` that might one day be created for | |
2219 | energy-efficiency purposes. | |
2220 | ||
2221 | That said, there are limits. RCU requires that the ``rcu_head`` | |
2222 | structure be aligned to a two-byte boundary, and passing a misaligned | |
2223 | ``rcu_head`` structure to one of the ``call_rcu()`` family of functions | |
2224 | will result in a splat. It is therefore necessary to exercise caution | |
2225 | when packing structures containing fields of type ``rcu_head``. Why not | |
2226 | a four-byte or even eight-byte alignment requirement? Because the m68k | |
2227 | architecture provides only two-byte alignment, and thus acts as | |
2228 | alignment's least common denominator. | |
2229 | ||
2230 | The reason for reserving the bottom bit of pointers to ``rcu_head`` | |
2231 | structures is to leave the door open to “lazy” callbacks whose | |
2232 | invocations can safely be deferred. Deferring invocation could | |
2233 | potentially have energy-efficiency benefits, but only if the rate of | |
2234 | non-lazy callbacks decreases significantly for some important workload. | |
2235 | In the meantime, reserving the bottom bit keeps this option open in case | |
2236 | it one day becomes useful. | |
2237 | ||
2238 | Performance, Scalability, Response Time, and Reliability | |
2239 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
2240 | ||
2241 | Expanding on the `earlier | |
2242 | discussion <#Performance%20and%20Scalability>`__, RCU is used heavily by | |
2243 | hot code paths in performance-critical portions of the Linux kernel's | |
2244 | networking, security, virtualization, and scheduling code paths. RCU | |
2245 | must therefore use efficient implementations, especially in its | |
2246 | read-side primitives. To that end, it would be good if preemptible RCU's | |
2247 | implementation of ``rcu_read_lock()`` could be inlined, however, doing | |
2248 | this requires resolving ``#include`` issues with the ``task_struct`` | |
2249 | structure. | |
2250 | ||
2251 | The Linux kernel supports hardware configurations with up to 4096 CPUs, | |
2252 | which means that RCU must be extremely scalable. Algorithms that involve | |
2253 | frequent acquisitions of global locks or frequent atomic operations on | |
2254 | global variables simply cannot be tolerated within the RCU | |
2255 | implementation. RCU therefore makes heavy use of a combining tree based | |
2256 | on the ``rcu_node`` structure. RCU is required to tolerate all CPUs | |
2257 | continuously invoking any combination of RCU's runtime primitives with | |
2258 | minimal per-operation overhead. In fact, in many cases, increasing load | |
2259 | must *decrease* the per-operation overhead, witness the batching | |
2260 | optimizations for ``synchronize_rcu()``, ``call_rcu()``, | |
2261 | ``synchronize_rcu_expedited()``, and ``rcu_barrier()``. As a general | |
2262 | rule, RCU must cheerfully accept whatever the rest of the Linux kernel | |
2263 | decides to throw at it. | |
2264 | ||
2265 | The Linux kernel is used for real-time workloads, especially in | |
2266 | conjunction with the `-rt | |
2267 | patchset <https://rt.wiki.kernel.org/index.php/Main_Page>`__. The | |
2268 | real-time-latency response requirements are such that the traditional | |
2269 | approach of disabling preemption across RCU read-side critical sections | |
2270 | is inappropriate. Kernels built with ``CONFIG_PREEMPT=y`` therefore use | |
2271 | an RCU implementation that allows RCU read-side critical sections to be | |
2272 | preempted. This requirement made its presence known after users made it | |
2273 | clear that an earlier `real-time | |
2274 | patch <https://lwn.net/Articles/107930/>`__ did not meet their needs, in | |
2275 | conjunction with some `RCU | |
2276 | issues <https://lkml.kernel.org/g/20050318002026.GA2693@us.ibm.com>`__ | |
2277 | encountered by a very early version of the -rt patchset. | |
2278 | ||
2279 | In addition, RCU must make do with a sub-100-microsecond real-time | |
2280 | latency budget. In fact, on smaller systems with the -rt patchset, the | |
2281 | Linux kernel provides sub-20-microsecond real-time latencies for the | |
2282 | whole kernel, including RCU. RCU's scalability and latency must | |
2283 | therefore be sufficient for these sorts of configurations. To my | |
2284 | surprise, the sub-100-microsecond real-time latency budget `applies to | |
2285 | even the largest systems | |
2286 | [PDF] <http://www.rdrop.com/users/paulmck/realtime/paper/bigrt.2013.01.31a.LCA.pdf>`__, | |
2287 | up to and including systems with 4096 CPUs. This real-time requirement | |
2288 | motivated the grace-period kthread, which also simplified handling of a | |
2289 | number of race conditions. | |
2290 | ||
2291 | RCU must avoid degrading real-time response for CPU-bound threads, | |
2292 | whether executing in usermode (which is one use case for | |
2293 | ``CONFIG_NO_HZ_FULL=y``) or in the kernel. That said, CPU-bound loops in | |
2294 | the kernel must execute ``cond_resched()`` at least once per few tens of | |
2295 | milliseconds in order to avoid receiving an IPI from RCU. | |
2296 | ||
2297 | Finally, RCU's status as a synchronization primitive means that any RCU | |
2298 | failure can result in arbitrary memory corruption that can be extremely | |
2299 | difficult to debug. This means that RCU must be extremely reliable, | |
2300 | which in practice also means that RCU must have an aggressive | |
2301 | stress-test suite. This stress-test suite is called ``rcutorture``. | |
2302 | ||
2303 | Although the need for ``rcutorture`` was no surprise, the current | |
2304 | immense popularity of the Linux kernel is posing interesting—and perhaps | |
2305 | unprecedented—validation challenges. To see this, keep in mind that | |
2306 | there are well over one billion instances of the Linux kernel running | |
2307 | today, given Android smartphones, Linux-powered televisions, and | |
2308 | servers. This number can be expected to increase sharply with the advent | |
2309 | of the celebrated Internet of Things. | |
2310 | ||
2311 | Suppose that RCU contains a race condition that manifests on average | |
2312 | once per million years of runtime. This bug will be occurring about | |
2313 | three times per *day* across the installed base. RCU could simply hide | |
2314 | behind hardware error rates, given that no one should really expect | |
2315 | their smartphone to last for a million years. However, anyone taking too | |
2316 | much comfort from this thought should consider the fact that in most | |
2317 | jurisdictions, a successful multi-year test of a given mechanism, which | |
2318 | might include a Linux kernel, suffices for a number of types of | |
2319 | safety-critical certifications. In fact, rumor has it that the Linux | |
2320 | kernel is already being used in production for safety-critical | |
2321 | applications. I don't know about you, but I would feel quite bad if a | |
2322 | bug in RCU killed someone. Which might explain my recent focus on | |
2323 | validation and verification. | |
2324 | ||
2325 | Other RCU Flavors | |
2326 | ----------------- | |
2327 | ||
2328 | One of the more surprising things about RCU is that there are now no | |
2329 | fewer than five *flavors*, or API families. In addition, the primary | |
2330 | flavor that has been the sole focus up to this point has two different | |
2331 | implementations, non-preemptible and preemptible. The other four flavors | |
2332 | are listed below, with requirements for each described in a separate | |
2333 | section. | |
2334 | ||
07335c16 JFG |
2335 | #. `Bottom-Half Flavor (Historical)`_ |
2336 | #. `Sched Flavor (Historical)`_ | |
2337 | #. `Sleepable RCU`_ | |
2338 | #. `Tasks RCU`_ | |
ccc9971e MCC |
2339 | |
2340 | Bottom-Half Flavor (Historical) | |
2341 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
2342 | ||
2343 | The RCU-bh flavor of RCU has since been expressed in terms of the other | |
2344 | RCU flavors as part of a consolidation of the three flavors into a | |
2345 | single flavor. The read-side API remains, and continues to disable | |
2346 | softirq and to be accounted for by lockdep. Much of the material in this | |
2347 | section is therefore strictly historical in nature. | |
2348 | ||
2349 | The softirq-disable (AKA “bottom-half”, hence the “_bh” abbreviations) | |
2350 | flavor of RCU, or *RCU-bh*, was developed by Dipankar Sarma to provide a | |
2351 | flavor of RCU that could withstand the network-based denial-of-service | |
2352 | attacks researched by Robert Olsson. These attacks placed so much | |
2353 | networking load on the system that some of the CPUs never exited softirq | |
2354 | execution, which in turn prevented those CPUs from ever executing a | |
2355 | context switch, which, in the RCU implementation of that time, prevented | |
2356 | grace periods from ever ending. The result was an out-of-memory | |
2357 | condition and a system hang. | |
2358 | ||
2359 | The solution was the creation of RCU-bh, which does | |
2360 | ``local_bh_disable()`` across its read-side critical sections, and which | |
2361 | uses the transition from one type of softirq processing to another as a | |
2362 | quiescent state in addition to context switch, idle, user mode, and | |
2363 | offline. This means that RCU-bh grace periods can complete even when | |
2364 | some of the CPUs execute in softirq indefinitely, thus allowing | |
2365 | algorithms based on RCU-bh to withstand network-based denial-of-service | |
2366 | attacks. | |
2367 | ||
2368 | Because ``rcu_read_lock_bh()`` and ``rcu_read_unlock_bh()`` disable and | |
2369 | re-enable softirq handlers, any attempt to start a softirq handlers | |
2370 | during the RCU-bh read-side critical section will be deferred. In this | |
2371 | case, ``rcu_read_unlock_bh()`` will invoke softirq processing, which can | |
2372 | take considerable time. One can of course argue that this softirq | |
2373 | overhead should be associated with the code following the RCU-bh | |
2374 | read-side critical section rather than ``rcu_read_unlock_bh()``, but the | |
2375 | fact is that most profiling tools cannot be expected to make this sort | |
2376 | of fine distinction. For example, suppose that a three-millisecond-long | |
2377 | RCU-bh read-side critical section executes during a time of heavy | |
2378 | networking load. There will very likely be an attempt to invoke at least | |
2379 | one softirq handler during that three milliseconds, but any such | |
2380 | invocation will be delayed until the time of the | |
2381 | ``rcu_read_unlock_bh()``. This can of course make it appear at first | |
2382 | glance as if ``rcu_read_unlock_bh()`` was executing very slowly. | |
2383 | ||
2384 | The `RCU-bh | |
2385 | API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__ | |
2386 | includes ``rcu_read_lock_bh()``, ``rcu_read_unlock_bh()``, | |
2387 | ``rcu_dereference_bh()``, ``rcu_dereference_bh_check()``, | |
2388 | ``synchronize_rcu_bh()``, ``synchronize_rcu_bh_expedited()``, | |
2389 | ``call_rcu_bh()``, ``rcu_barrier_bh()``, and | |
2390 | ``rcu_read_lock_bh_held()``. However, the update-side APIs are now | |
2391 | simple wrappers for other RCU flavors, namely RCU-sched in | |
2392 | CONFIG_PREEMPT=n kernels and RCU-preempt otherwise. | |
2393 | ||
2394 | Sched Flavor (Historical) | |
2395 | ~~~~~~~~~~~~~~~~~~~~~~~~~ | |
2396 | ||
2397 | The RCU-sched flavor of RCU has since been expressed in terms of the | |
2398 | other RCU flavors as part of a consolidation of the three flavors into a | |
2399 | single flavor. The read-side API remains, and continues to disable | |
2400 | preemption and to be accounted for by lockdep. Much of the material in | |
2401 | this section is therefore strictly historical in nature. | |
2402 | ||
2403 | Before preemptible RCU, waiting for an RCU grace period had the side | |
2404 | effect of also waiting for all pre-existing interrupt and NMI handlers. | |
2405 | However, there are legitimate preemptible-RCU implementations that do | |
2406 | not have this property, given that any point in the code outside of an | |
2407 | RCU read-side critical section can be a quiescent state. Therefore, | |
2408 | *RCU-sched* was created, which follows “classic” RCU in that an | |
2409 | RCU-sched grace period waits for for pre-existing interrupt and NMI | |
2410 | handlers. In kernels built with ``CONFIG_PREEMPT=n``, the RCU and | |
2411 | RCU-sched APIs have identical implementations, while kernels built with | |
2412 | ``CONFIG_PREEMPT=y`` provide a separate implementation for each. | |
2413 | ||
2414 | Note well that in ``CONFIG_PREEMPT=y`` kernels, | |
2415 | ``rcu_read_lock_sched()`` and ``rcu_read_unlock_sched()`` disable and | |
2416 | re-enable preemption, respectively. This means that if there was a | |
2417 | preemption attempt during the RCU-sched read-side critical section, | |
2418 | ``rcu_read_unlock_sched()`` will enter the scheduler, with all the | |
2419 | latency and overhead entailed. Just as with ``rcu_read_unlock_bh()``, | |
2420 | this can make it look as if ``rcu_read_unlock_sched()`` was executing | |
2421 | very slowly. However, the highest-priority task won't be preempted, so | |
2422 | that task will enjoy low-overhead ``rcu_read_unlock_sched()`` | |
2423 | invocations. | |
2424 | ||
2425 | The `RCU-sched | |
2426 | API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__ | |
2427 | includes ``rcu_read_lock_sched()``, ``rcu_read_unlock_sched()``, | |
2428 | ``rcu_read_lock_sched_notrace()``, ``rcu_read_unlock_sched_notrace()``, | |
2429 | ``rcu_dereference_sched()``, ``rcu_dereference_sched_check()``, | |
2430 | ``synchronize_sched()``, ``synchronize_rcu_sched_expedited()``, | |
2431 | ``call_rcu_sched()``, ``rcu_barrier_sched()``, and | |
2432 | ``rcu_read_lock_sched_held()``. However, anything that disables | |
2433 | preemption also marks an RCU-sched read-side critical section, including | |
2434 | ``preempt_disable()`` and ``preempt_enable()``, ``local_irq_save()`` and | |
2435 | ``local_irq_restore()``, and so on. | |
2436 | ||
2437 | Sleepable RCU | |
2438 | ~~~~~~~~~~~~~ | |
2439 | ||
2440 | For well over a decade, someone saying “I need to block within an RCU | |
2441 | read-side critical section” was a reliable indication that this someone | |
2442 | did not understand RCU. After all, if you are always blocking in an RCU | |
2443 | read-side critical section, you can probably afford to use a | |
2444 | higher-overhead synchronization mechanism. However, that changed with | |
2445 | the advent of the Linux kernel's notifiers, whose RCU read-side critical | |
2446 | sections almost never sleep, but sometimes need to. This resulted in the | |
2447 | introduction of `sleepable RCU <https://lwn.net/Articles/202847/>`__, or | |
2448 | *SRCU*. | |
2449 | ||
2450 | SRCU allows different domains to be defined, with each such domain | |
2451 | defined by an instance of an ``srcu_struct`` structure. A pointer to | |
2452 | this structure must be passed in to each SRCU function, for example, | |
2453 | ``synchronize_srcu(&ss)``, where ``ss`` is the ``srcu_struct`` | |
2454 | structure. The key benefit of these domains is that a slow SRCU reader | |
2455 | in one domain does not delay an SRCU grace period in some other domain. | |
2456 | That said, one consequence of these domains is that read-side code must | |
2457 | pass a “cookie” from ``srcu_read_lock()`` to ``srcu_read_unlock()``, for | |
2458 | example, as follows: | |
2459 | ||
2460 | :: | |
2461 | ||
2462 | 1 int idx; | |
2463 | 2 | |
2464 | 3 idx = srcu_read_lock(&ss); | |
2465 | 4 do_something(); | |
2466 | 5 srcu_read_unlock(&ss, idx); | |
2467 | ||
2468 | As noted above, it is legal to block within SRCU read-side critical | |
2469 | sections, however, with great power comes great responsibility. If you | |
2470 | block forever in one of a given domain's SRCU read-side critical | |
2471 | sections, then that domain's grace periods will also be blocked forever. | |
2472 | Of course, one good way to block forever is to deadlock, which can | |
2473 | happen if any operation in a given domain's SRCU read-side critical | |
2474 | section can wait, either directly or indirectly, for that domain's grace | |
2475 | period to elapse. For example, this results in a self-deadlock: | |
2476 | ||
2477 | :: | |
2478 | ||
2479 | 1 int idx; | |
2480 | 2 | |
2481 | 3 idx = srcu_read_lock(&ss); | |
2482 | 4 do_something(); | |
2483 | 5 synchronize_srcu(&ss); | |
2484 | 6 srcu_read_unlock(&ss, idx); | |
2485 | ||
2486 | However, if line 5 acquired a mutex that was held across a | |
2487 | ``synchronize_srcu()`` for domain ``ss``, deadlock would still be | |
2488 | possible. Furthermore, if line 5 acquired a mutex that was held across a | |
2489 | ``synchronize_srcu()`` for some other domain ``ss1``, and if an | |
2490 | ``ss1``-domain SRCU read-side critical section acquired another mutex | |
2491 | that was held across as ``ss``-domain ``synchronize_srcu()``, deadlock | |
2492 | would again be possible. Such a deadlock cycle could extend across an | |
2493 | arbitrarily large number of different SRCU domains. Again, with great | |
2494 | power comes great responsibility. | |
2495 | ||
2496 | Unlike the other RCU flavors, SRCU read-side critical sections can run | |
2497 | on idle and even offline CPUs. This ability requires that | |
2498 | ``srcu_read_lock()`` and ``srcu_read_unlock()`` contain memory barriers, | |
2499 | which means that SRCU readers will run a bit slower than would RCU | |
2500 | readers. It also motivates the ``smp_mb__after_srcu_read_unlock()`` API, | |
2501 | which, in combination with ``srcu_read_unlock()``, guarantees a full | |
2502 | memory barrier. | |
2503 | ||
2504 | Also unlike other RCU flavors, ``synchronize_srcu()`` may **not** be | |
2505 | invoked from CPU-hotplug notifiers, due to the fact that SRCU grace | |
2506 | periods make use of timers and the possibility of timers being | |
2507 | temporarily “stranded” on the outgoing CPU. This stranding of timers | |
2508 | means that timers posted to the outgoing CPU will not fire until late in | |
2509 | the CPU-hotplug process. The problem is that if a notifier is waiting on | |
2510 | an SRCU grace period, that grace period is waiting on a timer, and that | |
2511 | timer is stranded on the outgoing CPU, then the notifier will never be | |
2512 | awakened, in other words, deadlock has occurred. This same situation of | |
2513 | course also prohibits ``srcu_barrier()`` from being invoked from | |
2514 | CPU-hotplug notifiers. | |
2515 | ||
2516 | SRCU also differs from other RCU flavors in that SRCU's expedited and | |
2517 | non-expedited grace periods are implemented by the same mechanism. This | |
2518 | means that in the current SRCU implementation, expediting a future grace | |
2519 | period has the side effect of expediting all prior grace periods that | |
2520 | have not yet completed. (But please note that this is a property of the | |
2521 | current implementation, not necessarily of future implementations.) In | |
2522 | addition, if SRCU has been idle for longer than the interval specified | |
2523 | by the ``srcutree.exp_holdoff`` kernel boot parameter (25 microseconds | |
2524 | by default), and if a ``synchronize_srcu()`` invocation ends this idle | |
2525 | period, that invocation will be automatically expedited. | |
2526 | ||
2527 | As of v4.12, SRCU's callbacks are maintained per-CPU, eliminating a | |
2528 | locking bottleneck present in prior kernel versions. Although this will | |
2529 | allow users to put much heavier stress on ``call_srcu()``, it is | |
2530 | important to note that SRCU does not yet take any special steps to deal | |
2531 | with callback flooding. So if you are posting (say) 10,000 SRCU | |
2532 | callbacks per second per CPU, you are probably totally OK, but if you | |
2533 | intend to post (say) 1,000,000 SRCU callbacks per second per CPU, please | |
2534 | run some tests first. SRCU just might need a few adjustment to deal with | |
2535 | that sort of load. Of course, your mileage may vary based on the speed | |
2536 | of your CPUs and the size of your memory. | |
2537 | ||
2538 | The `SRCU | |
2539 | API <https://lwn.net/Articles/609973/#RCU%20Per-Flavor%20API%20Table>`__ | |
2540 | includes ``srcu_read_lock()``, ``srcu_read_unlock()``, | |
2541 | ``srcu_dereference()``, ``srcu_dereference_check()``, | |
2542 | ``synchronize_srcu()``, ``synchronize_srcu_expedited()``, | |
2543 | ``call_srcu()``, ``srcu_barrier()``, and ``srcu_read_lock_held()``. It | |
2544 | also includes ``DEFINE_SRCU()``, ``DEFINE_STATIC_SRCU()``, and | |
2545 | ``init_srcu_struct()`` APIs for defining and initializing | |
2546 | ``srcu_struct`` structures. | |
2547 | ||
2548 | Tasks RCU | |
2549 | ~~~~~~~~~ | |
2550 | ||
2551 | Some forms of tracing use “trampolines” to handle the binary rewriting | |
2552 | required to install different types of probes. It would be good to be | |
2553 | able to free old trampolines, which sounds like a job for some form of | |
2554 | RCU. However, because it is necessary to be able to install a trace | |
2555 | anywhere in the code, it is not possible to use read-side markers such | |
2556 | as ``rcu_read_lock()`` and ``rcu_read_unlock()``. In addition, it does | |
2557 | not work to have these markers in the trampoline itself, because there | |
2558 | would need to be instructions following ``rcu_read_unlock()``. Although | |
2559 | ``synchronize_rcu()`` would guarantee that execution reached the | |
2560 | ``rcu_read_unlock()``, it would not be able to guarantee that execution | |
2561 | had completely left the trampoline. | |
2562 | ||
2563 | The solution, in the form of `Tasks | |
2564 | RCU <https://lwn.net/Articles/607117/>`__, is to have implicit read-side | |
2565 | critical sections that are delimited by voluntary context switches, that | |
2566 | is, calls to ``schedule()``, ``cond_resched()``, and | |
2567 | ``synchronize_rcu_tasks()``. In addition, transitions to and from | |
2568 | userspace execution also delimit tasks-RCU read-side critical sections. | |
2569 | ||
2570 | The tasks-RCU API is quite compact, consisting only of | |
2571 | ``call_rcu_tasks()``, ``synchronize_rcu_tasks()``, and | |
2572 | ``rcu_barrier_tasks()``. In ``CONFIG_PREEMPT=n`` kernels, trampolines | |
2573 | cannot be preempted, so these APIs map to ``call_rcu()``, | |
2574 | ``synchronize_rcu()``, and ``rcu_barrier()``, respectively. In | |
2575 | ``CONFIG_PREEMPT=y`` kernels, trampolines can be preempted, and these | |
2576 | three APIs are therefore implemented by separate functions that check | |
2577 | for voluntary context switches. | |
2578 | ||
2579 | Possible Future Changes | |
2580 | ----------------------- | |
2581 | ||
2582 | One of the tricks that RCU uses to attain update-side scalability is to | |
2583 | increase grace-period latency with increasing numbers of CPUs. If this | |
2584 | becomes a serious problem, it will be necessary to rework the | |
2585 | grace-period state machine so as to avoid the need for the additional | |
2586 | latency. | |
2587 | ||
2588 | RCU disables CPU hotplug in a few places, perhaps most notably in the | |
2589 | ``rcu_barrier()`` operations. If there is a strong reason to use | |
2590 | ``rcu_barrier()`` in CPU-hotplug notifiers, it will be necessary to | |
2591 | avoid disabling CPU hotplug. This would introduce some complexity, so | |
2592 | there had better be a *very* good reason. | |
2593 | ||
2594 | The tradeoff between grace-period latency on the one hand and | |
2595 | interruptions of other CPUs on the other hand may need to be | |
2596 | re-examined. The desire is of course for zero grace-period latency as | |
2597 | well as zero interprocessor interrupts undertaken during an expedited | |
2598 | grace period operation. While this ideal is unlikely to be achievable, | |
2599 | it is quite possible that further improvements can be made. | |
2600 | ||
2601 | The multiprocessor implementations of RCU use a combining tree that | |
2602 | groups CPUs so as to reduce lock contention and increase cache locality. | |
2603 | However, this combining tree does not spread its memory across NUMA | |
2604 | nodes nor does it align the CPU groups with hardware features such as | |
2605 | sockets or cores. Such spreading and alignment is currently believed to | |
2606 | be unnecessary because the hotpath read-side primitives do not access | |
2607 | the combining tree, nor does ``call_rcu()`` in the common case. If you | |
2608 | believe that your architecture needs such spreading and alignment, then | |
2609 | your architecture should also benefit from the | |
2610 | ``rcutree.rcu_fanout_leaf`` boot parameter, which can be set to the | |
2611 | number of CPUs in a socket, NUMA node, or whatever. If the number of | |
2612 | CPUs is too large, use a fraction of the number of CPUs. If the number | |
2613 | of CPUs is a large prime number, well, that certainly is an | |
2614 | “interesting” architectural choice! More flexible arrangements might be | |
2615 | considered, but only if ``rcutree.rcu_fanout_leaf`` has proven | |
2616 | inadequate, and only if the inadequacy has been demonstrated by a | |
2617 | carefully run and realistic system-level workload. | |
2618 | ||
2619 | Please note that arrangements that require RCU to remap CPU numbers will | |
2620 | require extremely good demonstration of need and full exploration of | |
2621 | alternatives. | |
2622 | ||
2623 | RCU's various kthreads are reasonably recent additions. It is quite | |
2624 | likely that adjustments will be required to more gracefully handle | |
2625 | extreme loads. It might also be necessary to be able to relate CPU | |
2626 | utilization by RCU's kthreads and softirq handlers to the code that | |
2627 | instigated this CPU utilization. For example, RCU callback overhead | |
2628 | might be charged back to the originating ``call_rcu()`` instance, though | |
2629 | probably not in production kernels. | |
2630 | ||
2631 | Additional work may be required to provide reasonable forward-progress | |
2632 | guarantees under heavy load for grace periods and for callback | |
2633 | invocation. | |
2634 | ||
2635 | Summary | |
2636 | ------- | |
2637 | ||
2638 | This document has presented more than two decade's worth of RCU | |
2639 | requirements. Given that the requirements keep changing, this will not | |
2640 | be the last word on this subject, but at least it serves to get an | |
2641 | important subset of the requirements set forth. | |
2642 | ||
2643 | Acknowledgments | |
2644 | --------------- | |
2645 | ||
2646 | I am grateful to Steven Rostedt, Lai Jiangshan, Ingo Molnar, Oleg | |
2647 | Nesterov, Borislav Petkov, Peter Zijlstra, Boqun Feng, and Andy | |
2648 | Lutomirski for their help in rendering this article human readable, and | |
2649 | to Michelle Rankin for her support of this effort. Other contributions | |
2650 | are acknowledged in the Linux kernel's git archive. |