Commit | Line | Data |
---|---|---|
b00aedf9 AG |
1 | .. _rcu_dereference_doc: |
2 | ||
b4c5bf35 | 3 | PROPER CARE AND FEEDING OF RETURN VALUES FROM rcu_dereference() |
b00aedf9 | 4 | =============================================================== |
b4c5bf35 PM |
5 | |
6 | Most of the time, you can use values from rcu_dereference() or one of | |
7 | the similar primitives without worries. Dereferencing (prefix "*"), | |
8 | field selection ("->"), assignment ("="), address-of ("&"), addition and | |
9 | subtraction of constants, and casts all work quite naturally and safely. | |
10 | ||
11 | It is nevertheless possible to get into trouble with other operations. | |
12 | Follow these rules to keep your RCU code working properly: | |
13 | ||
b00aedf9 | 14 | - You must use one of the rcu_dereference() family of primitives |
b4c5bf35 PM |
15 | to load an RCU-protected pointer, otherwise CONFIG_PROVE_RCU |
16 | will complain. Worse yet, your code can see random memory-corruption | |
17 | bugs due to games that compilers and DEC Alpha can play. | |
18 | Without one of the rcu_dereference() primitives, compilers | |
19 | can reload the value, and won't your code have fun with two | |
20 | different values for a single pointer! Without rcu_dereference(), | |
21 | DEC Alpha can load a pointer, dereference that pointer, and | |
b33994ef PM |
22 | return data preceding initialization that preceded the store |
23 | of the pointer. (As noted later, in recent kernels READ_ONCE() | |
24 | also prevents DEC Alpha from playing these tricks.) | |
b4c5bf35 PM |
25 | |
26 | In addition, the volatile cast in rcu_dereference() prevents the | |
27 | compiler from deducing the resulting pointer value. Please see | |
28 | the section entitled "EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH" | |
29 | for an example where the compiler can in fact deduce the exact | |
30 | value of the pointer, and thus cause misordering. | |
31 | ||
86b5a738 PM |
32 | - In the special case where data is added but is never removed |
33 | while readers are accessing the structure, READ_ONCE() may be used | |
34 | instead of rcu_dereference(). In this case, use of READ_ONCE() | |
35 | takes on the role of the lockless_dereference() primitive that | |
36 | was removed in v4.15. | |
37 | ||
b33994ef | 38 | - You are only permitted to use rcu_dereference() on pointer values. |
8a597d63 PM |
39 | The compiler simply knows too much about integral values to |
40 | trust it to carry dependencies through integer operations. | |
41 | There are a very few exceptions, namely that you can temporarily | |
42 | cast the pointer to uintptr_t in order to: | |
43 | ||
b00aedf9 | 44 | - Set bits and clear bits down in the must-be-zero low-order |
8a597d63 PM |
45 | bits of that pointer. This clearly means that the pointer |
46 | must have alignment constraints, for example, this does | |
e3879ecd | 47 | *not* work in general for char* pointers. |
8a597d63 | 48 | |
b00aedf9 | 49 | - XOR bits to translate pointers, as is done in some |
8a597d63 PM |
50 | classic buddy-allocator algorithms. |
51 | ||
52 | It is important to cast the value back to pointer before | |
53 | doing much of anything else with it. | |
54 | ||
b00aedf9 | 55 | - Avoid cancellation when using the "+" and "-" infix arithmetic |
b4c5bf35 | 56 | operators. For example, for a given variable "x", avoid |
8a597d63 PM |
57 | "(x-(uintptr_t)x)" for char* pointers. The compiler is within its |
58 | rights to substitute zero for this sort of expression, so that | |
59 | subsequent accesses no longer depend on the rcu_dereference(), | |
60 | again possibly resulting in bugs due to misordering. | |
b4c5bf35 PM |
61 | |
62 | Of course, if "p" is a pointer from rcu_dereference(), and "a" | |
63 | and "b" are integers that happen to be equal, the expression | |
64 | "p+a-b" is safe because its value still necessarily depends on | |
65 | the rcu_dereference(), thus maintaining proper ordering. | |
66 | ||
b00aedf9 | 67 | - If you are using RCU to protect JITed functions, so that the |
b4c5bf35 PM |
68 | "()" function-invocation operator is applied to a value obtained |
69 | (directly or indirectly) from rcu_dereference(), you may need to | |
70 | interact directly with the hardware to flush instruction caches. | |
71 | This issue arises on some systems when a newly JITed function is | |
72 | using the same memory that was used by an earlier JITed function. | |
b4c5bf35 | 73 | |
b00aedf9 | 74 | - Do not use the results from relational operators ("==", "!=", |
b4c5bf35 | 75 | ">", ">=", "<", or "<=") when dereferencing. For example, |
b00aedf9 | 76 | the following (quite strange) code is buggy:: |
b4c5bf35 | 77 | |
cf9fbf80 PM |
78 | int *p; |
79 | int *q; | |
b4c5bf35 PM |
80 | |
81 | ... | |
82 | ||
cf9fbf80 PM |
83 | p = rcu_dereference(gp) |
84 | q = &global_q; | |
85 | q += p > &oom_p; | |
86 | r1 = *q; /* BUGGY!!! */ | |
b4c5bf35 PM |
87 | |
88 | As before, the reason this is buggy is that relational operators | |
89 | are often compiled using branches. And as before, although | |
90 | weak-memory machines such as ARM or PowerPC do order stores | |
91 | after such branches, but can speculate loads, which can again | |
92 | result in misordering bugs. | |
93 | ||
b00aedf9 | 94 | - Be very careful about comparing pointers obtained from |
b4c5bf35 PM |
95 | rcu_dereference() against non-NULL values. As Linus Torvalds |
96 | explained, if the two pointers are equal, the compiler could | |
97 | substitute the pointer you are comparing against for the pointer | |
b00aedf9 | 98 | obtained from rcu_dereference(). For example:: |
b4c5bf35 PM |
99 | |
100 | p = rcu_dereference(gp); | |
101 | if (p == &default_struct) | |
102 | do_default(p->a); | |
103 | ||
104 | Because the compiler now knows that the value of "p" is exactly | |
105 | the address of the variable "default_struct", it is free to | |
b00aedf9 | 106 | transform this code into the following:: |
b4c5bf35 PM |
107 | |
108 | p = rcu_dereference(gp); | |
109 | if (p == &default_struct) | |
110 | do_default(default_struct.a); | |
111 | ||
112 | On ARM and Power hardware, the load from "default_struct.a" | |
113 | can now be speculated, such that it might happen before the | |
114 | rcu_dereference(). This could result in bugs due to misordering. | |
115 | ||
116 | However, comparisons are OK in the following cases: | |
117 | ||
b00aedf9 | 118 | - The comparison was against the NULL pointer. If the |
b4c5bf35 PM |
119 | compiler knows that the pointer is NULL, you had better |
120 | not be dereferencing it anyway. If the comparison is | |
121 | non-equal, the compiler is none the wiser. Therefore, | |
122 | it is safe to compare pointers from rcu_dereference() | |
123 | against NULL pointers. | |
124 | ||
b00aedf9 | 125 | - The pointer is never dereferenced after being compared. |
b4c5bf35 PM |
126 | Since there are no subsequent dereferences, the compiler |
127 | cannot use anything it learned from the comparison | |
128 | to reorder the non-existent subsequent dereferences. | |
129 | This sort of comparison occurs frequently when scanning | |
130 | RCU-protected circular linked lists. | |
131 | ||
022d1b35 PM |
132 | Note that if the pointer comparison is done outside |
133 | of an RCU read-side critical section, and the pointer | |
134 | is never dereferenced, rcu_access_pointer() should be | |
135 | used in place of rcu_dereference(). In most cases, | |
136 | it is best to avoid accidental dereferences by testing | |
137 | the rcu_access_pointer() return value directly, without | |
138 | assigning it to a variable. | |
139 | ||
140 | Within an RCU read-side critical section, there is little | |
141 | reason to use rcu_access_pointer(). | |
93728af0 | 142 | |
b00aedf9 | 143 | - The comparison is against a pointer that references memory |
b4c5bf35 PM |
144 | that was initialized "a long time ago." The reason |
145 | this is safe is that even if misordering occurs, the | |
146 | misordering will not affect the accesses that follow | |
147 | the comparison. So exactly how long ago is "a long | |
148 | time ago"? Here are some possibilities: | |
149 | ||
b00aedf9 | 150 | - Compile time. |
b4c5bf35 | 151 | |
b00aedf9 | 152 | - Boot time. |
b4c5bf35 | 153 | |
b00aedf9 | 154 | - Module-init time for module code. |
b4c5bf35 | 155 | |
b00aedf9 | 156 | - Prior to kthread creation for kthread code. |
b4c5bf35 | 157 | |
b00aedf9 | 158 | - During some prior acquisition of the lock that |
b4c5bf35 PM |
159 | we now hold. |
160 | ||
b00aedf9 | 161 | - Before mod_timer() time for a timer handler. |
b4c5bf35 PM |
162 | |
163 | There are many other possibilities involving the Linux | |
164 | kernel's wide array of primitives that cause code to | |
165 | be invoked at a later time. | |
166 | ||
b00aedf9 | 167 | - The pointer being compared against also came from |
b4c5bf35 PM |
168 | rcu_dereference(). In this case, both pointers depend |
169 | on one rcu_dereference() or another, so you get proper | |
170 | ordering either way. | |
171 | ||
172 | That said, this situation can make certain RCU usage | |
173 | bugs more likely to happen. Which can be a good thing, | |
174 | at least if they happen during testing. An example | |
175 | of such an RCU usage bug is shown in the section titled | |
176 | "EXAMPLE OF AMPLIFIED RCU-USAGE BUG". | |
177 | ||
b00aedf9 | 178 | - All of the accesses following the comparison are stores, |
b4c5bf35 PM |
179 | so that a control dependency preserves the needed ordering. |
180 | That said, it is easy to get control dependencies wrong. | |
181 | Please see the "CONTROL DEPENDENCIES" section of | |
182 | Documentation/memory-barriers.txt for more details. | |
183 | ||
e3879ecd | 184 | - The pointers are not equal *and* the compiler does |
b4c5bf35 PM |
185 | not have enough information to deduce the value of the |
186 | pointer. Note that the volatile cast in rcu_dereference() | |
187 | will normally prevent the compiler from knowing too much. | |
188 | ||
ee7c29be PM |
189 | However, please note that if the compiler knows that the |
190 | pointer takes on only one of two values, a not-equal | |
191 | comparison will provide exactly the information that the | |
192 | compiler needs to deduce the value of the pointer. | |
193 | ||
b00aedf9 | 194 | - Disable any value-speculation optimizations that your compiler |
b4c5bf35 PM |
195 | might provide, especially if you are making use of feedback-based |
196 | optimizations that take data collected from prior runs. Such | |
197 | value-speculation optimizations reorder operations by design. | |
198 | ||
199 | There is one exception to this rule: Value-speculation | |
200 | optimizations that leverage the branch-prediction hardware are | |
201 | safe on strongly ordered systems (such as x86), but not on weakly | |
202 | ordered systems (such as ARM or Power). Choose your compiler | |
203 | command-line options wisely! | |
204 | ||
205 | ||
206 | EXAMPLE OF AMPLIFIED RCU-USAGE BUG | |
b00aedf9 | 207 | ---------------------------------- |
b4c5bf35 PM |
208 | |
209 | Because updaters can run concurrently with RCU readers, RCU readers can | |
210 | see stale and/or inconsistent values. If RCU readers need fresh or | |
211 | consistent values, which they sometimes do, they need to take proper | |
b00aedf9 | 212 | precautions. To see this, consider the following code fragment:: |
b4c5bf35 PM |
213 | |
214 | struct foo { | |
215 | int a; | |
216 | int b; | |
217 | int c; | |
218 | }; | |
219 | struct foo *gp1; | |
220 | struct foo *gp2; | |
221 | ||
222 | void updater(void) | |
223 | { | |
224 | struct foo *p; | |
225 | ||
226 | p = kmalloc(...); | |
227 | if (p == NULL) | |
228 | deal_with_it(); | |
229 | p->a = 42; /* Each field in its own cache line. */ | |
230 | p->b = 43; | |
231 | p->c = 44; | |
232 | rcu_assign_pointer(gp1, p); | |
233 | p->b = 143; | |
234 | p->c = 144; | |
235 | rcu_assign_pointer(gp2, p); | |
236 | } | |
237 | ||
238 | void reader(void) | |
239 | { | |
240 | struct foo *p; | |
241 | struct foo *q; | |
242 | int r1, r2; | |
243 | ||
b33994ef | 244 | rcu_read_lock(); |
b4c5bf35 PM |
245 | p = rcu_dereference(gp2); |
246 | if (p == NULL) | |
247 | return; | |
248 | r1 = p->b; /* Guaranteed to get 143. */ | |
249 | q = rcu_dereference(gp1); /* Guaranteed non-NULL. */ | |
250 | if (p == q) { | |
251 | /* The compiler decides that q->c is same as p->c. */ | |
252 | r2 = p->c; /* Could get 44 on weakly order system. */ | |
b33994ef PM |
253 | } else { |
254 | r2 = p->c - r1; /* Unconditional access to p->c. */ | |
b4c5bf35 | 255 | } |
b33994ef | 256 | rcu_read_unlock(); |
b4c5bf35 PM |
257 | do_something_with(r1, r2); |
258 | } | |
259 | ||
260 | You might be surprised that the outcome (r1 == 143 && r2 == 44) is possible, | |
261 | but you should not be. After all, the updater might have been invoked | |
262 | a second time between the time reader() loaded into "r1" and the time | |
263 | that it loaded into "r2". The fact that this same result can occur due | |
264 | to some reordering from the compiler and CPUs is beside the point. | |
265 | ||
266 | But suppose that the reader needs a consistent view? | |
267 | ||
b00aedf9 | 268 | Then one approach is to use locking, for example, as follows:: |
b4c5bf35 PM |
269 | |
270 | struct foo { | |
271 | int a; | |
272 | int b; | |
273 | int c; | |
274 | spinlock_t lock; | |
275 | }; | |
276 | struct foo *gp1; | |
277 | struct foo *gp2; | |
278 | ||
279 | void updater(void) | |
280 | { | |
281 | struct foo *p; | |
282 | ||
283 | p = kmalloc(...); | |
284 | if (p == NULL) | |
285 | deal_with_it(); | |
286 | spin_lock(&p->lock); | |
287 | p->a = 42; /* Each field in its own cache line. */ | |
288 | p->b = 43; | |
289 | p->c = 44; | |
290 | spin_unlock(&p->lock); | |
291 | rcu_assign_pointer(gp1, p); | |
292 | spin_lock(&p->lock); | |
293 | p->b = 143; | |
294 | p->c = 144; | |
295 | spin_unlock(&p->lock); | |
296 | rcu_assign_pointer(gp2, p); | |
297 | } | |
298 | ||
299 | void reader(void) | |
300 | { | |
301 | struct foo *p; | |
302 | struct foo *q; | |
303 | int r1, r2; | |
304 | ||
b33994ef | 305 | rcu_read_lock(); |
b4c5bf35 PM |
306 | p = rcu_dereference(gp2); |
307 | if (p == NULL) | |
308 | return; | |
309 | spin_lock(&p->lock); | |
310 | r1 = p->b; /* Guaranteed to get 143. */ | |
311 | q = rcu_dereference(gp1); /* Guaranteed non-NULL. */ | |
312 | if (p == q) { | |
313 | /* The compiler decides that q->c is same as p->c. */ | |
314 | r2 = p->c; /* Locking guarantees r2 == 144. */ | |
b33994ef PM |
315 | } else { |
316 | spin_lock(&q->lock); | |
317 | r2 = q->c - r1; | |
318 | spin_unlock(&q->lock); | |
b4c5bf35 | 319 | } |
b33994ef | 320 | rcu_read_unlock(); |
b4c5bf35 PM |
321 | spin_unlock(&p->lock); |
322 | do_something_with(r1, r2); | |
323 | } | |
324 | ||
325 | As always, use the right tool for the job! | |
326 | ||
327 | ||
328 | EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH | |
b00aedf9 | 329 | ----------------------------------------- |
b4c5bf35 PM |
330 | |
331 | If a pointer obtained from rcu_dereference() compares not-equal to some | |
332 | other pointer, the compiler normally has no clue what the value of the | |
333 | first pointer might be. This lack of knowledge prevents the compiler | |
334 | from carrying out optimizations that otherwise might destroy the ordering | |
335 | guarantees that RCU depends on. And the volatile cast in rcu_dereference() | |
336 | should prevent the compiler from guessing the value. | |
337 | ||
338 | But without rcu_dereference(), the compiler knows more than you might | |
b00aedf9 | 339 | expect. Consider the following code fragment:: |
b4c5bf35 PM |
340 | |
341 | struct foo { | |
342 | int a; | |
343 | int b; | |
344 | }; | |
345 | static struct foo variable1; | |
346 | static struct foo variable2; | |
347 | static struct foo *gp = &variable1; | |
348 | ||
349 | void updater(void) | |
350 | { | |
351 | initialize_foo(&variable2); | |
352 | rcu_assign_pointer(gp, &variable2); | |
353 | /* | |
354 | * The above is the only store to gp in this translation unit, | |
355 | * and the address of gp is not exported in any way. | |
356 | */ | |
357 | } | |
358 | ||
359 | int reader(void) | |
360 | { | |
361 | struct foo *p; | |
362 | ||
363 | p = gp; | |
364 | barrier(); | |
365 | if (p == &variable1) | |
366 | return p->a; /* Must be variable1.a. */ | |
367 | else | |
368 | return p->b; /* Must be variable2.b. */ | |
369 | } | |
370 | ||
371 | Because the compiler can see all stores to "gp", it knows that the only | |
372 | possible values of "gp" are "variable1" on the one hand and "variable2" | |
373 | on the other. The comparison in reader() therefore tells the compiler | |
374 | the exact value of "p" even in the not-equals case. This allows the | |
375 | compiler to make the return values independent of the load from "gp", | |
376 | in turn destroying the ordering between this load and the loads of the | |
377 | return values. This can result in "p->b" returning pre-initialization | |
b33994ef | 378 | garbage values on weakly ordered systems. |
b4c5bf35 | 379 | |
e3879ecd | 380 | In short, rcu_dereference() is *not* optional when you are going to |
b4c5bf35 | 381 | dereference the resulting pointer. |
d1b493bb PM |
382 | |
383 | ||
384 | WHICH MEMBER OF THE rcu_dereference() FAMILY SHOULD YOU USE? | |
b00aedf9 | 385 | ------------------------------------------------------------ |
d1b493bb PM |
386 | |
387 | First, please avoid using rcu_dereference_raw() and also please avoid | |
388 | using rcu_dereference_check() and rcu_dereference_protected() with a | |
389 | second argument with a constant value of 1 (or true, for that matter). | |
390 | With that caution out of the way, here is some guidance for which | |
391 | member of the rcu_dereference() to use in various situations: | |
392 | ||
393 | 1. If the access needs to be within an RCU read-side critical | |
394 | section, use rcu_dereference(). With the new consolidated | |
395 | RCU flavors, an RCU read-side critical section is entered | |
396 | using rcu_read_lock(), anything that disables bottom halves, | |
397 | anything that disables interrupts, or anything that disables | |
398 | preemption. | |
399 | ||
400 | 2. If the access might be within an RCU read-side critical section | |
401 | on the one hand, or protected by (say) my_lock on the other, | |
b00aedf9 | 402 | use rcu_dereference_check(), for example:: |
d1b493bb PM |
403 | |
404 | p1 = rcu_dereference_check(p->rcu_protected_pointer, | |
405 | lockdep_is_held(&my_lock)); | |
406 | ||
407 | ||
408 | 3. If the access might be within an RCU read-side critical section | |
409 | on the one hand, or protected by either my_lock or your_lock on | |
b00aedf9 | 410 | the other, again use rcu_dereference_check(), for example:: |
d1b493bb PM |
411 | |
412 | p1 = rcu_dereference_check(p->rcu_protected_pointer, | |
413 | lockdep_is_held(&my_lock) || | |
414 | lockdep_is_held(&your_lock)); | |
415 | ||
416 | 4. If the access is on the update side, so that it is always protected | |
b00aedf9 | 417 | by my_lock, use rcu_dereference_protected():: |
d1b493bb PM |
418 | |
419 | p1 = rcu_dereference_protected(p->rcu_protected_pointer, | |
420 | lockdep_is_held(&my_lock)); | |
421 | ||
422 | This can be extended to handle multiple locks as in #3 above, | |
423 | and both can be extended to check other conditions as well. | |
424 | ||
425 | 5. If the protection is supplied by the caller, and is thus unknown | |
426 | to this code, that is the rare case when rcu_dereference_raw() | |
427 | is appropriate. In addition, rcu_dereference_raw() might be | |
428 | appropriate when the lockdep expression would be excessively | |
429 | complex, except that a better approach in that case might be to | |
430 | take a long hard look at your synchronization design. Still, | |
431 | there are data-locking cases where any one of a very large number | |
432 | of locks or reference counters suffices to protect the pointer, | |
433 | so rcu_dereference_raw() does have its place. | |
434 | ||
435 | However, its place is probably quite a bit smaller than one | |
436 | might expect given the number of uses in the current kernel. | |
437 | Ditto for its synonym, rcu_dereference_check( ... , 1), and | |
438 | its close relative, rcu_dereference_protected(... , 1). | |
439 | ||
440 | ||
441 | SPARSE CHECKING OF RCU-PROTECTED POINTERS | |
b00aedf9 | 442 | ----------------------------------------- |
d1b493bb | 443 | |
b33994ef | 444 | The sparse static-analysis tool checks for non-RCU access to RCU-protected |
d1b493bb PM |
445 | pointers, which can result in "interesting" bugs due to compiler |
446 | optimizations involving invented loads and perhaps also load tearing. | |
b00aedf9 | 447 | For example, suppose someone mistakenly does something like this:: |
d1b493bb PM |
448 | |
449 | p = q->rcu_protected_pointer; | |
450 | do_something_with(p->a); | |
451 | do_something_else_with(p->b); | |
452 | ||
453 | If register pressure is high, the compiler might optimize "p" out | |
b00aedf9 | 454 | of existence, transforming the code to something like this:: |
d1b493bb PM |
455 | |
456 | do_something_with(q->rcu_protected_pointer->a); | |
457 | do_something_else_with(q->rcu_protected_pointer->b); | |
458 | ||
459 | This could fatally disappoint your code if q->rcu_protected_pointer | |
460 | changed in the meantime. Nor is this a theoretical problem: Exactly | |
461 | this sort of bug cost Paul E. McKenney (and several of his innocent | |
462 | colleagues) a three-day weekend back in the early 1990s. | |
463 | ||
464 | Load tearing could of course result in dereferencing a mashup of a pair | |
465 | of pointers, which also might fatally disappoint your code. | |
466 | ||
467 | These problems could have been avoided simply by making the code instead | |
b00aedf9 | 468 | read as follows:: |
d1b493bb PM |
469 | |
470 | p = rcu_dereference(q->rcu_protected_pointer); | |
471 | do_something_with(p->a); | |
472 | do_something_else_with(p->b); | |
473 | ||
474 | Unfortunately, these sorts of bugs can be extremely hard to spot during | |
475 | review. This is where the sparse tool comes into play, along with the | |
476 | "__rcu" marker. If you mark a pointer declaration, whether in a structure | |
477 | or as a formal parameter, with "__rcu", which tells sparse to complain if | |
478 | this pointer is accessed directly. It will also cause sparse to complain | |
479 | if a pointer not marked with "__rcu" is accessed using rcu_dereference() | |
480 | and friends. For example, ->rcu_protected_pointer might be declared as | |
b00aedf9 | 481 | follows:: |
d1b493bb PM |
482 | |
483 | struct foo __rcu *rcu_protected_pointer; | |
484 | ||
485 | Use of "__rcu" is opt-in. If you choose not to use it, then you should | |
486 | ignore the sparse warnings. |