Commit | Line | Data |
---|---|---|
0b8c06b7 PM |
1 | This document provides options for those wishing to keep their |
2 | memory-ordering lives simple, as is necessary for those whose domain | |
3 | is complex. After all, there are bugs other than memory-ordering bugs, | |
4 | and the time spent gaining memory-ordering knowledge is not available | |
5 | for gaining domain knowledge. Furthermore Linux-kernel memory model | |
6 | (LKMM) is quite complex, with subtle differences in code often having | |
7 | dramatic effects on correctness. | |
8 | ||
9 | The options near the beginning of this list are quite simple. The idea | |
10 | is not that kernel hackers don't already know about them, but rather | |
11 | that they might need the occasional reminder. | |
12 | ||
13 | Please note that this is a generic guide, and that specific subsystems | |
14 | will often have special requirements or idioms. For example, developers | |
15 | of MMIO-based device drivers will often need to use mb(), rmb(), and | |
16 | wmb(), and therefore might find smp_mb(), smp_rmb(), and smp_wmb() | |
17 | to be more natural than smp_load_acquire() and smp_store_release(). | |
18 | On the other hand, those coming in from other environments will likely | |
19 | be more familiar with these last two. | |
20 | ||
21 | ||
22 | Single-threaded code | |
23 | ==================== | |
24 | ||
25 | In single-threaded code, there is no reordering, at least assuming | |
26 | that your toolchain and hardware are working correctly. In addition, | |
27 | it is generally a mistake to assume your code will only run in a single | |
28 | threaded context as the kernel can enter the same code path on multiple | |
29 | CPUs at the same time. One important exception is a function that makes | |
30 | no external data references. | |
31 | ||
32 | In the general case, you will need to take explicit steps to ensure that | |
33 | your code really is executed within a single thread that does not access | |
34 | shared variables. A simple way to achieve this is to define a global lock | |
35 | that you acquire at the beginning of your code and release at the end, | |
36 | taking care to ensure that all references to your code's shared data are | |
37 | also carried out under that same lock. Because only one thread can hold | |
38 | this lock at a given time, your code will be executed single-threaded. | |
39 | This approach is called "code locking". | |
40 | ||
41 | Code locking can severely limit both performance and scalability, so it | |
42 | should be used with caution, and only on code paths that execute rarely. | |
43 | After all, a huge amount of effort was required to remove the Linux | |
44 | kernel's old "Big Kernel Lock", so let's please be very careful about | |
45 | adding new "little kernel locks". | |
46 | ||
47 | One of the advantages of locking is that, in happy contrast with the | |
48 | year 1981, almost all kernel developers are very familiar with locking. | |
49 | The Linux kernel's lockdep (CONFIG_PROVE_LOCKING=y) is very helpful with | |
50 | the formerly feared deadlock scenarios. | |
51 | ||
52 | Please use the standard locking primitives provided by the kernel rather | |
53 | than rolling your own. For one thing, the standard primitives interact | |
54 | properly with lockdep. For another thing, these primitives have been | |
55 | tuned to deal better with high contention. And for one final thing, it is | |
56 | surprisingly hard to correctly code production-quality lock acquisition | |
57 | and release functions. After all, even simple non-production-quality | |
58 | locking functions must carefully prevent both the CPU and the compiler | |
59 | from moving code in either direction across the locking function. | |
60 | ||
61 | Despite the scalability limitations of single-threaded code, RCU | |
62 | takes this approach for much of its grace-period processing and also | |
63 | for early-boot operation. The reason RCU is able to scale despite | |
64 | single-threaded grace-period processing is use of batching, where all | |
65 | updates that accumulated during one grace period are handled by the | |
66 | next one. In other words, slowing down grace-period processing makes | |
67 | it more efficient. Nor is RCU unique: Similar batching optimizations | |
68 | are used in many I/O operations. | |
69 | ||
70 | ||
71 | Packaged code | |
72 | ============= | |
73 | ||
74 | Even if performance and scalability concerns prevent your code from | |
75 | being completely single-threaded, it is often possible to use library | |
76 | functions that handle the concurrency nearly or entirely on their own. | |
77 | This approach delegates any LKMM worries to the library maintainer. | |
78 | ||
79 | In the kernel, what is the "library"? Quite a bit. It includes the | |
80 | contents of the lib/ directory, much of the include/linux/ directory along | |
81 | with a lot of other heavily used APIs. But heavily used examples include | |
82 | the list macros (for example, include/linux/{,rcu}list.h), workqueues, | |
83 | smp_call_function(), and the various hash tables and search trees. | |
84 | ||
85 | ||
86 | Data locking | |
87 | ============ | |
88 | ||
89 | With code locking, we use single-threaded code execution to guarantee | |
90 | serialized access to the data that the code is accessing. However, | |
91 | we can also achieve this by instead associating the lock with specific | |
92 | instances of the data structures. This creates a "critical section" | |
93 | in the code execution that will execute as though it is single threaded. | |
94 | By placing all the accesses and modifications to a shared data structure | |
95 | inside a critical section, we ensure that the execution context that | |
96 | holds the lock has exclusive access to the shared data. | |
97 | ||
98 | The poster boy for this approach is the hash table, where placing a lock | |
99 | in each hash bucket allows operations on different buckets to proceed | |
100 | concurrently. This works because the buckets do not overlap with each | |
101 | other, so that an operation on one bucket does not interfere with any | |
102 | other bucket. | |
103 | ||
104 | As the number of buckets increases, data locking scales naturally. | |
105 | In particular, if the amount of data increases with the number of CPUs, | |
106 | increasing the number of buckets as the number of CPUs increase results | |
107 | in a naturally scalable data structure. | |
108 | ||
109 | ||
110 | Per-CPU processing | |
111 | ================== | |
112 | ||
113 | Partitioning processing and data over CPUs allows each CPU to take | |
114 | a single-threaded approach while providing excellent performance and | |
115 | scalability. Of course, there is no free lunch: The dark side of this | |
116 | excellence is substantially increased memory footprint. | |
117 | ||
118 | In addition, it is sometimes necessary to occasionally update some global | |
119 | view of this processing and data, in which case something like locking | |
120 | must be used to protect this global view. This is the approach taken | |
121 | by the percpu_counter infrastructure. In many cases, there are already | |
122 | generic/library variants of commonly used per-cpu constructs available. | |
123 | Please use them rather than rolling your own. | |
124 | ||
125 | RCU uses DEFINE_PER_CPU*() declaration to create a number of per-CPU | |
126 | data sets. For example, each CPU does private quiescent-state processing | |
127 | within its instance of the per-CPU rcu_data structure, and then uses data | |
128 | locking to report quiescent states up the grace-period combining tree. | |
129 | ||
130 | ||
131 | Packaged primitives: Sequence locking | |
132 | ===================================== | |
133 | ||
134 | Lockless programming is considered by many to be more difficult than | |
135 | lock-based programming, but there are a few lockless design patterns that | |
136 | have been built out into an API. One of these APIs is sequence locking. | |
137 | Although this APIs can be used in extremely complex ways, there are simple | |
138 | and effective ways of using it that avoid the need to pay attention to | |
139 | memory ordering. | |
140 | ||
141 | The basic keep-things-simple rule for sequence locking is "do not write | |
142 | in read-side code". Yes, you can do writes from within sequence-locking | |
143 | readers, but it won't be so simple. For example, such writes will be | |
144 | lockless and should be idempotent. | |
145 | ||
146 | For more sophisticated use cases, LKMM can guide you, including use | |
147 | cases involving combining sequence locking with other synchronization | |
148 | primitives. (LKMM does not yet know about sequence locking, so it is | |
149 | currently necessary to open-code it in your litmus tests.) | |
150 | ||
151 | Additional information may be found in include/linux/seqlock.h. | |
152 | ||
153 | Packaged primitives: RCU | |
154 | ======================== | |
155 | ||
156 | Another lockless design pattern that has been baked into an API | |
157 | is RCU. The Linux kernel makes sophisticated use of RCU, but the | |
158 | keep-things-simple rules for RCU are "do not write in read-side code" | |
159 | and "do not update anything that is visible to and accessed by readers", | |
160 | and "protect updates with locking". | |
161 | ||
162 | These rules are illustrated by the functions foo_update_a() and | |
163 | foo_get_a() shown in Documentation/RCU/whatisRCU.rst. Additional | |
164 | RCU usage patterns maybe found in Documentation/RCU and in the | |
165 | source code. | |
166 | ||
167 | ||
168 | Packaged primitives: Atomic operations | |
169 | ====================================== | |
170 | ||
171 | Back in the day, the Linux kernel had three types of atomic operations: | |
172 | ||
173 | 1. Initialization and read-out, such as atomic_set() and atomic_read(). | |
174 | ||
175 | 2. Operations that did not return a value and provided no ordering, | |
176 | such as atomic_inc() and atomic_dec(). | |
177 | ||
178 | 3. Operations that returned a value and provided full ordering, such as | |
179 | atomic_add_return() and atomic_dec_and_test(). Note that some | |
180 | value-returning operations provide full ordering only conditionally. | |
181 | For example, cmpxchg() provides ordering only upon success. | |
182 | ||
183 | More recent kernels have operations that return a value but do not | |
184 | provide full ordering. These are flagged with either a _relaxed() | |
185 | suffix (providing no ordering), or an _acquire() or _release() suffix | |
186 | (providing limited ordering). | |
187 | ||
188 | Additional information may be found in these files: | |
189 | ||
190 | Documentation/atomic_t.txt | |
191 | Documentation/atomic_bitops.txt | |
0b8c06b7 PM |
192 | Documentation/core-api/refcount-vs-atomic.rst |
193 | ||
194 | Reading code using these primitives is often also quite helpful. | |
195 | ||
196 | ||
197 | Lockless, fully ordered | |
198 | ======================= | |
199 | ||
200 | When using locking, there often comes a time when it is necessary | |
201 | to access some variable or another without holding the data lock | |
202 | that serializes access to that variable. | |
203 | ||
204 | If you want to keep things simple, use the initialization and read-out | |
205 | operations from the previous section only when there are no racing | |
206 | accesses. Otherwise, use only fully ordered operations when accessing | |
207 | or modifying the variable. This approach guarantees that code prior | |
208 | to a given access to that variable will be seen by all CPUs has having | |
209 | happened before any code following any later access to that same variable. | |
210 | ||
211 | Please note that per-CPU functions are not atomic operations and | |
212 | hence they do not provide any ordering guarantees at all. | |
213 | ||
214 | If the lockless accesses are frequently executed reads that are used | |
215 | only for heuristics, or if they are frequently executed writes that | |
216 | are used only for statistics, please see the next section. | |
217 | ||
218 | ||
219 | Lockless statistics and heuristics | |
220 | ================================== | |
221 | ||
222 | Unordered primitives such as atomic_read(), atomic_set(), READ_ONCE(), and | |
223 | WRITE_ONCE() can safely be used in some cases. These primitives provide | |
224 | no ordering, but they do prevent the compiler from carrying out a number | |
225 | of destructive optimizations (for which please see the next section). | |
226 | One example use for these primitives is statistics, such as per-CPU | |
227 | counters exemplified by the rt_cache_stat structure's routing-cache | |
228 | statistics counters. Another example use case is heuristics, such as | |
229 | the jiffies_till_first_fqs and jiffies_till_next_fqs kernel parameters | |
230 | controlling how often RCU scans for idle CPUs. | |
231 | ||
232 | But be careful. "Unordered" really does mean "unordered". It is all | |
233 | too easy to assume ordering, and this assumption must be avoided when | |
234 | using these primitives. | |
235 | ||
236 | ||
237 | Don't let the compiler trip you up | |
238 | ================================== | |
239 | ||
240 | It can be quite tempting to use plain C-language accesses for lockless | |
241 | loads from and stores to shared variables. Although this is both | |
242 | possible and quite common in the Linux kernel, it does require a | |
243 | surprising amount of analysis, care, and knowledge about the compiler. | |
244 | Yes, some decades ago it was not unfair to consider a C compiler to be | |
245 | an assembler with added syntax and better portability, but the advent of | |
246 | sophisticated optimizing compilers mean that those days are long gone. | |
247 | Today's optimizing compilers can profoundly rewrite your code during the | |
248 | translation process, and have long been ready, willing, and able to do so. | |
249 | ||
250 | Therefore, if you really need to use C-language assignments instead of | |
251 | READ_ONCE(), WRITE_ONCE(), and so on, you will need to have a very good | |
252 | understanding of both the C standard and your compiler. Here are some | |
253 | introductory references and some tooling to start you on this noble quest: | |
254 | ||
255 | Who's afraid of a big bad optimizing compiler? | |
256 | https://lwn.net/Articles/793253/ | |
257 | Calibrating your fear of big bad optimizing compilers | |
258 | https://lwn.net/Articles/799218/ | |
259 | Concurrency bugs should fear the big bad data-race detector (part 1) | |
260 | https://lwn.net/Articles/816850/ | |
261 | Concurrency bugs should fear the big bad data-race detector (part 2) | |
262 | https://lwn.net/Articles/816854/ | |
263 | ||
264 | ||
265 | More complex use cases | |
266 | ====================== | |
267 | ||
268 | If the alternatives above do not do what you need, please look at the | |
269 | recipes-pairs.txt file to peel off the next layer of the memory-ordering | |
270 | onion. |