Commit | Line | Data |
---|---|---|
bd0ccc4a ME |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | .. Copyright (C) 2019, Google LLC. | |
3 | ||
905e672b ME |
4 | The Kernel Concurrency Sanitizer (KCSAN) |
5 | ======================================== | |
6 | ||
e7325b77 ME |
7 | The Kernel Concurrency Sanitizer (KCSAN) is a dynamic race detector, which |
8 | relies on compile-time instrumentation, and uses a watchpoint-based sampling | |
9 | approach to detect races. KCSAN's primary purpose is to detect `data races`_. | |
905e672b ME |
10 | |
11 | Usage | |
12 | ----- | |
13 | ||
e68dcd8e ME |
14 | KCSAN is supported by both GCC and Clang. With GCC we require version 11 or |
15 | later, and with Clang also require version 11 or later. | |
e7325b77 ME |
16 | |
17 | To enable KCSAN configure the kernel with:: | |
905e672b ME |
18 | |
19 | CONFIG_KCSAN = y | |
20 | ||
21 | KCSAN provides several other configuration options to customize behaviour (see | |
e7325b77 | 22 | the respective help text in ``lib/Kconfig.kcsan`` for more info). |
905e672b ME |
23 | |
24 | Error reports | |
25 | ~~~~~~~~~~~~~ | |
26 | ||
27 | A typical data race report looks like this:: | |
28 | ||
29 | ================================================================== | |
b930226f ME |
30 | BUG: KCSAN: data-race in test_kernel_read / test_kernel_write |
31 | ||
32 | write to 0xffffffffc009a628 of 8 bytes by task 487 on cpu 0: | |
33 | test_kernel_write+0x1d/0x30 | |
34 | access_thread+0x89/0xd0 | |
35 | kthread+0x23e/0x260 | |
36 | ret_from_fork+0x22/0x30 | |
37 | ||
38 | read to 0xffffffffc009a628 of 8 bytes by task 488 on cpu 6: | |
39 | test_kernel_read+0x10/0x20 | |
40 | access_thread+0x89/0xd0 | |
41 | kthread+0x23e/0x260 | |
42 | ret_from_fork+0x22/0x30 | |
43 | ||
44 | value changed: 0x00000000000009a6 -> 0x00000000000009b2 | |
905e672b ME |
45 | |
46 | Reported by Kernel Concurrency Sanitizer on: | |
b930226f ME |
47 | CPU: 6 PID: 488 Comm: access_thread Not tainted 5.12.0-rc2+ #1 |
48 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 | |
905e672b ME |
49 | ================================================================== |
50 | ||
51 | The header of the report provides a short summary of the functions involved in | |
52 | the race. It is followed by the access types and stack traces of the 2 threads | |
b930226f ME |
53 | involved in the data race. If KCSAN also observed a value change, the observed |
54 | old value and new value are shown on the "value changed" line respectively. | |
905e672b ME |
55 | |
56 | The other less common type of data race report looks like this:: | |
57 | ||
58 | ================================================================== | |
b930226f ME |
59 | BUG: KCSAN: data-race in test_kernel_rmw_array+0x71/0xd0 |
60 | ||
61 | race at unknown origin, with read to 0xffffffffc009bdb0 of 8 bytes by task 515 on cpu 2: | |
62 | test_kernel_rmw_array+0x71/0xd0 | |
63 | access_thread+0x89/0xd0 | |
64 | kthread+0x23e/0x260 | |
65 | ret_from_fork+0x22/0x30 | |
66 | ||
67 | value changed: 0x0000000000002328 -> 0x0000000000002329 | |
905e672b ME |
68 | |
69 | Reported by Kernel Concurrency Sanitizer on: | |
b930226f ME |
70 | CPU: 2 PID: 515 Comm: access_thread Not tainted 5.12.0-rc2+ #1 |
71 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 | |
905e672b ME |
72 | ================================================================== |
73 | ||
74 | This report is generated where it was not possible to determine the other | |
75 | racing thread, but a race was inferred due to the data value of the watched | |
b930226f ME |
76 | memory location having changed. These reports always show a "value changed" |
77 | line. A common reason for reports of this type are missing instrumentation in | |
78 | the racing thread, but could also occur due to e.g. DMA accesses. Such reports | |
79 | are shown only if ``CONFIG_KCSAN_REPORT_RACE_UNKNOWN_ORIGIN=y``, which is | |
80 | enabled by default. | |
905e672b ME |
81 | |
82 | Selective analysis | |
83 | ~~~~~~~~~~~~~~~~~~ | |
84 | ||
71611774 ME |
85 | It may be desirable to disable data race detection for specific accesses, |
86 | functions, compilation units, or entire subsystems. For static blacklisting, | |
87 | the below options are available: | |
905e672b | 88 | |
71611774 ME |
89 | * KCSAN understands the ``data_race(expr)`` annotation, which tells KCSAN that |
90 | any data races due to accesses in ``expr`` should be ignored and resulting | |
ea048464 | 91 | behaviour when encountering a data race is deemed safe. Please see |
117232c0 | 92 | `"Marking Shared-Memory Accesses" in the LKMM`_ for more information. |
71611774 ME |
93 | |
94 | * Disabling data race detection for entire functions can be accomplished by | |
e7325b77 ME |
95 | using the function attribute ``__no_kcsan``:: |
96 | ||
97 | __no_kcsan | |
98 | void foo(void) { | |
99 | ... | |
100 | ||
101 | To dynamically limit for which functions to generate reports, see the | |
102 | `DebugFS interface`_ blacklist/whitelist feature. | |
103 | ||
71611774 ME |
104 | * To disable data race detection for a particular compilation unit, add to the |
105 | ``Makefile``:: | |
905e672b ME |
106 | |
107 | KCSAN_SANITIZE_file.o := n | |
108 | ||
71611774 ME |
109 | * To disable data race detection for all compilation units listed in a |
110 | ``Makefile``, add to the respective ``Makefile``:: | |
111 | ||
112 | KCSAN_SANITIZE := n | |
905e672b | 113 | |
117232c0 AY |
114 | .. _"Marking Shared-Memory Accesses" in the LKMM: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/Documentation/access-marking.txt |
115 | ||
e7325b77 ME |
116 | Furthermore, it is possible to tell KCSAN to show or hide entire classes of |
117 | data races, depending on preferences. These can be changed via the following | |
118 | Kconfig options: | |
119 | ||
120 | * ``CONFIG_KCSAN_REPORT_VALUE_CHANGE_ONLY``: If enabled and a conflicting write | |
121 | is observed via a watchpoint, but the data value of the memory location was | |
122 | observed to remain unchanged, do not report the data race. | |
123 | ||
124 | * ``CONFIG_KCSAN_ASSUME_PLAIN_WRITES_ATOMIC``: Assume that plain aligned writes | |
125 | up to word size are atomic by default. Assumes that such writes are not | |
126 | subject to unsafe compiler optimizations resulting in data races. The option | |
127 | causes KCSAN to not report data races due to conflicts where the only plain | |
128 | accesses are aligned writes up to word size. | |
129 | ||
49f72d53 ME |
130 | * ``CONFIG_KCSAN_PERMISSIVE``: Enable additional permissive rules to ignore |
131 | certain classes of common data races. Unlike the above, the rules are more | |
132 | complex involving value-change patterns, access type, and address. This | |
133 | option depends on ``CONFIG_KCSAN_REPORT_VALUE_CHANGE_ONLY=y``. For details | |
134 | please see the ``kernel/kcsan/permissive.h``. Testers and maintainers that | |
135 | only focus on reports from specific subsystems and not the whole kernel are | |
136 | recommended to disable this option. | |
137 | ||
e675d253 ME |
138 | To use the strictest possible rules, select ``CONFIG_KCSAN_STRICT=y``, which |
139 | configures KCSAN to follow the Linux-kernel memory consistency model (LKMM) as | |
140 | closely as possible. | |
141 | ||
e7325b77 ME |
142 | DebugFS interface |
143 | ~~~~~~~~~~~~~~~~~ | |
144 | ||
145 | The file ``/sys/kernel/debug/kcsan`` provides the following interface: | |
905e672b | 146 | |
e7325b77 | 147 | * Reading ``/sys/kernel/debug/kcsan`` returns various runtime statistics. |
905e672b | 148 | |
e7325b77 ME |
149 | * Writing ``on`` or ``off`` to ``/sys/kernel/debug/kcsan`` allows turning KCSAN |
150 | on or off, respectively. | |
905e672b ME |
151 | |
152 | * Writing ``!some_func_name`` to ``/sys/kernel/debug/kcsan`` adds | |
153 | ``some_func_name`` to the report filter list, which (by default) blacklists | |
154 | reporting data races where either one of the top stackframes are a function | |
155 | in the list. | |
156 | ||
157 | * Writing either ``blacklist`` or ``whitelist`` to ``/sys/kernel/debug/kcsan`` | |
158 | changes the report filtering behaviour. For example, the blacklist feature | |
159 | can be used to silence frequently occurring data races; the whitelist feature | |
160 | can help with reproduction and testing of fixes. | |
161 | ||
e7325b77 ME |
162 | Tuning performance |
163 | ~~~~~~~~~~~~~~~~~~ | |
164 | ||
165 | Core parameters that affect KCSAN's overall performance and bug detection | |
166 | ability are exposed as kernel command-line arguments whose defaults can also be | |
167 | changed via the corresponding Kconfig options. | |
168 | ||
169 | * ``kcsan.skip_watch`` (``CONFIG_KCSAN_SKIP_WATCH``): Number of per-CPU memory | |
170 | operations to skip, before another watchpoint is set up. Setting up | |
171 | watchpoints more frequently will result in the likelihood of races to be | |
172 | observed to increase. This parameter has the most significant impact on | |
173 | overall system performance and race detection ability. | |
174 | ||
175 | * ``kcsan.udelay_task`` (``CONFIG_KCSAN_UDELAY_TASK``): For tasks, the | |
176 | microsecond delay to stall execution after a watchpoint has been set up. | |
177 | Larger values result in the window in which we may observe a race to | |
178 | increase. | |
179 | ||
180 | * ``kcsan.udelay_interrupt`` (``CONFIG_KCSAN_UDELAY_INTERRUPT``): For | |
181 | interrupts, the microsecond delay to stall execution after a watchpoint has | |
182 | been set up. Interrupts have tighter latency requirements, and their delay | |
183 | should generally be smaller than the one chosen for tasks. | |
184 | ||
185 | They may be tweaked at runtime via ``/sys/module/kcsan/parameters/``. | |
186 | ||
905e672b ME |
187 | Data Races |
188 | ---------- | |
189 | ||
e7325b77 ME |
190 | In an execution, two memory accesses form a *data race* if they *conflict*, |
191 | they happen concurrently in different threads, and at least one of them is a | |
192 | *plain access*; they *conflict* if both access the same memory location, and at | |
193 | least one is a write. For a more thorough discussion and definition, see `"Plain | |
194 | Accesses and Data Races" in the LKMM`_. | |
195 | ||
196 | .. _"Plain Accesses and Data Races" in the LKMM: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/memory-model/Documentation/explanation.txt#n1922 | |
905e672b | 197 | |
e7325b77 ME |
198 | Relationship with the Linux-Kernel Memory Consistency Model (LKMM) |
199 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
905e672b ME |
200 | |
201 | The LKMM defines the propagation and ordering rules of various memory | |
202 | operations, which gives developers the ability to reason about concurrent code. | |
203 | Ultimately this allows to determine the possible executions of concurrent code, | |
204 | and if that code is free from data races. | |
205 | ||
e7325b77 | 206 | KCSAN is aware of *marked atomic operations* (``READ_ONCE``, ``WRITE_ONCE``, |
82eb6911 ME |
207 | ``atomic_*``, etc.), and a subset of ordering guarantees implied by memory |
208 | barriers. With ``CONFIG_KCSAN_WEAK_MEMORY=y``, KCSAN models load or store | |
209 | buffering, and can detect missing ``smp_mb()``, ``smp_wmb()``, ``smp_rmb()``, | |
210 | ``smp_store_release()``, and all ``atomic_*`` operations with equivalent | |
211 | implied barriers. | |
212 | ||
213 | Note, KCSAN will not report all data races due to missing memory ordering, | |
214 | specifically where a memory barrier would be required to prohibit subsequent | |
215 | memory operation from reordering before the barrier. Developers should | |
216 | therefore carefully consider the required memory ordering requirements that | |
217 | remain unchecked. | |
e7325b77 ME |
218 | |
219 | Race Detection Beyond Data Races | |
220 | -------------------------------- | |
221 | ||
222 | For code with complex concurrency design, race-condition bugs may not always | |
223 | manifest as data races. Race conditions occur if concurrently executing | |
224 | operations result in unexpected system behaviour. On the other hand, data races | |
225 | are defined at the C-language level. The following macros can be used to check | |
226 | properties of concurrent code where bugs would not manifest as data races. | |
227 | ||
228 | .. kernel-doc:: include/linux/kcsan-checks.h | |
d8949ef1 ME |
229 | :functions: ASSERT_EXCLUSIVE_WRITER ASSERT_EXCLUSIVE_WRITER_SCOPED |
230 | ASSERT_EXCLUSIVE_ACCESS ASSERT_EXCLUSIVE_ACCESS_SCOPED | |
e7325b77 | 231 | ASSERT_EXCLUSIVE_BITS |
905e672b ME |
232 | |
233 | Implementation Details | |
234 | ---------------------- | |
235 | ||
e7325b77 ME |
236 | KCSAN relies on observing that two accesses happen concurrently. Crucially, we |
237 | want to (a) increase the chances of observing races (especially for races that | |
238 | manifest rarely), and (b) be able to actually observe them. We can accomplish | |
239 | (a) by injecting various delays, and (b) by using address watchpoints (or | |
240 | breakpoints). | |
241 | ||
242 | If we deliberately stall a memory access, while we have a watchpoint for its | |
243 | address set up, and then observe the watchpoint to fire, two accesses to the | |
244 | same address just raced. Using hardware watchpoints, this is the approach taken | |
245 | in `DataCollider | |
905e672b ME |
246 | <http://usenix.org/legacy/events/osdi10/tech/full_papers/Erickson.pdf>`_. |
247 | Unlike DataCollider, KCSAN does not use hardware watchpoints, but instead | |
e7325b77 | 248 | relies on compiler instrumentation and "soft watchpoints". |
905e672b | 249 | |
e7325b77 ME |
250 | In KCSAN, watchpoints are implemented using an efficient encoding that stores |
251 | access type, size, and address in a long; the benefits of using "soft | |
252 | watchpoints" are portability and greater flexibility. KCSAN then relies on the | |
253 | compiler instrumenting plain accesses. For each instrumented plain access: | |
905e672b ME |
254 | |
255 | 1. Check if a matching watchpoint exists; if yes, and at least one access is a | |
256 | write, then we encountered a racing access. | |
257 | ||
258 | 2. Periodically, if no matching watchpoint exists, set up a watchpoint and | |
e7325b77 | 259 | stall for a small randomized delay. |
905e672b ME |
260 | |
261 | 3. Also check the data value before the delay, and re-check the data value | |
262 | after delay; if the values mismatch, we infer a race of unknown origin. | |
263 | ||
e7325b77 ME |
264 | To detect data races between plain and marked accesses, KCSAN also annotates |
265 | marked accesses, but only to check if a watchpoint exists; i.e. KCSAN never | |
266 | sets up a watchpoint on marked accesses. By never setting up watchpoints for | |
267 | marked operations, if all accesses to a variable that is accessed concurrently | |
268 | are properly marked, KCSAN will never trigger a watchpoint and therefore never | |
269 | report the accesses. | |
905e672b | 270 | |
82eb6911 ME |
271 | Modeling Weak Memory |
272 | ~~~~~~~~~~~~~~~~~~~~ | |
273 | ||
274 | KCSAN's approach to detecting data races due to missing memory barriers is | |
275 | based on modeling access reordering (with ``CONFIG_KCSAN_WEAK_MEMORY=y``). | |
276 | Each plain memory access for which a watchpoint is set up, is also selected for | |
277 | simulated reordering within the scope of its function (at most 1 in-flight | |
278 | access). | |
279 | ||
280 | Once an access has been selected for reordering, it is checked along every | |
281 | other access until the end of the function scope. If an appropriate memory | |
282 | barrier is encountered, the access will no longer be considered for simulated | |
283 | reordering. | |
284 | ||
285 | When the result of a memory operation should be ordered by a barrier, KCSAN can | |
286 | then detect data races where the conflict only occurs as a result of a missing | |
287 | barrier. Consider the example:: | |
288 | ||
289 | int x, flag; | |
290 | void T1(void) | |
291 | { | |
292 | x = 1; // data race! | |
293 | WRITE_ONCE(flag, 1); // correct: smp_store_release(&flag, 1) | |
294 | } | |
295 | void T2(void) | |
296 | { | |
297 | while (!READ_ONCE(flag)); // correct: smp_load_acquire(&flag) | |
298 | ... = x; // data race! | |
299 | } | |
300 | ||
301 | When weak memory modeling is enabled, KCSAN can consider ``x`` in ``T1`` for | |
302 | simulated reordering. After the write of ``flag``, ``x`` is again checked for | |
303 | concurrent accesses: because ``T2`` is able to proceed after the write of | |
304 | ``flag``, a data race is detected. With the correct barriers in place, ``x`` | |
305 | would not be considered for reordering after the proper release of ``flag``, | |
306 | and no data race would be detected. | |
307 | ||
308 | Deliberate trade-offs in complexity but also practical limitations mean only a | |
309 | subset of data races due to missing memory barriers can be detected. With | |
310 | currently available compiler support, the implementation is limited to modeling | |
311 | the effects of "buffering" (delaying accesses), since the runtime cannot | |
312 | "prefetch" accesses. Also recall that watchpoints are only set up for plain | |
313 | accesses, and the only access type for which KCSAN simulates reordering. This | |
314 | means reordering of marked accesses is not modeled. | |
315 | ||
316 | A consequence of the above is that acquire operations do not require barrier | |
317 | instrumentation (no prefetching). Furthermore, marked accesses introducing | |
318 | address or control dependencies do not require special handling (the marked | |
319 | access cannot be reordered, later dependent accesses cannot be prefetched). | |
320 | ||
905e672b ME |
321 | Key Properties |
322 | ~~~~~~~~~~~~~~ | |
323 | ||
e7325b77 ME |
324 | 1. **Memory Overhead:** The overall memory overhead is only a few MiB |
325 | depending on configuration. The current implementation uses a small array of | |
326 | longs to encode watchpoint information, which is negligible. | |
905e672b ME |
327 | |
328 | 2. **Performance Overhead:** KCSAN's runtime aims to be minimal, using an | |
329 | efficient watchpoint encoding that does not require acquiring any shared | |
330 | locks in the fast-path. For kernel boot on a system with 8 CPUs: | |
331 | ||
332 | - 5.0x slow-down with the default KCSAN config; | |
333 | - 2.8x slow-down from runtime fast-path overhead only (set very large | |
334 | ``KCSAN_SKIP_WATCH`` and unset ``KCSAN_SKIP_WATCH_RANDOMIZE``). | |
335 | ||
336 | 3. **Annotation Overheads:** Minimal annotations are required outside the KCSAN | |
337 | runtime. As a result, maintenance overheads are minimal as the kernel | |
338 | evolves. | |
339 | ||
340 | 4. **Detects Racy Writes from Devices:** Due to checking data values upon | |
341 | setting up watchpoints, racy writes from devices can also be detected. | |
342 | ||
82eb6911 ME |
343 | 5. **Memory Ordering:** KCSAN is aware of only a subset of LKMM ordering rules; |
344 | this may result in missed data races (false negatives). | |
905e672b ME |
345 | |
346 | 6. **Analysis Accuracy:** For observed executions, due to using a sampling | |
347 | strategy, the analysis is *unsound* (false negatives possible), but aims to | |
348 | be complete (no false positives). | |
349 | ||
350 | Alternatives Considered | |
351 | ----------------------- | |
352 | ||
e7325b77 | 353 | An alternative data race detection approach for the kernel can be found in the |
905e672b ME |
354 | `Kernel Thread Sanitizer (KTSAN) <https://github.com/google/ktsan/wiki>`_. |
355 | KTSAN is a happens-before data race detector, which explicitly establishes the | |
356 | happens-before order between memory operations, which can then be used to | |
e7325b77 ME |
357 | determine data races as defined in `Data Races`_. |
358 | ||
359 | To build a correct happens-before relation, KTSAN must be aware of all ordering | |
360 | rules of the LKMM and synchronization primitives. Unfortunately, any omission | |
361 | leads to large numbers of false positives, which is especially detrimental in | |
362 | the context of the kernel which includes numerous custom synchronization | |
363 | mechanisms. To track the happens-before relation, KTSAN's implementation | |
364 | requires metadata for each memory location (shadow memory), which for each page | |
365 | corresponds to 4 pages of shadow memory, and can translate into overhead of | |
366 | tens of GiB on a large system. |