Commit | Line | Data |
---|---|---|
10efe55f ME |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | .. Copyright (C) 2020, Google LLC. | |
3 | ||
4 | Kernel Electric-Fence (KFENCE) | |
5 | ============================== | |
6 | ||
7 | Kernel Electric-Fence (KFENCE) is a low-overhead sampling-based memory safety | |
8 | error detector. KFENCE detects heap out-of-bounds access, use-after-free, and | |
9 | invalid-free errors. | |
10 | ||
11 | KFENCE is designed to be enabled in production kernels, and has near zero | |
12 | performance overhead. Compared to KASAN, KFENCE trades performance for | |
13 | precision. The main motivation behind KFENCE's design, is that with enough | |
14 | total uptime KFENCE will detect bugs in code paths not typically exercised by | |
15 | non-production test workloads. One way to quickly achieve a large enough total | |
16 | uptime is when the tool is deployed across a large fleet of machines. | |
17 | ||
18 | Usage | |
19 | ----- | |
20 | ||
21 | To enable KFENCE, configure the kernel with:: | |
22 | ||
23 | CONFIG_KFENCE=y | |
24 | ||
25 | To build a kernel with KFENCE support, but disabled by default (to enable, set | |
26 | ``kfence.sample_interval`` to non-zero value), configure the kernel with:: | |
27 | ||
28 | CONFIG_KFENCE=y | |
29 | CONFIG_KFENCE_SAMPLE_INTERVAL=0 | |
30 | ||
31 | KFENCE provides several other configuration options to customize behaviour (see | |
32 | the respective help text in ``lib/Kconfig.kfence`` for more info). | |
33 | ||
34 | Tuning performance | |
35 | ~~~~~~~~~~~~~~~~~~ | |
36 | ||
37 | The most important parameter is KFENCE's sample interval, which can be set via | |
38 | the kernel boot parameter ``kfence.sample_interval`` in milliseconds. The | |
39 | sample interval determines the frequency with which heap allocations will be | |
40 | guarded by KFENCE. The default is configurable via the Kconfig option | |
41 | ``CONFIG_KFENCE_SAMPLE_INTERVAL``. Setting ``kfence.sample_interval=0`` | |
42 | disables KFENCE. | |
43 | ||
737b6a10 ME |
44 | The sample interval controls a timer that sets up KFENCE allocations. By |
45 | default, to keep the real sample interval predictable, the normal timer also | |
46 | causes CPU wake-ups when the system is completely idle. This may be undesirable | |
47 | on power-constrained systems. The boot parameter ``kfence.deferrable=1`` | |
48 | instead switches to a "deferrable" timer which does not force CPU wake-ups on | |
49 | idle systems, at the risk of unpredictable sample intervals. The default is | |
50 | configurable via the Kconfig option ``CONFIG_KFENCE_DEFERRABLE``. | |
51 | ||
52 | .. warning:: | |
53 | The KUnit test suite is very likely to fail when using a deferrable timer | |
54 | since it currently causes very unpredictable sample intervals. | |
55 | ||
10efe55f ME |
56 | The KFENCE memory pool is of fixed size, and if the pool is exhausted, no |
57 | further KFENCE allocations occur. With ``CONFIG_KFENCE_NUM_OBJECTS`` (default | |
58 | 255), the number of available guarded objects can be controlled. Each object | |
59 | requires 2 pages, one for the object itself and the other one used as a guard | |
60 | page; object pages are interleaved with guard pages, and every object page is | |
61 | therefore surrounded by two guard pages. | |
62 | ||
63 | The total memory dedicated to the KFENCE memory pool can be computed as:: | |
64 | ||
65 | ( #objects + 1 ) * 2 * PAGE_SIZE | |
66 | ||
67 | Using the default config, and assuming a page size of 4 KiB, results in | |
68 | dedicating 2 MiB to the KFENCE memory pool. | |
69 | ||
70 | Note: On architectures that support huge pages, KFENCE will ensure that the | |
71 | pool is using pages of size ``PAGE_SIZE``. This will result in additional page | |
72 | tables being allocated. | |
73 | ||
74 | Error reports | |
75 | ~~~~~~~~~~~~~ | |
76 | ||
77 | A typical out-of-bounds access looks like this:: | |
78 | ||
79 | ================================================================== | |
4bbf04aa | 80 | BUG: KFENCE: out-of-bounds read in test_out_of_bounds_read+0xa6/0x234 |
10efe55f | 81 | |
4bbf04aa ME |
82 | Out-of-bounds read at 0xffff8c3f2e291fff (1B left of kfence-#72): |
83 | test_out_of_bounds_read+0xa6/0x234 | |
84 | kunit_try_run_case+0x61/0xa0 | |
10efe55f | 85 | kunit_generic_run_threadfn_adapter+0x16/0x30 |
4bbf04aa | 86 | kthread+0x176/0x1b0 |
10efe55f ME |
87 | ret_from_fork+0x22/0x30 |
88 | ||
4bbf04aa ME |
89 | kfence-#72: 0xffff8c3f2e292000-0xffff8c3f2e29201f, size=32, cache=kmalloc-32 |
90 | ||
91 | allocated by task 484 on cpu 0 at 32.919330s: | |
92 | test_alloc+0xfe/0x738 | |
93 | test_out_of_bounds_read+0x9b/0x234 | |
94 | kunit_try_run_case+0x61/0xa0 | |
10efe55f | 95 | kunit_generic_run_threadfn_adapter+0x16/0x30 |
4bbf04aa | 96 | kthread+0x176/0x1b0 |
10efe55f ME |
97 | ret_from_fork+0x22/0x30 |
98 | ||
4bbf04aa ME |
99 | CPU: 0 PID: 484 Comm: kunit_try_catch Not tainted 5.13.0-rc3+ #7 |
100 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 | |
10efe55f ME |
101 | ================================================================== |
102 | ||
103 | The header of the report provides a short summary of the function involved in | |
104 | the access. It is followed by more detailed information about the access and | |
35beccf0 ME |
105 | its origin. Note that, real kernel addresses are only shown when using the |
106 | kernel command line option ``no_hash_pointers``. | |
10efe55f ME |
107 | |
108 | Use-after-free accesses are reported as:: | |
109 | ||
110 | ================================================================== | |
bc8fbc5f | 111 | BUG: KFENCE: use-after-free read in test_use_after_free_read+0xb3/0x143 |
10efe55f | 112 | |
4bbf04aa | 113 | Use-after-free read at 0xffff8c3f2e2a0000 (in kfence-#79): |
10efe55f | 114 | test_use_after_free_read+0xb3/0x143 |
4bbf04aa | 115 | kunit_try_run_case+0x61/0xa0 |
10efe55f | 116 | kunit_generic_run_threadfn_adapter+0x16/0x30 |
4bbf04aa | 117 | kthread+0x176/0x1b0 |
10efe55f ME |
118 | ret_from_fork+0x22/0x30 |
119 | ||
4bbf04aa ME |
120 | kfence-#79: 0xffff8c3f2e2a0000-0xffff8c3f2e2a001f, size=32, cache=kmalloc-32 |
121 | ||
122 | allocated by task 488 on cpu 2 at 33.871326s: | |
123 | test_alloc+0xfe/0x738 | |
10efe55f | 124 | test_use_after_free_read+0x76/0x143 |
4bbf04aa | 125 | kunit_try_run_case+0x61/0xa0 |
10efe55f | 126 | kunit_generic_run_threadfn_adapter+0x16/0x30 |
4bbf04aa | 127 | kthread+0x176/0x1b0 |
10efe55f ME |
128 | ret_from_fork+0x22/0x30 |
129 | ||
4bbf04aa | 130 | freed by task 488 on cpu 2 at 33.871358s: |
10efe55f | 131 | test_use_after_free_read+0xa8/0x143 |
4bbf04aa | 132 | kunit_try_run_case+0x61/0xa0 |
10efe55f | 133 | kunit_generic_run_threadfn_adapter+0x16/0x30 |
4bbf04aa | 134 | kthread+0x176/0x1b0 |
10efe55f ME |
135 | ret_from_fork+0x22/0x30 |
136 | ||
4bbf04aa ME |
137 | CPU: 2 PID: 488 Comm: kunit_try_catch Tainted: G B 5.13.0-rc3+ #7 |
138 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 | |
10efe55f ME |
139 | ================================================================== |
140 | ||
141 | KFENCE also reports on invalid frees, such as double-frees:: | |
142 | ||
143 | ================================================================== | |
144 | BUG: KFENCE: invalid free in test_double_free+0xdc/0x171 | |
145 | ||
4bbf04aa | 146 | Invalid free of 0xffff8c3f2e2a4000 (in kfence-#81): |
10efe55f | 147 | test_double_free+0xdc/0x171 |
4bbf04aa | 148 | kunit_try_run_case+0x61/0xa0 |
10efe55f | 149 | kunit_generic_run_threadfn_adapter+0x16/0x30 |
4bbf04aa | 150 | kthread+0x176/0x1b0 |
10efe55f ME |
151 | ret_from_fork+0x22/0x30 |
152 | ||
4bbf04aa ME |
153 | kfence-#81: 0xffff8c3f2e2a4000-0xffff8c3f2e2a401f, size=32, cache=kmalloc-32 |
154 | ||
155 | allocated by task 490 on cpu 1 at 34.175321s: | |
156 | test_alloc+0xfe/0x738 | |
10efe55f | 157 | test_double_free+0x76/0x171 |
4bbf04aa | 158 | kunit_try_run_case+0x61/0xa0 |
10efe55f | 159 | kunit_generic_run_threadfn_adapter+0x16/0x30 |
4bbf04aa | 160 | kthread+0x176/0x1b0 |
10efe55f ME |
161 | ret_from_fork+0x22/0x30 |
162 | ||
4bbf04aa | 163 | freed by task 490 on cpu 1 at 34.175348s: |
10efe55f | 164 | test_double_free+0xa8/0x171 |
4bbf04aa | 165 | kunit_try_run_case+0x61/0xa0 |
10efe55f | 166 | kunit_generic_run_threadfn_adapter+0x16/0x30 |
4bbf04aa | 167 | kthread+0x176/0x1b0 |
10efe55f ME |
168 | ret_from_fork+0x22/0x30 |
169 | ||
4bbf04aa ME |
170 | CPU: 1 PID: 490 Comm: kunit_try_catch Tainted: G B 5.13.0-rc3+ #7 |
171 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 | |
10efe55f ME |
172 | ================================================================== |
173 | ||
174 | KFENCE also uses pattern-based redzones on the other side of an object's guard | |
175 | page, to detect out-of-bounds writes on the unprotected side of the object. | |
176 | These are reported on frees:: | |
177 | ||
178 | ================================================================== | |
179 | BUG: KFENCE: memory corruption in test_kmalloc_aligned_oob_write+0xef/0x184 | |
180 | ||
4bbf04aa | 181 | Corrupted memory at 0xffff8c3f2e33aff9 [ 0xac . . . . . . ] (in kfence-#156): |
10efe55f | 182 | test_kmalloc_aligned_oob_write+0xef/0x184 |
4bbf04aa | 183 | kunit_try_run_case+0x61/0xa0 |
10efe55f | 184 | kunit_generic_run_threadfn_adapter+0x16/0x30 |
4bbf04aa | 185 | kthread+0x176/0x1b0 |
10efe55f ME |
186 | ret_from_fork+0x22/0x30 |
187 | ||
4bbf04aa ME |
188 | kfence-#156: 0xffff8c3f2e33afb0-0xffff8c3f2e33aff8, size=73, cache=kmalloc-96 |
189 | ||
190 | allocated by task 502 on cpu 7 at 42.159302s: | |
191 | test_alloc+0xfe/0x738 | |
10efe55f | 192 | test_kmalloc_aligned_oob_write+0x57/0x184 |
4bbf04aa | 193 | kunit_try_run_case+0x61/0xa0 |
10efe55f | 194 | kunit_generic_run_threadfn_adapter+0x16/0x30 |
4bbf04aa | 195 | kthread+0x176/0x1b0 |
10efe55f ME |
196 | ret_from_fork+0x22/0x30 |
197 | ||
4bbf04aa ME |
198 | CPU: 7 PID: 502 Comm: kunit_try_catch Tainted: G B 5.13.0-rc3+ #7 |
199 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 | |
10efe55f ME |
200 | ================================================================== |
201 | ||
202 | For such errors, the address where the corruption occurred as well as the | |
203 | invalidly written bytes (offset from the address) are shown; in this | |
204 | representation, '.' denote untouched bytes. In the example above ``0xac`` is | |
205 | the value written to the invalid address at offset 0, and the remaining '.' | |
206 | denote that no following bytes have been touched. Note that, real values are | |
35beccf0 ME |
207 | only shown if the kernel was booted with ``no_hash_pointers``; to avoid |
208 | information disclosure otherwise, '!' is used instead to denote invalidly | |
10efe55f ME |
209 | written bytes. |
210 | ||
211 | And finally, KFENCE may also report on invalid accesses to any protected page | |
212 | where it was not possible to determine an associated object, e.g. if adjacent | |
213 | object pages had not yet been allocated:: | |
214 | ||
215 | ================================================================== | |
bc8fbc5f | 216 | BUG: KFENCE: invalid read in test_invalid_access+0x26/0xe0 |
10efe55f | 217 | |
bc8fbc5f | 218 | Invalid read at 0xffffffffb670b00a: |
10efe55f ME |
219 | test_invalid_access+0x26/0xe0 |
220 | kunit_try_run_case+0x51/0x85 | |
221 | kunit_generic_run_threadfn_adapter+0x16/0x30 | |
222 | kthread+0x137/0x160 | |
223 | ret_from_fork+0x22/0x30 | |
224 | ||
225 | CPU: 4 PID: 124 Comm: kunit_try_catch Tainted: G W 5.8.0-rc6+ #7 | |
226 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014 | |
227 | ================================================================== | |
228 | ||
229 | DebugFS interface | |
230 | ~~~~~~~~~~~~~~~~~ | |
231 | ||
232 | Some debugging information is exposed via debugfs: | |
233 | ||
234 | * The file ``/sys/kernel/debug/kfence/stats`` provides runtime statistics. | |
235 | ||
236 | * The file ``/sys/kernel/debug/kfence/objects`` provides a list of objects | |
237 | allocated via KFENCE, including those already freed but protected. | |
238 | ||
239 | Implementation Details | |
240 | ---------------------- | |
241 | ||
242 | Guarded allocations are set up based on the sample interval. After expiration | |
243 | of the sample interval, the next allocation through the main allocator (SLAB or | |
244 | SLUB) returns a guarded allocation from the KFENCE object pool (allocation | |
245 | sizes up to PAGE_SIZE are supported). At this point, the timer is reset, and | |
4f612ed3 ME |
246 | the next allocation is set up after the expiration of the interval. |
247 | ||
248 | When using ``CONFIG_KFENCE_STATIC_KEYS=y``, KFENCE allocations are "gated" | |
249 | through the main allocator's fast-path by relying on static branches via the | |
250 | static keys infrastructure. The static branch is toggled to redirect the | |
251 | allocation to KFENCE. Depending on sample interval, target workloads, and | |
252 | system architecture, this may perform better than the simple dynamic branch. | |
253 | Careful benchmarking is recommended. | |
10efe55f ME |
254 | |
255 | KFENCE objects each reside on a dedicated page, at either the left or right | |
256 | page boundaries selected at random. The pages to the left and right of the | |
257 | object page are "guard pages", whose attributes are changed to a protected | |
258 | state, and cause page faults on any attempted access. Such page faults are then | |
259 | intercepted by KFENCE, which handles the fault gracefully by reporting an | |
260 | out-of-bounds access, and marking the page as accessible so that the faulting | |
261 | code can (wrongly) continue executing (set ``panic_on_warn`` to panic instead). | |
262 | ||
263 | To detect out-of-bounds writes to memory within the object's page itself, | |
264 | KFENCE also uses pattern-based redzones. For each object page, a redzone is set | |
265 | up for all non-object memory. For typical alignments, the redzone is only | |
266 | required on the unguarded side of an object. Because KFENCE must honor the | |
267 | cache's requested alignment, special alignments may result in unprotected gaps | |
268 | on either side of an object, all of which are redzoned. | |
269 | ||
270 | The following figure illustrates the page layout:: | |
271 | ||
272 | ---+-----------+-----------+-----------+-----------+-----------+--- | |
273 | | xxxxxxxxx | O : | xxxxxxxxx | : O | xxxxxxxxx | | |
274 | | xxxxxxxxx | B : | xxxxxxxxx | : B | xxxxxxxxx | | |
275 | | x GUARD x | J : RED- | x GUARD x | RED- : J | x GUARD x | | |
276 | | xxxxxxxxx | E : ZONE | xxxxxxxxx | ZONE : E | xxxxxxxxx | | |
277 | | xxxxxxxxx | C : | xxxxxxxxx | : C | xxxxxxxxx | | |
278 | | xxxxxxxxx | T : | xxxxxxxxx | : T | xxxxxxxxx | | |
279 | ---+-----------+-----------+-----------+-----------+-----------+--- | |
280 | ||
281 | Upon deallocation of a KFENCE object, the object's page is again protected and | |
282 | the object is marked as freed. Any further access to the object causes a fault | |
283 | and KFENCE reports a use-after-free access. Freed objects are inserted at the | |
284 | tail of KFENCE's freelist, so that the least recently freed objects are reused | |
285 | first, and the chances of detecting use-after-frees of recently freed objects | |
286 | is increased. | |
287 | ||
5cc906b4 ME |
288 | If pool utilization reaches 75% (default) or above, to reduce the risk of the |
289 | pool eventually being fully occupied by allocated objects yet ensure diverse | |
290 | coverage of allocations, KFENCE limits currently covered allocations of the | |
291 | same source from further filling up the pool. The "source" of an allocation is | |
292 | based on its partial allocation stack trace. A side-effect is that this also | |
293 | limits frequent long-lived allocations (e.g. pagecache) of the same source | |
294 | filling up the pool permanently, which is the most common risk for the pool | |
295 | becoming full and the sampled allocation rate dropping to zero. The threshold | |
296 | at which to start limiting currently covered allocations can be configured via | |
297 | the boot parameter ``kfence.skip_covered_thresh`` (pool usage%). | |
298 | ||
10efe55f ME |
299 | Interface |
300 | --------- | |
301 | ||
302 | The following describes the functions which are used by allocators as well as | |
303 | page handling code to set up and deal with KFENCE allocations. | |
304 | ||
305 | .. kernel-doc:: include/linux/kfence.h | |
306 | :functions: is_kfence_address | |
307 | kfence_shutdown_cache | |
308 | kfence_alloc kfence_free __kfence_free | |
309 | kfence_ksize kfence_object_start | |
310 | kfence_handle_page_fault | |
311 | ||
312 | Related Tools | |
313 | ------------- | |
314 | ||
315 | In userspace, a similar approach is taken by `GWP-ASan | |
316 | <http://llvm.org/docs/GwpAsan.html>`_. GWP-ASan also relies on guard pages and | |
317 | a sampling strategy to detect memory unsafety bugs at scale. KFENCE's design is | |
318 | directly influenced by GWP-ASan, and can be seen as its kernel sibling. Another | |
319 | similar but non-sampling approach, that also inspired the name "KFENCE", can be | |
320 | found in the userspace `Electric Fence Malloc Debugger | |
321 | <https://linux.die.net/man/3/efence>`_. | |
322 | ||
323 | In the kernel, several tools exist to debug memory access errors, and in | |
324 | particular KASAN can detect all bug classes that KFENCE can detect. While KASAN | |
325 | is more precise, relying on compiler instrumentation, this comes at a | |
326 | performance cost. | |
327 | ||
328 | It is worth highlighting that KASAN and KFENCE are complementary, with | |
329 | different target environments. For instance, KASAN is the better debugging-aid, | |
330 | where test cases or reproducers exists: due to the lower chance to detect the | |
331 | error, it would require more effort using KFENCE to debug. Deployments at scale | |
332 | that cannot afford to enable KASAN, however, would benefit from using KFENCE to | |
333 | discover bugs due to code paths not exercised by test cases or fuzzers. |