Commit | Line | Data |
---|---|---|
c2ed6743 KC |
1 | ====================== |
2 | Kernel Self-Protection | |
3 | ====================== | |
9f803664 KC |
4 | |
5 | Kernel self-protection is the design and implementation of systems and | |
6 | structures within the Linux kernel to protect against security flaws in | |
7 | the kernel itself. This covers a wide range of issues, including removing | |
8 | entire classes of bugs, blocking security flaw exploitation methods, | |
9 | and actively detecting attack attempts. Not all topics are explored in | |
10 | this document, but it should serve as a reasonable starting point and | |
11 | answer any frequently asked questions. (Patches welcome, of course!) | |
12 | ||
13 | In the worst-case scenario, we assume an unprivileged local attacker | |
14 | has arbitrary read and write access to the kernel's memory. In many | |
15 | cases, bugs being exploited will not provide this level of access, | |
16 | but with systems in place that defend against the worst case we'll | |
17 | cover the more limited cases as well. A higher bar, and one that should | |
18 | still be kept in mind, is protecting the kernel against a _privileged_ | |
19 | local attacker, since the root user has access to a vastly increased | |
20 | attack surface. (Especially when they have the ability to load arbitrary | |
21 | kernel modules.) | |
22 | ||
23 | The goals for successful self-protection systems would be that they | |
24 | are effective, on by default, require no opt-in by developers, have no | |
25 | performance impact, do not impede kernel debugging, and have tests. It | |
26 | is uncommon that all these goals can be met, but it is worth explicitly | |
27 | mentioning them, since these aspects need to be explored, dealt with, | |
28 | and/or accepted. | |
29 | ||
30 | ||
c2ed6743 KC |
31 | Attack Surface Reduction |
32 | ======================== | |
9f803664 KC |
33 | |
34 | The most fundamental defense against security exploits is to reduce the | |
35 | areas of the kernel that can be used to redirect execution. This ranges | |
36 | from limiting the exposed APIs available to userspace, making in-kernel | |
37 | APIs hard to use incorrectly, minimizing the areas of writable kernel | |
38 | memory, etc. | |
39 | ||
c2ed6743 KC |
40 | Strict kernel memory permissions |
41 | -------------------------------- | |
9f803664 KC |
42 | |
43 | When all of kernel memory is writable, it becomes trivial for attacks | |
44 | to redirect execution flow. To reduce the availability of these targets | |
45 | the kernel needs to protect its memory with a tight set of permissions. | |
46 | ||
c2ed6743 KC |
47 | Executable code and read-only data must not be writable |
48 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
9f803664 KC |
49 | |
50 | Any areas of the kernel with executable memory must not be writable. | |
51 | While this obviously includes the kernel text itself, we must consider | |
52 | all additional places too: kernel modules, JIT memory, etc. (There are | |
53 | temporary exceptions to this rule to support things like instruction | |
54 | alternatives, breakpoints, kprobes, etc. If these must exist in a | |
55 | kernel, they are implemented in a way where the memory is temporarily | |
56 | made writable during the update, and then returned to the original | |
57 | permissions.) | |
58 | ||
c2ed6743 KC |
59 | In support of this are ``CONFIG_STRICT_KERNEL_RWX`` and |
60 | ``CONFIG_STRICT_MODULE_RWX``, which seek to make sure that code is not | |
9f803664 KC |
61 | writable, data is not executable, and read-only data is neither writable |
62 | nor executable. | |
63 | ||
ad21fc4f LA |
64 | Most architectures have these options on by default and not user selectable. |
65 | For some architectures like arm that wish to have these be selectable, | |
66 | the architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable | |
c2ed6743 | 67 | a Kconfig prompt. ``CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT`` determines |
ad21fc4f LA |
68 | the default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled. |
69 | ||
c2ed6743 KC |
70 | Function pointers and sensitive variables must not be writable |
71 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
9f803664 KC |
72 | |
73 | Vast areas of kernel memory contain function pointers that are looked | |
74 | up by the kernel and used to continue execution (e.g. descriptor/vector | |
75 | tables, file/network/etc operation structures, etc). The number of these | |
76 | variables must be reduced to an absolute minimum. | |
77 | ||
78 | Many such variables can be made read-only by setting them "const" | |
79 | so that they live in the .rodata section instead of the .data section | |
80 | of the kernel, gaining the protection of the kernel's strict memory | |
81 | permissions as described above. | |
82 | ||
c2ed6743 | 83 | For variables that are initialized once at ``__init`` time, these can |
b080e521 | 84 | be marked with the ``__ro_after_init`` attribute. |
9f803664 KC |
85 | |
86 | What remains are variables that are updated rarely (e.g. GDT). These | |
87 | will need another infrastructure (similar to the temporary exceptions | |
88 | made to kernel code mentioned above) that allow them to spend the rest | |
89 | of their lifetime read-only. (For example, when being updated, only the | |
90 | CPU thread performing the update would be given uninterruptible write | |
91 | access to the memory.) | |
92 | ||
c2ed6743 KC |
93 | Segregation of kernel memory from userspace memory |
94 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
9f803664 KC |
95 | |
96 | The kernel must never execute userspace memory. The kernel must also never | |
97 | access userspace memory without explicit expectation to do so. These | |
98 | rules can be enforced either by support of hardware-based restrictions | |
99 | (x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains). | |
100 | By blocking userspace memory in this way, execution and data parsing | |
101 | cannot be passed to trivially-controlled userspace memory, forcing | |
102 | attacks to operate entirely in kernel memory. | |
103 | ||
c2ed6743 KC |
104 | Reduced access to syscalls |
105 | -------------------------- | |
9f803664 KC |
106 | |
107 | One trivial way to eliminate many syscalls for 64-bit systems is building | |
c2ed6743 | 108 | without ``CONFIG_COMPAT``. However, this is rarely a feasible scenario. |
9f803664 KC |
109 | |
110 | The "seccomp" system provides an opt-in feature made available to | |
111 | userspace, which provides a way to reduce the number of kernel entry | |
112 | points available to a running process. This limits the breadth of kernel | |
113 | code that can be reached, possibly reducing the availability of a given | |
114 | bug to an attack. | |
115 | ||
116 | An area of improvement would be creating viable ways to keep access to | |
117 | things like compat, user namespaces, BPF creation, and perf limited only | |
118 | to trusted processes. This would keep the scope of kernel entry points | |
119 | restricted to the more regular set of normally available to unprivileged | |
120 | userspace. | |
121 | ||
c2ed6743 KC |
122 | Restricting access to kernel modules |
123 | ------------------------------------ | |
9f803664 KC |
124 | |
125 | The kernel should never allow an unprivileged user the ability to | |
126 | load specific kernel modules, since that would provide a facility to | |
127 | unexpectedly extend the available attack surface. (The on-demand loading | |
128 | of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is | |
129 | considered "expected" here, though additional consideration should be | |
130 | given even to these.) For example, loading a filesystem module via an | |
131 | unprivileged socket API is nonsense: only the root or physically local | |
132 | user should trigger filesystem module loading. (And even this can be up | |
133 | for debate in some scenarios.) | |
134 | ||
135 | To protect against even privileged users, systems may need to either | |
136 | disable module loading entirely (e.g. monolithic kernel builds or | |
137 | modules_disabled sysctl), or provide signed modules (e.g. | |
c2ed6743 | 138 | ``CONFIG_MODULE_SIG_FORCE``, or dm-crypt with LoadPin), to keep from having |
9f803664 KC |
139 | root load arbitrary kernel code via the module loader interface. |
140 | ||
141 | ||
c2ed6743 KC |
142 | Memory integrity |
143 | ================ | |
9f803664 KC |
144 | |
145 | There are many memory structures in the kernel that are regularly abused | |
146 | to gain execution control during an attack, By far the most commonly | |
147 | understood is that of the stack buffer overflow in which the return | |
148 | address stored on the stack is overwritten. Many other examples of this | |
149 | kind of attack exist, and protections exist to defend against them. | |
150 | ||
c2ed6743 KC |
151 | Stack buffer overflow |
152 | --------------------- | |
9f803664 KC |
153 | |
154 | The classic stack buffer overflow involves writing past the expected end | |
155 | of a variable stored on the stack, ultimately writing a controlled value | |
156 | to the stack frame's stored return address. The most widely used defense | |
157 | is the presence of a stack canary between the stack variables and the | |
050e9baa | 158 | return address (``CONFIG_STACKPROTECTOR``), which is verified just before |
9f803664 KC |
159 | the function returns. Other defenses include things like shadow stacks. |
160 | ||
c2ed6743 KC |
161 | Stack depth overflow |
162 | -------------------- | |
9f803664 KC |
163 | |
164 | A less well understood attack is using a bug that triggers the | |
165 | kernel to consume stack memory with deep function calls or large stack | |
166 | allocations. With this attack it is possible to write beyond the end of | |
167 | the kernel's preallocated stack space and into sensitive structures. Two | |
168 | important changes need to be made for better protections: moving the | |
169 | sensitive thread_info structure elsewhere, and adding a faulting memory | |
170 | hole at the bottom of the stack to catch these overflows. | |
171 | ||
c2ed6743 KC |
172 | Heap memory integrity |
173 | --------------------- | |
9f803664 KC |
174 | |
175 | The structures used to track heap free lists can be sanity-checked during | |
176 | allocation and freeing to make sure they aren't being used to manipulate | |
177 | other memory areas. | |
178 | ||
c2ed6743 KC |
179 | Counter integrity |
180 | ----------------- | |
9f803664 KC |
181 | |
182 | Many places in the kernel use atomic counters to track object references | |
183 | or perform similar lifetime management. When these counters can be made | |
184 | to wrap (over or under) this traditionally exposes a use-after-free | |
185 | flaw. By trapping atomic wrapping, this class of bug vanishes. | |
186 | ||
c2ed6743 KC |
187 | Size calculation overflow detection |
188 | ----------------------------------- | |
9f803664 KC |
189 | |
190 | Similar to counter overflow, integer overflows (usually size calculations) | |
191 | need to be detected at runtime to kill this class of bug, which | |
192 | traditionally leads to being able to write past the end of kernel buffers. | |
193 | ||
194 | ||
c2ed6743 KC |
195 | Probabilistic defenses |
196 | ====================== | |
9f803664 KC |
197 | |
198 | While many protections can be considered deterministic (e.g. read-only | |
199 | memory cannot be written to), some protections provide only statistical | |
200 | defense, in that an attack must gather enough information about a | |
201 | running system to overcome the defense. While not perfect, these do | |
202 | provide meaningful defenses. | |
203 | ||
c2ed6743 KC |
204 | Canaries, blinding, and other secrets |
205 | ------------------------------------- | |
9f803664 KC |
206 | |
207 | It should be noted that things like the stack canary discussed earlier | |
c9de4a82 KC |
208 | are technically statistical defenses, since they rely on a secret value, |
209 | and such values may become discoverable through an information exposure | |
210 | flaw. | |
9f803664 KC |
211 | |
212 | Blinding literal values for things like JITs, where the executable | |
213 | contents may be partially under the control of userspace, need a similar | |
214 | secret value. | |
215 | ||
216 | It is critical that the secret values used must be separate (e.g. | |
217 | different canary per stack) and high entropy (e.g. is the RNG actually | |
218 | working?) in order to maximize their success. | |
219 | ||
c2ed6743 KC |
220 | Kernel Address Space Layout Randomization (KASLR) |
221 | ------------------------------------------------- | |
9f803664 KC |
222 | |
223 | Since the location of kernel memory is almost always instrumental in | |
224 | mounting a successful attack, making the location non-deterministic | |
225 | raises the difficulty of an exploit. (Note that this in turn makes | |
c9de4a82 KC |
226 | the value of information exposures higher, since they may be used to |
227 | discover desired memory locations.) | |
9f803664 | 228 | |
c2ed6743 KC |
229 | Text and module base |
230 | ~~~~~~~~~~~~~~~~~~~~ | |
9f803664 KC |
231 | |
232 | By relocating the physical and virtual base address of the kernel at | |
c2ed6743 | 233 | boot-time (``CONFIG_RANDOMIZE_BASE``), attacks needing kernel code will be |
9f803664 KC |
234 | frustrated. Additionally, offsetting the module loading base address |
235 | means that even systems that load the same set of modules in the same | |
236 | order every boot will not share a common base address with the rest of | |
237 | the kernel text. | |
238 | ||
c2ed6743 KC |
239 | Stack base |
240 | ~~~~~~~~~~ | |
9f803664 KC |
241 | |
242 | If the base address of the kernel stack is not the same between processes, | |
243 | or even not the same between syscalls, targets on or beyond the stack | |
244 | become more difficult to locate. | |
245 | ||
c2ed6743 KC |
246 | Dynamic memory base |
247 | ~~~~~~~~~~~~~~~~~~~ | |
9f803664 KC |
248 | |
249 | Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up | |
250 | being relatively deterministic in layout due to the order of early-boot | |
251 | initializations. If the base address of these areas is not the same | |
c9de4a82 KC |
252 | between boots, targeting them is frustrated, requiring an information |
253 | exposure specific to the region. | |
254 | ||
c2ed6743 KC |
255 | Structure layout |
256 | ~~~~~~~~~~~~~~~~ | |
c9de4a82 KC |
257 | |
258 | By performing a per-build randomization of the layout of sensitive | |
259 | structures, attacks must either be tuned to known kernel builds or expose | |
260 | enough kernel memory to determine structure layouts before manipulating | |
261 | them. | |
9f803664 KC |
262 | |
263 | ||
c2ed6743 KC |
264 | Preventing Information Exposures |
265 | ================================ | |
9f803664 KC |
266 | |
267 | Since the locations of sensitive structures are the primary target for | |
c9de4a82 | 268 | attacks, it is important to defend against exposure of both kernel memory |
9f803664 KC |
269 | addresses and kernel memory contents (since they may contain kernel |
270 | addresses or other sensitive things like canary values). | |
271 | ||
227d1a61 TH |
272 | Kernel addresses |
273 | ---------------- | |
274 | ||
275 | Printing kernel addresses to userspace leaks sensitive information about | |
276 | the kernel memory layout. Care should be exercised when using any printk | |
277 | specifier that prints the raw address, currently %px, %p[ad], (and %p[sSb] | |
278 | in certain circumstances [*]). Any file written to using one of these | |
279 | specifiers should be readable only by privileged processes. | |
280 | ||
281 | Kernels 4.14 and older printed the raw address using %p. As of 4.15-rc1 | |
282 | addresses printed with the specifier %p are hashed before printing. | |
283 | ||
284 | [*] If KALLSYMS is enabled and symbol lookup fails, the raw address is | |
285 | printed. If KALLSYMS is not enabled the raw address is printed. | |
286 | ||
c2ed6743 KC |
287 | Unique identifiers |
288 | ------------------ | |
9f803664 KC |
289 | |
290 | Kernel memory addresses must never be used as identifiers exposed to | |
291 | userspace. Instead, use an atomic counter, an idr, or similar unique | |
292 | identifier. | |
293 | ||
c2ed6743 KC |
294 | Memory initialization |
295 | --------------------- | |
9f803664 KC |
296 | |
297 | Memory copied to userspace must always be fully initialized. If not | |
298 | explicitly memset(), this will require changes to the compiler to make | |
299 | sure structure holes are cleared. | |
300 | ||
c2ed6743 KC |
301 | Memory poisoning |
302 | ---------------- | |
9f803664 | 303 | |
ed535a2d AP |
304 | When releasing memory, it is best to poison the contents, to avoid reuse |
305 | attacks that rely on the old contents of memory. E.g., clear stack on a | |
306 | syscall return (``CONFIG_GCC_PLUGIN_STACKLEAK``), wipe heap memory on a | |
307 | free. This frustrates many uninitialized variable attacks, stack content | |
308 | exposures, heap content exposures, and use-after-free attacks. | |
9f803664 | 309 | |
c2ed6743 KC |
310 | Destination tracking |
311 | -------------------- | |
9f803664 KC |
312 | |
313 | To help kill classes of bugs that result in kernel addresses being | |
314 | written to userspace, the destination of writes needs to be tracked. If | |
c2ed6743 | 315 | the buffer is destined for userspace (e.g. seq_file backed ``/proc`` files), |
9f803664 | 316 | it should automatically censor sensitive values. |