Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
[linux-block.git] / Documentation / security / self-protection.rst
CommitLineData
c2ed6743
KC
1======================
2Kernel Self-Protection
3======================
9f803664
KC
4
5Kernel self-protection is the design and implementation of systems and
6structures within the Linux kernel to protect against security flaws in
7the kernel itself. This covers a wide range of issues, including removing
8entire classes of bugs, blocking security flaw exploitation methods,
9and actively detecting attack attempts. Not all topics are explored in
10this document, but it should serve as a reasonable starting point and
11answer any frequently asked questions. (Patches welcome, of course!)
12
13In the worst-case scenario, we assume an unprivileged local attacker
14has arbitrary read and write access to the kernel's memory. In many
15cases, bugs being exploited will not provide this level of access,
16but with systems in place that defend against the worst case we'll
17cover the more limited cases as well. A higher bar, and one that should
18still be kept in mind, is protecting the kernel against a _privileged_
19local attacker, since the root user has access to a vastly increased
20attack surface. (Especially when they have the ability to load arbitrary
21kernel modules.)
22
23The goals for successful self-protection systems would be that they
24are effective, on by default, require no opt-in by developers, have no
25performance impact, do not impede kernel debugging, and have tests. It
26is uncommon that all these goals can be met, but it is worth explicitly
27mentioning them, since these aspects need to be explored, dealt with,
28and/or accepted.
29
30
c2ed6743
KC
31Attack Surface Reduction
32========================
9f803664
KC
33
34The most fundamental defense against security exploits is to reduce the
35areas of the kernel that can be used to redirect execution. This ranges
36from limiting the exposed APIs available to userspace, making in-kernel
37APIs hard to use incorrectly, minimizing the areas of writable kernel
38memory, etc.
39
c2ed6743
KC
40Strict kernel memory permissions
41--------------------------------
9f803664
KC
42
43When all of kernel memory is writable, it becomes trivial for attacks
44to redirect execution flow. To reduce the availability of these targets
45the kernel needs to protect its memory with a tight set of permissions.
46
c2ed6743
KC
47Executable code and read-only data must not be writable
48~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
9f803664
KC
49
50Any areas of the kernel with executable memory must not be writable.
51While this obviously includes the kernel text itself, we must consider
52all additional places too: kernel modules, JIT memory, etc. (There are
53temporary exceptions to this rule to support things like instruction
54alternatives, breakpoints, kprobes, etc. If these must exist in a
55kernel, they are implemented in a way where the memory is temporarily
56made writable during the update, and then returned to the original
57permissions.)
58
c2ed6743
KC
59In support of this are ``CONFIG_STRICT_KERNEL_RWX`` and
60``CONFIG_STRICT_MODULE_RWX``, which seek to make sure that code is not
9f803664
KC
61writable, data is not executable, and read-only data is neither writable
62nor executable.
63
ad21fc4f
LA
64Most architectures have these options on by default and not user selectable.
65For some architectures like arm that wish to have these be selectable,
66the architecture Kconfig can select ARCH_OPTIONAL_KERNEL_RWX to enable
c2ed6743 67a Kconfig prompt. ``CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT`` determines
ad21fc4f
LA
68the default setting when ARCH_OPTIONAL_KERNEL_RWX is enabled.
69
c2ed6743
KC
70Function pointers and sensitive variables must not be writable
71~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
9f803664
KC
72
73Vast areas of kernel memory contain function pointers that are looked
74up by the kernel and used to continue execution (e.g. descriptor/vector
75tables, file/network/etc operation structures, etc). The number of these
76variables must be reduced to an absolute minimum.
77
78Many such variables can be made read-only by setting them "const"
79so that they live in the .rodata section instead of the .data section
80of the kernel, gaining the protection of the kernel's strict memory
81permissions as described above.
82
c2ed6743 83For variables that are initialized once at ``__init`` time, these can
b080e521 84be marked with the ``__ro_after_init`` attribute.
9f803664
KC
85
86What remains are variables that are updated rarely (e.g. GDT). These
87will need another infrastructure (similar to the temporary exceptions
88made to kernel code mentioned above) that allow them to spend the rest
89of their lifetime read-only. (For example, when being updated, only the
90CPU thread performing the update would be given uninterruptible write
91access to the memory.)
92
c2ed6743
KC
93Segregation of kernel memory from userspace memory
94~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
9f803664
KC
95
96The kernel must never execute userspace memory. The kernel must also never
97access userspace memory without explicit expectation to do so. These
98rules can be enforced either by support of hardware-based restrictions
99(x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains).
100By blocking userspace memory in this way, execution and data parsing
101cannot be passed to trivially-controlled userspace memory, forcing
102attacks to operate entirely in kernel memory.
103
c2ed6743
KC
104Reduced access to syscalls
105--------------------------
9f803664
KC
106
107One trivial way to eliminate many syscalls for 64-bit systems is building
c2ed6743 108without ``CONFIG_COMPAT``. However, this is rarely a feasible scenario.
9f803664
KC
109
110The "seccomp" system provides an opt-in feature made available to
111userspace, which provides a way to reduce the number of kernel entry
112points available to a running process. This limits the breadth of kernel
113code that can be reached, possibly reducing the availability of a given
114bug to an attack.
115
116An area of improvement would be creating viable ways to keep access to
117things like compat, user namespaces, BPF creation, and perf limited only
118to trusted processes. This would keep the scope of kernel entry points
119restricted to the more regular set of normally available to unprivileged
120userspace.
121
c2ed6743
KC
122Restricting access to kernel modules
123------------------------------------
9f803664
KC
124
125The kernel should never allow an unprivileged user the ability to
126load specific kernel modules, since that would provide a facility to
127unexpectedly extend the available attack surface. (The on-demand loading
128of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is
129considered "expected" here, though additional consideration should be
130given even to these.) For example, loading a filesystem module via an
131unprivileged socket API is nonsense: only the root or physically local
132user should trigger filesystem module loading. (And even this can be up
133for debate in some scenarios.)
134
135To protect against even privileged users, systems may need to either
136disable module loading entirely (e.g. monolithic kernel builds or
137modules_disabled sysctl), or provide signed modules (e.g.
c2ed6743 138``CONFIG_MODULE_SIG_FORCE``, or dm-crypt with LoadPin), to keep from having
9f803664
KC
139root load arbitrary kernel code via the module loader interface.
140
141
c2ed6743
KC
142Memory integrity
143================
9f803664
KC
144
145There are many memory structures in the kernel that are regularly abused
146to gain execution control during an attack, By far the most commonly
147understood is that of the stack buffer overflow in which the return
148address stored on the stack is overwritten. Many other examples of this
149kind of attack exist, and protections exist to defend against them.
150
c2ed6743
KC
151Stack buffer overflow
152---------------------
9f803664
KC
153
154The classic stack buffer overflow involves writing past the expected end
155of a variable stored on the stack, ultimately writing a controlled value
156to the stack frame's stored return address. The most widely used defense
157is the presence of a stack canary between the stack variables and the
050e9baa 158return address (``CONFIG_STACKPROTECTOR``), which is verified just before
9f803664
KC
159the function returns. Other defenses include things like shadow stacks.
160
c2ed6743
KC
161Stack depth overflow
162--------------------
9f803664
KC
163
164A less well understood attack is using a bug that triggers the
165kernel to consume stack memory with deep function calls or large stack
166allocations. With this attack it is possible to write beyond the end of
167the kernel's preallocated stack space and into sensitive structures. Two
168important changes need to be made for better protections: moving the
169sensitive thread_info structure elsewhere, and adding a faulting memory
170hole at the bottom of the stack to catch these overflows.
171
c2ed6743
KC
172Heap memory integrity
173---------------------
9f803664
KC
174
175The structures used to track heap free lists can be sanity-checked during
176allocation and freeing to make sure they aren't being used to manipulate
177other memory areas.
178
c2ed6743
KC
179Counter integrity
180-----------------
9f803664
KC
181
182Many places in the kernel use atomic counters to track object references
183or perform similar lifetime management. When these counters can be made
184to wrap (over or under) this traditionally exposes a use-after-free
185flaw. By trapping atomic wrapping, this class of bug vanishes.
186
c2ed6743
KC
187Size calculation overflow detection
188-----------------------------------
9f803664
KC
189
190Similar to counter overflow, integer overflows (usually size calculations)
191need to be detected at runtime to kill this class of bug, which
192traditionally leads to being able to write past the end of kernel buffers.
193
194
c2ed6743
KC
195Probabilistic defenses
196======================
9f803664
KC
197
198While many protections can be considered deterministic (e.g. read-only
199memory cannot be written to), some protections provide only statistical
200defense, in that an attack must gather enough information about a
201running system to overcome the defense. While not perfect, these do
202provide meaningful defenses.
203
c2ed6743
KC
204Canaries, blinding, and other secrets
205-------------------------------------
9f803664
KC
206
207It should be noted that things like the stack canary discussed earlier
c9de4a82
KC
208are technically statistical defenses, since they rely on a secret value,
209and such values may become discoverable through an information exposure
210flaw.
9f803664
KC
211
212Blinding literal values for things like JITs, where the executable
213contents may be partially under the control of userspace, need a similar
214secret value.
215
216It is critical that the secret values used must be separate (e.g.
217different canary per stack) and high entropy (e.g. is the RNG actually
218working?) in order to maximize their success.
219
c2ed6743
KC
220Kernel Address Space Layout Randomization (KASLR)
221-------------------------------------------------
9f803664
KC
222
223Since the location of kernel memory is almost always instrumental in
224mounting a successful attack, making the location non-deterministic
225raises the difficulty of an exploit. (Note that this in turn makes
c9de4a82
KC
226the value of information exposures higher, since they may be used to
227discover desired memory locations.)
9f803664 228
c2ed6743
KC
229Text and module base
230~~~~~~~~~~~~~~~~~~~~
9f803664
KC
231
232By relocating the physical and virtual base address of the kernel at
c2ed6743 233boot-time (``CONFIG_RANDOMIZE_BASE``), attacks needing kernel code will be
9f803664
KC
234frustrated. Additionally, offsetting the module loading base address
235means that even systems that load the same set of modules in the same
236order every boot will not share a common base address with the rest of
237the kernel text.
238
c2ed6743
KC
239Stack base
240~~~~~~~~~~
9f803664
KC
241
242If the base address of the kernel stack is not the same between processes,
243or even not the same between syscalls, targets on or beyond the stack
244become more difficult to locate.
245
c2ed6743
KC
246Dynamic memory base
247~~~~~~~~~~~~~~~~~~~
9f803664
KC
248
249Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up
250being relatively deterministic in layout due to the order of early-boot
251initializations. If the base address of these areas is not the same
c9de4a82
KC
252between boots, targeting them is frustrated, requiring an information
253exposure specific to the region.
254
c2ed6743
KC
255Structure layout
256~~~~~~~~~~~~~~~~
c9de4a82
KC
257
258By performing a per-build randomization of the layout of sensitive
259structures, attacks must either be tuned to known kernel builds or expose
260enough kernel memory to determine structure layouts before manipulating
261them.
9f803664
KC
262
263
c2ed6743
KC
264Preventing Information Exposures
265================================
9f803664
KC
266
267Since the locations of sensitive structures are the primary target for
c9de4a82 268attacks, it is important to defend against exposure of both kernel memory
9f803664
KC
269addresses and kernel memory contents (since they may contain kernel
270addresses or other sensitive things like canary values).
271
227d1a61
TH
272Kernel addresses
273----------------
274
275Printing kernel addresses to userspace leaks sensitive information about
276the kernel memory layout. Care should be exercised when using any printk
277specifier that prints the raw address, currently %px, %p[ad], (and %p[sSb]
278in certain circumstances [*]). Any file written to using one of these
279specifiers should be readable only by privileged processes.
280
281Kernels 4.14 and older printed the raw address using %p. As of 4.15-rc1
282addresses printed with the specifier %p are hashed before printing.
283
284[*] If KALLSYMS is enabled and symbol lookup fails, the raw address is
285printed. If KALLSYMS is not enabled the raw address is printed.
286
c2ed6743
KC
287Unique identifiers
288------------------
9f803664
KC
289
290Kernel memory addresses must never be used as identifiers exposed to
291userspace. Instead, use an atomic counter, an idr, or similar unique
292identifier.
293
c2ed6743
KC
294Memory initialization
295---------------------
9f803664
KC
296
297Memory copied to userspace must always be fully initialized. If not
298explicitly memset(), this will require changes to the compiler to make
299sure structure holes are cleared.
300
c2ed6743
KC
301Memory poisoning
302----------------
9f803664 303
ed535a2d
AP
304When releasing memory, it is best to poison the contents, to avoid reuse
305attacks that rely on the old contents of memory. E.g., clear stack on a
306syscall return (``CONFIG_GCC_PLUGIN_STACKLEAK``), wipe heap memory on a
307free. This frustrates many uninitialized variable attacks, stack content
308exposures, heap content exposures, and use-after-free attacks.
9f803664 309
c2ed6743
KC
310Destination tracking
311--------------------
9f803664
KC
312
313To help kill classes of bugs that result in kernel addresses being
314written to userspace, the destination of writes needs to be tracked. If
c2ed6743 315the buffer is destined for userspace (e.g. seq_file backed ``/proc`` files),
9f803664 316it should automatically censor sensitive values.