Commit | Line | Data |
---|---|---|
4b8fec28 SK |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | ===================================== | |
4 | Virtually Mapped Kernel Stack Support | |
5 | ===================================== | |
6 | ||
7 | :Author: Shuah Khan <skhan@linuxfoundation.org> | |
8 | ||
9 | .. contents:: :local: | |
10 | ||
11 | Overview | |
12 | -------- | |
13 | ||
14 | This is a compilation of information from the code and original patch | |
15 | series that introduced the `Virtually Mapped Kernel Stacks feature | |
16 | <https://lwn.net/Articles/694348/>` | |
17 | ||
18 | Introduction | |
19 | ------------ | |
20 | ||
21 | Kernel stack overflows are often hard to debug and make the kernel | |
22 | susceptible to exploits. Problems could show up at a later time making | |
23 | it difficult to isolate and root-cause. | |
24 | ||
25 | Virtually-mapped kernel stacks with guard pages causes kernel stack | |
26 | overflows to be caught immediately rather than causing difficult to | |
27 | diagnose corruptions. | |
28 | ||
29 | HAVE_ARCH_VMAP_STACK and VMAP_STACK configuration options enable | |
30 | support for virtually mapped stacks with guard pages. This feature | |
31 | causes reliable faults when the stack overflows. The usability of | |
32 | the stack trace after overflow and response to the overflow itself | |
33 | is architecture dependent. | |
34 | ||
35 | .. note:: | |
36 | As of this writing, arm64, powerpc, riscv, s390, um, and x86 have | |
37 | support for VMAP_STACK. | |
38 | ||
39 | HAVE_ARCH_VMAP_STACK | |
40 | -------------------- | |
41 | ||
42 | Architectures that can support Virtually Mapped Kernel Stacks should | |
43 | enable this bool configuration option. The requirements are: | |
44 | ||
45 | - vmalloc space must be large enough to hold many kernel stacks. This | |
46 | may rule out many 32-bit architectures. | |
47 | - Stacks in vmalloc space need to work reliably. For example, if | |
48 | vmap page tables are created on demand, either this mechanism | |
49 | needs to work while the stack points to a virtual address with | |
50 | unpopulated page tables or arch code (switch_to() and switch_mm(), | |
51 | most likely) needs to ensure that the stack's page table entries | |
52 | are populated before running on a possibly unpopulated stack. | |
53 | - If the stack overflows into a guard page, something reasonable | |
54 | should happen. The definition of "reasonable" is flexible, but | |
55 | instantly rebooting without logging anything would be unfriendly. | |
56 | ||
57 | VMAP_STACK | |
58 | ---------- | |
59 | ||
60 | VMAP_STACK bool configuration option when enabled allocates virtually | |
61 | mapped task stacks. This option depends on HAVE_ARCH_VMAP_STACK. | |
62 | ||
63 | - Enable this if you want the use virtually-mapped kernel stacks | |
64 | with guard pages. This causes kernel stack overflows to be caught | |
65 | immediately rather than causing difficult-to-diagnose corruption. | |
66 | ||
67 | .. note:: | |
68 | ||
69 | Using this feature with KASAN requires architecture support | |
70 | for backing virtual mappings with real shadow memory, and | |
71 | KASAN_VMALLOC must be enabled. | |
72 | ||
73 | .. note:: | |
74 | ||
75 | VMAP_STACK is enabled, it is not possible to run DMA on stack | |
76 | allocated data. | |
77 | ||
78 | Kernel configuration options and dependencies keep changing. Refer to | |
79 | the latest code base: | |
80 | ||
81 | `Kconfig <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/Kconfig>` | |
82 | ||
83 | Allocation | |
84 | ----------- | |
85 | ||
86 | When a new kernel thread is created, thread stack is allocated from | |
87 | virtually contiguous memory pages from the page level allocator. These | |
88 | pages are mapped into contiguous kernel virtual space with PAGE_KERNEL | |
89 | protections. | |
90 | ||
91 | alloc_thread_stack_node() calls __vmalloc_node_range() to allocate stack | |
92 | with PAGE_KERNEL protections. | |
93 | ||
94 | - Allocated stacks are cached and later reused by new threads, so memcg | |
95 | accounting is performed manually on assigning/releasing stacks to tasks. | |
96 | Hence, __vmalloc_node_range is called without __GFP_ACCOUNT. | |
97 | - vm_struct is cached to be able to find when thread free is initiated | |
98 | in interrupt context. free_thread_stack() can be called in interrupt | |
99 | context. | |
100 | - On arm64, all VMAP's stacks need to have the same alignment to ensure | |
101 | that VMAP'd stack overflow detection works correctly. Arch specific | |
102 | vmap stack allocator takes care of this detail. | |
103 | - This does not address interrupt stacks - according to the original patch | |
104 | ||
105 | Thread stack allocation is initiated from clone(), fork(), vfork(), | |
106 | kernel_thread() via kernel_clone(). Leaving a few hints for searching | |
107 | the code base to understand when and how thread stack is allocated. | |
108 | ||
109 | Bulk of the code is in: | |
110 | `kernel/fork.c <https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/fork.c>`. | |
111 | ||
112 | stack_vm_area pointer in task_struct keeps track of the virtually allocated | |
113 | stack and a non-null stack_vm_area pointer serves as a indication that the | |
114 | virtually mapped kernel stacks are enabled. | |
115 | ||
116 | :: | |
117 | ||
118 | struct vm_struct *stack_vm_area; | |
119 | ||
120 | Stack overflow handling | |
121 | ----------------------- | |
122 | ||
123 | Leading and trailing guard pages help detect stack overflows. When stack | |
124 | overflows into the guard pages, handlers have to be careful not overflow | |
125 | the stack again. When handlers are called, it is likely that very little | |
126 | stack space is left. | |
127 | ||
128 | On x86, this is done by handling the page fault indicating the kernel | |
129 | stack overflow on the double-fault stack. | |
130 | ||
131 | Testing VMAP allocation with guard pages | |
132 | ---------------------------------------- | |
133 | ||
134 | How do we ensure that VMAP_STACK is actually allocating with a leading | |
135 | and trailing guard page? The following lkdtm tests can help detect any | |
136 | regressions. | |
137 | ||
138 | :: | |
139 | ||
140 | void lkdtm_STACK_GUARD_PAGE_LEADING() | |
141 | void lkdtm_STACK_GUARD_PAGE_TRAILING() | |
142 | ||
143 | Conclusions | |
144 | ----------- | |
145 | ||
146 | - A percpu cache of vmalloced stacks appears to be a bit faster than a | |
147 | high-order stack allocation, at least when the cache hits. | |
148 | - THREAD_INFO_IN_TASK gets rid of arch-specific thread_info entirely and | |
149 | simply embed the thread_info (containing only flags) and 'int cpu' into | |
150 | task_struct. | |
151 | - The thread stack can be free'ed as soon as the task is dead (without | |
152 | waiting for RCU) and then, if vmapped stacks are in use, cache the | |
153 | entire stack for reuse on the same cpu. |