Commit | Line | Data |
---|---|---|
01c9b17b DH |
1 | Overview |
2 | ======== | |
3 | ||
4 | Page Table Isolation (pti, previously known as KAISER[1]) is a | |
5 | countermeasure against attacks on the shared user/kernel address | |
6 | space such as the "Meltdown" approach[2]. | |
7 | ||
8 | To mitigate this class of attacks, we create an independent set of | |
9 | page tables for use only when running userspace applications. When | |
10 | the kernel is entered via syscalls, interrupts or exceptions, the | |
11 | page tables are switched to the full "kernel" copy. When the system | |
12 | switches back to user mode, the user copy is used again. | |
13 | ||
14 | The userspace page tables contain only a minimal amount of kernel | |
15 | data: only what is needed to enter/exit the kernel such as the | |
16 | entry/exit functions themselves and the interrupt descriptor table | |
17 | (IDT). There are a few strictly unnecessary things that get mapped | |
18 | such as the first C function when entering an interrupt (see | |
19 | comments in pti.c). | |
20 | ||
21 | This approach helps to ensure that side-channel attacks leveraging | |
22 | the paging structures do not function when PTI is enabled. It can be | |
23 | enabled by setting CONFIG_PAGE_TABLE_ISOLATION=y at compile time. | |
24 | Once enabled at compile-time, it can be disabled at boot with the | |
25 | 'nopti' or 'pti=' kernel parameters (see kernel-parameters.txt). | |
26 | ||
27 | Page Table Management | |
28 | ===================== | |
29 | ||
30 | When PTI is enabled, the kernel manages two sets of page tables. | |
31 | The first set is very similar to the single set which is present in | |
32 | kernels without PTI. This includes a complete mapping of userspace | |
33 | that the kernel can use for things like copy_to_user(). | |
34 | ||
35 | Although _complete_, the user portion of the kernel page tables is | |
36 | crippled by setting the NX bit in the top level. This ensures | |
37 | that any missed kernel->user CR3 switch will immediately crash | |
38 | userspace upon executing its first instruction. | |
39 | ||
40 | The userspace page tables map only the kernel data needed to enter | |
41 | and exit the kernel. This data is entirely contained in the 'struct | |
42 | cpu_entry_area' structure which is placed in the fixmap which gives | |
43 | each CPU's copy of the area a compile-time-fixed virtual address. | |
44 | ||
45 | For new userspace mappings, the kernel makes the entries in its | |
46 | page tables like normal. The only difference is when the kernel | |
47 | makes entries in the top (PGD) level. In addition to setting the | |
48 | entry in the main kernel PGD, a copy of the entry is made in the | |
49 | userspace page tables' PGD. | |
50 | ||
51 | This sharing at the PGD level also inherently shares all the lower | |
52 | layers of the page tables. This leaves a single, shared set of | |
53 | userspace page tables to manage. One PTE to lock, one set of | |
54 | accessed bits, dirty bits, etc... | |
55 | ||
56 | Overhead | |
57 | ======== | |
58 | ||
59 | Protection against side-channel attacks is important. But, | |
60 | this protection comes at a cost: | |
61 | ||
62 | 1. Increased Memory Use | |
63 | a. Each process now needs an order-1 PGD instead of order-0. | |
64 | (Consumes an additional 4k per process). | |
65 | b. The 'cpu_entry_area' structure must be 2MB in size and 2MB | |
66 | aligned so that it can be mapped by setting a single PMD | |
67 | entry. This consumes nearly 2MB of RAM once the kernel | |
68 | is decompressed, but no space in the kernel image itself. | |
69 | ||
70 | 2. Runtime Cost | |
71 | a. CR3 manipulation to switch between the page table copies | |
72 | must be done at interrupt, syscall, and exception entry | |
73 | and exit (it can be skipped when the kernel is interrupted, | |
74 | though.) Moves to CR3 are on the order of a hundred | |
75 | cycles, and are required at every entry and exit. | |
76 | b. A "trampoline" must be used for SYSCALL entry. This | |
77 | trampoline depends on a smaller set of resources than the | |
78 | non-PTI SYSCALL entry code, so requires mapping fewer | |
79 | things into the userspace page tables. The downside is | |
80 | that stacks must be switched at entry time. | |
98f0fcee | 81 | c. Global pages are disabled for all kernel structures not |
01c9b17b DH |
82 | mapped into both kernel and userspace page tables. This |
83 | feature of the MMU allows different processes to share TLB | |
84 | entries mapping the kernel. Losing the feature means more | |
85 | TLB misses after a context switch. The actual loss of | |
86 | performance is very small, however, never exceeding 1%. | |
87 | d. Process Context IDentifiers (PCID) is a CPU feature that | |
88 | allows us to skip flushing the entire TLB when switching page | |
89 | tables by setting a special bit in CR3 when the page tables | |
90 | are changed. This makes switching the page tables (at context | |
91 | switch, or kernel entry/exit) cheaper. But, on systems with | |
92 | PCID support, the context switch code must flush both the user | |
93 | and kernel entries out of the TLB. The user PCID TLB flush is | |
94 | deferred until the exit to userspace, minimizing the cost. | |
95 | See intel.com/sdm for the gory PCID/INVPCID details. | |
96 | e. The userspace page tables must be populated for each new | |
97 | process. Even without PTI, the shared kernel mappings | |
98 | are created by copying top-level (PGD) entries into each | |
99 | new process. But, with PTI, there are now *two* kernel | |
100 | mappings: one in the kernel page tables that maps everything | |
101 | and one for the entry/exit structures. At fork(), we need to | |
102 | copy both. | |
103 | f. In addition to the fork()-time copying, there must also | |
104 | be an update to the userspace PGD any time a set_pgd() is done | |
105 | on a PGD used to map userspace. This ensures that the kernel | |
106 | and userspace copies always map the same userspace | |
107 | memory. | |
108 | g. On systems without PCID support, each CR3 write flushes | |
109 | the entire TLB. That means that each syscall, interrupt | |
110 | or exception flushes the TLB. | |
111 | h. INVPCID is a TLB-flushing instruction which allows flushing | |
112 | of TLB entries for non-current PCIDs. Some systems support | |
113 | PCIDs, but do not support INVPCID. On these systems, addresses | |
114 | can only be flushed from the TLB for the current PCID. When | |
115 | flushing a kernel address, we need to flush all PCIDs, so a | |
116 | single kernel address flush will require a TLB-flushing CR3 | |
117 | write upon the next use of every PCID. | |
118 | ||
119 | Possible Future Work | |
120 | ==================== | |
121 | 1. We can be more careful about not actually writing to CR3 | |
122 | unless its value is actually changed. | |
123 | 2. Allow PTI to be enabled/disabled at runtime in addition to the | |
124 | boot-time switching. | |
125 | ||
126 | Testing | |
127 | ======== | |
128 | ||
129 | To test stability of PTI, the following test procedure is recommended, | |
130 | ideally doing all of these in parallel: | |
131 | ||
132 | 1. Set CONFIG_DEBUG_ENTRY=y | |
133 | 2. Run several copies of all of the tools/testing/selftests/x86/ tests | |
134 | (excluding MPX and protection_keys) in a loop on multiple CPUs for | |
135 | several minutes. These tests frequently uncover corner cases in the | |
136 | kernel entry code. In general, old kernels might cause these tests | |
137 | themselves to crash, but they should never crash the kernel. | |
138 | 3. Run the 'perf' tool in a mode (top or record) that generates many | |
139 | frequent performance monitoring non-maskable interrupts (see "NMI" | |
140 | in /proc/interrupts). This exercises the NMI entry/exit code which | |
141 | is known to trigger bugs in code paths that did not expect to be | |
142 | interrupted, including nested NMIs. Using "-c" boosts the rate of | |
143 | NMIs, and using two -c with separate counters encourages nested NMIs | |
144 | and less deterministic behavior. | |
145 | ||
146 | while true; do perf record -c 10000 -e instructions,cycles -a sleep 10; done | |
147 | ||
148 | 4. Launch a KVM virtual machine. | |
149 | 5. Run 32-bit binaries on systems supporting the SYSCALL instruction. | |
150 | This has been a lightly-tested code path and needs extra scrutiny. | |
151 | ||
152 | Debugging | |
153 | ========= | |
154 | ||
155 | Bugs in PTI cause a few different signatures of crashes | |
156 | that are worth noting here. | |
157 | ||
158 | * Failures of the selftests/x86 code. Usually a bug in one of the | |
159 | more obscure corners of entry_64.S | |
160 | * Crashes in early boot, especially around CPU bringup. Bugs | |
161 | in the trampoline code or mappings cause these. | |
162 | * Crashes at the first interrupt. Caused by bugs in entry_64.S, | |
163 | like screwing up a page table switch. Also caused by | |
164 | incorrectly mapping the IRQ handler entry code. | |
165 | * Crashes at the first NMI. The NMI code is separate from main | |
166 | interrupt handlers and can have bugs that do not affect | |
167 | normal interrupts. Also caused by incorrectly mapping NMI | |
168 | code. NMIs that interrupt the entry code must be very | |
169 | careful and can be the cause of crashes that show up when | |
170 | running perf. | |
171 | * Kernel crashes at the first exit to userspace. entry_64.S | |
172 | bugs, or failing to map some of the exit code. | |
173 | * Crashes at first interrupt that interrupts userspace. The paths | |
174 | in entry_64.S that return to userspace are sometimes separate | |
175 | from the ones that return to the kernel. | |
176 | * Double faults: overflowing the kernel stack because of page | |
177 | faults upon page faults. Caused by touching non-pti-mapped | |
178 | data in the entry code, or forgetting to switch to kernel | |
179 | CR3 before calling into C functions which are not pti-mapped. | |
180 | * Userspace segfaults early in boot, sometimes manifesting | |
181 | as mount(8) failing to mount the rootfs. These have | |
182 | tended to be TLB invalidation issues. Usually invalidating | |
183 | the wrong PCID, or otherwise missing an invalidation. | |
184 | ||
185 | 1. https://gruss.cc/files/kaiser.pdf | |
186 | 2. https://meltdownattack.com/meltdown.pdf |