Commit | Line | Data |
---|---|---|
3fa97bf0 JS |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | =============================== | |
4 | Software Guard eXtensions (SGX) | |
5 | =============================== | |
6 | ||
7 | Overview | |
8 | ======== | |
9 | ||
10 | Software Guard eXtensions (SGX) hardware enables for user space applications | |
11 | to set aside private memory regions of code and data: | |
12 | ||
379e4de9 | 13 | * Privileged (ring-0) ENCLS functions orchestrate the construction of the |
3fa97bf0 JS |
14 | regions. |
15 | * Unprivileged (ring-3) ENCLU functions allow an application to enter and | |
16 | execute inside the regions. | |
17 | ||
18 | These memory regions are called enclaves. An enclave can be only entered at a | |
19 | fixed set of entry points. Each entry point can hold a single hardware thread | |
20 | at a time. While the enclave is loaded from a regular binary file by using | |
21 | ENCLS functions, only the threads inside the enclave can access its memory. The | |
22 | region is denied from outside access by the CPU, and encrypted before it leaves | |
23 | from LLC. | |
24 | ||
25 | The support can be determined by | |
26 | ||
27 | ``grep sgx /proc/cpuinfo`` | |
28 | ||
29 | SGX must both be supported in the processor and enabled by the BIOS. If SGX | |
30 | appears to be unsupported on a system which has hardware support, ensure | |
31 | support is enabled in the BIOS. If a BIOS presents a choice between "Enabled" | |
32 | and "Software Enabled" modes for SGX, choose "Enabled". | |
33 | ||
34 | Enclave Page Cache | |
35 | ================== | |
36 | ||
37 | SGX utilizes an *Enclave Page Cache (EPC)* to store pages that are associated | |
38 | with an enclave. It is contained in a BIOS-reserved region of physical memory. | |
39 | Unlike pages used for regular memory, pages can only be accessed from outside of | |
40 | the enclave during enclave construction with special, limited SGX instructions. | |
41 | ||
42 | Only a CPU executing inside an enclave can directly access enclave memory. | |
43 | However, a CPU executing inside an enclave may access normal memory outside the | |
44 | enclave. | |
45 | ||
46 | The kernel manages enclave memory similar to how it treats device memory. | |
47 | ||
48 | Enclave Page Types | |
49 | ------------------ | |
50 | ||
51 | **SGX Enclave Control Structure (SECS)** | |
52 | Enclave's address range, attributes and other global data are defined | |
53 | by this structure. | |
54 | ||
55 | **Regular (REG)** | |
56 | Regular EPC pages contain the code and data of an enclave. | |
57 | ||
58 | **Thread Control Structure (TCS)** | |
59 | Thread Control Structure pages define the entry points to an enclave and | |
60 | track the execution state of an enclave thread. | |
61 | ||
62 | **Version Array (VA)** | |
63 | Version Array pages contain 512 slots, each of which can contain a version | |
64 | number for a page evicted from the EPC. | |
65 | ||
66 | Enclave Page Cache Map | |
67 | ---------------------- | |
68 | ||
69 | The processor tracks EPC pages in a hardware metadata structure called the | |
70 | *Enclave Page Cache Map (EPCM)*. The EPCM contains an entry for each EPC page | |
71 | which describes the owning enclave, access rights and page type among the other | |
72 | things. | |
73 | ||
74 | EPCM permissions are separate from the normal page tables. This prevents the | |
75 | kernel from, for instance, allowing writes to data which an enclave wishes to | |
76 | remain read-only. EPCM permissions may only impose additional restrictions on | |
77 | top of normal x86 page permissions. | |
78 | ||
79 | For all intents and purposes, the SGX architecture allows the processor to | |
80 | invalidate all EPCM entries at will. This requires that software be prepared to | |
81 | handle an EPCM fault at any time. In practice, this can happen on events like | |
82 | power transitions when the ephemeral key that encrypts enclave memory is lost. | |
83 | ||
84 | Application interface | |
85 | ===================== | |
86 | ||
87 | Enclave build functions | |
88 | ----------------------- | |
89 | ||
90 | In addition to the traditional compiler and linker build process, SGX has a | |
91 | separate enclave “build” process. Enclaves must be built before they can be | |
92 | executed (entered). The first step in building an enclave is opening the | |
93 | **/dev/sgx_enclave** device. Since enclave memory is protected from direct | |
379e4de9 | 94 | access, special privileged instructions are then used to copy data into enclave |
3fa97bf0 JS |
95 | pages and establish enclave page permissions. |
96 | ||
97 | .. kernel-doc:: arch/x86/kernel/cpu/sgx/ioctl.c | |
98 | :functions: sgx_ioc_enclave_create | |
99 | sgx_ioc_enclave_add_pages | |
100 | sgx_ioc_enclave_init | |
101 | sgx_ioc_enclave_provision | |
102 | ||
629b5155 RC |
103 | Enclave runtime management |
104 | -------------------------- | |
105 | ||
106 | Systems supporting SGX2 additionally support changes to initialized | |
107 | enclaves: modifying enclave page permissions and type, and dynamically | |
108 | adding and removing of enclave pages. When an enclave accesses an address | |
109 | within its address range that does not have a backing page then a new | |
110 | regular page will be dynamically added to the enclave. The enclave is | |
111 | still required to run EACCEPT on the new page before it can be used. | |
112 | ||
113 | .. kernel-doc:: arch/x86/kernel/cpu/sgx/ioctl.c | |
114 | :functions: sgx_ioc_enclave_restrict_permissions | |
115 | sgx_ioc_enclave_modify_types | |
116 | sgx_ioc_enclave_remove_pages | |
117 | ||
3fa97bf0 JS |
118 | Enclave vDSO |
119 | ------------ | |
120 | ||
121 | Entering an enclave can only be done through SGX-specific EENTER and ERESUME | |
122 | functions, and is a non-trivial process. Because of the complexity of | |
123 | transitioning to and from an enclave, enclaves typically utilize a library to | |
124 | handle the actual transitions. This is roughly analogous to how glibc | |
125 | implementations are used by most applications to wrap system calls. | |
126 | ||
127 | Another crucial characteristic of enclaves is that they can generate exceptions | |
128 | as part of their normal operation that need to be handled in the enclave or are | |
129 | unique to SGX. | |
130 | ||
131 | Instead of the traditional signal mechanism to handle these exceptions, SGX | |
132 | can leverage special exception fixup provided by the vDSO. The kernel-provided | |
133 | vDSO function wraps low-level transitions to/from the enclave like EENTER and | |
134 | ERESUME. The vDSO function intercepts exceptions that would otherwise generate | |
135 | a signal and return the fault information directly to its caller. This avoids | |
136 | the need to juggle signal handlers. | |
137 | ||
138 | .. kernel-doc:: arch/x86/include/uapi/asm/sgx.h | |
139 | :functions: vdso_sgx_enter_enclave_t | |
140 | ||
141 | ksgxd | |
142 | ===== | |
143 | ||
379e4de9 | 144 | SGX support includes a kernel thread called *ksgxd*. |
3fa97bf0 JS |
145 | |
146 | EPC sanitization | |
147 | ---------------- | |
148 | ||
149 | ksgxd is started when SGX initializes. Enclave memory is typically ready | |
379e4de9 | 150 | for use when the processor powers on or resets. However, if SGX has been in |
3fa97bf0 JS |
151 | use since the reset, enclave pages may be in an inconsistent state. This might |
152 | occur after a crash and kexec() cycle, for instance. At boot, ksgxd | |
153 | reinitializes all enclave pages so that they can be allocated and re-used. | |
154 | ||
155 | The sanitization is done by going through EPC address space and applying the | |
156 | EREMOVE function to each physical page. Some enclave pages like SECS pages have | |
157 | hardware dependencies on other pages which prevents EREMOVE from functioning. | |
158 | Executing two EREMOVE passes removes the dependencies. | |
159 | ||
160 | Page reclaimer | |
161 | -------------- | |
162 | ||
163 | Similar to the core kswapd, ksgxd, is responsible for managing the | |
164 | overcommitment of enclave memory. If the system runs out of enclave memory, | |
379e4de9 | 165 | *ksgxd* “swaps” enclave memory to normal memory. |
3fa97bf0 JS |
166 | |
167 | Launch Control | |
168 | ============== | |
169 | ||
170 | SGX provides a launch control mechanism. After all enclave pages have been | |
171 | copied, kernel executes EINIT function, which initializes the enclave. Only after | |
172 | this the CPU can execute inside the enclave. | |
173 | ||
379e4de9 | 174 | EINIT function takes an RSA-3072 signature of the enclave measurement. The function |
3fa97bf0 JS |
175 | checks that the measurement is correct and signature is signed with the key |
176 | hashed to the four **IA32_SGXLEPUBKEYHASH{0, 1, 2, 3}** MSRs representing the | |
177 | SHA256 of a public key. | |
178 | ||
179 | Those MSRs can be configured by the BIOS to be either readable or writable. | |
180 | Linux supports only writable configuration in order to give full control to the | |
181 | kernel on launch control policy. Before calling EINIT function, the driver sets | |
182 | the MSRs to match the enclave's signing key. | |
183 | ||
184 | Encryption engines | |
185 | ================== | |
186 | ||
187 | In order to conceal the enclave data while it is out of the CPU package, the | |
188 | memory controller has an encryption engine to transparently encrypt and decrypt | |
189 | enclave memory. | |
190 | ||
191 | In CPUs prior to Ice Lake, the Memory Encryption Engine (MEE) is used to | |
192 | encrypt pages leaving the CPU caches. MEE uses a n-ary Merkle tree with root in | |
193 | SRAM to maintain integrity of the encrypted data. This provides integrity and | |
194 | anti-replay protection but does not scale to large memory sizes because the time | |
195 | required to update the Merkle tree grows logarithmically in relation to the | |
196 | memory size. | |
197 | ||
198 | CPUs starting from Icelake use Total Memory Encryption (TME) in the place of | |
199 | MEE. TME-based SGX implementations do not have an integrity Merkle tree, which | |
200 | means integrity and replay-attacks are not mitigated. B, it includes | |
201 | additional changes to prevent cipher text from being returned and SW memory | |
379e4de9 | 202 | aliases from being created. |
3fa97bf0 JS |
203 | |
204 | DMA to enclave memory is blocked by range registers on both MEE and TME systems | |
205 | (SDM section 41.10). | |
206 | ||
207 | Usage Models | |
208 | ============ | |
209 | ||
210 | Shared Library | |
211 | -------------- | |
212 | ||
213 | Sensitive data and the code that acts on it is partitioned from the application | |
214 | into a separate library. The library is then linked as a DSO which can be loaded | |
215 | into an enclave. The application can then make individual function calls into | |
216 | the enclave through special SGX instructions. A run-time within the enclave is | |
217 | configured to marshal function parameters into and out of the enclave and to | |
218 | call the correct library function. | |
219 | ||
220 | Application Container | |
221 | --------------------- | |
222 | ||
223 | An application may be loaded into a container enclave which is specially | |
224 | configured with a library OS and run-time which permits the application to run. | |
225 | The enclave run-time and library OS work together to execute the application | |
226 | when a thread enters the enclave. | |
b0c7459b KH |
227 | |
228 | Impact of Potential Kernel SGX Bugs | |
229 | =================================== | |
230 | ||
231 | EPC leaks | |
232 | --------- | |
233 | ||
234 | When EPC page leaks happen, a WARNING like this is shown in dmesg: | |
235 | ||
236 | "EREMOVE returned ... and an EPC page was leaked. SGX may become unusable..." | |
237 | ||
238 | This is effectively a kernel use-after-free of an EPC page, and due | |
239 | to the way SGX works, the bug is detected at freeing. Rather than | |
240 | adding the page back to the pool of available EPC pages, the kernel | |
241 | intentionally leaks the page to avoid additional errors in the future. | |
242 | ||
243 | When this happens, the kernel will likely soon leak more EPC pages, and | |
244 | SGX will likely become unusable because the memory available to SGX is | |
245 | limited. However, while this may be fatal to SGX, the rest of the kernel | |
246 | is unlikely to be impacted and should continue to work. | |
247 | ||
248 | As a result, when this happpens, user should stop running any new | |
249 | SGX workloads, (or just any new workloads), and migrate all valuable | |
250 | workloads. Although a machine reboot can recover all EPC memory, the bug | |
251 | should be reported to Linux developers. | |
540745dd SC |
252 | |
253 | ||
254 | Virtual EPC | |
255 | =========== | |
256 | ||
257 | The implementation has also a virtual EPC driver to support SGX enclaves | |
258 | in guests. Unlike the SGX driver, an EPC page allocated by the virtual | |
259 | EPC driver doesn't have a specific enclave associated with it. This is | |
260 | because KVM doesn't track how a guest uses EPC pages. | |
261 | ||
262 | As a result, the SGX core page reclaimer doesn't support reclaiming EPC | |
263 | pages allocated to KVM guests through the virtual EPC driver. If the | |
264 | user wants to deploy SGX applications both on the host and in guests | |
265 | on the same machine, the user should reserve enough EPC (by taking out | |
266 | total virtual EPC size of all SGX VMs from the physical EPC size) for | |
267 | host SGX applications so they can run with acceptable performance. | |
ae095b16 PB |
268 | |
269 | Architectural behavior is to restore all EPC pages to an uninitialized | |
270 | state also after a guest reboot. Because this state can be reached only | |
271 | through the privileged ``ENCLS[EREMOVE]`` instruction, ``/dev/sgx_vepc`` | |
272 | provides the ``SGX_IOC_VEPC_REMOVE_ALL`` ioctl to execute the instruction | |
273 | on all pages in the virtual EPC. | |
274 | ||
275 | ``EREMOVE`` can fail for three reasons. Userspace must pay attention | |
276 | to expected failures and handle them as follows: | |
277 | ||
278 | 1. Page removal will always fail when any thread is running in the | |
279 | enclave to which the page belongs. In this case the ioctl will | |
280 | return ``EBUSY`` independent of whether it has successfully removed | |
281 | some pages; userspace can avoid these failures by preventing execution | |
282 | of any vcpu which maps the virtual EPC. | |
283 | ||
284 | 2. Page removal will cause a general protection fault if two calls to | |
285 | ``EREMOVE`` happen concurrently for pages that refer to the same | |
286 | "SECS" metadata pages. This can happen if there are concurrent | |
287 | invocations to ``SGX_IOC_VEPC_REMOVE_ALL``, or if a ``/dev/sgx_vepc`` | |
288 | file descriptor in the guest is closed at the same time as | |
289 | ``SGX_IOC_VEPC_REMOVE_ALL``; it will also be reported as ``EBUSY``. | |
290 | This can be avoided in userspace by serializing calls to the ioctl() | |
291 | and to close(), but in general it should not be a problem. | |
292 | ||
293 | 3. Finally, page removal will fail for SECS metadata pages which still | |
294 | have child pages. Child pages can be removed by executing | |
295 | ``SGX_IOC_VEPC_REMOVE_ALL`` on all ``/dev/sgx_vepc`` file descriptors | |
296 | mapped into the guest. This means that the ioctl() must be called | |
297 | twice: an initial set of calls to remove child pages and a subsequent | |
298 | set of calls to remove SECS pages. The second set of calls is only | |
299 | required for those mappings that returned a nonzero value from the | |
300 | first call. It indicates a bug in the kernel or the userspace client | |
301 | if any of the second round of ``SGX_IOC_VEPC_REMOVE_ALL`` calls has | |
302 | a return code other than 0. |