Commit | Line | Data |
---|---|---|
c51ff2c7 DH |
1 | Memory Protection Keys for Userspace (PKU aka PKEYs) is a feature |
2 | which is found on Intel's Skylake "Scalable Processor" Server CPUs. | |
3 | It will be avalable in future non-server parts. | |
4 | ||
5 | For anyone wishing to test or use this feature, it is available in | |
6 | Amazon's EC2 C5 instances and is known to work there using an Ubuntu | |
7 | 17.04 image. | |
591b1d8d DH |
8 | |
9 | Memory Protection Keys provides a mechanism for enforcing page-based | |
10 | protections, but without requiring modification of the page tables | |
11 | when an application changes protection domains. It works by | |
12 | dedicating 4 previously ignored bits in each page table entry to a | |
13 | "protection key", giving 16 possible keys. | |
14 | ||
15 | There is also a new user-accessible register (PKRU) with two separate | |
16 | bits (Access Disable and Write Disable) for each key. Being a CPU | |
17 | register, PKRU is inherently thread-local, potentially giving each | |
18 | thread a different set of protections from every other thread. | |
19 | ||
20 | There are two new instructions (RDPKRU/WRPKRU) for reading and writing | |
21 | to the new register. The feature is only available in 64-bit mode, | |
22 | even though there is theoretically space in the PAE PTEs. These | |
23 | permissions are enforced on data access only and have no effect on | |
24 | instruction fetches. | |
25 | ||
c74fe394 DH |
26 | =========================== Syscalls =========================== |
27 | ||
6679dac5 | 28 | There are 3 system calls which directly interact with pkeys: |
c74fe394 DH |
29 | |
30 | int pkey_alloc(unsigned long flags, unsigned long init_access_rights) | |
31 | int pkey_free(int pkey); | |
32 | int pkey_mprotect(unsigned long start, size_t len, | |
33 | unsigned long prot, int pkey); | |
34 | ||
35 | Before a pkey can be used, it must first be allocated with | |
36 | pkey_alloc(). An application calls the WRPKRU instruction | |
37 | directly in order to change access permissions to memory covered | |
38 | with a key. In this example WRPKRU is wrapped by a C function | |
39 | called pkey_set(). | |
40 | ||
41 | int real_prot = PROT_READ|PROT_WRITE; | |
f90e2d9a | 42 | pkey = pkey_alloc(0, PKEY_DISABLE_WRITE); |
c74fe394 DH |
43 | ptr = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_ANONYMOUS|MAP_PRIVATE, -1, 0); |
44 | ret = pkey_mprotect(ptr, PAGE_SIZE, real_prot, pkey); | |
45 | ... application runs here | |
46 | ||
47 | Now, if the application needs to update the data at 'ptr', it can | |
48 | gain access, do the update, then remove its write access: | |
49 | ||
f90e2d9a | 50 | pkey_set(pkey, 0); // clear PKEY_DISABLE_WRITE |
c74fe394 | 51 | *ptr = foo; // assign something |
f90e2d9a | 52 | pkey_set(pkey, PKEY_DISABLE_WRITE); // set PKEY_DISABLE_WRITE again |
c74fe394 DH |
53 | |
54 | Now when it frees the memory, it will also free the pkey since it | |
55 | is no longer in use: | |
56 | ||
57 | munmap(ptr, PAGE_SIZE); | |
58 | pkey_free(pkey); | |
59 | ||
6679dac5 DH |
60 | (Note: pkey_set() is a wrapper for the RDPKRU and WRPKRU instructions. |
61 | An example implementation can be found in | |
62 | tools/testing/selftests/x86/protection_keys.c) | |
63 | ||
c74fe394 DH |
64 | =========================== Behavior =========================== |
65 | ||
66 | The kernel attempts to make protection keys consistent with the | |
67 | behavior of a plain mprotect(). For instance if you do this: | |
68 | ||
69 | mprotect(ptr, size, PROT_NONE); | |
70 | something(ptr); | |
71 | ||
72 | you can expect the same effects with protection keys when doing this: | |
73 | ||
74 | pkey = pkey_alloc(0, PKEY_DISABLE_WRITE | PKEY_DISABLE_READ); | |
75 | pkey_mprotect(ptr, size, PROT_READ|PROT_WRITE, pkey); | |
76 | something(ptr); | |
77 | ||
78 | That should be true whether something() is a direct access to 'ptr' | |
79 | like: | |
80 | ||
81 | *ptr = foo; | |
82 | ||
83 | or when the kernel does the access on the application's behalf like | |
84 | with a read(): | |
85 | ||
86 | read(fd, ptr, 1); | |
87 | ||
88 | The kernel will send a SIGSEGV in both cases, but si_code will be set | |
89 | to SEGV_PKERR when violating protection keys versus SEGV_ACCERR when | |
90 | the plain mprotect() permissions are violated. |