Commit | Line | Data |
---|---|---|
a4452e67 GKB |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | ===================== | |
4 | Syscall User Dispatch | |
5 | ===================== | |
6 | ||
7 | Background | |
8 | ---------- | |
9 | ||
10 | Compatibility layers like Wine need a way to efficiently emulate system | |
11 | calls of only a part of their process - the part that has the | |
12 | incompatible code - while being able to execute native syscalls without | |
13 | a high performance penalty on the native part of the process. Seccomp | |
14 | falls short on this task, since it has limited support to efficiently | |
15 | filter syscalls based on memory regions, and it doesn't support removing | |
16 | filters. Therefore a new mechanism is necessary. | |
17 | ||
18 | Syscall User Dispatch brings the filtering of the syscall dispatcher | |
19 | address back to userspace. The application is in control of a flip | |
20 | switch, indicating the current personality of the process. A | |
21 | multiple-personality application can then flip the switch without | |
22 | invoking the kernel, when crossing the compatibility layer API | |
23 | boundaries, to enable/disable the syscall redirection and execute | |
24 | syscalls directly (disabled) or send them to be emulated in userspace | |
25 | through a SIGSYS. | |
26 | ||
27 | The goal of this design is to provide very quick compatibility layer | |
28 | boundary crosses, which is achieved by not executing a syscall to change | |
29 | personality every time the compatibility layer executes. Instead, a | |
30 | userspace memory region exposed to the kernel indicates the current | |
31 | personality, and the application simply modifies that variable to | |
32 | configure the mechanism. | |
33 | ||
34 | There is a relatively high cost associated with handling signals on most | |
35 | architectures, like x86, but at least for Wine, syscalls issued by | |
36 | native Windows code are currently not known to be a performance problem, | |
37 | since they are quite rare, at least for modern gaming applications. | |
38 | ||
39 | Since this mechanism is designed to capture syscalls issued by | |
40 | non-native applications, it must function on syscalls whose invocation | |
41 | ABI is completely unexpected to Linux. Syscall User Dispatch, therefore | |
42 | doesn't rely on any of the syscall ABI to make the filtering. It uses | |
43 | only the syscall dispatcher address and the userspace key. | |
44 | ||
45 | As the ABI of these intercepted syscalls is unknown to Linux, these | |
46 | syscalls are not instrumentable via ptrace or the syscall tracepoints. | |
47 | ||
48 | Interface | |
49 | --------- | |
50 | ||
51 | A thread can setup this mechanism on supported kernels by executing the | |
52 | following prctl: | |
53 | ||
54 | prctl(PR_SET_SYSCALL_USER_DISPATCH, <op>, <offset>, <length>, [selector]) | |
55 | ||
56 | <op> is either PR_SYS_DISPATCH_ON or PR_SYS_DISPATCH_OFF, to enable and | |
57 | disable the mechanism globally for that thread. When | |
58 | PR_SYS_DISPATCH_OFF is used, the other fields must be zero. | |
59 | ||
60 | [<offset>, <offset>+<length>) delimit a memory region interval | |
61 | from which syscalls are always executed directly, regardless of the | |
62 | userspace selector. This provides a fast path for the C library, which | |
63 | includes the most common syscall dispatchers in the native code | |
64 | applications, and also provides a way for the signal handler to return | |
65 | without triggering a nested SIGSYS on (rt\_)sigreturn. Users of this | |
66 | interface should make sure that at least the signal trampoline code is | |
67 | included in this region. In addition, for syscalls that implement the | |
68 | trampoline code on the vDSO, that trampoline is never intercepted. | |
69 | ||
70 | [selector] is a pointer to a char-sized region in the process memory | |
71 | region, that provides a quick way to enable disable syscall redirection | |
72 | thread-wide, without the need to invoke the kernel directly. selector | |
36a6c843 GKB |
73 | can be set to SYSCALL_DISPATCH_FILTER_ALLOW or SYSCALL_DISPATCH_FILTER_BLOCK. |
74 | Any other value should terminate the program with a SIGSYS. | |
a4452e67 GKB |
75 | |
76 | Security Notes | |
77 | -------------- | |
78 | ||
79 | Syscall User Dispatch provides functionality for compatibility layers to | |
80 | quickly capture system calls issued by a non-native part of the | |
81 | application, while not impacting the Linux native regions of the | |
82 | process. It is not a mechanism for sandboxing system calls, and it | |
83 | should not be seen as a security mechanism, since it is trivial for a | |
84 | malicious application to subvert the mechanism by jumping to an allowed | |
85 | dispatcher region prior to executing the syscall, or to discover the | |
86 | address and modify the selector value. If the use case requires any | |
87 | kind of security sandboxing, Seccomp should be used instead. | |
88 | ||
89 | Any fork or exec of the existing process resets the mechanism to | |
90 | PR_SYS_DISPATCH_OFF. |