Commit | Line | Data |
---|---|---|
c061f33f KC |
1 | =========================================== |
2 | Seccomp BPF (SECure COMPuting with filters) | |
3 | =========================================== | |
8ac270d1 WD |
4 | |
5 | Introduction | |
c061f33f | 6 | ============ |
8ac270d1 WD |
7 | |
8 | A large number of system calls are exposed to every userland process | |
9 | with many of them going unused for the entire lifetime of the process. | |
10 | As system calls change and mature, bugs are found and eradicated. A | |
11 | certain subset of userland applications benefit by having a reduced set | |
12 | of available system calls. The resulting set reduces the total kernel | |
13 | surface exposed to the application. System call filtering is meant for | |
14 | use with those applications. | |
15 | ||
16 | Seccomp filtering provides a means for a process to specify a filter for | |
17 | incoming system calls. The filter is expressed as a Berkeley Packet | |
18 | Filter (BPF) program, as with socket filters, except that the data | |
19 | operated on is related to the system call being made: system call | |
20 | number and the system call arguments. This allows for expressive | |
21 | filtering of system calls using a filter program language with a long | |
22 | history of being exposed to userland and a straightforward data set. | |
23 | ||
24 | Additionally, BPF makes it impossible for users of seccomp to fall prey | |
25 | to time-of-check-time-of-use (TOCTOU) attacks that are common in system | |
26 | call interposition frameworks. BPF programs may not dereference | |
27 | pointers which constrains all filters to solely evaluating the system | |
28 | call arguments directly. | |
29 | ||
30 | What it isn't | |
c061f33f | 31 | ============= |
8ac270d1 WD |
32 | |
33 | System call filtering isn't a sandbox. It provides a clearly defined | |
34 | mechanism for minimizing the exposed kernel surface. It is meant to be | |
35 | a tool for sandbox developers to use. Beyond that, policy for logical | |
36 | behavior and information flow should be managed with a combination of | |
37 | other system hardening techniques and, potentially, an LSM of your | |
38 | choosing. Expressive, dynamic filters provide further options down this | |
39 | path (avoiding pathological sizes or selecting which of the multiplexed | |
40 | system calls in socketcall() is allowed, for instance) which could be | |
41 | construed, incorrectly, as a more complete sandboxing solution. | |
42 | ||
43 | Usage | |
c061f33f | 44 | ===== |
8ac270d1 WD |
45 | |
46 | An additional seccomp mode is added and is enabled using the same | |
47 | prctl(2) call as the strict seccomp. If the architecture has | |
c061f33f | 48 | ``CONFIG_HAVE_ARCH_SECCOMP_FILTER``, then filters may be added as below: |
8ac270d1 | 49 | |
c061f33f | 50 | ``PR_SET_SECCOMP``: |
8ac270d1 WD |
51 | Now takes an additional argument which specifies a new filter |
52 | using a BPF program. | |
53 | The BPF program will be executed over struct seccomp_data | |
54 | reflecting the system call number, arguments, and other | |
55 | metadata. The BPF program must then return one of the | |
56 | acceptable values to inform the kernel which action should be | |
57 | taken. | |
58 | ||
c061f33f KC |
59 | Usage:: |
60 | ||
8ac270d1 WD |
61 | prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, prog); |
62 | ||
63 | The 'prog' argument is a pointer to a struct sock_fprog which | |
64 | will contain the filter program. If the program is invalid, the | |
c061f33f | 65 | call will return -1 and set errno to ``EINVAL``. |
8ac270d1 | 66 | |
c061f33f | 67 | If ``fork``/``clone`` and ``execve`` are allowed by @prog, any child |
8ac270d1 WD |
68 | processes will be constrained to the same filters and system |
69 | call ABI as the parent. | |
70 | ||
c061f33f KC |
71 | Prior to use, the task must call ``prctl(PR_SET_NO_NEW_PRIVS, 1)`` or |
72 | run with ``CAP_SYS_ADMIN`` privileges in its namespace. If these are not | |
73 | true, ``-EACCES`` will be returned. This requirement ensures that filter | |
8ac270d1 WD |
74 | programs cannot be applied to child processes with greater privileges |
75 | than the task that installed them. | |
76 | ||
c061f33f | 77 | Additionally, if ``prctl(2)`` is allowed by the attached filter, |
8ac270d1 WD |
78 | additional filters may be layered on which will increase evaluation |
79 | time, but allow for further decreasing the attack surface during | |
80 | execution of a process. | |
81 | ||
82 | The above call returns 0 on success and non-zero on error. | |
83 | ||
84 | Return values | |
c061f33f KC |
85 | ============= |
86 | ||
8ac270d1 WD |
87 | A seccomp filter may return any of the following values. If multiple |
88 | filters exist, the return value for the evaluation of a given system | |
89 | call will always use the highest precedent value. (For example, | |
0466bdb9 | 90 | ``SECCOMP_RET_KILL_PROCESS`` will always take precedence.) |
8ac270d1 WD |
91 | |
92 | In precedence order, they are: | |
93 | ||
0466bdb9 KC |
94 | ``SECCOMP_RET_KILL_PROCESS``: |
95 | Results in the entire process exiting immediately without executing | |
96 | the system call. The exit status of the task (``status & 0x7f``) | |
97 | will be ``SIGSYS``, not ``SIGKILL``. | |
98 | ||
fd76875c | 99 | ``SECCOMP_RET_KILL_THREAD``: |
8ac270d1 | 100 | Results in the task exiting immediately without executing the |
c061f33f KC |
101 | system call. The exit status of the task (``status & 0x7f``) will |
102 | be ``SIGSYS``, not ``SIGKILL``. | |
8ac270d1 | 103 | |
c061f33f KC |
104 | ``SECCOMP_RET_TRAP``: |
105 | Results in the kernel sending a ``SIGSYS`` signal to the triggering | |
106 | task without executing the system call. ``siginfo->si_call_addr`` | |
87b526d3 | 107 | will show the address of the system call instruction, and |
c061f33f | 108 | ``siginfo->si_syscall`` and ``siginfo->si_arch`` will indicate which |
87b526d3 AL |
109 | syscall was attempted. The program counter will be as though |
110 | the syscall happened (i.e. it will not point to the syscall | |
111 | instruction). The return value register will contain an arch- | |
112 | dependent value -- if resuming execution, set it to something | |
113 | sensible. (The architecture dependency is because replacing | |
c061f33f | 114 | it with ``-ENOSYS`` could overwrite some useful information.) |
8ac270d1 | 115 | |
c061f33f KC |
116 | The ``SECCOMP_RET_DATA`` portion of the return value will be passed |
117 | as ``si_errno``. | |
8ac270d1 | 118 | |
c061f33f | 119 | ``SIGSYS`` triggered by seccomp will have a si_code of ``SYS_SECCOMP``. |
8ac270d1 | 120 | |
c061f33f | 121 | ``SECCOMP_RET_ERRNO``: |
8ac270d1 WD |
122 | Results in the lower 16-bits of the return value being passed |
123 | to userland as the errno without executing the system call. | |
124 | ||
6a21cc50 | 125 | ``SECCOMP_RET_USER_NOTIF``: |
2f1ff589 JW |
126 | Results in a ``struct seccomp_notif`` message sent on the userspace |
127 | notification fd, if it is attached, or ``-ENOSYS`` if it is not. See | |
128 | below on discussion of how to handle user notifications. | |
6a21cc50 | 129 | |
c061f33f | 130 | ``SECCOMP_RET_TRACE``: |
8ac270d1 | 131 | When returned, this value will cause the kernel to attempt to |
c061f33f KC |
132 | notify a ``ptrace()``-based tracer prior to executing the system |
133 | call. If there is no tracer present, ``-ENOSYS`` is returned to | |
8ac270d1 WD |
134 | userland and the system call is not executed. |
135 | ||
6491126e | 136 | A tracer will be notified if it requests ``PTRACE_O_TRACESECCOMP`` |
c061f33f KC |
137 | using ``ptrace(PTRACE_SETOPTIONS)``. The tracer will be notified |
138 | of a ``PTRACE_EVENT_SECCOMP`` and the ``SECCOMP_RET_DATA`` portion of | |
8ac270d1 | 139 | the BPF program return value will be available to the tracer |
c061f33f | 140 | via ``PTRACE_GETEVENTMSG``. |
8ac270d1 | 141 | |
87b526d3 AL |
142 | The tracer can skip the system call by changing the syscall number |
143 | to -1. Alternatively, the tracer can change the system call | |
144 | requested by changing the system call to a valid syscall number. If | |
145 | the tracer asks to skip the system call, then the system call will | |
146 | appear to return the value that the tracer puts in the return value | |
147 | register. | |
148 | ||
149 | The seccomp check will not be run again after the tracer is | |
150 | notified. (This means that seccomp-based sandboxes MUST NOT | |
151 | allow use of ptrace, even of other sandboxed processes, without | |
152 | extreme care; ptracers can use this mechanism to escape.) | |
153 | ||
59f5cf44 TH |
154 | ``SECCOMP_RET_LOG``: |
155 | Results in the system call being executed after it is logged. This | |
156 | should be used by application developers to learn which syscalls their | |
157 | application needs without having to iterate through multiple test and | |
158 | development cycles to build the list. | |
159 | ||
160 | This action will only be logged if "log" is present in the | |
161 | actions_logged sysctl string. | |
162 | ||
c061f33f | 163 | ``SECCOMP_RET_ALLOW``: |
8ac270d1 WD |
164 | Results in the system call being executed. |
165 | ||
166 | If multiple filters exist, the return value for the evaluation of a | |
167 | given system call will always use the highest precedent value. | |
168 | ||
c061f33f | 169 | Precedence is only determined using the ``SECCOMP_RET_ACTION`` mask. When |
8ac270d1 | 170 | multiple filters return values of the same precedence, only the |
c061f33f | 171 | ``SECCOMP_RET_DATA`` from the most recently installed filter will be |
8ac270d1 WD |
172 | returned. |
173 | ||
174 | Pitfalls | |
c061f33f | 175 | ======== |
8ac270d1 WD |
176 | |
177 | The biggest pitfall to avoid during use is filtering on system call | |
178 | number without checking the architecture value. Why? On any | |
179 | architecture that supports multiple system call invocation conventions, | |
180 | the system call numbers may vary based on the specific invocation. If | |
181 | the numbers in the different calling conventions overlap, then checks in | |
182 | the filters may be abused. Always check the arch value! | |
183 | ||
184 | Example | |
c061f33f | 185 | ======= |
8ac270d1 | 186 | |
c061f33f | 187 | The ``samples/seccomp/`` directory contains both an x86-specific example |
8ac270d1 WD |
188 | and a more generic example of a higher level macro interface for BPF |
189 | program generation. | |
190 | ||
6a21cc50 TA |
191 | Userspace Notification |
192 | ====================== | |
193 | ||
194 | The ``SECCOMP_RET_USER_NOTIF`` return code lets seccomp filters pass a | |
195 | particular syscall to userspace to be handled. This may be useful for | |
196 | applications like container managers, which wish to intercept particular | |
197 | syscalls (``mount()``, ``finit_module()``, etc.) and change their behavior. | |
198 | ||
199 | To acquire a notification FD, use the ``SECCOMP_FILTER_FLAG_NEW_LISTENER`` | |
200 | argument to the ``seccomp()`` syscall: | |
201 | ||
202 | .. code-block:: c | |
203 | ||
204 | fd = seccomp(SECCOMP_SET_MODE_FILTER, SECCOMP_FILTER_FLAG_NEW_LISTENER, &prog); | |
205 | ||
206 | which (on success) will return a listener fd for the filter, which can then be | |
207 | passed around via ``SCM_RIGHTS`` or similar. Note that filter fds correspond to | |
208 | a particular filter, and not a particular task. So if this task then forks, | |
209 | notifications from both tasks will appear on the same filter fd. Reads and | |
210 | writes to/from a filter fd are also synchronized, so a filter fd can safely | |
211 | have many readers. | |
212 | ||
213 | The interface for a seccomp notification fd consists of two structures: | |
214 | ||
215 | .. code-block:: c | |
216 | ||
217 | struct seccomp_notif_sizes { | |
218 | __u16 seccomp_notif; | |
219 | __u16 seccomp_notif_resp; | |
220 | __u16 seccomp_data; | |
221 | }; | |
222 | ||
223 | struct seccomp_notif { | |
224 | __u64 id; | |
225 | __u32 pid; | |
226 | __u32 flags; | |
227 | struct seccomp_data data; | |
228 | }; | |
229 | ||
230 | struct seccomp_notif_resp { | |
231 | __u64 id; | |
232 | __s64 val; | |
233 | __s32 error; | |
234 | __u32 flags; | |
235 | }; | |
236 | ||
237 | The ``struct seccomp_notif_sizes`` structure can be used to determine the size | |
238 | of the various structures used in seccomp notifications. The size of ``struct | |
239 | seccomp_data`` may change in the future, so code should use: | |
240 | ||
241 | .. code-block:: c | |
242 | ||
243 | struct seccomp_notif_sizes sizes; | |
244 | seccomp(SECCOMP_GET_NOTIF_SIZES, 0, &sizes); | |
245 | ||
246 | to determine the size of the various structures to allocate. See | |
247 | samples/seccomp/user-trap.c for an example. | |
248 | ||
249 | Users can read via ``ioctl(SECCOMP_IOCTL_NOTIF_RECV)`` (or ``poll()``) on a | |
250 | seccomp notification fd to receive a ``struct seccomp_notif``, which contains | |
251 | five members: the input length of the structure, a unique-per-filter ``id``, | |
252 | the ``pid`` of the task which triggered this request (which may be 0 if the | |
aac90292 SD |
253 | task is in a pid ns not visible from the listener's pid namespace). The |
254 | notification also contains the ``data`` passed to seccomp, and a filters flag. | |
255 | The structure should be zeroed out prior to calling the ioctl. | |
256 | ||
257 | Userspace can then make a decision based on this information about what to do, | |
258 | and ``ioctl(SECCOMP_IOCTL_NOTIF_SEND)`` a response, indicating what should be | |
259 | returned to userspace. The ``id`` member of ``struct seccomp_notif_resp`` should | |
260 | be the same ``id`` as in ``struct seccomp_notif``. | |
6a21cc50 | 261 | |
0ae71c77 RC |
262 | Userspace can also add file descriptors to the notifying process via |
263 | ``ioctl(SECCOMP_IOCTL_NOTIF_ADDFD)``. The ``id`` member of | |
264 | ``struct seccomp_notif_addfd`` should be the same ``id`` as in | |
265 | ``struct seccomp_notif``. The ``newfd_flags`` flag may be used to set flags | |
19d67694 | 266 | like O_CLOEXEC on the file descriptor in the notifying process. If the supervisor |
0ae71c77 RC |
267 | wants to inject the file descriptor with a specific number, the |
268 | ``SECCOMP_ADDFD_FLAG_SETFD`` flag can be used, and set the ``newfd`` member to | |
269 | the specific number to use. If that file descriptor is already open in the | |
270 | notifying process it will be replaced. The supervisor can also add an FD, and | |
271 | respond atomically by using the ``SECCOMP_ADDFD_FLAG_SEND`` flag and the return | |
272 | value will be the injected file descriptor number. | |
273 | ||
c2aa2dfe SD |
274 | The notifying process can be preempted, resulting in the notification being |
275 | aborted. This can be problematic when trying to take actions on behalf of the | |
276 | notifying process that are long-running and typically retryable (mounting a | |
c1966bd1 | 277 | filesystem). Alternatively, at filter installation time, the |
c2aa2dfe SD |
278 | ``SECCOMP_FILTER_FLAG_WAIT_KILLABLE_RECV`` flag can be set. This flag makes it |
279 | such that when a user notification is received by the supervisor, the notifying | |
280 | process will ignore non-fatal signals until the response is sent. Signals that | |
281 | are sent prior to the notification being received by userspace are handled | |
282 | normally. | |
283 | ||
6a21cc50 TA |
284 | It is worth noting that ``struct seccomp_data`` contains the values of register |
285 | arguments to the syscall, but does not contain pointers to memory. The task's | |
286 | memory is accessible to suitably privileged traces via ``ptrace()`` or | |
287 | ``/proc/pid/mem``. However, care should be taken to avoid the TOCTOU mentioned | |
288 | above in this document: all arguments being read from the tracee's memory | |
289 | should be read into the tracer's memory before any policy decisions are made. | |
290 | This allows for an atomic decision on syscall arguments. | |
291 | ||
8e5f1ad1 TH |
292 | Sysctls |
293 | ======= | |
294 | ||
295 | Seccomp's sysctl files can be found in the ``/proc/sys/kernel/seccomp/`` | |
296 | directory. Here's a description of each file in that directory: | |
297 | ||
298 | ``actions_avail``: | |
299 | A read-only ordered list of seccomp return values (refer to the | |
300 | ``SECCOMP_RET_*`` macros above) in string form. The ordering, from | |
301 | left-to-right, is the least permissive return value to the most | |
302 | permissive return value. | |
8ac270d1 | 303 | |
8e5f1ad1 TH |
304 | The list represents the set of seccomp return values supported |
305 | by the kernel. A userspace program may use this list to | |
306 | determine if the actions found in the ``seccomp.h``, when the | |
307 | program was built, differs from the set of actions actually | |
308 | supported in the current running kernel. | |
8ac270d1 | 309 | |
0ddec0fc TH |
310 | ``actions_logged``: |
311 | A read-write ordered list of seccomp return values (refer to the | |
312 | ``SECCOMP_RET_*`` macros above) that are allowed to be logged. Writes | |
313 | to the file do not need to be in ordered form but reads from the file | |
314 | will be ordered in the same way as the actions_avail sysctl. | |
315 | ||
0ddec0fc TH |
316 | The ``allow`` string is not accepted in the ``actions_logged`` sysctl |
317 | as it is not possible to log ``SECCOMP_RET_ALLOW`` actions. Attempting | |
318 | to write ``allow`` to the sysctl will result in an EINVAL being | |
319 | returned. | |
320 | ||
8ac270d1 | 321 | Adding architecture support |
c061f33f | 322 | =========================== |
8ac270d1 | 323 | |
c061f33f | 324 | See ``arch/Kconfig`` for the authoritative requirements. In general, if an |
8ac270d1 | 325 | architecture supports both ptrace_event and seccomp, it will be able to |
c061f33f KC |
326 | support seccomp filter with minor fixup: ``SIGSYS`` support and seccomp return |
327 | value checking. Then it must just add ``CONFIG_HAVE_ARCH_SECCOMP_FILTER`` | |
8ac270d1 | 328 | to its arch-specific Kconfig. |
87b526d3 AL |
329 | |
330 | ||
331 | ||
332 | Caveats | |
c061f33f | 333 | ======= |
87b526d3 AL |
334 | |
335 | The vDSO can cause some system calls to run entirely in userspace, | |
336 | leading to surprises when you run programs on different machines that | |
337 | fall back to real syscalls. To minimize these surprises on x86, make | |
338 | sure you test with | |
c061f33f KC |
339 | ``/sys/devices/system/clocksource/clocksource0/current_clocksource`` set to |
340 | something like ``acpi_pm``. | |
87b526d3 AL |
341 | |
342 | On x86-64, vsyscall emulation is enabled by default. (vsyscalls are | |
c061f33f KC |
343 | legacy variants on vDSO calls.) Currently, emulated vsyscalls will |
344 | honor seccomp, with a few oddities: | |
87b526d3 | 345 | |
c061f33f | 346 | - A return value of ``SECCOMP_RET_TRAP`` will set a ``si_call_addr`` pointing to |
87b526d3 AL |
347 | the vsyscall entry for the given call and not the address after the |
348 | 'syscall' instruction. Any code which wants to restart the call | |
349 | should be aware that (a) a ret instruction has been emulated and (b) | |
350 | trying to resume the syscall will again trigger the standard vsyscall | |
351 | emulation security checks, making resuming the syscall mostly | |
352 | pointless. | |
353 | ||
c061f33f | 354 | - A return value of ``SECCOMP_RET_TRACE`` will signal the tracer as usual, |
87b526d3 AL |
355 | but the syscall may not be changed to another system call using the |
356 | orig_rax register. It may only be changed to -1 order to skip the | |
357 | currently emulated call. Any other change MAY terminate the process. | |
358 | The rip value seen by the tracer will be the syscall entry address; | |
359 | this is different from normal behavior. The tracer MUST NOT modify | |
360 | rip or rsp. (Do not rely on other changes terminating the process. | |
361 | They might work. For example, on some kernels, choosing a syscall | |
362 | that only exists in future kernels will be correctly emulated (by | |
c061f33f | 363 | returning ``-ENOSYS``). |
87b526d3 | 364 | |
c061f33f KC |
365 | To detect this quirky behavior, check for ``addr & ~0x0C00 == |
366 | 0xFFFFFFFFFF600000``. (For ``SECCOMP_RET_TRACE``, use rip. For | |
367 | ``SECCOMP_RET_TRAP``, use ``siginfo->si_call_addr``.) Do not check any other | |
87b526d3 AL |
368 | condition: future kernels may improve vsyscall emulation and current |
369 | kernels in vsyscall=native mode will behave differently, but the | |
c061f33f | 370 | instructions at ``0xF...F600{0,4,8,C}00`` will not be system calls in these |
87b526d3 AL |
371 | cases. |
372 | ||
373 | Note that modern systems are unlikely to use vsyscalls at all -- they | |
374 | are a legacy feature and they are considerably slower than standard | |
375 | syscalls. New code will use the vDSO, and vDSO-issued system calls | |
376 | are indistinguishable from normal system calls. |