Commit | Line | Data |
---|---|---|
f504d47b JC |
1 | unshare system call |
2 | =================== | |
0d4c3e7a | 3 | |
f504d47b | 4 | This document describes the new system call, unshare(). The document |
0d4c3e7a JD |
5 | provides an overview of the feature, why it is needed, how it can |
6 | be used, its interface specification, design, implementation and | |
7 | how it can be tested. | |
8 | ||
f504d47b JC |
9 | Change Log |
10 | ---------- | |
0d4c3e7a JD |
11 | version 0.1 Initial document, Janak Desai (janak@us.ibm.com), Jan 11, 2006 |
12 | ||
f504d47b JC |
13 | Contents |
14 | -------- | |
0d4c3e7a JD |
15 | 1) Overview |
16 | 2) Benefits | |
17 | 3) Cost | |
18 | 4) Requirements | |
19 | 5) Functional Specification | |
20 | 6) High Level Design | |
21 | 7) Low Level Design | |
22 | 8) Test Specification | |
23 | 9) Future Work | |
24 | ||
25 | 1) Overview | |
26 | ----------- | |
f504d47b | 27 | |
0d4c3e7a JD |
28 | Most legacy operating system kernels support an abstraction of threads |
29 | as multiple execution contexts within a process. These kernels provide | |
30 | special resources and mechanisms to maintain these "threads". The Linux | |
31 | kernel, in a clever and simple manner, does not make distinction | |
32 | between processes and "threads". The kernel allows processes to share | |
33 | resources and thus they can achieve legacy "threads" behavior without | |
34 | requiring additional data structures and mechanisms in the kernel. The | |
35 | power of implementing threads in this manner comes not only from | |
36 | its simplicity but also from allowing application programmers to work | |
37 | outside the confinement of all-or-nothing shared resources of legacy | |
38 | threads. On Linux, at the time of thread creation using the clone system | |
39 | call, applications can selectively choose which resources to share | |
40 | between threads. | |
41 | ||
f504d47b | 42 | unshare() system call adds a primitive to the Linux thread model that |
0d4c3e7a | 43 | allows threads to selectively 'unshare' any resources that were being |
f504d47b | 44 | shared at the time of their creation. unshare() was conceptualized by |
0d4c3e7a | 45 | Al Viro in the August of 2000, on the Linux-Kernel mailing list, as part |
f504d47b | 46 | of the discussion on POSIX threads on Linux. unshare() augments the |
0d4c3e7a | 47 | usefulness of Linux threads for applications that would like to control |
f504d47b | 48 | shared resources without creating a new process. unshare() is a natural |
0d4c3e7a JD |
49 | addition to the set of available primitives on Linux that implement |
50 | the concept of process/thread as a virtual machine. | |
51 | ||
52 | 2) Benefits | |
53 | ----------- | |
f504d47b JC |
54 | |
55 | unshare() would be useful to large application frameworks such as PAM | |
0d4c3e7a JD |
56 | where creating a new process to control sharing/unsharing of process |
57 | resources is not possible. Since namespaces are shared by default | |
f504d47b | 58 | when creating a new process using fork or clone, unshare() can benefit |
0d4c3e7a JD |
59 | even non-threaded applications if they have a need to disassociate |
60 | from default shared namespace. The following lists two use-cases | |
f504d47b | 61 | where unshare() can be used. |
0d4c3e7a JD |
62 | |
63 | 2.1 Per-security context namespaces | |
f504d47b JC |
64 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
65 | ||
66 | unshare() can be used to implement polyinstantiated directories using | |
0d4c3e7a JD |
67 | the kernel's per-process namespace mechanism. Polyinstantiated directories, |
68 | such as per-user and/or per-security context instance of /tmp, /var/tmp or | |
69 | per-security context instance of a user's home directory, isolate user | |
f504d47b | 70 | processes when working with these directories. Using unshare(), a PAM |
0d4c3e7a JD |
71 | module can easily setup a private namespace for a user at login. |
72 | Polyinstantiated directories are required for Common Criteria certification | |
73 | with Labeled System Protection Profile, however, with the availability | |
74 | of shared-tree feature in the Linux kernel, even regular Linux systems | |
75 | can benefit from setting up private namespaces at login and | |
76 | polyinstantiating /tmp, /var/tmp and other directories deemed | |
77 | appropriate by system administrators. | |
78 | ||
79 | 2.2 unsharing of virtual memory and/or open files | |
f504d47b JC |
80 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
81 | ||
0d4c3e7a JD |
82 | Consider a client/server application where the server is processing |
83 | client requests by creating processes that share resources such as | |
f504d47b | 84 | virtual memory and open files. Without unshare(), the server has to |
0d4c3e7a | 85 | decide what needs to be shared at the time of creating the process |
f504d47b | 86 | which services the request. unshare() allows the server an ability to |
0d4c3e7a JD |
87 | disassociate parts of the context during the servicing of the |
88 | request. For large and complex middleware application frameworks, this | |
f504d47b | 89 | ability to unshare() after the process was created can be very |
0d4c3e7a JD |
90 | useful. |
91 | ||
92 | 3) Cost | |
93 | ------- | |
f504d47b JC |
94 | |
95 | In order to not duplicate code and to handle the fact that unshare() | |
0d4c3e7a | 96 | works on an active task (as opposed to clone/fork working on a newly |
f504d47b | 97 | allocated inactive task) unshare() had to make minor reorganizational |
0d4c3e7a JD |
98 | changes to copy_* functions utilized by clone/fork system call. |
99 | There is a cost associated with altering existing, well tested and | |
100 | stable code to implement a new feature that may not get exercised | |
101 | extensively in the beginning. However, with proper design and code | |
f504d47b | 102 | review of the changes and creation of an unshare() test for the LTP |
0d4c3e7a JD |
103 | the benefits of this new feature can exceed its cost. |
104 | ||
105 | 4) Requirements | |
106 | --------------- | |
f504d47b JC |
107 | |
108 | unshare() reverses sharing that was done using clone(2) system call, | |
109 | so unshare() should have a similar interface as clone(2). That is, | |
5e33994d | 110 | since flags in clone(int flags, void \*stack) specifies what should |
0d4c3e7a JD |
111 | be shared, similar flags in unshare(int flags) should specify |
112 | what should be unshared. Unfortunately, this may appear to invert | |
113 | the meaning of the flags from the way they are used in clone(2). | |
114 | However, there was no easy solution that was less confusing and that | |
115 | allowed incremental context unsharing in future without an ABI change. | |
116 | ||
f504d47b | 117 | unshare() interface should accommodate possible future addition of |
0d4c3e7a | 118 | new context flags without requiring a rebuild of old applications. |
f504d47b | 119 | If and when new context flags are added, unshare() design should allow |
0d4c3e7a JD |
120 | incremental unsharing of those resources on an as needed basis. |
121 | ||
122 | 5) Functional Specification | |
123 | --------------------------- | |
f504d47b | 124 | |
0d4c3e7a JD |
125 | NAME |
126 | unshare - disassociate parts of the process execution context | |
127 | ||
128 | SYNOPSIS | |
129 | #include <sched.h> | |
130 | ||
131 | int unshare(int flags); | |
132 | ||
133 | DESCRIPTION | |
f504d47b | 134 | unshare() allows a process to disassociate parts of its execution |
0d4c3e7a JD |
135 | context that are currently being shared with other processes. Part |
136 | of execution context, such as the namespace, is shared by default | |
137 | when a new process is created using fork(2), while other parts, | |
138 | such as the virtual memory, open file descriptors, etc, may be | |
139 | shared by explicit request to share them when creating a process | |
140 | using clone(2). | |
141 | ||
f504d47b | 142 | The main use of unshare() is to allow a process to control its |
0d4c3e7a JD |
143 | shared execution context without creating a new process. |
144 | ||
145 | The flags argument specifies one or bitwise-or'ed of several of | |
146 | the following constants. | |
147 | ||
148 | CLONE_FS | |
149 | If CLONE_FS is set, file system information of the caller | |
150 | is disassociated from the shared file system information. | |
151 | ||
152 | CLONE_FILES | |
153 | If CLONE_FILES is set, the file descriptor table of the | |
154 | caller is disassociated from the shared file descriptor | |
155 | table. | |
156 | ||
157 | CLONE_NEWNS | |
158 | If CLONE_NEWNS is set, the namespace of the caller is | |
159 | disassociated from the shared namespace. | |
160 | ||
161 | CLONE_VM | |
162 | If CLONE_VM is set, the virtual memory of the caller is | |
163 | disassociated from the shared virtual memory. | |
164 | ||
165 | RETURN VALUE | |
166 | On success, zero returned. On failure, -1 is returned and errno is | |
167 | ||
168 | ERRORS | |
169 | EPERM CLONE_NEWNS was specified by a non-root process (process | |
170 | without CAP_SYS_ADMIN). | |
171 | ||
172 | ENOMEM Cannot allocate sufficient memory to copy parts of caller's | |
173 | context that need to be unshared. | |
174 | ||
175 | EINVAL Invalid flag was specified as an argument. | |
176 | ||
177 | CONFORMING TO | |
178 | The unshare() call is Linux-specific and should not be used | |
179 | in programs intended to be portable. | |
180 | ||
181 | SEE ALSO | |
182 | clone(2), fork(2) | |
183 | ||
184 | 6) High Level Design | |
185 | -------------------- | |
f504d47b JC |
186 | |
187 | Depending on the flags argument, the unshare() system call allocates | |
0d4c3e7a JD |
188 | appropriate process context structures, populates it with values from |
189 | the current shared version, associates newly duplicated structures | |
190 | with the current task structure and releases corresponding shared | |
191 | versions. Helper functions of clone (copy_*) could not be used | |
f504d47b JC |
192 | directly by unshare() because of the following two reasons. |
193 | ||
0d4c3e7a | 194 | 1) clone operates on a newly allocated not-yet-active task |
f504d47b JC |
195 | structure, where as unshare() operates on the current active |
196 | task. Therefore unshare() has to take appropriate task_lock() | |
0d4c3e7a | 197 | before associating newly duplicated context structures |
f504d47b JC |
198 | |
199 | 2) unshare() has to allocate and duplicate all context structures | |
0d4c3e7a JD |
200 | that are being unshared, before associating them with the |
201 | current task and releasing older shared structures. Failure | |
202 | do so will create race conditions and/or oops when trying | |
203 | to backout due to an error. Consider the case of unsharing | |
204 | both virtual memory and namespace. After successfully unsharing | |
205 | vm, if the system call encounters an error while allocating | |
206 | new namespace structure, the error return code will have to | |
207 | reverse the unsharing of vm. As part of the reversal the | |
208 | system call will have to go back to older, shared, vm | |
209 | structure, which may not exist anymore. | |
210 | ||
211 | Therefore code from copy_* functions that allocated and duplicated | |
212 | current context structure was moved into new dup_* functions. Now, | |
213 | copy_* functions call dup_* functions to allocate and duplicate | |
214 | appropriate context structures and then associate them with the | |
f504d47b | 215 | task structure that is being constructed. unshare() system call on |
0d4c3e7a | 216 | the other hand performs the following: |
f504d47b | 217 | |
0d4c3e7a | 218 | 1) Check flags to force missing, but implied, flags |
f504d47b JC |
219 | |
220 | 2) For each context structure, call the corresponding unshare() | |
0d4c3e7a JD |
221 | helper function to allocate and duplicate a new context |
222 | structure, if the appropriate bit is set in the flags argument. | |
f504d47b | 223 | |
0d4c3e7a JD |
224 | 3) If there is no error in allocation and duplication and there |
225 | are new context structures then lock the current task structure, | |
226 | associate new context structures with the current task structure, | |
227 | and release the lock on the current task structure. | |
f504d47b | 228 | |
0d4c3e7a JD |
229 | 4) Appropriately release older, shared, context structures. |
230 | ||
231 | 7) Low Level Design | |
232 | ------------------- | |
f504d47b JC |
233 | |
234 | Implementation of unshare() can be grouped in the following 4 different | |
0d4c3e7a | 235 | items: |
f504d47b | 236 | |
0d4c3e7a | 237 | a) Reorganization of existing copy_* functions |
f504d47b JC |
238 | |
239 | b) unshare() system call service function | |
240 | ||
241 | c) unshare() helper functions for each different process context | |
242 | ||
0d4c3e7a JD |
243 | d) Registration of system call number for different architectures |
244 | ||
f504d47b JC |
245 | 7.1) Reorganization of copy_* functions |
246 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
247 | ||
248 | Each copy function such as copy_mm, copy_namespace, copy_files, | |
249 | etc, had roughly two components. The first component allocated | |
250 | and duplicated the appropriate structure and the second component | |
251 | linked it to the task structure passed in as an argument to the copy | |
252 | function. The first component was split into its own function. | |
253 | These dup_* functions allocated and duplicated the appropriate | |
254 | context structure. The reorganized copy_* functions invoked | |
255 | their corresponding dup_* functions and then linked the newly | |
256 | duplicated structures to the task structure with which the | |
257 | copy function was called. | |
258 | ||
259 | 7.2) unshare() system call service function | |
260 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
261 | ||
0d4c3e7a JD |
262 | * Check flags |
263 | Force implied flags. If CLONE_THREAD is set force CLONE_VM. | |
264 | If CLONE_VM is set, force CLONE_SIGHAND. If CLONE_SIGHAND is | |
265 | set and signals are also being shared, force CLONE_THREAD. If | |
266 | CLONE_NEWNS is set, force CLONE_FS. | |
f504d47b | 267 | |
0d4c3e7a JD |
268 | * For each context flag, invoke the corresponding unshare_* |
269 | helper routine with flags passed into the system call and a | |
270 | reference to pointer pointing the new unshared structure | |
f504d47b | 271 | |
0d4c3e7a JD |
272 | * If any new structures are created by unshare_* helper |
273 | functions, take the task_lock() on the current task, | |
274 | modify appropriate context pointers, and release the | |
275 | task lock. | |
f504d47b | 276 | |
0d4c3e7a JD |
277 | * For all newly unshared structures, release the corresponding |
278 | older, shared, structures. | |
279 | ||
f504d47b JC |
280 | 7.3) unshare_* helper functions |
281 | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | |
0d4c3e7a | 282 | |
f504d47b JC |
283 | For unshare_* helpers corresponding to CLONE_SYSVSEM, CLONE_SIGHAND, |
284 | and CLONE_THREAD, return -EINVAL since they are not implemented yet. | |
285 | For others, check the flag value to see if the unsharing is | |
286 | required for that structure. If it is, invoke the corresponding | |
287 | dup_* function to allocate and duplicate the structure and return | |
288 | a pointer to it. | |
289 | ||
290 | 7.4) Finally | |
291 | ~~~~~~~~~~~~ | |
292 | ||
293 | Appropriately modify architecture specific code to register the | |
294 | new system call. | |
0d4c3e7a JD |
295 | |
296 | 8) Test Specification | |
297 | --------------------- | |
f504d47b JC |
298 | |
299 | The test for unshare() should test the following: | |
300 | ||
0d4c3e7a | 301 | 1) Valid flags: Test to check that clone flags for signal and |
f504d47b JC |
302 | signal handlers, for which unsharing is not implemented |
303 | yet, return -EINVAL. | |
304 | ||
0d4c3e7a | 305 | 2) Missing/implied flags: Test to make sure that if unsharing |
f504d47b JC |
306 | namespace without specifying unsharing of filesystem, correctly |
307 | unshares both namespace and filesystem information. | |
308 | ||
0d4c3e7a | 309 | 3) For each of the four (namespace, filesystem, files and vm) |
f504d47b JC |
310 | supported unsharing, verify that the system call correctly |
311 | unshares the appropriate structure. Verify that unsharing | |
312 | them individually as well as in combination with each | |
313 | other works as expected. | |
314 | ||
0d4c3e7a | 315 | 4) Concurrent execution: Use shared memory segments and futex on |
f504d47b JC |
316 | an address in the shm segment to synchronize execution of |
317 | about 10 threads. Have a couple of threads execute execve, | |
318 | a couple _exit and the rest unshare with different combination | |
319 | of flags. Verify that unsharing is performed as expected and | |
320 | that there are no oops or hangs. | |
0d4c3e7a JD |
321 | |
322 | 9) Future Work | |
323 | -------------- | |
f504d47b JC |
324 | |
325 | The current implementation of unshare() does not allow unsharing of | |
0d4c3e7a JD |
326 | signals and signal handlers. Signals are complex to begin with and |
327 | to unshare signals and/or signal handlers of a currently running | |
328 | process is even more complex. If in the future there is a specific | |
329 | need to allow unsharing of signals and/or signal handlers, it can | |
f504d47b JC |
330 | be incrementally added to unshare() without affecting legacy |
331 | applications using unshare(). | |
0d4c3e7a | 332 |