Commit | Line | Data |
---|---|---|
1174bd84 MR |
1 | ================== |
2 | NUMA Memory Policy | |
3 | ================== | |
42b88e6a | 4 | |
1174bd84 | 5 | What is NUMA Memory Policy? |
cb5e4376 | 6 | ============================ |
42b88e6a LS |
7 | |
8 | In the Linux kernel, "memory policy" determines from which node the kernel will | |
9 | allocate memory in a NUMA system or in an emulated NUMA system. Linux has | |
10 | supported platforms with Non-Uniform Memory Access architectures since 2.4.?. | |
11 | The current memory policy support was added to Linux 2.6 around May 2004. This | |
12 | document attempts to describe the concepts and APIs of the 2.6 memory policy | |
13 | support. | |
14 | ||
21acb9ca | 15 | Memory policies should not be confused with cpusets |
da82c92f | 16 | (``Documentation/admin-guide/cgroup-v1/cpusets.rst``) |
42b88e6a LS |
17 | which is an administrative mechanism for restricting the nodes from which |
18 | memory may be allocated by a set of processes. Memory policies are a | |
19 | programming interface that a NUMA-aware application can take advantage of. When | |
20 | both cpusets and policies are applied to a task, the restrictions of the cpuset | |
cb5e4376 MR |
21 | takes priority. See :ref:`Memory Policies and cpusets <mem_pol_and_cpusets>` |
22 | below for more details. | |
42b88e6a | 23 | |
cb5e4376 MR |
24 | Memory Policy Concepts |
25 | ====================== | |
42b88e6a LS |
26 | |
27 | Scope of Memory Policies | |
cb5e4376 | 28 | ------------------------ |
42b88e6a LS |
29 | |
30 | The Linux kernel supports _scopes_ of memory policy, described here from | |
31 | most general to most specific: | |
32 | ||
cb5e4376 MR |
33 | System Default Policy |
34 | this policy is "hard coded" into the kernel. It is the policy | |
35 | that governs all page allocations that aren't controlled by | |
36 | one of the more specific policy scopes discussed below. When | |
37 | the system is "up and running", the system default policy will | |
38 | use "local allocation" described below. However, during boot | |
39 | up, the system default policy will be set to interleave | |
40 | allocations across all nodes with "sufficient" memory, so as | |
41 | not to overload the initial boot node with boot-time | |
42 | allocations. | |
43 | ||
44 | Task/Process Policy | |
42f44d12 MR |
45 | this is an optional, per-task policy. When defined for a |
46 | specific task, this policy controls all page allocations made | |
47 | by or on behalf of the task that aren't controlled by a more | |
48 | specific scope. If a task does not define a task policy, then | |
49 | all page allocations that would have been controlled by the | |
50 | task policy "fall back" to the System Default Policy. | |
42b88e6a LS |
51 | |
52 | The task policy applies to the entire address space of a task. Thus, | |
53 | it is inheritable, and indeed is inherited, across both fork() | |
54 | [clone() w/o the CLONE_VM flag] and exec*(). This allows a parent task | |
55 | to establish the task policy for a child task exec()'d from an | |
56 | executable image that has no awareness of memory policy. See the | |
42f44d12 MR |
57 | :ref:`Memory Policy APIs <memory_policy_apis>` section, |
58 | below, for an overview of the system call | |
a33f3224 | 59 | that a task may use to set/change its task/process policy. |
42b88e6a LS |
60 | |
61 | In a multi-threaded task, task policies apply only to the thread | |
62 | [Linux kernel task] that installs the policy and any threads | |
63 | subsequently created by that thread. Any sibling threads existing | |
64 | at the time a new task policy is installed retain their current | |
65 | policy. | |
66 | ||
67 | A task policy applies only to pages allocated after the policy is | |
68 | installed. Any pages already faulted in by the task when the task | |
69 | changes its task policy remain where they were allocated based on | |
70 | the policy at the time they were allocated. | |
71 | ||
cb5e4376 MR |
72 | .. _vma_policy: |
73 | ||
74 | VMA Policy | |
75 | A "VMA" or "Virtual Memory Area" refers to a range of a task's | |
76 | virtual address space. A task may define a specific policy for a range | |
42f44d12 MR |
77 | of its virtual address space. See the |
78 | :ref:`Memory Policy APIs <memory_policy_apis>` section, | |
cb5e4376 MR |
79 | below, for an overview of the mbind() system call used to set a VMA |
80 | policy. | |
81 | ||
82 | A VMA policy will govern the allocation of pages that back | |
42f44d12 | 83 | this region of the address space. Any regions of the task's |
cb5e4376 MR |
84 | address space that don't have an explicit VMA policy will fall |
85 | back to the task policy, which may itself fall back to the | |
86 | System Default Policy. | |
87 | ||
88 | VMA policies have a few complicating details: | |
89 | ||
90 | * VMA policy applies ONLY to anonymous pages. These include | |
91 | pages allocated for anonymous segments, such as the task | |
92 | stack and heap, and any regions of the address space | |
93 | mmap()ed with the MAP_ANONYMOUS flag. If a VMA policy is | |
94 | applied to a file mapping, it will be ignored if the mapping | |
95 | used the MAP_SHARED flag. If the file mapping used the | |
96 | MAP_PRIVATE flag, the VMA policy will only be applied when | |
97 | an anonymous page is allocated on an attempt to write to the | |
98 | mapping-- i.e., at Copy-On-Write. | |
99 | ||
100 | * VMA policies are shared between all tasks that share a | |
101 | virtual address space--a.k.a. threads--independent of when | |
102 | the policy is installed; and they are inherited across | |
103 | fork(). However, because VMA policies refer to a specific | |
104 | region of a task's address space, and because the address | |
105 | space is discarded and recreated on exec*(), VMA policies | |
106 | are NOT inheritable across exec(). Thus, only NUMA-aware | |
107 | applications may use VMA policies. | |
108 | ||
109 | * A task may install a new VMA policy on a sub-range of a | |
110 | previously mmap()ed region. When this happens, Linux splits | |
111 | the existing virtual memory area into 2 or 3 VMAs, each with | |
112 | it's own policy. | |
113 | ||
114 | * By default, VMA policy applies only to pages allocated after | |
115 | the policy is installed. Any pages already faulted into the | |
116 | VMA range remain where they were allocated based on the | |
117 | policy at the time they were allocated. However, since | |
118 | 2.6.16, Linux supports page migration via the mbind() system | |
119 | call, so that page contents can be moved to match a newly | |
120 | installed policy. | |
121 | ||
122 | Shared Policy | |
123 | Conceptually, shared policies apply to "memory objects" mapped | |
124 | shared into one or more tasks' distinct address spaces. An | |
42f44d12 | 125 | application installs shared policies the same way as VMA |
cb5e4376 MR |
126 | policies--using the mbind() system call specifying a range of |
127 | virtual addresses that map the shared object. However, unlike | |
128 | VMA policies, which can be considered to be an attribute of a | |
129 | range of a task's address space, shared policies apply | |
130 | directly to the shared object. Thus, all tasks that attach to | |
131 | the object share the policy, and all pages allocated for the | |
132 | shared object, by any task, will obey the shared policy. | |
42b88e6a LS |
133 | |
134 | As of 2.6.22, only shared memory segments, created by shmget() or | |
135 | mmap(MAP_ANONYMOUS|MAP_SHARED), support shared policy. When shared | |
136 | policy support was added to Linux, the associated data structures were | |
137 | added to hugetlbfs shmem segments. At the time, hugetlbfs did not | |
138 | support allocation at fault time--a.k.a lazy allocation--so hugetlbfs | |
139 | shmem segments were never "hooked up" to the shared policy support. | |
140 | Although hugetlbfs segments now support lazy allocation, their support | |
141 | for shared policy has not been completed. | |
142 | ||
42f44d12 | 143 | As mentioned above in :ref:`VMA policies <vma_policy>` section, |
cb5e4376 MR |
144 | allocations of page cache pages for regular files mmap()ed |
145 | with MAP_SHARED ignore any VMA policy installed on the virtual | |
146 | address range backed by the shared file mapping. Rather, | |
147 | shared page cache pages, including pages backing private | |
148 | mappings that have not yet been written by the task, follow | |
42b88e6a LS |
149 | task policy, if any, else System Default Policy. |
150 | ||
151 | The shared policy infrastructure supports different policies on subset | |
152 | ranges of the shared object. However, Linux still splits the VMA of | |
153 | the task that installs the policy for each range of distinct policy. | |
154 | Thus, different tasks that attach to a shared memory segment can have | |
155 | different VMA configurations mapping that one shared object. This | |
156 | can be seen by examining the /proc/<pid>/numa_maps of tasks sharing | |
157 | a shared memory region, when one task has installed shared policy on | |
158 | one or more ranges of the region. | |
159 | ||
160 | Components of Memory Policies | |
cb5e4376 MR |
161 | ----------------------------- |
162 | ||
1174bd84 | 163 | A NUMA memory policy consists of a "mode", optional mode flags, and |
cb5e4376 MR |
164 | an optional set of nodes. The mode determines the behavior of the |
165 | policy, the optional mode flags determine the behavior of the mode, | |
166 | and the optional set of nodes can be viewed as the arguments to the | |
167 | policy behavior. | |
168 | ||
169 | Internally, memory policies are implemented by a reference counted | |
170 | structure, struct mempolicy. Details of this structure will be | |
171 | discussed in context, below, as required to explain the behavior. | |
172 | ||
1174bd84 | 173 | NUMA memory policy supports the following 4 behavioral modes: |
cb5e4376 MR |
174 | |
175 | Default Mode--MPOL_DEFAULT | |
176 | This mode is only used in the memory policy APIs. Internally, | |
177 | MPOL_DEFAULT is converted to the NULL memory policy in all | |
178 | policy scopes. Any existing non-default policy will simply be | |
179 | removed when MPOL_DEFAULT is specified. As a result, | |
180 | MPOL_DEFAULT means "fall back to the next most specific policy | |
181 | scope." | |
182 | ||
183 | For example, a NULL or default task policy will fall back to the | |
184 | system default policy. A NULL or default vma policy will fall | |
185 | back to the task policy. | |
186 | ||
187 | When specified in one of the memory policy APIs, the Default mode | |
188 | does not use the optional set of nodes. | |
189 | ||
190 | It is an error for the set of nodes specified for this policy to | |
191 | be non-empty. | |
192 | ||
193 | MPOL_BIND | |
194 | This mode specifies that memory must come from the set of | |
195 | nodes specified by the policy. Memory will be allocated from | |
196 | the node in the set with sufficient free memory that is | |
197 | closest to the node where the allocation takes place. | |
198 | ||
199 | MPOL_PREFERRED | |
200 | This mode specifies that the allocation should be attempted | |
201 | from the single node specified in the policy. If that | |
202 | allocation fails, the kernel will search other nodes, in order | |
203 | of increasing distance from the preferred node based on | |
204 | information provided by the platform firmware. | |
205 | ||
206 | Internally, the Preferred policy uses a single node--the | |
207 | preferred_node member of struct mempolicy. When the internal | |
208 | mode flag MPOL_F_LOCAL is set, the preferred_node is ignored | |
209 | and the policy is interpreted as local allocation. "Local" | |
210 | allocation policy can be viewed as a Preferred policy that | |
211 | starts at the node containing the cpu where the allocation | |
212 | takes place. | |
213 | ||
214 | It is possible for the user to specify that local allocation | |
215 | is always preferred by passing an empty nodemask with this | |
216 | mode. If an empty nodemask is passed, the policy cannot use | |
217 | the MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES flags | |
218 | described below. | |
219 | ||
220 | MPOL_INTERLEAVED | |
221 | This mode specifies that page allocations be interleaved, on a | |
222 | page granularity, across the nodes specified in the policy. | |
223 | This mode also behaves slightly differently, based on the | |
224 | context where it is used: | |
225 | ||
226 | For allocation of anonymous pages and shared memory pages, | |
227 | Interleave mode indexes the set of nodes specified by the | |
228 | policy using the page offset of the faulting address into the | |
229 | segment [VMA] containing the address modulo the number of | |
230 | nodes specified by the policy. It then attempts to allocate a | |
231 | page, starting at the selected node, as if the node had been | |
232 | specified by a Preferred policy or had been selected by a | |
233 | local allocation. That is, allocation will follow the per | |
234 | node zonelist. | |
235 | ||
236 | For allocation of page cache pages, Interleave mode indexes | |
237 | the set of nodes specified by the policy using a node counter | |
238 | maintained per task. This counter wraps around to the lowest | |
239 | specified node after it reaches the highest specified node. | |
240 | This will tend to spread the pages out over the nodes | |
241 | specified by the policy based on the order in which they are | |
242 | allocated, rather than based on any page offset into an | |
243 | address range or file. During system boot up, the temporary | |
244 | interleaved system default policy works in this mode. | |
245 | ||
a38a59fd | 246 | MPOL_PREFERRED_MANY |
dbeb56fe | 247 | This mode specifies that the allocation should be preferably |
a38a59fd BW |
248 | satisfied from the nodemask specified in the policy. If there is |
249 | a memory pressure on all nodes in the nodemask, the allocation | |
250 | can fall back to all existing numa nodes. This is effectively | |
251 | MPOL_PREFERRED allowed for a mask rather than a single node. | |
252 | ||
1174bd84 | 253 | NUMA memory policy supports the following optional mode flags: |
cb5e4376 MR |
254 | |
255 | MPOL_F_STATIC_NODES | |
256 | This flag specifies that the nodemask passed by | |
65d66fc0 DR |
257 | the user should not be remapped if the task or VMA's set of allowed |
258 | nodes changes after the memory policy has been defined. | |
259 | ||
42f44d12 | 260 | Without this flag, any time a mempolicy is rebound because of a |
a38a59fd BW |
261 | change in the set of allowed nodes, the preferred nodemask (Preferred |
262 | Many), preferred node (Preferred) or nodemask (Bind, Interleave) is | |
263 | remapped to the new set of allowed nodes. This may result in nodes | |
264 | being used that were previously undesired. | |
cb5e4376 MR |
265 | |
266 | With this flag, if the user-specified nodes overlap with the | |
267 | nodes allowed by the task's cpuset, then the memory policy is | |
268 | applied to their intersection. If the two sets of nodes do not | |
269 | overlap, the Default policy is used. | |
270 | ||
271 | For example, consider a task that is attached to a cpuset with | |
272 | mems 1-3 that sets an Interleave policy over the same set. If | |
273 | the cpuset's mems change to 3-5, the Interleave will now occur | |
274 | over nodes 3, 4, and 5. With this flag, however, since only node | |
275 | 3 is allowed from the user's nodemask, the "interleave" only | |
276 | occurs over that node. If no nodes from the user's nodemask are | |
277 | now allowed, the Default behavior is used. | |
278 | ||
279 | MPOL_F_STATIC_NODES cannot be combined with the | |
280 | MPOL_F_RELATIVE_NODES flag. It also cannot be used for | |
281 | MPOL_PREFERRED policies that were created with an empty nodemask | |
282 | (local allocation). | |
283 | ||
284 | MPOL_F_RELATIVE_NODES | |
285 | This flag specifies that the nodemask passed | |
65d66fc0 DR |
286 | by the user will be mapped relative to the set of the task or VMA's |
287 | set of allowed nodes. The kernel stores the user-passed nodemask, | |
288 | and if the allowed nodes changes, then that original nodemask will | |
289 | be remapped relative to the new set of allowed nodes. | |
290 | ||
cb5e4376 MR |
291 | Without this flag (and without MPOL_F_STATIC_NODES), anytime a |
292 | mempolicy is rebound because of a change in the set of allowed | |
293 | nodes, the node (Preferred) or nodemask (Bind, Interleave) is | |
294 | remapped to the new set of allowed nodes. That remap may not | |
295 | preserve the relative nature of the user's passed nodemask to its | |
296 | set of allowed nodes upon successive rebinds: a nodemask of | |
297 | 1,3,5 may be remapped to 7-9 and then to 1-3 if the set of | |
298 | allowed nodes is restored to its original state. | |
299 | ||
300 | With this flag, the remap is done so that the node numbers from | |
301 | the user's passed nodemask are relative to the set of allowed | |
302 | nodes. In other words, if nodes 0, 2, and 4 are set in the user's | |
303 | nodemask, the policy will be effected over the first (and in the | |
304 | Bind or Interleave case, the third and fifth) nodes in the set of | |
305 | allowed nodes. The nodemask passed by the user represents nodes | |
306 | relative to task or VMA's set of allowed nodes. | |
307 | ||
308 | If the user's nodemask includes nodes that are outside the range | |
309 | of the new set of allowed nodes (for example, node 5 is set in | |
310 | the user's nodemask when the set of allowed nodes is only 0-3), | |
311 | then the remap wraps around to the beginning of the nodemask and, | |
312 | if not already set, sets the node in the mempolicy nodemask. | |
313 | ||
314 | For example, consider a task that is attached to a cpuset with | |
315 | mems 2-5 that sets an Interleave policy over the same set with | |
316 | MPOL_F_RELATIVE_NODES. If the cpuset's mems change to 3-7, the | |
317 | interleave now occurs over nodes 3,5-7. If the cpuset's mems | |
318 | then change to 0,2-3,5, then the interleave occurs over nodes | |
319 | 0,2-3,5. | |
320 | ||
321 | Thanks to the consistent remapping, applications preparing | |
322 | nodemasks to specify memory policies using this flag should | |
323 | disregard their current, actual cpuset imposed memory placement | |
324 | and prepare the nodemask as if they were always located on | |
325 | memory nodes 0 to N-1, where N is the number of memory nodes the | |
326 | policy is intended to manage. Let the kernel then remap to the | |
327 | set of memory nodes allowed by the task's cpuset, as that may | |
328 | change over time. | |
329 | ||
330 | MPOL_F_RELATIVE_NODES cannot be combined with the | |
331 | MPOL_F_STATIC_NODES flag. It also cannot be used for | |
332 | MPOL_PREFERRED policies that were created with an empty nodemask | |
333 | (local allocation). | |
334 | ||
335 | Memory Policy Reference Counting | |
336 | ================================ | |
52cd3b07 LS |
337 | |
338 | To resolve use/free races, struct mempolicy contains an atomic reference | |
339 | count field. Internal interfaces, mpol_get()/mpol_put() increment and | |
340 | decrement this reference count, respectively. mpol_put() will only free | |
341 | the structure back to the mempolicy kmem cache when the reference count | |
342 | goes to zero. | |
343 | ||
a33f3224 | 344 | When a new memory policy is allocated, its reference count is initialized |
52cd3b07 LS |
345 | to '1', representing the reference held by the task that is installing the |
346 | new policy. When a pointer to a memory policy structure is stored in another | |
347 | structure, another reference is added, as the task's reference will be dropped | |
348 | on completion of the policy installation. | |
349 | ||
350 | During run-time "usage" of the policy, we attempt to minimize atomic operations | |
351 | on the reference count, as this can lead to cache lines bouncing between cpus | |
352 | and NUMA nodes. "Usage" here means one of the following: | |
353 | ||
354 | 1) querying of the policy, either by the task itself [using the get_mempolicy() | |
355 | API discussed below] or by another task using the /proc/<pid>/numa_maps | |
356 | interface. | |
357 | ||
358 | 2) examination of the policy to determine the policy mode and associated node | |
359 | or node lists, if any, for page allocation. This is considered a "hot | |
360 | path". Note that for MPOL_BIND, the "usage" extends across the entire | |
dbeb56fe | 361 | allocation process, which may sleep during page reclamation, because the |
52cd3b07 LS |
362 | BIND policy nodemask is used, by reference, to filter ineligible nodes. |
363 | ||
364 | We can avoid taking an extra reference during the usages listed above as | |
365 | follows: | |
366 | ||
367 | 1) we never need to get/free the system default policy as this is never | |
368 | changed nor freed, once the system is up and running. | |
369 | ||
370 | 2) for querying the policy, we do not need to take an extra reference on the | |
371 | target task's task policy nor vma policies because we always acquire the | |
c1e8d7c6 ML |
372 | task's mm's mmap_lock for read during the query. The set_mempolicy() and |
373 | mbind() APIs [see below] always acquire the mmap_lock for write when | |
52cd3b07 LS |
374 | installing or replacing task or vma policies. Thus, there is no possibility |
375 | of a task or thread freeing a policy while another task or thread is | |
376 | querying it. | |
377 | ||
378 | 3) Page allocation usage of task or vma policy occurs in the fault path where | |
c1e8d7c6 ML |
379 | we hold them mmap_lock for read. Again, because replacing the task or vma |
380 | policy requires that the mmap_lock be held for write, the policy can't be | |
52cd3b07 LS |
381 | freed out from under us while we're using it for page allocation. |
382 | ||
383 | 4) Shared policies require special consideration. One task can replace a | |
c1e8d7c6 | 384 | shared memory policy while another task, with a distinct mmap_lock, is |
52cd3b07 LS |
385 | querying or allocating a page based on the policy. To resolve this |
386 | potential race, the shared policy infrastructure adds an extra reference | |
387 | to the shared policy during lookup while holding a spin lock on the shared | |
388 | policy management structure. This requires that we drop this extra | |
389 | reference when we're finished "using" the policy. We must drop the | |
390 | extra reference on shared policies in the same query/allocation paths | |
391 | used for non-shared policies. For this reason, shared policies are marked | |
392 | as such, and the extra reference is dropped "conditionally"--i.e., only | |
393 | for shared policies. | |
394 | ||
395 | Because of this extra reference counting, and because we must lookup | |
396 | shared policies in a tree structure under spinlock, shared policies are | |
d9195881 | 397 | more expensive to use in the page allocation path. This is especially |
52cd3b07 LS |
398 | true for shared policies on shared memory regions shared by tasks running |
399 | on different NUMA nodes. This extra overhead can be avoided by always | |
400 | falling back to task or system default policy for shared memory regions, | |
401 | or by prefaulting the entire shared memory region into memory and locking | |
402 | it down. However, this might not be appropriate for all applications. | |
403 | ||
42f44d12 MR |
404 | .. _memory_policy_apis: |
405 | ||
cb5e4376 | 406 | Memory Policy APIs |
42f44d12 | 407 | ================== |
42b88e6a | 408 | |
c6018b4b | 409 | Linux supports 4 system calls for controlling memory policy. These APIS |
42b88e6a LS |
410 | always affect only the calling task, the calling task's address space, or |
411 | some shared object mapped into the calling task's address space. | |
412 | ||
cb5e4376 MR |
413 | .. note:: |
414 | the headers that define these APIs and the parameter data types for | |
415 | user space applications reside in a package that is not part of the | |
416 | Linux kernel. The kernel system call interfaces, with the 'sys\_' | |
417 | prefix, are defined in <linux/syscalls.h>; the mode and flag | |
418 | definitions are defined in <linux/mempolicy.h>. | |
42b88e6a | 419 | |
cb5e4376 | 420 | Set [Task] Memory Policy:: |
42b88e6a LS |
421 | |
422 | long set_mempolicy(int mode, const unsigned long *nmask, | |
423 | unsigned long maxnode); | |
424 | ||
cb5e4376 MR |
425 | Set's the calling task's "task/process memory policy" to mode |
426 | specified by the 'mode' argument and the set of nodes defined by | |
427 | 'nmask'. 'nmask' points to a bit mask of node ids containing at least | |
428 | 'maxnode' ids. Optional mode flags may be passed by combining the | |
429 | 'mode' argument with the flag (for example: MPOL_INTERLEAVE | | |
430 | MPOL_F_STATIC_NODES). | |
42b88e6a | 431 | |
cb5e4376 | 432 | See the set_mempolicy(2) man page for more details |
42b88e6a LS |
433 | |
434 | ||
cb5e4376 | 435 | Get [Task] Memory Policy or Related Information:: |
42b88e6a LS |
436 | |
437 | long get_mempolicy(int *mode, | |
438 | const unsigned long *nmask, unsigned long maxnode, | |
439 | void *addr, int flags); | |
440 | ||
cb5e4376 MR |
441 | Queries the "task/process memory policy" of the calling task, or the |
442 | policy or location of a specified virtual address, depending on the | |
443 | 'flags' argument. | |
42b88e6a | 444 | |
cb5e4376 | 445 | See the get_mempolicy(2) man page for more details |
42b88e6a LS |
446 | |
447 | ||
cb5e4376 | 448 | Install VMA/Shared Policy for a Range of Task's Address Space:: |
42b88e6a LS |
449 | |
450 | long mbind(void *start, unsigned long len, int mode, | |
451 | const unsigned long *nmask, unsigned long maxnode, | |
452 | unsigned flags); | |
453 | ||
cb5e4376 MR |
454 | mbind() installs the policy specified by (mode, nmask, maxnodes) as a |
455 | VMA policy for the range of the calling task's address space specified | |
456 | by the 'start' and 'len' arguments. Additional actions may be | |
457 | requested via the 'flags' argument. | |
42b88e6a | 458 | |
cb5e4376 | 459 | See the mbind(2) man page for more details. |
42b88e6a | 460 | |
c6018b4b AK |
461 | Set home node for a Range of Task's Address Spacec:: |
462 | ||
463 | long sys_set_mempolicy_home_node(unsigned long start, unsigned long len, | |
464 | unsigned long home_node, | |
465 | unsigned long flags); | |
466 | ||
467 | sys_set_mempolicy_home_node set the home node for a VMA policy present in the | |
468 | task's address range. The system call updates the home node only for the existing | |
469 | mempolicy range. Other address ranges are ignored. A home node is the NUMA node | |
470 | closest to which page allocation will come from. Specifying the home node override | |
471 | the default allocation policy to allocate memory close to the local node for an | |
472 | executing CPU. | |
473 | ||
474 | ||
cb5e4376 MR |
475 | Memory Policy Command Line Interface |
476 | ==================================== | |
42b88e6a LS |
477 | |
478 | Although not strictly part of the Linux implementation of memory policy, | |
479 | a command line tool, numactl(8), exists that allows one to: | |
480 | ||
481 | + set the task policy for a specified program via set_mempolicy(2), fork(2) and | |
482 | exec(2) | |
483 | ||
484 | + set the shared policy for a shared memory segment via mbind(2) | |
485 | ||
0bc79f7f | 486 | The numactl(8) tool is packaged with the run-time version of the library |
42b88e6a LS |
487 | containing the memory policy system call wrappers. Some distributions |
488 | package the headers and compile-time libraries in a separate development | |
489 | package. | |
490 | ||
cb5e4376 | 491 | .. _mem_pol_and_cpusets: |
42b88e6a | 492 | |
cb5e4376 MR |
493 | Memory Policies and cpusets |
494 | =========================== | |
42b88e6a LS |
495 | |
496 | Memory policies work within cpusets as described above. For memory policies | |
497 | that require a node or set of nodes, the nodes are restricted to the set of | |
754af6f5 | 498 | nodes whose memories are allowed by the cpuset constraints. If the nodemask |
65d66fc0 DR |
499 | specified for the policy contains nodes that are not allowed by the cpuset and |
500 | MPOL_F_RELATIVE_NODES is not used, the intersection of the set of nodes | |
501 | specified for the policy and the set of nodes with memory is used. If the | |
502 | result is the empty set, the policy is considered invalid and cannot be | |
503 | installed. If MPOL_F_RELATIVE_NODES is used, the policy's nodes are mapped | |
504 | onto and folded into the task's set of allowed nodes as previously described. | |
505 | ||
506 | The interaction of memory policies and cpusets can be problematic when tasks | |
507 | in two cpusets share access to a memory region, such as shared memory segments | |
508 | created by shmget() of mmap() with the MAP_ANONYMOUS and MAP_SHARED flags, and | |
509 | any of the tasks install shared policy on the region, only nodes whose | |
510 | memories are allowed in both cpusets may be used in the policies. Obtaining | |
511 | this information requires "stepping outside" the memory policy APIs to use the | |
512 | cpuset information and requires that one know in what cpusets other task might | |
513 | be attaching to the shared region. Furthermore, if the cpusets' allowed | |
514 | memory sets are disjoint, "local" allocation is the only valid policy. |