Commit | Line | Data |
---|---|---|
1da177e4 LT |
1 | Started Nov 1999 by Kanoj Sarcar <kanoj@sgi.com> |
2 | ||
137b4552 | 3 | ============= |
b9498bfe | 4 | What is NUMA? |
137b4552 | 5 | ============= |
b9498bfe LS |
6 | |
7 | This question can be answered from a couple of perspectives: the | |
8 | hardware view and the Linux software view. | |
9 | ||
10 | From the hardware perspective, a NUMA system is a computer platform that | |
11 | comprises multiple components or assemblies each of which may contain 0 | |
12 | or more CPUs, local memory, and/or IO buses. For brevity and to | |
13 | disambiguate the hardware view of these physical components/assemblies | |
14 | from the software abstraction thereof, we'll call the components/assemblies | |
15 | 'cells' in this document. | |
16 | ||
17 | Each of the 'cells' may be viewed as an SMP [symmetric multi-processor] subset | |
18 | of the system--although some components necessary for a stand-alone SMP system | |
19 | may not be populated on any given cell. The cells of the NUMA system are | |
20 | connected together with some sort of system interconnect--e.g., a crossbar or | |
21 | point-to-point link are common types of NUMA system interconnects. Both of | |
22 | these types of interconnects can be aggregated to create NUMA platforms with | |
23 | cells at multiple distances from other cells. | |
24 | ||
25 | For Linux, the NUMA platforms of interest are primarily what is known as Cache | |
26 | Coherent NUMA or ccNUMA systems. With ccNUMA systems, all memory is visible | |
27 | to and accessible from any CPU attached to any cell and cache coherency | |
28 | is handled in hardware by the processor caches and/or the system interconnect. | |
29 | ||
30 | Memory access time and effective memory bandwidth varies depending on how far | |
31 | away the cell containing the CPU or IO bus making the memory access is from the | |
32 | cell containing the target memory. For example, access to memory by CPUs | |
33 | attached to the same cell will experience faster access times and higher | |
34 | bandwidths than accesses to memory on other, remote cells. NUMA platforms | |
35 | can have cells at multiple remote distances from any given cell. | |
36 | ||
37 | Platform vendors don't build NUMA systems just to make software developers' | |
38 | lives interesting. Rather, this architecture is a means to provide scalable | |
39 | memory bandwidth. However, to achieve scalable memory bandwidth, system and | |
40 | application software must arrange for a large majority of the memory references | |
41 | [cache misses] to be to "local" memory--memory on the same cell, if any--or | |
42 | to the closest cell with memory. | |
43 | ||
44 | This leads to the Linux software view of a NUMA system: | |
45 | ||
46 | Linux divides the system's hardware resources into multiple software | |
47 | abstractions called "nodes". Linux maps the nodes onto the physical cells | |
48 | of the hardware platform, abstracting away some of the details for some | |
49 | architectures. As with physical cells, software nodes may contain 0 or more | |
50 | CPUs, memory and/or IO buses. And, again, memory accesses to memory on | |
51 | "closer" nodes--nodes that map to closer cells--will generally experience | |
52 | faster access times and higher effective bandwidth than accesses to more | |
53 | remote cells. | |
54 | ||
55 | For some architectures, such as x86, Linux will "hide" any node representing a | |
56 | physical cell that has no memory attached, and reassign any CPUs attached to | |
57 | that cell to a node representing a cell that does have memory. Thus, on | |
58 | these architectures, one cannot assume that all CPUs that Linux associates with | |
59 | a given node will see the same local memory access times and bandwidth. | |
60 | ||
61 | In addition, for some architectures, again x86 is an example, Linux supports | |
62 | the emulation of additional nodes. For NUMA emulation, linux will carve up | |
63 | the existing nodes--or the system memory for non-NUMA platforms--into multiple | |
64 | nodes. Each emulated node will manage a fraction of the underlying cells' | |
77a0812c | 65 | physical memory. NUMA emulation is useful for testing NUMA kernel and |
b9498bfe LS |
66 | application features on non-NUMA platforms, and as a sort of memory resource |
67 | management mechanism when used together with cpusets. | |
da82c92f | 68 | [see Documentation/admin-guide/cgroup-v1/cpusets.rst] |
b9498bfe LS |
69 | |
70 | For each node with memory, Linux constructs an independent memory management | |
71 | subsystem, complete with its own free page lists, in-use page lists, usage | |
72 | statistics and locks to mediate access. In addition, Linux constructs for | |
73 | each memory zone [one or more of DMA, DMA32, NORMAL, HIGH_MEMORY, MOVABLE], | |
74 | an ordered "zonelist". A zonelist specifies the zones/nodes to visit when a | |
75 | selected zone/node cannot satisfy the allocation request. This situation, | |
76 | when a zone has no available memory to satisfy a request, is called | |
77 | "overflow" or "fallback". | |
78 | ||
79 | Because some nodes contain multiple zones containing different types of | |
80 | memory, Linux must decide whether to order the zonelists such that allocations | |
81 | fall back to the same zone type on a different node, or to a different zone | |
82 | type on the same node. This is an important consideration because some zones, | |
83 | such as DMA or DMA32, represent relatively scarce resources. Linux chooses | |
c9bff3ee MH |
84 | a default Node ordered zonelist. This means it tries to fallback to other zones |
85 | from the same node before using remote nodes which are ordered by NUMA distance. | |
b9498bfe LS |
86 | |
87 | By default, Linux will attempt to satisfy memory allocation requests from the | |
88 | node to which the CPU that executes the request is assigned. Specifically, | |
89 | Linux will attempt to allocate from the first node in the appropriate zonelist | |
90 | for the node where the request originates. This is called "local allocation." | |
91 | If the "local" node cannot satisfy the request, the kernel will examine other | |
92 | nodes' zones in the selected zonelist looking for the first zone in the list | |
93 | that can satisfy the request. | |
94 | ||
95 | Local allocation will tend to keep subsequent access to the allocated memory | |
96 | "local" to the underlying physical resources and off the system interconnect-- | |
97 | as long as the task on whose behalf the kernel allocated some memory does not | |
98 | later migrate away from that memory. The Linux scheduler is aware of the | |
99 | NUMA topology of the platform--embodied in the "scheduling domains" data | |
d6a3b247 | 100 | structures [see Documentation/scheduler/sched-domains.rst]--and the scheduler |
b9498bfe LS |
101 | attempts to minimize task migration to distant scheduling domains. However, |
102 | the scheduler does not take a task's NUMA footprint into account directly. | |
103 | Thus, under sufficient imbalance, tasks can migrate between nodes, remote | |
104 | from their initial node and kernel data structures. | |
105 | ||
106 | System administrators and application designers can restrict a task's migration | |
107 | to improve NUMA locality using various CPU affinity command line interfaces, | |
108 | such as taskset(1) and numactl(1), and program interfaces such as | |
109 | sched_setaffinity(2). Further, one can modify the kernel's default local | |
66e9c46c | 110 | allocation behavior using Linux NUMA memory policy. [see |
ee865889 | 111 | Documentation/admin-guide/mm/numa_memory_policy.rst]. |
b9498bfe LS |
112 | |
113 | System administrators can restrict the CPUs and nodes' memories that a non- | |
114 | privileged user can specify in the scheduling or NUMA commands and functions | |
da82c92f | 115 | using control groups and CPUsets. [see Documentation/admin-guide/cgroup-v1/cpusets.rst] |
b9498bfe LS |
116 | |
117 | On architectures that do not hide memoryless nodes, Linux will include only | |
118 | zones [nodes] with memory in the zonelists. This means that for a memoryless | |
119 | node the "local memory node"--the node of the first zone in CPU's node's | |
120 | zonelist--will not be the node itself. Rather, it will be the node that the | |
121 | kernel selected as the nearest node with memory when it built the zonelists. | |
122 | So, default, local allocations will succeed with the kernel supplying the | |
123 | closest available memory. This is a consequence of the same mechanism that | |
124 | allows such allocations to fallback to other nearby nodes when a node that | |
125 | does contain memory overflows. | |
126 | ||
127 | Some kernel allocations do not want or cannot tolerate this allocation fallback | |
128 | behavior. Rather they want to be sure they get memory from the specified node | |
129 | or get notified that the node has no free memory. This is usually the case when | |
130 | a subsystem allocates per CPU memory resources, for example. | |
131 | ||
132 | A typical model for making such an allocation is to obtain the node id of the | |
133 | node to which the "current CPU" is attached using one of the kernel's | |
134 | numa_node_id() or CPU_to_node() functions and then request memory from only | |
135 | the node id returned. When such an allocation fails, the requesting subsystem | |
136 | may revert to its own fallback path. The slab kernel memory allocator is an | |
137 | example of this. Or, the subsystem may choose to disable or not to enable | |
138 | itself on allocation failure. The kernel profiling subsystem is an example of | |
139 | this. | |
140 | ||
141 | If the architecture supports--does not hide--memoryless nodes, then CPUs | |
142 | attached to memoryless nodes would always incur the fallback path overhead | |
143 | or some subsystems would fail to initialize if they attempted to allocated | |
144 | memory exclusively from a node without memory. To support such | |
145 | architectures transparently, kernel subsystems can use the numa_mem_id() | |
146 | or cpu_to_mem() function to locate the "local memory node" for the calling or | |
147 | specified CPU. Again, this is the same node from which default, local page | |
148 | allocations will be attempted. |