Commit | Line | Data |
---|---|---|
7d10bdbd MR |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | .. _physical_memory_model: | |
4 | ||
5 | ===================== | |
6 | Physical Memory Model | |
7 | ===================== | |
8 | ||
9 | Physical memory in a system may be addressed in different ways. The | |
10 | simplest case is when the physical memory starts at address 0 and | |
11 | spans a contiguous range up to the maximal address. It could be, | |
12 | however, that this range contains small holes that are not accessible | |
13 | for the CPU. Then there could be several contiguous ranges at | |
14 | completely distinct addresses. And, don't forget about NUMA, where | |
15 | different memory banks are attached to different CPUs. | |
16 | ||
17 | Linux abstracts this diversity using one of the three memory models: | |
18 | FLATMEM, DISCONTIGMEM and SPARSEMEM. Each architecture defines what | |
19 | memory models it supports, what the default memory model is and | |
20 | whether it is possible to manually override that default. | |
21 | ||
22 | .. note:: | |
23 | At time of this writing, DISCONTIGMEM is considered deprecated, | |
24 | although it is still in use by several architectures. | |
25 | ||
26 | All the memory models track the status of physical page frames using | |
9303c9d5 | 27 | struct page arranged in one or more arrays. |
7d10bdbd MR |
28 | |
29 | Regardless of the selected memory model, there exists one-to-one | |
30 | mapping between the physical page frame number (PFN) and the | |
31 | corresponding `struct page`. | |
32 | ||
33 | Each memory model defines :c:func:`pfn_to_page` and :c:func:`page_to_pfn` | |
34 | helpers that allow the conversion from PFN to `struct page` and vice | |
35 | versa. | |
36 | ||
37 | FLATMEM | |
38 | ======= | |
39 | ||
40 | The simplest memory model is FLATMEM. This model is suitable for | |
41 | non-NUMA systems with contiguous, or mostly contiguous, physical | |
42 | memory. | |
43 | ||
44 | In the FLATMEM memory model, there is a global `mem_map` array that | |
45 | maps the entire physical memory. For most architectures, the holes | |
46 | have entries in the `mem_map` array. The `struct page` objects | |
47 | corresponding to the holes are never fully initialized. | |
48 | ||
237e506c MR |
49 | To allocate the `mem_map` array, architecture specific setup code should |
50 | call :c:func:`free_area_init` function. Yet, the mappings array is not | |
51 | usable until the call to :c:func:`memblock_free_all` that hands all the | |
52 | memory to the page allocator. | |
7d10bdbd | 53 | |
5e545df3 | 54 | An architecture may free parts of the `mem_map` array that do not cover the |
7d10bdbd MR |
55 | actual physical pages. In such case, the architecture specific |
56 | :c:func:`pfn_valid` implementation should take the holes in the | |
57 | `mem_map` into account. | |
58 | ||
59 | With FLATMEM, the conversion between a PFN and the `struct page` is | |
60 | straightforward: `PFN - ARCH_PFN_OFFSET` is an index to the | |
61 | `mem_map` array. | |
62 | ||
63 | The `ARCH_PFN_OFFSET` defines the first page frame number for | |
64 | systems with physical memory starting at address different from 0. | |
65 | ||
66 | DISCONTIGMEM | |
67 | ============ | |
68 | ||
69 | The DISCONTIGMEM model treats the physical memory as a collection of | |
70 | `nodes` similarly to how Linux NUMA support does. For each node Linux | |
71 | constructs an independent memory management subsystem represented by | |
72 | `struct pglist_data` (or `pg_data_t` for short). Among other | |
73 | things, `pg_data_t` holds the `node_mem_map` array that maps | |
74 | physical pages belonging to that node. The `node_start_pfn` field of | |
75 | `pg_data_t` is the number of the first page frame belonging to that | |
76 | node. | |
77 | ||
78 | The architecture setup code should call :c:func:`free_area_init_node` for | |
79 | each node in the system to initialize the `pg_data_t` object and its | |
80 | `node_mem_map`. | |
81 | ||
82 | Every `node_mem_map` behaves exactly as FLATMEM's `mem_map` - | |
83 | every physical page frame in a node has a `struct page` entry in the | |
84 | `node_mem_map` array. When DISCONTIGMEM is enabled, a portion of the | |
85 | `flags` field of the `struct page` encodes the node number of the | |
86 | node hosting that page. | |
87 | ||
88 | The conversion between a PFN and the `struct page` in the | |
89 | DISCONTIGMEM model became slightly more complex as it has to determine | |
90 | which node hosts the physical page and which `pg_data_t` object | |
91 | holds the `struct page`. | |
92 | ||
93 | Architectures that support DISCONTIGMEM provide :c:func:`pfn_to_nid` | |
94 | to convert PFN to the node number. The opposite conversion helper | |
95 | :c:func:`page_to_nid` is generic as it uses the node number encoded in | |
96 | page->flags. | |
97 | ||
98 | Once the node number is known, the PFN can be used to index | |
99 | appropriate `node_mem_map` array to access the `struct page` and | |
100 | the offset of the `struct page` from the `node_mem_map` plus | |
101 | `node_start_pfn` is the PFN of that page. | |
102 | ||
103 | SPARSEMEM | |
104 | ========= | |
105 | ||
106 | SPARSEMEM is the most versatile memory model available in Linux and it | |
107 | is the only memory model that supports several advanced features such | |
108 | as hot-plug and hot-remove of the physical memory, alternative memory | |
109 | maps for non-volatile memory devices and deferred initialization of | |
110 | the memory map for larger systems. | |
111 | ||
112 | The SPARSEMEM model presents the physical memory as a collection of | |
9303c9d5 | 113 | sections. A section is represented with struct mem_section |
7d10bdbd MR |
114 | that contains `section_mem_map` that is, logically, a pointer to an |
115 | array of struct pages. However, it is stored with some other magic | |
116 | that aids the sections management. The section size and maximal number | |
117 | of section is specified using `SECTION_SIZE_BITS` and | |
118 | `MAX_PHYSMEM_BITS` constants defined by each architecture that | |
119 | supports SPARSEMEM. While `MAX_PHYSMEM_BITS` is an actual width of a | |
120 | physical address that an architecture supports, the | |
121 | `SECTION_SIZE_BITS` is an arbitrary value. | |
122 | ||
123 | The maximal number of sections is denoted `NR_MEM_SECTIONS` and | |
124 | defined as | |
125 | ||
126 | .. math:: | |
127 | ||
128 | NR\_MEM\_SECTIONS = 2 ^ {(MAX\_PHYSMEM\_BITS - SECTION\_SIZE\_BITS)} | |
129 | ||
130 | The `mem_section` objects are arranged in a two-dimensional array | |
131 | called `mem_sections`. The size and placement of this array depend | |
132 | on `CONFIG_SPARSEMEM_EXTREME` and the maximal possible number of | |
133 | sections: | |
134 | ||
135 | * When `CONFIG_SPARSEMEM_EXTREME` is disabled, the `mem_sections` | |
136 | array is static and has `NR_MEM_SECTIONS` rows. Each row holds a | |
137 | single `mem_section` object. | |
138 | * When `CONFIG_SPARSEMEM_EXTREME` is enabled, the `mem_sections` | |
139 | array is dynamically allocated. Each row contains PAGE_SIZE worth of | |
140 | `mem_section` objects and the number of rows is calculated to fit | |
141 | all the memory sections. | |
142 | ||
c89ab04f MR |
143 | The architecture setup code should call sparse_init() to |
144 | initialize the memory sections and the memory maps. | |
7d10bdbd MR |
145 | |
146 | With SPARSEMEM there are two possible ways to convert a PFN to the | |
147 | corresponding `struct page` - a "classic sparse" and "sparse | |
148 | vmemmap". The selection is made at build time and it is determined by | |
149 | the value of `CONFIG_SPARSEMEM_VMEMMAP`. | |
150 | ||
151 | The classic sparse encodes the section number of a page in page->flags | |
152 | and uses high bits of a PFN to access the section that maps that page | |
153 | frame. Inside a section, the PFN is the index to the array of pages. | |
154 | ||
155 | The sparse vmemmap uses a virtually mapped memory map to optimize | |
156 | pfn_to_page and page_to_pfn operations. There is a global `struct | |
157 | page *vmemmap` pointer that points to a virtually contiguous array of | |
18d97ed9 | 158 | `struct page` objects. A PFN is an index to that array and the |
7d10bdbd MR |
159 | offset of the `struct page` from `vmemmap` is the PFN of that |
160 | page. | |
161 | ||
162 | To use vmemmap, an architecture has to reserve a range of virtual | |
163 | addresses that will map the physical pages containing the memory | |
164 | map and make sure that `vmemmap` points to that range. In addition, | |
165 | the architecture should implement :c:func:`vmemmap_populate` method | |
166 | that will allocate the physical memory and create page tables for the | |
167 | virtual memory map. If an architecture does not have any special | |
168 | requirements for the vmemmap mappings, it can use default | |
169 | :c:func:`vmemmap_populate_basepages` provided by the generic memory | |
170 | management. | |
171 | ||
172 | The virtually mapped memory map allows storing `struct page` objects | |
173 | for persistent memory devices in pre-allocated storage on those | |
9303c9d5 | 174 | devices. This storage is represented with struct vmem_altmap |
7d10bdbd MR |
175 | that is eventually passed to vmemmap_populate() through a long chain |
176 | of function calls. The vmemmap_populate() implementation may use the | |
56993b4e | 177 | `vmem_altmap` along with :c:func:`vmemmap_alloc_block_buf` helper to |
7d10bdbd | 178 | allocate memory map on the persistent memory device. |
a0653406 DW |
179 | |
180 | ZONE_DEVICE | |
181 | =========== | |
182 | The `ZONE_DEVICE` facility builds upon `SPARSEMEM_VMEMMAP` to offer | |
183 | `struct page` `mem_map` services for device driver identified physical | |
184 | address ranges. The "device" aspect of `ZONE_DEVICE` relates to the fact | |
185 | that the page objects for these address ranges are never marked online, | |
186 | and that a reference must be taken against the device, not just the page | |
187 | to keep the memory pinned for active use. `ZONE_DEVICE`, via | |
188 | :c:func:`devm_memremap_pages`, performs just enough memory hotplug to | |
189 | turn on :c:func:`pfn_to_page`, :c:func:`page_to_pfn`, and | |
190 | :c:func:`get_user_pages` service for the given range of pfns. Since the | |
191 | page reference count never drops below 1 the page is never tracked as | |
192 | free memory and the page's `struct list_head lru` space is repurposed | |
193 | for back referencing to the host device / driver that mapped the memory. | |
194 | ||
195 | While `SPARSEMEM` presents memory as a collection of sections, | |
196 | optionally collected into memory blocks, `ZONE_DEVICE` users have a need | |
197 | for smaller granularity of populating the `mem_map`. Given that | |
198 | `ZONE_DEVICE` memory is never marked online it is subsequently never | |
199 | subject to its memory ranges being exposed through the sysfs memory | |
200 | hotplug api on memory block boundaries. The implementation relies on | |
201 | this lack of user-api constraint to allow sub-section sized memory | |
202 | ranges to be specified to :c:func:`arch_add_memory`, the top-half of | |
203 | memory hotplug. Sub-section support allows for 2MB as the cross-arch | |
204 | common alignment granularity for :c:func:`devm_memremap_pages`. | |
205 | ||
206 | The users of `ZONE_DEVICE` are: | |
207 | ||
208 | * pmem: Map platform persistent memory to be used as a direct-I/O target | |
209 | via DAX mappings. | |
210 | ||
211 | * hmm: Extend `ZONE_DEVICE` with `->page_fault()` and `->page_free()` | |
212 | event callbacks to allow a device-driver to coordinate memory management | |
213 | events related to device-memory, typically GPU memory. See | |
214 | Documentation/vm/hmm.rst. | |
215 | ||
216 | * p2pdma: Create `struct page` objects to allow peer devices in a | |
217 | PCI/-E topology to coordinate direct-DMA operations between themselves, | |
218 | i.e. bypass host memory. |