Commit | Line | Data |
---|---|---|
7d10bdbd MR |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | .. _physical_memory_model: | |
4 | ||
5 | ===================== | |
6 | Physical Memory Model | |
7 | ===================== | |
8 | ||
9 | Physical memory in a system may be addressed in different ways. The | |
10 | simplest case is when the physical memory starts at address 0 and | |
11 | spans a contiguous range up to the maximal address. It could be, | |
12 | however, that this range contains small holes that are not accessible | |
13 | for the CPU. Then there could be several contiguous ranges at | |
14 | completely distinct addresses. And, don't forget about NUMA, where | |
15 | different memory banks are attached to different CPUs. | |
16 | ||
17 | Linux abstracts this diversity using one of the three memory models: | |
18 | FLATMEM, DISCONTIGMEM and SPARSEMEM. Each architecture defines what | |
19 | memory models it supports, what the default memory model is and | |
20 | whether it is possible to manually override that default. | |
21 | ||
22 | .. note:: | |
23 | At time of this writing, DISCONTIGMEM is considered deprecated, | |
24 | although it is still in use by several architectures. | |
25 | ||
26 | All the memory models track the status of physical page frames using | |
27 | :c:type:`struct page` arranged in one or more arrays. | |
28 | ||
29 | Regardless of the selected memory model, there exists one-to-one | |
30 | mapping between the physical page frame number (PFN) and the | |
31 | corresponding `struct page`. | |
32 | ||
33 | Each memory model defines :c:func:`pfn_to_page` and :c:func:`page_to_pfn` | |
34 | helpers that allow the conversion from PFN to `struct page` and vice | |
35 | versa. | |
36 | ||
37 | FLATMEM | |
38 | ======= | |
39 | ||
40 | The simplest memory model is FLATMEM. This model is suitable for | |
41 | non-NUMA systems with contiguous, or mostly contiguous, physical | |
42 | memory. | |
43 | ||
44 | In the FLATMEM memory model, there is a global `mem_map` array that | |
45 | maps the entire physical memory. For most architectures, the holes | |
46 | have entries in the `mem_map` array. The `struct page` objects | |
47 | corresponding to the holes are never fully initialized. | |
48 | ||
49 | To allocate the `mem_map` array, architecture specific setup code | |
50 | should call :c:func:`free_area_init_node` function or its convenience | |
51 | wrapper :c:func:`free_area_init`. Yet, the mappings array is not | |
52 | usable until the call to :c:func:`memblock_free_all` that hands all | |
53 | the memory to the page allocator. | |
54 | ||
55 | If an architecture enables `CONFIG_ARCH_HAS_HOLES_MEMORYMODEL` option, | |
56 | it may free parts of the `mem_map` array that do not cover the | |
57 | actual physical pages. In such case, the architecture specific | |
58 | :c:func:`pfn_valid` implementation should take the holes in the | |
59 | `mem_map` into account. | |
60 | ||
61 | With FLATMEM, the conversion between a PFN and the `struct page` is | |
62 | straightforward: `PFN - ARCH_PFN_OFFSET` is an index to the | |
63 | `mem_map` array. | |
64 | ||
65 | The `ARCH_PFN_OFFSET` defines the first page frame number for | |
66 | systems with physical memory starting at address different from 0. | |
67 | ||
68 | DISCONTIGMEM | |
69 | ============ | |
70 | ||
71 | The DISCONTIGMEM model treats the physical memory as a collection of | |
72 | `nodes` similarly to how Linux NUMA support does. For each node Linux | |
73 | constructs an independent memory management subsystem represented by | |
74 | `struct pglist_data` (or `pg_data_t` for short). Among other | |
75 | things, `pg_data_t` holds the `node_mem_map` array that maps | |
76 | physical pages belonging to that node. The `node_start_pfn` field of | |
77 | `pg_data_t` is the number of the first page frame belonging to that | |
78 | node. | |
79 | ||
80 | The architecture setup code should call :c:func:`free_area_init_node` for | |
81 | each node in the system to initialize the `pg_data_t` object and its | |
82 | `node_mem_map`. | |
83 | ||
84 | Every `node_mem_map` behaves exactly as FLATMEM's `mem_map` - | |
85 | every physical page frame in a node has a `struct page` entry in the | |
86 | `node_mem_map` array. When DISCONTIGMEM is enabled, a portion of the | |
87 | `flags` field of the `struct page` encodes the node number of the | |
88 | node hosting that page. | |
89 | ||
90 | The conversion between a PFN and the `struct page` in the | |
91 | DISCONTIGMEM model became slightly more complex as it has to determine | |
92 | which node hosts the physical page and which `pg_data_t` object | |
93 | holds the `struct page`. | |
94 | ||
95 | Architectures that support DISCONTIGMEM provide :c:func:`pfn_to_nid` | |
96 | to convert PFN to the node number. The opposite conversion helper | |
97 | :c:func:`page_to_nid` is generic as it uses the node number encoded in | |
98 | page->flags. | |
99 | ||
100 | Once the node number is known, the PFN can be used to index | |
101 | appropriate `node_mem_map` array to access the `struct page` and | |
102 | the offset of the `struct page` from the `node_mem_map` plus | |
103 | `node_start_pfn` is the PFN of that page. | |
104 | ||
105 | SPARSEMEM | |
106 | ========= | |
107 | ||
108 | SPARSEMEM is the most versatile memory model available in Linux and it | |
109 | is the only memory model that supports several advanced features such | |
110 | as hot-plug and hot-remove of the physical memory, alternative memory | |
111 | maps for non-volatile memory devices and deferred initialization of | |
112 | the memory map for larger systems. | |
113 | ||
114 | The SPARSEMEM model presents the physical memory as a collection of | |
115 | sections. A section is represented with :c:type:`struct mem_section` | |
116 | that contains `section_mem_map` that is, logically, a pointer to an | |
117 | array of struct pages. However, it is stored with some other magic | |
118 | that aids the sections management. The section size and maximal number | |
119 | of section is specified using `SECTION_SIZE_BITS` and | |
120 | `MAX_PHYSMEM_BITS` constants defined by each architecture that | |
121 | supports SPARSEMEM. While `MAX_PHYSMEM_BITS` is an actual width of a | |
122 | physical address that an architecture supports, the | |
123 | `SECTION_SIZE_BITS` is an arbitrary value. | |
124 | ||
125 | The maximal number of sections is denoted `NR_MEM_SECTIONS` and | |
126 | defined as | |
127 | ||
128 | .. math:: | |
129 | ||
130 | NR\_MEM\_SECTIONS = 2 ^ {(MAX\_PHYSMEM\_BITS - SECTION\_SIZE\_BITS)} | |
131 | ||
132 | The `mem_section` objects are arranged in a two-dimensional array | |
133 | called `mem_sections`. The size and placement of this array depend | |
134 | on `CONFIG_SPARSEMEM_EXTREME` and the maximal possible number of | |
135 | sections: | |
136 | ||
137 | * When `CONFIG_SPARSEMEM_EXTREME` is disabled, the `mem_sections` | |
138 | array is static and has `NR_MEM_SECTIONS` rows. Each row holds a | |
139 | single `mem_section` object. | |
140 | * When `CONFIG_SPARSEMEM_EXTREME` is enabled, the `mem_sections` | |
141 | array is dynamically allocated. Each row contains PAGE_SIZE worth of | |
142 | `mem_section` objects and the number of rows is calculated to fit | |
143 | all the memory sections. | |
144 | ||
145 | The architecture setup code should call :c:func:`memory_present` for | |
146 | each active memory range or use :c:func:`memblocks_present` or | |
147 | :c:func:`sparse_memory_present_with_active_regions` wrappers to | |
148 | initialize the memory sections. Next, the actual memory maps should be | |
149 | set up using :c:func:`sparse_init`. | |
150 | ||
151 | With SPARSEMEM there are two possible ways to convert a PFN to the | |
152 | corresponding `struct page` - a "classic sparse" and "sparse | |
153 | vmemmap". The selection is made at build time and it is determined by | |
154 | the value of `CONFIG_SPARSEMEM_VMEMMAP`. | |
155 | ||
156 | The classic sparse encodes the section number of a page in page->flags | |
157 | and uses high bits of a PFN to access the section that maps that page | |
158 | frame. Inside a section, the PFN is the index to the array of pages. | |
159 | ||
160 | The sparse vmemmap uses a virtually mapped memory map to optimize | |
161 | pfn_to_page and page_to_pfn operations. There is a global `struct | |
162 | page *vmemmap` pointer that points to a virtually contiguous array of | |
163 | `struct page` objects. A PFN is an index to that array and the the | |
164 | offset of the `struct page` from `vmemmap` is the PFN of that | |
165 | page. | |
166 | ||
167 | To use vmemmap, an architecture has to reserve a range of virtual | |
168 | addresses that will map the physical pages containing the memory | |
169 | map and make sure that `vmemmap` points to that range. In addition, | |
170 | the architecture should implement :c:func:`vmemmap_populate` method | |
171 | that will allocate the physical memory and create page tables for the | |
172 | virtual memory map. If an architecture does not have any special | |
173 | requirements for the vmemmap mappings, it can use default | |
174 | :c:func:`vmemmap_populate_basepages` provided by the generic memory | |
175 | management. | |
176 | ||
177 | The virtually mapped memory map allows storing `struct page` objects | |
178 | for persistent memory devices in pre-allocated storage on those | |
179 | devices. This storage is represented with :c:type:`struct vmem_altmap` | |
180 | that is eventually passed to vmemmap_populate() through a long chain | |
181 | of function calls. The vmemmap_populate() implementation may use the | |
182 | `vmem_altmap` along with :c:func:`altmap_alloc_block_buf` helper to | |
183 | allocate memory map on the persistent memory device. | |
a0653406 DW |
184 | |
185 | ZONE_DEVICE | |
186 | =========== | |
187 | The `ZONE_DEVICE` facility builds upon `SPARSEMEM_VMEMMAP` to offer | |
188 | `struct page` `mem_map` services for device driver identified physical | |
189 | address ranges. The "device" aspect of `ZONE_DEVICE` relates to the fact | |
190 | that the page objects for these address ranges are never marked online, | |
191 | and that a reference must be taken against the device, not just the page | |
192 | to keep the memory pinned for active use. `ZONE_DEVICE`, via | |
193 | :c:func:`devm_memremap_pages`, performs just enough memory hotplug to | |
194 | turn on :c:func:`pfn_to_page`, :c:func:`page_to_pfn`, and | |
195 | :c:func:`get_user_pages` service for the given range of pfns. Since the | |
196 | page reference count never drops below 1 the page is never tracked as | |
197 | free memory and the page's `struct list_head lru` space is repurposed | |
198 | for back referencing to the host device / driver that mapped the memory. | |
199 | ||
200 | While `SPARSEMEM` presents memory as a collection of sections, | |
201 | optionally collected into memory blocks, `ZONE_DEVICE` users have a need | |
202 | for smaller granularity of populating the `mem_map`. Given that | |
203 | `ZONE_DEVICE` memory is never marked online it is subsequently never | |
204 | subject to its memory ranges being exposed through the sysfs memory | |
205 | hotplug api on memory block boundaries. The implementation relies on | |
206 | this lack of user-api constraint to allow sub-section sized memory | |
207 | ranges to be specified to :c:func:`arch_add_memory`, the top-half of | |
208 | memory hotplug. Sub-section support allows for 2MB as the cross-arch | |
209 | common alignment granularity for :c:func:`devm_memremap_pages`. | |
210 | ||
211 | The users of `ZONE_DEVICE` are: | |
212 | ||
213 | * pmem: Map platform persistent memory to be used as a direct-I/O target | |
214 | via DAX mappings. | |
215 | ||
216 | * hmm: Extend `ZONE_DEVICE` with `->page_fault()` and `->page_free()` | |
217 | event callbacks to allow a device-driver to coordinate memory management | |
218 | events related to device-memory, typically GPU memory. See | |
219 | Documentation/vm/hmm.rst. | |
220 | ||
221 | * p2pdma: Create `struct page` objects to allow peer devices in a | |
222 | PCI/-E topology to coordinate direct-DMA operations between themselves, | |
223 | i.e. bypass host memory. |