Commit | Line | Data |
---|---|---|
7d10bdbd MR |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | .. _physical_memory_model: | |
4 | ||
5 | ===================== | |
6 | Physical Memory Model | |
7 | ===================== | |
8 | ||
9 | Physical memory in a system may be addressed in different ways. The | |
10 | simplest case is when the physical memory starts at address 0 and | |
11 | spans a contiguous range up to the maximal address. It could be, | |
12 | however, that this range contains small holes that are not accessible | |
13 | for the CPU. Then there could be several contiguous ranges at | |
14 | completely distinct addresses. And, don't forget about NUMA, where | |
15 | different memory banks are attached to different CPUs. | |
16 | ||
48d9f335 MR |
17 | Linux abstracts this diversity using one of the two memory models: |
18 | FLATMEM and SPARSEMEM. Each architecture defines what | |
7d10bdbd MR |
19 | memory models it supports, what the default memory model is and |
20 | whether it is possible to manually override that default. | |
21 | ||
7d10bdbd | 22 | All the memory models track the status of physical page frames using |
9303c9d5 | 23 | struct page arranged in one or more arrays. |
7d10bdbd MR |
24 | |
25 | Regardless of the selected memory model, there exists one-to-one | |
26 | mapping between the physical page frame number (PFN) and the | |
27 | corresponding `struct page`. | |
28 | ||
29 | Each memory model defines :c:func:`pfn_to_page` and :c:func:`page_to_pfn` | |
30 | helpers that allow the conversion from PFN to `struct page` and vice | |
31 | versa. | |
32 | ||
33 | FLATMEM | |
34 | ======= | |
35 | ||
36 | The simplest memory model is FLATMEM. This model is suitable for | |
37 | non-NUMA systems with contiguous, or mostly contiguous, physical | |
38 | memory. | |
39 | ||
40 | In the FLATMEM memory model, there is a global `mem_map` array that | |
41 | maps the entire physical memory. For most architectures, the holes | |
42 | have entries in the `mem_map` array. The `struct page` objects | |
43 | corresponding to the holes are never fully initialized. | |
44 | ||
237e506c MR |
45 | To allocate the `mem_map` array, architecture specific setup code should |
46 | call :c:func:`free_area_init` function. Yet, the mappings array is not | |
47 | usable until the call to :c:func:`memblock_free_all` that hands all the | |
48 | memory to the page allocator. | |
7d10bdbd | 49 | |
5e545df3 | 50 | An architecture may free parts of the `mem_map` array that do not cover the |
7d10bdbd MR |
51 | actual physical pages. In such case, the architecture specific |
52 | :c:func:`pfn_valid` implementation should take the holes in the | |
53 | `mem_map` into account. | |
54 | ||
55 | With FLATMEM, the conversion between a PFN and the `struct page` is | |
56 | straightforward: `PFN - ARCH_PFN_OFFSET` is an index to the | |
57 | `mem_map` array. | |
58 | ||
59 | The `ARCH_PFN_OFFSET` defines the first page frame number for | |
60 | systems with physical memory starting at address different from 0. | |
61 | ||
7d10bdbd MR |
62 | SPARSEMEM |
63 | ========= | |
64 | ||
65 | SPARSEMEM is the most versatile memory model available in Linux and it | |
66 | is the only memory model that supports several advanced features such | |
67 | as hot-plug and hot-remove of the physical memory, alternative memory | |
68 | maps for non-volatile memory devices and deferred initialization of | |
69 | the memory map for larger systems. | |
70 | ||
71 | The SPARSEMEM model presents the physical memory as a collection of | |
9303c9d5 | 72 | sections. A section is represented with struct mem_section |
7d10bdbd MR |
73 | that contains `section_mem_map` that is, logically, a pointer to an |
74 | array of struct pages. However, it is stored with some other magic | |
75 | that aids the sections management. The section size and maximal number | |
76 | of section is specified using `SECTION_SIZE_BITS` and | |
77 | `MAX_PHYSMEM_BITS` constants defined by each architecture that | |
78 | supports SPARSEMEM. While `MAX_PHYSMEM_BITS` is an actual width of a | |
79 | physical address that an architecture supports, the | |
80 | `SECTION_SIZE_BITS` is an arbitrary value. | |
81 | ||
82 | The maximal number of sections is denoted `NR_MEM_SECTIONS` and | |
83 | defined as | |
84 | ||
85 | .. math:: | |
86 | ||
87 | NR\_MEM\_SECTIONS = 2 ^ {(MAX\_PHYSMEM\_BITS - SECTION\_SIZE\_BITS)} | |
88 | ||
89 | The `mem_section` objects are arranged in a two-dimensional array | |
90 | called `mem_sections`. The size and placement of this array depend | |
91 | on `CONFIG_SPARSEMEM_EXTREME` and the maximal possible number of | |
92 | sections: | |
93 | ||
94 | * When `CONFIG_SPARSEMEM_EXTREME` is disabled, the `mem_sections` | |
95 | array is static and has `NR_MEM_SECTIONS` rows. Each row holds a | |
96 | single `mem_section` object. | |
97 | * When `CONFIG_SPARSEMEM_EXTREME` is enabled, the `mem_sections` | |
98 | array is dynamically allocated. Each row contains PAGE_SIZE worth of | |
99 | `mem_section` objects and the number of rows is calculated to fit | |
100 | all the memory sections. | |
101 | ||
c89ab04f MR |
102 | The architecture setup code should call sparse_init() to |
103 | initialize the memory sections and the memory maps. | |
7d10bdbd MR |
104 | |
105 | With SPARSEMEM there are two possible ways to convert a PFN to the | |
106 | corresponding `struct page` - a "classic sparse" and "sparse | |
107 | vmemmap". The selection is made at build time and it is determined by | |
108 | the value of `CONFIG_SPARSEMEM_VMEMMAP`. | |
109 | ||
110 | The classic sparse encodes the section number of a page in page->flags | |
111 | and uses high bits of a PFN to access the section that maps that page | |
112 | frame. Inside a section, the PFN is the index to the array of pages. | |
113 | ||
114 | The sparse vmemmap uses a virtually mapped memory map to optimize | |
115 | pfn_to_page and page_to_pfn operations. There is a global `struct | |
116 | page *vmemmap` pointer that points to a virtually contiguous array of | |
18d97ed9 | 117 | `struct page` objects. A PFN is an index to that array and the |
7d10bdbd MR |
118 | offset of the `struct page` from `vmemmap` is the PFN of that |
119 | page. | |
120 | ||
121 | To use vmemmap, an architecture has to reserve a range of virtual | |
122 | addresses that will map the physical pages containing the memory | |
123 | map and make sure that `vmemmap` points to that range. In addition, | |
124 | the architecture should implement :c:func:`vmemmap_populate` method | |
125 | that will allocate the physical memory and create page tables for the | |
126 | virtual memory map. If an architecture does not have any special | |
127 | requirements for the vmemmap mappings, it can use default | |
128 | :c:func:`vmemmap_populate_basepages` provided by the generic memory | |
129 | management. | |
130 | ||
131 | The virtually mapped memory map allows storing `struct page` objects | |
132 | for persistent memory devices in pre-allocated storage on those | |
9303c9d5 | 133 | devices. This storage is represented with struct vmem_altmap |
7d10bdbd MR |
134 | that is eventually passed to vmemmap_populate() through a long chain |
135 | of function calls. The vmemmap_populate() implementation may use the | |
56993b4e | 136 | `vmem_altmap` along with :c:func:`vmemmap_alloc_block_buf` helper to |
7d10bdbd | 137 | allocate memory map on the persistent memory device. |
a0653406 DW |
138 | |
139 | ZONE_DEVICE | |
140 | =========== | |
141 | The `ZONE_DEVICE` facility builds upon `SPARSEMEM_VMEMMAP` to offer | |
142 | `struct page` `mem_map` services for device driver identified physical | |
143 | address ranges. The "device" aspect of `ZONE_DEVICE` relates to the fact | |
144 | that the page objects for these address ranges are never marked online, | |
145 | and that a reference must be taken against the device, not just the page | |
146 | to keep the memory pinned for active use. `ZONE_DEVICE`, via | |
147 | :c:func:`devm_memremap_pages`, performs just enough memory hotplug to | |
148 | turn on :c:func:`pfn_to_page`, :c:func:`page_to_pfn`, and | |
149 | :c:func:`get_user_pages` service for the given range of pfns. Since the | |
150 | page reference count never drops below 1 the page is never tracked as | |
151 | free memory and the page's `struct list_head lru` space is repurposed | |
152 | for back referencing to the host device / driver that mapped the memory. | |
153 | ||
154 | While `SPARSEMEM` presents memory as a collection of sections, | |
155 | optionally collected into memory blocks, `ZONE_DEVICE` users have a need | |
156 | for smaller granularity of populating the `mem_map`. Given that | |
157 | `ZONE_DEVICE` memory is never marked online it is subsequently never | |
158 | subject to its memory ranges being exposed through the sysfs memory | |
159 | hotplug api on memory block boundaries. The implementation relies on | |
160 | this lack of user-api constraint to allow sub-section sized memory | |
161 | ranges to be specified to :c:func:`arch_add_memory`, the top-half of | |
162 | memory hotplug. Sub-section support allows for 2MB as the cross-arch | |
163 | common alignment granularity for :c:func:`devm_memremap_pages`. | |
164 | ||
165 | The users of `ZONE_DEVICE` are: | |
166 | ||
167 | * pmem: Map platform persistent memory to be used as a direct-I/O target | |
168 | via DAX mappings. | |
169 | ||
170 | * hmm: Extend `ZONE_DEVICE` with `->page_fault()` and `->page_free()` | |
171 | event callbacks to allow a device-driver to coordinate memory management | |
172 | events related to device-memory, typically GPU memory. See | |
173 | Documentation/vm/hmm.rst. | |
174 | ||
175 | * p2pdma: Create `struct page` objects to allow peer devices in a | |
176 | PCI/-E topology to coordinate direct-DMA operations between themselves, | |
177 | i.e. bypass host memory. |