Merge tag 'objtool-urgent-2020-12-27' of git://git.kernel.org/pub/scm/linux/kernel...
[linux-block.git] / Documentation / vm / memory-model.rst
CommitLineData
7d10bdbd
MR
1.. SPDX-License-Identifier: GPL-2.0
2
3.. _physical_memory_model:
4
5=====================
6Physical Memory Model
7=====================
8
9Physical memory in a system may be addressed in different ways. The
10simplest case is when the physical memory starts at address 0 and
11spans a contiguous range up to the maximal address. It could be,
12however, that this range contains small holes that are not accessible
13for the CPU. Then there could be several contiguous ranges at
14completely distinct addresses. And, don't forget about NUMA, where
15different memory banks are attached to different CPUs.
16
17Linux abstracts this diversity using one of the three memory models:
18FLATMEM, DISCONTIGMEM and SPARSEMEM. Each architecture defines what
19memory models it supports, what the default memory model is and
20whether it is possible to manually override that default.
21
22.. note::
23 At time of this writing, DISCONTIGMEM is considered deprecated,
24 although it is still in use by several architectures.
25
26All the memory models track the status of physical page frames using
9303c9d5 27struct page arranged in one or more arrays.
7d10bdbd
MR
28
29Regardless of the selected memory model, there exists one-to-one
30mapping between the physical page frame number (PFN) and the
31corresponding `struct page`.
32
33Each memory model defines :c:func:`pfn_to_page` and :c:func:`page_to_pfn`
34helpers that allow the conversion from PFN to `struct page` and vice
35versa.
36
37FLATMEM
38=======
39
40The simplest memory model is FLATMEM. This model is suitable for
41non-NUMA systems with contiguous, or mostly contiguous, physical
42memory.
43
44In the FLATMEM memory model, there is a global `mem_map` array that
45maps the entire physical memory. For most architectures, the holes
46have entries in the `mem_map` array. The `struct page` objects
47corresponding to the holes are never fully initialized.
48
237e506c
MR
49To allocate the `mem_map` array, architecture specific setup code should
50call :c:func:`free_area_init` function. Yet, the mappings array is not
51usable until the call to :c:func:`memblock_free_all` that hands all the
52memory to the page allocator.
7d10bdbd 53
5e545df3 54An architecture may free parts of the `mem_map` array that do not cover the
7d10bdbd
MR
55actual physical pages. In such case, the architecture specific
56:c:func:`pfn_valid` implementation should take the holes in the
57`mem_map` into account.
58
59With FLATMEM, the conversion between a PFN and the `struct page` is
60straightforward: `PFN - ARCH_PFN_OFFSET` is an index to the
61`mem_map` array.
62
63The `ARCH_PFN_OFFSET` defines the first page frame number for
64systems with physical memory starting at address different from 0.
65
66DISCONTIGMEM
67============
68
69The DISCONTIGMEM model treats the physical memory as a collection of
70`nodes` similarly to how Linux NUMA support does. For each node Linux
71constructs an independent memory management subsystem represented by
72`struct pglist_data` (or `pg_data_t` for short). Among other
73things, `pg_data_t` holds the `node_mem_map` array that maps
74physical pages belonging to that node. The `node_start_pfn` field of
75`pg_data_t` is the number of the first page frame belonging to that
76node.
77
78The architecture setup code should call :c:func:`free_area_init_node` for
79each node in the system to initialize the `pg_data_t` object and its
80`node_mem_map`.
81
82Every `node_mem_map` behaves exactly as FLATMEM's `mem_map` -
83every physical page frame in a node has a `struct page` entry in the
84`node_mem_map` array. When DISCONTIGMEM is enabled, a portion of the
85`flags` field of the `struct page` encodes the node number of the
86node hosting that page.
87
88The conversion between a PFN and the `struct page` in the
89DISCONTIGMEM model became slightly more complex as it has to determine
90which node hosts the physical page and which `pg_data_t` object
91holds the `struct page`.
92
93Architectures that support DISCONTIGMEM provide :c:func:`pfn_to_nid`
94to convert PFN to the node number. The opposite conversion helper
95:c:func:`page_to_nid` is generic as it uses the node number encoded in
96page->flags.
97
98Once the node number is known, the PFN can be used to index
99appropriate `node_mem_map` array to access the `struct page` and
100the offset of the `struct page` from the `node_mem_map` plus
101`node_start_pfn` is the PFN of that page.
102
103SPARSEMEM
104=========
105
106SPARSEMEM is the most versatile memory model available in Linux and it
107is the only memory model that supports several advanced features such
108as hot-plug and hot-remove of the physical memory, alternative memory
109maps for non-volatile memory devices and deferred initialization of
110the memory map for larger systems.
111
112The SPARSEMEM model presents the physical memory as a collection of
9303c9d5 113sections. A section is represented with struct mem_section
7d10bdbd
MR
114that contains `section_mem_map` that is, logically, a pointer to an
115array of struct pages. However, it is stored with some other magic
116that aids the sections management. The section size and maximal number
117of section is specified using `SECTION_SIZE_BITS` and
118`MAX_PHYSMEM_BITS` constants defined by each architecture that
119supports SPARSEMEM. While `MAX_PHYSMEM_BITS` is an actual width of a
120physical address that an architecture supports, the
121`SECTION_SIZE_BITS` is an arbitrary value.
122
123The maximal number of sections is denoted `NR_MEM_SECTIONS` and
124defined as
125
126.. math::
127
128 NR\_MEM\_SECTIONS = 2 ^ {(MAX\_PHYSMEM\_BITS - SECTION\_SIZE\_BITS)}
129
130The `mem_section` objects are arranged in a two-dimensional array
131called `mem_sections`. The size and placement of this array depend
132on `CONFIG_SPARSEMEM_EXTREME` and the maximal possible number of
133sections:
134
135* When `CONFIG_SPARSEMEM_EXTREME` is disabled, the `mem_sections`
136 array is static and has `NR_MEM_SECTIONS` rows. Each row holds a
137 single `mem_section` object.
138* When `CONFIG_SPARSEMEM_EXTREME` is enabled, the `mem_sections`
139 array is dynamically allocated. Each row contains PAGE_SIZE worth of
140 `mem_section` objects and the number of rows is calculated to fit
141 all the memory sections.
142
c89ab04f
MR
143The architecture setup code should call sparse_init() to
144initialize the memory sections and the memory maps.
7d10bdbd
MR
145
146With SPARSEMEM there are two possible ways to convert a PFN to the
147corresponding `struct page` - a "classic sparse" and "sparse
148vmemmap". The selection is made at build time and it is determined by
149the value of `CONFIG_SPARSEMEM_VMEMMAP`.
150
151The classic sparse encodes the section number of a page in page->flags
152and uses high bits of a PFN to access the section that maps that page
153frame. Inside a section, the PFN is the index to the array of pages.
154
155The sparse vmemmap uses a virtually mapped memory map to optimize
156pfn_to_page and page_to_pfn operations. There is a global `struct
157page *vmemmap` pointer that points to a virtually contiguous array of
18d97ed9 158`struct page` objects. A PFN is an index to that array and the
7d10bdbd
MR
159offset of the `struct page` from `vmemmap` is the PFN of that
160page.
161
162To use vmemmap, an architecture has to reserve a range of virtual
163addresses that will map the physical pages containing the memory
164map and make sure that `vmemmap` points to that range. In addition,
165the architecture should implement :c:func:`vmemmap_populate` method
166that will allocate the physical memory and create page tables for the
167virtual memory map. If an architecture does not have any special
168requirements for the vmemmap mappings, it can use default
169:c:func:`vmemmap_populate_basepages` provided by the generic memory
170management.
171
172The virtually mapped memory map allows storing `struct page` objects
173for persistent memory devices in pre-allocated storage on those
9303c9d5 174devices. This storage is represented with struct vmem_altmap
7d10bdbd
MR
175that is eventually passed to vmemmap_populate() through a long chain
176of function calls. The vmemmap_populate() implementation may use the
56993b4e 177`vmem_altmap` along with :c:func:`vmemmap_alloc_block_buf` helper to
7d10bdbd 178allocate memory map on the persistent memory device.
a0653406
DW
179
180ZONE_DEVICE
181===========
182The `ZONE_DEVICE` facility builds upon `SPARSEMEM_VMEMMAP` to offer
183`struct page` `mem_map` services for device driver identified physical
184address ranges. The "device" aspect of `ZONE_DEVICE` relates to the fact
185that the page objects for these address ranges are never marked online,
186and that a reference must be taken against the device, not just the page
187to keep the memory pinned for active use. `ZONE_DEVICE`, via
188:c:func:`devm_memremap_pages`, performs just enough memory hotplug to
189turn on :c:func:`pfn_to_page`, :c:func:`page_to_pfn`, and
190:c:func:`get_user_pages` service for the given range of pfns. Since the
191page reference count never drops below 1 the page is never tracked as
192free memory and the page's `struct list_head lru` space is repurposed
193for back referencing to the host device / driver that mapped the memory.
194
195While `SPARSEMEM` presents memory as a collection of sections,
196optionally collected into memory blocks, `ZONE_DEVICE` users have a need
197for smaller granularity of populating the `mem_map`. Given that
198`ZONE_DEVICE` memory is never marked online it is subsequently never
199subject to its memory ranges being exposed through the sysfs memory
200hotplug api on memory block boundaries. The implementation relies on
201this lack of user-api constraint to allow sub-section sized memory
202ranges to be specified to :c:func:`arch_add_memory`, the top-half of
203memory hotplug. Sub-section support allows for 2MB as the cross-arch
204common alignment granularity for :c:func:`devm_memremap_pages`.
205
206The users of `ZONE_DEVICE` are:
207
208* pmem: Map platform persistent memory to be used as a direct-I/O target
209 via DAX mappings.
210
211* hmm: Extend `ZONE_DEVICE` with `->page_fault()` and `->page_free()`
212 event callbacks to allow a device-driver to coordinate memory management
213 events related to device-memory, typically GPU memory. See
214 Documentation/vm/hmm.rst.
215
216* p2pdma: Create `struct page` objects to allow peer devices in a
217 PCI/-E topology to coordinate direct-DMA operations between themselves,
218 i.e. bypass host memory.