Merge tag 'trace-v5.4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt...
[linux-2.6-block.git] / Documentation / vm / memory-model.rst
CommitLineData
7d10bdbd
MR
1.. SPDX-License-Identifier: GPL-2.0
2
3.. _physical_memory_model:
4
5=====================
6Physical Memory Model
7=====================
8
9Physical memory in a system may be addressed in different ways. The
10simplest case is when the physical memory starts at address 0 and
11spans a contiguous range up to the maximal address. It could be,
12however, that this range contains small holes that are not accessible
13for the CPU. Then there could be several contiguous ranges at
14completely distinct addresses. And, don't forget about NUMA, where
15different memory banks are attached to different CPUs.
16
17Linux abstracts this diversity using one of the three memory models:
18FLATMEM, DISCONTIGMEM and SPARSEMEM. Each architecture defines what
19memory models it supports, what the default memory model is and
20whether it is possible to manually override that default.
21
22.. note::
23 At time of this writing, DISCONTIGMEM is considered deprecated,
24 although it is still in use by several architectures.
25
26All the memory models track the status of physical page frames using
27:c:type:`struct page` arranged in one or more arrays.
28
29Regardless of the selected memory model, there exists one-to-one
30mapping between the physical page frame number (PFN) and the
31corresponding `struct page`.
32
33Each memory model defines :c:func:`pfn_to_page` and :c:func:`page_to_pfn`
34helpers that allow the conversion from PFN to `struct page` and vice
35versa.
36
37FLATMEM
38=======
39
40The simplest memory model is FLATMEM. This model is suitable for
41non-NUMA systems with contiguous, or mostly contiguous, physical
42memory.
43
44In the FLATMEM memory model, there is a global `mem_map` array that
45maps the entire physical memory. For most architectures, the holes
46have entries in the `mem_map` array. The `struct page` objects
47corresponding to the holes are never fully initialized.
48
49To allocate the `mem_map` array, architecture specific setup code
50should call :c:func:`free_area_init_node` function or its convenience
51wrapper :c:func:`free_area_init`. Yet, the mappings array is not
52usable until the call to :c:func:`memblock_free_all` that hands all
53the memory to the page allocator.
54
55If an architecture enables `CONFIG_ARCH_HAS_HOLES_MEMORYMODEL` option,
56it may free parts of the `mem_map` array that do not cover the
57actual physical pages. In such case, the architecture specific
58:c:func:`pfn_valid` implementation should take the holes in the
59`mem_map` into account.
60
61With FLATMEM, the conversion between a PFN and the `struct page` is
62straightforward: `PFN - ARCH_PFN_OFFSET` is an index to the
63`mem_map` array.
64
65The `ARCH_PFN_OFFSET` defines the first page frame number for
66systems with physical memory starting at address different from 0.
67
68DISCONTIGMEM
69============
70
71The DISCONTIGMEM model treats the physical memory as a collection of
72`nodes` similarly to how Linux NUMA support does. For each node Linux
73constructs an independent memory management subsystem represented by
74`struct pglist_data` (or `pg_data_t` for short). Among other
75things, `pg_data_t` holds the `node_mem_map` array that maps
76physical pages belonging to that node. The `node_start_pfn` field of
77`pg_data_t` is the number of the first page frame belonging to that
78node.
79
80The architecture setup code should call :c:func:`free_area_init_node` for
81each node in the system to initialize the `pg_data_t` object and its
82`node_mem_map`.
83
84Every `node_mem_map` behaves exactly as FLATMEM's `mem_map` -
85every physical page frame in a node has a `struct page` entry in the
86`node_mem_map` array. When DISCONTIGMEM is enabled, a portion of the
87`flags` field of the `struct page` encodes the node number of the
88node hosting that page.
89
90The conversion between a PFN and the `struct page` in the
91DISCONTIGMEM model became slightly more complex as it has to determine
92which node hosts the physical page and which `pg_data_t` object
93holds the `struct page`.
94
95Architectures that support DISCONTIGMEM provide :c:func:`pfn_to_nid`
96to convert PFN to the node number. The opposite conversion helper
97:c:func:`page_to_nid` is generic as it uses the node number encoded in
98page->flags.
99
100Once the node number is known, the PFN can be used to index
101appropriate `node_mem_map` array to access the `struct page` and
102the offset of the `struct page` from the `node_mem_map` plus
103`node_start_pfn` is the PFN of that page.
104
105SPARSEMEM
106=========
107
108SPARSEMEM is the most versatile memory model available in Linux and it
109is the only memory model that supports several advanced features such
110as hot-plug and hot-remove of the physical memory, alternative memory
111maps for non-volatile memory devices and deferred initialization of
112the memory map for larger systems.
113
114The SPARSEMEM model presents the physical memory as a collection of
115sections. A section is represented with :c:type:`struct mem_section`
116that contains `section_mem_map` that is, logically, a pointer to an
117array of struct pages. However, it is stored with some other magic
118that aids the sections management. The section size and maximal number
119of section is specified using `SECTION_SIZE_BITS` and
120`MAX_PHYSMEM_BITS` constants defined by each architecture that
121supports SPARSEMEM. While `MAX_PHYSMEM_BITS` is an actual width of a
122physical address that an architecture supports, the
123`SECTION_SIZE_BITS` is an arbitrary value.
124
125The maximal number of sections is denoted `NR_MEM_SECTIONS` and
126defined as
127
128.. math::
129
130 NR\_MEM\_SECTIONS = 2 ^ {(MAX\_PHYSMEM\_BITS - SECTION\_SIZE\_BITS)}
131
132The `mem_section` objects are arranged in a two-dimensional array
133called `mem_sections`. The size and placement of this array depend
134on `CONFIG_SPARSEMEM_EXTREME` and the maximal possible number of
135sections:
136
137* When `CONFIG_SPARSEMEM_EXTREME` is disabled, the `mem_sections`
138 array is static and has `NR_MEM_SECTIONS` rows. Each row holds a
139 single `mem_section` object.
140* When `CONFIG_SPARSEMEM_EXTREME` is enabled, the `mem_sections`
141 array is dynamically allocated. Each row contains PAGE_SIZE worth of
142 `mem_section` objects and the number of rows is calculated to fit
143 all the memory sections.
144
145The architecture setup code should call :c:func:`memory_present` for
146each active memory range or use :c:func:`memblocks_present` or
147:c:func:`sparse_memory_present_with_active_regions` wrappers to
148initialize the memory sections. Next, the actual memory maps should be
149set up using :c:func:`sparse_init`.
150
151With SPARSEMEM there are two possible ways to convert a PFN to the
152corresponding `struct page` - a "classic sparse" and "sparse
153vmemmap". The selection is made at build time and it is determined by
154the value of `CONFIG_SPARSEMEM_VMEMMAP`.
155
156The classic sparse encodes the section number of a page in page->flags
157and uses high bits of a PFN to access the section that maps that page
158frame. Inside a section, the PFN is the index to the array of pages.
159
160The sparse vmemmap uses a virtually mapped memory map to optimize
161pfn_to_page and page_to_pfn operations. There is a global `struct
162page *vmemmap` pointer that points to a virtually contiguous array of
163`struct page` objects. A PFN is an index to that array and the the
164offset of the `struct page` from `vmemmap` is the PFN of that
165page.
166
167To use vmemmap, an architecture has to reserve a range of virtual
168addresses that will map the physical pages containing the memory
169map and make sure that `vmemmap` points to that range. In addition,
170the architecture should implement :c:func:`vmemmap_populate` method
171that will allocate the physical memory and create page tables for the
172virtual memory map. If an architecture does not have any special
173requirements for the vmemmap mappings, it can use default
174:c:func:`vmemmap_populate_basepages` provided by the generic memory
175management.
176
177The virtually mapped memory map allows storing `struct page` objects
178for persistent memory devices in pre-allocated storage on those
179devices. This storage is represented with :c:type:`struct vmem_altmap`
180that is eventually passed to vmemmap_populate() through a long chain
181of function calls. The vmemmap_populate() implementation may use the
182`vmem_altmap` along with :c:func:`altmap_alloc_block_buf` helper to
183allocate memory map on the persistent memory device.
a0653406
DW
184
185ZONE_DEVICE
186===========
187The `ZONE_DEVICE` facility builds upon `SPARSEMEM_VMEMMAP` to offer
188`struct page` `mem_map` services for device driver identified physical
189address ranges. The "device" aspect of `ZONE_DEVICE` relates to the fact
190that the page objects for these address ranges are never marked online,
191and that a reference must be taken against the device, not just the page
192to keep the memory pinned for active use. `ZONE_DEVICE`, via
193:c:func:`devm_memremap_pages`, performs just enough memory hotplug to
194turn on :c:func:`pfn_to_page`, :c:func:`page_to_pfn`, and
195:c:func:`get_user_pages` service for the given range of pfns. Since the
196page reference count never drops below 1 the page is never tracked as
197free memory and the page's `struct list_head lru` space is repurposed
198for back referencing to the host device / driver that mapped the memory.
199
200While `SPARSEMEM` presents memory as a collection of sections,
201optionally collected into memory blocks, `ZONE_DEVICE` users have a need
202for smaller granularity of populating the `mem_map`. Given that
203`ZONE_DEVICE` memory is never marked online it is subsequently never
204subject to its memory ranges being exposed through the sysfs memory
205hotplug api on memory block boundaries. The implementation relies on
206this lack of user-api constraint to allow sub-section sized memory
207ranges to be specified to :c:func:`arch_add_memory`, the top-half of
208memory hotplug. Sub-section support allows for 2MB as the cross-arch
209common alignment granularity for :c:func:`devm_memremap_pages`.
210
211The users of `ZONE_DEVICE` are:
212
213* pmem: Map platform persistent memory to be used as a direct-I/O target
214 via DAX mappings.
215
216* hmm: Extend `ZONE_DEVICE` with `->page_fault()` and `->page_free()`
217 event callbacks to allow a device-driver to coordinate memory management
218 events related to device-memory, typically GPU memory. See
219 Documentation/vm/hmm.rst.
220
221* p2pdma: Create `struct page` objects to allow peer devices in a
222 PCI/-E topology to coordinate direct-DMA operations between themselves,
223 i.e. bypass host memory.