Commit | Line | Data |
---|---|---|
17b6fc88 | 1 | |
60a427db JM |
2 | .. SPDX-License-Identifier: GPL-2.0 |
3 | ||
4917f55b JM |
4 | ========================================= |
5 | A vmemmap diet for HugeTLB and Device DAX | |
6 | ========================================= | |
7 | ||
8 | HugeTLB | |
9 | ======= | |
60a427db | 10 | |
dff03381 MS |
11 | This section is to explain how HugeTLB Vmemmap Optimization (HVO) works. |
12 | ||
838691a1 | 13 | The ``struct page`` structures are used to describe a physical page frame. By |
d56b699d | 14 | default, there is a one-to-one mapping from a page frame to its corresponding |
838691a1 | 15 | ``struct page``. |
60a427db JM |
16 | |
17 | HugeTLB pages consist of multiple base page size pages and is supported by many | |
18 | architectures. See Documentation/admin-guide/mm/hugetlbpage.rst for more | |
19 | details. On the x86-64 architecture, HugeTLB pages of size 2MB and 1GB are | |
20 | currently supported. Since the base page size on x86 is 4KB, a 2MB HugeTLB page | |
17b6fc88 | 21 | consists of 512 base pages and a 1GB HugeTLB page consists of 262144 base pages. |
838691a1 | 22 | For each base page, there is a corresponding ``struct page``. |
60a427db | 23 | |
838691a1 MS |
24 | Within the HugeTLB subsystem, only the first 4 ``struct page`` are used to |
25 | contain unique information about a HugeTLB page. ``__NR_USED_SUBPAGE`` provides | |
26 | this upper limit. The only 'useful' information in the remaining ``struct page`` | |
60a427db JM |
27 | is the compound_head field, and this field is the same for all tail pages. |
28 | ||
838691a1 | 29 | By removing redundant ``struct page`` for HugeTLB pages, memory can be returned |
60a427db JM |
30 | to the buddy allocator for other uses. |
31 | ||
32 | Different architectures support different HugeTLB pages. For example, the | |
33 | following table is the HugeTLB page size supported by x86 and arm64 | |
34 | architectures. Because arm64 supports 4k, 16k, and 64k base pages and | |
35 | supports contiguous entries, so it supports many kinds of sizes of HugeTLB | |
36 | page. | |
37 | ||
38 | +--------------+-----------+-----------------------------------------------+ | |
39 | | Architecture | Page Size | HugeTLB Page Size | | |
40 | +--------------+-----------+-----------+-----------+-----------+-----------+ | |
41 | | x86-64 | 4KB | 2MB | 1GB | | | | |
42 | +--------------+-----------+-----------+-----------+-----------+-----------+ | |
43 | | | 4KB | 64KB | 2MB | 32MB | 1GB | | |
44 | | +-----------+-----------+-----------+-----------+-----------+ | |
45 | | arm64 | 16KB | 2MB | 32MB | 1GB | | | |
46 | | +-----------+-----------+-----------+-----------+-----------+ | |
47 | | | 64KB | 2MB | 512MB | 16GB | | | |
48 | +--------------+-----------+-----------+-----------+-----------+-----------+ | |
49 | ||
838691a1 | 50 | When the system boot up, every HugeTLB page has more than one ``struct page`` |
60a427db JM |
51 | structs which size is (unit: pages):: |
52 | ||
53 | struct_size = HugeTLB_Size / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE | |
54 | ||
55 | Where HugeTLB_Size is the size of the HugeTLB page. We know that the size | |
56 | of the HugeTLB page is always n times PAGE_SIZE. So we can get the following | |
57 | relationship:: | |
58 | ||
59 | HugeTLB_Size = n * PAGE_SIZE | |
60 | ||
61 | Then:: | |
62 | ||
63 | struct_size = n * PAGE_SIZE / PAGE_SIZE * sizeof(struct page) / PAGE_SIZE | |
64 | = n * sizeof(struct page) / PAGE_SIZE | |
65 | ||
66 | We can use huge mapping at the pud/pmd level for the HugeTLB page. | |
67 | ||
68 | For the HugeTLB page of the pmd level mapping, then:: | |
69 | ||
70 | struct_size = n * sizeof(struct page) / PAGE_SIZE | |
71 | = PAGE_SIZE / sizeof(pte_t) * sizeof(struct page) / PAGE_SIZE | |
72 | = sizeof(struct page) / sizeof(pte_t) | |
73 | = 64 / 8 | |
74 | = 8 (pages) | |
75 | ||
76 | Where n is how many pte entries which one page can contains. So the value of | |
77 | n is (PAGE_SIZE / sizeof(pte_t)). | |
78 | ||
79 | This optimization only supports 64-bit system, so the value of sizeof(pte_t) | |
838691a1 MS |
80 | is 8. And this optimization also applicable only when the size of ``struct page`` |
81 | is a power of two. In most cases, the size of ``struct page`` is 64 bytes (e.g. | |
60a427db | 82 | x86-64 and arm64). So if we use pmd level mapping for a HugeTLB page, the |
838691a1 | 83 | size of ``struct page`` structs of it is 8 page frames which size depends on the |
60a427db JM |
84 | size of the base page. |
85 | ||
86 | For the HugeTLB page of the pud level mapping, then:: | |
87 | ||
88 | struct_size = PAGE_SIZE / sizeof(pmd_t) * struct_size(pmd) | |
89 | = PAGE_SIZE / 8 * 8 (pages) | |
90 | = PAGE_SIZE (pages) | |
91 | ||
838691a1 | 92 | Where the struct_size(pmd) is the size of the ``struct page`` structs of a |
60a427db JM |
93 | HugeTLB page of the pmd level mapping. |
94 | ||
95 | E.g.: A 2MB HugeTLB page on x86_64 consists in 8 page frames while 1GB | |
96 | HugeTLB page consists in 4096. | |
97 | ||
98 | Next, we take the pmd level mapping of the HugeTLB page as an example to | |
99 | show the internal implementation of this optimization. There are 8 pages | |
838691a1 | 100 | ``struct page`` structs associated with a HugeTLB page which is pmd mapped. |
60a427db JM |
101 | |
102 | Here is how things look before optimization:: | |
103 | ||
104 | HugeTLB struct pages(8 pages) page frame(8 pages) | |
105 | +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ | |
106 | | | | 0 | -------------> | 0 | | |
107 | | | +-----------+ +-----------+ | |
108 | | | | 1 | -------------> | 1 | | |
109 | | | +-----------+ +-----------+ | |
110 | | | | 2 | -------------> | 2 | | |
111 | | | +-----------+ +-----------+ | |
112 | | | | 3 | -------------> | 3 | | |
113 | | | +-----------+ +-----------+ | |
114 | | | | 4 | -------------> | 4 | | |
115 | | PMD | +-----------+ +-----------+ | |
116 | | level | | 5 | -------------> | 5 | | |
117 | | mapping | +-----------+ +-----------+ | |
118 | | | | 6 | -------------> | 6 | | |
119 | | | +-----------+ +-----------+ | |
120 | | | | 7 | -------------> | 7 | | |
121 | | | +-----------+ +-----------+ | |
122 | | | | |
123 | | | | |
124 | | | | |
125 | +-----------+ | |
126 | ||
127 | The value of page->compound_head is the same for all tail pages. The first | |
838691a1 MS |
128 | page of ``struct page`` (page 0) associated with the HugeTLB page contains the 4 |
129 | ``struct page`` necessary to describe the HugeTLB. The only use of the remaining | |
130 | pages of ``struct page`` (page 1 to page 7) is to point to page->compound_head. | |
131 | Therefore, we can remap pages 1 to 7 to page 0. Only 1 page of ``struct page`` | |
60a427db JM |
132 | will be used for each HugeTLB page. This will allow us to free the remaining |
133 | 7 pages to the buddy allocator. | |
134 | ||
135 | Here is how things look after remapping:: | |
136 | ||
137 | HugeTLB struct pages(8 pages) page frame(8 pages) | |
138 | +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ | |
139 | | | | 0 | -------------> | 0 | | |
140 | | | +-----------+ +-----------+ | |
141 | | | | 1 | ---------------^ ^ ^ ^ ^ ^ ^ | |
142 | | | +-----------+ | | | | | | | |
143 | | | | 2 | -----------------+ | | | | | | |
144 | | | +-----------+ | | | | | | |
145 | | | | 3 | -------------------+ | | | | | |
146 | | | +-----------+ | | | | | |
147 | | | | 4 | ---------------------+ | | | | |
148 | | PMD | +-----------+ | | | | |
149 | | level | | 5 | -----------------------+ | | | |
150 | | mapping | +-----------+ | | | |
151 | | | | 6 | -------------------------+ | | |
152 | | | +-----------+ | | |
153 | | | | 7 | ---------------------------+ | |
154 | | | +-----------+ | |
155 | | | | |
156 | | | | |
157 | | | | |
158 | +-----------+ | |
159 | ||
160 | When a HugeTLB is freed to the buddy system, we should allocate 7 pages for | |
161 | vmemmap pages and restore the previous mapping relationship. | |
162 | ||
163 | For the HugeTLB page of the pud level mapping. It is similar to the former. | |
164 | We also can use this approach to free (PAGE_SIZE - 1) vmemmap pages. | |
165 | ||
166 | Apart from the HugeTLB page of the pmd/pud level mapping, some architectures | |
167 | (e.g. aarch64) provides a contiguous bit in the translation table entries | |
168 | that hints to the MMU to indicate that it is one of a contiguous set of | |
169 | entries that can be cached in a single TLB entry. | |
170 | ||
171 | The contiguous bit is used to increase the mapping size at the pmd and pte | |
172 | (last) level. So this type of HugeTLB page can be optimized only when its | |
838691a1 | 173 | size of the ``struct page`` structs is greater than **1** page. |
60a427db JM |
174 | |
175 | Notice: The head vmemmap page is not freed to the buddy allocator and all | |
176 | tail vmemmap pages are mapped to the head vmemmap page frame. So we can see | |
838691a1 MS |
177 | more than one ``struct page`` struct with ``PG_head`` (e.g. 8 per 2 MB HugeTLB |
178 | page) associated with each HugeTLB page. The ``compound_head()`` can handle | |
179 | this correctly. There is only **one** head ``struct page``, the tail | |
180 | ``struct page`` with ``PG_head`` are fake head ``struct page``. We need an | |
181 | approach to distinguish between those two different types of ``struct page`` so | |
182 | that ``compound_head()`` can return the real head ``struct page`` when the | |
183 | parameter is the tail ``struct page`` but with ``PG_head``. The following code | |
184 | snippet describes how to distinguish between real and fake head ``struct page``. | |
185 | ||
186 | .. code-block:: c | |
187 | ||
188 | if (test_bit(PG_head, &page->flags)) { | |
189 | unsigned long head = READ_ONCE(page[1].compound_head); | |
190 | ||
191 | if (head & 1) { | |
192 | if (head == (unsigned long)page + 1) | |
193 | /* head struct page */ | |
194 | else | |
195 | /* tail struct page */ | |
196 | } else { | |
197 | /* head struct page */ | |
198 | } | |
199 | } | |
200 | ||
201 | We can safely access the field of the **page[1]** with ``PG_head`` because the | |
202 | page is a compound page composed with at least two contiguous pages. | |
203 | The implementation refers to ``page_fixed_fake_head()``. | |
4917f55b JM |
204 | |
205 | Device DAX | |
206 | ========== | |
207 | ||
208 | The device-dax interface uses the same tail deduplication technique explained | |
209 | in the previous chapter, except when used with the vmemmap in | |
210 | the device (altmap). | |
211 | ||
212 | The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64), | |
213 | PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64). | |
de6772ee | 214 | For powerpc equivalent details see Documentation/arch/powerpc/vmemmap_dedup.rst |
4917f55b JM |
215 | |
216 | The differences with HugeTLB are relatively minor. | |
217 | ||
838691a1 | 218 | It only use 3 ``struct page`` for storing all information as opposed |
4917f55b JM |
219 | to 4 on HugeTLB pages. |
220 | ||
221 | There's no remapping of vmemmap given that device-dax memory is not part of | |
222 | System RAM ranges initialized at boot. Thus the tail page deduplication | |
223 | happens at a later stage when we populate the sections. HugeTLB reuses the | |
224 | the head vmemmap page representing, whereas device-dax reuses the tail | |
225 | vmemmap page. This results in only half of the savings compared to HugeTLB. | |
226 | ||
227 | Deduplicated tail pages are not mapped read-only. | |
228 | ||
229 | Here's how things look like on device-dax after the sections are populated:: | |
230 | ||
231 | +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+ | |
232 | | | | 0 | -------------> | 0 | | |
233 | | | +-----------+ +-----------+ | |
234 | | | | 1 | -------------> | 1 | | |
235 | | | +-----------+ +-----------+ | |
236 | | | | 2 | ----------------^ ^ ^ ^ ^ ^ | |
237 | | | +-----------+ | | | | | | |
238 | | | | 3 | ------------------+ | | | | | |
239 | | | +-----------+ | | | | | |
240 | | | | 4 | --------------------+ | | | | |
241 | | PMD | +-----------+ | | | | |
242 | | level | | 5 | ----------------------+ | | | |
243 | | mapping | +-----------+ | | | |
244 | | | | 6 | ------------------------+ | | |
245 | | | +-----------+ | | |
246 | | | | 7 | --------------------------+ | |
247 | | | +-----------+ | |
248 | | | | |
249 | | | | |
250 | | | | |
251 | +-----------+ |