Commit | Line | Data |
---|---|---|
a5e4da91 MR |
1 | ============================== |
2 | Unevictable LRU Infrastructure | |
3 | ============================== | |
c24b7201 | 4 | |
a5e4da91 MR |
5 | .. contents:: :local: |
6 | ||
7 | ||
8 | Introduction | |
c24b7201 DH |
9 | ============ |
10 | ||
11 | This document describes the Linux memory manager's "Unevictable LRU" | |
12 | infrastructure and the use of this to manage several types of "unevictable" | |
90c9d13a | 13 | folios. |
c24b7201 DH |
14 | |
15 | The document attempts to provide the overall rationale behind this mechanism | |
16 | and the rationale for some of the design decisions that drove the | |
17 | implementation. The latter design rationale is discussed in the context of an | |
18 | implementation description. Admittedly, one can obtain the implementation | |
19 | details - the "what does it do?" - by reading the code. One hopes that the | |
20 | descriptions below add value by provide the answer to "why does it do that?". | |
21 | ||
22 | ||
a5e4da91 MR |
23 | |
24 | The Unevictable LRU | |
c24b7201 DH |
25 | =================== |
26 | ||
27 | The Unevictable LRU facility adds an additional LRU list to track unevictable | |
90c9d13a MWO |
28 | folios and to hide these folios from vmscan. This mechanism is based on a patch |
29 | by Larry Woodman of Red Hat to address several scalability problems with folio | |
fa07e787 | 30 | reclaim in Linux. The problems have been observed at customer sites on large |
c24b7201 DH |
31 | memory x86_64 systems. |
32 | ||
33 | To illustrate this with an example, a non-NUMA x86_64 platform with 128GB of | |
15b44736 | 34 | main memory will have over 32 million 4k pages in a single node. When a large |
c24b7201 DH |
35 | fraction of these pages are not evictable for any reason [see below], vmscan |
36 | will spend a lot of time scanning the LRU lists looking for the small fraction | |
37 | of pages that are evictable. This can result in a situation where all CPUs are | |
38 | spending 100% of their time in vmscan for hours or days on end, with the system | |
39 | completely unresponsive. | |
40 | ||
41 | The unevictable list addresses the following classes of unevictable pages: | |
42 | ||
a5e4da91 | 43 | * Those owned by ramfs. |
c24b7201 | 44 | |
2c6efe9c LC |
45 | * Those owned by tmpfs with the noswap mount option. |
46 | ||
a5e4da91 | 47 | * Those mapped into SHM_LOCK'd shared memory regions. |
c24b7201 | 48 | |
a5e4da91 | 49 | * Those mapped into VM_LOCKED [mlock()ed] VMAs. |
c24b7201 DH |
50 | |
51 | The infrastructure may also be able to handle other conditions that make pages | |
fa07e787 LS |
52 | unevictable, either by definition or by circumstance, in the future. |
53 | ||
54 | ||
90c9d13a MWO |
55 | The Unevictable LRU Folio List |
56 | ------------------------------ | |
577e9846 | 57 | |
90c9d13a MWO |
58 | The Unevictable LRU folio list is a lie. It was never an LRU-ordered |
59 | list, but a companion to the LRU-ordered anonymous and file, active and | |
60 | inactive folio lists; and now it is not even a folio list. But following | |
61 | familiar convention, here in this document and in the source, we often | |
62 | imagine it as a fifth LRU folio list. | |
fa07e787 | 63 | |
15b44736 | 64 | The Unevictable LRU infrastructure consists of an additional, per-node, LRU list |
90c9d13a MWO |
65 | called the "unevictable" list and an associated folio flag, PG_unevictable, to |
66 | indicate that the folio is being managed on the unevictable list. | |
c24b7201 DH |
67 | |
68 | The PG_unevictable flag is analogous to, and mutually exclusive with, the | |
90c9d13a | 69 | PG_active flag in that it indicates on which LRU list a folio resides when |
e6e8dd50 | 70 | PG_lru is set. |
fa07e787 | 71 | |
90c9d13a | 72 | The Unevictable LRU infrastructure maintains unevictable folios as if they were |
577e9846 | 73 | on an additional LRU list for a few reasons: |
fa07e787 | 74 | |
90c9d13a | 75 | (1) We get to "treat unevictable folios just like we treat other folios in the |
c24b7201 DH |
76 | system - which means we get to use the same code to manipulate them, the |
77 | same code to isolate them (for migrate, etc.), the same code to keep track | |
78 | of the statistics, etc..." [Rik van Riel] | |
79 | ||
90c9d13a | 80 | (2) We want to be able to migrate unevictable folios between nodes for memory |
577e9846 | 81 | defragmentation, workload management and memory hotplug. The Linux kernel |
90c9d13a | 82 | can only migrate folios that it can successfully isolate from the LRU |
577e9846 | 83 | lists (or "Movable" pages: outside of consideration here). If we were to |
90c9d13a MWO |
84 | maintain folios elsewhere than on an LRU-like list, where they can be |
85 | detected by folio_isolate_lru(), we would prevent their migration. | |
fa07e787 | 86 | |
90c9d13a MWO |
87 | The unevictable list does not differentiate between file-backed and |
88 | anonymous, swap-backed folios. This differentiation is only important | |
89 | while the folios are, in fact, evictable. | |
fa07e787 | 90 | |
15b44736 | 91 | The unevictable list benefits from the "arrayification" of the per-node LRU |
c24b7201 | 92 | lists and statistics originally proposed and posted by Christoph Lameter. |
fa07e787 | 93 | |
fa07e787 | 94 | |
a5e4da91 | 95 | Memory Control Group Interaction |
c24b7201 | 96 | -------------------------------- |
fa07e787 | 97 | |
c24b7201 | 98 | The unevictable LRU facility interacts with the memory control group [aka |
577e9846 HD |
99 | memory controller; see Documentation/admin-guide/cgroup-v1/memory.rst] by |
100 | extending the lru_list enum. | |
c24b7201 | 101 | |
15b44736 HD |
102 | The memory controller data structure automatically gets a per-node unevictable |
103 | list as a result of the "arrayification" of the per-node LRU lists (one per | |
c24b7201 DH |
104 | lru_list enum element). The memory controller tracks the movement of pages to |
105 | and from the unevictable list. | |
fa07e787 | 106 | |
fa07e787 LS |
107 | When a memory control group comes under memory pressure, the controller will |
108 | not attempt to reclaim pages on the unevictable list. This has a couple of | |
c24b7201 DH |
109 | effects: |
110 | ||
111 | (1) Because the pages are "hidden" from reclaim on the unevictable list, the | |
112 | reclaim process can be more efficient, dealing only with pages that have a | |
113 | chance of being reclaimed. | |
114 | ||
115 | (2) On the other hand, if too many of the pages charged to the control group | |
116 | are unevictable, the evictable portion of the working set of the tasks in | |
117 | the control group may not fit into the available memory. This can cause | |
118 | the control group to thrash or to OOM-kill tasks. | |
119 | ||
120 | ||
a5e4da91 MR |
121 | .. _mark_addr_space_unevict: |
122 | ||
123 | Marking Address Spaces Unevictable | |
c24b7201 DH |
124 | ---------------------------------- |
125 | ||
126 | For facilities such as ramfs none of the pages attached to the address space | |
127 | may be evicted. To prevent eviction of any such pages, the AS_UNEVICTABLE | |
128 | address space flag is provided, and this can be manipulated by a filesystem | |
129 | using a number of wrapper functions: | |
130 | ||
a5e4da91 | 131 | * ``void mapping_set_unevictable(struct address_space *mapping);`` |
c24b7201 DH |
132 | |
133 | Mark the address space as being completely unevictable. | |
134 | ||
a5e4da91 | 135 | * ``void mapping_clear_unevictable(struct address_space *mapping);`` |
c24b7201 DH |
136 | |
137 | Mark the address space as being evictable. | |
138 | ||
a5e4da91 | 139 | * ``int mapping_unevictable(struct address_space *mapping);`` |
c24b7201 DH |
140 | |
141 | Query the address space, and return true if it is completely | |
142 | unevictable. | |
143 | ||
64e3d12f | 144 | These are currently used in three places in the kernel: |
c24b7201 DH |
145 | |
146 | (1) By ramfs to mark the address spaces of its inodes when they are created, | |
147 | and this mark remains for the life of the inode. | |
148 | ||
149 | (2) By SYSV SHM to mark SHM_LOCK'd address spaces until SHM_UNLOCK is called. | |
c24b7201 DH |
150 | Note that SHM_LOCK is not required to page in the locked pages if they're |
151 | swapped out; the application must touch the pages manually if it wants to | |
152 | ensure they're in memory. | |
153 | ||
64e3d12f KHY |
154 | (3) By the i915 driver to mark pinned address space until it's unpinned. The |
155 | amount of unevictable memory marked by i915 driver is roughly the bounded | |
156 | object size in debugfs/dri/0/i915_gem_objects. | |
157 | ||
c24b7201 | 158 | |
a5e4da91 | 159 | Detecting Unevictable Pages |
c24b7201 DH |
160 | --------------------------- |
161 | ||
90c9d13a | 162 | The function folio_evictable() in mm/internal.h determines whether a folio is |
a5e4da91 MR |
163 | evictable or not using the query function outlined above [see section |
164 | :ref:`Marking address spaces unevictable <mark_addr_space_unevict>`] | |
165 | to check the AS_UNEVICTABLE flag. | |
c24b7201 DH |
166 | |
167 | For address spaces that are so marked after being populated (as SHM regions | |
577e9846 | 168 | might be), the lock action (e.g. SHM_LOCK) can be lazy, and need not populate |
c24b7201 DH |
169 | the page tables for the region as does, for example, mlock(), nor need it make |
170 | any special effort to push any pages in the SHM_LOCK'd area to the unevictable | |
90c9d13a | 171 | list. Instead, vmscan will do this if and when it encounters the folios during |
c24b7201 DH |
172 | a reclamation scan. |
173 | ||
577e9846 | 174 | On an unlock action (such as SHM_UNLOCK), the unlocker (e.g. shmctl()) must scan |
c24b7201 DH |
175 | the pages in the region and "rescue" them from the unevictable list if no other |
176 | condition is keeping them unevictable. If an unevictable region is destroyed, | |
177 | the pages are also "rescued" from the unevictable list in the process of | |
178 | freeing them. | |
179 | ||
90c9d13a MWO |
180 | folio_evictable() also checks for mlocked folios by calling |
181 | folio_test_mlocked(), which is set when a folio is faulted into a | |
182 | VM_LOCKED VMA, or found in a VMA being VM_LOCKED. | |
fa07e787 LS |
183 | |
184 | ||
90c9d13a MWO |
185 | Vmscan's Handling of Unevictable Folios |
186 | --------------------------------------- | |
fa07e787 | 187 | |
90c9d13a MWO |
188 | If unevictable folios are culled in the fault path, or moved to the unevictable |
189 | list at mlock() or mmap() time, vmscan will not encounter the folios until they | |
c24b7201 DH |
190 | have become evictable again (via munlock() for example) and have been "rescued" |
191 | from the unevictable list. However, there may be situations where we decide, | |
90c9d13a | 192 | for the sake of expediency, to leave an unevictable folio on one of the regular |
c24b7201 | 193 | active/inactive LRU lists for vmscan to deal with. vmscan checks for such |
90c9d13a MWO |
194 | folios in all of the shrink_{active|inactive|page}_list() functions and will |
195 | "cull" such folios that it encounters: that is, it diverts those folios to the | |
577e9846 | 196 | unevictable list for the memory cgroup and node being scanned. |
c24b7201 | 197 | |
90c9d13a MWO |
198 | There may be situations where a folio is mapped into a VM_LOCKED VMA, |
199 | but the folio does not have the mlocked flag set. Such folios will make | |
200 | it all the way to shrink_active_list() or shrink_page_list() where they | |
201 | will be detected when vmscan walks the reverse map in folio_referenced() | |
202 | or try_to_unmap(). The folio is culled to the unevictable list when it | |
203 | is released by the shrinker. | |
fa07e787 | 204 | |
90c9d13a MWO |
205 | To "cull" an unevictable folio, vmscan simply puts the folio back on |
206 | the LRU list using folio_putback_lru() - the inverse operation to | |
207 | folio_isolate_lru() - after dropping the folio lock. Because the | |
208 | condition which makes the folio unevictable may change once the folio | |
209 | is unlocked, __pagevec_lru_add_fn() will recheck the unevictable state | |
210 | of a folio before placing it on the unevictable list. | |
fa07e787 LS |
211 | |
212 | ||
a5e4da91 | 213 | MLOCKED Pages |
c24b7201 | 214 | ============= |
fa07e787 | 215 | |
90c9d13a | 216 | The unevictable folio list is also useful for mlock(), in addition to ramfs and |
c24b7201 DH |
217 | SYSV SHM. Note that mlock() is only available in CONFIG_MMU=y situations; in |
218 | NOMMU situations, all mappings are effectively mlocked. | |
219 | ||
220 | ||
a5e4da91 | 221 | History |
c24b7201 DH |
222 | ------- |
223 | ||
224 | The "Unevictable mlocked Pages" infrastructure is based on work originally | |
fa07e787 | 225 | posted by Nick Piggin in an RFC patch entitled "mm: mlocked pages off LRU". |
c24b7201 DH |
226 | Nick posted his patch as an alternative to a patch posted by Christoph Lameter |
227 | to achieve the same objective: hiding mlocked pages from vmscan. | |
228 | ||
229 | In Nick's patch, he used one of the struct page LRU list link fields as a count | |
577e9846 HD |
230 | of VM_LOCKED VMAs that map the page (Rik van Riel had the same idea three years |
231 | earlier). But this use of the link field for a count prevented the management | |
232 | of the pages on an LRU list, and thus mlocked pages were not migratable as | |
233 | isolate_lru_page() could not detect them, and the LRU list link field was not | |
234 | available to the migration subsystem. | |
c24b7201 | 235 | |
577e9846 | 236 | Nick resolved this by putting mlocked pages back on the LRU list before |
c24b7201 DH |
237 | attempting to isolate them, thus abandoning the count of VM_LOCKED VMAs. When |
238 | Nick's patch was integrated with the Unevictable LRU work, the count was | |
577e9846 HD |
239 | replaced by walking the reverse map when munlocking, to determine whether any |
240 | other VM_LOCKED VMAs still mapped the page. | |
241 | ||
242 | However, walking the reverse map for each page when munlocking was ugly and | |
243 | inefficient, and could lead to catastrophic contention on a file's rmap lock, | |
244 | when many processes which had it mlocked were trying to exit. In 5.18, the | |
245 | idea of keeping mlock_count in Unevictable LRU list link field was revived and | |
246 | put to work, without preventing the migration of mlocked pages. This is why | |
247 | the "Unevictable LRU list" cannot be a linked list of pages now; but there was | |
248 | no use for that linked list anyway - though its size is maintained for meminfo. | |
c24b7201 DH |
249 | |
250 | ||
a5e4da91 | 251 | Basic Management |
c24b7201 DH |
252 | ---------------- |
253 | ||
254 | mlocked pages - pages mapped into a VM_LOCKED VMA - are a class of unevictable | |
255 | pages. When such a page has been "noticed" by the memory management subsystem, | |
256 | the page is marked with the PG_mlocked flag. This can be manipulated using the | |
257 | PageMlocked() functions. | |
258 | ||
259 | A PG_mlocked page will be placed on the unevictable list when it is added to | |
260 | the LRU. Such pages can be "noticed" by memory management in several places: | |
261 | ||
577e9846 | 262 | (1) in the mlock()/mlock2()/mlockall() system call handlers; |
c24b7201 DH |
263 | |
264 | (2) in the mmap() system call handler when mmapping a region with the | |
265 | MAP_LOCKED flag; | |
266 | ||
267 | (3) mmapping a region in a task that has called mlockall() with the MCL_FUTURE | |
577e9846 | 268 | flag; |
c24b7201 | 269 | |
577e9846 | 270 | (4) in the fault path and when a VM_LOCKED stack segment is expanded; or |
c24b7201 DH |
271 | |
272 | (5) as mentioned above, in vmscan:shrink_page_list() when attempting to | |
9a7d7a80 | 273 | reclaim a page in a VM_LOCKED VMA by folio_referenced() or try_to_unmap(). |
c24b7201 DH |
274 | |
275 | mlocked pages become unlocked and rescued from the unevictable list when: | |
276 | ||
277 | (1) mapped in a range unlocked via the munlock()/munlockall() system calls; | |
278 | ||
279 | (2) munmap()'d out of the last VM_LOCKED VMA that maps the page, including | |
280 | unmapping at task exit; | |
281 | ||
282 | (3) when the page is truncated from the last VM_LOCKED VMA of an mmapped file; | |
283 | or | |
284 | ||
285 | (4) before a page is COW'd in a VM_LOCKED VMA. | |
286 | ||
287 | ||
577e9846 HD |
288 | mlock()/mlock2()/mlockall() System Call Handling |
289 | ------------------------------------------------ | |
fa07e787 | 290 | |
577e9846 | 291 | mlock(), mlock2() and mlockall() system call handlers proceed to mlock_fixup() |
c24b7201 | 292 | for each VMA in the range specified by the call. In the case of mlockall(), |
fa07e787 | 293 | this is the entire active address space of the task. Note that mlock_fixup() |
c24b7201 | 294 | is used for both mlocking and munlocking a range of memory. A call to mlock() |
577e9846 HD |
295 | an already VM_LOCKED VMA, or to munlock() a VMA that is not VM_LOCKED, is |
296 | treated as a no-op and mlock_fixup() simply returns. | |
c24b7201 | 297 | |
577e9846 | 298 | If the VMA passes some filtering as described in "Filtering Special VMAs" |
c24b7201 | 299 | below, mlock_fixup() will attempt to merge the VMA with its neighbors or split |
577e9846 | 300 | off a subset of the VMA if the range does not cover the entire VMA. Any pages |
e0650a41 | 301 | already present in the VMA are then marked as mlocked by mlock_folio() via |
577e9846 HD |
302 | mlock_pte_range() via walk_page_range() via mlock_vma_pages_range(). |
303 | ||
304 | Before returning from the system call, do_mlock() or mlockall() will call | |
305 | __mm_populate() to fault in the remaining pages via get_user_pages() and to | |
306 | mark those pages as mlocked as they are faulted. | |
c24b7201 DH |
307 | |
308 | Note that the VMA being mlocked might be mapped with PROT_NONE. In this case, | |
309 | get_user_pages() will be unable to fault in the pages. That's okay. If pages | |
577e9846 HD |
310 | do end up getting faulted into this VM_LOCKED VMA, they will be handled in the |
311 | fault path - which is also how mlock2()'s MLOCK_ONFAULT areas are handled. | |
312 | ||
313 | For each PTE (or PMD) being faulted into a VMA, the page add rmap function | |
7efecffb | 314 | calls mlock_vma_folio(), which calls mlock_folio() when the VMA is VM_LOCKED |
577e9846 | 315 | (unless it is a PTE mapping of a part of a transparent huge page). Or when |
a8265cd9 LS |
316 | it is a newly allocated anonymous page, folio_add_lru_vma() calls |
317 | mlock_new_folio() instead: similar to mlock_folio(), but can make better | |
577e9846 HD |
318 | judgments, since this page is held exclusively and known not to be on LRU yet. |
319 | ||
a8265cd9 LS |
320 | mlock_folio() sets PG_mlocked immediately, then places the page on the CPU's |
321 | mlock folio batch, to batch up the rest of the work to be done under lru_lock by | |
322 | __mlock_folio(). __mlock_folio() sets PG_unevictable, initializes mlock_count | |
577e9846 | 323 | and moves the page to unevictable state ("the unevictable LRU", but with |
a8265cd9 LS |
324 | mlock_count in place of LRU threading). Or if the page was already PG_lru |
325 | and PG_unevictable and PG_mlocked, it simply increments the mlock_count. | |
577e9846 HD |
326 | |
327 | But in practice that may not work ideally: the page may not yet be on an LRU, or | |
328 | it may have been temporarily isolated from LRU. In such cases the mlock_count | |
a8265cd9 | 329 | field cannot be touched, but will be set to 0 later when __munlock_folio() |
577e9846 HD |
330 | returns the page to "LRU". Races prohibit mlock_count from being set to 1 then: |
331 | rather than risk stranding a page indefinitely as unevictable, always err with | |
332 | mlock_count on the low side, so that when munlocked the page will be rescued to | |
333 | an evictable LRU, then perhaps be mlocked again later if vmscan finds it in a | |
334 | VM_LOCKED VMA. | |
fa07e787 LS |
335 | |
336 | ||
a5e4da91 | 337 | Filtering Special VMAs |
c24b7201 | 338 | ---------------------- |
fa07e787 | 339 | |
c24b7201 | 340 | mlock_fixup() filters several classes of "special" VMAs: |
fa07e787 | 341 | |
c24b7201 | 342 | 1) VMAs with VM_IO or VM_PFNMAP set are skipped entirely. The pages behind |
fa07e787 | 343 | these mappings are inherently pinned, so we don't need to mark them as |
c24b7201 DH |
344 | mlocked. In any case, most of the pages have no struct page in which to so |
345 | mark the page. Because of this, get_user_pages() will fail for these VMAs, | |
346 | so there is no sense in attempting to visit them. | |
347 | ||
348 | 2) VMAs mapping hugetlbfs page are already effectively pinned into memory. We | |
577e9846 HD |
349 | neither need nor want to mlock() these pages. But __mm_populate() includes |
350 | hugetlbfs ranges, allocating the huge pages and populating the PTEs. | |
c24b7201 | 351 | |
314e51b9 | 352 | 3) VMAs with VM_DONTEXPAND are generally userspace mappings of kernel pages, |
577e9846 HD |
353 | such as the VDSO page, relay channel pages, etc. These pages are inherently |
354 | unevictable and are not managed on the LRU lists. __mm_populate() includes | |
355 | these ranges, populating the PTEs if not already populated. | |
356 | ||
357 | 4) VMAs with VM_MIXEDMAP set are not marked VM_LOCKED, but __mm_populate() | |
358 | includes these ranges, populating the PTEs if not already populated. | |
fa07e787 | 359 | |
c24b7201 | 360 | Note that for all of these special VMAs, mlock_fixup() does not set the |
fa07e787 | 361 | VM_LOCKED flag. Therefore, we won't have to deal with them later during |
c24b7201 DH |
362 | munlock(), munmap() or task exit. Neither does mlock_fixup() account these |
363 | VMAs against the task's "locked_vm". | |
364 | ||
365 | ||
a5e4da91 | 366 | munlock()/munlockall() System Call Handling |
c24b7201 DH |
367 | ------------------------------------------- |
368 | ||
577e9846 HD |
369 | The munlock() and munlockall() system calls are handled by the same |
370 | mlock_fixup() function as mlock(), mlock2() and mlockall() system calls are. | |
371 | If called to munlock an already munlocked VMA, mlock_fixup() simply returns. | |
372 | Because of the VMA filtering discussed above, VM_LOCKED will not be set in | |
373 | any "special" VMAs. So, those VMAs will be ignored for munlock. | |
fa07e787 | 374 | |
c24b7201 | 375 | If the VMA is VM_LOCKED, mlock_fixup() again attempts to merge or split off the |
e0650a41 | 376 | specified range. All pages in the VMA are then munlocked by munlock_folio() via |
577e9846 HD |
377 | mlock_pte_range() via walk_page_range() via mlock_vma_pages_range() - the same |
378 | function used when mlocking a VMA range, with new flags for the VMA indicating | |
379 | that it is munlock() being performed. | |
380 | ||
e0650a41 MWO |
381 | munlock_folio() uses the mlock pagevec to batch up work to be done |
382 | under lru_lock by __munlock_folio(). __munlock_folio() decrements the | |
383 | folio's mlock_count, and when that reaches 0 it clears the mlocked flag | |
384 | and clears the unevictable flag, moving the folio from unevictable state | |
385 | to the inactive LRU. | |
577e9846 | 386 | |
e0650a41 | 387 | But in practice that may not work ideally: the folio may not yet have reached |
577e9846 HD |
388 | "the unevictable LRU", or it may have been temporarily isolated from it. In |
389 | those cases its mlock_count field is unusable and must be assumed to be 0: so | |
e0650a41 | 390 | that the folio will be rescued to an evictable LRU, then perhaps be mlocked |
577e9846 | 391 | again later if vmscan finds it in a VM_LOCKED VMA. |
c24b7201 DH |
392 | |
393 | ||
a5e4da91 | 394 | Migrating MLOCKED Pages |
c24b7201 DH |
395 | ----------------------- |
396 | ||
397 | A page that is being migrated has been isolated from the LRU lists and is held | |
398 | locked across unmapping of the page, updating the page's address space entry | |
399 | and copying the contents and state, until the page table entry has been | |
400 | replaced with an entry that refers to the new page. Linux supports migration | |
577e9846 HD |
401 | of mlocked pages and other unevictable pages. PG_mlocked is cleared from the |
402 | the old page when it is unmapped from the last VM_LOCKED VMA, and set when the | |
403 | new page is mapped in place of migration entry in a VM_LOCKED VMA. If the page | |
404 | was unevictable because mlocked, PG_unevictable follows PG_mlocked; but if the | |
405 | page was unevictable for other reasons, PG_unevictable is copied explicitly. | |
c24b7201 DH |
406 | |
407 | Note that page migration can race with mlocking or munlocking of the same page. | |
577e9846 HD |
408 | There is mostly no problem since page migration requires unmapping all PTEs of |
409 | the old page (including munlock where VM_LOCKED), then mapping in the new page | |
410 | (including mlock where VM_LOCKED). The page table locks provide sufficient | |
411 | synchronization. | |
c24b7201 | 412 | |
577e9846 HD |
413 | However, since mlock_vma_pages_range() starts by setting VM_LOCKED on a VMA, |
414 | before mlocking any pages already present, if one of those pages were migrated | |
415 | before mlock_pte_range() reached it, it would get counted twice in mlock_count. | |
416 | To prevent that, mlock_vma_pages_range() temporarily marks the VMA as VM_IO, | |
7efecffb | 417 | so that mlock_vma_folio() will skip it. |
577e9846 HD |
418 | |
419 | To complete page migration, we place the old and new pages back onto the LRU | |
420 | afterwards. The "unneeded" page - old page on success, new page on failure - | |
421 | is freed when the reference count held by the migration process is released. | |
c24b7201 DH |
422 | |
423 | ||
a5e4da91 | 424 | Compacting MLOCKED Pages |
922c0551 EM |
425 | ------------------------ |
426 | ||
577e9846 HD |
427 | The memory map can be scanned for compactable regions and the default behavior |
428 | is to let unevictable pages be moved. /proc/sys/vm/compact_unevictable_allowed | |
429 | controls this behavior (see Documentation/admin-guide/sysctl/vm.rst). The work | |
430 | of compaction is mostly handled by the page migration code and the same work | |
431 | flow as described in Migrating MLOCKED Pages will apply. | |
432 | ||
922c0551 | 433 | |
a5e4da91 | 434 | MLOCKING Transparent Huge Pages |
6fb8ddfc KS |
435 | ------------------------------- |
436 | ||
437 | A transparent huge page is represented by a single entry on an LRU list. | |
438 | Therefore, we can only make unevictable an entire compound page, not | |
439 | individual subpages. | |
440 | ||
577e9846 HD |
441 | If a user tries to mlock() part of a huge page, and no user mlock()s the |
442 | whole of the huge page, we want the rest of the page to be reclaimable. | |
6fb8ddfc KS |
443 | |
444 | We cannot just split the page on partial mlock() as split_huge_page() can | |
577e9846 | 445 | fail and a new intermittent failure mode for the syscall is undesirable. |
6fb8ddfc | 446 | |
577e9846 HD |
447 | We handle this by keeping PTE-mlocked huge pages on evictable LRU lists: |
448 | the PMD on the border of a VM_LOCKED VMA will be split into a PTE table. | |
6fb8ddfc | 449 | |
577e9846 | 450 | This way the huge page is accessible for vmscan. Under memory pressure the |
6fb8ddfc | 451 | page will be split, subpages which belong to VM_LOCKED VMAs will be moved |
577e9846 HD |
452 | to the unevictable LRU and the rest can be reclaimed. |
453 | ||
454 | /proc/meminfo's Unevictable and Mlocked amounts do not include those parts | |
455 | of a transparent huge page which are mapped only by PTEs in VM_LOCKED VMAs. | |
6fb8ddfc | 456 | |
922c0551 | 457 | |
a5e4da91 | 458 | mmap(MAP_LOCKED) System Call Handling |
c24b7201 | 459 | ------------------------------------- |
fa07e787 | 460 | |
577e9846 HD |
461 | In addition to the mlock(), mlock2() and mlockall() system calls, an application |
462 | can request that a region of memory be mlocked by supplying the MAP_LOCKED flag | |
463 | to the mmap() call. There is one important and subtle difference here, though. | |
464 | mmap() + mlock() will fail if the range cannot be faulted in (e.g. because | |
465 | mm_populate fails) and returns with ENOMEM while mmap(MAP_LOCKED) will not fail. | |
d56b699d | 466 | The mmapped area will still have properties of the locked area - pages will not |
577e9846 | 467 | get swapped out - but major page faults to fault memory in might still happen. |
9b012a29 | 468 | |
577e9846 HD |
469 | Furthermore, any mmap() call or brk() call that expands the heap by a task |
470 | that has previously called mlockall() with the MCL_FUTURE flag will result | |
c24b7201 | 471 | in the newly mapped memory being mlocked. Before the unevictable/mlock |
577e9846 HD |
472 | changes, the kernel simply called make_pages_present() to allocate pages |
473 | and populate the page table. | |
fa07e787 | 474 | |
577e9846 HD |
475 | To mlock a range of memory under the unevictable/mlock infrastructure, |
476 | the mmap() handler and task address space expansion functions call | |
fc05f566 KS |
477 | populate_vma_page_range() specifying the vma and the address range to mlock. |
478 | ||
fa07e787 | 479 | |
a5e4da91 | 480 | munmap()/exit()/exec() System Call Handling |
c24b7201 | 481 | ------------------------------------------- |
fa07e787 LS |
482 | |
483 | When unmapping an mlocked region of memory, whether by an explicit call to | |
484 | munmap() or via an internal unmap from exit() or exec() processing, we must | |
c24b7201 | 485 | munlock the pages if we're removing the last VM_LOCKED VMA that maps the pages. |
63d6c5ad HD |
486 | Before the unevictable/mlock changes, mlocking did not mark the pages in any |
487 | way, so unmapping them required no processing. | |
fa07e787 | 488 | |
5a0033f0 | 489 | For each PTE (or PMD) being unmapped from a VMA, folio_remove_rmap_*() calls |
672aa27d | 490 | munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED |
577e9846 HD |
491 | (unless it was a PTE mapping of a part of a transparent huge page). |
492 | ||
e0650a41 MWO |
493 | munlock_folio() uses the mlock pagevec to batch up work to be done |
494 | under lru_lock by __munlock_folio(). __munlock_folio() decrements the | |
495 | folio's mlock_count, and when that reaches 0 it clears the mlocked flag | |
496 | and clears the unevictable flag, moving the folio from unevictable state | |
497 | to the inactive LRU. | |
577e9846 | 498 | |
e0650a41 | 499 | But in practice that may not work ideally: the folio may not yet have reached |
577e9846 HD |
500 | "the unevictable LRU", or it may have been temporarily isolated from it. In |
501 | those cases its mlock_count field is unusable and must be assumed to be 0: so | |
e0650a41 | 502 | that the folio will be rescued to an evictable LRU, then perhaps be mlocked |
577e9846 HD |
503 | again later if vmscan finds it in a VM_LOCKED VMA. |
504 | ||
505 | ||
506 | Truncating MLOCKED Pages | |
507 | ------------------------ | |
508 | ||
509 | File truncation or hole punching forcibly unmaps the deleted pages from | |
510 | userspace; truncation even unmaps and deletes any private anonymous pages | |
511 | which had been Copied-On-Write from the file pages now being truncated. | |
512 | ||
513 | Mlocked pages can be munlocked and deleted in this way: like with munmap(), | |
5a0033f0 | 514 | for each PTE (or PMD) being unmapped from a VMA, folio_remove_rmap_*() calls |
672aa27d | 515 | munlock_vma_folio(), which calls munlock_folio() when the VMA is VM_LOCKED |
577e9846 HD |
516 | (unless it was a PTE mapping of a part of a transparent huge page). |
517 | ||
518 | However, if there is a racing munlock(), since mlock_vma_pages_range() starts | |
519 | munlocking by clearing VM_LOCKED from a VMA, before munlocking all the pages | |
520 | present, if one of those pages were unmapped by truncation or hole punch before | |
521 | mlock_pte_range() reached it, it would not be recognized as mlocked by this VMA, | |
522 | and would not be counted out of mlock_count. In this rare case, a page may | |
a8265cd9 | 523 | still appear as PG_mlocked after it has been fully unmapped: and it is left to |
577e9846 HD |
524 | release_pages() (or __page_cache_release()) to clear it and update statistics |
525 | before freeing (this event is counted in /proc/vmstat unevictable_pgs_cleared, | |
526 | which is usually 0). | |
c24b7201 DH |
527 | |
528 | ||
a5e4da91 | 529 | Page Reclaim in shrink_*_list() |
c24b7201 DH |
530 | ------------------------------- |
531 | ||
577e9846 HD |
532 | vmscan's shrink_active_list() culls any obviously unevictable pages - |
533 | i.e. !page_evictable(page) pages - diverting those to the unevictable list. | |
c24b7201 | 534 | However, shrink_active_list() only sees unevictable pages that made it onto the |
a8265cd9 | 535 | active/inactive LRU lists. Note that these pages do not have PG_unevictable |
577e9846 | 536 | set - otherwise they would be on the unevictable list and shrink_active_list() |
c24b7201 | 537 | would never see them. |
fa07e787 LS |
538 | |
539 | Some examples of these unevictable pages on the LRU lists are: | |
540 | ||
c24b7201 DH |
541 | (1) ramfs pages that have been placed on the LRU lists when first allocated. |
542 | ||
543 | (2) SHM_LOCK'd shared memory pages. shmctl(SHM_LOCK) does not attempt to | |
544 | allocate or fault in the pages in the shared memory region. This happens | |
545 | when an application accesses the page the first time after SHM_LOCK'ing | |
546 | the segment. | |
fa07e787 | 547 | |
577e9846 HD |
548 | (3) pages still mapped into VM_LOCKED VMAs, which should be marked mlocked, |
549 | but events left mlock_count too low, so they were munlocked too early. | |
fa07e787 | 550 | |
577e9846 HD |
551 | vmscan's shrink_inactive_list() and shrink_page_list() also divert obviously |
552 | unevictable pages found on the inactive lists to the appropriate memory cgroup | |
553 | and node unevictable list. | |
fa07e787 | 554 | |
9a7d7a80 | 555 | rmap's folio_referenced_one(), called via vmscan's shrink_active_list() or |
577e9846 | 556 | shrink_page_list(), and rmap's try_to_unmap_one() called via shrink_page_list(), |
7efecffb | 557 | check for (3) pages still mapped into VM_LOCKED VMAs, and call mlock_vma_folio() |
577e9846 HD |
558 | to correct them. Such pages are culled to the unevictable list when released |
559 | by the shrinker. |