Documentation/mm/multigen_lru.rst

   1 .. SPDX-License-Identifier: GPL-2.0
   2
   3 =============
   4 Multi-Gen LRU
   5 =============
   6 The multi-gen LRU is an alternative LRU implementation that optimizes
   7 page reclaim and improves performance under memory pressure. Page
   8 reclaim decides the kernel's caching policy and ability to overcommit
   9 memory. It directly impacts the kswapd CPU usage and RAM efficiency.
  10
  11 Design overview
  12 ===============
  13 Objectives
  14 ----------
  15 The design objectives are:
  16
  17 * Good representation of access recency
  18 * Try to profit from spatial locality
  19 * Fast paths to make obvious choices
  20 * Simple self-correcting heuristics
  21
  22 The representation of access recency is at the core of all LRU
  23 implementations. In the multi-gen LRU, each generation represents a
  24 group of pages with similar access recency. Generations establish a
  25 (time-based) common frame of reference and therefore help make better
  26 choices, e.g., between different memcgs on a computer or different
  27 computers in a data center (for job scheduling).
  28
  29 Exploiting spatial locality improves efficiency when gathering the
  30 accessed bit. A rmap walk targets a single page and does not try to
  31 profit from discovering a young PTE. A page table walk can sweep all
  32 the young PTEs in an address space, but the address space can be too
  33 sparse to make a profit. The key is to optimize both methods and use
  34 them in combination.
  35
  36 Fast paths reduce code complexity and runtime overhead. Unmapped pages
  37 do not require TLB flushes; clean pages do not require writeback.
  38 These facts are only helpful when other conditions, e.g., access
  39 recency, are similar. With generations as a common frame of reference,
  40 additional factors stand out. But obvious choices might not be good
  41 choices; thus self-correction is necessary.
  42
  43 The benefits of simple self-correcting heuristics are self-evident.
  44 Again, with generations as a common frame of reference, this becomes
  45 attainable. Specifically, pages in the same generation can be
  46 categorized based on additional factors, and a feedback loop can
  47 statistically compare the refault percentages across those categories
  48 and infer which of them are better choices.
  49
  50 Assumptions
  51 -----------
  52 The protection of hot pages and the selection of cold pages are based
  53 on page access channels and patterns. There are two access channels:
  54
  55 * Accesses through page tables
  56 * Accesses through file descriptors
  57
  58 The protection of the former channel is by design stronger because:
  59
  60 1. The uncertainty in determining the access patterns of the former
  61    channel is higher due to the approximation of the accessed bit.
  62 2. The cost of evicting the former channel is higher due to the TLB
  63    flushes required and the likelihood of encountering the dirty bit.
  64 3. The penalty of underprotecting the former channel is higher because
  65    applications usually do not prepare themselves for major page
  66    faults like they do for blocked I/O. E.g., GUI applications
  67    commonly use dedicated I/O threads to avoid blocking rendering
  68    threads.
  69
  70 There are also two access patterns:
  71
  72 * Accesses exhibiting temporal locality
  73 * Accesses not exhibiting temporal locality
  74
  75 For the reasons listed above, the former channel is assumed to follow
  76 the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is
  77 present, and the latter channel is assumed to follow the latter
  78 pattern unless outlying refaults have been observed.
  79
  80 Workflow overview
  81 =================
  82 Evictable pages are divided into multiple generations for each
  83 ``lruvec``. The youngest generation number is stored in
  84 ``lrugen->max_seq`` for both anon and file types as they are aged on
  85 an equal footing. The oldest generation numbers are stored in
  86 ``lrugen->min_seq[]`` separately for anon and file types as clean file
  87 pages can be evicted regardless of swap constraints. These three
  88 variables are monotonically increasing.
  89
  90 Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)``
  91 bits in order to fit into the gen counter in ``folio->flags``. Each
  92 truncated generation number is an index to ``lrugen->folios[]``. The
  93 sliding window technique is used to track at least ``MIN_NR_GENS`` and
  94 at most ``MAX_NR_GENS`` generations. The gen counter stores a value
  95 within ``[1, MAX_NR_GENS]`` while a page is on one of
  96 ``lrugen->folios[]``; otherwise it stores zero.
  97
  98 Each generation is divided into multiple tiers. A page accessed ``N``
  99 times through file descriptors is in tier ``order_base_2(N)``. Unlike
 100 generations, tiers do not have dedicated ``lrugen->folios[]``. In
 101 contrast to moving across generations, which requires the LRU lock,
 102 moving across tiers only involves atomic operations on
 103 ``folio->flags`` and therefore has a negligible cost. A feedback loop
 104 modeled after the PID controller monitors refaults over all the tiers
 105 from anon and file types and decides which tiers from which types to
 106 evict or protect.
 107
 108 There are two conceptually independent procedures: the aging and the
 109 eviction. They form a closed-loop system, i.e., the page reclaim.
 110
 111 Aging
 112 -----
 113 The aging produces young generations. Given an ``lruvec``, it
 114 increments ``max_seq`` when ``max_seq-min_seq+1`` approaches
 115 ``MIN_NR_GENS``. The aging promotes hot pages to the youngest
 116 generation when it finds them accessed through page tables; the
 117 demotion of cold pages happens consequently when it increments
 118 ``max_seq``. The aging uses page table walks and rmap walks to find
 119 young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list``
 120 and calls ``walk_page_range()`` with each ``mm_struct`` on this list
 121 to scan PTEs, and after each iteration, it increments ``max_seq``. For
 122 the latter, when the eviction walks the rmap and finds a young PTE,
 123 the aging scans the adjacent PTEs. For both, on finding a young PTE,
 124 the aging clears the accessed bit and updates the gen counter of the
 125 page mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``.
 126
 127 Eviction
 128 --------
 129 The eviction consumes old generations. Given an ``lruvec``, it
 130 increments ``min_seq`` when ``lrugen->folios[]`` indexed by
 131 ``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to
 132 evict from, it first compares ``min_seq[]`` to select the older type.
 133 If both types are equally old, it selects the one whose first tier has
 134 a lower refault percentage. The first tier contains single-use
 135 unmapped clean pages, which are the best bet. The eviction sorts a
 136 page according to its gen counter if the aging has found this page
 137 accessed through page tables and updated its gen counter. It also
 138 moves a page to the next generation, i.e., ``min_seq+1``, if this page
 139 was accessed multiple times through file descriptors and the feedback
 140 loop has detected outlying refaults from the tier this page is in. To
 141 this end, the feedback loop uses the first tier as the baseline, for
 142 the reason stated earlier.
 143
 144 Working set protection
 145 ----------------------
 146 Each generation is timestamped at birth. If ``lru_gen_min_ttl`` is
 147 set, an ``lruvec`` is protected from the eviction when its oldest
 148 generation was born within ``lru_gen_min_ttl`` milliseconds. In other
 149 words, it prevents the working set of ``lru_gen_min_ttl`` milliseconds
 150 from getting evicted. The OOM killer is triggered if this working set
 151 cannot be kept in memory.
 152
 153 This time-based approach has the following advantages:
 154
 155 1. It is easier to configure because it is agnostic to applications
 156    and memory sizes.
 157 2. It is more reliable because it is directly wired to the OOM killer.
 158
 159 Rmap/PT walk feedback
 160 ---------------------
 161 Searching the rmap for PTEs mapping each page on an LRU list (to test
 162 and clear the accessed bit) can be expensive because pages from
 163 different VMAs (PA space) are not cache friendly to the rmap (VA
 164 space). For workloads mostly using mapped pages, searching the rmap
 165 can incur the highest CPU cost in the reclaim path.
 166
 167 ``lru_gen_look_around()`` exploits spatial locality to reduce the
 168 trips into the rmap. It scans the adjacent PTEs of a young PTE and
 169 promotes hot pages. If the scan was done cacheline efficiently, it
 170 adds the PMD entry pointing to the PTE table to the Bloom filter. This
 171 forms a feedback loop between the eviction and the aging.
 172
 173 Bloom Filters
 174 -------------
 175 Bloom filters are a space and memory efficient data structure for set
 176 membership test, i.e., test if an element is not in the set or may be
 177 in the set.
 178
 179 In the eviction path, specifically, in ``lru_gen_look_around()``, if a
 180 PMD has a sufficient number of hot pages, its address is placed in the
 181 filter. In the aging path, set membership means that the PTE range
 182 will be scanned for young pages.
 183
 184 Note that Bloom filters are probabilistic on set membership. If a test
 185 is false positive, the cost is an additional scan of a range of PTEs,
 186 which may yield hot pages anyway. Parameters of the filter itself can
 187 control the false positive rate in the limit.
 188
 189 Memcg LRU
 190 ---------
 191 An memcg LRU is a per-node LRU of memcgs. It is also an LRU of LRUs,
 192 since each node and memcg combination has an LRU of folios (see
 193 ``mem_cgroup_lruvec()``). Its goal is to improve the scalability of
 194 global reclaim, which is critical to system-wide memory overcommit in
 195 data centers. Note that memcg LRU only applies to global reclaim.
 196
 197 The basic structure of an memcg LRU can be understood by an analogy to
 198 the active/inactive LRU (of folios):
 199
 200 1. It has the young and the old (generations), i.e., the counterparts
 201    to the active and the inactive;
 202 2. The increment of ``max_seq`` triggers promotion, i.e., the
 203    counterpart to activation;
 204 3. Other events trigger similar operations, e.g., offlining an memcg
 205    triggers demotion, i.e., the counterpart to deactivation.
 206
 207 In terms of global reclaim, it has two distinct features:
 208
 209 1. Sharding, which allows each thread to start at a random memcg (in
 210    the old generation) and improves parallelism;
 211 2. Eventual fairness, which allows direct reclaim to bail out at will
 212    and reduces latency without affecting fairness over some time.
 213
 214 In terms of traversing memcgs during global reclaim, it improves the
 215 best-case complexity from O(n) to O(1) and does not affect the
 216 worst-case complexity O(n). Therefore, on average, it has a sublinear
 217 complexity.
 218
 219 Summary
 220 -------
 221 The multi-gen LRU (of folios) can be disassembled into the following
 222 parts:
 223
 224 * Generations
 225 * Rmap walks
 226 * Page table walks
 227 * Bloom filters
 228 * PID controller
 229
 230 The aging and the eviction form a producer-consumer model;
 231 specifically, the latter drives the former by the sliding window over
 232 generations. Within the aging, rmap walks drive page table walks by
 233 inserting hot densely populated page tables to the Bloom filters.
 234 Within the eviction, the PID controller uses refaults as the feedback
 235 to select types to evict and tiers to protect.