Commit | Line | Data |
---|---|---|
07017acb YZ |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | ============= | |
4 | Multi-Gen LRU | |
5 | ============= | |
6 | The multi-gen LRU is an alternative LRU implementation that optimizes | |
7 | page reclaim and improves performance under memory pressure. Page | |
8 | reclaim decides the kernel's caching policy and ability to overcommit | |
9 | memory. It directly impacts the kswapd CPU usage and RAM efficiency. | |
10 | ||
11 | Quick start | |
12 | =========== | |
13 | Build the kernel with the following configurations. | |
14 | ||
15 | * ``CONFIG_LRU_GEN=y`` | |
16 | * ``CONFIG_LRU_GEN_ENABLED=y`` | |
17 | ||
18 | All set! | |
19 | ||
20 | Runtime options | |
21 | =============== | |
22 | ``/sys/kernel/mm/lru_gen/`` contains stable ABIs described in the | |
23 | following subsections. | |
24 | ||
25 | Kill switch | |
26 | ----------- | |
27 | ``enabled`` accepts different values to enable or disable the | |
28 | following components. Its default value depends on | |
29 | ``CONFIG_LRU_GEN_ENABLED``. All the components should be enabled | |
30 | unless some of them have unforeseen side effects. Writing to | |
31 | ``enabled`` has no effect when a component is not supported by the | |
32 | hardware, and valid values will be accepted even when the main switch | |
33 | is off. | |
34 | ||
35 | ====== =============================================================== | |
36 | Values Components | |
37 | ====== =============================================================== | |
38 | 0x0001 The main switch for the multi-gen LRU. | |
39 | 0x0002 Clearing the accessed bit in leaf page table entries in large | |
40 | batches, when MMU sets it (e.g., on x86). This behavior can | |
41 | theoretically worsen lock contention (mmap_lock). If it is | |
42 | disabled, the multi-gen LRU will suffer a minor performance | |
43 | degradation for workloads that contiguously map hot pages, | |
44 | whose accessed bits can be otherwise cleared by fewer larger | |
45 | batches. | |
46 | 0x0004 Clearing the accessed bit in non-leaf page table entries as | |
47 | well, when MMU sets it (e.g., on x86). This behavior was not | |
48 | verified on x86 varieties other than Intel and AMD. If it is | |
49 | disabled, the multi-gen LRU will suffer a negligible | |
50 | performance degradation. | |
51 | [yYnN] Apply to all the components above. | |
52 | ====== =============================================================== | |
53 | ||
54 | E.g., | |
55 | :: | |
56 | ||
57 | echo y >/sys/kernel/mm/lru_gen/enabled | |
58 | cat /sys/kernel/mm/lru_gen/enabled | |
59 | 0x0007 | |
60 | echo 5 >/sys/kernel/mm/lru_gen/enabled | |
61 | cat /sys/kernel/mm/lru_gen/enabled | |
62 | 0x0005 | |
63 | ||
64 | Thrashing prevention | |
65 | -------------------- | |
66 | Personal computers are more sensitive to thrashing because it can | |
67 | cause janks (lags when rendering UI) and negatively impact user | |
68 | experience. The multi-gen LRU offers thrashing prevention to the | |
69 | majority of laptop and desktop users who do not have ``oomd``. | |
70 | ||
71 | Users can write ``N`` to ``min_ttl_ms`` to prevent the working set of | |
72 | ``N`` milliseconds from getting evicted. The OOM killer is triggered | |
73 | if this working set cannot be kept in memory. In other words, this | |
74 | option works as an adjustable pressure relief valve, and when open, it | |
75 | terminates applications that are hopefully not being used. | |
76 | ||
77 | Based on the average human detectable lag (~100ms), ``N=1000`` usually | |
78 | eliminates intolerable janks due to thrashing. Larger values like | |
79 | ``N=3000`` make janks less noticeable at the risk of premature OOM | |
80 | kills. | |
81 | ||
82 | The default value ``0`` means disabled. | |
83 | ||
84 | Experimental features | |
85 | ===================== | |
86 | ``/sys/kernel/debug/lru_gen`` accepts commands described in the | |
87 | following subsections. Multiple command lines are supported, so does | |
88 | concatenation with delimiters ``,`` and ``;``. | |
89 | ||
90 | ``/sys/kernel/debug/lru_gen_full`` provides additional stats for | |
91 | debugging. ``CONFIG_LRU_GEN_STATS=y`` keeps historical stats from | |
92 | evicted generations in this file. | |
93 | ||
94 | Working set estimation | |
95 | ---------------------- | |
96 | Working set estimation measures how much memory an application needs | |
97 | in a given time interval, and it is usually done with little impact on | |
98 | the performance of the application. E.g., data centers want to | |
99 | optimize job scheduling (bin packing) to improve memory utilizations. | |
100 | When a new job comes in, the job scheduler needs to find out whether | |
101 | each server it manages can allocate a certain amount of memory for | |
102 | this new job before it can pick a candidate. To do so, the job | |
103 | scheduler needs to estimate the working sets of the existing jobs. | |
104 | ||
105 | When it is read, ``lru_gen`` returns a histogram of numbers of pages | |
106 | accessed over different time intervals for each memcg and node. | |
107 | ``MAX_NR_GENS`` decides the number of bins for each histogram. The | |
108 | histograms are noncumulative. | |
109 | :: | |
110 | ||
111 | memcg memcg_id memcg_path | |
112 | node node_id | |
113 | min_gen_nr age_in_ms nr_anon_pages nr_file_pages | |
114 | ... | |
115 | max_gen_nr age_in_ms nr_anon_pages nr_file_pages | |
116 | ||
117 | Each bin contains an estimated number of pages that have been accessed | |
118 | within ``age_in_ms``. E.g., ``min_gen_nr`` contains the coldest pages | |
119 | and ``max_gen_nr`` contains the hottest pages, since ``age_in_ms`` of | |
120 | the former is the largest and that of the latter is the smallest. | |
121 | ||
122 | Users can write the following command to ``lru_gen`` to create a new | |
123 | generation ``max_gen_nr+1``: | |
124 | ||
125 | ``+ memcg_id node_id max_gen_nr [can_swap [force_scan]]`` | |
126 | ||
127 | ``can_swap`` defaults to the swap setting and, if it is set to ``1``, | |
128 | it forces the scan of anon pages when swap is off, and vice versa. | |
129 | ``force_scan`` defaults to ``1`` and, if it is set to ``0``, it | |
130 | employs heuristics to reduce the overhead, which is likely to reduce | |
131 | the coverage as well. | |
132 | ||
133 | A typical use case is that a job scheduler runs this command at a | |
134 | certain time interval to create new generations, and it ranks the | |
135 | servers it manages based on the sizes of their cold pages defined by | |
136 | this time interval. | |
137 | ||
138 | Proactive reclaim | |
139 | ----------------- | |
140 | Proactive reclaim induces page reclaim when there is no memory | |
141 | pressure. It usually targets cold pages only. E.g., when a new job | |
142 | comes in, the job scheduler wants to proactively reclaim cold pages on | |
143 | the server it selected, to improve the chance of successfully landing | |
144 | this new job. | |
145 | ||
146 | Users can write the following command to ``lru_gen`` to evict | |
147 | generations less than or equal to ``min_gen_nr``. | |
148 | ||
149 | ``- memcg_id node_id min_gen_nr [swappiness [nr_to_reclaim]]`` | |
150 | ||
151 | ``min_gen_nr`` should be less than ``max_gen_nr-1``, since | |
152 | ``max_gen_nr`` and ``max_gen_nr-1`` are not fully aged (equivalent to | |
153 | the active list) and therefore cannot be evicted. ``swappiness`` | |
154 | overrides the default value in ``/proc/sys/vm/swappiness``. | |
155 | ``nr_to_reclaim`` limits the number of pages to evict. | |
156 | ||
157 | A typical use case is that a job scheduler runs this command before it | |
158 | tries to land a new job on a server. If it fails to materialize enough | |
159 | cold pages because of the overestimation, it retries on the next | |
160 | server according to the ranking result obtained from the working set | |
161 | estimation step. This less forceful approach limits the impacts on the | |
162 | existing jobs. |