Commit | Line | Data |
---|---|---|
6bf53999 MR |
1 | .. _admin_guide_memory_hotplug: |
2 | ||
ac3332c4 DH |
3 | ================== |
4 | Memory Hot(Un)Plug | |
5 | ================== | |
6867c931 | 6 | |
ac3332c4 DH |
7 | This document describes generic Linux support for memory hot(un)plug with |
8 | a focus on System RAM, including ZONE_MOVABLE support. | |
6867c931 | 9 | |
6bf53999 MR |
10 | .. contents:: :local: |
11 | ||
ac3332c4 DH |
12 | Introduction |
13 | ============ | |
c18c1cce | 14 | |
ac3332c4 DH |
15 | Memory hot(un)plug allows for increasing and decreasing the size of physical |
16 | memory available to a machine at runtime. In the simplest case, it consists of | |
17 | physically plugging or unplugging a DIMM at runtime, coordinated with the | |
18 | operating system. | |
c18c1cce | 19 | |
ac3332c4 | 20 | Memory hot(un)plug is used for various purposes: |
c18c1cce | 21 | |
ac3332c4 DH |
22 | - The physical memory available to a machine can be adjusted at runtime, up- or |
23 | downgrading the memory capacity. This dynamic memory resizing, sometimes | |
24 | referred to as "capacity on demand", is frequently used with virtual machines | |
25 | and logical partitions. | |
26 | ||
27 | - Replacing hardware, such as DIMMs or whole NUMA nodes, without downtime. One | |
28 | example is replacing failing memory modules. | |
c18c1cce | 29 | |
ac3332c4 DH |
30 | - Reducing energy consumption either by physically unplugging memory modules or |
31 | by logically unplugging (parts of) memory modules from Linux. | |
c18c1cce | 32 | |
ac3332c4 DH |
33 | Further, the basic memory hot(un)plug infrastructure in Linux is nowadays also |
34 | used to expose persistent memory, other performance-differentiated memory and | |
35 | reserved memory regions as ordinary system RAM to Linux. | |
6867c931 | 36 | |
ac3332c4 DH |
37 | Linux only supports memory hot(un)plug on selected 64 bit architectures, such as |
38 | x86_64, arm64, ppc64, s390x and ia64. | |
6867c931 | 39 | |
ac3332c4 DH |
40 | Memory Hot(Un)Plug Granularity |
41 | ------------------------------ | |
6867c931 | 42 | |
ac3332c4 DH |
43 | Memory hot(un)plug in Linux uses the SPARSEMEM memory model, which divides the |
44 | physical memory address space into chunks of the same size: memory sections. The | |
45 | size of a memory section is architecture dependent. For example, x86_64 uses | |
46 | 128 MiB and ppc64 uses 16 MiB. | |
6867c931 | 47 | |
ac3332c4 DH |
48 | Memory sections are combined into chunks referred to as "memory blocks". The |
49 | size of a memory block is architecture dependent and corresponds to the smallest | |
50 | granularity that can be hot(un)plugged. The default size of a memory block is | |
51 | the same as memory section size, unless an architecture specifies otherwise. | |
52 | ||
53 | All memory blocks have the same size. | |
54 | ||
55 | Phases of Memory Hotplug | |
c18c1cce MCC |
56 | ------------------------ |
57 | ||
ac3332c4 | 58 | Memory hotplug consists of two phases: |
c18c1cce | 59 | |
ac3332c4 DH |
60 | (1) Adding the memory to Linux |
61 | (2) Onlining memory blocks | |
6867c931 | 62 | |
ac3332c4 DH |
63 | In the first phase, metadata, such as the memory map ("memmap") and page tables |
64 | for the direct mapping, is allocated and initialized, and memory blocks are | |
65 | created; the latter also creates sysfs files for managing newly created memory | |
66 | blocks. | |
6867c931 | 67 | |
ac3332c4 DH |
68 | In the second phase, added memory is exposed to the page allocator. After this |
69 | phase, the memory is visible in memory statistics, such as free and total | |
70 | memory, of the system. | |
6867c931 | 71 | |
ac3332c4 DH |
72 | Phases of Memory Hotunplug |
73 | -------------------------- | |
6867c931 | 74 | |
ac3332c4 | 75 | Memory hotunplug consists of two phases: |
6867c931 | 76 | |
ac3332c4 DH |
77 | (1) Offlining memory blocks |
78 | (2) Removing the memory from Linux | |
6867c931 | 79 | |
ac3332c4 DH |
80 | In the fist phase, memory is "hidden" from the page allocator again, for |
81 | example, by migrating busy memory to other memory locations and removing all | |
82 | relevant free pages from the page allocator After this phase, the memory is no | |
83 | longer visible in memory statistics of the system. | |
c18c1cce | 84 | |
ac3332c4 | 85 | In the second phase, the memory blocks are removed and metadata is freed. |
6867c931 | 86 | |
ac3332c4 DH |
87 | Memory Hotplug Notifications |
88 | ============================ | |
6867c931 | 89 | |
ac3332c4 DH |
90 | There are various ways how Linux is notified about memory hotplug events such |
91 | that it can start adding hotplugged memory. This description is limited to | |
92 | systems that support ACPI; mechanisms specific to other firmware interfaces or | |
93 | virtual machines are not described. | |
56a3c655 | 94 | |
ac3332c4 DH |
95 | ACPI Notifications |
96 | ------------------ | |
6867c931 | 97 | |
ac3332c4 DH |
98 | Platforms that support ACPI, such as x86_64, can support memory hotplug |
99 | notifications via ACPI. | |
6867c931 | 100 | |
ac3332c4 DH |
101 | In general, a firmware supporting memory hotplug defines a memory class object |
102 | HID "PNP0C80". When notified about hotplug of a new memory device, the ACPI | |
103 | driver will hotplug the memory to Linux. | |
c18c1cce | 104 | |
ac3332c4 DH |
105 | If the firmware supports hotplug of NUMA nodes, it defines an object _HID |
106 | "ACPI0004", "PNP0A05", or "PNP0A06". When notified about an hotplug event, all | |
107 | assigned memory devices are added to Linux by the ACPI driver. | |
6867c931 | 108 | |
ac3332c4 DH |
109 | Similarly, Linux can be notified about requests to hotunplug a memory device or |
110 | a NUMA node via ACPI. The ACPI driver will try offlining all relevant memory | |
111 | blocks, and, if successful, hotunplug the memory from Linux. | |
6867c931 | 112 | |
ac3332c4 DH |
113 | Manual Probing |
114 | -------------- | |
6867c931 | 115 | |
ac3332c4 DH |
116 | On some architectures, the firmware may not be able to notify the operating |
117 | system about a memory hotplug event. Instead, the memory has to be manually | |
118 | probed from user space. | |
6867c931 | 119 | |
ac3332c4 | 120 | The probe interface is located at:: |
6867c931 | 121 | |
ac3332c4 | 122 | /sys/devices/system/memory/probe |
c18c1cce | 123 | |
ac3332c4 DH |
124 | Only complete memory blocks can be probed. Individual memory blocks are probed |
125 | by providing the physical start address of the memory block:: | |
c18c1cce | 126 | |
ac3332c4 | 127 | % echo addr > /sys/devices/system/memory/probe |
c18c1cce | 128 | |
ac3332c4 DH |
129 | Which results in a memory block for the range [addr, addr + memory_block_size) |
130 | being created. | |
c18c1cce | 131 | |
ac3332c4 | 132 | .. note:: |
56a3c655 | 133 | |
ac3332c4 DH |
134 | Using the probe interface is discouraged as it is easy to crash the kernel, |
135 | because Linux cannot validate user input; this interface might be removed in | |
136 | the future. | |
6867c931 | 137 | |
ac3332c4 DH |
138 | Onlining and Offlining Memory Blocks |
139 | ==================================== | |
6bf53999 | 140 | |
ac3332c4 DH |
141 | After a memory block has been created, Linux has to be instructed to actually |
142 | make use of that memory: the memory block has to be "online". | |
6867c931 | 143 | |
ac3332c4 DH |
144 | Before a memory block can be removed, Linux has to stop using any memory part of |
145 | the memory block: the memory block has to be "offlined". | |
6867c931 | 146 | |
ac3332c4 DH |
147 | The Linux kernel can be configured to automatically online added memory blocks |
148 | and drivers automatically trigger offlining of memory blocks when trying | |
149 | hotunplug of memory. Memory blocks can only be removed once offlining succeeded | |
150 | and drivers may trigger offlining of memory blocks when attempting hotunplug of | |
151 | memory. | |
c18c1cce | 152 | |
ac3332c4 DH |
153 | Onlining Memory Blocks Manually |
154 | ------------------------------- | |
c18c1cce | 155 | |
ac3332c4 DH |
156 | If auto-onlining of memory blocks isn't enabled, user-space has to manually |
157 | trigger onlining of memory blocks. Often, udev rules are used to automate this | |
158 | task in user space. | |
6867c931 | 159 | |
ac3332c4 | 160 | Onlining of a memory block can be triggered via:: |
6867c931 | 161 | |
ac3332c4 | 162 | % echo online > /sys/devices/system/memory/memoryXXX/state |
c18c1cce | 163 | |
ac3332c4 | 164 | Or alternatively:: |
c18c1cce | 165 | |
ac3332c4 | 166 | % echo 1 > /sys/devices/system/memory/memoryXXX/online |
6867c931 | 167 | |
9e122cc1 DH |
168 | The kernel will select the target zone automatically, depending on the |
169 | configured ``online_policy``. | |
c18c1cce | 170 | |
ac3332c4 DH |
171 | One can explicitly request to associate an offline memory block with |
172 | ZONE_MOVABLE by:: | |
6867c931 | 173 | |
ac3332c4 | 174 | % echo online_movable > /sys/devices/system/memory/memoryXXX/state |
6867c931 | 175 | |
ac3332c4 | 176 | Or one can explicitly request a kernel zone (usually ZONE_NORMAL) by:: |
6bf53999 | 177 | |
ac3332c4 | 178 | % echo online_kernel > /sys/devices/system/memory/memoryXXX/state |
dee5d0d5 | 179 | |
ac3332c4 DH |
180 | In any case, if onlining succeeds, the state of the memory block is changed to |
181 | be "online". If it fails, the state of the memory block will remain unchanged | |
182 | and the above commands will fail. | |
6867c931 | 183 | |
ac3332c4 DH |
184 | Onlining Memory Blocks Automatically |
185 | ------------------------------------ | |
6bf53999 | 186 | |
ac3332c4 DH |
187 | The kernel can be configured to try auto-onlining of newly added memory blocks. |
188 | If this feature is disabled, the memory blocks will stay offline until | |
189 | explicitly onlined from user space. | |
dee5d0d5 | 190 | |
ac3332c4 | 191 | The configured auto-online behavior can be observed via:: |
c18c1cce | 192 | |
ac3332c4 | 193 | % cat /sys/devices/system/memory/auto_online_blocks |
56a3c655 | 194 | |
ac3332c4 DH |
195 | Auto-onlining can be enabled by writing ``online``, ``online_kernel`` or |
196 | ``online_movable`` to that file, like:: | |
6867c931 | 197 | |
ac3332c4 | 198 | % echo online > /sys/devices/system/memory/auto_online_blocks |
6867c931 | 199 | |
9e122cc1 DH |
200 | Similarly to manual onlining, with ``online`` the kernel will select the |
201 | target zone automatically, depending on the configured ``online_policy``. | |
202 | ||
ac3332c4 DH |
203 | Modifying the auto-online behavior will only affect all subsequently added |
204 | memory blocks only. | |
6867c931 | 205 | |
ac3332c4 | 206 | .. note:: |
6867c931 | 207 | |
ac3332c4 DH |
208 | In corner cases, auto-onlining can fail. The kernel won't retry. Note that |
209 | auto-onlining is not expected to fail in default configurations. | |
6867c931 | 210 | |
ac3332c4 | 211 | .. note:: |
c18c1cce | 212 | |
ac3332c4 DH |
213 | DLPAR on ppc64 ignores the ``offline`` setting and will still online added |
214 | memory blocks; if onlining fails, memory blocks are removed again. | |
6867c931 | 215 | |
ac3332c4 DH |
216 | Offlining Memory Blocks |
217 | ----------------------- | |
6bf53999 | 218 | |
ac3332c4 DH |
219 | In the current implementation, Linux's memory offlining will try migrating all |
220 | movable pages off the affected memory block. As most kernel allocations, such as | |
221 | page tables, are unmovable, page migration can fail and, therefore, inhibit | |
222 | memory offlining from succeeding. | |
6867c931 | 223 | |
ac3332c4 DH |
224 | Having the memory provided by memory block managed by ZONE_MOVABLE significantly |
225 | increases memory offlining reliability; still, memory offlining can fail in | |
226 | some corner cases. | |
6867c931 | 227 | |
ac3332c4 DH |
228 | Further, memory offlining might retry for a long time (or even forever), until |
229 | aborted by the user. | |
6867c931 | 230 | |
ac3332c4 | 231 | Offlining of a memory block can be triggered via:: |
6867c931 | 232 | |
ac3332c4 | 233 | % echo offline > /sys/devices/system/memory/memoryXXX/state |
6867c931 | 234 | |
ac3332c4 | 235 | Or alternatively:: |
c18c1cce | 236 | |
ac3332c4 | 237 | % echo 0 > /sys/devices/system/memory/memoryXXX/online |
c18c1cce | 238 | |
ac3332c4 DH |
239 | If offlining succeeds, the state of the memory block is changed to be "offline". |
240 | If it fails, the state of the memory block will remain unchanged and the above | |
241 | commands will fail, for example, via:: | |
6867c931 | 242 | |
ac3332c4 | 243 | bash: echo: write error: Device or resource busy |
6867c931 | 244 | |
ac3332c4 | 245 | or via:: |
6867c931 | 246 | |
ac3332c4 | 247 | bash: echo: write error: Invalid argument |
6867c931 | 248 | |
ac3332c4 DH |
249 | Observing the State of Memory Blocks |
250 | ------------------------------------ | |
c18c1cce | 251 | |
ac3332c4 DH |
252 | The state (online/offline/going-offline) of a memory block can be observed |
253 | either via:: | |
6867c931 | 254 | |
ac3332c4 | 255 | % cat /sys/device/system/memory/memoryXXX/state |
6867c931 | 256 | |
ac3332c4 | 257 | Or alternatively (1/0) via:: |
31bc3858 | 258 | |
ac3332c4 | 259 | % cat /sys/device/system/memory/memoryXXX/online |
31bc3858 | 260 | |
ac3332c4 | 261 | For an online memory block, the managing zone can be observed via:: |
31bc3858 | 262 | |
ac3332c4 | 263 | % cat /sys/device/system/memory/memoryXXX/valid_zones |
31bc3858 | 264 | |
ac3332c4 DH |
265 | Configuring Memory Hot(Un)Plug |
266 | ============================== | |
6867c931 | 267 | |
ac3332c4 DH |
268 | There are various ways how system administrators can configure memory |
269 | hot(un)plug and interact with memory blocks, especially, to online them. | |
6867c931 | 270 | |
ac3332c4 DH |
271 | Memory Hot(Un)Plug Configuration via Sysfs |
272 | ------------------------------------------ | |
9f123ab5 | 273 | |
ac3332c4 | 274 | Some memory hot(un)plug properties can be configured or inspected via sysfs in:: |
c18c1cce | 275 | |
ac3332c4 | 276 | /sys/devices/system/memory/ |
511c2aba | 277 | |
ac3332c4 | 278 | The following files are currently defined: |
511c2aba | 279 | |
ac3332c4 DH |
280 | ====================== ========================================================= |
281 | ``auto_online_blocks`` read-write: set or get the default state of new memory | |
282 | blocks; configure auto-onlining. | |
511c2aba | 283 | |
ac3332c4 DH |
284 | The default value depends on the |
285 | CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE kernel configuration | |
286 | option. | |
c18c1cce | 287 | |
ac3332c4 DH |
288 | See the ``state`` property of memory blocks for details. |
289 | ``block_size_bytes`` read-only: the size in bytes of a memory block. | |
290 | ``probe`` write-only: add (probe) selected memory blocks manually | |
291 | from user space by supplying the physical start address. | |
511c2aba | 292 | |
ac3332c4 DH |
293 | Availability depends on the CONFIG_ARCH_MEMORY_PROBE |
294 | kernel configuration option. | |
295 | ``uevent`` read-write: generic udev file for device subsystems. | |
296 | ====================== ========================================================= | |
9f123ab5 | 297 | |
ac3332c4 | 298 | .. note:: |
6867c931 | 299 | |
ac3332c4 DH |
300 | When the CONFIG_MEMORY_FAILURE kernel configuration option is enabled, two |
301 | additional files ``hard_offline_page`` and ``soft_offline_page`` are available | |
302 | to trigger hwpoisoning of pages, for example, for testing purposes. Note that | |
303 | this functionality is not really related to memory hot(un)plug or actual | |
304 | offlining of memory blocks. | |
6867c931 | 305 | |
ac3332c4 DH |
306 | Memory Block Configuration via Sysfs |
307 | ------------------------------------ | |
c18c1cce | 308 | |
ac3332c4 DH |
309 | Each memory block is represented as a memory block device that can be |
310 | onlined or offlined. All memory blocks have their device information located in | |
311 | sysfs. Each present memory block is listed under | |
312 | ``/sys/devices/system/memory`` as:: | |
6867c931 | 313 | |
ac3332c4 | 314 | /sys/devices/system/memory/memoryXXX |
6867c931 | 315 | |
ac3332c4 | 316 | where XXX is the memory block id; the number of digits is variable. |
6867c931 | 317 | |
ac3332c4 DH |
318 | A present memory block indicates that some memory in the range is present; |
319 | however, a memory block might span memory holes. A memory block spanning memory | |
320 | holes cannot be offlined. | |
6867c931 | 321 | |
ac3332c4 DH |
322 | For example, assume 1 GiB memory block size. A device for a memory starting at |
323 | 0x100000000 is ``/sys/device/system/memory/memory4``:: | |
6867c931 | 324 | |
ac3332c4 | 325 | (0x100000000 / 1Gib = 4) |
6867c931 | 326 | |
ac3332c4 | 327 | This device covers address range [0x100000000 ... 0x140000000) |
6867c931 | 328 | |
ac3332c4 | 329 | The following files are currently defined: |
6867c931 | 330 | |
ac3332c4 DH |
331 | =================== ============================================================ |
332 | ``online`` read-write: simplified interface to trigger onlining / | |
333 | offlining and to observe the state of a memory block. | |
334 | When onlining, the zone is selected automatically. | |
335 | ``phys_device`` read-only: legacy interface only ever used on s390x to | |
336 | expose the covered storage increment. | |
337 | ``phys_index`` read-only: the memory block id (XXX). | |
338 | ``removable`` read-only: legacy interface that indicated whether a memory | |
339 | block was likely to be offlineable or not. Nowadays, the | |
340 | kernel return ``1`` if and only if it supports memory | |
341 | offlining. | |
342 | ``state`` read-write: advanced interface to trigger onlining / | |
343 | offlining and to observe the state of a memory block. | |
344 | ||
345 | When writing, ``online``, ``offline``, ``online_kernel`` and | |
346 | ``online_movable`` are supported. | |
347 | ||
348 | ``online_movable`` specifies onlining to ZONE_MOVABLE. | |
349 | ``online_kernel`` specifies onlining to the default kernel | |
350 | zone for the memory block, such as ZONE_NORMAL. | |
351 | ``online`` let's the kernel select the zone automatically. | |
352 | ||
353 | When reading, ``online``, ``offline`` and ``going-offline`` | |
354 | may be returned. | |
355 | ``uevent`` read-write: generic uevent file for devices. | |
356 | ``valid_zones`` read-only: when a block is online, shows the zone it | |
357 | belongs to; when a block is offline, shows what zone will | |
358 | manage it when the block will be onlined. | |
359 | ||
360 | For online memory blocks, ``DMA``, ``DMA32``, ``Normal``, | |
361 | ``Movable`` and ``none`` may be returned. ``none`` indicates | |
362 | that memory provided by a memory block is managed by | |
363 | multiple zones or spans multiple nodes; such memory blocks | |
364 | cannot be offlined. ``Movable`` indicates ZONE_MOVABLE. | |
365 | Other values indicate a kernel zone. | |
366 | ||
367 | For offline memory blocks, the first column shows the | |
368 | zone the kernel would select when onlining the memory block | |
369 | right now without further specifying a zone. | |
370 | ||
371 | Availability depends on the CONFIG_MEMORY_HOTREMOVE | |
372 | kernel configuration option. | |
373 | =================== ============================================================ | |
c18c1cce MCC |
374 | |
375 | .. note:: | |
6867c931 | 376 | |
ac3332c4 DH |
377 | If the CONFIG_NUMA kernel configuration option is enabled, the memoryXXX/ |
378 | directories can also be accessed via symbolic links located in the | |
379 | ``/sys/devices/system/node/node*`` directories. | |
380 | ||
381 | For example:: | |
382 | ||
383 | /sys/devices/system/node/node0/memory9 -> ../../memory/memory9 | |
384 | ||
385 | A backlink will also be created:: | |
386 | ||
387 | /sys/devices/system/memory/memory9/node0 -> ../../node/node0 | |
388 | ||
389 | Command Line Parameters | |
390 | ----------------------- | |
391 | ||
392 | Some command line parameters affect memory hot(un)plug handling. The following | |
393 | command line parameters are relevant: | |
394 | ||
395 | ======================== ======================================================= | |
396 | ``memhp_default_state`` configure auto-onlining by essentially setting | |
397 | ``/sys/devices/system/memory/auto_online_blocks``. | |
9e122cc1 DH |
398 | ``movable_node`` configure automatic zone selection in the kernel when |
399 | using the ``contig-zones`` online policy. When | |
400 | set, the kernel will default to ZONE_MOVABLE when | |
401 | onlining a memory block, unless other zones can be kept | |
402 | contiguous. | |
ac3332c4 DH |
403 | ======================== ======================================================= |
404 | ||
9e122cc1 DH |
405 | See Documentation/admin-guide/kernel-parameters.txt for a more generic |
406 | description of these command line parameters. | |
407 | ||
ac3332c4 DH |
408 | Module Parameters |
409 | ------------------ | |
6867c931 | 410 | |
ac3332c4 DH |
411 | Instead of additional command line parameters or sysfs files, the |
412 | ``memory_hotplug`` subsystem now provides a dedicated namespace for module | |
413 | parameters. Module parameters can be set via the command line by predicating | |
414 | them with ``memory_hotplug.`` such as:: | |
415 | ||
416 | memory_hotplug.memmap_on_memory=1 | |
417 | ||
418 | and they can be observed (and some even modified at runtime) via:: | |
419 | ||
a8db400f | 420 | /sys/module/memory_hotplug/parameters/ |
ac3332c4 DH |
421 | |
422 | The following module parameters are currently defined: | |
423 | ||
9e122cc1 DH |
424 | ================================ =============================================== |
425 | ``memmap_on_memory`` read-write: Allocate memory for the memmap from | |
426 | the added memory block itself. Even if enabled, | |
427 | actual support depends on various other system | |
428 | properties and should only be regarded as a | |
429 | hint whether the behavior would be desired. | |
430 | ||
431 | While allocating the memmap from the memory | |
432 | block itself makes memory hotplug less likely | |
433 | to fail and keeps the memmap on the same NUMA | |
434 | node in any case, it can fragment physical | |
435 | memory in a way that huge pages in bigger | |
436 | granularity cannot be formed on hotplugged | |
437 | memory. | |
438 | ``online_policy`` read-write: Set the basic policy used for | |
439 | automatic zone selection when onlining memory | |
440 | blocks without specifying a target zone. | |
441 | ``contig-zones`` has been the kernel default | |
442 | before this parameter was added. After an | |
443 | online policy was configured and memory was | |
444 | online, the policy should not be changed | |
445 | anymore. | |
446 | ||
447 | When set to ``contig-zones``, the kernel will | |
448 | try keeping zones contiguous. If a memory block | |
449 | intersects multiple zones or no zone, the | |
450 | behavior depends on the ``movable_node`` kernel | |
451 | command line parameter: default to ZONE_MOVABLE | |
452 | if set, default to the applicable kernel zone | |
453 | (usually ZONE_NORMAL) if not set. | |
454 | ||
455 | When set to ``auto-movable``, the kernel will | |
456 | try onlining memory blocks to ZONE_MOVABLE if | |
457 | possible according to the configuration and | |
458 | memory device details. With this policy, one | |
459 | can avoid zone imbalances when eventually | |
460 | hotplugging a lot of memory later and still | |
461 | wanting to be able to hotunplug as much as | |
462 | possible reliably, very desirable in | |
463 | virtualized environments. This policy ignores | |
464 | the ``movable_node`` kernel command line | |
465 | parameter and isn't really applicable in | |
466 | environments that require it (e.g., bare metal | |
467 | with hotunpluggable nodes) where hotplugged | |
468 | memory might be exposed via the | |
469 | firmware-provided memory map early during boot | |
470 | to the system instead of getting detected, | |
471 | added and onlined later during boot (such as | |
472 | done by virtio-mem or by some hypervisors | |
473 | implementing emulated DIMMs). As one example, a | |
474 | hotplugged DIMM will be onlined either | |
475 | completely to ZONE_MOVABLE or completely to | |
476 | ZONE_NORMAL, not a mixture. | |
477 | As another example, as many memory blocks | |
478 | belonging to a virtio-mem device will be | |
479 | onlined to ZONE_MOVABLE as possible, | |
480 | special-casing units of memory blocks that can | |
481 | only get hotunplugged together. *This policy | |
482 | does not protect from setups that are | |
483 | problematic with ZONE_MOVABLE and does not | |
484 | change the zone of memory blocks dynamically | |
485 | after they were onlined.* | |
486 | ``auto_movable_ratio`` read-write: Set the maximum MOVABLE:KERNEL | |
487 | memory ratio in % for the ``auto-movable`` | |
488 | online policy. Whether the ratio applies only | |
489 | for the system across all NUMA nodes or also | |
490 | per NUMA nodes depends on the | |
491 | ``auto_movable_numa_aware`` configuration. | |
492 | ||
493 | All accounting is based on present memory pages | |
494 | in the zones combined with accounting per | |
495 | memory device. Memory dedicated to the CMA | |
496 | allocator is accounted as MOVABLE, although | |
497 | residing on one of the kernel zones. The | |
498 | possible ratio depends on the actual workload. | |
499 | The kernel default is "301" %, for example, | |
500 | allowing for hotplugging 24 GiB to a 8 GiB VM | |
501 | and automatically onlining all hotplugged | |
502 | memory to ZONE_MOVABLE in many setups. The | |
503 | additional 1% deals with some pages being not | |
504 | present, for example, because of some firmware | |
505 | allocations. | |
506 | ||
507 | Note that ZONE_NORMAL memory provided by one | |
508 | memory device does not allow for more | |
509 | ZONE_MOVABLE memory for a different memory | |
510 | device. As one example, onlining memory of a | |
511 | hotplugged DIMM to ZONE_NORMAL will not allow | |
512 | for another hotplugged DIMM to get onlined to | |
513 | ZONE_MOVABLE automatically. In contrast, memory | |
514 | hotplugged by a virtio-mem device that got | |
515 | onlined to ZONE_NORMAL will allow for more | |
516 | ZONE_MOVABLE memory within *the same* | |
517 | virtio-mem device. | |
518 | ``auto_movable_numa_aware`` read-write: Configure whether the | |
519 | ``auto_movable_ratio`` in the ``auto-movable`` | |
520 | online policy also applies per NUMA | |
521 | node in addition to the whole system across all | |
522 | NUMA nodes. The kernel default is "Y". | |
523 | ||
524 | Disabling NUMA awareness can be helpful when | |
525 | dealing with NUMA nodes that should be | |
526 | completely hotunpluggable, onlining the memory | |
527 | completely to ZONE_MOVABLE automatically if | |
528 | possible. | |
529 | ||
530 | Parameter availability depends on CONFIG_NUMA. | |
531 | ================================ =============================================== | |
ac3332c4 DH |
532 | |
533 | ZONE_MOVABLE | |
534 | ============ | |
535 | ||
536 | ZONE_MOVABLE is an important mechanism for more reliable memory offlining. | |
537 | Further, having system RAM managed by ZONE_MOVABLE instead of one of the | |
538 | kernel zones can increase the number of possible transparent huge pages and | |
539 | dynamically allocated huge pages. | |
540 | ||
541 | Most kernel allocations are unmovable. Important examples include the memory | |
542 | map (usually 1/64ths of memory), page tables, and kmalloc(). Such allocations | |
543 | can only be served from the kernel zones. | |
544 | ||
545 | Most user space pages, such as anonymous memory, and page cache pages are | |
546 | movable. Such allocations can be served from ZONE_MOVABLE and the kernel zones. | |
547 | ||
548 | Only movable allocations are served from ZONE_MOVABLE, resulting in unmovable | |
549 | allocations being limited to the kernel zones. Without ZONE_MOVABLE, there is | |
550 | absolutely no guarantee whether a memory block can be offlined successfully. | |
551 | ||
552 | Zone Imbalances | |
553 | --------------- | |
ad2fa371 | 554 | |
ac3332c4 DH |
555 | Having too much system RAM managed by ZONE_MOVABLE is called a zone imbalance, |
556 | which can harm the system or degrade performance. As one example, the kernel | |
557 | might crash because it runs out of free memory for unmovable allocations, | |
558 | although there is still plenty of free memory left in ZONE_MOVABLE. | |
559 | ||
560 | Usually, MOVABLE:KERNEL ratios of up to 3:1 or even 4:1 are fine. Ratios of 63:1 | |
561 | are definitely impossible due to the overhead for the memory map. | |
562 | ||
563 | Actual safe zone ratios depend on the workload. Extreme cases, like excessive | |
564 | long-term pinning of pages, might not be able to deal with ZONE_MOVABLE at all. | |
ad2fa371 | 565 | |
fa965fd5 | 566 | .. note:: |
fa965fd5 | 567 | |
ac3332c4 DH |
568 | CMA memory part of a kernel zone essentially behaves like memory in |
569 | ZONE_MOVABLE and similar considerations apply, especially when combining | |
570 | CMA with ZONE_MOVABLE. | |
6867c931 | 571 | |
ac3332c4 DH |
572 | ZONE_MOVABLE Sizing Considerations |
573 | ---------------------------------- | |
6867c931 | 574 | |
ac3332c4 DH |
575 | We usually expect that a large portion of available system RAM will actually |
576 | be consumed by user space, either directly or indirectly via the page cache. In | |
577 | the normal case, ZONE_MOVABLE can be used when allocating such pages just fine. | |
6867c931 | 578 | |
ac3332c4 DH |
579 | With that in mind, it makes sense that we can have a big portion of system RAM |
580 | managed by ZONE_MOVABLE. However, there are some things to consider when using | |
581 | ZONE_MOVABLE, especially when fine-tuning zone ratios: | |
582 | ||
583 | - Having a lot of offline memory blocks. Even offline memory blocks consume | |
584 | memory for metadata and page tables in the direct map; having a lot of offline | |
585 | memory blocks is not a typical case, though. | |
586 | ||
587 | - Memory ballooning without balloon compaction is incompatible with | |
588 | ZONE_MOVABLE. Only some implementations, such as virtio-balloon and | |
589 | pseries CMM, fully support balloon compaction. | |
590 | ||
591 | Further, the CONFIG_BALLOON_COMPACTION kernel configuration option might be | |
592 | disabled. In that case, balloon inflation will only perform unmovable | |
593 | allocations and silently create a zone imbalance, usually triggered by | |
594 | inflation requests from the hypervisor. | |
595 | ||
596 | - Gigantic pages are unmovable, resulting in user space consuming a | |
597 | lot of unmovable memory. | |
598 | ||
599 | - Huge pages are unmovable when an architectures does not support huge | |
600 | page migration, resulting in a similar issue as with gigantic pages. | |
601 | ||
602 | - Page tables are unmovable. Excessive swapping, mapping extremely large | |
603 | files or ZONE_DEVICE memory can be problematic, although only really relevant | |
604 | in corner cases. When we manage a lot of user space memory that has been | |
605 | swapped out or is served from a file/persistent memory/... we still need a lot | |
606 | of page tables to manage that memory once user space accessed that memory. | |
607 | ||
608 | - In certain DAX configurations the memory map for the device memory will be | |
609 | allocated from the kernel zones. | |
610 | ||
611 | - KASAN can have a significant memory overhead, for example, consuming 1/8th of | |
612 | the total system memory size as (unmovable) tracking metadata. | |
613 | ||
614 | - Long-term pinning of pages. Techniques that rely on long-term pinnings | |
615 | (especially, RDMA and vfio/mdev) are fundamentally problematic with | |
616 | ZONE_MOVABLE, and therefore, memory offlining. Pinned pages cannot reside | |
617 | on ZONE_MOVABLE as that would turn these pages unmovable. Therefore, they | |
618 | have to be migrated off that zone while pinning. Pinning a page can fail | |
619 | even if there is plenty of free memory in ZONE_MOVABLE. | |
620 | ||
621 | In addition, using ZONE_MOVABLE might make page pinning more expensive, | |
622 | because of the page migration overhead. | |
623 | ||
624 | By default, all the memory configured at boot time is managed by the kernel | |
625 | zones and ZONE_MOVABLE is not used. | |
626 | ||
627 | To enable ZONE_MOVABLE to include the memory present at boot and to control the | |
628 | ratio between movable and kernel zones there are two command line options: | |
629 | ``kernelcore=`` and ``movablecore=``. See | |
630 | Documentation/admin-guide/kernel-parameters.rst for their description. | |
631 | ||
632 | Memory Offlining and ZONE_MOVABLE | |
633 | --------------------------------- | |
634 | ||
635 | Even with ZONE_MOVABLE, there are some corner cases where offlining a memory | |
636 | block might fail: | |
637 | ||
638 | - Memory blocks with memory holes; this applies to memory blocks present during | |
639 | boot and can apply to memory blocks hotplugged via the XEN balloon and the | |
640 | Hyper-V balloon. | |
641 | ||
642 | - Mixed NUMA nodes and mixed zones within a single memory block prevent memory | |
643 | offlining; this applies to memory blocks present during boot only. | |
644 | ||
645 | - Special memory blocks prevented by the system from getting offlined. Examples | |
646 | include any memory available during boot on arm64 or memory blocks spanning | |
647 | the crashkernel area on s390x; this usually applies to memory blocks present | |
648 | during boot only. | |
649 | ||
650 | - Memory blocks overlapping with CMA areas cannot be offlined, this applies to | |
651 | memory blocks present during boot only. | |
652 | ||
653 | - Concurrent activity that operates on the same physical memory area, such as | |
654 | allocating gigantic pages, can result in temporary offlining failures. | |
655 | ||
dff03381 MS |
656 | - Out of memory when dissolving huge pages, especially when HugeTLB Vmemmap |
657 | Optimization (HVO) is enabled. | |
ac3332c4 DH |
658 | |
659 | Offlining code may be able to migrate huge page contents, but may not be able | |
660 | to dissolve the source huge page because it fails allocating (unmovable) pages | |
661 | for the vmemmap, because the system might not have free memory in the kernel | |
662 | zones left. | |
663 | ||
664 | Users that depend on memory offlining to succeed for movable zones should | |
665 | carefully consider whether the memory savings gained from this feature are | |
666 | worth the risk of possibly not being able to offline memory in certain | |
667 | situations. | |
668 | ||
669 | Further, when running into out of memory situations while migrating pages, or | |
670 | when still encountering permanently unmovable pages within ZONE_MOVABLE | |
671 | (-> BUG), memory offlining will keep retrying until it eventually succeeds. | |
672 | ||
673 | When offlining is triggered from user space, the offlining context can be | |
674 | terminated by sending a fatal signal. A timeout based offlining can easily be | |
675 | implemented via:: | |
6867c931 | 676 | |
ac3332c4 | 677 | % timeout $TIMEOUT offline_block | failure_handling |