Merge tag 'docs-5.1' of git://git.lwn.net/linux
[linux-2.6-block.git] / Documentation / sysctl / vm.txt
CommitLineData
db0fb184 1Documentation for /proc/sys/vm/* kernel version 2.6.29
1da177e4 2 (c) 1998, 1999, Rik van Riel <riel@nl.linux.org>
db0fb184 3 (c) 2008 Peter W. Morreale <pmorreale@novell.com>
1da177e4
LT
4
5For general info and legal blurb, please look in README.
6
7==============================================================
8
9This file contains the documentation for the sysctl files in
db0fb184 10/proc/sys/vm and is valid for Linux kernel version 2.6.29.
1da177e4
LT
11
12The files in this directory can be used to tune the operation
13of the virtual memory (VM) subsystem of the Linux kernel and
14the writeout of dirty data to disk.
15
16Default values and initialization routines for most of these
17files can be found in mm/swap.c.
18
19Currently, these files are in /proc/sys/vm:
db0fb184 20
4eeab4f5 21- admin_reserve_kbytes
db0fb184 22- block_dump
76ab0f53 23- compact_memory
5bbe3547 24- compact_unevictable_allowed
db0fb184 25- dirty_background_bytes
1da177e4 26- dirty_background_ratio
db0fb184 27- dirty_bytes
1da177e4 28- dirty_expire_centisecs
db0fb184 29- dirty_ratio
fc1ca3d5 30- dirtytime_expire_seconds
1da177e4 31- dirty_writeback_centisecs
db0fb184 32- drop_caches
5e771905 33- extfrag_threshold
db0fb184
PM
34- hugetlb_shm_group
35- laptop_mode
36- legacy_va_layout
37- lowmem_reserve_ratio
1da177e4 38- max_map_count
6a46079c
AK
39- memory_failure_early_kill
40- memory_failure_recovery
1da177e4 41- min_free_kbytes
0ff38490 42- min_slab_ratio
db0fb184
PM
43- min_unmapped_ratio
44- mmap_min_addr
d07e2259
DC
45- mmap_rnd_bits
46- mmap_rnd_compat_bits
d5dbac87 47- nr_hugepages
d1634e1a 48- nr_hugepages_mempolicy
d5dbac87 49- nr_overcommit_hugepages
db0fb184
PM
50- nr_trim_pages (only if CONFIG_MMU=n)
51- numa_zonelist_order
52- oom_dump_tasks
53- oom_kill_allocating_task
49f0ce5f 54- overcommit_kbytes
db0fb184
PM
55- overcommit_memory
56- overcommit_ratio
57- page-cluster
58- panic_on_oom
59- percpu_pagelist_fraction
60- stat_interval
52b6f46b 61- stat_refresh
4518085e 62- numa_stat
db0fb184 63- swappiness
c9b1d098 64- user_reserve_kbytes
db0fb184 65- vfs_cache_pressure
1c30844d 66- watermark_boost_factor
e6507a00 67- watermark_scale_factor
db0fb184
PM
68- zone_reclaim_mode
69
1da177e4
LT
70==============================================================
71
4eeab4f5
AS
72admin_reserve_kbytes
73
74The amount of free memory in the system that should be reserved for users
75with the capability cap_sys_admin.
76
77admin_reserve_kbytes defaults to min(3% of free pages, 8MB)
78
79That should provide enough for the admin to log in and kill a process,
80if necessary, under the default overcommit 'guess' mode.
81
82Systems running under overcommit 'never' should increase this to account
83for the full Virtual Memory Size of programs used to recover. Otherwise,
84root may not be able to log in to recover the system.
85
86How do you calculate a minimum useful reserve?
87
88sshd or login + bash (or some other shell) + top (or ps, kill, etc.)
89
90For overcommit 'guess', we can sum resident set sizes (RSS).
91On x86_64 this is about 8MB.
92
93For overcommit 'never', we can take the max of their virtual sizes (VSZ)
94and add the sum of their RSS.
95On x86_64 this is about 128MB.
96
97Changing this takes effect whenever an application requests memory.
98
99==============================================================
100
db0fb184 101block_dump
1da177e4 102
db0fb184
PM
103block_dump enables block I/O debugging when set to a nonzero value. More
104information on block I/O debugging is in Documentation/laptops/laptop-mode.txt.
1da177e4
LT
105
106==============================================================
107
76ab0f53
MG
108compact_memory
109
110Available only when CONFIG_COMPACTION is set. When 1 is written to the file,
111all zones are compacted such that free memory is available in contiguous
112blocks where possible. This can be important for example in the allocation of
113huge pages although processes will also directly compact memory as required.
114
115==============================================================
116
5bbe3547
EM
117compact_unevictable_allowed
118
119Available only when CONFIG_COMPACTION is set. When set to 1, compaction is
120allowed to examine the unevictable lru (mlocked pages) for pages to compact.
121This should be used on systems where stalls for minor page faults are an
122acceptable trade for large contiguous free memory. Set to 0 to prevent
123compaction from moving pages that are unevictable. Default value is 1.
124
125==============================================================
126
db0fb184 127dirty_background_bytes
1da177e4 128
6601fac8
AB
129Contains the amount of dirty memory at which the background kernel
130flusher threads will start writeback.
1da177e4 131
abffc020
AR
132Note: dirty_background_bytes is the counterpart of dirty_background_ratio. Only
133one of them may be specified at a time. When one sysctl is written it is
134immediately taken into account to evaluate the dirty memory limits and the
135other appears as 0 when read.
1da177e4 136
db0fb184 137==============================================================
1da177e4 138
db0fb184 139dirty_background_ratio
1da177e4 140
715ea41e
ZL
141Contains, as a percentage of total available memory that contains free pages
142and reclaimable pages, the number of pages at which the background kernel
143flusher threads will start writing out dirty data.
144
d83e2a4e 145The total available memory is not equal to total system memory.
1da177e4 146
db0fb184 147==============================================================
1da177e4 148
db0fb184
PM
149dirty_bytes
150
151Contains the amount of dirty memory at which a process generating disk writes
152will itself start writeback.
153
abffc020
AR
154Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be
155specified at a time. When one sysctl is written it is immediately taken into
156account to evaluate the dirty memory limits and the other appears as 0 when
157read.
1da177e4 158
9e4a5bda
AR
159Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any
160value lower than this limit will be ignored and the old configuration will be
161retained.
162
1da177e4
LT
163==============================================================
164
db0fb184 165dirty_expire_centisecs
1da177e4 166
db0fb184 167This tunable is used to define when dirty data is old enough to be eligible
6601fac8
AB
168for writeout by the kernel flusher threads. It is expressed in 100'ths
169of a second. Data which has been dirty in-memory for longer than this
170interval will be written out next time a flusher thread wakes up.
db0fb184
PM
171
172==============================================================
173
174dirty_ratio
175
715ea41e
ZL
176Contains, as a percentage of total available memory that contains free pages
177and reclaimable pages, the number of pages at which a process which is
178generating disk writes will itself start writing out dirty data.
179
d83e2a4e 180The total available memory is not equal to total system memory.
1da177e4
LT
181
182==============================================================
183
fc1ca3d5
YS
184dirtytime_expire_seconds
185
186When a lazytime inode is constantly having its pages dirtied, the inode with
187an updated timestamp will never get chance to be written out. And, if the
188only thing that has happened on the file system is a dirtytime inode caused
189by an atime update, a worker will be scheduled to make sure that inode
190eventually gets pushed out to disk. This tunable is used to define when dirty
191inode is old enough to be eligible for writeback by the kernel flusher threads.
192And, it is also used as the interval to wakeup dirtytime_writeback thread.
193
194==============================================================
195
db0fb184 196dirty_writeback_centisecs
1da177e4 197
6601fac8 198The kernel flusher threads will periodically wake up and write `old' data
db0fb184
PM
199out to disk. This tunable expresses the interval between those wakeups, in
200100'ths of a second.
1da177e4 201
db0fb184 202Setting this to zero disables periodic writeback altogether.
1da177e4
LT
203
204==============================================================
205
db0fb184 206drop_caches
1da177e4 207
5509a5d2
DH
208Writing to this will cause the kernel to drop clean caches, as well as
209reclaimable slab objects like dentries and inodes. Once dropped, their
210memory becomes free.
1da177e4 211
db0fb184
PM
212To free pagecache:
213 echo 1 > /proc/sys/vm/drop_caches
5509a5d2 214To free reclaimable slab objects (includes dentries and inodes):
db0fb184 215 echo 2 > /proc/sys/vm/drop_caches
5509a5d2 216To free slab objects and pagecache:
db0fb184 217 echo 3 > /proc/sys/vm/drop_caches
1da177e4 218
5509a5d2
DH
219This is a non-destructive operation and will not free any dirty objects.
220To increase the number of objects freed by this operation, the user may run
221`sync' prior to writing to /proc/sys/vm/drop_caches. This will minimize the
222number of dirty objects on the system and create more candidates to be
223dropped.
224
225This file is not a means to control the growth of the various kernel caches
226(inodes, dentries, pagecache, etc...) These objects are automatically
227reclaimed by the kernel when memory is needed elsewhere on the system.
228
229Use of this file can cause performance problems. Since it discards cached
230objects, it may cost a significant amount of I/O and CPU to recreate the
231dropped objects, especially if they were under heavy use. Because of this,
232use outside of a testing or debugging environment is not recommended.
233
234You may see informational messages in your kernel log when this file is
235used:
236
237 cat (1234): drop_caches: 3
238
239These are informational only. They do not mean that anything is wrong
631605c0 240with your system. To disable them, echo 4 (bit 2) into drop_caches.
1da177e4
LT
241
242==============================================================
243
5e771905
MG
244extfrag_threshold
245
246This parameter affects whether the kernel will compact memory or direct
a10726bb
RV
247reclaim to satisfy a high-order allocation. The extfrag/extfrag_index file in
248debugfs shows what the fragmentation index for each order is in each zone in
249the system. Values tending towards 0 imply allocations would fail due to lack
250of memory, values towards 1000 imply failures are due to fragmentation and -1
251implies that the allocation will succeed as long as watermarks are met.
5e771905
MG
252
253The kernel will not compact memory in a zone if the
254fragmentation index is <= extfrag_threshold. The default value is 500.
255
256==============================================================
257
d09b6468
MH
258highmem_is_dirtyable
259
260Available only for systems with CONFIG_HIGHMEM enabled (32b systems).
261
262This parameter controls whether the high memory is considered for dirty
263writers throttling. This is not the case by default which means that
264only the amount of memory directly visible/usable by the kernel can
265be dirtied. As a result, on systems with a large amount of memory and
266lowmem basically depleted writers might be throttled too early and
267streaming writes can get very slow.
268
269Changing the value to non zero would allow more memory to be dirtied
270and thus allow writers to write more data which can be flushed to the
271storage more effectively. Note this also comes with a risk of pre-mature
272OOM killer because some writers (e.g. direct block device writes) can
273only use the low memory and they can fill it up with dirty data without
274any throttling.
275
276==============================================================
277
db0fb184 278hugetlb_shm_group
8ad4b1fb 279
db0fb184
PM
280hugetlb_shm_group contains group id that is allowed to create SysV
281shared memory segment using hugetlb page.
8ad4b1fb 282
db0fb184 283==============================================================
8ad4b1fb 284
db0fb184 285laptop_mode
1743660b 286
db0fb184
PM
287laptop_mode is a knob that controls "laptop mode". All the things that are
288controlled by this knob are discussed in Documentation/laptops/laptop-mode.txt.
1743660b 289
db0fb184 290==============================================================
1743660b 291
db0fb184 292legacy_va_layout
1b2ffb78 293
2174efb6 294If non-zero, this sysctl disables the new 32-bit mmap layout - the kernel
db0fb184 295will use the legacy (2.4) layout for all processes.
1b2ffb78 296
db0fb184 297==============================================================
1b2ffb78 298
db0fb184
PM
299lowmem_reserve_ratio
300
301For some specialised workloads on highmem machines it is dangerous for
302the kernel to allow process memory to be allocated from the "lowmem"
303zone. This is because that memory could then be pinned via the mlock()
304system call, or by unavailability of swapspace.
305
306And on large highmem machines this lack of reclaimable lowmem memory
307can be fatal.
308
309So the Linux page allocator has a mechanism which prevents allocations
310which _could_ use highmem from using too much lowmem. This means that
311a certain amount of lowmem is defended from the possibility of being
312captured into pinned user memory.
313
314(The same argument applies to the old 16 megabyte ISA DMA region. This
315mechanism will also defend that region from allocations which could use
316highmem or lowmem).
317
318The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is
319in defending these lower zones.
320
321If you have a machine which uses highmem or ISA DMA and your
322applications are using mlock(), or if you are running with no swap then
323you probably should change the lowmem_reserve_ratio setting.
324
325The lowmem_reserve_ratio is an array. You can see them by reading this file.
326-
327% cat /proc/sys/vm/lowmem_reserve_ratio
328256 256 32
329-
db0fb184
PM
330
331But, these values are not used directly. The kernel calculates # of protection
332pages for each zones from them. These are shown as array of protection pages
333in /proc/zoneinfo like followings. (This is an example of x86-64 box).
334Each zone has an array of protection pages like this.
335
336-
337Node 0, zone DMA
338 pages free 1355
339 min 3
340 low 3
341 high 4
342 :
343 :
344 numa_other 0
345 protection: (0, 2004, 2004, 2004)
346 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
347 pagesets
348 cpu: 0 pcp: 0
349 :
350-
351These protections are added to score to judge whether this zone should be used
352for page allocation or should be reclaimed.
353
354In this example, if normal pages (index=2) are required to this DMA zone and
41858966
MG
355watermark[WMARK_HIGH] is used for watermark, the kernel judges this zone should
356not be used because pages_free(1355) is smaller than watermark + protection[2]
db0fb184
PM
357(4 + 2004 = 2008). If this protection value is 0, this zone would be used for
358normal page requirement. If requirement is DMA zone(index=0), protection[0]
359(=0) is used.
360
361zone[i]'s protection[j] is calculated by following expression.
362
363(i < j):
364 zone[i]->protection[j]
013110a7 365 = (total sums of managed_pages from zone[i+1] to zone[j] on the node)
db0fb184
PM
366 / lowmem_reserve_ratio[i];
367(i = j):
368 (should not be protected. = 0;
369(i > j):
370 (not necessary, but looks 0)
371
372The default values of lowmem_reserve_ratio[i] are
373 256 (if zone[i] means DMA or DMA32 zone)
374 32 (others).
375As above expression, they are reciprocal number of ratio.
013110a7 376256 means 1/256. # of protection pages becomes about "0.39%" of total managed
db0fb184
PM
377pages of higher zones on the node.
378
379If you would like to protect more pages, smaller values are effective.
d3cda233
JK
380The minimum value is 1 (1/1 -> 100%). The value less than 1 completely
381disables protection of the pages.
1b2ffb78 382
db0fb184 383==============================================================
1b2ffb78 384
db0fb184 385max_map_count:
1743660b 386
db0fb184
PM
387This file contains the maximum number of memory map areas a process
388may have. Memory map areas are used as a side-effect of calling
def5efe0
DR
389malloc, directly by mmap, mprotect, and madvise, and also when loading
390shared libraries.
1743660b 391
db0fb184
PM
392While most applications need less than a thousand maps, certain
393programs, particularly malloc debuggers, may consume lots of them,
394e.g., up to one or two maps per allocation.
fadd8fbd 395
db0fb184 396The default value is 65536.
9614634f 397
6a46079c
AK
398=============================================================
399
400memory_failure_early_kill:
401
402Control how to kill processes when uncorrected memory error (typically
403a 2bit error in a memory module) is detected in the background by hardware
404that cannot be handled by the kernel. In some cases (like the page
405still having a valid copy on disk) the kernel will handle the failure
406transparently without affecting any applications. But if there is
407no other uptodate copy of the data it will kill to prevent any data
408corruptions from propagating.
409
4101: Kill all processes that have the corrupted and not reloadable page mapped
411as soon as the corruption is detected. Note this is not supported
412for a few types of pages, like kernel internally allocated data or
413the swap cache, but works for the majority of user pages.
414
4150: Only unmap the corrupted page from all processes and only kill a process
416who tries to access it.
417
418The kill is done using a catchable SIGBUS with BUS_MCEERR_AO, so processes can
419handle this if they want to.
420
421This is only active on architectures/platforms with advanced machine
422check handling and depends on the hardware capabilities.
423
424Applications can override this setting individually with the PR_MCE_KILL prctl
425
426==============================================================
427
428memory_failure_recovery
429
430Enable memory failure recovery (when supported by the platform)
431
4321: Attempt recovery.
433
4340: Always panic on a memory failure.
435
db0fb184 436==============================================================
9614634f 437
db0fb184 438min_free_kbytes:
9614634f 439
db0fb184 440This is used to force the Linux VM to keep a minimum number
41858966
MG
441of kilobytes free. The VM uses this number to compute a
442watermark[WMARK_MIN] value for each lowmem zone in the system.
443Each lowmem zone gets a number of reserved free pages based
444proportionally on its size.
db0fb184
PM
445
446Some minimal amount of memory is needed to satisfy PF_MEMALLOC
447allocations; if you set this to lower than 1024KB, your system will
448become subtly broken, and prone to deadlock under high loads.
449
450Setting this too high will OOM your machine instantly.
9614634f
CL
451
452=============================================================
453
0ff38490
CL
454min_slab_ratio:
455
456This is available only on NUMA kernels.
457
458A percentage of the total pages in each zone. On Zone reclaim
459(fallback from the local zone occurs) slabs will be reclaimed if more
460than this percentage of pages in a zone are reclaimable slab pages.
461This insures that the slab growth stays under control even in NUMA
462systems that rarely perform global reclaim.
463
464The default is 5 percent.
465
466Note that slab reclaim is triggered in a per zone / node fashion.
467The process of reclaiming slab memory is currently not node specific
468and may not be fast.
469
470=============================================================
471
db0fb184 472min_unmapped_ratio:
fadd8fbd 473
db0fb184 474This is available only on NUMA kernels.
fadd8fbd 475
90afa5de
MG
476This is a percentage of the total pages in each zone. Zone reclaim will
477only occur if more than this percentage of pages are in a state that
478zone_reclaim_mode allows to be reclaimed.
479
480If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared
481against all file-backed unmapped pages including swapcache pages and tmpfs
482files. Otherwise, only unmapped pages backed by normal files but not tmpfs
483files and similar are considered.
2b744c01 484
db0fb184 485The default is 1 percent.
fadd8fbd 486
db0fb184 487==============================================================
2b744c01 488
db0fb184 489mmap_min_addr
ed032189 490
db0fb184 491This file indicates the amount of address space which a user process will
af901ca1 492be restricted from mmapping. Since kernel null dereference bugs could
db0fb184
PM
493accidentally operate based on the information in the first couple of pages
494of memory userspace processes should not be allowed to write to them. By
495default this value is set to 0 and no protections will be enforced by the
496security module. Setting this value to something like 64k will allow the
497vast majority of applications to work correctly and provide defense in depth
498against future potential kernel bugs.
fe071d7e 499
db0fb184 500==============================================================
fef1bdd6 501
d07e2259
DC
502mmap_rnd_bits:
503
504This value can be used to select the number of bits to use to
505determine the random offset to the base address of vma regions
506resulting from mmap allocations on architectures which support
507tuning address space randomization. This value will be bounded
508by the architecture's minimum and maximum supported values.
509
510This value can be changed after boot using the
511/proc/sys/vm/mmap_rnd_bits tunable
512
513==============================================================
514
515mmap_rnd_compat_bits:
516
517This value can be used to select the number of bits to use to
518determine the random offset to the base address of vma regions
519resulting from mmap allocations for applications run in
520compatibility mode on architectures which support tuning address
521space randomization. This value will be bounded by the
522architecture's minimum and maximum supported values.
523
524This value can be changed after boot using the
525/proc/sys/vm/mmap_rnd_compat_bits tunable
526
527==============================================================
528
db0fb184 529nr_hugepages
fef1bdd6 530
db0fb184 531Change the minimum size of the hugepage pool.
fef1bdd6 532
1ad1335d 533See Documentation/admin-guide/mm/hugetlbpage.rst
fef1bdd6 534
db0fb184 535==============================================================
d1634e1a
PD
536
537nr_hugepages_mempolicy
538
539Change the size of the hugepage pool at run-time on a specific
540set of NUMA nodes.
541
542See Documentation/admin-guide/mm/hugetlbpage.rst
543
544==============================================================
fef1bdd6 545
db0fb184 546nr_overcommit_hugepages
fef1bdd6 547
db0fb184
PM
548Change the maximum size of the hugepage pool. The maximum is
549nr_hugepages + nr_overcommit_hugepages.
fe071d7e 550
1ad1335d 551See Documentation/admin-guide/mm/hugetlbpage.rst
fe071d7e 552
db0fb184 553==============================================================
fe071d7e 554
db0fb184 555nr_trim_pages
ed032189 556
db0fb184
PM
557This is available only on NOMMU kernels.
558
559This value adjusts the excess page trimming behaviour of power-of-2 aligned
560NOMMU mmap allocations.
561
562A value of 0 disables trimming of allocations entirely, while a value of 1
563trims excess pages aggressively. Any value >= 1 acts as the watermark where
564trimming of allocations is initiated.
565
566The default value is 1.
567
568See Documentation/nommu-mmap.txt for more information.
ed032189 569
f0c0b2b8
KH
570==============================================================
571
572numa_zonelist_order
573
c9bff3ee
MH
574This sysctl is only for NUMA and it is deprecated. Anything but
575Node order will fail!
576
f0c0b2b8
KH
577'where the memory is allocated from' is controlled by zonelists.
578(This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation.
579 you may be able to read ZONE_DMA as ZONE_DMA32...)
580
581In non-NUMA case, a zonelist for GFP_KERNEL is ordered as following.
582ZONE_NORMAL -> ZONE_DMA
583This means that a memory allocation request for GFP_KERNEL will
584get memory from ZONE_DMA only when ZONE_NORMAL is not available.
585
586In NUMA case, you can think of following 2 types of order.
587Assume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL
588
589(A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL
590(B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA.
591
592Type(A) offers the best locality for processes on Node(0), but ZONE_DMA
593will be used before ZONE_NORMAL exhaustion. This increases possibility of
594out-of-memory(OOM) of ZONE_DMA because ZONE_DMA is tend to be small.
595
596Type(B) cannot offer the best locality but is more robust against OOM of
597the DMA zone.
598
599Type(A) is called as "Node" order. Type (B) is "Zone" order.
600
601"Node order" orders the zonelists by node, then by zone within each node.
5a3016a6 602Specify "[Nn]ode" for node order
f0c0b2b8
KH
603
604"Zone Order" orders the zonelists by zone type, then by node within each
5a3016a6 605zone. Specify "[Zz]one" for zone order.
f0c0b2b8 606
7c88a292
XQ
607Specify "[Dd]efault" to request automatic configuration.
608
609On 32-bit, the Normal zone needs to be preserved for allocations accessible
610by the kernel, so "zone" order will be selected.
611
612On 64-bit, devices that require DMA32/DMA are relatively rare, so "node"
613order will be selected.
614
615Default order is recommended unless this is causing problems for your
616system/application.
d5dbac87
NA
617
618==============================================================
619
db0fb184 620oom_dump_tasks
d5dbac87 621
dc6c9a35
KS
622Enables a system-wide task dump (excluding kernel threads) to be produced
623when the kernel performs an OOM-killing and includes such information as
af5b0f6a
KS
624pid, uid, tgid, vm size, rss, pgtables_bytes, swapents, oom_score_adj
625score, and name. This is helpful to determine why the OOM killer was
626invoked, to identify the rogue task that caused it, and to determine why
627the OOM killer chose the task it did to kill.
d5dbac87 628
db0fb184
PM
629If this is set to zero, this information is suppressed. On very
630large systems with thousands of tasks it may not be feasible to dump
631the memory state information for each one. Such systems should not
632be forced to incur a performance penalty in OOM conditions when the
633information may not be desired.
634
635If this is set to non-zero, this information is shown whenever the
636OOM killer actually kills a memory-hogging task.
637
ad915c43 638The default value is 1 (enabled).
d5dbac87
NA
639
640==============================================================
641
db0fb184 642oom_kill_allocating_task
d5dbac87 643
db0fb184
PM
644This enables or disables killing the OOM-triggering task in
645out-of-memory situations.
d5dbac87 646
db0fb184
PM
647If this is set to zero, the OOM killer will scan through the entire
648tasklist and select a task based on heuristics to kill. This normally
649selects a rogue memory-hogging task that frees up a large amount of
650memory when killed.
651
652If this is set to non-zero, the OOM killer simply kills the task that
653triggered the out-of-memory condition. This avoids the expensive
654tasklist scan.
655
656If panic_on_oom is selected, it takes precedence over whatever value
657is used in oom_kill_allocating_task.
658
659The default value is 0.
dd8632a1
PM
660
661==============================================================
662
49f0ce5f
JM
663overcommit_kbytes:
664
665When overcommit_memory is set to 2, the committed address space is not
666permitted to exceed swap plus this amount of physical RAM. See below.
667
668Note: overcommit_kbytes is the counterpart of overcommit_ratio. Only one
669of them may be specified at a time. Setting one disables the other (which
670then appears as 0 when read).
671
672==============================================================
673
db0fb184 674overcommit_memory:
dd8632a1 675
db0fb184 676This value contains a flag that enables memory overcommitment.
dd8632a1 677
db0fb184
PM
678When this flag is 0, the kernel attempts to estimate the amount
679of free memory left when userspace requests more memory.
dd8632a1 680
db0fb184
PM
681When this flag is 1, the kernel pretends there is always enough
682memory until it actually runs out.
dd8632a1 683
db0fb184
PM
684When this flag is 2, the kernel uses a "never overcommit"
685policy that attempts to prevent any overcommit of memory.
c9b1d098 686Note that user_reserve_kbytes affects this policy.
dd8632a1 687
db0fb184
PM
688This feature can be very useful because there are a lot of
689programs that malloc() huge amounts of memory "just-in-case"
690and don't use much of it.
691
692The default value is 0.
693
ad56b738 694See Documentation/vm/overcommit-accounting.rst and
85f237a5 695mm/util.c::__vm_enough_memory() for more information.
db0fb184
PM
696
697==============================================================
698
699overcommit_ratio:
700
701When overcommit_memory is set to 2, the committed address
702space is not permitted to exceed swap plus this percentage
703of physical RAM. See above.
704
705==============================================================
706
707page-cluster
708
df858fa8
CE
709page-cluster controls the number of pages up to which consecutive pages
710are read in from swap in a single attempt. This is the swap counterpart
711to page cache readahead.
712The mentioned consecutivity is not in terms of virtual/physical addresses,
713but consecutive on swap space - that means they were swapped out together.
db0fb184
PM
714
715It is a logarithmic value - setting it to zero means "1 page", setting
716it to 1 means "2 pages", setting it to 2 means "4 pages", etc.
df858fa8 717Zero disables swap readahead completely.
db0fb184
PM
718
719The default value is three (eight pages at a time). There may be some
720small benefits in tuning this to a different value if your workload is
721swap-intensive.
722
df858fa8
CE
723Lower values mean lower latencies for initial faults, but at the same time
724extra faults and I/O delays for following faults if they would have been part of
725that consecutive pages readahead would have brought in.
726
db0fb184
PM
727=============================================================
728
729panic_on_oom
730
731This enables or disables panic on out-of-memory feature.
732
733If this is set to 0, the kernel will kill some rogue process,
734called oom_killer. Usually, oom_killer can kill rogue processes and
735system will survive.
736
737If this is set to 1, the kernel panics when out-of-memory happens.
738However, if a process limits using nodes by mempolicy/cpusets,
739and those nodes become memory exhaustion status, one process
740may be killed by oom-killer. No panic occurs in this case.
741Because other nodes' memory may be free. This means system total status
742may be not fatal yet.
743
744If this is set to 2, the kernel panics compulsorily even on the
daaf1e68
KH
745above-mentioned. Even oom happens under memory cgroup, the whole
746system panics.
db0fb184
PM
747
748The default value is 0.
7491 and 2 are for failover of clustering. Please select either
750according to your policy of failover.
daaf1e68
KH
751panic_on_oom=2+kdump gives you very strong tool to investigate
752why oom happens. You can get snapshot.
db0fb184
PM
753
754=============================================================
755
756percpu_pagelist_fraction
757
758This is the fraction of pages at most (high mark pcp->high) in each zone that
759are allocated for each per cpu page list. The min value for this is 8. It
760means that we don't allow more than 1/8th of pages in each zone to be
761allocated in any single per_cpu_pagelist. This entry only changes the value
762of hot per cpu pagelists. User can specify a number like 100 to allocate
7631/100th of each zone to each per cpu page list.
764
765The batch value of each per cpu pagelist is also updated as a result. It is
766set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8)
767
768The initial value is zero. Kernel does not use this value at boot time to set
7cd2b0a3
DR
769the high water marks for each per cpu page list. If the user writes '0' to this
770sysctl, it will revert to this default behavior.
db0fb184
PM
771
772==============================================================
773
774stat_interval
775
776The time interval between which vm statistics are updated. The default
777is 1 second.
778
779==============================================================
780
52b6f46b
HD
781stat_refresh
782
783Any read or write (by root only) flushes all the per-cpu vm statistics
784into their global totals, for more accurate reports when testing
785e.g. cat /proc/sys/vm/stat_refresh /proc/meminfo
786
787As a side-effect, it also checks for negative totals (elsewhere reported
788as 0) and "fails" with EINVAL if any are found, with a warning in dmesg.
789(At time of writing, a few stats are known sometimes to be found negative,
790with no ill effects: errors and warnings on these stats are suppressed.)
791
792==============================================================
793
4518085e
KW
794numa_stat
795
796This interface allows runtime configuration of numa statistics.
797
798When page allocation performance becomes a bottleneck and you can tolerate
799some possible tool breakage and decreased numa counter precision, you can
800do:
801 echo 0 > /proc/sys/vm/numa_stat
802
803When page allocation performance is not a bottleneck and you want all
804tooling to work, you can do:
805 echo 1 > /proc/sys/vm/numa_stat
806
807==============================================================
808
db0fb184
PM
809swappiness
810
811This control is used to define how aggressive the kernel will swap
2743232c 812memory pages. Higher values will increase aggressiveness, lower values
8582cb96
AT
813decrease the amount of swap. A value of 0 instructs the kernel not to
814initiate swap until the amount of free and file-backed pages is less
815than the high water mark in a zone.
db0fb184
PM
816
817The default value is 60.
818
819==============================================================
820
c9b1d098
AS
821- user_reserve_kbytes
822
633708a4 823When overcommit_memory is set to 2, "never overcommit" mode, reserve
c9b1d098
AS
824min(3% of current process size, user_reserve_kbytes) of free memory.
825This is intended to prevent a user from starting a single memory hogging
826process, such that they cannot recover (kill the hog).
827
828user_reserve_kbytes defaults to min(3% of the current process size, 128MB).
829
830If this is reduced to zero, then the user will be allowed to allocate
831all free memory with a single process, minus admin_reserve_kbytes.
832Any subsequent attempts to execute a command will result in
833"fork: Cannot allocate memory".
834
835Changing this takes effect whenever an application requests memory.
836
837==============================================================
838
db0fb184
PM
839vfs_cache_pressure
840------------------
841
4a0da71b
DV
842This percentage value controls the tendency of the kernel to reclaim
843the memory which is used for caching of directory and inode objects.
db0fb184
PM
844
845At the default value of vfs_cache_pressure=100 the kernel will attempt to
846reclaim dentries and inodes at a "fair" rate with respect to pagecache and
847swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer
55c37a84
JK
848to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will
849never reclaim dentries and inodes due to memory pressure and this can easily
850lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100
db0fb184
PM
851causes the kernel to prefer to reclaim dentries and inodes.
852
4a0da71b
DV
853Increasing vfs_cache_pressure significantly beyond 100 may have negative
854performance impact. Reclaim code needs to take various locks to find freeable
855directory and inode objects. With vfs_cache_pressure=1000, it will look for
856ten times more freeable objects than there are.
857
795ae7a0
JW
858=============================================================
859
1c30844d
MG
860watermark_boost_factor:
861
862This factor controls the level of reclaim when memory is being fragmented.
863It defines the percentage of the high watermark of a zone that will be
864reclaimed if pages of different mobility are being mixed within pageblocks.
865The intent is that compaction has less work to do in the future and to
866increase the success rate of future high-order allocations such as SLUB
867allocations, THP and hugetlbfs pages.
868
869To make it sensible with respect to the watermark_scale_factor parameter,
870the unit is in fractions of 10,000. The default value of 15,000 means
871that up to 150% of the high watermark will be reclaimed in the event of
872a pageblock being mixed due to fragmentation. The level of reclaim is
873determined by the number of fragmentation events that occurred in the
874recent past. If this value is smaller than a pageblock then a pageblocks
875worth of pages will be reclaimed (e.g. 2MB on 64-bit x86). A boost factor
876of 0 will disable the feature.
877
878=============================================================
879
795ae7a0
JW
880watermark_scale_factor:
881
882This factor controls the aggressiveness of kswapd. It defines the
883amount of memory left in a node/system before kswapd is woken up and
884how much memory needs to be free before kswapd goes back to sleep.
885
886The unit is in fractions of 10,000. The default value of 10 means the
887distances between watermarks are 0.1% of the available memory in the
888node/system. The maximum value is 1000, or 10% of memory.
889
890A high rate of threads entering direct reclaim (allocstall) or kswapd
891going to sleep prematurely (kswapd_low_wmark_hit_quickly) can indicate
892that the number of free pages kswapd maintains for latency reasons is
893too small for the allocation bursts occurring in the system. This knob
894can then be used to tune kswapd aggressiveness accordingly.
895
db0fb184
PM
896==============================================================
897
898zone_reclaim_mode:
899
900Zone_reclaim_mode allows someone to set more or less aggressive approaches to
901reclaim memory when a zone runs out of memory. If it is set to zero then no
902zone reclaim occurs. Allocations will be satisfied from other zones / nodes
903in the system.
904
905This is value ORed together of
906
9071 = Zone reclaim on
9082 = Zone reclaim writes dirty pages out
9094 = Zone reclaim swaps pages
910
4f9b16a6
MG
911zone_reclaim_mode is disabled by default. For file servers or workloads
912that benefit from having their data cached, zone_reclaim_mode should be
913left disabled as the caching effect is likely to be more important than
db0fb184
PM
914data locality.
915
4f9b16a6
MG
916zone_reclaim may be enabled if it's known that the workload is partitioned
917such that each partition fits within a NUMA node and that accessing remote
918memory would cause a measurable performance reduction. The page allocator
919will then reclaim easily reusable pages (those page cache pages that are
920currently not used) before allocating off node pages.
921
db0fb184
PM
922Allowing zone reclaim to write out pages stops processes that are
923writing large amounts of data from dirtying pages on other nodes. Zone
924reclaim will write out dirty pages if a zone fills up and so effectively
925throttle the process. This may decrease the performance of a single process
926since it cannot use all of system memory to buffer the outgoing writes
927anymore but it preserve the memory on other nodes so that the performance
928of other processes running on other nodes will not be affected.
929
930Allowing regular swap effectively restricts allocations to the local
931node unless explicitly overridden by memory policies or cpuset
932configurations.
933
934============ End of Document =================================