Documentation/mm/slub.rst

   1 .. _slub:
   2
   3 ==========================
   4 Short users guide for SLUB
   5 ==========================
   6
   7 The basic philosophy of SLUB is very different from SLAB. SLAB
   8 requires rebuilding the kernel to activate debug options for all
   9 slab caches. SLUB always includes full debugging but it is off by default.
  10 SLUB can enable debugging only for selected slabs in order to avoid
  11 an impact on overall system performance which may make a bug more
  12 difficult to find.
  13
  14 In order to switch debugging on one can add an option ``slub_debug``
  15 to the kernel command line. That will enable full debugging for
  16 all slabs.
  17
  18 Typically one would then use the ``slabinfo`` command to get statistical
  19 data and perform operation on the slabs. By default ``slabinfo`` only lists
  20 slabs that have data in them. See "slabinfo -h" for more options when
  21 running the command. ``slabinfo`` can be compiled with
  22 ::
  23
  24         gcc -o slabinfo tools/vm/slabinfo.c
  25
  26 Some of the modes of operation of ``slabinfo`` require that slub debugging
  27 be enabled on the command line. F.e. no tracking information will be
  28 available without debugging on and validation can only partially
  29 be performed if debugging was not switched on.
  30
  31 Some more sophisticated uses of slub_debug:
  32 -------------------------------------------
  33
  34 Parameters may be given to ``slub_debug``. If none is specified then full
  35 debugging is enabled. Format:
  36
  37 slub_debug=<Debug-Options>
  38         Enable options for all slabs
  39
  40 slub_debug=<Debug-Options>,<slab name1>,<slab name2>,...
  41         Enable options only for select slabs (no spaces
  42         after a comma)
  43
  44 Multiple blocks of options for all slabs or selected slabs can be given, with
  45 blocks of options delimited by ';'. The last of "all slabs" blocks is applied
  46 to all slabs except those that match one of the "select slabs" block. Options
  47 of the first "select slabs" blocks that matches the slab's name are applied.
  48
  49 Possible debug options are::
  50
  51         F               Sanity checks on (enables SLAB_DEBUG_CONSISTENCY_CHECKS
  52                         Sorry SLAB legacy issues)
  53         Z               Red zoning
  54         P               Poisoning (object and padding)
  55         U               User tracking (free and alloc)
  56         T               Trace (please only use on single slabs)
  57         A               Enable failslab filter mark for the cache
  58         O               Switch debugging off for caches that would have
  59                         caused higher minimum slab orders
  60         -               Switch all debugging off (useful if the kernel is
  61                         configured with CONFIG_SLUB_DEBUG_ON)
  62
  63 F.e. in order to boot just with sanity checks and red zoning one would specify::
  64
  65         slub_debug=FZ
  66
  67 Trying to find an issue in the dentry cache? Try::
  68
  69         slub_debug=,dentry
  70
  71 to only enable debugging on the dentry cache.  You may use an asterisk at the
  72 end of the slab name, in order to cover all slabs with the same prefix.  For
  73 example, here's how you can poison the dentry cache as well as all kmalloc
  74 slabs::
  75
  76         slub_debug=P,kmalloc-*,dentry
  77
  78 Red zoning and tracking may realign the slab.  We can just apply sanity checks
  79 to the dentry cache with::
  80
  81         slub_debug=F,dentry
  82
  83 Debugging options may require the minimum possible slab order to increase as
  84 a result of storing the metadata (for example, caches with PAGE_SIZE object
  85 sizes).  This has a higher liklihood of resulting in slab allocation errors
  86 in low memory situations or if there's high fragmentation of memory.  To
  87 switch off debugging for such caches by default, use::
  88
  89         slub_debug=O
  90
  91 You can apply different options to different list of slab names, using blocks
  92 of options. This will enable red zoning for dentry and user tracking for
  93 kmalloc. All other slabs will not get any debugging enabled::
  94
  95         slub_debug=Z,dentry;U,kmalloc-*
  96
  97 You can also enable options (e.g. sanity checks and poisoning) for all caches
  98 except some that are deemed too performance critical and don't need to be
  99 debugged by specifying global debug options followed by a list of slab names
 100 with "-" as options::
 101
 102         slub_debug=FZ;-,zs_handle,zspage
 103
 104 The state of each debug option for a slab can be found in the respective files
 105 under::
 106
 107         /sys/kernel/slab/<slab name>/
 108
 109 If the file contains 1, the option is enabled, 0 means disabled. The debug
 110 options from the ``slub_debug`` parameter translate to the following files::
 111
 112         F       sanity_checks
 113         Z       red_zone
 114         P       poison
 115         U       store_user
 116         T       trace
 117         A       failslab
 118
 119 failslab file is writable, so writing 1 or 0 will enable or disable
 120 the option at runtime. Write returns -EINVAL if cache is an alias.
 121 Careful with tracing: It may spew out lots of information and never stop if
 122 used on the wrong slab.
 123
 124 Slab merging
 125 ============
 126
 127 If no debug options are specified then SLUB may merge similar slabs together
 128 in order to reduce overhead and increase cache hotness of objects.
 129 ``slabinfo -a`` displays which slabs were merged together.
 130
 131 Slab validation
 132 ===============
 133
 134 SLUB can validate all object if the kernel was booted with slub_debug. In
 135 order to do so you must have the ``slabinfo`` tool. Then you can do
 136 ::
 137
 138         slabinfo -v
 139
 140 which will test all objects. Output will be generated to the syslog.
 141
 142 This also works in a more limited way if boot was without slab debug.
 143 In that case ``slabinfo -v`` simply tests all reachable objects. Usually
 144 these are in the cpu slabs and the partial slabs. Full slabs are not
 145 tracked by SLUB in a non debug situation.
 146
 147 Getting more performance
 148 ========================
 149
 150 To some degree SLUB's performance is limited by the need to take the
 151 list_lock once in a while to deal with partial slabs. That overhead is
 152 governed by the order of the allocation for each slab. The allocations
 153 can be influenced by kernel parameters:
 154
 155 .. slub_min_objects=x           (default 4)
 156 .. slub_min_order=x             (default 0)
 157 .. slub_max_order=x             (default 3 (PAGE_ALLOC_COSTLY_ORDER))
 158
 159 ``slub_min_objects``
 160         allows to specify how many objects must at least fit into one
 161         slab in order for the allocation order to be acceptable.  In
 162         general slub will be able to perform this number of
 163         allocations on a slab without consulting centralized resources
 164         (list_lock) where contention may occur.
 165
 166 ``slub_min_order``
 167         specifies a minimum order of slabs. A similar effect like
 168         ``slub_min_objects``.
 169
 170 ``slub_max_order``
 171         specified the order at which ``slub_min_objects`` should no
 172         longer be checked. This is useful to avoid SLUB trying to
 173         generate super large order pages to fit ``slub_min_objects``
 174         of a slab cache with large object sizes into one high order
 175         page. Setting command line parameter
 176         ``debug_guardpage_minorder=N`` (N > 0), forces setting
 177         ``slub_max_order`` to 0, what cause minimum possible order of
 178         slabs allocation.
 179
 180 SLUB Debug output
 181 =================
 182
 183 Here is a sample of slub debug output::
 184
 185  ====================================================================
 186  BUG kmalloc-8: Right Redzone overwritten
 187  --------------------------------------------------------------------
 188
 189  INFO: 0xc90f6d28-0xc90f6d2b. First byte 0x00 instead of 0xcc
 190  INFO: Slab 0xc528c530 flags=0x400000c3 inuse=61 fp=0xc90f6d58
 191  INFO: Object 0xc90f6d20 @offset=3360 fp=0xc90f6d58
 192  INFO: Allocated in get_modalias+0x61/0xf5 age=53 cpu=1 pid=554
 193
 194  Bytes b4 (0xc90f6d10): 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ
 195  Object   (0xc90f6d20): 31 30 31 39 2e 30 30 35                         1019.005
 196  Redzone  (0xc90f6d28): 00 cc cc cc                                     .
 197  Padding  (0xc90f6d50): 5a 5a 5a 5a 5a 5a 5a 5a                         ZZZZZZZZ
 198
 199    [<c010523d>] dump_trace+0x63/0x1eb
 200    [<c01053df>] show_trace_log_lvl+0x1a/0x2f
 201    [<c010601d>] show_trace+0x12/0x14
 202    [<c0106035>] dump_stack+0x16/0x18
 203    [<c017e0fa>] object_err+0x143/0x14b
 204    [<c017e2cc>] check_object+0x66/0x234
 205    [<c017eb43>] __slab_free+0x239/0x384
 206    [<c017f446>] kfree+0xa6/0xc6
 207    [<c02e2335>] get_modalias+0xb9/0xf5
 208    [<c02e23b7>] dmi_dev_uevent+0x27/0x3c
 209    [<c027866a>] dev_uevent+0x1ad/0x1da
 210    [<c0205024>] kobject_uevent_env+0x20a/0x45b
 211    [<c020527f>] kobject_uevent+0xa/0xf
 212    [<c02779f1>] store_uevent+0x4f/0x58
 213    [<c027758e>] dev_attr_store+0x29/0x2f
 214    [<c01bec4f>] sysfs_write_file+0x16e/0x19c
 215    [<c0183ba7>] vfs_write+0xd1/0x15a
 216    [<c01841d7>] sys_write+0x3d/0x72
 217    [<c0104112>] sysenter_past_esp+0x5f/0x99
 218    [<b7f7b410>] 0xb7f7b410
 219    =======================
 220
 221  FIX kmalloc-8: Restoring Redzone 0xc90f6d28-0xc90f6d2b=0xcc
 222
 223 If SLUB encounters a corrupted object (full detection requires the kernel
 224 to be booted with slub_debug) then the following output will be dumped
 225 into the syslog:
 226
 227 1. Description of the problem encountered
 228
 229    This will be a message in the system log starting with::
 230
 231      ===============================================
 232      BUG <slab cache affected>: <What went wrong>
 233      -----------------------------------------------
 234
 235      INFO: <corruption start>-<corruption_end> <more info>
 236      INFO: Slab <address> <slab information>
 237      INFO: Object <address> <object information>
 238      INFO: Allocated in <kernel function> age=<jiffies since alloc> cpu=<allocated by
 239         cpu> pid=<pid of the process>
 240      INFO: Freed in <kernel function> age=<jiffies since free> cpu=<freed by cpu>
 241         pid=<pid of the process>
 242
 243    (Object allocation / free information is only available if SLAB_STORE_USER is
 244    set for the slab. slub_debug sets that option)
 245
 246 2. The object contents if an object was involved.
 247
 248    Various types of lines can follow the BUG SLUB line:
 249
 250    Bytes b4 <address> : <bytes>
 251         Shows a few bytes before the object where the problem was detected.
 252         Can be useful if the corruption does not stop with the start of the
 253         object.
 254
 255    Object <address> : <bytes>
 256         The bytes of the object. If the object is inactive then the bytes
 257         typically contain poison values. Any non-poison value shows a
 258         corruption by a write after free.
 259
 260    Redzone <address> : <bytes>
 261         The Redzone following the object. The Redzone is used to detect
 262         writes after the object. All bytes should always have the same
 263         value. If there is any deviation then it is due to a write after
 264         the object boundary.
 265
 266         (Redzone information is only available if SLAB_RED_ZONE is set.
 267         slub_debug sets that option)
 268
 269    Padding <address> : <bytes>
 270         Unused data to fill up the space in order to get the next object
 271         properly aligned. In the debug case we make sure that there are
 272         at least 4 bytes of padding. This allows the detection of writes
 273         before the object.
 274
 275 3. A stackdump
 276
 277    The stackdump describes the location where the error was detected. The cause
 278    of the corruption is may be more likely found by looking at the function that
 279    allocated or freed the object.
 280
 281 4. Report on how the problem was dealt with in order to ensure the continued
 282    operation of the system.
 283
 284    These are messages in the system log beginning with::
 285
 286         FIX <slab cache affected>: <corrective action taken>
 287
 288    In the above sample SLUB found that the Redzone of an active object has
 289    been overwritten. Here a string of 8 characters was written into a slab that
 290    has the length of 8 characters. However, a 8 character string needs a
 291    terminating 0. That zero has overwritten the first byte of the Redzone field.
 292    After reporting the details of the issue encountered the FIX SLUB message
 293    tells us that SLUB has restored the Redzone to its proper value and then
 294    system operations continue.
 295
 296 Emergency operations
 297 ====================
 298
 299 Minimal debugging (sanity checks alone) can be enabled by booting with::
 300
 301         slub_debug=F
 302
 303 This will be generally be enough to enable the resiliency features of slub
 304 which will keep the system running even if a bad kernel component will
 305 keep corrupting objects. This may be important for production systems.
 306 Performance will be impacted by the sanity checks and there will be a
 307 continual stream of error messages to the syslog but no additional memory
 308 will be used (unlike full debugging).
 309
 310 No guarantees. The kernel component still needs to be fixed. Performance
 311 may be optimized further by locating the slab that experiences corruption
 312 and enabling debugging only for that cache
 313
 314 I.e.::
 315
 316         slub_debug=F,dentry
 317
 318 If the corruption occurs by writing after the end of the object then it
 319 may be advisable to enable a Redzone to avoid corrupting the beginning
 320 of other objects::
 321
 322         slub_debug=FZ,dentry
 323
 324 Extended slabinfo mode and plotting
 325 ===================================
 326
 327 The ``slabinfo`` tool has a special 'extended' ('-X') mode that includes:
 328  - Slabcache Totals
 329  - Slabs sorted by size (up to -N <num> slabs, default 1)
 330  - Slabs sorted by loss (up to -N <num> slabs, default 1)
 331
 332 Additionally, in this mode ``slabinfo`` does not dynamically scale
 333 sizes (G/M/K) and reports everything in bytes (this functionality is
 334 also available to other slabinfo modes via '-B' option) which makes
 335 reporting more precise and accurate. Moreover, in some sense the `-X'
 336 mode also simplifies the analysis of slabs' behaviour, because its
 337 output can be plotted using the ``slabinfo-gnuplot.sh`` script. So it
 338 pushes the analysis from looking through the numbers (tons of numbers)
 339 to something easier -- visual analysis.
 340
 341 To generate plots:
 342
 343 a) collect slabinfo extended records, for example::
 344
 345         while [ 1 ]; do slabinfo -X >> FOO_STATS; sleep 1; done
 346
 347 b) pass stats file(-s) to ``slabinfo-gnuplot.sh`` script::
 348
 349         slabinfo-gnuplot.sh FOO_STATS [FOO_STATS2 .. FOO_STATSN]
 350
 351    The ``slabinfo-gnuplot.sh`` script will pre-processes the collected records
 352    and generates 3 png files (and 3 pre-processing cache files) per STATS
 353    file:
 354    - Slabcache Totals: FOO_STATS-totals.png
 355    - Slabs sorted by size: FOO_STATS-slabs-by-size.png
 356    - Slabs sorted by loss: FOO_STATS-slabs-by-loss.png
 357
 358 Another use case, when ``slabinfo-gnuplot.sh`` can be useful, is when you
 359 need to compare slabs' behaviour "prior to" and "after" some code
 360 modification.  To help you out there, ``slabinfo-gnuplot.sh`` script
 361 can 'merge' the `Slabcache Totals` sections from different
 362 measurements. To visually compare N plots:
 363
 364 a) Collect as many STATS1, STATS2, .. STATSN files as you need::
 365
 366         while [ 1 ]; do slabinfo -X >> STATS<X>; sleep 1; done
 367
 368 b) Pre-process those STATS files::
 369
 370         slabinfo-gnuplot.sh STATS1 STATS2 .. STATSN
 371
 372 c) Execute ``slabinfo-gnuplot.sh`` in '-t' mode, passing all of the
 373    generated pre-processed \*-totals::
 374
 375         slabinfo-gnuplot.sh -t STATS1-totals STATS2-totals .. STATSN-totals
 376
 377    This will produce a single plot (png file).
 378
 379    Plots, expectedly, can be large so some fluctuations or small spikes
 380    can go unnoticed. To deal with that, ``slabinfo-gnuplot.sh`` has two
 381    options to 'zoom-in'/'zoom-out':
 382
 383    a) ``-s %d,%d`` -- overwrites the default image width and height
 384    b) ``-r %d,%d`` -- specifies a range of samples to use (for example,
 385       in ``slabinfo -X >> FOO_STATS; sleep 1;`` case, using a ``-r
 386       40,60`` range will plot only samples collected between 40th and
 387       60th seconds).
 388
 389
 390 DebugFS files for SLUB
 391 ======================
 392
 393 For more information about current state of SLUB caches with the user tracking
 394 debug option enabled, debugfs files are available, typically under
 395 /sys/kernel/debug/slab/<cache>/ (created only for caches with enabled user
 396 tracking). There are 2 types of these files with the following debug
 397 information:
 398
 399 1. alloc_traces::
 400
 401     Prints information about unique allocation traces of the currently
 402     allocated objects. The output is sorted by frequency of each trace.
 403
 404     Information in the output:
 405     Number of objects, allocating function, possible memory wastage of
 406     kmalloc objects(total/per-object), minimal/average/maximal jiffies
 407     since alloc, pid range of the allocating processes, cpu mask of
 408     allocating cpus, numa node mask of origins of memory, and stack trace.
 409
 410     Example:::
 411
 412     338 pci_alloc_dev+0x2c/0xa0 waste=521872/1544 age=290837/291891/293509 pid=1 cpus=106 nodes=0-1
 413         __kmem_cache_alloc_node+0x11f/0x4e0
 414         kmalloc_trace+0x26/0xa0
 415         pci_alloc_dev+0x2c/0xa0
 416         pci_scan_single_device+0xd2/0x150
 417         pci_scan_slot+0xf7/0x2d0
 418         pci_scan_child_bus_extend+0x4e/0x360
 419         acpi_pci_root_create+0x32e/0x3b0
 420         pci_acpi_scan_root+0x2b9/0x2d0
 421         acpi_pci_root_add.cold.11+0x110/0xb0a
 422         acpi_bus_attach+0x262/0x3f0
 423         device_for_each_child+0xb7/0x110
 424         acpi_dev_for_each_child+0x77/0xa0
 425         acpi_bus_attach+0x108/0x3f0
 426         device_for_each_child+0xb7/0x110
 427         acpi_dev_for_each_child+0x77/0xa0
 428         acpi_bus_attach+0x108/0x3f0
 429
 430 2. free_traces::
 431
 432     Prints information about unique freeing traces of the currently allocated
 433     objects. The freeing traces thus come from the previous life-cycle of the
 434     objects and are reported as not available for objects allocated for the first
 435     time. The output is sorted by frequency of each trace.
 436
 437     Information in the output:
 438     Number of objects, freeing function, minimal/average/maximal jiffies since free,
 439     pid range of the freeing processes, cpu mask of freeing cpus, and stack trace.
 440
 441     Example:::
 442
 443     1980 <not-available> age=4294912290 pid=0 cpus=0
 444     51 acpi_ut_update_ref_count+0x6a6/0x782 age=236886/237027/237772 pid=1 cpus=1
 445         kfree+0x2db/0x420
 446         acpi_ut_update_ref_count+0x6a6/0x782
 447         acpi_ut_update_object_reference+0x1ad/0x234
 448         acpi_ut_remove_reference+0x7d/0x84
 449         acpi_rs_get_prt_method_data+0x97/0xd6
 450         acpi_get_irq_routing_table+0x82/0xc4
 451         acpi_pci_irq_find_prt_entry+0x8e/0x2e0
 452         acpi_pci_irq_lookup+0x3a/0x1e0
 453         acpi_pci_irq_enable+0x77/0x240
 454         pcibios_enable_device+0x39/0x40
 455         do_pci_enable_device.part.0+0x5d/0xe0
 456         pci_enable_device_flags+0xfc/0x120
 457         pci_enable_device+0x13/0x20
 458         virtio_pci_probe+0x9e/0x170
 459         local_pci_probe+0x48/0x80
 460         pci_device_probe+0x105/0x1c0
 461
 462 Christoph Lameter, May 30, 2007
 463 Sergey Senozhatsky, October 23, 2015