Merge branch 'akpm' (patches from Andrew)
[linux-2.6-block.git] / Documentation / vm / slub.rst
CommitLineData
0c14398b
MR
1.. _slub:
2
3==========================
35243421 4Short users guide for SLUB
0c14398b 5==========================
35243421 6
35243421
CL
7The basic philosophy of SLUB is very different from SLAB. SLAB
8requires rebuilding the kernel to activate debug options for all
c1aee215 9slab caches. SLUB always includes full debugging but it is off by default.
35243421
CL
10SLUB can enable debugging only for selected slabs in order to avoid
11an impact on overall system performance which may make a bug more
12difficult to find.
13
0c14398b 14In order to switch debugging on one can add an option ``slub_debug``
35243421
CL
15to the kernel command line. That will enable full debugging for
16all slabs.
17
0c14398b
MR
18Typically one would then use the ``slabinfo`` command to get statistical
19data and perform operation on the slabs. By default ``slabinfo`` only lists
35243421 20slabs that have data in them. See "slabinfo -h" for more options when
0c14398b
MR
21running the command. ``slabinfo`` can be compiled with
22::
35243421 23
0c14398b 24 gcc -o slabinfo tools/vm/slabinfo.c
35243421 25
0c14398b 26Some of the modes of operation of ``slabinfo`` require that slub debugging
35243421
CL
27be enabled on the command line. F.e. no tracking information will be
28available without debugging on and validation can only partially
29be performed if debugging was not switched on.
30
31Some more sophisticated uses of slub_debug:
32-------------------------------------------
33
0c14398b 34Parameters may be given to ``slub_debug``. If none is specified then full
35243421
CL
35debugging is enabled. Format:
36
0c14398b
MR
37slub_debug=<Debug-Options>
38 Enable options for all slabs
0c14398b 39
c5fd3ca0
AT
40slub_debug=<Debug-Options>,<slab name1>,<slab name2>,...
41 Enable options only for select slabs (no spaces
42 after a comma)
0c14398b
MR
43
44Possible debug options are::
35243421 45
becfda68
LA
46 F Sanity checks on (enables SLAB_DEBUG_CONSISTENCY_CHECKS
47 Sorry SLAB legacy issues)
35243421
CL
48 Z Red zoning
49 P Poisoning (object and padding)
50 U User tracking (free and alloc)
51 T Trace (please only use on single slabs)
4c13dd3b 52 A Toggle failslab filter mark for the cache
fa5ec8a1
DR
53 O Switch debugging off for caches that would have
54 caused higher minimum slab orders
f0630fff
CL
55 - Switch all debugging off (useful if the kernel is
56 configured with CONFIG_SLUB_DEBUG_ON)
35243421 57
0c14398b 58F.e. in order to boot just with sanity checks and red zoning one would specify::
35243421
CL
59
60 slub_debug=FZ
61
0c14398b 62Trying to find an issue in the dentry cache? Try::
35243421 63
989a7241 64 slub_debug=,dentry
35243421 65
c5fd3ca0
AT
66to only enable debugging on the dentry cache. You may use an asterisk at the
67end of the slab name, in order to cover all slabs with the same prefix. For
68example, here's how you can poison the dentry cache as well as all kmalloc
11ede500 69slabs::
c5fd3ca0
AT
70
71 slub_debug=P,kmalloc-*,dentry
35243421
CL
72
73Red zoning and tracking may realign the slab. We can just apply sanity checks
0c14398b 74to the dentry cache with::
35243421 75
989a7241 76 slub_debug=F,dentry
35243421 77
fa5ec8a1
DR
78Debugging options may require the minimum possible slab order to increase as
79a result of storing the metadata (for example, caches with PAGE_SIZE object
80sizes). This has a higher liklihood of resulting in slab allocation errors
81in low memory situations or if there's high fragmentation of memory. To
0c14398b 82switch off debugging for such caches by default, use::
fa5ec8a1
DR
83
84 slub_debug=O
85
35243421
CL
86In case you forgot to enable debugging on the kernel command line: It is
87possible to enable debugging manually when the kernel is up. Look at the
0c14398b 88contents of::
35243421 89
0c14398b 90 /sys/kernel/slab/<slab name>/
35243421
CL
91
92Look at the writable files. Writing 1 to them will enable the
93corresponding debug option. All options can be set on a slab that does
94not contain objects. If the slab already contains objects then sanity checks
95and tracing may only be enabled. The other options may cause the realignment
96of objects.
97
98Careful with tracing: It may spew out lots of information and never stop if
99used on the wrong slab.
100
c1aee215 101Slab merging
0c14398b 102============
35243421 103
c1aee215 104If no debug options are specified then SLUB may merge similar slabs together
35243421 105in order to reduce overhead and increase cache hotness of objects.
0c14398b 106``slabinfo -a`` displays which slabs were merged together.
35243421 107
c1aee215 108Slab validation
0c14398b 109===============
c1aee215
CL
110
111SLUB can validate all object if the kernel was booted with slub_debug. In
0c14398b
MR
112order to do so you must have the ``slabinfo`` tool. Then you can do
113::
c1aee215 114
0c14398b 115 slabinfo -v
c1aee215
CL
116
117which will test all objects. Output will be generated to the syslog.
118
119This also works in a more limited way if boot was without slab debug.
0c14398b 120In that case ``slabinfo -v`` simply tests all reachable objects. Usually
c1aee215
CL
121these are in the cpu slabs and the partial slabs. Full slabs are not
122tracked by SLUB in a non debug situation.
123
35243421 124Getting more performance
0c14398b 125========================
35243421
CL
126
127To some degree SLUB's performance is limited by the need to take the
128list_lock once in a while to deal with partial slabs. That overhead is
129governed by the order of the allocation for each slab. The allocations
130can be influenced by kernel parameters:
131
0c14398b
MR
132.. slub_min_objects=x (default 4)
133.. slub_min_order=x (default 0)
134.. slub_max_order=x (default 3 (PAGE_ALLOC_COSTLY_ORDER))
135
136``slub_min_objects``
137 allows to specify how many objects must at least fit into one
138 slab in order for the allocation order to be acceptable. In
139 general slub will be able to perform this number of
140 allocations on a slab without consulting centralized resources
141 (list_lock) where contention may occur.
142
143``slub_min_order``
358b6ba9 144 specifies a minimum order of slabs. A similar effect like
0c14398b
MR
145 ``slub_min_objects``.
146
147``slub_max_order``
148 specified the order at which ``slub_min_objects`` should no
149 longer be checked. This is useful to avoid SLUB trying to
150 generate super large order pages to fit ``slub_min_objects``
151 of a slab cache with large object sizes into one high order
152 page. Setting command line parameter
153 ``debug_guardpage_minorder=N`` (N > 0), forces setting
154 ``slub_max_order`` to 0, what cause minimum possible order of
155 slabs allocation.
35243421 156
c1aee215 157SLUB Debug output
0c14398b
MR
158=================
159
160Here is a sample of slub debug output::
161
162 ====================================================================
163 BUG kmalloc-8: Redzone overwritten
164 --------------------------------------------------------------------
165
166 INFO: 0xc90f6d28-0xc90f6d2b. First byte 0x00 instead of 0xcc
167 INFO: Slab 0xc528c530 flags=0x400000c3 inuse=61 fp=0xc90f6d58
168 INFO: Object 0xc90f6d20 @offset=3360 fp=0xc90f6d58
169 INFO: Allocated in get_modalias+0x61/0xf5 age=53 cpu=1 pid=554
170
171 Bytes b4 0xc90f6d10: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a ........ZZZZZZZZ
172 Object 0xc90f6d20: 31 30 31 39 2e 30 30 35 1019.005
173 Redzone 0xc90f6d28: 00 cc cc cc .
174 Padding 0xc90f6d50: 5a 5a 5a 5a 5a 5a 5a 5a ZZZZZZZZ
175
176 [<c010523d>] dump_trace+0x63/0x1eb
177 [<c01053df>] show_trace_log_lvl+0x1a/0x2f
178 [<c010601d>] show_trace+0x12/0x14
179 [<c0106035>] dump_stack+0x16/0x18
180 [<c017e0fa>] object_err+0x143/0x14b
181 [<c017e2cc>] check_object+0x66/0x234
182 [<c017eb43>] __slab_free+0x239/0x384
183 [<c017f446>] kfree+0xa6/0xc6
184 [<c02e2335>] get_modalias+0xb9/0xf5
185 [<c02e23b7>] dmi_dev_uevent+0x27/0x3c
186 [<c027866a>] dev_uevent+0x1ad/0x1da
187 [<c0205024>] kobject_uevent_env+0x20a/0x45b
188 [<c020527f>] kobject_uevent+0xa/0xf
189 [<c02779f1>] store_uevent+0x4f/0x58
190 [<c027758e>] dev_attr_store+0x29/0x2f
191 [<c01bec4f>] sysfs_write_file+0x16e/0x19c
192 [<c0183ba7>] vfs_write+0xd1/0x15a
193 [<c01841d7>] sys_write+0x3d/0x72
194 [<c0104112>] sysenter_past_esp+0x5f/0x99
195 [<b7f7b410>] 0xb7f7b410
196 =======================
197
198 FIX kmalloc-8: Restoring Redzone 0xc90f6d28-0xc90f6d2b=0xcc
c1aee215 199
24922684
CL
200If SLUB encounters a corrupted object (full detection requires the kernel
201to be booted with slub_debug) then the following output will be dumped
202into the syslog:
c1aee215 203
24922684 2041. Description of the problem encountered
c1aee215 205
0c14398b 206 This will be a message in the system log starting with::
c1aee215 207
0c14398b
MR
208 ===============================================
209 BUG <slab cache affected>: <What went wrong>
210 -----------------------------------------------
c1aee215 211
0c14398b
MR
212 INFO: <corruption start>-<corruption_end> <more info>
213 INFO: Slab <address> <slab information>
214 INFO: Object <address> <object information>
215 INFO: Allocated in <kernel function> age=<jiffies since alloc> cpu=<allocated by
24922684 216 cpu> pid=<pid of the process>
0c14398b
MR
217 INFO: Freed in <kernel function> age=<jiffies since free> cpu=<freed by cpu>
218 pid=<pid of the process>
c1aee215 219
0c14398b
MR
220 (Object allocation / free information is only available if SLAB_STORE_USER is
221 set for the slab. slub_debug sets that option)
c1aee215 222
24922684 2232. The object contents if an object was involved.
c1aee215 224
0c14398b 225 Various types of lines can follow the BUG SLUB line:
c1aee215 226
0c14398b 227 Bytes b4 <address> : <bytes>
24922684 228 Shows a few bytes before the object where the problem was detected.
c1aee215
CL
229 Can be useful if the corruption does not stop with the start of the
230 object.
231
0c14398b 232 Object <address> : <bytes>
c1aee215 233 The bytes of the object. If the object is inactive then the bytes
24922684 234 typically contain poison values. Any non-poison value shows a
c1aee215
CL
235 corruption by a write after free.
236
0c14398b 237 Redzone <address> : <bytes>
24922684 238 The Redzone following the object. The Redzone is used to detect
c1aee215
CL
239 writes after the object. All bytes should always have the same
240 value. If there is any deviation then it is due to a write after
241 the object boundary.
242
24922684
CL
243 (Redzone information is only available if SLAB_RED_ZONE is set.
244 slub_debug sets that option)
c1aee215 245
0c14398b 246 Padding <address> : <bytes>
c1aee215
CL
247 Unused data to fill up the space in order to get the next object
248 properly aligned. In the debug case we make sure that there are
24922684 249 at least 4 bytes of padding. This allows the detection of writes
c1aee215
CL
250 before the object.
251
24922684
CL
2523. A stackdump
253
0c14398b
MR
254 The stackdump describes the location where the error was detected. The cause
255 of the corruption is may be more likely found by looking at the function that
256 allocated or freed the object.
24922684
CL
257
2584. Report on how the problem was dealt with in order to ensure the continued
0c14398b 259 operation of the system.
24922684 260
0c14398b 261 These are messages in the system log beginning with::
24922684 262
0c14398b 263 FIX <slab cache affected>: <corrective action taken>
24922684 264
0c14398b
MR
265 In the above sample SLUB found that the Redzone of an active object has
266 been overwritten. Here a string of 8 characters was written into a slab that
267 has the length of 8 characters. However, a 8 character string needs a
268 terminating 0. That zero has overwritten the first byte of the Redzone field.
269 After reporting the details of the issue encountered the FIX SLUB message
270 tells us that SLUB has restored the Redzone to its proper value and then
271 system operations continue.
24922684 272
0c14398b
MR
273Emergency operations
274====================
24922684 275
0c14398b 276Minimal debugging (sanity checks alone) can be enabled by booting with::
24922684
CL
277
278 slub_debug=F
279
280This will be generally be enough to enable the resiliency features of slub
281which will keep the system running even if a bad kernel component will
282keep corrupting objects. This may be important for production systems.
283Performance will be impacted by the sanity checks and there will be a
284continual stream of error messages to the syslog but no additional memory
285will be used (unlike full debugging).
286
287No guarantees. The kernel component still needs to be fixed. Performance
288may be optimized further by locating the slab that experiences corruption
289and enabling debugging only for that cache
290
0c14398b 291I.e.::
24922684
CL
292
293 slub_debug=F,dentry
294
295If the corruption occurs by writing after the end of the object then it
296may be advisable to enable a Redzone to avoid corrupting the beginning
0c14398b 297of other objects::
24922684
CL
298
299 slub_debug=FZ,dentry
c1aee215 300
05be9617 301Extended slabinfo mode and plotting
0c14398b 302===================================
05be9617 303
0c14398b 304The ``slabinfo`` tool has a special 'extended' ('-X') mode that includes:
05be9617
SS
305 - Slabcache Totals
306 - Slabs sorted by size (up to -N <num> slabs, default 1)
307 - Slabs sorted by loss (up to -N <num> slabs, default 1)
308
0c14398b
MR
309Additionally, in this mode ``slabinfo`` does not dynamically scale
310sizes (G/M/K) and reports everything in bytes (this functionality is
311also available to other slabinfo modes via '-B' option) which makes
312reporting more precise and accurate. Moreover, in some sense the `-X'
313mode also simplifies the analysis of slabs' behaviour, because its
314output can be plotted using the ``slabinfo-gnuplot.sh`` script. So it
315pushes the analysis from looking through the numbers (tons of numbers)
316to something easier -- visual analysis.
05be9617
SS
317
318To generate plots:
0c14398b
MR
319
320a) collect slabinfo extended records, for example::
321
322 while [ 1 ]; do slabinfo -X >> FOO_STATS; sleep 1; done
323
324b) pass stats file(-s) to ``slabinfo-gnuplot.sh`` script::
325
326 slabinfo-gnuplot.sh FOO_STATS [FOO_STATS2 .. FOO_STATSN]
327
328 The ``slabinfo-gnuplot.sh`` script will pre-processes the collected records
329 and generates 3 png files (and 3 pre-processing cache files) per STATS
330 file:
331 - Slabcache Totals: FOO_STATS-totals.png
332 - Slabs sorted by size: FOO_STATS-slabs-by-size.png
333 - Slabs sorted by loss: FOO_STATS-slabs-by-loss.png
334
335Another use case, when ``slabinfo-gnuplot.sh`` can be useful, is when you
336need to compare slabs' behaviour "prior to" and "after" some code
337modification. To help you out there, ``slabinfo-gnuplot.sh`` script
338can 'merge' the `Slabcache Totals` sections from different
339measurements. To visually compare N plots:
340
341a) Collect as many STATS1, STATS2, .. STATSN files as you need::
342
343 while [ 1 ]; do slabinfo -X >> STATS<X>; sleep 1; done
344
345b) Pre-process those STATS files::
346
347 slabinfo-gnuplot.sh STATS1 STATS2 .. STATSN
348
349c) Execute ``slabinfo-gnuplot.sh`` in '-t' mode, passing all of the
350 generated pre-processed \*-totals::
351
352 slabinfo-gnuplot.sh -t STATS1-totals STATS2-totals .. STATSN-totals
353
354 This will produce a single plot (png file).
355
356 Plots, expectedly, can be large so some fluctuations or small spikes
357 can go unnoticed. To deal with that, ``slabinfo-gnuplot.sh`` has two
358 options to 'zoom-in'/'zoom-out':
359
360 a) ``-s %d,%d`` -- overwrites the default image width and heigh
361 b) ``-r %d,%d`` -- specifies a range of samples to use (for example,
362 in ``slabinfo -X >> FOO_STATS; sleep 1;`` case, using a ``-r
363 40,60`` range will plot only samples collected between 40th and
364 60th seconds).
05be9617 365
cde53535 366Christoph Lameter, May 30, 2007
05be9617 367Sergey Senozhatsky, October 23, 2015