Commit | Line | Data |
---|---|---|
4d2e26a3 MCC |
1 | ====================== |
2 | Firmware-Assisted Dump | |
3 | ====================== | |
8e0aa6d4 | 4 | |
4d2e26a3 | 5 | July 2011 |
8e0aa6d4 MS |
6 | |
7 | The goal of firmware-assisted dump is to enable the dump of | |
8 | a crashed system, and to do so from a fully-reset system, and | |
9 | to minimize the total elapsed time until the system is back | |
10 | in production use. | |
11 | ||
1679b96e | 12 | - Firmware-Assisted Dump (FADump) infrastructure is intended to replace |
8e0aa6d4 MS |
13 | the existing phyp assisted dump. |
14 | - Fadump uses the same firmware interfaces and memory reservation model | |
15 | as phyp assisted dump. | |
1679b96e | 16 | - Unlike phyp dump, FADump exports the memory dump through /proc/vmcore |
8e0aa6d4 MS |
17 | in the ELF format in the same way as kdump. This helps us reuse the |
18 | kdump infrastructure for dump capture and filtering. | |
19 | - Unlike phyp dump, userspace tool does not need to refer any sysfs | |
20 | interface while reading /proc/vmcore. | |
1679b96e | 21 | - Unlike phyp dump, FADump allows user to release all the memory reserved |
8e0aa6d4 | 22 | for dump, with a single operation of echo 1 > /sys/kernel/fadump_release_mem. |
1679b96e | 23 | - Once enabled through kernel boot parameter, FADump can be |
8e0aa6d4 MS |
24 | started/stopped through /sys/kernel/fadump_registered interface (see |
25 | sysfs files section below) and can be easily integrated with kdump | |
26 | service start/stop init scripts. | |
27 | ||
28 | Comparing with kdump or other strategies, firmware-assisted | |
29 | dump offers several strong, practical advantages: | |
30 | ||
4d2e26a3 | 31 | - Unlike kdump, the system has been reset, and loaded |
8e0aa6d4 MS |
32 | with a fresh copy of the kernel. In particular, |
33 | PCI and I/O devices have been reinitialized and are | |
34 | in a clean, consistent state. | |
4d2e26a3 | 35 | - Once the dump is copied out, the memory that held the dump |
8e0aa6d4 | 36 | is immediately available to the running kernel. And therefore, |
1679b96e | 37 | unlike kdump, FADump doesn't need a 2nd reboot to get back |
8e0aa6d4 MS |
38 | the system to the production configuration. |
39 | ||
40 | The above can only be accomplished by coordination with, | |
41 | and assistance from the Power firmware. The procedure is | |
42 | as follows: | |
43 | ||
4d2e26a3 | 44 | - The first kernel registers the sections of memory with the |
8e0aa6d4 MS |
45 | Power firmware for dump preservation during OS initialization. |
46 | These registered sections of memory are reserved by the first | |
47 | kernel during early boot. | |
48 | ||
fbcafdae HB |
49 | - When system crashes, the Power firmware will copy the registered |
50 | low memory regions (boot memory) from source to destination area. | |
51 | It will also save hardware PTE's. | |
8e0aa6d4 | 52 | |
4d2e26a3 MCC |
53 | NOTE: |
54 | The term 'boot memory' means size of the low memory chunk | |
8e0aa6d4 MS |
55 | that is required for a kernel to boot successfully when |
56 | booted with restricted memory. By default, the boot memory | |
57 | size will be the larger of 5% of system RAM or 256MB. | |
58 | Alternatively, user can also specify boot memory size | |
92019efc HB |
59 | through boot parameter 'crashkernel=' which will override |
60 | the default calculated size. Use this option if default | |
61 | boot memory size is not sufficient for second kernel to | |
62 | boot successfully. For syntax of crashkernel= parameter, | |
fbcafdae HB |
63 | refer to Documentation/admin-guide/kdump/kdump.rst. If any |
64 | offset is provided in crashkernel= parameter, it will be | |
65 | ignored as FADump uses a predefined offset to reserve memory | |
e7467dc6 | 66 | for boot memory dump preservation in case of a crash. |
8e0aa6d4 | 67 | |
4d2e26a3 | 68 | - After the low memory (boot memory) area has been saved, the |
8e0aa6d4 MS |
69 | firmware will reset PCI and other hardware state. It will |
70 | *not* clear the RAM. It will then launch the bootloader, as | |
71 | normal. | |
72 | ||
fbcafdae HB |
73 | - The freshly booted kernel will notice that there is a new node |
74 | (rtas/ibm,kernel-dump on pSeries or ibm,opal/dump/mpipl-boot | |
75 | on OPAL platform) in the device tree, indicating that | |
8e0aa6d4 MS |
76 | there is crash data available from a previous boot. During |
77 | the early boot OS will reserve rest of the memory above | |
78 | boot memory size effectively booting with restricted memory | |
8468d155 HB |
79 | size. This will make sure that this kernel (also, referred |
80 | to as second kernel or capture kernel) will not touch any | |
81 | of the dump memory area. | |
8e0aa6d4 | 82 | |
4d2e26a3 | 83 | - User-space tools will read /proc/vmcore to obtain the contents |
8e0aa6d4 MS |
84 | of memory, which holds the previous crashed kernel dump in ELF |
85 | format. The userspace tools may copy this info to disk, or | |
86 | network, nas, san, iscsi, etc. as desired. | |
87 | ||
4d2e26a3 | 88 | - Once the userspace tool is done saving dump, it will echo |
8e0aa6d4 MS |
89 | '1' to /sys/kernel/fadump_release_mem to release the reserved |
90 | memory back to general use, except the memory required for | |
91 | next firmware-assisted dump registration. | |
92 | ||
4d2e26a3 MCC |
93 | e.g.:: |
94 | ||
8e0aa6d4 MS |
95 | # echo 1 > /sys/kernel/fadump_release_mem |
96 | ||
97 | Please note that the firmware-assisted dump feature | |
fbcafdae HB |
98 | is only available on POWER6 and above systems on pSeries |
99 | (PowerVM) platform and POWER9 and above systems with OP940 | |
100 | or later firmware versions on PowerNV (OPAL) platform. | |
101 | Note that, OPAL firmware exports ibm,opal/dump node when | |
102 | FADump is supported on PowerNV platform. | |
8e0aa6d4 | 103 | |
58cf055d HB |
104 | On OPAL based machines, system first boots into an intermittent |
105 | kernel (referred to as petitboot kernel) before booting into the | |
106 | capture kernel. This kernel would have minimal kernel and/or | |
107 | userspace support to process crash data. Such kernel needs to | |
108 | preserve previously crash'ed kernel's memory for the subsequent | |
109 | capture kernel boot to process this crash data. Kernel config | |
110 | option CONFIG_PRESERVE_FA_DUMP has to be enabled on such kernel | |
111 | to ensure that crash data is preserved to process later. | |
112 | ||
b3bba79d HB |
113 | -- On OPAL based machines (PowerNV), if the kernel is build with |
114 | CONFIG_OPAL_CORE=y, OPAL memory at the time of crash is also | |
8852c07a | 115 | exported as /sys/firmware/opal/mpipl/core file. This procfs file is |
b3bba79d HB |
116 | helpful in debugging OPAL crashes with GDB. The kernel memory |
117 | used for exporting this procfs file can be released by echo'ing | |
8852c07a | 118 | '1' to /sys/firmware/opal/mpipl/release_core node. |
b3bba79d HB |
119 | |
120 | e.g. | |
8852c07a | 121 | # echo 1 > /sys/firmware/opal/mpipl/release_core |
b3bba79d | 122 | |
8e0aa6d4 | 123 | Implementation details: |
4d2e26a3 | 124 | ----------------------- |
8e0aa6d4 MS |
125 | |
126 | During boot, a check is made to see if firmware supports | |
127 | this feature on that particular machine. If it does, then | |
128 | we check to see if an active dump is waiting for us. If yes | |
129 | then everything but boot memory size of RAM is reserved during | |
130 | early boot (See Fig. 2). This area is released once we finish | |
131 | collecting the dump from user land scripts (e.g. kdump scripts) | |
132 | that are run. If there is dump data, then the | |
133 | /sys/kernel/fadump_release_mem file is created, and the reserved | |
134 | memory is held. | |
135 | ||
fbcafdae HB |
136 | If there is no waiting dump data, then only the memory required to |
137 | hold CPU state, HPTE region, boot memory dump, FADump header and | |
138 | elfcore header, is usually reserved at an offset greater than boot | |
139 | memory size (see Fig. 1). This area is *not* released: this region | |
140 | will be kept permanently reserved, so that it can act as a receptacle | |
141 | for a copy of the boot memory content in addition to CPU state and | |
142 | HPTE region, in the case a crash does occur. | |
143 | ||
144 | Since this reserved memory area is used only after the system crash, | |
145 | there is no point in blocking this significant chunk of memory from | |
146 | production kernel. Hence, the implementation uses the Linux kernel's | |
147 | Contiguous Memory Allocator (CMA) for memory reservation if CMA is | |
148 | configured for kernel. With CMA reservation this memory will be | |
149 | available for applications to use it, while kernel is prevented from | |
150 | using it. With this FADump will still be able to capture all of the | |
151 | kernel memory and most of the user space memory except the user pages | |
152 | that were present in CMA region:: | |
8e0aa6d4 MS |
153 | |
154 | o Memory Reservation during first kernel | |
155 | ||
fbcafdae HB |
156 | Low memory Top of memory |
157 | 0 boot memory size |<--- Reserved dump area --->| | | |
158 | | | | Permanent Reservation | | | |
159 | V V | | V | |
160 | +-----------+-----/ /---+---+----+-------+-----+-----+----+--+ | |
161 | | | |///|////| DUMP | HDR | ELF |////| | | |
162 | +-----------+-----/ /---+---+----+-------+-----+-----+----+--+ | |
163 | | ^ ^ ^ ^ ^ | |
164 | | | | | | | | |
165 | \ CPU HPTE / | | | |
166 | ------------------------------ | | | |
167 | Boot memory content gets transferred | | | |
168 | to reserved area by firmware at the | | | |
169 | time of crash. | | | |
170 | FADump Header | | |
171 | (meta area) | | |
172 | | | |
173 | | | |
174 | Metadata: This area holds a metadata struture whose | |
175 | address is registered with f/w and retrieved in the | |
176 | second kernel after crash, on platforms that support | |
177 | tags (OPAL). Having such structure with info needed | |
178 | to process the crashdump eases dump capture process. | |
8468d155 | 179 | |
8e0aa6d4 MS |
180 | Fig. 1 |
181 | ||
8468d155 | 182 | |
8e0aa6d4 MS |
183 | o Memory Reservation during second kernel after crash |
184 | ||
fbcafdae HB |
185 | Low memory Top of memory |
186 | 0 boot memory size | | |
187 | | |<------------ Crash preserved area ------------>| | |
188 | V V |<--- Reserved dump area --->| | | |
189 | +-----------+-----/ /---+---+----+-------+-----+-----+----+--+ | |
190 | | | |///|////| DUMP | HDR | ELF |////| | | |
191 | +-----------+-----/ /---+---+----+-------+-----+-----+----+--+ | |
192 | | | | |
193 | V V | |
194 | Used by second /proc/vmcore | |
8e0aa6d4 | 195 | kernel to boot |
fbcafdae HB |
196 | |
197 | +---+ | |
198 | |///| -> Regions (CPU, HPTE & Metadata) marked like this in the above | |
199 | +---+ figures are not always present. For example, OPAL platform | |
200 | does not have CPU & HPTE regions while Metadata region is | |
201 | not supported on pSeries currently. | |
202 | ||
8e0aa6d4 MS |
203 | Fig. 2 |
204 | ||
fbcafdae | 205 | |
8468d155 HB |
206 | Currently the dump will be copied from /proc/vmcore to a new file upon |
207 | user intervention. The dump data available through /proc/vmcore will be | |
208 | in ELF format. Hence the existing kdump infrastructure (kdump scripts) | |
209 | to save the dump works fine with minor modifications. KDump scripts on | |
210 | major Distro releases have already been modified to work seemlessly (no | |
211 | user intervention in saving the dump) when FADump is used, instead of | |
212 | KDump, as dump mechanism. | |
8e0aa6d4 MS |
213 | |
214 | The tools to examine the dump will be same as the ones | |
215 | used for kdump. | |
216 | ||
1679b96e | 217 | How to enable firmware-assisted dump (FADump): |
4d2e26a3 | 218 | ---------------------------------------------- |
8e0aa6d4 MS |
219 | |
220 | 1. Set config option CONFIG_FA_DUMP=y and build kernel. | |
221 | 2. Boot into linux kernel with 'fadump=on' kernel cmdline option. | |
1679b96e | 222 | By default, FADump reserved memory will be initialized as CMA area. |
a4e92ce8 | 223 | Alternatively, user can boot linux kernel with 'fadump=nocma' to |
1679b96e | 224 | prevent FADump to use CMA. |
92019efc | 225 | 3. Optionally, user can also set 'crashkernel=' kernel cmdline |
8e0aa6d4 MS |
226 | to specify size of the memory to reserve for boot memory dump |
227 | preservation. | |
228 | ||
4d2e26a3 MCC |
229 | NOTE: |
230 | 1. 'fadump_reserve_mem=' parameter has been deprecated. Instead | |
231 | use 'crashkernel=' to specify size of the memory to reserve | |
232 | for boot memory dump preservation. | |
233 | 2. If firmware-assisted dump fails to reserve memory then it | |
234 | will fallback to existing kdump mechanism if 'crashkernel=' | |
235 | option is set at kernel cmdline. | |
236 | 3. if user wants to capture all of user space memory and ok with | |
237 | reserved memory not available to production system, then | |
238 | 'fadump=nocma' kernel parameter can be used to fallback to | |
239 | old behaviour. | |
8e0aa6d4 MS |
240 | |
241 | Sysfs/debugfs files: | |
4d2e26a3 | 242 | -------------------- |
8e0aa6d4 MS |
243 | |
244 | Firmware-assisted dump feature uses sysfs file system to hold | |
245 | the control files and debugfs file to display memory reserved region. | |
246 | ||
247 | Here is the list of files under kernel sysfs: | |
248 | ||
249 | /sys/kernel/fadump_enabled | |
1679b96e | 250 | This is used to display the FADump status. |
4d2e26a3 | 251 | |
1679b96e HB |
252 | - 0 = FADump is disabled |
253 | - 1 = FADump is enabled | |
8e0aa6d4 MS |
254 | |
255 | This interface can be used by kdump init scripts to identify if | |
1679b96e | 256 | FADump is enabled in the kernel and act accordingly. |
8e0aa6d4 MS |
257 | |
258 | /sys/kernel/fadump_registered | |
1679b96e HB |
259 | This is used to display the FADump registration status as well |
260 | as to control (start/stop) the FADump registration. | |
4d2e26a3 | 261 | |
1679b96e HB |
262 | - 0 = FADump is not registered. |
263 | - 1 = FADump is registered and ready to handle system crash. | |
8e0aa6d4 | 264 | |
1679b96e | 265 | To register FADump echo 1 > /sys/kernel/fadump_registered and |
8e0aa6d4 | 266 | echo 0 > /sys/kernel/fadump_registered for un-register and stop the |
1679b96e | 267 | FADump. Once the FADump is un-registered, the system crash will not |
8e0aa6d4 MS |
268 | be handled and vmcore will not be captured. This interface can be |
269 | easily integrated with kdump service start/stop. | |
270 | ||
d8e73458 SJ |
271 | /sys/kernel/fadump/mem_reserved |
272 | ||
273 | This is used to display the memory reserved by FADump for saving the | |
274 | crash dump. | |
275 | ||
8e0aa6d4 | 276 | /sys/kernel/fadump_release_mem |
1679b96e | 277 | This file is available only when FADump is active during |
8e0aa6d4 MS |
278 | second kernel. This is used to release the reserved memory |
279 | region that are held for saving crash dump. To release the | |
4d2e26a3 | 280 | reserved memory echo 1 to it:: |
8e0aa6d4 | 281 | |
4d2e26a3 | 282 | echo 1 > /sys/kernel/fadump_release_mem |
8e0aa6d4 MS |
283 | |
284 | After echo 1, the content of the /sys/kernel/debug/powerpc/fadump_region | |
285 | file will change to reflect the new memory reservations. | |
286 | ||
287 | The existing userspace tools (kdump infrastructure) can be easily | |
288 | enhanced to use this interface to release the memory reserved for | |
289 | dump and continue without 2nd reboot. | |
290 | ||
8852c07a SJ |
291 | Note: /sys/kernel/fadump_release_opalcore sysfs has moved to |
292 | /sys/firmware/opal/mpipl/release_core | |
293 | ||
294 | /sys/firmware/opal/mpipl/release_core | |
b3bba79d HB |
295 | |
296 | This file is available only on OPAL based machines when FADump is | |
297 | active during capture kernel. This is used to release the memory | |
8852c07a | 298 | used by the kernel to export /sys/firmware/opal/mpipl/core file. To |
b3bba79d HB |
299 | release this memory, echo '1' to it: |
300 | ||
8852c07a | 301 | echo 1 > /sys/firmware/opal/mpipl/release_core |
b3bba79d | 302 | |
3f5f1f22 SJ |
303 | Note: The following FADump sysfs files are deprecated. |
304 | ||
305 | +----------------------------------+--------------------------------+ | |
306 | | Deprecated | Alternative | | |
307 | +----------------------------------+--------------------------------+ | |
308 | | /sys/kernel/fadump_enabled | /sys/kernel/fadump/enabled | | |
309 | +----------------------------------+--------------------------------+ | |
310 | | /sys/kernel/fadump_registered | /sys/kernel/fadump/registered | | |
311 | +----------------------------------+--------------------------------+ | |
312 | | /sys/kernel/fadump_release_mem | /sys/kernel/fadump/release_mem | | |
313 | +----------------------------------+--------------------------------+ | |
314 | ||
8e0aa6d4 MS |
315 | Here is the list of files under powerpc debugfs: |
316 | (Assuming debugfs is mounted on /sys/kernel/debug directory.) | |
317 | ||
318 | /sys/kernel/debug/powerpc/fadump_region | |
1679b96e | 319 | This file shows the reserved memory regions if FADump is |
8e0aa6d4 | 320 | enabled otherwise this file is empty. The output format |
4d2e26a3 MCC |
321 | is:: |
322 | ||
323 | <region>: [<start>-<end>] <reserved-size> bytes, Dumped: <dump-size> | |
8e0aa6d4 | 324 | |
1679b96e HB |
325 | and for kernel DUMP region is: |
326 | ||
327 | DUMP: Src: <src-addr>, Dest: <dest-addr>, Size: <size>, Dumped: # bytes | |
328 | ||
8e0aa6d4 | 329 | e.g. |
1679b96e | 330 | Contents when FADump is registered during first kernel:: |
8e0aa6d4 | 331 | |
4d2e26a3 MCC |
332 | # cat /sys/kernel/debug/powerpc/fadump_region |
333 | CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x0 | |
334 | HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x0 | |
335 | DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x0 | |
8e0aa6d4 | 336 | |
1679b96e | 337 | Contents when FADump is active during second kernel:: |
8e0aa6d4 | 338 | |
4d2e26a3 MCC |
339 | # cat /sys/kernel/debug/powerpc/fadump_region |
340 | CPU : [0x0000006ffb0000-0x0000006fff001f] 0x40020 bytes, Dumped: 0x40020 | |
341 | HPTE: [0x0000006fff0020-0x0000006fff101f] 0x1000 bytes, Dumped: 0x1000 | |
342 | DUMP: [0x0000006fff1020-0x0000007fff101f] 0x10000000 bytes, Dumped: 0x10000000 | |
343 | : [0x00000010000000-0x0000006ffaffff] 0x5ffb0000 bytes, Dumped: 0x5ffb0000 | |
8e0aa6d4 | 344 | |
1679b96e | 345 | |
4d2e26a3 | 346 | NOTE: |
0c1bc6b8 | 347 | Please refer to Documentation/filesystems/debugfs.rst on |
8e0aa6d4 MS |
348 | how to mount the debugfs filesystem. |
349 | ||
350 | ||
351 | TODO: | |
352 | ----- | |
4d2e26a3 | 353 | - Need to come up with the better approach to find out more |
8e0aa6d4 MS |
354 | accurate boot memory size that is required for a kernel to |
355 | boot successfully when booted with restricted memory. | |
1679b96e | 356 | - The FADump implementation introduces a FADump crash info structure |
8e0aa6d4 MS |
357 | in the scratch area before the ELF core header. The idea of introducing |
358 | this structure is to pass some important crash info data to the second | |
359 | kernel which will help second kernel to populate ELF core header with | |
360 | correct data before it gets exported through /proc/vmcore. The current | |
361 | design implementation does not address a possibility of introducing | |
362 | additional fields (in future) to this structure without affecting | |
363 | compatibility. Need to come up with the better approach to address this. | |
4d2e26a3 | 364 | |
8e0aa6d4 | 365 | The possible approaches are: |
4d2e26a3 | 366 | |
8e0aa6d4 MS |
367 | 1. Introduce version field for version tracking, bump up the version |
368 | whenever a new field is added to the structure in future. The version | |
369 | field can be used to find out what fields are valid for the current | |
370 | version of the structure. | |
371 | 2. Reserve the area of predefined size (say PAGE_SIZE) for this | |
372 | structure and have unused area as reserved (initialized to zero) | |
373 | for future field additions. | |
4d2e26a3 | 374 | |
8e0aa6d4 | 375 | The advantage of approach 1 over 2 is we don't need to reserve extra space. |
4d2e26a3 | 376 | |
8e0aa6d4 | 377 | Author: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> |
4d2e26a3 | 378 | |
8e0aa6d4 | 379 | This document is based on the original documentation written for phyp |
4d2e26a3 | 380 | |
8e0aa6d4 | 381 | assisted dump by Linas Vepstas and Manish Ahuja. |