Commit | Line | Data |
---|---|---|
151f4e2b | 1 | ================================= |
ce2b7147 | 2 | Debugging hibernation and suspend |
151f4e2b MCC |
3 | ================================= |
4 | ||
5b795202 RW |
5 | (C) 2007 Rafael J. Wysocki <rjw@sisk.pl>, GPL |
6 | ||
ce2b7147 | 7 | 1. Testing hibernation (aka suspend to disk or STD) |
151f4e2b | 8 | =================================================== |
5b795202 | 9 | |
151f4e2b | 10 | To check if hibernation works, you can try to hibernate in the "reboot" mode:: |
5b795202 | 11 | |
151f4e2b MCC |
12 | # echo reboot > /sys/power/disk |
13 | # echo disk > /sys/power/state | |
5b795202 | 14 | |
ce2b7147 RW |
15 | and the system should create a hibernation image, reboot, resume and get back to |
16 | the command prompt where you have started the transition. If that happens, | |
17 | hibernation is most likely to work correctly. Still, you need to repeat the | |
18 | test at least a couple of times in a row for confidence. [This is necessary, | |
19 | because some problems only show up on a second attempt at suspending and | |
20 | resuming the system.] Moreover, hibernating in the "reboot" and "shutdown" | |
21 | modes causes the PM core to skip some platform-related callbacks which on ACPI | |
151f4e2b MCC |
22 | systems might be necessary to make hibernation work. Thus, if your machine |
23 | fails to hibernate or resume in the "reboot" mode, you should try the | |
24 | "platform" mode:: | |
5b795202 | 25 | |
151f4e2b MCC |
26 | # echo platform > /sys/power/disk |
27 | # echo disk > /sys/power/state | |
5b795202 | 28 | |
ce2b7147 RW |
29 | which is the default and recommended mode of hibernation. |
30 | ||
31 | Unfortunately, the "platform" mode of hibernation does not work on some systems | |
32 | with broken BIOSes. In such cases the "shutdown" mode of hibernation might | |
151f4e2b | 33 | work:: |
5b795202 | 34 | |
151f4e2b MCC |
35 | # echo shutdown > /sys/power/disk |
36 | # echo disk > /sys/power/state | |
5b795202 | 37 | |
ce2b7147 RW |
38 | (it is similar to the "reboot" mode, but it requires you to press the power |
39 | button to make the system resume). | |
40 | ||
41 | If neither "platform" nor "shutdown" hibernation mode works, you will need to | |
42 | identify what goes wrong. | |
43 | ||
44 | a) Test modes of hibernation | |
151f4e2b | 45 | ---------------------------- |
ce2b7147 RW |
46 | |
47 | To find out why hibernation fails on your system, you can use a special testing | |
48 | facility available if the kernel is compiled with CONFIG_PM_DEBUG set. Then, | |
49 | there is the file /sys/power/pm_test that can be used to make the hibernation | |
50 | core run in a test mode. There are 5 test modes available: | |
51 | ||
52 | freezer | |
151f4e2b | 53 | - test the freezing of processes |
ce2b7147 RW |
54 | |
55 | devices | |
151f4e2b | 56 | - test the freezing of processes and suspending of devices |
5b795202 | 57 | |
ce2b7147 | 58 | platform |
151f4e2b MCC |
59 | - test the freezing of processes, suspending of devices and platform |
60 | global control methods [1]_ | |
5b795202 | 61 | |
ce2b7147 | 62 | processors |
151f4e2b MCC |
63 | - test the freezing of processes, suspending of devices, platform |
64 | global control methods [1]_ and the disabling of nonboot CPUs | |
5b795202 | 65 | |
ce2b7147 | 66 | core |
151f4e2b MCC |
67 | - test the freezing of processes, suspending of devices, platform global |
68 | control methods\ [1]_, the disabling of nonboot CPUs and suspending | |
69 | of platform/system devices | |
70 | ||
71 | .. [1] | |
ce2b7147 | 72 | |
151f4e2b | 73 | the platform global control methods are only available on ACPI systems |
ce2b7147 RW |
74 | and are only tested if the hibernation mode is set to "platform" |
75 | ||
76 | To use one of them it is necessary to write the corresponding string to | |
77 | /sys/power/pm_test (eg. "devices" to test the freezing of processes and | |
78 | suspending devices) and issue the standard hibernation commands. For example, | |
79 | to use the "devices" test mode along with the "platform" mode of hibernation, | |
151f4e2b | 80 | you should do the following:: |
ce2b7147 | 81 | |
151f4e2b MCC |
82 | # echo devices > /sys/power/pm_test |
83 | # echo platform > /sys/power/disk | |
84 | # echo disk > /sys/power/state | |
5b795202 | 85 | |
1d4a9c17 BN |
86 | Then, the kernel will try to freeze processes, suspend devices, wait a few |
87 | seconds (5 by default, but configurable by the suspend.pm_test_delay module | |
88 | parameter), resume devices and thaw processes. If "platform" is written to | |
ce2b7147 RW |
89 | /sys/power/pm_test , then after suspending devices the kernel will additionally |
90 | invoke the global control methods (eg. ACPI global control methods) used to | |
1d4a9c17 BN |
91 | prepare the platform firmware for hibernation. Next, it will wait a |
92 | configurable number of seconds and invoke the platform (eg. ACPI) global | |
93 | methods used to cancel hibernation etc. | |
ce2b7147 RW |
94 | |
95 | Writing "none" to /sys/power/pm_test causes the kernel to switch to the normal | |
96 | hibernation/suspend operations. Also, when open for reading, /sys/power/pm_test | |
97 | contains a space-separated list of all available tests (including "none" that | |
98 | represents the normal functionality) in which the current test level is | |
99 | indicated by square brackets. | |
100 | ||
101 | Generally, as you can see, each test level is more "invasive" than the previous | |
102 | one and the "core" level tests the hardware and drivers as deeply as possible | |
103 | without creating a hibernation image. Obviously, if the "devices" test fails, | |
104 | the "platform" test will fail as well and so on. Thus, as a rule of thumb, you | |
105 | should try the test modes starting from "freezer", through "devices", "platform" | |
106 | and "processors" up to "core" (repeat the test on each level a couple of times | |
107 | to make sure that any random factors are avoided). | |
108 | ||
109 | If the "freezer" test fails, there is a task that cannot be frozen (in that case | |
110 | it usually is possible to identify the offending task by analysing the output of | |
111 | dmesg obtained after the failing test). Failure at this level usually means | |
112 | that there is a problem with the tasks freezer subsystem that should be | |
113 | reported. | |
114 | ||
115 | If the "devices" test fails, most likely there is a driver that cannot suspend | |
116 | or resume its device (in the latter case the system may hang or become unstable | |
117 | after the test, so please take that into consideration). To find this driver, | |
118 | you can carry out a binary search according to the rules: | |
151f4e2b | 119 | |
5b795202 | 120 | - if the test fails, unload a half of the drivers currently loaded and repeat |
151f4e2b MCC |
121 | (that would probably involve rebooting the system, so always note what drivers |
122 | have been loaded before the test), | |
5b795202 | 123 | - if the test succeeds, load a half of the drivers you have unloaded most |
151f4e2b | 124 | recently and repeat. |
5b795202 RW |
125 | |
126 | Once you have found the failing driver (there can be more than just one of | |
ce2b7147 RW |
127 | them), you have to unload it every time before hibernation. In that case please |
128 | make sure to report the problem with the driver. | |
129 | ||
130 | It is also possible that the "devices" test will still fail after you have | |
131 | unloaded all modules. In that case, you may want to look in your kernel | |
132 | configuration for the drivers that can be compiled as modules (and test again | |
133 | with these drivers compiled as modules). You may also try to use some special | |
134 | kernel command line options such as "noapic", "noacpi" or even "acpi=off". | |
135 | ||
136 | If the "platform" test fails, there is a problem with the handling of the | |
137 | platform (eg. ACPI) firmware on your system. In that case the "platform" mode | |
138 | of hibernation is not likely to work. You can try the "shutdown" mode, but that | |
139 | is rather a poor man's workaround. | |
140 | ||
141 | If the "processors" test fails, the disabling/enabling of nonboot CPUs does not | |
142 | work (of course, this only may be an issue on SMP systems) and the problem | |
143 | should be reported. In that case you can also try to switch the nonboot CPUs | |
144 | off and on using the /sys/devices/system/cpu/cpu*/online sysfs attributes and | |
145 | see if that works. | |
146 | ||
147 | If the "core" test fails, which means that suspending of the system/platform | |
148 | devices has failed (these devices are suspended on one CPU with interrupts off), | |
149 | the problem is most probably hardware-related and serious, so it should be | |
150 | reported. | |
151 | ||
152 | A failure of any of the "platform", "processors" or "core" tests may cause your | |
153 | system to hang or become unstable, so please beware. Such a failure usually | |
154 | indicates a serious problem that very well may be related to the hardware, but | |
155 | please report it anyway. | |
5b795202 RW |
156 | |
157 | b) Testing minimal configuration | |
151f4e2b | 158 | -------------------------------- |
5b795202 | 159 | |
ce2b7147 RW |
160 | If all of the hibernation test modes work, you can boot the system with the |
161 | "init=/bin/bash" command line parameter and attempt to hibernate in the | |
162 | "reboot", "shutdown" and "platform" modes. If that does not work, there | |
163 | probably is a problem with a driver statically compiled into the kernel and you | |
164 | can try to compile more drivers as modules, so that they can be tested | |
165 | individually. Otherwise, there is a problem with a modular driver and you can | |
166 | find it by loading a half of the modules you normally use and binary searching | |
167 | in accordance with the algorithm: | |
5b795202 RW |
168 | - if there are n modules loaded and the attempt to suspend and resume fails, |
169 | unload n/2 of the modules and try again (that would probably involve rebooting | |
170 | the system), | |
171 | - if there are n modules loaded and the attempt to suspend and resume succeeds, | |
172 | load n/2 modules more and try again. | |
173 | ||
174 | Again, if you find the offending module(s), it(they) must be unloaded every time | |
ce2b7147 | 175 | before hibernation, and please report the problem with it(them). |
5b795202 | 176 | |
947d2c2c | 177 | c) Using the "test_resume" hibernation option |
151f4e2b | 178 | --------------------------------------------- |
947d2c2c RW |
179 | |
180 | /sys/power/disk generally tells the kernel what to do after creating a | |
181 | hibernation image. One of the available options is "test_resume" which | |
182 | causes the just created image to be used for immediate restoration. Namely, | |
151f4e2b | 183 | after doing:: |
947d2c2c | 184 | |
151f4e2b MCC |
185 | # echo test_resume > /sys/power/disk |
186 | # echo disk > /sys/power/state | |
947d2c2c RW |
187 | |
188 | a hibernation image will be created and a resume from it will be triggered | |
189 | immediately without involving the platform firmware in any way. | |
190 | ||
191 | That test can be used to check if failures to resume from hibernation are | |
192 | related to bad interactions with the platform firmware. That is, if the above | |
193 | works every time, but resume from actual hibernation does not work or is | |
194 | unreliable, the platform firmware may be responsible for the failures. | |
195 | ||
196 | On architectures and platforms that support using different kernels to restore | |
197 | hibernation images (that is, the kernel used to read the image from storage and | |
198 | load it into memory is different from the one included in the image) or support | |
199 | kernel address space randomization, it also can be used to check if failures | |
200 | to resume may be related to the differences between the restore and image | |
201 | kernels. | |
202 | ||
203 | d) Advanced debugging | |
151f4e2b | 204 | --------------------- |
5b795202 | 205 | |
ce2b7147 RW |
206 | In case that hibernation does not work on your system even in the minimal |
207 | configuration and compiling more drivers as modules is not practical or some | |
208 | modules cannot be unloaded, you can use one of the more advanced debugging | |
209 | techniques to find the problem. First, if there is a serial port in your box, | |
210 | you can boot the kernel with the 'no_console_suspend' parameter and try to log | |
211 | kernel messages using the serial console. This may provide you with some | |
212 | information about the reasons of the suspend (resume) failure. Alternatively, | |
213 | it may be possible to use a FireWire port for debugging with firescope | |
a9954ce7 | 214 | (http://v3.sk/~lkundrak/firescope/). On x86 it is also possible to |
151f4e2b | 215 | use the PM_TRACE mechanism documented in Documentation/power/s2ram.rst . |
5b795202 RW |
216 | |
217 | 2. Testing suspend to RAM (STR) | |
151f4e2b | 218 | =============================== |
5b795202 RW |
219 | |
220 | To verify that the STR works, it is generally more convenient to use the s2ram | |
221 | tool available from http://suspend.sf.net and documented at | |
54d4f25b | 222 | http://en.opensuse.org/SDB:Suspend_to_RAM (S2RAM_LINK). |
ce2b7147 RW |
223 | |
224 | Namely, after writing "freezer", "devices", "platform", "processors", or "core" | |
225 | into /sys/power/pm_test (available if the kernel is compiled with | |
226 | CONFIG_PM_DEBUG set) the suspend code will work in the test mode corresponding | |
227 | to given string. The STR test modes are defined in the same way as for | |
228 | hibernation, so please refer to Section 1 for more information about them. In | |
229 | particular, the "core" test allows you to test everything except for the actual | |
230 | invocation of the platform firmware in order to put the system into the sleep | |
231 | state. | |
232 | ||
233 | Among other things, the testing with the help of /sys/power/pm_test may allow | |
234 | you to identify drivers that fail to suspend or resume their devices. They | |
235 | should be unloaded every time before an STR transition. | |
236 | ||
54d4f25b JF |
237 | Next, you can follow the instructions at S2RAM_LINK to test the system, but if |
238 | it does not work "out of the box", you may need to boot it with | |
239 | "init=/bin/bash" and test s2ram in the minimal configuration. In that case, | |
240 | you may be able to search for failing drivers by following the procedure | |
ce2b7147 RW |
241 | analogous to the one described in section 1. If you find some failing drivers, |
242 | you will have to unload them every time before an STR transition (ie. before | |
243 | you run s2ram), and please report the problems with them. | |
2a77c46d SL |
244 | |
245 | There is a debugfs entry which shows the suspend to RAM statistics. Here is an | |
151f4e2b MCC |
246 | example of its output:: |
247 | ||
2a77c46d SL |
248 | # mount -t debugfs none /sys/kernel/debug |
249 | # cat /sys/kernel/debug/suspend_stats | |
250 | success: 20 | |
251 | fail: 5 | |
252 | failed_freeze: 0 | |
253 | failed_prepare: 0 | |
254 | failed_suspend: 5 | |
255 | failed_suspend_noirq: 0 | |
256 | failed_resume: 0 | |
257 | failed_resume_noirq: 0 | |
258 | failures: | |
259 | last_failed_dev: alarm | |
260 | adc | |
261 | last_failed_errno: -16 | |
262 | -16 | |
263 | last_failed_step: suspend | |
264 | suspend | |
151f4e2b | 265 | |
2a77c46d SL |
266 | Field success means the success number of suspend to RAM, and field fail means |
267 | the failure number. Others are the failure number of different steps of suspend | |
268 | to RAM. suspend_stats just lists the last 2 failed devices, error number and | |
269 | failed step of suspend. |