Commit | Line | Data |
---|---|---|
151f4e2b | 1 | ==================================================================== |
7fef9fc8 | 2 | Interaction of Suspend code (S3) with the CPU hotplug infrastructure |
151f4e2b | 3 | ==================================================================== |
7fef9fc8 | 4 | |
151f4e2b | 5 | (C) 2011 - 2014 Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com> |
7fef9fc8 SB |
6 | |
7 | ||
151f4e2b MCC |
8 | I. Differences between CPU hotplug and Suspend-to-RAM |
9 | ====================================================== | |
10 | ||
11 | How does the regular CPU hotplug code differ from how the Suspend-to-RAM | |
12 | infrastructure uses it internally? And where do they share common code? | |
7fef9fc8 SB |
13 | |
14 | Well, a picture is worth a thousand words... So ASCII art follows :-) | |
15 | ||
16 | [This depicts the current design in the kernel, and focusses only on the | |
17 | interactions involving the freezer and CPU hotplug and also tries to explain | |
18 | the locking involved. It outlines the notifications involved as well. | |
19 | But please note that here, only the call paths are illustrated, with the aim | |
20 | of describing where they take different paths and where they share code. | |
21 | What happens when regular CPU hotplug and Suspend-to-RAM race with each other | |
22 | is not depicted here.] | |
23 | ||
151f4e2b | 24 | On a high level, the suspend-resume cycle goes like this:: |
7fef9fc8 | 25 | |
151f4e2b MCC |
26 | |Freeze| -> |Disable nonboot| -> |Do suspend| -> |Enable nonboot| -> |Thaw | |
27 | |tasks | | cpus | | | | cpus | |tasks| | |
7fef9fc8 SB |
28 | |
29 | ||
151f4e2b | 30 | More details follow:: |
7fef9fc8 SB |
31 | |
32 | Suspend call path | |
33 | ----------------- | |
34 | ||
35 | Write 'mem' to | |
36 | /sys/power/state | |
6237dd13 | 37 | sysfs file |
7fef9fc8 SB |
38 | | |
39 | v | |
55f2503c | 40 | Acquire system_transition_mutex lock |
7fef9fc8 SB |
41 | | |
42 | v | |
43 | Send PM_SUSPEND_PREPARE | |
44 | notifications | |
45 | | | |
46 | v | |
47 | Freeze tasks | |
48 | | | |
49 | | | |
50 | v | |
56555855 | 51 | freeze_secondary_cpus() |
7fef9fc8 SB |
52 | /* start */ |
53 | | | |
54 | v | |
55 | Acquire cpu_add_remove_lock | |
56 | | | |
57 | v | |
58 | Iterate over CURRENTLY | |
59 | online CPUs | |
60 | | | |
61 | | | |
62 | | ---------- | |
63 | v | L | |
64 | ======> _cpu_down() | | |
65 | | [This takes cpuhotplug.lock | | |
66 | Common | before taking down the CPU | | |
67 | code | and releases it when done] | O | |
68 | | While it is at it, notifications | | |
69 | | are sent when notable events occur, | | |
70 | ======> by running all registered callbacks. | | |
71 | | | O | |
72 | | | | |
73 | | | | |
74 | v | | |
75 | Note down these cpus in | P | |
76 | frozen_cpus mask ---------- | |
77 | | | |
78 | v | |
79 | Disable regular cpu hotplug | |
89af7ba5 | 80 | by increasing cpu_hotplug_disabled |
7fef9fc8 SB |
81 | | |
82 | v | |
83 | Release cpu_add_remove_lock | |
84 | | | |
85 | v | |
56555855 | 86 | /* freeze_secondary_cpus() complete */ |
7fef9fc8 SB |
87 | | |
88 | v | |
89 | Do suspend | |
90 | ||
91 | ||
92 | ||
93 | Resuming back is likewise, with the counterparts being (in the order of | |
94 | execution during resume): | |
151f4e2b | 95 | |
56555855 | 96 | * thaw_secondary_cpus() which involves:: |
151f4e2b | 97 | |
7fef9fc8 | 98 | | Acquire cpu_add_remove_lock |
89af7ba5 | 99 | | Decrease cpu_hotplug_disabled, thereby enabling regular cpu hotplug |
7fef9fc8 SB |
100 | | Call _cpu_up() [for all those cpus in the frozen_cpus mask, in a loop] |
101 | | Release cpu_add_remove_lock | |
102 | v | |
103 | ||
104 | * thaw tasks | |
105 | * send PM_POST_SUSPEND notifications | |
55f2503c | 106 | * Release system_transition_mutex lock. |
7fef9fc8 SB |
107 | |
108 | ||
1992b66d BH |
109 | It is to be noted here that the system_transition_mutex lock is acquired at the |
110 | very beginning, when we are just starting out to suspend, and then released only | |
7fef9fc8 SB |
111 | after the entire cycle is complete (i.e., suspend + resume). |
112 | ||
151f4e2b MCC |
113 | :: |
114 | ||
7fef9fc8 SB |
115 | |
116 | ||
117 | Regular CPU hotplug call path | |
118 | ----------------------------- | |
119 | ||
120 | Write 0 (or 1) to | |
121 | /sys/devices/system/cpu/cpu*/online | |
122 | sysfs file | |
123 | | | |
124 | | | |
125 | v | |
126 | cpu_down() | |
127 | | | |
128 | v | |
129 | Acquire cpu_add_remove_lock | |
130 | | | |
131 | v | |
89af7ba5 | 132 | If cpu_hotplug_disabled > 0 |
7fef9fc8 SB |
133 | return gracefully |
134 | | | |
135 | | | |
136 | v | |
137 | ======> _cpu_down() | |
138 | | [This takes cpuhotplug.lock | |
139 | Common | before taking down the CPU | |
140 | code | and releases it when done] | |
141 | | While it is at it, notifications | |
142 | | are sent when notable events occur, | |
143 | ======> by running all registered callbacks. | |
144 | | | |
145 | | | |
146 | v | |
147 | Release cpu_add_remove_lock | |
148 | [That's it!, for | |
149 | regular CPU hotplug] | |
150 | ||
151 | ||
152 | ||
153 | So, as can be seen from the two diagrams (the parts marked as "Common code"), | |
154 | regular CPU hotplug and the suspend code path converge at the _cpu_down() and | |
155 | _cpu_up() functions. They differ in the arguments passed to these functions, | |
156 | in that during regular CPU hotplug, 0 is passed for the 'tasks_frozen' | |
157 | argument. But during suspend, since the tasks are already frozen by the time | |
158 | the non-boot CPUs are offlined or onlined, the _cpu_*() functions are called | |
159 | with the 'tasks_frozen' argument set to 1. | |
160 | [See below for some known issues regarding this.] | |
161 | ||
162 | ||
163 | Important files and functions/entry points: | |
151f4e2b | 164 | ------------------------------------------- |
7fef9fc8 | 165 | |
151f4e2b MCC |
166 | - kernel/power/process.c : freeze_processes(), thaw_processes() |
167 | - kernel/power/suspend.c : suspend_prepare(), suspend_enter(), suspend_finish() | |
1992b66d BH |
168 | - kernel/cpu.c: cpu_[up|down](), _cpu_[up|down](), |
169 | [disable|enable]_nonboot_cpus() | |
7fef9fc8 SB |
170 | |
171 | ||
172 | ||
173 | II. What are the issues involved in CPU hotplug? | |
151f4e2b | 174 | ------------------------------------------------ |
7fef9fc8 SB |
175 | |
176 | There are some interesting situations involving CPU hotplug and microcode | |
177 | update on the CPUs, as discussed below: | |
178 | ||
179 | [Please bear in mind that the kernel requests the microcode images from | |
180 | userspace, using the request_firmware() function defined in | |
df9267f1 | 181 | drivers/base/firmware_loader/main.c] |
7fef9fc8 SB |
182 | |
183 | ||
184 | a. When all the CPUs are identical: | |
185 | ||
186 | This is the most common situation and it is quite straightforward: we want | |
187 | to apply the same microcode revision to each of the CPUs. | |
188 | To give an example of x86, the collect_cpu_info() function defined in | |
189 | arch/x86/kernel/microcode_core.c helps in discovering the type of the CPU | |
190 | and thereby in applying the correct microcode revision to it. | |
191 | But note that the kernel does not maintain a common microcode image for the | |
192 | all CPUs, in order to handle case 'b' described below. | |
193 | ||
194 | ||
195 | b. When some of the CPUs are different than the rest: | |
196 | ||
197 | In this case since we probably need to apply different microcode revisions | |
198 | to different CPUs, the kernel maintains a copy of the correct microcode | |
199 | image for each CPU (after appropriate CPU type/model discovery using | |
200 | functions such as collect_cpu_info()). | |
201 | ||
202 | ||
203 | c. When a CPU is physically hot-unplugged and a new (and possibly different | |
204 | type of) CPU is hot-plugged into the system: | |
205 | ||
206 | In the current design of the kernel, whenever a CPU is taken offline during | |
207 | a regular CPU hotplug operation, upon receiving the CPU_DEAD notification | |
208 | (which is sent by the CPU hotplug code), the microcode update driver's | |
209 | callback for that event reacts by freeing the kernel's copy of the | |
210 | microcode image for that CPU. | |
211 | ||
212 | Hence, when a new CPU is brought online, since the kernel finds that it | |
213 | doesn't have the microcode image, it does the CPU type/model discovery | |
214 | afresh and then requests the userspace for the appropriate microcode image | |
215 | for that CPU, which is subsequently applied. | |
216 | ||
217 | For example, in x86, the mc_cpu_callback() function (which is the microcode | |
218 | update driver's callback registered for CPU hotplug events) calls | |
219 | microcode_update_cpu() which would call microcode_init_cpu() in this case, | |
220 | instead of microcode_resume_cpu() when it finds that the kernel doesn't | |
221 | have a valid microcode image. This ensures that the CPU type/model | |
222 | discovery is performed and the right microcode is applied to the CPU after | |
223 | getting it from userspace. | |
224 | ||
225 | ||
226 | d. Handling microcode update during suspend/hibernate: | |
227 | ||
228 | Strictly speaking, during a CPU hotplug operation which does not involve | |
229 | physically removing or inserting CPUs, the CPUs are not actually powered | |
230 | off during a CPU offline. They are just put to the lowest C-states possible. | |
231 | Hence, in such a case, it is not really necessary to re-apply microcode | |
232 | when the CPUs are brought back online, since they wouldn't have lost the | |
233 | image during the CPU offline operation. | |
234 | ||
235 | This is the usual scenario encountered during a resume after a suspend. | |
236 | However, in the case of hibernation, since all the CPUs are completely | |
237 | powered off, during restore it becomes necessary to apply the microcode | |
238 | images to all the CPUs. | |
239 | ||
240 | [Note that we don't expect someone to physically pull out nodes and insert | |
241 | nodes with a different type of CPUs in-between a suspend-resume or a | |
242 | hibernate/restore cycle.] | |
243 | ||
244 | In the current design of the kernel however, during a CPU offline operation | |
f4c09f87 | 245 | as part of the suspend/hibernate cycle (cpuhp_tasks_frozen is set), |
7fef9fc8 SB |
246 | the existing copy of microcode image in the kernel is not freed up. |
247 | And during the CPU online operations (during resume/restore), since the | |
248 | kernel finds that it already has copies of the microcode images for all the | |
249 | CPUs, it just applies them to the CPUs, avoiding any re-discovery of CPU | |
250 | type/model and the need for validating whether the microcode revisions are | |
251 | right for the CPUs or not (due to the above assumption that physical CPU | |
252 | hotplug will not be done in-between suspend/resume or hibernate/restore | |
253 | cycles). | |
254 | ||
255 | ||
151f4e2b MCC |
256 | III. Known problems |
257 | =================== | |
258 | ||
259 | Are there any known problems when regular CPU hotplug and suspend race | |
260 | with each other? | |
7fef9fc8 SB |
261 | |
262 | Yes, they are listed below: | |
263 | ||
264 | 1. When invoking regular CPU hotplug, the 'tasks_frozen' argument passed to | |
265 | the _cpu_down() and _cpu_up() functions is *always* 0. | |
266 | This might not reflect the true current state of the system, since the | |
267 | tasks could have been frozen by an out-of-band event such as a suspend | |
f4c09f87 TG |
268 | operation in progress. Hence, the cpuhp_tasks_frozen variable will not |
269 | reflect the frozen state and the CPU hotplug callbacks which evaluate | |
270 | that variable might execute the wrong code path. | |
7fef9fc8 SB |
271 | |
272 | 2. If a regular CPU hotplug stress test happens to race with the freezer due | |
273 | to a suspend operation in progress at the same time, then we could hit the | |
274 | situation described below: | |
275 | ||
276 | * A regular cpu online operation continues its journey from userspace | |
277 | into the kernel, since the freezing has not yet begun. | |
278 | * Then freezer gets to work and freezes userspace. | |
279 | * If cpu online has not yet completed the microcode update stuff by now, | |
280 | it will now start waiting on the frozen userspace in the | |
281 | TASK_UNINTERRUPTIBLE state, in order to get the microcode image. | |
282 | * Now the freezer continues and tries to freeze the remaining tasks. But | |
283 | due to this wait mentioned above, the freezer won't be able to freeze | |
284 | the cpu online hotplug task and hence freezing of tasks fails. | |
285 | ||
286 | As a result of this task freezing failure, the suspend operation gets | |
287 | aborted. |