Commit | Line | Data |
---|---|---|
dc7a12bd | 1 | ========================================================= |
7fe31d28 DM |
2 | Cluster-wide Power-up/power-down race avoidance algorithm |
3 | ========================================================= | |
4 | ||
5 | This file documents the algorithm which is used to coordinate CPU and | |
6 | cluster setup and teardown operations and to manage hardware coherency | |
7 | controls safely. | |
8 | ||
9 | The section "Rationale" explains what the algorithm is for and why it is | |
10 | needed. "Basic model" explains general concepts using a simplified view | |
11 | of the system. The other sections explain the actual details of the | |
12 | algorithm in use. | |
13 | ||
14 | ||
15 | Rationale | |
16 | --------- | |
17 | ||
18 | In a system containing multiple CPUs, it is desirable to have the | |
19 | ability to turn off individual CPUs when the system is idle, reducing | |
20 | power consumption and thermal dissipation. | |
21 | ||
22 | In a system containing multiple clusters of CPUs, it is also desirable | |
23 | to have the ability to turn off entire clusters. | |
24 | ||
25 | Turning entire clusters off and on is a risky business, because it | |
26 | involves performing potentially destructive operations affecting a group | |
27 | of independently running CPUs, while the OS continues to run. This | |
28 | means that we need some coordination in order to ensure that critical | |
29 | cluster-level operations are only performed when it is truly safe to do | |
30 | so. | |
31 | ||
32 | Simple locking may not be sufficient to solve this problem, because | |
33 | mechanisms like Linux spinlocks may rely on coherency mechanisms which | |
34 | are not immediately enabled when a cluster powers up. Since enabling or | |
35 | disabling those mechanisms may itself be a non-atomic operation (such as | |
36 | writing some hardware registers and invalidating large caches), other | |
37 | methods of coordination are required in order to guarantee safe | |
38 | power-down and power-up at the cluster level. | |
39 | ||
40 | The mechanism presented in this document describes a coherent memory | |
41 | based protocol for performing the needed coordination. It aims to be as | |
42 | lightweight as possible, while providing the required safety properties. | |
43 | ||
44 | ||
45 | Basic model | |
46 | ----------- | |
47 | ||
48 | Each cluster and CPU is assigned a state, as follows: | |
49 | ||
dc7a12bd MCC |
50 | - DOWN |
51 | - COMING_UP | |
52 | - UP | |
53 | - GOING_DOWN | |
54 | ||
55 | :: | |
7fe31d28 DM |
56 | |
57 | +---------> UP ----------+ | |
58 | | v | |
59 | ||
60 | COMING_UP GOING_DOWN | |
61 | ||
62 | ^ | | |
63 | +--------- DOWN <--------+ | |
64 | ||
65 | ||
dc7a12bd MCC |
66 | DOWN: |
67 | The CPU or cluster is not coherent, and is either powered off or | |
7fe31d28 DM |
68 | suspended, or is ready to be powered off or suspended. |
69 | ||
dc7a12bd MCC |
70 | COMING_UP: |
71 | The CPU or cluster has committed to moving to the UP state. | |
7fe31d28 DM |
72 | It may be part way through the process of initialisation and |
73 | enabling coherency. | |
74 | ||
dc7a12bd MCC |
75 | UP: |
76 | The CPU or cluster is active and coherent at the hardware | |
7fe31d28 DM |
77 | level. A CPU in this state is not necessarily being used |
78 | actively by the kernel. | |
79 | ||
dc7a12bd MCC |
80 | GOING_DOWN: |
81 | The CPU or cluster has committed to moving to the DOWN | |
7fe31d28 DM |
82 | state. It may be part way through the process of teardown and |
83 | coherency exit. | |
84 | ||
85 | ||
86 | Each CPU has one of these states assigned to it at any point in time. | |
87 | The CPU states are described in the "CPU state" section, below. | |
88 | ||
89 | Each cluster is also assigned a state, but it is necessary to split the | |
90 | state value into two parts (the "cluster" state and "inbound" state) and | |
91 | to introduce additional states in order to avoid races between different | |
92 | CPUs in the cluster simultaneously modifying the state. The cluster- | |
93 | level states are described in the "Cluster state" section. | |
94 | ||
95 | To help distinguish the CPU states from cluster states in this | |
dc7a12bd MCC |
96 | discussion, the state names are given a `CPU_` prefix for the CPU states, |
97 | and a `CLUSTER_` or `INBOUND_` prefix for the cluster states. | |
7fe31d28 DM |
98 | |
99 | ||
100 | CPU state | |
101 | --------- | |
102 | ||
103 | In this algorithm, each individual core in a multi-core processor is | |
104 | referred to as a "CPU". CPUs are assumed to be single-threaded: | |
105 | therefore, a CPU can only be doing one thing at a single point in time. | |
106 | ||
107 | This means that CPUs fit the basic model closely. | |
108 | ||
109 | The algorithm defines the following states for each CPU in the system: | |
110 | ||
dc7a12bd MCC |
111 | - CPU_DOWN |
112 | - CPU_COMING_UP | |
113 | - CPU_UP | |
114 | - CPU_GOING_DOWN | |
115 | ||
116 | :: | |
7fe31d28 DM |
117 | |
118 | cluster setup and | |
119 | CPU setup complete policy decision | |
120 | +-----------> CPU_UP ------------+ | |
121 | | v | |
122 | ||
123 | CPU_COMING_UP CPU_GOING_DOWN | |
124 | ||
125 | ^ | | |
126 | +----------- CPU_DOWN <----------+ | |
127 | policy decision CPU teardown complete | |
128 | or hardware event | |
129 | ||
130 | ||
131 | The definitions of the four states correspond closely to the states of | |
132 | the basic model. | |
133 | ||
134 | Transitions between states occur as follows. | |
135 | ||
136 | A trigger event (spontaneous) means that the CPU can transition to the | |
137 | next state as a result of making local progress only, with no | |
138 | requirement for any external event to happen. | |
139 | ||
140 | ||
141 | CPU_DOWN: | |
7fe31d28 DM |
142 | A CPU reaches the CPU_DOWN state when it is ready for |
143 | power-down. On reaching this state, the CPU will typically | |
144 | power itself down or suspend itself, via a WFI instruction or a | |
145 | firmware call. | |
146 | ||
dc7a12bd MCC |
147 | Next state: |
148 | CPU_COMING_UP | |
149 | Conditions: | |
150 | none | |
7fe31d28 DM |
151 | |
152 | Trigger events: | |
7fe31d28 DM |
153 | a) an explicit hardware power-up operation, resulting |
154 | from a policy decision on another CPU; | |
155 | ||
156 | b) a hardware event, such as an interrupt. | |
157 | ||
158 | ||
159 | CPU_COMING_UP: | |
7fe31d28 DM |
160 | A CPU cannot start participating in hardware coherency until the |
161 | cluster is set up and coherent. If the cluster is not ready, | |
162 | then the CPU will wait in the CPU_COMING_UP state until the | |
163 | cluster has been set up. | |
164 | ||
dc7a12bd MCC |
165 | Next state: |
166 | CPU_UP | |
167 | Conditions: | |
168 | The CPU's parent cluster must be in CLUSTER_UP. | |
169 | Trigger events: | |
170 | Transition of the parent cluster to CLUSTER_UP. | |
7fe31d28 DM |
171 | |
172 | Refer to the "Cluster state" section for a description of the | |
173 | CLUSTER_UP state. | |
174 | ||
175 | ||
176 | CPU_UP: | |
177 | When a CPU reaches the CPU_UP state, it is safe for the CPU to | |
178 | start participating in local coherency. | |
179 | ||
180 | This is done by jumping to the kernel's CPU resume code. | |
181 | ||
182 | Note that the definition of this state is slightly different | |
183 | from the basic model definition: CPU_UP does not mean that the | |
184 | CPU is coherent yet, but it does mean that it is safe to resume | |
185 | the kernel. The kernel handles the rest of the resume | |
186 | procedure, so the remaining steps are not visible as part of the | |
187 | race avoidance algorithm. | |
188 | ||
189 | The CPU remains in this state until an explicit policy decision | |
190 | is made to shut down or suspend the CPU. | |
191 | ||
dc7a12bd MCC |
192 | Next state: |
193 | CPU_GOING_DOWN | |
194 | Conditions: | |
195 | none | |
196 | Trigger events: | |
197 | explicit policy decision | |
7fe31d28 DM |
198 | |
199 | ||
200 | CPU_GOING_DOWN: | |
7fe31d28 DM |
201 | While in this state, the CPU exits coherency, including any |
202 | operations required to achieve this (such as cleaning data | |
203 | caches). | |
204 | ||
dc7a12bd MCC |
205 | Next state: |
206 | CPU_DOWN | |
207 | Conditions: | |
208 | local CPU teardown complete | |
209 | Trigger events: | |
210 | (spontaneous) | |
7fe31d28 DM |
211 | |
212 | ||
213 | Cluster state | |
214 | ------------- | |
215 | ||
216 | A cluster is a group of connected CPUs with some common resources. | |
217 | Because a cluster contains multiple CPUs, it can be doing multiple | |
218 | things at the same time. This has some implications. In particular, a | |
219 | CPU can start up while another CPU is tearing the cluster down. | |
220 | ||
221 | In this discussion, the "outbound side" is the view of the cluster state | |
222 | as seen by a CPU tearing the cluster down. The "inbound side" is the | |
223 | view of the cluster state as seen by a CPU setting the CPU up. | |
224 | ||
225 | In order to enable safe coordination in such situations, it is important | |
226 | that a CPU which is setting up the cluster can advertise its state | |
227 | independently of the CPU which is tearing down the cluster. For this | |
228 | reason, the cluster state is split into two parts: | |
229 | ||
230 | "cluster" state: The global state of the cluster; or the state | |
dc7a12bd | 231 | on the outbound side: |
7fe31d28 | 232 | |
dc7a12bd MCC |
233 | - CLUSTER_DOWN |
234 | - CLUSTER_UP | |
235 | - CLUSTER_GOING_DOWN | |
7fe31d28 DM |
236 | |
237 | "inbound" state: The state of the cluster on the inbound side. | |
238 | ||
dc7a12bd MCC |
239 | - INBOUND_NOT_COMING_UP |
240 | - INBOUND_COMING_UP | |
7fe31d28 DM |
241 | |
242 | ||
243 | The different pairings of these states results in six possible | |
dc7a12bd | 244 | states for the cluster as a whole:: |
7fe31d28 DM |
245 | |
246 | CLUSTER_UP | |
247 | +==========> INBOUND_NOT_COMING_UP -------------+ | |
248 | # | | |
249 | | | |
250 | CLUSTER_UP <----+ | | |
251 | INBOUND_COMING_UP | v | |
252 | ||
253 | ^ CLUSTER_GOING_DOWN CLUSTER_GOING_DOWN | |
254 | # INBOUND_COMING_UP <=== INBOUND_NOT_COMING_UP | |
255 | ||
256 | CLUSTER_DOWN | | | |
257 | INBOUND_COMING_UP <----+ | | |
258 | | | |
259 | ^ | | |
260 | +=========== CLUSTER_DOWN <------------+ | |
261 | INBOUND_NOT_COMING_UP | |
262 | ||
263 | Transitions -----> can only be made by the outbound CPU, and | |
264 | only involve changes to the "cluster" state. | |
265 | ||
266 | Transitions ===##> can only be made by the inbound CPU, and only | |
267 | involve changes to the "inbound" state, except where there is no | |
268 | further transition possible on the outbound side (i.e., the | |
269 | outbound CPU has put the cluster into the CLUSTER_DOWN state). | |
270 | ||
271 | The race avoidance algorithm does not provide a way to determine | |
272 | which exact CPUs within the cluster play these roles. This must | |
273 | be decided in advance by some other means. Refer to the section | |
274 | "Last man and first man selection" for more explanation. | |
275 | ||
276 | ||
277 | CLUSTER_DOWN/INBOUND_NOT_COMING_UP is the only state where the | |
278 | cluster can actually be powered down. | |
279 | ||
280 | The parallelism of the inbound and outbound CPUs is observed by | |
281 | the existence of two different paths from CLUSTER_GOING_DOWN/ | |
282 | INBOUND_NOT_COMING_UP (corresponding to GOING_DOWN in the basic | |
283 | model) to CLUSTER_DOWN/INBOUND_COMING_UP (corresponding to | |
284 | COMING_UP in the basic model). The second path avoids cluster | |
285 | teardown completely. | |
286 | ||
287 | CLUSTER_UP/INBOUND_COMING_UP is equivalent to UP in the basic | |
288 | model. The final transition to CLUSTER_UP/INBOUND_NOT_COMING_UP | |
289 | is trivial and merely resets the state machine ready for the | |
290 | next cycle. | |
291 | ||
292 | Details of the allowable transitions follow. | |
293 | ||
294 | The next state in each case is notated | |
295 | ||
296 | <cluster state>/<inbound state> (<transitioner>) | |
297 | ||
298 | where the <transitioner> is the side on which the transition | |
299 | can occur; either the inbound or the outbound side. | |
300 | ||
301 | ||
302 | CLUSTER_DOWN/INBOUND_NOT_COMING_UP: | |
dc7a12bd MCC |
303 | Next state: |
304 | CLUSTER_DOWN/INBOUND_COMING_UP (inbound) | |
305 | Conditions: | |
306 | none | |
7fe31d28 | 307 | |
7fe31d28 | 308 | Trigger events: |
7fe31d28 DM |
309 | a) an explicit hardware power-up operation, resulting |
310 | from a policy decision on another CPU; | |
311 | ||
312 | b) a hardware event, such as an interrupt. | |
313 | ||
314 | ||
315 | CLUSTER_DOWN/INBOUND_COMING_UP: | |
316 | ||
317 | In this state, an inbound CPU sets up the cluster, including | |
318 | enabling of hardware coherency at the cluster level and any | |
319 | other operations (such as cache invalidation) which are required | |
320 | in order to achieve this. | |
321 | ||
322 | The purpose of this state is to do sufficient cluster-level | |
323 | setup to enable other CPUs in the cluster to enter coherency | |
324 | safely. | |
325 | ||
dc7a12bd MCC |
326 | Next state: |
327 | CLUSTER_UP/INBOUND_COMING_UP (inbound) | |
328 | Conditions: | |
329 | cluster-level setup and hardware coherency complete | |
330 | Trigger events: | |
331 | (spontaneous) | |
7fe31d28 DM |
332 | |
333 | ||
334 | CLUSTER_UP/INBOUND_COMING_UP: | |
335 | ||
336 | Cluster-level setup is complete and hardware coherency is | |
337 | enabled for the cluster. Other CPUs in the cluster can safely | |
338 | enter coherency. | |
339 | ||
340 | This is a transient state, leading immediately to | |
341 | CLUSTER_UP/INBOUND_NOT_COMING_UP. All other CPUs on the cluster | |
342 | should consider treat these two states as equivalent. | |
343 | ||
dc7a12bd MCC |
344 | Next state: |
345 | CLUSTER_UP/INBOUND_NOT_COMING_UP (inbound) | |
346 | Conditions: | |
347 | none | |
348 | Trigger events: | |
349 | (spontaneous) | |
7fe31d28 DM |
350 | |
351 | ||
352 | CLUSTER_UP/INBOUND_NOT_COMING_UP: | |
353 | ||
354 | Cluster-level setup is complete and hardware coherency is | |
355 | enabled for the cluster. Other CPUs in the cluster can safely | |
356 | enter coherency. | |
357 | ||
358 | The cluster will remain in this state until a policy decision is | |
359 | made to power the cluster down. | |
360 | ||
dc7a12bd MCC |
361 | Next state: |
362 | CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP (outbound) | |
363 | Conditions: | |
364 | none | |
365 | Trigger events: | |
366 | policy decision to power down the cluster | |
7fe31d28 DM |
367 | |
368 | ||
369 | CLUSTER_GOING_DOWN/INBOUND_NOT_COMING_UP: | |
370 | ||
371 | An outbound CPU is tearing the cluster down. The selected CPU | |
372 | must wait in this state until all CPUs in the cluster are in the | |
373 | CPU_DOWN state. | |
374 | ||
375 | When all CPUs are in the CPU_DOWN state, the cluster can be torn | |
376 | down, for example by cleaning data caches and exiting | |
377 | cluster-level coherency. | |
378 | ||
379 | To avoid wasteful unnecessary teardown operations, the outbound | |
380 | should check the inbound cluster state for asynchronous | |
381 | transitions to INBOUND_COMING_UP. Alternatively, individual | |
382 | CPUs can be checked for entry into CPU_COMING_UP or CPU_UP. | |
383 | ||
384 | ||
385 | Next states: | |
386 | ||
387 | CLUSTER_DOWN/INBOUND_NOT_COMING_UP (outbound) | |
dc7a12bd MCC |
388 | Conditions: |
389 | cluster torn down and ready to power off | |
390 | Trigger events: | |
391 | (spontaneous) | |
7fe31d28 DM |
392 | |
393 | CLUSTER_GOING_DOWN/INBOUND_COMING_UP (inbound) | |
dc7a12bd MCC |
394 | Conditions: |
395 | none | |
7fe31d28 | 396 | |
dc7a12bd | 397 | Trigger events: |
7fe31d28 DM |
398 | a) an explicit hardware power-up operation, |
399 | resulting from a policy decision on another | |
400 | CPU; | |
401 | ||
402 | b) a hardware event, such as an interrupt. | |
403 | ||
404 | ||
405 | CLUSTER_GOING_DOWN/INBOUND_COMING_UP: | |
406 | ||
407 | The cluster is (or was) being torn down, but another CPU has | |
408 | come online in the meantime and is trying to set up the cluster | |
409 | again. | |
410 | ||
411 | If the outbound CPU observes this state, it has two choices: | |
412 | ||
413 | a) back out of teardown, restoring the cluster to the | |
414 | CLUSTER_UP state; | |
415 | ||
416 | b) finish tearing the cluster down and put the cluster | |
417 | in the CLUSTER_DOWN state; the inbound CPU will | |
418 | set up the cluster again from there. | |
419 | ||
420 | Choice (a) permits the removal of some latency by avoiding | |
421 | unnecessary teardown and setup operations in situations where | |
422 | the cluster is not really going to be powered down. | |
423 | ||
424 | ||
425 | Next states: | |
426 | ||
427 | CLUSTER_UP/INBOUND_COMING_UP (outbound) | |
dc7a12bd MCC |
428 | Conditions: |
429 | cluster-level setup and hardware | |
7fe31d28 | 430 | coherency complete |
dc7a12bd MCC |
431 | |
432 | Trigger events: | |
433 | (spontaneous) | |
7fe31d28 DM |
434 | |
435 | CLUSTER_DOWN/INBOUND_COMING_UP (outbound) | |
dc7a12bd MCC |
436 | Conditions: |
437 | cluster torn down and ready to power off | |
438 | ||
439 | Trigger events: | |
440 | (spontaneous) | |
7fe31d28 DM |
441 | |
442 | ||
443 | Last man and First man selection | |
444 | -------------------------------- | |
445 | ||
446 | The CPU which performs cluster tear-down operations on the outbound side | |
447 | is commonly referred to as the "last man". | |
448 | ||
449 | The CPU which performs cluster setup on the inbound side is commonly | |
450 | referred to as the "first man". | |
451 | ||
452 | The race avoidance algorithm documented above does not provide a | |
453 | mechanism to choose which CPUs should play these roles. | |
454 | ||
455 | ||
456 | Last man: | |
457 | ||
458 | When shutting down the cluster, all the CPUs involved are initially | |
459 | executing Linux and hence coherent. Therefore, ordinary spinlocks can | |
460 | be used to select a last man safely, before the CPUs become | |
461 | non-coherent. | |
462 | ||
463 | ||
464 | First man: | |
465 | ||
466 | Because CPUs may power up asynchronously in response to external wake-up | |
467 | events, a dynamic mechanism is needed to make sure that only one CPU | |
468 | attempts to play the first man role and do the cluster-level | |
469 | initialisation: any other CPUs must wait for this to complete before | |
470 | proceeding. | |
471 | ||
472 | Cluster-level initialisation may involve actions such as configuring | |
473 | coherency controls in the bus fabric. | |
474 | ||
475 | The current implementation in mcpm_head.S uses a separate mutual exclusion | |
476 | mechanism to do this arbitration. This mechanism is documented in | |
477 | detail in vlocks.txt. | |
478 | ||
479 | ||
480 | Features and Limitations | |
481 | ------------------------ | |
482 | ||
483 | Implementation: | |
484 | ||
485 | The current ARM-based implementation is split between | |
486 | arch/arm/common/mcpm_head.S (low-level inbound CPU operations) and | |
487 | arch/arm/common/mcpm_entry.c (everything else): | |
488 | ||
489 | __mcpm_cpu_going_down() signals the transition of a CPU to the | |
dc7a12bd | 490 | CPU_GOING_DOWN state. |
7fe31d28 DM |
491 | |
492 | __mcpm_cpu_down() signals the transition of a CPU to the CPU_DOWN | |
dc7a12bd | 493 | state. |
7fe31d28 DM |
494 | |
495 | A CPU transitions to CPU_COMING_UP and then to CPU_UP via the | |
dc7a12bd MCC |
496 | low-level power-up code in mcpm_head.S. This could |
497 | involve CPU-specific setup code, but in the current | |
498 | implementation it does not. | |
7fe31d28 DM |
499 | |
500 | __mcpm_outbound_enter_critical() and __mcpm_outbound_leave_critical() | |
dc7a12bd MCC |
501 | handle transitions from CLUSTER_UP to CLUSTER_GOING_DOWN |
502 | and from there to CLUSTER_DOWN or back to CLUSTER_UP (in | |
503 | the case of an aborted cluster power-down). | |
7fe31d28 | 504 | |
dc7a12bd MCC |
505 | These functions are more complex than the __mcpm_cpu_*() |
506 | functions due to the extra inter-CPU coordination which | |
507 | is needed for safe transitions at the cluster level. | |
7fe31d28 DM |
508 | |
509 | A cluster transitions from CLUSTER_DOWN back to CLUSTER_UP via | |
dc7a12bd MCC |
510 | the low-level power-up code in mcpm_head.S. This |
511 | typically involves platform-specific setup code, | |
512 | provided by the platform-specific power_up_setup | |
513 | function registered via mcpm_sync_init. | |
7fe31d28 DM |
514 | |
515 | Deep topologies: | |
516 | ||
517 | As currently described and implemented, the algorithm does not | |
518 | support CPU topologies involving more than two levels (i.e., | |
519 | clusters of clusters are not supported). The algorithm could be | |
520 | extended by replicating the cluster-level states for the | |
521 | additional topological levels, and modifying the transition | |
522 | rules for the intermediate (non-outermost) cluster levels. | |
523 | ||
524 | ||
525 | Colophon | |
526 | -------- | |
527 | ||
528 | Originally created and documented by Dave Martin for Linaro Limited, in | |
529 | collaboration with Nicolas Pitre and Achin Gupta. | |
530 | ||
531 | Copyright (C) 2012-2013 Linaro Limited | |
532 | Distributed under the terms of Version 2 of the GNU General Public | |
533 | License, as defined in linux/COPYING. |