Commit | Line | Data |
---|---|---|
db9a0975 MCC |
1 | ============================================================= |
2 | An ad-hoc collection of notes on IA64 MCA and INIT processing | |
3 | ============================================================= | |
4 | ||
5 | Feel free to update it with notes about any area that is not clear. | |
8ee9e23d KO |
6 | |
7 | --- | |
8 | ||
9 | MCA/INIT are completely asynchronous. They can occur at any time, when | |
10 | the OS is in any state. Including when one of the cpus is already | |
11 | holding a spinlock. Trying to get any lock from MCA/INIT state is | |
12 | asking for deadlock. Also the state of structures that are protected | |
13 | by locks is indeterminate, including linked lists. | |
14 | ||
15 | --- | |
16 | ||
17 | The complicated ia64 MCA process. All of this is mandated by Intel's | |
670e9f34 | 18 | specification for ia64 SAL, error recovery and unwind, it is not as |
8ee9e23d KO |
19 | if we have a choice here. |
20 | ||
21 | * MCA occurs on one cpu, usually due to a double bit memory error. | |
22 | This is the monarch cpu. | |
23 | ||
24 | * SAL sends an MCA rendezvous interrupt (which is a normal interrupt) | |
25 | to all the other cpus, the slaves. | |
26 | ||
27 | * Slave cpus that receive the MCA interrupt call down into SAL, they | |
28 | end up spinning disabled while the MCA is being serviced. | |
29 | ||
30 | * If any slave cpu was already spinning disabled when the MCA occurred | |
31 | then it cannot service the MCA interrupt. SAL waits ~20 seconds then | |
32 | sends an unmaskable INIT event to the slave cpus that have not | |
33 | already rendezvoused. | |
34 | ||
35 | * Because MCA/INIT can be delivered at any time, including when the cpu | |
36 | is down in PAL in physical mode, the registers at the time of the | |
37 | event are _completely_ undefined. In particular the MCA/INIT | |
38 | handlers cannot rely on the thread pointer, PAL physical mode can | |
39 | (and does) modify TP. It is allowed to do that as long as it resets | |
40 | TP on return. However MCA/INIT events expose us to these PAL | |
41 | internal TP changes. Hence curr_task(). | |
42 | ||
43 | * If an MCA/INIT event occurs while the kernel was running (not user | |
44 | space) and the kernel has called PAL then the MCA/INIT handler cannot | |
45 | assume that the kernel stack is in a fit state to be used. Mainly | |
46 | because PAL may or may not maintain the stack pointer internally. | |
47 | Because the MCA/INIT handlers cannot trust the kernel stack, they | |
48 | have to use their own, per-cpu stacks. The MCA/INIT stacks are | |
49 | preformatted with just enough task state to let the relevant handlers | |
50 | do their job. | |
51 | ||
52 | * Unlike most other architectures, the ia64 struct task is embedded in | |
53 | the kernel stack[1]. So switching to a new kernel stack means that | |
54 | we switch to a new task as well. Because various bits of the kernel | |
55 | assume that current points into the struct task, switching to a new | |
56 | stack also means a new value for current. | |
57 | ||
58 | * Once all slaves have rendezvoused and are spinning disabled, the | |
59 | monarch is entered. The monarch now tries to diagnose the problem | |
60 | and decide if it can recover or not. | |
61 | ||
62 | * Part of the monarch's job is to look at the state of all the other | |
63 | tasks. The only way to do that on ia64 is to call the unwinder, | |
64 | as mandated by Intel. | |
65 | ||
66 | * The starting point for the unwind depends on whether a task is | |
67 | running or not. That is, whether it is on a cpu or is blocked. The | |
68 | monarch has to determine whether or not a task is on a cpu before it | |
69 | knows how to start unwinding it. The tasks that received an MCA or | |
70 | INIT event are no longer running, they have been converted to blocked | |
71 | tasks. But (and its a big but), the cpus that received the MCA | |
72 | rendezvous interrupt are still running on their normal kernel stacks! | |
73 | ||
74 | * To distinguish between these two cases, the monarch must know which | |
75 | tasks are on a cpu and which are not. Hence each slave cpu that | |
76 | switches to an MCA/INIT stack, registers its new stack using | |
77 | set_curr_task(), so the monarch can tell that the _original_ task is | |
78 | no longer running on that cpu. That gives us a decent chance of | |
79 | getting a valid backtrace of the _original_ task. | |
80 | ||
81 | * MCA/INIT can be nested, to a depth of 2 on any cpu. In the case of a | |
82 | nested error, we want diagnostics on the MCA/INIT handler that | |
83 | failed, not on the task that was originally running. Again this | |
84 | requires set_curr_task() so the MCA/INIT handlers can register their | |
85 | own stack as running on that cpu. Then a recursive error gets a | |
86 | trace of the failing handler's "task". | |
87 | ||
db9a0975 MCC |
88 | [1] |
89 | My (Keith Owens) original design called for ia64 to separate its | |
8ee9e23d KO |
90 | struct task and the kernel stacks. Then the MCA/INIT data would be |
91 | chained stacks like i386 interrupt stacks. But that required | |
92 | radical surgery on the rest of ia64, plus extra hard wired TLB | |
93 | entries with its associated performance degradation. David | |
94 | Mosberger vetoed that approach. Which meant that separate kernel | |
95 | stacks meant separate "tasks" for the MCA/INIT handlers. | |
96 | ||
97 | --- | |
98 | ||
99 | INIT is less complicated than MCA. Pressing the nmi button or using | |
100 | the equivalent command on the management console sends INIT to all | |
670e9f34 | 101 | cpus. SAL picks one of the cpus as the monarch and the rest are |
8ee9e23d KO |
102 | slaves. All the OS INIT handlers are entered at approximately the same |
103 | time. The OS monarch prints the state of all tasks and returns, after | |
104 | which the slaves return and the system resumes. | |
105 | ||
106 | At least that is what is supposed to happen. Alas there are broken | |
107 | versions of SAL out there. Some drive all the cpus as monarchs. Some | |
108 | drive them all as slaves. Some drive one cpu as monarch, wait for that | |
109 | cpu to return from the OS then drive the rest as slaves. Some versions | |
110 | of SAL cannot even cope with returning from the OS, they spin inside | |
111 | SAL on resume. The OS INIT code has workarounds for some of these | |
112 | broken SAL symptoms, but some simply cannot be fixed from the OS side. | |
113 | ||
114 | --- | |
115 | ||
116 | The scheduler hooks used by ia64 (curr_task, set_curr_task) are layer | |
117 | violations. Unfortunately MCA/INIT start off as massive layer | |
118 | violations (can occur at _any_ time) and they build from there. | |
119 | ||
120 | At least ia64 makes an attempt at recovering from hardware errors, but | |
121 | it is a difficult problem because of the asynchronous nature of these | |
122 | errors. When processing an unmaskable interrupt we sometimes need | |
123 | special code to cope with our inability to take any locks. | |
124 | ||
125 | --- | |
126 | ||
127 | How is ia64 MCA/INIT different from x86 NMI? | |
128 | ||
129 | * x86 NMI typically gets delivered to one cpu. MCA/INIT gets sent to | |
130 | all cpus. | |
131 | ||
132 | * x86 NMI cannot be nested. MCA/INIT can be nested, to a depth of 2 | |
133 | per cpu. | |
134 | ||
135 | * x86 has a separate struct task which points to one of multiple kernel | |
136 | stacks. ia64 has the struct task embedded in the single kernel | |
137 | stack, so switching stack means switching task. | |
138 | ||
139 | * x86 does not call the BIOS so the NMI handler does not have to worry | |
140 | about any registers having changed. MCA/INIT can occur while the cpu | |
141 | is in PAL in physical mode, with undefined registers and an undefined | |
142 | kernel stack. | |
143 | ||
144 | * i386 backtrace is not very sensitive to whether a process is running | |
145 | or not. ia64 unwind is very, very sensitive to whether a process is | |
146 | running or not. | |
147 | ||
148 | --- | |
149 | ||
150 | What happens when MCA/INIT is delivered what a cpu is running user | |
151 | space code? | |
152 | ||
153 | The user mode registers are stored in the RSE area of the MCA/INIT on | |
154 | entry to the OS and are restored from there on return to SAL, so user | |
155 | mode registers are preserved across a recoverable MCA/INIT. Since the | |
156 | OS has no idea what unwind data is available for the user space stack, | |
157 | MCA/INIT never tries to backtrace user space. Which means that the OS | |
158 | does not bother making the user space process look like a blocked task, | |
159 | i.e. the OS does not copy pt_regs and switch_stack to the user space | |
160 | stack. Also the OS has no idea how big the user space RSE and memory | |
161 | stacks are, which makes it too risky to copy the saved state to a user | |
162 | mode stack. | |
163 | ||
164 | --- | |
165 | ||
166 | How do we get a backtrace on the tasks that were running when MCA/INIT | |
167 | was delivered? | |
168 | ||
169 | mca.c:::ia64_mca_modify_original_stack(). That identifies and | |
170 | verifies the original kernel stack, copies the dirty registers from | |
171 | the MCA/INIT stack's RSE to the original stack's RSE, copies the | |
172 | skeleton struct pt_regs and switch_stack to the original stack, fills | |
173 | in the skeleton structures from the PAL minstate area and updates the | |
174 | original stack's thread.ksp. That makes the original stack look | |
175 | exactly like any other blocked task, i.e. it now appears to be | |
176 | sleeping. To get a backtrace, just start with thread.ksp for the | |
177 | original task and unwind like any other sleeping task. | |
178 | ||
179 | --- | |
180 | ||
181 | How do we identify the tasks that were running when MCA/INIT was | |
182 | delivered? | |
183 | ||
184 | If the previous task has been verified and converted to a blocked | |
185 | state, then sos->prev_task on the MCA/INIT stack is updated to point to | |
186 | the previous task. You can look at that field in dumps or debuggers. | |
187 | To help distinguish between the handler and the original tasks, | |
188 | handlers have _TIF_MCA_INIT set in thread_info.flags. | |
189 | ||
190 | The sos data is always in the MCA/INIT handler stack, at offset | |
191 | MCA_SOS_OFFSET. You can get that value from mca_asm.h or calculate it | |
192 | as KERNEL_STACK_SIZE - sizeof(struct pt_regs) - sizeof(struct | |
193 | ia64_sal_os_state), with 16 byte alignment for all structures. | |
194 | ||
195 | Also the comm field of the MCA/INIT task is modified to include the pid | |
196 | of the original task, for humans to use. For example, a comm field of | |
197 | 'MCA 12159' means that pid 12159 was running when the MCA was | |
198 | delivered. |