[linux-2.6-block.git] / Documentation / ia64 / mca.rst

=============================================================
An ad-hoc collection of notes on IA64 MCA and INIT processing
=============================================================

Feel free to update it with notes about any area that is not clear.

---

MCA/INIT are completely asynchronous.  They can occur at any time, when
the OS is in any state.  Including when one of the cpus is already
holding a spinlock.  Trying to get any lock from MCA/INIT state is
asking for deadlock.  Also the state of structures that are protected
by locks is indeterminate, including linked lists.

---

The complicated ia64 MCA process.  All of this is mandated by Intel's
specification for ia64 SAL, error recovery and unwind, it is not as
if we have a choice here.

* MCA occurs on one cpu, usually due to a double bit memory error.
  This is the monarch cpu.

* SAL sends an MCA rendezvous interrupt (which is a normal interrupt)
  to all the other cpus, the slaves.

* Slave cpus that receive the MCA interrupt call down into SAL, they
  end up spinning disabled while the MCA is being serviced.

* If any slave cpu was already spinning disabled when the MCA occurred
  then it cannot service the MCA interrupt.  SAL waits ~20 seconds then
  sends an unmaskable INIT event to the slave cpus that have not
  already rendezvoused.

* Because MCA/INIT can be delivered at any time, including when the cpu
  is down in PAL in physical mode, the registers at the time of the
  event are _completely_ undefined.  In particular the MCA/INIT
  handlers cannot rely on the thread pointer, PAL physical mode can
  (and does) modify TP.  It is allowed to do that as long as it resets
  TP on return.  However MCA/INIT events expose us to these PAL
  internal TP changes.  Hence curr_task().

* If an MCA/INIT event occurs while the kernel was running (not user
  space) and the kernel has called PAL then the MCA/INIT handler cannot
  assume that the kernel stack is in a fit state to be used.  Mainly
  because PAL may or may not maintain the stack pointer internally.
  Because the MCA/INIT handlers cannot trust the kernel stack, they
  have to use their own, per-cpu stacks.  The MCA/INIT stacks are
  preformatted with just enough task state to let the relevant handlers
  do their job.

* Unlike most other architectures, the ia64 struct task is embedded in
  the kernel stack[1].  So switching to a new kernel stack means that
  we switch to a new task as well.  Because various bits of the kernel
  assume that current points into the struct task, switching to a new
  stack also means a new value for current.

* Once all slaves have rendezvoused and are spinning disabled, the
  monarch is entered.  The monarch now tries to diagnose the problem
  and decide if it can recover or not.

* Part of the monarch's job is to look at the state of all the other
  tasks.  The only way to do that on ia64 is to call the unwinder,
  as mandated by Intel.

* The starting point for the unwind depends on whether a task is
  running or not.  That is, whether it is on a cpu or is blocked.  The
  monarch has to determine whether or not a task is on a cpu before it
  knows how to start unwinding it.  The tasks that received an MCA or
  INIT event are no longer running, they have been converted to blocked
  tasks.  But (and its a big but), the cpus that received the MCA
  rendezvous interrupt are still running on their normal kernel stacks!

* To distinguish between these two cases, the monarch must know which
  tasks are on a cpu and which are not.  Hence each slave cpu that
  switches to an MCA/INIT stack, registers its new stack using
  set_curr_task(), so the monarch can tell that the _original_ task is
  no longer running on that cpu.  That gives us a decent chance of
  getting a valid backtrace of the _original_ task.

* MCA/INIT can be nested, to a depth of 2 on any cpu.  In the case of a
  nested error, we want diagnostics on the MCA/INIT handler that
  failed, not on the task that was originally running.  Again this
  requires set_curr_task() so the MCA/INIT handlers can register their
  own stack as running on that cpu.  Then a recursive error gets a
  trace of the failing handler's "task".

[1]
    My (Keith Owens) original design called for ia64 to separate its
    struct task and the kernel stacks.  Then the MCA/INIT data would be
    chained stacks like i386 interrupt stacks.  But that required
    radical surgery on the rest of ia64, plus extra hard wired TLB
    entries with its associated performance degradation.  David
    Mosberger vetoed that approach.  Which meant that separate kernel
    stacks meant separate "tasks" for the MCA/INIT handlers.

---

INIT is less complicated than MCA.  Pressing the nmi button or using
the equivalent command on the management console sends INIT to all
cpus.  SAL picks one of the cpus as the monarch and the rest are
slaves.  All the OS INIT handlers are entered at approximately the same
time.  The OS monarch prints the state of all tasks and returns, after
which the slaves return and the system resumes.

At least that is what is supposed to happen.  Alas there are broken
versions of SAL out there.  Some drive all the cpus as monarchs.  Some
drive them all as slaves.  Some drive one cpu as monarch, wait for that
cpu to return from the OS then drive the rest as slaves.  Some versions
of SAL cannot even cope with returning from the OS, they spin inside
SAL on resume.  The OS INIT code has workarounds for some of these
broken SAL symptoms, but some simply cannot be fixed from the OS side.

---

The scheduler hooks used by ia64 (curr_task, set_curr_task) are layer
violations.  Unfortunately MCA/INIT start off as massive layer
violations (can occur at _any_ time) and they build from there.

At least ia64 makes an attempt at recovering from hardware errors, but
it is a difficult problem because of the asynchronous nature of these
errors.  When processing an unmaskable interrupt we sometimes need
special code to cope with our inability to take any locks.

---

How is ia64 MCA/INIT different from x86 NMI?

* x86 NMI typically gets delivered to one cpu.  MCA/INIT gets sent to
  all cpus.

* x86 NMI cannot be nested.  MCA/INIT can be nested, to a depth of 2
  per cpu.

* x86 has a separate struct task which points to one of multiple kernel
  stacks.  ia64 has the struct task embedded in the single kernel
  stack, so switching stack means switching task.

* x86 does not call the BIOS so the NMI handler does not have to worry
  about any registers having changed.  MCA/INIT can occur while the cpu
  is in PAL in physical mode, with undefined registers and an undefined
  kernel stack.

* i386 backtrace is not very sensitive to whether a process is running
  or not.  ia64 unwind is very, very sensitive to whether a process is
  running or not.

---

What happens when MCA/INIT is delivered what a cpu is running user
space code?

The user mode registers are stored in the RSE area of the MCA/INIT on
entry to the OS and are restored from there on return to SAL, so user
mode registers are preserved across a recoverable MCA/INIT.  Since the
OS has no idea what unwind data is available for the user space stack,
MCA/INIT never tries to backtrace user space.  Which means that the OS
does not bother making the user space process look like a blocked task,
i.e. the OS does not copy pt_regs and switch_stack to the user space
stack.  Also the OS has no idea how big the user space RSE and memory
stacks are, which makes it too risky to copy the saved state to a user
mode stack.

---

How do we get a backtrace on the tasks that were running when MCA/INIT
was delivered?

mca.c:::ia64_mca_modify_original_stack().  That identifies and
verifies the original kernel stack, copies the dirty registers from
the MCA/INIT stack's RSE to the original stack's RSE, copies the
skeleton struct pt_regs and switch_stack to the original stack, fills
in the skeleton structures from the PAL minstate area and updates the
original stack's thread.ksp.  That makes the original stack look
exactly like any other blocked task, i.e. it now appears to be
sleeping.  To get a backtrace, just start with thread.ksp for the
original task and unwind like any other sleeping task.

---

How do we identify the tasks that were running when MCA/INIT was
delivered?

If the previous task has been verified and converted to a blocked
state, then sos->prev_task on the MCA/INIT stack is updated to point to
the previous task.  You can look at that field in dumps or debuggers.
To help distinguish between the handler and the original tasks,
handlers have _TIF_MCA_INIT set in thread_info.flags.

The sos data is always in the MCA/INIT handler stack, at offset
MCA_SOS_OFFSET.  You can get that value from mca_asm.h or calculate it
as KERNEL_STACK_SIZE - sizeof(struct pt_regs) - sizeof(struct
ia64_sal_os_state), with 16 byte alignment for all structures.

Also the comm field of the MCA/INIT task is modified to include the pid
of the original task, for humans to use.  For example, a comm field of
'MCA 12159' means that pid 12159 was running when the MCA was
delivered.
Commit	Line	Data
db9a0975 MCC	1	=============================================================
	2	An ad-hoc collection of notes on IA64 MCA and INIT processing
	3	=============================================================
	4
	5	Feel free to update it with notes about any area that is not clear.
8ee9e23d KO	6
	7	---
	8
	9	MCA/INIT are completely asynchronous. They can occur at any time, when
	10	the OS is in any state. Including when one of the cpus is already
	11	holding a spinlock. Trying to get any lock from MCA/INIT state is
	12	asking for deadlock. Also the state of structures that are protected
	13	by locks is indeterminate, including linked lists.
	14
	15	---
	16
	17	The complicated ia64 MCA process. All of this is mandated by Intel's
670e9f34	18	specification for ia64 SAL, error recovery and unwind, it is not as
8ee9e23d KO	19	if we have a choice here.
	20
	21	* MCA occurs on one cpu, usually due to a double bit memory error.
	22	This is the monarch cpu.
	23
	24	* SAL sends an MCA rendezvous interrupt (which is a normal interrupt)
	25	to all the other cpus, the slaves.
	26
	27	* Slave cpus that receive the MCA interrupt call down into SAL, they
	28	end up spinning disabled while the MCA is being serviced.
	29
	30	* If any slave cpu was already spinning disabled when the MCA occurred
	31	then it cannot service the MCA interrupt. SAL waits ~20 seconds then
	32	sends an unmaskable INIT event to the slave cpus that have not
	33	already rendezvoused.
	34
	35	* Because MCA/INIT can be delivered at any time, including when the cpu
	36	is down in PAL in physical mode, the registers at the time of the
	37	event are _completely_ undefined. In particular the MCA/INIT
	38	handlers cannot rely on the thread pointer, PAL physical mode can
	39	(and does) modify TP. It is allowed to do that as long as it resets
	40	TP on return. However MCA/INIT events expose us to these PAL
	41	internal TP changes. Hence curr_task().
	42
	43	* If an MCA/INIT event occurs while the kernel was running (not user
	44	space) and the kernel has called PAL then the MCA/INIT handler cannot
	45	assume that the kernel stack is in a fit state to be used. Mainly
	46	because PAL may or may not maintain the stack pointer internally.
	47	Because the MCA/INIT handlers cannot trust the kernel stack, they
	48	have to use their own, per-cpu stacks. The MCA/INIT stacks are
	49	preformatted with just enough task state to let the relevant handlers
	50	do their job.
	51
	52	* Unlike most other architectures, the ia64 struct task is embedded in
	53	the kernel stack[1]. So switching to a new kernel stack means that
	54	we switch to a new task as well. Because various bits of the kernel
	55	assume that current points into the struct task, switching to a new
	56	stack also means a new value for current.
	57
	58	* Once all slaves have rendezvoused and are spinning disabled, the
	59	monarch is entered. The monarch now tries to diagnose the problem
	60	and decide if it can recover or not.
	61
	62	* Part of the monarch's job is to look at the state of all the other
	63	tasks. The only way to do that on ia64 is to call the unwinder,
	64	as mandated by Intel.
	65
	66	* The starting point for the unwind depends on whether a task is
	67	running or not. That is, whether it is on a cpu or is blocked. The
	68	monarch has to determine whether or not a task is on a cpu before it
	69	knows how to start unwinding it. The tasks that received an MCA or
	70	INIT event are no longer running, they have been converted to blocked
	71	tasks. But (and its a big but), the cpus that received the MCA
	72	rendezvous interrupt are still running on their normal kernel stacks!
	73
	74	* To distinguish between these two cases, the monarch must know which
	75	tasks are on a cpu and which are not. Hence each slave cpu that
	76	switches to an MCA/INIT stack, registers its new stack using
	77	set_curr_task(), so the monarch can tell that the _original_ task is
	78	no longer running on that cpu. That gives us a decent chance of
	79	getting a valid backtrace of the _original_ task.
	80
	81	* MCA/INIT can be nested, to a depth of 2 on any cpu. In the case of a
	82	nested error, we want diagnostics on the MCA/INIT handler that
83	failed, not on the task that was originally running. Again this
84	requires set_curr_task() so the MCA/INIT handlers can register their
85	own stack as running on that cpu. Then a recursive error gets a
86	trace of the failing handler's "task".
87
db9a0975 MCC	88	[1]
db9a0975 MCC	89	My (Keith Owens) original design called for ia64 to separate its
8ee9e23d KO	90	struct task and the kernel stacks. Then the MCA/INIT data would be
	91	chained stacks like i386 interrupt stacks. But that required
	92	radical surgery on the rest of ia64, plus extra hard wired TLB
	93	entries with its associated performance degradation. David
	94	Mosberger vetoed that approach. Which meant that separate kernel
	95	stacks meant separate "tasks" for the MCA/INIT handlers.
	96
	97	---
	98
	99	INIT is less complicated than MCA. Pressing the nmi button or using
	100	the equivalent command on the management console sends INIT to all
670e9f34	101	cpus. SAL picks one of the cpus as the monarch and the rest are
8ee9e23d KO	102	slaves. All the OS INIT handlers are entered at approximately the same
	103	time. The OS monarch prints the state of all tasks and returns, after
	104	which the slaves return and the system resumes.
	105
	106	At least that is what is supposed to happen. Alas there are broken
	107	versions of SAL out there. Some drive all the cpus as monarchs. Some
	108	drive them all as slaves. Some drive one cpu as monarch, wait for that
	109	cpu to return from the OS then drive the rest as slaves. Some versions
	110	of SAL cannot even cope with returning from the OS, they spin inside
	111	SAL on resume. The OS INIT code has workarounds for some of these
	112	broken SAL symptoms, but some simply cannot be fixed from the OS side.
	113
	114	---
	115
	116	The scheduler hooks used by ia64 (curr_task, set_curr_task) are layer
	117	violations. Unfortunately MCA/INIT start off as massive layer
	118	violations (can occur at _any_ time) and they build from there.
	119
	120	At least ia64 makes an attempt at recovering from hardware errors, but
	121	it is a difficult problem because of the asynchronous nature of these
	122	errors. When processing an unmaskable interrupt we sometimes need
	123	special code to cope with our inability to take any locks.
	124
	125	---
	126
	127	How is ia64 MCA/INIT different from x86 NMI?
	128
	129	* x86 NMI typically gets delivered to one cpu. MCA/INIT gets sent to
	130	all cpus.
	131
	132	* x86 NMI cannot be nested. MCA/INIT can be nested, to a depth of 2
	133	per cpu.
	134
	135	* x86 has a separate struct task which points to one of multiple kernel
	136	stacks. ia64 has the struct task embedded in the single kernel
	137	stack, so switching stack means switching task.
	138
	139	* x86 does not call the BIOS so the NMI handler does not have to worry
	140	about any registers having changed. MCA/INIT can occur while the cpu
	141	is in PAL in physical mode, with undefined registers and an undefined
	142	kernel stack.
	143
	144	* i386 backtrace is not very sensitive to whether a process is running
	145	or not. ia64 unwind is very, very sensitive to whether a process is
	146	running or not.
	147
	148	---
	149
	150	What happens when MCA/INIT is delivered what a cpu is running user
	151	space code?
	152
	153	The user mode registers are stored in the RSE area of the MCA/INIT on
	154	entry to the OS and are restored from there on return to SAL, so user
	155	mode registers are preserved across a recoverable MCA/INIT. Since the
	156	OS has no idea what unwind data is available for the user space stack,
	157	MCA/INIT never tries to backtrace user space. Which means that the OS
	158	does not bother making the user space process look like a blocked task,
	159	i.e. the OS does not copy pt_regs and switch_stack to the user space
	160	stack. Also the OS has no idea how big the user space RSE and memory
	161	stacks are, which makes it too risky to copy the saved state to a user
	162	mode stack.
	163
	164	---
	165
166	How do we get a backtrace on the tasks that were running when MCA/INIT
167	was delivered?
168
169	mca.c:::ia64_mca_modify_original_stack(). That identifies and
170	verifies the original kernel stack, copies the dirty registers from
171	the MCA/INIT stack's RSE to the original stack's RSE, copies the
172	skeleton struct pt_regs and switch_stack to the original stack, fills
173	in the skeleton structures from the PAL minstate area and updates the
174	original stack's thread.ksp. That makes the original stack look
175	exactly like any other blocked task, i.e. it now appears to be
176	sleeping. To get a backtrace, just start with thread.ksp for the
177	original task and unwind like any other sleeping task.
178
179	---
180
181	How do we identify the tasks that were running when MCA/INIT was
182	delivered?
183
184	If the previous task has been verified and converted to a blocked
185	state, then sos->prev_task on the MCA/INIT stack is updated to point to
186	the previous task. You can look at that field in dumps or debuggers.
187	To help distinguish between the handler and the original tasks,
188	handlers have _TIF_MCA_INIT set in thread_info.flags.
189
190	The sos data is always in the MCA/INIT handler stack, at offset
191	MCA_SOS_OFFSET. You can get that value from mca_asm.h or calculate it
192	as KERNEL_STACK_SIZE - sizeof(struct pt_regs) - sizeof(struct
193	ia64_sal_os_state), with 16 byte alignment for all structures.
194
195	Also the comm field of the MCA/INIT task is modified to include the pid
196	of the original task, for humans to use. For example, a comm field of
197	'MCA 12159' means that pid 12159 was running when the MCA was
198	delivered.