[linux-block.git] / Documentation / livepatch / reliable-stacktrace.rst

===================
Reliable Stacktrace
===================

This document outlines basic information about reliable stacktracing.

.. Table of Contents:

.. contents:: :local:

1. Introduction
===============

The kernel livepatch consistency model relies on accurately identifying which
functions may have live state and therefore may not be safe to patch. One way
to identify which functions are live is to use a stacktrace.

Existing stacktrace code may not always give an accurate picture of all
functions with live state, and best-effort approaches which can be helpful for
debugging are unsound for livepatching. Livepatching depends on architectures
to provide a *reliable* stacktrace which ensures it never omits any live
functions from a trace.


2. Requirements
===============

Architectures must implement one of the reliable stacktrace functions.
Architectures using CONFIG_ARCH_STACKWALK must implement
'arch_stack_walk_reliable', and other architectures must implement
'save_stack_trace_tsk_reliable'.

Principally, the reliable stacktrace function must ensure that either:

* The trace includes all functions that the task may be returned to, and the
  return code is zero to indicate that the trace is reliable.

* The return code is non-zero to indicate that the trace is not reliable.

.. note::
   In some cases it is legitimate to omit specific functions from the trace,
   but all other functions must be reported. These cases are described in
   futher detail below.

Secondly, the reliable stacktrace function must be robust to cases where
the stack or other unwind state is corrupt or otherwise unreliable. The
function should attempt to detect such cases and return a non-zero error
code, and should not get stuck in an infinite loop or access memory in
an unsafe way.  Specific cases are described in further detail below.


3. Compile-time analysis
========================

To ensure that kernel code can be correctly unwound in all cases,
architectures may need to verify that code has been compiled in a manner
expected by the unwinder. For example, an unwinder may expect that
functions manipulate the stack pointer in a limited way, or that all
functions use specific prologue and epilogue sequences. Architectures
with such requirements should verify the kernel compilation using
objtool.

In some cases, an unwinder may require metadata to correctly unwind.
Where necessary, this metadata should be generated at build time using
objtool.


4. Considerations
=================

The unwinding process varies across architectures, their respective procedure
call standards, and kernel configurations. This section describes common
details that architectures should consider.

4.1 Identifying successful termination
--------------------------------------

Unwinding may terminate early for a number of reasons, including:

* Stack or frame pointer corruption.

* Missing unwind support for an uncommon scenario, or a bug in the unwinder.

* Dynamically generated code (e.g. eBPF) or foreign code (e.g. EFI runtime
  services) not following the conventions expected by the unwinder.

To ensure that this does not result in functions being omitted from the trace,
even if not caught by other checks, it is strongly recommended that
architectures verify that a stacktrace ends at an expected location, e.g.

* Within a specific function that is an entry point to the kernel.

* At a specific location on a stack expected for a kernel entry point.

* On a specific stack expected for a kernel entry point (e.g. if the
  architecture has separate task and IRQ stacks).

4.2 Identifying unwindable code
-------------------------------

Unwinding typically relies on code following specific conventions (e.g.
manipulating a frame pointer), but there can be code which may not follow these
conventions and may require special handling in the unwinder, e.g.

* Exception vectors and entry assembly.

* Procedure Linkage Table (PLT) entries and veneer functions.

* Trampoline assembly (e.g. ftrace, kprobes).

* Dynamically generated code (e.g. eBPF, optprobe trampolines).

* Foreign code (e.g. EFI runtime services).

To ensure that such cases do not result in functions being omitted from a
trace, it is strongly recommended that architectures positively identify code
which is known to be reliable to unwind from, and reject unwinding from all
other code.

Kernel code including modules and eBPF can be distinguished from foreign code
using '__kernel_text_address()'. Checking for this also helps to detect stack
corruption.

There are several ways an architecture may identify kernel code which is deemed
unreliable to unwind from, e.g.

* Placing such code into special linker sections, and rejecting unwinding from
  any code in these sections.

* Identifying specific portions of code using bounds information.

4.3 Unwinding across interrupts and exceptions
----------------------------------------------

At function call boundaries the stack and other unwind state is expected to be
in a consistent state suitable for reliable unwinding, but this may not be the
case part-way through a function. For example, during a function prologue or
epilogue a frame pointer may be transiently invalid, or during the function
body the return address may be held in an arbitrary general purpose register.
For some architectures this may change at runtime as a result of dynamic
instrumentation.

If an interrupt or other exception is taken while the stack or other unwind
state is in an inconsistent state, it may not be possible to reliably unwind,
and it may not be possible to identify whether such unwinding will be reliable.
See below for examples.

Architectures which cannot identify when it is reliable to unwind such cases
(or where it is never reliable) must reject unwinding across exception
boundaries. Note that it may be reliable to unwind across certain
exceptions (e.g. IRQ) but unreliable to unwind across other exceptions
(e.g. NMI).

Architectures which can identify when it is reliable to unwind such cases (or
have no such cases) should attempt to unwind across exception boundaries, as
doing so can prevent unnecessarily stalling livepatch consistency checks and
permits livepatch transitions to complete more quickly.

4.4 Rewriting of return addresses
---------------------------------

Some trampolines temporarily modify the return address of a function in order
to intercept when that function returns with a return trampoline, e.g.

* An ftrace trampoline may modify the return address so that function graph
  tracing can intercept returns.

* A kprobes (or optprobes) trampoline may modify the return address so that
  kretprobes can intercept returns.

When this happens, the original return address will not be in its usual
location. For trampolines which are not subject to live patching, where an
unwinder can reliably determine the original return address and no unwind state
is altered by the trampoline, the unwinder may report the original return
address in place of the trampoline and report this as reliable. Otherwise, an
unwinder must report these cases as unreliable.

Special care is required when identifying the original return address, as this
information is not in a consistent location for the duration of the entry
trampoline or return trampoline. For example, considering the x86_64
'return_to_handler' return trampoline:

.. code-block:: none

   SYM_CODE_START(return_to_handler)
           UNWIND_HINT_EMPTY
           subq  $24, %rsp

           /* Save the return values */
           movq %rax, (%rsp)
           movq %rdx, 8(%rsp)
           movq %rbp, %rdi

           call ftrace_return_to_handler

           movq %rax, %rdi
           movq 8(%rsp), %rdx
           movq (%rsp), %rax
           addq $24, %rsp
           JMP_NOSPEC rdi
   SYM_CODE_END(return_to_handler)

While the traced function runs its return address on the stack points to
the start of return_to_handler, and the original return address is stored in
the task's cur_ret_stack. During this time the unwinder can find the return
address using ftrace_graph_ret_addr().

When the traced function returns to return_to_handler, there is no longer a
return address on the stack, though the original return address is still stored
in the task's cur_ret_stack. Within ftrace_return_to_handler(), the original
return address is removed from cur_ret_stack and is transiently moved
arbitrarily by the compiler before being returned in rax. The return_to_handler
trampoline moves this into rdi before jumping to it.

Architectures might not always be able to unwind such sequences, such as when
ftrace_return_to_handler() has removed the address from cur_ret_stack, and the
location of the return address cannot be reliably determined.

It is recommended that architectures unwind cases where return_to_handler has
not yet been returned to, but architectures are not required to unwind from the
middle of return_to_handler and can report this as unreliable. Architectures
are not required to unwind from other trampolines which modify the return
address.

4.5 Obscuring of return addresses
---------------------------------

Some trampolines do not rewrite the return address in order to intercept
returns, but do transiently clobber the return address or other unwind state.

For example, the x86_64 implementation of optprobes patches the probed function
with a JMP instruction which targets the associated optprobe trampoline. When
the probe is hit, the CPU will branch to the optprobe trampoline, and the
address of the probed function is not held in any register or on the stack.

Similarly, the arm64 implementation of DYNAMIC_FTRACE_WITH_REGS patches traced
functions with the following:

.. code-block:: none

   MOV X9, X30
   BL <trampoline>

The MOV saves the link register (X30) into X9 to preserve the return address
before the BL clobbers the link register and branches to the trampoline. At the
start of the trampoline, the address of the traced function is in X9 rather
than the link register as would usually be the case.

Architectures must either ensure that unwinders either reliably unwind
such cases, or report the unwinding as unreliable.

4.6 Link register unreliability
-------------------------------

On some other architectures, 'call' instructions place the return address into a
link register, and 'return' instructions consume the return address from the
link register without modifying the register. On these architectures software
must save the return address to the stack prior to making a function call. Over
the duration of a function call, the return address may be held in the link
register alone, on the stack alone, or in both locations.

Unwinders typically assume the link register is always live, but this
assumption can lead to unreliable stack traces. For example, consider the
following arm64 assembly for a simple function:

.. code-block:: none

   function:
           STP X29, X30, [SP, -16]!
           MOV X29, SP
           BL <other_function>
           LDP X29, X30, [SP], #16
           RET

At entry to the function, the link register (x30) points to the caller, and the
frame pointer (X29) points to the caller's frame including the caller's return
address. The first two instructions create a new stackframe and update the
frame pointer, and at this point the link register and the frame pointer both
describe this function's return address. A trace at this point may describe
this function twice, and if the function return is being traced, the unwinder
may consume two entries from the fgraph return stack rather than one entry.

The BL invokes 'other_function' with the link register pointing to this
function's LDR and the frame pointer pointing to this function's stackframe.
When 'other_function' returns, the link register is left pointing at the BL,
and so a trace at this point could result in 'function' appearing twice in the
backtrace.

Similarly, a function may deliberately clobber the LR, e.g.

.. code-block:: none

   caller:
           STP X29, X30, [SP, -16]!
           MOV X29, SP
           ADR LR, <callee>
           BLR LR
           LDP X29, X30, [SP], #16
           RET

The ADR places the address of 'callee' into the LR, before the BLR branches to
this address. If a trace is made immediately after the ADR, 'callee' will
appear to be the parent of 'caller', rather than the child.

Due to cases such as the above, it may only be possible to reliably consume a
link register value at a function call boundary. Architectures where this is
the case must reject unwinding across exception boundaries unless they can
reliably identify when the LR or stack value should be used (e.g. using
metadata generated by objtool).
Commit	Line	Data
f89f20ac MR	1	===================
	2	Reliable Stacktrace
	3	===================
	4
	5	This document outlines basic information about reliable stacktracing.
	6
	7	.. Table of Contents:
	8
	9	.. contents:: :local:
	10
	11	1. Introduction
	12	===============
	13
	14	The kernel livepatch consistency model relies on accurately identifying which
	15	functions may have live state and therefore may not be safe to patch. One way
	16	to identify which functions are live is to use a stacktrace.
	17
	18	Existing stacktrace code may not always give an accurate picture of all
	19	functions with live state, and best-effort approaches which can be helpful for
	20	debugging are unsound for livepatching. Livepatching depends on architectures
	21	to provide a reliable stacktrace which ensures it never omits any live
	22	functions from a trace.
	23
	24
	25	2. Requirements
	26	===============
	27
	28	Architectures must implement one of the reliable stacktrace functions.
	29	Architectures using CONFIG_ARCH_STACKWALK must implement
	30	'arch_stack_walk_reliable', and other architectures must implement
	31	'save_stack_trace_tsk_reliable'.
	32
	33	Principally, the reliable stacktrace function must ensure that either:
	34
	35	* The trace includes all functions that the task may be returned to, and the
	36	return code is zero to indicate that the trace is reliable.
	37
	38	* The return code is non-zero to indicate that the trace is not reliable.
	39
	40	.. note::
	41	In some cases it is legitimate to omit specific functions from the trace,
	42	but all other functions must be reported. These cases are described in
	43	futher detail below.
	44
	45	Secondly, the reliable stacktrace function must be robust to cases where
	46	the stack or other unwind state is corrupt or otherwise unreliable. The
	47	function should attempt to detect such cases and return a non-zero error
	48	code, and should not get stuck in an infinite loop or access memory in
	49	an unsafe way. Specific cases are described in further detail below.
	50
	51
	52	3. Compile-time analysis
	53	========================
	54
	55	To ensure that kernel code can be correctly unwound in all cases,
	56	architectures may need to verify that code has been compiled in a manner
	57	expected by the unwinder. For example, an unwinder may expect that
	58	functions manipulate the stack pointer in a limited way, or that all
	59	functions use specific prologue and epilogue sequences. Architectures
	60	with such requirements should verify the kernel compilation using
	61	objtool.
	62
	63	In some cases, an unwinder may require metadata to correctly unwind.
	64	Where necessary, this metadata should be generated at build time using
65	objtool.
66
67
68	4. Considerations
69	=================
70
71	The unwinding process varies across architectures, their respective procedure
72	call standards, and kernel configurations. This section describes common
73	details that architectures should consider.
74
75	4.1 Identifying successful termination
76	--------------------------------------
77
78	Unwinding may terminate early for a number of reasons, including:
79
80	* Stack or frame pointer corruption.
81
82	* Missing unwind support for an uncommon scenario, or a bug in the unwinder.
83
84	* Dynamically generated code (e.g. eBPF) or foreign code (e.g. EFI runtime
85	services) not following the conventions expected by the unwinder.
86
87	To ensure that this does not result in functions being omitted from the trace,
88	even if not caught by other checks, it is strongly recommended that
89	architectures verify that a stacktrace ends at an expected location, e.g.
90
91	* Within a specific function that is an entry point to the kernel.
92
93	* At a specific location on a stack expected for a kernel entry point.
94
95	* On a specific stack expected for a kernel entry point (e.g. if the
96	architecture has separate task and IRQ stacks).
97
98	4.2 Identifying unwindable code
99	-------------------------------
100
101	Unwinding typically relies on code following specific conventions (e.g.
102	manipulating a frame pointer), but there can be code which may not follow these
103	conventions and may require special handling in the unwinder, e.g.
104
105	* Exception vectors and entry assembly.
106
107	* Procedure Linkage Table (PLT) entries and veneer functions.
108
109	* Trampoline assembly (e.g. ftrace, kprobes).
110
111	* Dynamically generated code (e.g. eBPF, optprobe trampolines).
112
113	* Foreign code (e.g. EFI runtime services).
114
115	To ensure that such cases do not result in functions being omitted from a
116	trace, it is strongly recommended that architectures positively identify code
117	which is known to be reliable to unwind from, and reject unwinding from all
118	other code.
119
120	Kernel code including modules and eBPF can be distinguished from foreign code
121	using '__kernel_text_address()'. Checking for this also helps to detect stack
122	corruption.
123
124	There are several ways an architecture may identify kernel code which is deemed
125	unreliable to unwind from, e.g.
126
127	* Placing such code into special linker sections, and rejecting unwinding from
128	any code in these sections.
129
130	* Identifying specific portions of code using bounds information.
131
132	4.3 Unwinding across interrupts and exceptions
133	----------------------------------------------
134
135	At function call boundaries the stack and other unwind state is expected to be
136	in a consistent state suitable for reliable unwinding, but this may not be the
137	case part-way through a function. For example, during a function prologue or
138	epilogue a frame pointer may be transiently invalid, or during the function
139	body the return address may be held in an arbitrary general purpose register.
140	For some architectures this may change at runtime as a result of dynamic
141	instrumentation.
142
143	If an interrupt or other exception is taken while the stack or other unwind
144	state is in an inconsistent state, it may not be possible to reliably unwind,
145	and it may not be possible to identify whether such unwinding will be reliable.
146	See below for examples.
147
148	Architectures which cannot identify when it is reliable to unwind such cases
149	(or where it is never reliable) must reject unwinding across exception
150	boundaries. Note that it may be reliable to unwind across certain
151	exceptions (e.g. IRQ) but unreliable to unwind across other exceptions
152	(e.g. NMI).
153
154	Architectures which can identify when it is reliable to unwind such cases (or
155	have no such cases) should attempt to unwind across exception boundaries, as
156	doing so can prevent unnecessarily stalling livepatch consistency checks and
157	permits livepatch transitions to complete more quickly.
158
159	4.4 Rewriting of return addresses
160	---------------------------------
161
162	Some trampolines temporarily modify the return address of a function in order
163	to intercept when that function returns with a return trampoline, e.g.
164
165	* An ftrace trampoline may modify the return address so that function graph
166	tracing can intercept returns.
167
168	* A kprobes (or optprobes) trampoline may modify the return address so that
169	kretprobes can intercept returns.
170
171	When this happens, the original return address will not be in its usual
172	location. For trampolines which are not subject to live patching, where an
173	unwinder can reliably determine the original return address and no unwind state
174	is altered by the trampoline, the unwinder may report the original return
175	address in place of the trampoline and report this as reliable. Otherwise, an
176	unwinder must report these cases as unreliable.
177
178	Special care is required when identifying the original return address, as this
179	information is not in a consistent location for the duration of the entry
180	trampoline or return trampoline. For example, considering the x86_64
181	'return_to_handler' return trampoline:
182
183	.. code-block:: none
184
185	SYM_CODE_START(return_to_handler)
186	UNWIND_HINT_EMPTY
187	subq $24, %rsp
188
189	/* Save the return values */
190	movq %rax, (%rsp)
191	movq %rdx, 8(%rsp)
192	movq %rbp, %rdi
193
194	call ftrace_return_to_handler
195
196	movq %rax, %rdi
197	movq 8(%rsp), %rdx
198	movq (%rsp), %rax
199	addq $24, %rsp
200	JMP_NOSPEC rdi
201	SYM_CODE_END(return_to_handler)
202
203	While the traced function runs its return address on the stack points to
204	the start of return_to_handler, and the original return address is stored in
205	the task's cur_ret_stack. During this time the unwinder can find the return
206	address using ftrace_graph_ret_addr().
207
208	When the traced function returns to return_to_handler, there is no longer a
209	return address on the stack, though the original return address is still stored
210	in the task's cur_ret_stack. Within ftrace_return_to_handler(), the original
211	return address is removed from cur_ret_stack and is transiently moved
212	arbitrarily by the compiler before being returned in rax. The return_to_handler
213	trampoline moves this into rdi before jumping to it.
214
215	Architectures might not always be able to unwind such sequences, such as when
216	ftrace_return_to_handler() has removed the address from cur_ret_stack, and the
217	location of the return address cannot be reliably determined.
218
219	It is recommended that architectures unwind cases where return_to_handler has
220	not yet been returned to, but architectures are not required to unwind from the
221	middle of return_to_handler and can report this as unreliable. Architectures
222	are not required to unwind from other trampolines which modify the return
223	address.
224
225	4.5 Obscuring of return addresses
226	---------------------------------
227
228	Some trampolines do not rewrite the return address in order to intercept
229	returns, but do transiently clobber the return address or other unwind state.
230
231	For example, the x86_64 implementation of optprobes patches the probed function
232	with a JMP instruction which targets the associated optprobe trampoline. When
233	the probe is hit, the CPU will branch to the optprobe trampoline, and the
234	address of the probed function is not held in any register or on the stack.
235
236	Similarly, the arm64 implementation of DYNAMIC_FTRACE_WITH_REGS patches traced
237	functions with the following:
238
239	.. code-block:: none
240
241	MOV X9, X30
242	BL <trampoline>
243
244	The MOV saves the link register (X30) into X9 to preserve the return address
245	before the BL clobbers the link register and branches to the trampoline. At the
246	start of the trampoline, the address of the traced function is in X9 rather
247	than the link register as would usually be the case.
248
249	Architectures must either ensure that unwinders either reliably unwind
250	such cases, or report the unwinding as unreliable.
251
252	4.6 Link register unreliability
253	-------------------------------
254
255	On some other architectures, 'call' instructions place the return address into a
256	link register, and 'return' instructions consume the return address from the
257	link register without modifying the register. On these architectures software
258	must save the return address to the stack prior to making a function call. Over
259	the duration of a function call, the return address may be held in the link
260	register alone, on the stack alone, or in both locations.
261
262	Unwinders typically assume the link register is always live, but this
263	assumption can lead to unreliable stack traces. For example, consider the
264	following arm64 assembly for a simple function:
265
266	.. code-block:: none
267
268	function:
269	STP X29, X30, [SP, -16]!
270	MOV X29, SP
271	BL <other_function>
272	LDP X29, X30, [SP], #16
273	RET
274
275	At entry to the function, the link register (x30) points to the caller, and the
276	frame pointer (X29) points to the caller's frame including the caller's return
277	address. The first two instructions create a new stackframe and update the
278	frame pointer, and at this point the link register and the frame pointer both
279	describe this function's return address. A trace at this point may describe
280	this function twice, and if the function return is being traced, the unwinder
281	may consume two entries from the fgraph return stack rather than one entry.
282
283	The BL invokes 'other_function' with the link register pointing to this
284	function's LDR and the frame pointer pointing to this function's stackframe.
285	When 'other_function' returns, the link register is left pointing at the BL,
286	and so a trace at this point could result in 'function' appearing twice in the
287	backtrace.
288
289	Similarly, a function may deliberately clobber the LR, e.g.
290
291	.. code-block:: none
292
293	caller:
294	STP X29, X30, [SP, -16]!
295	MOV X29, SP
296	ADR LR, <callee>
297	BLR LR
298	LDP X29, X30, [SP], #16
299	RET
300
301	The ADR places the address of 'callee' into the LR, before the BLR branches to
302	this address. If a trace is made immediately after the ADR, 'callee' will
303	appear to be the parent of 'caller', rather than the child.
304
305	Due to cases such as the above, it may only be possible to reliably consume a
306	link register value at a function call boundary. Architectures where this is
307	the case must reject unwinding across exception boundaries unless they can
308	reliably identify when the LR or stack value should be used (e.g. using
309	metadata generated by objtool).