Commit | Line | Data |
---|---|---|
f89f20ac MR |
1 | =================== |
2 | Reliable Stacktrace | |
3 | =================== | |
4 | ||
5 | This document outlines basic information about reliable stacktracing. | |
6 | ||
7 | .. Table of Contents: | |
8 | ||
9 | .. contents:: :local: | |
10 | ||
11 | 1. Introduction | |
12 | =============== | |
13 | ||
14 | The kernel livepatch consistency model relies on accurately identifying which | |
15 | functions may have live state and therefore may not be safe to patch. One way | |
16 | to identify which functions are live is to use a stacktrace. | |
17 | ||
18 | Existing stacktrace code may not always give an accurate picture of all | |
19 | functions with live state, and best-effort approaches which can be helpful for | |
20 | debugging are unsound for livepatching. Livepatching depends on architectures | |
21 | to provide a *reliable* stacktrace which ensures it never omits any live | |
22 | functions from a trace. | |
23 | ||
24 | ||
25 | 2. Requirements | |
26 | =============== | |
27 | ||
28 | Architectures must implement one of the reliable stacktrace functions. | |
29 | Architectures using CONFIG_ARCH_STACKWALK must implement | |
30 | 'arch_stack_walk_reliable', and other architectures must implement | |
31 | 'save_stack_trace_tsk_reliable'. | |
32 | ||
33 | Principally, the reliable stacktrace function must ensure that either: | |
34 | ||
35 | * The trace includes all functions that the task may be returned to, and the | |
36 | return code is zero to indicate that the trace is reliable. | |
37 | ||
38 | * The return code is non-zero to indicate that the trace is not reliable. | |
39 | ||
40 | .. note:: | |
41 | In some cases it is legitimate to omit specific functions from the trace, | |
42 | but all other functions must be reported. These cases are described in | |
43 | futher detail below. | |
44 | ||
45 | Secondly, the reliable stacktrace function must be robust to cases where | |
46 | the stack or other unwind state is corrupt or otherwise unreliable. The | |
47 | function should attempt to detect such cases and return a non-zero error | |
48 | code, and should not get stuck in an infinite loop or access memory in | |
49 | an unsafe way. Specific cases are described in further detail below. | |
50 | ||
51 | ||
52 | 3. Compile-time analysis | |
53 | ======================== | |
54 | ||
55 | To ensure that kernel code can be correctly unwound in all cases, | |
56 | architectures may need to verify that code has been compiled in a manner | |
57 | expected by the unwinder. For example, an unwinder may expect that | |
58 | functions manipulate the stack pointer in a limited way, or that all | |
59 | functions use specific prologue and epilogue sequences. Architectures | |
60 | with such requirements should verify the kernel compilation using | |
61 | objtool. | |
62 | ||
63 | In some cases, an unwinder may require metadata to correctly unwind. | |
64 | Where necessary, this metadata should be generated at build time using | |
65 | objtool. | |
66 | ||
67 | ||
68 | 4. Considerations | |
69 | ================= | |
70 | ||
71 | The unwinding process varies across architectures, their respective procedure | |
72 | call standards, and kernel configurations. This section describes common | |
73 | details that architectures should consider. | |
74 | ||
75 | 4.1 Identifying successful termination | |
76 | -------------------------------------- | |
77 | ||
78 | Unwinding may terminate early for a number of reasons, including: | |
79 | ||
80 | * Stack or frame pointer corruption. | |
81 | ||
82 | * Missing unwind support for an uncommon scenario, or a bug in the unwinder. | |
83 | ||
84 | * Dynamically generated code (e.g. eBPF) or foreign code (e.g. EFI runtime | |
85 | services) not following the conventions expected by the unwinder. | |
86 | ||
87 | To ensure that this does not result in functions being omitted from the trace, | |
88 | even if not caught by other checks, it is strongly recommended that | |
89 | architectures verify that a stacktrace ends at an expected location, e.g. | |
90 | ||
91 | * Within a specific function that is an entry point to the kernel. | |
92 | ||
93 | * At a specific location on a stack expected for a kernel entry point. | |
94 | ||
95 | * On a specific stack expected for a kernel entry point (e.g. if the | |
96 | architecture has separate task and IRQ stacks). | |
97 | ||
98 | 4.2 Identifying unwindable code | |
99 | ------------------------------- | |
100 | ||
101 | Unwinding typically relies on code following specific conventions (e.g. | |
102 | manipulating a frame pointer), but there can be code which may not follow these | |
103 | conventions and may require special handling in the unwinder, e.g. | |
104 | ||
105 | * Exception vectors and entry assembly. | |
106 | ||
107 | * Procedure Linkage Table (PLT) entries and veneer functions. | |
108 | ||
109 | * Trampoline assembly (e.g. ftrace, kprobes). | |
110 | ||
111 | * Dynamically generated code (e.g. eBPF, optprobe trampolines). | |
112 | ||
113 | * Foreign code (e.g. EFI runtime services). | |
114 | ||
115 | To ensure that such cases do not result in functions being omitted from a | |
116 | trace, it is strongly recommended that architectures positively identify code | |
117 | which is known to be reliable to unwind from, and reject unwinding from all | |
118 | other code. | |
119 | ||
120 | Kernel code including modules and eBPF can be distinguished from foreign code | |
121 | using '__kernel_text_address()'. Checking for this also helps to detect stack | |
122 | corruption. | |
123 | ||
124 | There are several ways an architecture may identify kernel code which is deemed | |
125 | unreliable to unwind from, e.g. | |
126 | ||
127 | * Placing such code into special linker sections, and rejecting unwinding from | |
128 | any code in these sections. | |
129 | ||
130 | * Identifying specific portions of code using bounds information. | |
131 | ||
132 | 4.3 Unwinding across interrupts and exceptions | |
133 | ---------------------------------------------- | |
134 | ||
135 | At function call boundaries the stack and other unwind state is expected to be | |
136 | in a consistent state suitable for reliable unwinding, but this may not be the | |
137 | case part-way through a function. For example, during a function prologue or | |
138 | epilogue a frame pointer may be transiently invalid, or during the function | |
139 | body the return address may be held in an arbitrary general purpose register. | |
140 | For some architectures this may change at runtime as a result of dynamic | |
141 | instrumentation. | |
142 | ||
143 | If an interrupt or other exception is taken while the stack or other unwind | |
144 | state is in an inconsistent state, it may not be possible to reliably unwind, | |
145 | and it may not be possible to identify whether such unwinding will be reliable. | |
146 | See below for examples. | |
147 | ||
148 | Architectures which cannot identify when it is reliable to unwind such cases | |
149 | (or where it is never reliable) must reject unwinding across exception | |
150 | boundaries. Note that it may be reliable to unwind across certain | |
151 | exceptions (e.g. IRQ) but unreliable to unwind across other exceptions | |
152 | (e.g. NMI). | |
153 | ||
154 | Architectures which can identify when it is reliable to unwind such cases (or | |
155 | have no such cases) should attempt to unwind across exception boundaries, as | |
156 | doing so can prevent unnecessarily stalling livepatch consistency checks and | |
157 | permits livepatch transitions to complete more quickly. | |
158 | ||
159 | 4.4 Rewriting of return addresses | |
160 | --------------------------------- | |
161 | ||
162 | Some trampolines temporarily modify the return address of a function in order | |
163 | to intercept when that function returns with a return trampoline, e.g. | |
164 | ||
165 | * An ftrace trampoline may modify the return address so that function graph | |
166 | tracing can intercept returns. | |
167 | ||
168 | * A kprobes (or optprobes) trampoline may modify the return address so that | |
169 | kretprobes can intercept returns. | |
170 | ||
171 | When this happens, the original return address will not be in its usual | |
172 | location. For trampolines which are not subject to live patching, where an | |
173 | unwinder can reliably determine the original return address and no unwind state | |
174 | is altered by the trampoline, the unwinder may report the original return | |
175 | address in place of the trampoline and report this as reliable. Otherwise, an | |
176 | unwinder must report these cases as unreliable. | |
177 | ||
178 | Special care is required when identifying the original return address, as this | |
179 | information is not in a consistent location for the duration of the entry | |
180 | trampoline or return trampoline. For example, considering the x86_64 | |
181 | 'return_to_handler' return trampoline: | |
182 | ||
183 | .. code-block:: none | |
184 | ||
185 | SYM_CODE_START(return_to_handler) | |
186 | UNWIND_HINT_EMPTY | |
187 | subq $24, %rsp | |
188 | ||
189 | /* Save the return values */ | |
190 | movq %rax, (%rsp) | |
191 | movq %rdx, 8(%rsp) | |
192 | movq %rbp, %rdi | |
193 | ||
194 | call ftrace_return_to_handler | |
195 | ||
196 | movq %rax, %rdi | |
197 | movq 8(%rsp), %rdx | |
198 | movq (%rsp), %rax | |
199 | addq $24, %rsp | |
200 | JMP_NOSPEC rdi | |
201 | SYM_CODE_END(return_to_handler) | |
202 | ||
203 | While the traced function runs its return address on the stack points to | |
204 | the start of return_to_handler, and the original return address is stored in | |
205 | the task's cur_ret_stack. During this time the unwinder can find the return | |
206 | address using ftrace_graph_ret_addr(). | |
207 | ||
208 | When the traced function returns to return_to_handler, there is no longer a | |
209 | return address on the stack, though the original return address is still stored | |
210 | in the task's cur_ret_stack. Within ftrace_return_to_handler(), the original | |
211 | return address is removed from cur_ret_stack and is transiently moved | |
212 | arbitrarily by the compiler before being returned in rax. The return_to_handler | |
213 | trampoline moves this into rdi before jumping to it. | |
214 | ||
215 | Architectures might not always be able to unwind such sequences, such as when | |
216 | ftrace_return_to_handler() has removed the address from cur_ret_stack, and the | |
217 | location of the return address cannot be reliably determined. | |
218 | ||
219 | It is recommended that architectures unwind cases where return_to_handler has | |
220 | not yet been returned to, but architectures are not required to unwind from the | |
221 | middle of return_to_handler and can report this as unreliable. Architectures | |
222 | are not required to unwind from other trampolines which modify the return | |
223 | address. | |
224 | ||
225 | 4.5 Obscuring of return addresses | |
226 | --------------------------------- | |
227 | ||
228 | Some trampolines do not rewrite the return address in order to intercept | |
229 | returns, but do transiently clobber the return address or other unwind state. | |
230 | ||
231 | For example, the x86_64 implementation of optprobes patches the probed function | |
232 | with a JMP instruction which targets the associated optprobe trampoline. When | |
233 | the probe is hit, the CPU will branch to the optprobe trampoline, and the | |
234 | address of the probed function is not held in any register or on the stack. | |
235 | ||
236 | Similarly, the arm64 implementation of DYNAMIC_FTRACE_WITH_REGS patches traced | |
237 | functions with the following: | |
238 | ||
239 | .. code-block:: none | |
240 | ||
241 | MOV X9, X30 | |
242 | BL <trampoline> | |
243 | ||
244 | The MOV saves the link register (X30) into X9 to preserve the return address | |
245 | before the BL clobbers the link register and branches to the trampoline. At the | |
246 | start of the trampoline, the address of the traced function is in X9 rather | |
247 | than the link register as would usually be the case. | |
248 | ||
249 | Architectures must either ensure that unwinders either reliably unwind | |
250 | such cases, or report the unwinding as unreliable. | |
251 | ||
252 | 4.6 Link register unreliability | |
253 | ------------------------------- | |
254 | ||
255 | On some other architectures, 'call' instructions place the return address into a | |
256 | link register, and 'return' instructions consume the return address from the | |
257 | link register without modifying the register. On these architectures software | |
258 | must save the return address to the stack prior to making a function call. Over | |
259 | the duration of a function call, the return address may be held in the link | |
260 | register alone, on the stack alone, or in both locations. | |
261 | ||
262 | Unwinders typically assume the link register is always live, but this | |
263 | assumption can lead to unreliable stack traces. For example, consider the | |
264 | following arm64 assembly for a simple function: | |
265 | ||
266 | .. code-block:: none | |
267 | ||
268 | function: | |
269 | STP X29, X30, [SP, -16]! | |
270 | MOV X29, SP | |
271 | BL <other_function> | |
272 | LDP X29, X30, [SP], #16 | |
273 | RET | |
274 | ||
275 | At entry to the function, the link register (x30) points to the caller, and the | |
276 | frame pointer (X29) points to the caller's frame including the caller's return | |
277 | address. The first two instructions create a new stackframe and update the | |
278 | frame pointer, and at this point the link register and the frame pointer both | |
279 | describe this function's return address. A trace at this point may describe | |
280 | this function twice, and if the function return is being traced, the unwinder | |
281 | may consume two entries from the fgraph return stack rather than one entry. | |
282 | ||
283 | The BL invokes 'other_function' with the link register pointing to this | |
284 | function's LDR and the frame pointer pointing to this function's stackframe. | |
285 | When 'other_function' returns, the link register is left pointing at the BL, | |
286 | and so a trace at this point could result in 'function' appearing twice in the | |
287 | backtrace. | |
288 | ||
289 | Similarly, a function may deliberately clobber the LR, e.g. | |
290 | ||
291 | .. code-block:: none | |
292 | ||
293 | caller: | |
294 | STP X29, X30, [SP, -16]! | |
295 | MOV X29, SP | |
296 | ADR LR, <callee> | |
297 | BLR LR | |
298 | LDP X29, X30, [SP], #16 | |
299 | RET | |
300 | ||
301 | The ADR places the address of 'callee' into the LR, before the BLR branches to | |
302 | this address. If a trace is made immediately after the ADR, 'callee' will | |
303 | appear to be the parent of 'caller', rather than the child. | |
304 | ||
305 | Due to cases such as the above, it may only be possible to reliably consume a | |
306 | link register value at a function call boundary. Architectures where this is | |
307 | the case must reject unwinding across exception boundaries unless they can | |
308 | reliably identify when the LR or stack value should be used (e.g. using | |
309 | metadata generated by objtool). |