Commit | Line | Data |
---|---|---|
ac2b4687 CD |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | ============= | |
4 | Kernel Stacks | |
5 | ============= | |
6 | ||
d724a9a5 | 7 | Kernel stacks on x86-64 bit |
ac2b4687 | 8 | =========================== |
d724a9a5 | 9 | |
352f7bae AK |
10 | Most of the text from Keith Owens, hacked by AK |
11 | ||
12 | x86_64 page size (PAGE_SIZE) is 4K. | |
13 | ||
14 | Like all other architectures, x86_64 has a kernel stack for every | |
15 | active thread. These thread stacks are THREAD_SIZE (2*PAGE_SIZE) big. | |
16 | These stacks contain useful data as long as a thread is alive or a | |
17 | zombie. While the thread is in user space the kernel stack is empty | |
18 | except for the thread_info structure at the bottom. | |
19 | ||
20 | In addition to the per thread stacks, there are specialized stacks | |
57d30772 RD |
21 | associated with each CPU. These stacks are only used while the kernel |
22 | is in control on that CPU; when a CPU returns to user space the | |
23 | specialized stacks contain no useful data. The main CPU stacks are: | |
352f7bae | 24 | |
0fe0965e | 25 | * Interrupt stack. IRQ_STACK_SIZE |
352f7bae AK |
26 | |
27 | Used for external hardware interrupts. If this is the first external | |
28 | hardware interrupt (i.e. not a nested hardware interrupt) then the | |
29 | kernel switches from the current task to the interrupt stack. Like | |
7974891d CH |
30 | the split thread and interrupt stacks on i386, this gives more room |
31 | for kernel interrupt processing without having to increase the size | |
32 | of every per thread stack. | |
352f7bae AK |
33 | |
34 | The interrupt stack is also used when processing a softirq. | |
35 | ||
36 | Switching to the kernel interrupt stack is done by software based on a | |
37 | per CPU interrupt nest counter. This is needed because x86-64 "IST" | |
38 | hardware stacks cannot nest without races. | |
39 | ||
40 | x86_64 also has a feature which is not available on i386, the ability | |
41 | to automatically switch to a new stack for designated events such as | |
42 | double fault or NMI, which makes it easier to handle these unusual | |
43 | events on x86_64. This feature is called the Interrupt Stack Table | |
57d30772 RD |
44 | (IST). There can be up to 7 IST entries per CPU. The IST code is an |
45 | index into the Task State Segment (TSS). The IST entries in the TSS | |
46 | point to dedicated stacks; each stack can be a different size. | |
352f7bae | 47 | |
57d30772 | 48 | An IST is selected by a non-zero value in the IST field of an |
352f7bae AK |
49 | interrupt-gate descriptor. When an interrupt occurs and the hardware |
50 | loads such a descriptor, the hardware automatically sets the new stack | |
51 | pointer based on the IST value, then invokes the interrupt handler. If | |
48e08d0f AL |
52 | the interrupt came from user mode, then the interrupt handler prologue |
53 | will switch back to the per-thread stack. If software wants to allow | |
54 | nested IST interrupts then the handler must adjust the IST values on | |
55 | entry to and exit from the interrupt handler. (This is occasionally | |
56 | done, e.g. for debug exceptions.) | |
352f7bae AK |
57 | |
58 | Events with different IST codes (i.e. with different stacks) can be | |
59 | nested. For example, a debug interrupt can safely be interrupted by an | |
60 | NMI. arch/x86_64/kernel/entry.S::paranoidentry adjusts the stack | |
61 | pointers on entry to and exit from all IST events, in theory allowing | |
62 | IST events with the same code to be nested. However in most cases, the | |
63 | stack size allocated to an IST assumes no nesting for the same code. | |
64 | If that assumption is ever broken then the stacks will become corrupt. | |
65 | ||
ac2b4687 | 66 | The currently assigned IST stacks are: |
352f7bae | 67 | |
8f34c5b5 | 68 | * ESTACK_DF. EXCEPTION_STKSZ (PAGE_SIZE). |
352f7bae AK |
69 | |
70 | Used for interrupt 8 - Double Fault Exception (#DF). | |
71 | ||
57d30772 RD |
72 | Invoked when handling one exception causes another exception. Happens |
73 | when the kernel is very confused (e.g. kernel stack pointer corrupt). | |
74 | Using a separate stack allows the kernel to recover from it well enough | |
75 | in many cases to still output an oops. | |
352f7bae | 76 | |
8f34c5b5 | 77 | * ESTACK_NMI. EXCEPTION_STKSZ (PAGE_SIZE). |
352f7bae AK |
78 | |
79 | Used for non-maskable interrupts (NMI). | |
80 | ||
81 | NMI can be delivered at any time, including when the kernel is in the | |
82 | middle of switching stacks. Using IST for NMI events avoids making | |
83 | assumptions about the previous state of the kernel stack. | |
84 | ||
2a594d4c | 85 | * ESTACK_DB. EXCEPTION_STKSZ (PAGE_SIZE). |
352f7bae AK |
86 | |
87 | Used for hardware debug interrupts (interrupt 1) and for software | |
88 | debug interrupts (INT3). | |
89 | ||
90 | When debugging a kernel, debug interrupts (both hardware and | |
91 | software) can occur at any time. Using IST for these interrupts | |
92 | avoids making assumptions about the previous state of the kernel | |
93 | stack. | |
94 | ||
2a594d4c TG |
95 | To handle nested #DB correctly there exist two instances of DB stacks. On |
96 | #DB entry the IST stackpointer for #DB is switched to the second instance | |
97 | so a nested #DB starts from a clean stack. The nested #DB switches | |
98 | the IST stackpointer to a guard hole to catch triple nesting. | |
99 | ||
8f34c5b5 | 100 | * ESTACK_MCE. EXCEPTION_STKSZ (PAGE_SIZE). |
352f7bae AK |
101 | |
102 | Used for interrupt 18 - Machine Check Exception (#MC). | |
103 | ||
104 | MCE can be delivered at any time, including when the kernel is in the | |
105 | middle of switching stacks. Using IST for MCE events avoids making | |
106 | assumptions about the previous state of the kernel stack. | |
107 | ||
108 | For more details see the Intel IA32 or AMD AMD64 architecture manuals. | |
113b5e37 BP |
109 | |
110 | ||
111 | Printing backtraces on x86 | |
ac2b4687 | 112 | ========================== |
113b5e37 BP |
113 | |
114 | The question about the '?' preceding function names in an x86 stacktrace | |
115 | keeps popping up, here's an indepth explanation. It helps if the reader | |
116 | stares at print_context_stack() and the whole machinery in and around | |
117 | arch/x86/kernel/dumpstack.c. | |
118 | ||
119 | Adapted from Ingo's mail, Message-ID: <20150521101614.GA10889@gmail.com>: | |
120 | ||
121 | We always scan the full kernel stack for return addresses stored on | |
ac2b4687 | 122 | the kernel stack(s) [1]_, from stack top to stack bottom, and print out |
113b5e37 BP |
123 | anything that 'looks like' a kernel text address. |
124 | ||
125 | If it fits into the frame pointer chain, we print it without a question | |
126 | mark, knowing that it's part of the real backtrace. | |
127 | ||
128 | If the address does not fit into our expected frame pointer chain we | |
129 | still print it, but we print a '?'. It can mean two things: | |
130 | ||
131 | - either the address is not part of the call chain: it's just stale | |
132 | values on the kernel stack, from earlier function calls. This is | |
133 | the common case. | |
134 | ||
135 | - or it is part of the call chain, but the frame pointer was not set | |
136 | up properly within the function, so we don't recognize it. | |
137 | ||
138 | This way we will always print out the real call chain (plus a few more | |
139 | entries), regardless of whether the frame pointer was set up correctly | |
140 | or not - but in most cases we'll get the call chain right as well. The | |
141 | entries printed are strictly in stack order, so you can deduce more | |
142 | information from that as well. | |
143 | ||
144 | The most important property of this method is that we _never_ lose | |
145 | information: we always strive to print _all_ addresses on the stack(s) | |
146 | that look like kernel text addresses, so if debug information is wrong, | |
147 | we still print out the real call chain as well - just with more question | |
148 | marks than ideal. | |
149 | ||
ac2b4687 CD |
150 | .. [1] For things like IRQ and IST stacks, we also scan those stacks, in |
151 | the right order, and try to cross from one stack into another | |
152 | reconstructing the call chain. This works most of the time. |