Commit | Line | Data |
---|---|---|
bf026e2e TG |
1 | Entry/exit handling for exceptions, interrupts, syscalls and KVM |
2 | ================================================================ | |
3 | ||
4 | All transitions between execution domains require state updates which are | |
5 | subject to strict ordering constraints. State updates are required for the | |
6 | following: | |
7 | ||
8 | * Lockdep | |
9 | * RCU / Context tracking | |
10 | * Preemption counter | |
11 | * Tracing | |
12 | * Time accounting | |
13 | ||
14 | The update order depends on the transition type and is explained below in | |
15 | the transition type sections: `Syscalls`_, `KVM`_, `Interrupts and regular | |
16 | exceptions`_, `NMI and NMI-like exceptions`_. | |
17 | ||
18 | Non-instrumentable code - noinstr | |
19 | --------------------------------- | |
20 | ||
21 | Most instrumentation facilities depend on RCU, so intrumentation is prohibited | |
22 | for entry code before RCU starts watching and exit code after RCU stops | |
23 | watching. In addition, many architectures must save and restore register state, | |
24 | which means that (for example) a breakpoint in the breakpoint entry code would | |
25 | overwrite the debug registers of the initial breakpoint. | |
26 | ||
27 | Such code must be marked with the 'noinstr' attribute, placing that code into a | |
28 | special section inaccessible to instrumentation and debug facilities. Some | |
29 | functions are partially instrumentable, which is handled by marking them | |
30 | noinstr and using instrumentation_begin() and instrumentation_end() to flag the | |
31 | instrumentable ranges of code: | |
32 | ||
33 | .. code-block:: c | |
34 | ||
35 | noinstr void entry(void) | |
36 | { | |
37 | handle_entry(); // <-- must be 'noinstr' or '__always_inline' | |
38 | ... | |
39 | ||
40 | instrumentation_begin(); | |
41 | handle_context(); // <-- instrumentable code | |
42 | instrumentation_end(); | |
43 | ||
44 | ... | |
45 | handle_exit(); // <-- must be 'noinstr' or '__always_inline' | |
46 | } | |
47 | ||
48 | This allows verification of the 'noinstr' restrictions via objtool on | |
49 | supported architectures. | |
50 | ||
51 | Invoking non-instrumentable functions from instrumentable context has no | |
52 | restrictions and is useful to protect e.g. state switching which would | |
53 | cause malfunction if instrumented. | |
54 | ||
55 | All non-instrumentable entry/exit code sections before and after the RCU | |
56 | state transitions must run with interrupts disabled. | |
57 | ||
58 | Syscalls | |
59 | -------- | |
60 | ||
61 | Syscall-entry code starts in assembly code and calls out into low-level C code | |
62 | after establishing low-level architecture-specific state and stack frames. This | |
63 | low-level C code must not be instrumented. A typical syscall handling function | |
64 | invoked from low-level assembly code looks like this: | |
65 | ||
66 | .. code-block:: c | |
67 | ||
68 | noinstr void syscall(struct pt_regs *regs, int nr) | |
69 | { | |
70 | arch_syscall_enter(regs); | |
71 | nr = syscall_enter_from_user_mode(regs, nr); | |
72 | ||
73 | instrumentation_begin(); | |
74 | if (!invoke_syscall(regs, nr) && nr != -1) | |
75 | result_reg(regs) = __sys_ni_syscall(regs); | |
76 | instrumentation_end(); | |
77 | ||
78 | syscall_exit_to_user_mode(regs); | |
79 | } | |
80 | ||
81 | syscall_enter_from_user_mode() first invokes enter_from_user_mode() which | |
82 | establishes state in the following order: | |
83 | ||
84 | * Lockdep | |
85 | * RCU / Context tracking | |
86 | * Tracing | |
87 | ||
88 | and then invokes the various entry work functions like ptrace, seccomp, audit, | |
89 | syscall tracing, etc. After all that is done, the instrumentable invoke_syscall | |
90 | function can be invoked. The instrumentable code section then ends, after which | |
91 | syscall_exit_to_user_mode() is invoked. | |
92 | ||
93 | syscall_exit_to_user_mode() handles all work which needs to be done before | |
94 | returning to user space like tracing, audit, signals, task work etc. After | |
95 | that it invokes exit_to_user_mode() which again handles the state | |
96 | transition in the reverse order: | |
97 | ||
98 | * Tracing | |
99 | * RCU / Context tracking | |
100 | * Lockdep | |
101 | ||
102 | syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also | |
103 | available as fine grained subfunctions in cases where the architecture code | |
104 | has to do extra work between the various steps. In such cases it has to | |
105 | ensure that enter_from_user_mode() is called first on entry and | |
106 | exit_to_user_mode() is called last on exit. | |
107 | ||
e3aa43e9 NSJ |
108 | Do not nest syscalls. Nested systcalls will cause RCU and/or context tracking |
109 | to print a warning. | |
bf026e2e TG |
110 | |
111 | KVM | |
112 | --- | |
113 | ||
114 | Entering or exiting guest mode is very similar to syscalls. From the host | |
115 | kernel point of view the CPU goes off into user space when entering the | |
116 | guest and returns to the kernel on exit. | |
117 | ||
118 | kvm_guest_enter_irqoff() is a KVM-specific variant of exit_to_user_mode() | |
119 | and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode(). | |
120 | The state operations have the same ordering. | |
121 | ||
122 | Task work handling is done separately for guest at the boundary of the | |
123 | vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of | |
124 | the work handled on return to user space. | |
125 | ||
e3aa43e9 NSJ |
126 | Do not nest KVM entry/exit transitions because doing so is nonsensical. |
127 | ||
bf026e2e TG |
128 | Interrupts and regular exceptions |
129 | --------------------------------- | |
130 | ||
131 | Interrupts entry and exit handling is slightly more complex than syscalls | |
132 | and KVM transitions. | |
133 | ||
134 | If an interrupt is raised while the CPU executes in user space, the entry | |
135 | and exit handling is exactly the same as for syscalls. | |
136 | ||
137 | If the interrupt is raised while the CPU executes in kernel space the entry and | |
138 | exit handling is slightly different. RCU state is only updated when the | |
139 | interrupt is raised in the context of the CPU's idle task. Otherwise, RCU will | |
140 | already be watching. Lockdep and tracing have to be updated unconditionally. | |
141 | ||
142 | irqentry_enter() and irqentry_exit() provide the implementation for this. | |
143 | ||
144 | The architecture-specific part looks similar to syscall handling: | |
145 | ||
146 | .. code-block:: c | |
147 | ||
148 | noinstr void interrupt(struct pt_regs *regs, int nr) | |
149 | { | |
150 | arch_interrupt_enter(regs); | |
151 | state = irqentry_enter(regs); | |
152 | ||
153 | instrumentation_begin(); | |
154 | ||
155 | irq_enter_rcu(); | |
156 | invoke_irq_handler(regs, nr); | |
157 | irq_exit_rcu(); | |
158 | ||
159 | instrumentation_end(); | |
160 | ||
161 | irqentry_exit(regs, state); | |
162 | } | |
163 | ||
164 | Note that the invocation of the actual interrupt handler is within a | |
165 | irq_enter_rcu() and irq_exit_rcu() pair. | |
166 | ||
167 | irq_enter_rcu() updates the preemption count which makes in_hardirq() | |
168 | return true, handles NOHZ tick state and interrupt time accounting. This | |
169 | means that up to the point where irq_enter_rcu() is invoked in_hardirq() | |
170 | returns false. | |
171 | ||
172 | irq_exit_rcu() handles interrupt time accounting, undoes the preemption | |
173 | count update and eventually handles soft interrupts and NOHZ tick state. | |
174 | ||
175 | In theory, the preemption count could be updated in irqentry_enter(). In | |
176 | practice, deferring this update to irq_enter_rcu() allows the preemption-count | |
177 | code to be traced, while also maintaining symmetry with irq_exit_rcu() and | |
178 | irqentry_exit(), which are described in the next paragraph. The only downside | |
179 | is that the early entry code up to irq_enter_rcu() must be aware that the | |
180 | preemption count has not yet been updated with the HARDIRQ_OFFSET state. | |
181 | ||
182 | Note that irq_exit_rcu() must remove HARDIRQ_OFFSET from the preemption count | |
183 | before it handles soft interrupts, whose handlers must run in BH context rather | |
184 | than irq-disabled context. In addition, irqentry_exit() might schedule, which | |
185 | also requires that HARDIRQ_OFFSET has been removed from the preemption count. | |
186 | ||
e3aa43e9 NSJ |
187 | Even though interrupt handlers are expected to run with local interrupts |
188 | disabled, interrupt nesting is common from an entry/exit perspective. For | |
189 | example, softirq handling happens within an irqentry_{enter,exit}() block with | |
190 | local interrupts enabled. Also, although uncommon, nothing prevents an | |
191 | interrupt handler from re-enabling interrupts. | |
192 | ||
193 | Interrupt entry/exit code doesn't strictly need to handle reentrancy, since it | |
194 | runs with local interrupts disabled. But NMIs can happen anytime, and a lot of | |
195 | the entry code is shared between the two. | |
196 | ||
bf026e2e TG |
197 | NMI and NMI-like exceptions |
198 | --------------------------- | |
199 | ||
200 | NMIs and NMI-like exceptions (machine checks, double faults, debug | |
201 | interrupts, etc.) can hit any context and must be extra careful with | |
202 | the state. | |
203 | ||
204 | State changes for debug exceptions and machine-check exceptions depend on | |
205 | whether these exceptions happened in user-space (breakpoints or watchpoints) or | |
206 | in kernel mode (code patching). From user-space, they are treated like | |
207 | interrupts, while from kernel mode they are treated like NMIs. | |
208 | ||
209 | NMIs and other NMI-like exceptions handle state transitions without | |
210 | distinguishing between user-mode and kernel-mode origin. | |
211 | ||
212 | The state update on entry is handled in irqentry_nmi_enter() which updates | |
213 | state in the following order: | |
214 | ||
215 | * Preemption counter | |
216 | * Lockdep | |
217 | * RCU / Context tracking | |
218 | * Tracing | |
219 | ||
220 | The exit counterpart irqentry_nmi_exit() does the reverse operation in the | |
221 | reverse order. | |
222 | ||
223 | Note that the update of the preemption counter has to be the first | |
224 | operation on enter and the last operation on exit. The reason is that both | |
225 | lockdep and RCU rely on in_nmi() returning true in this case. The | |
226 | preemption count modification in the NMI entry/exit case must not be | |
227 | traced. | |
228 | ||
229 | Architecture-specific code looks like this: | |
230 | ||
231 | .. code-block:: c | |
232 | ||
233 | noinstr void nmi(struct pt_regs *regs) | |
234 | { | |
235 | arch_nmi_enter(regs); | |
236 | state = irqentry_nmi_enter(regs); | |
237 | ||
238 | instrumentation_begin(); | |
239 | nmi_handler(regs); | |
240 | instrumentation_end(); | |
241 | ||
242 | irqentry_nmi_exit(regs); | |
243 | } | |
244 | ||
245 | and for e.g. a debug exception it can look like this: | |
246 | ||
247 | .. code-block:: c | |
248 | ||
249 | noinstr void debug(struct pt_regs *regs) | |
250 | { | |
251 | arch_nmi_enter(regs); | |
252 | ||
253 | debug_regs = save_debug_regs(); | |
254 | ||
255 | if (user_mode(regs)) { | |
256 | state = irqentry_enter(regs); | |
257 | ||
258 | instrumentation_begin(); | |
259 | user_mode_debug_handler(regs, debug_regs); | |
260 | instrumentation_end(); | |
261 | ||
262 | irqentry_exit(regs, state); | |
263 | } else { | |
264 | state = irqentry_nmi_enter(regs); | |
265 | ||
266 | instrumentation_begin(); | |
267 | kernel_mode_debug_handler(regs, debug_regs); | |
268 | instrumentation_end(); | |
269 | ||
270 | irqentry_nmi_exit(regs, state); | |
271 | } | |
272 | } | |
273 | ||
274 | There is no combined irqentry_nmi_if_kernel() function available as the | |
275 | above cannot be handled in an exception-agnostic way. | |
e3aa43e9 NSJ |
276 | |
277 | NMIs can happen in any context. For example, an NMI-like exception triggered | |
278 | while handling an NMI. So NMI entry code has to be reentrant and state updates | |
279 | need to handle nesting. |