[linux-block.git] / Documentation / core-api / entry.rst

Entry/exit handling for exceptions, interrupts, syscalls and KVM
================================================================

All transitions between execution domains require state updates which are
subject to strict ordering constraints. State updates are required for the
following:

  * Lockdep
  * RCU / Context tracking
  * Preemption counter
  * Tracing
  * Time accounting

The update order depends on the transition type and is explained below in
the transition type sections: `Syscalls`_, `KVM`_, `Interrupts and regular
exceptions`_, `NMI and NMI-like exceptions`_.

Non-instrumentable code - noinstr
---------------------------------

Most instrumentation facilities depend on RCU, so intrumentation is prohibited
for entry code before RCU starts watching and exit code after RCU stops
watching. In addition, many architectures must save and restore register state,
which means that (for example) a breakpoint in the breakpoint entry code would
overwrite the debug registers of the initial breakpoint.

Such code must be marked with the 'noinstr' attribute, placing that code into a
special section inaccessible to instrumentation and debug facilities. Some
functions are partially instrumentable, which is handled by marking them
noinstr and using instrumentation_begin() and instrumentation_end() to flag the
instrumentable ranges of code:

.. code-block:: c

  noinstr void entry(void)
  {
  	handle_entry();     // <-- must be 'noinstr' or '__always_inline'
	...

	instrumentation_begin();
	handle_context();   // <-- instrumentable code
	instrumentation_end();

	...
	handle_exit();      // <-- must be 'noinstr' or '__always_inline'
  }

This allows verification of the 'noinstr' restrictions via objtool on
supported architectures.

Invoking non-instrumentable functions from instrumentable context has no
restrictions and is useful to protect e.g. state switching which would
cause malfunction if instrumented.

All non-instrumentable entry/exit code sections before and after the RCU
state transitions must run with interrupts disabled.

Syscalls
--------

Syscall-entry code starts in assembly code and calls out into low-level C code
after establishing low-level architecture-specific state and stack frames. This
low-level C code must not be instrumented. A typical syscall handling function
invoked from low-level assembly code looks like this:

.. code-block:: c

  noinstr void syscall(struct pt_regs *regs, int nr)
  {
	arch_syscall_enter(regs);
	nr = syscall_enter_from_user_mode(regs, nr);

	instrumentation_begin();
	if (!invoke_syscall(regs, nr) && nr != -1)
	 	result_reg(regs) = __sys_ni_syscall(regs);
	instrumentation_end();

	syscall_exit_to_user_mode(regs);
  }

syscall_enter_from_user_mode() first invokes enter_from_user_mode() which
establishes state in the following order:

  * Lockdep
  * RCU / Context tracking
  * Tracing

and then invokes the various entry work functions like ptrace, seccomp, audit,
syscall tracing, etc. After all that is done, the instrumentable invoke_syscall
function can be invoked. The instrumentable code section then ends, after which
syscall_exit_to_user_mode() is invoked.

syscall_exit_to_user_mode() handles all work which needs to be done before
returning to user space like tracing, audit, signals, task work etc. After
that it invokes exit_to_user_mode() which again handles the state
transition in the reverse order:

  * Tracing
  * RCU / Context tracking
  * Lockdep

syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
available as fine grained subfunctions in cases where the architecture code
has to do extra work between the various steps. In such cases it has to
ensure that enter_from_user_mode() is called first on entry and
exit_to_user_mode() is called last on exit.

Do not nest syscalls. Nested systcalls will cause RCU and/or context tracking
to print a warning.

KVM
---

Entering or exiting guest mode is very similar to syscalls. From the host
kernel point of view the CPU goes off into user space when entering the
guest and returns to the kernel on exit.

kvm_guest_enter_irqoff() is a KVM-specific variant of exit_to_user_mode()
and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
The state operations have the same ordering.

Task work handling is done separately for guest at the boundary of the
vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
the work handled on return to user space.

Do not nest KVM entry/exit transitions because doing so is nonsensical.

Interrupts and regular exceptions
---------------------------------

Interrupts entry and exit handling is slightly more complex than syscalls
and KVM transitions.

If an interrupt is raised while the CPU executes in user space, the entry
and exit handling is exactly the same as for syscalls.

If the interrupt is raised while the CPU executes in kernel space the entry and
exit handling is slightly different. RCU state is only updated when the
interrupt is raised in the context of the CPU's idle task. Otherwise, RCU will
already be watching. Lockdep and tracing have to be updated unconditionally.

irqentry_enter() and irqentry_exit() provide the implementation for this.

The architecture-specific part looks similar to syscall handling:

.. code-block:: c

  noinstr void interrupt(struct pt_regs *regs, int nr)
  {
	arch_interrupt_enter(regs);
	state = irqentry_enter(regs);

	instrumentation_begin();

	irq_enter_rcu();
	invoke_irq_handler(regs, nr);
	irq_exit_rcu();

	instrumentation_end();

	irqentry_exit(regs, state);
  }

Note that the invocation of the actual interrupt handler is within a
irq_enter_rcu() and irq_exit_rcu() pair.

irq_enter_rcu() updates the preemption count which makes in_hardirq()
return true, handles NOHZ tick state and interrupt time accounting. This
means that up to the point where irq_enter_rcu() is invoked in_hardirq()
returns false.

irq_exit_rcu() handles interrupt time accounting, undoes the preemption
count update and eventually handles soft interrupts and NOHZ tick state.

In theory, the preemption count could be updated in irqentry_enter(). In
practice, deferring this update to irq_enter_rcu() allows the preemption-count
code to be traced, while also maintaining symmetry with irq_exit_rcu() and
irqentry_exit(), which are described in the next paragraph. The only downside
is that the early entry code up to irq_enter_rcu() must be aware that the
preemption count has not yet been updated with the HARDIRQ_OFFSET state.

Note that irq_exit_rcu() must remove HARDIRQ_OFFSET from the preemption count
before it handles soft interrupts, whose handlers must run in BH context rather
than irq-disabled context. In addition, irqentry_exit() might schedule, which
also requires that HARDIRQ_OFFSET has been removed from the preemption count.

Even though interrupt handlers are expected to run with local interrupts
disabled, interrupt nesting is common from an entry/exit perspective. For
example, softirq handling happens within an irqentry_{enter,exit}() block with
local interrupts enabled. Also, although uncommon, nothing prevents an
interrupt handler from re-enabling interrupts.

Interrupt entry/exit code doesn't strictly need to handle reentrancy, since it
runs with local interrupts disabled. But NMIs can happen anytime, and a lot of
the entry code is shared between the two.

NMI and NMI-like exceptions
---------------------------

NMIs and NMI-like exceptions (machine checks, double faults, debug
interrupts, etc.) can hit any context and must be extra careful with
the state.

State changes for debug exceptions and machine-check exceptions depend on
whether these exceptions happened in user-space (breakpoints or watchpoints) or
in kernel mode (code patching). From user-space, they are treated like
interrupts, while from kernel mode they are treated like NMIs.

NMIs and other NMI-like exceptions handle state transitions without
distinguishing between user-mode and kernel-mode origin.

The state update on entry is handled in irqentry_nmi_enter() which updates
state in the following order:

  * Preemption counter
  * Lockdep
  * RCU / Context tracking
  * Tracing

The exit counterpart irqentry_nmi_exit() does the reverse operation in the
reverse order.

Note that the update of the preemption counter has to be the first
operation on enter and the last operation on exit. The reason is that both
lockdep and RCU rely on in_nmi() returning true in this case. The
preemption count modification in the NMI entry/exit case must not be
traced.

Architecture-specific code looks like this:

.. code-block:: c

  noinstr void nmi(struct pt_regs *regs)
  {
	arch_nmi_enter(regs);
	state = irqentry_nmi_enter(regs);

	instrumentation_begin();
	nmi_handler(regs);
	instrumentation_end();

	irqentry_nmi_exit(regs);
  }

and for e.g. a debug exception it can look like this:

.. code-block:: c

  noinstr void debug(struct pt_regs *regs)
  {
	arch_nmi_enter(regs);

	debug_regs = save_debug_regs();

	if (user_mode(regs)) {
		state = irqentry_enter(regs);

		instrumentation_begin();
		user_mode_debug_handler(regs, debug_regs);
		instrumentation_end();

		irqentry_exit(regs, state);
  	} else {
  		state = irqentry_nmi_enter(regs);

		instrumentation_begin();
		kernel_mode_debug_handler(regs, debug_regs);
		instrumentation_end();

		irqentry_nmi_exit(regs, state);
	}
  }

There is no combined irqentry_nmi_if_kernel() function available as the
above cannot be handled in an exception-agnostic way.

NMIs can happen in any context. For example, an NMI-like exception triggered
while handling an NMI. So NMI entry code has to be reentrant and state updates
need to handle nesting.
Commit	Line	Data
bf026e2e TG	1	Entry/exit handling for exceptions, interrupts, syscalls and KVM
	2	================================================================
	3
	4	All transitions between execution domains require state updates which are
	5	subject to strict ordering constraints. State updates are required for the
	6	following:
	7
	8	* Lockdep
	9	* RCU / Context tracking
	10	* Preemption counter
	11	* Tracing
	12	* Time accounting
	13
	14	The update order depends on the transition type and is explained below in
	15	the transition type sections: `Syscalls`_, `KVM`_, `Interrupts and regular
	16	exceptions`_, `NMI and NMI-like exceptions`_.
	17
	18	Non-instrumentable code - noinstr
	19	---------------------------------
	20
	21	Most instrumentation facilities depend on RCU, so intrumentation is prohibited
	22	for entry code before RCU starts watching and exit code after RCU stops
	23	watching. In addition, many architectures must save and restore register state,
	24	which means that (for example) a breakpoint in the breakpoint entry code would
	25	overwrite the debug registers of the initial breakpoint.
	26
	27	Such code must be marked with the 'noinstr' attribute, placing that code into a
	28	special section inaccessible to instrumentation and debug facilities. Some
	29	functions are partially instrumentable, which is handled by marking them
	30	noinstr and using instrumentation_begin() and instrumentation_end() to flag the
	31	instrumentable ranges of code:
	32
	33	.. code-block:: c
	34
	35	noinstr void entry(void)
	36	{
	37	handle_entry(); // <-- must be 'noinstr' or '__always_inline'
	38	...
	39
	40	instrumentation_begin();
	41	handle_context(); // <-- instrumentable code
	42	instrumentation_end();
	43
	44	...
	45	handle_exit(); // <-- must be 'noinstr' or '__always_inline'
	46	}
	47
	48	This allows verification of the 'noinstr' restrictions via objtool on
	49	supported architectures.
	50
	51	Invoking non-instrumentable functions from instrumentable context has no
	52	restrictions and is useful to protect e.g. state switching which would
	53	cause malfunction if instrumented.
	54
	55	All non-instrumentable entry/exit code sections before and after the RCU
	56	state transitions must run with interrupts disabled.
	57
	58	Syscalls
	59	--------
	60
	61	Syscall-entry code starts in assembly code and calls out into low-level C code
	62	after establishing low-level architecture-specific state and stack frames. This
	63	low-level C code must not be instrumented. A typical syscall handling function
	64	invoked from low-level assembly code looks like this:
65
66	.. code-block:: c
67
68	noinstr void syscall(struct pt_regs *regs, int nr)
69	{
70	arch_syscall_enter(regs);
71	nr = syscall_enter_from_user_mode(regs, nr);
72
73	instrumentation_begin();
74	if (!invoke_syscall(regs, nr) && nr != -1)
75	result_reg(regs) = __sys_ni_syscall(regs);
76	instrumentation_end();
77
78	syscall_exit_to_user_mode(regs);
79	}
80
81	syscall_enter_from_user_mode() first invokes enter_from_user_mode() which
82	establishes state in the following order:
83
84	* Lockdep
85	* RCU / Context tracking
86	* Tracing
87
88	and then invokes the various entry work functions like ptrace, seccomp, audit,
89	syscall tracing, etc. After all that is done, the instrumentable invoke_syscall
90	function can be invoked. The instrumentable code section then ends, after which
91	syscall_exit_to_user_mode() is invoked.
92
93	syscall_exit_to_user_mode() handles all work which needs to be done before
94	returning to user space like tracing, audit, signals, task work etc. After
95	that it invokes exit_to_user_mode() which again handles the state
96	transition in the reverse order:
97
98	* Tracing
99	* RCU / Context tracking
100	* Lockdep
101
102	syscall_enter_from_user_mode() and syscall_exit_to_user_mode() are also
103	available as fine grained subfunctions in cases where the architecture code
104	has to do extra work between the various steps. In such cases it has to
105	ensure that enter_from_user_mode() is called first on entry and
106	exit_to_user_mode() is called last on exit.
107
e3aa43e9 NSJ	108	Do not nest syscalls. Nested systcalls will cause RCU and/or context tracking
e3aa43e9 NSJ	109	to print a warning.
bf026e2e TG	110
	111	KVM
	112	---
	113
	114	Entering or exiting guest mode is very similar to syscalls. From the host
	115	kernel point of view the CPU goes off into user space when entering the
	116	guest and returns to the kernel on exit.
	117
	118	kvm_guest_enter_irqoff() is a KVM-specific variant of exit_to_user_mode()
	119	and kvm_guest_exit_irqoff() is the KVM variant of enter_from_user_mode().
	120	The state operations have the same ordering.
	121
	122	Task work handling is done separately for guest at the boundary of the
	123	vcpu_run() loop via xfer_to_guest_mode_handle_work() which is a subset of
	124	the work handled on return to user space.
	125
e3aa43e9 NSJ	126	Do not nest KVM entry/exit transitions because doing so is nonsensical.
e3aa43e9 NSJ	127
bf026e2e TG	128	Interrupts and regular exceptions
	129	---------------------------------
	130
	131	Interrupts entry and exit handling is slightly more complex than syscalls
	132	and KVM transitions.
	133
	134	If an interrupt is raised while the CPU executes in user space, the entry
	135	and exit handling is exactly the same as for syscalls.
	136
	137	If the interrupt is raised while the CPU executes in kernel space the entry and
	138	exit handling is slightly different. RCU state is only updated when the
	139	interrupt is raised in the context of the CPU's idle task. Otherwise, RCU will
	140	already be watching. Lockdep and tracing have to be updated unconditionally.
	141
	142	irqentry_enter() and irqentry_exit() provide the implementation for this.
	143
	144	The architecture-specific part looks similar to syscall handling:
	145
	146	.. code-block:: c
	147
	148	noinstr void interrupt(struct pt_regs *regs, int nr)
	149	{
	150	arch_interrupt_enter(regs);
	151	state = irqentry_enter(regs);
	152
	153	instrumentation_begin();
	154
	155	irq_enter_rcu();
	156	invoke_irq_handler(regs, nr);
	157	irq_exit_rcu();
	158
	159	instrumentation_end();
	160
	161	irqentry_exit(regs, state);
	162	}
	163
	164	Note that the invocation of the actual interrupt handler is within a
	165	irq_enter_rcu() and irq_exit_rcu() pair.
	166
	167	irq_enter_rcu() updates the preemption count which makes in_hardirq()
	168	return true, handles NOHZ tick state and interrupt time accounting. This
	169	means that up to the point where irq_enter_rcu() is invoked in_hardirq()
	170	returns false.
	171
	172	irq_exit_rcu() handles interrupt time accounting, undoes the preemption
	173	count update and eventually handles soft interrupts and NOHZ tick state.
	174
	175	In theory, the preemption count could be updated in irqentry_enter(). In
	176	practice, deferring this update to irq_enter_rcu() allows the preemption-count
	177	code to be traced, while also maintaining symmetry with irq_exit_rcu() and
	178	irqentry_exit(), which are described in the next paragraph. The only downside
	179	is that the early entry code up to irq_enter_rcu() must be aware that the
	180	preemption count has not yet been updated with the HARDIRQ_OFFSET state.
	181
	182	Note that irq_exit_rcu() must remove HARDIRQ_OFFSET from the preemption count
	183	before it handles soft interrupts, whose handlers must run in BH context rather
	184	than irq-disabled context. In addition, irqentry_exit() might schedule, which
	185	also requires that HARDIRQ_OFFSET has been removed from the preemption count.
	186
e3aa43e9 NSJ	187	Even though interrupt handlers are expected to run with local interrupts
	188	disabled, interrupt nesting is common from an entry/exit perspective. For
	189	example, softirq handling happens within an irqentry_{enter,exit}() block with
	190	local interrupts enabled. Also, although uncommon, nothing prevents an
	191	interrupt handler from re-enabling interrupts.
	192
	193	Interrupt entry/exit code doesn't strictly need to handle reentrancy, since it
	194	runs with local interrupts disabled. But NMIs can happen anytime, and a lot of
	195	the entry code is shared between the two.
	196
bf026e2e TG	197	NMI and NMI-like exceptions
	198	---------------------------
	199
	200	NMIs and NMI-like exceptions (machine checks, double faults, debug
	201	interrupts, etc.) can hit any context and must be extra careful with
	202	the state.
	203
	204	State changes for debug exceptions and machine-check exceptions depend on
	205	whether these exceptions happened in user-space (breakpoints or watchpoints) or
	206	in kernel mode (code patching). From user-space, they are treated like
	207	interrupts, while from kernel mode they are treated like NMIs.
	208
	209	NMIs and other NMI-like exceptions handle state transitions without
	210	distinguishing between user-mode and kernel-mode origin.
	211
	212	The state update on entry is handled in irqentry_nmi_enter() which updates
	213	state in the following order:
	214
	215	* Preemption counter
	216	* Lockdep
	217	* RCU / Context tracking
	218	* Tracing
	219
	220	The exit counterpart irqentry_nmi_exit() does the reverse operation in the
	221	reverse order.
	222
	223	Note that the update of the preemption counter has to be the first
	224	operation on enter and the last operation on exit. The reason is that both
	225	lockdep and RCU rely on in_nmi() returning true in this case. The
	226	preemption count modification in the NMI entry/exit case must not be
	227	traced.
	228
	229	Architecture-specific code looks like this:
	230
	231	.. code-block:: c
	232
	233	noinstr void nmi(struct pt_regs *regs)
	234	{
	235	arch_nmi_enter(regs);
	236	state = irqentry_nmi_enter(regs);
	237
	238	instrumentation_begin();
	239	nmi_handler(regs);
	240	instrumentation_end();
	241
	242	irqentry_nmi_exit(regs);
	243	}
	244
	245	and for e.g. a debug exception it can look like this:
	246
	247	.. code-block:: c
	248
	249	noinstr void debug(struct pt_regs *regs)
	250	{
	251	arch_nmi_enter(regs);
	252
	253	debug_regs = save_debug_regs();
	254
	255	if (user_mode(regs)) {
	256	state = irqentry_enter(regs);
	257
	258	instrumentation_begin();
	259	user_mode_debug_handler(regs, debug_regs);
	260	instrumentation_end();
261
262	irqentry_exit(regs, state);
263	} else {
264	state = irqentry_nmi_enter(regs);
265
266	instrumentation_begin();
267	kernel_mode_debug_handler(regs, debug_regs);
268	instrumentation_end();
269
270	irqentry_nmi_exit(regs, state);
271	}
272	}
273
274	There is no combined irqentry_nmi_if_kernel() function available as the
275	above cannot be handled in an exception-agnostic way.
e3aa43e9 NSJ	276
	277	NMIs can happen in any context. For example, an NMI-like exception triggered
	278	while handling an NMI. So NMI entry code has to be reentrant and state updates
	279	need to handle nesting.