[linux-2.6-block.git] / Documentation / virtual / kvm / locking.txt

KVM Lock Overview
=================

1. Acquisition Orders
---------------------

The acquisition orders for mutexes are as follows:

- kvm->lock is taken outside vcpu->mutex

- kvm->lock is taken outside kvm->slots_lock and kvm->irq_lock

- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
  them together is quite rare.

On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock.

For spinlocks, kvm_lock is taken outside kvm->mmu_lock.

Everything else is a leaf: no other lock is taken inside the critical
sections.

2: Exception
------------

Fast page fault:

Fast page fault is the fast path which fixes the guest page fault out of
the mmu-lock on x86. Currently, the page fault can be fast in one of the
following two cases:

1. Access Tracking: The SPTE is not present, but it is marked for access
tracking i.e. the SPTE_SPECIAL_MASK is set. That means we need to
restore the saved R/X bits. This is described in more detail later below.

2. Write-Protection: The SPTE is present and the fault is
caused by write-protect. That means we just need to change the W bit of the 
spte.

What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and
SPTE_MMU_WRITEABLE bit on the spte:
- SPTE_HOST_WRITEABLE means the gfn is writable on host.
- SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when
  the gfn is writable on guest mmu and it is not write-protected by shadow
  page write-protection.

On fast page fault path, we will use cmpxchg to atomically set the spte W
bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or 
restore the saved R/X bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This
is safe because whenever changing these bits can be detected by cmpxchg.

But we need carefully check these cases:
1): The mapping from gfn to pfn
The mapping from gfn to pfn may be changed since we can only ensure the pfn
is not changed during cmpxchg. This is a ABA problem, for example, below case
will happen:

At the beginning:
gpte = gfn1
gfn1 is mapped to pfn1 on host
spte is the shadow page table entry corresponding with gpte and
spte = pfn1

   VCPU 0                           VCPU0
on fast page fault path:

   old_spte = *spte;
                                 pfn1 is swapped out:
                                    spte = 0;

                                 pfn1 is re-alloced for gfn2.

                                 gpte is changed to point to
                                 gfn2 by the guest:
                                    spte = pfn1;

   if (cmpxchg(spte, old_spte, old_spte+W)
	mark_page_dirty(vcpu->kvm, gfn1)
             OOPS!!!

We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap.

For direct sp, we can easily avoid it since the spte of direct sp is fixed
to gfn. For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic()
to pin gfn to pfn, because after gfn_to_pfn_atomic():
- We have held the refcount of pfn that means the pfn can not be freed and
  be reused for another gfn.
- The pfn is writable that means it can not be shared between different gfns
  by KSM.

Then, we can ensure the dirty bitmaps is correctly set for a gfn.

Currently, to simplify the whole things, we disable fast page fault for
indirect shadow page.

2): Dirty bit tracking
In the origin code, the spte can be fast updated (non-atomically) if the
spte is read-only and the Accessed bit has already been set since the
Accessed bit and Dirty bit can not be lost.

But it is not true after fast page fault since the spte can be marked
writable between reading spte and updating spte. Like below case:

At the beginning:
spte.W = 0
spte.Accessed = 1

   VCPU 0                                       VCPU0
In mmu_spte_clear_track_bits():

   old_spte = *spte;

   /* 'if' condition is satisfied. */
   if (old_spte.Accessed == 1 &&
        old_spte.W == 0)
      spte = 0ull;
                                         on fast page fault path:
                                             spte.W = 1
                                         memory write on the spte:
                                             spte.Dirty = 1


   else
      old_spte = xchg(spte, 0ull)


   if (old_spte.Accessed == 1)
      kvm_set_pfn_accessed(spte.pfn);
   if (old_spte.Dirty == 1)
      kvm_set_pfn_dirty(spte.pfn);
      OOPS!!!

The Dirty bit is lost in this case.

In order to avoid this kind of issue, we always treat the spte as "volatile"
if it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means,
the spte is always atomically updated in this case.

3): flush tlbs due to spte updated
If the spte is updated from writable to readonly, we should flush all TLBs,
otherwise rmap_write_protect will find a read-only spte, even though the
writable spte might be cached on a CPU's TLB.

As mentioned before, the spte can be updated to writable out of mmu-lock on
fast page fault path, in order to easily audit the path, we see if TLBs need
be flushed caused by this reason in mmu_spte_update() since this is a common
function to update spte (present -> present).

Since the spte is "volatile" if it can be updated out of mmu-lock, we always
atomically update the spte, the race caused by fast page fault can be avoided,
See the comments in spte_has_volatile_bits() and mmu_spte_update().

Lockless Access Tracking:

This is used for Intel CPUs that are using EPT but do not support the EPT A/D
bits. In this case, when the KVM MMU notifier is called to track accesses to a
page (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present
by clearing the RWX bits in the PTE and storing the original R & X bits in
some unused/ignored bits. In addition, the SPTE_SPECIAL_MASK is also set on the
PTE (using the ignored bit 62). When the VM tries to access the page later on,
a fault is generated and the fast page fault mechanism described above is used
to atomically restore the PTE to a Present state. The W bit is not saved when
the PTE is marked for access tracking and during restoration to the Present
state, the W bit is set depending on whether or not it was a write access. If
it wasn't, then the W bit will remain clear until a write access happens, at 
which time it will be set using the Dirty tracking mechanism described above.

3. Reference
------------

Name:		kvm_lock
Type:		spinlock_t
Arch:		any
Protects:	- vm_list

Name:		kvm_count_lock
Type:		raw_spinlock_t
Arch:		any
Protects:	- hardware virtualization enable/disable
Comment:	'raw' because hardware enabling/disabling must be atomic /wrt
		migration.

Name:		kvm_arch::tsc_write_lock
Type:		raw_spinlock
Arch:		x86
Protects:	- kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
		- tsc offset in vmcb
Comment:	'raw' because updating the tsc offsets must not be preempted.

Name:		kvm->mmu_lock
Type:		spinlock_t
Arch:		any
Protects:	-shadow page/shadow tlb entry
Comment:	it is a spinlock since it is used in mmu notifier.

Name:		kvm->srcu
Type:		srcu lock
Arch:		any
Protects:	- kvm->memslots
		- kvm->buses
Comment:	The srcu read lock must be held while accessing memslots (e.g.
		when using gfn_to_* functions) and while accessing in-kernel
		MMIO/PIO address->device structure mapping (kvm->buses).
		The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu
		if it is needed by multiple functions.

Name:		blocked_vcpu_on_cpu_lock
Type:		spinlock_t
Arch:		x86
Protects:	blocked_vcpu_on_cpu
Comment:	This is a per-CPU lock and it is used for VT-d posted-interrupts.
		When VT-d posted-interrupts is supported and the VM has assigned
		devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu
		protected by blocked_vcpu_on_cpu_lock, when VT-d hardware issues
		wakeup notification event since external interrupts from the
		assigned devices happens, we will find the vCPU on the list to
		wakeup.
Commit	Line	Data
38a778aa JK	1	KVM Lock Overview
	2	=================
	3
	4	1. Acquisition Orders
	5	---------------------
	6
58e3948a PB	7	The acquisition orders for mutexes are as follows:
	8
	9	- kvm->lock is taken outside vcpu->mutex
	10
	11	- kvm->lock is taken outside kvm->slots_lock and kvm->irq_lock
	12
	13	- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring
	14	them together is quite rare.
	15
3f5ad8be PB	16	On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock.
	17
	18	For spinlocks, kvm_lock is taken outside kvm->mmu_lock.
	19
	20	Everything else is a leaf: no other lock is taken inside the critical
	21	sections.
38a778aa	22
58d8b172 XG	23	2: Exception
	24	------------
	25
	26	Fast page fault:
	27
	28	Fast page fault is the fast path which fixes the guest page fault out of
63dbe14d JS	29	the mmu-lock on x86. Currently, the page fault can be fast in one of the
	30	following two cases:
	31
	32	1. Access Tracking: The SPTE is not present, but it is marked for access
	33	tracking i.e. the SPTE_SPECIAL_MASK is set. That means we need to
	34	restore the saved R/X bits. This is described in more detail later below.
	35
	36	2. Write-Protection: The SPTE is present and the fault is
	37	caused by write-protect. That means we just need to change the W bit of the
	38	spte.
58d8b172 XG	39
	40	What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and
	41	SPTE_MMU_WRITEABLE bit on the spte:
	42	- SPTE_HOST_WRITEABLE means the gfn is writable on host.
	43	- SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when
	44	the gfn is writable on guest mmu and it is not write-protected by shadow
	45	page write-protection.
	46
	47	On fast page fault path, we will use cmpxchg to atomically set the spte W
63dbe14d JS	48	bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or
63dbe14d JS	49	restore the saved R/X bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This
58d8b172 XG	50	is safe because whenever changing these bits can be detected by cmpxchg.
	51
	52	But we need carefully check these cases:
	53	1): The mapping from gfn to pfn
	54	The mapping from gfn to pfn may be changed since we can only ensure the pfn
	55	is not changed during cmpxchg. This is a ABA problem, for example, below case
	56	will happen:
	57
	58	At the beginning:
	59	gpte = gfn1
	60	gfn1 is mapped to pfn1 on host
	61	spte is the shadow page table entry corresponding with gpte and
	62	spte = pfn1
	63
	64	VCPU 0 VCPU0
	65	on fast page fault path:
	66
	67	old_spte = *spte;
	68	pfn1 is swapped out:
	69	spte = 0;
	70
	71	pfn1 is re-alloced for gfn2.
	72
	73	gpte is changed to point to
	74	gfn2 by the guest:
	75	spte = pfn1;
	76
	77	if (cmpxchg(spte, old_spte, old_spte+W)
	78	mark_page_dirty(vcpu->kvm, gfn1)
	79	OOPS!!!
	80
	81	We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap.
	82
	83	For direct sp, we can easily avoid it since the spte of direct sp is fixed
	84	to gfn. For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic()
	85	to pin gfn to pfn, because after gfn_to_pfn_atomic():
	86	- We have held the refcount of pfn that means the pfn can not be freed and
	87	be reused for another gfn.
	88	- The pfn is writable that means it can not be shared between different gfns
	89	by KSM.
	90
	91	Then, we can ensure the dirty bitmaps is correctly set for a gfn.
	92
	93	Currently, to simplify the whole things, we disable fast page fault for
	94	indirect shadow page.
	95
	96	2): Dirty bit tracking
	97	In the origin code, the spte can be fast updated (non-atomically) if the
	98	spte is read-only and the Accessed bit has already been set since the
	99	Accessed bit and Dirty bit can not be lost.
	100
	101	But it is not true after fast page fault since the spte can be marked
	102	writable between reading spte and updating spte. Like below case:
	103
	104	At the beginning:
	105	spte.W = 0
	106	spte.Accessed = 1
	107
	108	VCPU 0 VCPU0
	109	In mmu_spte_clear_track_bits():
	110
	111	old_spte = *spte;
	112
	113	/* 'if' condition is satisfied. */
bb3541f1	114	if (old_spte.Accessed == 1 &&
58d8b172 XG	115	old_spte.W == 0)
	116	spte = 0ull;
	117	on fast page fault path:
	118	spte.W = 1
	119	memory write on the spte:
	120	spte.Dirty = 1
	121
	122
	123	else
	124	old_spte = xchg(spte, 0ull)
	125
	126
bb3541f1	127	if (old_spte.Accessed == 1)
58d8b172 XG	128	kvm_set_pfn_accessed(spte.pfn);
	129	if (old_spte.Dirty == 1)
	130	kvm_set_pfn_dirty(spte.pfn);
	131	OOPS!!!
	132
	133	The Dirty bit is lost in this case.
	134
	135	In order to avoid this kind of issue, we always treat the spte as "volatile"
	136	if it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means,
17180032	137	the spte is always atomically updated in this case.
58d8b172 XG	138
	139	3): flush tlbs due to spte updated
	140	If the spte is updated from writable to readonly, we should flush all TLBs,
	141	otherwise rmap_write_protect will find a read-only spte, even though the
	142	writable spte might be cached on a CPU's TLB.
	143
	144	As mentioned before, the spte can be updated to writable out of mmu-lock on
	145	fast page fault path, in order to easily audit the path, we see if TLBs need
	146	be flushed caused by this reason in mmu_spte_update() since this is a common
	147	function to update spte (present -> present).
	148
	149	Since the spte is "volatile" if it can be updated out of mmu-lock, we always
17180032	150	atomically update the spte, the race caused by fast page fault can be avoided,
58d8b172 XG	151	See the comments in spte_has_volatile_bits() and mmu_spte_update().
58d8b172 XG	152
63dbe14d JS	153	Lockless Access Tracking:
	154
	155	This is used for Intel CPUs that are using EPT but do not support the EPT A/D
	156	bits. In this case, when the KVM MMU notifier is called to track accesses to a
	157	page (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present
	158	by clearing the RWX bits in the PTE and storing the original R & X bits in
	159	some unused/ignored bits. In addition, the SPTE_SPECIAL_MASK is also set on the
	160	PTE (using the ignored bit 62). When the VM tries to access the page later on,
	161	a fault is generated and the fast page fault mechanism described above is used
	162	to atomically restore the PTE to a Present state. The W bit is not saved when
	163	the PTE is marked for access tracking and during restoration to the Present
	164	state, the W bit is set depending on whether or not it was a write access. If
	165	it wasn't, then the W bit will remain clear until a write access happens, at
	166	which time it will be set using the Dirty tracking mechanism described above.
	167
58d8b172	168	3. Reference
38a778aa JK	169	------------
	170
	171	Name: kvm_lock
2f303b74	172	Type: spinlock_t
38a778aa JK	173	Arch: any
38a778aa JK	174	Protects: - vm_list
4a937f96 PB	175
	176	Name: kvm_count_lock
	177	Type: raw_spinlock_t
	178	Arch: any
	179	Protects: - hardware virtualization enable/disable
38a778aa JK	180	Comment: 'raw' because hardware enabling/disabling must be atomic /wrt
	181	migration.
	182
	183	Name: kvm_arch::tsc_write_lock
	184	Type: raw_spinlock
	185	Arch: x86
	186	Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset}
	187	- tsc offset in vmcb
	188	Comment: 'raw' because updating the tsc offsets must not be preempted.
58d8b172 XG	189
	190	Name: kvm->mmu_lock
	191	Type: spinlock_t
	192	Arch: any
	193	Protects: -shadow page/shadow tlb entry
	194	Comment: it is a spinlock since it is used in mmu notifier.
519192aa TH	195
	196	Name: kvm->srcu
	197	Type: srcu lock
	198	Arch: any
	199	Protects: - kvm->memslots
	200	- kvm->buses
	201	Comment: The srcu read lock must be held while accessing memslots (e.g.
	202	when using gfn_to_* functions) and while accessing in-kernel
	203	MMIO/PIO address->device structure mapping (kvm->buses).
	204	The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu
	205	if it is needed by multiple functions.
bf9f6ac8 FW	206
	207	Name: blocked_vcpu_on_cpu_lock
	208	Type: spinlock_t
	209	Arch: x86
	210	Protects: blocked_vcpu_on_cpu
	211	Comment: This is a per-CPU lock and it is used for VT-d posted-interrupts.
	212	When VT-d posted-interrupts is supported and the VM has assigned
	213	devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu
	214	protected by blocked_vcpu_on_cpu_lock, when VT-d hardware issues
	215	wakeup notification event since external interrupts from the
	216	assigned devices happens, we will find the vCPU on the list to
	217	wakeup.