Commit | Line | Data |
---|---|---|
38a778aa JK |
1 | KVM Lock Overview |
2 | ================= | |
3 | ||
4 | 1. Acquisition Orders | |
5 | --------------------- | |
6 | ||
58e3948a PB |
7 | The acquisition orders for mutexes are as follows: |
8 | ||
9 | - kvm->lock is taken outside vcpu->mutex | |
10 | ||
11 | - kvm->lock is taken outside kvm->slots_lock and kvm->irq_lock | |
12 | ||
13 | - kvm->slots_lock is taken outside kvm->irq_lock, though acquiring | |
14 | them together is quite rare. | |
15 | ||
3f5ad8be PB |
16 | On x86, vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock. |
17 | ||
18 | For spinlocks, kvm_lock is taken outside kvm->mmu_lock. | |
19 | ||
20 | Everything else is a leaf: no other lock is taken inside the critical | |
21 | sections. | |
38a778aa | 22 | |
58d8b172 XG |
23 | 2: Exception |
24 | ------------ | |
25 | ||
26 | Fast page fault: | |
27 | ||
28 | Fast page fault is the fast path which fixes the guest page fault out of | |
63dbe14d JS |
29 | the mmu-lock on x86. Currently, the page fault can be fast in one of the |
30 | following two cases: | |
31 | ||
32 | 1. Access Tracking: The SPTE is not present, but it is marked for access | |
33 | tracking i.e. the SPTE_SPECIAL_MASK is set. That means we need to | |
34 | restore the saved R/X bits. This is described in more detail later below. | |
35 | ||
36 | 2. Write-Protection: The SPTE is present and the fault is | |
37 | caused by write-protect. That means we just need to change the W bit of the | |
38 | spte. | |
58d8b172 XG |
39 | |
40 | What we use to avoid all the race is the SPTE_HOST_WRITEABLE bit and | |
41 | SPTE_MMU_WRITEABLE bit on the spte: | |
42 | - SPTE_HOST_WRITEABLE means the gfn is writable on host. | |
43 | - SPTE_MMU_WRITEABLE means the gfn is writable on mmu. The bit is set when | |
44 | the gfn is writable on guest mmu and it is not write-protected by shadow | |
45 | page write-protection. | |
46 | ||
47 | On fast page fault path, we will use cmpxchg to atomically set the spte W | |
63dbe14d JS |
48 | bit if spte.SPTE_HOST_WRITEABLE = 1 and spte.SPTE_WRITE_PROTECT = 1, or |
49 | restore the saved R/X bits if VMX_EPT_TRACK_ACCESS mask is set, or both. This | |
58d8b172 XG |
50 | is safe because whenever changing these bits can be detected by cmpxchg. |
51 | ||
52 | But we need carefully check these cases: | |
53 | 1): The mapping from gfn to pfn | |
54 | The mapping from gfn to pfn may be changed since we can only ensure the pfn | |
55 | is not changed during cmpxchg. This is a ABA problem, for example, below case | |
56 | will happen: | |
57 | ||
58 | At the beginning: | |
59 | gpte = gfn1 | |
60 | gfn1 is mapped to pfn1 on host | |
61 | spte is the shadow page table entry corresponding with gpte and | |
62 | spte = pfn1 | |
63 | ||
64 | VCPU 0 VCPU0 | |
65 | on fast page fault path: | |
66 | ||
67 | old_spte = *spte; | |
68 | pfn1 is swapped out: | |
69 | spte = 0; | |
70 | ||
71 | pfn1 is re-alloced for gfn2. | |
72 | ||
73 | gpte is changed to point to | |
74 | gfn2 by the guest: | |
75 | spte = pfn1; | |
76 | ||
77 | if (cmpxchg(spte, old_spte, old_spte+W) | |
78 | mark_page_dirty(vcpu->kvm, gfn1) | |
79 | OOPS!!! | |
80 | ||
81 | We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap. | |
82 | ||
83 | For direct sp, we can easily avoid it since the spte of direct sp is fixed | |
84 | to gfn. For indirect sp, before we do cmpxchg, we call gfn_to_pfn_atomic() | |
85 | to pin gfn to pfn, because after gfn_to_pfn_atomic(): | |
86 | - We have held the refcount of pfn that means the pfn can not be freed and | |
87 | be reused for another gfn. | |
88 | - The pfn is writable that means it can not be shared between different gfns | |
89 | by KSM. | |
90 | ||
91 | Then, we can ensure the dirty bitmaps is correctly set for a gfn. | |
92 | ||
93 | Currently, to simplify the whole things, we disable fast page fault for | |
94 | indirect shadow page. | |
95 | ||
96 | 2): Dirty bit tracking | |
97 | In the origin code, the spte can be fast updated (non-atomically) if the | |
98 | spte is read-only and the Accessed bit has already been set since the | |
99 | Accessed bit and Dirty bit can not be lost. | |
100 | ||
101 | But it is not true after fast page fault since the spte can be marked | |
102 | writable between reading spte and updating spte. Like below case: | |
103 | ||
104 | At the beginning: | |
105 | spte.W = 0 | |
106 | spte.Accessed = 1 | |
107 | ||
108 | VCPU 0 VCPU0 | |
109 | In mmu_spte_clear_track_bits(): | |
110 | ||
111 | old_spte = *spte; | |
112 | ||
113 | /* 'if' condition is satisfied. */ | |
bb3541f1 | 114 | if (old_spte.Accessed == 1 && |
58d8b172 XG |
115 | old_spte.W == 0) |
116 | spte = 0ull; | |
117 | on fast page fault path: | |
118 | spte.W = 1 | |
119 | memory write on the spte: | |
120 | spte.Dirty = 1 | |
121 | ||
122 | ||
123 | else | |
124 | old_spte = xchg(spte, 0ull) | |
125 | ||
126 | ||
bb3541f1 | 127 | if (old_spte.Accessed == 1) |
58d8b172 XG |
128 | kvm_set_pfn_accessed(spte.pfn); |
129 | if (old_spte.Dirty == 1) | |
130 | kvm_set_pfn_dirty(spte.pfn); | |
131 | OOPS!!! | |
132 | ||
133 | The Dirty bit is lost in this case. | |
134 | ||
135 | In order to avoid this kind of issue, we always treat the spte as "volatile" | |
136 | if it can be updated out of mmu-lock, see spte_has_volatile_bits(), it means, | |
17180032 | 137 | the spte is always atomically updated in this case. |
58d8b172 XG |
138 | |
139 | 3): flush tlbs due to spte updated | |
140 | If the spte is updated from writable to readonly, we should flush all TLBs, | |
141 | otherwise rmap_write_protect will find a read-only spte, even though the | |
142 | writable spte might be cached on a CPU's TLB. | |
143 | ||
144 | As mentioned before, the spte can be updated to writable out of mmu-lock on | |
145 | fast page fault path, in order to easily audit the path, we see if TLBs need | |
146 | be flushed caused by this reason in mmu_spte_update() since this is a common | |
147 | function to update spte (present -> present). | |
148 | ||
149 | Since the spte is "volatile" if it can be updated out of mmu-lock, we always | |
17180032 | 150 | atomically update the spte, the race caused by fast page fault can be avoided, |
58d8b172 XG |
151 | See the comments in spte_has_volatile_bits() and mmu_spte_update(). |
152 | ||
63dbe14d JS |
153 | Lockless Access Tracking: |
154 | ||
155 | This is used for Intel CPUs that are using EPT but do not support the EPT A/D | |
156 | bits. In this case, when the KVM MMU notifier is called to track accesses to a | |
157 | page (via kvm_mmu_notifier_clear_flush_young), it marks the PTE as not-present | |
158 | by clearing the RWX bits in the PTE and storing the original R & X bits in | |
159 | some unused/ignored bits. In addition, the SPTE_SPECIAL_MASK is also set on the | |
160 | PTE (using the ignored bit 62). When the VM tries to access the page later on, | |
161 | a fault is generated and the fast page fault mechanism described above is used | |
162 | to atomically restore the PTE to a Present state. The W bit is not saved when | |
163 | the PTE is marked for access tracking and during restoration to the Present | |
164 | state, the W bit is set depending on whether or not it was a write access. If | |
165 | it wasn't, then the W bit will remain clear until a write access happens, at | |
166 | which time it will be set using the Dirty tracking mechanism described above. | |
167 | ||
58d8b172 | 168 | 3. Reference |
38a778aa JK |
169 | ------------ |
170 | ||
171 | Name: kvm_lock | |
2f303b74 | 172 | Type: spinlock_t |
38a778aa JK |
173 | Arch: any |
174 | Protects: - vm_list | |
4a937f96 PB |
175 | |
176 | Name: kvm_count_lock | |
177 | Type: raw_spinlock_t | |
178 | Arch: any | |
179 | Protects: - hardware virtualization enable/disable | |
38a778aa JK |
180 | Comment: 'raw' because hardware enabling/disabling must be atomic /wrt |
181 | migration. | |
182 | ||
183 | Name: kvm_arch::tsc_write_lock | |
184 | Type: raw_spinlock | |
185 | Arch: x86 | |
186 | Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset} | |
187 | - tsc offset in vmcb | |
188 | Comment: 'raw' because updating the tsc offsets must not be preempted. | |
58d8b172 XG |
189 | |
190 | Name: kvm->mmu_lock | |
191 | Type: spinlock_t | |
192 | Arch: any | |
193 | Protects: -shadow page/shadow tlb entry | |
194 | Comment: it is a spinlock since it is used in mmu notifier. | |
519192aa TH |
195 | |
196 | Name: kvm->srcu | |
197 | Type: srcu lock | |
198 | Arch: any | |
199 | Protects: - kvm->memslots | |
200 | - kvm->buses | |
201 | Comment: The srcu read lock must be held while accessing memslots (e.g. | |
202 | when using gfn_to_* functions) and while accessing in-kernel | |
203 | MMIO/PIO address->device structure mapping (kvm->buses). | |
204 | The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu | |
205 | if it is needed by multiple functions. | |
bf9f6ac8 FW |
206 | |
207 | Name: blocked_vcpu_on_cpu_lock | |
208 | Type: spinlock_t | |
209 | Arch: x86 | |
210 | Protects: blocked_vcpu_on_cpu | |
211 | Comment: This is a per-CPU lock and it is used for VT-d posted-interrupts. | |
212 | When VT-d posted-interrupts is supported and the VM has assigned | |
213 | devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu | |
214 | protected by blocked_vcpu_on_cpu_lock, when VT-d hardware issues | |
215 | wakeup notification event since external interrupts from the | |
216 | assigned devices happens, we will find the vCPU on the list to | |
217 | wakeup. |