Commit | Line | Data |
---|---|---|
d2d7a611 GC |
1 | KVM-specific MSRs. |
2 | Glauber Costa <glommer@redhat.com>, Red Hat Inc, 2010 | |
3 | ===================================================== | |
4 | ||
5 | KVM makes use of some custom MSRs to service some requests. | |
d2d7a611 GC |
6 | |
7 | Custom MSRs have a range reserved for them, that goes from | |
8 | 0x4b564d00 to 0x4b564dff. There are MSRs outside this area, | |
9 | but they are deprecated and their use is discouraged. | |
10 | ||
11 | Custom MSR list | |
12 | -------- | |
13 | ||
14 | The current supported Custom MSR list is: | |
15 | ||
16 | MSR_KVM_WALL_CLOCK_NEW: 0x4b564d00 | |
17 | ||
18 | data: 4-byte alignment physical address of a memory area which must be | |
19 | in guest RAM. This memory is expected to hold a copy of the following | |
20 | structure: | |
21 | ||
22 | struct pvclock_wall_clock { | |
23 | u32 version; | |
24 | u32 sec; | |
25 | u32 nsec; | |
26 | } __attribute__((__packed__)); | |
27 | ||
28 | whose data will be filled in by the hypervisor. The hypervisor is only | |
29 | guaranteed to update this data at the moment of MSR write. | |
30 | Users that want to reliably query this information more than once have | |
31 | to write more than once to this MSR. Fields have the following meanings: | |
32 | ||
33 | version: guest has to check version before and after grabbing | |
34 | time information and check that they are both equal and even. | |
35 | An odd version indicates an in-progress update. | |
36 | ||
879238fe | 37 | sec: number of seconds for wallclock at time of boot. |
d2d7a611 | 38 | |
879238fe SF |
39 | nsec: number of nanoseconds for wallclock at time of boot. |
40 | ||
41 | In order to get the current wallclock time, the system_time from | |
42 | MSR_KVM_SYSTEM_TIME_NEW needs to be added. | |
d2d7a611 GC |
43 | |
44 | Note that although MSRs are per-CPU entities, the effect of this | |
45 | particular MSR is global. | |
46 | ||
47 | Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid | |
48 | leaf prior to usage. | |
49 | ||
50 | MSR_KVM_SYSTEM_TIME_NEW: 0x4b564d01 | |
51 | ||
52 | data: 4-byte aligned physical address of a memory area which must be in | |
53 | guest RAM, plus an enable bit in bit 0. This memory is expected to hold | |
54 | a copy of the following structure: | |
55 | ||
56 | struct pvclock_vcpu_time_info { | |
57 | u32 version; | |
58 | u32 pad0; | |
59 | u64 tsc_timestamp; | |
60 | u64 system_time; | |
61 | u32 tsc_to_system_mul; | |
62 | s8 tsc_shift; | |
63 | u8 flags; | |
64 | u8 pad[2]; | |
65 | } __attribute__((__packed__)); /* 32 bytes */ | |
66 | ||
67 | whose data will be filled in by the hypervisor periodically. Only one | |
68 | write, or registration, is needed for each VCPU. The interval between | |
69 | updates of this structure is arbitrary and implementation-dependent. | |
70 | The hypervisor may update this structure at any time it sees fit until | |
71 | anything with bit0 == 0 is written to it. | |
72 | ||
73 | Fields have the following meanings: | |
74 | ||
75 | version: guest has to check version before and after grabbing | |
76 | time information and check that they are both equal and even. | |
77 | An odd version indicates an in-progress update. | |
78 | ||
79 | tsc_timestamp: the tsc value at the current VCPU at the time | |
80 | of the update of this structure. Guests can subtract this value | |
81 | from current tsc to derive a notion of elapsed time since the | |
82 | structure update. | |
83 | ||
84 | system_time: a host notion of monotonic time, including sleep | |
85 | time at the time this structure was last updated. Unit is | |
86 | nanoseconds. | |
87 | ||
879238fe SF |
88 | tsc_to_system_mul: multiplier to be used when converting |
89 | tsc-related quantity to nanoseconds | |
d2d7a611 | 90 | |
879238fe SF |
91 | tsc_shift: shift to be used when converting tsc-related |
92 | quantity to nanoseconds. This shift will ensure that | |
93 | multiplication with tsc_to_system_mul does not overflow. | |
94 | A positive value denotes a left shift, a negative value | |
95 | a right shift. | |
d2d7a611 | 96 | |
879238fe SF |
97 | The conversion from tsc to nanoseconds involves an additional |
98 | right shift by 32 bits. With this information, guests can | |
99 | derive per-CPU time by doing: | |
d2d7a611 GC |
100 | |
101 | time = (current_tsc - tsc_timestamp) | |
879238fe SF |
102 | if (tsc_shift >= 0) |
103 | time <<= tsc_shift; | |
104 | else | |
105 | time >>= -tsc_shift; | |
106 | time = (time * tsc_to_system_mul) >> 32 | |
d2d7a611 GC |
107 | time = time + system_time |
108 | ||
109 | flags: bits in this field indicate extended capabilities | |
110 | coordinated between the guest and the hypervisor. Availability | |
111 | of specific flags has to be checked in 0x40000001 cpuid leaf. | |
112 | Current flags are: | |
113 | ||
114 | flag bit | cpuid bit | meaning | |
115 | ------------------------------------------------------------- | |
116 | | | time measures taken across | |
117 | 0 | 24 | multiple cpus are guaranteed to | |
118 | | | be monotonic | |
119 | ------------------------------------------------------------- | |
1c0b28c2 EM |
120 | | | guest vcpu has been paused by |
121 | 1 | N/A | the host | |
122 | | | See 4.70 in api.txt | |
123 | ------------------------------------------------------------- | |
d2d7a611 GC |
124 | |
125 | Availability of this MSR must be checked via bit 3 in 0x4000001 cpuid | |
126 | leaf prior to usage. | |
127 | ||
128 | ||
129 | MSR_KVM_WALL_CLOCK: 0x11 | |
130 | ||
131 | data and functioning: same as MSR_KVM_WALL_CLOCK_NEW. Use that instead. | |
132 | ||
133 | This MSR falls outside the reserved KVM range and may be removed in the | |
134 | future. Its usage is deprecated. | |
135 | ||
136 | Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid | |
137 | leaf prior to usage. | |
138 | ||
139 | MSR_KVM_SYSTEM_TIME: 0x12 | |
140 | ||
141 | data and functioning: same as MSR_KVM_SYSTEM_TIME_NEW. Use that instead. | |
142 | ||
143 | This MSR falls outside the reserved KVM range and may be removed in the | |
144 | future. Its usage is deprecated. | |
145 | ||
146 | Availability of this MSR must be checked via bit 0 in 0x4000001 cpuid | |
147 | leaf prior to usage. | |
148 | ||
149 | The suggested algorithm for detecting kvmclock presence is then: | |
150 | ||
151 | if (!kvm_para_available()) /* refer to cpuid.txt */ | |
152 | return NON_PRESENT; | |
153 | ||
154 | flags = cpuid_eax(0x40000001); | |
155 | if (flags & 3) { | |
156 | msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW; | |
157 | msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW; | |
158 | return PRESENT; | |
159 | } else if (flags & 0) { | |
160 | msr_kvm_system_time = MSR_KVM_SYSTEM_TIME; | |
161 | msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK; | |
162 | return PRESENT; | |
163 | } else | |
164 | return NON_PRESENT; | |
344d9588 GN |
165 | |
166 | MSR_KVM_ASYNC_PF_EN: 0x4b564d02 | |
167 | data: Bits 63-6 hold 64-byte aligned physical address of a | |
168 | 64 byte memory area which must be in guest RAM and must be | |
52a5c155 | 169 | zeroed. Bits 5-3 are reserved and should be zero. Bit 0 is 1 |
344d9588 | 170 | when asynchronous page faults are enabled on the vcpu 0 when |
91690bf3 | 171 | disabled. Bit 1 is 1 if asynchronous page faults can be injected |
52a5c155 | 172 | when vcpu is in cpl == 0. Bit 2 is 1 if asynchronous page faults |
fe2a3027 RK |
173 | are delivered to L1 as #PF vmexits. Bit 2 can be set only if |
174 | KVM_FEATURE_ASYNC_PF_VMEXIT is present in CPUID. | |
344d9588 GN |
175 | |
176 | First 4 byte of 64 byte memory location will be written to by | |
177 | the hypervisor at the time of asynchronous page fault (APF) | |
178 | injection to indicate type of asynchronous page fault. Value | |
179 | of 1 means that the page referred to by the page fault is not | |
180 | present. Value 2 means that the page is now available. Disabling | |
181 | interrupt inhibits APFs. Guest must not enable interrupt | |
182 | before the reason is read, or it may be overwritten by another | |
183 | APF. Since APF uses the same exception vector as regular page | |
184 | fault guest must reset the reason to 0 before it does | |
185 | something that can generate normal page fault. If during page | |
186 | fault APF reason is 0 it means that this is regular page | |
187 | fault. | |
188 | ||
189 | During delivery of type 1 APF cr2 contains a token that will | |
190 | be used to notify a guest when missing page becomes | |
191 | available. When page becomes available type 2 APF is sent with | |
192 | cr2 set to the token associated with the page. There is special | |
193 | kind of token 0xffffffff which tells vcpu that it should wake | |
194 | up all processes waiting for APFs and no individual type 2 APFs | |
195 | will be sent. | |
196 | ||
197 | If APF is disabled while there are outstanding APFs, they will | |
198 | not be delivered. | |
199 | ||
200 | Currently type 2 APF will be always delivered on the same vcpu as | |
201 | type 1 was, but guest should not rely on that. | |
9ddabbe7 GC |
202 | |
203 | MSR_KVM_STEAL_TIME: 0x4b564d03 | |
204 | ||
205 | data: 64-byte alignment physical address of a memory area which must be | |
206 | in guest RAM, plus an enable bit in bit 0. This memory is expected to | |
207 | hold a copy of the following structure: | |
208 | ||
209 | struct kvm_steal_time { | |
210 | __u64 steal; | |
211 | __u32 version; | |
212 | __u32 flags; | |
3dd3e0ce PX |
213 | __u8 preempted; |
214 | __u8 u8_pad[3]; | |
215 | __u32 pad[11]; | |
9ddabbe7 GC |
216 | } |
217 | ||
218 | whose data will be filled in by the hypervisor periodically. Only one | |
219 | write, or registration, is needed for each VCPU. The interval between | |
220 | updates of this structure is arbitrary and implementation-dependent. | |
221 | The hypervisor may update this structure at any time it sees fit until | |
222 | anything with bit0 == 0 is written to it. Guest is required to make sure | |
223 | this structure is initialized to zero. | |
224 | ||
225 | Fields have the following meanings: | |
226 | ||
227 | version: a sequence counter. In other words, guest has to check | |
228 | this field before and after grabbing time information and make | |
229 | sure they are both equal and even. An odd version indicates an | |
230 | in-progress update. | |
231 | ||
232 | flags: At this point, always zero. May be used to indicate | |
233 | changes in this structure in the future. | |
234 | ||
235 | steal: the amount of time in which this vCPU did not run, in | |
236 | nanoseconds. Time during which the vcpu is idle, will not be | |
237 | reported as steal time. | |
c1af87dc | 238 | |
3dd3e0ce PX |
239 | preempted: indicate the vCPU who owns this struct is running or |
240 | not. Non-zero values mean the vCPU has been preempted. Zero | |
241 | means the vCPU is not preempted. NOTE, it is always zero if the | |
242 | the hypervisor doesn't support this field. | |
243 | ||
c1af87dc MT |
244 | MSR_KVM_EOI_EN: 0x4b564d04 |
245 | data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0 | |
246 | when disabled. Bit 1 is reserved and must be zero. When PV end of | |
247 | interrupt is enabled (bit 0 set), bits 63-2 hold a 4-byte aligned | |
248 | physical address of a 4 byte memory area which must be in guest RAM and | |
249 | must be zeroed. | |
250 | ||
251 | The first, least significant bit of 4 byte memory location will be | |
252 | written to by the hypervisor, typically at the time of interrupt | |
253 | injection. Value of 1 means that guest can skip writing EOI to the apic | |
254 | (using MSR or MMIO write); instead, it is sufficient to signal | |
255 | EOI by clearing the bit in guest memory - this location will | |
256 | later be polled by the hypervisor. | |
257 | Value of 0 means that the EOI write is required. | |
258 | ||
259 | It is always safe for the guest to ignore the optimization and perform | |
260 | the APIC EOI write anyway. | |
261 | ||
262 | Hypervisor is guaranteed to only modify this least | |
263 | significant bit while in the current VCPU context, this means that | |
264 | guest does not need to use either lock prefix or memory ordering | |
265 | primitives to synchronise with the hypervisor. | |
266 | ||
267 | However, hypervisor can set and clear this memory bit at any time: | |
268 | therefore to make sure hypervisor does not interrupt the | |
269 | guest and clear the least significant bit in the memory area | |
270 | in the window between guest testing it to detect | |
271 | whether it can skip EOI apic write and between guest | |
272 | clearing it to signal EOI to the hypervisor, | |
273 | guest must both read the least significant bit in the memory area and | |
274 | clear it using a single CPU instruction, such as test and clear, or | |
275 | compare and exchange. |