Commit | Line | Data |
---|---|---|
58b278f5 VJ |
1 | .. SPDX-License-Identifier: GPL-2.0 |
2 | ||
3 | =========================== | |
4 | Hypercall Op-codes (hcalls) | |
5 | =========================== | |
6 | ||
7 | Overview | |
8 | ========= | |
9 | ||
10 | Virtualization on 64-bit Power Book3S Platforms is based on the PAPR | |
11 | specification [1]_ which describes the run-time environment for a guest | |
12 | operating system and how it should interact with the hypervisor for | |
13 | privileged operations. Currently there are two PAPR compliant hypervisors: | |
14 | ||
15 | - **IBM PowerVM (PHYP)**: IBM's proprietary hypervisor that supports AIX, | |
16 | IBM-i and Linux as supported guests (termed as Logical Partitions | |
17 | or LPARS). It supports the full PAPR specification. | |
18 | ||
19 | - **Qemu/KVM**: Supports PPC64 linux guests running on a PPC64 linux host. | |
20 | Though it only implements a subset of PAPR specification called LoPAPR [2]_. | |
21 | ||
22 | On PPC64 arch a guest kernel running on top of a PAPR hypervisor is called | |
23 | a *pSeries guest*. A pseries guest runs in a supervisor mode (HV=0) and must | |
24 | issue hypercalls to the hypervisor whenever it needs to perform an action | |
25 | that is hypervisor priviledged [3]_ or for other services managed by the | |
26 | hypervisor. | |
27 | ||
28 | Hence a Hypercall (hcall) is essentially a request by the pseries guest | |
29 | asking hypervisor to perform a privileged operation on behalf of the guest. The | |
30 | guest issues a with necessary input operands. The hypervisor after performing | |
31 | the privilege operation returns a status code and output operands back to the | |
32 | guest. | |
33 | ||
34 | HCALL ABI | |
35 | ========= | |
36 | The ABI specification for a hcall between a pseries guest and PAPR hypervisor | |
37 | is covered in section 14.5.3 of ref [2]_. Switch to the Hypervisor context is | |
38 | done via the instruction **HVCS** that expects the Opcode for hcall is set in *r3* | |
39 | and any in-arguments for the hcall are provided in registers *r4-r12*. If values | |
40 | have to be passed through a memory buffer, the data stored in that buffer should be | |
41 | in Big-endian byte order. | |
42 | ||
f8b42777 | 43 | Once control returns back to the guest after hypervisor has serviced the |
58b278f5 VJ |
44 | 'HVCS' instruction the return value of the hcall is available in *r3* and any |
45 | out values are returned in registers *r4-r12*. Again like in case of in-arguments, | |
46 | any out values stored in a memory buffer will be in Big-endian byte order. | |
47 | ||
48 | Powerpc arch code provides convenient wrappers named **plpar_hcall_xxx** defined | |
49 | in a arch specific header [4]_ to issue hcalls from the linux kernel | |
50 | running as pseries guest. | |
51 | ||
52 | Register Conventions | |
53 | ==================== | |
54 | ||
55 | Any hcall should follow same register convention as described in section 2.2.1.1 | |
56 | of "64-Bit ELF V2 ABI Specification: Power Architecture"[5]_. Table below | |
57 | summarizes these conventions: | |
58 | ||
59 | +----------+----------+-------------------------------------------+ | |
60 | | Register |Volatile | Purpose | | |
61 | | Range |(Y/N) | | | |
62 | +==========+==========+===========================================+ | |
63 | | r0 | Y | Optional-usage | | |
64 | +----------+----------+-------------------------------------------+ | |
65 | | r1 | N | Stack Pointer | | |
66 | +----------+----------+-------------------------------------------+ | |
67 | | r2 | N | TOC | | |
68 | +----------+----------+-------------------------------------------+ | |
69 | | r3 | Y | hcall opcode/return value | | |
70 | +----------+----------+-------------------------------------------+ | |
71 | | r4-r10 | Y | in and out values | | |
72 | +----------+----------+-------------------------------------------+ | |
73 | | r11 | Y | Optional-usage/Environmental pointer | | |
74 | +----------+----------+-------------------------------------------+ | |
75 | | r12 | Y | Optional-usage/Function entry address at | | |
76 | | | | global entry point | | |
77 | +----------+----------+-------------------------------------------+ | |
78 | | r13 | N | Thread-Pointer | | |
79 | +----------+----------+-------------------------------------------+ | |
80 | | r14-r31 | N | Local Variables | | |
81 | +----------+----------+-------------------------------------------+ | |
82 | | LR | Y | Link Register | | |
83 | +----------+----------+-------------------------------------------+ | |
84 | | CTR | Y | Loop Counter | | |
85 | +----------+----------+-------------------------------------------+ | |
86 | | XER | Y | Fixed-point exception register. | | |
87 | +----------+----------+-------------------------------------------+ | |
88 | | CR0-1 | Y | Condition register fields. | | |
89 | +----------+----------+-------------------------------------------+ | |
90 | | CR2-4 | N | Condition register fields. | | |
91 | +----------+----------+-------------------------------------------+ | |
92 | | CR5-7 | Y | Condition register fields. | | |
93 | +----------+----------+-------------------------------------------+ | |
94 | | Others | N | | | |
95 | +----------+----------+-------------------------------------------+ | |
96 | ||
97 | DRC & DRC Indexes | |
98 | ================= | |
99 | :: | |
100 | ||
101 | DR1 Guest | |
102 | +--+ +------------+ +---------+ | |
103 | | | <----> | | | User | | |
104 | +--+ DRC1 | | DRC | Space | | |
105 | | PAPR | Index +---------+ | |
106 | DR2 | Hypervisor | | | | |
107 | +--+ | | <-----> | Kernel | | |
108 | | | <----> | | Hcall | | | |
109 | +--+ DRC2 +------------+ +---------+ | |
110 | ||
111 | PAPR hypervisor terms shared hardware resources like PCI devices, NVDIMMs etc | |
112 | available for use by LPARs as Dynamic Resource (DR). When a DR is allocated to | |
113 | an LPAR, PHYP creates a data-structure called Dynamic Resource Connector (DRC) | |
114 | to manage LPAR access. An LPAR refers to a DRC via an opaque 32-bit number | |
115 | called DRC-Index. The DRC-index value is provided to the LPAR via device-tree | |
116 | where its present as an attribute in the device tree node associated with the | |
117 | DR. | |
118 | ||
119 | HCALL Return-values | |
120 | =================== | |
121 | ||
122 | After servicing the hcall, hypervisor sets the return-value in *r3* indicating | |
123 | success or failure of the hcall. In case of a failure an error code indicates | |
124 | the cause for error. These codes are defined and documented in arch specific | |
125 | header [4]_. | |
126 | ||
127 | In some cases a hcall can potentially take a long time and need to be issued | |
128 | multiple times in order to be completely serviced. These hcalls will usually | |
129 | accept an opaque value *continue-token* within there argument list and a | |
130 | return value of *H_CONTINUE* indicates that hypervisor hasn't still finished | |
131 | servicing the hcall yet. | |
132 | ||
133 | To make such hcalls the guest need to set *continue-token == 0* for the | |
134 | initial call and use the hypervisor returned value of *continue-token* | |
135 | for each subsequent hcall until hypervisor returns a non *H_CONTINUE* | |
136 | return value. | |
137 | ||
138 | HCALL Op-codes | |
139 | ============== | |
140 | ||
141 | Below is a partial list of HCALLs that are supported by PHYP. For the | |
142 | corresponding opcode values please look into the arch specific header [4]_: | |
143 | ||
144 | **H_SCM_READ_METADATA** | |
145 | ||
146 | | Input: *drcIndex, offset, buffer-address, numBytesToRead* | |
147 | | Out: *numBytesRead* | |
148 | | Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_Hardware* | |
149 | ||
f8b42777 | 150 | Given a DRC Index of an NVDIMM, read N-bytes from the metadata area |
58b278f5 VJ |
151 | associated with it, at a specified offset and copy it to provided buffer. |
152 | The metadata area stores configuration information such as label information, | |
153 | bad-blocks etc. The metadata area is located out-of-band of NVDIMM storage | |
154 | area hence a separate access semantics is provided. | |
155 | ||
156 | **H_SCM_WRITE_METADATA** | |
157 | ||
158 | | Input: *drcIndex, offset, data, numBytesToWrite* | |
159 | | Out: *None* | |
160 | | Return Value: *H_Success, H_Parameter, H_P2, H_P4, H_Hardware* | |
161 | ||
162 | Given a DRC Index of an NVDIMM, write N-bytes to the metadata area | |
163 | associated with it, at the specified offset and from the provided buffer. | |
164 | ||
165 | **H_SCM_BIND_MEM** | |
166 | ||
167 | | Input: *drcIndex, startingScmBlockIndex, numScmBlocksToBind,* | |
168 | | *targetLogicalMemoryAddress, continue-token* | |
169 | | Out: *continue-token, targetLogicalMemoryAddress, numScmBlocksToBound* | |
170 | | Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_P4, H_Overlap,* | |
171 | | *H_Too_Big, H_P5, H_Busy* | |
172 | ||
173 | Given a DRC-Index of an NVDIMM, map a continuous SCM blocks range | |
174 | *(startingScmBlockIndex, startingScmBlockIndex+numScmBlocksToBind)* to the guest | |
175 | at *targetLogicalMemoryAddress* within guest physical address space. In | |
176 | case *targetLogicalMemoryAddress == 0xFFFFFFFF_FFFFFFFF* then hypervisor | |
177 | assigns a target address to the guest. The HCALL can fail if the Guest has | |
178 | an active PTE entry to the SCM block being bound. | |
179 | ||
180 | **H_SCM_UNBIND_MEM** | |
181 | | Input: drcIndex, startingScmLogicalMemoryAddress, numScmBlocksToUnbind | |
182 | | Out: numScmBlocksUnbound | |
183 | | Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_In_Use, H_Overlap,* | |
184 | | *H_Busy, H_LongBusyOrder1mSec, H_LongBusyOrder10mSec* | |
185 | ||
186 | Given a DRC-Index of an NVDimm, unmap *numScmBlocksToUnbind* SCM blocks starting | |
187 | at *startingScmLogicalMemoryAddress* from guest physical address space. The | |
188 | HCALL can fail if the Guest has an active PTE entry to the SCM block being | |
189 | unbound. | |
190 | ||
191 | **H_SCM_QUERY_BLOCK_MEM_BINDING** | |
192 | ||
193 | | Input: *drcIndex, scmBlockIndex* | |
194 | | Out: *Guest-Physical-Address* | |
195 | | Return Value: *H_Success, H_Parameter, H_P2, H_NotFound* | |
196 | ||
197 | Given a DRC-Index and an SCM Block index return the guest physical address to | |
198 | which the SCM block is mapped to. | |
199 | ||
200 | **H_SCM_QUERY_LOGICAL_MEM_BINDING** | |
201 | ||
202 | | Input: *Guest-Physical-Address* | |
203 | | Out: *drcIndex, scmBlockIndex* | |
204 | | Return Value: *H_Success, H_Parameter, H_P2, H_NotFound* | |
205 | ||
206 | Given a guest physical address return which DRC Index and SCM block is mapped | |
207 | to that address. | |
208 | ||
209 | **H_SCM_UNBIND_ALL** | |
210 | ||
211 | | Input: *scmTargetScope, drcIndex* | |
212 | | Out: *None* | |
213 | | Return Value: *H_Success, H_Parameter, H_P2, H_P3, H_In_Use, H_Busy,* | |
214 | | *H_LongBusyOrder1mSec, H_LongBusyOrder10mSec* | |
215 | ||
216 | Depending on the Target scope unmap all SCM blocks belonging to all NVDIMMs | |
217 | or all SCM blocks belonging to a single NVDIMM identified by its drcIndex | |
218 | from the LPAR memory. | |
219 | ||
220 | **H_SCM_HEALTH** | |
221 | ||
222 | | Input: drcIndex | |
901e3490 | 223 | | Out: *health-bitmap (r4), health-bit-valid-bitmap (r5)* |
58b278f5 VJ |
224 | | Return Value: *H_Success, H_Parameter, H_Hardware* |
225 | ||
226 | Given a DRC Index return the info on predictive failure and overall health of | |
901e3490 VJ |
227 | the PMEM device. The asserted bits in the health-bitmap indicate one or more states |
228 | (described in table below) of the PMEM device and health-bit-valid-bitmap indicate | |
229 | which bits in health-bitmap are valid. The bits are reported in | |
230 | reverse bit ordering for example a value of 0xC400000000000000 | |
231 | indicates bits 0, 1, and 5 are valid. | |
232 | ||
233 | Health Bitmap Flags: | |
234 | ||
235 | +------+-----------------------------------------------------------------------+ | |
236 | | Bit | Definition | | |
237 | +======+=======================================================================+ | |
238 | | 00 | PMEM device is unable to persist memory contents. | | |
239 | | | If the system is powered down, nothing will be saved. | | |
240 | +------+-----------------------------------------------------------------------+ | |
241 | | 01 | PMEM device failed to persist memory contents. Either contents were | | |
242 | | | not saved successfully on power down or were not restored properly on | | |
243 | | | power up. | | |
244 | +------+-----------------------------------------------------------------------+ | |
245 | | 02 | PMEM device contents are persisted from previous IPL. The data from | | |
246 | | | the last boot were successfully restored. | | |
247 | +------+-----------------------------------------------------------------------+ | |
248 | | 03 | PMEM device contents are not persisted from previous IPL. There was no| | |
249 | | | data to restore from the last boot. | | |
250 | +------+-----------------------------------------------------------------------+ | |
251 | | 04 | PMEM device memory life remaining is critically low | | |
252 | +------+-----------------------------------------------------------------------+ | |
253 | | 05 | PMEM device will be garded off next IPL due to failure | | |
254 | +------+-----------------------------------------------------------------------+ | |
255 | | 06 | PMEM device contents cannot persist due to current platform health | | |
256 | | | status. A hardware failure may prevent data from being saved or | | |
257 | | | restored. | | |
258 | +------+-----------------------------------------------------------------------+ | |
259 | | 07 | PMEM device is unable to persist memory contents in certain conditions| | |
260 | +------+-----------------------------------------------------------------------+ | |
261 | | 08 | PMEM device is encrypted | | |
262 | +------+-----------------------------------------------------------------------+ | |
263 | | 09 | PMEM device has successfully completed a requested erase or secure | | |
264 | | | erase procedure. | | |
265 | +------+-----------------------------------------------------------------------+ | |
266 | |10:63 | Reserved / Unused | | |
267 | +------+-----------------------------------------------------------------------+ | |
58b278f5 VJ |
268 | |
269 | **H_SCM_PERFORMANCE_STATS** | |
270 | ||
271 | | Input: drcIndex, resultBuffer Addr | |
272 | | Out: None | |
273 | | Return Value: *H_Success, H_Parameter, H_Unsupported, H_Hardware, H_Authority, H_Privilege* | |
274 | ||
275 | Given a DRC Index collect the performance statistics for NVDIMM and copy them | |
276 | to the resultBuffer. | |
277 | ||
75b7c05e SB |
278 | **H_SCM_FLUSH** |
279 | ||
280 | | Input: *drcIndex, continue-token* | |
281 | | Out: *continue-token* | |
282 | | Return Value: *H_SUCCESS, H_Parameter, H_P2, H_BUSY* | |
283 | ||
284 | Given a DRC Index Flush the data to backend NVDIMM device. | |
285 | ||
286 | The hcall returns H_BUSY when the flush takes longer time and the hcall needs | |
287 | to be issued multiple times in order to be completely serviced. The | |
288 | *continue-token* from the output to be passed in the argument list of | |
289 | subsequent hcalls to the hypervisor until the hcall is completely serviced | |
290 | at which point H_SUCCESS or other error is returned by the hypervisor. | |
291 | ||
58b278f5 VJ |
292 | References |
293 | ========== | |
294 | .. [1] "Power Architecture Platform Reference" | |
295 | https://en.wikipedia.org/wiki/Power_Architecture_Platform_Reference | |
296 | .. [2] "Linux on Power Architecture Platform Reference" | |
297 | https://members.openpowerfoundation.org/document/dl/469 | |
298 | .. [3] "Definitions and Notation" Book III-Section 14.5.3 | |
299 | https://openpowerfoundation.org/?resource_lib=power-isa-version-3-0 | |
300 | .. [4] arch/powerpc/include/asm/hvcall.h | |
301 | .. [5] "64-Bit ELF V2 ABI Specification: Power Architecture" | |
302 | https://openpowerfoundation.org/?resource_lib=64-bit-elf-v2-abi-specification-power-architecture |