Commit | Line | Data |
---|---|---|
96d32e63 MB |
1 | =================================================== |
2 | Scalable Matrix Extension support for AArch64 Linux | |
3 | =================================================== | |
4 | ||
5 | This document outlines briefly the interface provided to userspace by Linux in | |
6 | order to support use of the ARM Scalable Matrix Extension (SME). | |
7 | ||
8 | This is an outline of the most important features and issues only and not | |
9 | intended to be exhaustive. It should be read in conjunction with the SVE | |
10 | documentation in sve.rst which provides details on the Streaming SVE mode | |
11 | included in SME. | |
12 | ||
13 | This document does not aim to describe the SME architecture or programmer's | |
14 | model. To aid understanding, a minimal description of relevant programmer's | |
15 | model features for SME is included in Appendix A. | |
16 | ||
17 | ||
18 | 1. General | |
19 | ----------- | |
20 | ||
4edc1174 MB |
21 | * PSTATE.SM, PSTATE.ZA, the streaming mode vector length, the ZA and (when |
22 | present) ZTn register state and TPIDR2_EL0 are tracked per thread. | |
96d32e63 MB |
23 | |
24 | * The presence of SME is reported to userspace via HWCAP2_SME in the aux vector | |
25 | AT_HWCAP2 entry. Presence of this flag implies the presence of the SME | |
26 | instructions and registers, and the Linux-specific system interfaces | |
27 | described in this document. SME is reported in /proc/cpuinfo as "sme". | |
28 | ||
4edc1174 MB |
29 | * The presence of SME2 is reported to userspace via HWCAP2_SME2 in the |
30 | aux vector AT_HWCAP2 entry. Presence of this flag implies the presence of | |
31 | the SME2 instructions and ZT0, and the Linux-specific system interfaces | |
32 | described in this document. SME2 is reported in /proc/cpuinfo as "sme2". | |
33 | ||
96d32e63 MB |
34 | * Support for the execution of SME instructions in userspace can also be |
35 | detected by reading the CPU ID register ID_AA64PFR1_EL1 using an MRS | |
36 | instruction, and checking that the value of the SME field is nonzero. [3] | |
37 | ||
38 | It does not guarantee the presence of the system interfaces described in the | |
39 | following sections: software that needs to verify that those interfaces are | |
40 | present must check for HWCAP2_SME instead. | |
41 | ||
42 | * There are a number of optional SME features, presence of these is reported | |
43 | through AT_HWCAP2 through: | |
44 | ||
45 | HWCAP2_SME_I16I64 | |
46 | HWCAP2_SME_F64F64 | |
47 | HWCAP2_SME_I8I32 | |
48 | HWCAP2_SME_F16F32 | |
49 | HWCAP2_SME_B16F32 | |
50 | HWCAP2_SME_F32F32 | |
51 | HWCAP2_SME_FA64 | |
4edc1174 | 52 | HWCAP2_SME2 |
96d32e63 MB |
53 | |
54 | This list may be extended over time as the SME architecture evolves. | |
55 | ||
56 | These extensions are also reported via the CPU ID register ID_AA64SMFR0_EL1, | |
57 | which userspace can read using an MRS instruction. See elf_hwcaps.txt and | |
58 | cpu-feature-registers.txt for details. | |
59 | ||
60 | * Debuggers should restrict themselves to interacting with the target via the | |
4edc1174 MB |
61 | NT_ARM_SVE, NT_ARM_SSVE, NT_ARM_ZA and NT_ARM_ZT regsets. The recommended |
62 | way of detecting support for these regsets is to connect to a target process | |
96d32e63 MB |
63 | first and then attempt a |
64 | ||
65 | ptrace(PTRACE_GETREGSET, pid, NT_ARM_<regset>, &iov). | |
66 | ||
67 | * Whenever ZA register values are exchanged in memory between userspace and | |
68 | the kernel, the register value is encoded in memory as a series of horizontal | |
69 | vectors from 0 to VL/8-1 stored in the same endianness invariant format as is | |
70 | used for SVE vectors. | |
71 | ||
72 | * On thread creation TPIDR2_EL0 is preserved unless CLONE_SETTLS is specified, | |
73 | in which case it is set to 0. | |
74 | ||
75 | 2. Vector lengths | |
76 | ------------------ | |
77 | ||
78 | SME defines a second vector length similar to the SVE vector length which is | |
79 | controls the size of the streaming mode SVE vectors and the ZA matrix array. | |
80 | The ZA matrix is square with each side having as many bytes as a streaming | |
81 | mode SVE vector. | |
82 | ||
83 | ||
84 | 3. Sharing of streaming and non-streaming mode SVE state | |
85 | --------------------------------------------------------- | |
86 | ||
87 | It is implementation defined which if any parts of the SVE state are shared | |
88 | between streaming and non-streaming modes. When switching between modes | |
89 | via software interfaces such as ptrace if no register content is provided as | |
90 | part of switching no state will be assumed to be shared and everything will | |
91 | be zeroed. | |
92 | ||
93 | ||
94 | 4. System call behaviour | |
95 | ------------------------- | |
96 | ||
97 | * On syscall PSTATE.ZA is preserved, if PSTATE.ZA==1 then the contents of the | |
4edc1174 | 98 | ZA matrix and ZTn (if present) are preserved. |
96d32e63 MB |
99 | |
100 | * On syscall PSTATE.SM will be cleared and the SVE registers will be handled | |
101 | as per the standard SVE ABI. | |
102 | ||
4edc1174 MB |
103 | * None of the SVE registers, ZA or ZTn are used to pass arguments to |
104 | or receive results from any syscall. | |
96d32e63 MB |
105 | |
106 | * On process creation (eg, clone()) the newly created process will have | |
107 | PSTATE.SM cleared. | |
108 | ||
109 | * All other SME state of a thread, including the currently configured vector | |
110 | length, the state of the PR_SME_VL_INHERIT flag, and the deferred vector | |
111 | length (if any), is preserved across all syscalls, subject to the specific | |
112 | exceptions for execve() described in section 6. | |
113 | ||
114 | ||
115 | 5. Signal handling | |
116 | ------------------- | |
117 | ||
118 | * Signal handlers are invoked with streaming mode and ZA disabled. | |
119 | ||
17d0c4a2 MB |
120 | * A new signal frame record TPIDR2_MAGIC is added formatted as a struct |
121 | tpidr2_context to allow access to TPIDR2_EL0 from signal handlers. | |
122 | ||
96d32e63 MB |
123 | * A new signal frame record za_context encodes the ZA register contents on |
124 | signal delivery. [1] | |
125 | ||
126 | * The signal frame record for ZA always contains basic metadata, in particular | |
127 | the thread's vector length (in za_context.vl). | |
128 | ||
129 | * The ZA matrix may or may not be included in the record, depending on | |
130 | the value of PSTATE.ZA. The registers are present if and only if: | |
131 | za_context.head.size >= ZA_SIG_CONTEXT_SIZE(sve_vq_from_vl(za_context.vl)) | |
132 | in which case PSTATE.ZA == 1. | |
133 | ||
134 | * If matrix data is present, the remainder of the record has a vl-dependent | |
135 | size and layout. Macros ZA_SIG_* are defined [1] to facilitate access to | |
136 | them. | |
137 | ||
138 | * The matrix is stored as a series of horizontal vectors in the same format as | |
139 | is used for SVE vectors. | |
140 | ||
141 | * If the ZA context is too big to fit in sigcontext.__reserved[], then extra | |
142 | space is allocated on the stack, an extra_context record is written in | |
143 | __reserved[] referencing this space. za_context is then written in the | |
144 | extra space. Refer to [1] for further details about this mechanism. | |
145 | ||
4edc1174 MB |
146 | * If ZTn is supported and PSTATE.ZA==1 then a signal frame record for ZTn will |
147 | be generated. | |
148 | ||
149 | * The signal record for ZTn has magic ZT_MAGIC (0x5a544e01) and consists of a | |
150 | standard signal frame header followed by a struct zt_context specifying | |
151 | the number of ZTn registers supported by the system, then zt_context.nregs | |
152 | blocks of 64 bytes of data per register. | |
153 | ||
96d32e63 MB |
154 | |
155 | 5. Signal return | |
156 | ----------------- | |
157 | ||
158 | When returning from a signal handler: | |
159 | ||
160 | * If there is no za_context record in the signal frame, or if the record is | |
161 | present but contains no register data as described in the previous section, | |
162 | then ZA is disabled. | |
163 | ||
164 | * If za_context is present in the signal frame and contains matrix data then | |
165 | PSTATE.ZA is set to 1 and ZA is populated with the specified data. | |
166 | ||
167 | * The vector length cannot be changed via signal return. If za_context.vl in | |
168 | the signal frame does not match the current vector length, the signal return | |
169 | attempt is treated as illegal, resulting in a forced SIGSEGV. | |
170 | ||
4edc1174 MB |
171 | * If ZTn is not supported or PSTATE.ZA==0 then it is illegal to have a |
172 | signal frame record for ZTn, resulting in a forced SIGSEGV. | |
173 | ||
96d32e63 MB |
174 | |
175 | 6. prctl extensions | |
176 | -------------------- | |
177 | ||
178 | Some new prctl() calls are added to allow programs to manage the SME vector | |
179 | length: | |
180 | ||
181 | prctl(PR_SME_SET_VL, unsigned long arg) | |
182 | ||
183 | Sets the vector length of the calling thread and related flags, where | |
184 | arg == vl | flags. Other threads of the calling process are unaffected. | |
185 | ||
186 | vl is the desired vector length, where sve_vl_valid(vl) must be true. | |
187 | ||
188 | flags: | |
189 | ||
190 | PR_SME_VL_INHERIT | |
191 | ||
192 | Inherit the current vector length across execve(). Otherwise, the | |
193 | vector length is reset to the system default at execve(). (See | |
194 | Section 9.) | |
195 | ||
196 | PR_SME_SET_VL_ONEXEC | |
197 | ||
198 | Defer the requested vector length change until the next execve() | |
199 | performed by this thread. | |
200 | ||
201 | The effect is equivalent to implicit execution of the following | |
202 | call immediately after the next execve() (if any) by the thread: | |
203 | ||
204 | prctl(PR_SME_SET_VL, arg & ~PR_SME_SET_VL_ONEXEC) | |
205 | ||
206 | This allows launching of a new program with a different vector | |
207 | length, while avoiding runtime side effects in the caller. | |
208 | ||
209 | Without PR_SME_SET_VL_ONEXEC, the requested change takes effect | |
210 | immediately. | |
211 | ||
212 | ||
213 | Return value: a nonnegative on success, or a negative value on error: | |
214 | EINVAL: SME not supported, invalid vector length requested, or | |
215 | invalid flags. | |
216 | ||
217 | ||
218 | On success: | |
219 | ||
220 | * Either the calling thread's vector length or the deferred vector length | |
221 | to be applied at the next execve() by the thread (dependent on whether | |
222 | PR_SME_SET_VL_ONEXEC is present in arg), is set to the largest value | |
223 | supported by the system that is less than or equal to vl. If vl == | |
224 | SVE_VL_MAX, the value set will be the largest value supported by the | |
225 | system. | |
226 | ||
227 | * Any previously outstanding deferred vector length change in the calling | |
228 | thread is cancelled. | |
229 | ||
230 | * The returned value describes the resulting configuration, encoded as for | |
231 | PR_SME_GET_VL. The vector length reported in this value is the new | |
232 | current vector length for this thread if PR_SME_SET_VL_ONEXEC was not | |
233 | present in arg; otherwise, the reported vector length is the deferred | |
234 | vector length that will be applied at the next execve() by the calling | |
235 | thread. | |
236 | ||
4edc1174 MB |
237 | * Changing the vector length causes all of ZA, ZTn, P0..P15, FFR and all |
238 | bits of Z0..Z31 except for Z0 bits [127:0] .. Z31 bits [127:0] to become | |
96d32e63 MB |
239 | unspecified, including both streaming and non-streaming SVE state. |
240 | Calling PR_SME_SET_VL with vl equal to the thread's current vector | |
241 | length, or calling PR_SME_SET_VL with the PR_SVE_SET_VL_ONEXEC flag, | |
242 | does not constitute a change to the vector length for this purpose. | |
243 | ||
244 | * Changing the vector length causes PSTATE.ZA and PSTATE.SM to be cleared. | |
245 | Calling PR_SME_SET_VL with vl equal to the thread's current vector | |
246 | length, or calling PR_SME_SET_VL with the PR_SVE_SET_VL_ONEXEC flag, | |
247 | does not constitute a change to the vector length for this purpose. | |
248 | ||
249 | ||
250 | prctl(PR_SME_GET_VL) | |
251 | ||
252 | Gets the vector length of the calling thread. | |
253 | ||
254 | The following flag may be OR-ed into the result: | |
255 | ||
256 | PR_SME_VL_INHERIT | |
257 | ||
258 | Vector length will be inherited across execve(). | |
259 | ||
260 | There is no way to determine whether there is an outstanding deferred | |
261 | vector length change (which would only normally be the case between a | |
262 | fork() or vfork() and the corresponding execve() in typical use). | |
263 | ||
264 | To extract the vector length from the result, bitwise and it with | |
265 | PR_SME_VL_LEN_MASK. | |
266 | ||
267 | Return value: a nonnegative value on success, or a negative value on error: | |
268 | EINVAL: SME not supported. | |
269 | ||
270 | ||
271 | 7. ptrace extensions | |
272 | --------------------- | |
273 | ||
274 | * A new regset NT_ARM_SSVE is defined for access to streaming mode SVE | |
275 | state via PTRACE_GETREGSET and PTRACE_SETREGSET, this is documented in | |
276 | sve.rst. | |
277 | ||
278 | * A new regset NT_ARM_ZA is defined for ZA state for access to ZA state via | |
279 | PTRACE_GETREGSET and PTRACE_SETREGSET. | |
280 | ||
281 | Refer to [2] for definitions. | |
282 | ||
283 | The regset data starts with struct user_za_header, containing: | |
284 | ||
285 | size | |
286 | ||
287 | Size of the complete regset, in bytes. | |
288 | This depends on vl and possibly on other things in the future. | |
289 | ||
290 | If a call to PTRACE_GETREGSET requests less data than the value of | |
291 | size, the caller can allocate a larger buffer and retry in order to | |
292 | read the complete regset. | |
293 | ||
294 | max_size | |
295 | ||
296 | Maximum size in bytes that the regset can grow to for the target | |
297 | thread. The regset won't grow bigger than this even if the target | |
298 | thread changes its vector length etc. | |
299 | ||
300 | vl | |
301 | ||
302 | Target thread's current streaming vector length, in bytes. | |
303 | ||
304 | max_vl | |
305 | ||
306 | Maximum possible streaming vector length for the target thread. | |
307 | ||
308 | flags | |
309 | ||
310 | Zero or more of the following flags, which have the same | |
311 | meaning and behaviour as the corresponding PR_SET_VL_* flags: | |
312 | ||
313 | SME_PT_VL_INHERIT | |
314 | ||
315 | SME_PT_VL_ONEXEC (SETREGSET only). | |
316 | ||
317 | * The effects of changing the vector length and/or flags are equivalent to | |
318 | those documented for PR_SME_SET_VL. | |
319 | ||
320 | The caller must make a further GETREGSET call if it needs to know what VL is | |
321 | actually set by SETREGSET, unless is it known in advance that the requested | |
322 | VL is supported. | |
323 | ||
324 | * The size and layout of the payload depends on the header fields. The | |
325 | SME_PT_ZA_*() macros are provided to facilitate access to the data. | |
326 | ||
327 | * In either case, for SETREGSET it is permissible to omit the payload, in which | |
328 | case the vector length and flags are changed and PSTATE.ZA is set to 0 | |
329 | (along with any consequences of those changes). If a payload is provided | |
330 | then PSTATE.ZA will be set to 1. | |
331 | ||
332 | * For SETREGSET, if the requested VL is not supported, the effect will be the | |
333 | same as if the payload were omitted, except that an EIO error is reported. | |
334 | No attempt is made to translate the payload data to the correct layout | |
335 | for the vector length actually set. It is up to the caller to translate the | |
336 | payload layout for the actual VL and retry. | |
337 | ||
338 | * The effect of writing a partial, incomplete payload is unspecified. | |
339 | ||
4edc1174 MB |
340 | * A new regset NT_ARM_ZT is defined for access to ZTn state via |
341 | PTRACE_GETREGSET and PTRACE_SETREGSET. | |
342 | ||
343 | * The NT_ARM_ZT regset consists of a single 512 bit register. | |
344 | ||
345 | * When PSTATE.ZA==0 reads of NT_ARM_ZT will report all bits of ZTn as 0. | |
346 | ||
347 | * Writes to NT_ARM_ZT will set PSTATE.ZA to 1. | |
348 | ||
96d32e63 MB |
349 | |
350 | 8. ELF coredump extensions | |
351 | --------------------------- | |
352 | ||
353 | * NT_ARM_SSVE notes will be added to each coredump for | |
354 | each thread of the dumped process. The contents will be equivalent to the | |
355 | data that would have been read if a PTRACE_GETREGSET of the corresponding | |
356 | type were executed for each thread when the coredump was generated. | |
357 | ||
358 | * A NT_ARM_ZA note will be added to each coredump for each thread of the | |
359 | dumped process. The contents will be equivalent to the data that would have | |
360 | been read if a PTRACE_GETREGSET of NT_ARM_ZA were executed for each thread | |
361 | when the coredump was generated. | |
362 | ||
4edc1174 MB |
363 | * A NT_ARM_ZT note will be added to each coredump for each thread of the |
364 | dumped process. The contents will be equivalent to the data that would have | |
365 | been read if a PTRACE_GETREGSET of NT_ARM_ZT were executed for each thread | |
366 | when the coredump was generated. | |
367 | ||
f285da05 MB |
368 | * The NT_ARM_TLS note will be extended to two registers, the second register |
369 | will contain TPIDR2_EL0 on systems that support SME and will be read as | |
370 | zero with writes ignored otherwise. | |
96d32e63 MB |
371 | |
372 | 9. System runtime configuration | |
373 | -------------------------------- | |
374 | ||
375 | * To mitigate the ABI impact of expansion of the signal frame, a policy | |
376 | mechanism is provided for administrators, distro maintainers and developers | |
377 | to set the default vector length for userspace processes: | |
378 | ||
379 | /proc/sys/abi/sme_default_vector_length | |
380 | ||
381 | Writing the text representation of an integer to this file sets the system | |
382 | default vector length to the specified value, unless the value is greater | |
383 | than the maximum vector length supported by the system in which case the | |
384 | default vector length is set to that maximum. | |
385 | ||
386 | The result can be determined by reopening the file and reading its | |
387 | contents. | |
388 | ||
389 | At boot, the default vector length is initially set to 32 or the maximum | |
390 | supported vector length, whichever is smaller and supported. This | |
391 | determines the initial vector length of the init process (PID 1). | |
392 | ||
393 | Reading this file returns the current system default vector length. | |
394 | ||
395 | * At every execve() call, the new vector length of the new process is set to | |
396 | the system default vector length, unless | |
397 | ||
398 | * PR_SME_VL_INHERIT (or equivalently SME_PT_VL_INHERIT) is set for the | |
399 | calling thread, or | |
400 | ||
401 | * a deferred vector length change is pending, established via the | |
402 | PR_SME_SET_VL_ONEXEC flag (or SME_PT_VL_ONEXEC). | |
403 | ||
404 | * Modifying the system default vector length does not affect the vector length | |
405 | of any existing process or thread that does not make an execve() call. | |
406 | ||
407 | ||
408 | Appendix A. SME programmer's model (informative) | |
409 | ================================================= | |
410 | ||
f539316f | 411 | This section provides a minimal description of the additions made by SME to the |
96d32e63 MB |
412 | ARMv8-A programmer's model that are relevant to this document. |
413 | ||
414 | Note: This section is for information only and not intended to be complete or | |
415 | to replace any architectural specification. | |
416 | ||
417 | A.1. Registers | |
418 | --------------- | |
419 | ||
420 | In A64 state, SME adds the following: | |
421 | ||
422 | * A new mode, streaming mode, in which a subset of the normal FPSIMD and SVE | |
423 | features are available. When supported EL0 software may enter and leave | |
424 | streaming mode at any time. | |
425 | ||
426 | For best system performance it is strongly encouraged for software to enable | |
427 | streaming mode only when it is actively being used. | |
428 | ||
429 | * A new vector length controlling the size of ZA and the Z registers when in | |
430 | streaming mode, separately to the vector length used for SVE when not in | |
431 | streaming mode. There is no requirement that either the currently selected | |
432 | vector length or the set of vector lengths supported for the two modes in | |
433 | a given system have any relationship. The streaming mode vector length | |
434 | is referred to as SVL. | |
435 | ||
436 | * A new ZA matrix register. This is a square matrix of SVLxSVL bits. Most | |
437 | operations on ZA require that streaming mode be enabled but ZA can be | |
438 | enabled without streaming mode in order to load, save and retain data. | |
439 | ||
440 | For best system performance it is strongly encouraged for software to enable | |
441 | ZA only when it is actively being used. | |
442 | ||
4edc1174 MB |
443 | * A new ZT0 register is introduced when SME2 is present. This is a 512 bit |
444 | register which is accessible when PSTATE.ZA is set, as ZA itself is. | |
445 | ||
96d32e63 MB |
446 | * Two new 1 bit fields in PSTATE which may be controlled via the SMSTART and |
447 | SMSTOP instructions or by access to the SVCR system register: | |
448 | ||
449 | * PSTATE.ZA, if this is 1 then the ZA matrix is accessible and has valid | |
450 | data while if it is 0 then ZA can not be accessed. When PSTATE.ZA is | |
451 | changed from 0 to 1 all bits in ZA are cleared. | |
452 | ||
453 | * PSTATE.SM, if this is 1 then the PE is in streaming mode. When the value | |
454 | of PSTATE.SM is changed then it is implementation defined if the subset | |
455 | of the floating point register bits valid in both modes may be retained. | |
456 | Any other bits will be cleared. | |
457 | ||
458 | ||
459 | References | |
460 | ========== | |
461 | ||
462 | [1] arch/arm64/include/uapi/asm/sigcontext.h | |
463 | AArch64 Linux signal ABI definitions | |
464 | ||
465 | [2] arch/arm64/include/uapi/asm/ptrace.h | |
466 | AArch64 Linux ptrace ABI definitions | |
467 | ||
468 | [3] Documentation/arm64/cpu-feature-registers.rst |