diff --git a/llvm/docs/AArch64SME.rst b/llvm/docs/AArch64SME.rst new file mode 100644 --- /dev/null +++ b/llvm/docs/AArch64SME.rst @@ -0,0 +1,447 @@ +***************************************************** +Support for AArch64 Scalable Matrix Extension in LLVM +***************************************************** + +.. contents:: + :local: + +1. Introduction +=============== + +The :ref:`AArch64 SME ACLE ` provides a number of +attributes for users to control PSTATE.SM and PSTATE.ZA. +The :ref:`AArch64 SME ABI` describes the requirements for +calls between functions when at least one of those functions uses PSTATE.SM or +PSTATE.ZA. + +This document describes how the SME ACLE attributes map to LLVM IR +attributes and how LLVM lowers these attributes to implement the rules and +requirements of the ABI. + +Below we describe the LLVM IR attributes and their relation to the C/C++ +level ACLE attributes: + +``aarch64_pstate_sm_enabled`` + is used for functions with ``__attribute__((arm_streaming))`` + +``aarch64_pstate_sm_compatible`` + is used for functions with ``__attribute__((arm_streaming_compatible))`` + +``aarch64_pstate_sm_body`` + is used for functions with ``__attribute__((arm_locally_streaming))`` and is + only valid on function definitions (not declarations) + +``aarch64_pstate_za_new`` + is used for functions with ``__attribute__((arm_new_za))`` + +``aarch64_pstate_za_shared`` + is used for functions with ``__attribute__((arm_shared_za))`` + +``aarch64_pstate_za_preserved`` + is used for functions with ``__attribute__((arm_preserves_za))`` + +Clang must ensure that the above attributes are added both to the +function's declaration/definition as well as to their call-sites. This is +important for calls to attributed function pointers, where there is no +definition or declaration available. + + +2. Handling PSTATE.SM +===================== + +When changing PSTATE.SM the execution of FP/vector operations may be transferred +to another processing element. This has three important implications: + +* The runtime SVE vector length may change. + +* The contents of FP/AdvSIMD/SVE registers are zeroed. + +* The set of allowable instructions changes. + +This leads to certain restrictions on IR and optimizations. For example, it +is undefined behaviour to share vector-length dependent state between functions +that may operate with different values for PSTATE.SM. Front-ends must honour +these restrictions when generating LLVM IR. + +Even though the runtime SVE vector length may change, for the purpose of LLVM IR +and almost all parts of CodeGen we can assume that the runtime value for +``vscale`` does not. If we let the compiler insert the appropriate ``smstart`` +and ``smstop`` instructions around call boundaries, then the effects on SVE +state can be mitigated. By limiting the state changes to a very brief window +around the call we can control how the operations are scheduled and how live +values remain preserved between state transitions. + +In order to control PSTATE.SM at this level of granularity, we use function and +callsite attributes rather than intrinsics. + + +Restrictions on attributes +-------------------------- + +* It is undefined behaviour to pass or return (pointers to) scalable vector + objects to/from functions which may use a different SVE vector length. + This includes functions with a non-streaming interface, but marked with + ``aarch64_pstate_sm_body``. + +* It is not allowed for a function to be decorated with both + ``aarch64_pstate_sm_compatible`` and ``aarch64_pstate_sm_enabled``. + +* It is not allowed for a function to be decorated with both + ``aarch64_pstate_za_new`` and ``aarch64_pstate_za_preserved``. + +* It is not allowed for a function to be decorated with both + ``aarch64_pstate_za_new`` and ``aarch64_pstate_za_shared``. + +These restrictions also apply in the higher level SME ACLE, which means we can +emit diagnostics in Clang to signal users about incorrect behaviour. + + +Compiler inserted streaming-mode changes +---------------------------------------- + +The table below describes the transitions in PSTATE.SM the compiler has to +account for when doing calls between functions with different attributes. +In this table, we use the following abbreviations: + +``N`` + functions with a normal interface (PSTATE.SM=0 on entry, PSTATE.SM=0 on + return) + +``S`` + functions with a Streaming interface (PSTATE.SM=1 on entry, PSTATE.SM=1 + on return) + +``SC`` + functions with a Streaming-Compatible interface (PSTATE.SM can be + either 0 or 1 on entry, and is unchanged on return). + +Functions with ``__attribute__((arm_locally_streaming))`` are excluded from this +table because for the caller the attribute is synonymous to 'streaming', and +for the callee it is merely an implementation detail that is explicitly not +exposed to the caller. + +.. table:: Combinations of calls for functions with different attributes + + ==== ==== =============================== ============================== ============================== + From To Before call After call After exception + ==== ==== =============================== ============================== ============================== + N N + N S SMSTART SMSTOP + N SC + S N SMSTOP SMSTART SMSTART + S S SMSTART + S SC SMSTART + SC N If PSTATE.SM before call is 1, If PSTATE.SM before call is 1, If PSTATE.SM before call is 1, + then SMSTOP then SMSTART then SMSTART + SC S If PSTATE.SM before call is 0, If PSTATE.SM before call is 0, If PSTATE.SM before call is 1, + then SMSTART then SMSTOP then SMSTART + SC SC If PSTATE.SM before call is 1, + then SMSTART + ==== ==== =============================== ============================== ============================== + + +Because changing PSTATE.SM zeroes the FP/vector registers, it is best to emit +the ``smstart`` and ``smstop`` instructions before register allocation, so that +the register allocator can spill/reload registers around the mode change. + +The compiler should also have sufficient information on which operations are +part of the call/function's arguments/result and which operations are part of +the function's body, so that it can place the mode changes in exactly the right +position. The suitable place to do this seems to be SelectionDAG, where it lowers +the call's arguments/return values to implement the specified calling convention. +SelectionDAG provides Chains and Glue to specify the order of operations and give +preliminary control over the instruction's scheduling. + + +Example of preserving state +--------------------------- + +When passing and returning a ``float`` value to/from a function +that has a streaming interface from a function that has a normal interface, the +call-site will need to ensure that the argument/result registers are preserved +and that no other code is scheduled in between the ``smstart/smstop`` and the call. + +.. code-block:: llvm + + define float @foo(float %f) nounwind { + %res = call float @bar(float %f) "aarch64_pstate_sm_enabled" + ret float %res + } + + declare float @bar(float) "aarch64_pstate_sm_enabled" + +The program needs to preserve the value of the floating point argument and +return value in register ``s0``: + +.. code-block:: none + + foo: // @foo + // %bb.0: + stp d15, d14, [sp, #-80]! // 16-byte Folded Spill + stp d13, d12, [sp, #16] // 16-byte Folded Spill + stp d11, d10, [sp, #32] // 16-byte Folded Spill + stp d9, d8, [sp, #48] // 16-byte Folded Spill + str x30, [sp, #64] // 8-byte Folded Spill + str s0, [sp, #76] // 4-byte Folded Spill + smstart sm + ldr s0, [sp, #76] // 4-byte Folded Reload + bl bar + str s0, [sp, #76] // 4-byte Folded Spill + smstop sm + ldp d9, d8, [sp, #48] // 16-byte Folded Reload + ldp d11, d10, [sp, #32] // 16-byte Folded Reload + ldp d13, d12, [sp, #16] // 16-byte Folded Reload + ldr s0, [sp, #76] // 4-byte Folded Reload + ldr x30, [sp, #64] // 8-byte Folded Reload + ldp d15, d14, [sp], #80 // 16-byte Folded Reload + ret + +Setting the correct register masks on the ISD nodes and inserting the +``smstart/smstop`` in the right places should ensure this is done correctly. + + +Instruction Selection Nodes +--------------------------- + +.. code-block:: none + + AArch64ISD::SMSTART Chain, [SM|ZA|Both], CurrentState, ExpectedState[, RegMask] + AArch64ISD::SMSTOP Chain, [SM|ZA|Both], CurrentState, ExpectedState[, RegMask] + +The ``SMSTART/SMSTOP`` nodes take ``CurrentState`` and ``ExpectedState`` operand for +the case of a conditional SMSTART/SMSTOP. The instruction will only be executed +if CurrentState != ExpectedState. + +When ``CurrentState`` and ``ExpectedState`` can be evaluated at compile-time +(i.e. they are both constants) then an unconditional ``smstart/smstop`` +instruction is emitted. Otherwise the node is matched to a Pseudo instruction +which expands to a compare/branch and a ``smstart/smstop``. This is necessary to +implement transitions from ``SC -> N`` and ``SC -> S``. + + +Unchained Function calls +------------------------ +When a function with "``aarch64_pstate_sm_enabled``" calls a function that is not +streaming compatible, the compiler has to insert a SMSTOP before the call and +insert a SMSTOP after the call. + +If the function that is called is an intrinsic with no side-effects which in +turn is lowered to a function call (e.g. ``@llvm.cos()``), then the call to +``@llvm.cos()`` is not part of any Chain; it can be scheduled freely. + +Lowering of a Callsite creates a small chain of nodes which: + +- starts a call sequence + +- copies input values from virtual registers to physical registers specified by + the ABI + +- executes a branch-and-link + +- stops the call sequence + +- copies the output values from their physical registers to virtual registers + +When the callsite's Chain is not used, only the result value from the chained +sequence is used, but the Chain itself is discarded. + +The ``SMSTART`` and ``SMSTOP`` ISD nodes return a Chain, but no real +values, so when the ``SMSTART/SMSTOP`` nodes are part of a Chain that isn't +used, these nodes are not considered for scheduling and are +removed from the DAG. In order to prevent these nodes +from being removed, we need a way to ensure the results from the +``CopyFromReg`` can only be **used after** the ``SMSTART/SMSTOP`` has been +executed. + +We can use a CopyToReg -> CopyFromReg sequence for this, which moves the +value to/from a virtual register and chains these nodes with the +SMSTART/SMSTOP to make them part of the expression that calculates +the result value. The resulting COPY nodes are removed by the register +allocator. + +The example below shows how this is used in a DAG that does not link +together the result by a Chain, but rather by a value: + +.. code-block:: none + + t0: ch,glue = AArch64ISD::SMSTOP ... + t1: ch,glue = ISD::CALL .... + t2: res,ch,glue = CopyFromReg t1, ... + t3: ch,glue = AArch64ISD::SMSTART t2:1, .... <- this is now part of the expression that returns the result value. + t4: ch = CopyToReg t3, Register:f64 %vreg, t2 + t5: res,ch = CopyFromReg t4, Register:f64 %vreg + t6: res = FADD t5, t9 + +We also need this for locally streaming functions, where an ``SMSTART`` needs to +be inserted into the DAG at the start of the function. + +Functions with __attribute__((arm_locally_streaming)) +----------------------------------------------------- + +If a function is marked as ``arm_locally_streaming``, then the runtime SVE +vector length in the prologue/epilogue may be different from the vector length +in the function's body. This happens because we invoke smstart after setting up +the stack-frame and similarly invoke smstop before deallocating the stack-frame. + +To ensure we use the correct SVE vector length to allocate the locals with, we +can use the streaming vector-length to allocate the stack-slots through the +``ADDSVL`` instruction, even when the CPU is not yet in streaming mode. + +This only works for locals and not callee-save slots, since LLVM doesn't support +mixing two different scalable vector lengths in one stack frame. That means that the +case where a function is marked ``arm_locally_streaming`` and needs to spill SVE +callee-saves in the prologue is currently unsupported. However, it is unlikely +for this to happen without user intervention, because ``arm_locally_streaming`` +functions cannot take or return vector-length-dependent values. This would otherwise +require forcing both the SVE PCS using '``aarch64_sve_pcs``' combined with using +``arm_locally_streaming`` in order to encounter this problem. This combination +can be prevented in Clang through emitting a diagnostic. + + +An example of how the prologue/epilogue would look for a function that is +attributed with ``arm_locally_streaming``: + +.. code-block:: c++ + + #define N 64 + + void __attribute__((arm_streaming_compatible)) some_use(svfloat32_t *); + + // Use a float argument type, to check the value isn't clobbered by smstart. + // Use a float return type to check the value isn't clobbered by smstop. + float __attribute__((noinline, arm_locally_streaming)) foo(float arg) { + // Create local for SVE vector to check local is created with correct + // size when not yet in streaming mode (ADDSVL). + float array[N]; + svfloat32_t vector; + + some_use(&vector); + svst1_f32(svptrue_b32(), &array[0], vector); + return array[N - 1] + arg; + } + +should use ADDSVL for allocating the stack space and should avoid clobbering +the return/argument values. + +.. code-block:: none + + _Z3foof: // @_Z3foof + // %bb.0: // %entry + stp d15, d14, [sp, #-96]! // 16-byte Folded Spill + stp d13, d12, [sp, #16] // 16-byte Folded Spill + stp d11, d10, [sp, #32] // 16-byte Folded Spill + stp d9, d8, [sp, #48] // 16-byte Folded Spill + stp x29, x30, [sp, #64] // 16-byte Folded Spill + add x29, sp, #64 + str x28, [sp, #80] // 8-byte Folded Spill + addsvl sp, sp, #-1 + sub sp, sp, #256 + str s0, [x29, #28] // 4-byte Folded Spill + smstart sm + sub x0, x29, #64 + addsvl x0, x0, #-1 + bl _Z10some_usePu13__SVFloat32_t + sub x8, x29, #64 + ptrue p0.s + ld1w { z0.s }, p0/z, [x8, #-1, mul vl] + ldr s1, [x29, #28] // 4-byte Folded Reload + st1w { z0.s }, p0, [sp] + ldr s0, [sp, #252] + fadd s0, s0, s1 + str s0, [x29, #28] // 4-byte Folded Spill + smstop sm + ldr s0, [x29, #28] // 4-byte Folded Reload + addsvl sp, sp, #1 + add sp, sp, #256 + ldp x29, x30, [sp, #64] // 16-byte Folded Reload + ldp d9, d8, [sp, #48] // 16-byte Folded Reload + ldp d11, d10, [sp, #32] // 16-byte Folded Reload + ldp d13, d12, [sp, #16] // 16-byte Folded Reload + ldr x28, [sp, #80] // 8-byte Folded Reload + ldp d15, d14, [sp], #96 // 16-byte Folded Reload + ret + + +Preventing the use of illegal instructions in Streaming Mode +------------------------------------------------------------ + +* When executing a program in streaming-mode (PSTATE.SM=1) a subset of SVE/SVE2 + instructions and most AdvSIMD/NEON instructions are invalid. + +* When executing a program in normal mode (PSTATE.SM=0), a subset of SME + instructions are invalid. + +* Streaming-compatible functions must only use instructions that are valid when + either PSTATE.SM=0 or PSTATE.SM=1. + +The value of PSTATE.SM is not controlled by the feature flags, but rather by the +function attributes. This means that we can compile for '``+sme``' and the compiler +will code-generate any instructions, even if they are not legal under the requested +streaming mode. The compiler needs to use the function attributes to ensure the +compiler doesn't do transformations under the assumption that certain operations +are available at runtime. + +We made a conscious choice not to model this with feature flags, because we +still want to support inline-asm in either mode (with the user placing +smstart/smstop manually), and this became rather complicated to implement at the +individual instruction level (see `D120261 `_ +and `D121208 `_) because of limitations in +TableGen. + +As a first step, this means we'll disable vectorization (LoopVectorize/SLP) +entirely when the a function has either of the ``aarch64_pstate_sm_enabled``, +``aarch64_pstate_sm_body`` or ``aarch64_pstate_sm_compatible`` attributes, +in order to avoid the use of vector instructions. + +Later on we'll aim to relax these restrictions to enable scalable +auto-vectorization with a subset of streaming-compatible instructions, but that +requires changes to the CostModel, Legalization and SelectionDAG lowering. + +We will also emit diagnostics in Clang to prevent the use of +non-streaming(-compatible) operations, e.g. through ACLE intrinsics, when a +function is decorated with the streaming mode attributes. + + +Other things to consider +------------------------ + +* Inlining must be disabled when the call-site needs to toggle PSTATE.SM or + when the callee's function body is executed in a different streaming mode than + its caller. This is needed because function calls are the boundaries for + streaming mode changes. + +* Tail call optimization must be disabled when the call-site needs to toggle + PSTATE.SM, such that the caller can restore the original value of PSTATE.SM. + + +3. Handling PSTATE.ZA +===================== + +In contrast to PSTATE.SM, enabling PSTATE.ZA does not affect the SVE vector +length and also doesn't clobber FP/AdvSIMD/SVE registers. This means it is safe +to toggle PSTATE.ZA using intrinsics. This also makes it simpler to setup a +lazy-save mechanism for calls to private-ZA functions (i.e. functions that may +either directly or indirectly clobber ZA state). + +For this purpose, we'll introduce a new LLVM IR pass that is run just before +SelectionDAG. + +Setting up a lazy-save +---------------------- + +Committing a lazy-save +---------------------- + +Exception handling and ZA +------------------------- + +4. References +============= + + .. _aarch64_sme_acle: + +1. `SME ACLE Pull-request `__ + + .. _aarch64_sme_abi: + +2. `SME ABI Pull-request `__ diff --git a/llvm/docs/UserGuides.rst b/llvm/docs/UserGuides.rst --- a/llvm/docs/UserGuides.rst +++ b/llvm/docs/UserGuides.rst @@ -12,6 +12,7 @@ .. toctree:: :hidden: + AArch64SME AddingConstrainedIntrinsics AdvancedBuilds AliasAnalysis @@ -228,6 +229,9 @@ LLVM's support for generating NEON instructions on big endian ARM targets is somewhat nonintuitive. This document explains the implementation and rationale. +:doc:`AArch64SME` + LLVM's support for AArch64 SME ACLE and ABI. + :doc:`CompileCudaWithLLVM` LLVM support for CUDA.