This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/docs/
-
docs/
2/5
AArch64SME.rst
-
UserGuides.rst

Differential D131562

[AArch64][SME] Document SME ABI implementation in LLVM
ClosedPublic

Authored by sdesmalen on Aug 10 2022, 5:44 AM.

Download Raw Diff

Details

Reviewers

efriedma
aemerson
t.p.northover
paulwalker-arm
sagarkulkarni19

Commits

rG4fc2c922fe12: [AArch64][SME] Document SME ABI implementation in LLVM

Summary

Adds a design document for implementing the SME ABI in LLVM. This document
can be used as a reference for follow-up patches that attempt to implement
the ABI.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

sdesmalen created this revision.Aug 10 2022, 5:44 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 10 2022, 5:45 AM

Herald added a subscriber: kristof.beyls. · View Herald Transcript

sdesmalen requested review of this revision.Aug 10 2022, 5:45 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 10 2022, 5:45 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

sdesmalen edited the summary of this revision. (Show Details)Aug 10 2022, 5:54 AM

tschuett added a subscriber: tschuett.Aug 10 2022, 6:21 AM

Harbormaster completed remote builds in B180365: Diff 451429.Aug 10 2022, 7:07 AM

sdesmalen mentioned this in D131570: [AArch64][SME] Add utility class for handling SME attributes..Aug 10 2022, 7:39 AM

sdesmalen mentioned this in D131576: [AArch64][SME] Implement ABI for calls to/from streaming functions..Aug 10 2022, 8:15 AM

sdesmalen added a child revision: D131576: [AArch64][SME] Implement ABI for calls to/from streaming functions..Aug 10 2022, 8:21 AM

sdesmalen mentioned this in D131578: [AArch64][SME] Implement ABI for calls from streaming-compatible functions..Aug 10 2022, 8:39 AM

sdesmalen added a child revision: D131578: [AArch64][SME] Implement ABI for calls from streaming-compatible functions..Aug 10 2022, 8:40 AM

sdesmalen mentioned this in D131579: [AArch64][SME] Disable tail-call optimization when streaming mode change or lazy-save may be required..Aug 10 2022, 8:42 AM

sdesmalen added a child revision: D131579: [AArch64][SME] Disable tail-call optimization when streaming mode change or lazy-save may be required..Aug 10 2022, 8:42 AM

sdesmalen mentioned this in D131581: [AArch64][SME] Disable inlining when SME attributes require smstart/smstop or lazy-save..Aug 10 2022, 8:44 AM

sdesmalen added a child revision: D131581: [AArch64][SME] Disable inlining when SME attributes require smstart/smstop or lazy-save..Aug 10 2022, 8:44 AM

sdesmalen mentioned this in D131582: [AArch64][SME] Add support for arm_locally_streaming functions..Aug 10 2022, 8:48 AM

sdesmalen added a child revision: D131582: [AArch64][SME] Add support for arm_locally_streaming functions..Aug 10 2022, 8:49 AM

shruthiashwath added a subscriber: shruthiashwath.Aug 10 2022, 8:51 PM

In one diff you disable inlining for SME functions. I am afraid that there are more incompatible transformations on functions, i.e., function merging, hot-cold splitting, outlining, ...

You state that the vscale can change during the execution of the program. I would have expected a LangRef update.

sdesmalen added reviewers: efriedma, aemerson, t.p.northover, paulwalker-arm, sagarkulkarni19.Aug 11 2022, 6:47 AM

sdesmalen added subscribers: kmclaughlin, david-arm.

I agree a LangRef update makes sense if we're loosening the constraint that vscale is constant across the whole program. Might need some adjustments to make various transforms work; in particular, I suspect GEP constant expressions involving vscale break.

Are aarch64_pstate_sm_compatible functions allowed to call arbitrary functions? I guess there's some way to save/restore PSTATE.SM?

llvm/docs/AArch64SME.rst
279	Not sure the CopyToReg+CopyFromReg thing is actually enough to prevent all the transforms you want to prevent... if I'm following correctly, you really need to prevent any operations from showing up in the middle of the call sequence, and the SelectionDAG scheduler isn't really set up to enforce that sort of constraint. Would be more obviously correct to use a pseduo-instruction for the call, and expand it to smstop+call+smstart using a post-isel hook.
382	You could use a feature flag that reflects the state of PSTATE.SM, but doesn't actually enable/disable any instructions. But I guess that's basically the same thing as a function attribute.
398	Disabling vectorization might not be enough to completely prevent the use of vector instructions, but probably close enough. (For example, we use vector instructions to lower popcount.)

In one diff you disable inlining for SME functions. I am afraid that there are more incompatible transformations on functions, i.e., function merging, hot-cold splitting, outlining, ...

Having different functions using different instruction set extensions isn't that unusual. We support that sort of thing on many different targets, including x86. (I haven't looked at the diffs to see if it's using the usual target hooks for this.)

Would it help to have a idk v2scale or vmscale for the matrix mode and vscale for the non-matrix mode. Then both can be constant over time, but in some states it would invalid to reason about one but not about the other.

In D131562#3715184, @tschuett wrote:

You state that the vscale can change during the execution of the program. I would have expected a LangRef update.

Enabling streaming mode means FP/vector code may be executed on a different processing element. This has implications beyond having a different SVE vector length since it also clears register state and changes the set of allowable instructions. For the purpose of LLVM IR it is sufficient to support only a single vscale as long as the input IR is restricted to not pass/return (pointers to) scalable vectors between such functions and any streaming mode changes happen only on function-call boundaries. The SME ACLE has similar restrictions, so Clang emitting these attributes will ensure these restrictions are honoured in the IR.

We'd rather not open the can of worms of supporting multiple vscales because this is not really a capability we need from LLVM. Even if it was a generic capability in LLVM then there would still be other issues we'd need to tackle in a similar way as outlined in this document, so I'm not sure if it's worth making a change to the LangRef.

llvm/docs/AArch64SME.rst
279	the SelectionDAG scheduler isn't really set up to enforce that sort of constraint It seems that this is exactly what Glue is meant to do. It tells the SelectionDAG scheduler to put these operations in the same SUnit node, which means they get scheduled together without anything else being scheduled in between. Most of the call-chain is glued together in this way. Would be more obviously correct to use a pseduo-instruction for the call, and expand it to smstop+call+smstart using a post-isel hook. I experimented with that, but found that replacing the ISD::CALL with a pseudo node is not really feasible. There are some complications to this approach: At the time of expanding the CALL itself, the operands will already have been lowered to the callconv-defined registers, whereas we need to change PSTATE.SM before that to give the register allocator a chance to spill/reload these registers (while still virtual registers) before it moves them to physical registers defined by the cc. There is a similar problem after the call; it first needs to move the registers to virtual registers before it can invoke `smstart/smstop`. Even if we'd be happy to with the idea of finding an insertion point for smstart/smstop after selection, then we'd need to fiddle with the scheduling afterwards to move the `smstart/smstop` instructions before and after the COPY nodes. For this we need to know which operations are part of the Call, and which are part of the program. But when finding the right insertion point after selection, this information has been lost. This gets even more complicated to recognise when values are passed by reference or via the stack.
398	Yes you're right, we'll probably find cases where we need to limit the code-generator to lower things differently when in streaming mode.

Rephrased section 2 (Handling PSTATE.SM) and replaced some uses of vscale with SVE vector length.

Harbormaster completed remote builds in B181294: Diff 452683.Aug 15 2022, 9:41 AM

Matt added a subscriber: Matt.Aug 15 2022, 1:21 PM

sdesmalen mentioned this in rGcf72dddaefe9: [AArch64][SME] Add utility class for handling SME attributes..Sep 12 2022, 5:46 AM

Gentle ping. Does anyone have some more feedback?

Herald added a subscriber: • pcwang-thead. · View Herald TranscriptSep 12 2022, 6:42 AM

For the purpose of LLVM IR it is sufficient to support only a single vscale as long as the input IR is restricted to not pass/return (pointers to) scalable vectors between such functions and any streaming mode changes happen only on function-call boundaries.

Sure, that makes sense: supporting multiple different scales within a function would be complicated.

But you still need a LangRef change. Interprocedural optimizations need to know that "vscale" computed in the callee might be different from "vscale" computed in the caller. As a practical matter, it probably affects very few optimizations, but certain optimizations might need to be aware of it. (At the moment, this is most likely to be relevant for GEP constant expressions, since we don't really do interprocedural hoisting or sinking. But in theory it could also be relevant for calls to llvm.vscale.)

In D131562#3787300, @efriedma wrote:

For the purpose of LLVM IR it is sufficient to support only a single vscale as long as the input IR is restricted to not pass/return (pointers to) scalable vectors between such functions and any streaming mode changes happen only on function-call boundaries.

Sure, that makes sense: supporting multiple different scales within a function would be complicated.

But you still need a LangRef change. Interprocedural optimizations need to know that "vscale" computed in the callee might be different from "vscale" computed in the caller. As a practical matter, it probably affects very few optimizations, but certain optimizations might need to be aware of it. (At the moment, this is most likely to be relevant for GEP constant expressions, since we don't really do interprocedural hoisting or sinking. But in theory it could also be relevant for calls to llvm.vscale.)

Thanks, I think that's fair enough, we can tone down the wording in the LangRef a bit.

LGTM

This revision is now accepted and ready to land.Sep 15 2022, 3:46 PM

sdesmalen mentioned this in rGb00c36c2958b: [AArch64][SME] Implement ABI for calls to/from streaming functions..Sep 16 2022, 7:08 AM

Closed by commit rG4fc2c922fe12: [AArch64][SME] Document SME ABI implementation in LLVM (authored by sdesmalen). · Explain WhySep 16 2022, 7:49 AM

This revision was automatically updated to reflect the committed changes.

sdesmalen added a commit: rG4fc2c922fe12: [AArch64][SME] Document SME ABI implementation in LLVM.

sdesmalen mentioned this in rGbd4935c175ad: [AArch64][SME] Implement ABI for calls from streaming-compatible functions..

sdesmalen mentioned this in rG5fae000f3610: [AArch64][SME] Disable tail-call optimization when streaming mode change or….Sep 17 2022, 9:22 AM

david-arm mentioned this in rG64bef3d5688b: [AArch64][SME] Disable inlining when SME attributes require smstart/smstop or….Sep 21 2022, 1:36 AM

paulwalker-arm mentioned this in D134648: [LangRef] Update text for vscale to be more flexible but maintain original intent..Sep 26 2022, 8:34 AM

sdesmalen mentioned this in rG02df03c5b7ae: [AArch64][SME] Add support for arm_locally_streaming functions..Oct 14 2022, 6:48 AM

Revision Contents

Path

Size

llvm/

docs/

AArch64SME.rst

447 lines

UserGuides.rst

4 lines

Diff 460759

llvm/docs/AArch64SME.rst

This file was added.

				*****************************************************
				Support for AArch64 Scalable Matrix Extension in LLVM
				*****************************************************

				.. contents::
				:local:

				1. Introduction
				===============

				The :ref:`AArch64 SME ACLE <aarch64_sme_acle>` provides a number of
				attributes for users to control PSTATE.SM and PSTATE.ZA.
				The :ref:`AArch64 SME ABI<aarch64_sme_abi>` describes the requirements for
				calls between functions when at least one of those functions uses PSTATE.SM or
				PSTATE.ZA.

				This document describes how the SME ACLE attributes map to LLVM IR
				attributes and how LLVM lowers these attributes to implement the rules and
				requirements of the ABI.

				Below we describe the LLVM IR attributes and their relation to the C/C++
				level ACLE attributes:

				``aarch64_pstate_sm_enabled``
				is used for functions with ``__attribute__((arm_streaming))``

				``aarch64_pstate_sm_compatible``
				is used for functions with ``__attribute__((arm_streaming_compatible))``

				``aarch64_pstate_sm_body``
				is used for functions with ``__attribute__((arm_locally_streaming))`` and is
				only valid on function definitions (not declarations)

				``aarch64_pstate_za_new``
				is used for functions with ``__attribute__((arm_new_za))``

				``aarch64_pstate_za_shared``
				is used for functions with ``__attribute__((arm_shared_za))``

				``aarch64_pstate_za_preserved``
				is used for functions with ``__attribute__((arm_preserves_za))``

				Clang must ensure that the above attributes are added both to the
				function's declaration/definition as well as to their call-sites. This is
				important for calls to attributed function pointers, where there is no
				definition or declaration available.


				2. Handling PSTATE.SM
				=====================

				When changing PSTATE.SM the execution of FP/vector operations may be transferred
				to another processing element. This has three important implications:

				* The runtime SVE vector length may change.

				* The contents of FP/AdvSIMD/SVE registers are zeroed.

				* The set of allowable instructions changes.

				This leads to certain restrictions on IR and optimizations. For example, it
				is undefined behaviour to share vector-length dependent state between functions
				that may operate with different values for PSTATE.SM. Front-ends must honour
				these restrictions when generating LLVM IR.

				Even though the runtime SVE vector length may change, for the purpose of LLVM IR
				and almost all parts of CodeGen we can assume that the runtime value for
				``vscale`` does not. If we let the compiler insert the appropriate ``smstart``
				and ``smstop`` instructions around call boundaries, then the effects on SVE
				state can be mitigated. By limiting the state changes to a very brief window
				around the call we can control how the operations are scheduled and how live
				values remain preserved between state transitions.

				In order to control PSTATE.SM at this level of granularity, we use function and
				callsite attributes rather than intrinsics.


				Restrictions on attributes
				--------------------------

				* It is undefined behaviour to pass or return (pointers to) scalable vector
				objects to/from functions which may use a different SVE vector length.
				This includes functions with a non-streaming interface, but marked with
				``aarch64_pstate_sm_body``.

				* It is not allowed for a function to be decorated with both
				``aarch64_pstate_sm_compatible`` and ``aarch64_pstate_sm_enabled``.

				* It is not allowed for a function to be decorated with both
				``aarch64_pstate_za_new`` and ``aarch64_pstate_za_preserved``.

				* It is not allowed for a function to be decorated with both
				``aarch64_pstate_za_new`` and ``aarch64_pstate_za_shared``.

				These restrictions also apply in the higher level SME ACLE, which means we can
				emit diagnostics in Clang to signal users about incorrect behaviour.


				Compiler inserted streaming-mode changes
				----------------------------------------

				The table below describes the transitions in PSTATE.SM the compiler has to
				account for when doing calls between functions with different attributes.
				In this table, we use the following abbreviations:

				``N``
				functions with a normal interface (PSTATE.SM=0 on entry, PSTATE.SM=0 on
				return)

				``S``
				functions with a Streaming interface (PSTATE.SM=1 on entry, PSTATE.SM=1
				on return)

				``SC``
				functions with a Streaming-Compatible interface (PSTATE.SM can be
				either 0 or 1 on entry, and is unchanged on return).

				Functions with ``__attribute__((arm_locally_streaming))`` are excluded from this
				table because for the caller the attribute is synonymous to 'streaming', and
				for the callee it is merely an implementation detail that is explicitly not
				exposed to the caller.

				.. table:: Combinations of calls for functions with different attributes

				==== ==== =============================== ============================== ==============================
				From To Before call After call After exception
				==== ==== =============================== ============================== ==============================
				N N
				N S SMSTART SMSTOP
				N SC
				S N SMSTOP SMSTART SMSTART
				S S SMSTART
				S SC SMSTART
				SC N If PSTATE.SM before call is 1, If PSTATE.SM before call is 1, If PSTATE.SM before call is 1,
				then SMSTOP then SMSTART then SMSTART
				SC S If PSTATE.SM before call is 0, If PSTATE.SM before call is 0, If PSTATE.SM before call is 1,
				then SMSTART then SMSTOP then SMSTART
				SC SC If PSTATE.SM before call is 1,
				then SMSTART
				==== ==== =============================== ============================== ==============================


				Because changing PSTATE.SM zeroes the FP/vector registers, it is best to emit
				the ``smstart`` and ``smstop`` instructions before register allocation, so that
				the register allocator can spill/reload registers around the mode change.

				The compiler should also have sufficient information on which operations are
				part of the call/function's arguments/result and which operations are part of
				the function's body, so that it can place the mode changes in exactly the right
				position. The suitable place to do this seems to be SelectionDAG, where it lowers
				the call's arguments/return values to implement the specified calling convention.
				SelectionDAG provides Chains and Glue to specify the order of operations and give
				preliminary control over the instruction's scheduling.


				Example of preserving state
				---------------------------

				When passing and returning a ``float`` value to/from a function
				that has a streaming interface from a function that has a normal interface, the
				call-site will need to ensure that the argument/result registers are preserved
				and that no other code is scheduled in between the ``smstart/smstop`` and the call.

				.. code-block:: llvm

				define float @foo(float %f) nounwind {
				%res = call float @bar(float %f) "aarch64_pstate_sm_enabled"
				ret float %res
				}

				declare float @bar(float) "aarch64_pstate_sm_enabled"

				The program needs to preserve the value of the floating point argument and
				return value in register ``s0``:

				.. code-block:: none

				foo: // @foo
				// %bb.0:
				stp d15, d14, [sp, #-80]! // 16-byte Folded Spill
				stp d13, d12, [sp, #16] // 16-byte Folded Spill
				stp d11, d10, [sp, #32] // 16-byte Folded Spill
				stp d9, d8, [sp, #48] // 16-byte Folded Spill
				str x30, [sp, #64] // 8-byte Folded Spill
				str s0, [sp, #76] // 4-byte Folded Spill
				smstart sm
				ldr s0, [sp, #76] // 4-byte Folded Reload
				bl bar
				str s0, [sp, #76] // 4-byte Folded Spill
				smstop sm
				ldp d9, d8, [sp, #48] // 16-byte Folded Reload
				ldp d11, d10, [sp, #32] // 16-byte Folded Reload
				ldp d13, d12, [sp, #16] // 16-byte Folded Reload
				ldr s0, [sp, #76] // 4-byte Folded Reload
				ldr x30, [sp, #64] // 8-byte Folded Reload
				ldp d15, d14, [sp], #80 // 16-byte Folded Reload
				ret

				Setting the correct register masks on the ISD nodes and inserting the
				``smstart/smstop`` in the right places should ensure this is done correctly.


				Instruction Selection Nodes
				---------------------------

				.. code-block:: none

				AArch64ISD::SMSTART Chain, [SM\|ZA\|Both], CurrentState, ExpectedState[, RegMask]
				AArch64ISD::SMSTOP Chain, [SM\|ZA\|Both], CurrentState, ExpectedState[, RegMask]

				The ``SMSTART/SMSTOP`` nodes take ``CurrentState`` and ``ExpectedState`` operand for
				the case of a conditional SMSTART/SMSTOP. The instruction will only be executed
				if CurrentState != ExpectedState.

				When ``CurrentState`` and ``ExpectedState`` can be evaluated at compile-time
				(i.e. they are both constants) then an unconditional ``smstart/smstop``
				instruction is emitted. Otherwise the node is matched to a Pseudo instruction
				which expands to a compare/branch and a ``smstart/smstop``. This is necessary to
				implement transitions from ``SC -> N`` and ``SC -> S``.


				Unchained Function calls
				------------------------
				When a function with "``aarch64_pstate_sm_enabled``" calls a function that is not
				streaming compatible, the compiler has to insert a SMSTOP before the call and
				insert a SMSTOP after the call.

				If the function that is called is an intrinsic with no side-effects which in
				turn is lowered to a function call (e.g. ``@llvm.cos()``), then the call to
				``@llvm.cos()`` is not part of any Chain; it can be scheduled freely.

				Lowering of a Callsite creates a small chain of nodes which:

				- starts a call sequence

				- copies input values from virtual registers to physical registers specified by
				the ABI

				- executes a branch-and-link

				- stops the call sequence

				- copies the output values from their physical registers to virtual registers

				When the callsite's Chain is not used, only the result value from the chained
				sequence is used, but the Chain itself is discarded.

				The ``SMSTART`` and ``SMSTOP`` ISD nodes return a Chain, but no real
				values, so when the ``SMSTART/SMSTOP`` nodes are part of a Chain that isn't
				used, these nodes are not considered for scheduling and are
				removed from the DAG. In order to prevent these nodes
				from being removed, we need a way to ensure the results from the
				``CopyFromReg`` can only be used after the ``SMSTART/SMSTOP`` has been
				executed.

				We can use a CopyToReg -> CopyFromReg sequence for this, which moves the
				value to/from a virtual register and chains these nodes with the
				SMSTART/SMSTOP to make them part of the expression that calculates
				the result value. The resulting COPY nodes are removed by the register
				allocator.

				The example below shows how this is used in a DAG that does not link
				together the result by a Chain, but rather by a value:

				.. code-block:: none

				t0: ch,glue = AArch64ISD::SMSTOP ...
				t1: ch,glue = ISD::CALL ....
				t2: res,ch,glue = CopyFromReg t1, ...
				t3: ch,glue = AArch64ISD::SMSTART t2:1, .... <- this is now part of the expression that returns the result value.
				t4: ch = CopyToReg t3, Register:f64 %vreg, t2
				t5: res,ch = CopyFromReg t4, Register:f64 %vreg
				t6: res = FADD t5, t9

				We also need this for locally streaming functions, where an ``SMSTART`` needs to
				be inserted into the DAG at the start of the function.

				Functions with __attribute__((arm_locally_streaming))
				-----------------------------------------------------
				efriedmaUnsubmitted Not Done Reply Inline Actions Not sure the CopyToReg+CopyFromReg thing is actually enough to prevent all the transforms you want to prevent... if I'm following correctly, you really need to prevent any operations from showing up in the middle of the call sequence, and the SelectionDAG scheduler isn't really set up to enforce that sort of constraint. Would be more obviously correct to use a pseduo-instruction for the call, and expand it to smstop+call+smstart using a post-isel hook. efriedma: Not sure the CopyToReg+CopyFromReg thing is actually enough to prevent all the transforms you…
				sdesmalenAuthorUnsubmitted Done Reply Inline Actions the SelectionDAG scheduler isn't really set up to enforce that sort of constraint It seems that this is exactly what Glue is meant to do. It tells the SelectionDAG scheduler to put these operations in the same SUnit node, which means they get scheduled together without anything else being scheduled in between. Most of the call-chain is glued together in this way. Would be more obviously correct to use a pseduo-instruction for the call, and expand it to smstop+call+smstart using a post-isel hook. I experimented with that, but found that replacing the ISD::CALL with a pseudo node is not really feasible. There are some complications to this approach: At the time of expanding the CALL itself, the operands will already have been lowered to the callconv-defined registers, whereas we need to change PSTATE.SM before that to give the register allocator a chance to spill/reload these registers (while still virtual registers) before it moves them to physical registers defined by the cc. There is a similar problem after the call; it first needs to move the registers to virtual registers before it can invoke `smstart/smstop`. Even if we'd be happy to with the idea of finding an insertion point for smstart/smstop after selection, then we'd need to fiddle with the scheduling afterwards to move the `smstart/smstop` instructions before and after the COPY nodes. For this we need to know which operations are part of the Call, and which are part of the program. But when finding the right insertion point after selection, this information has been lost. This gets even more complicated to recognise when values are passed by reference or via the stack. sdesmalen: > the SelectionDAG scheduler isn't really set up to enforce that sort of constraint It seems…

				If a function is marked as ``arm_locally_streaming``, then the runtime SVE
				vector length in the prologue/epilogue may be different from the vector length
				in the function's body. This happens because we invoke smstart after setting up
				the stack-frame and similarly invoke smstop before deallocating the stack-frame.

				To ensure we use the correct SVE vector length to allocate the locals with, we
				can use the streaming vector-length to allocate the stack-slots through the
				``ADDSVL`` instruction, even when the CPU is not yet in streaming mode.

				This only works for locals and not callee-save slots, since LLVM doesn't support
				mixing two different scalable vector lengths in one stack frame. That means that the
				case where a function is marked ``arm_locally_streaming`` and needs to spill SVE
				callee-saves in the prologue is currently unsupported. However, it is unlikely
				for this to happen without user intervention, because ``arm_locally_streaming``
				functions cannot take or return vector-length-dependent values. This would otherwise
				require forcing both the SVE PCS using '``aarch64_sve_pcs``' combined with using
				``arm_locally_streaming`` in order to encounter this problem. This combination
				can be prevented in Clang through emitting a diagnostic.


				An example of how the prologue/epilogue would look for a function that is
				attributed with ``arm_locally_streaming``:

				.. code-block:: c++

				#define N 64

				void __attribute__((arm_streaming_compatible)) some_use(svfloat32_t *);

				// Use a float argument type, to check the value isn't clobbered by smstart.
				// Use a float return type to check the value isn't clobbered by smstop.
				float __attribute__((noinline, arm_locally_streaming)) foo(float arg) {
				// Create local for SVE vector to check local is created with correct
				// size when not yet in streaming mode (ADDSVL).
				float array[N];
				svfloat32_t vector;

				some_use(&vector);
				svst1_f32(svptrue_b32(), &array[0], vector);
				return array[N - 1] + arg;
				}

				should use ADDSVL for allocating the stack space and should avoid clobbering
				the return/argument values.

				.. code-block:: none

				_Z3foof: // @_Z3foof
				// %bb.0: // %entry
				stp d15, d14, [sp, #-96]! // 16-byte Folded Spill
				stp d13, d12, [sp, #16] // 16-byte Folded Spill
				stp d11, d10, [sp, #32] // 16-byte Folded Spill
				stp d9, d8, [sp, #48] // 16-byte Folded Spill
				stp x29, x30, [sp, #64] // 16-byte Folded Spill
				add x29, sp, #64
				str x28, [sp, #80] // 8-byte Folded Spill
				addsvl sp, sp, #-1
				sub sp, sp, #256
				str s0, [x29, #28] // 4-byte Folded Spill
				smstart sm
				sub x0, x29, #64
				addsvl x0, x0, #-1
				bl _Z10some_usePu13__SVFloat32_t
				sub x8, x29, #64
				ptrue p0.s
				ld1w { z0.s }, p0/z, [x8, #-1, mul vl]
				ldr s1, [x29, #28] // 4-byte Folded Reload
				st1w { z0.s }, p0, [sp]
				ldr s0, [sp, #252]
				fadd s0, s0, s1
				str s0, [x29, #28] // 4-byte Folded Spill
				smstop sm
				ldr s0, [x29, #28] // 4-byte Folded Reload
				addsvl sp, sp, #1
				add sp, sp, #256
				ldp x29, x30, [sp, #64] // 16-byte Folded Reload
				ldp d9, d8, [sp, #48] // 16-byte Folded Reload
				ldp d11, d10, [sp, #32] // 16-byte Folded Reload
				ldp d13, d12, [sp, #16] // 16-byte Folded Reload
				ldr x28, [sp, #80] // 8-byte Folded Reload
				ldp d15, d14, [sp], #96 // 16-byte Folded Reload
				ret


				Preventing the use of illegal instructions in Streaming Mode
				------------------------------------------------------------

				* When executing a program in streaming-mode (PSTATE.SM=1) a subset of SVE/SVE2
				instructions and most AdvSIMD/NEON instructions are invalid.

				* When executing a program in normal mode (PSTATE.SM=0), a subset of SME
				instructions are invalid.

				* Streaming-compatible functions must only use instructions that are valid when
				either PSTATE.SM=0 or PSTATE.SM=1.

				The value of PSTATE.SM is not controlled by the feature flags, but rather by the
				function attributes. This means that we can compile for '``+sme``' and the compiler
				will code-generate any instructions, even if they are not legal under the requested
				streaming mode. The compiler needs to use the function attributes to ensure the
				compiler doesn't do transformations under the assumption that certain operations
				are available at runtime.
				efriedmaUnsubmitted Not Done Reply Inline Actions You could use a feature flag that reflects the state of PSTATE.SM, but doesn't actually enable/disable any instructions. But I guess that's basically the same thing as a function attribute. efriedma: You could use a feature flag that reflects the state of PSTATE.SM, but doesn't actually…

				We made a conscious choice not to model this with feature flags, because we
				still want to support inline-asm in either mode (with the user placing
				smstart/smstop manually), and this became rather complicated to implement at the
				individual instruction level (see `D120261 <https://reviews.llvm.org/D120261>`_
				and `D121208 <https://reviews.llvm.org/D121208>`_) because of limitations in
				TableGen.

				As a first step, this means we'll disable vectorization (LoopVectorize/SLP)
				entirely when the a function has either of the ``aarch64_pstate_sm_enabled``,
				``aarch64_pstate_sm_body`` or ``aarch64_pstate_sm_compatible`` attributes,
				in order to avoid the use of vector instructions.

				Later on we'll aim to relax these restrictions to enable scalable
				auto-vectorization with a subset of streaming-compatible instructions, but that
				requires changes to the CostModel, Legalization and SelectionDAG lowering.
				efriedmaUnsubmitted Not Done Reply Inline Actions Disabling vectorization might not be enough to completely prevent the use of vector instructions, but probably close enough. (For example, we use vector instructions to lower popcount.) efriedma: Disabling vectorization might not be enough to completely prevent the use of vector…
				sdesmalenAuthorUnsubmitted Done Reply Inline Actions Yes you're right, we'll probably find cases where we need to limit the code-generator to lower things differently when in streaming mode. sdesmalen: Yes you're right, we'll probably find cases where we need to limit the code-generator to lower…

				We will also emit diagnostics in Clang to prevent the use of
				non-streaming(-compatible) operations, e.g. through ACLE intrinsics, when a
				function is decorated with the streaming mode attributes.


				Other things to consider
				------------------------

				* Inlining must be disabled when the call-site needs to toggle PSTATE.SM or
				when the callee's function body is executed in a different streaming mode than
				its caller. This is needed because function calls are the boundaries for
				streaming mode changes.

				* Tail call optimization must be disabled when the call-site needs to toggle
				PSTATE.SM, such that the caller can restore the original value of PSTATE.SM.


				3. Handling PSTATE.ZA
				=====================

				In contrast to PSTATE.SM, enabling PSTATE.ZA does not affect the SVE vector
				length and also doesn't clobber FP/AdvSIMD/SVE registers. This means it is safe
				to toggle PSTATE.ZA using intrinsics. This also makes it simpler to setup a
				lazy-save mechanism for calls to private-ZA functions (i.e. functions that may
				either directly or indirectly clobber ZA state).

				For this purpose, we'll introduce a new LLVM IR pass that is run just before
				SelectionDAG.

				Setting up a lazy-save
				----------------------

				Committing a lazy-save
				----------------------

				Exception handling and ZA
				-------------------------

				4. References
				=============

				.. _aarch64_sme_acle:

				1. `SME ACLE Pull-request <https://github.com/ARM-software/acle/pull/188>`__

				.. _aarch64_sme_abi:

				2. `SME ABI Pull-request <https://github.com/ARM-software/abi-aa/pull/123>`__

llvm/docs/UserGuides.rst

	User Guides			User Guides
	===========			===========

	NOTE: If you are a user who is only interested in using an LLVM-based compiler,			NOTE: If you are a user who is only interested in using an LLVM-based compiler,
	you should look into `Clang <https://clang.llvm.org>`_ instead. The			you should look into `Clang <https://clang.llvm.org>`_ instead. The
	documentation here is intended for users who have a need to work with the			documentation here is intended for users who have a need to work with the
	intermediate LLVM representation.			intermediate LLVM representation.

	.. contents::			.. contents::
	:local:			:local:

	.. toctree::			.. toctree::
	:hidden:			:hidden:

				AArch64SME
	AddingConstrainedIntrinsics			AddingConstrainedIntrinsics
	AdvancedBuilds			AdvancedBuilds
	AliasAnalysis			AliasAnalysis
	AMDGPUUsage			AMDGPUUsage
	Benchmarking			Benchmarking
	BigEndianNEON			BigEndianNEON
	BuildingADistribution			BuildingADistribution
	CFIVerify			CFIVerify
	▲ Show 20 Lines • Show All 201 Lines • ▼ Show 20 Lines

	:doc:`HowToCrossCompileBuiltinsOnArm`			:doc:`HowToCrossCompileBuiltinsOnArm`
	Notes on cross-building and testing the compiler-rt builtins for Arm.			Notes on cross-building and testing the compiler-rt builtins for Arm.

	:doc:`BigEndianNEON`			:doc:`BigEndianNEON`
	LLVM's support for generating NEON instructions on big endian ARM targets is			LLVM's support for generating NEON instructions on big endian ARM targets is
	somewhat nonintuitive. This document explains the implementation and rationale.			somewhat nonintuitive. This document explains the implementation and rationale.

				:doc:`AArch64SME`
				LLVM's support for AArch64 SME ACLE and ABI.

	:doc:`CompileCudaWithLLVM`			:doc:`CompileCudaWithLLVM`
	LLVM support for CUDA.			LLVM support for CUDA.

	:doc:`NVPTXUsage`			:doc:`NVPTXUsage`
	This document describes using the NVPTX backend to compile GPU kernels.			This document describes using the NVPTX backend to compile GPU kernels.

	:doc:`AMDGPUUsage`			:doc:`AMDGPUUsage`
	This document describes using the AMDGPU backend to compile GPU kernels.			This document describes using the AMDGPU backend to compile GPU kernels.
	Show All 20 Lines