This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SME] Document SME ABI implementation in LLVM
ClosedPublic

Authored by sdesmalen on Aug 10 2022, 5:44 AM.

Details

Summary

Adds a design document for implementing the SME ABI in LLVM. This document
can be used as a reference for follow-up patches that attempt to implement
the ABI.

Diff Detail

Event Timeline

sdesmalen created this revision.Aug 10 2022, 5:44 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 10 2022, 5:45 AM
sdesmalen requested review of this revision.Aug 10 2022, 5:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 10 2022, 5:45 AM
sdesmalen edited the summary of this revision. (Show Details)Aug 10 2022, 5:54 AM

In one diff you disable inlining for SME functions. I am afraid that there are more incompatible transformations on functions, i.e., function merging, hot-cold splitting, outlining, ...

You state that the vscale can change during the execution of the program. I would have expected a LangRef update.

I agree a LangRef update makes sense if we're loosening the constraint that vscale is constant across the whole program. Might need some adjustments to make various transforms work; in particular, I suspect GEP constant expressions involving vscale break.

Are aarch64_pstate_sm_compatible functions allowed to call arbitrary functions? I guess there's some way to save/restore PSTATE.SM?

llvm/docs/AArch64SME.rst
279

Not sure the CopyToReg+CopyFromReg thing is actually enough to prevent all the transforms you want to prevent... if I'm following correctly, you really need to prevent *any* operations from showing up in the middle of the call sequence, and the SelectionDAG scheduler isn't really set up to enforce that sort of constraint. Would be more obviously correct to use a pseduo-instruction for the call, and expand it to smstop+call+smstart using a post-isel hook.

382

You could use a feature flag that reflects the state of PSTATE.SM, but doesn't actually enable/disable any instructions. But I guess that's basically the same thing as a function attribute.

398

Disabling vectorization might not be enough to completely prevent the use of vector instructions, but probably close enough. (For example, we use vector instructions to lower popcount.)

In one diff you disable inlining for SME functions. I am afraid that there are more incompatible transformations on functions, i.e., function merging, hot-cold splitting, outlining, ...

Having different functions using different instruction set extensions isn't that unusual. We support that sort of thing on many different targets, including x86. (I haven't looked at the diffs to see if it's using the usual target hooks for this.)

Would it help to have a idk v2scale or vmscale for the matrix mode and vscale for the non-matrix mode. Then both can be constant over time, but in some states it would invalid to reason about one but not about the other.

You state that the vscale can change during the execution of the program. I would have expected a LangRef update.

Enabling streaming mode means FP/vector code may be executed on a different processing element. This has implications beyond having a different SVE vector length since it also clears register state and changes the set of allowable instructions. For the purpose of LLVM IR it is sufficient to support only a single vscale as long as the input IR is restricted to not pass/return (pointers to) scalable vectors between such functions and any streaming mode changes happen only on function-call boundaries. The SME ACLE has similar restrictions, so Clang emitting these attributes will ensure these restrictions are honoured in the IR.

We'd rather not open the can of worms of supporting multiple vscales because this is not really a capability we need from LLVM. Even if it was a generic capability in LLVM then there would still be other issues we'd need to tackle in a similar way as outlined in this document, so I'm not sure if it's worth making a change to the LangRef.

llvm/docs/AArch64SME.rst
279

the SelectionDAG scheduler isn't really set up to enforce that sort of constraint

It seems that this is exactly what Glue is meant to do. It tells the SelectionDAG scheduler to put these operations in the same SUnit node, which means they get scheduled together without anything else being scheduled in between. Most of the call-chain is glued together in this way.

Would be more obviously correct to use a pseduo-instruction for the call, and expand it to smstop+call+smstart using a post-isel hook.

I experimented with that, but found that replacing the ISD::CALL with a pseudo node is not really feasible. There are some complications to this approach:

  • At the time of expanding the CALL itself, the operands will already have been lowered to the callconv-defined registers, whereas we need to change PSTATE.SM before that to give the register allocator a chance to spill/reload these registers (while still virtual registers) before it moves them to physical registers defined by the cc.
  • There is a similar problem after the call; it first needs to move the registers to virtual registers before it can invoke smstart/smstop.
  • Even if we'd be happy to with the idea of finding an insertion point for smstart/smstop after selection, then we'd need to fiddle with the scheduling afterwards to move the smstart/smstop instructions before and after the COPY nodes. For this we need to know which operations are part of the Call, and which are part of the program. But when finding the right insertion point after selection, this information has been lost. This gets even more complicated to recognise when values are passed by reference or via the stack.
398

Yes you're right, we'll probably find cases where we need to limit the code-generator to lower things differently when in streaming mode.

sdesmalen updated this revision to Diff 452683.Aug 15 2022, 8:25 AM

Rephrased section 2 (Handling PSTATE.SM) and replaced some uses of vscale with SVE vector length.

Matt added a subscriber: Matt.Aug 15 2022, 1:21 PM

Gentle ping. Does anyone have some more feedback?

For the purpose of LLVM IR it is sufficient to support only a single vscale as long as the input IR is restricted to not pass/return (pointers to) scalable vectors between such functions and any streaming mode changes happen only on function-call boundaries.

Sure, that makes sense: supporting multiple different scales within a function would be complicated.

But you still need a LangRef change. Interprocedural optimizations need to know that "vscale" computed in the callee might be different from "vscale" computed in the caller. As a practical matter, it probably affects very few optimizations, but certain optimizations might need to be aware of it. (At the moment, this is most likely to be relevant for GEP constant expressions, since we don't really do interprocedural hoisting or sinking. But in theory it could also be relevant for calls to llvm.vscale.)

For the purpose of LLVM IR it is sufficient to support only a single vscale as long as the input IR is restricted to not pass/return (pointers to) scalable vectors between such functions and any streaming mode changes happen only on function-call boundaries.

Sure, that makes sense: supporting multiple different scales within a function would be complicated.

But you still need a LangRef change. Interprocedural optimizations need to know that "vscale" computed in the callee might be different from "vscale" computed in the caller. As a practical matter, it probably affects very few optimizations, but certain optimizations might need to be aware of it. (At the moment, this is most likely to be relevant for GEP constant expressions, since we don't really do interprocedural hoisting or sinking. But in theory it could also be relevant for calls to llvm.vscale.)

Thanks, I think that's fair enough, we can tone down the wording in the LangRef a bit.

This revision is now accepted and ready to land.Sep 15 2022, 3:46 PM