This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SME] Add support for arm_locally_streaming functions.
ClosedPublic

Authored by sdesmalen on Aug 10 2022, 8:48 AM.

Details

Summary

Functions with aarch64_sme_pstatesm_body will emit a SMSTART at the start
of the function, and a SMSTOP at the end of the function, such that all
operations use the right value for vscale.

Because the placement of these nodes is critically important (i.e. no
vscale-dependent operations should be done before SMSTART has been issued),
we require glueing the CopyFromReg to the Entry node such that we can
insert the SMSTART as part of that glued chain.

More details about the SME attributes and design can be found
in D131562.

Diff Detail

Event Timeline

sdesmalen created this revision.Aug 10 2022, 8:48 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 10 2022, 8:48 AM
sdesmalen requested review of this revision.Aug 10 2022, 8:48 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 10 2022, 8:48 AM
aemerson added inline comments.
llvm/lib/CodeGen/SelectionDAG/InstrEmitter.cpp
1162

?

llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
1239

I think we need a better understanding.

sdesmalen updated this revision to Diff 459429.Sep 12 2022, 5:51 AM

Removed FIXME and work-around for constructor of SelectionDAG.

llvm/lib/CodeGen/SelectionDAG/InstrEmitter.cpp
1162

Perhaps removing this without any alternative assert is a little crude, but I wasn't really sure what to test for instead.

The change here is that EntryToken now "becomes part of the schedule" because other instructions can be glued to it, even though the node itself isn't schedulable (it is always guaranteed to be the first instruction). I was hoping someone would have a suggestion here.

llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
1239

You're right. The issue was that calling getVTList with more than one operand went through a different code-path which queries the VTListMap, which is initialised after EntryNode. By reordering these member variables (lexicographically) the issue goes away.

aemerson accepted this revision.Sep 15 2022, 3:11 AM

Seems reasonable to me.

This revision is now accepted and ready to land.Sep 15 2022, 3:11 AM
sdesmalen added inline comments.
llvm/lib/CodeGen/SelectionDAG/InstrEmitter.cpp
1162

@efriedma just wanting a second opinion on this, are you aware of anything else I might need to worry about here?

I needed to remove this llvm_unreachable because I changed EntryToken to also have Glue, and when using that it makes the EntryToken part of the schedule (even though the node itself can't be scheduled).

efriedma added inline comments.Sep 15 2022, 4:43 PM
llvm/lib/CodeGen/SelectionDAG/InstrEmitter.cpp
1162

This feels pretty isolated; the entry token basically has the same meaning during isel, and we're throwing it away after isel. The only possible side-effect I can think of is that maybe there's something assumes the entry token only has one successor.

That said, it feels like a sort of strange thing to be inserting smstart/smend this early... would it make sense to insert it as part of prologue/epilogue lowering instead?

sdesmalen added inline comments.Sep 16 2022, 3:21 AM
llvm/lib/CodeGen/SelectionDAG/InstrEmitter.cpp
1162

I considered that early on but thought doing this at ISEL level would be easier because we get the benefits from the register allocator. Also, we already have the node/mechanisms in place for SelectionDAG, so adding it here seemed sensible.

For some simple example:

define <8 x i32> @foo(<8 x i32> %x) nounwind "aarch64_pstate_sm_body" {
  %y = add <8 x i32> %x, %x
  ret <8 x i32> %y
}

if we were to do this at point of Prologue/Epilogue lowering, we'd have the following input:

# *** IR Dump Before Prologue/Epilogue Insertion & Frame Finalization (prologepilog) ***:                                                                                                                                                                                                                                                                                   
# Machine code for function foo: NoPHIs, TracksLiveness, NoVRegs, TiedOpsRewritten, TracksDebugUserValues                                                                                                                                                                                                                                                                   
Function Live Ins: $q0, $q1                                                                                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                                                                            
bb.0 (%ir-block.0):                                                                                                                                                                                                                                                                                                                                                         
  liveins: $q0, $q1                                                                                                                                                                                                                                                                                                                                                         
  renamable $q0 = ADDv4i32 killed renamable $q0, renamable $q0                                                                                                                                                                                                                                                                                                              
  renamable $q1 = ADDv4i32 killed renamable $q1, renamable $q1                                                                                                                                                                                                                                                                                                              
  RET_ReallyLR implicit $q0, implicit $q1

We'd then first need to do analysis which registers are clobbered by the smstart/smstop, add extra spill slots for the input arguments passed in NEON/FP/SVE registers, and then add the spills/smstart/reloads for the prolgoue, and spills/smstop/reloads for the epilogue. The frame-lowering code is already quite complex. We'd need to end up with something like:

foo:                                    // @foo
// %bb.0:
        sub     sp, sp, #96
        stp     d15, d14, [sp, #32]
        stp     d13, d12, [sp, #48]
        stp     d11, d10, [sp, #64]
        stp     d9, d8, [sp, #80]
        stp     q0, q1, [sp]                    // input operands are clobbered by smstart, so spill to stack
        smstart sm
        ldr     q0, [sp]                        // reload the spilled operand register
        add     v2.4s, v0.4s, v0.4s
        ldr     q0, [sp, #16]               // reload the other spilled operand register
        add     v0.4s, v0.4s, v0.4s
        stp     q2, q0, [sp]                 // spill the results, because output operands are clobbered by smstop
        smstop  sm
        ldp     q0, q1, [sp]                 // reload the results
        ldp     d9, d8, [sp, #80]
        ldp     d11, d10, [sp, #64]
        ldp     d13, d12, [sp, #48]
        ldp     d15, d14, [sp, #32]
        add     sp, sp, #96
        ret

If we insert the smstart/smstop as part of ISEL the register allocator will insert all the spills and fills for free. And we may have some minor benefits from the scheduler which can rearrange code around the spills/fills for the input arguments.

Matt added a subscriber: Matt.Sep 16 2022, 11:40 AM
This revision was landed with ongoing or failed builds.Oct 14 2022, 6:48 AM
This revision was automatically updated to reflect the committed changes.