Page MenuHomePhabricator

[AArch64][RegAllocFast] Add findSpillBefore to TargetRegisterInfo
AbandonedPublic

Authored by tmatheson on Jan 18 2021, 11:25 PM.

Details

Summary

Due to interactions between RegAllocFast and expansion of atomicrmw at -O0,
both ARM and AArch64 backends would emit stores between ldrex and strex,
which clears the exclusive access monitor.

atomicrmw instructions are expanded to loops, where the main MachineBasicBlock
includes a ldrex/strex pair. It then conditionally branches if this atomic
operation was successful. Because of this, the register loaded by ldrex is
LiveOut, and RegAllocFast therefore spills this register. The issue is that it
spills between the ldrex and strex, which invalidates the monitor.

I tried several ways of fixing this which all have problems:

  • Adding a pass after RegAllocFast which moved the str instructions (spills) to after the strex. For more complex sequences like those generated for 64 bit atomics with e.g. nand, this becomes difficult to do.
  • Add new pseudo instructions for atomicrmw which are expanded after register allocation. This would involve duplicating all of the loop creation code. Similar approach has been used before for cmpxchg: https://reviews.llvm.org/D16239?id=52861
  • Stop FastRegAlloc from spilling these registers for these instructions. However, other instructions between ldrex and strex can spill, and it is hard to catch them all.
  • Move the location of the spill to after the strex. This is the approach taken.

To spill after strex, I have added a new function to TargetRegisterInfo which returns an appropriate
point to spill at for a given instruction. For all backends except ARM/AArch64 this just returns the
next instruction. For ARM/AArch64 it returns the strex.

Alternatives:

  • A similar approach could have been applied to calcSpillCost instead, to set a high spill cost between ldrex/strex
  • Pseudo instructions could have been used instead

It is also possible that the cmpxchg pseudoinstructions could be removed, and the same technique used for them.

Diff Detail

Event Timeline

tmatheson created this revision.Jan 18 2021, 11:25 PM
tmatheson requested review of this revision.Jan 18 2021, 11:25 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 18 2021, 11:25 PM

AMDGPU has a similar problem/mechanism handled with TII::isBasicBlockPrologue. I'm not really satisfied with it, and this looks similarly ad-hoc. I'm not really sure what the right solution is, but I've been considering something looking like a new type of label pseudo

llvm/lib/Target/AArch64/AArch64RegisterInfo.cpp
820 ↗(On Diff #317469)

Technically 0 is TargetOpcode::PHI

llvm/lib/Target/ARM/ARMRegisterInfo.cpp
22–23 ↗(On Diff #317469)

Could you use a bundle for this?

Use Optional rather than returning 0 (==TargetOpcode::PHI).
Rename isExclusiveLoad to IsExclusiveLoad.

tmatheson marked an inline comment as done.Jan 27 2021, 12:30 AM
tmatheson added inline comments.
llvm/lib/Target/ARM/ARMRegisterInfo.cpp
22–23 ↗(On Diff #317469)

Potentially, can you elaborate on what you mean exactly? One approach could be to create a bundle in AtomicExpand::insertRMWLLSCLoop to bundle the results of emitLoadLinked and emitStoreConditional. However, this would also need some sort of hook from AtomicExpandPass to use in order to do this in a target dependent way.

mtrofin added inline comments.Jan 27 2021, 8:40 AM
llvm/lib/Target/AArch64/AArch64RegisterInfo.cpp
791 ↗(On Diff #319483)

Nit: "Is" prefix suggests returning a boolean. How about GetExclusiveLoadOpcode?

llvm/lib/Target/ARM/ARMRegisterInfo.cpp
25 ↗(On Diff #319483)

how similar are the 2 implementations - the one in AArch64RegisterInfo and here? Can there be some factoring to improve maintainability?

26 ↗(On Diff #319483)

same comment about name as in AArch64RegisterInfo.cpp

tmatheson updated this revision to Diff 323935.Feb 16 2021, 2:41 AM

This is quite a substantial change in approach, please take a look.

[ARM][AtomicExpand] Bundle exclusive loads and stores created by AtomicExpandPass

AtomicExpandPass expands atomicrmw instructions to loop structures. On
ARM/AArch64, these make use of exclusive load/store instructions. Any additional
store that occurs between these instructions will invalidate the exclusive
access monitor, and potentially cause an infinite loop. Therefore the register
allocator must be prevented from inserting spills between these two points.

The approach taken here is to create a bundle containing all the instructions
between the exclusive load and store. This prevents the register allocator from
inserting spills.

This exposed an issue with RegAllocFast, wherein a virtual register defined
inside the bundle might be assigned the same physical register as a virtual
register with a use that occurs after the def. For example:

%0 = something global
BUNDLE implicit-def %1, implicit %0 {
  %1 = MOVi 123
  store %0, ...
}

In the above example was possible to allocate the same physical register to both
%0 and %1. RegAllocFast has been updated to avoid this. RegAllocGreedy does not
have a similar problem, since it uses liveness analysis.

Finally, UnpackMachineBundles is added after register allocation for ARM/AArch64
to remove the bundles.

tmatheson marked 3 inline comments as done.Feb 16 2021, 2:43 AM
tmatheson added inline comments.
llvm/lib/Target/ARM/ARMRegisterInfo.cpp
25 ↗(On Diff #319483)

I have moved the common code into a common base class in the updated version.

tmatheson marked an inline comment as done.Feb 22 2021, 5:58 AM

Ping

Ping. @arsenm, any comments on new approach using bundles, and the regalloc changes?

lenary added a subscriber: lenary.Mar 1 2021, 7:00 AM

Some comments below, in addition to these questions

  • Have you tested this with both the new and the legacy pass managers?

I think the most important question to be answered about the approach is whether the backends use bundles anywhere else - if it does, this is probably too brittle and pseudo-instructions is a better approach, even though it adds duplication of loop insertion.

llvm/include/llvm/CodeGen/AtomicLoopBundler.h
43

Can you document the use of Derived::IsLoadInstr and Derived::IsStoreInstr? It's not clear from a quick scan of the class that they are required to use the pass.

58–59

Don't you want to find the store *after* the Load? So maybe start at LdIter?

llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
613

Does the backend use bundles anywhere else, as we would need to make sure we're not unpacking other bundles at this point by mistake.

llvm/lib/Target/ARM/ARMAtomicLoopBundler.cpp
19

Here you're looking for instructions as introduced by atomic loop expansion - so the predicate should match that (I think).

Alternatively, you could just use MI.mayLoad() if you're looking for any kind of Load?

tmatheson updated this revision to Diff 331240.Mar 17 2021, 5:23 AM
tmatheson marked 3 inline comments as done.
  • Address review comments and clang-tidy
  • Rename isLoadInstr/isStoreInstr to isExclusiveLoad/isExclusiveStore

@lenary sorry for the delay in responding to comments.

I think the most important question to be answered about the approach is whether the backends use bundles anywhere else - if it does, this is probably too brittle and pseudo-instructions is a better approach, even though it adds duplication of loop insertion.

This is only applied to the AArch64 and ARM backends. The other uses of bundles in these backends (or in common passes used by these backends) that I'm aware of are:

  • MVEVPTBlockPass
  • Thumb2ITBlockPass
  • ARMExpandPseudoInsts
  • AArch64ExpandPseudoInsts

These are all added after register allocation, so unbundling before register allocation should not affect them.

There are other uses of bundles to prevent insertion of instructions (D79792) and to prevent reordering (D91048).

foad added a subscriber: foad.Mar 17 2021, 6:25 AM
foad added inline comments.
llvm/lib/CodeGen/RegAllocFast.cpp
1098

Naive question: shouldn't whatever added the operands to the BUNDLE have set the IsEarlyClobber flag appropriately, so you don't need to special-case bundles here?

arsenm added inline comments.Mar 17 2021, 6:34 AM
llvm/include/llvm/CodeGen/AtomicLoopBundler.h
35

I think running selection, and then trying to interpret the instructions to figure out what it did is fraught with peril. You would be better off expanding a pseudoinstruction after register allocation

43

You absolutely cannot rely on machine basic block names

llvm/lib/CodeGen/RegAllocFast.cpp
1093

The order of instructions in the bundle doesn't actually matter. It's not accurate to say a def happens before or after a use inside the bundle

1098

Yes

tmatheson added inline comments.Mar 17 2021, 6:40 AM
llvm/lib/CodeGen/RegAllocFast.cpp
1098

I looked into that, specifically marking any defs which are followed by any use inside the bundle as early-clobber. For example, you might have a bundle that defines %1 and then uses %2. The idea being that RegAllocFast sees only the bundle instruction, and within the bundle these defs/uses act like early-clobbers in that the def must have it's own separate register.

This works well for RegAllocFast, but RegAllocGreedy actually looks at the live ranges, which do not see bundles. At some point it would hit an assertion failure because it would see an early-clobber register (on the bundle instruction) who's live range didn't start at an early-clobber slot (because it was copied from the instruction inside the bundle and started at the r slot).

Trying to avoid this problem seemed like it would require breaking the live ranges semantics and didn't seem like a good path to go down.

tmatheson abandoned this revision.Tue, Mar 23, 4:41 AM

Thank you everyone for the comments, they have been very useful. From the discussions here and internally it seems that neither of these approaches (new target hook for spill location, or using bundles) is the right way forward, and I should look into the pseudo instructions. Since this review has grown quite large I will abandon it and open a new review when I have something working with pseudo expansion.