This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Add support for spilling high registers in Thumb1
Needs ReviewPublic

Authored by petpav01 on Jul 16 2018, 1:40 AM.

Details

Summary

LLVM normally only makes use of low registers in Thumb1 and methods Thumb1InstrInfo::storeRegToStackSlot()/loadRegFromStackSlot() are currently able to store/restore only them. However, it is possible in rare cases that a register allocator might need to spill a high register in the middle of a function as well.

Example:

$ cat test.c
void constraint_h(void) {
  int i;
  asm volatile("@ %0" : : "h" (i) : "r12");
}
$ clang -target arm-none-eabi -march=armv6-m -c test.c
clang-7: [...]/llvm/lib/Target/ARM/Thumb1InstrInfo.cpp:85: virtual void llvm::Thumb1InstrInfo::storeRegToStackSlot(llvm::MachineBasicBlock&, llvm::MachineBasicBlock::iterator, unsigned int, bool, int, const llvm::TargetRegisterClass*, const llvm::TargetRegisterInfo*) const: Assertion `(RC == &ARM::tGPRRegClass || (TargetRegisterInfo::isPhysicalRegister(SrcReg) && isARMLowRegister(SrcReg))) && "Unknown regclass!"' failed.
[...]

The program was compiled at -O0 and so Fast Register Allocator is used. The following happens in this case:

  • Prior to register allocation, MIR looks as follows:
Frame Objects:
  fi#0: size=4, align=4, at location [SP]

bb.0.entry:
  %1:tgpr = tLDRspi %stack.0.i, 0, 14, $noreg :: (dereferenceable load 4 from %ir.i)
  %0:hgpr = COPY %1:tgpr
  INLINEASM &"@ $0" [sideeffect] [attdialect], $0:[reguse:hGPR], %0:hgpr, $1:[clobber], implicit-def early-clobber $r12, !3
  tBX_RET 14, $noreg
  • Fast Register Allocator first satisfies %0:hgpr by selecting r12.
  • When the scan reaches the INLINEASM instruction, the allocator however notices that r12 is clobbered and so it needs to be spilled.
  • The allocator calls Thumb1InstrInfo::storeRegToStackSlot() to store the register in a stack slot but the method does not know how to do it and aborts. This can also result in a miscompilation if LLVM is built without assertions enabled.

Both store and load of a high register in Thumb1 needs an additional low register. For instance, the store is implemented as:

mov %lowReg, %spilledHighReg
str %lowReg, ...

An initial patch in this review extended Thumb1InstrInfo::storeRegToStackSlot() and loadRegFromStackSlot() to allow storing and restoring high registers by inserting a pseudo-instruction that got later lowered after register allocation in ThumbRegisterInfo::eliminateFrameIndex(). This relied on the register scavenger to secure a low register for the sequence. This is possibly problematic when the register pressure is high because ThumbRegisterInfo::saveScavengerRegister() at the moment also tries to make use of high register r12.

The current patch extends RegAllocFast and InlineSpiller to handle a spill with an intermediary directly.

Diff Detail

Event Timeline

petpav01 created this revision.Jul 16 2018, 1:40 AM
thopre added a subscriber: thopre.Jul 16 2018, 9:46 AM
thopre added inline comments.
lib/Target/ARM/Thumb1InstrInfo.cpp
132–133

Can you put a similar comment to the store to stack slot?

This is possibly problematic when the register pressure is high because ThumbRegisterInfo::saveScavengerRegister() currently also tries to make use of high register r12.

Also, the constant islands pass can clobber lr. But given the only way to end up with an "hGPR" register class is inline asm, we could probably work around this issue by excluding ip/lr from allocation order for hGPR.

That said, if we ever want to make the high registers generally allocatable in Thumb1 mode, this patch probably isn't the right solution; instead, we should make the register allocator insert the copy, so we aren't forced to scavenge a register later.

lib/Target/ARM/ARMRegisterInfo.td
209 ↗(On Diff #155628)

This comment isn't really right. It's still worth using the high registers, with appropriate cost constraints: there are a few important instructions which can take high registers as inputs (cmp, add, bx/blx), and even if we're just effectively using them as spill slots, it's cheaper than spilling to the stack.

petpav01 updated this revision to Diff 156217.Jul 19 2018, 12:41 AM

This is possibly problematic when the register pressure is high because ThumbRegisterInfo::saveScavengerRegister() currently also tries to make use of high register r12.

Also, the constant islands pass can clobber lr. But given the only way to end up with an "hGPR" register class is inline asm, we could probably work around this issue by excluding ip/lr from allocation order for hGPR.

That said, if we ever want to make the high registers generally allocatable in Thumb1 mode, this patch probably isn't the right solution; instead, we should make the register allocator insert the copy, so we aren't forced to scavenge a register later.

Sorry, the mentioned idea with the copy is not quite clear to me. Could you please explain it a bit more for me?

lib/Target/ARM/ARMRegisterInfo.td
209 ↗(On Diff #155628)

Updated, hopefully it now makes sense.

lib/Target/ARM/Thumb1InstrInfo.cpp
132–133

Added.

efriedma added a subscriber: wmi.Jul 19 2018, 6:46 PM

Sorry, the mentioned idea with the copy is not quite clear to me. Could you please explain it a bit more for me?

Say the target has a new hook, call it "getRegClassForStackSaveRestore()" or something, which takes a register class, and returns a register class appropriate for stack save/restore operations. Then when a register allocator wants to spill a vreg, it first calls getRegClassForStackSaveRestore(); if that returns a new register class, instead of spilling using storeRegToStackSlot, it makes a new vreg with the returned class, and inserts a COPY to that vreg.

This avoids having to scavenge a register later; the register allocator has more ways to make a register available, so the resulting code is likely more efficient, and it avoids the potential problem of needing to scavenge multiple registers in ThumbRegisterInfo::eliminateFrameIndex.

petpav01 updated this revision to Diff 160538.Aug 14 2018, 2:38 AM

Thanks for the explanation of this idea. Updated patch goes in that direction and provides a prototype of this approach. The implementation is done in Fast Register Allocator (which has its own spiller code) and in InlineSpiller (used by the other LLVM allocators: Basic, Greedy, PBQP).

The implemented approach is to always make a complete spill of a high register to stack instead of initially moving it only to a low register and then spill the low register if actually needed. This allows to keep things a bit simpler to implement and reason about. InlineSpiller could be improved to implement only the mentioned "half-spills" but it does not appear necessary for now. With this problem currently being limited only to inline assembler, I think the Greedy production allocator should not get in a state where it would need to spill high registers.

The patch is not complete but I thought I would ask for feedback on it early, before I dive into solving remaining issues.

Known problems:

  • Reloads of registers that require a COPY instruction should be done by RegAllocFast before other register uses try to get satisfied to provide better assignment possibilities for the temporary register.
  • Helper registers used in high-register reloads should get properly removed from UsedInInstr in RegAllocFast so they can get used by the actual instruction.
  • RegAllocFast uses one temporary virtual register for all COPY instructions that it needs to insert for high-register spills. This is a workaround for LiveRegMap (SparseSet) not being resizable when it is not empty.
  • Operands of COPY instructions inserted by InlineSpiller can get inflated to GPR. This is visible in test hgpr-spill-basic.mir and would cause a problem if the inflated GRP register needed to get subsequently spilled.
  • Interaction with snippets and hoisting in InlineSpiller is likely not really correct.
petpav01 updated this revision to Diff 172930.Nov 7 2018, 5:59 AM
petpav01 edited the summary of this revision. (Show Details)

Updated patch improves the RegAllocFast part and adds more testing for it. InlineSpiller has no new changes.

Description of the changes:

  • Code to allocate an intermediary register for the spill is moved to RegAllocFast::handleIntermediarySpill().
  • RegAllocFast::allocVirtReg() is split into allocVirtReg() and assignVirtReg(). The former method still does most of the allocation work but leaves final update of PhysRegState + LRI->PhysReg and error reporting to assignVirtReg(). This allows handleIntermediarySpill() to call allocVirtReg() to get a free register without updating other state.
  • Spilling all registers prior to a call instruction in RegAllocFast::allocateBasicBlock() is moved before clearing of UsedInInstr so an intermediary does not get allocated to a register used by the instruction.

This is still not a complete patch. Known problems are:

  • handleIntermediarySpill() does not always correctly update debug information (DBG_VALUEs).
  • InlineSpiller still has the same problems as mentioned previously and needs more work.
  • Changes implemented in SparseSet are currently without testing.

Any feedback on this is very welcome, especially whether the overall approach looks sensible or if some different idea would be preferable and better.

Note: There is a ongoing rewrite of RegAllocFast in D52010 which means this patch will need to be somewhat reworked after the rewrite lands but I do not think it should affect the basic idea that is implemented here.