This is an archive of the discontinued LLVM Phabricator instance.

[InlineSpiller] Fix a crash due to lack of forward progress from remat
ClosedPublic

Authored by reames on Dec 11 2017, 4:14 PM.

Download Raw Diff

Details

Reviewers

wmi
qcolombet
mkuper
MatzeB

Commits

rG8befa295ea54: [InlineSpiller] Fix a crash due to lack of forward progress from remat…
rL335077: [InlineSpiller] Fix a crash due to lack of forward progress from remat…

Summary

This is a very ugly fix and I'm hoping someone has a better idea on how to fix this. What's going on is we've got an instruction with more vreg uses than there are physical registers, and rematerialization appears to assume that there's always at least one physical register available in the user instruction. At the moment, the only instructions I know of which have this problem are PATCHPOINT, STATEPOINT, and STACKMAP, but in theory, any instruction which expects to work on a lot of stack slots at once could have this problem.

(The zexts in the test case are simply a stand-in for any rematable operation which does not fold into the use.)

The only other fix I see here is to separate the spiller and remat logic entirely and expose remat as a distinct step within the register allocator. There are some arguments in favour of this - in particular, right now the splitter is much less good at remat then the spiller is - but it's a large restructuring for what is arguably a cornercase.

Diff Detail

Repository: rL LLVM

Event Timeline

reames created this revision.Dec 11 2017, 4:14 PM

Herald added subscribers: bollu, eraman, mcrosier. · View Herald TranscriptDec 11 2017, 4:14 PM

Hi Philip,

I agree that a refactoring of how remat is done would be a good thing, but also only doing it for this corner case is overkill.

The current approach makes sense, but I believe we should use pressure sets or something along those lines, because the current logic won't work generally speaking. See the inline comment.

Cheers,
-Quentin

lib/CodeGen/InlineSpiller.cpp
588 ↗	(On Diff #126466)	Counting by RC is not going to give you the result you want. You can imagine some instruction with different RCs that overlap, e.g., GPRsp, GPR. With the current counting your going to count them in different bucket and miss the overlap. I believe the pressure sets (look in TargetRegisterInfo) would do what you want. Double check because it's been a while since I've played with it.

Second attempt, this time accounting for sub-registers

reames added inline comments.Dec 21 2017, 11:21 AM

lib/CodeGen/InlineSpiller.cpp
588 ↗	(On Diff #126466)	I looked at pressure sets and they didn't seem to have an obvious application here, but they did lead me to the notion of RegUnits. I think the updated code addresses the aliasing concern you raise, can you confirm? I'm figuring this out as I go along and am not familiar with this area of the code.

ping

qcolombet requested changes to this revision.Jan 5 2018, 10:07 AM

qcolombet added inline comments.

lib/CodeGen/InlineSpiller.cpp
530 ↗	(On Diff #127920)	Just to comment on the TODO, this code is actually not conservative as it assumes all the registers of the largest legal super class are a suitable assignment for the current register. This is generally not true, since this won't take into account the constraints of the encoding of MI. Anyhow, sketching a potential solution in the comment below.
543 ↗	(On Diff #127920)	This is kind of a random check because: Not all operands in MI are registers Not all operands in MI have the same constraints and thus AllocationOrder may not be relevant for all of them
575 ↗	(On Diff #127920)	I believe the solution should look something like this: const TargetRegisterClass *RelaxedRC = getLargestLegalSuperClassConstraintedOnMIUsage(MI, VReg) // look at getNumAllocatableRegsForConstraints in RegAllocGreedy if (!RelaxedRC) return false; // Could be a fatal error because if we end up in this situation that means the allocation problem is not feasible LiveRegUnits Used(TRI) Used.accumulate(MI) // Walk the registers in RelaxedRC and check if it exists one which has all its regunits not available for (MCPhysReg PossibleReg : RelaxedRC) { if (MRI.isReserved(PossibleReg)) continue; bool IsAvailable = true; for (MCRegUnitIterator Units(PossibleReg, TRI); Units.isValid(); ++Units) if (!Used.available(Units)) { IsAvailable = false; break; } if (IsAvailable) return true; } return false;

This revision now requires changes to proceed.Jan 5 2018, 10:07 AM

Quentin, I read over your suggested approach, but honestly, I didn't entirely follow. Rather than going through another round with an approach that I clearly don't understand, I'd like to go with a quick and dirty hack for the moment if you don't mind. Are you okay with us solving this just for STATEPOINTs at the moment by essentially just disabling remats for STATEPOINT uses?

Are you okay with us solving this just for STATEPOINTs at the moment by essentially just disabling remats for STATEPOINT uses?

Works for me. Would be good to make sure every user of STATEPOINT agree with that, but I trust you would know better how to coordinate with :).

This revision was not accepted when it landed; it landed in state Needs Review.Jun 19 2018, 2:24 PM

Closed by commit rL335077: [InlineSpiller] Fix a crash due to lack of forward progress from remat… (authored by reames). · Explain Why

This revision was automatically updated to reflect the committed changes.

I went ahead and submitted the workaround, but I want to record one other idea I tried. It didn't work out, but I'm not sure if that's due to something fundamental, or if I just had a non-obvious bug.

I explored introducing another step in RegAllocGreedy's LiveRangeStage after RS_Spill called RS_SpillWithoutRemat. The idea was that spilling transitioned the new vregs created not to RS_Done, but to the new state which allowed them to be spilled again if needed. By avoiding remat, we should avoid infinite retry loops. If we don't remat, then every reload use should be foldable into the STATEPOINT . Given the existing remats might have fragmented live ranges, we might end up with awful spill, reload, spill, fold patterns, but in theory it should always work.

Another framing would be to split out RS_Remat as a distinct phase before either splitting or spilling. At the moment, we have two different sets of remat logic (in splitting and spilling respectively), so I didn't want to pursue that approach since there might be lots of perturbation of the output. I also wasn't quite sure how the queue order would impact code quality if we rematted, and then defered spilling until other vregs had been processed.

Revision Contents

Path

Size

llvm/

trunk/

lib/

CodeGen/

InlineSpiller.cpp

26 lines

test/

CodeGen/

X86/

statepoint-live-in.ll

71 lines

Diff 151977

llvm/trunk/lib/CodeGen/InlineSpiller.cpp

Show First 20 Lines • Show All 209 Lines • ▼ Show 20 Lines	private:

bool isRegToSpill(unsigned Reg) { return is_contained(RegsToSpill, Reg); }		bool isRegToSpill(unsigned Reg) { return is_contained(RegsToSpill, Reg); }

bool isSibling(unsigned Reg);		bool isSibling(unsigned Reg);
bool hoistSpillInsideBB(LiveInterval &SpillLI, MachineInstr &CopyMI);		bool hoistSpillInsideBB(LiveInterval &SpillLI, MachineInstr &CopyMI);
void eliminateRedundantSpills(LiveInterval &LI, VNInfo *VNI);		void eliminateRedundantSpills(LiveInterval &LI, VNInfo *VNI);

void markValueUsed(LiveInterval, VNInfo);		void markValueUsed(LiveInterval, VNInfo);
		bool canGuaranteeAssignmentAfterRemat(unsigned VReg, MachineInstr &MI);
bool reMaterializeFor(LiveInterval &, MachineInstr &MI);		bool reMaterializeFor(LiveInterval &, MachineInstr &MI);
void reMaterializeAll();		void reMaterializeAll();

bool coalesceStackAccess(MachineInstr *MI, unsigned Reg);		bool coalesceStackAccess(MachineInstr *MI, unsigned Reg);
bool foldMemoryOperand(ArrayRef<std::pair<MachineInstr *, unsigned>>,		bool foldMemoryOperand(ArrayRef<std::pair<MachineInstr *, unsigned>>,
MachineInstr *LoadMI = nullptr);		MachineInstr *LoadMI = nullptr);
void insertReload(unsigned VReg, SlotIndex, MachineBasicBlock::iterator MI);		void insertReload(unsigned VReg, SlotIndex, MachineBasicBlock::iterator MI);
void insertSpill(unsigned VReg, bool isKill, MachineBasicBlock::iterator MI);		void insertSpill(unsigned VReg, bool isKill, MachineBasicBlock::iterator MI);
▲ Show 20 Lines • Show All 283 Lines • ▼ Show 20 Lines	do {
LiveInterval &SnipLI = LIS.getInterval(MI->getOperand(1).getReg());		LiveInterval &SnipLI = LIS.getInterval(MI->getOperand(1).getReg());
assert(isRegToSpill(SnipLI.reg) && "Unexpected register in copy");		assert(isRegToSpill(SnipLI.reg) && "Unexpected register in copy");
VNInfo *SnipVNI = SnipLI.getVNInfoAt(VNI->def.getRegSlot(true));		VNInfo *SnipVNI = SnipLI.getVNInfoAt(VNI->def.getRegSlot(true));
assert(SnipVNI && "Snippet undefined before copy");		assert(SnipVNI && "Snippet undefined before copy");
WorkList.push_back(std::make_pair(&SnipLI, SnipVNI));		WorkList.push_back(std::make_pair(&SnipLI, SnipVNI));
} while (!WorkList.empty());		} while (!WorkList.empty());
}		}

		bool InlineSpiller::canGuaranteeAssignmentAfterRemat(unsigned VReg,
		MachineInstr &MI) {
		// Here's a quick explanation of the problem we're trying to handle here:
		// * There are some pseudo instructions with more vreg uses than there are
		// physical registers on the machine.
		// * This is normally handled by spilling the vreg, and folding the reload
		// into the user instruction. (Thus decreasing the number of used vregs
		// until the remainder can be assigned to physregs.)
		// * However, since we may try to spill vregs in any order, we can end up
		// trying to spill each operand to the instruction, and then rematting it
		// instead. When that happens, the new live intervals (for the remats) are
		// expected to be trivially assignable (i.e. RS_Done). However, since we
		// may have more remats than physregs, we're guaranteed to fail to assign
		// one.
		// At the moment, we only handle this for STATEPOINTs since they're the only
		// psuedo op where we've seen this. If we start seeing other instructions
		// with the same problem, we need to revisit this.
		return (MI.getOpcode() != TargetOpcode::STATEPOINT);
		}

/// reMaterializeFor - Attempt to rematerialize before MI instead of reloading.		/// reMaterializeFor - Attempt to rematerialize before MI instead of reloading.
bool InlineSpiller::reMaterializeFor(LiveInterval &VirtReg, MachineInstr &MI) {		bool InlineSpiller::reMaterializeFor(LiveInterval &VirtReg, MachineInstr &MI) {
// Analyze instruction		// Analyze instruction
SmallVector<std::pair<MachineInstr *, unsigned>, 8> Ops;		SmallVector<std::pair<MachineInstr *, unsigned>, 8> Ops;
MIBundleOperands::VirtRegInfo RI =		MIBundleOperands::VirtRegInfo RI =
MIBundleOperands(MI).analyzeVirtReg(VirtReg.reg, &Ops);		MIBundleOperands(MI).analyzeVirtReg(VirtReg.reg, &Ops);

if (!RI.Reads)		if (!RI.Reads)
Show All 39 Lines	bool InlineSpiller::reMaterializeFor(LiveInterval &VirtReg, MachineInstr &MI) {
// fold a load into the instruction. That avoids allocating a new register.		// fold a load into the instruction. That avoids allocating a new register.
if (RM.OrigMI->canFoldAsLoad() &&		if (RM.OrigMI->canFoldAsLoad() &&
foldMemoryOperand(Ops, RM.OrigMI)) {		foldMemoryOperand(Ops, RM.OrigMI)) {
Edit->markRematerialized(RM.ParentVNI);		Edit->markRematerialized(RM.ParentVNI);
++NumFoldedLoads;		++NumFoldedLoads;
return true;		return true;
}		}

		// If we can't guarantee that we'll be able to actually assign the new vreg,
		// we can't remat.
		if (!canGuaranteeAssignmentAfterRemat(VirtReg.reg, MI))
		return false;

// Allocate a new register for the remat.		// Allocate a new register for the remat.
unsigned NewVReg = Edit->createFrom(Original);		unsigned NewVReg = Edit->createFrom(Original);

// Finally we can rematerialize OrigMI before MI.		// Finally we can rematerialize OrigMI before MI.
SlotIndex DefIdx =		SlotIndex DefIdx =
Edit->rematerializeAt(*MI.getParent(), MI, NewVReg, RM, TRI);		Edit->rematerializeAt(*MI.getParent(), MI, NewVReg, RM, TRI);

// We take the DebugLoc from MI, since OrigMI may be attributed to a		// We take the DebugLoc from MI, since OrigMI may be attributed to a
▲ Show 20 Lines • Show All 927 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/statepoint-live-in.ll

	Show First 20 Lines • Show All 122 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: popq %rbx			; CHECK-NEXT: popq %rbx
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	entry:			entry:
	call token (i64, i32, void (), i32, i32, ...) @llvm.experimental.gc.statepoint.p0f_isVoidf(i64 2882400000, i32 0, void () @baz, i32 0, i32 0, i32 0, i32 1, i32 %a)			call token (i64, i32, void (), i32, i32, ...) @llvm.experimental.gc.statepoint.p0f_isVoidf(i64 2882400000, i32 0, void () @baz, i32 0, i32 0, i32 0, i32 1, i32 %a)
	call token (i64, i32, void (), i32, i32, ...) @llvm.experimental.gc.statepoint.p0f_isVoidf(i64 2882400000, i32 0, void () @bar, i32 0, i32 2, i32 0, i32 1, i32 %a)			call token (i64, i32, void (), i32, i32, ...) @llvm.experimental.gc.statepoint.p0f_isVoidf(i64 2882400000, i32 0, void () @bar, i32 0, i32 2, i32 0, i32 1, i32 %a)
	ret void			ret void
	}			}

				; A variant of test7 where values are not directly foldable from stack slots.
				define void @test7(i32 %a, i32 %b, i32 %c, i32 %d, i32 %e, i32 %f, i32 %g, i32 %h, i32 %i, i32 %j, i32 %k, i32 %l, i32 %m, i32 %n, i32 %o, i32 %p, i32 %q, i32 %r, i32 %s, i32 %t, i32 %u, i32 %v, i32 %w, i32 %x, i32 %y, i32 %z) gc "statepoint-example" {
				; The code for this is terrible, check simply for correctness for the moment
				; CHECK-LABEL: test7:
				; CHECK: callq _bar
				entry:
				%a64 = zext i32 %a to i64
				%b64 = zext i32 %b to i64
				%c64 = zext i32 %c to i64
				%d64 = zext i32 %d to i64
				%e64 = zext i32 %e to i64
				%f64 = zext i32 %f to i64
				%g64 = zext i32 %g to i64
				%h64 = zext i32 %h to i64
				%i64 = zext i32 %i to i64
				%j64 = zext i32 %j to i64
				%k64 = zext i32 %k to i64
				%l64 = zext i32 %l to i64
				%m64 = zext i32 %m to i64
				%n64 = zext i32 %n to i64
				%o64 = zext i32 %o to i64
				%p64 = zext i32 %p to i64
				%q64 = zext i32 %q to i64
				%r64 = zext i32 %r to i64
				%s64 = zext i32 %s to i64
				%t64 = zext i32 %t to i64
				%u64 = zext i32 %u to i64
				%v64 = zext i32 %v to i64
				%w64 = zext i32 %w to i64
				%x64 = zext i32 %x to i64
				%y64 = zext i32 %y to i64
				%z64 = zext i32 %z to i64
				%statepoint_token1 = call token (i64, i32, void (), i32, i32, ...) @llvm.experimental.gc.statepoint.p0f_isVoidf(i64 2882400000, i32 0, void () @bar, i32 0, i32 2, i32 0, i32 26, i64 %a64, i64 %b64, i64 %c64, i64 %d64, i64 %e64, i64 %f64, i64 %g64, i64 %h64, i64 %i64, i64 %j64, i64 %k64, i64 %l64, i64 %m64, i64 %n64, i64 %o64, i64 %p64, i64 %q64, i64 %r64, i64 %s64, i64 %t64, i64 %u64, i64 %v64, i64 %w64, i64 %x64, i64 %y64, i64 %z64)
				ret void
				}

				; a variant of test7 with mixed types chosen to exercise register aliases
				define void @test8(i32 %a, i32 %b, i32 %c, i32 %d, i32 %e, i32 %f, i32 %g, i32 %h, i32 %i, i32 %j, i32 %k, i32 %l, i32 %m, i32 %n, i32 %o, i32 %p, i32 %q, i32 %r, i32 %s, i32 %t, i32 %u, i32 %v, i32 %w, i32 %x, i32 %y, i32 %z) gc "statepoint-example" {
				; The code for this is terrible, check simply for correctness for the moment
				; CHECK-LABEL: test8:
				; CHECK: callq _bar
				entry:
				%a8 = trunc i32 %a to i8
				%b8 = trunc i32 %b to i8
				%c8 = trunc i32 %c to i8
				%d8 = trunc i32 %d to i8
				%e16 = trunc i32 %e to i16
				%f16 = trunc i32 %f to i16
				%g16 = trunc i32 %g to i16
				%h16 = trunc i32 %h to i16
				%i64 = zext i32 %i to i64
				%j64 = zext i32 %j to i64
				%k64 = zext i32 %k to i64
				%l64 = zext i32 %l to i64
				%m64 = zext i32 %m to i64
				%n64 = zext i32 %n to i64
				%o64 = zext i32 %o to i64
				%p64 = zext i32 %p to i64
				%q64 = zext i32 %q to i64
				%r64 = zext i32 %r to i64
				%s64 = zext i32 %s to i64
				%t64 = zext i32 %t to i64
				%u64 = zext i32 %u to i64
				%v64 = zext i32 %v to i64
				%w64 = zext i32 %w to i64
				%x64 = zext i32 %x to i64
				%y64 = zext i32 %y to i64
				%z64 = zext i32 %z to i64
				%statepoint_token1 = call token (i64, i32, void (), i32, i32, ...) @llvm.experimental.gc.statepoint.p0f_isVoidf(i64 2882400000, i32 0, void () @bar, i32 0, i32 2, i32 0, i32 26, i8 %a8, i8 %b8, i8 %c8, i8 %d8, i16 %e16, i16 %f16, i16 %g16, i16 %h16, i64 %i64, i64 %j64, i64 %k64, i64 %l64, i64 %m64, i64 %n64, i64 %o64, i64 %p64, i64 %q64, i64 %r64, i64 %s64, i64 %t64, i64 %u64, i64 %v64, i64 %w64, i64 %x64, i64 %y64, i64 %z64)
				ret void
				}

	; CHECK: Ltmp0-_test1			; CHECK: Ltmp0-_test1
	; CHECK: .byte 1			; CHECK: .byte 1
	; CHECK-NEXT: .byte 0			; CHECK-NEXT: .byte 0
	; CHECK-NEXT: .short 4			; CHECK-NEXT: .short 4
	; CHECK-NEXT: .short 5			; CHECK-NEXT: .short 5
	; CHECK-NEXT: .short 0			; CHECK-NEXT: .short 0
	; CHECK-NEXT: .long 0			; CHECK-NEXT: .long 0
	Show All 34 Lines