This is an archive of the discontinued LLVM Phabricator instance.

[RS4GC] Effective rematerialization at non-entry polls
AbandonedPublic

Authored by reames on Jan 21 2016, 5:10 PM.

Download Raw Diff

Details

Reviewers

igor-laevsky
mjacob
JosephTremoulet

Summary

This is an attempt at addressing 26223. Specifically, try to avoid unfortunate register spilling by trying to place rematerializations introduced by rewrite-statepoints-for-gc in order to maximize folding and simplification opportunities rather than to minimize execution frequency.

If we have a bit of code like this:
%addr = gep %o, 8
loop {

if (poll) {
   safepoint();
}
load %addr

}

We currently end up rewriting this as:
%addr = gep %o, 8
loop {

%addr1 = phi (%addr, %addr2)
if (poll) {
   safepoint();
   %remat = gep %o.relocated, 8
}
%addr2 = phi (%addr1, %remat)
load %addr2

}
This ends up forcing us to rematerialize the address explicitly and likely will cause us to spill/fill the address if register constrained. This creates a bunch of dependent loads (fill from stack, load from result) which show up as hot in a couple of benchmarks.

A much better result would be:
%addr = gep %o, 8
loop {

if (poll) {
   safepoint();   
}
%remat = gep %o.relocated, 8
load %remat

}

This version allows the GEP to be folded directly into x86's native addressing modes.

(Note: For conciseness, I'm not writing the phis for relocating %o, assume they're all there.)

The particular heuristic chosen here is to push each given remat as late as possible. This has the effect of moving remats closer to uses and preventing the creation of unnecessary and confusing PHI nodes. Empirically, this does appear to help in some of the benchmarks when I encountered this, but I'm getting increasing uncomfortable with the coupling between RS4GC and CGP. In particular, a better version of this heuristic is already present in CGP.

I think we should probably take this incremental step, but before going much further, factoring the code to share parts of the implementation of CGP might be a good idea. The generally problem is that many CGP transforms are hard to perform after RS4GC has run. It may make sense to selectively run them before hand.

Diff Detail

Event Timeline

reames updated this revision to Diff 45619.Jan 21 2016, 5:10 PM

reames retitled this revision from to [RS4GC] Effective rematerialization at non-entry polls.

reames updated this object.

reames added reviewers: JosephTremoulet, sanjoy, igor-laevsky, mjacob.

reames added a subscriber: llvm-commits.

Herald added subscribers: mcrosier, sanjoy, MatzeB. · View Herald TranscriptJan 21 2016, 5:10 PM

For those curious, the analogous CGP logic is in: CodeGenPrepare::optimizeMemoryInst

mjacob edited edge metadata.Jan 22 2016, 5:05 PM

mjacob added a subscriber: mjacob.

This comment was removed by mjacob.

I think I've found a better fix for this issue. I'm working on a patch for CGP which lets it cleanup the phi cycles introduced by the simple rematerialization. Assuming that works out, I'll abandon this patch.

Resigning as reviewer to move this off my worklist.

@reames if you want to resume working on this patch and would like me to help review, please add me back.

I found a better way to approach this. Once I get back to this, I'm going to just fix CGP rather than working around the problem here. Given that, abandoning revision.

Revision Contents

Path

Size

lib/

Transforms/

Scalar/

RewriteStatepointsForGC.cpp

121 lines

test/

Transforms/

RewriteStatepointsForGC/

remat-schedule.ll

119 lines

rematerialize-derived-pointers.ll

2 lines

Diff 45619

lib/Transforms/Scalar/RewriteStatepointsForGC.cpp

Show First 20 Lines • Show All 57 Lines • ▼ Show 20 Lines	static cl::opt<bool> PrintBasePointers("spp-print-base-pointers", cl::Hidden,
cl::init(false));		cl::init(false));

// Cost threshold measuring when it is profitable to rematerialize value instead		// Cost threshold measuring when it is profitable to rematerialize value instead
// of relocating it		// of relocating it
static cl::opt<unsigned>		static cl::opt<unsigned>
RematerializationThreshold("spp-rematerialization-threshold", cl::Hidden,		RematerializationThreshold("spp-rematerialization-threshold", cl::Hidden,
cl::init(6));		cl::init(6));

		// Should we try to find a better place to remat? This is believed to always
		// be a good idea (on x86 at least). The option is mostly useful as a
		// performance debugging aid in identifying any resulting regressions.
		static cl::opt<bool> UseRematSchedule("rs4gc-schedule-remat", cl::Hidden,
		cl::init(true));

#ifdef XDEBUG		#ifdef XDEBUG
static bool ClobberNonLive = true;		static bool ClobberNonLive = true;
#else		#else
static bool ClobberNonLive = false;		static bool ClobberNonLive = false;
#endif		#endif
static cl::opt<bool, true> ClobberNonLiveOverride("rs4gc-clobber-non-live",		static cl::opt<bool, true> ClobberNonLiveOverride("rs4gc-clobber-non-live",
cl::location(ClobberNonLive),		cl::location(ClobberNonLive),
cl::Hidden);		cl::Hidden);
▲ Show 20 Lines • Show All 2,046 Lines • ▼ Show 20 Lines	for (Instruction *Instr : Chain) {
} else {		} else {
llvm_unreachable("unsupported instruciton type during rematerialization");		llvm_unreachable("unsupported instruciton type during rematerialization");
}		}
}		}

return Cost;		return Cost;
}		}

		/// Find a basic block which is dominated by all of the uses required by all
		/// the instructions in Chain. Since elements in the remat chain are assumed
		/// to be side effect free, this tells us where is legal to insert chain.
		static BasicBlock findDefScope(DominatorTree &DT, BasicBlock UseBlock,
		ArrayRef<Instruction*> Chain) {
		SmallSet<BasicBlock*, 16> DefBlocks;
		for (Instruction *Link : Chain)
		for (Value *V : Link->operands())
		if (auto *I = dyn_cast<Instruction>(V))
		DefBlocks.insert(I->getParent());

		BasicBlock *Current = UseBlock;
		while (true) {
		if (DefBlocks.count(Current))
		return Current;
		if (Current == &Current->getParent()->getEntryBlock())
		return Current;
		Current = DT.getNode(Current)->getIDom()->getBlock();
		}
		llvm_unreachable("use not dominated by defs?");
		}

		/// Find a good place to insert a rematerialization chain. The motivation for
		/// this is that the safepoint we're rematerializing for might be buried in a
		/// conditionally executed block or a loop nest. While having the
		/// rematerializations in a rarely executed block would seem ideal, in practice,
		/// its often better to materialize closer to the uses. In addition to the
		/// solid chance that we can fold the rematerialized address computation into
		/// the load (on x86 at least), if we materialize in the rare path and insert
		/// phis around the loop nest, we can greatly increase register presure. The
		/// resulting materializations are likely to be spilled and rematerlized (again)
		/// in the fast path where actually needed, but this time without the knowledge
		/// of what's actually being done (due to those darn phis). It's generally far
		/// better to just compute the pointer via an LEA then introduce a load from a
		/// stack slot. Note that the last part here only holds because a) we're
		/// not keeping pointers in registers when lowering in the backend, and b) the
		/// backend can't see back through a complicated nest of phis to realize an
		/// instruction could be rematerialized late.
		static Instruction findInsertPoint(DominatorTree &DT, Value Replacee,
		ArrayRef<Instruction*> Chain,
		Instruction *TrivialIP) {

		// The heuristic here is to find the latest point which post dominates the
		// safepoint and is before the first use we can. By pushing into successor
		// blocks, we can greatly reduce the numbers of phis needed since we don't
		// need to merge the remated value with anything from another path. This is
		// specifically relying on the fact that we can remat the original value
		// without side effects along any path, not merely the one which happened to
		// go through the safepoint.

		// TODO: as mentioned in the comment above, this is somewhat x86 specific in
		// it's assumptions. Using appropriate target hooks would be useful once
		// we're trying to support other architectures.

		// TODO: it might be a good idea to restrict how aggressive this is using
		// block frequency info or the size of the liveset at the safepoint
		// (i.e. minimum register pressure).

		BasicBlock *Scope = findDefScope(DT, TrivialIP->getParent(), Chain);

		auto hasNextInstruction = [&](Instruction *I) {
		if (!I->isTerminator())
		return true;
		BasicBlock *nextBB = I->getParent()->getUniqueSuccessor();
		return nextBB && DT.dominates(Scope, nextBB);
		};

		auto nextInstruction = [&hasNextInstruction](Instruction *I) {
		assert(hasNextInstruction(I) &&
		"first check if there is a next instruction!");
		if (I->isTerminator())
		return &I->getParent()->getUniqueSuccessor()->front();
		return &*++I->getIterator();
		};

		auto hasUseOf = [](User U, Value V) {
		for (unsigned i = 0, E = U->getNumOperands(); i != E; ++i)
		if (U->getOperand(i) == V)
		return true;
		return false;
		};
		Instruction *cursor = TrivialIP;
		for (; hasNextInstruction(cursor);
		cursor = nextInstruction(cursor)) {
		// TODO: consider placing remats in each use, then tracking dominating def
		// until leave merge chain or controling scope.
		if (hasUseOf(cursor, Replacee))
		break;

		// Can't insert new defs past another statepoint without adjusting it's
		// live range... but that's okay, this means we didn't find any uses which
		// needed handling. That implies the remat here is trivially dead
		// anyways. We'll insert it for simplicity, but InstSimpilfy will kill it.
		if (isStatepoint(cursor))
		break;
		}
		return cursor;

		}

// From the statepoint live set pick values that are cheaper to recompute then		// From the statepoint live set pick values that are cheaper to recompute then
// to relocate. Remove this values from the live set, rematerialize them after		// to relocate. Remove this values from the live set, rematerialize them after
// statepoint and record them in "Info" structure. Note that similar to		// statepoint and record them in "Info" structure. Note that similar to
// relocated values we don't do any user adjustments here.		// relocated values we don't do any user adjustments here.
static void rematerializeLiveValues(CallSite CS,		static void rematerializeLiveValues(CallSite CS,
PartiallyConstructedSafepointRecord &Info,		PartiallyConstructedSafepointRecord &Info,
TargetTransformInfo &TTI) {		TargetTransformInfo &TTI,
		DominatorTree &DT) {
const unsigned int ChainLengthThreshold = 10;		const unsigned int ChainLengthThreshold = 10;

// Record values we are going to delete from this statepoint live set.		// Record values we are going to delete from this statepoint live set.
// We can not di this in following loop due to iterator invalidation.		// We can not di this in following loop due to iterator invalidation.
SmallVector<Value *, 32> LiveValuesToBeDeleted;		SmallVector<Value *, 32> LiveValuesToBeDeleted;

for (Value *LiveValue: Info.LiveSet) {		for (Value *LiveValue: Info.LiveSet) {
// For each live pointer find it's defining chain		// For each live pointer find it's defining chain
Show All 31 Lines	for (Value *LiveValue: Info.LiveSet) {
// Clone instructions and record them inside "Info" structure		// Clone instructions and record them inside "Info" structure

// Walk backwards to visit top-most instructions first		// Walk backwards to visit top-most instructions first
std::reverse(ChainToBase.begin(), ChainToBase.end());		std::reverse(ChainToBase.begin(), ChainToBase.end());

// Utility function which clones all instructions from "ChainToBase"		// Utility function which clones all instructions from "ChainToBase"
// and inserts them before "InsertBefore". Returns rematerialized value		// and inserts them before "InsertBefore". Returns rematerialized value
// which should be used after statepoint.		// which should be used after statepoint.
auto rematerializeChain = [&ChainToBase](Instruction *InsertBefore) {		auto rematerializeChain = [&](Instruction *InsertBefore) {
Instruction *LastClonedValue = nullptr;		Instruction *LastClonedValue = nullptr;
Instruction *LastValue = nullptr;		Instruction *LastValue = nullptr;
for (Instruction *Instr: ChainToBase) {		for (Instruction *Instr: ChainToBase) {
// Only GEP's and casts are suported as we need to be careful to not		// Only GEP's and casts are suported as we need to be careful to not
// introduce any new uses of pointers not in the liveset.		// introduce any new uses of pointers not in the liveset.
// Note that it's fine to introduce new uses of pointers which were		// Note that it's fine to introduce new uses of pointers which were
// otherwise not used after this statepoint.		// otherwise not used after this statepoint.
assert(isa<GetElementPtrInst>(Instr) \|\| isa<CastInst>(Instr));		assert(isa<GetElementPtrInst>(Instr) \|\| isa<CastInst>(Instr));
Show All 25 Lines	#endif
return LastClonedValue;		return LastClonedValue;
};		};

// Different cases for calls and invokes. For invokes we need to clone		// Different cases for calls and invokes. For invokes we need to clone
// instructions both on normal and unwind path.		// instructions both on normal and unwind path.
if (CS.isCall()) {		if (CS.isCall()) {
Instruction *InsertBefore = CS.getInstruction()->getNextNode();		Instruction *InsertBefore = CS.getInstruction()->getNextNode();
assert(InsertBefore);		assert(InsertBefore);
		if (UseRematSchedule)
		InsertBefore = findInsertPoint(DT, LiveValue, ChainToBase, InsertBefore);
Instruction *RematerializedValue = rematerializeChain(InsertBefore);		Instruction *RematerializedValue = rematerializeChain(InsertBefore);
Info.RematerializedValues[RematerializedValue] = LiveValue;		Info.RematerializedValues[RematerializedValue] = LiveValue;
} else {		} else {
InvokeInst *Invoke = cast<InvokeInst>(CS.getInstruction());		InvokeInst *Invoke = cast<InvokeInst>(CS.getInstruction());

Instruction *NormalInsertBefore =		Instruction *NormalInsertBefore =
&*Invoke->getNormalDest()->getFirstInsertionPt();		&*Invoke->getNormalDest()->getFirstInsertionPt();
		if (UseRematSchedule)
		NormalInsertBefore = findInsertPoint(DT, LiveValue, ChainToBase,
		NormalInsertBefore);
Instruction *UnwindInsertBefore =		Instruction *UnwindInsertBefore =
&*Invoke->getUnwindDest()->getFirstInsertionPt();		&*Invoke->getUnwindDest()->getFirstInsertionPt();
		if (UseRematSchedule)
		UnwindInsertBefore = findInsertPoint(DT, LiveValue, ChainToBase,
		UnwindInsertBefore);

Instruction *NormalRematerializedValue =		Instruction *NormalRematerializedValue =
rematerializeChain(NormalInsertBefore);		rematerializeChain(NormalInsertBefore);
Instruction *UnwindRematerializedValue =		Instruction *UnwindRematerializedValue =
rematerializeChain(UnwindInsertBefore);		rematerializeChain(UnwindInsertBefore);

Info.RematerializedValues[NormalRematerializedValue] = LiveValue;		Info.RematerializedValues[NormalRematerializedValue] = LiveValue;
Info.RematerializedValues[UnwindRematerializedValue] = LiveValue;		Info.RematerializedValues[UnwindRematerializedValue] = LiveValue;
▲ Show 20 Lines • Show All 154 Lines • ▼ Show 20 Lines	for (size_t i = 0; i < Records.size(); i++) {
splitVectorValues(cast<Instruction>(Statepoint), Info.LiveSet,		splitVectorValues(cast<Instruction>(Statepoint), Info.LiveSet,
Info.PointerToBase, DT);		Info.PointerToBase, DT);
}		}

// In order to reduce live set of statepoint we might choose to rematerialize		// In order to reduce live set of statepoint we might choose to rematerialize
// some values instead of relocating them. This is purely an optimization and		// some values instead of relocating them. This is purely an optimization and
// does not influence correctness.		// does not influence correctness.
for (size_t i = 0; i < Records.size(); i++)		for (size_t i = 0; i < Records.size(); i++)
rematerializeLiveValues(ToUpdate[i], Records[i], TTI);		rematerializeLiveValues(ToUpdate[i], Records[i], TTI, DT);

// We need this to safely RAUW and delete call or invoke return values that		// We need this to safely RAUW and delete call or invoke return values that
// may themselves be live over a statepoint. For details, please see usage in		// may themselves be live over a statepoint. For details, please see usage in
// makeStatepointExplicitImpl.		// makeStatepointExplicitImpl.
std::vector<DeferredReplacement> Replacements;		std::vector<DeferredReplacement> Replacements;

// Now run through and replace the existing statepoints with new ones with		// Now run through and replace the existing statepoints with new ones with
// the live variables listed. We do not yet update uses of the values being		// the live variables listed. We do not yet update uses of the values being
▲ Show 20 Lines • Show All 503 Lines • Show Last 20 Lines

test/Transforms/RewriteStatepointsForGC/remat-schedule.ll

				; RUN: opt %s -rewrite-statepoints-for-gc -S -rs4gc-schedule-remat=1 -rs4gc-use-deopt-bundles 2>&1 \| FileCheck %s

				declare void @use_obj16(i16 addrspace(1)*) "gc-leaf-function"
				declare void @use_obj32(i32 addrspace(1)*) "gc-leaf-function"
				declare void @use_obj64(i64 addrspace(1)*) "gc-leaf-function"
				declare void @do_safepoint()

				define void @test(i32 addrspace(1)* %base) gc "statepoint-example" {
				; CHECK-LABEL: @test
				entry:
				%ptr = getelementptr i32, i32 addrspace(1)* %base, i32 15
				; CHECK: getelementptr i32, i32 addrspace(1)* %base, i32 15
				call void @do_safepoint() ["deopt" ()]
				; CHECK: gc.relocate
				; CHECK: bitcast
				; CHECK: br label %next
				br label %next
				next:
				; CHECK: call
				; CHECK: getelementptr i32, i32 addrspace(1)* %base.relocated.casted, i32 15
				; CHECK: call
				call void @use_obj32(i32 addrspace(1)* %base)
				call void @use_obj32(i32 addrspace(1)* %ptr)
				ret void
				}

				; Only need a PHI for the base pointer in this loop. The derived pointer
				; can be nicely remated
				define void @test2(i32 addrspace(1)* %base) gc "statepoint-example" {
				; CHECK-LABEL: test2
				entry:
				%ptr.gep = getelementptr i32, i32 addrspace(1)* %base, i32 15
				; CHECK: getelementptr
				br label %loop

				loop:
				; CHECK: phi i32 addrspace(1)* [ %base, %entry ], [ %base.relocated.casted, %loop ]
				; CHECK: %ptr.gep.remat = getelementptr
				call void @use_obj32(i32 addrspace(1)* %ptr.gep)
				call void @do_safepoint() ["deopt" ()]
				; CHECK: gc.relocate
				br label %loop
				}

				@G = external global i32

				; Can sink the remat into the bottom of the loop which lets
				; it merge with the load
				define i8 @test3(i8 addrspace(1)* %base) gc "statepoint-example" {
				; CHECK-LABEL: test3
				entry:
				%gep = getelementptr i8, i8 addrspace(1)* %base, i64 8
				br label %loop

				loop:
				%iv = phi i32 [0, %entry], [%iv.next, %merge]
				%iv.next = add i32 %iv, 1
				%test = load volatile i32, i32* @G
				%sp_flag = icmp eq i32 %test, 0
				br i1 %sp_flag, label %safepoint, label %merge

				safepoint:
				call void @do_safepoint() [ "deopt" () ]
				br label %merge

				merge:
				; CHECK-LABEL: merge:
				; CHECK: phi i8 addrspace(1)*
				; CHECK: %gep.remat = getelementptr
				%cnd = icmp ne i32 10000, %iv
				br i1 %cnd, label %loop, label %exit

				exit:
				; CHECK-LABEL: exit:
				; CHECK: load i8, i8 addrspace(1)* %gep.remat
				%load = load i8, i8 addrspace(1)* %gep
				ret i8 %load
				}

				; In this case, we can't directly push the remat around the loop, but since we
				; got it to the end of the loop (i.e. no phi for merge), instcombine has no trouble
				; finishing the job for us.
				define i8 @test4(i8 addrspace(1)* %base) gc "statepoint-example" {
				; CHECK-LABEL: test4
				entry:
				%gep = getelementptr i8, i8 addrspace(1)* %base, i64 8
				br label %loop

				loop:
				; CHECK-LABEL: loop:
				; CHECK: phi i8 addrspace(1)* [ %gep, %entry ], [ %gep.remat, %merge ]
				; CHECK: load i8, i8 addrspace(1)*
				%iv = phi i32 [0, %entry], [%iv.next, %merge]
				%iv.next = add i32 %iv, 1
				%load = load i8, i8 addrspace(1)* %gep
				%test = load volatile i32, i32* @G
				%sp_flag = icmp eq i32 %test, 0
				br i1 %sp_flag, label %safepoint, label %merge

				safepoint:
				call void @do_safepoint() [ "deopt" () ]
				br label %merge

				merge:
				; CHECK-LABEL: merge:
				; CHECK: phi i8 addrspace(1)*
				; CHECK: %gep.remat = getelementptr
				%cnd = icmp ne i32 10000, %iv
				br i1 %cnd, label %loop, label %exit

				exit:
				; CHECK-LABEL: exit:
				ret i8 %load
				}




				declare token @llvm.experimental.gc.statepoint.p0f_isVoidf(i64, i32, void ()*, i32, i32, ...)

test/Transforms/RewriteStatepointsForGC/rematerialize-derived-pointers.ll

	; RUN: opt %s -rewrite-statepoints-for-gc -S 2>&1 \| FileCheck %s			; RUN: opt %s -rewrite-statepoints-for-gc -S -rs4gc-schedule-remat=0 2>&1 \| FileCheck %s

	declare void @use_obj16(i16 addrspace(1)*)			declare void @use_obj16(i16 addrspace(1)*)
	declare void @use_obj32(i32 addrspace(1)*)			declare void @use_obj32(i32 addrspace(1)*)
	declare void @use_obj64(i64 addrspace(1)*)			declare void @use_obj64(i64 addrspace(1)*)
	declare void @do_safepoint()			declare void @do_safepoint()

	define void @"test_gep_const"(i32 addrspace(1)* %base) gc "statepoint-example" {			define void @"test_gep_const"(i32 addrspace(1)* %base) gc "statepoint-example" {
	; CHECK-LABEL: test_gep_const			; CHECK-LABEL: test_gep_const
	▲ Show 20 Lines • Show All 247 Lines • Show Last 20 Lines