Download Raw Diff

Details

Reviewers

sanjoy
hfinkel

Commits

rG9364432cec63: Unmerge GEPs to reduce register pressure on IndirectBr edges.
rL312930: Unmerge GEPs to reduce register pressure on IndirectBr edges.

Summary

GEP merging can sometimes increase the number of live values and register
pressure across control edges and cause performance problems particularly if the
increased register pressure results in spills.

This change implements GEP unmerging around an IndirectBr in certain cases to
mitigate the issue. This is in the CodeGenPrepare pass (after all the GEP
merging has happened.)

With this patch, the Python interpreter loop runs faster by ~5%.

Diff Detail

Build Status

Buildable 9310
Build 9310: arc lint + arc unit

Event Timeline

hjyamauchi created this revision.Aug 15 2017, 3:47 PM

Harbormaster completed remote builds in B9310: Diff 111271.Aug 15 2017, 3:47 PM

junbuml added a subscriber: junbuml.Aug 16 2017, 7:09 AM

Hi Hal, would you take a look at this change?

With this patch, the Python interpreter loop runs faster by ~5%.

On what platform?

Did you try always doing this (instead of just doing it over indirect branches)? You're possibly increasing the critical path by doing the computation this way, and if you have a processor with a good branch predictor maybe this shows up? But if you told me that it generally always helps, I'd not be too surprised either.

You should be careful not to create constants that aren't cheap to represent if you start with ones that are. Specifically, UGEPIIdx->getValue() - GEPIIdx->getValue() might be large (if one of those values is negative). TTI has getIntImmCost, and if both UIdx and Idx are cheap, but (Uidx - Idx) is expensive, we probably don't want to do this.

Sorry for a delay.

In D36772#848262, @hfinkel wrote:

With this patch, the Python interpreter loop runs faster by ~5%.

On what platform?

x86-64 Haswell.

Did you try always doing this (instead of just doing it over indirect branches)? You're possibly increasing the critical path by doing the computation this way, and if you have a processor with a good branch predictor maybe this shows up? But if you told me that it generally always helps, I'd not be too surprised either.

Good points.

No, I didn't try always doing this. My thoughts follow:

As you point out, I think there may be a tradeoff between potential spills and a potentially longer critical path, and it's not 100% clear which way is better *in general* because that would depend on how the CPU works and it's not easy to tell whether this would actually save spills at this stage, though for the indirectbr in the python interpreter, it's a clear win due to fewer spills (on x86-64).

The benefits of restricting this to relatively rare indirectbr are that (a) it might limit the impact of a potentially longer critical path, if any, and (b) the impact on the compile time should be minimal because it's the first check we do there and the common case doesn't need to go through the subsequent more elaborate checks.

Maybe should query the target and do this only if the number of registers is low (or just x86(-64))?

I wish I could formulate this in a better way. I haven't found a better way so far. If you see a better way, please let me know.

You should be careful not to create constants that aren't cheap to represent if you start with ones that are. Specifically, UGEPIIdx->getValue() - GEPIIdx->getValue() might be large (if one of those values is negative). TTI has getIntImmCost, and if both UIdx and Idx are cheap, but (Uidx - Idx) is expensive, we probably don't want to do this.

Agreed. Will work on this.

Added getIntImmCost checks. Please take another look.

Harbormaster completed remote builds in B9618: Diff 112638.Aug 24 2017, 5:01 PM

As this patch can affect ARM targets I am doing some benchmarking.
I've got the LNT benchmarks results for AArch64 (Cortex-A57). There is no difference in performance. I'll have got more results soon.
It's interesting to see what benchmarks has been used to measure the improvements.

In D36772#852865, @eastig wrote:

As this patch can affect ARM targets I am doing some benchmarking.
I've got the LNT benchmarks results for AArch64 (Cortex-A57). There is no difference in performance. I'll have got more results soon.
It's interesting to see what benchmarks has been used to measure the improvements.

Good to know.

The improvements I saw were measured with python programs (like the following) running on the python 2.7 runtime compiled with LLVM at r309573 on x86-64 Haswell.

for _ in xrange(1, 100000000):
  continue

This patch reduces register spills in the computed goto-based interpreter loop.

Noted the tradeoff between register pressures and critical path in the comment.

Also rebased.

Harbormaster completed remote builds in B9705: Diff 112941.Aug 28 2017, 12:44 PM

This is looking good. A couple additional things...

lib/CodeGen/CodeGenPrepare.cpp
6207	I don't think this check is necessary. GEPIOp is constrained to be defined in SrcBlock, and it's SrcBlock that has the IndirectBr terminator, so any use of GEPIOp outside of SrcBlock keeps it live over the indirect edge. I don't see why we wouldn't unmerge regardless of the parent block here.
6247	You'll also need to make sure that this GEP is not marked as inbounds if GEPI was not. if (!GEPI->isInBounds()) { UGEPI->setIsInBounds(false); } because otherwise the result of GEP could be not-in-bounds resulting in UB if that's used as the input to an inbounds UGEPI.

One comment addressed and another needs clarification.

Harbormaster completed remote builds in B9946: Diff 114084.Sep 6 2017, 3:22 PM

hjyamauchi added inline comments.Sep 6 2017, 3:23 PM

lib/CodeGen/CodeGenPrepare.cpp
6247	I'm not very familiar with how inbounds works. Is an inbounds GEP UB if it takes a non-inbounds GEP as its operand (regardless of whether the index/offset is actually in bounds or not)? For example, Before: %GEPIOp = ... %GEPI = gep %GEPIOp 2 %UGEPI = gep inbounds %GEPIOp 1 After: %GEPIOp = ... %GEPI = gep %GEPIOp 2 %UGEPI = gep inbounds %GEPI -1 Suppose "gep %GEPIOp 2" is not in bounds and "gep inbounds %GEPIOp 1" is in bounds. Both aren't UB. "gep inbounds %GEPI -1" is UB just because it takes (non-inbounds) "gep %GEPIOp 2" as an operand, even though the offset/index of "gep inbounds %GEPI -1" is actually in bounds?

hfinkel added inline comments.Sep 6 2017, 5:25 PM

lib/CodeGen/CodeGenPrepare.cpp
6247	Yes, the base pointer needs to be inbounds too. The LangRef says, "If the inbounds keyword is present, the result value of the getelementptr is a poison value if the base pointer is not an in bounds address of an allocated object, or if any of the addresses that would be formed by successive addition of the offsets implied by the indices to the base address with infinitely precise signed arithmetic are not an in bounds address of that allocated object." That's exactly why you need to account for the inbounds here.

Addressed a comment.

Harbormaster completed remote builds in B9980: Diff 114225.Sep 7 2017, 11:59 AM

LGTM

This revision is now accepted and ready to land.Sep 7 2017, 3:14 PM

Rebased.

Harbormaster completed remote builds in B10087: Diff 114637.Sep 11 2017, 10:52 AM

Thanks, Hal.

hjyamauchi closed this revision.Sep 11 2017, 10:53 AM

Diff 111271

lib/CodeGen/CodeGenPrepare.cpp

Show First 20 Lines • Show All 6,084 Lines • ▼ Show 20 Lines	static bool splitMergedValStore(StoreInst &SI, const DataLayout &DL,
CreateSplitStore(LValue, false);		CreateSplitStore(LValue, false);
CreateSplitStore(HValue, true);		CreateSplitStore(HValue, true);

// Delete the old store.		// Delete the old store.
SI.eraseFromParent();		SI.eraseFromParent();
return true;		return true;
}		}

		// Return true if the GEP has two operands, the first operand is of a sequential
		// type, and the second operand is a constant.
		static bool GEPSequentialConstIndexed(GetElementPtrInst *GEP) {
		gep_type_iterator I = gep_type_begin(*GEP);
		return GEP->getNumOperands() == 2 &&
		I.isSequential() &&
		isa<ConstantInt>(GEP->getOperand(1));
		}

		// Try unmerging GEPs to reduce liveness interference (register pressure) across
		// IndirectBr edges. Since IndirectBr edges tend to touch on many blocks,
		// reducing liveness interference across those edges benefits global register
		// allocation. Currently handles only certain cases.
		//
		// For example, unmerge %GEPI and %UGEPI as below.
		//
		// ---------- BEFORE ----------
		// SrcBlock:
		// ...
		// %GEPIOp = ...
		// ...
		// %GEPI = gep %GEPIOp, Idx
		// ...
		// indirectbr ... [ label %DstB0, label %DstB1, ... label %DstBi ... ]
		// (* %GEPI is alive on the indirectbr edges due to other uses ahead)
		// (* %GEPIOp is alive on the indirectbr edges only because of it's used by
		// %UGEPI)
		//
		// DstB0: ... (there may be a gep similar to %UGEPI to be unmerged)
		// DstB1: ... (there may be a gep similar to %UGEPI to be unmerged)
		// ...
		//
		// DstBi:
		// ...
		// %UGEPI = gep %GEPIOp, UIdx
		// ...
		// ---------------------------
		//
		// ---------- AFTER ----------
		// SrcBlock:
		// ... (same as above)
		// (* %GEPI is still alive on the indirectbr edges)
		// (* %GEPIOp is no longer alive on the indirectbr edges as a result of the
		// unmerging)
		// ...
		//
		// DstBi:
		// ...
		// %UGEPI = gep %GEPI, (UIdx-Idx)
		// ...
		// ---------------------------
		//
		// The register pressure on the IndirectBr edges is reduced because %GEPIOp is
		// no longer alive on them.
		//
		// We try to unmerge GEPs here in CodGenPrepare, as opposed to limiting merging
		// of GEPs in the first place in InstCombiner::visitGetElementPtrInst() so as
		// not to disable further simplications and optimizations as a result of GEP
		// merging.
		static bool tryUnmergingGEPsAcrossIndirectBr(GetElementPtrInst *GEPI) {
		BasicBlock *SrcBlock = GEPI->getParent();
		// Check that SrcBlock ends with an IndirectBr. If not, give up.
		if (!isa<IndirectBrInst>(SrcBlock->getTerminator()))
		return false;
		// Check that GEPI is a simple gep with a single constant index.
		if (!GEPSequentialConstIndexed(GEPI))
		return false;
		Value *GEPIOp = GEPI->getOperand(0);
		// Check that GEPIOp is an instruction that's also defined in SrcBlock.
		if (!isa<Instruction>(GEPIOp))
		return false;
		auto *GEPIOpI = cast<Instruction>(GEPIOp);
		if (GEPIOpI->getParent() != SrcBlock)
		return false;
		// Check that GEP is used outside the block, meaning it's alive on the
		// IndirectBr edge(s).
		if (find_if(GEPI->users(), [&](User *Usr) {
		if (auto *I = dyn_cast<Instruction>(Usr)) {
		if (I->getParent() != SrcBlock) {
		return true;
		}
		}
		return false;
		}) == GEPI->users().end())
		return false;
		// The second elements of the GEP chains to be unmerged.
		std::vector<GetElementPtrInst *> UGEPIs;
		ConstantInt *GEPIIdx = cast<ConstantInt>(GEPI->getOperand(1));
		// Check each user of GEPIOp to check if unmerging would make GEPIOp not alive
		// on IndirectBr edges.
		for (User *Usr : GEPIOp->users()) {
		if (Usr == GEPI) continue;
		// Check if Usr is an Instruction. If not, give up.
		if (!isa<Instruction>(Usr))
		return false;
		auto *UI = cast<Instruction>(Usr);
		// Check if Usr in the same block as GEPIOp, which is fine, skip.
		if (UI->getParent() == SrcBlock)
		continue;
		// Check if Usr is in a block at the end of one of the IndirectBr edges and
		// that SrcBlock is the only precedessor of it. If not, give up.
		if (UI->getParent()->getSinglePredecessor() != SrcBlock)
		return false;
		// Check if Usr is a GEP. If not, give up.
		if (!isa<GetElementPtrInst>(Usr))
		return false;
		auto *UGEPI = cast<GetElementPtrInst>(Usr);
		// Check if UGEPI is a simple gep with a single constant index and GEPIOp is
		// the pointer operand to it. If so, record it in the vector. If not, give
		// up.
		if (GEPSequentialConstIndexed(UGEPI) && UGEPI->getOperand(0) == GEPIOp &&
		GEPIIdx->getType() ==
		cast<ConstantInt>(UGEPI->getOperand(1))->getType())
		UGEPIs.push_back(UGEPI);
		else
		hfinkelUnsubmitted Done Reply Inline Actions I don't think this check is necessary. GEPIOp is constrained to be defined in SrcBlock, and it's SrcBlock that has the IndirectBr terminator, so any use of GEPIOp outside of SrcBlock keeps it live over the indirect edge. I don't see why we wouldn't unmerge regardless of the parent block here. hfinkel: I don't think this check is necessary. GEPIOp is constrained to be defined in SrcBlock, and…
		return false;
		}
		if (UGEPIs.size() == 0)
		return false;
		// Now unmerge between GEPI and UGEPIs.
		for (GetElementPtrInst *UGEPI : UGEPIs) {
		UGEPI->setOperand(0, GEPI);
		ConstantInt *UGEPIIdx = cast<ConstantInt>(UGEPI->getOperand(1));
		Constant *NewUGEPIIdx =
		ConstantInt::get(GEPIIdx->getType(),
		UGEPIIdx->getValue() - GEPIIdx->getValue());
		UGEPI->setOperand(1, NewUGEPIIdx);
		}
		// After unmerging, verify that GEPIOp is actually only used in SrcBlock (not
		// alive on IndirectBr edges).
		assert(find_if(GEPIOp->users(), [&](User *Usr) {
		return cast<Instruction>(Usr)->getParent() != SrcBlock;
		}) == GEPIOp->users().end() && "GEPIOp is used outside SrcBlock");
		return true;
		}

bool CodeGenPrepare::optimizeInst(Instruction *I, bool &ModifiedDT) {		bool CodeGenPrepare::optimizeInst(Instruction *I, bool &ModifiedDT) {
// Bail out if we inserted the instruction to prevent optimizations from		// Bail out if we inserted the instruction to prevent optimizations from
// stepping on each other's toes.		// stepping on each other's toes.
if (InsertedInsts.count(I))		if (InsertedInsts.count(I))
return false;		return false;

if (PHINode *P = dyn_cast<PHINode>(I)) {		if (PHINode *P = dyn_cast<PHINode>(I)) {
// It is possible for very late stage optimizations (such as SimplifyCFG)		// It is possible for very late stage optimizations (such as SimplifyCFG)
// to introduce PHI nodes too late to be cleaned up. If we detect such a		// to introduce PHI nodes too late to be cleaned up. If we detect such a
// trivial PHI, go ahead and zap it here.		// trivial PHI, go ahead and zap it here.
if (Value V = SimplifyInstruction(P, {DL, TLInfo})) {		if (Value V = SimplifyInstruction(P, {DL, TLInfo})) {
P->replaceAllUsesWith(V);		P->replaceAllUsesWith(V);
P->eraseFromParent();		P->eraseFromParent();
++NumPHIsElim;		++NumPHIsElim;
return true;		return true;
}		}
return false;		return false;
}		}

		hfinkelUnsubmitted Not Done Reply Inline Actions You'll also need to make sure that this GEP is not marked as inbounds if GEPI was not. if (!GEPI->isInBounds()) { UGEPI->setIsInBounds(false); } because otherwise the result of GEP could be not-in-bounds resulting in UB if that's used as the input to an inbounds UGEPI. hfinkel: You'll also need to make sure that this GEP is not marked as inbounds if GEPI was not. if (!
		hjyamauchiAuthorUnsubmitted Not Done Reply Inline Actions I'm not very familiar with how inbounds works. Is an inbounds GEP UB if it takes a non-inbounds GEP as its operand (regardless of whether the index/offset is actually in bounds or not)? For example, Before: %GEPIOp = ... %GEPI = gep %GEPIOp 2 %UGEPI = gep inbounds %GEPIOp 1 After: %GEPIOp = ... %GEPI = gep %GEPIOp 2 %UGEPI = gep inbounds %GEPI -1 Suppose "gep %GEPIOp 2" is not in bounds and "gep inbounds %GEPIOp 1" is in bounds. Both aren't UB. "gep inbounds %GEPI -1" is UB just because it takes (non-inbounds) "gep %GEPIOp 2" as an operand, even though the offset/index of "gep inbounds %GEPI -1" is actually in bounds? hjyamauchi: I'm not very familiar with how inbounds works. Is an inbounds GEP UB if it takes a non…
		hfinkelUnsubmitted Done Reply Inline Actions Yes, the base pointer needs to be inbounds too. The LangRef says, "If the inbounds keyword is present, the result value of the getelementptr is a poison value if the base pointer is not an in bounds address of an allocated object, or if any of the addresses that would be formed by successive addition of the offsets implied by the indices to the base address with infinitely precise signed arithmetic are not an in bounds address of that allocated object." That's exactly why you need to account for the inbounds here. hfinkel: Yes, the base pointer needs to be inbounds too. The LangRef says, "If the inbounds keyword is…
if (CastInst *CI = dyn_cast<CastInst>(I)) {		if (CastInst *CI = dyn_cast<CastInst>(I)) {
// If the source of the cast is a constant, then this should have		// If the source of the cast is a constant, then this should have
// already been constant folded. The only reason NOT to constant fold		// already been constant folded. The only reason NOT to constant fold
// it is if something (e.g. LSR) was careful to place the constant		// it is if something (e.g. LSR) was careful to place the constant
// evaluation in a block other than then one that uses it (e.g. to hoist		// evaluation in a block other than then one that uses it (e.g. to hoist
// the address of globals out of a loop). If this is the case, we don't		// the address of globals out of a loop). If this is the case, we don't
// want to forward-subst the cast.		// want to forward-subst the cast.
if (isa<Constant>(CI->getOperand(0)))		if (isa<Constant>(CI->getOperand(0)))
▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	if (GEPI->hasAllZeroIndices()) {
Instruction *NC = new BitCastInst(GEPI->getOperand(0), GEPI->getType(),		Instruction *NC = new BitCastInst(GEPI->getOperand(0), GEPI->getType(),
GEPI->getName(), GEPI);		GEPI->getName(), GEPI);
GEPI->replaceAllUsesWith(NC);		GEPI->replaceAllUsesWith(NC);
GEPI->eraseFromParent();		GEPI->eraseFromParent();
++NumGEPsElim;		++NumGEPsElim;
optimizeInst(NC, ModifiedDT);		optimizeInst(NC, ModifiedDT);
return true;		return true;
}		}
		if (tryUnmergingGEPsAcrossIndirectBr(GEPI)) {
		return true;
		}
return false;		return false;
}		}

if (CallInst *CI = dyn_cast<CallInst>(I))		if (CallInst *CI = dyn_cast<CallInst>(I))
return optimizeCallInst(CI, ModifiedDT);		return optimizeCallInst(CI, ModifiedDT);

if (SelectInst *SI = dyn_cast<SelectInst>(I))		if (SelectInst *SI = dyn_cast<SelectInst>(I))
return optimizeSelectInst(SI);		return optimizeSelectInst(SI);
▲ Show 20 Lines • Show All 305 Lines • Show Last 20 Lines

test/Transforms/CodeGenPrepare/gep-unmerging.ll

This file was added.

				; RUN: opt -codegenprepare -S < %s \| FileCheck %s

				@exit_addr = constant i8* blockaddress(@gep_unmerging, %exit)
				@op1_addr = constant i8* blockaddress(@gep_unmerging, %op1)
				@op2_addr = constant i8* blockaddress(@gep_unmerging, %op2)
				@op3_addr = constant i8* blockaddress(@gep_unmerging, %op3)
				@dummy = global i8 0

				define void @gep_unmerging(i1 %pred, i8* %p0) {
				entry:
				%table = alloca [256 x i8*]
				%table_0 = getelementptr [256 x i8], [256 x i8]* %table, i64 0, i64 0
				%table_1 = getelementptr [256 x i8], [256 x i8]* %table, i64 0, i64 1
				%table_2 = getelementptr [256 x i8], [256 x i8]* %table, i64 0, i64 2
				%table_3 = getelementptr [256 x i8], [256 x i8]* %table, i64 0, i64 3
				%exit_a = load i8, i8* @exit_addr
				%op1_a = load i8, i8* @op1_addr
				%op2_a = load i8, i8* @op2_addr
				%op3_a = load i8, i8* @op3_addr
				store i8* %exit_a, i8** %table_0
				store i8* %op1_a, i8** %table_1
				store i8* %op2_a, i8** %table_2
				store i8* %op3_a, i8** %table_3
				br label %indirectbr

				op1:
				; CHECK-LABEL: op1:
				; CHECK-NEXT: %p1_inc2 = getelementptr i8, i8* %p_postinc, i64 2
				; CHECK-NEXT: %p1_inc1 = getelementptr i8, i8* %p_postinc, i64 1
				%p1_inc2 = getelementptr i8, i8* %p_preinc, i64 3
				%p1_inc1 = getelementptr i8, i8* %p_preinc, i64 2
				%a10 = load i8, i8* %p_postinc
				%a11 = load i8, i8* %p1_inc1
				%a12 = add i8 %a10, %a11
				store i8 %a12, i8* @dummy
				br i1 %pred, label %indirectbr, label %exit

				op2:
				; CHECK-LABEL: op2:
				; CHECK-NEXT: %p2_inc = getelementptr i8, i8* %p_postinc, i64 1
				%p2_inc = getelementptr i8, i8* %p_preinc, i64 2
				%a2 = load i8, i8* %p_postinc
				store i8 %a2, i8* @dummy
				br i1 %pred, label %indirectbr, label %exit

				op3:
				br i1 %pred, label %indirectbr, label %exit

				indirectbr:
				%p_preinc = phi i8* [%p0, %entry], [%p1_inc2, %op1], [%p2_inc, %op2], [%p_postinc, %op3]
				%p_postinc = getelementptr i8, i8* %p_preinc, i64 1
				%next_op = load i8, i8* %p_preinc
				%p_zext = zext i8 %next_op to i64
				%slot = getelementptr [256 x i8], [256 x i8]* %table, i64 0, i64 %p_zext
				%target = load i8, i8* %slot
				indirectbr i8* %target, [label %exit, label %op1, label %op2]

				exit:
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

Unmerge GEPs to reduce register pressure on IndirectBr edges.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 111271

lib/CodeGen/CodeGenPrepare.cpp

test/Transforms/CodeGenPrepare/gep-unmerging.ll

This is an archive of the discontinued LLVM Phabricator instance.

Unmerge GEPs to reduce register pressure on IndirectBr edges.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 111271

lib/CodeGen/CodeGenPrepare.cpp

test/Transforms/CodeGenPrepare/gep-unmerging.ll

Unmerge GEPs to reduce register pressure on IndirectBr edges.
ClosedPublic