This is an archive of the discontinued LLVM Phabricator instance.

[GVN] Perform Scalar PRE on gep indices that feed loads before doing Load PRE
ClosedPublic

Authored by bmakam on Nov 3 2014, 3:25 PM.

Download Raw Diff

Details

Reviewers

• HaoLiu
jmolloy
• dberlin
resistor
Jiangning
hfinkel
mcrosier
apazos

Summary

All,

This patch addresses the missing PRE opportunities initially reported by James Molloy in 450.soplex

This patch re-factors James' patch to "Make GVN more iterative" based on the comments/suggestions from Daniel that iterating all of GVN over again is pretty big hammer.
Instead of iterating GVN all over, this patch does a ScalarPRE of any scalar instructions that a load is dependent on, before performing LoadPRE on that load.

When tested on a Cortex-A57, James' initial patch to make GVN more iterative improved 450.soplex by 3%. This patch improved 450.soplex by 7% without iterating GVN all over again.
In order to achieve this I had to enable the reverse post order traversal for iterateOnFunction because we would have to value number dependent scalar instructions before performing ScalarPRE on them. Although, traversing in reverse post order is costly in terms of compile time but this may be cheaper than iterating GVN all over again and also results in better performance. What do you guys think?

Diff Detail

Event Timeline

bmakam updated this revision to Diff 15736.Nov 3 2014, 3:25 PM

bmakam retitled this revision from to [GVN] Perform Scalar PRE on gep indices that feed loads before doing Load PRE.

bmakam updated this object.

bmakam edited the test plan for this revision. (Show Details)

bmakam added reviewers: jmolloy, • dberlin, apazos, Jiangning, • HaoLiu, mcrosier, hfinkel.

bmakam added a subscriber: Unknown Object (MLST).

Thanks for continuing to work on this!

Generally, please post patches with full context. For instructions on how to do this, see: http://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface

I think that a 7% speedup sounds nice, does anything else improve? But please do provide some compile-time slowdown numbers, so that we can get a better handle on the cost/benefit analysis.

lib/Transforms/Scalar/GVN.cpp
2655	If the patch turns this on, please just remove the #if.

I turned on the slow path and commented out the fast path. If we can decide we no longer need to keep the fast path around I will clean it up.

I am running a perf run to gather compile times and will update the comment once I get back the results.

[Update1]
While I am still waiting for perf data on other benchmarks in Spec2k/2k6, here is the data I got so far:

a) Compilation times:

compiling clang.bc (Thanks to Jiangning for running the tests)

With the patch,

real 19m56.978s
user 141m16.602s
sys 2m59.942s

Without the patch, (original)

real 19m58.099s
user 141m21.219s
sys 2m58.493

which is 2s(0.85%) slower, so the slowdown is in noise range.

On Spec, only 433.milc slowed down the most by 5%. Other slowdowns were in

179.art - 2%
164.gzip - 2%
445.gobmk - 2%

b) Runtime Performance:
Other benchmarks whose performance improved:
447.dealII - 6%
403.gcc - 14%

I will update data on other benchmarks later.

[Update2]
401.bzip2 - 4%
464.h264ref - 4%
186.crafty - 7%

and only regression above noise range was in 181.mcf with at -3%

bmakam added inline comments.Nov 4 2014, 6:09 AM

lib/Transforms/Scalar/GVN.cpp
2655	I will turn on the slow path and turn off the fast path in my next patch that I will upload with full context. I am not sure if we still want to keep the fast path commented out in the code or clean it up.

hfinkel added inline comments.Nov 4 2014, 6:39 AM

lib/Transforms/Scalar/GVN.cpp
2655	If you want to still keep it, add a command-line flag to enable it. We don't generally keep commented-out code, as a policy.

bmakam added a reviewer: resistor.Nov 6 2014, 1:42 AM

All,

I updated my comment with slowdowns in compilation times and other benchmarks that improve with this patch. Slowdowns in compile times are in noise range I do not see any performance regressions greater than 3% in Spec. Based on this data, I would like to get rid of the fast path and will upload a patch removing out the commented-out code if you agree. Thanks for reviewing.

Sounds good to me.
I'll review in detail later today

Cleaned up dead code.

Ping.

LGTM modulo one comment

lib/Transforms/Scalar/GVN.cpp
2447	If you are going to add LoadInst, you might as well add all the memory ops (anything where getOpCode() > MemoryOpsBegin && getOpCode < MemoryOpsEnd. If you do this, i'd add isMemoryOp to instruction.h alongside isBinaryOp, etc)

bmakam added inline comments.Nov 13 2014, 1:05 PM

lib/Transforms/Scalar/GVN.cpp
2447	Thanks for catching this Daniel. The LoadInst was unintentional, it is already covered by CurInst->mayReadFromMemory(). I will prepare a patch removing the LoadInst. If LGTM, please feel free to +2 it since I do not have commit rights.

Addressed Daniel's comment and also rebased.

Approved based on Daniel's review.

This revision is now accepted and ready to land.Nov 13 2014, 1:16 PM

Committed r221924.

Revision Contents

Path

Size

lib/

Transforms/

Scalar/

GVN.cpp

319 lines

test/

Transforms/

GVN/

pre-gep-load.ll

49 lines

Diff 15736

lib/Transforms/Scalar/GVN.cpp

Context not available.
	#include "llvm/ADT/DepthFirstIterator.h"	#include "llvm/ADT/DepthFirstIterator.h"
	#include "llvm/ADT/Hashing.h"	#include "llvm/ADT/Hashing.h"
	#include "llvm/ADT/MapVector.h"	#include "llvm/ADT/MapVector.h"
		#include "llvm/ADT/PostOrderIterator.h"
	#include "llvm/ADT/SetVector.h"	#include "llvm/ADT/SetVector.h"
	#include "llvm/ADT/SmallPtrSet.h"	#include "llvm/ADT/SmallPtrSet.h"
	#include "llvm/ADT/Statistic.h"	#include "llvm/ADT/Statistic.h"
Context not available.
	void dump(DenseMap<uint32_t, Value*> &d);	void dump(DenseMap<uint32_t, Value*> &d);
	bool iterateOnFunction(Function &F);	bool iterateOnFunction(Function &F);
	bool performPRE(Function &F);	bool performPRE(Function &F);
		bool performScalarPRE(Instruction *I);
	Value findLeader(const BasicBlock BB, uint32_t num);	Value findLeader(const BasicBlock BB, uint32_t num);
	void cleanupGlobalSets();	void cleanupGlobalSets();
	void verifyRemoved(const Instruction *I) const;	void verifyRemoved(const Instruction *I) const;
Context not available.
	return false;	return false;
	}	}

		// If this load follows a GEP, see if we can PRE the indices before analyzing.
		if (GetElementPtrInst *GEP = dyn_cast<GetElementPtrInst>(LI->getOperand(0))) {
		for(GetElementPtrInst::op_iterator OI = GEP->idx_begin(),
		OE = GEP->idx_end(); OI != OE; ++OI)
		if (Instruction *I = dyn_cast<Instruction>(OI->get()))
		performScalarPRE(I);
		}

	// Step 2: Analyze the availability of the load	// Step 2: Analyze the availability of the load
	AvailValInBlkVect ValuesPerBlock;	AvailValInBlkVect ValuesPerBlock;
	UnavailBlkVect UnavailableBlocks;	UnavailBlkVect UnavailableBlocks;
Context not available.
	return ChangedFunction;	return ChangedFunction;
	}	}

	/// performPRE - Perform a purely local form of PRE that looks for diamond	bool GVN::performScalarPRE(Instruction *CurInst) {
	/// control flow patterns and attempts to perform simple PRE at the join point.
	bool GVN::performPRE(Function &F) {
	bool Changed = false;
	SmallVector<std::pair<Value, BasicBlock>, 8> predMap;	SmallVector<std::pair<Value, BasicBlock>, 8> predMap;
	for (BasicBlock *CurrentBlock : depth_first(&F.getEntryBlock())) {
	// Nothing to PRE in the entry block.
	if (CurrentBlock == &F.getEntryBlock()) continue;

	// Don't perform PRE on a landing pad.	if (isa<AllocaInst>(CurInst) \|\| isa<LoadInst>(CurInst) \|\|
		dberlinUnsubmitted Not Done Reply Inline Actions If you are going to add LoadInst, you might as well add all the memory ops (anything where getOpCode() > MemoryOpsBegin && getOpCode < MemoryOpsEnd. If you do this, i'd add isMemoryOp to instruction.h alongside isBinaryOp, etc) dberlin: If you are going to add LoadInst, you might as well add all the memory ops (anything where…
		bmakamAuthorUnsubmitted Not Done Reply Inline Actions Thanks for catching this Daniel. The LoadInst was unintentional, it is already covered by CurInst->mayReadFromMemory(). I will prepare a patch removing the LoadInst. If LGTM, please feel free to +2 it since I do not have commit rights. bmakam: Thanks for catching this Daniel. The LoadInst was unintentional, it is already covered by…
	if (CurrentBlock->isLandingPad()) continue;	isa<TerminatorInst>(CurInst) \|\| isa<PHINode>(CurInst) \|\|
		CurInst->getType()->isVoidTy() \|\|
		CurInst->mayReadFromMemory() \|\| CurInst->mayHaveSideEffects() \|\|
		isa<DbgInfoIntrinsic>(CurInst))
		return false;

	for (BasicBlock::iterator BI = CurrentBlock->begin(),	// Don't do PRE on compares. The PHI would prevent CodeGenPrepare from
	BE = CurrentBlock->end(); BI != BE; ) {	// sinking the compare again, and it would force the code generator to
	Instruction *CurInst = BI++;	// move the i1 from processor flags or predicate registers into a general
		// purpose register.
		if (isa<CmpInst>(CurInst))
		return false;

	if (isa<AllocaInst>(CurInst) \|\|	// We don't currently value number ANY inline asm calls.
	isa<TerminatorInst>(CurInst) \|\| isa<PHINode>(CurInst) \|\|	if (CallInst *CallI = dyn_cast<CallInst>(CurInst))
	CurInst->getType()->isVoidTy() \|\|	if (CallI->isInlineAsm())
	CurInst->mayReadFromMemory() \|\| CurInst->mayHaveSideEffects() \|\|	return false;
	isa<DbgInfoIntrinsic>(CurInst))
	continue;

	// Don't do PRE on compares. The PHI would prevent CodeGenPrepare from	uint32_t ValNo = VN.lookup(CurInst);
	// sinking the compare again, and it would force the code generator to
	// move the i1 from processor flags or predicate registers into a general	// Look for the predecessors for PRE opportunities. We're
	// purpose register.	// only trying to solve the basic diamond case, where
	if (isa<CmpInst>(CurInst))	// a value is computed in the successor and one predecessor,
	continue;	// but not the other. We also explicitly disallow cases
		// where the successor is its own predecessor, because they're
		// more complicated to get right.
		unsigned NumWith = 0;
		unsigned NumWithout = 0;
		BasicBlock *PREPred = nullptr;
		BasicBlock *CurrentBlock = CurInst->getParent();
		predMap.clear();

		for (pred_iterator PI = pred_begin(CurrentBlock),
		PE = pred_end(CurrentBlock); PI != PE; ++PI) {
		BasicBlock P = PI;
		// We're not interested in PRE where the block is its
		// own predecessor, or in blocks with predecessors
		// that are not reachable.
		if (P == CurrentBlock) {
		NumWithout = 2;
		break;
		} else if (!DT->isReachableFromEntry(P)) {
		NumWithout = 2;
		break;
		}

	// We don't currently value number ANY inline asm calls.	Value* predV = findLeader(P, ValNo);
	if (CallInst *CallI = dyn_cast<CallInst>(CurInst))	if (!predV) {
	if (CallI->isInlineAsm())	predMap.push_back(std::make_pair(static_cast<Value *>(nullptr), P));
	continue;	PREPred = P;
		++NumWithout;
		} else if (predV == CurInst) {
		/* CurInst dominates this predecessor. */
		NumWithout = 2;
		break;
		} else {
		predMap.push_back(std::make_pair(predV, P));
		++NumWith;
		}
		}

	uint32_t ValNo = VN.lookup(CurInst);	// Don't do PRE when it might increase code size, i.e. when
		// we would need to insert instructions in more than one pred.
	// Look for the predecessors for PRE opportunities. We're	if (NumWithout != 1 \|\| NumWith == 0)
	// only trying to solve the basic diamond case, where	return false;
	// a value is computed in the successor and one predecessor,
	// but not the other. We also explicitly disallow cases
	// where the successor is its own predecessor, because they're
	// more complicated to get right.
	unsigned NumWith = 0;
	unsigned NumWithout = 0;
	BasicBlock *PREPred = nullptr;
	predMap.clear();

	for (pred_iterator PI = pred_begin(CurrentBlock),
	PE = pred_end(CurrentBlock); PI != PE; ++PI) {
	BasicBlock P = PI;
	// We're not interested in PRE where the block is its
	// own predecessor, or in blocks with predecessors
	// that are not reachable.
	if (P == CurrentBlock) {
	NumWithout = 2;
	break;
	} else if (!DT->isReachableFromEntry(P)) {
	NumWithout = 2;
	break;
	}

	Value* predV = findLeader(P, ValNo);	// Don't do PRE across indirect branch.
	if (!predV) {	if (isa<IndirectBrInst>(PREPred->getTerminator()))
	predMap.push_back(std::make_pair(static_cast<Value *>(nullptr), P));	return false;
	PREPred = P;
	++NumWithout;
	} else if (predV == CurInst) {
	/* CurInst dominates this predecessor. */
	NumWithout = 2;
	break;
	} else {
	predMap.push_back(std::make_pair(predV, P));
	++NumWith;
	}
	}

	// Don't do PRE when it might increase code size, i.e. when	// We can't do PRE safely on a critical edge, so instead we schedule
	// we would need to insert instructions in more than one pred.	// the edge to be split and perform the PRE the next time we iterate
	if (NumWithout != 1 \|\| NumWith == 0)	// on the function.
	continue;	unsigned SuccNum = GetSuccessorNumber(PREPred, CurrentBlock);
		if (isCriticalEdge(PREPred->getTerminator(), SuccNum)) {
		toSplit.push_back(std::make_pair(PREPred->getTerminator(), SuccNum));
		return false;
		}

	// Don't do PRE across indirect branch.	// Instantiate the expression in the predecessor that lacked it.
	if (isa<IndirectBrInst>(PREPred->getTerminator()))	// Because we are going top-down through the block, all value numbers
	continue;	// will be available in the predecessor by the time we need them. Any
		// that weren't originally present will have been instantiated earlier
		// in this loop.
		Instruction *PREInstr = CurInst->clone();
		bool success = true;
		for (unsigned i = 0, e = CurInst->getNumOperands(); i != e; ++i) {
		Value *Op = PREInstr->getOperand(i);
		if (isa<Argument>(Op) \|\| isa<Constant>(Op) \|\| isa<GlobalValue>(Op))
		continue;

	// We can't do PRE safely on a critical edge, so instead we schedule	if (Value *V = findLeader(PREPred, VN.lookup(Op))) {
	// the edge to be split and perform the PRE the next time we iterate	PREInstr->setOperand(i, V);
	// on the function.	} else {
	unsigned SuccNum = GetSuccessorNumber(PREPred, CurrentBlock);	success = false;
	if (isCriticalEdge(PREPred->getTerminator(), SuccNum)) {	break;
	toSplit.push_back(std::make_pair(PREPred->getTerminator(), SuccNum));	}
	continue;	}
	}

	// Instantiate the expression in the predecessor that lacked it.	// Fail out if we encounter an operand that is not available in
	// Because we are going top-down through the block, all value numbers	// the PRE predecessor. This is typically because of loads which
	// will be available in the predecessor by the time we need them. Any	// are not value numbered precisely.
	// that weren't originally present will have been instantiated earlier	if (!success) {
	// in this loop.	DEBUG(verifyRemoved(PREInstr));
	Instruction *PREInstr = CurInst->clone();	delete PREInstr;
	bool success = true;	return false;
	for (unsigned i = 0, e = CurInst->getNumOperands(); i != e; ++i) {	}
	Value *Op = PREInstr->getOperand(i);
	if (isa<Argument>(Op) \|\| isa<Constant>(Op) \|\| isa<GlobalValue>(Op))
	continue;

	if (Value *V = findLeader(PREPred, VN.lookup(Op))) {	PREInstr->insertBefore(PREPred->getTerminator());
	PREInstr->setOperand(i, V);	PREInstr->setName(CurInst->getName() + ".pre");
	} else {	PREInstr->setDebugLoc(CurInst->getDebugLoc());
	success = false;	VN.add(PREInstr, ValNo);
	break;	++NumGVNPRE;
	}
	}

	// Fail out if we encounter an operand that is not available in	// Update the availability map to include the new instruction.
	// the PRE predecessor. This is typically because of loads which	addToLeaderTable(ValNo, PREInstr, PREPred);
	// are not value numbered precisely.
	if (!success) {
	DEBUG(verifyRemoved(PREInstr));
	delete PREInstr;
	continue;
	}

	PREInstr->insertBefore(PREPred->getTerminator());	// Create a PHI to make the value available in this block.
	PREInstr->setName(CurInst->getName() + ".pre");	PHINode* Phi = PHINode::Create(CurInst->getType(), predMap.size(),
	PREInstr->setDebugLoc(CurInst->getDebugLoc());	CurInst->getName() + ".pre-phi",
	VN.add(PREInstr, ValNo);	CurrentBlock->begin());
	++NumGVNPRE;	for (unsigned i = 0, e = predMap.size(); i != e; ++i) {
		if (Value *V = predMap[i].first)
	// Update the availability map to include the new instruction.	Phi->addIncoming(V, predMap[i].second);
	addToLeaderTable(ValNo, PREInstr, PREPred);	else
		Phi->addIncoming(PREInstr, PREPred);
	// Create a PHI to make the value available in this block.	}
	PHINode* Phi = PHINode::Create(CurInst->getType(), predMap.size(),
	CurInst->getName() + ".pre-phi",	VN.add(Phi, ValNo);
	CurrentBlock->begin());	addToLeaderTable(ValNo, Phi, CurrentBlock);
	for (unsigned i = 0, e = predMap.size(); i != e; ++i) {	Phi->setDebugLoc(CurInst->getDebugLoc());
	if (Value *V = predMap[i].first)	CurInst->replaceAllUsesWith(Phi);
	Phi->addIncoming(V, predMap[i].second);	if (Phi->getType()->getScalarType()->isPointerTy()) {
	else	// Because we have added a PHI-use of the pointer value, it has now
	Phi->addIncoming(PREInstr, PREPred);	// "escaped" from alias analysis' perspective. We need to inform
	}	// AA of this.
		for (unsigned ii = 0, ee = Phi->getNumIncomingValues(); ii != ee;
		++ii) {
		unsigned jj = PHINode::getOperandNumForIncomingValue(ii);
		VN.getAliasAnalysis()->addEscapingUse(Phi->getOperandUse(jj));
		}

	VN.add(Phi, ValNo);	if (MD)
	addToLeaderTable(ValNo, Phi, CurrentBlock);	MD->invalidateCachedPointerInfo(Phi);
	Phi->setDebugLoc(CurInst->getDebugLoc());	}
	CurInst->replaceAllUsesWith(Phi);	VN.erase(CurInst);
	if (Phi->getType()->getScalarType()->isPointerTy()) {	removeFromLeaderTable(ValNo, CurInst, CurrentBlock);
	// Because we have added a PHI-use of the pointer value, it has now
	// "escaped" from alias analysis' perspective. We need to inform
	// AA of this.
	for (unsigned ii = 0, ee = Phi->getNumIncomingValues(); ii != ee;
	++ii) {
	unsigned jj = PHINode::getOperandNumForIncomingValue(ii);
	VN.getAliasAnalysis()->addEscapingUse(Phi->getOperandUse(jj));
	}

	if (MD)	DEBUG(dbgs() << "GVN PRE removed: " << *CurInst << '\n');
	MD->invalidateCachedPointerInfo(Phi);	if (MD) MD->removeInstruction(CurInst);
	}	DEBUG(verifyRemoved(CurInst));
	VN.erase(CurInst);	CurInst->eraseFromParent();
	removeFromLeaderTable(ValNo, CurInst, CurrentBlock);	return true;
		}

		/// performPRE - Perform a purely local form of PRE that looks for diamond
		/// control flow patterns and attempts to perform simple PRE at the join point.
		bool GVN::performPRE(Function &F) {
		bool Changed = false;
		for (BasicBlock *CurrentBlock : depth_first(&F.getEntryBlock())) {
		// Nothing to PRE in the entry block.
		if (CurrentBlock == &F.getEntryBlock()) continue;

		// Don't perform PRE on a landing pad.
		if (CurrentBlock->isLandingPad()) continue;

	DEBUG(dbgs() << "GVN PRE removed: " << *CurInst << '\n');	for (BasicBlock::iterator BI = CurrentBlock->begin(),
	if (MD) MD->removeInstruction(CurInst);	BE = CurrentBlock->end(); BI != BE; ) {
	DEBUG(verifyRemoved(CurInst));	Instruction *CurInst = BI++;
	CurInst->eraseFromParent();	Changed = performScalarPRE(CurInst);
	Changed = true;
	}	}
	}	}

Context not available.

	// Top-down walk of the dominator tree	// Top-down walk of the dominator tree
	bool Changed = false;	bool Changed = false;
	#if 0	#if 1
		hfinkelUnsubmitted Not Done Reply Inline Actions If the patch turns this on, please just remove the #if. hfinkel: If the patch turns this on, please just remove the #if.
		bmakamAuthorUnsubmitted Not Done Reply Inline Actions I will turn on the slow path and turn off the fast path in my next patch that I will upload with full context. I am not sure if we still want to keep the fast path commented out in the code or clean it up. bmakam: I will turn on the slow path and turn off the fast path in my next patch that I will upload…
		hfinkelUnsubmitted Not Done Reply Inline Actions If you want to still keep it, add a command-line flag to enable it. We don't generally keep commented-out code, as a policy. hfinkel: If you want to still keep it, add a command-line flag to enable it. We don't generally keep…
	// Needed for value numbering with phi construction to work.	// Needed for value numbering with phi construction to work.
	ReversePostOrderTraversal<Function*> RPOT(&F);	ReversePostOrderTraversal<Function*> RPOT(&F);
	for (ReversePostOrderTraversal<Function*>::rpo_iterator RI = RPOT.begin(),	for (ReversePostOrderTraversal<Function*>::rpo_iterator RI = RPOT.begin(),
Context not available.

test/Transforms/GVN/pre-gep-load.ll

This file was added.

				; RUN: opt < %s -basicaa -gvn -enable-load-pre -S \| FileCheck %s
				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64--linux-gnu"

				define double @foo(i32 %stat, i32 %i, double** %p) {
				; CHECK-LABEL: @foo(
				entry:
				switch i32 %stat, label %sw.default [
				i32 0, label %sw.bb
				i32 1, label %sw.bb
				i32 2, label %sw.bb2
				]

				sw.bb: ; preds = %entry, %entry
				%idxprom = sext i32 %i to i64
				%arrayidx = getelementptr inbounds double** %p, i64 0
				%0 = load double** %arrayidx, align 8
				%arrayidx1 = getelementptr inbounds double* %0, i64 %idxprom
				%1 = load double* %arrayidx1, align 8
				%sub = fsub double %1, 1.000000e+00
				%cmp = fcmp olt double %sub, 0.000000e+00
				br i1 %cmp, label %if.then, label %if.end

				if.then: ; preds = %sw.bb
				br label %return

				if.end: ; preds = %sw.bb
				br label %sw.bb2

				sw.bb2: ; preds = %if.end, %entry
				%idxprom3 = sext i32 %i to i64
				%arrayidx4 = getelementptr inbounds double** %p, i64 0
				%2 = load double** %arrayidx4, align 8
				%arrayidx5 = getelementptr inbounds double* %2, i64 %idxprom3
				%3 = load double* %arrayidx5, align 8
				; CHECK: sw.bb2:
				; CHECK-NEXT-NOT: sext
				; CHECK-NEXT: phi double [
				; CHECK-NOT: load
				%sub6 = fsub double 3.000000e+00, %3
				br label %return

				sw.default: ; preds = %entry
				br label %return

				return: ; preds = %sw.default, %sw.bb2, %if.then
				%retval.0 = phi double [ 0.000000e+00, %sw.default ], [ %sub6, %sw.bb2 ], [ %sub, %if.then ]
				ret double %retval.0
				}