This is an archive of the discontinued LLVM Phabricator instance.

[GVN] Eliminate redundant loads whose addresses are dependent on the result of a select instruction.
AbandonedPublic

Authored by mcrosier on Mar 6 2015, 12:26 PM.

Download Raw Diff

Details

Reviewers

majnemer
• tilmann
• dberlin
nicholas
hfinkel

Summary

This patch attempts to eliminate redundant loads whose addresses are dependent upon a select instruction.

First, look for patterns such as:

(load (gep (select (cond, ptr_a, ptr_b)) idx0, idx1, ..., idxN))
(load (gep ptr_op, (select (cond, idx_a, idx_b)), idx1 ..., idxN))
(load (gep ptr_op, idx0, (select (cond, idx_a, idx_b)), ..., idxN))

Then find similar loads and perform a transformation such as:

Constant zero index with different pointer operands:

a = (load (gep ptr_a 0 0))
b = (load (gep ptr_b 0 0))
c = (load (gep (select (cond, ptr_a, ptr_b)) 0 0))
-->
a = (load (gep ptr_a, 0, 0))
b = (load (gep ptr_b, 0, 0))
c = (select (cond, a, b))

and

GEPs differ by only a single index:

a = (load (gep ptr_op, idx, idx_a))
b = (load (gep ptr_op, idx, idx_b))
c = (load (gep ptr_op, idx, (select (cond, idx_a, idx_b))))
-->
a = (load (gep ptr_op, idx, idx_a))
b = (load (gep ptr_op, idx, idx_b))
c = (select (cond, a, b))

See the PR for additional information.
http://llvm.org/bugs/show_bug.cgi?id=21052

This work is based on a patch done by Tilmann Scheller (attached to PR21052).

Basically, I moved the patch from InstCombine to GVN, so I would have dependency information. I also made the solution a bit more general.

This is a WIP, but I wanted to get some feedback before continuing. After considering the suggestions from Nick L. and David M. (see the bug report), I didn't see how those suggestions would benefit this patch (at least not in local scope).

Here are the tasks that remain:

Use use-def and def-use chains rather than iterate over instructions.
Allow the base load to be fed by stores as well as loads.
Perform additional correctness/perf testing.

Please feel free to comment.

Chad

Diff Detail

Event Timeline

mcrosier updated this revision to Diff 21385.Mar 6 2015, 12:26 PM

mcrosier retitled this revision from to [GVN] Eliminate redundant loads whose addresses are dependent on the result of a select instruction..

mcrosier updated this object.

mcrosier edited the test plan for this revision. (Show Details)

mcrosier added reviewers: • tilmann, nicholas, majnemer.

mcrosier set the repository for this revision to rL LLVM.

mcrosier added a subscriber: Unknown Object (MLST).

mcrosier added subscribers: gberry, mssimpso.Mar 6 2015, 12:45 PM

mcrosier added a subscriber: jmolloy.

Revised patch with the following changes:
-Reworked the code so that we now use def-use and use-def, rather than iterate over all instructions in the block.
-Store can now feed the select. I.e.,

a = (load (gep ptr_a 0 0))
store b (gep ptr_b 0 0))
c = (load (gep (select (cond, ptr_a, ptr_b)) 0 0))
-->
a = (load (gep ptr_a, 0, 0))
store b (gep ptr_b 0 0))
c = (select (cond, a, b))

There were no correctness or performance regressions across spec2000, spec2006, and eembc on an A57 device. The only benchmarks that hit this transform are spec2000/gcc, spec2000/vpr, and spec2006/dealII. I saw a ~1% improvement in vpr, but that's basically noise. :o/

Also, adding Daniel B. to the reviews as he seems to be working in this area as of late.

All, please have a look.

Chad

mcrosier added a reviewer: hfinkel.Mar 12 2015, 2:34 PM

mcrosier abandoned this revision.Mar 16 2015, 5:41 AM

A few comments inline. I didn't spot any obvious problems with the code, but it feels like you're trying to solve a problem in one step that should probably be sub-divided into several. I suggested one possible division, but there might be others that make more sense.

lib/Transforms/Scalar/GVN.cpp
1855	Another way you might be able to phrase this would be to ask GVN if it had a value which would be CSEd with a newly created GEP at the current instruction.
1868	This restriction feels like a hack.
1903	Ignoring your code entirely for a moment.... Your starting IR fragment seems really odd to me. I'd have expected this to be canonicalized as: a_addr = gep ptr_op idx idx_a b_addr = gep ptr_op idx idx_b a = (ld\|st a_addr b = (ld\|st b_addr c = select (cond, a_addr, b_addr) c = load c_addr This form would make your transform far more obvious once we actually got to PRE. Thoughts?
1915	Why only one level?

reames added inline comments.Mar 24 2015, 5:19 PM

lib/Transforms/Scalar/GVN.cpp
2025	Ah, you need to check for a local dependence if you're making that assumption. Spotted that immediately after sending last comments. :)

So, this is an awfully specific redundancy to detect in GVN.

The more general case doesn't use select's but phi's, and is tested in gcc
with GVNPRE with testcases like:

int
foo (int i)
{

int a, b;
if (i)
    a = 3, b = 2;
else
    a = 2, b = 3;
return a + b;

}

and more complex testcase that involve expressions.
But the basic idea is the same - if both values are computed already, the
end is redundant.
(here, the result is constant along both edges)

GVN doesn't touch this (and wouldn't with your patch):

; Function Attrs: nounwind readnone ssp uwtable
define i32 @foo(i32 %i) #0 {

%1 = icmp eq i32 %i, 0
%. = select i1 %1, i32 2, i32 3
%.1 = select i1 %1, i32 3, i32 2
%2 = add nsw i32 %., %.1
ret i32 %2

}

GCC treats this as a GVNPRE problem rather than a GVN one, because it
involves computing available values along edges.

Putting these as selects, sadly, makes it harder for things like PRE to see
this (though you could treat the selects are false blocks with incoming
edges)
It would compute the values are available along each edge already, and
insert a no-cost phi to merge them.

In any case, if we want it, sure. I have no objections as even in GVN,
this is trivial to add to new gvn, it just costs time as a special case
until i rewrite PRE based on GVN.

I've abandoned this patch (originally abandoned on March 16th). Overall, I saw no significant performance improvements across SPEC2K/SPEC2K6 and after further testing there was one significant regression (i.e., 6% on spec2000/vpr), which was the workload I was targeting. While the patch does remove the redundant load, that redundancy was replaced by another set of redundant instructions, fcmp/csel. IIRC, csel instructions are bad for A57 devices in general.

I agree that a more general solution should be considered and if the new PRE pass can enable that solution, I'm in no rush to push this patch forward.

Regardless, I do appreciate everyones feedback!

Out of curiosity, at what phase do you see these redundancies?
GCC used to canonicalize on the CFG version and do select formation late
specifically to address some of these issues.

Not to hijack this discussion, but this is related to something I’ve been looking at (and have discussed with Hal and James previously), which is select vs. phi canonicalization.

Currently SimplifyCFG does most of the converting of phis to selects in IR. There is also early and late if-conversion in some backends, but let’s ignore that for now.

CodeGenPrepare does the reverse (i.e. select -> phi/branch) for some targets and some selects.

SimplifyCFG uses a target instruction cost model of the speculated instructions to decide whether to do phi -> select conversion. It currently uses a cutoff of 2*Basic if I recall correctly.

I’d like to start a discussion on what the “right” canonical representation of these operations is over the phases of optimization. It seems to me that keeping branches around longer would be better since it would allow global optimizations to only have to worry about phis. It would also create blocks where instructions could be sunk. For example:

int selects(int a, int b, int c, int d) {

int x1, x2, x3;

x1 = a / b;

x2 = b / c;

x3 = c / d;

if (a < b) {

    x1 = 0;

    x2 = 0;

    x3 = 0;

}

return x1 + x2 + x3;

If we don’t convert the above to selects, then the CodeSinking pass is able to sink the divides into an else block that it creates. (As an aside this brings up the question of canonical location of instructions w.r.t. sinking/forwarding, which I haven’t seen discussed either).

I believe the downside to keeping phis/branches in the IR (someone please correct me if I’m wrong here), is that we break up the scope of later non-global passes. I don’t have a feel for how big of a problem this is or how hard it would be to fix by e.g. extending these passes to work on single-entry single-exit regions or even superblocks.

Any thoughts on this?

Geoff Berry

Employee of Qualcomm Innovation Center, Inc.

Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project

From: Daniel Berlin [mailto:dberlin@dberlin.org]
Sent: Wednesday, March 25, 2015 10:25 AM
To: reviews+D8120+public+d2c9388bc8837c4c@reviews.llvm.org
Cc: tilmann.scheller@googlemail.com; Nick Lewycky; David Majnemer; Hal Finkel; Philip Reames; James Molloy; mssimpso@codeaurora.org; gberry@codeaurora.org; Commit Messages and Patches for LLVM
Subject: Re: [PATCH] [GVN] Eliminate redundant loads whose addresses are dependent on the result of a select instruction.

Out of curiosity, at what phase do you see these redundancies?

GCC used to canonicalize on the CFG version and do select formation late specifically to address some of these issues.

Revision Contents

Path

Size

lib/

Transforms/

Scalar/

GVN.cpp

130 lines

test/

Transforms/

GVN/

load-gep-select.ll

93 lines

Diff 21385

lib/Transforms/Scalar/GVN.cpp

Context not available.

	// Helper fuctions of redundant load elimination	// Helper fuctions of redundant load elimination
	bool processLoad(LoadInst *L);	bool processLoad(LoadInst *L);
		bool loadGEPSelect(LoadInst *LI, MemDepResult Dep);

	bool processNonLocalLoad(LoadInst *L);	bool processNonLocalLoad(LoadInst *L);
	void AnalyzeLoadAvailability(LoadInst *LI, LoadDepVect &Deps,	void AnalyzeLoadAvailability(LoadInst *LI, LoadDepVect &Deps,
	AvailValInBlkVect &ValuesPerBlock,	AvailValInBlkVect &ValuesPerBlock,
Context not available.
	I->replaceAllUsesWith(Repl);	I->replaceAllUsesWith(Repl);
	}	}

		// Helper function for loadGEPSelect(). Look for the following pattern:
		//
		// (load (gep ptr idx0 idx1 idx2 ))
		//
		// Only match above patterns which satisfy the following additional
		// constraints:
		// - The inbounds attribute must match the inbounds attribute
		// of the base GEP
		// - The number of indicies must match.
		// - The operands, including the pointer operand, only differ by one operand,
		// which is the operand fed by a select.
		//
		static bool isNearMatchingLoadGEP(Value Inst, GetElementPtrInst BaseGEP,
		unsigned SIIdx) {
		if (LoadInst *LI = dyn_cast<LoadInst>(Inst)) {
		Value *Ptr = LI->getPointerOperand();

		if (GetElementPtrInst *GEPI = dyn_cast<GetElementPtrInst>(Ptr)) {
		if (GEPI->isInBounds() != BaseGEP->isInBounds() \|\|
		GEPI->getNumIndices() != BaseGEP->getNumIndices())
		return false;

		// Compute the number of matching indices.
		unsigned NumMatch = 0;
		unsigned NumOps = GEPI->getNumOperands();
		for (unsigned i = 0; i != NumOps; ++i, ++NumMatch)
		if (GEPI->getOperand(i) != BaseGEP->getOperand(i))
		if (i != SIIdx)
		break;

		return NumMatch == NumOps;
		}
		}
		return false;
		reamesUnsubmitted Not Done Reply Inline Actions Another way you might be able to phrase this would be to ask GVN if it had a value which would be CSEd with a newly created GEP at the current instruction. reames: Another way you might be able to phrase this would be to ask GVN if it had a value which would…
		}

		// Do the following transformation:
		//
		// a = (load (gep ptr_op idx idx_a))
		// b = (load (gep ptr_op idx idx_b))
		// c = (load (gep ptr_op idx (select (cond, idx_a, idx_b))))
		// -->
		// a = (load (gep ptr_op idx idx_a))
		// b = (load (gep ptr_op idx idx_b))
		// c = (select (cond, a, b))
		//
		bool GVN::loadGEPSelect(LoadInst *LI, MemDepResult Dep) {
		reamesUnsubmitted Not Done Reply Inline Actions This restriction feels like a hack. reames: This restriction feels like a hack.
		Value *Op = LI->getOperand(0);

		GetElementPtrInst *GEPI = dyn_cast<GetElementPtrInst>(Op);
		if (!GEPI)
		return false;

		// Check whether the GEP is using the result of a select.
		unsigned SIIdx = 0;
		SelectInst *SI = nullptr;
		for (unsigned i = 0, e = GEPI->getNumOperands(); i != e; ++i) {
		Value *Op0 = GEPI->getOperand(i);
		if (ZExtInst *Ext = dyn_cast<ZExtInst>(Op0))
		Op0 = Ext->getOperand(0);
		if (isa<SelectInst>(Op0)) {
		SIIdx = i;
		SI = cast<SelectInst>(Op0);
		break;
		}
		}
		if (!SI)
		return false;

		Value *SelectTrueOp = SI->getTrueValue();
		Value *SelectFalseOp = SI->getFalseValue();

		// Scan from the load until we encounter a dependent instruction or the
		// beginning of the block.
		Instruction *DepInst = Dep.getInst();
		BasicBlock::iterator ScanFrom = LI, ScanEnd;
		if (DepInst)
		ScanEnd = DepInst;
		else
		ScanEnd = LI->getParent()->begin();

		Value *TrueLoad = nullptr;
		reamesUnsubmitted Not Done Reply Inline Actions Ignoring your code entirely for a moment.... Your starting IR fragment seems really odd to me. I'd have expected this to be canonicalized as: a_addr = gep ptr_op idx idx_a b_addr = gep ptr_op idx idx_b a = (ld\|st a_addr b = (ld\|st b_addr c = select (cond, a_addr, b_addr) c = load c_addr This form would make your transform far more obvious once we actually got to PRE. Thoughts? reames: Ignoring your code entirely for a moment.... Your starting IR fragment seems really odd to me.
		Value *FalseLoad = nullptr;

		// Scan the basic block backwards looking for loads with nearly matching geps.
		while (ScanFrom != ScanEnd) {
		Instruction *Inst = --ScanFrom;

		if (!isNearMatchingLoadGEP(Inst, GEPI, SIIdx))
		continue;

		Value *Ptr = cast<LoadInst>(Inst)->getPointerOperand();
		GetElementPtrInst *GEPI2 = cast<GetElementPtrInst>(Ptr);
		Value *GEPOp = GEPI2->getOperand(SIIdx);
		reamesUnsubmitted Not Done Reply Inline Actions Why only one level? reames: Why only one level?

		// Look through zext.
		if (ZExtInst *Ext = dyn_cast<ZExtInst>(GEPOp))
		GEPOp = Ext->getOperand(0);

		// Does the gep op match either of the select ops?
		if (GEPOp == SelectTrueOp)
		TrueLoad = Inst;
		else if (GEPOp == SelectFalseOp)
		FalseLoad = Inst;

		// All preconditions fulfilled, apply the transformation:
		if (TrueLoad && FalseLoad) {
		Value *V = SelectInst::Create(SI->getCondition(), TrueLoad, FalseLoad,
		"ld-gep-sel", LI);
		LI->replaceAllUsesWith(V);

		if (V->getType()->getScalarType()->isPointerTy())
		MD->invalidateCachedPointerInfo(V);
		markInstructionForDeletion(LI);
		++NumGVNLoad;
		return true;
		}
		}
		return false;
		}

	/// Attempt to eliminate a load, first by eliminating it	/// Attempt to eliminate a load, first by eliminating it
	/// locally, and then attempting non-local elimination if that fails.	/// locally, and then attempting non-local elimination if that fails.
	bool GVN::processLoad(LoadInst *L) {	bool GVN::processLoad(LoadInst *L) {
		reamesUnsubmitted Not Done Reply Inline Actions Ah, you need to check for a local dependence if you're making that assumption. Spotted that immediately after sending last comments. :) reames: Ah, you need to check for a local dependence if you're making that assumption. Spotted that…
Context not available.
	return false;	return false;
	}	}

		// Attempt to eliminate redundant loads whose addresses are dependent on
		// the result of a select instruction. If both values have already been
		// loaded into registers the load becomes redundant and can be replaced
		// with a select on the two values instead.
		if (loadGEPSelect(L, Dep))
		return true;

	// If it is defined in another block, try harder.	// If it is defined in another block, try harder.
	if (Dep.isNonLocal())	if (Dep.isNonLocal())
	return processNonLocalLoad(L);	return processNonLocalLoad(L);
Context not available.

test/Transforms/GVN/load-gep-select.ll

This file was added.

				; RUN: opt < %s -basicaa -gvn -enable-load-pre -S \| FileCheck %s

				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64--linux-gnu"

				%struct.s = type { i32 }

				@vec = common global i32* null, align 8
				@agg = common global %struct.s** null, align 8

				; unsigned *vec;
				;
				; unsigned test1(unsigned i, unsigned max) {
				; if (vec[i] > vec[i+1])
				; ++i;
				; if (vec[i] > max) <- Remove redundant loads
				; return 1;
				; return 0;
				; }

				; CHECK-LABEL: test1
				; CHECK: load i32, i32* %arrayidx, align 4
				; CHECK: load i32, i32* %arrayidx2, align 4
				; CHECK-NOT: load
				; CHECK: ret i32

				define i32 @test1(i32 %i, i32 %max) {
				entry:
				%idxprom = zext i32 %i to i64
				%0 = load i32, i32* @vec, align 8
				%arrayidx = getelementptr inbounds i32, i32* %0, i64 %idxprom
				%1 = load i32, i32* %arrayidx, align 4
				%add = add i32 %i, 1
				%idxprom1 = zext i32 %add to i64
				%arrayidx2 = getelementptr inbounds i32, i32* %0, i64 %idxprom1
				%2 = load i32, i32* %arrayidx2, align 4
				%cmp = icmp ugt i32 %1, %2
				%add.i = select i1 %cmp, i32 %add, i32 %i
				%idxprom3 = zext i32 %add.i to i64
				%arrayidx4 = getelementptr inbounds i32, i32* %0, i64 %idxprom3
				%3 = load i32, i32* %arrayidx4, align 4
				%cmp5 = icmp ugt i32 %3, %max
				%retval.0 = zext i1 %cmp5 to i32
				ret i32 %retval.0
				}

				; struct s {
				; unsigned field;
				; };
				; struct s **agg;
				;
				; unsigned test2(unsigned i, unsigned max) {
				; if (agg[i]->field > agg[i+1]->field)
				; ++i;
				; if (agg[i]->field > max) <- Remove redundant loads
				; return 1;
				; return 0;
				; }

				; CHECK-LABEL: test2
				; CHECK: load %struct.s, %struct.s* %arrayidx, align 8
				; CHECK: load i32, i32* %field, align 4
				; CHECK: load %struct.s, %struct.s* %arrayidx2, align 8
				; CHECK: load i32, i32* %field3, align 4
				; CHECK-NOT: load
				; CHECK-NOT: load
				; CHECK: ret i32

				define i32 @test2(i32 %i, i32 %max) {
				entry:
				%idxprom = zext i32 %i to i64
				%0 = load %struct.s, %struct.s* @agg, align 8
				%arrayidx = getelementptr inbounds %struct.s, %struct.s* %0, i64 %idxprom
				%1 = load %struct.s, %struct.s* %arrayidx, align 8
				%field = getelementptr inbounds %struct.s, %struct.s* %1, i64 0, i32 0
				%2 = load i32, i32* %field, align 4
				%add = add i32 %i, 1
				%idxprom1 = zext i32 %add to i64
				%arrayidx2 = getelementptr inbounds %struct.s, %struct.s* %0, i64 %idxprom1
				%3 = load %struct.s, %struct.s* %arrayidx2, align 8
				%field3 = getelementptr inbounds %struct.s, %struct.s* %3, i64 0, i32 0
				%4 = load i32, i32* %field3, align 4
				%cmp = icmp ugt i32 %2, %4
				%add.i = select i1 %cmp, i32 %add, i32 %i
				%idxprom4 = zext i32 %add.i to i64
				%arrayidx5 = getelementptr inbounds %struct.s, %struct.s* %0, i64 %idxprom4
				%5 = load %struct.s, %struct.s* %arrayidx5, align 8
				%field6 = getelementptr inbounds %struct.s, %struct.s* %5, i64 0, i32 0
				%6 = load i32, i32* %field6, align 4
				%cmp7 = icmp ugt i32 %6, %max
				%retval.0 = zext i1 %cmp7 to i32
				ret i32 %retval.0
				}