This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Remove single use restriction from InstCombine's explicit sinking code.
Needs ReviewPublic

Authored by craig.topper on Sep 12 2017, 1:25 PM.

Download Raw Diff

Details

Reviewers

hfinkel
davide
chandlerc
spatel
sebpop
hiraditya

Summary

The explicit sinking code in InstCombine checks to make sure the candidate for sinking only has a single use. But that doesn't really make sense. All that should really matter is that all uses are in the same basic block.

The test case is derived from a real example I saw in a benchmark we where ended up with a chain of conditional ORs that we were unable to sink completely because %select1 has two uses in the successor block.

I fear based on other discussions on the mailing list that this patch may be controversial because InstCombine shouldn't really be doing this sinking at all. But I'm just trying to make it logical until such time that a new sinking pass is implemented.

Diff Detail

Event Timeline

Remove a stray change that snuck in

I think what matters in this case is the {post}dominance relation between the block of the DEF and the block(s) [potentially > 1] of the USEs.
Doing this xform when all the uses are in the same block, is, correct, but restrictive. So, I think your logic is fine, but this makes me still less convinced that we shouldn't use the dom to drive this analysis (and therefore should be a separate pass :)

davide added reviewers: sebpop, hiraditya.Sep 12 2017, 3:19 PM

In D37762#868790, @davide wrote:

I think what matters in this case is the {post}dominance relation between the block of the DEF and the block(s) [potentially > 1] of the USEs.
Doing this xform when all the uses are in the same block, is, correct, but restrictive. So, I think your logic is fine, but this makes me still less convinced that we shouldn't use the dom to drive this analysis (and therefore should be a separate pass :)

Do we see any performance effects from removing this entirely? It's not immediately obvious to me what this enables. Maybe SimplifyCFG later removes some blocks as empty? I don't see why anything in InstCombine would care.

Here's a longer version of what I saw in the benchmark i was looking at

y1 = 0;
y2 = (x & 2) ? (y1 | C1) : y1
y3 = (x & 4) ? (y2 | C2) : y2
y4 = (x & 8) ? (y3 | C3) : y3
y5 = (x & 16) ? (y4 | C4) : y4
y6 = (x & 32) ? (y5 | C5) : y5
y7 = (x & 64) ? (y6 | C6) : y6
if (x & 254) {

y8 = (x & 128) ? (y7 | C7) : y7
store y8 to memory

}

As you can see y7 is only used inside the if body, but we didn’t sink it. The select for y8, the (y7 | C7) for its left hand size and the (x & 128) all started output above the if (x & 254) and were pushed down because they were only needed by the store.

Is there some other pass that I should expect to sink this code so that we don't waste time decoding/executing the resulting ors and cmovs when (x & 254) is false?

In D37762#869696, @craig.topper wrote:
Here's a longer version of what I saw in the benchmark i was looking at

y1 = 0;
y2 = (x & 2) ? (y1 | C1) : y1
y3 = (x & 4) ? (y2 | C2) : y2
y4 = (x & 8) ? (y3 | C3) : y3
y5 = (x & 16) ? (y4 | C4) : y4
y6 = (x & 32) ? (y5 | C5) : y5
y7 = (x & 64) ? (y6 | C6) : y6
if (x & 254) {
y8 = (x & 128) ? (y7 | C7) : y7
store y8 to memory
}

As you can see y7 is only used inside the if body, but we didn’t sink it. The select for y8, the (y7 | C7) for its left hand size and the (x & 128) all started output above the if (x & 254) and were pushed down because they were only needed by the store.

Is there some other pass that I should expect to sink this code so that we don't waste time decoding/executing the resulting ors and cmovs when (x & 254) is false?

Interesting. Nothing occurs to me (at the IR level - we do have MachineSink in CodeGen). Should we do this in CGP, or is there a reason to do it earlier?

I may have said this before, but we should figure out what we want our canonical form to be. Execute things as early as possible (i.e., aggressive hoisting), but as few times as possible (i.e., still sink out of loops), is one possibility. Lacking other motivations, that's what I'd recommend.

Has there been any sort of discussion on expanding/using the existing IR level code sinking pass? I am referring to the SinkingPass in scalar/sink.cpp. AFAIK it's only used in the AMDGPU preisel pipeline. I don't know it's current state/usability but the description of the pass is:

This pass moves instructions into successor blocks, when possible, so that
they aren't executed on paths where their results aren't needed.

Ok so maybe I'm chasing a deficiency in machine sinking? Here's a related example https://godbolt.org/g/YnEvgg Most of the code should have been able to sink below the 'je' at line 11.

In D37762#870140, @rriddle wrote:

Has there been any sort of discussion on expanding/using the existing IR level code sinking pass? I am referring to the SinkingPass in scalar/sink.cpp. AFAIK it's only used in the AMDGPU preisel pipeline. I don't know it's current state/usability but the description of the pass is:

This pass moves instructions into successor blocks, when possible, so that
they aren't executed on paths where their results aren't needed.

Doing this as a late IR pass makes sense. Taking a quick look at the implementation, there may be some improvements that would help (e.g., make it use MemorySSA, better handling of debug info, use a postdom tree). Considering adding this to the default codegen IR pipeline could be a good idea (and maybe better than trying to do a better job at the MI level?).

In D37762#870202, @hfinkel wrote:

In D37762#870140, @rriddle wrote:

Has there been any sort of discussion on expanding/using the existing IR level code sinking pass? I am referring to the SinkingPass in scalar/sink.cpp. AFAIK it's only used in the AMDGPU preisel pipeline. I don't know it's current state/usability but the description of the pass is:

This pass moves instructions into successor blocks, when possible, so that
they aren't executed on paths where their results aren't needed.

Doing this as a late IR pass makes sense. Taking a quick look at the implementation, there may be some improvements that would help (e.g., make it use MemorySSA, better handling of debug info, use a postdom tree). Considering adding this to the default codegen IR pipeline could be a good idea (and maybe better than trying to do a better job at the MI level?).

One of the reason why this was blocked for improvement was that until recently, we didn't have a functional post dominator tree construction (wrt unreachable blocks et similia).
@kuhar and Danny fixed it recently so that shouldn't be a problem anymore. We might really considering removing this code from InstCombine.

A couple things:

A proper and complete scalar PRE implementation would eliminate the redundant computation parts in your godbolt example (but not the redundant stores).

GCC's PRE at -O3 (which is mostly complete), will transform this into a selection of constants and 3 stores, all using those constants.
A proper NewGVN PRE would do the same.

NewGVN right now will catch the full redundancy:
You will get:

  %phiofops = phi i32 [ 131074, %2 ], [ 196611, %6 ]
  %.0 = phi i32 [ 65537, %6 ], [ 0, %2 ]
  ...
%.1 = phi i32 [ %phiofops, %10 ], [ %.0, %7 ]

So it eliminates one of the or's already, even without PRE, because it is fully redundant.
The others would require PRE.

The "proper" theoretical way to eliminate the stores in both (and do the sinking above) is Partial Dead Store/Partial Dead Code elimination.

It should put it in the optimal place as well
There is a PDSE implementation under review (i have someone taking it over).

I would expect PDSE to take care of the first case completely, but as said, it will not sink the computations in your other example, only sink the store.

GCC has no PDSE, it has a simple sinking pass that I implemented to catch the common case (on by default)

It's an IR level pass.

What it does is, for an instruction I in block BB, take all the uses of I (with a phi node use occurring the appropriate predecessor), and finds the nearest common dominator of all the uses.
This place, NCD, is a guaranteed safe location as long as BB dominates it (NCD it may actually be above BB in some loop cases).

then, in the dominator tree between between BB and NCD, we want the block that is the most control dependent and shallowest loop nest.
So shallower loop nests are always considered better.
Same loop nest level is considered better if execution frequency is significantly lower than NCD execution frequency.
Otherwise, use NCD.

The resulting block is always a safe place to sink (because BB is an ancestor of NCD in the dominator tree, and we are only walking the dominator tree till we hit BB).

You could also use real control dependences to find the thing inside the most non-loop branches :)

In the above case, this sinks it into the pred of the phi, as you want.
In the godbolt example, as mentioned, this is a combination of PRE and PDSE.
The simple sinking pass i wrote is not smart enough to handle the PDSE case you've presented.

I believe LLVM has a similar simple sinking pass.

Looks like machine sinking fails because it can't handle the cmovs reading eflags. And it only considers one instruction at a time, but to sink the cmov you have to sink the eflags producer and the cmov together.

Here's something even closer to the real benchmark https://godbolt.org/g/ePkjTu

This also hitting an issue with simplifyCFG's merge conditional stores that resulted in two stores instead of one.

So GCC-7 definitely eliminates all computation in your benchmark at -O3 except for one and, and we should be able to do the same with a newgvn PRE.
It won't do it at O2 because it requires partial antic, which gcc only does at O3.

I've attached the dump of internediate code.

My conclusion: it's not sane to get it to eliminate computation with the current GVN's scalar PRE.

The store sinking would require PDSE to do in one pass optimally.

foo2.c.133t.pre119 KBDownload

Is the code gcc generates at O3 really "better"? According to godbolt the resulting assembly looks quite large with lots of spills and reloads.

The assembly is clearly worse, the IR is clearly better.
The IR has removed all computation except a mask of the input.
(Note, in the dump i gave, it has not run DCE, so there's a bunch of dead phis).
In essence, it has turned the entire function into a nested if statement based on the bar & x values, where all branches are storing a constant value.
(or selects, or however you want to canonicalize this)

GCC then does not do a good job of doing something with that in the backend.

This could be codegened precisely the way a switch could be (IE lookup table and perfect hash, multiple lookup tables,, binary tree, etc).
It can use as many or as few registers as you like.
In fact, in this form, it's also ripe for vectorization.
You could load the constants into AVX512 registers (it should take 2), the bar & values into another, and use the & to reduce and select the right constant from the vector registers. I think you can do it in AVX2 as well, but haven't double checked.

So, i'm going to say "yes, the IR is better, it could probably use canonicalization or something to take it out of the phi form and put it into a canonical form".

So I think we all agree that doing sinking in InstCombine isn't a good idea in the long term. But I suspect doing something to fix that is going to take some more work.

Do we think there any issues with this patch in the short term? Either significant performance regressions or compile time issues?

Ping

• dberlin removed a subscriber: • dberlin.Sep 25 2017, 11:22 AM

Ping

In D37762#868790, @davide wrote:

I think what matters in this case is the {post}dominance relation between the block of the DEF and the block(s) [potentially > 1] of the USEs.

Agreed with this.

Doing this xform when all the uses are in the same block, is, correct, but restrictive. So, I think your logic is fine, but this makes me still less convinced that we shouldn't use the dom to drive this analysis (and therefore should be a separate pass :)

I think GVNSink could be improved to do this. Also the current implementation in InstCombiner looks like a hack. We are sinking an instruction without appropriate cost model. Delaying execution of an instruction may not always be a good idea.

Looks like machine sinking fails because it can't handle the cmovs reading eflags. And it only considers one instruction at a time, but to sink the cmov you have to sink the eflags producer and the cmov together.

I have been working on a global scheduling pass (https://reviews.llvm.org/D32140) which can help achieve this. What we need is a good cost model (reducing live range can be one) to drive the transformation.

Revision Contents

Path

Size

lib/

Transforms/

InstCombine/

InstCombineInternal.h

3 lines

InstructionCombining.cpp

97 lines

test/

Transforms/

InstCombine/

sink_instruction.ll

26 lines

Diff 114892

lib/Transforms/InstCombine/InstCombineInternal.h

Show First 20 Lines • Show All 751 Lines • ▼ Show 20 Lines	private:
Instruction SimplifyMemSet(MemSetInst MI);		Instruction SimplifyMemSet(MemSetInst MI);

Value EvaluateInDifferentType(Value V, Type *Ty, bool isSigned);		Value EvaluateInDifferentType(Value V, Type *Ty, bool isSigned);

/// \brief Returns a value X such that Val = X * Scale, or null if none.		/// \brief Returns a value X such that Val = X * Scale, or null if none.
///		///
/// If the multiplication is known not to overflow then NoSignedWrap is set.		/// If the multiplication is known not to overflow then NoSignedWrap is set.
Value Descale(Value Val, APInt Scale, bool &NoSignedWrap);		Value Descale(Value Val, APInt Scale, bool &NoSignedWrap);

		// Try to sink instruction from its current block to one of its successors.
		void tryToSinkInstruction(Instruction *I);
};		};

} // end namespace llvm.		} // end namespace llvm.

#undef DEBUG_TYPE		#undef DEBUG_TYPE

#endif		#endif

lib/Transforms/InstCombine/InstructionCombining.cpp

Show First 20 Lines • Show All 2,824 Lines • ▼ Show 20 Lines	Instruction *InstCombiner::visitLandingPadInst(LandingPadInst &LI) {

return nullptr;		return nullptr;
}		}

/// Try to move the specified instruction from its current block into the		/// Try to move the specified instruction from its current block into the
/// beginning of DestBlock, which can only happen if it's safe to move the		/// beginning of DestBlock, which can only happen if it's safe to move the
/// instruction past all of the instructions between it and the end of its		/// instruction past all of the instructions between it and the end of its
/// block.		/// block.
static bool TryToSinkInstruction(Instruction I, BasicBlock DestBlock) {		static bool TryToSinkInstructionInto(Instruction I, BasicBlock DestBlock) {
assert(I->hasOneUse() && "Invariants didn't hold!");

// Cannot move control-flow-involving, volatile loads, vaarg, etc.		// Cannot move control-flow-involving, volatile loads, vaarg, etc.
if (isa<PHINode>(I) \|\| I->isEHPad() \|\| I->mayHaveSideEffects() \|\|		if (isa<PHINode>(I) \|\| I->isEHPad() \|\| I->mayHaveSideEffects() \|\|
isa<TerminatorInst>(I))		isa<TerminatorInst>(I))
return false;		return false;

// Do not sink alloca instructions out of the entry block.		// Do not sink alloca instructions out of the entry block.
if (isa<AllocaInst>(I) && I->getParent() ==		if (isa<AllocaInst>(I) && I->getParent() ==
Show All 20 Lines	static bool TryToSinkInstructionInto(Instruction I, BasicBlock DestBlock) {
}		}

BasicBlock::iterator InsertPos = DestBlock->getFirstInsertionPt();		BasicBlock::iterator InsertPos = DestBlock->getFirstInsertionPt();
I->moveBefore(&*InsertPos);		I->moveBefore(&*InsertPos);
++NumSunkInst;		++NumSunkInst;
return true;		return true;
}		}

		void InstCombiner::tryToSinkInstruction(Instruction *I) {
		BasicBlock *BB = I->getParent();

		BasicBlock *DestBlock = nullptr;

		for (auto UI = I->user_begin(), UE = I->user_end(); UI != UE; ++UI) {
		Instruction UserInst = cast<Instruction>(UI);

		// Get the block the use occurs in.
		BasicBlock *UserParent;
		if (PHINode *PN = dyn_cast<PHINode>(UserInst))
		UserParent = PN->getIncomingBlock(UI.getUse());
		else
		UserParent = UserInst->getParent();

		// If this User is the current block, there's nothing to do.
		if (UserParent == BB)
		return;

		if (!DestBlock) {
		bool UserIsSuccessor = false;
		// See if the user is one of our successors.
		for (succ_iterator SI = succ_begin(BB), E = succ_end(BB); SI != E; ++SI)
		if (*SI == UserParent) {
		UserIsSuccessor = true;
		break;
		}

		// If the user is one of our immediate successors, and if that successor
		// only has us as a predecessors (we'd have to split the critical edge
		// otherwise), we can keep going.
		if (!UserIsSuccessor \|\| !UserParent->getUniquePredecessor())
		return;

		DestBlock = UserParent;
		} else if (DestBlock != UserParent)
		return;
		}

		if (!DestBlock)
		return;

		// Okay, the CFG is simple enough, try to sink this instruction.
		if (TryToSinkInstructionInto(I, DestBlock)) {
		DEBUG(dbgs() << "IC: Sink: " << *I << '\n');
		MadeIRChange = true;
		// We'll add uses of the sunk instruction below, but since sinking
		// can expose opportunities for it's operands add them to the
		// worklist
		for (Use &U : I->operands())
		if (Instruction *OpI = dyn_cast<Instruction>(U.get()))
		Worklist.Add(OpI);
		}
		}

bool InstCombiner::run() {		bool InstCombiner::run() {
while (!Worklist.isEmpty()) {		while (!Worklist.isEmpty()) {
Instruction *I = Worklist.RemoveOne();		Instruction *I = Worklist.RemoveOne();
if (I == nullptr) continue; // skip null values.		if (I == nullptr) continue; // skip null values.

// Check to see if we can DCE the instruction.		// Check to see if we can DCE the instruction.
if (isInstructionTriviallyDead(I, &TLI)) {		if (isInstructionTriviallyDead(I, &TLI)) {
DEBUG(dbgs() << "IC: DCE: " << *I << '\n');		DEBUG(dbgs() << "IC: DCE: " << *I << '\n');
Show All 38 Lines	if (ExpensiveCombines && !I->use_empty() && Ty->isIntOrIntVectorTy()) {
if (isInstructionTriviallyDead(I, &TLI))		if (isInstructionTriviallyDead(I, &TLI))
eraseInstFromFunction(*I);		eraseInstFromFunction(*I);
MadeIRChange = true;		MadeIRChange = true;
continue;		continue;
}		}
}		}

// See if we can trivially sink this instruction to a successor basic block.		// See if we can trivially sink this instruction to a successor basic block.
if (I->hasOneUse()) {		tryToSinkInstruction(I);
BasicBlock *BB = I->getParent();
Instruction UserInst = cast<Instruction>(I->user_begin());
BasicBlock *UserParent;

// Get the block the use occurs in.
if (PHINode *PN = dyn_cast<PHINode>(UserInst))
UserParent = PN->getIncomingBlock(*I->use_begin());
else
UserParent = UserInst->getParent();

if (UserParent != BB) {
bool UserIsSuccessor = false;
// See if the user is one of our successors.
for (succ_iterator SI = succ_begin(BB), E = succ_end(BB); SI != E; ++SI)
if (*SI == UserParent) {
UserIsSuccessor = true;
break;
}

// If the user is one of our immediate successors, and if that successor
// only has us as a predecessors (we'd have to split the critical edge
// otherwise), we can keep going.
if (UserIsSuccessor && UserParent->getUniquePredecessor()) {
// Okay, the CFG is simple enough, try to sink this instruction.
if (TryToSinkInstruction(I, UserParent)) {
DEBUG(dbgs() << "IC: Sink: " << *I << '\n');
MadeIRChange = true;
// We'll add uses of the sunk instruction below, but since sinking
// can expose opportunities for it's operands add them to the
// worklist
for (Use &U : I->operands())
if (Instruction *OpI = dyn_cast<Instruction>(U.get()))
Worklist.Add(OpI);
}
}
}
}

// Now that we have an instruction, try combining it to simplify it.		// Now that we have an instruction, try combining it to simplify it.
Builder.SetInsertPoint(I);		Builder.SetInsertPoint(I);
Builder.SetCurrentDebugLocation(I->getDebugLoc());		Builder.SetCurrentDebugLocation(I->getDebugLoc());

#ifndef NDEBUG		#ifndef NDEBUG
std::string OrigI;		std::string OrigI;
#endif		#endif
▲ Show 20 Lines • Show All 321 Lines • Show Last 20 Lines

test/Transforms/InstCombine/sink_instruction.ll

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -instcombine -S < %s \| FileCheck %s			; RUN: opt -instcombine -S < %s \| FileCheck %s

	;; This tests that the instructions in the entry blocks are sunk into each			;; This tests that the instructions in the entry blocks are sunk into each
	;; arm of the 'if'.			;; arm of the 'if'.

	define i32 @test1(i1 %C, i32 %A, i32 %B) {			define i32 @test1(i1 %C, i32 %A, i32 %B) {
	; CHECK-LABEL: @test1(			; CHECK-LABEL: @test1(
	entry:			entry:
	▲ Show 20 Lines • Show All 63 Lines • ▼ Show 20 Lines
	; CHECK: %0 = load i32, i32* %arrayidx, align 4			; CHECK: %0 = load i32, i32* %arrayidx, align 4
	%add = add nsw i32 %0, %i			%add = add nsw i32 %0, %i
	br label %sw.epilog			br label %sw.epilog

	sw.epilog: ; preds = %entry, %sw.bb			sw.epilog: ; preds = %entry, %sw.bb
	%sum.0 = phi i32 [ %add, %sw.bb ], [ 0, %entry ]			%sum.0 = phi i32 [ %add, %sw.bb ], [ 0, %entry ]
	ret i32 %sum.0			ret i32 %sum.0
	}			}

				define i32 @test4(i1 %C, i1 %D, i1 %E, i32 %A) {
				; CHECK-LABEL: @test4(
				; CHECK-NEXT: br i1 [[C:%.]], label [[THEN:%.]], label [[ENDIF:%.*]]
				; CHECK: then:
				; CHECK-NEXT: [[OR1:%.]] = or i32 [[A:%.]], 65537
				; CHECK-NEXT: [[SELECT1:%.]] = select i1 [[D:%.]], i32 [[OR1]], i32 [[A]]
				; CHECK-NEXT: [[OR2:%.*]] = or i32 [[SELECT1]], 131074
				; CHECK-NEXT: [[SELECT2:%.]] = select i1 [[E:%.]], i32 [[OR2]], i32 [[SELECT1]]
				; CHECK-NEXT: ret i32 [[SELECT2]]
				; CHECK: endif:
				; CHECK-NEXT: ret i32 [[A]]
				;
				%or1 = or i32 %A, 65537
				%select1 = select i1 %D, i32 %or1, i32 %A
				%or2 = or i32 %select1, 131074
				%select2 = select i1 %E, i32 %or2, i32 %select1
				br i1 %C, label %then, label %endif

				then:
				ret i32 %select2

				endif:
				ret i32 %A
				}