This is an archive of the discontinued LLVM Phabricator instance.

[LICM] sink through non-trivially replicable PHI
ClosedPublic

Authored by junbuml on Aug 25 2017, 1:44 PM.

Download Raw Diff

Details

Reviewers

hfinkel
majnemer
davidxl
bmakam
mcrosier
danielcdh
efriedma
jtony

Commits

rGf5fb3d745d9b: [LICM] sink through non-trivially replicable PHI
rL317335: [LICM] sink through non-trivially replicable PHI

Summary

The current LICM allows sinking an instruction only when it is exposed to exit
blocks through a trivially replacable PHI of which all incoming values are the
same instruction. This change enhance LICM to sink a sinkable instruction
through non-trivially replacable PHIs by spliting predecessors of loop
exits.

Diff Detail

Event Timeline

junbuml created this revision.Aug 25 2017, 1:44 PM

efriedma added inline comments.Aug 25 2017, 2:43 PM

lib/Transforms/Scalar/LICM.cpp
803	Do you need to do something to keep SunkCopies up to date?
858	The indexing here is a little weird; can you first compute all the blocks to split, then split them?

Addressed Eli's comment.

lib/Transforms/Scalar/LICM.cpp
803	This is to clone only one instruction per exit block and I couldn't think of a case where the same instruction is cloned several times in the same exit block. Please let me know if you can see any case where we need to update SunkCopies.

efriedma added inline comments.Aug 29 2017, 11:59 AM

lib/Transforms/Scalar/LICM.cpp
803	The reason the SunkCopies map exists is that there could be multiple PHI nodes which use the same instruction in a loop exit block. Maybe it would be more straightforward if you would make all the necessary calls to splitPredecessorsOfLoopExit before you actually start sinking instructions?

Addressed Eli's comment.

junbuml added inline comments.Aug 30 2017, 8:33 AM

lib/Transforms/Scalar/LICM.cpp
803	Now, I split non-trivially replaceable PHIs before sinking in sink(). Please take a look and let me know any comment. Thanks Eli for the review.

I'd like more eyes on this to figure out if there are potential problems with this transform (maybe there's a performance cost for splitting edges in some cases?).

A few more testcases would be nice; maybe a loop with multiple exit blocks, a loop with more than two exits pointing to the same exit block, and a loop where multiple exit PHI nodes use the same sinkable instruction.

lib/Transforms/Scalar/LICM.cpp
850	Iterating over a SmallPtrSet is non-deterministic.
903	Maybe put a comment here describing what the first loop is doing?
934	Unnecessary cast.

Overall, i'm unclear on why you can't compute which ones will actually be sunk, and then split the preds.
Also, the examples does not even require splitting preds to effect the transform.

These are simply phi of ops form, and there is an equivalent op of phis form you could use directly in the exit block.
For example, you have, in the last test (i converted the checks to the code):

Out12.split.loop.exit:
lcssaphi1 = phi i32 [ %N_addr.0.pn, %ContLoop ]
mulres  = mul i32 %N, %lcssaphi1
subres1 = sub %mulres, %N2
br label %Out12

Out12.split.loop.exit1:
lcssaphi2 = phi i32 [ %N_addr.0.pn, %Loop ]
mulres2 =  mul i32 %N, %lcssaphi2
subres2 = sub i32, %mulres2, %N
br label %Out12

Out12:
something = phi i32 [%subres, %Out12.split.loop.exit ], [ %subres2, %Out12.split.loop.exit1 ]

This is the same as

Out12: 
phires = phi i32 [%N, %Loop], [%N2, %ContLoop]
mulres = mul i32 %N, %N_addr.0.pn
; something here has the same value as the phi named something above
something = sub i32 %mulres, %phires

It seems to me if you can compute which can be sunk before choosing to split the preds, you can do it without splitting the preds.
What am i missing?

lib/Transforms/Scalar/LICM.cpp
929	Is there a reason not to just put all the users in a smallvector of WeakVH first? While it's true splitPred may invalidate the iterators, none of the answers to what this loop does can change (that i can see) as a result of that (nor will this loop do anything for any new values).

I agree that the test doesn't really require splitting since the operations of incoming values are the same (sub/mul) and all operands used in the sinkable chain (%N_addr.0.pn, %N, and %N2) dominate the PHI, so we can directly add PHIs for operands and share the same operation in the exit block like you comment. For me, it seems that the test case is a special case in which splitting is not required. In most case, I think splitting is required to sink through a non-trivially replicable PHI. For example, if operations of incoming values are different or any operands in the sinkable chain do not dominate the PHI, we should split preds.

For me, SimplifyCFG should clean up such foldable case. The test case seems to be handled properly by SimplifyCFG :

Out12:
  %N_addr.0.pn.lcssa.sink = phi i32 [ %N_addr.0.pn, %Loop ], [ %N_addr.0.pn, %ContLoop ]
  %N.sink = phi i32 [ %N, %Loop ], [ %N2, %ContLoop ]
  %tmp.6.le = mul i32 %N, %N_addr.0.pn.lcssa.sink
  %tmp.7.le = sub i32 %tmp.6.le, %N.sink

Would it make sense that LICM split preds for non-trivially replicable PHIs in general and let the SimplyCFG fold such case. If there is any case where SimplyCFG cannot fold, we can improve SimplyCFG.

In D37163#857853, @junbuml wrote:

I agree that the test doesn't really require splitting since the operations of incoming values are the same (sub/mul) and all operands used in the sinkable chain (%N_addr.0.pn, %N, and %N2) dominate the PHI, so we can directly add PHIs for operands and share the same operation in the exit block like you comment. For me, it seems that the test case is a special case in which splitting is not required. In most case, I think splitting is required to sink through a non-trivially replicable PHI. For example, if operations of incoming values are different or any operands in the sinkable chain do not dominate the PHI, we should split preds.

The case where you can't place it somewhere safe should be the case where the edge is critical.
Otherwise, there should always be somewhere to place the computation safely and correctly.
Now, i admit it may be tricky to come up with such an algorithm, but i'd also be surprised if it doesn't exist in literature already.

For me, SimplifyCFG should clean up such foldable case. The test case seems to be handled properly by SimplifyCFG :
Out12:
  %N_addr.0.pn.lcssa.sink = phi i32 [ %N_addr.0.pn, %Loop ], [ %N_addr.0.pn, %ContLoop ]
  %N.sink = phi i32 [ %N, %Loop ], [ %N2, %ContLoop ]
  %tmp.6.le = mul i32 %N, %N_addr.0.pn.lcssa.sink
  %tmp.7.le = sub i32 %tmp.6.le, %N.sink
Would it make sense that LICM split preds for non-trivially replicable PHIs in general and let the SimplyCFG fold such case. If there is any case where SimplyCFG cannot fold, we can improve SimplyCFG.

If we can avoid creating the mess in the first place, i'd rather see us do that.
Changing the CFG like this costs something (incremental dominator updates, etc).
If we don't need to do it, we generally shouldn't do it.

If the algorithm turns out to be wildly complex, sure, let's do the easy thing and split preds and let something clean it up.
But i'd at least like to see us *try*.

Of course, it would be good to avoid unnecessary splitting in the fist place if it's not very costly. Let me try to find a reasonable approach for this.

The case where you can't place it somewhere safe should be the case where the edge is critical.
Otherwise, there should always be somewhere to place the computation safely and correctly.

I'm not perfectly clear about your above comment. Did you mean that we cannot safely sink a sinkable instruction through a critical edge? So, in such case we should split the critical edge to sink?

In D37163#858781, @junbuml wrote:

Of course, it would be good to avoid unnecessary splitting in the fist place if it's not very costly. Let me try to find a reasonable approach for this.

If you need help, let me know.
If you can't find one, awesome, let's go with splitting :)

The case where you can't place it somewhere safe should be the case where the edge is critical.
Otherwise, there should always be somewhere to place the computation safely and correctly.

I'm not perfectly clear about your above comment. Did you mean that we cannot safely sink a sinkable instruction through a critical edge? So, in such case we should split the critical edge to sink?

Yes.
Instructions are really placed on edges in most cases.
That is, you are hoisting or sinking it out of a block, and want it to appear first/last in another block.
This is really edge placement.
But most compilers don't allow code on edges, so you actually have to put it in the block.
In the normal, non-critical edge case, this is safe to do. That is, it's possible to do it and have the instruction execute under only the same conditions it did before.

In the critical edge case, you may need to split the edge in order to have a place to put it.

Let me give you a hoisting example just because people seem to find it easier when i explain it that direction:

a      b
| \  /
c  d

We have the same computation in B and D.
So you want to do PRE on it.
To eliminate the extra computation, you need to place a computation on the edge between A and D. That is the only way to ensure that the computation is still available on all paths to D.

If you could place code on that edge, it would be easy, no muss, no fuss!
You'd be guaranteed it only executes on the a->d path.
LLVM (and most other compilers) can't place code on edges, only in the blocks.

But if you place the computation in block A, it may also be computed on the path to C, and when D is never executed.
That's no good, now your computation may execute when it didn't before (think of a load, for example, or any trapping instruction).

This happens because a->d is a critical edge. It's an edge from a block with multiple successors to a block with multiple predecessors.

The only safe way to place the computation on this edge is to split this edge so it looks like:

a
| \         b
c   t     /
      \  /
        d

and then place the computation in t.

Sinking has the same problem (imagine you want to sink something from a to d. if you place it at the beginning of d, now it also occurs on the b->d path, where it may never have occurred before. The only way to make it occur just on the a->d path is to split the edge as we did above)
The edges to loop exits may often be critical edges, so you may (but not always) have to split them to place computations safely.
You can actually compute which edges are "blocking" your transformation and split them.
(you could also pre-split all critical edges, at some cost. There is a lot of literature on this, and the TL;DR is that you can prove that a number of dataflow problems do not get optimal answers in the presence of critical edges. LLVM has not chosen this path so far, however, because the cost has outweighed the practical benefit).

I looked at if we can add an extra check to avoid unnecessary splitting, but for me this seems to be a special case for which I don't want to increase complexity in this change. Since all incoming edges in non-trivially replacable PHI in LCSSA is critical, splitting them by default doesn't seems unreasonable to me.

The splitting is not required only when all incoming value are sinkable and the operation of them is same in the non-trivially replicable PHIs. In such case, we can sink the same sinkable operation and modify the PHI to pass the operands to the shared operation like the IR you mentioned above. Handling such case pretty much special case which I believe we can let other pass handle or we can extend it in a separate patch if we need to handle it in LICM.

Added more test cases and made minor updates.

junbuml added inline comments.Sep 7 2017, 10:22 AM

lib/Transforms/Scalar/LICM.cpp
929	After splitPredecessorsOfLoopExit(), user PHIs will be modified and could no longer be an user and also could be not in the exit block anymore. We also need to get Use from user_iterator for the unreachability check for Phi's incoming block in this loop. Instead of holding the original users and iterating them over, I think restarting the iterators for valid users might be earlier to read. To avoid performing the same check for the same users, I cached visited users. If you see the test16() in the test case, the same phi is used multiple times as an user. After splitPredecessorsOfLoopExit, the PHI will no longer be an user of the sinkable instruction, so we may not want to revisit this no-longer-user phi.

Okay.
One reason we don't split critical edges by default in most passes in LLVM is that edges coming out of switches that lead to the same block are pretty much always critical.
Doing so by default for large switches leads to *huge* code blowup.

IE it goes from 1 block to N blocks, where N is the number of switch cases.

If there are multiple layers of switches, it's obviously even worse.

Have you tested on this on any switch heavy code to ensure it does not blow up code size and compile time?

Okay.
One reason we don't split critical edges by default in most passes in LLVM is that edges coming out of switches that lead to the same block are pretty much always critical.
Doing so by default for large switches leads to *huge* code blowup.
IE it goes from 1 block to N blocks, where N is the number of switch cases.
If there are multiple layers of switches, it's obviously even worse.
Have you tested on this on any switch heavy code to ensure it does not blow up code size and compile time?

I tried to build spec2000/spec2006/spec2017 with/without this change for aarch64. This change was applied widely and as far as I know perlbench use lots of big switches, but I didn't see any significant changes in code size and compile time across all benchmarks.
However, as you point out, it could possibly increase code size even in some case where splitting is required for sinking. Just to be safe what about checking if the number of predecessors is larger than some threshold from cl::opt before splitting?

It's possible to get quadratic code growth from LICM sinking. For example:

int foo();
int unconditional_break(int n, int *p) {
int r = 0;
for (int i = 0; i < n; ++i) {
  int x = foo();
  int a = *p;
  int b = a * a;
  int c = b * b;
  int d = c * c;
  int e = d * d;
  int f = e * e;
  if (x == 1) { r = a; break; }
  if (x == 2) { r = b; break; }
  if (x == 3) { r = c; break; }
  if (x == 4) { r = d; break; }
  if (x == 5) { r = e; break; }
  if (x == 6) { r = f; break; }
}
return r;
}

int foo();
int conditional_break(int n, int *p) {
int r = 0;
for (int i = 0; i < n; ++i) {
  int x = foo();
  int a = *p;
  int b = a * a;
  int c = b * b;
  int d = c * c;
  int e = d * d;
  int f = e * e;
  if (x == 1) { r = a; if (foo()) break; }
  if (x == 2) { r = b; if (foo()) break; }
  if (x == 3) { r = c; if (foo()) break; }
  if (x == 4) { r = d; if (foo()) break; }
  if (x == 5) { r = e; if (foo()) break; }
  if (x == 6) { r = f; if (foo()) break; }
}
return r;
}

The source contains 5 multiplies, but the optimized IR has 15 due to sinking. For the first function, sinking happens on trunk. For the second function, this patch will allow sinking, I think.

That said, codesize bloat from hasn't ever shown up as a problem in practice; I doubt it will cause problems here.

In D37163#863648, @junbuml wrote:

Okay.
One reason we don't split critical edges by default in most passes in LLVM is that edges coming out of switches that lead to the same block are pretty much always critical.
Doing so by default for large switches leads to *huge* code blowup.
IE it goes from 1 block to N blocks, where N is the number of switch cases.
If there are multiple layers of switches, it's obviously even worse.
Have you tested on this on any switch heavy code to ensure it does not blow up code size and compile time?

I tried to build spec2000/spec2006/spec2017 with/without this change for aarch64. This change was applied widely and as far as I know perlbench use lots of big switches, but I didn't see any significant changes in code size and compile time across all benchmarks.
However, as you point out, it could possibly increase code size even in some case where splitting is required for sinking. Just to be safe what about checking if the number of predecessors is larger than some threshold from cl::opt before splitting?

If you haven't seen any real code size regressions, i think you should just be prepared to address it in the future, and let's leave it at that for now :)
Thanks for working through all this.

junbuml mentioned this in D37076: [LICM] Allow sinking when foldable in loop.Sep 13 2017, 2:06 PM

Kindly ping. Please let me know any comment.

Ran tests for spec2000/2006/2017 on aarch64, and there was no clear change in performance and size. However, observed 4% performance gain in spec2006/xalancbmk when applied together with my another LICM patch (https://reviews.llvm.org/D37076).

Minor update in comments and fixed a test case failure in subreg-postra-2.ll by preventing the select instruction from being sunk.

Herald added a subscriber: nemanjai. · View Herald TranscriptSep 26 2017, 9:44 AM

Kindly ping.

Kindly ping one more time. Please let me know any comment.

LGTM

This revision is now accepted and ready to land.Oct 20 2017, 1:41 PM

Tony,
Can you please confirm if the change in subreg-postra-2.ll doesn't break the original intention of the test?
Thanks,
Jun

I'm still not clear if the change in subreg-postra-2.ll is okay. Can anyone who know about this PowerPC test please review the change in the test.
Thanks,
Jun

hfinkel added inline comments.Oct 31 2017, 4:15 PM

test/CodeGen/PowerPC/subreg-postra-2.ll
7 ↗	(On Diff #120267)	Do you need to change this test at all? Would using the flag -ppc-gep-opt=0 enable you to leave the test as is?

Added "-ppc-gep-opt=0". No change in the test with/without this change. The change in the second RUN is because of using -ppc-gep-opt=0, but this doesn't seem to break the original intention of this test.

Rebased. I will commit this if the change in PowerPC test (subreg-postra-2.ll) is okay.
Thanks,
Jun

In D37163#914400, @junbuml wrote:

Rebased. I will commit this if the change in PowerPC test (subreg-postra-2.ll) is okay.

Yes, looks fine to me.

Thanks,
Jun

Closed by commit rL317335: [LICM] sink through non-trivially replicable PHI (authored by junbuml). · Explain WhyNov 3 2017, 9:25 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Transforms/

Scalar/

LICM.cpp

131 lines

test/

Transforms/

LICM/

sinking.ll

42 lines

Diff 112738

lib/Transforms/Scalar/LICM.cpp

Show First 20 Lines • Show All 56 Lines • ▼ Show 20 Lines
#include "llvm/IR/LLVMContext.h"		#include "llvm/IR/LLVMContext.h"
#include "llvm/IR/Metadata.h"		#include "llvm/IR/Metadata.h"
#include "llvm/IR/PredIteratorCache.h"		#include "llvm/IR/PredIteratorCache.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"
#include "llvm/Transforms/Scalar.h"		#include "llvm/Transforms/Scalar.h"
#include "llvm/Transforms/Scalar/LoopPassManager.h"		#include "llvm/Transforms/Scalar/LoopPassManager.h"
		#include "llvm/Transforms/Utils/BasicBlockUtils.h"
#include "llvm/Transforms/Utils/Local.h"		#include "llvm/Transforms/Utils/Local.h"
#include "llvm/Transforms/Utils/LoopUtils.h"		#include "llvm/Transforms/Utils/LoopUtils.h"
#include "llvm/Transforms/Utils/SSAUpdater.h"		#include "llvm/Transforms/Utils/SSAUpdater.h"
#include <algorithm>		#include <algorithm>
#include <utility>		#include <utility>
using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "licm"		#define DEBUG_TYPE "licm"
Show All 15 Lines	cl::desc("Max num uses visited for identifying load "
"invariance in loop using invariant start (default = 8)"));		"invariance in loop using invariant start (default = 8)"));

static bool inSubLoop(BasicBlock BB, Loop CurLoop, LoopInfo *LI);		static bool inSubLoop(BasicBlock BB, Loop CurLoop, LoopInfo *LI);
static bool isNotUsedInLoop(const Instruction &I, const Loop *CurLoop,		static bool isNotUsedInLoop(const Instruction &I, const Loop *CurLoop,
const LoopSafetyInfo *SafetyInfo);		const LoopSafetyInfo *SafetyInfo);
static bool hoist(Instruction &I, const DominatorTree DT, const Loop CurLoop,		static bool hoist(Instruction &I, const DominatorTree DT, const Loop CurLoop,
const LoopSafetyInfo *SafetyInfo,		const LoopSafetyInfo *SafetyInfo,
OptimizationRemarkEmitter *ORE);		OptimizationRemarkEmitter *ORE);
static bool sink(Instruction &I, const LoopInfo LI, const DominatorTree DT,		static bool sink(Instruction &I, LoopInfo LI, DominatorTree DT,
const Loop CurLoop, AliasSetTracker CurAST,		const Loop CurLoop, AliasSetTracker CurAST,
const LoopSafetyInfo *SafetyInfo,		const LoopSafetyInfo *SafetyInfo,
OptimizationRemarkEmitter *ORE);		OptimizationRemarkEmitter *ORE);
static bool isSafeToExecuteUnconditionally(Instruction &Inst,		static bool isSafeToExecuteUnconditionally(Instruction &Inst,
const DominatorTree *DT,		const DominatorTree *DT,
const Loop *CurLoop,		const Loop *CurLoop,
const LoopSafetyInfo *SafetyInfo,		const LoopSafetyInfo *SafetyInfo,
OptimizationRemarkEmitter *ORE,		OptimizationRemarkEmitter *ORE,
▲ Show 20 Lines • Show All 605 Lines • ▼ Show 20 Lines	if (const PHINode *PN = dyn_cast<PHINode>(UI)) {
return false;		return false;

// We need to sink a callsite to a unique funclet. Avoid sinking if the		// We need to sink a callsite to a unique funclet. Avoid sinking if the
// phi use is too muddled.		// phi use is too muddled.
if (isa<CallInst>(I))		if (isa<CallInst>(I))
if (!BlockColors.empty() &&		if (!BlockColors.empty() &&
BlockColors.find(const_cast<BasicBlock *>(BB))->second.size() != 1)		BlockColors.find(const_cast<BasicBlock *>(BB))->second.size() != 1)
return false;		return false;

// A PHI node where all of the incoming values are this instruction are
// special -- they can just be RAUW'ed with the instruction and thus
// don't require a use in the predecessor. This is a particular important
// special case because it is the pattern found in LCSSA form.
if (isTriviallyReplacablePHI(*PN, I)) {
if (CurLoop->contains(PN))
return false;
else
continue;
}

// Otherwise, PHI node uses occur in predecessor blocks if the incoming
// values. Check for such a use being inside the loop.
for (unsigned i = 0, e = PN->getNumIncomingValues(); i != e; ++i)
if (PN->getIncomingValue(i) == &I)
if (CurLoop->contains(PN->getIncomingBlock(i)))
return false;

continue;
}		}

if (CurLoop->contains(UI))		if (CurLoop->contains(UI))
return false;		return false;
}		}
return true;		return true;
}		}

▲ Show 20 Lines • Show All 53 Lines • ▼ Show 20 Lines	if (Instruction OInst = dyn_cast<Instruction>(OI))
OInst->getName() + ".lcssa", &ExitBlock.front());		OInst->getName() + ".lcssa", &ExitBlock.front());
for (unsigned i = 0, e = PN.getNumIncomingValues(); i != e; ++i)		for (unsigned i = 0, e = PN.getNumIncomingValues(); i != e; ++i)
OpPN->addIncoming(OInst, PN.getIncomingBlock(i));		OpPN->addIncoming(OInst, PN.getIncomingBlock(i));
*OI = OpPN;		*OI = OpPN;
}		}
return New;		return New;
}		}

		static Instruction *sinkThroughTriviallyReplacablePHI(
		PHINode TPN, Instruction I, LoopInfo *LI,
		SmallDenseMap<BasicBlock , Instruction , 32> &SunkCopies,
		const LoopSafetyInfo SafetyInfo, const Loop CurLoop) {
		#ifndef NDEBUG
		SmallVector<BasicBlock *, 32> ExitBlocks;
		CurLoop->getUniqueExitBlocks(ExitBlocks);
		SmallPtrSet<BasicBlock *, 32> ExitBlockSet(ExitBlocks.begin(),
		ExitBlocks.end());
		#endif
		BasicBlock *ExitBlock = TPN->getParent();
		assert(ExitBlockSet.count(ExitBlock) &&
		"The LCSSA PHI is not in an exit block!");

		Instruction *New;
		auto It = SunkCopies.find(ExitBlock);
		efriedmaUnsubmitted Not Done Reply Inline Actions Do you need to do something to keep SunkCopies up to date? efriedma: Do you need to do something to keep SunkCopies up to date?
		junbumlAuthorUnsubmitted Not Done Reply Inline Actions This is to clone only one instruction per exit block and I couldn't think of a case where the same instruction is cloned several times in the same exit block. Please let me know if you can see any case where we need to update SunkCopies. junbuml: This is to clone only one instruction per exit block and I couldn't think of a case where the…
		efriedmaUnsubmitted Not Done Reply Inline Actions The reason the SunkCopies map exists is that there could be multiple PHI nodes which use the same instruction in a loop exit block. Maybe it would be more straightforward if you would make all the necessary calls to splitPredecessorsOfLoopExit before you actually start sinking instructions? efriedma: The reason the SunkCopies map exists is that there could be multiple PHI nodes which use the…
		junbumlAuthorUnsubmitted Not Done Reply Inline Actions Now, I split non-trivially replaceable PHIs before sinking in sink(). Please take a look and let me know any comment. Thanks Eli for the review. junbuml: Now, I split non-trivially replaceable PHIs before sinking in sink(). Please take a look and…
		if (It != SunkCopies.end())
		New = It->second;
		else
		New = SunkCopies[ExitBlock] =
		CloneInstructionInExitBlock(I, ExitBlock, *TPN, LI, SafetyInfo);
		return New;
		}

		static void splitPredecessorsOfLoopExit(PHINode PN, DominatorTree DT,
		LoopInfo LI, const Loop CurLoop) {
		#ifndef NDEBUG
		SmallVector<BasicBlock *, 32> ExitBlocks;
		CurLoop->getUniqueExitBlocks(ExitBlocks);
		SmallPtrSet<BasicBlock *, 32> ExitBlockSet(ExitBlocks.begin(),
		ExitBlocks.end());
		#endif

		assert(ExitBlockSet.count(PN->getParent()) &&
		"Expect the PHI is in an exit block.");

		// Split predecessors of the loop exit to make instructions in the loop are
		// exposed to exit blocks through trivially replacable PHIs while keeping the
		// loop in the canonical form where each predecessor of each exit block should
		// be contained within the loop. For example, this will convert the loop below
		// from
		//
		// LB1:
		// %v1 =
		// br %LE, %LB2
		// LB2:
		// %v2 =
		// br %LE, %LB1
		// LE:
		// %p = phi [%v1, %LB1], [%v2, %LB2] <-- non-trivially replacable
		//
		// to
		//
		// LB1:
		// %v1 =
		// br %LE.split, %LB2
		// LB2:
		// %v2 =
		// br %LE.split2, %LB1
		// LE.split:
		// %p1 = phi [%v1, %LB1] <-- trivially replacable
		// br %LE
		// LE.split2:
		efriedmaUnsubmitted Done Reply Inline Actions Iterating over a SmallPtrSet is non-deterministic. efriedma: Iterating over a SmallPtrSet is non-deterministic.
		// %p2 = phi [%v2, %LB2] <-- trivially replacable
		// br %LE
		// LE:
		// %p = phi [%p1, %LE.split], [%p2, %LE.split2]
		//
		unsigned i = 0, e = PN->getNumIncomingValues();
		while (i != e) {
		if (CurLoop->contains(PN->getIncomingBlock(i))) {
		efriedmaUnsubmitted Not Done Reply Inline Actions The indexing here is a little weird; can you first compute all the blocks to split, then split them? efriedma: The indexing here is a little weird; can you first compute all the blocks to split, then split…
		SplitBlockPredecessors(PN->getParent(), PN->getIncomingBlock(i),
		".split.loop.exit", DT, LI, true);
		i = 0;
		e = PN->getNumIncomingValues();
		continue;
		}
		++i;
		}
		}

/// When an instruction is found to only be used outside of the loop, this		/// When an instruction is found to only be used outside of the loop, this
/// function moves it to the exit blocks and patches up SSA form as needed.		/// function moves it to the exit blocks and patches up SSA form as needed.
/// This method is guaranteed to remove the original instruction from its		/// This method is guaranteed to remove the original instruction from its
/// position, and may either delete it or move it to outside of the loop.		/// position, and may either delete it or move it to outside of the loop.
///		///
static bool sink(Instruction &I, const LoopInfo LI, const DominatorTree DT,		static bool sink(Instruction &I, LoopInfo LI, DominatorTree DT,
const Loop CurLoop, AliasSetTracker CurAST,		const Loop CurLoop, AliasSetTracker CurAST,
const LoopSafetyInfo *SafetyInfo,		const LoopSafetyInfo *SafetyInfo,
OptimizationRemarkEmitter *ORE) {		OptimizationRemarkEmitter *ORE) {
DEBUG(dbgs() << "LICM sinking instruction: " << I << "\n");		DEBUG(dbgs() << "LICM sinking instruction: " << I << "\n");
ORE->emit(OptimizationRemark(DEBUG_TYPE, "InstSunk", &I)		ORE->emit(OptimizationRemark(DEBUG_TYPE, "InstSunk", &I)
<< "sinking " << ore::NV("Inst", &I));		<< "sinking " << ore::NV("Inst", &I));
bool Changed = false;		bool Changed = false;
if (isa<LoadInst>(I))		if (isa<LoadInst>(I))
Show All 12 Lines	#endif

// Clones of this instruction. Don't create more than one per exit block!		// Clones of this instruction. Don't create more than one per exit block!
SmallDenseMap<BasicBlock , Instruction , 32> SunkCopies;		SmallDenseMap<BasicBlock , Instruction , 32> SunkCopies;

// If this instruction is only used outside of the loop, then all users are		// If this instruction is only used outside of the loop, then all users are
// PHI nodes in exit blocks due to LCSSA form. Just RAUW them with clones of		// PHI nodes in exit blocks due to LCSSA form. Just RAUW them with clones of
// the instruction.		// the instruction.
while (!I.use_empty()) {		while (!I.use_empty()) {
Value::user_iterator UI = I.user_begin();		Value::user_iterator UI = I.user_begin();
		efriedmaUnsubmitted Done Reply Inline Actions Maybe put a comment here describing what the first loop is doing? efriedma: Maybe put a comment here describing what the first loop is doing?
auto User = cast<Instruction>(UI);		auto User = cast<Instruction>(UI);
if (!DT->isReachableFromEntry(User->getParent())) {		if (!DT->isReachableFromEntry(User->getParent())) {
User->replaceUsesOfWith(&I, UndefValue::get(I.getType()));		User->replaceUsesOfWith(&I, UndefValue::get(I.getType()));
continue;		continue;
}		}
// The user must be a PHI node.		// The user must be a PHI node.
PHINode *PN = cast<PHINode>(User);		PHINode *PN = cast<PHINode>(User);

// Surprisingly, instructions can be used outside of loops without any		// Surprisingly, instructions can be used outside of loops without any
// exits. This can only happen in PHI nodes if the incoming block is		// exits. This can only happen in PHI nodes if the incoming block is
// unreachable.		// unreachable.
Use &U = UI.getUse();		Use &U = UI.getUse();
BasicBlock *BB = PN->getIncomingBlock(U);		BasicBlock *BB = PN->getIncomingBlock(U);
if (!DT->isReachableFromEntry(BB)) {		if (!DT->isReachableFromEntry(BB)) {
U = UndefValue::get(I.getType());		U = UndefValue::get(I.getType());
continue;		continue;
}		}

BasicBlock *ExitBlock = PN->getParent();		if (isTriviallyReplacablePHI(*PN, I)) {
assert(ExitBlockSet.count(ExitBlock) &&		Instruction *New = sinkThroughTriviallyReplacablePHI(
"The LCSSA PHI is not in an exit block!");		PN, &I, LI, SunkCopies, SafetyInfo, CurLoop);

Instruction *New;
auto It = SunkCopies.find(ExitBlock);
if (It != SunkCopies.end())
New = It->second;
else
New = SunkCopies[ExitBlock] =
CloneInstructionInExitBlock(I, ExitBlock, PN, LI, SafetyInfo);

PN->replaceAllUsesWith(New);		PN->replaceAllUsesWith(New);
PN->eraseFromParent();		PN->eraseFromParent();
		} else {
		// Split predecessors of the PHI so that we can sink the instruction
		// through a trivially replacable PHIs.
		dberlinUnsubmitted Not Done Reply Inline Actions Is there a reason not to just put all the users in a smallvector of WeakVH first? While it's true splitPred may invalidate the iterators, none of the answers to what this loop does can change (that i can see) as a result of that (nor will this loop do anything for any new values). dberlin: Is there a reason not to just put all the users in a smallvector of WeakVH first? While it's…
		junbumlAuthorUnsubmitted Not Done Reply Inline Actions After splitPredecessorsOfLoopExit(), user PHIs will be modified and could no longer be an user and also could be not in the exit block anymore. We also need to get Use from user_iterator for the unreachability check for Phi's incoming block in this loop. Instead of holding the original users and iterating them over, I think restarting the iterators for valid users might be earlier to read. To avoid performing the same check for the same users, I cached visited users. If you see the test16() in the test case, the same phi is used multiple times as an user. After splitPredecessorsOfLoopExit, the PHI will no longer be an user of the sinkable instruction, so we may not want to revisit this no-longer-user phi. junbuml: After splitPredecessorsOfLoopExit(), user PHIs will be modified and could no longer be an user…
		splitPredecessorsOfLoopExit(PN, DT, LI, CurLoop);
		continue;
		}
}		}

		efriedmaUnsubmitted Done Reply Inline Actions Unnecessary cast. efriedma: Unnecessary cast.
CurAST->deleteValue(&I);		CurAST->deleteValue(&I);
I.eraseFromParent();		I.eraseFromParent();
return Changed;		return Changed;
}		}

/// When an instruction is found to only use loop invariant operands that		/// When an instruction is found to only use loop invariant operands that
/// is safe to hoist, this instruction is called to do the dirty work.		/// is safe to hoist, this instruction is called to do the dirty work.
///		///
▲ Show 20 Lines • Show All 531 Lines • Show Last 20 Lines

test/Transforms/LICM/sinking.ll

	Show First 20 Lines • Show All 386 Lines • ▼ Show 20 Lines
	lab60:			lab60:
	; CHECK: lab60:			; CHECK: lab60:
	; CHECK: store			; CHECK: store
	; CHECK-NEXT: indirectbr			; CHECK-NEXT: indirectbr
	store i32 2145244101, i32* undef, align 4			store i32 2145244101, i32* undef, align 4
	indirectbr i8* undef, [label %lab21, label %lab19]			indirectbr i8* undef, [label %lab21, label %lab19]
	}			}

				; Check if LICM can sink a sinkable instruction to the exit blocks through
				; a non-trivially replacable PHI node.
				define i32 @test14(i32 %N, i32 %N2, i1 %C) {
				; CHECK-LABEL: @test14
				Entry:
				br label %Loop

				; CHECK-LABEL: Loop:
				; CHECK-NOT: mul
				; CHECK-NOT: sub
				Loop:
				%N_addr.0.pn = phi i32 [ %dec, %ContLoop ], [ %N, %Entry ]
				%tmp.6 = mul i32 %N, %N_addr.0.pn
				%tmp.7 = sub i32 %tmp.6, %N
				%tmp.8 = sub i32 %tmp.6, %N2
				%dec = add i32 %N_addr.0.pn, -1
				br i1 %C, label %ContLoop, label %Out12

				ContLoop:
				%tmp.1 = icmp ne i32 %N_addr.0.pn, 1
				br i1 %tmp.1, label %Loop, label %Out12

				; CHECK-LABEL: Out12.split.loop.exit:
				; CHECK: %[[LCSSAPHI:.*]] = phi i32 [ %N_addr.0.pn, %ContLoop ]
				; CHECK: %[[MUL:.*]] = mul i32 %N, %[[LCSSAPHI]]
				; CHECK: %[[SUB:.*]] = sub i32 %[[MUL]], %N2
				; CHECK: br label %Out12
				;
				; CHECK-LABEL: Out12.split.loop.exit1:
				; CHECK: %[[LCSSAPHI2:.*]] = phi i32 [ %N_addr.0.pn, %Loop ]
				; CHECK: %[[MUL2:.*]] = mul i32 %N, %[[LCSSAPHI2]]
				; CHECK: %[[SUB2:.*]] = sub i32 %[[MUL2]], %N
				; CHECK: br label %Out12
				;
				; CHECK-LABEL: Out12:
				; CHECK: phi i32 [ %[[SUB]], %Out12.split.loop.exit ], [ %[[SUB2]], %Out12.split.loop.exit1 ]

				Out12:
				%nonTrivialPhi = phi i32 [%tmp.8, %ContLoop], [%tmp.7, %Loop]
				ret i32 %nonTrivialPhi
				}

	declare void @f(i32*)			declare void @f(i32*)

	declare void @g()			declare void @g()