This is an archive of the discontinued LLVM Phabricator instance.

[CGP] Fold empty dedicated exit blocks created by loopsimplify.
AbandonedPublic

Authored by bmakam on Jul 18 2017, 3:11 PM.

Download Raw Diff

Details

Reviewers

efriedma
davidxl
mcrosier
junbuml

Summary

If the destination block is almost empty exit block then this block
was created by loopsimplify during LSR when it canonicalized the loop
and can be merged.

Diff Detail

Event Timeline

bmakam created this revision.Jul 18 2017, 3:11 PM

bmakam mentioned this in D35411: [SimplifyCFG] Defer folding unconditional branches to LateSimplifyCFG if it can destroy canonical loop structure..

bmakam added a parent revision: D35411: [SimplifyCFG] Defer folding unconditional branches to LateSimplifyCFG if it can destroy canonical loop structure..Jul 18 2017, 3:14 PM

bmakam removed a parent revision: D35411: [SimplifyCFG] Defer folding unconditional branches to LateSimplifyCFG if it can destroy canonical loop structure..Jul 18 2017, 3:18 PM

bmakam added a child revision: D35411: [SimplifyCFG] Defer folding unconditional branches to LateSimplifyCFG if it can destroy canonical loop structure..

bmakam removed a child revision: D35411: [SimplifyCFG] Defer folding unconditional branches to LateSimplifyCFG if it can destroy canonical loop structure..Jul 19 2017, 1:50 AM

The underlying issue is that after loopsimplify creates empty exit blocks, CGP cannot clean up the empty blocks if it happens to be coming from a switch case. I am working on an alternative approach to address this issue.

Updated patch to address the underlying issue.

efriedma added inline comments.Jul 28 2017, 2:55 PM

test/Transforms/CodeGenPrepare/merge-empty-latch-block.ll
67	Probably not relevant, but it looks like there's a missed optimization here: we should rotate this loop.
92	If I'm following correctly, the problem is this block: you want it to go away, but cgp isn't folding it. It looks like isMergingEmptyBlockProfitable is specifically trying to detect cases like this: folding away this BB involves inserting an extra COPY into the while.cond5, and while.cond5 is hotter than while.body.backedge.loopexit, so in theory you could lose performance. In this particular situation, though, you want to fold it anyway? What distinguishes this testcase from the testcase in r289988?

bmakam added inline comments.Jul 31 2017, 8:36 AM

test/Transforms/CodeGenPrepare/merge-empty-latch-block.ll
92	Yes, your understanding is correct. This block was created by loopsimplify during LSR pass to canonicalize the loop such that it has dedicated exit blocks with the understanding that simplifyCFG will clean up blocks which are split out but end up being unnecessary. Since we do not run simplifyCFG after LSR, we rely on CGP to fold this block so that generated code is not pessimized. If I understand correctly, isMergingEmptyBlockProfitable is trying to workaround the underlying problem i.e. new critical edges cannot be split properly in PhiElimination which results in COPY instructions be inserted into blocks with higher frequency. I'm not sure if GlobalISel can handle this issue, but the temporary solution in r289988 cannot be applied in general as I observed that after r308422, if we do not fold away empty exit blocks it pessimizes the generated code and resulted in a 3% regression in the same benchmark that was initially targeted with r289988. My first solution was to fold the empty block if it were only a latch block and this avoided the regression caused by r308422 and also kept the gains due to r289988. However, I feel this was papering over the real issue, so I am now folding all the exit blocks because they were likely added by loopsimplify and need to be cleaned up. This recovered 0.7% of the lost regression. I am looking for feedback on what could be a reasonable approach and open for suggestions.

A gentle ping.

It's hard to predict what exactly PHIElimination will do at this point in the pipeline... I don't have any better suggestions for a heuristic here.

(GlobalISel is irrelevant, I think; we keep around PHI nodes until register allocation.)

lib/CodeGen/CodeGenPrepare.cpp
656	Whitespace.
test/Transforms/CodeGenPrepare/merge-empty-latch-block.ll
50	Needs a comment explaining why you're looking for this pattern.

Update to address Eli's comments.

bmakam marked 2 inline comments as done.Aug 2 2017, 1:02 PM

To clarify, the original version of this patch recovered the full 3% regression, but the new version only recovers 0.7%?

Is the actual value of the PHI operand relevant here? It looks like some of the testcases on r289988 involve constants, which are materialized during isel (and can be substantially more expensive).

In D35584#829791, @efriedma wrote:

To clarify, the original version of this patch recovered the full 3% regression, but the new version only recovers 0.7%?

Correct.

Is the actual value of the PHI operand relevant here? It looks like some of the testcases on r289988 involve constants, which are materialized during isel (and can be substantially more expensive).

Thanks Eli, this looks interesting. Another observation is that if we increase cgp-freq-ratio-to-skip-merge to 1000 it will recover the full 3% regression. I am trying to reduce the exact basic block which when skipped merging in CGP removes placing copies in hot path and improves performance due to reduced dynamic instruction count and the basic block(s) which when merged eliminate the unnecessary branches and improve the performance due to better branching/i-cache utilization. This might provide an answer to your question about the relevance of PHI operands.

I have looked into the phi operands closely and I am not convinced if phi operands involving constants have any influence on the profitability of merging empty blocks. I identified empty exit blocks if when merged with their successors improved the performance a bit, yet if another similar empty exit block was merged with the destination block, it sinks the performance. The only difference between them is that the successor block is also empty when the performance regressed.

Is the actual value of the PHI operand relevant here? It looks like some of the testcases on r289988 involve constants, which are materialized during isel (and can be substantially more expensive).

I took a stab at this and created D37343. Please take a look.

Subsumed by D37343

Revision Contents

Path

Size

lib/

CodeGen/

CodeGenPrepare.cpp

19 lines

test/

Transforms/

CodeGenPrepare/

merge-empty-latch-block.ll

117 lines

Diff 107187

lib/CodeGen/CodeGenPrepare.cpp

Show First 20 Lines • Show All 228 Lines • ▼ Show 20 Lines	class TypePromotionTransaction;

private:		private:
bool eliminateFallThrough(Function &F);		bool eliminateFallThrough(Function &F);
bool eliminateMostlyEmptyBlocks(Function &F);		bool eliminateMostlyEmptyBlocks(Function &F);
BasicBlock findDestBlockOfMergeableEmptyBlock(BasicBlock BB);		BasicBlock findDestBlockOfMergeableEmptyBlock(BasicBlock BB);
bool canMergeBlocks(const BasicBlock BB, const BasicBlock DestBB) const;		bool canMergeBlocks(const BasicBlock BB, const BasicBlock DestBB) const;
void eliminateMostlyEmptyBlock(BasicBlock *BB);		void eliminateMostlyEmptyBlock(BasicBlock *BB);
bool isMergingEmptyBlockProfitable(BasicBlock BB, BasicBlock DestBB,		bool isMergingEmptyBlockProfitable(BasicBlock BB, BasicBlock DestBB,
bool isPreheader);		bool isPreheader, bool isLatch);
bool optimizeBlock(BasicBlock &BB, bool &ModifiedDT);		bool optimizeBlock(BasicBlock &BB, bool &ModifiedDT);
bool optimizeInst(Instruction *I, bool &ModifiedDT);		bool optimizeInst(Instruction *I, bool &ModifiedDT);
bool optimizeMemoryInst(Instruction I, Value Addr,		bool optimizeMemoryInst(Instruction I, Value Addr,
Type *AccessTy, unsigned AS);		Type *AccessTy, unsigned AS);
bool optimizeInlineAsmInst(CallInst *CS);		bool optimizeInlineAsmInst(CallInst *CS);
bool optimizeCallInst(CallInst *CI, bool &ModifiedDT);		bool optimizeCallInst(CallInst *CI, bool &ModifiedDT);
bool optimizeExt(Instruction *&I);		bool optimizeExt(Instruction *&I);
bool optimizeExtUses(Instruction *I);		bool optimizeExtUses(Instruction *I);
▲ Show 20 Lines • Show All 394 Lines • ▼ Show 20 Lines
}		}

/// Eliminate blocks that contain only PHI nodes, debug info directives, and an		/// Eliminate blocks that contain only PHI nodes, debug info directives, and an
/// unconditional branch. Passes before isel (e.g. LSR/loopsimplify) often split		/// unconditional branch. Passes before isel (e.g. LSR/loopsimplify) often split
/// edges in ways that are non-optimal for isel. Start by eliminating these		/// edges in ways that are non-optimal for isel. Start by eliminating these
/// blocks so we can split them the way we want them.		/// blocks so we can split them the way we want them.
bool CodeGenPrepare::eliminateMostlyEmptyBlocks(Function &F) {		bool CodeGenPrepare::eliminateMostlyEmptyBlocks(Function &F) {
SmallPtrSet<BasicBlock *, 16> Preheaders;		SmallPtrSet<BasicBlock *, 16> Preheaders;
		SmallPtrSet<BasicBlock *, 16> Latches;
SmallVector<Loop *, 16> LoopList(LI->begin(), LI->end());		SmallVector<Loop *, 16> LoopList(LI->begin(), LI->end());
while (!LoopList.empty()) {		while (!LoopList.empty()) {
Loop *L = LoopList.pop_back_val();		Loop *L = LoopList.pop_back_val();
LoopList.insert(LoopList.end(), L->begin(), L->end());		LoopList.insert(LoopList.end(), L->begin(), L->end());
if (BasicBlock *Preheader = L->getLoopPreheader())		if (BasicBlock *Preheader = L->getLoopPreheader())
Preheaders.insert(Preheader);		Preheaders.insert(Preheader);
		if (BasicBlock *Latch = L->getLoopLatch())
		Latches.insert(Latch);
		efriedmaUnsubmitted Done Reply Inline Actions Whitespace. efriedma: Whitespace.
}		}

bool MadeChange = false;		bool MadeChange = false;
// Note that this intentionally skips the entry block.		// Note that this intentionally skips the entry block.
for (Function::iterator I = std::next(F.begin()), E = F.end(); I != E;) {		for (Function::iterator I = std::next(F.begin()), E = F.end(); I != E;) {
BasicBlock BB = &I++;		BasicBlock BB = &I++;
BasicBlock *DestBB = findDestBlockOfMergeableEmptyBlock(BB);		BasicBlock *DestBB = findDestBlockOfMergeableEmptyBlock(BB);
if (!DestBB \|\|		if (!DestBB \|\| !isMergingEmptyBlockProfitable(
!isMergingEmptyBlockProfitable(BB, DestBB, Preheaders.count(BB)))		BB, DestBB, Preheaders.count(BB), Latches.count(DestBB)))
continue;		continue;

		Latches.erase(BB);
		Preheaders.erase(BB);
eliminateMostlyEmptyBlock(BB);		eliminateMostlyEmptyBlock(BB);
MadeChange = true;		MadeChange = true;
}		}
return MadeChange;		return MadeChange;
}		}

bool CodeGenPrepare::isMergingEmptyBlockProfitable(BasicBlock *BB,		bool CodeGenPrepare::isMergingEmptyBlockProfitable(BasicBlock *BB,
BasicBlock *DestBB,		BasicBlock *DestBB,
bool isPreheader) {		bool isPreheader,
		bool isLatch) {
// Do not delete loop preheaders if doing so would create a critical edge.		// Do not delete loop preheaders if doing so would create a critical edge.
// Loop preheaders can be good locations to spill registers. If the		// Loop preheaders can be good locations to spill registers. If the
// preheader is deleted and we create a critical edge, registers may be		// preheader is deleted and we create a critical edge, registers may be
// spilled in the loop body instead.		// spilled in the loop body instead.
if (!DisablePreheaderProtect && isPreheader &&		if (!DisablePreheaderProtect && isPreheader &&
!(BB->getSinglePredecessor() &&		!(BB->getSinglePredecessor() &&
BB->getSinglePredecessor()->getSingleSuccessor()))		BB->getSinglePredecessor()->getSingleSuccessor()))
return false;		return false;

// Try to skip merging if the unique predecessor of BB is terminated by a		// Try to skip merging if the unique predecessor of BB is terminated by a
// switch or indirect branch instruction, and BB is used as an incoming block		// switch or indirect branch instruction, and BB is used as an incoming block
// of PHIs in DestBB. In such case, merging BB and DestBB would cause ISel to		// of PHIs in DestBB. In such case, merging BB and DestBB would cause ISel to
// add COPY instructions in the predecessor of BB instead of BB (if it is not		// add COPY instructions in the predecessor of BB instead of BB (if it is not
// merged). Note that the critical edge created by merging such blocks wont be		// merged). Note that the critical edge created by merging such blocks wont be
// split in MachineSink because the jump table is not analyzable. By keeping		// split in MachineSink because the jump table is not analyzable. By keeping
// such empty block (BB), ISel will place COPY instructions in BB, not in the		// such empty block (BB), ISel will place COPY instructions in BB, not in the
// predecessor of BB.		// predecessor of BB.
BasicBlock *Pred = BB->getUniquePredecessor();		BasicBlock *Pred = BB->getUniquePredecessor();
if (!Pred \|\|		if (!Pred \|\|
!(isa<SwitchInst>(Pred->getTerminator()) \|\|		!(isa<SwitchInst>(Pred->getTerminator()) \|\|
isa<IndirectBrInst>(Pred->getTerminator())))		isa<IndirectBrInst>(Pred->getTerminator())))
return true;		return true;

		// If the destination block is almost empty latch block then we can hoist
		// the jump through the backedge, so it is profitable to merge.
		if (isLatch && DestBB->getTerminator() == DestBB->getFirstNonPHI())
		return true;

if (BB->getTerminator() != BB->getFirstNonPHI())		if (BB->getTerminator() != BB->getFirstNonPHI())
return true;		return true;

// We use a simple cost heuristic which determine skipping merging is		// We use a simple cost heuristic which determine skipping merging is
// profitable if the cost of skipping merging is less than the cost of		// profitable if the cost of skipping merging is less than the cost of
// merging : Cost(skipping merging) < Cost(merging BB), where the		// merging : Cost(skipping merging) < Cost(merging BB), where the
// Cost(skipping merging) is Freq(BB) * (Cost(Copy) + Cost(Branch)), and		// Cost(skipping merging) is Freq(BB) * (Cost(Copy) + Cost(Branch)), and
// the Cost(merging BB) is Freq(Pred) * Cost(Copy).		// the Cost(merging BB) is Freq(Pred) * Cost(Copy).
▲ Show 20 Lines • Show All 5,782 Lines • Show Last 20 Lines

test/Transforms/CodeGenPrepare/merge-empty-latch-block.ll

This file was added.

				; RUN: opt -codegenprepare < %s -mtriple=aarch64-none-linux-gnu -S \| FileCheck %s

				target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64--linux-gnu"

				; Expect to merge empty latch block as it will hoist the jump through backedge.
				%struct._IO_FILE = type { i32, i8, i8, i8, i8, i8, i8, i8, i8, i8, i8, i8, %struct._IO_marker, %struct._IO_FILE, i32, i32, i64, i16, i8, [1 x i8], i8, i64, i8, i8, i8, i8, i64, i32, [20 x i8] }
				%struct._IO_marker = type { %struct._IO_marker, %struct._IO_FILE, i32 }

				@finput = external local_unnamed_addr global %struct._IO_FILE*, align 8
				@.str = external hidden unnamed_addr constant [23 x i8], align 1
				@lineno = external local_unnamed_addr global i32, align 4
				@.str.1 = external hidden unnamed_addr constant [21 x i8], align 1

				; Function Attrs: nounwind
				; CHECK-LABEL: @skip_white_space
				define i32 @skip_white_space() local_unnamed_addr #0 {
				entry:
				br label %for.cond

				for.cond: ; preds = %for.cond.backedge, %entry
				%0 = load %struct._IO_FILE, %struct._IO_FILE* @finput, align 8
				%call11 = tail call i32 @_IO_getc(%struct._IO_FILE* %0)
				switch i32 %call11, label %sw.default [
				i32 47, label %sw.bb
				i32 10, label %sw.bb25
				i32 32, label %for.cond.backedge
				i32 9, label %for.cond.backedge
				i32 12, label %for.cond.backedge
				]

				sw.bb: ; preds = %for.cond
				%1 = load %struct._IO_FILE, %struct._IO_FILE* @finput, align 8
				%call1 = tail call i32 @_IO_getc(%struct._IO_FILE* %1)
				%cmp = icmp eq i32 %call1, 42
				br i1 %cmp, label %if.end, label %if.then

				if.then: ; preds = %sw.bb
				tail call void @fatals(i8* getelementptr inbounds ([23 x i8], [23 x i8]* @.str, i64 0, i64 0), i32 %call1, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0) #3
				br label %if.end

				if.end: ; preds = %if.then, %sw.bb
				%2 = load %struct._IO_FILE, %struct._IO_FILE* @finput, align 8
				%call2 = tail call i32 @_IO_getc(%struct._IO_FILE* %2)
				br label %while.body

				; CHECK-LABEL: while.body
				; CHECK: %c.140 = phi i32 [ %call2, %if.end ], [ %call20, %if.else19 ], [ -1, %if.then18 ], [ %call15, %if.then14 ], [ %c.2, %while.cond5 ]
				; CHECK-NOT: %c.140 = phi i32 [ %call2, %if.end ], [ %call20, %if.else19 ], [ -1, %if.then18 ], [ %call15, %if.then14 ], [ %c.2, %while.body.backedge.loopexit ]
				while.body: ; preds = %while.body.backedge, %if.end
				efriedmaUnsubmitted Done Reply Inline Actions Needs a comment explaining why you're looking for this pattern. efriedma: Needs a comment explaining why you're looking for this pattern.
				%c.140 = phi i32 [ %call2, %if.end ], [ %c.140.be, %while.body.backedge ]
				switch i32 %c.140, label %if.else19 [
				i32 42, label %while.cond5.preheader
				i32 10, label %if.then14
				i32 -1, label %if.then18
				]

				while.cond5.preheader: ; preds = %while.body
				br label %while.cond5

				while.cond5: ; preds = %while.body7, %while.cond5.preheader
				%c.2 = phi i32 [ %call8, %while.body7 ], [ 42, %while.cond5.preheader ]
				switch i32 %c.2, label %while.body.backedge.loopexit [
				i32 42, label %while.body7
				i32 47, label %for.cond.backedge.loopexit
				]

				efriedmaUnsubmitted Not Done Reply Inline Actions Probably not relevant, but it looks like there's a missed optimization here: we should rotate this loop. efriedma: Probably not relevant, but it looks like there's a missed optimization here: we should rotate…
				while.body7: ; preds = %while.cond5
				%3 = load %struct._IO_FILE, %struct._IO_FILE* @finput, align 8
				%call8 = tail call i32 @_IO_getc(%struct._IO_FILE* %3)
				br label %while.cond5

				if.then14: ; preds = %while.body
				%4 = load i32, i32* @lineno, align 4
				%inc = add nsw i32 %4, 1
				store i32 %inc, i32* @lineno, align 4
				%5 = load %struct._IO_FILE, %struct._IO_FILE* @finput, align 8
				%call15 = tail call i32 @_IO_getc(%struct._IO_FILE* %5)
				br label %while.body.backedge

				if.then18: ; preds = %while.body
				tail call void @fatal(i8* getelementptr inbounds ([21 x i8], [21 x i8]* @.str.1, i64 0, i64 0)) #3
				br label %while.body.backedge

				if.else19: ; preds = %while.body
				%6 = load %struct._IO_FILE, %struct._IO_FILE* @finput, align 8
				%call20 = tail call i32 @_IO_getc(%struct._IO_FILE* %6)
				br label %while.body.backedge

				while.body.backedge.loopexit: ; preds = %while.cond5
				br label %while.body.backedge

				efriedmaUnsubmitted Not Done Reply Inline Actions If I'm following correctly, the problem is this block: you want it to go away, but cgp isn't folding it. It looks like isMergingEmptyBlockProfitable is specifically trying to detect cases like this: folding away this BB involves inserting an extra COPY into the while.cond5, and while.cond5 is hotter than while.body.backedge.loopexit, so in theory you could lose performance. In this particular situation, though, you want to fold it anyway? What distinguishes this testcase from the testcase in r289988? efriedma: If I'm following correctly, the problem is this block: you want it to go away, but cgp isn't…
				bmakamAuthorUnsubmitted Not Done Reply Inline Actions Yes, your understanding is correct. This block was created by loopsimplify during LSR pass to canonicalize the loop such that it has dedicated exit blocks with the understanding that simplifyCFG will clean up blocks which are split out but end up being unnecessary. Since we do not run simplifyCFG after LSR, we rely on CGP to fold this block so that generated code is not pessimized. If I understand correctly, isMergingEmptyBlockProfitable is trying to workaround the underlying problem i.e. new critical edges cannot be split properly in PhiElimination which results in COPY instructions be inserted into blocks with higher frequency. I'm not sure if GlobalISel can handle this issue, but the temporary solution in r289988 cannot be applied in general as I observed that after r308422, if we do not fold away empty exit blocks it pessimizes the generated code and resulted in a 3% regression in the same benchmark that was initially targeted with r289988. My first solution was to fold the empty block if it were only a latch block and this avoided the regression caused by r308422 and also kept the gains due to r289988. However, I feel this was papering over the real issue, so I am now folding all the exit blocks because they were likely added by loopsimplify and need to be cleaned up. This recovered 0.7% of the lost regression. I am looking for feedback on what could be a reasonable approach and open for suggestions. bmakam: Yes, your understanding is correct. This block was created by loopsimplify during LSR pass to…
				while.body.backedge: ; preds = %while.body.backedge.loopexit, %if.else19, %if.then18, %if.then14
				%c.140.be = phi i32 [ %call20, %if.else19 ], [ -1, %if.then18 ], [ %call15, %if.then14 ], [ %c.2, %while.body.backedge.loopexit ]
				br label %while.body

				sw.bb25: ; preds = %for.cond
				%7 = load i32, i32* @lineno, align 4
				%inc26 = add nsw i32 %7, 1
				store i32 %inc26, i32* @lineno, align 4
				br label %for.cond.backedge

				for.cond.backedge.loopexit: ; preds = %while.cond5
				br label %for.cond.backedge

				for.cond.backedge: ; preds = %for.cond.backedge.loopexit, %sw.bb25, %for.cond, %for.cond, %for.cond
				br label %for.cond

				sw.default: ; preds = %for.cond
				ret i32 %call11
				}
				; Function Attrs: nounwind
				declare i32 @_IO_getc(%struct._IO_FILE* nocapture) local_unnamed_addr #1

				declare void @fatals(i8*, i32, i32, i32, i32, i32, i32, i32, i32) local_unnamed_addr #2

				declare void @fatal(i8*) local_unnamed_addr #2

This is an archive of the discontinued LLVM Phabricator instance.

[CGP] Fold empty dedicated exit blocks created by loopsimplify.AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 107187

lib/CodeGen/CodeGenPrepare.cpp

test/Transforms/CodeGenPrepare/merge-empty-latch-block.ll

[CGP] Fold empty dedicated exit blocks created by loopsimplify.
AbandonedPublic