Download Raw Diff

Details

Reviewers

alexfh
efriedma
spatel
xbolva00
jyknight
MatzeB
foad
t.p.northover
fhahn
lkail

Commits

rG1f9fa549841a: [Taildup] Don't tail-duplicate loop header with multiple successors as its…

Summary

when Taildup hit loop with multiple latches like:

//    1 -> 2 <-> 3                 |
//          \  <-> 4               |
//           \   <-> 5             |
//            \---> rest           |

it may transform this loop into multiple loops by duplicate loop header.
However, this change may has little benefit while makes cfg much complex.
In some uncommon cases, it causes large compile time regression (offered by
@alexfh in D106056).

This patch disable tail-duplicate of such cases.
PS: It checks the number of latches > 2 since we want to omit conditional branch.

TestPlan: check-llvm

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

junparser created this revision.Sep 28 2021, 2:11 AM

Herald added subscribers: pengfei, hiraditya. · View Herald TranscriptSep 28 2021, 2:11 AM

junparser requested review of this revision.Sep 28 2021, 2:11 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 28 2021, 2:11 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

lkail added a subscriber: lkail.Sep 28 2021, 2:18 AM

huge-switch.ll231 KBDownload

reduced bc file.

$time llc-after --filetype=asm -O3   huge-switch.ll  -o huge-switch.s

real    0m0.454s
user    0m0.436s
sys     0m0.012s

time llc-before -O3 --filetype=asm  huge-switch.ll   -o huge-switch-1.s

real    0m14.714s
user    0m14.297s
sys     0m0.134s

The backend spend lots of time to build cfg&loop chain. also it has worse code.

Harbormaster completed remote builds in B126050: Diff 375499.Sep 28 2021, 2:40 AM

IIUC, switching this pattern off might lose some optimization commonly seen in interpreter, e.g., https://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables. Do you have any benchmark numbers?

lkail added a reviewer: MatzeB.Sep 28 2021, 5:03 AM

Add testcase diff between this change.

Harbormaster completed remote builds in B126236: Diff 375769.Sep 28 2021, 8:03 PM

In D110613#3027076, @lkail wrote:

IIUC, switching this pattern off might lose some optimization commonly seen in interpreter, e.g., https://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables. Do you have any benchmark numbers?

Hi, thanks for the reminder!
I tested the change with spec cpu2017, and there was no performance change. I also tested the case in https://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables, and I add it int tesetcase as well. This patch has no effect on most of the interpreter switch loop.

However, i test the case offered by @alexfh based on D106056 and w/o this change, it show 1x performance gains without the change. The reduced case is large_loop_switch in testcase.

The root reason is that llvm optimizer usually does not change unused default case into unreachable before D106056. The interpreter switch loop also keeps this branch when transform to jumptable, The loop form become to :

//                  
//              / <---------/ / /
//          0-> 1->  2  -> 3  / /            
//              \      \  -> 4 /              
//                \      \ -> 5             
//                  \---> default

the default bb can not tail duplicate into loop header. With D106056, the unreachable default bb is removed, then taildup can help to transform switch loop into dispatch table.

So this change blocks optimization from D106056, @alexfh would you like to workaround with -mllvm -tail-dup-indirect-size=4？

Or let add one more parameter such as -tail-dup-jmptable-size with default value maybe 100?
any suggest？

In D110613#3029276, @junparser wrote:

So this change blocks optimization from D106056, @alexfh would you like to workaround with -mllvm -tail-dup-indirect-size=4？

I found a single huge protobuf generated source that was pushed over the time limit. However, there are likely many others that will just take more time to compile wasting resources and potentially increasing build latency. Having to hunt down all of them and to sprinkle -mllvm -tail-dup-indirect-size=4 all around the build configuration files would be unfortunate.

Or let add one more parameter such as -tail-dup-jmptable-size with default value maybe 100?
any suggest？

Would that stop this particular transformation for large switch cases? Should this always be done?

Add one parameter. @alexfh could you have a try with this patch?

Harbormaster completed remote builds in B127912: Diff 378416.Oct 8 2021, 11:33 PM

In D110613#3037724, @alexfh wrote:

In D110613#3029276, @junparser wrote:

So this change blocks optimization from D106056, @alexfh would you like to workaround with -mllvm -tail-dup-indirect-size=4？

I found a single huge protobuf generated source that was pushed over the time limit. However, there are likely many others that will just take more time to compile wasting resources and potentially increasing build latency. Having to hunt down all of them and to sprinkle -mllvm -tail-dup-indirect-size=4 all around the build configuration files would be unfortunate.

Or let add one more parameter such as -tail-dup-jmptable-size with default value maybe 100?
any suggest？

Would that stop this particular transformation for large switch cases? Should this always be done?

Sorry for the late reply. I believe this should stop to tail duplicate large jump table loop generated by switch loop.

In D110613#3029273, @junparser wrote:
In D110613#3027076, @lkail wrote:

IIUC, switching this pattern off might lose some optimization commonly seen in interpreter, e.g., https://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables. Do you have any benchmark numbers?

Hi, thanks for the reminder!
I tested the change with spec cpu2017, and there was no performance change. I also tested the case in https://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables, and I add it int tesetcase as well. This patch has no effect on most of the interpreter switch loop.

However, i test the case offered by @alexfh based on D106056 and w/o this change, it show 1x performance gains without the change. The reduced case is large_loop_switch in testcase.

The root reason is that llvm optimizer usually does not change unused default case into unreachable before D106056. The interpreter switch loop also keeps this branch when transform to jumptable, The loop form become to :
//                  
//              / <---------/ / /
//          0-> 1->  2  -> 3  / /            
//              \      \  -> 4 /              
//                \      \ -> 5             
//                  \---> default
the default bb can not tail duplicate into loop header. With D106056, the unreachable default bb is removed, then taildup can help to transform switch loop into dispatch table.

@lkail @MatzeB any suggestion with the parameter value 128?

@lkail @MatzeB any suggestion with the parameter value 128?

From the perf data pasted by @alexfh in https://reviews.llvm.org/D106056, the compile time regression occurs in MachineBlockPlacement when CFG becomes complex. I think we should fix the issue there, i.e., fix the O(n^2) iteration by(There are multiple such usages, listed one)

diff --git a/llvm/lib/CodeGen/MachineBlockPlacement.cpp b/llvm/lib/CodeGen/MachineBlockPlacement.cpp
index 8a1b4031642d..56c3d3106a19 100644
--- a/llvm/lib/CodeGen/MachineBlockPlacement.cpp
+++ b/llvm/lib/CodeGen/MachineBlockPlacement.cpp
@@ -1905,8 +1905,11 @@ MachineBlockPlacement::TopFallThroughFreq(
       // Check if Top is the best successor of Pred.
       auto TopProb = MBPI->getEdgeProbability(Pred, Top);
       bool TopOK = true;
-      for (MachineBasicBlock *Succ : Pred->successors()) {
-        auto SuccProb = MBPI->getEdgeProbability(Pred, Succ);
+      for (MachineBasicBlock::succ_iterator SI = Pred->succ_begin(),
+                                          SE = Pred->succ_end();
+         SI != SE; ++SI) {
+        MachineBasicBlock *Succ = *SI;
+        auto SuccProb = MBPI->getEdgeProbability(Pred, SI);
         BlockChain *SuccChain = BlockToChain[Succ];
         // Check if Succ can be placed after Pred.
         // Succ should not be in any chain, or it is the head of some chain.

This might give acceptable compile time.

In D110613#3054163, @lkail wrote:

@lkail @MatzeB any suggestion with the parameter value 128?

From the perf data pasted by @alexfh in https://reviews.llvm.org/D106056, the compile time regression occurs in MachineBlockPlacement when CFG becomes complex. I think we should fix the issue there, i.e., fix the O(n^2) iteration by(There are multiple such usages, listed one)

It's not only MachineBlockPlacement pass, we saw Eliminate PHI and Control flow pass also cost much more time. with the ir file i attached above, it shows

6.1973 ( 46.3%)   0.0000 (  0.0%)   6.1973 ( 45.8%)   6.2801 ( 45.8%)  Eliminate PHI nodes for register allocation
2.3774 ( 17.7%)   0.0000 (  0.0%)   2.3774 ( 17.6%)   2.4110 ( 17.6%)  Control Flow Optimizer
0.9603 (  7.2%)   0.0000 (  0.0%)   0.9603 (  7.1%)   0.9740 (  7.1%)  Branch Probability Basic Block Placement
0.4236 (  3.2%)   0.0680 ( 50.6%)   0.4916 (  3.6%)   0.4978 (  3.6%)  Early Tail Duplication
0.4374 (  3.3%)   0.0000 (  0.0%)   0.4374 (  3.2%)   0.4455 (  3.2%)  Machine code sinking

diff --git a/llvm/lib/CodeGen/MachineBlockPlacement.cpp b/llvm/lib/CodeGen/MachineBlockPlacement.cpp
index 8a1b4031642d..56c3d3106a19 100644
--- a/llvm/lib/CodeGen/MachineBlockPlacement.cpp
+++ b/llvm/lib/CodeGen/MachineBlockPlacement.cpp
@@ -1905,8 +1905,11 @@ MachineBlockPlacement::TopFallThroughFreq(
       // Check if Top is the best successor of Pred.
       auto TopProb = MBPI->getEdgeProbability(Pred, Top);
       bool TopOK = true;
-      for (MachineBasicBlock *Succ : Pred->successors()) {
-        auto SuccProb = MBPI->getEdgeProbability(Pred, Succ);
+      for (MachineBasicBlock::succ_iterator SI = Pred->succ_begin(),
+                                          SE = Pred->succ_end();
+         SI != SE; ++SI) {
+        MachineBasicBlock *Succ = *SI;
+        auto SuccProb = MBPI->getEdgeProbability(Pred, SI);
         BlockChain *SuccChain = BlockToChain[Succ];
         // Check if Succ can be placed after Pred.
         // Succ should not be in any chain, or it is the head of some chain.

This might give acceptable compile time.

Also I have test with this change, It does not work. The time is not consumed by find in getEdgeProbability

I agree we should add limitation to such pattern of CFG. If not, quadratic number of edges are added to CFG and hurt compile-time, especially when there are 1K+ cases. The setting of tail-dup-jmptable-loop-size might be related to BTB or BTAC, of which I'm not an expert. But for now, we don't want to regress protobuf's compile-time, value of 128 need @alexfh 's confirm.

llvm/lib/CodeGen/TailDuplicator.cpp
560	nit: `A` renamed to `MBB` or `TailBB`.
572	Please add comment to explain why we don't taildup this pattern of CFG.
llvm/test/CodeGen/X86/tail-dup-multiple-latch-loop.ll
1	This test should be pre-committed to show diff.

lkail added reviewers: foad, Florian, t.p.northover.Oct 11 2021, 3:31 AM

lkail edited reviewers, added: fhahn; removed: Florian.

address comments.

Harbormaster completed remote builds in B128252: Diff 378860.Oct 11 2021, 7:59 PM

junparser added inline comments.Oct 11 2021, 7:59 PM

llvm/test/CodeGen/X86/tail-dup-multiple-latch-loop.ll
1	This has been pre-committed.

@alexfh kindly ping~

In D110613#3054593, @lkail wrote:

I agree we should add limitation to such pattern of CFG. If not, quadratic number of edges are added to CFG and hurt compile-time, especially when there are 1K+ cases. The setting of tail-dup-jmptable-loop-size might be related to BTB or BTAC, of which I'm not an expert. But for now, we don't want to regress protobuf's compile-time, value of 128 need @alexfh 's confirm.

@alexfh, would you have a try with this patch?

ping

junparser added a reviewer: lkail.Oct 28 2021, 8:37 PM

Hi @lkail, since the value 128 can fix the ut, can we land this first, and then change this value if we have to. is it ok?

LGTM, let's land it first to unblock https://reviews.llvm.org/D106056. But one thing to notice, number of python's opcode is larger than 128. If this patch affects python's perf, I think we should seek another adequate default value.

This revision is now accepted and ready to land.Oct 28 2021, 9:15 PM

In D110613#3095582, @lkail wrote:

LGTM, let's land it first to unblock https://reviews.llvm.org/D106056. But one thing to notice, number of python's opcode is larger than 128. If this patch affects python's perf, I think we should seek another adequate default value.

Do we have any benchmark on python, I think I can have a perf test with them before land this.

Do we have any benchmark on python, I think I can have a perf test with them before land this.

We don't have it in test-suite yet. In test-suite, we have SingleSource/Benchmarks/Misc/evalloop.c and MultiSource/Benchmarks/TSVC for computed-goto.

In D110613#3099467, @lkail wrote:

Do we have any benchmark on python, I think I can have a perf test with them before land this.

We don't have it in test-suite yet. In test-suite, we have SingleSource/Benchmarks/Misc/evalloop.c and MultiSource/Benchmarks/TSVC for computed-goto.

I've tested python under pyperformance with USE_COMPUTED_GOTOS disabled, It shows no performance difference. So, let's keep 128 as default value for now.

This revision was landed with ongoing or failed builds.Nov 1 2021, 12:33 AM

Closed by commit rG1f9fa549841a: [Taildup] Don't tail-duplicate loop header with multiple successors as its… (authored by junparser). · Explain Why

This revision was automatically updated to reflect the committed changes.

junparser added a commit: rG1f9fa549841a: [Taildup] Don't tail-duplicate loop header with multiple successors as its….

As reported in https://reviews.llvm.org/rGc93f93b2e3f28997f794265089fb8138dd5b5f13, we should implement a more general way to avoid adding quadratic edges in CFGs.

In D110613#3147744, @lkail wrote:

As reported in https://reviews.llvm.org/rGc93f93b2e3f28997f794265089fb8138dd5b5f13, we should implement a more general way to avoid adding quadratic edges in CFGs.

we may need loop form or dominate tree to check loop header and latch here. I'll revert this as well.

junparser added a reverting change: rG17eb6b61de4b: Revert "[Taildup] Don't tail-duplicate loop header with multiple successors as….Nov 23 2021, 6:27 PM

Diff 383724

llvm/lib/CodeGen/TailDuplicator.cpp

Show First 20 Lines • Show All 64 Lines • ▼ Show 20 Lines	static cl::opt<unsigned> TailDuplicateSize(
cl::Hidden);		cl::Hidden);

static cl::opt<unsigned> TailDupIndirectBranchSize(		static cl::opt<unsigned> TailDupIndirectBranchSize(
"tail-dup-indirect-size",		"tail-dup-indirect-size",
cl::desc("Maximum instructions to consider tail duplicating blocks that "		cl::desc("Maximum instructions to consider tail duplicating blocks that "
"end with indirect branches."), cl::init(20),		"end with indirect branches."), cl::init(20),
cl::Hidden);		cl::Hidden);

		static cl::opt<unsigned> TailDupJmpTableLoopSize(
		"tail-dup-jmptable-loop-size",
		cl::desc("Maximum loop latches to consider tail duplication that are "
		"successors of loop header."),
		cl::init(128), cl::Hidden);

static cl::opt<bool>		static cl::opt<bool>
TailDupVerify("tail-dup-verify",		TailDupVerify("tail-dup-verify",
cl::desc("Verify sanity of PHI instructions during taildup"),		cl::desc("Verify sanity of PHI instructions during taildup"),
cl::init(false), cl::Hidden);		cl::init(false), cl::Hidden);

static cl::opt<unsigned> TailDupLimit("tail-dup-limit", cl::init(~0U),		static cl::opt<unsigned> TailDupLimit("tail-dup-limit", cl::init(~0U),
cl::Hidden);		cl::Hidden);

▲ Show 20 Lines • Show All 465 Lines • ▼ Show 20 Lines	for (MachineInstr &MI : *SuccBB) {
MI.RemoveOperand(Idx + 1);		MI.RemoveOperand(Idx + 1);
MI.RemoveOperand(Idx);		MI.RemoveOperand(Idx);
}		}
}		}
}		}
}		}

/// Determine if it is profitable to duplicate this block.		/// Determine if it is profitable to duplicate this block.
bool TailDuplicator::shouldTailDuplicate(bool IsSimple,		bool TailDuplicator::shouldTailDuplicate(bool IsSimple,
		lkailUnsubmitted Not Done Reply Inline Actions nit: `A` renamed to `MBB` or `TailBB`. lkail: nit: `A` renamed to `MBB` or `TailBB`.
MachineBasicBlock &TailBB) {		MachineBasicBlock &TailBB) {
// When doing tail-duplication during layout, the block ordering is in flux,		// When doing tail-duplication during layout, the block ordering is in flux,
// so canFallThrough returns a result based on incorrect information and		// so canFallThrough returns a result based on incorrect information and
// should just be ignored.		// should just be ignored.
if (!LayoutMode && TailBB.canFallThrough())		if (!LayoutMode && TailBB.canFallThrough())
return false;		return false;

// Don't try to tail-duplicate single-block loops.		// Don't try to tail-duplicate single-block loops.
if (TailBB.isSuccessor(&TailBB))		if (TailBB.isSuccessor(&TailBB))
return false;		return false;

		// When doing tail-duplication with jumptable loops like:
		lkailUnsubmitted Not Done Reply Inline Actions Please add comment to explain why we don't taildup this pattern of CFG. lkail: Please add comment to explain why we don't taildup this pattern of CFG.
		// 1 -> 2 <-> 3 \|
		// \ <-> 4 \|
		// \ <-> 5 \|
		// \ <-> ... \|
		// \---> rest \|
		// quadratic number of edges and much more loops are added to CFG. This
		// may cause compile time regression when jumptable is quiet large.
		// So set the limit on jumptable cases.
		auto isLargeJumpTableLoop = [](const MachineBasicBlock &TailBB) {
		const SmallPtrSet<const MachineBasicBlock *, 8> Preds(TailBB.pred_begin(),
		TailBB.pred_end());
		// Check the basic block has large number of successors, all of them only
		// have one successor which is the basic block itself.
		return llvm::count_if(
		TailBB.successors(), [&](const MachineBasicBlock *SuccBB) {
		return Preds.count(SuccBB) && SuccBB->succ_size() == 1;
		}) > TailDupJmpTableLoopSize;
		};

		if (isLargeJumpTableLoop(TailBB))
		return false;

// Set the limit on the cost to duplicate. When optimizing for size,		// Set the limit on the cost to duplicate. When optimizing for size,
// duplicate only one, because one branch instruction can be eliminated to		// duplicate only one, because one branch instruction can be eliminated to
// compensate for the duplication.		// compensate for the duplication.
unsigned MaxDuplicateCount;		unsigned MaxDuplicateCount;
bool OptForSize = MF->getFunction().hasOptSize() \|\|		bool OptForSize = MF->getFunction().hasOptSize() \|\|
llvm::shouldOptimizeForSize(&TailBB, PSI, MBFI);		llvm::shouldOptimizeForSize(&TailBB, PSI, MBFI);
if (TailDupSize == 0)		if (TailDupSize == 0)
MaxDuplicateCount = TailDuplicateSize;		MaxDuplicateCount = TailDuplicateSize;
▲ Show 20 Lines • Show All 493 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/tail-dup-multiple-latch-loop.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	lkailUnsubmitted Not Done Reply Inline Actions This test should be pre-committed to show diff. lkail: This test should be pre-committed to show diff.
	junparserAuthorUnsubmitted Done Reply Inline Actions This has been pre-committed. junparser: This has been pre-committed.
	; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu \| FileCheck %s			; RUN: llc < %s -tail-dup-jmptable-loop-size=5 -mtriple=x86_64-unknown-linux-gnu \| FileCheck %s
	define i8* @large_loop_switch(i8* %p) {			define i8* @large_loop_switch(i8* %p) {
	; CHECK-LABEL: large_loop_switch:			; CHECK-LABEL: large_loop_switch:
	; CHECK: # %bb.0: # %entry			; CHECK: # %bb.0: # %entry
	; CHECK-NEXT: pushq %rbx			; CHECK-NEXT: pushq %rbx
	; CHECK-NEXT: .cfi_def_cfa_offset 16			; CHECK-NEXT: .cfi_def_cfa_offset 16
	; CHECK-NEXT: .cfi_offset %rbx, -16			; CHECK-NEXT: .cfi_offset %rbx, -16
	; CHECK-NEXT: movq %rdi, %rax			; CHECK-NEXT: movq %rdi, %rsi
	; CHECK-NEXT: movl $6, %ebx			; CHECK-NEXT: movl $6, %ebx
	; CHECK-NEXT: movl %ebx, %ecx			; CHECK-NEXT: movl %ebx, %eax
	; CHECK-NEXT: jmpq *.LJTI0_0(,%rcx,8)			; CHECK-NEXT: jmpq *.LJTI0_0(,%rax,8)
	; CHECK-NEXT: .LBB0_1: # %for.cond.cleanup
	; CHECK-NEXT: movl $530, %edi # imm = 0x212
	; CHECK-NEXT: movq %rax, %rsi
	; CHECK-NEXT: popq %rbx
	; CHECK-NEXT: .cfi_def_cfa_offset 8
	; CHECK-NEXT: jmp ccc@PLT # TAILCALL
	; CHECK-NEXT: .p2align 4, 0x90
	; CHECK-NEXT: .LBB0_2: # %sw.bb1			; CHECK-NEXT: .LBB0_2: # %sw.bb1
	; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: .cfi_def_cfa_offset 16
	; CHECK-NEXT: movl $531, %edi # imm = 0x213			; CHECK-NEXT: movl $531, %edi # imm = 0x213
	; CHECK-NEXT: movq %rax, %rsi			; CHECK-NEXT: .LBB0_3: # %for.body
	; CHECK-NEXT: callq ccc@PLT			; CHECK-NEXT: callq ccc@PLT
				; CHECK-NEXT: .LBB0_4: # %for.body
				; CHECK-NEXT: movq %rax, %rsi
	; CHECK-NEXT: decl %ebx			; CHECK-NEXT: decl %ebx
	; CHECK-NEXT: movl %ebx, %ecx			; CHECK-NEXT: movl %ebx, %eax
	; CHECK-NEXT: jmpq *.LJTI0_0(,%rcx,8)			; CHECK-NEXT: jmpq *.LJTI0_0(,%rax,8)
	; CHECK-NEXT: .p2align 4, 0x90			; CHECK-NEXT: .LBB0_5: # %sw.bb3
	; CHECK-NEXT: .LBB0_3: # %sw.bb3
	; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: movl $532, %edi # imm = 0x214			; CHECK-NEXT: movl $532, %edi # imm = 0x214
	; CHECK-NEXT: movq %rax, %rsi
	; CHECK-NEXT: callq bbb@PLT			; CHECK-NEXT: callq bbb@PLT
	; CHECK-NEXT: decl %ebx			; CHECK-NEXT: jmp .LBB0_4
	; CHECK-NEXT: movl %ebx, %ecx			; CHECK-NEXT: .LBB0_7: # %sw.bb5
	; CHECK-NEXT: jmpq *.LJTI0_0(,%rcx,8)
	; CHECK-NEXT: .p2align 4, 0x90
	; CHECK-NEXT: .LBB0_4: # %sw.bb5
	; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: movl $533, %edi # imm = 0x215			; CHECK-NEXT: movl $533, %edi # imm = 0x215
	; CHECK-NEXT: movq %rax, %rsi
	; CHECK-NEXT: callq bbb@PLT			; CHECK-NEXT: callq bbb@PLT
	; CHECK-NEXT: decl %ebx			; CHECK-NEXT: jmp .LBB0_4
	; CHECK-NEXT: movl %ebx, %ecx			; CHECK-NEXT: .LBB0_8: # %sw.bb7
	; CHECK-NEXT: jmpq *.LJTI0_0(,%rcx,8)
	; CHECK-NEXT: .p2align 4, 0x90
	; CHECK-NEXT: .LBB0_5: # %sw.bb7
	; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: movl $535, %edi # imm = 0x217			; CHECK-NEXT: movl $535, %edi # imm = 0x217
	; CHECK-NEXT: movq %rax, %rsi
	; CHECK-NEXT: callq bbb@PLT			; CHECK-NEXT: callq bbb@PLT
	; CHECK-NEXT: decl %ebx			; CHECK-NEXT: jmp .LBB0_4
	; CHECK-NEXT: movl %ebx, %ecx			; CHECK-NEXT: .LBB0_9: # %sw.bb9
	; CHECK-NEXT: jmpq *.LJTI0_0(,%rcx,8)
	; CHECK-NEXT: .p2align 4, 0x90
	; CHECK-NEXT: .LBB0_6: # %sw.bb9
	; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: movl $536, %edi # imm = 0x218			; CHECK-NEXT: movl $536, %edi # imm = 0x218
	; CHECK-NEXT: movq %rax, %rsi			; CHECK-NEXT: jmp .LBB0_3
	; CHECK-NEXT: callq ccc@PLT			; CHECK-NEXT: .LBB0_10: # %sw.bb11
	; CHECK-NEXT: decl %ebx
	; CHECK-NEXT: movl %ebx, %ecx
	; CHECK-NEXT: jmpq *.LJTI0_0(,%rcx,8)
	; CHECK-NEXT: .p2align 4, 0x90
	; CHECK-NEXT: .LBB0_7: # %sw.bb11
	; CHECK-NEXT: # =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: movl $658, %edi # imm = 0x292			; CHECK-NEXT: movl $658, %edi # imm = 0x292
	; CHECK-NEXT: movq %rax, %rsi
	; CHECK-NEXT: callq bbb@PLT			; CHECK-NEXT: callq bbb@PLT
	; CHECK-NEXT: decl %ebx			; CHECK-NEXT: jmp .LBB0_4
	; CHECK-NEXT: movl %ebx, %ecx			; CHECK-NEXT: .LBB0_11: # %for.cond.cleanup
	; CHECK-NEXT: jmpq *.LJTI0_0(,%rcx,8)			; CHECK-NEXT: movl $530, %edi # imm = 0x212
				; CHECK-NEXT: popq %rbx
				; CHECK-NEXT: .cfi_def_cfa_offset 8
				; CHECK-NEXT: jmp ccc@PLT # TAILCALL
	entry:			entry:
	br label %for.body			br label %for.body

	for.cond.cleanup: ; preds = %for.body			for.cond.cleanup: ; preds = %for.body
	%call = tail call i8* @ccc(i32 signext 530, i8* %p.addr.03006)			%call = tail call i8* @ccc(i32 signext 530, i8* %p.addr.03006)
	ret i8* %call			ret i8* %call

	for.body: ; preds = %for.inc, %entry			for.body: ; preds = %for.inc, %entry
	▲ Show 20 Lines • Show All 140 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[Taildup] Don't tail-duplicate loop header with multiple successors as its latches
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 383724

llvm/lib/CodeGen/TailDuplicator.cpp

llvm/test/CodeGen/X86/tail-dup-multiple-latch-loop.ll

This is an archive of the discontinued LLVM Phabricator instance.

[Taildup] Don't tail-duplicate loop header with multiple successors as its latchesClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 383724

llvm/lib/CodeGen/TailDuplicator.cpp

llvm/test/CodeGen/X86/tail-dup-multiple-latch-loop.ll

[Taildup] Don't tail-duplicate loop header with multiple successors as its latches
ClosedPublic