This is an archive of the discontinued LLVM Phabricator instance.

CodeGen: BlockPlacement: Increase tail duplication size for O3.
ClosedPublic

Authored by iteratee on Apr 20 2017, 4:28 PM.

Download Raw Diff

Details

Reviewers

Summary

At O3 we are more willing to increase size if we believe it will improve
performance. The current threshold for tail-duplication of 2 instructions is
conservative, and can be relaxed at O3.

Benchmark results:
llvm test-suite:
6% improvement in aha, due to duplication of loop latch
3% improvement in hexxagon for similar reasons.

2% slowdown in lpbench. Seems related, but couldn't completely diagnose.

Internal google benchmark:
Produces 4% improvement on internal google protocol buffer serialization
benchmarks.

Diff Detail

Event Timeline

iteratee created this revision.Apr 20 2017, 4:28 PM

Herald added a subscriber: nemanjai. · View Herald TranscriptApr 20 2017, 4:28 PM

davidxl added inline comments.Apr 20 2017, 4:35 PM

lib/CodeGen/MachineBlockPlacement.cpp
2657	Is it better to have two parameters: TailDupThreshold and TailDupAggressiveThreshold? The later can be used for O3.

Made the aggressive threshold an option.

iteratee marked an inline comment as done.May 2 2017, 6:01 PM

davidxl added inline comments.May 3 2017, 9:12 AM

lib/CodeGen/MachineBlockPlacement.cpp
2662	I think when the aggressive threshold is also explicitly specified, then it should take precedence even at O2. Basically this is the order: Explicit Aggressive Threshold Explicit regular Threshold Implicit Aggressive at O3 and implicit regular at O2.

iteratee added inline comments.May 4 2017, 3:44 PM

lib/CodeGen/MachineBlockPlacement.cpp
2662	At O3 I think it should be: Explicit Aggressive Threshold Explicit Regular Threshold Implicit Aggressive Threshold At O2 I think it should be: Explicit Regular Threshold Explicit Aggressive Threshold Implicit Regular Threshold For instance someone may want to adjust both flags globally and compile individual modules at O2 or O3.

davidxl added inline comments.May 12 2017, 9:32 AM

lib/CodeGen/MachineBlockPlacement.cpp
2662	Do you have an updated patch with the proposed logic?

No, I wanted to get agreement before I re-wrote it. I can do it if you'd like to see it before deciding.

If either threshold is the only one explicitly set, use that threshold.
Otherwise, if both, or neither are set, use the aggressive threshold at O3

lgtm

This revision is now accepted and ready to land.May 12 2017, 4:17 PM

Committed in rL303084

Revision Contents

Path

Size

lib/

CodeGen/

MachineBlockPlacement.cpp

30 lines

test/

CodeGen/

PowerPC/

tail-dup-layout.ll

97 lines

X86/

sse1.ll

16 lines

Diff 98850

lib/CodeGen/MachineBlockPlacement.cpp

Show First 20 Lines • Show All 127 Lines • ▼ Show 20 Lines
// Heuristic for tail duplication.		// Heuristic for tail duplication.
static cl::opt<unsigned> TailDupPlacementThreshold(		static cl::opt<unsigned> TailDupPlacementThreshold(
"tail-dup-placement-threshold",		"tail-dup-placement-threshold",
cl::desc("Instruction cutoff for tail duplication during layout. "		cl::desc("Instruction cutoff for tail duplication during layout. "
"Tail merging during layout is forced to have a threshold "		"Tail merging during layout is forced to have a threshold "
"that won't conflict."), cl::init(2),		"that won't conflict."), cl::init(2),
cl::Hidden);		cl::Hidden);

		// Heuristic for aggressive tail duplication.
		static cl::opt<unsigned> TailDupPlacementAggressiveThreshold(
		"tail-dup-placement-aggressive-threshold",
		cl::desc("Instruction cutoff for aggressive tail duplication during "
		"layout. Used at -O3. Tail merging during layout is forced to "
		"have a threshold that won't conflict."), cl::init(3),
		cl::Hidden);

// Heuristic for tail duplication.		// Heuristic for tail duplication.
static cl::opt<unsigned> TailDupPlacementPenalty(		static cl::opt<unsigned> TailDupPlacementPenalty(
"tail-dup-placement-penalty",		"tail-dup-placement-penalty",
cl::desc("Cost penalty for blocks that can avoid breaking CFG by copying. "		cl::desc("Cost penalty for blocks that can avoid breaking CFG by copying. "
"Copying can increase fallthrough, but it also increases icache "		"Copying can increase fallthrough, but it also increases icache "
"pressure. This parameter controls the penalty to account for that. "		"pressure. This parameter controls the penalty to account for that. "
"Percent as integer."),		"Percent as integer."),
cl::init(2),		cl::init(2),
▲ Show 20 Lines • Show All 2,497 Lines • ▼ Show 20 Lines	bool MachineBlockPlacement::runOnMachineFunction(MachineFunction &MF) {

// Initialize PreferredLoopExit to nullptr here since it may never be set if		// Initialize PreferredLoopExit to nullptr here since it may never be set if
// there are no MachineLoops.		// there are no MachineLoops.
PreferredLoopExit = nullptr;		PreferredLoopExit = nullptr;

assert(BlockToChain.empty());		assert(BlockToChain.empty());
assert(ComputedEdges.empty());		assert(ComputedEdges.empty());

		unsigned TailDupSize = TailDupPlacementThreshold;
		davidxlUnsubmitted Done Reply Inline Actions Is it better to have two parameters: TailDupThreshold and TailDupAggressiveThreshold? The later can be used for O3. davidxl: Is it better to have two parameters: TailDupThreshold and TailDupAggressiveThreshold? The later…
		// If only the aggressive threshold is explicitly set, use it.
		if (TailDupPlacementAggressiveThreshold.getNumOccurrences() != 0 &&
		TailDupPlacementThreshold.getNumOccurrences() == 0)
		TailDupSize = TailDupPlacementAggressiveThreshold;

		davidxlUnsubmitted Not Done Reply Inline Actions I think when the aggressive threshold is also explicitly specified, then it should take precedence even at O2. Basically this is the order: Explicit Aggressive Threshold Explicit regular Threshold Implicit Aggressive at O3 and implicit regular at O2. davidxl: I think when the aggressive threshold is also explicitly specified, then it should take…
		iterateeAuthorUnsubmitted Not Done Reply Inline Actions At O3 I think it should be: Explicit Aggressive Threshold Explicit Regular Threshold Implicit Aggressive Threshold At O2 I think it should be: Explicit Regular Threshold Explicit Aggressive Threshold Implicit Regular Threshold For instance someone may want to adjust both flags globally and compile individual modules at O2 or O3. iteratee: At O3 I think it should be: Explicit Aggressive Threshold Explicit Regular Threshold Implicit…
		davidxlUnsubmitted Not Done Reply Inline Actions Do you have an updated patch with the proposed logic? davidxl: Do you have an updated patch with the proposed logic?
		TargetPassConfig *PassConfig = &getAnalysis<TargetPassConfig>();
		// For agressive optimization, we can adjust some thresholds to be less
		// conservative.
		if (PassConfig->getOptLevel() >= CodeGenOpt::Aggressive) {
		// At O3 we should be more willing to copy blocks for tail duplication. This
		// increases size pressure, so we only do it at O3
		// Do this unless only the regular threshold is explicitly set.
		if (TailDupPlacementThreshold.getNumOccurrences() == 0 \|\|
		TailDupPlacementAggressiveThreshold.getNumOccurrences() != 0)
		TailDupSize = TailDupPlacementAggressiveThreshold;
		}

if (TailDupPlacement) {		if (TailDupPlacement) {
MPDT = &getAnalysis<MachinePostDominatorTree>();		MPDT = &getAnalysis<MachinePostDominatorTree>();
unsigned TailDupSize = TailDupPlacementThreshold;
if (MF.getFunction()->optForSize())		if (MF.getFunction()->optForSize())
TailDupSize = 1;		TailDupSize = 1;
TailDup.initMF(MF, MBPI, /* LayoutMode */ true, TailDupSize);		TailDup.initMF(MF, MBPI, /* LayoutMode */ true, TailDupSize);
precomputeTriangleChains();		precomputeTriangleChains();
}		}

buildCFGChains();		buildCFGChains();

// Changing the layout can create new tail merging opportunities.		// Changing the layout can create new tail merging opportunities.
TargetPassConfig *PassConfig = &getAnalysis<TargetPassConfig>();
// TailMerge can create jump into if branches that make CFG irreducible for		// TailMerge can create jump into if branches that make CFG irreducible for
// HW that requires structured CFG.		// HW that requires structured CFG.
bool EnableTailMerge = !MF.getTarget().requiresStructuredCFG() &&		bool EnableTailMerge = !MF.getTarget().requiresStructuredCFG() &&
PassConfig->getEnableTailMerge() &&		PassConfig->getEnableTailMerge() &&
BranchFoldPlacement;		BranchFoldPlacement;
// No tail merging opportunities if the block number is less than four.		// No tail merging opportunities if the block number is less than four.
if (MF.size() > 3 && EnableTailMerge) {		if (MF.size() > 3 && EnableTailMerge) {
unsigned TailMergeSize = TailDupPlacementThreshold + 1;		unsigned TailMergeSize = TailDupSize + 1;
BranchFolder BF(/EnableTailMerge=/true, /CommonHoist=/false, *MBFI,		BranchFolder BF(/EnableTailMerge=/true, /CommonHoist=/false, *MBFI,
*MBPI, TailMergeSize);		*MBPI, TailMergeSize);

if (BF.OptimizeFunction(MF, TII, MF.getSubtarget().getRegisterInfo(),		if (BF.OptimizeFunction(MF, TII, MF.getSubtarget().getRegisterInfo(),
getAnalysisIfAvailable<MachineModuleInfo>(), MLI,		getAnalysisIfAvailable<MachineModuleInfo>(), MLI,
/AfterBlockPlacement=/true)) {		/AfterBlockPlacement=/true)) {
// Redo the layout if tail merging creates/removes/moves blocks.		// Redo the layout if tail merging creates/removes/moves blocks.
BlockToChain.clear();		BlockToChain.clear();
▲ Show 20 Lines • Show All 109 Lines • Show Last 20 Lines

test/CodeGen/PowerPC/tail-dup-layout.ll

; RUN: llc -O2 < %s \| FileCheck %s		; RUN: llc -O2 -o - %s \| FileCheck --check-prefix=CHECK --check-prefix=CHECK-O2 %s
		; RUN: llc -O3 -o - %s \| FileCheck --check-prefix=CHECK --check-prefix=CHECK-O3 %s
target datalayout = "e-m:e-i64:64-n32:64"		target datalayout = "e-m:e-i64:64-n32:64"
target triple = "powerpc64le-grtev4-linux-gnu"		target triple = "powerpc64le-grtev4-linux-gnu"

; Intended layout:		; Intended layout:
; The chain-based outlining produces the layout		; The chain-based outlining produces the layout
; test1		; test1
; test2		; test2
; test3		; test3
▲ Show 20 Lines • Show All 84 Lines • ▼ Show 20 Lines	exit:
ret void		ret void
}		}

; Intended layout:		; Intended layout:
; The chain-of-triangles based duplicating produces the layout		; The chain-of-triangles based duplicating produces the layout
; test1		; test1
; test2		; test2
; test3		; test3
; test4
; optional1		; optional1
; optional2		; optional2
; optional3		; optional3
; optional4
; exit		; exit
; even for 50/50 branches.		; even for 50/50 branches.
; Tail duplication puts test n+1 at the end of optional n		; Tail duplication puts test n+1 at the end of optional n
; so optional1 includes a copy of test2 at the end, and branches		; so optional1 includes a copy of test2 at the end, and branches
; to test3 (at the top) or falls through to optional 2.		; to test3 (at the top) or falls through to optional 2.
; The CHECK statements check for the whole string of tests		; The CHECK statements check for the whole string of tests
; and then check that the correct test has been duplicated into the end of		; and then check that the correct test has been duplicated into the end of
; the optional blocks and that the optional blocks are in the correct order.		; the optional blocks and that the optional blocks are in the correct order.
▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines
optional3:		optional3:
call void @c()		call void @c()
br label %exit		br label %exit
exit:		exit:
ret void		ret void
}		}

; Intended layout:		; Intended layout:
		; The chain-of-triangles based duplicating produces the layout when 3
		; instructions are allowed for tail-duplication.
		; test1
		; test2
		; test3
		; optional1
		; optional2
		; optional3
		; exit
		;
		; Otherwise it produces the layout:
		; test1
		; optional1
		; test2
		; optional2
		; test3
		; optional3
		; exit

		;CHECK-LABEL: straight_test_3_instr_test:
		; test1 may have been merged with entry
		;CHECK: mr [[TAGREG:[0-9]+]], 3
		;CHECK: clrlwi {{[0-9]+}}, [[TAGREG]], 30
		;CHECK-NEXT: cmplwi {{[0-9]+}}, 2

		;CHECK-O3-NEXT: bne 0, .[[OPT1LABEL:[_0-9A-Za-z]+]]
		;CHECK-O3-NEXT: # %test2
		;CHECK-O3-NEXT: rlwinm {{[0-9]+}}, [[TAGREG]], 0, 28, 29
		;CHECK-O3-NEXT: cmplwi {{[0-9]+}}, 8
		;CHECK-O3-NEXT: bne 0, .[[OPT2LABEL:[_0-9A-Za-z]+]]
		;CHECK-O3-NEXT: .[[TEST3LABEL:[_0-9A-Za-z]+]]: # %test3
		;CHECK-O3-NEXT: rlwinm {{[0-9]+}}, [[TAGREG]], 0, 26, 27
		;CHECK-O3-NEXT: cmplwi {{[0-9]+}}, 32
		;CHECK-O3-NEXT: bne 0, .[[OPT3LABEL:[_0-9A-Za-z]+]]
		;CHECK-O3-NEXT: .[[EXITLABEL:[_0-9A-Za-z]+]]: # %exit
		;CHECK-O3: blr
		;CHECK-O3-NEXT: .[[OPT1LABEL]]:
		;CHECK-O3: rlwinm {{[0-9]+}}, [[TAGREG]], 0, 28, 29
		;CHECK-O3-NEXT: cmplwi {{[0-9]+}}, 8
		;CHECK-O3-NEXT: beq 0, .[[TEST3LABEL]]
		;CHECK-O3-NEXT: .[[OPT2LABEL]]:
		;CHECK-O3: rlwinm {{[0-9]+}}, [[TAGREG]], 0, 26, 27
		;CHECK-O3-NEXT: cmplwi {{[0-9]+}}, 32
		;CHECK-O3-NEXT: beq 0, .[[EXITLABEL]]
		;CHECK-O3-NEXT: .[[OPT3LABEL]]:
		;CHECK-O3: b .[[EXITLABEL]]

		;CHECK-O2-NEXT: beq 0, .[[TEST2LABEL:[_0-9A-Za-z]+]]
		;CHECK-O2-NEXT: # %optional1
		;CHECK-O2: .[[TEST2LABEL]]: # %test2
		;CHECK-O2-NEXT: rlwinm {{[0-9]+}}, [[TAGREG]], 0, 28, 29
		;CHECK-O2-NEXT: cmplwi {{[0-9]+}}, 8
		;CHECK-O2-NEXT: beq 0, .[[TEST3LABEL:[_0-9A-Za-z]+]]
		;CHECK-O2-NEXT: # %optional2
		;CHECK-O2: .[[TEST3LABEL]]: # %test3
		;CHECK-O2-NEXT: rlwinm {{[0-9]+}}, [[TAGREG]], 0, 26, 27
		;CHECK-O2-NEXT: cmplwi {{[0-9]+}}, 32
		;CHECK-O2-NEXT: beq 0, .[[EXITLABEL:[_0-9A-Za-z]+]]
		;CHECK-O2-NEXT: # %optional3
		;CHECK-O2: .[[EXITLABEL:[_0-9A-Za-z]+]]: # %exit
		;CHECK-O2: blr


		define void @straight_test_3_instr_test(i32 %tag) {
		entry:
		br label %test1
		test1:
		%tagbit1 = and i32 %tag, 3
		%tagbit1eq0 = icmp eq i32 %tagbit1, 2
		br i1 %tagbit1eq0, label %test2, label %optional1, !prof !2
		optional1:
		call void @a()
		br label %test2
		test2:
		%tagbit2 = and i32 %tag, 12
		%tagbit2eq0 = icmp eq i32 %tagbit2, 8
		br i1 %tagbit2eq0, label %test3, label %optional2, !prof !2
		optional2:
		call void @b()
		br label %test3
		test3:
		%tagbit3 = and i32 %tag, 48
		%tagbit3eq0 = icmp eq i32 %tagbit3, 32
		br i1 %tagbit3eq0, label %exit, label %optional3, !prof !1
		optional3:
		call void @c()
		br label %exit
		exit:
		ret void
		}

		; Intended layout:
; The chain-based outlining produces the layout		; The chain-based outlining produces the layout
; entry		; entry
; --- Begin loop ---		; --- Begin loop ---
; for.latch		; for.latch
; for.check		; for.check
; test1		; test1
; test2		; test2
; test3		; test3
▲ Show 20 Lines • Show All 369 Lines • Show Last 20 Lines

test/CodeGen/X86/sse1.ll

	Show First 20 Lines • Show All 60 Lines • ▼ Show 20 Lines
	; X32-NEXT: cmpl $0, {{[0-9]+}}(%esp)			; X32-NEXT: cmpl $0, {{[0-9]+}}(%esp)
	; X32-NEXT: jne .LBB1_5			; X32-NEXT: jne .LBB1_5
	; X32-NEXT: .LBB1_4:			; X32-NEXT: .LBB1_4:
	; X32-NEXT: movss {{.*#+}} xmm2 = mem[0],zero,zero,zero			; X32-NEXT: movss {{.*#+}} xmm2 = mem[0],zero,zero,zero
	; X32-NEXT: cmpl $0, {{[0-9]+}}(%esp)			; X32-NEXT: cmpl $0, {{[0-9]+}}(%esp)
	; X32-NEXT: jne .LBB1_8			; X32-NEXT: jne .LBB1_8
	; X32-NEXT: .LBB1_7:			; X32-NEXT: .LBB1_7:
	; X32-NEXT: movss {{.*#+}} xmm3 = mem[0],zero,zero,zero			; X32-NEXT: movss {{.*#+}} xmm3 = mem[0],zero,zero,zero
	; X32-NEXT: jmp .LBB1_9			; X32-NEXT: cmpl $0, {{[0-9]+}}(%esp)
				; X32-NEXT: unpcklps {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]
				; X32-NEXT: je .LBB1_10
				; X32-NEXT: jmp .LBB1_11
	; X32-NEXT: .LBB1_1:			; X32-NEXT: .LBB1_1:
	; X32-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero			; X32-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; X32-NEXT: cmpl $0, {{[0-9]+}}(%esp)			; X32-NEXT: cmpl $0, {{[0-9]+}}(%esp)
	; X32-NEXT: je .LBB1_4			; X32-NEXT: je .LBB1_4
	; X32-NEXT: .LBB1_5: # %entry			; X32-NEXT: .LBB1_5: # %entry
	; X32-NEXT: xorps %xmm2, %xmm2			; X32-NEXT: xorps %xmm2, %xmm2
	; X32-NEXT: cmpl $0, {{[0-9]+}}(%esp)			; X32-NEXT: cmpl $0, {{[0-9]+}}(%esp)
	; X32-NEXT: je .LBB1_7			; X32-NEXT: je .LBB1_7
	; X32-NEXT: .LBB1_8: # %entry			; X32-NEXT: .LBB1_8: # %entry
	; X32-NEXT: xorps %xmm3, %xmm3			; X32-NEXT: xorps %xmm3, %xmm3
	; X32-NEXT: .LBB1_9: # %entry
	; X32-NEXT: cmpl $0, {{[0-9]+}}(%esp)			; X32-NEXT: cmpl $0, {{[0-9]+}}(%esp)
	; X32-NEXT: unpcklps {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]			; X32-NEXT: unpcklps {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]
	; X32-NEXT: jne .LBB1_11			; X32-NEXT: jne .LBB1_11
	; X32-NEXT: # BB#10:			; X32-NEXT: .LBB1_10:
	; X32-NEXT: movss {{.*#+}} xmm0 = mem[0],zero,zero,zero			; X32-NEXT: movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
	; X32-NEXT: .LBB1_11: # %entry			; X32-NEXT: .LBB1_11: # %entry
	; X32-NEXT: unpcklps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]			; X32-NEXT: unpcklps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
	; X32-NEXT: unpcklps {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1]			; X32-NEXT: unpcklps {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1]
	; X32-NEXT: retl			; X32-NEXT: retl
	;			;
	; X64-LABEL: vselect:			; X64-LABEL: vselect:
	; X64: # BB#0: # %entry			; X64: # BB#0: # %entry
	; X64-NEXT: testl %ecx, %ecx			; X64-NEXT: testl %ecx, %ecx
	; X64-NEXT: xorps %xmm0, %xmm0			; X64-NEXT: xorps %xmm0, %xmm0
	; X64-NEXT: je .LBB1_1			; X64-NEXT: je .LBB1_1
	; X64-NEXT: # BB#2: # %entry			; X64-NEXT: # BB#2: # %entry
	; X64-NEXT: xorps %xmm1, %xmm1			; X64-NEXT: xorps %xmm1, %xmm1
	; X64-NEXT: testl %edx, %edx			; X64-NEXT: testl %edx, %edx
	; X64-NEXT: jne .LBB1_5			; X64-NEXT: jne .LBB1_5
	; X64-NEXT: .LBB1_4:			; X64-NEXT: .LBB1_4:
	; X64-NEXT: movss {{.*#+}} xmm2 = mem[0],zero,zero,zero			; X64-NEXT: movss {{.*#+}} xmm2 = mem[0],zero,zero,zero
	; X64-NEXT: testl %r8d, %r8d			; X64-NEXT: testl %r8d, %r8d
	; X64-NEXT: jne .LBB1_8			; X64-NEXT: jne .LBB1_8
	; X64-NEXT: .LBB1_7:			; X64-NEXT: .LBB1_7:
	; X64-NEXT: movss {{.*#+}} xmm3 = mem[0],zero,zero,zero			; X64-NEXT: movss {{.*#+}} xmm3 = mem[0],zero,zero,zero
	; X64-NEXT: jmp .LBB1_9			; X64-NEXT: testl %esi, %esi
				; X64-NEXT: unpcklps {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]
				; X64-NEXT: je .LBB1_10
				; X64-NEXT: jmp .LBB1_11
	; X64-NEXT: .LBB1_1:			; X64-NEXT: .LBB1_1:
	; X64-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero			; X64-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; X64-NEXT: testl %edx, %edx			; X64-NEXT: testl %edx, %edx
	; X64-NEXT: je .LBB1_4			; X64-NEXT: je .LBB1_4
	; X64-NEXT: .LBB1_5: # %entry			; X64-NEXT: .LBB1_5: # %entry
	; X64-NEXT: xorps %xmm2, %xmm2			; X64-NEXT: xorps %xmm2, %xmm2
	; X64-NEXT: testl %r8d, %r8d			; X64-NEXT: testl %r8d, %r8d
	; X64-NEXT: je .LBB1_7			; X64-NEXT: je .LBB1_7
	; X64-NEXT: .LBB1_8: # %entry			; X64-NEXT: .LBB1_8: # %entry
	; X64-NEXT: xorps %xmm3, %xmm3			; X64-NEXT: xorps %xmm3, %xmm3
	; X64-NEXT: .LBB1_9: # %entry
	; X64-NEXT: testl %esi, %esi			; X64-NEXT: testl %esi, %esi
	; X64-NEXT: unpcklps {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]			; X64-NEXT: unpcklps {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]
	; X64-NEXT: jne .LBB1_11			; X64-NEXT: jne .LBB1_11
	; X64-NEXT: # BB#10:			; X64-NEXT: .LBB1_10:
	; X64-NEXT: movss {{.*#+}} xmm0 = mem[0],zero,zero,zero			; X64-NEXT: movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
	; X64-NEXT: .LBB1_11: # %entry			; X64-NEXT: .LBB1_11: # %entry
	; X64-NEXT: unpcklps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]			; X64-NEXT: unpcklps {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
	; X64-NEXT: unpcklps {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1]			; X64-NEXT: unpcklps {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1]
	; X64-NEXT: retq			; X64-NEXT: retq
	entry:			entry:
	%a1 = icmp eq <4 x i32> %q, zeroinitializer			%a1 = icmp eq <4 x i32> %q, zeroinitializer
	%a14 = select <4 x i1> %a1, <4 x float> <float 1.000000e+00, float 2.000000e+00, float 3.000000e+00, float 4.000000e+0> , <4 x float> zeroinitializer			%a14 = select <4 x i1> %a1, <4 x float> <float 1.000000e+00, float 2.000000e+00, float 3.000000e+00, float 4.000000e+0> , <4 x float> zeroinitializer
	▲ Show 20 Lines • Show All 217 Lines • Show Last 20 Lines