This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
VectorCombine.cpp
-
test/Transforms/VectorCombine/X86/
-
Transforms/
-
VectorCombine/
-
X86/
-
extract-binop.ll
-
extract-cmp.ll

Differential D76623

[VectorCombine] try to form a better extractelement
ClosedPublic

Authored by spatel on Mar 23 2020, 9:29 AM.

Download Raw Diff

Details

Reviewers

lebedev.ri
RKSimon
jgorbe

Commits

rGce97ce3a5d72: [VectorCombine] try to form a better extractelement

Summary

Extracting to the same index that we are going to insert back into allows forming select ("blend") shuffles and enables further transforms.

Admittedly, this is a quick-fix for a more general problem that I'm hoping to solve by adding transforms for patterns that start with an insertelement.
But this might resolve some regressions known to be caused by the extract-extract transform (although I have not gotten more details on those yet).

In the motivating case from PR34724:
https://bugs.llvm.org/show_bug.cgi?id=34724

The combination of subsequent instcombine and codegen transforms gets us this improvement:

vmovshdup	%xmm0, %xmm2    ## xmm2 = xmm0[1,1,3,3]
vhaddps	%xmm1, %xmm1, %xmm4
vmovshdup	%xmm1, %xmm3    ## xmm3 = xmm1[1,1,3,3]
vaddps	%xmm0, %xmm2, %xmm0
vaddps	%xmm1, %xmm3, %xmm1
vshufps	$200, %xmm4, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm4[0,3]
vinsertps	$177, %xmm1, %xmm0, %xmm0 ## xmm0 = zero,xmm0[1,2],xmm1[2]

-->

vmovshdup	%xmm0, %xmm2    ## xmm2 = xmm0[1,1,3,3]
vhaddps	%xmm1, %xmm1, %xmm1
vaddps	%xmm0, %xmm2, %xmm0
vshufps	$200, %xmm1, %xmm0, %xmm0 ## xmm0 = xmm0[0,2],xmm1[0,3]

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spatel created this revision.Mar 23 2020, 9:29 AM

Herald added subscribers: hiraditya, mcrosier. · View Herald TranscriptMar 23 2020, 9:29 AM

While this looks good in principle, i'm starting to have second thoughts about the general design here..

Since this doesn't affect the calculated cost, i wonder if it would be better to go more generic,
and instead implement some kind of lane reordering?

In D76623#1944608, @lebedev.ri wrote:

While this looks good in principle, i'm starting to have second thoughts about the general design here..

Since this doesn't affect the calculated cost, i wonder if it would be better to go more generic,
and instead implement some kind of lane reordering?

Hmm...I'm not seeing what a more generic solution would look like. Do you have an example/suggestion?
If we implement the larger match that starts from an insertelement, I'm imagining this becomes something like InstCombine's canEvaluateZExtd() or AggressiveInstCombine's TruncInstCombine. So we have:
insertelement (some string of scalar ops that are seeded by extractelement and/or constants) --> series of vector ops with extract/insert removed
We could build that up in steps, so start with the simplest case of: insert undef, (binop (extract X), C) --> vector binop X, vecC

Ping.

LGTM

This revision is now accepted and ready to land.Apr 2 2020, 2:15 PM

LG for now.

In D76623#1944788, @spatel wrote:

In D76623#1944608, @lebedev.ri wrote:

While this looks good in principle, i'm starting to have second thoughts about the general design here..

Since this doesn't affect the calculated cost, i wonder if it would be better to go more generic,
and instead implement some kind of lane reordering?

Hmm...I'm not seeing what a more generic solution would look like. Do you have an example/suggestion?
If we implement the larger match that starts from an insertelement, I'm imagining this becomes something like InstCombine's canEvaluateZExtd() or AggressiveInstCombine's TruncInstCombine. So we have:
insertelement (some string of scalar ops that are seeded by extractelement and/or constants) --> series of vector ops with extract/insert removed
We could build that up in steps, so start with the simplest case of: insert undef, (binop (extract X), C) --> vector binop X, vecC

Yes, that is the general idea, if we have insert-of-one-use-extract
(or, all uses are inserts into same lane?), try to make extraction
to be from the same lane. Though there is a second potential fold here:
start from extractelement, and try to minimize it's cost by moving it into lower lanes.

It does sound involved, i'm not sure if it will be worth it in the end.

Closed by commit rGce97ce3a5d72: [VectorCombine] try to form a better extractelement (authored by spatel). · Explain WhyApr 3 2020, 11:20 AM

This revision was automatically updated to reflect the committed changes.

Herald added a project: Restricted Project. · View Herald TranscriptApr 3 2020, 11:20 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

VectorCombine.cpp

25 lines

test/

Transforms/

VectorCombine/

X86/

extract-binop.ll

15 lines

extract-cmp.ll

12 lines

Diff 254864

llvm/lib/Transforms/Vectorize/VectorCombine.cpp

Show First 20 Lines • Show All 46 Lines • ▼ Show 20 Lines
/// Compare the relative costs of 2 extracts followed by scalar operation vs.		/// Compare the relative costs of 2 extracts followed by scalar operation vs.
/// vector operation(s) followed by extract. Return true if the existing		/// vector operation(s) followed by extract. Return true if the existing
/// instructions are cheaper than a vector alternative. Otherwise, return false		/// instructions are cheaper than a vector alternative. Otherwise, return false
/// and if one of the extracts should be transformed to a shufflevector, set		/// and if one of the extracts should be transformed to a shufflevector, set
/// \p ConvertToShuffle to that extract instruction.		/// \p ConvertToShuffle to that extract instruction.
static bool isExtractExtractCheap(Instruction Ext0, Instruction Ext1,		static bool isExtractExtractCheap(Instruction Ext0, Instruction Ext1,
unsigned Opcode,		unsigned Opcode,
const TargetTransformInfo &TTI,		const TargetTransformInfo &TTI,
Instruction *&ConvertToShuffle) {		Instruction *&ConvertToShuffle,
		unsigned PreferredExtractIndex) {
assert(isa<ConstantInt>(Ext0->getOperand(1)) &&		assert(isa<ConstantInt>(Ext0->getOperand(1)) &&
isa<ConstantInt>(Ext1->getOperand(1)) &&		isa<ConstantInt>(Ext1->getOperand(1)) &&
"Expected constant extract indexes");		"Expected constant extract indexes");
Type *ScalarTy = Ext0->getType();		Type *ScalarTy = Ext0->getType();
Type *VecTy = Ext0->getOperand(0)->getType();		Type *VecTy = Ext0->getOperand(0)->getType();
int ScalarOpCost, VectorOpCost;		int ScalarOpCost, VectorOpCost;

// Get cost estimates for scalar and vector versions of the operation.		// Get cost estimates for scalar and vector versions of the operation.
▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	if (Ext0Index == Ext1Index) {
// undefined except for 1 lane that is being translated to the remaining		// undefined except for 1 lane that is being translated to the remaining
// extraction lane. Therefore, it is a splat shuffle. Ex:		// extraction lane. Therefore, it is a splat shuffle. Ex:
// ShufMask = { undef, undef, 0, undef }		// ShufMask = { undef, undef, 0, undef }
// TODO: The cost model has an option for a "broadcast" shuffle		// TODO: The cost model has an option for a "broadcast" shuffle
// (splat-from-element-0), but no option for a more general splat.		// (splat-from-element-0), but no option for a more general splat.
NewCost +=		NewCost +=
TTI.getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc, VecTy);		TTI.getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc, VecTy);

// The more expensive extract will be replaced by a shuffle. If the extracts		// The more expensive extract will be replaced by a shuffle. If the costs
// have the same cost, replace the extract with the higher index.		// are equal and there is a preferred extract index, shuffle the opposite
		// operand. Otherwise, replace the extract with the higher index.
if (Extract0Cost > Extract1Cost)		if (Extract0Cost > Extract1Cost)
ConvertToShuffle = Ext0;		ConvertToShuffle = Ext0;
else if (Extract1Cost > Extract0Cost)		else if (Extract1Cost > Extract0Cost)
ConvertToShuffle = Ext1;		ConvertToShuffle = Ext1;
		else if (PreferredExtractIndex == Ext0Index)
		ConvertToShuffle = Ext1;
		else if (PreferredExtractIndex == Ext1Index)
		ConvertToShuffle = Ext0;
else		else
ConvertToShuffle = Ext0Index > Ext1Index ? Ext0 : Ext1;		ConvertToShuffle = Ext0Index > Ext1Index ? Ext0 : Ext1;
}		}

// Aggressively form a vector op if the cost is equal because the transform		// Aggressively form a vector op if the cost is equal because the transform
// may enable further optimization.		// may enable further optimization.
// Codegen can reverse this transform (scalarize) if it was not profitable.		// Codegen can reverse this transform (scalarize) if it was not profitable.
return OldCost < NewCost;		return OldCost < NewCost;
▲ Show 20 Lines • Show All 56 Lines • ▼ Show 20 Lines	static bool foldExtractExtract(Instruction &I, const TargetTransformInfo &TTI) {

Value V0, V1;		Value V0, V1;
uint64_t C0, C1;		uint64_t C0, C1;
if (!match(Ext0, m_ExtractElement(m_Value(V0), m_ConstantInt(C0))) \|\|		if (!match(Ext0, m_ExtractElement(m_Value(V0), m_ConstantInt(C0))) \|\|
!match(Ext1, m_ExtractElement(m_Value(V1), m_ConstantInt(C1))) \|\|		!match(Ext1, m_ExtractElement(m_Value(V1), m_ConstantInt(C1))) \|\|
V0->getType() != V1->getType())		V0->getType() != V1->getType())
return false;		return false;

		// If the scalar value 'I' is going to be re-inserted into a vector, then try
		// to create an extract to that same element. The extract/insert can be
		// reduced to a "select shuffle".
		// TODO: If we add a larger pattern match that starts from an insert, this
		// probably becomes unnecessary.
		uint64_t InsertIndex = std::numeric_limits<uint64_t>::max();
		if (I.hasOneUse())
		match(I.user_back(), m_InsertElement(m_Value(), m_Value(),
		m_ConstantInt(InsertIndex)));

Instruction *ConvertToShuffle;		Instruction *ConvertToShuffle;
if (isExtractExtractCheap(Ext0, Ext1, I.getOpcode(), TTI, ConvertToShuffle))		if (isExtractExtractCheap(Ext0, Ext1, I.getOpcode(), TTI, ConvertToShuffle,
		InsertIndex))
return false;		return false;

if (ConvertToShuffle) {		if (ConvertToShuffle) {
// The shuffle mask is undefined except for 1 lane that is being translated		// The shuffle mask is undefined except for 1 lane that is being translated
// to the cheap extraction lane. Example:		// to the cheap extraction lane. Example:
// ShufMask = { 2, undef, undef, undef }		// ShufMask = { 2, undef, undef, undef }
uint64_t SplatIndex = ConvertToShuffle == Ext0 ? C0 : C1;		uint64_t SplatIndex = ConvertToShuffle == Ext0 ? C0 : C1;
uint64_t CheapExtIndex = ConvertToShuffle == Ext0 ? C1 : C0;		uint64_t CheapExtIndex = ConvertToShuffle == Ext0 ? C1 : C0;
▲ Show 20 Lines • Show All 153 Lines • Show Last 20 Lines

llvm/test/Transforms/VectorCombine/X86/extract-binop.ll

Show First 20 Lines • Show All 412 Lines • ▼ Show 20 Lines	;
%e0 = extractelement <16 x float> %x, i32 14		%e0 = extractelement <16 x float> %x, i32 14
%e1 = extractelement <16 x float> %x, i32 15		%e1 = extractelement <16 x float> %x, i32 15
%r = fadd float %e0, %e1		%r = fadd float %e0, %e1
ret float %r		ret float %r
}		}

define <4 x float> @ins_bo_ext_ext(<4 x float> %a, <4 x float> %b) {		define <4 x float> @ins_bo_ext_ext(<4 x float> %a, <4 x float> %b) {
; CHECK-LABEL: @ins_bo_ext_ext(		; CHECK-LABEL: @ins_bo_ext_ext(
; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x float> [[A:%.]], <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef>		; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x float> [[A:%.]], <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 undef, i32 2>
; CHECK-NEXT: [[TMP2:%.*]] = fadd <4 x float> [[A]], [[TMP1]]		; CHECK-NEXT: [[TMP2:%.*]] = fadd <4 x float> [[TMP1]], [[A]]
; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x float> [[TMP2]], i32 2		; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x float> [[TMP2]], i64 3
; CHECK-NEXT: [[V3:%.]] = insertelement <4 x float> [[B:%.]], float [[TMP3]], i32 3		; CHECK-NEXT: [[V3:%.]] = insertelement <4 x float> [[B:%.]], float [[TMP3]], i32 3
; CHECK-NEXT: ret <4 x float> [[V3]]		; CHECK-NEXT: ret <4 x float> [[V3]]
;		;
%a2 = extractelement <4 x float> %a, i32 2		%a2 = extractelement <4 x float> %a, i32 2
%a3 = extractelement <4 x float> %a, i32 3		%a3 = extractelement <4 x float> %a, i32 3
%a23 = fadd float %a2, %a3		%a23 = fadd float %a2, %a3
%v3 = insertelement <4 x float> %b, float %a23, i32 3		%v3 = insertelement <4 x float> %b, float %a23, i32 3
ret <4 x float> %v3		ret <4 x float> %v3
}		}

		; TODO: This is conservatively left to extract from the lower index value,
		; but it is likely that extracting from index 3 is the better option.

define <4 x float> @ins_bo_ext_ext_uses(<4 x float> %a, <4 x float> %b) {		define <4 x float> @ins_bo_ext_ext_uses(<4 x float> %a, <4 x float> %b) {
; CHECK-LABEL: @ins_bo_ext_ext_uses(		; CHECK-LABEL: @ins_bo_ext_ext_uses(
; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x float> [[A:%.]], <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef>		; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x float> [[A:%.]], <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef>
; CHECK-NEXT: [[TMP2:%.*]] = fadd <4 x float> [[A]], [[TMP1]]		; CHECK-NEXT: [[TMP2:%.*]] = fadd <4 x float> [[A]], [[TMP1]]
; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x float> [[TMP2]], i32 2		; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x float> [[TMP2]], i32 2
; CHECK-NEXT: call void @use_f32(float [[TMP3]])		; CHECK-NEXT: call void @use_f32(float [[TMP3]])
; CHECK-NEXT: [[V3:%.]] = insertelement <4 x float> [[B:%.]], float [[TMP3]], i32 3		; CHECK-NEXT: [[V3:%.]] = insertelement <4 x float> [[B:%.]], float [[TMP3]], i32 3
; CHECK-NEXT: ret <4 x float> [[V3]]		; CHECK-NEXT: ret <4 x float> [[V3]]
;		;
%a2 = extractelement <4 x float> %a, i32 2		%a2 = extractelement <4 x float> %a, i32 2
%a3 = extractelement <4 x float> %a, i32 3		%a3 = extractelement <4 x float> %a, i32 3
%a23 = fadd float %a2, %a3		%a23 = fadd float %a2, %a3
call void @use_f32(float %a23)		call void @use_f32(float %a23)
%v3 = insertelement <4 x float> %b, float %a23, i32 3		%v3 = insertelement <4 x float> %b, float %a23, i32 3
ret <4 x float> %v3		ret <4 x float> %v3
}		}

define <4 x float> @PR34724(<4 x float> %a, <4 x float> %b) {		define <4 x float> @PR34724(<4 x float> %a, <4 x float> %b) {
; CHECK-LABEL: @PR34724(		; CHECK-LABEL: @PR34724(
; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x float> [[A:%.]], <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef>		; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x float> [[A:%.]], <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef>
; CHECK-NEXT: [[TMP2:%.]] = shufflevector <4 x float> [[B:%.]], <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>		; CHECK-NEXT: [[TMP2:%.]] = shufflevector <4 x float> [[B:%.]], <4 x float> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x float> [[B]], <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef>		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x float> [[B]], <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 undef, i32 2>
; CHECK-NEXT: [[TMP4:%.*]] = fadd <4 x float> [[A]], [[TMP1]]		; CHECK-NEXT: [[TMP4:%.*]] = fadd <4 x float> [[A]], [[TMP1]]
; CHECK-NEXT: [[TMP5:%.*]] = extractelement <4 x float> [[TMP4]], i32 2		; CHECK-NEXT: [[TMP5:%.*]] = extractelement <4 x float> [[TMP4]], i32 2
; CHECK-NEXT: [[TMP6:%.*]] = fadd <4 x float> [[B]], [[TMP2]]		; CHECK-NEXT: [[TMP6:%.*]] = fadd <4 x float> [[B]], [[TMP2]]
; CHECK-NEXT: [[TMP7:%.*]] = extractelement <4 x float> [[TMP6]], i32 0		; CHECK-NEXT: [[TMP7:%.*]] = extractelement <4 x float> [[TMP6]], i32 0
; CHECK-NEXT: [[TMP8:%.*]] = fadd <4 x float> [[B]], [[TMP3]]		; CHECK-NEXT: [[TMP8:%.*]] = fadd <4 x float> [[TMP3]], [[B]]
; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x float> [[TMP8]], i32 2		; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x float> [[TMP8]], i64 3
; CHECK-NEXT: [[V1:%.*]] = insertelement <4 x float> undef, float [[TMP5]], i32 1		; CHECK-NEXT: [[V1:%.*]] = insertelement <4 x float> undef, float [[TMP5]], i32 1
; CHECK-NEXT: [[V2:%.*]] = insertelement <4 x float> [[V1]], float [[TMP7]], i32 2		; CHECK-NEXT: [[V2:%.*]] = insertelement <4 x float> [[V1]], float [[TMP7]], i32 2
; CHECK-NEXT: [[V3:%.*]] = insertelement <4 x float> [[V2]], float [[TMP9]], i32 3		; CHECK-NEXT: [[V3:%.*]] = insertelement <4 x float> [[V2]], float [[TMP9]], i32 3
; CHECK-NEXT: ret <4 x float> [[V3]]		; CHECK-NEXT: ret <4 x float> [[V3]]
;		;
%a0 = extractelement <4 x float> %a, i32 0		%a0 = extractelement <4 x float> %a, i32 0
%a1 = extractelement <4 x float> %a, i32 1		%a1 = extractelement <4 x float> %a, i32 1
%a2 = extractelement <4 x float> %a, i32 2		%a2 = extractelement <4 x float> %a, i32 2
Show All 16 Lines

llvm/test/Transforms/VectorCombine/X86/extract-cmp.ll

	Show First 20 Lines • Show All 155 Lines • ▼ Show 20 Lines
	; SSE-LABEL: @ins_fcmp_ext_ext(			; SSE-LABEL: @ins_fcmp_ext_ext(
	; SSE-NEXT: [[A1:%.]] = extractelement <4 x float> [[A:%.]], i32 1			; SSE-NEXT: [[A1:%.]] = extractelement <4 x float> [[A:%.]], i32 1
	; SSE-NEXT: [[A2:%.*]] = extractelement <4 x float> [[A]], i32 2			; SSE-NEXT: [[A2:%.*]] = extractelement <4 x float> [[A]], i32 2
	; SSE-NEXT: [[A21:%.*]] = fcmp ugt float [[A2]], [[A1]]			; SSE-NEXT: [[A21:%.*]] = fcmp ugt float [[A2]], [[A1]]
	; SSE-NEXT: [[R:%.]] = insertelement <4 x i1> [[B:%.]], i1 [[A21]], i32 2			; SSE-NEXT: [[R:%.]] = insertelement <4 x i1> [[B:%.]], i1 [[A21]], i32 2
	; SSE-NEXT: ret <4 x i1> [[R]]			; SSE-NEXT: ret <4 x i1> [[R]]
	;			;
	; AVX-LABEL: @ins_fcmp_ext_ext(			; AVX-LABEL: @ins_fcmp_ext_ext(
	; AVX-NEXT: [[TMP1:%.]] = shufflevector <4 x float> [[A:%.]], <4 x float> undef, <4 x i32> <i32 undef, i32 2, i32 undef, i32 undef>			; AVX-NEXT: [[TMP1:%.]] = shufflevector <4 x float> [[A:%.]], <4 x float> undef, <4 x i32> <i32 undef, i32 undef, i32 1, i32 undef>
	; AVX-NEXT: [[TMP2:%.*]] = fcmp ugt <4 x float> [[TMP1]], [[A]]			; AVX-NEXT: [[TMP2:%.*]] = fcmp ugt <4 x float> [[A]], [[TMP1]]
	; AVX-NEXT: [[TMP3:%.*]] = extractelement <4 x i1> [[TMP2]], i64 1			; AVX-NEXT: [[TMP3:%.*]] = extractelement <4 x i1> [[TMP2]], i32 2
	; AVX-NEXT: [[R:%.]] = insertelement <4 x i1> [[B:%.]], i1 [[TMP3]], i32 2			; AVX-NEXT: [[R:%.]] = insertelement <4 x i1> [[B:%.]], i1 [[TMP3]], i32 2
	; AVX-NEXT: ret <4 x i1> [[R]]			; AVX-NEXT: ret <4 x i1> [[R]]
	;			;
	%a1 = extractelement <4 x float> %a, i32 1			%a1 = extractelement <4 x float> %a, i32 1
	%a2 = extractelement <4 x float> %a, i32 2			%a2 = extractelement <4 x float> %a, i32 2
	%a21 = fcmp ugt float %a2, %a1			%a21 = fcmp ugt float %a2, %a1
	%r = insertelement <4 x i1> %b, i1 %a21, i32 2			%r = insertelement <4 x i1> %b, i1 %a21, i32 2
	ret <4 x i1> %r			ret <4 x i1> %r
	}			}

	define <4 x i1> @ins_icmp_ext_ext(<4 x i32> %a, <4 x i1> %b) {			define <4 x i1> @ins_icmp_ext_ext(<4 x i32> %a, <4 x i1> %b) {
	; CHECK-LABEL: @ins_icmp_ext_ext(			; CHECK-LABEL: @ins_icmp_ext_ext(
	; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x i32> [[A:%.]], <4 x i32> undef, <4 x i32> <i32 undef, i32 undef, i32 3, i32 undef>			; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x i32> [[A:%.]], <4 x i32> undef, <4 x i32> <i32 undef, i32 undef, i32 undef, i32 2>
	; CHECK-NEXT: [[TMP2:%.*]] = icmp ule <4 x i32> [[A]], [[TMP1]]			; CHECK-NEXT: [[TMP2:%.*]] = icmp ule <4 x i32> [[TMP1]], [[A]]
	; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x i1> [[TMP2]], i32 2			; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x i1> [[TMP2]], i64 3
	; CHECK-NEXT: [[R:%.]] = insertelement <4 x i1> [[B:%.]], i1 [[TMP3]], i32 3			; CHECK-NEXT: [[R:%.]] = insertelement <4 x i1> [[B:%.]], i1 [[TMP3]], i32 3
	; CHECK-NEXT: ret <4 x i1> [[R]]			; CHECK-NEXT: ret <4 x i1> [[R]]
	;			;
	%a3 = extractelement <4 x i32> %a, i32 3			%a3 = extractelement <4 x i32> %a, i32 3
	%a2 = extractelement <4 x i32> %a, i32 2			%a2 = extractelement <4 x i32> %a, i32 2
	%a23 = icmp ule i32 %a2, %a3			%a23 = icmp ule i32 %a2, %a3
	%r = insertelement <4 x i1> %b, i1 %a23, i32 3			%r = insertelement <4 x i1> %b, i1 %a23, i32 3
	ret <4 x i1> %r			ret <4 x i1> %r
	}			}