This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/CodeGen/SelectionDAG/
-
CodeGen/
-
SelectionDAG/
-
DAGCombiner.cpp
-
test/CodeGen/PowerPC/
-
CodeGen/
-
PowerPC/
-
pre-inc-disable.ll
-
scalar_vector_test_4.ll

Differential D96405

[DAGCombiner] Reduce Shuffle_Vector Nodes Count
Needs ReviewPublic

Authored by mmarjieh on Feb 10 2021, 3:25 AM.

Download Raw Diff

Details

Reviewers

craig.topper
mkuper
greened
jtony
rs
stefanp
simon_tatham
RKSimon

Summary

[DAGCombiner] Reduce Shuffle_Vector Nodes Count

This patch adds a DAG combine on vector shuffles that tries to reduce
the number of shuffle vector nodes.
For example, if we have the following pattern:

Instead of generating the following nodes:
t0: v8i32 = vector_shuffle<0,5,u,u,u,u,u,u> t3, undef
t1: v8i32 = vector_shuffle<0,1,10,11,u,u,u,u> t0, t4

Combine to this node:
t2: v8i32 = vector_shuffle<0,5,10,11,u,u,u,u> t3, t4

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	2,181,200 ms	x64 windows > LLVM.CodeGen/PowerPC::vec_perf_shuffle.ll
	944,210 ms	x64 windows > LLVM.CodeGen/PowerPC::vec_shuffle_p8vector.ll
	944,080 ms	x64 windows > LLVM.CodeGen/PowerPC::vector.ll

Event Timeline

mmarjieh created this revision.Feb 10 2021, 3:25 AM

Herald added subscribers: ecnelises, pengfei, dmgreen and 2 others. · View Herald TranscriptFeb 10 2021, 3:25 AM

mmarjieh requested review of this revision.Feb 10 2021, 3:25 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 10 2021, 3:25 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Fix typo in commit message.

mmarjieh added reviewers: RKSimon, craig.topper, mkuper, greened, jtony.Feb 10 2021, 3:36 AM

Harbormaster completed remote builds in B88608: Diff 322641.Feb 10 2021, 4:15 AM

Harbormaster completed remote builds in B88607: Diff 322640.Feb 10 2021, 4:20 AM

RKSimon added inline comments.Feb 10 2021, 6:44 AM

llvm/test/CodeGen/X86/vector-shuffle-combining-avx512bwvl.ll
115 ↗	(On Diff #322641)	At quick glance - this looks wrong, I'd expect this still to be the same vshufpd?

guyblank added a subscriber: guyblank.Feb 10 2021, 11:05 PM

mmarjieh added inline comments.Feb 11 2021, 1:32 AM

llvm/test/CodeGen/X86/vector-shuffle-combining-avx512bwvl.ll

115 ↗

(On Diff #322641)

I am not familiar with X86's ISA.
Can you explain why?
Meanwhile, I will show you the difference in the DAG after my patch:

Before this patch:
SelectionDAG has 28 nodes:

t0: ch = EntryToken
t5: v4i64,ch = load<(load 32 from `<4 x i64>* null`, align 8)> t0, Constant:i32<0>, undef:i32
t6: v4i64,ch = load<(load 32 from `<4 x i64>* undef`, align 8)> t0, undef:i32, undef:i32
    t21: ch = TokenFactor t5:1, t6:1
            t36: v8i16 = BUILD_VECTOR Constant:i16<0>, Constant:i16<0>, Constant:i16<0>, Constant:i16<0>, undef:i16, undef:i16, undef:i16, undef:i16
          t52: v2f64 = bitcast t36
        t62: v4f64 = concat_vectors t52, undef:v2f64
      t63: v4f64 = vector_shuffle<u,u,0,0> t62, undef:v4f64
            t37: v8i16 = X86ISD::VTRUNC t5
          t39: v8i16 = sign_extend_inreg t37, ValueType:ch:v8i8
        t44: v2f64 = bitcast t39
            t40: v8i16 = X86ISD::VTRUNC t6
          t41: v8i16 = sign_extend_inreg t40, ValueType:ch:v8i8
        t49: v2f64 = bitcast t41
      t59: v4f64 = concat_vectors t44, t49
    t68: v4f64 = vector_shuffle<4,6,2,3> t63, t59
    t3: i32,ch = load<(load 4 from %fixed-stack.0)> t0, FrameIndex:i32<-1>, undef:i32
  t57: ch = store<(store 32 into %ir.10, align 2)> t21, t68, t3, undef:i32
t29: ch = X86ISD::RET_FLAG t57, TargetConstant:i32<0>

After this patch:
SelectionDAG has 26 nodes:

t0: ch = EntryToken
t5: v4i64,ch = load<(load 32 from `<4 x i64>* null`, align 8)> t0, Constant:i32<0>, undef:i32
t6: v4i64,ch = load<(load 32 from `<4 x i64>* undef`, align 8)> t0, undef:i32, undef:i32
    t21: ch = TokenFactor t5:1, t6:1
            t37: v8i16 = X86ISD::VTRUNC t5
          t39: v8i16 = sign_extend_inreg t37, ValueType:ch:v8i8
        t44: v2f64 = bitcast t39
            t40: v8i16 = X86ISD::VTRUNC t6
          t41: v8i16 = sign_extend_inreg t40, ValueType:ch:v8i8
        t49: v2f64 = bitcast t41
      t59: v4f64 = concat_vectors t44, t49
          t36: v8i16 = BUILD_VECTOR Constant:i16<0>, Constant:i16<0>, Constant:i16<0>, Constant:i16<0>, undef:i16, undef:i16, undef:i16, undef:i16
        t52: v2f64 = bitcast t36
      t62: v4f64 = concat_vectors t52, undef:v2f64
    t64: v4f64 = vector_shuffle<0,2,4,4> t59, t62
    t3: i32,ch = load<(load 4 from %fixed-stack.0)> t0, FrameIndex:i32<-1>, undef:i32
  t57: ch = store<(store 32 into %ir.10, align 2)> t21, t64, t3, undef:i32
t29: ch = X86ISD::RET_FLAG t57, TargetConstant:i32<0>

RKSimon added inline comments.Feb 11 2021, 3:22 AM

llvm/test/CodeGen/X86/vector-shuffle-combining-avx512bwvl.ll
115 ↗	(On Diff #322641)	My mistake - I missed that we were implicitly zeroing the upper elements (xmm -> ymm) - sorry about that

mmarjieh added reviewers: rs, stefanp, simon_tatham.Feb 15 2021, 8:09 AM

Hey guys, I would appreciate it if you can review.
I think this change is beneficial for all targets, since we are reducing the number of shuffle_vector dag nodes and hence reducing code size.
Since I am not familiar with all targets, can you go over the target assembly and verify that it is beneficial for you?
I counted the number of instructions in each lit test and saw an improvement in the number of instructions.

This doesn't seem like the right direction,
i'd expect that to be a new fold to reduce shuffle count,
because if we only teach some existing fold to do this,
we'll miss such shuffle patterns that appear via other means.

Instead of reducing the number of vector_shuffle DAG nodes
in reduceBuildVecToShuffle, do this in a separate combine.

I wonder if we should be checking if the resulting shuffle is legal, i.e. see buildLegalVectorShuffle() and it's uses.

In D96405#2565621, @lebedev.ri wrote:

This doesn't seem like the right direction,
i'd expect that to be a new fold to reduce shuffle count,
because if we only teach some existing fold to do this,
we'll miss such shuffle patterns that appear via other means.

Agreed.
I added a separate combine for this.
I am still not sure what is the best DAGCombine Level to run this combine.
Currently, I run it after LegalizeDAG.
I would like to receive advice from you on when to run it.

mmarjieh retitled this revision from [DAGCombiner] Improve reduceBuildVecToShuffle Performance to [DAGCombiner] Reduce Shuffle_Vector Nodes Count.Mar 3 2021, 3:05 AM

mmarjieh edited the summary of this revision. (Show Details)

Herald added a subscriber: steven.zhang. · View Herald TranscriptMar 3 2021, 3:05 AM

mmarjieh edited the summary of this revision. (Show Details)Mar 3 2021, 3:05 AM

Harbormaster completed remote builds in B91761: Diff 327711.Mar 3 2021, 5:53 AM

In D96405#2599708, @mmarjieh wrote:

In D96405#2565621, @lebedev.ri wrote:

This doesn't seem like the right direction,
i'd expect that to be a new fold to reduce shuffle count,
because if we only teach some existing fold to do this,
we'll miss such shuffle patterns that appear via other means.

Agreed.
I added a separate combine for this.
I am still not sure what is the best DAGCombine Level to run this combine.
Currently, I run it after LegalizeDAG.
I would like to receive advice from you on when to run it.

I'm not sure VECTOR_SHUFFLE makes is to the DAGCombine after LegalizeDAG on most targets. Definitely not on X86.

If all types are legal and LegalizeVectorOps doesn't change anything, there are only 2 DAG combine runs that will happen, the very first one before LegalizeTypes and the very last one after LegalizeDAG. The other 2 DAG combiners are conditional on the previous legalizer step making changes. Given that and the X86 behavior I'm not sure you really have an option to delay it.

I'm not sure how useful this really is - we already have something very similar later in the function that uses the MergeInnerShuffle lambda - maybe we're just missing an edge case there? But as @craig.topper said we need to be careful about performing this in later legalization stages.

@mmarjieh You're probably better off moving this into PPCISelLowering.cpp and making it a PPC only shuffle combine

@RKSimon @craig.topper After investigating and trying out this combine in different DAG combine phases, I noticed that this optimization is not always beneficial for all targets.
Since this combine is beneficial in our target, I already implemented it in the target specific ISelLowering.
If we want to continue with this patch and commit it to the community, I suggest that we introduce a target hook for it.

In D96405#2658431, @mmarjieh wrote:

@RKSimon @craig.topper After investigating and trying out this combine in different DAG combine phases, I noticed that this optimization is not always beneficial for all targets.
Since this combine is beneficial in our target, I already implemented it in the target specific ISelLowering.
If we want to continue with this patch and commit it to the community, I suggest that we introduce a target hook for it.

I assume your target is out of tree? Keeping this as a generic combine with a target hook could be fine, assuming the PowerPC team accept the hook being enabled and we retain test coverage.

Revision Contents

Path

Size

llvm/

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

71 lines

test/

CodeGen/

PowerPC/

pre-inc-disable.ll

100 lines

scalar_vector_test_4.ll

64 lines

Diff 327711

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 20,793 Lines • ▼ Show 20 Lines	static SDValue replaceShuffleOfInsert(ShuffleVectorSDNode *Shuf,
return DAG.getNode(ISD::INSERT_VECTOR_ELT, SDLoc(Shuf), Op0.getValueType(),		return DAG.getNode(ISD::INSERT_VECTOR_ELT, SDLoc(Shuf), Op0.getValueType(),
Op1, Op0.getOperand(1), NewInsIndex);		Op1, Op0.getOperand(1), NewInsIndex);
}		}

/// If we have a unary shuffle of a shuffle, see if it can be folded away		/// If we have a unary shuffle of a shuffle, see if it can be folded away
/// completely. This has the potential to lose undef knowledge because the first		/// completely. This has the potential to lose undef knowledge because the first
/// shuffle may not have an undef mask element where the second one does. So		/// shuffle may not have an undef mask element where the second one does. So
/// only call this after doing simplifications based on demanded elements.		/// only call this after doing simplifications based on demanded elements.
static SDValue simplifyShuffleOfShuffle(ShuffleVectorSDNode *Shuf) {		static SDValue simplifyUnaryShuffleOfShuffle(ShuffleVectorSDNode *Shuf) {
// shuf (shuf0 X, Y, Mask0), undef, Mask		// shuf (shuf0 X, Y, Mask0), undef, Mask
auto *Shuf0 = dyn_cast<ShuffleVectorSDNode>(Shuf->getOperand(0));		auto *Shuf0 = dyn_cast<ShuffleVectorSDNode>(Shuf->getOperand(0));
if (!Shuf0 \|\| !Shuf->getOperand(1).isUndef())		if (!Shuf0 \|\| !Shuf->getOperand(1).isUndef())
return SDValue();		return SDValue();

ArrayRef<int> Mask = Shuf->getMask();		ArrayRef<int> Mask = Shuf->getMask();
ArrayRef<int> Mask0 = Shuf0->getMask();		ArrayRef<int> Mask0 = Shuf0->getMask();
for (int i = 0, e = (int)Mask.size(); i != e; ++i) {		for (int i = 0, e = (int)Mask.size(); i != e; ++i) {
// Ignore undef elements.		// Ignore undef elements.
if (Mask[i] == -1)		if (Mask[i] == -1)
continue;		continue;
assert(Mask[i] >= 0 && Mask[i] < e && "Unexpected shuffle mask value");		assert(Mask[i] >= 0 && Mask[i] < e && "Unexpected shuffle mask value");

// Is the element of the shuffle operand chosen by this shuffle the same as		// Is the element of the shuffle operand chosen by this shuffle the same as
// the element chosen by the shuffle operand itself?		// the element chosen by the shuffle operand itself?
if (Mask0[Mask[i]] != Mask0[i])		if (Mask0[Mask[i]] != Mask0[i])
return SDValue();		return SDValue();
}		}
// Every element of this shuffle is identical to the result of the previous		// Every element of this shuffle is identical to the result of the previous
// shuffle, so we can replace this value.		// shuffle, so we can replace this value.
return Shuf->getOperand(0);		return Shuf->getOperand(0);
}		}

		/// If we have a binary shuffle \p Shuf of an unary shuffle, fold the unary
		/// shuffle away into \p Shuf and update its mask.
		/// For Example:
		/// Detect this pattern:
		/// t0: v8i32 = vector_shuffle<0,5,u,u,u,u,u,u> t3, undef
		/// t1: v8i32 = vector_shuffle<0,1,10,11,u,u,u,u> t0, t4
		///
		/// Combine to this node:
		/// t2: v8i32 = vector_shuffle<0,5,10,11,u,u,u,u> t3, t4
		static SDValue simplifyBinaryShuffleOfShuffle(ShuffleVectorSDNode *Shuf,
		SelectionDAG &DAG) {
		if (Shuf->getOperand(1).isUndef())
		return SDValue();
		// shuf (shuf0 X, Undef, Mask0), Y, Mask or shuf Y, (shuf0 X, Undef, Mask0)
		auto *Shuf0 = dyn_cast<ShuffleVectorSDNode>(Shuf->getOperand(0));
		bool IsFirstOperandUnaryShuffle = true;
		if (!Shuf0 \|\| !Shuf0->getOperand(1).isUndef()) {
		Shuf0 = dyn_cast<ShuffleVectorSDNode>(Shuf->getOperand(1));
		if (!Shuf0 \|\| !Shuf0->getOperand(1).isUndef())
		return SDValue();
		IsFirstOperandUnaryShuffle = false;
		}

		ArrayRef<int> Mask = Shuf->getMask();
		ArrayRef<int> Mask0 = Shuf0->getMask();
		SmallVector<int, 8> NewMask(Mask.size(), -1);
		for (int i = 0, e = (int)Mask.size(); i != e; ++i) {
		// Ignore undef elements.
		if (Mask[i] == -1)
		continue;
		assert(Mask[i] >= 0 && Mask[i] < 2 * e && "Unexpected shuffle mask value");
		// Element taken from second operand
		if (Mask[i] >= e) {
		// If the first operand is being folded away then the mask is unchanged.
		if (IsFirstOperandUnaryShuffle)
		NewMask[i] = Mask[i];
		else {
		assert(Mask0[Mask[i] - e] != -1 &&
		"Unexpected shuffle mask undef value");
		NewMask[i] = Mask0[Mask[i] - e] + e;
		}
		} else {
		// Element taken from first operand
		if (IsFirstOperandUnaryShuffle) {
		assert(Mask0[Mask[i]] != -1 && "Unexpected shuffle mask undef value");
		NewMask[i] = Mask0[Mask[i]];
		} else
		// If the second operand is being folded away then the mask is
		// unchanged.
		NewMask[i] = Mask[i];
		}
		}

		SDValue ToShuffleLeft =
		IsFirstOperandUnaryShuffle ? Shuf0->getOperand(0) : Shuf->getOperand(0);
		SDValue ToShuffleRight =
		IsFirstOperandUnaryShuffle ? Shuf->getOperand(1) : Shuf0->getOperand(0);
		return DAG.getVectorShuffle(Shuf->getValueType(0), SDLoc(Shuf), ToShuffleLeft,
		ToShuffleRight, NewMask);
		}

SDValue DAGCombiner::visitVECTOR_SHUFFLE(SDNode *N) {		SDValue DAGCombiner::visitVECTOR_SHUFFLE(SDNode *N) {
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
unsigned NumElts = VT.getVectorNumElements();		unsigned NumElts = VT.getVectorNumElements();

SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);

assert(N0.getValueType() == VT && "Vector shuffle must be normalized in DAG");		assert(N0.getValueType() == VT && "Vector shuffle must be normalized in DAG");
▲ Show 20 Lines • Show All 115 Lines • ▼ Show 20 Lines	SDValue DAGCombiner::visitVECTOR_SHUFFLE(SDNode *N) {
}		}

// Simplify source operands based on shuffle mask.		// Simplify source operands based on shuffle mask.
if (SimplifyDemandedVectorElts(SDValue(N, 0)))		if (SimplifyDemandedVectorElts(SDValue(N, 0)))
return SDValue(N, 0);		return SDValue(N, 0);

// This is intentionally placed after demanded elements simplification because		// This is intentionally placed after demanded elements simplification because
// it could eliminate knowledge of undef elements created by this shuffle.		// it could eliminate knowledge of undef elements created by this shuffle.
if (SDValue ShufOp = simplifyShuffleOfShuffle(SVN))		if (SDValue ShufOp = simplifyUnaryShuffleOfShuffle(SVN))
return ShufOp;		return ShufOp;

		if (Level == AfterLegalizeDAG) {
		SDValue ShufOp = simplifyBinaryShuffleOfShuffle(SVN, DAG);
		if (ShufOp)
		return ShufOp;
		}

// Match shuffles that can be converted to any_vector_extend_in_reg.		// Match shuffles that can be converted to any_vector_extend_in_reg.
if (SDValue V = combineShuffleToVectorExtend(SVN, DAG, TLI, LegalOperations))		if (SDValue V = combineShuffleToVectorExtend(SVN, DAG, TLI, LegalOperations))
return V;		return V;

// Combine "truncate_vector_in_reg" style shuffles.		// Combine "truncate_vector_in_reg" style shuffles.
if (SDValue V = combineTruncationShuffle(SVN, DAG))		if (SDValue V = combineTruncationShuffle(SVN, DAG))
return V;		return V;

▲ Show 20 Lines • Show All 2,118 Lines • Show Last 20 Lines

llvm/test/CodeGen/PowerPC/pre-inc-disable.ll

Show First 20 Lines • Show All 343 Lines • ▼ Show 20 Lines	entry:
store <4 x i32> %13, <4 x i32>* undef, align 16		store <4 x i32> %13, <4 x i32>* undef, align 16
ret void		ret void
}		}

define void @test16(i16* nocapture readonly %sums, i32 signext %delta, i32 signext %thresh) {		define void @test16(i16* nocapture readonly %sums, i32 signext %delta, i32 signext %thresh) {
; CHECK-LABEL: test16:		; CHECK-LABEL: test16:
; CHECK: # %bb.0: # %entry		; CHECK: # %bb.0: # %entry
; CHECK-NEXT: sldi r4, r4, 1		; CHECK-NEXT: sldi r4, r4, 1
; CHECK-NEXT: li r7, 16
; CHECK-NEXT: add r6, r3, r4		; CHECK-NEXT: add r6, r3, r4
; CHECK-NEXT: lxsihzx v4, r3, r4		; CHECK-NEXT: lxsihzx v2, r3, r4
		; CHECK-NEXT: li r3, 0
		; CHECK-NEXT: mtvsrd v3, r3
; CHECK-NEXT: addis r3, r2, .LCPI3_0@toc@ha		; CHECK-NEXT: addis r3, r2, .LCPI3_0@toc@ha
; CHECK-NEXT: lxsihzx v2, r6, r7
; CHECK-NEXT: li r6, 0
; CHECK-NEXT: addi r3, r3, .LCPI3_0@toc@l		; CHECK-NEXT: addi r3, r3, .LCPI3_0@toc@l
; CHECK-NEXT: mtvsrd v3, r6
; CHECK-NEXT: vsplth v4, v4, 3
; CHECK-NEXT: vsplth v2, v2, 3		; CHECK-NEXT: vsplth v2, v2, 3
; CHECK-NEXT: vmrghh v4, v3, v4		; CHECK-NEXT: lxvx v4, 0, r3
		; CHECK-NEXT: li r3, 16
; CHECK-NEXT: vmrghh v2, v3, v2		; CHECK-NEXT: vmrghh v2, v3, v2
; CHECK-NEXT: vsplth v3, v3, 3		; CHECK-NEXT: vperm v2, v2, v3, v4
; CHECK-NEXT: vmrglw v3, v4, v3		; CHECK-NEXT: lxsihzx v4, r6, r3
		; CHECK-NEXT: addis r3, r2, .LCPI3_1@toc@ha
		; CHECK-NEXT: addi r3, r3, .LCPI3_1@toc@l
		; CHECK-NEXT: vsplth v4, v4, 3
		; CHECK-NEXT: vmrghh v3, v3, v4
; CHECK-NEXT: lxvx v4, 0, r3		; CHECK-NEXT: lxvx v4, 0, r3
; CHECK-NEXT: li r3, 0		; CHECK-NEXT: li r3, 0
; CHECK-NEXT: vperm v2, v2, v3, v4		; CHECK-NEXT: vperm v2, v3, v2, v4
; CHECK-NEXT: xxspltw v3, v2, 2		; CHECK-NEXT: xxspltw v3, v2, 2
; CHECK-NEXT: vadduwm v2, v2, v3		; CHECK-NEXT: vadduwm v2, v2, v3
; CHECK-NEXT: vextuwrx r3, r3, v2		; CHECK-NEXT: vextuwrx r3, r3, v2
; CHECK-NEXT: cmpw r3, r5		; CHECK-NEXT: cmpw r3, r5
; CHECK-NEXT: bgelr+ cr0		; CHECK-NEXT: bgelr+ cr0
; CHECK-NEXT: # %bb.1: # %if.then		; CHECK-NEXT: # %bb.1: # %if.then
;		;
; P9BE-LABEL: test16:		; P9BE-LABEL: test16:
; P9BE: # %bb.0: # %entry		; P9BE: # %bb.0: # %entry
; P9BE-NEXT: sldi r4, r4, 1		; P9BE-NEXT: sldi r4, r4, 1
; P9BE-NEXT: li r7, 16
; P9BE-NEXT: add r6, r3, r4		; P9BE-NEXT: add r6, r3, r4
; P9BE-NEXT: lxsihzx v4, r3, r4		; P9BE-NEXT: lxsihzx v2, r3, r4
		; P9BE-NEXT: li r3, 0
		; P9BE-NEXT: sldi r3, r3, 48
		; P9BE-NEXT: mtvsrd v3, r3
; P9BE-NEXT: addis r3, r2, .LCPI3_0@toc@ha		; P9BE-NEXT: addis r3, r2, .LCPI3_0@toc@ha
; P9BE-NEXT: lxsihzx v2, r6, r7
; P9BE-NEXT: li r6, 0
; P9BE-NEXT: addi r3, r3, .LCPI3_0@toc@l
; P9BE-NEXT: sldi r6, r6, 48
; P9BE-NEXT: vsplth v4, v4, 3
; P9BE-NEXT: mtvsrd v3, r6
; P9BE-NEXT: vsplth v2, v2, 3		; P9BE-NEXT: vsplth v2, v2, 3
; P9BE-NEXT: vmrghh v4, v3, v4		; P9BE-NEXT: addi r3, r3, .LCPI3_0@toc@l
		; P9BE-NEXT: lxvx v4, 0, r3
; P9BE-NEXT: vmrghh v2, v3, v2		; P9BE-NEXT: vmrghh v2, v3, v2
; P9BE-NEXT: vsplth v3, v3, 0		; P9BE-NEXT: li r3, 16
; P9BE-NEXT: vmrghw v3, v3, v4		; P9BE-NEXT: vperm v2, v3, v2, v4
		; P9BE-NEXT: lxsihzx v4, r6, r3
		; P9BE-NEXT: addis r3, r2, .LCPI3_1@toc@ha
		; P9BE-NEXT: addi r3, r3, .LCPI3_1@toc@l
		; P9BE-NEXT: vsplth v4, v4, 3
		; P9BE-NEXT: vmrghh v3, v3, v4
; P9BE-NEXT: lxvx v4, 0, r3		; P9BE-NEXT: lxvx v4, 0, r3
; P9BE-NEXT: li r3, 0		; P9BE-NEXT: li r3, 0
; P9BE-NEXT: vperm v2, v3, v2, v4		; P9BE-NEXT: vperm v2, v2, v3, v4
; P9BE-NEXT: xxspltw v3, v2, 1		; P9BE-NEXT: xxspltw v3, v2, 1
; P9BE-NEXT: vadduwm v2, v2, v3		; P9BE-NEXT: vadduwm v2, v2, v3
; P9BE-NEXT: vextuwlx r3, r3, v2		; P9BE-NEXT: vextuwlx r3, r3, v2
; P9BE-NEXT: cmpw r3, r5		; P9BE-NEXT: cmpw r3, r5
; P9BE-NEXT: bgelr+ cr0		; P9BE-NEXT: bgelr+ cr0
; P9BE-NEXT: # %bb.1: # %if.then		; P9BE-NEXT: # %bb.1: # %if.then
entry:		entry:
%idxprom = sext i32 %delta to i64		%idxprom = sext i32 %delta to i64
Show All 29 Lines

define void @test8(i8* nocapture readonly %sums, i32 signext %delta, i32 signext %thresh) {		define void @test8(i8* nocapture readonly %sums, i32 signext %delta, i32 signext %thresh) {
; CHECK-LABEL: test8:		; CHECK-LABEL: test8:
; CHECK: # %bb.0: # %entry		; CHECK: # %bb.0: # %entry
; CHECK-NEXT: add r6, r3, r4		; CHECK-NEXT: add r6, r3, r4
; CHECK-NEXT: lxsibzx v2, r3, r4		; CHECK-NEXT: lxsibzx v2, r3, r4
; CHECK-NEXT: li r3, 0		; CHECK-NEXT: li r3, 0
; CHECK-NEXT: mtvsrd v3, r3		; CHECK-NEXT: mtvsrd v3, r3
; CHECK-NEXT: li r3, 8
; CHECK-NEXT: lxsibzx v5, r6, r3
; CHECK-NEXT: vspltb v4, v3, 7
; CHECK-NEXT: addis r3, r2, .LCPI4_0@toc@ha		; CHECK-NEXT: addis r3, r2, .LCPI4_0@toc@ha
; CHECK-NEXT: vspltb v2, v2, 7
; CHECK-NEXT: addi r3, r3, .LCPI4_0@toc@l		; CHECK-NEXT: addi r3, r3, .LCPI4_0@toc@l
		; CHECK-NEXT: vspltb v5, v3, 7
		; CHECK-NEXT: vspltb v2, v2, 7
		; CHECK-NEXT: lxvx v4, 0, r3
		; CHECK-NEXT: addis r3, r2, .LCPI4_1@toc@ha
		; CHECK-NEXT: addi r3, r3, .LCPI4_1@toc@l
; CHECK-NEXT: vmrghb v2, v3, v2		; CHECK-NEXT: vmrghb v2, v3, v2
		; CHECK-NEXT: lxvx v0, 0, r3
		; CHECK-NEXT: li r3, 8
		; CHECK-NEXT: vperm v2, v2, v3, v4
		; CHECK-NEXT: vperm v2, v2, v5, v0
		; CHECK-NEXT: lxsibzx v5, r6, r3
		; CHECK-NEXT: addis r3, r2, .LCPI4_2@toc@ha
		; CHECK-NEXT: addi r3, r3, .LCPI4_2@toc@l
; CHECK-NEXT: vspltb v5, v5, 7		; CHECK-NEXT: vspltb v5, v5, 7
; CHECK-NEXT: vmrglh v2, v2, v4		; CHECK-NEXT: vmrghb v5, v3, v5
; CHECK-NEXT: vmrghb v3, v3, v5		; CHECK-NEXT: vperm v4, v5, v3, v4
; CHECK-NEXT: vmrglw v2, v2, v4		; CHECK-NEXT: lxvx v5, 0, r3
; CHECK-NEXT: vmrglh v3, v3, v4		; CHECK-NEXT: addis r3, r2, .LCPI4_3@toc@ha
; CHECK-NEXT: vmrglw v3, v4, v3		; CHECK-NEXT: addi r3, r3, .LCPI4_3@toc@l
		; CHECK-NEXT: vperm v3, v3, v4, v5
; CHECK-NEXT: lxvx v4, 0, r3		; CHECK-NEXT: lxvx v4, 0, r3
; CHECK-NEXT: li r3, 0		; CHECK-NEXT: li r3, 0
; CHECK-NEXT: vperm v2, v3, v2, v4		; CHECK-NEXT: vperm v2, v3, v2, v4
; CHECK-NEXT: xxspltw v3, v2, 2		; CHECK-NEXT: xxspltw v3, v2, 2
; CHECK-NEXT: vadduwm v2, v2, v3		; CHECK-NEXT: vadduwm v2, v2, v3
; CHECK-NEXT: vextuwrx r3, r3, v2		; CHECK-NEXT: vextuwrx r3, r3, v2
; CHECK-NEXT: cmpw r3, r5		; CHECK-NEXT: cmpw r3, r5
; CHECK-NEXT: bgelr+ cr0		; CHECK-NEXT: bgelr+ cr0
; CHECK-NEXT: # %bb.1: # %if.then		; CHECK-NEXT: # %bb.1: # %if.then
;		;
; P9BE-LABEL: test8:		; P9BE-LABEL: test8:
; P9BE: # %bb.0: # %entry		; P9BE: # %bb.0: # %entry
; P9BE-NEXT: add r6, r3, r4		; P9BE-NEXT: add r6, r3, r4
; P9BE-NEXT: li r7, 8		; P9BE-NEXT: lxsibzx v2, r3, r4
; P9BE-NEXT: lxsibzx v4, r3, r4		; P9BE-NEXT: li r3, 0
		; P9BE-NEXT: sldi r3, r3, 56
		; P9BE-NEXT: mtvsrd v3, r3
; P9BE-NEXT: addis r3, r2, .LCPI4_0@toc@ha		; P9BE-NEXT: addis r3, r2, .LCPI4_0@toc@ha
; P9BE-NEXT: lxsibzx v2, r6, r7		; P9BE-NEXT: vspltb v2, v2, 7
; P9BE-NEXT: li r6, 0
; P9BE-NEXT: addi r3, r3, .LCPI4_0@toc@l		; P9BE-NEXT: addi r3, r3, .LCPI4_0@toc@l
; P9BE-NEXT: sldi r6, r6, 56		; P9BE-NEXT: lxvx v4, 0, r3
		; P9BE-NEXT: vmrghb v2, v3, v2
		; P9BE-NEXT: li r3, 8
		; P9BE-NEXT: vperm v2, v2, v3, v4
		; P9BE-NEXT: lxsibzx v4, r6, r3
		; P9BE-NEXT: addis r3, r2, .LCPI4_1@toc@ha
		; P9BE-NEXT: addi r3, r3, .LCPI4_1@toc@l
; P9BE-NEXT: vspltb v4, v4, 7		; P9BE-NEXT: vspltb v4, v4, 7
; P9BE-NEXT: mtvsrd v3, r6
; P9BE-NEXT: vspltb v2, v2, 7
; P9BE-NEXT: vmrghb v4, v3, v4		; P9BE-NEXT: vmrghb v4, v3, v4
; P9BE-NEXT: vmrghb v2, v3, v2
; P9BE-NEXT: vspltb v3, v3, 0		; P9BE-NEXT: vspltb v3, v3, 0
; P9BE-NEXT: vmrghh v4, v4, v3		; P9BE-NEXT: vmrghw v2, v2, v4
; P9BE-NEXT: xxspltw v3, v3, 0
; P9BE-NEXT: vmrghw v2, v4, v2
; P9BE-NEXT: lxvx v4, 0, r3		; P9BE-NEXT: lxvx v4, 0, r3
		; P9BE-NEXT: xxspltw v3, v3, 0
; P9BE-NEXT: li r3, 0		; P9BE-NEXT: li r3, 0
; P9BE-NEXT: vperm v2, v3, v2, v4		; P9BE-NEXT: vperm v2, v3, v2, v4
; P9BE-NEXT: xxspltw v3, v2, 1		; P9BE-NEXT: xxspltw v3, v2, 1
; P9BE-NEXT: vadduwm v2, v2, v3		; P9BE-NEXT: vadduwm v2, v2, v3
; P9BE-NEXT: vextuwlx r3, r3, v2		; P9BE-NEXT: vextuwlx r3, r3, v2
; P9BE-NEXT: cmpw r3, r5		; P9BE-NEXT: cmpw r3, r5
; P9BE-NEXT: bgelr+ cr0		; P9BE-NEXT: bgelr+ cr0
; P9BE-NEXT: # %bb.1: # %if.then		; P9BE-NEXT: # %bb.1: # %if.then
Show All 31 Lines

llvm/test/CodeGen/PowerPC/scalar_vector_test_4.ll

Show First 20 Lines • Show All 248 Lines • ▼ Show 20 Lines	entry:
%vecins = insertelement <4 x float> %vec, float %0, i32 0		%vecins = insertelement <4 x float> %vec, float %0, i32 0
ret <4 x float> %vecins		ret <4 x float> %vecins
}		}

; Function Attrs: norecurse nounwind readonly		; Function Attrs: norecurse nounwind readonly
define <2 x float> @s2v_test_f2(float* nocapture readonly %f64, <2 x float> %vec) {		define <2 x float> @s2v_test_f2(float* nocapture readonly %f64, <2 x float> %vec) {
; P9LE-LABEL: s2v_test_f2:		; P9LE-LABEL: s2v_test_f2:
; P9LE: # %bb.0: # %entry		; P9LE: # %bb.0: # %entry
		; P9LE-NEXT: addis r4, r2, .LCPI6_0@toc@ha
; P9LE-NEXT: addi r3, r3, 4		; P9LE-NEXT: addi r3, r3, 4
; P9LE-NEXT: vmrglw v2, v2, v2		; P9LE-NEXT: addi r4, r4, .LCPI6_0@toc@l
; P9LE-NEXT: lxsiwzx v3, 0, r3		; P9LE-NEXT: lxsiwzx v4, 0, r3
; P9LE-NEXT: vmrghw v2, v2, v3		; P9LE-NEXT: lxvx v3, 0, r4
		; P9LE-NEXT: vperm v2, v2, v4, v3
; P9LE-NEXT: blr		; P9LE-NEXT: blr
;		;
; P9BE-LABEL: s2v_test_f2:		; P9BE-LABEL: s2v_test_f2:
; P9BE: # %bb.0: # %entry		; P9BE: # %bb.0: # %entry
; P9BE-NEXT: addi r3, r3, 4		; P9BE-NEXT: addi r3, r3, 4
; P9BE-NEXT: xxspltw v2, v2, 1		; P9BE-NEXT: xxspltw v2, v2, 1
; P9BE-NEXT: lfiwzx f0, 0, r3		; P9BE-NEXT: lfiwzx f0, 0, r3
; P9BE-NEXT: xxsldwi v3, f0, f0, 1		; P9BE-NEXT: xxsldwi v3, f0, f0, 1
; P9BE-NEXT: vmrghw v2, v3, v2		; P9BE-NEXT: vmrghw v2, v3, v2
; P9BE-NEXT: blr		; P9BE-NEXT: blr
;		;
; P8LE-LABEL: s2v_test_f2:		; P8LE-LABEL: s2v_test_f2:
; P8LE: # %bb.0: # %entry		; P8LE: # %bb.0: # %entry
; P8LE-NEXT: vmrglw v2, v2, v2		; P8LE-NEXT: addis r4, r2, .LCPI6_0@toc@ha
; P8LE-NEXT: addi r3, r3, 4		; P8LE-NEXT: addi r3, r3, 4
; P8LE-NEXT: lxsiwzx v3, 0, r3		; P8LE-NEXT: addi r4, r4, .LCPI6_0@toc@l
; P8LE-NEXT: vmrghw v2, v2, v3		; P8LE-NEXT: lxsiwzx v4, 0, r3
		; P8LE-NEXT: lvx v3, 0, r4
		; P8LE-NEXT: vperm v2, v2, v4, v3
; P8LE-NEXT: blr		; P8LE-NEXT: blr
;		;
; P8BE-LABEL: s2v_test_f2:		; P8BE-LABEL: s2v_test_f2:
; P8BE: # %bb.0: # %entry		; P8BE: # %bb.0: # %entry
; P8BE-NEXT: addi r3, r3, 4		; P8BE-NEXT: addi r3, r3, 4
; P8BE-NEXT: xxspltw v2, v2, 1		; P8BE-NEXT: xxspltw v2, v2, 1
; P8BE-NEXT: lfiwzx f0, 0, r3		; P8BE-NEXT: lfiwzx f0, 0, r3
; P8BE-NEXT: xxsldwi v3, f0, f0, 1		; P8BE-NEXT: xxsldwi v3, f0, f0, 1
; P8BE-NEXT: vmrghw v2, v3, v2		; P8BE-NEXT: vmrghw v2, v3, v2
; P8BE-NEXT: blr		; P8BE-NEXT: blr
entry:		entry:
%arrayidx = getelementptr inbounds float, float* %f64, i64 1		%arrayidx = getelementptr inbounds float, float* %f64, i64 1
%0 = load float, float* %arrayidx, align 8		%0 = load float, float* %arrayidx, align 8
%vecins = insertelement <2 x float> %vec, float %0, i32 0		%vecins = insertelement <2 x float> %vec, float %0, i32 0
ret <2 x float> %vecins		ret <2 x float> %vecins
}		}

; Function Attrs: norecurse nounwind readonly		; Function Attrs: norecurse nounwind readonly
define <2 x float> @s2v_test_f3(float* nocapture readonly %f64, <2 x float> %vec, i32 signext %Idx) {		define <2 x float> @s2v_test_f3(float* nocapture readonly %f64, <2 x float> %vec, i32 signext %Idx) {
; P9LE-LABEL: s2v_test_f3:		; P9LE-LABEL: s2v_test_f3:
; P9LE: # %bb.0: # %entry		; P9LE: # %bb.0: # %entry
; P9LE-NEXT: sldi r4, r7, 2		; P9LE-NEXT: sldi r4, r7, 2
; P9LE-NEXT: vmrglw v2, v2, v2
; P9LE-NEXT: lxsiwzx v3, r3, r4		; P9LE-NEXT: lxsiwzx v3, r3, r4
; P9LE-NEXT: vmrghw v2, v2, v3		; P9LE-NEXT: addis r3, r2, .LCPI7_0@toc@ha
		; P9LE-NEXT: addi r3, r3, .LCPI7_0@toc@l
		; P9LE-NEXT: lxvx v4, 0, r3
		; P9LE-NEXT: vperm v2, v2, v3, v4
; P9LE-NEXT: blr		; P9LE-NEXT: blr
;		;
; P9BE-LABEL: s2v_test_f3:		; P9BE-LABEL: s2v_test_f3:
; P9BE: # %bb.0: # %entry		; P9BE: # %bb.0: # %entry
; P9BE-NEXT: sldi r4, r7, 2		; P9BE-NEXT: sldi r4, r7, 2
; P9BE-NEXT: xxspltw v2, v2, 1		; P9BE-NEXT: xxspltw v2, v2, 1
; P9BE-NEXT: lfiwzx f0, r3, r4		; P9BE-NEXT: lfiwzx f0, r3, r4
; P9BE-NEXT: xxsldwi v3, f0, f0, 1		; P9BE-NEXT: xxsldwi v3, f0, f0, 1
; P9BE-NEXT: vmrghw v2, v3, v2		; P9BE-NEXT: vmrghw v2, v3, v2
; P9BE-NEXT: blr		; P9BE-NEXT: blr
;		;
; P8LE-LABEL: s2v_test_f3:		; P8LE-LABEL: s2v_test_f3:
; P8LE: # %bb.0: # %entry		; P8LE: # %bb.0: # %entry
; P8LE-NEXT: vmrglw v2, v2, v2		; P8LE-NEXT: addis r4, r2, .LCPI7_0@toc@ha
; P8LE-NEXT: sldi r4, r7, 2		; P8LE-NEXT: sldi r5, r7, 2
; P8LE-NEXT: lxsiwzx v3, r3, r4		; P8LE-NEXT: addi r4, r4, .LCPI7_0@toc@l
; P8LE-NEXT: vmrghw v2, v2, v3		; P8LE-NEXT: lxsiwzx v3, r3, r5
		; P8LE-NEXT: lvx v4, 0, r4
		; P8LE-NEXT: vperm v2, v2, v3, v4
; P8LE-NEXT: blr		; P8LE-NEXT: blr
;		;
; P8BE-LABEL: s2v_test_f3:		; P8BE-LABEL: s2v_test_f3:
; P8BE: # %bb.0: # %entry		; P8BE: # %bb.0: # %entry
; P8BE-NEXT: sldi r4, r7, 2		; P8BE-NEXT: sldi r4, r7, 2
; P8BE-NEXT: xxspltw v2, v2, 1		; P8BE-NEXT: xxspltw v2, v2, 1
; P8BE-NEXT: lfiwzx f0, r3, r4		; P8BE-NEXT: lfiwzx f0, r3, r4
; P8BE-NEXT: xxsldwi v3, f0, f0, 1		; P8BE-NEXT: xxsldwi v3, f0, f0, 1
; P8BE-NEXT: vmrghw v2, v3, v2		; P8BE-NEXT: vmrghw v2, v3, v2
; P8BE-NEXT: blr		; P8BE-NEXT: blr
entry:		entry:
%idxprom = sext i32 %Idx to i64		%idxprom = sext i32 %Idx to i64
%arrayidx = getelementptr inbounds float, float* %f64, i64 %idxprom		%arrayidx = getelementptr inbounds float, float* %f64, i64 %idxprom
%0 = load float, float* %arrayidx, align 8		%0 = load float, float* %arrayidx, align 8
%vecins = insertelement <2 x float> %vec, float %0, i32 0		%vecins = insertelement <2 x float> %vec, float %0, i32 0
ret <2 x float> %vecins		ret <2 x float> %vecins
}		}

; Function Attrs: norecurse nounwind readonly		; Function Attrs: norecurse nounwind readonly
define <2 x float> @s2v_test_f4(float* nocapture readonly %f64, <2 x float> %vec) {		define <2 x float> @s2v_test_f4(float* nocapture readonly %f64, <2 x float> %vec) {
; P9LE-LABEL: s2v_test_f4:		; P9LE-LABEL: s2v_test_f4:
; P9LE: # %bb.0: # %entry		; P9LE: # %bb.0: # %entry
		; P9LE-NEXT: addis r4, r2, .LCPI8_0@toc@ha
; P9LE-NEXT: addi r3, r3, 4		; P9LE-NEXT: addi r3, r3, 4
; P9LE-NEXT: vmrglw v2, v2, v2		; P9LE-NEXT: addi r4, r4, .LCPI8_0@toc@l
; P9LE-NEXT: lxsiwzx v3, 0, r3		; P9LE-NEXT: lxsiwzx v4, 0, r3
; P9LE-NEXT: vmrghw v2, v2, v3		; P9LE-NEXT: lxvx v3, 0, r4
		; P9LE-NEXT: vperm v2, v2, v4, v3
; P9LE-NEXT: blr		; P9LE-NEXT: blr
;		;
; P9BE-LABEL: s2v_test_f4:		; P9BE-LABEL: s2v_test_f4:
; P9BE: # %bb.0: # %entry		; P9BE: # %bb.0: # %entry
; P9BE-NEXT: addi r3, r3, 4		; P9BE-NEXT: addi r3, r3, 4
; P9BE-NEXT: xxspltw v2, v2, 1		; P9BE-NEXT: xxspltw v2, v2, 1
; P9BE-NEXT: lfiwzx f0, 0, r3		; P9BE-NEXT: lfiwzx f0, 0, r3
; P9BE-NEXT: xxsldwi v3, f0, f0, 1		; P9BE-NEXT: xxsldwi v3, f0, f0, 1
; P9BE-NEXT: vmrghw v2, v3, v2		; P9BE-NEXT: vmrghw v2, v3, v2
; P9BE-NEXT: blr		; P9BE-NEXT: blr
;		;
; P8LE-LABEL: s2v_test_f4:		; P8LE-LABEL: s2v_test_f4:
; P8LE: # %bb.0: # %entry		; P8LE: # %bb.0: # %entry
; P8LE-NEXT: vmrglw v2, v2, v2		; P8LE-NEXT: addis r4, r2, .LCPI8_0@toc@ha
; P8LE-NEXT: addi r3, r3, 4		; P8LE-NEXT: addi r3, r3, 4
; P8LE-NEXT: lxsiwzx v3, 0, r3		; P8LE-NEXT: addi r4, r4, .LCPI8_0@toc@l
; P8LE-NEXT: vmrghw v2, v2, v3		; P8LE-NEXT: lxsiwzx v4, 0, r3
		; P8LE-NEXT: lvx v3, 0, r4
		; P8LE-NEXT: vperm v2, v2, v4, v3
; P8LE-NEXT: blr		; P8LE-NEXT: blr
;		;
; P8BE-LABEL: s2v_test_f4:		; P8BE-LABEL: s2v_test_f4:
; P8BE: # %bb.0: # %entry		; P8BE: # %bb.0: # %entry
; P8BE-NEXT: addi r3, r3, 4		; P8BE-NEXT: addi r3, r3, 4
; P8BE-NEXT: xxspltw v2, v2, 1		; P8BE-NEXT: xxspltw v2, v2, 1
; P8BE-NEXT: lfiwzx f0, 0, r3		; P8BE-NEXT: lfiwzx f0, 0, r3
; P8BE-NEXT: xxsldwi v3, f0, f0, 1		; P8BE-NEXT: xxsldwi v3, f0, f0, 1
; P8BE-NEXT: vmrghw v2, v3, v2		; P8BE-NEXT: vmrghw v2, v3, v2
; P8BE-NEXT: blr		; P8BE-NEXT: blr
entry:		entry:
%arrayidx = getelementptr inbounds float, float* %f64, i64 1		%arrayidx = getelementptr inbounds float, float* %f64, i64 1
%0 = load float, float* %arrayidx, align 8		%0 = load float, float* %arrayidx, align 8
%vecins = insertelement <2 x float> %vec, float %0, i32 0		%vecins = insertelement <2 x float> %vec, float %0, i32 0
ret <2 x float> %vecins		ret <2 x float> %vecins
}		}

; Function Attrs: norecurse nounwind readonly		; Function Attrs: norecurse nounwind readonly
define <2 x float> @s2v_test_f5(<2 x float> %vec, float* nocapture readonly %ptr1) {		define <2 x float> @s2v_test_f5(<2 x float> %vec, float* nocapture readonly %ptr1) {
; P9LE-LABEL: s2v_test_f5:		; P9LE-LABEL: s2v_test_f5:
; P9LE: # %bb.0: # %entry		; P9LE: # %bb.0: # %entry
; P9LE-NEXT: lxsiwzx v3, 0, r5		; P9LE-NEXT: addis r3, r2, .LCPI9_0@toc@ha
; P9LE-NEXT: vmrglw v2, v2, v2		; P9LE-NEXT: lxsiwzx v4, 0, r5
; P9LE-NEXT: vmrghw v2, v2, v3		; P9LE-NEXT: addi r3, r3, .LCPI9_0@toc@l
		; P9LE-NEXT: lxvx v3, 0, r3
		; P9LE-NEXT: vperm v2, v2, v4, v3
; P9LE-NEXT: blr		; P9LE-NEXT: blr
;		;
; P9BE-LABEL: s2v_test_f5:		; P9BE-LABEL: s2v_test_f5:
; P9BE: # %bb.0: # %entry		; P9BE: # %bb.0: # %entry
; P9BE-NEXT: lfiwzx f0, 0, r5		; P9BE-NEXT: lfiwzx f0, 0, r5
; P9BE-NEXT: xxspltw v2, v2, 1		; P9BE-NEXT: xxspltw v2, v2, 1
; P9BE-NEXT: xxsldwi v3, f0, f0, 1		; P9BE-NEXT: xxsldwi v3, f0, f0, 1
; P9BE-NEXT: vmrghw v2, v3, v2		; P9BE-NEXT: vmrghw v2, v3, v2
; P9BE-NEXT: blr		; P9BE-NEXT: blr
;		;
; P8LE-LABEL: s2v_test_f5:		; P8LE-LABEL: s2v_test_f5:
; P8LE: # %bb.0: # %entry		; P8LE: # %bb.0: # %entry
; P8LE-NEXT: vmrglw v2, v2, v2		; P8LE-NEXT: addis r3, r2, .LCPI9_0@toc@ha
; P8LE-NEXT: lxsiwzx v3, 0, r5		; P8LE-NEXT: lxsiwzx v4, 0, r5
; P8LE-NEXT: vmrghw v2, v2, v3		; P8LE-NEXT: addi r3, r3, .LCPI9_0@toc@l
		; P8LE-NEXT: lvx v3, 0, r3
		; P8LE-NEXT: vperm v2, v2, v4, v3
; P8LE-NEXT: blr		; P8LE-NEXT: blr
;		;
; P8BE-LABEL: s2v_test_f5:		; P8BE-LABEL: s2v_test_f5:
; P8BE: # %bb.0: # %entry		; P8BE: # %bb.0: # %entry
; P8BE-NEXT: lfiwzx f0, 0, r5		; P8BE-NEXT: lfiwzx f0, 0, r5
; P8BE-NEXT: xxspltw v2, v2, 1		; P8BE-NEXT: xxspltw v2, v2, 1
; P8BE-NEXT: xxsldwi v3, f0, f0, 1		; P8BE-NEXT: xxsldwi v3, f0, f0, 1
; P8BE-NEXT: vmrghw v2, v3, v2		; P8BE-NEXT: vmrghw v2, v3, v2
; P8BE-NEXT: blr		; P8BE-NEXT: blr
entry:		entry:
%0 = load float, float* %ptr1, align 8		%0 = load float, float* %ptr1, align 8
%vecins = insertelement <2 x float> %vec, float %0, i32 0		%vecins = insertelement <2 x float> %vec, float %0, i32 0
ret <2 x float> %vecins		ret <2 x float> %vecins
}		}