This is an archive of the discontinued LLVM Phabricator instance.

Power9 Instructions for build_vector improvements
ClosedPublic

Authored by nemanjai on Jun 8 2016, 7:04 AM.

Download Raw Diff

Details

Reviewers

tjablin
wschmidt
cycheng
kbarton
amehsan
hfinkel

Summary

This patch exploits the following instructions:
mtvsrws
lxvwsx
mtvsrdd
mfvsrld

In order to improve some build_vector and extractelement patterns.

Diff Detail

Repository: rL LLVM

Event Timeline

nemanjai updated this revision to Diff 60029.Jun 8 2016, 7:04 AM

nemanjai retitled this revision from to Power9 Instructions for build_vector improvements.

nemanjai updated this object.

nemanjai added reviewers: hfinkel, kbarton, amehsan, cycheng, wschmidt, tjablin.

nemanjai set the repository for this revision to rL LLVM.

nemanjai added a subscriber: echristo.

amehsan added inline comments.Jun 8 2016, 2:24 PM

lib/Target/PowerPC/PPCInstrVSX.td
2178–2180	I think this should depend on how the extracted element is going to be used. If the subsequent use is somehow in a VSX register we do not want to do this. For example if we extract the integer, then convert it to floating point and do some FP arithmetic on it.

amehsan added inline comments.Jun 8 2016, 2:51 PM

lib/Target/PowerPC/PPCISelLowering.cpp
7441–7451	Is this true even if LOAD has users other than SCALAR_TO_VECTOR?
lib/Target/PowerPC/PPCInstrVSX.td
2178–2180	I am not saying that all cases should be handled in this patch. The example that I provided may need to be handled in DAGCombine and probably by the time we reach here, this is the right decision. That does not need to be in this patch. But I want to make sure that after adding this code, we do not have patterns for which we generate slower code on pwr9 compare to pwr8.

nemanjai added inline comments.Jun 13 2016, 11:02 AM

lib/Target/PowerPC/PPCISelLowering.cpp
7441–7451	Ah, thanks for pointing this out. Yes, there's a missing check for hasOneUse() on the LOAD. It will be in the updated patch (along with a test case to ensure we don't get rid of the splat).
lib/Target/PowerPC/PPCInstrVSX.td
2178–2180	Yes, I think the right thing to do in these cases would be either a DAG combine or a peephole to look for where we move stuff out of VSX registers just to move them back in. In any case, the pattern for Power8 is a swap followed by a direct move. On Power9, we just avoid the initial swap.

Added the missing check for only one use of the load when deciding whether to eliminate the splat when building a vector of i32's on Power9.

amehsan added inline comments.Jun 13 2016, 8:51 PM

lib/Target/PowerPC/PPCInstrVSX.td
2178–2180	That problem already exists on PWR8. for define double @test2(<2 x i64> %a) { entry: %0 = extractelement <2 x i64> %a, i32 0 %1 = sitofp i64 %0 to double ret double %1 } we generate xxswapd 0, 34 mfvsrd 3, 0 mtvsrd 0, 3 xscvsxddp 1, 0 blr I will open a bugzilla item for this.

As we discussed, before you commit the change, please add -verify-machineinstrs to your regression tests. No need to upload the patch again. Thanks.

Some of the new instructions were being emitted for unintended code patterns (such as materializing a vector of zeros). The new sequences were inferior so this update ensures that we emit the better code sequence. For example, due to the "AddedComplexity", the initial patch emitted a load-immediate followed by a direct move for materializing ones or zeros into a vector. A vector of zeros can be produced with a single XXLXOR. A vector of ones can be produced by a splat-immediate (especially now that we have a VSX version of it).
Test case was modified accordingly.

This patch was functionally tested on the Power9 simulator.

Herald added a subscriber: nemanjai. · View Herald TranscriptJul 4 2016, 12:50 PM

This is perhaps minor, but we should rethink the change in PPCInstPrinter.cpp. If this change is needed, then we should change all the print routines in a similar manner.

lib/Target/PowerPC/InstPrinter/PPCInstPrinter.cpp
300	I'm not sure about this change. Why are we printing as unsigned int, instead of unsigned char? It seems like this method, and the method above (printU7ImmOperand) should be using (unsigned char) instead of (unsigned int). It looks like this was done with the printU10ImmOperand below (and probably others, but I didn't look exhaustively).

This revision now requires changes to proceed.Aug 31 2016, 9:43 AM

nemanjai added inline comments.Sep 12 2016, 8:48 AM

lib/Target/PowerPC/InstPrinter/PPCInstPrinter.cpp
300	This is a great point. I don't know why I didn't think of just casting to unsigned char which will implicitly truncate. I'll try that and re-post.

Updated the truncation of the 32-bit unsigned value to 8-bits in PPCInstrPrinter.cpp.

LGTM

This revision is now accepted and ready to land.Sep 21 2016, 10:56 AM

Committed revision 282246.

jsji mentioned this in D105596: [PowerPC] Custom Lowering BUILD_VECTOR for v2i64 for P7 as well.Jul 7 2021, 2:59 PM

jsji mentioned this in rG2377eca93c03: [PowerPC] Custom Lowering BUILD_VECTOR for v2i64 for P7 as well.Jul 12 2021, 10:56 AM

Revision Contents

Path

Size

lib/

Target/

PowerPC/

InstPrinter/

5 lines

17 lines

1 line

33 lines

test/

CodeGen/

PowerPC/

power9-moves-and-splats.ll

143 lines

Diff 60029

lib/Target/PowerPC/InstPrinter/PPCInstPrinter.cpp

	Show First 20 Lines • Show All 289 Lines • ▼ Show 20 Lines

	void PPCInstPrinter::printU7ImmOperand(const MCInst *MI, unsigned OpNo,			void PPCInstPrinter::printU7ImmOperand(const MCInst *MI, unsigned OpNo,
	raw_ostream &O) {			raw_ostream &O) {
	unsigned int Value = MI->getOperand(OpNo).getImm();			unsigned int Value = MI->getOperand(OpNo).getImm();
	assert(Value <= 127 && "Invalid u7imm argument!");			assert(Value <= 127 && "Invalid u7imm argument!");
	O << (unsigned int)Value;			O << (unsigned int)Value;
	}			}

				// Operands of BUILD_VECTOR are signed and we use this to print operands
				// of XXSPLTIB which are unsigned. So we simply truncate to 8 bits and
				// print as unsigned.
				kbartonUnsubmitted Not Done Reply Inline Actions I'm not sure about this change. Why are we printing as unsigned int, instead of unsigned char? It seems like this method, and the method above (printU7ImmOperand) should be using (unsigned char) instead of (unsigned int). It looks like this was done with the printU10ImmOperand below (and probably others, but I didn't look exhaustively). kbarton: I'm not sure about this change. Why are we printing as unsigned int, instead of unsigned char?
				nemanjaiAuthorUnsubmitted Not Done Reply Inline Actions This is a great point. I don't know why I didn't think of just casting to unsigned char which will implicitly truncate. I'll try that and re-post. nemanjai: This is a great point. I don't know why I didn't think of just casting to unsigned char which…
	void PPCInstPrinter::printU8ImmOperand(const MCInst *MI, unsigned OpNo,			void PPCInstPrinter::printU8ImmOperand(const MCInst *MI, unsigned OpNo,
	raw_ostream &O) {			raw_ostream &O) {
	unsigned int Value = MI->getOperand(OpNo).getImm();			unsigned int Value = MI->getOperand(OpNo).getImm();
	assert(Value <= 255 && "Invalid u8imm argument!");			Value &= 0xFF;
	O << (unsigned int)Value;			O << (unsigned int)Value;
	}			}

	void PPCInstPrinter::printU10ImmOperand(const MCInst *MI, unsigned OpNo,			void PPCInstPrinter::printU10ImmOperand(const MCInst *MI, unsigned OpNo,
	raw_ostream &O) {			raw_ostream &O) {
	unsigned short Value = MI->getOperand(OpNo).getImm();			unsigned short Value = MI->getOperand(OpNo).getImm();
	assert(Value <= 1023 && "Invalid u10imm argument!");			assert(Value <= 1023 && "Invalid u10imm argument!");
	O << (unsigned short)Value;			O << (unsigned short)Value;
	▲ Show 20 Lines • Show All 144 Lines • Show Last 20 Lines

lib/Target/PowerPC/PPCISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 659 Lines • ▼ Show 20 Lines	if (Subtarget.hasVSX()) {

addRegisterClass(MVT::v2i64, &PPC::VSRCRegClass);		addRegisterClass(MVT::v2i64, &PPC::VSRCRegClass);
}		}

if (Subtarget.hasP8Altivec()) {		if (Subtarget.hasP8Altivec()) {
addRegisterClass(MVT::v2i64, &PPC::VRRCRegClass);		addRegisterClass(MVT::v2i64, &PPC::VRRCRegClass);
addRegisterClass(MVT::v1i128, &PPC::VRRCRegClass);		addRegisterClass(MVT::v1i128, &PPC::VRRCRegClass);
}		}

		if (Subtarget.isISA3_0() && Subtarget.hasDirectMove())
		setOperationAction(ISD::BUILD_VECTOR, MVT::v2i64, Legal);
}		}

if (Subtarget.hasQPX()) {		if (Subtarget.hasQPX()) {
setOperationAction(ISD::FADD, MVT::v4f64, Legal);		setOperationAction(ISD::FADD, MVT::v4f64, Legal);
setOperationAction(ISD::FSUB, MVT::v4f64, Legal);		setOperationAction(ISD::FSUB, MVT::v4f64, Legal);
setOperationAction(ISD::FMUL, MVT::v4f64, Legal);		setOperationAction(ISD::FMUL, MVT::v4f64, Legal);
setOperationAction(ISD::FREM, MVT::v4f64, Expand);		setOperationAction(ISD::FREM, MVT::v4f64, Expand);

▲ Show 20 Lines • Show All 6,519 Lines • ▼ Show 20 Lines	if (! BVN->isConstantSplat(APSplatBits, APSplatUndef, SplatBitSize,
return SDValue();		return SDValue();

unsigned SplatBits = APSplatBits.getZExtValue();		unsigned SplatBits = APSplatBits.getZExtValue();
unsigned SplatUndef = APSplatUndef.getZExtValue();		unsigned SplatUndef = APSplatUndef.getZExtValue();
unsigned SplatSize = SplatBitSize / 8;		unsigned SplatSize = SplatBitSize / 8;

// First, handle single instruction cases.		// First, handle single instruction cases.

		if (Subtarget.isISA3_0() && Op.getValueType() == MVT::v16i8)
		return Op;

// All zeros?		// All zeros?
if (SplatBits == 0) {		if (SplatBits == 0) {
// Canonicalize all zero vectors to be v4i32.		// Canonicalize all zero vectors to be v4i32.
if (Op.getValueType() != MVT::v4i32 \|\| HasAnyUndefs) {		if (Op.getValueType() != MVT::v4i32 \|\| HasAnyUndefs) {
SDValue Z = DAG.getConstant(0, dl, MVT::v4i32);		SDValue Z = DAG.getConstant(0, dl, MVT::v4i32);
Op = DAG.getNode(ISD::BITCAST, dl, Op.getValueType(), Z);		Op = DAG.getNode(ISD::BITCAST, dl, Op.getValueType(), Z);
}		}
return Op;		return Op;
▲ Show 20 Lines • Show All 216 Lines • ▼ Show 20 Lines	SDValue PPCTargetLowering::LowerVECTOR_SHUFFLE(SDValue Op,
SDValue V2 = Op.getOperand(1);		SDValue V2 = Op.getOperand(1);
ShuffleVectorSDNode *SVOp = cast<ShuffleVectorSDNode>(Op);		ShuffleVectorSDNode *SVOp = cast<ShuffleVectorSDNode>(Op);
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();
bool isLittleEndian = Subtarget.isLittleEndian();		bool isLittleEndian = Subtarget.isLittleEndian();

if (Subtarget.hasVSX()) {		if (Subtarget.hasVSX()) {
if (V2.isUndef() && PPC::isSplatShuffleMask(SVOp, 4)) {		if (V2.isUndef() && PPC::isSplatShuffleMask(SVOp, 4)) {
int SplatIdx = PPC::getVSPLTImmediate(SVOp, 4, DAG);		int SplatIdx = PPC::getVSPLTImmediate(SVOp, 4, DAG);

		// If the source for the shuffle is a scalar_to_vector that came from a
		// 32-bit load, it will have used LXVWSX so we don't need to splat again.
		if (Subtarget.isISA3_0() &&
		((isLittleEndian && SplatIdx == 3) \|\|
		(!isLittleEndian && SplatIdx == 0))) {
		SDValue Src = V1.getOperand(0);
		if (Src.getOpcode() == ISD::SCALAR_TO_VECTOR &&
		Src.getOperand(0).getOpcode() == ISD::LOAD)
		return V1;
		}
		amehsanUnsubmitted Not Done Reply Inline Actions Is this true even if LOAD has users other than SCALAR_TO_VECTOR? amehsan: Is this true even if LOAD has users other than SCALAR_TO_VECTOR?
		nemanjaiAuthorUnsubmitted Not Done Reply Inline Actions Ah, thanks for pointing this out. Yes, there's a missing check for hasOneUse() on the LOAD. It will be in the updated patch (along with a test case to ensure we don't get rid of the splat). nemanjai: Ah, thanks for pointing this out. Yes, there's a missing check for hasOneUse() on the LOAD. It…
SDValue Conv = DAG.getNode(ISD::BITCAST, dl, MVT::v4i32, V1);		SDValue Conv = DAG.getNode(ISD::BITCAST, dl, MVT::v4i32, V1);
SDValue Splat = DAG.getNode(PPCISD::XXSPLT, dl, MVT::v4i32, Conv,		SDValue Splat = DAG.getNode(PPCISD::XXSPLT, dl, MVT::v4i32, Conv,
DAG.getConstant(SplatIdx, dl, MVT::i32));		DAG.getConstant(SplatIdx, dl, MVT::i32));
return DAG.getNode(ISD::BITCAST, dl, MVT::v16i8, Splat);		return DAG.getNode(ISD::BITCAST, dl, MVT::v16i8, Splat);
}		}
}		}

if (Subtarget.hasQPX()) {		if (Subtarget.hasQPX()) {
▲ Show 20 Lines • Show All 4,615 Lines • Show Last 20 Lines

lib/Target/PowerPC/PPCInstrInfo.td

Show First 20 Lines • Show All 296 Lines • ▼ Show 20 Lines	def imm64SExt16 : Operand<i64>, ImmLeaf<i64, [{
// sign extended field. Used by instructions like 'addi'.		// sign extended field. Used by instructions like 'addi'.
return (int64_t)Imm == (short)Imm;		return (int64_t)Imm == (short)Imm;
}]>;		}]>;
def immZExt16 : PatLeaf<(imm), [{		def immZExt16 : PatLeaf<(imm), [{
// immZExt16 predicate - True if the immediate fits in a 16-bit zero extended		// immZExt16 predicate - True if the immediate fits in a 16-bit zero extended
// field. Used by instructions like 'ori'.		// field. Used by instructions like 'ori'.
return (uint64_t)N->getZExtValue() == (unsigned short)N->getZExtValue();		return (uint64_t)N->getZExtValue() == (unsigned short)N->getZExtValue();
}], LO16>;		}], LO16>;
		def immSExt8 : ImmLeaf<i32, [{ return isInt<8>(Imm); }]>;

// imm16Shifted* - These match immediates where the low 16-bits are zero. There		// imm16Shifted* - These match immediates where the low 16-bits are zero. There
// are two forms: imm16ShiftedSExt and imm16ShiftedZExt. These two forms are		// are two forms: imm16ShiftedSExt and imm16ShiftedZExt. These two forms are
// identical in 32-bit mode, but in 64-bit mode, they return true if the		// identical in 32-bit mode, but in 64-bit mode, they return true if the
// immediate fits into a sign/zero extended 32-bit immediate (with the low bits		// immediate fits into a sign/zero extended 32-bit immediate (with the low bits
// clear).		// clear).
def imm16ShiftedZExt : PatLeaf<(imm), [{		def imm16ShiftedZExt : PatLeaf<(imm), [{
// imm16ShiftedZExt predicate - True if only bits in the top 16-bits of the		// imm16ShiftedZExt predicate - True if only bits in the top 16-bits of the
▲ Show 20 Lines • Show All 3,920 Lines • Show Last 20 Lines

lib/Target/PowerPC/PPCInstrVSX.td

Show First 20 Lines • Show All 1,235 Lines • ▼ Show 20 Lines	def MTVSRWA : XX1_RS6_RD5_XO<31, 211, (outs vsfrc:$XT), (ins gprc:$rA),
[(set f64:$XT, (PPCmtvsra i32:$rA))]>;		[(set f64:$XT, (PPCmtvsra i32:$rA))]>;
def MTVSRWZ : XX1_RS6_RD5_XO<31, 243, (outs vsfrc:$XT), (ins gprc:$rA),		def MTVSRWZ : XX1_RS6_RD5_XO<31, 243, (outs vsfrc:$XT), (ins gprc:$rA),
"mtvsrwz $XT, $rA", IIC_VecGeneral,		"mtvsrwz $XT, $rA", IIC_VecGeneral,
[(set f64:$XT, (PPCmtvsrz i32:$rA))]>;		[(set f64:$XT, (PPCmtvsrz i32:$rA))]>;
} // HasDirectMove		} // HasDirectMove

let Predicates = [IsISA3_0, HasDirectMove] in {		let Predicates = [IsISA3_0, HasDirectMove] in {
def MTVSRWS: XX1_RS6_RD5_XO<31, 403, (outs vsrc:$XT), (ins gprc:$rA),		def MTVSRWS: XX1_RS6_RD5_XO<31, 403, (outs vsrc:$XT), (ins gprc:$rA),
"mtvsrws $XT, $rA", IIC_VecGeneral,		"mtvsrws $XT, $rA", IIC_VecGeneral, []>;
[]>;

def MTVSRDD: XX1Form<31, 435, (outs vsrc:$XT), (ins g8rc:$rA, g8rc:$rB),		def MTVSRDD: XX1Form<31, 435, (outs vsrc:$XT), (ins g8rc:$rA, g8rc:$rB),
"mtvsrdd $XT, $rA, $rB", IIC_VecGeneral,		"mtvsrdd $XT, $rA, $rB", IIC_VecGeneral,
[]>, Requires<[In64BitMode]>;		[]>, Requires<[In64BitMode]>;

def MFVSRLD: XX1_RS6_RD5_XO<31, 307, (outs g8rc:$rA), (ins vsrc:$XT),		def MFVSRLD: XX1_RS6_RD5_XO<31, 307, (outs g8rc:$rA), (ins vsrc:$XT),
"mfvsrld $rA, $XT", IIC_VecGeneral,		"mfvsrld $rA, $XT", IIC_VecGeneral,
[]>, Requires<[In64BitMode]>;		[]>, Requires<[In64BitMode]>;
▲ Show 20 Lines • Show All 544 Lines • ▼ Show 20 Lines
// bitconvert i64 -> f64		// bitconvert i64 -> f64
// (move to FPR, nothing else needed)		// (move to FPR, nothing else needed)
def : Pat<(f64 (bitconvert i64:$S)),		def : Pat<(f64 (bitconvert i64:$S)),
(f64 (MTVSRD $S))>;		(f64 (MTVSRD $S))>;
}		}

// The following VSX instructions were introduced in Power ISA 3.0		// The following VSX instructions were introduced in Power ISA 3.0
def HasP9Vector : Predicate<"PPCSubTarget->hasP9Vector()">;		def HasP9Vector : Predicate<"PPCSubTarget->hasP9Vector()">;
		let AddedComplexity = 400 in {
let Predicates = [HasP9Vector] in {		let Predicates = [HasP9Vector] in {

// [PO VRT XO VRB XO /]		// [PO VRT XO VRB XO /]
class X_VT5_XO5_VB5<bits<6> opcode, bits<5> xo2, bits<10> xo, string opc,		class X_VT5_XO5_VB5<bits<6> opcode, bits<5> xo2, bits<10> xo, string opc,
list<dag> pattern>		list<dag> pattern>
: X_RD5_XO5_RS5<opcode, xo2, xo, (outs vrrc:$vT), (ins vrrc:$vB),		: X_RD5_XO5_RS5<opcode, xo2, xo, (outs vrrc:$vT), (ins vrrc:$vB),
!strconcat(opc, " $vT, $vB"), IIC_VecFP, pattern>;		!strconcat(opc, " $vT, $vB"), IIC_VecFP, pattern>;

▲ Show 20 Lines • Show All 336 Lines • ▼ Show 20 Lines	let Predicates = [HasP9Vector] in {

// Store Vector Indexed		// Store Vector Indexed
def STXVX : X_XS6_RA5_RB5<31, 396, "stxvx" , vsrc, []>;		def STXVX : X_XS6_RA5_RB5<31, 396, "stxvx" , vsrc, []>;

// Store Vector (Left-justified) with Length		// Store Vector (Left-justified) with Length
def STXVL : X_XS6_RA5_RB5<31, 397, "stxvl" , vsrc, []>;		def STXVL : X_XS6_RA5_RB5<31, 397, "stxvl" , vsrc, []>;
def STXVLL : X_XS6_RA5_RB5<31, 429, "stxvll" , vsrc, []>;		def STXVLL : X_XS6_RA5_RB5<31, 429, "stxvll" , vsrc, []>;
} // end mayStore		} // end mayStore

		def : Pat<(v4i32 (scalar_to_vector (i32 (load xoaddr:$src)))),
		(v4i32 (LXVWSX xoaddr:$src))>;
		def : Pat<(v4f32 (scalar_to_vector (f32 (load xoaddr:$src)))),
		(v4f32 (LXVWSX xoaddr:$src))>;
		def : Pat<(v4i32 (build_vector i32:$A, i32:$A, i32:$A, i32:$A)),
		(v4i32 (MTVSRWS $A))>;
		def : Pat<(v16i8 (build_vector immSExt8:$A, immSExt8:$A, immSExt8:$A,
		immSExt8:$A, immSExt8:$A, immSExt8:$A,
		immSExt8:$A, immSExt8:$A, immSExt8:$A,
		immSExt8:$A, immSExt8:$A, immSExt8:$A,
		immSExt8:$A, immSExt8:$A, immSExt8:$A,
		immSExt8:$A)),
		(v16i8 (COPY_TO_REGCLASS (XXSPLTIB imm:$A), VSRC))>;
} // end HasP9Vector		} // end HasP9Vector
		} // AddedComplexity

		let Predicates = [IsISA3_0, HasDirectMove, IsLittleEndian] in {
		def : Pat<(v2i64 (build_vector i64:$rA, i64:$rB)),
		(v2i64 (MTVSRDD $rB, $rA))>;
		def : Pat<(i64 (extractelt v2i64:$A, 0)),
		(i64 (MFVSRLD $A))>;
		}
		amehsanUnsubmitted Not Done Reply Inline Actions I think this should depend on how the extracted element is going to be used. If the subsequent use is somehow in a VSX register we do not want to do this. For example if we extract the integer, then convert it to floating point and do some FP arithmetic on it. amehsan: I think this should depend on how the extracted element is going to be used. If the subsequent…
		amehsanUnsubmitted Not Done Reply Inline Actions I am not saying that all cases should be handled in this patch. The example that I provided may need to be handled in DAGCombine and probably by the time we reach here, this is the right decision. That does not need to be in this patch. But I want to make sure that after adding this code, we do not have patterns for which we generate slower code on pwr9 compare to pwr8. amehsan: I am not saying that all cases should be handled in this patch. The example that I provided may…
		nemanjaiAuthorUnsubmitted Not Done Reply Inline Actions Yes, I think the right thing to do in these cases would be either a DAG combine or a peephole to look for where we move stuff out of VSX registers just to move them back in. In any case, the pattern for Power8 is a swap followed by a direct move. On Power9, we just avoid the initial swap. nemanjai: Yes, I think the right thing to do in these cases would be either a DAG combine or a peephole…
		amehsanUnsubmitted Not Done Reply Inline Actions That problem already exists on PWR8. for define double @test2(<2 x i64> %a) { entry: %0 = extractelement <2 x i64> %a, i32 0 %1 = sitofp i64 %0 to double ret double %1 } we generate xxswapd 0, 34 mfvsrd 3, 0 mtvsrd 0, 3 xscvsxddp 1, 0 blr I will open a bugzilla item for this. amehsan: That problem already exists on PWR8. for ``` define double @test2(<2 x i64> %a) { entry…

		let Predicates = [IsISA3_0, HasDirectMove, IsBigEndian] in {
		def : Pat<(v2i64 (build_vector i64:$rB, i64:$rA)),
		(v2i64 (MTVSRDD $rB, $rA))>;
		def : Pat<(i64 (extractelt v2i64:$A, 1)),
		(i64 (MFVSRLD $A))>;
		}

test/CodeGen/PowerPC/power9-moves-and-splats.ll

				; RUN: llc -mcpu=pwr9 -mtriple=powerpc64le-unknown-linux-gnu < %s \| FileCheck %s
				; RUN: llc -mcpu=pwr9 -mtriple=powerpc64-unknown-linux-gnu < %s \| FileCheck %s \
				; RUN: --check-prefix=CHECK-BE

				@Globi = external global i32, align 4
				@Globf = external global float, align 4

				define <2 x i64> @test1(i64 %a, i64 %b) {
				entry:
				; CHECK-LABEL: test1
				; CHECK: mtvsrdd 34, 4, 3
				; CHECK-BE-LABEL: test1
				; CHECK-BE: mtvsrdd 34, 3, 4
				%vecins = insertelement <2 x i64> undef, i64 %a, i32 0
				%vecins1 = insertelement <2 x i64> %vecins, i64 %b, i32 1
				ret <2 x i64> %vecins1
				}

				define i64 @test2(<2 x i64> %a) {
				entry:
				; CHECK-LABEL: test2
				; CHECK: mfvsrld 3, 34
				%0 = extractelement <2 x i64> %a, i32 0
				ret i64 %0
				}

				define i64 @test3(<2 x i64> %a) {
				entry:
				; CHECK-BE-LABEL: test3
				; CHECK-BE: mfvsrld 3, 34
				%0 = extractelement <2 x i64> %a, i32 1
				ret i64 %0
				}

				define <4 x i32> @test4(i32* nocapture readonly %in) {
				entry:
				; CHECK-LABEL: test4
				; CHECK: lxvwsx 34, 0, 3
				; CHECK-BE-LABEL: test4
				; CHECK-BE: lxvwsx 34, 0, 3
				%0 = load i32, i32* %in, align 4
				%splat.splatinsert = insertelement <4 x i32> undef, i32 %0, i32 0
				%splat.splat = shufflevector <4 x i32> %splat.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
				ret <4 x i32> %splat.splat
				}

				define <4 x float> @test5(float* nocapture readonly %in) {
				entry:
				; CHECK-LABEL: test5
				; CHECK: lxvwsx 34, 0, 3
				; CHECK-BE-LABEL: test5
				; CHECK-BE: lxvwsx 34, 0, 3
				%0 = load float, float* %in, align 4
				%splat.splatinsert = insertelement <4 x float> undef, float %0, i32 0
				%splat.splat = shufflevector <4 x float> %splat.splatinsert, <4 x float> undef, <4 x i32> zeroinitializer
				ret <4 x float> %splat.splat
				}

				define <4 x i32> @test6() {
				entry:
				; CHECK-LABEL: test6
				; CHECK: addis
				; CHECK: ld [[TOC:[0-9]+]], .LC0
				; CHECK: lxvwsx 34, 0, 3
				; CHECK-BE-LABEL: test6
				; CHECK-BE: addis
				; CHECK-BE: ld [[TOC:[0-9]+]], .LC0
				; CHECK-BE: lxvwsx 34, 0, 3
				%0 = load i32, i32* @Globi, align 4
				%splat.splatinsert = insertelement <4 x i32> undef, i32 %0, i32 0
				%splat.splat = shufflevector <4 x i32> %splat.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
				ret <4 x i32> %splat.splat
				}

				define <4 x float> @test7() {
				entry:
				; CHECK-LABEL: test7
				; CHECK: addis
				; CHECK: ld [[TOC:[0-9]+]], .LC1
				; CHECK: lxvwsx 34, 0, 3
				; CHECK-BE-LABEL: test7
				; CHECK-BE: addis
				; CHECK-BE: ld [[TOC:[0-9]+]], .LC1
				; CHECK-BE: lxvwsx 34, 0, 3
				%0 = load float, float* @Globf, align 4
				%splat.splatinsert = insertelement <4 x float> undef, float %0, i32 0
				%splat.splat = shufflevector <4 x float> %splat.splatinsert, <4 x float> undef, <4 x i32> zeroinitializer
				ret <4 x float> %splat.splat
				}

				define <16 x i8> @test8() {
				entry:
				; CHECK-LABEL: test8
				; CHECK: xxspltib 34, 0
				; CHECK-BE-LABEL: test8
				; CHECK-BE: xxspltib 34, 0
				ret <16 x i8> zeroinitializer
				}

				define <16 x i8> @test9() {
				entry:
				; CHECK-LABEL: test9
				; CHECK: xxspltib 34, 1
				; CHECK-BE-LABEL: test9
				; CHECK-BE: xxspltib 34, 1
				ret <16 x i8> <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>
				}

				define <16 x i8> @test10() {
				entry:
				; CHECK-LABEL: test10
				; CHECK: xxspltib 34, 127
				; CHECK-BE-LABEL: test10
				; CHECK-BE: xxspltib 34, 127
				ret <16 x i8> <i8 127, i8 127, i8 127, i8 127, i8 127, i8 127, i8 127, i8 127, i8 127, i8 127, i8 127, i8 127, i8 127, i8 127, i8 127, i8 127>
				}

				define <16 x i8> @test11() {
				entry:
				; CHECK-LABEL: test11
				; CHECK: xxspltib 34, 128
				; CHECK-BE-LABEL: test11
				; CHECK-BE: xxspltib 34, 128
				ret <16 x i8> <i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128>
				}

				define <16 x i8> @test12() {
				entry:
				; CHECK-LABEL: test12
				; CHECK: xxspltib 34, 255
				; CHECK-BE-LABEL: test12
				; CHECK-BE: xxspltib 34, 255
				ret <16 x i8> <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>
				}

				define <16 x i8> @test13() {
				entry:
				; CHECK-LABEL: test13
				; CHECK: xxspltib 34, 129
				; CHECK-BE-LABEL: test13
				; CHECK-BE: xxspltib 34, 129
				ret <16 x i8> <i8 -127, i8 -127, i8 -127, i8 -127, i8 -127, i8 -127, i8 -127, i8 -127, i8 -127, i8 -127, i8 -127, i8 -127, i8 -127, i8 -127, i8 -127, i8 -127>
				}