This is an archive of the discontinued LLVM Phabricator instance.

Power9 Instructions for build_vector improvements
ClosedPublic

Authored by nemanjai on Jun 8 2016, 7:04 AM.

Download Raw Diff

Details

Reviewers

tjablin
wschmidt
cycheng
kbarton
amehsan
hfinkel

Summary

This patch exploits the following instructions:
mtvsrws
lxvwsx
mtvsrdd
mfvsrld

In order to improve some build_vector and extractelement patterns.

Diff Detail

Repository: rL LLVM

Event Timeline

nemanjai updated this revision to Diff 60029.Jun 8 2016, 7:04 AM

nemanjai retitled this revision from to Power9 Instructions for build_vector improvements.

nemanjai updated this object.

nemanjai added reviewers: hfinkel, kbarton, amehsan, cycheng, wschmidt, tjablin.

nemanjai set the repository for this revision to rL LLVM.

nemanjai added a subscriber: echristo.

amehsan added inline comments.Jun 8 2016, 2:24 PM

lib/Target/PowerPC/PPCInstrVSX.td
2243–2245	I think this should depend on how the extracted element is going to be used. If the subsequent use is somehow in a VSX register we do not want to do this. For example if we extract the integer, then convert it to floating point and do some FP arithmetic on it.

amehsan added inline comments.Jun 8 2016, 2:51 PM

lib/Target/PowerPC/PPCISelLowering.cpp
7491–7501	Is this true even if LOAD has users other than SCALAR_TO_VECTOR?
lib/Target/PowerPC/PPCInstrVSX.td
2243–2245	I am not saying that all cases should be handled in this patch. The example that I provided may need to be handled in DAGCombine and probably by the time we reach here, this is the right decision. That does not need to be in this patch. But I want to make sure that after adding this code, we do not have patterns for which we generate slower code on pwr9 compare to pwr8.

nemanjai added inline comments.Jun 13 2016, 11:02 AM

lib/Target/PowerPC/PPCISelLowering.cpp
7491–7501	Ah, thanks for pointing this out. Yes, there's a missing check for hasOneUse() on the LOAD. It will be in the updated patch (along with a test case to ensure we don't get rid of the splat).
lib/Target/PowerPC/PPCInstrVSX.td
2243–2245	Yes, I think the right thing to do in these cases would be either a DAG combine or a peephole to look for where we move stuff out of VSX registers just to move them back in. In any case, the pattern for Power8 is a swap followed by a direct move. On Power9, we just avoid the initial swap.

Added the missing check for only one use of the load when deciding whether to eliminate the splat when building a vector of i32's on Power9.

amehsan added inline comments.Jun 13 2016, 8:51 PM

lib/Target/PowerPC/PPCInstrVSX.td
2243–2245	That problem already exists on PWR8. for define double @test2(<2 x i64> %a) { entry: %0 = extractelement <2 x i64> %a, i32 0 %1 = sitofp i64 %0 to double ret double %1 } we generate xxswapd 0, 34 mfvsrd 3, 0 mtvsrd 0, 3 xscvsxddp 1, 0 blr I will open a bugzilla item for this.

As we discussed, before you commit the change, please add -verify-machineinstrs to your regression tests. No need to upload the patch again. Thanks.

Some of the new instructions were being emitted for unintended code patterns (such as materializing a vector of zeros). The new sequences were inferior so this update ensures that we emit the better code sequence. For example, due to the "AddedComplexity", the initial patch emitted a load-immediate followed by a direct move for materializing ones or zeros into a vector. A vector of zeros can be produced with a single XXLXOR. A vector of ones can be produced by a splat-immediate (especially now that we have a VSX version of it).
Test case was modified accordingly.

This patch was functionally tested on the Power9 simulator.

Herald added a subscriber: nemanjai. · View Herald TranscriptJul 4 2016, 12:50 PM

This is perhaps minor, but we should rethink the change in PPCInstPrinter.cpp. If this change is needed, then we should change all the print routines in a similar manner.

lib/Target/PowerPC/InstPrinter/PPCInstPrinter.cpp
328	I'm not sure about this change. Why are we printing as unsigned int, instead of unsigned char? It seems like this method, and the method above (printU7ImmOperand) should be using (unsigned char) instead of (unsigned int). It looks like this was done with the printU10ImmOperand below (and probably others, but I didn't look exhaustively).

This revision now requires changes to proceed.Aug 31 2016, 9:43 AM

nemanjai added inline comments.Sep 12 2016, 8:48 AM

lib/Target/PowerPC/InstPrinter/PPCInstPrinter.cpp
328	This is a great point. I don't know why I didn't think of just casting to unsigned char which will implicitly truncate. I'll try that and re-post.

Updated the truncation of the 32-bit unsigned value to 8-bits in PPCInstrPrinter.cpp.

LGTM

This revision is now accepted and ready to land.Sep 21 2016, 10:56 AM

Committed revision 282246.

jsji mentioned this in D105596: [PowerPC] Custom Lowering BUILD_VECTOR for v2i64 for P7 as well.Jul 7 2021, 2:59 PM

jsji mentioned this in rG2377eca93c03: [PowerPC] Custom Lowering BUILD_VECTOR for v2i64 for P7 as well.Jul 12 2021, 10:56 AM

Revision Contents

Path

Size

lib/

Target/

PowerPC/

InstPrinter/

6 lines

40 lines

7 lines

1 line

48 lines

test/

CodeGen/

PowerPC/

power9-moves-and-splats.ll

167 lines

Diff 71406

lib/Target/PowerPC/InstPrinter/PPCInstPrinter.cpp

	Show First 20 Lines • Show All 317 Lines • ▼ Show 20 Lines

	void PPCInstPrinter::printU7ImmOperand(const MCInst *MI, unsigned OpNo,			void PPCInstPrinter::printU7ImmOperand(const MCInst *MI, unsigned OpNo,
	raw_ostream &O) {			raw_ostream &O) {
	unsigned int Value = MI->getOperand(OpNo).getImm();			unsigned int Value = MI->getOperand(OpNo).getImm();
	assert(Value <= 127 && "Invalid u7imm argument!");			assert(Value <= 127 && "Invalid u7imm argument!");
	O << (unsigned int)Value;			O << (unsigned int)Value;
	}			}

				// Operands of BUILD_VECTOR are signed and we use this to print operands
				// of XXSPLTIB which are unsigned. So we simply truncate to 8 bits and
				// print as unsigned.
				kbartonUnsubmitted Not Done Reply Inline Actions I'm not sure about this change. Why are we printing as unsigned int, instead of unsigned char? It seems like this method, and the method above (printU7ImmOperand) should be using (unsigned char) instead of (unsigned int). It looks like this was done with the printU10ImmOperand below (and probably others, but I didn't look exhaustively). kbarton: I'm not sure about this change. Why are we printing as unsigned int, instead of unsigned char?
				nemanjaiAuthorUnsubmitted Not Done Reply Inline Actions This is a great point. I don't know why I didn't think of just casting to unsigned char which will implicitly truncate. I'll try that and re-post. nemanjai: This is a great point. I don't know why I didn't think of just casting to unsigned char which…
	void PPCInstPrinter::printU8ImmOperand(const MCInst *MI, unsigned OpNo,			void PPCInstPrinter::printU8ImmOperand(const MCInst *MI, unsigned OpNo,
	raw_ostream &O) {			raw_ostream &O) {
	unsigned int Value = MI->getOperand(OpNo).getImm();			unsigned char Value = MI->getOperand(OpNo).getImm();
	assert(Value <= 255 && "Invalid u8imm argument!");
	O << (unsigned int)Value;			O << (unsigned int)Value;
	}			}

	void PPCInstPrinter::printU10ImmOperand(const MCInst *MI, unsigned OpNo,			void PPCInstPrinter::printU10ImmOperand(const MCInst *MI, unsigned OpNo,
	raw_ostream &O) {			raw_ostream &O) {
	unsigned short Value = MI->getOperand(OpNo).getImm();			unsigned short Value = MI->getOperand(OpNo).getImm();
	assert(Value <= 1023 && "Invalid u10imm argument!");			assert(Value <= 1023 && "Invalid u10imm argument!");
	O << (unsigned short)Value;			O << (unsigned short)Value;
	▲ Show 20 Lines • Show All 144 Lines • Show Last 20 Lines

lib/Target/PowerPC/PPCISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 666 Lines • ▼ Show 20 Lines	if (Subtarget.hasP8Altivec()) {
addRegisterClass(MVT::v2i64, &PPC::VRRCRegClass);		addRegisterClass(MVT::v2i64, &PPC::VRRCRegClass);
addRegisterClass(MVT::v1i128, &PPC::VRRCRegClass);		addRegisterClass(MVT::v1i128, &PPC::VRRCRegClass);
}		}

if (Subtarget.hasP9Vector()) {		if (Subtarget.hasP9Vector()) {
setOperationAction(ISD::INSERT_VECTOR_ELT, MVT::v4i32, Custom);		setOperationAction(ISD::INSERT_VECTOR_ELT, MVT::v4i32, Custom);
setOperationAction(ISD::INSERT_VECTOR_ELT, MVT::v4f32, Custom);		setOperationAction(ISD::INSERT_VECTOR_ELT, MVT::v4f32, Custom);
}		}

		if (Subtarget.isISA3_0() && Subtarget.hasDirectMove())
		setOperationAction(ISD::BUILD_VECTOR, MVT::v2i64, Legal);
}		}

if (Subtarget.hasQPX()) {		if (Subtarget.hasQPX()) {
setOperationAction(ISD::FADD, MVT::v4f64, Legal);		setOperationAction(ISD::FADD, MVT::v4f64, Legal);
setOperationAction(ISD::FSUB, MVT::v4f64, Legal);		setOperationAction(ISD::FSUB, MVT::v4f64, Legal);
setOperationAction(ISD::FMUL, MVT::v4f64, Legal);		setOperationAction(ISD::FMUL, MVT::v4f64, Legal);
setOperationAction(ISD::FREM, MVT::v4f64, Expand);		setOperationAction(ISD::FREM, MVT::v4f64, Expand);

▲ Show 20 Lines • Show All 6,391 Lines • ▼ Show 20 Lines	static SDValue BuildVSLDOI(SDValue LHS, SDValue RHS, unsigned Amt, EVT VT,

int Ops[16];		int Ops[16];
for (unsigned i = 0; i != 16; ++i)		for (unsigned i = 0; i != 16; ++i)
Ops[i] = i + Amt;		Ops[i] = i + Amt;
SDValue T = DAG.getVectorShuffle(MVT::v16i8, dl, LHS, RHS, Ops);		SDValue T = DAG.getVectorShuffle(MVT::v16i8, dl, LHS, RHS, Ops);
return DAG.getNode(ISD::BITCAST, dl, VT, T);		return DAG.getNode(ISD::BITCAST, dl, VT, T);
}		}

		static bool isNonConstSplatBV(BuildVectorSDNode *BVN, EVT Type) {
		if (BVN->getValueType(0) != Type)
		return false;
		auto OpZero = BVN->getOperand(0);
		for (int i = 1, e = BVN->getNumOperands(); i < e; i++)
		if (BVN->getOperand(i) != OpZero)
		return false;
		return true;
		}

// If this is a case we can't handle, return null and let the default		// If this is a case we can't handle, return null and let the default
// expansion code take care of it. If we CAN select this case, and if it		// expansion code take care of it. If we CAN select this case, and if it
// selects to a single instruction, return Op. Otherwise, if we can codegen		// selects to a single instruction, return Op. Otherwise, if we can codegen
// this case more efficiently than a constant pool load, lower it to the		// this case more efficiently than a constant pool load, lower it to the
// sequence of ops that should be used.		// sequence of ops that should be used.
SDValue PPCTargetLowering::LowerBUILD_VECTOR(SDValue Op,		SDValue PPCTargetLowering::LowerBUILD_VECTOR(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
SDLoc dl(Op);		SDLoc dl(Op);
▲ Show 20 Lines • Show All 105 Lines • ▼ Show 20 Lines	if (Subtarget.hasQPX())
return SDValue();		return SDValue();

// Check if this is a splat of a constant value.		// Check if this is a splat of a constant value.
APInt APSplatBits, APSplatUndef;		APInt APSplatBits, APSplatUndef;
unsigned SplatBitSize;		unsigned SplatBitSize;
bool HasAnyUndefs;		bool HasAnyUndefs;
if (! BVN->isConstantSplat(APSplatBits, APSplatUndef, SplatBitSize,		if (! BVN->isConstantSplat(APSplatBits, APSplatUndef, SplatBitSize,
HasAnyUndefs, 0, !Subtarget.isLittleEndian()) \|\|		HasAnyUndefs, 0, !Subtarget.isLittleEndian()) \|\|
SplatBitSize > 32)		SplatBitSize > 32) {
		// We can splat a non-const value on CPU's that implement ISA 3.0
		// in two ways: LXVWSX (load and splat) and MTVSRWS(move and splat).
		auto OpZero = BVN->getOperand(0);
		bool CanLoadAndSplat = OpZero.getOpcode() == ISD::LOAD &&
		BVN->isOnlyUserOf(OpZero.getNode());
		if (Subtarget.isISA3_0() &&
		isNonConstSplatBV(BVN, MVT::v4i32) && !CanLoadAndSplat)
		return Op;
return SDValue();		return SDValue();
		}

unsigned SplatBits = APSplatBits.getZExtValue();		unsigned SplatBits = APSplatBits.getZExtValue();
unsigned SplatUndef = APSplatUndef.getZExtValue();		unsigned SplatUndef = APSplatUndef.getZExtValue();
unsigned SplatSize = SplatBitSize / 8;		unsigned SplatSize = SplatBitSize / 8;

// First, handle single instruction cases.		// First, handle single instruction cases.

// All zeros?		// All zeros?
if (SplatBits == 0) {		if (SplatBits == 0) {
// Canonicalize all zero vectors to be v4i32.		// Canonicalize all zero vectors to be v4i32.
if (Op.getValueType() != MVT::v4i32 \|\| HasAnyUndefs) {		if (Op.getValueType() != MVT::v4i32 \|\| HasAnyUndefs) {
SDValue Z = DAG.getConstant(0, dl, MVT::v4i32);		SDValue Z = DAG.getConstant(0, dl, MVT::v4i32);
Op = DAG.getNode(ISD::BITCAST, dl, Op.getValueType(), Z);		Op = DAG.getNode(ISD::BITCAST, dl, Op.getValueType(), Z);
}		}
return Op;		return Op;
}		}

		// We have XXSPLTIB for constant splats one byte wide
		if (Subtarget.isISA3_0() && Op.getValueType() == MVT::v16i8)
		return Op;

// If the sign extended value is in the range [-16,15], use VSPLTI[bhw].		// If the sign extended value is in the range [-16,15], use VSPLTI[bhw].
int32_t SextVal= (int32_t(SplatBits << (32-SplatBitSize)) >>		int32_t SextVal= (int32_t(SplatBits << (32-SplatBitSize)) >>
(32-SplatBitSize));		(32-SplatBitSize));
if (SextVal >= -16 && SextVal <= 15)		if (SextVal >= -16 && SextVal <= 15)
return BuildSplatI(SextVal, SplatSize, Op.getValueType(), DAG, dl);		return BuildSplatI(SextVal, SplatSize, Op.getValueType(), DAG, dl);

// Two instruction sequences.		// Two instruction sequences.

▲ Show 20 Lines • Show All 227 Lines • ▼ Show 20 Lines	if (Subtarget.hasP9Vector() &&
SDValue Ins = DAG.getNode(PPCISD::XXINSERT, dl, MVT::v4i32, Conv1, Conv2,		SDValue Ins = DAG.getNode(PPCISD::XXINSERT, dl, MVT::v4i32, Conv1, Conv2,
DAG.getConstant(InsertAtByte, dl, MVT::i32));		DAG.getConstant(InsertAtByte, dl, MVT::i32));
return DAG.getNode(ISD::BITCAST, dl, MVT::v16i8, Ins);		return DAG.getNode(ISD::BITCAST, dl, MVT::v16i8, Ins);
}		}

if (Subtarget.hasVSX()) {		if (Subtarget.hasVSX()) {
if (V2.isUndef() && PPC::isSplatShuffleMask(SVOp, 4)) {		if (V2.isUndef() && PPC::isSplatShuffleMask(SVOp, 4)) {
int SplatIdx = PPC::getVSPLTImmediate(SVOp, 4, DAG);		int SplatIdx = PPC::getVSPLTImmediate(SVOp, 4, DAG);

		// If the source for the shuffle is a scalar_to_vector that came from a
		// 32-bit load, it will have used LXVWSX so we don't need to splat again.
		if (Subtarget.isISA3_0() &&
		((isLittleEndian && SplatIdx == 3) \|\|
		(!isLittleEndian && SplatIdx == 0))) {
		SDValue Src = V1.getOperand(0);
		if (Src.getOpcode() == ISD::SCALAR_TO_VECTOR &&
		Src.getOperand(0).getOpcode() == ISD::LOAD &&
		Src.getOperand(0).hasOneUse())
		return V1;
		amehsanUnsubmitted Not Done Reply Inline Actions Is this true even if LOAD has users other than SCALAR_TO_VECTOR? amehsan: Is this true even if LOAD has users other than SCALAR_TO_VECTOR?
		nemanjaiAuthorUnsubmitted Not Done Reply Inline Actions Ah, thanks for pointing this out. Yes, there's a missing check for hasOneUse() on the LOAD. It will be in the updated patch (along with a test case to ensure we don't get rid of the splat). nemanjai: Ah, thanks for pointing this out. Yes, there's a missing check for hasOneUse() on the LOAD. It…
		}
SDValue Conv = DAG.getNode(ISD::BITCAST, dl, MVT::v4i32, V1);		SDValue Conv = DAG.getNode(ISD::BITCAST, dl, MVT::v4i32, V1);
SDValue Splat = DAG.getNode(PPCISD::XXSPLT, dl, MVT::v4i32, Conv,		SDValue Splat = DAG.getNode(PPCISD::XXSPLT, dl, MVT::v4i32, Conv,
DAG.getConstant(SplatIdx, dl, MVT::i32));		DAG.getConstant(SplatIdx, dl, MVT::i32));
return DAG.getNode(ISD::BITCAST, dl, MVT::v16i8, Splat);		return DAG.getNode(ISD::BITCAST, dl, MVT::v16i8, Splat);
}		}

// Left shifts of 8 bytes are actually swaps. Convert accordingly.		// Left shifts of 8 bytes are actually swaps. Convert accordingly.
if (V2.isUndef() && PPC::isVSLDOIShuffleMask(SVOp, 1, DAG) == 8) {		if (V2.isUndef() && PPC::isVSLDOIShuffleMask(SVOp, 1, DAG) == 8) {
▲ Show 20 Lines • Show All 4,775 Lines • Show Last 20 Lines

lib/Target/PowerPC/PPCInstrFormats.td

Show First 20 Lines • Show All 1,053 Lines • ▼ Show 20 Lines	class XX3Form<bits<6> opcode, bits<8> xo, dag OOL, dag IOL, string asmstr,
let Inst{11-15} = XA{4-0};		let Inst{11-15} = XA{4-0};
let Inst{16-20} = XB{4-0};		let Inst{16-20} = XB{4-0};
let Inst{21-28} = xo;		let Inst{21-28} = xo;
let Inst{29} = XA{5};		let Inst{29} = XA{5};
let Inst{30} = XB{5};		let Inst{30} = XB{5};
let Inst{31} = XT{5};		let Inst{31} = XT{5};
}		}

		class XX3Form_Zero<bits<6> opcode, bits<8> xo, dag OOL, dag IOL, string asmstr,
		InstrItinClass itin, list<dag> pattern>
		: XX3Form<opcode, xo, OOL, IOL, asmstr, itin, pattern> {
		let XA = XT;
		let XB = XT;
		}

class XX3Form_1<bits<6> opcode, bits<8> xo, dag OOL, dag IOL, string asmstr,		class XX3Form_1<bits<6> opcode, bits<8> xo, dag OOL, dag IOL, string asmstr,
InstrItinClass itin, list<dag> pattern>		InstrItinClass itin, list<dag> pattern>
: I<opcode, OOL, IOL, asmstr, itin> {		: I<opcode, OOL, IOL, asmstr, itin> {
bits<3> CR;		bits<3> CR;
bits<6> XA;		bits<6> XA;
bits<6> XB;		bits<6> XB;

let Pattern = pattern;		let Pattern = pattern;
▲ Show 20 Lines • Show All 899 Lines • Show Last 20 Lines

lib/Target/PowerPC/PPCInstrInfo.td

Show First 20 Lines • Show All 306 Lines • ▼ Show 20 Lines	def imm64SExt16 : Operand<i64>, ImmLeaf<i64, [{
// sign extended field. Used by instructions like 'addi'.		// sign extended field. Used by instructions like 'addi'.
return (int64_t)Imm == (short)Imm;		return (int64_t)Imm == (short)Imm;
}]>;		}]>;
def immZExt16 : PatLeaf<(imm), [{		def immZExt16 : PatLeaf<(imm), [{
// immZExt16 predicate - True if the immediate fits in a 16-bit zero extended		// immZExt16 predicate - True if the immediate fits in a 16-bit zero extended
// field. Used by instructions like 'ori'.		// field. Used by instructions like 'ori'.
return (uint64_t)N->getZExtValue() == (unsigned short)N->getZExtValue();		return (uint64_t)N->getZExtValue() == (unsigned short)N->getZExtValue();
}], LO16>;		}], LO16>;
		def immSExt8 : ImmLeaf<i32, [{ return isInt<8>(Imm); }]>;

// imm16Shifted* - These match immediates where the low 16-bits are zero. There		// imm16Shifted* - These match immediates where the low 16-bits are zero. There
// are two forms: imm16ShiftedSExt and imm16ShiftedZExt. These two forms are		// are two forms: imm16ShiftedSExt and imm16ShiftedZExt. These two forms are
// identical in 32-bit mode, but in 64-bit mode, they return true if the		// identical in 32-bit mode, but in 64-bit mode, they return true if the
// immediate fits into a sign/zero extended 32-bit immediate (with the low bits		// immediate fits into a sign/zero extended 32-bit immediate (with the low bits
// clear).		// clear).
def imm16ShiftedZExt : PatLeaf<(imm), [{		def imm16ShiftedZExt : PatLeaf<(imm), [{
// imm16ShiftedZExt predicate - True if only bits in the top 16-bits of the		// imm16ShiftedZExt predicate - True if only bits in the top 16-bits of the
▲ Show 20 Lines • Show All 4,056 Lines • Show Last 20 Lines

lib/Target/PowerPC/PPCInstrVSX.td

Show First 20 Lines • Show All 755 Lines • ▼ Show 20 Lines	let Uses = [RM] in {
def XXLORf: XX3Form<60, 146,		def XXLORf: XX3Form<60, 146,
(outs vsfrc:$XT), (ins vsfrc:$XA, vsfrc:$XB),		(outs vsfrc:$XT), (ins vsfrc:$XA, vsfrc:$XB),
"xxlor $XT, $XA, $XB", IIC_VecGeneral, []>;		"xxlor $XT, $XA, $XB", IIC_VecGeneral, []>;
def XXLXOR : XX3Form<60, 154,		def XXLXOR : XX3Form<60, 154,
(outs vsrc:$XT), (ins vsrc:$XA, vsrc:$XB),		(outs vsrc:$XT), (ins vsrc:$XA, vsrc:$XB),
"xxlxor $XT, $XA, $XB", IIC_VecGeneral,		"xxlxor $XT, $XA, $XB", IIC_VecGeneral,
[(set v4i32:$XT, (xor v4i32:$XA, v4i32:$XB))]>;		[(set v4i32:$XT, (xor v4i32:$XA, v4i32:$XB))]>;
} // isCommutable		} // isCommutable
		let isCodeGenOnly = 1 in
		def XXLXORz : XX3Form_Zero<60, 154, (outs vsrc:$XT), (ins),
		"xxlxor $XT, $XT, $XT", IIC_VecGeneral,
		[(set v4i32:$XT, (v4i32 immAllZerosV))]>;

// Permutation Instructions		// Permutation Instructions
def XXMRGHW : XX3Form<60, 18,		def XXMRGHW : XX3Form<60, 18,
(outs vsrc:$XT), (ins vsrc:$XA, vsrc:$XB),		(outs vsrc:$XT), (ins vsrc:$XA, vsrc:$XB),
"xxmrghw $XT, $XA, $XB", IIC_VecPerm, []>;		"xxmrghw $XT, $XA, $XB", IIC_VecPerm, []>;
def XXMRGLW : XX3Form<60, 50,		def XXMRGLW : XX3Form<60, 50,
(outs vsrc:$XT), (ins vsrc:$XA, vsrc:$XB),		(outs vsrc:$XT), (ins vsrc:$XA, vsrc:$XB),
"xxmrglw $XT, $XA, $XB", IIC_VecPerm, []>;		"xxmrglw $XT, $XA, $XB", IIC_VecPerm, []>;
▲ Show 20 Lines • Show All 527 Lines • ▼ Show 20 Lines	def MTVSRWA : XX1_RS6_RD5_XO<31, 211, (outs vsfrc:$XT), (ins gprc:$rA),
[(set f64:$XT, (PPCmtvsra i32:$rA))]>;		[(set f64:$XT, (PPCmtvsra i32:$rA))]>;
def MTVSRWZ : XX1_RS6_RD5_XO<31, 243, (outs vsfrc:$XT), (ins gprc:$rA),		def MTVSRWZ : XX1_RS6_RD5_XO<31, 243, (outs vsfrc:$XT), (ins gprc:$rA),
"mtvsrwz $XT, $rA", IIC_VecGeneral,		"mtvsrwz $XT, $rA", IIC_VecGeneral,
[(set f64:$XT, (PPCmtvsrz i32:$rA))]>;		[(set f64:$XT, (PPCmtvsrz i32:$rA))]>;
} // HasDirectMove		} // HasDirectMove

let Predicates = [IsISA3_0, HasDirectMove] in {		let Predicates = [IsISA3_0, HasDirectMove] in {
def MTVSRWS: XX1_RS6_RD5_XO<31, 403, (outs vsrc:$XT), (ins gprc:$rA),		def MTVSRWS: XX1_RS6_RD5_XO<31, 403, (outs vsrc:$XT), (ins gprc:$rA),
"mtvsrws $XT, $rA", IIC_VecGeneral,		"mtvsrws $XT, $rA", IIC_VecGeneral, []>;
[]>;

def MTVSRDD: XX1Form<31, 435, (outs vsrc:$XT), (ins g8rc:$rA, g8rc:$rB),		def MTVSRDD: XX1Form<31, 435, (outs vsrc:$XT), (ins g8rc:$rA, g8rc:$rB),
"mtvsrdd $XT, $rA, $rB", IIC_VecGeneral,		"mtvsrdd $XT, $rA, $rB", IIC_VecGeneral,
[]>, Requires<[In64BitMode]>;		[]>, Requires<[In64BitMode]>;

def MFVSRLD: XX1_RS6_RD5_XO<31, 307, (outs g8rc:$rA), (ins vsrc:$XT),		def MFVSRLD: XX1_RS6_RD5_XO<31, 307, (outs g8rc:$rA), (ins vsrc:$XT),
"mfvsrld $rA, $XT", IIC_VecGeneral,		"mfvsrld $rA, $XT", IIC_VecGeneral,
[]>, Requires<[In64BitMode]>;		[]>, Requires<[In64BitMode]>;
▲ Show 20 Lines • Show All 547 Lines • ▼ Show 20 Lines	def : Pat<(f64 (bitconvert i64:$S)),
(f64 (MTVSRD $S))>;		(f64 (MTVSRD $S))>;
}		}

def AlignValues {		def AlignValues {
dag F32_TO_BE_WORD1 = (v4f32 (XXSLDWI (XSCVDPSPN $B), (XSCVDPSPN $B), 3));		dag F32_TO_BE_WORD1 = (v4f32 (XXSLDWI (XSCVDPSPN $B), (XSCVDPSPN $B), 3));
dag I32_TO_BE_WORD1 = (COPY_TO_REGCLASS (MTVSRWZ $B), VSRC);		dag I32_TO_BE_WORD1 = (COPY_TO_REGCLASS (MTVSRWZ $B), VSRC);
}		}

		// Materialize a zero-vector of long long
		def : Pat<(v2i64 immAllZerosV),
		(v2i64 (XXLXORz))>;

// The following VSX instructions were introduced in Power ISA 3.0		// The following VSX instructions were introduced in Power ISA 3.0
def HasP9Vector : Predicate<"PPCSubTarget->hasP9Vector()">;		def HasP9Vector : Predicate<"PPCSubTarget->hasP9Vector()">;
let AddedComplexity = 400, Predicates = [HasP9Vector] in {		let AddedComplexity = 400, Predicates = [HasP9Vector] in {

// [PO VRT XO VRB XO /]		// [PO VRT XO VRB XO /]
class X_VT5_XO5_VB5<bits<6> opcode, bits<5> xo2, bits<10> xo, string opc,		class X_VT5_XO5_VB5<bits<6> opcode, bits<5> xo2, bits<10> xo, string opc,
list<dag> pattern>		list<dag> pattern>
: X_RD5_XO5_RS5<opcode, xo2, xo, (outs vrrc:$vT), (ins vrrc:$vB),		: X_RD5_XO5_RS5<opcode, xo2, xo, (outs vrrc:$vT), (ins vrrc:$vB),
▲ Show 20 Lines • Show All 348 Lines • ▼ Show 20 Lines	let AddedComplexity = 400, Predicates = [HasP9Vector] in {
def STXVLL : X_XS6_RA5_RB5<31, 429, "stxvll" , vsrc, []>;		def STXVLL : X_XS6_RA5_RB5<31, 429, "stxvll" , vsrc, []>;
} // end mayStore		} // end mayStore

// Patterns for which instructions from ISA 3.0 are a better match		// Patterns for which instructions from ISA 3.0 are a better match
let Predicates = [IsLittleEndian, HasP9Vector] in {		let Predicates = [IsLittleEndian, HasP9Vector] in {
def : Pat<(f32 (PPCfcfidus (PPCmtvsrz (i32 (extractelt v4i32:$A, 0))))),		def : Pat<(f32 (PPCfcfidus (PPCmtvsrz (i32 (extractelt v4i32:$A, 0))))),
(f32 (XSCVUXDSP (XXEXTRACTUW $A, 12)))>;		(f32 (XSCVUXDSP (XXEXTRACTUW $A, 12)))>;
def : Pat<(f32 (PPCfcfidus (PPCmtvsrz (i32 (extractelt v4i32:$A, 1))))),		def : Pat<(f32 (PPCfcfidus (PPCmtvsrz (i32 (extractelt v4i32:$A, 1))))),
(f32 (XSCVUXDSP (XXEXTRACTUW $A, 8)))>;		(f32 (XSCVUXDSP (XXEXTRACTUW $A, 8)))>;
def : Pat<(f32 (PPCfcfidus (PPCmtvsrz (i32 (extractelt v4i32:$A, 2))))),		def : Pat<(f32 (PPCfcfidus (PPCmtvsrz (i32 (extractelt v4i32:$A, 2))))),
(f32 (XSCVUXDSP (XXEXTRACTUW $A, 4)))>;		(f32 (XSCVUXDSP (XXEXTRACTUW $A, 4)))>;
		amehsanUnsubmitted Not Done Reply Inline Actions I think this should depend on how the extracted element is going to be used. If the subsequent use is somehow in a VSX register we do not want to do this. For example if we extract the integer, then convert it to floating point and do some FP arithmetic on it. amehsan: I think this should depend on how the extracted element is going to be used. If the subsequent…
		amehsanUnsubmitted Not Done Reply Inline Actions I am not saying that all cases should be handled in this patch. The example that I provided may need to be handled in DAGCombine and probably by the time we reach here, this is the right decision. That does not need to be in this patch. But I want to make sure that after adding this code, we do not have patterns for which we generate slower code on pwr9 compare to pwr8. amehsan: I am not saying that all cases should be handled in this patch. The example that I provided may…
		nemanjaiAuthorUnsubmitted Not Done Reply Inline Actions Yes, I think the right thing to do in these cases would be either a DAG combine or a peephole to look for where we move stuff out of VSX registers just to move them back in. In any case, the pattern for Power8 is a swap followed by a direct move. On Power9, we just avoid the initial swap. nemanjai: Yes, I think the right thing to do in these cases would be either a DAG combine or a peephole…
		amehsanUnsubmitted Not Done Reply Inline Actions That problem already exists on PWR8. for define double @test2(<2 x i64> %a) { entry: %0 = extractelement <2 x i64> %a, i32 0 %1 = sitofp i64 %0 to double ret double %1 } we generate xxswapd 0, 34 mfvsrd 3, 0 mtvsrd 0, 3 xscvsxddp 1, 0 blr I will open a bugzilla item for this. amehsan: That problem already exists on PWR8. for ``` define double @test2(<2 x i64> %a) { entry…
def : Pat<(f32 (PPCfcfidus (PPCmtvsrz (i32 (extractelt v4i32:$A, 3))))),		def : Pat<(f32 (PPCfcfidus (PPCmtvsrz (i32 (extractelt v4i32:$A, 3))))),
(f32 (XSCVUXDSP (XXEXTRACTUW $A, 0)))>;		(f32 (XSCVUXDSP (XXEXTRACTUW $A, 0)))>;
def : Pat<(v4i32 (insertelt v4i32:$A, i32:$B, 0)),		def : Pat<(v4i32 (insertelt v4i32:$A, i32:$B, 0)),
(v4i32 (XXINSERTW v4i32:$A, AlignValues.I32_TO_BE_WORD1, 12))>;		(v4i32 (XXINSERTW v4i32:$A, AlignValues.I32_TO_BE_WORD1, 12))>;
def : Pat<(v4i32 (insertelt v4i32:$A, i32:$B, 1)),		def : Pat<(v4i32 (insertelt v4i32:$A, i32:$B, 1)),
(v4i32 (XXINSERTW v4i32:$A, AlignValues.I32_TO_BE_WORD1, 8))>;		(v4i32 (XXINSERTW v4i32:$A, AlignValues.I32_TO_BE_WORD1, 8))>;
def : Pat<(v4i32 (insertelt v4i32:$A, i32:$B, 2)),		def : Pat<(v4i32 (insertelt v4i32:$A, i32:$B, 2)),
(v4i32 (XXINSERTW v4i32:$A, AlignValues.I32_TO_BE_WORD1, 4))>;		(v4i32 (XXINSERTW v4i32:$A, AlignValues.I32_TO_BE_WORD1, 4))>;
Show All 30 Lines	def : Pat<(v4f32 (insertelt v4f32:$A, f32:$B, 0)),
(v4f32 (XXINSERTW v4f32:$A, AlignValues.F32_TO_BE_WORD1, 0))>;		(v4f32 (XXINSERTW v4f32:$A, AlignValues.F32_TO_BE_WORD1, 0))>;
def : Pat<(v4f32 (insertelt v4f32:$A, f32:$B, 1)),		def : Pat<(v4f32 (insertelt v4f32:$A, f32:$B, 1)),
(v4f32 (XXINSERTW v4f32:$A, AlignValues.F32_TO_BE_WORD1, 4))>;		(v4f32 (XXINSERTW v4f32:$A, AlignValues.F32_TO_BE_WORD1, 4))>;
def : Pat<(v4f32 (insertelt v4f32:$A, f32:$B, 2)),		def : Pat<(v4f32 (insertelt v4f32:$A, f32:$B, 2)),
(v4f32 (XXINSERTW v4f32:$A, AlignValues.F32_TO_BE_WORD1, 8))>;		(v4f32 (XXINSERTW v4f32:$A, AlignValues.F32_TO_BE_WORD1, 8))>;
def : Pat<(v4f32 (insertelt v4f32:$A, f32:$B, 3)),		def : Pat<(v4f32 (insertelt v4f32:$A, f32:$B, 3)),
(v4f32 (XXINSERTW v4f32:$A, AlignValues.F32_TO_BE_WORD1, 12))>;		(v4f32 (XXINSERTW v4f32:$A, AlignValues.F32_TO_BE_WORD1, 12))>;
} // IsLittleEndian, HasP9Vector		} // IsLittleEndian, HasP9Vector

		def : Pat<(v4i32 (scalar_to_vector (i32 (load xoaddr:$src)))),
		(v4i32 (LXVWSX xoaddr:$src))>;
		def : Pat<(v4f32 (scalar_to_vector (f32 (load xoaddr:$src)))),
		(v4f32 (LXVWSX xoaddr:$src))>;
		def : Pat<(v4i32 (build_vector i32:$A, i32:$A, i32:$A, i32:$A)),
		(v4i32 (MTVSRWS $A))>;
		def : Pat<(v16i8 (build_vector immSExt8:$A, immSExt8:$A, immSExt8:$A,
		immSExt8:$A, immSExt8:$A, immSExt8:$A,
		immSExt8:$A, immSExt8:$A, immSExt8:$A,
		immSExt8:$A, immSExt8:$A, immSExt8:$A,
		immSExt8:$A, immSExt8:$A, immSExt8:$A,
		immSExt8:$A)),
		(v16i8 (COPY_TO_REGCLASS (XXSPLTIB imm:$A), VSRC))>;
		def : Pat<(v16i8 immAllOnesV),
		(v16i8 (COPY_TO_REGCLASS (XXSPLTIB 255), VSRC))>;
		def : Pat<(v8i16 immAllOnesV),
		(v8i16 (COPY_TO_REGCLASS (XXSPLTIB 255), VSRC))>;
		def : Pat<(v4i32 immAllOnesV),
		(v4i32 (XXSPLTIB 255))>;
		def : Pat<(v2i64 immAllOnesV),
		(v2i64 (XXSPLTIB 255))>;
} // end HasP9Vector, AddedComplexity		} // end HasP9Vector, AddedComplexity

		let Predicates = [IsISA3_0, HasDirectMove, IsLittleEndian] in {
		def : Pat<(v2i64 (build_vector i64:$rA, i64:$rB)),
		(v2i64 (MTVSRDD $rB, $rA))>;
		def : Pat<(i64 (extractelt v2i64:$A, 0)),
		(i64 (MFVSRLD $A))>;
		}

		let Predicates = [IsISA3_0, HasDirectMove, IsBigEndian] in {
		def : Pat<(v2i64 (build_vector i64:$rB, i64:$rA)),
		(v2i64 (MTVSRDD $rB, $rA))>;
		def : Pat<(i64 (extractelt v2i64:$A, 1)),
		(i64 (MFVSRLD $A))>;
		}

test/CodeGen/PowerPC/power9-moves-and-splats.ll

				; RUN: llc -mcpu=pwr9 -mtriple=powerpc64le-unknown-linux-gnu < %s \| FileCheck %s
				; RUN: llc -mcpu=pwr9 -mtriple=powerpc64-unknown-linux-gnu < %s \| FileCheck %s \
				; RUN: --check-prefix=CHECK-BE

				@Globi = external global i32, align 4
				@Globf = external global float, align 4

				define <2 x i64> @test1(i64 %a, i64 %b) {
				entry:
				; CHECK-LABEL: test1
				; CHECK: mtvsrdd 34, 4, 3
				; CHECK-BE-LABEL: test1
				; CHECK-BE: mtvsrdd 34, 3, 4
				%vecins = insertelement <2 x i64> undef, i64 %a, i32 0
				%vecins1 = insertelement <2 x i64> %vecins, i64 %b, i32 1
				ret <2 x i64> %vecins1
				}

				define i64 @test2(<2 x i64> %a) {
				entry:
				; CHECK-LABEL: test2
				; CHECK: mfvsrld 3, 34
				%0 = extractelement <2 x i64> %a, i32 0
				ret i64 %0
				}

				define i64 @test3(<2 x i64> %a) {
				entry:
				; CHECK-BE-LABEL: test3
				; CHECK-BE: mfvsrld 3, 34
				%0 = extractelement <2 x i64> %a, i32 1
				ret i64 %0
				}

				define <4 x i32> @test4(i32* nocapture readonly %in) {
				entry:
				; CHECK-LABEL: test4
				; CHECK: lxvwsx 34, 0, 3
				; CHECK-NOT: xxspltw
				; CHECK-BE-LABEL: test4
				; CHECK-BE: lxvwsx 34, 0, 3
				; CHECK-BE-NOT: xxspltw
				%0 = load i32, i32* %in, align 4
				%splat.splatinsert = insertelement <4 x i32> undef, i32 %0, i32 0
				%splat.splat = shufflevector <4 x i32> %splat.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
				ret <4 x i32> %splat.splat
				}

				define <4 x float> @test5(float* nocapture readonly %in) {
				entry:
				; CHECK-LABEL: test5
				; CHECK: lxvwsx 34, 0, 3
				; CHECK-NOT: xxspltw
				; CHECK-BE-LABEL: test5
				; CHECK-BE: lxvwsx 34, 0, 3
				; CHECK-BE-NOT: xxspltw
				%0 = load float, float* %in, align 4
				%splat.splatinsert = insertelement <4 x float> undef, float %0, i32 0
				%splat.splat = shufflevector <4 x float> %splat.splatinsert, <4 x float> undef, <4 x i32> zeroinitializer
				ret <4 x float> %splat.splat
				}

				define <4 x i32> @test6() {
				entry:
				; CHECK-LABEL: test6
				; CHECK: addis
				; CHECK: ld [[TOC:[0-9]+]], .LC0
				; CHECK: lxvwsx 34, 0, 3
				; CHECK-NOT: xxspltw
				; CHECK-BE-LABEL: test6
				; CHECK-BE: addis
				; CHECK-BE: ld [[TOC:[0-9]+]], .LC0
				; CHECK-BE: lxvwsx 34, 0, 3
				; CHECK-BE-NOT: xxspltw
				%0 = load i32, i32* @Globi, align 4
				%splat.splatinsert = insertelement <4 x i32> undef, i32 %0, i32 0
				%splat.splat = shufflevector <4 x i32> %splat.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
				ret <4 x i32> %splat.splat
				}

				define <4 x float> @test7() {
				entry:
				; CHECK-LABEL: test7
				; CHECK: addis
				; CHECK: ld [[TOC:[0-9]+]], .LC1
				; CHECK: lxvwsx 34, 0, 3
				; CHECK-NOT: xxspltw
				; CHECK-BE-LABEL: test7
				; CHECK-BE: addis
				; CHECK-BE: ld [[TOC:[0-9]+]], .LC1
				; CHECK-BE: lxvwsx 34, 0, 3
				; CHECK-BE-NOT: xxspltw
				%0 = load float, float* @Globf, align 4
				%splat.splatinsert = insertelement <4 x float> undef, float %0, i32 0
				%splat.splat = shufflevector <4 x float> %splat.splatinsert, <4 x float> undef, <4 x i32> zeroinitializer
				ret <4 x float> %splat.splat
				}

				define <16 x i8> @test8() {
				entry:
				; CHECK-LABEL: test8
				; CHECK: xxlxor 34, 34, 34
				; CHECK-BE-LABEL: test8
				; CHECK-BE: xxlxor 34, 34, 34
				ret <16 x i8> zeroinitializer
				}

				define <16 x i8> @test9() {
				entry:
				; CHECK-LABEL: test9
				; CHECK: xxspltib 34, 1
				; CHECK-BE-LABEL: test9
				; CHECK-BE: xxspltib 34, 1
				ret <16 x i8> <i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1, i8 1>
				}

				define <16 x i8> @test10() {
				entry:
				; CHECK-LABEL: test10
				; CHECK: xxspltib 34, 127
				; CHECK-BE-LABEL: test10
				; CHECK-BE: xxspltib 34, 127
				ret <16 x i8> <i8 127, i8 127, i8 127, i8 127, i8 127, i8 127, i8 127, i8 127, i8 127, i8 127, i8 127, i8 127, i8 127, i8 127, i8 127, i8 127>
				}

				define <16 x i8> @test11() {
				entry:
				; CHECK-LABEL: test11
				; CHECK: xxspltib 34, 128
				; CHECK-BE-LABEL: test11
				; CHECK-BE: xxspltib 34, 128
				ret <16 x i8> <i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128>
				}

				define <16 x i8> @test12() {
				entry:
				; CHECK-LABEL: test12
				; CHECK: xxspltib 34, 255
				; CHECK-BE-LABEL: test12
				; CHECK-BE: xxspltib 34, 255
				ret <16 x i8> <i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>
				}

				define <16 x i8> @test13() {
				entry:
				; CHECK-LABEL: test13
				; CHECK: xxspltib 34, 129
				; CHECK-BE-LABEL: test13
				; CHECK-BE: xxspltib 34, 129
				ret <16 x i8> <i8 -127, i8 -127, i8 -127, i8 -127, i8 -127, i8 -127, i8 -127, i8 -127, i8 -127, i8 -127, i8 -127, i8 -127, i8 -127, i8 -127, i8 -127, i8 -127>
				}

				define <4 x i32> @test14(<4 x i32> %a, i32* nocapture readonly %b) {
				entry:
				; CHECK-LABEL: test14
				; CHECK: lwz [[LD:[0-9]+]],
				; CHECK: mtvsrws 34, [[LD]]
				; CHECK-BE-LABEL: test14
				; CHECK-BE: lwz [[LD:[0-9]+]],
				; CHECK-BE: mtvsrws 34, [[LD]]
				%0 = load i32, i32* %b, align 4
				%splat.splatinsert = insertelement <4 x i32> undef, i32 %0, i32 0
				%splat.splat = shufflevector <4 x i32> %splat.splatinsert, <4 x i32> undef, <4 x i32> zeroinitializer
				%1 = add i32 %0, 5
				store i32 %1, i32* %b, align 4
				ret <4 x i32> %splat.splat
				}