This is an archive of the discontinued LLVM Phabricator instance.

Differential D76275

[SystemZ] Use vllez + vle to load, shuffle and zero extend a vector.
AbandonedPublic

Authored by jonpa on Mar 17 2020, 4:13 AM.

Download Raw Diff

Details

Reviewers

uweigand

Summary

(Separated and continued from https://reviews.llvm.org/D75978)

I agree that using VLLEZ to implement the zero-extend is a good idea. I'm wondering about the implementation however; it seems odd to combine two quite different manners of emitting that instruction.

First of all, the "canonicalization in DAGCombiner::visitINSERT_VECTOR_ELT()" problem you mention seems to be a problem in general; the code with the two element-inserts could easily arise just from normal middle-end action (not just that new back-end code you added), and we really ought to be able to handle such code in general. So I think we should try to do that, either by overriding the DAGCombine with our own rule (if that's possible), or possibly handling this as a SystemZDAGToDAGISel::Select rule? In any case, I'm sceptical about the new DAG opcode -- it's always better to avoid those for semantics that *can* be described with standard opcodes, since custom opcodes are opaque to all further analysis by the middle-end ...

The other question is how to handle the zero-extend expansion. Logically, a vector zero-extend is fully equivalent to a shuffle intermixing bytes from the original vector with a zero vector. So I think the back-end ought to handle both cases in the same way. Now, we have logic that will intelligently handle shuffles (the whole GeneralShuffle logic). From what I can see, this never emits an UNPACK -- however, it should in theory be able to emit a MERGE, which would later (if one of the inputs is a zero vector) be transformed into an UNPACK via SystemZTargetLowering::combineMERGE.

Sorry for the confusion, but it seems I may have been doing something wrong earlier with the INSERT_VECTOR_ELT nodes, since now that I removed the VLLEZ node it actually seems to work to just use INSERT_VECTOR_ELT nodes the way I first tried. I was probably also too eager to introduce a new SystemZISD node thinking that it was the right thing to (since I wanted the common-code *not* to change anything after that point).

The advantage of a VLLEZ node would be that later optimizations do not confuse things so that two VLE's and a VGBM 0 would instead result. I saw this now in a reduced test case, but this does not seem to be a problem on the benchmarks (never happens). The only difference there is that without the VLLEZ node, one file gets dsgfr:s instead of dsgr:s - in other words it can figure out that the extracted element is 32 bits or less. So it seems to be an advantage, per what you said above.

Is this patch acceptable now? I haven't checked further if the current code misses any cases of VLLEZ, but at the moment I have no reason to believe it does.

Diff Detail

Event Timeline

jonpa created this revision.Mar 17 2020, 4:13 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptMar 17 2020, 4:13 AM

jonpa abandoned this revision.Jun 29 2020, 11:54 PM

Herald added a project: Restricted Project. · View Herald TranscriptJun 29 2020, 11:54 PM

Herald added subscribers: llvm-commits, steven.zhang. · View Herald Transcript

Revision Contents

Path

Size

llvm/

lib/

Target/

SystemZ/

SystemZISelLowering.h

1 line

SystemZISelLowering.cpp

82 lines

test/

CodeGen/

SystemZ/

vec-move-24.ll

47 lines

Diff 250718

llvm/lib/Target/SystemZ/SystemZISelLowering.h

Show First 20 Lines • Show All 614 Lines • ▼ Show 20 Lines	private:
bool isVectorElementLoad(SDValue Op) const;		bool isVectorElementLoad(SDValue Op) const;
SDValue buildVector(SelectionDAG &DAG, const SDLoc &DL, EVT VT,		SDValue buildVector(SelectionDAG &DAG, const SDLoc &DL, EVT VT,
SmallVectorImpl<SDValue> &Elems) const;		SmallVectorImpl<SDValue> &Elems) const;
SDValue lowerBUILD_VECTOR(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerBUILD_VECTOR(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerVECTOR_SHUFFLE(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerVECTOR_SHUFFLE(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerSCALAR_TO_VECTOR(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerSCALAR_TO_VECTOR(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerINSERT_VECTOR_ELT(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerINSERT_VECTOR_ELT(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerEXTRACT_VECTOR_ELT(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerEXTRACT_VECTOR_ELT(SDValue Op, SelectionDAG &DAG) const;
		SDValue tryVLLEZ(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerExtendVectorInreg(SDValue Op, SelectionDAG &DAG,		SDValue lowerExtendVectorInreg(SDValue Op, SelectionDAG &DAG,
unsigned UnpackHigh) const;		unsigned UnpackHigh) const;
SDValue lowerShift(SDValue Op, SelectionDAG &DAG, unsigned ByScalar) const;		SDValue lowerShift(SDValue Op, SelectionDAG &DAG, unsigned ByScalar) const;

bool canTreatAsByteVector(EVT VT) const;		bool canTreatAsByteVector(EVT VT) const;
SDValue combineExtract(const SDLoc &DL, EVT ElemVT, EVT VecVT, SDValue OrigOp,		SDValue combineExtract(const SDLoc &DL, EVT ElemVT, EVT VecVT, SDValue OrigOp,
unsigned Index, DAGCombinerInfo &DCI,		unsigned Index, DAGCombinerInfo &DCI,
bool Force) const;		bool Force) const;
▲ Show 20 Lines • Show All 88 Lines • Show Last 20 Lines

llvm/lib/Target/SystemZ/SystemZISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,035 Lines • ▼ Show 20 Lines	SystemZTargetLowering::lowerEXTRACT_VECTOR_ELT(SDValue Op,
MVT IntVT = MVT::getIntegerVT(VT.getSizeInBits());		MVT IntVT = MVT::getIntegerVT(VT.getSizeInBits());
MVT IntVecVT = MVT::getVectorVT(IntVT, VecVT.getVectorNumElements());		MVT IntVecVT = MVT::getVectorVT(IntVT, VecVT.getVectorNumElements());
SDValue Res = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, IntVT,		SDValue Res = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, IntVT,
DAG.getNode(ISD::BITCAST, DL, IntVecVT, Op0), Op1);		DAG.getNode(ISD::BITCAST, DL, IntVecVT, Op0), Op1);
return DAG.getNode(ISD::BITCAST, DL, VT, Res);		return DAG.getNode(ISD::BITCAST, DL, VT, Res);
}		}

SDValue		SDValue
		SystemZTargetLowering::tryVLLEZ(SDValue Op, SelectionDAG &DAG) const {
		// Replace ZERO_EXTEND_VECTOR_INREG -> VECTOR_SHUFFLE -> LOAD
		// with
		// VLE -> VLLEZ

		auto *SVN = dyn_cast<ShuffleVectorSDNode>(Op.getOperand(0));
		if (!SVN \|\| !SVN->getOperand(1).isUndef())
		return SDValue();

		// Only generate one additional load (VLE).
		EVT OutVT = Op.getValueType();
		EVT InVT = SVN->getValueType(0);
		if (OutVT != MVT::v2i64 \|\| (InVT != MVT::v8i16 && InVT != MVT::v16i8))
		return SDValue();

		// Find the load by looking through any type conversions.
		SDValue Src = SVN->getOperand(0);
		bool Change = true;
		while (Change) {
		Change = false;
		switch (Src->getOpcode()) {
		case ISD::BITCAST:
		if (!Src->getOperand(0).getValueType().isVector())
		break;
		LLVM_FALLTHROUGH;
		case ISD::SCALAR_TO_VECTOR:
		Src = Src->getOperand(0);
		Change = true;
		break;
		default: break;
		}
		}
		auto *Load = dyn_cast<LoadSDNode>(Src);
		if (!Load \|\| Load->isVolatile())
		return SDValue();

		// Load the first element and insert it into a zero vector (-> VLLEZ).
		EVT NarrowEltVT = InVT.getScalarType();
		unsigned NarrowEltBytes = NarrowEltVT.getSizeInBits() / 8;
		EVT PtrVT = getPointerTy(DAG.getDataLayout());
		SDLoc DL(Load);
		const SDValue &BaseAddr = Load->getBasePtr();
		unsigned ByteOffset = SVN->getMaskElt(0) * NarrowEltBytes;
		SDValue Address = DAG.getNode(ISD::ADD, DL, PtrVT, BaseAddr,
		DAG.getIntPtrConstant(ByteOffset, DL));
		MachineMemOperand *MMO = Load->getMemOperand();
		SDValue Ld0 = DAG.getLoad(NarrowEltVT, DL, Load->getChain(), Address, MMO);
		SDValue Zeroes = DAG.getSplatBuildVector(InVT, DL,
		DAG.getConstant(0, DL, MVT::i32));
		SDValue Ins0 = DAG.getNode(ISD::INSERT_VECTOR_ELT, DL, InVT, Zeroes, Ld0,
		DAG.getVectorIdxConstant(InVT.getVectorNumElements() / 2 - 1, DL));

		// Load the other element (-> VLE).
		ByteOffset = SVN->getMaskElt(1) * NarrowEltBytes;
		Address = DAG.getNode(ISD::ADD, DL, PtrVT, BaseAddr,
		DAG.getIntPtrConstant(ByteOffset, DL));
		SDValue Ld1 = DAG.getLoad(NarrowEltVT, DL, Ld0.getValue(1), Address, MMO);
		SDValue Ins1 = DAG.getNode(ISD::INSERT_VECTOR_ELT, DL, InVT, Ins0, Ld1,
		DAG.getVectorIdxConstant(InVT.getVectorNumElements() - 1, DL));

		// Update chains.
		SDValue LoadCh = SDValue(Load, 1);
		if (!LoadCh.use_empty()) {
		SDValue TF = DAG.getNode(ISD::TokenFactor, DL, MVT::Other,
		Ld0.getValue(1), Ld1.getValue(1));
		DAG.ReplaceAllUsesOfValueWith(LoadCh, TF);
		SmallVector<SDValue, 2> Ops;
		Ops.push_back(LoadCh);
		Ops.push_back(Ld1.getValue(1));
		DAG.UpdateNodeOperands(TF.getNode(), Ops);
		}

		return DAG.getNode(ISD::BITCAST, DL, OutVT, Ins1);
		}

		SDValue
SystemZTargetLowering::lowerExtendVectorInreg(SDValue Op, SelectionDAG &DAG,		SystemZTargetLowering::lowerExtendVectorInreg(SDValue Op, SelectionDAG &DAG,
unsigned UnpackHigh) const {		unsigned UnpackHigh) const {
		if (UnpackHigh == SystemZISD::UNPACKL_HIGH) {
		SDValue Res = tryVLLEZ(Op, DAG);
		if (Res.getNode())
		return Res;
		}

SDValue PackedOp = Op.getOperand(0);		SDValue PackedOp = Op.getOperand(0);
EVT OutVT = Op.getValueType();		EVT OutVT = Op.getValueType();
EVT InVT = PackedOp.getValueType();		EVT InVT = PackedOp.getValueType();
unsigned ToBits = OutVT.getScalarSizeInBits();		unsigned ToBits = OutVT.getScalarSizeInBits();
unsigned FromBits = InVT.getScalarSizeInBits();		unsigned FromBits = InVT.getScalarSizeInBits();
do {		do {
FromBits *= 2;		FromBits *= 2;
EVT OutVT = MVT::getVectorVT(MVT::getIntegerVT(FromBits),		EVT OutVT = MVT::getVectorVT(MVT::getIntegerVT(FromBits),
▲ Show 20 Lines • Show All 3,028 Lines • Show Last 20 Lines

llvm/test/CodeGen/SystemZ/vec-move-24.ll

This file was added.

				; RUN: llc < %s -mtriple=s390x-linux-gnu -mcpu=z14 \| FileCheck %s
				;
				; Test that shuffled and zero extended vector loads gets implemented
				; with vllez + vle in the case of a <2 x i64> result.

				define void @fun0(<2 x i8>* %Src, <2 x i64>* %Dst) {
				; CHECK-LABEL: fun0:
				; CHECK: vllezb %v0, 1(%r2)
				; CHECK-NEXT: vleb %v0, 0(%r2), 15
				; CHECK-NEXT: vst %v0, 0(%r3), 3
				; CHECK-NEXT: br %r14
				%l = load <2 x i8>, <2 x i8>* %Src
				%sh = shufflevector <2 x i8> %l, <2 x i8> undef, <2 x i32> <i32 1, i32 0>
				%z = zext <2 x i8> %sh to <2 x i64>
				store <2 x i64> %z, <2 x i64>* %Dst
				ret void
				}

				define void @fun1(<2 x i16>* %Src, <2 x i64>* %Dst) {
				; CHECK-LABEL: fun1:
				; CHECK: vllezh %v0, 2(%r2)
				; CHECK-NEXT: vleh %v0, 0(%r2), 7
				; CHECK-NEXT: vst %v0, 0(%r3), 3
				; CHECK-NEXT: br %r14
				%l = load <2 x i16>, <2 x i16>* %Src
				%sh = shufflevector <2 x i16> %l, <2 x i16> undef, <2 x i32> <i32 1, i32 0>
				%z = zext <2 x i16> %sh to <2 x i64>
				store <2 x i64> %z, <2 x i64>* %Dst
				ret void
				}

				; Don't use vllez and multiple vle:s.
				define void @fun2(<4 x i16>* %Src, <4 x i32>* %Dst) {
				; CHECK-LABEL: fun2:
				; CHECK: larl %r1, .LCPI2_0
				; CHECK-NEXT: vlrepg %v0, 0(%r2)
				; CHECK-NEXT: vl %v1, 0(%r1), 3
				; CHECK-NEXT: vperm %v0, %v0, %v0, %v1
				; CHECK-NEXT: vuplhh %v0, %v0
				; CHECK-NEXT: vst %v0, 0(%r3), 3
				; CHECK-NEXT: br %r14
				%l = load <4 x i16>, <4 x i16>* %Src
				%sh = shufflevector <4 x i16> %l, <4 x i16> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>
				%z = zext <4 x i16> %sh to <4 x i32>
				store <4 x i32> %z, <4 x i32>* %Dst
				ret void
				}