This is an archive of the discontinued LLVM Phabricator instance.

Differential D75978

[SystemZ] Avoid scalarization of S/UINT_TO_FP
ClosedPublic

Authored by jonpa on Mar 11 2020, 2:21 AM.

Download Raw Diff

Details

Reviewers

uweigand

Commits

rG132f25bcca2e: [SystemZ] Avoid scalarization of [SU]INT_TO_FP ISD-nodes.

Summary

Convert UINT_TO_FP DAG nodes with a wider result vector element type to first do a zero extend of the source vector:

The type legalizer will scalarize a 'v4f64 = uint_to_fp v4i16' as part of widening the illegally typed operand. By zero extending the source before type legalization, the node is instead split and kept as a vector operation.

Doing this before type legalization seemed like a simple solution that will handle any source vectors, such as v4i16, without having to worry about things like widening. The type legalizer seems to scalarize the uint_to_fp if the input/output vector type is not legal.

I thought about having a check here for a scalarized source operand (e.g. urem), in which case it might be less desired to first insert the source elements in order to do a vector convert. There was however no cases in benchmarks where also any user of the uint_to_fp was scalarized, so it seemed that this consideration can be ignored. This was a benchmark that first made me try this:

f470.lbm
cdlfbr         :                    4                    0       -4
vuplhh         :                    0                    2       +2
vuplhf         :                    0                    2       +2
vcdlgb         :                    0                    2       +2
vmrhg          :                   16                   14       -2
vlvgh          :                    0                    2       +2
vlvgb          :                    0                    2       +2
vuplhb         :                    0                    1       +1

The difference here was that a vector convert with a scalarized source, first vector elements are loaded and then unpacked. In that case it seems possibly better to do scalar conversions, since they support uint32 -> double and directly store into the right vector register: 2 x scalar conversions + 1 vmrhg. This however did not matter for performance and it was just one or two cases...

This is the check that was needed (and removed):

+bool SystemZTargetLowering::willBeScalarized(SDValue Op) const {
+  if (Op->isUndef())
+    return false;
+  if (Op->getOpcode() == ISD::VECTOR_SHUFFLE)
+    return willBeScalarized(Op->getOperand(0)) ||
+           willBeScalarized(Op->getOperand(1));
+
+  unsigned ScalarBits = Op.getValueType().getScalarSizeInBits();
+  MVT CheckVT = MVT::getVectorVT(MVT::getIntegerVT(ScalarBits),
+                                 SystemZ::VectorBits / ScalarBits);
+  return (getOperationAction(Op->getOpcode(), CheckVT) == Expand);
+}

 SDValue SystemZTargetLowering::combineUINT_TO_FP(
     SDNode *N, DAGCombinerInfo &DCI) const {

+  // Don't do this if Op is known to be scalarized.
+  if (willBeScalarized(Op))
+    return SDValue();
+

At first I used a check to only do this when the vector conversion was supported (v2f64 on z14 and also v4f32 on z15), but then I figured that it is still better to do the vector zero extend instead of extracting and then extending each element:

f510.parest_r
vlgvh          :                    0                   20      +20
vlgvf          :                  540                  520      -20
llhr           :                   90                  110      +20
vuplhh         :                   17                   12       -5

Use a vllez + vle to load, shuffle and zero extend a vector with two elements in just two instructions:

I first tried to create the vllez with existing patterns by building a load and insert_vector_elt, but:

 t23: i32,ch = load<(load 4 from %ir.Src), anyext from i16> t0, t2, undef:i64
 t22: i32,ch = load<(load 4 from %ir.Src), anyext from i16> t0, t16, undef:i64
     t29: v8i16 = BUILD_VECTOR Constant:i32<0>...
            t31: v8i16 = insert_vector_elt t29, t23, Constant:i32<3>
          t32: v8i16 = insert_vector_elt t31, t22, Constant:i32<0>

=> canonicalization in DAGCombiner::visitINSERT_VECTOR_ELT() =>

      t29: v8i16 = BUILD_VECTOR Constant:i32<0>...
             t34: v8i16 = insert_vector_elt t29, t22, Constant:i32<0>
           t35: v8i16 = insert_vector_elt t34, t23, Constant:i32<3>

, which means that the insert-pattern for vllezh does not work.

Therefore a new SystemZISD::VLLEZ node seemed needed. The previous pattern for vllez has been renamed, and the new node SystemZISD::VLLEZ matches z_vllezi[8 - 64].

Tried also i32 -> i64, but this only changed one file (did not look like an improvement), and it involved extra searching code through a second VECTOR_SHUFFLE in tryVLLEZ(), so skipped.

Haven't tried v4i8 -> v4i32, since that would involve doing four loads instead of one, which seems a little worrisome.

I saw just a few cases on SPEC 06, where multiple LAYs were generated instead of one (13 bit displacement added twice, before both vllez and vle). It would have been better to do just one LAY and then use an immediate displacement in the vle. This seems like an existing problem which involves reg pressure consideration. Would it be worth trying a clean-up pass pre-RA that tried to minimize the number of LAYs? It could check if the base-address was killed at the first LAY and in that case reuse that LAYs result in the second one.

Results

Only Imagick affected performance wise among benchmarks.

combineUINT_TO_FP() gives 15% improvement
tryVLLEZ() gives another 5-7% improvement.

Total improvement: 20-22%

imagick/morphology.s/MorphologyApply which is hot now uses 68 x (vllezh; vleh; vcdlgb).

(GCC does not do any vllez/vcdlgb, and nearly all uint to fp conversions are scalar, but it is still fast).

Diff Detail

Event Timeline

jonpa created this revision.Mar 11 2020, 2:21 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 11 2020, 2:21 AM

Herald added subscribers: llvm-commits, hiraditya. · View Herald Transcript

Not yet looking at the implementation details, but a couple of comments on the overall approach:

The UINT_TO_FP changes look good to me. The only question I have is whether we should do the same for SINT_TO_FP? (And in either case, this can probably be committed as a separate patch, apart from the VLLEZ changes.)

I agree that using VLLEZ to implement the zero-extend is a good idea. I'm wondering about the implementation however; it seems odd to combine two quite different manners of emitting that instruction.

First of all, the "canonicalization in DAGCombiner::visitINSERT_VECTOR_ELT()" problem you mention seems to be a problem in general; the code with the two element-inserts could easily arise just from normal middle-end action (not just that new back-end code you added), and we really ought to be able to handle such code in general. So I think we should try to do that, either by overriding the DAGCombine with our own rule (if that's possible), or possibly handling this as a SystemZDAGToDAGISel::Select rule? In any case, I'm sceptical about the new DAG opcode -- it's always better to avoid those for semantics that *can* be described with standard opcodes, since custom opcodes are opaque to all further analysis by the middle-end ...

The other question is how to handle the zero-extend expansion. Logically, a vector zero-extend is fully equivalent to a shuffle intermixing bytes from the original vector with a zero vector. So I think the back-end ought to handle both cases in the same way. Now, we have logic that will intelligently handle shuffles (the whole GeneralShuffle logic). From what I can see, this never emits an UNPACK -- however, it should in theory be able to emit a MERGE, which would later (if one of the inputs is a zero vector) be transformed into an UNPACK via SystemZTargetLowering::combineMERGE.

Given that, maybe the optimal solution to catch all cases would be:

first, expand vector zero-extend into a shuffle (always)
second, make sure that the GeneralShuffle logic (and/or combineMERGE) detects the case where an optimal solution is via element inserts into a zero vector (and falls back to UNPACK otherwise)
finally (as said above), make sure that element inserts into a zero vector get implemented via VLLEZ in all cases where this is possible

The UINT_TO_FP changes look good to me. The only question I have is whether we should do the same for SINT_TO_FP? (And in either case, this can probably be committed as a separate patch, apart from the VLLEZ changes.)

Yes, that looks also like a good idea. Patch updated to only do INT_TO_FP conversions - VLLEZ part will be posted separately.

jonpa retitled this revision from [SystemZ] Avoid scalarization of UINT_TO_FP + improve shuffled zext:ing loads. to [SystemZ] Avoid scalarization of S/UINT_TO_FP.Mar 14 2020, 3:49 AM

LGTM, thanks!

This revision is now accepted and ready to land.Mar 16 2020, 2:27 AM

Closed by commit rG132f25bcca2e: [SystemZ] Avoid scalarization of [SU]INT_TO_FP ISD-nodes. (authored by jonpa). · Explain WhyMar 16 2020, 5:17 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

Target/

SystemZ/

SystemZISelLowering.h

5 lines

SystemZISelLowering.cpp

107 lines

SystemZOperators.td

28 lines

test/

CodeGen/

SystemZ/

vec-move-23.ll

140 lines

Diff 249566

llvm/lib/Target/SystemZ/SystemZISelLowering.h

Show First 20 Lines • Show All 357 Lines • ▼ Show 20 Lines	enum NodeType : unsigned {
ATOMIC_CMP_SWAP_128,		ATOMIC_CMP_SWAP_128,

// Byte swapping load/store. Same operands as regular load/store.		// Byte swapping load/store. Same operands as regular load/store.
LRV, STRV,		LRV, STRV,

// Element swapping load/store. Same operands as regular load/store.		// Element swapping load/store. Same operands as regular load/store.
VLER, VSTER,		VLER, VSTER,

		// Zero all bits of vector and load logical element.
		VLLEZ,

// Prefetch from the second operand using the 4-bit control code in		// Prefetch from the second operand using the 4-bit control code in
// the first operand. The code is 1 for a load prefetch and 2 for		// the first operand. The code is 1 for a load prefetch and 2 for
// a store prefetch.		// a store prefetch.
PREFETCH		PREFETCH
};		};

// Return true if OPCODE is some kind of PC-relative address.		// Return true if OPCODE is some kind of PC-relative address.
inline bool isPCREL(unsigned Opcode) {		inline bool isPCREL(unsigned Opcode) {
▲ Show 20 Lines • Show All 241 Lines • ▼ Show 20 Lines	private:
bool isVectorElementLoad(SDValue Op) const;		bool isVectorElementLoad(SDValue Op) const;
SDValue buildVector(SelectionDAG &DAG, const SDLoc &DL, EVT VT,		SDValue buildVector(SelectionDAG &DAG, const SDLoc &DL, EVT VT,
SmallVectorImpl<SDValue> &Elems) const;		SmallVectorImpl<SDValue> &Elems) const;
SDValue lowerBUILD_VECTOR(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerBUILD_VECTOR(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerVECTOR_SHUFFLE(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerVECTOR_SHUFFLE(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerSCALAR_TO_VECTOR(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerSCALAR_TO_VECTOR(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerINSERT_VECTOR_ELT(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerINSERT_VECTOR_ELT(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerEXTRACT_VECTOR_ELT(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerEXTRACT_VECTOR_ELT(SDValue Op, SelectionDAG &DAG) const;
		SDValue tryVLLEZ(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerExtendVectorInreg(SDValue Op, SelectionDAG &DAG,		SDValue lowerExtendVectorInreg(SDValue Op, SelectionDAG &DAG,
unsigned UnpackHigh) const;		unsigned UnpackHigh) const;
SDValue lowerShift(SDValue Op, SelectionDAG &DAG, unsigned ByScalar) const;		SDValue lowerShift(SDValue Op, SelectionDAG &DAG, unsigned ByScalar) const;

bool canTreatAsByteVector(EVT VT) const;		bool canTreatAsByteVector(EVT VT) const;
SDValue combineExtract(const SDLoc &DL, EVT ElemVT, EVT VecVT, SDValue OrigOp,		SDValue combineExtract(const SDLoc &DL, EVT ElemVT, EVT VecVT, SDValue OrigOp,
unsigned Index, DAGCombinerInfo &DCI,		unsigned Index, DAGCombinerInfo &DCI,
bool Force) const;		bool Force) const;
SDValue combineTruncateExtract(const SDLoc &DL, EVT TruncVT, SDValue Op,		SDValue combineTruncateExtract(const SDLoc &DL, EVT TruncVT, SDValue Op,
DAGCombinerInfo &DCI) const;		DAGCombinerInfo &DCI) const;
SDValue combineZERO_EXTEND(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineZERO_EXTEND(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue combineSIGN_EXTEND(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineSIGN_EXTEND(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue combineSIGN_EXTEND_INREG(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineSIGN_EXTEND_INREG(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue combineMERGE(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineMERGE(SDNode *N, DAGCombinerInfo &DCI) const;
bool canLoadStoreByteSwapped(EVT VT) const;		bool canLoadStoreByteSwapped(EVT VT) const;
SDValue combineLOAD(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineLOAD(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue combineSTORE(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineSTORE(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue combineVECTOR_SHUFFLE(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineVECTOR_SHUFFLE(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue combineEXTRACT_VECTOR_ELT(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineEXTRACT_VECTOR_ELT(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue combineJOIN_DWORDS(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineJOIN_DWORDS(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue combineFP_ROUND(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineFP_ROUND(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue combineFP_EXTEND(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineFP_EXTEND(SDNode *N, DAGCombinerInfo &DCI) const;
		SDValue combineUINT_TO_FP(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue combineBSWAP(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineBSWAP(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue combineBR_CCMASK(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineBR_CCMASK(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue combineSELECT_CCMASK(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineSELECT_CCMASK(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue combineGET_CCMASK(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineGET_CCMASK(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue combineIntDIVREM(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineIntDIVREM(SDNode *N, DAGCombinerInfo &DCI) const;

SDValue unwrapAddress(SDValue N) const override;		SDValue unwrapAddress(SDValue N) const override;

▲ Show 20 Lines • Show All 65 Lines • Show Last 20 Lines

llvm/lib/Target/SystemZ/SystemZISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 635 Lines • ▼ Show 20 Lines	SystemZTargetLowering::SystemZTargetLowering(const TargetMachine &TM,
setTargetDAGCombine(ISD::SIGN_EXTEND_INREG);		setTargetDAGCombine(ISD::SIGN_EXTEND_INREG);
setTargetDAGCombine(ISD::LOAD);		setTargetDAGCombine(ISD::LOAD);
setTargetDAGCombine(ISD::STORE);		setTargetDAGCombine(ISD::STORE);
setTargetDAGCombine(ISD::VECTOR_SHUFFLE);		setTargetDAGCombine(ISD::VECTOR_SHUFFLE);
setTargetDAGCombine(ISD::EXTRACT_VECTOR_ELT);		setTargetDAGCombine(ISD::EXTRACT_VECTOR_ELT);
setTargetDAGCombine(ISD::FP_ROUND);		setTargetDAGCombine(ISD::FP_ROUND);
setTargetDAGCombine(ISD::STRICT_FP_ROUND);		setTargetDAGCombine(ISD::STRICT_FP_ROUND);
setTargetDAGCombine(ISD::FP_EXTEND);		setTargetDAGCombine(ISD::FP_EXTEND);
		setTargetDAGCombine(ISD::UINT_TO_FP);
setTargetDAGCombine(ISD::STRICT_FP_EXTEND);		setTargetDAGCombine(ISD::STRICT_FP_EXTEND);
setTargetDAGCombine(ISD::BSWAP);		setTargetDAGCombine(ISD::BSWAP);
setTargetDAGCombine(ISD::SDIV);		setTargetDAGCombine(ISD::SDIV);
setTargetDAGCombine(ISD::UDIV);		setTargetDAGCombine(ISD::UDIV);
setTargetDAGCombine(ISD::SREM);		setTargetDAGCombine(ISD::SREM);
setTargetDAGCombine(ISD::UREM);		setTargetDAGCombine(ISD::UREM);

// Handle intrinsics.		// Handle intrinsics.
▲ Show 20 Lines • Show All 4,382 Lines • ▼ Show 20 Lines	SystemZTargetLowering::lowerEXTRACT_VECTOR_ELT(SDValue Op,
MVT IntVT = MVT::getIntegerVT(VT.getSizeInBits());		MVT IntVT = MVT::getIntegerVT(VT.getSizeInBits());
MVT IntVecVT = MVT::getVectorVT(IntVT, VecVT.getVectorNumElements());		MVT IntVecVT = MVT::getVectorVT(IntVT, VecVT.getVectorNumElements());
SDValue Res = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, IntVT,		SDValue Res = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, IntVT,
DAG.getNode(ISD::BITCAST, DL, IntVecVT, Op0), Op1);		DAG.getNode(ISD::BITCAST, DL, IntVecVT, Op0), Op1);
return DAG.getNode(ISD::BITCAST, DL, VT, Res);		return DAG.getNode(ISD::BITCAST, DL, VT, Res);
}		}

SDValue		SDValue
		SystemZTargetLowering::tryVLLEZ(SDValue Op, SelectionDAG &DAG) const {
		// Replace ZERO_EXTEND_VECTOR_INREG -> VECTOR_SHUFFLE -> LOAD
		// with
		// VLE -> VLLEZ

		auto *SVN = dyn_cast<ShuffleVectorSDNode>(Op.getOperand(0));
		if (!SVN \|\| !SVN->getOperand(1).isUndef())
		return SDValue();

		// Only generate one additional load (VLE).
		EVT OutVT = Op.getValueType();
		EVT InVT = SVN->getValueType(0);
		if (OutVT != MVT::v2i64 \|\| (InVT != MVT::v8i16 && InVT != MVT::v16i8))
		return SDValue();

		// Find the load by looking through any type conversions.
		SDValue Src = SVN->getOperand(0);
		bool Change = true;
		while (Change) {
		Change = false;
		switch (Src->getOpcode()) {
		case ISD::BITCAST:
		if (!Src->getOperand(0).getValueType().isVector())
		break;
		LLVM_FALLTHROUGH;
		case ISD::SCALAR_TO_VECTOR:
		Src = Src->getOperand(0);
		Change = true;
		break;
		default: break;
		}
		}
		auto *Load = dyn_cast<LoadSDNode>(Src);
		if (!Load \|\| Load->isVolatile())
		return SDValue();

		// First do the VLLEZ, which will zero all other bits of the vector.
		EVT NarrowEltVT = InVT.getScalarType();
		unsigned NarrowEltBytes = NarrowEltVT.getSizeInBits() / 8;
		EVT PtrVT = getPointerTy(DAG.getDataLayout());
		SDLoc DL(Load);
		const SDValue &BaseAddr = Load->getBasePtr();
		unsigned ByteOffset = SVN->getMaskElt(0) * NarrowEltBytes;
		SDValue Address = DAG.getNode(ISD::ADD, DL, PtrVT, BaseAddr,
		DAG.getIntPtrConstant(ByteOffset, DL));
		SDVTList Tys = DAG.getVTList(InVT, MVT::Other);
		SDValue Ops[] = { Load->getChain(), Address };
		MachineMemOperand *MMO = Load->getMemOperand();
		SDValue VLLEZ = DAG.getMemIntrinsicNode(SystemZISD::VLLEZ, DL, Tys, Ops,
		NarrowEltVT, MMO);
		// Load the other element.
		ByteOffset = SVN->getMaskElt(1) * NarrowEltBytes;
		Address = DAG.getNode(ISD::ADD, DL, PtrVT, BaseAddr,
		DAG.getIntPtrConstant(ByteOffset, DL));
		SDValue EltLd = DAG.getLoad(NarrowEltVT, DL, VLLEZ.getValue(1), Address, MMO);
		SDValue InsIdx = DAG.getVectorIdxConstant(InVT.getVectorNumElements() - 1, DL);
		SDValue InsVec =
		DAG.getNode(ISD::INSERT_VECTOR_ELT, DL, InVT, VLLEZ, EltLd, InsIdx);

		// Update chains.
		SDValue LoadCh = SDValue(Load, 1);
		if (!LoadCh.use_empty()) {
		SDValue TF = DAG.getNode(ISD::TokenFactor, DL, MVT::Other,
		VLLEZ.getValue(1), EltLd.getValue(1));
		DAG.ReplaceAllUsesOfValueWith(LoadCh, TF);
		SmallVector<SDValue, 2> Ops;
		Ops.push_back(LoadCh);
		Ops.push_back(EltLd.getValue(1));
		DAG.UpdateNodeOperands(TF.getNode(), Ops);
		}

		return DAG.getNode(ISD::BITCAST, DL, OutVT, InsVec);
		}

		SDValue
SystemZTargetLowering::lowerExtendVectorInreg(SDValue Op, SelectionDAG &DAG,		SystemZTargetLowering::lowerExtendVectorInreg(SDValue Op, SelectionDAG &DAG,
unsigned UnpackHigh) const {		unsigned UnpackHigh) const {
		if (UnpackHigh == SystemZISD::UNPACKL_HIGH) {
		SDValue Res = tryVLLEZ(Op, DAG);
		if (Res.getNode())
		return Res;
		}

SDValue PackedOp = Op.getOperand(0);		SDValue PackedOp = Op.getOperand(0);
EVT OutVT = Op.getValueType();		EVT OutVT = Op.getValueType();
EVT InVT = PackedOp.getValueType();		EVT InVT = PackedOp.getValueType();
unsigned ToBits = OutVT.getScalarSizeInBits();		unsigned ToBits = OutVT.getScalarSizeInBits();
unsigned FromBits = InVT.getScalarSizeInBits();		unsigned FromBits = InVT.getScalarSizeInBits();
do {		do {
FromBits *= 2;		FromBits *= 2;
EVT OutVT = MVT::getVectorVT(MVT::getIntegerVT(FromBits),		EVT OutVT = MVT::getVectorVT(MVT::getIntegerVT(FromBits),
▲ Show 20 Lines • Show All 390 Lines • ▼ Show 20 Lines	switch ((SystemZISD::NodeType)Opcode) {
OPCODE(ATOMIC_CMP_SWAP);		OPCODE(ATOMIC_CMP_SWAP);
OPCODE(ATOMIC_LOAD_128);		OPCODE(ATOMIC_LOAD_128);
OPCODE(ATOMIC_STORE_128);		OPCODE(ATOMIC_STORE_128);
OPCODE(ATOMIC_CMP_SWAP_128);		OPCODE(ATOMIC_CMP_SWAP_128);
OPCODE(LRV);		OPCODE(LRV);
OPCODE(STRV);		OPCODE(STRV);
OPCODE(VLER);		OPCODE(VLER);
OPCODE(VSTER);		OPCODE(VSTER);
		OPCODE(VLLEZ);
OPCODE(PREFETCH);		OPCODE(PREFETCH);
}		}
return nullptr;		return nullptr;
#undef OPCODE		#undef OPCODE
}		}

// Return true if VT is a vector whose elements are a whole number of bytes		// Return true if VT is a vector whose elements are a whole number of bytes
// in width. Also check for presence of vector support.		// in width. Also check for presence of vector support.
▲ Show 20 Lines • Show All 618 Lines • ▼ Show 20 Lines	for (auto *U : Vec->uses()) {
return Extract0;		return Extract0;
}		}
}		}
}		}
}		}
return SDValue();		return SDValue();
}		}

		SDValue SystemZTargetLowering::combineUINT_TO_FP(
		SDNode *N, DAGCombinerInfo &DCI) const {
		if (DCI.Level != BeforeLegalizeTypes)
		return SDValue();
		EVT OutVT = N->getValueType(0);
		SelectionDAG &DAG = DCI.DAG;
		SDValue Op = N->getOperand(0);
		unsigned OutScalarBits = OutVT.getScalarSizeInBits();
		unsigned InScalarBits = Op->getValueType(0).getScalarSizeInBits();

		// Insert a zero_extend before type-legalization to avoid scalarization, e.g.:
		// v2f64 = uint_to_fp v2i16
		// =>
		// v2f64 = uint_to_fp (v2i64 zero_extend v2i16)
		if (OutVT.isVector() && OutScalarBits > InScalarBits) {
		MVT ExtVT = MVT::getVectorVT(MVT::getIntegerVT(OutVT.getScalarSizeInBits()),
		OutVT.getVectorNumElements());
		SDValue ExtOp = DAG.getNode(ISD::ZERO_EXTEND, SDLoc(N), ExtVT, Op);
		return DAG.getNode(ISD::UINT_TO_FP, SDLoc(N), OutVT, ExtOp);
		}
		return SDValue();
		}

SDValue SystemZTargetLowering::combineBSWAP(		SDValue SystemZTargetLowering::combineBSWAP(
SDNode *N, DAGCombinerInfo &DCI) const {		SDNode *N, DAGCombinerInfo &DCI) const {
SelectionDAG &DAG = DCI.DAG;		SelectionDAG &DAG = DCI.DAG;
// Combine BSWAP (LOAD) into LRVH/LRV/LRVG/VLBR		// Combine BSWAP (LOAD) into LRVH/LRV/LRVG/VLBR
if (ISD::isNON_EXTLoad(N->getOperand(0).getNode()) &&		if (ISD::isNON_EXTLoad(N->getOperand(0).getNode()) &&
N->getOperand(0).hasOneUse() &&		N->getOperand(0).hasOneUse() &&
canLoadStoreByteSwapped(N->getValueType(0))) {		canLoadStoreByteSwapped(N->getValueType(0))) {
SDValue Load = N->getOperand(0);		SDValue Load = N->getOperand(0);
▲ Show 20 Lines • Show All 311 Lines • ▼ Show 20 Lines	SDValue SystemZTargetLowering::PerformDAGCombine(SDNode *N,
case ISD::STORE: return combineSTORE(N, DCI);		case ISD::STORE: return combineSTORE(N, DCI);
case ISD::VECTOR_SHUFFLE: return combineVECTOR_SHUFFLE(N, DCI);		case ISD::VECTOR_SHUFFLE: return combineVECTOR_SHUFFLE(N, DCI);
case ISD::EXTRACT_VECTOR_ELT: return combineEXTRACT_VECTOR_ELT(N, DCI);		case ISD::EXTRACT_VECTOR_ELT: return combineEXTRACT_VECTOR_ELT(N, DCI);
case SystemZISD::JOIN_DWORDS: return combineJOIN_DWORDS(N, DCI);		case SystemZISD::JOIN_DWORDS: return combineJOIN_DWORDS(N, DCI);
case ISD::STRICT_FP_ROUND:		case ISD::STRICT_FP_ROUND:
case ISD::FP_ROUND: return combineFP_ROUND(N, DCI);		case ISD::FP_ROUND: return combineFP_ROUND(N, DCI);
case ISD::STRICT_FP_EXTEND:		case ISD::STRICT_FP_EXTEND:
case ISD::FP_EXTEND: return combineFP_EXTEND(N, DCI);		case ISD::FP_EXTEND: return combineFP_EXTEND(N, DCI);
		case ISD::UINT_TO_FP: return combineUINT_TO_FP(N, DCI);
case ISD::BSWAP: return combineBSWAP(N, DCI);		case ISD::BSWAP: return combineBSWAP(N, DCI);
case SystemZISD::BR_CCMASK: return combineBR_CCMASK(N, DCI);		case SystemZISD::BR_CCMASK: return combineBR_CCMASK(N, DCI);
case SystemZISD::SELECT_CCMASK: return combineSELECT_CCMASK(N, DCI);		case SystemZISD::SELECT_CCMASK: return combineSELECT_CCMASK(N, DCI);
case SystemZISD::GET_CCMASK: return combineGET_CCMASK(N, DCI);		case SystemZISD::GET_CCMASK: return combineGET_CCMASK(N, DCI);
case ISD::SDIV:		case ISD::SDIV:
case ISD::UDIV:		case ISD::UDIV:
case ISD::SREM:		case ISD::SREM:
case ISD::UREM: return combineIntDIVREM(N, DCI);		case ISD::UREM: return combineIntDIVREM(N, DCI);
▲ Show 20 Lines • Show All 1,633 Lines • Show Last 20 Lines

llvm/lib/Target/SystemZ/SystemZOperators.td

	Show First 20 Lines • Show All 286 Lines • ▼ Show 20 Lines
	def z_loadbswap : SDNode<"SystemZISD::LRV", SDTLoad,			def z_loadbswap : SDNode<"SystemZISD::LRV", SDTLoad,
	[SDNPHasChain, SDNPMayLoad, SDNPMemOperand]>;			[SDNPHasChain, SDNPMayLoad, SDNPMemOperand]>;
	def z_storebswap : SDNode<"SystemZISD::STRV", SDTStore,			def z_storebswap : SDNode<"SystemZISD::STRV", SDTStore,
	[SDNPHasChain, SDNPMayStore, SDNPMemOperand]>;			[SDNPHasChain, SDNPMayStore, SDNPMemOperand]>;
	def z_loadeswap : SDNode<"SystemZISD::VLER", SDTLoad,			def z_loadeswap : SDNode<"SystemZISD::VLER", SDTLoad,
	[SDNPHasChain, SDNPMayLoad, SDNPMemOperand]>;			[SDNPHasChain, SDNPMayLoad, SDNPMemOperand]>;
	def z_storeeswap : SDNode<"SystemZISD::VSTER", SDTStore,			def z_storeeswap : SDNode<"SystemZISD::VSTER", SDTStore,
	[SDNPHasChain, SDNPMayStore, SDNPMemOperand]>;			[SDNPHasChain, SDNPMayStore, SDNPMemOperand]>;
				def z_vllez : SDNode<"SystemZISD::VLLEZ", SDTLoad,
				[SDNPHasChain, SDNPMayLoad, SDNPMemOperand]>;

	def z_tdc : SDNode<"SystemZISD::TDC", SDT_ZTest>;			def z_tdc : SDNode<"SystemZISD::TDC", SDT_ZTest>;

	// Defined because the index is an i32 rather than a pointer.			// Defined because the index is an i32 rather than a pointer.
	def z_vector_insert : SDNode<"ISD::INSERT_VECTOR_ELT",			def z_vector_insert : SDNode<"ISD::INSERT_VECTOR_ELT",
	SDT_ZInsertVectorElt>;			SDT_ZInsertVectorElt>;
	def z_vector_extract : SDNode<"ISD::EXTRACT_VECTOR_ELT",			def z_vector_extract : SDNode<"ISD::EXTRACT_VECTOR_ELT",
	SDT_ZExtractVectorElt>;			SDT_ZExtractVectorElt>;
	▲ Show 20 Lines • Show All 503 Lines • ▼ Show 20 Lines
	def z_vlef64 : z_vle<f64, load>;			def z_vlef64 : z_vle<f64, load>;
	// Byte-swapped vector element loads.			// Byte-swapped vector element loads.
	def z_vlebri16 : z_vle<i32, z_loadbswap16>;			def z_vlebri16 : z_vle<i32, z_loadbswap16>;
	def z_vlebri32 : z_vle<i32, z_loadbswap32>;			def z_vlebri32 : z_vle<i32, z_loadbswap32>;
	def z_vlebri64 : z_vle<i64, z_loadbswap64>;			def z_vlebri64 : z_vle<i64, z_loadbswap64>;

	// Load a scalar and insert it into the low element of the high i64 of a			// Load a scalar and insert it into the low element of the high i64 of a
	// zeroed vector.			// zeroed vector.
	class z_vllez<ValueType scalartype, SDPatternOperator load, int index>			class vllez_insertpat<ValueType scalartype, SDPatternOperator load, int index>
	: PatFrag<(ops node:$addr),			: PatFrag<(ops node:$addr),
	(z_vector_insert immAllZerosV,			(z_vector_insert immAllZerosV,
	(scalartype (load node:$addr)), (i32 index))>;			(scalartype (load node:$addr)), (i32 index))>;
	def z_vllezi8 : z_vllez<i32, anyextloadi8, 7>;
	def z_vllezi16 : z_vllez<i32, anyextloadi16, 3>;			class vllez_patterns<ValueType scalartype, SDPatternOperator load, int index>
	def z_vllezi32 : z_vllez<i32, load, 1>;			: PatFrags<(ops node:$addr),
				[(z_vector_insert immAllZerosV,
				(scalartype (load node:$addr)), (i32 index)),
				(z_vllez node:$addr)]>;

				def z_vllezi8 : vllez_patterns<i32, anyextloadi8, 7>;
				def z_vllezi16 : vllez_patterns<i32, anyextloadi16, 3>;
				def z_vllezi32 : vllez_patterns<i32, load, 1>;
	def z_vllezi64 : PatFrags<(ops node:$addr),			def z_vllezi64 : PatFrags<(ops node:$addr),
	[(z_vector_insert immAllZerosV,			[(z_vector_insert immAllZerosV,
	(i64 (load node:$addr)), (i32 0)),			(i64 (load node:$addr)), (i32 0)),
	(z_join_dwords (i64 (load node:$addr)), (i64 0))]>;			(z_join_dwords (i64 (load node:$addr)), (i64 0)),
				(z_vllez node:$addr)]>;
	// We use high merges to form a v4f32 from four f32s. Propagating zero			// We use high merges to form a v4f32 from four f32s. Propagating zero
	// into all elements but index 1 gives this expression.			// into all elements but index 1 gives this expression.
	def z_vllezf32 : PatFrag<(ops node:$addr),			def z_vllezf32 : PatFrag<(ops node:$addr),
	(z_merge_high			(z_merge_high
	(v2i64			(v2i64
	(z_unpackl_high			(z_unpackl_high
	(v4i32			(v4i32
	(bitconvert			(bitconvert
	(v4f32 (scalar_to_vector			(v4f32 (scalar_to_vector
	(f32 (load node:$addr)))))))),			(f32 (load node:$addr)))))))),
	(v2i64			(v2i64
	(bitconvert (v4f32 immAllZerosV))))>;			(bitconvert (v4f32 immAllZerosV))))>;
	def z_vllezf64 : PatFrag<(ops node:$addr),			def z_vllezf64 : PatFrag<(ops node:$addr),
	(z_merge_high			(z_merge_high
	(v2f64 (scalar_to_vector (f64 (load node:$addr)))),			(v2f64 (scalar_to_vector (f64 (load node:$addr)))),
	immAllZerosV)>;			immAllZerosV)>;

	// Similarly for the high element of a zeroed vector.			// Similarly for the high element of a zeroed vector.
	def z_vllezli32 : z_vllez<i32, load, 0>;			def z_vllezli32 : vllez_insertpat<i32, load, 0>;
	def z_vllezlf32 : PatFrag<(ops node:$addr),			def z_vllezlf32 : PatFrag<(ops node:$addr),
	(z_merge_high			(z_merge_high
	(v2i64			(v2i64
	(bitconvert			(bitconvert
	(z_merge_high			(z_merge_high
	(v4f32 (scalar_to_vector			(v4f32 (scalar_to_vector
	(f32 (load node:$addr)))),			(f32 (load node:$addr)))),
	(v4f32 immAllZerosV)))),			(v4f32 immAllZerosV)))),
	(v2i64			(v2i64
	(bitconvert (v4f32 immAllZerosV))))>;			(bitconvert (v4f32 immAllZerosV))))>;

	// Byte-swapped variants.			// Byte-swapped variants.
	def z_vllebrzi16 : z_vllez<i32, z_loadbswap16, 3>;			def z_vllebrzi16 : vllez_insertpat<i32, z_loadbswap16, 3>;
	def z_vllebrzi32 : z_vllez<i32, z_loadbswap32, 1>;			def z_vllebrzi32 : vllez_insertpat<i32, z_loadbswap32, 1>;
	def z_vllebrzli32 : z_vllez<i32, z_loadbswap32, 0>;			def z_vllebrzli32 : vllez_insertpat<i32, z_loadbswap32, 0>;
	def z_vllebrzi64 : PatFrags<(ops node:$addr),			def z_vllebrzi64 : PatFrags<(ops node:$addr),
	[(z_vector_insert immAllZerosV,			[(z_vector_insert immAllZerosV,
	(i64 (z_loadbswap64 node:$addr)),			(i64 (z_loadbswap64 node:$addr)),
	(i32 0)),			(i32 0)),
	(z_join_dwords (i64 (z_loadbswap64 node:$addr)),			(z_join_dwords (i64 (z_loadbswap64 node:$addr)),
	(i64 0))]>;			(i64 0))]>;


	▲ Show 20 Lines • Show All 55 Lines • Show Last 20 Lines

llvm/test/CodeGen/SystemZ/vec-move-23.ll

This file was added.

				; RUN: llc < %s -mtriple=s390x-linux-gnu -mcpu=z14 \| FileCheck %s -check-prefixes=CHECK,Z14
				; RUN: llc < %s -mtriple=s390x-linux-gnu -mcpu=z15 \| FileCheck %s -check-prefixes=CHECK,Z15
				;
				; Check that uitofp conversions from a narrower type get a vector zero extend.
				;
				; Also test that shuffled and zero extended vector loads gets implemented
				; with vllez + vle in the case of a <2 x i64> result.

				define void @fun1(<2 x i16>* %Src, <2 x double>* %Dst) {
				; CHECK-LABEL: fun1:
				; CHECK: vlrepf %v0, 0(%r2)
				; CHECK-NEXT: vuplhh %v0, %v0
				; CHECK-NEXT: vuplhf %v0, %v0
				; CHECK-NEXT: vcdlgb %v0, %v0, 0, 0
				; CHECK-NEXT: vst %v0, 0(%r3), 3
				; CHECK-NEXT: br %r14
				%l = load <2 x i16>, <2 x i16>* %Src
				%c = uitofp <2 x i16> %l to <2 x double>
				store <2 x double> %c, <2 x double>* %Dst
				ret void
				}

				define void @fun2(<2 x i16>* %Src, <2 x double>* %Dst) {
				; CHECK-LABEL: fun2:
				; CHECK: vllezh %v0, 2(%r2)
				; CHECK-NEXT: vleh %v0, 0(%r2), 7
				; CHECK-NEXT: vcdlgb %v0, %v0, 0, 0
				; CHECK-NEXT: vst %v0, 0(%r3), 3
				; CHECK-NEXT: br %r14
				%l = load <2 x i16>, <2 x i16>* %Src
				%sh = shufflevector <2 x i16> %l, <2 x i16> undef, <2 x i32> <i32 1, i32 0>
				%c = uitofp <2 x i16> %sh to <2 x double>
				store <2 x double> %c, <2 x double>* %Dst
				ret void
				}

				define void @fun3(<4 x i16>* %Src, <4 x double>* %Dst) {
				; CHECK-LABEL: fun3:
				; CHECK: vllezh %v0, 4(%r2)
				; CHECK-NEXT: vleh %v0, 2(%r2), 7
				; CHECK-NEXT: vllezh %v1, 6(%r2)
				; CHECK-NEXT: vleh %v1, 0(%r2), 7
				; CHECK-NEXT: vcdlgb %v1, %v1, 0, 0
				; CHECK-NEXT: vcdlgb %v0, %v0, 0, 0
				; CHECK-NEXT: vst %v0, 16(%r3), 4
				; CHECK-NEXT: vst %v1, 0(%r3), 4
				; CHECK-NEXT: br %r14
				%l = load <4 x i16>, <4 x i16>* %Src
				%sh = shufflevector <4 x i16> %l, <4 x i16> undef, <4 x i32> <i32 3, i32 0, i32 2, i32 1>
				%c = uitofp <4 x i16> %sh to <4 x double>
				store <4 x double> %c, <4 x double>* %Dst
				ret void
				}

				define void @fun4(<4 x i8>* %Src, <4 x double>* %Dst) {
				; CHECK-LABEL: fun4:
				; CHECK: vllezb %v0, 2(%r2)
				; CHECK-NEXT: vleb %v0, 1(%r2), 15
				; CHECK-NEXT: vllezb %v1, 3(%r2)
				; CHECK-NEXT: vleb %v1, 0(%r2), 15
				; CHECK-NEXT: vcdlgb %v1, %v1, 0, 0
				; CHECK-NEXT: vcdlgb %v0, %v0, 0, 0
				; CHECK-NEXT: vst %v0, 16(%r3), 4
				; CHECK-NEXT: vst %v1, 0(%r3), 4
				; CHECK-NEXT: br %r14
				%l = load <4 x i8>, <4 x i8>* %Src
				%sh = shufflevector <4 x i8> %l, <4 x i8> undef, <4 x i32> <i32 3, i32 0, i32 2, i32 1>
				%c = uitofp <4 x i8> %sh to <4 x double>
				store <4 x double> %c, <4 x double>* %Dst
				ret void
				}

				define void @fun5(<4 x i16>* %Src, <4 x float>* %Dst) {
				; CHECK-LABEL: fun5:
				; Z14: larl %r1, .LCPI4_0
				; Z14-NEXT: vlrepg %v0, 0(%r2)
				; Z14-NEXT: vl %v1, 0(%r1), 3
				; Z14-NEXT: vperm %v0, %v0, %v0, %v1
				; Z14-NEXT: vuplhh %v0, %v0
				; Z14-NEXT: vlgvf %r0, %v0, 3
				; Z14-NEXT: celfbr %f1, 0, %r0, 0
				; Z14-NEXT: vlgvf %r0, %v0, 2
				; Z14-NEXT: celfbr %f2, 0, %r0, 0
				; Z14-NEXT: vlgvf %r0, %v0, 1
				; Z14-NEXT: vmrhf %v1, %v2, %v1
				; Z14-NEXT: celfbr %f2, 0, %r0, 0
				; Z14-NEXT: vlgvf %r0, %v0, 0
				; Z14-NEXT: celfbr %f0, 0, %r0, 0
				; Z14-NEXT: vmrhf %v0, %v0, %v2
				; Z14-NEXT: vmrhg %v0, %v0, %v1
				; Z14-NEXT: vst %v0, 0(%r3), 3
				; Z14-NEXT: br %r14

				; Z15: larl %r1, .LCPI4_0
				; Z15-NEXT: vlrepg %v0, 0(%r2)
				; Z15-NEXT: vl %v1, 0(%r1), 3
				; Z15-NEXT: vperm %v0, %v0, %v0, %v1
				; Z15-NEXT: vuplhh %v0, %v0
				; Z15-NEXT: vcelfb %v0, %v0, 0, 0
				; Z15-NEXT: vst %v0, 0(%r3), 3
				; Z15-NEXT: br %r14
				%l = load <4 x i16>, <4 x i16>* %Src
				%sh = shufflevector <4 x i16> %l, <4 x i16> undef, <4 x i32> <i32 3, i32 0, i32 2, i32 1>
				%c = uitofp <4 x i16> %sh to <4 x float>
				store <4 x float> %c, <4 x float>* %Dst
				ret void
				}

				define void @fun6(<4 x i16>* %Src, <4 x i64>* %Dst) {
				; CHECK-LABEL: fun6:
				; CHECK: vllezh %v0, 6(%r2)
				; CHECK-LABEL: vleh %v0, 0(%r2), 7
				; CHECK-LABEL: vllezh %v1, 4(%r2)
				; CHECK-LABEL: vleh %v1, 2(%r2), 7
				; CHECK-LABEL: vst %v1, 16(%r3), 4
				; CHECK-LABEL: vst %v0, 0(%r3), 4
				; CHECK-LABEL: br %r14
				%l = load <4 x i16>, <4 x i16>* %Src
				%sh = shufflevector <4 x i16> %l, <4 x i16> undef, <4 x i32> <i32 3, i32 0, i32 2, i32 1>
				%z = zext <4 x i16> %sh to <4 x i64>
				store <4 x i64> %z, <4 x i64>* %Dst
				ret void
				}

				; Don't use vllez and multiple vle:s.
				define void @fun7(<4 x i16>* %Src, <4 x i32>* %Dst) {
				; CHECK-LABEL: fun7:
				; CHECK: larl %r1, .LCPI6_0
				; CHECK-LABEL: vlrepg %v0, 0(%r2)
				; CHECK-LABEL: vl %v1, 0(%r1), 3
				; CHECK-LABEL: vperm %v0, %v0, %v0, %v1
				; CHECK-LABEL: vuplhh %v0, %v0
				; CHECK-LABEL: vst %v0, 0(%r3), 3
				; CHECK-LABEL: br %r14
				%l = load <4 x i16>, <4 x i16>* %Src
				%sh = shufflevector <4 x i16> %l, <4 x i16> undef, <4 x i32> <i32 3, i32 0, i32 2, i32 1>
				%z = zext <4 x i16> %sh to <4 x i32>
				store <4 x i32> %z, <4 x i32>* %Dst
				ret void
				}