This is an archive of the discontinued LLVM Phabricator instance.

Differential D78486

[SystemZ] Expand vector zero-extend into a shuffle.
ClosedPublic

Authored by jonpa on Apr 20 2020, 5:44 AM.

Download Raw Diff

Details

Reviewers

uweigand

Commits

rGef7aad0db49f: [SystemZ] Improve handling of ZERO_EXTEND_VECTOR_INREG.

Summary

This is a preparatory patch for generating more VLLEZ instructions, although it also has some benefits on its own.

lowerExtendVectorInreg() no longer unpacks ZERO_EXTEND_VECTOR_INREG always, but instead expands to a shuffle with a zero vector.

GeneralShuffle::getNode() now detects and handles cases which involves a zero vector. An advantage to this is that it is now possible to tune when a vperm or unpacking should be used, depending on how many unpacks are needed.
It seemed to work better to use the new UnpackInfo struct to work on the byte level directly, rather than trying to detect byte sequences and from that deduce the unpacks needed.

It is also necessary to handle the cases of a shuffled input source, which must be done cleverly to avoid generating worse code.

An alternative to this - for the purpose of VLLEZ - might be to only expand those potential candidates to shuffles. Expanding all of them reduces the number of vperms generally since more cases can now be handled with unpacks, but there is also need to find the zero vector and make sure to handle it last in getNode().

When realizing that the number of permutes increased at one point I attempted an algorithm to find different orders of shuffling in getNode(). This was however not a simple thing to do since the DAG nodes could then not be created when evaluating an order of the shuffles. I then however found out that putting the new zero vector last handled this problem and now vperms are same or less on all files with this.

On Imagick (this time without ffp-contract=fast), this gave a good improvement *without* VLLEZ. Using the other vllez-patch directly (https://reviews.llvm.org/D76275) gave a slightly bigger improvement. Using this patch and also generating VLLEZs gave the exact same improvement as the first patch. So looking at just this benchmark, I see that this patch is beneficial on its own, but if generating VLLEZ it doesn't really play a role.

Apart from Imagick, I see a 2.5% improvement on x264 if using -max-unpackops=2, which is interesting.

Other notes:

lowerExtendVectorInreg(): All extended bytes are defined to 0, even if the original element was -1. Is this needed, or could it be assumed that a zero-extended undefined element have all bytes undefined as well? Not sure if it matters.

unpacking with a permute: There are a few (140) cases with only two defined elements, where the VPERM of Best.Mask can be done with a VREPI. Most of those cases however required 3 unpacks after the VPERM, and just a dozen could be done with 2 unpacks after. I removed that check since it also did not seem to improve any benchmarks.

New tests: fun0: This particular set of values (i8 compares -> i32) seem to not work on trunk while others do.

Use -debug-only=systemz-lower to see what is going on.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jonpa created this revision.Apr 20 2020, 5:44 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptApr 20 2020, 5:44 AM

jonpa edited the summary of this revision. (Show Details)Apr 20 2020, 5:48 AM

Hi Jonas, I haven't looked into everything in detail, but first one fundmental question: My understanding was that the current GeneralShuffle code would detect the shuffle you generate for a zero-extend as a case of a MERGE. Later, combineMERGE would detect that one input of the MERGE is a zero vector, and replace the merge by an UNPACKL.

Is there some reason why this doesn't work for you? Why do you have to create the UNPACKL already in GeneralShuffle?

llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
5247	Now that this routine shares no code at all between the sign- and zero-extend case, I think it would make more sense to just have two different routines lowerSIGN_EXTEND_VECTOR_INREG and lowerZERO_EXTEND_VECTOR_INREG.

Hi Jonas, I haven't looked into everything in detail, but first one fundmental question: My understanding was that the current GeneralShuffle code would detect the shuffle you generate for a zero-extend as a case of a MERGE. Later, combineMERGE would detect that one input of the MERGE is a zero vector, and replace the merge by an UNPACKL.

Is there some reason why this doesn't work for you? Why do you have to create the UNPACKL already in GeneralShuffle?

That will only work in cases where only one unpack is needed, and only if the input does not need any permutation of any kind.

Patch changed to use increasing indexes into the zero vector so that the simple case can still be matched with a MERGE_HIGH.

I am not that sure about the actual best tuning yet, but so far I have aimed to reduce the number of VPERMs. I thought originally that using isByteZero() to check if a single byte was known zero would make for detecting more cases where only one element or byte of was zero. This way, any of the operands could have a zero byte which would be detected, like a high 0 byte in a wider immediate operand, for instance. I see now however that this is no improvement compared to simply finding a zero or undef vector input, so I have changed the patch to to this instead.

lowerExtendVectorInreg() split into lowerSIGN_EXTEND_VECTOR_INREG() and lowerZERO_EXTEND_VECTOR_INREG().

Herald added a project: Restricted Project. · View Herald TranscriptApr 27 2020, 6:06 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

It certainly seems to be an improvement on two benchmarks to do just either one unpack or else a vperm (and not multiple unpacks). In fact, i505.mcf actually regressed (2.5%) when doing two unpacks instead of a vperm (with -ffp-contract=fast). So the initial idea of reducing the number of vperms seems to have been proven wrong - it is better to have a single vperm on the critical path rather than multiple unpacks.

Patch simplified to do just one unpack whenever possible.

Tests updated.

In D78486#2012867, @jonpa wrote:

It certainly seems to be an improvement on two benchmarks to do just either one unpack or else a vperm (and not multiple unpacks). In fact, i505.mcf actually regressed (2.5%) when doing two unpacks instead of a vperm (with -ffp-contract=fast). So the initial idea of reducing the number of vperms seems to have been proven wrong - it is better to have a single vperm on the critical path rather than multiple unpacks.

Huh, interesting. That's good to know, and certainly makes the code simpler as well.

I'm wondering: unless I'm missing something, there's still one specific case where you generate a vperm followed by an unpack (the case where you already had a permute as source). Wouldn't it be preferable to just use a single vperm there as well?

I'm wondering: unless I'm missing something, there's still one specific case where you generate a vperm followed by an unpack (the case where you already had a permute as source). Wouldn't it be preferable to just use a single vperm there as well?

This is the case where the input is a permute that has no other users, which means that it's being replaced. So instead of vperm + vperm, we get a vperm + unpack, which follows the same reasoning of replacing a vperm with a single unpack.

For example, in the case with three source ops, where A and B first are combined with a permute, and now AB and C are to be combined: instead of permuting AB and C, the AB permute is changed so that AB and C can be unpacked.

This now happens 100 times on SPEC'17:

                                 patch  <>  "No permute replacement"
vperm          :                23803                23903     +100
vl             :               109720               109797      +77
vuplhb         :                  151                  101      -50
vuplhh         :                  177                  127      -50
vst            :                85032                85078      +46
larl           :               371586               371616      +30
vgbm           :                11637                11644       +7
mvi            :                23380                23381       +1

In D78486#2014509, @jonpa wrote:

I'm wondering: unless I'm missing something, there's still one specific case where you generate a vperm followed by an unpack (the case where you already had a permute as source). Wouldn't it be preferable to just use a single vperm there as well?

This is the case where the input is a permute that has no other users, which means that it's being replaced. So instead of vperm + vperm, we get a vperm + unpack, which follows the same reasoning of replacing a vperm with a single unpack.

For example, in the case with three source ops, where A and B first are combined with a permute, and now AB and C are to be combined: instead of permuting AB and C, the AB permute is changed so that AB and C can be unpacked.

Ah, that's what I missed: the case where the first permute has already two different inputs!

Now I'm wondering whether this couldn't also apply in other cases, beyond just perm + unpack. Would it be possible to handle all cases by recognizing the case where the source of a permute is itself a (single-use) permute early, e.g. in GeneralShuffle::add ? Hmm, reading that just now, it seems there is already such code:

else if (Op.getOpcode() == ISD::VECTOR_SHUFFLE && Op.hasOneUse()) {
  // See whether the bytes we need come from a contiguous part of one
  // operand.

So now I'm wondering instead: why does your new code ever trigger in the first place; why isn't this already handled in GeneralShuffle::add?

I ran SPEC'2006 as well once with this new patch and found to my surprise that on z14 I saw a big (7%) regression with this on perlbench which I have never seen before, while on z15 everything was as expected. Even though I was not sure if this was a trustable result (given that perl'17 looked ok), I looked into it a bit. I found that things had changed as expected only - multiple unpacks had become vperms instead. Two things were obvious while looking at the generated code:

I see a larl+vl+vgbm inside a single block loop prior to the vperm. Those three instructions are loop-invariant and should (and could) be hoisted to before the loop. It seems that the target instructions look ok (the load is tagged with a 'constant-pool' MO for instance), but MachineLICM (Post-RA) is being way too conservative as regards to when a physreg def cannot be moved. Post-RA, it is keeping track of physreg clobbers for the entire outer loop and then checks if the inner loop clobbers any of those regs, it seems. That means that our LARL in a single-block loop inside a bigger loop cannot be moved out of the inner loop even though that would make sense. MachineLICM is run also pre-RA, but then only outermost loops are visited.

However, this was easy to do in the assembly file, but hoisting those instructions out of the loop proved to be entirely non-effective (no improvement).

one more vperm mask in the constant pool adds 16 bytes in memory, which may possibly lead to some bad side-effects, which however seems unlikely to me since inserting that constant in the assembly does not change the function as such - it seems that the constant pool area is separate from the function using the constant. I also tried inserting a dummy use along with the 16-byte constant but that did not slow down the faster version using unpacks.

This is still strange to me and it seems unpredictable to sort out so it is probably not worth further attention for the moment...

So now I'm wondering instead: why does your new code ever trigger in the first place; why isn't this already handled in GeneralShuffle::add?

I think that be two different cases, where my handling is addressing a *new* permute resulting from combining two operands, when trying to combine (unpack) a third source operand with that first (new) permute.

In D78486#2020042, @jonpa wrote:

So now I'm wondering instead: why does your new code ever trigger in the first place; why isn't this already handled in GeneralShuffle::add?

I think that be two different cases, where my handling is addressing a *new* permute resulting from combining two operands, when trying to combine (unpack) a third source operand with that first (new) permute.

Ah, I see. So this permute is one that had just been generated via getGeneralPermuteNode earlier in the loop? It's a bit unfortunate to first create that ISD node and then right away delete it again.

Maybe one option to try would be to do the check whether this permutation looks like an unpack (possibly requiring a permutation) at the top level *before* the loop. If this is true, then apply the permutation (in reverse) to the Bytes array, remove the zerovec from the Ops list, and run through the loop. At the end, remember to actually generate the unpack node. This would also remove the need to special-case handing the zerovec through the loop.

The whole operation would be worthwhile only if by performing the unpack we can remove the zerovec (i.e. it isn't still used elsewhere), and if by removing the zerovec the depth of the resulting tree (i.e. the critical path length) is also reduced. The depth is the log2 of the number of operands (rounded up), so if reducing the zerovec takes us from 3 to 2 ops, it is worthwhile, but if it takes us from 2 to 1 ops (or from 4 to 3 ops), then it probably is not, since adding the unpack would then just add one more instruction to the critical path.

Does this make sense?

Patch updated per suggestions.

Ah, I see. So this permute is one that had just been generated via getGeneralPermuteNode earlier in the loop? It's a bit unfortunate to first create that ISD node and then right away delete it again.

Yes, that's the permute that was replaced.

Maybe one option to try would be to do the check whether this permutation looks like an unpack (possibly requiring a permutation) at the top level *before* the loop. If this is true, then apply the permutation (in reverse) to the Bytes array, remove the zerovec from the Ops list, and run through the loop. At the end, remember to actually generate the unpack node. This would also remove the need to special-case handing the zerovec through the loop.

Ah, yes... :-) Now that it seems clear that we want to use just a single unpack, it is simpler to do it in this more integrated way in getNode().

The whole operation would be worthwhile only if by performing the unpack we can remove the zerovec (i.e. it isn't still used elsewhere), and if by removing the zerovec the depth of the resulting tree (i.e. the critical path length) is also reduced. The depth is the log2 of the number of operands (rounded up), so if reducing the zerovec takes us from 3 to 2 ops, it is worthwhile, but if it takes us from 2 to 1 ops (or from 4 to 3 ops), then it probably is not, since adding the unpack would then just add one more instruction to the critical path.

That's a very good point - the idea behind building the tree of operations in the loop must be to avoid serialization so we shouldn't increase the depth with a trailing unpack. I added a check for this in case of >2 operands. For the case of a single source node shuffled with a zero vector (2 ops), there is instead a check that the unpack is enough by itself.

In rewriting this now I saw a (previous patch version) case where moving the zero-vector to last in Ops resulted in a vpkg (PACK) instead of a vperm. I am not sure if that was just a rare case, or if there are other operations than unpack that could also replace vperm if zero vector is put last. That is a separate issue from this, though, I think. Going to this version of the patch there were some minor changes which I think are the variations that arise from rearranging the Bytes vector when preparing for the unpack.

I still see two nice improvements over night: imagick:13% and x264: 4%. :-)

Patch rebased.

I updated the tests and noticed that not all VGBMs where going away. In order to achieve that I added look-through of BITCAST in isZeroVector(). This changed a few files on SPEC'17:

In total 11 less VGBMs on SPEC, over 10 files, but also some unexpected changes:

morphology.s: One *more* vgbm than before. Regalloc seems to now spill the big VGBM interval now and as a result the VGBM is rematerialized in two places which makes for actually one more VGBM with this. Here the VGBM was used also by a VFMAXDB. The VGBM was available over many blocks.

fx.s: four more new and different VPERM masks (LCPI constants) in the text section. This seems to be because four masks where reused before, while now they are not. This is because only one of of the users of that mask had a zero vector operand. With the zero vector this mask is used:

cp#8: <i8 0, i8 1, i8 2, i8 3, i8 4, i8 5, i8 16, i8 17, i8 6, i8 7, i8 8, i8 9, i8 10, i8 11, i8 18, i8 19>

Without the zero vector, this mask instead:
cp#2: <i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 16, i8 17, i8 0, i8 0, i8 0, i8 0, i8 0, i8 0, i8 18, i8 19>

This file also has a few more spills/reloads for some reason, probably because of the increased presence of mask registers. These side-effects are just noticed in this particular file...

It may be possible to treat this as a global problem where the goal is to minimize the number of LCPI vector constants needed, considering that removing the use of the zero vector can instead cause a new mask to be needed where as before an already existing mask could be reused. Not sure how needed or easy that would be...

In addition to this, some NFC changes:

Removed the check for the undef vector also in this patch.
Minor fixes such as variable renaming.
Tests no longer need VGBMs :-)

                               master      <>        patch
vuplhh         :                 3323                  241    -3082
vuplhb         :                 3137                  220    -2917
vperm          :                20996                23777    +2781
vpkg           :                 1535                  750     -785
vl             :               109474               109943     +469
vuplhf         :                  778                  329     -449
vmrlg          :                 4481                 4088     -393
vsldb          :                  566                  191     -375
larl           :               371292               371627     +335
cghsi          :                34550                34393     -157
lg             :              1092297              1092153     -144

...

                               master      <>        patch
Spill|Reload   :               683254               683105     -149
Copies         :              1036372              1036441      +69
grep "LCPI.*:" :                17540                17741     +203

About the LCPI vector constants: It seems that those 200 extra constants I noticed were not really due to the optimization of eliminating the zero-vector for PERMUTE. That optimization eliminates 239 VGBMs on SPEC, while adding only 5 more LCPIs. All the other new LCPIs caused by this patch seem to be regular VPERM masks for zero extension that are duplicated once for each function. In other words, I see the same mask in the constant pool with one copy for each function. I looked at this at file scope only (.s / .o), and I am hoping that either that is not a problem, or maybe that the linker optimizes this?

Over night results look good both with and w/out the zero-vector optimization (not much difference).

jonpa mentioned this in D78488: [SystemZ] Emit VLLEZ from tryShuffleWithZeroVec().May 21 2020, 1:15 AM

jonpa marked an inline comment as done.May 21 2020, 1:39 AM

jonpa added inline comments.

llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
5278	I rewrote this loop from the version on Apr 20 2020, 14:06, so that the indexes into the zero vector would be not just the first byte, but increasing. Back then that lead to some matches of MERGE_HIGH, but I see now that it's a no-op, given the optimization of the zero vector we have now added. But i suppose, it doesn't matter much either way...

This looks all good to me now, we should verify that performance results are still good, and then it can go in.

This revision is now accepted and ready to land.Jun 5 2020, 8:33 AM

Closed by commit rGef7aad0db49f: [SystemZ] Improve handling of ZERO_EXTEND_VECTOR_INREG. (authored by jonpa). · Explain WhyJun 30 2020, 12:30 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

Target/

SystemZ/

SystemZISelLowering.h

4 lines

SystemZISelLowering.cpp

171 lines

test/

CodeGen/

SystemZ/

20 lines

11 lines

49 lines

25 lines

Diff 274349

llvm/lib/Target/SystemZ/SystemZISelLowering.h

Show First 20 Lines • Show All 621 Lines • ▼ Show 20 Lines	private:
bool isVectorElementLoad(SDValue Op) const;		bool isVectorElementLoad(SDValue Op) const;
SDValue buildVector(SelectionDAG &DAG, const SDLoc &DL, EVT VT,		SDValue buildVector(SelectionDAG &DAG, const SDLoc &DL, EVT VT,
SmallVectorImpl<SDValue> &Elems) const;		SmallVectorImpl<SDValue> &Elems) const;
SDValue lowerBUILD_VECTOR(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerBUILD_VECTOR(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerVECTOR_SHUFFLE(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerVECTOR_SHUFFLE(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerSCALAR_TO_VECTOR(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerSCALAR_TO_VECTOR(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerINSERT_VECTOR_ELT(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerINSERT_VECTOR_ELT(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerEXTRACT_VECTOR_ELT(SDValue Op, SelectionDAG &DAG) const;		SDValue lowerEXTRACT_VECTOR_ELT(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerExtendVectorInreg(SDValue Op, SelectionDAG &DAG,		SDValue lowerSIGN_EXTEND_VECTOR_INREG(SDValue Op, SelectionDAG &DAG) const;
unsigned UnpackHigh) const;		SDValue lowerZERO_EXTEND_VECTOR_INREG(SDValue Op, SelectionDAG &DAG) const;
SDValue lowerShift(SDValue Op, SelectionDAG &DAG, unsigned ByScalar) const;		SDValue lowerShift(SDValue Op, SelectionDAG &DAG, unsigned ByScalar) const;

bool canTreatAsByteVector(EVT VT) const;		bool canTreatAsByteVector(EVT VT) const;
SDValue combineExtract(const SDLoc &DL, EVT ElemVT, EVT VecVT, SDValue OrigOp,		SDValue combineExtract(const SDLoc &DL, EVT ElemVT, EVT VecVT, SDValue OrigOp,
unsigned Index, DAGCombinerInfo &DCI,		unsigned Index, DAGCombinerInfo &DCI,
bool Force) const;		bool Force) const;
SDValue combineTruncateExtract(const SDLoc &DL, EVT TruncVT, SDValue Op,		SDValue combineTruncateExtract(const SDLoc &DL, EVT TruncVT, SDValue Op,
DAGCombinerInfo &DCI) const;		DAGCombinerInfo &DCI) const;
▲ Show 20 Lines • Show All 89 Lines • Show Last 20 Lines

llvm/lib/Target/SystemZ/SystemZISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,461 Lines • ▼ Show 20 Lines	if (P.Opcode == SystemZISD::PERMUTE_DWORDS) {
Op = DAG.getNode(SystemZISD::PACK, DL, OutVT, Op0, Op1);		Op = DAG.getNode(SystemZISD::PACK, DL, OutVT, Op0, Op1);
} else {		} else {
Op = DAG.getNode(P.Opcode, DL, InVT, Op0, Op1);		Op = DAG.getNode(P.Opcode, DL, InVT, Op0, Op1);
}		}
return Op;		return Op;
}		}

static bool isZeroVector(SDValue N) {		static bool isZeroVector(SDValue N) {
		if (N->getOpcode() == ISD::BITCAST)
		N = N->getOperand(0);
if (N->getOpcode() == ISD::SPLAT_VECTOR)		if (N->getOpcode() == ISD::SPLAT_VECTOR)
if (auto *Op = dyn_cast<ConstantSDNode>(N->getOperand(0)))		if (auto *Op = dyn_cast<ConstantSDNode>(N->getOperand(0)))
return Op->getZExtValue() == 0;		return Op->getZExtValue() == 0;
return ISD::isBuildVectorAllZeros(N.getNode());		return ISD::isBuildVectorAllZeros(N.getNode());
}		}

		// Return the index of the zero/undef vector, or UINT32_MAX if not found.
		static uint32_t findZeroVectorIdx(SDValue *Ops, unsigned Num) {
		for (unsigned I = 0; I < Num ; I++)
		if (isZeroVector(Ops[I]))
		return I;
		return UINT32_MAX;
		}

// Bytes is a VPERM-like permute vector, except that -1 is used for		// Bytes is a VPERM-like permute vector, except that -1 is used for
// undefined bytes. Implement it on operands Ops[0] and Ops[1] using		// undefined bytes. Implement it on operands Ops[0] and Ops[1] using
// VSLDB or VPERM.		// VSLDB or VPERM.
static SDValue getGeneralPermuteNode(SelectionDAG &DAG, const SDLoc &DL,		static SDValue getGeneralPermuteNode(SelectionDAG &DAG, const SDLoc &DL,
SDValue *Ops,		SDValue *Ops,
const SmallVectorImpl<int> &Bytes) {		const SmallVectorImpl<int> &Bytes) {
for (unsigned I = 0; I < 2; ++I)		for (unsigned I = 0; I < 2; ++I)
Ops[I] = DAG.getNode(ISD::BITCAST, DL, MVT::v16i8, Ops[I]);		Ops[I] = DAG.getNode(ISD::BITCAST, DL, MVT::v16i8, Ops[I]);

// First see whether VSLDB can be used.		// First see whether VSLDB can be used.
unsigned StartIndex, OpNo0, OpNo1;		unsigned StartIndex, OpNo0, OpNo1;
if (isShlDoublePermute(Bytes, StartIndex, OpNo0, OpNo1))		if (isShlDoublePermute(Bytes, StartIndex, OpNo0, OpNo1))
return DAG.getNode(SystemZISD::SHL_DOUBLE, DL, MVT::v16i8, Ops[OpNo0],		return DAG.getNode(SystemZISD::SHL_DOUBLE, DL, MVT::v16i8, Ops[OpNo0],
Ops[OpNo1],		Ops[OpNo1],
DAG.getTargetConstant(StartIndex, DL, MVT::i32));		DAG.getTargetConstant(StartIndex, DL, MVT::i32));

// Fall back on VPERM. Construct an SDNode for the permute vector. Try to		// Fall back on VPERM. Construct an SDNode for the permute vector. Try to
// eliminate a zero vector by reusing any zero index in the permute vector.		// eliminate a zero vector by reusing any zero index in the permute vector.
unsigned ZeroVecIdx =		unsigned ZeroVecIdx = findZeroVectorIdx(&Ops[0], 2);
isZeroVector(Ops[0]) ? 0 : (isZeroVector(Ops[1]) ? 1 : UINT_MAX);		if (ZeroVecIdx != UINT32_MAX) {
if (ZeroVecIdx != UINT_MAX) {
bool MaskFirst = true;		bool MaskFirst = true;
int ZeroIdx = -1;		int ZeroIdx = -1;
for (unsigned I = 0; I < SystemZ::VectorBytes; ++I) {		for (unsigned I = 0; I < SystemZ::VectorBytes; ++I) {
unsigned OpNo = unsigned(Bytes[I]) / SystemZ::VectorBytes;		unsigned OpNo = unsigned(Bytes[I]) / SystemZ::VectorBytes;
unsigned Byte = unsigned(Bytes[I]) % SystemZ::VectorBytes;		unsigned Byte = unsigned(Bytes[I]) % SystemZ::VectorBytes;
if (OpNo == ZeroVecIdx && I == 0) {		if (OpNo == ZeroVecIdx && I == 0) {
// If the first byte is zero, use mask as first operand.		// If the first byte is zero, use mask as first operand.
ZeroIdx = 0;		ZeroIdx = 0;
▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	static SDValue getGeneralPermuteNode(SelectionDAG &DAG, const SDLoc &DL,
SDValue Op2 = DAG.getBuildVector(MVT::v16i8, DL, IndexNodes);		SDValue Op2 = DAG.getBuildVector(MVT::v16i8, DL, IndexNodes);
return DAG.getNode(SystemZISD::PERMUTE, DL, MVT::v16i8, Ops[0],		return DAG.getNode(SystemZISD::PERMUTE, DL, MVT::v16i8, Ops[0],
(!Ops[1].isUndef() ? Ops[1] : Ops[0]), Op2);		(!Ops[1].isUndef() ? Ops[1] : Ops[0]), Op2);
}		}

namespace {		namespace {
// Describes a general N-operand vector shuffle.		// Describes a general N-operand vector shuffle.
struct GeneralShuffle {		struct GeneralShuffle {
GeneralShuffle(EVT vt) : VT(vt) {}		GeneralShuffle(EVT vt) : VT(vt), UnpackFromEltSize(UINT_MAX) {}
void addUndef();		void addUndef();
bool add(SDValue, unsigned);		bool add(SDValue, unsigned);
SDValue getNode(SelectionDAG &, const SDLoc &);		SDValue getNode(SelectionDAG &, const SDLoc &);
		void tryPrepareForUnpack();
		bool unpackWasPrepared() { return UnpackFromEltSize <= 4; }
		SDValue insertUnpackIfPrepared(SelectionDAG &DAG, const SDLoc &DL, SDValue Op);

// The operands of the shuffle.		// The operands of the shuffle.
SmallVector<SDValue, SystemZ::VectorBytes> Ops;		SmallVector<SDValue, SystemZ::VectorBytes> Ops;

// Index I is -1 if byte I of the result is undefined. Otherwise the		// Index I is -1 if byte I of the result is undefined. Otherwise the
// result comes from byte Bytes[I] % SystemZ::VectorBytes of operand		// result comes from byte Bytes[I] % SystemZ::VectorBytes of operand
// Bytes[I] / SystemZ::VectorBytes.		// Bytes[I] / SystemZ::VectorBytes.
SmallVector<int, SystemZ::VectorBytes> Bytes;		SmallVector<int, SystemZ::VectorBytes> Bytes;

// The type of the shuffle result.		// The type of the shuffle result.
EVT VT;		EVT VT;

		// Holds a value of 1, 2 or 4 if a final unpack has been prepared for.
		unsigned UnpackFromEltSize;
};		};
}		}

// Add an extra undefined element to the shuffle.		// Add an extra undefined element to the shuffle.
void GeneralShuffle::addUndef() {		void GeneralShuffle::addUndef() {
unsigned BytesPerElement = VT.getVectorElementType().getStoreSize();		unsigned BytesPerElement = VT.getVectorElementType().getStoreSize();
for (unsigned I = 0; I < BytesPerElement; ++I)		for (unsigned I = 0; I < BytesPerElement; ++I)
Bytes.push_back(-1);		Bytes.push_back(-1);
▲ Show 20 Lines • Show All 66 Lines • ▼ Show 20 Lines

// Return SDNodes for the completed shuffle.		// Return SDNodes for the completed shuffle.
SDValue GeneralShuffle::getNode(SelectionDAG &DAG, const SDLoc &DL) {		SDValue GeneralShuffle::getNode(SelectionDAG &DAG, const SDLoc &DL) {
assert(Bytes.size() == SystemZ::VectorBytes && "Incomplete vector");		assert(Bytes.size() == SystemZ::VectorBytes && "Incomplete vector");

if (Ops.size() == 0)		if (Ops.size() == 0)
return DAG.getUNDEF(VT);		return DAG.getUNDEF(VT);

		// Use a single unpack if possible as the last operation.
		tryPrepareForUnpack();

// Make sure that there are at least two shuffle operands.		// Make sure that there are at least two shuffle operands.
if (Ops.size() == 1)		if (Ops.size() == 1)
Ops.push_back(DAG.getUNDEF(MVT::v16i8));		Ops.push_back(DAG.getUNDEF(MVT::v16i8));

// Create a tree of shuffles, deferring root node until after the loop.		// Create a tree of shuffles, deferring root node until after the loop.
// Try to redistribute the undefined elements of non-root nodes so that		// Try to redistribute the undefined elements of non-root nodes so that
// the non-root shuffles match something like a pack or merge, then adjust		// the non-root shuffles match something like a pack or merge, then adjust
// the parent node's permute vector to compensate for the new order.		// the parent node's permute vector to compensate for the new order.
▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	for (unsigned I = 0; I < SystemZ::VectorBytes; ++I)
if (Bytes[I] >= int(SystemZ::VectorBytes))		if (Bytes[I] >= int(SystemZ::VectorBytes))
Bytes[I] -= (Stride - 1) * SystemZ::VectorBytes;		Bytes[I] -= (Stride - 1) * SystemZ::VectorBytes;
}		}

// Look for an instruction that can do the permute without resorting		// Look for an instruction that can do the permute without resorting
// to VPERM.		// to VPERM.
unsigned OpNo0, OpNo1;		unsigned OpNo0, OpNo1;
SDValue Op;		SDValue Op;
if (const Permute *P = matchPermute(Bytes, OpNo0, OpNo1))		if (unpackWasPrepared() && Ops[1].isUndef())
		Op = Ops[0];
		else if (const Permute *P = matchPermute(Bytes, OpNo0, OpNo1))
Op = getPermuteNode(DAG, DL, *P, Ops[OpNo0], Ops[OpNo1]);		Op = getPermuteNode(DAG, DL, *P, Ops[OpNo0], Ops[OpNo1]);
else		else
Op = getGeneralPermuteNode(DAG, DL, &Ops[0], Bytes);		Op = getGeneralPermuteNode(DAG, DL, &Ops[0], Bytes);

		Op = insertUnpackIfPrepared(DAG, DL, Op);

return DAG.getNode(ISD::BITCAST, DL, VT, Op);		return DAG.getNode(ISD::BITCAST, DL, VT, Op);
}		}

		#ifndef NDEBUG
		static void dumpBytes(const SmallVectorImpl<int> &Bytes, std::string Msg) {
		dbgs() << Msg.c_str() << " { ";
		for (unsigned i = 0; i < Bytes.size(); i++)
		dbgs() << Bytes[i] << " ";
		dbgs() << "}\n";
		}
		#endif

		// If the Bytes vector matches an unpack operation, prepare to do the unpack
		// after all else by removing the zero vector and the effect of the unpack on
		// Bytes.
		void GeneralShuffle::tryPrepareForUnpack() {
		uint32_t ZeroVecOpNo = findZeroVectorIdx(&Ops[0], Ops.size());
		if (ZeroVecOpNo == UINT32_MAX \|\| Ops.size() == 1)
		return;

		// Only do this if removing the zero vector reduces the depth, otherwise
		// the critical path will increase with the final unpack.
		if (Ops.size() > 2 &&
		Log2_32_Ceil(Ops.size()) == Log2_32_Ceil(Ops.size() - 1))
		return;

		// Find an unpack that would allow removing the zero vector from Ops.
		UnpackFromEltSize = 1;
		for (; UnpackFromEltSize <= 4; UnpackFromEltSize *= 2) {
		bool MatchUnpack = true;
		SmallVector<int, SystemZ::VectorBytes> SrcBytes;
		for (unsigned Elt = 0; Elt < SystemZ::VectorBytes; Elt++) {
		unsigned ToEltSize = UnpackFromEltSize * 2;
		bool IsZextByte = (Elt % ToEltSize) < UnpackFromEltSize;
		if (!IsZextByte)
		SrcBytes.push_back(Bytes[Elt]);
		if (Bytes[Elt] != -1) {
		unsigned OpNo = unsigned(Bytes[Elt]) / SystemZ::VectorBytes;
		if (IsZextByte != (OpNo == ZeroVecOpNo)) {
		MatchUnpack = false;
		break;
		}
		}
		}
		if (MatchUnpack) {
		if (Ops.size() == 2) {
		// Don't use unpack if a single source operand needs rearrangement.
		for (unsigned i = 0; i < SystemZ::VectorBytes / 2; i++)
		if (SrcBytes[i] != -1 && SrcBytes[i] % 16 != int(i)) {
		UnpackFromEltSize = UINT_MAX;
		return;
		}
		}
		break;
		}
		}
		if (UnpackFromEltSize > 4)
		return;

		LLVM_DEBUG(dbgs() << "Preparing for final unpack of element size "
		<< UnpackFromEltSize << ". Zero vector is Op#" << ZeroVecOpNo
		<< ".\n";
		dumpBytes(Bytes, "Original Bytes vector:"););

		// Apply the unpack in reverse to the Bytes array.
		unsigned B = 0;
		for (unsigned Elt = 0; Elt < SystemZ::VectorBytes;) {
		Elt += UnpackFromEltSize;
		for (unsigned i = 0; i < UnpackFromEltSize; i++, Elt++, B++)
		Bytes[B] = Bytes[Elt];
		}
		while (B < SystemZ::VectorBytes)
		Bytes[B++] = -1;

		// Remove the zero vector from Ops
		Ops.erase(&Ops[ZeroVecOpNo]);
		for (unsigned I = 0; I < SystemZ::VectorBytes; ++I)
		if (Bytes[I] >= 0) {
		unsigned OpNo = unsigned(Bytes[I]) / SystemZ::VectorBytes;
		if (OpNo > ZeroVecOpNo)
		Bytes[I] -= SystemZ::VectorBytes;
		}

		LLVM_DEBUG(dumpBytes(Bytes, "Resulting Bytes vector, zero vector removed:");
		dbgs() << "\n";);
		}

		SDValue GeneralShuffle::insertUnpackIfPrepared(SelectionDAG &DAG,
		const SDLoc &DL,
		SDValue Op) {
		if (!unpackWasPrepared())
		return Op;
		unsigned InBits = UnpackFromEltSize * 8;
		EVT InVT = MVT::getVectorVT(MVT::getIntegerVT(InBits),
		SystemZ::VectorBits / InBits);
		SDValue PackedOp = DAG.getNode(ISD::BITCAST, DL, InVT, Op);
		unsigned OutBits = InBits * 2;
		EVT OutVT = MVT::getVectorVT(MVT::getIntegerVT(OutBits),
		SystemZ::VectorBits / OutBits);
		return DAG.getNode(SystemZISD::UNPACKL_HIGH, DL, OutVT, PackedOp);
		}

// Return true if the given BUILD_VECTOR is a scalar-to-vector conversion.		// Return true if the given BUILD_VECTOR is a scalar-to-vector conversion.
static bool isScalarToVector(SDValue Op) {		static bool isScalarToVector(SDValue Op) {
for (unsigned I = 1, E = Op.getNumOperands(); I != E; ++I)		for (unsigned I = 1, E = Op.getNumOperands(); I != E; ++I)
if (!Op.getOperand(I).isUndef())		if (!Op.getOperand(I).isUndef())
return false;		return false;
return true;		return true;
}		}

▲ Show 20 Lines • Show All 378 Lines • ▼ Show 20 Lines	SystemZTargetLowering::lowerEXTRACT_VECTOR_ELT(SDValue Op,
// Otherwise bitcast to the equivalent integer form and extract via a GPR.		// Otherwise bitcast to the equivalent integer form and extract via a GPR.
MVT IntVT = MVT::getIntegerVT(VT.getSizeInBits());		MVT IntVT = MVT::getIntegerVT(VT.getSizeInBits());
MVT IntVecVT = MVT::getVectorVT(IntVT, VecVT.getVectorNumElements());		MVT IntVecVT = MVT::getVectorVT(IntVT, VecVT.getVectorNumElements());
SDValue Res = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, IntVT,		SDValue Res = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, IntVT,
DAG.getNode(ISD::BITCAST, DL, IntVecVT, Op0), Op1);		DAG.getNode(ISD::BITCAST, DL, IntVecVT, Op0), Op1);
return DAG.getNode(ISD::BITCAST, DL, VT, Res);		return DAG.getNode(ISD::BITCAST, DL, VT, Res);
}		}

SDValue		SDValue SystemZTargetLowering::
SystemZTargetLowering::lowerExtendVectorInreg(SDValue Op, SelectionDAG &DAG,		lowerSIGN_EXTEND_VECTOR_INREG(SDValue Op, SelectionDAG &DAG) const {
unsigned UnpackHigh) const {
SDValue PackedOp = Op.getOperand(0);		SDValue PackedOp = Op.getOperand(0);
EVT OutVT = Op.getValueType();		EVT OutVT = Op.getValueType();
EVT InVT = PackedOp.getValueType();		EVT InVT = PackedOp.getValueType();
unsigned ToBits = OutVT.getScalarSizeInBits();		unsigned ToBits = OutVT.getScalarSizeInBits();
unsigned FromBits = InVT.getScalarSizeInBits();		unsigned FromBits = InVT.getScalarSizeInBits();
do {		do {
FromBits *= 2;		FromBits *= 2;
		uweigandUnsubmitted Done Reply Inline Actions Now that this routine shares no code at all between the sign- and zero-extend case, I think it would make more sense to just have two different routines lowerSIGN_EXTEND_VECTOR_INREG and lowerZERO_EXTEND_VECTOR_INREG. uweigand: Now that this routine shares no code at all between the sign- and zero-extend case, I think it…
EVT OutVT = MVT::getVectorVT(MVT::getIntegerVT(FromBits),		EVT OutVT = MVT::getVectorVT(MVT::getIntegerVT(FromBits),
SystemZ::VectorBits / FromBits);		SystemZ::VectorBits / FromBits);
PackedOp = DAG.getNode(UnpackHigh, SDLoc(PackedOp), OutVT, PackedOp);		PackedOp =
		DAG.getNode(SystemZISD::UNPACK_HIGH, SDLoc(PackedOp), OutVT, PackedOp);
} while (FromBits != ToBits);		} while (FromBits != ToBits);
return PackedOp;		return PackedOp;
}		}

		// Lower a ZERO_EXTEND_VECTOR_INREG to a vector shuffle with a zero vector.
		SDValue SystemZTargetLowering::
		lowerZERO_EXTEND_VECTOR_INREG(SDValue Op, SelectionDAG &DAG) const {
		SDValue PackedOp = Op.getOperand(0);
		SDLoc DL(Op);
		EVT OutVT = Op.getValueType();
		EVT InVT = PackedOp.getValueType();
		unsigned InNumElts = InVT.getVectorNumElements();
		unsigned OutNumElts = OutVT.getVectorNumElements();
		unsigned NumInPerOut = InNumElts / OutNumElts;

		SDValue ZeroVec =
		DAG.getSplatVector(InVT, DL, DAG.getConstant(0, DL, InVT.getScalarType()));

		SmallVector<int, 16> Mask(InNumElts);
		unsigned ZeroVecElt = InNumElts;
		for (unsigned PackedElt = 0; PackedElt < OutNumElts; PackedElt++) {
		unsigned MaskElt = PackedElt * NumInPerOut;
		unsigned End = MaskElt + NumInPerOut - 1;
		for (; MaskElt < End; MaskElt++)
		Mask[MaskElt] = ZeroVecElt++;
		Mask[MaskElt] = PackedElt;
		}
		jonpaAuthorUnsubmitted Done Reply Inline Actions I rewrote this loop from the version on Apr 20 2020, 14:06, so that the indexes into the zero vector would be not just the first byte, but increasing. Back then that lead to some matches of MERGE_HIGH, but I see now that it's a no-op, given the optimization of the zero vector we have now added. But i suppose, it doesn't matter much either way... jonpa: I rewrote this loop from the version on Apr 20 2020, 14:06, so that the indexes into the zero…
		SDValue Shuf = DAG.getVectorShuffle(InVT, DL, PackedOp, ZeroVec, Mask);
		return DAG.getNode(ISD::BITCAST, DL, OutVT, Shuf);
		}

SDValue SystemZTargetLowering::lowerShift(SDValue Op, SelectionDAG &DAG,		SDValue SystemZTargetLowering::lowerShift(SDValue Op, SelectionDAG &DAG,
unsigned ByScalar) const {		unsigned ByScalar) const {
// Look for cases where a vector shift can use the *_BY_SCALAR form.		// Look for cases where a vector shift can use the *_BY_SCALAR form.
SDValue Op0 = Op.getOperand(0);		SDValue Op0 = Op.getOperand(0);
SDValue Op1 = Op.getOperand(1);		SDValue Op1 = Op.getOperand(1);
SDLoc DL(Op);		SDLoc DL(Op);
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();
unsigned ElemBitSize = VT.getScalarSizeInBits();		unsigned ElemBitSize = VT.getScalarSizeInBits();
▲ Show 20 Lines • Show All 149 Lines • ▼ Show 20 Lines	case ISD::VECTOR_SHUFFLE:
return lowerVECTOR_SHUFFLE(Op, DAG);		return lowerVECTOR_SHUFFLE(Op, DAG);
case ISD::SCALAR_TO_VECTOR:		case ISD::SCALAR_TO_VECTOR:
return lowerSCALAR_TO_VECTOR(Op, DAG);		return lowerSCALAR_TO_VECTOR(Op, DAG);
case ISD::INSERT_VECTOR_ELT:		case ISD::INSERT_VECTOR_ELT:
return lowerINSERT_VECTOR_ELT(Op, DAG);		return lowerINSERT_VECTOR_ELT(Op, DAG);
case ISD::EXTRACT_VECTOR_ELT:		case ISD::EXTRACT_VECTOR_ELT:
return lowerEXTRACT_VECTOR_ELT(Op, DAG);		return lowerEXTRACT_VECTOR_ELT(Op, DAG);
case ISD::SIGN_EXTEND_VECTOR_INREG:		case ISD::SIGN_EXTEND_VECTOR_INREG:
return lowerExtendVectorInreg(Op, DAG, SystemZISD::UNPACK_HIGH);		return lowerSIGN_EXTEND_VECTOR_INREG(Op, DAG);
case ISD::ZERO_EXTEND_VECTOR_INREG:		case ISD::ZERO_EXTEND_VECTOR_INREG:
return lowerExtendVectorInreg(Op, DAG, SystemZISD::UNPACKL_HIGH);		return lowerZERO_EXTEND_VECTOR_INREG(Op, DAG);
case ISD::SHL:		case ISD::SHL:
return lowerShift(Op, DAG, SystemZISD::VSHL_BY_SCALAR);		return lowerShift(Op, DAG, SystemZISD::VSHL_BY_SCALAR);
case ISD::SRL:		case ISD::SRL:
return lowerShift(Op, DAG, SystemZISD::VSRL_BY_SCALAR);		return lowerShift(Op, DAG, SystemZISD::VSRL_BY_SCALAR);
case ISD::SRA:		case ISD::SRA:
return lowerShift(Op, DAG, SystemZISD::VSRA_BY_SCALAR);		return lowerShift(Op, DAG, SystemZISD::VSRA_BY_SCALAR);
default:		default:
llvm_unreachable("Unexpected node to lower");		llvm_unreachable("Unexpected node to lower");
▲ Show 20 Lines • Show All 2,963 Lines • Show Last 20 Lines

llvm/test/CodeGen/SystemZ/vec-move-16.ll

Show All 34 Lines	; No expected output, but must compile.
%val = load <4 x i1>, <4 x i1> *%ptr		%val = load <4 x i1>, <4 x i1> *%ptr
%ret = zext <4 x i1> %val to <4 x i32>		%ret = zext <4 x i1> %val to <4 x i32>
ret <4 x i32> %ret		ret <4 x i32> %ret
}		}

; Test a v4i8->v4i32 extension.		; Test a v4i8->v4i32 extension.
define <4 x i32> @f5(<4 x i8> *%ptr) {		define <4 x i32> @f5(<4 x i8> *%ptr) {
; CHECK-LABEL: f5:		; CHECK-LABEL: f5:
		; CHECK: larl %r1, .LCPI4_0
; CHECK: vlrepf [[REG1:%v[0-9]+]], 0(%r2)		; CHECK: vlrepf [[REG1:%v[0-9]+]], 0(%r2)
; CHECK: vuplhb [[REG2:%v[0-9]+]], [[REG1]]		; CHECK: vl %v1, 0(%r1), 3
; CHECK: vuplhh %v24, [[REG2]]		; CHECK: vperm %v24, %v1, [[REG1]], %v1
; CHECK: br %r14		; CHECK: br %r14
%val = load <4 x i8>, <4 x i8> *%ptr		%val = load <4 x i8>, <4 x i8> *%ptr
%ret = zext <4 x i8> %val to <4 x i32>		%ret = zext <4 x i8> %val to <4 x i32>
ret <4 x i32> %ret		ret <4 x i32> %ret
}		}

; Test a v4i16->v4i32 extension.		; Test a v4i16->v4i32 extension.
define <4 x i32> @f6(<4 x i16> *%ptr) {		define <4 x i32> @f6(<4 x i16> *%ptr) {
Show All 12 Lines	; No expected output, but must compile.
%val = load <2 x i1>, <2 x i1> *%ptr		%val = load <2 x i1>, <2 x i1> *%ptr
%ret = zext <2 x i1> %val to <2 x i64>		%ret = zext <2 x i1> %val to <2 x i64>
ret <2 x i64> %ret		ret <2 x i64> %ret
}		}

; Test a v2i8->v2i64 extension.		; Test a v2i8->v2i64 extension.
define <2 x i64> @f8(<2 x i8> *%ptr) {		define <2 x i64> @f8(<2 x i8> *%ptr) {
; CHECK-LABEL: f8:		; CHECK-LABEL: f8:
		; CHECK: larl %r1, .LCPI7_0
; CHECK: vlreph [[REG1:%v[0-9]+]], 0(%r2)		; CHECK: vlreph [[REG1:%v[0-9]+]], 0(%r2)
; CHECK: vuplhb [[REG2:%v[0-9]+]], [[REG1]]		; CHECK: vl %v1, 0(%r1), 3
; CHECK: vuplhh [[REG3:%v[0-9]+]], [[REG2]]		; CHECK: vperm %v24, %v1, [[REG1]], %v1
; CHECK: vuplhf %v24, [[REG3]]
; CHECK: br %r14		; CHECK: br %r14
%val = load <2 x i8>, <2 x i8> *%ptr		%val = load <2 x i8>, <2 x i8> *%ptr
%ret = zext <2 x i8> %val to <2 x i64>		%ret = zext <2 x i8> %val to <2 x i64>
ret <2 x i64> %ret		ret <2 x i64> %ret
}		}

; Test a v2i16->v2i64 extension.		; Test a v2i16->v2i64 extension.
define <2 x i64> @f9(<2 x i16> *%ptr) {		define <2 x i64> @f9(<2 x i16> *%ptr) {
; CHECK-LABEL: f9:		; CHECK-LABEL: f9:
		; CHECK: larl %r1, .LCPI8_0
; CHECK: vlrepf [[REG1:%v[0-9]+]], 0(%r2)		; CHECK: vlrepf [[REG1:%v[0-9]+]], 0(%r2)
; CHECK: vuplhh [[REG2:%v[0-9]+]], [[REG1]]		; CHECK: vl %v1, 0(%r1), 3
; CHECK: vuplhf %v24, [[REG2]]		; CHECK: vperm %v24, %v1, [[REG1]], %v1
; CHECK: br %r14		; CHECK: br %r14
%val = load <2 x i16>, <2 x i16> *%ptr		%val = load <2 x i16>, <2 x i16> *%ptr
%ret = zext <2 x i16> %val to <2 x i64>		%ret = zext <2 x i16> %val to <2 x i64>
ret <2 x i64> %ret		ret <2 x i64> %ret
}		}

; Test a v2i32->v2i64 extension.		; Test a v2i32->v2i64 extension.
define <2 x i64> @f10(<2 x i32> *%ptr) {		define <2 x i64> @f10(<2 x i32> *%ptr) {
; CHECK-LABEL: f10:		; CHECK-LABEL: f10:
; CHECK: vlrepg [[REG1:%v[0-9]+]], 0(%r2)		; CHECK: vlrepg [[REG1:%v[0-9]+]], 0(%r2)
; CHECK: vuplhf %v24, [[REG1]]		; CHECK: vuplhf %v24, [[REG1]]
; CHECK: br %r14		; CHECK: br %r14
%val = load <2 x i32>, <2 x i32> *%ptr		%val = load <2 x i32>, <2 x i32> *%ptr
%ret = zext <2 x i32> %val to <2 x i64>		%ret = zext <2 x i32> %val to <2 x i64>
ret <2 x i64> %ret		ret <2 x i64> %ret
}		}

llvm/test/CodeGen/SystemZ/vec-move-23.ll

	Show First 20 Lines • Show All 62 Lines • ▼ Show 20 Lines
	; Z15-NEXT: br %r14			; Z15-NEXT: br %r14
	%c = sitofp <4 x i16> %Src to <4 x float>			%c = sitofp <4 x i16> %Src to <4 x float>
	store <4 x float> %c, <4 x float>* %Dst			store <4 x float> %c, <4 x float>* %Dst
	ret void			ret void
	}			}

	define void @fun4(<2 x i8> %Src, <2 x double>* %Dst) {			define void @fun4(<2 x i8> %Src, <2 x double>* %Dst) {
	; CHECK-LABEL: fun4:			; CHECK-LABEL: fun4:
	; CHECK: vuplhb %v0, %v24			; CHECK: larl %r1, .LCPI4_0
	; CHECK-NEXT: vuplhh %v0, %v0			; CHECK-NEXT: vl %v0, 0(%r1), 3
	; CHECK-NEXT: vuplhf %v0, %v0			; CHECK-NEXT: vperm %v0, %v0, %v24, %v0
	; CHECK-NEXT: vcdlgb %v0, %v0, 0, 0			; CHECK-NEXT: vcdlgb %v0, %v0, 0, 0
	; CHECK-NEXT: vst %v0, 0(%r2), 3			; CHECK-NEXT: vst %v0, 0(%r2), 3
	; CHECK-NEXT: br %r14			; CHECK-NEXT: br %r14
	%c = uitofp <2 x i8> %Src to <2 x double>			%c = uitofp <2 x i8> %Src to <2 x double>
	store <2 x double> %c, <2 x double>* %Dst			store <2 x double> %c, <2 x double>* %Dst
	ret void			ret void
	}			}

	define void @fun5(<2 x i16> %Src, <2 x double>* %Dst) {			define void @fun5(<2 x i16> %Src, <2 x double>* %Dst) {
	; CHECK-LABEL: fun5:			; CHECK-LABEL: fun5:
	; CHECK: vuplhh %v0, %v24			; CHECK: larl %r1, .LCPI5_0
	; CHECK-NEXT: vuplhf %v0, %v0			; CHECK-NEXT: vl %v0, 0(%r1), 3
				; CHECK-NEXT: vperm %v0, %v0, %v24, %v0
	; CHECK-NEXT: vcdlgb %v0, %v0, 0, 0			; CHECK-NEXT: vcdlgb %v0, %v0, 0, 0
	; CHECK-NEXT: vst %v0, 0(%r2), 3			; CHECK-NEXT: vst %v0, 0(%r2), 3
	; CHECK-NEXT: br %r14			; CHECK-NEXT: br %r14
	%c = uitofp <2 x i16> %Src to <2 x double>			%c = uitofp <2 x i16> %Src to <2 x double>
	store <2 x double> %c, <2 x double>* %Dst			store <2 x double> %c, <2 x double>* %Dst
	ret void			ret void
	}			}

	Show All 38 Lines

llvm/test/CodeGen/SystemZ/vec-move-24.ll

This file was added.

				; RUN: llc < %s -mtriple=s390x-linux-gnu -mcpu=z14 \| FileCheck %s
				;
				; Test that vperm is not used if a single unpack is enough.

				define <4 x i32> @fun0(<4 x i32>* %Src) nounwind {
				; CHECK-LABEL: fun0:
				; CHECK-NOT: vperm
				%tmp = load <4 x i32>, <4 x i32>* %Src
				%tmp2 = shufflevector <4 x i32> zeroinitializer, <4 x i32> %tmp, <4 x i32> <i32 0, i32 4, i32 2, i32 5>
				ret <4 x i32> %tmp2
				}

				define void @fun1(i8 %Src, <32 x i8>* %Dst) nounwind {
				; CHECK-LABEL: fun1:
				; CHECK-NOT: vperm
				%I0 = insertelement <16 x i8> undef, i8 %Src, i32 0
				%I1 = insertelement <16 x i8> %I0, i8 %Src, i32 1
				%I2 = insertelement <16 x i8> %I1, i8 %Src, i32 2
				%I3 = insertelement <16 x i8> %I2, i8 %Src, i32 3
				%I4 = insertelement <16 x i8> %I3, i8 %Src, i32 4
				%I5 = insertelement <16 x i8> %I4, i8 %Src, i32 5
				%I6 = insertelement <16 x i8> %I5, i8 %Src, i32 6
				%I7 = insertelement <16 x i8> %I6, i8 %Src, i32 7
				%I8 = insertelement <16 x i8> %I7, i8 %Src, i32 8
				%I9 = insertelement <16 x i8> %I8, i8 %Src, i32 9
				%I10 = insertelement <16 x i8> %I9, i8 %Src, i32 10
				%I11 = insertelement <16 x i8> %I10, i8 %Src, i32 11
				%I12 = insertelement <16 x i8> %I11, i8 %Src, i32 12
				%I13 = insertelement <16 x i8> %I12, i8 %Src, i32 13
				%I14 = insertelement <16 x i8> %I13, i8 %Src, i32 14
				%I15 = insertelement <16 x i8> %I14, i8 %Src, i32 15

				%tmp = shufflevector <16 x i8> zeroinitializer,
				<16 x i8> %I15,
				<32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7,
				i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15,
				i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23,
				i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
				%tmp9 = shufflevector <32 x i8> undef,
				<32 x i8> %tmp,
				<32 x i32> <i32 33, i32 32, i32 48, i32 49, i32 1, i32 17, i32 50, i32 51,
				i32 2, i32 18, i32 52, i32 53, i32 3, i32 19, i32 54, i32 55,
				i32 4, i32 20, i32 56, i32 57, i32 5, i32 21, i32 58, i32 59,
				i32 6, i32 22, i32 60, i32 61, i32 7, i32 62, i32 55, i32 63>

				store <32 x i8> %tmp9, <32 x i8>* %Dst
				ret void
				}

llvm/test/CodeGen/SystemZ/vec-zext.ll

	; Test that vector zexts are done efficently with unpack instructions also in			; Test that vector zexts are done efficently also in case of fewer elements
	; case of fewer elements than allowed, e.g. <2 x i32>.			; than allowed, e.g. <2 x i32>.
	;			;
	; RUN: llc < %s -mtriple=s390x-linux-gnu -mcpu=z13 \| FileCheck %s			; RUN: llc < %s -mtriple=s390x-linux-gnu -mcpu=z13 \| FileCheck %s


	define <2 x i16> @fun1(<2 x i8> %val1) {			define <2 x i16> @fun1(<2 x i8> %val1) {
	; CHECK-LABEL: fun1:			; CHECK-LABEL: fun1:
	; CHECK: vuplhb %v24, %v24			; CHECK: vuplhb %v24, %v24
	; CHECK-NEXT: br %r14			; CHECK-NEXT: br %r14
	%z = zext <2 x i8> %val1 to <2 x i16>			%z = zext <2 x i8> %val1 to <2 x i16>
	ret <2 x i16> %z			ret <2 x i16> %z
	}			}

	define <2 x i32> @fun2(<2 x i8> %val1) {			define <2 x i32> @fun2(<2 x i8> %val1) {
	; CHECK-LABEL: fun2:			; CHECK-LABEL: fun2:
	; CHECK: vuplhb %v0, %v24			; CHECK: larl %r1, .LCPI1_0
	; CHECK-NEXT: vuplhh %v24, %v0			; CHECK-NEXT: vl %v0, 0(%r1), 3
				; CHECK-NEXT: vperm %v24, %v0, %v24, %v0
	; CHECK-NEXT: br %r14			; CHECK-NEXT: br %r14
	%z = zext <2 x i8> %val1 to <2 x i32>			%z = zext <2 x i8> %val1 to <2 x i32>
	ret <2 x i32> %z			ret <2 x i32> %z
	}			}

	define <2 x i64> @fun3(<2 x i8> %val1) {			define <2 x i64> @fun3(<2 x i8> %val1) {
	; CHECK-LABEL: fun3:			; CHECK-LABEL: fun3:
	; CHECK: vuplhb %v0, %v24			; CHECK: larl %r1, .LCPI2_0
	; CHECK-NEXT: vuplhh %v0, %v0			; CHECK-NEXT: vl %v0, 0(%r1), 3
	; CHECK-NEXT: vuplhf %v24, %v0			; CHECK-NEXT: vperm %v24, %v0, %v24, %v0
	; CHECK-NEXT: br %r14			; CHECK-NEXT: br %r14
	%z = zext <2 x i8> %val1 to <2 x i64>			%z = zext <2 x i8> %val1 to <2 x i64>
	ret <2 x i64> %z			ret <2 x i64> %z
	}			}

	define <2 x i32> @fun4(<2 x i16> %val1) {			define <2 x i32> @fun4(<2 x i16> %val1) {
	; CHECK-LABEL: fun4:			; CHECK-LABEL: fun4:
	; CHECK: vuplhh %v24, %v24			; CHECK: vuplhh %v24, %v24
	; CHECK-NEXT: br %r14			; CHECK-NEXT: br %r14
	%z = zext <2 x i16> %val1 to <2 x i32>			%z = zext <2 x i16> %val1 to <2 x i32>
	ret <2 x i32> %z			ret <2 x i32> %z
	}			}

	define <2 x i64> @fun5(<2 x i16> %val1) {			define <2 x i64> @fun5(<2 x i16> %val1) {
	; CHECK-LABEL: fun5:			; CHECK-LABEL: fun5:
	; CHECK: vuplhh %v0, %v24			; CHECK: larl %r1, .LCPI4_0
	; CHECK-NEXT: vuplhf %v24, %v0			; CHECK-NEXT: vl %v0, 0(%r1), 3
				; CHECK-NEXT: vperm %v24, %v0, %v24, %v0
	; CHECK-NEXT: br %r14			; CHECK-NEXT: br %r14
	%z = zext <2 x i16> %val1 to <2 x i64>			%z = zext <2 x i16> %val1 to <2 x i64>
	ret <2 x i64> %z			ret <2 x i64> %z
	}			}

	define <2 x i64> @fun6(<2 x i32> %val1) {			define <2 x i64> @fun6(<2 x i32> %val1) {
	; CHECK-LABEL: fun6:			; CHECK-LABEL: fun6:
	; CHECK: vuplhf %v24, %v24			; CHECK: vuplhf %v24, %v24
	; CHECK-NEXT: br %r14			; CHECK-NEXT: br %r14
	%z = zext <2 x i32> %val1 to <2 x i64>			%z = zext <2 x i32> %val1 to <2 x i64>
	ret <2 x i64> %z			ret <2 x i64> %z
	}			}

	define <4 x i16> @fun7(<4 x i8> %val1) {			define <4 x i16> @fun7(<4 x i8> %val1) {
	; CHECK-LABEL: fun7:			; CHECK-LABEL: fun7:
	; CHECK: vuplhb %v24, %v24			; CHECK: vuplhb %v24, %v24
	; CHECK-NEXT: br %r14			; CHECK-NEXT: br %r14
	%z = zext <4 x i8> %val1 to <4 x i16>			%z = zext <4 x i8> %val1 to <4 x i16>
	ret <4 x i16> %z			ret <4 x i16> %z
	}			}

	define <4 x i32> @fun8(<4 x i8> %val1) {			define <4 x i32> @fun8(<4 x i8> %val1) {
	; CHECK-LABEL: fun8:			; CHECK-LABEL: fun8:
	; CHECK: vuplhb %v0, %v24			; CHECK: larl %r1, .LCPI7_0
	; CHECK-NEXT: vuplhh %v24, %v0			; CHECK-NEXT: vl %v0, 0(%r1), 3
				; CHECK-NEXT: vperm %v24, %v0, %v24, %v0
	; CHECK-NEXT: br %r14			; CHECK-NEXT: br %r14
	%z = zext <4 x i8> %val1 to <4 x i32>			%z = zext <4 x i8> %val1 to <4 x i32>
	ret <4 x i32> %z			ret <4 x i32> %z
	}			}

	define <4 x i32> @fun9(<4 x i16> %val1) {			define <4 x i32> @fun9(<4 x i16> %val1) {
	; CHECK-LABEL: fun9:			; CHECK-LABEL: fun9:
	; CHECK: vuplhh %v24, %v24			; CHECK: vuplhh %v24, %v24
	Show All 13 Lines