This is an archive of the discontinued LLVM Phabricator instance.

llvm/lib/Target/X86/X86ISelLowering.cpp
37911	We still need to do this before the widenSubVector() code - otherwise we'll never be able to simplify any input that doesn't match RootSizeInBits, which are likely to be the most interesting cases imo.
37965	This seems to be really bulky for what its actually doing. I don't think we need to create this shuffle mask for instance, we should be able to create a demanded elts mask directly and then trunc/scale it for the input's size. I keep meaning to create a scaleDemandedMask() common helper method as we have several places that would use it (e.g. SelectionDAG.computeKnownBits bitcast handling and other parts of value tracking).

lebedev.ri added inline comments.Sep 7 2021, 3:20 PM

llvm/lib/Target/X86/X86ISelLowering.cpp
37965	That is what what i initially came up with, and it's much uglier than this code :) I can do that again, but i'm not sure that will be be better.

Introduce ScaleDemandedEltsMask() and use it.

llvm/lib/Target/X86/X86ISelLowering.cpp
37911	I agree, but is this a correctness concern for this patch?
37965	Ok, how about this?

RKSimon added inline comments.Sep 8 2021, 3:27 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
37598	We should assert that (NumElts % NumSrcElts) == 0 \|\| (NumSrcElts % NumElts) == 0 - or return true/false on success/failure.
37911	what correctness?

lebedev.ri added inline comments.Sep 8 2021, 3:29 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
37598	Oops, i meant to do that, but forgot to in the end.
37911	I mean, if we don't do this in this patch, will that lead to miscompiles, or simply to missed optimizations?

Add forgotten assert.

RKSimon added inline comments.Sep 8 2021, 3:52 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
37594	This can probably move to the APIntOps helpers
37974	To move this before widening, we should just need to truncate OpDemandedElts based on its size vs RootSizeInBits - we should assert that no lost elts were demanded. Then we can scale it.

Harbormaster completed remote builds in B123024: Diff 371303.Sep 8 2021, 4:02 AM

Try to move before widenSubVector() - now without miscompiles?

llvm/lib/Target/X86/X86ISelLowering.cpp
37594	Let's do that afterwards?
37974	Ok, i admit i've tried to avoid doing that because i don't quite understand all of the logic here. Does this look right? It avoids the miscompiles that were visible in some previous attempt at least.

Harbormaster completed remote builds in B123037: Diff 371316.Sep 8 2021, 5:36 AM

Did i get it right this time? :)

RKSimon added inline comments.Sep 13 2021, 3:21 AM

llvm/test/CodeGen/X86/insertelement-ones.ll
389	Any luck on improving this?

RKSimon mentioned this in D109683: [APInt] Add APIntOps::ScaleBitMask helper.Sep 13 2021, 4:48 AM

RKSimon mentioned this in rG9db20822f795: [APInt] Add APIntOps::ScaleBitMask helper.Sep 13 2021, 8:28 AM

lebedev.ri added inline comments.Sep 13 2021, 1:44 PM

llvm/test/CodeGen/X86/insertelement-ones.ll
389	This one is obscure. I believe the problem is `X86ISelLowering.cpp`'s `matchBinaryShuffle()`'s `ISD::OR` lowering. We have: mask: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 30 -2 matchBinaryShuffle() EltSizeInBits: 8 V1: t4: v16i8,ch = CopyFromReg t0, Register:v16i8 %1 t3: v16i8 = Register %1 V2: t74: v16i8 = X86ISD::VSHLDQ t51, TargetConstant:i8<14> t51: v16i8 = bitcast t50 t50: v4i32 = scalar_to_vector Constant:i32<255> t49: i32 = Constant<255> t73: i8 = TargetConstant<14> We can't say anything about `t4`, but i think it's obvious that `t74` is actually an all-zeros except the 14'th element, which is all-ones. So we of course can lower that as an `or` blend, and we do not care what `t4` is. But the code fails to do that. I think we'd basically have to do `computeKnownBits()` for each element of V1/V2 separately. Should i keep looking?

lebedev.ri added a parent revision: D109726: [X86] Improve `matchBinaryShuffle()`'s `BLEND` lowering with per-element all-zero/all-ones knowledge.Sep 13 2021, 3:10 PM

lebedev.ri added inline comments.

llvm/test/CodeGen/X86/insertelement-ones.ll
389	Ok, got it: D109726

Rebased ontop main+D109726.
The noted regression is gone (but many more took it's place.)

Harbormaster completed remote builds in B123754: Diff 372352.Sep 13 2021, 4:24 PM

Rebased, NFC.

lebedev.ri added inline comments.Sep 17 2021, 9:30 AM

llvm/test/CodeGen/X86/insertelement-ones.ll

314

Here we have:

Optimized legalized selection DAG: %bb.0 'insert_v16i8_x123456789ABCDEx:'
SelectionDAG has 20 nodes:
  t0: ch = EntryToken
          t2: v16i8,ch = CopyFromReg t0, Register:v16i8 %0
        t19: v16i8 = and t2, t36
        t20: v16i8 = X86ISD::ANDNP t36, t27
      t21: v16i8 = or t19, t20
      t33: v16i8 = X86ISD::VSHLDQ t27, TargetConstant:i8<15>
    t45: v16i8 = or t21, t33
  t12: ch,glue = CopyToReg t0, Register:v16i8 $xmm0, t45
    t26: v4i32 = scalar_to_vector Constant:i32<255>
  t27: v16i8 = bitcast t26
    t38: i64 = X86ISD::Wrapper TargetConstantPool:i64<<16 x i8> <i8 0, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>> 0
  t36: v16i8,ch = load<(load (s128) from constant-pool)> t0, t38, undef:i64
  t13: ch = X86ISD::RET_FLAG t12, TargetConstant:i32<0>, Register:v16i8 $xmm0, t12:1

... so matchBinaryShuffle() again fails to omit the masking,
even though it's obviously redundant here for the reasons seen in D109726.
I would suspect that is because around scalar_to_vector we operate on i32 elt type,
so we don't have all-ones elements until after bitcast.
Without changing computeKnownBits to operate on a specified element width,
i'm not sure it can help us further, and that does not sound like the right fix.

RKSimon added inline comments.Sep 17 2021, 9:30 AM

llvm/test/CodeGen/X86/insertelement-ones.ll
318	We're going to have to improve INSERT_VECTOR_ELT handling of 0/-1 elements - just AND/OR if we don't have a legal PINSRB instruction (pre-SSE41).
llvm/test/CodeGen/X86/oddshuffles.ll
2268	Looks like we're missing a fold to share scalar_to_vector(x) and scalar_to_vector(trunc(x)) (maybe worth supporting scalar_to_vector(ext(x)) as well)?

Harbormaster completed remote builds in B124420: Diff 373251.Sep 17 2021, 10:08 AM

RKSimon added inline comments.Sep 17 2021, 10:53 AM

llvm/test/CodeGen/X86/insertelement-ones.ll
318	It looks like we might be able to do this more easily by extending lowerShuffleAsBitMask to handle the allones elements case as well as the zero elements case.

lebedev.ri added inline comments.Sep 17 2021, 11:05 AM

llvm/test/CodeGen/X86/insertelement-ones.ll
318	Note that `X86TargetLowering::LowerINSERT_VECTOR_ELT` isn't even called for this test, since we expand, not legalize, in this case. Marking it as legalize causes crashes "don't know how to legalize", i guess it doesn't retry to legalize via the generic expansion.

lebedev.ri mentioned this in D109989: [X86] Improve i8 all-ones element insertion in pre-SSE4.1.Sep 17 2021, 11:33 AM

Rebased ontop of D109989 - llvm/test/CodeGen/X86/insertelement-ones.ll is all good now.

lebedev.ri added a parent revision: D109989: [X86] Improve i8 all-ones element insertion in pre-SSE4.1.Sep 17 2021, 11:39 AM

lebedev.ri added inline comments.

llvm/test/CodeGen/X86/insertelement-ones.ll
318	Done, D109989.

lebedev.ri added inline comments.Sep 17 2021, 11:52 AM

llvm/test/CodeGen/X86/oddshuffles.ll

2268

This stuff is broken.
We've in AVX1-more, and only have broadcast-from-mem,
yet we've successfully obscured the load via the truncation.

I suppose we could look past ext/trunc of scalar_to_vector operand,
and change it to bitcast/ext of scalar_to_vector itself,
let me see.

Optimized legalized selection DAG: %bb.0 'splat_v3i32:'
SelectionDAG has 28 nodes:
  t0: ch = EntryToken
    t2: i64,ch = CopyFromReg t0, Register:i64 %0
  t27: i64,ch = load<(load (s64) from %ir.ptr, align 1)> t0, t2, undef:i64
      t24: v8i32 = BUILD_VECTOR Constant:i32<0>, undef:i32, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>
          t30: v2i64 = scalar_to_vector t27
        t107: v4i64 = insert_subvector undef:v4i64, t30, Constant:i64<0>
      t108: v8i32 = bitcast t107
    t101: v8i32 = X86ISD::BLENDI t24, t108, TargetConstant:i8<2>
  t19: ch,glue = CopyToReg t0, Register:v8i32 $ymm0, t101
      t25: v8i32 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>, undef:i32, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>
          t91: i32 = truncate t27
        t92: v4i32 = X86ISD::VBROADCAST t91
      t94: v8i32 = insert_subvector undef:v8i32, t92, Constant:i64<0>
    t97: v8i32 = X86ISD::BLENDI t25, t94, TargetConstant:i8<4>
  t21: ch,glue = CopyToReg t19, Register:v8i32 $ymm1, t97, t19:1
  t22: ch = X86ISD::RET_FLAG t21, TargetConstant:i32<0>, Register:v8i32 $ymm0, Register:v8i32 $ymm1, t21:1

Harbormaster completed remote builds in B124449: Diff 373292.Sep 17 2021, 12:08 PM

lebedev.ri added inline comments.Sep 17 2021, 12:40 PM

llvm/test/CodeGen/X86/oddshuffles.ll

2268

While something like this should work:

diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index 5a49f33e46fe..4d7c2c2a8651 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -21824,6 +21824,13 @@ SDValue DAGCombiner::visitSCALAR_TO_VECTOR(SDNode *N) {
     }
   }
 
+  // Fold SCALAR_TO_VECTOR(TRUNCATE(V)) to SCALAR_TO_VECTOR(V),
+  // by making trucation of the operand implicit.
+  if (InVal.getOpcode() == ISD::TRUNCATE && VT.isFixedLengthVector() &&
+      Level < AfterLegalizeDAG)
+    return DAG.getNode(ISD::SCALAR_TO_VECTOR, SDLoc(N), VT,
+                       InVal->getOperand(0));
+
   return SDValue();
 }
 
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 09ba7af6e38a..695cc8303cc1 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -14076,10 +14076,12 @@ static SDValue lowerShuffleAsBroadcast(const SDLoc &DL, MVT VT, SDValue V1,
 
   // If this is a scalar, do the broadcast on this type and bitcast.
   if (!V.getValueType().isVector()) {
-    assert(V.getScalarValueSizeInBits() == NumEltBits &&
-           "Unexpected scalar size");
-    MVT BroadcastVT = MVT::getVectorVT(V.getSimpleValueType(),
-                                       VT.getVectorNumElements());
+    if(V.getValueType().isInteger() &&
+       V.getScalarValueSizeInBits() > NumEltBits)
+      V = DAG.getNode(ISD::TRUNCATE, DL, VT.getScalarType(), V);
+    assert(V.getScalarValueSizeInBits() == NumEltBits && "Unexpected scalar size");
+    MVT BroadcastVT =
+        MVT::getVectorVT(V.getSimpleValueType(), VT.getVectorNumElements());
     return DAG.getBitcast(VT, DAG.getNode(Opcode, DL, BroadcastVT, V));
   }

it doesn't catch anything with the cut-off,
and without it. it exposes numerous places that don't expect this truncation to be implicit.

lebedev.ri added inline comments.Sep 18 2021, 3:22 AM

llvm/test/CodeGen/X86/oddshuffles.ll
2268	I've looked again, and i'm not sure i have enough motivation to tackle all the fallout from the `scalar_to_vector(trunc(x)) --> scalar_to_vector(x)` fold, unless i misunderstood the suggestion.

lebedev.ri mentioned this in rG6a2c2263fbca: [X86] Improve i8 all-ones element insertion in pre-SSE4.1.Sep 18 2021, 12:24 PM

Rebased, NFC.

Harbormaster completed remote builds in B124551: Diff 373433.Sep 18 2021, 1:08 PM

LGTM

llvm/lib/Target/X86/X86ISelLowering.cpp
37950	Lo + Hi
llvm/test/CodeGen/X86/oddshuffles.ll
2268	That's OK - I'll take a look when I get the chance.

This revision is now accepted and ready to land.Sep 19 2021, 5:00 AM

In D109065#3008188, @RKSimon wrote:

LGTM

:/
Thank you for the review!

llvm/test/CodeGen/X86/oddshuffles.ll
2268	Sorry :/

This revision was landed with ongoing or failed builds.Sep 19 2021, 7:25 AM

Closed by commit rG1e72ca94e579: [X86] combineX86ShufflesRecursively(): call… (authored by lebedev.ri). · Explain Why

This revision was automatically updated to reflect the committed changes.

lebedev.ri marked an inline comment as done.

lebedev.ri added a commit: rG1e72ca94e579: [X86] combineX86ShufflesRecursively(): call….

RKSimon mentioned this in rG0e89ff8195e9: [X86] SimplifyDemandedBits - only narrow a broadcast source if we only have one….Sep 19 2021, 2:54 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86ISelLowering.cpp

44 lines

test/

CodeGen/

X86/

insertelement-ones.ll

12 lines

oddshuffles.ll

26 lines

vector-interleaved-load-i16-stride-5.ll

2 lines

vector-interleaved-load-i16-stride-6.ll

8 lines

vector-shuffle-combining-avx.ll

36 lines

vselect.ll

7 lines

x86-interleaved-access.ll

208 lines

Diff 369928

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 37,585 Lines • ▼ Show 20 Lines	namespace llvm {
namespace X86 {		namespace X86 {
enum {		enum {
MaxShuffleCombineDepth = 8		MaxShuffleCombineDepth = 8
};		};
}		}
} // namespace llvm		} // namespace llvm

/// Fully generic combining of x86 shuffle instructions.		/// Fully generic combining of x86 shuffle instructions.
///		///
		RKSimonUnsubmitted Not Done Reply Inline Actions This can probably move to the APIntOps helpers RKSimon: This can probably move to the APIntOps helpers
		lebedev.riAuthorUnsubmitted Not Done Reply Inline Actions Let's do that afterwards? lebedev.ri: Let's do that afterwards?
/// This should be the last combine run over the x86 shuffle instructions. Once		/// This should be the last combine run over the x86 shuffle instructions. Once
/// they have been fully optimized, this will recursively consider all chains		/// they have been fully optimized, this will recursively consider all chains
/// of single-use shuffle instructions, build a generic model of the cumulative		/// of single-use shuffle instructions, build a generic model of the cumulative
/// shuffle operation, and check for simpler instructions which implement this		/// shuffle operation, and check for simpler instructions which implement this
		RKSimonUnsubmitted Done Reply Inline Actions We should assert that (NumElts % NumSrcElts) == 0 \|\| (NumSrcElts % NumElts) == 0 - or return true/false on success/failure. RKSimon: We should assert that (NumElts % NumSrcElts) == 0 \|\| (NumSrcElts % NumElts) == 0 - or return…
		lebedev.riAuthorUnsubmitted Done Reply Inline Actions Oops, i meant to do that, but forgot to in the end. lebedev.ri: Oops, i meant to do that, but forgot to in the end.
/// operation. We use this primarily for two purposes:		/// operation. We use this primarily for two purposes:
///		///
/// 1) Collapse generic shuffles to specialized single instructions when		/// 1) Collapse generic shuffles to specialized single instructions when
/// equivalent. In most cases, this is just an encoding size win, but		/// equivalent. In most cases, this is just an encoding size win, but
/// sometimes we will collapse multiple generic shuffles into a single		/// sometimes we will collapse multiple generic shuffles into a single
/// special-purpose shuffle.		/// special-purpose shuffle.
/// 2) Look for sequences of shuffle instructions with 3 or more total		/// 2) Look for sequences of shuffle instructions with 3 or more total
/// instructions, and replace them with the slightly more expensive SSSE3		/// instructions, and replace them with the slightly more expensive SSSE3
▲ Show 20 Lines • Show All 278 Lines • ▼ Show 20 Lines	if (Depth == 0 && llvm::all_of(Ops, [&](SDValue Op) {
SmallVector<APInt> RawBits;		SmallVector<APInt> RawBits;
unsigned EltSizeInBits = RootSizeInBits / Mask.size();		unsigned EltSizeInBits = RootSizeInBits / Mask.size();
return getTargetConstantBitsFromNode(Op, EltSizeInBits, UndefElts,		return getTargetConstantBitsFromNode(Op, EltSizeInBits, UndefElts,
RawBits);		RawBits);
})) {		})) {
return SDValue();		return SDValue();
}		}

		// Try to refine our inputs given our knowledge of target shuffle mask.
		for (auto I : enumerate(Ops)) {
		int OpIdx = I.index();
		SDValue &Op = I.value();

		// What range of shuffle mask element values results in picking from Op?
		int lo = OpIdx * Mask.size();
		int hi = lo + Mask.size();

		// Which elements of Op do we demand?
		SmallVector<int, 8> OpDemandedIdentityMask(Mask.size(), -1);
		for (int MaskElt : Mask) {
		if (isInRange(MaskElt, lo, hi)) { // Picks from Op?
		int OpEltIdx = MaskElt - lo;
		OpDemandedIdentityMask[OpEltIdx] = OpEltIdx;
		}
		}

		unsigned NumOpElts = Op.getValueType().getVectorNumElements();
		RKSimonUnsubmitted Done Reply Inline Actions Op might be a different width to the Root - see the "Widen any subvector shuffle inputs we've collected." code below. RKSimon: Op might be a different width to the Root - see the "Widen any subvector shuffle inputs we've…
		lebedev.riAuthorUnsubmitted Done Reply Inline Actions I keep hitting the same pitfail. lebedev.ri: I keep hitting the same pitfail.
		RKSimonUnsubmitted Not Done Reply Inline Actions We still need to do this before the widenSubVector() code - otherwise we'll never be able to simplify any input that doesn't match RootSizeInBits, which are likely to be the most interesting cases imo. RKSimon: We still need to do this before the widenSubVector() code - otherwise we'll never be able to…
		lebedev.riAuthorUnsubmitted Not Done Reply Inline Actions I agree, but is this a correctness concern for this patch? lebedev.ri: I agree, but is this a correctness concern for this patch?
		RKSimonUnsubmitted Not Done Reply Inline Actions what correctness? RKSimon: what correctness?
		lebedev.riAuthorUnsubmitted Not Done Reply Inline Actions I mean, if we don't do this in this patch, will that lead to miscompiles, or simply to missed optimizations? lebedev.ri: I mean, if we don't do this in this patch, will that lead to miscompiles, or simply to missed…

		SmallVector<int, 8> ScaledOpDemandedIdentityMask;
		bool scaled = scaleShuffleElements(OpDemandedIdentityMask, NumOpElts,
		ScaledOpDemandedIdentityMask);
		(void)scaled;
		assert(scaled &&
		"We should always succeed in scaling the identity shuffle mask!");
		assert(isSequentialOrUndefInRange(ScaledOpDemandedIdentityMask, 0,
		NumOpElts, 0) &&
		"Should still have an identity mask after scaling!");

		// Transform (scaled) identity shuffle mask into a demandedelts mask.
		APInt DemandedOpElts = APInt::getNullValue(NumOpElts);
		for (int ScaledOpDemandedIdentityMaskElt : ScaledOpDemandedIdentityMask)
		if (ScaledOpDemandedIdentityMaskElt >= 0)
		DemandedOpElts.setBit(ScaledOpDemandedIdentityMaskElt);

		// Can this operand be simplified any further, given it's demanded elements?
		if (SDValue NewOp =
		DAG.getTargetLoweringInfo().SimplifyMultipleUseDemandedVectorElts(
		Op, DemandedOpElts, DAG))
		Op = NewOp;
		}
		// FIXME: should we rerun resolveTargetShuffleInputsAndMask() now?

// Canonicalize the combined shuffle mask chain with horizontal ops.		// Canonicalize the combined shuffle mask chain with horizontal ops.
// NOTE: This will update the Ops and Mask.		// NOTE: This will update the Ops and Mask.
if (SDValue HOp = canonicalizeShuffleMaskWithHorizOp(		if (SDValue HOp = canonicalizeShuffleMaskWithHorizOp(
Ops, Mask, RootSizeInBits, SDLoc(Root), DAG, Subtarget))		Ops, Mask, RootSizeInBits, SDLoc(Root), DAG, Subtarget))
return DAG.getBitcast(Root.getValueType(), HOp);		return DAG.getBitcast(Root.getValueType(), HOp);

// Widen any subvector shuffle inputs we've collected.		// Widen any subvector shuffle inputs we've collected.
if (any_of(Ops, [RootSizeInBits](SDValue Op) {		if (any_of(Ops, [RootSizeInBits](SDValue Op) {
return Op.getValueSizeInBits() < RootSizeInBits;		return Op.getValueSizeInBits() < RootSizeInBits;
})) {		})) {
for (SDValue &Op : Ops)		for (SDValue &Op : Ops)
if (Op.getValueSizeInBits() < RootSizeInBits)		if (Op.getValueSizeInBits() < RootSizeInBits)
Op = widenSubVector(Op, false, Subtarget, DAG, SDLoc(Op),		Op = widenSubVector(Op, false, Subtarget, DAG, SDLoc(Op),
RootSizeInBits);		RootSizeInBits);
		RKSimonUnsubmitted Not Done Reply Inline Actions Lo + Hi RKSimon: Lo + Hi
// Reresolve - we might have repeated subvector sources.		// Reresolve - we might have repeated subvector sources.
resolveTargetShuffleInputsAndMask(Ops, Mask);		resolveTargetShuffleInputsAndMask(Ops, Mask);
}		}

// We can only combine unary and binary shuffle mask cases.		// We can only combine unary and binary shuffle mask cases.
if (Ops.size() <= 2) {		if (Ops.size() <= 2) {
// Minor canonicalization of the accumulated shuffle mask to make it easier		// Minor canonicalization of the accumulated shuffle mask to make it easier
// to match below. All this does is detect masks with sequential pairs of		// to match below. All this does is detect masks with sequential pairs of
// elements, and shrink them to the half-width mask. It does this in a loop		// elements, and shrink them to the half-width mask. It does this in a loop
// so it will reduce the size of the mask to the minimal width mask which		// so it will reduce the size of the mask to the minimal width mask which
// performs an equivalent shuffle.		// performs an equivalent shuffle.
while (Mask.size() > 1) {		while (Mask.size() > 1) {
SmallVector<int, 64> WidenedMask;		SmallVector<int, 64> WidenedMask;
if (!canWidenShuffleElements(Mask, WidenedMask))		if (!canWidenShuffleElements(Mask, WidenedMask))
break;		break;
		RKSimonUnsubmitted Not Done Reply Inline Actions This seems to be really bulky for what its actually doing. I don't think we need to create this shuffle mask for instance, we should be able to create a demanded elts mask directly and then trunc/scale it for the input's size. I keep meaning to create a scaleDemandedMask() common helper method as we have several places that would use it (e.g. SelectionDAG.computeKnownBits bitcast handling and other parts of value tracking). RKSimon: This seems to be really bulky for what its actually doing. I don't think we need to create this…
		lebedev.riAuthorUnsubmitted Not Done Reply Inline Actions That is what what i initially came up with, and it's much uglier than this code :) I can do that again, but i'm not sure that will be be better. lebedev.ri: That is what what i initially came up with, and it's much uglier than this code :) I can do…
		lebedev.riAuthorUnsubmitted Not Done Reply Inline Actions Ok, how about this? lebedev.ri: Ok, how about this?
Mask = std::move(WidenedMask);		Mask = std::move(WidenedMask);
}		}

// Canonicalization of binary shuffle masks to improve pattern matching by		// Canonicalization of binary shuffle masks to improve pattern matching by
// commuting the inputs.		// commuting the inputs.
if (Ops.size() == 2 && canonicalizeShuffleMaskWithCommute(Mask)) {		if (Ops.size() == 2 && canonicalizeShuffleMaskWithCommute(Mask)) {
ShuffleVectorSDNode::commuteMask(Mask);		ShuffleVectorSDNode::commuteMask(Mask);
std::swap(Ops[0], Ops[1]);		std::swap(Ops[0], Ops[1]);
}		}
		RKSimonUnsubmitted Not Done Reply Inline Actions To move this before widening, we should just need to truncate OpDemandedElts based on its size vs RootSizeInBits - we should assert that no lost elts were demanded. Then we can scale it. RKSimon: To move this before widening, we should just need to truncate OpDemandedElts based on its size…
		lebedev.riAuthorUnsubmitted Not Done Reply Inline Actions Ok, i admit i've tried to avoid doing that because i don't quite understand all of the logic here. Does this look right? It avoids the miscompiles that were visible in some previous attempt at least. lebedev.ri: Ok, i admit i've tried to avoid doing that because i don't quite understand all of the logic…

// Finally, try to combine into a single shuffle instruction.		// Finally, try to combine into a single shuffle instruction.
return combineX86ShuffleChain(Ops, Root, Mask, Depth, HasVariableMask,		return combineX86ShuffleChain(Ops, Root, Mask, Depth, HasVariableMask,
AllowVariableCrossLaneMask,		AllowVariableCrossLaneMask,
AllowVariablePerLaneMask, DAG, Subtarget);		AllowVariablePerLaneMask, DAG, Subtarget);
}		}

// If that failed and any input is extracted then try to combine as a		// If that failed and any input is extracted then try to combine as a
▲ Show 20 Lines • Show All 15,658 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/insertelement-ones.ll

	Show First 20 Lines • Show All 305 Lines • ▼ Show 20 Lines
	; AVX512-NEXT: vpblendd {{.*#+}} ymm0 = ymm1[0,1,2,3],ymm0[4,5,6,7]			; AVX512-NEXT: vpblendd {{.*#+}} ymm0 = ymm1[0,1,2,3],ymm0[4,5,6,7]
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	%1 = insertelement <16 x i16> %a, i16 -1, i32 0			%1 = insertelement <16 x i16> %a, i16 -1, i32 0
	%2 = insertelement <16 x i16> %1, i16 -1, i32 6			%2 = insertelement <16 x i16> %1, i16 -1, i32 6
	%3 = insertelement <16 x i16> %2, i16 -1, i32 15			%3 = insertelement <16 x i16> %2, i16 -1, i32 15
	ret <16 x i16> %3			ret <16 x i16> %3
	}			}

	define <16 x i8> @insert_v16i8_x123456789ABCDEx(<16 x i8> %a) {			define <16 x i8> @insert_v16i8_x123456789ABCDEx(<16 x i8> %a) {
				lebedev.riAuthorUnsubmitted Done Reply Inline Actions Here we have: Optimized legalized selection DAG: %bb.0 'insert_v16i8_x123456789ABCDEx:' SelectionDAG has 20 nodes: t0: ch = EntryToken t2: v16i8,ch = CopyFromReg t0, Register:v16i8 %0 t19: v16i8 = and t2, t36 t20: v16i8 = X86ISD::ANDNP t36, t27 t21: v16i8 = or t19, t20 t33: v16i8 = X86ISD::VSHLDQ t27, TargetConstant:i8<15> t45: v16i8 = or t21, t33 t12: ch,glue = CopyToReg t0, Register:v16i8 $xmm0, t45 t26: v4i32 = scalar_to_vector Constant:i32<255> t27: v16i8 = bitcast t26 t38: i64 = X86ISD::Wrapper TargetConstantPool:i64<<16 x i8> <i8 0, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>> 0 t36: v16i8,ch = load<(load (s128) from constant-pool)> t0, t38, undef:i64 t13: ch = X86ISD::RET_FLAG t12, TargetConstant:i32<0>, Register:v16i8 $xmm0, t12:1 ... so `matchBinaryShuffle()` again fails to omit the masking, even though it's obviously redundant here for the reasons seen in D109726. I would suspect that is because around `scalar_to_vector` we operate on i32 elt type, so we don't have all-ones elements until after `bitcast`. Without changing `computeKnownBits` to operate on a specified element width, i'm not sure it can help us further, and that does not sound like the right fix. lebedev.ri: Here we have: ``` Optimized legalized selection DAG: %bb.0 'insert_v16i8_x123456789ABCDEx:'…
	; SSE2-LABEL: insert_v16i8_x123456789ABCDEx:			; SSE2-LABEL: insert_v16i8_x123456789ABCDEx:
	; SSE2: # %bb.0:			; SSE2: # %bb.0:
	; SSE2-NEXT: movdqa {{.*#+}} xmm1 = [0,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]			; SSE2-NEXT: movdqa {{.*#+}} xmm1 = [0,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]
	; SSE2-NEXT: pand %xmm1, %xmm0			; SSE2-NEXT: pand %xmm1, %xmm0
				RKSimonUnsubmitted Done Reply Inline Actions We're going to have to improve INSERT_VECTOR_ELT handling of 0/-1 elements - just AND/OR if we don't have a legal PINSRB instruction (pre-SSE41). RKSimon: We're going to have to improve INSERT_VECTOR_ELT handling of 0/-1 elements - just AND/OR if we…
				RKSimonUnsubmitted Done Reply Inline Actions It looks like we might be able to do this more easily by extending lowerShuffleAsBitMask to handle the allones elements case as well as the zero elements case. RKSimon: It looks like we might be able to do this more easily by extending lowerShuffleAsBitMask to…
				lebedev.riAuthorUnsubmitted Done Reply Inline Actions Note that `X86TargetLowering::LowerINSERT_VECTOR_ELT` isn't even called for this test, since we expand, not legalize, in this case. Marking it as legalize causes crashes "don't know how to legalize", i guess it doesn't retry to legalize via the generic expansion. lebedev.ri: Note that `X86TargetLowering::LowerINSERT_VECTOR_ELT` isn't even called for this test, since we…
				lebedev.riAuthorUnsubmitted Done Reply Inline Actions Done, D109989. lebedev.ri: Done, D109989.
	; SSE2-NEXT: movl $255, %eax			; SSE2-NEXT: movl $255, %eax
	; SSE2-NEXT: movd %eax, %xmm2			; SSE2-NEXT: movd %eax, %xmm2
	; SSE2-NEXT: pandn %xmm2, %xmm1			; SSE2-NEXT: pandn %xmm2, %xmm1
	; SSE2-NEXT: por %xmm1, %xmm0			; SSE2-NEXT: por %xmm1, %xmm0
	; SSE2-NEXT: pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0			; SSE2-NEXT: pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
	; SSE2-NEXT: pslldq {{.*#+}} xmm2 = zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,xmm2[0]			; SSE2-NEXT: pslldq {{.*#+}} xmm2 = zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,xmm2[0]
	; SSE2-NEXT: por %xmm2, %xmm0			; SSE2-NEXT: por %xmm2, %xmm0
	; SSE2-NEXT: retq			; SSE2-NEXT: retq
	▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines
	; SSE2-NEXT: movd %eax, %xmm3			; SSE2-NEXT: movd %eax, %xmm3
	; SSE2-NEXT: pandn %xmm3, %xmm2			; SSE2-NEXT: pandn %xmm3, %xmm2
	; SSE2-NEXT: por %xmm2, %xmm0			; SSE2-NEXT: por %xmm2, %xmm0
	; SSE2-NEXT: movdqa {{.*#+}} xmm2 = [255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,0]			; SSE2-NEXT: movdqa {{.*#+}} xmm2 = [255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,0]
	; SSE2-NEXT: pand %xmm2, %xmm0			; SSE2-NEXT: pand %xmm2, %xmm0
	; SSE2-NEXT: movdqa %xmm3, %xmm4			; SSE2-NEXT: movdqa %xmm3, %xmm4
	; SSE2-NEXT: pslldq {{.*#+}} xmm4 = zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,xmm4[0]			; SSE2-NEXT: pslldq {{.*#+}} xmm4 = zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,xmm4[0]
	; SSE2-NEXT: por %xmm4, %xmm0			; SSE2-NEXT: por %xmm4, %xmm0
	; SSE2-NEXT: pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1			; SSE2-NEXT: movdqa {{.*#+}} xmm5 = [255,255,255,255,255,255,255,255,255,255,255,255,255,255,0,255]
				; SSE2-NEXT: pand %xmm5, %xmm1
	; SSE2-NEXT: pslldq {{.*#+}} xmm3 = zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,xmm3[0,1]			; SSE2-NEXT: pslldq {{.*#+}} xmm3 = zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,xmm3[0,1]
	; SSE2-NEXT: por %xmm3, %xmm1			; SSE2-NEXT: pandn %xmm3, %xmm5
				; SSE2-NEXT: por %xmm5, %xmm1
				RKSimonUnsubmitted Done Reply Inline Actions Any luck on improving this? RKSimon: Any luck on improving this?
				lebedev.riAuthorUnsubmitted Done Reply Inline Actions This one is obscure. I believe the problem is `X86ISelLowering.cpp`'s `matchBinaryShuffle()`'s `ISD::OR` lowering. We have: mask: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 30 -2 matchBinaryShuffle() EltSizeInBits: 8 V1: t4: v16i8,ch = CopyFromReg t0, Register:v16i8 %1 t3: v16i8 = Register %1 V2: t74: v16i8 = X86ISD::VSHLDQ t51, TargetConstant:i8<14> t51: v16i8 = bitcast t50 t50: v4i32 = scalar_to_vector Constant:i32<255> t49: i32 = Constant<255> t73: i8 = TargetConstant<14> We can't say anything about `t4`, but i think it's obvious that `t74` is actually an all-zeros except the 14'th element, which is all-ones. So we of course can lower that as an `or` blend, and we do not care what `t4` is. But the code fails to do that. I think we'd basically have to do `computeKnownBits()` for each element of V1/V2 separately. Should i keep looking? lebedev.ri: This one is obscure. I believe the problem is `X86ISelLowering.cpp`'s `matchBinaryShuffle()`'s…
				lebedev.riAuthorUnsubmitted Done Reply Inline Actions Ok, got it: D109726 lebedev.ri: Ok, got it: D109726
	; SSE2-NEXT: pand %xmm2, %xmm1			; SSE2-NEXT: pand %xmm2, %xmm1
	; SSE2-NEXT: por %xmm4, %xmm1			; SSE2-NEXT: por %xmm4, %xmm1
	; SSE2-NEXT: retq			; SSE2-NEXT: retq
	;			;
	; SSE3-LABEL: insert_v32i8_x123456789ABCDEzGHIJKLMNOPQRSTxx:			; SSE3-LABEL: insert_v32i8_x123456789ABCDEzGHIJKLMNOPQRSTxx:
	; SSE3: # %bb.0:			; SSE3: # %bb.0:
	; SSE3-NEXT: movdqa {{.*#+}} xmm2 = [0,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]			; SSE3-NEXT: movdqa {{.*#+}} xmm2 = [0,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]
	; SSE3-NEXT: pand %xmm2, %xmm0			; SSE3-NEXT: pand %xmm2, %xmm0
	; SSE3-NEXT: movl $255, %eax			; SSE3-NEXT: movl $255, %eax
	; SSE3-NEXT: movd %eax, %xmm3			; SSE3-NEXT: movd %eax, %xmm3
	; SSE3-NEXT: pandn %xmm3, %xmm2			; SSE3-NEXT: pandn %xmm3, %xmm2
	; SSE3-NEXT: por %xmm2, %xmm0			; SSE3-NEXT: por %xmm2, %xmm0
	; SSE3-NEXT: movdqa {{.*#+}} xmm2 = [255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,0]			; SSE3-NEXT: movdqa {{.*#+}} xmm2 = [255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,0]
	; SSE3-NEXT: pand %xmm2, %xmm0			; SSE3-NEXT: pand %xmm2, %xmm0
	; SSE3-NEXT: movdqa %xmm3, %xmm4			; SSE3-NEXT: movdqa %xmm3, %xmm4
	; SSE3-NEXT: pslldq {{.*#+}} xmm4 = zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,xmm4[0]			; SSE3-NEXT: pslldq {{.*#+}} xmm4 = zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,xmm4[0]
	; SSE3-NEXT: por %xmm4, %xmm0			; SSE3-NEXT: por %xmm4, %xmm0
	; SSE3-NEXT: pand {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1			; SSE3-NEXT: movdqa {{.*#+}} xmm5 = [255,255,255,255,255,255,255,255,255,255,255,255,255,255,0,255]
				; SSE3-NEXT: pand %xmm5, %xmm1
	; SSE3-NEXT: pslldq {{.*#+}} xmm3 = zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,xmm3[0,1]			; SSE3-NEXT: pslldq {{.*#+}} xmm3 = zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,xmm3[0,1]
	; SSE3-NEXT: por %xmm3, %xmm1			; SSE3-NEXT: pandn %xmm3, %xmm5
				; SSE3-NEXT: por %xmm5, %xmm1
	; SSE3-NEXT: pand %xmm2, %xmm1			; SSE3-NEXT: pand %xmm2, %xmm1
	; SSE3-NEXT: por %xmm4, %xmm1			; SSE3-NEXT: por %xmm4, %xmm1
	; SSE3-NEXT: retq			; SSE3-NEXT: retq
	;			;
	; SSSE3-LABEL: insert_v32i8_x123456789ABCDEzGHIJKLMNOPQRSTxx:			; SSSE3-LABEL: insert_v32i8_x123456789ABCDEzGHIJKLMNOPQRSTxx:
	; SSSE3: # %bb.0:			; SSSE3: # %bb.0:
	; SSSE3-NEXT: movl $255, %eax			; SSSE3-NEXT: movl $255, %eax
	; SSSE3-NEXT: movd %eax, %xmm3			; SSSE3-NEXT: movd %eax, %xmm3
	▲ Show 20 Lines • Show All 61 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/oddshuffles.ll

	Show First 20 Lines • Show All 2,255 Lines • ▼ Show 20 Lines
	; SSE42-NEXT: pblendw {{.*#+}} xmm0 = xmm1[0,1],xmm0[2,3],xmm1[4,5,6,7]			; SSE42-NEXT: pblendw {{.*#+}} xmm0 = xmm1[0,1],xmm0[2,3],xmm1[4,5,6,7]
	; SSE42-NEXT: pshufd {{.*#+}} xmm2 = xmm2[1,1,0,1]			; SSE42-NEXT: pshufd {{.*#+}} xmm2 = xmm2[1,1,0,1]
	; SSE42-NEXT: pxor %xmm1, %xmm1			; SSE42-NEXT: pxor %xmm1, %xmm1
	; SSE42-NEXT: xorps %xmm3, %xmm3			; SSE42-NEXT: xorps %xmm3, %xmm3
	; SSE42-NEXT: retq			; SSE42-NEXT: retq
	;			;
	; AVX1-LABEL: splat_v3i32:			; AVX1-LABEL: splat_v3i32:
	; AVX1: # %bb.0:			; AVX1: # %bb.0:
	; AVX1-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero			; AVX1-NEXT: movq (%rdi), %rax
	; AVX1-NEXT: vpinsrd $2, 8(%rdi), %xmm0, %xmm1			; AVX1-NEXT: vmovq %rax, %xmm0
	; AVX1-NEXT: vxorps %xmm2, %xmm2, %xmm2			; AVX1-NEXT: vxorps %xmm1, %xmm1, %xmm1
	; AVX1-NEXT: vblendps {{.*#+}} ymm0 = ymm2[0],ymm0[1],ymm2[2,3,4,5,6,7]			; AVX1-NEXT: vblendps {{.*#+}} ymm0 = ymm1[0],ymm0[1],ymm1[2,3,4,5,6,7]
	; AVX1-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[0,1,0,1]			; AVX1-NEXT: vmovd %eax, %xmm2
				RKSimonUnsubmitted Done Reply Inline Actions Looks like we're missing a fold to share scalar_to_vector(x) and scalar_to_vector(trunc(x)) (maybe worth supporting scalar_to_vector(ext(x)) as well)? RKSimon: Looks like we're missing a fold to share scalar_to_vector(x) and scalar_to_vector(trunc(x))…
				lebedev.riAuthorUnsubmitted Done Reply Inline Actions This stuff is broken. We've in AVX1-more, and only have broadcast-from-mem, yet we've successfully obscured the load via the truncation. I suppose we could look past ext/trunc of scalar_to_vector operand, and change it to bitcast/ext of scalar_to_vector itself, let me see. Optimized legalized selection DAG: %bb.0 'splat_v3i32:' SelectionDAG has 28 nodes: t0: ch = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %0 t27: i64,ch = load<(load (s64) from %ir.ptr, align 1)> t0, t2, undef:i64 t24: v8i32 = BUILD_VECTOR Constant:i32<0>, undef:i32, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0> t30: v2i64 = scalar_to_vector t27 t107: v4i64 = insert_subvector undef:v4i64, t30, Constant:i64<0> t108: v8i32 = bitcast t107 t101: v8i32 = X86ISD::BLENDI t24, t108, TargetConstant:i8<2> t19: ch,glue = CopyToReg t0, Register:v8i32 $ymm0, t101 t25: v8i32 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>, undef:i32, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0> t91: i32 = truncate t27 t92: v4i32 = X86ISD::VBROADCAST t91 t94: v8i32 = insert_subvector undef:v8i32, t92, Constant:i64<0> t97: v8i32 = X86ISD::BLENDI t25, t94, TargetConstant:i8<4> t21: ch,glue = CopyToReg t19, Register:v8i32 $ymm1, t97, t19:1 t22: ch = X86ISD::RET_FLAG t21, TargetConstant:i32<0>, Register:v8i32 $ymm0, Register:v8i32 $ymm1, t21:1 lebedev.ri: This stuff is broken. We've in AVX1-more, and only have broadcast-from-mem, yet we've…
				lebedev.riAuthorUnsubmitted Done Reply Inline Actions While something like this should work: diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp index 5a49f33e46fe..4d7c2c2a8651 100644 --- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp +++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp @@ -21824,6 +21824,13 @@ SDValue DAGCombiner::visitSCALAR_TO_VECTOR(SDNode N) { } } + // Fold SCALAR_TO_VECTOR(TRUNCATE(V)) to SCALAR_TO_VECTOR(V), + // by making trucation of the operand implicit. + if (InVal.getOpcode() == ISD::TRUNCATE && VT.isFixedLengthVector() && + Level < AfterLegalizeDAG) + return DAG.getNode(ISD::SCALAR_TO_VECTOR, SDLoc(N), VT, + InVal->getOperand(0)); + return SDValue(); } diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp index 09ba7af6e38a..695cc8303cc1 100644 --- a/llvm/lib/Target/X86/X86ISelLowering.cpp +++ b/llvm/lib/Target/X86/X86ISelLowering.cpp @@ -14076,10 +14076,12 @@ static SDValue lowerShuffleAsBroadcast(const SDLoc &DL, MVT VT, SDValue V1, // If this is a scalar, do the broadcast on this type and bitcast. if (!V.getValueType().isVector()) { - assert(V.getScalarValueSizeInBits() == NumEltBits && - "Unexpected scalar size"); - MVT BroadcastVT = MVT::getVectorVT(V.getSimpleValueType(), - VT.getVectorNumElements()); + if(V.getValueType().isInteger() && + V.getScalarValueSizeInBits() > NumEltBits) + V = DAG.getNode(ISD::TRUNCATE, DL, VT.getScalarType(), V); + assert(V.getScalarValueSizeInBits() == NumEltBits && "Unexpected scalar size"); + MVT BroadcastVT = + MVT::getVectorVT(V.getSimpleValueType(), VT.getVectorNumElements()); return DAG.getBitcast(VT, DAG.getNode(Opcode, DL, BroadcastVT, V)); } it doesn't catch anything with the cut-off, and without it. it exposes numerous places that don't expect this truncation to be implicit. lebedev.ri:* While something like this should work: ``` diff --git…
				lebedev.riAuthorUnsubmitted Done Reply Inline Actions I've looked again, and i'm not sure i have enough motivation to tackle all the fallout from the `scalar_to_vector(trunc(x)) --> scalar_to_vector(x)` fold, unless i misunderstood the suggestion. lebedev.ri: I've looked again, and i'm not sure i have enough motivation to tackle all the fallout from the…
				RKSimonUnsubmitted Done Reply Inline Actions That's OK - I'll take a look when I get the chance. RKSimon: That's OK - I'll take a look when I get the chance.
				lebedev.riAuthorUnsubmitted Done Reply Inline Actions Sorry :/ lebedev.ri: Sorry :/
	; AVX1-NEXT: vblendps {{.*#+}} ymm1 = ymm2[0,1],ymm1[2],ymm2[3,4,5,6,7]			; AVX1-NEXT: vpshufd {{.*#+}} xmm2 = xmm2[0,0,0,0]
				; AVX1-NEXT: vblendps {{.*#+}} ymm1 = ymm1[0,1],ymm2[2],ymm1[3,4,5,6,7]
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-SLOW-LABEL: splat_v3i32:			; AVX2-SLOW-LABEL: splat_v3i32:
	; AVX2-SLOW: # %bb.0:			; AVX2-SLOW: # %bb.0:
	; AVX2-SLOW-NEXT: vmovsd {{.*#+}} xmm1 = mem[0],zero			; AVX2-SLOW-NEXT: vmovsd {{.*#+}} xmm1 = mem[0],zero
	; AVX2-SLOW-NEXT: vxorps %xmm2, %xmm2, %xmm2			; AVX2-SLOW-NEXT: vxorps %xmm2, %xmm2, %xmm2
	; AVX2-SLOW-NEXT: vblendps {{.*#+}} ymm0 = ymm2[0],ymm1[1],ymm2[2,3,4,5,6,7]			; AVX2-SLOW-NEXT: vblendps {{.*#+}} ymm0 = ymm2[0],ymm1[1],ymm2[2,3,4,5,6,7]
	; AVX2-SLOW-NEXT: vbroadcastss %xmm1, %xmm1			; AVX2-SLOW-NEXT: vbroadcastss %xmm1, %xmm1
	; AVX2-SLOW-NEXT: vblendps {{.*#+}} ymm1 = ymm2[0,1],ymm1[2],ymm2[3,4,5,6,7]			; AVX2-SLOW-NEXT: vblendps {{.*#+}} ymm1 = ymm2[0,1],ymm1[2],ymm2[3,4,5,6,7]
	; AVX2-SLOW-NEXT: retq			; AVX2-SLOW-NEXT: retq
	;			;
	; AVX2-FAST-LABEL: splat_v3i32:			; AVX2-FAST-LABEL: splat_v3i32:
	; AVX2-FAST: # %bb.0:			; AVX2-FAST: # %bb.0:
	; AVX2-FAST-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero			; AVX2-FAST-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
	; AVX2-FAST-NEXT: vpxor %xmm0, %xmm0, %xmm0			; AVX2-FAST-NEXT: vpxor %xmm0, %xmm0, %xmm0
	; AVX2-FAST-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0],ymm1[1],ymm0[2,3,4,5,6,7]			; AVX2-FAST-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0],ymm1[1],ymm0[2,3,4,5,6,7]
	; AVX2-FAST-NEXT: vpshufb {{.*#+}} ymm1 = zero,zero,zero,zero,zero,zero,zero,zero,ymm1[0,1,2,3],zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero			; AVX2-FAST-NEXT: vpshufb {{.*#+}} ymm1 = zero,zero,zero,zero,zero,zero,zero,zero,ymm1[0,1,2,3],zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero
	; AVX2-FAST-NEXT: retq			; AVX2-FAST-NEXT: retq
	;			;
	; XOP-LABEL: splat_v3i32:			; XOP-LABEL: splat_v3i32:
	; XOP: # %bb.0:			; XOP: # %bb.0:
	; XOP-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero			; XOP-NEXT: movq (%rdi), %rax
	; XOP-NEXT: vpinsrd $2, 8(%rdi), %xmm0, %xmm1			; XOP-NEXT: vmovq %rax, %xmm0
	; XOP-NEXT: vxorps %xmm2, %xmm2, %xmm2			; XOP-NEXT: vxorps %xmm1, %xmm1, %xmm1
	; XOP-NEXT: vblendps {{.*#+}} ymm0 = ymm2[0],ymm0[1],ymm2[2,3,4,5,6,7]			; XOP-NEXT: vblendps {{.*#+}} ymm0 = ymm1[0],ymm0[1],ymm1[2,3,4,5,6,7]
	; XOP-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[0,1,0,1]			; XOP-NEXT: vmovd %eax, %xmm2
	; XOP-NEXT: vblendps {{.*#+}} ymm1 = ymm2[0,1],ymm1[2],ymm2[3,4,5,6,7]			; XOP-NEXT: vpshufd {{.*#+}} xmm2 = xmm2[0,0,0,0]
				; XOP-NEXT: vblendps {{.*#+}} ymm1 = ymm1[0,1],ymm2[2],ymm1[3,4,5,6,7]
	; XOP-NEXT: retq			; XOP-NEXT: retq
	%1 = load <3 x i32>, <3 x i32>* %ptr, align 1			%1 = load <3 x i32>, <3 x i32>* %ptr, align 1
	%2 = shufflevector <3 x i32> %1, <3 x i32> undef, <16 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			%2 = shufflevector <3 x i32> %1, <3 x i32> undef, <16 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%3 = shufflevector <16 x i32> <i32 0, i32 undef, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 undef, i32 0, i32 0, i32 0, i32 0, i32 0>, <16 x i32> %2, <16 x i32> <i32 0, i32 17, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 16, i32 11, i32 12, i32 13, i32 14, i32 15>			%3 = shufflevector <16 x i32> <i32 0, i32 undef, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 undef, i32 0, i32 0, i32 0, i32 0, i32 0>, <16 x i32> %2, <16 x i32> <i32 0, i32 17, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 16, i32 11, i32 12, i32 13, i32 14, i32 15>
	ret <16 x i32 > %3			ret <16 x i32 > %3
	}			}

	define <2 x double> @wrongorder(<4 x double> %A, <8 x double>* %P) #0 {			define <2 x double> @wrongorder(<4 x double> %A, <8 x double>* %P) #0 {
	▲ Show 20 Lines • Show All 260 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-5.ll

	Show First 20 Lines • Show All 488 Lines • ▼ Show 20 Lines
	; AVX2-FAST-PERLANE-NEXT: vpblendw {{.*#+}} xmm2 = xmm3[0,1,2],xmm2[3,4],xmm3[5,6,7]			; AVX2-FAST-PERLANE-NEXT: vpblendw {{.*#+}} xmm2 = xmm3[0,1,2],xmm2[3,4],xmm3[5,6,7]
	; AVX2-FAST-PERLANE-NEXT: vpshufb {{.*#+}} xmm2 = xmm2[8,9,2,3,12,13,6,7,0,1,10,11,u,u,u,u]			; AVX2-FAST-PERLANE-NEXT: vpshufb {{.*#+}} xmm2 = xmm2[8,9,2,3,12,13,6,7,0,1,10,11,u,u,u,u]
	; AVX2-FAST-PERLANE-NEXT: vpblendw {{.*#+}} ymm0 = ymm1[0,1],ymm0[2],ymm1[3],ymm0[4],ymm1[5,6],ymm0[7],ymm1[8,9],ymm0[10],ymm1[11],ymm0[12],ymm1[13,14],ymm0[15]			; AVX2-FAST-PERLANE-NEXT: vpblendw {{.*#+}} ymm0 = ymm1[0,1],ymm0[2],ymm1[3],ymm0[4],ymm1[5,6],ymm0[7],ymm1[8,9],ymm0[10],ymm1[11],ymm0[12],ymm1[13,14],ymm0[15]
	; AVX2-FAST-PERLANE-NEXT: vpermq {{.*#+}} ymm1 = ymm0[2,3,0,1]			; AVX2-FAST-PERLANE-NEXT: vpermq {{.*#+}} ymm1 = ymm0[2,3,0,1]
	; AVX2-FAST-PERLANE-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0,1,2,3],ymm1[4],ymm0[5,6],ymm1[7]			; AVX2-FAST-PERLANE-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0,1,2,3],ymm1[4],ymm0[5,6],ymm1[7]
	; AVX2-FAST-PERLANE-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[u,u,u,u,u,u,u,u,u,u,u,u,4,5,14,15,24,25,18,19,28,29,22,23,u,u,u,u,u,u,u,u]			; AVX2-FAST-PERLANE-NEXT: vpshufb {{.*#+}} ymm0 = ymm0[u,u,u,u,u,u,u,u,u,u,u,u,4,5,14,15,24,25,18,19,28,29,22,23,u,u,u,u,u,u,u,u]
	; AVX2-FAST-PERLANE-NEXT: vpblendd {{.*#+}} ymm0 = ymm2[0,1,2],ymm0[3,4,5],ymm2[6,7]			; AVX2-FAST-PERLANE-NEXT: vpblendd {{.*#+}} ymm0 = ymm2[0,1,2],ymm0[3,4,5],ymm2[6,7]
	; AVX2-FAST-PERLANE-NEXT: vpshufb {{.*#+}} xmm1 = xmm4[12,13,14,15,4,5,14,15,u,u,u,u,u,u,u,u]			; AVX2-FAST-PERLANE-NEXT: vpshufb {{.*#+}} xmm1 = xmm4[12,13,14,15,4,5,14,15,u,u,u,u,u,u,u,u]
	; AVX2-FAST-PERLANE-NEXT: vpshufb {{.*#+}} xmm2 = xmm5[0,1,2,3,0,1,10,11,u,u,u,u,u,u,u,u]			; AVX2-FAST-PERLANE-NEXT: vpshufb {{.*#+}} xmm2 = xmm5[u,u,u,u,0,1,10,11,u,u,u,u,u,u,u,u]
	; AVX2-FAST-PERLANE-NEXT: vpunpckldq {{.*#+}} xmm1 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]			; AVX2-FAST-PERLANE-NEXT: vpunpckldq {{.*#+}} xmm1 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]
	; AVX2-FAST-PERLANE-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm1			; AVX2-FAST-PERLANE-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm1
	; AVX2-FAST-PERLANE-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0,1,2,3,4,5],ymm1[6,7]			; AVX2-FAST-PERLANE-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0,1,2,3,4,5],ymm1[6,7]
	; AVX2-FAST-PERLANE-NEXT: vmovdqa %ymm9, (%rsi)			; AVX2-FAST-PERLANE-NEXT: vmovdqa %ymm9, (%rsi)
	; AVX2-FAST-PERLANE-NEXT: vmovdqa %ymm10, (%rdx)			; AVX2-FAST-PERLANE-NEXT: vmovdqa %ymm10, (%rdx)
	; AVX2-FAST-PERLANE-NEXT: vmovdqa %ymm8, (%rcx)			; AVX2-FAST-PERLANE-NEXT: vmovdqa %ymm8, (%rcx)
	; AVX2-FAST-PERLANE-NEXT: vmovdqa %ymm6, (%r8)			; AVX2-FAST-PERLANE-NEXT: vmovdqa %ymm6, (%r8)
	; AVX2-FAST-PERLANE-NEXT: vmovdqa %ymm0, (%r9)			; AVX2-FAST-PERLANE-NEXT: vmovdqa %ymm0, (%r9)
	Show All 18 Lines

llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-6.ll

	Show First 20 Lines • Show All 303 Lines • ▼ Show 20 Lines
	; AVX2-SLOW-NEXT: vmovdqa 32(%rdi), %ymm14			; AVX2-SLOW-NEXT: vmovdqa 32(%rdi), %ymm14
	; AVX2-SLOW-NEXT: vmovdqa 64(%rdi), %ymm2			; AVX2-SLOW-NEXT: vmovdqa 64(%rdi), %ymm2
	; AVX2-SLOW-NEXT: vmovdqa 96(%rdi), %ymm5			; AVX2-SLOW-NEXT: vmovdqa 96(%rdi), %ymm5
	; AVX2-SLOW-NEXT: vmovdqa 160(%rdi), %ymm15			; AVX2-SLOW-NEXT: vmovdqa 160(%rdi), %ymm15
	; AVX2-SLOW-NEXT: vmovdqa 128(%rdi), %ymm1			; AVX2-SLOW-NEXT: vmovdqa 128(%rdi), %ymm1
	; AVX2-SLOW-NEXT: vmovdqu %ymm1, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill			; AVX2-SLOW-NEXT: vmovdqu %ymm1, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill
	; AVX2-SLOW-NEXT: vpblendd {{.*#+}} ymm8 = ymm1[0,1],ymm15[2],ymm1[3,4],ymm15[5],ymm1[6,7]			; AVX2-SLOW-NEXT: vpblendd {{.*#+}} ymm8 = ymm1[0,1],ymm15[2],ymm1[3,4],ymm15[5],ymm1[6,7]
	; AVX2-SLOW-NEXT: vextracti128 $1, %ymm8, %xmm0			; AVX2-SLOW-NEXT: vextracti128 $1, %ymm8, %xmm0
	; AVX2-SLOW-NEXT: vpshufb {{.*#+}} xmm6 = xmm0[0,1,4,5,4,5,u,u,0,1,12,13,u,u,4,5]			; AVX2-SLOW-NEXT: vpshufb {{.*#+}} xmm6 = xmm0[u,u,u,u,u,u,u,u,0,1,12,13,u,u,4,5]
	; AVX2-SLOW-NEXT: vpshuflw {{.*#+}} xmm7 = xmm8[2,2,2,2,4,5,6,7]			; AVX2-SLOW-NEXT: vpshuflw {{.*#+}} xmm7 = xmm8[2,2,2,2,4,5,6,7]
	; AVX2-SLOW-NEXT: vpshufd {{.*#+}} xmm7 = xmm7[0,1,2,2]			; AVX2-SLOW-NEXT: vpshufd {{.*#+}} xmm7 = xmm7[0,1,2,2]
	; AVX2-SLOW-NEXT: vpblendw {{.*#+}} xmm6 = xmm6[0,1,2],xmm7[3],xmm6[4,5],xmm7[6],xmm6[7]			; AVX2-SLOW-NEXT: vpblendw {{.*#+}} xmm6 = xmm6[0,1,2],xmm7[3],xmm6[4,5],xmm7[6],xmm6[7]
	; AVX2-SLOW-NEXT: vinserti128 $1, %xmm6, %ymm0, %ymm9			; AVX2-SLOW-NEXT: vinserti128 $1, %xmm6, %ymm0, %ymm9
	; AVX2-SLOW-NEXT: vperm2i128 {{.*#+}} ymm12 = ymm2[2,3],ymm5[2,3]			; AVX2-SLOW-NEXT: vperm2i128 {{.*#+}} ymm12 = ymm2[2,3],ymm5[2,3]
	; AVX2-SLOW-NEXT: vpshufd {{.*#+}} ymm7 = ymm12[0,2,2,1,4,6,6,5]			; AVX2-SLOW-NEXT: vpshufd {{.*#+}} ymm7 = ymm12[0,2,2,1,4,6,6,5]
	; AVX2-SLOW-NEXT: vpshufhw {{.*#+}} ymm11 = ymm7[0,1,2,3,6,6,6,6,8,9,10,11,14,14,14,14]			; AVX2-SLOW-NEXT: vpshufhw {{.*#+}} ymm11 = ymm7[0,1,2,3,6,6,6,6,8,9,10,11,14,14,14,14]
	; AVX2-SLOW-NEXT: vperm2i128 {{.*#+}} ymm10 = ymm2[0,1],ymm5[0,1]			; AVX2-SLOW-NEXT: vperm2i128 {{.*#+}} ymm10 = ymm2[0,1],ymm5[0,1]
	; AVX2-SLOW-NEXT: vpshufd {{.*#+}} ymm5 = ymm10[0,3,2,3,4,7,6,7]			; AVX2-SLOW-NEXT: vpshufd {{.*#+}} ymm5 = ymm10[0,3,2,3,4,7,6,7]
	; AVX2-SLOW-NEXT: vpshuflw {{.*#+}} ymm2 = ymm5[0,2,2,3,4,5,6,7,8,10,10,11,12,13,14,15]			; AVX2-SLOW-NEXT: vpshuflw {{.*#+}} ymm2 = ymm5[0,2,2,3,4,5,6,7,8,10,10,11,12,13,14,15]
	; AVX2-SLOW-NEXT: vpshufd {{.*#+}} ymm2 = ymm2[0,1,2,2,4,5,6,6]			; AVX2-SLOW-NEXT: vpshufd {{.*#+}} ymm2 = ymm2[0,1,2,2,4,5,6,6]
	; AVX2-SLOW-NEXT: vpblendw {{.*#+}} ymm2 = ymm2[0,1],ymm11[2],ymm2[3,4,5,6],ymm11[7],ymm2[8,9],ymm11[10],ymm2[11,12,13,14],ymm11[15]			; AVX2-SLOW-NEXT: vpblendw {{.*#+}} ymm2 = ymm2[0,1],ymm11[2],ymm2[3,4,5,6],ymm11[7],ymm2[8,9],ymm11[10],ymm2[11,12,13,14],ymm11[15]
	; AVX2-SLOW-NEXT: vpblendd {{.*#+}} ymm11 = ymm13[0],ymm14[1],ymm13[2,3],ymm14[4],ymm13[5,6],ymm14[7]			; AVX2-SLOW-NEXT: vpblendd {{.*#+}} ymm11 = ymm13[0],ymm14[1],ymm13[2,3],ymm14[4],ymm13[5,6],ymm14[7]
	; AVX2-SLOW-NEXT: vpshufb {{.*#+}} xmm1 = xmm11[0,1,12,13,u,u,4,5,u,u,u,u,12,13,14,15]			; AVX2-SLOW-NEXT: vpshufb {{.*#+}} xmm1 = xmm11[0,1,12,13,u,u,4,5,u,u,u,u,12,13,14,15]
	; AVX2-SLOW-NEXT: vextracti128 $1, %ymm11, %xmm3			; AVX2-SLOW-NEXT: vextracti128 $1, %ymm11, %xmm3
	; AVX2-SLOW-NEXT: vpshufd {{.*#+}} xmm4 = xmm3[0,2,0,3]			; AVX2-SLOW-NEXT: vpshufd {{.*#+}} xmm4 = xmm3[0,2,0,3]
	; AVX2-SLOW-NEXT: vpshufhw {{.*#+}} xmm4 = xmm4[0,1,2,3,4,6,6,7]			; AVX2-SLOW-NEXT: vpshufhw {{.*#+}} xmm4 = xmm4[0,1,2,3,4,6,6,7]
	; AVX2-SLOW-NEXT: vpblendw {{.*#+}} xmm1 = xmm1[0,1],xmm4[2],xmm1[3],xmm4[4,5],xmm1[6,7]			; AVX2-SLOW-NEXT: vpblendw {{.*#+}} xmm1 = xmm1[0,1],xmm4[2],xmm1[3],xmm4[4,5],xmm1[6,7]
	; AVX2-SLOW-NEXT: vpblendd {{.*#+}} ymm1 = ymm1[0,1,2],ymm2[3,4,5],ymm1[6,7]			; AVX2-SLOW-NEXT: vpblendd {{.*#+}} ymm1 = ymm1[0,1,2],ymm2[3,4,5],ymm1[6,7]
	; AVX2-SLOW-NEXT: vpblendw {{.*#+}} ymm2 = ymm1[0,1,2],ymm9[3,4,5,6,7],ymm1[8,9,10],ymm9[11,12,13,14,15]			; AVX2-SLOW-NEXT: vpblendw {{.*#+}} ymm2 = ymm1[0,1,2],ymm9[3,4,5,6,7],ymm1[8,9,10],ymm9[11,12,13,14,15]
	; AVX2-SLOW-NEXT: vpblendd {{.*#+}} ymm1 = ymm1[0,1,2,3],ymm2[4,5,6,7]			; AVX2-SLOW-NEXT: vpblendd {{.*#+}} ymm1 = ymm1[0,1,2,3],ymm2[4,5,6,7]
	; AVX2-SLOW-NEXT: vmovdqu %ymm1, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill			; AVX2-SLOW-NEXT: vmovdqu %ymm1, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill
	; AVX2-SLOW-NEXT: vpshufhw {{.*#+}} xmm1 = xmm8[0,1,2,3,5,5,5,5]			; AVX2-SLOW-NEXT: vpshufhw {{.*#+}} xmm1 = xmm8[0,1,2,3,5,5,5,5]
	; AVX2-SLOW-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[6,7,2,3,4,5,u,u,2,3,14,15,u,u,6,7]			; AVX2-SLOW-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[u,u,u,u,u,u,u,u,2,3,14,15,u,u,6,7]
	; AVX2-SLOW-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0,1,2],xmm1[3],xmm0[4,5],xmm1[6],xmm0[7]			; AVX2-SLOW-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0,1,2],xmm1[3],xmm0[4,5],xmm1[6],xmm0[7]
	; AVX2-SLOW-NEXT: vinserti128 $1, %xmm0, %ymm0, %ymm0			; AVX2-SLOW-NEXT: vinserti128 $1, %xmm0, %ymm0, %ymm0
	; AVX2-SLOW-NEXT: vpshufd {{.*#+}} ymm1 = ymm12[2,1,2,1,6,5,6,5]			; AVX2-SLOW-NEXT: vpshufd {{.*#+}} ymm1 = ymm12[2,1,2,1,6,5,6,5]
	; AVX2-SLOW-NEXT: vpshuflw {{.*#+}} ymm1 = ymm1[1,1,1,1,4,5,6,7,9,9,9,9,12,13,14,15]			; AVX2-SLOW-NEXT: vpshuflw {{.*#+}} ymm1 = ymm1[1,1,1,1,4,5,6,7,9,9,9,9,12,13,14,15]
	; AVX2-SLOW-NEXT: vpshuflw {{.*#+}} ymm4 = ymm5[1,3,2,3,4,5,6,7,9,11,10,11,12,13,14,15]			; AVX2-SLOW-NEXT: vpshuflw {{.*#+}} ymm4 = ymm5[1,3,2,3,4,5,6,7,9,11,10,11,12,13,14,15]
	; AVX2-SLOW-NEXT: vpshufhw {{.*#+}} ymm4 = ymm4[0,1,2,3,5,5,5,5,8,9,10,11,13,13,13,13]			; AVX2-SLOW-NEXT: vpshufhw {{.*#+}} ymm4 = ymm4[0,1,2,3,5,5,5,5,8,9,10,11,13,13,13,13]
	; AVX2-SLOW-NEXT: vpblendw {{.*#+}} ymm1 = ymm4[0,1],ymm1[2],ymm4[3,4,5,6],ymm1[7],ymm4[8,9],ymm1[10],ymm4[11,12,13,14],ymm1[15]			; AVX2-SLOW-NEXT: vpblendw {{.*#+}} ymm1 = ymm4[0,1],ymm1[2],ymm4[3,4,5,6],ymm1[7],ymm4[8,9],ymm1[10],ymm4[11,12,13,14],ymm1[15]
	; AVX2-SLOW-NEXT: vpshufb {{.*#+}} xmm3 = xmm3[u,u,u,u,10,11,u,u,2,3,14,15,u,u,u,u]			; AVX2-SLOW-NEXT: vpshufb {{.*#+}} xmm3 = xmm3[u,u,u,u,10,11,u,u,2,3,14,15,u,u,u,u]
	▲ Show 20 Lines • Show All 103 Lines • ▼ Show 20 Lines
	; AVX2-FAST-NEXT: vmovdqa 64(%rdi), %ymm2			; AVX2-FAST-NEXT: vmovdqa 64(%rdi), %ymm2
	; AVX2-FAST-NEXT: vmovdqa 96(%rdi), %ymm5			; AVX2-FAST-NEXT: vmovdqa 96(%rdi), %ymm5
	; AVX2-FAST-NEXT: vmovdqa 160(%rdi), %ymm0			; AVX2-FAST-NEXT: vmovdqa 160(%rdi), %ymm0
	; AVX2-FAST-NEXT: vmovdqu %ymm0, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill			; AVX2-FAST-NEXT: vmovdqu %ymm0, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill
	; AVX2-FAST-NEXT: vmovdqa 128(%rdi), %ymm13			; AVX2-FAST-NEXT: vmovdqa 128(%rdi), %ymm13
	; AVX2-FAST-NEXT: vpblendd {{.*#+}} ymm8 = ymm13[0,1],ymm0[2],ymm13[3,4],ymm0[5],ymm13[6,7]			; AVX2-FAST-NEXT: vpblendd {{.*#+}} ymm8 = ymm13[0,1],ymm0[2],ymm13[3,4],ymm0[5],ymm13[6,7]
	; AVX2-FAST-NEXT: vpshufb {{.*#+}} xmm6 = xmm8[u,u,u,u,u,u,4,5,u,u,u,u,8,9,u,u]			; AVX2-FAST-NEXT: vpshufb {{.*#+}} xmm6 = xmm8[u,u,u,u,u,u,4,5,u,u,u,u,8,9,u,u]
	; AVX2-FAST-NEXT: vextracti128 $1, %ymm8, %xmm0			; AVX2-FAST-NEXT: vextracti128 $1, %ymm8, %xmm0
	; AVX2-FAST-NEXT: vpshufb {{.*#+}} xmm7 = xmm0[0,1,4,5,4,5,u,u,0,1,12,13,u,u,4,5]			; AVX2-FAST-NEXT: vpshufb {{.*#+}} xmm7 = xmm0[u,u,u,u,u,u,u,u,0,1,12,13,u,u,4,5]
	; AVX2-FAST-NEXT: vpblendw {{.*#+}} xmm6 = xmm7[0,1,2],xmm6[3],xmm7[4,5],xmm6[6],xmm7[7]			; AVX2-FAST-NEXT: vpblendw {{.*#+}} xmm6 = xmm7[0,1,2],xmm6[3],xmm7[4,5],xmm6[6],xmm7[7]
	; AVX2-FAST-NEXT: vinserti128 $1, %xmm6, %ymm0, %ymm9			; AVX2-FAST-NEXT: vinserti128 $1, %xmm6, %ymm0, %ymm9
	; AVX2-FAST-NEXT: vperm2i128 {{.*#+}} ymm10 = ymm2[2,3],ymm5[2,3]			; AVX2-FAST-NEXT: vperm2i128 {{.*#+}} ymm10 = ymm2[2,3],ymm5[2,3]
	; AVX2-FAST-NEXT: vpshufd {{.*#+}} ymm11 = ymm10[2,1,2,1,6,5,6,5]			; AVX2-FAST-NEXT: vpshufd {{.*#+}} ymm11 = ymm10[2,1,2,1,6,5,6,5]
	; AVX2-FAST-NEXT: vpshufb {{.*#+}} ymm12 = ymm11[u,u,u,u,u,u,u,u,u,u,u,u,u,u,12,13,u,u,u,u,16,17,u,u,u,u,u,u,u,u,u,u]			; AVX2-FAST-NEXT: vpshufb {{.*#+}} ymm12 = ymm11[u,u,u,u,u,u,u,u,u,u,u,u,u,u,12,13,u,u,u,u,16,17,u,u,u,u,u,u,u,u,u,u]
	; AVX2-FAST-NEXT: vperm2i128 {{.*#+}} ymm7 = ymm2[0,1],ymm5[0,1]			; AVX2-FAST-NEXT: vperm2i128 {{.*#+}} ymm7 = ymm2[0,1],ymm5[0,1]
	; AVX2-FAST-NEXT: vpshufd {{.*#+}} ymm5 = ymm7[0,3,2,3,4,7,6,7]			; AVX2-FAST-NEXT: vpshufd {{.*#+}} ymm5 = ymm7[0,3,2,3,4,7,6,7]
	; AVX2-FAST-NEXT: vpshufb {{.*#+}} ymm2 = ymm5[u,u,u,u,u,u,u,u,u,u,u,u,8,9,u,u,16,17,20,21,u,u,22,23,u,u,u,u,u,u,u,u]			; AVX2-FAST-NEXT: vpshufb {{.*#+}} ymm2 = ymm5[u,u,u,u,u,u,u,u,u,u,u,u,8,9,u,u,16,17,20,21,u,u,22,23,u,u,u,u,u,u,u,u]
	; AVX2-FAST-NEXT: vpblendw {{.*#+}} ymm2 = ymm2[0,1],ymm12[2],ymm2[3,4,5,6],ymm12[7],ymm2[8,9],ymm12[10],ymm2[11,12,13,14],ymm12[15]			; AVX2-FAST-NEXT: vpblendw {{.*#+}} ymm2 = ymm2[0,1],ymm12[2],ymm2[3,4,5,6],ymm12[7],ymm2[8,9],ymm12[10],ymm2[11,12,13,14],ymm12[15]
	; AVX2-FAST-NEXT: vpblendd {{.*#+}} ymm12 = ymm14[0],ymm15[1],ymm14[2,3],ymm15[4],ymm14[5,6],ymm15[7]			; AVX2-FAST-NEXT: vpblendd {{.*#+}} ymm12 = ymm14[0],ymm15[1],ymm14[2,3],ymm15[4],ymm14[5,6],ymm15[7]
	; AVX2-FAST-NEXT: vpshufb {{.*#+}} xmm1 = xmm12[0,1,12,13,u,u,4,5,u,u,u,u,12,13,14,15]			; AVX2-FAST-NEXT: vpshufb {{.*#+}} xmm1 = xmm12[0,1,12,13,u,u,4,5,u,u,u,u,12,13,14,15]
	; AVX2-FAST-NEXT: vextracti128 $1, %ymm12, %xmm3			; AVX2-FAST-NEXT: vextracti128 $1, %ymm12, %xmm3
	; AVX2-FAST-NEXT: vpshufd {{.*#+}} xmm3 = xmm3[2,1,0,3]			; AVX2-FAST-NEXT: vpshufd {{.*#+}} xmm3 = xmm3[2,1,0,3]
	; AVX2-FAST-NEXT: vpshufb {{.*#+}} xmm4 = xmm3[u,u,u,u,0,1,u,u,8,9,12,13,u,u,u,u]			; AVX2-FAST-NEXT: vpshufb {{.*#+}} xmm4 = xmm3[u,u,u,u,0,1,u,u,8,9,12,13,u,u,u,u]
	; AVX2-FAST-NEXT: vpblendw {{.*#+}} xmm1 = xmm1[0,1],xmm4[2],xmm1[3],xmm4[4,5],xmm1[6,7]			; AVX2-FAST-NEXT: vpblendw {{.*#+}} xmm1 = xmm1[0,1],xmm4[2],xmm1[3],xmm4[4,5],xmm1[6,7]
	; AVX2-FAST-NEXT: vpblendd {{.*#+}} ymm1 = ymm1[0,1,2],ymm2[3,4,5],ymm1[6,7]			; AVX2-FAST-NEXT: vpblendd {{.*#+}} ymm1 = ymm1[0,1,2],ymm2[3,4,5],ymm1[6,7]
	; AVX2-FAST-NEXT: vpblendw {{.*#+}} ymm2 = ymm1[0,1,2],ymm9[3,4,5,6,7],ymm1[8,9,10],ymm9[11,12,13,14,15]			; AVX2-FAST-NEXT: vpblendw {{.*#+}} ymm2 = ymm1[0,1,2],ymm9[3,4,5,6,7],ymm1[8,9,10],ymm9[11,12,13,14,15]
	; AVX2-FAST-NEXT: vpblendd {{.*#+}} ymm1 = ymm1[0,1,2,3],ymm2[4,5,6,7]			; AVX2-FAST-NEXT: vpblendd {{.*#+}} ymm1 = ymm1[0,1,2,3],ymm2[4,5,6,7]
	; AVX2-FAST-NEXT: vmovdqu %ymm1, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill			; AVX2-FAST-NEXT: vmovdqu %ymm1, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill
	; AVX2-FAST-NEXT: vpshufhw {{.*#+}} xmm1 = xmm8[0,1,2,3,5,5,5,5]			; AVX2-FAST-NEXT: vpshufhw {{.*#+}} xmm1 = xmm8[0,1,2,3,5,5,5,5]
	; AVX2-FAST-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[6,7,2,3,4,5,u,u,2,3,14,15,u,u,6,7]			; AVX2-FAST-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[u,u,u,u,u,u,u,u,2,3,14,15,u,u,6,7]
	; AVX2-FAST-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0,1,2],xmm1[3],xmm0[4,5],xmm1[6],xmm0[7]			; AVX2-FAST-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0,1,2],xmm1[3],xmm0[4,5],xmm1[6],xmm0[7]
	; AVX2-FAST-NEXT: vinserti128 $1, %xmm0, %ymm0, %ymm0			; AVX2-FAST-NEXT: vinserti128 $1, %xmm0, %ymm0, %ymm0
	; AVX2-FAST-NEXT: vpshuflw {{.*#+}} ymm1 = ymm11[1,1,1,1,4,5,6,7,9,9,9,9,12,13,14,15]			; AVX2-FAST-NEXT: vpshuflw {{.*#+}} ymm1 = ymm11[1,1,1,1,4,5,6,7,9,9,9,9,12,13,14,15]
	; AVX2-FAST-NEXT: vpshufb {{.*#+}} ymm4 = ymm5[u,u,u,u,u,u,u,u,u,u,u,u,10,11,u,u,18,19,22,23,u,u,22,23,u,u,u,u,u,u,u,u]			; AVX2-FAST-NEXT: vpshufb {{.*#+}} ymm4 = ymm5[u,u,u,u,u,u,u,u,u,u,u,u,10,11,u,u,18,19,22,23,u,u,22,23,u,u,u,u,u,u,u,u]
	; AVX2-FAST-NEXT: vpblendw {{.*#+}} ymm1 = ymm4[0,1],ymm1[2],ymm4[3,4,5,6],ymm1[7],ymm4[8,9],ymm1[10],ymm4[11,12,13,14],ymm1[15]			; AVX2-FAST-NEXT: vpblendw {{.*#+}} ymm1 = ymm4[0,1],ymm1[2],ymm4[3,4,5,6],ymm1[7],ymm4[8,9],ymm1[10],ymm4[11,12,13,14],ymm1[15]
	; AVX2-FAST-NEXT: vpshufb {{.*#+}} xmm4 = xmm12[2,3,14,15,u,u,6,7,u,u,u,u,12,13,14,15]			; AVX2-FAST-NEXT: vpshufb {{.*#+}} xmm4 = xmm12[2,3,14,15,u,u,6,7,u,u,u,u,12,13,14,15]
	; AVX2-FAST-NEXT: vpshufb {{.*#+}} xmm3 = xmm3[u,u,u,u,2,3,u,u,10,11,14,15,u,u,u,u]			; AVX2-FAST-NEXT: vpshufb {{.*#+}} xmm3 = xmm3[u,u,u,u,2,3,u,u,10,11,14,15,u,u,u,u]
	; AVX2-FAST-NEXT: vpblendw {{.*#+}} xmm3 = xmm4[0,1],xmm3[2],xmm4[3],xmm3[4,5],xmm4[6,7]			; AVX2-FAST-NEXT: vpblendw {{.*#+}} xmm3 = xmm4[0,1],xmm3[2],xmm4[3],xmm3[4,5],xmm4[6,7]
	▲ Show 20 Lines • Show All 101 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/vector-shuffle-combining-avx.ll

	Show First 20 Lines • Show All 489 Lines • ▼ Show 20 Lines
	; X86-AVX512-LABEL: PR48908:			; X86-AVX512-LABEL: PR48908:
	; X86-AVX512: # %bb.0:			; X86-AVX512: # %bb.0:
	; X86-AVX512-NEXT: # kill: def $ymm2 killed $ymm2 def $zmm2			; X86-AVX512-NEXT: # kill: def $ymm2 killed $ymm2 def $zmm2
	; X86-AVX512-NEXT: # kill: def $ymm1 killed $ymm1 def $zmm1			; X86-AVX512-NEXT: # kill: def $ymm1 killed $ymm1 def $zmm1
	; X86-AVX512-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0			; X86-AVX512-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0
	; X86-AVX512-NEXT: movl {{[0-9]+}}(%esp), %eax			; X86-AVX512-NEXT: movl {{[0-9]+}}(%esp), %eax
	; X86-AVX512-NEXT: movl {{[0-9]+}}(%esp), %ecx			; X86-AVX512-NEXT: movl {{[0-9]+}}(%esp), %ecx
	; X86-AVX512-NEXT: movl {{[0-9]+}}(%esp), %edx			; X86-AVX512-NEXT: movl {{[0-9]+}}(%esp), %edx
	; X86-AVX512-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm3			; X86-AVX512-NEXT: vperm2f128 {{.*#+}} ymm3 = ymm1[2,3],ymm2[0,1]
	; X86-AVX512-NEXT: vshufpd {{.*#+}} ymm3 = ymm0[0],ymm3[1],ymm0[2],ymm3[2]			; X86-AVX512-NEXT: vshufpd {{.*#+}} ymm3 = ymm1[1],ymm3[0],ymm1[2],ymm3[3]
	; X86-AVX512-NEXT: vperm2f128 {{.*#+}} ymm4 = ymm1[2,3],ymm2[0,1]			; X86-AVX512-NEXT: vmovapd {{.*#+}} ymm4 = [0,0,2,0,8,0,1,0]
	; X86-AVX512-NEXT: vshufpd {{.*#+}} ymm4 = ymm1[1],ymm4[0],ymm1[2],ymm4[3]			; X86-AVX512-NEXT: vpermi2pd %zmm2, %zmm0, %zmm4
	; X86-AVX512-NEXT: vmovapd {{.*#+}} ymm5 = [0,0,3,0,8,0,1,0]			; X86-AVX512-NEXT: vmovapd %ymm4, (%edx)
	; X86-AVX512-NEXT: vpermt2pd %zmm2, %zmm5, %zmm3			; X86-AVX512-NEXT: vmovapd {{.*#+}} ymm4 = [0,0,3,0,10,0,1,0]
	; X86-AVX512-NEXT: vmovapd %ymm3, (%edx)			; X86-AVX512-NEXT: vpermt2pd %zmm0, %zmm4, %zmm3
	; X86-AVX512-NEXT: vmovapd {{.*#+}} ymm3 = [0,0,3,0,10,0,1,0]			; X86-AVX512-NEXT: vmovapd %ymm3, (%ecx)
	; X86-AVX512-NEXT: vpermt2pd %zmm0, %zmm3, %zmm4
	; X86-AVX512-NEXT: vmovapd %ymm4, (%ecx)
	; X86-AVX512-NEXT: vmovapd {{.*#+}} ymm3 = <3,0,11,0,u,u,u,u>			; X86-AVX512-NEXT: vmovapd {{.*#+}} ymm3 = <3,0,11,0,u,u,u,u>
	; X86-AVX512-NEXT: vpermi2pd %zmm1, %zmm0, %zmm3			; X86-AVX512-NEXT: vpermi2pd %zmm1, %zmm0, %zmm3
	; X86-AVX512-NEXT: vmovapd {{.*#+}} ymm0 = [2,0,8,0,9,0,3,0]			; X86-AVX512-NEXT: vmovapd {{.*#+}} ymm0 = [2,0,8,0,9,0,3,0]
	; X86-AVX512-NEXT: vpermi2pd %zmm3, %zmm2, %zmm0			; X86-AVX512-NEXT: vpermi2pd %zmm3, %zmm2, %zmm0
	; X86-AVX512-NEXT: vmovapd %ymm0, (%eax)			; X86-AVX512-NEXT: vmovapd %ymm0, (%eax)
	; X86-AVX512-NEXT: vzeroupper			; X86-AVX512-NEXT: vzeroupper
	; X86-AVX512-NEXT: retl			; X86-AVX512-NEXT: retl
	;			;
	▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines
	; X64-AVX2-NEXT: vzeroupper			; X64-AVX2-NEXT: vzeroupper
	; X64-AVX2-NEXT: retq			; X64-AVX2-NEXT: retq
	;			;
	; X64-AVX512-LABEL: PR48908:			; X64-AVX512-LABEL: PR48908:
	; X64-AVX512: # %bb.0:			; X64-AVX512: # %bb.0:
	; X64-AVX512-NEXT: # kill: def $ymm2 killed $ymm2 def $zmm2			; X64-AVX512-NEXT: # kill: def $ymm2 killed $ymm2 def $zmm2
	; X64-AVX512-NEXT: # kill: def $ymm1 killed $ymm1 def $zmm1			; X64-AVX512-NEXT: # kill: def $ymm1 killed $ymm1 def $zmm1
	; X64-AVX512-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0			; X64-AVX512-NEXT: # kill: def $ymm0 killed $ymm0 def $zmm0
	; X64-AVX512-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm3			; X64-AVX512-NEXT: vperm2f128 {{.*#+}} ymm3 = ymm1[2,3],ymm2[0,1]
	; X64-AVX512-NEXT: vshufpd {{.*#+}} ymm3 = ymm0[0],ymm3[1],ymm0[2],ymm3[2]			; X64-AVX512-NEXT: vshufpd {{.*#+}} ymm3 = ymm1[1],ymm3[0],ymm1[2],ymm3[3]
	; X64-AVX512-NEXT: vperm2f128 {{.*#+}} ymm4 = ymm1[2,3],ymm2[0,1]			; X64-AVX512-NEXT: vmovapd {{.*#+}} ymm4 = [0,2,8,1]
	; X64-AVX512-NEXT: vshufpd {{.*#+}} ymm4 = ymm1[1],ymm4[0],ymm1[2],ymm4[3]			; X64-AVX512-NEXT: vpermi2pd %zmm2, %zmm0, %zmm4
	; X64-AVX512-NEXT: vmovapd {{.*#+}} ymm5 = [0,3,8,1]			; X64-AVX512-NEXT: vmovapd %ymm4, (%rdi)
	; X64-AVX512-NEXT: vpermt2pd %zmm2, %zmm5, %zmm3			; X64-AVX512-NEXT: vmovapd {{.*#+}} ymm4 = [0,3,10,1]
	; X64-AVX512-NEXT: vmovapd %ymm3, (%rdi)			; X64-AVX512-NEXT: vpermt2pd %zmm0, %zmm4, %zmm3
	; X64-AVX512-NEXT: vmovapd {{.*#+}} ymm3 = [0,3,10,1]			; X64-AVX512-NEXT: vmovapd %ymm3, (%rsi)
	; X64-AVX512-NEXT: vpermt2pd %zmm0, %zmm3, %zmm4
	; X64-AVX512-NEXT: vmovapd %ymm4, (%rsi)
	; X64-AVX512-NEXT: vmovapd {{.*#+}} ymm3 = <3,11,u,u>			; X64-AVX512-NEXT: vmovapd {{.*#+}} ymm3 = <3,11,u,u>
	; X64-AVX512-NEXT: vpermi2pd %zmm1, %zmm0, %zmm3			; X64-AVX512-NEXT: vpermi2pd %zmm1, %zmm0, %zmm3
	; X64-AVX512-NEXT: vmovapd {{.*#+}} ymm0 = [2,8,9,3]			; X64-AVX512-NEXT: vmovapd {{.*#+}} ymm0 = [2,8,9,3]
	; X64-AVX512-NEXT: vpermi2pd %zmm3, %zmm2, %zmm0			; X64-AVX512-NEXT: vpermi2pd %zmm3, %zmm2, %zmm0
	; X64-AVX512-NEXT: vmovapd %ymm0, (%rdx)			; X64-AVX512-NEXT: vmovapd %ymm0, (%rdx)
	; X64-AVX512-NEXT: vzeroupper			; X64-AVX512-NEXT: vzeroupper
	; X64-AVX512-NEXT: retq			; X64-AVX512-NEXT: retq
	%t0 = shufflevector <4 x double> %v0, <4 x double> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 4>			%t0 = shufflevector <4 x double> %v0, <4 x double> %v1, <4 x i32> <i32 0, i32 1, i32 2, i32 4>
	▲ Show 20 Lines • Show All 67 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/vselect.ll

	Show First 20 Lines • Show All 562 Lines • ▼ Show 20 Lines
	define <2 x i32> @simplify_select(i32 %x, <2 x i1> %z) {			define <2 x i32> @simplify_select(i32 %x, <2 x i1> %z) {
	; SSE2-LABEL: simplify_select:			; SSE2-LABEL: simplify_select:
	; SSE2: # %bb.0:			; SSE2: # %bb.0:
	; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]			; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
	; SSE2-NEXT: pslld $31, %xmm0			; SSE2-NEXT: pslld $31, %xmm0
	; SSE2-NEXT: psrad $31, %xmm0			; SSE2-NEXT: psrad $31, %xmm0
	; SSE2-NEXT: movd %edi, %xmm1			; SSE2-NEXT: movd %edi, %xmm1
	; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[1,0,1,1]			; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[1,0,1,1]
	; SSE2-NEXT: por %xmm1, %xmm2			; SSE2-NEXT: movdqa %xmm2, %xmm3
				; SSE2-NEXT: por %xmm1, %xmm3
	; SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,1],xmm2[1,3]			; SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,1],xmm2[1,3]
	; SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[2,0],xmm2[1,1]			; SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[2,0],xmm2[1,1]
	; SSE2-NEXT: pand %xmm0, %xmm2			; SSE2-NEXT: pand %xmm0, %xmm3
	; SSE2-NEXT: pandn %xmm1, %xmm0			; SSE2-NEXT: pandn %xmm1, %xmm0
	; SSE2-NEXT: por %xmm2, %xmm0			; SSE2-NEXT: por %xmm3, %xmm0
	; SSE2-NEXT: retq			; SSE2-NEXT: retq
	;			;
	; SSE41-LABEL: simplify_select:			; SSE41-LABEL: simplify_select:
	; SSE41: # %bb.0:			; SSE41: # %bb.0:
	; SSE41-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]			; SSE41-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
	; SSE41-NEXT: pslld $31, %xmm0			; SSE41-NEXT: pslld $31, %xmm0
	; SSE41-NEXT: movd %edi, %xmm1			; SSE41-NEXT: movd %edi, %xmm1
	; SSE41-NEXT: pshufd {{.*#+}} xmm2 = xmm1[1,0,1,1]			; SSE41-NEXT: pshufd {{.*#+}} xmm2 = xmm1[1,0,1,1]
	▲ Show 20 Lines • Show All 90 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/x86-interleaved-access.ll

	Show First 20 Lines • Show All 711 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: vinsertf128 $1, %xmm9, %ymm2, %ymm2			; AVX1-NEXT: vinsertf128 $1, %xmm9, %ymm2, %ymm2
	; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
	; AVX1-NEXT: vxorps %ymm0, %ymm2, %ymm0			; AVX1-NEXT: vxorps %ymm0, %ymm2, %ymm0
	; AVX1-NEXT: vxorps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0			; AVX1-NEXT: vxorps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: interleaved_load_vf32_i8_stride4:			; AVX2-LABEL: interleaved_load_vf32_i8_stride4:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	; AVX2-NEXT: vmovdqa {{.*#+}} xmm0 = <u,u,u,u,0,4,8,12,u,u,u,u,u,u,u,u>			; AVX2-NEXT: vmovdqa {{.*#+}} xmm0 = <0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u>
	; AVX2-NEXT: vmovdqa 112(%rdi), %xmm9			; AVX2-NEXT: vmovdqa (%rdi), %xmm1
	; AVX2-NEXT: vpshufb %xmm0, %xmm9, %xmm1			; AVX2-NEXT: vmovdqa 64(%rdi), %xmm2
	; AVX2-NEXT: vmovdqa 96(%rdi), %xmm10			; AVX2-NEXT: vmovdqa 80(%rdi), %xmm3
	; AVX2-NEXT: vpshufb %xmm0, %xmm10, %xmm3			; AVX2-NEXT: vpshufb %xmm0, %xmm3, %xmm4
	; AVX2-NEXT: vpunpckldq {{.*#+}} xmm1 = xmm3[0],xmm1[0],xmm3[1],xmm1[1]			; AVX2-NEXT: vpshufb %xmm0, %xmm2, %xmm5
	; AVX2-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm1			; AVX2-NEXT: vpunpckldq {{.*#+}} xmm4 = xmm5[0],xmm4[0],xmm5[1],xmm4[1]
	; AVX2-NEXT: vmovdqa {{.*#+}} xmm2 = <0,4,8,12,u,u,u,u,u,u,u,u,u,u,u,u>
	; AVX2-NEXT: vmovdqa 80(%rdi), %xmm12
	; AVX2-NEXT: vpshufb %xmm2, %xmm12, %xmm4
	; AVX2-NEXT: vmovdqa 64(%rdi), %xmm5
	; AVX2-NEXT: vpshufb %xmm2, %xmm5, %xmm6
	; AVX2-NEXT: vpunpckldq {{.*#+}} xmm4 = xmm6[0],xmm4[0],xmm6[1],xmm4[1]
	; AVX2-NEXT: vinserti128 $1, %xmm4, %ymm0, %ymm4			; AVX2-NEXT: vinserti128 $1, %xmm4, %ymm0, %ymm4
	; AVX2-NEXT: vpblendd {{.*#+}} ymm8 = ymm4[0,1,2,3,4,5],ymm1[6,7]			; AVX2-NEXT: vpshufb %xmm0, %xmm1, %xmm0
	; AVX2-NEXT: vmovdqa (%rdi), %xmm11			; AVX2-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0,1,2,3],ymm4[4,5,6,7]
	; AVX2-NEXT: vmovdqa 16(%rdi), %xmm13			; AVX2-NEXT: vmovdqa {{.*#+}} xmm4 = <1,5,9,13,u,u,u,u,u,u,u,u,u,u,u,u>
	; AVX2-NEXT: vmovdqa 32(%rdi), %xmm6			; AVX2-NEXT: vpshufb %xmm4, %xmm3, %xmm5
	; AVX2-NEXT: vmovdqa 48(%rdi), %xmm7			; AVX2-NEXT: vpshufb %xmm4, %xmm2, %xmm6
	; AVX2-NEXT: vpshufb %xmm0, %xmm7, %xmm1			; AVX2-NEXT: vpunpckldq {{.*#+}} xmm5 = xmm6[0],xmm5[0],xmm6[1],xmm5[1]
	; AVX2-NEXT: vpshufb %xmm0, %xmm6, %xmm0			; AVX2-NEXT: vinserti128 $1, %xmm5, %ymm0, %ymm5
	; AVX2-NEXT: vpunpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]			; AVX2-NEXT: vpshufb %xmm4, %xmm1, %xmm4
	; AVX2-NEXT: vpshufb %xmm2, %xmm13, %xmm1			; AVX2-NEXT: vpblendd {{.*#+}} ymm4 = ymm4[0,1,2,3],ymm5[4,5,6,7]
	; AVX2-NEXT: vpshufb %xmm2, %xmm11, %xmm2			; AVX2-NEXT: vpcmpeqb %ymm4, %ymm0, %ymm0
	; AVX2-NEXT: vpunpckldq {{.*#+}} xmm1 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]			; AVX2-NEXT: vmovdqa {{.*#+}} xmm4 = <2,6,10,14,u,u,u,u,u,u,u,u,u,u,u,u>
	; AVX2-NEXT: vpblendd {{.*#+}} xmm0 = xmm1[0,1],xmm0[2,3]			; AVX2-NEXT: vpshufb %xmm4, %xmm3, %xmm5
	; AVX2-NEXT: vpblendd {{.*#+}} ymm8 = ymm0[0,1,2,3],ymm8[4,5,6,7]			; AVX2-NEXT: vpshufb %xmm4, %xmm2, %xmm6
	; AVX2-NEXT: vmovdqa {{.*#+}} xmm1 = <u,u,u,u,1,5,9,13,u,u,u,u,u,u,u,u>			; AVX2-NEXT: vpunpckldq {{.*#+}} xmm5 = xmm6[0],xmm5[0],xmm6[1],xmm5[1]
	; AVX2-NEXT: vpshufb %xmm1, %xmm9, %xmm2			; AVX2-NEXT: vinserti128 $1, %xmm5, %ymm0, %ymm5
	; AVX2-NEXT: vpshufb %xmm1, %xmm10, %xmm0			; AVX2-NEXT: vpshufb %xmm4, %xmm1, %xmm4
	; AVX2-NEXT: vpunpckldq {{.*#+}} xmm0 = xmm0[0],xmm2[0],xmm0[1],xmm2[1]			; AVX2-NEXT: vpblendd {{.*#+}} ymm4 = ymm4[0,1,2,3],ymm5[4,5,6,7]
	; AVX2-NEXT: vinserti128 $1, %xmm0, %ymm0, %ymm0			; AVX2-NEXT: vmovdqa {{.*#+}} xmm5 = <3,7,11,15,u,u,u,u,u,u,u,u,u,u,u,u>
	; AVX2-NEXT: vmovdqa {{.*#+}} xmm2 = <1,5,9,13,u,u,u,u,u,u,u,u,u,u,u,u>			; AVX2-NEXT: vpshufb %xmm5, %xmm3, %xmm3
	; AVX2-NEXT: vpshufb %xmm2, %xmm12, %xmm3			; AVX2-NEXT: vpshufb %xmm5, %xmm2, %xmm2
	; AVX2-NEXT: vpshufb %xmm2, %xmm5, %xmm4
	; AVX2-NEXT: vpunpckldq {{.*#+}} xmm3 = xmm4[0],xmm3[0],xmm4[1],xmm3[1]
	; AVX2-NEXT: vinserti128 $1, %xmm3, %ymm0, %ymm3
	; AVX2-NEXT: vpblendd {{.*#+}} ymm0 = ymm3[0,1,2,3,4,5],ymm0[6,7]
	; AVX2-NEXT: vpshufb %xmm1, %xmm7, %xmm3
	; AVX2-NEXT: vpshufb %xmm1, %xmm6, %xmm1
	; AVX2-NEXT: vpunpckldq {{.*#+}} xmm1 = xmm1[0],xmm3[0],xmm1[1],xmm3[1]
	; AVX2-NEXT: vpshufb %xmm2, %xmm13, %xmm3
	; AVX2-NEXT: vpshufb %xmm2, %xmm11, %xmm2
	; AVX2-NEXT: vpunpckldq {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]
	; AVX2-NEXT: vpblendd {{.*#+}} xmm1 = xmm2[0,1],xmm1[2,3]
	; AVX2-NEXT: vpblendd {{.*#+}} ymm0 = ymm1[0,1,2,3],ymm0[4,5,6,7]
	; AVX2-NEXT: vpcmpeqb %ymm0, %ymm8, %ymm8
	; AVX2-NEXT: vmovdqa {{.*#+}} xmm0 = <u,u,u,u,2,6,10,14,u,u,u,u,u,u,u,u>
	; AVX2-NEXT: vpshufb %xmm0, %xmm9, %xmm1
	; AVX2-NEXT: vpshufb %xmm0, %xmm10, %xmm2
	; AVX2-NEXT: vpunpckldq {{.*#+}} xmm1 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]
	; AVX2-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm1
	; AVX2-NEXT: vmovdqa {{.*#+}} xmm2 = <2,6,10,14,u,u,u,u,u,u,u,u,u,u,u,u>
	; AVX2-NEXT: vpshufb %xmm2, %xmm12, %xmm3
	; AVX2-NEXT: vpshufb %xmm2, %xmm5, %xmm4
	; AVX2-NEXT: vpunpckldq {{.*#+}} xmm3 = xmm4[0],xmm3[0],xmm4[1],xmm3[1]
	; AVX2-NEXT: vinserti128 $1, %xmm3, %ymm0, %ymm3
	; AVX2-NEXT: vpblendd {{.*#+}} ymm1 = ymm3[0,1,2,3,4,5],ymm1[6,7]
	; AVX2-NEXT: vpshufb %xmm0, %xmm7, %xmm3
	; AVX2-NEXT: vpshufb %xmm0, %xmm6, %xmm0
	; AVX2-NEXT: vpunpckldq {{.*#+}} xmm0 = xmm0[0],xmm3[0],xmm0[1],xmm3[1]
	; AVX2-NEXT: vpshufb %xmm2, %xmm13, %xmm3
	; AVX2-NEXT: vpshufb %xmm2, %xmm11, %xmm2
	; AVX2-NEXT: vpunpckldq {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]			; AVX2-NEXT: vpunpckldq {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]
	; AVX2-NEXT: vpblendd {{.*#+}} xmm0 = xmm2[0,1],xmm0[2,3]
	; AVX2-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0,1,2,3],ymm1[4,5,6,7]
	; AVX2-NEXT: vmovdqa {{.*#+}} xmm1 = <u,u,u,u,3,7,11,15,u,u,u,u,u,u,u,u>
	; AVX2-NEXT: vpshufb %xmm1, %xmm9, %xmm2
	; AVX2-NEXT: vpshufb %xmm1, %xmm10, %xmm3
	; AVX2-NEXT: vpunpckldq {{.*#+}} xmm2 = xmm3[0],xmm2[0],xmm3[1],xmm2[1]
	; AVX2-NEXT: vinserti128 $1, %xmm2, %ymm0, %ymm2			; AVX2-NEXT: vinserti128 $1, %xmm2, %ymm0, %ymm2
	; AVX2-NEXT: vmovdqa {{.*#+}} xmm3 = <3,7,11,15,u,u,u,u,u,u,u,u,u,u,u,u>			; AVX2-NEXT: vpshufb %xmm5, %xmm1, %xmm1
	; AVX2-NEXT: vpshufb %xmm3, %xmm12, %xmm4
	; AVX2-NEXT: vpshufb %xmm3, %xmm5, %xmm5
	; AVX2-NEXT: vpunpckldq {{.*#+}} xmm4 = xmm5[0],xmm4[0],xmm5[1],xmm4[1]
	; AVX2-NEXT: vinserti128 $1, %xmm4, %ymm0, %ymm4
	; AVX2-NEXT: vpblendd {{.*#+}} ymm2 = ymm4[0,1,2,3,4,5],ymm2[6,7]
	; AVX2-NEXT: vpshufb %xmm1, %xmm7, %xmm4
	; AVX2-NEXT: vpshufb %xmm1, %xmm6, %xmm1
	; AVX2-NEXT: vpunpckldq {{.*#+}} xmm1 = xmm1[0],xmm4[0],xmm1[1],xmm4[1]
	; AVX2-NEXT: vpshufb %xmm3, %xmm13, %xmm4
	; AVX2-NEXT: vpshufb %xmm3, %xmm11, %xmm3
	; AVX2-NEXT: vpunpckldq {{.*#+}} xmm3 = xmm3[0],xmm4[0],xmm3[1],xmm4[1]
	; AVX2-NEXT: vpblendd {{.*#+}} xmm1 = xmm3[0,1],xmm1[2,3]
	; AVX2-NEXT: vpblendd {{.*#+}} ymm1 = ymm1[0,1,2,3],ymm2[4,5,6,7]			; AVX2-NEXT: vpblendd {{.*#+}} ymm1 = ymm1[0,1,2,3],ymm2[4,5,6,7]
	; AVX2-NEXT: vpcmpeqb %ymm1, %ymm0, %ymm0			; AVX2-NEXT: vpcmpeqb %ymm1, %ymm4, %ymm1
	; AVX2-NEXT: vpxor %ymm0, %ymm8, %ymm0			; AVX2-NEXT: vpxor %ymm1, %ymm0, %ymm0
	; AVX2-NEXT: vpxor {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0			; AVX2-NEXT: vpxor {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %ymm0, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	;			;
	; AVX512-LABEL: interleaved_load_vf32_i8_stride4:			; AVX512-LABEL: interleaved_load_vf32_i8_stride4:
	; AVX512: # %bb.0:			; AVX512: # %bb.0:
	; AVX512-NEXT: vmovdqa64 64(%rdi), %zmm0			; AVX512-NEXT: vmovdqa64 64(%rdi), %zmm0
	; AVX512-NEXT: vpmovdb %zmm0, %xmm0			; AVX512-NEXT: vpmovdb %zmm0, %xmm0
	; AVX512-NEXT: vinserti128 $1, %xmm0, %ymm0, %ymm0			; AVX512-NEXT: vinserti128 $1, %xmm0, %ymm0, %ymm0
	; AVX512-NEXT: vmovdqa64 (%rdi), %zmm1			; AVX512-NEXT: vmovdqa64 (%rdi), %zmm1
	; AVX512-NEXT: vpmovdb %zmm1, %xmm1			; AVX512-NEXT: vpmovdb %zmm1, %xmm1
	; AVX512-NEXT: vpblendd {{.*#+}} ymm9 = ymm1[0,1,2,3],ymm0[4,5,6,7]			; AVX512-NEXT: vpblendd {{.*#+}} ymm0 = ymm1[0,1,2,3],ymm0[4,5,6,7]
	; AVX512-NEXT: vmovdqa 64(%rdi), %xmm10			; AVX512-NEXT: vmovdqa (%rdi), %xmm1
	; AVX512-NEXT: vmovdqa 80(%rdi), %xmm11			; AVX512-NEXT: vmovdqa 64(%rdi), %xmm2
	; AVX512-NEXT: vmovdqa 96(%rdi), %xmm12			; AVX512-NEXT: vmovdqa 80(%rdi), %xmm3
	; AVX512-NEXT: vmovdqa 112(%rdi), %xmm14			; AVX512-NEXT: vmovdqa {{.*#+}} xmm4 = <1,5,9,13,u,u,u,u,u,u,u,u,u,u,u,u>
	; AVX512-NEXT: vmovdqa {{.*#+}} xmm1 = <u,u,u,u,1,5,9,13,u,u,u,u,u,u,u,u>			; AVX512-NEXT: vpshufb %xmm4, %xmm3, %xmm5
	; AVX512-NEXT: vpshufb %xmm1, %xmm14, %xmm0			; AVX512-NEXT: vpshufb %xmm4, %xmm2, %xmm6
	; AVX512-NEXT: vpshufb %xmm1, %xmm12, %xmm4			; AVX512-NEXT: vmovdqa {{.*#+}} ymm7 = <u,u,u,u,0,16,u,u>
	; AVX512-NEXT: vpunpckldq {{.*#+}} xmm0 = xmm4[0],xmm0[0],xmm4[1],xmm0[1]			; AVX512-NEXT: vpermt2d %zmm5, %zmm7, %zmm6
	; AVX512-NEXT: vinserti128 $1, %xmm0, %ymm0, %ymm0			; AVX512-NEXT: vpshufb %xmm4, %xmm1, %xmm4
	; AVX512-NEXT: vmovdqa {{.*#+}} xmm2 = <1,5,9,13,u,u,u,u,u,u,u,u,u,u,u,u>			; AVX512-NEXT: vpblendd {{.*#+}} ymm4 = ymm4[0,1,2,3],ymm6[4,5,6,7]
	; AVX512-NEXT: vpshufb %xmm2, %xmm11, %xmm4			; AVX512-NEXT: vmovdqa {{.*#+}} xmm5 = <2,6,10,14,u,u,u,u,u,u,u,u,u,u,u,u>
	; AVX512-NEXT: vpshufb %xmm2, %xmm10, %xmm6			; AVX512-NEXT: vpshufb %xmm5, %xmm3, %xmm6
	; AVX512-NEXT: vpunpckldq {{.*#+}} xmm4 = xmm6[0],xmm4[0],xmm6[1],xmm4[1]			; AVX512-NEXT: vpshufb %xmm5, %xmm2, %xmm8
	; AVX512-NEXT: vinserti128 $1, %xmm4, %ymm0, %ymm4			; AVX512-NEXT: vpermt2d %zmm6, %zmm7, %zmm8
	; AVX512-NEXT: vpblendd {{.*#+}} ymm8 = ymm4[0,1,2,3,4,5],ymm0[6,7]			; AVX512-NEXT: vpshufb %xmm5, %xmm1, %xmm5
	; AVX512-NEXT: vmovdqa (%rdi), %xmm13			; AVX512-NEXT: vpblendd {{.*#+}} ymm5 = ymm5[0,1,2,3],ymm8[4,5,6,7]
	; AVX512-NEXT: vmovdqa 16(%rdi), %xmm6			; AVX512-NEXT: vmovdqa {{.*#+}} xmm6 = <3,7,11,15,u,u,u,u,u,u,u,u,u,u,u,u>
	; AVX512-NEXT: vmovdqa 32(%rdi), %xmm7			; AVX512-NEXT: vpshufb %xmm6, %xmm3, %xmm3
	; AVX512-NEXT: vmovdqa 48(%rdi), %xmm0			; AVX512-NEXT: vpshufb %xmm6, %xmm2, %xmm2
	; AVX512-NEXT: vpshufb %xmm1, %xmm0, %xmm3			; AVX512-NEXT: vpermt2d %zmm3, %zmm7, %zmm2
	; AVX512-NEXT: vpshufb %xmm1, %xmm7, %xmm1			; AVX512-NEXT: vpshufb %xmm6, %xmm1, %xmm1
	; AVX512-NEXT: vpunpckldq {{.*#+}} xmm1 = xmm1[0],xmm3[0],xmm1[1],xmm3[1]			; AVX512-NEXT: vpblendd {{.*#+}} ymm1 = ymm1[0,1,2,3],ymm2[4,5,6,7]
	; AVX512-NEXT: vpshufb %xmm2, %xmm6, %xmm3			; AVX512-NEXT: vpcmpeqb %zmm4, %zmm0, %k0
	; AVX512-NEXT: vpshufb %xmm2, %xmm13, %xmm2			; AVX512-NEXT: vpcmpeqb %zmm1, %zmm5, %k1
	; AVX512-NEXT: vpunpckldq {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]
	; AVX512-NEXT: vpblendd {{.*#+}} xmm1 = xmm2[0,1],xmm1[2,3]
	; AVX512-NEXT: vpblendd {{.*#+}} ymm8 = ymm1[0,1,2,3],ymm8[4,5,6,7]
	; AVX512-NEXT: vmovdqa {{.*#+}} xmm1 = <u,u,u,u,2,6,10,14,u,u,u,u,u,u,u,u>
	; AVX512-NEXT: vpshufb %xmm1, %xmm14, %xmm2
	; AVX512-NEXT: vpshufb %xmm1, %xmm12, %xmm3
	; AVX512-NEXT: vpunpckldq {{.*#+}} xmm2 = xmm3[0],xmm2[0],xmm3[1],xmm2[1]
	; AVX512-NEXT: vinserti128 $1, %xmm2, %ymm0, %ymm2
	; AVX512-NEXT: vmovdqa {{.*#+}} xmm3 = <2,6,10,14,u,u,u,u,u,u,u,u,u,u,u,u>
	; AVX512-NEXT: vpshufb %xmm3, %xmm11, %xmm4
	; AVX512-NEXT: vpshufb %xmm3, %xmm10, %xmm5
	; AVX512-NEXT: vpunpckldq {{.*#+}} xmm4 = xmm5[0],xmm4[0],xmm5[1],xmm4[1]
	; AVX512-NEXT: vinserti128 $1, %xmm4, %ymm0, %ymm4
	; AVX512-NEXT: vpblendd {{.*#+}} ymm2 = ymm4[0,1,2,3,4,5],ymm2[6,7]
	; AVX512-NEXT: vpshufb %xmm1, %xmm0, %xmm4
	; AVX512-NEXT: vpshufb %xmm1, %xmm7, %xmm1
	; AVX512-NEXT: vpunpckldq {{.*#+}} xmm1 = xmm1[0],xmm4[0],xmm1[1],xmm4[1]
	; AVX512-NEXT: vpshufb %xmm3, %xmm6, %xmm4
	; AVX512-NEXT: vpshufb %xmm3, %xmm13, %xmm3
	; AVX512-NEXT: vpunpckldq {{.*#+}} xmm3 = xmm3[0],xmm4[0],xmm3[1],xmm4[1]
	; AVX512-NEXT: vpblendd {{.*#+}} xmm1 = xmm3[0,1],xmm1[2,3]
	; AVX512-NEXT: vpblendd {{.*#+}} ymm15 = ymm1[0,1,2,3],ymm2[4,5,6,7]
	; AVX512-NEXT: vmovdqa {{.*#+}} xmm2 = <u,u,u,u,3,7,11,15,u,u,u,u,u,u,u,u>
	; AVX512-NEXT: vpshufb %xmm2, %xmm14, %xmm3
	; AVX512-NEXT: vpshufb %xmm2, %xmm12, %xmm4
	; AVX512-NEXT: vpunpckldq {{.*#+}} xmm3 = xmm4[0],xmm3[0],xmm4[1],xmm3[1]
	; AVX512-NEXT: vinserti128 $1, %xmm3, %ymm0, %ymm3
	; AVX512-NEXT: vmovdqa {{.*#+}} xmm4 = <3,7,11,15,u,u,u,u,u,u,u,u,u,u,u,u>
	; AVX512-NEXT: vpshufb %xmm4, %xmm11, %xmm5
	; AVX512-NEXT: vpshufb %xmm4, %xmm10, %xmm1
	; AVX512-NEXT: vpunpckldq {{.*#+}} xmm1 = xmm1[0],xmm5[0],xmm1[1],xmm5[1]
	; AVX512-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm1
	; AVX512-NEXT: vpblendd {{.*#+}} ymm1 = ymm1[0,1,2,3,4,5],ymm3[6,7]
	; AVX512-NEXT: vpshufb %xmm2, %xmm0, %xmm0
	; AVX512-NEXT: vpshufb %xmm2, %xmm7, %xmm2
	; AVX512-NEXT: vpunpckldq {{.*#+}} xmm0 = xmm2[0],xmm0[0],xmm2[1],xmm0[1]
	; AVX512-NEXT: vpshufb %xmm4, %xmm6, %xmm2
	; AVX512-NEXT: vpshufb %xmm4, %xmm13, %xmm3
	; AVX512-NEXT: vpunpckldq {{.*#+}} xmm2 = xmm3[0],xmm2[0],xmm3[1],xmm2[1]
	; AVX512-NEXT: vpblendd {{.*#+}} xmm0 = xmm2[0,1],xmm0[2,3]
	; AVX512-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0,1,2,3],ymm1[4,5,6,7]
	; AVX512-NEXT: vpcmpeqb %zmm8, %zmm9, %k0
	; AVX512-NEXT: vpcmpeqb %zmm0, %zmm15, %k1
	; AVX512-NEXT: kxnord %k1, %k0, %k0			; AVX512-NEXT: kxnord %k1, %k0, %k0
	; AVX512-NEXT: vpmovm2b %k0, %zmm0			; AVX512-NEXT: vpmovm2b %k0, %zmm0
	; AVX512-NEXT: # kill: def $ymm0 killed $ymm0 killed $zmm0			; AVX512-NEXT: # kill: def $ymm0 killed $ymm0 killed $zmm0
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	%wide.vec = load <128 x i8>, <128 x i8>* %ptr			%wide.vec = load <128 x i8>, <128 x i8>* %ptr
	%v1 = shufflevector <128 x i8> %wide.vec, <128 x i8> undef, <32 x i32> <i32 0, i32 4, i32 8, i32 12, i32 16, i32 20, i32 24, i32 28, i32 32, i32 36, i32 40, i32 44, i32 48, i32 52, i32 56, i32 60, i32 64, i32 68, i32 72, i32 76, i32 80, i32 84, i32 88, i32 92, i32 96, i32 100, i32 104, i32 108, i32 112, i32 116, i32 120, i32 124>			%v1 = shufflevector <128 x i8> %wide.vec, <128 x i8> undef, <32 x i32> <i32 0, i32 4, i32 8, i32 12, i32 16, i32 20, i32 24, i32 28, i32 32, i32 36, i32 40, i32 44, i32 48, i32 52, i32 56, i32 60, i32 64, i32 68, i32 72, i32 76, i32 80, i32 84, i32 88, i32 92, i32 96, i32 100, i32 104, i32 108, i32 112, i32 116, i32 120, i32 124>

	%v2 = shufflevector <128 x i8> %wide.vec, <128 x i8> undef, <32 x i32> <i32 1, i32 5, i32 9, i32 13, i32 17, i32 21, i32 25, i32 29, i32 33, i32 37, i32 41, i32 45, i32 49, i32 53, i32 57, i32 61, i32 65, i32 69, i32 73, i32 77, i32 81, i32 85, i32 89, i32 93, i32 97, i32 101, i32 105, i32 109, i32 113, i32 117, i32 121, i32 125>			%v2 = shufflevector <128 x i8> %wide.vec, <128 x i8> undef, <32 x i32> <i32 1, i32 5, i32 9, i32 13, i32 17, i32 21, i32 25, i32 29, i32 33, i32 37, i32 41, i32 45, i32 49, i32 53, i32 57, i32 61, i32 65, i32 69, i32 73, i32 77, i32 81, i32 85, i32 89, i32 93, i32 97, i32 101, i32 105, i32 109, i32 113, i32 117, i32 121, i32 125>
	▲ Show 20 Lines • Show All 1,058 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[X86] combineX86ShufflesRecursively(): call SimplifyMultipleUseDemandedVectorElts() on after finishing recursingClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 369928

llvm/lib/Target/X86/X86ISelLowering.cpp

llvm/test/CodeGen/X86/insertelement-ones.ll

llvm/test/CodeGen/X86/oddshuffles.ll

llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-5.ll

llvm/test/CodeGen/X86/vector-interleaved-load-i16-stride-6.ll

llvm/test/CodeGen/X86/vector-shuffle-combining-avx.ll

llvm/test/CodeGen/X86/vselect.ll

llvm/test/CodeGen/X86/x86-interleaved-access.ll

[X86] combineX86ShufflesRecursively(): call SimplifyMultipleUseDemandedVectorElts() on after finishing recursing
ClosedPublic