This is an archive of the discontinued LLVM Phabricator instance.

llvm/lib/Target/X86/X86ISelLowering.cpp
37921	We still need to do this before the widenSubVector() code - otherwise we'll never be able to simplify any input that doesn't match RootSizeInBits, which are likely to be the most interesting cases imo.
37973	This seems to be really bulky for what its actually doing. I don't think we need to create this shuffle mask for instance, we should be able to create a demanded elts mask directly and then trunc/scale it for the input's size. I keep meaning to create a scaleDemandedMask() common helper method as we have several places that would use it (e.g. SelectionDAG.computeKnownBits bitcast handling and other parts of value tracking).

lebedev.ri added inline comments.Sep 7 2021, 3:20 PM

llvm/lib/Target/X86/X86ISelLowering.cpp
37973	That is what what i initially came up with, and it's much uglier than this code :) I can do that again, but i'm not sure that will be be better.

Introduce ScaleDemandedEltsMask() and use it.

llvm/lib/Target/X86/X86ISelLowering.cpp
37921	I agree, but is this a correctness concern for this patch?
37973	Ok, how about this?

RKSimon added inline comments.Sep 8 2021, 3:27 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
37608	We should assert that (NumElts % NumSrcElts) == 0 \|\| (NumSrcElts % NumElts) == 0 - or return true/false on success/failure.
37921	what correctness?

lebedev.ri added inline comments.Sep 8 2021, 3:29 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
37608	Oops, i meant to do that, but forgot to in the end.
37921	I mean, if we don't do this in this patch, will that lead to miscompiles, or simply to missed optimizations?

Add forgotten assert.

RKSimon added inline comments.Sep 8 2021, 3:52 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
37604	This can probably move to the APIntOps helpers
37982	To move this before widening, we should just need to truncate OpDemandedElts based on its size vs RootSizeInBits - we should assert that no lost elts were demanded. Then we can scale it.

Harbormaster completed remote builds in B123024: Diff 371303.Sep 8 2021, 4:02 AM

Try to move before widenSubVector() - now without miscompiles?

llvm/lib/Target/X86/X86ISelLowering.cpp
37604	Let's do that afterwards?
37982	Ok, i admit i've tried to avoid doing that because i don't quite understand all of the logic here. Does this look right? It avoids the miscompiles that were visible in some previous attempt at least.

Harbormaster completed remote builds in B123037: Diff 371316.Sep 8 2021, 5:36 AM

Did i get it right this time? :)

RKSimon added inline comments.Sep 13 2021, 3:21 AM

llvm/test/CodeGen/X86/insertelement-ones.ll
389 ↗	(On Diff #371316)	Any luck on improving this?

RKSimon mentioned this in D109683: [APInt] Add APIntOps::ScaleBitMask helper.Sep 13 2021, 4:48 AM

RKSimon mentioned this in rG9db20822f795: [APInt] Add APIntOps::ScaleBitMask helper.Sep 13 2021, 8:28 AM

lebedev.ri added inline comments.Sep 13 2021, 1:44 PM

llvm/test/CodeGen/X86/insertelement-ones.ll
389 ↗	(On Diff #371316)	This one is obscure. I believe the problem is `X86ISelLowering.cpp`'s `matchBinaryShuffle()`'s `ISD::OR` lowering. We have: mask: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 30 -2 matchBinaryShuffle() EltSizeInBits: 8 V1: t4: v16i8,ch = CopyFromReg t0, Register:v16i8 %1 t3: v16i8 = Register %1 V2: t74: v16i8 = X86ISD::VSHLDQ t51, TargetConstant:i8<14> t51: v16i8 = bitcast t50 t50: v4i32 = scalar_to_vector Constant:i32<255> t49: i32 = Constant<255> t73: i8 = TargetConstant<14> We can't say anything about `t4`, but i think it's obvious that `t74` is actually an all-zeros except the 14'th element, which is all-ones. So we of course can lower that as an `or` blend, and we do not care what `t4` is. But the code fails to do that. I think we'd basically have to do `computeKnownBits()` for each element of V1/V2 separately. Should i keep looking?

lebedev.ri added a parent revision: D109726: [X86] Improve `matchBinaryShuffle()`'s `BLEND` lowering with per-element all-zero/all-ones knowledge.Sep 13 2021, 3:10 PM

lebedev.ri added inline comments.

llvm/test/CodeGen/X86/insertelement-ones.ll
389 ↗	(On Diff #371316)	Ok, got it: D109726

Rebased ontop main+D109726.
The noted regression is gone (but many more took it's place.)

Harbormaster completed remote builds in B123754: Diff 372352.Sep 13 2021, 4:24 PM

Rebased, NFC.

lebedev.ri added inline comments.Sep 17 2021, 9:30 AM

llvm/test/CodeGen/X86/insertelement-ones.ll

311 ↗

(On Diff #373251)

Here we have:

Optimized legalized selection DAG: %bb.0 'insert_v16i8_x123456789ABCDEx:'
SelectionDAG has 20 nodes:
  t0: ch = EntryToken
          t2: v16i8,ch = CopyFromReg t0, Register:v16i8 %0
        t19: v16i8 = and t2, t36
        t20: v16i8 = X86ISD::ANDNP t36, t27
      t21: v16i8 = or t19, t20
      t33: v16i8 = X86ISD::VSHLDQ t27, TargetConstant:i8<15>
    t45: v16i8 = or t21, t33
  t12: ch,glue = CopyToReg t0, Register:v16i8 $xmm0, t45
    t26: v4i32 = scalar_to_vector Constant:i32<255>
  t27: v16i8 = bitcast t26
    t38: i64 = X86ISD::Wrapper TargetConstantPool:i64<<16 x i8> <i8 0, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1, i8 -1>> 0
  t36: v16i8,ch = load<(load (s128) from constant-pool)> t0, t38, undef:i64
  t13: ch = X86ISD::RET_FLAG t12, TargetConstant:i32<0>, Register:v16i8 $xmm0, t12:1

... so matchBinaryShuffle() again fails to omit the masking,
even though it's obviously redundant here for the reasons seen in D109726.
I would suspect that is because around scalar_to_vector we operate on i32 elt type,
so we don't have all-ones elements until after bitcast.
Without changing computeKnownBits to operate on a specified element width,
i'm not sure it can help us further, and that does not sound like the right fix.

RKSimon added inline comments.Sep 17 2021, 9:30 AM

llvm/test/CodeGen/X86/insertelement-ones.ll
315 ↗	(On Diff #373251)	We're going to have to improve INSERT_VECTOR_ELT handling of 0/-1 elements - just AND/OR if we don't have a legal PINSRB instruction (pre-SSE41).
llvm/test/CodeGen/X86/oddshuffles.ll
2268	Looks like we're missing a fold to share scalar_to_vector(x) and scalar_to_vector(trunc(x)) (maybe worth supporting scalar_to_vector(ext(x)) as well)?

Harbormaster completed remote builds in B124420: Diff 373251.Sep 17 2021, 10:08 AM

RKSimon added inline comments.Sep 17 2021, 10:53 AM

llvm/test/CodeGen/X86/insertelement-ones.ll
315 ↗	(On Diff #373251)	It looks like we might be able to do this more easily by extending lowerShuffleAsBitMask to handle the allones elements case as well as the zero elements case.

lebedev.ri added inline comments.Sep 17 2021, 11:05 AM

llvm/test/CodeGen/X86/insertelement-ones.ll
315 ↗	(On Diff #373251)	Note that `X86TargetLowering::LowerINSERT_VECTOR_ELT` isn't even called for this test, since we expand, not legalize, in this case. Marking it as legalize causes crashes "don't know how to legalize", i guess it doesn't retry to legalize via the generic expansion.

lebedev.ri mentioned this in D109989: [X86] Improve i8 all-ones element insertion in pre-SSE4.1.Sep 17 2021, 11:33 AM

Rebased ontop of D109989 - llvm/test/CodeGen/X86/insertelement-ones.ll is all good now.

lebedev.ri added a parent revision: D109989: [X86] Improve i8 all-ones element insertion in pre-SSE4.1.Sep 17 2021, 11:39 AM

lebedev.ri added inline comments.

llvm/test/CodeGen/X86/insertelement-ones.ll
315 ↗	(On Diff #373251)	Done, D109989.

lebedev.ri added inline comments.Sep 17 2021, 11:52 AM

llvm/test/CodeGen/X86/oddshuffles.ll

2268

This stuff is broken.
We've in AVX1-more, and only have broadcast-from-mem,
yet we've successfully obscured the load via the truncation.

I suppose we could look past ext/trunc of scalar_to_vector operand,
and change it to bitcast/ext of scalar_to_vector itself,
let me see.

Optimized legalized selection DAG: %bb.0 'splat_v3i32:'
SelectionDAG has 28 nodes:
  t0: ch = EntryToken
    t2: i64,ch = CopyFromReg t0, Register:i64 %0
  t27: i64,ch = load<(load (s64) from %ir.ptr, align 1)> t0, t2, undef:i64
      t24: v8i32 = BUILD_VECTOR Constant:i32<0>, undef:i32, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>
          t30: v2i64 = scalar_to_vector t27
        t107: v4i64 = insert_subvector undef:v4i64, t30, Constant:i64<0>
      t108: v8i32 = bitcast t107
    t101: v8i32 = X86ISD::BLENDI t24, t108, TargetConstant:i8<2>
  t19: ch,glue = CopyToReg t0, Register:v8i32 $ymm0, t101
      t25: v8i32 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>, undef:i32, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>
          t91: i32 = truncate t27
        t92: v4i32 = X86ISD::VBROADCAST t91
      t94: v8i32 = insert_subvector undef:v8i32, t92, Constant:i64<0>
    t97: v8i32 = X86ISD::BLENDI t25, t94, TargetConstant:i8<4>
  t21: ch,glue = CopyToReg t19, Register:v8i32 $ymm1, t97, t19:1
  t22: ch = X86ISD::RET_FLAG t21, TargetConstant:i32<0>, Register:v8i32 $ymm0, Register:v8i32 $ymm1, t21:1

Harbormaster completed remote builds in B124449: Diff 373292.Sep 17 2021, 12:08 PM

lebedev.ri added inline comments.Sep 17 2021, 12:40 PM

llvm/test/CodeGen/X86/oddshuffles.ll

2268

While something like this should work:

diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
index 5a49f33e46fe..4d7c2c2a8651 100644
--- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
+++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
@@ -21824,6 +21824,13 @@ SDValue DAGCombiner::visitSCALAR_TO_VECTOR(SDNode *N) {
     }
   }
 
+  // Fold SCALAR_TO_VECTOR(TRUNCATE(V)) to SCALAR_TO_VECTOR(V),
+  // by making trucation of the operand implicit.
+  if (InVal.getOpcode() == ISD::TRUNCATE && VT.isFixedLengthVector() &&
+      Level < AfterLegalizeDAG)
+    return DAG.getNode(ISD::SCALAR_TO_VECTOR, SDLoc(N), VT,
+                       InVal->getOperand(0));
+
   return SDValue();
 }
 
diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp
index 09ba7af6e38a..695cc8303cc1 100644
--- a/llvm/lib/Target/X86/X86ISelLowering.cpp
+++ b/llvm/lib/Target/X86/X86ISelLowering.cpp
@@ -14076,10 +14076,12 @@ static SDValue lowerShuffleAsBroadcast(const SDLoc &DL, MVT VT, SDValue V1,
 
   // If this is a scalar, do the broadcast on this type and bitcast.
   if (!V.getValueType().isVector()) {
-    assert(V.getScalarValueSizeInBits() == NumEltBits &&
-           "Unexpected scalar size");
-    MVT BroadcastVT = MVT::getVectorVT(V.getSimpleValueType(),
-                                       VT.getVectorNumElements());
+    if(V.getValueType().isInteger() &&
+       V.getScalarValueSizeInBits() > NumEltBits)
+      V = DAG.getNode(ISD::TRUNCATE, DL, VT.getScalarType(), V);
+    assert(V.getScalarValueSizeInBits() == NumEltBits && "Unexpected scalar size");
+    MVT BroadcastVT =
+        MVT::getVectorVT(V.getSimpleValueType(), VT.getVectorNumElements());
     return DAG.getBitcast(VT, DAG.getNode(Opcode, DL, BroadcastVT, V));
   }

it doesn't catch anything with the cut-off,
and without it. it exposes numerous places that don't expect this truncation to be implicit.

lebedev.ri added inline comments.Sep 18 2021, 3:22 AM

llvm/test/CodeGen/X86/oddshuffles.ll
2268	I've looked again, and i'm not sure i have enough motivation to tackle all the fallout from the `scalar_to_vector(trunc(x)) --> scalar_to_vector(x)` fold, unless i misunderstood the suggestion.

lebedev.ri mentioned this in rG6a2c2263fbca: [X86] Improve i8 all-ones element insertion in pre-SSE4.1.Sep 18 2021, 12:24 PM

Rebased, NFC.

Harbormaster completed remote builds in B124551: Diff 373433.Sep 18 2021, 1:08 PM

LGTM

llvm/lib/Target/X86/X86ISelLowering.cpp
37916	Lo + Hi
llvm/test/CodeGen/X86/oddshuffles.ll
2268	That's OK - I'll take a look when I get the chance.

This revision is now accepted and ready to land.Sep 19 2021, 5:00 AM

In D109065#3008188, @RKSimon wrote:

LGTM

:/
Thank you for the review!

llvm/test/CodeGen/X86/oddshuffles.ll
2268	Sorry :/

This revision was landed with ongoing or failed builds.Sep 19 2021, 7:25 AM

Closed by commit rG1e72ca94e579: [X86] combineX86ShufflesRecursively(): call… (authored by lebedev.ri). · Explain Why

This revision was automatically updated to reflect the committed changes.

lebedev.ri marked an inline comment as done.

lebedev.ri added a commit: rG1e72ca94e579: [X86] combineX86ShufflesRecursively(): call….

RKSimon mentioned this in rG0e89ff8195e9: [X86] SimplifyDemandedBits - only narrow a broadcast source if we only have one….Sep 19 2021, 2:54 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86ISelLowering.cpp

42 lines

test/

CodeGen/

X86/

oddshuffles.ll

26 lines

vselect.ll

7 lines

Diff 373292

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 37,595 Lines • ▼ Show 20 Lines	namespace llvm {
namespace X86 {		namespace X86 {
enum {		enum {
MaxShuffleCombineDepth = 8		MaxShuffleCombineDepth = 8
};		};
}		}
} // namespace llvm		} // namespace llvm

/// Fully generic combining of x86 shuffle instructions.		/// Fully generic combining of x86 shuffle instructions.
///		///
		RKSimonUnsubmitted Not Done Reply Inline Actions This can probably move to the APIntOps helpers RKSimon: This can probably move to the APIntOps helpers
		lebedev.riAuthorUnsubmitted Not Done Reply Inline Actions Let's do that afterwards? lebedev.ri: Let's do that afterwards?
/// This should be the last combine run over the x86 shuffle instructions. Once		/// This should be the last combine run over the x86 shuffle instructions. Once
/// they have been fully optimized, this will recursively consider all chains		/// they have been fully optimized, this will recursively consider all chains
/// of single-use shuffle instructions, build a generic model of the cumulative		/// of single-use shuffle instructions, build a generic model of the cumulative
/// shuffle operation, and check for simpler instructions which implement this		/// shuffle operation, and check for simpler instructions which implement this
		RKSimonUnsubmitted Done Reply Inline Actions We should assert that (NumElts % NumSrcElts) == 0 \|\| (NumSrcElts % NumElts) == 0 - or return true/false on success/failure. RKSimon: We should assert that (NumElts % NumSrcElts) == 0 \|\| (NumSrcElts % NumElts) == 0 - or return…
		lebedev.riAuthorUnsubmitted Done Reply Inline Actions Oops, i meant to do that, but forgot to in the end. lebedev.ri: Oops, i meant to do that, but forgot to in the end.
/// operation. We use this primarily for two purposes:		/// operation. We use this primarily for two purposes:
///		///
/// 1) Collapse generic shuffles to specialized single instructions when		/// 1) Collapse generic shuffles to specialized single instructions when
/// equivalent. In most cases, this is just an encoding size win, but		/// equivalent. In most cases, this is just an encoding size win, but
/// sometimes we will collapse multiple generic shuffles into a single		/// sometimes we will collapse multiple generic shuffles into a single
/// special-purpose shuffle.		/// special-purpose shuffle.
/// 2) Look for sequences of shuffle instructions with 3 or more total		/// 2) Look for sequences of shuffle instructions with 3 or more total
/// instructions, and replace them with the slightly more expensive SSSE3		/// instructions, and replace them with the slightly more expensive SSSE3
▲ Show 20 Lines • Show All 284 Lines • ▼ Show 20 Lines	static SDValue combineX86ShufflesRecursively(
}		}

// Canonicalize the combined shuffle mask chain with horizontal ops.		// Canonicalize the combined shuffle mask chain with horizontal ops.
// NOTE: This will update the Ops and Mask.		// NOTE: This will update the Ops and Mask.
if (SDValue HOp = canonicalizeShuffleMaskWithHorizOp(		if (SDValue HOp = canonicalizeShuffleMaskWithHorizOp(
Ops, Mask, RootSizeInBits, SDLoc(Root), DAG, Subtarget))		Ops, Mask, RootSizeInBits, SDLoc(Root), DAG, Subtarget))
return DAG.getBitcast(Root.getValueType(), HOp);		return DAG.getBitcast(Root.getValueType(), HOp);

		// Try to refine our inputs given our knowledge of target shuffle mask.
		for (auto I : enumerate(Ops)) {
		int OpIdx = I.index();
		SDValue &Op = I.value();

		// What range of shuffle mask element values results in picking from Op?
		int lo = OpIdx * Mask.size();
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'lo' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'lo' [readability-identifier-naming]…
		int hi = lo + Mask.size();
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'hi' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'hi' [readability-identifier-naming]…
		RKSimonUnsubmitted Not Done Reply Inline Actions Lo + Hi RKSimon: Lo + Hi

		// Which elements of Op do we demand, given the mask's granularity?
		APInt OpDemandedElts(Mask.size(), 0);
		for (int MaskElt : Mask) {
		if (isInRange(MaskElt, lo, hi)) { // Picks from Op?
		RKSimonUnsubmitted Done Reply Inline Actions Op might be a different width to the Root - see the "Widen any subvector shuffle inputs we've collected." code below. RKSimon: Op might be a different width to the Root - see the "Widen any subvector shuffle inputs we've…
		lebedev.riAuthorUnsubmitted Done Reply Inline Actions I keep hitting the same pitfail. lebedev.ri: I keep hitting the same pitfail.
		RKSimonUnsubmitted Not Done Reply Inline Actions We still need to do this before the widenSubVector() code - otherwise we'll never be able to simplify any input that doesn't match RootSizeInBits, which are likely to be the most interesting cases imo. RKSimon: We still need to do this before the widenSubVector() code - otherwise we'll never be able to…
		lebedev.riAuthorUnsubmitted Not Done Reply Inline Actions I agree, but is this a correctness concern for this patch? lebedev.ri: I agree, but is this a correctness concern for this patch?
		RKSimonUnsubmitted Not Done Reply Inline Actions what correctness? RKSimon: what correctness?
		lebedev.riAuthorUnsubmitted Not Done Reply Inline Actions I mean, if we don't do this in this patch, will that lead to miscompiles, or simply to missed optimizations? lebedev.ri: I mean, if we don't do this in this patch, will that lead to miscompiles, or simply to missed…
		int OpEltIdx = MaskElt - lo;
		OpDemandedElts.setBit(OpEltIdx);
		}
		}

		// Is the shuffle result smaller than the root?
		if (Op.getValueSizeInBits() < RootSizeInBits) {
		// We padded the mask with undefs. But we now need to undo that.
		unsigned NumExpectedVectorElts = Mask.size();
		unsigned EltSizeInBits = RootSizeInBits / NumExpectedVectorElts;
		unsigned NumOpVectorElts = Op.getValueSizeInBits() / EltSizeInBits;
		assert(!OpDemandedElts.extractBits(
		NumExpectedVectorElts - NumOpVectorElts, NumOpVectorElts) &&
		"Demanding the virtual undef widening padding?");
		OpDemandedElts = OpDemandedElts.trunc(NumOpVectorElts); // NUW
		}

		// The Op itself may be of different VT, so we need to scale the mask.
		unsigned NumOpElts = Op.getValueType().getVectorNumElements();
		APInt OpScaledDemandedElts = APIntOps::ScaleBitMask(OpDemandedElts, NumOpElts);
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - APInt OpScaledDemandedElts = APIntOps::ScaleBitMask(OpDemandedElts, NumOpElts); + APInt OpScaledDemandedElts = + APIntOps::ScaleBitMask(OpDemandedElts, NumOpElts); Lint: Pre-merge checks: clang-format: please reformat the code ``` - APInt OpScaledDemandedElts = APIntOps…

		// Can this operand be simplified any further, given it's demanded elements?
		if (SDValue NewOp =
		DAG.getTargetLoweringInfo().SimplifyMultipleUseDemandedVectorElts(
		Op, OpScaledDemandedElts, DAG))
		Op = NewOp;
		}
		// FIXME: should we rerun resolveTargetShuffleInputsAndMask() now?

// Widen any subvector shuffle inputs we've collected.		// Widen any subvector shuffle inputs we've collected.
if (any_of(Ops, [RootSizeInBits](SDValue Op) {		if (any_of(Ops, [RootSizeInBits](SDValue Op) {
return Op.getValueSizeInBits() < RootSizeInBits;		return Op.getValueSizeInBits() < RootSizeInBits;
})) {		})) {
for (SDValue &Op : Ops)		for (SDValue &Op : Ops)
if (Op.getValueSizeInBits() < RootSizeInBits)		if (Op.getValueSizeInBits() < RootSizeInBits)
Op = widenSubVector(Op, false, Subtarget, DAG, SDLoc(Op),		Op = widenSubVector(Op, false, Subtarget, DAG, SDLoc(Op),
RootSizeInBits);		RootSizeInBits);
// Reresolve - we might have repeated subvector sources.		// Reresolve - we might have repeated subvector sources.
resolveTargetShuffleInputsAndMask(Ops, Mask);		resolveTargetShuffleInputsAndMask(Ops, Mask);
}		}

// We can only combine unary and binary shuffle mask cases.		// We can only combine unary and binary shuffle mask cases.
if (Ops.size() <= 2) {		if (Ops.size() <= 2) {
// Minor canonicalization of the accumulated shuffle mask to make it easier		// Minor canonicalization of the accumulated shuffle mask to make it easier
// to match below. All this does is detect masks with sequential pairs of		// to match below. All this does is detect masks with sequential pairs of
// elements, and shrink them to the half-width mask. It does this in a loop		// elements, and shrink them to the half-width mask. It does this in a loop
// so it will reduce the size of the mask to the minimal width mask which		// so it will reduce the size of the mask to the minimal width mask which
// performs an equivalent shuffle.		// performs an equivalent shuffle.
while (Mask.size() > 1) {		while (Mask.size() > 1) {
SmallVector<int, 64> WidenedMask;		SmallVector<int, 64> WidenedMask;
if (!canWidenShuffleElements(Mask, WidenedMask))		if (!canWidenShuffleElements(Mask, WidenedMask))
break;		break;
		RKSimonUnsubmitted Not Done Reply Inline Actions This seems to be really bulky for what its actually doing. I don't think we need to create this shuffle mask for instance, we should be able to create a demanded elts mask directly and then trunc/scale it for the input's size. I keep meaning to create a scaleDemandedMask() common helper method as we have several places that would use it (e.g. SelectionDAG.computeKnownBits bitcast handling and other parts of value tracking). RKSimon: This seems to be really bulky for what its actually doing. I don't think we need to create this…
		lebedev.riAuthorUnsubmitted Not Done Reply Inline Actions That is what what i initially came up with, and it's much uglier than this code :) I can do that again, but i'm not sure that will be be better. lebedev.ri: That is what what i initially came up with, and it's much uglier than this code :) I can do…
		lebedev.riAuthorUnsubmitted Not Done Reply Inline Actions Ok, how about this? lebedev.ri: Ok, how about this?
Mask = std::move(WidenedMask);		Mask = std::move(WidenedMask);
}		}

// Canonicalization of binary shuffle masks to improve pattern matching by		// Canonicalization of binary shuffle masks to improve pattern matching by
// commuting the inputs.		// commuting the inputs.
if (Ops.size() == 2 && canonicalizeShuffleMaskWithCommute(Mask)) {		if (Ops.size() == 2 && canonicalizeShuffleMaskWithCommute(Mask)) {
ShuffleVectorSDNode::commuteMask(Mask);		ShuffleVectorSDNode::commuteMask(Mask);
std::swap(Ops[0], Ops[1]);		std::swap(Ops[0], Ops[1]);
}		}
		RKSimonUnsubmitted Not Done Reply Inline Actions To move this before widening, we should just need to truncate OpDemandedElts based on its size vs RootSizeInBits - we should assert that no lost elts were demanded. Then we can scale it. RKSimon: To move this before widening, we should just need to truncate OpDemandedElts based on its size…
		lebedev.riAuthorUnsubmitted Not Done Reply Inline Actions Ok, i admit i've tried to avoid doing that because i don't quite understand all of the logic here. Does this look right? It avoids the miscompiles that were visible in some previous attempt at least. lebedev.ri: Ok, i admit i've tried to avoid doing that because i don't quite understand all of the logic…

// Finally, try to combine into a single shuffle instruction.		// Finally, try to combine into a single shuffle instruction.
return combineX86ShuffleChain(Ops, Root, Mask, Depth, HasVariableMask,		return combineX86ShuffleChain(Ops, Root, Mask, Depth, HasVariableMask,
AllowVariableCrossLaneMask,		AllowVariableCrossLaneMask,
AllowVariablePerLaneMask, DAG, Subtarget);		AllowVariablePerLaneMask, DAG, Subtarget);
}		}

// If that failed and any input is extracted then try to combine as a		// If that failed and any input is extracted then try to combine as a
▲ Show 20 Lines • Show All 15,704 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/oddshuffles.ll

	Show First 20 Lines • Show All 2,255 Lines • ▼ Show 20 Lines
	; SSE42-NEXT: pblendw {{.*#+}} xmm0 = xmm1[0,1],xmm0[2,3],xmm1[4,5,6,7]			; SSE42-NEXT: pblendw {{.*#+}} xmm0 = xmm1[0,1],xmm0[2,3],xmm1[4,5,6,7]
	; SSE42-NEXT: pshufd {{.*#+}} xmm2 = xmm2[1,1,0,1]			; SSE42-NEXT: pshufd {{.*#+}} xmm2 = xmm2[1,1,0,1]
	; SSE42-NEXT: pxor %xmm1, %xmm1			; SSE42-NEXT: pxor %xmm1, %xmm1
	; SSE42-NEXT: xorps %xmm3, %xmm3			; SSE42-NEXT: xorps %xmm3, %xmm3
	; SSE42-NEXT: retq			; SSE42-NEXT: retq
	;			;
	; AVX1-LABEL: splat_v3i32:			; AVX1-LABEL: splat_v3i32:
	; AVX1: # %bb.0:			; AVX1: # %bb.0:
	; AVX1-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero			; AVX1-NEXT: movq (%rdi), %rax
	; AVX1-NEXT: vpinsrd $2, 8(%rdi), %xmm0, %xmm1			; AVX1-NEXT: vmovq %rax, %xmm0
	; AVX1-NEXT: vxorps %xmm2, %xmm2, %xmm2			; AVX1-NEXT: vxorps %xmm1, %xmm1, %xmm1
	; AVX1-NEXT: vblendps {{.*#+}} ymm0 = ymm2[0],ymm0[1],ymm2[2,3,4,5,6,7]			; AVX1-NEXT: vblendps {{.*#+}} ymm0 = ymm1[0],ymm0[1],ymm1[2,3,4,5,6,7]
	; AVX1-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[0,1,0,1]			; AVX1-NEXT: vmovd %eax, %xmm2
				RKSimonUnsubmitted Done Reply Inline Actions Looks like we're missing a fold to share scalar_to_vector(x) and scalar_to_vector(trunc(x)) (maybe worth supporting scalar_to_vector(ext(x)) as well)? RKSimon: Looks like we're missing a fold to share scalar_to_vector(x) and scalar_to_vector(trunc(x))…
				lebedev.riAuthorUnsubmitted Done Reply Inline Actions This stuff is broken. We've in AVX1-more, and only have broadcast-from-mem, yet we've successfully obscured the load via the truncation. I suppose we could look past ext/trunc of scalar_to_vector operand, and change it to bitcast/ext of scalar_to_vector itself, let me see. Optimized legalized selection DAG: %bb.0 'splat_v3i32:' SelectionDAG has 28 nodes: t0: ch = EntryToken t2: i64,ch = CopyFromReg t0, Register:i64 %0 t27: i64,ch = load<(load (s64) from %ir.ptr, align 1)> t0, t2, undef:i64 t24: v8i32 = BUILD_VECTOR Constant:i32<0>, undef:i32, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0> t30: v2i64 = scalar_to_vector t27 t107: v4i64 = insert_subvector undef:v4i64, t30, Constant:i64<0> t108: v8i32 = bitcast t107 t101: v8i32 = X86ISD::BLENDI t24, t108, TargetConstant:i8<2> t19: ch,glue = CopyToReg t0, Register:v8i32 $ymm0, t101 t25: v8i32 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>, undef:i32, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0> t91: i32 = truncate t27 t92: v4i32 = X86ISD::VBROADCAST t91 t94: v8i32 = insert_subvector undef:v8i32, t92, Constant:i64<0> t97: v8i32 = X86ISD::BLENDI t25, t94, TargetConstant:i8<4> t21: ch,glue = CopyToReg t19, Register:v8i32 $ymm1, t97, t19:1 t22: ch = X86ISD::RET_FLAG t21, TargetConstant:i32<0>, Register:v8i32 $ymm0, Register:v8i32 $ymm1, t21:1 lebedev.ri: This stuff is broken. We've in AVX1-more, and only have broadcast-from-mem, yet we've…
				lebedev.riAuthorUnsubmitted Done Reply Inline Actions While something like this should work: diff --git a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp index 5a49f33e46fe..4d7c2c2a8651 100644 --- a/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp +++ b/llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp @@ -21824,6 +21824,13 @@ SDValue DAGCombiner::visitSCALAR_TO_VECTOR(SDNode N) { } } + // Fold SCALAR_TO_VECTOR(TRUNCATE(V)) to SCALAR_TO_VECTOR(V), + // by making trucation of the operand implicit. + if (InVal.getOpcode() == ISD::TRUNCATE && VT.isFixedLengthVector() && + Level < AfterLegalizeDAG) + return DAG.getNode(ISD::SCALAR_TO_VECTOR, SDLoc(N), VT, + InVal->getOperand(0)); + return SDValue(); } diff --git a/llvm/lib/Target/X86/X86ISelLowering.cpp b/llvm/lib/Target/X86/X86ISelLowering.cpp index 09ba7af6e38a..695cc8303cc1 100644 --- a/llvm/lib/Target/X86/X86ISelLowering.cpp +++ b/llvm/lib/Target/X86/X86ISelLowering.cpp @@ -14076,10 +14076,12 @@ static SDValue lowerShuffleAsBroadcast(const SDLoc &DL, MVT VT, SDValue V1, // If this is a scalar, do the broadcast on this type and bitcast. if (!V.getValueType().isVector()) { - assert(V.getScalarValueSizeInBits() == NumEltBits && - "Unexpected scalar size"); - MVT BroadcastVT = MVT::getVectorVT(V.getSimpleValueType(), - VT.getVectorNumElements()); + if(V.getValueType().isInteger() && + V.getScalarValueSizeInBits() > NumEltBits) + V = DAG.getNode(ISD::TRUNCATE, DL, VT.getScalarType(), V); + assert(V.getScalarValueSizeInBits() == NumEltBits && "Unexpected scalar size"); + MVT BroadcastVT = + MVT::getVectorVT(V.getSimpleValueType(), VT.getVectorNumElements()); return DAG.getBitcast(VT, DAG.getNode(Opcode, DL, BroadcastVT, V)); } it doesn't catch anything with the cut-off, and without it. it exposes numerous places that don't expect this truncation to be implicit. lebedev.ri:* While something like this should work: ``` diff --git…
				lebedev.riAuthorUnsubmitted Done Reply Inline Actions I've looked again, and i'm not sure i have enough motivation to tackle all the fallout from the `scalar_to_vector(trunc(x)) --> scalar_to_vector(x)` fold, unless i misunderstood the suggestion. lebedev.ri: I've looked again, and i'm not sure i have enough motivation to tackle all the fallout from the…
				RKSimonUnsubmitted Done Reply Inline Actions That's OK - I'll take a look when I get the chance. RKSimon: That's OK - I'll take a look when I get the chance.
				lebedev.riAuthorUnsubmitted Done Reply Inline Actions Sorry :/ lebedev.ri: Sorry :/
	; AVX1-NEXT: vblendps {{.*#+}} ymm1 = ymm2[0,1],ymm1[2],ymm2[3,4,5,6,7]			; AVX1-NEXT: vpshufd {{.*#+}} xmm2 = xmm2[0,0,0,0]
				; AVX1-NEXT: vblendps {{.*#+}} ymm1 = ymm1[0,1],ymm2[2],ymm1[3,4,5,6,7]
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-SLOW-LABEL: splat_v3i32:			; AVX2-SLOW-LABEL: splat_v3i32:
	; AVX2-SLOW: # %bb.0:			; AVX2-SLOW: # %bb.0:
	; AVX2-SLOW-NEXT: vmovsd {{.*#+}} xmm1 = mem[0],zero			; AVX2-SLOW-NEXT: vmovsd {{.*#+}} xmm1 = mem[0],zero
	; AVX2-SLOW-NEXT: vxorps %xmm2, %xmm2, %xmm2			; AVX2-SLOW-NEXT: vxorps %xmm2, %xmm2, %xmm2
	; AVX2-SLOW-NEXT: vblendps {{.*#+}} ymm0 = ymm2[0],ymm1[1],ymm2[2,3,4,5,6,7]			; AVX2-SLOW-NEXT: vblendps {{.*#+}} ymm0 = ymm2[0],ymm1[1],ymm2[2,3,4,5,6,7]
	; AVX2-SLOW-NEXT: vbroadcastss %xmm1, %xmm1			; AVX2-SLOW-NEXT: vbroadcastss %xmm1, %xmm1
	; AVX2-SLOW-NEXT: vblendps {{.*#+}} ymm1 = ymm2[0,1],ymm1[2],ymm2[3,4,5,6,7]			; AVX2-SLOW-NEXT: vblendps {{.*#+}} ymm1 = ymm2[0,1],ymm1[2],ymm2[3,4,5,6,7]
	; AVX2-SLOW-NEXT: retq			; AVX2-SLOW-NEXT: retq
	;			;
	; AVX2-FAST-LABEL: splat_v3i32:			; AVX2-FAST-LABEL: splat_v3i32:
	; AVX2-FAST: # %bb.0:			; AVX2-FAST: # %bb.0:
	; AVX2-FAST-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero			; AVX2-FAST-NEXT: vmovq {{.*#+}} xmm1 = mem[0],zero
	; AVX2-FAST-NEXT: vpxor %xmm0, %xmm0, %xmm0			; AVX2-FAST-NEXT: vpxor %xmm0, %xmm0, %xmm0
	; AVX2-FAST-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0],ymm1[1],ymm0[2,3,4,5,6,7]			; AVX2-FAST-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0],ymm1[1],ymm0[2,3,4,5,6,7]
	; AVX2-FAST-NEXT: vpshufb {{.*#+}} ymm1 = zero,zero,zero,zero,zero,zero,zero,zero,ymm1[0,1,2,3],zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero			; AVX2-FAST-NEXT: vpshufb {{.*#+}} ymm1 = zero,zero,zero,zero,zero,zero,zero,zero,ymm1[0,1,2,3],zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero
	; AVX2-FAST-NEXT: retq			; AVX2-FAST-NEXT: retq
	;			;
	; XOP-LABEL: splat_v3i32:			; XOP-LABEL: splat_v3i32:
	; XOP: # %bb.0:			; XOP: # %bb.0:
	; XOP-NEXT: vmovq {{.*#+}} xmm0 = mem[0],zero			; XOP-NEXT: movq (%rdi), %rax
	; XOP-NEXT: vpinsrd $2, 8(%rdi), %xmm0, %xmm1			; XOP-NEXT: vmovq %rax, %xmm0
	; XOP-NEXT: vxorps %xmm2, %xmm2, %xmm2			; XOP-NEXT: vxorps %xmm1, %xmm1, %xmm1
	; XOP-NEXT: vblendps {{.*#+}} ymm0 = ymm2[0],ymm0[1],ymm2[2,3,4,5,6,7]			; XOP-NEXT: vblendps {{.*#+}} ymm0 = ymm1[0],ymm0[1],ymm1[2,3,4,5,6,7]
	; XOP-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[0,1,0,1]			; XOP-NEXT: vmovd %eax, %xmm2
	; XOP-NEXT: vblendps {{.*#+}} ymm1 = ymm2[0,1],ymm1[2],ymm2[3,4,5,6,7]			; XOP-NEXT: vpshufd {{.*#+}} xmm2 = xmm2[0,0,0,0]
				; XOP-NEXT: vblendps {{.*#+}} ymm1 = ymm1[0,1],ymm2[2],ymm1[3,4,5,6,7]
	; XOP-NEXT: retq			; XOP-NEXT: retq
	%1 = load <3 x i32>, <3 x i32>* %ptr, align 1			%1 = load <3 x i32>, <3 x i32>* %ptr, align 1
	%2 = shufflevector <3 x i32> %1, <3 x i32> undef, <16 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>			%2 = shufflevector <3 x i32> %1, <3 x i32> undef, <16 x i32> <i32 0, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
	%3 = shufflevector <16 x i32> <i32 0, i32 undef, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 undef, i32 0, i32 0, i32 0, i32 0, i32 0>, <16 x i32> %2, <16 x i32> <i32 0, i32 17, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 16, i32 11, i32 12, i32 13, i32 14, i32 15>			%3 = shufflevector <16 x i32> <i32 0, i32 undef, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 undef, i32 0, i32 0, i32 0, i32 0, i32 0>, <16 x i32> %2, <16 x i32> <i32 0, i32 17, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 16, i32 11, i32 12, i32 13, i32 14, i32 15>
	ret <16 x i32 > %3			ret <16 x i32 > %3
	}			}

	define <2 x double> @wrongorder(<4 x double> %A, <8 x double>* %P) #0 {			define <2 x double> @wrongorder(<4 x double> %A, <8 x double>* %P) #0 {
	▲ Show 20 Lines • Show All 260 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/vselect.ll

	Show First 20 Lines • Show All 562 Lines • ▼ Show 20 Lines
	define <2 x i32> @simplify_select(i32 %x, <2 x i1> %z) {			define <2 x i32> @simplify_select(i32 %x, <2 x i1> %z) {
	; SSE2-LABEL: simplify_select:			; SSE2-LABEL: simplify_select:
	; SSE2: # %bb.0:			; SSE2: # %bb.0:
	; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]			; SSE2-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
	; SSE2-NEXT: pslld $31, %xmm0			; SSE2-NEXT: pslld $31, %xmm0
	; SSE2-NEXT: psrad $31, %xmm0			; SSE2-NEXT: psrad $31, %xmm0
	; SSE2-NEXT: movd %edi, %xmm1			; SSE2-NEXT: movd %edi, %xmm1
	; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[1,0,1,1]			; SSE2-NEXT: pshufd {{.*#+}} xmm2 = xmm1[1,0,1,1]
	; SSE2-NEXT: por %xmm1, %xmm2			; SSE2-NEXT: movdqa %xmm2, %xmm3
				; SSE2-NEXT: por %xmm1, %xmm3
	; SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,1],xmm2[1,3]			; SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,1],xmm2[1,3]
	; SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[2,0],xmm2[1,1]			; SSE2-NEXT: shufps {{.*#+}} xmm1 = xmm1[2,0],xmm2[1,1]
	; SSE2-NEXT: pand %xmm0, %xmm2			; SSE2-NEXT: pand %xmm0, %xmm3
	; SSE2-NEXT: pandn %xmm1, %xmm0			; SSE2-NEXT: pandn %xmm1, %xmm0
	; SSE2-NEXT: por %xmm2, %xmm0			; SSE2-NEXT: por %xmm3, %xmm0
	; SSE2-NEXT: retq			; SSE2-NEXT: retq
	;			;
	; SSE41-LABEL: simplify_select:			; SSE41-LABEL: simplify_select:
	; SSE41: # %bb.0:			; SSE41: # %bb.0:
	; SSE41-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]			; SSE41-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
	; SSE41-NEXT: pslld $31, %xmm0			; SSE41-NEXT: pslld $31, %xmm0
	; SSE41-NEXT: movd %edi, %xmm1			; SSE41-NEXT: movd %edi, %xmm1
	; SSE41-NEXT: pshufd {{.*#+}} xmm2 = xmm1[1,0,1,1]			; SSE41-NEXT: pshufd {{.*#+}} xmm2 = xmm1[1,0,1,1]
	▲ Show 20 Lines • Show All 90 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[X86] combineX86ShufflesRecursively(): call SimplifyMultipleUseDemandedVectorElts() on after finishing recursingClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 373292

llvm/lib/Target/X86/X86ISelLowering.cpp

llvm/test/CodeGen/X86/oddshuffles.ll

llvm/test/CodeGen/X86/vselect.ll

[X86] combineX86ShufflesRecursively(): call SimplifyMultipleUseDemandedVectorElts() on after finishing recursing
ClosedPublic