llvm/lib/Target/X86/X86ISelLowering.cpp
12665	At least a summary comment would be useful.
12666	Instead of passing Input by reference - why not return it? It just makes it look messy imo.
12672	Would it be better to assert(isBroadcastShuffleMask(InputMask)) ? The isNoopOrBroadcastShuffleMask checks below should ensure it no?
12673	Why not just create a X86ISD::VBROADCAST node? This code is AVX only and we have isel patterns that handle AVX1 cases where load folding fails.

@RKSimon thank you for taking a look!
Hopefully addressed review notes.

Note that i'm not quite sure that this is the right way.
This paves road for D108253 that fixes https://bugs.llvm.org/show_bug.cgi?id=50971

llvm/lib/Target/X86/X86ISelLowering.cpp
12666	Note that we also modify `InputMask` - we turn it into an identity mask.
12672	Sure, that can work now.
12673	Oh, hmm. I have not considered that, and that changes the results somewhat...

Harbormaster completed remote builds in B120376: Diff 367551.Aug 19 2021, 12:00 PM

Rebased, NFC.

lebedev.ri mentioned this in D108411: [X86][TLI] SimplifyDemandedVectorEltsForTargetNode(): don't break apart broadcasts from which not just the 0'th elt is demanded.Aug 19 2021, 2:15 PM

lebedev.ri added a child revision: D108411: [X86][TLI] SimplifyDemandedVectorEltsForTargetNode(): don't break apart broadcasts from which not just the 0'th elt is demanded.

lebedev.ri added inline comments.Aug 19 2021, 2:18 PM

llvm/test/CodeGen/X86/copy-low-subvec-elt-to-high-subvec-elt.ll
287	This regression is being fixed by D108411.

Harbormaster completed remote builds in B120425: Diff 367614.Aug 19 2021, 3:38 PM

Rebased, NFC.

Harbormaster completed remote builds in B120775: Diff 368087.Aug 23 2021, 7:00 AM

lebedev.ri added a reviewer: pengfei.Aug 25 2021, 7:41 AM

RKSimon added inline comments.Aug 25 2021, 10:21 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
12667	can this comment reduce to 2 lines? it doesn't seem to be 80col
llvm/test/CodeGen/X86/oddshuffles.ll
2284 ↗	(On Diff #368087)	any luck with this?

lebedev.ri added inline comments.Aug 25 2021, 1:30 PM

llvm/lib/Target/X86/X86ISelLowering.cpp
12667	All my commits are clang-formatted, so this did fit within 80-col limit. Is this better?

Aargh, phab ate my inline comment :/

lebedev.ri added inline comments.Aug 25 2021, 1:35 PM

llvm/test/CodeGen/X86/oddshuffles.ll
2284 ↗	(On Diff #368087)	I wrote a comment here, and phab just lost it :( This seems like demandedelts failure. In LHS, we successfully dropped this load. Whenever in `SimplifyMultipleUseDemandedBits()` we look at this `insert_vector_elt`, demandedelts implies that we demand all elements. The problem is that we need to decode the target shuffle mask to notice that, i think. Wild guess: perhaps in `SimplifyMultipleUseDemandedBitsForTargetNode()` after `getTargetShuffleInputs()`, we can call `SimplifyMultipleUseDemandedBits()` on inputs, and recreate the shuffle if that succeeded? I'm not really sure if there is some other better place to do that.

lebedev.ri added inline comments.Aug 26 2021, 6:23 AM

llvm/test/CodeGen/X86/oddshuffles.ll

2284 ↗

(On Diff #368087)

Actually, that won't work either.

Optimized legalized selection DAG: %bb.0 'splat_v3i32:'
SelectionDAG has 32 nodes:
  t0: ch = EntryToken
  t2: i64,ch = CopyFromReg t0, Register:i64 %0
      t24: v8i32 = BUILD_VECTOR Constant:i32<0>, undef:i32, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>, Constant:i32<0>
    t55: v8i32 = X86ISD::BLENDI t24, t58, TargetConstant:i8<2>
  t19: ch,glue = CopyToReg t0, Register:v8i32 $ymm0, t55
        t69: v32i8 = bitcast t58
          t76: i64 = X86ISD::Wrapper TargetConstantPool:i64<<32 x i8> <i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 0, i8 1, i8 2, i8 3, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128, i8 -128>> 0
        t74: v32i8,ch = load<(load (s256) from constant-pool)> t0, t76, undef:i64
      t71: v32i8 = X86ISD::PSHUFB t69, t74
    t72: v8i32 = bitcast t71
  t21: ch,glue = CopyToReg t19, Register:v8i32 $ymm1, t72, t19:1
          t27: i64,ch = load<(load (s64) from %ir.ptr, align 1)> t0, t2, undef:i64
        t30: v2i64 = scalar_to_vector t27
      t31: v4i32 = bitcast t30
        t28: i64 = add nuw t2, Constant:i64<8>
      t29: i32,ch = load<(load (s32) from %ir.ptr + 8, align 1)> t0, t28, undef:i64
    t61: v4i32 = insert_vector_elt t31, t29, Constant:i64<2>
  t58: v8i32 = insert_subvector undef:v8i32, t61, Constant:i64<0>
  t22: ch = X86ISD::RET_FLAG t21, TargetConstant:i32<0>, Register:v8i32 $ymm0, Register:v8i32 $ymm1, t21:1


===== Instruction selection begins: %bb.0 ''

t29/t61 is what we want to drop, but even if we could recreate subreg widening, t58 has two uses.
So i guess our only hope is combineX86ShufflesRecursively()?

RKSimon added inline comments.Sep 1 2021, 3:13 AM

llvm/test/CodeGen/X86/oddshuffles.ll
2284 ↗	(On Diff #368087)	What might work is calling SimplifyMultipleUseDemandedVectorElts on each operand at the end of the combineX86ShufflesRecursively recursion just before the calls into combineX86ShuffleChain?

lebedev.ri added inline comments.Sep 1 2021, 3:58 AM

llvm/test/CodeGen/X86/oddshuffles.ll
2284 ↗	(On Diff #368087)	Let's see...

lebedev.ri mentioned this in D109065: [X86] combineX86ShufflesRecursively(): call SimplifyMultipleUseDemandedVectorElts() on after finishing recursing.Sep 1 2021, 8:01 AM

lebedev.ri added inline comments.Sep 1 2021, 9:48 AM

llvm/test/CodeGen/X86/oddshuffles.ll
2284 ↗	(On Diff #368087)	I've locally rebased this patch ontop of the suggestion which i've implemented in D109065. and it does not help.

lebedev.ri mentioned this in D109074: [Codegen][TLI][X86] SimplifyMultipleUseDemandedBits(): 0'th vec subreg widening is free, try to perform it earlier.Sep 1 2021, 10:42 AM

Rebased ontop of D109074+D109065, the regression is gone.

lebedev.ri added a parent revision: D109065: [X86] combineX86ShufflesRecursively(): call SimplifyMultipleUseDemandedVectorElts() on after finishing recursing.Sep 1 2021, 10:49 AM

Harbormaster completed remote builds in B122122: Diff 369990.Sep 1 2021, 11:28 AM

lebedev.ri mentioned this in rGf5753125f03a: [Codegen][TLI][X86] SimplifyMultipleUseDemandedBits(): 0'th vec subreg widening….Sep 1 2021, 2:54 PM

Rebased, NFC.

Harbormaster completed remote builds in B122177: Diff 370078.Sep 1 2021, 3:04 PM

ping

LGTM - cheers

This revision is now accepted and ready to land.Sep 7 2021, 2:43 PM

In D108382#2987966, @RKSimon wrote:

LGTM - cheers

Aha! Thank you for the review.
This depends on D109065, so these two accepted patches in this patch series can't land just yet.

lebedev.ri mentioned this in rG1e72ca94e579: [X86] combineX86ShufflesRecursively(): call….Sep 19 2021, 7:25 AM

Rebased, NFC.
Going to land this now.

This revision was landed with ongoing or failed builds.Sep 19 2021, 7:36 AM

Closed by commit rG07f1d8f0caa1: [X86] lowerShuffleAsDecomposedShuffleMerge(): if both inputs are… (authored by lebedev.ri). · Explain Why

This revision was automatically updated to reflect the committed changes.

lebedev.ri added a commit: rG07f1d8f0caa1: [X86] lowerShuffleAsDecomposedShuffleMerge(): if both inputs are….

lebedev.ri mentioned this in rG5f2fe48d06c7: [X86][TLI] SimplifyDemandedVectorEltsForTargetNode(): don't break apart….Sep 19 2021, 7:39 AM

Harbormaster completed remote builds in B124574: Diff 373459.Sep 19 2021, 8:06 AM

Diff 367516

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 12,606 Lines • ▼ Show 20 Lines	static SDValue lowerShuffleAsByteRotateAndPermute(
// Check if the ranges are small enough to rotate from either direction.		// Check if the ranges are small enough to rotate from either direction.
if (Range2.second < Range1.first)		if (Range2.second < Range1.first)
return RotateAndPermute(V1, V2, Range1.first, 0);		return RotateAndPermute(V1, V2, Range1.first, 0);
if (Range1.second < Range2.first)		if (Range1.second < Range2.first)
return RotateAndPermute(V2, V1, Range2.first, NumElts);		return RotateAndPermute(V2, V1, Range2.first, NumElts);
return SDValue();		return SDValue();
}		}

		static bool isBroadcastShuffleMask(ArrayRef<int> Mask) {
		return isUndefOrEqual(Mask, 0);
		}

		static bool isNoopOrBroadcastShuffleMask(ArrayRef<int> Mask) {
		return isNoopShuffleMask(Mask) \|\| isBroadcastShuffleMask(Mask);
		}

		static SDValue getSplatOfVectorElement(const SDLoc &DL, SDValue Vec, int EltIdx,
		SelectionDAG &DAG) {
		EVT VT = Vec.getValueType();
		SDValue ScalarElt =
		DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, VT.getScalarType(), Vec,
		DAG.getIntPtrConstant(EltIdx, DL));
		return DAG.getSplatBuildVector(VT, DL, ScalarElt);
		}

/// Generic routine to decompose a shuffle and blend into independent		/// Generic routine to decompose a shuffle and blend into independent
/// blends and permutes.		/// blends and permutes.
///		///
/// This matches the extremely common pattern for handling combined		/// This matches the extremely common pattern for handling combined
/// shuffle+blend operations on newer X86 ISAs where we have very fast blend		/// shuffle+blend operations on newer X86 ISAs where we have very fast blend
/// operations. It will try to pick the best arrangement of shuffles and		/// operations. It will try to pick the best arrangement of shuffles and
/// blends. For vXi8/vXi16 shuffles we may use unpack instead of blend.		/// blends. For vXi8/vXi16 shuffles we may use unpack instead of blend.
static SDValue lowerShuffleAsDecomposedShuffleMerge(		static SDValue lowerShuffleAsDecomposedShuffleMerge(
Show All 17 Lines	if (M >= 0 && M < NumElts) {
IsAlternating &= (i & 1) == 0;		IsAlternating &= (i & 1) == 0;
} else if (M >= NumElts) {		} else if (M >= NumElts) {
V2Mask[i] = M - NumElts;		V2Mask[i] = M - NumElts;
FinalMask[i] = i + NumElts;		FinalMask[i] = i + NumElts;
IsAlternating &= (i & 1) == 1;		IsAlternating &= (i & 1) == 1;
}		}
}		}

		auto canonicalizeBroadcastableInput =
		RKSimonUnsubmitted Done Reply Inline Actions At least a summary comment would be useful. RKSimon: At least a summary comment would be useful.
		[DL, &Subtarget, &DAG](SDValue &Input, MutableArrayRef<int> InputMask) {
		RKSimonUnsubmitted Not Done Reply Inline Actions Instead of passing Input by reference - why not return it? It just makes it look messy imo. RKSimon: Instead of passing Input by reference - why not return it? It just makes it look messy imo.
		lebedev.riAuthorUnsubmitted Not Done Reply Inline Actions Note that we also modify `InputMask` - we turn it into an identity mask. lebedev.ri: Note that we also modify `InputMask` - we turn it into an identity mask.
		unsigned EltSizeInBits = Input.getScalarValueSizeInBits();
		RKSimonUnsubmitted Not Done Reply Inline Actions can this comment reduce to 2 lines? it doesn't seem to be 80col RKSimon: can this comment reduce to 2 lines? it doesn't seem to be 80col
		lebedev.riAuthorUnsubmitted Not Done Reply Inline Actions All my commits are clang-formatted, so this did fit within 80-col limit. Is this better? lebedev.ri: All my commits are clang-formatted, so this did fit within 80-col limit. Is this better?
		if (!Subtarget.hasAVX2() &&
		(!Subtarget.hasAVX() \|\| EltSizeInBits < 32 \|\| !MayFoldLoad(Input)))
		return;
		if (isNoopShuffleMask(InputMask) \|\| !isBroadcastShuffleMask(InputMask))
		return;
		RKSimonUnsubmitted Done Reply Inline Actions Would it be better to assert(isBroadcastShuffleMask(InputMask)) ? The isNoopOrBroadcastShuffleMask checks below should ensure it no? RKSimon: Would it be better to assert(isBroadcastShuffleMask(InputMask)) ? The…
		lebedev.riAuthorUnsubmitted Done Reply Inline Actions Sure, that can work now. lebedev.ri: Sure, that can work now.
		Input = getSplatOfVectorElement(DL, Input, 0, DAG);
		RKSimonUnsubmitted Done Reply Inline Actions Why not just create a X86ISD::VBROADCAST node? This code is AVX only and we have isel patterns that handle AVX1 cases where load folding fails. RKSimon: Why not just create a X86ISD::VBROADCAST node? This code is AVX only and we have isel patterns…
		lebedev.riAuthorUnsubmitted Done Reply Inline Actions Oh, hmm. I have not considered that, and that changes the results somewhat... lebedev.ri: Oh, hmm. I have not considered that, and that changes the results somewhat...
		for (auto I : enumerate(InputMask)) {
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for variable 'canonicalizeBroadcastableInput' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for variable 'canonicalizeBroadcastableInput'…
		int &InputMaskElt = I.value();
		if (InputMaskElt >= 0)
		InputMaskElt = I.index();
		}
		};

		if (isNoopOrBroadcastShuffleMask(V1Mask) &&
		isNoopOrBroadcastShuffleMask(V2Mask)) {
		canonicalizeBroadcastableInput(V1, V1Mask);
		canonicalizeBroadcastableInput(V2, V2Mask);
		}

// Try to lower with the simpler initial blend/unpack/rotate strategies unless		// Try to lower with the simpler initial blend/unpack/rotate strategies unless
// one of the input shuffles would be a no-op. We prefer to shuffle inputs as		// one of the input shuffles would be a no-op. We prefer to shuffle inputs as
// the shuffle may be able to fold with a load or other benefit. However, when		// the shuffle may be able to fold with a load or other benefit. However, when
// we'll have to do 2x as many shuffles in order to achieve this, a 2-input		// we'll have to do 2x as many shuffles in order to achieve this, a 2-input
// pre-shuffle first is a better strategy.		// pre-shuffle first is a better strategy.
if (!isNoopShuffleMask(V1Mask) && !isNoopShuffleMask(V2Mask)) {		if (!isNoopShuffleMask(V1Mask) && !isNoopShuffleMask(V2Mask)) {
// Only prefer immediate blends to unpack/rotate.		// Only prefer immediate blends to unpack/rotate.
if (SDValue BlendPerm = lowerShuffleAsBlendAndPermute(DL, VT, V1, V2, Mask,		if (SDValue BlendPerm = lowerShuffleAsBlendAndPermute(DL, VT, V1, V2, Mask,
▲ Show 20 Lines • Show All 40,619 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/copy-low-subvec-elt-to-high-subvec-elt.ll

Show First 20 Lines • Show All 278 Lines • ▼ Show 20 Lines
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%r = shufflevector <4 x i64> %x, <4 x i64> poison, <4 x i32> <i32 0, i32 0, i32 2, i32 0>		%r = shufflevector <4 x i64> %x, <4 x i64> poison, <4 x i32> <i32 0, i32 0, i32 2, i32 0>
ret <4 x i64> %r		ret <4 x i64> %r
}		}

define <4 x i64> @vec256_eltty_i64_source_subvec_0_target_subvec_mask_3_binary(<4 x i64> %x, <4 x i64> %y) nounwind {		define <4 x i64> @vec256_eltty_i64_source_subvec_0_target_subvec_mask_3_binary(<4 x i64> %x, <4 x i64> %y) nounwind {
; CHECK-LABEL: vec256_eltty_i64_source_subvec_0_target_subvec_mask_3_binary:		; CHECK-LABEL: vec256_eltty_i64_source_subvec_0_target_subvec_mask_3_binary:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vbroadcastsd %xmm1, %ymm1		; CHECK-NEXT: vbroadcastsd %xmm1, %ymm1
lebedev.riAuthorUnsubmitted Done Reply Inline Actions This regression is being fixed by D108411. lebedev.ri: This regression is being fixed by D108411.
; CHECK-NEXT: vunpcklpd {{.*#+}} ymm0 = ymm0[0],ymm1[0],ymm0[2],ymm1[2]		; CHECK-NEXT: vunpcklpd {{.*#+}} ymm0 = ymm0[0],ymm1[0],ymm0[2],ymm1[2]
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%r = shufflevector <4 x i64> %x, <4 x i64> %y, <4 x i32> <i32 0, i32 4, i32 2, i32 4>		%r = shufflevector <4 x i64> %x, <4 x i64> %y, <4 x i32> <i32 0, i32 4, i32 2, i32 4>
ret <4 x i64> %r		ret <4 x i64> %r
}		}

define <4 x i64> @vec256_eltty_i64_source_subvec_1_target_subvec_mask_1_unary(<4 x i64> %x) nounwind {		define <4 x i64> @vec256_eltty_i64_source_subvec_1_target_subvec_mask_1_unary(<4 x i64> %x) nounwind {
; CHECK-LABEL: vec256_eltty_i64_source_subvec_1_target_subvec_mask_1_unary:		; CHECK-LABEL: vec256_eltty_i64_source_subvec_1_target_subvec_mask_1_unary:
▲ Show 20 Lines • Show All 440 Lines • ▼ Show 20 Lines
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%r = shufflevector <32 x i8> %x, <32 x i8> poison, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 0, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>		%r = shufflevector <32 x i8> %x, <32 x i8> poison, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 0, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
ret <32 x i8> %r		ret <32 x i8> %r
}		}

define <32 x i8> @vec256_eltty_i8_source_subvec_0_target_subvec_mask_1_binary(<32 x i8> %x, <32 x i8> %y) nounwind {		define <32 x i8> @vec256_eltty_i8_source_subvec_0_target_subvec_mask_1_binary(<32 x i8> %x, <32 x i8> %y) nounwind {
; CHECK-LABEL: vec256_eltty_i8_source_subvec_0_target_subvec_mask_1_binary:		; CHECK-LABEL: vec256_eltty_i8_source_subvec_0_target_subvec_mask_1_binary:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vpslldq {{.*#+}} xmm1 = zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,xmm1[0]		; CHECK-NEXT: vpbroadcastb %xmm1, %ymm1
; CHECK-NEXT: vmovdqa {{.*#+}} ymm2 = [255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,0,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]		; CHECK-NEXT: vmovdqa {{.*#+}} ymm2 = [255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,0,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]
; CHECK-NEXT: vpblendvb %ymm2, %ymm0, %ymm1, %ymm0		; CHECK-NEXT: vpblendvb %ymm2, %ymm0, %ymm1, %ymm0
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%r = shufflevector <32 x i8> %x, <32 x i8> %y, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 32, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>		%r = shufflevector <32 x i8> %x, <32 x i8> %y, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 32, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
ret <32 x i8> %r		ret <32 x i8> %r
}		}

define <32 x i8> @vec256_eltty_i8_source_subvec_0_target_subvec_mask_2_unary(<32 x i8> %x) nounwind {		define <32 x i8> @vec256_eltty_i8_source_subvec_0_target_subvec_mask_2_unary(<32 x i8> %x) nounwind {
; CHECK-LABEL: vec256_eltty_i8_source_subvec_0_target_subvec_mask_2_unary:		; CHECK-LABEL: vec256_eltty_i8_source_subvec_0_target_subvec_mask_2_unary:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vpbroadcastb %xmm0, %ymm1		; CHECK-NEXT: vpbroadcastb %xmm0, %ymm1
; CHECK-NEXT: vmovdqa {{.*#+}} ymm2 = [255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,0]		; CHECK-NEXT: vmovdqa {{.*#+}} ymm2 = [255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,0]
; CHECK-NEXT: vpblendvb %ymm2, %ymm0, %ymm1, %ymm0		; CHECK-NEXT: vpblendvb %ymm2, %ymm0, %ymm1, %ymm0
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%r = shufflevector <32 x i8> %x, <32 x i8> poison, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 0>		%r = shufflevector <32 x i8> %x, <32 x i8> poison, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 0>
ret <32 x i8> %r		ret <32 x i8> %r
}		}

define <32 x i8> @vec256_eltty_i8_source_subvec_0_target_subvec_mask_2_binary(<32 x i8> %x, <32 x i8> %y) nounwind {		define <32 x i8> @vec256_eltty_i8_source_subvec_0_target_subvec_mask_2_binary(<32 x i8> %x, <32 x i8> %y) nounwind {
; CHECK-LABEL: vec256_eltty_i8_source_subvec_0_target_subvec_mask_2_binary:		; CHECK-LABEL: vec256_eltty_i8_source_subvec_0_target_subvec_mask_2_binary:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vpslldq {{.*#+}} xmm1 = zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,xmm1[0]		; CHECK-NEXT: vpbroadcastb %xmm1, %ymm1
; CHECK-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm1
; CHECK-NEXT: vmovdqa {{.*#+}} ymm2 = [255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,0]		; CHECK-NEXT: vmovdqa {{.*#+}} ymm2 = [255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,0]
; CHECK-NEXT: vpblendvb %ymm2, %ymm0, %ymm1, %ymm0		; CHECK-NEXT: vpblendvb %ymm2, %ymm0, %ymm1, %ymm0
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%r = shufflevector <32 x i8> %x, <32 x i8> %y, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 32>		%r = shufflevector <32 x i8> %x, <32 x i8> %y, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 32>
ret <32 x i8> %r		ret <32 x i8> %r
}		}

define <32 x i8> @vec256_eltty_i8_source_subvec_0_target_subvec_mask_3_unary(<32 x i8> %x) nounwind {		define <32 x i8> @vec256_eltty_i8_source_subvec_0_target_subvec_mask_3_unary(<32 x i8> %x) nounwind {
Show All 19 Lines	; CHECK-NEXT: retq
%r = shufflevector <32 x i8> %x, <32 x i8> %y, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 32, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 32>		%r = shufflevector <32 x i8> %x, <32 x i8> %y, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 32, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 32>
ret <32 x i8> %r		ret <32 x i8> %r
}		}

define <32 x i8> @vec256_eltty_i8_source_subvec_1_target_subvec_mask_1_unary(<32 x i8> %x) nounwind {		define <32 x i8> @vec256_eltty_i8_source_subvec_1_target_subvec_mask_1_unary(<32 x i8> %x) nounwind {
; CHECK-LABEL: vec256_eltty_i8_source_subvec_1_target_subvec_mask_1_unary:		; CHECK-LABEL: vec256_eltty_i8_source_subvec_1_target_subvec_mask_1_unary:
; CHECK: # %bb.0:		; CHECK: # %bb.0:
; CHECK-NEXT: vextracti128 $1, %ymm0, %xmm1		; CHECK-NEXT: vextracti128 $1, %ymm0, %xmm1
; CHECK-NEXT: vpslldq {{.*#+}} xmm1 = zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,xmm1[0]		; CHECK-NEXT: vpbroadcastb %xmm1, %ymm1
; CHECK-NEXT: vmovdqa {{.*#+}} ymm2 = [255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,0,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]		; CHECK-NEXT: vmovdqa {{.*#+}} ymm2 = [255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,0,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255]
; CHECK-NEXT: vpblendvb %ymm2, %ymm0, %ymm1, %ymm0		; CHECK-NEXT: vpblendvb %ymm2, %ymm0, %ymm1, %ymm0
; CHECK-NEXT: retq		; CHECK-NEXT: retq
%r = shufflevector <32 x i8> %x, <32 x i8> poison, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 16, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>		%r = shufflevector <32 x i8> %x, <32 x i8> poison, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 16, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
ret <32 x i8> %r		ret <32 x i8> %r
}		}

define <32 x i8> @vec256_eltty_i8_source_subvec_1_target_subvec_mask_1_binary(<32 x i8> %x, <32 x i8> %y) nounwind {		define <32 x i8> @vec256_eltty_i8_source_subvec_1_target_subvec_mask_1_binary(<32 x i8> %x, <32 x i8> %y) nounwind {
▲ Show 20 Lines • Show All 55 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/horizontal-sum.ll

	Show First 20 Lines • Show All 457 Lines • ▼ Show 20 Lines
	; AVX2-FAST-NEXT: vinsertps {{.*#+}} xmm3 = xmm3[0,1,2],xmm4[0]			; AVX2-FAST-NEXT: vinsertps {{.*#+}} xmm3 = xmm3[0,1,2],xmm4[0]
	; AVX2-FAST-NEXT: vshufps {{.*#+}} xmm1 = xmm2[1,3],xmm1[1,3]			; AVX2-FAST-NEXT: vshufps {{.*#+}} xmm1 = xmm2[1,3],xmm1[1,3]
	; AVX2-FAST-NEXT: vblendps {{.*#+}} xmm1 = xmm1[0,1,2],xmm4[3]			; AVX2-FAST-NEXT: vblendps {{.*#+}} xmm1 = xmm1[0,1,2],xmm4[3]
	; AVX2-FAST-NEXT: vpaddd %xmm1, %xmm3, %xmm1			; AVX2-FAST-NEXT: vpaddd %xmm1, %xmm3, %xmm1
	; AVX2-FAST-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]			; AVX2-FAST-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm1[0]
	; AVX2-FAST-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[2,3,2,3]			; AVX2-FAST-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[2,3,2,3]
	; AVX2-FAST-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0			; AVX2-FAST-NEXT: vinserti128 $1, %xmm1, %ymm0, %ymm0
	; AVX2-FAST-NEXT: vphaddd %xmm7, %xmm6, %xmm1			; AVX2-FAST-NEXT: vphaddd %xmm7, %xmm6, %xmm1
	; AVX2-FAST-NEXT: vphaddd %xmm0, %xmm0, %xmm2			; AVX2-FAST-NEXT: vphaddd %xmm0, %xmm1, %xmm1
	; AVX2-FAST-NEXT: vphaddd %xmm2, %xmm1, %xmm1
	; AVX2-FAST-NEXT: vpbroadcastq %xmm1, %ymm1			; AVX2-FAST-NEXT: vpbroadcastq %xmm1, %ymm1
	; AVX2-FAST-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0,1,2,3,4,5],ymm1[6,7]			; AVX2-FAST-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0,1,2,3,4,5],ymm1[6,7]
	; AVX2-FAST-NEXT: retq			; AVX2-FAST-NEXT: retq
	%9 = shufflevector <4 x i32> %0, <4 x i32> poison, <2 x i32> <i32 0, i32 2>			%9 = shufflevector <4 x i32> %0, <4 x i32> poison, <2 x i32> <i32 0, i32 2>
	%10 = shufflevector <4 x i32> %0, <4 x i32> poison, <2 x i32> <i32 1, i32 3>			%10 = shufflevector <4 x i32> %0, <4 x i32> poison, <2 x i32> <i32 1, i32 3>
	%11 = add <2 x i32> %9, %10			%11 = add <2 x i32> %9, %10
	%12 = shufflevector <2 x i32> %11, <2 x i32> poison, <2 x i32> <i32 1, i32 undef>			%12 = shufflevector <2 x i32> %11, <2 x i32> poison, <2 x i32> <i32 1, i32 undef>
	%13 = add <2 x i32> %11, %12			%13 = add <2 x i32> %11, %12
	▲ Show 20 Lines • Show All 601 Lines • ▼ Show 20 Lines
	; SSSE3-FAST-NEXT: pshufd {{.*#+}} xmm1 = xmm2[2,3,2,3]			; SSSE3-FAST-NEXT: pshufd {{.*#+}} xmm1 = xmm2[2,3,2,3]
	; SSSE3-FAST-NEXT: paddd %xmm2, %xmm1			; SSSE3-FAST-NEXT: paddd %xmm2, %xmm1
	; SSSE3-FAST-NEXT: pshufd {{.*#+}} xmm2 = xmm3[2,3,2,3]			; SSSE3-FAST-NEXT: pshufd {{.*#+}} xmm2 = xmm3[2,3,2,3]
	; SSSE3-FAST-NEXT: paddd %xmm3, %xmm2			; SSSE3-FAST-NEXT: paddd %xmm3, %xmm2
	; SSSE3-FAST-NEXT: phaddd %xmm2, %xmm1			; SSSE3-FAST-NEXT: phaddd %xmm2, %xmm1
	; SSSE3-FAST-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,2]			; SSSE3-FAST-NEXT: shufps {{.*#+}} xmm0 = xmm0[0,2],xmm1[0,2]
	; SSSE3-FAST-NEXT: retq			; SSSE3-FAST-NEXT: retq
	;			;
	; AVX-SLOW-LABEL: reduction_sum_v4i32_v4i32:			; AVX1-SLOW-LABEL: reduction_sum_v4i32_v4i32:
	; AVX-SLOW: # %bb.0:			; AVX1-SLOW: # %bb.0:
	; AVX-SLOW-NEXT: vpshufd {{.*#+}} xmm4 = xmm0[2,3,2,3]			; AVX1-SLOW-NEXT: vpshufd {{.*#+}} xmm4 = xmm0[2,3,2,3]
	; AVX-SLOW-NEXT: vpaddd %xmm4, %xmm0, %xmm0			; AVX1-SLOW-NEXT: vpaddd %xmm4, %xmm0, %xmm0
	; AVX-SLOW-NEXT: vpshufd {{.*#+}} xmm4 = xmm0[1,1,1,1]			; AVX1-SLOW-NEXT: vpshufd {{.*#+}} xmm4 = xmm0[1,1,1,1]
	; AVX-SLOW-NEXT: vpshufd {{.*#+}} xmm5 = xmm1[2,3,2,3]			; AVX1-SLOW-NEXT: vpshufd {{.*#+}} xmm5 = xmm1[2,3,2,3]
	; AVX-SLOW-NEXT: vpaddd %xmm5, %xmm1, %xmm1			; AVX1-SLOW-NEXT: vpaddd %xmm5, %xmm1, %xmm1
	; AVX-SLOW-NEXT: vpshufd {{.*#+}} xmm5 = xmm1[1,1,1,1]			; AVX1-SLOW-NEXT: vpshufd {{.*#+}} xmm5 = xmm1[1,1,1,1]
	; AVX-SLOW-NEXT: vpunpckldq {{.*#+}} xmm4 = xmm4[0],xmm5[0],xmm4[1],xmm5[1]			; AVX1-SLOW-NEXT: vpunpckldq {{.*#+}} xmm4 = xmm4[0],xmm5[0],xmm4[1],xmm5[1]
	; AVX-SLOW-NEXT: vpshufd {{.*#+}} xmm5 = xmm2[2,3,2,3]			; AVX1-SLOW-NEXT: vpshufd {{.*#+}} xmm5 = xmm2[2,3,2,3]
	; AVX-SLOW-NEXT: vpaddd %xmm5, %xmm2, %xmm2			; AVX1-SLOW-NEXT: vpaddd %xmm5, %xmm2, %xmm2
	; AVX-SLOW-NEXT: vpshufd {{.*#+}} xmm5 = xmm2[1,1,1,1]			; AVX1-SLOW-NEXT: vpshufd {{.*#+}} xmm5 = xmm2[1,1,1,1]
	; AVX-SLOW-NEXT: vpshufd {{.*#+}} xmm6 = xmm3[2,3,2,3]			; AVX1-SLOW-NEXT: vpshufd {{.*#+}} xmm6 = xmm3[2,3,2,3]
	; AVX-SLOW-NEXT: vpaddd %xmm6, %xmm3, %xmm3			; AVX1-SLOW-NEXT: vpaddd %xmm6, %xmm3, %xmm3
	; AVX-SLOW-NEXT: vpshufd {{.*#+}} xmm6 = xmm3[1,1,1,1]			; AVX1-SLOW-NEXT: vpshufd {{.*#+}} xmm6 = xmm3[1,1,1,1]
	; AVX-SLOW-NEXT: vpunpckldq {{.*#+}} xmm5 = xmm5[0],xmm6[0],xmm5[1],xmm6[1]			; AVX1-SLOW-NEXT: vpunpckldq {{.*#+}} xmm5 = xmm5[0],xmm6[0],xmm5[1],xmm6[1]
	; AVX-SLOW-NEXT: vpunpckldq {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]			; AVX1-SLOW-NEXT: vpunpckldq {{.*#+}} xmm2 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]
	; AVX-SLOW-NEXT: vpaddd %xmm5, %xmm2, %xmm2			; AVX1-SLOW-NEXT: vpaddd %xmm5, %xmm2, %xmm2
	; AVX-SLOW-NEXT: vpunpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]			; AVX1-SLOW-NEXT: vpunpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
	; AVX-SLOW-NEXT: vpaddd %xmm4, %xmm0, %xmm0			; AVX1-SLOW-NEXT: vpaddd %xmm4, %xmm0, %xmm0
	; AVX-SLOW-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]			; AVX1-SLOW-NEXT: vpunpcklqdq {{.*#+}} xmm0 = xmm0[0],xmm2[0]
	; AVX-SLOW-NEXT: retq			; AVX1-SLOW-NEXT: retq
	;			;
	; AVX1-FAST-LABEL: reduction_sum_v4i32_v4i32:			; AVX1-FAST-LABEL: reduction_sum_v4i32_v4i32:
	; AVX1-FAST: # %bb.0:			; AVX1-FAST: # %bb.0:
	; AVX1-FAST-NEXT: vpshufd {{.*#+}} xmm4 = xmm0[2,3,2,3]			; AVX1-FAST-NEXT: vpshufd {{.*#+}} xmm4 = xmm0[2,3,2,3]
	; AVX1-FAST-NEXT: vpaddd %xmm4, %xmm0, %xmm0			; AVX1-FAST-NEXT: vpaddd %xmm4, %xmm0, %xmm0
	; AVX1-FAST-NEXT: vpshufd {{.*#+}} xmm4 = xmm1[2,3,2,3]			; AVX1-FAST-NEXT: vpshufd {{.*#+}} xmm4 = xmm1[2,3,2,3]
	; AVX1-FAST-NEXT: vpaddd %xmm4, %xmm1, %xmm1			; AVX1-FAST-NEXT: vpaddd %xmm4, %xmm1, %xmm1
	; AVX1-FAST-NEXT: vphaddd %xmm1, %xmm0, %xmm0			; AVX1-FAST-NEXT: vphaddd %xmm1, %xmm0, %xmm0
	; AVX1-FAST-NEXT: vpshufd {{.*#+}} xmm1 = xmm2[2,3,2,3]			; AVX1-FAST-NEXT: vpshufd {{.*#+}} xmm1 = xmm2[2,3,2,3]
	; AVX1-FAST-NEXT: vpaddd %xmm1, %xmm2, %xmm1			; AVX1-FAST-NEXT: vpaddd %xmm1, %xmm2, %xmm1
	; AVX1-FAST-NEXT: vpshufd {{.*#+}} xmm2 = xmm3[2,3,2,3]			; AVX1-FAST-NEXT: vpshufd {{.*#+}} xmm2 = xmm3[2,3,2,3]
	; AVX1-FAST-NEXT: vpaddd %xmm2, %xmm3, %xmm2			; AVX1-FAST-NEXT: vpaddd %xmm2, %xmm3, %xmm2
	; AVX1-FAST-NEXT: vphaddd %xmm2, %xmm1, %xmm1			; AVX1-FAST-NEXT: vphaddd %xmm2, %xmm1, %xmm1
	; AVX1-FAST-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[0,2,0,2]			; AVX1-FAST-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[0,2,0,2]
	; AVX1-FAST-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]			; AVX1-FAST-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
	; AVX1-FAST-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0,1,2,3],xmm1[4,5,6,7]			; AVX1-FAST-NEXT: vpblendw {{.*#+}} xmm0 = xmm0[0,1,2,3],xmm1[4,5,6,7]
	; AVX1-FAST-NEXT: retq			; AVX1-FAST-NEXT: retq
	;			;
				; AVX2-SLOW-LABEL: reduction_sum_v4i32_v4i32:
				; AVX2-SLOW: # %bb.0:
				; AVX2-SLOW-NEXT: vpshufd {{.*#+}} xmm4 = xmm0[2,3,2,3]
				; AVX2-SLOW-NEXT: vpaddd %xmm4, %xmm0, %xmm0
				; AVX2-SLOW-NEXT: vpshufd {{.*#+}} xmm4 = xmm0[1,1,1,1]
				; AVX2-SLOW-NEXT: vpshufd {{.*#+}} xmm5 = xmm1[2,3,2,3]
				; AVX2-SLOW-NEXT: vpaddd %xmm5, %xmm1, %xmm1
				; AVX2-SLOW-NEXT: vpshufd {{.*#+}} xmm5 = xmm2[2,3,2,3]
				; AVX2-SLOW-NEXT: vpaddd %xmm5, %xmm2, %xmm2
				; AVX2-SLOW-NEXT: vpshufd {{.*#+}} xmm5 = xmm3[2,3,2,3]
				; AVX2-SLOW-NEXT: vpaddd %xmm5, %xmm3, %xmm3
				; AVX2-SLOW-NEXT: vpunpckldq {{.*#+}} xmm5 = xmm2[0],xmm3[0],xmm2[1],xmm3[1]
				; AVX2-SLOW-NEXT: vpblendd {{.*#+}} xmm4 = xmm4[0],xmm1[1],xmm4[2,3]
				; AVX2-SLOW-NEXT: vpblendd {{.*#+}} xmm4 = xmm4[0,1],xmm5[2,3]
				; AVX2-SLOW-NEXT: vpunpckldq {{.*#+}} xmm0 = xmm0[0],xmm1[0],xmm0[1],xmm1[1]
				; AVX2-SLOW-NEXT: vpbroadcastd %xmm3, %xmm1
				; AVX2-SLOW-NEXT: vpbroadcastd %xmm2, %xmm2
				; AVX2-SLOW-NEXT: vpunpckldq {{.*#+}} xmm1 = xmm2[0],xmm1[0],xmm2[1],xmm1[1]
				; AVX2-SLOW-NEXT: vpblendd {{.*#+}} xmm0 = xmm0[0,1],xmm1[2,3]
				; AVX2-SLOW-NEXT: vpaddd %xmm4, %xmm0, %xmm0
				; AVX2-SLOW-NEXT: retq
				;
	; AVX2-FAST-LABEL: reduction_sum_v4i32_v4i32:			; AVX2-FAST-LABEL: reduction_sum_v4i32_v4i32:
	; AVX2-FAST: # %bb.0:			; AVX2-FAST: # %bb.0:
	; AVX2-FAST-NEXT: vpshufd {{.*#+}} xmm4 = xmm0[2,3,2,3]			; AVX2-FAST-NEXT: vpshufd {{.*#+}} xmm4 = xmm0[2,3,2,3]
	; AVX2-FAST-NEXT: vpaddd %xmm4, %xmm0, %xmm0			; AVX2-FAST-NEXT: vpaddd %xmm4, %xmm0, %xmm0
	; AVX2-FAST-NEXT: vpshufd {{.*#+}} xmm4 = xmm1[2,3,2,3]			; AVX2-FAST-NEXT: vpshufd {{.*#+}} xmm4 = xmm1[2,3,2,3]
	; AVX2-FAST-NEXT: vpaddd %xmm4, %xmm1, %xmm1			; AVX2-FAST-NEXT: vpaddd %xmm4, %xmm1, %xmm1
	; AVX2-FAST-NEXT: vphaddd %xmm1, %xmm0, %xmm0			; AVX2-FAST-NEXT: vphaddd %xmm1, %xmm0, %xmm0
	; AVX2-FAST-NEXT: vpshufd {{.*#+}} xmm1 = xmm2[2,3,2,3]			; AVX2-FAST-NEXT: vpshufd {{.*#+}} xmm1 = xmm2[2,3,2,3]
	; AVX2-FAST-NEXT: vpaddd %xmm1, %xmm2, %xmm1			; AVX2-FAST-NEXT: vpaddd %xmm1, %xmm2, %xmm1
	; AVX2-FAST-NEXT: vpshufd {{.*#+}} xmm2 = xmm3[2,3,2,3]			; AVX2-FAST-NEXT: vpshufd {{.*#+}} xmm2 = xmm3[2,3,2,3]
	; AVX2-FAST-NEXT: vpaddd %xmm2, %xmm3, %xmm2			; AVX2-FAST-NEXT: vpaddd %xmm2, %xmm3, %xmm2
	; AVX2-FAST-NEXT: vphaddd %xmm2, %xmm1, %xmm1			; AVX2-FAST-NEXT: vphaddd %xmm2, %xmm1, %xmm1
	; AVX2-FAST-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[0,2,0,2]			; AVX2-FAST-NEXT: vpshufd {{.*#+}} xmm1 = xmm1[0,1,0,2]
	; AVX2-FAST-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]			; AVX2-FAST-NEXT: vpshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
	; AVX2-FAST-NEXT: vpblendd {{.*#+}} xmm0 = xmm0[0,1],xmm1[2,3]			; AVX2-FAST-NEXT: vpblendd {{.*#+}} xmm0 = xmm0[0,1],xmm1[2,3]
	; AVX2-FAST-NEXT: retq			; AVX2-FAST-NEXT: retq
	%5 = call i32 @llvm.vector.reduce.add.i32.v4i32(<4 x i32> %0)			%5 = call i32 @llvm.vector.reduce.add.i32.v4i32(<4 x i32> %0)
	%6 = call i32 @llvm.vector.reduce.add.i32.v4i32(<4 x i32> %1)			%6 = call i32 @llvm.vector.reduce.add.i32.v4i32(<4 x i32> %1)
	%7 = call i32 @llvm.vector.reduce.add.i32.v4i32(<4 x i32> %2)			%7 = call i32 @llvm.vector.reduce.add.i32.v4i32(<4 x i32> %2)
	%8 = call i32 @llvm.vector.reduce.add.i32.v4i32(<4 x i32> %3)			%8 = call i32 @llvm.vector.reduce.add.i32.v4i32(<4 x i32> %3)
	%9 = insertelement <4 x i32> undef, i32 %5, i32 0			%9 = insertelement <4 x i32> undef, i32 %5, i32 0
	%10 = insertelement <4 x i32> %9, i32 %6, i32 1			%10 = insertelement <4 x i32> %9, i32 %6, i32 1
	%11 = insertelement <4 x i32> %10, i32 %7, i32 2			%11 = insertelement <4 x i32> %10, i32 %7, i32 2
	%12 = insertelement <4 x i32> %11, i32 %8, i32 3			%12 = insertelement <4 x i32> %11, i32 %8, i32 3
	ret <4 x i32> %12			ret <4 x i32> %12
	}			}
	declare i32 @llvm.vector.reduce.add.i32.v4i32(<4 x i32>)			declare i32 @llvm.vector.reduce.add.i32.v4i32(<4 x i32>)

This is an archive of the discontinued LLVM Phabricator instance.

[X86] lowerShuffleAsDecomposedShuffleMerge(): if both inputs are broadcastable/identities, canonicalize broadcasts as such
ClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 367516

llvm/lib/Target/X86/X86ISelLowering.cpp

llvm/test/CodeGen/X86/copy-low-subvec-elt-to-high-subvec-elt.ll

llvm/test/CodeGen/X86/horizontal-sum.ll

This is an archive of the discontinued LLVM Phabricator instance.

[X86] lowerShuffleAsDecomposedShuffleMerge(): if both inputs are broadcastable/identities, canonicalize broadcasts as suchClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 367516

llvm/lib/Target/X86/X86ISelLowering.cpp

llvm/test/CodeGen/X86/copy-low-subvec-elt-to-high-subvec-elt.ll

llvm/test/CodeGen/X86/horizontal-sum.ll

[X86] lowerShuffleAsDecomposedShuffleMerge(): if both inputs are broadcastable/identities, canonicalize broadcasts as such
ClosedPublic