This is an archive of the discontinued LLVM Phabricator instance.

[x86] try harder to form 256-bit unpck*
ClosedPublic

Authored by spatel on Jan 12 2020, 11:10 AM.

Download Raw Diff

Details

Reviewers

craig.topper
RKSimon

Commits

rG43f60e614a3d: [x86] try harder to form 256-bit unpck*

Summary

This is another part of a problem noted in PR42024:
https://bugs.llvm.org/show_bug.cgi?id=42024

The AVX2 code may use awkward 256-bit shuffles vs. the AVX code that gets split into the expected 128-bit unpack instructions. We have to be selective in matching the types where we try to do this though. Otherwise, we can end up with more instructions (in the case of v8x32/v4x64).

Diff Detail

Event Timeline

spatel created this revision.Jan 12 2020, 11:10 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 12 2020, 11:10 AM

Herald added subscribers: hiraditya, mcrosier. · View Herald Transcript

RKSimon added inline comments.Jan 13 2020, 8:24 AM

llvm/lib/Target/X86/X86ISelLowering.cpp
16239	Add a comment and move the VPERMD comment.
16305	Add a comment. This generates a shuffle sequence, its probably better to place this after all the lowerShuffle* calls that create single (non-variable) instructions.
16410	Add a comment. This generates a shuffle sequence, its probably better to place this after all the lowerShuffle* calls that create single (non-variable) instructions.

Patch updated:

Added code comments to better explain the strategy.
Moved calls lower and within predicate blocks (expect V2 to be undef) where this makes sense to try.

lebedev.ri added a subscriber: lebedev.ri.Jan 13 2020, 11:08 AM

lebedev.ri added inline comments.

llvm/test/CodeGen/X86/vector-shuffle-256-v8.ll
1721	This looks like some demandedelts deficiency?

RKSimon added inline comments.Jan 13 2020, 11:51 AM

llvm/test/CodeGen/X86/vector-shuffle-256-v8.ll
1721	D66004 might catch it, else it might be due to the constant mask already being lowered - we don't do much to simplify constant vectors already lowered to the constant pool.

spatel marked an inline comment as done.Jan 13 2020, 1:00 PM

spatel added inline comments.

llvm/test/CodeGen/X86/vector-shuffle-256-v8.ll

1721

I think we've already lowered to constant pool loads.

This patch creates the unpack as expected, but then a later combine does:

Combining: t23: v8i32 = X86ISD::UNPCKL t22, t22
Creating new node: t60: v8i32 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>, Constant:i32<1>, Constant:i32<1>, Constant:i32<2>, Constant:i32<2>, Constant:i32<3>, Constant:i32<3>
Creating new node: t61: v8i32 = X86ISD::VPERMV t60, t4

And then we lower the build_vector:

    t65: v8i32,ch = load<(load 32 from constant-pool)> t0, t67, undef:i64
    t4: v8i32,ch = CopyFromReg t0, Register:v8i32 %1
  t61: v8i32 = X86ISD::VPERMV t65, t4
   t49: v8i32,ch = load<(load 32 from constant-pool)> t0, t51, undef:i64
    t2: v8i32,ch = CopyFromReg t0, Register:v8i32 %0
  t43: v8i32 = X86ISD::VPERMV t49, t2
t16: v8i32 = X86ISD::BLENDI t61, t43, TargetConstant:i8<17>

Before we reach the BLENDI. Doesn't seem like there'd be much chance of improvement this late.

LGTM - technically this is a limited version of a "lowerShuffleAsLanePermuteAndUNPCK" pattern that we might want to investigate in the future, but its a more controllable initial implementation.

This revision is now accepted and ready to land.Jan 17 2020, 5:38 AM

Closed by commit rG43f60e614a3d: [x86] try harder to form 256-bit unpck* (authored by spatel). · Explain WhyJan 17 2020, 7:46 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86ISelLowering.h

15 lines

X86ISelLowering.cpp

45 lines

test/

CodeGen/

X86/

vector-interleave.ll

26 lines

vector-shuffle-256-v8.ll

34 lines

Diff 237733

llvm/lib/Target/X86/X86ISelLowering.h

Show First 20 Lines • Show All 1,682 Lines • ▼ Show 20 Lines	for (int i = 0; i < NumElts; ++i) {
unsigned LaneStart = (i / NumEltsInLane) * NumEltsInLane;		unsigned LaneStart = (i / NumEltsInLane) * NumEltsInLane;
int Pos = (i % NumEltsInLane) / 2 + LaneStart;		int Pos = (i % NumEltsInLane) / 2 + LaneStart;
Pos += (Unary ? 0 : NumElts * (i % 2));		Pos += (Unary ? 0 : NumElts * (i % 2));
Pos += (Lo ? 0 : NumEltsInLane / 2);		Pos += (Lo ? 0 : NumEltsInLane / 2);
Mask.push_back(Pos);		Mask.push_back(Pos);
}		}
}		}

		/// Similar to unpacklo/unpackhi, but without the 128-bit lane limitation
		/// imposed by AVX and specific to the unary pattern. Example:
		/// v8iX Lo --> <0, 0, 1, 1, 2, 2, 3, 3>
		/// v8iX Hi --> <4, 4, 5, 5, 6, 6, 7, 7>
		template <typename T = int>
		void createSplat2ShuffleMask(MVT VT, SmallVectorImpl<T> &Mask, bool Lo) {
		assert(Mask.empty() && "Expected an empty shuffle mask vector");
		int NumElts = VT.getVectorNumElements();
		for (int i = 0; i < NumElts; ++i) {
		int Pos = i / 2;
		Pos += (Lo ? 0 : NumElts / 2);
		Mask.push_back(Pos);
		}
		}

/// Helper function to scale a shuffle or target shuffle mask, replacing each		/// Helper function to scale a shuffle or target shuffle mask, replacing each
/// mask index with the scaled sequential indices for an equivalent narrowed		/// mask index with the scaled sequential indices for an equivalent narrowed
/// mask. This is the reverse process to canWidenShuffleElements, but can		/// mask. This is the reverse process to canWidenShuffleElements, but can
/// always succeed.		/// always succeed.
template <typename T>		template <typename T>
void scaleShuffleMask(size_t Scale, ArrayRef<T> Mask,		void scaleShuffleMask(size_t Scale, ArrayRef<T> Mask,
SmallVectorImpl<T> &ScaledMask) {		SmallVectorImpl<T> &ScaledMask) {
assert(0 < Scale && "Unexpected scaling factor");		assert(0 < Scale && "Unexpected scaling factor");
Show All 21 Lines

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 10,922 Lines • ▼ Show 20 Lines	static SDValue lowerShuffleWithUNPCK(const SDLoc &DL, MVT VT,

ShuffleVectorSDNode::commuteMask(Unpckh);		ShuffleVectorSDNode::commuteMask(Unpckh);
if (isShuffleEquivalent(V1, V2, Mask, Unpckh))		if (isShuffleEquivalent(V1, V2, Mask, Unpckh))
return DAG.getNode(X86ISD::UNPCKH, DL, VT, V2, V1);		return DAG.getNode(X86ISD::UNPCKH, DL, VT, V2, V1);

return SDValue();		return SDValue();
}		}

		/// Check if the mask can be mapped to a preliminary shuffle (vperm 64-bit)
		/// followed by unpack 256-bit.
		static SDValue lowerShuffleWithUNPCK256(const SDLoc &DL, MVT VT,
		ArrayRef<int> Mask, SDValue V1,
		SDValue V2, SelectionDAG &DAG) {
		SmallVector<int, 32> Unpckl, Unpckh;
		createSplat2ShuffleMask(VT, Unpckl, /* Lo */ true);
		createSplat2ShuffleMask(VT, Unpckh, /* Lo */ false);

		unsigned UnpackOpcode;
		if (isShuffleEquivalent(V1, V2, Mask, Unpckl))
		UnpackOpcode = X86ISD::UNPCKL;
		else if (isShuffleEquivalent(V1, V2, Mask, Unpckh))
		UnpackOpcode = X86ISD::UNPCKH;
		else
		return SDValue();

		// This is a "natural" unpack operation (rather than the 128-bit sectored
		// operation implemented by AVX). We need to rearrange 64-bit chunks of the
		// input in order to use the x86 instruction.
		V1 = DAG.getVectorShuffle(MVT::v4f64, DL, DAG.getBitcast(MVT::v4f64, V1),
		DAG.getUNDEF(MVT::v4f64), {0, 2, 1, 3});
		V1 = DAG.getBitcast(VT, V1);
		return DAG.getNode(UnpackOpcode, DL, VT, V1, V1);
		}

static bool matchShuffleAsVPMOV(ArrayRef<int> Mask, bool SwappedOps,		static bool matchShuffleAsVPMOV(ArrayRef<int> Mask, bool SwappedOps,
int Delta) {		int Delta) {
int Size = (int)Mask.size();		int Size = (int)Mask.size();
int Split = Size / Delta;		int Split = Size / Delta;
int TruncatedVectorStart = SwappedOps ? Size : 0;		int TruncatedVectorStart = SwappedOps ? Size : 0;

// Match for mask starting with e.g.: <8, 10, 12, 14,... or <0, 2, 4, 6,...		// Match for mask starting with e.g.: <8, 10, 12, 14,... or <0, 2, 4, 6,...
if (!isSequentialOrUndefInRange(Mask, 0, Split, TruncatedVectorStart, Delta))		if (!isSequentialOrUndefInRange(Mask, 0, Split, TruncatedVectorStart, Delta))
▲ Show 20 Lines • Show All 5,265 Lines • ▼ Show 20 Lines	if (SDValue Rotate = lowerShuffleAsByteRotate(DL, MVT::v8i32, V1, V2, Mask,
return Rotate;		return Rotate;

// Try to create an in-lane repeating shuffle mask and then shuffle the		// Try to create an in-lane repeating shuffle mask and then shuffle the
// results into the target lanes.		// results into the target lanes.
if (SDValue V = lowerShuffleAsRepeatedMaskAndLanePermute(		if (SDValue V = lowerShuffleAsRepeatedMaskAndLanePermute(
DL, MVT::v8i32, V1, V2, Mask, Subtarget, DAG))		DL, MVT::v8i32, V1, V2, Mask, Subtarget, DAG))
return V;		return V;

// If the shuffle patterns aren't repeated but it is a single input, directly
// generate a cross-lane VPERMD instruction.
if (V2.isUndef()) {		if (V2.isUndef()) {
		// Try to produce a fixed cross-128-bit lane permute followed by unpack
		RKSimonUnsubmitted Done Reply Inline Actions Add a comment and move the VPERMD comment. RKSimon: Add a comment and move the VPERMD comment.
		// because that should be faster than the variable permute alternatives.
		if (SDValue V = lowerShuffleWithUNPCK256(DL, MVT::v8i32, Mask, V1, V2, DAG))
		return V;

		// If the shuffle patterns aren't repeated but it's a single input, directly
		// generate a cross-lane VPERMD instruction.
SDValue VPermMask = getConstVector(Mask, MVT::v8i32, DAG, DL, true);		SDValue VPermMask = getConstVector(Mask, MVT::v8i32, DAG, DL, true);
return DAG.getNode(X86ISD::VPERMV, DL, MVT::v8i32, VPermMask, V1);		return DAG.getNode(X86ISD::VPERMV, DL, MVT::v8i32, VPermMask, V1);
}		}

// Assume that a single SHUFPS is faster than an alternative sequence of		// Assume that a single SHUFPS is faster than an alternative sequence of
// multiple instructions (even if the CPU has a domain penalty).		// multiple instructions (even if the CPU has a domain penalty).
// If some CPU is harmed by the domain switch, we can fix it in a later pass.		// If some CPU is harmed by the domain switch, we can fix it in a later pass.
if (Is128BitLaneRepeatedShuffle && isSingleSHUFPSMask(RepeatedMask)) {		if (Is128BitLaneRepeatedShuffle && isSingleSHUFPSMask(RepeatedMask)) {
▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines	static SDValue lowerV16I16Shuffle(const SDLoc &DL, ArrayRef<int> Mask,
if (SDValue Blend = lowerShuffleAsBlend(DL, MVT::v16i16, V1, V2, Mask,		if (SDValue Blend = lowerShuffleAsBlend(DL, MVT::v16i16, V1, V2, Mask,
Zeroable, Subtarget, DAG))		Zeroable, Subtarget, DAG))
return Blend;		return Blend;

// Use dedicated unpack instructions for masks that match their pattern.		// Use dedicated unpack instructions for masks that match their pattern.
if (SDValue V = lowerShuffleWithUNPCK(DL, MVT::v16i16, Mask, V1, V2, DAG))		if (SDValue V = lowerShuffleWithUNPCK(DL, MVT::v16i16, Mask, V1, V2, DAG))
return V;		return V;

// Use dedicated pack instructions for masks that match their pattern.		// Use dedicated pack instructions for masks that match their pattern.
		RKSimonUnsubmitted Done Reply Inline Actions Add a comment. This generates a shuffle sequence, its probably better to place this after all the lowerShuffle* calls that create single (non-variable) instructions. RKSimon: Add a comment. This generates a shuffle sequence, its probably better to place this after all…
if (SDValue V = lowerShuffleWithPACK(DL, MVT::v16i16, Mask, V1, V2, DAG,		if (SDValue V = lowerShuffleWithPACK(DL, MVT::v16i16, Mask, V1, V2, DAG,
Subtarget))		Subtarget))
return V;		return V;

// Try to use shift instructions.		// Try to use shift instructions.
if (SDValue Shift = lowerShuffleAsShift(DL, MVT::v16i16, V1, V2, Mask,		if (SDValue Shift = lowerShuffleAsShift(DL, MVT::v16i16, V1, V2, Mask,
Zeroable, Subtarget, DAG))		Zeroable, Subtarget, DAG))
return Shift;		return Shift;

// Try to use byte rotation instructions.		// Try to use byte rotation instructions.
if (SDValue Rotate = lowerShuffleAsByteRotate(DL, MVT::v16i16, V1, V2, Mask,		if (SDValue Rotate = lowerShuffleAsByteRotate(DL, MVT::v16i16, V1, V2, Mask,
Subtarget, DAG))		Subtarget, DAG))
return Rotate;		return Rotate;

// Try to create an in-lane repeating shuffle mask and then shuffle the		// Try to create an in-lane repeating shuffle mask and then shuffle the
// results into the target lanes.		// results into the target lanes.
if (SDValue V = lowerShuffleAsRepeatedMaskAndLanePermute(		if (SDValue V = lowerShuffleAsRepeatedMaskAndLanePermute(
DL, MVT::v16i16, V1, V2, Mask, Subtarget, DAG))		DL, MVT::v16i16, V1, V2, Mask, Subtarget, DAG))
return V;		return V;

if (V2.isUndef()) {		if (V2.isUndef()) {
		// Try to produce a fixed cross-128-bit lane permute followed by unpack
		// because that should be faster than the variable permute alternatives.
		if (SDValue V = lowerShuffleWithUNPCK256(DL, MVT::v16i16, Mask, V1, V2, DAG))
		return V;

// There are no generalized cross-lane shuffle operations available on i16		// There are no generalized cross-lane shuffle operations available on i16
// element types.		// element types.
if (is128BitLaneCrossingShuffleMask(MVT::v16i16, Mask)) {		if (is128BitLaneCrossingShuffleMask(MVT::v16i16, Mask)) {
if (SDValue V = lowerShuffleAsLanePermuteAndPermute(		if (SDValue V = lowerShuffleAsLanePermuteAndPermute(
DL, MVT::v16i16, V1, V2, Mask, DAG, Subtarget))		DL, MVT::v16i16, V1, V2, Mask, DAG, Subtarget))
return V;		return V;

return lowerShuffleAsLanePermuteAndShuffle(DL, MVT::v16i16, V1, V2, Mask,		return lowerShuffleAsLanePermuteAndShuffle(DL, MVT::v16i16, V1, V2, Mask,
▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	static SDValue lowerV32I8Shuffle(const SDLoc &DL, ArrayRef<int> Mask,
if (SDValue Blend = lowerShuffleAsBlend(DL, MVT::v32i8, V1, V2, Mask,		if (SDValue Blend = lowerShuffleAsBlend(DL, MVT::v32i8, V1, V2, Mask,
Zeroable, Subtarget, DAG))		Zeroable, Subtarget, DAG))
return Blend;		return Blend;

// Use dedicated unpack instructions for masks that match their pattern.		// Use dedicated unpack instructions for masks that match their pattern.
if (SDValue V = lowerShuffleWithUNPCK(DL, MVT::v32i8, Mask, V1, V2, DAG))		if (SDValue V = lowerShuffleWithUNPCK(DL, MVT::v32i8, Mask, V1, V2, DAG))
return V;		return V;

// Use dedicated pack instructions for masks that match their pattern.		// Use dedicated pack instructions for masks that match their pattern.
		RKSimonUnsubmitted Done Reply Inline Actions Add a comment. This generates a shuffle sequence, its probably better to place this after all the lowerShuffle* calls that create single (non-variable) instructions. RKSimon: Add a comment. This generates a shuffle sequence, its probably better to place this after all…
if (SDValue V = lowerShuffleWithPACK(DL, MVT::v32i8, Mask, V1, V2, DAG,		if (SDValue V = lowerShuffleWithPACK(DL, MVT::v32i8, Mask, V1, V2, DAG,
Subtarget))		Subtarget))
return V;		return V;

// Try to use shift instructions.		// Try to use shift instructions.
if (SDValue Shift = lowerShuffleAsShift(DL, MVT::v32i8, V1, V2, Mask,		if (SDValue Shift = lowerShuffleAsShift(DL, MVT::v32i8, V1, V2, Mask,
Zeroable, Subtarget, DAG))		Zeroable, Subtarget, DAG))
return Shift;		return Shift;

// Try to use byte rotation instructions.		// Try to use byte rotation instructions.
if (SDValue Rotate = lowerShuffleAsByteRotate(DL, MVT::v32i8, V1, V2, Mask,		if (SDValue Rotate = lowerShuffleAsByteRotate(DL, MVT::v32i8, V1, V2, Mask,
Subtarget, DAG))		Subtarget, DAG))
return Rotate;		return Rotate;

// Try to create an in-lane repeating shuffle mask and then shuffle the		// Try to create an in-lane repeating shuffle mask and then shuffle the
// results into the target lanes.		// results into the target lanes.
if (SDValue V = lowerShuffleAsRepeatedMaskAndLanePermute(		if (SDValue V = lowerShuffleAsRepeatedMaskAndLanePermute(
DL, MVT::v32i8, V1, V2, Mask, Subtarget, DAG))		DL, MVT::v32i8, V1, V2, Mask, Subtarget, DAG))
return V;		return V;

// There are no generalized cross-lane shuffle operations available on i8		// There are no generalized cross-lane shuffle operations available on i8
// element types.		// element types.
if (V2.isUndef() && is128BitLaneCrossingShuffleMask(MVT::v32i8, Mask)) {		if (V2.isUndef() && is128BitLaneCrossingShuffleMask(MVT::v32i8, Mask)) {
		// Try to produce a fixed cross-128-bit lane permute followed by unpack
		// because that should be faster than the variable permute alternatives.
		if (SDValue V = lowerShuffleWithUNPCK256(DL, MVT::v32i8, Mask, V1, V2, DAG))
		return V;

if (SDValue V = lowerShuffleAsLanePermuteAndPermute(		if (SDValue V = lowerShuffleAsLanePermuteAndPermute(
DL, MVT::v32i8, V1, V2, Mask, DAG, Subtarget))		DL, MVT::v32i8, V1, V2, Mask, DAG, Subtarget))
return V;		return V;

return lowerShuffleAsLanePermuteAndShuffle(DL, MVT::v32i8, V1, V2, Mask,		return lowerShuffleAsLanePermuteAndShuffle(DL, MVT::v32i8, V1, V2, Mask,
DAG, Subtarget);		DAG, Subtarget);
}		}

▲ Show 20 Lines • Show All 30,895 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/vector-interleave.ll

	Show First 20 Lines • Show All 165 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: vmovdqu %xmm1, 48(%rsi)			; AVX1-NEXT: vmovdqu %xmm1, 48(%rsi)
	; AVX1-NEXT: vmovdqu %xmm3, 32(%rsi)			; AVX1-NEXT: vmovdqu %xmm3, 32(%rsi)
	; AVX1-NEXT: vmovdqu %xmm0, 16(%rsi)			; AVX1-NEXT: vmovdqu %xmm0, 16(%rsi)
	; AVX1-NEXT: vmovdqu %xmm2, (%rsi)			; AVX1-NEXT: vmovdqu %xmm2, (%rsi)
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: splat2_i8:			; AVX2-LABEL: splat2_i8:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	; AVX2-NEXT: vmovdqu (%rdi), %ymm0			; AVX2-NEXT: vpermq {{.*#+}} ymm0 = mem[0,2,1,3]
	; AVX2-NEXT: vpermq {{.*#+}} ymm1 = ymm0[0,1,0,1]			; AVX2-NEXT: vpunpcklbw {{.*#+}} ymm1 = ymm0[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,16,16,17,17,18,18,19,19,20,20,21,21,22,22,23,23]
	; AVX2-NEXT: vmovdqa {{.*#+}} ymm2 = [0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15]			; AVX2-NEXT: vpunpckhbw {{.*#+}} ymm0 = ymm0[8,8,9,9,10,10,11,11,12,12,13,13,14,14,15,15,24,24,25,25,26,26,27,27,28,28,29,29,30,30,31,31]
	; AVX2-NEXT: vpshufb %ymm2, %ymm1, %ymm1
	; AVX2-NEXT: vpermq {{.*#+}} ymm0 = ymm0[2,3,2,3]
	; AVX2-NEXT: vpshufb %ymm2, %ymm0, %ymm0
	; AVX2-NEXT: vmovdqu %ymm0, 32(%rsi)			; AVX2-NEXT: vmovdqu %ymm0, 32(%rsi)
	; AVX2-NEXT: vmovdqu %ymm1, (%rsi)			; AVX2-NEXT: vmovdqu %ymm1, (%rsi)
	; AVX2-NEXT: vzeroupper			; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%ld32 = load <32 x i8>, <32 x i8>* %s, align 1			%ld32 = load <32 x i8>, <32 x i8>* %s, align 1
	%cat = shufflevector <32 x i8> %ld32, <32 x i8> undef, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>			%cat = shufflevector <32 x i8> %ld32, <32 x i8> undef, <64 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
	%cat2 = shufflevector <64 x i8> %cat, <64 x i8> undef, <64 x i32> <i32 0, i32 32, i32 1, i32 33, i32 2, i32 34, i32 3, i32 35, i32 4, i32 36, i32 5, i32 37, i32 6, i32 38, i32 7, i32 39, i32 8, i32 40, i32 9, i32 41, i32 10, i32 42, i32 11, i32 43, i32 12, i32 44, i32 13, i32 45, i32 14, i32 46, i32 15, i32 47, i32 16, i32 48, i32 17, i32 49, i32 18, i32 50, i32 19, i32 51, i32 20, i32 52, i32 21, i32 53, i32 22, i32 54, i32 23, i32 55, i32 24, i32 56, i32 25, i32 57, i32 26, i32 58, i32 27, i32 59, i32 28, i32 60, i32 29, i32 61, i32 30, i32 62, i32 31, i32 63>			%cat2 = shufflevector <64 x i8> %cat, <64 x i8> undef, <64 x i32> <i32 0, i32 32, i32 1, i32 33, i32 2, i32 34, i32 3, i32 35, i32 4, i32 36, i32 5, i32 37, i32 6, i32 38, i32 7, i32 39, i32 8, i32 40, i32 9, i32 41, i32 10, i32 42, i32 11, i32 43, i32 12, i32 44, i32 13, i32 45, i32 14, i32 46, i32 15, i32 47, i32 16, i32 48, i32 17, i32 49, i32 18, i32 50, i32 19, i32 51, i32 20, i32 52, i32 21, i32 53, i32 22, i32 54, i32 23, i32 55, i32 24, i32 56, i32 25, i32 57, i32 26, i32 58, i32 27, i32 59, i32 28, i32 60, i32 29, i32 61, i32 30, i32 62, i32 31, i32 63>
	store <64 x i8> %cat2, <64 x i8>* %d, align 1			store <64 x i8> %cat2, <64 x i8>* %d, align 1
	Show All 28 Lines
	; AVX1-NEXT: vmovdqu %xmm1, 48(%rsi)			; AVX1-NEXT: vmovdqu %xmm1, 48(%rsi)
	; AVX1-NEXT: vmovdqu %xmm3, 32(%rsi)			; AVX1-NEXT: vmovdqu %xmm3, 32(%rsi)
	; AVX1-NEXT: vmovdqu %xmm0, 16(%rsi)			; AVX1-NEXT: vmovdqu %xmm0, 16(%rsi)
	; AVX1-NEXT: vmovdqu %xmm2, (%rsi)			; AVX1-NEXT: vmovdqu %xmm2, (%rsi)
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: splat2_i16:			; AVX2-LABEL: splat2_i16:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	; AVX2-NEXT: vmovdqu (%rdi), %ymm0			; AVX2-NEXT: vpermq {{.*#+}} ymm0 = mem[0,2,1,3]
	; AVX2-NEXT: vpermq {{.*#+}} ymm1 = ymm0[0,1,0,1]			; AVX2-NEXT: vpunpcklwd {{.*#+}} ymm1 = ymm0[0,0,1,1,2,2,3,3,8,8,9,9,10,10,11,11]
	; AVX2-NEXT: vmovdqa {{.*#+}} ymm2 = [0,1,0,1,2,3,2,3,4,5,4,5,6,7,6,7,8,9,8,9,10,11,10,11,12,13,12,13,14,15,14,15]			; AVX2-NEXT: vpunpckhwd {{.*#+}} ymm0 = ymm0[4,4,5,5,6,6,7,7,12,12,13,13,14,14,15,15]
	; AVX2-NEXT: vpshufb %ymm2, %ymm1, %ymm1
	; AVX2-NEXT: vpermq {{.*#+}} ymm0 = ymm0[2,3,2,3]
	; AVX2-NEXT: vpshufb %ymm2, %ymm0, %ymm0
	; AVX2-NEXT: vmovdqu %ymm0, 32(%rsi)			; AVX2-NEXT: vmovdqu %ymm0, 32(%rsi)
	; AVX2-NEXT: vmovdqu %ymm1, (%rsi)			; AVX2-NEXT: vmovdqu %ymm1, (%rsi)
	; AVX2-NEXT: vzeroupper			; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%ld32 = load <16 x i16>, <16 x i16>* %s, align 1			%ld32 = load <16 x i16>, <16 x i16>* %s, align 1
	%cat = shufflevector <16 x i16> %ld32, <16 x i16> undef, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>			%cat = shufflevector <16 x i16> %ld32, <16 x i16> undef, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15>
	%cat2 = shufflevector <32 x i16> %cat, <32 x i16> undef, <32 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23, i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31>			%cat2 = shufflevector <32 x i16> %cat, <32 x i16> undef, <32 x i32> <i32 0, i32 16, i32 1, i32 17, i32 2, i32 18, i32 3, i32 19, i32 4, i32 20, i32 5, i32 21, i32 6, i32 22, i32 7, i32 23, i32 8, i32 24, i32 9, i32 25, i32 10, i32 26, i32 11, i32 27, i32 12, i32 28, i32 13, i32 29, i32 14, i32 30, i32 15, i32 31>
	store <32 x i16> %cat2, <32 x i16>* %d, align 1			store <32 x i16> %cat2, <32 x i16>* %d, align 1
	Show All 26 Lines
	; AVX1-NEXT: vmovups %xmm1, 48(%rsi)			; AVX1-NEXT: vmovups %xmm1, 48(%rsi)
	; AVX1-NEXT: vmovups %xmm3, 32(%rsi)			; AVX1-NEXT: vmovups %xmm3, 32(%rsi)
	; AVX1-NEXT: vmovups %xmm0, 16(%rsi)			; AVX1-NEXT: vmovups %xmm0, 16(%rsi)
	; AVX1-NEXT: vmovups %xmm2, (%rsi)			; AVX1-NEXT: vmovups %xmm2, (%rsi)
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: splat2_i32:			; AVX2-LABEL: splat2_i32:
	; AVX2: # %bb.0:			; AVX2: # %bb.0:
	; AVX2-NEXT: vmovups (%rdi), %ymm0			; AVX2-NEXT: vpermpd {{.*#+}} ymm0 = mem[0,2,1,3]
	; AVX2-NEXT: vmovaps {{.*#+}} ymm1 = [0,0,1,1,2,2,3,3]			; AVX2-NEXT: vpermilps {{.*#+}} ymm1 = ymm0[0,0,1,1,4,4,5,5]
	; AVX2-NEXT: vpermps %ymm0, %ymm1, %ymm1			; AVX2-NEXT: vpermilps {{.*#+}} ymm0 = ymm0[2,2,3,3,6,6,7,7]
	; AVX2-NEXT: vmovaps {{.*#+}} ymm2 = [4,4,5,5,6,6,7,7]
	; AVX2-NEXT: vpermps %ymm0, %ymm2, %ymm0
	; AVX2-NEXT: vmovups %ymm0, 32(%rsi)			; AVX2-NEXT: vmovups %ymm0, 32(%rsi)
	; AVX2-NEXT: vmovups %ymm1, (%rsi)			; AVX2-NEXT: vmovups %ymm1, (%rsi)
	; AVX2-NEXT: vzeroupper			; AVX2-NEXT: vzeroupper
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%ld32 = load <8 x i32>, <8 x i32>* %s, align 1			%ld32 = load <8 x i32>, <8 x i32>* %s, align 1
	%cat = shufflevector <8 x i32> %ld32, <8 x i32> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>			%cat = shufflevector <8 x i32> %ld32, <8 x i32> undef, <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	%cat2 = shufflevector <16 x i32> %cat, <16 x i32> undef, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>			%cat2 = shufflevector <16 x i32> %cat, <16 x i32> undef, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
	store <16 x i32> %cat2, <16 x i32>* %d, align 1			store <16 x i32> %cat2, <16 x i32>* %d, align 1
	▲ Show 20 Lines • Show All 45 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/vector-shuffle-256-v8.ll

	Show First 20 Lines • Show All 1,511 Lines • ▼ Show 20 Lines
	define <8 x i32> @shuffle_v8i32_00112233(<8 x i32> %a, <8 x i32> %b) {			define <8 x i32> @shuffle_v8i32_00112233(<8 x i32> %a, <8 x i32> %b) {
	; AVX1-LABEL: shuffle_v8i32_00112233:			; AVX1-LABEL: shuffle_v8i32_00112233:
	; AVX1: # %bb.0:			; AVX1: # %bb.0:
	; AVX1-NEXT: vpermilps {{.*#+}} xmm1 = xmm0[0,0,1,1]			; AVX1-NEXT: vpermilps {{.*#+}} xmm1 = xmm0[0,0,1,1]
	; AVX1-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[2,2,3,3]			; AVX1-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[2,2,3,3]
	; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm1, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2OR512VL-LABEL: shuffle_v8i32_00112233:			; AVX2-SLOW-LABEL: shuffle_v8i32_00112233:
	; AVX2OR512VL: # %bb.0:			; AVX2-SLOW: # %bb.0:
	; AVX2OR512VL-NEXT: vmovaps {{.*#+}} ymm1 = [0,0,1,1,2,2,3,3]			; AVX2-SLOW-NEXT: vpermpd {{.*#+}} ymm0 = ymm0[0,2,1,3]
	; AVX2OR512VL-NEXT: vpermps %ymm0, %ymm1, %ymm0			; AVX2-SLOW-NEXT: vpermilps {{.*#+}} ymm0 = ymm0[0,0,1,1,4,4,5,5]
	; AVX2OR512VL-NEXT: retq			; AVX2-SLOW-NEXT: retq
				;
				; AVX2-FAST-LABEL: shuffle_v8i32_00112233:
				; AVX2-FAST: # %bb.0:
				; AVX2-FAST-NEXT: vmovaps {{.*#+}} ymm1 = [0,0,1,1,2,2,3,3]
				; AVX2-FAST-NEXT: vpermps %ymm0, %ymm1, %ymm0
				; AVX2-FAST-NEXT: retq
				;
				; AVX512VL-SLOW-LABEL: shuffle_v8i32_00112233:
				; AVX512VL-SLOW: # %bb.0:
				; AVX512VL-SLOW-NEXT: vpermpd {{.*#+}} ymm0 = ymm0[0,2,1,3]
				; AVX512VL-SLOW-NEXT: vpermilps {{.*#+}} ymm0 = ymm0[0,0,1,1,4,4,5,5]
				; AVX512VL-SLOW-NEXT: retq
				;
				; AVX512VL-FAST-LABEL: shuffle_v8i32_00112233:
				; AVX512VL-FAST: # %bb.0:
				; AVX512VL-FAST-NEXT: vmovaps {{.*#+}} ymm1 = [0,0,1,1,2,2,3,3]
				; AVX512VL-FAST-NEXT: vpermps %ymm0, %ymm1, %ymm0
				; AVX512VL-FAST-NEXT: retq
	%shuffle = shufflevector <8 x i32> %a, <8 x i32> %b, <8 x i32> <i32 0, i32 0, i32 1, i32 1, i32 2, i32 2, i32 3, i32 3>			%shuffle = shufflevector <8 x i32> %a, <8 x i32> %b, <8 x i32> <i32 0, i32 0, i32 1, i32 1, i32 2, i32 2, i32 3, i32 3>
	ret <8 x i32> %shuffle			ret <8 x i32> %shuffle
	}			}

	define <8 x i32> @shuffle_v8i32_00001111(<8 x i32> %a, <8 x i32> %b) {			define <8 x i32> @shuffle_v8i32_00001111(<8 x i32> %a, <8 x i32> %b) {
	; AVX1-LABEL: shuffle_v8i32_00001111:			; AVX1-LABEL: shuffle_v8i32_00001111:
	; AVX1: # %bb.0:			; AVX1: # %bb.0:
	; AVX1-NEXT: vpermilps {{.*#+}} xmm1 = xmm0[0,0,0,0]			; AVX1-NEXT: vpermilps {{.*#+}} xmm1 = xmm0[0,0,0,0]
	▲ Show 20 Lines • Show All 151 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: vshufps {{.*#+}} xmm2 = xmm2[0,2],xmm1[1,1]			; AVX1-NEXT: vshufps {{.*#+}} xmm2 = xmm2[0,2],xmm1[1,1]
	; AVX1-NEXT: vblendps {{.*#+}} xmm0 = xmm0[0,1],xmm1[2,3]			; AVX1-NEXT: vblendps {{.*#+}} xmm0 = xmm0[0,1],xmm1[2,3]
	; AVX1-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[1,2,3,3]			; AVX1-NEXT: vpermilps {{.*#+}} xmm0 = xmm0[1,2,3,3]
	; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm2, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm0, %ymm2, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-SLOW-LABEL: shuffle_v8i32_08991abb:			; AVX2-SLOW-LABEL: shuffle_v8i32_08991abb:
	; AVX2-SLOW: # %bb.0:			; AVX2-SLOW: # %bb.0:
	; AVX2-SLOW-NEXT: vmovdqa {{.*#+}} ymm2 = <u,0,1,1,u,2,3,3>
	; AVX2-SLOW-NEXT: vpermd %ymm1, %ymm2, %ymm1
	; AVX2-SLOW-NEXT: vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero			; AVX2-SLOW-NEXT: vpmovzxdq {{.*#+}} xmm0 = xmm0[0],zero,xmm0[1],zero
	; AVX2-SLOW-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,1,1,3]			; AVX2-SLOW-NEXT: vpermq {{.*#+}} ymm0 = ymm0[0,1,1,3]
				; AVX2-SLOW-NEXT: vpermq {{.*#+}} ymm1 = ymm1[0,2,1,3]
				; AVX2-SLOW-NEXT: vpshufd {{.*#+}} ymm1 = ymm1[0,0,1,1,4,4,5,5]
	; AVX2-SLOW-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3],ymm0[4],ymm1[5,6,7]			; AVX2-SLOW-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3],ymm0[4],ymm1[5,6,7]
	; AVX2-SLOW-NEXT: retq			; AVX2-SLOW-NEXT: retq
	;			;
	; AVX2-FAST-LABEL: shuffle_v8i32_08991abb:			; AVX2-FAST-LABEL: shuffle_v8i32_08991abb:
	; AVX2-FAST: # %bb.0:			; AVX2-FAST: # %bb.0:
	; AVX2-FAST-NEXT: vmovaps {{.*#+}} ymm2 = <0,u,1,u,1,u,u,u>			; AVX2-FAST-NEXT: vmovaps {{.*#+}} ymm2 = <0,u,1,u,1,u,u,u>
	; AVX2-FAST-NEXT: vpermps %ymm0, %ymm2, %ymm0			; AVX2-FAST-NEXT: vpermps %ymm0, %ymm2, %ymm0
	; AVX2-FAST-NEXT: vmovaps {{.*#+}} ymm2 = <u,0,1,1,u,2,3,3>			; AVX2-FAST-NEXT: vmovaps {{.*#+}} ymm2 = [0,0,1,1,2,2,3,3]
				lebedev.riUnsubmitted Not Done Reply Inline Actions This looks like some demandedelts deficiency? lebedev.ri: This looks like some demandedelts deficiency?
				RKSimonUnsubmitted Not Done Reply Inline Actions D66004 might catch it, else it might be due to the constant mask already being lowered - we don't do much to simplify constant vectors already lowered to the constant pool. RKSimon: D66004 might catch it, else it might be due to the constant mask already being lowered - we…
				spatelAuthorUnsubmitted Done Reply Inline Actions I think we've already lowered to constant pool loads. This patch creates the unpack as expected, but then a later combine does: Combining: t23: v8i32 = X86ISD::UNPCKL t22, t22 Creating new node: t60: v8i32 = BUILD_VECTOR Constant:i32<0>, Constant:i32<0>, Constant:i32<1>, Constant:i32<1>, Constant:i32<2>, Constant:i32<2>, Constant:i32<3>, Constant:i32<3> Creating new node: t61: v8i32 = X86ISD::VPERMV t60, t4 And then we lower the build_vector: t65: v8i32,ch = load<(load 32 from constant-pool)> t0, t67, undef:i64 t4: v8i32,ch = CopyFromReg t0, Register:v8i32 %1 t61: v8i32 = X86ISD::VPERMV t65, t4 t49: v8i32,ch = load<(load 32 from constant-pool)> t0, t51, undef:i64 t2: v8i32,ch = CopyFromReg t0, Register:v8i32 %0 t43: v8i32 = X86ISD::VPERMV t49, t2 t16: v8i32 = X86ISD::BLENDI t61, t43, TargetConstant:i8<17> Before we reach the BLENDI. Doesn't seem like there'd be much chance of improvement this late. spatel: I think we've already lowered to constant pool loads. This patch creates the unpack as…
	; AVX2-FAST-NEXT: vpermps %ymm1, %ymm2, %ymm1			; AVX2-FAST-NEXT: vpermps %ymm1, %ymm2, %ymm1
	; AVX2-FAST-NEXT: vblendps {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3],ymm0[4],ymm1[5,6,7]			; AVX2-FAST-NEXT: vblendps {{.*#+}} ymm0 = ymm0[0],ymm1[1,2,3],ymm0[4],ymm1[5,6,7]
	; AVX2-FAST-NEXT: retq			; AVX2-FAST-NEXT: retq
	;			;
	; AVX512VL-LABEL: shuffle_v8i32_08991abb:			; AVX512VL-LABEL: shuffle_v8i32_08991abb:
	; AVX512VL: # %bb.0:			; AVX512VL: # %bb.0:
	; AVX512VL-NEXT: vmovdqa {{.*#+}} ymm2 = [8,0,1,1,9,2,3,3]			; AVX512VL-NEXT: vmovdqa {{.*#+}} ymm2 = [8,0,1,1,9,2,3,3]
	; AVX512VL-NEXT: vpermi2d %ymm0, %ymm1, %ymm2			; AVX512VL-NEXT: vpermi2d %ymm0, %ymm1, %ymm2
	▲ Show 20 Lines • Show All 1,630 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[x86] try harder to form 256-bit unpck*ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 237733

llvm/lib/Target/X86/X86ISelLowering.h

llvm/lib/Target/X86/X86ISelLowering.cpp

llvm/test/CodeGen/X86/vector-interleave.ll

llvm/test/CodeGen/X86/vector-shuffle-256-v8.ll

[x86] try harder to form 256-bit unpck*
ClosedPublic