This is an archive of the discontinued LLVM Phabricator instance.

Generate better code for shuffles
ClosedPublic

Authored by mkuper on Dec 16 2014, 2:57 AM.

Download Raw Diff

Details

Reviewers

spatel
chandlerc
nadav
delena
andreadb

Commits

rG047b1a0400fe: [DAGCombine] Slightly improve lowering of BUILD_VECTOR into a shuffle.
rL224429: [DAGCombine] Slightly improve lowering of BUILD_VECTOR into a shuffle.

Summary

This fixes PR15872, the code improves from:

vpextrd $1, %xmm0, %eax
vmovd   %xmm0, %ecx
vmovd   %ecx, %xmm1
vpinsrd $1, %eax, %xmm1, %xmm1
vextractf128    $1, %ymm0, %xmm2
vmovd   %xmm2, %eax
vpinsrd $2, %eax, %xmm1, %xmm1
vpextrd $1, %xmm2, %eax
vpinsrd $3, %eax, %xmm1, %xmm1
vpextrd $3, %xmm0, %eax
vpextrd $2, %xmm0, %ecx
vmovd   %ecx, %xmm0
vpinsrd $1, %eax, %xmm0, %xmm0
vpextrd $2, %xmm2, %eax
vpinsrd $2, %eax, %xmm0, %xmm0
vpextrd $3, %xmm2, %eax
vpinsrd $3, %eax, %xmm0, %xmm0
vmovdqa %xmm1, (%rdi)
vzeroupper
retq

vextractf128    $1, %ymm0, %xmm1
vpunpcklqdq     %xmm1, %xmm0, %xmm2 # xmm2 = xmm0[0],xmm1[0]
vpunpckhqdq     %xmm1, %xmm0, %xmm0 # xmm0 = xmm0[1],xmm1[1]
vmovdqa %xmm2, (%rdi)
vzeroupper
retq

Sanjay has a fix for PR21711 which apparently has the same underlying issue here: http://reviews.llvm.org/D6622

This version is more general, but it may be too general, I'm fine with anything in this vein that fixes both PRs.

Diff Detail

Repository: rL LLVM

Event Timeline

mkuper updated this revision to Diff 17327.Dec 16 2014, 2:57 AM

mkuper retitled this revision from to Generate better code for shuffles.

mkuper updated this object.

mkuper edited the test plan for this revision. (Show Details)

mkuper added reviewers: nadav, andreadb, spatel, chandlerc, delena.

mkuper added a subscriber: Unknown Object (MLST).

Hi Michael -

Thanks for notifying about this patch. Can you please rebase the patch against a newer trunk rev? Each hunk failed to patch for me on r224339.

Rebased patch.

By the way, Sanjay, I've verified that the test-cases you have in vec_extract-avx.ll pass with this.

In D6678#101948, @mkuper wrote:

Rebased patch.

This turned out to just be a problem with Windows line-endings in your patch. It didn't patch on a Mac because of that. Not sure if that gets cleaned up on submission or if you need to change that before committing.

In D6678#101948, @mkuper wrote:

By the way, Sanjay, I've verified that the test-cases you have in vec_extract-avx.ll pass with this.

Excellent! I'll leave it to the other reviewers to decide if we need a TLI hook for this; otherwise, LGTM.

This patch is better than D6622 because it handles cases that that one bails out on. Eg, when we're extracting across the 128-bit boundary as in your test case or:

define void @low_v8f32_to_v4f32(<8 x float> %v, <4 x float>* %ptr) {
 %ext0 = extractelement <8 x float> %v, i32 1   
 %ext1 = extractelement <8 x float> %v, i32 2
 %ext2 = extractelement <8 x float> %v, i32 3
 %ext3 = extractelement <8 x float> %v, i32 4   ; crossing the 128-bit lane
 %ins0 = insertelement <4 x float> undef, float %ext0, i32 0
 %ins1 = insertelement <4 x float> %ins0, float %ext1, i32 1
 %ins2 = insertelement <4 x float> %ins1, float %ext2, i32 2
 %ins3 = insertelement <4 x float> %ins2, float %ext3, i32 3
 store <4 x float> %ins3, <4 x float>* %ptr, align 16
 ret void
}

You're avoiding the x86 miscompile that I saw by only generating extract_subvectors along half-vector-length boundaries. The case that I saw failing would be caused by a node like this:

DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, VT, VecIn1, DAG.getIntPtrConstant(1)); // not a half-vector extract

When that ends up in X86ISelLowering's Extract128BitVector(), we quietly round the index down with this integer math:

unsigned NormalizedIdxVal = (((IdxVal * ElVT.getSizeInBits()) / vectorWidth) * ElemsPerChunk);

The comment around there says:

// Idx is an index in the 128 bits we want. It need not be aligned to a 128-bit boundary.

While that matches what the integer math does, I think this code should be asserting that the index is a correct multiple of the number of elements. Otherwise, we're generating an extract that doesn't execute what the input node is asking for.

spatel mentioned this in D6622: generate extract_subvector node to avoid disastrous shuffle vector codegen.Dec 16 2014, 10:50 AM

The line-endings get cleaned up on submission, thankfully.

Regarding the miscompile - yes, that's what I thought. Wasn't sure it was a miscompile, or some magic is going on behind the scenes, but avoided it anyway since generating "unaligned" extracts didn't make sense to me from a performance stand-point.

Regarding D6622 - it's good to hear I didn't miss anything. I'll add the test-case from D6622 to my commit.

spatel added inline comments.Dec 16 2014, 11:23 AM

test/CodeGen/X86/vector-shuffle-combining.ll
1566 ↗	(On Diff #17337)	This isn't passing 'make check' for me. On an AVX2 machine, we generate 'vextracti128' (the integer flavor of the extract instruction).

Fixed my own test (thanks Sanjay) and added the test case from D6622

Hi Michael,

thanks for the patch. It looks good to me!

Cheers,
Andrea

This revision is now accepted and ready to land.Dec 17 2014, 3:23 AM

Closed by commit rL224429: [DAGCombine] Slightly improve lowering of BUILD_VECTOR into a shuffle. (authored by mkuper). · Explain WhyDec 17 2014, 4:33 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

Target/

TargetLowering.h

9 lines

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

33 lines

Target/

X86/

X86ISelLowering.h

4 lines

X86ISelLowering.cpp

8 lines

test/

CodeGen/

X86/

vec_extract-avx.ll

82 lines

vector-shuffle-combining.ll

31 lines

Diff 17386

llvm/trunk/include/llvm/Target/TargetLowering.h

Show First 20 Lines • Show All 1,525 Lines • ▼ Show 20 Lines	public:
/// just the constant itself.		/// just the constant itself.
/// On some targets it might be more efficient to use a combination of		/// On some targets it might be more efficient to use a combination of
/// arithmetic instructions to materialize the constant instead of loading it		/// arithmetic instructions to materialize the constant instead of loading it
/// from a constant pool.		/// from a constant pool.
virtual bool shouldConvertConstantLoadToIntImm(const APInt &Imm,		virtual bool shouldConvertConstantLoadToIntImm(const APInt &Imm,
Type *Ty) const {		Type *Ty) const {
return false;		return false;
}		}

		/// Return true if EXTRACT_SUBVECTOR is cheap for this result type
		/// with this index. This is needed because EXTRACT_SUBVECTOR usually
		/// has custom lowering that depends on the index of the first element,
		/// and only the target knows which lowering is cheap.
		virtual bool isExtractSubvectorCheap(EVT ResVT, unsigned Index) const {
		return false;
		}

//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
// Runtime Library hooks		// Runtime Library hooks
//		//

/// Rename the default libcall routine name for the specified libcall.		/// Rename the default libcall routine name for the specified libcall.
void setLibcallName(RTLIB::Libcall Call, const char *Name) {		void setLibcallName(RTLIB::Libcall Call, const char *Name) {
LibcallRoutineNames[Call] = Name;		LibcallRoutineNames[Call] = Name;
}		}
▲ Show 20 Lines • Show All 1,232 Lines • Show Last 20 Lines

llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 10,777 Lines • ▼ Show 20 Lines	for (unsigned i = 0; i != NumInScalars; ++i) {
continue;		continue;
}		}

// If extracting from the first vector, just use the index directly.		// If extracting from the first vector, just use the index directly.
SDValue Extract = N->getOperand(i);		SDValue Extract = N->getOperand(i);
SDValue ExtVal = Extract.getOperand(1);		SDValue ExtVal = Extract.getOperand(1);
unsigned ExtIndex = cast<ConstantSDNode>(ExtVal)->getZExtValue();		unsigned ExtIndex = cast<ConstantSDNode>(ExtVal)->getZExtValue();
if (Extract.getOperand(0) == VecIn1) {		if (Extract.getOperand(0) == VecIn1) {
if (ExtIndex > VT.getVectorNumElements())
return SDValue();

Mask.push_back(ExtIndex);		Mask.push_back(ExtIndex);
continue;		continue;
}		}

// Otherwise, use InIdx + VecSize		// Otherwise, use InIdx + VecSize
Mask.push_back(NumInScalars+ExtIndex);		Mask.push_back(NumInScalars+ExtIndex);
}		}

// Avoid introducing illegal shuffles with zero.		// Avoid introducing illegal shuffles with zero.
if (UsesZeroVector && !TLI.isVectorClearMaskLegal(Mask, VT))		if (UsesZeroVector && !TLI.isVectorClearMaskLegal(Mask, VT))
return SDValue();		return SDValue();

// We can't generate a shuffle node with mismatched input and output types.		// We can't generate a shuffle node with mismatched input and output types.
// Attempt to transform a single input vector to the correct type.		// Attempt to transform a single input vector to the correct type.
if ((VT != VecIn1.getValueType())) {		if ((VT != VecIn1.getValueType())) {
// We don't support shuffeling between TWO values of different types.		// We don't support shuffeling between TWO values of different types.
if (VecIn2.getNode())		if (VecIn2.getNode())
return SDValue();		return SDValue();

// We only support widening of vectors which are half the size of the
// output registers. For example XMM->YMM widening on X86 with AVX.
if (VecIn1.getValueType().getSizeInBits()*2 != VT.getSizeInBits())
return SDValue();

// If the input vector type has a different base type to the output		// If the input vector type has a different base type to the output
// vector type, bail out.		// vector type, bail out.
if (VecIn1.getValueType().getVectorElementType() !=		if (VecIn1.getValueType().getVectorElementType() !=
VT.getVectorElementType())		VT.getVectorElementType())
return SDValue();		return SDValue();

		// If the input vector is too small, widen it.
		// We only support widening of vectors which are half the size of the
		// output registers. For example XMM->YMM widening on X86 with AVX.
		EVT VecInT = VecIn1.getValueType();
		if (VecInT.getSizeInBits() * 2 == VT.getSizeInBits()) {
// Widen the input vector by adding undef values.		// Widen the input vector by adding undef values.
VecIn1 = DAG.getNode(ISD::CONCAT_VECTORS, dl, VT,		VecIn1 = DAG.getNode(ISD::CONCAT_VECTORS, dl, VT,
VecIn1, DAG.getUNDEF(VecIn1.getValueType()));		VecIn1, DAG.getUNDEF(VecIn1.getValueType()));
		} else if (VecInT.getSizeInBits() == VT.getSizeInBits() * 2) {
		// If the input vector is too large, try to split it.
		if (!TLI.isExtractSubvectorCheap(VT, VT.getVectorNumElements()))
		return SDValue();

		// Try to replace VecIn1 with two extract_subvectors
		// No need to update the masks, they should still be correct.
		VecIn2 = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, VT, VecIn1,
		DAG.getConstant(VT.getVectorNumElements(), TLI.getVectorIdxTy()));
		VecIn1 = DAG.getNode(ISD::EXTRACT_SUBVECTOR, dl, VT, VecIn1,
		DAG.getConstant(0, TLI.getVectorIdxTy()));
		UsesZeroVector = false;
		} else
		return SDValue();
}		}

if (UsesZeroVector)		if (UsesZeroVector)
VecIn2 = VT.isInteger() ? DAG.getConstant(0, VT) :		VecIn2 = VT.isInteger() ? DAG.getConstant(0, VT) :
DAG.getConstantFP(0.0, VT);		DAG.getConstantFP(0.0, VT);
else		else
// If VecIn2 is unused then change it to undef.		// If VecIn2 is unused then change it to undef.
VecIn2 = VecIn2.getNode() ? VecIn2 : DAG.getUNDEF(VT);		VecIn2 = VecIn2.getNode() ? VecIn2 : DAG.getUNDEF(VT);
▲ Show 20 Lines • Show All 1,720 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86ISelLowering.h

Show First 20 Lines • Show All 785 Lines • ▼ Show 20 Lines	bool isIntegerTypeFTOL(EVT VT) const {
return isTargetFTOL() && VT == MVT::i64;		return isTargetFTOL() && VT == MVT::i64;
}		}

/// \brief Returns true if it is beneficial to convert a load of a constant		/// \brief Returns true if it is beneficial to convert a load of a constant
/// to just the constant itself.		/// to just the constant itself.
bool shouldConvertConstantLoadToIntImm(const APInt &Imm,		bool shouldConvertConstantLoadToIntImm(const APInt &Imm,
Type *Ty) const override;		Type *Ty) const override;

		/// Return true if EXTRACT_SUBVECTOR is cheap for this result type
		/// with this index.
		bool isExtractSubvectorCheap(EVT ResVT, unsigned Index) const override;

/// Intel processors have a unified instruction and data cache		/// Intel processors have a unified instruction and data cache
const char * getClearCacheBuiltinName() const override {		const char * getClearCacheBuiltinName() const override {
return nullptr; // nothing to do, move along.		return nullptr; // nothing to do, move along.
}		}

unsigned getRegisterByName(const char* RegName, EVT VT) const override;		unsigned getRegisterByName(const char* RegName, EVT VT) const override;

/// This method returns a target specific FastISel object,		/// This method returns a target specific FastISel object,
▲ Show 20 Lines • Show All 252 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 3,845 Lines • ▼ Show 20 Lines	bool X86TargetLowering::shouldConvertConstantLoadToIntImm(const APInt &Imm,
assert(Ty->isIntegerTy());		assert(Ty->isIntegerTy());

unsigned BitSize = Ty->getPrimitiveSizeInBits();		unsigned BitSize = Ty->getPrimitiveSizeInBits();
if (BitSize == 0 \|\| BitSize > 64)		if (BitSize == 0 \|\| BitSize > 64)
return false;		return false;
return true;		return true;
}		}

		bool X86TargetLowering::isExtractSubvectorCheap(EVT ResVT,
		unsigned Index) const {
		if (!isOperationLegalOrCustom(ISD::EXTRACT_SUBVECTOR, ResVT))
		return false;

		return (Index == 0 \|\| Index == ResVT.getVectorNumElements());
		}

/// isUndefOrInRange - Return true if Val is undef or if its value falls within		/// isUndefOrInRange - Return true if Val is undef or if its value falls within
/// the specified range (L, H].		/// the specified range (L, H].
static bool isUndefOrInRange(int Val, int Low, int Hi) {		static bool isUndefOrInRange(int Val, int Low, int Hi) {
return (Val < 0) \|\| (Val >= Low && Val < Hi);		return (Val < 0) \|\| (Val >= Low && Val < Hi);
}		}

/// isUndefOrEqual - Val is either less than zero (undef) or equal to the		/// isUndefOrEqual - Val is either less than zero (undef) or equal to the
/// specified value.		/// specified value.
▲ Show 20 Lines • Show All 22,466 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/vec_extract-avx.ll

				target triple = "x86_64-unknown-unknown"

				; RUN: llc < %s -march=x86-64 -mattr=+avx \| FileCheck %s

				; When extracting multiple consecutive elements from a larger
				; vector into a smaller one, do it efficiently. We should use
				; an EXTRACT_SUBVECTOR node internally rather than a bunch of
				; single element extractions.

				; Extracting the low elements only requires using the right kind of store.
				define void @low_v8f32_to_v4f32(<8 x float> %v, <4 x float>* %ptr) {
				%ext0 = extractelement <8 x float> %v, i32 0
				%ext1 = extractelement <8 x float> %v, i32 1
				%ext2 = extractelement <8 x float> %v, i32 2
				%ext3 = extractelement <8 x float> %v, i32 3
				%ins0 = insertelement <4 x float> undef, float %ext0, i32 0
				%ins1 = insertelement <4 x float> %ins0, float %ext1, i32 1
				%ins2 = insertelement <4 x float> %ins1, float %ext2, i32 2
				%ins3 = insertelement <4 x float> %ins2, float %ext3, i32 3
				store <4 x float> %ins3, <4 x float>* %ptr, align 16
				ret void

				; CHECK-LABEL: low_v8f32_to_v4f32
				; CHECK: vmovaps
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				}

				; Extracting the high elements requires just one AVX instruction.
				define void @high_v8f32_to_v4f32(<8 x float> %v, <4 x float>* %ptr) {
				%ext0 = extractelement <8 x float> %v, i32 4
				%ext1 = extractelement <8 x float> %v, i32 5
				%ext2 = extractelement <8 x float> %v, i32 6
				%ext3 = extractelement <8 x float> %v, i32 7
				%ins0 = insertelement <4 x float> undef, float %ext0, i32 0
				%ins1 = insertelement <4 x float> %ins0, float %ext1, i32 1
				%ins2 = insertelement <4 x float> %ins1, float %ext2, i32 2
				%ins3 = insertelement <4 x float> %ins2, float %ext3, i32 3
				store <4 x float> %ins3, <4 x float>* %ptr, align 16
				ret void

				; CHECK-LABEL: high_v8f32_to_v4f32
				; CHECK: vextractf128
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				}

				; Make sure element type doesn't alter the codegen. Note that
				; if we were actually using the vector in this function and
				; have AVX2, we should generate vextracti128 (the int version).
				define void @high_v8i32_to_v4i32(<8 x i32> %v, <4 x i32>* %ptr) {
				%ext0 = extractelement <8 x i32> %v, i32 4
				%ext1 = extractelement <8 x i32> %v, i32 5
				%ext2 = extractelement <8 x i32> %v, i32 6
				%ext3 = extractelement <8 x i32> %v, i32 7
				%ins0 = insertelement <4 x i32> undef, i32 %ext0, i32 0
				%ins1 = insertelement <4 x i32> %ins0, i32 %ext1, i32 1
				%ins2 = insertelement <4 x i32> %ins1, i32 %ext2, i32 2
				%ins3 = insertelement <4 x i32> %ins2, i32 %ext3, i32 3
				store <4 x i32> %ins3, <4 x i32>* %ptr, align 16
				ret void

				; CHECK-LABEL: high_v8i32_to_v4i32
				; CHECK: vextractf128
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				}

				; Make sure that element size doesn't alter the codegen.
				define void @high_v4f64_to_v2f64(<4 x double> %v, <2 x double>* %ptr) {
				%ext0 = extractelement <4 x double> %v, i32 2
				%ext1 = extractelement <4 x double> %v, i32 3
				%ins0 = insertelement <2 x double> undef, double %ext0, i32 0
				%ins1 = insertelement <2 x double> %ins0, double %ext1, i32 1
				store <2 x double> %ins1, <2 x double>* %ptr, align 16
				ret void

				; CHECK-LABEL: high_v4f64_to_v2f64
				; CHECK: vextractf128
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				}

llvm/trunk/test/CodeGen/X86/vector-shuffle-combining.ll

	Show First 20 Lines • Show All 1,546 Lines • ▼ Show 20 Lines
	; AVX2: # BB#0:			; AVX2: # BB#0:
	; AVX2-NEXT: vpblendd {{.*#+}} xmm0 = xmm1[0],xmm0[1],xmm1[2,3]			; AVX2-NEXT: vpblendd {{.*#+}} xmm0 = xmm1[0],xmm0[1],xmm1[2,3]
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%1 = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 4, i32 1, i32 6, i32 7>			%1 = shufflevector <4 x i32> %a, <4 x i32> %b, <4 x i32> <i32 4, i32 1, i32 6, i32 7>
	%2 = shufflevector <4 x i32> %1, <4 x i32> %a, <4 x i32> <i32 0, i32 5, i32 2, i32 3>			%2 = shufflevector <4 x i32> %1, <4 x i32> %a, <4 x i32> <i32 0, i32 5, i32 2, i32 3>
	ret <4 x i32> %2			ret <4 x i32> %2
	}			}

				define <4 x i32> @combine_test21(<8 x i32> %a, <4 x i32>* %ptr) {
				; SSE-LABEL: combine_test21:
				; SSE: # BB#0:
				; SSE-NEXT: movdqa %xmm0, %xmm2
				; SSE-NEXT: punpcklqdq {{.*#+}} xmm2 = xmm2[0],xmm1[0]
				; SSE-NEXT: punpckhqdq {{.*#+}} xmm0 = xmm0[1],xmm1[1]
				; SSE-NEXT: movdqa %xmm2,
				; SSE-NEXT: retq
				;
				; AVX1-LABEL: combine_test21:
				; AVX1: # BB#0:
				; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm1
				; AVX1-NEXT: vpunpcklqdq {{.*#+}} xmm2 = xmm0[0],xmm1[0]
				; AVX1-NEXT: vpunpckhqdq {{.*#+}} xmm0 = xmm0[1],xmm1[1]
				; AVX1-NEXT: movdqa %xmm2,
				; AVX1-NEXT: vzeroupper
				; AVX1-NEXT: retq
				;
				; AVX2-LABEL: combine_test21:
				; AVX2: # BB#0:
				; AVX2-NEXT: vextracti128 $1, %ymm0, %xmm1
				; AVX2-NEXT: vpunpcklqdq {{.*#+}} xmm2 = xmm0[0],xmm1[0]
				; AVX2-NEXT: vpunpckhqdq {{.*#+}} xmm0 = xmm0[1],xmm1[1]
				; AVX2-NEXT: movdqa %xmm2,
				; AVX2-NEXT: vzeroupper
				; AVX2-NEXT: retq
				%1 = shufflevector <8 x i32> %a, <8 x i32> %a, <4 x i32> <i32 0, i32 1, i32 4, i32 5>
				%2 = shufflevector <8 x i32> %a, <8 x i32> %a, <4 x i32> <i32 2, i32 3, i32 6, i32 7>
				store <4 x i32> %1, <4 x i32>* %ptr, align 16
				ret <4 x i32> %2
				}

	; Check some negative cases.			; Check some negative cases.
	; FIXME: Do any of these really make sense? Are they redundant with the above tests?			; FIXME: Do any of these really make sense? Are they redundant with the above tests?

	define <4 x float> @combine_test1b(<4 x float> %a, <4 x float> %b) {			define <4 x float> @combine_test1b(<4 x float> %a, <4 x float> %b) {
	; SSE-LABEL: combine_test1b:			; SSE-LABEL: combine_test1b:
	; SSE: # BB#0:			; SSE: # BB#0:
	; SSE-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,1,2,0]			; SSE-NEXT: shufps {{.*#+}} xmm1 = xmm1[0,1,2,0]
	▲ Show 20 Lines • Show All 938 Lines • Show Last 20 Lines