This is an archive of the discontinued LLVM Phabricator instance.

[X86, AVX] use blends instead of insert128 with index 0
ClosedPublic

Authored by spatel on Mar 16 2015, 3:23 PM.

Download Raw Diff

Details

Reviewers

bruno
chandlerc
andreadb

Commits

rGd5c2d287f98c: [X86, AVX] use blends instead of insert128 with index 0
rL232773: [X86, AVX] use blends instead of insert128 with index 0

Summary

Another case of x86-specific shuffle strength reduction: avoid generating insert*128 instructions with index 0 because they are slower than their non-lane-changing blend equivalents.

Diff Detail

Repository: rL LLVM

Event Timeline

spatel updated this revision to Diff 22057.Mar 16 2015, 3:23 PM

spatel retitled this revision from to [X86, AVX] use blends instead of insert128 with index 0 .

spatel updated this object.

spatel edited the test plan for this revision. (Show Details)

spatel added reviewers: andreadb, chandlerc, bruno.

spatel added a subscriber: Unknown Object (MLST).

Updated patch: the previous version of the patch was adding logic to PerformShuffleCombine256(), but that won't catch every case where we create an INSERT_SUBVECTOR. And yes, there was one more shuffle regression test where we expected an inserti128 $0

In this version, I've moved the code into Insert128BitVector(). No other functional changes.

This should catch every creation of an INSERT_SUBVECTOR that can be optimized with a BLENDI.

Hi Sanjay,

test/CodeGen/X86/avx-cast.ll
43–46 ↗	(On Diff #22104)	So, the reason why your code doesn't optimize this case for AVX1 is because AVX1 doesn't support integer blend on YMM registers. However, wouldn't the following code be faster in this case (at least on Intel cpus)? vxorps %ymm1, %ymm1, %ymm1 vblendps $0, %ymm0, %ymm1, %ymm0 I understand that we want to avoid domain-crossing as much as possible. However, in this particular case I don't think it is possible (please correct me if I am wrong). Your code would fall back to selecting a 'vinsertf128'. However, as far as I know 'vinsertf128' is floating point cluster anyway. So, I expect (I haven't tested it though) that using 'vblendps/d' would probably give us the same (or better on Haswell?) throughput. What do you think?

spatel added inline comments.Mar 18 2015, 2:17 PM

test/CodeGen/X86/avx-cast.ll
43–46 ↗	(On Diff #22104)	Hi Andrea - Thanks for the close reading! Yes - if you only have AVX and you get to this point, then there's no avoiding the domain-crossing because you won't have vinserti128 either. I'll redo the check to account for this case.

Patch updated based on feedback from Andrea: for integer ops on AVX1, it's better to generate a wrong domain vblend instead of a wrong domain vinsertf128.

Thanks Sanjay.
I made a couple of comments (see below). Otherwise the patch looks good to me.

lib/Target/X86/X86ISelLowering.cpp
185–186 ↗	(On Diff #22282)	So, the INSERT_SUBVECTOR node is always generated regardless of whether 'ScalarType' is floating point or not. I think it makes sense to factor out the common logic between the floating point and the integer case. For example, you can create the INSERT_SUBVECTOR node immedately after line 175. This will allow you to get rid of the code at around line 203.
test/CodeGen/X86/avx-cast.ll
9–12 ↗	(On Diff #22282)	Would it be possible to also have a test where the vector insertion is not performed on a zero vector? Apparently all the test cases you modified only seem to test the case case where a vector is inserted in the low 128-bit lane of a zero vector.

spatel added inline comments.Mar 19 2015, 2:13 PM

test/CodeGen/X86/avx-cast.ll
9–12 ↗	(On Diff #22282)	Let me make sure I understand the scenario. Is it different that this: define <4 x double> @shuffle_v4f64_0167(<4 x double> %a, <4 x double> %b) { ; ALL-LABEL: shuffle_v4f64_0167: ; ALL: # BB#0: ; ALL-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0,1],ymm1[2,3] ; ALL-NEXT: retq %shuffle = shufflevector <4 x double> %a, <4 x double> %b, <4 x i32> <i32 0, i32 1, i32 6, i32 7> ret <4 x double> %shuffle } I think the existing shuffle lowering was already detecting the non-zero version, so there are existing test cases that cover it (the above is in vector-shuffle-256-v4.ll). Please let me know if I've misunderstood.

andreadb accepted this revision.Mar 19 2015, 2:52 PM

andreadb edited edge metadata.

andreadb added inline comments.

test/CodeGen/X86/avx-cast.ll
9–12 ↗	(On Diff #22282)	That matches what I originally thought: the non-zero cases are handled by other parts of the shuffle lowering logic. I just wanted to make sure that we had a good test coverage :-). I think your patch is OK. Thanks!

This revision is now accepted and ready to land.Mar 19 2015, 2:52 PM

Closed by commit rL232773: [X86, AVX] use blends instead of insert128 with index 0 (authored by spatel). · Explain WhyMar 19 2015, 3:32 PM

This revision was automatically updated to reflect the committed changes.

Thanks, Andrea. I hoisted the common INSERT_SUBVECTOR and checked in at r232773.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

45 lines

test/

CodeGen/

X86/

2012-04-26-sdglue.ll

4 lines

avx-cast.ll

81 lines

vector-shuffle-256-v32.ll

6 lines

Diff 22315

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 154 Lines • ▼ Show 20 Lines

	/// Generate a DAG to put 128-bits into a vector > 128 bits. This			/// Generate a DAG to put 128-bits into a vector > 128 bits. This
	/// sets things up to match to an AVX VINSERTF128/VINSERTI128 or			/// sets things up to match to an AVX VINSERTF128/VINSERTI128 or
	/// AVX-512 VINSERTF32x4/VINSERTI32x4 instructions or a			/// AVX-512 VINSERTF32x4/VINSERTI32x4 instructions or a
	/// simple superregister reference. Idx is an index in the 128 bits			/// simple superregister reference. Idx is an index in the 128 bits
	/// we want. It need not be aligned to a 128-bit boundary. That makes			/// we want. It need not be aligned to a 128-bit boundary. That makes
	/// lowering INSERT_VECTOR_ELT operations easier.			/// lowering INSERT_VECTOR_ELT operations easier.
	static SDValue Insert128BitVector(SDValue Result, SDValue Vec, unsigned IdxVal,			static SDValue Insert128BitVector(SDValue Result, SDValue Vec, unsigned IdxVal,
	SelectionDAG &DAG,SDLoc dl) {			SelectionDAG &DAG, SDLoc dl) {
	assert(Vec.getValueType().is128BitVector() && "Unexpected vector size!");			assert(Vec.getValueType().is128BitVector() && "Unexpected vector size!");

				// For insertion into the zero index (low half) of a 256-bit vector, it is
				// more efficient to generate a blend with immediate instead of an insert*128.
				// We are still creating an INSERT_SUBVECTOR below with an undef node to
				// extend the subvector to the size of the result vector. Make sure that
				// we are not recursing on that node by checking for undef here.
				if (IdxVal == 0 && Result.getValueType().is256BitVector() &&
				Result.getOpcode() != ISD::UNDEF) {
				EVT ResultVT = Result.getValueType();
				SDValue ZeroIndex = DAG.getIntPtrConstant(0);
				SDValue Undef = DAG.getUNDEF(ResultVT);
				SDValue Vec256 = DAG.getNode(ISD::INSERT_SUBVECTOR, dl, ResultVT, Undef,
				Vec, ZeroIndex);

				// The blend instruction, and therefore its mask, depend on the data type.
				MVT ScalarType = ResultVT.getScalarType().getSimpleVT();
				if (ScalarType.isFloatingPoint()) {
				// Choose either vblendps (float) or vblendpd (double).
				unsigned ScalarSize = ScalarType.getSizeInBits();
				assert((ScalarSize == 64 \|\| ScalarSize == 32) && "Unknown float type");
				unsigned MaskVal = (ScalarSize == 64) ? 0x03 : 0x0f;
				SDValue Mask = DAG.getConstant(MaskVal, MVT::i8);
				return DAG.getNode(X86ISD::BLENDI, dl, ResultVT, Result, Vec256, Mask);
				}

				const X86Subtarget &Subtarget =
				static_cast<const X86Subtarget &>(DAG.getSubtarget());

				// AVX2 is needed for 256-bit integer blend support.
				// Integers must be cast to 32-bit because there is only vpblendd;
				// vpblendw can't be used for this because it has a handicapped mask.

				// If we don't have AVX2, then cast to float. Using a wrong domain blend
				// is still more efficient than using the wrong domain vinsertf128 that
				// will be created by InsertSubVector().
				MVT CastVT = Subtarget.hasAVX2() ? MVT::v8i32 : MVT::v8f32;

				SDValue Mask = DAG.getConstant(0x0f, MVT::i8);
				Vec256 = DAG.getNode(ISD::BITCAST, dl, CastVT, Vec256);
				Vec256 = DAG.getNode(X86ISD::BLENDI, dl, CastVT, Result, Vec256, Mask);
				return DAG.getNode(ISD::BITCAST, dl, ResultVT, Vec256);
				}

	return InsertSubVector(Result, Vec, IdxVal, DAG, dl, 128);			return InsertSubVector(Result, Vec, IdxVal, DAG, dl, 128);
	}			}

	static SDValue Insert256BitVector(SDValue Result, SDValue Vec, unsigned IdxVal,			static SDValue Insert256BitVector(SDValue Result, SDValue Vec, unsigned IdxVal,
	SelectionDAG &DAG, SDLoc dl) {			SelectionDAG &DAG, SDLoc dl) {
	assert(Vec.getValueType().is256BitVector() && "Unexpected vector size!");			assert(Vec.getValueType().is256BitVector() && "Unexpected vector size!");
	return InsertSubVector(Result, Vec, IdxVal, DAG, dl, 256);			return InsertSubVector(Result, Vec, IdxVal, DAG, dl, 256);
	}			}
	▲ Show 20 Lines • Show All 24,371 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/2012-04-26-sdglue.ll

	; RUN: llc < %s -mtriple=x86_64-apple-darwin -mcpu=core-avx2 -mattr=+avx \| FileCheck %s			; RUN: llc < %s -mtriple=x86_64-apple-darwin -mcpu=core-avx2 -mattr=+avx \| FileCheck %s
	; rdar://11314175: SD Scheduler, BuildSchedUnits assert:			; rdar://11314175: SD Scheduler, BuildSchedUnits assert:
	; N->getNodeId() == -1 && "Node already inserted!			; N->getNodeId() == -1 && "Node already inserted!

	; It's hard to test for the ISEL condition because CodeGen optimizes			; It's hard to test for the ISEL condition because CodeGen optimizes
	; away the bugpointed code. Just ensure the basics are still there.			; away the bugpointed code. Just ensure the basics are still there.
	;CHECK-LABEL: func:			;CHECK-LABEL: func:
	;CHECK: vpxor			;CHECK: vxorps
	;CHECK: vinserti128
	;CHECK: vpshufd			;CHECK: vpshufd
	;CHECK: vpbroadcastd			;CHECK: vpbroadcastd
				;CHECK: vinserti128
	;CHECK: vmulps			;CHECK: vmulps
	;CHECK: vmulps			;CHECK: vmulps
	;CHECK: ret			;CHECK: ret

	define void @func() nounwind ssp {			define void @func() nounwind ssp {
	%tmp = load <4 x float>, <4 x float>* null, align 1			%tmp = load <4 x float>, <4 x float>* null, align 1
	%tmp14 = getelementptr <4 x float>, <4 x float>* null, i32 2			%tmp14 = getelementptr <4 x float>, <4 x float>* null, i32 2
	%tmp15 = load <4 x float>, <4 x float>* %tmp14, align 1			%tmp15 = load <4 x float>, <4 x float>* %tmp14, align 1
	Show All 28 Lines

llvm/trunk/test/CodeGen/X86/avx-cast.ll

	; RUN: llc < %s -mtriple=x86_64-apple-darwin -mattr=+avx \| FileCheck %s			; RUN: llc < %s -mtriple=x86_64-apple-darwin -mattr=+avx \| FileCheck %s --check-prefix=AVX1
				; RUN: llc < %s -mtriple=x86_64-apple-darwin -mattr=+avx2 \| FileCheck %s --check-prefix=AVX2

				; Prefer a blend instruction to a vinsert128 instruction because blends
				; are simpler (no lane changes) and therefore will have equal or better
				; performance.

	; CHECK-LABEL: castA:
	; CHECK: vxorps
	; CHECK-NEXT: vinsertf128 $0
	define <8 x float> @castA(<4 x float> %m) nounwind uwtable readnone ssp {			define <8 x float> @castA(<4 x float> %m) nounwind uwtable readnone ssp {
				; AVX1-LABEL: castA:
				; AVX1: vxorps %ymm1, %ymm1, %ymm1
				; AVX1-NEXT: vblendps {{.*#+}} ymm0 = ymm0[0,1,2,3],ymm1[4,5,6,7]
				; AVX1-NEXT: retq
				;
				; AVX2-LABEL: castA:
				; AVX2: vxorps %ymm1, %ymm1, %ymm1
				; AVX2-NEXT: vblendps {{.*#+}} ymm0 = ymm0[0,1,2,3],ymm1[4,5,6,7]
				; AVX2-NEXT: retq

	entry:			entry:
	%shuffle.i = shufflevector <4 x float> %m, <4 x float> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 4, i32 4, i32 4>			%shuffle.i = shufflevector <4 x float> %m, <4 x float> zeroinitializer, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 4, i32 4, i32 4>
	ret <8 x float> %shuffle.i			ret <8 x float> %shuffle.i
	}			}

	; CHECK-LABEL: castB:
	; CHECK: vxorps
	; CHECK-NEXT: vinsertf128 $0
	define <4 x double> @castB(<2 x double> %m) nounwind uwtable readnone ssp {			define <4 x double> @castB(<2 x double> %m) nounwind uwtable readnone ssp {
				; AVX1-LABEL: castB:
				; AVX1: vxorpd %ymm1, %ymm1, %ymm1
				; AVX1-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0,1],ymm1[2,3]
				; AVX1-NEXT: retq
				;
				; AVX2-LABEL: castB:
				; AVX2: vxorpd %ymm1, %ymm1, %ymm1
				; AVX2-NEXT: vblendpd {{.*#+}} ymm0 = ymm0[0,1],ymm1[2,3]
				; AVX2-NEXT: retq

	entry:			entry:
	%shuffle.i = shufflevector <2 x double> %m, <2 x double> zeroinitializer, <4 x i32> <i32 0, i32 1, i32 2, i32 2>			%shuffle.i = shufflevector <2 x double> %m, <2 x double> zeroinitializer, <4 x i32> <i32 0, i32 1, i32 2, i32 2>
	ret <4 x double> %shuffle.i			ret <4 x double> %shuffle.i
	}			}

	; CHECK-LABEL: castC:			; AVX2 is needed for integer types.
	; CHECK: vxorps
	; CHECK-NEXT: vinsertf128 $0
	define <4 x i64> @castC(<2 x i64> %m) nounwind uwtable readnone ssp {			define <4 x i64> @castC(<2 x i64> %m) nounwind uwtable readnone ssp {
				; AVX1-LABEL: castC:
				; AVX1: vxorps %xmm1, %xmm1, %xmm1
				; AVX1-NEXT: vblendps {{.*#+}} ymm0 = ymm0[0,1,2,3],ymm1[4,5,6,7]
				; AVX1-NEXT: retq
				;
				; AVX2-LABEL: castC:
				; AVX2: vpxor %ymm1, %ymm1, %ymm1
				; AVX2-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0,1,2,3],ymm1[4,5,6,7]
				; AVX2-NEXT: retq

	entry:			entry:
	%shuffle.i = shufflevector <2 x i64> %m, <2 x i64> zeroinitializer, <4 x i32> <i32 0, i32 1, i32 2, i32 2>			%shuffle.i = shufflevector <2 x i64> %m, <2 x i64> zeroinitializer, <4 x i32> <i32 0, i32 1, i32 2, i32 2>
	ret <4 x i64> %shuffle.i			ret <4 x i64> %shuffle.i
	}			}

	; CHECK-LABEL: castD:			; The next three tests don't need any shuffling. There may or may not be a
	; CHECK-NOT: vextractf128 $0			; vzeroupper before the return, so just check for the absence of shuffles.

	define <4 x float> @castD(<8 x float> %m) nounwind uwtable readnone ssp {			define <4 x float> @castD(<8 x float> %m) nounwind uwtable readnone ssp {
				; AVX1-LABEL: castD:
				; AVX1-NOT: extract
				; AVX1-NOT: blend
				;
				; AVX2-LABEL: castD:
				; AVX2-NOT: extract
				; AVX2-NOT: blend

	entry:			entry:
	%shuffle.i = shufflevector <8 x float> %m, <8 x float> %m, <4 x i32> <i32 0, i32 1, i32 2, i32 3>			%shuffle.i = shufflevector <8 x float> %m, <8 x float> %m, <4 x i32> <i32 0, i32 1, i32 2, i32 3>
	ret <4 x float> %shuffle.i			ret <4 x float> %shuffle.i
	}			}

	; CHECK-LABEL: castE:
	; CHECK-NOT: vextractf128 $0
	define <2 x i64> @castE(<4 x i64> %m) nounwind uwtable readnone ssp {			define <2 x i64> @castE(<4 x i64> %m) nounwind uwtable readnone ssp {
				; AVX1-LABEL: castE:
				; AVX1-NOT: extract
				; AVX1-NOT: blend
				;
				; AVX2-LABEL: castE:
				; AVX2-NOT: extract
				; AVX2-NOT: blend

	entry:			entry:
	%shuffle.i = shufflevector <4 x i64> %m, <4 x i64> %m, <2 x i32> <i32 0, i32 1>			%shuffle.i = shufflevector <4 x i64> %m, <4 x i64> %m, <2 x i32> <i32 0, i32 1>
	ret <2 x i64> %shuffle.i			ret <2 x i64> %shuffle.i
	}			}

	; CHECK-LABEL: castF:
	; CHECK-NOT: vextractf128 $0
	define <2 x double> @castF(<4 x double> %m) nounwind uwtable readnone ssp {			define <2 x double> @castF(<4 x double> %m) nounwind uwtable readnone ssp {
				; AVX1-LABEL: castF:
				; AVX1-NOT: extract
				; AVX1-NOT: blend
				;
				; AVX2-LABEL: castF:
				; AVX2-NOT: extract
				; AVX2-NOT: blend

	entry:			entry:
	%shuffle.i = shufflevector <4 x double> %m, <4 x double> %m, <2 x i32> <i32 0, i32 1>			%shuffle.i = shufflevector <4 x double> %m, <4 x double> %m, <2 x i32> <i32 0, i32 1>
	ret <2 x double> %shuffle.i			ret <2 x double> %shuffle.i
	}			}

llvm/trunk/test/CodeGen/X86/vector-shuffle-256-v32.ll

	Show First 20 Lines • Show All 646 Lines • ▼ Show 20 Lines
	; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm0			; AVX1-NEXT: vextractf128 $1, %ymm0, %xmm0
	; AVX1-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[15],zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero			; AVX1-NEXT: vpshufb {{.*#+}} xmm0 = xmm0[15],zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero,zero
	; AVX1-NEXT: vpor %xmm0, %xmm2, %xmm0			; AVX1-NEXT: vpor %xmm0, %xmm2, %xmm0
	; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0			; AVX1-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
	; AVX1-NEXT: retq			; AVX1-NEXT: retq
	;			;
	; AVX2-LABEL: shuffle_v32i8_31_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00:			; AVX2-LABEL: shuffle_v32i8_31_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00:
	; AVX2: # BB#0:			; AVX2: # BB#0:
				; AVX2-NEXT: vperm2i128 {{.*#+}} ymm1 = ymm0[2,3,0,1]
				; AVX2-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0,1],ymm1[2,3,4,5,6,7]
	; AVX2-NEXT: movl $15, %eax			; AVX2-NEXT: movl $15, %eax
	; AVX2-NEXT: vmovd %eax, %xmm1			; AVX2-NEXT: vmovd %eax, %xmm1
	; AVX2-NEXT: vpxor %ymm2, %ymm2, %ymm2			; AVX2-NEXT: vpxor %ymm2, %ymm2, %ymm2
	; AVX2-NEXT: vinserti128 $0, %xmm1, %ymm2, %ymm1			; AVX2-NEXT: vpblendd $15, %ymm1, %ymm2, %ymm1
	; AVX2-NEXT: vperm2i128 {{.*#+}} ymm2 = ymm0[2,3,0,1]
	; AVX2-NEXT: vpblendd {{.*#+}} ymm0 = ymm0[0,1],ymm2[2,3,4,5,6,7]
	; AVX2-NEXT: vpshufb %ymm1, %ymm0, %ymm0			; AVX2-NEXT: vpshufb %ymm1, %ymm0, %ymm0
	; AVX2-NEXT: retq			; AVX2-NEXT: retq
	%shuffle = shufflevector <32 x i8> %a, <32 x i8> %b, <32 x i32> <i32 31, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>			%shuffle = shufflevector <32 x i8> %a, <32 x i8> %b, <32 x i32> <i32 31, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>
	ret <32 x i8> %shuffle			ret <32 x i8> %shuffle
	}			}

	define <32 x i8> @shuffle_v32i8_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_16_16_16_16_16_16_16_16_16_16_16_16_16_16_16_16(<32 x i8> %a, <32 x i8> %b) {			define <32 x i8> @shuffle_v32i8_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_16_16_16_16_16_16_16_16_16_16_16_16_16_16_16_16(<32 x i8> %a, <32 x i8> %b) {
	; AVX1-LABEL: shuffle_v32i8_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_16_16_16_16_16_16_16_16_16_16_16_16_16_16_16_16:			; AVX1-LABEL: shuffle_v32i8_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_00_16_16_16_16_16_16_16_16_16_16_16_16_16_16_16_16:
	▲ Show 20 Lines • Show All 1,292 Lines • Show Last 20 Lines