This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombine] Don't generate atrocious shuffles for constant vectors
ClosedPublic

Authored by mkuper on Jan 21 2015, 6:25 AM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon
andreadb

Commits

rG25e34d11f3ca: [DAGCombine] Produce better code for constant splats
rG84fad3e5c9f7: [DAGCombine] Produce better code for constant splats
rL226811: [DAGCombine] Produce better code for constant splats

Summary

According to PR22276, generating constant vectors (and, especially, zero vectors) can sometimes result in very odd code.
This is especially true for zero vectors, where a single xor should be sufficient, but instead shuffles are generated.

This adds a DAGCombine that removes a redundant shuffle for this case.
Note that lowering the resulting BUILD_VECTOR to a constant broadcast vs. a constant pool load vs. something more efficient (e.g. a xor for the case of 0) is a target decision, and isn't made here.

Diff Detail

Event Timeline

mkuper updated this revision to Diff 18511.Jan 21 2015, 6:25 AM

mkuper retitled this revision from to [DAGCombine] Don't generate atrocious shuffles for constant vectors.

mkuper updated this object.

mkuper edited the test plan for this revision. (Show Details)

mkuper added reviewers: spatel, chandlerc.

mkuper added a subscriber: Unknown Object (MLST).

Thanks Michael - see comment.

test/CodeGen/X86/widen_shuffle-1.ll
92	This looks like its leaving a shuffle of a constant vector - shouldn't this be constant folded away?

Thanks, Simon!

test/CodeGen/X86/widen_shuffle-1.ll
92	It should, but that would require an additional DAGCombine. What this code test up generating is two vector_shuffle nodes, one comes directly from the shufflevector and the other appears during type legalization. This patch simplifies the first one, but the other one remains.

Hi Michael -

This is almost exactly what I had in mind (and probably Simon did too as he was looking at this sort of problem as well).

But shortly after filing the bug, I realized that we also need to handle non-const splats better too. Eg, this is a simplification of what I'm seeing in loops where the induction variable is splatted in order to vectorize the loop:

define <4 x i64> @splat(i64 %x) {
  %scalar_to_vector = insertelement <4 x i64> undef, i64 %x, i32 0
  %splat = shufflevector <4 x i64> %scalar_to_vector, <4 x i64> undef, <4 x i32> zeroinitializer
  ret <4 x i64> %splat
}

Can you loosen the splat check to handle more than just constants? The particular case above seems ok with AVX2 (vbroadcastsd is generated), but AVX1 still shows:

vmovq	%rdi, %xmm0
vunpcklpd	%xmm0, %xmm0, %xmm0 ## xmm0 = xmm0[0,0]
vinsertf128	$1, %xmm0, %ymm0, %ymm0

Hi Michael,

Most of the canonicalization rules performed by the dag combiner on shuffle nodes are the same rules implemented by method 'SelectionDAG::getVectorShuffle' in SelectionDAG.cpp.
If you add a new canonicalization rule in method getVectorShuffle, then you probably need to update method 'getVectorShuffle' as well. I guess, this would (hopefully) fix the problem with the missing constant folding reported by Simon.

I hope this helps!
Andrea

Thanks Sanjay, Andrea.

Sanjay, just removing the condition doesn't fix this example.
What do you think about committing the constant version first, and then looking at the rest?

Andrea, I'm not sure I understood.
Do you prefer the logic to live only in getVectorShuffle() or be replicated both here and there? I'm not sure the first is sufficient. In any case, I'm not sure that would solve the issue Simon reported. That case also involves a bitcast that changes the number of vector elements, and don't know if that's currently handled. I'll check.

In D7093#111473, @mkuper wrote:

What do you think about committing the constant version first, and then looking at the rest?

Sure, I have no problem with that. I'll file a separate bug so we don't lose it.

In D7093#111473, @mkuper wrote:

Andrea, I'm not sure I understood.
Do you prefer the logic to live only in getVectorShuffle() or be replicated both here and there? I'm not sure the first is sufficient.

I think you would need to fix both places.
When a new shuffle node is created, method 'getVectorShuffle' ensures that the resulting node is always in a canonical form. The reason the same canonicalization rules in 'getVectorShuffle' are performed by 'visitVECTOR_SHUFFLE' is because we want to preserve the canonical form of shuffles when optimizing the dag. For example, without those rules, we would break the canonical form of a shuffle if Undef gets propagated to its first operand. So, my opinion is that you should replicate that rule in both places. It's not ideal, I agree. Ideally if possible, those rules should be factored out into a common function at some point but with the current implementation we need to update both.

In any case, I'm not sure that would solve the issue Simon reported. That case also involves a bitcast that changes the number of vector elements, and don't know if that's currently handled. I'll check.

Right. getVectorShuffle would look through any bitcast, however it would not try to simplify the shuffle if the bitcast changes the number of elements. So it would not fix the problem with that test.

Updated getVectorShuffle()

Also fixed a potential bug with respect to bitcasts. I couldn't actually hit this for constants because bitcasts of constants get resolved earlier, but this is safer, and, in any case, is required if the constant constraint is removed.

Hi Michael,

the patch LGTM. Thanks!

This revision is now accepted and ready to land.Jan 22 2015, 3:21 AM

Closed by commit rL226811: [DAGCombine] Produce better code for constant splats (authored by mkuper). · Explain WhyJan 22 2015, 4:38 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

CodeGen/

SelectionDAG/

	DAGCombiner.cpp
	DAGCombiner.cpp (revision 226569)

20 lines

	SelectionDAG.cpp
	SelectionDAG.cpp (revision 226569)

23 lines

test/

CodeGen/

X86/

	splat-const.ll
	splat-const.ll (revision 0)

40 lines

	sse41.ll
	sse41.ll (revision 226569)

4 lines

	widen_shuffle-1.ll
	widen_shuffle-1.ll (revision 226569)

4 lines

Diff 18594

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 11,466 Lines • ▼ Show 20 Lines	for (unsigned i = 0; i != NumElts; ++i) {
}		}
NewMask.push_back(Idx);		NewMask.push_back(Idx);
}		}
if (Changed)		if (Changed)
return DAG.getVectorShuffle(VT, SDLoc(N), N0, N1, &NewMask[0]);		return DAG.getVectorShuffle(VT, SDLoc(N), N0, N1, &NewMask[0]);
}		}

// If it is a splat, check if the argument vector is another splat or a		// If it is a splat, check if the argument vector is another splat or a
// build_vector with all scalar elements the same.		// build_vector.
if (SVN->isSplat() && SVN->getSplatIndex() < (int)NumElts) {		if (SVN->isSplat() && SVN->getSplatIndex() < (int)NumElts) {
SDNode *V = N0.getNode();		SDNode *V = N0.getNode();

// If this is a bit convert that changes the element type of the vector but		// If this is a bit convert that changes the element type of the vector but
// not the number of vector elements, look through it. Be careful not to		// not the number of vector elements, look through it. Be careful not to
// look though conversions that change things like v4f32 to v2f64.		// look though conversions that change things like v4f32 to v2f64.
if (V->getOpcode() == ISD::BITCAST) {		if (V->getOpcode() == ISD::BITCAST) {
SDValue ConvInput = V->getOperand(0);		SDValue ConvInput = V->getOperand(0);
Show All 20 Lines	if (V->getOpcode() == ISD::BUILD_VECTOR) {
if (V->getOperand(i) != Base) {		if (V->getOperand(i) != Base) {
AllSame = false;		AllSame = false;
break;		break;
}		}
}		}
// Splat of <x, x, x, x>, return <x, x, x, x>		// Splat of <x, x, x, x>, return <x, x, x, x>
if (AllSame)		if (AllSame)
return N0;		return N0;

		// If the splatted element is a constant, just build the vector out of
		// constants directly.
		const SDValue &Splatted = V->getOperand(SVN->getSplatIndex());
		if (isa<ConstantSDNode>(Splatted) \|\| isa<ConstantFPSDNode>(Splatted)) {
		SmallVector<SDValue, 8> Ops;
		for (unsigned i = 0; i != NumElts; ++i) {
		Ops.push_back(Splatted);
		}
		SDValue &NewBV = DAG.getNode(ISD::BUILD_VECTOR, SDLoc(N),
		V->getValueType(0), Ops);

		// We may have jumped through bitcasts, so the type of the
		// BUILD_VECTOR may not match the type of the shuffle.
		if (V->getValueType(0) != VT)
		NewBV = DAG.getNode(ISD::BITCAST, SDLoc(N), VT, NewBV);
		return NewBV;
		}
}		}
}		}

// There are various patterns used to build up a vector from smaller vectors,		// There are various patterns used to build up a vector from smaller vectors,
// subvectors, or elements. Scan chains of these and replace unused insertions		// subvectors, or elements. Scan chains of these and replace unused insertions
// or components with undef.		// or components with undef.
if (SDValue S = simplifyShuffleOperands(SVN, N0, N1, DAG))		if (SDValue S = simplifyShuffleOperands(SVN, N0, N1, DAG))
return S;		return S;
▲ Show 20 Lines • Show All 1,280 Lines • Show Last 20 Lines

lib/CodeGen/SelectionDAG/SelectionDAG.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,507 Lines • ▼ Show 20 Lines	SDValue SelectionDAG::getVectorShuffle(EVT VT, SDLoc dl, SDValue N1,
}		}
// Reset our undef status after accounting for the mask.		// Reset our undef status after accounting for the mask.
N2Undef = N2.getOpcode() == ISD::UNDEF;		N2Undef = N2.getOpcode() == ISD::UNDEF;
// Re-check whether both sides ended up undef.		// Re-check whether both sides ended up undef.
if (N1.getOpcode() == ISD::UNDEF && N2Undef)		if (N1.getOpcode() == ISD::UNDEF && N2Undef)
return getUNDEF(VT);		return getUNDEF(VT);

// If Identity shuffle return that node.		// If Identity shuffle return that node.
bool Identity = true;		bool Identity = true, AllSame = true;
for (unsigned i = 0; i != NElts; ++i) {		for (unsigned i = 0; i != NElts; ++i) {
if (MaskVec[i] >= 0 && MaskVec[i] != (int)i) Identity = false;		if (MaskVec[i] >= 0 && MaskVec[i] != (int)i) Identity = false;
		if (MaskVec[i] != MaskVec[0]) AllSame = false;
}		}
if (Identity && NElts)		if (Identity && NElts)
return N1;		return N1;

// Shuffling a constant splat doesn't change the result.		// Shuffling a constant splat doesn't change the result.
if (N2Undef) {		if (N2Undef) {
SDValue V = N1;		SDValue V = N1;

Show All 17 Lines	if (auto *BV = dyn_cast<BuildVectorSDNode>(V)) {
// number of elements match or the value splatted is a zero constant.		// number of elements match or the value splatted is a zero constant.
if (V.getValueType().getVectorNumElements() ==		if (V.getValueType().getVectorNumElements() ==
VT.getVectorNumElements())		VT.getVectorNumElements())
return N1;		return N1;
if (auto *C = dyn_cast<ConstantSDNode>(Splat))		if (auto *C = dyn_cast<ConstantSDNode>(Splat))
if (C->isNullValue())		if (C->isNullValue())
return N1;		return N1;
}		}

		// If the shuffle itself creates a constant splat, build the vector
		// directly.
		if (AllSame) {
		const SDValue &Splatted = BV->getOperand(MaskVec[0]);
		if (isa<ConstantSDNode>(Splatted) \|\| isa<ConstantFPSDNode>(Splatted)) {
		SmallVector<SDValue, 8> Ops;
		for (unsigned i = 0; i != NElts; ++i) {
		Ops.push_back(Splatted);
		}
		SDValue &NewBV = getNode(ISD::BUILD_VECTOR, dl,
		BV->getValueType(0), Ops);

		// We may have jumped through bitcasts, so the type of the
		// BUILD_VECTOR may not match the type of the shuffle.
		if (BV->getValueType(0) != VT)
		NewBV = getNode(ISD::BITCAST, dl, VT, NewBV);
		return NewBV;
		}
		}
}		}
}		}

FoldingSetNodeID ID;		FoldingSetNodeID ID;
SDValue Ops[2] = { N1, N2 };		SDValue Ops[2] = { N1, N2 };
AddNodeIDNode(ID, ISD::VECTOR_SHUFFLE, getVTList(VT), Ops);		AddNodeIDNode(ID, ISD::VECTOR_SHUFFLE, getVTList(VT), Ops);
for (unsigned i = 0; i != NElts; ++i)		for (unsigned i = 0; i != NElts; ++i)
ID.AddInteger(MaskVec[i]);		ID.AddInteger(MaskVec[i]);
▲ Show 20 Lines • Show All 5,304 Lines • Show Last 20 Lines

test/CodeGen/X86/splat-const.ll

				; RUN: llc < %s -mcpu=penryn \| FileCheck %s --check-prefix=SSE
				; RUN: llc < %s -mcpu=sandybridge \| FileCheck %s --check-prefix=AVX
				; RUN: llc < %s -mcpu=haswell \| FileCheck %s --check-prefix=AVX2
				; This checks that lowering for creation of constant vectors is sane and
				; doesn't use redundant shuffles. (fixes PR22276)
				target triple = "x86_64-unknown-unknown"

				define <4 x i32> @zero_vector() {
				; SSE-LABEL: zero_vector:
				; SSE: xorps %xmm0, %xmm0
				; SSE-NEXT: retq
				; AVX-LABEL: zero_vector:
				; AVX: vxorps %xmm0, %xmm0, %xmm0
				; AVX-NEXT: retq
				; AVX2-LABEL: zero_vector:
				; AVX2: vxorps %xmm0, %xmm0, %xmm0
				; AVX2-NEXT: retq
				%zero = insertelement <4 x i32> undef, i32 0, i32 0
				%splat = shufflevector <4 x i32> %zero, <4 x i32> undef, <4 x i32> zeroinitializer
				ret <4 x i32> %splat
				}

				; Note that for the "const_vector" versions, lowering that uses a shuffle
				; instead of a load would be legitimate, if it's a single broadcast shuffle.
				; (as opposed to the previous mess)
				; However, this is not the current preferred lowering.
				define <4 x i32> @const_vector() {
				; SSE-LABEL: const_vector:
				; SSE: movaps {{.*}}, %xmm0 # xmm0 = [42,42,42,42]
				; SSE-NEXT: retq
				; AVX-LABEL: const_vector:
				; AVX: vmovaps {{.*}}, %xmm0 # xmm0 = [42,42,42,42]
				; AVX-NEXT: retq
				; AVX2-LABEL: const_vector:
				; AVX2: vbroadcastss {{[^%].*}}, %xmm0
				; AVX2-NEXT: retq
				%const = insertelement <4 x i32> undef, i32 42, i32 0
				%splat = shufflevector <4 x i32> %const, <4 x i32> undef, <4 x i32> zeroinitializer
				ret <4 x i32> %splat
				}

test/CodeGen/X86/sse41.ll

Show First 20 Lines • Show All 999 Lines • ▼ Show 20 Lines	; X64-NEXT: retq
ret <4 x float> %ret		ret <4 x float> %ret
}		}

; Edge case for insertps where we end up with a shuffle with mask=<0, 7, -1, -1>		; Edge case for insertps where we end up with a shuffle with mask=<0, 7, -1, -1>
define void @insertps_pr20411(i32* noalias nocapture %RET) #1 {		define void @insertps_pr20411(i32* noalias nocapture %RET) #1 {
; X32-LABEL: insertps_pr20411:		; X32-LABEL: insertps_pr20411:
; X32: ## BB#0:		; X32: ## BB#0:
; X32-NEXT: movl {{[0-9]+}}(%esp), %eax		; X32-NEXT: movl {{[0-9]+}}(%esp), %eax
; X32-NEXT: pshufd {{.*#+}} xmm0 = mem[3,1,2,3]		; X32-NEXT: movaps {{.*#+}} xmm0 = [3,3,3,3]
; X32-NEXT: insertps $-36, LCPI49_1+12, %xmm0		; X32-NEXT: insertps $-36, LCPI49_1+12, %xmm0
; X32-NEXT: movups %xmm0, (%eax)		; X32-NEXT: movups %xmm0, (%eax)
; X32-NEXT: retl		; X32-NEXT: retl
;		;
; X64-LABEL: insertps_pr20411:		; X64-LABEL: insertps_pr20411:
; X64: ## BB#0:		; X64: ## BB#0:
; X64-NEXT: pshufd {{.*#+}} xmm0 = mem[3,1,2,3]		; X64-NEXT: movaps {{.*#+}} xmm0 = [3,3,3,3]
; X64-NEXT: insertps $-36, LCPI49_1+{{.*}}(%rip), %xmm0		; X64-NEXT: insertps $-36, LCPI49_1+{{.*}}(%rip), %xmm0
; X64-NEXT: movups %xmm0, (%rdi)		; X64-NEXT: movups %xmm0, (%rdi)
; X64-NEXT: retq		; X64-NEXT: retq
%gather_load = shufflevector <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>, <8 x i32> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>		%gather_load = shufflevector <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>, <8 x i32> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
%shuffle109 = shufflevector <4 x i32> <i32 4, i32 5, i32 6, i32 7>, <4 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> ; 4 5 6 7		%shuffle109 = shufflevector <4 x i32> <i32 4, i32 5, i32 6, i32 7>, <4 x i32> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3> ; 4 5 6 7
%shuffle116 = shufflevector <8 x i32> %gather_load, <8 x i32> undef, <4 x i32> <i32 3, i32 undef, i32 undef, i32 undef> ; 3 x x x		%shuffle116 = shufflevector <8 x i32> %gather_load, <8 x i32> undef, <4 x i32> <i32 3, i32 undef, i32 undef, i32 undef> ; 3 x x x
%shuffle117 = shufflevector <4 x i32> %shuffle109, <4 x i32> %shuffle116, <4 x i32> <i32 4, i32 3, i32 undef, i32 undef> ; 3 7 x x		%shuffle117 = shufflevector <4 x i32> %shuffle109, <4 x i32> %shuffle116, <4 x i32> <i32 4, i32 3, i32 undef, i32 undef> ; 3 7 x x
%ptrcast = bitcast i32* %RET to <4 x i32>*		%ptrcast = bitcast i32* %RET to <4 x i32>*
▲ Show 20 Lines • Show All 179 Lines • Show Last 20 Lines

test/CodeGen/X86/widen_shuffle-1.ll

Show First 20 Lines • Show All 76 Lines • ▼ Show 20 Lines	; CHECK-NEXT: retl
ret <8 x i8> %vshuf		ret <8 x i8> %vshuf
}		}

; PR11389: another CONCAT_VECTORS case		; PR11389: another CONCAT_VECTORS case
define void @shuf5(<8 x i8>* %p) nounwind {		define void @shuf5(<8 x i8>* %p) nounwind {
; CHECK-LABEL: shuf5:		; CHECK-LABEL: shuf5:
; CHECK: # BB#0:		; CHECK: # BB#0:
; CHECK-NEXT: movl {{[0-9]+}}(%esp), %eax		; CHECK-NEXT: movl {{[0-9]+}}(%esp), %eax
; CHECK-NEXT: movdqa {{.*#+}} xmm0 = <4,33,u,u,u,u,u,u>		; CHECK-NEXT: movdqa {{.*#+}} xmm0 = [33,33,33,33,33,33,33,33]
; CHECK-NEXT: pshufb {{.*#+}} xmm0 = xmm0[2,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]		; CHECK-NEXT: pshufb {{.*#+}} xmm0 = xmm0[0,2,4,6,8,10,12,14,u,u,u,u,u,u,u,u]
; CHECK-NEXT: movlpd %xmm0, (%eax)		; CHECK-NEXT: movlpd %xmm0, (%eax)
; CHECK-NEXT: retl		; CHECK-NEXT: retl
%v = shufflevector <2 x i8> <i8 4, i8 33>, <2 x i8> undef, <8 x i32> <i32 1, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>		%v = shufflevector <2 x i8> <i8 4, i8 33>, <2 x i8> undef, <8 x i32> <i32 1, i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef>
store <8 x i8> %v, <8 x i8>* %p, align 8		store <8 x i8> %v, <8 x i8>* %p, align 8
ret void		ret void
}		}
		RKSimonUnsubmitted Not Done Reply Inline Actions This looks like its leaving a shuffle of a constant vector - shouldn't this be constant folded away? RKSimon: This looks like its leaving a shuffle of a constant vector - shouldn't this be constant folded…
		mkuperAuthorUnsubmitted Not Done Reply Inline Actions It should, but that would require an additional DAGCombine. What this code test up generating is two vector_shuffle nodes, one comes directly from the shufflevector and the other appears during type legalization. This patch simplifies the first one, but the other one remains. mkuper: It should, but that would require an additional DAGCombine. What this code test up generating…