This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] Combine small-element shuffles of scalar_to_vector in terms of the wider scalar.
AbandonedPublic

Authored by ab on Apr 7 2015, 5:16 PM.

Download Raw Diff

Details

Reviewers

spatel
qcolombet
t.p.northover

Summary

Continuing on D8883 and D8884, now that (if?) we decided on the (vector_shuffle (bitcast (scalar_to_vector))) form, we can further try to use the wider shuffle element type that was the input of the scalar_to_vector, if the shuffle mask permits it.

In practice this lets us recognize special patterns (see the DUP testcase) without needing to teach everything to deal with "this might be an i32 DUP shuffle, but is expressed in terms of i8 <0,1,2,3> sequences"

Diff Detail

Event Timeline

ab updated this revision to Diff 23384.Apr 7 2015, 5:16 PM

ab retitled this revision from to [CodeGen] Combine small-element shuffles of scalar_to_vector in terms of the wider scalar..

ab updated this object.

ab edited the test plan for this revision. (Show Details)

ab added reviewers: qcolombet, t.p.northover, spatel.

ab added parent revisions: D8884: [CodeGen] Combine shuffle from concat+bitcast scalar to avoid the smaller vector type., D8883: [CodeGen] Combine concat_vectors(trunc(scalar), undef) -> scalar_to_vector(scalar).

ab added a subscriber: Unknown Object (MLST).

There is already an optimization that attempts to shuffle 'scalar source' input operands into a single BUILD_VECTOR just above this code. Might it make more sense to add the ability to peek through bitcasts there?

I'm not sure: I initially dismissed it because it works on multiple scalar inputs, for which it makes sense to do a BUILD_VECTOR directly. For a single scalar input, I found it more straightforward to shuffle a SCALAR_TO_VECTOR, but I agree it's not very consistent.

Also, I'm not sure what happens to the BUILD_VECTOR later on, but I can imagine it being expanded to the shuffle anyway, so maybe it does make sense to do that instead. Let me look into it!

-Ahmed

In D8885#153311, @ab wrote:

I'm not sure: I initially dismissed it because it works on multiple scalar inputs, for which it makes sense to do a BUILD_VECTOR directly. For a single scalar input, I found it more straightforward to shuffle a SCALAR_TO_VECTOR, but I agree it's not very consistent.

If you look at the early versions of the patch on D8516 we did have a version that would create a SCALAR_TO_VECTOR if only the first scalar was defined, but there wasn't a good use case for it so it got dropped. Readding that path should be trivial.

Try combining to BUILD_VECTOR instead, adding to the 2-input scalar shuffle combine.

I didn't look at two bitcasts at the same time because this only fires because of D8884: otherwise, the vector_shuffle combine is way too early (and the operands are legalized to BUILD_VECTOR/SCALAR_TO_VECTOR right after), and either the operands get lowered between combine rounds, or other combines catch this without needing to peak through bitcasts (usually concat_vectors if you have two vector inputs).

This does change one ARM test, where we replace:

vmov r0, r1, d16
vmov.32 d16[0], r0
vmov.32 d16[1], r1
vext.8 q8, q8, q8, #4

with:

vmov r0, r1, d16
vmov.32 d16[0], r1

Which looks ..interesting, but is an improvement nonetheless.

-Ahmed

In D8885#153343, @RKSimon wrote:

If you look at the early versions of the patch on D8516 we did have a version that would create a SCALAR_TO_VECTOR if only the first scalar was defined, but there wasn't a good use case for it so it got dropped. Readding that path should be trivial.

I don't think we need that after all; I agree with your comment there that it's fine to "rely on a target's lowering logic to do the right thing."

-Ahmed

ab mentioned this in D8948: [CodeGen] Combine concat_vectors of scalars into build_vector..Apr 9 2015, 6:29 PM

I only hit this from the SCALAR_TO_VECTOR generated by D8884. With BUILD_VECTOR, this isn't needed after all.

Or rather, I couldn't tickle this into triggering with D8948. Marking as abandoned, will update if I can come up with an actual testcase (the ARM one is pretty uninteresting IMO, but might be a good start).

-Ahmed

Revision Contents

Path

Size

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

45 lines

test/

CodeGen/

AArch64/

concat_vectors-combines.ll

29 lines

ARM/

vector-DAGCombine.ll

3 lines

Diff 23457

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 11,855 Lines • ▼ Show 20 Lines	for (unsigned I = 0; I != NumConcats; ++I) {
}		}
}		}

return DAG.getNode(ISD::CONCAT_VECTORS, SDLoc(N), VT, Ops);		return DAG.getNode(ISD::CONCAT_VECTORS, SDLoc(N), VT, Ops);
}		}

static SDValue combineShuffleOfScalarInputs(SDNode *N, SelectionDAG &DAG) {		static SDValue combineShuffleOfScalarInputs(SDNode *N, SelectionDAG &DAG) {
const TargetLowering &TLI = DAG.getTargetLoweringInfo();		const TargetLowering &TLI = DAG.getTargetLoweringInfo();
		EVT ResVT = N->getValueType(0);
SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
ShuffleVectorSDNode *SVN = cast<ShuffleVectorSDNode>(N);		ShuffleVectorSDNode *SVN = cast<ShuffleVectorSDNode>(N);

		// If we only have one input, peek through bitcasts only if there is one user.
		// FIXME: Is it useful to look at bitcasts on both sides?
		if (N1.getOpcode() == ISD::UNDEF)
		while (N0.getOpcode() == ISD::BITCAST) {
		if (!N0.hasOneUse())
		break;
		N0 = N0.getOperand(0);
		}

		// The bitcast source needs to have an element size that's a multiple of
		// the shuffle element size. If it doesn't, revert to the shuffle operand.
		if (!N0.getValueType().isVector() \|\|
		(N0.getValueType().getScalarSizeInBits() % ResVT.getScalarSizeInBits()))
		N0 = N->getOperand(0);

EVT VT = N0.getValueType();		EVT VT = N0.getValueType();
EVT SVT = VT.getScalarType();		EVT SVT = VT.getScalarType();
const unsigned NumElts = VT.getVectorNumElements();		const unsigned NumElts = VT.getVectorNumElements();
		const unsigned ResNumElts = ResVT.getVectorNumElements();

		const int Scale = SVT.getSizeInBits() / ResVT.getScalarSizeInBits();
		assert((ResNumElts % Scale) == 0);

SmallVector<SDValue, 8> Ops;		SmallVector<SDValue, 8> Ops;
for (int M : SVN->getMask()) {		for (unsigned i = 0; i != ResNumElts; i += Scale) {
SDValue Op = DAG.getUNDEF(SVT);		SDValue Op = DAG.getUNDEF(SVT);
		int M = SVN->getMaskElt(i);

		if (Scale > 1) {
		// Normalize undef indices.
		if (M < 0)
		M = -Scale;
		if (M % Scale)
		return SDValue();
		// Make sure these are either all undef or consecutive elements.
		for (int j = 0; j != Scale; ++j) {
		int InnerM = SVN->getMaskElt(i + j);
		if (((InnerM < 0) != (M < 0)) \|\|
		(InnerM >= 0 && InnerM != M + (int)j))
		return SDValue();
		}
		M /= Scale;
		}

if (M >= 0) {		if (M >= 0) {
int Idx = M % NumElts;		int Idx = M % NumElts;
SDValue &S = (M < (int)NumElts ? N0 : N1);		SDValue &S = (M < (int)NumElts ? N0 : N1);
if (S.getOpcode() == ISD::BUILD_VECTOR && S.hasOneUse()) {		if (S.getOpcode() == ISD::BUILD_VECTOR && S.hasOneUse()) {
Op = S.getOperand(Idx);		Op = S.getOperand(Idx);
} else if (S.getOpcode() == ISD::SCALAR_TO_VECTOR && S.hasOneUse()) {		} else if (S.getOpcode() == ISD::SCALAR_TO_VECTOR && S.hasOneUse()) {
if (Idx == 0)		if (Idx == 0)
Op = S.getOperand(0);		Op = S.getOperand(0);
Show All 9 Lines	static SDValue combineShuffleOfScalarInputs(SDNode *N, SelectionDAG &DAG) {
if (SVT.isInteger())		if (SVT.isInteger())
for (SDValue &Op : Ops)		for (SDValue &Op : Ops)
SVT = (SVT.bitsLT(Op.getValueType()) ? Op.getValueType() : SVT);		SVT = (SVT.bitsLT(Op.getValueType()) ? Op.getValueType() : SVT);
if (SVT != VT.getScalarType())		if (SVT != VT.getScalarType())
for (SDValue &Op : Ops)		for (SDValue &Op : Ops)
Op = TLI.isZExtFree(Op.getValueType(), SVT)		Op = TLI.isZExtFree(Op.getValueType(), SVT)
? DAG.getZExtOrTrunc(Op, SDLoc(N), SVT)		? DAG.getZExtOrTrunc(Op, SDLoc(N), SVT)
: DAG.getSExtOrTrunc(Op, SDLoc(N), SVT);		: DAG.getSExtOrTrunc(Op, SDLoc(N), SVT);
return DAG.getNode(ISD::BUILD_VECTOR, SDLoc(N), VT, Ops);		return DAG.getNode(ISD::BITCAST, SDLoc(N), ResVT,
		DAG.getNode(ISD::BUILD_VECTOR, SDLoc(N), VT, Ops));
}		}

SDValue DAGCombiner::visitVECTOR_SHUFFLE(SDNode *N) {		SDValue DAGCombiner::visitVECTOR_SHUFFLE(SDNode *N) {
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
unsigned NumElts = VT.getVectorNumElements();		unsigned NumElts = VT.getVectorNumElements();

SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
▲ Show 20 Lines • Show All 141 Lines • ▼ Show 20 Lines	if (N1.getOpcode() == ISD::UNDEF &&
}		}
}		}

// Try to simplify either the shuffle or the concats.		// Try to simplify either the shuffle or the concats.
if (SDValue V = partitionShuffleOfConcats(N, DAG))		if (SDValue V = partitionShuffleOfConcats(N, DAG))
return V;		return V;
}		}

// Attempt to combine a shuffle of 2 inputs of 'scalar sources' -		// Attempt to combine a shuffle of 1 or 2 inputs of 'scalar sources' -
// BUILD_VECTOR or SCALAR_TO_VECTOR into a single BUILD_VECTOR.		// BUILD_VECTOR or SCALAR_TO_VECTOR into a single BUILD_VECTOR.
if (Level < AfterLegalizeVectorOps && TLI.isTypeLegal(VT)) {		if (Level < AfterLegalizeVectorOps && TLI.isTypeLegal(VT)) {
if (SDValue V = combineShuffleOfScalarInputs(N, DAG))		if (SDValue V = combineShuffleOfScalarInputs(N, DAG))
return V;		return V;
}		}

// If this shuffle only has a single input that is a bitcasted shuffle,		// If this shuffle only has a single input that is a bitcasted shuffle,
// attempt to merge the 2 shuffles and suitably bitcast the inputs/output		// attempt to merge the 2 shuffles and suitably bitcast the inputs/output
▲ Show 20 Lines • Show All 1,339 Lines • Show Last 20 Lines

test/CodeGen/AArch64/concat_vectors-combines.ll

Show First 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret
%t = trunc i32 %x to i16		%t = trunc i32 %x to i16
%0 = bitcast i16 %t to <2 x i8>		%0 = bitcast i16 %t to <2 x i8>
%1 = shufflevector <2 x i8> %0, <2 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2>		%1 = shufflevector <2 x i8> %0, <2 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2>
ret <8 x i8> %1		ret <8 x i8> %1
}		}

; Test the (vector_shuffle (concat_vectors (bitcast (scalar)), undef..), undef, <mask>) pattern.		; Test the (vector_shuffle (concat_vectors (bitcast (scalar)), undef..), undef, <mask>) pattern.

; FIXME: This should use DUP.
define <8 x i8> @test_shuffle_from_concat_scalar_v2i8_to_v8i8_dup(i32 %x) #0 {		define <8 x i8> @test_shuffle_from_concat_scalar_v2i8_to_v8i8_dup(i32 %x) #0 {
entry:		entry:
; CHECK-LABEL: test_shuffle_from_concat_scalar_v2i8_to_v8i8_dup:		; CHECK-LABEL: test_shuffle_from_concat_scalar_v2i8_to_v8i8_dup:
; CHECK-NEXT: fmov s[[V0:[0-9]+]], w0		; CHECK-NEXT: dup.4h v0, w0
; CHECK-NEXT: ins.d v[[V0]][1], v[[V0]][0]
; CHECK-NEXT: movi.4h v[[V1:[0-9]+]], #0x1, lsl #8
; CHECK-NEXT: tbl.8b v0, { v[[V0]] }, v[[V1]]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%t = trunc i32 %x to i16		%t = trunc i32 %x to i16
%0 = bitcast i16 %t to <2 x i8>		%0 = bitcast i16 %t to <2 x i8>
%1 = shufflevector <2 x i8> %0, <2 x i8> undef, <8 x i32> <i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1>		%1 = shufflevector <2 x i8> %0, <2 x i8> undef, <8 x i32> <i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1>
ret <8 x i8> %1		ret <8 x i8> %1
}		}

		define <8 x i16> @test_shuffle_from_concat_scalar_v2i16_to_v8i16_dup(i32 %x) #0 {
		entry:
		; CHECK-LABEL: test_shuffle_from_concat_scalar_v2i16_to_v8i16_dup:
		; CHECK-NEXT: dup.4s v0, w0
		; CHECK-NEXT: ret
		%0 = bitcast i32 %x to <2 x i16>
		%1 = shufflevector <2 x i16> %0, <2 x i16> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 2, i32 0, i32 1, i32 0, i32 1>
		ret <8 x i16> %1
		}

		define <8 x i16> @test_shuffle_from_concat_scalar_v2i16_to_v8i16_duplike_invalid(i32 %x) #0 {
		entry:
		; CHECK-LABEL: test_shuffle_from_concat_scalar_v2i16_to_v8i16_duplike_invalid:
		; CHECK-NEXT: adrp x[[MASKPTR:[0-9]+]], lCPI{{.*}}
		; CHECK-NEXT: ldr q[[V1:[0-9]+]], [x[[MASKPTR]], lCPI{{.*}}
		; CHECK-NEXT: fmov s[[V0:[0-9]+]], w0
		; CHECK-NEXT: tbl.16b v0, { v[[V0]] }, v[[V1]]
		; CHECK-NEXT: ret
		%0 = bitcast i32 %x to <2 x i16>
		%1 = shufflevector <2 x i16> %0, <2 x i16> undef, <8 x i32> <i32 0, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0>
		ret <8 x i16> %1
		}

define <8 x i8> @test_shuffle_from_concat_scalar_v2i8_to_v8i8(i32 %x) #0 {		define <8 x i8> @test_shuffle_from_concat_scalar_v2i8_to_v8i8(i32 %x) #0 {
entry:		entry:
; CHECK-LABEL: test_shuffle_from_concat_scalar_v2i8_to_v8i8:		; CHECK-LABEL: test_shuffle_from_concat_scalar_v2i8_to_v8i8:
; CHECK-NEXT: adrp x[[MASKPTR:[0-9]+]], lCPI{{.*}}		; CHECK-NEXT: adrp x[[MASKPTR:[0-9]+]], lCPI{{.*}}
; CHECK-NEXT: ldr d[[V1:[0-9]+]], [x[[MASKPTR]], lCPI{{.*}}		; CHECK-NEXT: ldr d[[V1:[0-9]+]], [x[[MASKPTR]], lCPI{{.*}}
; CHECK-NEXT: fmov s[[V0:[0-9]+]], w0		; CHECK-NEXT: fmov s[[V0:[0-9]+]], w0
; CHECK-NEXT: ins.d v[[V0]][1], v[[V0]][0]		; CHECK-NEXT: ins.d v[[V0]][1], v[[V0]][0]
; CHECK-NEXT: tbl.8b v0, { v[[V0]] }, v[[V1]]		; CHECK-NEXT: tbl.8b v0, { v[[V0]] }, v[[V1]]
Show All 35 Lines

test/CodeGen/ARM/vector-DAGCombine.ll

Show First 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	entry:
br i1 undef, label %bb1, label %bb2		br i1 undef, label %bb1, label %bb2

bb1:		bb1:
%0 = bitcast <2 x i64> zeroinitializer to <2 x double>		%0 = bitcast <2 x i64> zeroinitializer to <2 x double>
%1 = extractelement <2 x double> %0, i32 0		%1 = extractelement <2 x double> %0, i32 0
%2 = bitcast double %1 to i64		%2 = bitcast double %1 to i64
%3 = insertelement <1 x i64> undef, i64 %2, i32 0		%3 = insertelement <1 x i64> undef, i64 %2, i32 0
; CHECK-NOT: vmov s		; CHECK-NOT: vmov s
; CHECK: vext.8		; CHECK: vmov r0, r1, d
		; CHECK: vmov r2, r3, d
%4 = shufflevector <1 x i64> %3, <1 x i64> undef, <2 x i32> <i32 0, i32 1>		%4 = shufflevector <1 x i64> %3, <1 x i64> undef, <2 x i32> <i32 0, i32 1>
%tmp2006.3 = bitcast <2 x i64> %4 to <16 x i8>		%tmp2006.3 = bitcast <2 x i64> %4 to <16 x i8>
%5 = shufflevector <16 x i8> %tmp2006.3, <16 x i8> undef, <16 x i32> <i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19>		%5 = shufflevector <16 x i8> %tmp2006.3, <16 x i8> undef, <16 x i32> <i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19>
%tmp2004.3 = bitcast <16 x i8> %5 to <4 x i32>		%tmp2004.3 = bitcast <16 x i8> %5 to <4 x i32>
br i1 undef, label %bb2, label %bb1		br i1 undef, label %bb2, label %bb1

bb2:		bb2:
%result = phi <4 x i32> [ undef, %entry ], [ %tmp2004.3, %bb1 ]		%result = phi <4 x i32> [ undef, %entry ], [ %tmp2004.3, %bb1 ]
▲ Show 20 Lines • Show All 195 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] Combine small-element shuffles of scalar_to_vector in terms of the wider scalar.AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 23457

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

test/CodeGen/AArch64/concat_vectors-combines.ll

test/CodeGen/ARM/vector-DAGCombine.ll

[CodeGen] Combine small-element shuffles of scalar_to_vector in terms of the wider scalar.
AbandonedPublic