This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] Combine shuffle from concat+bitcast scalar to avoid the smaller vector type.
AbandonedPublic

Authored by ab on Apr 7 2015, 5:07 PM.

Download Raw Diff

Details

Reviewers

spatel
qcolombet
RKSimon
chandlerc
andreadb

Summary

Code that deals with small vectors but worries about performance (is hardware-aware) often passes the vectors around (on some targets as scalars, thanks to type coercion), but immediately shuffles them into a native-width vector, to do proper operations that we can't reasonably do a horrible job at. Promoting small vector types leads to some really nasty stuff, and people actively try to avoid that now ;)

Anyway, this means a pretty common pattern is this:

(vector_shuffle (v8i8 concat_vectors (v2i8 bitcast (i16)), undef..), M)

which we'll usually end up scalarizing. Instead, we can turn it into:

(vector_shuffle (v8i8 bitcast (v4i16 scalar_to_vector (i16))), M)

which lets us deal with native-width types all the time (this should be the foremost canonicalization goal whenever dealing with vectors IMHO, but I digress).

Diff Detail

Event Timeline

ab updated this revision to Diff 23382.Apr 7 2015, 5:07 PM

ab retitled this revision from to [CodeGen] Combine shuffle from concat+bitcast scalar to avoid the smaller vector type..

ab updated this object.

ab edited the test plan for this revision. (Show Details)

ab added reviewers: spatel, qcolombet.

ab added a parent revision: D8883: [CodeGen] Combine concat_vectors(trunc(scalar), undef) -> scalar_to_vector(scalar).

ab added a subscriber: Unknown Object (MLST).

ab mentioned this in D8885: [CodeGen] Combine small-element shuffles of scalar_to_vector in terms of the wider scalar..Apr 7 2015, 5:16 PM

I have a gut feeling telling me this isn't the right solution; let's add more of the shuffle brigade! =)

Forgive the AArch64 testcase; on X86 there's enough heroics that we're able to recover from scalarizing the concat_vectors. I'm a bit fuzzy on the details, so I'll definitely keep looking before committing. In the meantime, guidance welcome!

-Ahmed

Much of this optimization looks like to will be accomplished by D8883 no?

Yes, but there's other combines that can get in the way. For instance, my go-to example is the _dup testcase, which we'll try to turn into just (concat_vectors (bitcast (i16)), (bitcast (i16)), ...), which, when promoted, gets scalarized, and we never recover from that. On X86 we do, that's the part I still need to look into.

From what I see, there's two other alternatives:

not creating CONCAT_VECTORS of illegal types
fixing whatever de-scalarizes the legalized code later on for this testcase on X86 but not on AArch64 (intuition says it's the fact that i16 isn't legal on AArch64, whereas it is on X86, so when we promote the scalar input as well, all hell breaks loose).

The latter seems brittle, the former is basically saying "CONCAT_VECTORS is only the canonical way to concat vectors when they're legal", which I found wrong, but isn't that shocking, when you spell it out.

Thoughts?

-Ahmed

In D8884#153312, @ab wrote:

Yes, but there's other combines that can get in the way. For instance, my go-to example is the _dup testcase, which we'll try to turn into just (concat_vectors (bitcast (i16)), (bitcast (i16)), ...), which, when promoted, gets scalarized, and we never recover from that. On X86 we do, that's the part I still need to look into.

Have you looked at generalizing the visitCONCAT_VECTOR code to create SCALAR_TO_VECTOR or BUILD_VECTOR depending on the contents of the operands? At the moment the SCALAR_TO_VECTOR and BUILD_VECTOR cases are treated separately but it should be pretty easy to merge - see D7816.

In D8884#153454, @RKSimon wrote:

In D8884#153312, @ab wrote:

Yes, but there's other combines that can get in the way. For instance, my go-to example is the _dup testcase, which we'll try to turn into just (concat_vectors (bitcast (i16)), (bitcast (i16)), ...), which, when promoted, gets scalarized, and we never recover from that. On X86 we do, that's the part I still need to look into.

Have you looked at generalizing the visitCONCAT_VECTOR code to create SCALAR_TO_VECTOR or BUILD_VECTOR depending on the contents of the operands? At the moment the SCALAR_TO_VECTOR and BUILD_VECTOR cases are treated separately but it should be pretty easy to merge - see D7816.

Now that I look at this again, it seems to me that for the concat problem, a cleaner solution would be to combine (only when the intermediate type is illegal, perhaps?):

(v8i8 concat_vectors (v2i8 bitcast (i16)) x4)

into:

(v8i8 (bitcast (v4i16 BUILD_VECTOR (i16) x4)))

Is that what you're proposing?

-Ahmed

In D8884#153590, @ab wrote:
In D8884#153454, @RKSimon wrote:

In D8884#153312, @ab wrote:

Yes, but there's other combines that can get in the way. For instance, my go-to example is the _dup testcase, which we'll try to turn into just (concat_vectors (bitcast (i16)), (bitcast (i16)), ...), which, when promoted, gets scalarized, and we never recover from that. On X86 we do, that's the part I still need to look into.

Have you looked at generalizing the visitCONCAT_VECTOR code to create SCALAR_TO_VECTOR or BUILD_VECTOR depending on the contents of the operands? At the moment the SCALAR_TO_VECTOR and BUILD_VECTOR cases are treated separately but it should be pretty easy to merge - see D7816.

Now that I look at this again, it seems to me that for the concat problem, a cleaner solution would be to combine (only when the intermediate type is illegal, perhaps?):
(v8i8 concat_vectors (v2i8 bitcast (i16)) x4)
into:
(v8i8 (bitcast (v4i16 BUILD_VECTOR (i16) x4)))
Is that what you're proposing?

-Ahmed

FWIW, this is what I think makes far and away the most sense. We should definitely prefer a build_vector of scalars over concat_vector of bitcasts of scalars.

Now that I look at this again, it seems to me that for the concat problem, a cleaner solution would be to combine (only when the intermediate type is illegal, perhaps?):
(v8i8 concat_vectors (v2i8 bitcast (i16)) x4)
into:
(v8i8 (bitcast (v4i16 BUILD_VECTOR (i16) x4)))
Is that what you're proposing?

Yes that seems the cleanest approach and gives us the best chance of further optimization later on.

Alrighty then, superseded by D8948.

Thanks for the help!

-Ahmed

Revision Contents

Path

Size

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

24 lines

test/

CodeGen/

AArch64/

concat_vectors-combines.ll

62 lines

Diff 23382

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 11,979 Lines • ▼ Show 20 Lines	SDValue DAGCombiner::visitVECTOR_SHUFFLE(SDNode *N) {

// Attempt to combine a shuffle of concat_vectors.		// Attempt to combine a shuffle of concat_vectors.
if (N0.getOpcode() == ISD::CONCAT_VECTORS &&		if (N0.getOpcode() == ISD::CONCAT_VECTORS &&
Level < AfterLegalizeVectorOps &&		Level < AfterLegalizeVectorOps &&
(N1.getOpcode() == ISD::UNDEF \|\|		(N1.getOpcode() == ISD::UNDEF \|\|
(N1.getOpcode() == ISD::CONCAT_VECTORS &&		(N1.getOpcode() == ISD::CONCAT_VECTORS &&
N0.getOperand(0).getValueType() == N1.getOperand(0).getValueType()))) {		N0.getOperand(0).getValueType() == N1.getOperand(0).getValueType()))) {

		// Try to avoid smaller vectors by combining a shuffle of one
		// CONCAT_VECTORS coming from a scalar to SCALAR_TO_VECTOR instead:
		// (vector_shuffle (v8i8 concat_vectors (v2i8 bitcast (i16)), undef..), M)
		// ->
		// (vector_shuffle (v8i8 bitcast (v4i16 scalar_to_vector (i16))), M)
		if (N1.getOpcode() == ISD::UNDEF &&
		N0->getOperand(0)->getOpcode() == ISD::BITCAST &&
		std::all_of(
		N0->ops().begin() + 1, N0->ops().end(),
		[](const SDValue &Op) { return Op->getOpcode() == ISD::UNDEF; })) {
		SDValue Scalar = N0->getOperand(0)->getOperand(0);
		EVT ScVT = Scalar.getValueType();
		if (!ScVT.isVector()) {
		SDLoc dl(N);
		EVT VecVT = EVT::getVectorVT(*DAG.getContext(), ScVT,
		VT.getSizeInBits() / ScVT.getSizeInBits());
		return DAG.getVectorShuffle(
		VT, dl,
		DAG.getNode(ISD::BITCAST, dl, VT,
		DAG.getNode(ISD::SCALAR_TO_VECTOR, dl, VecVT, Scalar)),
		N1, SVN->getMask());
		}
		}

// Try to simplify either the shuffle or the concats.		// Try to simplify either the shuffle or the concats.
if (SDValue V = partitionShuffleOfConcats(N, DAG))		if (SDValue V = partitionShuffleOfConcats(N, DAG))
return V;		return V;
}		}

// Attempt to combine a shuffle of 2 inputs of 'scalar sources' -		// Attempt to combine a shuffle of 2 inputs of 'scalar sources' -
// BUILD_VECTOR or SCALAR_TO_VECTOR into a single BUILD_VECTOR.		// BUILD_VECTOR or SCALAR_TO_VECTOR into a single BUILD_VECTOR.
if (Level < AfterLegalizeVectorOps && TLI.isTypeLegal(VT)) {		if (Level < AfterLegalizeVectorOps && TLI.isTypeLegal(VT)) {
▲ Show 20 Lines • Show All 1,375 Lines • Show Last 20 Lines

test/CodeGen/AArch64/concat_vectors-combines.ll

	; RUN: llc < %s -mtriple arm64-apple-darwin -asm-verbose=false \| FileCheck %s			; RUN: llc < %s -mtriple arm64-apple-darwin -aarch64-collect-loh=false -asm-verbose=false \| FileCheck %s
				; LOHs are annoying, disable them.

	target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:o-i64:64-i128:128-n32:64-S128"

	; Test the (concat_vectors (trunc), (trunc)) pattern.			; Test the (concat_vectors (trunc), (trunc)) pattern.

	define <4 x i16> @test_concat_truncate_v2i64_to_v4i16(<2 x i64> %a, <2 x i64> %b) #0 {			define <4 x i16> @test_concat_truncate_v2i64_to_v4i16(<2 x i64> %a, <2 x i64> %b) #0 {
	entry:			entry:
	; CHECK-LABEL: test_concat_truncate_v2i64_to_v4i16:			; CHECK-LABEL: test_concat_truncate_v2i64_to_v4i16:
	Show All 38 Lines
	; CHECK-NEXT: fmov s0, w0			; CHECK-NEXT: fmov s0, w0
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%t = trunc i32 %x to i16			%t = trunc i32 %x to i16
	%0 = bitcast i16 %t to <2 x i8>			%0 = bitcast i16 %t to <2 x i8>
	%1 = shufflevector <2 x i8> %0, <2 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2>			%1 = shufflevector <2 x i8> %0, <2 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 2, i32 2, i32 2, i32 2, i32 2>
	ret <8 x i8> %1			ret <8 x i8> %1
	}			}

				; Test the (vector_shuffle (concat_vectors (bitcast (scalar)), undef..), undef, <mask>) pattern.

				; FIXME: This should use DUP.
				define <8 x i8> @test_shuffle_from_concat_scalar_v2i8_to_v8i8_dup(i32 %x) #0 {
				entry:
				; CHECK-LABEL: test_shuffle_from_concat_scalar_v2i8_to_v8i8_dup:
				; CHECK-NEXT: fmov s[[V0:[0-9]+]], w0
				; CHECK-NEXT: ins.d v[[V0]][1], v[[V0]][0]
				; CHECK-NEXT: movi.4h v[[V1:[0-9]+]], #0x1, lsl #8
				; CHECK-NEXT: tbl.8b v0, { v[[V0]] }, v[[V1]]
				; CHECK-NEXT: ret
				%t = trunc i32 %x to i16
				%0 = bitcast i16 %t to <2 x i8>
				%1 = shufflevector <2 x i8> %0, <2 x i8> undef, <8 x i32> <i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1>
				ret <8 x i8> %1
				}

				define <8 x i8> @test_shuffle_from_concat_scalar_v2i8_to_v8i8(i32 %x) #0 {
				entry:
				; CHECK-LABEL: test_shuffle_from_concat_scalar_v2i8_to_v8i8:
				; CHECK-NEXT: adrp x[[MASKPTR:[0-9]+]], lCPI{{.*}}
				; CHECK-NEXT: ldr d[[V1:[0-9]+]], [x[[MASKPTR]], lCPI{{.*}}
				; CHECK-NEXT: fmov s[[V0:[0-9]+]], w0
				; CHECK-NEXT: ins.d v[[V0]][1], v[[V0]][0]
				; CHECK-NEXT: tbl.8b v0, { v[[V0]] }, v[[V1]]
				; CHECK-NEXT: ret
				%t = trunc i32 %x to i16
				%0 = bitcast i16 %t to <2 x i8>
				%1 = shufflevector <2 x i8> %0, <2 x i8> undef, <8 x i32> <i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 1, i32 1>
				ret <8 x i8> %1
				}

				define <8 x i8> @test_shuffle_from_concat_scalar_v4i8_to_v8i8(i32 %x) #0 {
				entry:
				; CHECK-LABEL: test_shuffle_from_concat_scalar_v4i8_to_v8i8:
				; CHECK-NEXT: adrp x[[MASKPTR:[0-9]+]], lCPI{{.*}}
				; CHECK-NEXT: ldr d[[V1:[0-9]+]], [x[[MASKPTR]], lCPI{{.*}}
				; CHECK-NEXT: fmov s[[V0:[0-9]+]], w0
				; CHECK-NEXT: ins.d v[[V0]][1], v[[V0]][0]
				; CHECK-NEXT: tbl.8b v0, { v[[V0]] }, v[[V1]]
				; CHECK-NEXT: ret
				%0 = bitcast i32 %x to <4 x i8>
				%1 = shufflevector <4 x i8> %0, <4 x i8> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 1, i32 1, i32 1, i32 1>
				ret <8 x i8> %1
				}

				define <8 x i16> @test_shuffle_from_concat_scalar_v2i16_to_v8i16(i32 %x) #0 {
				entry:
				; CHECK-LABEL: test_shuffle_from_concat_scalar_v2i16_to_v8i16:
				; CHECK-NEXT: adrp x[[MASKPTR:[0-9]+]], lCPI{{.*}}
				; CHECK-NEXT: ldr q[[V1:[0-9]+]], [x[[MASKPTR]], lCPI{{.*}}
				; CHECK-NEXT: fmov s[[V0:[0-9]+]], w0
				; CHECK-NEXT: tbl.16b v0, { v[[V0]] }, v[[V1]]
				; CHECK-NEXT: ret
				%0 = bitcast i32 %x to <2 x i16>
				%1 = shufflevector <2 x i16> %0, <2 x i16> undef, <8 x i32> <i32 0, i32 1, i32 1, i32 0, i32 1, i32 1, i32 1, i32 1>
				ret <8 x i16> %1
				}

	attributes #0 = { nounwind }			attributes #0 = { nounwind }

This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] Combine shuffle from concat+bitcast scalar to avoid the smaller vector type.AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 23382

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

test/CodeGen/AArch64/concat_vectors-combines.ll

[CodeGen] Combine shuffle from concat+bitcast scalar to avoid the smaller vector type.
AbandonedPublic