This is an archive of the discontinued LLVM Phabricator instance.

[PCG] Poor shuffle lane tracking (PR35454 )
Needs ReviewPublic

Authored by kbelochapka on Nov 29 2017, 7:58 PM.

Download Raw Diff

Details

Reviewers

craig.topper
spatel
efriedma
RKSimon

Summary

Fix allows for the following transformation to take place:
BINARY-OPERATION( SHUFFLE(vector1, mask), SHUFFLE(vector2, mask)) -> SHUFFLE( BINARY-OPERATION(vector1, vector2), mask)
in a case when BINARY-OPERATION instruction operands vector type is different from SHUFFLE instruction operands vector type. e.g. <4 x i32> and <16 x i8>, obviously the both data types need to have the same width.

Diff Detail

Event Timeline

kbelochapka created this revision.Nov 29 2017, 7:58 PM

RKSimon added a reviewer: RKSimon.Nov 30 2017, 1:20 AM

spatel added a reviewer: efriedma.Nov 30 2017, 12:51 PM

spatel added inline comments.

lib/Transforms/InstCombine/InstructionCombining.cpp

1460–1462

This is not correct. You can't assume the surrounding instructions when creating an instcombine fold. Your test cases should be minimal and show what happens with those patterns. For this transform, that would be something like this:

define <8 x i16> @shuffle_add(<4 x i32> %v1, <4 x i32> %v2) {
  %shuffle1 = shufflevector <4 x i32> %v1, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
  %shuffle2 = shufflevector <4 x i32> %v2, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
  %bc1 = bitcast <4 x i32> %shuffle1 to <8 x i16>
  %bc2 = bitcast <4 x i32> %shuffle2 to <8 x i16>
  %add = add <8 x i16> %bc1, %bc2
  ret <8 x i16> %add
}

With this patch, we go from 5 instructions to 6:

define <8 x i16> @shuffle_add(<4 x i32> %v1, <4 x i32> %v2) {
  %1 = bitcast <4 x i32> %v1 to <8 x i16>
  %2 = bitcast <4 x i32> %v2 to <8 x i16>
  %3 = add <8 x i16> %1, %2
  %4 = bitcast <8 x i16> %3 to <4 x i32>
  %5 = shufflevector <4 x i32> %4, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
  %6 = bitcast <4 x i32> %5 to <8 x i16>
  ret <8 x i16> %6
}

Maybe this makes sense because we traded a shuffle for a couple of bitcasts?

If we used the narrower element type for the shuffle, then we'd have a reduction in instruction count by eliminating the last 2 bitcasts, but I don't know if that is allowed as a target-independent transform.

spatel added inline comments.Nov 30 2017, 12:54 PM

lib/Transforms/InstCombine/InstructionCombining.cpp
1465–1466	The shuffle mask is always a constant: "The shuffle mask operand is required to be a constant vector with either constant integer or undef values." http://llvm.org/docs/LangRef.html#shufflevector-instruction

efriedma added inline comments.Nov 30 2017, 1:06 PM

lib/Transforms/InstCombine/InstructionCombining.cpp
1460–1462	On some targets, vector bitcasts aren't free (IIRC big-endian ARM is like this). Changing the type of the shuffle is.... maybe a little sketchy. I mean, ideally targets should be able to handle either form, but I'm not sure we actually do that reliably. We don't have good tests for that sort of thing.

spatel added inline comments.Nov 30 2017, 1:12 PM

lib/Transforms/InstCombine/InstructionCombining.cpp
1460–1462	Yeah, I was afraid that was the answer :) So 2 options: Wait to do this in the DAG where we can ask if bitcasts are free. Try to match the larger pattern (shuffle+binop+shuffle) shown in the motivating tests.

Thanks guys for valuable comments, will reimplementing the fix as suggested by Sanyaj.

Reimplemented the fix based on the reviewers recommendations.
Now the fix makes an attempt to transform sequence :
SHUFFLE<T0>(MASK) --> BITCACT<T1> --> BINOP<T1> --> BITCAST<T0> --> SHUFFLE<T0>(MASK)
into:
BITCAST<T1> --> BINOP<T1> --> SHUFFLE<T1>(NEW_MASK)
It is always possible when sizeof of BINOP vector element type is smaller than sizeof of SHUFFLE vector element type,
and sometimes is possible when it is not.

@kbelochapka I think we can abandon this now the vector combine pass handles PR35454

@spatel Maybe ensure we have all the test coverage from the tests that Konstantin added here?

In D40633#1958062, @RKSimon wrote:

@spatel Maybe ensure we have all the test coverage from the tests that Konstantin added here?

Tests adapted from this patch and added to "PhaseOrdering":
rG389704cc601

I added test comments about what we still can do to improve things. Also note that the fold in -vector-combine relies on the cost model, so we don't get the simplifications for a base SSE2 compile (I assumed from the asm shown in https://bugs.llvm.org/show_bug.cgi?id=35454 that we care about an AVX or later target).

spatel mentioned this in rG389704cc601b: [PhaseOrdering] add shuffle tests based on D40633; NFC.Apr 3 2020, 10:15 AM

spatel mentioned this in D77881: [VectorUtils] add IR-level analysis for widening of shuffle mask .Apr 10 2020, 8:51 AM

spatel mentioned this in rGc23cbefd9d73: [VectorUtils] add IR-level analysis for widening of shuffle mask.Apr 12 2020, 7:28 AM

RKSimon resigned from this revision.May 15 2020, 1:53 AM

Revision Contents

Path

Size

lib/

Transforms/

InstCombine/

InstructionCombining.cpp

43 lines

test/

Transforms/

InstCombine/

vec_shuffle.ll

45 lines

Diff 124869

lib/Transforms/InstCombine/InstructionCombining.cpp

Show First 20 Lines • Show All 1,445 Lines • ▼ Show 20 Lines	if (LShuf && RShuf && LShuf->getMask() == RShuf->getMask() &&
isa<UndefValue>(LShuf->getOperand(1)) &&		isa<UndefValue>(LShuf->getOperand(1)) &&
isa<UndefValue>(RShuf->getOperand(1)) &&		isa<UndefValue>(RShuf->getOperand(1)) &&
LShuf->getOperand(0)->getType() == RShuf->getOperand(0)->getType()) {		LShuf->getOperand(0)->getType() == RShuf->getOperand(0)->getType()) {
Value *NewBO = CreateBinOpAsGiven(Inst, LShuf->getOperand(0),		Value *NewBO = CreateBinOpAsGiven(Inst, LShuf->getOperand(0),
RShuf->getOperand(0), Builder);		RShuf->getOperand(0), Builder);
return Builder.CreateShuffleVector(		return Builder.CreateShuffleVector(
NewBO, UndefValue::get(NewBO->getType()), LShuf->getMask());		NewBO, UndefValue::get(NewBO->getType()), LShuf->getMask());
}		}
		// Both arguments of the binary operation are the shuffle instructions, but
		// binary operation vector type is different from a shuffle vector type, e.g.
		// shuffle operands data type is <4 x i32>, but a binary operation operands
		// data type is <16 x i8> In this situation, in order to move the shuffle
		// instruction behind the binary operation instruction we need four bitcast
		// instructions: two for each shuffle operand, one for binary operation
		// result, and one for shuffle result. That looks like a loot of bitcast
		// instructions, but they will be all eliminated during the subsequent
		// instructions combine phases.
		spatelUnsubmitted Not Done Reply Inline Actions This is not correct. You can't assume the surrounding instructions when creating an instcombine fold. Your test cases should be minimal and show what happens with those patterns. For this transform, that would be something like this: define <8 x i16> @shuffle_add(<4 x i32> %v1, <4 x i32> %v2) { %shuffle1 = shufflevector <4 x i32> %v1, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0> %shuffle2 = shufflevector <4 x i32> %v2, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0> %bc1 = bitcast <4 x i32> %shuffle1 to <8 x i16> %bc2 = bitcast <4 x i32> %shuffle2 to <8 x i16> %add = add <8 x i16> %bc1, %bc2 ret <8 x i16> %add } With this patch, we go from 5 instructions to 6: define <8 x i16> @shuffle_add(<4 x i32> %v1, <4 x i32> %v2) { %1 = bitcast <4 x i32> %v1 to <8 x i16> %2 = bitcast <4 x i32> %v2 to <8 x i16> %3 = add <8 x i16> %1, %2 %4 = bitcast <8 x i16> %3 to <4 x i32> %5 = shufflevector <4 x i32> %4, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0> %6 = bitcast <4 x i32> %5 to <8 x i16> ret <8 x i16> %6 } Maybe this makes sense because we traded a shuffle for a couple of bitcasts? If we used the narrower element type for the shuffle, then we'd have a reduction in instruction count by eliminating the last 2 bitcasts, but I don't know if that is allowed as a target-independent transform. spatel: This is not correct. You can't assume the surrounding instructions when creating an instcombine…
		efriedmaUnsubmitted Not Done Reply Inline Actions On some targets, vector bitcasts aren't free (IIRC big-endian ARM is like this). Changing the type of the shuffle is.... maybe a little sketchy. I mean, ideally targets should be able to handle either form, but I'm not sure we actually do that reliably. We don't have good tests for that sort of thing. efriedma: On some targets, vector bitcasts aren't free (IIRC big-endian ARM is like this). Changing the…
		spatelUnsubmitted Not Done Reply Inline Actions Yeah, I was afraid that was the answer :) So 2 options: Wait to do this in the DAG where we can ask if bitcasts are free. Try to match the larger pattern (shuffle+binop+shuffle) shown in the motivating tests. spatel: Yeah, I was afraid that was the answer :) So 2 options: 1. Wait to do this in the DAG where we…
		// Another approach is to change the shuffle instruction data type and
		// recompute the shuffle instruction mask has very limited usage because, we
		// can recompute the shuffle mask only in a case when the shuffle mask is a
		// constant value, and secondly we can do this only in a situation when we
		spatelUnsubmitted Not Done Reply Inline Actions The shuffle mask is always a constant: "The shuffle mask operand is required to be a constant vector with either constant integer or undef values." http://llvm.org/docs/LangRef.html#shufflevector-instruction spatel: The shuffle mask is always a constant: "The shuffle mask operand is required to be a constant…
		// need to change a shuffle instruction vector type from <4 x i32> to <16 x
		// i8> but not visa versa
		BitCastInst *LBitCast = dyn_cast<BitCastInst>(LHS);
		BitCastInst *RBitCast = dyn_cast<BitCastInst>(RHS);
		if (LBitCast && RBitCast) {
		Value *LBitCastOp = LBitCast->getOperand(0);
		Value *RBitCastOp = RBitCast->getOperand(0);
		ShuffleVectorInst *LBcShuf = dyn_cast<ShuffleVectorInst>(LBitCastOp);
		ShuffleVectorInst *RBcShuf = dyn_cast<ShuffleVectorInst>(RBitCastOp);

		if (LBcShuf && RBcShuf && LBcShuf->getMask() == RBcShuf->getMask() &&
		isa<UndefValue>(LBcShuf->getOperand(1)) &&
		isa<UndefValue>(RBcShuf->getOperand(1)) &&
		LBcShuf->getOperand(0)->getType() ==
		RBcShuf->getOperand(0)->getType()) {

		Value *LBitCast =
		Builder.CreateBitCast(LBcShuf->getOperand(0), Inst.getType());
		Value *RBitCast =
		Builder.CreateBitCast(RBcShuf->getOperand(0), Inst.getType());

		Value *NewBinOp = CreateBinOpAsGiven(Inst, LBitCast, RBitCast, Builder);
		Value *ShufBitCast = Builder.CreateBitCast(NewBinOp, LBcShuf->getType());

		Value *NewShuf = Builder.CreateShuffleVector(
		ShufBitCast, UndefValue::get(LBcShuf->getType()), LBcShuf->getMask());
		Value *NewBitCast = Builder.CreateBitCast(NewShuf, Inst.getType());
		return NewBitCast;
		}
		}

// If one argument is a shuffle within one vector, the other is a constant,		// If one argument is a shuffle within one vector, the other is a constant,
// try moving the shuffle after the binary operation.		// try moving the shuffle after the binary operation.
ShuffleVectorInst *Shuffle = nullptr;		ShuffleVectorInst *Shuffle = nullptr;
Constant *C1 = nullptr;		Constant *C1 = nullptr;
if (isa<ShuffleVectorInst>(LHS)) Shuffle = cast<ShuffleVectorInst>(LHS);		if (isa<ShuffleVectorInst>(LHS)) Shuffle = cast<ShuffleVectorInst>(LHS);
if (isa<ShuffleVectorInst>(RHS)) Shuffle = cast<ShuffleVectorInst>(RHS);		if (isa<ShuffleVectorInst>(RHS)) Shuffle = cast<ShuffleVectorInst>(RHS);
if (isa<Constant>(LHS)) C1 = cast<Constant>(LHS);		if (isa<Constant>(LHS)) C1 = cast<Constant>(LHS);
▲ Show 20 Lines • Show All 1,888 Lines • Show Last 20 Lines

test/Transforms/InstCombine/vec_shuffle.ll

	Show First 20 Lines • Show All 457 Lines • ▼ Show 20 Lines
	define <2 x i32> @pr23113(<4 x i32> %A) {			define <2 x i32> @pr23113(<4 x i32> %A) {
	; CHECK-LABEL: @pr23113(			; CHECK-LABEL: @pr23113(
	; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x i32> %A, <4 x i32*> undef, <2 x i32> <i32 0, i32 1>			; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x i32> %A, <4 x i32*> undef, <2 x i32> <i32 0, i32 1>
	; CHECK-NEXT: ret <2 x i32*> [[TMP1]]			; CHECK-NEXT: ret <2 x i32*> [[TMP1]]
	;			;
	%1 = shufflevector <4 x i32> %A, <4 x i32> undef, <2 x i32> <i32 0, i32 1>			%1 = shufflevector <4 x i32> %A, <4 x i32> undef, <2 x i32> <i32 0, i32 1>
	ret <2 x i32*> %1			ret <2 x i32*> %1
	}			}

				; Function Attrs: noinline nounwind uwtable
				define <2 x i64> @shuffle_add2_32_16(<2 x i64> %v) {
				; CHECK-LABEL: @shuffle_add2_32_16(
				; CHECK-NEXT: [[TMP0:%.*]] = bitcast <2 x i64> %v to <8 x i16>
				; CHECK-NEXT: [[TMP1:%.*]] = bitcast <2 x i64> %v to <8 x i16>
				; CHECK-NEXT: [[TMP2:%.]] = add <8 x i16> [[TMP1:%.]], [[TMP2:%.*]]
				; CHECK-NEXT: [[TMP3:%.]] = bitcast <8 x i16> [[TMP2:%.]] to <2 x i64>
				; CHECK-NEXT: ret <2 x i64> [[TMP3:%.*]]
				;
				%bc0 = bitcast <2 x i64> %v to <4 x i32>
				%shuffle = shufflevector <4 x i32> %bc0, <4 x i32> zeroinitializer, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
				%bc1 = bitcast <4 x i32> %shuffle to <2 x i64>
				%bc2 = bitcast <2 x i64> %bc1 to <8 x i16>
				%add.i = add <8 x i16> %bc2, %bc2
				%bc3 = bitcast <8 x i16> %add.i to <2 x i64>
				%bc4 = bitcast <2 x i64> %bc3 to <4 x i32>
				%shuffle4 = shufflevector <4 x i32> %bc4, <4 x i32> zeroinitializer, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
				%bc5 = bitcast <4 x i32> %shuffle4 to <2 x i64>
				ret <2 x i64> %bc5
				}



				; Function Attrs: noinline nounwind uwtable
				define <2 x i64> @shuffle_add2_32_8(<2 x i64> %v) {
				; CHECK-LABEL: @shuffle_add2_32_8(
				; CHECK-NEXT: [[TMP0:%.*]] = bitcast <2 x i64> %v to <16 x i8>
				; CHECK-NEXT: [[TMP1:%.*]] = bitcast <2 x i64> %v to <16 x i8>
				; CHECK-NEXT: [[TMP2:%.]] = add <16 x i8> [[TMP1:%.]], [[TMP2:%.*]]
				; CHECK-NEXT: [[TMP3:%.]] = bitcast <16 x i8> [[TMP2:%.]] to <2 x i64>
				; CHECK-NEXT: ret <2 x i64> [[TMP3:%.*]]
				;
				%bc0 = bitcast <2 x i64> %v to <4 x i32>
				%shuffle = shufflevector <4 x i32> %bc0, <4 x i32> zeroinitializer, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
				%bc1 = bitcast <4 x i32> %shuffle to <2 x i64>
				%bc2 = bitcast <2 x i64> %bc1 to <16 x i8>
				%add.i = add <16 x i8> %bc2, %bc2
				%bc3 = bitcast <16 x i8> %add.i to <2 x i64>
				%bc4 = bitcast <2 x i64> %bc3 to <4 x i32>
				%shuffle4 = shufflevector <4 x i32> %bc4, <4 x i32> zeroinitializer, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
				%bc5 = bitcast <4 x i32> %shuffle4 to <2 x i64>
				ret <2 x i64> %bc5
				}

This is an archive of the discontinued LLVM Phabricator instance.

[PCG] Poor shuffle lane tracking (PR35454 )Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 124869

lib/Transforms/InstCombine/InstructionCombining.cpp

test/Transforms/InstCombine/vec_shuffle.ll

[PCG] Poor shuffle lane tracking (PR35454 )
Needs ReviewPublic