This is an archive of the discontinued LLVM Phabricator instance.

[PCG] Poor shuffle lane tracking (PR35454 )
Needs ReviewPublic

Authored by kbelochapka on Nov 29 2017, 7:58 PM.

Details

Summary

Fix allows for the following transformation to take place:
BINARY-OPERATION( SHUFFLE(vector1, mask), SHUFFLE(vector2, mask)) -> SHUFFLE( BINARY-OPERATION(vector1, vector2), mask)
in a case when BINARY-OPERATION instruction operands vector type is different from SHUFFLE instruction operands vector type. e.g. <4 x i32> and <16 x i8>, obviously the both data types need to have the same width.

Diff Detail

Event Timeline

kbelochapka created this revision.Nov 29 2017, 7:58 PM
spatel added inline comments.
lib/Transforms/InstCombine/InstructionCombining.cpp
1715–1717

This is not correct. You can't assume the surrounding instructions when creating an instcombine fold. Your test cases should be minimal and show what happens with those patterns. For this transform, that would be something like this:

define <8 x i16> @shuffle_add(<4 x i32> %v1, <4 x i32> %v2) {
  %shuffle1 = shufflevector <4 x i32> %v1, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
  %shuffle2 = shufflevector <4 x i32> %v2, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
  %bc1 = bitcast <4 x i32> %shuffle1 to <8 x i16>
  %bc2 = bitcast <4 x i32> %shuffle2 to <8 x i16>
  %add = add <8 x i16> %bc1, %bc2
  ret <8 x i16> %add
}

With this patch, we go from 5 instructions to 6:

define <8 x i16> @shuffle_add(<4 x i32> %v1, <4 x i32> %v2) {
  %1 = bitcast <4 x i32> %v1 to <8 x i16>
  %2 = bitcast <4 x i32> %v2 to <8 x i16>
  %3 = add <8 x i16> %1, %2
  %4 = bitcast <8 x i16> %3 to <4 x i32>
  %5 = shufflevector <4 x i32> %4, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
  %6 = bitcast <4 x i32> %5 to <8 x i16>
  ret <8 x i16> %6
}

Maybe this makes sense because we traded a shuffle for a couple of bitcasts?

If we used the narrower element type for the shuffle, then we'd have a reduction in instruction count by eliminating the last 2 bitcasts, but I don't know if that is allowed as a target-independent transform.

spatel added inline comments.Nov 30 2017, 12:54 PM
lib/Transforms/InstCombine/InstructionCombining.cpp
1720–1721

The shuffle mask is always a constant:
"The shuffle mask operand is required to be a constant vector with either constant integer or undef values."
http://llvm.org/docs/LangRef.html#shufflevector-instruction

efriedma added inline comments.Nov 30 2017, 1:06 PM
lib/Transforms/InstCombine/InstructionCombining.cpp
1715–1717

On some targets, vector bitcasts aren't free (IIRC big-endian ARM is like this).

Changing the type of the shuffle is.... maybe a little sketchy. I mean, ideally targets should be able to handle either form, but I'm not sure we actually do that reliably. We don't have good tests for that sort of thing.

spatel added inline comments.Nov 30 2017, 1:12 PM
lib/Transforms/InstCombine/InstructionCombining.cpp
1715–1717

Yeah, I was afraid that was the answer :)

So 2 options:

  1. Wait to do this in the DAG where we can ask if bitcasts are free.
  2. Try to match the larger pattern (shuffle+binop+shuffle) shown in the motivating tests.

Thanks guys for valuable comments, will reimplementing the fix as suggested by Sanyaj.

Reimplemented the fix based on the reviewers recommendations.
Now the fix makes an attempt to transform sequence :
SHUFFLE<T0>(MASK) --> BITCACT<T1> --> BINOP<T1> --> BITCAST<T0> --> SHUFFLE<T0>(MASK)
into:
BITCAST<T1> --> BINOP<T1> --> SHUFFLE<T1>(NEW_MASK)
It is always possible when sizeof of BINOP vector element type is smaller than sizeof of SHUFFLE vector element type,
and sometimes is possible when it is not.

@kbelochapka I think we can abandon this now the vector combine pass handles PR35454

@spatel Maybe ensure we have all the test coverage from the tests that Konstantin added here?

spatel added a comment.Apr 3 2020, 9:45 AM

@spatel Maybe ensure we have all the test coverage from the tests that Konstantin added here?

Tests adapted from this patch and added to "PhaseOrdering":
rG389704cc601

I added test comments about what we still can do to improve things. Also note that the fold in -vector-combine relies on the cost model, so we don't get the simplifications for a base SSE2 compile (I assumed from the asm shown in https://bugs.llvm.org/show_bug.cgi?id=35454 that we care about an AVX or later target).

RKSimon resigned from this revision.May 15 2020, 1:53 AM