Fix allows for the following transformation to take place:
BINARY-OPERATION( SHUFFLE(vector1, mask), SHUFFLE(vector2, mask)) -> SHUFFLE( BINARY-OPERATION(vector1, vector2), mask)
in a case when BINARY-OPERATION instruction operands vector type is different from SHUFFLE instruction operands vector type. e.g. <4 x i32> and <16 x i8>, obviously the both data types need to have the same width.
Details
- Reviewers
craig.topper spatel efriedma RKSimon
Diff Detail
Event Timeline
lib/Transforms/InstCombine/InstructionCombining.cpp | ||
---|---|---|
1715–1717 | This is not correct. You can't assume the surrounding instructions when creating an instcombine fold. Your test cases should be minimal and show what happens with those patterns. For this transform, that would be something like this: define <8 x i16> @shuffle_add(<4 x i32> %v1, <4 x i32> %v2) { %shuffle1 = shufflevector <4 x i32> %v1, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0> %shuffle2 = shufflevector <4 x i32> %v2, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0> %bc1 = bitcast <4 x i32> %shuffle1 to <8 x i16> %bc2 = bitcast <4 x i32> %shuffle2 to <8 x i16> %add = add <8 x i16> %bc1, %bc2 ret <8 x i16> %add } With this patch, we go from 5 instructions to 6: define <8 x i16> @shuffle_add(<4 x i32> %v1, <4 x i32> %v2) { %1 = bitcast <4 x i32> %v1 to <8 x i16> %2 = bitcast <4 x i32> %v2 to <8 x i16> %3 = add <8 x i16> %1, %2 %4 = bitcast <8 x i16> %3 to <4 x i32> %5 = shufflevector <4 x i32> %4, <4 x i32> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0> %6 = bitcast <4 x i32> %5 to <8 x i16> ret <8 x i16> %6 } Maybe this makes sense because we traded a shuffle for a couple of bitcasts? If we used the narrower element type for the shuffle, then we'd have a reduction in instruction count by eliminating the last 2 bitcasts, but I don't know if that is allowed as a target-independent transform. |
lib/Transforms/InstCombine/InstructionCombining.cpp | ||
---|---|---|
1720–1721 | The shuffle mask is always a constant: |
lib/Transforms/InstCombine/InstructionCombining.cpp | ||
---|---|---|
1715–1717 | On some targets, vector bitcasts aren't free (IIRC big-endian ARM is like this). Changing the type of the shuffle is.... maybe a little sketchy. I mean, ideally targets should be able to handle either form, but I'm not sure we actually do that reliably. We don't have good tests for that sort of thing. |
lib/Transforms/InstCombine/InstructionCombining.cpp | ||
---|---|---|
1715–1717 | Yeah, I was afraid that was the answer :) So 2 options:
|
Thanks guys for valuable comments, will reimplementing the fix as suggested by Sanyaj.
Reimplemented the fix based on the reviewers recommendations.
Now the fix makes an attempt to transform sequence :
SHUFFLE<T0>(MASK) --> BITCACT<T1> --> BINOP<T1> --> BITCAST<T0> --> SHUFFLE<T0>(MASK)
into:
BITCAST<T1> --> BINOP<T1> --> SHUFFLE<T1>(NEW_MASK)
It is always possible when sizeof of BINOP vector element type is smaller than sizeof of SHUFFLE vector element type,
and sometimes is possible when it is not.
@kbelochapka I think we can abandon this now the vector combine pass handles PR35454
@spatel Maybe ensure we have all the test coverage from the tests that Konstantin added here?
Tests adapted from this patch and added to "PhaseOrdering":
rG389704cc601
I added test comments about what we still can do to improve things. Also note that the fold in -vector-combine relies on the cost model, so we don't get the simplifications for a base SSE2 compile (I assumed from the asm shown in https://bugs.llvm.org/show_bug.cgi?id=35454 that we care about an AVX or later target).
This is not correct. You can't assume the surrounding instructions when creating an instcombine fold. Your test cases should be minimal and show what happens with those patterns. For this transform, that would be something like this:
With this patch, we go from 5 instructions to 6:
Maybe this makes sense because we traded a shuffle for a couple of bitcasts?
If we used the narrower element type for the shuffle, then we'd have a reduction in instruction count by eliminating the last 2 bitcasts, but I don't know if that is allowed as a target-independent transform.