This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] fold trunc ([lshr] (bitcast vector) ) --> extractelement (PR25543)
ClosedPublic

Authored by spatel on Dec 9 2015, 1:12 PM.

Details

Summary

This is a fix for PR25543:
https://llvm.org/bugs/show_bug.cgi?id=25543

The idea is to take the existing fold of:
bitcast ( trunc ( lshr ( bitcast X))) --> extractelement (bitcast X)
( http://reviews.llvm.org/rL112232 )

And break it into 2 less specific transforms so we'll catch more cases such as the example in the bug report:
bitcast ( trunc ( lshr ( bitcast X))) --> bitcast ( extractelement (bitcast X)) --> extractelement (bitcast X)

D14879 handles the 2nd transform: folding of bitcasts around the extractelement.

Diff Detail

Event Timeline

spatel updated this revision to Diff 42329.Dec 9 2015, 1:12 PM
spatel retitled this revision from to [InstCombine] fold trunc ([lshr] (bitcast vector) ) --> extractelement (PR25543).
spatel updated this object.
spatel added reviewers: hfinkel, kuhar, majnemer.
spatel added a subscriber: llvm-commits.
hfinkel added inline comments.Dec 10 2015, 1:13 AM
test/Transforms/InstCombine/bitcast-bigendian.ll
24

I'm a bit surprised by this change. We used to prefer the vector bitcast over the scalar one, and with this change, we prefer the scalar bitcast after the abstract. I can see benefits to this as a canonical form (even through some backends will need to work somewhat harder to retail good code quality: vector bitcasts are often cheaper than scalar ones). However, what happens when both elements are extracted? Do we end up with multiple scalar bitcasts?

spatel added inline comments.Dec 10 2015, 10:12 AM
test/Transforms/InstCombine/bitcast-bigendian.ll
24

I didn't think much of that difference for the existing test case, but for the case that I think you're asking about:

define float @test2(<2 x i32> %A) {
  %tmp28 = bitcast <2 x i32> %A to i64
  %tmp23 = trunc i64 %tmp28 to i32
  %tmp24 = bitcast i32 %tmp23 to float

  %tmp = bitcast <2 x i32> %A to i64
  %lshr = lshr i64 %tmp, 32
  %tmp2 = trunc i64 %lshr to i32
  %tmp4 = bitcast i32 %tmp2 to float 

  %add = fadd float %tmp24, %tmp4
  ret float %add
}

Yes, we'll get multiple scalar bitcasts:

define float @test2(<2 x i32> %A) {
  %tmp23 = extractelement <2 x i32> %A, i32 0
  %tmp24 = bitcast i32 %tmp23 to float
  %tmp2 = extractelement <2 x i32> %A, i32 1
  %tmp4 = bitcast i32 %tmp2 to float
  %add = fadd float %tmp24, %tmp4
  ret float %add
}

And the codegen for that is decidedly worse on all of PPC64+Altivec, AArch64, and x86-64 than what we used to get:

define float @test2(<2 x i32> %A) {
  %1 = bitcast <2 x i32> %A to <2 x float>
  %tmp24 = extractelement <2 x float> %1, i32 0
  %2 = bitcast <2 x i32> %A to <2 x float>
  %tmp4 = extractelement <2 x float> %2, i32 1
  %add = fadd float %tmp24, %tmp4
  ret float %add
}

Should I create a bitcast canonicalization instcombine to hoist bitcasts ahead of extracts?

hfinkel added inline comments.Dec 10 2015, 10:25 AM
test/Transforms/InstCombine/bitcast-bigendian.ll
24

Should I create a bitcast canonicalization instcombine to hoist bitcasts ahead of extracts?

I think it is likely better to emulate the current apparent preference for the vector bitcast over the scalar one(s).

However, given that your code creates a vector bit cast, what code is undoing that to prefer the scalar bitcasts?

spatel added inline comments.Dec 10 2015, 10:51 AM
test/Transforms/InstCombine/bitcast-bigendian.ll
24

I was digging around for that, but it's an illusion. :)
If we have this sequence:

%bc1 = bitcast <2 x i32> %A to i64
%trunc = trunc i64 %bc1 to i32
%bc2 = bitcast i32 %trunc to float

This patch fires at the trunc without ever seeing the 2nd bitcast, and says, "That's an extract!" and it eliminates the need for the first bitcast:

%ext = extractelement <2 x i32> %A, i32 0
%bc2 = bitcast i32 %ext to float

So the remaining bitcasts that we're seeing in these cases are just the original instructions. Without this patch, there was no direct transform of the trunc; we only matched a (bc(trunc(bc x))) pattern, so we could hoist the 2nd bitcast.

To take this to its logical (I hope) conclusion...
We shouldn't need the transform in D14879 at all if we:

  1. Canonicalize bitcasts before extractelements
  2. Transform (bitcast (bitcast X)) --> bitcast X

I'm not sure why we wouldn't allow the 2nd for any types, but CastInst::isEliminableCastPair() says we're not allowed to simplify those if vector and scalar types are intermingled and it's not a roundtrip to the original type:

// If either of the casts are a bitcast from scalar to vector, disallow the
// merging. However, bitcast of A->B->A are allowed.

So for the 8 vector/scalar type combinations for a pair of bitcasts, we don't optimize the middle 6 in this list:

define ppc_fp128 @bitcast_bitcast_scalar_scalar_scalar(i128 %A) {
  %bc1 = bitcast i128 %A to fp128
  %bc2 = bitcast fp128 %bc1 to ppc_fp128
  ret ppc_fp128 %bc2
}

define <2 x i32> @bitcast_bitcast_scalar_scalar_vector(i64 %A) {
  %bc1 = bitcast i64 %A to double
  %bc2 = bitcast double %bc1 to <2 x i32>
  ret <2 x i32> %bc2
}

define double @bitcast_bitcast_scalar_vector_scalar(i64 %A) {
  %bc1 = bitcast i64 %A to <2 x i32>
  %bc2 = bitcast <2 x i32> %bc1 to double
  ret double %bc2
}

define <2 x i32> @bitcast_bitcast_scalar_vector_vector(i64 %A) {
  %bc1 = bitcast i64 %A to <4 x i16>
  %bc2 = bitcast <4 x i16> %bc1 to <2 x i32>
  ret <2 x i32> %bc2
}

define i64 @bitcast_bitcast_vector_scalar_scalar(<2 x i32> %A) {
  %bc1 = bitcast <2 x i32> %A to double
  %bc2 = bitcast double %bc1 to i64
  ret i64 %bc2
}

define <4 x i16> @bitcast_bitcast_vector_scalar_vector(<2 x i32> %A) {
  %bc1 = bitcast <2 x i32> %A to double
  %bc2 = bitcast double %bc1 to <4 x i16>
  ret <4 x i16> %bc2
}

define double @bitcast_bitcast_vector_vector_scalar(<2 x float> %A) {
  %bc1 = bitcast <2 x float> %A to <4 x i16>
  %bc2 = bitcast <4 x i16> %bc1 to double
  ret double %bc2
}

define <2 x i32> @bitcast_bitcast_vector_vector_vector(<2 x float> %A) {
  %bc1 = bitcast <2 x float> %A to <4 x i16>
  %bc2 = bitcast <4 x i16> %bc1 to <2 x i32>
  ret <2 x i32> %bc2
}
hfinkel added a subscriber: nadav.Dec 10 2015, 3:47 PM
hfinkel edited edge metadata.Dec 10 2015, 3:53 PM

To take this to its logical (I hope) conclusion...
We shouldn't need the transform in D14879 at all if we:

  1. Canonicalize bitcasts before extractelements
  2. Transform (bitcast (bitcast X)) --> bitcast X

I'm not sure why we wouldn't allow the 2nd for any types, but CastInst::isEliminableCastPair() says we're not allowed to simplify those if vector and scalar types are intermingled and it's not a roundtrip to the original type:

// If either of the casts are a bitcast from scalar to vector, disallow the
// merging. However, bitcast of A->B->A are allowed.

I've added Nadav, in case he remembers what this is all about. It seems to have been related to a fix to:

https://llvm.org/bugs/show_bug.cgi?id=7311

see also:

http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20110829/127115.html
http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20110829/127089.html

and while I'm not 100% sure, I suspect that limitation in isEliminableCastPair is to prevent constant folding from creating illegal instructions (like uitofp i64 %x to <2 x float>) more than any underlying statement about the semantics of bitcasts. Thus, the current limitations in instcombine seem like an unfortunate accident more than some purposeful design.

I think that the canonicalization you propose makes sense.

...

spatel updated this revision to Diff 42639.Dec 12 2015, 9:56 AM
spatel edited edge metadata.

Patch updated.
After:
http://reviews.llvm.org/rL255399 (combine bitcasts)
http://reviews.llvm.org/rL255433 (canonicalize extractelement(bitcast X))

We don't get any changes for the tests in bitcast.ll and bitcast-bigendian.ll, so this patch should now preserve existing behavior other than adding the trunc -> extract optimization.

hfinkel accepted this revision.Dec 12 2015, 4:02 PM
hfinkel edited edge metadata.

LGTM.

This revision is now accepted and ready to land.Dec 12 2015, 4:02 PM
This revision was automatically updated to reflect the committed changes.