[InstCombineCasts] Take in account final size when transforming sext->lshr->trunc patterns
ClosedPublic

Authored by davide on Thu, May 18, 2:20 PM.

Details

Summary

Counter example where this miscompiles:

define i8 @tinky() {

%sext = sext i1 1 to i16
%hibit = lshr i16 %sext, 15
%tr = trunc i16 %hibit to i8
ret i8 %tr

}

define i8 @winky() {

%sext = sext i1 1 to i8
ret i8 %sext

}

which gets folded to:

define i8 @tinky() {

ret i8 1

}

define i8 @winky() {

ret i8 -1

}

Alive confirms: http://rise4fun.com/Alive/plX

Diff Detail

Repository
rL LLVM
davide created this revision.Thu, May 18, 2:20 PM

This also refactors the codepath for clarity (as suggested by Sanjay).

jacobly added inline comments.Thu, May 18, 7:01 PM
lib/Transforms/InstCombine/InstCombineCasts.cpp
608 ↗(On Diff #99493)

This needs to be checked in every case, so maybe it should be factored out.

611 ↗(On Diff #99493)

This code also handles CISize < ASize correctly: http://rise4fun.com/Alive/eSU. There was no need to check ShiftAmt <= SExtSize - ASize in the first place since we only care about zero bits being shifted into the final type.

I think it's good to solve the miscompile case ASAP, but fundamentally, this block of code is error-prone and incomplete because we're missing earlier folds that would have kept us out of trouble from trying to reason about a 3 instruction sequence (trunc(lshr (sext A), C).

For example, the modified PR33078 pattern in the test here could be canonicalized like this

Name: smear_hibit
%sext = sext i3 %x to i16
%r = lshr i16 %sext, 13
  =>
%s = ashr i3 %x, 2
%r = zext i3 %s to i16

Ie, we should lift the shift because narrower ops are easier for value tracking. We might also consider it an optimization / strength-reduction to replace a wide shift with a narrow shift.

If we had this pattern, the zext followed by trunc would get optimized away cleanly using an existing simple fold that's easy to reason about.

In the actual PR33078 problem description case, there is no question that this is a missed fold:

Name: bool_hibit
%sext = sext i1 %x to i16
%hibit = lshr i16 %sext, 15
  =>
%hibit = zext i1 %x to i16

This should again become a zext/trunc pair for PR33078, and we would never reach the buggy block of code.
I would prefer to solve the bug by fixing the most basic patterns. By doing that, we should be able to simplify - or hopefully remove entirely - this block of code that is looking for longer patterns.

So what youre suggesting here is to canonicalize and remove this Xform entirely?

So what youre suggesting here is to canonicalize and remove this Xform entirely?

Yes - but I acknowledge this may be a longer path than the quick fix of just adding extra predicates to this transform directly to avoid the known bug.

If you think it's better to have the quick fix to avoid the known miscompile, I would support that solution with a FIXME note about trying to get rid of this transform.

Here are some tests for the smaller patterns that we might want to match:

; FIXME: The bool bit got smeared across a wide val, but then we zero'd out those bits. This is just a zext.

define i16 @bool_zext(i1 %x) {
  %sext = sext i1 %x to i16
  %hibit = lshr i16 %sext, 15
  ret i16 %hibit
}

define <2 x i8> @bool_zext_splat(<2 x i1> %x) {
  %sext = sext <2 x i1> %x to <2 x i8>
  %hibit = lshr <2 x i8> %sext, <i8 7, i8 7>
  ret <2 x i8> %hibit
}

; FIXME: We could use a narrow arithmetic shift first and then zext.

define i16 @smear_sign_and_widen(i4 %x) {
  %sext = sext i4 %x to i16
  %hibit = lshr i16 %sext, 12
  ret i16 %hibit
}

define <2 x i8> @smear_sign_and_widen_splat(<2 x i6> %x) {
  %sext = sext <2 x i6> %x to <2 x i8>
  %hibit = lshr <2 x i8> %sext, <i8 2, i8 2>
  ret <2 x i8> %hibit
}

; FIXME: All of the replicated sign bits are wiped out by the lshr. This could be lshr+zext.

define i16 @fake_sext(i3 %x) {
  %sext = sext i3 %x to i16
  %sh = lshr i16 %sext, 15
  ret i16 %sh
}

define <2 x i8> @fake_sext_splat(<2 x i3> %x) {
  %sext = sext <2 x i3> %x to <2 x i8>
  %sh = lshr <2 x i8> %sext, <i8 7, i8 7>
  ret <2 x i8> %sh
}

Here are some tests for the smaller patterns that we might want to match:

; FIXME: The bool bit got smeared across a wide val, but then we zero'd out those bits. This is just a zext.

define i16 @bool_zext(i1 %x) {
  %sext = sext i1 %x to i16
  %hibit = lshr i16 %sext, 15
  ret i16 %hibit
}

define <2 x i8> @bool_zext_splat(<2 x i1> %x) {
  %sext = sext <2 x i1> %x to <2 x i8>
  %hibit = lshr <2 x i8> %sext, <i8 7, i8 7>
  ret <2 x i8> %hibit
}

; FIXME: We could use a narrow arithmetic shift first and then zext.

define i16 @smear_sign_and_widen(i4 %x) {
  %sext = sext i4 %x to i16
  %hibit = lshr i16 %sext, 12
  ret i16 %hibit
}

define <2 x i8> @smear_sign_and_widen_splat(<2 x i6> %x) {
  %sext = sext <2 x i6> %x to <2 x i8>
  %hibit = lshr <2 x i8> %sext, <i8 2, i8 2>
  ret <2 x i8> %hibit
}

; FIXME: All of the replicated sign bits are wiped out by the lshr. This could be lshr+zext.

define i16 @fake_sext(i3 %x) {
  %sext = sext i3 %x to i16
  %sh = lshr i16 %sext, 15
  ret i16 %sh
}

define <2 x i8> @fake_sext_splat(<2 x i3> %x) {
  %sext = sext <2 x i3> %x to <2 x i8>
  %sh = lshr <2 x i8> %sext, <i8 7, i8 7>
  ret <2 x i8> %sh
}

Feel free to commit these tests, I think they're good.
I'm worried about fixing the miscompile. I'll commit some tests where we get this wrong, then upload a new version of the patch so that you can take a look.

davide updated this revision to Diff 99614.Fri, May 19, 1:01 PM

Updated after Sanjay's/Jacob's feedback. Thanks for your review!

davide updated this revision to Diff 99615.Fri, May 19, 1:11 PM

Correct version of the patch & clang-formatted.

spatel accepted this revision.Sun, May 21, 8:22 AM

This is strictly safer (does the transform less often), so LGTM. Bonus: we're not creating an unnecessary cast in the case where the result size is the same as the source op.

@davide, let me know if you plan to work on the (lshr (sext X), C) patterns. I checked in the tests with rL303504.

This revision is now accepted and ready to land.Sun, May 21, 8:22 AM

This is strictly safer (does the transform less often), so LGTM. Bonus: we're not creating an unnecessary cast in the case where the result size is the same as the source op.

@davide, let me know if you plan to work on the (lshr (sext X), C) patterns. I checked in the tests with rL303504.

I do plan to work on them, but if you have some bandwidth, any help would be appreciated :)
Should we open a bug so that we don't forget?

This revision was automatically updated to reflect the committed changes.
spatel added a subscriber: dberlin.Mon, May 22, 8:38 AM

This is strictly safer (does the transform less often), so LGTM. Bonus: we're not creating an unnecessary cast in the case where the result size is the same as the source op.

@davide, let me know if you plan to work on the (lshr (sext X), C) patterns. I checked in the tests with rL303504.

I do plan to work on them, but if you have some bandwidth, any help would be appreciated :)
Should we open a bug so that we don't forget?

Bugs reports are always good for me, but since this one is in mental cache right now, I'd rather just squash it now. :)

The interesting thing about this case is that it feeds into the larger current discussion about how to split up InstCombine to make the whole thing less obnoxious.

Take these two cases:

Name: shift_less
%sext = sext i3 %x to i8
%hibit = lshr i8 %sext, 7
  =>
%tr = lshr i3 %x, 2
%hibit = zext i3 %tr to i8

Name: bool_hibit
%sext = sext i1 %x to i8
%hibit = lshr i8 %sext, 7
  =>
%hibit = zext i1 %x to i8

If we think of InstCombine as a pure canonicalizer, we might make a lshr rule like (semi-pseudo-code):

// Are we moving the sign bit to the low bit and widening a value with high zeros?
// lshr (sext X), C --> zext (lshr X, C2)
if (match(Op0, m_SExt(m_Value(X))) && 
    match(Op1, m_APInt(C)) && 
    C->getZextValue() == Op0->getScalarSizeInBits() - 1) {
   return createZext(createLShr(X, X->getScalarSizeInBits() - 1));
}

In other words, we would not special-case the i1 (bool) example and eliminate this no-op before it ever existed:
%tr = lshr i3 %x, 0
...we'd leave that to the optimizer pass because eliminating that instruction is an optimization, not a canonicalization.

Now, we might just say this is a bug in the IRBuilder - why doesn't it check for a zero shift constant before creating an obviously unnecessary instruction? On the other hand, that's not its job. It shouldn't think at all about that special case; that's the caller's responsibility.

We'll find many examples like this if we start trying to create a pure canonicalizer. If I interpreted the statement correctly, @dberlin mentioned this potential inefficiency in:
http://lists.llvm.org/pipermail/llvm-dev/2017-May/113220.html