Page MenuHomePhabricator

[InstCombine] Enhance narrowUDivURem.
Needs RevisionPublic

Authored by bixia on May 11 2018, 10:12 AM.

Details

Summary

Transform udiv/urem (zext X), C to zext (udiv/urem X, C') when X has two uses
and both uses can be transformed. This is a common pattern seen in code
generated by the XLA GPU backend.

Add two test cases.

Diff Detail

Event Timeline

bixia created this revision.May 11 2018, 10:12 AM
lebedev.ri added a subscriber: lebedev.ri.
lebedev.ri added inline comments.
lib/Transforms/InstCombine/InstCombineMulDivRem.cpp
947

I'd guess it would be simpler to use early return.

test/Transforms/InstCombine/udivrem-change-width.ll
88

Add a second test with different constant that could be transformed still, but isn't.

This (walking users of a value) stretches instcombine beyond the simplistic transforms that I think most people want to see here.

Have a look at the BypassSlowDivision (runs within -codegenprepare I think) and DivRemPairs (-div-rem-pairs) passes.

bixia updated this revision to Diff 146358.May 11 2018, 10:45 AM

Add another test case.

bixia edited the summary of this revision. (Show Details)May 11 2018, 10:52 AM
bixia marked an inline comment as done.May 11 2018, 10:55 AM
bixia added inline comments.
lib/Transforms/InstCombine/InstCombineMulDivRem.cpp
947

I don't quite understand how to do that without duplication the code inside if (CanConvert)? Can you some detail?

lebedev.ri added inline comments.May 11 2018, 10:59 AM
lib/Transforms/InstCombine/InstCombineMulDivRem.cpp
947

Now, it's

if (CanConvert) {
  ...
  return new ...;
}

return nullptr;

i'd think you can just do

if (!CanConvert)
  return nullptr;
....
return new ...;
spatel requested changes to this revision.May 11 2018, 11:59 AM

Looked at your example a bit closer, and I don't understand the motivation.

@jlebar improved -correlated-propagation to handle this case in D44102 :

$ ./opt -correlated-propagation divrem.ll -S
define i64 @udiv_urem_twouses_xform(i32 %a) {
  %za = zext i32 %a to i64
  %udiv.lhs.trunc = trunc i64 %za to i32
  %udiv.rhs.trunc = trunc i64 33 to i32
  %udiv1 = udiv i32 %udiv.lhs.trunc, %udiv.rhs.trunc
  %udiv.zext = zext i32 %udiv1 to i64
  %urem.lhs.trunc = trunc i64 %za to i32
  %urem.rhs.trunc = trunc i64 33 to i32
  %urem2 = urem i32 %urem.lhs.trunc, %urem.rhs.trunc
  %urem.zext = zext i32 %urem2 to i64
  %uadd = add i64 %udiv.zext, %urem.zext
  ret i64 %uadd
}
This revision now requires changes to proceed.May 11 2018, 11:59 AM
bixia updated this revision to Diff 146494.May 12 2018, 8:21 PM

Use early return.

bixia added a comment.May 12 2018, 8:36 PM

Yes, you are right in that correlated value propagation can handle the case. However, if the divisor is a power of 2 and instcombine is invoked before correlated value propagation (as in the opt passes), instcombine transforms the i64 udiv/urem into i64 shift/and. My original motivated test case is like below, the change here is required in order for opt transform all the i64 arithmetic operations into i32 operations.

; Function Attrs: nounwind
declare i32 @get_number() #0
define void @narrow_long_chain_with_udiv_urem(i64* %result) {

%num1 = call i32 @get_number(), !range !0
%block_id = zext i32 %num1 to i64
%num2 = call i32 @get_number(), !range !0
%thread_id = zext i32 %num2 to i64
%tmp = mul nuw nsw i64 %block_id, 64
%linear_index = add nuw nsw i64 %tmp, %thread_id
%tmp1 = udiv i64 %linear_index, 1
%tmp2 = urem i64 %tmp1, 384
%warp_id = udiv i64 %tmp2, 32
%lane_id = urem i64 %tmp2, 32
%tmp3 = mul nsw i64 %warp_id, 8
%tmp4 = add nsw i64 7, %tmp3
%tmp5 = mul nsw i64 32, %tmp4
%tmp6 = add nsw i64 %lane_id, %tmp5
store i64 %tmp6, i64* %result
ret void

}
attributes #0 = { nounwind }
!0 = !{i32 0, i32 9945}

Yes, you are right in that correlated value propagation can handle the case. However, if the divisor is a power of 2 and instcombine is invoked before correlated value propagation (as in the opt passes), instcombine transforms the i64 udiv/urem into i64 shift/and.

Why couldn't you teach -correlated-propagation that pattern?

Are you suggesting to teach correlated-propagation to narrow shl/and? InstCombine can perform such narrowing, I don't see why it is better to duplicate this in correlated-propagation. Can you explain? The "m_OneUse" check in the code I am changing here is a simple logic to make sure that the transformation do not increase the number of ZExt instructions. I basically add a little more logic there to say "if there are two uses, and we can guarantee that the transformation do not increase the number of ZExt instructions, we also want to perform the transformation here. " What are your concerns for such a change?

Are you suggesting to teach correlated-propagation to narrow shl/and?

Yes.

InstCombine can perform such narrowing, I don't see why it is better to duplicate this in correlated-propagation.

Because as you have already said,

Yes, you are right in that correlated value propagation can handle the case.

So you would not be duplicating anything.
In fact, this instcombine change is the one creating duplication.
Because it's the correlated-propagation that already contains such code.

Can you explain? The "m_OneUse" check in the code I am changing here is a simple logic to make sure that the transformation do not increase the number of ZExt instructions. I basically add a little more logic there to say "if there are two uses, and we can guarantee that the transformation do not increase the number of ZExt instructions, we also want to perform the transformation here. " What are your concerns for such a change?

bixia added a comment.EditedMay 14 2018, 8:45 AM

Instcombine has udiv/urem narrowing without increasing the total number of ZExt instructions. Correlated-Propagation has udiv/urem narrowing that can increasing the total number of ZExt instruction (kind of a superset of what is in instcombine). These are existing duplicated functionalities. I have considered two possible changes to this: (1) allow instcombine to increase the total number of ZExt instructions in general (by simply removing the m_oneuse related checks). (2) as shown in this change, allow instcombine to handle a special case and may increase the number of ZExt instructions by one.
I looked at other similar narrowing transformations in instcombine, and found most of them have the check to avoid increasing the total number of ZExt/SExt instructions (lshr narrowing for example), but some of them don't have such a check (shl narrowing for example). I am not sure whether these are by design or due to something else such as ignorance. Do you know about this?

Now you are proposing a third solution: (3) Create yet another duplicated functionality situtation between instcombine and correlated-propagation. The additional lshr-narrowing and and-narrowing to make my test case work (see my Note below) do not require control flow information, and putting them to correlated-propagation is an overkill.
Can you explain why you believe approach (3) is the best among the three solutions I described?

Note: for my motivated test case, without my change here, instcombine can't narrow these code mostly due to the fact that "tmp2" have two uses:

%tmp2 = zext i32 %1 to i64
%lane_id = and i64 %tmp2, 31
%2 = lshr i64 %tmp2, 2

You want to narrow multi-use sequences of code because instcombine can't do that profitably using minimal peepholes:

Name: narrow expression
Pre: C1 u< 9945
  %tmp2 = zext i32 C1 to i64
  %lane_id = and i64 %tmp2, 31
  %shr = lshr i64 %tmp2, 2
  %tmp4 = shl nuw nsw i64 %shr, 5
  %r = or i64 %lane_id, %tmp4
=>
  %nlane_id = and i32 C1, 31
  %nshr = lshr i32 C1, 2
  %ntmp4 = shl i32 %nshr, 5
  %ntmp5 = or i32 %nlane_id, %ntmp4
  %r = zext i32 %ntmp5 to i64

The original motivation for -aggressive-instcombine (TruncInstCombine) was something almost like that - see D38313. Can you extend that? Note that the general problem isn't about udiv/urem, lshr, or any particular binop. It's about narrowing a sequence of arbitrary binops (and maybe even more than binops).

Re: instcombine - if some transform is increasing instruction count by not checking for uses, that's a bug. I think we established that conclusively in D44266.

The problem with trying to do this transform in instcombine by checking the users is that if it's ok to check the users for this 1 transform, then why don't we do that for *every* transform? The compile-time cost of that change would be substantial, and we're trying to limit the cost of instcombine. See recent llvm-dev discussions for more details.

You want to narrow multi-use sequences of code because instcombine can't do that profitably using minimal peepholes:

Name: narrow expression
Pre: C1 u< 9945
  %tmp2 = zext i32 C1 to i64
  %lane_id = and i64 %tmp2, 31
  %shr = lshr i64 %tmp2, 2
  %tmp4 = shl nuw nsw i64 %shr, 5
  %r = or i64 %lane_id, %tmp4
=>
  %nlane_id = and i32 C1, 31
  %nshr = lshr i32 C1, 2
  %ntmp4 = shl i32 %nshr, 5
  %ntmp5 = or i32 %nlane_id, %ntmp4
  %r = zext i32 %ntmp5 to i64

The original motivation for -aggressive-instcombine (TruncInstCombine) was something almost like that - see D38313. Can you extend that? Note that the general problem isn't about udiv/urem, lshr, or any particular binop. It's about narrowing a sequence of arbitrary binops (and maybe even more than binops).

Hmm, there already is llvm/lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp.
I'm guessing what you meant is, can that be extended to handle this case, too?
It currently does not https://godbolt.org/g/CMd7sK

@bixia did you look into that yet by any chance?

Re: instcombine - if some transform is increasing instruction count by not checking for uses, that's a bug. I think we established that conclusively in D44266.

The problem with trying to do this transform in instcombine by checking the users is that if it's ok to check the users for this 1 transform, then why don't we do that for *every* transform? The compile-time cost of that change would be substantial, and we're trying to limit the cost of instcombine. See recent llvm-dev discussions for more details.

The transformation we really want is: (Z(A) / N) * N + Z(A) % N => Z(A) where Z(...) is zero extension. Maybe we can do that directly instead of reasoning about use lists like this (if we do it directly we will trivially not be increasing the number of zero extensions)?

bixia added a comment.Jun 5 2018, 10:03 PM

What Sanjoy suggested doesn't (completely) solve the problem we have , where UDIV and UREM are used to calculate the dimensional indices from a linear index, and the dimensionally indices are then used to access tensors of different shapes (for example, two tensors being accessed both require broadcasting but along different dimensions).
With the current state of D47113, would it be acceptable if I revise my change here to only narrow i64 udiv/urem to i32 using the added patterns?

spatel added a comment.Jun 6 2018, 8:50 AM

What Sanjoy suggested doesn't (completely) solve the problem we have , where UDIV and UREM are used to calculate the dimensional indices from a linear index, and the dimensionally indices are then used to access tensors of different shapes (for example, two tensors being accessed both require broadcasting but along different dimensions).
With the current state of D47113, would it be acceptable if I revise my change here to only narrow i64 udiv/urem to i32 using the added patterns?

We're missing the fundamental motivation. This isn't about div/rem based on the IR example posted on May 12. That code is canonicalized to use shifts. So we must account for the fact that the source code could have been written using shifts in the first place. Otherwise, we're chasing an incomplete solution.

So let's look at that example after -instcombine:

define void @narrow_long_chain_with_udiv_urem(i64* %result) {
  %num1 = call i32 @get_number(), !range !0
  %num2 = call i32 @get_number(), !range !0
  %1 = shl nuw nsw i32 %num1, 6
  %addconv = add nuw nsw i32 %1, %num2
  %2 = urem i32 %addconv, 384
  %tmp2 = zext i32 %2 to i64
  %lane_id = and i64 %tmp2, 31
  %3 = lshr i64 %tmp2, 2             <---
  %tmp4 = shl nuw nsw i64 %3, 5      <---
  %tmp5 = or i64 %tmp4, 224
  %tmp6 = or i64 %lane_id, %tmp5
  store i64 %tmp6, i64* %result, align 4
  ret void
}

I've pointed out what I think is the critical sequence: shift right followed by shift left. Why is that a better canonicalization than:

%and = and i64 %x, -4
%left = shl i64 %and, 3

https://rise4fun.com/Alive/Gmc

If we had that transform, then everything would be shrunken to i32 already as we want/expect:

define void @narrow_long_chain_with_shifts(i64* %result) {
  %num1 = call i32 @get_number(), !range !0
  %num2 = call i32 @get_number(), !range !0
  %1 = shl nuw nsw i32 %num1, 6
  %addconv = add nuw nsw i32 %1, %num2
  %2 = urem i32 %addconv, 384
  %3 = and i32 %2, 31
  %4 = shl nuw nsw i32 %2, 3
  %5 = and i32 %4, 3840
  %6 = or i32 %5, 224
  %tmp61 = or i32 %3, %6
  %tmp6 = zext i32 %tmp61 to i64
  store i64 %tmp6, i64* %result, align 4
  ret void
}

I suspect there's some codegen concern about having a large 'and' constant mask, so if that's a problem, we can add a reverse canonicalization to the DAG to transform back to shift from 'and'. The 'and' seems better for analysis and enabling other transforms (as seen here), so I think we should have this canonicalization in IR.

spatel added a comment.EditedJun 6 2018, 9:02 AM

Note that we don't actively canonicalize to shifts:

 %t1 = shl i64 %x, 3
  %r = and i64 %t1, -32
=>
  %right = lshr i64 %x, 2
  %r = shl i64 %right, 5

https://rise4fun.com/Alive/uz6Y

...so from an instcombine perspective, this is just a case of a *missing* transform (rather than reversing an existing transform).

bixia marked an inline comment as done.Jun 6 2018, 4:02 PM

@spatel Thanks for your comment! I need to revise my example to better match the real case I am looking at-- in particular, to show that the optimization will increase the number of zext instructions. In this example, there are only two zext instructions. However, if we narrow all the arithmetic operations to i32 we will need four zext instructions. In one of your previous comment, you have suggested to extend TruncInstCombine in aggressive-instcombine. I see the following issues: (1) TruncInstCombine doesn't increase the number of zext/sext instructions. (2) TruncInstCombine currently only handles operations with this properties truncate(op(i64), i32) == op(truncate(i64, i32)), div/rem/right-shift which are needed here aren't part of such operations. We can properly add value range analysis to resolve this though. (3) TruncInstCombine is driven by the truncate instruction in the IR, and there is no such truncate instruction in the case here. To handle the case here (such as the pattern "lshr followed by shl" you mentioned) , is it acceptable to add a new optimization to aggressive-instcombine that can increase the number of zext/sext instructions?

; Function Attrs: nounwind
declare i32 @get_number() #0
declare i32 @use64_3(i64, i64, i64)

define void @narrow_long_chain_with_udiv_urem_2(i64* %result) {
%num1 = call i32 @get_number(), !range !0
%block_id = zext i32 %num1 to i64
%num2 = call i32 @get_number(), !range !0
%thread_id = zext i32 %num2 to i64
%tmp = mul nuw nsw i64 %block_id, 64
%linear_index = add nuw nsw i64 %tmp, %thread_id
%tmp1 = udiv i64 %linear_index, 1
%x = urem i64 %tmp1, 128
%tmp2 = udiv i64 %tmp1, 128
%y = urem i64 %tmp2, 32
%z = udiv i64 %tmp2, 32
call i32 @use64_3(i64 %x, i64 %y, i64 %z)
%warp_id = udiv i64 %x, 32
%lane_id = urem i64 %x, 32
%tmp3 = mul nsw i64 %warp_id, 8
%tmp4 = add nsw i64 7, %tmp3
%tmp5 = mul nsw i64 32, %tmp4
%tmp6 = add nsw i64 %lane_id, %tmp5
store i64 %tmp6, i64* %result
ret void
}
attributes #0 = { nounwind }
!0 = !{i32 0, i32 9945}

spatel added a comment.Jun 8 2018, 9:28 AM
To handle the case here (such as the pattern "lshr followed by shl" you mentioned) , is it acceptable to add a new optimization to aggressive-instcombine that can increase the number of zext/sext instructions?

I haven't had a chance to look at the new example yet, but let me answer this in general: no. Currently, aggressive-instcombine is a target-independent canonicalization pass just like regular instcombine. And we've said that increasing instruction count is not usually canonical for instcombine.

The only difference for aggressive-instcombine is that we can pursue more expensive pattern matching there because that was deemed acceptable for -O3. So whatever we try to do to use narrow ops should result in equal or less total instructions.

What Sanjoy suggested doesn't (completely) solve the problem we have , where UDIV and UREM are used to calculate the dimensional indices from a linear index, and the dimensionally indices are then used to access tensors of different shapes (for example, two tensors being accessed both require broadcasting but along different dimensions).
With the current state of D47113, would it be acceptable if I revise my change here to only narrow i64 udiv/urem to i32 using the added patterns?

We're missing the fundamental motivation. This isn't about div/rem based on the IR example posted on May 12. That code is canonicalized to use shifts. So we must account for the fact that the source code could have been written using shifts in the first place. Otherwise, we're chasing an incomplete solution.

So let's look at that example after -instcombine:

define void @narrow_long_chain_with_udiv_urem(i64* %result) {
  %num1 = call i32 @get_number(), !range !0
  %num2 = call i32 @get_number(), !range !0
  %1 = shl nuw nsw i32 %num1, 6
  %addconv = add nuw nsw i32 %1, %num2
  %2 = urem i32 %addconv, 384
  %tmp2 = zext i32 %2 to i64
  %lane_id = and i64 %tmp2, 31
  %3 = lshr i64 %tmp2, 2             <---
  %tmp4 = shl nuw nsw i64 %3, 5      <---
  %tmp5 = or i64 %tmp4, 224
  %tmp6 = or i64 %lane_id, %tmp5
  store i64 %tmp6, i64* %result, align 4
  ret void
}

I've pointed out what I think is the critical sequence: shift right followed by shift left. Why is that a better canonicalization than:

%and = and i64 %x, -4
%left = shl i64 %and, 3

https://rise4fun.com/Alive/Gmc

If we had that transform, then everything would be shrunken to i32 already as we want/expect:
...
I suspect there's some codegen concern about having a large 'and' constant mask, so if that's a problem, we can add a reverse canonicalization to the DAG to transform back to shift from 'and'. The 'and' seems better for analysis and enabling other transforms (as seen here), so I think we should have this canonicalization in IR.

As suggested in https://bugs.llvm.org/show_bug.cgi?id=37603, i have added such tests in rL334347.
But then i have noticed this: https://github.com/llvm-mirror/llvm/blob/b155a7cf565c5af36bc98ff92917054ade29a2f0/lib/Transforms/InstCombine/InstCombineShifts.cpp#L625-L649

// Be careful about hiding shl instructions behind bit masks. They are used
// to represent multiplies by a constant, and it is important that simple
// arithmetic expressions are still recognizable by scalar evolution.
// The inexact versions are deferred to DAGCombine, so we don't hide shl
// behind a bit mask.

That was added by @spatel in rL293570 / rL293435.
So it seems the current reason why we don't handle different constants in (x >> y) << y (but do handle them in (x << y) >> y!) is SCEV.
Can anyone familiar with SCEV comment, @mkazantsev perhaps?

Correction, that was initially added in rL155136 / rL155362 by @stoklund.

Bixia - is it correct to say that the behavior we want here is that we narrow arithmetic operations from 64 bits to 32 bits, irrespective of how many extra zero extend instructions we need to add? If yes, perhaps this isn't a good match for InstCombine and we should be looking at a specialized target-dependent transformation pass? InstCombine is a "mini LLVM" in that many LLVM passes have a simpler "do no harm" variant implemented in InstCombine, so this won't be anything new design-wise.

Correction, that was initially added in rL155136 / rL155362 by @stoklund.

And as D48229 shows, the SCEV is of no concern here.
So only the backend is the probable culprit.
I suppose it will be resolved in D47681 and followups.

lebedev.ri requested changes to this revision.Fri, Jun 21, 10:50 AM

@spatel where we're at with funnel-shift canonicalization with constant shift amounts?
Is it time to fix this via fixing https://bugs.llvm.org/show_bug.cgi?id=37872 ?

This revision now requires changes to proceed.Fri, Jun 21, 10:50 AM
Herald added a project: Restricted Project. · View Herald TranscriptFri, Jun 21, 10:50 AM

@spatel where we're at with funnel-shift canonicalization with constant shift amounts?
Is it time to fix this via fixing https://bugs.llvm.org/show_bug.cgi?id=37872 ?

Yes, I can make that change. I'm not aware of any gaps in IR analysis for funnel-shift / rotate at this point, and the expansion in SDAG should be identical to the current IR sequence:

or (shl X, C), (lshr X, (width - C))

With a constant shift amount, we eliminate the variations and concerns about branch/select/trunc/ext/UB/poison that may exist in the variable shift amount patterns.

@spatel where we're at with funnel-shift canonicalization with constant shift amounts?
Is it time to fix this via fixing https://bugs.llvm.org/show_bug.cgi?id=37872 ?

Yes, I can make that change. I'm not aware of any gaps in IR analysis for funnel-shift / rotate at this point, and the expansion in SDAG should be identical to the current IR sequence:

or (shl X, C), (lshr X, (width - C))

A 1st hack at this shows that we do have missing canonicalizations: we don't turn funnel-shift into bswap when possible (thought we had a bug on that 1, but I don't see it now), and we don't recognize bswap sequences that contain intermediate bswaps. Regression tests in test/Transforms/InstCombine/bswap.ll are affected.