Transform udiv/urem (zext X), C to zext (udiv/urem X, C') when X has two uses
and both uses can be transformed. This is a common pattern seen in code
generated by the XLA GPU backend.
Add two test cases.
Differential D46760
[InstCombine] Enhance narrowUDivURem. bixia on May 11 2018, 10:12 AM. Authored by
Details
Transform udiv/urem (zext X), C to zext (udiv/urem X, C') when X has two uses Add two test cases.
Diff Detail
Event TimelineComment Actions This (walking users of a value) stretches instcombine beyond the simplistic transforms that I think most people want to see here. Have a look at the BypassSlowDivision (runs within -codegenprepare I think) and DivRemPairs (-div-rem-pairs) passes.
Comment Actions Looked at your example a bit closer, and I don't understand the motivation. @jlebar improved -correlated-propagation to handle this case in D44102 : $ ./opt -correlated-propagation divrem.ll -S define i64 @udiv_urem_twouses_xform(i32 %a) { %za = zext i32 %a to i64 %udiv.lhs.trunc = trunc i64 %za to i32 %udiv.rhs.trunc = trunc i64 33 to i32 %udiv1 = udiv i32 %udiv.lhs.trunc, %udiv.rhs.trunc %udiv.zext = zext i32 %udiv1 to i64 %urem.lhs.trunc = trunc i64 %za to i32 %urem.rhs.trunc = trunc i64 33 to i32 %urem2 = urem i32 %urem.lhs.trunc, %urem.rhs.trunc %urem.zext = zext i32 %urem2 to i64 %uadd = add i64 %udiv.zext, %urem.zext ret i64 %uadd } Comment Actions Yes, you are right in that correlated value propagation can handle the case. However, if the divisor is a power of 2 and instcombine is invoked before correlated value propagation (as in the opt passes), instcombine transforms the i64 udiv/urem into i64 shift/and. My original motivated test case is like below, the change here is required in order for opt transform all the i64 arithmetic operations into i32 operations. ; Function Attrs: nounwind %num1 = call i32 @get_number(), !range !0 %block_id = zext i32 %num1 to i64 %num2 = call i32 @get_number(), !range !0 %thread_id = zext i32 %num2 to i64 %tmp = mul nuw nsw i64 %block_id, 64 %linear_index = add nuw nsw i64 %tmp, %thread_id %tmp1 = udiv i64 %linear_index, 1 %tmp2 = urem i64 %tmp1, 384 %warp_id = udiv i64 %tmp2, 32 %lane_id = urem i64 %tmp2, 32 %tmp3 = mul nsw i64 %warp_id, 8 %tmp4 = add nsw i64 7, %tmp3 %tmp5 = mul nsw i64 32, %tmp4 %tmp6 = add nsw i64 %lane_id, %tmp5 store i64 %tmp6, i64* %result ret void } Comment Actions Are you suggesting to teach correlated-propagation to narrow shl/and? InstCombine can perform such narrowing, I don't see why it is better to duplicate this in correlated-propagation. Can you explain? The "m_OneUse" check in the code I am changing here is a simple logic to make sure that the transformation do not increase the number of ZExt instructions. I basically add a little more logic there to say "if there are two uses, and we can guarantee that the transformation do not increase the number of ZExt instructions, we also want to perform the transformation here. " What are your concerns for such a change? Comment Actions Yes.
Because as you have already said, So you would not be duplicating anything.
Comment Actions Instcombine has udiv/urem narrowing without increasing the total number of ZExt instructions. Correlated-Propagation has udiv/urem narrowing that can increasing the total number of ZExt instruction (kind of a superset of what is in instcombine). These are existing duplicated functionalities. I have considered two possible changes to this: (1) allow instcombine to increase the total number of ZExt instructions in general (by simply removing the m_oneuse related checks). (2) as shown in this change, allow instcombine to handle a special case and may increase the number of ZExt instructions by one. Now you are proposing a third solution: (3) Create yet another duplicated functionality situtation between instcombine and correlated-propagation. The additional lshr-narrowing and and-narrowing to make my test case work (see my Note below) do not require control flow information, and putting them to correlated-propagation is an overkill. Note: for my motivated test case, without my change here, instcombine can't narrow these code mostly due to the fact that "tmp2" have two uses: %tmp2 = zext i32 %1 to i64 %lane_id = and i64 %tmp2, 31 %2 = lshr i64 %tmp2, 2 Comment Actions You want to narrow multi-use sequences of code because instcombine can't do that profitably using minimal peepholes: Name: narrow expression Pre: C1 u< 9945 %tmp2 = zext i32 C1 to i64 %lane_id = and i64 %tmp2, 31 %shr = lshr i64 %tmp2, 2 %tmp4 = shl nuw nsw i64 %shr, 5 %r = or i64 %lane_id, %tmp4 => %nlane_id = and i32 C1, 31 %nshr = lshr i32 C1, 2 %ntmp4 = shl i32 %nshr, 5 %ntmp5 = or i32 %nlane_id, %ntmp4 %r = zext i32 %ntmp5 to i64 The original motivation for -aggressive-instcombine (TruncInstCombine) was something almost like that - see D38313. Can you extend that? Note that the general problem isn't about udiv/urem, lshr, or any particular binop. It's about narrowing a sequence of arbitrary binops (and maybe even more than binops). Re: instcombine - if some transform is increasing instruction count by not checking for uses, that's a bug. I think we established that conclusively in D44266. The problem with trying to do this transform in instcombine by checking the users is that if it's ok to check the users for this 1 transform, then why don't we do that for *every* transform? The compile-time cost of that change would be substantial, and we're trying to limit the cost of instcombine. See recent llvm-dev discussions for more details. Comment Actions Hmm, there already is llvm/lib/Transforms/AggressiveInstCombine/TruncInstCombine.cpp. @bixia did you look into that yet by any chance?
Comment Actions The transformation we really want is: (Z(A) / N) * N + Z(A) % N => Z(A) where Z(...) is zero extension. Maybe we can do that directly instead of reasoning about use lists like this (if we do it directly we will trivially not be increasing the number of zero extensions)? Comment Actions What Sanjoy suggested doesn't (completely) solve the problem we have , where UDIV and UREM are used to calculate the dimensional indices from a linear index, and the dimensionally indices are then used to access tensors of different shapes (for example, two tensors being accessed both require broadcasting but along different dimensions). Comment Actions We're missing the fundamental motivation. This isn't about div/rem based on the IR example posted on May 12. That code is canonicalized to use shifts. So we must account for the fact that the source code could have been written using shifts in the first place. Otherwise, we're chasing an incomplete solution. So let's look at that example after -instcombine: define void @narrow_long_chain_with_udiv_urem(i64* %result) { %num1 = call i32 @get_number(), !range !0 %num2 = call i32 @get_number(), !range !0 %1 = shl nuw nsw i32 %num1, 6 %addconv = add nuw nsw i32 %1, %num2 %2 = urem i32 %addconv, 384 %tmp2 = zext i32 %2 to i64 %lane_id = and i64 %tmp2, 31 %3 = lshr i64 %tmp2, 2 <--- %tmp4 = shl nuw nsw i64 %3, 5 <--- %tmp5 = or i64 %tmp4, 224 %tmp6 = or i64 %lane_id, %tmp5 store i64 %tmp6, i64* %result, align 4 ret void } I've pointed out what I think is the critical sequence: shift right followed by shift left. Why is that a better canonicalization than: %and = and i64 %x, -4 %left = shl i64 %and, 3 https://rise4fun.com/Alive/Gmc If we had that transform, then everything would be shrunken to i32 already as we want/expect: define void @narrow_long_chain_with_shifts(i64* %result) { %num1 = call i32 @get_number(), !range !0 %num2 = call i32 @get_number(), !range !0 %1 = shl nuw nsw i32 %num1, 6 %addconv = add nuw nsw i32 %1, %num2 %2 = urem i32 %addconv, 384 %3 = and i32 %2, 31 %4 = shl nuw nsw i32 %2, 3 %5 = and i32 %4, 3840 %6 = or i32 %5, 224 %tmp61 = or i32 %3, %6 %tmp6 = zext i32 %tmp61 to i64 store i64 %tmp6, i64* %result, align 4 ret void } I suspect there's some codegen concern about having a large 'and' constant mask, so if that's a problem, we can add a reverse canonicalization to the DAG to transform back to shift from 'and'. The 'and' seems better for analysis and enabling other transforms (as seen here), so I think we should have this canonicalization in IR. Comment Actions Note that we don't actively canonicalize to shifts: %t1 = shl i64 %x, 3 %r = and i64 %t1, -32 => %right = lshr i64 %x, 2 %r = shl i64 %right, 5 https://rise4fun.com/Alive/uz6Y ...so from an instcombine perspective, this is just a case of a *missing* transform (rather than reversing an existing transform). Comment Actions @spatel Thanks for your comment! I need to revise my example to better match the real case I am looking at-- in particular, to show that the optimization will increase the number of zext instructions. In this example, there are only two zext instructions. However, if we narrow all the arithmetic operations to i32 we will need four zext instructions. In one of your previous comment, you have suggested to extend TruncInstCombine in aggressive-instcombine. I see the following issues: (1) TruncInstCombine doesn't increase the number of zext/sext instructions. (2) TruncInstCombine currently only handles operations with this properties truncate(op(i64), i32) == op(truncate(i64, i32)), div/rem/right-shift which are needed here aren't part of such operations. We can properly add value range analysis to resolve this though. (3) TruncInstCombine is driven by the truncate instruction in the IR, and there is no such truncate instruction in the case here. To handle the case here (such as the pattern "lshr followed by shl" you mentioned) , is it acceptable to add a new optimization to aggressive-instcombine that can increase the number of zext/sext instructions? ; Function Attrs: nounwind define void @narrow_long_chain_with_udiv_urem_2(i64* %result) { Comment Actions I haven't had a chance to look at the new example yet, but let me answer this in general: no. Currently, aggressive-instcombine is a target-independent canonicalization pass just like regular instcombine. And we've said that increasing instruction count is not usually canonical for instcombine. The only difference for aggressive-instcombine is that we can pursue more expensive pattern matching there because that was deemed acceptable for -O3. So whatever we try to do to use narrow ops should result in equal or less total instructions. Comment Actions As suggested in https://bugs.llvm.org/show_bug.cgi?id=37603, i have added such tests in rL334347. // Be careful about hiding shl instructions behind bit masks. They are used // to represent multiplies by a constant, and it is important that simple // arithmetic expressions are still recognizable by scalar evolution. // The inexact versions are deferred to DAGCombine, so we don't hide shl // behind a bit mask. That was added by @spatel in rL293570 / rL293435. Comment Actions Bixia - is it correct to say that the behavior we want here is that we narrow arithmetic operations from 64 bits to 32 bits, irrespective of how many extra zero extend instructions we need to add? If yes, perhaps this isn't a good match for InstCombine and we should be looking at a specialized target-dependent transformation pass? InstCombine is a "mini LLVM" in that many LLVM passes have a simpler "do no harm" variant implemented in InstCombine, so this won't be anything new design-wise. Comment Actions @spatel where we're at with funnel-shift canonicalization with constant shift amounts? Comment Actions Yes, I can make that change. I'm not aware of any gaps in IR analysis for funnel-shift / rotate at this point, and the expansion in SDAG should be identical to the current IR sequence: or (shl X, C), (lshr X, (width - C)) With a constant shift amount, we eliminate the variations and concerns about branch/select/trunc/ext/UB/poison that may exist in the variable shift amount patterns. Comment Actions A 1st hack at this shows that we do have missing canonicalizations: we don't turn funnel-shift into bswap when possible (thought we had a bug on that 1, but I don't see it now), and we don't recognize bswap sequences that contain intermediate bswaps. Regression tests in test/Transforms/InstCombine/bswap.ll are affected. Comment Actions This review may be stuck/dead, consider abandoning if no longer relevant. |
I'd guess it would be simpler to use early return.