This is an archive of the discontinued LLVM Phabricator instance.

[DivRemHoist] add a pass to move div/rem pairs into the same block (PR31028)
ClosedPublic

Authored by spatel on Aug 24 2017, 2:54 PM.

Details

Summary

This is intended to be the same functionality as D31037 (EarlyCSE) but implemented as an independent pass, so there's no stretching of scope and feature creep for an existing pass. I also proposed a weaker version of this for SimplifyCFG in D30910. And I initially had almost this same functionality as another lump of CGP in the motivating example of PR31028 ( https://bugs.llvm.org/show_bug.cgi?id=31028 ). It's been a long road. :)

The advantage of positioning this ahead of SimplifyCFG / InstCombine in the pass pipeline is that we'll reduce the positive test cases to a single block:

%rem = urem i8 %a, %b
%div = udiv i8 %a, %b
%cmp = icmp eq i8 %div, 42
%sel = select i1 %cmp, i8 %rem, i8 3
ret i8 %sel

...and that can lead to better codegen even for targets that don't have a joint divrem instruction. D30910 has an AArch64 example of that.

Diff Detail

Repository
rL LLVM

Event Timeline

spatel created this revision.Aug 24 2017, 2:54 PM
efriedma added inline comments.Aug 24 2017, 4:06 PM
lib/Transforms/Scalar/DivRemHoist.cpp
86 ↗(On Diff #112612)

Why is this assertion true? I don't see why two instructions with the same operands must have a dominance relation.

spatel added inline comments.Aug 24 2017, 4:16 PM
lib/Transforms/Scalar/DivRemHoist.cpp
86 ↗(On Diff #112612)

Yes, you're right. The examples I was looking at weren't very diverse. Will fix.

davide added a subscriber: davide.Aug 24 2017, 4:45 PM
spatel updated this revision to Diff 112690.Aug 25 2017, 6:35 AM

Patch updated:
Fixed to check that one op must dominate the other to allow hoisting and added test for that.

dberlin added inline comments.Aug 25 2017, 8:54 AM
lib/Transforms/Scalar/DivRemHoist.cpp
78 ↗(On Diff #112690)

You should check dominance of the parents, not the instructions. Though you currently avoid it, dt->dominates is a linear time test if you use it on instructions (and that's not obvious here) in the same block.

The recommendation is to use orderedinstructions instead if you need to check same-block dominance, but you don't :)

83 ↗(On Diff #112690)

It should go after, to retain the relative ordering.

89 ↗(On Diff #112690)

Ditto, this should be after.

spatel updated this revision to Diff 112732.Aug 25 2017, 1:10 PM
spatel marked 2 inline comments as done.

Patch updated:

  1. Check dominance of the basic blocks rather than the instructions to avoid potential unexpected complexity.
  2. Move the hoisted instruction after (rather than before) the other intruction using removeFromParent()+insertAfter().
lib/Transforms/Scalar/DivRemHoist.cpp
83 ↗(On Diff #112690)

Agreed, that feels better. But now I remember why I picked "moveBefore"...there is no "moveAfter". Is this an oversight of the API or is there something about 'after' that requires different logic? Looks like other places do:

I->removeFromParent();
I->insertAfter(OtherInst);
dberlin added inline comments.Aug 25 2017, 5:11 PM
lib/Transforms/Scalar/DivRemHoist.cpp
83 ↗(On Diff #112690)

This seems strange to me. The Eli tends to be much better than me at explaining the good reason something is the way it is that i've missed, so let's see what he says.

MemorySSA also uses iplists and has moveAfter.

All it requires there is the equivalent of:

moveBefore(*MovePos->getParent(), ++MovePos->getIterator());

(since splice inserts before)

efriedma added inline comments.Aug 25 2017, 5:43 PM
lib/Transforms/Scalar/DivRemHoist.cpp
83 ↗(On Diff #112690)

I can't think of any reason moveAfter doesn't exist. Looks like it's just an oversight.

dberlin edited edge metadata.Aug 25 2017, 5:55 PM

Okay.
Then feel free to either add it, or just use the canonical idiom, and i'll add it and clean them all up.
Your call.

Okay.
Then feel free to either add it, or just use the canonical idiom, and i'll add it and clean them all up.
Your call.

If there are no other suggestions for this patch, I'd prefer to commit this as-is (with a FIXME comment) to get more testing underway and then come back for the clean-up.

I do have a few other questions (this is my first try at writing a new pass):

  1. Are there guidelines/intuition about where to position this in the opt pipeline? It's currently near the end, but I don't have a good reason for that. I just think that it should be ahead of at least one SimplifyCFG to allow flattening.
  2. Are there guidelines for when to enable this? It should be cheap in compile-time, so I put in -O1. But it's also rare, so that may be an argument for -O2 or even -O3?
  3. Any thoughts about the TODO for PreservedAnalyses near the end of DivRemHoist.cpp?
spatel updated this revision to Diff 113128.Aug 29 2017, 11:50 AM

Patch updated:

  1. Use moveAfter() to simplify the hoisting code (convenience function added with rL312001).
  2. Move the pass later in the pipeline. I noticed that the transform wasn't holding in loops (example below) because loop transforms will sink the sibling op to its use. That defeats the point of this pass, so this time I've made the transform really late, but still before the final -simplifycfg.
  3. Add the pass to the new pass manager pipeline in the equivalent position. I failed to add the pass at all in the last rev (!).
  4. Adding to the new pipeline means there are actually tests to check that the pass is running, so updated those.
  5. Added preservation of GlobalsAA for both pipelines. The lack of this was causing a test failure in test/Transforms/PhaseOrdering/globalaa-retained.ll with the previous placement of the pass. I'm not sure if that would still happen now, but it's safer to make that explicit?

Here's an example of a loop with div/rem where we do not want another pass to re-sink the rem (because codegen can't fix that). Should I add a PhaseOrdering test like this?

define void @rebase_mask(i32 %n, i32 %divisor, i32* %mask) {
entry:
  %cmp = icmp sgt i32 %n, 0
  br i1 %cmp, label %preheader, label %exit

preheader:
  br label %for.body

exit:
  ret void

for.body:
  %i = phi i32 [ %inc, %cleanup ], [ 0, %preheader ]
  %div = sdiv i32 %i, %divisor
  %idxprom = sext i32 %div to i64
  %arrayidx = getelementptr inbounds i32, i32* %mask, i64 %idxprom
  %ld = load i32, i32* %arrayidx, align 4
  %cmp1 = icmp slt i32 %ld, 0
  br i1 %cmp1, label %cleanup, label %if.end

if.end:
  %rem = srem i32 %i, %divisor
  store i32 %rem, i32* %arrayidx, align 4
  br label %cleanup

cleanup:
  %inc = add nuw nsw i32 %i, 1
  %exitcond = icmp eq i32 %inc, %n
  br i1 %exitcond, label %exit, label %for.body
}

On closer inspection, I misdiagnosed who was sinking the rem in my example, but I don't think it makes a difference to the patch. We don't need loops to show the problem...if simplifycfg can't flatten the code, then it's instcombine sinking instructions for some unstated reason:

define void @rebase_mask(i1 %cmp, i32 %num, i32 %divisor, i32* %p1, i32* %p2) {
entry:
  %div = sdiv i32 %num, %divisor
  store i32 %div, i32* %p1
  br i1 %cmp, label %exit, label %if.end

if.end:
  %rem = srem i32 %num, %divisor
  store i32 %rem, i32* %p2
  br label %exit

exit:
  ret void
}

./opt -div-rem-hoist -instcombine divrem.ll -S -debug
...
IC: Sink: %rem = srem i32 %num, %divisor

Instcombine sinks instructions with a single use in a successor block with no other predecessors.

hfinkel added a subscriber: hfinkel.Sep 5 2017, 3:35 PM
hfinkel added inline comments.
lib/Transforms/Scalar/DivRemHoist.cpp
84 ↗(On Diff #113128)

Please don't hard-code 1 here as the basis for comparison to the cost model. The normalization is target specific. Feel free to compare it a corresponding cost for Instruction::Add, for example.

Also, to be clear, you're using the reciprocal-throughput model here. So if we can perform two adds every cycle, and that has a cost of 1, and we can do only one multiple per cycle, it would have a cost of 2. Is that too much?

You might want to use the user-cost model (which is more like code size in some sense), but that's what SimplyCFG uses to do speculation. You could call TTTI->getOperationCost and compare it to TargetTransformInfo::TCC_Basic. We'll need to clean all of this up at some point, but I think that would be akin to what SimplifyCFG is doing for speculation costs.

spatel added inline comments.Sep 5 2017, 4:28 PM
lib/Transforms/Scalar/DivRemHoist.cpp
84 ↗(On Diff #113128)

Ah, right. I should've remembered all that since I've complained about it before!
Let me take another look at the current state of those cost models and do something less obnoxious than a magic number.

spatel updated this revision to Diff 114020.Sep 6 2017, 9:37 AM

Patch updated:
As I was rummaging around in the cost models (realizing they're quite wrong for div/rem/mul in their own ways) and trying to bend them to express what I wanted...

I decided it would be better to just add a new TTI shim (hasDivRemOp()) to implement the behavior that I initially had when this was a CGP add-on. I realize we're at the edge of IR vs. backend transforms here, so let me know if this is wrong. Based on the existing hooks for things like masked ops, however, this doesn't look out of place to me.

I also figured it was best to just replace a non-hoistable rem instruction in-place and avoid getting into the inaccurate cost model mess to decide when hoisting would be the right thing to do. So now the i128 test for x86 shows that optimization instead of just bailing out. I think that's also the right thing to do for all targets that don't have HW divrem (everything in trunk besides x86?).

Patch updated:
As I was rummaging around in the cost models (realizing they're quite wrong for div/rem/mul in their own ways) and trying to bend them to express what I wanted...

I decided it would be better to just add a new TTI shim (hasDivRemOp()) to implement the behavior that I initially had when this was a CGP add-on. I realize we're at the edge of IR vs. backend transforms here, so let me know if this is wrong. Based on the existing hooks for things like masked ops, however, this doesn't look out of place to me.

I also figured it was best to just replace a non-hoistable rem instruction in-place and avoid getting into the inaccurate cost model mess to decide when hoisting would be the right thing to do. So now the i128 test for x86 shows that optimization instead of just bailing out. I think that's also the right thing to do for all targets that don't have HW divrem (everything in trunk besides x86?).

Is this worthwhile only if the target has a combined dev/rem operation? Isn't the multiplication plus subtraction generally cheaper than an independent remainder operation in any case. Maybe, in the latter case, you simply want to do that operational replacement instead of hoisting?

include/llvm/Analysis/TargetTransformInfo.h
452 ↗(On Diff #114020)

I think that you might as well add a Boolean parameter indicating whether or not this is for a signed or unsigned division (and then, in the X86 implementation, you can test the right ISD opcode for each).

spatel added a comment.Sep 6 2017, 1:29 PM

Patch updated:
As I was rummaging around in the cost models (realizing they're quite wrong for div/rem/mul in their own ways) and trying to bend them to express what I wanted...

I decided it would be better to just add a new TTI shim (hasDivRemOp()) to implement the behavior that I initially had when this was a CGP add-on. I realize we're at the edge of IR vs. backend transforms here, so let me know if this is wrong. Based on the existing hooks for things like masked ops, however, this doesn't look out of place to me.

I also figured it was best to just replace a non-hoistable rem instruction in-place and avoid getting into the inaccurate cost model mess to decide when hoisting would be the right thing to do. So now the i128 test for x86 shows that optimization instead of just bailing out. I think that's also the right thing to do for all targets that don't have HW divrem (everything in trunk besides x86?).

Is this worthwhile only if the target has a combined dev/rem operation? Isn't the multiplication plus subtraction generally cheaper than an independent remainder operation in any case. Maybe, in the latter case, you simply want to do that operational replacement instead of hoisting?

No. Sorry if this wasn't clear - we're doing the replacement for all targets, not just x86. The hoisting of rem is the special case for x86.

include/llvm/Analysis/TargetTransformInfo.h
452 ↗(On Diff #114020)

Yes - let me update that; that was just an oversight.

spatel updated this revision to Diff 114073.Sep 6 2017, 2:35 PM

My earlier comments about this in Phab don't seem to have made it to the email list, so I'll send directly to the list if this doesn't either...

Patch updated:

  1. Add bool to properly check availability of divrem with correct signedness.
  2. Add PPC target to test file to show that the rem replacement part of the patch works for all targets.
hfinkel added inline comments.Sep 6 2017, 3:00 PM
lib/Transforms/Scalar/DivRemHoist.cpp
50 ↗(On Diff #114073)

I'm missing something: where is DivRemMapKey defined?

85 ↗(On Diff #114073)

I don't understand what's going on here? Can you please explain in the comment. Why is this way cheap but hoisting the rem near the div is not? I'm guessing this has something to do with the fact that DAGCombiner::useDivRem isn't called for divisions if TLI.isIntDivCheap returns true, but is always called on remainders? If that's it, should we mirror that logic more directly here (i.e. somehow directly incorporate the result of calling TLI.isIntDivCheap)?

spatel added inline comments.Sep 6 2017, 3:19 PM
lib/Transforms/Scalar/DivRemHoist.cpp
50 ↗(On Diff #114073)

This is an existing struct used for a different div/rem optimization. It lives in:
#include "llvm/Transforms/Utils/BypassSlowDivision.h"

85 ↗(On Diff #114073)

This is independent of anything in the backend. That's why this transform is frustrating - it's half target-independent. :)

Let's see if some more words or repeating the formula would make it better:
"Hoist the div into the rem block. No cost calculation is needed because division is an implicit component of remainder:
X % Y --> X - ((X / Y) * Y)
If the target has a unified instruction for div/rem, then this will occur in a single instruction. If the target does not have a unified instruction for div/rem, then it must calculate remainder as (sub X, (mul (div X, Y), Y). Either way, hoisting division will be free."

hfinkel added inline comments.Sep 6 2017, 5:28 PM
lib/Transforms/Scalar/DivRemHoist.cpp
50 ↗(On Diff #114073)

Ah, okay.

85 ↗(On Diff #114073)

That is much better, thanks. However, are you making that substitution? It looks like you're just moving the division in this case (i.e. only doing DivInst->moveAfter(RemInst);)?

spatel added inline comments.Sep 6 2017, 5:43 PM
lib/Transforms/Scalar/DivRemHoist.cpp
85 ↗(On Diff #114073)

We can just move the division in this pass because we can assume that the backend will be able to simplify the common division when the div and rem are in the same block.

I suppose at this point we could decompose the dominating rem into div+mul+sub if we know the backend does not have a divrem. I don't think there would be any downside to that (other than we're repeating functionality that is known to exist in the backend)?

hfinkel added inline comments.Sep 6 2017, 5:52 PM
lib/Transforms/Scalar/DivRemHoist.cpp
85 ↗(On Diff #114073)

We can just move the division in this pass because we can assume that the backend will be able to simplify the common division when the div and rem are in the same block.

That's okay, but you should say that in the comment. However...

I suppose at this point we could decompose the dominating rem into div+mul+sub if we know the backend does not have a divrem. I don't think there would be any downside to that (other than we're repeating functionality that is known to exist in the backend)?

If this pass were running early in the pipeline, we'd definitely need to do this (because it would affect inlining/unrolling costs and the like). Because this runs relatively late, that's not a huge concern. However, I'd prefer that you do the decomposition here at the IR level anyway because a) you already have the code here to do it (so there's little added complexity) and b) in case there are any transformations later that are still using IR-level costs, and there very well might be in the backend (even target-specific ones), we might as well be kind to them and give them more-accurate costs.

spatel added inline comments.Sep 7 2017, 7:09 AM
lib/Transforms/Scalar/DivRemHoist.cpp
85 ↗(On Diff #114073)

Agreed - now that we're doing replacement, we might as well make both cases do it.

This does a raise a question: what should this pass be called now?
div-rem-cse-or-hoist
div-rem-opt
div-rem-twins
...?

Also, should I rename BypassSlowDivision.h to DivRem____ to make the home of DivRemMapKey more apparent?

spatel updated this revision to Diff 114195.Sep 7 2017, 9:50 AM

Patch updated:

  1. If the target doesn't have a divrem instruction, always decompose the rem.
  2. Rename everything to 'DivRemPairs' to reflect the greater powers of the pass (open to other suggestions).
  3. Improve code comments to better explain how this works.
hfinkel added inline comments.Sep 7 2017, 4:54 PM
lib/Transforms/Scalar/DivRemPairs.cpp
80 ↗(On Diff #114195)

I'd like to remove this check. There are two reasons for this:

  1. Consistency (ending up with the expanded form if we started with the operations in different blocks, but not if they started in the same block, seems unfortunate).
  2. To allow us to get optimal results reliably in the backend in a straightforward way. For example, if a target does not have a legal rem instruction, we get the desired behavor because the rem will be legalized into the code with the div, and SDAG will reuse an existing identical div in the same block. If there is a legal rem instruction, however, this doesn't work automatically. In fact, the PowerPC backend has this comment:
// PowerPC has no SREM/UREM instructions unless we are on P9
// On P9 we may use a hardware instruction to compute the remainder.
// The instructions are not legalized directly because in the cases where the
// result of both the remainder and the division is required it is more
// efficient to compute the remainder from the result of the division rather
// than use the remainder instruction.
if (Subtarget.isISA3_0()) {
  setOperationAction(ISD::SREM, MVT::i32, Custom);
  setOperationAction(ISD::UREM, MVT::i32, Custom);
  setOperationAction(ISD::SREM, MVT::i64, Custom);
  setOperationAction(ISD::UREM, MVT::i64, Custom);
} else {
  setOperationAction(ISD::SREM, MVT::i32, Expand);
  setOperationAction(ISD::UREM, MVT::i32, Expand);
  setOperationAction(ISD::SREM, MVT::i64, Expand);
  setOperationAction(ISD::UREM, MVT::i64, Expand);
}

and if we always expand here, I think we can clean this up.

I think that you can do this just by writing:

if (HasDivRemOp && RemBB == DivBB)
  continue;

bool DivDominates = DT.dominates(DivInst, RemInst);

what Danny said earlier about the instruction-level dominance checks being expensive within the same block is true, but there's no extra expense compared to the BB version if the instructions are different BBs. EarlyCSE should ensure that there are at most one div or rem with any particular set of arguments, so I don't think we need to do anything special to deal with algorithmic complexity issues (although if we're really concerned about that, we can use OrderedInstructions).

93 ↗(On Diff #114195)

I wouldn't phrase this as misleading. Arguably, it's more misleading this way. The point is that, if you decompose here, some later pass might disturb the pattern such that it is not recognizable in the backend.

spatel updated this revision to Diff 114420.Sep 8 2017, 12:59 PM

Patch updated:

  1. For a target that doesn't have div+rem, handle the case where the div and rem are in the same block by decomposing the rem.
  2. Add tests for those cases (first 2 tests in the test file).
  3. Update comments to better explain the same block scenario and remove low quality comment.
hfinkel accepted this revision.Sep 8 2017, 2:01 PM

LGTM

This revision is now accepted and ready to land.Sep 8 2017, 2:01 PM
This revision was automatically updated to reflect the committed changes.
jlebar added a subscriber: jlebar.Mar 1 2018, 4:33 PM

After some sleuthing, I discovered that the addition of this new (morally good) pass introduced a moderate regression on NVPTX.

Previously, when given a div/rem pair operating over i64, we'd check if the operands fit in i32 in BypassSlowDivision (during codegenprepare). If so, we'd replace the div/rem pair with i32 div/rem ops. Then in the NVPTX selection dag, we'd compute the rem from this div using the same formulation as here. But since the div was i32, the rem computation would also happen in i32.

After this change, the rem is replaced with i64 arithmetic early on in the pipeline. BypassSlowDivision replaces the i64 div with an i32 div as before. But because this happens during codegenprepare, nobody ever changes the rem computation to happen in 32 bits.

I'm not sure what is the right fix for this. We could teach instcombine to strength-reduce i64 divides where the operands are known to fit into 32 bits into i32 divs? It already does this for zext'ed operands, but not in general. But I'm not sure what is the principled way to tell instcombine whether and how to do this strength reduction. (Like, maybe we want to convert 64-bit divs into 32-bit divs, but do we want to convert 64-bit divs into 8-bit divs, if it fits?)

spatel added a comment.Mar 2 2018, 7:25 AM

After some sleuthing, I discovered that the addition of this new (morally good) pass introduced a moderate regression on NVPTX.

Previously, when given a div/rem pair operating over i64, we'd check if the operands fit in i32 in BypassSlowDivision (during codegenprepare). If so, we'd replace the div/rem pair with i32 div/rem ops. Then in the NVPTX selection dag, we'd compute the rem from this div using the same formulation as here. But since the div was i32, the rem computation would also happen in i32.

After this change, the rem is replaced with i64 arithmetic early on in the pipeline. BypassSlowDivision replaces the i64 div with an i32 div as before. But because this happens during codegenprepare, nobody ever changes the rem computation to happen in 32 bits.

I'm not sure what is the right fix for this. We could teach instcombine to strength-reduce i64 divides where the operands are known to fit into 32 bits into i32 divs? It already does this for zext'ed operands, but not in general. But I'm not sure what is the principled way to tell instcombine whether and how to do this strength reduction. (Like, maybe we want to convert 64-bit divs into 32-bit divs, but do we want to convert 64-bit divs into 8-bit divs, if it fits?)

Can you post an IR example or file a bug that shows the failure? If BypassSlowDivision can get it, but InstCombine can not, then the difference comes down to using computeKnownBits?

Div/rem IR instructions are rare, and we can't do much analysis/combining based on the output, so using ValueTracking to narrow them as a canonicalization step is not an excessive cost IMO.
If that is opposed, we could do the narrowing in this pass as a perf optimization?

jlebar added a comment.EditedMar 4 2018, 6:54 PM

Can you post an IR example or file a bug that shows the failure? If BypassSlowDivision can get it, but InstCombine can not, then the difference comes down to using computeKnownBits?

Sure, something like:

target datalayout = "e-i64:64-v16:16-v32:32-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

define void @foo(i64 %a, i64* %ptr1, i64* %ptr2) {
  %b = and i64 %a, 65535
  %div = udiv i64 %b, 42
  %rem = urem i64 %b, 42
  store i64 %div, i64* %ptr1
  store i64 %rem, i64* %ptr2
  ret void
}
$ opt -codegenprepare -S

define void @foo(i64 %a, i64* %ptr1, i64* %ptr2) {
  %b = and i64 %a, 65535
  %1 = trunc i64 %b to i32
  %2 = udiv i32 %1, 42
  %3 = urem i32 %1, 42
  %4 = zext i32 %2 to i64
  %5 = zext i32 %3 to i64
  store i64 %4, i64* %ptr1
  store i64 %5, i64* %ptr2
  ret void
}

$ opt -codegenprepare -S | llc
[snip]
        shr.u32         %r2, %r1, 1;
        mul.wide.u32    %rd3, %r2, 818089009;
        shr.u64         %rd4, %rd3, 34;
        cvt.u32.u64     %r3, %rd4;
        mul.lo.s32      %r4, %r3, 42;
        sub.s32         %r5, %r1, %r4;

$ opt -div-rem-pairs -codegenprepare -S
define void @foo(i64 %a, i64* %ptr1, i64* %ptr2) {
  %b = and i64 %a, 65535
  %1 = trunc i64 %b to i32
  %2 = udiv i32 %1, 42
  %3 = zext i32 %2 to i64
  %4 = mul i64 %3, 42
  %5 = sub i64 %b, %4
  store i64 %3, i64* %ptr1
  store i64 %5, i64* %ptr2
  ret void
}

$ opt -div-rem-pairs -codegenprepare -S | llc
[snip]
        shr.u32         %r2, %r1, 1;
        mul.wide.u32    %rd4, %r2, 818089009;
        shr.u64         %rd5, %rd4, 34;
        cvt.u32.u64     %r3, %rd5;
        mul.wide.u32    %rd6, %r3, 42;
        sub.s64         %rd7, %rd1, %rd6;
spatel added a comment.Mar 5 2018, 7:39 AM

Can you post an IR example or file a bug that shows the failure? If BypassSlowDivision can get it, but InstCombine can not, then the difference comes down to using computeKnownBits?

Sure, something like:

target datalayout = "e-i64:64-v16:16-v32:32-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

define void @foo(i64 %a, i64* %ptr1, i64* %ptr2) {
  %b = and i64 %a, 65535
  %div = udiv i64 %b, 42
  %rem = urem i64 %b, 42
  store i64 %div, i64* %ptr1
  store i64 %rem, i64* %ptr2
  ret void
}

Are you confident that the problem is limited to cases with a constant div/rem operand and masking of the variable that could be replaced by a trunc? If so, then we could add a narrow pattern match fix without using computeKnownBits:

Name: udiv_shrink
%b = and i32 %a, 65535
%r = udiv i32 %b, 42

=>

%t = trunc i32 %a to i16
%u = udiv i16 %t, 42
%r = zext i16 %u to i32

https://rise4fun.com/Alive/EHK

I was worried that canEvaluateZExtd() would try to invert that transform, but either by oversight or intention, we don't widen udiv/urem there like most binops.

jlebar added a comment.Mar 5 2018, 7:51 AM

Are you confident that the problem is limited to cases with a constant div/rem operand and masking of the variable that could be replaced by a trunc?

Actually I'm pretty confident the problem is *not* limited to such cases -- sorry for making a misleading testcase. In the code I'm interested in, we know the bit-width of the dividend because of a branch on the dividend's value (x < some_constant) or because an llvm.assume tells us so.

spatel added a comment.Mar 5 2018, 8:07 AM

Are you confident that the problem is limited to cases with a constant div/rem operand and masking of the variable that could be replaced by a trunc?

Actually I'm pretty confident the problem is *not* limited to such cases -- sorry for making a misleading testcase. In the code I'm interested in, we know the bit-width of the dividend because of a branch on the dividend's value (x < some_constant) or because an llvm.assume tells us so.

Ah, then we'll need a stronger solution. If the shrinkability is based on propagation via cmp/br, then that's not something that instcombine would/should handle? Maybe we should add some shrinking logic to correlated-propagation?

Maybe we should add some shrinking logic to correlated-propagation?

That sounds like a plan to me. I put up a (perhaps weak) attempt at writing the patch in D44102.