This is an archive of the discontinued LLVM Phabricator instance.

Teach CorrelatedValuePropagation to reduce the width of udiv/urem instructions.
ClosedPublic

Authored by jlebar on Mar 5 2018, 11:58 AM.

Details

Summary

If the operands of a udiv/urem can be proved to fit within a smaller
power-of-two-sized type, reduce the width of the udiv/urem.

Event Timeline

jlebar created this revision.Mar 5 2018, 11:58 AM

I feel like this could use more/better tests, but before I spend a lot of time on that...am I on the right track here?

Disappointingly, this doesn't work for simple cases where you mask the divisor:

%b = and i64 %a, 65535
%div = udiv i64 %b, 42

It does work for llvm.assume, which I guess is good enough for the specific case I have, but...maybe this is not the right pass to be doing this in? Or should I check known-bits here too? Sorry, I'm an ignoramus when it comes to the target-independent parts of LLVM.

target datalayout = "e-i64:64-v16:16-v32:32-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

declare void @llvm.assume(i1)

define void @foo(i64 %a, i64* %ptr1, i64* %ptr2) {
  %cond = icmp ult i64 %a, 1024
  call void @llvm.assume(i1 %cond)
  %div = udiv i64 %a, 42
  %rem = urem i64 %a, 42
  store i64 %div, i64* %ptr1
  store i64 %rem, i64* %ptr2
  ret void
}

becomes, at opt -O2

define void @foo(i64 %a, i64* nocapture %ptr1, i64* nocapture %ptr2) local_unnamed_addr #0 {
  %cond = icmp ult i64 %a, 1024
  tail call void @llvm.assume(i1 %cond)
  %div.lhs.trunc = trunc i64 %a to i16
  %div1 = udiv i16 %div.lhs.trunc, 42
  %div.zext = zext i16 %div1 to i64
  %1 = mul i16 %div1, 42
  %2 = sub i16 %div.lhs.trunc, %1
  %rem.zext = zext i16 %2 to i64
  store i64 %div.zext, i64* %ptr1, align 8
  store i64 %rem.zext, i64* %ptr2, align 8
  ret void
}

which lowers to the following ptx:

shr.u16         %rs2, %rs1, 1;
mul.wide.u16    %r1, %rs2, -15603;
shr.u32         %r2, %r1, 20;
cvt.u16.u32     %rs3, %r2;
cvt.u64.u32     %rd3, %r2;
mul.lo.s16      %rs4, %rs3, 42;
sub.s16         %rs5, %rs1, %rs4;
cvt.u64.u16     %rd4, %rs5;
st.u64  [%rd1], %rd3;
st.u64  [%rd2], %rd4;

This is even nicer than before because we do the magic-number division in 16-widens-to-32-bit instead of (before) doing it in 32-widens-to-64 bit. At least, I hope that's efficient in NVPTX -- if not, that's our backend's problem. :)

I think this is the right approach, but I don't know much about CVP, so adding more potential reviewers.

Context: This is part of improving udiv/urem IR as discussed in the latest comments on D37121.
Motivation: Narrowing the width of these ops improves bit-tracking / analysis and potentially enables further IR transforms.
Bonus: It also aligns with codegen transforms such as the magic number division mentioned by Justin and improves perf for targets that have faster narrow div/rem instructions.

We have basic div/rem narrowing folds in instcombine, but we want to handle cases like this where edge info and/or computeKnownBits would also allow narrowing. Is it preferred to use ValueTracking here to get the single BB case or should that be added to InstCombine?

sanjoy added a comment.Mar 5 2018, 2:44 PM

Disappointingly, this doesn't work for simple cases where you mask the divisor:

%b = and i64 %a, 65535
%div = udiv i64 %b, 42

This is a little surprising to me -- on a first glance it looks like LazyValueInfoImpl::solveBlockValueBinaryOp should be doing the right thing here. But I dug a bit deeper and it looks like getPredicateAt calls getValueAt which only looks at guards and assumes (and never calls solveBlockValueBinaryOp)? I have a hunch that using the getConstantRange call will "fix" this issue, though getPredicateAt definitely should be not weaker than calling getConstantRange and inferring the predicate manually.

llvm/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp
457

Instead of making multiple queries, how about using LazyValueInfo::getConstantRange instead?

jlebar added a comment.Mar 6 2018, 2:35 AM

I have a hunch that using the getConstantRange call will "fix" this issue

It does, thanks! It also simplifies the code.

though getPredicateAt definitely should be not weaker than calling getConstantRange and inferring the predicate manually.

Agree. This is a bit of a Chesterton's Fence for me, though. Why would we ever want to do getValueAt as opposed to getValueInBlock? getValueAt is called only by getPredicateAt, and...getPredicateAt *has* a BB, namely CtxI->getParent(). So why not just call getValueInBlock from there, and delete getValueAt? Further adding to my confusion is the fact that the header says that getPredicateAt (only?) looks at assume intrinsics. So it sort of seems intentional, but I have no idea why...

The only users of getPredicateAt are CVP and jump threading. I tried switching getPredicateAt to call getValueInBlock, and it seems to be fine for CVP, but it breaks jump threading tests in what seems to me to be a real way.

diff --git a/llvm/lib/Analysis/LazyValueInfo.cpp b/llvm/lib/Analysis/LazyValueInfo.cpp
index 65fd007dc0b2..04d2143bb904 100644
--- a/llvm/lib/Analysis/LazyValueInfo.cpp
+++ b/llvm/lib/Analysis/LazyValueInfo.cpp
@@ -1701,7 +1701,8 @@ LazyValueInfo::getPredicateAt(unsigned Pred, Value *V, Constant *C,
     else if (Pred == ICmpInst::ICMP_NE)
       return LazyValueInfo::True;
   }
-  ValueLatticeElement Result = getImpl(PImpl, AC, &DL, DT).getValueAt(V, CxtI);
+  ValueLatticeElement Result =
+      getImpl(PImpl, AC, &DL, DT).getValueInBlock(V, CxtI->getParent(), CxtI);
   Tristate Ret = getPredicateResult(Pred, C, Result, DL, TLI);
   if (Ret != Unknown)
     return Ret;

breaks

LLVM :: Analysis/LazyValueAnalysis/lvi-after-jumpthreading.ll
LLVM :: Transforms/JumpThreading/induction.ll

I stared at the jump threading code for a while and it's not at all clear to me why this new implementation of getPredicateAt is wrong for it.

jlebar updated this revision to Diff 137140.Mar 6 2018, 2:38 AM

Use getConstantRange instead of getPredicateAt.

The new testcase that checks that sdiv i32 is narrowed to udiv i8 currently
fails because the sdiv i32 -> udiv i32 transition uses getPredicateAt rather
than getConstantRange.

sanjoy accepted this revision.Mar 6 2018, 11:28 PM
sanjoy added inline comments.
llvm/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp
438

Not sure what SDI stands for here -- how about just calling it Inst?

447

How about s/R/OperandRange/?

This revision is now accepted and ready to land.Mar 6 2018, 11:28 PM
This revision was automatically updated to reflect the committed changes.
jlebar marked 2 inline comments as done.
jlebar added a comment.Mar 7 2018, 7:15 AM

Thank you for the reviews, Sanjoy and Sanjay!

I had a brief moment of terror when I realized that (R.getUnsignedMax() + 1).ceilLog2() can overflow. It actually works, because 0.ceilLog2() returns num_bits. But anyway I realized that there's getActiveBits(), which is what I actually wanted.

Submitting...

llvm/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp
438

And here I just thought I was thick for not figuring it out. :) Changed.

spatel added a comment.Mar 7 2018, 7:25 AM

Thank you for the reviews, Sanjoy and Sanjay!

Thanks for doing the work. :)
To confirm - with this patch, we're getting all of the motivating cases (edge value propagation, local known bits, llvm.assume) that were affected by D37121? Ie, there's no current motivation to make instcombine try harder to shrink div/rem?

jlebar added a comment.Mar 7 2018, 8:01 AM

To confirm - with this patch, we're getting all of the motivating cases (edge value propagation, local known bits, llvm.assume) that were affected by D37121? Ie, there's no current motivation to make instcombine try harder to shrink div/rem?

I still need to confirm, but I expect so.

I'm going to work on a patch to address the getPredicateAt problems Sanjoy pointed out above.

Also I need to revert this because it's crashing while building clang. Will figure that out, it's probably something dumb.

jlebar added a comment.EditedMar 7 2018, 9:01 AM

Pushed the fix, rL326908. It was indeed something simple: Given e.g. udiv i24 with no constraints on the operands, we noticed that the smallest power of 2 that could contain the operands was i32, and then tried to *expand* the udiv. Which is wrong, and also blew up because we were trying to trunc from i24 -> i32 and zext from i32 -> i24.