If the operands of a udiv/urem can be proved to fit within a smaller
power-of-two-sized type, reduce the width of the udiv/urem.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
I feel like this could use more/better tests, but before I spend a lot of time on that...am I on the right track here?
Disappointingly, this doesn't work for simple cases where you mask the divisor:
%b = and i64 %a, 65535 %div = udiv i64 %b, 42
It does work for llvm.assume, which I guess is good enough for the specific case I have, but...maybe this is not the right pass to be doing this in? Or should I check known-bits here too? Sorry, I'm an ignoramus when it comes to the target-independent parts of LLVM.
target datalayout = "e-i64:64-v16:16-v32:32-n16:32:64" target triple = "nvptx64-nvidia-cuda" declare void @llvm.assume(i1) define void @foo(i64 %a, i64* %ptr1, i64* %ptr2) { %cond = icmp ult i64 %a, 1024 call void @llvm.assume(i1 %cond) %div = udiv i64 %a, 42 %rem = urem i64 %a, 42 store i64 %div, i64* %ptr1 store i64 %rem, i64* %ptr2 ret void }
becomes, at opt -O2
define void @foo(i64 %a, i64* nocapture %ptr1, i64* nocapture %ptr2) local_unnamed_addr #0 { %cond = icmp ult i64 %a, 1024 tail call void @llvm.assume(i1 %cond) %div.lhs.trunc = trunc i64 %a to i16 %div1 = udiv i16 %div.lhs.trunc, 42 %div.zext = zext i16 %div1 to i64 %1 = mul i16 %div1, 42 %2 = sub i16 %div.lhs.trunc, %1 %rem.zext = zext i16 %2 to i64 store i64 %div.zext, i64* %ptr1, align 8 store i64 %rem.zext, i64* %ptr2, align 8 ret void }
which lowers to the following ptx:
shr.u16 %rs2, %rs1, 1; mul.wide.u16 %r1, %rs2, -15603; shr.u32 %r2, %r1, 20; cvt.u16.u32 %rs3, %r2; cvt.u64.u32 %rd3, %r2; mul.lo.s16 %rs4, %rs3, 42; sub.s16 %rs5, %rs1, %rs4; cvt.u64.u16 %rd4, %rs5; st.u64 [%rd1], %rd3; st.u64 [%rd2], %rd4;
This is even nicer than before because we do the magic-number division in 16-widens-to-32-bit instead of (before) doing it in 32-widens-to-64 bit. At least, I hope that's efficient in NVPTX -- if not, that's our backend's problem. :)
I think this is the right approach, but I don't know much about CVP, so adding more potential reviewers.
Context: This is part of improving udiv/urem IR as discussed in the latest comments on D37121.
Motivation: Narrowing the width of these ops improves bit-tracking / analysis and potentially enables further IR transforms.
Bonus: It also aligns with codegen transforms such as the magic number division mentioned by Justin and improves perf for targets that have faster narrow div/rem instructions.
We have basic div/rem narrowing folds in instcombine, but we want to handle cases like this where edge info and/or computeKnownBits would also allow narrowing. Is it preferred to use ValueTracking here to get the single BB case or should that be added to InstCombine?
This is a little surprising to me -- on a first glance it looks like LazyValueInfoImpl::solveBlockValueBinaryOp should be doing the right thing here. But I dug a bit deeper and it looks like getPredicateAt calls getValueAt which only looks at guards and assumes (and never calls solveBlockValueBinaryOp)? I have a hunch that using the getConstantRange call will "fix" this issue, though getPredicateAt definitely should be not weaker than calling getConstantRange and inferring the predicate manually.
llvm/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp | ||
---|---|---|
457 ↗ | (On Diff #137049) | Instead of making multiple queries, how about using LazyValueInfo::getConstantRange instead? |
I have a hunch that using the getConstantRange call will "fix" this issue
It does, thanks! It also simplifies the code.
though getPredicateAt definitely should be not weaker than calling getConstantRange and inferring the predicate manually.
Agree. This is a bit of a Chesterton's Fence for me, though. Why would we ever want to do getValueAt as opposed to getValueInBlock? getValueAt is called only by getPredicateAt, and...getPredicateAt *has* a BB, namely CtxI->getParent(). So why not just call getValueInBlock from there, and delete getValueAt? Further adding to my confusion is the fact that the header says that getPredicateAt (only?) looks at assume intrinsics. So it sort of seems intentional, but I have no idea why...
The only users of getPredicateAt are CVP and jump threading. I tried switching getPredicateAt to call getValueInBlock, and it seems to be fine for CVP, but it breaks jump threading tests in what seems to me to be a real way.
diff --git a/llvm/lib/Analysis/LazyValueInfo.cpp b/llvm/lib/Analysis/LazyValueInfo.cpp index 65fd007dc0b2..04d2143bb904 100644 --- a/llvm/lib/Analysis/LazyValueInfo.cpp +++ b/llvm/lib/Analysis/LazyValueInfo.cpp @@ -1701,7 +1701,8 @@ LazyValueInfo::getPredicateAt(unsigned Pred, Value *V, Constant *C, else if (Pred == ICmpInst::ICMP_NE) return LazyValueInfo::True; } - ValueLatticeElement Result = getImpl(PImpl, AC, &DL, DT).getValueAt(V, CxtI); + ValueLatticeElement Result = + getImpl(PImpl, AC, &DL, DT).getValueInBlock(V, CxtI->getParent(), CxtI); Tristate Ret = getPredicateResult(Pred, C, Result, DL, TLI); if (Ret != Unknown) return Ret;
breaks
LLVM :: Analysis/LazyValueAnalysis/lvi-after-jumpthreading.ll LLVM :: Transforms/JumpThreading/induction.ll
I stared at the jump threading code for a while and it's not at all clear to me why this new implementation of getPredicateAt is wrong for it.
Use getConstantRange instead of getPredicateAt.
The new testcase that checks that sdiv i32 is narrowed to udiv i8 currently
fails because the sdiv i32 -> udiv i32 transition uses getPredicateAt rather
than getConstantRange.
Thank you for the reviews, Sanjoy and Sanjay!
I had a brief moment of terror when I realized that (R.getUnsignedMax() + 1).ceilLog2() can overflow. It actually works, because 0.ceilLog2() returns num_bits. But anyway I realized that there's getActiveBits(), which is what I actually wanted.
Submitting...
llvm/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp | ||
---|---|---|
438 ↗ | (On Diff #137140) | And here I just thought I was thick for not figuring it out. :) Changed. |
Thanks for doing the work. :)
To confirm - with this patch, we're getting all of the motivating cases (edge value propagation, local known bits, llvm.assume) that were affected by D37121? Ie, there's no current motivation to make instcombine try harder to shrink div/rem?
To confirm - with this patch, we're getting all of the motivating cases (edge value propagation, local known bits, llvm.assume) that were affected by D37121? Ie, there's no current motivation to make instcombine try harder to shrink div/rem?
I still need to confirm, but I expect so.
I'm going to work on a patch to address the getPredicateAt problems Sanjoy pointed out above.
Also I need to revert this because it's crashing while building clang. Will figure that out, it's probably something dumb.
Pushed the fix, rL326908. It was indeed something simple: Given e.g. udiv i24 with no constraints on the operands, we noticed that the smallest power of 2 that could contain the operands was i32, and then tried to *expand* the udiv. Which is wrong, and also blew up because we were trying to trunc from i24 -> i32 and zext from i32 -> i24.