This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Scalar/
-
Transforms/
-
Scalar/
2/4
CorrelatedValuePropagation.cpp
-
test/Transforms/CorrelatedValuePropagation/
-
Transforms/
-
CorrelatedValuePropagation/
-
udiv.ll
-
urem.ll

Differential D44102

Teach CorrelatedValuePropagation to reduce the width of udiv/urem instructions.
ClosedPublic

Authored by jlebar on Mar 5 2018, 11:58 AM.

Download Raw Diff

Details

Reviewers

spatel
sanjoy
anna
davide
reames

Commits

rGcb9e89c39b07: Teach CorrelatedValuePropagation to reduce the width of udiv/urem instructions.
rL326898: Teach CorrelatedValuePropagation to reduce the width of udiv/urem instructions.

Summary

If the operands of a udiv/urem can be proved to fit within a smaller
power-of-two-sized type, reduce the width of the udiv/urem.

Diff Detail

Build Status

Buildable 15678
Build 15678: arc lint + arc unit

Event Timeline

jlebar created this revision.Mar 5 2018, 11:58 AM

Herald added a subscriber: hiraditya. · View Herald TranscriptMar 5 2018, 11:58 AM

Harbormaster completed remote builds in B15678: Diff 137049.Mar 5 2018, 11:58 AM

I feel like this could use more/better tests, but before I spend a lot of time on that...am I on the right track here?

jlebar mentioned this in D37121: [DivRemHoist] add a pass to move div/rem pairs into the same block (PR31028).Mar 5 2018, 12:05 PM

Disappointingly, this doesn't work for simple cases where you mask the divisor:

%b = and i64 %a, 65535
%div = udiv i64 %b, 42

It does work for llvm.assume, which I guess is good enough for the specific case I have, but...maybe this is not the right pass to be doing this in? Or should I check known-bits here too? Sorry, I'm an ignoramus when it comes to the target-independent parts of LLVM.

target datalayout = "e-i64:64-v16:16-v32:32-n16:32:64"
target triple = "nvptx64-nvidia-cuda"

declare void @llvm.assume(i1)

define void @foo(i64 %a, i64* %ptr1, i64* %ptr2) {
  %cond = icmp ult i64 %a, 1024
  call void @llvm.assume(i1 %cond)
  %div = udiv i64 %a, 42
  %rem = urem i64 %a, 42
  store i64 %div, i64* %ptr1
  store i64 %rem, i64* %ptr2
  ret void
}

becomes, at opt -O2

define void @foo(i64 %a, i64* nocapture %ptr1, i64* nocapture %ptr2) local_unnamed_addr #0 {
  %cond = icmp ult i64 %a, 1024
  tail call void @llvm.assume(i1 %cond)
  %div.lhs.trunc = trunc i64 %a to i16
  %div1 = udiv i16 %div.lhs.trunc, 42
  %div.zext = zext i16 %div1 to i64
  %1 = mul i16 %div1, 42
  %2 = sub i16 %div.lhs.trunc, %1
  %rem.zext = zext i16 %2 to i64
  store i64 %div.zext, i64* %ptr1, align 8
  store i64 %rem.zext, i64* %ptr2, align 8
  ret void
}

which lowers to the following ptx:

shr.u16         %rs2, %rs1, 1;
mul.wide.u16    %r1, %rs2, -15603;
shr.u32         %r2, %r1, 20;
cvt.u16.u32     %rs3, %r2;
cvt.u64.u32     %rd3, %r2;
mul.lo.s16      %rs4, %rs3, 42;
sub.s16         %rs5, %rs1, %rs4;
cvt.u64.u16     %rd4, %rs5;
st.u64  [%rd1], %rd3;
st.u64  [%rd2], %rd4;

This is even nicer than before because we do the magic-number division in 16-widens-to-32-bit instead of (before) doing it in 32-widens-to-64 bit. At least, I hope that's efficient in NVPTX -- if not, that's our backend's problem. :)

I think this is the right approach, but I don't know much about CVP, so adding more potential reviewers.

Context: This is part of improving udiv/urem IR as discussed in the latest comments on D37121.
Motivation: Narrowing the width of these ops improves bit-tracking / analysis and potentially enables further IR transforms.
Bonus: It also aligns with codegen transforms such as the magic number division mentioned by Justin and improves perf for targets that have faster narrow div/rem instructions.

We have basic div/rem narrowing folds in instcombine, but we want to handle cases like this where edge info and/or computeKnownBits would also allow narrowing. Is it preferred to use ValueTracking here to get the single BB case or should that be added to InstCombine?

In D44102#1027656, @jlebar wrote:
Disappointingly, this doesn't work for simple cases where you mask the divisor:
%b = and i64 %a, 65535
%div = udiv i64 %b, 42

This is a little surprising to me -- on a first glance it looks like LazyValueInfoImpl::solveBlockValueBinaryOp should be doing the right thing here. But I dug a bit deeper and it looks like getPredicateAt calls getValueAt which only looks at guards and assumes (and never calls solveBlockValueBinaryOp)? I have a hunch that using the getConstantRange call will "fix" this issue, though getPredicateAt definitely should be not weaker than calling getConstantRange and inferring the predicate manually.

llvm/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp
457	Instead of making multiple queries, how about using `LazyValueInfo::getConstantRange` instead?

I have a hunch that using the getConstantRange call will "fix" this issue

It does, thanks! It also simplifies the code.

though getPredicateAt definitely should be not weaker than calling getConstantRange and inferring the predicate manually.

Agree. This is a bit of a Chesterton's Fence for me, though. Why would we ever want to do getValueAt as opposed to getValueInBlock? getValueAt is called only by getPredicateAt, and...getPredicateAt *has* a BB, namely CtxI->getParent(). So why not just call getValueInBlock from there, and delete getValueAt? Further adding to my confusion is the fact that the header says that getPredicateAt (only?) looks at assume intrinsics. So it sort of seems intentional, but I have no idea why...

The only users of getPredicateAt are CVP and jump threading. I tried switching getPredicateAt to call getValueInBlock, and it seems to be fine for CVP, but it breaks jump threading tests in what seems to me to be a real way.

diff --git a/llvm/lib/Analysis/LazyValueInfo.cpp b/llvm/lib/Analysis/LazyValueInfo.cpp
index 65fd007dc0b2..04d2143bb904 100644
--- a/llvm/lib/Analysis/LazyValueInfo.cpp
+++ b/llvm/lib/Analysis/LazyValueInfo.cpp
@@ -1701,7 +1701,8 @@ LazyValueInfo::getPredicateAt(unsigned Pred, Value *V, Constant *C,
     else if (Pred == ICmpInst::ICMP_NE)
       return LazyValueInfo::True;
   }
-  ValueLatticeElement Result = getImpl(PImpl, AC, &DL, DT).getValueAt(V, CxtI);
+  ValueLatticeElement Result =
+      getImpl(PImpl, AC, &DL, DT).getValueInBlock(V, CxtI->getParent(), CxtI);
   Tristate Ret = getPredicateResult(Pred, C, Result, DL, TLI);
   if (Ret != Unknown)
     return Ret;

breaks

LLVM :: Analysis/LazyValueAnalysis/lvi-after-jumpthreading.ll
LLVM :: Transforms/JumpThreading/induction.ll

I stared at the jump threading code for a while and it's not at all clear to me why this new implementation of getPredicateAt is wrong for it.

Use getConstantRange instead of getPredicateAt.

The new testcase that checks that sdiv i32 is narrowed to udiv i8 currently
fails because the sdiv i32 -> udiv i32 transition uses getPredicateAt rather
than getConstantRange.

Harbormaster completed remote builds in B15712: Diff 137140.Mar 6 2018, 2:38 AM

sanjoy accepted this revision.Mar 6 2018, 11:28 PM

sanjoy added inline comments.

llvm/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp
438	Not sure what `SDI` stands for here -- how about just calling it `Inst`?
447	How about s/`R`/`OperandRange`/?

This revision is now accepted and ready to land.Mar 6 2018, 11:28 PM

Closed by commit rL326898: Teach CorrelatedValuePropagation to reduce the width of udiv/urem instructions. (authored by jlebar). · Explain WhyMar 7 2018, 7:14 AM

This revision was automatically updated to reflect the committed changes.

jlebar marked 2 inline comments as done.

Thank you for the reviews, Sanjoy and Sanjay!

I had a brief moment of terror when I realized that (R.getUnsignedMax() + 1).ceilLog2() can overflow. It actually works, because 0.ceilLog2() returns num_bits. But anyway I realized that there's getActiveBits(), which is what I actually wanted.

Submitting...

llvm/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp
438	And here I just thought I was thick for not figuring it out. :) Changed.

In D44102#1029861, @jlebar wrote:

Thank you for the reviews, Sanjoy and Sanjay!

Thanks for doing the work. :)
To confirm - with this patch, we're getting all of the motivating cases (edge value propagation, local known bits, llvm.assume) that were affected by D37121? Ie, there's no current motivation to make instcombine try harder to shrink div/rem?

To confirm - with this patch, we're getting all of the motivating cases (edge value propagation, local known bits, llvm.assume) that were affected by D37121? Ie, there's no current motivation to make instcombine try harder to shrink div/rem?

I still need to confirm, but I expect so.

I'm going to work on a patch to address the getPredicateAt problems Sanjoy pointed out above.

Also I need to revert this because it's crashing while building clang. Will figure that out, it's probably something dumb.

Pushed the fix, rL326908. It was indeed something simple: Given e.g. udiv i24 with no constraints on the operands, we noticed that the smallest power of 2 that could contain the operands was i32, and then tried to *expand* the udiv. Which is wrong, and also blew up because we were trying to trunc from i24 -> i32 and zext from i32 -> i24.

jlebar mentioned this in D44252: [CVP] [LVI] Add LVI::getPredicateInBlock and use it in CVP..Mar 8 2018, 6:01 AM

jlebar mentioned this in rL327252: Back out "Re-land: Teach CorrelatedValuePropagation to reduce the width of….Mar 12 2018, 2:29 AM

spatel mentioned this in D46760: [InstCombine] Enhance narrowUDivURem..May 11 2018, 11:59 AM

lebedev.ri mentioned this in D47112: [CVP] Add tests for lshr width reduction.May 19 2018, 4:45 PM

lebedev.ri mentioned this in D47113: [CVP] Teach CorrelatedValuePropagation to reduce the width of lshr instruction..

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Scalar/

CorrelatedValuePropagation.cpp

59 lines

test/

Transforms/

CorrelatedValuePropagation/

udiv.ll

65 lines

urem.ll

65 lines

Diff 137049

llvm/lib/Transforms/Scalar/CorrelatedValuePropagation.cpp

Show First 20 Lines • Show All 52 Lines • ▼ Show 20 Lines

STATISTIC(NumPhis, "Number of phis propagated");		STATISTIC(NumPhis, "Number of phis propagated");
STATISTIC(NumSelects, "Number of selects propagated");		STATISTIC(NumSelects, "Number of selects propagated");
STATISTIC(NumMemAccess, "Number of memory access targets propagated");		STATISTIC(NumMemAccess, "Number of memory access targets propagated");
STATISTIC(NumCmps, "Number of comparisons propagated");		STATISTIC(NumCmps, "Number of comparisons propagated");
STATISTIC(NumReturns, "Number of return values propagated");		STATISTIC(NumReturns, "Number of return values propagated");
STATISTIC(NumDeadCases, "Number of switch cases removed");		STATISTIC(NumDeadCases, "Number of switch cases removed");
STATISTIC(NumSDivs, "Number of sdiv converted to udiv");		STATISTIC(NumSDivs, "Number of sdiv converted to udiv");
		STATISTIC(NumUDivs, "Number of udivs whose width was decreased");
STATISTIC(NumAShrs, "Number of ashr converted to lshr");		STATISTIC(NumAShrs, "Number of ashr converted to lshr");
STATISTIC(NumSRems, "Number of srem converted to urem");		STATISTIC(NumSRems, "Number of srem converted to urem");
STATISTIC(NumOverflows, "Number of overflow checks removed");		STATISTIC(NumOverflows, "Number of overflow checks removed");

static cl::opt<bool> DontProcessAdds("cvp-dont-process-adds", cl::init(true));		static cl::opt<bool> DontProcessAdds("cvp-dont-process-adds", cl::init(true));

namespace {		namespace {

▲ Show 20 Lines • Show All 360 Lines • ▼ Show 20 Lines	for (Value *O : SDI->operands()) {
if (Result != LazyValueInfo::True)		if (Result != LazyValueInfo::True)
return false;		return false;
}		}
return true;		return true;
}		}

static bool processSRem(BinaryOperator SDI, LazyValueInfo LVI) {		static bool processSRem(BinaryOperator SDI, LazyValueInfo LVI) {
if (SDI->getType()->isVectorTy() \|\| !hasPositiveOperands(SDI, LVI))		if (SDI->getType()->isVectorTy() \|\| !hasPositiveOperands(SDI, LVI))
return false;		return false;
		sanjoyUnsubmitted Done Reply Inline Actions Not sure what `SDI` stands for here -- how about just calling it `Inst`? sanjoy: Not sure what `SDI` stands for here -- how about just calling it `Inst`?
		jlebarAuthorUnsubmitted Not Done Reply Inline Actions And here I just thought I was thick for not figuring it out. :) Changed. jlebar: And here I just thought I was thick for not figuring it out. :) Changed.

++NumSRems;		++NumSRems;
auto *BO = BinaryOperator::CreateURem(SDI->getOperand(0), SDI->getOperand(1),		auto *BO = BinaryOperator::CreateURem(SDI->getOperand(0), SDI->getOperand(1),
SDI->getName(), SDI);		SDI->getName(), SDI);
SDI->replaceAllUsesWith(BO);		SDI->replaceAllUsesWith(BO);
SDI->eraseFromParent();		SDI->eraseFromParent();
return true;		return true;
}		}

		sanjoyUnsubmitted Done Reply Inline Actions How about s/`R`/`OperandRange`/? sanjoy: How about s/`R`/`OperandRange`/?
		// Tries to find the smallest power-of-two bit width greater than 8 bits which
		// is sufficient to hold all of the operands of SDI (interpreted as uints).
		static Optional<uint64_t>
		smallestPowerOf2WidthForOperandsOf(BinaryOperator SDI, LazyValueInfo LVI) {
		Optional<uint64_t> Result;
		auto OrigWidth = SDI->getType()->getIntegerBitWidth();
		for (uint64_t Width = PowerOf2Floor(OrigWidth - 1); Width >= 8; Width /= 2) {
		Constant *Max = ConstantInt::get(
		SDI->getType(), APInt::getAllOnesValue(Width).zext(OrigWidth));
		if (all_of(SDI->operands(), [&](Value *Operand) {
		sanjoyUnsubmitted Not Done Reply Inline Actions Instead of making multiple queries, how about using `LazyValueInfo::getConstantRange` instead? sanjoy: Instead of making multiple queries, how about using `LazyValueInfo::getConstantRange` instead?
		return LVI->getPredicateAt(ICmpInst::ICMP_ULE, Operand, Max, SDI) ==
		LazyValueInfo::True;
		}))
		Result = Width;
		else
		break;
		}
		return Result;
		}

		/// Try to shrink a udiv/urem's width down to the smallest power of two that's
		/// sufficient to contain its operands.
		static bool processUDivOrURem(BinaryOperator SDI, LazyValueInfo LVI) {
		assert(SDI->getOpcode() == Instruction::UDiv \|\|
		SDI->getOpcode() == Instruction::URem);
		if (SDI->getType()->isVectorTy())
		return false;

		Optional<uint64_t> TruncBitWidth =
		smallestPowerOf2WidthForOperandsOf(SDI, LVI);
		if (!TruncBitWidth)
		return false;

		++NumUDivs;
		auto TruncTy = Type::getIntNTy(SDI->getContext(), TruncBitWidth);
		auto *LHS = CastInst::Create(Instruction::Trunc, SDI->getOperand(0), TruncTy,
		SDI->getName() + ".lhs.trunc", SDI);
		auto *RHS = CastInst::Create(Instruction::Trunc, SDI->getOperand(1), TruncTy,
		SDI->getName() + ".rhs.trunc", SDI);
		auto *BO =
		BinaryOperator::Create(SDI->getOpcode(), LHS, RHS, SDI->getName(), SDI);
		auto *Zext = CastInst::Create(Instruction::ZExt, BO, SDI->getType(),
		SDI->getName() + ".zext", SDI);
		if (BO->getOpcode() == Instruction::UDiv)
		BO->setIsExact(SDI->isExact());

		SDI->replaceAllUsesWith(Zext);
		SDI->eraseFromParent();
		return true;
		}

/// See if LazyValueInfo's ability to exploit edge conditions or range		/// See if LazyValueInfo's ability to exploit edge conditions or range
/// information is sufficient to prove the both operands of this SDiv are		/// information is sufficient to prove the both operands of this SDiv are
/// positive. If this is the case, replace the SDiv with a UDiv. Even for local		/// positive. If this is the case, replace the SDiv with a UDiv. Even for local
/// conditions, this can sometimes prove conditions instcombine can't by		/// conditions, this can sometimes prove conditions instcombine can't by
/// exploiting range information.		/// exploiting range information.
static bool processSDiv(BinaryOperator SDI, LazyValueInfo LVI) {		static bool processSDiv(BinaryOperator SDI, LazyValueInfo LVI) {
if (SDI->getType()->isVectorTy() \|\| !hasPositiveOperands(SDI, LVI))		if (SDI->getType()->isVectorTy() \|\| !hasPositiveOperands(SDI, LVI))
return false;		return false;

++NumSDivs;		++NumSDivs;
auto *BO = BinaryOperator::CreateUDiv(SDI->getOperand(0), SDI->getOperand(1),		auto *BO = BinaryOperator::CreateUDiv(SDI->getOperand(0), SDI->getOperand(1),
SDI->getName(), SDI);		SDI->getName(), SDI);
BO->setIsExact(SDI->isExact());		BO->setIsExact(SDI->isExact());
SDI->replaceAllUsesWith(BO);		SDI->replaceAllUsesWith(BO);
SDI->eraseFromParent();		SDI->eraseFromParent();

		// Try to simplify our new udiv.
		processUDivOrURem(BO, LVI);

return true;		return true;
}		}

static bool processAShr(BinaryOperator SDI, LazyValueInfo LVI) {		static bool processAShr(BinaryOperator SDI, LazyValueInfo LVI) {
if (SDI->getType()->isVectorTy())		if (SDI->getType()->isVectorTy())
return false;		return false;

Constant *Zero = ConstantInt::get(SDI->getType(), 0);		Constant *Zero = ConstantInt::get(SDI->getType(), 0);
▲ Show 20 Lines • Show All 119 Lines • ▼ Show 20 Lines	for (BasicBlock::iterator BI = BB->begin(), BE = BB->end(); BI != BE;) {
BBChanged \|= processCallSite(CallSite(II), LVI);		BBChanged \|= processCallSite(CallSite(II), LVI);
break;		break;
case Instruction::SRem:		case Instruction::SRem:
BBChanged \|= processSRem(cast<BinaryOperator>(II), LVI);		BBChanged \|= processSRem(cast<BinaryOperator>(II), LVI);
break;		break;
case Instruction::SDiv:		case Instruction::SDiv:
BBChanged \|= processSDiv(cast<BinaryOperator>(II), LVI);		BBChanged \|= processSDiv(cast<BinaryOperator>(II), LVI);
break;		break;
		case Instruction::UDiv:
		case Instruction::URem:
		BBChanged \|= processUDivOrURem(cast<BinaryOperator>(II), LVI);
		break;
case Instruction::AShr:		case Instruction::AShr:
BBChanged \|= processAShr(cast<BinaryOperator>(II), LVI);		BBChanged \|= processAShr(cast<BinaryOperator>(II), LVI);
break;		break;
case Instruction::Add:		case Instruction::Add:
BBChanged \|= processAdd(cast<BinaryOperator>(II), LVI);		BBChanged \|= processAdd(cast<BinaryOperator>(II), LVI);
break;		break;
}		}
}		}
▲ Show 20 Lines • Show All 48 Lines • Show Last 20 Lines

llvm/test/Transforms/CorrelatedValuePropagation/udiv.ll

This file was added.

				; RUN: opt < %s -correlated-propagation -S \| FileCheck %s

				; CHECK-LABEL: @test1(
				define void @test1(i32 %n) {
				entry:
				%cmp = icmp ule i32 %n, 65535
				br i1 %cmp, label %bb, label %exit

				bb:
				; CHECK: udiv i16
				%div = udiv i32 %n, 100
				br label %exit

				exit:
				ret void
				}

				; CHECK-LABEL: @test2(
				define void @test2(i32 %n) {
				entry:
				%cmp = icmp ule i32 %n, 65536
				br i1 %cmp, label %bb, label %exit

				bb:
				; CHECK: udiv i32 %n, 100
				%div = udiv i32 %n, 100
				br label %exit

				exit:
				ret void
				}

				; CHECK-LABEL: @test3(
				define void @test3(i32 %m, i32 %n) {
				entry:
				%cmp1 = icmp ult i32 %m, 65535
				%cmp2 = icmp ult i32 %n, 65535
				%cmp = and i1 %cmp1, %cmp2
				br i1 %cmp, label %bb, label %exit

				bb:
				; CHECK: udiv i16
				%div = udiv i32 %m, %n
				br label %exit

				exit:
				ret void
				}

				; CHECK-LABEL: @test4(
				define void @test4(i32 %m, i32 %n) {
				entry:
				%cmp1 = icmp ult i32 %m, 65535
				%cmp2 = icmp ule i32 %n, 65536
				%cmp = and i1 %cmp1, %cmp2
				br i1 %cmp, label %bb, label %exit

				bb:
				; CHECK: udiv i32 %m, %n
				%div = udiv i32 %m, %n
				br label %exit

				exit:
				ret void
				}

llvm/test/Transforms/CorrelatedValuePropagation/urem.ll

This file was added.

				; RUN: opt < %s -correlated-propagation -S \| FileCheck %s

				; CHECK-LABEL: @test1(
				define void @test1(i32 %n) {
				entry:
				%cmp = icmp ule i32 %n, 65535
				br i1 %cmp, label %bb, label %exit

				bb:
				; CHECK: urem i16
				%div = urem i32 %n, 100
				br label %exit

				exit:
				ret void
				}

				; CHECK-LABEL: @test2(
				define void @test2(i32 %n) {
				entry:
				%cmp = icmp ule i32 %n, 65536
				br i1 %cmp, label %bb, label %exit

				bb:
				; CHECK: urem i32 %n, 100
				%div = urem i32 %n, 100
				br label %exit

				exit:
				ret void
				}

				; CHECK-LABEL: @test3(
				define void @test3(i32 %m, i32 %n) {
				entry:
				%cmp1 = icmp ult i32 %m, 65535
				%cmp2 = icmp ult i32 %n, 65535
				%cmp = and i1 %cmp1, %cmp2
				br i1 %cmp, label %bb, label %exit

				bb:
				; CHECK: urem i16
				%div = urem i32 %m, %n
				br label %exit

				exit:
				ret void
				}

				; CHECK-LABEL: @test4(
				define void @test4(i32 %m, i32 %n) {
				entry:
				%cmp1 = icmp ult i32 %m, 65535
				%cmp2 = icmp ule i32 %n, 65536
				%cmp = and i1 %cmp1, %cmp2
				br i1 %cmp, label %bb, label %exit

				bb:
				; CHECK: urem i32 %m, %n
				%div = urem i32 %m, %n
				br label %exit

				exit:
				ret void
				}