This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Transforms/Utils/
-
Transforms/
-
Utils/
-
BypassSlowDivision.cpp
-
test/Transforms/CodeGenPrepare/NVPTX/
-
Transforms/
-
CodeGenPrepare/
-
NVPTX/
-
bypass-slow-div-special-cases.ll

Differential D29897

[BypassSlowDivision] Use ValueTracking to simplify run-time checks
ClosedPublic

Authored by n.bozhenov on Feb 13 2017, 9:55 AM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon
jlebar

Commits

rG4a04fb9e9059: [BypassSlowDivision] Use ValueTracking to simplify run-time checks
rL296832: [BypassSlowDivision] Use ValueTracking to simplify run-time checks

Summary

ValueTracking is used for more thorough analysis of operands. Based on the analysis, either run-time checks can be simplified (e.g. check only one operand instead of two) or the transformation can be avoided. For example, it is quite often the case that a divisor is promoted from a shorter type and run-time checks for it are redundant.

With additional compile-time analysis of values, two special cases naturally arise and are addressed by the patch:

Both operands are known to be short enough. Then, the long division can be simply replaced with a short one without CFG modification.

If a division is unsigned and the dividend is known to be short then the long division is not needed at all. Because if the divisor is too big for short division then the quotient is obviously zero (and the remainder is equal to the dividend). Actually, the division is not needed when (divisor > dividend).

Basically, this is D28199 rebased onto D29896.

Diff Detail

Repository: rL LLVM

Event Timeline

n.bozhenov created this revision.Feb 13 2017, 9:55 AM

n.bozhenov added a parent revision: D29896: [BypassSlowDivision] Refactor fast division insertion logic (NFC).

n.bozhenov mentioned this in D28199: [BypassSlowDivision] Use ValueTracking to simplify run-time checks.Feb 13 2017, 10:01 AM

Rebased onto a newer version of D29896.

Yet another rebase.

n.bozhenov mentioned this in D29896: [BypassSlowDivision] Refactor fast division insertion logic (NFC).Feb 19 2017, 8:03 AM

n.bozhenov updated this revision to Diff 89066.Feb 19 2017, 8:05 AM

When we see x/y, in addition to checking whether x and y are known "long", should we also check whether x | y is known-long? This way if we saw something like:

if (x | y & 0xffffffff00000000) return x / y;
else return static_cast<int32_t>(x) / static_cast<int32_t>(y);

we wouldn't re-optimize this code.

n.bozhenov updated this revision to Diff 89134.Feb 20 2017, 10:39 AM

In D29897#681222, @jlebar wrote:
When we see x/y, in addition to checking whether x and y are known "long", should we also check whether x | y is known-long? This way if we saw something like:
if (x | y & 0xffffffff00000000) return x / y;
else return static_cast<int32_t>(x) / static_cast<int32_t>(y);
we wouldn't re-optimize this code.

I am afraid that ValueTracking doesn't work this way. It doesn't take into account the point where we analyze a value. So, unfortunately, it cannot yield different results on different if-branches for the same values.

In your example, if both branches may be taken, than (x|y) is neither known-long nor known-short. ValueTracking can determine that (x|y) is known-long only when the else-branch of the if-statement is statically known to be dead.

Do you think it would be too conservative for us to say: We won't optimize a long div if you do a short div with the same operands anywhere in the same function?

In D29897#681767, @jlebar wrote:

Do you think it would be too conservative for us to say: We won't optimize a long div if you do a short div with the same operands anywhere in the same function?

That's an interesting heuristic, but I believe it's beyond the scope of this particular patch. This patch just takes advantage of ValueTracking and simplifies runtime checks where possible.

In D29897#682146, @n.bozhenov wrote:

In D29897#681767, @jlebar wrote:

Do you think it would be too conservative for us to say: We won't optimize a long div if you do a short div with the same operands anywhere in the same function?

That's an interesting heuristic, but I believe it's beyond the scope of this particular patch. This patch just takes advantage of ValueTracking and simplifies runtime checks where possible.

Sure. I would like not to lose track of this idea, though.

jlebar added inline comments.Feb 21 2017, 9:13 AM

lib/Transforms/Utils/BypassSlowDivision.cpp
194 ↗	(On Diff #89134)	Maybe, "check if an integer type fits into our bypass type"?
195 ↗	(On Diff #89134)	Value *V ? It's not necessarily an operator -- could be a constant or something.
201 ↗	(On Diff #89134)	Should we assert that V has the same width as getSlowType()? If its type is shorter I think it works out, but if it's longer, does it still work?
285 ↗	(On Diff #89134)	Would actually prefer not to initialize this variable to null -- this way we'll get a compile warning if it's not initialized in both branches of the if below.
291 ↗	(On Diff #89134)	Maybe change this to assert((Op1 \|\| Op2) && "Nothing to check") and move up to the top of the function?
314 ↗	(On Diff #89134)	I wonder if getValueRange should get the DL itself -- one less step for us here.
test/CodeGen/X86/bypass-slow-division-64.ll
95 ↗	(On Diff #89134)	I should have been looking more closely at these testcases -- this seems really fragile because it depends on our exact register allocation? I know that to some degree the registers are fixed because of the calling convention plus the fixed in/out regs of e.g. idiv. But to the extent that the registers are not fixed, I don't think we should be relying on a particular register allocation. That will just make unnecessary pain for the next person to come along and change the register allocator.
105 ↗	(On Diff #89134)	Similarly here we're relying on the particular label name LBB4_1. This seems extremely fragile -- the test would break if we added a function above this one? I think we should either do this right, with FileCheck variable matching (i.e. [[XYZ:some_regex]]), or just CHECK for "je" and ignore the label.

Actually, I recall from the last time I touched this that you can run codegenprepare as an opt pass, so you can write tests over LLVM IR instead of over x86 assembly. That seems like a much better way to do this.

n.bozhenov updated this revision to Diff 89403.Feb 22 2017, 12:20 PM

n.bozhenov marked 3 inline comments as done.

n.bozhenov added inline comments.

lib/Transforms/Utils/BypassSlowDivision.cpp
194 ↗	(On Diff #89134)	Not sure what you mean. Here we're testing not a type (how many bits it has?) but a value (may there be any non-zero upper bits?).
195 ↗	(On Diff #89134)	Actually, Op was supposed to mean an operand here. But now I see it's ambiguous.
201 ↗	(On Diff #89134)	I have added an assert that type of the value is wider than BypassType.

As for tests, it was discussed in D28196 that it's better to have them that detailed. I agree with you that they are overly fragile and check for a lot of actually insignificant details. But on the other hand, this approach also has a number of advantages: tests indeed guarantee that the compiler generates correct code for the testcase, the tests are easy to create and update, diffs in tests induced by patches are explicit and easy to review.

As for tests, it was discussed in D28196 that it's better to have them that detailed.

I read the discussion in that thread differently.

I see it saying: *if* we are writing llc tests, then maybe we should leave them detailed.

However, we don't have to write llc tests. We can write an opt test, and write regular checks over LLVM IR.

lib/Transforms/Utils/BypassSlowDivision.cpp
194 ↗	(On Diff #89134)	Yes, I had a typo, I meant "Check if an integer value fits into our bypass type." What I wanted was the change s/a smaller type/our bypass type/, because this function does not check whether V fits into an arbitrary smaller type, but rather checks specifically if it fits into the bypass type.

Well, it might make sense to write opt tests here... But we already have got 3 files with 18 llc tests for division bypassing. Moreover, while working on this patchset, the tests were thoroughly reviewed, it was insisted that the test should be written this way, I had to make a number of commits specifically to modify the tests. So, I don't think it would be a good idea to throw away all the work that has been done and re-write the tests from scratch now.

n.bozhenov mentioned this in D28200: [BypassSlowDivision] Do not bypass division of hash-like values.Feb 23 2017, 9:00 AM

n.bozhenov added a child revision: D28200: [BypassSlowDivision] Do not bypass division of hash-like values.Feb 23 2017, 9:03 AM

In D29897#684734, @n.bozhenov wrote:

Well, it might make sense to write opt tests here... But we already have got 3 files with 18 llc tests for division bypassing. Moreover, while working on this patchset, the tests were thoroughly reviewed, it was insisted that the test should be written this way, I had to make a number of commits specifically to modify the tests. So, I don't think it would be a good idea to throw away all the work that has been done and re-write the tests from scratch now.

First, a note about x86 codegen testing: it's correct that the FileCheck script often leaves registers hard-coded and includes labels and other gunk like 'kill' comments. Yes, this makes the tests more fragile, but in practice there's been little to complain about vs. the benefits of tighter checks and ease of updating via script. We could certainly improve the script, but there hasn't been much motivation for that AFAIK.

There are good arguments for both continuing as x86 tests and including cleaner IR tests for CGP. When in doubt, do both? :)
Note that we have a script (for generating opt FileChecks too, so this is easy. I strongly recommend using that if you add IR tests for the same reasons that we prefer to script the x86 codegen checks.

Here's what that opt FileCheck script output looks like for the tests added in this patch (without applying this patch). So I would add these tests in a pre-commit, apply this patch, run the script again, and we just show the IR diffs when this patch lands:

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt < %s -codegenprepare -S -mtriple=x86_64-unknown-unknown    | FileCheck %s

define i64 @Test_no_bypassing(i32 %a, i64 %b) nounwind {
; CHECK-LABEL: @Test_no_bypassing(
; CHECK-NEXT:    [[A_1:%.*]] = zext i32 [[A:%.*]] to i64
; CHECK-NEXT:    [[A_2:%.*]] = sub i64 -1, [[A_1]]
; CHECK-NEXT:    [[RES:%.*]] = srem i64 [[A_2]], [[B:%.*]]
; CHECK-NEXT:    ret i64 [[RES]]
;
  %a.1 = zext i32 %a to i64
  ; %a.2 is always negative so the division cannot be bypassed.
  %a.2 = sub i64 -1, %a.1
  %res = srem i64 %a.2, %b
  ret i64 %res
}

; No OR instruction is needed if one of the operands (divisor) is known
; to fit into 32 bits.
define i64 @Test_check_one_operand(i64 %a, i32 %b) nounwind {
; CHECK-LABEL: @Test_check_one_operand(
; CHECK-NEXT:    [[B_1:%.*]] = zext i32 [[B:%.*]] to i64
; CHECK-NEXT:    [[RES:%.*]] = sdiv i64 [[A:%.*]], [[B_1]]
; CHECK-NEXT:    ret i64 [[RES]]
;
  %b.1 = zext i32 %b to i64
  %res = sdiv i64 %a, %b.1
  ret i64 %res
}

; If both operands are known to fit into 32 bits, then replace the division
; in-place without CFG modification.
define i64 @Test_check_none(i64 %a, i32 %b) nounwind {
; CHECK-LABEL: @Test_check_none(
; CHECK-NEXT:    [[A_1:%.*]] = and i64 [[A:%.*]], 4294967295
; CHECK-NEXT:    [[B_1:%.*]] = zext i32 [[B:%.*]] to i64
; CHECK-NEXT:    [[RES:%.*]] = udiv i64 [[A_1]], [[B_1]]
; CHECK-NEXT:    ret i64 [[RES]]
;
  %a.1 = and i64 %a, 4294967295
  %b.1 = zext i32 %b to i64
  %res = udiv i64 %a.1, %b.1
  ret i64 %res
}

; In case of unsigned long division with a short dividend,
; the long division is not needed any more.
define i64 @Test_special_case(i32 %a, i64 %b) nounwind {
; CHECK-LABEL: @Test_special_case(
; CHECK-NEXT:    [[A_1:%.*]] = zext i32 [[A:%.*]] to i64
; CHECK-NEXT:    [[DIV:%.*]] = udiv i64 [[A_1]], [[B:%.*]]
; CHECK-NEXT:    [[REM:%.*]] = urem i64 [[A_1]], [[B]]
; CHECK-NEXT:    [[RES:%.*]] = add i64 [[DIV]], [[REM]]
; CHECK-NEXT:    ret i64 [[RES]]
;
  %a.1 = zext i32 %a to i64
  %div = udiv i64 %a.1, %b
  %rem = urem i64 %a.1, %b
  %res = add i64 %div, %rem
  ret i64 %res
}

I still do not understand what we are trying to accomplish by writing x86-specific tests for this at all.

This is a 99% target-generic LLVM optimization. The only target-specific part of this transformation is the bypass width. Everything else is totally generic and has absolutely nothing to do with the target ISA.

I can understand having a few tests to check that the bypass widths are set as we expect. But otherwise, it seems to me that we should test this exactly like we test every other target-independent pass in LLVM, and that's by checking LLVM IR and not x86 assembly.

I understand it can be frustrating to throw away work you've done. In this case, however, the argument seems clear-cut. We want to test using LLVM IR where possible, because that's the one language everyone in the LLVM community is assumed to know, and because it's the language that most closely reflects the transformations being made here. Otherwise when these tests fail, someone who doesn't necessarily understand x86 assembly is going to have to read this code and understand whether the failure is meaningful. This is a burden we should avoid where possible, and a burden that we do, by and large, avoid in LLVM as a matter of convention, if not policy.

In D29897#685056, @jlebar wrote:

I still do not understand what we are trying to accomplish by writing x86-specific tests for this at all.

This is a 99% target-generic LLVM optimization. The only target-specific part of this transformation is the bypass width. Everything else is totally generic and has absolutely nothing to do with the target ISA.

Since this is largely target-independent (which jibes with the fact that no existing tests are changing), then yes, I agree that IR tests are better.

For the other patch, where we're running opt and llc in one RUN line - that's definitely not what we want to see in a regression test. If there are multiple transforms happening, there should be multiple tests to check each step in the process.

That's indeed somewhat frustrating... As I understand, now you all believe it's better to drop all the tests from test/CodeGen/X86/bypass-slow-division-32.ll and test/CodeGen/X86/bypass-slow-division-64.ll and replace them with a few target-independent tests.

However, I don't think that dropping the tests is a good idea. Target-independent tests would check that the transformation itself is performed correctly. That's good but not enough. The existing tests check that the compiler indeed generates a good code for integer division, which is the only purpose of the transformation, and this cannot be checked with opt tests. For instance, D28198 wouldn't be possible if there were no such detailed X86-specific llc tests. Only because of the tests, the inefficiency introduced with D28196 was detected immediately and fixed right away.

Moreover, I'm not sure if it is at all possible to write correct opt tests for this transformation. The bypassed types are registered only if TM.getOptLevel() >= CodeGenOpt::Default. However, when running opt -codegenprepare the TM.getOptLevel() equals CodeGenOpt::None. Running opt -O2 -codegenprepare, obviously, is not an option because it runs all the optimizations in the middle-end.

In D29897#686823, @n.bozhenov wrote:

However, I don't think that dropping the tests is a good idea. Target-independent tests would check that the transformation itself is performed correctly. That's good but not enough. The existing tests check that the compiler indeed generates a good code for integer division

In LLVM we canonically would write two tests for this. First, we would write an opt check for the transformation. Then, we would write an llc test to check that divs are lowered correctly.

Since you already have IR inputs that exercise the pass, and you have a script to generate IR tests from them, I don't think this should actually involve a lot of work. Indeed, it looks like Sanjay above already converted the tests here from llc to opt tests using a script.

I suspect there are already plenty of llc tests in the tree that check that x86 divs are lowered correctly. But you certainly could look and add some if there are gaps.

For instance, D28198 wouldn't be possible if there were no such detailed X86-specific llc tests.

I am not saying we should not have x86-specific tests. That patch is for an x86-specific lowering, so it's totally appropriate that we have an llc test for it.

I'm also not saying that we should have no x86-specific tests for this pass. Just that most of them should be generic.

Please understand why I am not willing to compromise on this. I (and others in the LLVM community) may to be on the hook to maintain these tests. I know almost nothing about the x86 ISA -- I work on GPUs. I do not want to be on the hook to maintain large tests with tons of x86 assembly that I don't understand. Moreover, if I am on the hook for this, I am not going to do a good job of it. (For similar reasons, I wouldn't ask you to be on the hook to maintain tests with tons of GPU assembly.) LLVM IR is the one language that everyone on this project is expected to understand. So where possible (and I understand it's not always possible), we should write tests that use this language.

Moreover, I'm not sure if it is at all possible to write correct opt tests for this transformation. The bypassed types are registered only if TM.getOptLevel() >= CodeGenOpt::Default. However, when running opt -codegenprepare the TM.getOptLevel() equals CodeGenOpt::None. Running opt -O2 -codegenprepare, obviously, is not an option because it runs all the optimizations in the middle-end.

One option would be to target nvptx -- unlike x86, we unconditionally register the bypass types. (There's no GPU-specific code in these tests, since they're just opt tests.)

See e.g. test/Transforms/CodeGenPrepare/NVPTX/bypass-slow-div.ll.

Ok. I see your point. Indeed, for example, there's no need to run the whole X86 code generator to check that a division bypassing is disabled if one of the operands doesn't fit into BypassType. So I have moved the tests introduced by this patch into Transforms/CodeGenPrepare/NVPTX/bypass-slow-div-special-cases.ll. However, I still believe that we should keep a few existing tests in CodeGen/X86/bypass-slow-division-64.ll to check that when bypassing fires, it produces a good X86 code.

Herald added a subscriber: jholewinski. · View Herald TranscriptFeb 26 2017, 1:12 PM

sgtm, thanks.

This revision is now accepted and ready to land.Feb 27 2017, 12:01 PM

Closed by commit rL296832: [BypassSlowDivision] Use ValueTracking to simplify run-time checks (authored by n.bozhenov). · Explain WhyMar 2 2017, 2:24 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Utils/

BypassSlowDivision.cpp

137 lines

test/

Transforms/

CodeGenPrepare/

NVPTX/

bypass-slow-div-special-cases.ll

95 lines

Diff 90395

llvm/trunk/lib/Transforms/Utils/BypassSlowDivision.cpp

Show All 11 Lines
// For example, on Intel Atom 32-bit divides are slow enough that during		// For example, on Intel Atom 32-bit divides are slow enough that during
// runtime it is profitable to check the value of the operands, and if they are		// runtime it is profitable to check the value of the operands, and if they are
// positive and less than 256 use an unsigned 8-bit divide.		// positive and less than 256 use an unsigned 8-bit divide.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "llvm/Transforms/Utils/BypassSlowDivision.h"		#include "llvm/Transforms/Utils/BypassSlowDivision.h"
#include "llvm/ADT/DenseMap.h"		#include "llvm/ADT/DenseMap.h"
		#include "llvm/Analysis/ValueTracking.h"
#include "llvm/IR/Function.h"		#include "llvm/IR/Function.h"
#include "llvm/IR/IRBuilder.h"		#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/Instructions.h"		#include "llvm/IR/Instructions.h"
#include "llvm/Transforms/Utils/Local.h"		#include "llvm/Transforms/Utils/Local.h"

using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "bypass-slow-division"		#define DEBUG_TYPE "bypass-slow-division"
▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	struct DenseMapInfo<DivOpInfo> {
}		}
};		};

typedef DenseMap<DivOpInfo, QuotRemPair> DivCacheTy;		typedef DenseMap<DivOpInfo, QuotRemPair> DivCacheTy;
typedef DenseMap<unsigned, unsigned> BypassWidthsTy;		typedef DenseMap<unsigned, unsigned> BypassWidthsTy;
}		}

namespace {		namespace {
		enum ValueRange {
		/// Operand definitely fits into BypassType. No runtime checks are needed.
		VALRNG_SHORT,
		/// A runtime check is required, as value range is unknown.
		VALRNG_UNKNOWN,
		/// Operand is unlikely to fit into BypassType. The bypassing should be
		/// disabled.
		VALRNG_LONG
		};

class FastDivInsertionTask {		class FastDivInsertionTask {
bool IsValidTask = false;		bool IsValidTask = false;
Instruction *SlowDivOrRem = nullptr;		Instruction *SlowDivOrRem = nullptr;
IntegerType *BypassType = nullptr;		IntegerType *BypassType = nullptr;
BasicBlock *MainBB = nullptr;		BasicBlock *MainBB = nullptr;

		ValueRange getValueRange(Value *Op);
QuotRemWithBB createSlowBB(BasicBlock *Successor);		QuotRemWithBB createSlowBB(BasicBlock *Successor);
QuotRemWithBB createFastBB(BasicBlock *Successor);		QuotRemWithBB createFastBB(BasicBlock *Successor);
QuotRemPair createDivRemPhiNodes(QuotRemWithBB &LHS, QuotRemWithBB &RHS,		QuotRemPair createDivRemPhiNodes(QuotRemWithBB &LHS, QuotRemWithBB &RHS,
BasicBlock *PhiBB);		BasicBlock *PhiBB);
Value *insertOperandRuntimeCheck();		Value insertOperandRuntimeCheck(Value Op1, Value *Op2);
Optional<QuotRemPair> insertFastDivAndRem();		Optional<QuotRemPair> insertFastDivAndRem();

bool isSignedOp() {		bool isSignedOp() {
return SlowDivOrRem->getOpcode() == Instruction::SDiv \|\|		return SlowDivOrRem->getOpcode() == Instruction::SDiv \|\|
SlowDivOrRem->getOpcode() == Instruction::SRem;		SlowDivOrRem->getOpcode() == Instruction::SRem;
}		}
bool isDivisionOp() {		bool isDivisionOp() {
return SlowDivOrRem->getOpcode() == Instruction::SDiv \|\|		return SlowDivOrRem->getOpcode() == Instruction::SDiv \|\|
▲ Show 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	if (!OptResult)
return nullptr;		return nullptr;
CacheI = Cache.insert({Key, *OptResult}).first;		CacheI = Cache.insert({Key, *OptResult}).first;
}		}

QuotRemPair &Value = CacheI->second;		QuotRemPair &Value = CacheI->second;
return isDivisionOp() ? Value.Quotient : Value.Remainder;		return isDivisionOp() ? Value.Quotient : Value.Remainder;
}		}

		/// Check if an integer value fits into our bypass type.
		ValueRange FastDivInsertionTask::getValueRange(Value *V) {
		unsigned ShortLen = BypassType->getBitWidth();
		unsigned LongLen = V->getType()->getIntegerBitWidth();

		assert(LongLen > ShortLen && "Value type must be wider than BypassType");
		unsigned HiBits = LongLen - ShortLen;

		const DataLayout &DL = SlowDivOrRem->getModule()->getDataLayout();
		APInt Zeros(LongLen, 0), Ones(LongLen, 0);

		computeKnownBits(V, Zeros, Ones, DL);

		if (Zeros.countLeadingOnes() >= HiBits)
		return VALRNG_SHORT;

		if (Ones.countLeadingZeros() < HiBits)
		return VALRNG_LONG;

		return VALRNG_UNKNOWN;
		}

/// Add new basic block for slow div and rem operations and put it before		/// Add new basic block for slow div and rem operations and put it before
/// SuccessorBB.		/// SuccessorBB.
QuotRemWithBB FastDivInsertionTask::createSlowBB(BasicBlock *SuccessorBB) {		QuotRemWithBB FastDivInsertionTask::createSlowBB(BasicBlock *SuccessorBB) {
QuotRemWithBB DivRemPair;		QuotRemWithBB DivRemPair;
DivRemPair.BB = BasicBlock::Create(MainBB->getParent()->getContext(), "",		DivRemPair.BB = BasicBlock::Create(MainBB->getParent()->getContext(), "",
MainBB->getParent(), SuccessorBB);		MainBB->getParent(), SuccessorBB);
IRBuilder<> Builder(DivRemPair.BB, DivRemPair.BB->begin());		IRBuilder<> Builder(DivRemPair.BB, DivRemPair.BB->begin());

▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	QuotRemPair FastDivInsertionTask::createDivRemPhiNodes(QuotRemWithBB &LHS,
PHINode *RemPhi = Builder.CreatePHI(getSlowType(), 2);		PHINode *RemPhi = Builder.CreatePHI(getSlowType(), 2);
RemPhi->addIncoming(LHS.Remainder, LHS.BB);		RemPhi->addIncoming(LHS.Remainder, LHS.BB);
RemPhi->addIncoming(RHS.Remainder, RHS.BB);		RemPhi->addIncoming(RHS.Remainder, RHS.BB);
return QuotRemPair(QuoPhi, RemPhi);		return QuotRemPair(QuoPhi, RemPhi);
}		}

/// Creates a runtime check to test whether both the divisor and dividend fit		/// Creates a runtime check to test whether both the divisor and dividend fit
/// into BypassType. The check is inserted at the end of MainBB. True return		/// into BypassType. The check is inserted at the end of MainBB. True return
/// value means that the operands fit.		/// value means that the operands fit. Either of the operands may be NULL if it
Value *FastDivInsertionTask::insertOperandRuntimeCheck() {		/// doesn't need a runtime check.
		Value FastDivInsertionTask::insertOperandRuntimeCheck(Value Op1, Value *Op2) {
		assert((Op1 \|\| Op2) && "Nothing to check");
IRBuilder<> Builder(MainBB, MainBB->end());		IRBuilder<> Builder(MainBB, MainBB->end());
Value *Dividend = SlowDivOrRem->getOperand(0);
Value *Divisor = SlowDivOrRem->getOperand(1);

// We should have bailed out above if the divisor is a constant, but the
// dividend may still be a constant. Set OrV to our non-constant operands
// OR'ed together.
assert(!isa<ConstantInt>(Divisor));

Value *OrV;		Value *OrV;
if (!isa<ConstantInt>(Dividend))		if (Op1 && Op2)
OrV = Builder.CreateOr(Dividend, Divisor);		OrV = Builder.CreateOr(Op1, Op2);
else		else
OrV = Divisor;		OrV = Op1 ? Op1 : Op2;

// BitMask is inverted to check if the operands are		// BitMask is inverted to check if the operands are
// larger than the bypass type		// larger than the bypass type
uint64_t BitMask = ~BypassType->getBitMask();		uint64_t BitMask = ~BypassType->getBitMask();
Value *AndV = Builder.CreateAnd(OrV, BitMask);		Value *AndV = Builder.CreateAnd(OrV, BitMask);

// Compare operand values		// Compare operand values
Value *ZeroV = ConstantInt::getSigned(getSlowType(), 0);		Value *ZeroV = ConstantInt::getSigned(getSlowType(), 0);
return Builder.CreateICmpEQ(AndV, ZeroV);		return Builder.CreateICmpEQ(AndV, ZeroV);
}		}

/// Substitutes the div/rem instruction with code that checks the value of the		/// Substitutes the div/rem instruction with code that checks the value of the
/// operands and uses a shorter-faster div/rem instruction when possible.		/// operands and uses a shorter-faster div/rem instruction when possible.
Optional<QuotRemPair> FastDivInsertionTask::insertFastDivAndRem() {		Optional<QuotRemPair> FastDivInsertionTask::insertFastDivAndRem() {
Value *Dividend = SlowDivOrRem->getOperand(0);		Value *Dividend = SlowDivOrRem->getOperand(0);
Value *Divisor = SlowDivOrRem->getOperand(1);		Value *Divisor = SlowDivOrRem->getOperand(1);

if (isa<ConstantInt>(Divisor)) {		if (isa<ConstantInt>(Divisor)) {
// Keep division by a constant for DAGCombiner.		// Keep division by a constant for DAGCombiner.
return None;		return None;
}		}

// If the numerator is a constant, bail if it doesn't fit into BypassType.		ValueRange DividendRange = getValueRange(Dividend);
if (ConstantInt *ConstDividend = dyn_cast<ConstantInt>(Dividend))		if (DividendRange == VALRNG_LONG)
if (ConstDividend->getValue().getActiveBits() > BypassType->getBitWidth())		return None;

		ValueRange DivisorRange = getValueRange(Divisor);
		if (DivisorRange == VALRNG_LONG)
return None;		return None;

		bool DividendShort = (DividendRange == VALRNG_SHORT);
		bool DivisorShort = (DivisorRange == VALRNG_SHORT);

		if (DividendShort && DivisorShort) {
		// If both operands are known to be short then just replace the long
		// division with a short one in-place.

		IRBuilder<> Builder(SlowDivOrRem);
		Value *TruncDividend = Builder.CreateTrunc(Dividend, BypassType);
		Value *TruncDivisor = Builder.CreateTrunc(Divisor, BypassType);
		Value *TruncDiv = Builder.CreateUDiv(TruncDividend, TruncDivisor);
		Value *TruncRem = Builder.CreateURem(TruncDividend, TruncDivisor);
		Value *ExtDiv = Builder.CreateZExt(TruncDiv, getSlowType());
		Value *ExtRem = Builder.CreateZExt(TruncRem, getSlowType());
		return QuotRemPair(ExtDiv, ExtRem);
		} else if (DividendShort && !isSignedOp()) {
		// If the division is unsigned and Dividend is known to be short, then
		// either
		// 1) Divisor is less or equal to Dividend, and the result can be computed
		// with a short division.
		// 2) Divisor is greater than Dividend. In this case, no division is needed
		// at all: The quotient is 0 and the remainder is equal to Dividend.
		//
		// So instead of checking at runtime whether Divisor fits into BypassType,
		// we emit a runtime check to differentiate between these two cases. This
		// lets us entirely avoid a long div.

		// Split the basic block before the div/rem.
		BasicBlock *SuccessorBB = MainBB->splitBasicBlock(SlowDivOrRem);
		// Remove the unconditional branch from MainBB to SuccessorBB.
		MainBB->getInstList().back().eraseFromParent();
		QuotRemWithBB Long;
		Long.BB = MainBB;
		Long.Quotient = ConstantInt::get(getSlowType(), 0);
		Long.Remainder = Dividend;
		QuotRemWithBB Fast = createFastBB(SuccessorBB);
		QuotRemPair Result = createDivRemPhiNodes(Fast, Long, SuccessorBB);
		IRBuilder<> Builder(MainBB, MainBB->end());
		Value *CmpV = Builder.CreateICmpUGE(Dividend, Divisor);
		Builder.CreateCondBr(CmpV, Fast.BB, SuccessorBB);
		return Result;
		} else {
		// General case. Create both slow and fast div/rem pairs and choose one of
		// them at runtime.

// Split the basic block before the div/rem.		// Split the basic block before the div/rem.
BasicBlock *SuccessorBB = MainBB->splitBasicBlock(SlowDivOrRem);		BasicBlock *SuccessorBB = MainBB->splitBasicBlock(SlowDivOrRem);
// Remove the unconditional branch from MainBB to SuccessorBB.		// Remove the unconditional branch from MainBB to SuccessorBB.
MainBB->getInstList().back().eraseFromParent();		MainBB->getInstList().back().eraseFromParent();
QuotRemWithBB Fast = createFastBB(SuccessorBB);		QuotRemWithBB Fast = createFastBB(SuccessorBB);
QuotRemWithBB Slow = createSlowBB(SuccessorBB);		QuotRemWithBB Slow = createSlowBB(SuccessorBB);
QuotRemPair Result = createDivRemPhiNodes(Fast, Slow, SuccessorBB);		QuotRemPair Result = createDivRemPhiNodes(Fast, Slow, SuccessorBB);
Value *CmpV = insertOperandRuntimeCheck();		Value *CmpV = insertOperandRuntimeCheck(DividendShort ? nullptr : Dividend,
		DivisorShort ? nullptr : Divisor);
IRBuilder<> Builder(MainBB, MainBB->end());		IRBuilder<> Builder(MainBB, MainBB->end());
Builder.CreateCondBr(CmpV, Fast.BB, Slow.BB);		Builder.CreateCondBr(CmpV, Fast.BB, Slow.BB);
return Result;		return Result;
}		}
		}

/// This optimization identifies DIV/REM instructions in a BB that can be		/// This optimization identifies DIV/REM instructions in a BB that can be
/// profitably bypassed and carried out with a shorter, faster divide.		/// profitably bypassed and carried out with a shorter, faster divide.
bool llvm::bypassSlowDivision(BasicBlock *BB,		bool llvm::bypassSlowDivision(BasicBlock *BB,
const BypassWidthsTy &BypassWidths) {		const BypassWidthsTy &BypassWidths) {
DivCacheTy PerBBDivCache;		DivCacheTy PerBBDivCache;

bool MadeChange = false;		bool MadeChange = false;
Show All 24 Lines

llvm/trunk/test/Transforms/CodeGenPrepare/NVPTX/bypass-slow-div-special-cases.ll

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt -S -codegenprepare < %s \| FileCheck %s

				target datalayout = "e-i64:64-v16:16-v32:32-n16:32:64"
				target triple = "nvptx64-nvidia-cuda"

				; No bypassing should be done in apparently unsuitable cases.
				define void @Test_no_bypassing(i32 %a, i64 %b, i64* %retptr) {
				; CHECK-LABEL: @Test_no_bypassing(
				; CHECK-NEXT: [[A_1:%.]] = zext i32 [[A:%.]] to i64
				; CHECK-NEXT: [[A_2:%.*]] = sub i64 -1, [[A_1]]
				; CHECK-NEXT: [[RES:%.]] = srem i64 [[A_2]], [[B:%.]]
				; CHECK-NEXT: store i64 [[RES]], i64* [[RETPTR:%.*]]
				; CHECK-NEXT: ret void
				;
				%a.1 = zext i32 %a to i64
				; %a.2 is always negative so the division cannot be bypassed.
				%a.2 = sub i64 -1, %a.1
				%res = srem i64 %a.2, %b
				store i64 %res, i64* %retptr
				ret void
				}

				; No OR instruction is needed if one of the operands (divisor) is known
				; to fit into 32 bits.
				define void @Test_check_one_operand(i64 %a, i32 %b, i64* %retptr) {
				; CHECK-LABEL: @Test_check_one_operand(
				; CHECK-NEXT: [[B_1:%.]] = zext i32 [[B:%.]] to i64
				; CHECK-NEXT: [[TMP1:%.]] = and i64 [[A:%.]], -4294967296
				; CHECK-NEXT: [[TMP2:%.*]] = icmp eq i64 [[TMP1]], 0
				; CHECK-NEXT: br i1 [[TMP2]], label [[TMP3:%.]], label [[TMP8:%.]]
				; CHECK: [[TMP4:%.*]] = trunc i64 [[B_1]] to i32
				; CHECK-NEXT: [[TMP5:%.*]] = trunc i64 [[A]] to i32
				; CHECK-NEXT: [[TMP6:%.*]] = udiv i32 [[TMP5]], [[TMP4]]
				; CHECK-NEXT: [[TMP7:%.*]] = zext i32 [[TMP6]] to i64
				; CHECK-NEXT: br label [[TMP10:%.*]]
				; CHECK: [[TMP9:%.*]] = sdiv i64 [[A]], [[B_1]]
				; CHECK-NEXT: br label [[TMP10]]
				; CHECK: [[TMP11:%.*]] = phi i64 [ [[TMP7]], [[TMP3]] ], [ [[TMP9]], [[TMP8]] ]
				; CHECK-NEXT: store i64 [[TMP11]], i64* [[RETPTR:%.*]]
				; CHECK-NEXT: ret void
				;
				%b.1 = zext i32 %b to i64
				%res = sdiv i64 %a, %b.1
				store i64 %res, i64* %retptr
				ret void
				}

				; If both operands are known to fit into 32 bits, then replace the division
				; in-place without CFG modification.
				define void @Test_check_none(i64 %a, i32 %b, i64* %retptr) {
				; CHECK-LABEL: @Test_check_none(
				; CHECK-NEXT: [[A_1:%.]] = and i64 [[A:%.]], 4294967295
				; CHECK-NEXT: [[B_1:%.]] = zext i32 [[B:%.]] to i64
				; CHECK-NEXT: [[TMP1:%.*]] = trunc i64 [[A_1]] to i32
				; CHECK-NEXT: [[TMP2:%.*]] = trunc i64 [[B_1]] to i32
				; CHECK-NEXT: [[TMP3:%.*]] = udiv i32 [[TMP1]], [[TMP2]]
				; CHECK-NEXT: [[TMP4:%.*]] = zext i32 [[TMP3]] to i64
				; CHECK-NEXT: store i64 [[TMP4]], i64* [[RETPTR:%.*]]
				; CHECK-NEXT: ret void
				;
				%a.1 = and i64 %a, 4294967295
				%b.1 = zext i32 %b to i64
				%res = udiv i64 %a.1, %b.1
				store i64 %res, i64* %retptr
				ret void
				}

				; In case of unsigned long division with a short dividend,
				; the long division is not needed any more.
				define void @Test_special_case(i32 %a, i64 %b, i64* %retptr) {
				; CHECK-LABEL: @Test_special_case(
				; CHECK-NEXT: [[A_1:%.]] = zext i32 [[A:%.]] to i64
				; CHECK-NEXT: [[TMP1:%.]] = icmp uge i64 [[A_1]], [[B:%.]]
				; CHECK-NEXT: br i1 [[TMP1]], label [[TMP2:%.]], label [[TMP9:%.]]
				; CHECK: [[TMP3:%.*]] = trunc i64 [[B]] to i32
				; CHECK-NEXT: [[TMP4:%.*]] = trunc i64 [[A_1]] to i32
				; CHECK-NEXT: [[TMP5:%.*]] = udiv i32 [[TMP4]], [[TMP3]]
				; CHECK-NEXT: [[TMP6:%.*]] = urem i32 [[TMP4]], [[TMP3]]
				; CHECK-NEXT: [[TMP7:%.*]] = zext i32 [[TMP5]] to i64
				; CHECK-NEXT: [[TMP8:%.*]] = zext i32 [[TMP6]] to i64
				; CHECK-NEXT: br label [[TMP9]]
				; CHECK: [[TMP10:%.]] = phi i64 [ [[TMP7]], [[TMP2]] ], [ 0, [[TMP0:%.]] ]
				; CHECK-NEXT: [[TMP11:%.*]] = phi i64 [ [[TMP8]], [[TMP2]] ], [ [[A_1]], [[TMP0]] ]
				; CHECK-NEXT: [[RES:%.*]] = add i64 [[TMP10]], [[TMP11]]
				; CHECK-NEXT: store i64 [[RES]], i64* [[RETPTR:%.*]]
				; CHECK-NEXT: ret void
				;
				%a.1 = zext i32 %a to i64
				%div = udiv i64 %a.1, %b
				%rem = urem i64 %a.1, %b
				%res = add i64 %div, %rem
				store i64 %res, i64* %retptr
				ret void
				}