This is an archive of the discontinued LLVM Phabricator instance.

[BypassSlowDivision] Use ValueTracking to simplify run-time checks
ClosedPublic

Authored by n.bozhenov on Feb 13 2017, 9:55 AM.

Details

Summary

ValueTracking is used for more thorough analysis of operands. Based on the analysis, either run-time checks can be simplified (e.g. check only one operand instead of two) or the transformation can be avoided. For example, it is quite often the case that a divisor is promoted from a shorter type and run-time checks for it are redundant.

With additional compile-time analysis of values, two special cases naturally arise and are addressed by the patch:

  1. Both operands are known to be short enough. Then, the long division can be simply replaced with a short one without CFG modification.
  1. If a division is unsigned and the dividend is known to be short then the long division is not needed at all. Because if the divisor is too big for short division then the quotient is obviously zero (and the remainder is equal to the dividend). Actually, the division is not needed when (divisor > dividend).

Basically, this is D28199 rebased onto D29896.

Diff Detail

Repository
rL LLVM

Event Timeline

n.bozhenov updated this revision to Diff 88555.Feb 15 2017, 9:22 AM

Rebased onto a newer version of D29896.

Yet another rebase.

n.bozhenov updated this revision to Diff 89066.Feb 19 2017, 8:05 AM
jlebar edited edge metadata.Feb 19 2017, 10:17 AM

When we see x/y, in addition to checking whether x and y are known "long", should we also check whether x | y is known-long? This way if we saw something like:

if (x | y & 0xffffffff00000000) return x / y;
else return static_cast<int32_t>(x) / static_cast<int32_t>(y);

we wouldn't re-optimize this code.

When we see x/y, in addition to checking whether x and y are known "long", should we also check whether x | y is known-long? This way if we saw something like:

if (x | y & 0xffffffff00000000) return x / y;
else return static_cast<int32_t>(x) / static_cast<int32_t>(y);

we wouldn't re-optimize this code.

I am afraid that ValueTracking doesn't work this way. It doesn't take into account the point where we analyze a value. So, unfortunately, it cannot yield different results on different if-branches for the same values.

In your example, if both branches may be taken, than (x|y) is neither known-long nor known-short. ValueTracking can determine that (x|y) is known-long only when the else-branch of the if-statement is statically known to be dead.

Do you think it would be too conservative for us to say: We won't optimize a long div if you do a short div with the same operands anywhere in the same function?

Do you think it would be too conservative for us to say: We won't optimize a long div if you do a short div with the same operands anywhere in the same function?

That's an interesting heuristic, but I believe it's beyond the scope of this particular patch. This patch just takes advantage of ValueTracking and simplifies runtime checks where possible.

Do you think it would be too conservative for us to say: We won't optimize a long div if you do a short div with the same operands anywhere in the same function?

That's an interesting heuristic, but I believe it's beyond the scope of this particular patch. This patch just takes advantage of ValueTracking and simplifies runtime checks where possible.

Sure. I would like not to lose track of this idea, though.

jlebar added inline comments.Feb 21 2017, 9:13 AM
lib/Transforms/Utils/BypassSlowDivision.cpp
194 ↗(On Diff #89134)

Maybe, "check if an integer type fits into our bypass type"?

195 ↗(On Diff #89134)

Value *V ? It's not necessarily an operator -- could be a constant or something.

201 ↗(On Diff #89134)

Should we assert that V has the same width as getSlowType()? If its type is shorter I think it works out, but if it's longer, does it still work?

285 ↗(On Diff #89134)

Would actually prefer not to initialize this variable to null -- this way we'll get a compile warning if it's not initialized in both branches of the if below.

291 ↗(On Diff #89134)

Maybe change this to assert((Op1 || Op2) && "Nothing to check") and move up to the top of the function?

314 ↗(On Diff #89134)

I wonder if getValueRange should get the DL itself -- one less step for us here.

test/CodeGen/X86/bypass-slow-division-64.ll
95 ↗(On Diff #89134)

I should have been looking more closely at these testcases -- this seems really fragile because it depends on our exact register allocation? I know that to some degree the registers are fixed because of the calling convention plus the fixed in/out regs of e.g. idiv. But to the extent that the registers are *not* fixed, I don't think we should be relying on a particular register allocation. That will just make unnecessary pain for the next person to come along and change the register allocator.

105 ↗(On Diff #89134)

Similarly here we're relying on the particular label name LBB4_1. This seems extremely fragile -- the test would break if we added a function above this one?

I think we should either do this right, with FileCheck variable matching (i.e. [[XYZ:some_regex]]), or just CHECK for "je" and ignore the label.

Actually, I recall from the last time I touched this that you can run codegenprepare as an opt pass, so you can write tests over LLVM IR instead of over x86 assembly. That seems like a much better way to do this.

n.bozhenov marked 3 inline comments as done.
n.bozhenov added inline comments.
lib/Transforms/Utils/BypassSlowDivision.cpp
194 ↗(On Diff #89134)

Not sure what you mean. Here we're testing not a type (how many bits it has?) but a value (may there be any non-zero upper bits?).

195 ↗(On Diff #89134)

Actually, Op was supposed to mean an operand here. But now I see it's ambiguous.

201 ↗(On Diff #89134)

I have added an assert that type of the value is wider than BypassType.

As for tests, it was discussed in D28196 that it's better to have them that detailed. I agree with you that they are overly fragile and check for a lot of actually insignificant details. But on the other hand, this approach also has a number of advantages: tests indeed guarantee that the compiler generates correct code for the testcase, the tests are easy to create and update, diffs in tests induced by patches are explicit and easy to review.

As for tests, it was discussed in D28196 that it's better to have them that detailed.

I read the discussion in that thread differently.

I see it saying: *if* we are writing llc tests, then maybe we should leave them detailed.

However, we don't have to write llc tests. We can write an opt test, and write regular checks over LLVM IR.

lib/Transforms/Utils/BypassSlowDivision.cpp
194 ↗(On Diff #89134)

Yes, I had a typo, I meant "Check if an integer value fits into our bypass type."

What I wanted was the change s/a smaller type/our bypass type/, because this function does not check whether V fits into an arbitrary smaller type, but rather checks specifically if it fits into the bypass type.

n.bozhenov updated this revision to Diff 89517.Feb 23 2017, 8:55 AM
n.bozhenov added reviewers: RKSimon, spatel.
n.bozhenov marked 3 inline comments as done.

Well, it might make sense to write opt tests here... But we already have got 3 files with 18 llc tests for division bypassing. Moreover, while working on this patchset, the tests were thoroughly reviewed, it was insisted that the test should be written this way, I had to make a number of commits specifically to modify the tests. So, I don't think it would be a good idea to throw away all the work that has been done and re-write the tests from scratch now.

spatel edited edge metadata.Feb 23 2017, 9:37 AM

Well, it might make sense to write opt tests here... But we already have got 3 files with 18 llc tests for division bypassing. Moreover, while working on this patchset, the tests were thoroughly reviewed, it was insisted that the test should be written this way, I had to make a number of commits specifically to modify the tests. So, I don't think it would be a good idea to throw away all the work that has been done and re-write the tests from scratch now.

First, a note about x86 codegen testing: it's correct that the FileCheck script often leaves registers hard-coded and includes labels and other gunk like 'kill' comments. Yes, this makes the tests more fragile, but in practice there's been little to complain about vs. the benefits of tighter checks and ease of updating via script. We could certainly improve the script, but there hasn't been much motivation for that AFAIK.

There are good arguments for both continuing as x86 tests and including cleaner IR tests for CGP. When in doubt, do both? :)
Note that we have a script (for generating opt FileChecks too, so this is easy. I strongly recommend using that if you add IR tests for the same reasons that we prefer to script the x86 codegen checks.

Here's what that opt FileCheck script output looks like for the tests added in this patch (without applying this patch). So I would add these tests in a pre-commit, apply this patch, run the script again, and we just show the IR diffs when this patch lands:

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt < %s -codegenprepare -S -mtriple=x86_64-unknown-unknown    | FileCheck %s

define i64 @Test_no_bypassing(i32 %a, i64 %b) nounwind {
; CHECK-LABEL: @Test_no_bypassing(
; CHECK-NEXT:    [[A_1:%.*]] = zext i32 [[A:%.*]] to i64
; CHECK-NEXT:    [[A_2:%.*]] = sub i64 -1, [[A_1]]
; CHECK-NEXT:    [[RES:%.*]] = srem i64 [[A_2]], [[B:%.*]]
; CHECK-NEXT:    ret i64 [[RES]]
;
  %a.1 = zext i32 %a to i64
  ; %a.2 is always negative so the division cannot be bypassed.
  %a.2 = sub i64 -1, %a.1
  %res = srem i64 %a.2, %b
  ret i64 %res
}

; No OR instruction is needed if one of the operands (divisor) is known
; to fit into 32 bits.
define i64 @Test_check_one_operand(i64 %a, i32 %b) nounwind {
; CHECK-LABEL: @Test_check_one_operand(
; CHECK-NEXT:    [[B_1:%.*]] = zext i32 [[B:%.*]] to i64
; CHECK-NEXT:    [[RES:%.*]] = sdiv i64 [[A:%.*]], [[B_1]]
; CHECK-NEXT:    ret i64 [[RES]]
;
  %b.1 = zext i32 %b to i64
  %res = sdiv i64 %a, %b.1
  ret i64 %res
}

; If both operands are known to fit into 32 bits, then replace the division
; in-place without CFG modification.
define i64 @Test_check_none(i64 %a, i32 %b) nounwind {
; CHECK-LABEL: @Test_check_none(
; CHECK-NEXT:    [[A_1:%.*]] = and i64 [[A:%.*]], 4294967295
; CHECK-NEXT:    [[B_1:%.*]] = zext i32 [[B:%.*]] to i64
; CHECK-NEXT:    [[RES:%.*]] = udiv i64 [[A_1]], [[B_1]]
; CHECK-NEXT:    ret i64 [[RES]]
;
  %a.1 = and i64 %a, 4294967295
  %b.1 = zext i32 %b to i64
  %res = udiv i64 %a.1, %b.1
  ret i64 %res
}

; In case of unsigned long division with a short dividend,
; the long division is not needed any more.
define i64 @Test_special_case(i32 %a, i64 %b) nounwind {
; CHECK-LABEL: @Test_special_case(
; CHECK-NEXT:    [[A_1:%.*]] = zext i32 [[A:%.*]] to i64
; CHECK-NEXT:    [[DIV:%.*]] = udiv i64 [[A_1]], [[B:%.*]]
; CHECK-NEXT:    [[REM:%.*]] = urem i64 [[A_1]], [[B]]
; CHECK-NEXT:    [[RES:%.*]] = add i64 [[DIV]], [[REM]]
; CHECK-NEXT:    ret i64 [[RES]]
;
  %a.1 = zext i32 %a to i64
  %div = udiv i64 %a.1, %b
  %rem = urem i64 %a.1, %b
  %res = add i64 %div, %rem
  ret i64 %res
}

I still do not understand what we are trying to accomplish by writing x86-specific tests for this at all.

This is a 99% target-generic LLVM optimization. The only target-specific part of this transformation is the bypass width. Everything else is totally generic and has absolutely nothing to do with the target ISA.

I can understand having a few tests to check that the bypass widths are set as we expect. But otherwise, it seems to me that we should test this exactly like we test every other target-independent pass in LLVM, and that's by checking LLVM IR and not x86 assembly.

I understand it can be frustrating to throw away work you've done. In this case, however, the argument seems clear-cut. We want to test using LLVM IR where possible, because that's the one language everyone in the LLVM community is assumed to know, and because it's the language that most closely reflects the transformations being made here. Otherwise when these tests fail, someone who doesn't necessarily understand x86 assembly is going to have to read this code and understand whether the failure is meaningful. This is a burden we should avoid where possible, and a burden that we do, by and large, avoid in LLVM as a matter of convention, if not policy.

I still do not understand what we are trying to accomplish by writing x86-specific tests for this at all.

This is a 99% target-generic LLVM optimization. The only target-specific part of this transformation is the bypass width. Everything else is totally generic and has absolutely nothing to do with the target ISA.

Since this is largely target-independent (which jibes with the fact that no existing tests are changing), then yes, I agree that IR tests are better.

For the other patch, where we're running opt and llc in one RUN line - that's definitely not what we want to see in a regression test. If there are multiple transforms happening, there should be multiple tests to check each step in the process.

That's indeed somewhat frustrating... As I understand, now you all believe it's better to drop all the tests from test/CodeGen/X86/bypass-slow-division-32.ll and test/CodeGen/X86/bypass-slow-division-64.ll and replace them with a few target-independent tests.

However, I don't think that dropping the tests is a good idea. Target-independent tests would check that the transformation itself is performed correctly. That's good but not enough. The existing tests check that the compiler indeed generates a good code for integer division, which is the only purpose of the transformation, and this cannot be checked with opt tests. For instance, D28198 wouldn't be possible if there were no such detailed X86-specific llc tests. Only because of the tests, the inefficiency introduced with D28196 was detected immediately and fixed right away.

Moreover, I'm not sure if it is at all possible to write correct opt tests for this transformation. The bypassed types are registered only if TM.getOptLevel() >= CodeGenOpt::Default. However, when running opt -codegenprepare the TM.getOptLevel() equals CodeGenOpt::None. Running opt -O2 -codegenprepare, obviously, is not an option because it runs all the optimizations in the middle-end.

However, I don't think that dropping the tests is a good idea. Target-independent tests would check that the transformation itself is performed correctly. That's good but not enough. The existing tests check that the compiler indeed generates a good code for integer division

In LLVM we canonically would write two tests for this. First, we would write an opt check for the transformation. Then, we would write an llc test to check that divs are lowered correctly.

Since you already have IR inputs that exercise the pass, and you have a script to generate IR tests from them, I don't think this should actually involve a lot of work. Indeed, it looks like Sanjay above already converted the tests here from llc to opt tests using a script.

I suspect there are already plenty of llc tests in the tree that check that x86 divs are lowered correctly. But you certainly could look and add some if there are gaps.

For instance, D28198 wouldn't be possible if there were no such detailed X86-specific llc tests.

I am not saying we should not have x86-specific tests. That patch is for an x86-specific lowering, so it's totally appropriate that we have an llc test for it.

I'm also not saying that we should have no x86-specific tests for this pass. Just that most of them should be generic.

Please understand why I am not willing to compromise on this. I (and others in the LLVM community) may to be on the hook to maintain these tests. I know almost nothing about the x86 ISA -- I work on GPUs. I do not want to be on the hook to maintain large tests with tons of x86 assembly that I don't understand. Moreover, if I am on the hook for this, I am not going to do a good job of it. (For similar reasons, I wouldn't ask you to be on the hook to maintain tests with tons of GPU assembly.) LLVM IR is the one language that everyone on this project is expected to understand. So where possible (and I understand it's not always possible), we should write tests that use this language.

Moreover, I'm not sure if it is at all possible to write correct opt tests for this transformation. The bypassed types are registered only if TM.getOptLevel() >= CodeGenOpt::Default. However, when running opt -codegenprepare the TM.getOptLevel() equals CodeGenOpt::None. Running opt -O2 -codegenprepare, obviously, is not an option because it runs all the optimizations in the middle-end.

One option would be to target nvptx -- unlike x86, we unconditionally register the bypass types. (There's no GPU-specific code in these tests, since they're just opt tests.)

See e.g. test/Transforms/CodeGenPrepare/NVPTX/bypass-slow-div.ll.

n.bozhenov updated this revision to Diff 89816.Feb 26 2017, 1:12 PM

Ok. I see your point. Indeed, for example, there's no need to run the whole X86 code generator to check that a division bypassing is disabled if one of the operands doesn't fit into BypassType. So I have moved the tests introduced by this patch into Transforms/CodeGenPrepare/NVPTX/bypass-slow-div-special-cases.ll. However, I still believe that we should keep a few existing tests in CodeGen/X86/bypass-slow-division-64.ll to check that when bypassing fires, it produces a good X86 code.

jlebar accepted this revision.Feb 27 2017, 12:01 PM

sgtm, thanks.

This revision is now accepted and ready to land.Feb 27 2017, 12:01 PM
This revision was automatically updated to reflect the committed changes.