This is an archive of the discontinued LLVM Phabricator instance.

Infer known bits from dominating conditions
ClosedPublic

Authored by reames on Feb 17 2015, 3:59 PM.

Details

Summary

This patch adds limited support in ValueTracking for inferring known bits of a value from conditional expressions which must be true to reach the instruction we're trying to optimize. At this time, the feature is off by default. Once landed, I'm hoping for feedback from others on both profitability and compile time impact.

Forms of conditional value propagation have been tried in LLVM before and have failed due to compile time problems. In an attempt to side step that, this patch only considers conditions where the edge leaving the branch dominates the context instruction. Rather than walking back along the dominator tree for every value/context instruction pair, we only consider uses of the value itself when searching for useful conditions. It's also limited to only the first level call in the known bits recursion search.

Even with these restrictions, it handles many interesting cases:

  • Early exits from functions
  • Early exits from loops (for context instructions in the loop and after the check)
  • Conditions which control entry into loops, including multi-version loops (such as those produced during vectorization, IRCE, loop unswitch, etc..)

Possible applications include optimizing using information provided by constructs such as: preconditions, assumptions, null checks, & range checks.

When compiling sqlite3.c (using a Release+Asserts build), I get the following timings:
-instcombine -value-tracking-dom-conditions=1: 0.788s-0.793
-instcombine -value-tracking-dom-conditions=0: 0.771-0.773
-O3 -value-tracking-dom-conditions=1: 14.993-15.391
-O3 -value-tracking-dom-conditions=0: 14.961-14.984

There is some slightly slow down in instcombine, but the difference in the O3 times are essentially noise. I measured an outlier (which I discarded in the above) of 16.408 for "-O3 -value-tracking-dom-conditions=0".

Diff Detail

Repository
rL LLVM

Event Timeline

reames updated this revision to Diff 20111.Feb 17 2015, 3:59 PM
reames retitled this revision from to Infer known bits from dominating conditions.
reames updated this object.
reames edited the test plan for this revision. (Show Details)
reames added reviewers: nlewycky, hfinkel.
reames added a subscriber: Unknown Object (MLST).

Here are the O3 numbers for a Release build (without asserts):
-O3 -value-tracking-dom-conditions=1: 11.045-11.319
-O3 -value-tracking-dom-conditions=0: 10.989-11.300

Taking the two extremes, that's about a 300 ms difference at worst. That's about 2.7%. Given the (informally measured) variance between runs, that's within the noise.

In terms of effectiveness, this allows an extra 500 inst combines (out of ~50k). The resulting IR is slightly larger, though I can't easily determine why. The biggest difference is a lot more nsw and nuw markers are being added. It appears that instcombine is now handling some cases that LVI and GVN would have previously gotten since the stats on those drop slightly. (This has the potential of making those passes more powerful due by avoiding pass ordering issues.) I do see both loop vectorizer and memcpyopt kicking in one extra time each. I have not measured execution time of sqlite.

hfinkel edited edge metadata.Feb 19 2015, 5:47 AM

Thanks for working on this!

I'd really like to see a compile-time comparison with an implementation that just walks up the dominator tree. It seems like that technique would be more powerful (because it could catch conditions where the variable is not a direct argument to the icmp). It might be manageable (with a cutoff for really deep trees, and noting that we can stop searching after we reach the definition of the variable itself).

Also, full-context patches please: http://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface

lib/Analysis/ValueTracking.cpp
42 ↗(On Diff #20111)

Line too long?

498 ↗(On Diff #20111)

Please make this cutoff a command-line parameter.

510 ↗(On Diff #20111)

Why do you care that the cmp has only one use? Why not look at all of the branches that use the cmp result?

541 ↗(On Diff #20111)

Indeed, can't you share all of that logic that is used by the assume implementation?

Although to really do that you'd have to start with the dominating branches, walking up?

sanjoy added a subscriber: sanjoy.Feb 25 2015, 6:37 PM

Comments inline.

lib/Analysis/ValueTracking.cpp
498 ↗(On Diff #20111)

I remember hearing somewhere that the walking the use-list in LLVM is cache-miss'y. If (a big if) that is true, then for variables with a large number of uses you're going to spend a lot of time just skimming through the use list here. To that end, does it make sense to have this cutoff in the loop where you're looping through the use list?

508 ↗(On Diff #20111)

Nit: should be ICmpInst *Cmp. In general, I'd suggest just running this patch through clang-format before checking in.

529 ↗(On Diff #20111)

I don't think this will work if BI->getSuccessor(0) == BI->getSuccessor(1). You need to check if the false branch of BI actually branches to something other than BB0.

531 ↗(On Diff #20111)

The only other place that uses Q.DT checks to see if it isn't nullptr. Is there reason to believe that we actually have the dominator tree?

533 ↗(On Diff #20111)

Actually, it might be cleaner and more powerful to just use the BasicBlockEdge version of dominates here. dominates will also handle the case where BB0 has multiple preds but BB0 dominates all of them.

574 ↗(On Diff #20111)

Nit: should this be 'high bits in RHS'?

Responding to Sanjoy's review comments.

lib/Analysis/ValueTracking.cpp
498 ↗(On Diff #20111)

This hasn't been a measurable impact. I'd be a bit leery of effecting the code generation of the loop as well. Besides, once somethings in cache (from the first check), accessing it again (in the loop) is cheap.

510 ↗(On Diff #20111)

Compile time. I want to avoid walking potentially long use lists. I'm matching specifically the common pattern of a branch controlled by a comparison right before it. See my comment to Hal about possible future extensions.

529 ↗(On Diff #20111)

You're right. I need to account for that case. Thanks for finding this!

531 ↗(On Diff #20111)

It's checked at the top of the function. We do nothing if we don't have a dom tree.

533 ↗(On Diff #20111)

This is a good idea. Hadn't thought of that interface. Will definitely have to look into that.

574 ↗(On Diff #20111)

Will fix.

To be clear, is this a request for me to evaluate this before submission? I really don't see the need given that a) the current implementation is fast, and b) this is an experimental flag.

Yes, it is. My feeling is that this is not particularly complicated, and if it is, then there's something I'm not considering properly.

To be clear, this is a very important missing capability, and I want to make sure that we don't prematurely optimize here: inconsistencies here will be *very* user visible, and I'm concerned that with your approach, small local code changes will cause drastic changes in the output by turning on and off this logic. If a more general approach is not feasible, then so be it, I'll take what we can get, but I want to make his decision quantitatively -- and I don't want to start with a design that will need to be rewritten to make forward progress.

Why do you care that the cmp has only one use? Why not look at all of the branches that use the cmp result?

We could, and may someday. Doing so involves a potential increase in
compile time. I'm deliberately going for minimal compile time impact,
even at the loss of some effectiveness. I want to be able to turn this
on by default. Most of my actual motivating cases are really simple.

Please measure this. Unless this is really needed, please take it out. Users can change codes to introduce extra uses of comparison results in many ways, and doing so should not change the optimizers ability to make use of the information.

I would strongly prefer going with the fastest implementation now and
evaluating extensions at a later time. If you think it matters and is
cheap enough, feel free to contribute a patch. :)

But I don't know that it is the fastest implementation, or how much that matters. Fast matters, but it is not the only thing that matters. Using a DT walk with a cutoff might actually be faster. Stability of optimization results to small source changes is also really important. Consistency with other optimizer capabilities is also really important.

Let's work together on this. Yes, if you don't want to do this, I'll get to it in a couple of months.

I think all of the issues you raised should be evaluated and addressed. I'm even willing to do most of the work to do so. However, I would really strongly prefer doing that *in tree*. You have access to workloads I don't. So does Nick. So do other interested parties. What I want is to check in a shared implementation (or two, or three), and give each of us a chance to measure on our workloads. Running with a set of flags is generally much easier than building a custom build with a potentially tweaked patch and keeping track of which patch generated which set of results.

Okay. However, being asked to try an alternate implementation technique is part of the code review process. I think there are very good reasons to prefer this alternate strategy if it is practical...

Now, on the technical side again, I think we're coming from two different perspectives. My overwhelming concern is compile time. We've tried various attempts at value property propagation in the past and all of them have failed due to that issue. What I'm aiming to do is find *something* which can at least catch the simple cases and can be turned on without impacting performance. I'd like to push that as far as possible, but that's a second order concern. You seem to be focused more on the effectiveness side, with a second order interest in performance. Given past history, I don't think that's the right approach to take here.

This seems a bit unfair. I am concerned with compile time, but...

  1. ValueTracking is slow currently, and the way it is used by InstCombine (etc.) make it very expensive. Really, we need a way of caching the results (instead of recomputing things each time each instruction is visited, which can happen many times as each instructions users are processed). To really fix the compile time issues here, we need to fix this.
  2. Should we settle for a less-general analysis to work around (1)? It obviously depends on the price. But let's not compound our problems unnecessarily. Walking all the way up the DT is likely to be too expensive, but maybe searching N dominators (for some small N) is not too expensive. This is the implementation we should desire, however, because it will allow us to do the analysis in a general way, and share that code with the code that handles @llvm.assume. Then, once (1) is fixed, we might be able to raise N somewhat. If we cannot do this, even for a small N, then we'll have to settle for doing something even cheaper, at the price of effectiveness, consistency, and future development time. We might choose to do this, but let's not go there without evidence that it is necessary. Otherwise, if there is some value of N that hits your use cases and meets your compile-time impact targets, let's begin as we mean to go on.

Also, can you give an example of a code change that concerns you? At least on the surface this seems pretty stable to me. Early exits from functions and loops gives you information the optimizer can exploit. Anything else might, but might not. That seems entirely understandable.

I'm actually concerned about patterns like this:

$ cat /tmp/c.c 
void bar(); // pretend this can be inlined.
void har(); // and/or this too.

void foo1(int a) {
  if (a < 5)
    bar();
  else
    har();
}

bool foo2(int a) {
  if (a < 5)
    bar();
  else
    har();

  return a < 5;
}

where, as you might expect, in foo2 there is a second use of the comparison. Introducing this second use of the comparison should not affect whether or not ValueTracking can deduce some fact about 'a' when processing the inlined code.

clang++ -O3 -emit-llvm -S -o - /tmp/c.c

target datalayout = "E-m:e-i64:64-n32:64"
target triple = "powerpc64-unknown-linux-gnu"

define void @_Z4foo1i(i32 signext %a) #0 {
entry:
  %cmp = icmp slt i32 %a, 5
  br i1 %cmp, label %if.then, label %if.else

if.then:                                          ; preds = %entry
  tail call void @_Z3barv()
  br label %if.end

if.else:                                          ; preds = %entry
  tail call void @_Z3harv()
  br label %if.end

if.end:                                           ; preds = %if.else, %if.then
  ret void
}

declare void @_Z3barv() #0
declare void @_Z3harv() #0

define zeroext i1 @_Z4foo2i(i32 signext %a) #0 {
entry:
  %cmp = icmp slt i32 %a, 5
  br i1 %cmp, label %if.then, label %if.else

if.then:                                          ; preds = %entry
  tail call void @_Z3barv()
  br label %if.end

if.else:                                          ; preds = %entry
  tail call void @_Z3harv()
  br label %if.end

if.end:                                           ; preds = %if.else, %if.then
  ret i1 %cmp
}

a user can "do nothing wrong", but by repeating the comparison as part of some other expression, cause known bit information to change inside otherwise unrelated blocks. This would be, as I'm sure you appreciate, quite unfortunate.

Thanks again!

reames updated this revision to Diff 21168.Mar 3 2015, 7:50 PM
reames edited edge metadata.

Updated to incorporate code comments. Additional feature flags and implementation approaches to follow separately.

reames updated this revision to Diff 21169.Mar 3 2015, 8:12 PM

Look at all uses of the comparison as requested by Hal.

On the sqlite3 example I used before, the difference between the two configurations is indistinguishable. I can quantify on a quieter box later, but on my local machine I don't see much difference.

reames added a comment.Mar 3 2015, 8:31 PM

Hal - preliminary numbers on the dominance based approach make it look like that might be viable. I need to cleanup my hacked implementation and fix at least one bug, but my initial impression is that it appears to be fast enough. I'm a bit surprised by this, but it's what the data says.

Worth discussing is whether we should actually incorporate both. They seem to be good at different cases and maybe we should exploit that. The use based approach is good for grabbing facts far from the use site (i.e. function entry guards), while the dominance approach is likely more powerful since it's not restricted to direct uses, but is probably best limited to a local scope. Thoughts?

Hal - preliminary numbers on the dominance based approach make it look like that might be viable. I need to cleanup my hacked implementation and fix at least one bug, but my initial impression is that it appears to be fast enough. I'm a bit surprised by this, but it's what the data says.

Worth discussing is whether we should actually incorporate both. They seem to be good at different cases and maybe we should exploit that. The use based approach is good for grabbing facts far from the use site (i.e. function entry guards), while the dominance approach is likely more powerful since it's not restricted to direct uses, but is probably best limited to a local scope. Thoughts?

Yes, perhaps with a FIXME if there is some other "right" way to fix the problem. I've been thinking of something like LVI that keeps track of known bits. What do you think?

reames updated this revision to Diff 21537.Mar 9 2015, 5:49 PM

Updated to include a cleaned up dom tree based implementation. This version should allow performance comparisons between the two approaches.

sanjoy added a comment.Mar 9 2015, 6:02 PM

With the inline comments addressed, I think this is okay to check in, but I'll defer to someone more familiar with the area for an LGTM.

Optional: ou might want to add some negative tests too -- where optimizing based on a dominating condition would be a miscompile.

lib/Analysis/ValueTracking.cpp
638 ↗(On Diff #21537)

DominatorTree::dominates asserts BBE.isSingleEdge(), so you should check that before calling dominates.

1051 ↗(On Diff #21537)

Whitespace is a little off (elsewhere too), please run clang-format before checkin.

test/Transforms/InstCombine/dom-conditions.ll
63 ↗(On Diff #21537)

This is an interesting example -- we should be able to fold the addition completely.

(Not commenting on this change) Can instcombine be taught to do something if (KnownZero | KnowOne).isAllOnesValue() (i.e. we have complete information about a value)?

reames added a comment.Mar 9 2015, 6:46 PM

Before anyone bothers running test against this, please let me upload one more patch. I took another look and realized that I have a return where I should have a continue or break. (for the dominating check) The good news (I guess) is that I introduced this bug when adding features Hal requested, so my original numbers are still good. Everything after that is meaningless of course. :(

reames updated this revision to Diff 21604.Mar 10 2015, 11:11 AM

Updated to fix the early return bug. Unfortunately, I'm now getting the performance impact I really expected to see all along. Short version: We will need to be very selective about this.

Baseline (sqlite -03): 16.128-16.805

  • Note the high variance. Also, NEW BASELINE NUMBERS. I have no idea why the baseline changed by 50%. I've rebased on ToT, but nothing else should have changed.

Both Dom & Value based (unlimited): 18.416, 18.427 (ouch!)
Value only (unlimited): 17.349, 17.492
Dom only (unlimited): 17.462, 17.613
Value only (< 20 uses): 16.516 - 17.107
Dom + Value (< 20 each): 17.512 - 18.054
Value only (< 4 uses): 16.540-16.927
Dom only (< 4 blocks): 16.991 - 17.366

My take away here is that the dominance walk based implementation is more expensive than the value based one.

@Hal, can you give me a LGTM here? I'd really prefer to work on this incrementally in tree with proper review at each change. In particular, that likely would have caught my return typo issue and saved us all some time.

@sanjoy - will fix up whitespace and add a negative test or two

reames updated this revision to Diff 21611.Mar 10 2015, 11:42 AM

Address Sanjoy's review comments.

hfinkel accepted this revision.Mar 10 2015, 1:35 PM
hfinkel edited edge metadata.

LGTM, thanks!

test/Transforms/InstCombine/dom-conditions.ll
50 ↗(On Diff #21611)

Extra blank lines (if you want to separate these into sections, please add a comment, don't just leave more blank lines).

154 ↗(On Diff #21611)

Extra blank line.

This revision is now accepted and ready to land.Mar 10 2015, 1:35 PM
This revision was automatically updated to reflect the committed changes.

Sorry for taking so long to get back to you. This has been put on the
back burner for the moment.

To answer your question, I don't believe this line of work will help
you. KnownBits would need to infer that some specific bit is non-zero
to prove that the value in question is non-zero. Unless I'm misreading
your example, you don't know this. You really want something more along
the lines of a constant range analysis pass. We have something like that
in LVI today, but I've given absolutely no thought to how to use that in
alias analysis. Daniel's concerns seem entirely reasonable to me and I
suspect would be prohibitive of any attempt before the pass manager has
been *finally* finished. (Feel free to go bug/harass Chandler!)

To give a quick update on this line of work as a whole:

  • Nick was able to run some internal tests for me. His overall results

showed that a) compile time wasn't too badly impacted (2-3% on average),
but b) lots of bugs were exposed by enabling this. I haven't had a
chance to go stomp all those bugs and would definitely welcome help here.

  • I need to get around to implementing a couple of known performance

tweaks and completing the branch checking. Currently the code is
somewhat incomplete (but functional.)

  • I plan to pursue turning this on by default once the previous two

issues are addressed.

  • I've also been looking in to other approaches for solving the same

issues. I see no reason to only solve this one way and have been trying
to get some of the easy cases dealt with. Examples include:
http://reviews.llvm.org/D9312, http://reviews.llvm.org/D9763,
https://llvm.org/bugs/show_bug.cgi?id=23333.

Philip