This is an archive of the discontinued LLVM Phabricator instance.

[WIP] Add Caching of Known Bits in InstCombine
Needs ReviewPublic

Authored by hfinkel on Mar 22 2017, 6:33 AM.

Details

Summary

This patch adds a known-bits cache for value tracking, and uses it from InstCombine. This works, even though the known-bits computation retains its depth cutoff, because InstCombine uses a (generally) top-down walking strategy, so the known-bits of a value's operands are generally caches when the known bits of the value are requested.

The advantages of doing this are (potentially):

  1. Limit the expense of computing known bits (which currently does a lot of walking), hopefully making InstCombine cheaper.
  2. Effectively provide a known-bits computation of unlimited depth (thus making the optimizer more powerful).

If an assumption contributes to the known bits, then the result cannot be cached (because the cache is not context dependent, but assumptions are context dependent).

This is a work-in-progress (I initially put this together last November, and just rebased it now in light of current discussions). It has one problem I know about: We'd need to invalidate the cached known bits of a value when we remove its nsw/nuw/etc. flags, and this is currently not done. Removing these flags causes changes, potentially, to the known bits of that value and also all of its users. This is the cause of at least some of the regression-test failures:

Failing Tests (4):
    LLVM :: Transforms/InstCombine/demand_shrink_nsw.ll
    LLVM :: Transforms/InstCombine/icmp.ll
    LLVM :: Transforms/InstCombine/select.ll
    LLVM :: Transforms/InstCombine/sext.ll

I'm posting this in its current state to make it easier to experiment with the relative costs/benefits of caching to the performance of InstCombine.

Diff Detail

Event Timeline

hfinkel created this revision.Mar 22 2017, 6:33 AM
rnk added a subscriber: rnk.Mar 22 2017, 1:38 PM

We can't just keep passing more analysis pointers throughout our simplification utilities. We might want some wrapper of common, almost globally used analyses that optionally contains DL, TLI, DT, AC, etc.

Long term, I agree with Danny that computing known bits on demand doesn't seem to scale. We might want to just do the analysis up front and see if that works better.

mzolotukhin edited edge metadata.Mar 22 2017, 2:34 PM

Thanks for posting this, Hal!

I'll run some experiments with it, but in general I support this approach.

@rnk:

Long term, I agree with Danny that computing known bits on demand doesn't seem to scale. We might want to just do the analysis up front and see if that works better.

The problem with computing everything upfront is that at some point we might want to partially invalidate the results. Recomputing everything from scratch after it sounds inefficient, and if we add caching, we'll end up with something like what's proposed here. To me this looks very similar to what we have in SCEV (a map from expression to the corresponding analysis result, that we can invalidate and recompute whenever it's requested again), and I think SCEV has been working pretty well so far in this aspect.

Michael

In D31239#707970, @rnk wrote:

We can't just keep passing more analysis pointers throughout our simplification utilities. We might want some wrapper of common, almost globally used analyses that optionally contains DL, TLI, DT, AC, etc.

We definitely should. Actually, this may really just remove code. computeKnownBits's implementation uses a Query structure to contain these, so does InstructionSimplify. It is only the external interfaces that are unwieldy. We might promote these Query structures somehow and then just make the internal interfaces the external ones.

Thanks for posting this, Hal!

I'll run some experiments with it, but in general I support this approach.

@rnk:

Long term, I agree with Danny that computing known bits on demand doesn't seem to scale. We might want to just do the analysis up front and see if that works better.

The problem with computing everything upfront is that at some point we might want to partially invalidate the results. Recomputing everything from scratch after it sounds inefficient, and if we add caching, we'll end up with something like what's proposed here. To me this looks very similar to what we have in SCEV (a map from expression to the corresponding analysis result, that we can invalidate and recompute whenever it's requested again), and I think SCEV has been working pretty well so far in this aspect.

This is exactly right. Moreover, perhaps as you're implying, we could make this an actual analysis (like SCEV) - that could easily be cleaner.

Michael

I'm seeing more problems than just nsw/nuw flags here.

sext.ll test is failing because SimplifyDemandedInstructions bits determined that this

%and = and i32 %x, 16

shl i32 %and, 27

Simplified to just the shl because we were only demanding the MSB of the shift. This occurred after we had cached the known bits for the shl as having 31 lsbs as 0. But without the "and" in there we can no longer guarantee the lower bits of the shift result are 0.

I also got a failure on shift.ll not reported here. This was caused by getShiftedValue rewriting operands and changing constants of other shifts. This effectively shifts the Known bits and thus the cache needs to be invalidate.

hfinkel updated this revision to Diff 99822.May 22 2017, 3:02 PM

Rebased patch provided by Craig Topper.

aemerson resigned from this revision.Aug 4 2017, 9:48 AM