This is an archive of the discontinued LLVM Phabricator instance.

[InstSimplify] shl nuw %x, %y -> %x iff KnownBits says %x is negative.
AbandonedPublic

Authored by lebedev.ri on Jun 7 2018, 9:27 AM.

Details

Summary

Follow-up for D47883.

Diff Detail

Repository
rL LLVM

Event Timeline

lebedev.ri created this revision.Jun 7 2018, 9:27 AM
lebedev.ri retitled this revision from [InstSimplify] shl nuw C, %x -> C iff KnownBits says C is negative. to [InstSimplify] shl nuw %x, %y -> %x iff KnownBits says %x is negative..
spatel added a subscriber: efriedma.Jun 7 2018, 9:51 AM

I think we need to make a compile-time vs. perf win assessment on this one. This is probably true for any proposal that uses knownbits analysis in InstSimplify at this point. @efriedma said something that stuck with me in 1 of my proposals a while back: we could use knownbits on *every* value in instsimplify, but that would be very expensive and wouldn't gain much perf.

So I'd like to see debug STATISTICs for this fold to see how often we have 'shl nuw' in real code and how often this fires. Test-suite or compiling clang/llvm itself should be reasonable benchmarks.

I think we need to make a compile-time vs. perf win assessment on this one. This is probably true for any proposal that uses knownbits analysis in InstSimplify at this point. @efriedma said something that stuck with me in 1 of my proposals a while back: we could use knownbits on *every* value in instsimplify, but that would be very expensive and wouldn't gain much perf.

It is of note that in SimplifyRightShift() (which is directly above this function), we always use KnownBits to check for the low bit.

So I'd like to see debug STATISTICs for this fold to see how often we have 'shl nuw' in real code and how often this fires. Test-suite or compiling clang/llvm itself should be reasonable benchmarks.

To be more specific, you want the count of times we call that computeKnownBits(), and how many of those result in the fold, i.e.

counter0++;
if (computeKnownBits(Op0, Q.DL, /*Depth=*/0, Q.AC, Q.CxtI, Q.DT).isNegative()) {
  counter1++;
  return Op0;
}

?

If you use the generic instruction simplify entry point, we literally do call computeKnownBits on every input. It's the last thing we do if no simplications are found. If you call the individual instruction entry points like InstCombine does we don't do this. InstCombine itself will try to do something similar in main worklist processing code, but only when ExpensiveCombines is true. I've never understood why InstSimplify was more aggressive here.

I think we need to make a compile-time vs. perf win assessment on this one. This is probably true for any proposal that uses knownbits analysis in InstSimplify at this point. @efriedma said something that stuck with me in 1 of my proposals a while back: we could use knownbits on *every* value in instsimplify, but that would be very expensive and wouldn't gain much perf.

It is of note that in SimplifyRightShift() (which is directly above this function), we always use KnownBits to check for the low bit.

That's a good point. I didn't see that.

So I'd like to see debug STATISTICs for this fold to see how often we have 'shl nuw' in real code and how often this fires. Test-suite or compiling clang/llvm itself should be reasonable benchmarks.

To be more specific, you want the count of times we call that computeKnownBits(), and how many of those result in the fold, i.e.

counter0++;
if (computeKnownBits(Op0, Q.DL, /*Depth=*/0, Q.AC, Q.CxtI, Q.DT).isNegative()) {
  counter1++;
  return Op0;
}

?

Yes. I suppose the proof of whether this is a good addition will be in the actual compile-time and perf results for test-suite, but I think it would be good to have this micro-stat for our knowledge. My guess is that this is very rare, and the fold is even rarer...but that's just a guess, so having the data will be good.

spatel added a subscriber: hfinkel.Jun 7 2018, 10:55 AM

If you use the generic instruction simplify entry point, we literally do call computeKnownBits on every input. It's the last thing we do if no simplications are found. If you call the individual instruction entry points like InstCombine does we don't do this. InstCombine itself will try to do something similar in main worklist processing code, but only when ExpensiveCombines is true. I've never understood why InstSimplify was more aggressive here.

Hmm...yes. IIRC, the difference with Eli's comment was that we could get knownbits for all operands, not just the inputs - and that would be even more costly of course.
This patch isn't the right place for the larger discussion, but cc'ing @hfinkel as I think both of the mentioned knownbits changes were made in rL251146. The ExpensiveCombines flag for instcombine was added later as a reaction to increased compile-time.

lebedev.ri abandoned this revision.Jun 7 2018, 12:24 PM

Ok, got the self-hosting stats.

$ find -iname *.stats | wc -l
3049
$ find -iname *.stats | xargs grep NumShlNuwNonConstF
./lib/Analysis/CMakeFiles/LLVMAnalysis.dir/LoopInfo.stats:      "instsimplify.NumShlNuwNonConstFolded": 1,
./unittests/ADT/CMakeFiles/ADTTests.dir/PriorityWorklistTest.stats:     "instsimplify.NumShlNuwNonConstFolded": 1,

So counter2, times the fold happened is 2.

I haven't found where is the tooling to process these .stats dumps, so i have manually processed it for the "instsimplify.NumShlNuwNonConst" entries.
The counter1, the times we called computeKnownBits() here is 643566.

I'm guessing it means we don't want this :)

spatel added a comment.Jun 7 2018, 3:17 PM

Ok, got the self-hosting stats.

$ find -iname *.stats | wc -l
3049
$ find -iname *.stats | xargs grep NumShlNuwNonConstF
./lib/Analysis/CMakeFiles/LLVMAnalysis.dir/LoopInfo.stats:      "instsimplify.NumShlNuwNonConstFolded": 1,
./unittests/ADT/CMakeFiles/ADTTests.dir/PriorityWorklistTest.stats:     "instsimplify.NumShlNuwNonConstFolded": 1,

So counter2, times the fold happened is 2.

I haven't found where is the tooling to process these .stats dumps, so i have manually processed it for the "instsimplify.NumShlNuwNonConst" entries.
The counter1, the times we called computeKnownBits() here is 643566.

Thanks for collecting the stats. I also don't know where the tooling for processing stats lives. I did some similar grep last time I wanted to find something like this.

I'm guessing it means we don't want this :)

Yes, that seems like a very small probability of success, but I really don't know where we set the threshold for cost/benefit. If someone finds this hole in real code, we can revisit.