This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Fold (A OR B) AND B code sequence over Phi node
AbandonedPublic

Authored by inouehrs on Mar 19 2018, 7:46 AM.

Details

Summary

This patch intends to enable jump threading with a method whose return type is std::pair<int, bool> or std::pair<bool, int>
For example, jump threading does not work for the if statement in func.

std::pair<int, bool> callee(int v) {
  int a = dummy(v);
  if (a) return std::make_pair(dummy(v), true);
  else return std::make_pair(v, v < 0);
}

int func(int v) {
  std::pair<int, bool> rc = callee(v);
  if (rc.second) {
    // do something
  }

SROA executed before the method inlining replaces std::pair by i64 without splitting in both callee and func since at this point no access to the individual fields is seen to SROA.
After inlining, jump threading fails to identify that the incoming value is a constant due to additional instructions (like or, and, trunc).

This patch finds a phi node, which has OR instruction as an incoming value and AND instruction as its use. If the OR and AND instructions take the same operand, e.g. (%A OR %B) AND %B, then replace the incoming OR by %B. For example,

BB1:
  %or = or i64 %val, 1
  br %BB2
BB2:
  %phi = phi i64 [ %or, %BB1 ], ... # -> phi i64 [ 1, %BB1 ], ...
  %and = and i64 %phi, 1

This helps jump threading identify the opportunity listed above.
Later, in the CFG simplification pass, the similar code modification happens. But it is too late to help jump threading.

Diff Detail

Event Timeline

inouehrs created this revision.Mar 19 2018, 7:46 AM

This patch improved the performance of protobuf (https://github.com/google/protobuf), which suffers from the problem of jump threading for std::pair<int, bool>.

with this patch
----------------------------------------------------------------------
Benchmark                               Time           CPU Iterations
----------------------------------------------------------------------
google_message2_parse_new          761764 ns     761558 ns        842   105.904MB/s
google_message2_parse_reuse        204430 ns     204397 ns       3422   394.587MB/s
google_message2_parse_newarena     398431 ns     398367 ns       1758   202.457MB/s
google_message2_serialize          116088 ns     116080 ns       6029     694.8MB/s

without this patch
----------------------------------------------------------------------
Benchmark                               Time           CPU Iterations
----------------------------------------------------------------------
google_message2_parse_new          786832 ns     786808 ns        868   102.506MB/s
google_message2_parse_reuse        207647 ns     207631 ns       3368    388.44MB/s
google_message2_parse_newarena     403998 ns     403985 ns       1733   199.642MB/s
google_message2_serialize          117183 ns     117172 ns       5973   688.322MB/s
majnemer added inline comments.
lib/Transforms/InstCombine/InstCombinePHI.cpp
678 ↗(On Diff #138908)

I think you need to return NewPhi.

Missing testcase.

lib/Transforms/InstCombine/InstCombinePHI.cpp
669 ↗(On Diff #138908)

In general, instcombine should not insert new instructions without erasing any existing instructions; in some cases, we'll increase codesize without any benefit.

inouehrs updated this revision to Diff 139578.Mar 23 2018, 4:55 AM
  • Now this optimization creates a new phi node only if the AND instruction can be eliminated not to increase the code size.
  • Added test case.

@efriedma Thank you so much for the advise! I added check before creating a new phi node to confirm that the AND instruction will be eliminated later and so not to increase the code size.
Also, I added test case; I forgot to include this in the first submission.

inouehrs marked 2 inline comments as done.Mar 23 2018, 5:28 AM
inouehrs added inline comments.
lib/Transforms/InstCombine/InstCombinePHI.cpp
678 ↗(On Diff #138908)

Thanks. I made the function return &Phi since we add (not replace) NewPhi or modify operands of Phi.

inouehrs updated this revision to Diff 142213.Apr 12 2018, 10:27 AM
  • simplify code
  • rebase to the latest code

Any idea how frequently this triggers on general code (the LLVM testsuite, or something like that)?

lib/Transforms/InstCombine/InstCombinePHI.cpp
653 ↗(On Diff #142213)

Only allowing 64-bit values seems overly restrictive.

657 ↗(On Diff #142213)

We don't usually like to walk uses like this; can you start the pattern-match from the "and"?

682 ↗(On Diff #142213)

cast<>, not dyn_cast<>

692 ↗(On Diff #142213)

Do you need to check hasOneUse() on the "or" here?

Looking specifically for zext+shl seems overly specific to your testcase; I'd like to see something a little more general. Maybe you could check SimplifyAndInst(V, UserVal) != nullptr? Or maybe that's too expensive; not sure.

inouehrs updated this revision to Diff 143229.Apr 19 2018, 11:10 PM
  • addressed comments from Eli.

Only allowing 64-bit values seems overly restrictive.

I agree. I relaxed the condition.

We don't usually like to walk uses like this; can you start the pattern-match from the "and"?

I made the pattern match start from "and".

Do you need to check hasOneUse() on the "or" here?

This patch does not eliminate or modify "or" and does not require hasOneUse on it, although the later optimization may further optimize "or".

Looking specifically for zext+shl seems overly specific to your testcase

I generalized the code using computeKnownBits since SimplifyAndInst does not identify this pattern.

Any idea how frequently this triggers on general code (the LLVM testsuite, or something like that)?

While compiling LLVM, this modification happens about 440 times. In LLVM testsuite it happens only three times.
I think newer programs tend to use a value pair as the return value type more frequently.

Thank you for the comments.

xbolva00 accepted this revision.Apr 20 2018, 12:02 AM
This revision is now accepted and ready to land.Apr 20 2018, 12:02 AM

@efriedma Do you have further comments of suggestions? Thanks!

I'm still not sure this is really as general as it could be, but I guess it's okay.

lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
1423

This logic doesn't work if AndVal isn't a constant (e.g. consider the case where the "and" and "or" are in the same basic block).

xbolva00 requested changes to this revision.Apr 28 2018, 2:50 AM
This revision now requires changes to proceed.Apr 28 2018, 2:50 AM
spatel added a subscriber: fhahn.Apr 28 2018, 7:20 AM

I'm still not sure this is really as general as it could be, but I guess it's okay.

I think what this patch really wants to ask/do is: "Does this binop simplify with the incoming value of the phi to 1 of the binop operands? If yes, substitute that value into the phi."

If you look at it that way, then it should be some add-on to the existing foldOpIntoPhi() - and that's probably a smaller and more general patch.

That substitution analysis seems to fall into the gray area -- or not gray depending on your viewpoint :) -- of whether this belongs in (New)GVN or instcombine. (cc @fhahn).

inouehrs updated this revision to Diff 144452.Apr 28 2018, 9:24 AM
inouehrs marked an inline comment as done.Apr 28 2018, 9:37 AM

I think what this patch really wants to ask/do is: "Does this binop simplify with the incoming value of the phi to 1 of the binop operands? If yes, substitute that value into the phi."

Note that this patch intend to optimize only a simple but important case on std::pair to help jump threading. Other optimizers are already able to do more generic optimization for this type of code sequence, but it's too late to help jump threading.

lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
1423

Thank you for pointing this out. I added check for AndVal (prevoiously I checked this only in if (!Phi->hasOneUse()) block).

I think what this patch really wants to ask/do is: "Does this binop simplify with the incoming value of the phi to 1 of the binop operands? If yes, substitute that value into the phi."

Note that this patch intend to optimize only a simple but important case on std::pair to help jump threading. Other optimizers are already able to do more generic optimization for this type of code sequence, but it's too late to help jump threading.

I understand that we get many other similar cases (for better or worse, here in instcombine). But if we're going to try this, then we might as well generalize it, so it triggers more often and leads to less lumpy optimization. Ie, you should get this case too:

define i64 @phi_with_idempotent_binop(i1 %f, i64 %a) {
entry:
  br i1 %f, label %BB1, label %BB2

BB1:                               
  %and = and i64 %a, 1
  br label %BB2

BB2:                                 
  %phi = phi i64 [ %a, %entry ], [ %and, %BB1 ]
  %or = or i64 %phi, 1
  ret i64 %or
}

If you use a more general matcher (and I think it will be cheaper for the tests you're showing), we can get the motivating cases you've seen and the ones you haven't seen yet. :)

I think the matcher is quite simple. It would look something like this:

if (BinOp.isIdempotent()) { // handle both 'and' and 'or'
  Value *V = SimplifyBinOp(BinOp->getOpcode(), Phi->getIncomingVal(Idx), BinOp->getOperand(1)) // no limits on what's on the other side of the phi
  if (V && V == BinOp->getOperand(1))
    Phi->setIncomingValue(Idx, V);
}

(add the appropriate checks for uses, commutes, etc)

dberlin added a comment.EditedMay 2 2018, 7:58 AM

I'm still not sure this is really as general as it could be, but I guess it's okay.

I think what this patch really wants to ask/do is: "Does this binop simplify with the incoming value of the phi to 1 of the binop operands? If yes, substitute that value into the phi."

If you look at it that way, then it should be some add-on to the existing foldOpIntoPhi() - and that's probably a smaller and more general patch.

That substitution analysis seems to fall into the gray area -- or not gray depending on your viewpoint :) -- of whether this belongs in (New)GVN or instcombine. (cc @fhahn).

FWIW, the underlying issue for me in doing it in instcombine is that it can never really be good at it in any sane time bound. Without knowing the values of other instructions ahead of time (IE some form of value numbering), it has to go figure them out (again and again), or give up and only tackle simple cases around constants (which people never stay happy with). It also has no fine grain dependency tracking, so it will re-evaluate this all the time. By contrast, NewGVN (or even GVN if you wanted to try it there) only re-evaluate these transforms at the points they could change, and already know the values of other things in the program, so can tell when the transform produces a redundancy without further evaluation.

Over time, either the set of cases you catch suck, you make instcombine slow and complicated, or you make instcombine do value numbering.
None of these seem like appealing options to me ;)

xbolva00 resigned from this revision.May 16 2018, 5:56 AM
inouehrs abandoned this revision.Jul 31 2018, 10:10 PM
inouehrs marked an inline comment as done.

I submitted new patches for instsimplify to catch this opportunity.
https://reviews.llvm.org/D48828
https://reviews.llvm.org/D49981

Thanks for all the advice.