This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Improve profitability check for folding PHI args
Needs ReviewPublic

Authored by uweigand on Aug 1 2017, 1:23 PM.

Details

Summary

FoldPHIArgOpIntoPHI and related routines move operations from PHI
arguments to the result of the PHI node. To avoid creating more
new operations than are removed by the transformation, this is
currently only done if the original operations each have a single
use only (i.e. the PHI node). This is overly pessimistic for
two reasons:

  • A single operation can be used multiple times as input to the same PHI node. This is called out in a FIXME in the code: FIXME: The hasOneUse check will fail for PHIs that use the value more than themselves more than once.
  • Sometimes there are two (or more) PHI nodes using the same set of arguments (e.g. if a loop variable is also used after the end of the loop). If *all* these PHIs are transformed, the original argument operations will still be dead.

This patch implements an IsFoldPHIArgProfitable routine that
performs a more precise check to determine whether after all
PHI nodes are transformed, there will be no more operations
in total than we had before. This check is then used instead
of the hasOneUse checks throughout FoldPHIArgOpIntoPHI and
related routines.

The existing Transforms/LoopUnroll/runtime-loop-multiple-exits.ll
test case now shows further improvement (it was using a single
operation multiple times in the same PHI). A new test added to
Transforms/InstCombine/phi.ll demonstrates the multiple-PHI case.

Diff Detail

Event Timeline

uweigand created this revision.Aug 1 2017, 1:23 PM
dberlin edited edge metadata.Aug 1 2017, 2:34 PM

Please please don't :)
Doing it the way this codepath does it can be exponential time/space
NewGVN will do the same thing, and in fact, catch all possible cases that can exist here in polynomial time.

I think the original check is a good compromise, but you are changing it to spend O(N^2) time per phi node, which, imho, is not good.
I strongly think we should just leave this one alone and be happy for now.
If you have pressing testcases, i'm happy to try to put someone on finishing the small amount of work on NewGVN bugs to get it on by default :)

davide added a subscriber: davide.Aug 1 2017, 2:41 PM

FWIW, I'm currently working on squashing out the remaining problems in NewGVN (see updates in bugzilla for details).
I expect realistically that the pass will be enabled by default in a couple of months, given all the insane amount of fuzzing we did on it.

Well, I'd be fine with NewGVN doing this transform. But at least right now, it doesn't appear to be doing so -- running the new test case (@phi_multi in phi.ll) through "opt -O2 --enable-newgvn" does not move the zext/or out of the loop. Is this supposed to happen?

As to compile time, I didn't notice any increase in real-world tests. While it is true that the new check can be quadratic in the worst case, there are several early exits that are actually taken most of time, so *usually* the check has the same complexity as the hasOneUse test it removed. Only in the case where a value is used in more than one PHI node do we even start the more expensive check ...

NewGVN currently has a limitation that davide is fixing around recursive translation, so it won't get it quite yet.
Note also that the usual transform it performs is to remove the redundancy, not to just sink (which was the original usecase for the transform).
If you mainly care about the sinking, it may require a bit of thought. It's also worth noting our dividing line for instcombine was one expressed to me as "local transforms", and this is most certainly not :)
(IE if this is a local transform, i can express PRE in InstCombine, and have it do global PRE).

Note that the check is not just quadratic in the worst case because instcombine iterates. This mechanism of transforming phi of ops into op of phis can be exponential time worst case even without instcombine's iteration :)
But i'll agree that i would expect the early exits to be hit mostly.

In any case, is this testcase the main testcase you care about? Is it a real performance win for your platform in general (IE how much is it buying us)?

The testcase is extracted from a real-word program. On that program, the transformation (moving some of those operations out of a hot loop) is a significant overall win (about 10% improved performance of the whole program). I agree that this application is a quite special case -- this patch doesn't make much difference to the overall performance of the platform in general.

Agreed that the overall InstCombine algorithm can exhibit performance issues, and should probably be replaced in the long term. But I don't think this particular patch makes those issues significantly worse :-)

The testcase is extracted from a real-word program. On that program, the transformation (moving some of those operations out of a hot loop) is a significant overall win (about 10% improved performance of the whole program). I agree that this application is a quite special case -- this patch doesn't make much difference to the overall performance of the platform in general.

I'd be against doing it for this reason, but i'll leave it to others.

Agreed that the overall InstCombine algorithm can exhibit performance issues, and should probably be replaced in the long term. But I don't think this particular patch makes those issues significantly worse :-)

Let be very clear: I'm actually talking about *this specific transformation*, not instcombine.

The general transformations between
a = x op y
b = x op y
result = phi(a, b)
and
1 = phi(x, y)
2 = phi(x, y)
result = 1 op 2

is exponential when applied repeatedly.

The original version you replaced limited this in a way that it would not be because of the single use restriction
I'm pretty positive yours, combined with instcombine's iteration, can be shown to be exponential worst case

The testcase is extracted from a real-word program. On that program, the transformation (moving some of those operations out of a hot loop) is a significant overall win (about 10% improved performance of the whole program). I agree that this application is a quite special case -- this patch doesn't make much difference to the overall performance of the platform in general.

I'd be against doing it for this reason, but i'll leave it to others.

Agreed that the overall InstCombine algorithm can exhibit performance issues, and should probably be replaced in the long term. But I don't think this particular patch makes those issues significantly worse :-)

Let be very clear: I'm actually talking about *this specific transformation*, not instcombine.

The general transformations between
a = x op y
b = x op y
result = phi(a, b)
and
1 = phi(x, y)
2 = phi(x, y)
result = 1 op 2

is exponential when applied repeatedly.

and NewGVN is only able to avoid such behavior by

  1. memoizing the results properly, since it's a value numberer
  2. Only attempting the transform when it sees something that could possibly be redundant as a result.

(but even then, the transform is still polynomial rather than linear)

Neither of these limitations apply here.
Given the right program, i believe you will happily go off into exponential land :)

Let be very clear: I'm actually talking about *this specific transformation*, not instcombine.

The general transformations between
a = x op y
b = x op y
result = phi(a, b)
and
1 = phi(x, y)
2 = phi(x, y)
result = 1 op 2

is exponential when applied repeatedly.

Well, yes, because that would introduce two phis where there originally was just one. But InstCombine doesn't do that, either with or without my patch. Before the patch, InstCombine used to transform:

a = x op c
b = y op c
result = phi(a, b)
to
tmp = phi(x, y)
result = tmp op c

Now, it will in addition also transform:

a = x op c
b = y op c
result1 = phi(a, b)
result2 = phi(a, b)
to
tmp1 = phi(x, y)
result1 = tmp1 op c
tmp2 = phi(x, y)
result2 = tmp2 op c

So with my patch, it may introduce more copies of the "op" (but still not more that we had before the transformation). But either with or without my patch, there will never be more *phis*. This is not because of the single-use check, but because we only move unary ops, or binary ops where the other operand is invariant, over the phi.

Let be very clear: I'm actually talking about *this specific transformation*, not instcombine.

The general transformations between
a = x op y
b = x op y
result = phi(a, b)
and
1 = phi(x, y)
2 = phi(x, y)
result = 1 op 2

is exponential when applied repeatedly.

Well, yes, because that would introduce two phis where there originally was just one. But InstCombine doesn't do that, either with or without my patch. Before the patch, InstCombine used to transform:

That's the space issue, but the time issue is the actual evaluation .
You have definitely solved the space issue, i will not disagree there.

I was also just giving an example of the generalization of this transform.

Anyway, i've said my say, so i'm going to leave this for someone else to review and decide on :)