This is an archive of the discontinued LLVM Phabricator instance.

[SCEV] Teach SCEVExpander to expand BinPow
ClosedPublic

Authored by mkazantsev on Jun 8 2017, 2:46 AM.

Details

Summary

Current implementation of SCEVExpander demonstrates a very naive behavior when
it deals with power calculation. For example, a SCEV for x^8 looks like

(x * x * x * x * x * x * x * x)

If we try to expand it, it generates a very straightforward sequence of muls, like:

x2 = mul x, x
x3 = mul x2, x
x4 = mul x3, x
    ...
x8 = mul x7, x

This is a non-efficient way of doing that. A better way is to generate a sequence of
binary power calculation. In this case the expanded calculation will look like:

x2 = mul x, x
x4 = mul x2, x2
x8 = mul x4, x4

In some cases the code size reduction for such SCEVs is dramatic. If we had a loop:

x = a;
for (int i = 0; i < 3; i++)
  x = x * x;

And this loop have been fully unrolled, we have something like:

x = a;
x2 = x * x;
x4 = x2 * x2;
x8 = x4 * x4;

The SCEV for x8 is the same as in example above, and if we for some reason
want to expand it, we will generate naively 7 multiplications instead of 3.
The BinPow expansion algorithm here allows to keep code size reasonable.

This patch teaches SCEV Expander to generate a sequence of BinPow multiplications
if we have repeating arguments in SCEVMulExpressions.

Diff Detail

Event Timeline

mkazantsev created this revision.Jun 8 2017, 2:46 AM
mkazantsev edited the summary of this revision. (Show Details)Jun 8 2017, 2:47 AM
dneilson edited edge metadata.Jun 8 2017, 7:35 AM

Have you looked at what happens with addition? i.e.

x = a;
for (i = 0; i < N; i++) x = x + x;

Should still be an exponential number of additions, when it could be just (x * (N << 1)).

lib/Analysis/ScalarEvolutionExpander.cpp
766

How challenging would it be to enhance SCEVMulExpr to represent repeated multiplies by the same value as a pair (value, exponent) rather than as a list of the same value being repeated 'exponent' times?

This patch has addressed not having to expand a ^ (2 ^ N) at runtime, but we still have an exponential number of terms in the OpsAndLoops list, which should quite negatively affect compile time.

773

If Ty is a float type and N isn't a trivial value (like 1 or 2) then it can be better (faster & more accurate) to generate a call to @llvm.powi rather than the tree of multiplies.

Daniel, regarding getAddExpr: SCEV tends to keep its expressions as flat as possible, so if it has two (or more) equal args in SCEVAddExpr, it transforms it into multiplication. You can try to invoke getAddExpr for (a, a, a) and it will return simething like (3 * a). So transforming Adds to Muls is not a problem, the problem with Muls is that we don't have a Power construct in SCEV that would be exactly what we need to represent such expressions. But as I said in the comment, we don't have it and it wouldn't be easy to introduce it.

lib/Analysis/ScalarEvolutionExpander.cpp
766

I agree with your concern. In fact what you propose is to introduce a new construct like SCEVPower. It would be extremely nice to have, but unfortunately it would be a really huge change affecting all SCEV.

773

SCEV doesn't work with FP types, so it is not our case. :) But if it did, such transformations changing the order of calculations could not even be legal at all.

sanjoy requested changes to this revision.Jun 9 2017, 2:36 PM

Hi Maxim,

If I understand your motivation here correctly, I think we've already lost by creating a SCEVMulExpr with an exponentially large number of operands. IMO the right fix would be to prevent creating such expressions in the first place.

This revision now requires changes to proceed.Jun 9 2017, 2:36 PM

Sanjoy, what is the difference between having one SCEV with exponential number of operands or having a SCEV tree with 2 operands in each node, but the same number of operands in total? We won't win anything doing so.

Sanjoy, what is the difference between having one SCEV with exponential number of operands or having a SCEV tree with 2 operands in each node, but the same number of operands in total? We won't win anything doing so.

I don't think you'll need the same number of operands in total in the second case. IIUC, the problematic IR is:

x1 = x0 * x0;
x2 = x1 * x1;
x3 = x2 * x2;
x4 = x3 * x3;

where the SCEV expression for x4 ends up being (x0 * x0 * ... 16 times). This, in turn, gets lowered into:

xa = x0 * x0;
xb = xa * x0;
... 15 instructions

I'm postulating that the real problem was that we created a SCEV for x4 with 16 operands. If we instead had the SCEV for x4 be:

// Writing it this way since printing SCEVs obscures SCEV identity
S0 = x0 * x0;
S1 = S0 * S0;
S2 = S1 * S1;
S3 = S2 * S2;

then the expansion would naturally have been

xa = x0 * x0;  // expand(S0)
xb = xa * xa;  // expand(S1)
xc = xb * xb;  // expand(S2)
xd = xc * xc;  // expand(S3)

(Note that expand tries to re-use already expanded expressions, see InsertedExpressions).

Two caveats though:

  1. Obviously, we don't want to prevent SCEV for merging multiplication-of-multiplication expressions completely -- some of that is required for good simplification. We just want to do that up to a sane limit.
  2. This patch may still be a good idea, but for that you'll have to make a case by claiming that the new expanded form is a more canonical expansion of SCEVMulExpr s, not by claiming it is a bugfix. You'll also need to show that other passes don't get pessimized by such multiplication trees (unlikely).

This patch may still be a good idea, but for that you'll have to make a case by claiming that the new expanded form is a more canonical expansion of SCEVMulExpr s, not by claiming it is a bugfix. You'll also need to show that other passes don't get pessimized by such multiplication trees (unlikely).

Hi Sanjoy,

Actually an example of a better behavior is already present it test_03. If you look at it, you can notice that the new expander logic has changed the order of calculations. Maybe this example is not good enough because the overall number of multiplications remains the same, but if we have something like

x0 = a * a
x1 = x0 * a
x2 = x1 * a
...
xn = x(n-1) *a

then the new logic allows to reduce the overall number of multiplications from O(N) to O(log(N)). Would that be a sufficient justification of this change as improvement? I can add a test that does it.

As for limiting the operands merging to some limit - I am still not convinced that it is a good idea. What if we have the situation I described above, with n being like 1022? In fact, we have a chain of muls that calcupates a ^ 1024. It could come from a loop like:

x = a * a;
for (i = 0; i < 1022; i++)
  x = x * a

Which got fully unrolled. In this case it is profitable to merge one operands in one SCEV, figure out that we actually need a power of 2 and then expand it into BinPow, decreasing the number of resulting muls from 1023 to ~10. If we set a merge limit, we can fail to do that.

Anyways, setting a merge limit for muls would be a separate patch. If you still do think we need it, I will make it. As for this one, if I add a test on what I described (when it reduces the overall number of muls), would that work to approve it?

Thanks!

mkazantsev edited edge metadata.
mkazantsev edited the summary of this revision. (Show Details)

Re-justified this as being an improvement, added a test where a linear calculation is replaced with a logarithmic one. I will make a separate patch to avoid creation of SCEVs with unreasonably huge numbers of operands.

sanjoy accepted this revision.Jun 16 2017, 9:04 PM

lgtm with nits

lib/Analysis/ScalarEvolutionExpander.cpp
767

Avoid braces here.

You could also write this loop using find_if.

Finally I'd perhaps suggest using a uint64_t for Exponent so that an overflow is practically impossible.

775

Can you get rid of this conditional and instead start BinExp from 1?

792

It is a bit hard to read ExpandOpBinPowN() and tell what all local state the call touches. Do you mind extracting out a static helper function so that the data flow is more obvious?

This revision is now accepted and ready to land.Jun 16 2017, 9:04 PM
mkazantsev added inline comments.Jun 18 2017, 8:11 PM
lib/Analysis/ScalarEvolutionExpander.cpp
767

I cannot get rid of braces because of two operands inside. Will split them into two lines.

Isn't it true that the operands here are sorted? If yes, we just need a consecutive sequence of similar operands, find_if here would be an overcomplication.

Agreed with uint64_t.

mkazantsev added inline comments.Jun 18 2017, 8:17 PM
lib/Analysis/ScalarEvolutionExpander.cpp
775

That would make the loop a bit messy, since we are generating the current power differently for p = 1 and the rest, and because we need to generate the next operand before (possibly) using it. I will need an if in loop in this case, something like

for (unsigned BinExp = 1; BinExp <= Exponent; BinExp <<= 1) {
  P = BinExp == 1 ? expandCodeFor(I->second, Ty) : InsertBinop(Instruction::Mul, P, P);
  if (Exponent & BinExp)
    Result = Result ? InsertBinop(Instruction::Mul, Result, P) : P;
}

It is hardly easier to read and worse performance-wise.

mkazantsev added inline comments.Jun 18 2017, 8:26 PM
lib/Analysis/ScalarEvolutionExpander.cpp
792

We call internal SCEVExpander method (like InsertBinop) inside. Do we really want a static helper that takes "this" as argument?

I think the better way to make the data flow obvious here is to capture the required objects explicitly.

mkazantsev marked 3 inline comments as done.
mkazantsev added inline comments.
lib/Analysis/ScalarEvolutionExpander.cpp
767

Added a test that checks that the BinPow is expanded even if the calculation of SCEV included different operands.

This revision was automatically updated to reflect the committed changes.