This is an archive of the discontinued LLVM Phabricator instance.

Modifying reassociate for improved CSE
AbandonedPublic

Authored by escha on Oct 26 2017, 11:45 AM.

Details

Summary

Moved here from the llvm-dev list on their advice.

Copy-paste of description:

When playing around with reassociate I noticed a seemingly obvious optimization that was not getting done anywhere in llvm… nor in gcc or ICC.

Consider the following trivial function:

void foo(int a, int b, int c, int d, int e, int *res) {
  res[0] = (e * a) * d;
  res[1] = (e * b) * d;
  res[2] = (e * c) * d;
}

This function can be optimized down to 4 multiplies instead of 6 by reassociating such that (e*d) is the common subexpresion. However, no compiler I’ve tested does this. I wrote a slightly hacky heuristic algorithm to augment reassociate to do this and tested it.

First, before the details, the results: on a large offline test suite of graphics shaders it cut down total instruction count by ~0.9% (!) and float math instruction count by 1.5% (!). I ran it on SPEC out of curiosity and the one standout was that with fast-math, it cut down 470.lbm by ~6% runtime (see Note 1 at the bottom for the probable reason why). Not coincidentally, this is probably the SPEC program that looks most like a graphics shader.

Here’s how it works:

  1. Do reassociate once as normal. Keep track of the linear chains of operations that are saved to the final IR for later.
  2. Create a “pair map” consisting of a mapping from <Instr, Instr> to <unsigned>. Have one pair map for each type of BinaryOperation. This map represents how common a given operand pair occurs in the source code for a given BinaryOperation. But in addition to counting each actual instruction, we also count each possible O(N^2) pair of each linear operand chain. So for example, if the operand chain is this:
a*b*c

we do:

PairMap[Multiply][{a, b}]++;
PairMap[Multiply][{b, c}]++;
PairMap[Multiply][{a, c}]++;

Obviously if runtime is an issue we can prohibit this on extremely long chains or such that are unlikely to occur in real programs to avoid asymptotically quadratic behavior.

  1. Run reassociate again. All the information is saved from the first time around so hopefully this won’t be very expensive except for the changes we actually make. But this time, whenever emitting a linear operand chain, pick the operand pair that’s *most common* in the source code (using PairMap) and make that one the first operation. Thus, for example:
(((a*b)*c)*d)*e

if “b*e” is the most common, this becomes:

(((b*e)*a)*c)*d

Now b*e can be CSE’d later! Magic!

This could probably be enhanced to allow for multiple operations to be CSE’d, e.g. by doing something like ((b*e)*(a*c))*d, but that would change the canonicalization regime and make it a more invasive change, I think.

Also, as a tiebreaker, the current one I’m using is the “pair which has the lowest max rank of the two operands”, which makes sense because in this example, “a*b” is the first operation in the chain, so we want to pick the duplicates which are also higher up in the program vs closer to the leaf. No other tiebreaker I tried seemed to work as well.

This is not the cleanest implementation or algorithm, but it’s the most straightforward I could come up with and it gives us some unreasonably large perf gains, and of course, optimizes “foo” correctly down to 4 multiplies.

Any thoughts? The gains here are… well, more than I expected, to say the least ;-)

—escha

Note 1: see how (1.0 - u2) is a common factor here.

DST_C ( dstGrid ) = (1.0-OMEGA)*SRC_C ( srcGrid ) + DFL1*OMEGA*rho*(1.0                                 - u2);
DST_N ( dstGrid ) = (1.0-OMEGA)*SRC_N ( srcGrid ) + DFL2*OMEGA*rho*(1.0 +       uy*(4.5*uy       + 3.0) - u2);
DST_S ( dstGrid ) = (1.0-OMEGA)*SRC_S ( srcGrid ) + DFL2*OMEGA*rho*(1.0 +       uy*(4.5*uy       - 3.0) - u2);
DST_E ( dstGrid ) = (1.0-OMEGA)*SRC_E ( srcGrid ) + DFL2*OMEGA*rho*(1.0 +       ux*(4.5*ux       + 3.0) - u2);

(etc)

Diff Detail

Repository
rL LLVM

Event Timeline

escha created this revision.Oct 26 2017, 11:45 AM
escha edited the summary of this revision. (Show Details)Oct 26 2017, 12:18 PM
fhahn added a subscriber: fhahn.Oct 26 2017, 3:38 PM
arsenm added inline comments.Oct 27 2017, 3:23 AM
include/llvm/Transforms/Scalar/Reassociate.h
76

I find the use of this somewhat confusing. It seems to really be a loop counter stuck in the class to pass it to the other functions. Is it particularly inconvenient to pass this in to the functions where it's needed? Also a better name might be ReassociateStep or Level (possibly with an enum for the steps)? Pass is pretty overloaded (plus lldb doesn't understand what to do with this since there is also a class named Pass)

lib/Transforms/Scalar/Reassociate.cpp
2222

Leftover debug printing

fhahn added inline comments.Oct 27 2017, 5:15 AM
include/llvm/Transforms/Scalar/Reassociate.h
76

I am probably missing something, but would it be possible to collect the linear chains as a pre-processing step and then do the full reassociate only once? You might have to move out the code to linearize binary ops, but I think it would be a more lightweight and clearer solution.

hiraditya added inline comments.
lib/Transforms/Scalar/Reassociate.cpp
2217

Does this ensure that the original order of evaluation is emitted when there is no preferred order?

escha added inline comments.Oct 27 2017, 10:23 AM
include/llvm/Transforms/Scalar/Reassociate.h
76

Mmm, I'm not sure. The problem is that the chains can *change* as a result of reassociation, which can rewrite instructions. e.g. changing use-counts can change the chains.

You're probably right; the main reason I didn't do that was that I wanted to make minimally invasive changes because I really don't feel I understand the pass deeply enough to restructure it like that safely.

76

And to the name, absolutely, will change on next iteration regardless.

lib/Transforms/Scalar/Reassociate.cpp
2217

I *think* it should? Not 100% sure. It currently doesn't reorder if Score never exceeds 1 but I don't fully trust this code. There's probably cases where it's "wrong" (as in, doing something unnecessary/suboptimal with order, not incorrect code).

This seems very similar to what n-ary reassociate wants to do, I'd like to understand why we shouldn't do it there?
(where it seems to fit perfectly)

This seems very similar to what n-ary reassociate wants to do, I'd like to understand why we shouldn't do it there?
(where it seems to fit perfectly)

Yeah, actually, I don't quite understand what N-ary reassociate *is* compared to this. One thing I do know about it is that it uses SCEV, which completely rules it out for floating point, which is the main thing I care about. Does it handle the integer case here?

fhahn added a reviewer: fhahn.Oct 30 2017, 4:13 AM

Yeah, actually, I don't quite understand what N-ary reassociate *is* compared to this. One thing I do know about it is that it uses SCEV, which completely rules it out for floating point, which is the main thing I care about. Does it handle the integer case here?

It looks like NaryReassoicate tries to do the same thing as this patch, although using a different approach. I think ideally we would not have code in 2 passes that try to do the same thing.

If this patch handles all cases handled by NaryReassoicate AND floating point operations, it might make sense to get rid of NaryReassociate. Could you try and check if this patch handles all the NaryReassoicate test cases?

Hmm, reading through N-ary reassociate, it looks like it has similar goals but is not completely overlapping in action, for better or worse.

N-ary reassociate runs in dominator order and looks to see if any subexpressions of the current computation have already been computed.

But as far as I understand, it does not consider *reassociating expressions* to find those redundancies. In other words, it doesn't consider the N^2 possible sub-combinations of previous expressions; it only considers them in the order they were emitted. Its main goal seems to be optimizing addressing expressions. So for example, if you have (a+b) and a later expression computes (a+(b+2)), it can find that this is redundant. But if you have (a+(b+1)) and (a+(b+2)) but no (a+b), I don't think it handles that. It also only runs in dominator order, so it can't find arbitrary redundancies.

Similarly, my change doesn't handle GetElementPtr in the way N-ary reassociate does, since it wasn't targeted at addressing expressions.

So I agree they're redundant, and kind of share similar goals and purposes, but annoyingly I don't see any way to unify them and achieve the goals of both.

Oh and of course because N-ary uses SCEV it cannot possibly handle floating point without a huge extension of SCEV, I think.

fhahn edited edge metadata.Nov 2 2017, 4:10 AM

Hmm, reading through N-ary reassociate, it looks like it has similar goals but is not completely overlapping in action, for better or worse.

N-ary reassociate runs in dominator order and looks to see if any subexpressions of the current computation have already been computed.

But as far as I understand, it does not consider *reassociating expressions* to find those redundancies. In other words, it doesn't consider the N^2 possible sub-combinations of previous expressions; it only considers them in the order they were emitted. Its main goal seems to be optimizing addressing expressions. So for example, if you have (a+b) and a later expression computes (a+(b+2)), it can find that this is redundant. But if you have (a+(b+1)) and (a+(b+2)) but no (a+b), I don't think it handles that. It also only runs in dominator order, so it can't find arbitrary redundancies.

Similarly, my change doesn't handle GetElementPtr in the way N-ary reassociate does, since it wasn't targeted at addressing expressions.

So I agree they're redundant, and kind of share similar goals and purposes, but annoyingly I don't see any way to unify them and achieve the goals of both.

Thanks. I suppose it should be possible to to extend Reassociate to support GetElementPtr, and then we would get wha NaryReassociate provides with your patch? I am not saying this should be part of this patch, I am just trying to understand how we can build on this patch.

Your patch uses function range counts for pairs, right? Could this lead to cases were this change reassociates operations based on this global count, even though CSE won't eliminate the reassociated expression (because it would be too costly because of control flow for example) and produces worse results? I assume this is not an issue for shader code, but could happen for control-flow heavy general code. I can run some benchmarks on AArch64 and share results in a couple of days.

include/llvm/Transforms/Scalar/Reassociate.h
76

Ok thanks! I see why it is simpler to just split it up in 2 iterations. I was thinking about was having a loop that just builds the the rankMap and then have the main loop update this maps as it deletes instructions. Although it might be worth considering doing that as follow-up work.

lib/Transforms/Scalar/Reassociate.cpp
2212

Could we order the keys in the pairs to avoid a second lookup here, i.e. insert pairs with <a, b> where a < b?

2251

It looks like we are running the reassociate 4 times? Is this needed? From the description it sounds like 2 times should be enough?

2293

It's probably easier to understand if this would be done in a separate function.

escha added a comment.Nov 2 2017, 9:44 AM

Hmm, reading through N-ary reassociate, it looks like it has similar goals but is not completely overlapping in action, for better or worse.

N-ary reassociate runs in dominator order and looks to see if any subexpressions of the current computation have already been computed.

But as far as I understand, it does not consider *reassociating expressions* to find those redundancies. In other words, it doesn't consider the N^2 possible sub-combinations of previous expressions; it only considers them in the order they were emitted. Its main goal seems to be optimizing addressing expressions. So for example, if you have (a+b) and a later expression computes (a+(b+2)), it can find that this is redundant. But if you have (a+(b+1)) and (a+(b+2)) but no (a+b), I don't think it handles that. It also only runs in dominator order, so it can't find arbitrary redundancies.

Similarly, my change doesn't handle GetElementPtr in the way N-ary reassociate does, since it wasn't targeted at addressing expressions.

So I agree they're redundant, and kind of share similar goals and purposes, but annoyingly I don't see any way to unify them and achieve the goals of both.

Thanks. I suppose it should be possible to to extend Reassociate to support GetElementPtr, and then we would get wha NaryReassociate provides with your patch? I am not saying this should be part of this patch, I am just trying to understand how we can build on this patch.

I'm not sure. GetElementPtr is a fairly complex instruction that isn't easily mapped to other instructions (i.e. it can represent a whole chain instead of just one). I think the reason that Nary is able to do this is because it uses SCEV, which abstracts over that.

Who wrote Nary; maybe they might have a useful opinion on all this?

Your patch uses function range counts for pairs, right? Could this lead to cases were this change reassociates operations based on this global count, even though CSE won't eliminate the reassociated expression (because it would be too costly because of control flow for example) and produces worse results? I assume this is not an issue for shader code, but could happen for control-flow heavy general code. I can run some benchmarks on AArch64 and share results in a couple of days.

Yeah, I was thinking of that. In the most extreme case, you could end up with disjoint instructions in different control flow paths, but on the other hand, the new GVN-Hoist should ideally catch those. You could use something like a scoped data structure to handle this in theory, maybe?

lib/Transforms/Scalar/Reassociate.cpp
2212

hmm. what would be the best way to do this? It's possible for two instructions to have the same Rank, iirc. would pointer order be acceptable? I'd be worried about determinism.

2251

oh, right. this was me experimenting to see if i ran it twice I'd get better results .... and I did! I'm guessing this is because if you heavily reassociate based on global statistics, it opens up new opportunities, and so on. But it might be unnecessary.

this is probably in part caused by the fact that my patch isn't worklist based (i.e. run over it repeatedly instead of add things to a worklist to revisit)

2293

good point. agreed

spatel added a subscriber: spatel.Nov 9 2017, 7:56 AM

Initial benchmark results on AArch64 Cortex-A57 for the LLVM test-suite and SPEC2000: -0.64% geomean of runtimes. Noteable changes:

  • SingleSource/Benchmarks/Shootout/Shootout-strcat -27.78%
  • SingleSource/Benchmarks/Linpack/linpack-pc -7.23%

No regressions > 1%.

I will run newer SPEC versions too, but that might take a while

escha abandoned this revision.Nov 14 2017, 1:31 PM

Closing to start a real patch review.