This is an archive of the discontinued LLVM Phabricator instance.

Enable Operand Reordering for Commutative Instructions in the FunctionComparator/MergeFunctions
Needs ReviewPublic

Authored by rcorcs on Mar 15 2019, 5:11 PM.

Details

Reviewers
jfb
davide
Summary

The function merger should be able to handle small differences if the two functions are still equivalent.
This patch, as the title suggests, enables the function merger to handle simple operand reordering for commutative binary operators.
For example:

mul nsw i32 3, %a

should be equivalent to

mul nsw i32 %a, 3

This check is handled by the FunctionComparator. Whenever it says that two functions are equivalent, the MergeFunctions pass is able to merge them.

I've added a simple unit test to exercise this change. All unit tests pass as expected.

  • Rodrigo Rocha

Diff Detail

Repository
rL LLVM

Event Timeline

rcorcs created this revision.Mar 15 2019, 5:11 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 15 2019, 5:11 PM
jfb added a comment.Mar 15 2019, 5:21 PM

This seems like something that we should canonicalize in general, no? i.e. the function comparator should only compare the canonicalized form (which another pass would ensure), and never expect to see a non-canonical one. I'd much rather make sure something else canonicalizes this than try to merge all permutations.

This example that I gave in my comment is a simplistic one that gets canonicalized.
However, this patch would handle more interesting examples as well. Such as the following one:

int foo(int *v, int a, int b, int c) {

int t1 = (3*a);
int t2 = (c*b);
v[0] = t1+t2;
int t3 = (b*7);
int t4 = (c*a);
v[1] = t3+t4;
return v[0]+v[1];

}

int bar(int *v, int a, int b, int c) {

int t1 = (a*3);
int t2 = (b*c);
v[0] = t2+t1;
int t3 = (7*b);
int t4 = (a*c);
v[1] = t4+t3;
return v[1]+v[0];

}

which clang with -O1, -O2, or -Os (I haven't tested with other flags) will produce something similar to:

define dso_local i32 @foo(i32* nocapture %v, i32 %a, i32 %b, i32 %c) local_unnamed_addr #0 {
entry:

%mul = mul nsw i32 %a, 3
%mul1 = mul nsw i32 %c, %b
%add = add nsw i32 %mul1, %mul
store i32 %add, i32* %v, align 4, !tbaa !2
%mul2 = mul nsw i32 %b, 7
%mul3 = mul nsw i32 %c, %a
%add4 = add nsw i32 %mul3, %mul2
%arrayidx5 = getelementptr inbounds i32, i32* %v, i64 1
store i32 %add4, i32* %arrayidx5, align 4, !tbaa !2
%add8 = add nsw i32 %add, %add4
ret i32 %add8

}

define dso_local i32 @bar(i32* nocapture %v, i32 %a, i32 %b, i32 %c) local_unnamed_addr #0 {
entry:

%mul = mul nsw i32 %a, 3
%mul1 = mul nsw i32 %c, %b
%add = add nsw i32 %mul1, %mul
store i32 %add, i32* %v, align 4, !tbaa !2
%mul2 = mul nsw i32 %b, 7
%mul3 = mul nsw i32 %c, %a
%add4 = add nsw i32 %mul3, %mul2
%arrayidx5 = getelementptr inbounds i32, i32* %v, i64 1
store i32 %add4, i32* %arrayidx5, align 4, !tbaa !2
%add8 = add nsw i32 %add4, %add
ret i32 %add8

}

and if you notice the instruction

%add8 = add nsw i32 %add4, %add

has the operands in the "wrong" order compared to the first function.

This makes the existing -mergefunc fail, but this patch fixes it.

I agree that canonicalization steps help a lot, but, of course, that I like the idea of having a more powerful function merger (as a proponent of one).

jfb added a comment.Mar 18 2019, 8:38 AM

This example that I gave in my comment is a simplistic one that gets canonicalized.
However, this patch would handle more interesting examples as well. Such as the following one:

[...]

This makes the existing -mergefunc fail, but this patch fixes it.

I agree that canonicalization steps help a lot, but, of course, that I like the idea of having a more powerful function merger (as a proponent of one).

Gotcha. Have you seen this occur? Which cases? Are there better mergefuncs-independent canonicalizations that would handle such matching? i.e. can we improve canonicalization so that we don't need to do the matching here?

I want to make sure we don't unnecessarily slow down mergefuncs. We might at some point consider a "slower but more powerful" set of comparators, and "fast but good" ones. As I described on the mailing list last week, mergefuncs can help compile time if we run the simple one early, and way later run the complex one.

lib/Transforms/Utils/FunctionComparator.cpp
780

Please run the added code through clang-format.

928

How does your change behave when hashing? Hashing is an important part of mergefuncs avoiding O(n²) behavior.

unittests/Transforms/Utils/FunctionComparatorTest.cpp
163

You should test the more complex example you describe as well.
In general you should be testing all the binary ops that LLVM has.

172

I don't think you want the tests here, you instead want regular IR tests with FileCheck. See test/Transforms/MergeFunc/.

jfb added a subscriber: dexonsmith.Mar 19 2019, 9:24 AM

I was talking to @dexonsmith about this review, and we like where this will eventually get LLVM. However, what really makes me uneasy is that I'm not sure the example you're trying to optimize happens often enough to warrant the algorithmic complexity in trying to perform the matching. Here are few more thoughts:

  • In the limit, you're trying to prove that two function graphs are isomorphic and that's generally NP. We don't want that by default, though it might be interesting to go full NP at high optimization levels for a subset of problems that occur in real code (e.g. use some heuristic to determine when going NP might pay off and not take too long (whatever that means)).
  • Without doing full graph isomorphism, there are probably some limited graph canonicalizations that would generally pay off on real-world code (for mergefuncs and LLVM in general). That sounds like what you tried to do in your paper on sequence alignment, and I am worried at that compile-time overhead of 15%.
  • We want mergefuncs to help compile time, maybe even at -O0, so there needs to be a way to run mergefuncs early with very few complex matching rules.
  • We'd like mergefuncs to be on by default before trying to make it super powerful. Solving the engineering problems that prevent it from being on by default is a good amount of work, but the payoff is that super powerful optimizations will get way more testing and real-world measurements (both on compile-time and size improvements) than if mergefuncs is off.
jfb added a subscriber: vsk.Mar 19 2019, 9:52 AM

Cannot we canonicalize to have a smaller index as lhs ?

v[0] + v[1] - ok
v[1] + v[0] - swap to indexes

jfb added a comment.Mar 19 2019, 10:28 AM

Cannot we canonicalize to have a smaller index as lhs ?

v[0] + v[1] - ok
v[1] + v[0] - swap to indexes

For arrays that have constant indices sure, but in general no: it could be arrays with variable indices or just straight up variables and then we've got no truly natural ordering.

I understand your point of having a light-weight function merging (FM) to leave always on to reduce code-size/compilation-time.
I agree that perhaps we could have a parametrized merger that enables/disables more powerful comparisons.
That's why I also suggested that a more powerful merger (like my FM by Sequence Alignment) could be only enabled for heavier code-size optimizations (like -Oz) and have a light-weight FM always enabled (or enabled for most of the optimization settings).

You're right about the general problem being NP, but we can do similar tricks to what SLP does, for example.
In SLP, we are also constantly fighting this trade-off between having a better way of identifying isomorphism and compilation-time.
SLP is also based on finding isomorphic graphs (mostly trees), and they also handle operand reordering for commutative operations. In particular, in the function buildTree_rec when SLP is handling binary operations, which is done by the function reorderInputsAccordingToOpcode.
In this patch, I'm kind of borrowing that from SLP.

Regarding the 15% overhead of my FMSA, that is true for my prototype implementation, but I believe that this number can be reduced significantly.
For example, in this paper, we're still doing a lot of wasteful merges (80+%) that we throw away because we always try to merge with the best candidate from the ranking mechanism.
But in most cases, even the top candidate is very dissimilar from the current function being considered, so that we could just avoid trying to merge them.
However, we didn't try to improve that for this paper. But we could have a way for quickly discarding bad candidates, similar to how the ranking itself works.
This, among other things, would definitely reduce the 15% average overhead.

I will look into the canonicalization pass.
I also need to read the discussions, that you shared with me in the mailing list, in more detail. But I'm aware of the MergeSimilarFunctions work, as it is the state-of-the-art that I use for comparison in my paper.

Nicola added a subscriber: Nicola.Apr 8 2019, 3:04 AM
aykevl added a subscriber: aykevl.May 16 2019, 5:42 PM