This is an archive of the discontinued LLVM Phabricator instance.

[x86] add a reassociation optimization to increase ILP via the MachineCombiner pass
ClosedPublic

Authored by spatel on Jun 8 2015, 2:49 PM.

Details

Summary

This is a reimplementation of D9780 at the machine instruction level rather than the DAG.

I'm using the MachineCombiner pass to reassociate scalar single-precision AVX additions (just a starting point; see the TODO comment) to increase ILP when it's safe to do so.

The code is closely based on the existing MachineCombiner optimization that is implemented for AArch64.

I tried the test cases that Mehdi provided in the follow-up thread for r236031, and I don't see any instruction count increase. In the massive test case (~8000 machine instructions) that blew up previously, this optimization will fire ~2000 times. This causes a horrible compile-time increase:

$ time llc -enable-unsafe-fp-math -x86-machine-combiner=0 -mattr=avx spill.ll
1.8 sec
$ time llc -enable-unsafe-fp-math -x86-machine-combiner=1 -mattr=avx spill.ll
35.8 sec

For now, I'm assuming that this is a degenerate test case (for x86 at least) that we don't need to artificially limit, but we could also clip the optimization to only fire N times.

Diff Detail

Repository
rL LLVM

Event Timeline

spatel updated this revision to Diff 27331.Jun 8 2015, 2:49 PM
spatel retitled this revision from to [x86] add a reassociation optimization to increase ILP via the MachineCombiner pass.
spatel updated this object.
spatel edited the test plan for this revision. (Show Details)
spatel added a subscriber: Unknown Object (MLST).
Gerolf edited edge metadata.Jun 8 2015, 3:44 PM

I'm also curious about performance numbers.

Cheers
Gerolf

lib/Target/X86/X86InstrInfo.cpp
6228 ↗(On Diff #27331)

This function doesn't fit on my screen :-) You could integrate it into the combiner or split it into a) Checking for candidate b) Checking for pattern and c) generating a pattern. You also make a design decision and filter (one) pattern (vsadd/vsadd/vsadd) while searching. Instead you could separate search and filter. I can see arguments either way, but please document cost model functionality clearly.

6400 ↗(On Diff #27331)

You could implement this with a two-dimensional array (pattern rows, operand columns) and avoid the code bloat.

lib/Target/X86/X86InstrInfo.h
32 ↗(On Diff #27331)

Can you think of more descriptive names than 1-4? And/or add a comment about the targeted pattern?

spatel added a comment.Jun 8 2015, 5:53 PM

I'm also curious about performance numbers.

Hi Gerolf -

Thanks for the review! I'll work on the inline comments soon.

For perf numbers, I see no difference running the benchmarking subset of test-suite on x86 with AVX and -ffast-math; this is as expected because this pass isn't firing more than a handful of times on any test.

Do you have suggestions for other benchmarks? My motivating examples for this patch are:
https://llvm.org/bugs/show_bug.cgi?id=17305
https://llvm.org/bugs/show_bug.cgi?id=21768
https://llvm.org/bugs/show_bug.cgi?id=23116
http://reviews.llvm.org/D8941

I can create a toy test case and measure that the perf matches the IACA output in:
https://llvm.org/bugs/show_bug.cgi?id=17305#c0

Gerolf added a comment.Jun 8 2015, 6:13 PM

Hi Sanjay

Any HPC benchmark is a good candidate. But I was just curious if you had spotted some noticeable gains.

Cheers
Gerolf

spatel updated this revision to Diff 27400.Jun 9 2015, 2:31 PM
spatel edited edge metadata.

Patch updated based on Gerolf's feedback:

  1. Renamed patterns to be slightly more meaningful
  2. Split functions that got much too big
  3. Replace lengthy and repetitive 'switch' with selectable array for operand indexes

Also, while stepping through in the debugger, I discovered that my attempted recycling of a virtual register was causing the critical path calculation to be wrong, so I changed that to use a new VR (see comment in the code).

Finally, I added a comment regarding the machine combiner's internal logic: it replaces instructions even if there is no outright win in critical path or resources (just a tie). I'm not sure if that should be changed in a separate patch, but I think it would make it easier to generalize this reassociation pattern further while limiting the compile time hit. Ie, we don't need to have a sequence of 3 identical ops for this reassociation optimization, just 2.

Gerolf added a comment.Jun 9 2015, 3:38 PM

Hi Sanjay,

is this on phabricator?

My understanding of the code is that you reassociate patterns like A = ?? op ??; B = A op X; C = B op Y, but your comments suggest A’s statement can use any operand. Please check in isReassocCandidate():

+ // We must match a simple chain of dependent ops.
+ if (checkPrevOpcode && MI1->getOpcode() != AssocOpcode)
+ return nullptr;

and note that for all incarnations checkPrevOpcode is ‘true’. Or do I miss something?

Cheers
Gerolf

Gerolf added a comment.Jun 9 2015, 3:41 PM

Please see below.

spatel added a comment.Jun 9 2015, 4:30 PM

Hi Sanjay,

is this on phabricator?

Yes - although there may be something odd going on here because I got a duplicate email for each of your comments. Let me know if I need to send directly to the mailing list.

My understanding of the code is that you reassociate patterns like A = ?? op ??; B = A op X; C = B op Y, but your comments suggest A’s statement can use any operand. Please check in isReassocCandidate():

+ // We must match a simple chain of dependent ops.
+ if (checkPrevOpcode && MI1->getOpcode() != AssocOpcode)
+ return nullptr;

and note that for all incarnations checkPrevOpcode is ‘true’. Or do I miss something?

Sorry - my comment here and in the code is not clear enough. I would like to set checkPrevOpcode to 'false' in a follow-on patch. This would allow, for example, this sequence to be optimized:

A = ? mul ?   <--- 'mul' could actually be anything here
B = A add X
C = B add Y

For now, I limited the pattern to only match when there are 3 consecutive 'add' ops:

A = ? add ?
B = A add X
C = B add Y

I did this because if we allow the more liberal matching of the first sequence, I'm seeing that the transform will trigger more often, but not necessarily for any benefit because of the logic in the machine combiner that replaces instructions as long as they are no worse than the original sequence. Please let me know if that makes things clearer.

Gerolf added inline comments.Jun 9 2015, 5:39 PM
lib/Target/X86/X86InstrInfo.cpp
6234 ↗(On Diff #27400)

I'm not happy about the check* parameters that control the pattern match, but I guess it is better to revisit that at a later point when you will support more reassociation pattern. Although I still suggest removing checkPrevOpcode even for this commit.

6286 ↗(On Diff #27400)

If you like, I think this could be implemented this by by a 2x2 matrix and avoid the conditionals.

6293 ↗(On Diff #27400)

else?

6313 ↗(On Diff #27400)

For this commit that parameter could be eliminated. I understand your intention, but it confusing for anyone that comes across the code the first time. So I would add it in a later commit. That will do away with the TODO also.

6334 ↗(On Diff #27400)

Right now the code supports A = ?? op ??. I would make that clear in the comment and change it later in the follow up patch that generalizes this one. Same for the equivalent comments above and below.

6429 ↗(On Diff #27400)

How about something like:
Encodes for each pattern where to find operand in Root and Prev. The operand order in the columns is A, B, X, Y.

6436 ↗(On Diff #27400)

I think you could make the array static and only pass Root and Prev to reassociateOps(). That would result in a smaller signature.

spatel added inline comments.Jun 10 2015, 9:25 AM
lib/Target/X86/X86InstrInfo.cpp
6234 ↗(On Diff #27400)

Sounds good - I removed the checkPrevOpcode param.

6286 ↗(On Diff #27400)

I know we're already relying on the enum values as indexes below, but I'd prefer to keep this logic clearer for now instead of encoding.

6293 ↗(On Diff #27400)

I was originally going with this:
http://llvm.org/docs/CodingStandards.html#don-t-use-else-after-a-return

...but I prefer the symmetry of an 'else' here. Fixed.

6313 ↗(On Diff #27400)

Param removed, however, I think that just pushes the TODO into isReassocCandidate() until we support the general case.

I do intend to continue working on this, so hopefully, the TODOs are quite temporary. :)

6334 ↗(On Diff #27400)

I made the comments line up with the more general logic that is actually implemented - so we don't even need the 'A' instruction line. The current limitation on the 3rd instruction is noted in the TODO comment in isReassocCandidate().

6436 ↗(On Diff #27400)

I may not have implemented this as you were envisioning: I moved the array and op selection into reassociateOps(). Now we have the smaller signature, but we need to pass the Pattern value as the selector. Let me know if you see a better way. Thanks!

spatel updated this revision to Diff 27448.Jun 10 2015, 9:28 AM

Patch updated; see previous inline comments and replies.

Thanks for iterating on this! LGTM!

-Gerolf

This revision was automatically updated to reflect the committed changes.