This is an archive of the discontinued LLVM Phabricator instance.

[x86] generalize reassociation optimization in machine combiner to 2 instructions
ClosedPublic

Authored by spatel on Jun 15 2015, 2:05 PM.

Details

Summary

Currently ( D10321, http://reviews.llvm.org/rL239486 ), we can use the machine combiner pass to reassociate the following sequence to reduce the critical path:

A = ? op ?
B = A op X
C = B op Y
-->
A = ? op ?
B = X op Y
C = A op B

'op' is currently limited to x86 AVX scalar FP adds (with fast-math on), but in theory, it could be any associative math/logic op (see TODO in code comment).

This patch generalizes the pattern match to ignore the instruction that defines 'A'. So instead of a sequence of 3 adds, we now only need to find 2 dependent adds and decide if it's worth reassociating them.

This generalization has a compile-time cost because we can now match more instruction sequences and we rely more heavily on the machine combiner to discard sequences where reassociation doesn't improve the critical path.

For example, in the new test case:

A = M div N
B = A add X
C = B add Y

We'll match 2 reassociation patterns, but this transform doesn't reduce the critical path:

A = M div N
B = A add Y
C = B add X

We need the combiner to reject that pattern but select this:

A = M div N
B = X add Y
C = B add A

On Mehdi's (hopefully degenerate for x86) test case from the r236031 post-commit thread, the compile-time increases from ~0.2 sec to 5.0 sec because the combiner completes 3963 reassociations. Using test-suite's benchmarking subset, however, the only test where this completes more than 4 times is linpack; there it reassociates 14 times (used to be 11). But I don't see any compile-time difference from doing that extra optimization work.

Diff Detail

Repository
rL LLVM

Event Timeline

spatel updated this revision to Diff 27712.Jun 15 2015, 2:05 PM
spatel retitled this revision from to [x86] generalize reassociation optimization in machine combiner to 2 instructions.
spatel updated this object.
spatel edited the test plan for this revision. (Show Details)
spatel added reviewers: Gerolf, mehdi_amini, qcolombet.
spatel added a subscriber: Unknown Object (MLST).
Gerolf added inline comments.Jun 15 2015, 9:24 PM
lib/CodeGen/MachineCombiner.cpp
214 ↗(On Diff #27712)

remove outright

218 ↗(On Diff #27712)

This is no longer true. The equation could now be '<'.

lib/Target/X86/X86InstrInfo.cpp
6292 ↗(On Diff #27712)

This function does more than the name suggests. Also, I don't find it intuitive that is records two pattern.

spatel added inline comments.Jun 16 2015, 9:23 AM
lib/CodeGen/MachineCombiner.cpp
214 ↗(On Diff #27712)

Fixed.

218 ↗(On Diff #27712)

I read that statement as <= is still the minimum requirement, but let's see if I can make that clearer. I've added another explanatory statement after the formula to explain the role of the new parameter (NewCodeHasLessInsts). Let me know if you see a better way to word this. Thanks!

lib/Target/X86/X86InstrInfo.cpp
6292 ↗(On Diff #27712)

Looking at the AArch64 implementation, I thought it also could record 2 patterns per call. I agree that we should make this more obvious. Suggestions:

  1. getPatterns()
  2. getMachineCombinerPatterns()
  3. getMachineCombinerPatternsForRootInst()

I opted for #2 in this revision of the patch. Since this change is just a naming difference but affects more files, we could make it a follow-on patch?

spatel updated this revision to Diff 27764.Jun 16 2015, 9:24 AM

Patch updated based on Gerolf's feedback. See previous inline comments and replies.

Gerolf edited edge metadata.Jun 18 2015, 6:57 PM

The code looks pretty good, but I'd like to understand better why the new code investigates more patterns. Also, the compile time increase to 5s looks huge. It is probably correct that it must be an outlier, however, is there anything that can be done to protect from a compile-time spike? On the other hand, the extra compile-time could be a good compile-time/performance trade-off. So one possibility I can think of is to check the rate of success. For example: investigated N association patterns, never found a better code sequence (or perhaps some %threashold instead), so let's not waste more time on association patterns in this function. What do you think?
Thanks for clarifying the AARCH64 and MachineCombiner code!

lib/Target/X86/X86InstrInfo.cpp
6284 ↗(On Diff #27764)

There is a bit of code duplication you can avoid eg. by overloading hasVirtual...() and wrapping the code starting at MRI in a function. Then you would get something like if (hasVirtualRegDefsInBasicBlock(Op1,Op2, MBB) && Sibling=findAssocSibling(Op1,Op2,MBB, Commute) && hasVirtualRegDefsInBasicBlock(Sibling, MBB)) return true; return false;

6316 ↗(On Diff #27764)

Allowing more than one pattern was part of the original design. What confused/confuses me is that in your old code you checked if operands had to be commuted in Root and Prev. But now the code only checks Root and potentially investigates two code sequences instead of one. Isn't that more expensive? And given that the order of the operands in Prev is not checked now, should there be a change in reassociateOps() addressing that?

Phab is slow/down, so sending email to list...

The code looks pretty good, but I'd like to understand better why the new code investigates more patterns. Also, the compile time increase to 5s looks huge. It is probably correct that it must be an outlier, however, is there anything that can be done to protect from a compile-time spike? On the other hand, the extra compile-time could be a good compile-time/performance trade-off. So one possibility I can think of is to check the rate of success. For example: investigated N association patterns, never found a better code sequence (or perhaps some %threashold instead), so let's not waste more time on association patterns in this function. What do you think?

It's certainly possible that we'll cause a compile-time spike with this patch (or even the existing code), but I would prefer to leave the safety harness as a follow-on patch pending some evidence that the case actually exists in the real world. Limiting this patch without that evidence seems like a premature compile-time optimization to me. The extra compile time should always be linear to the number of instructions, so it shouldn't explode too far on us.

lib/Target/X86/X86InstrInfo.cpp
6284 ↗(On Diff #27764)

Good point. I took a slightly different approach to reduce even further!

6316 ↗(On Diff #27764)

reassociateOps() doesn't need any changes because the earlier patch assumed this change was coming; we made it (even the comments) assume the more general pattern could happen.

spatel updated this revision to Diff 28128.Jun 22 2015, 11:09 AM
spatel edited edge metadata.

Patch updated based on Gerolf's feedback.

Also, I checked in the name change: hasPattern() -> getMachineCombinerPatterns().
http://reviews.llvm.org/rL240192

...because that's independent and NFC, so this patch is reduced to just the MachineCombiner and x86 files again.

LGTM, but for compile time please add a FIXME before commit. What more evidence does it need? "On Mehdi's (hopefully degenerate for x86) test case from the r236031 post-commit thread, the compile-time increases from ~0.2 sec to 5.0 sec". However, in the current form the patch should have negligible ct impact in general.

mehdi_amini edited edge metadata.Jun 22 2015, 2:26 PM

For the record: the test didn’t come from an X86 test, it is a simplified version of a real-world GPU shader.


Mehdi

This revision was automatically updated to reflect the committed changes.