This is an archive of the discontinued LLVM Phabricator instance.

X86-FMA3: Implemented commute transformations for FMA*_Int instructions
ClosedPublic

Authored by v_klochkov on Nov 10 2015, 12:43 PM.

Details

Summary

Hello,

Please review the patch that implements the commute transformations for
X86-FMA3 FMA*_Int opcodes (i.e. opcodes generated only for scalar FMA intrinsics).

Previously, the commute transformation was implemented for all FMA3 instructions
except FMA*_Int. Please see ( D13269 ) for details.
So, this change-set is mostly a minor tuning/update of the optimization introduced in ( D13269 ).

X86InstrFMA.td:
Set the 'isCommuteble' attribute to 1 for FMA*_Int opcodes.

X86InstrInfo.cpp:
Added FMA*_Int opcodes to isFMA3() routine.
Added a table containing FMA*_Int in groups of three opcodes in each group(132, 213, 231).

fma-commute-x86.ll:
Tightened the checks.
Changed the FMA opcode generated for scalar intrincis on Windows.
The generated code is different now because previously the FMA*_INt instructions
were not commutable. Now they are.
PeepholeOptimizer tries to do memory folding of operands starting from the 1st operand.
The 1st operand cannot be commuted with 3rd (foldable) operand as it chagnes the intrinsic result.
The 2nd operand can be commuted with 3rd. So, it was commuted. The 2nd operand became 3rd
(that also required the opcode changes from 213 to 132); then the operand was folded
with the load.

fma-commute-x86.ll:
Added some test cases for scalar intrinsics.

Thank you,
Slava

Diff Detail

Repository
rL LLVM

Event Timeline

v_klochkov retitled this revision from to X86-FMA3: Implemented commute transformations for FMA*_Int instructions.
v_klochkov updated this object.
v_klochkov added a reviewer: DavidKreitzer.
v_klochkov added subscribers: llvm-commits, qcolombet.

Hi Slava,

I haven’t looked at the test cases closely yet, but I think the patch is pretty good.
I am suggesting one refactoring to avoid some code duplication.

When the refactoring is done, I’ll look closer into the test cases.

Thanks for working on this!

-Quentin

llvm/lib/Target/X86/X86InstrInfo.cpp
2976 ↗(On Diff #39844)

Add an optional parameter: bool *IsIntrinsic = nullptr.
Set it to false at the beginning of the function.

2994 ↗(On Diff #39844)

Put all the intrinsic opcodes together to set the boolean to true and fall through to the non intrinsic cases.

3502 ↗(On Diff #39844)

At first, I was believing the code would remain simpler if we just merge this table with the existing one.
I was guessing we were doing that because we want to know that the opcode is an intrinsic.
Although that is fair, I don’t like the code duplication this implies.

Now, thinking about it, splitting the table is fine, but we can style avoid the code duplication.
Indeed, let say we have a method that tells us before hand that an opcode is an intrinsic.
Using this knowledge, we can:

  • Make the search only in the appropriate table.
  • Reuse the same code for both tables, this is just a matter of setting the boundaries correctly.

For instance, we could add an optional parameter to isFMA3 that says whether or not the opcode is an intrinsic, then update GroupsNum and some new OpcodeGroups like variable.

3513 ↗(On Diff #39844)

Call isFMA3 to get the intrinsic information.
Set a new OpcodeGroups array to either the Intrinsic opcode array or the regular opcode array.
Set GroupsNum accordingly.

3538 ↗(On Diff #39844)

We wouldn’t need that loop anymore.

3541 ↗(On Diff #39844)

This change is not required anymore.

v_klochkov updated this revision to Diff 39982.Nov 11 2015, 3:22 PM
v_klochkov marked 5 inline comments as done.

Did additional refactoring suggested by Quentin:

  • added an optional parameter IsIntrinsic to isFMA3()
  • removed the duplicating code (loop) from getFMA3OpcodeToCommuteOperands().

Hi Quentin,

Thank you for the so quick code-review.
I did the refactoring your suggested to do.
Please see the additional changes.

Thank you,
Slava

llvm/lib/Target/X86/X86InstrInfo.cpp
2976 ↗(On Diff #39844)

Ok, done.

3502 ↗(On Diff #39844)

Ok, good idea, thank you. I did the proposed additional changes.

3541 ↗(On Diff #39844)

The search loop must find a group of 3 opcodes because this routine is called only after isFMA3() check.
So, technically 'return 0' statement is not reachable now.
I did not remove this check just for additional safety, for example, if someone else adds more opcodes to isFMA3(), but does not add them to the opcode groups defined in this routine. In such scenario it is better to just return 0.

qcolombet accepted this revision.Nov 11 2015, 3:49 PM
qcolombet added a reviewer: qcolombet.

Hi Slava,

LGTM with a couple of comments.
Feel free to commit whenever you want.

Cheers,
-Quentin

llvm/lib/Target/X86/X86InstrInfo.cpp
3547 ↗(On Diff #39982)

Good point!
Turn that into an assert maybe?
That way, if it happens, we would know there are things that can be improved :).

Up to you.

llvm/test/CodeGen/X86/fma-commute-x86.ll
12 ↗(On Diff #39982)

I am guessing you put regular expression here because you want the test to be robust against scheduling changes. This is usually not the way we go, i.e., we tend to use the DAG construct for such cases. That being said, I do not remember whether DAG support NEXT. Anyway, for now, we may just match the actual scheduling, unless you saw it changing?

This revision is now accepted and ready to land.Nov 11 2015, 3:49 PM

Thank you for the comments and for the approval!
-Slava

llvm/lib/Target/X86/X86InstrInfo.cpp
3547 ↗(On Diff #39982)

I'll replace it with an assert. Thank you.

llvm/test/CodeGen/X86/fma-commute-x86.ll
12 ↗(On Diff #39982)

Scheduling may depend on harmless changes. I think I noticed some irregularities in how the parameters are copied to registers (i.e. which param is read first). So, this change was kind of preventive.
Please see my next comment below. It looks like this approach happened to be useful already as it helps to avoid the need in fixing such tests after harmless changes in code-gen.

llvm/test/CodeGen/X86/fma-intrinsics-x86.ll
13 ↗(On Diff #39982)

This test is a good example showing my motivations changing the checks.

There was a misprint at the line 13: "rdi" was occasionally used instead of "rdx".
The test passed with that misprint which means that movaps from rcx was going before movaps from rdx.
After I tried running this test with a harmless patch (hopefully it will be submitted for code-review On Thursday Nov 12), this test failed because movaps from rdx for some unknown reasons started being met first.

I also used "movap{{s|d}}" because in some cases movaps is used to prepare the operand for *SD/*PD instructions, and in some cases movapd is used.

This revision was automatically updated to reflect the committed changes.