Page MenuHomePhabricator

New X86 FMA3*_Int opcodes for scalar FMA intrinsics.

Authored by v_klochkov on Oct 13 2015, 4:00 PM.



This change-set is one in the series of change-sets improving X86-FMA3 optimizations.
Please see (D11370) and (D13269) for more details.

This change-sets adds new X86-FMA3 opcodes that must be used for SCALAR FMA INTRINSICS.
The new FMA*_Int opcodes are similar to existing ADD*_Int, SUB*_Int, MUL*_Int opcodes.

The key difference between FMA* and FMA*_Int opcodes is that FMA*_Int opcodes are handled
more conservatively. For example, it is illegal to commute 1st and 2nd operands of FMA*_Int
as such commute transformation would change the upper bits of the intrinsic result which should be taken
from the 1st operand of the FMA intrinsic.

So, this patch fixes the existing problem in LLVM X86 Code-Gen.

The definitions of X86-FMA3 opcodes were simplified a lot.
Unused or unnecessary template parameters were removed.
Now the definitions look quite similar to definitions of ADD/SUB/MUL opcodes.

Temporarily, the FMA*_Int opcodes are defined as non-commutable.
This constraint was added to reduce the size of the current patch and it will be eliminated
in the next changes very soon.

  • Simplified the definitions of scalar FMA3 opcodes by removing the template parameters that were unused or not really necessary.
  • Added definitions for FMA3 opcodes generated for scalar FMA instructions.


  • Added the new FMA*_Int opcodes to MemoryFoldTable3 to enable memory-folding optimization.


  • Added tests for scalar FMA intrinscis. PHI-213-to-231 optimization should not be used for scalar intrinsics.
  • Added tests for FNMADD and FNMSUB intrinsics to make the test more complete.


  • Added more test cases to check that 1st and 2nd operands of scalar FMAs generated for intrinsics are not commuted anymore.

Diff Detail


Event Timeline

v_klochkov updated this revision to Diff 37297.Oct 13 2015, 4:00 PM
v_klochkov retitled this revision from to New X86 FMA3*_Int opcodes for scalar FMA intrinsics..
v_klochkov updated this object.
v_klochkov added subscribers: ab, qcolombet, llvm-commits.
qcolombet added inline comments.Oct 13 2015, 4:04 PM
4 ↗(On Diff #37297)

Could you add more check lines to check that the registers are set to the right value?
This just this check I guess that previously buggy code would also match and we don’t want that.

Hi Quentin,

Thank you for the comment. I would rather not adding additional checks to that test,
but would remove/cancel the scalar test cases I added to fma-intrinsics-phi-213-to-231.ll test.
Please see see more details in my answer to your inline comment.

Thank you,

4 ↗(On Diff #37297)

The test cases in this LIT test are quite complex, and adding such additional checks should be quite complex and it may
make the test unstable, pass/fail may depend on many optimizations working on the tested loops.

The initial idea of the test fma-intrinsics-phi-213-to-231.ll _probably_ was only
to check if the corresponding PHI-213-to-231 optimization did its work for FMAs with ADD accumulators.
The idea of my changes was to check that the optimization
DID NOT do changes for SCALAR FMA instructions generated for scalar FMA intrinsics.

Your guess that the new scalar test case would pass all checks even without my new patch is correct.
I did an additional investigation and discovered that PHI-213-to-231 optimization never did any changes
for scalar FMA instructions generated for intrinsics because of the additional REG copies inserted
in the pattern (see the base version of the file the lines 194-196).

After giving it some more thought I think that the scalar test cases I added to this test are
redundant as PHI-213-to-231 optimization is just a special instruction inserter which is never
called for the new FMA*_Int instructions.

The other aspects/transformations like commuting operands should be checked by other specialized tests
like fma-commutex86.ll added in (D13269).

Do you agree with my plan to remove the scalar cases from the test?
Also, it may be useful to add test cases for 128-bit packed FMAs to this test,
I just surprisingly realized that the test has only 256-bit test cases in it.

qcolombet added inline comments.Oct 19 2015, 10:11 AM
4 ↗(On Diff #37297)

Do you agree with my plan to remove the scalar cases from the test?

I believe you if you say they are redundant with the one we will add with D13269. So I am fine with that plan.
However, that does not resolve the problem that the check lines for the PHIs test would have still matched the bad code generation.
How do you suggest to fix that?

v_klochkov updated this revision to Diff 38072.Oct 21 2015, 7:10 PM

Updated the unit test fma-intrinsics-phi-213-to-231.ll:

  • cancelled the insertion of new scalar test cases (they were added in previous version of patch);
  • added 128-bit packed test cases
  • tightened the checks;
v_klochkov marked an inline comment as done.Oct 21 2015, 7:14 PM

Thank you for the comments. Please review the new/updated test.

Thank you,

4 ↗(On Diff #38072)

I updated the test. The checks look reasonably tight now.

delena added a subscriber: delena.Oct 28 2015, 6:43 AM

Is it possible to convert intrinsic to FMA node in DAG lowering phase, like we did in X86IntrinsicInfo.h?

1737 ↗(On Diff #38072)

Do you have a test that checks memory folding of intrinsic?

171 ↗(On Diff #38072)

Why do you need so long test in order to check only one operation? The comment is related to all tests.

v_klochkov updated this revision to Diff 38982.Nov 2 2015, 2:03 PM
v_klochkov marked 2 inline comments as done.

Updated the test fma-intrinsics-x86.ll:

  • added Windows target code-gen;
  • added checks for memory folding optimization opt of FMAs generated for intrinsics.

Hi Elena,

Thank you for reviewing the patch.
I have updated it to add more checks testing the memory folding optimization of FMAs generated for intrinsics.

Also, I am pretty sure that there is no need to use the scheme from X86IntrinsicInfo.h for SCALAR FMA intrinsics.
That approach is usable for packed FMA intrinsics, but for SCALAR FMA intrinscis that would require adding new SDNode operations, the existing FMADD,FMSUB,FNMSUB,FNMADD will not work for SCALAR intrinsics.
There are only few exceptions in X86IntinsicInfo.h where SCALAR intrinsics are handled there, but those are really special operations like VGETMANT intrinsics.

Thank you,

1737 ↗(On Diff #38072)

There were no any tests checking memory folding of FMA intrinsics.
In the updated patch updated the test fma-intrinsics-x86.ll such a way that it tests Windows target and it also checks Memory Folding.

171 ↗(On Diff #38072)

This LIT test is created not by me, that would be a good question to author, but in my opinion the test is reasonably long.
(I suppose that you say that each of test cases is long, right?)
Test case here represents a very typical situation when there is an FMA instruction in a loop and there is a loop dependency going through the ADD path of FMA (i.e. through the 3rd operand of a*b + c operation).
This test case looks a pretty laconic representation of a loop.

delena added inline comments.Nov 3 2015, 7:02 AM
1815 ↗(On Diff #38982)

I don't understand how you can use the 231 form for scalar intrinsic:

intr_fmadd_ss( a, b, c) may be translated as

VFMADD213SS a, b, c
VFMADD132SS a, c, b

but you can't generate VFMADD231SS because "a" should go first, you are taking the upper part from it.

171 ↗(On Diff #38982)

The test checks that FMA intrinsic gives the right form of FMA instruction.
I don't understand why do you need a loop here. We wrote a lot of FMA intrinsic tests without any loops.

485 ↗(On Diff #38982)

you check folding vector load into scalar intrinsic.
On AVX-512 we support folding scalar load to scalar intrinsic., by matching scalar_to_vector(loadf32) pattern in td file


Please see the answers to your questions.

Thank you,

1815 ↗(On Diff #38982)

Very good question. In the file I intentionally added a comment noticing that problem.
Please see the line 215 in that file:

// The FMA 231 form can be get only by commuting the 1st operand of 213 or 231
// forms and is possible only after special analysis of all uses of the initial
// instruction. Such analysis do not exist yet and thus introducing the 231
// form of FMA*_Int instructions is done using an optimistic assumption that
// such analysis will be implemented eventually.

BTW, I noticed a misprint in that comment and I'll fix it: "213 or 231" --> "213 or 132".
If ONLY the lowest element of FMA213 result is used then it is possible to commute the 1st operand.
Such analysis exist and used in other compilers.

171 ↗(On Diff #38982)

The loop is needed to get the right form of FMA instruction, i.e. the 231 form is generated when there is a LOOP DEPENDENCY on the ADD path. The test checks that 231 form is generated for such loops.

485 ↗(On Diff #38982)

I agree, the check tests memory folding of vector load into scalar intrinsic.

Memory folding does not work for such test cases (with and without my patch):

__m128d m = _mm_load_sd(mem);
__m128d res = _mm_fmadd_sd(a, b, m);

This should be fixed, and I think I know how to easily do that, but I would rather do that in a separate patch.

delena accepted this revision.Nov 3 2015, 12:40 PM
delena added a reviewer: delena.

I don't have more questions. Thank you.

171 ↗(On Diff #38982)


This revision is now accepted and ready to land.Nov 3 2015, 12:40 PM
This revision was automatically updated to reflect the committed changes.