This is an archive of the discontinued LLVM Phabricator instance.

X86 instr selection: combine ADDSUB + MUL to FMADDSUB
ClosedPublic

Authored by v_klochkov on Dec 23 2016, 9:26 PM.

Details

Summary

Hello,

Please review the patch that fuses MUL+ADDSUB operations into FMADDSUB
when AVX2 is available.

MUL+ADDSUB are often generated by LLVM (with -ffast-math flag) for
complex MUL operations.

C code:
#include <complex.h>
_Complex double a, b, dst;
void cmul() {

dst = a * b;

}

asm without patch:

vmovupd b(%rip), %xmm0
vmovddup        a(%rip), %xmm1  # xmm1 = mem[0,0]
vmulpd  %xmm1, %xmm0, %xmm1 <<<<<<<<<<<<<<<<<<<<<<<
vpermilpd       $1, %xmm0, %xmm0 # xmm0 = xmm0[1,0]
vmovddup        a+8(%rip), %xmm2 # xmm2 = mem[0,0]
vmulpd  %xmm2, %xmm0, %xmm0
vaddsubpd       %xmm0, %xmm1, %xmm0 <<<<<<<<<<<<<<<
vmovupd %xmm0, dst(%rip)

asm with the patch:

vmovupd b(%rip), %xmm0
vmovddup        a(%rip), %xmm1  # xmm1 = mem[0,0]
vpermilpd       $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0]
vmovddup        a+8(%rip), %xmm3 # xmm3 = mem[0,0]
vmulpd  %xmm3, %xmm2, %xmm2
vfmaddsub231pd  %xmm1, %xmm0, %xmm2 <<<<<<<<<<<<<<<<<<<
vmovupd %xmm2, dst(%rip)

Thank you,
Vyacheslav Klochkov

Diff Detail

Repository
rL LLVM

Event Timeline

v_klochkov updated this revision to Diff 82431.Dec 23 2016, 9:26 PM
v_klochkov retitled this revision from to X86 instr selection: combine ADDSUB + MUL to FMADDSUB.
v_klochkov updated this object.
RKSimon edited edge metadata.Dec 24 2016, 3:23 AM

Thanks for looking at this - I created PR30633 which talked about combining to vfmaddsub/vfmsubadd from buildvector/shuffles of vfmaddadd/vfmaddsub inputs as well, which is likely to happen if the combiner matches individual scalar FMAs first.

llvm/lib/Target/X86/X86ISelLowering.cpp
32048 ↗(On Diff #82431)

You should probably add an assert for the ADDSUB opcode - if this got called its really gone wrong!

assert(N->getOpcode() == X86ISD::ADDSUB && "Expected X86ISD::ADDSUB opcode");
32049 ↗(On Diff #82431)

Use Subtarget.hasAnyFMA() so FMA4 works as well

32052 ↗(On Diff #82431)

Add a comment explaining that this must match the FMA combine logic in DAGCombiner::visitFADDForFMACombine

llvm/test/CodeGen/X86/fmaddsub-combine.ll
2 ↗(On Diff #82431)

Please regenerate this with utils\update_llc_test_checks.py and add -mattr=+avx2,+fma to the llc command, possibly add a second pass with -mattr=+avx,+fma4 as well.

Please can you add tests with the @llvm.x86.sse3.addsub (and avx equivalent) pd/ps intrinsics.

The existing test looks like it can be simplified as well, if you wish to check for load folding add them as separate tests.

v_klochkov updated this revision to Diff 82713.EditedDec 29 2016, 6:04 PM
v_klochkov edited edge metadata.

`Hi Simon,

Thank you for the code-review.
I simplified the test and added more test cases to it.
BTW, I wanted to add support for 512-bit ADDSUB and FMADDSUB, but then realized that
512-bit ADDSUB operations are not yet defined in *.td files.

I cannot add tests with intrinsic calls like: _mm_mul_ps() + _mm_addsub_ps(),
because addsub intrinsics are lowered too late, much later than mul intrisics.
That problem should be fixed separately:

  • IR Dump After Remove unused exception handling info ***
	  ; Function Attrs: nounwind uwtable
	  define <4 x float> @fused_mul_addsub(<4 x float> %a, <4 x float> %b) local_unnamed_addr #0 {
	  entry:
	    %call = call fast fastcc <4 x float> @_mm_mul_ps(<4 x float> %a, <4 x float> %b)
	    %call1 = call fast fastcc <4 x float> @_mm_addsub_ps(<4 x float> %call, <4 x float> %a)
	    ret <4 x float> %call1
	  }
	  *** IR Dump After Function Integration/Inlining ***
	  ; Function Attrs: nounwind uwtable
	  define <4 x float> @fused_mul_addsub(<4 x float> %a, <4 x float> %b) local_unnamed_addr #0 {
	  entry:
	    %mul.i = fmul fast <4 x float> %b, %a
	    %0 = call fast <4 x float> @llvm.x86.sse3.addsub.ps(<4 x float> %mul.i, <4 x float> %a) #2
	    ret <4 x float> %0
	  }

Also, ADDSUB is not generated for _Complex float types.
For some reasons that complex MUL is lowered differently now,
i.e. real and imaginary parts are computed independently (MUL+FMA for each part).

Hopefully, this patch is just the 1st in a series of patches for improving performance of _Complex MULs and DIVs.

Thank you,
Vyacheslav Klochkov
`

delena added a comment.Jan 1 2017, 6:50 AM

BTW, I wanted to add support for 512-bit ADDSUB and FMADDSUB, but then realized that
512-bit ADDSUB operations are not yet defined in *.td files.

All 512 instructions are defined in .td:
defm VFMADDSUB213 : avx512_fma3p_213_f<0xA6, "vfmaddsub213", X86Fmaddsub, X86FmaddsubRnd>;
..
defm VFMADDSUB231 : avx512_fma3p_231_f<0xB6, "vfmaddsub231", X86Fmaddsub, X86FmaddsubRnd>;
defm VFMSUBADD231 : avx512_fma3p_231_f<0xB7, "vfmsubadd231", X86Fmsubadd, X86FmsubaddRnd>;
..

v_klochkov planned changes to this revision.Jan 1 2017, 10:56 PM

Elena,

I see my mistake now. CPU ISA says that ADDSUBPD/PS instructions are available for 128 and 256-bit vectors only.
It is not available for ZMMs.

So, if I want to make it possible to generate 512-bit FMADDSUB instrutions, then I'll need to change a little bit the recognition of ADDSUB idioms, i.e. it should be possible to recognize ADDSUB, but never generate 512-bit X86ISD::ADDSUBs.

Thank you,
Vyacheslav Klochkov

So, if I want to make it possible to generate 512-bit FMADDSUB instrutions, then I'll need to change a little bit the recognition of ADDSUB idioms, i.e. it should be possible to recognize ADDSUB, but never generate 512-bit X86ISD::ADDSUBs.

Does it make sense to convert 512-bit operation ADDSUB to FMADDSUB with all-ones multiplier? What ICC does?

llvm/lib/Target/X86/X86ISelLowering.cpp
32057 ↗(On Diff #82713)

I see more checks in visitFADDForFMACombine. Can we take all these checks to a separate function? Something like isFMALegalAndProfitable().

llvm/test/CodeGen/X86/fmaddsub-combine.ll
19 ↗(On Diff #82713)

I don't understand the dependency you show in the test.
I suppose that FMADDSUB test should be like this:

%AB = fmul <2 x double> %A, %B
%Sub = fsub <2 x double> %AB, %C
%Add = fadd <2 x double> %AB, %C
%Addsub = shufflevector <2 x double> %Sub, <2 x double> %Add, <2 x i32> <i32 0, i32 3>

Do you check the "fast" attribute of the operation itself? Or you rely on global options only?

Does it make sense to convert 512-bit operation ADDSUB to FMADDSUB with all-ones multiplier? What ICC does?

It is interesting idea, but it is unlikely that it can be very beneficial.
ICC does not generate 512-bit FMADDSUB(a, 1, b) when it does not have ADDSUB(a,b) because it just does not need plain 512-bit ADDSUB operations.

ADDSUBs are usually needed for Complex MUL and DIV operations and there they are always accompanied by float/double MUL operations, i.e. ICC generates FMADDSUBs and it never needs plain ADDSUBs (at least for targets with AVX512, where FMADDSUB is available).

llvm/lib/Target/X86/X86ISelLowering.cpp
32057 ↗(On Diff #82713)

Having such function is a good thing to have, but such function might be debatable and I do not want to complicate this change-set.
Perhaps such function should be added in a separate change-set.

Also, the function visitFADDForFMACombine() is shared for various target architectures, thus it needs much more checks.
The check here (in X86 part) seems sufficient.

llvm/test/CodeGen/X86/fmaddsub-combine.ll
19 ↗(On Diff #82713)

This code relies on global options only.
This is my first experience of work on SDNodes..., please fix me if I am wrong..., but SDNodes do not have FastMathFlags, right?

v_klochkov updated this revision to Diff 83206.Jan 5 2017, 2:30 AM
  1. Made some restructures in ADDSUB idiom recognition. As a result of these changes 512-bit FMADDSUB idiom can be recognized and X86ISD::FMADDSUB is generated.

512-bit bit ADDSUB idiom can be recognized for float and double vectors now, but 512-bit X86ISD::ADDSUB is never generated because 512-bit ADDSUB instructions are not available in AVX512. The recognition of 512-bit ADDSUB is needed only as part of 512-bit FMADDSUB idiom recognition.

  1. Updated the LIT test. Added 512-bit test cases to it. Also, made minor updates accordingly to Elena's suggestion.
v_klochkov updated this revision to Diff 83212.Jan 5 2017, 2:42 AM

Sorry, just a minor fix for a harmless misprint in the LIT test (replaced 'FMAx3_512' with 'FMA3_512')

delena added a comment.Jan 5 2017, 3:53 AM

This is my first experience of work on SDNodes..., please fix me if I am wrong..., but SDNodes do not have FastMathFlags, right?

See SDNodeFlags structure.
const SDNodeFlags *Flags = &cast<BinaryWithFlagsSDNode>(N)->Flags;
But I don't know whether it is criteria for FMA. If yes, it should be checked. If not, the "fast" should be removed from the tests.

llvm/lib/Target/X86/X86ISelLowering.cpp
7096 ↗(On Diff #83212)

Function name starts from lowercase. Same bellow.

27851 ↗(On Diff #83212)

DAG is not used.

27910 ↗(On Diff #83212)

You consider the situation when the shuffle is resolved before FMA.
But if fadd and fsub are already combined with fmul, fmaddsub will not be created. I'm not sure that this situation is real, except fma are coming form intrinsics ..

27924 ↗(On Diff #83212)

VT.is512BitVector()

v_klochkov updated this revision to Diff 83458.Jan 6 2017, 4:28 PM
  • removed 'fast' attributes from the LIT test;
  • fixed the function names (the first letter must be a lower case);
  • removed unused 'DAG' parameter from isAddSub() function;
  • did few other really minor changes.

Elena, thank you for the review and for the comments.
I made additional changes accordingly to your recommendations.

Currently, there is no fp_contract bit in FastMathFlags and in SDNodeFlags,
So, the checks used in isFMAddSub() are similar to existing in visitFADDForFMACombine().
That is correct in all situations and there are some bugs for this situation. Perhaps this can be fixed quite soon.
For now I just removed 'fast' attributes from the LIT test.

You consider the situation when the shuffle is resolved before FMA.
But if fadd and fsub are already combined with fmul, fmaddsub will not be created. I'm not sure
that this situation is real, except fma are coming form intrinsics ..

if you talk about such pattern:

t1 = fma <a,b,c>
t2 = fms <a,b,c>
shufffle t1,t2,...

That is just a totally different pattern. I don't see how it can be _automatically_ generated now.
If it happen to be generated later and is important to be optimized better, then it can be recognized/optimized, but that would require a separate patch.

delena accepted this revision.Jan 8 2017, 1:54 AM
delena added a reviewer: delena.
This revision is now accepted and ready to land.Jan 8 2017, 1:54 AM
This revision was automatically updated to reflect the committed changes.