This is an archive of the discontinued LLVM Phabricator instance.

X86 instr selection: combine ADDSUB + MUL to FMADDSUB
ClosedPublic

Authored by v_klochkov on Dec 23 2016, 9:26 PM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon
delena

Commits

rL291473: X86-specific path: Implemented the fusing of MUL+ADDSUB to FMADDSUB.

Summary

Hello,

Please review the patch that fuses MUL+ADDSUB operations into FMADDSUB
when AVX2 is available.

MUL+ADDSUB are often generated by LLVM (with -ffast-math flag) for
complex MUL operations.

C code:
#include <complex.h>
_Complex double a, b, dst;
void cmul() {

dst = a * b;

}

asm without patch:

vmovupd b(%rip), %xmm0
vmovddup        a(%rip), %xmm1  # xmm1 = mem[0,0]
vmulpd  %xmm1, %xmm0, %xmm1 <<<<<<<<<<<<<<<<<<<<<<<
vpermilpd       $1, %xmm0, %xmm0 # xmm0 = xmm0[1,0]
vmovddup        a+8(%rip), %xmm2 # xmm2 = mem[0,0]
vmulpd  %xmm2, %xmm0, %xmm0
vaddsubpd       %xmm0, %xmm1, %xmm0 <<<<<<<<<<<<<<<
vmovupd %xmm0, dst(%rip)

asm with the patch:

vmovupd b(%rip), %xmm0
vmovddup        a(%rip), %xmm1  # xmm1 = mem[0,0]
vpermilpd       $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0]
vmovddup        a+8(%rip), %xmm3 # xmm3 = mem[0,0]
vmulpd  %xmm3, %xmm2, %xmm2
vfmaddsub231pd  %xmm1, %xmm0, %xmm2 <<<<<<<<<<<<<<<<<<<
vmovupd %xmm2, dst(%rip)

Thank you,
Vyacheslav Klochkov

Diff Detail

Repository: rL LLVM

Event Timeline

v_klochkov updated this revision to Diff 82431.Dec 23 2016, 9:26 PM

v_klochkov retitled this revision from to X86 instr selection: combine ADDSUB + MUL to FMADDSUB.

v_klochkov updated this object.

v_klochkov added subscribers: llvm-commits, craig.topper, delena.

RKSimon added reviewers: RKSimon, spatel.Dec 24 2016, 2:58 AM

Thanks for looking at this - I created PR30633 which talked about combining to vfmaddsub/vfmsubadd from buildvector/shuffles of vfmaddadd/vfmaddsub inputs as well, which is likely to happen if the combiner matches individual scalar FMAs first.

llvm/lib/Target/X86/X86ISelLowering.cpp
32048 ↗	(On Diff #82431)	You should probably add an assert for the ADDSUB opcode - if this got called its really gone wrong! assert(N->getOpcode() == X86ISD::ADDSUB && "Expected X86ISD::ADDSUB opcode");
32049 ↗	(On Diff #82431)	Use Subtarget.hasAnyFMA() so FMA4 works as well
32052 ↗	(On Diff #82431)	Add a comment explaining that this must match the FMA combine logic in DAGCombiner::visitFADDForFMACombine
llvm/test/CodeGen/X86/fmaddsub-combine.ll
2 ↗	(On Diff #82431)	Please regenerate this with utils\update_llc_test_checks.py and add -mattr=+avx2,+fma to the llc command, possibly add a second pass with -mattr=+avx,+fma4 as well. Please can you add tests with the @llvm.x86.sse3.addsub (and avx equivalent) pd/ps intrinsics. The existing test looks like it can be simplified as well, if you wish to check for load folding add them as separate tests.

`Hi Simon,

Thank you for the code-review.
I simplified the test and added more test cases to it.
BTW, I wanted to add support for 512-bit ADDSUB and FMADDSUB, but then realized that
512-bit ADDSUB operations are not yet defined in *.td files.

I cannot add tests with intrinsic calls like: _mm_mul_ps() + _mm_addsub_ps(),
because addsub intrinsics are lowered too late, much later than mul intrisics.
That problem should be fixed separately:

IR Dump After Remove unused exception handling info ***

	  ; Function Attrs: nounwind uwtable
	  define <4 x float> @fused_mul_addsub(<4 x float> %a, <4 x float> %b) local_unnamed_addr #0 {
	  entry:
	    %call = call fast fastcc <4 x float> @_mm_mul_ps(<4 x float> %a, <4 x float> %b)
	    %call1 = call fast fastcc <4 x float> @_mm_addsub_ps(<4 x float> %call, <4 x float> %a)
	    ret <4 x float> %call1
	  }
	  *** IR Dump After Function Integration/Inlining ***
	  ; Function Attrs: nounwind uwtable
	  define <4 x float> @fused_mul_addsub(<4 x float> %a, <4 x float> %b) local_unnamed_addr #0 {
	  entry:
	    %mul.i = fmul fast <4 x float> %b, %a
	    %0 = call fast <4 x float> @llvm.x86.sse3.addsub.ps(<4 x float> %mul.i, <4 x float> %a) #2
	    ret <4 x float> %0
	  }

Also, ADDSUB is not generated for _Complex float types.
For some reasons that complex MUL is lowered differently now,
i.e. real and imaginary parts are computed independently (MUL+FMA for each part).

Hopefully, this patch is just the 1st in a series of patches for improving performance of _Complex MULs and DIVs.

Thank you,
Vyacheslav Klochkov
`

In D28087#632599, @v_klochkov wrote:

BTW, I wanted to add support for 512-bit ADDSUB and FMADDSUB, but then realized that
512-bit ADDSUB operations are not yet defined in *.td files.

All 512 instructions are defined in .td:
defm VFMADDSUB213 : avx512_fma3p_213_f<0xA6, "vfmaddsub213", X86Fmaddsub, X86FmaddsubRnd>;
..
defm VFMADDSUB231 : avx512_fma3p_231_f<0xB6, "vfmaddsub231", X86Fmaddsub, X86FmaddsubRnd>;
defm VFMSUBADD231 : avx512_fma3p_231_f<0xB7, "vfmsubadd231", X86Fmsubadd, X86FmsubaddRnd>;
..

Elena,

I see my mistake now. CPU ISA says that ADDSUBPD/PS instructions are available for 128 and 256-bit vectors only.
It is not available for ZMMs.

So, if I want to make it possible to generate 512-bit FMADDSUB instrutions, then I'll need to change a little bit the recognition of ADDSUB idioms, i.e. it should be possible to recognize ADDSUB, but never generate 512-bit X86ISD::ADDSUBs.

Thank you,
Vyacheslav Klochkov

In D28087#633340, @v_klochkov wrote:

So, if I want to make it possible to generate 512-bit FMADDSUB instrutions, then I'll need to change a little bit the recognition of ADDSUB idioms, i.e. it should be possible to recognize ADDSUB, but never generate 512-bit X86ISD::ADDSUBs.

Does it make sense to convert 512-bit operation ADDSUB to FMADDSUB with all-ones multiplier? What ICC does?

llvm/lib/Target/X86/X86ISelLowering.cpp
32057 ↗	(On Diff #82713)	I see more checks in visitFADDForFMACombine. Can we take all these checks to a separate function? Something like isFMALegalAndProfitable().
llvm/test/CodeGen/X86/fmaddsub-combine.ll
19 ↗	(On Diff #82713)	I don't understand the dependency you show in the test. I suppose that FMADDSUB test should be like this: %AB = fmul <2 x double> %A, %B %Sub = fsub <2 x double> %AB, %C %Add = fadd <2 x double> %AB, %C %Addsub = shufflevector <2 x double> %Sub, <2 x double> %Add, <2 x i32> <i32 0, i32 3> Do you check the "fast" attribute of the operation itself? Or you rely on global options only?

Does it make sense to convert 512-bit operation ADDSUB to FMADDSUB with all-ones multiplier? What ICC does?

It is interesting idea, but it is unlikely that it can be very beneficial.
ICC does not generate 512-bit FMADDSUB(a, 1, b) when it does not have ADDSUB(a,b) because it just does not need plain 512-bit ADDSUB operations.

ADDSUBs are usually needed for Complex MUL and DIV operations and there they are always accompanied by float/double MUL operations, i.e. ICC generates FMADDSUBs and it never needs plain ADDSUBs (at least for targets with AVX512, where FMADDSUB is available).

llvm/lib/Target/X86/X86ISelLowering.cpp
32057 ↗	(On Diff #82713)	Having such function is a good thing to have, but such function might be debatable and I do not want to complicate this change-set. Perhaps such function should be added in a separate change-set. Also, the function visitFADDForFMACombine() is shared for various target architectures, thus it needs much more checks. The check here (in X86 part) seems sufficient.
llvm/test/CodeGen/X86/fmaddsub-combine.ll
19 ↗	(On Diff #82713)	This code relies on global options only. This is my first experience of work on SDNodes..., please fix me if I am wrong..., but SDNodes do not have FastMathFlags, right?

Made some restructures in ADDSUB idiom recognition. As a result of these changes 512-bit FMADDSUB idiom can be recognized and X86ISD::FMADDSUB is generated.

512-bit bit ADDSUB idiom can be recognized for float and double vectors now, but 512-bit X86ISD::ADDSUB is never generated because 512-bit ADDSUB instructions are not available in AVX512. The recognition of 512-bit ADDSUB is needed only as part of 512-bit FMADDSUB idiom recognition.

Updated the LIT test. Added 512-bit test cases to it. Also, made minor updates accordingly to Elena's suggestion.

Sorry, just a minor fix for a harmless misprint in the LIT test (replaced 'FMAx3_512' with 'FMA3_512')

In D28087#636693, @v_klochkov wrote:

This is my first experience of work on SDNodes..., please fix me if I am wrong..., but SDNodes do not have FastMathFlags, right?

See SDNodeFlags structure.
const SDNodeFlags *Flags = &cast<BinaryWithFlagsSDNode>(N)->Flags;
But I don't know whether it is criteria for FMA. If yes, it should be checked. If not, the "fast" should be removed from the tests.

llvm/lib/Target/X86/X86ISelLowering.cpp
7096 ↗	(On Diff #83212)	Function name starts from lowercase. Same bellow.
27851 ↗	(On Diff #83212)	DAG is not used.
27910 ↗	(On Diff #83212)	You consider the situation when the shuffle is resolved before FMA. But if fadd and fsub are already combined with fmul, fmaddsub will not be created. I'm not sure that this situation is real, except fma are coming form intrinsics ..
27924 ↗	(On Diff #83212)	VT.is512BitVector()

removed 'fast' attributes from the LIT test;
fixed the function names (the first letter must be a lower case);
removed unused 'DAG' parameter from isAddSub() function;
did few other really minor changes.

Elena, thank you for the review and for the comments.
I made additional changes accordingly to your recommendations.

Currently, there is no fp_contract bit in FastMathFlags and in SDNodeFlags,
So, the checks used in isFMAddSub() are similar to existing in visitFADDForFMACombine().
That is correct in all situations and there are some bugs for this situation. Perhaps this can be fixed quite soon.
For now I just removed 'fast' attributes from the LIT test.

You consider the situation when the shuffle is resolved before FMA.
But if fadd and fsub are already combined with fmul, fmaddsub will not be created. I'm not sure
that this situation is real, except fma are coming form intrinsics ..

if you talk about such pattern:

t1 = fma <a,b,c>
t2 = fms <a,b,c>
shufffle t1,t2,...

That is just a totally different pattern. I don't see how it can be _automatically_ generated now.
If it happen to be generated later and is important to be optimized better, then it can be recognized/optimized, but that would require a separate patch.

delena accepted this revision.Jan 8 2017, 1:54 AM

delena added a reviewer: delena.

This revision is now accepted and ready to land.Jan 8 2017, 1:54 AM

Closed by commit rL291473: X86-specific path: Implemented the fusing of MUL+ADDSUB to FMADDSUB. (authored by v_klochkov). · Explain WhyJan 9 2017, 12:37 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

X86/

X86ISelLowering.cpp

197 lines

test/

CodeGen/

X86/

fmaddsub-combine.ll

129 lines

Diff 83670

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,956 Lines • ▼ Show 20 Lines	if (Mode) {

if (!isUndefHI && (!V0_HI->isUndef() \|\| !V1_HI->isUndef()))		if (!isUndefHI && (!V0_HI->isUndef() \|\| !V1_HI->isUndef()))
HI = DAG.getNode(X86Opcode, DL, NewVT, V0_HI, V1_HI);		HI = DAG.getNode(X86Opcode, DL, NewVT, V0_HI, V1_HI);
}		}

return DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, LO, HI);		return DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, LO, HI);
}		}

/// Try to fold a build_vector that performs an 'addsub' to an X86ISD::ADDSUB		/// Returns true iff \p BV builds a vector with the result equivalent to
/// node.		/// the result of ADDSUB operation.
static SDValue LowerToAddSub(const BuildVectorSDNode *BV,		/// If true is returned then the operands of ADDSUB = Opnd0 +- Opnd1 operation
const X86Subtarget &Subtarget, SelectionDAG &DAG) {		/// are written to the parameters \p Opnd0 and \p Opnd1.
		static bool isAddSub(const BuildVectorSDNode *BV,
		const X86Subtarget &Subtarget, SelectionDAG &DAG,
		SDValue &Opnd0, SDValue &Opnd1) {

MVT VT = BV->getSimpleValueType(0);		MVT VT = BV->getSimpleValueType(0);
if ((!Subtarget.hasSSE3() \|\| (VT != MVT::v4f32 && VT != MVT::v2f64)) &&		if ((!Subtarget.hasSSE3() \|\| (VT != MVT::v4f32 && VT != MVT::v2f64)) &&
(!Subtarget.hasAVX() \|\| (VT != MVT::v8f32 && VT != MVT::v4f64)))		(!Subtarget.hasAVX() \|\| (VT != MVT::v8f32 && VT != MVT::v4f64)) &&
return SDValue();		(!Subtarget.hasAVX512() \|\| (VT != MVT::v16f32 && VT != MVT::v8f64)))
		return false;

SDLoc DL(BV);
unsigned NumElts = VT.getVectorNumElements();		unsigned NumElts = VT.getVectorNumElements();
SDValue InVec0 = DAG.getUNDEF(VT);		SDValue InVec0 = DAG.getUNDEF(VT);
SDValue InVec1 = DAG.getUNDEF(VT);		SDValue InVec1 = DAG.getUNDEF(VT);

assert((VT == MVT::v8f32 \|\| VT == MVT::v4f64 \|\| VT == MVT::v4f32 \|\|
VT == MVT::v2f64) && "build_vector with an invalid type found!");

// Odd-numbered elements in the input build vector are obtained from		// Odd-numbered elements in the input build vector are obtained from
// adding two integer/float elements.		// adding two integer/float elements.
// Even-numbered elements in the input build vector are obtained from		// Even-numbered elements in the input build vector are obtained from
// subtracting two integer/float elements.		// subtracting two integer/float elements.
unsigned ExpectedOpcode = ISD::FSUB;		unsigned ExpectedOpcode = ISD::FSUB;
unsigned NextExpectedOpcode = ISD::FADD;		unsigned NextExpectedOpcode = ISD::FADD;
bool AddFound = false;		bool AddFound = false;
bool SubFound = false;		bool SubFound = false;

for (unsigned i = 0, e = NumElts; i != e; ++i) {		for (unsigned i = 0, e = NumElts; i != e; ++i) {
SDValue Op = BV->getOperand(i);		SDValue Op = BV->getOperand(i);

// Skip 'undef' values.		// Skip 'undef' values.
unsigned Opcode = Op.getOpcode();		unsigned Opcode = Op.getOpcode();
if (Opcode == ISD::UNDEF) {		if (Opcode == ISD::UNDEF) {
std::swap(ExpectedOpcode, NextExpectedOpcode);		std::swap(ExpectedOpcode, NextExpectedOpcode);
continue;		continue;
}		}

// Early exit if we found an unexpected opcode.		// Early exit if we found an unexpected opcode.
if (Opcode != ExpectedOpcode)		if (Opcode != ExpectedOpcode)
return SDValue();		return false;

SDValue Op0 = Op.getOperand(0);		SDValue Op0 = Op.getOperand(0);
SDValue Op1 = Op.getOperand(1);		SDValue Op1 = Op.getOperand(1);

// Try to match the following pattern:		// Try to match the following pattern:
// (BINOP (extract_vector_elt A, i), (extract_vector_elt B, i))		// (BINOP (extract_vector_elt A, i), (extract_vector_elt B, i))
// Early exit if we cannot match that sequence.		// Early exit if we cannot match that sequence.
if (Op0.getOpcode() != ISD::EXTRACT_VECTOR_ELT \|\|		if (Op0.getOpcode() != ISD::EXTRACT_VECTOR_ELT \|\|
Op1.getOpcode() != ISD::EXTRACT_VECTOR_ELT \|\|		Op1.getOpcode() != ISD::EXTRACT_VECTOR_ELT \|\|
!isa<ConstantSDNode>(Op0.getOperand(1)) \|\|		!isa<ConstantSDNode>(Op0.getOperand(1)) \|\|
!isa<ConstantSDNode>(Op1.getOperand(1)) \|\|		!isa<ConstantSDNode>(Op1.getOperand(1)) \|\|
Op0.getOperand(1) != Op1.getOperand(1))		Op0.getOperand(1) != Op1.getOperand(1))
return SDValue();		return false;

unsigned I0 = cast<ConstantSDNode>(Op0.getOperand(1))->getZExtValue();		unsigned I0 = cast<ConstantSDNode>(Op0.getOperand(1))->getZExtValue();
if (I0 != i)		if (I0 != i)
return SDValue();		return false;

// We found a valid add/sub node. Update the information accordingly.		// We found a valid add/sub node. Update the information accordingly.
if (i & 1)		if (i & 1)
AddFound = true;		AddFound = true;
else		else
SubFound = true;		SubFound = true;

// Update InVec0 and InVec1.		// Update InVec0 and InVec1.
if (InVec0.isUndef()) {		if (InVec0.isUndef()) {
InVec0 = Op0.getOperand(0);		InVec0 = Op0.getOperand(0);
if (InVec0.getSimpleValueType() != VT)		if (InVec0.getSimpleValueType() != VT)
return SDValue();		return false;
}		}
if (InVec1.isUndef()) {		if (InVec1.isUndef()) {
InVec1 = Op1.getOperand(0);		InVec1 = Op1.getOperand(0);
if (InVec1.getSimpleValueType() != VT)		if (InVec1.getSimpleValueType() != VT)
return SDValue();		return false;
}		}

// Make sure that operands in input to each add/sub node always		// Make sure that operands in input to each add/sub node always
// come from a same pair of vectors.		// come from a same pair of vectors.
if (InVec0 != Op0.getOperand(0)) {		if (InVec0 != Op0.getOperand(0)) {
if (ExpectedOpcode == ISD::FSUB)		if (ExpectedOpcode == ISD::FSUB)
return SDValue();		return false;

// FADD is commutable. Try to commute the operands		// FADD is commutable. Try to commute the operands
// and then test again.		// and then test again.
std::swap(Op0, Op1);		std::swap(Op0, Op1);
if (InVec0 != Op0.getOperand(0))		if (InVec0 != Op0.getOperand(0))
return SDValue();		return false;
}		}

if (InVec1 != Op1.getOperand(0))		if (InVec1 != Op1.getOperand(0))
return SDValue();		return false;

// Update the pair of expected opcodes.		// Update the pair of expected opcodes.
std::swap(ExpectedOpcode, NextExpectedOpcode);		std::swap(ExpectedOpcode, NextExpectedOpcode);
}		}

// Don't try to fold this build_vector into an ADDSUB if the inputs are undef.		// Don't try to fold this build_vector into an ADDSUB if the inputs are undef.
if (AddFound && SubFound && !InVec0.isUndef() && !InVec1.isUndef())		if (!AddFound \|\| !SubFound \|\| InVec0.isUndef() \|\| InVec1.isUndef())
return DAG.getNode(X86ISD::ADDSUB, DL, VT, InVec0, InVec1);		return false;

		Opnd0 = InVec0;
		Opnd1 = InVec1;
		return true;
		}

		/// Returns true if is possible to fold MUL and an idiom that has already been
		/// recognized as ADDSUB(\p Opnd0, \p Opnd1) into FMADDSUB(x, y, \p Opnd1).
		/// If (and only if) true is returned, the operands of FMADDSUB are written to
		/// parameters \p Opnd0, \p Opnd1, \p Opnd2.
		///
		/// Prior to calling this function it should be known that there is some
		/// SDNode that potentially can be replaced with an X86ISD::ADDSUB operation
		/// using \p Opnd0 and \p Opnd1 as operands. Also, this method is called
		/// before replacement of such SDNode with ADDSUB operation. Thus the number
		/// of \p Opnd0 uses is expected to be equal to 2.
		/// For example, this function may be called for the following IR:
		/// %AB = fmul fast <2 x double> %A, %B
		/// %Sub = fsub fast <2 x double> %AB, %C
		/// %Add = fadd fast <2 x double> %AB, %C
		/// %Addsub = shufflevector <2 x double> %Sub, <2 x double> %Add,
		/// <2 x i32> <i32 0, i32 3>
		/// There is a def for %Addsub here, which potentially can be replaced by
		/// X86ISD::ADDSUB operation:
		/// %Addsub = X86ISD::ADDSUB %AB, %C
		/// and such ADDSUB can further be replaced with FMADDSUB:
		/// %Addsub = FMADDSUB %A, %B, %C.
		///
		/// The main reason why this method is called before the replacement of the
		/// recognized ADDSUB idiom with ADDSUB operation is that such replacement
		/// is illegal sometimes. E.g. 512-bit ADDSUB is not available, while 512-bit
		/// FMADDSUB is.
		static bool isFMAddSub(const X86Subtarget &Subtarget, SelectionDAG &DAG,
		SDValue &Opnd0, SDValue &Opnd1, SDValue &Opnd2) {
		if (Opnd0.getOpcode() != ISD::FMUL \|\| Opnd0->use_size() != 2 \|\|
		!Subtarget.hasAnyFMA())
		return false;

		// FIXME: These checks must match the similar ones in
		// DAGCombiner::visitFADDForFMACombine. It would be good to have one
		// function that would answer if it is Ok to fuse MUL + ADD to FMADD
		// or MUL + ADDSUB to FMADDSUB.
		const TargetOptions &Options = DAG.getTarget().Options;
		bool AllowFusion =
		(Options.AllowFPOpFusion == FPOpFusion::Fast \|\| Options.UnsafeFPMath);
		if (!AllowFusion)
		return false;

		Opnd2 = Opnd1;
		Opnd1 = Opnd0.getOperand(1);
		Opnd0 = Opnd0.getOperand(0);

		return true;
		}

		/// Try to fold a build_vector that performs an 'addsub' or 'fmaddsub' operation
		/// accordingly to X86ISD::ADDSUB or X86ISD::FMADDSUB node.
		static SDValue lowerToAddSubOrFMAddSub(const BuildVectorSDNode *BV,
		const X86Subtarget &Subtarget,
		SelectionDAG &DAG) {
		SDValue Opnd0, Opnd1;
		if (!isAddSub(BV, Subtarget, DAG, Opnd0, Opnd1))
return SDValue();		return SDValue();

		MVT VT = BV->getSimpleValueType(0);
		SDLoc DL(BV);

		// Try to generate X86ISD::FMADDSUB node here.
		SDValue Opnd2;
		if (isFMAddSub(Subtarget, DAG, Opnd0, Opnd1, Opnd2))
		return DAG.getNode(X86ISD::FMADDSUB, DL, VT, Opnd0, Opnd1, Opnd2);

		// Do not generate X86ISD::ADDSUB node for 512-bit types even though
		// the ADDSUB idiom has been successfully recognized. There are no known
		// X86 targets with 512-bit ADDSUB instructions!
		// 512-bit ADDSUB idiom recognition was needed only as part of FMADDSUB idiom
		// recognition.
		if (VT.is512BitVector())
		return SDValue();

		return DAG.getNode(X86ISD::ADDSUB, DL, VT, Opnd0, Opnd1);
}		}

/// Lower BUILD_VECTOR to a horizontal add/sub operation if possible.		/// Lower BUILD_VECTOR to a horizontal add/sub operation if possible.
static SDValue LowerToHorizontalOp(const BuildVectorSDNode *BV,		static SDValue LowerToHorizontalOp(const BuildVectorSDNode *BV,
const X86Subtarget &Subtarget,		const X86Subtarget &Subtarget,
SelectionDAG &DAG) {		SelectionDAG &DAG) {
MVT VT = BV->getSimpleValueType(0);		MVT VT = BV->getSimpleValueType(0);
unsigned NumElts = VT.getVectorNumElements();		unsigned NumElts = VT.getVectorNumElements();
▲ Show 20 Lines • Show All 212 Lines • ▼ Show 20 Lines	X86TargetLowering::LowerBUILD_VECTOR(SDValue Op, SelectionDAG &DAG) const {
// Generate vectors for predicate vectors.		// Generate vectors for predicate vectors.
if (VT.getVectorElementType() == MVT::i1 && Subtarget.hasAVX512())		if (VT.getVectorElementType() == MVT::i1 && Subtarget.hasAVX512())
return LowerBUILD_VECTORvXi1(Op, DAG);		return LowerBUILD_VECTORvXi1(Op, DAG);

if (SDValue VectorConstant = materializeVectorConstant(Op, DAG, Subtarget))		if (SDValue VectorConstant = materializeVectorConstant(Op, DAG, Subtarget))
return VectorConstant;		return VectorConstant;

BuildVectorSDNode *BV = cast<BuildVectorSDNode>(Op.getNode());		BuildVectorSDNode *BV = cast<BuildVectorSDNode>(Op.getNode());
if (SDValue AddSub = LowerToAddSub(BV, Subtarget, DAG))		if (SDValue AddSub = lowerToAddSubOrFMAddSub(BV, Subtarget, DAG))
return AddSub;		return AddSub;
if (SDValue HorizontalOp = LowerToHorizontalOp(BV, Subtarget, DAG))		if (SDValue HorizontalOp = LowerToHorizontalOp(BV, Subtarget, DAG))
return HorizontalOp;		return HorizontalOp;
if (SDValue Broadcast = LowerVectorBroadcast(BV, Subtarget, DAG))		if (SDValue Broadcast = LowerVectorBroadcast(BV, Subtarget, DAG))
return Broadcast;		return Broadcast;
if (SDValue BitOp = lowerBuildVectorToBitOp(BV, DAG))		if (SDValue BitOp = lowerBuildVectorToBitOp(BV, DAG))
return BitOp;		return BitOp;

▲ Show 20 Lines • Show All 20,477 Lines • ▼ Show 20 Lines	if (SDValue NewN = combineRedundantDWordShuffle(N, Mask, DAG, DCI))
return NewN;		return NewN;

break;		break;
}		}

return SDValue();		return SDValue();
}		}

/// \brief Try to combine a shuffle into a target-specific add-sub node.		/// Returns true iff the shuffle node \p N can be replaced with ADDSUB
///		/// operation. If true is returned then the operands of ADDSUB operation
/// We combine this directly on the abstract vector shuffle nodes so it is		/// are written to the parameters \p Opnd0 and \p Opnd1.
/// easier to generically match. We also insert dummy vector shuffle nodes for		///
/// the operands which explicitly discard the lanes which are unused by this		/// We combine shuffle to ADDSUB directly on the abstract vector shuffle nodes
/// operation to try to flow through the rest of the combiner the fact that		/// so it is easier to generically match. We also insert dummy vector shuffle
/// they're unused.		/// nodes for the operands which explicitly discard the lanes which are unused
static SDValue combineShuffleToAddSub(SDNode *N, const X86Subtarget &Subtarget,		/// by this operation to try to flow through the rest of the combiner
SelectionDAG &DAG) {		/// the fact that they're unused.
SDLoc DL(N);		static bool isAddSub(SDNode *N, const X86Subtarget &Subtarget,
		SDValue &Opnd0, SDValue &Opnd1) {

EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
if ((!Subtarget.hasSSE3() \|\| (VT != MVT::v4f32 && VT != MVT::v2f64)) &&		if ((!Subtarget.hasSSE3() \|\| (VT != MVT::v4f32 && VT != MVT::v2f64)) &&
(!Subtarget.hasAVX() \|\| (VT != MVT::v8f32 && VT != MVT::v4f64)))		(!Subtarget.hasAVX() \|\| (VT != MVT::v8f32 && VT != MVT::v4f64)) &&
return SDValue();		(!Subtarget.hasAVX512() \|\| (VT != MVT::v16f32 && VT != MVT::v8f64)))
		return false;

// We only handle target-independent shuffles.		// We only handle target-independent shuffles.
// FIXME: It would be easy and harmless to use the target shuffle mask		// FIXME: It would be easy and harmless to use the target shuffle mask
// extraction tool to support more.		// extraction tool to support more.
if (N->getOpcode() != ISD::VECTOR_SHUFFLE)		if (N->getOpcode() != ISD::VECTOR_SHUFFLE)
return SDValue();		return false;

ArrayRef<int> OrigMask = cast<ShuffleVectorSDNode>(N)->getMask();		ArrayRef<int> OrigMask = cast<ShuffleVectorSDNode>(N)->getMask();
SmallVector<int, 8> Mask(OrigMask.begin(), OrigMask.end());		SmallVector<int, 16> Mask(OrigMask.begin(), OrigMask.end());

SDValue V1 = N->getOperand(0);		SDValue V1 = N->getOperand(0);
SDValue V2 = N->getOperand(1);		SDValue V2 = N->getOperand(1);

// We require the first shuffle operand to be the FSUB node, and the second to		// We require the first shuffle operand to be the FSUB node, and the second to
// be the FADD node.		// be the FADD node.
if (V1.getOpcode() == ISD::FADD && V2.getOpcode() == ISD::FSUB) {		if (V1.getOpcode() == ISD::FADD && V2.getOpcode() == ISD::FSUB) {
ShuffleVectorSDNode::commuteMask(Mask);		ShuffleVectorSDNode::commuteMask(Mask);
std::swap(V1, V2);		std::swap(V1, V2);
} else if (V1.getOpcode() != ISD::FSUB \|\| V2.getOpcode() != ISD::FADD)		} else if (V1.getOpcode() != ISD::FSUB \|\| V2.getOpcode() != ISD::FADD)
return SDValue();		return false;

// If there are other uses of these operations we can't fold them.		// If there are other uses of these operations we can't fold them.
if (!V1->hasOneUse() \|\| !V2->hasOneUse())		if (!V1->hasOneUse() \|\| !V2->hasOneUse())
return SDValue();		return false;

// Ensure that both operations have the same operands. Note that we can		// Ensure that both operations have the same operands. Note that we can
// commute the FADD operands.		// commute the FADD operands.
SDValue LHS = V1->getOperand(0), RHS = V1->getOperand(1);		SDValue LHS = V1->getOperand(0), RHS = V1->getOperand(1);
if ((V2->getOperand(0) != LHS \|\| V2->getOperand(1) != RHS) &&		if ((V2->getOperand(0) != LHS \|\| V2->getOperand(1) != RHS) &&
(V2->getOperand(0) != RHS \|\| V2->getOperand(1) != LHS))		(V2->getOperand(0) != RHS \|\| V2->getOperand(1) != LHS))
return SDValue();		return false;

// We're looking for blends between FADD and FSUB nodes. We insist on these		// We're looking for blends between FADD and FSUB nodes. We insist on these
// nodes being lined up in a specific expected pattern.		// nodes being lined up in a specific expected pattern.
if (!(isShuffleEquivalent(V1, V2, Mask, {0, 3}) \|\|		if (!(isShuffleEquivalent(V1, V2, Mask, {0, 3}) \|\|
isShuffleEquivalent(V1, V2, Mask, {0, 5, 2, 7}) \|\|		isShuffleEquivalent(V1, V2, Mask, {0, 5, 2, 7}) \|\|
isShuffleEquivalent(V1, V2, Mask, {0, 9, 2, 11, 4, 13, 6, 15})))		isShuffleEquivalent(V1, V2, Mask, {0, 9, 2, 11, 4, 13, 6, 15}) \|\|
		isShuffleEquivalent(V1, V2, Mask, {0, 17, 2, 19, 4, 21, 6, 23,
		8, 25, 10, 27, 12, 29, 14, 31})))
		return false;

		Opnd0 = LHS;
		Opnd1 = RHS;
		return true;
		}

		/// \brief Try to combine a shuffle into a target-specific add-sub or
		/// mul-add-sub node.
		static SDValue combineShuffleToAddSubOrFMAddSub(SDNode *N,
		const X86Subtarget &Subtarget,
		SelectionDAG &DAG) {
		SDValue Opnd0, Opnd1;
		if (!isAddSub(N, Subtarget, Opnd0, Opnd1))
		return SDValue();

		EVT VT = N->getValueType(0);
		SDLoc DL(N);

		// Try to generate X86ISD::FMADDSUB node here.
		SDValue Opnd2;
		if (isFMAddSub(Subtarget, DAG, Opnd0, Opnd1, Opnd2))
		return DAG.getNode(X86ISD::FMADDSUB, DL, VT, Opnd0, Opnd1, Opnd2);

		// Do not generate X86ISD::ADDSUB node for 512-bit types even though
		// the ADDSUB idiom has been successfully recognized. There are no known
		// X86 targets with 512-bit ADDSUB instructions!
		if (VT.is512BitVector())
return SDValue();		return SDValue();

return DAG.getNode(X86ISD::ADDSUB, DL, VT, LHS, RHS);		return DAG.getNode(X86ISD::ADDSUB, DL, VT, Opnd0, Opnd1);
}		}

// We are looking for a shuffle where both sources are concatenated with undef		// We are looking for a shuffle where both sources are concatenated with undef
// and have a width that is half of the output's width. AVX2 has VPERMD/Q, so		// and have a width that is half of the output's width. AVX2 has VPERMD/Q, so
// if we can express this as a single-source shuffle, that's preferable.		// if we can express this as a single-source shuffle, that's preferable.
static SDValue combineShuffleOfConcatUndef(SDNode *N, SelectionDAG &DAG,		static SDValue combineShuffleOfConcatUndef(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
if (!Subtarget.hasAVX2() \|\| !isa<ShuffleVectorSDNode>(N))		if (!Subtarget.hasAVX2() \|\| !isa<ShuffleVectorSDNode>(N))
▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	static SDValue combineShuffle(SDNode *N, SelectionDAG &DAG,
// Don't create instructions with illegal types after legalize types has run.		// Don't create instructions with illegal types after legalize types has run.
const TargetLowering &TLI = DAG.getTargetLoweringInfo();		const TargetLowering &TLI = DAG.getTargetLoweringInfo();
if (!DCI.isBeforeLegalize() && !TLI.isTypeLegal(VT.getVectorElementType()))		if (!DCI.isBeforeLegalize() && !TLI.isTypeLegal(VT.getVectorElementType()))
return SDValue();		return SDValue();

// If we have legalized the vector types, look for blends of FADD and FSUB		// If we have legalized the vector types, look for blends of FADD and FSUB
// nodes that we can fuse into an ADDSUB node.		// nodes that we can fuse into an ADDSUB node.
if (TLI.isTypeLegal(VT))		if (TLI.isTypeLegal(VT))
if (SDValue AddSub = combineShuffleToAddSub(N, Subtarget, DAG))		if (SDValue AddSub = combineShuffleToAddSubOrFMAddSub(N, Subtarget, DAG))
return AddSub;		return AddSub;

// During Type Legalization, when promoting illegal vector types,		// During Type Legalization, when promoting illegal vector types,
// the backend might introduce new shuffle dag nodes and bitcasts.		// the backend might introduce new shuffle dag nodes and bitcasts.
//		//
// This code performs the following transformation:		// This code performs the following transformation:
// fold: (shuffle (bitcast (BINOP A, B)), Undef, <Mask>) ->		// fold: (shuffle (bitcast (BINOP A, B)), Undef, <Mask>) ->
// (shuffle (BINOP (bitcast A), (bitcast B)), Undef, <Mask>)		// (shuffle (BINOP (bitcast A), (bitcast B)), Undef, <Mask>)
▲ Show 20 Lines • Show All 6,728 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/fmaddsub-combine.ll

				; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+fma \| FileCheck -check-prefix=FMA3 -check-prefix=FMA3_256 %s
				; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+fma,+avx512f \| FileCheck -check-prefix=FMA3 -check-prefix=FMA3_512 %s
				; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+fma4 \| FileCheck -check-prefix=FMA4 %s

				; This test checks the fusing of MUL + ADDSUB to FMADDSUB.

				define <2 x double> @mul_addsub_pd128(<2 x double> %A, <2 x double> %B, <2 x double> %C) #0 {
				; FMA3-LABEL: mul_addsub_pd128:
				; FMA3: # BB#0: # %entry
				; FMA3-NEXT: vfmaddsub213pd %xmm2, %xmm1, %xmm0
				; FMA3-NEXT: retq
				;
				; FMA4-LABEL: mul_addsub_pd128:
				; FMA4: # BB#0: # %entry
				; FMA4-NEXT: vfmaddsubpd %xmm2, %xmm1, %xmm0, %xmm0
				; FMA4-NEXT: retq
				entry:
				%AB = fmul <2 x double> %A, %B
				%Sub = fsub <2 x double> %AB, %C
				%Add = fadd <2 x double> %AB, %C
				%Addsub = shufflevector <2 x double> %Sub, <2 x double> %Add, <2 x i32> <i32 0, i32 3>
				ret <2 x double> %Addsub
				}

				define <4 x float> @mul_addsub_ps128(<4 x float> %A, <4 x float> %B, <4 x float> %C) #0 {
				; FMA3-LABEL: mul_addsub_ps128:
				; FMA3: # BB#0: # %entry
				; FMA3-NEXT: vfmaddsub213ps %xmm2, %xmm1, %xmm0
				; FMA3-NEXT: retq
				;
				; FMA4-LABEL: mul_addsub_ps128:
				; FMA4: # BB#0: # %entry
				; FMA4-NEXT: vfmaddsubps %xmm2, %xmm1, %xmm0, %xmm0
				; FMA4-NEXT: retq
				entry:
				%AB = fmul <4 x float> %A, %B
				%Sub = fsub <4 x float> %AB, %C
				%Add = fadd <4 x float> %AB, %C
				%Addsub = shufflevector <4 x float> %Sub, <4 x float> %Add, <4 x i32> <i32 0, i32 5, i32 2, i32 7>
				ret <4 x float> %Addsub
				}

				define <4 x double> @mul_addsub_pd256(<4 x double> %A, <4 x double> %B, <4 x double> %C) #0 {
				; FMA3-LABEL: mul_addsub_pd256:
				; FMA3: # BB#0: # %entry
				; FMA3-NEXT: vfmaddsub213pd %ymm2, %ymm1, %ymm0
				; FMA3-NEXT: retq
				;
				; FMA4-LABEL: mul_addsub_pd256:
				; FMA4: # BB#0: # %entry
				; FMA4-NEXT: vfmaddsubpd %ymm2, %ymm1, %ymm0, %ymm0
				; FMA4-NEXT: retq
				entry:
				%AB = fmul <4 x double> %A, %B
				%Sub = fsub <4 x double> %AB, %C
				%Add = fadd <4 x double> %AB, %C
				%Addsub = shufflevector <4 x double> %Sub, <4 x double> %Add, <4 x i32> <i32 0, i32 5, i32 2, i32 7>
				ret <4 x double> %Addsub
				}

				define <8 x float> @mul_addsub_ps256(<8 x float> %A, <8 x float> %B, <8 x float> %C) #0 {
				; FMA3-LABEL: mul_addsub_ps256:
				; FMA3: # BB#0: # %entry
				; FMA3-NEXT: vfmaddsub213ps %ymm2, %ymm1, %ymm0
				; FMA3-NEXT: retq
				;
				; FMA4-LABEL: mul_addsub_ps256:
				; FMA4: # BB#0: # %entry
				; FMA4-NEXT: vfmaddsubps %ymm2, %ymm1, %ymm0, %ymm0
				; FMA4-NEXT: retq
				entry:
				%AB = fmul <8 x float> %A, %B
				%Sub = fsub <8 x float> %AB, %C
				%Add = fadd <8 x float> %AB, %C
				%Addsub = shufflevector <8 x float> %Sub, <8 x float> %Add, <8 x i32> <i32 0, i32 9, i32 2, i32 11, i32 4, i32 13, i32 6, i32 15>
				ret <8 x float> %Addsub
				}

				define <8 x double> @mul_addsub_pd512(<8 x double> %A, <8 x double> %B, <8 x double> %C) #0 {
				; FMA3_256-LABEL: mul_addsub_pd512:
				; FMA3_256: # BB#0: # %entry
				; FMA3_256-NEXT: vfmaddsub213pd %ymm4, %ymm2, %ymm0
				; FMA3_256-NEXT: vfmaddsub213pd %ymm5, %ymm3, %ymm1
				; FMA3_256-NEXT: retq
				;
				; FMA3_512-LABEL: mul_addsub_pd512:
				; FMA3_512: # BB#0: # %entry
				; FMA3_512-NEXT: vfmaddsub213pd %zmm2, %zmm1, %zmm0
				; FMA3_512-NEXT: retq
				;
				; FMA4-LABEL: mul_addsub_pd512:
				; FMA4: # BB#0: # %entry
				; FMA4-NEXT: vfmaddsubpd %ymm4, %ymm2, %ymm0, %ymm0
				; FMA4-NEXT: vfmaddsubpd %ymm5, %ymm3, %ymm1, %ymm1
				; FMA4-NEXT: retq
				entry:
				%AB = fmul <8 x double> %A, %B
				%Sub = fsub <8 x double> %AB, %C
				%Add = fadd <8 x double> %AB, %C
				%Addsub = shufflevector <8 x double> %Sub, <8 x double> %Add, <8 x i32> <i32 0, i32 9, i32 2, i32 11, i32 4, i32 13, i32 6, i32 15>
				ret <8 x double> %Addsub
				}

				define <16 x float> @mul_addsub_ps512(<16 x float> %A, <16 x float> %B, <16 x float> %C) #0 {
				; FMA3_256-LABEL: mul_addsub_ps512:
				; FMA3_256: # BB#0: # %entry
				; FMA3_256-NEXT: vfmaddsub213ps %ymm4, %ymm2, %ymm0
				; FMA3_256-NEXT: vfmaddsub213ps %ymm5, %ymm3, %ymm1
				; FMA3_256-NEXT: retq
				;
				; FMA3_512-LABEL: mul_addsub_ps512:
				; FMA3_512: # BB#0: # %entry
				; FMA3_512-NEXT: vfmaddsub213ps %zmm2, %zmm1, %zmm0
				; FMA3_512-NEXT: retq
				;
				; FMA4-LABEL: mul_addsub_ps512:
				; FMA4: # BB#0: # %entry
				; FMA4-NEXT: vfmaddsubps %ymm4, %ymm2, %ymm0, %ymm0
				; FMA4-NEXT: vfmaddsubps %ymm5, %ymm3, %ymm1, %ymm1
				; FMA4-NEXT: retq
				entry:
				%AB = fmul <16 x float> %A, %B
				%Sub = fsub <16 x float> %AB, %C
				%Add = fadd <16 x float> %AB, %C
				%Addsub = shufflevector <16 x float> %Sub, <16 x float> %Add, <16 x i32> <i32 0, i32 17, i32 2, i32 19, i32 4, i32 21, i32 6, i32 23, i32 8, i32 25, i32 10, i32 27, i32 12, i32 29, i32 14, i32 31>
				ret <16 x float> %Addsub
				}

				attributes #0 = { nounwind "unsafe-fp-math"="true" }