This is an archive of the discontinued LLVM Phabricator instance.

X86 instr selection: combine ADDSUB + MUL to FMADDSUB
ClosedPublic

Authored by v_klochkov on Dec 23 2016, 9:26 PM.

Download Raw Diff

Details

Reviewers

spatel
RKSimon
delena

Commits

rL291473: X86-specific path: Implemented the fusing of MUL+ADDSUB to FMADDSUB.

Summary

Hello,

Please review the patch that fuses MUL+ADDSUB operations into FMADDSUB
when AVX2 is available.

MUL+ADDSUB are often generated by LLVM (with -ffast-math flag) for
complex MUL operations.

C code:
#include <complex.h>
_Complex double a, b, dst;
void cmul() {

dst = a * b;

}

asm without patch:

vmovupd b(%rip), %xmm0
vmovddup        a(%rip), %xmm1  # xmm1 = mem[0,0]
vmulpd  %xmm1, %xmm0, %xmm1 <<<<<<<<<<<<<<<<<<<<<<<
vpermilpd       $1, %xmm0, %xmm0 # xmm0 = xmm0[1,0]
vmovddup        a+8(%rip), %xmm2 # xmm2 = mem[0,0]
vmulpd  %xmm2, %xmm0, %xmm0
vaddsubpd       %xmm0, %xmm1, %xmm0 <<<<<<<<<<<<<<<
vmovupd %xmm0, dst(%rip)

asm with the patch:

vmovupd b(%rip), %xmm0
vmovddup        a(%rip), %xmm1  # xmm1 = mem[0,0]
vpermilpd       $1, %xmm0, %xmm2 # xmm2 = xmm0[1,0]
vmovddup        a+8(%rip), %xmm3 # xmm3 = mem[0,0]
vmulpd  %xmm3, %xmm2, %xmm2
vfmaddsub231pd  %xmm1, %xmm0, %xmm2 <<<<<<<<<<<<<<<<<<<
vmovupd %xmm2, dst(%rip)

Thank you,
Vyacheslav Klochkov

Diff Detail

Event Timeline

v_klochkov updated this revision to Diff 82431.Dec 23 2016, 9:26 PM

v_klochkov retitled this revision from to X86 instr selection: combine ADDSUB + MUL to FMADDSUB.

v_klochkov updated this object.

v_klochkov added subscribers: llvm-commits, craig.topper, delena.

RKSimon added reviewers: RKSimon, spatel.Dec 24 2016, 2:58 AM

Thanks for looking at this - I created PR30633 which talked about combining to vfmaddsub/vfmsubadd from buildvector/shuffles of vfmaddadd/vfmaddsub inputs as well, which is likely to happen if the combiner matches individual scalar FMAs first.

llvm/lib/Target/X86/X86ISelLowering.cpp
32048	You should probably add an assert for the ADDSUB opcode - if this got called its really gone wrong! assert(N->getOpcode() == X86ISD::ADDSUB && "Expected X86ISD::ADDSUB opcode");
32049	Use Subtarget.hasAnyFMA() so FMA4 works as well
32052	Add a comment explaining that this must match the FMA combine logic in DAGCombiner::visitFADDForFMACombine
llvm/test/CodeGen/X86/fmaddsub-combine.ll
3	Please regenerate this with utils\update_llc_test_checks.py and add -mattr=+avx2,+fma to the llc command, possibly add a second pass with -mattr=+avx,+fma4 as well. Please can you add tests with the @llvm.x86.sse3.addsub (and avx equivalent) pd/ps intrinsics. The existing test looks like it can be simplified as well, if you wish to check for load folding add them as separate tests.

`Hi Simon,

Thank you for the code-review.
I simplified the test and added more test cases to it.
BTW, I wanted to add support for 512-bit ADDSUB and FMADDSUB, but then realized that
512-bit ADDSUB operations are not yet defined in *.td files.

I cannot add tests with intrinsic calls like: _mm_mul_ps() + _mm_addsub_ps(),
because addsub intrinsics are lowered too late, much later than mul intrisics.
That problem should be fixed separately:

IR Dump After Remove unused exception handling info ***

	  ; Function Attrs: nounwind uwtable
	  define <4 x float> @fused_mul_addsub(<4 x float> %a, <4 x float> %b) local_unnamed_addr #0 {
	  entry:
	    %call = call fast fastcc <4 x float> @_mm_mul_ps(<4 x float> %a, <4 x float> %b)
	    %call1 = call fast fastcc <4 x float> @_mm_addsub_ps(<4 x float> %call, <4 x float> %a)
	    ret <4 x float> %call1
	  }
	  *** IR Dump After Function Integration/Inlining ***
	  ; Function Attrs: nounwind uwtable
	  define <4 x float> @fused_mul_addsub(<4 x float> %a, <4 x float> %b) local_unnamed_addr #0 {
	  entry:
	    %mul.i = fmul fast <4 x float> %b, %a
	    %0 = call fast <4 x float> @llvm.x86.sse3.addsub.ps(<4 x float> %mul.i, <4 x float> %a) #2
	    ret <4 x float> %0
	  }

Also, ADDSUB is not generated for _Complex float types.
For some reasons that complex MUL is lowered differently now,
i.e. real and imaginary parts are computed independently (MUL+FMA for each part).

Hopefully, this patch is just the 1st in a series of patches for improving performance of _Complex MULs and DIVs.

Thank you,
Vyacheslav Klochkov
`

In D28087#632599, @v_klochkov wrote:

BTW, I wanted to add support for 512-bit ADDSUB and FMADDSUB, but then realized that
512-bit ADDSUB operations are not yet defined in *.td files.

All 512 instructions are defined in .td:
defm VFMADDSUB213 : avx512_fma3p_213_f<0xA6, "vfmaddsub213", X86Fmaddsub, X86FmaddsubRnd>;
..
defm VFMADDSUB231 : avx512_fma3p_231_f<0xB6, "vfmaddsub231", X86Fmaddsub, X86FmaddsubRnd>;
defm VFMSUBADD231 : avx512_fma3p_231_f<0xB7, "vfmsubadd231", X86Fmsubadd, X86FmsubaddRnd>;
..

Elena,

I see my mistake now. CPU ISA says that ADDSUBPD/PS instructions are available for 128 and 256-bit vectors only.
It is not available for ZMMs.

So, if I want to make it possible to generate 512-bit FMADDSUB instrutions, then I'll need to change a little bit the recognition of ADDSUB idioms, i.e. it should be possible to recognize ADDSUB, but never generate 512-bit X86ISD::ADDSUBs.

Thank you,
Vyacheslav Klochkov

In D28087#633340, @v_klochkov wrote:

So, if I want to make it possible to generate 512-bit FMADDSUB instrutions, then I'll need to change a little bit the recognition of ADDSUB idioms, i.e. it should be possible to recognize ADDSUB, but never generate 512-bit X86ISD::ADDSUBs.

Does it make sense to convert 512-bit operation ADDSUB to FMADDSUB with all-ones multiplier? What ICC does?

llvm/lib/Target/X86/X86ISelLowering.cpp
32057	I see more checks in visitFADDForFMACombine. Can we take all these checks to a separate function? Something like isFMALegalAndProfitable().
llvm/test/CodeGen/X86/fmaddsub-combine.ll
19	I don't understand the dependency you show in the test. I suppose that FMADDSUB test should be like this: %AB = fmul <2 x double> %A, %B %Sub = fsub <2 x double> %AB, %C %Add = fadd <2 x double> %AB, %C %Addsub = shufflevector <2 x double> %Sub, <2 x double> %Add, <2 x i32> <i32 0, i32 3> Do you check the "fast" attribute of the operation itself? Or you rely on global options only?

Does it make sense to convert 512-bit operation ADDSUB to FMADDSUB with all-ones multiplier? What ICC does?

It is interesting idea, but it is unlikely that it can be very beneficial.
ICC does not generate 512-bit FMADDSUB(a, 1, b) when it does not have ADDSUB(a,b) because it just does not need plain 512-bit ADDSUB operations.

ADDSUBs are usually needed for Complex MUL and DIV operations and there they are always accompanied by float/double MUL operations, i.e. ICC generates FMADDSUBs and it never needs plain ADDSUBs (at least for targets with AVX512, where FMADDSUB is available).

llvm/lib/Target/X86/X86ISelLowering.cpp
32057	Having such function is a good thing to have, but such function might be debatable and I do not want to complicate this change-set. Perhaps such function should be added in a separate change-set. Also, the function visitFADDForFMACombine() is shared for various target architectures, thus it needs much more checks. The check here (in X86 part) seems sufficient.
llvm/test/CodeGen/X86/fmaddsub-combine.ll
19	This code relies on global options only. This is my first experience of work on SDNodes..., please fix me if I am wrong..., but SDNodes do not have FastMathFlags, right?

Made some restructures in ADDSUB idiom recognition. As a result of these changes 512-bit FMADDSUB idiom can be recognized and X86ISD::FMADDSUB is generated.

512-bit bit ADDSUB idiom can be recognized for float and double vectors now, but 512-bit X86ISD::ADDSUB is never generated because 512-bit ADDSUB instructions are not available in AVX512. The recognition of 512-bit ADDSUB is needed only as part of 512-bit FMADDSUB idiom recognition.

Updated the LIT test. Added 512-bit test cases to it. Also, made minor updates accordingly to Elena's suggestion.

Sorry, just a minor fix for a harmless misprint in the LIT test (replaced 'FMAx3_512' with 'FMA3_512')

In D28087#636693, @v_klochkov wrote:

This is my first experience of work on SDNodes..., please fix me if I am wrong..., but SDNodes do not have FastMathFlags, right?

See SDNodeFlags structure.
const SDNodeFlags *Flags = &cast<BinaryWithFlagsSDNode>(N)->Flags;
But I don't know whether it is criteria for FMA. If yes, it should be checked. If not, the "fast" should be removed from the tests.

llvm/lib/Target/X86/X86ISelLowering.cpp
7095	Function name starts from lowercase. Same bellow.
27753	DAG is not used.
27800	You consider the situation when the shuffle is resolved before FMA. But if fadd and fsub are already combined with fmul, fmaddsub will not be created. I'm not sure that this situation is real, except fma are coming form intrinsics ..
27814	VT.is512BitVector()

removed 'fast' attributes from the LIT test;
fixed the function names (the first letter must be a lower case);
removed unused 'DAG' parameter from isAddSub() function;
did few other really minor changes.

Elena, thank you for the review and for the comments.
I made additional changes accordingly to your recommendations.

Currently, there is no fp_contract bit in FastMathFlags and in SDNodeFlags,
So, the checks used in isFMAddSub() are similar to existing in visitFADDForFMACombine().
That is correct in all situations and there are some bugs for this situation. Perhaps this can be fixed quite soon.
For now I just removed 'fast' attributes from the LIT test.

You consider the situation when the shuffle is resolved before FMA.
But if fadd and fsub are already combined with fmul, fmaddsub will not be created. I'm not sure
that this situation is real, except fma are coming form intrinsics ..

if you talk about such pattern:

t1 = fma <a,b,c>
t2 = fms <a,b,c>
shufffle t1,t2,...

That is just a totally different pattern. I don't see how it can be _automatically_ generated now.
If it happen to be generated later and is important to be optimized better, then it can be recognized/optimized, but that would require a separate patch.

delena accepted this revision.Jan 8 2017, 1:54 AM

delena added a reviewer: delena.

This revision is now accepted and ready to land.Jan 8 2017, 1:54 AM

Closed by commit rL291473: X86-specific path: Implemented the fusing of MUL+ADDSUB to FMADDSUB. (authored by v_klochkov). · Explain WhyJan 9 2017, 12:37 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86ISelLowering.cpp

27 lines

test/

CodeGen/

X86/

fmaddsub-combine.ll

65 lines

Diff 82713

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,086 Lines • ▼ Show 20 Lines	static SDValue LowerToHorizontalOp(const BuildVectorSDNode *BV,
// operands but one are UNDEF.		// operands but one are UNDEF.
if (NumUndefsLO + NumUndefsHI + 1 >= NumElts)		if (NumUndefsLO + NumUndefsHI + 1 >= NumElts)
return SDValue();		return SDValue();

SDLoc DL(BV);		SDLoc DL(BV);
SDValue InVec0, InVec1;		SDValue InVec0, InVec1;
if ((VT == MVT::v4f32 \|\| VT == MVT::v2f64) && Subtarget.hasSSE3()) {		if ((VT == MVT::v4f32 \|\| VT == MVT::v2f64) && Subtarget.hasSSE3()) {
// Try to match an SSE3 float HADD/HSUB.		// Try to match an SSE3 float HADD/HSUB.
if (isHorizontalBinOp(BV, ISD::FADD, DAG, 0, NumElts, InVec0, InVec1))		if (isHorizontalBinOp(BV, ISD::FADD, DAG, 0, NumElts, InVec0, InVec1))
		delenaUnsubmitted Not Done Reply Inline Actions Function name starts from lowercase. Same bellow. delena: Function name starts from lowercase. Same bellow.
return DAG.getNode(X86ISD::FHADD, DL, VT, InVec0, InVec1);		return DAG.getNode(X86ISD::FHADD, DL, VT, InVec0, InVec1);

if (isHorizontalBinOp(BV, ISD::FSUB, DAG, 0, NumElts, InVec0, InVec1))		if (isHorizontalBinOp(BV, ISD::FSUB, DAG, 0, NumElts, InVec0, InVec1))
return DAG.getNode(X86ISD::FHSUB, DL, VT, InVec0, InVec1);		return DAG.getNode(X86ISD::FHSUB, DL, VT, InVec0, InVec1);
} else if ((VT == MVT::v4i32 \|\| VT == MVT::v8i16) && Subtarget.hasSSSE3()) {		} else if ((VT == MVT::v4i32 \|\| VT == MVT::v8i16) && Subtarget.hasSSSE3()) {
// Try to match an SSSE3 integer HADD/HSUB.		// Try to match an SSSE3 integer HADD/HSUB.
if (isHorizontalBinOp(BV, ISD::ADD, DAG, 0, NumElts, InVec0, InVec1))		if (isHorizontalBinOp(BV, ISD::ADD, DAG, 0, NumElts, InVec0, InVec1))
return DAG.getNode(X86ISD::HADD, DL, VT, InVec0, InVec1);		return DAG.getNode(X86ISD::HADD, DL, VT, InVec0, InVec1);
▲ Show 20 Lines • Show All 20,641 Lines • ▼ Show 20 Lines	static SDValue combineShuffleToAddSub(SDNode *N, const X86Subtarget &Subtarget,
if ((!Subtarget.hasSSE3() \|\| (VT != MVT::v4f32 && VT != MVT::v2f64)) &&		if ((!Subtarget.hasSSE3() \|\| (VT != MVT::v4f32 && VT != MVT::v2f64)) &&
(!Subtarget.hasAVX() \|\| (VT != MVT::v8f32 && VT != MVT::v4f64)))		(!Subtarget.hasAVX() \|\| (VT != MVT::v8f32 && VT != MVT::v4f64)))
return SDValue();		return SDValue();

// We only handle target-independent shuffles.		// We only handle target-independent shuffles.
// FIXME: It would be easy and harmless to use the target shuffle mask		// FIXME: It would be easy and harmless to use the target shuffle mask
// extraction tool to support more.		// extraction tool to support more.
if (N->getOpcode() != ISD::VECTOR_SHUFFLE)		if (N->getOpcode() != ISD::VECTOR_SHUFFLE)
return SDValue();		return SDValue();
		delenaUnsubmitted Not Done Reply Inline Actions DAG is not used. delena: DAG is not used.

ArrayRef<int> OrigMask = cast<ShuffleVectorSDNode>(N)->getMask();		ArrayRef<int> OrigMask = cast<ShuffleVectorSDNode>(N)->getMask();
SmallVector<int, 8> Mask(OrigMask.begin(), OrigMask.end());		SmallVector<int, 8> Mask(OrigMask.begin(), OrigMask.end());

SDValue V1 = N->getOperand(0);		SDValue V1 = N->getOperand(0);
SDValue V2 = N->getOperand(1);		SDValue V2 = N->getOperand(1);

// We require the first shuffle operand to be the FSUB node, and the second to		// We require the first shuffle operand to be the FSUB node, and the second to
Show All 30 Lines
// if we can express this as a single-source shuffle, that's preferable.		// if we can express this as a single-source shuffle, that's preferable.
static SDValue combineShuffleOfConcatUndef(SDNode *N, SelectionDAG &DAG,		static SDValue combineShuffleOfConcatUndef(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
if (!Subtarget.hasAVX2() \|\| !isa<ShuffleVectorSDNode>(N))		if (!Subtarget.hasAVX2() \|\| !isa<ShuffleVectorSDNode>(N))
return SDValue();		return SDValue();

EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

// We only care about shuffles of 128/256-bit vectors of 32/64-bit values.		// We only care about shuffles of 128/256-bit vectors of 32/64-bit values.
		delenaUnsubmitted Not Done Reply Inline Actions You consider the situation when the shuffle is resolved before FMA. But if fadd and fsub are already combined with fmul, fmaddsub will not be created. I'm not sure that this situation is real, except fma are coming form intrinsics .. delena: You consider the situation when the shuffle is resolved before FMA. But if fadd and fsub are…
if (!VT.is128BitVector() && !VT.is256BitVector())		if (!VT.is128BitVector() && !VT.is256BitVector())
return SDValue();		return SDValue();

if (VT.getVectorElementType() != MVT::i32 &&		if (VT.getVectorElementType() != MVT::i32 &&
VT.getVectorElementType() != MVT::i64 &&		VT.getVectorElementType() != MVT::i64 &&
VT.getVectorElementType() != MVT::f32 &&		VT.getVectorElementType() != MVT::f32 &&
VT.getVectorElementType() != MVT::f64)		VT.getVectorElementType() != MVT::f64)
return SDValue();		return SDValue();

SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);

// Check that both sources are concats with undef.		// Check that both sources are concats with undef.
if (N0.getOpcode() != ISD::CONCAT_VECTORS \|\|		if (N0.getOpcode() != ISD::CONCAT_VECTORS \|\|
		delenaUnsubmitted Not Done Reply Inline Actions VT.is512BitVector() delena: VT.is512BitVector()
N1.getOpcode() != ISD::CONCAT_VECTORS \|\| N0.getNumOperands() != 2 \|\|		N1.getOpcode() != ISD::CONCAT_VECTORS \|\| N0.getNumOperands() != 2 \|\|
N1.getNumOperands() != 2 \|\| !N0.getOperand(1).isUndef() \|\|		N1.getNumOperands() != 2 \|\| !N0.getOperand(1).isUndef() \|\|
!N1.getOperand(1).isUndef())		!N1.getOperand(1).isUndef())
return SDValue();		return SDValue();

// Construct the new shuffle mask. Elements from the first source retain their		// Construct the new shuffle mask. Elements from the first source retain their
// index, but elements from the second source no longer need to skip an undef.		// index, but elements from the second source no longer need to skip an undef.
SmallVector<int, 8> Mask;		SmallVector<int, 8> Mask;
▲ Show 20 Lines • Show All 4,213 Lines • ▼ Show 20 Lines	if (C->getType()->isVectorTy()) {
return Op0;		return Op0;
} else if (auto *FPConst = dyn_cast<ConstantFP>(C))		} else if (auto *FPConst = dyn_cast<ConstantFP>(C))
if (isSignBitValue(FPConst))		if (isSignBitValue(FPConst))
return Op0;		return Op0;
}		}
return SDValue();		return SDValue();
}		}

		/// Do target specific dag combines of MUL and ADDSUB nodes into FMADDSUB.
		static SDValue combineAddsub(SDNode *N, SelectionDAG &DAG,
		const X86Subtarget &Subtarget) {
		assert(N->getOpcode() == X86ISD::ADDSUB && "Expected X86ISD::ADDSUB opcode");

		RKSimonUnsubmitted Not Done Reply Inline Actions You should probably add an assert for the ADDSUB opcode - if this got called its really gone wrong! assert(N->getOpcode() == X86ISD::ADDSUB && "Expected X86ISD::ADDSUB opcode"); RKSimon: You should probably add an assert for the ADDSUB opcode - if this got called its really gone…
		SDValue Op1 = N->getOperand(0);
		RKSimonUnsubmitted Not Done Reply Inline Actions Use Subtarget.hasAnyFMA() so FMA4 works as well RKSimon: Use Subtarget.hasAnyFMA() so FMA4 works as well
		if (Op1->getOpcode() != ISD::FMUL \|\| !Op1->hasOneUse() \|\|
		!Subtarget.hasAnyFMA())
		return SDValue();
		RKSimonUnsubmitted Not Done Reply Inline Actions Add a comment explaining that this must match the FMA combine logic in DAGCombiner::visitFADDForFMACombine RKSimon: Add a comment explaining that this must match the FMA combine logic in DAGCombiner…

		// These checks must match the similar ones in
		// DAGCombiner::visitFADDForFMACombine.
		const TargetOptions &Options = DAG.getTarget().Options;
		bool AllowFusion =
		delenaUnsubmitted Not Done Reply Inline Actions I see more checks in visitFADDForFMACombine. Can we take all these checks to a separate function? Something like isFMALegalAndProfitable(). delena: I see more checks in visitFADDForFMACombine. Can we take all these checks to a separate…
		v_klochkovAuthorUnsubmitted Not Done Reply Inline Actions Having such function is a good thing to have, but such function might be debatable and I do not want to complicate this change-set. Perhaps such function should be added in a separate change-set. Also, the function visitFADDForFMACombine() is shared for various target architectures, thus it needs much more checks. The check here (in X86 part) seems sufficient. v_klochkov: Having such function is a good thing to have, but such function might be debatable and I do not…
		(Options.AllowFPOpFusion == FPOpFusion::Fast \|\| Options.UnsafeFPMath);
		if (!AllowFusion)
		return SDValue();

		EVT VT = N->getValueType(0);
		SDValue Op3 = N->getOperand(1);
		SDValue Op2 = Op1->getOperand(1);
		Op1 = Op1->getOperand(0);

		return DAG.getNode(X86ISD::FMADDSUB, SDLoc(N), VT, Op1, Op2, Op3);
		}

/// Do target-specific dag combines on floating point negations.		/// Do target-specific dag combines on floating point negations.
static SDValue combineFneg(SDNode *N, SelectionDAG &DAG,		static SDValue combineFneg(SDNode *N, SelectionDAG &DAG,
const X86Subtarget &Subtarget) {		const X86Subtarget &Subtarget) {
EVT OrigVT = N->getValueType(0);		EVT OrigVT = N->getValueType(0);
SDValue Arg = isFNEG(N);		SDValue Arg = isFNEG(N);
assert(Arg.getNode() && "N is expected to be an FNEG node");		assert(Arg.getNode() && "N is expected to be an FNEG node");

EVT VT = Arg.getValueType();		EVT VT = Arg.getValueType();
▲ Show 20 Lines • Show All 1,326 Lines • ▼ Show 20 Lines	SDValue X86TargetLowering::PerformDAGCombine(SDNode *N,
case ISD::LOAD: return combineLoad(N, DAG, DCI, Subtarget);		case ISD::LOAD: return combineLoad(N, DAG, DCI, Subtarget);
case ISD::MLOAD: return combineMaskedLoad(N, DAG, DCI, Subtarget);		case ISD::MLOAD: return combineMaskedLoad(N, DAG, DCI, Subtarget);
case ISD::STORE: return combineStore(N, DAG, Subtarget);		case ISD::STORE: return combineStore(N, DAG, Subtarget);
case ISD::MSTORE: return combineMaskedStore(N, DAG, Subtarget);		case ISD::MSTORE: return combineMaskedStore(N, DAG, Subtarget);
case ISD::SINT_TO_FP: return combineSIntToFP(N, DAG, Subtarget);		case ISD::SINT_TO_FP: return combineSIntToFP(N, DAG, Subtarget);
case ISD::UINT_TO_FP: return combineUIntToFP(N, DAG, Subtarget);		case ISD::UINT_TO_FP: return combineUIntToFP(N, DAG, Subtarget);
case ISD::FADD:		case ISD::FADD:
case ISD::FSUB: return combineFaddFsub(N, DAG, Subtarget);		case ISD::FSUB: return combineFaddFsub(N, DAG, Subtarget);
		case X86ISD::ADDSUB: return combineAddsub(N, DAG, Subtarget);
case ISD::FNEG: return combineFneg(N, DAG, Subtarget);		case ISD::FNEG: return combineFneg(N, DAG, Subtarget);
case ISD::TRUNCATE: return combineTruncate(N, DAG, Subtarget);		case ISD::TRUNCATE: return combineTruncate(N, DAG, Subtarget);
case X86ISD::FAND: return combineFAnd(N, DAG, Subtarget);		case X86ISD::FAND: return combineFAnd(N, DAG, Subtarget);
case X86ISD::FANDN: return combineFAndn(N, DAG, Subtarget);		case X86ISD::FANDN: return combineFAndn(N, DAG, Subtarget);
case X86ISD::FXOR:		case X86ISD::FXOR:
case X86ISD::FOR: return combineFOr(N, DAG, Subtarget);		case X86ISD::FOR: return combineFOr(N, DAG, Subtarget);
case X86ISD::FMIN:		case X86ISD::FMIN:
case X86ISD::FMAX: return combineFMinFMax(N, DAG);		case X86ISD::FMAX: return combineFMinFMax(N, DAG);
▲ Show 20 Lines • Show All 1,001 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/fmaddsub-combine.ll

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py

				; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+fma \| FileCheck -check-prefix=CHECK -check-prefix=CHECK-FMA3 %s
				RKSimonUnsubmitted Not Done Reply Inline Actions Please regenerate this with utils\update_llc_test_checks.py and add -mattr=+avx2,+fma to the llc command, possibly add a second pass with -mattr=+avx,+fma4 as well. Please can you add tests with the @llvm.x86.sse3.addsub (and avx equivalent) pd/ps intrinsics. The existing test looks like it can be simplified as well, if you wish to check for load folding add them as separate tests. RKSimon: Please regenerate this with utils\update_llc_test_checks.py and add -mattr=+avx2,+fma to the…
				; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+fma4 \| FileCheck -check-prefix=CHECK -check-prefix=CHECK-FMA4 %s

				; This test checks the fusing of MUL + ADDSUB to FMADDSUB.

				define <2 x double> @mul_addsub_pd128(<2 x double> %A, <2 x double> %B) #0 {
				; CHECK-LABEL: mul_addsub_pd128:
				; CHECK: # BB#0: # %entry
				; CHECK-FMA3-NEXT: vfmaddsub213pd %xmm0, %xmm1, %xmm0
				; CHECK-FMA4-NEXT: vfmaddsubpd %xmm0, %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: retq
				entry:
				%AB = fmul <2 x double> %A, %B
				%Sub = fsub fast <2 x double> %AB, %A
				%Add = fadd fast <2 x double> %AB, %A
				%Addsub = shufflevector <2 x double> %Sub, <2 x double> %Add, <2 x i32> <i32 0, i32 3>
				ret <2 x double> %Addsub
				delenaUnsubmitted Not Done Reply Inline Actions I don't understand the dependency you show in the test. I suppose that FMADDSUB test should be like this: %AB = fmul <2 x double> %A, %B %Sub = fsub <2 x double> %AB, %C %Add = fadd <2 x double> %AB, %C %Addsub = shufflevector <2 x double> %Sub, <2 x double> %Add, <2 x i32> <i32 0, i32 3> Do you check the "fast" attribute of the operation itself? Or you rely on global options only? delena: I don't understand the dependency you show in the test. I suppose that FMADDSUB test should be…
				v_klochkovAuthorUnsubmitted Not Done Reply Inline Actions This code relies on global options only. This is my first experience of work on SDNodes..., please fix me if I am wrong..., but SDNodes do not have FastMathFlags, right? v_klochkov: This code relies on global options only. This is my first experience of work on SDNodes...
				}

				define <4 x float> @mul_addsub_ps128(<4 x float> %A, <4 x float> %B) #0 {
				; CHECK-LABEL: mul_addsub_ps128:
				; CHECK: # BB#0: # %entry
				; CHECK-FMA3-NEXT: vfmaddsub213ps %xmm0, %xmm1, %xmm0
				; CHECK-FMA4-NEXT: vfmaddsubps %xmm0, %xmm1, %xmm0, %xmm0
				; CHECK-NEXT: retq
				entry:
				%AB = fmul <4 x float> %A, %B
				%Sub = fsub fast <4 x float> %AB, %A
				%Add = fadd fast <4 x float> %AB, %A
				%Addsub = shufflevector <4 x float> %Sub, <4 x float> %Add, <4 x i32> <i32 0, i32 5, i32 2, i32 7>
				ret <4 x float> %Addsub
				}

				define <4 x double> @mul_addsub_pd256(<4 x double> %A, <4 x double> %B) #0 {
				; CHECK-LABEL: mul_addsub_pd256:
				; CHECK: # BB#0: # %entry
				; CHECK-FMA3-NEXT: vfmaddsub213pd %ymm0, %ymm1, %ymm0
				; CHECK-FMA4-NEXT: vfmaddsubpd %ymm0, %ymm1, %ymm0, %ymm0
				; CHECK-NEXT: retq
				entry:
				%AB = fmul <4 x double> %A, %B
				%Sub = fsub fast <4 x double> %AB, %A
				%Add = fadd fast <4 x double> %AB, %A
				%Addsub = shufflevector <4 x double> %Sub, <4 x double> %Add, <4 x i32> <i32 0, i32 5, i32 2, i32 7>
				ret <4 x double> %Addsub
				}

				define <8 x float> @mul_addsub_ps256(<8 x float> %A, <8 x float> %B) #0 {
				; CHECK-LABEL: mul_addsub_ps256:
				; CHECK: # BB#0: # %entry
				; CHECK-FMA3-NEXT: vfmaddsub213ps %ymm0, %ymm1, %ymm0
				; CHECK-FMA4-NEXT: vfmaddsubps %ymm0, %ymm1, %ymm0, %ymm0
				; CHECK-NEXT: retq
				entry:
				%AB = fmul <8 x float> %A, %B
				%Sub = fsub fast <8 x float> %AB, %A
				%Add = fadd fast <8 x float> %AB, %A
				%Addsub = shufflevector <8 x float> %Sub, <8 x float> %Add, <8 x i32> <i32 0, i32 9, i32 2, i32 11, i32 4, i32 13, i32 6, i32 15>
				ret <8 x float> %Addsub
				}

				attributes #0 = { nounwind "unsafe-fp-math"="true" }