Download Raw Diff

Details

Reviewers

spatel
qcolombet
delena
scanon

Commits

rG3fc3454a0c84: [X86][FMA] Optimize FNEG(FMUL) Patterns
rL254495: [X86][FMA] Optimize FNEG(FMUL) Patterns

Summary

On FMA targets, we can avoid having to load a constant to negate a float/double multiply by instead using a FNMSUB (-(X*Y)-0)

Note: As Sanjay mentioned in his bug report, although this is consistently faster (by avoiding the constant load), this does increase register pressure by requiring us to create a zero register. I'm not sure how best to qualify this if people think its a problem. Only running with optsize doesn't really help us - we MAY reduce constantpool size (if no other FNEG are present) but we MAY also increase code size handling extra stack traffic. We do have precedent for this: we use blendps to zero out elements instead of using the slower insertps; I'm sure there are plenty of other examples.

Fix for PR24366

Diff Detail

Repository: rL LLVM

Event Timeline

RKSimon updated this revision to Diff 40882.Nov 22 2015, 11:41 AM

RKSimon retitled this revision from to [X86][FMA] Optimize FNEG(FMUL) Patterns.

RKSimon updated this object.

RKSimon added reviewers: spatel, delena, qcolombet.

RKSimon set the repository for this revision to rL LLVM.

RKSimon added a subscriber: llvm-commits.

Rebased - moved patch to PerformFNEGCombine

spatel added a reviewer: scanon.Nov 24 2015, 5:36 PM

spatel added inline comments.

lib/Target/X86/X86ISelLowering.cpp
26161 ↗	(On Diff #41085)	Do we need unsafe math for this transform?
26163 ↗	(On Diff #41085)	Aren't the FMA checks enough? Is there some target that assumes that AVX512 includes FMA but does not set the FMA attribute?
26164 ↗	(On Diff #41085)	Nit: could hoist this and reuse below.

Elena please can you confirm if we need hasAVX512() as well as hasFMA() ?

lib/Target/X86/X86ISelLowering.cpp
26161 ↗	(On Diff #41085)	IMO no its not necessary, I was just being conservative - subtract from zero at type or internal precision should give the same result.
26163 ↗	(On Diff #41085)	This is what we're doing in all similar cases - I don't have a CPUID spec that discusses whether FMA is guaranteed for AVX512.
26164 ↗	(On Diff #41085)	Easily done - we seem to be very inconsistent as to whether we hoist SDLoc (and sometimes not use it) or only generate them if we actually perform the combine.

AVX512 always contains FMA, you don't need a special check.

Elena

Updated with Sanjay & Elena's comments

In D14909#296372, @delena wrote:

AVX512 always contains FMA, you don't need a special check.

I see what's going on now. This patch needs the AVX512 check in order to work for 512-bit vectors. There should be at least one more test case included with this patch to check the 512-bit vector case. Alternatively, if we want to make 512-bit optimization a separate patch, the predicate should disallow those types.

So I see 2 bugs here. Test case:

define <16 x float> @test_v4f32_fneg_fmul_16(<16 x float> %x, <16 x float> %y) #0 {
  %m = fmul <16 x float> %x, %y
  %n = fsub <16 x float> <float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0>, %m
  ret <16 x float> %n
}

Bug #1: The AVX-512 attribute in the subtarget should imply hasFMA(). Currently, this isn't true, so if you don't explicitly specify FMA, you won't get any expected FMA transforms:

$ ./llc fma.ll -o - -mattr=avx512f
...
   vmulps	%zmm1, %zmm0, %zmm0
   vpxord	LCPI0_0(%rip), %zmm0, %zmm0
   retq

$ ./llc fma.ll -o - -mattr=avx512f,fma
...
  vpxord	%zmm2, %zmm2, %zmm2
  vfnmadd213ps	%zmm2, %zmm1, %zmm0
  retq

Bug #2: This patch isn't checking for legal types, so it will fire for a 512-bit vector on a target that doesn't support it, and we crash:

$ ./llc fma.ll -o - -mattr=fma
  Do not know how to custom type legalize this operation!
  UNREACHABLE executed at /Users/spatel/myllvm/llvm/lib/Target/X86/X86ISelLowering.cpp:19880!

This would also happen if the test case had an illegal type for any target; eg <17 x float>.

I think the best fix for #2 is to explicitly check for the supported types for a given subtarget rather than waiting for type legalization and then checking the scalar type. This should be a helper function, so we can have the existing code in PerformFMACombine() use it too.

[I added this to Phab over an hour ago, but it hasn't made it to the list
afaict, so repeating with an email.]

I see what's going on now. This patch needs the AVX512 check in order to
work for 512-bit vectors. There should be at least one more test case
included with this patch to check the 512-bit vector case. Alternatively,
if we want to make 512-bit optimization a separate patch, the predicate
should disallow those types.

So I see 2 bugs here. Test case:

define <16 x float> @test_v4f32_fneg_fmul_16(<16 x float> %x, <16 x
float> %y) #0 {

%m = fmul <16 x float> %x, %y
%n = fsub <16 x float> <float -0.0, float -0.0, float -0.0, float

-0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0,
float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float
-0.0, float -0.0>, %m

ret <16 x float> %n

}

Bug #1: The AVX-512 attribute in the subtarget should imply hasFMA().
Currently, this isn't true, so if you don't explicitly specify FMA, you
won't get any expected FMA transforms:

$ ./llc fma.ll -o - -mattr=avx512f... vmulps %zmm1, %zmm0, %zmm0
vpxord LCPI0_0(%rip), %zmm0, %zmm0 retq$ ./llc fma.ll -o -
-mattr=avx512f,fma... vpxord %zmm2, %zmm2, %zmm2 vfnmadd213ps %zmm2,
%zmm1, %zmm0 retq

Bug #2: This patch isn't checking for legal types, so it will fire for a
512-bit vector on a target that doesn't support it, and we crash:

$ ./llc fma.ll -o - -mattr=fma Do not know how to custom type
legalize this operation! UNREACHABLE executed at
/Users/spatel/myllvm/llvm/lib/Target/X86/X86ISelLowering.cpp:19880!

This would also happen if the test case had an illegal type for any target;
eg <17 x float>.

I think the best fix for #2 is to explicitly check for the supported types
for a given subtarget rather than waiting for type legalization and then
checking the scalar type. This should be a helper function, so we can have
the existing code in PerformFMACombine() use it too.

msg-22236-397.txt162 BDownload

Thanks Sanjay - I'm going to update fma_patterns.ll and fma_patterns_wide.ll to properly test AVX512, and sync them so more of the tests are tested in the wide pattern as well. That should deal with both AVX512 support and the legal types issues.

RKSimon mentioned this in rL254180: [X86][FMA] Begun adding AVX512 FMA tests.Nov 26 2015, 12:56 PM

RKSimon mentioned this in rL254232: [X86][FMA] More thorough FMA tests.Nov 28 2015, 6:31 AM

RKSimon mentioned this in rL254233: [X86][FMA] Added 512-bit tests to match 128/256-bit tests coverage.Nov 28 2015, 8:07 AM

Updated patch to correctly support 512-bit float types on AVX512 and pre-AVX512 targets.

Note - it turns out we do need the hasAVX512() test to correctly lower on 128-256 bit vectors on AVX512 targets.

scanon added inline comments.Nov 28 2015, 2:52 PM

lib/Target/X86/X86ISelLowering.cpp
26157 ↗	(On Diff #41335)	0 - a*b gets the sign of zero wrong if a or b is exactly zero; the result should be -0, but +0 is returned instead in default rounding. So this should be protected by no signed zeros.

Could this be resolved by using -0 as the constant instead of 0 in non-fast-math mode?

—escha

Could this be resolved by using -0 as the constant instead of 0 in non-fast-math mode?

You'd use VFNMSUB instead to avoid needing to conjure -0. However, that still does the wrong thing in non-default rounding modes. Consider -(0*-1) in round-down. This should produce +0. Instead, we get -0 - (-1*0) = -0.

So this could be licensed under no-signed-zeros, or also under assume-default-rounding (proposed in http://reviews.llvm.org/D14067).

Updated tests to include versions with/without nsz flags.

Note: I had to tweak the AVX512 tests to target AVX512DQ to lower XOR instructions.

As updated, this seems fine to me.

Thanks, Steve.

Can we assume default rounding for now, change the operand to X86::FNMSUB, remove the nsz check, and leave a 'TODO' comment to revisit this after rounding mode support is added? I think it would be better to have this transform be more general if we can.

Simon - is there a reason to prefer waiting for type legalization before doing the transform? I know that's how it's done in PerformFMACombine(), but it seems that the more common pattern is to check for known legal types for a particular opcode. This allows the transform to fire earlier?

In D14909#298759, @spatel wrote:

Can we assume default rounding for now, change the operand to X86::FNMSUB, remove the nsz check, and leave a 'TODO' comment to revisit this after rounding mode support is added? I think it would be better to have this transform be more general if we can.

I would *really* prefer not to. Not having rounding-mode support means that we should do the conservative thing. Once we have the ability to do the right thing, we can make the default more aggressive.

In D14909#298759, @spatel wrote:

Simon - is there a reason to prefer waiting for type legalization before doing the transform? I know that's how it's done in PerformFMACombine(), but it seems that the more common pattern is to check for known legal types for a particular opcode. This allows the transform to fire earlier?

No reason other than matching the behaviour in PerformFMACombine() - I tend to prefer not creating X86ISD nodes until we really need to - it gives the DAGCombiner as much chance to constant fold etc. as possible.

In D14909#298770, @scanon wrote:

In D14909#298759, @spatel wrote:

Can we assume default rounding for now, change the operand to X86::FNMSUB, remove the nsz check, and leave a 'TODO' comment to revisit this after rounding mode support is added? I think it would be better to have this transform be more general if we can.

I would *really* prefer not to. Not having rounding-mode support means that we should do the conservative thing. Once we have the ability to do the right thing, we can make the default more aggressive.

Ok. We should have a comment here to explain the logic then.
Would it still be better to use FNMSUB now even with the nsz check, so we're minimizing codegen differences after we have rounding-mode support?

In D14909#298784, @spatel wrote:

Would it still be better to use FNMSUB now even with the nsz check, so we're minimizing codegen differences after we have rounding-mode support?

Yes, also because it's more correct (even though the wrong answer is licensed by NSZ, if we can produce the right answer at no cost, we might as well do so).

Updated to use FNMSUB instead of FNMADD.

Added FIXME for once we have rounding control flags

RKSimon updated this object.Nov 30 2015, 2:53 PM

LGTM.

If you don't have a patch in progress for the missing AVX512 commute, add another FIXME comment for that?

This revision is now accepted and ready to land.Dec 1 2015, 8:20 AM

spatel added inline comments.Dec 1 2015, 10:25 AM

lib/Target/X86/X86ISelLowering.cpp
26155 ↗	(On Diff #41439)	We can use "hasAnyFMA()" after: http://reviews.llvm.org/rL254425

Closed by commit rL254495: [X86][FMA] Optimize FNEG(FMUL) Patterns (authored by RKSimon). · Explain WhyDec 2 2015, 1:10 AM

This revision was automatically updated to reflect the committed changes.

In D14909#299484, @spatel wrote:

LGTM.

If you don't have a patch in progress for the missing AVX512 commute, add another FIXME comment for that?

Thanks Sanjay, I have both the missing AVX512 commutations and the FMA scalar domain issues on my todo list.

Diff 41596

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 26,151 Lines • ▼ Show 20 Lines	if (((Subtarget->hasSSE3() && (VT == MVT::v4f32 \|\| VT == MVT::v2f64)) \|\|
return DAG.getNode(X86ISD::FHSUB, SDLoc(N), VT, LHS, RHS);		return DAG.getNode(X86ISD::FHSUB, SDLoc(N), VT, LHS, RHS);
return SDValue();		return SDValue();
}		}

/// Do target-specific dag combines on floating point negations.		/// Do target-specific dag combines on floating point negations.
static SDValue PerformFNEGCombine(SDNode *N, SelectionDAG &DAG,		static SDValue PerformFNEGCombine(SDNode *N, SelectionDAG &DAG,
const X86Subtarget *Subtarget) {		const X86Subtarget *Subtarget) {
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
		EVT SVT = VT.getScalarType();
SDValue Arg = N->getOperand(0);		SDValue Arg = N->getOperand(0);
		SDLoc DL(N);

		// Let legalize expand this if it isn't a legal type yet.
		if (!DAG.getTargetLoweringInfo().isTypeLegal(VT))
		return SDValue();

		// If we're negating a FMUL node on a target with FMA, then we can avoid the
		// use of a constant by performing (-0 - A*B) instead.
		// FIXME: Check rounding control flags as well once it becomes available.
		if (Arg.getOpcode() == ISD::FMUL && (SVT == MVT::f32 \|\| SVT == MVT::f64) &&
		Arg->getFlags()->hasNoSignedZeros() && Subtarget->hasAnyFMA()) {
		SDValue Zero = DAG.getConstantFP(0.0, DL, VT);
		return DAG.getNode(X86ISD::FNMSUB, DL, VT, Arg.getOperand(0),
		Arg.getOperand(1), Zero);
		}

// If we're negating a FMA node, then we can adjust the		// If we're negating a FMA node, then we can adjust the
// instruction to include the extra negation.		// instruction to include the extra negation.
if (Arg.hasOneUse()) {		if (Arg.hasOneUse()) {
switch (Arg.getOpcode()) {		switch (Arg.getOpcode()) {
case X86ISD::FMADD:		case X86ISD::FMADD:
return DAG.getNode(X86ISD::FNMSUB, SDLoc(N), VT, Arg.getOperand(0),		return DAG.getNode(X86ISD::FNMSUB, DL, VT, Arg.getOperand(0),
Arg.getOperand(1), Arg.getOperand(2));		Arg.getOperand(1), Arg.getOperand(2));
case X86ISD::FMSUB:		case X86ISD::FMSUB:
return DAG.getNode(X86ISD::FNMADD, SDLoc(N), VT, Arg.getOperand(0),		return DAG.getNode(X86ISD::FNMADD, DL, VT, Arg.getOperand(0),
Arg.getOperand(1), Arg.getOperand(2));		Arg.getOperand(1), Arg.getOperand(2));
case X86ISD::FNMADD:		case X86ISD::FNMADD:
return DAG.getNode(X86ISD::FMSUB, SDLoc(N), VT, Arg.getOperand(0),		return DAG.getNode(X86ISD::FMSUB, DL, VT, Arg.getOperand(0),
Arg.getOperand(1), Arg.getOperand(2));		Arg.getOperand(1), Arg.getOperand(2));
case X86ISD::FNMSUB:		case X86ISD::FNMSUB:
return DAG.getNode(X86ISD::FMADD, SDLoc(N), VT, Arg.getOperand(0),		return DAG.getNode(X86ISD::FMADD, DL, VT, Arg.getOperand(0),
Arg.getOperand(1), Arg.getOperand(2));		Arg.getOperand(1), Arg.getOperand(2));
}		}
}		}
return SDValue();		return SDValue();
}		}

/// Do target-specific dag combines on X86ISD::FOR and X86ISD::FXOR nodes.		/// Do target-specific dag combines on X86ISD::FOR and X86ISD::FXOR nodes.
static SDValue PerformFORCombine(SDNode *N, SelectionDAG &DAG,		static SDValue PerformFORCombine(SDNode *N, SelectionDAG &DAG,
const X86Subtarget *Subtarget) {		const X86Subtarget *Subtarget) {
▲ Show 20 Lines • Show All 1,746 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/X86/fma_patterns.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,+fma -fp-contract=fast \| FileCheck %s --check-prefix=ALL --check-prefix=FMA			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,+fma -fp-contract=fast \| FileCheck %s --check-prefix=ALL --check-prefix=FMA
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,+fma4,+fma -fp-contract=fast \| FileCheck %s --check-prefix=ALL --check-prefix=FMA4			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,+fma4,+fma -fp-contract=fast \| FileCheck %s --check-prefix=ALL --check-prefix=FMA4
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,+fma4 -fp-contract=fast \| FileCheck %s --check-prefix=ALL --check-prefix=FMA4			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,+fma4 -fp-contract=fast \| FileCheck %s --check-prefix=ALL --check-prefix=FMA4
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512f,+avx512vl -fp-contract=fast \| FileCheck %s --check-prefix=ALL --check-prefix=AVX512			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512dq,+avx512vl -fp-contract=fast \| FileCheck %s --check-prefix=ALL --check-prefix=AVX512

	;			;
	; Pattern: (fadd (fmul x, y), z) -> (fmadd x,y,z)			; Pattern: (fadd (fmul x, y), z) -> (fmadd x,y,z)
	;			;

	define float @test_f32_fmadd(float %a0, float %a1, float %a2) {			define float @test_f32_fmadd(float %a0, float %a1, float %a2) {
	; FMA-LABEL: test_f32_fmadd:			; FMA-LABEL: test_f32_fmadd:
	; FMA: # BB#0:			; FMA: # BB#0:
	▲ Show 20 Lines • Show All 1,090 Lines • ▼ Show 20 Lines
	; AVX512-NEXT: vmovaps %zmm1, %zmm0			; AVX512-NEXT: vmovaps %zmm1, %zmm0
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	%m0 = fmul <4 x float> %x, <float 1.0, float 2.0, float 3.0, float 4.0>			%m0 = fmul <4 x float> %x, <float 1.0, float 2.0, float 3.0, float 4.0>
	%m1 = fmul <4 x float> %m0, <float 4.0, float 3.0, float 2.0, float 1.0>			%m1 = fmul <4 x float> %m0, <float 4.0, float 3.0, float 2.0, float 1.0>
	%a = fadd <4 x float> %m1, %y			%a = fadd <4 x float> %m1, %y
	ret <4 x float> %a			ret <4 x float> %a
	}			}

				; Pattern: (fneg (fmul x, y)) -> (fnmsub x, y, 0)

				define double @test_f64_fneg_fmul(double %x, double %y) #0 {
				; FMA-LABEL: test_f64_fneg_fmul:
				; FMA: # BB#0:
				; FMA-NEXT: vxorps %xmm2, %xmm2, %xmm2
				; FMA-NEXT: vfnmsub213sd %xmm2, %xmm1, %xmm0
				; FMA-NEXT: retq
				;
				; FMA4-LABEL: test_f64_fneg_fmul:
				; FMA4: # BB#0:
				; FMA4-NEXT: vxorps %xmm2, %xmm2, %xmm2
				; FMA4-NEXT: vfnmsubsd %xmm2, %xmm1, %xmm0, %xmm0
				; FMA4-NEXT: retq
				;
				; AVX512-LABEL: test_f64_fneg_fmul:
				; AVX512: # BB#0:
				; AVX512-NEXT: vxorps %xmm2, %xmm2, %xmm2
				; AVX512-NEXT: vfnmsub213sd %xmm2, %xmm0, %xmm1
				; AVX512-NEXT: vmovaps %zmm1, %zmm0
				; AVX512-NEXT: retq
				%m = fmul nsz double %x, %y
				%n = fsub double -0.0, %m
				ret double %n
				}

				define <4 x float> @test_v4f32_fneg_fmul(<4 x float> %x, <4 x float> %y) #0 {
				; FMA-LABEL: test_v4f32_fneg_fmul:
				; FMA: # BB#0:
				; FMA-NEXT: vxorps %xmm2, %xmm2, %xmm2
				; FMA-NEXT: vfnmsub213ps %xmm2, %xmm1, %xmm0
				; FMA-NEXT: retq
				;
				; FMA4-LABEL: test_v4f32_fneg_fmul:
				; FMA4: # BB#0:
				; FMA4-NEXT: vxorps %xmm2, %xmm2, %xmm2
				; FMA4-NEXT: vfnmsubps %xmm2, %xmm1, %xmm0, %xmm0
				; FMA4-NEXT: retq
				;
				; AVX512-LABEL: test_v4f32_fneg_fmul:
				; AVX512: # BB#0:
				; AVX512-NEXT: vxorps %xmm2, %xmm2, %xmm2
				; AVX512-NEXT: vfnmsub213ps %xmm2, %xmm1, %xmm0
				; AVX512-NEXT: retq
				%m = fmul nsz <4 x float> %x, %y
				%n = fsub <4 x float> <float -0.0, float -0.0, float -0.0, float -0.0>, %m
				ret <4 x float> %n
				}

				define <4 x double> @test_v4f64_fneg_fmul(<4 x double> %x, <4 x double> %y) #0 {
				; FMA-LABEL: test_v4f64_fneg_fmul:
				; FMA: # BB#0:
				; FMA-NEXT: vxorpd %ymm2, %ymm2, %ymm2
				; FMA-NEXT: vfnmsub213pd %ymm2, %ymm1, %ymm0
				; FMA-NEXT: retq
				;
				; FMA4-LABEL: test_v4f64_fneg_fmul:
				; FMA4: # BB#0:
				; FMA4-NEXT: vxorpd %ymm2, %ymm2, %ymm2
				; FMA4-NEXT: vfnmsubpd %ymm2, %ymm1, %ymm0, %ymm0
				; FMA4-NEXT: retq
				;
				; AVX512-LABEL: test_v4f64_fneg_fmul:
				; AVX512: # BB#0:
				; AVX512-NEXT: vxorps %ymm2, %ymm2, %ymm2
				; AVX512-NEXT: vfnmsub213pd %ymm2, %ymm1, %ymm0
				; AVX512-NEXT: retq
				%m = fmul nsz <4 x double> %x, %y
				%n = fsub <4 x double> <double -0.0, double -0.0, double -0.0, double -0.0>, %m
				ret <4 x double> %n
				}

				define <4 x double> @test_v4f64_fneg_fmul_no_nsz(<4 x double> %x, <4 x double> %y) #0 {
				; ALL-LABEL: test_v4f64_fneg_fmul_no_nsz:
				; ALL: # BB#0:
				; ALL-NEXT: vmulpd %ymm1, %ymm0, %ymm0
				; ALL-NEXT: vxorpd {{.*}}(%rip), %ymm0, %ymm0
				; ALL-NEXT: retq
				%m = fmul <4 x double> %x, %y
				%n = fsub <4 x double> <double -0.0, double -0.0, double -0.0, double -0.0>, %m
				ret <4 x double> %n
				}

	attributes #0 = { "unsafe-fp-math"="true" }			attributes #0 = { "unsafe-fp-math"="true" }

llvm/trunk/test/CodeGen/X86/fma_patterns_wide.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,+fma -fp-contract=fast \| FileCheck %s --check-prefix=FMA			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,+fma -fp-contract=fast \| FileCheck %s --check-prefix=FMA
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,+fma4,+fma -fp-contract=fast \| FileCheck %s --check-prefix=FMA4			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,+fma4,+fma -fp-contract=fast \| FileCheck %s --check-prefix=FMA4
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,+fma4 -fp-contract=fast \| FileCheck %s --check-prefix=FMA4			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,+fma4 -fp-contract=fast \| FileCheck %s --check-prefix=FMA4
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512f -fp-contract=fast \| FileCheck %s --check-prefix=AVX512			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512dq -fp-contract=fast \| FileCheck %s --check-prefix=AVX512

	;			;
	; Pattern: (fadd (fmul x, y), z) -> (fmadd x,y,z)			; Pattern: (fadd (fmul x, y), z) -> (fmadd x,y,z)
	;			;

	define <16 x float> @test_16f32_fmadd(<16 x float> %a0, <16 x float> %a1, <16 x float> %a2) {			define <16 x float> @test_16f32_fmadd(<16 x float> %a0, <16 x float> %a1, <16 x float> %a2) {
	; FMA-LABEL: test_16f32_fmadd:			; FMA-LABEL: test_16f32_fmadd:
	; FMA: # BB#0:			; FMA: # BB#0:
	▲ Show 20 Lines • Show All 718 Lines • ▼ Show 20 Lines
	; AVX512-NEXT: vmovaps %zmm1, %zmm0			; AVX512-NEXT: vmovaps %zmm1, %zmm0
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	%m0 = fmul <16 x float> %x, <float 1.0, float 2.0, float 3.0, float 4.0, float 5.0, float 6.0, float 7.0, float 8.0, float 9.0, float 10.0, float 11.0, float 12.0, float 13.0, float 14.0, float 15.0, float 16.0>			%m0 = fmul <16 x float> %x, <float 1.0, float 2.0, float 3.0, float 4.0, float 5.0, float 6.0, float 7.0, float 8.0, float 9.0, float 10.0, float 11.0, float 12.0, float 13.0, float 14.0, float 15.0, float 16.0>
	%m1 = fmul <16 x float> %m0, <float 16.0, float 15.0, float 14.0, float 13.0, float 12.0, float 11.0, float 10.0, float 9.0, float 8.0, float 7.0, float 6.0, float 5.0, float 4.0, float 3.0, float 2.0, float 1.0>			%m1 = fmul <16 x float> %m0, <float 16.0, float 15.0, float 14.0, float 13.0, float 12.0, float 11.0, float 10.0, float 9.0, float 8.0, float 7.0, float 6.0, float 5.0, float 4.0, float 3.0, float 2.0, float 1.0>
	%a = fadd <16 x float> %m1, %y			%a = fadd <16 x float> %m1, %y
	ret <16 x float> %a			ret <16 x float> %a
	}			}

				; Pattern: (fneg (fmul x, y)) -> (fnmsub x, y, 0)

				define <16 x float> @test_v16f32_fneg_fmul(<16 x float> %x, <16 x float> %y) #0 {
				; FMA-LABEL: test_v16f32_fneg_fmul:
				; FMA: # BB#0:
				; FMA-NEXT: vxorps %ymm4, %ymm4, %ymm4
				; FMA-NEXT: vfnmsub213ps %ymm4, %ymm2, %ymm0
				; FMA-NEXT: vfnmsub213ps %ymm4, %ymm3, %ymm1
				; FMA-NEXT: retq
				;
				; FMA4-LABEL: test_v16f32_fneg_fmul:
				; FMA4: # BB#0:
				; FMA4-NEXT: vxorps %ymm4, %ymm4, %ymm4
				; FMA4-NEXT: vfnmsubps %ymm4, %ymm2, %ymm0, %ymm0
				; FMA4-NEXT: vfnmsubps %ymm4, %ymm3, %ymm1, %ymm1
				; FMA4-NEXT: retq
				;
				; AVX512-LABEL: test_v16f32_fneg_fmul:
				; AVX512: # BB#0:
				; AVX512-NEXT: vpxord %zmm2, %zmm2, %zmm2
				; AVX512-NEXT: vfnmsub213ps %zmm2, %zmm1, %zmm0
				; AVX512-NEXT: retq
				%m = fmul nsz <16 x float> %x, %y
				%n = fsub <16 x float> <float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0, float -0.0>, %m
				ret <16 x float> %n
				}

				define <8 x double> @test_v8f64_fneg_fmul(<8 x double> %x, <8 x double> %y) #0 {
				; FMA-LABEL: test_v8f64_fneg_fmul:
				; FMA: # BB#0:
				; FMA-NEXT: vxorpd %ymm4, %ymm4, %ymm4
				; FMA-NEXT: vfnmsub213pd %ymm4, %ymm2, %ymm0
				; FMA-NEXT: vfnmsub213pd %ymm4, %ymm3, %ymm1
				; FMA-NEXT: retq
				;
				; FMA4-LABEL: test_v8f64_fneg_fmul:
				; FMA4: # BB#0:
				; FMA4-NEXT: vxorpd %ymm4, %ymm4, %ymm4
				; FMA4-NEXT: vfnmsubpd %ymm4, %ymm2, %ymm0, %ymm0
				; FMA4-NEXT: vfnmsubpd %ymm4, %ymm3, %ymm1, %ymm1
				; FMA4-NEXT: retq
				;
				; AVX512-LABEL: test_v8f64_fneg_fmul:
				; AVX512: # BB#0:
				; AVX512-NEXT: vpxord %zmm2, %zmm2, %zmm2
				; AVX512-NEXT: vfnmsub213pd %zmm2, %zmm1, %zmm0
				; AVX512-NEXT: retq
				%m = fmul nsz <8 x double> %x, %y
				%n = fsub <8 x double> <double -0.0, double -0.0, double -0.0, double -0.0, double -0.0, double -0.0, double -0.0, double -0.0>, %m
				ret <8 x double> %n
				}

				define <8 x double> @test_v8f64_fneg_fmul_no_nsz(<8 x double> %x, <8 x double> %y) #0 {
				; FMA-LABEL: test_v8f64_fneg_fmul_no_nsz:
				; FMA: # BB#0:
				; FMA-NEXT: vmulpd %ymm3, %ymm1, %ymm1
				; FMA-NEXT: vmulpd %ymm2, %ymm0, %ymm0
				; FMA-NEXT: vmovapd {{.*#+}} ymm2 = [9223372036854775808,9223372036854775808,9223372036854775808,9223372036854775808]
				; FMA-NEXT: vxorpd %ymm2, %ymm0, %ymm0
				; FMA-NEXT: vxorpd %ymm2, %ymm1, %ymm1
				; FMA-NEXT: retq
				;
				; FMA4-LABEL: test_v8f64_fneg_fmul_no_nsz:
				; FMA4: # BB#0:
				; FMA4-NEXT: vmulpd %ymm3, %ymm1, %ymm1
				; FMA4-NEXT: vmulpd %ymm2, %ymm0, %ymm0
				; FMA4-NEXT: vmovapd {{.*#+}} ymm2 = [9223372036854775808,9223372036854775808,9223372036854775808,9223372036854775808]
				; FMA4-NEXT: vxorpd %ymm2, %ymm0, %ymm0
				; FMA4-NEXT: vxorpd %ymm2, %ymm1, %ymm1
				; FMA4-NEXT: retq
				;
				; AVX512-LABEL: test_v8f64_fneg_fmul_no_nsz:
				; AVX512: # BB#0:
				; AVX512-NEXT: vmulpd %zmm1, %zmm0, %zmm0
				; AVX512-NEXT: vxorpd {{.*}}(%rip), %zmm0, %zmm0
				; AVX512-NEXT: retq
				%m = fmul <8 x double> %x, %y
				%n = fsub <8 x double> <double -0.0, double -0.0, double -0.0, double -0.0, double -0.0, double -0.0, double -0.0, double -0.0>, %m
				ret <8 x double> %n
				}

	attributes #0 = { "unsafe-fp-math"="true" }			attributes #0 = { "unsafe-fp-math"="true" }

This is an archive of the discontinued LLVM Phabricator instance.

[X86][FMA] Optimize FNEG(FMUL) Patterns
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 41596

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

llvm/trunk/test/CodeGen/X86/fma_patterns.ll

llvm/trunk/test/CodeGen/X86/fma_patterns_wide.ll

This is an archive of the discontinued LLVM Phabricator instance.

[X86][FMA] Optimize FNEG(FMUL) PatternsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 41596

llvm/trunk/lib/Target/X86/X86ISelLowering.cpp

llvm/trunk/test/CodeGen/X86/fma_patterns.ll

llvm/trunk/test/CodeGen/X86/fma_patterns_wide.ll

[X86][FMA] Optimize FNEG(FMUL) Patterns
ClosedPublic