This is an archive of the discontinued LLVM Phabricator instance.

[X86] Enable enableAggressiveFMAFusion to true for FMA capable targets (PR36826)
Changes PlannedPublic

Authored by RKSimon on Apr 5 2022, 10:35 AM.

Download Raw Diff

Details

Reviewers

craig.topper
pengfei
spatel
andrew.w.kaylor

Summary

If the X86 subtarget supports FMA, then allow it to aggressively generate FMA nodes, even if it means we have duplicated mul(x,y) and fma(x,y,z) cases

This demonstrates a likely flaw in the existing enableAggressiveFMAFusion folds - should we fold fadd(fmul(x,y), fmul(x,y)) -> fma(x,y,fmul(x,y)) ?

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	60,100 ms	x64 debian > LLVM.CodeGen/NVPTX::wmma.py

Event Timeline

RKSimon created this revision.Apr 5 2022, 10:35 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 5 2022, 10:35 AM

Herald added subscribers: StephenFan, hiraditya. · View Herald Transcript

RKSimon requested review of this revision.Apr 5 2022, 10:35 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 5 2022, 10:35 AM

craig.topper added inline comments.Apr 5 2022, 11:25 AM

llvm/test/CodeGen/X86/dag-fmf-cse.ll

This comment needs to be updated.

I guess this was the flaw you were referring to?

This looks like a regression on Haswell and Broadwell.

Latencies

	HSW	BDW	SKL
vaddss	3	3	4
vmulss	5	3	4
fma	5	5	4

Harbormaster completed remote builds in B158031: Diff 420573.Apr 5 2022, 11:37 AM

RKSimon added inline comments.Apr 5 2022, 12:26 PM

llvm/test/CodeGen/X86/dag-fmf-cse.ll
12	Yes, I'm not convinced DAGCombine should be folding the fadd(fmul(x,y),fmul(x,y)) case for any target tbh.

Matt added a subscriber: Matt.Apr 7 2022, 12:02 PM

RKSimon planned changes to this revision.Apr 19 2022, 8:22 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86ISelLowering.h

3 lines

X86ISelLowering.cpp

4 lines

test/

CodeGen/

X86/

dag-fmf-cse.ll

8 lines

fmsubadd-combine.ll

37 lines

Diff 420573

llvm/lib/Target/X86/X86ISelLowering.h

Show First 20 Lines • Show All 1,104 Lines • ▼ Show 20 Lines	public:

bool convertSetCCLogicToBitwiseLogic(EVT VT) const override {		bool convertSetCCLogicToBitwiseLogic(EVT VT) const override {
return VT.isScalarInteger();		return VT.isScalarInteger();
}		}

/// Vector-sized comparisons are fast using PCMPEQ + PMOVMSK or PTEST.		/// Vector-sized comparisons are fast using PCMPEQ + PMOVMSK or PTEST.
MVT hasFastEqualityCompare(unsigned NumBits) const override;		MVT hasFastEqualityCompare(unsigned NumBits) const override;

		/// Force aggressive FMA fusion.
		bool enableAggressiveFMAFusion(EVT VT) const override;

/// Return the value type to use for ISD::SETCC.		/// Return the value type to use for ISD::SETCC.
EVT getSetCCResultType(const DataLayout &DL, LLVMContext &Context,		EVT getSetCCResultType(const DataLayout &DL, LLVMContext &Context,
EVT VT) const override;		EVT VT) const override;

bool targetShrinkDemandedConstant(SDValue Op, const APInt &DemandedBits,		bool targetShrinkDemandedConstant(SDValue Op, const APInt &DemandedBits,
const APInt &DemandedElts,		const APInt &DemandedElts,
TargetLoweringOpt &TLO) const override;		TargetLoweringOpt &TLO) const override;

▲ Show 20 Lines • Show All 647 Lines • Show Last 20 Lines

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,849 Lines • ▼ Show 20 Lines	MVT X86TargetLowering::hasFastEqualityCompare(unsigned NumBits) const {

// TODO: Allow 64-bit type for 32-bit target.		// TODO: Allow 64-bit type for 32-bit target.
// TODO: 512-bit types should be allowed, but make sure that those		// TODO: 512-bit types should be allowed, but make sure that those
// cases are handled in combineVectorSizedSetCCEquality().		// cases are handled in combineVectorSizedSetCCEquality().

return MVT::INVALID_SIMPLE_VALUE_TYPE;		return MVT::INVALID_SIMPLE_VALUE_TYPE;
}		}

		bool X86TargetLowering::enableAggressiveFMAFusion(EVT VT) const {
		return Subtarget.hasAnyFMA();
		}

/// Val is the undef sentinel value or equal to the specified value.		/// Val is the undef sentinel value or equal to the specified value.
static bool isUndefOrEqual(int Val, int CmpVal) {		static bool isUndefOrEqual(int Val, int CmpVal) {
return ((Val == SM_SentinelUndef) \|\| (Val == CmpVal));		return ((Val == SM_SentinelUndef) \|\| (Val == CmpVal));
}		}

/// Return true if every element in Mask is the undef sentinel value or equal to		/// Return true if every element in Mask is the undef sentinel value or equal to
/// the specified value..		/// the specified value..
static bool isUndefOrEqual(ArrayRef<int> Mask, int CmpVal) {		static bool isUndefOrEqual(ArrayRef<int> Mask, int CmpVal) {
▲ Show 20 Lines • Show All 9,991 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/dag-fmf-cse.ll

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py

; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=fma -enable-unsafe-fp-math | FileCheck %s

; If fast-math-flags are propagated correctly, the mul1 expression

; should be recognized as a factor in the last fsub, so we should

; see a mul and add, not a mul and fma:

craig.topperUnsubmitted

Not Done

This comment needs to be updated.

craig.topper: This comment needs to be updated.

; a * b - (-a * b) ---> (a * b) + (a * b)

define float @fmf_should_not_break_cse(float %a, float %b) {

; CHECK-LABEL: fmf_should_not_break_cse:

; CHECK: # %bb.0:

; CHECK-NEXT: vmulss %xmm1, %xmm0, %xmm0

; CHECK-NEXT: vmulss %xmm1, %xmm0, %xmm2

craig.topperUnsubmitted

Not Done

I guess this was the flaw you were referring to?

This looks like a regression on Haswell and Broadwell.

Latencies

	HSW	BDW	SKL
vaddss	3	3	4
vmulss	5	3	4
fma	5	5	4

craig.topper: I guess this was the flaw you were referring to? This looks like a regression on Haswell and…

RKSimonAuthorUnsubmitted

Not Done

Yes, I'm not convinced DAGCombine should be folding the fadd(fmul(x,y),fmul(x,y)) case for any target tbh.

RKSimon: Yes, I'm not convinced DAGCombine should be folding the fadd(fmul(x,y),fmul(x,y)) case for any…

; CHECK-NEXT: vaddss %xmm0, %xmm0, %xmm0

; CHECK-NEXT: vfmadd213ss {{.*#+}} xmm0 = (xmm1 * xmm0) + xmm2

; CHECK-NEXT: retq

%mul1 = fmul fast float %a, %b

%nega = fsub fast float 0.0, %a

%mul2 = fmul fast float %nega, %b

%abx2 = fsub fast float %mul1, %mul2

ret float %abx2

}

define <4 x float> @fmf_should_not_break_cse_vector(<4 x float> %a, <4 x float> %b) {

; CHECK-LABEL: fmf_should_not_break_cse_vector:

; CHECK: # %bb.0:

; CHECK-NEXT: vmulps %xmm1, %xmm0, %xmm0

; CHECK-NEXT: vmulps %xmm1, %xmm0, %xmm2

; CHECK-NEXT: vaddps %xmm0, %xmm0, %xmm0

; CHECK-NEXT: vfmadd213ps {{.*#+}} xmm0 = (xmm1 * xmm0) + xmm2

; CHECK-NEXT: retq

%mul1 = fmul fast <4 x float> %a, %b

%nega = fsub fast <4 x float> <float 0.0, float 0.0, float 0.0, float 0.0>, %a

%mul2 = fmul fast <4 x float> %nega, %b

%abx2 = fsub fast <4 x float> %mul1, %mul2

ret <4 x float> %abx2

}

llvm/test/CodeGen/X86/fmsubadd-combine.ll

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+avx \| FileCheck %s -check-prefixes=CHECK,NOFMA		; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+avx \| FileCheck %s -check-prefixes=NOFMA
; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+fma \| FileCheck %s -check-prefixes=CHECK,FMA3,FMA3_256		; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+fma \| FileCheck %s -check-prefixes=FMA3,FMA3_256
; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+fma,+avx512f \| FileCheck %s -check-prefixes=CHECK,FMA3,FMA3_512		; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+fma,+avx512f \| FileCheck %s -check-prefixes=FMA3,FMA3_512
; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+fma4 \| FileCheck %s -check-prefixes=CHECK,FMA4		; RUN: llc < %s -mtriple=x86_64-unknown-linux-gnu -mattr=+fma4 \| FileCheck %s -check-prefixes=FMA4

; This test checks the fusing of MUL + SUB/ADD to FMSUBADD.		; This test checks the fusing of MUL + SUB/ADD to FMSUBADD.

define <2 x double> @mul_subadd_pd128(<2 x double> %A, <2 x double> %B, <2 x double> %C) #0 {		define <2 x double> @mul_subadd_pd128(<2 x double> %A, <2 x double> %B, <2 x double> %C) #0 {
; NOFMA-LABEL: mul_subadd_pd128:		; NOFMA-LABEL: mul_subadd_pd128:
; NOFMA: # %bb.0: # %entry		; NOFMA: # %bb.0: # %entry
; NOFMA-NEXT: vmulpd %xmm1, %xmm0, %xmm0		; NOFMA-NEXT: vmulpd %xmm1, %xmm0, %xmm0
; NOFMA-NEXT: vsubpd %xmm2, %xmm0, %xmm1		; NOFMA-NEXT: vsubpd %xmm2, %xmm0, %xmm1
▲ Show 20 Lines • Show All 167 Lines • ▼ Show 20 Lines	entry:
%Sub = fsub <16 x float> %AB, %C		%Sub = fsub <16 x float> %AB, %C
%Add = fadd <16 x float> %AB, %C		%Add = fadd <16 x float> %AB, %C
%subadd = shufflevector <16 x float> %Add, <16 x float> %Sub, <16 x i32> <i32 0, i32 17, i32 2, i32 19, i32 4, i32 21, i32 6, i32 23, i32 8, i32 25, i32 10, i32 27, i32 12, i32 29, i32 14, i32 31>		%subadd = shufflevector <16 x float> %Add, <16 x float> %Sub, <16 x i32> <i32 0, i32 17, i32 2, i32 19, i32 4, i32 21, i32 6, i32 23, i32 8, i32 25, i32 10, i32 27, i32 12, i32 29, i32 14, i32 31>
ret <16 x float> %subadd		ret <16 x float> %subadd
}		}

; This should not be matched to fmsubadd because the mul is on the wrong side of the fsub.		; This should not be matched to fmsubadd because the mul is on the wrong side of the fsub.
define <2 x double> @mul_subadd_bad_commute(<2 x double> %A, <2 x double> %B, <2 x double> %C) #0 {		define <2 x double> @mul_subadd_bad_commute(<2 x double> %A, <2 x double> %B, <2 x double> %C) #0 {
; CHECK-LABEL: mul_subadd_bad_commute:		; NOFMA-LABEL: mul_subadd_bad_commute:
; CHECK: # %bb.0: # %entry		; NOFMA: # %bb.0: # %entry
; CHECK-NEXT: vmulpd %xmm1, %xmm0, %xmm0		; NOFMA-NEXT: vmulpd %xmm1, %xmm0, %xmm0
; CHECK-NEXT: vsubpd %xmm0, %xmm2, %xmm1		; NOFMA-NEXT: vsubpd %xmm0, %xmm2, %xmm1
; CHECK-NEXT: vaddpd %xmm2, %xmm0, %xmm0		; NOFMA-NEXT: vaddpd %xmm2, %xmm0, %xmm0
; CHECK-NEXT: vblendpd {{.*#+}} xmm0 = xmm0[0],xmm1[1]		; NOFMA-NEXT: vblendpd {{.*#+}} xmm0 = xmm0[0],xmm1[1]
; CHECK-NEXT: retq		; NOFMA-NEXT: retq
		;
		; FMA3-LABEL: mul_subadd_bad_commute:
		; FMA3: # %bb.0: # %entry
		; FMA3-NEXT: vmovapd %xmm1, %xmm3
		; FMA3-NEXT: vfnmadd213pd {{.#+}} xmm3 = -(xmm0 xmm3) + xmm2
		; FMA3-NEXT: vfmadd213pd {{.#+}} xmm1 = (xmm0 xmm1) + xmm2
		; FMA3-NEXT: vblendpd {{.*#+}} xmm0 = xmm1[0],xmm3[1]
		; FMA3-NEXT: retq
		;
		; FMA4-LABEL: mul_subadd_bad_commute:
		; FMA4: # %bb.0: # %entry
		; FMA4-NEXT: vfnmaddpd {{.#+}} xmm3 = -(xmm0 xmm1) + xmm2
		; FMA4-NEXT: vfmaddpd {{.#+}} xmm0 = (xmm0 xmm1) + xmm2
		; FMA4-NEXT: vblendpd {{.*#+}} xmm0 = xmm0[0],xmm3[1]
		; FMA4-NEXT: retq
entry:		entry:
%AB = fmul <2 x double> %A, %B		%AB = fmul <2 x double> %A, %B
%Sub = fsub <2 x double> %C, %AB		%Sub = fsub <2 x double> %C, %AB
%Add = fadd <2 x double> %AB, %C		%Add = fadd <2 x double> %AB, %C
%subadd = shufflevector <2 x double> %Add, <2 x double> %Sub, <2 x i32> <i32 0, i32 3>		%subadd = shufflevector <2 x double> %Add, <2 x double> %Sub, <2 x i32> <i32 0, i32 3>
ret <2 x double> %subadd		ret <2 x double> %subadd
}		}

attributes #0 = { nounwind "unsafe-fp-math"="true" }		attributes #0 = { nounwind "unsafe-fp-math"="true" }

This is an archive of the discontinued LLVM Phabricator instance.

[X86] Enable enableAggressiveFMAFusion to true for FMA capable targets (PR36826)Changes PlannedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 420573

llvm/lib/Target/X86/X86ISelLowering.h

llvm/lib/Target/X86/X86ISelLowering.cpp

llvm/test/CodeGen/X86/dag-fmf-cse.ll

llvm/test/CodeGen/X86/fmsubadd-combine.ll

[X86] Enable enableAggressiveFMAFusion to true for FMA capable targets (PR36826)
Changes PlannedPublic