This is an archive of the discontinued LLVM Phabricator instance.

[x86] split FMA with fast-math-flags to avoid libcall
ClosedPublic

Authored by spatel on Jul 16 2020, 12:55 PM.

Download Raw Diff

Details

Reviewers

craig.topper
cameron.mcinally
RKSimon

Commits

rG50afa18772da: [x86] split FMA with fast-math-flags to avoid libcall

Summary

fma reassoc A, B, C --> fadd (fmul A, B), C (when target has no FMA hardware)

C/C++ code may use explicit fma() calls (which become LLVM fma intrinsics in IR) but then gets compiled with -ffast-math or similar.
For targets that do not have FMA hardware, we don't want to go out to the math library for a precise but slow FMA result.

I tried this as a generic DAGCombine, but it caused infinite looping on more than 1 other target, so there's likely some over-reaching fma formation happening.

There's also a potential intersection of strict FP with fast-math here. I'm not sure who should win that fight, so just deferring to current behavior for that case.

Diff Detail

Event Timeline

spatel created this revision.Jul 16 2020, 12:55 PM

Herald added a project: Restricted Project. · View Herald TranscriptJul 16 2020, 12:55 PM

Herald added subscribers: steven.zhang, hiraditya, mcrosier. · View Herald Transcript

Patch updated:
Added a minimal vector test to show larger diffs and confirm that doesn't conflict with existing transforms.

SGTM, but i'll review to someone more familiar with FP

This looks good besides the Piledriver checks.

Regarding StrictFP: an FMA is not the same as a MUL+ADD in that context. So going to a libcall is the right move.

cameron.mcinally added inline comments.Jul 17 2020, 1:30 PM

llvm/test/CodeGen/X86/fma.ll
1519–1520	These Piledriver checks disappeared. Was that deliberate?

spatel marked 2 inline comments as done.Jul 17 2020, 1:58 PM

spatel added inline comments.

llvm/test/CodeGen/X86/fma.ll
1519–1520	They were subsumed into the common prefix of "FMACALL32". It's a bit strange that we don't have a unique prefix for this run line: ; RUN: llc < %s -mtriple=i386-apple-darwin10 -mattr=+avx,-fma,-fma4 -show-mc-encoding \| FileCheck %s --check-prefix=FMACALL32 I can try to make that clearer with a cleanup patch.

LGTM

Ah, okay. I should have looked for overlap in the RUN lines. Sorry for the noise.

This revision is now accepted and ready to land.Jul 17 2020, 2:02 PM

Closed by commit rG50afa18772da: [x86] split FMA with fast-math-flags to avoid libcall (authored by spatel). · Explain WhyJul 19 2020, 7:07 AM

This revision was automatically updated to reflect the committed changes.

spatel marked an inline comment as done.

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86ISelLowering.cpp

17 lines

test/

CodeGen/

X86/

fma.ll

18 lines

Diff 278579

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 46,123 Lines • ▼ Show 20 Lines	static SDValue combineFMA(SDNode *N, SelectionDAG &DAG,
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
bool IsStrict = N->isStrictFPOpcode() \|\| N->isTargetStrictFPOpcode();		bool IsStrict = N->isStrictFPOpcode() \|\| N->isTargetStrictFPOpcode();

// Let legalize expand this if it isn't a legal type yet.		// Let legalize expand this if it isn't a legal type yet.
const TargetLowering &TLI = DAG.getTargetLoweringInfo();		const TargetLowering &TLI = DAG.getTargetLoweringInfo();
if (!TLI.isTypeLegal(VT))		if (!TLI.isTypeLegal(VT))
return SDValue();		return SDValue();

EVT ScalarVT = VT.getScalarType();
if ((ScalarVT != MVT::f32 && ScalarVT != MVT::f64) \|\| !Subtarget.hasAnyFMA())
return SDValue();

SDValue A = N->getOperand(IsStrict ? 1 : 0);		SDValue A = N->getOperand(IsStrict ? 1 : 0);
SDValue B = N->getOperand(IsStrict ? 2 : 1);		SDValue B = N->getOperand(IsStrict ? 2 : 1);
SDValue C = N->getOperand(IsStrict ? 3 : 2);		SDValue C = N->getOperand(IsStrict ? 3 : 2);

		// If the operation allows fast-math and the target does not support FMA,
		// split this into mul+add to avoid a libcall.
		SDNodeFlags Flags = N->getFlags();
		if (!IsStrict && Flags.hasAllowReassociation() &&
		TLI.isOperationExpand(ISD::FMA, VT)) {
		SDValue Fmul = DAG.getNode(ISD::FMUL, dl, VT, A, B, Flags);
		return DAG.getNode(ISD::FADD, dl, VT, Fmul, C, Flags);
		}

		EVT ScalarVT = VT.getScalarType();
		if ((ScalarVT != MVT::f32 && ScalarVT != MVT::f64) \|\| !Subtarget.hasAnyFMA())
		return SDValue();

auto invertIfNegative = [&DAG, &TLI, &DCI](SDValue &V) {		auto invertIfNegative = [&DAG, &TLI, &DCI](SDValue &V) {
bool CodeSize = DAG.getMachineFunction().getFunction().hasOptSize();		bool CodeSize = DAG.getMachineFunction().getFunction().hasOptSize();
bool LegalOperations = !DCI.isBeforeLegalizeOps();		bool LegalOperations = !DCI.isBeforeLegalizeOps();
if (SDValue NegV = TLI.getCheaperNegatedExpression(V, DAG, LegalOperations,		if (SDValue NegV = TLI.getCheaperNegatedExpression(V, DAG, LegalOperations,
CodeSize)) {		CodeSize)) {
V = NegV;		V = NegV;
return true;		return true;
}		}
▲ Show 20 Lines • Show All 4,058 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/fma.ll

	Show First 20 Lines • Show All 67 Lines • ▼ Show 20 Lines
	; FMA32-NEXT: ## xmm1 = (xmm0 * xmm1) + mem			; FMA32-NEXT: ## xmm1 = (xmm0 * xmm1) + mem
	; FMA32-NEXT: vmovss %xmm1, (%esp) ## encoding: [0xc5,0xfa,0x11,0x0c,0x24]			; FMA32-NEXT: vmovss %xmm1, (%esp) ## encoding: [0xc5,0xfa,0x11,0x0c,0x24]
	; FMA32-NEXT: flds (%esp) ## encoding: [0xd9,0x04,0x24]			; FMA32-NEXT: flds (%esp) ## encoding: [0xd9,0x04,0x24]
	; FMA32-NEXT: popl %eax ## encoding: [0x58]			; FMA32-NEXT: popl %eax ## encoding: [0x58]
	; FMA32-NEXT: retl ## encoding: [0xc3]			; FMA32-NEXT: retl ## encoding: [0xc3]
	;			;
	; FMACALL32-LABEL: test_f32_reassoc:			; FMACALL32-LABEL: test_f32_reassoc:
	; FMACALL32: ## %bb.0:			; FMACALL32: ## %bb.0:
	; FMACALL32-NEXT: jmp _fmaf ## TAILCALL			; FMACALL32-NEXT: pushl %eax ## encoding: [0x50]
	; FMACALL32-NEXT: ## encoding: [0xeb,A]			; FMACALL32-NEXT: vmovss {{[0-9]+}}(%esp), %xmm0 ## encoding: [0xc5,0xfa,0x10,0x44,0x24,0x08]
	; FMACALL32-NEXT: ## fixup A - offset: 1, value: _fmaf-1, kind: FK_PCRel_1			; FMACALL32-NEXT: ## xmm0 = mem[0],zero,zero,zero
				; FMACALL32-NEXT: vmulss {{[0-9]+}}(%esp), %xmm0, %xmm0 ## encoding: [0xc5,0xfa,0x59,0x44,0x24,0x0c]
				; FMACALL32-NEXT: vaddss {{[0-9]+}}(%esp), %xmm0, %xmm0 ## encoding: [0xc5,0xfa,0x58,0x44,0x24,0x10]
				; FMACALL32-NEXT: vmovss %xmm0, (%esp) ## encoding: [0xc5,0xfa,0x11,0x04,0x24]
				; FMACALL32-NEXT: flds (%esp) ## encoding: [0xd9,0x04,0x24]
				; FMACALL32-NEXT: popl %eax ## encoding: [0x58]
				; FMACALL32-NEXT: retl ## encoding: [0xc3]
	;			;
	; FMA64-LABEL: test_f32_reassoc:			; FMA64-LABEL: test_f32_reassoc:
	; FMA64: ## %bb.0:			; FMA64: ## %bb.0:
	; FMA64-NEXT: vfmadd213ss %xmm2, %xmm1, %xmm0 ## encoding: [0xc4,0xe2,0x71,0xa9,0xc2]			; FMA64-NEXT: vfmadd213ss %xmm2, %xmm1, %xmm0 ## encoding: [0xc4,0xe2,0x71,0xa9,0xc2]
	; FMA64-NEXT: ## xmm0 = (xmm1 * xmm0) + xmm2			; FMA64-NEXT: ## xmm0 = (xmm1 * xmm0) + xmm2
	; FMA64-NEXT: retq ## encoding: [0xc3]			; FMA64-NEXT: retq ## encoding: [0xc3]
	;			;
	; FMACALL64-LABEL: test_f32_reassoc:			; FMACALL64-LABEL: test_f32_reassoc:
	; FMACALL64: ## %bb.0:			; FMACALL64: ## %bb.0:
	; FMACALL64-NEXT: jmp _fmaf ## TAILCALL			; FMACALL64-NEXT: mulss %xmm1, %xmm0 ## encoding: [0xf3,0x0f,0x59,0xc1]
	; FMACALL64-NEXT: ## encoding: [0xeb,A]			; FMACALL64-NEXT: addss %xmm2, %xmm0 ## encoding: [0xf3,0x0f,0x58,0xc2]
	; FMACALL64-NEXT: ## fixup A - offset: 1, value: _fmaf-1, kind: FK_PCRel_1			; FMACALL64-NEXT: retq ## encoding: [0xc3]
	;			;
	; AVX512-LABEL: test_f32_reassoc:			; AVX512-LABEL: test_f32_reassoc:
	; AVX512: ## %bb.0:			; AVX512: ## %bb.0:
	; AVX512-NEXT: vfmadd213ss %xmm2, %xmm1, %xmm0 ## EVEX TO VEX Compression encoding: [0xc4,0xe2,0x71,0xa9,0xc2]			; AVX512-NEXT: vfmadd213ss %xmm2, %xmm1, %xmm0 ## EVEX TO VEX Compression encoding: [0xc4,0xe2,0x71,0xa9,0xc2]
	; AVX512-NEXT: ## xmm0 = (xmm1 * xmm0) + xmm2			; AVX512-NEXT: ## xmm0 = (xmm1 * xmm0) + xmm2
	; AVX512-NEXT: retq ## encoding: [0xc3]			; AVX512-NEXT: retq ## encoding: [0xc3]
	;			;
	; AVX512VL-LABEL: test_f32_reassoc:			; AVX512VL-LABEL: test_f32_reassoc:
	▲ Show 20 Lines • Show All 1,412 Lines • ▼ Show 20 Lines
	; FMACALL32_BDVER2-NEXT: vmovhps {{[0-9]+}}(%esp), %xmm0, %xmm0 ## encoding: [0xc5,0xf8,0x16,0x44,0x24,0x20]			; FMACALL32_BDVER2-NEXT: vmovhps {{[0-9]+}}(%esp), %xmm0, %xmm0 ## encoding: [0xc5,0xf8,0x16,0x44,0x24,0x20]
	; FMACALL32_BDVER2-NEXT: ## xmm0 = xmm0[0,1],mem[0,1]			; FMACALL32_BDVER2-NEXT: ## xmm0 = xmm0[0,1],mem[0,1]
	; FMACALL32_BDVER2-NEXT: addl $108, %esp ## encoding: [0x83,0xc4,0x6c]			; FMACALL32_BDVER2-NEXT: addl $108, %esp ## encoding: [0x83,0xc4,0x6c]
	; FMACALL32_BDVER2-NEXT: retl ## encoding: [0xc3]			; FMACALL32_BDVER2-NEXT: retl ## encoding: [0xc3]
	entry:			entry:
	%call = call <2 x double> @llvm.fma.v2f64(<2 x double> %a, <2 x double> %b, <2 x double> %c)			%call = call <2 x double> @llvm.fma.v2f64(<2 x double> %a, <2 x double> %b, <2 x double> %c)
	ret <2 x double> %call			ret <2 x double> %call
	}			}

	define <4 x double> @test_v4f64(<4 x double> %a, <4 x double> %b, <4 x double> %c) #0 {			define <4 x double> @test_v4f64(<4 x double> %a, <4 x double> %b, <4 x double> %c) #0 {
	cameron.mcinallyUnsubmitted Done Reply Inline Actions These Piledriver checks disappeared. Was that deliberate? cameron.mcinally: These Piledriver checks disappeared. Was that deliberate?
	spatelAuthorUnsubmitted Done Reply Inline Actions They were subsumed into the common prefix of "FMACALL32". It's a bit strange that we don't have a unique prefix for this run line: ; RUN: llc < %s -mtriple=i386-apple-darwin10 -mattr=+avx,-fma,-fma4 -show-mc-encoding \| FileCheck %s --check-prefix=FMACALL32 I can try to make that clearer with a cleanup patch. spatel: They were subsumed into the common prefix of "FMACALL32". It's a bit strange that we don't have…
	; FMA32-LABEL: test_v4f64:			; FMA32-LABEL: test_v4f64:
	; FMA32: ## %bb.0: ## %entry			; FMA32: ## %bb.0: ## %entry
	; FMA32-NEXT: vfmadd213pd %ymm2, %ymm1, %ymm0 ## encoding: [0xc4,0xe2,0xf5,0xa8,0xc2]			; FMA32-NEXT: vfmadd213pd %ymm2, %ymm1, %ymm0 ## encoding: [0xc4,0xe2,0xf5,0xa8,0xc2]
	; FMA32-NEXT: ## ymm0 = (ymm1 * ymm0) + ymm2			; FMA32-NEXT: ## ymm0 = (ymm1 * ymm0) + ymm2
	; FMA32-NEXT: retl ## encoding: [0xc3]			; FMA32-NEXT: retl ## encoding: [0xc3]
	;			;
	; FMA64-LABEL: test_v4f64:			; FMA64-LABEL: test_v4f64:
	; FMA64: ## %bb.0: ## %entry			; FMA64: ## %bb.0: ## %entry
	▲ Show 20 Lines • Show All 538 Lines • Show Last 20 Lines