This is an archive of the discontinued LLVM Phabricator instance.

guard fsqrt with fmf sub flags
ClosedPublic

Authored by mcberg2017 on Jun 4 2018, 4:53 PM.

Download Raw Diff

Details

Reviewers

spatel
hfinkel
arsenm

Commits

rGcc1c4b691230: guard fsqrt with fmf sub flags
rL334113: guard fsqrt with fmf sub flags

Summary

This change uses fmf subflags to guard optimizations as well as unsafe. These changes originated from D46483.
It contains only context for fsqrt.

Diff Detail

Event Timeline

mcberg2017 created this revision.Jun 4 2018, 4:53 PM

Herald added a subscriber: nemanjai. · View Herald TranscriptJun 4 2018, 4:53 PM

spatel added subscribers: efriedma, wristow, andrew.w.kaylor.Jun 5 2018, 6:40 AM

spatel added inline comments.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
10897	Here, we're entering a bit of uncharted FMF territory. cc'ing some more potential reviewers for opinions - @wristow @efriedma @andrew.w.kaylor I think we want to allow this transform when the node allows "approximate functions" (afn). I don't think we should care about 'arcp' - the transform of 1.0/sqrt(x) is handled on a different path AFAICT.

mcberg2017 added a reviewer: arsenm.Jun 5 2018, 10:36 AM

Herald added a subscriber: wdng. · View Herald TranscriptJun 5 2018, 10:36 AM

mcberg2017 added inline comments.Jun 5 2018, 10:36 AM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
10897	Not quite true, I reference AMDGPU which uses 1.0/sqrt to map to AMDGPUISD::RSQ, which is rsqrt, although a partial condition. See SITargetLowering::lowerFastUnsafeFDIV. Adding Matt as this is a AMD topic.

wristow added inline comments.Jun 5 2018, 11:57 AM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
10897	It does seem to me that 'afn' is what we want here, rather than 'arcp'. I think the point about SITargetLowering::lowerFastUnsafeFDIV means that the code there ought to be checking both 'afn' and 'arcp'.

arsenm added inline comments.Jun 5 2018, 12:01 PM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
10897	The point of the code there is it doesn't need either. In some situations the instruction does the right thing without any need for a flag

arsenm added inline comments.Jun 5 2018, 12:08 PM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
10897	For the cases where it's not legal, I suppose checking both makes sense

efriedma added a subscriber: hfinkel.Jun 5 2018, 12:40 PM

efriedma added inline comments.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
10897	I can't see how arcp is relevant to this patch at all; visitFSQRT doesn't deal with division. The flags required for rsqrt were discussed here: https://reviews.llvm.org/D37686#877987 .

My take on Eli's point from D37686 is afn or unsafe is what we need then. The code and tests are augmented for that change.

hfinkel added inline comments.Jun 6 2018, 5:49 AM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
10897	I can't see how arcp is relevant to this patch at all; visitFSQRT doesn't deal with division. I agree. This is just a question of how we compute sqrt(x). For an approximation of 1/sqrt(x), then I can see also needing arcp.

spatel added inline comments.Jun 6 2018, 8:19 AM

test/CodeGen/PowerPC/fmf-propagation.ll
368–387	Is there some reason for trimming the output here? I think it's important that we show the entire estimate sequence here to be consistent. Otherwise, it's misleading as it appears that we only need the raw estimate instruction.

So should we include arcp in the check so that buildSqrtEstimate fires for arcp opportunities(to produce rsqrt) as well or is it fine the way it is?

In D47749#1123864, @mcberg2017 wrote:

So should we include arcp in the check so that buildSqrtEstimate fires for arcp opportunities(to produce rsqrt) as well or is it fine the way it is?

No - everyone agrees that this transform is predicated with 'afn' alone. The reciprocal sqrt transform is an independent problem/patch.

LGTM.

This revision is now accepted and ready to land.Jun 6 2018, 10:57 AM

Closed by commit rL334113: guard fsqrt with fmf sub flags (authored by mcberg2017). · Explain WhyJun 6 2018, 11:52 AM

This revision was automatically updated to reflect the committed changes.

Herald added a subscriber: llvm-commits. · View Herald TranscriptJun 6 2018, 11:52 AM

Revision Contents

Path

Size

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

9 lines

test/

CodeGen/

PowerPC/

fmf-propagation.ll

52 lines

X86/

fmf-flags.ll

15 lines

sqrt-fastmath-mir.ll

12 lines

Diff 150165

lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 10,887 Lines • ▼ Show 20 Lines	SDValue DAGCombiner::visitFREM(SDNode *N) {

if (SDValue NewSel = foldBinOpIntoSelect(N))		if (SDValue NewSel = foldBinOpIntoSelect(N))
return NewSel;		return NewSel;

return SDValue();		return SDValue();
}		}

SDValue DAGCombiner::visitFSQRT(SDNode *N) {		SDValue DAGCombiner::visitFSQRT(SDNode *N) {
if (!DAG.getTarget().Options.UnsafeFPMath)		SDNodeFlags Flags = N->getFlags();
		if (!DAG.getTarget().Options.UnsafeFPMath &&
		spatelUnsubmitted Not Done Reply Inline Actions Here, we're entering a bit of uncharted FMF territory. cc'ing some more potential reviewers for opinions - @wristow @efriedma @andrew.w.kaylor I think we want to allow this transform when the node allows "approximate functions" (afn). I don't think we should care about 'arcp' - the transform of 1.0/sqrt(x) is handled on a different path AFAICT. spatel: Here, we're entering a bit of uncharted FMF territory. cc'ing some more potential reviewers for…
		mcberg2017AuthorUnsubmitted Not Done Reply Inline Actions Not quite true, I reference AMDGPU which uses 1.0/sqrt to map to AMDGPUISD::RSQ, which is rsqrt, although a partial condition. See SITargetLowering::lowerFastUnsafeFDIV. Adding Matt as this is a AMD topic. mcberg2017: Not quite true, I reference AMDGPU which uses 1.0/sqrt to map to AMDGPUISD::RSQ, which is rsqrt…
		wristowUnsubmitted Not Done Reply Inline Actions It does seem to me that 'afn' is what we want here, rather than 'arcp'. I think the point about SITargetLowering::lowerFastUnsafeFDIV means that the code there ought to be checking both 'afn' and 'arcp'. wristow: It does seem to me that 'afn' is what we want here, rather than 'arcp'. I think the point…
		arsenmUnsubmitted Not Done Reply Inline Actions The point of the code there is it doesn't need either. In some situations the instruction does the right thing without any need for a flag arsenm: The point of the code there is it doesn't need either. In some situations the instruction does…
		arsenmUnsubmitted Not Done Reply Inline Actions For the cases where it's not legal, I suppose checking both makes sense arsenm: For the cases where it's not legal, I suppose checking both makes sense
		efriedmaUnsubmitted Not Done Reply Inline Actions I can't see how arcp is relevant to this patch at all; visitFSQRT doesn't deal with division. The flags required for rsqrt were discussed here: https://reviews.llvm.org/D37686#877987 . efriedma: I can't see how arcp is relevant to this patch at all; visitFSQRT doesn't deal with division.
		hfinkelUnsubmitted Not Done Reply Inline Actions I can't see how arcp is relevant to this patch at all; visitFSQRT doesn't deal with division. I agree. This is just a question of how we compute sqrt(x). For an approximation of 1/sqrt(x), then I can see also needing arcp. hfinkel: > I can't see how arcp is relevant to this patch at all; visitFSQRT doesn't deal with division.
		!Flags.hasApproximateFuncs())
return SDValue();		return SDValue();

SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
if (TLI.isFsqrtCheap(N0, DAG))		if (TLI.isFsqrtCheap(N0, DAG))
return SDValue();		return SDValue();

// TODO: FSQRT nodes should have flags that propagate to the created nodes.		// FSQRT nodes have flags that propagate to the created nodes.
// For now, create a Flags object for use with reassociation math transforms.
SDNodeFlags Flags;
Flags.setAllowReassociation(true);
return buildSqrtEstimate(N0, Flags);		return buildSqrtEstimate(N0, Flags);
}		}

/// copysign(x, fp_extend(y)) -> copysign(x, y)		/// copysign(x, fp_extend(y)) -> copysign(x, y)
/// copysign(x, fp_round(y)) -> copysign(x, y)		/// copysign(x, fp_round(y)) -> copysign(x, y)
static inline bool CanCombineFCOPYSIGN_EXTEND_ROUND(SDNode *N) {		static inline bool CanCombineFCOPYSIGN_EXTEND_ROUND(SDNode *N) {
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);
if ((N1.getOpcode() == ISD::FP_EXTEND \|\|		if ((N1.getOpcode() == ISD::FP_EXTEND \|\|
▲ Show 20 Lines • Show All 7,208 Lines • Show Last 20 Lines

test/CodeGen/PowerPC/fmf-propagation.ll

Show First 20 Lines • Show All 294 Lines • ▼ Show 20 Lines	; GLOBAL-NEXT: blr
%mul = fmul fast float %x, 42.0		%mul = fmul fast float %x, 42.0
%fma = call fast float @llvm.fma.f32(float %x, float 7.0, float %mul)		%fma = call fast float @llvm.fma.f32(float %x, float 7.0, float %mul)
ret float %fma		ret float %fma
}		}

; Reduced precision for sqrt is allowed - should use estimate and NR iterations.		; Reduced precision for sqrt is allowed - should use estimate and NR iterations.

; FMFDEBUG-LABEL: Optimized lowered selection DAG: %bb.0 'sqrt_afn:'		; FMFDEBUG-LABEL: Optimized lowered selection DAG: %bb.0 'sqrt_afn:'
; FMFDEBUG: fsqrt afn {{t[0-9]+}}		; FMFDEBUG: fmul afn {{t[0-9]+}}
; FMFDEBUG: Type-legalized selection DAG: %bb.0 'sqrt_afn:'		; FMFDEBUG: Type-legalized selection DAG: %bb.0 'sqrt_afn:'

; GLOBALDEBUG-LABEL: Optimized lowered selection DAG: %bb.0 'sqrt_afn:'		; GLOBALDEBUG-LABEL: Optimized lowered selection DAG: %bb.0 'sqrt_afn:'
; GLOBALDEBUG: fmul reassoc {{t[0-9]+}}		; GLOBALDEBUG: fmul afn {{t[0-9]+}}
; GLOBALDEBUG: Type-legalized selection DAG: %bb.0 'sqrt_afn:'		; GLOBALDEBUG: Type-legalized selection DAG: %bb.0 'sqrt_afn:'

define float @sqrt_afn(float %x) {		define float @sqrt_afn(float %x) {
; FMF-LABEL: sqrt_afn:		; FMF-LABEL: sqrt_afn:
; FMF: # %bb.0:		; FMF: # %bb.0:
; FMF-NEXT: xssqrtsp 1, 1		; FMF-NEXT: xxlxor 0, 0, 0
		; FMF-NEXT: fcmpu 0, 1, 0
		; FMF-NEXT: beq 0, .LBB10_2
		; FMF-NEXT: # %bb.1:
		; FMF-NEXT: addis 3, 2, .LCPI10_0@toc@ha
		; FMF-NEXT: xsrsqrtesp 3, 1
		; FMF-NEXT: addi 3, 3, .LCPI10_0@toc@l
		; FMF-NEXT: lfsx 0, 0, 3
		; FMF-NEXT: xsmulsp 2, 1, 0
		; FMF-NEXT: xsmulsp 4, 3, 3
		; FMF-NEXT: xssubsp 2, 2, 1
		; FMF-NEXT: xsmulsp 2, 2, 4
		; FMF-NEXT: xssubsp 0, 0, 2
		; FMF-NEXT: xsmulsp 0, 3, 0
		; FMF-NEXT: xsmulsp 0, 0, 1
		; FMF-NEXT: .LBB10_2:
		; FMF-NEXT: fmr 1, 0
; FMF-NEXT: blr		; FMF-NEXT: blr
;		;
; GLOBAL-LABEL: sqrt_afn:		; GLOBAL-LABEL: sqrt_afn:
; GLOBAL: # %bb.0:		; GLOBAL: # %bb.0:
; GLOBAL-NEXT: xxlxor 0, 0, 0		; GLOBAL-NEXT: xxlxor 0, 0, 0
; GLOBAL-NEXT: fcmpu 0, 1, 0		; GLOBAL-NEXT: fcmpu 0, 1, 0
; GLOBAL-NEXT: beq 0, .LBB10_2		; GLOBAL-NEXT: beq 0, .LBB10_2
; GLOBAL-NEXT: # %bb.1:		; GLOBAL-NEXT: # %bb.1:
; GLOBAL-NEXT: xsrsqrtesp 2, 1		; GLOBAL-NEXT: xsrsqrtesp 2, 1
Show All 12 Lines
; GLOBAL-NEXT: blr		; GLOBAL-NEXT: blr
%rt = call afn float @llvm.sqrt.f32(float %x)		%rt = call afn float @llvm.sqrt.f32(float %x)
ret float %rt		ret float %rt
}		}

; The call is now fully 'fast'. This implies that approximation is allowed.		; The call is now fully 'fast'. This implies that approximation is allowed.

; FMFDEBUG-LABEL: Optimized lowered selection DAG: %bb.0 'sqrt_fast:'		; FMFDEBUG-LABEL: Optimized lowered selection DAG: %bb.0 'sqrt_fast:'
; FMFDEBUG: fsqrt nnan ninf nsz arcp contract afn reassoc {{t[0-9]+}}		; FMFDEBUG: fmul nnan ninf nsz arcp contract afn reassoc {{t[0-9]+}}
; FMFDEBUG: Type-legalized selection DAG: %bb.0 'sqrt_fast:'		; FMFDEBUG: Type-legalized selection DAG: %bb.0 'sqrt_fast:'

; GLOBALDEBUG-LABEL: Optimized lowered selection DAG: %bb.0 'sqrt_fast:'		; GLOBALDEBUG-LABEL: Optimized lowered selection DAG: %bb.0 'sqrt_fast:'
; GLOBALDEBUG: fmul reassoc {{t[0-9]+}}		; GLOBALDEBUG: fmul nnan ninf nsz arcp contract afn reassoc {{t[0-9]+}}
; GLOBALDEBUG: Type-legalized selection DAG: %bb.0 'sqrt_fast:'		; GLOBALDEBUG: Type-legalized selection DAG: %bb.0 'sqrt_fast:'

define float @sqrt_fast(float %x) {		define float @sqrt_fast(float %x) {
; FMF-LABEL: sqrt_fast:		; FMF-LABEL: sqrt_fast:
; FMF: # %bb.0:		; FMF: # %bb.0:
; FMF-NEXT: xssqrtsp 1, 1		; FMF-NEXT: xxlxor 0, 0, 0
		; FMF-NEXT: fcmpu 0, 1, 0
		; FMF-NEXT: beq 0, .LBB11_2
		; FMF-NEXT: # %bb.1:
		; FMF-NEXT: xsrsqrtesp 2, 1
		; FMF-NEXT: addis 3, 2, .LCPI11_0@toc@ha
		; FMF-NEXT: fneg 0, 1
		; FMF-NEXT: fmr 4, 1
		; FMF-NEXT: addi 3, 3, .LCPI11_0@toc@l
		; FMF-NEXT: lfsx 3, 0, 3
		; FMF-NEXT: xsmaddasp 4, 0, 3
		; FMF-NEXT: xsmulsp 0, 2, 2
		; FMF-NEXT: xsmaddasp 3, 4, 0
		; FMF-NEXT: xsmulsp 0, 2, 3
		; FMF-NEXT: xsmulsp 0, 0, 1
		; FMF-NEXT: .LBB11_2:
		; FMF-NEXT: fmr 1, 0
; FMF-NEXT: blr		; FMF-NEXT: blr
;		;
		spatelUnsubmitted Done Reply Inline Actions Is there some reason for trimming the output here? I think it's important that we show the entire estimate sequence here to be consistent. Otherwise, it's misleading as it appears that we only need the raw estimate instruction. spatel: Is there some reason for trimming the output here? I think it's important that we show the…
; GLOBAL-LABEL: sqrt_fast:		; GLOBAL-LABEL: sqrt_fast:
; GLOBAL: # %bb.0:		; GLOBAL: # %bb.0:
; GLOBAL-NEXT: xxlxor 0, 0, 0		; GLOBAL-NEXT: xxlxor 0, 0, 0
; GLOBAL-NEXT: fcmpu 0, 1, 0		; GLOBAL-NEXT: fcmpu 0, 1, 0
; GLOBAL-NEXT: beq 0, .LBB11_2		; GLOBAL-NEXT: beq 0, .LBB11_2
; GLOBAL-NEXT: # %bb.1:		; GLOBAL-NEXT: # %bb.1:
; GLOBAL-NEXT: xsrsqrtesp 2, 1		; GLOBAL-NEXT: xsrsqrtesp 2, 1
; GLOBAL-NEXT: addis 3, 2, .LCPI11_0@toc@ha		; GLOBAL-NEXT: addis 3, 2, .LCPI11_0@toc@ha
▲ Show 20 Lines • Show All 121 Lines • Show Last 20 Lines

test/CodeGen/X86/fmf-flags.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=x86_64-unknown \| FileCheck %s -check-prefix=X64			; RUN: llc < %s -mtriple=x86_64-unknown \| FileCheck %s -check-prefix=X64
	; RUN: llc < %s -mtriple=i686-unknown \| FileCheck %s -check-prefix=X86			; RUN: llc < %s -mtriple=i686-unknown \| FileCheck %s -check-prefix=X86

	declare float @llvm.sqrt.f32(float %x);			declare float @llvm.sqrt.f32(float %x);

	define float @fast_recip_sqrt(float %x) {			define float @fast_recip_sqrt(float %x) {
	; X64-LABEL: fast_recip_sqrt:			; X64-LABEL: fast_recip_sqrt:
	; X64: # %bb.0:			; X64: # %bb.0:
	; X64-NEXT: sqrtss %xmm0, %xmm1			; X64-NEXT: rsqrtss %xmm0, %xmm1
	; X64-NEXT: movss {{.*#+}} xmm0 = mem[0],zero,zero,zero			; X64-NEXT: xorps %xmm2, %xmm2
	; X64-NEXT: divss %xmm1, %xmm0			; X64-NEXT: cmpeqss %xmm0, %xmm2
				; X64-NEXT: mulss %xmm1, %xmm0
				; X64-NEXT: movss {{.*}}(%rip), %xmm3
				; X64-NEXT: mulss %xmm0, %xmm3
				; X64-NEXT: mulss %xmm1, %xmm0
				; X64-NEXT: addss {{.*}}(%rip), %xmm0
				; X64-NEXT: mulss %xmm3, %xmm0
				; X64-NEXT: andnps %xmm0, %xmm2
				; X64-NEXT: movss {{.*}}(%rip), %xmm0
				; X64-NEXT: divss %xmm2, %xmm0
	; X64-NEXT: retq			; X64-NEXT: retq
	;			;
	; X86-LABEL: fast_recip_sqrt:			; X86-LABEL: fast_recip_sqrt:
	; X86: # %bb.0:			; X86: # %bb.0:
	; X86-NEXT: flds {{[0-9]+}}(%esp)			; X86-NEXT: flds {{[0-9]+}}(%esp)
	; X86-NEXT: fsqrt			; X86-NEXT: fsqrt
	; X86-NEXT: fld1			; X86-NEXT: fld1
	; X86-NEXT: fdivp %st(1)			; X86-NEXT: fdivp %st(1)
	▲ Show 20 Lines • Show All 84 Lines • Show Last 20 Lines

test/CodeGen/X86/sqrt-fastmath-mir.ll

	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx2,fma -stop-after=expand-isel-pseudos 2>&1 \| FileCheck %s			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx2,fma -stop-after=expand-isel-pseudos 2>&1 \| FileCheck %s

	declare float @llvm.sqrt.f32(float) #0			declare float @llvm.sqrt.f32(float) #0

	define float @foo(float %f) #0 {			define float @foo(float %f) #0 {
	; CHECK: {{name: *foo}}			; CHECK: {{name: *foo}}
	; CHECK: body:			; CHECK: body:
	; CHECK: %0:fr32 = COPY $xmm0			; CHECK: %0:fr32 = COPY $xmm0
	; CHECK: %1:fr32 = VRSQRTSSr killed %2, %0			; CHECK: %1:fr32 = VRSQRTSSr killed %2, %0
	; CHECK: %3:fr32 = reassoc VMULSSrr %0, %1			; CHECK: %3:fr32 = VMULSSrr %0, %1
	; CHECK: %4:fr32 = VMOVSSrm			; CHECK: %4:fr32 = VMOVSSrm
	; CHECK: %5:fr32 = VFMADD213SSr %1, killed %3, %4			; CHECK: %5:fr32 = VFMADD213SSr %1, killed %3, %4
	; CHECK: %6:fr32 = VMOVSSrm			; CHECK: %6:fr32 = VMOVSSrm
	; CHECK: %7:fr32 = reassoc VMULSSrr %1, %6			; CHECK: %7:fr32 = VMULSSrr %1, %6
	; CHECK: %8:fr32 = reassoc VMULSSrr killed %7, killed %5			; CHECK: %8:fr32 = VMULSSrr killed %7, killed %5
	; CHECK: %9:fr32 = reassoc VMULSSrr %0, %8			; CHECK: %9:fr32 = VMULSSrr %0, %8
	; CHECK: %10:fr32 = VFMADD213SSr %8, %9, %4			; CHECK: %10:fr32 = VFMADD213SSr %8, %9, %4
	; CHECK: %11:fr32 = reassoc VMULSSrr %9, %6			; CHECK: %11:fr32 = VMULSSrr %9, %6
	; CHECK: %12:fr32 = reassoc VMULSSrr killed %11, killed %10			; CHECK: %12:fr32 = VMULSSrr killed %11, killed %10
	; CHECK: %14:fr32 = FsFLD0SS			; CHECK: %14:fr32 = FsFLD0SS
	; CHECK: %15:fr32 = VCMPSSrr %0, killed %14, 0			; CHECK: %15:fr32 = VCMPSSrr %0, killed %14, 0
	; CHECK: %17:vr128 = VANDNPSrr killed %16, killed %13			; CHECK: %17:vr128 = VANDNPSrr killed %16, killed %13
	; CHECK: $xmm0 = COPY %18			; CHECK: $xmm0 = COPY %18
	; CHECK: RET 0, $xmm0			; CHECK: RET 0, $xmm0
	%call = tail call float @llvm.sqrt.f32(float %f) #1			%call = tail call float @llvm.sqrt.f32(float %f) #1
	ret float %call			ret float %call
	}			}
	Show All 25 Lines