This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/X86/
-
Target/
-
X86/
-
X86ISelLowering.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
1/1
avx512fp16vl-intrinsics.ll
-
sqrt-fastmath.ll

Differential D114765

[X86][FP16] Only generate approximate rsqrt when Reciprocal is true for half type
AbandonedPublic

Authored by pengfei on Nov 29 2021, 6:37 PM.

Download Raw Diff

Details

Reviewers

craig.topper
LuoYuanke
RKSimon
spatel
LiuChen3

Summary

We have reasonable fast sqrt and accurate rsqrt for half type due to the
limited fractions. So neither do we need multi steps refinement for
rsqrt nor replace sqrt by rsqrt.
This fixes a correctness issue when RefinementSteps = 0.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

pengfei created this revision.Nov 29 2021, 6:37 PM

Herald added a subscriber: hiraditya. · View Herald TranscriptNov 29 2021, 6:37 PM

pengfei requested review of this revision.Nov 29 2021, 6:37 PM

Herald added a project: Restricted Project. · View Herald TranscriptNov 29 2021, 6:37 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

What's the correctness issue? It's fast math so the answer isn't required to be exact.

@kmclaughlin I'm meant to change in the target independent code (enabled in D110557), then I found the test in your D111657 was affected too. I'll change there if you think it is workable for aarch64 too.

In D114765#3160232, @craig.topper wrote:

What's the correctness issue? It's fast math so the answer isn't required to be exact.

See https://godbolt.org/z/P4Y89hsj4
I guess we need at least RefinementSteps = 1 for correntness when Reciprocal == 0.

In D114765#3160245, @pengfei wrote:

In D114765#3160232, @craig.topper wrote:

What's the correctness issue? It's fast math so the answer isn't required to be exact.

See https://godbolt.org/z/P4Y89hsj4
I guess we need at least RefinementSteps = 1 for correntness when Reciprocal == 0.

That seems like a more general bug. The same issue happens if you force the estimate steps for floating point https://godbolt.org/z/vK8eab8zP

I think for f16 you should also be returning true from X86TargetLowering::isFsqrtCheap which would prevent getSqrtEstimate from being called for the non-reciprocal case.

Harbormaster completed remote builds in B136587: Diff 390543.Nov 29 2021, 7:17 PM

Return false for f16 in X86TargetLowering::isFsqrtCheap.

In D114765#3160256, @craig.topper wrote:

In D114765#3160245, @pengfei wrote:

In D114765#3160232, @craig.topper wrote:

What's the correctness issue? It's fast math so the answer isn't required to be exact.

See https://godbolt.org/z/P4Y89hsj4
I guess we need at least RefinementSteps = 1 for correntness when Reciprocal == 0.

That seems like a more general bug. The same issue happens if you force the estimate steps for floating point https://godbolt.org/z/vK8eab8zP

I see. But I'm not sure if it is bug or by design. Maybe need an assertion?

In D114765#3160268, @craig.topper wrote:

I think for f16 you should also be returning true from X86TargetLowering::isFsqrtCheap which would prevent getSqrtEstimate from being called for the non-reciprocal case.

Good point!

In D114765#3160387, @pengfei wrote:

In D114765#3160256, @craig.topper wrote:

In D114765#3160245, @pengfei wrote:

In D114765#3160232, @craig.topper wrote:

What's the correctness issue? It's fast math so the answer isn't required to be exact.

See https://godbolt.org/z/P4Y89hsj4
I guess we need at least RefinementSteps = 1 for correntness when Reciprocal == 0.

That seems like a more general bug. The same issue happens if you force the estimate steps for floating point https://godbolt.org/z/vK8eab8zP

I see. But I'm not sure if it is bug or by design. Maybe need an assertion?

I think this is a weird interface quirk. The caller interprets returning RefinementSteps = 0 to mean that all needed code has been created and nothing should be done. Theoretically a target could have its own way of handling it without the final FMUL the target independent code inserts. So I think X86 should either insert the final FMUL itself or not do the reciprocal approximation for non-reciprocal if RefinementSteps is 0. NVPTXTargetLowering::getSqrtEstimate does the latter.

Harbormaster completed remote builds in B136597: Diff 390562.Nov 29 2021, 10:06 PM

So I think X86 should either insert the final FMUL itself or ...

Insert a FMUL. Thanks Craig!

Harbormaster completed remote builds in B136606: Diff 390577.Nov 29 2021, 11:51 PM

In D114765#3160396, @craig.topper wrote:

In D114765#3160387, @pengfei wrote:

In D114765#3160256, @craig.topper wrote:

In D114765#3160245, @pengfei wrote:

In D114765#3160232, @craig.topper wrote:

What's the correctness issue? It's fast math so the answer isn't required to be exact.

See https://godbolt.org/z/P4Y89hsj4
I guess we need at least RefinementSteps = 1 for correntness when Reciprocal == 0.

That seems like a more general bug. The same issue happens if you force the estimate steps for floating point https://godbolt.org/z/vK8eab8zP

I see. But I'm not sure if it is bug or by design. Maybe need an assertion?

I think this is a weird interface quirk. The caller interprets returning RefinementSteps = 0 to mean that all needed code has been created and nothing should be done. Theoretically a target could have its own way of handling it without the final FMUL the target independent code inserts. So I think X86 should either insert the final FMUL itself or not do the reciprocal approximation for non-reciprocal if RefinementSteps is 0. NVPTXTargetLowering::getSqrtEstimate does the latter.

That looks like a long-standing bug for x86. I don't remember if that was a design flaw originally or if the code evolved to handle more patterns, and we just missed that problem.
Better to add the tests first, add the code to fix the bug, then add the code to bypass for f16?

llvm/test/CodeGen/X86/avx512fp16vl-intrinsics.ll
972	Does it make sense to add a scalar test too? define half @sqrt_half_fast(half %a0, half %a1) { %1 = call fast half @llvm.sqrt.f16(half %a0) ret half %1 }

Break it into 3 patches: rG65a3de91ab3e, D114843 and D114844.

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

X86ISelLowering.cpp

10 lines

test/

CodeGen/

X86/

avx512fp16vl-intrinsics.ll

9 lines

sqrt-fastmath.ll

71 lines

Diff 390577

llvm/lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 23,184 Lines • ▼ Show 20 Lines	static SDValue EmitCmp(SDValue Op0, SDValue Op1, unsigned X86CC,
SDValue Sub = DAG.getNode(X86ISD::SUB, dl, VTs, Op0, Op1);		SDValue Sub = DAG.getNode(X86ISD::SUB, dl, VTs, Op0, Op1);
return Sub.getValue(1);		return Sub.getValue(1);
}		}

/// Check if replacement of SQRT with RSQRT should be disabled.		/// Check if replacement of SQRT with RSQRT should be disabled.
bool X86TargetLowering::isFsqrtCheap(SDValue Op, SelectionDAG &DAG) const {		bool X86TargetLowering::isFsqrtCheap(SDValue Op, SelectionDAG &DAG) const {
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();

		// We don't need to replace SQRT with RSQRT for half type.
		if (VT.getScalarType() == MVT::f16)
		return true;

// We never want to use both SQRT and RSQRT instructions for the same input.		// We never want to use both SQRT and RSQRT instructions for the same input.
if (DAG.getNodeIfExists(X86ISD::FRSQRT, DAG.getVTList(VT), Op))		if (DAG.getNodeIfExists(X86ISD::FRSQRT, DAG.getVTList(VT), Op))
return false;		return false;

if (VT.isVector())		if (VT.isVector())
return Subtarget.hasFastVectorFSQRT();		return Subtarget.hasFastVectorFSQRT();
return Subtarget.hasFastScalarFSQRT();		return Subtarget.hasFastScalarFSQRT();
}		}
Show All 22 Lines	if ((VT == MVT::f32 && Subtarget.hasSSE1()) \|\|
(VT == MVT::v8f32 && Subtarget.hasAVX()) \|\|		(VT == MVT::v8f32 && Subtarget.hasAVX()) \|\|
(VT == MVT::v16f32 && Subtarget.useAVX512Regs())) {		(VT == MVT::v16f32 && Subtarget.useAVX512Regs())) {
if (RefinementSteps == ReciprocalEstimate::Unspecified)		if (RefinementSteps == ReciprocalEstimate::Unspecified)
RefinementSteps = 1;		RefinementSteps = 1;

UseOneConstNR = false;		UseOneConstNR = false;
// There is no FSQRT for 512-bits, but there is RSQRT14.		// There is no FSQRT for 512-bits, but there is RSQRT14.
unsigned Opcode = VT == MVT::v16f32 ? X86ISD::RSQRT14 : X86ISD::FRSQRT;		unsigned Opcode = VT == MVT::v16f32 ? X86ISD::RSQRT14 : X86ISD::FRSQRT;
return DAG.getNode(Opcode, DL, VT, Op);		SDValue Estimate = DAG.getNode(Opcode, DL, VT, Op);
		if (RefinementSteps == 0 && !Reciprocal)
		Estimate = DAG.getNode(ISD::FMUL, DL, VT, Op, Estimate);
		return Estimate;
}		}

if (VT.getScalarType() == MVT::f16 && isTypeLegal(VT) &&		if (VT.getScalarType() == MVT::f16 && isTypeLegal(VT) &&
Subtarget.hasFP16()) {		Subtarget.hasFP16()) {
		assert(Reciprocal && "Don't replace SQRT with RSQRT for half type");
if (RefinementSteps == ReciprocalEstimate::Unspecified)		if (RefinementSteps == ReciprocalEstimate::Unspecified)
RefinementSteps = 0;		RefinementSteps = 0;

if (VT == MVT::f16) {		if (VT == MVT::f16) {
SDValue Zero = DAG.getIntPtrConstant(0, DL);		SDValue Zero = DAG.getIntPtrConstant(0, DL);
SDValue Undef = DAG.getUNDEF(MVT::v8f16);		SDValue Undef = DAG.getUNDEF(MVT::v8f16);
Op = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, MVT::v8f16, Op);		Op = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, MVT::v8f16, Op);
Op = DAG.getNode(X86ISD::RSQRT14S, DL, MVT::v8f16, Undef, Op);		Op = DAG.getNode(X86ISD::RSQRT14S, DL, MVT::v8f16, Undef, Op);
▲ Show 20 Lines • Show All 31,171 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/avx512fp16vl-intrinsics.ll

	Show First 20 Lines • Show All 963 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: vrsqrtph %xmm0, %xmm0			; CHECK-NEXT: vrsqrtph %xmm0, %xmm0
	; CHECK-NEXT: vmulph %xmm0, %xmm1, %xmm0			; CHECK-NEXT: vmulph %xmm0, %xmm1, %xmm0
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%1 = call fast <8 x half> @llvm.sqrt.v8f16(<8 x half> %a0)			%1 = call fast <8 x half> @llvm.sqrt.v8f16(<8 x half> %a0)
	%2 = fdiv fast <8 x half> %a1, %1			%2 = fdiv fast <8 x half> %a1, %1
	ret <8 x half> %2			ret <8 x half> %2
	}			}

				define <8 x half> @test_sqrt_ph_128_fast2(<8 x half> %a0, <8 x half> %a1) {
				spatelUnsubmitted Done Reply Inline Actions Does it make sense to add a scalar test too? define half @sqrt_half_fast(half %a0, half %a1) { %1 = call fast half @llvm.sqrt.f16(half %a0) ret half %1 } spatel: Does it make sense to add a scalar test too? define half @sqrt_half_fast(half %a0, half %a1)…
				; CHECK-LABEL: test_sqrt_ph_128_fast2:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vsqrtph %xmm0, %xmm0
				; CHECK-NEXT: retq
				%1 = call fast <8 x half> @llvm.sqrt.v8f16(<8 x half> %a0)
				ret <8 x half> %1
				}

	define <8 x half> @test_mask_sqrt_ph_128(<8 x half> %a0, <8 x half> %passthru, i8 %mask) {			define <8 x half> @test_mask_sqrt_ph_128(<8 x half> %a0, <8 x half> %passthru, i8 %mask) {
	; CHECK-LABEL: test_mask_sqrt_ph_128:			; CHECK-LABEL: test_mask_sqrt_ph_128:
	; CHECK: # %bb.0:			; CHECK: # %bb.0:
	; CHECK-NEXT: kmovd %edi, %k1			; CHECK-NEXT: kmovd %edi, %k1
	; CHECK-NEXT: vsqrtph %xmm0, %xmm1 {%k1}			; CHECK-NEXT: vsqrtph %xmm0, %xmm1 {%k1}
	; CHECK-NEXT: vmovaps %xmm1, %xmm0			; CHECK-NEXT: vmovaps %xmm1, %xmm0
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%1 = call <8 x half> @llvm.sqrt.v8f16(<8 x half> %a0)			%1 = call <8 x half> @llvm.sqrt.v8f16(<8 x half> %a0)
	▲ Show 20 Lines • Show All 364 Lines • Show Last 20 Lines

llvm/test/CodeGen/X86/sqrt-fastmath.ll

Show First 20 Lines • Show All 378 Lines • ▼ Show 20 Lines
; AVX512-NEXT: vmulss {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1		; AVX512-NEXT: vmulss {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm1, %xmm1
; AVX512-NEXT: vmulss %xmm0, %xmm1, %xmm0		; AVX512-NEXT: vmulss %xmm0, %xmm1, %xmm0
; AVX512-NEXT: retq		; AVX512-NEXT: retq
%sqrt = tail call float @llvm.sqrt.f32(float %x)		%sqrt = tail call float @llvm.sqrt.f32(float %x)
%div = fdiv fast float 1.0, %sqrt		%div = fdiv fast float 1.0, %sqrt
ret float %div		ret float %div
}		}

		define float @f32_estimate2(float %x) #5 {
		; SSE-LABEL: f32_estimate2:
		; SSE: # %bb.0:
		; SSE-NEXT: rsqrtss %xmm0, %xmm1
		; SSE-NEXT: mulss %xmm0, %xmm1
		; SSE-NEXT: andps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
		; SSE-NEXT: cmpltss {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
		; SSE-NEXT: andnps %xmm1, %xmm0
		; SSE-NEXT: retq
		;
		; AVX1-LABEL: f32_estimate2:
		; AVX1: # %bb.0:
		; AVX1-NEXT: vrsqrtss %xmm0, %xmm0, %xmm1
		; AVX1-NEXT: vandps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm2
		; AVX1-NEXT: vmulss %xmm1, %xmm0, %xmm0
		; AVX1-NEXT: vcmpltss {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm2, %xmm1
		; AVX1-NEXT: vandnps %xmm0, %xmm1, %xmm0
		; AVX1-NEXT: retq
		;
		; AVX512-LABEL: f32_estimate2:
		; AVX512: # %bb.0:
		; AVX512-NEXT: vrsqrtss %xmm0, %xmm0, %xmm1
		; AVX512-NEXT: vmulss %xmm1, %xmm0, %xmm1
		; AVX512-NEXT: vbroadcastss {{.*#+}} xmm2 = [NaN,NaN,NaN,NaN]
		; AVX512-NEXT: vandps %xmm2, %xmm0, %xmm0
		; AVX512-NEXT: vcmpltss {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %k1
		; AVX512-NEXT: vxorps %xmm0, %xmm0, %xmm0
		; AVX512-NEXT: vmovss %xmm0, %xmm1, %xmm1 {%k1}
		; AVX512-NEXT: vmovaps %xmm1, %xmm0
		; AVX512-NEXT: retq
		%sqrt = tail call fast float @llvm.sqrt.f32(float %x)
		ret float %sqrt
		}

define <4 x float> @v4f32_no_estimate(<4 x float> %x) #0 {		define <4 x float> @v4f32_no_estimate(<4 x float> %x) #0 {
; SSE-LABEL: v4f32_no_estimate:		; SSE-LABEL: v4f32_no_estimate:
; SSE: # %bb.0:		; SSE: # %bb.0:
; SSE-NEXT: sqrtps %xmm0, %xmm1		; SSE-NEXT: sqrtps %xmm0, %xmm1
; SSE-NEXT: movaps {{.*#+}} xmm0 = [1.0E+0,1.0E+0,1.0E+0,1.0E+0]		; SSE-NEXT: movaps {{.*#+}} xmm0 = [1.0E+0,1.0E+0,1.0E+0,1.0E+0]
; SSE-NEXT: divps %xmm1, %xmm0		; SSE-NEXT: divps %xmm1, %xmm0
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
▲ Show 20 Lines • Show All 46 Lines • ▼ Show 20 Lines
; AVX512-NEXT: vmulps %xmm0, %xmm1, %xmm0		; AVX512-NEXT: vmulps %xmm0, %xmm1, %xmm0
; AVX512-NEXT: vmulps %xmm2, %xmm0, %xmm0		; AVX512-NEXT: vmulps %xmm2, %xmm0, %xmm0
; AVX512-NEXT: retq		; AVX512-NEXT: retq
%sqrt = tail call <4 x float> @llvm.sqrt.v4f32(<4 x float> %x)		%sqrt = tail call <4 x float> @llvm.sqrt.v4f32(<4 x float> %x)
%div = fdiv fast <4 x float> <float 1.0, float 1.0, float 1.0, float 1.0>, %sqrt		%div = fdiv fast <4 x float> <float 1.0, float 1.0, float 1.0, float 1.0>, %sqrt
ret <4 x float> %div		ret <4 x float> %div
}		}

		define <4 x float> @v4f32_estimate2(<4 x float> %x) #5 {
		; SSE-LABEL: v4f32_estimate2:
		; SSE: # %bb.0:
		; SSE-NEXT: rsqrtps %xmm0, %xmm2
		; SSE-NEXT: mulps %xmm0, %xmm2
		; SSE-NEXT: andps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0
		; SSE-NEXT: movaps {{.*#+}} xmm1 = [1.17549435E-38,1.17549435E-38,1.17549435E-38,1.17549435E-38]
		; SSE-NEXT: cmpleps %xmm0, %xmm1
		; SSE-NEXT: andps %xmm2, %xmm1
		; SSE-NEXT: movaps %xmm1, %xmm0
		; SSE-NEXT: retq
		;
		; AVX1-LABEL: v4f32_estimate2:
		; AVX1: # %bb.0:
		; AVX1-NEXT: vrsqrtps %xmm0, %xmm1
		; AVX1-NEXT: vandps {{\.?LCPI[0-9]+_[0-9]+}}(%rip), %xmm0, %xmm2
		; AVX1-NEXT: vmulps %xmm1, %xmm0, %xmm0
		; AVX1-NEXT: vmovaps {{.*#+}} xmm1 = [1.17549435E-38,1.17549435E-38,1.17549435E-38,1.17549435E-38]
		; AVX1-NEXT: vcmpleps %xmm2, %xmm1, %xmm1
		; AVX1-NEXT: vandps %xmm0, %xmm1, %xmm0
		; AVX1-NEXT: retq
		;
		; AVX512-LABEL: v4f32_estimate2:
		; AVX512: # %bb.0:
		; AVX512-NEXT: vrsqrtps %xmm0, %xmm1
		; AVX512-NEXT: vmulps %xmm1, %xmm0, %xmm1
		; AVX512-NEXT: vbroadcastss {{.*#+}} xmm2 = [NaN,NaN,NaN,NaN]
		; AVX512-NEXT: vandps %xmm2, %xmm0, %xmm0
		; AVX512-NEXT: vbroadcastss {{.*#+}} xmm2 = [1.17549435E-38,1.17549435E-38,1.17549435E-38,1.17549435E-38]
		; AVX512-NEXT: vcmpleps %xmm0, %xmm2, %xmm0
		; AVX512-NEXT: vandps %xmm1, %xmm0, %xmm0
		; AVX512-NEXT: retq
		%sqrt = tail call fast <4 x float> @llvm.sqrt.v4f32(<4 x float> %x)
		ret <4 x float> %sqrt
		}

define <8 x float> @v8f32_no_estimate(<8 x float> %x) #0 {		define <8 x float> @v8f32_no_estimate(<8 x float> %x) #0 {
; SSE-LABEL: v8f32_no_estimate:		; SSE-LABEL: v8f32_no_estimate:
; SSE: # %bb.0:		; SSE: # %bb.0:
; SSE-NEXT: sqrtps %xmm1, %xmm2		; SSE-NEXT: sqrtps %xmm1, %xmm2
; SSE-NEXT: sqrtps %xmm0, %xmm3		; SSE-NEXT: sqrtps %xmm0, %xmm3
; SSE-NEXT: movaps {{.*#+}} xmm1 = [1.0E+0,1.0E+0,1.0E+0,1.0E+0]		; SSE-NEXT: movaps {{.*#+}} xmm1 = [1.0E+0,1.0E+0,1.0E+0,1.0E+0]
; SSE-NEXT: movaps %xmm1, %xmm0		; SSE-NEXT: movaps %xmm1, %xmm0
; SSE-NEXT: divps %xmm3, %xmm0		; SSE-NEXT: divps %xmm3, %xmm0
▲ Show 20 Lines • Show All 558 Lines • ▼ Show 20 Lines	; AVX-NEXT: retq
ret double %sqrt_fast		ret double %sqrt_fast
}		}

attributes #0 = { "unsafe-fp-math"="true" "reciprocal-estimates"="!sqrtf,!vec-sqrtf,!divf,!vec-divf" }		attributes #0 = { "unsafe-fp-math"="true" "reciprocal-estimates"="!sqrtf,!vec-sqrtf,!divf,!vec-divf" }
attributes #1 = { "unsafe-fp-math"="true" "reciprocal-estimates"="sqrt,vec-sqrt" }		attributes #1 = { "unsafe-fp-math"="true" "reciprocal-estimates"="sqrt,vec-sqrt" }
attributes #2 = { nounwind readnone }		attributes #2 = { nounwind readnone }
attributes #3 = { "unsafe-fp-math"="true" "reciprocal-estimates"="sqrt,vec-sqrt" "denormal-fp-math"="ieee" }		attributes #3 = { "unsafe-fp-math"="true" "reciprocal-estimates"="sqrt,vec-sqrt" "denormal-fp-math"="ieee" }
attributes #4 = { "unsafe-fp-math"="true" "reciprocal-estimates"="sqrt,vec-sqrt" "denormal-fp-math"="ieee,preserve-sign" }		attributes #4 = { "unsafe-fp-math"="true" "reciprocal-estimates"="sqrt,vec-sqrt" "denormal-fp-math"="ieee,preserve-sign" }
		attributes #5 = { "unsafe-fp-math"="true" "reciprocal-estimates"="all:0" }