This is an archive of the discontinued LLVM Phabricator instance.

Does it buy us anything performance-wise? AFAICT llvm may be generating better code for gpus w/o fp16 support -- it does xor on 32-bit value w/o splitting it into 16-bit halfs. https://godbolt.org/z/Wjx7ceT75
Or is it needed to flush fp16 denormals consistently?

This revision is now accepted and ready to land.Oct 7 2022, 10:30 AM

In D135428#3843292, @tra wrote:

Just curious -- what prompts this change?

Does it buy us anything performance-wise? AFAICT llvm may be generating better code for gpus w/o fp16 support -- it does xor on 32-bit value w/o splitting it into 16-bit halfs. https://godbolt.org/z/Wjx7ceT75
Or is it needed to flush fp16 denormals consistently?

In all honesty I don't know what the motivation for this was, it came to my attention as a DPC++ bug (https://github.com/intel/llvm/issues/6958). I do think that your point about flushing behavior is important and should be preserved.
FWIW, using negdirectly does not require a bitcast from Float16x2Regs to Int32Regs or Float16Regs to Int16Regs, as seen in the xor case.

        // .globl       test_neg_f16
.visible .func  (.param .b32 func_retval0) test_neg_f16(
        .param .b32 test_neg_f16_param_0
)
{
        .reg .b16       %h<3>;

        ld.param.b16    %h1, [test_neg_f16_param_0];
        neg.f16         %h2, %h1;
        st.param.b16    [func_retval0+0], %h2;
        ret;

}
        // .globl       test_neg_f16x2
.visible .func  (.param .align 4 .b8 func_retval0[4]) test_neg_f16x2(
        .param .align 4 .b8 test_neg_f16x2_param_0[4]
)
{
        .reg .b32       %hh<3>;

        ld.param.b32    %hh1, [test_neg_f16x2_param_0];
        neg.f16x2       %hh2, %hh1;
        st.param.b32    [func_retval0+0], %hh2;
        ret;

}

In D135428#3848687, @jchlanda wrote:

In all honesty I don't know what the motivation for this was, it came to my attention as a DPC++ bug (https://github.com/intel/llvm/issues/6958). I do think that your point about flushing behavior is important and should be preserved.

FWIW, using negdirectly does not require a bitcast from Float16x2Regs to Int32Regs or Float16Regs to Int16Regs, as seen in the xor case.

Such bitcasts are essentially no-ops once ptxas is done with them. PTX ends up being a bit more verbose, but it usually has no impact on the SASS. FP and integers are kept in the same registers on the actual hardware. I've commented on the original bug.

Anyways, I think this change is fine. I just wanted to make sure I'm not missing something.

In D135428#3850180, @tra wrote:

In D135428#3848687, @jchlanda wrote:

In all honesty I don't know what the motivation for this was, it came to my attention as a DPC++ bug (https://github.com/intel/llvm/issues/6958). I do think that your point about flushing behavior is important and should be preserved.

FWIW, using negdirectly does not require a bitcast from Float16x2Regs to Int32Regs or Float16Regs to Int16Regs, as seen in the xor case.

Such bitcasts are essentially no-ops once ptxas is done with them. PTX ends up being a bit more verbose, but it usually has no impact on the SASS. FP and integers are kept in the same registers on the actual hardware. I've commented on the original bug.

Anyways, I think this change is fine. I just wanted to make sure I'm not missing something.

Thank you for explaining and commenting on the github issue, I had a feeling that those extra moves would be swizzled into nothing when generating sass.
Would you be so kind and land this patch for me?

Closed by commit rG8407fdbd691e: [NVPTX] Support neg{.ftz} for f16 and f16x2 (authored by jchlanda, committed by tra). · Explain WhyOct 13 2022, 10:55 AM

This revision was automatically updated to reflect the committed changes.

tra added a commit: rG8407fdbd691e: [NVPTX] Support neg{.ftz} for f16 and f16x2.

Revision Contents

Path

Size

llvm/

lib/

Target/

NVPTX/

NVPTXISelLowering.cpp

10 lines

NVPTXInstrInfo.td

13 lines

test/

CodeGen/

NVPTX/

f16-instructions.ll

27 lines

Diff 467539

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

Show First 20 Lines • Show All 558 Lines • ▼ Show 20 Lines	NVPTXTargetLowering::NVPTXTargetLowering(const NVPTXTargetMachine &TM,
// hardware, only sm_53 and sm_60 have full implementation. Others		// hardware, only sm_53 and sm_60 have full implementation. Others
// only have token amount of hardware and are likely to run faster		// only have token amount of hardware and are likely to run faster
// by using fp32 units instead.		// by using fp32 units instead.
for (const auto &Op : {ISD::FADD, ISD::FMUL, ISD::FSUB, ISD::FMA}) {		for (const auto &Op : {ISD::FADD, ISD::FMUL, ISD::FSUB, ISD::FMA}) {
setFP16OperationAction(Op, MVT::f16, Legal, Promote);		setFP16OperationAction(Op, MVT::f16, Legal, Promote);
setFP16OperationAction(Op, MVT::v2f16, Legal, Expand);		setFP16OperationAction(Op, MVT::v2f16, Legal, Expand);
}		}

// There's no neg.f16 instruction. Expand to (0-x).		// f16/f16x2 neg was introduced in PTX 60, SM_53.
setOperationAction(ISD::FNEG, MVT::f16, Expand);		const bool IsFP16FP16x2NegAvailable = STI.getSmVersion() >= 53 &&
setOperationAction(ISD::FNEG, MVT::v2f16, Expand);		STI.getPTXVersion() >= 60 &&
		STI.allowFP16Math();
		for (const auto &VT : {MVT::f16, MVT::v2f16})
		setOperationAction(ISD::FNEG, VT,
		IsFP16FP16x2NegAvailable ? Legal : Expand);

// (would be) Library functions.		// (would be) Library functions.

// These map to conversion instructions for scalar FP types.		// These map to conversion instructions for scalar FP types.
for (const auto &Op : {ISD::FCEIL, ISD::FFLOOR, ISD::FNEARBYINT, ISD::FRINT,		for (const auto &Op : {ISD::FCEIL, ISD::FFLOOR, ISD::FNEARBYINT, ISD::FRINT,
ISD::FROUNDEVEN, ISD::FTRUNC}) {		ISD::FROUNDEVEN, ISD::FTRUNC}) {
setOperationAction(Op, MVT::f16, Legal);		setOperationAction(Op, MVT::f16, Legal);
setOperationAction(Op, MVT::f32, Legal);		setOperationAction(Op, MVT::f32, Legal);
▲ Show 20 Lines • Show All 4,679 Lines • Show Last 20 Lines

llvm/lib/Target/NVPTX/NVPTXInstrInfo.td

	Show First 20 Lines • Show All 916 Lines • ▼ Show 20 Lines
	defm FMINNAN : F3<"min.NaN", fminimum>;			defm FMINNAN : F3<"min.NaN", fminimum>;
	defm FMAXNAN : F3<"max.NaN", fmaximum>;			defm FMAXNAN : F3<"max.NaN", fmaximum>;

	defm FABS : F2<"abs", fabs>;			defm FABS : F2<"abs", fabs>;
	defm FNEG : F2<"neg", fneg>;			defm FNEG : F2<"neg", fneg>;
	defm FSQRT : F2<"sqrt.rn", fsqrt>;			defm FSQRT : F2<"sqrt.rn", fsqrt>;

	//			//
				// F16 NEG
				//
				class FNEG_F16_F16X2<string OpcStr, RegisterClass RC, Predicate Pred> :
				NVPTXInst<(outs RC:$dst), (ins RC:$src),
				!strconcat(OpcStr, " \t$dst, $src;"),
				[(set RC:$dst, (fneg RC:$src))]>,
				Requires<[useFP16Math, hasPTX60, hasSM53, Pred]>;
				def FNEG16_ftz : FNEG_F16_F16X2<"neg.ftz.f16", Float16Regs, doF32FTZ>;
				def FNEG16 : FNEG_F16_F16X2<"neg.f16", Float16Regs, True>;
				def FNEG16x2_ftz : FNEG_F16_F16X2<"neg.ftz.f16x2", Float16x2Regs, doF32FTZ>;
				def FNEG16x2 : FNEG_F16_F16X2<"neg.f16x2", Float16x2Regs, True>;

				//
	// F64 division			// F64 division
	//			//
	def FDIV641r :			def FDIV641r :
	NVPTXInst<(outs Float64Regs:$dst),			NVPTXInst<(outs Float64Regs:$dst),
	(ins f64imm:$a, Float64Regs:$b),			(ins f64imm:$a, Float64Regs:$b),
	"rcp.rn.f64 \t$dst, $b;",			"rcp.rn.f64 \t$dst, $b;",
	[(set Float64Regs:$dst, (fdiv DoubleConst1:$a, Float64Regs:$b))]>;			[(set Float64Regs:$dst, (fdiv DoubleConst1:$a, Float64Regs:$b))]>;
	def FDIV64rr :			def FDIV64rr :
	▲ Show 20 Lines • Show All 2,274 Lines • Show Last 20 Lines

llvm/test/CodeGen/NVPTX/f16-instructions.ll

	; ## Full FP16 support enabled by default.			; ## Full FP16 support enabled by default.
	; RUN: llc < %s -mtriple=nvptx64-nvidia-cuda -mcpu=sm_53 -asm-verbose=false \			; RUN: llc < %s -mtriple=nvptx64-nvidia-cuda -mcpu=sm_53 -asm-verbose=false \
	; RUN: -O0 -disable-post-ra -frame-pointer=all -verify-machineinstrs \			; RUN: -O0 -disable-post-ra -frame-pointer=all -verify-machineinstrs \
				; RUN: -mattr=+ptx60 \
	; RUN: \| FileCheck -check-prefixes CHECK,CHECK-NOFTZ,CHECK-F16-NOFTZ %s			; RUN: \| FileCheck -check-prefixes CHECK,CHECK-NOFTZ,CHECK-F16-NOFTZ %s
	; RUN: %if ptxas %{ \			; RUN: %if ptxas %{ \
	; RUN: llc < %s -mtriple=nvptx64-nvidia-cuda -mcpu=sm_53 -asm-verbose=false \			; RUN: llc < %s -mtriple=nvptx64-nvidia-cuda -mcpu=sm_53 -asm-verbose=false \
	; RUN: -O0 -disable-post-ra -frame-pointer=all -verify-machineinstrs \			; RUN: -O0 -disable-post-ra -frame-pointer=all -verify-machineinstrs \
				; RUN: -mattr=+ptx60 \
	; RUN: \| %ptxas-verify -arch=sm_53 \			; RUN: \| %ptxas-verify -arch=sm_53 \
	; RUN: %}			; RUN: %}
	; ## Full FP16 with FTZ			; ## Full FP16 with FTZ
	; RUN: llc < %s -mtriple=nvptx64-nvidia-cuda -mcpu=sm_53 -asm-verbose=false \			; RUN: llc < %s -mtriple=nvptx64-nvidia-cuda -mcpu=sm_53 -asm-verbose=false \
	; RUN: -O0 -disable-post-ra -frame-pointer=all -verify-machineinstrs \			; RUN: -O0 -disable-post-ra -frame-pointer=all -verify-machineinstrs \
	; RUN: -denormal-fp-math-f32=preserve-sign \			; RUN: -denormal-fp-math-f32=preserve-sign -mattr=+ptx60 \
	; RUN: \| FileCheck -check-prefixes CHECK,CHECK-F16-FTZ %s			; RUN: \| FileCheck -check-prefixes CHECK,CHECK-F16-FTZ %s
	; RUN: %if ptxas %{ \			; RUN: %if ptxas %{ \
	; RUN: llc < %s -mtriple=nvptx64-nvidia-cuda -mcpu=sm_53 -asm-verbose=false \			; RUN: llc < %s -mtriple=nvptx64-nvidia-cuda -mcpu=sm_53 -asm-verbose=false \
	; RUN: -O0 -disable-post-ra -frame-pointer=all -verify-machineinstrs \			; RUN: -O0 -disable-post-ra -frame-pointer=all -verify-machineinstrs \
	; RUN: -denormal-fp-math-f32=preserve-sign \			; RUN: -denormal-fp-math-f32=preserve-sign -mattr=+ptx60 \
	; RUN: \| %ptxas-verify -arch=sm_53 \			; RUN: \| %ptxas-verify -arch=sm_53 \
	; RUN: %}			; RUN: %}
	; ## FP16 support explicitly disabled.			; ## FP16 support explicitly disabled.
	; RUN: llc < %s -mtriple=nvptx64-nvidia-cuda -mcpu=sm_53 -asm-verbose=false \			; RUN: llc < %s -mtriple=nvptx64-nvidia-cuda -mcpu=sm_53 -asm-verbose=false \
	; RUN: -O0 -disable-post-ra -frame-pointer=all --nvptx-no-f16-math \			; RUN: -O0 -disable-post-ra -frame-pointer=all --nvptx-no-f16-math \
	; RUN: -verify-machineinstrs \			; RUN: -verify-machineinstrs -mattr=+ptx60 \
	; RUN: \| FileCheck -check-prefixes CHECK,CHECK-NOFTZ,CHECK-NOF16 %s			; RUN: \| FileCheck -check-prefixes CHECK,CHECK-NOFTZ,CHECK-NOF16 %s
	; RUN: %if ptxas %{ \			; RUN: %if ptxas %{ \
	; RUN: llc < %s -mtriple=nvptx64-nvidia-cuda -mcpu=sm_53 -asm-verbose=false \			; RUN: llc < %s -mtriple=nvptx64-nvidia-cuda -mcpu=sm_53 -asm-verbose=false \
	; RUN: -O0 -disable-post-ra -frame-pointer=all --nvptx-no-f16-math \			; RUN: -O0 -disable-post-ra -frame-pointer=all --nvptx-no-f16-math \
	; RUN: \| %ptxas-verify -arch=sm_53 \			; RUN: \| %ptxas-verify -arch=sm_53 \
	; RUN: %}			; RUN: %}
	; ## FP16 is not supported by hardware.			; ## FP16 is not supported by hardware.
	; RUN: llc < %s -O0 -mtriple=nvptx64-nvidia-cuda -mcpu=sm_52 -asm-verbose=false \			; RUN: llc < %s -O0 -mtriple=nvptx64-nvidia-cuda -mcpu=sm_52 -asm-verbose=false \
	▲ Show 20 Lines • Show All 1,130 Lines • ▼ Show 20 Lines
	; CHECK-NOF16-NEXT: cvt.rn.f16.f32 [[R:%h[0-9]+]], [[R32]]			; CHECK-NOF16-NEXT: cvt.rn.f16.f32 [[R:%h[0-9]+]], [[R32]]
	; CHECK: st.param.b16 [func_retval0+0], [[R]];			; CHECK: st.param.b16 [func_retval0+0], [[R]];
	; CHECK: ret;			; CHECK: ret;
	define half @test_fmuladd(half %a, half %b, half %c) #0 {			define half @test_fmuladd(half %a, half %b, half %c) #0 {
	%r = call half @llvm.fmuladd.f16(half %a, half %b, half %c)			%r = call half @llvm.fmuladd.f16(half %a, half %b, half %c)
	ret half %r			ret half %r
	}			}

				; CHECK-LABEL: test_neg_f16(
				; CHECK-F16-NOFTZ: neg.f16
				; CHECK-F16-FTZ: neg.ftz.f16
				; CHECK-NOF16: xor.b16 %rs{{.}}, %rs{{.}}, -32768
				define half @test_neg_f16(half noundef %arg) #0 {
				%res = fneg half %arg
				ret half %res
				}

				; CHECK-LABEL: test_neg_f16x2(
				; CHECK-F16-NOFTZ: neg.f16x2
				; CHECK-F16-FTZ: neg.ftz.f16x2
				; CHECK-NOF16: xor.b16 %rs{{.}}, %rs{{.}}, -32768
				; CHECK-NOF16: xor.b16 %rs{{.}}, %rs{{.}}, -32768
				define <2 x half> @test_neg_f16x2(<2 x half> noundef %arg) #0 {
				%res = fneg <2 x half> %arg
				ret <2 x half> %res
				}

	attributes #0 = { nounwind }			attributes #0 = { nounwind }
	attributes #1 = { "unsafe-fp-math" = "true" }			attributes #1 = { "unsafe-fp-math" = "true" }

This is an archive of the discontinued LLVM Phabricator instance.

[NVPTX] Support neg{.ftz} for f16 and f16x2ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 467539

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

llvm/lib/Target/NVPTX/NVPTXInstrInfo.td

llvm/test/CodeGen/NVPTX/f16-instructions.ll

[NVPTX] Support neg{.ftz} for f16 and f16x2
ClosedPublic