This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Select V_CVT_*16_F16 more often
ClosedPublic

Authored by jpages on Apr 28 2021, 12:42 PM.

Download Raw Diff

Details

Reviewers

sebastian-ne
arsenm
foad

Commits

rGa1ed39df96bc: [AMDGPU] Select V_CVT_*16_F16 more often

Summary

Improve the code generation of fp_to_sint
and fp_to_uint for integer on 16-bits.

Diff Detail

Event Timeline

jpages created this revision.Apr 28 2021, 12:42 PM

Herald added subscribers: kerbowa, hiraditya, t-tye and 7 others. · View Herald TranscriptApr 28 2021, 12:42 PM

jpages requested review of this revision.Apr 28 2021, 12:42 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 28 2021, 12:42 PM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

jpages added a reviewer: arsenm.Apr 28 2021, 12:45 PM

All of the patterns you added look to me like workarounds for the DAGCombiner or legalizer not doing its job correctly

llvm/lib/Target/AMDGPU/SIInstructions.td
918	This is identical to the pattern attached to the instruction definition, so this shouldn't be doing anything
923	Ditto
928–929	I don't see the advantage of just selecting the 16-bit result op to the 32-bit result. The legalizer should just take care of this
933	Ditto
938	Ditto
943	Ditto
949–955	Would these sexts ever reach the selector? I would expect the combiner to take care of these. I don't see tests for this

Harbormaster completed remote builds in B101483: Diff 341292.Apr 28 2021, 2:36 PM

foad added a subscriber: foad.Apr 28 2021, 3:03 PM

foad added inline comments.Apr 29 2021, 2:59 AM

llvm/test/CodeGen/AMDGPU/fp_to_uint.ll
243–244	Just use a single GCN: check line.
llvm/test/CodeGen/AMDGPU/fptosi.f16.ll
2	Don't do this. Instead change GCN: to SI: on lines 5-8. GCN is supposed to be used for checks that are the same for all architectures (gfx6 onwards).
llvm/test/CodeGen/AMDGPU/fptoui.f16.ll
2	Don't do this.

Thank you for your inputs.

Rebased the patch to do the same thing in the Legalizer. It seems more natural indeed.
I took the occasion to do the suggested refactoring around LowerFP_TO_SINT/LowerFP_TO_UINT to merge them into one.

For @foad, I changed the tests back to the original version, thanks for the comment.

jpages marked 3 inline comments as done.Apr 30 2021, 3:11 PM

LGTM with nits

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
2734–2737 ↗	(On Diff #342057)	Combine into one if?
2759 ↗	(On Diff #342057)	Don't need the ternary operator, just directly pass the bool expression

Harbormaster completed remote builds in B102032: Diff 342057.Apr 30 2021, 5:30 PM

Updated for code style

jpages marked 2 inline comments as done.May 3 2021, 8:36 AM

Harbormaster completed remote builds in B102295: Diff 342406.May 3 2021, 9:20 AM

foad accepted this revision.May 4 2021, 1:14 AM

This revision is now accepted and ready to land.May 4 2021, 1:14 AM

Thanks for the review. I don't have commit access to the repo, could someone do it for me?

This revision was landed with ongoing or failed builds.May 5 2021, 12:58 AM

Closed by commit rGa1ed39df96bc: [AMDGPU] Select V_CVT_*16_F16 more often (authored by jpages, committed by foad). · Explain Why

This revision was automatically updated to reflect the committed changes.

foad added a commit: rGa1ed39df96bc: [AMDGPU] Select V_CVT_*16_F16 more often.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIISelLowering.cpp

4 lines

SIInstructions.td

42 lines

test/

CodeGen/

AMDGPU/

fp_to_uint.ll

5 lines

fptosi.f16.ll

36 lines

fptoui.f16.ll

27 lines

Diff 341292

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 524 Lines • ▼ Show 20 Lines	if (Subtarget->has16BitInsts()) {

setTruncStoreAction(MVT::i64, MVT::i16, Expand);		setTruncStoreAction(MVT::i64, MVT::i16, Expand);

setOperationAction(ISD::FP16_TO_FP, MVT::i16, Promote);		setOperationAction(ISD::FP16_TO_FP, MVT::i16, Promote);
AddPromotedToType(ISD::FP16_TO_FP, MVT::i16, MVT::i32);		AddPromotedToType(ISD::FP16_TO_FP, MVT::i16, MVT::i32);
setOperationAction(ISD::FP_TO_FP16, MVT::i16, Promote);		setOperationAction(ISD::FP_TO_FP16, MVT::i16, Promote);
AddPromotedToType(ISD::FP_TO_FP16, MVT::i16, MVT::i32);		AddPromotedToType(ISD::FP_TO_FP16, MVT::i16, MVT::i32);

setOperationAction(ISD::FP_TO_SINT, MVT::i16, Promote);		setOperationAction(ISD::FP_TO_SINT, MVT::i16, Legal);
setOperationAction(ISD::FP_TO_UINT, MVT::i16, Promote);		setOperationAction(ISD::FP_TO_UINT, MVT::i16, Legal);

// F16 - Constant Actions.		// F16 - Constant Actions.
setOperationAction(ISD::ConstantFP, MVT::f16, Legal);		setOperationAction(ISD::ConstantFP, MVT::f16, Legal);

// F16 - Load/Store Actions.		// F16 - Load/Store Actions.
setOperationAction(ISD::LOAD, MVT::f16, Promote);		setOperationAction(ISD::LOAD, MVT::f16, Promote);
AddPromotedToType(ISD::LOAD, MVT::f16, MVT::i16);		AddPromotedToType(ISD::LOAD, MVT::f16, MVT::i16);
setOperationAction(ISD::STORE, MVT::f16, Promote);		setOperationAction(ISD::STORE, MVT::f16, Promote);
▲ Show 20 Lines • Show All 11,563 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIInstructions.td

Show First 20 Lines • Show All 908 Lines • ▼ Show 20 Lines	def : GCNPat <
(V_CVT_F16_F32_e32 (V_CVT_F32_I32_e32 VSrc_b32:$src))		(V_CVT_F16_F32_e32 (V_CVT_F32_I32_e32 VSrc_b32:$src))
>;		>;

def : GCNPat <		def : GCNPat <
(f16 (uint_to_fp i32:$src)),		(f16 (uint_to_fp i32:$src)),
(V_CVT_F16_F32_e32 (V_CVT_F32_U32_e32 VSrc_b32:$src))		(V_CVT_F16_F32_e32 (V_CVT_F32_U32_e32 VSrc_b32:$src))
>;		>;

		def : GCNPat <
		(i16 (fp_to_sint f16:$src)),
		arsenmUnsubmitted Not Done Reply Inline Actions This is identical to the pattern attached to the instruction definition, so this shouldn't be doing anything arsenm: This is identical to the pattern attached to the instruction definition, so this shouldn't be…
		(V_CVT_I16_F16_e32 VSrc_b32:$src)
		>;

		def : GCNPat <
		(i16 (fp_to_uint f16:$src)),
		arsenmUnsubmitted Not Done Reply Inline Actions Ditto arsenm: Ditto
		(V_CVT_U16_F16_e32 VSrc_b32:$src)
		>;

		def : GCNPat <
		(i16 (fp_to_sint f32:$src)),
		(V_CVT_I32_F32_e32 VSrc_b32:$src)
		arsenmUnsubmitted Not Done Reply Inline Actions I don't see the advantage of just selecting the 16-bit result op to the 32-bit result. The legalizer should just take care of this arsenm: I don't see the advantage of just selecting the 16-bit result op to the 32-bit result. The…
		>;

		def : GCNPat <
		(i16 (fp_to_uint f32:$src)),
		arsenmUnsubmitted Not Done Reply Inline Actions Ditto arsenm: Ditto
		(V_CVT_U32_F32_e32 VSrc_b32:$src)
		>;

		def : GCNPat <
		(i16 (fp_to_sint f64:$src)),
		arsenmUnsubmitted Not Done Reply Inline Actions Ditto arsenm: Ditto
		(V_CVT_I32_F64_e32 VReg_64:$src)
		>;

		def : GCNPat <
		(i16 (fp_to_uint f64:$src)),
		arsenmUnsubmitted Not Done Reply Inline Actions Ditto arsenm: Ditto
		(V_CVT_U32_F64_e32 VReg_64:$src)
		>;

		let OtherPredicates = [HasSDWA] in {
		def : GCNPat <
		(i32 (sext (i16 (fp_to_sint f16:$src)))),
		(V_LSHLREV_B32_e32 (i32 16), (V_CVT_I16_F16_e32 VSrc_b32:$src))
		>;

		def : GCNPat <
		(i32 (sext (i16 (fp_to_uint f16:$src)))),
		(V_LSHLREV_B32_e32 (i32 16), (V_CVT_U16_F16_e32 VSrc_b32:$src))
		arsenmUnsubmitted Not Done Reply Inline Actions Would these sexts ever reach the selector? I would expect the combiner to take care of these. I don't see tests for this arsenm: Would these sexts ever reach the selector? I would expect the combiner to take care of these. I…
		>;
		} // OtherPredicates = [HasSDWA]

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// VOP2 Patterns		// VOP2 Patterns
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

// NoMods pattern used for mac. If there are any source modifiers then it's		// NoMods pattern used for mac. If there are any source modifiers then it's
// better to select mad instead of mac.		// better to select mad instead of mac.
class FMADPat <ValueType vt, Instruction inst, SDPatternOperator node>		class FMADPat <ValueType vt, Instruction inst, SDPatternOperator node>
: GCNPat <(vt (node (vt (VOP3NoMods vt:$src0)),		: GCNPat <(vt (node (vt (VOP3NoMods vt:$src0)),
▲ Show 20 Lines • Show All 1,929 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/fp_to_uint.ll

	Show First 20 Lines • Show All 234 Lines • ▼ Show 20 Lines
	define amdgpu_kernel void @fp_to_uint_fabs_f32_to_i1(i1 addrspace(1)* %out, float %in) #0 {			define amdgpu_kernel void @fp_to_uint_fabs_f32_to_i1(i1 addrspace(1)* %out, float %in) #0 {
	%in.fabs = call float @llvm.fabs.f32(float %in)			%in.fabs = call float @llvm.fabs.f32(float %in)
	%conv = fptoui float %in.fabs to i1			%conv = fptoui float %in.fabs to i1
	store i1 %conv, i1 addrspace(1)* %out			store i1 %conv, i1 addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}fp_to_uint_f32_to_i16:			; FUNC-LABEL: {{^}}fp_to_uint_f32_to_i16:
	; The reason different instructions are used on SI and VI is because for
	; SI fp_to_uint is legalized by the type legalizer and for VI it is
	; legalized by the dag legalizer and they legalize fp_to_uint differently.
	; SI: v_cvt_u32_f32_e32 [[VAL:v[0-9]+]], s{{[0-9]+}}			; SI: v_cvt_u32_f32_e32 [[VAL:v[0-9]+]], s{{[0-9]+}}
	; VI: v_cvt_i32_f32_e32 [[VAL:v[0-9]+]], s{{[0-9]+}}			; VI: v_cvt_u32_f32_e32 [[VAL:v[0-9]+]], s{{[0-9]+}}
				foadUnsubmitted Done Reply Inline Actions Just use a single GCN: check line. foad: Just use a single GCN: check line.
	; GCN: buffer_store_short [[VAL]]			; GCN: buffer_store_short [[VAL]]
	define amdgpu_kernel void @fp_to_uint_f32_to_i16(i16 addrspace(1)* %out, float %in) #0 {			define amdgpu_kernel void @fp_to_uint_f32_to_i16(i16 addrspace(1)* %out, float %in) #0 {
	%uint = fptoui float %in to i16			%uint = fptoui float %in to i16
	store i16 %uint, i16 addrspace(1)* %out			store i16 %uint, i16 addrspace(1)* %out
	ret void			ret void
	}			}

	attributes #0 = { nounwind }			attributes #0 = { nounwind }
	attributes #1 = { nounwind readnone }			attributes #1 = { nounwind readnone }

llvm/test/CodeGen/AMDGPU/fptosi.f16.ll

	; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -verify-machineinstrs -enable-unsafe-fp-math < %s \| FileCheck -check-prefix=GCN -check-prefix=SI %s			; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -verify-machineinstrs -enable-unsafe-fp-math < %s \| FileCheck -check-prefix=GCN -check-prefix=SI %s
	; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=fiji -mattr=-flat-for-global -verify-machineinstrs -enable-unsafe-fp-math < %s \| FileCheck -check-prefix=GCN -check-prefix=VI %s			; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=fiji -mattr=-flat-for-global -verify-machineinstrs -enable-unsafe-fp-math < %s \| FileCheck -check-prefix=VI %s
				foadUnsubmitted Done Reply Inline Actions Don't do this. Instead change GCN: to SI: on lines 5-8. GCN is supposed to be used for checks that are the same for all architectures (gfx6 onwards). foad: Don't do this. Instead change GCN: to SI: on lines 5-8. GCN is supposed to be used for checks…

	; GCN-LABEL: {{^}}fptosi_f16_to_i16			; GCN-LABEL: {{^}}fptosi_f16_to_i16
	; GCN: buffer_load_ushort v[[A_F16:[0-9]+]]			; GCN: buffer_load_ushort v[[A_F16:[0-9]+]]

	; GCN: v_cvt_f32_f16_e32 v[[A_F32:[0-9]+]], v[[A_F16]]			; GCN: v_cvt_f32_f16_e32 v[[A_F32:[0-9]+]], v[[A_F16]]
	; GCN: v_cvt_i32_f32_e32 v[[R_I16:[0-9]+]], v[[A_F32]]			; GCN: v_cvt_i32_f32_e32 v[[R_I16:[0-9]+]], v[[A_F32]]

				; VI: buffer_load_ushort v[[A_F16:[0-9]+]]
				; VI: v_cvt_i16_f16_e32 v[[R_I16:[0-9]+]], v[[A_F16]]
				; VI: buffer_store_short v[[R_I16]]

	; GCN: buffer_store_short v[[R_I16]]			; GCN: buffer_store_short v[[R_I16]]
	; GCN: s_endpgm			; GCN: s_endpgm
	define amdgpu_kernel void @fptosi_f16_to_i16(			define amdgpu_kernel void @fptosi_f16_to_i16(
	i16 addrspace(1)* %r,			i16 addrspace(1)* %r,
	half addrspace(1)* %a) {			half addrspace(1)* %a) {
	entry:			entry:
	%a.val = load half, half addrspace(1)* %a			%a.val = load half, half addrspace(1)* %a
	%r.val = fptosi half %a.val to i16			%r.val = fptosi half %a.val to i16
	▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines
	; SI: v_cvt_f32_f16_e32 v[[A_F32_0:[0-9]+]], v[[A_V2_F16]]			; SI: v_cvt_f32_f16_e32 v[[A_F32_0:[0-9]+]], v[[A_V2_F16]]
	; SI: v_cvt_f32_f16_e32 v[[A_F32_1:[0-9]+]], v[[A_F16_1]]			; SI: v_cvt_f32_f16_e32 v[[A_F32_1:[0-9]+]], v[[A_F16_1]]
	; SI: v_cvt_i32_f32_e32 v[[R_I16_0:[0-9]+]], v[[A_F32_0]]			; SI: v_cvt_i32_f32_e32 v[[R_I16_0:[0-9]+]], v[[A_F32_0]]
	; SI-DAG: v_cvt_i32_f32_e32 v[[R_I16_1:[0-9]+]], v[[A_F32_1]]			; SI-DAG: v_cvt_i32_f32_e32 v[[R_I16_1:[0-9]+]], v[[A_F32_1]]
	; SI-DAG: v_and_b32_e32 v[[R_I16_LO:[0-9]+]], 0xffff, v[[R_I16_0]]			; SI-DAG: v_and_b32_e32 v[[R_I16_LO:[0-9]+]], 0xffff, v[[R_I16_0]]
	; SI: v_lshlrev_b32_e32 v[[R_I16_HI:[0-9]+]], 16, v[[R_I16_1]]			; SI: v_lshlrev_b32_e32 v[[R_I16_HI:[0-9]+]], 16, v[[R_I16_1]]
	; SI: v_or_b32_e32 v[[R_V2_I16:[0-9]+]], v[[R_I16_LO]], v[[R_I16_HI]]			; SI: v_or_b32_e32 v[[R_V2_I16:[0-9]+]], v[[R_I16_LO]], v[[R_I16_HI]]

	; VI: v_cvt_f32_f16_e32 v[[A_F32_0:[0-9]+]], v[[A_V2_F16]]			; VI: buffer_load_dword v[[A_V2_F16:[0-9]+]]
	; VI: v_cvt_f32_f16_sdwa v[[A_F32_1:[0-9]+]], v[[A_V2_F16]] dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1			; VI: v_cvt_i16_f16_e32 v[[R_I16_LO:[0-9]+]], v[[A_V2_F16]]
	; VI: v_cvt_i32_f32_e32 v[[R_I16_0:[0-9]+]], v[[A_F32_0]]			; VI: v_cvt_i16_f16_sdwa v[[A_I32_1:[0-9]+]], v[[A_V2_F16]] dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1
	; VI: v_cvt_i32_f32_sdwa v[[R_I16_1:[0-9]+]], v[[A_F32_1]] dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD			; VI: v_or_b32_sdwa v[[R_V2_I16:[0-9]+]], v[[R_I16_LO]], v[[A_I32_1]] dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
	; VI: v_or_b32_sdwa v[[R_V2_I16:[0-9]+]], v[[R_I16_0]], v[[R_I16_1]] dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD			; VI: buffer_store_dword v[[R_V2_I16]]

	; GCN: buffer_store_dword v[[R_V2_I16]]			; GCN: buffer_store_dword v[[R_V2_I16]]
	; GCN: s_endpgm			; GCN: s_endpgm

	define amdgpu_kernel void @fptosi_v2f16_to_v2i16(			define amdgpu_kernel void @fptosi_v2f16_to_v2i16(
	<2 x i16> addrspace(1)* %r,			<2 x i16> addrspace(1)* %r,
	<2 x half> addrspace(1)* %a) {			<2 x half> addrspace(1)* %a) {
	entry:			entry:
	%a.val = load <2 x half>, <2 x half> addrspace(1)* %a			%a.val = load <2 x half>, <2 x half> addrspace(1)* %a
	%r.val = fptosi <2 x half> %a.val to <2 x i16>			%r.val = fptosi <2 x half> %a.val to <2 x i16>
	store <2 x i16> %r.val, <2 x i16> addrspace(1)* %r			store <2 x i16> %r.val, <2 x i16> addrspace(1)* %r
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}fptosi_v2f16_to_v2i32			; GCN-LABEL: {{^}}fptosi_v2f16_to_v2i32
	; GCN: buffer_load_dword			; GCN: buffer_load_dword
	; GCN: v_cvt_f32_f16_e32			; GCN: v_cvt_f32_f16_e32
	; SI: v_cvt_f32_f16_e32			; SI: v_cvt_f32_f16_e32

				; VI: v_cvt_f32_f16_e32
	; VI: v_cvt_f32_f16_sdwa			; VI: v_cvt_f32_f16_sdwa
				; VI: v_cvt_i32_f32_e32
				; VI: v_cvt_i32_f32_e32

	; GCN: v_cvt_i32_f32_e32			; GCN: v_cvt_i32_f32_e32
	; GCN: v_cvt_i32_f32_e32			; GCN: v_cvt_i32_f32_e32
	; GCN: buffer_store_dwordx2			; GCN: buffer_store_dwordx2
	; GCN: s_endpgm			; GCN: s_endpgm
	define amdgpu_kernel void @fptosi_v2f16_to_v2i32(			define amdgpu_kernel void @fptosi_v2f16_to_v2i32(
	<2 x i32> addrspace(1)* %r,			<2 x i32> addrspace(1)* %r,
	<2 x half> addrspace(1)* %a) {			<2 x half> addrspace(1)* %a) {
	entry:			entry:
	%a.val = load <2 x half>, <2 x half> addrspace(1)* %a			%a.val = load <2 x half>, <2 x half> addrspace(1)* %a
	%r.val = fptosi <2 x half> %a.val to <2 x i32>			%r.val = fptosi <2 x half> %a.val to <2 x i32>
	store <2 x i32> %r.val, <2 x i32> addrspace(1)* %r			store <2 x i32> %r.val, <2 x i32> addrspace(1)* %r
	ret void			ret void
	}			}

	; Need to make sure we promote f16 to f32 when converting f16 to i64. Existing			; Need to make sure we promote f16 to f32 when converting f16 to i64. Existing
	; test checks code generated for 'i64 = fp_to_sint f32'.			; test checks code generated for 'i64 = fp_to_sint f32'.

	; GCN-LABEL: {{^}}fptosi_v2f16_to_v2i64			; GCN-LABEL: {{^}}fptosi_v2f16_to_v2i64
	; GCN: buffer_load_dword v[[A_F16_0:[0-9]+]]			; GCN: buffer_load_dword v[[A_F16_0:[0-9]+]]
				; VI: buffer_load_dword v[[A_F16_0:[0-9]+]]

	; SI: v_lshrrev_b32_e32 v[[A_F16_1:[0-9]+]], 16, v[[A_F16_0]]			; SI: v_lshrrev_b32_e32 v[[A_F16_1:[0-9]+]], 16, v[[A_F16_0]]
	; SI: v_cvt_f32_f16_e32 v[[A_F32_0:[0-9]+]], v[[A_F16_0]]			; SI: v_cvt_f32_f16_e32 v[[A_F32_0:[0-9]+]], v[[A_F16_0]]
	; SI: v_cvt_f32_f16_e32 v[[A_F32_1:[0-9]+]], v[[A_F16_1]]			; SI: v_cvt_f32_f16_e32 v[[A_F32_1:[0-9]+]], v[[A_F16_1]]
	; SI: v_cvt_i32_f32_e32 v[[R_I64_0_Low:[0-9]+]], v[[A_F32_0]]			; SI: v_cvt_i32_f32_e32 v[[R_I64_0_Low:[0-9]+]], v[[A_F32_0]]
	; SI: v_ashrrev_i32_e32 v[[R_I64_0_High:[0-9]+]], 31, v[[R_I64_0_Low]]			; SI: v_ashrrev_i32_e32 v[[R_I64_0_High:[0-9]+]], 31, v[[R_I64_0_Low]]
	; SI: v_cvt_i32_f32_e32 v[[R_I64_1_Low:[0-9]+]], v[[A_F32_1]]			; SI: v_cvt_i32_f32_e32 v[[R_I64_1_Low:[0-9]+]], v[[A_F32_1]]
	; SI: v_ashrrev_i32_e32 v[[R_I64_1_High:[0-9]+]], 31, v[[R_I64_1_Low]]			; SI: v_ashrrev_i32_e32 v[[R_I64_1_High:[0-9]+]], 31, v[[R_I64_1_Low]]
	; VI: v_cvt_f32_f16_sdwa v[[A_F32_1:[0-9]+]], v[[A_F16_0]] dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
	; VI: v_cvt_f32_f16_e32 v[[A_F32_0:[0-9]+]], v[[A_F16_0]]			; VI: v_cvt_f32_f16_sdwa v[[A_F32_0:[0-9]+]], v[[A_F16_0]] dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
	; VI: v_cvt_i32_f32_e32 v[[R_I64_1_Low:[0-9]+]], v[[A_F32_1]]			; VI: v_cvt_f32_f16_e32 v[[A_F32_1:[0-9]+]], v[[A_F16_0]]
	; VI: v_cvt_i32_f32_e32 v[[R_I64_0_Low:[0-9]+]], v[[A_F32_0]]			; VI: v_cvt_i32_f32_e32 v[[R_I64_0_Low:[0-9]+]], v[[A_F32_0]]
	; VI: v_ashrrev_i32_e32 v[[R_I64_1_High:[0-9]+]], 31, v[[R_I64_1_Low]]			; VI: v_cvt_i32_f32_e32 v[[R_I64_1_Low:[0-9]+]], v[[A_F32_1]]
	; VI: v_ashrrev_i32_e32 v[[R_I64_0_High:[0-9]+]], 31, v[[R_I64_0_Low]]			; VI: v_ashrrev_i32_e32 v[[R_I64_0_High:[0-9]+]], 31, v[[R_I64_0_Low]]
				; VI: v_ashrrev_i32_e32 v[[R_I64_1_High:[0-9]+]], 31, v[[R_I64_1_Low]]
				; VI: buffer_store_dwordx4 v{{\[}}[[R_I64_1_Low]]{{\:}}[[R_I64_0_High]]{{\]}}

	; GCN: buffer_store_dwordx4 v{{\[}}[[R_I64_0_Low]]{{\:}}[[R_I64_1_High]]{{\]}}			; GCN: buffer_store_dwordx4 v{{\[}}[[R_I64_0_Low]]{{\:}}[[R_I64_1_High]]{{\]}}
	; GCN: s_endpgm			; GCN: s_endpgm
	define amdgpu_kernel void @fptosi_v2f16_to_v2i64(			define amdgpu_kernel void @fptosi_v2f16_to_v2i64(
	<2 x i64> addrspace(1)* %r,			<2 x i64> addrspace(1)* %r,
	<2 x half> addrspace(1)* %a) {			<2 x half> addrspace(1)* %a) {
	entry:			entry:
	%a.val = load <2 x half>, <2 x half> addrspace(1)* %a			%a.val = load <2 x half>, <2 x half> addrspace(1)* %a
	%r.val = fptosi <2 x half> %a.val to <2 x i64>			%r.val = fptosi <2 x half> %a.val to <2 x i64>
	Show All 16 Lines

llvm/test/CodeGen/AMDGPU/fptoui.f16.ll

; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=tahiti -verify-machineinstrs -enable-unsafe-fp-math < %s \| FileCheck -check-prefix=GCN -check-prefix=SI %s		; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=tahiti -verify-machineinstrs -enable-unsafe-fp-math < %s \| FileCheck -check-prefix=GCN -check-prefix=SI %s
; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=fiji -mattr=-flat-for-global -verify-machineinstrs -enable-unsafe-fp-math < %s \| FileCheck -check-prefix=GCN -check-prefix=VI %s		; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=fiji -mattr=-flat-for-global -verify-machineinstrs -enable-unsafe-fp-math < %s \| FileCheck -check-prefix=VI %s
		foadUnsubmitted Done Reply Inline Actions Don't do this. foad: Don't do this.

; GCN-LABEL: {{^}}fptoui_f16_to_i16		; GCN-LABEL: {{^}}fptoui_f16_to_i16
; GCN: buffer_load_ushort v[[A_F16:[0-9]+]]		; GCN: buffer_load_ushort v[[A_F16:[0-9]+]]
		; VI: buffer_load_ushort v[[A_F16:[0-9]+]]
; GCN: v_cvt_f32_f16_e32 v[[A_F32:[0-9]+]], v[[A_F16]]		; GCN: v_cvt_f32_f16_e32 v[[A_F32:[0-9]+]], v[[A_F16]]
; SI: v_cvt_u32_f32_e32 v[[R_I16:[0-9]+]], v[[A_F32]]		; SI: v_cvt_u32_f32_e32 v[[R_I16:[0-9]+]], v[[A_F32]]
; VI: v_cvt_i32_f32_e32 v[[R_I16:[0-9]+]], v[[A_F32]]
		; VI: v_cvt_u16_f16_e32 v[[R_I16:[0-9]+]], v[[A_F16]]
		; VI: buffer_store_short v[[R_I16]]

; GCN: buffer_store_short v[[R_I16]]		; GCN: buffer_store_short v[[R_I16]]
; GCN: s_endpgm		; GCN: s_endpgm
define amdgpu_kernel void @fptoui_f16_to_i16(		define amdgpu_kernel void @fptoui_f16_to_i16(
i16 addrspace(1)* %r,		i16 addrspace(1)* %r,
half addrspace(1)* %a) {		half addrspace(1)* %a) {
entry:		entry:
%a.val = load half, half addrspace(1)* %a		%a.val = load half, half addrspace(1)* %a
%r.val = fptoui half %a.val to i16		%r.val = fptoui half %a.val to i16
Show All 34 Lines	entry:
%a.val = load half, half addrspace(1)* %a		%a.val = load half, half addrspace(1)* %a
%r.val = fptoui half %a.val to i64		%r.val = fptoui half %a.val to i64
store i64 %r.val, i64 addrspace(1)* %r		store i64 %r.val, i64 addrspace(1)* %r
ret void		ret void
}		}

; GCN-LABEL: {{^}}fptoui_v2f16_to_v2i16		; GCN-LABEL: {{^}}fptoui_v2f16_to_v2i16
; GCN: buffer_load_dword v[[A_V2_F16:[0-9]+]]		; GCN: buffer_load_dword v[[A_V2_F16:[0-9]+]]
		; VI: buffer_load_dword v[[A_V2_F16:[0-9]+]]

; SI: v_lshrrev_b32_e32 v[[A_F16_1:[0-9]+]], 16, v[[A_V2_F16]]		; SI: v_lshrrev_b32_e32 v[[A_F16_1:[0-9]+]], 16, v[[A_V2_F16]]
; SI-DAG: v_cvt_f32_f16_e32 v[[A_F32_1:[0-9]+]], v[[A_F16_1]]		; SI-DAG: v_cvt_f32_f16_e32 v[[A_F32_1:[0-9]+]], v[[A_F16_1]]
; SI-DAG: v_cvt_f32_f16_e32 v[[A_F32_0:[0-9]+]], v[[A_V2_F16]]		; SI-DAG: v_cvt_f32_f16_e32 v[[A_F32_0:[0-9]+]], v[[A_V2_F16]]
; SI: v_cvt_u32_f32_e32 v[[R_I16_1:[0-9]+]], v[[A_F32_1]]		; SI: v_cvt_u32_f32_e32 v[[R_I16_1:[0-9]+]], v[[A_F32_1]]
; SI: v_cvt_u32_f32_e32 v[[R_I16_0:[0-9]+]], v[[A_F32_0]]		; SI: v_cvt_u32_f32_e32 v[[R_I16_0:[0-9]+]], v[[A_F32_0]]
; SI: v_lshlrev_b32_e32 v[[R_I16_HI:[0-9]+]], 16, v[[R_I16_1]]		; SI: v_lshlrev_b32_e32 v[[R_I16_HI:[0-9]+]], 16, v[[R_I16_1]]
; SI: v_or_b32_e32 v[[R_V2_I16:[0-9]+]], v[[R_I16_0]], v[[R_I16_HI]]		; SI: v_or_b32_e32 v[[R_V2_I16:[0-9]+]], v[[R_I16_0]], v[[R_I16_HI]]

; VI-DAG: v_cvt_f32_f16_e32 v[[A_F32_1:[0-9]+]], v[[A_V2_F16]]		; VI: v_cvt_u16_f16_e32 v[[A_F16_1:[0-9]+]], v[[A_V2_F16]]
; VI-DAG: v_cvt_f32_f16_sdwa v[[A_F32_0:[0-9]+]], v[[A_V2_F16]] dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1		; VI: v_cvt_u16_f16_sdwa v[[A_F16_0:[0-9]+]], v[[A_V2_F16]] dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1
; VI: v_cvt_i32_f32_e32 v[[R_I16_1:[0-9]+]], v[[A_F32_1]]		; VI: v_or_b32_sdwa v[[A_V2_U16:[0-9]+]], v[[A_F16_1]], v[[A_F16_0]] dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
; VI: v_cvt_i32_f32_sdwa v[[R_I16_0:[0-9]+]], v[[A_F32_0]] dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD
; VI: v_or_b32_sdwa v[[R_V2_I16:[0-9]+]], v[[R_I16_1]], v[[R_I16_0]] dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD

		; VI: buffer_store_dword v[[A_V2_U16]]
; GCN: buffer_store_dword v[[R_V2_I16]]		; GCN: buffer_store_dword v[[R_V2_I16]]
; GCN: s_endpgm		; GCN: s_endpgm

define amdgpu_kernel void @fptoui_v2f16_to_v2i16(		define amdgpu_kernel void @fptoui_v2f16_to_v2i16(
<2 x i16> addrspace(1)* %r,		<2 x i16> addrspace(1)* %r,
<2 x half> addrspace(1)* %a) {		<2 x half> addrspace(1)* %a) {
entry:		entry:
%a.val = load <2 x half>, <2 x half> addrspace(1)* %a		%a.val = load <2 x half>, <2 x half> addrspace(1)* %a
%r.val = fptoui <2 x half> %a.val to <2 x i16>		%r.val = fptoui <2 x half> %a.val to <2 x i16>
store <2 x i16> %r.val, <2 x i16> addrspace(1)* %r		store <2 x i16> %r.val, <2 x i16> addrspace(1)* %r
Show All 19 Lines	entry:
ret void		ret void
}		}

; Need to make sure we promote f16 to f32 when converting f16 to i64. Existing		; Need to make sure we promote f16 to f32 when converting f16 to i64. Existing
; test checks code generated for 'i64 = fp_to_uint f32'.		; test checks code generated for 'i64 = fp_to_uint f32'.

; GCN-LABEL: {{^}}fptoui_v2f16_to_v2i64		; GCN-LABEL: {{^}}fptoui_v2f16_to_v2i64
; GCN: buffer_load_dword v[[A_F16_0:[0-9]+]]		; GCN: buffer_load_dword v[[A_F16_0:[0-9]+]]
		; VI: buffer_load_dword v[[A_F16_0:[0-9]+]]
; GCN: v_mov_b32_e32 v[[R_I64_1_High:[0-9]+]], 0		; GCN: v_mov_b32_e32 v[[R_I64_1_High:[0-9]+]], 0
; SI: v_lshrrev_b32_e32 v[[A_F16_1:[0-9]+]], 16, v[[A_F16_0]]		; SI: v_lshrrev_b32_e32 v[[A_F16_1:[0-9]+]], 16, v[[A_F16_0]]
; SI: v_cvt_f32_f16_e32 v[[A_F32_0:[0-9]+]], v[[A_F16_0]]		; SI: v_cvt_f32_f16_e32 v[[A_F32_0:[0-9]+]], v[[A_F16_0]]
; SI: v_cvt_f32_f16_e32 v[[A_F32_1:[0-9]+]], v[[A_F16_1]]		; SI: v_cvt_f32_f16_e32 v[[A_F32_1:[0-9]+]], v[[A_F16_1]]
; SI: v_cvt_u32_f32_e32 v[[R_I64_0_Low:[0-9]+]], v[[A_F32_0]]		; SI: v_cvt_u32_f32_e32 v[[R_I64_0_Low:[0-9]+]], v[[A_F32_0]]
; SI: v_cvt_u32_f32_e32 v[[R_I64_1_Low:[0-9]+]], v[[A_F32_1]]		; SI: v_cvt_u32_f32_e32 v[[R_I64_1_Low:[0-9]+]], v[[A_F32_1]]
; VI: v_cvt_f32_f16_sdwa v[[A_F32_1:[0-9]+]], v[[A_F16_0]] dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1		; VI: v_cvt_f32_f16_sdwa v[[A_F32_1:[0-9]+]], v[[A_F16_0]] dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
; VI: v_cvt_f32_f16_e32 v[[A_F32_0:[0-9]+]], v[[A_F16_0]]		; VI: v_cvt_f32_f16_e32 v[[A_F32_0:[0-9]+]], v[[A_F16_0]]
Show All 27 Lines