This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Select V_CVT_*16_F16 more often
ClosedPublic

Authored by jpages on Apr 28 2021, 12:42 PM.

Download Raw Diff

Details

Reviewers

sebastian-ne
arsenm
foad

Commits

rGa1ed39df96bc: [AMDGPU] Select V_CVT_*16_F16 more often

Summary

Improve the code generation of fp_to_sint
and fp_to_uint for integer on 16-bits.

Diff Detail

Event Timeline

jpages created this revision.Apr 28 2021, 12:42 PM

Herald added subscribers: kerbowa, hiraditya, t-tye and 7 others. · View Herald TranscriptApr 28 2021, 12:42 PM

jpages requested review of this revision.Apr 28 2021, 12:42 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 28 2021, 12:42 PM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

jpages added a reviewer: arsenm.Apr 28 2021, 12:45 PM

All of the patterns you added look to me like workarounds for the DAGCombiner or legalizer not doing its job correctly

llvm/lib/Target/AMDGPU/SIInstructions.td
918 ↗	(On Diff #341292)	This is identical to the pattern attached to the instruction definition, so this shouldn't be doing anything
923 ↗	(On Diff #341292)	Ditto
928–929 ↗	(On Diff #341292)	I don't see the advantage of just selecting the 16-bit result op to the 32-bit result. The legalizer should just take care of this
933 ↗	(On Diff #341292)	Ditto
938 ↗	(On Diff #341292)	Ditto
943 ↗	(On Diff #341292)	Ditto
949–955 ↗	(On Diff #341292)	Would these sexts ever reach the selector? I would expect the combiner to take care of these. I don't see tests for this

Harbormaster completed remote builds in B101483: Diff 341292.Apr 28 2021, 2:36 PM

foad added a subscriber: foad.Apr 28 2021, 3:03 PM

foad added inline comments.Apr 29 2021, 2:59 AM

llvm/test/CodeGen/AMDGPU/fp_to_uint.ll
242–243	Just use a single GCN: check line.
llvm/test/CodeGen/AMDGPU/fptosi.f16.ll
2	Don't do this. Instead change GCN: to SI: on lines 5-8. GCN is supposed to be used for checks that are the same for all architectures (gfx6 onwards).
llvm/test/CodeGen/AMDGPU/fptoui.f16.ll
2	Don't do this.

Thank you for your inputs.

Rebased the patch to do the same thing in the Legalizer. It seems more natural indeed.
I took the occasion to do the suggested refactoring around LowerFP_TO_SINT/LowerFP_TO_UINT to merge them into one.

For @foad, I changed the tests back to the original version, thanks for the comment.

jpages marked 3 inline comments as done.Apr 30 2021, 3:11 PM

LGTM with nits

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
2734–2737	Combine into one if?
2759	Don't need the ternary operator, just directly pass the bool expression

Harbormaster completed remote builds in B102032: Diff 342057.Apr 30 2021, 5:30 PM

Updated for code style

jpages marked 2 inline comments as done.May 3 2021, 8:36 AM

Harbormaster completed remote builds in B102295: Diff 342406.May 3 2021, 9:20 AM

foad accepted this revision.May 4 2021, 1:14 AM

This revision is now accepted and ready to land.May 4 2021, 1:14 AM

Thanks for the review. I don't have commit access to the repo, could someone do it for me?

This revision was landed with ongoing or failed builds.May 5 2021, 12:58 AM

Closed by commit rGa1ed39df96bc: [AMDGPU] Select V_CVT_*16_F16 more often (authored by jpages, committed by foad). · Explain Why

This revision was automatically updated to reflect the committed changes.

foad added a commit: rGa1ed39df96bc: [AMDGPU] Select V_CVT_*16_F16 more often.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUISelLowering.h

3 lines

AMDGPUISelLowering.cpp

51 lines

SIISelLowering.cpp

7 lines

test/

CodeGen/

AMDGPU/

fp_to_uint.ll

10 lines

fptosi.f16.ll

13 lines

fptoui.f16.ll

14 lines

Diff 342057

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.h

Show First 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	protected:

SDValue LowerINT_TO_FP32(SDValue Op, SelectionDAG &DAG, bool Signed) const;		SDValue LowerINT_TO_FP32(SDValue Op, SelectionDAG &DAG, bool Signed) const;
SDValue LowerINT_TO_FP64(SDValue Op, SelectionDAG &DAG, bool Signed) const;		SDValue LowerINT_TO_FP64(SDValue Op, SelectionDAG &DAG, bool Signed) const;
SDValue LowerUINT_TO_FP(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerUINT_TO_FP(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerSINT_TO_FP(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerSINT_TO_FP(SDValue Op, SelectionDAG &DAG) const;

SDValue LowerFP64_TO_INT(SDValue Op, SelectionDAG &DAG, bool Signed) const;		SDValue LowerFP64_TO_INT(SDValue Op, SelectionDAG &DAG, bool Signed) const;
SDValue LowerFP_TO_FP16(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerFP_TO_FP16(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerFP_TO_UINT(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerFP_TO_INT(SDValue Op, SelectionDAG &DAG) const;
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for function 'LowerFP_TO_INT' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for function 'LowerFP_TO_INT' [readability-identifier…
SDValue LowerFP_TO_SINT(SDValue Op, SelectionDAG &DAG) const;

SDValue LowerSIGN_EXTEND_INREG(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerSIGN_EXTEND_INREG(SDValue Op, SelectionDAG &DAG) const;

protected:		protected:
bool shouldCombineMemoryType(EVT VT) const;		bool shouldCombineMemoryType(EVT VT) const;
SDValue performLoadCombine(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue performLoadCombine(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue performStoreCombine(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue performStoreCombine(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue performAssertSZExtCombine(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue performAssertSZExtCombine(SDNode *N, DAGCombinerInfo &DCI) const;
▲ Show 20 Lines • Show All 460 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp

Show First 20 Lines • Show All 1,266 Lines • ▼ Show 20 Lines	case ISD::FLOG:
return LowerFLOG(Op, DAG, numbers::ln2f);		return LowerFLOG(Op, DAG, numbers::ln2f);
case ISD::FLOG10:		case ISD::FLOG10:
return LowerFLOG(Op, DAG, numbers::ln2f / numbers::ln10f);		return LowerFLOG(Op, DAG, numbers::ln2f / numbers::ln10f);
case ISD::FEXP:		case ISD::FEXP:
return lowerFEXP(Op, DAG);		return lowerFEXP(Op, DAG);
case ISD::SINT_TO_FP: return LowerSINT_TO_FP(Op, DAG);		case ISD::SINT_TO_FP: return LowerSINT_TO_FP(Op, DAG);
case ISD::UINT_TO_FP: return LowerUINT_TO_FP(Op, DAG);		case ISD::UINT_TO_FP: return LowerUINT_TO_FP(Op, DAG);
case ISD::FP_TO_FP16: return LowerFP_TO_FP16(Op, DAG);		case ISD::FP_TO_FP16: return LowerFP_TO_FP16(Op, DAG);
case ISD::FP_TO_SINT: return LowerFP_TO_SINT(Op, DAG);		case ISD::FP_TO_SINT:
case ISD::FP_TO_UINT: return LowerFP_TO_UINT(Op, DAG);		case ISD::FP_TO_UINT:
		return LowerFP_TO_INT(Op, DAG);
case ISD::CTTZ:		case ISD::CTTZ:
case ISD::CTTZ_ZERO_UNDEF:		case ISD::CTTZ_ZERO_UNDEF:
case ISD::CTLZ:		case ISD::CTLZ:
case ISD::CTLZ_ZERO_UNDEF:		case ISD::CTLZ_ZERO_UNDEF:
return LowerCTLZ_CTTZ(Op, DAG);		return LowerCTLZ_CTTZ(Op, DAG);
case ISD::DYNAMIC_STACKALLOC: return LowerDYNAMIC_STACKALLOC(Op, DAG);		case ISD::DYNAMIC_STACKALLOC: return LowerDYNAMIC_STACKALLOC(Op, DAG);
}		}
return Op;		return Op;
▲ Show 20 Lines • Show All 1,417 Lines • ▼ Show 20 Lines	V = DAG.getNode(ISD::SRL, DL, MVT::i32, V,
DAG.getConstant(2, DL, MVT::i32));		DAG.getConstant(2, DL, MVT::i32));
SDValue V0 = DAG.getSelectCC(DL, VLow3, DAG.getConstant(3, DL, MVT::i32),		SDValue V0 = DAG.getSelectCC(DL, VLow3, DAG.getConstant(3, DL, MVT::i32),
One, Zero, ISD::SETEQ);		One, Zero, ISD::SETEQ);
SDValue V1 = DAG.getSelectCC(DL, VLow3, DAG.getConstant(5, DL, MVT::i32),		SDValue V1 = DAG.getSelectCC(DL, VLow3, DAG.getConstant(5, DL, MVT::i32),
One, Zero, ISD::SETGT);		One, Zero, ISD::SETGT);
V1 = DAG.getNode(ISD::OR, DL, MVT::i32, V0, V1);		V1 = DAG.getNode(ISD::OR, DL, MVT::i32, V0, V1);
V = DAG.getNode(ISD::ADD, DL, MVT::i32, V, V1);		V = DAG.getNode(ISD::ADD, DL, MVT::i32, V, V1);

V = DAG.getSelectCC(DL, E, DAG.getConstant(30, DL, MVT::i32),		V = DAG.getSelectCC(DL, E, DAG.getConstant(30, DL, MVT::i32),
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for function 'LowerFP_TO_INT' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for function 'LowerFP_TO_INT' [readability-identifier…
DAG.getConstant(0x7c00, DL, MVT::i32), V, ISD::SETGT);		DAG.getConstant(0x7c00, DL, MVT::i32), V, ISD::SETGT);
V = DAG.getSelectCC(DL, E, DAG.getConstant(1039, DL, MVT::i32),		V = DAG.getSelectCC(DL, E, DAG.getConstant(1039, DL, MVT::i32),
I, V, ISD::SETEQ);		I, V, ISD::SETEQ);

// Extract the sign bit.		// Extract the sign bit.
SDValue Sign = DAG.getNode(ISD::SRL, DL, MVT::i32, UH,		SDValue Sign = DAG.getNode(ISD::SRL, DL, MVT::i32, UH,
DAG.getConstant(16, DL, MVT::i32));		DAG.getConstant(16, DL, MVT::i32));
Sign = DAG.getNode(ISD::AND, DL, MVT::i32, Sign,		Sign = DAG.getNode(ISD::AND, DL, MVT::i32, Sign,
DAG.getConstant(0x8000, DL, MVT::i32));		DAG.getConstant(0x8000, DL, MVT::i32));

V = DAG.getNode(ISD::OR, DL, MVT::i32, Sign, V);		V = DAG.getNode(ISD::OR, DL, MVT::i32, Sign, V);
return DAG.getZExtOrTrunc(V, DL, Op.getValueType());		return DAG.getZExtOrTrunc(V, DL, Op.getValueType());
}		}

SDValue AMDGPUTargetLowering::LowerFP_TO_SINT(SDValue Op,		SDValue AMDGPUTargetLowering::LowerFP_TO_INT(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
SDValue Src = Op.getOperand(0);		SDValue Src = Op.getOperand(0);
		unsigned OpOpcode = Op.getOpcode();
// TODO: Factor out code common with LowerFP_TO_UINT.

EVT SrcVT = Src.getValueType();		EVT SrcVT = Src.getValueType();
if (SrcVT == MVT::f16 \|\|		EVT DestVT = Op.getValueType();
(SrcVT == MVT::f32 && Src.getOpcode() == ISD::FP16_TO_FP)) {
SDLoc DL(Op);

SDValue FpToInt32 = DAG.getNode(Op.getOpcode(), DL, MVT::i32, Src);		// Will be selected natively
return DAG.getNode(ISD::SIGN_EXTEND, DL, MVT::i64, FpToInt32);		if (SrcVT == MVT::f16) {
		if (DestVT == MVT::i16)
		return Op;
}		}
		arsenmUnsubmitted Done Reply Inline Actions Combine into one if? arsenm: Combine into one if?

if (Op.getValueType() == MVT::i64 && Src.getValueType() == MVT::f64)		// Promote i16 to i32
return LowerFP64_TO_INT(Op, DAG, true);		if (DestVT == MVT::i16 && (SrcVT == MVT::f32 \|\| SrcVT == MVT::f64)) {
		SDLoc DL(Op);

return SDValue();		SDValue FpToInt32 = DAG.getNode(OpOpcode, DL, MVT::i32, Src);
		return DAG.getNode(ISD::TRUNCATE, DL, MVT::i16, FpToInt32);
}		}

SDValue AMDGPUTargetLowering::LowerFP_TO_UINT(SDValue Op,
SelectionDAG &DAG) const {
SDValue Src = Op.getOperand(0);

// TODO: Factor out code common with LowerFP_TO_SINT.

EVT SrcVT = Src.getValueType();
if (SrcVT == MVT::f16 \|\|		if (SrcVT == MVT::f16 \|\|
(SrcVT == MVT::f32 && Src.getOpcode() == ISD::FP16_TO_FP)) {		(SrcVT == MVT::f32 && Src.getOpcode() == ISD::FP16_TO_FP)) {
SDLoc DL(Op);		SDLoc DL(Op);

SDValue FpToUInt32 = DAG.getNode(Op.getOpcode(), DL, MVT::i32, Src);		SDValue FpToInt32 = DAG.getNode(OpOpcode, DL, MVT::i32, Src);
return DAG.getNode(ISD::ZERO_EXTEND, DL, MVT::i64, FpToUInt32);		unsigned Ext =
		OpOpcode == ISD::FP_TO_SINT ? ISD::SIGN_EXTEND : ISD::ZERO_EXTEND;
		return DAG.getNode(Ext, DL, MVT::i64, FpToInt32);
}		}

if (Op.getValueType() == MVT::i64 && Src.getValueType() == MVT::f64)		if (DestVT == MVT::i64 && SrcVT == MVT::f64)
return LowerFP64_TO_INT(Op, DAG, false);		return LowerFP64_TO_INT(Op, DAG,
		OpOpcode == ISD::FP_TO_SINT ? true : false);
		arsenmUnsubmitted Done Reply Inline Actions Don't need the ternary operator, just directly pass the bool expression arsenm: Don't need the ternary operator, just directly pass the bool expression

return SDValue();		return SDValue();
}		}

SDValue AMDGPUTargetLowering::LowerSIGN_EXTEND_INREG(SDValue Op,		SDValue AMDGPUTargetLowering::LowerSIGN_EXTEND_INREG(SDValue Op,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
EVT ExtraVT = cast<VTSDNode>(Op.getOperand(1))->getVT();		EVT ExtraVT = cast<VTSDNode>(Op.getOperand(1))->getVT();
MVT VT = Op.getSimpleValueType();		MVT VT = Op.getSimpleValueType();
▲ Show 20 Lines • Show All 1,979 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 524 Lines • ▼ Show 20 Lines	if (Subtarget->has16BitInsts()) {

setTruncStoreAction(MVT::i64, MVT::i16, Expand);		setTruncStoreAction(MVT::i64, MVT::i16, Expand);

setOperationAction(ISD::FP16_TO_FP, MVT::i16, Promote);		setOperationAction(ISD::FP16_TO_FP, MVT::i16, Promote);
AddPromotedToType(ISD::FP16_TO_FP, MVT::i16, MVT::i32);		AddPromotedToType(ISD::FP16_TO_FP, MVT::i16, MVT::i32);
setOperationAction(ISD::FP_TO_FP16, MVT::i16, Promote);		setOperationAction(ISD::FP_TO_FP16, MVT::i16, Promote);
AddPromotedToType(ISD::FP_TO_FP16, MVT::i16, MVT::i32);		AddPromotedToType(ISD::FP_TO_FP16, MVT::i16, MVT::i32);

setOperationAction(ISD::FP_TO_SINT, MVT::i16, Promote);		setOperationAction(ISD::FP_TO_SINT, MVT::i16, Custom);
setOperationAction(ISD::FP_TO_UINT, MVT::i16, Promote);		setOperationAction(ISD::FP_TO_UINT, MVT::i16, Custom);

// F16 - Constant Actions.		// F16 - Constant Actions.
setOperationAction(ISD::ConstantFP, MVT::f16, Legal);		setOperationAction(ISD::ConstantFP, MVT::f16, Legal);

// F16 - Load/Store Actions.		// F16 - Load/Store Actions.
setOperationAction(ISD::LOAD, MVT::f16, Promote);		setOperationAction(ISD::LOAD, MVT::f16, Promote);
AddPromotedToType(ISD::LOAD, MVT::f16, MVT::i16);		AddPromotedToType(ISD::LOAD, MVT::f16, MVT::i16);
setOperationAction(ISD::STORE, MVT::f16, Promote);		setOperationAction(ISD::STORE, MVT::f16, Promote);
▲ Show 20 Lines • Show All 3,974 Lines • ▼ Show 20 Lines	SDValue SITargetLowering::LowerOperation(SDValue Op, SelectionDAG &DAG) const {
case ISD::FCANONICALIZE:		case ISD::FCANONICALIZE:
case ISD::BSWAP:		case ISD::BSWAP:
return splitUnaryVectorOp(Op, DAG);		return splitUnaryVectorOp(Op, DAG);
case ISD::FMINNUM:		case ISD::FMINNUM:
case ISD::FMAXNUM:		case ISD::FMAXNUM:
return lowerFMINNUM_FMAXNUM(Op, DAG);		return lowerFMINNUM_FMAXNUM(Op, DAG);
case ISD::FMA:		case ISD::FMA:
return splitTernaryVectorOp(Op, DAG);		return splitTernaryVectorOp(Op, DAG);
		case ISD::FP_TO_SINT:
		case ISD::FP_TO_UINT:
		return LowerFP_TO_INT(Op, DAG);
case ISD::SHL:		case ISD::SHL:
case ISD::SRA:		case ISD::SRA:
case ISD::SRL:		case ISD::SRL:
case ISD::ADD:		case ISD::ADD:
case ISD::SUB:		case ISD::SUB:
case ISD::MUL:		case ISD::MUL:
case ISD::SMIN:		case ISD::SMIN:
case ISD::SMAX:		case ISD::SMAX:
▲ Show 20 Lines • Show All 7,573 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/fp_to_uint.ll

	; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -allow-deprecated-dag-overlap %s -check-prefixes=GCN,FUNC,SI			; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -allow-deprecated-dag-overlap %s -check-prefixes=GCN,FUNC
	; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -allow-deprecated-dag-overlap %s -check-prefixes=GCN,FUNC,VI			; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -allow-deprecated-dag-overlap %s -check-prefixes=GCN,FUNC
	; RUN: llc -march=r600 -mcpu=redwood < %s \| FileCheck -allow-deprecated-dag-overlap %s -check-prefix=EG -check-prefix=FUNC			; RUN: llc -march=r600 -mcpu=redwood < %s \| FileCheck -allow-deprecated-dag-overlap %s -check-prefix=EG -check-prefix=FUNC

	declare float @llvm.fabs.f32(float) #1			declare float @llvm.fabs.f32(float) #1

	; FUNC-LABEL: {{^}}fp_to_uint_f32_to_i32:			; FUNC-LABEL: {{^}}fp_to_uint_f32_to_i32:
	; EG: FLT_TO_UINT {{\** *}}T{{[0-9]+\.[XYZW], PV\.[XYZW]}}			; EG: FLT_TO_UINT {{\** *}}T{{[0-9]+\.[XYZW], PV\.[XYZW]}}

	; GCN: v_cvt_u32_f32_e32			; GCN: v_cvt_u32_f32_e32
	▲ Show 20 Lines • Show All 223 Lines • ▼ Show 20 Lines
	; GCN: v_cmp_eq_f32_e64 s{{\[[0-9]+:[0-9]+\]}}, 1.0, \|s{{[0-9]+}}\|			; GCN: v_cmp_eq_f32_e64 s{{\[[0-9]+:[0-9]+\]}}, 1.0, \|s{{[0-9]+}}\|
	define amdgpu_kernel void @fp_to_uint_fabs_f32_to_i1(i1 addrspace(1)* %out, float %in) #0 {			define amdgpu_kernel void @fp_to_uint_fabs_f32_to_i1(i1 addrspace(1)* %out, float %in) #0 {
	%in.fabs = call float @llvm.fabs.f32(float %in)			%in.fabs = call float @llvm.fabs.f32(float %in)
	%conv = fptoui float %in.fabs to i1			%conv = fptoui float %in.fabs to i1
	store i1 %conv, i1 addrspace(1)* %out			store i1 %conv, i1 addrspace(1)* %out
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}fp_to_uint_f32_to_i16:			; FUNC-LABEL: {{^}}fp_to_uint_f32_to_i16:
	; The reason different instructions are used on SI and VI is because for			; GCN: v_cvt_u32_f32_e32 [[VAL:v[0-9]+]], s{{[0-9]+}}
				foadUnsubmitted Done Reply Inline Actions Just use a single GCN: check line. foad: Just use a single GCN: check line.
	; SI fp_to_uint is legalized by the type legalizer and for VI it is
	; legalized by the dag legalizer and they legalize fp_to_uint differently.
	; SI: v_cvt_u32_f32_e32 [[VAL:v[0-9]+]], s{{[0-9]+}}
	; VI: v_cvt_i32_f32_e32 [[VAL:v[0-9]+]], s{{[0-9]+}}
	; GCN: buffer_store_short [[VAL]]			; GCN: buffer_store_short [[VAL]]
	define amdgpu_kernel void @fp_to_uint_f32_to_i16(i16 addrspace(1)* %out, float %in) #0 {			define amdgpu_kernel void @fp_to_uint_f32_to_i16(i16 addrspace(1)* %out, float %in) #0 {
	%uint = fptoui float %in to i16			%uint = fptoui float %in to i16
	store i16 %uint, i16 addrspace(1)* %out			store i16 %uint, i16 addrspace(1)* %out
	ret void			ret void
	}			}

	attributes #0 = { nounwind }			attributes #0 = { nounwind }
	attributes #1 = { nounwind readnone }			attributes #1 = { nounwind readnone }

llvm/test/CodeGen/AMDGPU/fptosi.f16.ll

	; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -verify-machineinstrs -enable-unsafe-fp-math < %s \| FileCheck -check-prefix=GCN -check-prefix=SI %s			; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -verify-machineinstrs -enable-unsafe-fp-math < %s \| FileCheck -check-prefix=GCN -check-prefix=SI %s
	; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=fiji -mattr=-flat-for-global -verify-machineinstrs -enable-unsafe-fp-math < %s \| FileCheck -check-prefix=GCN -check-prefix=VI %s			; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=fiji -mattr=-flat-for-global -verify-machineinstrs -enable-unsafe-fp-math < %s \| FileCheck -check-prefix=GCN -check-prefix=VI %s
				foadUnsubmitted Done Reply Inline Actions Don't do this. Instead change GCN: to SI: on lines 5-8. GCN is supposed to be used for checks that are the same for all architectures (gfx6 onwards). foad: Don't do this. Instead change GCN: to SI: on lines 5-8. GCN is supposed to be used for checks…

	; GCN-LABEL: {{^}}fptosi_f16_to_i16			; GCN-LABEL: {{^}}fptosi_f16_to_i16
	; GCN: buffer_load_ushort v[[A_F16:[0-9]+]]			; GCN: buffer_load_ushort v[[A_F16:[0-9]+]]
	; GCN: v_cvt_f32_f16_e32 v[[A_F32:[0-9]+]], v[[A_F16]]			; SI: v_cvt_f32_f16_e32 v[[A_F32:[0-9]+]], v[[A_F16]]
	; GCN: v_cvt_i32_f32_e32 v[[R_I16:[0-9]+]], v[[A_F32]]			; SI: v_cvt_i32_f32_e32 v[[R_I16:[0-9]+]], v[[A_F32]]
				; VI: v_cvt_i16_f16_e32 v[[R_I16:[0-9]+]], v[[A_F16]]
	; GCN: buffer_store_short v[[R_I16]]			; GCN: buffer_store_short v[[R_I16]]
	; GCN: s_endpgm			; GCN: s_endpgm
	define amdgpu_kernel void @fptosi_f16_to_i16(			define amdgpu_kernel void @fptosi_f16_to_i16(
	i16 addrspace(1)* %r,			i16 addrspace(1)* %r,
	half addrspace(1)* %a) {			half addrspace(1)* %a) {
	entry:			entry:
	%a.val = load half, half addrspace(1)* %a			%a.val = load half, half addrspace(1)* %a
	%r.val = fptosi half %a.val to i16			%r.val = fptosi half %a.val to i16
	▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines
	; SI: v_cvt_f32_f16_e32 v[[A_F32_0:[0-9]+]], v[[A_V2_F16]]			; SI: v_cvt_f32_f16_e32 v[[A_F32_0:[0-9]+]], v[[A_V2_F16]]
	; SI: v_cvt_f32_f16_e32 v[[A_F32_1:[0-9]+]], v[[A_F16_1]]			; SI: v_cvt_f32_f16_e32 v[[A_F32_1:[0-9]+]], v[[A_F16_1]]
	; SI: v_cvt_i32_f32_e32 v[[R_I16_0:[0-9]+]], v[[A_F32_0]]			; SI: v_cvt_i32_f32_e32 v[[R_I16_0:[0-9]+]], v[[A_F32_0]]
	; SI-DAG: v_cvt_i32_f32_e32 v[[R_I16_1:[0-9]+]], v[[A_F32_1]]			; SI-DAG: v_cvt_i32_f32_e32 v[[R_I16_1:[0-9]+]], v[[A_F32_1]]
	; SI-DAG: v_and_b32_e32 v[[R_I16_LO:[0-9]+]], 0xffff, v[[R_I16_0]]			; SI-DAG: v_and_b32_e32 v[[R_I16_LO:[0-9]+]], 0xffff, v[[R_I16_0]]
	; SI: v_lshlrev_b32_e32 v[[R_I16_HI:[0-9]+]], 16, v[[R_I16_1]]			; SI: v_lshlrev_b32_e32 v[[R_I16_HI:[0-9]+]], 16, v[[R_I16_1]]
	; SI: v_or_b32_e32 v[[R_V2_I16:[0-9]+]], v[[R_I16_LO]], v[[R_I16_HI]]			; SI: v_or_b32_e32 v[[R_V2_I16:[0-9]+]], v[[R_I16_LO]], v[[R_I16_HI]]

	; VI: v_cvt_f32_f16_e32 v[[A_F32_0:[0-9]+]], v[[A_V2_F16]]			; VI: v_cvt_i16_f16_e32 v[[A_I16_0:[0-9]+]], v[[A_V2_F16]]
	; VI: v_cvt_f32_f16_sdwa v[[A_F32_1:[0-9]+]], v[[A_V2_F16]] dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1			; VI: v_cvt_i16_f16_sdwa v[[A_I16_1:[0-9]+]], v[[A_V2_F16]] dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1
	; VI: v_cvt_i32_f32_e32 v[[R_I16_0:[0-9]+]], v[[A_F32_0]]			; VI: v_or_b32_sdwa v[[R_V2_I16:[0-9]+]], v[[A_I16_0]], v[[A_I16_1]] dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
	; VI: v_cvt_i32_f32_sdwa v[[R_I16_1:[0-9]+]], v[[A_F32_1]] dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD
	; VI: v_or_b32_sdwa v[[R_V2_I16:[0-9]+]], v[[R_I16_0]], v[[R_I16_1]] dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD

	; GCN: buffer_store_dword v[[R_V2_I16]]			; GCN: buffer_store_dword v[[R_V2_I16]]
	; GCN: s_endpgm			; GCN: s_endpgm

	define amdgpu_kernel void @fptosi_v2f16_to_v2i16(			define amdgpu_kernel void @fptosi_v2f16_to_v2i16(
	<2 x i16> addrspace(1)* %r,			<2 x i16> addrspace(1)* %r,
	<2 x half> addrspace(1)* %a) {			<2 x half> addrspace(1)* %a) {
	entry:			entry:
	▲ Show 20 Lines • Show All 67 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/fptoui.f16.ll

	; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=tahiti -verify-machineinstrs -enable-unsafe-fp-math < %s \| FileCheck -check-prefix=GCN -check-prefix=SI %s			; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=tahiti -verify-machineinstrs -enable-unsafe-fp-math < %s \| FileCheck -check-prefix=GCN -check-prefix=SI %s
	; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=fiji -mattr=-flat-for-global -verify-machineinstrs -enable-unsafe-fp-math < %s \| FileCheck -check-prefix=GCN -check-prefix=VI %s			; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=fiji -mattr=-flat-for-global -verify-machineinstrs -enable-unsafe-fp-math < %s \| FileCheck -check-prefix=GCN -check-prefix=VI %s
				foadUnsubmitted Done Reply Inline Actions Don't do this. foad: Don't do this.

	; GCN-LABEL: {{^}}fptoui_f16_to_i16			; GCN-LABEL: {{^}}fptoui_f16_to_i16
	; GCN: buffer_load_ushort v[[A_F16:[0-9]+]]			; GCN: buffer_load_ushort v[[A_F16:[0-9]+]]
	; GCN: v_cvt_f32_f16_e32 v[[A_F32:[0-9]+]], v[[A_F16]]			; SI: v_cvt_f32_f16_e32 v[[A_F32:[0-9]+]], v[[A_F16]]
	; SI: v_cvt_u32_f32_e32 v[[R_I16:[0-9]+]], v[[A_F32]]			; SI: v_cvt_u32_f32_e32 v[[R_I16:[0-9]+]], v[[A_F32]]
	; VI: v_cvt_i32_f32_e32 v[[R_I16:[0-9]+]], v[[A_F32]]			; VI: v_cvt_u16_f16_e32 v[[R_I16:[0-9]+]], v[[A_F16]]
	; GCN: buffer_store_short v[[R_I16]]			; GCN: buffer_store_short v[[R_I16]]
	; GCN: s_endpgm			; GCN: s_endpgm
	define amdgpu_kernel void @fptoui_f16_to_i16(			define amdgpu_kernel void @fptoui_f16_to_i16(
	i16 addrspace(1)* %r,			i16 addrspace(1)* %r,
	half addrspace(1)* %a) {			half addrspace(1)* %a) {
	entry:			entry:
	%a.val = load half, half addrspace(1)* %a			%a.val = load half, half addrspace(1)* %a
	%r.val = fptoui half %a.val to i16			%r.val = fptoui half %a.val to i16
	▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines
	; SI: v_lshrrev_b32_e32 v[[A_F16_1:[0-9]+]], 16, v[[A_V2_F16]]			; SI: v_lshrrev_b32_e32 v[[A_F16_1:[0-9]+]], 16, v[[A_V2_F16]]
	; SI-DAG: v_cvt_f32_f16_e32 v[[A_F32_1:[0-9]+]], v[[A_F16_1]]			; SI-DAG: v_cvt_f32_f16_e32 v[[A_F32_1:[0-9]+]], v[[A_F16_1]]
	; SI-DAG: v_cvt_f32_f16_e32 v[[A_F32_0:[0-9]+]], v[[A_V2_F16]]			; SI-DAG: v_cvt_f32_f16_e32 v[[A_F32_0:[0-9]+]], v[[A_V2_F16]]
	; SI: v_cvt_u32_f32_e32 v[[R_I16_1:[0-9]+]], v[[A_F32_1]]			; SI: v_cvt_u32_f32_e32 v[[R_I16_1:[0-9]+]], v[[A_F32_1]]
	; SI: v_cvt_u32_f32_e32 v[[R_I16_0:[0-9]+]], v[[A_F32_0]]			; SI: v_cvt_u32_f32_e32 v[[R_I16_0:[0-9]+]], v[[A_F32_0]]
	; SI: v_lshlrev_b32_e32 v[[R_I16_HI:[0-9]+]], 16, v[[R_I16_1]]			; SI: v_lshlrev_b32_e32 v[[R_I16_HI:[0-9]+]], 16, v[[R_I16_1]]
	; SI: v_or_b32_e32 v[[R_V2_I16:[0-9]+]], v[[R_I16_0]], v[[R_I16_HI]]			; SI: v_or_b32_e32 v[[R_V2_I16:[0-9]+]], v[[R_I16_0]], v[[R_I16_HI]]

	; VI-DAG: v_cvt_f32_f16_e32 v[[A_F32_1:[0-9]+]], v[[A_V2_F16]]			; VI: v_cvt_u16_f16_e32 v[[A_U16_1:[0-9]+]], v[[A_V2_F16]]
	; VI-DAG: v_cvt_f32_f16_sdwa v[[A_F32_0:[0-9]+]], v[[A_V2_F16]] dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1			; VI: v_cvt_u16_f16_sdwa v[[R_U16_0:[0-9]+]], v[[A_V2_F16]] dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:WORD_1
	; VI: v_cvt_i32_f32_e32 v[[R_I16_1:[0-9]+]], v[[A_F32_1]]			; VI: v_or_b32_sdwa v[[R_V2_I16:[0-9]+]], v[[A_U16_1]], v[[R_U16_0]] dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
	; VI: v_cvt_i32_f32_sdwa v[[R_I16_0:[0-9]+]], v[[A_F32_0]] dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD
	; VI: v_or_b32_sdwa v[[R_V2_I16:[0-9]+]], v[[R_I16_1]], v[[R_I16_0]] dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD

	; GCN: buffer_store_dword v[[R_V2_I16]]			; GCN: buffer_store_dword v[[R_V2_I16]]
	; GCN: s_endpgm			; GCN: s_endpgm

	define amdgpu_kernel void @fptoui_v2f16_to_v2i16(			define amdgpu_kernel void @fptoui_v2f16_to_v2i16(
	<2 x i16> addrspace(1)* %r,			<2 x i16> addrspace(1)* %r,
	<2 x half> addrspace(1)* %a) {			<2 x half> addrspace(1)* %a) {
	entry:			entry:
	▲ Show 20 Lines • Show All 65 Lines • Show Last 20 Lines