Download Raw Diff

Details

Reviewers

dstuttard
arsenm
tpr

Group Reviewers

Restricted Project

Commits

rG824ca3f3dd85: [AMDGPU] Add intrinsics for 16 bit interpolation
rL352357: [AMDGPU] Add intrinsics for 16 bit interpolation

Summary

Added the intrinsics llvm.amdgcn.interp.p1.f16() and
llvm.amdgcn.interp.p2.f16() and related LIT test.

The p1 intrinsic generates code appropriate for both 16 and 32
bank LDS.

Diff Detail

Repository

rL LLVM

Build Status

Buildable 20954
Build 20954: arc lint + arc unit

Event Timeline

timcorringham created this revision.May 11 2018, 6:48 AM

Herald added subscribers: llvm-commits, t-tye, tpr and 6 others. · View Herald TranscriptMay 11 2018, 6:48 AM

timcorringham added reviewers: Restricted Project, dstuttard, arsenm, tpr.May 11 2018, 6:51 AM

arsenm requested changes to this revision.May 14 2018, 4:26 AM

arsenm added inline comments.

include/llvm/IR/IntrinsicsAMDGPU.td
970	You should add name mangling to the existing intrinsics rather than new intrinsics. The builtin declaration needs to be done in clang for the GCCBuiltin

This revision now requires changes to proceed.May 14 2018, 4:26 AM

timcorringham added inline comments.May 15 2018, 4:56 AM

include/llvm/IR/IntrinsicsAMDGPU.td
970	I now have the clang changes in D46871 (I have added the 32 bit interp builtins too as they were missing). I don't believe it is possible to overload these intrinsics as they have an extra operand compared to the 32 bit versions. Also apart from the extra operand the signature of the 16 bit p1 intrinsic is identical to the 32 bit one, so there iosn't any type difference to overload.

Corrected the ordering of operands to interp_p2_f16, added lowered
intrinsics to list of those that cware a source of divergence, and
amended LIT test.

I have not overloaded the intrinsics as I don't believe it is possible
in this case as they have an additional operand, and apart from that
additional operand the interp_p1_f16 has the same types as the 32 bit
version, so there are no type differences to provide disambiguation.

In D46754#1104736, @timcorringham wrote:

Corrected the ordering of operands to interp_p2_f16, added lowered
intrinsics to list of those that cware a source of divergence, and
amended LIT test.

I have not overloaded the intrinsics as I don't believe it is possible
in this case as they have an additional operand, and apart from that
additional operand the interp_p1_f16 has the same types as the 32 bit
version, so there are no type differences to provide disambiguation.

Is the extra parameter you're referring the high parameter to change where the register is read from the high or low bits? That shouldn't be exposed in the intrinsic at all. Eliminating the high bit extraction is a codegen optimization pattern

In D46754#1104799, @arsenm wrote:

In D46754#1104736, @timcorringham wrote:

Corrected the ordering of operands to interp_p2_f16, added lowered
intrinsics to list of those that cware a source of divergence, and
amended LIT test.

I have not overloaded the intrinsics as I don't believe it is possible
in this case as they have an additional operand, and apart from that
additional operand the interp_p1_f16 has the same types as the 32 bit
version, so there are no type differences to provide disambiguation.

Is the extra parameter you're referring the high parameter to change where the register is read from the high or low bits? That shouldn't be exposed in the intrinsic at all. Eliminating the high bit extraction is a codegen optimization pattern

Or is this bit controlling the weird load from memory? The manual isn't particularly clear to me. I see mention of LDs loads, but also op_sel control of destination bits

arsenm added inline comments.May 18 2018, 10:55 AM

lib/Target/AMDGPU/AMDGPUSearchableTables.td
52–53	Should get a test in test/DivergenceAnalysis

Even without the high operand I don't think it is possible to overload interp_p1 and interp_p1_f16 as they would have identical types - there is nothing to disambiguate them.

Or is this bit controlling the weird load from memory? The manual isn't particularly clear to me. I see mention of LDs loads, but also op_sel control of destination bits

Yes, the high bit controls the LDS access. As all the operands to interp_p1_f16 are the same types as for the 32 bit variant, I don't know of any way to deduce the value of the high bit if it isn't specified explicitly.

Added a divergence LIT test for the 16 bit interp intrinsics.

tpr added inline comments.May 22 2018, 11:34 AM

lib/Target/AMDGPU/VOP3Instructions.td
459	Don't forget to fix the problem found with this i1 in testing.

Change the omod operand type to be i32 rather than i1, to avoid
a build failure when building using a debug TableGen.

Harbormaster completed remote builds in B18472: Diff 148075.May 22 2018, 12:30 PM

[AMDGPU] Add intrinsics for 16 bit interpolation

Added a new pass to to ensure that the 16 bit interpolation
instructions use the round to zero rounding mode.

Herald added a subscriber: mgorny. · View Herald TranscriptJul 3 2018, 7:31 AM

Harbormaster completed remote builds in B19984: Diff 153913.Jul 3 2018, 7:31 AM

A slighly more performant implementation of the pass to add any
required changes to the double precision rounding mode.

Harbormaster completed remote builds in B20138: Diff 154558.Jul 9 2018, 3:04 AM

Refactored pass to insert rounding mode to use a style more in line
with other LLVM passes. This fails to optimize a few corner cases,
but they are expected to occur very rarely if at all.

Harbormaster completed remote builds in B20196: Diff 154774.Jul 10 2018, 4:12 AM

Changed mode register pass to use an explicit stack instead of recursion.

Removed the mode register pass, as that will be introduced as a
separate change.

Harbormaster completed remote builds in B20653: Diff 157007.Jul 24 2018, 4:51 AM

arsenm added inline comments.Jul 27 2018, 1:53 AM

test/CodeGen/AMDGPU/llvm.amdgcn.interp.f16.ll
2–4	Use -'s instead of _'s in the check prefixes
6–8	Might as well just use update_llc_test_checks at this point?

Updated the LIT test as per review comments.

Rebased, and amended LIT test now that the required mode register
pass has been committed.

Herald added a subscriber: jvesely. · View Herald TranscriptDec 18 2018, 1:53 AM

Harbormaster completed remote builds in B26098: Diff 178625.Dec 18 2018, 1:53 AM

arsenm added inline comments.Jan 22 2019, 2:14 PM

test/CodeGen/AMDGPU/llvm.amdgcn.interp.f16.ll
58	Can you add a test case with LDS usage to make sure m0 is properly restored after?

arsenm added inline comments.Jan 22 2019, 2:54 PM

lib/Target/AMDGPU/AMDGPUSearchableTables.td
52–53	Test still missing

Extended llvm.amdgcn.interp.f16.ll to check that m0 is set before
each interp instruction if necessary, and added a new LIT test
to check that the interp f16 intrinsics are identified as being
divergent.

Harbormaster completed remote builds in B27248: Diff 183328.Jan 24 2019, 9:31 AM

timcorringham marked 2 inline comments as done.Jan 24 2019, 9:33 AM

timcorringham added inline comments.

test/CodeGen/AMDGPU/llvm.amdgcn.interp.f16.ll
58	I have added test cases to check that m0 is set up before each of the interp f16 instructions if necessary. I have done this by explicitly writing to m0 rather than using LDS as I couldn't see a way to do the latter, and other tests use the technique of writing to m0.

LGTM

This revision is now accepted and ready to land.Jan 25 2019, 9:13 AM

Closed by commit rL352357: [AMDGPU] Add intrinsics for 16 bit interpolation (authored by timcorringham). · Explain WhyJan 28 2019, 5:48 AM

This revision was automatically updated to reflect the committed changes.

Diff 158515

include/llvm/IR/IntrinsicsAMDGPU.td

	Show First 20 Lines • Show All 961 Lines • ▼ Show 20 Lines
	// so it behaves like IntrNoMem.			// so it behaves like IntrNoMem.
	def int_amdgcn_interp_p1 :			def int_amdgcn_interp_p1 :
	GCCBuiltin<"__builtin_amdgcn_interp_p1">,			GCCBuiltin<"__builtin_amdgcn_interp_p1">,
	Intrinsic<[llvm_float_ty],			Intrinsic<[llvm_float_ty],
	[llvm_float_ty, llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],			[llvm_float_ty, llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],
	[IntrNoMem, IntrSpeculatable]>;			[IntrNoMem, IntrSpeculatable]>;

	// __builtin_amdgcn_interp_p2 <p1>, <j>, <attr_chan>, <attr>, <m0>			// __builtin_amdgcn_interp_p2 <p1>, <j>, <attr_chan>, <attr>, <m0>
	def int_amdgcn_interp_p2 :			def int_amdgcn_interp_p2 :
				arsenmUnsubmitted Not Done Reply Inline Actions You should add name mangling to the existing intrinsics rather than new intrinsics. The builtin declaration needs to be done in clang for the GCCBuiltin arsenm: You should add name mangling to the existing intrinsics rather than new intrinsics. The builtin…
				timcorringhamAuthorUnsubmitted Not Done Reply Inline Actions I now have the clang changes in D46871 (I have added the 32 bit interp builtins too as they were missing). I don't believe it is possible to overload these intrinsics as they have an extra operand compared to the 32 bit versions. Also apart from the extra operand the signature of the 16 bit p1 intrinsic is identical to the 32 bit one, so there iosn't any type difference to overload. timcorringham: I now have the clang changes in D46871 (I have added the 32 bit interp builtins too as they…
	GCCBuiltin<"__builtin_amdgcn_interp_p2">,			GCCBuiltin<"__builtin_amdgcn_interp_p2">,
	Intrinsic<[llvm_float_ty],			Intrinsic<[llvm_float_ty],
	[llvm_float_ty, llvm_float_ty, llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],			[llvm_float_ty, llvm_float_ty, llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],
	[IntrNoMem, IntrSpeculatable]>;			[IntrNoMem, IntrSpeculatable]>;
	// See int_amdgcn_v_interp_p1 for why this is IntrNoMem.			// See int_amdgcn_v_interp_p1 for why this is IntrNoMem.

				// __builtin_amdgcn_interp_p1_f16 <i>, <attr_chan>, <attr>, <high>, <m0>
				def int_amdgcn_interp_p1_f16 :
				GCCBuiltin<"__builtin_amdgcn_interp_p1_f16">,
				Intrinsic<[llvm_float_ty],
				[llvm_float_ty, llvm_i32_ty, llvm_i32_ty, llvm_i1_ty, llvm_i32_ty],
				[IntrNoMem, IntrSpeculatable]>;

				// __builtin_amdgcn_interp_p2_f16 <p1>, <j>, <attr_chan>, <attr>, <high>, <m0>
				def int_amdgcn_interp_p2_f16 :
				GCCBuiltin<"__builtin_amdgcn_interp_p2_f16">,
				Intrinsic<[llvm_half_ty],
				[llvm_float_ty, llvm_float_ty, llvm_i32_ty, llvm_i32_ty, llvm_i1_ty, llvm_i32_ty],
				[IntrNoMem, IntrSpeculatable]>;

	// Pixel shaders only: whether the current pixel is live (i.e. not a helper			// Pixel shaders only: whether the current pixel is live (i.e. not a helper
	// invocation for derivative computation).			// invocation for derivative computation).
	def int_amdgcn_ps_live : Intrinsic <			def int_amdgcn_ps_live : Intrinsic <
	[llvm_i1_ty],			[llvm_i1_ty],
	[],			[],
	[IntrNoMem]>;			[IntrNoMem]>;

	def int_amdgcn_mbcnt_lo :			def int_amdgcn_mbcnt_lo :
	▲ Show 20 Lines • Show All 349 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUISelLowering.h

Show First 20 Lines • Show All 453 Lines • ▼ Show 20 Lines	enum NodeType : unsigned {
CONST_DATA_PTR,		CONST_DATA_PTR,
INIT_EXEC,		INIT_EXEC,
INIT_EXEC_FROM_INPUT,		INIT_EXEC_FROM_INPUT,
SENDMSG,		SENDMSG,
SENDMSGHALT,		SENDMSGHALT,
INTERP_MOV,		INTERP_MOV,
INTERP_P1,		INTERP_P1,
INTERP_P2,		INTERP_P2,
		INTERP_P1LL_F16,
		INTERP_P1LV_F16,
		INTERP_P2_F16,
PC_ADD_REL_OFFSET,		PC_ADD_REL_OFFSET,
KILL,		KILL,
DUMMY_CHAIN,		DUMMY_CHAIN,
FIRST_MEM_OPCODE_NUMBER = ISD::FIRST_TARGET_MEMORY_OPCODE,		FIRST_MEM_OPCODE_NUMBER = ISD::FIRST_TARGET_MEMORY_OPCODE,
STORE_MSKOR,		STORE_MSKOR,
LOAD_CONSTANT,		LOAD_CONSTANT,
TBUFFER_STORE_FORMAT,		TBUFFER_STORE_FORMAT,
TBUFFER_STORE_FORMAT_X3,		TBUFFER_STORE_FORMAT_X3,
Show All 36 Lines

lib/Target/AMDGPU/AMDGPUISelLowering.cpp

Show First 20 Lines • Show All 4,092 Lines • ▼ Show 20 Lines	const char* AMDGPUTargetLowering::getTargetNodeName(unsigned Opcode) const {
case AMDGPUISD::FIRST_MEM_OPCODE_NUMBER: break;		case AMDGPUISD::FIRST_MEM_OPCODE_NUMBER: break;
NODE_NAME_CASE(INIT_EXEC)		NODE_NAME_CASE(INIT_EXEC)
NODE_NAME_CASE(INIT_EXEC_FROM_INPUT)		NODE_NAME_CASE(INIT_EXEC_FROM_INPUT)
NODE_NAME_CASE(SENDMSG)		NODE_NAME_CASE(SENDMSG)
NODE_NAME_CASE(SENDMSGHALT)		NODE_NAME_CASE(SENDMSGHALT)
NODE_NAME_CASE(INTERP_MOV)		NODE_NAME_CASE(INTERP_MOV)
NODE_NAME_CASE(INTERP_P1)		NODE_NAME_CASE(INTERP_P1)
NODE_NAME_CASE(INTERP_P2)		NODE_NAME_CASE(INTERP_P2)
		NODE_NAME_CASE(INTERP_P1LL_F16)
		NODE_NAME_CASE(INTERP_P1LV_F16)
		NODE_NAME_CASE(INTERP_P2_F16)
NODE_NAME_CASE(STORE_MSKOR)		NODE_NAME_CASE(STORE_MSKOR)
NODE_NAME_CASE(LOAD_CONSTANT)		NODE_NAME_CASE(LOAD_CONSTANT)
NODE_NAME_CASE(TBUFFER_STORE_FORMAT)		NODE_NAME_CASE(TBUFFER_STORE_FORMAT)
NODE_NAME_CASE(TBUFFER_STORE_FORMAT_X3)		NODE_NAME_CASE(TBUFFER_STORE_FORMAT_X3)
NODE_NAME_CASE(TBUFFER_STORE_FORMAT_D16)		NODE_NAME_CASE(TBUFFER_STORE_FORMAT_D16)
NODE_NAME_CASE(TBUFFER_LOAD_FORMAT)		NODE_NAME_CASE(TBUFFER_LOAD_FORMAT)
NODE_NAME_CASE(TBUFFER_LOAD_FORMAT_D16)		NODE_NAME_CASE(TBUFFER_LOAD_FORMAT_D16)
NODE_NAME_CASE(ATOMIC_CMP_SWAP)		NODE_NAME_CASE(ATOMIC_CMP_SWAP)
▲ Show 20 Lines • Show All 217 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUInstrInfo.td

	Show First 20 Lines • Show All 371 Lines • ▼ Show 20 Lines
	def AMDGPUinterp_p1 : SDNode<"AMDGPUISD::INTERP_P1",			def AMDGPUinterp_p1 : SDNode<"AMDGPUISD::INTERP_P1",
	SDTypeProfile<1, 3, [SDTCisFP<0>]>,			SDTypeProfile<1, 3, [SDTCisFP<0>]>,
	[SDNPInGlue, SDNPOutGlue]>;			[SDNPInGlue, SDNPOutGlue]>;

	def AMDGPUinterp_p2 : SDNode<"AMDGPUISD::INTERP_P2",			def AMDGPUinterp_p2 : SDNode<"AMDGPUISD::INTERP_P2",
	SDTypeProfile<1, 4, [SDTCisFP<0>]>,			SDTypeProfile<1, 4, [SDTCisFP<0>]>,
	[SDNPInGlue]>;			[SDNPInGlue]>;

				def AMDGPUinterp_p1ll_f16 : SDNode<"AMDGPUISD::INTERP_P1LL_F16",
				SDTypeProfile<1, 7, [SDTCisFP<0>]>,
				[SDNPInGlue, SDNPOutGlue]>;

				def AMDGPUinterp_p1lv_f16 : SDNode<"AMDGPUISD::INTERP_P1LV_F16",
				SDTypeProfile<1, 9, [SDTCisFP<0>]>,
				[SDNPInGlue, SDNPOutGlue]>;

				def AMDGPUinterp_p2_f16 : SDNode<"AMDGPUISD::INTERP_P2_F16",
				SDTypeProfile<1, 8, [SDTCisFP<0>]>,
				[SDNPInGlue]>;

	def AMDGPUkill : SDNode<"AMDGPUISD::KILL", AMDGPUKillSDT,			def AMDGPUkill : SDNode<"AMDGPUISD::KILL", AMDGPUKillSDT,
	[SDNPHasChain, SDNPSideEffect]>;			[SDNPHasChain, SDNPSideEffect]>;

	// SI+ export			// SI+ export
	def AMDGPUExportOp : SDTypeProfile<0, 8, [			def AMDGPUExportOp : SDTypeProfile<0, 8, [
	SDTCisInt<0>, // i8 tgt			SDTCisInt<0>, // i8 tgt
	SDTCisInt<1>, // i8 en			SDTCisInt<1>, // i8 en
	▲ Show 20 Lines • Show All 47 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUSearchableTables.td

	Show First 20 Lines • Show All 43 Lines • ▼ Show 20 Lines
	}			}

	def : SourceOfDivergence<int_amdgcn_workitem_id_x>;			def : SourceOfDivergence<int_amdgcn_workitem_id_x>;
	def : SourceOfDivergence<int_amdgcn_workitem_id_y>;			def : SourceOfDivergence<int_amdgcn_workitem_id_y>;
	def : SourceOfDivergence<int_amdgcn_workitem_id_z>;			def : SourceOfDivergence<int_amdgcn_workitem_id_z>;
	def : SourceOfDivergence<int_amdgcn_interp_mov>;			def : SourceOfDivergence<int_amdgcn_interp_mov>;
	def : SourceOfDivergence<int_amdgcn_interp_p1>;			def : SourceOfDivergence<int_amdgcn_interp_p1>;
	def : SourceOfDivergence<int_amdgcn_interp_p2>;			def : SourceOfDivergence<int_amdgcn_interp_p2>;
				def : SourceOfDivergence<int_amdgcn_interp_p1_f16>;
				def : SourceOfDivergence<int_amdgcn_interp_p2_f16>;
				arsenmUnsubmitted Not Done Reply Inline Actions Should get a test in test/DivergenceAnalysis arsenm: Should get a test in test/DivergenceAnalysis
				arsenmUnsubmitted Done Reply Inline Actions Test still missing arsenm: Test still missing
	def : SourceOfDivergence<int_amdgcn_mbcnt_hi>;			def : SourceOfDivergence<int_amdgcn_mbcnt_hi>;
	def : SourceOfDivergence<int_amdgcn_mbcnt_lo>;			def : SourceOfDivergence<int_amdgcn_mbcnt_lo>;
	def : SourceOfDivergence<int_r600_read_tidig_x>;			def : SourceOfDivergence<int_r600_read_tidig_x>;
	def : SourceOfDivergence<int_r600_read_tidig_y>;			def : SourceOfDivergence<int_r600_read_tidig_y>;
	def : SourceOfDivergence<int_r600_read_tidig_z>;			def : SourceOfDivergence<int_r600_read_tidig_z>;
	def : SourceOfDivergence<int_amdgcn_atomic_inc>;			def : SourceOfDivergence<int_amdgcn_atomic_inc>;
	def : SourceOfDivergence<int_amdgcn_atomic_dec>;			def : SourceOfDivergence<int_amdgcn_atomic_dec>;
	def : SourceOfDivergence<int_amdgcn_ds_fadd>;			def : SourceOfDivergence<int_amdgcn_ds_fadd>;
	Show All 18 Lines

lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,845 Lines • ▼ Show 20 Lines	SDValue SITargetLowering::LowerINTRINSIC_WO_CHAIN(SDValue Op,
}		}
case Intrinsic::amdgcn_interp_p2: {		case Intrinsic::amdgcn_interp_p2: {
SDValue M0 = copyToM0(DAG, DAG.getEntryNode(), DL, Op.getOperand(5));		SDValue M0 = copyToM0(DAG, DAG.getEntryNode(), DL, Op.getOperand(5));
SDValue Glue = SDValue(M0.getNode(), 1);		SDValue Glue = SDValue(M0.getNode(), 1);
return DAG.getNode(AMDGPUISD::INTERP_P2, DL, MVT::f32, Op.getOperand(1),		return DAG.getNode(AMDGPUISD::INTERP_P2, DL, MVT::f32, Op.getOperand(1),
Op.getOperand(2), Op.getOperand(3), Op.getOperand(4),		Op.getOperand(2), Op.getOperand(3), Op.getOperand(4),
Glue);		Glue);
}		}
		case Intrinsic::amdgcn_interp_p1_f16: {
		SDValue M0 = copyToM0(DAG, DAG.getEntryNode(), DL, Op.getOperand(5));
		SDValue Glue = M0.getValue(1);
		if (getSubtarget()->getLDSBankCount() == 16) {
		// 16 bank LDS
		SDValue S = DAG.getNode(AMDGPUISD::INTERP_MOV, DL, MVT::f32,
		DAG.getConstant(2, DL, MVT::i32), // P0
		Op.getOperand(2), // Attrchan
		Op.getOperand(3), // Attr
		Glue);
		SDValue Ops[] = {
		Op.getOperand(1), // Src0
		Op.getOperand(2), // Attrchan
		Op.getOperand(3), // Attr
		DAG.getConstant(0, DL, MVT::i32), // $src0_modifiers
		S, // Src2 - holds two f16 values selected by high
		DAG.getConstant(0, DL, MVT::i32), // $src2_modifiers
		Op.getOperand(4), // high
		DAG.getConstant(0, DL, MVT::i1), // $clamp
		DAG.getConstant(0, DL, MVT::i32) // $omod
		};
		return DAG.getNode(AMDGPUISD::INTERP_P1LV_F16, DL, MVT::f32, Ops);
		} else {
		// 32 bank LDS
		SDValue Ops[] = {
		Op.getOperand(1), // Src0
		Op.getOperand(2), // Attrchan
		Op.getOperand(3), // Attr
		DAG.getConstant(0, DL, MVT::i32), // $src0_modifiers
		Op.getOperand(4), // high
		DAG.getConstant(0, DL, MVT::i1), // $clamp
		DAG.getConstant(0, DL, MVT::i32), // $omod
		Glue
		};
		return DAG.getNode(AMDGPUISD::INTERP_P1LL_F16, DL, MVT::f32, Ops);
		}
		}
		case Intrinsic::amdgcn_interp_p2_f16: {
		SDValue M0 = copyToM0(DAG, DAG.getEntryNode(), DL, Op.getOperand(6));
		SDValue Glue = SDValue(M0.getNode(), 1);
		SDValue Ops[] = {
		Op.getOperand(2), // Src0
		Op.getOperand(3), // Attrchan
		Op.getOperand(4), // Attr
		DAG.getConstant(0, DL, MVT::i32), // $src0_modifiers
		Op.getOperand(1), // Src2
		DAG.getConstant(0, DL, MVT::i32), // $src2_modifiers
		Op.getOperand(5), // high
		DAG.getConstant(0, DL, MVT::i1), // $clamp
		Glue
		};
		return DAG.getNode(AMDGPUISD::INTERP_P2_F16, DL, MVT::f16, Ops);
		}
case Intrinsic::amdgcn_sin:		case Intrinsic::amdgcn_sin:
return DAG.getNode(AMDGPUISD::SIN_HW, DL, VT, Op.getOperand(1));		return DAG.getNode(AMDGPUISD::SIN_HW, DL, VT, Op.getOperand(1));

case Intrinsic::amdgcn_cos:		case Intrinsic::amdgcn_cos:
return DAG.getNode(AMDGPUISD::COS_HW, DL, VT, Op.getOperand(1));		return DAG.getNode(AMDGPUISD::COS_HW, DL, VT, Op.getOperand(1));

case Intrinsic::amdgcn_log_clamp: {		case Intrinsic::amdgcn_log_clamp: {
if (Subtarget->getGeneration() < AMDGPUSubtarget::VOLCANIC_ISLANDS)		if (Subtarget->getGeneration() < AMDGPUSubtarget::VOLCANIC_ISLANDS)
▲ Show 20 Lines • Show All 3,576 Lines • Show Last 20 Lines

lib/Target/AMDGPU/VOP3Instructions.td

Show First 20 Lines • Show All 213 Lines • ▼ Show 20 Lines	def VOP3b_I64_I1_I32_I32_I64 : VOPProfile<[i64, i32, i32, i64]> {
let Outs64 = (outs DstRC:$vdst, SReg_64:$sdst);		let Outs64 = (outs DstRC:$vdst, SReg_64:$sdst);
let Asm64 = " $vdst, $sdst, $src0, $src1, $src2$clamp";		let Asm64 = " $vdst, $sdst, $src0, $src1, $src2$clamp";
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// VOP3 INTERP		// VOP3 INTERP
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

class VOP3Interp<string OpName, VOPProfile P> : VOP3_Pseudo<OpName, P> {		class VOP3Interp<string OpName, VOPProfile P, list<dag> pattern = []> :
		VOP3_Pseudo<OpName, P, pattern> {
let AsmMatchConverter = "cvtVOP3Interp";		let AsmMatchConverter = "cvtVOP3Interp";
}		}

def VOP3_INTERP : VOPProfile<[f32, f32, i32, untyped]> {		def VOP3_INTERP : VOPProfile<[f32, f32, i32, untyped]> {
let Ins64 = (ins Src0Mod:$src0_modifiers, VRegSrc_32:$src0,		let Ins64 = (ins Src0Mod:$src0_modifiers, VRegSrc_32:$src0,
Attr:$attr, AttrChan:$attrchan,		Attr:$attr, AttrChan:$attrchan,
clampmod:$clamp, omod:$omod);		clampmod:$clamp, omod:$omod);

▲ Show 20 Lines • Show All 192 Lines • ▼ Show 20 Lines

let SubtargetPredicate = Has16BitInsts, isCommutable = 1 in {		let SubtargetPredicate = Has16BitInsts, isCommutable = 1 in {

let renamedInGFX9 = 1 in {		let renamedInGFX9 = 1 in {
def V_MAD_F16 : VOP3Inst <"v_mad_f16", VOP3_Profile<VOP_F16_F16_F16_F16>, fmad>;		def V_MAD_F16 : VOP3Inst <"v_mad_f16", VOP3_Profile<VOP_F16_F16_F16_F16>, fmad>;
def V_MAD_U16 : VOP3Inst <"v_mad_u16", VOP3_Profile<VOP_I16_I16_I16_I16, VOP3_CLAMP>>;		def V_MAD_U16 : VOP3Inst <"v_mad_u16", VOP3_Profile<VOP_I16_I16_I16_I16, VOP3_CLAMP>>;
def V_MAD_I16 : VOP3Inst <"v_mad_i16", VOP3_Profile<VOP_I16_I16_I16_I16, VOP3_CLAMP>>;		def V_MAD_I16 : VOP3Inst <"v_mad_i16", VOP3_Profile<VOP_I16_I16_I16_I16, VOP3_CLAMP>>;
def V_FMA_F16 : VOP3Inst <"v_fma_f16", VOP3_Profile<VOP_F16_F16_F16_F16>, fma>;		def V_FMA_F16 : VOP3Inst <"v_fma_f16", VOP3_Profile<VOP_F16_F16_F16_F16>, fma>;
def V_INTERP_P2_F16 : VOP3Interp <"v_interp_p2_f16", VOP3_INTERP16<[f16, f32, i32, f32]>>;		let Uses = [M0, EXEC] in {
}		def V_INTERP_P2_F16 : VOP3Interp <"v_interp_p2_f16", VOP3_INTERP16<[f16, f32, i32, f32]>,
		[(set f16:$vdst, (AMDGPUinterp_p2_f16 f32:$src0, (i32 imm:$attrchan),
		(i32 imm:$attr),
		(i32 imm:$src0_modifiers),
		(f32 VRegSrc_32:$src2),
		(i32 imm:$src2_modifiers),
		(i1 imm:$high),
		(i1 imm:$clamp)))]>;
		} // End Uses = [M0, EXEC]
		} // End renamedInGfx9 = 1

let SubtargetPredicate = isGFX9 in {		let SubtargetPredicate = isGFX9 in {
def V_MAD_F16_gfx9 : VOP3Inst <"v_mad_f16_gfx9", VOP3_Profile<VOP_F16_F16_F16_F16, VOP3_OPSEL>>;		def V_MAD_F16_gfx9 : VOP3Inst <"v_mad_f16_gfx9", VOP3_Profile<VOP_F16_F16_F16_F16, VOP3_OPSEL>>;
def V_MAD_U16_gfx9 : VOP3Inst <"v_mad_u16_gfx9", VOP3_Profile<VOP_I16_I16_I16_I16, VOP3_OPSEL>>;		def V_MAD_U16_gfx9 : VOP3Inst <"v_mad_u16_gfx9", VOP3_Profile<VOP_I16_I16_I16_I16, VOP3_OPSEL>>;
def V_MAD_I16_gfx9 : VOP3Inst <"v_mad_i16_gfx9", VOP3_Profile<VOP_I16_I16_I16_I16, VOP3_OPSEL>>;		def V_MAD_I16_gfx9 : VOP3Inst <"v_mad_i16_gfx9", VOP3_Profile<VOP_I16_I16_I16_I16, VOP3_OPSEL>>;
def V_FMA_F16_gfx9 : VOP3Inst <"v_fma_f16_gfx9", VOP3_Profile<VOP_F16_F16_F16_F16, VOP3_OPSEL>>;		def V_FMA_F16_gfx9 : VOP3Inst <"v_fma_f16_gfx9", VOP3_Profile<VOP_F16_F16_F16_F16, VOP3_OPSEL>>;
def V_INTERP_P2_F16_gfx9 : VOP3Interp <"v_interp_p2_f16_gfx9", VOP3_INTERP16<[f16, f32, i32, f32]>>;		def V_INTERP_P2_F16_gfx9 : VOP3Interp <"v_interp_p2_f16_gfx9", VOP3_INTERP16<[f16, f32, i32, f32]>>;
} // End SubtargetPredicate = isGFX9		} // End SubtargetPredicate = isGFX9

def V_INTERP_P1LL_F16 : VOP3Interp <"v_interp_p1ll_f16", VOP3_INTERP16<[f32, f32, i32, untyped]>>;		let Uses = [M0, EXEC] in {
def V_INTERP_P1LV_F16 : VOP3Interp <"v_interp_p1lv_f16", VOP3_INTERP16<[f32, f32, i32, f16]>>;		def V_INTERP_P1LL_F16 : VOP3Interp <"v_interp_p1ll_f16", VOP3_INTERP16<[f32, f32, i32, untyped]>,
		[(set f32:$vdst, (AMDGPUinterp_p1ll_f16 f32:$src0, (i32 imm:$attrchan),
		(i32 imm:$attr),
		(i32 imm:$src0_modifiers),
		(i1 imm:$high),
		(i1 imm:$clamp),
		(i32 imm:$omod)))]>;
		tprUnsubmitted Done Reply Inline Actions Don't forget to fix the problem found with this i1 in testing. tpr: Don't forget to fix the problem found with this i1 in testing.
		def V_INTERP_P1LV_F16 : VOP3Interp <"v_interp_p1lv_f16", VOP3_INTERP16<[f32, f32, i32, f16]>,
		[(set f32:$vdst, (AMDGPUinterp_p1lv_f16 f32:$src0, (i32 imm:$attrchan),
		(i32 imm:$attr),
		(i32 imm:$src0_modifiers),
		(f32 VRegSrc_32:$src2),
		(i32 imm:$src2_modifiers),
		(i1 imm:$high),
		(i1 imm:$clamp),
		(i32 imm:$omod)))]>;
		} // End Uses = [M0, EXEC]

} // End SubtargetPredicate = Has16BitInsts, isCommutable = 1		} // End SubtargetPredicate = Has16BitInsts, isCommutable = 1

let SubtargetPredicate = isVI in {		let SubtargetPredicate = isVI in {
def V_INTERP_P1_F32_e64 : VOP3Interp <"v_interp_p1_f32", VOP3_INTERP>;		def V_INTERP_P1_F32_e64 : VOP3Interp <"v_interp_p1_f32", VOP3_INTERP>;
def V_INTERP_P2_F32_e64 : VOP3Interp <"v_interp_p2_f32", VOP3_INTERP>;		def V_INTERP_P2_F32_e64 : VOP3Interp <"v_interp_p2_f32", VOP3_INTERP>;
def V_INTERP_MOV_F32_e64 : VOP3Interp <"v_interp_mov_f32", VOP3_INTERP_MOV>;		def V_INTERP_MOV_F32_e64 : VOP3Interp <"v_interp_mov_f32", VOP3_INTERP_MOV>;

▲ Show 20 Lines • Show All 410 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/llvm.amdgcn.interp.f16.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mtriple=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GFX9-32BANK %s
				; RUN: llc -mtriple=amdgcn -mcpu=fiji -verify-machineinstrs < %s \| FileCheck -check-prefixes=GFX8-32BANK %s
				; RUN: llc -mtriple=amdgcn -mcpu=gfx810 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GFX8-16BANK %s
				arsenmUnsubmitted Done Reply Inline Actions Use -'s instead of _'s in the check prefixes arsenm: Use -'s instead of _'s in the check prefixes

				define amdgpu_ps half @interp_f16(float inreg %i, float inreg %j, i32 inreg %m0) #0 {
				; GFX9-32BANK-LABEL: interp_f16:
				; GFX9-32BANK: ; %bb.0: ; %main_body
				arsenmUnsubmitted Done Reply Inline Actions Might as well just use update_llc_test_checks at this point? arsenm: Might as well just use update_llc_test_checks at this point?
				; GFX9-32BANK-NEXT: s_mov_b32 m0, s2
				; GFX9-32BANK-NEXT: v_mov_b32_e32 v0, s0
				; GFX9-32BANK-NEXT: v_interp_p1ll_f16 v1, v0, attr2.y
				; GFX9-32BANK-NEXT: v_mov_b32_e32 v2, s1
				; GFX9-32BANK-NEXT: v_interp_p1ll_f16 v0, v0, attr2.y high
				; GFX9-32BANK-NEXT: v_interp_p2_legacy_f16 v1, v2, attr2.y, v1
				; GFX9-32BANK-NEXT: v_interp_p2_legacy_f16 v0, v2, attr2.y, v0 high
				; GFX9-32BANK-NEXT: v_add_f16_e32 v0, v1, v0
				; GFX9-32BANK-NEXT: ; return to shader part epilog
				;
				; GFX8-32BANK-LABEL: interp_f16:
				; GFX8-32BANK: ; %bb.0: ; %main_body
				; GFX8-32BANK-NEXT: s_mov_b32 m0, s2
				; GFX8-32BANK-NEXT: v_mov_b32_e32 v0, s0
				; GFX8-32BANK-NEXT: v_interp_p1ll_f16 v1, v0, attr2.y
				; GFX8-32BANK-NEXT: v_mov_b32_e32 v2, s1
				; GFX8-32BANK-NEXT: v_interp_p1ll_f16 v0, v0, attr2.y high
				; GFX8-32BANK-NEXT: v_interp_p2_f16 v1, v2, attr2.y, v1
				; GFX8-32BANK-NEXT: v_interp_p2_f16 v0, v2, attr2.y, v0 high
				; GFX8-32BANK-NEXT: v_add_f16_e32 v0, v1, v0
				; GFX8-32BANK-NEXT: ; return to shader part epilog
				;
				; GFX8-16BANK-LABEL: interp_f16:
				; GFX8-16BANK: ; %bb.0: ; %main_body
				; GFX8-16BANK-NEXT: s_mov_b32 m0, s2
				; GFX8-16BANK-NEXT: v_interp_mov_f32_e32 v0, p0, attr2.y
				; GFX8-16BANK-NEXT: v_mov_b32_e32 v1, s0
				; GFX8-16BANK-NEXT: v_interp_p1lv_f16 v2, v1, attr2.y, v0
				; GFX8-16BANK-NEXT: v_mov_b32_e32 v3, s1
				; GFX8-16BANK-NEXT: v_interp_p1lv_f16 v0, v1, attr2.y, v0 high
				; GFX8-16BANK-NEXT: v_interp_p2_f16 v2, v3, attr2.y, v2
				; GFX8-16BANK-NEXT: v_interp_p2_f16 v0, v3, attr2.y, v0 high
				; GFX8-16BANK-NEXT: v_add_f16_e32 v0, v2, v0
				; GFX8-16BANK-NEXT: ; return to shader part epilog
				main_body:
				%p1_0 = call float @llvm.amdgcn.interp.p1.f16(float %i, i32 1, i32 2, i1 0, i32 %m0)
				%p2_0 = call half @llvm.amdgcn.interp.p2.f16(float %p1_0, float %j, i32 1, i32 2, i1 0, i32 %m0)
				%p1_1 = call float @llvm.amdgcn.interp.p1.f16(float %i, i32 1, i32 2, i1 1, i32 %m0)
				%p2_1 = call half @llvm.amdgcn.interp.p2.f16(float %p1_1, float %j, i32 1, i32 2, i1 1, i32 %m0)
				%res = fadd half %p2_0, %p2_1
				ret half %res
				}

				; float @llvm.amdgcn.interp.p1.f16(i, attrchan, attr, high, m0)
				declare float @llvm.amdgcn.interp.p1.f16(float, i32, i32, i1, i32) #0
				; half @llvm.amdgcn.interp.p1.f16(p1, j, attrchan, attr, high, m0)
				declare half @llvm.amdgcn.interp.p2.f16(float, float, i32, i32, i1, i32) #0
				declare float @llvm.amdgcn.interp.mov(i32, i32, i32, i32) #0

				attributes #0 = { nounwind readnone }
				arsenmUnsubmitted Not Done Reply Inline Actions Can you add a test case with LDS usage to make sure m0 is properly restored after? arsenm: Can you add a test case with LDS usage to make sure m0 is properly restored after?
				timcorringhamAuthorUnsubmitted Done Reply Inline Actions I have added test cases to check that m0 is set up before each of the interp f16 instructions if necessary. I have done this by explicitly writing to m0 rather than using LDS as I couldn't see a way to do the latter, and other tests use the technique of writing to m0. timcorringham: I have added test cases to check that m0 is set up before each of the interp f16 instructions…

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add intrinsics for 16 bit interpolation
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 158515

include/llvm/IR/IntrinsicsAMDGPU.td

lib/Target/AMDGPU/AMDGPUISelLowering.h

lib/Target/AMDGPU/AMDGPUISelLowering.cpp

lib/Target/AMDGPU/AMDGPUInstrInfo.td

lib/Target/AMDGPU/AMDGPUSearchableTables.td

lib/Target/AMDGPU/SIISelLowering.cpp

lib/Target/AMDGPU/VOP3Instructions.td

test/CodeGen/AMDGPU/llvm.amdgcn.interp.f16.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add intrinsics for 16 bit interpolationClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 158515

include/llvm/IR/IntrinsicsAMDGPU.td

lib/Target/AMDGPU/AMDGPUISelLowering.h

lib/Target/AMDGPU/AMDGPUISelLowering.cpp

lib/Target/AMDGPU/AMDGPUInstrInfo.td

lib/Target/AMDGPU/AMDGPUSearchableTables.td

lib/Target/AMDGPU/SIISelLowering.cpp

lib/Target/AMDGPU/VOP3Instructions.td

test/CodeGen/AMDGPU/llvm.amdgcn.interp.f16.ll

[AMDGPU] Add intrinsics for 16 bit interpolation
ClosedPublic