Download Raw Diff

Details

Reviewers

rampitec
b-sumner

Commits

rGc370d7b33d0a: [AMDGPU] [AMDGPU] Support a fdot2 pattern.
rL337198: [AMDGPU] [AMDGPU] Support a fdot2 pattern.

Summary

Optimize fma((float)S0.x, (float)S1.x fma((float)S0.y, (float)S1.y, z)) -> fdot2((v2f16)S0, (v2f16)S1, (float)z)

Diff Detail

Repository: rL LLVM

Event Timeline

FarhanaAleen created this revision.Jul 10 2018, 9:36 AM

Herald added subscribers: t-tye, tpr, dstuttard and 5 others. · View Herald TranscriptJul 10 2018, 9:36 AM

rampitec added a reviewer: b-sumner.Jul 10 2018, 9:51 AM

Does fdot2 perform rounding of intermediates?
Basically you start with two FMAs: FMA - Perform a * b + c with no intermediate rounding step. So the expression you are converting is quite fancy in terms of rounding:

fma(S0.x, S1.x fma(S0.y, S1.y, z)) => round((S0.x * S1.x) + round((S0.y * S1.y) + z))

I.e. you have two operations with rounding and two without. I am really unsure that is what the v_dot2_f2_f16 instruction does.
Then you probably need to have some conditions to check if denorms are supported or not.

This operation only rounds a single time, and unfortunately always flushes f32 denorms. Thus this transformation should only be done when unsafe math is requested.

In D49146#1157632, @b-sumner wrote:

This operation only rounds a single time, and unfortunately always flushes f32 denorms. Thus this transformation should only be done when unsafe math is requested.

As far as I understand it should be also legal with -mattr=-fp32-denormals,-fp64-fp16-denormals. I.e. when both 32 and 16 denorms are not supported. Right? Not that is really helps in the real world.
Otherwise it shall be legal if either UnsafeAlgebra or AllowContract flag is set on both FMA nodes.

Agreed.

By the way, since types are being mixed, shouldn't the summary say something like optimize fma((float)S0.x, (float)S1.x, fma((float)S0.y, (float)S1.y, S2)) --> fdot2(S0, S1, S2)? We only want this transformation if S0 and S1 are <2 x f16>.

As far as I understand it should be also legal with -mattr=-fp32-denormals,-fp64-fp16-denormals. I.e. when both 32 and 16 denorms are not supported. Right? Not that is really helps in the real world.
Otherwise it shall be legal if either UnsafeAlgebra or AllowContract flag is set on both FMA nodes.

Having the FMA node already grantees that either UnsafeAlgebra is set or AllowContract flag set is on the FAdd/FMUL nodes. We don't need to check them again during the FMA combine, right?

By the way, since types are being mixed, shouldn't the summary say something like optimize fma((float)S0.x, (float)S1.x, fma((float)S0.y, (float)S1.y, S2)) --> fdot2(S0, S1, S2)? We only want this transformation if S0 and S1 are <2 x f16>.

Current pattern matching does not support float element type yet, it will be supported next.

You are right, there is a typo in the summary. It should be:
fma((f16)S0.x, (f16)S1.x fma((f16)S0.y, (f16)S1.y, (f16)z)) -> ftrunc(fdot2(S0, S1, (f32)z))

In D49146#1157800, @FarhanaAleen wrote:

By the way, since types are being mixed, shouldn't the summary say something like optimize fma((float)S0.x, (float)S1.x, fma((float)S0.y, (float)S1.y, S2)) --> fdot2(S0, S1, S2)? We only want this transformation if S0 and S1 are <2 x f16>.

Current pattern matching does not support float element type yet, it will be supported next.

You are right, there is a typo in the summary. It should be:
fma((f16)S0.x, (f16)S1.x fma((f16)S0.y, (f16)S1.y, (f16)z)) -> ftrunc(fdot2(S0, S1, (f32)z))

I think I had it right. FMA requires all 3 arguments to be the same type. And we don't want to do this transformation if the first two arguments to each FMA weren't cast from f16 to f32.

In D49146#1157765, @FarhanaAleen wrote:

As far as I understand it should be also legal with -mattr=-fp32-denormals,-fp64-fp16-denormals. I.e. when both 32 and 16 denorms are not supported. Right? Not that is really helps in the real world.
Otherwise it shall be legal if either UnsafeAlgebra or AllowContract flag is set on both FMA nodes.

Having the FMA node already grantees that either UnsafeAlgebra is set or AllowContract flag set is on the FAdd/FMUL nodes. We don't need to check them again during the FMA combine, right?

FMA really restricts only the intermediate result rounding, but result of fma itself is rounded according to -mattr and other options. So I think you would need to check it anyway.

In D49146#1157765, @FarhanaAleen wrote:

As far as I understand it should be also legal with -mattr=-fp32-denormals,-fp64-fp16-denormals. I.e. when both 32 and 16 denorms are not supported. Right? Not that is really helps in the real world.
Otherwise it shall be legal if either UnsafeAlgebra or AllowContract flag is set on both FMA nodes.

Having the FMA node already grantees that either UnsafeAlgebra is set or AllowContract flag set is on the FAdd/FMUL nodes. We don't need to check them again during the FMA combine, right?

The fma absolutely does not guarantee this

test/CodeGen/AMDGPU/dotproduct.ll
1 ↗	(On Diff #154822)	Test should probable be named fdot2.ll
2 ↗	(On Diff #154822)	The check prefixes are broken. They both need to include gcn for the check labels to work
23 ↗	(On Diff #154822)	Remove fast and only use the required algebra flag (whatever it’s called now), plus some tests with the different permutations with a missing flag
30 ↗	(On Diff #154822)	Needs tests with source and output modifiers, also with wider vector types

arsenm added inline comments.Jul 10 2018, 2:33 PM

lib/Target/AMDGPU/SIISelLowering.cpp
7474 ↗	(On Diff #154822)	Probably should only do after legalization

FarhanaAleen added inline comments.Jul 12 2018, 11:20 AM

lib/Target/AMDGPU/SIISelLowering.cpp
7474 ↗	(On Diff #154822)	Doing this after legalization makes the analysis complex. Due to type(v2f16)/node legalization we get a long chain of bitcasts in the IR after legalization. Also, the identification of the vector element is not straightforward after the legalization. Would it be illegal to do this before legalization?

Added fast-math flag+allow-contract flag and more test-cases.

rampitec added inline comments.Jul 13 2018, 10:13 AM

lib/Target/AMDGPU/SIISelLowering.cpp
7488 ↗	(On Diff #155416)	That is not exactly true. If denorms are not supported on both f32 and f16 folding is also legal.

arsenm added inline comments.Jul 13 2018, 10:25 AM

lib/Target/AMDGPU/SIISelLowering.cpp
7474 ↗	(On Diff #154822)	No, I just expect introducing custom nodes earlier to break generic optimizations

Reworded the comment about the flag requirements.

rampitec added inline comments.Jul 13 2018, 10:36 AM

lib/Target/AMDGPU/SIISelLowering.cpp
7488 ↗	(On Diff #155416)	The comment is still wrong. It is in fact illegal to do it without respect of the denorm mode. It is only legal under some conditions, like unsafe math or allowed contraction for that reason.

FarhanaAleen updated this revision to Diff 155509.Jul 13 2018, 3:01 PM

FarhanaAleen added inline comments.

lib/Target/AMDGPU/SIISelLowering.cpp
7488 ↗	(On Diff #155416)	I thought there are two separate things. fdot2 handles denormals and requires denorm mode to be set for that. With this configuration under the fast-math/fp-contract, we can still reject to generate fdot2 since the denorm support is not there but we want to support denormals. fdot2 does not handle denormals ever regardless of the denorm mode setting. This configuration does not leave much choices. So, we don't need to worry about the denorm support, fast-math/fp-contract is sufficient to decide whether fdot2 should be generated or not. I thought it is already implied that we can only do this transformation under unsafe-fp-math/fp-contract, so are any other aggressive floating point optimizations. So, we don't need to mention about the unsafe-math/allowed contraction explicitly in the comment. There are cases even when the unsafe-fp-math/fp-contract is there, we don't do the transformation because the hardware does not support subnormals and fdot2 is an exception to those cases. That's why my comment was trying to emphasize on the denorm part with an implication of unsafe-math is there. But I see your point.

LGTM

This revision is now accepted and ready to land.Jul 13 2018, 3:05 PM

Closed by commit rL337198: [AMDGPU] [AMDGPU] Support a fdot2 pattern. (authored by faaleen). · Explain WhyJul 16 2018, 11:25 AM

This revision was automatically updated to reflect the committed changes.

Herald added a subscriber: llvm-commits. · View Herald TranscriptJul 16 2018, 11:25 AM

arsenm added inline comments.Aug 13 2018, 8:30 AM

llvm/trunk/test/CodeGen/AMDGPU/fdot2.ll
232	Probably should try matching the negated form too

Diff 155722

llvm/trunk/lib/Target/AMDGPU/AMDGPUISelLowering.h

Show First 20 Lines • Show All 357 Lines • ▼ Show 20 Lines	enum NodeType : unsigned {
SMAX3,		SMAX3,
UMAX3,		UMAX3,
FMIN3,		FMIN3,
SMIN3,		SMIN3,
UMIN3,		UMIN3,
FMED3,		FMED3,
SMED3,		SMED3,
UMED3,		UMED3,
		FDOT2,
URECIP,		URECIP,
DIV_SCALE,		DIV_SCALE,
DIV_FMAS,		DIV_FMAS,
DIV_FIXUP,		DIV_FIXUP,
// For emitting ISD::FMAD when f32 denormals are enabled because mac/mad is		// For emitting ISD::FMAD when f32 denormals are enabled because mac/mad is
// treated as an illegal operation.		// treated as an illegal operation.
FMAD_FTZ,		FMAD_FTZ,
TRIG_PREOP, // 1 ULP max error for f64		TRIG_PREOP, // 1 ULP max error for f64
▲ Show 20 Lines • Show All 128 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/AMDGPUISelLowering.cpp

Show First 20 Lines • Show All 3,987 Lines • ▼ Show 20 Lines	const char* AMDGPUTargetLowering::getTargetNodeName(unsigned Opcode) const {
NODE_NAME_CASE(SMAX3)		NODE_NAME_CASE(SMAX3)
NODE_NAME_CASE(UMAX3)		NODE_NAME_CASE(UMAX3)
NODE_NAME_CASE(FMIN3)		NODE_NAME_CASE(FMIN3)
NODE_NAME_CASE(SMIN3)		NODE_NAME_CASE(SMIN3)
NODE_NAME_CASE(UMIN3)		NODE_NAME_CASE(UMIN3)
NODE_NAME_CASE(FMED3)		NODE_NAME_CASE(FMED3)
NODE_NAME_CASE(SMED3)		NODE_NAME_CASE(SMED3)
NODE_NAME_CASE(UMED3)		NODE_NAME_CASE(UMED3)
		NODE_NAME_CASE(FDOT2)
NODE_NAME_CASE(URECIP)		NODE_NAME_CASE(URECIP)
NODE_NAME_CASE(DIV_SCALE)		NODE_NAME_CASE(DIV_SCALE)
NODE_NAME_CASE(DIV_FMAS)		NODE_NAME_CASE(DIV_FMAS)
NODE_NAME_CASE(DIV_FIXUP)		NODE_NAME_CASE(DIV_FIXUP)
NODE_NAME_CASE(FMAD_FTZ)		NODE_NAME_CASE(FMAD_FTZ)
NODE_NAME_CASE(TRIG_PREOP)		NODE_NAME_CASE(TRIG_PREOP)
NODE_NAME_CASE(RCP)		NODE_NAME_CASE(RCP)
NODE_NAME_CASE(RSQ)		NODE_NAME_CASE(RSQ)
▲ Show 20 Lines • Show All 288 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/AMDGPUInstrInfo.td

	Show First 20 Lines • Show All 335 Lines • ▼ Show 20 Lines
	>;			>;

	def AMDGPUumed3 : SDNode<"AMDGPUISD::UMED3", AMDGPUDTIntTernaryOp,			def AMDGPUumed3 : SDNode<"AMDGPUISD::UMED3", AMDGPUDTIntTernaryOp,
	[]			[]
	>;			>;

	def AMDGPUfmed3 : SDNode<"AMDGPUISD::FMED3", SDTFPTernaryOp, []>;			def AMDGPUfmed3 : SDNode<"AMDGPUISD::FMED3", SDTFPTernaryOp, []>;

				def AMDGPUfdot2 : SDNode<"AMDGPUISD::FDOT2",
				SDTypeProfile<1, 3, [SDTCisSameAs<0, 3>, SDTCisSameAs<1, 2>,
				SDTCisFP<0>, SDTCisVec<1>]>,
				[]>;

	def AMDGPUperm : SDNode<"AMDGPUISD::PERM", AMDGPUDTIntTernaryOp, []>;			def AMDGPUperm : SDNode<"AMDGPUISD::PERM", AMDGPUDTIntTernaryOp, []>;

	def AMDGPUinit_exec : SDNode<"AMDGPUISD::INIT_EXEC",			def AMDGPUinit_exec : SDNode<"AMDGPUISD::INIT_EXEC",
	SDTypeProfile<0, 1, [SDTCisInt<0>]>,			SDTypeProfile<0, 1, [SDTCisInt<0>]>,
	[SDNPHasChain, SDNPInGlue]>;			[SDNPHasChain, SDNPInGlue]>;

	def AMDGPUinit_exec_from_input : SDNode<"AMDGPUISD::INIT_EXEC_FROM_INPUT",			def AMDGPUinit_exec_from_input : SDNode<"AMDGPUISD::INIT_EXEC_FROM_INPUT",
	SDTypeProfile<0, 2,			SDTypeProfile<0, 2,
	▲ Show 20 Lines • Show All 78 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SIISelLowering.h

Show First 20 Lines • Show All 130 Lines • ▼ Show 20 Lines	private:

unsigned getFusedOpcode(const SelectionDAG &DAG,		unsigned getFusedOpcode(const SelectionDAG &DAG,
const SDNode N0, const SDNode N1) const;		const SDNode N0, const SDNode N1) const;
SDValue performAddCombine(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue performAddCombine(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue performAddCarrySubCarryCombine(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue performAddCarrySubCarryCombine(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue performSubCombine(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue performSubCombine(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue performFAddCombine(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue performFAddCombine(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue performFSubCombine(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue performFSubCombine(SDNode *N, DAGCombinerInfo &DCI) const;
		SDValue performFMACombine(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue performSetCCCombine(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue performSetCCCombine(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue performCvtF32UByteNCombine(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue performCvtF32UByteNCombine(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue performClampCombine(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue performClampCombine(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue performRcpCombine(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue performRcpCombine(SDNode *N, DAGCombinerInfo &DCI) const;

bool isLegalFlatAddressingMode(const AddrMode &AM) const;		bool isLegalFlatAddressingMode(const AddrMode &AM) const;
bool isLegalGlobalAddressingMode(const AddrMode &AM) const;		bool isLegalGlobalAddressingMode(const AddrMode &AM) const;
bool isLegalMUBUFAddressingMode(const AddrMode &AM) const;		bool isLegalMUBUFAddressingMode(const AddrMode &AM) const;
▲ Show 20 Lines • Show All 170 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 617 Lines • ▼ Show 20 Lines	#endif
setTargetDAGCombine(ISD::ADD);		setTargetDAGCombine(ISD::ADD);
setTargetDAGCombine(ISD::ADDCARRY);		setTargetDAGCombine(ISD::ADDCARRY);
setTargetDAGCombine(ISD::SUB);		setTargetDAGCombine(ISD::SUB);
setTargetDAGCombine(ISD::SUBCARRY);		setTargetDAGCombine(ISD::SUBCARRY);
setTargetDAGCombine(ISD::FADD);		setTargetDAGCombine(ISD::FADD);
setTargetDAGCombine(ISD::FSUB);		setTargetDAGCombine(ISD::FSUB);
setTargetDAGCombine(ISD::FMINNUM);		setTargetDAGCombine(ISD::FMINNUM);
setTargetDAGCombine(ISD::FMAXNUM);		setTargetDAGCombine(ISD::FMAXNUM);
		setTargetDAGCombine(ISD::FMA);
setTargetDAGCombine(ISD::SMIN);		setTargetDAGCombine(ISD::SMIN);
setTargetDAGCombine(ISD::SMAX);		setTargetDAGCombine(ISD::SMAX);
setTargetDAGCombine(ISD::UMIN);		setTargetDAGCombine(ISD::UMIN);
setTargetDAGCombine(ISD::UMAX);		setTargetDAGCombine(ISD::UMAX);
setTargetDAGCombine(ISD::SETCC);		setTargetDAGCombine(ISD::SETCC);
setTargetDAGCombine(ISD::AND);		setTargetDAGCombine(ISD::AND);
setTargetDAGCombine(ISD::OR);		setTargetDAGCombine(ISD::OR);
setTargetDAGCombine(ISD::XOR);		setTargetDAGCombine(ISD::XOR);
▲ Show 20 Lines • Show All 4,306 Lines • ▼ Show 20 Lines	case Intrinsic::amdgcn_fcmp: {
FCmpInst::Predicate IcInput = static_cast<FCmpInst::Predicate>(CondCode);		FCmpInst::Predicate IcInput = static_cast<FCmpInst::Predicate>(CondCode);
ISD::CondCode CCOpcode = getFCmpCondCode(IcInput);		ISD::CondCode CCOpcode = getFCmpCondCode(IcInput);
return DAG.getNode(AMDGPUISD::SETCC, DL, VT, Op.getOperand(1),		return DAG.getNode(AMDGPUISD::SETCC, DL, VT, Op.getOperand(1),
Op.getOperand(2), DAG.getCondCode(CCOpcode));		Op.getOperand(2), DAG.getCondCode(CCOpcode));
}		}
case Intrinsic::amdgcn_fmed3:		case Intrinsic::amdgcn_fmed3:
return DAG.getNode(AMDGPUISD::FMED3, DL, VT,		return DAG.getNode(AMDGPUISD::FMED3, DL, VT,
Op.getOperand(1), Op.getOperand(2), Op.getOperand(3));		Op.getOperand(1), Op.getOperand(2), Op.getOperand(3));
		case Intrinsic::amdgcn_fdot2:
		return DAG.getNode(AMDGPUISD::FDOT2, DL, VT,
		Op.getOperand(1), Op.getOperand(2), Op.getOperand(3));
case Intrinsic::amdgcn_fmul_legacy:		case Intrinsic::amdgcn_fmul_legacy:
return DAG.getNode(AMDGPUISD::FMUL_LEGACY, DL, VT,		return DAG.getNode(AMDGPUISD::FMUL_LEGACY, DL, VT,
Op.getOperand(1), Op.getOperand(2));		Op.getOperand(1), Op.getOperand(2));
case Intrinsic::amdgcn_sffbh:		case Intrinsic::amdgcn_sffbh:
return DAG.getNode(AMDGPUISD::FFBH_I32, DL, VT, Op.getOperand(1));		return DAG.getNode(AMDGPUISD::FFBH_I32, DL, VT, Op.getOperand(1));
case Intrinsic::amdgcn_sbfe:		case Intrinsic::amdgcn_sbfe:
return DAG.getNode(AMDGPUISD::BFE_I32, DL, VT,		return DAG.getNode(AMDGPUISD::BFE_I32, DL, VT,
Op.getOperand(1), Op.getOperand(2), Op.getOperand(3));		Op.getOperand(1), Op.getOperand(2), Op.getOperand(3));
▲ Show 20 Lines • Show All 2,515 Lines • ▼ Show 20 Lines	if (A == RHS.getOperand(1)) {
return DAG.getNode(FusedOp, SL, VT, A, NegTwo, LHS);		return DAG.getNode(FusedOp, SL, VT, A, NegTwo, LHS);
}		}
}		}
}		}

return SDValue();		return SDValue();
}		}

		SDValue SITargetLowering::performFMACombine(SDNode *N,
		DAGCombinerInfo &DCI) const {
		SelectionDAG &DAG = DCI.DAG;
		EVT VT = N->getValueType(0);
		SDLoc SL(N);

		if (!Subtarget->hasDLInsts() \|\| VT != MVT::f32)
		return SDValue();

		// FMA((F32)S0.x, (F32)S1. x, FMA((F32)S0.y, (F32)S1.y, (F32)z)) ->
		// FDOT2((V2F16)S0, (V2F16)S1, (F32)z))
		SDValue Op1 = N->getOperand(0);
		SDValue Op2 = N->getOperand(1);
		SDValue FMA = N->getOperand(2);

		if (FMA.getOpcode() != ISD::FMA \|\|
		Op1.getOpcode() != ISD::FP_EXTEND \|\|
		Op2.getOpcode() != ISD::FP_EXTEND)
		return SDValue();

		// fdot2_f32_f16 always flushes fp32 denormal operand and output to zero,
		// regardless of the denorm mode setting. Therefore, unsafe-fp-math/fp-contract
		// is sufficient to allow generaing fdot2.
		const TargetOptions &Options = DAG.getTarget().Options;
		if (Options.AllowFPOpFusion == FPOpFusion::Fast \|\| Options.UnsafeFPMath \|\|
		(N->getFlags().hasAllowContract() &&
		FMA->getFlags().hasAllowContract())) {
		Op1 = Op1.getOperand(0);
		Op2 = Op2.getOperand(0);
		if (Op1.getOpcode() != ISD::EXTRACT_VECTOR_ELT \|\|
		Op2.getOpcode() != ISD::EXTRACT_VECTOR_ELT)
		return SDValue();

		SDValue Vec1 = Op1.getOperand(0);
		SDValue Idx1 = Op1.getOperand(1);
		SDValue Vec2 = Op2.getOperand(0);

		SDValue FMAOp1 = FMA.getOperand(0);
		SDValue FMAOp2 = FMA.getOperand(1);
		SDValue FMAAcc = FMA.getOperand(2);

		if (FMAOp1.getOpcode() != ISD::FP_EXTEND \|\|
		FMAOp2.getOpcode() != ISD::FP_EXTEND)
		return SDValue();

		FMAOp1 = FMAOp1.getOperand(0);
		FMAOp2 = FMAOp2.getOperand(0);
		if (FMAOp1.getOpcode() != ISD::EXTRACT_VECTOR_ELT \|\|
		FMAOp2.getOpcode() != ISD::EXTRACT_VECTOR_ELT)
		return SDValue();

		SDValue Vec3 = FMAOp1.getOperand(0);
		SDValue Vec4 = FMAOp2.getOperand(0);
		SDValue Idx2 = FMAOp1.getOperand(1);

		if (Idx1 != Op2.getOperand(1) \|\| Idx2 != FMAOp2.getOperand(1) \|\|
		// Idx1 and Idx2 cannot be the same.
		Idx1 == Idx2)
		return SDValue();

		if (Vec1 == Vec2 \|\| Vec3 == Vec4)
		return SDValue();

		if (Vec1.getValueType() != MVT::v2f16 \|\| Vec2.getValueType() != MVT::v2f16)
		return SDValue();

		if ((Vec1 == Vec3 && Vec2 == Vec4) \|\|
		(Vec1 == Vec4 && Vec2 == Vec3))
		return DAG.getNode(AMDGPUISD::FDOT2, SL, MVT::f32, Vec1, Vec2, FMAAcc);
		}
		return SDValue();
		}

SDValue SITargetLowering::performSetCCCombine(SDNode *N,		SDValue SITargetLowering::performSetCCCombine(SDNode *N,
DAGCombinerInfo &DCI) const {		DAGCombinerInfo &DCI) const {
SelectionDAG &DAG = DCI.DAG;		SelectionDAG &DAG = DCI.DAG;
SDLoc SL(N);		SDLoc SL(N);

SDValue LHS = N->getOperand(0);		SDValue LHS = N->getOperand(0);
SDValue RHS = N->getOperand(1);		SDValue RHS = N->getOperand(1);
EVT VT = LHS.getValueType();		EVT VT = LHS.getValueType();
▲ Show 20 Lines • Show All 168 Lines • ▼ Show 20 Lines	SDValue SITargetLowering::PerformDAGCombine(SDNode *N,
case ISD::UMIN:		case ISD::UMIN:
case AMDGPUISD::FMIN_LEGACY:		case AMDGPUISD::FMIN_LEGACY:
case AMDGPUISD::FMAX_LEGACY: {		case AMDGPUISD::FMAX_LEGACY: {
if (DCI.getDAGCombineLevel() >= AfterLegalizeDAG &&		if (DCI.getDAGCombineLevel() >= AfterLegalizeDAG &&
getTargetMachine().getOptLevel() > CodeGenOpt::None)		getTargetMachine().getOptLevel() > CodeGenOpt::None)
return performMinMaxCombine(N, DCI);		return performMinMaxCombine(N, DCI);
break;		break;
}		}
		case ISD::FMA:
		return performFMACombine(N, DCI);
case ISD::LOAD: {		case ISD::LOAD: {
if (SDValue Widended = widenLoad(cast<LoadSDNode>(N), DCI))		if (SDValue Widended = widenLoad(cast<LoadSDNode>(N), DCI))
return Widended;		return Widended;
LLVM_FALLTHROUGH;		LLVM_FALLTHROUGH;
}		}
case ISD::STORE:		case ISD::STORE:
case ISD::ATOMIC_LOAD:		case ISD::ATOMIC_LOAD:
case ISD::ATOMIC_STORE:		case ISD::ATOMIC_STORE:
▲ Show 20 Lines • Show All 691 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AMDGPU/VOP3PInstructions.td

	Show First 20 Lines • Show All 161 Lines • ▼ Show 20 Lines
	}			}
	}			}

	defm : MadFmaMixPats<fma, V_FMA_MIX_F32, V_FMA_MIXLO_F16, V_FMA_MIXHI_F16>;			defm : MadFmaMixPats<fma, V_FMA_MIX_F32, V_FMA_MIXLO_F16, V_FMA_MIXHI_F16>;
	}			}

	let SubtargetPredicate = HasDLInsts in {			let SubtargetPredicate = HasDLInsts in {

	def V_DOT2_F32_F16 : VOP3PInst<"v_dot2_f32_f16", VOP3_Profile<VOP_F32_V2F16_V2F16_F32>, int_amdgcn_fdot2>;			def V_DOT2_F32_F16 : VOP3PInst<"v_dot2_f32_f16", VOP3_Profile<VOP_F32_V2F16_V2F16_F32>, AMDGPUfdot2>;
	def V_DOT2_I32_I16 : VOP3PInst<"v_dot2_i32_i16", VOP3_Profile<VOP_I32_V2I16_V2I16_I32>, int_amdgcn_sdot2>;			def V_DOT2_I32_I16 : VOP3PInst<"v_dot2_i32_i16", VOP3_Profile<VOP_I32_V2I16_V2I16_I32>, int_amdgcn_sdot2>;
	def V_DOT2_U32_U16 : VOP3PInst<"v_dot2_u32_u16", VOP3_Profile<VOP_I32_V2I16_V2I16_I32>, int_amdgcn_udot2>;			def V_DOT2_U32_U16 : VOP3PInst<"v_dot2_u32_u16", VOP3_Profile<VOP_I32_V2I16_V2I16_I32>, int_amdgcn_udot2>;
	def V_DOT4_I32_I8 : VOP3PInst<"v_dot4_i32_i8", VOP3_Profile<VOP_I32_I32_I32_I32, VOP3_PACKED>, int_amdgcn_sdot4>;			def V_DOT4_I32_I8 : VOP3PInst<"v_dot4_i32_i8", VOP3_Profile<VOP_I32_I32_I32_I32, VOP3_PACKED>, int_amdgcn_sdot4>;
	def V_DOT4_U32_U8 : VOP3PInst<"v_dot4_u32_u8", VOP3_Profile<VOP_I32_I32_I32_I32, VOP3_PACKED>, int_amdgcn_udot4>;			def V_DOT4_U32_U8 : VOP3PInst<"v_dot4_u32_u8", VOP3_Profile<VOP_I32_I32_I32_I32, VOP3_PACKED>, int_amdgcn_udot4>;
	def V_DOT8_I32_I4 : VOP3PInst<"v_dot8_i32_i4", VOP3_Profile<VOP_I32_I32_I32_I32, VOP3_PACKED>, int_amdgcn_sdot8>;			def V_DOT8_I32_I4 : VOP3PInst<"v_dot8_i32_i4", VOP3_Profile<VOP_I32_I32_I32_I32, VOP3_PACKED>, int_amdgcn_sdot8>;
	def V_DOT8_U32_U4 : VOP3PInst<"v_dot8_u32_u4", VOP3_Profile<VOP_I32_I32_I32_I32, VOP3_PACKED>, int_amdgcn_udot8>;			def V_DOT8_U32_U4 : VOP3PInst<"v_dot8_u32_u4", VOP3_Profile<VOP_I32_I32_I32_I32, VOP3_PACKED>, int_amdgcn_udot8>;

	} // End SubtargetPredicate = HasDLInsts			} // End SubtargetPredicate = HasDLInsts
	▲ Show 20 Lines • Show All 60 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/fdot2.ll

				; RUN: llc -march=amdgcn -mcpu=gfx900 -enable-unsafe-fp-math -verify-machineinstrs < %s \| FileCheck %s -check-prefixes=GCN,GFX900
				; RUN: llc -march=amdgcn -mcpu=gfx906 -enable-unsafe-fp-math -verify-machineinstrs < %s \| FileCheck %s -check-prefixes=GCN,GFX906-UNSAFE
				; RUN: llc -march=amdgcn -mcpu=gfx906 -verify-machineinstrs < %s \| FileCheck %s -check-prefixes=GCN,GFX906
				; RUN: llc -march=amdgcn -mcpu=gfx906 -mattr=-fp64-fp16-denormals,-fp32-denormals -fp-contract=fast -verify-machineinstrs < %s \| FileCheck %s -check-prefixes=GCN,GFX906-CONTRACT
				; RUN: llc -march=amdgcn -mcpu=gfx906 -mattr=+fp64-fp16-denormals,+fp32-denormals -fp-contract=fast -verify-machineinstrs < %s \| FileCheck %s -check-prefixes=GCN,GFX906-DENORM-CONTRACT
				; (fadd (fmul S1.x, S2.x), (fadd (fmul (S1.y, S2.y), z))) -> (fdot2 S1, S2, z)

				; Tests to make sure fdot2 is not generated when vector elements of dot-product expressions
				; are not converted from f16 to f32.
				; GCN-LABEL: {{^}}dotproduct_f16
				; GFX900: v_fma_legacy_f16
				; GCN900: v_fma_legacy_f16

				; GFX906: v_mul_f16_e32
				; GFX906: v_mul_f16_e32

				; GFX906-UNSAFE: v_fma_legacy_f16

				; GFX906-CONTRACT: v_mac_f16_e32
				; GFX906-DENORM-CONTRACT: v_fma_legacy_f16
				define amdgpu_kernel void @dotproduct_f16(<2 x half> addrspace(1)* %src1,
				<2 x half> addrspace(1)* %src2,
				half addrspace(1)* nocapture %dst) {
				entry:
				%src1.vec = load <2 x half>, <2 x half> addrspace(1)* %src1
				%src2.vec = load <2 x half>, <2 x half> addrspace(1)* %src2

				%src1.el1 = extractelement <2 x half> %src1.vec, i64 0
				%src2.el1 = extractelement <2 x half> %src2.vec, i64 0

				%src1.el2 = extractelement <2 x half> %src1.vec, i64 1
				%src2.el2 = extractelement <2 x half> %src2.vec, i64 1

				%mul2 = fmul half %src1.el2, %src2.el2
				%mul1 = fmul half %src1.el1, %src2.el1
				%acc = load half, half addrspace(1)* %dst, align 2
				%acc1 = fadd half %mul2, %acc
				%acc2 = fadd half %mul1, %acc1
				store half %acc2, half addrspace(1)* %dst, align 2
				ret void
				}


				; We only want to generate fdot2 if vector element of dot product is converted from f16 to f32
				; and the vectors are of type <2 x half>
				; GCN-LABEL: {{^}}dotproduct_f16_f32
				; GFX900: v_mad_mix_f32
				; GCN900: v_mad_mix_f32

				; GFX906: v_mad_f32
				; GFX906: v_mac_f32_e32

				; GFX906-UNSAFE: v_dot2_f32_f16

				; GFX906-CONTRACT: v_dot2_f32_f16

				; GFX906-DENORM-CONTRACT: v_dot2_f32_f16
				define amdgpu_kernel void @dotproduct_f16_f32(<2 x half> addrspace(1)* %src1,
				<2 x half> addrspace(1)* %src2,
				float addrspace(1)* nocapture %dst) {
				entry:
				%src1.vec = load <2 x half>, <2 x half> addrspace(1)* %src1
				%src2.vec = load <2 x half>, <2 x half> addrspace(1)* %src2

				%src1.el1 = extractelement <2 x half> %src1.vec, i64 0
				%csrc1.el1 = fpext half %src1.el1 to float
				%src2.el1 = extractelement <2 x half> %src2.vec, i64 0
				%csrc2.el1 = fpext half %src2.el1 to float

				%src1.el2 = extractelement <2 x half> %src1.vec, i64 1
				%csrc1.el2 = fpext half %src1.el2 to float
				%src2.el2 = extractelement <2 x half> %src2.vec, i64 1
				%csrc2.el2 = fpext half %src2.el2 to float

				%mul2 = fmul float %csrc1.el2, %csrc2.el2
				%mul1 = fmul float %csrc1.el1, %csrc2.el1
				%acc = load float, float addrspace(1)* %dst, align 4
				%acc1 = fadd float %mul2, %acc
				%acc2 = fadd float %mul1, %acc1
				store float %acc2, float addrspace(1)* %dst, align 4
				ret void
				}

				; We only want to generate fdot2 if vector element of dot product is converted from f16 to f32
				; and the vectors are of type <2 x half>
				; GCN-LABEL: {{^}}dotproduct_diffvecorder
				; GFX900: v_mad_mix_f32
				; GCN900: v_mad_mix_f32

				; GFX906: v_mad_f32
				; GFX906: v_mac_f32_e32

				; GFX906-UNSAFE: v_dot2_f32_f16

				; GFX906-CONTRACT: v_dot2_f32_f16
				; GFX906-DENORM-CONTRACT: v_dot2_f32_f16
				define amdgpu_kernel void @dotproduct_diffvecorder(<2 x half> addrspace(1)* %src1,
				<2 x half> addrspace(1)* %src2,
				float addrspace(1)* nocapture %dst) {
				entry:
				%src1.vec = load <2 x half>, <2 x half> addrspace(1)* %src1
				%src2.vec = load <2 x half>, <2 x half> addrspace(1)* %src2

				%src1.el1 = extractelement <2 x half> %src1.vec, i64 0
				%csrc1.el1 = fpext half %src1.el1 to float
				%src2.el1 = extractelement <2 x half> %src2.vec, i64 0
				%csrc2.el1 = fpext half %src2.el1 to float

				%src1.el2 = extractelement <2 x half> %src1.vec, i64 1
				%csrc1.el2 = fpext half %src1.el2 to float
				%src2.el2 = extractelement <2 x half> %src2.vec, i64 1
				%csrc2.el2 = fpext half %src2.el2 to float

				%mul2 = fmul float %csrc2.el2, %csrc1.el2
				%mul1 = fmul float %csrc1.el1, %csrc2.el1
				%acc = load float, float addrspace(1)* %dst, align 4
				%acc1 = fadd float %mul2, %acc
				%acc2 = fadd float %mul1, %acc1
				store float %acc2, float addrspace(1)* %dst, align 4
				ret void
				}

				; Tests to make sure dot product is not generated when the vectors are not of <2 x half>.
				; GCN-LABEL: {{^}}dotproduct_v4f16
				; GFX900: v_mad_mix_f32

				; GFX906: v_mad_f32
				; GFX906: v_mac_f32_e32

				; GFX906-UNSAFE: v_fma_mix_f32

				; GFX906-CONTRACT: v_fma_mix_f32
				; GFX906-DENORM-CONTRACT: v_fma_mix_f32
				define amdgpu_kernel void @dotproduct_v4f16(<4 x half> addrspace(1)* %src1,
				<4 x half> addrspace(1)* %src2,
				float addrspace(1)* nocapture %dst) {
				entry:
				%src1.vec = load <4 x half>, <4 x half> addrspace(1)* %src1
				%src2.vec = load <4 x half>, <4 x half> addrspace(1)* %src2

				%src1.el1 = extractelement <4 x half> %src1.vec, i64 0
				%csrc1.el1 = fpext half %src1.el1 to float
				%src2.el1 = extractelement <4 x half> %src2.vec, i64 0
				%csrc2.el1 = fpext half %src2.el1 to float

				%src1.el2 = extractelement <4 x half> %src1.vec, i64 1
				%csrc1.el2 = fpext half %src1.el2 to float
				%src2.el2 = extractelement <4 x half> %src2.vec, i64 1
				%csrc2.el2 = fpext half %src2.el2 to float

				%mul2 = fmul float %csrc1.el2, %csrc2.el2
				%mul1 = fmul float %csrc1.el1, %csrc2.el1
				%acc = load float, float addrspace(1)* %dst, align 4
				%acc1 = fadd float %mul2, %acc
				%acc2 = fadd float %mul1, %acc1
				store float %acc2, float addrspace(1)* %dst, align 4
				ret void
				}

				; GCN-LABEL: {{^}}NotAdotproduct
				; GFX900: v_mad_mix_f32
				; GCN900: v_mad_mix_f32

				; GFX906: v_mad_f32
				; GFX906: v_mac_f32_e32

				; GFX906-UNSAFE: v_fma_mix_f32

				; GFX906-CONTRACT: v_fma_mix_f32
				; GFX906-DENORM-CONTRACT: v_fma_mix_f32
				define amdgpu_kernel void @NotAdotproduct(<2 x half> addrspace(1)* %src1,
				<2 x half> addrspace(1)* %src2,
				float addrspace(1)* nocapture %dst) {
				entry:
				%src1.vec = load <2 x half>, <2 x half> addrspace(1)* %src1
				%src2.vec = load <2 x half>, <2 x half> addrspace(1)* %src2

				%src1.el1 = extractelement <2 x half> %src1.vec, i64 0
				%csrc1.el1 = fpext half %src1.el1 to float
				%src2.el1 = extractelement <2 x half> %src2.vec, i64 0
				%csrc2.el1 = fpext half %src2.el1 to float

				%src1.el2 = extractelement <2 x half> %src1.vec, i64 1
				%csrc1.el2 = fpext half %src1.el2 to float
				%src2.el2 = extractelement <2 x half> %src2.vec, i64 1
				%csrc2.el2 = fpext half %src2.el2 to float

				%mul2 = fmul float %csrc1.el2, %csrc1.el1
				%mul1 = fmul float %csrc2.el1, %csrc2.el2
				%acc = load float, float addrspace(1)* %dst, align 4
				%acc1 = fadd float %mul2, %acc
				%acc2 = fadd float %mul1, %acc1
				store float %acc2, float addrspace(1)* %dst, align 4
				ret void
				}

				; GCN-LABEL: {{^}}Diff_Idx_NotAdotproduct
				; GFX900: v_mad_mix_f32
				; GCN900: v_mad_mix_f32

				; GFX906: v_mad_f32
				; GFX906: v_mac_f32_e32

				; GFX906-UNSAFE: v_fma_mix_f32

				; GFX906-CONTRACT: v_fma_mix_f32
				; GFX906-DENORM-CONTRACT: v_fma_mix_f32
				define amdgpu_kernel void @Diff_Idx_NotAdotproduct(<2 x half> addrspace(1)* %src1,
				<2 x half> addrspace(1)* %src2,
				float addrspace(1)* nocapture %dst) {
				entry:
				%src1.vec = load <2 x half>, <2 x half> addrspace(1)* %src1
				%src2.vec = load <2 x half>, <2 x half> addrspace(1)* %src2

				%src1.el1 = extractelement <2 x half> %src1.vec, i64 0
				%csrc1.el1 = fpext half %src1.el1 to float
				%src2.el1 = extractelement <2 x half> %src2.vec, i64 0
				%csrc2.el1 = fpext half %src2.el1 to float

				%src1.el2 = extractelement <2 x half> %src1.vec, i64 1
				%csrc1.el2 = fpext half %src1.el2 to float
				%src2.el2 = extractelement <2 x half> %src2.vec, i64 1
				%csrc2.el2 = fpext half %src2.el2 to float

				%mul2 = fmul float %csrc1.el2, %csrc2.el1
				%mul1 = fmul float %csrc1.el1, %csrc2.el2
				%acc = load float, float addrspace(1)* %dst, align 4
				%acc1 = fadd float %mul2, %acc
				%acc2 = fadd float %mul1, %acc1
				store float %acc2, float addrspace(1)* %dst, align 4
				ret void
				}
				arsenmUnsubmitted Not Done Reply Inline Actions Probably should try matching the negated form too arsenm: Probably should try matching the negated form too
				No newline at end of file

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Support a fdot2 pattern.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 155722

llvm/trunk/lib/Target/AMDGPU/AMDGPUISelLowering.h

llvm/trunk/lib/Target/AMDGPU/AMDGPUISelLowering.cpp

llvm/trunk/lib/Target/AMDGPU/AMDGPUInstrInfo.td

llvm/trunk/lib/Target/AMDGPU/SIISelLowering.h

llvm/trunk/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/trunk/lib/Target/AMDGPU/VOP3PInstructions.td

llvm/trunk/test/CodeGen/AMDGPU/fdot2.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Support a fdot2 pattern.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 155722

llvm/trunk/lib/Target/AMDGPU/AMDGPUISelLowering.h

llvm/trunk/lib/Target/AMDGPU/AMDGPUISelLowering.cpp

llvm/trunk/lib/Target/AMDGPU/AMDGPUInstrInfo.td

llvm/trunk/lib/Target/AMDGPU/SIISelLowering.h

llvm/trunk/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/trunk/lib/Target/AMDGPU/VOP3PInstructions.td

llvm/trunk/test/CodeGen/AMDGPU/fdot2.ll

[AMDGPU] Support a fdot2 pattern.
ClosedPublic