This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Generate the correct sequence of code for FDIV32 when correctly-rounded-divide-sqrt is set
ClosedPublic

Authored by cfang on Dec 10 2019, 11:21 AM.

Download Raw Diff

Details

Reviewers

b-sumner
arsenm
kerbowa

Summary

As the name suggests, correctly-rounded-divide-sqrt specifies the result of divion/sqrt to be rounded, and
thus we need to generate the correct sequence of code even when we flush the denormals.

Diff Detail

Event Timeline

cfang created this revision.Dec 10 2019, 11:21 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 10 2019, 11:21 AM

Herald added subscribers: hiraditya, t-tye, tpr and 6 others. · View Herald Transcript

This looks OK to me, although tuning on correctly rounded division any time denorms are enabled is not actually required by OpenCL.

The attribute should not de directly checked (we probably shouldn’t even be putting it on the function). The proper thing to check is the fpmath metadata on the individual instruction. This isn’t propagated into the DAG, so AMDGPUCodeGenPrepare inserts intrinsic calls which isn’t ideal

This revision now requires changes to proceed.Dec 10 2019, 8:51 PM

In D71293#1778867, @arsenm wrote:

The attribute should not de directly checked (we probably shouldn’t even be putting it on the function). The proper thing to check is the fpmath metadata on the individual instruction. This isn’t propagated into the DAG, so AMDGPUCodeGenPrepare inserts intrinsic calls which isn’t ideal

:
So what's your suggestion here? The current logic in AMDGPUCodeGenPrepare is to find cases that we can insert the intrinsic to generate "Faster 2.5 ULP division that does not support denormals."
Otherwise SIISelLowering will lower FDIV32 UnsafeMath and Demorm support.

Do you want to change this logic to insert new intrinsics to generate the expected sequence of code for fdiv32?

Introduce an intrinsic in AMDGPUCodeGenPrepare to generate correctly rounded fdiv32.

arsenm added inline comments.Jan 9 2020, 2:03 PM

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
568–569	The attribute should not be considered at all. Only the fpmath metadata matters. If -cl-fp32-correctly-rounded-divide-sqrt is specified, the regular fdiv instruction should behave correctly.
571–575	An intrinsic should only be introduced when the fdiv differs from the default FP environment. Here you are doing the opposite, and not even considering the denormal mode. You should be inhibiting the insertion of the fdiv.fast if denormals are enabled, not introducing a new intrinsic. You can also consider the afn fast flag and use that to ignore the denormal mode
628–631	There's no need to check the attribute
llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-fdiv.ll
245	The attribute should be removed

cfang marked 2 inline comments as done.Jan 10 2020, 9:52 AM

cfang added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
568–569	Do you mean that in AMDGPUCodeGenPrepare, we should check the fpmath metadata to keep regular fdiv (instead of an intrinsic) when -cl-fp32-correctly-rounded-divide-sqrt is specified? The issue is, when -cl-fp32-correctly-rounded-divide-sqrt is specified, a simple v_rcp is generated for a fdiv. Apparently the codegen produces the wrong sequence of code for a "regular" fdiv.

arsenm added inline comments.Jan 10 2020, 10:56 AM

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
568–569	Yes. We shouldn't even have an IR attribute for this flag. The interpretation of the flag is entirely represented in the use of the !fpmath metadata.

Implement rcp optimization for fdiv in AMGGPUCodegenPrepare to insert amdgcn_rcp intrinsic. For f32 type fdiv,
if fpmath metadata is unavailable, we could not do rcp optimization unless fast unsafe math is specified.

Herald added a subscriber: kerbowa. · View Herald TranscriptJan 20 2020, 4:21 PM

The GlobalISel path should also be fixed, but that can be a follow up patch

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
585	Needs comment explaining the interaction between !fpmath requirements and denormals. This could use a chart of different fast math options, FP math and denormal handling and the expected lowering
603	I don't think just allow reciprocal is sufficient without either checking FPMath or afn. I think this needs to be something more like UnsafeFP \|\| isFast \|\| (allowReciprocal && (denormal hasLowAccuracy \|\| approximateFunction))
612–614	It would be clearer to do something like bool NeedHighAccuracy = !FPMath \|\| FPMath->getFPAccuracy() < 2.5
617	Typo metadat
618	It would be clearer to invert this, instead of the logic below relying on the double negative
619	FPMath should be checked once, and in relation to it's value only. Checking for the lack of metadata here is imprecise
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7540	Based on the original problem, Flags.hasAllowReciprocal() isn't sufficient here. Without knowledge of !fpmath, this also needs approximate function
7542	Needs comment explaining why
8726	Braces

arsenm added a reviewer: kerbowa.Jan 20 2020, 7:49 PM

For GlobalISel, I'm not sure this should reproduce the same fix. We can more plausibly preserve the !fpmath in the gMIR and handle it the right way, instead of hacking around it in AMDGPUCodeGenPrepare. I think a few asserts and the verifier would need to be updated, but it should be possible to allow arbitrary MDNode operands on an instruction, similar to how implicit registers can be added. I think we should disallow implicit register operands on G_* instructions, and instead only allow implicit metadata arguments. The fdiv lowering can then do the right thing with the original !fpmath information

cfang marked 5 inline comments as done.Jan 21 2020, 8:33 AM

cfang added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
612–614	Is < 2.5 ulp the limiting factor that we can not do 1/x -> rcp(x) ?
619	Do you mean here we should check like this: (Ty->isFloatTy() && (HasFP32Denormals \|\| NeedHighAccuracy)); where NeedHighAccuracy is checked like a previous comment?

cfang marked 3 inline comments as done.Jan 21 2020, 9:33 AM

cfang added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
603	Can you explain what is exactly "denormal hasLowAccuracy" here?

Update based on feedback from the reviewer.

arsenm added inline comments.Jan 22 2020, 7:08 AM

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
87–89	This should be redundant with the below logic
96–98	This should be fully captured by the logic above

arsenm added inline comments.Jan 22 2020, 7:13 AM

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
600	UnsafeDiv is too imprecise here. This should explain in concrete terms why we need to insert the intrinsics and not just refer to the variable names. We need fdiv.fast when we only need 2.5 ULP and denormals are flushed
624	I think this should maybe be rephrased into RcpLegal and UseFDivFast

Update based on the comments.

Rewrite the comments of the function visitFDiv;
Rename a few variables.

arsenm added inline comments.Jan 22 2020, 12:58 PM

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
616	You can just initialize this below with the logical value instead of setting the value conditionally
640	I think this still isn't quite right. I think this should be (FMF.allowReciprocal() && ((!HasFP32Denormals && !NeedHighAccuracy) \|\| FMF.approxFunc())). As is, this will allow reciprocal when denormals are flushed, but the higher fdiv precision is required, which was the case you were trying to fix in the first place
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7542	This still needs the denormal and type checks
llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-fdiv.ll
88	This should not have produced rcp since denormals are enabled and it doesn't have afn.
91–92	The name says high accuracy, but 5 ulp is lower accuracy. This didn't form rcp, but I think for the wrong reason
94–98	These two I think are OK because of afn

cfang marked 3 inline comments as done.Jan 22 2020, 2:29 PM

cfang added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
616	Thanks, Will do like that.
640	How could we handle fp16 and fp64? I think HasFP32Denormals only matter for fp32. Also, the issue I am working on seems not related to FMF.allowReciprocal() at all unless arcp is default.

arsenm added inline comments.Jan 22 2020, 2:36 PM

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
640	Yes, this also needs to account for FP32denormals. RCP for f16 doesn't' care about the fp16 denormal mode

update based on feedback.

Using arcp && (( no denormals && fpmath>=2.5) || afn)
update arcp related LIT tests.

arsenm added inline comments.Jan 23 2020, 7:30 AM

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
122	Set UseFDivFast once based on the logical expression below and never mutate it. UseFDivFast should be const
llvm/test/CodeGen/AMDGPU/fdiv.f16.ll
253 ↗	(On Diff #239724)	I don't know what ulp the f16 rcp instruction provides. This test change looks incomplete if there isn't already a case without !fpmath

arsenm added inline comments.Jan 23 2020, 7:42 AM

llvm/test/CodeGen/AMDGPU/fdiv.f16.ll
253 ↗	(On Diff #239724)	I found a document stating this provides "~0.5ulp", so I guess check that value for f16?

cfang marked an inline comment as done.Jan 23 2020, 11:47 AM

cfang added inline comments.

llvm/test/CodeGen/AMDGPU/fdiv.f16.ll
253 ↗	(On Diff #239724)	Currently the logic in DAG lowering does "1/x -> rcp(x)" for fp16 without checking fpmath accuracy. Actually it always does "1/x -> rcp(x)" for fp16 because v_rcp_f16 supports denormals. We need to revisit that logic in DAG lowering. But I would rather to do that in a follow-up patch.

Update based on feedback:

const for UseFDivFast variable
Remove the added "!fpmath !0" for an arcp f16 test, because the current logic in DAG loweing generates the same code with/without !fpmath.

TODO (in an follow up patch maybe): Change the accuracy threshold and apply the threshold to all types. Also need to re-visit
the rcp logic in DAG Lowering as long as the work in AMDGPUCodegenPrepare is done.

LGTM

This revision is now accepted and ready to land.Jan 23 2020, 1:11 PM

commit 2531535984ad989ce88aeee23cb92a827da6686e
Author: Changpeng Fang <changpeng.fang@gmail.com>
Date: Thu Jan 23 16:57:43 2020 -0800

Revision Contents

Path

Size

llvm/

include/

llvm/

IR/

IntrinsicsAMDGPU.td

6 lines

lib/

Target/

AMDGPU/

AMDGPUCodeGenPrepare.cpp

44 lines

SIISelLowering.h

2 lines

SIISelLowering.cpp

25 lines

test/

CodeGen/

AMDGPU/

amdgpu-codegenprepare-fdiv.ll

27 lines

fdiv.ll

65 lines

Diff 236698

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

	Show All 12 Lines
	def int_amdgcn_unreachable : Intrinsic<[], [], [IntrConvergent]>;	def int_amdgcn_unreachable : Intrinsic<[], [], [IntrConvergent]>;

	// Emit 2.5 ulp, no denormal division. Should only be inserted by	// Emit 2.5 ulp, no denormal division. Should only be inserted by
	// pass based on !fpmath metadata.	// pass based on !fpmath metadata.
	def int_amdgcn_fdiv_fast : Intrinsic<	def int_amdgcn_fdiv_fast : Intrinsic<
	[llvm_float_ty], [llvm_float_ty, llvm_float_ty],	[llvm_float_ty], [llvm_float_ty, llvm_float_ty],
	[IntrNoMem, IntrSpeculatable]	[IntrNoMem, IntrSpeculatable]
	>;	>;

		// Emit correctly rounded fp32 divide and sqrt.
		def int_amdgcn_fdiv_rounded : Intrinsic<
		[llvm_float_ty], [llvm_float_ty, llvm_float_ty],
		[IntrNoMem, IntrSpeculatable]
		>;
	}	}
Context not available.

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

	Show All 12 Lines
	class AMDGPUCodeGenPrepare : public FunctionPass,			class AMDGPUCodeGenPrepare : public FunctionPass,
	public InstVisitor<AMDGPUCodeGenPrepare, bool> {			public InstVisitor<AMDGPUCodeGenPrepare, bool> {
	const GCNSubtarget *ST = nullptr;			const GCNSubtarget *ST = nullptr;
	AssumptionCache *AC = nullptr;			AssumptionCache *AC = nullptr;
	LegacyDivergenceAnalysis *DA = nullptr;			LegacyDivergenceAnalysis *DA = nullptr;
	Module *Mod = nullptr;			Module *Mod = nullptr;
	const DataLayout *DL = nullptr;			const DataLayout *DL = nullptr;
	bool HasUnsafeFPMath = false;			bool HasUnsafeFPMath = false;
				bool HasCorrectlyRoundedDivideSqrt = false;
	bool HasFP32Denormals = false;			bool HasFP32Denormals = false;

	/// Copies exact/nsw/nuw flags (if any) from binary operation \p I to			/// Copies exact/nsw/nuw flags (if any) from binary operation \p I to
	/// binary operation \p V.			/// binary operation \p V.
	///			///
	/// \returns Binary operation \p V.			/// \returns Binary operation \p V.
	/// \returns \p T's base element bit width.			/// \returns \p T's base element bit width.
	unsigned getBaseElementBitWidth(const Type *T) const;			unsigned getBaseElementBitWidth(const Type *T) const;

	/// \returns Equivalent 32 bit integer type for given type \p T. For example,			/// \returns Equivalent 32 bit integer type for given type \p T. For example,
	/// if \p T is i7, then i32 is returned; if \p T is <3 x i12>, then <3 x i32>			/// if \p T is i7, then i32 is returned; if \p T is <3 x i12>, then <3 x i32>
	/// is returned.			/// is returned.
	Type getI32Ty(IRBuilder<> &B, const Type T) const;			Type getI32Ty(IRBuilder<> &B, const Type T) const;

	/// \returns True if binary operation \p I is a signed binary operation, false			/// \returns True if binary operation \p I is a signed binary operation, false
	/// otherwise.			/// otherwise.
				arsenmUnsubmitted Not Done Reply Inline Actions This should be redundant with the below logic arsenm: This should be redundant with the below logic
	bool isSigned(const BinaryOperator &I) const;			bool isSigned(const BinaryOperator &I) const;

	/// \returns True if the condition of 'select' operation \p I comes from a			/// \returns True if the condition of 'select' operation \p I comes from a
	/// signed 'icmp' operation, false otherwise.			/// signed 'icmp' operation, false otherwise.
				arsenmUnsubmitted Not Done Reply Inline Actions This should be fully captured by the logic above arsenm: This should be fully captured by the logic above
				arsenmUnsubmitted Not Done Reply Inline Actions Set UseFDivFast once based on the logical expression below and never mutate it. UseFDivFast should be const arsenm: Set UseFDivFast once based on the logical expression below and never mutate it. UseFDivFast…
	Show All 12 Lines

	// Reciprocal f32 is handled separately without denormals.			// Reciprocal f32 is handled separately without denormals.
	return HasDenormals ^ IsOne;			return HasDenormals ^ IsOne;
	}			}

	// Insert an intrinsic for fast fdiv for safe math situations where we can			// Insert an intrinsic for fast fdiv for safe math situations where we can
	// reduce precision. Leave fdiv for situations where the generic node is			// reduce precision. Leave fdiv for situations where the generic node is
	// expected to be optimized.			// expected to be optimized.
				//
				// Also, insert an intrinsic for safe fdiv when
				// -cl-fp32-correctly-rounded-divide-sqrt is enabled.
	bool AMDGPUCodeGenPrepare::visitFDiv(BinaryOperator &FDiv) {			bool AMDGPUCodeGenPrepare::visitFDiv(BinaryOperator &FDiv) {
	Type *Ty = FDiv.getType();			Type *Ty = FDiv.getType();

	if (!Ty->getScalarType()->isFloatTy())			if (!Ty->getScalarType()->isFloatTy())
	return false;			return false;

	MDNode *FPMath = FDiv.getMetadata(LLVMContext::MD_fpmath);			const FPMathOperator *FPOp = cast<const FPMathOperator>(&FDiv);
	if (!FPMath)
	return false;

	const FPMathOperator *FPOp = cast<const FPMathOperator>(&FDiv);
	float ULP = FPOp->getFPAccuracy();
	if (ULP < 2.5f)
	return false;

	FastMathFlags FMF = FPOp->getFastMathFlags();			FastMathFlags FMF = FPOp->getFastMathFlags();
	bool UnsafeDiv = HasUnsafeFPMath \|\| FMF.isFast() \|\|			bool UnsafeDiv = HasUnsafeFPMath \|\| FMF.isFast() \|\|
	FMF.allowReciprocal();			FMF.allowReciprocal();

	// With UnsafeDiv node will be optimized to just rcp and mul.			// With UnsafeDiv node will be optimized to just rcp and mul.
	if (UnsafeDiv)			if (UnsafeDiv)
	return false;			return false;

				bool SafeFast = true;
				MDNode *FPMath = FDiv.getMetadata(LLVMContext::MD_fpmath);
				if (!FPMath) {
				// Insert an intrinsic when -cl-fp32-correctly-rounded-divide-sqrt
				// is enabled.
				if (!HasCorrectlyRoundedDivideSqrt)
				return false;
				arsenmUnsubmitted Done Reply Inline Actions The attribute should not be considered at all. Only the fpmath metadata matters. If -cl-fp32-correctly-rounded-divide-sqrt is specified, the regular fdiv instruction should behave correctly. arsenm: The attribute should not be considered at all. Only the fpmath metadata matters. If -cl-fp32…
				cfangAuthorUnsubmitted Done Reply Inline Actions Do you mean that in AMDGPUCodeGenPrepare, we should check the fpmath metadata to keep regular fdiv (instead of an intrinsic) when -cl-fp32-correctly-rounded-divide-sqrt is specified? The issue is, when -cl-fp32-correctly-rounded-divide-sqrt is specified, a simple v_rcp is generated for a fdiv. Apparently the codegen produces the wrong sequence of code for a "regular" fdiv. cfang: Do you mean that in AMDGPUCodeGenPrepare, we should check the fpmath metadata to keep regular…
				arsenmUnsubmitted Not Done Reply Inline Actions Yes. We shouldn't even have an IR attribute for this flag. The interpretation of the flag is entirely represented in the use of the !fpmath metadata. arsenm: Yes. We shouldn't even have an IR attribute for this flag. The interpretation of the flag is…
				SafeFast = false;
				} else {
				float ULP = FPOp->getFPAccuracy();
				if (ULP < 2.5f)
				return false;
				}
				arsenmUnsubmitted Not Done Reply Inline Actions An intrinsic should only be introduced when the fdiv differs from the default FP environment. Here you are doing the opposite, and not even considering the denormal mode. You should be inhibiting the insertion of the fdiv.fast if denormals are enabled, not introducing a new intrinsic. You can also consider the afn fast flag and use that to ignore the denormal mode arsenm: An intrinsic should only be introduced when the fdiv differs from the default FP environment.

	IRBuilder<> Builder(FDiv.getParent(), std::next(FDiv.getIterator()), FPMath);			IRBuilder<> Builder(FDiv.getParent(), std::next(FDiv.getIterator()), FPMath);
	Builder.setFastMathFlags(FMF);			Builder.setFastMathFlags(FMF);
	Builder.SetCurrentDebugLocation(FDiv.getDebugLoc());			Builder.SetCurrentDebugLocation(FDiv.getDebugLoc());

	Function *Decl = Intrinsic::getDeclaration(Mod, Intrinsic::amdgcn_fdiv_fast);			unsigned IntrinsicOpc = SafeFast ? Intrinsic::amdgcn_fdiv_fast
				: Intrinsic::amdgcn_fdiv_rounded;

				Function *Decl = Intrinsic::getDeclaration(Mod, IntrinsicOpc);

				arsenmUnsubmitted Not Done Reply Inline Actions Needs comment explaining the interaction between !fpmath requirements and denormals. This could use a chart of different fast math options, FP math and denormal handling and the expected lowering arsenm: Needs comment explaining the interaction between !fpmath requirements and denormals. This could…
	Value *Num = FDiv.getOperand(0);			Value *Num = FDiv.getOperand(0);
	Value *Den = FDiv.getOperand(1);			Value *Den = FDiv.getOperand(1);

	Value *NewFDiv = nullptr;			Value *NewFDiv = nullptr;

	if (VectorType *VT = dyn_cast<VectorType>(Ty)) {			if (VectorType *VT = dyn_cast<VectorType>(Ty)) {
	NewFDiv = UndefValue::get(VT);			NewFDiv = UndefValue::get(VT);

	// FIXME: Doesn't do the right thing for cases where the vector is partially			// FIXME: Doesn't do the right thing for cases where the vector is partially
	// constant. This works when the scalarizer pass is run first.			// constant. This works when the scalarizer pass is run first.
	for (unsigned I = 0, E = VT->getNumElements(); I != E; ++I) {			for (unsigned I = 0, E = VT->getNumElements(); I != E; ++I) {
	Value *NumEltI = Builder.CreateExtractElement(Num, I);			Value *NumEltI = Builder.CreateExtractElement(Num, I);
	Value *DenEltI = Builder.CreateExtractElement(Den, I);			Value *DenEltI = Builder.CreateExtractElement(Den, I);
	Value *NewElt;			Value *NewElt;

				arsenmUnsubmitted Not Done Reply Inline Actions UnsafeDiv is too imprecise here. This should explain in concrete terms why we need to insert the intrinsics and not just refer to the variable names. We need fdiv.fast when we only need 2.5 ULP and denormals are flushed arsenm: UnsafeDiv is too imprecise here. This should explain in concrete terms why we need to insert…
	if (shouldKeepFDivF32(NumEltI, UnsafeDiv, HasFP32Denormals)) {			if (SafeFast && shouldKeepFDivF32(NumEltI, UnsafeDiv, HasFP32Denormals)) {
	NewElt = Builder.CreateFDiv(NumEltI, DenEltI);			NewElt = Builder.CreateFDiv(NumEltI, DenEltI);
	} else {			} else {
				arsenmUnsubmitted Not Done Reply Inline Actions I don't think just allow reciprocal is sufficient without either checking FPMath or afn. I think this needs to be something more like UnsafeFP \|\| isFast \|\| (allowReciprocal && (denormal hasLowAccuracy \|\| approximateFunction)) arsenm: I don't think just allow reciprocal is sufficient without either checking FPMath or afn. I…
				cfangAuthorUnsubmitted Done Reply Inline Actions Can you explain what is exactly "denormal hasLowAccuracy" here? cfang: Can you explain what is exactly "denormal hasLowAccuracy" here?
	NewElt = Builder.CreateCall(Decl, { NumEltI, DenEltI });			NewElt = Builder.CreateCall(Decl, { NumEltI, DenEltI });
	}			}

	NewFDiv = Builder.CreateInsertElement(NewFDiv, NewElt, I);			NewFDiv = Builder.CreateInsertElement(NewFDiv, NewElt, I);
	}			}
	} else {			} else {
	if (!shouldKeepFDivF32(Num, UnsafeDiv, HasFP32Denormals))			if (!SafeFast \|\| !shouldKeepFDivF32(Num, UnsafeDiv, HasFP32Denormals))
	NewFDiv = Builder.CreateCall(Decl, { Num, Den });			NewFDiv = Builder.CreateCall(Decl, { Num, Den });
	}			}

	if (NewFDiv) {			if (NewFDiv) {
				arsenmUnsubmitted Done Reply Inline Actions It would be clearer to do something like bool NeedHighAccuracy = !FPMath \|\| FPMath->getFPAccuracy() < 2.5 arsenm: It would be clearer to do something like bool NeedHighAccuracy = !FPMath \|\| FPMath…
				cfangAuthorUnsubmitted Done Reply Inline Actions Is < 2.5 ulp the limiting factor that we can not do 1/x -> rcp(x) ? cfang: Is < 2.5 ulp the limiting factor that we can not do 1/x -> rcp(x) ?
	FDiv.replaceAllUsesWith(NewFDiv);			FDiv.replaceAllUsesWith(NewFDiv);
	NewFDiv->takeName(&FDiv);			NewFDiv->takeName(&FDiv);
				arsenmUnsubmitted Done Reply Inline Actions You can just initialize this below with the logical value instead of setting the value conditionally arsenm: You can just initialize this below with the logical value instead of setting the value…
				cfangAuthorUnsubmitted Done Reply Inline Actions Thanks, Will do like that. cfang: Thanks, Will do like that.
	FDiv.eraseFromParent();			FDiv.eraseFromParent();
				arsenmUnsubmitted Done Reply Inline Actions Typo metadat arsenm: Typo metadat
	}			}
				arsenmUnsubmitted Not Done Reply Inline Actions It would be clearer to invert this, instead of the logic below relying on the double negative arsenm: It would be clearer to invert this, instead of the logic below relying on the double negative

				arsenmUnsubmitted Done Reply Inline Actions FPMath should be checked once, and in relation to it's value only. Checking for the lack of metadata here is imprecise arsenm: FPMath should be checked once, and in relation to it's value only. Checking for the lack of…
				cfangAuthorUnsubmitted Done Reply Inline Actions Do you mean here we should check like this: (Ty->isFloatTy() && (HasFP32Denormals \|\| NeedHighAccuracy)); where NeedHighAccuracy is checked like a previous comment? cfang: Do you mean here we should check like this: (Ty->isFloatTy() && (HasFP32Denormals \|\|…
	return !!NewFDiv;			return !!NewFDiv;
	}			}

	static bool hasUnsafeFPMath(const Function &F) {			static bool hasUnsafeFPMath(const Function &F) {
	Attribute Attr = F.getFnAttribute("unsafe-fp-math");			Attribute Attr = F.getFnAttribute("unsafe-fp-math");
				arsenmUnsubmitted Not Done Reply Inline Actions I think this should maybe be rephrased into RcpLegal and UseFDivFast arsenm: I think this should maybe be rephrased into RcpLegal and UseFDivFast
	return Attr.getValueAsString() == "true";			return Attr.getValueAsString() == "true";
	}			}

				static bool hasCorrectlyRoundedDivideSqrt(const Function &F) {
				Attribute Attr = F.getFnAttribute("correctly-rounded-divide-sqrt-fp-math");
				return (Attr.getValueAsString() == "true");
				}
				arsenmUnsubmitted Not Done Reply Inline Actions There's no need to check the attribute arsenm: There's no need to check the attribute

	static std::pair<Value, Value> getMul64(IRBuilder<> &Builder,			static std::pair<Value, Value> getMul64(IRBuilder<> &Builder,
	Value LHS, Value RHS) {			Value LHS, Value RHS) {
	Type *I32Ty = Builder.getInt32Ty();			Type *I32Ty = Builder.getInt32Ty();
	Type *I64Ty = Builder.getInt64Ty();			Type *I64Ty = Builder.getInt64Ty();

	Value *LHS_EXT64 = Builder.CreateZExt(LHS, I64Ty);			Value *LHS_EXT64 = Builder.CreateZExt(LHS, I64Ty);
	Value *RHS_EXT64 = Builder.CreateZExt(RHS, I64Ty);			Value *RHS_EXT64 = Builder.CreateZExt(RHS, I64Ty);
	Value *MUL64 = Builder.CreateMul(LHS_EXT64, RHS_EXT64);			Value *MUL64 = Builder.CreateMul(LHS_EXT64, RHS_EXT64);
				arsenmUnsubmitted Not Done Reply Inline Actions I think this still isn't quite right. I think this should be (FMF.allowReciprocal() && ((!HasFP32Denormals && !NeedHighAccuracy) \|\| FMF.approxFunc())). As is, this will allow reciprocal when denormals are flushed, but the higher fdiv precision is required, which was the case you were trying to fix in the first place arsenm: I think this still isn't quite right. I think this should be (FMF.allowReciprocal() && ((!
				cfangAuthorUnsubmitted Done Reply Inline Actions How could we handle fp16 and fp64? I think HasFP32Denormals only matter for fp32. Also, the issue I am working on seems not related to FMF.allowReciprocal() at all unless arcp is default. cfang: How could we handle fp16 and fp64? I think HasFP32Denormals only matter for fp32. Also, the…
				arsenmUnsubmitted Not Done Reply Inline Actions Yes, this also needs to account for FP32denormals. RCP for f16 doesn't' care about the fp16 denormal mode arsenm: Yes, this also needs to account for FP32denormals. RCP for f16 doesn't' care about the fp16…
	Value *Lo = Builder.CreateTrunc(MUL64, I32Ty);			Value *Lo = Builder.CreateTrunc(MUL64, I32Ty);
	Value *Hi = Builder.CreateLShr(MUL64, Builder.getInt64(32));			Value *Hi = Builder.CreateLShr(MUL64, Builder.getInt64(32));
	Hi = Builder.CreateTrunc(Hi, I32Ty);			Hi = Builder.CreateTrunc(Hi, I32Ty);
	return std::make_pair(Lo, Hi);			return std::make_pair(Lo, Hi);
	}			}

	static Value* getMulHu(IRBuilder<> &Builder, Value LHS, Value RHS) {			static Value* getMulHu(IRBuilder<> &Builder, Value LHS, Value RHS) {
	return getMul64(Builder, LHS, RHS).second;			return getMul64(Builder, LHS, RHS).second;
	Show All 16 Lines
	if (!TPC)			if (!TPC)
	return false;			return false;

	const AMDGPUTargetMachine &TM = TPC->getTM<AMDGPUTargetMachine>();			const AMDGPUTargetMachine &TM = TPC->getTM<AMDGPUTargetMachine>();
	ST = &TM.getSubtarget<GCNSubtarget>(F);			ST = &TM.getSubtarget<GCNSubtarget>(F);
	AC = &getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);			AC = &getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
	DA = &getAnalysis<LegacyDivergenceAnalysis>();			DA = &getAnalysis<LegacyDivergenceAnalysis>();
	HasUnsafeFPMath = hasUnsafeFPMath(F);			HasUnsafeFPMath = hasUnsafeFPMath(F);
				HasCorrectlyRoundedDivideSqrt = hasCorrectlyRoundedDivideSqrt(F);
	HasFP32Denormals = ST->hasFP32Denormals(F);			HasFP32Denormals = ST->hasFP32Denormals(F);

	bool MadeChange = false;			bool MadeChange = false;

	for (BasicBlock &BB : F) {			for (BasicBlock &BB : F) {
	BasicBlock::iterator Next;			BasicBlock::iterator Next;
	for (BasicBlock::iterator I = BB.begin(), E = BB.end(); I != E; I = Next) {			for (BasicBlock::iterator I = BB.begin(), E = BB.end(); I != E; I = Next) {
	Next = std::next(I);			Next = std::next(I);
	Show All 12 Lines

llvm/lib/Target/AMDGPU/SIISelLowering.h

	Show All 12 Lines
	std::pair<SDValue, SDValue> splitBufferOffsets(SDValue Offset,			std::pair<SDValue, SDValue> splitBufferOffsets(SDValue Offset,
	SelectionDAG &DAG) const;			SelectionDAG &DAG) const;

	SDValue widenLoad(LoadSDNode *Ld, DAGCombinerInfo &DCI) const;			SDValue widenLoad(LoadSDNode *Ld, DAGCombinerInfo &DCI) const;
	SDValue LowerLOAD(SDValue Op, SelectionDAG &DAG) const;			SDValue LowerLOAD(SDValue Op, SelectionDAG &DAG) const;
	SDValue LowerSELECT(SDValue Op, SelectionDAG &DAG) const;			SDValue LowerSELECT(SDValue Op, SelectionDAG &DAG) const;
	SDValue lowerFastUnsafeFDIV(SDValue Op, SelectionDAG &DAG) const;			SDValue lowerFastUnsafeFDIV(SDValue Op, SelectionDAG &DAG) const;
	SDValue lowerFDIV_FAST(SDValue Op, SelectionDAG &DAG) const;			SDValue lowerFDIV_FAST(SDValue Op, SelectionDAG &DAG) const;
				SDValue lowerFDIV_ROUNDED(SDValue Op, SelectionDAG &DAG) const;
				SDValue lowerFDIV_SAFE(SDValue LHS, SDValue RHS, SelectionDAG &DAG) const;
	SDValue LowerFDIV16(SDValue Op, SelectionDAG &DAG) const;			SDValue LowerFDIV16(SDValue Op, SelectionDAG &DAG) const;
	SDValue LowerFDIV32(SDValue Op, SelectionDAG &DAG) const;			SDValue LowerFDIV32(SDValue Op, SelectionDAG &DAG) const;
	SDValue LowerFDIV64(SDValue Op, SelectionDAG &DAG) const;			SDValue LowerFDIV64(SDValue Op, SelectionDAG &DAG) const;
	SDValue LowerFDIV(SDValue Op, SelectionDAG &DAG) const;			SDValue LowerFDIV(SDValue Op, SelectionDAG &DAG) const;
	SDValue LowerINT_TO_FP(SDValue Op, SelectionDAG &DAG, bool Signed) const;			SDValue LowerINT_TO_FP(SDValue Op, SelectionDAG &DAG, bool Signed) const;
	SDValue LowerSTORE(SDValue Op, SelectionDAG &DAG) const;			SDValue LowerSTORE(SDValue Op, SelectionDAG &DAG) const;
	SDValue LowerTrig(SDValue Op, SelectionDAG &DAG) const;			SDValue LowerTrig(SDValue Op, SelectionDAG &DAG) const;
	SDValue LowerATOMIC_CMP_SWAP(SDValue Op, SelectionDAG &DAG) const;			SDValue LowerATOMIC_CMP_SWAP(SDValue Op, SelectionDAG &DAG) const;
	Show All 12 Lines

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

	Show All 12 Lines
	if (!parseCachePolicy(Op.getOperand(3), DAG, &GLC, nullptr,	if (!parseCachePolicy(Op.getOperand(3), DAG, &GLC, nullptr,
	IsGFX10 ? &DLC : nullptr))	IsGFX10 ? &DLC : nullptr))
	return Op;	return Op;
	return lowerSBuffer(VT, DL, Op.getOperand(1), Op.getOperand(2), GLC, DLC,	return lowerSBuffer(VT, DL, Op.getOperand(1), Op.getOperand(2), GLC, DLC,
	DAG);	DAG);
	}	}
	case Intrinsic::amdgcn_fdiv_fast:	case Intrinsic::amdgcn_fdiv_fast:
	return lowerFDIV_FAST(Op, DAG);	return lowerFDIV_FAST(Op, DAG);
		case Intrinsic::amdgcn_fdiv_rounded:
		return lowerFDIV_ROUNDED(Op, DAG);
	case Intrinsic::amdgcn_interp_p1_f16: {	case Intrinsic::amdgcn_interp_p1_f16: {
	SDValue ToM0 = DAG.getCopyToReg(DAG.getEntryNode(), DL, AMDGPU::M0,	SDValue ToM0 = DAG.getCopyToReg(DAG.getEntryNode(), DL, AMDGPU::M0,
	Op.getOperand(5), SDValue());	Op.getOperand(5), SDValue());
	if (getSubtarget()->getLDSBankCount() == 16) {	if (getSubtarget()->getLDSBankCount() == 16) {
	// 16 bank LDS	// 16 bank LDS

	// FIXME: This implicitly will insert a second CopyToReg to M0.	// FIXME: This implicitly will insert a second CopyToReg to M0.
	SDValue S = DAG.getNode(	SDValue S = DAG.getNode(
	ISD::INTRINSIC_WO_CHAIN, DL, MVT::f32,	ISD::INTRINSIC_WO_CHAIN, DL, MVT::f32,
	DAG.getTargetConstant(Intrinsic::amdgcn_interp_mov, DL, MVT::i32),	DAG.getTargetConstant(Intrinsic::amdgcn_interp_mov, DL, MVT::i32),
	DAG.getConstant(2, DL, MVT::i32), // P0	DAG.getConstant(2, DL, MVT::i32), // P0
	Op.getOperand(2), // Attrchan	Op.getOperand(2), // Attrchan
	Op.getOperand(3), // Attr	Op.getOperand(3), // Attr
	Op.getOperand(5)); // m0	Op.getOperand(5)); // m0

	SDValue Ops[] = {	SDValue Ops[] = {
	Op.getOperand(1), // Src0	Op.getOperand(1), // Src0
	Op.getOperand(2), // Attrchan	Op.getOperand(2), // Attrchan
	Op.getOperand(3), // Attr	Op.getOperand(3), // Attr
	DAG.getTargetConstant(0, DL, MVT::i32), // $src0_modifiers	DAG.getTargetConstant(0, DL, MVT::i32), // $src0_modifiers
		arsenmUnsubmitted Not Done Reply Inline Actions Needs comment explaining why arsenm: Needs comment explaining why
		arsenmUnsubmitted Not Done Reply Inline Actions Based on the original problem, Flags.hasAllowReciprocal() isn't sufficient here. Without knowledge of !fpmath, this also needs approximate function arsenm: Based on the original problem, Flags.hasAllowReciprocal() isn't sufficient here. Without…
		arsenmUnsubmitted Not Done Reply Inline Actions This still needs the denormal and type checks arsenm: This still needs the denormal and type checks
	Show All 12 Lines
	int DPDenormModeDefault = hasFP64FP16Denormals(DAG.getMachineFunction())	int DPDenormModeDefault = hasFP64FP16Denormals(DAG.getMachineFunction())
	? FP_DENORM_FLUSH_NONE	? FP_DENORM_FLUSH_NONE
	: FP_DENORM_FLUSH_IN_FLUSH_OUT;	: FP_DENORM_FLUSH_IN_FLUSH_OUT;

	int Mode = SPDenormMode \| (DPDenormModeDefault << 2);	int Mode = SPDenormMode \| (DPDenormModeDefault << 2);
	return DAG.getTargetConstant(Mode, SL, MVT::i32);	return DAG.getTargetConstant(Mode, SL, MVT::i32);
	}	}

	SDValue SITargetLowering::LowerFDIV32(SDValue Op, SelectionDAG &DAG) const {
	if (SDValue FastLowered = lowerFastUnsafeFDIV(Op, DAG))
	return FastLowered;

	SDLoc SL(Op);	SDValue SITargetLowering::lowerFDIV_SAFE(SDValue LHS, SDValue RHS, SelectionDAG &DAG) const {
	SDValue LHS = Op.getOperand(0);	SDLoc SL(RHS);
	SDValue RHS = Op.getOperand(1);

	const SDValue One = DAG.getConstantFP(1.0, SL, MVT::f32);	const SDValue One = DAG.getConstantFP(1.0, SL, MVT::f32);

	SDVTList ScaleVT = DAG.getVTList(MVT::f32, MVT::i1);	SDVTList ScaleVT = DAG.getVTList(MVT::f32, MVT::i1);

	SDValue DenominatorScaled = DAG.getNode(AMDGPUISD::DIV_SCALE, SL, ScaleVT,	SDValue DenominatorScaled = DAG.getNode(AMDGPUISD::DIV_SCALE, SL, ScaleVT,
	RHS, RHS, LHS);	RHS, RHS, LHS);
	SDValue NumeratorScaled = DAG.getNode(AMDGPUISD::DIV_SCALE, SL, ScaleVT,	SDValue NumeratorScaled = DAG.getNode(AMDGPUISD::DIV_SCALE, SL, ScaleVT,
	Show All 24 Lines

	SDValue Scale = NumeratorScaled.getValue(1);	SDValue Scale = NumeratorScaled.getValue(1);
	SDValue Fmas = DAG.getNode(AMDGPUISD::DIV_FMAS, SL, MVT::f32,	SDValue Fmas = DAG.getNode(AMDGPUISD::DIV_FMAS, SL, MVT::f32,
	Fma4, Fma1, Fma3, Scale);	Fma4, Fma1, Fma3, Scale);

	return DAG.getNode(AMDGPUISD::DIV_FIXUP, SL, MVT::f32, Fmas, RHS, LHS);	return DAG.getNode(AMDGPUISD::DIV_FIXUP, SL, MVT::f32, Fmas, RHS, LHS);
	}	}

		SDValue SITargetLowering::lowerFDIV_ROUNDED(SDValue Op, SelectionDAG &DAG) const {
		SDValue LHS = Op.getOperand(1);
		SDValue RHS = Op.getOperand(2);
		return lowerFDIV_SAFE(LHS, RHS, DAG);
		}

		SDValue SITargetLowering::LowerFDIV32(SDValue Op, SelectionDAG &DAG) const {
		if (SDValue FastLowered = lowerFastUnsafeFDIV(Op, DAG))
		return FastLowered;

		SDValue LHS = Op.getOperand(0);
		SDValue RHS = Op.getOperand(1);
		return lowerFDIV_SAFE(LHS, RHS, DAG);
		}

	SDValue SITargetLowering::LowerFDIV64(SDValue Op, SelectionDAG &DAG) const {	SDValue SITargetLowering::LowerFDIV64(SDValue Op, SelectionDAG &DAG) const {
	if (DAG.getTarget().Options.UnsafeFPMath)	if (DAG.getTarget().Options.UnsafeFPMath)
	return lowerFastUnsafeFDIV(Op, DAG);	return lowerFastUnsafeFDIV(Op, DAG);

	SDLoc SL(Op);	SDLoc SL(Op);
	SDValue X = Op.getOperand(0);	SDValue X = Op.getOperand(0);
	SDValue Y = Op.getOperand(1);	SDValue Y = Op.getOperand(1);

	const SDValue One = DAG.getConstantFP(1.0, SL, MVT::f64);	const SDValue One = DAG.getConstantFP(1.0, SL, MVT::f64);

	SDVTList ScaleVT = DAG.getVTList(MVT::f64, MVT::i1);	SDVTList ScaleVT = DAG.getVTList(MVT::f64, MVT::i1);

	SDValue DivScale0 = DAG.getNode(AMDGPUISD::DIV_SCALE, SL, ScaleVT, Y, Y, X);	SDValue DivScale0 = DAG.getNode(AMDGPUISD::DIV_SCALE, SL, ScaleVT, Y, Y, X);

	SDValue NegDivScale0 = DAG.getNode(ISD::FNEG, SL, MVT::f64, DivScale0);	SDValue NegDivScale0 = DAG.getNode(ISD::FNEG, SL, MVT::f64, DivScale0);

	SDValue Rcp = DAG.getNode(AMDGPUISD::RCP, SL, MVT::f64, DivScale0);	SDValue Rcp = DAG.getNode(AMDGPUISD::RCP, SL, MVT::f64, DivScale0);

	SDValue Fma0 = DAG.getNode(ISD::FMA, SL, MVT::f64, NegDivScale0, Rcp, One);	SDValue Fma0 = DAG.getNode(ISD::FMA, SL, MVT::f64, NegDivScale0, Rcp, One);

Context not available.
		arsenmUnsubmitted Not Done Reply Inline Actions Braces arsenm: Braces

llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-fdiv.ll

	Show All 12 Lines
	store volatile float %fast.md.25ulp, float addrspace(1)* %out	store volatile float %fast.md.25ulp, float addrspace(1)* %out

	%arcp.md.25ulp = fdiv arcp float %a, %b, !fpmath !0	%arcp.md.25ulp = fdiv arcp float %a, %b, !fpmath !0
	store volatile float %arcp.md.25ulp, float addrspace(1)* %out	store volatile float %arcp.md.25ulp, float addrspace(1)* %out

	ret void	ret void
	}	}

		; CHECK-LABEL: @fdiv_rounded(
		; CHECK: %no.md = call float @llvm.amdgcn.fdiv.rounded(float %a, float %b)
		define amdgpu_kernel void @fdiv_rounded(float addrspace(1)* %out, float %a, float %b) #3 {
		%no.md = fdiv float %a, %b
		store volatile float %no.md, float addrspace(1)* %out

		ret void
		}

		; CHECK-LABEL: @fdiv_rounded_vector(
		; CHECK: %[[Anomd0:[0-9]+]] = extractelement <2 x float> %a, i64 0
		; CHECK: %[[Bnomd0:[0-9]+]] = extractelement <2 x float> %b, i64 0
		; CHECK: %[[FDIVnomd0:[0-9]+]] = call float @llvm.amdgcn.fdiv.rounded(float %[[Anomd0]], float %[[Bnomd0]])
		; CHECK: %[[INSnomd0:[0-9]+]] = insertelement <2 x float> undef, float %[[FDIVnomd0]], i64 0
		; CHECK: %[[Anomd1:[0-9]+]] = extractelement <2 x float> %a, i64 1
		; CHECK: %[[Bnomd1:[0-9]+]] = extractelement <2 x float> %b, i64 1
		; CHECK: %[[FDIVnomd1:[0-9]+]] = call float @llvm.amdgcn.fdiv.rounded(float %[[Anomd1]], float %[[Bnomd1]])
		; CHECK: %no.md = insertelement <2 x float> %[[INSnomd0]], float %[[FDIVnomd1]], i64 1
		define amdgpu_kernel void @fdiv_rounded_vector(<2 x float> addrspace(1)* %out, <2 x float> %a, <2 x float> %b) #3 {
		%no.md = fdiv <2 x float> %a, %b
		store volatile <2 x float> %no.md, <2 x float> addrspace(1)* %out

		ret void
		}


	attributes #0 = { nounwind optnone noinline }	attributes #0 = { nounwind optnone noinline }
	attributes #1 = { nounwind }	attributes #1 = { nounwind }
	attributes #2 = { nounwind "target-features"="+fp32-denormals" }	attributes #2 = { nounwind "target-features"="+fp32-denormals" }
		attributes #3 = { nounwind "correctly-rounded-divide-sqrt-fp-math"="true" }
		arsenmUnsubmitted Not Done Reply Inline Actions The attribute should be removed arsenm: The attribute should be removed

	; CHECK: !0 = !{float 2.500000e+00}	; CHECK: !0 = !{float 2.500000e+00}
	; CHECK: !1 = !{float 5.000000e-01}	; CHECK: !1 = !{float 5.000000e-01}
	; CHECK: !2 = !{float 1.000000e+00}	; CHECK: !2 = !{float 1.000000e+00}
	; CHECK: !3 = !{float 3.000000e+00}	; CHECK: !3 = !{float 3.000000e+00}

	!0 = !{float 2.500000e+00}	!0 = !{float 2.500000e+00}
	!1 = !{float 5.000000e-01}	!1 = !{float 5.000000e-01}
	!2 = !{float 1.000000e+00}	!2 = !{float 1.000000e+00}
	!3 = !{float 3.000000e+00}	!3 = !{float 3.000000e+00}
Context not available.

llvm/test/CodeGen/AMDGPU/fdiv.ll

	Show All 12 Lines
	; GCN: v_div_fixup_f32 v{{[0-9]+}}, [[FMAS]],	; GCN: v_div_fixup_f32 v{{[0-9]+}}, [[FMAS]],
	define amdgpu_kernel void @fdiv_f32_denormals(float addrspace(1)* %out, float %a, float %b) #2 {	define amdgpu_kernel void @fdiv_f32_denormals(float addrspace(1)* %out, float %a, float %b) #2 {
	entry:	entry:
	%fdiv = fdiv float %a, %b	%fdiv = fdiv float %a, %b
	store float %fdiv, float addrspace(1)* %out	store float %fdiv, float addrspace(1)* %out
	ret void	ret void
	}	}

		; FUNC-LABEL: {{^}}fdiv_f32_correctly_rounded_divide_sqrt:

		; GCN: v_div_scale_f32 [[NUM_SCALE:v[0-9]+]]
		; GCN-DAG: v_div_scale_f32 [[DEN_SCALE:v[0-9]+]]
		; GCN-DAG: v_rcp_f32_e32 [[NUM_RCP:v[0-9]+]], [[NUM_SCALE]]

		; PREGFX10: s_setreg_imm32_b32 hwreg(HW_REG_MODE, 4, 2), 3
		; GFX10: s_denorm_mode 15
		; GCN: v_fma_f32 [[A:v[0-9]+]], -[[NUM_SCALE]], [[NUM_RCP]], 1.0
		; GCN: v_fma_f32 [[B:v[0-9]+]], [[A]], [[NUM_RCP]], [[NUM_RCP]]
		; GCN: v_mul_f32_e32 [[C:v[0-9]+]], [[DEN_SCALE]], [[B]]
		; GCN: v_fma_f32 [[D:v[0-9]+]], -[[NUM_SCALE]], [[C]], [[DEN_SCALE]]
		; GCN: v_fma_f32 [[E:v[0-9]+]], [[D]], [[B]], [[C]]
		; GCN: v_fma_f32 [[F:v[0-9]+]], -[[NUM_SCALE]], [[E]], [[DEN_SCALE]]
		; PREGFX10: s_setreg_imm32_b32 hwreg(HW_REG_MODE, 4, 2), 0
		; GFX10: s_denorm_mode 12
		; GCN: v_div_fmas_f32 [[FMAS:v[0-9]+]], [[F]], [[B]], [[E]]
		; GCN: v_div_fixup_f32 v{{[0-9]+}}, [[FMAS]],

		define amdgpu_kernel void @fdiv_f32_correctly_rounded_divide_sqrt(float addrspace(1)* %out, float %a) #3 {
		entry:
		%fdiv = fdiv float 1.000000e+00, %a
		store float %fdiv, float addrspace(1)* %out
		ret void
		}


		; FUNC-LABEL: {{^}}fdiv_f32_denorms_correctly_rounded_divide_sqrt:

		; GCN: v_div_scale_f32 [[NUM_SCALE:v[0-9]+]]
		; GCN-DAG: v_rcp_f32_e32 [[NUM_RCP:v[0-9]+]], [[NUM_SCALE]]

		; PREGFX10-DAG: v_div_scale_f32 [[DEN_SCALE:v[0-9]+]]
		; PREGFX10-NOT: s_setreg
		; PREGFX10: v_fma_f32 [[A:v[0-9]+]], -[[NUM_SCALE]], [[NUM_RCP]], 1.0
		; PREGFX10: v_fma_f32 [[B:v[0-9]+]], [[A]], [[NUM_RCP]], [[NUM_RCP]]
		; PREGFX10: v_mul_f32_e32 [[C:v[0-9]+]], [[DEN_SCALE]], [[B]]
		; PREGFX10: v_fma_f32 [[D:v[0-9]+]], -[[NUM_SCALE]], [[C]], [[DEN_SCALE]]
		; PREGFX10: v_fma_f32 [[E:v[0-9]+]], [[D]], [[B]], [[C]]
		; PREGFX10: v_fma_f32 [[F:v[0-9]+]], -[[NUM_SCALE]], [[E]], [[DEN_SCALE]]
		; PREGFX10-NOT: s_setreg

		; GFX10-NOT: s_denorm_mode
		; GFX10: v_fma_f32 [[A:v[0-9]+]], -[[NUM_SCALE]], [[NUM_RCP]], 1.0
		; GFX10: v_fmac_f32_e32 [[B:v[0-9]+]], [[A]], [[NUM_RCP]]
		; GFX10: v_div_scale_f32 [[DEN_SCALE:v[0-9]+]]
		; GFX10: v_mul_f32_e32 [[C:v[0-9]+]], [[DEN_SCALE]], [[B]]
		; GFX10: v_fma_f32 [[D:v[0-9]+]], [[C]], -[[NUM_SCALE]], [[DEN_SCALE]]
		; GFX10: v_fmac_f32_e32 [[E:v[0-9]+]], [[D]], [[B]]
		; GFX10: v_fmac_f32_e64 [[F:v[0-9]+]], -[[NUM_SCALE]], [[E]]
		; GFX10-NOT: s_denorm_mode

		; GCN: v_div_fmas_f32 [[FMAS:v[0-9]+]], [[F]], [[B]], [[E]]
		; GCN: v_div_fixup_f32 v{{[0-9]+}}, [[FMAS]],
		define amdgpu_kernel void @fdiv_f32_denorms_correctly_rounded_divide_sqrt(float addrspace(1)* %out, float %a) #4 {
		entry:
		%fdiv = fdiv float 1.000000e+00, %a
		store float %fdiv, float addrspace(1)* %out
		ret void
		}


	; FUNC-LABEL: {{^}}fdiv_25ulp_f32:	; FUNC-LABEL: {{^}}fdiv_25ulp_f32:
	; GCN: v_cndmask_b32	; GCN: v_cndmask_b32
	; GCN: v_mul_f32	; GCN: v_mul_f32
	; GCN: v_rcp_f32	; GCN: v_rcp_f32
	; GCN: v_mul_f32	; GCN: v_mul_f32
	; GCN: v_mul_f32	; GCN: v_mul_f32
	define amdgpu_kernel void @fdiv_25ulp_f32(float addrspace(1)* %out, float %a, float %b) #0 {	define amdgpu_kernel void @fdiv_25ulp_f32(float addrspace(1)* %out, float %a, float %b) #0 {
	entry:	entry:
	Show All 24 Lines
	%result = fdiv arcp <4 x float> %a, %b	%result = fdiv arcp <4 x float> %a, %b
	store <4 x float> %result, <4 x float> addrspace(1)* %out	store <4 x float> %result, <4 x float> addrspace(1)* %out
	ret void	ret void
	}	}

	attributes #0 = { nounwind "enable-unsafe-fp-math"="false" "target-features"="-fp32-denormals,+fp64-fp16-denormals,-flat-for-global" }	attributes #0 = { nounwind "enable-unsafe-fp-math"="false" "target-features"="-fp32-denormals,+fp64-fp16-denormals,-flat-for-global" }
	attributes #1 = { nounwind "enable-unsafe-fp-math"="true" "target-features"="-fp32-denormals,-flat-for-global" }	attributes #1 = { nounwind "enable-unsafe-fp-math"="true" "target-features"="-fp32-denormals,-flat-for-global" }
	attributes #2 = { nounwind "enable-unsafe-fp-math"="false" "target-features"="+fp32-denormals,-flat-for-global" }	attributes #2 = { nounwind "enable-unsafe-fp-math"="false" "target-features"="+fp32-denormals,-flat-for-global" }
		attributes #3 = { nounwind "correctly-rounded-divide-sqrt-fp-math"="true" "target-features"="-fp32-denormals,-flat-for-global" }
		attributes #4 = { nounwind "correctly-rounded-divide-sqrt-fp-math"="true" "target-features"="+fp32-denormals,-flat-for-global" }


	!0 = !{float 2.500000e+00}	!0 = !{float 2.500000e+00}
Context not available.

This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Generate the correct sequence of code for FDIV32 when correctly-rounded-divide-sqrt is setClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 236698

llvm/include/llvm/IR/IntrinsicsAMDGPU.td

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

llvm/lib/Target/AMDGPU/SIISelLowering.h

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-fdiv.ll

llvm/test/CodeGen/AMDGPU/fdiv.ll

AMDGPU: Generate the correct sequence of code for FDIV32 when correctly-rounded-divide-sqrt is set
ClosedPublic