This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Generate the correct sequence of code for FDIV32 when correctly-rounded-divide-sqrt is set
ClosedPublic

Authored by cfang on Dec 10 2019, 11:21 AM.

Download Raw Diff

Details

Reviewers

b-sumner
arsenm
kerbowa

Summary

As the name suggests, correctly-rounded-divide-sqrt specifies the result of divion/sqrt to be rounded, and
thus we need to generate the correct sequence of code even when we flush the denormals.

Diff Detail

Event Timeline

cfang created this revision.Dec 10 2019, 11:21 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 10 2019, 11:21 AM

Herald added subscribers: hiraditya, t-tye, tpr and 6 others. · View Herald Transcript

This looks OK to me, although tuning on correctly rounded division any time denorms are enabled is not actually required by OpenCL.

The attribute should not de directly checked (we probably shouldn’t even be putting it on the function). The proper thing to check is the fpmath metadata on the individual instruction. This isn’t propagated into the DAG, so AMDGPUCodeGenPrepare inserts intrinsic calls which isn’t ideal

This revision now requires changes to proceed.Dec 10 2019, 8:51 PM

In D71293#1778867, @arsenm wrote:

The attribute should not de directly checked (we probably shouldn’t even be putting it on the function). The proper thing to check is the fpmath metadata on the individual instruction. This isn’t propagated into the DAG, so AMDGPUCodeGenPrepare inserts intrinsic calls which isn’t ideal

:
So what's your suggestion here? The current logic in AMDGPUCodeGenPrepare is to find cases that we can insert the intrinsic to generate "Faster 2.5 ULP division that does not support denormals."
Otherwise SIISelLowering will lower FDIV32 UnsafeMath and Demorm support.

Do you want to change this logic to insert new intrinsics to generate the expected sequence of code for fdiv32?

Introduce an intrinsic in AMDGPUCodeGenPrepare to generate correctly rounded fdiv32.

arsenm added inline comments.Jan 9 2020, 2:03 PM

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
568–569	The attribute should not be considered at all. Only the fpmath metadata matters. If -cl-fp32-correctly-rounded-divide-sqrt is specified, the regular fdiv instruction should behave correctly.
571–575	An intrinsic should only be introduced when the fdiv differs from the default FP environment. Here you are doing the opposite, and not even considering the denormal mode. You should be inhibiting the insertion of the fdiv.fast if denormals are enabled, not introducing a new intrinsic. You can also consider the afn fast flag and use that to ignore the denormal mode
628–631	There's no need to check the attribute
llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-fdiv.ll
245	The attribute should be removed

cfang marked 2 inline comments as done.Jan 10 2020, 9:52 AM

cfang added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
568–569	Do you mean that in AMDGPUCodeGenPrepare, we should check the fpmath metadata to keep regular fdiv (instead of an intrinsic) when -cl-fp32-correctly-rounded-divide-sqrt is specified? The issue is, when -cl-fp32-correctly-rounded-divide-sqrt is specified, a simple v_rcp is generated for a fdiv. Apparently the codegen produces the wrong sequence of code for a "regular" fdiv.

arsenm added inline comments.Jan 10 2020, 10:56 AM

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
568–569	Yes. We shouldn't even have an IR attribute for this flag. The interpretation of the flag is entirely represented in the use of the !fpmath metadata.

Implement rcp optimization for fdiv in AMGGPUCodegenPrepare to insert amdgcn_rcp intrinsic. For f32 type fdiv,
if fpmath metadata is unavailable, we could not do rcp optimization unless fast unsafe math is specified.

Herald added a subscriber: kerbowa. · View Herald TranscriptJan 20 2020, 4:21 PM

The GlobalISel path should also be fixed, but that can be a follow up patch

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
595	Needs comment explaining the interaction between !fpmath requirements and denormals. This could use a chart of different fast math options, FP math and denormal handling and the expected lowering
626–631	I don't think just allow reciprocal is sufficient without either checking FPMath or afn. I think this needs to be something more like UnsafeFP \|\| isFast \|\| (allowReciprocal && (denormal hasLowAccuracy \|\| approximateFunction))
637	Typo metadat
638	It would be clearer to invert this, instead of the logic below relying on the double negative
638–640	It would be clearer to do something like bool NeedHighAccuracy = !FPMath \|\| FPMath->getFPAccuracy() < 2.5
639	FPMath should be checked once, and in relation to it's value only. Checking for the lack of metadata here is imprecise
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7540	Based on the original problem, Flags.hasAllowReciprocal() isn't sufficient here. Without knowledge of !fpmath, this also needs approximate function
7542	Needs comment explaining why
8726	Braces

arsenm added a reviewer: kerbowa.Jan 20 2020, 7:49 PM

For GlobalISel, I'm not sure this should reproduce the same fix. We can more plausibly preserve the !fpmath in the gMIR and handle it the right way, instead of hacking around it in AMDGPUCodeGenPrepare. I think a few asserts and the verifier would need to be updated, but it should be possible to allow arbitrary MDNode operands on an instruction, similar to how implicit registers can be added. I think we should disallow implicit register operands on G_* instructions, and instead only allow implicit metadata arguments. The fdiv lowering can then do the right thing with the original !fpmath information

cfang marked 5 inline comments as done.Jan 21 2020, 8:33 AM

cfang added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
638–640	Is < 2.5 ulp the limiting factor that we can not do 1/x -> rcp(x) ?
639	Do you mean here we should check like this: (Ty->isFloatTy() && (HasFP32Denormals \|\| NeedHighAccuracy)); where NeedHighAccuracy is checked like a previous comment?

cfang marked 3 inline comments as done.Jan 21 2020, 9:33 AM

cfang added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
626–631	Can you explain what is exactly "denormal hasLowAccuracy" here?

Update based on feedback from the reviewer.

arsenm added inline comments.Jan 22 2020, 7:08 AM

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
87–89	This should be redundant with the below logic
96–98	This should be fully captured by the logic above

arsenm added inline comments.Jan 22 2020, 7:13 AM

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
600	UnsafeDiv is too imprecise here. This should explain in concrete terms why we need to insert the intrinsics and not just refer to the variable names. We need fdiv.fast when we only need 2.5 ULP and denormals are flushed
624	I think this should maybe be rephrased into RcpLegal and UseFDivFast

Update based on the comments.

Rewrite the comments of the function visitFDiv;
Rename a few variables.

arsenm added inline comments.Jan 22 2020, 12:58 PM

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
616	You can just initialize this below with the logical value instead of setting the value conditionally
630	I think this still isn't quite right. I think this should be (FMF.allowReciprocal() && ((!HasFP32Denormals && !NeedHighAccuracy) \|\| FMF.approxFunc())). As is, this will allow reciprocal when denormals are flushed, but the higher fdiv precision is required, which was the case you were trying to fix in the first place
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7543	This still needs the denormal and type checks
llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-fdiv.ll
88	This should not have produced rcp since denormals are enabled and it doesn't have afn.
91–92	The name says high accuracy, but 5 ulp is lower accuracy. This didn't form rcp, but I think for the wrong reason
94–98	These two I think are OK because of afn

cfang marked 3 inline comments as done.Jan 22 2020, 2:29 PM

cfang added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
616	Thanks, Will do like that.
630	How could we handle fp16 and fp64? I think HasFP32Denormals only matter for fp32. Also, the issue I am working on seems not related to FMF.allowReciprocal() at all unless arcp is default.

arsenm added inline comments.Jan 22 2020, 2:36 PM

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
630	Yes, this also needs to account for FP32denormals. RCP for f16 doesn't' care about the fp16 denormal mode

update based on feedback.

Using arcp && (( no denormals && fpmath>=2.5) || afn)
update arcp related LIT tests.

arsenm added inline comments.Jan 23 2020, 7:30 AM

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
122	Set UseFDivFast once based on the logical expression below and never mutate it. UseFDivFast should be const
llvm/test/CodeGen/AMDGPU/fdiv.f16.ll
253 ↗	(On Diff #239724)	I don't know what ulp the f16 rcp instruction provides. This test change looks incomplete if there isn't already a case without !fpmath

arsenm added inline comments.Jan 23 2020, 7:42 AM

llvm/test/CodeGen/AMDGPU/fdiv.f16.ll
253 ↗	(On Diff #239724)	I found a document stating this provides "~0.5ulp", so I guess check that value for f16?

cfang marked an inline comment as done.Jan 23 2020, 11:47 AM

cfang added inline comments.

llvm/test/CodeGen/AMDGPU/fdiv.f16.ll
253 ↗	(On Diff #239724)	Currently the logic in DAG lowering does "1/x -> rcp(x)" for fp16 without checking fpmath accuracy. Actually it always does "1/x -> rcp(x)" for fp16 because v_rcp_f16 supports denormals. We need to revisit that logic in DAG lowering. But I would rather to do that in a follow-up patch.

Update based on feedback:

const for UseFDivFast variable
Remove the added "!fpmath !0" for an arcp f16 test, because the current logic in DAG loweing generates the same code with/without !fpmath.

TODO (in an follow up patch maybe): Change the accuracy threshold and apply the threshold to all types. Also need to re-visit
the rcp logic in DAG Lowering as long as the work in AMDGPUCodegenPrepare is done.

LGTM

This revision is now accepted and ready to land.Jan 23 2020, 1:11 PM

commit 2531535984ad989ce88aeee23cb92a827da6686e
Author: Changpeng Fang <changpeng.fang@gmail.com>
Date: Thu Jan 23 16:57:43 2020 -0800

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUCodeGenPrepare.cpp

142 lines

SIISelLowering.cpp

78 lines

test/

CodeGen/

AMDGPU/

amdgpu-codegenprepare-fdiv.ll

198 lines

fdiv.ll

62 lines

fdiv32-to-rcp-folding.ll

64 lines

fneg-combines.ll

22 lines

known-never-snan.ll

24 lines

llvm.amdgcn.rcp.ll

9 lines

mul24-pass-ordering.ll

4 lines

rcp-pattern.ll

10 lines

rcp_iflag.ll

6 lines

rsq.ll

22 lines

Diff 239963

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

	Show All 12 Lines
	Value *NewVal = insertValues(Builder, Ty, ResultVals);			Value *NewVal = insertValues(Builder, Ty, ResultVals);
	NewVal->takeName(&I);			NewVal->takeName(&I);
	I.replaceAllUsesWith(NewVal);			I.replaceAllUsesWith(NewVal);
	I.eraseFromParent();			I.eraseFromParent();

	return true;			return true;
	}			}

	static bool shouldKeepFDivF32(Value *Num, bool UnsafeDiv, bool HasDenormals) {			// Perform RCP optimizations:
				//
				// 1/x -> rcp(x) when fast unsafe rcp is legal or fpmath >= 2.5ULP with
				// denormals flushed.
				//
				// a/b -> a*rcp(b) when fast unsafe rcp is legal.
				static Value performRCPOpt(Value Num, Value *Den, bool FastUnsafeRcpLegal,
				IRBuilder<> Builder, MDNode FPMath, Module Mod,
				bool HasDenormals, bool NeedHighAccuracy) {

				Type *Ty = Den->getType();
				if (!FastUnsafeRcpLegal && Ty->isFloatTy() &&
				(HasDenormals \|\| NeedHighAccuracy))
				return nullptr;

				Function *Decl = Intrinsic::getDeclaration(Mod, Intrinsic::amdgcn_rcp, Ty);
				if (const ConstantFP *CLHS = dyn_cast<ConstantFP>(Num)) {
				if (FastUnsafeRcpLegal \|\| Ty->isFloatTy() \|\| Ty->isHalfTy()) {
				if (CLHS->isExactlyValue(1.0)) {
				// v_rcp_f32 and v_rsq_f32 do not support denormals, and according to
				// the CI documentation has a worst case error of 1 ulp.
				// OpenCL requires <= 2.5 ulp for 1.0 / x, so it should always be OK to
				// use it as long as we aren't trying to use denormals.
				//
				// v_rcp_f16 and v_rsq_f16 DO support denormals.

				// NOTE: v_sqrt and v_rcp will be combined to v_rsq later. So we don't
				// insert rsq intrinsic here.

				// 1.0 / x -> rcp(x)
				return Builder.CreateCall(Decl, { Den });
				}

				// Same as for 1.0, but expand the sign out of the constant.
				if (CLHS->isExactlyValue(-1.0)) {
				// -1.0 / x -> rcp (fneg x)
				Value *FNeg = Builder.CreateFNeg(Den);
				return Builder.CreateCall(Decl, { FNeg });
				}
				}
				}

				arsenmUnsubmitted Done Reply Inline Actions The attribute should not be considered at all. Only the fpmath metadata matters. If -cl-fp32-correctly-rounded-divide-sqrt is specified, the regular fdiv instruction should behave correctly. arsenm: The attribute should not be considered at all. Only the fpmath metadata matters. If -cl-fp32…
				cfangAuthorUnsubmitted Done Reply Inline Actions Do you mean that in AMDGPUCodeGenPrepare, we should check the fpmath metadata to keep regular fdiv (instead of an intrinsic) when -cl-fp32-correctly-rounded-divide-sqrt is specified? The issue is, when -cl-fp32-correctly-rounded-divide-sqrt is specified, a simple v_rcp is generated for a fdiv. Apparently the codegen produces the wrong sequence of code for a "regular" fdiv. cfang: Do you mean that in AMDGPUCodeGenPrepare, we should check the fpmath metadata to keep regular…
				arsenmUnsubmitted Not Done Reply Inline Actions Yes. We shouldn't even have an IR attribute for this flag. The interpretation of the flag is entirely represented in the use of the !fpmath metadata. arsenm: Yes. We shouldn't even have an IR attribute for this flag. The interpretation of the flag is…
				if (FastUnsafeRcpLegal) {
				// Turn into multiply by the reciprocal.
				// x / y -> x * (1.0 / y)
				Value *Recip = Builder.CreateCall(Decl, { Den });
				return Builder.CreateFMul(Num, Recip, "", FPMath);
				}
				arsenmUnsubmitted Not Done Reply Inline Actions An intrinsic should only be introduced when the fdiv differs from the default FP environment. Here you are doing the opposite, and not even considering the denormal mode. You should be inhibiting the insertion of the fdiv.fast if denormals are enabled, not introducing a new intrinsic. You can also consider the afn fast flag and use that to ignore the denormal mode arsenm: An intrinsic should only be introduced when the fdiv differs from the default FP environment.
				return nullptr;
				}

				static bool shouldKeepFDivF32(Value *Num, bool FastUnsafeRcpLegal,
				bool HasDenormals) {
	const ConstantFP *CNum = dyn_cast<ConstantFP>(Num);			const ConstantFP *CNum = dyn_cast<ConstantFP>(Num);
	if (!CNum)			if (!CNum)
	return HasDenormals;			return HasDenormals;

	if (UnsafeDiv)			if (FastUnsafeRcpLegal)
	return true;			return true;

	bool IsOne = CNum->isExactlyValue(+1.0) \|\| CNum->isExactlyValue(-1.0);			bool IsOne = CNum->isExactlyValue(+1.0) \|\| CNum->isExactlyValue(-1.0);

	// Reciprocal f32 is handled separately without denormals.			// Reciprocal f32 is handled separately without denormals.
	return HasDenormals ^ IsOne;			return HasDenormals ^ IsOne;
	}			}

	// Insert an intrinsic for fast fdiv for safe math situations where we can
	// reduce precision. Leave fdiv for situations where the generic node is			// Optimizations is performed based on fpmath, fast math flags as wells as
				arsenmUnsubmitted Not Done Reply Inline Actions Needs comment explaining the interaction between !fpmath requirements and denormals. This could use a chart of different fast math options, FP math and denormal handling and the expected lowering arsenm: Needs comment explaining the interaction between !fpmath requirements and denormals. This could…
	// expected to be optimized.			// denormals to lower fdiv using either rcp or fdiv.fast.
				//
				// FastUnsafeRcpLegal: We determine whether it is legal to use rcp based on
				// unsafe-fp-math, fast math flags, denormals and fpmath
				// accuracy request.
				arsenmUnsubmitted Not Done Reply Inline Actions UnsafeDiv is too imprecise here. This should explain in concrete terms why we need to insert the intrinsics and not just refer to the variable names. We need fdiv.fast when we only need 2.5 ULP and denormals are flushed arsenm: UnsafeDiv is too imprecise here. This should explain in concrete terms why we need to insert…
				//
				// RCP Optimizations:
				// 1/x -> rcp(x) when fast unsafe rcp is legal or fpmath >= 2.5ULP with
				// denormals flushed.
				// a/b -> a*rcp(b) when fast unsafe rcp is legal.
				//
				// Use fdiv.fast:
				// a/b -> fdiv.fast(a, b) when RCP optimization is not performed and
				// fpmath >= 2.5ULP with denormals flushed.
				//
				// 1/x -> fdiv.fast(1,x) when RCP optimization is not performed and
				// fpmath >= 2.5ULP with denormals.
	bool AMDGPUCodeGenPrepare::visitFDiv(BinaryOperator &FDiv) {			bool AMDGPUCodeGenPrepare::visitFDiv(BinaryOperator &FDiv) {
	Type *Ty = FDiv.getType();

	if (!Ty->getScalarType()->isFloatTy())			Type *Ty = FDiv.getType()->getScalarType();
	return false;

				arsenmUnsubmitted Done Reply Inline Actions You can just initialize this below with the logical value instead of setting the value conditionally arsenm: You can just initialize this below with the logical value instead of setting the value…
				cfangAuthorUnsubmitted Done Reply Inline Actions Thanks, Will do like that. cfang: Thanks, Will do like that.
	MDNode *FPMath = FDiv.getMetadata(LLVMContext::MD_fpmath);			// No intrinsic for fdiv16 if target does not support f16.
	if (!FPMath)			if (Ty->isHalfTy() && !ST->has16BitInsts())
	return false;			return false;

	const FPMathOperator *FPOp = cast<const FPMathOperator>(&FDiv);			const FPMathOperator *FPOp = cast<const FPMathOperator>(&FDiv);
	float ULP = FPOp->getFPAccuracy();			MDNode *FPMath = FDiv.getMetadata(LLVMContext::MD_fpmath);
	if (ULP < 2.5f)			const bool NeedHighAccuracy = !FPMath \|\| FPOp->getFPAccuracy() < 2.5f;
	return false;

				arsenmUnsubmitted Not Done Reply Inline Actions I think this should maybe be rephrased into RcpLegal and UseFDivFast arsenm: I think this should maybe be rephrased into RcpLegal and UseFDivFast
	FastMathFlags FMF = FPOp->getFastMathFlags();			FastMathFlags FMF = FPOp->getFastMathFlags();
	bool UnsafeDiv = HasUnsafeFPMath \|\| FMF.isFast() \|\|			// Determine whether it is ok to use rcp based on unsafe-fp-math,
	FMF.allowReciprocal();			// fast math flags, denormals and accuracy request.
				const bool FastUnsafeRcpLegal = HasUnsafeFPMath \|\| FMF.isFast() \|\|
				(FMF.allowReciprocal() && ((!HasFP32Denormals && !NeedHighAccuracy)
				\|\| FMF.approxFunc()));
				arsenmUnsubmitted Not Done Reply Inline Actions I think this still isn't quite right. I think this should be (FMF.allowReciprocal() && ((!HasFP32Denormals && !NeedHighAccuracy) \|\| FMF.approxFunc())). As is, this will allow reciprocal when denormals are flushed, but the higher fdiv precision is required, which was the case you were trying to fix in the first place arsenm: I think this still isn't quite right. I think this should be (FMF.allowReciprocal() && ((!
				cfangAuthorUnsubmitted Done Reply Inline Actions How could we handle fp16 and fp64? I think HasFP32Denormals only matter for fp32. Also, the issue I am working on seems not related to FMF.allowReciprocal() at all unless arcp is default. cfang: How could we handle fp16 and fp64? I think HasFP32Denormals only matter for fp32. Also, the…
				arsenmUnsubmitted Not Done Reply Inline Actions Yes, this also needs to account for FP32denormals. RCP for f16 doesn't' care about the fp16 denormal mode arsenm: Yes, this also needs to account for FP32denormals. RCP for f16 doesn't' care about the fp16…

				arsenmUnsubmitted Not Done Reply Inline Actions There's no need to check the attribute arsenm: There's no need to check the attribute
				arsenmUnsubmitted Not Done Reply Inline Actions I don't think just allow reciprocal is sufficient without either checking FPMath or afn. I think this needs to be something more like UnsafeFP \|\| isFast \|\| (allowReciprocal && (denormal hasLowAccuracy \|\| approximateFunction)) arsenm: I don't think just allow reciprocal is sufficient without either checking FPMath or afn. I…
				cfangAuthorUnsubmitted Done Reply Inline Actions Can you explain what is exactly "denormal hasLowAccuracy" here? cfang: Can you explain what is exactly "denormal hasLowAccuracy" here?
	// With UnsafeDiv node will be optimized to just rcp and mul.			// Use fdiv.fast for only f32, fpmath >= 2.5ULP and rcp is not used.
	if (UnsafeDiv)			const bool UseFDivFast = Ty->isFloatTy() && !NeedHighAccuracy &&
	return false;			!FastUnsafeRcpLegal;

	IRBuilder<> Builder(FDiv.getParent(), std::next(FDiv.getIterator()), FPMath);			IRBuilder<> Builder(FDiv.getParent(), std::next(FDiv.getIterator()));
	Builder.setFastMathFlags(FMF);			Builder.setFastMathFlags(FMF);
				arsenmUnsubmitted Done Reply Inline Actions Typo metadat arsenm: Typo metadat
	Builder.SetCurrentDebugLocation(FDiv.getDebugLoc());			Builder.SetCurrentDebugLocation(FDiv.getDebugLoc());
				arsenmUnsubmitted Not Done Reply Inline Actions It would be clearer to invert this, instead of the logic below relying on the double negative arsenm: It would be clearer to invert this, instead of the logic below relying on the double negative

				arsenmUnsubmitted Done Reply Inline Actions FPMath should be checked once, and in relation to it's value only. Checking for the lack of metadata here is imprecise arsenm: FPMath should be checked once, and in relation to it's value only. Checking for the lack of…
				cfangAuthorUnsubmitted Done Reply Inline Actions Do you mean here we should check like this: (Ty->isFloatTy() && (HasFP32Denormals \|\| NeedHighAccuracy)); where NeedHighAccuracy is checked like a previous comment? cfang: Do you mean here we should check like this: (Ty->isFloatTy() && (HasFP32Denormals \|\|…
	Function *Decl = Intrinsic::getDeclaration(Mod, Intrinsic::amdgcn_fdiv_fast);

	Value *Num = FDiv.getOperand(0);			Value *Num = FDiv.getOperand(0);
				arsenmUnsubmitted Done Reply Inline Actions It would be clearer to do something like bool NeedHighAccuracy = !FPMath \|\| FPMath->getFPAccuracy() < 2.5 arsenm: It would be clearer to do something like bool NeedHighAccuracy = !FPMath \|\| FPMath…
				cfangAuthorUnsubmitted Done Reply Inline Actions Is < 2.5 ulp the limiting factor that we can not do 1/x -> rcp(x) ? cfang: Is < 2.5 ulp the limiting factor that we can not do 1/x -> rcp(x) ?
	Value *Den = FDiv.getOperand(1);			Value *Den = FDiv.getOperand(1);

	Value *NewFDiv = nullptr;			Value *NewFDiv = nullptr;
				if (VectorType *VT = dyn_cast<VectorType>(FDiv.getType())) {
	if (VectorType *VT = dyn_cast<VectorType>(Ty)) {
	NewFDiv = UndefValue::get(VT);			NewFDiv = UndefValue::get(VT);

	// FIXME: Doesn't do the right thing for cases where the vector is partially			// FIXME: Doesn't do the right thing for cases where the vector is partially
	// constant. This works when the scalarizer pass is run first.			// constant. This works when the scalarizer pass is run first.
	for (unsigned I = 0, E = VT->getNumElements(); I != E; ++I) {			for (unsigned I = 0, E = VT->getNumElements(); I != E; ++I) {
	Value *NumEltI = Builder.CreateExtractElement(Num, I);			Value *NumEltI = Builder.CreateExtractElement(Num, I);
	Value *DenEltI = Builder.CreateExtractElement(Den, I);			Value *DenEltI = Builder.CreateExtractElement(Den, I);
	Value *NewElt;			Value *NewElt = nullptr;
				if (UseFDivFast && !shouldKeepFDivF32(NumEltI, FastUnsafeRcpLegal,
	if (shouldKeepFDivF32(NumEltI, UnsafeDiv, HasFP32Denormals)) {			HasFP32Denormals)) {
	NewElt = Builder.CreateFDiv(NumEltI, DenEltI);			Function *Decl =
	} else {			Intrinsic::getDeclaration(Mod, Intrinsic::amdgcn_fdiv_fast);
	NewElt = Builder.CreateCall(Decl, { NumEltI, DenEltI });			NewElt = Builder.CreateCall(Decl, { NumEltI, DenEltI }, "", FPMath);
	}			}
				if (!NewElt) // Try rcp.
				NewElt = performRCPOpt(NumEltI, DenEltI, FastUnsafeRcpLegal, Builder,
				FPMath, Mod, HasFP32Denormals, NeedHighAccuracy);
				if (!NewElt)
				NewElt = Builder.CreateFDiv(NumEltI, DenEltI, "", FPMath);

	NewFDiv = Builder.CreateInsertElement(NewFDiv, NewElt, I);			NewFDiv = Builder.CreateInsertElement(NewFDiv, NewElt, I);
	}			}
	} else {			} else { // Scalar.
	if (!shouldKeepFDivF32(Num, UnsafeDiv, HasFP32Denormals))			if (UseFDivFast && !shouldKeepFDivF32(Num, FastUnsafeRcpLegal,
	NewFDiv = Builder.CreateCall(Decl, { Num, Den });			HasFP32Denormals)) {
				Function *Decl =
				Intrinsic::getDeclaration(Mod, Intrinsic::amdgcn_fdiv_fast);
				NewFDiv = Builder.CreateCall(Decl, { Num, Den }, "", FPMath);
				}
				if (!NewFDiv) { // Try rcp.
				NewFDiv = performRCPOpt(Num, Den, FastUnsafeRcpLegal, Builder, FPMath,
				Mod, HasFP32Denormals, NeedHighAccuracy);
				}
	}			}

	if (NewFDiv) {			if (NewFDiv) {
	FDiv.replaceAllUsesWith(NewFDiv);			FDiv.replaceAllUsesWith(NewFDiv);
	NewFDiv->takeName(&FDiv);			NewFDiv->takeName(&FDiv);
	FDiv.eraseFromParent();			FDiv.eraseFromParent();
	}			}

	Show All 12 Lines

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

	Show All 12 Lines
	// instructions.			// instructions.
	SDValue SITargetLowering::lowerFastUnsafeFDIV(SDValue Op,			SDValue SITargetLowering::lowerFastUnsafeFDIV(SDValue Op,
	SelectionDAG &DAG) const {			SelectionDAG &DAG) const {
	SDLoc SL(Op);			SDLoc SL(Op);
	SDValue LHS = Op.getOperand(0);			SDValue LHS = Op.getOperand(0);
	SDValue RHS = Op.getOperand(1);			SDValue RHS = Op.getOperand(1);
	EVT VT = Op.getValueType();			EVT VT = Op.getValueType();
	const SDNodeFlags Flags = Op->getFlags();			const SDNodeFlags Flags = Op->getFlags();
	bool Unsafe = DAG.getTarget().Options.UnsafeFPMath \|\| Flags.hasAllowReciprocal();

				arsenmUnsubmitted Not Done Reply Inline Actions Based on the original problem, Flags.hasAllowReciprocal() isn't sufficient here. Without knowledge of !fpmath, this also needs approximate function arsenm: Based on the original problem, Flags.hasAllowReciprocal() isn't sufficient here. Without…
	if (!Unsafe && VT == MVT::f32 && hasFP32Denormals(DAG.getMachineFunction()))			bool FastUnsafeRcpLegal = DAG.getTarget().Options.UnsafeFPMath \|\|
				(Flags.hasAllowReciprocal() &&
				arsenmUnsubmitted Not Done Reply Inline Actions Needs comment explaining why arsenm: Needs comment explaining why
				((VT == MVT::f32 && hasFP32Denormals(DAG.getMachineFunction())) \|\|
				arsenmUnsubmitted Not Done Reply Inline Actions This still needs the denormal and type checks arsenm: This still needs the denormal and type checks
				VT == MVT::f16 \|\|
				Flags.hasApproximateFuncs()));

				// Do rcp optimization only when fast unsafe rcp is legal here.
				// NOTE: We already performed RCP optimization to insert intrinsics in
				// AMDGPUCodeGenPrepare. Ideally there should have no opportunity here to
				// rcp optimization.
				// However, there are cases like FREM, which is expended into a sequence
				// of instructions including FDIV, which may expose new opportunities.
				if (!FastUnsafeRcpLegal)
	return SDValue();			return SDValue();

	if (const ConstantFPSDNode *CLHS = dyn_cast<ConstantFPSDNode>(LHS)) {			if (const ConstantFPSDNode *CLHS = dyn_cast<ConstantFPSDNode>(LHS)) {
	if (Unsafe \|\| VT == MVT::f32 \|\| VT == MVT::f16) {			if (CLHS->isExactlyValue(1.0)) {
	if (CLHS->isExactlyValue(1.0)) {			// v_rcp_f32 and v_rsq_f32 do not support denormals, and according to
	// v_rcp_f32 and v_rsq_f32 do not support denormals, and according to			// the CI documentation has a worst case error of 1 ulp.
	// the CI documentation has a worst case error of 1 ulp.			// OpenCL requires <= 2.5 ulp for 1.0 / x, so it should always be OK to
	// OpenCL requires <= 2.5 ulp for 1.0 / x, so it should always be OK to			// use it as long as we aren't trying to use denormals.
	// use it as long as we aren't trying to use denormals.			//
	//			// v_rcp_f16 and v_rsq_f16 DO support denormals.
	// v_rcp_f16 and v_rsq_f16 DO support denormals.

	// 1.0 / sqrt(x) -> rsq(x)

	// XXX - Is UnsafeFPMath sufficient to do this for f64? The maximum ULP
	// error seems really high at 2^29 ULP.
	if (RHS.getOpcode() == ISD::FSQRT)
	return DAG.getNode(AMDGPUISD::RSQ, SL, VT, RHS.getOperand(0));

	// 1.0 / x -> rcp(x)
	return DAG.getNode(AMDGPUISD::RCP, SL, VT, RHS);
	}

	// Same as for 1.0, but expand the sign out of the constant.			// 1.0 / sqrt(x) -> rsq(x)
	if (CLHS->isExactlyValue(-1.0)) {
	// -1.0 / x -> rcp (fneg x)			// XXX - Is UnsafeFPMath sufficient to do this for f64? The maximum ULP
	SDValue FNegRHS = DAG.getNode(ISD::FNEG, SL, VT, RHS);			// error seems really high at 2^29 ULP.
	return DAG.getNode(AMDGPUISD::RCP, SL, VT, FNegRHS);			if (RHS.getOpcode() == ISD::FSQRT)
	}			return DAG.getNode(AMDGPUISD::RSQ, SL, VT, RHS.getOperand(0));

				// 1.0 / x -> rcp(x)
				return DAG.getNode(AMDGPUISD::RCP, SL, VT, RHS);
	}			}
	}

	if (Unsafe) {			// Same as for 1.0, but expand the sign out of the constant.
	// Turn into multiply by the reciprocal.			if (CLHS->isExactlyValue(-1.0)) {
	// x / y -> x * (1.0 / y)			// -1.0 / x -> rcp (fneg x)
	SDValue Recip = DAG.getNode(AMDGPUISD::RCP, SL, VT, RHS);			SDValue FNegRHS = DAG.getNode(ISD::FNEG, SL, VT, RHS);
	return DAG.getNode(ISD::FMUL, SL, VT, LHS, Recip, Flags);			return DAG.getNode(AMDGPUISD::RCP, SL, VT, FNegRHS);
				}
	}			}

	return SDValue();			// Turn into multiply by the reciprocal.
				// x / y -> x * (1.0 / y)
				SDValue Recip = DAG.getNode(AMDGPUISD::RCP, SL, VT, RHS);
				return DAG.getNode(ISD::FMUL, SL, VT, LHS, Recip, Flags);
	}			}

	static SDValue getFPBinOp(SelectionDAG &DAG, unsigned Opcode, const SDLoc &SL,			static SDValue getFPBinOp(SelectionDAG &DAG, unsigned Opcode, const SDLoc &SL,
	EVT VT, SDValue A, SDValue B, SDValue GlueChain) {			EVT VT, SDValue A, SDValue B, SDValue GlueChain) {
	if (GlueChain->getNumValues() <= 1) {			if (GlueChain->getNumValues() <= 1) {
	return DAG.getNode(Opcode, SL, VT, A, B);			return DAG.getNode(Opcode, SL, VT, A, B);
	}			}

	Show All 17 Lines

	SDValue SITargetLowering::performRcpCombine(SDNode *N,			SDValue SITargetLowering::performRcpCombine(SDNode *N,
	DAGCombinerInfo &DCI) const {			DAGCombinerInfo &DCI) const {
	EVT VT = N->getValueType(0);			EVT VT = N->getValueType(0);
	SDValue N0 = N->getOperand(0);			SDValue N0 = N->getOperand(0);

	if (N0.isUndef())			if (N0.isUndef())
	return N0;			return N0;

				arsenmUnsubmitted Not Done Reply Inline Actions Braces arsenm: Braces
	if (VT == MVT::f32 && (N0.getOpcode() == ISD::UINT_TO_FP \|\|			if (VT == MVT::f32 && (N0.getOpcode() == ISD::UINT_TO_FP \|\|
	N0.getOpcode() == ISD::SINT_TO_FP)) {			N0.getOpcode() == ISD::SINT_TO_FP)) {
	return DCI.DAG.getNode(AMDGPUISD::RCP_IFLAG, SDLoc(N), VT, N0,			return DCI.DAG.getNode(AMDGPUISD::RCP_IFLAG, SDLoc(N), VT, N0,
	N->getFlags());			N->getFlags());
	}			}

				if ((VT == MVT::f32 \|\| VT == MVT::f16) && N0.getOpcode() == ISD::FSQRT) {
				return DCI.DAG.getNode(AMDGPUISD::RSQ, SDLoc(N), VT,
				N0.getOperand(0), N->getFlags());
				}

	return AMDGPUTargetLowering::performRcpCombine(N, DCI);			return AMDGPUTargetLowering::performRcpCombine(N, DCI);
	}			}

	bool SITargetLowering::isCanonicalized(SelectionDAG &DAG, SDValue Op,			bool SITargetLowering::isCanonicalized(SelectionDAG &DAG, SDValue Op,
	unsigned MaxDepth) const {			unsigned MaxDepth) const {
	unsigned Opcode = Op.getOpcode();			unsigned Opcode = Op.getOpcode();
	if (Opcode == ISD::FCANONICALIZE)			if (Opcode == ISD::FCANONICALIZE)
	return true;			return true;
	Show All 12 Lines

llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-fdiv.ll

Show All 10 Lines
}		}

; CHECK-LABEL: @fdiv_fpmath(		; CHECK-LABEL: @fdiv_fpmath(
; CHECK: %no.md = fdiv float %a, %b{{$}}		; CHECK: %no.md = fdiv float %a, %b{{$}}
; CHECK: %md.half.ulp = fdiv float %a, %b, !fpmath !1		; CHECK: %md.half.ulp = fdiv float %a, %b, !fpmath !1
; CHECK: %md.1ulp = fdiv float %a, %b, !fpmath !2		; CHECK: %md.1ulp = fdiv float %a, %b, !fpmath !2
; CHECK: %md.25ulp = call float @llvm.amdgcn.fdiv.fast(float %a, float %b), !fpmath !0		; CHECK: %md.25ulp = call float @llvm.amdgcn.fdiv.fast(float %a, float %b), !fpmath !0
; CHECK: %md.3ulp = call float @llvm.amdgcn.fdiv.fast(float %a, float %b), !fpmath !3		; CHECK: %md.3ulp = call float @llvm.amdgcn.fdiv.fast(float %a, float %b), !fpmath !3
; CHECK: %fast.md.25ulp = fdiv fast float %a, %b, !fpmath !0		; CHECK: %[[FAST_RCP:[0-9]+]] = call fast float @llvm.amdgcn.rcp.f32(float %b)
; CHECK: arcp.md.25ulp = fdiv arcp float %a, %b, !fpmath !0		; CHECK: %fast.md.25ulp = fmul fast float %a, %[[FAST_RCP]], !fpmath !0
		; CHECK: %[[ARCP_RCP:[0-9]+]] = call arcp float @llvm.amdgcn.rcp.f32(float %b)
		; CHECK: arcp.md.25ulp = fmul arcp float %a, %[[ARCP_RCP]], !fpmath !0
define amdgpu_kernel void @fdiv_fpmath(float addrspace(1)* %out, float %a, float %b) #1 {		define amdgpu_kernel void @fdiv_fpmath(float addrspace(1)* %out, float %a, float %b) #1 {
%no.md = fdiv float %a, %b		%no.md = fdiv float %a, %b
store volatile float %no.md, float addrspace(1)* %out		store volatile float %no.md, float addrspace(1)* %out

%md.half.ulp = fdiv float %a, %b, !fpmath !1		%md.half.ulp = fdiv float %a, %b, !fpmath !1
store volatile float %md.half.ulp, float addrspace(1)* %out		store volatile float %md.half.ulp, float addrspace(1)* %out

%md.1ulp = fdiv float %a, %b, !fpmath !2		%md.1ulp = fdiv float %a, %b, !fpmath !2
Show All 11 Lines	define amdgpu_kernel void @fdiv_fpmath(float addrspace(1)* %out, float %a, float %b) #1 {
%arcp.md.25ulp = fdiv arcp float %a, %b, !fpmath !0		%arcp.md.25ulp = fdiv arcp float %a, %b, !fpmath !0
store volatile float %arcp.md.25ulp, float addrspace(1)* %out		store volatile float %arcp.md.25ulp, float addrspace(1)* %out

ret void		ret void
}		}

; CHECK-LABEL: @rcp_fdiv_fpmath(		; CHECK-LABEL: @rcp_fdiv_fpmath(
; CHECK: %no.md = fdiv float 1.000000e+00, %x{{$}}		; CHECK: %no.md = fdiv float 1.000000e+00, %x{{$}}
; CHECK: %md.25ulp = fdiv float 1.000000e+00, %x, !fpmath !0		; CHECK: %md.25ulp = call float @llvm.amdgcn.rcp.f32(float %x)
; CHECK: %md.half.ulp = fdiv float 1.000000e+00, %x, !fpmath !1		; CHECK: %md.half.ulp = fdiv float 1.000000e+00, %x, !fpmath !1
; CHECK: %arcp.no.md = fdiv arcp float 1.000000e+00, %x{{$}}		; CHECK: %arcp.no.md = fdiv arcp float 1.000000e+00, %x
; CHECK: %arcp.25ulp = fdiv arcp float 1.000000e+00, %x, !fpmath !0		; CHECK: %arcp.25ulp = call arcp float @llvm.amdgcn.rcp.f32(float %x)
; CHECK: %fast.no.md = fdiv fast float 1.000000e+00, %x{{$}}		; CHECK: %fast.no.md = call fast float @llvm.amdgcn.rcp.f32(float %x)
; CHECK: %fast.25ulp = fdiv fast float 1.000000e+00, %x, !fpmath !0		; CHECK: %fast.25ulp = call fast float @llvm.amdgcn.rcp.f32(float %x)
define amdgpu_kernel void @rcp_fdiv_fpmath(float addrspace(1)* %out, float %x) #1 {		define amdgpu_kernel void @rcp_fdiv_fpmath(float addrspace(1)* %out, float %x) #1 {
%no.md = fdiv float 1.0, %x		%no.md = fdiv float 1.0, %x
store volatile float %no.md, float addrspace(1)* %out		store volatile float %no.md, float addrspace(1)* %out

%md.25ulp = fdiv float 1.0, %x, !fpmath !0		%md.25ulp = fdiv float 1.0, %x, !fpmath !0
store volatile float %md.25ulp, float addrspace(1)* %out		store volatile float %md.25ulp, float addrspace(1)* %out

%md.half.ulp = fdiv float 1.0, %x, !fpmath !1		%md.half.ulp = fdiv float 1.0, %x, !fpmath !1
Show All 9 Lines	define amdgpu_kernel void @rcp_fdiv_fpmath(float addrspace(1)* %out, float %x) #1 {
store volatile float %fast.no.md, float addrspace(1)* %out		store volatile float %fast.no.md, float addrspace(1)* %out

%fast.25ulp = fdiv fast float 1.0, %x, !fpmath !0		%fast.25ulp = fdiv fast float 1.0, %x, !fpmath !0
store volatile float %fast.25ulp, float addrspace(1)* %out		store volatile float %fast.25ulp, float addrspace(1)* %out

ret void		ret void
}		}

		; CHECK-LABEL: @rcp_fdiv_arcp_denormal(
		; CHECK: %arcp.low.accuracy = call arcp float @llvm.amdgcn.fdiv.fast(float 1.000000e+00, float %x), !fpmath !0
		; CHECK: %arcp.high.accuracy = fdiv arcp float 1.000000e+00, %x, !fpmath !2
		; CHECK: %arcp.low.afn = call arcp afn float @llvm.amdgcn.rcp.f32(float %x)
		; CHECK: %arcp.high.afn = call arcp afn float @llvm.amdgcn.rcp.f32(float %x)
		define amdgpu_kernel void @rcp_fdiv_arcp_denormal(float addrspace(1)* %out, float %x) #2 {

		%arcp.low.accuracy = fdiv arcp float 1.0, %x, !fpmath !0
		arsenmUnsubmitted Not Done Reply Inline Actions This should not have produced rcp since denormals are enabled and it doesn't have afn. arsenm: This should not have produced rcp since denormals are enabled and it doesn't have afn.
		store volatile float %arcp.low.accuracy, float addrspace(1)* %out

		%arcp.high.accuracy = fdiv arcp float 1.0, %x, !fpmath !2
		store volatile float %arcp.high.accuracy, float addrspace(1)* %out
		arsenmUnsubmitted Not Done Reply Inline Actions The name says high accuracy, but 5 ulp is lower accuracy. This didn't form rcp, but I think for the wrong reason arsenm: The name says high accuracy, but 5 ulp is lower accuracy. This didn't form rcp, but I think for…

		%arcp.low.afn = fdiv arcp afn float 1.0, %x, !fpmath !0
		store volatile float %arcp.low.afn, float addrspace(1)* %out

		%arcp.high.afn = fdiv arcp afn float 1.0, %x, !fpmath !2
		store volatile float %arcp.high.afn, float addrspace(1)* %out
		arsenmUnsubmitted Not Done Reply Inline Actions These two I think are OK because of afn arsenm: These two I think are OK because of afn

		ret void
		}

; CHECK-LABEL: @fdiv_fpmath_vector(		; CHECK-LABEL: @fdiv_fpmath_vector(
; CHECK: %no.md = fdiv <2 x float> %a, %b{{$}}		; CHECK: %[[NO_A0:[0-9]+]] = extractelement <2 x float> %a, i64 0
; CHECK: %md.half.ulp = fdiv <2 x float> %a, %b, !fpmath !1		; CHECK: %[[NO_B0:[0-9]+]] = extractelement <2 x float> %b, i64 0
; CHECK: %md.1ulp = fdiv <2 x float> %a, %b, !fpmath !2		; CHECK: %[[NO_FDIV0:[0-9]+]] = fdiv float %[[NO_A0]], %[[NO_B0]]
		; CHECK: %[[NO_INS0:[0-9]+]] = insertelement <2 x float> undef, float %[[NO_FDIV0]], i64 0
		; CHECK: %[[NO_A1:[0-9]+]] = extractelement <2 x float> %a, i64 1
		; CHECK: %[[NO_B1:[0-9]+]] = extractelement <2 x float> %b, i64 1
		; CHECK: %[[NO_FDIV1:[0-9]+]] = fdiv float %[[NO_A1]], %[[NO_B1]]
		; CHECK: %no.md = insertelement <2 x float> %[[NO_INS0]], float %[[NO_FDIV1]], i64 1
		; CHECK: store volatile <2 x float> %no.md, <2 x float> addrspace(1)* %out

		; CHECK: %[[HALF_A0:[0-9]+]] = extractelement <2 x float> %a, i64 0
		; CHECK: %[[HALF_B0:[0-9]+]] = extractelement <2 x float> %b, i64 0
		; CHECK: %[[HALF_FDIV0:[0-9]+]] = fdiv float %[[HALF_A0]], %[[HALF_B0]], !fpmath !1
		; CHECK: %[[HALF_INS0:[0-9]+]] = insertelement <2 x float> undef, float %[[HALF_FDIV0]], i64 0
		; CHECK: %[[HALF_A1:[0-9]+]] = extractelement <2 x float> %a, i64 1
		; CHECK: %[[HALF_B1:[0-9]+]] = extractelement <2 x float> %b, i64 1
		; CHECK: %[[HALF_FDIV1:[0-9]+]] = fdiv float %[[HALF_A1]], %[[HALF_B1]], !fpmath !1
		; CHECK: %md.half.ulp = insertelement <2 x float> %[[HALF_INS0]], float %[[HALF_FDIV1]], i64 1
		; CHECK: store volatile <2 x float> %md.half.ulp, <2 x float> addrspace(1)* %out

		; CHECK: %[[ONE_A0:[0-9]+]] = extractelement <2 x float> %a, i64 0
		; CHECK: %[[ONE_B0:[0-9]+]] = extractelement <2 x float> %b, i64 0
		; CHECK: %[[ONE_FDIV0:[0-9]+]] = fdiv float %[[ONE_A0]], %[[ONE_B0]], !fpmath !2
		; CHECK: %[[ONE_INS0:[0-9]+]] = insertelement <2 x float> undef, float %[[ONE_FDIV0]], i64 0
		; CHECK: %[[ONE_A1:[0-9]+]] = extractelement <2 x float> %a, i64 1
		; CHECK: %[[ONE_B1:[0-9]+]] = extractelement <2 x float> %b, i64 1
		; CHECK: %[[ONE_FDIV1:[0-9]+]] = fdiv float %[[ONE_A1]], %[[ONE_B1]], !fpmath !2
		; CHECK: %md.1ulp = insertelement <2 x float> %[[ONE_INS0]], float %[[ONE_FDIV1]], i64 1
		; CHECK: store volatile <2 x float> %md.1ulp, <2 x float> addrspace(1)* %out

; CHECK: %[[A0:[0-9]+]] = extractelement <2 x float> %a, i64 0		; CHECK: %[[A0:[0-9]+]] = extractelement <2 x float> %a, i64 0
; CHECK: %[[B0:[0-9]+]] = extractelement <2 x float> %b, i64 0		; CHECK: %[[B0:[0-9]+]] = extractelement <2 x float> %b, i64 0
; CHECK: %[[FDIV0:[0-9]+]] = call float @llvm.amdgcn.fdiv.fast(float %[[A0]], float %[[B0]]), !fpmath !0		; CHECK: %[[FDIV0:[0-9]+]] = call float @llvm.amdgcn.fdiv.fast(float %[[A0]], float %[[B0]]), !fpmath !0
; CHECK: %[[INS0:[0-9]+]] = insertelement <2 x float> undef, float %[[FDIV0]], i64 0		; CHECK: %[[INS0:[0-9]+]] = insertelement <2 x float> undef, float %[[FDIV0]], i64 0
; CHECK: %[[A1:[0-9]+]] = extractelement <2 x float> %a, i64 1		; CHECK: %[[A1:[0-9]+]] = extractelement <2 x float> %a, i64 1
; CHECK: %[[B1:[0-9]+]] = extractelement <2 x float> %b, i64 1		; CHECK: %[[B1:[0-9]+]] = extractelement <2 x float> %b, i64 1
; CHECK: %[[FDIV1:[0-9]+]] = call float @llvm.amdgcn.fdiv.fast(float %[[A1]], float %[[B1]]), !fpmath !0		; CHECK: %[[FDIV1:[0-9]+]] = call float @llvm.amdgcn.fdiv.fast(float %[[A1]], float %[[B1]]), !fpmath !0
Show All 10 Lines	define amdgpu_kernel void @fdiv_fpmath_vector(<2 x float> addrspace(1)* %out, <2 x float> %a, <2 x float> %b) #1 {

%md.25ulp = fdiv <2 x float> %a, %b, !fpmath !0		%md.25ulp = fdiv <2 x float> %a, %b, !fpmath !0
store volatile <2 x float> %md.25ulp, <2 x float> addrspace(1)* %out		store volatile <2 x float> %md.25ulp, <2 x float> addrspace(1)* %out

ret void		ret void
}		}

; CHECK-LABEL: @rcp_fdiv_fpmath_vector(		; CHECK-LABEL: @rcp_fdiv_fpmath_vector(
; CHECK: %no.md = fdiv <2 x float> <float 1.000000e+00, float 1.000000e+00>, %x{{$}}		; CHECK: %[[NO0:[0-9]+]] = extractelement <2 x float> %x, i64 0
; CHECK: %md.half.ulp = fdiv <2 x float> <float 1.000000e+00, float 1.000000e+00>, %x, !fpmath !1		; CHECK: %[[NO_FDIV0:[0-9]+]] = fdiv float 1.000000e+00, %[[NO0]]
; CHECK: %arcp.no.md = fdiv arcp <2 x float> <float 1.000000e+00, float 1.000000e+00>, %x{{$}}		; CHECK: %[[NO_INS0:[0-9]+]] = insertelement <2 x float> undef, float %[[NO_FDIV0]], i64 0
; CHECK: %fast.no.md = fdiv fast <2 x float> <float 1.000000e+00, float 1.000000e+00>, %x{{$}}		; CHECK: %[[NO1:[0-9]+]] = extractelement <2 x float> %x, i64 1
; CHECK: %arcp.25ulp = fdiv arcp <2 x float> <float 1.000000e+00, float 1.000000e+00>, %x, !fpmath !0		; CHECK: %[[NO_FDIV1:[0-9]+]] = fdiv float 1.000000e+00, %[[NO1]]
; CHECK: %fast.25ulp = fdiv fast <2 x float> <float 1.000000e+00, float 1.000000e+00>, %x, !fpmath !0		; CHECK: %no.md = insertelement <2 x float> %[[NO_INS0]], float %[[NO_FDIV1]], i64 1
		; CHECK: store volatile <2 x float> %no.md, <2 x float> addrspace(1)* %out

		; CHECK: %[[HALF0:[0-9]+]] = extractelement <2 x float> %x, i64 0
		; CHECK: %[[HALF_FDIV0:[0-9]+]] = fdiv float 1.000000e+00, %[[HALF0]], !fpmath !1
		; CHECK: %[[HALF_INS0:[0-9]+]] = insertelement <2 x float> undef, float %[[HALF_FDIV0]], i64 0
		; CHECK: %[[HALF1:[0-9]+]] = extractelement <2 x float> %x, i64 1
		; CHECK: %[[HALF_FDIV1:[0-9]+]] = fdiv float 1.000000e+00, %[[HALF1]], !fpmath !1
		; CHECK: %md.half.ulp = insertelement <2 x float> %[[HALF_INS0]], float %[[HALF_FDIV1]], i64 1
		; CHECK: store volatile <2 x float> %md.half.ulp, <2 x float> addrspace(1)* %out

		; CHECK: %[[ARCP_NO0:[0-9]+]] = extractelement <2 x float> %x, i64 0
		; CHECK: %[[ARCP_NO_FDIV0:[0-9]+]] = fdiv arcp float 1.000000e+00, %[[ARCP_NO0]]
		; CHECK: %[[ARCP_NO_INS0:[0-9]+]] = insertelement <2 x float> undef, float %[[ARCP_NO_FDIV0]], i64 0
		; CHECK: %[[ARCP_NO1:[0-9]+]] = extractelement <2 x float> %x, i64 1
		; CHECK: %[[ARCP_NO_FDIV1:[0-9]+]] = fdiv arcp float 1.000000e+00, %[[ARCP_NO1]]
		; CHECK: %arcp.no.md = insertelement <2 x float> %[[ARCP_NO_INS0]], float %[[ARCP_NO_FDIV1]], i64 1
		; CHECK: store volatile <2 x float> %arcp.no.md, <2 x float> addrspace(1)* %out

		; CHECK: %[[FAST_NO0:[0-9]+]] = extractelement <2 x float> %x, i64 0
		; CHECK: %[[FAST_NO_RCP0:[0-9]+]] = call fast float @llvm.amdgcn.rcp.f32(float %[[FAST_NO0]])
		; CHECK: %[[FAST_NO_INS0:[0-9]+]] = insertelement <2 x float> undef, float %[[FAST_NO_RCP0]], i64 0
		; CHECK: %[[FAST_NO1:[0-9]+]] = extractelement <2 x float> %x, i64 1
		; CHECK: %[[FAST_NO_RCP1:[0-9]+]] = call fast float @llvm.amdgcn.rcp.f32(float %[[FAST_NO1]])
		; CHECK: %fast.no.md = insertelement <2 x float> %[[FAST_NO_INS0]], float %[[FAST_NO_RCP1]], i64 1
		; CHECK: store volatile <2 x float> %fast.no.md, <2 x float> addrspace(1)* %out

		; CHECK: %[[ARCP_250:[0-9]+]] = extractelement <2 x float> %x, i64 0
		; CHECK: %[[ARCP_25_RCP0:[0-9]+]] = call arcp float @llvm.amdgcn.rcp.f32(float %[[ARCP_250]])
		; CHECK: %[[ARCP_25_INS0:[0-9]+]] = insertelement <2 x float> undef, float %[[ARCP_25_RCP0]], i64 0
		; CHECK: %[[ARCP_251:[0-9]+]] = extractelement <2 x float> %x, i64 1
		; CHECK: %[[ARCP_25_RCP1:[0-9]+]] = call arcp float @llvm.amdgcn.rcp.f32(float %[[ARCP_251]])
		; CHECK: %arcp.25ulp = insertelement <2 x float> %[[ARCP_25_INS0]], float %[[ARCP_25_RCP1]], i64 1
		; CHECK: store volatile <2 x float> %arcp.25ulp, <2 x float> addrspace(1)* %out

		; CHECK: %[[FAST_250:[0-9]+]] = extractelement <2 x float> %x, i64 0
		; CHECK: %[[FAST_25_RCP0:[0-9]+]] = call fast float @llvm.amdgcn.rcp.f32(float %[[FAST_250]])
		; CHECK: %[[FAST_25_INS0:[0-9]+]] = insertelement <2 x float> undef, float %[[FAST_25_RCP0]], i64 0
		; CHECK: %[[FAST_251:[0-9]+]] = extractelement <2 x float> %x, i64 1
		; CHECK: %[[FAST_25_RCP1:[0-9]+]] = call fast float @llvm.amdgcn.rcp.f32(float %[[FAST_251]])
		; CHECK: %fast.25ulp = insertelement <2 x float> %[[FAST_25_INS0]], float %[[FAST_25_RCP1]], i64 1
; CHECK: store volatile <2 x float> %fast.25ulp, <2 x float> addrspace(1)* %out		; CHECK: store volatile <2 x float> %fast.25ulp, <2 x float> addrspace(1)* %out
define amdgpu_kernel void @rcp_fdiv_fpmath_vector(<2 x float> addrspace(1)* %out, <2 x float> %x) #1 {		define amdgpu_kernel void @rcp_fdiv_fpmath_vector(<2 x float> addrspace(1)* %out, <2 x float> %x) #1 {
%no.md = fdiv <2 x float> <float 1.0, float 1.0>, %x		%no.md = fdiv <2 x float> <float 1.0, float 1.0>, %x
store volatile <2 x float> %no.md, <2 x float> addrspace(1)* %out		store volatile <2 x float> %no.md, <2 x float> addrspace(1)* %out

%md.half.ulp = fdiv <2 x float> <float 1.0, float 1.0>, %x, !fpmath !1		%md.half.ulp = fdiv <2 x float> <float 1.0, float 1.0>, %x, !fpmath !1
store volatile <2 x float> %md.half.ulp, <2 x float> addrspace(1)* %out		store volatile <2 x float> %md.half.ulp, <2 x float> addrspace(1)* %out

%arcp.no.md = fdiv arcp <2 x float> <float 1.0, float 1.0>, %x		%arcp.no.md = fdiv arcp <2 x float> <float 1.0, float 1.0>, %x
store volatile <2 x float> %arcp.no.md, <2 x float> addrspace(1)* %out		store volatile <2 x float> %arcp.no.md, <2 x float> addrspace(1)* %out

%fast.no.md = fdiv fast <2 x float> <float 1.0, float 1.0>, %x		%fast.no.md = fdiv fast <2 x float> <float 1.0, float 1.0>, %x
store volatile <2 x float> %fast.no.md, <2 x float> addrspace(1)* %out		store volatile <2 x float> %fast.no.md, <2 x float> addrspace(1)* %out

%arcp.25ulp = fdiv arcp <2 x float> <float 1.0, float 1.0>, %x, !fpmath !0		%arcp.25ulp = fdiv arcp <2 x float> <float 1.0, float 1.0>, %x, !fpmath !0
store volatile <2 x float> %arcp.25ulp, <2 x float> addrspace(1)* %out		store volatile <2 x float> %arcp.25ulp, <2 x float> addrspace(1)* %out

%fast.25ulp = fdiv fast <2 x float> <float 1.0, float 1.0>, %x, !fpmath !0		%fast.25ulp = fdiv fast <2 x float> <float 1.0, float 1.0>, %x, !fpmath !0
store volatile <2 x float> %fast.25ulp, <2 x float> addrspace(1)* %out		store volatile <2 x float> %fast.25ulp, <2 x float> addrspace(1)* %out

ret void		ret void
}		}

; CHECK-LABEL: @rcp_fdiv_fpmath_vector_nonsplat(		; CHECK-LABEL: @rcp_fdiv_fpmath_vector_nonsplat(
; CHECK: %no.md = fdiv <2 x float> <float 1.000000e+00, float 2.000000e+00>, %x		; CHECK: %[[NO0:[0-9]+]] = extractelement <2 x float> %x, i64 0
; CHECK: %arcp.no.md = fdiv arcp <2 x float> <float 1.000000e+00, float 2.000000e+00>, %x		; CHECK: %[[NO_FDIV0:[0-9]+]] = fdiv float 1.000000e+00, %[[NO0]]
; CHECK: %fast.no.md = fdiv fast <2 x float> <float 1.000000e+00, float 2.000000e+00>, %x{{$}}		; CHECK: %[[NO_INS0:[0-9]+]] = insertelement <2 x float> undef, float %[[NO_FDIV0]], i64 0
; CHECK: %arcp.25ulp = fdiv arcp <2 x float> <float 1.000000e+00, float 2.000000e+00>, %x, !fpmath !0		; CHECK: %[[NO1:[0-9]+]] = extractelement <2 x float> %x, i64 1
; CHECK: %fast.25ulp = fdiv fast <2 x float> <float 1.000000e+00, float 2.000000e+00>, %x, !fpmath !0		; CHECK: %[[NO_FDIV1:[0-9]+]] = fdiv float 2.000000e+00, %[[NO1]]
; CHECK: store volatile <2 x float> %fast.25ulp		; CHECK: %no.md = insertelement <2 x float> %[[NO_INS0]], float %[[NO_FDIV1]], i64 1
		; CHECK: store volatile <2 x float> %no.md, <2 x float> addrspace(1)* %out

		; CHECK: %[[ARCP_NO0:[0-9]+]] = extractelement <2 x float> %x, i64 0
		; CHECK: %[[ARCP_NO_FDIV0:[0-9]+]] = fdiv arcp float 1.000000e+00, %[[ARCP_NO0]]
		; CHECK: %[[ARCP_NO_INS0:[0-9]+]] = insertelement <2 x float> undef, float %[[ARCP_NO_FDIV0]], i64 0
		; CHECK: %[[ARCP_NO1:[0-9]+]] = extractelement <2 x float> %x, i64 1
		; CHECK: %[[ARCP_NO_FDIV1:[0-9]+]] = fdiv arcp float 2.000000e+00, %[[ARCP_NO1]]
		; CHECK: %arcp.no.md = insertelement <2 x float> %[[ARCP_NO_INS0]], float %[[ARCP_NO_FDIV1]], i64 1
		; CHECK: store volatile <2 x float> %arcp.no.md, <2 x float> addrspace(1)* %out

		; CHECK: %[[FAST_NO0:[0-9]+]] = extractelement <2 x float> %x, i64 0
		arsenmUnsubmitted Not Done Reply Inline Actions The attribute should be removed arsenm: The attribute should be removed
		; CHECK: %[[FAST_NO_RCP0:[0-9]+]] = call fast float @llvm.amdgcn.rcp.f32(float %[[FAST_NO0]])
		; CHECK: %[[FAST_NO_INS0:[0-9]+]] = insertelement <2 x float> undef, float %[[FAST_NO_RCP0]], i64 0
		; CHECK: %[[FAST_NO1:[0-9]+]] = extractelement <2 x float> %x, i64 1
		; CHECK: %[[FAST_NO_RCP1:[0-9]+]] = call fast float @llvm.amdgcn.rcp.f32(float %[[FAST_NO1]])
		; CHECK: %[[FAST_NO_MUL1:[0-9]+]] = fmul fast float 2.000000e+00, %[[FAST_NO_RCP1]]
		; CHECK: %fast.no.md = insertelement <2 x float> %[[FAST_NO_INS0]], float %[[FAST_NO_MUL1]], i64 1
		; CHECK: store volatile <2 x float> %fast.no.md, <2 x float> addrspace(1)* %out

		; CHECK: %[[ARCP_250:[0-9]+]] = extractelement <2 x float> %x, i64 0
		; CHECK: %[[ARCP_25_RCP0:[0-9]+]] = call arcp float @llvm.amdgcn.rcp.f32(float %[[ARCP_250]])
		; CHECK: %[[ARCP_25_INS0:[0-9]+]] = insertelement <2 x float> undef, float %[[ARCP_25_RCP0]], i64 0
		; CHECK: %[[ARCP_251:[0-9]+]] = extractelement <2 x float> %x, i64 1
		; CHECK: %[[ARCP_25_RCP1:[0-9]+]] = call arcp float @llvm.amdgcn.rcp.f32(float %[[ARCP_251]])
		; CHECK: %[[ARCP_25_MUL1:[0-9]+]] = fmul arcp float 2.000000e+00, %[[ARCP_25_RCP1]]
		; CHECK: %arcp.25ulp = insertelement <2 x float> %[[ARCP_25_INS0]], float %[[ARCP_25_MUL1]], i64 1
		; CHECK: store volatile <2 x float> %arcp.25ulp, <2 x float> addrspace(1)* %out

		; CHECK: %[[FAST_250:[0-9]+]] = extractelement <2 x float> %x, i64 0
		; CHECK: %[[FAST_25_RCP0:[0-9]+]] = call fast float @llvm.amdgcn.rcp.f32(float %[[FAST_250]])
		; CHECK: %[[FAST_25_INS0:[0-9]+]] = insertelement <2 x float> undef, float %[[FAST_25_RCP0]], i64 0
		; CHECK: %[[FAST_251:[0-9]+]] = extractelement <2 x float> %x, i64 1
		; CHECK: %[[FAST_25_RCP1:[0-9]+]] = call fast float @llvm.amdgcn.rcp.f32(float %[[FAST_251]])
		; CHECK: %[[FAST_25_MUL1:[0-9]+]] = fmul fast float 2.000000e+00, %[[FAST_25_RCP1]]
		; CHECK: %fast.25ulp = insertelement <2 x float> %[[FAST_25_INS0]], float %[[FAST_25_MUL1]], i64 1
		; CHECK: store volatile <2 x float> %fast.25ulp, <2 x float> addrspace(1)* %out
define amdgpu_kernel void @rcp_fdiv_fpmath_vector_nonsplat(<2 x float> addrspace(1)* %out, <2 x float> %x) #1 {		define amdgpu_kernel void @rcp_fdiv_fpmath_vector_nonsplat(<2 x float> addrspace(1)* %out, <2 x float> %x) #1 {
%no.md = fdiv <2 x float> <float 1.0, float 2.0>, %x		%no.md = fdiv <2 x float> <float 1.0, float 2.0>, %x
store volatile <2 x float> %no.md, <2 x float> addrspace(1)* %out		store volatile <2 x float> %no.md, <2 x float> addrspace(1)* %out

%arcp.no.md = fdiv arcp <2 x float> <float 1.0, float 2.0>, %x		%arcp.no.md = fdiv arcp <2 x float> <float 1.0, float 2.0>, %x
store volatile <2 x float> %arcp.no.md, <2 x float> addrspace(1)* %out		store volatile <2 x float> %arcp.no.md, <2 x float> addrspace(1)* %out

%fast.no.md = fdiv fast <2 x float> <float 1.0, float 2.0>, %x		%fast.no.md = fdiv fast <2 x float> <float 1.0, float 2.0>, %x
store volatile <2 x float> %fast.no.md, <2 x float> addrspace(1)* %out		store volatile <2 x float> %fast.no.md, <2 x float> addrspace(1)* %out

%arcp.25ulp = fdiv arcp <2 x float> <float 1.0, float 2.0>, %x, !fpmath !0		%arcp.25ulp = fdiv arcp <2 x float> <float 1.0, float 2.0>, %x, !fpmath !0
store volatile <2 x float> %arcp.25ulp, <2 x float> addrspace(1)* %out		store volatile <2 x float> %arcp.25ulp, <2 x float> addrspace(1)* %out

%fast.25ulp = fdiv fast <2 x float> <float 1.0, float 2.0>, %x, !fpmath !0		%fast.25ulp = fdiv fast <2 x float> <float 1.0, float 2.0>, %x, !fpmath !0
store volatile <2 x float> %fast.25ulp, <2 x float> addrspace(1)* %out		store volatile <2 x float> %fast.25ulp, <2 x float> addrspace(1)* %out

ret void		ret void
}		}

; FIXME: Should be able to get fdiv for 1.0 component
; CHECK-LABEL: @rcp_fdiv_fpmath_vector_partial_constant(		; CHECK-LABEL: @rcp_fdiv_fpmath_vector_partial_constant(
; CHECK: %arcp.25ulp = fdiv arcp <2 x float> %x.insert, %y, !fpmath !0		; CHECK: %[[ARCP_A0:[0-9]+]] = extractelement <2 x float> %x.insert, i64 0
		; CHECK: %[[ARCP_B0:[0-9]+]] = extractelement <2 x float> %y, i64 0
		; CHECK: %[[ARCP_RCP0:[0-9]+]] = call arcp float @llvm.amdgcn.rcp.f32(float %[[ARCP_B0]])
		; CHECK: %[[ARCP_MUL0:[0-9]+]] = fmul arcp float %[[ARCP_A0]], %[[ARCP_RCP0]], !fpmath !0
		; CHECK: %[[ARCP_INS0:[0-9]+]] = insertelement <2 x float> undef, float %[[ARCP_MUL0]], i64 0
		; CHECK: %[[ARCP_A1:[0-9]+]] = extractelement <2 x float> %x.insert, i64 1
		; CHECK: %[[ARCP_B1:[0-9]+]] = extractelement <2 x float> %y, i64 1
		; CHECK: %[[ARCP_RCP1:[0-9]+]] = call arcp float @llvm.amdgcn.rcp.f32(float %[[ARCP_B1]])
		; CHECK: %[[ARCP_MUL1:[0-9]+]] = fmul arcp float %[[ARCP_A1]], %[[ARCP_RCP1]], !fpmath !0
		; CHECK: %arcp.25ulp = insertelement <2 x float> %[[ARCP_INS0]], float %[[ARCP_MUL1]], i64 1
; CHECK: store volatile <2 x float> %arcp.25ulp		; CHECK: store volatile <2 x float> %arcp.25ulp

; CHECK: %fast.25ulp = fdiv fast <2 x float> %x.insert, %y, !fpmath !0		; CHECK: %[[FAST_A0:[0-9]+]] = extractelement <2 x float> %x.insert, i64 0
		; CHECK: %[[FAST_B0:[0-9]+]] = extractelement <2 x float> %y, i64 0
		; CHECK: %[[FAST_RCP0:[0-9]+]] = call fast float @llvm.amdgcn.rcp.f32(float %[[FAST_B0]])
		; CHECK: %[[FAST_MUL0:[0-9]+]] = fmul fast float %[[FAST_A0]], %[[FAST_RCP0]], !fpmath !0
		; CHECK: %[[FAST_INS0:[0-9]+]] = insertelement <2 x float> undef, float %[[FAST_MUL0]], i64 0
		; CHECK: %[[FAST_A1:[0-9]+]] = extractelement <2 x float> %x.insert, i64 1
		; CHECK: %[[FAST_B1:[0-9]+]] = extractelement <2 x float> %y, i64 1
		; CHECK: %[[FAST_RCP1:[0-9]+]] = call fast float @llvm.amdgcn.rcp.f32(float %[[FAST_B1]])
		; CHECK: %[[FAST_MUL1:[0-9]+]] = fmul fast float %[[FAST_A1]], %[[FAST_RCP1]], !fpmath !0
		; CHECK: %fast.25ulp = insertelement <2 x float> %[[FAST_INS0]], float %[[FAST_MUL1]], i64 1
; CHECK: store volatile <2 x float> %fast.25ulp		; CHECK: store volatile <2 x float> %fast.25ulp
define amdgpu_kernel void @rcp_fdiv_fpmath_vector_partial_constant(<2 x float> addrspace(1)* %out, <2 x float> %x, <2 x float> %y) #1 {		define amdgpu_kernel void @rcp_fdiv_fpmath_vector_partial_constant(<2 x float> addrspace(1)* %out, <2 x float> %x, <2 x float> %y) #1 {
%x.insert = insertelement <2 x float> %x, float 1.0, i32 0		%x.insert = insertelement <2 x float> %x, float 1.0, i32 0

%arcp.25ulp = fdiv arcp <2 x float> %x.insert, %y, !fpmath !0		%arcp.25ulp = fdiv arcp <2 x float> %x.insert, %y, !fpmath !0
store volatile <2 x float> %arcp.25ulp, <2 x float> addrspace(1)* %out		store volatile <2 x float> %arcp.25ulp, <2 x float> addrspace(1)* %out

%fast.25ulp = fdiv fast <2 x float> %x.insert, %y, !fpmath !0		%fast.25ulp = fdiv fast <2 x float> %x.insert, %y, !fpmath !0
store volatile <2 x float> %fast.25ulp, <2 x float> addrspace(1)* %out		store volatile <2 x float> %fast.25ulp, <2 x float> addrspace(1)* %out

ret void		ret void
}		}

; CHECK-LABEL: @fdiv_fpmath_f32_denormals(		; CHECK-LABEL: @fdiv_fpmath_f32_denormals(
; CHECK: %no.md = fdiv float %a, %b{{$}}		; CHECK: %no.md = fdiv float %a, %b{{$}}
; CHECK: %md.half.ulp = fdiv float %a, %b, !fpmath !1		; CHECK: %md.half.ulp = fdiv float %a, %b, !fpmath !1
; CHECK: %md.1ulp = fdiv float %a, %b, !fpmath !2		; CHECK: %md.1ulp = fdiv float %a, %b, !fpmath !2
; CHECK: %md.25ulp = fdiv float %a, %b, !fpmath !0		; CHECK: %md.25ulp = fdiv float %a, %b, !fpmath !0
; CHECK: %md.3ulp = fdiv float %a, %b, !fpmath !3		; CHECK: %md.3ulp = fdiv float %a, %b, !fpmath !3
; CHECK: %fast.md.25ulp = fdiv fast float %a, %b, !fpmath !0		; CHECK: %[[RCP_FAST:[0-9]+]] = call fast float @llvm.amdgcn.rcp.f32(float %b)
		; CHECK: %fast.md.25ulp = fmul fast float %a, %[[RCP_FAST]], !fpmath !0
; CHECK: %arcp.md.25ulp = fdiv arcp float %a, %b, !fpmath !0		; CHECK: %arcp.md.25ulp = fdiv arcp float %a, %b, !fpmath !0
define amdgpu_kernel void @fdiv_fpmath_f32_denormals(float addrspace(1)* %out, float %a, float %b) #2 {		define amdgpu_kernel void @fdiv_fpmath_f32_denormals(float addrspace(1)* %out, float %a, float %b) #2 {
%no.md = fdiv float %a, %b		%no.md = fdiv float %a, %b
store volatile float %no.md, float addrspace(1)* %out		store volatile float %no.md, float addrspace(1)* %out

%md.half.ulp = fdiv float %a, %b, !fpmath !1		%md.half.ulp = fdiv float %a, %b, !fpmath !1
store volatile float %md.half.ulp, float addrspace(1)* %out		store volatile float %md.half.ulp, float addrspace(1)* %out

%md.1ulp = fdiv float %a, %b, !fpmath !2		%md.1ulp = fdiv float %a, %b, !fpmath !2
Show All 12 Lines

llvm/test/CodeGen/AMDGPU/fdiv.ll

	Show All 12 Lines
	%b_ptr = getelementptr <4 x float>, <4 x float> addrspace(1)* %in, i32 1	%b_ptr = getelementptr <4 x float>, <4 x float> addrspace(1)* %in, i32 1
	%a = load <4 x float>, <4 x float> addrspace(1) * %in	%a = load <4 x float>, <4 x float> addrspace(1) * %in
	%b = load <4 x float>, <4 x float> addrspace(1) * %b_ptr	%b = load <4 x float>, <4 x float> addrspace(1) * %b_ptr
	%result = fdiv arcp <4 x float> %a, %b	%result = fdiv arcp <4 x float> %a, %b
	store <4 x float> %result, <4 x float> addrspace(1)* %out	store <4 x float> %result, <4 x float> addrspace(1)* %out
	ret void	ret void
	}	}

		; FUNC-LABEL: {{^}}fdiv_f32_correctly_rounded_divide_sqrt:

		; GCN: v_div_scale_f32 [[NUM_SCALE:v[0-9]+]]
		; GCN-DAG: v_div_scale_f32 [[DEN_SCALE:v[0-9]+]]
		; GCN-DAG: v_rcp_f32_e32 [[NUM_RCP:v[0-9]+]], [[NUM_SCALE]]

		; PREGFX10: s_setreg_imm32_b32 hwreg(HW_REG_MODE, 4, 2), 3
		; GFX10: s_denorm_mode 15
		; GCN: v_fma_f32 [[A:v[0-9]+]], -[[NUM_SCALE]], [[NUM_RCP]], 1.0
		; GCN: v_fma_f32 [[B:v[0-9]+]], [[A]], [[NUM_RCP]], [[NUM_RCP]]
		; GCN: v_mul_f32_e32 [[C:v[0-9]+]], [[DEN_SCALE]], [[B]]
		; GCN: v_fma_f32 [[D:v[0-9]+]], -[[NUM_SCALE]], [[C]], [[DEN_SCALE]]
		; GCN: v_fma_f32 [[E:v[0-9]+]], [[D]], [[B]], [[C]]
		; GCN: v_fma_f32 [[F:v[0-9]+]], -[[NUM_SCALE]], [[E]], [[DEN_SCALE]]
		; PREGFX10: s_setreg_imm32_b32 hwreg(HW_REG_MODE, 4, 2), 0
		; GFX10: s_denorm_mode 12
		; GCN: v_div_fmas_f32 [[FMAS:v[0-9]+]], [[F]], [[B]], [[E]]
		; GCN: v_div_fixup_f32 v{{[0-9]+}}, [[FMAS]],

		define amdgpu_kernel void @fdiv_f32_correctly_rounded_divide_sqrt(float addrspace(1)* %out, float %a) #0 {
		entry:
		%fdiv = fdiv float 1.000000e+00, %a
		store float %fdiv, float addrspace(1)* %out
		ret void
		}


		; FUNC-LABEL: {{^}}fdiv_f32_denorms_correctly_rounded_divide_sqrt:

		; GCN: v_div_scale_f32 [[NUM_SCALE:v[0-9]+]]
		; GCN-DAG: v_rcp_f32_e32 [[NUM_RCP:v[0-9]+]], [[NUM_SCALE]]

		; PREGFX10-DAG: v_div_scale_f32 [[DEN_SCALE:v[0-9]+]]
		; PREGFX10-NOT: s_setreg
		; PREGFX10: v_fma_f32 [[A:v[0-9]+]], -[[NUM_SCALE]], [[NUM_RCP]], 1.0
		; PREGFX10: v_fma_f32 [[B:v[0-9]+]], [[A]], [[NUM_RCP]], [[NUM_RCP]]
		; PREGFX10: v_mul_f32_e32 [[C:v[0-9]+]], [[DEN_SCALE]], [[B]]
		; PREGFX10: v_fma_f32 [[D:v[0-9]+]], -[[NUM_SCALE]], [[C]], [[DEN_SCALE]]
		; PREGFX10: v_fma_f32 [[E:v[0-9]+]], [[D]], [[B]], [[C]]
		; PREGFX10: v_fma_f32 [[F:v[0-9]+]], -[[NUM_SCALE]], [[E]], [[DEN_SCALE]]
		; PREGFX10-NOT: s_setreg

		; GFX10-NOT: s_denorm_mode
		; GFX10: v_fma_f32 [[A:v[0-9]+]], -[[NUM_SCALE]], [[NUM_RCP]], 1.0
		; GFX10: v_fmac_f32_e32 [[B:v[0-9]+]], [[A]], [[NUM_RCP]]
		; GFX10: v_div_scale_f32 [[DEN_SCALE:v[0-9]+]]
		; GFX10: v_mul_f32_e32 [[C:v[0-9]+]], [[DEN_SCALE]], [[B]]
		; GFX10: v_fma_f32 [[D:v[0-9]+]], [[C]], -[[NUM_SCALE]], [[DEN_SCALE]]
		; GFX10: v_fmac_f32_e32 [[E:v[0-9]+]], [[D]], [[B]]
		; GFX10: v_fmac_f32_e64 [[F:v[0-9]+]], -[[NUM_SCALE]], [[E]]
		; GFX10-NOT: s_denorm_mode

		; GCN: v_div_fmas_f32 [[FMAS:v[0-9]+]], [[F]], [[B]], [[E]]
		; GCN: v_div_fixup_f32 v{{[0-9]+}}, [[FMAS]],
		define amdgpu_kernel void @fdiv_f32_denorms_correctly_rounded_divide_sqrt(float addrspace(1)* %out, float %a) #2 {
		entry:
		%fdiv = fdiv float 1.000000e+00, %a
		store float %fdiv, float addrspace(1)* %out
		ret void
		}


	attributes #0 = { nounwind "enable-unsafe-fp-math"="false" "target-features"="-fp32-denormals,+fp64-fp16-denormals,-flat-for-global" }	attributes #0 = { nounwind "enable-unsafe-fp-math"="false" "target-features"="-fp32-denormals,+fp64-fp16-denormals,-flat-for-global" }
	attributes #1 = { nounwind "enable-unsafe-fp-math"="true" "target-features"="-fp32-denormals,-flat-for-global" }	attributes #1 = { nounwind "enable-unsafe-fp-math"="true" "target-features"="-fp32-denormals,-flat-for-global" }
	attributes #2 = { nounwind "enable-unsafe-fp-math"="false" "target-features"="+fp32-denormals,-flat-for-global" }	attributes #2 = { nounwind "enable-unsafe-fp-math"="false" "target-features"="+fp32-denormals,-flat-for-global" }

	!0 = !{float 2.500000e+00}	!0 = !{float 2.500000e+00}
Context not available.

llvm/test/CodeGen/AMDGPU/fdiv32-to-rcp-folding.ll

	Show All 12 Lines
	}	}

	; GCN-LABEL: {{^}}div_1_by_x_fast:	; GCN-LABEL: {{^}}div_1_by_x_fast:
	; GCN: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x0	; GCN: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x0
	; GCN: v_rcp_f32_e32 [[RCP:v[0-9]+]], [[VAL]]	; GCN: v_rcp_f32_e32 [[RCP:v[0-9]+]], [[VAL]]
	; GCN: global_store_dword v[{{[0-9:]+}}], [[RCP]], off	; GCN: global_store_dword v[{{[0-9:]+}}], [[RCP]], off
	define amdgpu_kernel void @div_1_by_x_fast(float addrspace(1)* %arg) {	define amdgpu_kernel void @div_1_by_x_fast(float addrspace(1)* %arg) {
	%load = load float, float addrspace(1)* %arg, align 4	%load = load float, float addrspace(1)* %arg, align 4
	%div = fdiv fast float 1.000000e+00, %load	%div = fdiv fast float 1.000000e+00, %load, !fpmath !0
	store float %div, float addrspace(1)* %arg, align 4	store float %div, float addrspace(1)* %arg, align 4
	ret void	ret void
	}	}

	; GCN-LABEL: {{^}}div_minus_1_by_x_fast:	; GCN-LABEL: {{^}}div_minus_1_by_x_fast:
	; GCN: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x0	; GCN: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x0
	; GCN: v_rcp_f32_e64 [[RCP:v[0-9]+]], -[[VAL]]	; GCN: v_rcp_f32_e64 [[RCP:v[0-9]+]], -[[VAL]]
	; GCN: global_store_dword v[{{[0-9:]+}}], [[RCP]], off	; GCN: global_store_dword v[{{[0-9:]+}}], [[RCP]], off
	define amdgpu_kernel void @div_minus_1_by_x_fast(float addrspace(1)* %arg) {	define amdgpu_kernel void @div_minus_1_by_x_fast(float addrspace(1)* %arg) {
	%load = load float, float addrspace(1)* %arg, align 4	%load = load float, float addrspace(1)* %arg, align 4
	%div = fdiv fast float -1.000000e+00, %load	%div = fdiv fast float -1.000000e+00, %load, !fpmath !0
	store float %div, float addrspace(1)* %arg, align 4	store float %div, float addrspace(1)* %arg, align 4
	ret void	ret void
	}	}

	; GCN-LABEL: {{^}}div_1_by_minus_x_fast:	; GCN-LABEL: {{^}}div_1_by_minus_x_fast:
	; GCN: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x0	; GCN: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x0
	; GCN: v_rcp_f32_e64 [[RCP:v[0-9]+]], -[[VAL]]	; GCN: v_rcp_f32_e64 [[RCP:v[0-9]+]], -[[VAL]]
	; GCN: global_store_dword v[{{[0-9:]+}}], [[RCP]], off	; GCN: global_store_dword v[{{[0-9:]+}}], [[RCP]], off
	define amdgpu_kernel void @div_1_by_minus_x_fast(float addrspace(1)* %arg) {	define amdgpu_kernel void @div_1_by_minus_x_fast(float addrspace(1)* %arg) {
	%load = load float, float addrspace(1)* %arg, align 4	%load = load float, float addrspace(1)* %arg, align 4
	%neg = fsub float -0.000000e+00, %load	%neg = fsub float -0.000000e+00, %load, !fpmath !0
	%div = fdiv fast float 1.000000e+00, %neg	%div = fdiv fast float 1.000000e+00, %neg
	store float %div, float addrspace(1)* %arg, align 4	store float %div, float addrspace(1)* %arg, align 4
	ret void	ret void
	}	}

	; GCN-LABEL: {{^}}div_minus_1_by_minus_x_fast:	; GCN-LABEL: {{^}}div_minus_1_by_minus_x_fast:
	; GCN: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x0	; GCN: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x0
	; GCN: v_rcp_f32_e32 [[RCP:v[0-9]+]], [[VAL]]	; GCN: v_rcp_f32_e32 [[RCP:v[0-9]+]], [[VAL]]
	; GCN: global_store_dword v[{{[0-9:]+}}], [[RCP]], off	; GCN: global_store_dword v[{{[0-9:]+}}], [[RCP]], off
	define amdgpu_kernel void @div_minus_1_by_minus_x_fast(float addrspace(1)* %arg) {	define amdgpu_kernel void @div_minus_1_by_minus_x_fast(float addrspace(1)* %arg) {
	%load = load float, float addrspace(1)* %arg, align 4	%load = load float, float addrspace(1)* %arg, align 4
	%neg = fsub float -0.000000e+00, %load	%neg = fsub float -0.000000e+00, %load, !fpmath !0
	%div = fdiv fast float -1.000000e+00, %neg	%div = fdiv fast float -1.000000e+00, %neg
	store float %div, float addrspace(1)* %arg, align 4	store float %div, float addrspace(1)* %arg, align 4
	ret void	ret void
	}	}

	; GCN-LABEL: {{^}}div_1_by_x_correctly_rounded:	; GCN-LABEL: {{^}}div_1_by_x_correctly_rounded:
	; GCN-DENORM-DAG: v_div_scale_f32	; GCN-DAG: v_div_scale_f32
	; GCN-DENORM-DAG: v_rcp_f32_e32	; GCN-DAG: v_rcp_f32_e32
	; GCN-DENORM-DAG: v_div_scale_f32	; GCN-DAG: v_div_scale_f32
	; GCN-DENORM: v_div_fmas_f32	; GCN: v_div_fmas_f32
	; GCN-DENORM: v_div_fixup_f32	; GCN: v_div_fixup_f32

	; GCN-FLUSH: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x0
	; GCN-FLUSH: v_rcp_f32_e32 [[RCP:v[0-9]+]], [[VAL]]
	; GCN-FLUSH: global_store_dword v[{{[0-9:]+}}], [[RCP]], off
	define amdgpu_kernel void @div_1_by_x_correctly_rounded(float addrspace(1)* %arg) {	define amdgpu_kernel void @div_1_by_x_correctly_rounded(float addrspace(1)* %arg) {
	%load = load float, float addrspace(1)* %arg, align 4	%load = load float, float addrspace(1)* %arg, align 4
	%div = fdiv float 1.000000e+00, %load	%div = fdiv float 1.000000e+00, %load
	store float %div, float addrspace(1)* %arg, align 4	store float %div, float addrspace(1)* %arg, align 4
	ret void	ret void
	}	}

	; GCN-LABEL: {{^}}div_minus_1_by_x_correctly_rounded:	; GCN-LABEL: {{^}}div_minus_1_by_x_correctly_rounded:
	; GCN-DENORM-DAG: v_div_scale_f32	; GCN-DAG: v_div_scale_f32
	; GCN-DENORM-DAG: v_rcp_f32_e32	; GCN-DAG: v_rcp_f32_e32
	; GCN-DENORM-DAG: v_div_scale_f32	; GCN-DAG: v_div_scale_f32
	; GCN-DENORM: v_div_fmas_f32	; GCN: v_div_fmas_f32
	; GCN-DENORM: v_div_fixup_f32	; GCN: v_div_fixup_f32

	; GCN-FLUSH: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x0
	; GCN-FLUSH: v_rcp_f32_e64 [[RCP:v[0-9]+]], -[[VAL]]
	; GCN-FLUSH: global_store_dword v[{{[0-9:]+}}], [[RCP]], off
	define amdgpu_kernel void @div_minus_1_by_x_correctly_rounded(float addrspace(1)* %arg) {	define amdgpu_kernel void @div_minus_1_by_x_correctly_rounded(float addrspace(1)* %arg) {
	%load = load float, float addrspace(1)* %arg, align 4	%load = load float, float addrspace(1)* %arg, align 4
	%div = fdiv float -1.000000e+00, %load	%div = fdiv float -1.000000e+00, %load
	store float %div, float addrspace(1)* %arg, align 4	store float %div, float addrspace(1)* %arg, align 4
	ret void	ret void
	}	}

	; GCN-LABEL: {{^}}div_1_by_minus_x_correctly_rounded:	; GCN-LABEL: {{^}}div_1_by_minus_x_correctly_rounded:
	; GCN-DENORM-DAG: v_div_scale_f32	; GCN-DAG: v_div_scale_f32
	; GCN-DENORM-DAG: v_rcp_f32_e32	; GCN-DAG: v_rcp_f32_e32
	; GCN-DENORM-DAG: v_div_scale_f32	; GCN-DAG: v_div_scale_f32
	; GCN-DENORM: v_div_fmas_f32	; GCN: v_div_fmas_f32
	; GCN-DENORM: v_div_fixup_f32	; GCN: v_div_fixup_f32

	; GCN-FLUSH: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x0
	; GCN-FLUSH: v_rcp_f32_e64 [[RCP:v[0-9]+]], -[[VAL]]
	; GCN-FLUSH: global_store_dword v[{{[0-9:]+}}], [[RCP]], off
	define amdgpu_kernel void @div_1_by_minus_x_correctly_rounded(float addrspace(1)* %arg) {	define amdgpu_kernel void @div_1_by_minus_x_correctly_rounded(float addrspace(1)* %arg) {
	%load = load float, float addrspace(1)* %arg, align 4	%load = load float, float addrspace(1)* %arg, align 4
	%neg = fsub float -0.000000e+00, %load	%neg = fsub float -0.000000e+00, %load
	%div = fdiv float 1.000000e+00, %neg	%div = fdiv float 1.000000e+00, %neg
	store float %div, float addrspace(1)* %arg, align 4	store float %div, float addrspace(1)* %arg, align 4
	ret void	ret void
	}	}

	; GCN-LABEL: {{^}}div_minus_1_by_minus_x_correctly_rounded:	; GCN-LABEL: {{^}}div_minus_1_by_minus_x_correctly_rounded:
	; GCN-DENORM-DAG: v_div_scale_f32	; GCN-DAG: v_div_scale_f32
	; GCN-DENORM-DAG: v_rcp_f32_e32	; GCN-DAG: v_rcp_f32_e32
	; GCN-DENORM-DAG: v_div_scale_f32	; GCN-DAG: v_div_scale_f32
	; GCN-DENORM: v_div_fmas_f32	; GCN: v_div_fmas_f32
	; GCN-DENORM: v_div_fixup_f32	; GCN: v_div_fixup_f32

	; GCN-FLUSH: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x0
	; GCN-FLUSH: v_rcp_f32_e32 [[RCP:v[0-9]+]], [[VAL]]
	; GCN-FLUSH: global_store_dword v[{{[0-9:]+}}], [[RCP]], off
	define amdgpu_kernel void @div_minus_1_by_minus_x_correctly_rounded(float addrspace(1)* %arg) {	define amdgpu_kernel void @div_minus_1_by_minus_x_correctly_rounded(float addrspace(1)* %arg) {
	%load = load float, float addrspace(1)* %arg, align 4	%load = load float, float addrspace(1)* %arg, align 4
	%neg = fsub float -0.000000e+00, %load	%neg = fsub float -0.000000e+00, %load
	%div = fdiv float -1.000000e+00, %neg	%div = fdiv float -1.000000e+00, %neg
	store float %div, float addrspace(1)* %arg, align 4	store float %div, float addrspace(1)* %arg, align 4
	ret void	ret void
	}	}

	!0 = !{float 2.500000e+00}	!0 = !{float 2.500000e+00}
Context not available.

llvm/test/CodeGen/AMDGPU/fneg-combines.ll

	Show All 12 Lines
	ret void	ret void
	}	}

	; This one asserted with -enable-no-signed-zeros-fp-math	; This one asserted with -enable-no-signed-zeros-fp-math
	; GCN-LABEL: {{^}}fneg_fadd_0:	; GCN-LABEL: {{^}}fneg_fadd_0:
	; GCN-SAFE-DAG: v_mad_f32 [[A:v[0-9]+]],	; GCN-SAFE-DAG: v_mad_f32 [[A:v[0-9]+]],
	; GCN-SAFE-DAG: v_cmp_ngt_f32_e32 {{.*}}, [[A]]	; GCN-SAFE-DAG: v_cmp_ngt_f32_e32 {{.*}}, [[A]]
	; GCN-SAFE-DAG: v_cndmask_b32_e64 v{{[0-9]+}}, -[[A]]	; GCN-SAFE-DAG: v_cndmask_b32_e64 v{{[0-9]+}}, -[[A]]
		define amdgpu_ps float @fneg_fadd_0(float inreg %tmp2, float inreg %tmp6, <4 x i32> %arg) local_unnamed_addr #0 {
		.entry:
		%tmp7 = fdiv float 1.000000e+00, %tmp6
		%tmp8 = fmul float 0.000000e+00, %tmp7
		%tmp9 = fmul reassoc nnan arcp contract float 0.000000e+00, %tmp8
		%.i188 = fadd float %tmp9, 0.000000e+00
		%tmp10 = fcmp uge float %.i188, %tmp2
		%tmp11 = fsub float -0.000000e+00, %.i188
		%.i092 = select i1 %tmp10, float %tmp2, float %tmp11
		%tmp12 = fcmp ule float %.i092, 0.000000e+00
		%.i198 = select i1 %tmp12, float 0.000000e+00, float 0x7FF8000000000000
		ret float %.i198
		}

		; This is a workaround because -enable-no-signed-zeros-fp-math does not set up
		; function attribute unsafe-fp-math automatically. Combine with the previous test
		; when that is done.
		; GCN-LABEL: {{^}}fneg_fadd_0_nsz:
	; GCN-NSZ-DAG: v_rcp_f32_e32 [[A:v[0-9]+]],	; GCN-NSZ-DAG: v_rcp_f32_e32 [[A:v[0-9]+]],
	; GCN-NSZ-DAG: v_mov_b32_e32 [[B:v[0-9]+]],	; GCN-NSZ-DAG: v_mov_b32_e32 [[B:v[0-9]+]],
	; GCN-NSZ-DAG: v_mov_b32_e32 [[C:v[0-9]+]],	; GCN-NSZ-DAG: v_mov_b32_e32 [[C:v[0-9]+]],
	; GCN-NSZ-DAG: v_mul_f32_e32 [[D:v[0-9]+]],	; GCN-NSZ-DAG: v_mul_f32_e32 [[D:v[0-9]+]],
	; GCN-NSZ-DAG: v_cmp_nlt_f32_e64 {{.*}}, -[[D]]	; GCN-NSZ-DAG: v_cmp_nlt_f32_e64 {{.*}}, -[[D]]
		define amdgpu_ps float @fneg_fadd_0_nsz(float inreg %tmp2, float inreg %tmp6, <4 x i32> %arg) local_unnamed_addr #2 {
	define amdgpu_ps float @fneg_fadd_0(float inreg %tmp2, float inreg %tmp6, <4 x i32> %arg) local_unnamed_addr #0 {
	.entry:	.entry:
	%tmp7 = fdiv float 1.000000e+00, %tmp6	%tmp7 = fdiv float 1.000000e+00, %tmp6
	%tmp8 = fmul float 0.000000e+00, %tmp7	%tmp8 = fmul float 0.000000e+00, %tmp7
	%tmp9 = fmul reassoc nnan arcp contract float 0.000000e+00, %tmp8	%tmp9 = fmul reassoc nnan arcp contract float 0.000000e+00, %tmp8
	%.i188 = fadd float %tmp9, 0.000000e+00	%.i188 = fadd float %tmp9, 0.000000e+00
	%tmp10 = fcmp uge float %.i188, %tmp2	%tmp10 = fcmp uge float %.i188, %tmp2
	%tmp11 = fsub float -0.000000e+00, %.i188	%tmp11 = fsub float -0.000000e+00, %.i188
	%.i092 = select i1 %tmp10, float %tmp2, float %tmp11	%.i092 = select i1 %tmp10, float %tmp2, float %tmp11
	Show All 24 Lines
	declare float @llvm.amdgcn.rcp.f32(float) #1	declare float @llvm.amdgcn.rcp.f32(float) #1
	declare float @llvm.amdgcn.rcp.legacy(float) #1	declare float @llvm.amdgcn.rcp.legacy(float) #1
	declare float @llvm.amdgcn.fmul.legacy(float, float) #1	declare float @llvm.amdgcn.fmul.legacy(float, float) #1
	declare float @llvm.amdgcn.interp.p1(float, i32, i32, i32) #0	declare float @llvm.amdgcn.interp.p1(float, i32, i32, i32) #0
	declare float @llvm.amdgcn.interp.p2(float, float, i32, i32, i32) #0	declare float @llvm.amdgcn.interp.p2(float, float, i32, i32, i32) #0

	attributes #0 = { nounwind }	attributes #0 = { nounwind }
	attributes #1 = { nounwind readnone }	attributes #1 = { nounwind readnone }
		attributes #2 = { nounwind "unsafe-fp-math"="true" }
Context not available.

llvm/test/CodeGen/AMDGPU/known-never-snan.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=fiji -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefix=GCN %s	; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=fiji -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefix=GCN %s

	; Mostly overlaps with fmed3.ll to stress specific cases of	; Mostly overlaps with fmed3.ll to stress specific cases of
	; isKnownNeverSNaN.	; isKnownNeverSNaN.

	define float @v_test_known_not_snan_fabs_input_fmed3_r_i_i_f32(float %a) #0 {	define float @v_test_known_not_snan_fabs_input_fmed3_r_i_i_f32(float %a) #0 {
	; GCN-LABEL: v_test_known_not_snan_fabs_input_fmed3_r_i_i_f32:	; GCN-LABEL: v_test_known_not_snan_fabs_input_fmed3_r_i_i_f32:
	; GCN: ; %bb.0:	; GCN: ; %bb.0:
	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GCN-NEXT: v_rcp_f32_e32 v0, v0	; GCN-NEXT: v_rcp_f32_e32 v0, v0
	; GCN-NEXT: v_med3_f32 v0, \|v0\|, 2.0, 4.0	; GCN-NEXT: v_med3_f32 v0, \|v0\|, 2.0, 4.0
	; GCN-NEXT: s_setpc_b64 s[30:31]	; GCN-NEXT: s_setpc_b64 s[30:31]
	%a.nnan.add = fdiv nnan float 1.0, %a	%a.nnan.add = fdiv nnan float 1.0, %a, !fpmath !0
	%known.not.snan = call float @llvm.fabs.f32(float %a.nnan.add)	%known.not.snan = call float @llvm.fabs.f32(float %a.nnan.add)
	%max = call float @llvm.maxnum.f32(float %known.not.snan, float 2.0)	%max = call float @llvm.maxnum.f32(float %known.not.snan, float 2.0)
	%med = call float @llvm.minnum.f32(float %max, float 4.0)	%med = call float @llvm.minnum.f32(float %max, float 4.0)
	ret float %med	ret float %med
	}	}

	define float @v_test_known_not_snan_fneg_input_fmed3_r_i_i_f32(float %a) #0 {	define float @v_test_known_not_snan_fneg_input_fmed3_r_i_i_f32(float %a) #0 {
	; GCN-LABEL: v_test_known_not_snan_fneg_input_fmed3_r_i_i_f32:	; GCN-LABEL: v_test_known_not_snan_fneg_input_fmed3_r_i_i_f32:
	; GCN: ; %bb.0:	; GCN: ; %bb.0:
	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GCN-NEXT: v_rcp_f32_e64 v0, -v0	; GCN-NEXT: v_rcp_f32_e32 v0, v0
	; GCN-NEXT: v_med3_f32 v0, v0, 2.0, 4.0	; GCN-NEXT: v_med3_f32 v0, -v0, 2.0, 4.0
	; GCN-NEXT: s_setpc_b64 s[30:31]	; GCN-NEXT: s_setpc_b64 s[30:31]
	%a.nnan.add = fdiv nnan float 1.0, %a	%a.nnan.add = fdiv nnan float 1.0, %a, !fpmath !0
	%known.not.snan = fsub float -0.0, %a.nnan.add	%known.not.snan = fsub float -0.0, %a.nnan.add
	%max = call float @llvm.maxnum.f32(float %known.not.snan, float 2.0)	%max = call float @llvm.maxnum.f32(float %known.not.snan, float 2.0)
	%med = call float @llvm.minnum.f32(float %max, float 4.0)	%med = call float @llvm.minnum.f32(float %max, float 4.0)
	ret float %med	ret float %med
	}	}

	define float @v_test_known_not_snan_fpext_input_fmed3_r_i_i_f32(half %a) #0 {	define float @v_test_known_not_snan_fpext_input_fmed3_r_i_i_f32(half %a) #0 {
	; GCN-LABEL: v_test_known_not_snan_fpext_input_fmed3_r_i_i_f32:	; GCN-LABEL: v_test_known_not_snan_fpext_input_fmed3_r_i_i_f32:
	Show All 24 Lines
	; GCN-LABEL: v_test_known_not_snan_copysign_input_fmed3_r_i_i_f32:	; GCN-LABEL: v_test_known_not_snan_copysign_input_fmed3_r_i_i_f32:
	; GCN: ; %bb.0:	; GCN: ; %bb.0:
	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GCN-NEXT: v_rcp_f32_e32 v0, v0	; GCN-NEXT: v_rcp_f32_e32 v0, v0
	; GCN-NEXT: s_brev_b32 s4, -2	; GCN-NEXT: s_brev_b32 s4, -2
	; GCN-NEXT: v_bfi_b32 v0, s4, v0, v1	; GCN-NEXT: v_bfi_b32 v0, s4, v0, v1
	; GCN-NEXT: v_med3_f32 v0, v0, 2.0, 4.0	; GCN-NEXT: v_med3_f32 v0, v0, 2.0, 4.0
	; GCN-NEXT: s_setpc_b64 s[30:31]	; GCN-NEXT: s_setpc_b64 s[30:31]
	%a.nnan.add = fdiv nnan float 1.0, %a	%a.nnan.add = fdiv nnan float 1.0, %a, !fpmath !0
	%known.not.snan = call float @llvm.copysign.f32(float %a.nnan.add, float %sign)	%known.not.snan = call float @llvm.copysign.f32(float %a.nnan.add, float %sign)
	%max = call float @llvm.maxnum.f32(float %known.not.snan, float 2.0)	%max = call float @llvm.maxnum.f32(float %known.not.snan, float 2.0)
	%med = call float @llvm.minnum.f32(float %max, float 4.0)	%med = call float @llvm.minnum.f32(float %max, float 4.0)
	ret float %med	ret float %med
	}	}

	; Canonicalize always quiets, so nothing is necessary.	; Canonicalize always quiets, so nothing is necessary.
	define float @v_test_known_canonicalize_input_fmed3_r_i_i_f32(float %a) #0 {	define float @v_test_known_canonicalize_input_fmed3_r_i_i_f32(float %a) #0 {
	Show All 13 Lines
	; GCN-LABEL: v_test_known_not_snan_minnum_input_fmed3_r_i_i_f32:	; GCN-LABEL: v_test_known_not_snan_minnum_input_fmed3_r_i_i_f32:
	; GCN: ; %bb.0:	; GCN: ; %bb.0:
	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GCN-NEXT: v_rcp_f32_e32 v0, v0	; GCN-NEXT: v_rcp_f32_e32 v0, v0
	; GCN-NEXT: v_add_f32_e32 v1, 1.0, v1	; GCN-NEXT: v_add_f32_e32 v1, 1.0, v1
	; GCN-NEXT: v_min_f32_e32 v0, v0, v1	; GCN-NEXT: v_min_f32_e32 v0, v0, v1
	; GCN-NEXT: v_med3_f32 v0, v0, 2.0, 4.0	; GCN-NEXT: v_med3_f32 v0, v0, 2.0, 4.0
	; GCN-NEXT: s_setpc_b64 s[30:31]	; GCN-NEXT: s_setpc_b64 s[30:31]
	%a.nnan.add = fdiv nnan float 1.0, %a	%a.nnan.add = fdiv nnan float 1.0, %a, !fpmath !0
	%b.nnan.add = fadd nnan float %b, 1.0	%b.nnan.add = fadd nnan float %b, 1.0
	%known.not.snan = call float @llvm.minnum.f32(float %a.nnan.add, float %b.nnan.add)	%known.not.snan = call float @llvm.minnum.f32(float %a.nnan.add, float %b.nnan.add)
	%max = call float @llvm.maxnum.f32(float %known.not.snan, float 2.0)	%max = call float @llvm.maxnum.f32(float %known.not.snan, float 2.0)
	%med = call float @llvm.minnum.f32(float %max, float 4.0)	%med = call float @llvm.minnum.f32(float %max, float 4.0)
	ret float %med	ret float %med
	}	}

	define float @v_test_known_not_minnum_maybe_nan_src0_input_fmed3_r_i_i_f32(float %a, float %b) #0 {	define float @v_test_known_not_minnum_maybe_nan_src0_input_fmed3_r_i_i_f32(float %a, float %b) #0 {
	Show All 24 Lines
	; GCN-LABEL: v_minnum_possible_nan_rhs_input_fmed3_r_i_i_f32:	; GCN-LABEL: v_minnum_possible_nan_rhs_input_fmed3_r_i_i_f32:
	; GCN: ; %bb.0:	; GCN: ; %bb.0:
	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GCN-NEXT: v_rcp_f32_e32 v0, v0	; GCN-NEXT: v_rcp_f32_e32 v0, v0
	; GCN-NEXT: v_mul_f32_e32 v1, 1.0, v1	; GCN-NEXT: v_mul_f32_e32 v1, 1.0, v1
	; GCN-NEXT: v_min_f32_e32 v0, v0, v1	; GCN-NEXT: v_min_f32_e32 v0, v0, v1
	; GCN-NEXT: v_med3_f32 v0, v0, 2.0, 4.0	; GCN-NEXT: v_med3_f32 v0, v0, 2.0, 4.0
	; GCN-NEXT: s_setpc_b64 s[30:31]	; GCN-NEXT: s_setpc_b64 s[30:31]
	%a.nnan.add = fdiv nnan float 1.0, %a	%a.nnan.add = fdiv nnan float 1.0, %a, !fpmath !0
	%known.not.snan = call float @llvm.minnum.f32(float %a.nnan.add, float %b)	%known.not.snan = call float @llvm.minnum.f32(float %a.nnan.add, float %b)
	%max = call float @llvm.maxnum.f32(float %known.not.snan, float 2.0)	%max = call float @llvm.maxnum.f32(float %known.not.snan, float 2.0)
	%med = call float @llvm.minnum.f32(float %max, float 4.0)	%med = call float @llvm.minnum.f32(float %max, float 4.0)
	ret float %med	ret float %med
	}	}

	define float @v_test_known_not_snan_maxnum_input_fmed3_r_i_i_f32(float %a, float %b) #0 {	define float @v_test_known_not_snan_maxnum_input_fmed3_r_i_i_f32(float %a, float %b) #0 {
	; GCN-LABEL: v_test_known_not_snan_maxnum_input_fmed3_r_i_i_f32:	; GCN-LABEL: v_test_known_not_snan_maxnum_input_fmed3_r_i_i_f32:
	; GCN: ; %bb.0:	; GCN: ; %bb.0:
	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GCN-NEXT: v_rcp_f32_e32 v0, v0	; GCN-NEXT: v_rcp_f32_e32 v0, v0
	; GCN-NEXT: v_add_f32_e32 v1, 1.0, v1	; GCN-NEXT: v_add_f32_e32 v1, 1.0, v1
	; GCN-NEXT: v_max_f32_e32 v0, v0, v1	; GCN-NEXT: v_max_f32_e32 v0, v0, v1
	; GCN-NEXT: v_med3_f32 v0, v0, 2.0, 4.0	; GCN-NEXT: v_med3_f32 v0, v0, 2.0, 4.0
	; GCN-NEXT: s_setpc_b64 s[30:31]	; GCN-NEXT: s_setpc_b64 s[30:31]
	%a.nnan.add = fdiv nnan float 1.0, %a	%a.nnan.add = fdiv nnan float 1.0, %a, !fpmath !0
	%b.nnan.add = fadd nnan float %b, 1.0	%b.nnan.add = fadd nnan float %b, 1.0
	%known.not.snan = call float @llvm.maxnum.f32(float %a.nnan.add, float %b.nnan.add)	%known.not.snan = call float @llvm.maxnum.f32(float %a.nnan.add, float %b.nnan.add)
	%max = call float @llvm.maxnum.f32(float %known.not.snan, float 2.0)	%max = call float @llvm.maxnum.f32(float %known.not.snan, float 2.0)
	%med = call float @llvm.minnum.f32(float %max, float 4.0)	%med = call float @llvm.minnum.f32(float %max, float 4.0)
	ret float %med	ret float %med
	}	}

	define float @v_maxnum_possible_nan_lhs_input_fmed3_r_i_i_f32(float %a, float %b) #0 {	define float @v_maxnum_possible_nan_lhs_input_fmed3_r_i_i_f32(float %a, float %b) #0 {
	Show All 16 Lines
	; GCN-LABEL: v_maxnum_possible_nan_rhs_input_fmed3_r_i_i_f32:	; GCN-LABEL: v_maxnum_possible_nan_rhs_input_fmed3_r_i_i_f32:
	; GCN: ; %bb.0:	; GCN: ; %bb.0:
	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GCN-NEXT: v_rcp_f32_e32 v0, v0	; GCN-NEXT: v_rcp_f32_e32 v0, v0
	; GCN-NEXT: v_mul_f32_e32 v1, 1.0, v1	; GCN-NEXT: v_mul_f32_e32 v1, 1.0, v1
	; GCN-NEXT: v_max_f32_e32 v0, v0, v1	; GCN-NEXT: v_max_f32_e32 v0, v0, v1
	; GCN-NEXT: v_med3_f32 v0, v0, 2.0, 4.0	; GCN-NEXT: v_med3_f32 v0, v0, 2.0, 4.0
	; GCN-NEXT: s_setpc_b64 s[30:31]	; GCN-NEXT: s_setpc_b64 s[30:31]
	%a.nnan.add = fdiv nnan float 1.0, %a	%a.nnan.add = fdiv nnan float 1.0, %a, !fpmath !0
	%known.not.snan = call float @llvm.maxnum.f32(float %a.nnan.add, float %b)	%known.not.snan = call float @llvm.maxnum.f32(float %a.nnan.add, float %b)
	%max = call float @llvm.maxnum.f32(float %known.not.snan, float 2.0)	%max = call float @llvm.maxnum.f32(float %known.not.snan, float 2.0)
	%med = call float @llvm.minnum.f32(float %max, float 4.0)	%med = call float @llvm.minnum.f32(float %max, float 4.0)
	ret float %med	ret float %med
	}	}

	define float @v_test_known_not_snan_select_input_fmed3_r_i_i_f32(float %a, float %b, i32 %c) #0 {	define float @v_test_known_not_snan_select_input_fmed3_r_i_i_f32(float %a, float %b, i32 %c) #0 {
	; GCN-LABEL: v_test_known_not_snan_select_input_fmed3_r_i_i_f32:	; GCN-LABEL: v_test_known_not_snan_select_input_fmed3_r_i_i_f32:
	; GCN: ; %bb.0:	; GCN: ; %bb.0:
	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GCN-NEXT: v_rcp_f32_e32 v0, v0	; GCN-NEXT: v_rcp_f32_e32 v0, v0
	; GCN-NEXT: v_add_f32_e32 v1, 1.0, v1	; GCN-NEXT: v_add_f32_e32 v1, 1.0, v1
	; GCN-NEXT: v_cmp_eq_u32_e32 vcc, 0, v2	; GCN-NEXT: v_cmp_eq_u32_e32 vcc, 0, v2
	; GCN-NEXT: v_cndmask_b32_e32 v0, v1, v0, vcc	; GCN-NEXT: v_cndmask_b32_e32 v0, v1, v0, vcc
	; GCN-NEXT: v_med3_f32 v0, v0, 2.0, 4.0	; GCN-NEXT: v_med3_f32 v0, v0, 2.0, 4.0
	; GCN-NEXT: s_setpc_b64 s[30:31]	; GCN-NEXT: s_setpc_b64 s[30:31]
	%a.nnan.add = fdiv nnan float 1.0, %a	%a.nnan.add = fdiv nnan float 1.0, %a, !fpmath !0
	%b.nnan.add = fadd nnan float %b, 1.0	%b.nnan.add = fadd nnan float %b, 1.0
	%cmp = icmp eq i32 %c, 0	%cmp = icmp eq i32 %c, 0
	%known.not.snan = select i1 %cmp, float %a.nnan.add, float %b.nnan.add	%known.not.snan = select i1 %cmp, float %a.nnan.add, float %b.nnan.add
	%max = call float @llvm.maxnum.f32(float %known.not.snan, float 2.0)	%max = call float @llvm.maxnum.f32(float %known.not.snan, float 2.0)
	%med = call float @llvm.minnum.f32(float %max, float 4.0)	%med = call float @llvm.minnum.f32(float %max, float 4.0)
	ret float %med	ret float %med
	}	}

	Show All 20 Lines
	; GCN: ; %bb.0:	; GCN: ; %bb.0:
	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)	; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GCN-NEXT: v_rcp_f32_e32 v0, v0	; GCN-NEXT: v_rcp_f32_e32 v0, v0
	; GCN-NEXT: v_cmp_eq_u32_e32 vcc, 0, v2	; GCN-NEXT: v_cmp_eq_u32_e32 vcc, 0, v2
	; GCN-NEXT: v_cndmask_b32_e32 v0, v1, v0, vcc	; GCN-NEXT: v_cndmask_b32_e32 v0, v1, v0, vcc
	; GCN-NEXT: v_mul_f32_e32 v0, 1.0, v0	; GCN-NEXT: v_mul_f32_e32 v0, 1.0, v0
	; GCN-NEXT: v_med3_f32 v0, v0, 2.0, 4.0	; GCN-NEXT: v_med3_f32 v0, v0, 2.0, 4.0
	; GCN-NEXT: s_setpc_b64 s[30:31]	; GCN-NEXT: s_setpc_b64 s[30:31]
	%a.nnan.add = fdiv nnan float 1.0, %a	%a.nnan.add = fdiv nnan float 1.0, %a, !fpmath !0
	%cmp = icmp eq i32 %c, 0	%cmp = icmp eq i32 %c, 0
	%known.not.snan = select i1 %cmp, float %a.nnan.add, float %b	%known.not.snan = select i1 %cmp, float %a.nnan.add, float %b
	%max = call float @llvm.maxnum.f32(float %known.not.snan, float 2.0)	%max = call float @llvm.maxnum.f32(float %known.not.snan, float 2.0)
	%med = call float @llvm.minnum.f32(float %max, float 4.0)	%med = call float @llvm.minnum.f32(float %max, float 4.0)
	ret float %med	ret float %med
	}	}

	define float @v_test_known_not_snan_fadd_input_fmed3_r_i_i_f32(float %a, float %b) #0 {	define float @v_test_known_not_snan_fadd_input_fmed3_r_i_i_f32(float %a, float %b) #0 {
	Show All 24 Lines
	declare float @llvm.amdgcn.frexp.mant.f32(float) #1	declare float @llvm.amdgcn.frexp.mant.f32(float) #1
	declare float @llvm.amdgcn.rcp.f32(float) #1	declare float @llvm.amdgcn.rcp.f32(float) #1
	declare float @llvm.amdgcn.rsq.f32(float) #1	declare float @llvm.amdgcn.rsq.f32(float) #1
	declare float @llvm.amdgcn.fract.f32(float) #1	declare float @llvm.amdgcn.fract.f32(float) #1
	declare float @llvm.amdgcn.cubeid(float, float, float) #0	declare float @llvm.amdgcn.cubeid(float, float, float) #0

	attributes #0 = { nounwind }	attributes #0 = { nounwind }
	attributes #1 = { nounwind readnone speculatable }	attributes #1 = { nounwind readnone speculatable }

		!0 = !{float 2.500000e+00}
Context not available.

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.rcp.ll

	Show All 12 Lines
	ret void	ret void
	}	}

	; FUNC-LABEL: {{^}}safe_no_fp32_denormals_rcp_f32:	; FUNC-LABEL: {{^}}safe_no_fp32_denormals_rcp_f32:
	; SI: v_rcp_f32_e32 [[RESULT:v[0-9]+]], s{{[0-9]+}}	; SI: v_rcp_f32_e32 [[RESULT:v[0-9]+]], s{{[0-9]+}}
	; SI-NOT: [[RESULT]]	; SI-NOT: [[RESULT]]
	; SI: buffer_store_dword [[RESULT]]	; SI: buffer_store_dword [[RESULT]]
	define amdgpu_kernel void @safe_no_fp32_denormals_rcp_f32(float addrspace(1)* %out, float %src) #1 {	define amdgpu_kernel void @safe_no_fp32_denormals_rcp_f32(float addrspace(1)* %out, float %src) #1 {
	%rcp = fdiv float 1.0, %src	%rcp = fdiv float 1.0, %src, !fpmath !0
	store float %rcp, float addrspace(1)* %out, align 4	store float %rcp, float addrspace(1)* %out, align 4
	ret void	ret void
	}	}

	; FUNC-LABEL: {{^}}safe_f32_denormals_rcp_pat_f32:	; FUNC-LABEL: {{^}}safe_f32_denormals_rcp_pat_f32:
	; SI: v_rcp_f32_e32 [[RESULT:v[0-9]+]], s{{[0-9]+}}	; SI: v_rcp_f32_e32 [[RESULT:v[0-9]+]], s{{[0-9]+}}
	; SI-NOT: [[RESULT]]	; SI-NOT: [[RESULT]]
	; SI: buffer_store_dword [[RESULT]]	; SI: buffer_store_dword [[RESULT]]
	define amdgpu_kernel void @safe_f32_denormals_rcp_pat_f32(float addrspace(1)* %out, float %src) #4 {	define amdgpu_kernel void @safe_f32_denormals_rcp_pat_f32(float addrspace(1)* %out, float %src) #4 {
	%rcp = fdiv float 1.0, %src	%rcp = fdiv float 1.0, %src, !fpmath !0
	store float %rcp, float addrspace(1)* %out, align 4	store float %rcp, float addrspace(1)* %out, align 4
	ret void	ret void
	}	}

	; FUNC-LABEL: {{^}}unsafe_f32_denormals_rcp_pat_f32:	; FUNC-LABEL: {{^}}unsafe_f32_denormals_rcp_pat_f32:
	; SI: v_div_scale_f32	; SI: v_div_scale_f32
	define amdgpu_kernel void @unsafe_f32_denormals_rcp_pat_f32(float addrspace(1)* %out, float %src) #3 {	define amdgpu_kernel void @unsafe_f32_denormals_rcp_pat_f32(float addrspace(1)* %out, float %src) #3 {
	%rcp = fdiv float 1.0, %src	%rcp = fdiv float 1.0, %src
	store float %rcp, float addrspace(1)* %out, align 4	store float %rcp, float addrspace(1)* %out, align 4
	ret void	ret void
	}	}

	; FUNC-LABEL: {{^}}safe_rsq_rcp_pat_f32:	; FUNC-LABEL: {{^}}safe_rsq_rcp_pat_f32:
	; SI: v_sqrt_f32_e32	; SI: v_rsq_f32_e32
	; SI: v_rcp_f32_e32
	define amdgpu_kernel void @safe_rsq_rcp_pat_f32(float addrspace(1)* %out, float %src) #1 {	define amdgpu_kernel void @safe_rsq_rcp_pat_f32(float addrspace(1)* %out, float %src) #1 {
	%sqrt = call float @llvm.sqrt.f32(float %src)	%sqrt = call float @llvm.sqrt.f32(float %src)
	%rcp = call float @llvm.amdgcn.rcp.f32(float %sqrt)	%rcp = call float @llvm.amdgcn.rcp.f32(float %sqrt)
	store float %rcp, float addrspace(1)* %out, align 4	store float %rcp, float addrspace(1)* %out, align 4
	ret void	ret void
	}	}

	; FUNC-LABEL: {{^}}unsafe_rsq_rcp_pat_f32:	; FUNC-LABEL: {{^}}unsafe_rsq_rcp_pat_f32:
	Show All 24 Lines
	ret void	ret void
	}	}

	attributes #0 = { nounwind readnone }	attributes #0 = { nounwind readnone }
	attributes #1 = { nounwind "unsafe-fp-math"="false" "target-features"="-fp32-denormals" }	attributes #1 = { nounwind "unsafe-fp-math"="false" "target-features"="-fp32-denormals" }
	attributes #2 = { nounwind "unsafe-fp-math"="true" "target-features"="-fp32-denormals" }	attributes #2 = { nounwind "unsafe-fp-math"="true" "target-features"="-fp32-denormals" }
	attributes #3 = { nounwind "unsafe-fp-math"="false" "target-features"="+fp32-denormals" }	attributes #3 = { nounwind "unsafe-fp-math"="false" "target-features"="+fp32-denormals" }
	attributes #4 = { nounwind "unsafe-fp-math"="true" "target-features"="+fp32-denormals" }	attributes #4 = { nounwind "unsafe-fp-math"="true" "target-features"="+fp32-denormals" }

		!0 = !{float 2.500000e+00}
Context not available.

llvm/test/CodeGen/AMDGPU/mul24-pass-ordering.ll

	Show All 12 Lines
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)	; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: s_setpc_b64 s[30:31]	; GFX9-NEXT: s_setpc_b64 s[30:31]
	bb:	bb:
	%tmp = icmp ult i32 %arg, %arg1	%tmp = icmp ult i32 %arg, %arg1
	br i1 %tmp, label %bb19, label %.loopexit	br i1 %tmp, label %bb19, label %.loopexit

	bb19: ; preds = %bb	bb19: ; preds = %bb
	%tmp20 = uitofp i32 %arg6 to float	%tmp20 = uitofp i32 %arg6 to float
	%tmp21 = fdiv float 1.000000e+00, %tmp20	%tmp21 = fdiv float 1.000000e+00, %tmp20, !fpmath !0
	%tmp22 = and i32 %arg6, 16777215	%tmp22 = and i32 %arg6, 16777215
	br label %bb23	br label %bb23

	.loopexit: ; preds = %bb23, %bb	.loopexit: ; preds = %bb23, %bb
	ret void	ret void

	bb23: ; preds = %bb19, %bb23	bb23: ; preds = %bb19, %bb23
	%tmp24 = phi i32 [ %arg, %bb19 ], [ %tmp47, %bb23 ]	%tmp24 = phi i32 [ %arg, %bb19 ], [ %tmp47, %bb23 ]
	Show All 24 Lines
	ret void	ret void
	}	}

	declare void @foo(i32) #0	declare void @foo(i32) #0
	declare float @llvm.fmuladd.f32(float, float, float) #1	declare float @llvm.fmuladd.f32(float, float, float) #1

	attributes #0 = { nounwind willreturn }	attributes #0 = { nounwind willreturn }
	attributes #1 = { nounwind readnone speculatable }	attributes #1 = { nounwind readnone speculatable }

		!0 = !{float 2.500000e+00}
Context not available.

llvm/test/CodeGen/AMDGPU/rcp-pattern.ll

	; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=FUNC %s			; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=FUNC %s
	; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=FUNC %s			; RUN: llc -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN -check-prefix=FUNC %s
	; RUN: llc -march=r600 -mcpu=cypress -verify-machineinstrs < %s \| FileCheck -check-prefix=EG -check-prefix=FUNC %s			; RUN: llc -march=r600 -mcpu=cypress -verify-machineinstrs < %s \| FileCheck -check-prefix=EG -check-prefix=FUNC %s
	; RUN: llc -march=r600 -mcpu=cayman -verify-machineinstrs < %s \| FileCheck -check-prefix=EG -check-prefix=FUNC %s			; RUN: llc -march=r600 -mcpu=cayman -verify-machineinstrs < %s \| FileCheck -check-prefix=EG -check-prefix=FUNC %s

	; FUNC-LABEL: {{^}}rcp_pat_f32:			; FUNC-LABEL: {{^}}rcp_pat_f32:
	; GCN: s_load_dword [[SRC:s[0-9]+]]			; GCN: s_load_dword [[SRC:s[0-9]+]]
	; GCN: v_rcp_f32_e32 [[RCP:v[0-9]+]], [[SRC]]			; GCN: v_rcp_f32_e32 [[RCP:v[0-9]+]], [[SRC]]
	; GCN: buffer_store_dword [[RCP]]			; GCN: buffer_store_dword [[RCP]]

	; EG: RECIP_IEEE			; EG: RECIP_IEEE
	define amdgpu_kernel void @rcp_pat_f32(float addrspace(1)* %out, float %src) #0 {			define amdgpu_kernel void @rcp_pat_f32(float addrspace(1)* %out, float %src) #0 {
	%rcp = fdiv float 1.0, %src			%rcp = fdiv float 1.0, %src, !fpmath !0
	store float %rcp, float addrspace(1)* %out, align 4			store float %rcp, float addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}rcp_ulp25_pat_f32:			; FUNC-LABEL: {{^}}rcp_ulp25_pat_f32:
	; GCN: s_load_dword [[SRC:s[0-9]+]]			; GCN: s_load_dword [[SRC:s[0-9]+]]
	; GCN: v_rcp_f32_e32 [[RCP:v[0-9]+]], [[SRC]]			; GCN: v_rcp_f32_e32 [[RCP:v[0-9]+]], [[SRC]]
	; GCN: buffer_store_dword [[RCP]]			; GCN: buffer_store_dword [[RCP]]
	Show All 24 Lines
	; FUNC-LABEL: {{^}}rcp_fabs_pat_f32:			; FUNC-LABEL: {{^}}rcp_fabs_pat_f32:
	; GCN: s_load_dword [[SRC:s[0-9]+]]			; GCN: s_load_dword [[SRC:s[0-9]+]]
	; GCN: v_rcp_f32_e64 [[RCP:v[0-9]+]], \|[[SRC]]\|			; GCN: v_rcp_f32_e64 [[RCP:v[0-9]+]], \|[[SRC]]\|
	; GCN: buffer_store_dword [[RCP]]			; GCN: buffer_store_dword [[RCP]]

	; EG: RECIP_IEEE			; EG: RECIP_IEEE
	define amdgpu_kernel void @rcp_fabs_pat_f32(float addrspace(1)* %out, float %src) #0 {			define amdgpu_kernel void @rcp_fabs_pat_f32(float addrspace(1)* %out, float %src) #0 {
	%src.fabs = call float @llvm.fabs.f32(float %src)			%src.fabs = call float @llvm.fabs.f32(float %src)
	%rcp = fdiv float 1.0, %src.fabs			%rcp = fdiv float 1.0, %src.fabs, !fpmath !0
	store float %rcp, float addrspace(1)* %out, align 4			store float %rcp, float addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}neg_rcp_pat_f32:			; FUNC-LABEL: {{^}}neg_rcp_pat_f32:
	; GCN: s_load_dword [[SRC:s[0-9]+]]			; GCN: s_load_dword [[SRC:s[0-9]+]]
	; GCN: v_rcp_f32_e64 [[RCP:v[0-9]+]], -[[SRC]]			; GCN: v_rcp_f32_e64 [[RCP:v[0-9]+]], -[[SRC]]
	; GCN: buffer_store_dword [[RCP]]			; GCN: buffer_store_dword [[RCP]]

	; EG: RECIP_IEEE			; EG: RECIP_IEEE
	define amdgpu_kernel void @neg_rcp_pat_f32(float addrspace(1)* %out, float %src) #0 {			define amdgpu_kernel void @neg_rcp_pat_f32(float addrspace(1)* %out, float %src) #0 {
	%rcp = fdiv float -1.0, %src			%rcp = fdiv float -1.0, %src, !fpmath !0
	store float %rcp, float addrspace(1)* %out, align 4			store float %rcp, float addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}rcp_fabs_fneg_pat_f32:			; FUNC-LABEL: {{^}}rcp_fabs_fneg_pat_f32:
	; GCN: s_load_dword [[SRC:s[0-9]+]]			; GCN: s_load_dword [[SRC:s[0-9]+]]
	; GCN: v_rcp_f32_e64 [[RCP:v[0-9]+]], -\|[[SRC]]\|			; GCN: v_rcp_f32_e64 [[RCP:v[0-9]+]], -\|[[SRC]]\|
	; GCN: buffer_store_dword [[RCP]]			; GCN: buffer_store_dword [[RCP]]
	define amdgpu_kernel void @rcp_fabs_fneg_pat_f32(float addrspace(1)* %out, float %src) #0 {			define amdgpu_kernel void @rcp_fabs_fneg_pat_f32(float addrspace(1)* %out, float %src) #0 {
	%src.fabs = call float @llvm.fabs.f32(float %src)			%src.fabs = call float @llvm.fabs.f32(float %src)
	%src.fabs.fneg = fsub float -0.0, %src.fabs			%src.fabs.fneg = fsub float -0.0, %src.fabs
	%rcp = fdiv float 1.0, %src.fabs.fneg			%rcp = fdiv float 1.0, %src.fabs.fneg, !fpmath !0
	store float %rcp, float addrspace(1)* %out, align 4			store float %rcp, float addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}rcp_fabs_fneg_pat_multi_use_f32:			; FUNC-LABEL: {{^}}rcp_fabs_fneg_pat_multi_use_f32:
	; GCN: s_load_dword [[SRC:s[0-9]+]]			; GCN: s_load_dword [[SRC:s[0-9]+]]
	; GCN: v_rcp_f32_e64 [[RCP:v[0-9]+]], -\|[[SRC]]\|			; GCN: v_rcp_f32_e64 [[RCP:v[0-9]+]], -\|[[SRC]]\|
	; GCN: v_mul_f32_e64 [[MUL:v[0-9]+]], [[SRC]], -\|[[SRC]]\|			; GCN: v_mul_f32_e64 [[MUL:v[0-9]+]], [[SRC]], -\|[[SRC]]\|
	; GCN: buffer_store_dword [[RCP]]			; GCN: buffer_store_dword [[RCP]]
	; GCN: buffer_store_dword [[MUL]]			; GCN: buffer_store_dword [[MUL]]
	define amdgpu_kernel void @rcp_fabs_fneg_pat_multi_use_f32(float addrspace(1)* %out, float %src) #0 {			define amdgpu_kernel void @rcp_fabs_fneg_pat_multi_use_f32(float addrspace(1)* %out, float %src) #0 {
	%src.fabs = call float @llvm.fabs.f32(float %src)			%src.fabs = call float @llvm.fabs.f32(float %src)
	%src.fabs.fneg = fsub float -0.0, %src.fabs			%src.fabs.fneg = fsub float -0.0, %src.fabs
	%rcp = fdiv float 1.0, %src.fabs.fneg			%rcp = fdiv float 1.0, %src.fabs.fneg, !fpmath !0
	store volatile float %rcp, float addrspace(1)* %out, align 4			store volatile float %rcp, float addrspace(1)* %out, align 4

	%other = fmul float %src, %src.fabs.fneg			%other = fmul float %src, %src.fabs.fneg
	store volatile float %other, float addrspace(1)* %out, align 4			store volatile float %other, float addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; FUNC-LABEL: {{^}}div_arcp_2_x_pat_f32:			; FUNC-LABEL: {{^}}div_arcp_2_x_pat_f32:
	Show All 12 Lines

llvm/test/CodeGen/AMDGPU/rcp_iflag.ll

	; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck --check-prefix=GCN %s			; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck --check-prefix=GCN %s

	; GCN-LABEL: {{^}}rcp_uint:			; GCN-LABEL: {{^}}rcp_uint:
	; GCN: v_rcp_iflag_f32_e32			; GCN: v_rcp_iflag_f32_e32
	define amdgpu_kernel void @rcp_uint(i32 addrspace(1)* %in, float addrspace(1)* %out) {			define amdgpu_kernel void @rcp_uint(i32 addrspace(1)* %in, float addrspace(1)* %out) {
	%load = load i32, i32 addrspace(1)* %in, align 4			%load = load i32, i32 addrspace(1)* %in, align 4
	%cvt = uitofp i32 %load to float			%cvt = uitofp i32 %load to float
	%div = fdiv float 1.000000e+00, %cvt			%div = fdiv float 1.000000e+00, %cvt, !fpmath !0
	store float %div, float addrspace(1)* %out, align 4			store float %div, float addrspace(1)* %out, align 4
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}rcp_sint:			; GCN-LABEL: {{^}}rcp_sint:
	; GCN: v_rcp_iflag_f32_e32			; GCN: v_rcp_iflag_f32_e32
	define amdgpu_kernel void @rcp_sint(i32 addrspace(1)* %in, float addrspace(1)* %out) {			define amdgpu_kernel void @rcp_sint(i32 addrspace(1)* %in, float addrspace(1)* %out) {
	%load = load i32, i32 addrspace(1)* %in, align 4			%load = load i32, i32 addrspace(1)* %in, align 4
	%cvt = sitofp i32 %load to float			%cvt = sitofp i32 %load to float
	%div = fdiv float 1.000000e+00, %cvt			%div = fdiv float 1.000000e+00, %cvt, !fpmath !0
	store float %div, float addrspace(1)* %out, align 4			store float %div, float addrspace(1)* %out, align 4
	ret void			ret void
	}			}

				!0 = !{float 2.500000e+00}

llvm/test/CodeGen/AMDGPU/rsq.ll

	; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mattr=-fp32-denormals -verify-machineinstrs -enable-unsafe-fp-math < %s \| FileCheck -check-prefix=SI-UNSAFE -check-prefix=SI %s	; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mattr=-fp32-denormals -verify-machineinstrs -enable-unsafe-fp-math < %s \| FileCheck -check-prefix=SI-UNSAFE -check-prefix=SI %s
	; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mattr=-fp32-denormals -verify-machineinstrs < %s \| FileCheck -check-prefix=SI-SAFE -check-prefix=SI %s	; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mattr=-fp32-denormals -verify-machineinstrs < %s \| FileCheck -check-prefix=SI-SAFE -check-prefix=SI %s

	declare i32 @llvm.amdgcn.workitem.id.x() nounwind readnone	declare i32 @llvm.amdgcn.workitem.id.x() nounwind readnone
	declare float @llvm.sqrt.f32(float) nounwind readnone	declare float @llvm.sqrt.f32(float) nounwind readnone
	declare double @llvm.sqrt.f64(double) nounwind readnone	declare double @llvm.sqrt.f64(double) nounwind readnone

	; SI-LABEL: {{^}}rsq_f32:	; SI-LABEL: {{^}}rsq_f32:
	; SI: v_rsq_f32_e32	; SI: v_rsq_f32_e32
	; SI: s_endpgm	; SI: s_endpgm
	define amdgpu_kernel void @rsq_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) nounwind {	define amdgpu_kernel void @rsq_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) nounwind {
	%val = load float, float addrspace(1)* %in, align 4	%val = load float, float addrspace(1)* %in, align 4
	%sqrt = call float @llvm.sqrt.f32(float %val) nounwind readnone	%sqrt = call float @llvm.sqrt.f32(float %val) nounwind readnone
	%div = fdiv float 1.0, %sqrt	%div = fdiv float 1.0, %sqrt, !fpmath !0
	store float %div, float addrspace(1)* %out, align 4	store float %div, float addrspace(1)* %out, align 4
	ret void	ret void
	}	}

	; SI-LABEL: {{^}}rsq_f64:	; SI-LABEL: {{^}}rsq_f64:
	; SI-UNSAFE: v_rsq_f64_e32	; SI-UNSAFE: v_rsq_f64_e32
	; SI-SAFE: v_sqrt_f64_e32	; SI-SAFE: v_sqrt_f64_e32
	; SI: s_endpgm	; SI: s_endpgm
	define amdgpu_kernel void @rsq_f64(double addrspace(1)* noalias %out, double addrspace(1)* noalias %in) nounwind {	define amdgpu_kernel void @rsq_f64(double addrspace(1)* noalias %out, double addrspace(1)* noalias %in) nounwind {
	%val = load double, double addrspace(1)* %in, align 4	%val = load double, double addrspace(1)* %in, align 4
	%sqrt = call double @llvm.sqrt.f64(double %val) nounwind readnone	%sqrt = call double @llvm.sqrt.f64(double %val) nounwind readnone
	%div = fdiv double 1.0, %sqrt	%div = fdiv double 1.0, %sqrt
	store double %div, double addrspace(1)* %out, align 4	store double %div, double addrspace(1)* %out, align 4
	ret void	ret void
	}	}

	; SI-LABEL: {{^}}rsq_f32_sgpr:	; SI-LABEL: {{^}}rsq_f32_sgpr:
	; SI: v_rsq_f32_e32 {{v[0-9]+}}, {{s[0-9]+}}	; SI: v_rsq_f32_e32 {{v[0-9]+}}, {{s[0-9]+}}
	; SI: s_endpgm	; SI: s_endpgm
	define amdgpu_kernel void @rsq_f32_sgpr(float addrspace(1)* noalias %out, float %val) nounwind {	define amdgpu_kernel void @rsq_f32_sgpr(float addrspace(1)* noalias %out, float %val) nounwind {
	%sqrt = call float @llvm.sqrt.f32(float %val) nounwind readnone	%sqrt = call float @llvm.sqrt.f32(float %val) nounwind readnone
	%div = fdiv float 1.0, %sqrt	%div = fdiv float 1.0, %sqrt, !fpmath !0
	store float %div, float addrspace(1)* %out, align 4	store float %div, float addrspace(1)* %out, align 4
	ret void	ret void
	}	}

	; Recognize that this is rsqrt(a) * rcp(b) * c,	; Recognize that this is rsqrt(a) * rcp(b) * c,
	; not 1 / ( 1 / sqrt(a)) * rcp(b) * c.	; not 1 / ( 1 / sqrt(a)) * rcp(b) * c.

	; SI-LABEL: @rsqrt_fmul	; SI-LABEL: @rsqrt_fmul
	Show All 24 Lines
	ret void	ret void
	}	}

	; SI-LABEL: {{^}}neg_rsq_f32:	; SI-LABEL: {{^}}neg_rsq_f32:
	; SI-SAFE: v_sqrt_f32_e32 [[SQRT:v[0-9]+]], v{{[0-9]+}}	; SI-SAFE: v_sqrt_f32_e32 [[SQRT:v[0-9]+]], v{{[0-9]+}}
	; SI-SAFE: v_rcp_f32_e64 [[RSQ:v[0-9]+]], -[[SQRT]]	; SI-SAFE: v_rcp_f32_e64 [[RSQ:v[0-9]+]], -[[SQRT]]
	; SI-SAFE: buffer_store_dword [[RSQ]]	; SI-SAFE: buffer_store_dword [[RSQ]]

	; SI-UNSAFE: v_rsq_f32_e32 [[RSQ:v[0-9]+]], v{{[0-9]+}}	; SI-UNSAFE: v_sqrt_f32_e32 [[SQRT:v[0-9]+]], v{{[0-9]+}}
	; SI-UNSAFE: v_xor_b32_e32 [[NEG_RSQ:v[0-9]+]], 0x80000000, [[RSQ]]	; SI-UNSAFE: v_rcp_f32_e64 [[RSQ:v[0-9]+]], -[[SQRT]]
	; SI-UNSAFE: buffer_store_dword [[NEG_RSQ]]	; SI-UNSAFE: buffer_store_dword [[RSQ]]
	define amdgpu_kernel void @neg_rsq_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) nounwind {	define amdgpu_kernel void @neg_rsq_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) nounwind {
	%val = load float, float addrspace(1)* %in, align 4	%val = load float, float addrspace(1)* %in, align 4
	%sqrt = call float @llvm.sqrt.f32(float %val)	%sqrt = call float @llvm.sqrt.f32(float %val)
	%div = fdiv float -1.0, %sqrt	%div = fdiv float -1.0, %sqrt, !fpmath !0
	store float %div, float addrspace(1)* %out, align 4	store float %div, float addrspace(1)* %out, align 4
	ret void	ret void
	}	}

	; SI-LABEL: {{^}}neg_rsq_f64:	; SI-LABEL: {{^}}neg_rsq_f64:
	; SI-SAFE: v_sqrt_f64_e32	; SI-SAFE: v_sqrt_f64_e32
	; SI-SAFE: v_div_scale_f64	; SI-SAFE: v_div_scale_f64

	; SI-UNSAFE: v_sqrt_f64_e32 [[SQRT:v\[[0-9]+:[0-9]+\]]], v{{\[[0-9]+:[0-9]+\]}}	; SI-UNSAFE: v_sqrt_f64_e32 [[SQRT:v\[[0-9]+:[0-9]+\]]], v{{\[[0-9]+:[0-9]+\]}}
	; SI-UNSAFE: v_rcp_f64_e64 [[RCP:v\[[0-9]+:[0-9]+\]]], -[[SQRT]]	; SI-UNSAFE: v_rcp_f64_e64 [[RCP:v\[[0-9]+:[0-9]+\]]], -[[SQRT]]
	; SI-UNSAFE: buffer_store_dwordx2 [[RCP]]	; SI-UNSAFE: buffer_store_dwordx2 [[RCP]]
	define amdgpu_kernel void @neg_rsq_f64(double addrspace(1)* noalias %out, double addrspace(1)* noalias %in) nounwind {	define amdgpu_kernel void @neg_rsq_f64(double addrspace(1)* noalias %out, double addrspace(1)* noalias %in) nounwind {
	%val = load double, double addrspace(1)* %in, align 4	%val = load double, double addrspace(1)* %in, align 4
	%sqrt = call double @llvm.sqrt.f64(double %val)	%sqrt = call double @llvm.sqrt.f64(double %val)
	%div = fdiv double -1.0, %sqrt	%div = fdiv double -1.0, %sqrt
	store double %div, double addrspace(1)* %out, align 4	store double %div, double addrspace(1)* %out, align 4
	ret void	ret void
	}	}

	; SI-LABEL: {{^}}neg_rsq_neg_f32:	; SI-LABEL: {{^}}neg_rsq_neg_f32:
	; SI-SAFE: v_sqrt_f32_e64 [[SQRT:v[0-9]+]], -v{{[0-9]+}}	; SI-SAFE: v_sqrt_f32_e64 [[SQRT:v[0-9]+]], -v{{[0-9]+}}
	; SI-SAFE: v_rcp_f32_e64 [[RSQ:v[0-9]+]], -[[SQRT]]	; SI-SAFE: v_rcp_f32_e64 [[RSQ:v[0-9]+]], -[[SQRT]]
	; SI-SAFE: buffer_store_dword [[RSQ]]	; SI-SAFE: buffer_store_dword [[RSQ]]

	; SI-UNSAFE: v_rsq_f32_e64 [[RSQ:v[0-9]+]], -v{{[0-9]+}}	; SI-UNSAFE: v_sqrt_f32_e64 [[SQRT:v[0-9]+]], -v{{[0-9]+}}
	; SI-UNSAFE: v_xor_b32_e32 [[NEG_RSQ:v[0-9]+]], 0x80000000, [[RSQ]]	; SI-UNSAFE: v_rcp_f32_e64 [[RSQ:v[0-9]+]], -[[SQRT]]
	; SI-UNSAFE: buffer_store_dword [[NEG_RSQ]]	; SI-UNSAFE: buffer_store_dword [[RSQ]]
	define amdgpu_kernel void @neg_rsq_neg_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) nounwind {	define amdgpu_kernel void @neg_rsq_neg_f32(float addrspace(1)* noalias %out, float addrspace(1)* noalias %in) nounwind {
	%val = load float, float addrspace(1)* %in, align 4	%val = load float, float addrspace(1)* %in, align 4
	%val.fneg = fsub float -0.0, %val	%val.fneg = fsub float -0.0, %val
	%sqrt = call float @llvm.sqrt.f32(float %val.fneg)	%sqrt = call float @llvm.sqrt.f32(float %val.fneg)
	%div = fdiv float -1.0, %sqrt	%div = fdiv float -1.0, %sqrt, !fpmath !0
	store float %div, float addrspace(1)* %out, align 4	store float %div, float addrspace(1)* %out, align 4
	ret void	ret void
	}	}

	; SI-LABEL: {{^}}neg_rsq_neg_f64:	; SI-LABEL: {{^}}neg_rsq_neg_f64:
	; SI-SAFE: v_sqrt_f64_e64 v{{\[[0-9]+:[0-9]+\]}}, -v{{\[[0-9]+:[0-9]+\]}}	; SI-SAFE: v_sqrt_f64_e64 v{{\[[0-9]+:[0-9]+\]}}, -v{{\[[0-9]+:[0-9]+\]}}
	; SI-SAFE: v_div_scale_f64	; SI-SAFE: v_div_scale_f64

	; SI-UNSAFE: v_sqrt_f64_e64 [[SQRT:v\[[0-9]+:[0-9]+\]]], -v{{\[[0-9]+:[0-9]+\]}}	; SI-UNSAFE: v_sqrt_f64_e64 [[SQRT:v\[[0-9]+:[0-9]+\]]], -v{{\[[0-9]+:[0-9]+\]}}
	; SI-UNSAFE: v_rcp_f64_e64 [[RCP:v\[[0-9]+:[0-9]+\]]], -[[SQRT]]	; SI-UNSAFE: v_rcp_f64_e64 [[RCP:v\[[0-9]+:[0-9]+\]]], -[[SQRT]]
	; SI-UNSAFE: buffer_store_dwordx2 [[RCP]]	; SI-UNSAFE: buffer_store_dwordx2 [[RCP]]
	define amdgpu_kernel void @neg_rsq_neg_f64(double addrspace(1)* noalias %out, double addrspace(1)* noalias %in) nounwind {	define amdgpu_kernel void @neg_rsq_neg_f64(double addrspace(1)* noalias %out, double addrspace(1)* noalias %in) nounwind {
	%val = load double, double addrspace(1)* %in, align 4	%val = load double, double addrspace(1)* %in, align 4
	%val.fneg = fsub double -0.0, %val	%val.fneg = fsub double -0.0, %val
	%sqrt = call double @llvm.sqrt.f64(double %val.fneg)	%sqrt = call double @llvm.sqrt.f64(double %val.fneg)
	%div = fdiv double -1.0, %sqrt	%div = fdiv double -1.0, %sqrt
	store double %div, double addrspace(1)* %out, align 4	store double %div, double addrspace(1)* %out, align 4
	ret void	ret void
	}	}

		!0 = !{float 2.500000e+00}
Context not available.

This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Generate the correct sequence of code for FDIV32 when correctly-rounded-divide-sqrt is setClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 239963

llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-fdiv.ll

llvm/test/CodeGen/AMDGPU/fdiv.ll

llvm/test/CodeGen/AMDGPU/fdiv32-to-rcp-folding.ll

llvm/test/CodeGen/AMDGPU/fneg-combines.ll

llvm/test/CodeGen/AMDGPU/known-never-snan.ll

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.rcp.ll

llvm/test/CodeGen/AMDGPU/mul24-pass-ordering.ll

llvm/test/CodeGen/AMDGPU/rcp-pattern.ll

llvm/test/CodeGen/AMDGPU/rcp_iflag.ll

llvm/test/CodeGen/AMDGPU/rsq.ll

AMDGPU: Generate the correct sequence of code for FDIV32 when correctly-rounded-divide-sqrt is set
ClosedPublic