This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Improve reciprocal handling
ClosedPublic

Authored by rampitec on Jun 5 2018, 5:25 PM.

Download Raw Diff

Details

Reviewers

b-sumner
arsenm

Commits

rGdf61be70b234: [AMDGPU] Improve reciprocal handling
rL334142: [AMDGPU] Improve reciprocal handling

Summary

When denormals are supported we are producing a full division for
1.0f / x. That still can be replaced by the faster version:

bool c = fabs(x) > 0x1.0p+96f;
float s = c ? 0x1.0p-32f : 1.0f;
x *= s;
return s * v_rcp_f32(x)

in case if requested accuracy is 2.5ulp or less. The same version
is used if denormals are not supported for non 1.0 numerators, where
just v_rcp_f32 is then used for 1.0 numerator.

The optimization of 1/x is extended to the case -1/x, which is the
same except for the resulting sign bit.

OpenCL conformance passed with both enabled and disabled denorms.

Diff Detail

Event Timeline

rampitec created this revision.Jun 5 2018, 5:25 PM

Herald added subscribers: t-tye, Anastasia, tpr and 5 others. · View Herald TranscriptJun 5 2018, 5:25 PM

arsenm added inline comments.Jun 6 2018, 6:38 AM

lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
376	I think if you use the stuff in PatternMatch.h you can easily check for constant splats if you want this to work for vectors too
386–389	Merge into a return of the logically combined condition
423	Check constant first? Also isn't just isa<Constant> sufficient? Not sure why this needs to check it at all since shouldKeepFDivF32 already checks this
test/CodeGen/AMDGPU/fdiv32-to-rcp-folding.ll
98–101	This will only effectively check for one, although I think there's a FileCheck patch out for review to fix this

rampitec updated this revision to Diff 150163.Jun 6 2018, 10:28 AM

rampitec marked 2 inline comments as done.

rampitec added inline comments.

lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
376	It now can work on an arbitrary constant vector, not splat only. Then this helper is used on an element. There is even a test for non-splat.
423	Check for constant only is not sufficient. It can replace fdiv when a numerator is not constant and no denorms. But I have removed it all together because it will be checked later anyway.

rampitec added inline comments.Jun 6 2018, 12:16 PM

test/CodeGen/AMDGPU/fdiv32-to-rcp-folding.ll
98–101	Then after FileCheck imorived the will test mire than now. It is really impossible to make reliable non-dag checks here in presence of two schedulers.

LGTM

This revision is now accepted and ready to land.Jun 6 2018, 2:16 PM

Closed by commit rL334142: [AMDGPU] Improve reciprocal handling (authored by rampitec). · Explain WhyJun 6 2018, 3:27 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

AMDGPUCodeGenPrepare.cpp

33 lines

test/

CodeGen/

AMDGPU/

fdiv32-to-rcp-folding.ll

459 lines

Diff 150059

lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp

Show First 20 Lines • Show All 366 Lines • ▼ Show 20 Lines	Value *TruncRes =
Builder.CreateTrunc(LShrOp, I.getType());		Builder.CreateTrunc(LShrOp, I.getType());

I.replaceAllUsesWith(TruncRes);		I.replaceAllUsesWith(TruncRes);
I.eraseFromParent();		I.eraseFromParent();

return true;		return true;
}		}

static bool shouldKeepFDivF32(Value *Num, bool UnsafeDiv) {		static bool shouldKeepFDivF32(Value *Num, bool UnsafeDiv, bool HasDenormals) {
const ConstantFP *CNum = dyn_cast<ConstantFP>(Num);		const ConstantFP *CNum = dyn_cast<ConstantFP>(Num);
		arsenmUnsubmitted Not Done Reply Inline Actions I think if you use the stuff in PatternMatch.h you can easily check for constant splats if you want this to work for vectors too arsenm: I think if you use the stuff in PatternMatch.h you can easily check for constant splats if you…
		rampitecAuthorUnsubmitted Not Done Reply Inline Actions It now can work on an arbitrary constant vector, not splat only. Then this helper is used on an element. There is even a test for non-splat. rampitec: It now can work on an arbitrary constant vector, not splat only. Then this helper is used on an…
if (!CNum)		if (!CNum)
return false;		return HasDenormals;

		if (UnsafeDiv)
		return true;

		bool IsOne = CNum->isExactlyValue(+1.0) \|\| CNum->isExactlyValue(-1.0);

// Reciprocal f32 is handled separately without denormals.		// Reciprocal f32 is handled separately without denormals.
return UnsafeDiv \|\| CNum->isExactlyValue(+1.0);		if (!HasDenormals)
		return IsOne;

		return !IsOne;
		arsenmUnsubmitted Done Reply Inline Actions Merge into a return of the logically combined condition arsenm: Merge into a return of the logically combined condition
}		}

// Insert an intrinsic for fast fdiv for safe math situations where we can		// Insert an intrinsic for fast fdiv for safe math situations where we can
// reduce precision. Leave fdiv for situations where the generic node is		// reduce precision. Leave fdiv for situations where the generic node is
// expected to be optimized.		// expected to be optimized.
bool AMDGPUCodeGenPrepare::visitFDiv(BinaryOperator &FDiv) {		bool AMDGPUCodeGenPrepare::visitFDiv(BinaryOperator &FDiv) {
Type *Ty = FDiv.getType();		Type *Ty = FDiv.getType();

Show All 9 Lines	bool AMDGPUCodeGenPrepare::visitFDiv(BinaryOperator &FDiv) {
if (ULP < 2.5f)		if (ULP < 2.5f)
return false;		return false;

FastMathFlags FMF = FPOp->getFastMathFlags();		FastMathFlags FMF = FPOp->getFastMathFlags();
bool UnsafeDiv = HasUnsafeFPMath \|\| FMF.isFast() \|\|		bool UnsafeDiv = HasUnsafeFPMath \|\| FMF.isFast() \|\|
FMF.allowReciprocal();		FMF.allowReciprocal();

// With UnsafeDiv node will be optimized to just rcp and mul.		// With UnsafeDiv node will be optimized to just rcp and mul.
if (ST->hasFP32Denormals() \|\| UnsafeDiv)		if (UnsafeDiv)
return false;		return false;


		Value *Num = FDiv.getOperand(0);

		bool HasDenormals = ST->hasFP32Denormals();
		if (shouldKeepFDivF32(Num, UnsafeDiv, HasDenormals) &&
		!isa<ConstantDataVector>(Num))
		arsenmUnsubmitted Done Reply Inline Actions Check constant first? Also isn't just isa<Constant> sufficient? Not sure why this needs to check it at all since shouldKeepFDivF32 already checks this arsenm: Check constant first? Also isn't just isa<Constant> sufficient? Not sure why this needs to…
		rampitecAuthorUnsubmitted Not Done Reply Inline Actions Check for constant only is not sufficient. It can replace fdiv when a numerator is not constant and no denorms. But I have removed it all together because it will be checked later anyway. rampitec: Check for constant only is not sufficient. It can replace fdiv when a numerator is not constant…
		return false;

		Value *Den = FDiv.getOperand(1);

IRBuilder<> Builder(FDiv.getParent(), std::next(FDiv.getIterator()), FPMath);		IRBuilder<> Builder(FDiv.getParent(), std::next(FDiv.getIterator()), FPMath);
Builder.setFastMathFlags(FMF);		Builder.setFastMathFlags(FMF);
Builder.SetCurrentDebugLocation(FDiv.getDebugLoc());		Builder.SetCurrentDebugLocation(FDiv.getDebugLoc());

Function *Decl = Intrinsic::getDeclaration(Mod, Intrinsic::amdgcn_fdiv_fast);		Function *Decl = Intrinsic::getDeclaration(Mod, Intrinsic::amdgcn_fdiv_fast);

Value *Num = FDiv.getOperand(0);
Value *Den = FDiv.getOperand(1);

Value *NewFDiv = nullptr;		Value *NewFDiv = nullptr;

if (VectorType *VT = dyn_cast<VectorType>(Ty)) {		if (VectorType *VT = dyn_cast<VectorType>(Ty)) {
NewFDiv = UndefValue::get(VT);		NewFDiv = UndefValue::get(VT);

// FIXME: Doesn't do the right thing for cases where the vector is partially		// FIXME: Doesn't do the right thing for cases where the vector is partially
// constant. This works when the scalarizer pass is run first.		// constant. This works when the scalarizer pass is run first.
for (unsigned I = 0, E = VT->getNumElements(); I != E; ++I) {		for (unsigned I = 0, E = VT->getNumElements(); I != E; ++I) {
Value *NumEltI = Builder.CreateExtractElement(Num, I);		Value *NumEltI = Builder.CreateExtractElement(Num, I);
Value *DenEltI = Builder.CreateExtractElement(Den, I);		Value *DenEltI = Builder.CreateExtractElement(Den, I);
Value *NewElt;		Value *NewElt;

if (shouldKeepFDivF32(NumEltI, UnsafeDiv)) {		if (shouldKeepFDivF32(NumEltI, UnsafeDiv, HasDenormals)) {
NewElt = Builder.CreateFDiv(NumEltI, DenEltI);		NewElt = Builder.CreateFDiv(NumEltI, DenEltI);
} else {		} else {
NewElt = Builder.CreateCall(Decl, { NumEltI, DenEltI });		NewElt = Builder.CreateCall(Decl, { NumEltI, DenEltI });
}		}

NewFDiv = Builder.CreateInsertElement(NewFDiv, NewElt, I);		NewFDiv = Builder.CreateInsertElement(NewFDiv, NewElt, I);
}		}
} else {		} else {
if (!shouldKeepFDivF32(Num, UnsafeDiv))		if (!shouldKeepFDivF32(Num, UnsafeDiv, HasDenormals))
NewFDiv = Builder.CreateCall(Decl, { Num, Den });		NewFDiv = Builder.CreateCall(Decl, { Num, Den });
}		}

if (NewFDiv) {		if (NewFDiv) {
FDiv.replaceAllUsesWith(NewFDiv);		FDiv.replaceAllUsesWith(NewFDiv);
NewFDiv->takeName(&FDiv);		NewFDiv->takeName(&FDiv);
FDiv.eraseFromParent();		FDiv.eraseFromParent();
}		}
▲ Show 20 Lines • Show All 124 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/fdiv32-to-rcp-folding.ll

This file was added.

				; RUN: llc -march=amdgcn -mcpu=gfx900 -mattr=+fp32-denormals < %s \| FileCheck --check-prefixes=GCN,GCN-DENORM %s
				; RUN: llc -march=amdgcn -mcpu=gfx900 -mattr=-fp32-denormals < %s \| FileCheck --check-prefixes=GCN,GCN-FLUSH %s

				; GCN-LABEL: {{^}}div_1_by_x_25ulp:
				; GCN-DENORM-DAG: v_mov_b32_e32 [[L:v[0-9]+]], 0x6f800000
				; GCN-DENORM-DAG: v_mov_b32_e32 [[S:v[0-9]+]], 0x2f800000
				; GCN-DAG: s_load_dword [[VAL:s[0-9]+]], s[{{[0-9:]+}}], 0x0{{$}}
				; GCN-DENORM-DAG: v_cmp_gt_f32_e64 vcc, \|[[VAL]]\|, [[L]]
				; GCN-DENORM-DAG: v_cndmask_b32_e32 [[SCALE:v[0-9]+]], 1.0, [[S]], vcc
				; GCN-DENORM: v_mul_f32_e32 [[PRESCALED:v[0-9]+]], [[VAL]], [[SCALE]]
				; GCN-DENORM: v_rcp_f32_e32 [[RCP:v[0-9]+]], [[PRESCALED]]
				; GCN-DENORM: v_mul_f32_e32 [[OUT:v[0-9]+]], [[SCALE]], [[RCP]]

				; GCN-FLUSH: v_rcp_f32_e32 [[OUT:v[0-9]+]], [[VAL]]

				; GCN: global_store_dword v[{{[0-9:]+}}], [[OUT]], off
				define amdgpu_kernel void @div_1_by_x_25ulp(float addrspace(1)* %arg) {
				%load = load float, float addrspace(1)* %arg, align 4
				%div = fdiv float 1.000000e+00, %load, !fpmath !0
				store float %div, float addrspace(1)* %arg, align 4
				ret void
				}

				; GCN-LABEL: {{^}}div_minus_1_by_x_25ulp:
				; GCN-DENORM-DAG: v_mov_b32_e32 [[L:v[0-9]+]], 0x6f800000
				; GCN-DENORM-DAG: v_mov_b32_e32 [[S:v[0-9]+]], 0x2f800000
				; GCN-DAG: s_load_dword [[VAL:s[0-9]+]], s[{{[0-9:]+}}], 0x0{{$}}
				; GCN-DENORM-DAG: v_cmp_gt_f32_e64 vcc, \|[[VAL]]\|, [[L]]
				; GCN-DENORM-DAG: v_cndmask_b32_e32 [[SCALE:v[0-9]+]], 1.0, [[S]], vcc
				; GCN-DENORM: v_mul_f32_e64 [[PRESCALED:v[0-9]+]], [[VAL]], -[[SCALE]]
				; GCN-DENORM: v_rcp_f32_e32 [[RCP:v[0-9]+]], [[PRESCALED]]
				; GCN-DENORM: v_mul_f32_e32 [[OUT:v[0-9]+]], [[SCALE]], [[RCP]]

				; GCN-FLUSH: v_rcp_f32_e64 [[OUT:v[0-9]+]], -[[VAL]]

				; GCN: global_store_dword v[{{[0-9:]+}}], [[OUT]], off
				define amdgpu_kernel void @div_minus_1_by_x_25ulp(float addrspace(1)* %arg) {
				%load = load float, float addrspace(1)* %arg, align 4
				%div = fdiv float -1.000000e+00, %load, !fpmath !0
				store float %div, float addrspace(1)* %arg, align 4
				ret void
				}

				; GCN-LABEL: {{^}}div_1_by_minus_x_25ulp:
				; GCN-DENORM-DAG: v_mov_b32_e32 [[L:v[0-9]+]], 0x6f800000
				; GCN-DENORM-DAG: v_mov_b32_e32 [[S:v[0-9]+]], 0x2f800000
				; GCN-DAG: s_load_dword [[VAL:s[0-9]+]], s[{{[0-9:]+}}], 0x0{{$}}
				; GCN-DENORM-DAG: v_cmp_gt_f32_e64 vcc, \|[[VAL]]\|, [[L]]
				; GCN-DENORM-DAG: v_cndmask_b32_e32 [[SCALE:v[0-9]+]], 1.0, [[S]], vcc
				; GCN-DENORM: v_mul_f32_e64 [[PRESCALED:v[0-9]+]], -[[VAL]], [[SCALE]]
				; GCN-DENORM: v_rcp_f32_e32 [[RCP:v[0-9]+]], [[PRESCALED]]
				; GCN-DENORM: v_mul_f32_e32 [[OUT:v[0-9]+]], [[SCALE]], [[RCP]]

				; GCN-FLUSH: v_rcp_f32_e64 [[OUT:v[0-9]+]], -[[VAL]]

				; GCN: global_store_dword v[{{[0-9:]+}}], [[OUT]], off
				define amdgpu_kernel void @div_1_by_minus_x_25ulp(float addrspace(1)* %arg) {
				%load = load float, float addrspace(1)* %arg, align 4
				%neg = fsub float -0.000000e+00, %load
				%div = fdiv float 1.000000e+00, %neg, !fpmath !0
				store float %div, float addrspace(1)* %arg, align 4
				ret void
				}

				; GCN-LABEL: {{^}}div_minus_1_by_minus_x_25ulp:
				; GCN-DENORM-DAG: v_mov_b32_e32 [[L:v[0-9]+]], 0x6f800000
				; GCN-DENORM-DAG: v_mov_b32_e32 [[S:v[0-9]+]], 0x2f800000
				; GCN-DAG: s_load_dword [[VAL:s[0-9]+]], s[{{[0-9:]+}}], 0x0{{$}}
				; GCN-DENORM-DAG: v_cmp_gt_f32_e64 vcc, \|[[VAL]]\|, [[L]]
				; GCN-DENORM-DAG: v_cndmask_b32_e32 [[SCALE:v[0-9]+]], 1.0, [[S]], vcc
				; GCN-DENORM: v_mul_f32_e32 [[PRESCALED:v[0-9]+]], [[VAL]], [[SCALE]]
				; GCN-DENORM: v_rcp_f32_e32 [[RCP:v[0-9]+]], [[PRESCALED]]
				; GCN-DENORM: v_mul_f32_e32 [[OUT:v[0-9]+]], [[SCALE]], [[RCP]]

				; GCN-FLUSH: v_rcp_f32_e32 [[OUT:v[0-9]+]], [[VAL]]

				; GCN: global_store_dword v[{{[0-9:]+}}], [[OUT]], off
				define amdgpu_kernel void @div_minus_1_by_minus_x_25ulp(float addrspace(1)* %arg) {
				%load = load float, float addrspace(1)* %arg, align 4
				%neg = fsub float -0.000000e+00, %load
				%div = fdiv float -1.000000e+00, %neg, !fpmath !0
				store float %div, float addrspace(1)* %arg, align 4
				ret void
				}

				; GCN-LABEL: {{^}}div_v4_1_by_x_25ulp:
				; GCN-DAG: s_load_dwordx4 s{{\[}}[[VAL0:[0-9]+]]:[[VAL3:[0-9]+]]], s[{{[0-9:]+}}], 0x0{{$}}
				; GCN-DENORM-DAG: v_mov_b32_e32 [[L:v[0-9]+]], 0x6f800000
				; GCN-DENORM-DAG: v_mov_b32_e32 [[S:v[0-9]+]], 0x2f800000
				; GCN-DENORM-DAG: v_cmp_gt_f32_e64 vcc, \|s{{[0-9]+}}\|, [[L]]
				; GCN-DENORM-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 1.0, [[S]], vcc
				; GCN-DENORM-DAG: v_cmp_gt_f32_e64 vcc, \|s{{[0-9]+}}\|, [[L]]
				; GCN-DENORM-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 1.0, [[S]], vcc
				; GCN-DENORM-DAG: v_cmp_gt_f32_e64 vcc, \|s{{[0-9]+}}\|, [[L]]
				; GCN-DENORM-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 1.0, [[S]], vcc
				; GCN-DENORM-DAG: v_cmp_gt_f32_e64 vcc, \|s{{[0-9]+}}\|, [[L]]
				; GCN-DENORM-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 1.0, [[S]], vcc
				; GCN-DENORM-DAG: v_mul_f32_e32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}
				; GCN-DENORM-DAG: v_mul_f32_e32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}
				; GCN-DENORM-DAG: v_mul_f32_e32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}
				; GCN-DENORM-DAG: v_mul_f32_e32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}
				arsenmUnsubmitted Not Done Reply Inline Actions This will only effectively check for one, although I think there's a FileCheck patch out for review to fix this arsenm: This will only effectively check for one, although I think there's a FileCheck patch out for…
				rampitecAuthorUnsubmitted Not Done Reply Inline Actions Then after FileCheck imorived the will test mire than now. It is really impossible to make reliable non-dag checks here in presence of two schedulers. rampitec: Then after FileCheck imorived the will test mire than now. It is really impossible to make…
				; GCN-DENORM-DAG: v_rcp_f32_e32
				; GCN-DENORM-DAG: v_rcp_f32_e32
				; GCN-DENORM-DAG: v_rcp_f32_e32
				; GCN-DENORM-DAG: v_rcp_f32_e32
				; GCN-DENORM-DAG: v_mul_f32_e32
				; GCN-DENORM-DAG: v_mul_f32_e32
				; GCN-DENORM-DAG: v_mul_f32_e32
				; GCN-DENORM-DAG: v_mul_f32_e32

				; GCN-FLUSH: v_rcp_f32_e32 v[[OUT0:[0-9]+]], s[[VAL0]]
				; GCN-FLUSH: v_rcp_f32_e32
				; GCN-FLUSH: v_rcp_f32_e32
				; GCN-FLUSH: v_rcp_f32_e32 v[[OUT3:[0-9]+]], s[[VAL3]]
				; GCN-FLUSH: global_store_dwordx4 v[{{[0-9:]+}}], v{{\[}}[[OUT0]]:[[OUT3]]], off
				define amdgpu_kernel void @div_v4_1_by_x_25ulp(<4 x float> addrspace(1)* %arg) {
				%load = load <4 x float>, <4 x float> addrspace(1)* %arg, align 16
				%div = fdiv <4 x float> <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>, %load, !fpmath !0
				store <4 x float> %div, <4 x float> addrspace(1)* %arg, align 16
				ret void
				}

				; GCN-LABEL: {{^}}div_v4_minus_1_by_x_25ulp:
				; GCN-DENORM-DAG: v_mov_b32_e32 [[L:v[0-9]+]], 0x6f800000
				; GCN-DENORM-DAG: v_mov_b32_e32 [[S:v[0-9]+]], 0x2f800000
				; GCN-DENORM-DAG: v_cmp_gt_f32_e64 vcc, \|s{{[0-9]+}}\|, [[L]]
				; GCN-DENORM-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 1.0, [[S]], vcc
				; GCN-DENORM-DAG: v_cmp_gt_f32_e64 vcc, \|s{{[0-9]+}}\|, [[L]]
				; GCN-DENORM-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 1.0, [[S]], vcc
				; GCN-DENORM-DAG: v_cmp_gt_f32_e64 vcc, \|s{{[0-9]+}}\|, [[L]]
				; GCN-DENORM-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 1.0, [[S]], vcc
				; GCN-DENORM-DAG: v_cmp_gt_f32_e64 vcc, \|s{{[0-9]+}}\|, [[L]]
				; GCN-DENORM-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 1.0, [[S]], vcc
				; GCN-DENORM-DAG: v_mul_f32_e64 v{{[0-9]+}}, s{{[0-9]+}}, -v{{[0-9]+}}
				; GCN-DENORM-DAG: v_mul_f32_e64 v{{[0-9]+}}, s{{[0-9]+}}, -v{{[0-9]+}}
				; GCN-DENORM-DAG: v_mul_f32_e64 v{{[0-9]+}}, s{{[0-9]+}}, -v{{[0-9]+}}
				; GCN-DENORM-DAG: v_mul_f32_e64 v{{[0-9]+}}, s{{[0-9]+}}, -v{{[0-9]+}}
				; GCN-DENORM-DAG: v_rcp_f32_e32
				; GCN-DENORM-DAG: v_rcp_f32_e32
				; GCN-DENORM-DAG: v_rcp_f32_e32
				; GCN-DENORM-DAG: v_rcp_f32_e32
				; GCN-DENORM-DAG: v_mul_f32_e32
				; GCN-DENORM-DAG: v_mul_f32_e32
				; GCN-DENORM-DAG: v_mul_f32_e32
				; GCN-DENORM-DAG: v_mul_f32_e32

				; GCN-FLUSH: v_rcp_f32_e64 v[[OUT0:[0-9]+]], -s[[VAL0]]
				; GCN-FLUSH: v_rcp_f32_e64
				; GCN-FLUSH: v_rcp_f32_e64
				; GCN-FLUSH: v_rcp_f32_e64 v[[OUT3:[0-9]+]], -s[[VAL3]]
				define amdgpu_kernel void @div_v4_minus_1_by_x_25ulp(<4 x float> addrspace(1)* %arg) {
				%load = load <4 x float>, <4 x float> addrspace(1)* %arg, align 16
				%div = fdiv <4 x float> <float -1.000000e+00, float -1.000000e+00, float -1.000000e+00, float -1.000000e+00>, %load, !fpmath !0
				store <4 x float> %div, <4 x float> addrspace(1)* %arg, align 16
				ret void
				}

				; GCN-LABEL: {{^}}div_v4_1_by_minus_x_25ulp:
				; GCN-DENORM-DAG: v_mov_b32_e32 [[L:v[0-9]+]], 0x6f800000
				; GCN-DENORM-DAG: v_mov_b32_e32 [[S:v[0-9]+]], 0x2f800000
				; GCN-DENORM-DAG: v_cmp_gt_f32_e64 vcc, \|s{{[0-9]+}}\|, [[L]]
				; GCN-DENORM-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 1.0, [[S]], vcc
				; GCN-DENORM-DAG: v_cmp_gt_f32_e64 vcc, \|s{{[0-9]+}}\|, [[L]]
				; GCN-DENORM-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 1.0, [[S]], vcc
				; GCN-DENORM-DAG: v_cmp_gt_f32_e64 vcc, \|s{{[0-9]+}}\|, [[L]]
				; GCN-DENORM-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 1.0, [[S]], vcc
				; GCN-DENORM-DAG: v_cmp_gt_f32_e64 vcc, \|s{{[0-9]+}}\|, [[L]]
				; GCN-DENORM-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 1.0, [[S]], vcc
				; GCN-DENORM-DAG: v_mul_f32_e64 v{{[0-9]+}}, -s{{[0-9]+}}, v{{[0-9]+}}
				; GCN-DENORM-DAG: v_mul_f32_e64 v{{[0-9]+}}, -s{{[0-9]+}}, v{{[0-9]+}}
				; GCN-DENORM-DAG: v_mul_f32_e64 v{{[0-9]+}}, -s{{[0-9]+}}, v{{[0-9]+}}
				; GCN-DENORM-DAG: v_mul_f32_e64 v{{[0-9]+}}, -s{{[0-9]+}}, v{{[0-9]+}}
				; GCN-DENORM-DAG: v_rcp_f32_e32
				; GCN-DENORM-DAG: v_rcp_f32_e32
				; GCN-DENORM-DAG: v_rcp_f32_e32
				; GCN-DENORM-DAG: v_rcp_f32_e32
				; GCN-DENORM-DAG: v_mul_f32_e32
				; GCN-DENORM-DAG: v_mul_f32_e32
				; GCN-DENORM-DAG: v_mul_f32_e32
				; GCN-DENORM-DAG: v_mul_f32_e32

				; GCN-FLUSH: v_rcp_f32_e64 v[[OUT0:[0-9]+]], -s[[VAL0]]
				; GCN-FLUSH: v_rcp_f32_e64
				; GCN-FLUSH: v_rcp_f32_e64
				; GCN-FLUSH: v_rcp_f32_e64 v[[OUT3:[0-9]+]], -s[[VAL3]]
				; GCN-FLUSH: global_store_dwordx4 v[{{[0-9:]+}}], v{{\[}}[[OUT0]]:[[OUT3]]], off
				define amdgpu_kernel void @div_v4_1_by_minus_x_25ulp(<4 x float> addrspace(1)* %arg) {
				%load = load <4 x float>, <4 x float> addrspace(1)* %arg, align 16
				%neg = fsub <4 x float> <float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00>, %load
				%div = fdiv <4 x float> <float 1.000000e+00, float 1.000000e+00, float 1.000000e+00, float 1.000000e+00>, %neg, !fpmath !0
				store <4 x float> %div, <4 x float> addrspace(1)* %arg, align 16
				ret void
				}

				; GCN-LABEL: {{^}}div_v4_minus_1_by_minus_x_25ulp:
				; GCN-DAG: s_load_dwordx4 s{{\[}}[[VAL0:[0-9]+]]:[[VAL3:[0-9]+]]], s[{{[0-9:]+}}], 0x0{{$}}
				; GCN-DENORM-DAG: v_mov_b32_e32 [[L:v[0-9]+]], 0x6f800000
				; GCN-DENORM-DAG: v_mov_b32_e32 [[S:v[0-9]+]], 0x2f800000
				; GCN-DENORM-DAG: v_cmp_gt_f32_e64 vcc, \|s{{[0-9]+}}\|, [[L]]
				; GCN-DENORM-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 1.0, [[S]], vcc
				; GCN-DENORM-DAG: v_cmp_gt_f32_e64 vcc, \|s{{[0-9]+}}\|, [[L]]
				; GCN-DENORM-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 1.0, [[S]], vcc
				; GCN-DENORM-DAG: v_cmp_gt_f32_e64 vcc, \|s{{[0-9]+}}\|, [[L]]
				; GCN-DENORM-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 1.0, [[S]], vcc
				; GCN-DENORM-DAG: v_cmp_gt_f32_e64 vcc, \|s{{[0-9]+}}\|, [[L]]
				; GCN-DENORM-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 1.0, [[S]], vcc
				; GCN-DENORM-DAG: v_mul_f32_e32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}
				; GCN-DENORM-DAG: v_mul_f32_e32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}
				; GCN-DENORM-DAG: v_mul_f32_e32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}
				; GCN-DENORM-DAG: v_mul_f32_e32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}
				; GCN-DENORM-DAG: v_rcp_f32_e32
				; GCN-DENORM-DAG: v_rcp_f32_e32
				; GCN-DENORM-DAG: v_rcp_f32_e32
				; GCN-DENORM-DAG: v_rcp_f32_e32
				; GCN-DENORM-DAG: v_mul_f32_e32
				; GCN-DENORM-DAG: v_mul_f32_e32
				; GCN-DENORM-DAG: v_mul_f32_e32
				; GCN-DENORM-DAG: v_mul_f32_e32

				; GCN-FLUSH: v_rcp_f32_e32 v[[OUT0:[0-9]+]], s[[VAL0]]
				; GCN-FLUSH: v_rcp_f32_e32
				; GCN-FLUSH: v_rcp_f32_e32
				; GCN-FLUSH: v_rcp_f32_e32 v[[OUT3:[0-9]+]], s[[VAL3]]
				; GCN-FLUSH: global_store_dwordx4 v[{{[0-9:]+}}], v{{\[}}[[OUT0]]:[[OUT3]]], off
				define amdgpu_kernel void @div_v4_minus_1_by_minus_x_25ulp(<4 x float> addrspace(1)* %arg) {
				%load = load <4 x float>, <4 x float> addrspace(1)* %arg, align 16
				%neg = fsub <4 x float> <float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00>, %load
				%div = fdiv <4 x float> <float -1.000000e+00, float -1.000000e+00, float -1.000000e+00, float -1.000000e+00>, %neg, !fpmath !0
				store <4 x float> %div, <4 x float> addrspace(1)* %arg, align 16
				ret void
				}

				; GCN-LABEL: {{^}}div_v4_c_by_x_25ulp:
				; GCN-DAG: v_mov_b32_e32 [[L:v[0-9]+]], 0x6f800000
				; GCN-DAG: v_mov_b32_e32 [[S:v[0-9]+]], 0x2f800000
				; GCN-DENORM-DAG: v_div_scale_f32 {{.*}}, 2.0{{$}}
				; GCN-DENORM-DAG: v_div_scale_f32 {{.*}}, 2.0{{$}}
				; GCN-DENORM-DAG: v_div_scale_f32 {{.*}}, -2.0{{$}}
				; GCN-DENORM-DAG: v_div_scale_f32 {{.*}}, -2.0{{$}}
				; GCN-DENORM-DAG: v_rcp_f32_e32
				; GCN-DENORM-DAG: v_rcp_f32_e32

				; GCN-DAG: v_cmp_gt_f32_e64 vcc, \|s{{[0-9]+}}\|, [[L]]
				; GCN-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 1.0, [[S]], vcc
				; GCN-DAG: v_cmp_gt_f32_e64 vcc, \|s{{[0-9]+}}\|, [[L]]
				; GCN-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 1.0, [[S]], vcc

				; GCN-DENORM-DAG: v_mul_f32_e32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}
				; GCN-DENORM-DAG: v_mul_f32_e64 v{{[0-9]+}}, s{{[0-9]+}}, -v{{[0-9]+}}
				; GCN-DENORM-DAG: v_rcp_f32_e32 [[RCP1:v[0-9]+]], v{{[0-9]+}}
				; GCN-DENORM-DAG: v_mul_f32_e32 v{{[0-9]+}}, v{{[0-9]+}}, [[RCP1]]
				; GCN-DENORM-DAG: v_rcp_f32_e32 [[RCP2:v[0-9]+]], v{{[0-9]+}}
				; GCN-DENORM-DAG: v_mul_f32_e32 v{{[0-9]+}}, v{{[0-9]+}}, [[RCP2]]

				; GCN-DENORM-DAG: v_div_fmas_f32
				; GCN-DENORM-DAG: v_div_fmas_f32
				; GCN-DENORM-DAG: v_div_fixup_f32 {{.*}}, 2.0{{$}}
				; GCN-DENORM-DAG: v_div_fixup_f32 {{.*}}, -2.0{{$}}

				; GCN-FLUSH-DAG: v_rcp_f32_e32
				; GCN-FLUSH-DAG: v_rcp_f32_e64

				; GCN-NOT: v_cmp_gt_f32_e64
				; GCN-NOT: v_cndmask_b32_e32
				; GCN-FLUSH-NOT: v_div

				; GCN: global_store_dwordx4
				define amdgpu_kernel void @div_v4_c_by_x_25ulp(<4 x float> addrspace(1)* %arg) {
				%load = load <4 x float>, <4 x float> addrspace(1)* %arg, align 16
				%div = fdiv <4 x float> <float 2.000000e+00, float 1.000000e+00, float -1.000000e+00, float -2.000000e+00>, %load, !fpmath !0
				store <4 x float> %div, <4 x float> addrspace(1)* %arg, align 16
				ret void
				}

				; GCN-LABEL: {{^}}div_v4_c_by_minus_x_25ulp:
				; GCN-DAG: v_mov_b32_e32 [[L:v[0-9]+]], 0x6f800000
				; GCN-DAG: v_mov_b32_e32 [[S:v[0-9]+]], 0x2f800000
				; GCN-DENORM-DAG: v_div_scale_f32 {{.*}}, -2.0{{$}}
				; GCN-DENORM-DAG: v_div_scale_f32 {{.*}}, -2.0{{$}}
				; GCN-DENORM-DAG: v_div_scale_f32 {{.*}}, -2.0{{$}}
				; GCN-DENORM-DAG: v_div_scale_f32 {{.*}}, -2.0{{$}}
				; GCN-DENORM-DAG: v_rcp_f32_e32
				; GCN-DENORM-DAG: v_rcp_f32_e32

				; GCN-DAG: v_cmp_gt_f32_e64 vcc, \|s{{[0-9]+}}\|, [[L]]
				; GCN-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 1.0, [[S]], vcc
				; GCN-DAG: v_cmp_gt_f32_e64 vcc, \|s{{[0-9]+}}\|, [[L]]
				; GCN-DAG: v_cndmask_b32_e32 v{{[0-9]+}}, 1.0, [[S]], vcc

				; GCN-DENORM-DAG: v_mul_f32_e32 v{{[0-9]+}}, s{{[0-9]+}}, v{{[0-9]+}}
				; GCN-DENORM-DAG: v_mul_f32_e64 v{{[0-9]+}}, -s{{[0-9]+}}, v{{[0-9]+}}
				; GCN-DENORM-DAG: v_rcp_f32_e32 [[RCP1:v[0-9]+]], v{{[0-9]+}}
				; GCN-DENORM-DAG: v_mul_f32_e32 v{{[0-9]+}}, v{{[0-9]+}}, [[RCP1]]
				; GCN-DENORM-DAG: v_rcp_f32_e32 [[RCP2:v[0-9]+]], v{{[0-9]+}}
				; GCN-DENORM-DAG: v_mul_f32_e32 v{{[0-9]+}}, v{{[0-9]+}}, [[RCP2]]

				; GCN-DENORM-DAG: v_div_fmas_f32
				; GCN-DENORM-DAG: v_div_fmas_f32
				; GCN-DENORM-DAG: v_div_fixup_f32 {{.*}}, -2.0{{$}}
				; GCN-DENORM-DAG: v_div_fixup_f32 {{.*}}, -2.0{{$}}

				; GCN-FLUSH-DAG: v_rcp_f32_e32
				; GCN-FLUSH-DAG: v_rcp_f32_e64

				; GCN-NOT: v_cmp_gt_f32_e64
				; GCN-NOT: v_cndmask_b32_e32
				; GCN-FLUSH-NOT: v_div

				; GCN: global_store_dwordx4
				define amdgpu_kernel void @div_v4_c_by_minus_x_25ulp(<4 x float> addrspace(1)* %arg) {
				%load = load <4 x float>, <4 x float> addrspace(1)* %arg, align 16
				%neg = fsub <4 x float> <float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00>, %load
				%div = fdiv <4 x float> <float 2.000000e+00, float 1.000000e+00, float -1.000000e+00, float -2.000000e+00>, %neg, !fpmath !0
				store <4 x float> %div, <4 x float> addrspace(1)* %arg, align 16
				ret void
				}

				; GCN-LABEL: {{^}}div_v_by_x_25ulp:
				; GCN-DAG: s_load_dword [[VAL:s[0-9]+]], s[{{[0-9:]+}}], 0x0{{$}}

				; GCN-DENORM-DAG: v_div_scale_f32
				; GCN-DENORM-DAG: v_rcp_f32_e32
				; GCN-DENORM-DAG: v_div_scale_f32
				; GCN-DENORM: v_div_fmas_f32
				; GCN-DENORM: v_div_fixup_f32 [[OUT:v[0-9]+]],

				; GCN-FLUSF-DAG: v_mov_b32_e32 [[L:v[0-9]+]], 0x6f800000
				; GCN-FLUSH-DAG: v_mov_b32_e32 [[S:v[0-9]+]], 0x2f800000
				; GCN-FLUSH-DAG: v_cmp_gt_f32_e64 vcc, \|[[VAL]]\|, [[L]]
				; GCN-FLUSH-DAG: v_cndmask_b32_e32 [[SCALE:v[0-9]+]], 1.0, [[S]], vcc
				; GCN-FLUSH: v_mul_f32_e32 [[PRESCALED:v[0-9]+]], [[VAL]], [[SCALE]]
				; GCN-FLUSH: v_rcp_f32_e32 [[RCP:v[0-9]+]], [[PRESCALED]]
				; GCN-FLUSH: v_mul_f32_e32 [[OUT:v[0-9]+]], [[SCALE]], [[RCP]]

				; GCN: global_store_dword v[{{[0-9:]+}}], [[OUT]], off
				define amdgpu_kernel void @div_v_by_x_25ulp(float addrspace(1)* %arg, float %num) {
				%load = load float, float addrspace(1)* %arg, align 4
				%div = fdiv float %num, %load, !fpmath !0
				store float %div, float addrspace(1)* %arg, align 4
				ret void
				}

				; GCN-LABEL: {{^}}div_1_by_x_fast:
				; GCN: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x0
				; GCN: v_rcp_f32_e32 [[RCP:v[0-9]+]], [[VAL]]
				; GCN: global_store_dword v[{{[0-9:]+}}], [[RCP]], off
				define amdgpu_kernel void @div_1_by_x_fast(float addrspace(1)* %arg) {
				%load = load float, float addrspace(1)* %arg, align 4
				%div = fdiv fast float 1.000000e+00, %load
				store float %div, float addrspace(1)* %arg, align 4
				ret void
				}

				; GCN-LABEL: {{^}}div_minus_1_by_x_fast:
				; GCN: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x0
				; GCN: v_rcp_f32_e64 [[RCP:v[0-9]+]], -[[VAL]]
				; GCN: global_store_dword v[{{[0-9:]+}}], [[RCP]], off
				define amdgpu_kernel void @div_minus_1_by_x_fast(float addrspace(1)* %arg) {
				%load = load float, float addrspace(1)* %arg, align 4
				%div = fdiv fast float -1.000000e+00, %load
				store float %div, float addrspace(1)* %arg, align 4
				ret void
				}

				; GCN-LABEL: {{^}}div_1_by_minus_x_fast:
				; GCN: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x0
				; GCN: v_rcp_f32_e64 [[RCP:v[0-9]+]], -[[VAL]]
				; GCN: global_store_dword v[{{[0-9:]+}}], [[RCP]], off
				define amdgpu_kernel void @div_1_by_minus_x_fast(float addrspace(1)* %arg) {
				%load = load float, float addrspace(1)* %arg, align 4
				%neg = fsub float -0.000000e+00, %load
				%div = fdiv fast float 1.000000e+00, %neg
				store float %div, float addrspace(1)* %arg, align 4
				ret void
				}

				; GCN-LABEL: {{^}}div_minus_1_by_minus_x_fast:
				; GCN: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x0
				; GCN: v_rcp_f32_e32 [[RCP:v[0-9]+]], [[VAL]]
				; GCN: global_store_dword v[{{[0-9:]+}}], [[RCP]], off
				define amdgpu_kernel void @div_minus_1_by_minus_x_fast(float addrspace(1)* %arg) {
				%load = load float, float addrspace(1)* %arg, align 4
				%neg = fsub float -0.000000e+00, %load
				%div = fdiv fast float -1.000000e+00, %neg
				store float %div, float addrspace(1)* %arg, align 4
				ret void
				}

				; GCN-LABEL: {{^}}div_1_by_x_correctly_rounded:
				; GCN-DENORM-DAG: v_div_scale_f32
				; GCN-DENORM-DAG: v_rcp_f32_e32
				; GCN-DENORM-DAG: v_div_scale_f32
				; GCN-DENORM: v_div_fmas_f32
				; GCN-DENORM: v_div_fixup_f32

				; GCN-FLUSH: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x0
				; GCN-FLUSH: v_rcp_f32_e32 [[RCP:v[0-9]+]], [[VAL]]
				; GCN-FLUSH: global_store_dword v[{{[0-9:]+}}], [[RCP]], off
				define amdgpu_kernel void @div_1_by_x_correctly_rounded(float addrspace(1)* %arg) {
				%load = load float, float addrspace(1)* %arg, align 4
				%div = fdiv float 1.000000e+00, %load
				store float %div, float addrspace(1)* %arg, align 4
				ret void
				}

				; GCN-LABEL: {{^}}div_minus_1_by_x_correctly_rounded:
				; GCN-DENORM-DAG: v_div_scale_f32
				; GCN-DENORM-DAG: v_rcp_f32_e32
				; GCN-DENORM-DAG: v_div_scale_f32
				; GCN-DENORM: v_div_fmas_f32
				; GCN-DENORM: v_div_fixup_f32

				; GCN-FLUSH: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x0
				; GCN-FLUSH: v_rcp_f32_e64 [[RCP:v[0-9]+]], -[[VAL]]
				; GCN-FLUSH: global_store_dword v[{{[0-9:]+}}], [[RCP]], off
				define amdgpu_kernel void @div_minus_1_by_x_correctly_rounded(float addrspace(1)* %arg) {
				%load = load float, float addrspace(1)* %arg, align 4
				%div = fdiv float -1.000000e+00, %load
				store float %div, float addrspace(1)* %arg, align 4
				ret void
				}

				; GCN-LABEL: {{^}}div_1_by_minus_x_correctly_rounded:
				; GCN-DENORM-DAG: v_div_scale_f32
				; GCN-DENORM-DAG: v_rcp_f32_e32
				; GCN-DENORM-DAG: v_div_scale_f32
				; GCN-DENORM: v_div_fmas_f32
				; GCN-DENORM: v_div_fixup_f32

				; GCN-FLUSH: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x0
				; GCN-FLUSH: v_rcp_f32_e64 [[RCP:v[0-9]+]], -[[VAL]]
				; GCN-FLUSH: global_store_dword v[{{[0-9:]+}}], [[RCP]], off
				define amdgpu_kernel void @div_1_by_minus_x_correctly_rounded(float addrspace(1)* %arg) {
				%load = load float, float addrspace(1)* %arg, align 4
				%neg = fsub float -0.000000e+00, %load
				%div = fdiv float 1.000000e+00, %neg
				store float %div, float addrspace(1)* %arg, align 4
				ret void
				}

				; GCN-LABEL: {{^}}div_minus_1_by_minus_x_correctly_rounded:
				; GCN-DENORM-DAG: v_div_scale_f32
				; GCN-DENORM-DAG: v_rcp_f32_e32
				; GCN-DENORM-DAG: v_div_scale_f32
				; GCN-DENORM: v_div_fmas_f32
				; GCN-DENORM: v_div_fixup_f32

				; GCN-FLUSH: s_load_dword [[VAL:s[0-9]+]], s[0:1], 0x0
				; GCN-FLUSH: v_rcp_f32_e32 [[RCP:v[0-9]+]], [[VAL]]
				; GCN-FLUSH: global_store_dword v[{{[0-9:]+}}], [[RCP]], off
				define amdgpu_kernel void @div_minus_1_by_minus_x_correctly_rounded(float addrspace(1)* %arg) {
				%load = load float, float addrspace(1)* %arg, align 4
				%neg = fsub float -0.000000e+00, %load
				%div = fdiv float -1.000000e+00, %neg
				store float %div, float addrspace(1)* %arg, align 4
				ret void
				}

				!0 = !{float 2.500000e+00}