Download Raw Diff

Details

Reviewers

cfang
• tstellarAMD
arsenm

Commits

rL272044: Differential Revision: http://reviews.llvm.org/D20557

Summary

Lowering floating point division for 32-bit using IEEE 754

Diff Detail

Repository: rL LLVM

Event Timeline

wdng updated this revision to Diff 58190.May 23 2016, 9:19 PM

wdng retitled this revision from to Lowering floating point division for 32-bit using IEEE 754.

wdng updated this object.

wdng added reviewers: arsenm, • tstellarAMD, cfang.

wdng set the repository for this revision to rL LLVM.

Herald added a subscriber: arsenm. · View Herald TranscriptMay 23 2016, 9:19 PM

arsenm added inline comments.May 24 2016, 12:50 PM

lib/Target/AMDGPU/SIISelLowering.cpp
42	Description is inaccurate. Should say something about faster 2.5 ulp fdiv
2031	No space after (
2032	Break on same line
2034	Indentation looks wrong
2047	Don't all caps ON
2076	You don't need the else

wdng updated this revision to Diff 58310.May 24 2016, 1:50 PM

wdng updated this object.

wdng marked 6 inline comments as done.

wdng removed a subscriber: arsenm.

Needs tests

wdng updated this revision to Diff 58637.May 26 2016, 10:07 AM

wdng updated this object.

wdng edited edge metadata.

You should add more tests that use the fast math flags, and also run with unsafe-fp-math enabled on the function

test/CodeGen/AMDGPU/fdiv.ll
4	This should come before the r600 line
81–82	-DAG does not work as expected if you specify the same string multiple times in the same -DAG sequence. I would reduce the vector tests to a differentiating instruction or two per element (e.g. div_scale + div_fixup)

Hi Matt, You commented that "You should add more tests that use the fast math flags, and also run with unsafe-fp-math enabled on the function". Looks like fast math is not a llc flag.
Could you explain in more details about using the fast math flags? e.g. What's the fast math flag? How the compiler backend retrieves the fast math flag? what code should we generate for fast-math? Thanks a lot!

The llc flag is -enable-unsafe-fp-math. There is also the per-function attribute "unsafe-fp-math"="true". We should probably use that, but I discovered a while ago that if one function has it used, the others must have "unsafe-fp-math"="false" because there is a bug where the setting is reset between functions

The new introduced flag " -amdgpu-fast-fdiv" is used to control use old and new added IEEE754 div implementation. For the -enable-unsafe-fp-math, could you please let me know which part of code will be tested? Thanks!

The test should show which division implementation you get when unsafe math is enabled. There also need to be tests for the fast math flags on the individual instructions

Modify & add LIT tests based on Matt's comments

You still need to add tests that only use the fast math flags

lib/Target/AMDGPU/SIISelLowering.cpp
2080–2081	Variables should be capitalized and camel case

Any comments about my latest changes? Thanks!

Capitalize variables to follow the camel case.

In D20557#444512, @wdng wrote:

Any comments about my latest changes? Thanks!

Still need tests with the individual fast math flags

In D20557#444692, @arsenm wrote:

In D20557#444512, @wdng wrote:

Any comments about my latest changes? Thanks!

Still need tests with the individual fast math flags

Not quite clear about the individual fast math? Could you please describe in details?

In D20557#444720, @wdng wrote:

In D20557#444692, @arsenm wrote:

In D20557#444512, @wdng wrote:

Any comments about my latest changes? Thanks!

Still need tests with the individual fast math flags

Not quite clear about the individual fast math? Could you please describe in details?

There should be tests with just the fast math flags as described here: http://llvm.org/docs/LangRef.html#fast-math-flags

Rather than the globally enabled fast math option.

In D20557#444721, @arsenm wrote:

In D20557#444720, @wdng wrote:

In D20557#444692, @arsenm wrote:

In D20557#444512, @wdng wrote:

Any comments about my latest changes? Thanks!

Still need tests with the individual fast math flags

Not quite clear about the individual fast math? Could you please describe in details?

There should be tests with just the fast math flags as described here: http://llvm.org/docs/LangRef.html#fast-math-flags

Rather than the globally enabled fast math option.

Looks like compiler chooses a different implementation for fpdiv if we have "-enable-unsafe-fp-math" flag enabled when using llc. Generated *.s does show different ISA output, but the breakpoint which I put at SITargetLowering::LowerFDIV is never reached. Is this correct?

Based on Matt's comments: Added tests (Merge Changpeng's LIT tests) with just the fast math flags as described here: http://llvm.org/docs/LangRef.html#fast-math-flags, rather than the globally enabled fast math option.

arsenm added inline comments.Jun 2 2016, 12:04 PM

test/CodeGen/AMDGPU/fdiv.ll
4	fast math + -amdgpu-fast-fdiv would be another combination to try
76	There should be one that only uses arcp as well

wdng added inline comments.Jun 2 2016, 12:36 PM

test/CodeGen/AMDGPU/fdiv.ll
4	Do you mean fast-math flag + "-amdgpu-fast-fdiv" ?

arsenm added inline comments.Jun 2 2016, 5:11 PM

test/CodeGen/AMDGPU/fdiv.ll
4	Yes

wdng added inline comments.Jun 2 2016, 5:41 PM

test/CodeGen/AMDGPU/fdiv.ll
4	Actually combining fast math + -amdgpu-fast-fdiv has been covered in these tests. When fast math flags + -enable-unsafe-fp-math are enabled (fast, arcp, etc.), " if (Flags->hasAllowReciprocal()) { ... } " will be executed. Orignal less accurate fpdiv implementation will be executed when only enabling the -amdgpu-fast-fdiv flag. If we enable fast math + -amdgpu-fast-fdiv, less accurate fpdiv will be tested.

Added tests for fast-math arcp.

LGTM

test/CodeGen/AMDGPU/fdiv.ll
4	Is it? This is what I'm unsure about and why I think there should be an additional test. Are the individual math node flags set if these are globally enabled?

This revision is now accepted and ready to land.Jun 6 2016, 1:42 PM

artem.tamazov added a subscriber: artem.tamazov.Jun 7 2016, 4:57 AM

artem.tamazov added inline comments.

test/CodeGen/AMDGPU/fdiv.ll
50–55	AFAIK there should be: Two v_div_scale prior v_rcp Two (not one) v_fma prior v_mul Three (not one) v_fma after v_mul v_div_fmas prior v_div_fixup. Perhaps the test should check that.

wdng added inline comments.Jun 7 2016, 8:40 AM

test/CodeGen/AMDGPU/fdiv.ll
50–55	Based on Matt's comment: "-DAG does not work as expected if you specify the same string multiple times in the same -DAG sequence". So, we reduce the vector tests to a differentiating instruction or two per element (e.g. div_scale + div_fixup).

Closed by commit rL272044: Differential Revision: http://reviews.llvm.org/D20557 (authored by wdng). · Explain WhyJun 7 2016, 12:11 PM

This revision was automatically updated to reflect the committed changes.

artem.tamazov added inline comments.Jun 8 2016, 6:28 AM

test/CodeGen/AMDGPU/fdiv.ll
50–55	OK. However, v_div_fmas_f32 is still missing. Perhaps this is not important.

Fixed LIT test failed for frem.ll and rsq.ll

arsenm added inline comments.Jun 8 2016, 12:32 PM

test/CodeGen/AMDGPU/frem.ll
1–3	This should also test without -amdgpu-fast-fdiv. -mcpu=SI should also be removed

cfang added inline comments.Jun 8 2016, 1:02 PM

test/CodeGen/AMDGPU/frem.ll
1–3	Actually I don't suggest we test with -amdgpu-fast-fdiv here at all. This option has already been tested in fdiv.ll, and testing here does not provide any additional value.

Modified LIT test based on Matt's and Changpeng's suggestion.

wdng marked an inline comment as done.Jun 8 2016, 5:17 PM

wdng added inline comments.

test/CodeGen/AMDGPU/fdiv.ll
50–55	Yes.

LGTM

Please go ahead to commit!

Diff 60120

lib/Target/AMDGPU/SIISelLowering.cpp

Context not available.

	using namespace llvm;	using namespace llvm;

		// -amdgpu-fast-fdiv - Command line option to enable faster 2.5 ulp fdiv.
		static cl::opt<bool> EnableAMDGPUFastFDIV(
		"amdgpu-fast-fdiv",
		cl::desc("Enable faster 2.5 ulp fdiv"),
		arsenmUnsubmitted Done Reply Inline Actions Description is inaccurate. Should say something about faster 2.5 ulp fdiv arsenm: Description is inaccurate. Should say something about faster 2.5 ulp fdiv
		cl::init(false));

	static unsigned findFirstFreeSGPR(CCState &CCInfo) {	static unsigned findFirstFreeSGPR(CCState &CCInfo) {
	unsigned NumSGPRs = AMDGPU::SGPR_32RegClass.getNumRegs();	unsigned NumSGPRs = AMDGPU::SGPR_32RegClass.getNumRegs();
	for (unsigned Reg = 0; Reg < NumSGPRs; ++Reg) {	for (unsigned Reg = 0; Reg < NumSGPRs; ++Reg) {
Context not available.
	}	}
	}	}

	if (Unsafe) {	const SDNodeFlags *Flags = Op->getFlags();

		if (Unsafe \|\| Flags->hasAllowReciprocal()) {
	// Turn into multiply by the reciprocal.	// Turn into multiply by the reciprocal.
	// x / y -> x * (1.0 / y)	// x / y -> x * (1.0 / y)
	SDNodeFlags Flags;	SDNodeFlags Flags;
Context not available.

	SDValue SITargetLowering::LowerFDIV32(SDValue Op, SelectionDAG &DAG) const {	SDValue SITargetLowering::LowerFDIV32(SDValue Op, SelectionDAG &DAG) const {
		arsenmUnsubmitted Done Reply Inline Actions No space after ( arsenm: No space after (
	if (SDValue FastLowered = LowerFastFDIV(Op, DAG))	if (SDValue FastLowered = LowerFastFDIV(Op, DAG))
		arsenmUnsubmitted Done Reply Inline Actions Break on same line arsenm: Break on same line
	return FastLowered;	return FastLowered;

		arsenmUnsubmitted Done Reply Inline Actions Indentation looks wrong arsenm: Indentation looks wrong
	// This uses v_rcp_f32 which does not handle denormals. Let this hit a	// This uses v_rcp_f32 which does not handle denormals. Let this hit a
	// selection error for now rather than do something incorrect.	// selection error for now rather than do something incorrect.
Context not available.
	SDValue LHS = Op.getOperand(0);	SDValue LHS = Op.getOperand(0);
	SDValue RHS = Op.getOperand(1);	SDValue RHS = Op.getOperand(1);

	SDValue r1 = DAG.getNode(ISD::FABS, SL, MVT::f32, RHS);	// faster 2.5 ulp fdiv when using -amdgpu-fast-fdiv flag
		if (EnableAMDGPUFastFDIV) {
		SDValue r1 = DAG.getNode(ISD::FABS, SL, MVT::f32, RHS);

		arsenmUnsubmitted Done Reply Inline Actions Don't all caps ON arsenm: Don't all caps ON
	const APFloat K0Val(BitsToFloat(0x6f800000));	const APFloat K0Val(BitsToFloat(0x6f800000));
	const SDValue K0 = DAG.getConstantFP(K0Val, SL, MVT::f32);	const SDValue K0 = DAG.getConstantFP(K0Val, SL, MVT::f32);

	const APFloat K1Val(BitsToFloat(0x2f800000));	const APFloat K1Val(BitsToFloat(0x2f800000));
	const SDValue K1 = DAG.getConstantFP(K1Val, SL, MVT::f32);	const SDValue K1 = DAG.getConstantFP(K1Val, SL, MVT::f32);

	const SDValue One = DAG.getConstantFP(1.0, SL, MVT::f32);	const SDValue One = DAG.getConstantFP(1.0, SL, MVT::f32);

	EVT SetCCVT =	EVT SetCCVT =
	getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), MVT::f32);	getSetCCResultType(DAG.getDataLayout(), *DAG.getContext(), MVT::f32);

	SDValue r2 = DAG.getSetCC(SL, SetCCVT, r1, K0, ISD::SETOGT);	SDValue r2 = DAG.getSetCC(SL, SetCCVT, r1, K0, ISD::SETOGT);

	SDValue r3 = DAG.getNode(ISD::SELECT, SL, MVT::f32, r2, K1, One);	SDValue r3 = DAG.getNode(ISD::SELECT, SL, MVT::f32, r2, K1, One);

	// TODO: Should this propagate fast-math-flags?	// TODO: Should this propagate fast-math-flags?

	r1 = DAG.getNode(ISD::FMUL, SL, MVT::f32, RHS, r3);	r1 = DAG.getNode(ISD::FMUL, SL, MVT::f32, RHS, r3);

	SDValue r0 = DAG.getNode(AMDGPUISD::RCP, SL, MVT::f32, r1);	SDValue r0 = DAG.getNode(AMDGPUISD::RCP, SL, MVT::f32, r1);

	SDValue Mul = DAG.getNode(ISD::FMUL, SL, MVT::f32, LHS, r0);	SDValue Mul = DAG.getNode(ISD::FMUL, SL, MVT::f32, LHS, r0);

	return DAG.getNode(ISD::FMUL, SL, MVT::f32, r3, Mul);	return DAG.getNode(ISD::FMUL, SL, MVT::f32, r3, Mul);
		}

		// Generates more precise fpdiv32.
		const SDValue One = DAG.getConstantFP(1.0, SL, MVT::f32);

		arsenmUnsubmitted Done Reply Inline Actions You don't need the else arsenm: You don't need the else
		SDVTList ScaleVT = DAG.getVTList(MVT::f32, MVT::i1);

		SDValue DenominatorScaled = DAG.getNode(AMDGPUISD::DIV_SCALE, SL, ScaleVT, RHS, RHS, LHS);
		SDValue NumeratorScaled = DAG.getNode(AMDGPUISD::DIV_SCALE, SL, ScaleVT, LHS, RHS, LHS);

		arsenmUnsubmitted Done Reply Inline Actions Variables should be capitalized and camel case arsenm: Variables should be capitalized and camel case
		SDValue ApproxRcp = DAG.getNode(AMDGPUISD::RCP, SL, MVT::f32, DenominatorScaled);

		SDValue NegDivScale0 = DAG.getNode(ISD::FNEG, SL, MVT::f32, DenominatorScaled);

		SDValue Fma0 = DAG.getNode(ISD::FMA, SL, MVT::f32, NegDivScale0, ApproxRcp, One);
		SDValue Fma1 = DAG.getNode(ISD::FMA, SL, MVT::f32, Fma0, ApproxRcp, ApproxRcp);

		SDValue Mul = DAG.getNode(ISD::FMUL, SL, MVT::f32, NumeratorScaled, Fma1);

		SDValue Fma2 = DAG.getNode(ISD::FMA, SL, MVT::f32, NegDivScale0, Mul, NumeratorScaled);
		SDValue Fma3 = DAG.getNode(ISD::FMA, SL, MVT::f32, Fma2, Fma1, Mul);
		SDValue Fma4 = DAG.getNode(ISD::FMA, SL, MVT::f32, NegDivScale0, Fma3, NumeratorScaled);

		SDValue Scale = NumeratorScaled.getValue(1);
		SDValue Fmas = DAG.getNode(AMDGPUISD::DIV_FMAS, SL, MVT::f32, Fma4, Fma1, Fma3, Scale);

		return DAG.getNode(AMDGPUISD::DIV_FIXUP, SL, MVT::f32, Fmas, RHS, LHS);
	}	}

	SDValue SITargetLowering::LowerFDIV64(SDValue Op, SelectionDAG &DAG) const {	SDValue SITargetLowering::LowerFDIV64(SDValue Op, SelectionDAG &DAG) const {
Context not available.

test/CodeGen/AMDGPU/fdiv.ll

	; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=SI %s	; RUN: llc -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=SI %s
	; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs < %s \| FileCheck -check-prefix=SI %s	; RUN: llc -march=amdgcn -mcpu=tonga -verify-machineinstrs -amdgpu-fast-fdiv < %s \| FileCheck -check-prefix=SI %s
		; RUN: llc -march=amdgcn -mcpu=fiji -verify-machineinstrs < %s \| FileCheck -check-prefix=I754 %s
		; RUN: llc -march=amdgcn -mcpu=fiji -verify-machineinstrs -enable-unsafe-fp-math < %s \| FileCheck -check-prefix=UNSAFE-FP %s
		arsenmUnsubmitted Done Reply Inline Actions This should come before the r600 line arsenm: This should come before the r600 line
		arsenmUnsubmitted Not Done Reply Inline Actions fast math + -amdgpu-fast-fdiv would be another combination to try arsenm: fast math + -amdgpu-fast-fdiv would be another combination to try
		wdngAuthorUnsubmitted Not Done Reply Inline Actions Do you mean fast-math flag + "-amdgpu-fast-fdiv" ? wdng: Do you mean fast-math flag + "-amdgpu-fast-fdiv" ?
		arsenmUnsubmitted Not Done Reply Inline Actions Yes arsenm: Yes
		wdngAuthorUnsubmitted Not Done Reply Inline Actions Actually combining fast math + -amdgpu-fast-fdiv has been covered in these tests. When fast math flags + -enable-unsafe-fp-math are enabled (fast, arcp, etc.), " if (Flags->hasAllowReciprocal()) { ... } " will be executed. Orignal less accurate fpdiv implementation will be executed when only enabling the -amdgpu-fast-fdiv flag. If we enable fast math + -amdgpu-fast-fdiv, less accurate fpdiv will be tested. wdng: Actually combining fast math + -amdgpu-fast-fdiv has been covered in these tests. When fast…
		arsenmUnsubmitted Not Done Reply Inline Actions Is it? This is what I'm unsure about and why I think there should be an additional test. Are the individual math node flags set if these are globally enabled? arsenm: Is it? This is what I'm unsure about and why I think there should be an additional test. Are…
	; RUN: llc -march=r600 -mcpu=redwood < %s \| FileCheck -check-prefix=R600 %s	; RUN: llc -march=r600 -mcpu=redwood < %s \| FileCheck -check-prefix=R600 %s

	; These tests check that fdiv is expanded correctly and also test that the	; These tests check that fdiv is expanded correctly and also test that the
	; scheduler is scheduling the RECIP_IEEE and MUL_IEEE instructions in separate	; scheduler is scheduling the RECIP_IEEE and MUL_IEEE instructions in separate
	; instruction groups.	; instruction groups.

		; These test check that fdiv using unsafe_fp_math, coarse fp div, and IEEE754 fp div.

	; FUNC-LABEL: {{^}}fdiv_f32:	; FUNC-LABEL: {{^}}fdiv_f32:
	; R600-DAG: RECIP_IEEE * T{{[0-9]+\.[XYZW]}}, KC0[3].Z	; R600-DAG: RECIP_IEEE * T{{[0-9]+\.[XYZW]}}, KC0[3].Z
	; R600-DAG: RECIP_IEEE * T{{[0-9]+\.[XYZW]}}, KC0[3].Y	; R600-DAG: RECIP_IEEE * T{{[0-9]+\.[XYZW]}}, KC0[3].Y
	; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW]}}, KC0[3].X, PS	; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW]}}, KC0[3].X, PS
	; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW]}}, KC0[2].W, PS	; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW]}}, KC0[2].W, PS

		; UNSAFE-FP: v_rcp_f32
		; UNSAFE-FP: v_mul_f32_e32

	; SI-DAG: v_rcp_f32	; SI-DAG: v_rcp_f32
	; SI-DAG: v_mul_f32	; SI-DAG: v_mul_f32

		; I754-DAG: v_div_scale_f32
		; I754-DAG: v_rcp_f32
		; I754-DAG: v_fma_f32
		; I754-DAG: v_mul_f32
		; I754-DAG: v_fma_f32
		; I754-DAG: v_div_fixup_f32
	define void @fdiv_f32(float addrspace(1)* %out, float %a, float %b) {	define void @fdiv_f32(float addrspace(1)* %out, float %a, float %b) {
	entry:	entry:
	%0 = fdiv float %a, %b	%0 = fdiv float %a, %b
Context not available.
	ret void	ret void
	}	}

		; FUNC-LABEL: {{^}}fdiv_f32_fast_math:
		; R600-DAG: RECIP_IEEE * T{{[0-9]+\.[XYZW]}}, KC0[3].Z
		; R600-DAG: RECIP_IEEE * T{{[0-9]+\.[XYZW]}}, KC0[3].Y
		; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW]}}, KC0[3].X, PS
		; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW]}}, KC0[2].W, PS

		; UNSAFE-FP: v_rcp_f32
		; UNSAFE-FP: v_mul_f32_e32

		; SI-DAG: v_rcp_f32
		; SI-DAG: v_mul_f32
		define void @fdiv_f32_fast_math(float addrspace(1)* %out, float %a, float %b) {
		entry:
		%0 = fdiv fast float %a, %b
		store float %0, float addrspace(1)* %out
		ret void
		}

		artem.tamazovUnsubmitted Not Done Reply Inline Actions AFAIK there should be: Two v_div_scale prior v_rcp Two (not one) v_fma prior v_mul Three (not one) v_fma after v_mul v_div_fmas prior v_div_fixup. Perhaps the test should check that. artem.tamazov: AFAIK there should be: 1. Two v_div_scale prior v_rcp 2. Two (not one) v_fma prior v_mul 3.
		wdngAuthorUnsubmitted Not Done Reply Inline Actions Based on Matt's comment: "-DAG does not work as expected if you specify the same string multiple times in the same -DAG sequence". So, we reduce the vector tests to a differentiating instruction or two per element (e.g. div_scale + div_fixup). wdng: Based on Matt's comment: "-DAG does not work as expected if you specify the same string…
		artem.tamazovUnsubmitted Not Done Reply Inline Actions OK. However, v_div_fmas_f32 is still missing. Perhaps this is not important. artem.tamazov: OK. However, v_div_fmas_f32 is still missing. Perhaps this is not important.
		wdngAuthorUnsubmitted Not Done Reply Inline Actions Yes. wdng: Yes.
		; FUNC-LABEL: {{^}}fdiv_f32_arcp_math:
		; R600-DAG: RECIP_IEEE * T{{[0-9]+\.[XYZW]}}, KC0[3].Z
		; R600-DAG: RECIP_IEEE * T{{[0-9]+\.[XYZW]}}, KC0[3].Y
		; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW]}}, KC0[3].X, PS
		; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW]}}, KC0[2].W, PS

		; UNSAFE-FP: v_rcp_f32
		; UNSAFE-FP: v_mul_f32_e32

		; SI-DAG: v_rcp_f32
		; SI-DAG: v_mul_f32
		define void @fdiv_f32_arcp_math(float addrspace(1)* %out, float %a, float %b) {
		entry:
		%0 = fdiv arcp float %a, %b
		store float %0, float addrspace(1)* %out
		ret void
		}

	; FUNC-LABEL: {{^}}fdiv_v2f32:	; FUNC-LABEL: {{^}}fdiv_v2f32:
	; R600-DAG: RECIP_IEEE * T{{[0-9]+\.[XYZW]}}, KC0[3].Z	; R600-DAG: RECIP_IEEE * T{{[0-9]+\.[XYZW]}}, KC0[3].Z
		arsenmUnsubmitted Done Reply Inline Actions There should be one that only uses arcp as well arsenm: There should be one that only uses arcp as well
Context not available.
	; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW]}}, KC0[3].X, PS	; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW]}}, KC0[3].X, PS
	; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW]}}, KC0[2].W, PS	; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW]}}, KC0[2].W, PS

		; UNSAFE-FP: v_rcp_f32
		; UNSAFE-FP: v_rcp_f32
		; UNSAFE-FP: v_mul_f32_e32
		arsenmUnsubmitted Done Reply Inline Actions -DAG does not work as expected if you specify the same string multiple times in the same -DAG sequence. I would reduce the vector tests to a differentiating instruction or two per element (e.g. div_scale + div_fixup) arsenm: -DAG does not work as expected if you specify the same string multiple times in the same -DAG…
		; UNSAFE-FP: v_mul_f32_e32

	; SI-DAG: v_rcp_f32	; SI-DAG: v_rcp_f32
	; SI-DAG: v_mul_f32	; SI-DAG: v_mul_f32
	; SI-DAG: v_rcp_f32	; SI-DAG: v_rcp_f32
	; SI-DAG: v_mul_f32	; SI-DAG: v_mul_f32

		; I754: v_div_scale_f32
		; I754: v_div_scale_f32
		; I754: v_div_scale_f32
		; I754: v_div_scale_f32
		; I754: v_div_fixup_f32
		; I754: v_div_fixup_f32
	define void @fdiv_v2f32(<2 x float> addrspace(1)* %out, <2 x float> %a, <2 x float> %b) {	define void @fdiv_v2f32(<2 x float> addrspace(1)* %out, <2 x float> %a, <2 x float> %b) {
	entry:	entry:
	%0 = fdiv <2 x float> %a, %b	%0 = fdiv <2 x float> %a, %b
Context not available.
	ret void	ret void
	}	}

		; FUNC-LABEL: {{^}}fdiv_v2f32_fast_math:
		; R600-DAG: RECIP_IEEE * T{{[0-9]+\.[XYZW]}}, KC0[3].Z
		; R600-DAG: RECIP_IEEE * T{{[0-9]+\.[XYZW]}}, KC0[3].Y
		; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW]}}, KC0[3].X, PS
		; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW]}}, KC0[2].W, PS

		; UNSAFE-FP: v_rcp_f32
		; UNSAFE-FP: v_rcp_f32
		; UNSAFE-FP: v_mul_f32_e32
		; UNSAFE-FP: v_mul_f32_e32

		; SI-DAG: v_rcp_f32
		; SI-DAG: v_mul_f32
		; SI-DAG: v_rcp_f32
		; SI-DAG: v_mul_f32
		define void @fdiv_v2f32_fast_math(<2 x float> addrspace(1)* %out, <2 x float> %a, <2 x float> %b) {
		entry:
		%0 = fdiv fast <2 x float> %a, %b
		store <2 x float> %0, <2 x float> addrspace(1)* %out
		ret void
		}

		; FUNC-LABEL: {{^}}fdiv_v2f32_arcp_math:
		; R600-DAG: RECIP_IEEE * T{{[0-9]+\.[XYZW]}}, KC0[3].Z
		; R600-DAG: RECIP_IEEE * T{{[0-9]+\.[XYZW]}}, KC0[3].Y
		; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW]}}, KC0[3].X, PS
		; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW]}}, KC0[2].W, PS

		; UNSAFE-FP: v_rcp_f32
		; UNSAFE-FP: v_rcp_f32
		; UNSAFE-FP: v_mul_f32_e32
		; UNSAFE-FP: v_mul_f32_e32

		; SI-DAG: v_rcp_f32
		; SI-DAG: v_mul_f32
		; SI-DAG: v_rcp_f32
		; SI-DAG: v_mul_f32
		define void @fdiv_v2f32_arcp_math(<2 x float> addrspace(1)* %out, <2 x float> %a, <2 x float> %b) {
		entry:
		%0 = fdiv arcp <2 x float> %a, %b
		store <2 x float> %0, <2 x float> addrspace(1)* %out
		ret void
		}

	; FUNC-LABEL: {{^}}fdiv_v4f32:	; FUNC-LABEL: {{^}}fdiv_v4f32:
	; R600-DAG: RECIP_IEEE * T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}	; R600-DAG: RECIP_IEEE * T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}
	; R600-DAG: RECIP_IEEE * T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}	; R600-DAG: RECIP_IEEE * T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}
Context not available.
	; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}, PS	; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}, PS
	; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}, PS	; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}, PS

		; UNSAFE-FP: v_rcp_f32_e32
		; UNSAFE-FP: v_rcp_f32_e32
		; UNSAFE-FP: v_rcp_f32_e32
		; UNSAFE-FP: v_rcp_f32_e32
		; UNSAFE-FP: v_mul_f32_e32
		; UNSAFE-FP: v_mul_f32_e32
		; UNSAFE-FP: v_mul_f32_e32
		; UNSAFE-FP: v_mul_f32_e32

	; SI-DAG: v_rcp_f32	; SI-DAG: v_rcp_f32
	; SI-DAG: v_mul_f32	; SI-DAG: v_mul_f32
	; SI-DAG: v_rcp_f32	; SI-DAG: v_rcp_f32
Context not available.
	; SI-DAG: v_mul_f32	; SI-DAG: v_mul_f32
	; SI-DAG: v_rcp_f32	; SI-DAG: v_rcp_f32
	; SI-DAG: v_mul_f32	; SI-DAG: v_mul_f32

		; I754: v_div_scale_f32
		; I754: v_div_scale_f32
		; I754: v_div_scale_f32
		; I754: v_div_scale_f32
		; I754: v_div_scale_f32
		; I754: v_div_scale_f32
		; I754: v_div_scale_f32
		; I754: v_div_scale_f32
		; I754: v_div_fixup_f32
		; I754: v_div_fixup_f32
		; I754: v_div_fixup_f32
		; I754: v_div_fixup_f32
	define void @fdiv_v4f32(<4 x float> addrspace(1)* %out, <4 x float> addrspace(1)* %in) {	define void @fdiv_v4f32(<4 x float> addrspace(1)* %out, <4 x float> addrspace(1)* %in) {
	%b_ptr = getelementptr <4 x float>, <4 x float> addrspace(1)* %in, i32 1	%b_ptr = getelementptr <4 x float>, <4 x float> addrspace(1)* %in, i32 1
	%a = load <4 x float>, <4 x float> addrspace(1) * %in	%a = load <4 x float>, <4 x float> addrspace(1) * %in
Context not available.
	store <4 x float> %result, <4 x float> addrspace(1)* %out	store <4 x float> %result, <4 x float> addrspace(1)* %out
	ret void	ret void
	}	}

		; FUNC-LABEL: {{^}}fdiv_v4f32_fast_math:
		; R600-DAG: RECIP_IEEE * T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}
		; R600-DAG: RECIP_IEEE * T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}
		; R600-DAG: RECIP_IEEE * T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}
		; R600-DAG: RECIP_IEEE * T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}
		; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}, PS
		; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}, PS
		; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}, PS
		; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}, PS

		; UNSAFE-FP: v_rcp_f32_e32
		; UNSAFE-FP: v_rcp_f32_e32
		; UNSAFE-FP: v_rcp_f32_e32
		; UNSAFE-FP: v_rcp_f32_e32
		; UNSAFE-FP: v_mul_f32_e32
		; UNSAFE-FP: v_mul_f32_e32
		; UNSAFE-FP: v_mul_f32_e32
		; UNSAFE-FP: v_mul_f32_e32

		; SI-DAG: v_rcp_f32
		; SI-DAG: v_mul_f32
		; SI-DAG: v_rcp_f32
		; SI-DAG: v_mul_f32
		; SI-DAG: v_rcp_f32
		; SI-DAG: v_mul_f32
		; SI-DAG: v_rcp_f32
		; SI-DAG: v_mul_f32
		define void @fdiv_v4f32_fast_math(<4 x float> addrspace(1)* %out, <4 x float> addrspace(1)* %in) {
		%b_ptr = getelementptr <4 x float>, <4 x float> addrspace(1)* %in, i32 1
		%a = load <4 x float>, <4 x float> addrspace(1) * %in
		%b = load <4 x float>, <4 x float> addrspace(1) * %b_ptr
		%result = fdiv fast <4 x float> %a, %b
		store <4 x float> %result, <4 x float> addrspace(1)* %out
		ret void
		}

		; FUNC-LABEL: {{^}}fdiv_v4f32_arcp_math:
		; R600-DAG: RECIP_IEEE * T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}
		; R600-DAG: RECIP_IEEE * T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}
		; R600-DAG: RECIP_IEEE * T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}
		; R600-DAG: RECIP_IEEE * T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}
		; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}, PS
		; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}, PS
		; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}, PS
		; R600-DAG: MUL_IEEE {{\** *}}T{{[0-9]+\.[XYZW], T[0-9]+\.[XYZW]}}, PS

		; UNSAFE-FP: v_rcp_f32_e32
		; UNSAFE-FP: v_rcp_f32_e32
		; UNSAFE-FP: v_rcp_f32_e32
		; UNSAFE-FP: v_rcp_f32_e32
		; UNSAFE-FP: v_mul_f32_e32
		; UNSAFE-FP: v_mul_f32_e32
		; UNSAFE-FP: v_mul_f32_e32
		; UNSAFE-FP: v_mul_f32_e32

		; SI-DAG: v_rcp_f32
		; SI-DAG: v_mul_f32
		; SI-DAG: v_rcp_f32
		; SI-DAG: v_mul_f32
		; SI-DAG: v_rcp_f32
		; SI-DAG: v_mul_f32
		; SI-DAG: v_rcp_f32
		; SI-DAG: v_mul_f32
		define void @fdiv_v4f32_arcp_math(<4 x float> addrspace(1)* %out, <4 x float> addrspace(1)* %in) {
		%b_ptr = getelementptr <4 x float>, <4 x float> addrspace(1)* %in, i32 1
		%a = load <4 x float>, <4 x float> addrspace(1) * %in
		%b = load <4 x float>, <4 x float> addrspace(1) * %b_ptr
		%result = fdiv arcp <4 x float> %a, %b
		store <4 x float> %result, <4 x float> addrspace(1)* %out
		ret void
		}
Context not available.

test/CodeGen/AMDGPU/frem.ll

Context not available.
	; FUNC-LABEL: {{^}}frem_f32:	; FUNC-LABEL: {{^}}frem_f32:
	; GCN-DAG: buffer_load_dword [[X:v[0-9]+]], {{.*$}}	; GCN-DAG: buffer_load_dword [[X:v[0-9]+]], {{.*$}}
	; GCN-DAG: buffer_load_dword [[Y:v[0-9]+]], {{.*}} offset:16	; GCN-DAG: buffer_load_dword [[Y:v[0-9]+]], {{.*}} offset:16
	; GCN-DAG: v_cmp	; GCN: v_div_scale_f32
	; GCN-DAG: v_mul_f32
	; GCN: v_rcp_f32_e32	; GCN: v_rcp_f32_e32
		; GCN: v_fma_f32
	; GCN: v_mul_f32_e32	; GCN: v_mul_f32_e32
	; GCN: v_mul_f32_e32	; GCN: v_div_fmas_f32
		; GCN: v_div_fixup_f32
	; GCN: v_trunc_f32_e32	; GCN: v_trunc_f32_e32
	; GCN: v_mad_f32	; GCN: v_mad_f32
	; GCN: s_endpgm	; GCN: s_endpgm
Context not available.

This is an archive of the discontinued LLVM Phabricator instance.

Lowering floating point division for 32-bit using IEEE 754
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 60120

lib/Target/AMDGPU/SIISelLowering.cpp

test/CodeGen/AMDGPU/fdiv.ll

test/CodeGen/AMDGPU/frem.ll

This is an archive of the discontinued LLVM Phabricator instance.

Lowering floating point division for 32-bit using IEEE 754ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 60120

lib/Target/AMDGPU/SIISelLowering.cpp

test/CodeGen/AMDGPU/fdiv.ll

test/CodeGen/AMDGPU/frem.ll

Lowering floating point division for 32-bit using IEEE 754
ClosedPublic