This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
9/18
AMDGPUInstCombineIntrinsic.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
llvm.amdgcn.trig.preop.ll

Differential D120150

Constant folding of llvm.amdgcn.trig.preop
Needs ReviewPublic

Authored by arsenm on Feb 18 2022, 11:12 AM.

Download Raw Diff

Details

Reviewers

b-sumner
sameerds
cdevadas
Ravi

Summary

If the parameters(the input and segment select) coming in to amdgcn.trig.preop intrinsic are compile time constants, then this patch pre-computes the output of amdgcn.trig.preop on the CPU and replaces the uses with the computed constant.

All the existing AMDGPU lit cases pass along with the negative cases where the parameters to this intrinsic are variable.
Added a simple test case with the exact output that matches the output from the GPU.

Created a small HIP test application with the exact compute logic(and the constants used for 2/pi) running on the CPU and the intrinsic invoked for the GPU kernel.
Ran the test over the entire range of double floating-point. The outputs from the CPU and those from the intrinsic on gfx10 AMD GPU match.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

Ravi created this revision.Feb 18 2022, 11:12 AM

Herald added subscribers: foad, kerbowa, hiraditya and 3 others. · View Herald TranscriptFeb 18 2022, 11:12 AM

Ravi requested review of this revision.Feb 18 2022, 11:12 AM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 18 2022, 11:12 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

arsenm added inline comments.Feb 18 2022, 11:27 AM

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
329	Relying on size of double, also this is technically undefined.
1010	uint32_t
1026–1027	Early exit and reduce indentation
1029	Indentation off
1044	Should stick with apfloat operations
1048	You can use scalbn for APFloat instead of relying on host ldexp

Harbormaster completed remote builds in B150460: Diff 409976.Feb 18 2022, 11:35 AM

Ran the test over the entire range of double floating-point.

Just curious: how long does it take to test all 2^64 inputs?

In D120150#3332613, @foad wrote:

Ran the test over the entire range of double floating-point.

Just curious: how long does it take to test all 2^64 inputs?

The only dependence is on the exponent so 2K inputs is sufficient.

In D120150#3332613, @foad wrote:

Ran the test over the entire range of double floating-point.

Just curious: how long does it take to test all 2^64 inputs?

It took around 5 days on Ryzen 9 - 5950. Was running a single thread for CPU though. And each input was checked with all 32 segments.

In D120150#3332632, @b-sumner wrote:

In D120150#3332613, @foad wrote:

Ran the test over the entire range of double floating-point.

Just curious: how long does it take to test all 2^64 inputs?

The only dependence is on the exponent so 2K inputs is sufficient.

Yes right. That would have greatly reduced the run times. Missed out, didn't give a thought in that direction.

Ravi added inline comments.Feb 18 2022, 11:38 PM

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
1044	Yes, will do that.
1048	Yes..will try it out and check for any precision difference. Should be good as long as the internal implementation of APFloat is within 0.5 ULP.

arsenm added inline comments.Feb 21 2022, 6:51 AM

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
1048	Actually we should be constant folding the ldexp intrinsic too. I thought I did that before, but in the code above I don't see it handling arbitrary constants

Fixed all the review comments.

Herald added a project: Restricted Project. · View Herald TranscriptMay 11 2022, 11:40 AM

Herald added subscribers: kosarev, jsilvanus, hsmhsm. · View Herald Transcript

Ravi marked 6 inline comments as done.May 11 2022, 11:42 AM

Ravi added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
1048	Can be done in another patch

Harbormaster completed remote builds in B163953: Diff 428730.May 11 2022, 1:15 PM

arsenm added inline comments.May 11 2022, 4:18 PM

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
1024	demorgan this
1035	extra parentheses
1042	extra parentheses
1054	extra parentheses
1055	extra parentheses

foad added inline comments.May 12 2022, 8:40 AM

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
1021	Capitalizing the names like CSrc, CSeg, FSrc, NumBits etc is more common.
1029	Just use `->getValue`. And in general I think you should convert CSrc and CSeg to plain uint32_t / uint64_t as soon as possible. You have lots of uses of APInt below where it really is not necessary.

Reverse ping

@Ravi where are these tests you mentioned?

arsenm added inline comments.Nov 16 2022, 1:45 PM

llvm/test/Transforms/InstCombine/AMDGPU/amdgcn-intrinsics.ll
5638 ↗	(On Diff #475913)	I feel like the tests are lacking in sample values but coming up with ones to test every point in the table seems exhausting

Harbormaster completed remote builds in B198064: Diff 475913.Nov 16 2022, 2:51 PM

Add assert, which shows this table lookup is broken.

Also improve poison handling

Harbormaster completed remote builds in B198289: Diff 476233.Nov 17 2022, 2:01 PM

foad added inline comments.Nov 18 2022, 6:47 AM

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp
1054	This (and some of the calculation below) can just use `int`. No need for APInt.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUInstCombineIntrinsic.cpp

61 lines

test/

CodeGen/

AMDGPU/

llvm.amdgcn.trig.preop.ll

10 lines

Diff 428730

llvm/lib/Target/AMDGPU/AMDGPUInstCombineIntrinsic.cpp

Show First 20 Lines • Show All 320 Lines • ▼ Show 20 Lines	return modifyIntrinsicCall(
// Convert the bias		// Convert the bias
if (!OnlyDerivatives && ImageDimIntr->NumBiasArgs != 0) {		if (!OnlyDerivatives && ImageDimIntr->NumBiasArgs != 0) {
Value *Bias = II.getOperand(ImageDimIntr->BiasIndex);		Value *Bias = II.getOperand(ImageDimIntr->BiasIndex);
Args[ImageDimIntr->BiasIndex] = convertTo16Bit(*Bias, IC.Builder);		Args[ImageDimIntr->BiasIndex] = convertTo16Bit(*Bias, IC.Builder);
}		}
});		});
}		}

bool GCNTTIImpl::canSimplifyLegacyMulToMul(const Value Op0, const Value Op1,		bool GCNTTIImpl::canSimplifyLegacyMulToMul(const Value Op0, const Value Op1,
		arsenmAuthorUnsubmitted Done Reply Inline Actions Relying on size of double, also this is technically undefined. arsenm: Relying on size of double, also this is technically undefined.
InstCombiner &IC) const {		InstCombiner &IC) const {
// The legacy behaviour is that multiplying +/-0.0 by anything, even NaN or		// The legacy behaviour is that multiplying +/-0.0 by anything, even NaN or
// infinity, gives +0.0. If we can prove we don't have one of the special		// infinity, gives +0.0. If we can prove we don't have one of the special
// cases then we can use a normal multiply instead.		// cases then we can use a normal multiply instead.
// TODO: Create and use isKnownFiniteNonZero instead of just matching		// TODO: Create and use isKnownFiniteNonZero instead of just matching
// constants here.		// constants here.
if (match(Op0, PatternMatch::m_FiniteNonZero()) \|\|		if (match(Op0, PatternMatch::m_FiniteNonZero()) \|\|
match(Op1, PatternMatch::m_FiniteNonZero())) {		match(Op1, PatternMatch::m_FiniteNonZero())) {
▲ Show 20 Lines • Show All 662 Lines • ▼ Show 20 Lines	case Intrinsic::amdgcn_ldexp: {
// ldexp(x, 0) -> x		// ldexp(x, 0) -> x
// ldexp(x, undef) -> x		// ldexp(x, undef) -> x
if (isa<UndefValue>(Op1) \|\| match(Op1, PatternMatch::m_ZeroInt())) {		if (isa<UndefValue>(Op1) \|\| match(Op1, PatternMatch::m_ZeroInt())) {
return IC.replaceInstUsesWith(II, Op0);		return IC.replaceInstUsesWith(II, Op0);
}		}

break;		break;
}		}
		case Intrinsic::amdgcn_trig_preop: {

		const uint32_t TwoByPi[] = {
		arsenmAuthorUnsubmitted Done Reply Inline Actions uint32_t arsenm: uint32_t
		0xa2f9836e, 0x4e441529, 0xfc2757d1, 0xf534ddc0, 0xdb629599, 0x3c439041,
		0xfe5163ab, 0xdebbc561, 0xb7246e3a, 0x424dd2e0, 0x06492eea, 0x09d1921c,
		0xfe1deb1c, 0xb129a73e, 0xe88235f5, 0x2ebb4484, 0xe99c7026, 0xb45f7e41,
		0x3991d639, 0x835339f4, 0x9c845f8b, 0xbdf9283b, 0x1ff897ff, 0xde05980f,
		0xef2f118b, 0x5a0a6d1f, 0x6d367ecf, 0x27cb09b7, 0x4f463f66, 0x9e5fea2d,
		0x7527bac7, 0xebe5f17b, 0x3d0739f7, 0x8a5292ea, 0x6bfb5fb1, 0x1f8d5d08,
		0x56033046};

		Value *Src = II.getArgOperand(0);
		Value *Segment = II.getArgOperand(1);
		const ConstantFP *Csrc = dyn_cast<ConstantFP>(Src);
		foadUnsubmitted Not Done Reply Inline Actions Capitalizing the names like CSrc, CSeg, FSrc, NumBits etc is more common. foad: Capitalizing the names like CSrc, CSeg, FSrc, NumBits etc is more common.
		const ConstantInt *Cseg = dyn_cast<ConstantInt>(Segment);

		if (!(Csrc && Cseg))
		arsenmAuthorUnsubmitted Not Done Reply Inline Actions demorgan this arsenm: demorgan this
		break;

		const APFloat &Fsrc = Csrc->getValueAPF();
		arsenmAuthorUnsubmitted Done Reply Inline Actions Early exit and reduce indentation arsenm: Early exit and reduce indentation

		const APInt &SegVal = Cseg->getUniqueInteger();
		arsenmAuthorUnsubmitted Done Reply Inline Actions Indentation off arsenm: Indentation off
		foadUnsubmitted Not Done Reply Inline Actions Just use `->getValue`. And in general I think you should convert CSrc and CSeg to plain uint32_t / uint64_t as soon as possible. You have lots of uses of APInt below where it really is not necessary. foad: Just use `->getValue`. And in general I think you should convert CSrc and CSeg to plain…
		bool Ovflow;
		unsigned Numbits = 32;
		bool Signed = true;

		APInt EClamp(Numbits, 1077, Signed);
		APInt E = (Fsrc.bitcastToAPInt()).ashr(52);
		arsenmAuthorUnsubmitted Not Done Reply Inline Actions extra parentheses arsenm: extra parentheses
		E &= 0x7ff;
		E = E.trunc(Numbits);
		APInt Shift =
		(E.sgt(EClamp) ? E.ssub_ov(EClamp, Ovflow) : APInt(Numbits, 0, Signed))
		.sadd_ov(APInt(Numbits, 53, Signed).smul_ov(SegVal, Ovflow),
		Ovflow);
		int32_t I = (Shift.ashr(5)).getSExtValue();
		arsenmAuthorUnsubmitted Not Done Reply Inline Actions extra parentheses arsenm: extra parentheses
		APInt Bshift = Shift & 0x1f;
		Numbits = 64;
		arsenmAuthorUnsubmitted Done Reply Inline Actions Should stick with apfloat operations arsenm: Should stick with apfloat operations
		RaviUnsubmitted Done Reply Inline Actions Yes, will do that. Ravi: Yes, will do that.
		Signed = false;
		APInt Thi = APInt(Numbits,
		(((uint64_t)TwoByPi[I] << 32) \| (uint64_t)TwoByPi[I + 1]),
		Signed);
		arsenmAuthorUnsubmitted Done Reply Inline Actions You can use scalbn for APFloat instead of relying on host ldexp arsenm: You can use scalbn for APFloat instead of relying on host ldexp
		RaviUnsubmitted Done Reply Inline Actions Yes..will try it out and check for any precision difference. Should be good as long as the internal implementation of APFloat is within 0.5 ULP. Ravi: Yes..will try it out and check for any precision difference. Should be good as long as the…
		arsenmAuthorUnsubmitted Not Done Reply Inline Actions Actually we should be constant folding the ldexp intrinsic too. I thought I did that before, but in the code above I don't see it handling arbitrary constants arsenm: Actually we should be constant folding the ldexp intrinsic too. I thought I did that before…
		RaviUnsubmitted Done Reply Inline Actions Can be done in another patch Ravi: Can be done in another patch
		APInt Tlo = APInt(Numbits, ((uint64_t)TwoByPi[I + 2] << 32), Signed);

		if (Bshift.sgt(0)) {
		Numbits = 32;
		Signed = true;
		Thi = (Thi.shl(Bshift)) \|
		arsenmAuthorUnsubmitted Not Done Reply Inline Actions extra parentheses arsenm: extra parentheses
		foadUnsubmitted Not Done Reply Inline Actions This (and some of the calculation below) can just use `int`. No need for APInt. foad: This (and some of the calculation below) can just use `int`. No need for APInt.
		(Tlo.lshr(APInt(Numbits, 64, Signed).ssub_ov(Bshift, Ovflow)));
		arsenmAuthorUnsubmitted Not Done Reply Inline Actions extra parentheses arsenm: extra parentheses
		}

		Thi = Thi.lshr(11);
		APFloat Res = APFloat(Thi.roundToDouble());
		int32_t Scale = -53 - Shift.getSExtValue();

		if (E.sge(0x7b0))
		Scale += 128;

		Res = scalbn(Res, Scale, RoundingMode::NearestTiesToEven);
		double Resd = Res.convertToDouble();
		return IC.replaceInstUsesWith(II, ConstantFP::get(Src->getType(), Resd));
		}
case Intrinsic::amdgcn_fmul_legacy: {		case Intrinsic::amdgcn_fmul_legacy: {
Value *Op0 = II.getArgOperand(0);		Value *Op0 = II.getArgOperand(0);
Value *Op1 = II.getArgOperand(1);		Value *Op1 = II.getArgOperand(1);

// The legacy behaviour is that multiplying +/-0.0 by anything, even NaN or		// The legacy behaviour is that multiplying +/-0.0 by anything, even NaN or
// infinity, gives +0.0.		// infinity, gives +0.0.
// TODO: Move to InstSimplify?		// TODO: Move to InstSimplify?
if (match(Op0, PatternMatch::m_AnyZeroFP()) \|\|		if (match(Op0, PatternMatch::m_AnyZeroFP()) \|\|
▲ Show 20 Lines • Show All 226 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.trig.preop.ll

	; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=SI %s			; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -verify-machineinstrs < %s \| FileCheck -check-prefix=SI %s
	; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=SI %s			; RUN: llc -amdgpu-scalarize-global-loads=false -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs < %s \| FileCheck -check-prefix=SI %s
				; RUN: opt -S -mtriple=amdgcn-- -mcpu=gfx900 -instcombine < %s \| FileCheck -check-prefix=GCN %s
				; RUN: opt -S -mtriple=amdgcn-- -mcpu=gfx1010 -instcombine < %s \| FileCheck -check-prefix=GCN %s

	declare double @llvm.amdgcn.trig.preop.f64(double, i32) nounwind readnone			declare double @llvm.amdgcn.trig.preop.f64(double, i32) nounwind readnone

	; SI-LABEL: {{^}}test_trig_preop_f64:			; SI-LABEL: {{^}}test_trig_preop_f64:
	; SI-DAG: buffer_load_dword [[SEG:v[0-9]+]]			; SI-DAG: buffer_load_dword [[SEG:v[0-9]+]]
	; SI-DAG: buffer_load_dwordx2 [[SRC:v\[[0-9]+:[0-9]+\]]],			; SI-DAG: buffer_load_dwordx2 [[SRC:v\[[0-9]+:[0-9]+\]]],
	; SI: v_trig_preop_f64 [[RESULT:v\[[0-9]+:[0-9]+\]]], [[SRC]], [[SEG]]			; SI: v_trig_preop_f64 [[RESULT:v\[[0-9]+:[0-9]+\]]], [[SRC]], [[SEG]]
	; SI: buffer_store_dwordx2 [[RESULT]],			; SI: buffer_store_dwordx2 [[RESULT]],
	Show All 12 Lines
	; SI: buffer_store_dwordx2 [[RESULT]],			; SI: buffer_store_dwordx2 [[RESULT]],
	; SI: s_endpgm			; SI: s_endpgm
	define amdgpu_kernel void @test_trig_preop_f64_imm_segment(double addrspace(1)* %out, double addrspace(1)* %aptr) nounwind {			define amdgpu_kernel void @test_trig_preop_f64_imm_segment(double addrspace(1)* %out, double addrspace(1)* %aptr) nounwind {
	%a = load double, double addrspace(1)* %aptr, align 8			%a = load double, double addrspace(1)* %aptr, align 8
	%result = call double @llvm.amdgcn.trig.preop.f64(double %a, i32 7) nounwind readnone			%result = call double @llvm.amdgcn.trig.preop.f64(double %a, i32 7) nounwind readnone
	store double %result, double addrspace(1)* %out, align 8			store double %result, double addrspace(1)* %out, align 8
	ret void			ret void
	}			}

				define protected amdgpu_kernel void @trig_preop_constfold(double addrspace(1)* nocapture %0, double addrspace(1)* nocapture readnone %1, i32 %2){
				; GCN: store double 0x2F42371D2126E970, double addrspace(1)* %0, align 8
				; GCN-NEXT: ret void
				%4 = tail call contract double @llvm.amdgcn.trig.preop.f64(double 3.454350e+02, i32 5)
				store double %4, double addrspace(1)* %0, align 8
				ret void
				}