Download Raw Diff

Details

Reviewers

arsenm

Summary

See https://github.com/RadeonOpenCompute/ROCm/issues/488

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

Pierre-vh created this revision.Nov 9 2022, 2:47 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 9 2022, 2:47 AM

Herald added subscribers: kosarev, foad, kerbowa and 6 others. · View Herald Transcript

Pierre-vh requested review of this revision.Nov 9 2022, 2:47 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 9 2022, 2:47 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

@arsenm I added the combine as described but codegen still doesn't look as good as what the original issue describes. I probably missed something but not sure what. Can you please advise?

foad added inline comments.Nov 9 2022, 3:15 AM

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
3222	This is not correct because it will lose the overflow bit. You should probably only do this if the ADD has a single use.

Fix combine

Pierre-vh marked an inline comment as done.Nov 9 2022, 3:39 AM

Pierre-vh added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
3222	Doing it if the add has a single use negates the purpose of the combine as it'll always have 2 uses in the cases we're interested in, but the second use is a trunc to i32. I've adapted the combine so it only does it when users are all truncs to i32, or the srl.

foad added inline comments.Nov 9 2022, 3:48 AM

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
3212	Could also accept 64-bit constants whose upper 32 bits are 0.

Harbormaster completed remote builds in B196867: Diff 474217.Nov 9 2022, 4:17 AM

Comments

Harbormaster completed remote builds in B197725: Diff 475410.Nov 15 2022, 3:58 AM

Should also do the globalisel version, if we don't do new optimizations at the same time it will never catch up

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
3205	Should also move to generic code

Seems to be correct https://alive2.llvm.org/ce/z/VN9-vU

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
3205	This is missing the extends in the input and output
3221	Looking at uses is unusual and I'm not sure why you're doing it
llvm/test/CodeGen/AMDGPU/add_shr_carry.ll
7	Should precommit this test to show the diff

arsenm added inline comments.Nov 15 2022, 3:41 PM

llvm/test/CodeGen/AMDGPU/add_shr_carry.ll
51	This second add is superfluous to the basic pattern https://alive2.llvm.org/ce/z/aLg_Ki

foad added inline comments.Nov 15 2022, 10:55 PM

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
3221	As mentioned below, the thinking is that this transform is not profitable unless every use either only wants the overflow bit, or only wants the low 32 bits of the 64 bit result. Otherwise you might as well keep the full 64 bit add.

Rebase on D138104

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
3205	What do you mean "generic"? Not checking the types and instead check that the shift amount is 1/2 of the type's size in bits?
3221	Indeed as Jay said, it's because the transformation is only profitable when the users only care about the lower 32 bits and the carry bit.

Harbormaster completed remote builds in B197935: Diff 475723.Nov 16 2022, 1:01 AM

Pierre-vh added a parent revision: D138104: [AMDGPU] Precommit add_shr_carry test.Nov 16 2022, 2:10 AM

Fix comment

Harbormaster completed remote builds in B197943: Diff 475735.Nov 16 2022, 2:51 AM

arsenm added inline comments.Nov 16 2022, 9:22 AM

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
3205	Yes, and move to DAGCombiner. You then just need to check that the target UADDO is legal or it's pre-legalize

arsenm added inline comments.Nov 16 2022, 9:23 AM

llvm/test/CodeGen/AMDGPU/add_shr_carry.ll
238	Testcase with multiple uses?

arsenm added inline comments.Nov 16 2022, 9:25 AM

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp
3214	Should try to short circuit the second known bits call if the first one fails the countMinLeadingZeros check
3222	There's no point in looking for multiple TRUNCATE users. Those would have been automagically CSEd

Apparently there is already an in flight version of this at D106139

In D137705#3932390, @arsenm wrote:

Apparently there is already an in flight version of this at D106139

Right, I think the reviewers there asked to move it to InstCombine, but I think we also want it in the backend, right?
Perhaps it should be kept as a target-specific combine for now? It can always be moved later if needed

In D137705#3933081, @Pierre-vh wrote:

In D137705#3932390, @arsenm wrote:

Apparently there is already an in flight version of this at D106139

Right, I think the reviewers there asked to move it to InstCombine, but I think we also want it in the backend, right?

Combines in the backend are primarily for patterns that arise as the result of legalization. Looking at this again, I'm inclined to have it in instcombine primarily.

Perhaps it should be kept as a target-specific combine for now? It can always be moved later if needed

It definitely should be done generically

In D137705#3935192, @arsenm wrote:

In D137705#3933081, @Pierre-vh wrote:

In D137705#3932390, @arsenm wrote:

Apparently there is already an in flight version of this at D106139

Right, I think the reviewers there asked to move it to InstCombine, but I think we also want it in the backend, right?

Combines in the backend are primarily for patterns that arise as the result of legalization. Looking at this again, I'm inclined to have it in instcombine primarily.

Perhaps it should be kept as a target-specific combine for now? It can always be moved later if needed

It definitely should be done generically

So this stack of diff should go and an instcombine implementation done instead?
D107552 was also in-flight for this. Should I take it over?
There was some discussion about it being a less understandable canonical form though

D138814

Diff 474210

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp

Show First 20 Lines • Show All 3,196 Lines • ▼ Show 20 Lines	if (auto *Mask = dyn_cast<ConstantSDNode>(LHS.getOperand(1))) {
DAG.getNode(ISD::SRL, SL, VT, LHS.getOperand(1), N->getOperand(1)));		DAG.getNode(ISD::SRL, SL, VT, LHS.getOperand(1), N->getOperand(1)));
}		}
}		}
}		}

if (VT != MVT::i64)		if (VT != MVT::i64)
return SDValue();		return SDValue();

		// fold (i64 (shr (add (zext a, i64), (zext b, i64)), 32)) -> (uaddo a,
		arsenmUnsubmitted Not Done Reply Inline Actions Should also move to generic code arsenm: Should also move to generic code
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions What do you mean "generic"? Not checking the types and instead check that the shift amount is 1/2 of the type's size in bits? Pierre-vh: What do you mean "generic"? Not checking the types and instead check that the shift amount is…
		arsenmUnsubmitted Not Done Reply Inline Actions Yes, and move to DAGCombiner. You then just need to check that the target UADDO is legal or it's pre-legalize arsenm: Yes, and move to DAGCombiner. You then just need to check that the target UADDO is legal or…
		arsenmUnsubmitted Done Reply Inline Actions This is missing the extends in the input and output arsenm: This is missing the extends in the input and output
		// b).overflow
		if (ShiftAmt == 32 && LHS.getOpcode() == ISD::ADD) {
		SDValue AddLHS = LHS->getOperand(0);
		SDValue AddRHS = LHS->getOperand(1);

		const auto Is32to64ZExt = [](SDValue V) -> bool {
		return V->getOpcode() == ISD::ZERO_EXTEND &&
		foadUnsubmitted Done Reply Inline Actions Could also accept 64-bit constants whose upper 32 bits are 0. foad: Could also accept 64-bit constants whose upper 32 bits are 0.
		V->getOperand(0)->getValueType(0) == MVT::i32 &&
		V->getValueType(0) == MVT::i64;
		arsenmUnsubmitted Not Done Reply Inline Actions Should try to short circuit the second known bits call if the first one fails the countMinLeadingZeros check arsenm: Should try to short circuit the second known bits call if the first one fails the…
		};

		if (Is32to64ZExt(AddLHS) && Is32to64ZExt(AddRHS)) {
		// Create a i32 uaddo
		SDValue A = AddLHS->getOperand(0);
		SDValue B = AddRHS->getOperand(0);
		SDValue UADDO = DAG.getNode(ISD::UADDO, SL, {MVT::i32, MVT::i1}, {A, B});
		arsenmUnsubmitted Done Reply Inline Actions Looking at uses is unusual and I'm not sure why you're doing it arsenm: Looking at uses is unusual and I'm not sure why you're doing it
		foadUnsubmitted Done Reply Inline Actions As mentioned below, the thinking is that this transform is not profitable unless every use either only wants the overflow bit, or only wants the low 32 bits of the 64 bit result. Otherwise you might as well keep the full 64 bit add. foad: As mentioned below, the thinking is that this transform is not profitable unless every use…
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions Indeed as Jay said, it's because the transformation is only profitable when the users only care about the lower 32 bits and the carry bit. Pierre-vh: Indeed as Jay said, it's because the transformation is only profitable when the users only care…
		// Replace the original add with (i64 (zext (uaddo ...)))
		foadUnsubmitted Done Reply Inline Actions This is not correct because it will lose the overflow bit. You should probably only do this if the ADD has a single use. foad: This is not correct because it will lose the overflow bit. You should probably only do this if…
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions Doing it if the add has a single use negates the purpose of the combine as it'll always have 2 uses in the cases we're interested in, but the second use is a trunc to i32. I've adapted the combine so it only does it when users are all truncs to i32, or the srl. Pierre-vh: Doing it if the add has a single use negates the purpose of the combine as it'll always have 2…
		arsenmUnsubmitted Not Done Reply Inline Actions There's no point in looking for multiple TRUNCATE users. Those would have been automagically CSEd arsenm: There's no point in looking for multiple TRUNCATE users. Those would have been automagically…
		DAG.ReplaceAllUsesOfValueWith(
		LHS, DAG.getNode(ISD::ZERO_EXTEND, SL, VT, {UADDO}));
		// Replace this right-shift with (i64 (zext (uaddo.overflow ...)))
		return DAG.getNode(ISD::ZERO_EXTEND, SL, VT, {UADDO.getValue(1)});
		}
		}

if (ShiftAmt < 32)		if (ShiftAmt < 32)
return SDValue();		return SDValue();

// srl i64:x, C for C >= 32		// srl i64:x, C for C >= 32
// =>		// =>
// build_pair (srl hi_32(x), C - 32), 0		// build_pair (srl hi_32(x), C - 32), 0
SDValue Zero = DAG.getConstant(0, SL, MVT::i32);		SDValue Zero = DAG.getConstant(0, SL, MVT::i32);

▲ Show 20 Lines • Show All 1,662 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/add_shr_carry.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=amdgcn-amd-mesa3d -mcpu=fiji -verify-machineinstrs \| FileCheck -check-prefix=VI %s
				; RUN: llc < %s -mtriple=amdgcn-amd-mesa3d -mcpu=gfx900 -verify-machineinstrs \| FileCheck -check-prefix=GFX9 %s
				; RUN: llc < %s -mtriple=amdgcn-amd-mesa3d -mcpu=gfx1010 -verify-machineinstrs \| FileCheck -check-prefix=GFX10 %s
				; RUN: llc < %s -mtriple=amdgcn-amd-mesa3d -mcpu=gfx1100 -amdgpu-enable-delay-alu=0 -verify-machineinstrs \| FileCheck -check-prefix=GFX11 %s

				define i64 @basic(i32 %a, i32 %b, i64 %c) {
				arsenmUnsubmitted Done Reply Inline Actions Should precommit this test to show the diff arsenm: Should precommit this test to show the diff
				; VI-LABEL: basic:
				; VI: ; %bb.0: ; %entry
				; VI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; VI-NEXT: v_add_u32_e32 v0, vcc, v0, v1
				; VI-NEXT: v_cndmask_b32_e64 v0, 0, 1, vcc
				; VI-NEXT: v_add_u32_e32 v0, vcc, v2, v0
				; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v3, vcc
				; VI-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: basic:
				; GFX9: ; %bb.0: ; %entry
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, v0, v1
				; GFX9-NEXT: v_cndmask_b32_e64 v0, 0, 1, vcc
				; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, v2, v0
				; GFX9-NEXT: v_addc_co_u32_e32 v1, vcc, 0, v3, vcc
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: basic:
				; GFX10: ; %bb.0: ; %entry
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: v_add_co_u32 v0, s4, v0, v1
				; GFX10-NEXT: v_cndmask_b32_e64 v0, 0, 1, s4
				; GFX10-NEXT: v_add_co_u32 v0, vcc_lo, v2, v0
				; GFX10-NEXT: v_add_co_ci_u32_e32 v1, vcc_lo, 0, v3, vcc_lo
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX11-LABEL: basic:
				; GFX11: ; %bb.0: ; %entry
				; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX11-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX11-NEXT: v_add_co_u32 v0, s0, v0, v1
				; GFX11-NEXT: v_cndmask_b32_e64 v0, 0, 1, s0
				; GFX11-NEXT: v_add_co_u32 v0, vcc_lo, v2, v0
				; GFX11-NEXT: v_add_co_ci_u32_e32 v1, vcc_lo, 0, v3, vcc_lo
				; GFX11-NEXT: s_setpc_b64 s[30:31]
				entry:
				%a.zext = zext i32 %a to i64
				%b.zext = zext i32 %b to i64
				%add.a.b = add i64 %a.zext, %b.zext
				%shr = lshr i64 %add.a.b, 32
				%add.c.shr = add i64 %c, %shr
				ret i64 %add.c.shr
				arsenmUnsubmitted Done Reply Inline Actions This second add is superfluous to the basic pattern https://alive2.llvm.org/ce/z/aLg_Ki arsenm: This second add is superfluous to the basic pattern https://alive2.llvm.org/ce/z/aLg_Ki
				}

				define <3 x i32> @add_i96(<3 x i32> %0, <3 x i32> %1) #0 {
				; VI-LABEL: add_i96:
				; VI: ; %bb.0:
				; VI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; VI-NEXT: v_add_u32_e32 v1, vcc, v4, v1
				; VI-NEXT: v_addc_u32_e64 v4, s[4:5], 0, 0, vcc
				; VI-NEXT: v_add_u32_e32 v0, vcc, v3, v0
				; VI-NEXT: v_cndmask_b32_e64 v3, 0, 1, vcc
				; VI-NEXT: v_add_u32_e32 v1, vcc, v1, v3
				; VI-NEXT: v_addc_u32_e32 v3, vcc, 0, v4, vcc
				; VI-NEXT: v_add_u32_e32 v2, vcc, v5, v2
				; VI-NEXT: v_add_u32_e32 v2, vcc, v2, v3
				; VI-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: add_i96:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: v_add_co_u32_e32 v1, vcc, v4, v1
				; GFX9-NEXT: v_addc_co_u32_e64 v4, s[4:5], 0, 0, vcc
				; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, v3, v0
				; GFX9-NEXT: v_cndmask_b32_e64 v3, 0, 1, vcc
				; GFX9-NEXT: v_add_co_u32_e32 v1, vcc, v1, v3
				; GFX9-NEXT: v_addc_co_u32_e32 v3, vcc, 0, v4, vcc
				; GFX9-NEXT: v_add3_u32 v2, v5, v2, v3
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX10-LABEL: add_i96:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: v_add_co_u32 v0, s4, v3, v0
				; GFX10-NEXT: v_cndmask_b32_e64 v3, 0, 1, s4
				; GFX10-NEXT: v_add_co_u32 v1, s4, v4, v1
				; GFX10-NEXT: v_add_co_ci_u32_e64 v4, s4, 0, 0, s4
				; GFX10-NEXT: v_add_co_u32 v1, vcc_lo, v1, v3
				; GFX10-NEXT: v_add_co_ci_u32_e32 v3, vcc_lo, 0, v4, vcc_lo
				; GFX10-NEXT: v_add3_u32 v2, v5, v2, v3
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX11-LABEL: add_i96:
				; GFX11: ; %bb.0:
				; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX11-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX11-NEXT: v_add_co_u32 v0, s0, v3, v0
				; GFX11-NEXT: v_cndmask_b32_e64 v3, 0, 1, s0
				; GFX11-NEXT: v_add_co_u32 v1, s0, v4, v1
				; GFX11-NEXT: v_add_co_ci_u32_e64 v4, null, 0, 0, s0
				; GFX11-NEXT: v_add_co_u32 v1, vcc_lo, v1, v3
				; GFX11-NEXT: v_add_co_ci_u32_e32 v3, vcc_lo, 0, v4, vcc_lo
				; GFX11-NEXT: v_add3_u32 v2, v5, v2, v3
				; GFX11-NEXT: s_setpc_b64 s[30:31]
				%3 = extractelement <3 x i32> %0, i64 0
				%4 = zext i32 %3 to i64
				%5 = extractelement <3 x i32> %1, i64 0
				%6 = zext i32 %5 to i64
				%7 = add nuw nsw i64 %6, %4
				%8 = extractelement <3 x i32> %0, i64 1
				%9 = zext i32 %8 to i64
				%10 = extractelement <3 x i32> %1, i64 1
				%11 = zext i32 %10 to i64
				%12 = add nuw nsw i64 %11, %9
				%13 = lshr i64 %7, 32
				%14 = add nuw nsw i64 %12, %13
				%15 = extractelement <3 x i32> %0, i64 2
				%16 = extractelement <3 x i32> %1, i64 2
				%17 = add i32 %16, %15
				%18 = lshr i64 %14, 32
				%19 = trunc i64 %18 to i32
				%20 = add i32 %17, %19
				%21 = trunc i64 %7 to i32
				%22 = insertelement <3 x i32> undef, i32 %21, i32 0
				%23 = trunc i64 %14 to i32
				%24 = insertelement <3 x i32> %22, i32 %23, i32 1
				%25 = insertelement <3 x i32> %24, i32 %20, i32 2
				ret <3 x i32> %25
				}
				arsenmUnsubmitted Not Done Reply Inline Actions Testcase with multiple uses? arsenm: Testcase with multiple uses?

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add DAG Combine for right-shift carry add to uaddo
AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 474210

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp

llvm/test/CodeGen/AMDGPU/add_shr_carry.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add DAG Combine for right-shift carry add to uaddoAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 474210

llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp

llvm/test/CodeGen/AMDGPU/add_shr_carry.ll

[AMDGPU] Add DAG Combine for right-shift carry add to uaddo
AbandonedPublic