This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
1/2
SIISelLowering.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
shl_add_ptr.ll

Differential D150246

AMDGPU: Fix issue in shl(or) combine
ClosedPublic

Authored by ruiling on May 9 2023, 8:47 PM.

Download Raw Diff

Details

Reviewers

arsenm
foad

Commits

rG60d9010aaf0f: AMDGPU: Fix issue in shl(or) combine

Summary

The code is doing the optimization:
((a | c1) << c2) ==> (a << c2) + (c1 << c2)
But this is only valid if a and c1 have no common bits being set.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ruiling created this revision.May 9 2023, 8:47 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 9 2023, 8:47 PM

Herald added subscribers: StephenFan, kerbowa, hiraditya and 5 others. · View Herald Transcript

ruiling requested review of this revision.May 9 2023, 8:47 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 9 2023, 8:47 PM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Harbormaster completed remote builds in B231022: Diff 520897.May 9 2023, 9:16 PM

The right way is to transform the pattern to (a << c2) | (c1 << c2)
But the right transformation does not do any help on folding the
constant offset into the memory instructions.

It should help because SelectionDAG::isBaseWithConstantOffset knows how to match OR (if the known bits do not overlap) as well as ADD.

It should help because SelectionDAG::isBaseWithConstantOffset knows how to match OR (if the known bits do not overlap) as well as ADD.

Do you mean we can optimize for shl(or) with the help of knownbits? I think I agree with you. But I am not sure whether it is really helpful in practical cases. The isBaseWithConstantOffset you mentioned specifically designed to work on stack slot access. I am not sure if such patterns can also be observed more broadly. And I think such kind of optimization should be done separately, maybe it should be added in the common LLVM code. And the lit-test should also be redesigned. I would rather fix the problematic transformation first. sounds ok to you?

fix the issue instead of removing the optimization.

ruiling retitled this revision from AMDGPU: remove an illegal transform for shl(or) to AMDGPU: Fix issue in shl(or) combine.May 11 2023, 7:32 AM

ruiling edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B231331: Diff 521303.May 11 2023, 8:29 AM

Is there a negative test for the common bits case?

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
9577	Don’t know why this has to change

This revision is now accepted and ready to land.May 12 2023, 1:44 AM

In D150246#4337140, @arsenm wrote:

Is there a negative test for the common bits case?

I have added one: shl_or_ptr_not_combine_2use_lds

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
9577	I don't know what happened:( will fix it.

This revision was landed with ongoing or failed builds.May 12 2023, 4:51 AM

Closed by commit rG60d9010aaf0f: AMDGPU: Fix issue in shl(or) combine (authored by ruiling). · Explain Why

This revision was automatically updated to reflect the committed changes.

ruiling added a commit: rG60d9010aaf0f: AMDGPU: Fix issue in shl(or) combine.

I think this is OK, but wouldn't it be simpler to transform ((a | c1) << c2) ==> (a << c2) | (c1 << c2) and remove the knownbits check? Or do you think that would make the generated code worse overall?

In D150246#4341656, @foad wrote:

I think this is OK, but wouldn't it be simpler to transform ((a | c1) << c2) ==> (a << c2) | (c1 << c2) and remove the knownbits check? Or do you think that would make the generated code worse overall?

I observed more assembly instructions in one typical IR with the transform ((a | c1) << c2) ==> (a << c2) | (c1 << c2). That's why I wanted to remove the code in the first version. But later I find common code tries hard to make (shl (or x, c1), c2) -> add (shl x, c2), (shl c1, c2) happen. So I just fix the issue to help possible cases.

JonChesterfield added a subscriber: JonChesterfield.May 15 2023, 8:49 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIISelLowering.cpp

13 lines

test/

CodeGen/

AMDGPU/

shl_add_ptr.ll

28 lines

Diff 521606

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 9,544 Lines • ▼ Show 20 Lines	SDValue SignAsF32 =
DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::f32, SignAsVector,		DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, MVT::f32, SignAsVector,
DAG.getConstant(1, DL, MVT::i32));		DAG.getConstant(1, DL, MVT::i32));

return DAG.getNode(ISD::FCOPYSIGN, DL, N->getValueType(0), N->getOperand(0),		return DAG.getNode(ISD::FCOPYSIGN, DL, N->getValueType(0), N->getOperand(0),
SignAsF32);		SignAsF32);
}		}

// (shl (add x, c1), c2) -> add (shl x, c2), (shl c1, c2)		// (shl (add x, c1), c2) -> add (shl x, c2), (shl c1, c2)
		// (shl (or x, c1), c2) -> add (shl x, c2), (shl c1, c2) iff x and c1 share no
		// bits

// This is a variant of		// This is a variant of
// (mul (add x, c1), c2) -> add (mul x, c2), (mul c1, c2),		// (mul (add x, c1), c2) -> add (mul x, c2), (mul c1, c2),
//		//
// The normal DAG combiner will do this, but only if the add has one use since		// The normal DAG combiner will do this, but only if the add has one use since
// that would increase the number of instructions.		// that would increase the number of instructions.
//		//
// This prevents us from seeing a constant offset that can be folded into a		// This prevents us from seeing a constant offset that can be folded into a
// memory instruction's addressing mode. If we know the resulting add offset of		// memory instruction's addressing mode. If we know the resulting add offset of
// a pointer can be folded into an addressing offset, we can replace the pointer		// a pointer can be folded into an addressing offset, we can replace the pointer
// operand with the add of new constant offset. This eliminates one of the uses,		// operand with the add of new constant offset. This eliminates one of the uses,
// and may allow the remaining use to also be simplified.		// and may allow the remaining use to also be simplified.
//		//
SDValue SITargetLowering::performSHLPtrCombine(SDNode *N,		SDValue SITargetLowering::performSHLPtrCombine(SDNode *N,
unsigned AddrSpace,		unsigned AddrSpace,
EVT MemVT,		EVT MemVT,
DAGCombinerInfo &DCI) const {		DAGCombinerInfo &DCI) const {
SDValue N0 = N->getOperand(0);		SDValue N0 = N->getOperand(0);
SDValue N1 = N->getOperand(1);		SDValue N1 = N->getOperand(1);

// We only do this to handle cases where it's profitable when there are		// We only do this to handle cases where it's profitable when there are
// multiple uses of the add, so defer to the standard combine.		// multiple uses of the add, so defer to the standard combine.
if ((N0.getOpcode() != ISD::ADD && N0.getOpcode() != ISD::OR) \|\|		if ((N0.getOpcode() != ISD::ADD && N0.getOpcode() != ISD::OR) \|\|
		arsenmUnsubmitted Not Done Reply Inline Actions Don’t know why this has to change arsenm: Don’t know why this has to change
		ruilingAuthorUnsubmitted Done Reply Inline Actions I don't know what happened:( will fix it. ruiling: I don't know what happened:( will fix it.
N0->hasOneUse())		N0->hasOneUse())
return SDValue();		return SDValue();

const ConstantSDNode *CN1 = dyn_cast<ConstantSDNode>(N1);		const ConstantSDNode *CN1 = dyn_cast<ConstantSDNode>(N1);
if (!CN1)		if (!CN1)
return SDValue();		return SDValue();

const ConstantSDNode *CAdd = dyn_cast<ConstantSDNode>(N0.getOperand(1));		const ConstantSDNode *CAdd = dyn_cast<ConstantSDNode>(N0.getOperand(1));
if (!CAdd)		if (!CAdd)
return SDValue();		return SDValue();

// If the resulting offset is too large, we can't fold it into the addressing		SelectionDAG &DAG = DCI.DAG;
// mode offset.
		if (N0->getOpcode() == ISD::OR &&
		!DAG.haveNoCommonBitsSet(N0.getOperand(0), N0.getOperand(1)))
		return SDValue();

		// If the resulting offset is too large, we can't fold it into the
		// addressing mode offset.
APInt Offset = CAdd->getAPIntValue() << CN1->getAPIntValue();		APInt Offset = CAdd->getAPIntValue() << CN1->getAPIntValue();
Type Ty = MemVT.getTypeForEVT(DCI.DAG.getContext());		Type Ty = MemVT.getTypeForEVT(DCI.DAG.getContext());

AddrMode AM;		AddrMode AM;
AM.HasBaseReg = true;		AM.HasBaseReg = true;
AM.BaseOffs = Offset.getSExtValue();		AM.BaseOffs = Offset.getSExtValue();
if (!isLegalAddressingMode(DCI.DAG.getDataLayout(), AM, Ty, AddrSpace))		if (!isLegalAddressingMode(DCI.DAG.getDataLayout(), AM, Ty, AddrSpace))
return SDValue();		return SDValue();

SelectionDAG &DAG = DCI.DAG;
SDLoc SL(N);		SDLoc SL(N);
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);

SDValue ShlX = DAG.getNode(ISD::SHL, SL, VT, N0.getOperand(0), N1);		SDValue ShlX = DAG.getNode(ISD::SHL, SL, VT, N0.getOperand(0), N1);
SDValue COffset = DAG.getConstant(Offset, SL, VT);		SDValue COffset = DAG.getConstant(Offset, SL, VT);

SDNodeFlags Flags;		SDNodeFlags Flags;
Flags.setNoUnsignedWrap(N->getFlags().hasNoUnsignedWrap() &&		Flags.setNoUnsignedWrap(N->getFlags().hasNoUnsignedWrap() &&
▲ Show 20 Lines • Show All 3,922 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/shl_add_ptr.ll

Show First 20 Lines • Show All 404 Lines • ▼ Show 20 Lines	define void @shl_add_ptr_combine_2use_both_max_private_offset(i16 zeroext %idx.arg) #0 {
%shl1 = shl i32 %idx.add, 5		%shl1 = shl i32 %idx.add, 5
%ptr0 = inttoptr i32 %shl0 to ptr addrspace(5)		%ptr0 = inttoptr i32 %shl0 to ptr addrspace(5)
%ptr1 = inttoptr i32 %shl1 to ptr addrspace(5)		%ptr1 = inttoptr i32 %shl1 to ptr addrspace(5)
store volatile i32 9, ptr addrspace(5) %ptr0		store volatile i32 9, ptr addrspace(5) %ptr0
store volatile i32 10, ptr addrspace(5) %ptr1		store volatile i32 10, ptr addrspace(5) %ptr1
ret void		ret void
}		}

; FIXME: This or should fold into an offset on the write
; GCN-LABEL: {{^}}shl_or_ptr_combine_2use_lds:		; GCN-LABEL: {{^}}shl_or_ptr_combine_2use_lds:
; GCN: v_lshlrev_b32_e32 [[SCALE0:v[0-9]+]], 3, v0		; GCN-DAG: ds_write_b32 v{{[0-9]+}}, v{{[0-9]+}} offset:8
; GCN: v_or_b32_e32 [[SCALE1:v[0-9]+]], 32, [[SCALE0]]		; GCN-DAG: ds_write_b32 v{{[0-9]+}}, v{{[0-9]+}} offset:16
; GCN: v_lshlrev_b32_e32 [[SCALE2:v[0-9]+]], 4, v0
; GCN: ds_write_b32 [[SCALE1]], v{{[0-9]+}}
; GCN: ds_write_b32 [[SCALE2]], v{{[0-9]+}} offset:64
define void @shl_or_ptr_combine_2use_lds(i32 %idx) #0 {		define void @shl_or_ptr_combine_2use_lds(i32 %idx) #0 {
%idx.add = or i32 %idx, 4		%idx.shl = shl i32 %idx, 1
		%idx.add = or i32 %idx.shl, 1
%shl0 = shl i32 %idx.add, 3		%shl0 = shl i32 %idx.add, 3
%shl1 = shl i32 %idx.add, 4		%shl1 = shl i32 %idx.add, 4
%ptr0 = inttoptr i32 %shl0 to ptr addrspace(3)		%ptr0 = inttoptr i32 %shl0 to ptr addrspace(3)
%ptr1 = inttoptr i32 %shl1 to ptr addrspace(3)		%ptr1 = inttoptr i32 %shl1 to ptr addrspace(3)
store volatile i32 9, ptr addrspace(3) %ptr0		store volatile i32 9, ptr addrspace(3) %ptr0
store volatile i32 10, ptr addrspace(3) %ptr1		store volatile i32 10, ptr addrspace(3) %ptr1
ret void		ret void
}		}
		; GCN-LABEL: {{^}}shl_or_ptr_not_combine_2use_lds:
; GCN-LABEL: {{^}}shl_or_ptr_combine_2use_max_lds_offset:		; GCN: v_or_b32_e32 [[OR:v[0-9]+]], 1, v0
; GCN-DAG: v_lshlrev_b32_e32 [[SCALE0:v[0-9]+]], 3, v0		; GCN-DAG: v_lshlrev_b32_e32 [[SCALE0:v[0-9]+]], 3, [[OR]]
; GCN-DAG: v_lshlrev_b32_e32 [[SCALE1:v[0-9]+]], 4, v0		; GCN-DAG: v_lshlrev_b32_e32 [[SCALE1:v[0-9]+]], 4, [[OR]]
; GCN-DAG: ds_write_b32 [[SCALE0]], v{{[0-9]+}} offset:65528		; GCN-DAG: ds_write_b32 [[SCALE0]], v{{[0-9]+}}{{$}}
; GCN-DAG: v_or_b32_e32 [[ADD1:v[0-9]+]], 0x1fff0, [[SCALE1]]		; GCN-DAG: ds_write_b32 [[SCALE1]], v{{[0-9]+}}{{$}}
; GCN: ds_write_b32 [[ADD1]], v{{[0-9]+$}}		define void @shl_or_ptr_not_combine_2use_lds(i32 %idx) #0 {
define void @shl_or_ptr_combine_2use_max_lds_offset(i32 %idx) #0 {		%idx.add = or i32 %idx, 1
%idx.add = or i32 %idx, 8191
%shl0 = shl i32 %idx.add, 3		%shl0 = shl i32 %idx.add, 3
%shl1 = shl i32 %idx.add, 4		%shl1 = shl i32 %idx.add, 4
%ptr0 = inttoptr i32 %shl0 to ptr addrspace(3)		%ptr0 = inttoptr i32 %shl0 to ptr addrspace(3)
%ptr1 = inttoptr i32 %shl1 to ptr addrspace(3)		%ptr1 = inttoptr i32 %shl1 to ptr addrspace(3)
store volatile i32 9, ptr addrspace(3) %ptr0		store volatile i32 9, ptr addrspace(3) %ptr0
store volatile i32 10, ptr addrspace(3) %ptr1		store volatile i32 10, ptr addrspace(3) %ptr1
ret void		ret void
}		}

attributes #0 = { nounwind }		attributes #0 = { nounwind }
attributes #1 = { nounwind readnone }		attributes #1 = { nounwind readnone }