This is an archive of the discontinued LLVM Phabricator instance.

lib/Target/AMDGPU/SIISelLowering.cpp
8589–8590 ↗	(On Diff #180126)	Should skip both if !hasOneUse. I thought SimplifyDemandedBits does this check already, so I don't know why ShrinkDemandedConstant doesn't also

Probably should try fixing ShrinkDemandedConstant first

In D56289#1345663, @arsenm wrote:

Probably should try fixing ShrinkDemandedConstant first

I do not see that a bug in the first place. You are creating mask and ask ShrinkDemandedConstant to do the job. My take it is assumed you did the checks needed and know what are you doing. I will just prevent legal optimizations elsewhere this way.

lib/Target/AMDGPU/SIISelLowering.cpp
8589–8590 ↗	(On Diff #180126)	SimplifyDemandedBits does that check, but it does more, so in some cases it is able to work even with multiple uses.

In D56289#1345663, @arsenm wrote:

Probably should try fixing ShrinkDemandedConstant first

I have checked existing uses of ShrinkDemandedConstant, it doe snot look these uses really care about multiple use situation. Therefor I have moved single use check there.

Concerning SimplifyDemandedBits is handles multiple uses gracefully, which is documented:

Helper for SimplifyDemandedBits that can simplify an operation with multiple uses.

Moved single use check into ShrinkDemandedConstant.

LGTM

This revision is now accepted and ready to land.Jan 4 2019, 8:38 PM

Closed by commit rL350475: Added single use check to ShrinkDemandedConstant (authored by rampitec). · Explain WhyJan 5 2019, 11:23 AM

This revision was automatically updated to reflect the committed changes.

I am seeing a regression with this change on AArch64, where we do not use shrinked constants, e.g. for and/or if they have multiple users. I am not entirely sure what the problem is this patch is solving, as I am not familiar with AMDGPU. Is it a case where shrinking the constant leads to shrinking the related Op and in the end this introduces unnecessary conversion code, as multiple users require different bitwidths?

If that is the case, I am not sure if this is the best place to fix the issue. IIUC, shrinkDemandedBits just shrinks the constant of the passed in Op, but not the width of the result type of Op. So it seems to me that shrinking the constant should be independent of the users of Op.

In D56289#1348172, @fhahn wrote:

I am seeing a regression with this change on AArch64, where we do not use shrinked constants, e.g. for and/or if they have multiple users. I am not entirely sure what the problem is this patch is solving, as I am not familiar with AMDGPU. Is it a case where shrinking the constant leads to shrinking the related Op and in the end this introduces unnecessary conversion code, as multiple users require different bitwidths?

If that is the case, I am not sure if this is the best place to fix the issue. IIUC, shrinkDemandedBits just shrinks the constant of the passed in Op, but not the width of the result type of Op. So it seems to me that shrinking the constant should be independent of the users of Op.

Here is DAG exposing the problem:

t33: i32 = or t42, Constant:i32<-2147483647>
        t37: f32 = bitcast t33
        t66: f32 = CVT_F32_UBYTE0 t33
      t38: f32 = fadd t37, t66

The node t66 only needs a low byte from the t33 "or" node. The other use of "or" is bitcast and subsequently t38 fadd which needs full dword.
The "Op" argument of ShrinkDemandedConstant() is t33 "or" and Demanded = 0xff as needed for t66.

If optimization succeeds it will replace "or t42, 0x80000001" with "or t42, 1". However, this "or" has another use which needs the full range of the value.
That is a bug not to take other uses into account.

It is unfortunate this lead to regression on AArch64 side. Are you sure a similar issue can never happen with AArch64?
Theoretically that is possible to move a single use check past targetShrinkDemandedConstant() call which actually do the job for AArch I presume.
I think it should correct lack of optimization in that specific case, but I am not sure there will be no correctness issues elsewhere. Here is what I am speaking about:

--- a/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -350,13 +350,13 @@ bool TargetLowering::ShrinkDemandedConstant(SDValue Op, const APInt &Demanded,
   SDLoc DL(Op);
   unsigned Opcode = Op.getOpcode();

-  if (!Op.hasOneUse())
-    return false;
-
   // Do target-specific constant optimization.
   if (targetShrinkDemandedConstant(Op, Demanded, TLO))
     return TLO.New.getNode();

+  if (!Op.hasOneUse())
+    return false;
+
   // FIXME: ISD::SELECT, ISD::SELECT_CC
   switch (Opcode) {
   default:

In D56289#1348529, @rampitec wrote:
In D56289#1348172, @fhahn wrote:

I am seeing a regression with this change on AArch64, where we do not use shrinked constants, e.g. for and/or if they have multiple users. I am not entirely sure what the problem is this patch is solving, as I am not familiar with AMDGPU. Is it a case where shrinking the constant leads to shrinking the related Op and in the end this introduces unnecessary conversion code, as multiple users require different bitwidths?

If that is the case, I am not sure if this is the best place to fix the issue. IIUC, shrinkDemandedBits just shrinks the constant of the passed in Op, but not the width of the result type of Op. So it seems to me that shrinking the constant should be independent of the users of Op.

Here is DAG exposing the problem:
t33: i32 = or t42, Constant:i32<-2147483647>
        t37: f32 = bitcast t33
        t66: f32 = CVT_F32_UBYTE0 t33
      t38: f32 = fadd t37, t66
The node t66 only needs a low byte from the t33 "or" node. The other use of "or" is bitcast and subsequently t38 fadd which needs full dword.
The "Op" argument of ShrinkDemandedConstant() is t33 "or" and Demanded = 0xff as needed for t66.

If optimization succeeds it will replace "or t42, 0x80000001" with "or t42, 1". However, this "or" has another use which needs the full range of the value.
That is a bug not to take other uses into account.

It is unfortunate this lead to regression on AArch64 side. Are you sure a similar issue can never happen with AArch64?
Theoretically that is possible to move a single use check past targetShrinkDemandedConstant() call which actually do the job for AArch I presume.
I think it should correct lack of optimization in that specific case, but I am not sure there will be no correctness issues elsewhere. Here is what I am speaking about:
--- a/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -350,13 +350,13 @@ bool TargetLowering::ShrinkDemandedConstant(SDValue Op, const APInt &Demanded,
   SDLoc DL(Op);
   unsigned Opcode = Op.getOpcode();

-  if (!Op.hasOneUse())
-    return false;
-
   // Do target-specific constant optimization.
   if (targetShrinkDemandedConstant(Op, Demanded, TLO))
     return TLO.New.getNode();

+  if (!Op.hasOneUse())
+    return false;
+
   // FIXME: ISD::SELECT, ISD::SELECT_CC
   switch (Opcode) {
   default:

In fact the bit width of all operations is the same. It just t66 only known to read low byte from its operand. I believe it worth nothing t66 is custom AMDGPU node.
Consider t66 is a shift left by 24:

t33: i32 = or t42, Constant:i32<-2147483647>
        t66: i32 = shl t33, 24
        t38: i32 = add t33, t66

There is nothing target specific here. t66 also only known to read low 8 bits of t33 and that is not unreasonable to call ShrinkDemandedConstant on t33 with Demanded = 0xff. It will expose the same bug regardless of target.

In D56289#1348541, @rampitec wrote:
In D56289#1348529, @rampitec wrote:
In D56289#1348172, @fhahn wrote:

I am seeing a regression with this change on AArch64, where we do not use shrinked constants, e.g. for and/or if they have multiple users. I am not entirely sure what the problem is this patch is solving, as I am not familiar with AMDGPU. Is it a case where shrinking the constant leads to shrinking the related Op and in the end this introduces unnecessary conversion code, as multiple users require different bitwidths?

If that is the case, I am not sure if this is the best place to fix the issue. IIUC, shrinkDemandedBits just shrinks the constant of the passed in Op, but not the width of the result type of Op. So it seems to me that shrinking the constant should be independent of the users of Op.

Here is DAG exposing the problem:
t33: i32 = or t42, Constant:i32<-2147483647>
        t37: f32 = bitcast t33
        t66: f32 = CVT_F32_UBYTE0 t33
      t38: f32 = fadd t37, t66
The node t66 only needs a low byte from the t33 "or" node. The other use of "or" is bitcast and subsequently t38 fadd which needs full dword.
The "Op" argument of ShrinkDemandedConstant() is t33 "or" and Demanded = 0xff as needed for t66.

If optimization succeeds it will replace "or t42, 0x80000001" with "or t42, 1". However, this "or" has another use which needs the full range of the value.
That is a bug not to take other uses into account.

It is unfortunate this lead to regression on AArch64 side. Are you sure a similar issue can never happen with AArch64?
Theoretically that is possible to move a single use check past targetShrinkDemandedConstant() call which actually do the job for AArch I presume.
I think it should correct lack of optimization in that specific case, but I am not sure there will be no correctness issues elsewhere. Here is what I am speaking about:
--- a/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -350,13 +350,13 @@ bool TargetLowering::ShrinkDemandedConstant(SDValue Op, const APInt &Demanded,
   SDLoc DL(Op);
   unsigned Opcode = Op.getOpcode();

-  if (!Op.hasOneUse())
-    return false;
-
   // Do target-specific constant optimization.
   if (targetShrinkDemandedConstant(Op, Demanded, TLO))
     return TLO.New.getNode();

+  if (!Op.hasOneUse())
+    return false;
+
   // FIXME: ISD::SELECT, ISD::SELECT_CC
   switch (Opcode) {
   default:
In fact the bit width of all operations is the same. It just t66 only known to read low byte from its operand. I believe it worth nothing t66 is custom AMDGPU node.
Consider t66 is a shift left by 24:
t33: i32 = or t42, Constant:i32<-2147483647>
        t66: i32 = shl t33, 24
        t38: i32 = add t33, t66
There is nothing target specific here. t66 also only known to read low 8 bits of t33 and that is not unreasonable to call ShrinkDemandedConstant on t33 with Demanded = 0xff. It will expose the same bug regardless of target.

After several hours trying to come up with a testcase which would fail with any other node or target I found none.
I have moved the check into the target: D56406. Maybe later a check will be needed somewhere in the common code.

rampitec mentioned this in rL350684: Remove check for single use in ShrinkDemandedConstant.Jan 8 2019, 6:28 PM

Revision Contents

Path

Size

llvm/

trunk/

lib/

CodeGen/

SelectionDAG/

TargetLowering.cpp

3 lines

test/

CodeGen/

AMDGPU/

cvt_f32_ubyte.ll

20 lines

Diff 180375

llvm/trunk/lib/CodeGen/SelectionDAG/TargetLowering.cpp

	Show First 20 Lines • Show All 344 Lines • ▼ Show 20 Lines
	/// bits set in that constant that are not demanded, then clear those bits and			/// bits set in that constant that are not demanded, then clear those bits and
	/// return true.			/// return true.
	bool TargetLowering::ShrinkDemandedConstant(SDValue Op, const APInt &Demanded,			bool TargetLowering::ShrinkDemandedConstant(SDValue Op, const APInt &Demanded,
	TargetLoweringOpt &TLO) const {			TargetLoweringOpt &TLO) const {
	SelectionDAG &DAG = TLO.DAG;			SelectionDAG &DAG = TLO.DAG;
	SDLoc DL(Op);			SDLoc DL(Op);
	unsigned Opcode = Op.getOpcode();			unsigned Opcode = Op.getOpcode();

				if (!Op.hasOneUse())
				return false;

	// Do target-specific constant optimization.			// Do target-specific constant optimization.
	if (targetShrinkDemandedConstant(Op, Demanded, TLO))			if (targetShrinkDemandedConstant(Op, Demanded, TLO))
	return TLO.New.getNode();			return TLO.New.getNode();

	// FIXME: ISD::SELECT, ISD::SELECT_CC			// FIXME: ISD::SELECT, ISD::SELECT_CC
	switch (Opcode) {			switch (Opcode) {
	default:			default:
	break;			break;
	▲ Show 20 Lines • Show All 5,065 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll

Show First 20 Lines • Show All 275 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @extract_byte3_to_f32(float addrspace(1)* noalias %out, i32 addrspace(1)* noalias %in) nounwind {
%gep = getelementptr i32, i32 addrspace(1)* %in, i32 %tid		%gep = getelementptr i32, i32 addrspace(1)* %in, i32 %tid
%val = load i32, i32 addrspace(1)* %gep		%val = load i32, i32 addrspace(1)* %gep
%srl = lshr i32 %val, 24		%srl = lshr i32 %val, 24
%and = and i32 %srl, 255		%and = and i32 %srl, 255
%cvt = uitofp i32 %and to float		%cvt = uitofp i32 %and to float
store float %cvt, float addrspace(1)* %out		store float %cvt, float addrspace(1)* %out
ret void		ret void
}		}

		; GCN-LABEL: {{^}}cvt_ubyte0_or_multiuse:
		; GCN: {{buffer\|flat}}_load_dword [[LOADREG:v[0-9]+]],
		; GCN-DAG: v_or_b32_e32 [[OR:v[0-9]+]], 0x80000001, [[LOADREG]]
		; GCN-DAG: v_cvt_f32_ubyte0_e32 [[CONV:v[0-9]+]], [[OR]]
		; GCN: v_add_f32_e32 [[RES:v[0-9]+]], [[OR]], [[CONV]]
		; GCN: buffer_store_dword [[RES]],
		define amdgpu_kernel void @cvt_ubyte0_or_multiuse(i32 addrspace(1)* %in, float addrspace(1)* %out) {
		bb:
		%lid = tail call i32 @llvm.amdgcn.workitem.id.x()
		%gep = getelementptr inbounds i32, i32 addrspace(1)* %in, i32 %lid
		%load = load i32, i32 addrspace(1)* %gep
		%or = or i32 %load, -2147483647
		%and = and i32 %or, 255
		%uitofp = uitofp i32 %and to float
		%cast = bitcast i32 %or to float
		%add = fadd float %cast, %uitofp
		store float %add, float addrspace(1)* %out
		ret void
		}

This is an archive of the discontinued LLVM Phabricator instance.

Added single use check to ShrinkDemandedConstantClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 180375

llvm/trunk/lib/CodeGen/SelectionDAG/TargetLowering.cpp

llvm/trunk/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll

Added single use check to ShrinkDemandedConstant
ClosedPublic