This is an archive of the discontinued LLVM Phabricator instance.

Added single use check to ShrinkDemandedConstant
ClosedPublic

Authored by rampitec on Jan 3 2019, 12:38 PM.

Details

Summary

Fixes cvt_f32_ubyte combine. performCvtF32UByteNCombine() could shrink
source node to demanded bits only even if there are other uses.

Diff Detail

Repository
rL LLVM

Event Timeline

rampitec created this revision.Jan 3 2019, 12:38 PM
arsenm added inline comments.Jan 3 2019, 2:06 PM
lib/Target/AMDGPU/SIISelLowering.cpp
8589–8590 ↗(On Diff #180126)

Should skip both if !hasOneUse. I thought SimplifyDemandedBits does this check already, so I don't know why ShrinkDemandedConstant doesn't also

arsenm added a comment.Jan 3 2019, 2:13 PM

Probably should try fixing ShrinkDemandedConstant first

rampitec marked an inline comment as done.Jan 3 2019, 2:26 PM

Probably should try fixing ShrinkDemandedConstant first

I do not see that a bug in the first place. You are creating mask and ask ShrinkDemandedConstant to do the job. My take it is assumed you did the checks needed and know what are you doing. I will just prevent legal optimizations elsewhere this way.

lib/Target/AMDGPU/SIISelLowering.cpp
8589–8590 ↗(On Diff #180126)

SimplifyDemandedBits does that check, but it does more, so in some cases it is able to work even with multiple uses.

rampitec marked an inline comment as done.Jan 4 2019, 12:27 PM

Probably should try fixing ShrinkDemandedConstant first

I have checked existing uses of ShrinkDemandedConstant, it doe snot look these uses really care about multiple use situation. Therefor I have moved single use check there.

Concerning SimplifyDemandedBits is handles multiple uses gracefully, which is documented:

Helper for SimplifyDemandedBits that can simplify an operation with multiple uses.

rampitec updated this revision to Diff 180303.Jan 4 2019, 12:28 PM
rampitec retitled this revision from [AMDGPU] Fixed cvt_f32_ubyte combine to Added single use check to ShrinkDemandedConstant.
rampitec edited the summary of this revision. (Show Details)

Moved single use check into ShrinkDemandedConstant.

arsenm accepted this revision.Jan 4 2019, 8:38 PM

LGTM

This revision is now accepted and ready to land.Jan 4 2019, 8:38 PM
This revision was automatically updated to reflect the committed changes.
fhahn added a subscriber: fhahn.Jan 7 2019, 6:32 AM

I am seeing a regression with this change on AArch64, where we do not use shrinked constants, e.g. for and/or if they have multiple users. I am not entirely sure what the problem is this patch is solving, as I am not familiar with AMDGPU. Is it a case where shrinking the constant leads to shrinking the related Op and in the end this introduces unnecessary conversion code, as multiple users require different bitwidths?

If that is the case, I am not sure if this is the best place to fix the issue. IIUC, shrinkDemandedBits just shrinks the constant of the passed in Op, but not the width of the result type of Op. So it seems to me that shrinking the constant should be independent of the users of Op.

I am seeing a regression with this change on AArch64, where we do not use shrinked constants, e.g. for and/or if they have multiple users. I am not entirely sure what the problem is this patch is solving, as I am not familiar with AMDGPU. Is it a case where shrinking the constant leads to shrinking the related Op and in the end this introduces unnecessary conversion code, as multiple users require different bitwidths?

If that is the case, I am not sure if this is the best place to fix the issue. IIUC, shrinkDemandedBits just shrinks the constant of the passed in Op, but not the width of the result type of Op. So it seems to me that shrinking the constant should be independent of the users of Op.

Here is DAG exposing the problem:

t33: i32 = or t42, Constant:i32<-2147483647>
        t37: f32 = bitcast t33
        t66: f32 = CVT_F32_UBYTE0 t33
      t38: f32 = fadd t37, t66

The node t66 only needs a low byte from the t33 "or" node. The other use of "or" is bitcast and subsequently t38 fadd which needs full dword.
The "Op" argument of ShrinkDemandedConstant() is t33 "or" and Demanded = 0xff as needed for t66.

If optimization succeeds it will replace "or t42, 0x80000001" with "or t42, 1". However, this "or" has another use which needs the full range of the value.
That is a bug not to take other uses into account.

It is unfortunate this lead to regression on AArch64 side. Are you sure a similar issue can never happen with AArch64?
Theoretically that is possible to move a single use check past targetShrinkDemandedConstant() call which actually do the job for AArch I presume.
I think it should correct lack of optimization in that specific case, but I am not sure there will be no correctness issues elsewhere. Here is what I am speaking about:

--- a/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -350,13 +350,13 @@ bool TargetLowering::ShrinkDemandedConstant(SDValue Op, const APInt &Demanded,
   SDLoc DL(Op);
   unsigned Opcode = Op.getOpcode();

-  if (!Op.hasOneUse())
-    return false;
-
   // Do target-specific constant optimization.
   if (targetShrinkDemandedConstant(Op, Demanded, TLO))
     return TLO.New.getNode();

+  if (!Op.hasOneUse())
+    return false;
+
   // FIXME: ISD::SELECT, ISD::SELECT_CC
   switch (Opcode) {
   default:

I am seeing a regression with this change on AArch64, where we do not use shrinked constants, e.g. for and/or if they have multiple users. I am not entirely sure what the problem is this patch is solving, as I am not familiar with AMDGPU. Is it a case where shrinking the constant leads to shrinking the related Op and in the end this introduces unnecessary conversion code, as multiple users require different bitwidths?

If that is the case, I am not sure if this is the best place to fix the issue. IIUC, shrinkDemandedBits just shrinks the constant of the passed in Op, but not the width of the result type of Op. So it seems to me that shrinking the constant should be independent of the users of Op.

Here is DAG exposing the problem:

t33: i32 = or t42, Constant:i32<-2147483647>
        t37: f32 = bitcast t33
        t66: f32 = CVT_F32_UBYTE0 t33
      t38: f32 = fadd t37, t66

The node t66 only needs a low byte from the t33 "or" node. The other use of "or" is bitcast and subsequently t38 fadd which needs full dword.
The "Op" argument of ShrinkDemandedConstant() is t33 "or" and Demanded = 0xff as needed for t66.

If optimization succeeds it will replace "or t42, 0x80000001" with "or t42, 1". However, this "or" has another use which needs the full range of the value.
That is a bug not to take other uses into account.

It is unfortunate this lead to regression on AArch64 side. Are you sure a similar issue can never happen with AArch64?
Theoretically that is possible to move a single use check past targetShrinkDemandedConstant() call which actually do the job for AArch I presume.
I think it should correct lack of optimization in that specific case, but I am not sure there will be no correctness issues elsewhere. Here is what I am speaking about:

--- a/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -350,13 +350,13 @@ bool TargetLowering::ShrinkDemandedConstant(SDValue Op, const APInt &Demanded,
   SDLoc DL(Op);
   unsigned Opcode = Op.getOpcode();

-  if (!Op.hasOneUse())
-    return false;
-
   // Do target-specific constant optimization.
   if (targetShrinkDemandedConstant(Op, Demanded, TLO))
     return TLO.New.getNode();

+  if (!Op.hasOneUse())
+    return false;
+
   // FIXME: ISD::SELECT, ISD::SELECT_CC
   switch (Opcode) {
   default:

In fact the bit width of all operations is the same. It just t66 only known to read low byte from its operand. I believe it worth nothing t66 is custom AMDGPU node.
Consider t66 is a shift left by 24:

t33: i32 = or t42, Constant:i32<-2147483647>
        t66: i32 = shl t33, 24
        t38: i32 = add t33, t66

There is nothing target specific here. t66 also only known to read low 8 bits of t33 and that is not unreasonable to call ShrinkDemandedConstant on t33 with Demanded = 0xff. It will expose the same bug regardless of target.

I am seeing a regression with this change on AArch64, where we do not use shrinked constants, e.g. for and/or if they have multiple users. I am not entirely sure what the problem is this patch is solving, as I am not familiar with AMDGPU. Is it a case where shrinking the constant leads to shrinking the related Op and in the end this introduces unnecessary conversion code, as multiple users require different bitwidths?

If that is the case, I am not sure if this is the best place to fix the issue. IIUC, shrinkDemandedBits just shrinks the constant of the passed in Op, but not the width of the result type of Op. So it seems to me that shrinking the constant should be independent of the users of Op.

Here is DAG exposing the problem:

t33: i32 = or t42, Constant:i32<-2147483647>
        t37: f32 = bitcast t33
        t66: f32 = CVT_F32_UBYTE0 t33
      t38: f32 = fadd t37, t66

The node t66 only needs a low byte from the t33 "or" node. The other use of "or" is bitcast and subsequently t38 fadd which needs full dword.
The "Op" argument of ShrinkDemandedConstant() is t33 "or" and Demanded = 0xff as needed for t66.

If optimization succeeds it will replace "or t42, 0x80000001" with "or t42, 1". However, this "or" has another use which needs the full range of the value.
That is a bug not to take other uses into account.

It is unfortunate this lead to regression on AArch64 side. Are you sure a similar issue can never happen with AArch64?
Theoretically that is possible to move a single use check past targetShrinkDemandedConstant() call which actually do the job for AArch I presume.
I think it should correct lack of optimization in that specific case, but I am not sure there will be no correctness issues elsewhere. Here is what I am speaking about:

--- a/lib/CodeGen/SelectionDAG/TargetLowering.cpp
+++ b/lib/CodeGen/SelectionDAG/TargetLowering.cpp
@@ -350,13 +350,13 @@ bool TargetLowering::ShrinkDemandedConstant(SDValue Op, const APInt &Demanded,
   SDLoc DL(Op);
   unsigned Opcode = Op.getOpcode();

-  if (!Op.hasOneUse())
-    return false;
-
   // Do target-specific constant optimization.
   if (targetShrinkDemandedConstant(Op, Demanded, TLO))
     return TLO.New.getNode();

+  if (!Op.hasOneUse())
+    return false;
+
   // FIXME: ISD::SELECT, ISD::SELECT_CC
   switch (Opcode) {
   default:

In fact the bit width of all operations is the same. It just t66 only known to read low byte from its operand. I believe it worth nothing t66 is custom AMDGPU node.
Consider t66 is a shift left by 24:

t33: i32 = or t42, Constant:i32<-2147483647>
        t66: i32 = shl t33, 24
        t38: i32 = add t33, t66

There is nothing target specific here. t66 also only known to read low 8 bits of t33 and that is not unreasonable to call ShrinkDemandedConstant on t33 with Demanded = 0xff. It will expose the same bug regardless of target.

After several hours trying to come up with a testcase which would fail with any other node or target I found none.
I have moved the check into the target: D56406. Maybe later a check will be needed somewhere in the common code.