Fixes cvt_f32_ubyte combine. performCvtF32UByteNCombine() could shrink
source node to demanded bits only even if there are other uses.
Details
Diff Detail
Event Timeline
lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
8589–8590 ↗ | (On Diff #180126) | Should skip both if !hasOneUse. I thought SimplifyDemandedBits does this check already, so I don't know why ShrinkDemandedConstant doesn't also |
I do not see that a bug in the first place. You are creating mask and ask ShrinkDemandedConstant to do the job. My take it is assumed you did the checks needed and know what are you doing. I will just prevent legal optimizations elsewhere this way.
lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
8589–8590 ↗ | (On Diff #180126) | SimplifyDemandedBits does that check, but it does more, so in some cases it is able to work even with multiple uses. |
I have checked existing uses of ShrinkDemandedConstant, it doe snot look these uses really care about multiple use situation. Therefor I have moved single use check there.
Concerning SimplifyDemandedBits is handles multiple uses gracefully, which is documented:
Helper for SimplifyDemandedBits that can simplify an operation with multiple uses.
I am seeing a regression with this change on AArch64, where we do not use shrinked constants, e.g. for and/or if they have multiple users. I am not entirely sure what the problem is this patch is solving, as I am not familiar with AMDGPU. Is it a case where shrinking the constant leads to shrinking the related Op and in the end this introduces unnecessary conversion code, as multiple users require different bitwidths?
If that is the case, I am not sure if this is the best place to fix the issue. IIUC, shrinkDemandedBits just shrinks the constant of the passed in Op, but not the width of the result type of Op. So it seems to me that shrinking the constant should be independent of the users of Op.
Here is DAG exposing the problem:
t33: i32 = or t42, Constant:i32<-2147483647> t37: f32 = bitcast t33 t66: f32 = CVT_F32_UBYTE0 t33 t38: f32 = fadd t37, t66
The node t66 only needs a low byte from the t33 "or" node. The other use of "or" is bitcast and subsequently t38 fadd which needs full dword.
The "Op" argument of ShrinkDemandedConstant() is t33 "or" and Demanded = 0xff as needed for t66.
If optimization succeeds it will replace "or t42, 0x80000001" with "or t42, 1". However, this "or" has another use which needs the full range of the value.
That is a bug not to take other uses into account.
It is unfortunate this lead to regression on AArch64 side. Are you sure a similar issue can never happen with AArch64?
Theoretically that is possible to move a single use check past targetShrinkDemandedConstant() call which actually do the job for AArch I presume.
I think it should correct lack of optimization in that specific case, but I am not sure there will be no correctness issues elsewhere. Here is what I am speaking about:
--- a/lib/CodeGen/SelectionDAG/TargetLowering.cpp +++ b/lib/CodeGen/SelectionDAG/TargetLowering.cpp @@ -350,13 +350,13 @@ bool TargetLowering::ShrinkDemandedConstant(SDValue Op, const APInt &Demanded, SDLoc DL(Op); unsigned Opcode = Op.getOpcode(); - if (!Op.hasOneUse()) - return false; - // Do target-specific constant optimization. if (targetShrinkDemandedConstant(Op, Demanded, TLO)) return TLO.New.getNode(); + if (!Op.hasOneUse()) + return false; + // FIXME: ISD::SELECT, ISD::SELECT_CC switch (Opcode) { default:
In fact the bit width of all operations is the same. It just t66 only known to read low byte from its operand. I believe it worth nothing t66 is custom AMDGPU node.
Consider t66 is a shift left by 24:
t33: i32 = or t42, Constant:i32<-2147483647> t66: i32 = shl t33, 24 t38: i32 = add t33, t66
There is nothing target specific here. t66 also only known to read low 8 bits of t33 and that is not unreasonable to call ShrinkDemandedConstant on t33 with Demanded = 0xff. It will expose the same bug regardless of target.
After several hours trying to come up with a testcase which would fail with any other node or target I found none.
I have moved the check into the target: D56406. Maybe later a check will be needed somewhere in the common code.