Page MenuHomePhabricator

[DAGCombine][X86][AMDGPU][AArch64] (srl (shl x, c1), c2) with c1 != c2 handling
Changes PlannedPublic

Authored by lebedev.ri on May 18 2019, 6:24 AM.

Details

Summary

https://rise4fun.com/Alive/6bVL

AArch64 is clearly a regression, will need a separate fix
AMDGPU change looks bad either way - bfe is not used
X86 changes look neutral-positive, except vector cases, those are clearly regressions

Diff Detail

Repository
rL LLVM

Event Timeline

lebedev.ri created this revision.May 18 2019, 6:24 AM
lebedev.ri marked 2 inline comments as done.May 18 2019, 3:18 PM
lebedev.ri added subscribers: arsenm, compnerd.

Looked at changes:

  • I'll leave x86 vector stuff for later. since i actually wanted to look into reverse trasnform, and looked into this only for consistency.
  • I don't know what to do with AArch64 regression. I can hide it with shouldFoldConstantShiftPairToMask(), but it is there regardless (tests added). Thoughts?
  • That leaves AMDGPU?
test/CodeGen/AArch64/arm64-bitfield-extract.ll
396

After actually looking at -debug output, this regression happens because of SimplifyDemandedBits(),
which ignores AArch64TargetLowering::isDesirableToCommuteWithShift() override.
So when we get to AArch64TargetLowering::isBitfieldExtractOp(), we have

t18: i32 = srl t15, Constant:i64<2>
  t15: i32 = or t26, t24
    t26: i32 = and t7, Constant:i32<1073741816>
      t7: i32,ch = load<(load 4 from %ir.y, align 8)> t0, t2, undef:i64
        t2: i64,ch = CopyFromReg t0, Register:i64 %0
          t1: i64 = Register %0
        t6: i64 = undef
      t25: i32 = Constant<1073741816>
    t24: i32 = and t12, Constant:i32<4>
      t12: i32 = srl t4, Constant:i64<16>
        t4: i32,ch = CopyFromReg t0, Register:i32 %1
          t3: i32 = Register %1
        t11: i64 = Constant<16>
      t23: i32 = Constant<4>
  t17: i64 = Constant<2>

I'm not sure how to turn this pattern on it's head to produce ubfx again.

test/CodeGen/AMDGPU/llvm.amdgcn.ubfe.ll
686–687

@arsenm will AMDGPU prefer 2 shifts or shift+mask here?

arsenm added inline comments.May 18 2019, 3:28 PM
test/CodeGen/AMDGPU/llvm.amdgcn.ubfe.ll
686–687

In this particular case, they're the same. In general 2 shifts is probably better. The mask value is less likely to be an inline immediate. It seems like we have a BFE matching problem though

686–687

For 64-bit, shift and mask would be better

Looks like AMDGPU changes are neutral too.
And now that i think about it, the AArch64 regression should be solvable (hidable) by an inverse transform.
Should i look into that before or after this patch?

RKSimon added inline comments.Jun 26 2019, 11:34 AM
test/CodeGen/AArch64/arm64-bitfield-extract.ll
396

@lebedev.ri Which SimplifyDemandedBits transforms is missing this?

xbolva00 added inline comments.
test/CodeGen/X86/rotate-extract-vector.ll
168

@RKSimon is this better?

lebedev.ri marked 2 inline comments as done.Jun 26 2019, 11:58 AM
lebedev.ri added inline comments.
test/CodeGen/AArch64/arm64-bitfield-extract.ll
396

The other way around, some SimplifyDemandedBits produces this, and the rest is unable to recover.

RKSimon added inline comments.Jun 27 2019, 3:40 AM
test/CodeGen/X86/rotate-extract-vector.ll
168

Unlikely - a pair of shifts by uniform constants is almost certainly better

lebedev.ri planned changes to this revision.Jun 27 2019, 3:53 AM
lebedev.ri marked an inline comment as done.
lebedev.ri marked an inline comment as done and an inline comment as not done.Jul 18 2019, 8:56 AM
lebedev.ri added inline comments.
test/CodeGen/AArch64/arm64-bitfield-extract.ll
396

Still don't have any ideas on how to approach this.

test/CodeGen/AMDGPU/llvm.amdgcn.ubfe.ll
686–687
bool AMDGPUTargetLowering::shouldFoldConstantShiftPairToMask(
    const SDNode *N, CombineLevel Level) const {
  EVT VT = N->getValueType(0);
  return VT.isScalarInteger() && VT.getScalarSizeInBits() == 64;
}

results in many AMDGPU test regressions, i can submit a patch, but i thought i'd double-check first - is that what you meant?

test/CodeGen/X86/rotate-extract-vector.ll
168

Hmm, could you be more specific please?
For vectors, do we want to keep two shifts even if shift amounts are equal, or only if shift amounts are unequal?
Also, what does "almost certainly better" mean given that we have -mattr=+fast-vector-shift-masks?