This is an archive of the discontinued LLVM Phabricator instance.

[DAG] getNode() - fold (zext (trunc x)) -> x iff the upper bits are known zero - add SRL support
ClosedPublic

Authored by RKSimon on Sep 20 2023, 9:58 AM.

Details

Summary

This is part of the work to address the D155472 regressions, there's a number of issues with generalizing this fold which is why I'm just adding SRL support atm.

The fold encourages the creation of ISD::FSHR nodes which was breaking a lot of the AMDGPUISD::PERM tests, so I've added ISD::FSHR handling to the matchPERM helper

Diff Detail

Event Timeline

RKSimon created this revision.Sep 20 2023, 9:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 20 2023, 9:58 AM
RKSimon requested review of this revision.Sep 20 2023, 9:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 20 2023, 9:58 AM
Herald added a subscriber: wdng. · View Herald Transcript
goldstein.w.n added inline comments.
llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
5698

Why does the ISD::AssertZext need one use check?

But while you're at it think ISD::AND also work here right?

what is the best way to generalize the performOrCombine handling to support ISD::FSHR as well?

We should ideally produce AMDGPUISD::PERM whenever we have an i32 that is byte permutation of two sources (except some special cases) and it's best to do this via DAGCombining due to target-specific reasons. In practice, typically such byte permutations have ISD::OR as the root of the tree (which is why I have begun with that combine), however, other cases exist where the tree does not have ISD::OR as the root (e.g. AMDGPUISD::PERM, ISD::FSHR, etc). It seems to me, the best way to generalize is to extract the common code from ISD::OR, and use it to enable new roots / entry points -- this is an extension of the work we should do anyway. I hacked together an implementation of what I have just described and this patch no longer causes the v_perm miss regressions. @foad @arsenm objections?

RKSimon added inline comments.Sep 20 2023, 1:36 PM
llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
5698

Everything should work here - eventually the getOpcode() filters will go entirely - the problem we're having is a number of AArch64 mul folds create various zext(trunc()) patterns and then expect them to still exist later in the combine.

As usual with LLVM it comes down to yak shaving :)

RKSimon updated this revision to Diff 557146.Sep 20 2023, 1:38 PM

@jrbyrnes I've pulled the AMDGPUISD::PERM matching out of performOrCombine, made it into a 'matchPERM' helper and setup ISD::FSHR to use it - is this what you had in mind?

jrbyrnes added a comment.EditedSep 20 2023, 3:24 PM

Disregard previous version of comment -- I did not see the most recent changes.

Can you please bring in tests from https://github.com/llvm/llvm-project/pull/66965

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
10775 ↗(On Diff #557146)

Can we also fail on vector types

10791 ↗(On Diff #557146)

This seems to be broken if Index = 1; BytesProvided; = 2 and ByteShift = 3 -- we will attempt to get the 2nd byte of a 16 bit operand which is out of range (only 0,1 are valid bytes).

Perhaps it was meant to be: Index + (BytesProvided - ByteShift) ? But this still doesnt account for the case where ByteShift > n * BytesProvided.

How do you feel about

case ISD::FSHR: {
  auto ShiftOp = dyn_cast<ConstantSDNode>(Op->getOperand(2));
  if (!ShiftOp || Op.getValueType().isVector())
    return std::nullopt;

  uint64_t BitsProvided = Op.getValueSizeInBits();
  if (BitsProvided % 8 != 0)
    return std::nullopt;

  uint64_t BitShift = ShiftOp->getAPIntValue().urem(BitsProvided);
  if (BitShift % 8)
    return std::nullopt;

  uint64_t ConcatSizeInBytes = BitsProvided / 4;
  uint64_t ByteShift = BitShift / 8;

  auto NewIndex = (Index + ByteShift) % ConcatSizeInBytes;
  auto BytesProvided = BitsProvided / 8;
  auto NextOp = Op.getOperand(NewIndex >= BytesProvided ? 0 : 1);
  NewIndex %= BytesProvided;
  return calculateSrcByte(NextOp, StartingIndex, NewIndex);
}
llvm/test/CodeGen/AMDGPU/permute.ll
127 ↗(On Diff #557146)

I can look into this after this patch lands.

jrbyrnes added inline comments.Sep 20 2023, 5:41 PM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
10791 ↗(On Diff #557146)

Oops -- should have been

return calculateByteProvider(NextOp, NewIndex, Depth + 1, StartingIndex

RKSimon updated this revision to Diff 557170.Sep 21 2023, 4:37 AM
RKSimon edited the summary of this revision. (Show Details)

Update with @jrbyrnes suggestions and add early-out so we prefer an existing ISD::FSHR node to a AMDGPUISD::PERM equivalent.

RKSimon added inline comments.
llvm/lib/CodeGen/SelectionDAG/SelectionDAG.cpp
5698

@goldstein.w.n Hopefully D159537 will address the last regressions which will allow me to remove the OpOp.getOpcode() filter entirely.

Thanks -- the AMDGPUISD::PERM side LGTM

Thanks - any further comments?

foad added a comment.Sep 22 2023, 3:31 AM

@jrbyrnes shouldn't we land the AMDGPU changes separately first? Or would it all be dead code without the rest of this patch?

@jrbyrnes shouldn't we land the AMDGPU changes separately first? Or would it all be dead code without the rest of this patch?

I can pre-commit the matchPERM helper as a NFC-ish to reduce the diff if that helps?

RKSimon updated this revision to Diff 557246.Sep 22 2023, 8:23 AM
RKSimon edited the summary of this revision. (Show Details)

rebase

@jrbyrnes shouldn't we land the AMDGPU changes separately first? Or would it all be dead code without the rest of this patch?

The changes to calculateByteProvider may actually be NFC because sub 32bit ISD::FSHRs get legalized into ISD::OR which are already handled, but enabling ISD::FSHR combine and installing matchPERM there is not.

The changes look okay, but it seems it would be best to land them separately first.

This revision was not accepted when it landed; it landed in state Needs Review.Sep 24 2023, 5:50 AM
This revision was landed with ongoing or failed builds.
This revision was automatically updated to reflect the committed changes.