This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU]: Allow combining into v_dot4
ClosedPublic

Authored by jrbyrnes on Jul 21 2023, 2:21 PM.

Details

Summary

Adds the algorithm to match and select v_dot4 instructions in combining, and removes the patterns from selection. The patterns are fragile, and fail to match when byte extraction code is slightly different, or any optimizations alters the add / mul structure of the tree. The DAG combining approach is more flexible, and should not result in much overhead given all the early exits.

For kernels that should select into these instructions, doing so is vitally important. Not only is performance much improved, but failing to select into them can result in severe code bloat which drastically degrades compile time.

The extended perm matching is a happy consequence of whitelisting EXTRACT_VECT_ELT i32s as ultimate srcs of bytes.

Diff Detail

Event Timeline

jrbyrnes created this revision.Jul 21 2023, 2:21 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 21 2023, 2:21 PM
jrbyrnes requested review of this revision.Jul 21 2023, 2:21 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 21 2023, 2:21 PM
jrbyrnes added inline comments.Jul 21 2023, 2:22 PM
llvm/lib/Target/AMDGPU/VOP3PInstructions.td
293 ↗(On Diff #543065)

Should we delete these?

jrbyrnes updated this revision to Diff 543123.Jul 21 2023, 5:16 PM

Fix some errors.

jrbyrnes planned changes to this revision.EditedJul 26 2023, 4:53 PM

Nothing necessarily planned at the moment, just want to block the review for now.

It may make more sense to tune vectorization cost model (for i8 and potentially i16) to produce something like

%m = mul < n x i8> %v0, %v1
%o = llvm.vector.reduce.add.vni8(%m)
%op.rdx = add %o, %scalar

Then lower to mfma or v_dot in CodeGenPrepare.

e.g.

%op.rdx = v_dot4_i32_i8 %v0, %v1, %scalar

Instead of scalarizing the sequence and trying to combine all possible variants.

Need to finish investigation before unblocking review.

SLP vectorization should be tuned but that seems like a separate issue. Trees corresponding to v_dot4 often have s/zext as the final dest is 32 bit, but the arithmetic operations involve 8 bit operands. By introducing s/zext into the tree, we confuse the SLP vectorization cost model as it thinks it is vectorizing 32bit operands. The main issue is that cost model only looks at one node of the vectorizable tree at a time to calculate cost, instead of also considering the sequence as a whole. If we were to vectorize, codegen may be significantly less complex for these.

Plan is to move forward with this patch, and potentially tune vectorization in later work.

jrbyrnes updated this revision to Diff 547328.Aug 4 2023, 1:27 PM

Rebase + Extended algorithm for more complete coverage of potential trees.

Still a WIP while I determine if it is coverage / feature complete, and need to make lit testing more robust.

jrbyrnes updated this revision to Diff 549513.Aug 11 2023, 1:49 PM

Rebase + clean up code. Still running tests but no longer a WIP.

jrbyrnes retitled this revision from [AMDGPU] WIP: Allow matching into v_dot4 to [AMDGPU]: Allow combining into v_dot4.Aug 11 2023, 1:49 PM
jrbyrnes edited the summary of this revision. (Show Details)
jrbyrnes updated this revision to Diff 550814.Aug 16 2023, 11:07 AM

Fix non-determinism -- iteration order of DenseMap. Use SmallVector instead (worst case lookup is non factor due to size)

arsenm requested changes to this revision.Aug 18 2023, 7:26 AM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12365

don't need .has_value() part

12369

Ditto

12438–12439

is_contained

12523

1 || == 2?

12567

Don't know why we have getSubtarget, you can just use Subtarget-> directly

12574

SDValue TempNode(N, 0)

12576–12579

Don't understand what you are doing with this opcode to index check

12604

Don't need has_value()

12621

Don't need has_value

12665–12666

is_contained

12691

use deleted_node instead of optional opcode

12692

Can you avoid generation checks?

12698

Can you go through the intrinsics instead of going straight to the machine node?

This revision now requires changes to proceed.Aug 18 2023, 7:26 AM
jrbyrnes updated this revision to Diff 552475.Aug 22 2023, 12:45 PM
jrbyrnes marked 13 inline comments as done.

Rebase (for https://reviews.llvm.org/D158468) and lower with intrinsics.

Allowing combining in pre-legalize phase.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12438–12439

We use the found value here

12576–12579

It is a convenience which allows getting the mul operand without having to do operand checks every time.

arsenm added inline comments.Aug 23 2023, 5:13 PM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12531

auto &[Something, Mask] : ?

12538–12554

I thought we combined the mul24 intrinsics to the nodes specifically so you don't need to do this

12699–12700

DAG.getTargetConstant(IsSigned ? Intrinsic::amdgcn_sdot4 : Intrinsic::amdgcn_udot4, SL, MVT::i32)

jrbyrnes updated this revision to Diff 553220.Aug 24 2023, 11:57 AM
jrbyrnes marked 3 inline comments as done.

Address Comments

arsenm accepted this revision.Sep 5 2023, 1:46 PM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12374

Missing newline

12471

maybe comment skipping bswap?

12605

extra parens

This revision is now accepted and ready to land.Sep 5 2023, 1:46 PM
jrbyrnes updated this revision to Diff 555935.Sep 5 2023, 2:34 PM
jrbyrnes marked 3 inline comments as done.

Address comments

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12471

It's the original SDValue, not bswap. I'll add comment.

def : GCNPat <
  (i32 (bswap i32:$a)),
  (V_PERM_B32_e64 (i32 0), VSrc_b32:$a, (S_MOV_B32 (i32 0x00010203)))
>;
arsenm accepted this revision.Sep 7 2023, 12:48 PM
This revision was landed with ongoing or failed builds.Sep 7 2023, 1:06 PM
This revision was automatically updated to reflect the committed changes.
jrbyrnes reopened this revision.Sep 12 2023, 4:58 PM

Reopen for review as it has been reverted, now includes https://github.com/llvm/llvm-project/pull/65995

This revision is now accepted and ready to land.Sep 12 2023, 4:58 PM
jrbyrnes updated this revision to Diff 556879.EditedSep 15 2023, 1:19 PM

IsSigned tracks whether or not to produce an instruction with signed behavior. In some cases, we are able to determine this based on the semantics of the top-level instruction, however, in other cases, we need more information. For such cases, we must look to the tree itself.

In cases of vectorized arithmetic reduction instructions, we typically see a widening of type. We can use the signedness of the extension to determine the signedness semantics required for the instruction we will ultimately produce.

However, other clients of calculateByteProvider may not be producing arithmetic instructions, and, in these cases, there may be no requirement to track the signedness semantics. Thus, we do not need to fail if we are unable to detemine the signedness.

This removes the requirement that ByteProviders must determine IsSigned.

jrbyrnes updated this revision to Diff 557255.Sep 22 2023, 1:12 PM

Extract signedness checking

jrbyrnes updated this revision to Diff 557258.Sep 22 2023, 2:00 PM

Fix dereference issue + nits (reorganize logic + comments)

jrbyrnes updated this revision to Diff 557369.Sep 26 2023, 11:37 AM

Fix signedness handling of any_extend

bcahoon added a subscriber: bcahoon.Oct 3 2023, 9:24 AM

Just some minor comments/questions.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
10724

add period at the end.

12361

This function is used for any mul, not just mul24?

12556

I think the check includes any extend (or unknown?) as well as signed?

Maybe say, If we have MUL_u24 without unsigned semantics, then fail.

12560

BTW, is this the same as :
if (!Src0.IsSigned.value_or(false) && MulOpcode == AMDGPUISD::MUL_I24)

jrbyrnes updated this revision to Diff 557593.Oct 4 2023, 11:12 AM
jrbyrnes marked 4 inline comments as done.

Address comments + update handling of AtomicSDNode + MemIntrinsic

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12556

This check is checking for conflicting signedness semantics. In the case where we don't have signedness info from the ByteProvider, the signedness is irrelevant so we should say it doesn't conflict.

In the case where we don't have signedness information, then two things could have occurred:

  1. We are accumulating into 8 bit register, and have not done any extensions. In this case, the upper bits are irrelevant, and we may use either version of the dot.
  2. We have exclusively used any_extends. Same as case 1, the upper bits are irrelevant.

There is a third scenario which I have accounted for in the latest version. Previously, we would not have signedness info if we encountered an unhandled node (MemIntrinsic / AtomicSDNode). However, in this case, the upper bits may be relevant. Thus, instead of throwing away signedness info in this situations, we now fail.

12560

Right nice catch -- except we should not fail if we don't have signedness info.

llvm/test/CodeGen/AMDGPU/idot4s.ll
146

Answering offline question about "neg_lo:[1,1,0]"

These tests are introduced via this patch, and the modifier indicates we need signedness semantics for both operands. neg_lo:[1,1,0] was introduced originally by rebasing on top of changes from https://reviews.llvm.org/D158468?vs=552172&id=552466#toc I believe.

bcahoon accepted this revision.Oct 4 2023, 12:48 PM

Changes LGTM.

This revision was landed with ongoing or failed builds.Oct 4 2023, 1:32 PM
This revision was automatically updated to reflect the committed changes.