This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Lowering VGPR to SGPR copies to v_readfirstlane_b32 if profitable.
ClosedPublic

Authored by alex-t on Jun 21 2022, 2:11 AM.

Details

Summary

Since the divergence-driven instruction selection has been enabled for AMDGPU,
all the uniform instructions are expected to be selected to SALU form, except those not having one.
VGPR to SGPR copies appear in MIR to connect values producers and consumers. This change implements an algorithm
that evolves a reasonable tradeoff between the profit achieved from keeping the uniform instructions in SALU form
and overhead introduced by the data transfer between the VGPRs and SGPRs.

Diff Detail

Event Timeline

alex-t created this revision.Jun 21 2022, 2:11 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 21 2022, 2:11 AM
alex-t requested review of this revision.Jun 21 2022, 2:11 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 21 2022, 2:11 AM
alex-t updated this revision to Diff 438602.Jun 21 2022, 2:27 AM

debug output fixed

vpykhtin added inline comments.Jun 23 2022, 5:29 AM
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
941

a bit misleading name, maybe "ShouldGoToVALU"?

970

ditto

982

Copies.count(J) and Copies[J] lookups J twice

1021

if (LowerSpecialCase ,,, ) continue to reduce nesting

1062

replace Inst->getParent()->end() with E

1070

why isSGPRReg is in the loop? It isn't being changed in the loop

1095

Copies[CurID] and Copies.count(CurID) lookups for CurID twice

1099

ditto for S

alex-t updated this revision to Diff 439694.Jun 24 2022, 3:33 AM

LIT tests changed. Description updated.

alex-t marked 8 inline comments as done.Jun 24 2022, 3:35 AM
alex-t added reviewers: rampitec, foad.
alex-t updated this revision to Diff 440619.Jun 28 2022, 7:31 AM

LIT tests updated

arsenm added inline comments.Jun 29 2022, 12:08 PM
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
579

What's the advantage of splitting this into a separate loop?

alex-t added inline comments.Jun 29 2022, 1:49 PM
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
579

We, basically, have 3 phases - analysis that collect information related to the particular copy, a decision that is made based on the information collected, and lowering that lowers the copy according to the decision made.
If processed into the main loop now, the info collected is changed by the REG_SEQUENCE & Co lowering and appears not actual at the moment when copies are lowered. The simplest example is the copy that has a SALU chain long enough to be v_readfirstlane_b32 but the nearest user is REG_SEQUENCE with one VGPR input.
We collect information that suggests we convert the copy into the v_readfirstlane_b32 but REG_SEQUENCE will be passed to moveToVALU because of the VGPR operand.
The only advantage is the possibility to split the change into one, analyzing and lowering just the copies, and another one, extending the algorithm to process the REG_SEQUENCE, PHIs & Co.
The latter is the next change and is upcoming soon.
The plan is to incrementally integrate both and then optimize compile time collecting the information in one loop.

foad added inline comments.Jul 4 2022, 8:16 AM
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
949

Spelling "penalty"

966

Spelling "penalty"

979

Can just be a set, doesn't need to be a SetVector

1034

Can just be a set, doesn't need to be a SetVector

alex-t updated this revision to Diff 442303.Jul 5 2022, 7:43 AM

Changes according to the reviewer comments.

alex-t marked 4 inline comments as done.Jul 5 2022, 7:44 AM
alex-t updated this revision to Diff 443331.Jul 8 2022, 12:57 PM

Fixed bug caused 5 vulkan CTS tests to fail. Also, fixed "sibling" copies counting code.
It now consider subregs of the same register as different sources.

ping
does anybody has objections?

foad added a comment.Jul 12 2022, 3:02 AM

[AMDGPU] VGPR to SGPR copies lowering

Needs a better subject line. Previously moving all VGPR uses from SALU to VALU was required for correctness. Now it is not, so this pass is a heuristic that decides whether to move SALU to VALU or to insert a readfirstlane instruction.

In future I would like VGPR-to-SGPR copies to be legal, and always implemented as a readfirstlane, so that this whole pass could be truly optional and we could skip it at -O0.

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
97–103

& should be after space, not before. Doesn't clang-format fix this? Please run git clang-format @^ on the patch anyway.

99

Likewise.

578

I don't understand this comment.

1124

Typo readfirstlane

1128

Should be const. But why is 4 enough? Isn't there some way you can get this programmatically from SIRegisterInfo?

1135

Don't need the "if", you can call getSubRegClass unconditionally.

1140

Don't need the "if", you can use the subreg version in all cases.

rampitec added inline comments.Jul 12 2022, 1:48 PM
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
99

camelLowerCase the name.

Please also add a comment what special cases does it handle.

933

Second to ask for a clang-format run.

944

Double space after "to".

954

#if !defined(NDEBUG) || defined(LLVM_ENABLE_DUMP) around dump().

1038

Capitalize 'Worklist'.

1050

Typo "comtribute".

1068

Why ++I is in parenthesis?

1143

What happens to 16 bit subregs?

alex-t updated this revision to Diff 444267.Jul 13 2022, 8:19 AM
alex-t marked 6 inline comments as done.

clang-formatted.
Changes addressing the reviewers comments.

alex-t marked 4 inline comments as done.Jul 13 2022, 8:22 AM
alex-t added inline comments.Jul 13 2022, 8:27 AM
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
1143

VGPR to SGPR copies are inserted by InstrEmitter to adjust the VALU result to the SALU consumer.
The 16bits in VGPR result are packed and adjusted to the consumer by inserting the EXCTRACT_ELEMENT lowered in another place.
What kind of adjustment would you recommend if we have a 16bit VGPR source?
Zero-extend it to 32bit?

[AMDGPU] VGPR to SGPR copies lowering

In future I would like VGPR-to-SGPR copies to be legal, and always implemented as a readfirstlane, so that this whole pass could be truly optional and we could skip it at -O0.

They are already legal. As soon as I integrate part2 that takes care of the REG_SEQUENCE and PHIs we can lower all to v_readfirstlane_b32 at -O0.
Now we cannot because the REG_SEQUENCE and PHIs having VGPR input are converted to VALU unconditionally and we are going to have bugs similar to that just fixed in VK-CTS.

alex-t retitled this revision from [AMDGPU] VGPR to SGPR copies lowering to [AMDGPU] Lowering VGPR to SGPR copies to v_readfirstlane_b32 if profitable..Jul 13 2022, 8:37 AM
alex-t marked an inline comment as done.Jul 13 2022, 8:42 AM
alex-t added inline comments.
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
1128

For now just SIRegisterInfo::getSubRegFromChannel is used. Later I would like to change the SIInstrInfo::readlaneVGPRToSGPR to serve all cases.

rampitec added inline comments.Jul 13 2022, 10:46 AM
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
946

You probably need to reset NextID to zero with each run of the pass. Better though make it a normal member of the Pass class itself.

1143

Assume the input like:

%0:SGPR_LO16 = COPY %1.lo16:VGPR_32

If I read it right it will produce V_READFIRSTLANE_B32 with a 16 bit destination and source, which does not work. Assume that selection managed to produce such input, which path will it take here?

1148

Run clang-format again?

rampitec added inline comments.Jul 13 2022, 11:05 AM
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
1143

JBTW, right now it seems to go via moveToVALU:

# RUN: llc -march=amdgcn -mcpu=gfx1100 -run-pass=si-fix-sgpr-copies -verify-machineinstrs -o - %s

---
name:            v16_to_s16
body:             |
  bb.0:
    %0:vgpr_32 = IMPLICIT_DEF
    %1:sgpr_lo16 = COPY %0.lo16:vgpr_32
    %2:vgpr_lo16 = COPY %1
    S_ENDPGM 0, implicit %1, implicit %2
...

Results in:

%0:vgpr_32 = IMPLICIT_DEF
%3:vgpr_lo16 = COPY %0.lo16
%2:vgpr_lo16 = COPY %3
S_ENDPGM 0, implicit %3, implicit %2

Perhaps I'm blind, but I don't see where the heuristic takes into account the number of ALU instructions that would be moved from SALU to VALU.

Perhaps I'm blind, but I don't see where the heuristic takes into account the number of ALU instructions that would be moved from SALU to VALU.

I am not sure that I understand the question.
Currently, we only have VGPR to SGPR copies for the cases where uniform instruction produces VGPR. It's uniform users that are SALU need SGPR.
For each copy we have a choice - convert all its users to VALU or replace the copy with v_readfirstlane_b32. The algorithm computes the tradeoff.
Inserting v_readfirstlane_b32 we add yet one more VALU instruction. If the SALU chain spawned by the copy is 2 instructions long, inserting v_readfirstlane_b32 is an overkill.
We agreed on the heuristic: 1 v_readfirstlane_b32 for at least 3 SALU.
The score is the length of SALU chain minus the number of v_readfirstlane_b32 to insert and minus the number of SGPR to VGPR copies that need to get the result back to VALU.

So, if we decide to move the whole chain to VALU it already means that keeping it SALU (by replacing the copy with v_readfirstlane_b32) is not profitable.
The number of the SALU instructions that are converting VALU, in this case, is the SChain.size()

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
1143

That is what I tried to explain. To reach the place we're talking about the copy should spawn SALU chain long enough to be selected for v_readfirstlane_b32. We have no SALU instructions that accept 16bit operand explicitly. All 16bit SALU really take 32bit SGPR but only use HI/LO half of it. So, we cannot create the MIR for the case you are concerned with.

alex-t updated this revision to Diff 444667.Jul 14 2022, 8:02 AM

NextID static member removed

alex-t updated this revision to Diff 444670.Jul 14 2022, 8:11 AM

clang-format

alex-t marked 4 inline comments as done.Jul 14 2022, 8:12 AM
rampitec added inline comments.Jul 14 2022, 11:14 AM
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
87

It is uninitialized.

The number of the SALU instructions that are converting VALU, in this case, is the SChain.size()

Thanks for the explanation, it all makes sense now. I misread what SChain does, sorry about that.

alex-t updated this revision to Diff 444799.Jul 14 2022, 2:26 PM

16bit source register case separated for the sake of readability

alex-t marked an inline comment as done.Jul 14 2022, 2:27 PM
This revision is now accepted and ready to land.Jul 14 2022, 2:41 PM
thakis added a subscriber: thakis.Jul 14 2022, 3:10 PM

looks like this breaks tests: http://45.33.8.238/linux/81242/step_12.txt

Please take a look and revert for now if it takes a while to fix.

looks like this breaks tests: http://45.33.8.238/linux/81242/step_12.txt

Please take a look and revert for now if it takes a while to fix.

Fixed by the 65abc3a869bf67984b4d393afa2fb16b1690e70d

foad added a comment.Jul 15 2022, 2:17 AM

In future I would like VGPR-to-SGPR copies to be legal

They are already legal.

Currently SIInstrInfo::copyPhysReg does this:

if (RI.isSGPRClass(RC)) {
  if (!RI.isSGPRClass(SrcRC)) {
    reportIllegalCopy(this, MBB, MI, DL, DestReg, SrcReg, KillSrc);
    return;
  }

That is why I say they are illegal. I would like to change this, so that copyPhysReg will allow them and implement them by emitting readfirstlane.

In future I would like VGPR-to-SGPR copies to be legal

They are already legal.

Currently SIInstrInfo::copyPhysReg does this:

if (RI.isSGPRClass(RC)) {
  if (!RI.isSGPRClass(SrcRC)) {
    reportIllegalCopy(this, MBB, MI, DL, DestReg, SrcReg, KillSrc);
    return;
  }

That is why I say they are illegal. I would like to change this, so that copyPhysReg will allow them and implement them by emitting readfirstlane.

OK. You already can do this now. I have pointed you to the experimental branch on my github that propagates the DA information to MIR.
In this branch there is an assert if VGPR to SGPR copy defining instruction is uniform. I had to support a list of the exceptions for those instructions that a divergent but require SGPR operand. Not really a long list of them.
I passed through all the AMDGPU LIT tests w/o asserts.
For more information see https://github.com/alex-t/llvm-project/tree/dd_isel_exp: 850646bb02c204da602c7f4e654c7b65e59b6912
The branch is a pure draft for internal use. You may be interested in the DA information propagated to MIR and the exception instruction list.

As for the -O0, the pass may be excluded as soon as I integrate part2 of this change that covers REG_SEQUENCE and PHIs

llvm/test/CodeGen/AMDGPU/agpr-copy-no-free-registers.ll