This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Fix vgpr2sgpr copy analysis to check scalar operands of buffer instructions use scalar registers.
AbandonedPublic

Authored by skc7 on Sep 22 2022, 3:43 AM.

Details

Summary

In si-fix-sgpr-copies pass, lowering of COPY instruction (vgpr to sgpr) to VALU or to v_readfirstlane_b32 is done. It is decided based on the SALU instructions users of result of COPY. It misses the case where the use of result of COPY need to be scalar register only. Example: In buffer instructions, there are scalar operands (srsrc, sOffset) which will only accept scalar registers.

This change lowers the vgpr2sgpr copies to use v_readfirstlane_b32, for scalar operands of MUBUF/MTBUF.

Diff Detail

Event Timeline

skc7 created this revision.Sep 22 2022, 3:43 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 22 2022, 3:43 AM
skc7 requested review of this revision.Sep 22 2022, 3:43 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 22 2022, 3:43 AM
foad added a reviewer: alex-t.Sep 22 2022, 3:52 AM
arsenm added inline comments.Sep 22 2022, 5:55 AM
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
947

The pass is already doing a walk over use operands, you shouldn't need another use list walk

958

This is a property of specific operands

skc7 updated this revision to Diff 462539.Sep 23 2022, 10:07 AM
skc7 added a reviewer: bcahoon.

Use existing use-list walk to identify mubuf/mtbuf instructions. Delete copyResultUseNeedToBeSgpr method previously introduced. Changes as suggested by @arsenm review.

foad added inline comments.Sep 23 2022, 11:01 AM
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
923

I don't understand why we need special cases for particular named operands. Why can't this be inferred from the operand's register class? @alex-t?

skc7 updated this revision to Diff 463184.Sep 27 2022, 4:29 AM

check for use of copy result as a scalar operand (soffset or srsrc) in mubuf/mtbuf instructions.

skc7 added inline comments.Sep 27 2022, 4:33 AM
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
923

soffset and srsrc need to be scalar registers in mubuf/mtbuf instructions.. Check for use of COPY's result is done here for these operands.

skc7 updated this revision to Diff 464708.Oct 3 2022, 9:11 AM

Rebase.

skc7 added inline comments.Oct 31 2022, 2:51 AM
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
947

Removed the previous extra use list walk.

958

soffset and srsrc operands are checked now for register use

alex-t added a comment.EditedOct 31 2022, 7:06 AM

Sorry for my tediousness but
I would like to see any inspirational reason for this change.

The change in the LIT test suggests one - we get rid of the waterfall loop, but I would like to see (if possible) a simpler case and verbal description regarding "why do we need this?"
It is a good idea just because not any SGPR use requires such special processing but the 128-bit SGPR in MUBUF instruction is legalized by creating the waterfall loop which affects performance. So, the justification here is necessary.

I also would like to see the test case where the VGPR to SGPR copy result has multiple uses.
It would be useful to check what happens if there are multiple VGPR uses besides the one that requires SGPR.

In general, the approach now looks like an attempt to hack into the concrete input pattern.

Although, I don't know other cases which add a significant penalty for moveToVALU. We should think of some unified per-opcode penalty interface if we have more cases.

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
923

The fact that the specific use requires SGPR surely can be checked for any use.
What I don't like here is the hack forcing the decision to be made for v_readfistlane.
The algorithm that makes a decision is a complex tradeoff and we should not "tune" it in such a way for each particular case. I would consider the "SGPR required" as a weight value for the solver rather than the ultimate condition.

This comment was removed by alex-t.
alex-t added inline comments.Oct 31 2022, 4:32 PM
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
916

What is the reason for swapping these lines?
if this

TRI->isSGPRReg(*MRI, Reg) && !TII->isVALU(*Inst)

is not true we don't need to process register users.

918

Inst not necessarily is a copy. You can have MUBUF/MTBUF instruction terminating the SALU chain of arbitrary length. The V2S copy result may be (for example) used by some long arithmetic sequence and then used as an operand of the REG_SEQUENCE which produces SReg_128, which in order used as an operand in MUBUF instruction.

922

Maybe just

if (MRI->getRegClass(Reg) == &AMDGPU::SReg_128RegClass &&
   (TII->isMUBUF(Opc) || TII->isMTBUF(Opc))

?

alex-t added inline comments.Oct 31 2022, 5:16 PM
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
915
if (TRI->isSGPRReg(*MRI, Reg) && !TII->isVALU(*Inst)) {
  for (auto &U : MRI->use_instructions(Reg)) {
    unsigned Opc = U.getOpcode();
    if (MRI->getRegClass(Reg) == &AMDGPU::SGPR_128RegClass &&
        (TII->isMUBUF(Opc) || TII->isMTBUF(Opc))) {
      Info.HasMUBUFSGPR128 = true;
    }
    Users.push_back(&U);
  }
}
skc7 updated this revision to Diff 472699.Nov 2 2022, 11:04 AM
skc7 retitled this revision from [AMDGPU] Fix vgpr2sgpr copy to scalar operands of buffer instructions. to [AMDGPU] Fix vgpr2sgpr copy analysis to check scalar operands of buffer instructions use scalar registers..
skc7 edited the summary of this revision. (Show Details)

changes as suggested by @alex-t.

skc7 added inline comments.Nov 2 2022, 11:10 AM
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
916

Fixed previous patch. Swap for and if was my mistake. Also added check for AMDGPU::SReg_32RegClass used by soffset of MUBUF/MTBUF.

923

HasMBUFScalarReg flag would be set to true for scalar operands of MUBUF/MTBUF. This would lower copy to use v_readfirstlane_b32.

skc7 added a comment.Nov 2 2022, 11:21 AM

Sorry for my tediousness but
I would like to see any inspirational reason for this change.

The change in the LIT test suggests one - we get rid of the waterfall loop, but I would like to see (if possible) a simpler case and verbal description regarding "why do we need this?"
It is a good idea just because not any SGPR use requires such special processing but the 128-bit SGPR in MUBUF instruction is legalized by creating the waterfall loop which affects performance. So, the justification here is necessary.

I also would like to see the test case where the VGPR to SGPR copy result has multiple uses.
It would be useful to check what happens if there are multiple VGPR uses besides the one that requires SGPR.

In general, the approach now looks like an attempt to hack into the concrete input pattern.

Although, I don't know other cases which add a significant penalty for moveToVALU. We should think of some unified per-opcode penalty interface if we have more cases.

%8:sreg_32 = COPY %5:vgpr_32
%7:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %4:vgpr_32, killed %6:sgpr_128, %8:sreg_32, 0, 0, 0, 0, implicit $exec ::

si-fix-sgpr-copies pass converts this as below

%7:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %4:vgpr_32, killed %6:sgpr_128, %5:vgpr_32, 0, 0, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)

soffset which is supposed to be scalar register has been converted to use vgpr. This is failing in the backend with error "Illegal virtual register for instruction".

With this patch I'm trying to fix this issue by checking that scalar registers are used for soffset and srsrc of MUBUF/MTBUF.

alex-t added inline comments.Nov 2 2022, 3:10 PM
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
941

Setting Info->Score to 0 means that it NEEDS to be VALU!
Setting it to zero if SChain is empty means that V2S copy has no SALU descendants and definitely needs to be VALU. So early returns to avoid the rest of the computations.

In your case, you return false specifically to indicate that this particular V2S copy has in its SALU chain MUBUF that requires SGPR and it should be NEVER converted to VALU even if its chain is short or empty.

foad added inline comments.Nov 3 2022, 12:12 AM
llvm/test/CodeGen/AMDGPU/vgpr-descriptor-waterfall-loop-idom-update.ll
5 ↗(On Diff #472699)

This test is *supposed* to generate a waterfall loop, because it is using a non-uniform descriptor in a buffer load. Your patch changes it so that it does not generate a waterfall loop. I don't think that is correct.

skc7 updated this revision to Diff 472885.Nov 3 2022, 3:09 AM

Rebase. Fix for failing tests.

skc7 added inline comments.Nov 3 2022, 3:12 AM
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
941

Since needToBeConvertedToVALU returns without any score calculation in this case, I assumed I need to make the score zero and return. Thanks for helping with this.

foad added a comment.Nov 7 2022, 1:52 AM

What is the motivation for this patch? Is there a bug? Or an existing test that you are trying to generate better code for?

This is a complicated area because in some cases SIFixSGPRCopies has to know what legalizeOperands is going to do, and legalizeOperands can do complicated things like introducing waterfall loops. In the long term it would be better if legalizeOperands stopped doing that and SelectionDAG produced code that was already correct without legalization (e.g. by generating waterfall loops during selection, or even in CodeGenPrepare if we come up with a way to express them in IR). On the other hand, in the long term, it would be nice to abandon SelectionDAG in favour of GlobalISel.

It is decided based on the SALU instructions users of result of COPY.

This is OK because we (mostly @alex-t!) have put a lot of effort into ensuring that SALU instructions are only selected for uniform operations.

It misses the case where the use of result of COPY need to be scalar register only. Example: In buffer instructions, there are scalar operands (srsrc, sOffset) which will only accept scalar registers.

This is probably not OK because SelectionDAG still selects buffer instruction where the "scalar operand" is actually a divergent value. It relies on legalizeOperands to fix this by inserting a waterfall loop.

skc7 added a comment.Nov 7 2022, 3:27 AM

What is the motivation for this patch? Is there a bug? Or an existing test that you are trying to generate better code for?

This is a complicated area because in some cases SIFixSGPRCopies has to know what legalizeOperands is going to do, and legalizeOperands can do complicated things like introducing waterfall loops. In the long term it would be better if legalizeOperands stopped doing that and SelectionDAG produced code that was already correct without legalization (e.g. by generating waterfall loops during selection, or even in CodeGenPrepare if we come up with a way to express them in IR). On the other hand, in the long term, it would be nice to abandon SelectionDAG in favour of GlobalISel.

It is decided based on the SALU instructions users of result of COPY.

This is OK because we (mostly @alex-t!) have put a lot of effort into ensuring that SALU instructions are only selected for uniform operations.

It misses the case where the use of result of COPY need to be scalar register only. Example: In buffer instructions, there are scalar operands (srsrc, sOffset) which will only accept scalar registers.

This is probably not OK because SelectionDAG still selects buffer instruction where the "scalar operand" is actually a divergent value. It relies on legalizeOperands to fix this by inserting a waterfall loop.

%8:sreg_32 = COPY %5:vgpr_32
%7:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %4:vgpr_32, killed %6:sgpr_128, %8:sreg_32, 0, 0, 0, 0, implicit $exec ::

si-fix-sgpr-copies pass converts this as below:

%7:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %4:vgpr_32, killed %6:sgpr_128, %5:vgpr_32, 0, 0, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)

soffset which is supposed to be scalar register has been converted to use vgpr. This is failing in the backend with error "Illegal virtual register for instruction".

Currently, we are seeing such VGPR to SGPR copies and SIFixSGPRCopies pass is using a vgpr for soffset of MBUF.
I'm trying to fix this issue by checking that scalar registers are used for soffset and srsrc of MUBUF/MTBUF. True, if SelectionDAG already produced code that was legal, that would not require any such fixes.

foad added a comment.Nov 7 2022, 5:41 AM

%8:sreg_32 = COPY %5:vgpr_32
%7:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %4:vgpr_32, killed %6:sgpr_128, %8:sreg_32, 0, 0, 0, 0, implicit $exec ::

I need more context. Is %5 uniform?

alex-t added a comment.Nov 7 2022, 9:08 AM

%8:sreg_32 = COPY %5:vgpr_32
%7:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %4:vgpr_32, killed %6:sgpr_128, %8:sreg_32, 0, 0, 0, 0, implicit $exec ::

I need more context. Is %5 uniform?

I think that I've got an idea behind this patch. Let's say %5 is uniform. Then we've got to try to promote all the %8 descendants to SALU if possible.
In some cases, it appears that such a copy has few or even no SALU descendants, and according to the common algorithm should be converted to VALU.
When the conversion is done, legalizeOperands creates the waterfall loop which is obviously much worse than inserting the v_readfirstlane_b32.
As far as I understand, @skc7 addresses this scenario and aims to avoid an unnecessary waterfall loop.
BTW, if %5 is divergent we have a bug in ISel. We now should not have any V2S copy with the divergent source.

skc7 added a comment.Nov 7 2022, 9:12 AM

%8:sreg_32 = COPY %5:vgpr_32
%7:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %4:vgpr_32, killed %6:sgpr_128, %8:sreg_32, 0, 0, 0, 0, implicit $exec ::

I need more context. Is %5 uniform?

define <4 x i32> @extract0_bitcast_raw_buffer_load_v4i32(<4 x i32> inreg %rsrc, i32 %ofs, i32 %sofs) local_unnamed_addr #0 {
%var = tail call <4 x i32> @llvm.amdgcn.raw.buffer.load.v4i32(<4 x i32> %rsrc, i32 %ofs, i32 %sofs, i32 0)
ret <4 x i32> %var
}

IR dump after amdgpu-isel:

bb.0 (%ir-block.0):
liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5
%5:vgpr_32 = COPY $vgpr5
%4:vgpr_32 = COPY $vgpr4
%3:vgpr_32 = COPY $vgpr3
%2:vgpr_32 = COPY $vgpr2
%1:vgpr_32 = COPY $vgpr1
%0:vgpr_32 = COPY $vgpr0
%6:sgpr_128 = REG_SEQUENCE %0:vgpr_32, %subreg.sub0, %1:vgpr_32, %subreg.sub1, %2:vgpr_32, %subreg.sub2, %3:vgpr_32, %subreg.sub3
%8:sreg_32 = COPY %5:vgpr_32
%7:vreg_128 = BUFFER_LOAD_DWORDX4_OFFEN %4:vgpr_32, killed %6:sgpr_128, %8:sreg_32, 0, 0, 0, 0, implicit $exec :: (dereferenceable load (s128), align 1, addrspace 4)
%9:vgpr_32 = COPY %7.sub0:vreg_128
%10:vgpr_32 = COPY %7.sub1:vreg_128
%11:vgpr_32 = COPY %7.sub2:vreg_128
%12:vgpr_32 = COPY %7.sub3:vreg_128
$vgpr0 = COPY %9:vgpr_32
$vgpr1 = COPY %10:vgpr_32
$vgpr2 = COPY %11:vgpr_32
$vgpr3 = COPY %12:vgpr_32
SI_RETURN implicit $vgpr0, implicit $vgpr1, implicit $vgpr2, implicit $vgpr3
arsenm added a comment.Nov 7 2022, 9:44 AM

Without additional context these should be using waterfall loops, not readfirstlane

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
107

This isn't really specific to MUBUF instructions; it's any operand that has to be scalar. We have to waterfall calls as well

919

Avoid calling getRegClass multiple times

921

The MUBUF/MTBUF part isn't interesting, it's the operand being an SGPR and not trivially rewritable as vector

skc7 added a comment.Nov 8 2022, 12:46 AM

%8:sreg_32 = COPY %5:vgpr_32
%7:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %4:vgpr_32, killed %6:sgpr_128, %8:sreg_32, 0, 0, 0, 0, implicit $exec ::

I need more context. Is %5 uniform?

I think that I've got an idea behind this patch. Let's say %5 is uniform. Then we've got to try to promote all the %8 descendants to SALU if possible.
In some cases, it appears that such a copy has few or even no SALU descendants, and according to the common algorithm should be converted to VALU.
When the conversion is done, legalizeOperands creates the waterfall loop which is obviously much worse than inserting the v_readfirstlane_b32.
As far as I understand, @skc7 addresses this scenario and aims to avoid an unnecessary waterfall loop.
BTW, if %5 is divergent we have a bug in ISel. We now should not have any V2S copy with the divergent source.

@alex-t @arsenm @foad This patch has been put up to fix the vgpr2sgpr copy instruction whose result is used as vgpr for srsrc/soffset of MUBUF/MTBUF. This is to fix the issue "Illegal virtual register for instruction". Shall I move to the initial version of the patch which just deals with vgpr2sgpr copy instruction? Legalizing the operands and coming to generic solution to fix such issues is beyond the scope of the patch as far as I understand.

foad added a comment.Nov 8 2022, 2:42 AM

BTW, if %5 is divergent we have a bug in ISel. We now should not have any V2S copy with the divergent source.

Look at the MIR that @skc7 quoted. %5 is divergent - it's copied from a vgpr function argument.

foad added a comment.Nov 9 2022, 6:45 AM

Hi @skc7, I've looked at the failure in your new test @llvm_amdgcn_raw_buffer_load_f32. I think this is a case that the AMDGPU backend has never handled properly, even before @alex-t rewrote SIFixSGPRCopies in D128252.

You have a BUFFER_LOAD_DWORD_OFFEN instruction with a divergent (vgpr) value for the $soffset operand. SIInstrInfo::legalizeOperands ought to fix this somehow, but it does not. See the FIXME here:

// Legalize MUBUF* instructions.
int RsrcIdx =
    AMDGPU::getNamedOperandIdx(MI.getOpcode(), AMDGPU::OpName::srsrc);
if (RsrcIdx != -1) {
  // We have an MUBUF instruction
  MachineOperand *Rsrc = &MI.getOperand(RsrcIdx);
  unsigned RsrcRC = get(MI.getOpcode()).OpInfo[RsrcIdx].RegClass;
  if (RI.getCommonSubClass(MRI.getRegClass(Rsrc->getReg()),
                           RI.getRegClass(RsrcRC))) {
    // The operands are legal.
    // FIXME: We may need to legalize operands besides srsrc.
    return CreatedBB;
  }

It only expects to legalize the $srsrc operand, which it will do by creating a waterfall loop. To legalize the $soffset operand it could either do something clever with addressing, like adding it into the $offset operand (buffer addressing is complicated so that might not be valid), or it could create another waterfall loop. Or it could use readfirstlane, which is a bit of a cop-out but would at least avoid crashing the compiler for now.

define <4 x i32> @extract0_bitcast_raw_buffer_load_v4i32(<4 x i32> inreg %rsrc, i32 %ofs, i32 %sofs) local_unnamed_addr #0 {
%var = tail call <4 x i32> @llvm.amdgcn.raw.buffer.load.v4i32(<4 x i32> %rsrc, i32 %ofs, i32 %sofs, i32 0)
ret <4 x i32> %var
}

IR dump after amdgpu-isel:

bb.0 (%ir-block.0):
liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5
%5:vgpr_32 = COPY $vgpr5
%8:sreg_32 = COPY %5:vgpr_32
%7:vreg_128 = BUFFER_LOAD_DWORDX4_OFFEN %4:vgpr_32, killed %6:sgpr_128, %8:sreg_32, 0, 0, 0, 0, implicit $exec :: (dereferenceable load (s128), align 1, addrspace 4)

Arguments of non-kernel function are divergent. So, $vgpr5 is divergent and %5 as well.

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
921

We are walking SALU def-use chain to compute the score to decide if it is profitable to insert v_readfirstlane_b32 or convert a copy and all the chain to VALU.
If we remove

(TII->isMUBUF(Opc) || TII->isMTBUF(Opc))

we will end up prohibiting any VALU conversion. Because all the SALU result registers are SGPRs and many of them have SGPR_128RegClass and SReg_32RegClass. We don't want to cut all of them.

The idea was that the exact opcodes are exceptional because following the common logic leads to inserting the waterfall loop that slows down the execution.

alex-t added a comment.EditedNov 10 2022, 10:15 AM

BTW, if %5 is divergent we have a bug in ISel. We now should not have any V2S copy with the divergent source.

Look at the MIR that @skc7 quoted. %5 is divergent - it's copied from a vgpr function argument.

The BUFFER_LOAD_DWORDX4_OFFEN is one of (as I remember correctly 5) the exceptional opcodes for which V2S copy is created even in case the copy source is divergent.
There is no bug in ISel. We have the value in VGPR because it is divergent and this is correct. The V2S copy is created in InstrEmitter just because the opcode requires SGPR.
We have yet several other such opcodes.

V_WRITELANE_B32, S_BUFFER_LOAD_DWORD_IMM, BUFFER_LOAD_FORMAT_X_OFFSET, BUFFER_LOAD_FORMAT_X_IDXEN, BUFFER_LOAD_FORMAT_X_OFFEN, BUFFER_LOAD_FORMAT_X_BOTHEN, IMAGE_SAMPLE_V1_V2
And this is really a TODO. For each of them, we should make a design and change legalizeOperand correspondingly.

skc7 added a comment.Nov 30 2022, 3:19 AM

BTW, if %5 is divergent we have a bug in ISel. We now should not have any V2S copy with the divergent source.

Look at the MIR that @skc7 quoted. %5 is divergent - it's copied from a vgpr function argument.

The BUFFER_LOAD_DWORDX4_OFFEN is one of (as I remember correctly 5) the exceptional opcodes for which V2S copy is created even in case the copy source is divergent.
There is no bug in ISel. We have the value in VGPR because it is divergent and this is correct. The V2S copy is created in InstrEmitter just because the opcode requires SGPR.
We have yet several other such opcodes.

V_WRITELANE_B32, S_BUFFER_LOAD_DWORD_IMM, BUFFER_LOAD_FORMAT_X_OFFSET, BUFFER_LOAD_FORMAT_X_IDXEN, BUFFER_LOAD_FORMAT_X_OFFEN, BUFFER_LOAD_FORMAT_X_BOTHEN, IMAGE_SAMPLE_V1_V2
And this is really a TODO. For each of them, we should make a design and change legalizeOperand correspondingly.

@alex-t @foad @arsenm As was suggested, legalizeOperand needs to be updated to support the mentioned opcodes which have similar issue. We will take it up as a parallel task and try to fix it.

But as a temporary work around, shall we update HasMBUFScalarReg to true, when a "V2S copy" is found and its result is used by MUBUF/MTBUF for scalar operands. HasMBUFScalarReg flag determines copy lowering way to v_readfirstlane_b32
This fixes the specific issue we encountered with BUFFER_LOAD_DWORDX4_OFFEN. "Illegal virtual register for instruction"

Register Reg = Inst->getOperand(0).getReg();
if (TRI->isSGPRReg(*MRI, Reg) && !TII->isVALU(*Inst)) {
for (auto &U : MRI->use_instructions(Reg)) {
   Users.push_back(&U);
   if (Inst->isCopy()) {
     unsigned Opc = U.getOpcode();
     if (Reg.isVirtual() &&
         (MRI->getRegClass(Reg) == &AMDGPU::SGPR_128RegClass ||
          MRI->getRegClass(Reg) == &AMDGPU::SReg_32RegClass) &&
         (TII->isMUBUF(Opc) || TII->isMTBUF(Opc))) {
       Info.HasMBUFScalarReg = true;
     }
   }
 }
}
skc7 updated this revision to Diff 481592.Dec 9 2022, 3:29 AM

Made changes to only identify copy and its result used by soffset of MUBUF/MTBUF. needToBeConvertedToVALU returns false if such pattern is found.
This also fixes vgpr-descriptor-waterfall-loop-idom-update.ll test, where the previous revision of the patch doesn't generate waterfall loop.

foad added a comment.Dec 9 2022, 8:35 AM

I think the best quick fix would be something like this in legalizeOperands, not changing SIFixSGPRCopies:

diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index c14b8df1f390..b79d343e0ed5 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -6005,6 +6005,12 @@ SIInstrInfo::legalizeOperands(MachineInstr &MI,
       AMDGPU::getNamedOperandIdx(MI.getOpcode(), AMDGPU::OpName::srsrc);
   if (RsrcIdx != -1) {
     // We have an MUBUF instruction
+    MachineOperand *SOff = getNamedOperand(MI, AMDGPU::OpName::soffset);
+    if (SOff->isReg() && !RI.isSGPRClass(MRI.getRegClass(SOff->getReg()))) {
+      Register SGPR = readlaneVGPRToSGPR(SOff->getReg(), MI, MRI);
+      SOff->setReg(SGPR);
+    }
+
     MachineOperand *Rsrc = &MI.getOperand(RsrcIdx);
     unsigned RsrcRC = get(MI.getOpcode()).OpInfo[RsrcIdx].RegClass;
     if (RI.getCommonSubClass(MRI.getRegClass(Rsrc->getReg()),
arsenm requested changes to this revision.Jul 18 2023, 5:02 AM

Can this be abandoned now?

This revision now requires changes to proceed.Jul 18 2023, 5:02 AM
skc7 abandoned this revision.Jul 19 2023, 10:28 PM