This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
6/19
SIFixSGPRCopies.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
si-fix-sgpr-copies-buf.ll

Differential D134423

[AMDGPU] Fix vgpr2sgpr copy analysis to check scalar operands of buffer instructions use scalar registers.
AbandonedPublic

Authored by skc7 on Sep 22 2022, 3:43 AM.

Download Raw Diff

Details

Reviewers

arsenm
cdevadas
rampitec
alex-t
bcahoon

Summary

In si-fix-sgpr-copies pass, lowering of COPY instruction (vgpr to sgpr) to VALU or to v_readfirstlane_b32 is done. It is decided based on the SALU instructions users of result of COPY. It misses the case where the use of result of COPY need to be scalar register only. Example: In buffer instructions, there are scalar operands (srsrc, sOffset) which will only accept scalar registers.

This change lowers the vgpr2sgpr copies to use v_readfirstlane_b32, for scalar operands of MUBUF/MTBUF.

Diff Detail

Unit TestsFailed

	Time	Test
	100 ms	x64 debian > Flang.Fir::boxproc.fir
	1,250 ms	x64 debian > Flang.Fir::target-rewrite-arg-position.fir
	50 ms	x64 debian > Flang.Fir::target-rewrite-boxchar.fir
	30 ms	x64 debian > Flang.Fir::target-rewrite-char-proc.fir
	50 ms	x64 debian > Flang.Fir::target.fir
		View Full Test Results (12 Failed)

Event Timeline

skc7 created this revision.Sep 22 2022, 3:43 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 22 2022, 3:43 AM

Herald added subscribers: kosarev, foad, kerbowa and 8 others. · View Herald Transcript

skc7 requested review of this revision.Sep 22 2022, 3:43 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 22 2022, 3:43 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Harbormaster completed remote builds in B188133: Diff 462123.Sep 22 2022, 3:43 AM

foad added a reviewer: alex-t.Sep 22 2022, 3:52 AM

arsenm added inline comments.Sep 22 2022, 5:55 AM

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
947	The pass is already doing a walk over use operands, you shouldn't need another use list walk
958	This is a property of specific operands

Use existing use-list walk to identify mubuf/mtbuf instructions. Delete copyResultUseNeedToBeSgpr method previously introduced. Changes as suggested by @arsenm review.

Harbormaster completed remote builds in B188442: Diff 462539.Sep 23 2022, 10:08 AM

foad added inline comments.Sep 23 2022, 11:01 AM

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
923	I don't understand why we need special cases for particular named operands. Why can't this be inferred from the operand's register class? @alex-t?

check for use of copy result as a scalar operand (soffset or srsrc) in mubuf/mtbuf instructions.

skc7 added inline comments.Sep 27 2022, 4:33 AM

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
923	soffset and srsrc need to be scalar registers in mubuf/mtbuf instructions.. Check for use of COPY's result is done here for these operands.

Harbormaster completed remote builds in B188915: Diff 463184.Sep 27 2022, 5:14 AM

Rebase.

Harbormaster completed remote builds in B189997: Diff 464708.Oct 3 2022, 10:03 AM

Rebase

Harbormaster completed remote builds in B194625: Diff 471121.Oct 27 2022, 6:10 AM

Rebase

Harbormaster completed remote builds in B194895: Diff 471490.Oct 28 2022, 5:03 AM

skc7 added inline comments.Oct 31 2022, 2:51 AM

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
947	Removed the previous extra use list walk.
958	soffset and srsrc operands are checked now for register use

Sorry for my tediousness but
I would like to see any inspirational reason for this change.

The change in the LIT test suggests one - we get rid of the waterfall loop, but I would like to see (if possible) a simpler case and verbal description regarding "why do we need this?"
It is a good idea just because not any SGPR use requires such special processing but the 128-bit SGPR in MUBUF instruction is legalized by creating the waterfall loop which affects performance. So, the justification here is necessary.

I also would like to see the test case where the VGPR to SGPR copy result has multiple uses.
It would be useful to check what happens if there are multiple VGPR uses besides the one that requires SGPR.

In general, the approach now looks like an attempt to hack into the concrete input pattern.

Although, I don't know other cases which add a significant penalty for moveToVALU. We should think of some unified per-opcode penalty interface if we have more cases.

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
923	The fact that the specific use requires SGPR surely can be checked for any use. What I don't like here is the hack forcing the decision to be made for v_readfistlane. The algorithm that makes a decision is a complex tradeoff and we should not "tune" it in such a way for each particular case. I would consider the "SGPR required" as a weight value for the solver rather than the ultimate condition.

alex-t added a comment.Oct 31 2022, 11:26 AM

This comment was removed by alex-t.

alex-t added inline comments.Oct 31 2022, 4:32 PM

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
916	What is the reason for swapping these lines? if this TRI->isSGPRReg(MRI, Reg) && !TII->isVALU(Inst) is not true we don't need to process register users.
918	Inst not necessarily is a copy. You can have MUBUF/MTBUF instruction terminating the SALU chain of arbitrary length. The V2S copy result may be (for example) used by some long arithmetic sequence and then used as an operand of the REG_SEQUENCE which produces SReg_128, which in order used as an operand in MUBUF instruction.
922	Maybe just if (MRI->getRegClass(Reg) == &AMDGPU::SReg_128RegClass && (TII->isMUBUF(Opc) \|\| TII->isMTBUF(Opc)) ?

alex-t added inline comments.Oct 31 2022, 5:16 PM

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp

915

if (TRI->isSGPRReg(*MRI, Reg) && !TII->isVALU(*Inst)) {
  for (auto &U : MRI->use_instructions(Reg)) {
    unsigned Opc = U.getOpcode();
    if (MRI->getRegClass(Reg) == &AMDGPU::SGPR_128RegClass &&
        (TII->isMUBUF(Opc) || TII->isMTBUF(Opc))) {
      Info.HasMUBUFSGPR128 = true;
    }
    Users.push_back(&U);
  }
}

changes as suggested by @alex-t.

skc7 added inline comments.Nov 2 2022, 11:10 AM

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
916	Fixed previous patch. Swap for and if was my mistake. Also added check for AMDGPU::SReg_32RegClass used by soffset of MUBUF/MTBUF.
923	HasMBUFScalarReg flag would be set to true for scalar operands of MUBUF/MTBUF. This would lower copy to use v_readfirstlane_b32.

In D134423#3896343, @alex-t wrote:

Sorry for my tediousness but
I would like to see any inspirational reason for this change.

The change in the LIT test suggests one - we get rid of the waterfall loop, but I would like to see (if possible) a simpler case and verbal description regarding "why do we need this?"
It is a good idea just because not any SGPR use requires such special processing but the 128-bit SGPR in MUBUF instruction is legalized by creating the waterfall loop which affects performance. So, the justification here is necessary.

I also would like to see the test case where the VGPR to SGPR copy result has multiple uses.
It would be useful to check what happens if there are multiple VGPR uses besides the one that requires SGPR.

In general, the approach now looks like an attempt to hack into the concrete input pattern.

Although, I don't know other cases which add a significant penalty for moveToVALU. We should think of some unified per-opcode penalty interface if we have more cases.

%8:sreg_32 = COPY %5:vgpr_32
%7:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %4:vgpr_32, killed %6:sgpr_128, %8:sreg_32, 0, 0, 0, 0, implicit $exec ::

si-fix-sgpr-copies pass converts this as below

%7:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %4:vgpr_32, killed %6:sgpr_128, %5:vgpr_32, 0, 0, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)

soffset which is supposed to be scalar register has been converted to use vgpr. This is failing in the backend with error "Illegal virtual register for instruction".

With this patch I'm trying to fix this issue by checking that scalar registers are used for soffset and srsrc of MUBUF/MTBUF.

Harbormaster completed remote builds in B195763: Diff 472699.Nov 2 2022, 12:42 PM

alex-t added inline comments.Nov 2 2022, 3:10 PM

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
941	Setting Info->Score to 0 means that it NEEDS to be VALU! Setting it to zero if SChain is empty means that V2S copy has no SALU descendants and definitely needs to be VALU. So early returns to avoid the rest of the computations. In your case, you return false specifically to indicate that this particular V2S copy has in its SALU chain MUBUF that requires SGPR and it should be NEVER converted to VALU even if its chain is short or empty.

foad added inline comments.Nov 3 2022, 12:12 AM

llvm/test/CodeGen/AMDGPU/vgpr-descriptor-waterfall-loop-idom-update.ll
5 ↗	(On Diff #472699)	This test is supposed to generate a waterfall loop, because it is using a non-uniform descriptor in a buffer load. Your patch changes it so that it does not generate a waterfall loop. I don't think that is correct.

Rebase. Fix for failing tests.

skc7 added inline comments.Nov 3 2022, 3:12 AM

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
941	Since needToBeConvertedToVALU returns without any score calculation in this case, I assumed I need to make the score zero and return. Thanks for helping with this.

Harbormaster completed remote builds in B195897: Diff 472885.Nov 3 2022, 3:52 AM

What is the motivation for this patch? Is there a bug? Or an existing test that you are trying to generate better code for?

This is a complicated area because in some cases SIFixSGPRCopies has to know what legalizeOperands is going to do, and legalizeOperands can do complicated things like introducing waterfall loops. In the long term it would be better if legalizeOperands stopped doing that and SelectionDAG produced code that was already correct without legalization (e.g. by generating waterfall loops during selection, or even in CodeGenPrepare if we come up with a way to express them in IR). On the other hand, in the long term, it would be nice to abandon SelectionDAG in favour of GlobalISel.

It is decided based on the SALU instructions users of result of COPY.

This is OK because we (mostly @alex-t!) have put a lot of effort into ensuring that SALU instructions are only selected for uniform operations.

It misses the case where the use of result of COPY need to be scalar register only. Example: In buffer instructions, there are scalar operands (srsrc, sOffset) which will only accept scalar registers.

This is probably not OK because SelectionDAG still selects buffer instruction where the "scalar operand" is actually a divergent value. It relies on legalizeOperands to fix this by inserting a waterfall loop.

In D134423#3911623, @foad wrote:

What is the motivation for this patch? Is there a bug? Or an existing test that you are trying to generate better code for?

This is a complicated area because in some cases SIFixSGPRCopies has to know what legalizeOperands is going to do, and legalizeOperands can do complicated things like introducing waterfall loops. In the long term it would be better if legalizeOperands stopped doing that and SelectionDAG produced code that was already correct without legalization (e.g. by generating waterfall loops during selection, or even in CodeGenPrepare if we come up with a way to express them in IR). On the other hand, in the long term, it would be nice to abandon SelectionDAG in favour of GlobalISel.

It is decided based on the SALU instructions users of result of COPY.

This is OK because we (mostly @alex-t!) have put a lot of effort into ensuring that SALU instructions are only selected for uniform operations.

It misses the case where the use of result of COPY need to be scalar register only. Example: In buffer instructions, there are scalar operands (srsrc, sOffset) which will only accept scalar registers.

This is probably not OK because SelectionDAG still selects buffer instruction where the "scalar operand" is actually a divergent value. It relies on legalizeOperands to fix this by inserting a waterfall loop.

%8:sreg_32 = COPY %5:vgpr_32
%7:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %4:vgpr_32, killed %6:sgpr_128, %8:sreg_32, 0, 0, 0, 0, implicit $exec ::

si-fix-sgpr-copies pass converts this as below:

%7:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %4:vgpr_32, killed %6:sgpr_128, %5:vgpr_32, 0, 0, 0, 0, implicit $exec :: (dereferenceable load (s32), align 1, addrspace 4)

soffset which is supposed to be scalar register has been converted to use vgpr. This is failing in the backend with error "Illegal virtual register for instruction".

Currently, we are seeing such VGPR to SGPR copies and SIFixSGPRCopies pass is using a vgpr for soffset of MBUF.
I'm trying to fix this issue by checking that scalar registers are used for soffset and srsrc of MUBUF/MTBUF. True, if SelectionDAG already produced code that was legal, that would not require any such fixes.

In D134423#3911726, @skc7 wrote:

%8:sreg_32 = COPY %5:vgpr_32
%7:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %4:vgpr_32, killed %6:sgpr_128, %8:sreg_32, 0, 0, 0, 0, implicit $exec ::

I need more context. Is %5 uniform?

In D134423#3911947, @foad wrote:

In D134423#3911726, @skc7 wrote:

%8:sreg_32 = COPY %5:vgpr_32
%7:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %4:vgpr_32, killed %6:sgpr_128, %8:sreg_32, 0, 0, 0, 0, implicit $exec ::

I need more context. Is %5 uniform?

I think that I've got an idea behind this patch. Let's say %5 is uniform. Then we've got to try to promote all the %8 descendants to SALU if possible.
In some cases, it appears that such a copy has few or even no SALU descendants, and according to the common algorithm should be converted to VALU.
When the conversion is done, legalizeOperands creates the waterfall loop which is obviously much worse than inserting the v_readfirstlane_b32.
As far as I understand, @skc7 addresses this scenario and aims to avoid an unnecessary waterfall loop.
BTW, if %5 is divergent we have a bug in ISel. We now should not have any V2S copy with the divergent source.

In D134423#3911947, @foad wrote:

In D134423#3911726, @skc7 wrote:

%8:sreg_32 = COPY %5:vgpr_32
%7:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %4:vgpr_32, killed %6:sgpr_128, %8:sreg_32, 0, 0, 0, 0, implicit $exec ::

I need more context. Is %5 uniform?

define <4 x i32> @extract0_bitcast_raw_buffer_load_v4i32(<4 x i32> inreg %rsrc, i32 %ofs, i32 %sofs) local_unnamed_addr #0 {
%var = tail call <4 x i32> @llvm.amdgcn.raw.buffer.load.v4i32(<4 x i32> %rsrc, i32 %ofs, i32 %sofs, i32 0)
ret <4 x i32> %var
}

IR dump after amdgpu-isel:

bb.0 (%ir-block.0):
liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5
%5:vgpr_32 = COPY $vgpr5
%4:vgpr_32 = COPY $vgpr4
%3:vgpr_32 = COPY $vgpr3
%2:vgpr_32 = COPY $vgpr2
%1:vgpr_32 = COPY $vgpr1
%0:vgpr_32 = COPY $vgpr0
%6:sgpr_128 = REG_SEQUENCE %0:vgpr_32, %subreg.sub0, %1:vgpr_32, %subreg.sub1, %2:vgpr_32, %subreg.sub2, %3:vgpr_32, %subreg.sub3
%8:sreg_32 = COPY %5:vgpr_32
%7:vreg_128 = BUFFER_LOAD_DWORDX4_OFFEN %4:vgpr_32, killed %6:sgpr_128, %8:sreg_32, 0, 0, 0, 0, implicit $exec :: (dereferenceable load (s128), align 1, addrspace 4)
%9:vgpr_32 = COPY %7.sub0:vreg_128
%10:vgpr_32 = COPY %7.sub1:vreg_128
%11:vgpr_32 = COPY %7.sub2:vreg_128
%12:vgpr_32 = COPY %7.sub3:vreg_128
$vgpr0 = COPY %9:vgpr_32
$vgpr1 = COPY %10:vgpr_32
$vgpr2 = COPY %11:vgpr_32
$vgpr3 = COPY %12:vgpr_32
SI_RETURN implicit $vgpr0, implicit $vgpr1, implicit $vgpr2, implicit $vgpr3

Without additional context these should be using waterfall loops, not readfirstlane

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
107	This isn't really specific to MUBUF instructions; it's any operand that has to be scalar. We have to waterfall calls as well
919	Avoid calling getRegClass multiple times
921	The MUBUF/MTBUF part isn't interesting, it's the operand being an SGPR and not trivially rewritable as vector

In D134423#3912577, @alex-t wrote:

In D134423#3911947, @foad wrote:

In D134423#3911726, @skc7 wrote:

%8:sreg_32 = COPY %5:vgpr_32
%7:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN %4:vgpr_32, killed %6:sgpr_128, %8:sreg_32, 0, 0, 0, 0, implicit $exec ::

I need more context. Is %5 uniform?

I think that I've got an idea behind this patch. Let's say %5 is uniform. Then we've got to try to promote all the %8 descendants to SALU if possible.
In some cases, it appears that such a copy has few or even no SALU descendants, and according to the common algorithm should be converted to VALU.
When the conversion is done, legalizeOperands creates the waterfall loop which is obviously much worse than inserting the v_readfirstlane_b32.
As far as I understand, @skc7 addresses this scenario and aims to avoid an unnecessary waterfall loop.
BTW, if %5 is divergent we have a bug in ISel. We now should not have any V2S copy with the divergent source.

@alex-t @arsenm @foad This patch has been put up to fix the vgpr2sgpr copy instruction whose result is used as vgpr for srsrc/soffset of MUBUF/MTBUF. This is to fix the issue "Illegal virtual register for instruction". Shall I move to the initial version of the patch which just deals with vgpr2sgpr copy instruction? Legalizing the operands and coming to generic solution to fix such issues is beyond the scope of the patch as far as I understand.

In D134423#3912577, @alex-t wrote:

BTW, if %5 is divergent we have a bug in ISel. We now should not have any V2S copy with the divergent source.

Look at the MIR that @skc7 quoted. %5 is divergent - it's copied from a vgpr function argument.

Hi @skc7, I've looked at the failure in your new test @llvm_amdgcn_raw_buffer_load_f32. I think this is a case that the AMDGPU backend has never handled properly, even before @alex-t rewrote SIFixSGPRCopies in D128252.

You have a BUFFER_LOAD_DWORD_OFFEN instruction with a divergent (vgpr) value for the $soffset operand. SIInstrInfo::legalizeOperands ought to fix this somehow, but it does not. See the FIXME here:

// Legalize MUBUF* instructions.
int RsrcIdx =
    AMDGPU::getNamedOperandIdx(MI.getOpcode(), AMDGPU::OpName::srsrc);
if (RsrcIdx != -1) {
  // We have an MUBUF instruction
  MachineOperand *Rsrc = &MI.getOperand(RsrcIdx);
  unsigned RsrcRC = get(MI.getOpcode()).OpInfo[RsrcIdx].RegClass;
  if (RI.getCommonSubClass(MRI.getRegClass(Rsrc->getReg()),
                           RI.getRegClass(RsrcRC))) {
    // The operands are legal.
    // FIXME: We may need to legalize operands besides srsrc.
    return CreatedBB;
  }

It only expects to legalize the $srsrc operand, which it will do by creating a waterfall loop. To legalize the $soffset operand it could either do something clever with addressing, like adding it into the $offset operand (buffer addressing is complicated so that might not be valid), or it could create another waterfall loop. Or it could use readfirstlane, which is a bit of a cop-out but would at least avoid crashing the compiler for now.

define <4 x i32> @extract0_bitcast_raw_buffer_load_v4i32(<4 x i32> inreg %rsrc, i32 %ofs, i32 %sofs) local_unnamed_addr #0 {
%var = tail call <4 x i32> @llvm.amdgcn.raw.buffer.load.v4i32(<4 x i32> %rsrc, i32 %ofs, i32 %sofs, i32 0)
ret <4 x i32> %var
}

IR dump after amdgpu-isel:

bb.0 (%ir-block.0):
liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5

%5:vgpr_32 = COPY $vgpr5

%8:sreg_32 = COPY %5:vgpr_32

%7:vreg_128 = BUFFER_LOAD_DWORDX4_OFFEN %4:vgpr_32, killed %6:sgpr_128, %8:sreg_32, 0, 0, 0, 0, implicit $exec :: (dereferenceable load (s128), align 1, addrspace 4)

Arguments of non-kernel function are divergent. So, $vgpr5 is divergent and %5 as well.

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
921	We are walking SALU def-use chain to compute the score to decide if it is profitable to insert v_readfirstlane_b32 or convert a copy and all the chain to VALU. If we remove (TII->isMUBUF(Opc) \|\| TII->isMTBUF(Opc)) we will end up prohibiting any VALU conversion. Because all the SALU result registers are SGPRs and many of them have SGPR_128RegClass and SReg_32RegClass. We don't want to cut all of them. The idea was that the exact opcodes are exceptional because following the common logic leads to inserting the waterfall loop that slows down the execution.

In D134423#3914502, @foad wrote:

In D134423#3912577, @alex-t wrote:

BTW, if %5 is divergent we have a bug in ISel. We now should not have any V2S copy with the divergent source.

Look at the MIR that @skc7 quoted. %5 is divergent - it's copied from a vgpr function argument.

The BUFFER_LOAD_DWORDX4_OFFEN is one of (as I remember correctly 5) the exceptional opcodes for which V2S copy is created even in case the copy source is divergent.
There is no bug in ISel. We have the value in VGPR because it is divergent and this is correct. The V2S copy is created in InstrEmitter just because the opcode requires SGPR.
We have yet several other such opcodes.

V_WRITELANE_B32, S_BUFFER_LOAD_DWORD_IMM, BUFFER_LOAD_FORMAT_X_OFFSET, BUFFER_LOAD_FORMAT_X_IDXEN, BUFFER_LOAD_FORMAT_X_OFFEN, BUFFER_LOAD_FORMAT_X_BOTHEN, IMAGE_SAMPLE_V1_V2
And this is really a TODO. For each of them, we should make a design and change legalizeOperand correspondingly.

In D134423#3919858, @alex-t wrote:

In D134423#3914502, @foad wrote:

In D134423#3912577, @alex-t wrote:

BTW, if %5 is divergent we have a bug in ISel. We now should not have any V2S copy with the divergent source.

Look at the MIR that @skc7 quoted. %5 is divergent - it's copied from a vgpr function argument.

The BUFFER_LOAD_DWORDX4_OFFEN is one of (as I remember correctly 5) the exceptional opcodes for which V2S copy is created even in case the copy source is divergent.
There is no bug in ISel. We have the value in VGPR because it is divergent and this is correct. The V2S copy is created in InstrEmitter just because the opcode requires SGPR.
We have yet several other such opcodes.

V_WRITELANE_B32, S_BUFFER_LOAD_DWORD_IMM, BUFFER_LOAD_FORMAT_X_OFFSET, BUFFER_LOAD_FORMAT_X_IDXEN, BUFFER_LOAD_FORMAT_X_OFFEN, BUFFER_LOAD_FORMAT_X_BOTHEN, IMAGE_SAMPLE_V1_V2
And this is really a TODO. For each of them, we should make a design and change legalizeOperand correspondingly.

@alex-t @foad @arsenm As was suggested, legalizeOperand needs to be updated to support the mentioned opcodes which have similar issue. We will take it up as a parallel task and try to fix it.

But as a temporary work around, shall we update HasMBUFScalarReg to true, when a "V2S copy" is found and its result is used by MUBUF/MTBUF for scalar operands. HasMBUFScalarReg flag determines copy lowering way to v_readfirstlane_b32
This fixes the specific issue we encountered with BUFFER_LOAD_DWORDX4_OFFEN. "Illegal virtual register for instruction"

Register Reg = Inst->getOperand(0).getReg();
if (TRI->isSGPRReg(*MRI, Reg) && !TII->isVALU(*Inst)) {
for (auto &U : MRI->use_instructions(Reg)) {
   Users.push_back(&U);
   if (Inst->isCopy()) {
     unsigned Opc = U.getOpcode();
     if (Reg.isVirtual() &&
         (MRI->getRegClass(Reg) == &AMDGPU::SGPR_128RegClass ||
          MRI->getRegClass(Reg) == &AMDGPU::SReg_32RegClass) &&
         (TII->isMUBUF(Opc) || TII->isMTBUF(Opc))) {
       Info.HasMBUFScalarReg = true;
     }
   }
 }
}

Made changes to only identify copy and its result used by soffset of MUBUF/MTBUF. needToBeConvertedToVALU returns false if such pattern is found.
This also fixes vgpr-descriptor-waterfall-loop-idom-update.ll test, where the previous revision of the patch doesn't generate waterfall loop.

Harbormaster completed remote builds in B202194: Diff 481592.Dec 9 2022, 4:46 AM

I think the best quick fix would be something like this in legalizeOperands, not changing SIFixSGPRCopies:

diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index c14b8df1f390..b79d343e0ed5 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -6005,6 +6005,12 @@ SIInstrInfo::legalizeOperands(MachineInstr &MI,
       AMDGPU::getNamedOperandIdx(MI.getOpcode(), AMDGPU::OpName::srsrc);
   if (RsrcIdx != -1) {
     // We have an MUBUF instruction
+    MachineOperand *SOff = getNamedOperand(MI, AMDGPU::OpName::soffset);
+    if (SOff->isReg() && !RI.isSGPRClass(MRI.getRegClass(SOff->getReg()))) {
+      Register SGPR = readlaneVGPRToSGPR(SOff->getReg(), MI, MRI);
+      SOff->setReg(SGPR);
+    }
+
     MachineOperand *Rsrc = &MI.getOperand(RsrcIdx);
     unsigned RsrcRC = get(MI.getOpcode()).OpInfo[RsrcIdx].RegClass;
     if (RI.getCommonSubClass(MRI.getRegClass(Rsrc->getReg()),

Can this be abandoned now?

This revision now requires changes to proceed.Jul 18 2023, 5:02 AM

skc7 abandoned this revision.Jul 19 2023, 10:28 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIFixSGPRCopies.cpp

16 lines

test/

CodeGen/

AMDGPU/

si-fix-sgpr-copies-buf.ll

290 lines

Diff 481592

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp

Show First 20 Lines • Show All 95 Lines • ▼ Show 20 Lines	public:
unsigned Score;		unsigned Score;
// Actual count of v_readfirstlane_b32		// Actual count of v_readfirstlane_b32
// which need to be inserted to keep SChain SALU		// which need to be inserted to keep SChain SALU
unsigned NumReadfirstlanes;		unsigned NumReadfirstlanes;
// Current score state. To speedup selection V2SCopyInfos for processing		// Current score state. To speedup selection V2SCopyInfos for processing
bool NeedToBeConvertedToVALU = false;		bool NeedToBeConvertedToVALU = false;
// Unique ID. Used as a key for mapping to keep permanent order.		// Unique ID. Used as a key for mapping to keep permanent order.
unsigned ID;		unsigned ID;
		// Flag to check if MUBUF/MTBUF needs scalar register.
		bool HasMBUFScalarReg = false;

// Count of another VGPR to SGPR copies that contribute to the		// Count of another VGPR to SGPR copies that contribute to the
		arsenmUnsubmitted Not Done Reply Inline Actions This isn't really specific to MUBUF instructions; it's any operand that has to be scalar. We have to waterfall calls as well arsenm: This isn't really specific to MUBUF instructions; it's any operand that has to be scalar. We…
// current copy SChain		// current copy SChain
unsigned SiblingPenalty = 0;		unsigned SiblingPenalty = 0;
SetVector<unsigned> Siblings;		SetVector<unsigned> Siblings;
V2SCopyInfo() : Copy(nullptr), ID(0){};		V2SCopyInfo() : Copy(nullptr), ID(0){};
V2SCopyInfo(unsigned Id, MachineInstr *C, unsigned Width)		V2SCopyInfo(unsigned Id, MachineInstr *C, unsigned Width)
: Copy(C), NumSVCopies(0), NumReadfirstlanes(Width / 32), ID(Id){};		: Copy(C), NumSVCopies(0), NumReadfirstlanes(Width / 32), ID(Id){};
#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)		#if !defined(NDEBUG) \|\| defined(LLVM_ENABLE_DUMP)
void dump() {		void dump() {
▲ Show 20 Lines • Show All 791 Lines • ▼ Show 20 Lines	if ((TII->isSALU(*Inst) && Inst->isCompare()) \|\|
auto E = Inst->getParent()->end();		auto E = Inst->getParent()->end();
while (++I != E && !I->findRegisterDefOperand(AMDGPU::SCC)) {		while (++I != E && !I->findRegisterDefOperand(AMDGPU::SCC)) {
if (I->readsRegister(AMDGPU::SCC))		if (I->readsRegister(AMDGPU::SCC))
Users.push_back(&*I);		Users.push_back(&*I);
}		}
} else if (Inst->getNumExplicitDefs() != 0) {		} else if (Inst->getNumExplicitDefs() != 0) {
Register Reg = Inst->getOperand(0).getReg();		Register Reg = Inst->getOperand(0).getReg();
if (TRI->isSGPRReg(MRI, Reg) && !TII->isVALU(Inst))		if (TRI->isSGPRReg(MRI, Reg) && !TII->isVALU(Inst))
for (auto &U : MRI->use_instructions(Reg))		for (auto &U : MRI->use_instructions(Reg)) {
		alex-tUnsubmitted Not Done Reply Inline Actions if (TRI->isSGPRReg(MRI, Reg) && !TII->isVALU(Inst)) { for (auto &U : MRI->use_instructions(Reg)) { unsigned Opc = U.getOpcode(); if (MRI->getRegClass(Reg) == &AMDGPU::SGPR_128RegClass && (TII->isMUBUF(Opc) \|\| TII->isMTBUF(Opc))) { Info.HasMUBUFSGPR128 = true; } Users.push_back(&U); } } alex-t: if (TRI->isSGPRReg(MRI, Reg) && !TII->isVALU(Inst)) { for (auto &U : MRI…
Users.push_back(&U);		Users.push_back(&U);
		alex-tUnsubmitted Not Done Reply Inline Actions What is the reason for swapping these lines? if this TRI->isSGPRReg(MRI, Reg) && !TII->isVALU(Inst) is not true we don't need to process register users. alex-t: What is the reason for swapping these lines? if this ``` TRI->isSGPRReg(*MRI, Reg) && !TII…
		skc7AuthorUnsubmitted Done Reply Inline Actions Fixed previous patch. Swap for and if was my mistake. Also added check for AMDGPU::SReg_32RegClass used by soffset of MUBUF/MTBUF. skc7: Fixed previous patch. Swap for and if was my mistake. Also added check for AMDGPU…
		if (Inst->isCopy()) {
		unsigned Opc = U.getOpcode();
		alex-tUnsubmitted Not Done Reply Inline Actions Inst not necessarily is a copy. You can have MUBUF/MTBUF instruction terminating the SALU chain of arbitrary length. The V2S copy result may be (for example) used by some long arithmetic sequence and then used as an operand of the REG_SEQUENCE which produces SReg_128, which in order used as an operand in MUBUF instruction. alex-t: Inst not necessarily is a copy. You can have MUBUF/MTBUF instruction terminating the SALU chain…
		if (Reg.isVirtual() &&
		arsenmUnsubmitted Not Done Reply Inline Actions Avoid calling getRegClass multiple times arsenm: Avoid calling getRegClass multiple times
		(MRI->getRegClass(Reg) == &AMDGPU::SReg_32RegClass) &&
		(TII->isMUBUF(Opc) \|\| TII->isMTBUF(Opc))) {
		arsenmUnsubmitted Not Done Reply Inline Actions The MUBUF/MTBUF part isn't interesting, it's the operand being an SGPR and not trivially rewritable as vector arsenm: The MUBUF/MTBUF part isn't interesting, it's the operand being an SGPR and not trivially…
		alex-tUnsubmitted Not Done Reply Inline Actions We are walking SALU def-use chain to compute the score to decide if it is profitable to insert v_readfirstlane_b32 or convert a copy and all the chain to VALU. If we remove (TII->isMUBUF(Opc) \|\| TII->isMTBUF(Opc)) we will end up prohibiting any VALU conversion. Because all the SALU result registers are SGPRs and many of them have SGPR_128RegClass and SReg_32RegClass. We don't want to cut all of them. The idea was that the exact opcodes are exceptional because following the common logic leads to inserting the waterfall loop that slows down the execution. alex-t: We are walking SALU def-use chain to compute the score to decide if it is profitable to insert…
		Info.HasMBUFScalarReg = true;
		alex-tUnsubmitted Not Done Reply Inline Actions Maybe just if (MRI->getRegClass(Reg) == &AMDGPU::SReg_128RegClass && (TII->isMUBUF(Opc) \|\| TII->isMTBUF(Opc)) ? alex-t: Maybe just ``` if (MRI->getRegClass(Reg) == &AMDGPU::SReg_128RegClass &&…
		}
		foadUnsubmitted Not Done Reply Inline Actions I don't understand why we need special cases for particular named operands. Why can't this be inferred from the operand's register class? @alex-t? foad: I don't understand why we need special cases for particular named operands. Why can't this be…
		skc7AuthorUnsubmitted Done Reply Inline Actions soffset and srsrc need to be scalar registers in mubuf/mtbuf instructions.. Check for use of COPY's result is done here for these operands. skc7: soffset and srsrc need to be scalar registers in mubuf/mtbuf instructions.. Check for use of…
		alex-tUnsubmitted Not Done Reply Inline Actions The fact that the specific use requires SGPR surely can be checked for any use. What I don't like here is the hack forcing the decision to be made for v_readfistlane. The algorithm that makes a decision is a complex tradeoff and we should not "tune" it in such a way for each particular case. I would consider the "SGPR required" as a weight value for the solver rather than the ultimate condition. alex-t: The fact that the specific use requires SGPR surely can be checked for any use. What I don't…
		skc7AuthorUnsubmitted Done Reply Inline Actions HasMBUFScalarReg flag would be set to true for scalar operands of MUBUF/MTBUF. This would lower copy to use v_readfirstlane_b32. skc7: HasMBUFScalarReg flag would be set to true for scalar operands of MUBUF/MTBUF. This would lower…
		}
		}
}		}
for (auto U : Users) {		for (auto U : Users) {
if (TII->isSALU(*U))		if (TII->isSALU(*U))
Info.SChain.insert(U);		Info.SChain.insert(U);
AnalysisWorklist.push_back(U);		AnalysisWorklist.push_back(U);
}		}
}		}
V2SCopies[Info.ID] = Info;		V2SCopies[Info.ID] = Info;
}		}

// The main function that computes the VGPR to SGPR copy score		// The main function that computes the VGPR to SGPR copy score
// and determines copy further lowering way: v_readfirstlane_b32 or moveToVALU		// and determines copy further lowering way: v_readfirstlane_b32 or moveToVALU
bool SIFixSGPRCopies::needToBeConvertedToVALU(V2SCopyInfo *Info) {		bool SIFixSGPRCopies::needToBeConvertedToVALU(V2SCopyInfo *Info) {
		if (Info->HasMBUFScalarReg) {
		return false;
		}
		alex-tUnsubmitted Not Done Reply Inline Actions Setting Info->Score to 0 means that it NEEDS to be VALU! Setting it to zero if SChain is empty means that V2S copy has no SALU descendants and definitely needs to be VALU. So early returns to avoid the rest of the computations. In your case, you return false specifically to indicate that this particular V2S copy has in its SALU chain MUBUF that requires SGPR and it should be NEVER converted to VALU even if its chain is short or empty. alex-t: Setting Info->Score to 0 means that it NEEDS to be VALU! Setting it to zero if SChain is…
		skc7AuthorUnsubmitted Done Reply Inline Actions Since needToBeConvertedToVALU returns without any score calculation in this case, I assumed I need to make the score zero and return. Thanks for helping with this. skc7: Since needToBeConvertedToVALU returns without any score calculation in this case, I assumed I…
if (Info->SChain.empty()) {		if (Info->SChain.empty()) {
Info->Score = 0;		Info->Score = 0;
return true;		return true;
}		}
Info->Siblings = SiblingPenalty[*std::max_element(		Info->Siblings = SiblingPenalty[*std::max_element(
Info->SChain.begin(), Info->SChain.end(),		Info->SChain.begin(), Info->SChain.end(),
		arsenmUnsubmitted Not Done Reply Inline Actions The pass is already doing a walk over use operands, you shouldn't need another use list walk arsenm: The pass is already doing a walk over use operands, you shouldn't need another use list walk
		skc7AuthorUnsubmitted Done Reply Inline Actions Removed the previous extra use list walk. skc7: Removed the previous extra use list walk.
[&](MachineInstr A, MachineInstr B) -> bool {		[&](MachineInstr A, MachineInstr B) -> bool {
return SiblingPenalty[A].size() < SiblingPenalty[B].size();		return SiblingPenalty[A].size() < SiblingPenalty[B].size();
})];		})];
Info->Siblings.remove_if([&](unsigned ID) { return ID == Info->ID; });		Info->Siblings.remove_if([&](unsigned ID) { return ID == Info->ID; });
// The loop below computes the number of another VGPR to SGPR V2SCopies		// The loop below computes the number of another VGPR to SGPR V2SCopies
// which contribute to the current copy SALU chain. We assume that all the		// which contribute to the current copy SALU chain. We assume that all the
// V2SCopies with the same source virtual register will be squashed to one		// V2SCopies with the same source virtual register will be squashed to one
// by regalloc. Also we take care of the V2SCopies of the differnt subregs		// by regalloc. Also we take care of the V2SCopies of the differnt subregs
// of the same register.		// of the same register.
SmallSet<std::pair<Register, unsigned>, 4> SrcRegs;		SmallSet<std::pair<Register, unsigned>, 4> SrcRegs;
for (auto J : Info->Siblings) {		for (auto J : Info->Siblings) {
		arsenmUnsubmitted Not Done Reply Inline Actions This is a property of specific operands arsenm: This is a property of specific operands
		skc7AuthorUnsubmitted Done Reply Inline Actions soffset and srsrc operands are checked now for register use skc7: soffset and srsrc operands are checked now for register use
auto InfoIt = V2SCopies.find(J);		auto InfoIt = V2SCopies.find(J);
if (InfoIt != V2SCopies.end()) {		if (InfoIt != V2SCopies.end()) {
MachineInstr *SiblingCopy = InfoIt->getSecond().Copy;		MachineInstr *SiblingCopy = InfoIt->getSecond().Copy;
if (SiblingCopy->isImplicitDef())		if (SiblingCopy->isImplicitDef())
// the COPY has already been MoveToVALUed		// the COPY has already been MoveToVALUed
continue;		continue;

SrcRegs.insert(std::make_pair(SiblingCopy->getOperand(1).getReg(),		SrcRegs.insert(std::make_pair(SiblingCopy->getOperand(1).getReg(),
▲ Show 20 Lines • Show All 134 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/si-fix-sgpr-copies-buf.ll

This file was added.

				; RUN: llc -march=amdgcn -mcpu=gfx906 -stop-after=si-fix-sgpr-copies -verify-machineinstrs -o - %s \| FileCheck %s

				define float @llvm_amdgcn_raw_buffer_load_f32(<4 x i32> inreg %rsrc, i32 %voffset, i32 inreg %soffset) {
				; CHECK-LABEL: name: llvm_amdgcn_raw_buffer_load_f32
				; CHECK: bb.0 (%ir-block.0):
				; CHECK-NEXT: successors: %bb.1(0x80000000)
				; CHECK-NEXT: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5
				; CHECK-NEXT: {{ $}}
				; CHECK: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr5
				; CHECK: [[V_READFIRSTLANE_B32:%[0-9]+]]:sreg_32 = V_READFIRSTLANE_B32 [[COPY]], implicit $exec
				; CHECK: bb.2:
				; CHECK-NEXT: successors: %bb.1(0x40000000), %bb.3(0x40000000)
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: [[BUFFER_LOAD_DWORD_OFFEN:%[0-9]+]]:vgpr_32 = BUFFER_LOAD_DWORD_OFFEN {{.}}, killed {{.}}, [[V_READFIRSTLANE_B32]], 0, 0, 0, implicit $exec

				%val = call float @llvm.amdgcn.raw.buffer.load.f32(<4 x i32> %rsrc, i32 %voffset, i32 %soffset, i32 0)
				ret float %val
				}

				define float @llvm_amdgcn_raw_tbuffer_load_f32(<4 x i32> inreg %rsrc, i32 %voffset, i32 inreg %soffset) {
				; CHECK-LABEL: name: llvm_amdgcn_raw_tbuffer_load_f32
				; CHECK: bb.0 (%ir-block.0):
				; CHECK-NEXT: successors: %bb.1(0x80000000)
				; CHECK-NEXT: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5
				; CHECK-NEXT: {{ $}}
				; CHECK: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr5
				; CHECK: [[V_READFIRSTLANE_B32:%[0-9]+]]:sreg_32 = V_READFIRSTLANE_B32 [[COPY]], implicit $exec
				; CHECK: bb.2:
				; CHECK-NEXT: successors: %bb.1(0x40000000), %bb.3(0x40000000)
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: [[TBUFFER_LOAD_FORMAT_X_OFFEN:%[0-9]+]]:vgpr_32 = TBUFFER_LOAD_FORMAT_X_OFFEN {{.}}, killed {{.}}, [[V_READFIRSTLANE_B32]], 0, 0, 0, 0, implicit $exec

				%val = call float @llvm.amdgcn.raw.tbuffer.load.f32(<4 x i32> %rsrc, i32 %voffset, i32 %soffset, i32 0, i32 0)
				ret float %val
				}

				define <2 x float> @llvm_amdgcn_raw_buffer_load_v2f32(<4 x i32> inreg %rsrc, i32 %voffset, i32 inreg %soffset) {
				; CHECK-LABEL: name: llvm_amdgcn_raw_buffer_load_v2f32
				; CHECK: bb.0 (%ir-block.0):
				; CHECK-NEXT: successors: %bb.1(0x80000000)
				; CHECK-NEXT: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5
				; CHECK-NEXT: {{ $}}
				; CHECK: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr5
				; CHECK: [[V_READFIRSTLANE_B32:%[0-9]+]]:sreg_32 = V_READFIRSTLANE_B32 [[COPY]], implicit $exec
				; CHECK: bb.2:
				; CHECK-NEXT: successors: %bb.1(0x40000000), %bb.3(0x40000000)
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: [[BUFFER_LOAD_DWORD_OFFEN:%[0-9]+]]:vreg_64 = BUFFER_LOAD_DWORDX2_OFFEN {{.}}, killed {{.}}, [[V_READFIRSTLANE_B32]], 0, 0, 0, implicit $exec

				%val = call <2 x float> @llvm.amdgcn.raw.buffer.load.v2f32(<4 x i32> %rsrc, i32 %voffset, i32 %soffset, i32 0)
				ret <2 x float> %val
				}

				define <2 x float> @llvm_amdgcn_raw_tbuffer_load_v2f32(<4 x i32> inreg %rsrc, i32 %voffset, i32 inreg %soffset) {
				; CHECK-LABEL: name: llvm_amdgcn_raw_tbuffer_load_v2f32
				; CHECK: bb.0 (%ir-block.0):
				; CHECK-NEXT: successors: %bb.1(0x80000000)
				; CHECK-NEXT: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5
				; CHECK-NEXT: {{ $}}
				; CHECK: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr5
				; CHECK: [[V_READFIRSTLANE_B32:%[0-9]+]]:sreg_32 = V_READFIRSTLANE_B32 [[COPY]], implicit $exec
				; CHECK: bb.2:
				; CHECK-NEXT: successors: %bb.1(0x40000000), %bb.3(0x40000000)
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: [[TBUFFER_LOAD_FORMAT_XY_OFFEN:%[0-9]+]]:vreg_64 = TBUFFER_LOAD_FORMAT_XY_OFFEN {{.}}, killed {{.}}, [[V_READFIRSTLANE_B32]], 0, 0, 0, 0, implicit $exec

				%val = call <2 x float> @llvm.amdgcn.raw.tbuffer.load.v2f32(<4 x i32> %rsrc, i32 %voffset, i32 %soffset, i32 0, i32 0)
				ret <2 x float> %val
				}

				define <3 x float> @llvm_amdgcn_raw_buffer_load_v3f32(<4 x i32> inreg %rsrc, i32 %voffset, i32 inreg %soffset) {
				; CHECK-LABEL: name: llvm_amdgcn_raw_buffer_load_v3f32
				; CHECK: bb.0 (%ir-block.0):
				; CHECK-NEXT: successors: %bb.1(0x80000000)
				; CHECK-NEXT: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5
				; CHECK-NEXT: {{ $}}
				; CHECK: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr5
				; CHECK: [[V_READFIRSTLANE_B32:%[0-9]+]]:sreg_32 = V_READFIRSTLANE_B32 [[COPY]], implicit $exec
				; CHECK: bb.2:
				; CHECK-NEXT: successors: %bb.1(0x40000000), %bb.3(0x40000000)
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: [[BUFFER_LOAD_DWORD_OFFEN:%[0-9]+]]:vreg_96 = BUFFER_LOAD_DWORDX3_OFFEN {{.}}, killed {{.}}, [[V_READFIRSTLANE_B32]], 0, 0, 0, implicit $exec

				%val = call <3 x float> @llvm.amdgcn.raw.buffer.load.v3f32(<4 x i32> %rsrc, i32 %voffset, i32 %soffset, i32 0)
				ret <3 x float> %val
				}

				define <3 x float> @llvm_amdgcn_raw_tbuffer_load_v3f32(<4 x i32> inreg %rsrc, i32 %voffset, i32 inreg %soffset) {
				; CHECK-LABEL: name: llvm_amdgcn_raw_tbuffer_load_v3f32
				; CHECK: bb.0 (%ir-block.0):
				; CHECK-NEXT: successors: %bb.1(0x80000000)
				; CHECK-NEXT: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5
				; CHECK-NEXT: {{ $}}
				; CHECK: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr5
				; CHECK: [[V_READFIRSTLANE_B32:%[0-9]+]]:sreg_32 = V_READFIRSTLANE_B32 [[COPY]], implicit $exec
				; CHECK: bb.2:
				; CHECK-NEXT: successors: %bb.1(0x40000000), %bb.3(0x40000000)
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: [[TBUFFER_LOAD_FORMAT_XYZ_OFFEN:%[0-9]+]]:vreg_96 = TBUFFER_LOAD_FORMAT_XYZ_OFFEN {{.}}, killed {{.}}, [[V_READFIRSTLANE_B32]], 0, 0, 0, 0, implicit $exec

				%val = call <3 x float> @llvm.amdgcn.raw.tbuffer.load.v3f32(<4 x i32> %rsrc, i32 %voffset, i32 %soffset, i32 0, i32 0)
				ret <3 x float> %val
				}

				define <4 x float> @llvm_amdgcn_raw_buffer_load_v4f32(<4 x i32> inreg %rsrc, i32 %voffset, i32 inreg %soffset) {
				; CHECK-LABEL: name: llvm_amdgcn_raw_buffer_load_v4f32
				; CHECK: bb.0 (%ir-block.0):
				; CHECK-NEXT: successors: %bb.1(0x80000000)
				; CHECK-NEXT: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5
				; CHECK-NEXT: {{ $}}
				; CHECK: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr5
				; CHECK: [[V_READFIRSTLANE_B32:%[0-9]+]]:sreg_32 = V_READFIRSTLANE_B32 [[COPY]], implicit $exec
				; CHECK: bb.2:
				; CHECK-NEXT: successors: %bb.1(0x40000000), %bb.3(0x40000000)
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: [[BUFFER_LOAD_DWORD_OFFEN:%[0-9]+]]:vreg_128 = BUFFER_LOAD_DWORDX4_OFFEN {{.}}, killed {{.}}, [[V_READFIRSTLANE_B32]], 0, 0, 0, implicit $exec

				%val = call <4 x float> @llvm.amdgcn.raw.buffer.load.v4f32(<4 x i32> %rsrc, i32 %voffset, i32 %soffset, i32 0)
				ret <4 x float> %val
				}

				define <4 x float> @llvm_amdgcn_raw_tbuffer_load_v4f32(<4 x i32> inreg %rsrc, i32 %voffset, i32 inreg %soffset) {
				; CHECK-LABEL: name: llvm_amdgcn_raw_tbuffer_load_v4f32
				; CHECK: bb.0 (%ir-block.0):
				; CHECK-NEXT: successors: %bb.1(0x80000000)
				; CHECK-NEXT: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5
				; CHECK-NEXT: {{ $}}
				; CHECK: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr5
				; CHECK: [[V_READFIRSTLANE_B32:%[0-9]+]]:sreg_32 = V_READFIRSTLANE_B32 [[COPY]], implicit $exec
				; CHECK: bb.2:
				; CHECK-NEXT: successors: %bb.1(0x40000000), %bb.3(0x40000000)
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: [[TBUFFER_LOAD_FORMAT_XYZW_OFFEN:%[0-9]+]]:vreg_128 = TBUFFER_LOAD_FORMAT_XYZW_OFFEN {{.}}, killed {{.}}, [[V_READFIRSTLANE_B32]], 0, 0, 0, 0, implicit $exec

				%val = call <4 x float> @llvm.amdgcn.raw.tbuffer.load.v4f32(<4 x i32> %rsrc, i32 %voffset, i32 %soffset, i32 0, i32 0)
				ret <4 x float> %val
				}

				define void @llvm_amdgcn_raw_buffer_store_f32(<4 x i32> inreg %rsrc, float %val, i32 %voffset, i32 inreg %soffset) {
				; CHECK-LABEL: name: llvm_amdgcn_raw_buffer_store_f32
				; CHECK: bb.0 (%ir-block.0):
				; CHECK-NEXT: successors: %bb.1(0x80000000)
				; CHECK-NEXT: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5, $vgpr6
				; CHECK-NEXT: {{ $}}
				; CHECK: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr6
				; CHECK: [[V_READFIRSTLANE_B32:%[0-9]+]]:sreg_32 = V_READFIRSTLANE_B32 [[COPY]], implicit $exec
				; CHECK: bb.2:
				; CHECK-NEXT: successors: %bb.1(0x40000000), %bb.3(0x40000000)
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: BUFFER_STORE_DWORD_OFFEN_exact {{.}}, {{.}}, killed {{.*}}, [[V_READFIRSTLANE_B32]], 0, 0, 0, implicit $exec

				call void @llvm.amdgcn.raw.buffer.store.f32(float %val, <4 x i32> %rsrc, i32 %voffset, i32 %soffset, i32 0)
				ret void
				}

				define void @llvm_amdgcn_raw_tbuffer_store_f32(<4 x i32> inreg %rsrc, float %val, i32 %voffset, i32 inreg %soffset) {
				; CHECK-LABEL: name: llvm_amdgcn_raw_tbuffer_store_f32
				; CHECK: bb.0 (%ir-block.0):
				; CHECK-NEXT: successors: %bb.1(0x80000000)
				; CHECK-NEXT: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5, $vgpr6
				; CHECK-NEXT: {{ $}}
				; CHECK: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr6
				; CHECK: [[V_READFIRSTLANE_B32:%[0-9]+]]:sreg_32 = V_READFIRSTLANE_B32 [[COPY]], implicit $exec
				; CHECK: bb.2:
				; CHECK-NEXT: successors: %bb.1(0x40000000), %bb.3(0x40000000)
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: TBUFFER_STORE_FORMAT_X_OFFEN_exact {{.}}, {{.}}, killed {{.*}}, [[V_READFIRSTLANE_B32]], 0, 0, 0, 0, implicit $exec

				call void @llvm.amdgcn.raw.tbuffer.store.f32(float %val, <4 x i32> %rsrc, i32 %voffset, i32 %soffset, i32 0, i32 0)
				ret void
				}

				define void @llvm_amdgcn_raw_buffer_store_v2f32(<4 x i32> inreg %rsrc, <2 x float> %val, i32 %voffset, i32 inreg %soffset) {
				; CHECK-LABEL: name: llvm_amdgcn_raw_buffer_store_v2f32
				; CHECK: bb.0 (%ir-block.0):
				; CHECK-NEXT: successors: %bb.1(0x80000000)
				; CHECK-NEXT: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5, $vgpr6, $vgpr7
				; CHECK-NEXT: {{ $}}
				; CHECK: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr7
				; CHECK: [[V_READFIRSTLANE_B32:%[0-9]+]]:sreg_32 = V_READFIRSTLANE_B32 [[COPY]], implicit $exec
				; CHECK: bb.2:
				; CHECK-NEXT: successors: %bb.1(0x40000000), %bb.3(0x40000000)
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: BUFFER_STORE_DWORDX2_OFFEN_exact {{.}}, {{.}}, killed {{.*}}, [[V_READFIRSTLANE_B32]], 0, 0, 0, implicit $exec

				call void @llvm.amdgcn.raw.buffer.store.v2f32(<2 x float> %val, <4 x i32> %rsrc, i32 %voffset, i32 %soffset, i32 0)
				ret void
				}

				define void @llvm_amdgcn_raw_tbuffer_store_v2f32(<4 x i32> inreg %rsrc, <2 x float> %val, i32 %voffset, i32 inreg %soffset) {
				; CHECK-LABEL: name: llvm_amdgcn_raw_tbuffer_store_v2f32
				; CHECK: bb.0 (%ir-block.0):
				; CHECK-NEXT: successors: %bb.1(0x80000000)
				; CHECK-NEXT: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5, $vgpr6, $vgpr7
				; CHECK-NEXT: {{ $}}
				; CHECK: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr7
				; CHECK: [[V_READFIRSTLANE_B32:%[0-9]+]]:sreg_32 = V_READFIRSTLANE_B32 [[COPY]], implicit $exec
				; CHECK: bb.2:
				; CHECK-NEXT: successors: %bb.1(0x40000000), %bb.3(0x40000000)
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: TBUFFER_STORE_FORMAT_XY_OFFEN_exact {{.}}, {{.}}, killed {{.*}}, [[V_READFIRSTLANE_B32]], 0, 0, 0, 0, implicit $exec

				call void @llvm.amdgcn.raw.tbuffer.store.v2f32(<2 x float> %val, <4 x i32> %rsrc, i32 %voffset, i32 %soffset, i32 0, i32 0)
				ret void
				}

				define void @llvm_amdgcn_raw_buffer_store_v3f32(<4 x i32> inreg %rsrc, <3 x float> %val, i32 %voffset, i32 inreg %soffset) {
				; CHECK-LABEL: name: llvm_amdgcn_raw_buffer_store_v3f32
				; CHECK: bb.0 (%ir-block.0):
				; CHECK-NEXT: successors: %bb.1(0x80000000)
				; CHECK-NEXT: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5, $vgpr6, $vgpr7, $vgpr8
				; CHECK-NEXT: {{ $}}
				; CHECK: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr8
				; CHECK: [[V_READFIRSTLANE_B32:%[0-9]+]]:sreg_32 = V_READFIRSTLANE_B32 [[COPY]], implicit $exec
				; CHECK: bb.2:
				; CHECK-NEXT: successors: %bb.1(0x40000000), %bb.3(0x40000000)
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: BUFFER_STORE_DWORDX3_OFFEN_exact {{.}}, {{.}}, killed {{.*}}, [[V_READFIRSTLANE_B32]], 0, 0, 0, implicit $exec

				call void @llvm.amdgcn.raw.buffer.store.v3f32(<3 x float> %val, <4 x i32> %rsrc, i32 %voffset, i32 %soffset, i32 0)
				ret void
				}

				define void @llvm_amdgcn_raw_tbuffer_store_v3f32(<4 x i32> inreg %rsrc, <3 x float> %val, i32 %voffset, i32 inreg %soffset) {
				; CHECK-LABEL: name: llvm_amdgcn_raw_tbuffer_store_v3f32
				; CHECK: bb.0 (%ir-block.0):
				; CHECK-NEXT: successors: %bb.1(0x80000000)
				; CHECK-NEXT: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5, $vgpr6, $vgpr7, $vgpr8
				; CHECK-NEXT: {{ $}}
				; CHECK: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr8
				; CHECK: [[V_READFIRSTLANE_B32:%[0-9]+]]:sreg_32 = V_READFIRSTLANE_B32 [[COPY]], implicit $exec
				; CHECK: bb.2:
				; CHECK-NEXT: successors: %bb.1(0x40000000), %bb.3(0x40000000)
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: TBUFFER_STORE_FORMAT_XYZ_OFFEN_exact {{.}}, {{.}}, killed {{.*}}, [[V_READFIRSTLANE_B32]], 0, 0, 0, 0, implicit $exec

				call void @llvm.amdgcn.raw.tbuffer.store.v3f32(<3 x float> %val, <4 x i32> %rsrc, i32 %voffset, i32 %soffset, i32 0, i32 0)
				ret void
				}

				define void @llvm_amdgcn_raw_buffer_store_v4f32(<4 x i32> inreg %rsrc, <4 x float> %val, i32 %voffset, i32 inreg %soffset) {
				; CHECK-LABEL: name: llvm_amdgcn_raw_buffer_store_v4f32
				; CHECK: bb.0 (%ir-block.0):
				; CHECK-NEXT: successors: %bb.1(0x80000000)
				; CHECK-NEXT: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5, $vgpr6, $vgpr7, $vgpr8, $vgpr9
				; CHECK-NEXT: {{ $}}
				; CHECK: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr9
				; CHECK: [[V_READFIRSTLANE_B32:%[0-9]+]]:sreg_32 = V_READFIRSTLANE_B32 [[COPY]], implicit $exec
				; CHECK: bb.2:
				; CHECK-NEXT: successors: %bb.1(0x40000000), %bb.3(0x40000000)
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: BUFFER_STORE_DWORDX4_OFFEN_exact {{.}}, {{.}}, killed {{.*}}, [[V_READFIRSTLANE_B32]], 0, 0, 0, implicit $exec

				call void @llvm.amdgcn.raw.buffer.store.v4f32(<4 x float> %val, <4 x i32> %rsrc, i32 %voffset, i32 %soffset, i32 0)
				ret void
				}

				define void @llvm_amdgcn_raw_tbuffer_store_v4f32(<4 x i32> inreg %rsrc, <4 x float> %val, i32 %voffset, i32 inreg %soffset) {
				; CHECK-LABEL: name: llvm_amdgcn_raw_tbuffer_store_v4f32
				; CHECK: bb.0 (%ir-block.0):
				; CHECK-NEXT: successors: %bb.1(0x80000000)
				; CHECK-NEXT: liveins: $vgpr0, $vgpr1, $vgpr2, $vgpr3, $vgpr4, $vgpr5, $vgpr6, $vgpr7, $vgpr8, $vgpr9
				; CHECK-NEXT: {{ $}}
				; CHECK: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr9
				; CHECK: [[V_READFIRSTLANE_B32:%[0-9]+]]:sreg_32 = V_READFIRSTLANE_B32 [[COPY]], implicit $exec
				; CHECK: bb.2:
				; CHECK-NEXT: successors: %bb.1(0x40000000), %bb.3(0x40000000)
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: TBUFFER_STORE_FORMAT_XYZW_OFFEN_exact {{.}}, {{.}}, killed {{.*}}, [[V_READFIRSTLANE_B32]], 0, 0, 0, 0, implicit $exec

				call void @llvm.amdgcn.raw.tbuffer.store.v4f32(<4 x float> %val, <4 x i32> %rsrc, i32 %voffset, i32 %soffset, i32 0, i32 0)
				ret void
				}

				declare float @llvm.amdgcn.raw.buffer.load.f32(<4 x i32>, i32, i32, i32 )
				declare float @llvm.amdgcn.raw.tbuffer.load.f32(<4 x i32>, i32, i32, i32, i32)
				declare <2 x float> @llvm.amdgcn.raw.buffer.load.v2f32(<4 x i32>, i32, i32, i32)
				declare <2 x float> @llvm.amdgcn.raw.tbuffer.load.v2f32(<4 x i32>, i32, i32, i32, i32)
				declare <3 x float> @llvm.amdgcn.raw.buffer.load.v3f32(<4 x i32>, i32, i32, i32)
				declare <3 x float> @llvm.amdgcn.raw.tbuffer.load.v3f32(<4 x i32>, i32, i32, i32, i32)
				declare <4 x float> @llvm.amdgcn.raw.buffer.load.v4f32(<4 x i32>, i32, i32, i32)
				declare <4 x float> @llvm.amdgcn.raw.tbuffer.load.v4f32(<4 x i32>, i32, i32, i32, i32)
				declare void @llvm.amdgcn.raw.buffer.store.f32(float, <4 x i32>, i32, i32, i32)
				declare void @llvm.amdgcn.raw.tbuffer.store.f32(float, <4 x i32>, i32, i32, i32, i32)
				declare void @llvm.amdgcn.raw.buffer.store.v2f32(<2 x float>, <4 x i32>, i32, i32, i32)
				declare void @llvm.amdgcn.raw.tbuffer.store.v2f32(<2 x float>, <4 x i32>, i32, i32, i32, i32)
				declare void @llvm.amdgcn.raw.buffer.store.v3f32(<3 x float>, <4 x i32>, i32, i32, i32)
				declare void @llvm.amdgcn.raw.tbuffer.store.v3f32(<3 x float>, <4 x i32>, i32, i32, i32, i32)
				declare void @llvm.amdgcn.raw.buffer.store.v4f32(<4 x float>, <4 x i32>, i32, i32, i32)
				declare void @llvm.amdgcn.raw.tbuffer.store.v4f32(<4 x float>, <4 x i32>, i32, i32, i32, i32)

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Fix vgpr2sgpr copy analysis to check scalar operands of buffer instructions use scalar registers.AbandonedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 481592

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp

llvm/test/CodeGen/AMDGPU/si-fix-sgpr-copies-buf.ll

[AMDGPU] Fix vgpr2sgpr copy analysis to check scalar operands of buffer instructions use scalar registers.
AbandonedPublic