This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
-
SIInstrInfo.cpp
29/33
SIPeepholeSDWA.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
1/1
sdwa-op64-test.ll
10/11
sdwa-ops.mir

Differential D54882

[AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos
ClosedPublic

Authored by ronlieb on Nov 25 2018, 3:50 PM.

Download Raw Diff

Details

Reviewers

arsenm
rampitec
vpykhtin

Group Reviewers

Restricted Project

Commits

rG16de4fd2ebac: [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos
rL348132: [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos

Summary

The introduction of S_{ADD|SUB}_U64_PSEUDO instructions which are decomposed
into VOP3 instruction pairs for S_ADD_U64_PSEUDO:

V_ADD_I32_e64
V_ADDC_U32_e64

and for S_SUB_U64_PSEUDO

V_SUB_I32_e64
V_SUBB_U32_e64

preclude the use of SDWA to encode a constant.
SDWA: Sub-Dword addressing is supported on VOP1 and VOP2 instructions,
but not on VOP3 instructions.

We desire to fold the bit-and operand into the instruction encoding
for the V_ADD_I32 instruction. This requires that we transform the
VOP3 into a VOP2 form of the instruction (_e32).

%19:vgpr_32 = V_AND_B32_e32 255,
    killed %16:vgpr_32, implicit $exec
%47:vgpr_32, %49:sreg_64_xexec = V_ADD_I32_e64
    %26.sub0:vreg_64, %19:vgpr_32, implicit $exec
%48:vgpr_32, dead %50:sreg_64_xexec = V_ADDC_U32_e64
    %26.sub1:vreg_64, %54:vgpr_32, killed %49:sreg_64_xexec, implicit $exec

which then allows the SDWA encoding and becomes

%47:vgpr_32 = V_ADD_I32_sdwa
    0, %26.sub0:vreg_64, 0, killed %16:vgpr_32, 0, 6, 0, 6, 0,
    implicit-def $vcc, implicit $exec
%48:vgpr_32 = V_ADDC_U32_e32
    0, %26.sub1:vreg_64, implicit-def $vcc, implicit $vcc, implicit $exec

Diff Detail

Event Timeline

ronlieb created this revision.Nov 25 2018, 3:50 PM

Herald added subscribers: llvm-commits, t-tye, tpr and 6 others. · View Herald TranscriptNov 25 2018, 3:50 PM

arsenm added inline comments.Nov 26 2018, 8:36 AM

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
903–910	There's no point in checking these since they must be present
918–921	We don't want to rely on kill flags. You should check the uses of the vreg instead
963–964	You have code checking for the carry ins, but you don't handle those here

A MIR test would be more useful for checking the carry in cases

rampitec marked an inline comment as done.Nov 26 2018, 8:49 AM

rampitec added inline comments.

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
883	Having a VOP2 pseudo does not necessarily mean there is target VOP2 instruction. I would suggest calling pseudoToMCOpcode() in addition on that VOP2 opcode.
898	Same here.
901	This is not SDWADesc, it is just VOP2Desc.
923	You need to check for modifiers which you are going to drop by this conversion. For instance incoming instruction can be VOP3 OpSel. If it is OpSel you need to check that OpSel modifiers are trivial (e.g. have no effect and equivalent to VOP2). The same about VOP3 modifiers abs and neg. You also need a mir test to check all negative cases when you cannot fold.
963–964	Right. It does not seem to be specific to just these instructions. In general any VOP3 can go through it.
test/CodeGen/AMDGPU/sdwa-op64-test.ll
2	You need to add fiji run line.

Essentially this is a limited version of shrinking. So I have several questions:

Why not to run shrink pass before sdwa instead?
If we want to shrink individual instructions, should not we just add SIInstrInfo::shrink() interface and call it from SIShrinkInstructions as well?
A the very least checks needs to use SIInstrInfo::canShrink() instead of many and insufficient checks here. For instance this code does not check for valid register classes which SIInstrInfo::canShrink() does. We really need to stop cloning that code.
In general SIInstrInfo::canShrink() should be extended to handle OpSel (via SIInstrInfo::hasModifiers()).
SIInstrInfo::canShrink() shall call SIInstrInfo::hasVALU32BitEncoding() to check for presence of target instruction.

rampitec added a reviewer: vpykhtin.Nov 26 2018, 9:39 AM

New patch arriving momentarily ...

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
923	leaving this one open, to work on the MIR test, and think about what modifiers might affect the V_ADD and B_SUB instructions.
963–964	The CarryIn/Out related code is located in pseudoOpConvertedToVOP2()
963–964	i changed the name to pseudoOpConvertedToVop2 to reflect that we look for a possible ADD or SUB that resulted from a previously lowered V_ADD_U64_PSEUDO or V_SUB_U64_PSEUDO. The function pseudoOpConvertedToVOP2 further validates that we have a lowered pseudo and returns true if it was able to perform the conversion. also added comments kind of like the above ...

New patch addressing many (but not all) of the review comments.
Will look into the shrink related comments soon ...

In D54882#1308240, @rampitec wrote:

Essentially this is a limited version of shrinking. So I have several questions:

Why not to run shrink pass before sdwa instead?

I tried adding Shrink pass before PeepholeSDWA and observed 88 lit test failures.
i tried moving Shrink pass before Peephole SDWA and observed 25 lit test failures

If we want to shrink individual instructions, should not we just add SIInstrInfo::shrink() interface and call it from SIShrinkInstructions as well?

not sure ...

A the very least checks needs to use SIInstrInfo::canShrink() instead of many and insufficient checks here. For instance this code does not check for valid register classes which SIInstrInfo::canShrink() does. We really need to stop cloning that code.

Great suggestion, i added calls to canShrink to replace the exsisting approach. Thanks for pointing that out.

In general SIInstrInfo::canShrink() should be extended to handle OpSel (via SIInstrInfo::hasModifiers()).

I am not really trying to handle OpSel, and i have modified the current patch to preclude it. We could open another defect to look into this later.

SIInstrInfo::canShrink() shall call SIInstrInfo::hasVALU32BitEncoding() to check for presence of target instruction.

I added this to canShrink and i think it helps clean up the patch, good suggestion.

New patch coming soon...
still need to construct an MIR test for negative testing.

In D54882#1309583, @ronlieb wrote:

In D54882#1308240, @rampitec wrote:

Essentially this is a limited version of shrinking. So I have several questions:

Why not to run shrink pass before sdwa instead?

I tried adding Shrink pass before PeepholeSDWA and observed 88 lit test failures.
i tried moving Shrink pass before Peephole SDWA and observed 25 lit test failures

Which may be a good thing if these failures are progressions (as I suspect) and not regressions. Are they progressions?
That is the point of other comments too, this patch is limited to handle just two instructions while there is a clear possibility to do it for almost any VOP3.

In D54882#1309588, @rampitec wrote:

In D54882#1309583, @ronlieb wrote:

In D54882#1308240, @rampitec wrote:

Essentially this is a limited version of shrinking. So I have several questions:

Why not to run shrink pass before sdwa instead?

I tried adding Shrink pass before PeepholeSDWA and observed 88 lit test failures.
i tried moving Shrink pass before Peephole SDWA and observed 25 lit test failures

Which may be a good thing if these failures are progressions (as I suspect) and not regressions. Are they progressions?
That is the point of other comments too, this patch is limited to handle just two instructions while there is a clear possibility to do it for almost any VOP3.

I would also assume many of these failures are just commute which is attempted by shrink pass. That is normal and would only need to change the tests.

Incorporated changes for Some of the Shrink suggestions.
Still need to do an MIR test.
Also, investigate if moving/adding Shrink pass results in test progressions or regressions

In D54882#1309605, @rampitec wrote:

In D54882#1309588, @rampitec wrote:

In D54882#1309583, @ronlieb wrote:

In D54882#1308240, @rampitec wrote:

Essentially this is a limited version of shrinking. So I have several questions:

Why not to run shrink pass before sdwa instead?

I tried adding Shrink pass before PeepholeSDWA and observed 88 lit test failures.
i tried moving Shrink pass before Peephole SDWA and observed 25 lit test failures

Which may be a good thing if these failures are progressions (as I suspect) and not regressions. Are they progressions?
That is the point of other comments too, this patch is limited to handle just two instructions while there is a clear possibility to do it for almost any VOP3.

I would also assume many of these failures are just commute which is attempted by shrink pass. That is normal and would only need to change the tests.

i tried an experiment of simply invoking the Shrink pass a 2nd time.

addPass(createSIShrinkInstructionsPass());
addPass(createSIShrinkInstructionsPass());

which resulted in 74 failures, and they do seem to be commute changes primarily (did not look at them all)
So then, i added a 3rd invocation and zero failures (i'm still laughing at this one).

In D54882#1309639, @ronlieb wrote:
In D54882#1309605, @rampitec wrote:

In D54882#1309588, @rampitec wrote:

In D54882#1309583, @ronlieb wrote:

In D54882#1308240, @rampitec wrote:

Essentially this is a limited version of shrinking. So I have several questions:

Why not to run shrink pass before sdwa instead?

I tried adding Shrink pass before PeepholeSDWA and observed 88 lit test failures.
i tried moving Shrink pass before Peephole SDWA and observed 25 lit test failures

Which may be a good thing if these failures are progressions (as I suspect) and not regressions. Are they progressions?
That is the point of other comments too, this patch is limited to handle just two instructions while there is a clear possibility to do it for almost any VOP3.

I would also assume many of these failures are just commute which is attempted by shrink pass. That is normal and would only need to change the tests.

i tried an experiment of simply invoking the Shrink pass a 2nd time.
addPass(createSIShrinkInstructionsPass());
addPass(createSIShrinkInstructionsPass());
which resulted in 74 failures, and they do seem to be commute changes primarily (did not look at them all)
So then, i added a 3rd invocation and zero failures (i'm still laughing at this one).

So it must be commute. I guess you just need to add a new shrink pass before sdwa. Does it help to deal with these two instructions, e.g. does it help you lit test?

If yes there are two options, either:

Revert the commute in shrink pass if it did not help.
Just update tests.

In D54882#1309642, @rampitec wrote:
In D54882#1309639, @ronlieb wrote:
In D54882#1309605, @rampitec wrote:

In D54882#1309588, @rampitec wrote:

In D54882#1309583, @ronlieb wrote:

In D54882#1308240, @rampitec wrote:

Essentially this is a limited version of shrinking. So I have several questions:

Why not to run shrink pass before sdwa instead?

I tried adding Shrink pass before PeepholeSDWA and observed 88 lit test failures.
i tried moving Shrink pass before Peephole SDWA and observed 25 lit test failures

Which may be a good thing if these failures are progressions (as I suspect) and not regressions. Are they progressions?
That is the point of other comments too, this patch is limited to handle just two instructions while there is a clear possibility to do it for almost any VOP3.

I would also assume many of these failures are just commute which is attempted by shrink pass. That is normal and would only need to change the tests.

i tried an experiment of simply invoking the Shrink pass a 2nd time.
addPass(createSIShrinkInstructionsPass());
addPass(createSIShrinkInstructionsPass());
which resulted in 74 failures, and they do seem to be commute changes primarily (did not look at them all)
So then, i added a 3rd invocation and zero failures (i'm still laughing at this one).
So it must be commute. I guess you just need to add a new shrink pass before sdwa. Does it help to deal with these two instructions, e.g. does it help you lit test?

If yes there are two options, either:

Revert the commute in shrink pass if it did not help.

Just update tests.

Adding the shrink pass just before Peephole SDWA does not help the lit test, it made no difference.
I think at this point i should proceed with adding the MIR test to verify when we cannot fold.

In D54882#1309708, @ronlieb wrote:

Adding the shrink pass just before Peephole SDWA does not help the lit test, it made no difference.
I think at this point i should proceed with adding the MIR test to verify when we cannot fold.

Do you know why was it unable to shrink these instructions?

In D54882#1309711, @rampitec wrote:

In D54882#1309708, @ronlieb wrote:

Adding the shrink pass just before Peephole SDWA does not help the lit test, it made no difference.
I think at this point i should proceed with adding the MIR test to verify when we cannot fold.

Do you know why was it unable to shrink these instructions?

SIInstrInfo::splitScalar64BitAddSub converts the S_ADD_U64_PSEUDO into the two add instructions which use the SReg_64_XEXECRegClass instead of VCC.
Later when SIShrinkInstructions::runOnMachineFunction pass runs, it sees that the Carry regs are not VCC and simply marks them with a hint to later convert to VCC ,
and then continues without doing a transformation.

if (SDst) {
  if (SDst->getReg() != AMDGPU::VCC) {
    if (TargetRegisterInfo::isVirtualRegister(SDst->getReg()))
      MRI.setRegAllocationHint(SDst->getReg(), 0, AMDGPU::VCC);
    continue;
  }

  // All of the instructions with carry outs also have an SGPR input in
  // src2.
  if (Src2 && Src2->getReg() != AMDGPU::VCC) {
    if (TargetRegisterInfo::isVirtualRegister(Src2->getReg()))
      MRI.setRegAllocationHint(Src2->getReg(), 0, AMDGPU::VCC);

    continue;
  }
}

In D54882#1309888, @ronlieb wrote:
In D54882#1309711, @rampitec wrote:

In D54882#1309708, @ronlieb wrote:

Adding the shrink pass just before Peephole SDWA does not help the lit test, it made no difference.
I think at this point i should proceed with adding the MIR test to verify when we cannot fold.

Do you know why was it unable to shrink these instructions?

SIInstrInfo::splitScalar64BitAddSub converts the S_ADD_U64_PSEUDO into the two add instructions which use the SReg_64_XEXECRegClass instead of VCC.
Later when SIShrinkInstructions::runOnMachineFunction pass runs, it sees that the Carry regs are not VCC and simply marks them with a hint to later convert to VCC ,
and then continues without doing a transformation.
if (SDst) {
  if (SDst->getReg() != AMDGPU::VCC) {
    if (TargetRegisterInfo::isVirtualRegister(SDst->getReg()))
      MRI.setRegAllocationHint(SDst->getReg(), 0, AMDGPU::VCC);
    continue;
  }

  // All of the instructions with carry outs also have an SGPR input in
  // src2.
  if (Src2 && Src2->getReg() != AMDGPU::VCC) {
    if (TargetRegisterInfo::isVirtualRegister(Src2->getReg()))
      MRI.setRegAllocationHint(Src2->getReg(), 0, AMDGPU::VCC);

    continue;
  }
}

OK, that makes sense. It just leaves it to post-RA shrink that way. Probably shrink pass could do the same, but it is not desirable as it limits scheduling opportunities.

But I see the problem in your code now: you do not check that vcc is not clobbered or used in between of two instructions.
I also think you need to shrink both instructions, otherwise you have carry-in of addc and carry-out of add in different registers, which just happen to be allocated to the same vcc. Note, that isConvertibleToSDWA() returning true does not guarantee final sdwa conversion, so you can end up with vop3 form for the first instruction anyway.

In D54882#1309928, @rampitec wrote:
In D54882#1309888, @ronlieb wrote:
In D54882#1309711, @rampitec wrote:

In D54882#1309708, @ronlieb wrote:

Adding the shrink pass just before Peephole SDWA does not help the lit test, it made no difference.
I think at this point i should proceed with adding the MIR test to verify when we cannot fold.

Do you know why was it unable to shrink these instructions?

SIInstrInfo::splitScalar64BitAddSub converts the S_ADD_U64_PSEUDO into the two add instructions which use the SReg_64_XEXECRegClass instead of VCC.
Later when SIShrinkInstructions::runOnMachineFunction pass runs, it sees that the Carry regs are not VCC and simply marks them with a hint to later convert to VCC ,
and then continues without doing a transformation.
if (SDst) {
  if (SDst->getReg() != AMDGPU::VCC) {
    if (TargetRegisterInfo::isVirtualRegister(SDst->getReg()))
      MRI.setRegAllocationHint(SDst->getReg(), 0, AMDGPU::VCC);
    continue;
  }

  // All of the instructions with carry outs also have an SGPR input in
  // src2.
  if (Src2 && Src2->getReg() != AMDGPU::VCC) {
    if (TargetRegisterInfo::isVirtualRegister(Src2->getReg()))
      MRI.setRegAllocationHint(Src2->getReg(), 0, AMDGPU::VCC);

    continue;
  }
}
OK, that makes sense. It just leaves it to post-RA shrink that way. Probably shrink pass could do the same, but it is not desirable as it limits scheduling opportunities.

But I see the problem in your code now: you do not check that vcc is not clobbered or used in between of two instructions.
I also think you need to shrink both instructions, otherwise you have carry-in of addc and carry-out of add in different registers, which just happen to be allocated to the same vcc. Note, that isConvertibleToSDWA() returning true does not guarantee final sdwa conversion, so you can end up with vop3 form for the first instruction anyway.

i think the following code does make sure that there are no intervening uses, i can also strengthen it to make sure the defining instruction of the CarryIn is the first ADD instruction.
+ if (!MRI->hasOneUse(CarryIn->getReg()) || !MRI->use_empty(CarryOut->getReg()))
+ return false;

Regarding the need to shrink both 'add' instructions, yes i can see where that is needed.
I added additional code to shrink the 1st add as well, and i had to restructure it a bit to lower the e64's just before doing the SDWA's.

Revised patch in transit.

In D54882#1310543, @ronlieb wrote:
In D54882#1309928, @rampitec wrote:
In D54882#1309888, @ronlieb wrote:
In D54882#1309711, @rampitec wrote:

In D54882#1309708, @ronlieb wrote:

Adding the shrink pass just before Peephole SDWA does not help the lit test, it made no difference.
I think at this point i should proceed with adding the MIR test to verify when we cannot fold.

Do you know why was it unable to shrink these instructions?

SIInstrInfo::splitScalar64BitAddSub converts the S_ADD_U64_PSEUDO into the two add instructions which use the SReg_64_XEXECRegClass instead of VCC.
Later when SIShrinkInstructions::runOnMachineFunction pass runs, it sees that the Carry regs are not VCC and simply marks them with a hint to later convert to VCC ,
and then continues without doing a transformation.
if (SDst) {
  if (SDst->getReg() != AMDGPU::VCC) {
    if (TargetRegisterInfo::isVirtualRegister(SDst->getReg()))
      MRI.setRegAllocationHint(SDst->getReg(), 0, AMDGPU::VCC);
    continue;
  }

  // All of the instructions with carry outs also have an SGPR input in
  // src2.
  if (Src2 && Src2->getReg() != AMDGPU::VCC) {
    if (TargetRegisterInfo::isVirtualRegister(Src2->getReg()))
      MRI.setRegAllocationHint(Src2->getReg(), 0, AMDGPU::VCC);

    continue;
  }
}
OK, that makes sense. It just leaves it to post-RA shrink that way. Probably shrink pass could do the same, but it is not desirable as it limits scheduling opportunities.

But I see the problem in your code now: you do not check that vcc is not clobbered or used in between of two instructions.
I also think you need to shrink both instructions, otherwise you have carry-in of addc and carry-out of add in different registers, which just happen to be allocated to the same vcc. Note, that isConvertibleToSDWA() returning true does not guarantee final sdwa conversion, so you can end up with vop3 form for the first instruction anyway.
i think the following code does make sure that there are no intervening uses, i can also strengthen it to make sure the defining instruction of the CarryIn is the first ADD instruction.
+ if (!MRI->hasOneUse(CarryIn->getReg()) || !MRI->use_empty(CarryOut->getReg()))
+ return false;

It does not. It only checks that original sreg is not used. However you are replacing original carry sreg with vcc by shrinking instruction, and you do not check vcc uses.

latest approach: transform the pair of ADDs/SUBs into e32, and tighten up check on def/use from one ADD to the other.

MIR test still pending.

In D54882#1310546, @rampitec wrote:
In D54882#1310543, @ronlieb wrote:
In D54882#1309928, @rampitec wrote:
In D54882#1309888, @ronlieb wrote:
In D54882#1309711, @rampitec wrote:

In D54882#1309708, @ronlieb wrote:

Adding the shrink pass just before Peephole SDWA does not help the lit test, it made no difference.
I think at this point i should proceed with adding the MIR test to verify when we cannot fold.

Do you know why was it unable to shrink these instructions?

SIInstrInfo::splitScalar64BitAddSub converts the S_ADD_U64_PSEUDO into the two add instructions which use the SReg_64_XEXECRegClass instead of VCC.
Later when SIShrinkInstructions::runOnMachineFunction pass runs, it sees that the Carry regs are not VCC and simply marks them with a hint to later convert to VCC ,
and then continues without doing a transformation.
if (SDst) {
  if (SDst->getReg() != AMDGPU::VCC) {
    if (TargetRegisterInfo::isVirtualRegister(SDst->getReg()))
      MRI.setRegAllocationHint(SDst->getReg(), 0, AMDGPU::VCC);
    continue;
  }

  // All of the instructions with carry outs also have an SGPR input in
  // src2.
  if (Src2 && Src2->getReg() != AMDGPU::VCC) {
    if (TargetRegisterInfo::isVirtualRegister(Src2->getReg()))
      MRI.setRegAllocationHint(Src2->getReg(), 0, AMDGPU::VCC);

    continue;
  }
}
OK, that makes sense. It just leaves it to post-RA shrink that way. Probably shrink pass could do the same, but it is not desirable as it limits scheduling opportunities.

But I see the problem in your code now: you do not check that vcc is not clobbered or used in between of two instructions.
I also think you need to shrink both instructions, otherwise you have carry-in of addc and carry-out of add in different registers, which just happen to be allocated to the same vcc. Note, that isConvertibleToSDWA() returning true does not guarantee final sdwa conversion, so you can end up with vop3 form for the first instruction anyway.
i think the following code does make sure that there are no intervening uses, i can also strengthen it to make sure the defining instruction of the CarryIn is the first ADD instruction.
+ if (!MRI->hasOneUse(CarryIn->getReg()) || !MRI->use_empty(CarryOut->getReg()))
+ return false;
It does not. It only checks that original sreg is not used. However you are replacing original carry sreg with vcc by shrinking instruction, and you do not check vcc uses.

well now that you mention it, yes i see that is the case, i shall look into the VCC, thanks.

rampitec added inline comments.Nov 27 2018, 2:50 PM

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
910	This check is not needed. findSingleRegUse() already did it.

ronlieb marked 3 inline comments as done.Nov 29 2018, 9:38 AM

ronlieb added inline comments.

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
923	at present the patch only accepts the ADD\|SUB variants. I am adding a check for Mods which if present we will reject the instructions, ie not perform the _e64 -> _e32 change. MIR Test added.

Added MIR test, and changes per review comments.

Rebased

rampitec added inline comments.Nov 29 2018, 11:46 AM

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
861	Not necessarily, a vcc_lo or hcc_hi can be used.
867	There is absolutely no guarantee vcc is not live at the beginning of the block. You need to query liveness before MI (MBB::computeRegisterLiveness) and only scan from MI to MISucc.
878	Use does not kill register, it is not a destructive read.
941	You do not need these checks. First the presence of modifiers does not prevent the shrink. When they are non-zero, that prevents it. And this is already tested by canShrink().
test/CodeGen/AMDGPU/sdwa-ops.mir
2	I do not see a test with modifiers.

ronlieb marked 6 inline comments as done.Nov 30 2018, 8:05 AM

ronlieb added inline comments.

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
861	seems like computeRegisterLiveness is a much better approach to determining VCC liveness than the clunky function VCCUsable, so i can simply toss this function out and use the computeRegisterLiveness which also handles the Subregs of VCC
test/CodeGen/AMDGPU/sdwa-ops.mir
2	i am having difficulty trying to construct a V_ADD_I32 or V_ADDC_U32 instruction with an abs or neg modifiers. In particular, the architecture ref gfx9 has comments like the following regarding input modifiers for vop1, vop2, vop3 "In general, negation and absolute value are only supported for floating point input operands (operands with a type of F16, F32, or F64); they are not supported for integer or untyped inputs." Do you know of an MIR example which has modifiers?

rampitec added inline comments.Nov 30 2018, 8:13 AM

test/CodeGen/AMDGPU/sdwa-ops.mir
2	Ah, right. You are only tracking these two int instructions.

ronlieb updated this revision to Diff 176136.Nov 30 2018, 8:37 AM

ronlieb marked an inline comment as done.

rampitec added inline comments.Nov 30 2018, 8:43 AM

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
911	That is not sufficient. You have checked that VCC is dead and the MI, fine. Now you need to check that it is not defined anywhere between MI and MISucc.

rampitec added inline comments.Nov 30 2018, 8:45 AM

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
911	And write tests for it of course: non-dead vcc, vcc def in between of two and vcc_lo (partial) def.

ronlieb updated this revision to Diff 176182.Nov 30 2018, 12:45 PM

ronlieb marked 3 inline comments as done.

rampitec added inline comments.Nov 30 2018, 2:41 PM

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
913	The fact ut dead at the MISucc does not give anything. It can be defined and killed in between of two instructions.

ronlieb updated this revision to Diff 176235.Nov 30 2018, 5:33 PM

ronlieb marked an inline comment as done.

rampitec added inline comments.Nov 30 2018, 5:44 PM

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
914	Almost there. I = std::next(MI) and you do not need E, it is confusing.
test/CodeGen/AMDGPU/sdwa-ops.mir
355	I wander why verifier does not complain on this instruction. Ah, I see. Please add -verify-machineinstrs to run lines. Anyway, you need a test for what you are checking in the code: vcc def in between of two instructions.

ronlieb marked an inline comment as done.Nov 30 2018, 6:01 PM

ronlieb added inline comments.

test/CodeGen/AMDGPU/sdwa-ops.mir
355	good catch: adding -verify-machineinstrs see line 364 , this has a def of $vcc between the ADD and ADDC, is that what you are suggesting.

ronlieb marked an inline comment as done.Nov 30 2018, 6:05 PM

ronlieb added inline comments.

test/CodeGen/AMDGPU/sdwa-ops.mir
355	sorry , i meant line 262

ronlieb updated this revision to Diff 176238.Nov 30 2018, 6:35 PM

ronlieb marked 2 inline comments as done.

rampitec added inline comments.Dec 2 2018, 12:20 AM

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
915	I mean you do not have to check MI itself. ++I was OK: MachineBasicBlock::const_iterator I = std::next(MI);
test/CodeGen/AMDGPU/sdwa-ops.mir
355	I still do not see how is it legal to copy 32 bit register into 64 bit.
355	It does not help. You need a test where vcc is defined and killed in between of MI and MISucc.

ronlieb updated this revision to Diff 176283.Dec 2 2018, 7:24 AM

ronlieb marked 3 inline comments as done.

ronlieb marked 2 inline comments as done.Dec 2 2018, 7:28 AM

ronlieb added inline comments.

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
915	i could not use std::next(MI) in the initializer, it caused bus errors for consective MI,MISucc. This works + // Check if VCC is referenced in range of (MI,MISucc]. + MachineBasicBlock::const_iterator I = MI; + for (++I; I != MISucc; ++I) {
test/CodeGen/AMDGPU/sdwa-ops.mir
355	Fixed it at lines 351,354 , and also above at lines 321,325
355	added subtest test12_add_co_sdwa test for $vcc defined and used between adds, should not generate GFX9-LABEL: name: test12_add_co_sdwa

rampitec added inline comments.Dec 2 2018, 9:34 AM

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
915	Take the iterator.
test/CodeGen/AMDGPU/sdwa-ops.mir
387	killed $vcc

ronlieb updated this revision to Diff 176289.Dec 2 2018, 11:05 AM

ronlieb marked 4 inline comments as done.

rampitec added inline comments.Dec 2 2018, 12:01 PM

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
914	Ugh.. Why do you need an interator from iterator?! This may even fail if next is end(). std::next() already returns you an iterator.

ronlieb marked an inline comment as done.Dec 2 2018, 12:56 PM

ronlieb added inline comments.

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
914	The short answer is i have not found another mechanism to increment I to the next MI, that both compiles and does not abort. In this particular context, MI and MISucc are both references to valid instructions in the basic block, and further that MISucc depends on MI,so its after MI in the instruction iteration order. This loops is not executed unless we have found both MI and MISucc, so that means we will not see MBB.end() as we will give up upon encountering MISucc. Further , in practice, MI and MISucc are very close to each other, if not actually next o each other. All that said, the savings from calling modifiesRegister one extra time is hopefully not too expensive in practice. My preference at this point is to use one of the two following: personally, i think #1 is the simplest. + // Check if VCC is referenced in range of MI and MISucc. + for (MachineBasicBlock::const_iterator I = MI; I != MISucc; + I = std::next(I)) { + if (I->modifiesRegister(AMDGPU::VCC, TRI)) + // Check if VCC is referenced in range of (MI,MISucc]. + MachineBasicBlock::const_iterator I = MI; + for (++I; I != MISucc; ++I) { + if (I->modifiesRegister(AMDGPU::VCC, TRI))

ronlieb marked an inline comment as done.Dec 2 2018, 3:39 PM

ronlieb added inline comments.

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
914	i need to use iterator instead of const_iterator // Check if VCC is referenced in range of (MI,MISucc]. for (MachineBasicBlock::iterator I = std::next(MI.getIterator()); I != MISucc; ++I) {

ronlieb updated this revision to Diff 176305.Dec 2 2018, 3:40 PM

AlexVlx added a subscriber: AlexVlx.Dec 2 2018, 4:19 PM

AlexVlx added inline comments.

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
914	Hello Ron. It is probably more hygienic (and may have spared you the pain here) to use auto in this case, since it's one of the instances where it shines. Your loop would become `for (auto I = std::next(MI.getIterator()); I != MISucc; ++I) {...}`, which also makes it robust against `MI.getIterator()` changing its return type (as unlikely as that'd be).

ronlieb marked an inline comment as done.Dec 2 2018, 4:44 PM

ronlieb added inline comments.

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
914	a good suggestion, thanks. It needs to be this to compile (MISucc needs to be an iterator, and not recompute end condition on each iteration.) for (auto I = std::next(MI.getIterator()), E = MISucc.getIterator(); I != E; ++I) {

ronlieb updated this revision to Diff 176308.Dec 2 2018, 4:48 PM

AlexVlx added inline comments.Dec 2 2018, 5:10 PM

lib/Target/AMDGPU/SIPeepholeSDWA.cpp

914

Right, I ignored that. To be fair, at this point it may be worth considering just doing the following:

const auto It = std::find_if(MI.getIterator(), MISucc.getIterator(), [](const MachineInstr &I) {
    return I.modifiesRegister(AMDGPU::VCC, TRI);
});

if (It != MISucc.getIterator()) return;

// OR

for (auto &&I : iterator_range{MI.getIterator(), MISucc.getIterator()}) {
    if (I.modifiesRegister(AMDGPU::VCC, TRI);
        return;
}

But it's completely your call and mostly cosmetic at this point. Apologies for the side-trip:)

LGTM

This revision is now accepted and ready to land.Dec 2 2018, 6:01 PM

ronlieb marked an inline comment as done.Dec 2 2018, 6:09 PM

ronlieb added inline comments.

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
914	I tried both suggestions, (side note: both need to access std::next) the first approach runs into issues with class pointer access TRI. The second one gave me syntactic fits. Just now ,i see that Stats gave it the highly coveted LGTM. so i am going run a PSDB on this patch, and then merge it i the morning.

rampitec added inline comments.Dec 2 2018, 6:17 PM

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
914	In fact find_if looks cool, but does not simplify source, reduce number of lines or simplifies syntax analysis/optimization of the source. Meanwhile, it is somewhat strange we have to replicate this code across many places in the llvm. After all it is pretty standard we need to check a phys reg can be used in between of two iterators, yet it is only available as a standard utility function during RA as checkInterference. What might be worth is to generalize and factor it into a common utility.

Closed by commit rL348132: [AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos (authored by ronlieb). · Explain WhyDec 3 2018, 5:07 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

lib/

Target/

AMDGPU/

SIInstrInfo.cpp

4 lines

SIPeepholeSDWA.cpp

97 lines

test/

CodeGen/

AMDGPU/

sdwa-op64-test.ll

74 lines

sdwa-ops.mir

390 lines

Diff 176308

lib/Target/AMDGPU/SIInstrInfo.cpp

Show First 20 Lines • Show All 2,620 Lines • ▼ Show 20 Lines	if (Src1 && (!Src1->isReg() \|\| !RI.isVGPR(MRI, Src1->getReg()) \|\|
hasModifiersSet(MI, AMDGPU::OpName::src1_modifiers)))		hasModifiersSet(MI, AMDGPU::OpName::src1_modifiers)))
return false;		return false;

// We don't need to check src0, all input types are legal, so just make sure		// We don't need to check src0, all input types are legal, so just make sure
// src0 isn't using any modifiers.		// src0 isn't using any modifiers.
if (hasModifiersSet(MI, AMDGPU::OpName::src0_modifiers))		if (hasModifiersSet(MI, AMDGPU::OpName::src0_modifiers))
return false;		return false;

		// Can it be shrunk to a valid 32 bit opcode?
		if (!hasVALU32BitEncoding(MI.getOpcode()))
		return false;

// Check output modifiers		// Check output modifiers
return !hasModifiersSet(MI, AMDGPU::OpName::omod) &&		return !hasModifiersSet(MI, AMDGPU::OpName::omod) &&
!hasModifiersSet(MI, AMDGPU::OpName::clamp);		!hasModifiersSet(MI, AMDGPU::OpName::clamp);
}		}

// Set VCC operand with all flags from \p Orig, except for setting it as		// Set VCC operand with all flags from \p Orig, except for setting it as
// implicit.		// implicit.
static void copyFlagsToImplicitVCC(MachineInstr &MI,		static void copyFlagsToImplicitVCC(MachineInstr &MI,
▲ Show 20 Lines • Show All 3,034 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIPeepholeSDWA.cpp

Show First 20 Lines • Show All 84 Lines • ▼ Show 20 Lines	public:

SIPeepholeSDWA() : MachineFunctionPass(ID) {		SIPeepholeSDWA() : MachineFunctionPass(ID) {
initializeSIPeepholeSDWAPass(*PassRegistry::getPassRegistry());		initializeSIPeepholeSDWAPass(*PassRegistry::getPassRegistry());
}		}

bool runOnMachineFunction(MachineFunction &MF) override;		bool runOnMachineFunction(MachineFunction &MF) override;
void matchSDWAOperands(MachineBasicBlock &MBB);		void matchSDWAOperands(MachineBasicBlock &MBB);
std::unique_ptr<SDWAOperand> matchSDWAOperand(MachineInstr &MI);		std::unique_ptr<SDWAOperand> matchSDWAOperand(MachineInstr &MI);
bool isConvertibleToSDWA(const MachineInstr &MI, const GCNSubtarget &ST) const;		bool isConvertibleToSDWA(MachineInstr &MI, const GCNSubtarget &ST) const;
		void pseudoOpConvertToVOP2(MachineInstr &MI,
		const GCNSubtarget &ST) const;
bool convertToSDWA(MachineInstr &MI, const SDWAOperandsVector &SDWAOperands);		bool convertToSDWA(MachineInstr &MI, const SDWAOperandsVector &SDWAOperands);
void legalizeScalarOperands(MachineInstr &MI, const GCNSubtarget &ST) const;		void legalizeScalarOperands(MachineInstr &MI, const GCNSubtarget &ST) const;

StringRef getPassName() const override { return "SI Peephole SDWA"; }		StringRef getPassName() const override { return "SI Peephole SDWA"; }

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.setPreservesCFG();		AU.setPreservesCFG();
MachineFunctionPass::getAnalysisUsage(AU);		MachineFunctionPass::getAnalysisUsage(AU);
▲ Show 20 Lines • Show All 747 Lines • ▼ Show 20 Lines	for (MachineInstr &MI : MBB) {
if (auto Operand = matchSDWAOperand(MI)) {		if (auto Operand = matchSDWAOperand(MI)) {
LLVM_DEBUG(dbgs() << "Match: " << MI << "To: " << *Operand << '\n');		LLVM_DEBUG(dbgs() << "Match: " << MI << "To: " << *Operand << '\n');
SDWAOperands[&MI] = std::move(Operand);		SDWAOperands[&MI] = std::move(Operand);
++NumSDWAPatternsFound;		++NumSDWAPatternsFound;
}		}
}		}
}		}

bool SIPeepholeSDWA::isConvertibleToSDWA(const MachineInstr &MI,		// Convert the V_ADDC_U32_e64 into V_ADDC_U32_e32, and
		// V_ADD_I32_e64 into V_ADD_I32_e32. This allows isConvertibleToSDWA
		// to perform its transformation on V_ADD_I32_e32 into V_ADD_I32_sdwa.
		rampitecUnsubmitted Done Reply Inline Actions Not necessarily, a vcc_lo or hcc_hi can be used. rampitec: Not necessarily, a vcc_lo or hcc_hi can be used.
		ronliebAuthorUnsubmitted Done Reply Inline Actions seems like computeRegisterLiveness is a much better approach to determining VCC liveness than the clunky function VCCUsable, so i can simply toss this function out and use the computeRegisterLiveness which also handles the Subregs of VCC ronlieb: seems like computeRegisterLiveness is a much better approach to determining VCC liveness than…
		//
		// We are transforming from a VOP3 into a VOP2 form of the instruction.
		// %19:vgpr_32 = V_AND_B32_e32 255,
		// killed %16:vgpr_32, implicit $exec
		// %47:vgpr_32, %49:sreg_64_xexec = V_ADD_I32_e64
		// %26.sub0:vreg_64, %19:vgpr_32, implicit $exec
		rampitecUnsubmitted Done Reply Inline Actions There is absolutely no guarantee vcc is not live at the beginning of the block. You need to query liveness before MI (MBB::computeRegisterLiveness) and only scan from MI to MISucc. rampitec: There is absolutely no guarantee vcc is not live at the beginning of the block. You need to…
		// %48:vgpr_32, dead %50:sreg_64_xexec = V_ADDC_U32_e64
		// %26.sub1:vreg_64, %54:vgpr_32, killed %49:sreg_64_xexec, implicit $exec
		//
		// becomes
		// %47:vgpr_32 = V_ADD_I32_sdwa
		// 0, %26.sub0:vreg_64, 0, killed %16:vgpr_32, 0, 6, 0, 6, 0,
		// implicit-def $vcc, implicit $exec
		// %48:vgpr_32 = V_ADDC_U32_e32
		// 0, %26.sub1:vreg_64, implicit-def $vcc, implicit $vcc, implicit $exec
		void SIPeepholeSDWA::pseudoOpConvertToVOP2(MachineInstr &MI,
		const GCNSubtarget &ST) const {
		rampitecUnsubmitted Done Reply Inline Actions Use does not kill register, it is not a destructive read. rampitec: Use does not kill register, it is not a destructive read.
		int Opc = MI.getOpcode();
		assert((Opc == AMDGPU::V_ADD_I32_e64 \|\| Opc == AMDGPU::V_SUB_I32_e64) &&
		"Currently only handles V_ADD_I32_e64 or V_SUB_I32_e64");

		// Can the candidate MI be shrunk?
		rampitecUnsubmitted Done Reply Inline Actions Having a VOP2 pseudo does not necessarily mean there is target VOP2 instruction. I would suggest calling pseudoToMCOpcode() in addition on that VOP2 opcode. rampitec: Having a VOP2 pseudo does not necessarily mean there is target VOP2 instruction. I would…
		if (!TII->canShrink(MI, *MRI))
		return;
		Opc = AMDGPU::getVOPe32(Opc);
		// Find the related ADD instruction.
		const MachineOperand *Sdst = TII->getNamedOperand(MI, AMDGPU::OpName::sdst);
		if (!Sdst)
		return;
		MachineOperand *NextOp = findSingleRegUse(Sdst, MRI);
		if (!NextOp)
		return;
		MachineInstr &MISucc = *NextOp->getParent();
		// Can the successor be shrunk?
		if (!TII->canShrink(MISucc, *MRI))
		return;
		int SuccOpc = AMDGPU::getVOPe32(MISucc.getOpcode());
		rampitecUnsubmitted Done Reply Inline Actions Same here. rampitec: Same here.
		// Make sure the carry in/out are subsequently unused.
		MachineOperand *CarryIn = TII->getNamedOperand(MISucc, AMDGPU::OpName::src2);
		if (!CarryIn)
		rampitecUnsubmitted Done Reply Inline Actions This is not SDWADesc, it is just VOP2Desc. rampitec: This is not SDWADesc, it is just VOP2Desc.
		return;
		MachineOperand *CarryOut = TII->getNamedOperand(MISucc, AMDGPU::OpName::sdst);
		if (!CarryOut)
		return;
		if (!MRI->hasOneUse(CarryIn->getReg()) \|\| !MRI->use_empty(CarryOut->getReg()))
		return;
		// Make sure VCC or its subregs are dead before MI.
		MachineBasicBlock &MBB = *MI.getParent();
		auto Liveness = MBB.computeRegisterLiveness(TRI, AMDGPU::VCC, MI, 25);
		arsenmUnsubmitted Done Reply Inline Actions There's no point in checking these since they must be present arsenm: There's no point in checking these since they must be present
		rampitecUnsubmitted Done Reply Inline Actions This check is not needed. findSingleRegUse() already did it. rampitec: This check is not needed. findSingleRegUse() already did it.
		if (Liveness != MachineBasicBlock::LQR_Dead)
		rampitecUnsubmitted Done Reply Inline Actions That is not sufficient. You have checked that VCC is dead and the MI, fine. Now you need to check that it is not defined anywhere between MI and MISucc. rampitec: That is not sufficient. You have checked that VCC is dead and the MI, fine. Now you need to…
		rampitecUnsubmitted Done Reply Inline Actions And write tests for it of course: non-dead vcc, vcc def in between of two and vcc_lo (partial) def. rampitec: And write tests for it of course: non-dead vcc, vcc def in between of two and vcc_lo (partial)…
		return;
		// Check if VCC is referenced in range of (MI,MISucc].
		rampitecUnsubmitted Done Reply Inline Actions The fact ut dead at the MISucc does not give anything. It can be defined and killed in between of two instructions. rampitec: The fact ut dead at the MISucc does not give anything. It can be defined and killed in between…
		for (auto I = std::next(MI.getIterator()), E = MISucc.getIterator();
		rampitecUnsubmitted Done Reply Inline Actions Almost there. I = std::next(MI) and you do not need E, it is confusing. rampitec: Almost there. I = std::next(MI) and you do not need E, it is confusing.
		rampitecUnsubmitted Not Done Reply Inline Actions Ugh.. Why do you need an interator from iterator?! This may even fail if next is end(). std::next() already returns you an iterator. rampitec: Ugh.. Why do you need an interator from iterator?! This may even fail if next is end(). std…
		ronliebAuthorUnsubmitted Done Reply Inline Actions The short answer is i have not found another mechanism to increment I to the next MI, that both compiles and does not abort. In this particular context, MI and MISucc are both references to valid instructions in the basic block, and further that MISucc depends on MI,so its after MI in the instruction iteration order. This loops is not executed unless we have found both MI and MISucc, so that means we will not see MBB.end() as we will give up upon encountering MISucc. Further , in practice, MI and MISucc are very close to each other, if not actually next o each other. All that said, the savings from calling modifiesRegister one extra time is hopefully not too expensive in practice. My preference at this point is to use one of the two following: personally, i think #1 is the simplest. + // Check if VCC is referenced in range of MI and MISucc. + for (MachineBasicBlock::const_iterator I = MI; I != MISucc; + I = std::next(I)) { + if (I->modifiesRegister(AMDGPU::VCC, TRI)) + // Check if VCC is referenced in range of (MI,MISucc]. + MachineBasicBlock::const_iterator I = MI; + for (++I; I != MISucc; ++I) { + if (I->modifiesRegister(AMDGPU::VCC, TRI)) ronlieb: The short answer is i have not found another mechanism to increment I to the next MI, that both…
		ronliebAuthorUnsubmitted Done Reply Inline Actions i need to use iterator instead of const_iterator // Check if VCC is referenced in range of (MI,MISucc]. for (MachineBasicBlock::iterator I = std::next(MI.getIterator()); I != MISucc; ++I) { ronlieb: i need to use iterator instead of const_iterator // Check if VCC is referenced in range of…
		AlexVlxUnsubmitted Not Done Reply Inline Actions Hello Ron. It is probably more hygienic (and may have spared you the pain here) to use auto in this case, since it's one of the instances where it shines. Your loop would become `for (auto I = std::next(MI.getIterator()); I != MISucc; ++I) {...}`, which also makes it robust against `MI.getIterator()` changing its return type (as unlikely as that'd be). AlexVlx: Hello Ron. It is probably more hygienic (and may have spared you the pain here) to use auto in…
		ronliebAuthorUnsubmitted Done Reply Inline Actions a good suggestion, thanks. It needs to be this to compile (MISucc needs to be an iterator, and not recompute end condition on each iteration.) for (auto I = std::next(MI.getIterator()), E = MISucc.getIterator(); I != E; ++I) { ronlieb: a good suggestion, thanks. It needs to be this to compile (MISucc needs to be an iterator, and…
		AlexVlxUnsubmitted Not Done Reply Inline Actions Right, I ignored that. To be fair, at this point it may be worth considering just doing the following: const auto It = std::find_if(MI.getIterator(), MISucc.getIterator(), [](const MachineInstr &I) { return I.modifiesRegister(AMDGPU::VCC, TRI); }); if (It != MISucc.getIterator()) return; // OR for (auto &&I : iterator_range{MI.getIterator(), MISucc.getIterator()}) { if (I.modifiesRegister(AMDGPU::VCC, TRI); return; } But it's completely your call and mostly cosmetic at this point. Apologies for the side-trip:) AlexVlx: Right, I ignored that. To be fair, at this point it may be worth considering just doing the…
		ronliebAuthorUnsubmitted Done Reply Inline Actions I tried both suggestions, (side note: both need to access std::next) the first approach runs into issues with class pointer access TRI. The second one gave me syntactic fits. Just now ,i see that Stats gave it the highly coveted LGTM. so i am going run a PSDB on this patch, and then merge it i the morning. ronlieb: I tried both suggestions, (side note: both need to access std::next) the first approach runs…
		rampitecUnsubmitted Not Done Reply Inline Actions In fact find_if looks cool, but does not simplify source, reduce number of lines or simplifies syntax analysis/optimization of the source. Meanwhile, it is somewhat strange we have to replicate this code across many places in the llvm. After all it is pretty standard we need to check a phys reg can be used in between of two iterators, yet it is only available as a standard utility function during RA as checkInterference. What might be worth is to generalize and factor it into a common utility. rampitec: In fact find_if looks cool, but does not simplify source, reduce number of lines or simplifies…
		I != E; ++I) {
		rampitecUnsubmitted Done Reply Inline Actions I mean you do not have to check MI itself. ++I was OK: MachineBasicBlock::const_iterator I = std::next(MI); rampitec: I mean you do not have to check MI itself. ++I was OK: ``` MachineBasicBlock::const_iterator…
		ronliebAuthorUnsubmitted Done Reply Inline Actions i could not use std::next(MI) in the initializer, it caused bus errors for consective MI,MISucc. This works + // Check if VCC is referenced in range of (MI,MISucc]. + MachineBasicBlock::const_iterator I = MI; + for (++I; I != MISucc; ++I) { ronlieb: i could not use std::next(MI) in the initializer, it caused bus errors for consective MI…
		rampitecUnsubmitted Done Reply Inline Actions Take the iterator. rampitec: Take the iterator.
		if (I->modifiesRegister(AMDGPU::VCC, TRI))
		return;
		}
		// Make the two new e32 instruction variants.
		// Replace MI with V_{SUB\|ADD}_I32_e32
		auto NewMI = BuildMI(MBB, MI, MI.getDebugLoc(), TII->get(Opc));
		arsenmUnsubmitted Done Reply Inline Actions We don't want to rely on kill flags. You should check the uses of the vreg instead arsenm: We don't want to rely on kill flags. You should check the uses of the vreg instead
		NewMI.add(*TII->getNamedOperand(MI, AMDGPU::OpName::vdst));
		NewMI.add(*TII->getNamedOperand(MI, AMDGPU::OpName::src0));
		rampitecUnsubmitted Done Reply Inline Actions You need to check for modifiers which you are going to drop by this conversion. For instance incoming instruction can be VOP3 OpSel. If it is OpSel you need to check that OpSel modifiers are trivial (e.g. have no effect and equivalent to VOP2). The same about VOP3 modifiers abs and neg. You also need a mir test to check all negative cases when you cannot fold. rampitec: You need to check for modifiers which you are going to drop by this conversion. For instance…
		ronliebAuthorUnsubmitted Done Reply Inline Actions leaving this one open, to work on the MIR test, and think about what modifiers might affect the V_ADD and B_SUB instructions. ronlieb: leaving this one open, to work on the MIR test, and think about what modifiers might affect the…
		ronliebAuthorUnsubmitted Done Reply Inline Actions at present the patch only accepts the ADD\|SUB variants. I am adding a check for Mods which if present we will reject the instructions, ie not perform the _e64 -> _e32 change. MIR Test added. ronlieb: at present the patch only accepts the ADD\|SUB variants. I am adding a check for Mods which if…
		NewMI.add(*TII->getNamedOperand(MI, AMDGPU::OpName::src1));
		MI.eraseFromParent();
		// Replace MISucc with V_{SUBB\|ADDC}_U32_e32
		auto NewInst = BuildMI(MBB, MISucc, MISucc.getDebugLoc(), TII->get(SuccOpc));
		NewInst.add(*TII->getNamedOperand(MISucc, AMDGPU::OpName::vdst));
		NewInst.add(*TII->getNamedOperand(MISucc, AMDGPU::OpName::src0));
		NewInst.add(*TII->getNamedOperand(MISucc, AMDGPU::OpName::src1));
		MISucc.eraseFromParent();
		}

		bool SIPeepholeSDWA::isConvertibleToSDWA(MachineInstr &MI,
const GCNSubtarget &ST) const {		const GCNSubtarget &ST) const {
// Check if this is already an SDWA instruction		// Check if this is already an SDWA instruction
unsigned Opc = MI.getOpcode();		unsigned Opc = MI.getOpcode();
if (TII->isSDWA(Opc))		if (TII->isSDWA(Opc))
return true;		return true;

// Check if this instruction has opcode that supports SDWA		// Check if this instruction has opcode that supports SDWA
		rampitecUnsubmitted Done Reply Inline Actions You do not need these checks. First the presence of modifiers does not prevent the shrink. When they are non-zero, that prevents it. And this is already tested by canShrink(). rampitec: You do not need these checks. First the presence of modifiers does not prevent the shrink. When…
if (AMDGPU::getSDWAOp(Opc) == -1)		if (AMDGPU::getSDWAOp(Opc) == -1)
Opc = AMDGPU::getVOPe32(Opc);		Opc = AMDGPU::getVOPe32(Opc);

if (AMDGPU::getSDWAOp(Opc) == -1)		if (AMDGPU::getSDWAOp(Opc) == -1)
return false;		return false;

if (!ST.hasSDWAOmod() && TII->hasModifiersSet(MI, AMDGPU::OpName::omod))		if (!ST.hasSDWAOmod() && TII->hasModifiersSet(MI, AMDGPU::OpName::omod))
return false;		return false;

if (TII->isVOPC(Opc)) {		if (TII->isVOPC(Opc)) {
if (!ST.hasSDWASdst()) {		if (!ST.hasSDWASdst()) {
const MachineOperand *SDst = TII->getNamedOperand(MI, AMDGPU::OpName::sdst);		const MachineOperand *SDst = TII->getNamedOperand(MI, AMDGPU::OpName::sdst);
if (SDst && SDst->getReg() != AMDGPU::VCC)		if (SDst && SDst->getReg() != AMDGPU::VCC)
return false;		return false;
}		}

if (!ST.hasSDWAOutModsVOPC() &&		if (!ST.hasSDWAOutModsVOPC() &&
(TII->hasModifiersSet(MI, AMDGPU::OpName::clamp) \|\|		(TII->hasModifiersSet(MI, AMDGPU::OpName::clamp) \|\|
TII->hasModifiersSet(MI, AMDGPU::OpName::omod)))		TII->hasModifiersSet(MI, AMDGPU::OpName::omod)))
return false;		return false;

} else if (TII->getNamedOperand(MI, AMDGPU::OpName::sdst) \|\|		} else if (TII->getNamedOperand(MI, AMDGPU::OpName::sdst) \|\|
!TII->getNamedOperand(MI, AMDGPU::OpName::vdst)) {		!TII->getNamedOperand(MI, AMDGPU::OpName::vdst)) {
		arsenmUnsubmitted Done Reply Inline Actions You have code checking for the carry ins, but you don't handle those here arsenm: You have code checking for the carry ins, but you don't handle those here
		rampitecUnsubmitted Done Reply Inline Actions Right. It does not seem to be specific to just these instructions. In general any VOP3 can go through it. rampitec: Right. It does not seem to be specific to just these instructions. In general any VOP3 can go…
		ronliebAuthorUnsubmitted Done Reply Inline Actions i changed the name to pseudoOpConvertedToVop2 to reflect that we look for a possible ADD or SUB that resulted from a previously lowered V_ADD_U64_PSEUDO or V_SUB_U64_PSEUDO. The function pseudoOpConvertedToVOP2 further validates that we have a lowered pseudo and returns true if it was able to perform the conversion. also added comments kind of like the above ... ronlieb: i changed the name to pseudoOpConvertedToVop2 to reflect that we look for a possible ADD or SUB…
		ronliebAuthorUnsubmitted Done Reply Inline Actions The CarryIn/Out related code is located in pseudoOpConvertedToVOP2() ronlieb: The CarryIn/Out related code is located in pseudoOpConvertedToVOP2()
return false;		return false;
}		}

if (!ST.hasSDWAMac() && (Opc == AMDGPU::V_MAC_F16_e32 \|\|		if (!ST.hasSDWAMac() && (Opc == AMDGPU::V_MAC_F16_e32 \|\|
Opc == AMDGPU::V_MAC_F32_e32))		Opc == AMDGPU::V_MAC_F32_e32))
return false;		return false;

// FIXME: has SDWA but require handling of implicit VCC use		// FIXME: has SDWA but require handling of implicit VCC use
▲ Show 20 Lines • Show All 226 Lines • ▼ Show 20 Lines	bool SIPeepholeSDWA::runOnMachineFunction(MachineFunction &MF) {
TRI = ST.getRegisterInfo();		TRI = ST.getRegisterInfo();
TII = ST.getInstrInfo();		TII = ST.getInstrInfo();

// Find all SDWA operands in MF.		// Find all SDWA operands in MF.
bool Ret = false;		bool Ret = false;
for (MachineBasicBlock &MBB : MF) {		for (MachineBasicBlock &MBB : MF) {
bool Changed = false;		bool Changed = false;
do {		do {
		// Preprocess the ADD/SUB pairs so they could be SDWA'ed.
		// Look for a possible ADD or SUB that resulted from a previously lowered
		// V_{ADD\|SUB}_U64_PSEUDO. The function pseudoOpConvertToVOP2
		// lowers the pair of instructions into e32 form.
		matchSDWAOperands(MBB);
		for (const auto &OperandPair : SDWAOperands) {
		const auto &Operand = OperandPair.second;
		MachineInstr *PotentialMI = Operand->potentialToConvert(TII);
		if (PotentialMI &&
		(PotentialMI->getOpcode() == AMDGPU::V_ADD_I32_e64 \|\|
		PotentialMI->getOpcode() == AMDGPU::V_SUB_I32_e64))
		pseudoOpConvertToVOP2(*PotentialMI, ST);
		}
		SDWAOperands.clear();

		// Generate potential match list.
matchSDWAOperands(MBB);		matchSDWAOperands(MBB);

for (const auto &OperandPair : SDWAOperands) {		for (const auto &OperandPair : SDWAOperands) {
const auto &Operand = OperandPair.second;		const auto &Operand = OperandPair.second;
MachineInstr *PotentialMI = Operand->potentialToConvert(TII);		MachineInstr *PotentialMI = Operand->potentialToConvert(TII);
if (PotentialMI && isConvertibleToSDWA(*PotentialMI, ST)) {		if (PotentialMI && isConvertibleToSDWA(*PotentialMI, ST)) {
PotentialMatches[PotentialMI].push_back(Operand.get());		PotentialMatches[PotentialMI].push_back(Operand.get());
}		}
Show All 21 Lines

test/CodeGen/AMDGPU/sdwa-op64-test.ll

This file was added.

				; RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -check-prefixes=GFX9,GCN %s
				; RUN: llc -march=amdgcn -mcpu=fiji -verify-machineinstrs < %s \| FileCheck -check-prefixes=FIJI,GCN %s
				rampitecUnsubmitted Done Reply Inline Actions You need to add fiji run line. rampitec: You need to add fiji run line.

				; GCN-LABEL: {{^}}test_add_co_sdwa:
				; GFX9: v_add_co_u32_sdwa v{{[0-9]+}}, vcc, v{{[0-9]+}}, v{{[0-9]+}} dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
				; GFX9: v_addc_co_u32_e32 v{{[0-9]+}}, vcc, 0, v{{[0-9]+}}, vcc{{$}}
				; FIJI: v_add_u32_sdwa v{{[0-9]+}}, vcc, v{{[0-9]+}}, v{{[0-9]+}} dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
				; FIJI: v_addc_u32_e32 v{{[0-9]+}}, vcc, 0, v{{[0-9]+}}, vcc{{$}}
				define amdgpu_kernel void @test_add_co_sdwa(i64 addrspace(1)* %arg, i32 addrspace(1)* %arg1) #0 {
				bb:
				%tmp = tail call i32 @llvm.amdgcn.workitem.id.x()
				%tmp3 = getelementptr inbounds i32, i32 addrspace(1)* %arg1, i32 %tmp
				%tmp4 = load i32, i32 addrspace(1)* %tmp3, align 4
				%tmp5 = and i32 %tmp4, 255
				%tmp6 = zext i32 %tmp5 to i64
				%tmp7 = getelementptr inbounds i64, i64 addrspace(1)* %arg, i32 %tmp
				%tmp8 = load i64, i64 addrspace(1)* %tmp7, align 8
				%tmp9 = add nsw i64 %tmp8, %tmp6
				store i64 %tmp9, i64 addrspace(1)* %tmp7, align 8
				ret void
				}


				; GCN-LABEL: {{^}}test_sub_co_sdwa:
				; GFX9: v_sub_co_u32_sdwa v{{[0-9]+}}, vcc, v{{[0-9]+}}, v{{[0-9]+}} dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
				; GFX9: v_subbrev_co_u32_e32 v{{[0-9]+}}, vcc, 0, v{{[0-9]+}}, vcc{{$}}
				; FIJI: v_sub_u32_sdwa v{{[0-9]+}}, vcc, v{{[0-9]+}}, v{{[0-9]+}} dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
				; FIJI: v_subbrev_u32_e32 v{{[0-9]+}}, vcc, 0, v{{[0-9]+}}, vcc{{$}}
				define amdgpu_kernel void @test_sub_co_sdwa(i64 addrspace(1)* %arg, i32 addrspace(1)* %arg1) #0 {
				bb:
				%tmp = tail call i32 @llvm.amdgcn.workitem.id.x()
				%tmp3 = getelementptr inbounds i32, i32 addrspace(1)* %arg1, i32 %tmp
				%tmp4 = load i32, i32 addrspace(1)* %tmp3, align 4
				%tmp5 = and i32 %tmp4, 255
				%tmp6 = zext i32 %tmp5 to i64
				%tmp7 = getelementptr inbounds i64, i64 addrspace(1)* %arg, i32 %tmp
				%tmp8 = load i64, i64 addrspace(1)* %tmp7, align 8
				%tmp9 = sub nsw i64 %tmp8, %tmp6
				store i64 %tmp9, i64 addrspace(1)* %tmp7, align 8
				ret void
				}

				; GCN-LABEL: {{^}}test1_add_co_sdwa:
				; GFX9: v_add_co_u32_sdwa v{{[0-9]+}}, vcc, v{{[0-9]+}}, v{{[0-9]+}} dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
				; GFX9: v_addc_co_u32_e32 v{{[0-9]+}}, vcc, 0, v{{[0-9]+}}, vcc{{$}}
				; GFX9: v_add_co_u32_sdwa v{{[0-9]+}}, vcc, v{{[0-9]+}}, v{{[0-9]+}} dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
				; GFX9: v_addc_co_u32_e32 v{{[0-9]+}}, vcc, 0, v{{[0-9]+}}, vcc{{$}}
				; FIJI: v_add_u32_sdwa v{{[0-9]+}}, vcc, v{{[0-9]+}}, v{{[0-9]+}} dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
				; FIJI: v_addc_u32_e32 v{{[0-9]+}}, vcc, 0, v{{[0-9]+}}, vcc{{$}}
				; FIJI: v_add_u32_sdwa v{{[0-9]+}}, vcc, v{{[0-9]+}}, v{{[0-9]+}} dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:BYTE_0
				; FIJI: v_addc_u32_e32 v{{[0-9]+}}, vcc, 0, v{{[0-9]+}}, vcc{{$}}
				define amdgpu_kernel void @test1_add_co_sdwa(i64 addrspace(1)* %arg, i32 addrspace(1)* %arg1, i64 addrspace(1)* %arg2) #0 {
				bb:
				%tmp = tail call i32 @llvm.amdgcn.workitem.id.x()
				%tmp3 = getelementptr inbounds i32, i32 addrspace(1)* %arg1, i32 %tmp
				%tmp4 = load i32, i32 addrspace(1)* %tmp3, align 4
				%tmp5 = and i32 %tmp4, 255
				%tmp6 = zext i32 %tmp5 to i64
				%tmp7 = getelementptr inbounds i64, i64 addrspace(1)* %arg, i32 %tmp
				%tmp8 = load i64, i64 addrspace(1)* %tmp7, align 8
				%tmp9 = add nsw i64 %tmp8, %tmp6
				store i64 %tmp9, i64 addrspace(1)* %tmp7, align 8
				%tmp13 = getelementptr inbounds i32, i32 addrspace(1)* %arg1, i32 %tmp
				%tmp14 = load i32, i32 addrspace(1)* %tmp13, align 4
				%tmp15 = and i32 %tmp14, 255
				%tmp16 = zext i32 %tmp15 to i64
				%tmp17 = getelementptr inbounds i64, i64 addrspace(1)* %arg2, i32 %tmp
				%tmp18 = load i64, i64 addrspace(1)* %tmp17, align 8
				%tmp19 = add nsw i64 %tmp18, %tmp16
				store i64 %tmp19, i64 addrspace(1)* %tmp17, align 8
				ret void
				}

				declare i32 @llvm.amdgcn.workitem.id.x()

test/CodeGen/AMDGPU/sdwa-ops.mir

This file was added.

				# RUN: llc -march=amdgcn -mcpu=gfx900 -verify-machineinstrs -run-pass=si-peephole-sdwa -o - %s \| FileCheck -check-prefix=GFX9 %s
				# RUN: llc -march=amdgcn -mcpu=fiji -verify-machineinstrs -run-pass=si-peephole-sdwa -o - %s \| FileCheck -check-prefix=GFX9 %s
				rampitecUnsubmitted Not Done Reply Inline Actions I do not see a test with modifiers. rampitec: I do not see a test with modifiers.
				ronliebAuthorUnsubmitted Done Reply Inline Actions i am having difficulty trying to construct a V_ADD_I32 or V_ADDC_U32 instruction with an abs or neg modifiers. In particular, the architecture ref gfx9 has comments like the following regarding input modifiers for vop1, vop2, vop3 "In general, negation and absolute value are only supported for floating point input operands (operands with a type of F16, F32, or F64); they are not supported for integer or untyped inputs." Do you know of an MIR example which has modifiers? ronlieb: i am having difficulty trying to construct a V_ADD_I32 or V_ADDC_U32 instruction with an abs or…
				rampitecUnsubmitted Done Reply Inline Actions Ah, right. You are only tracking these two int instructions. rampitec: Ah, right. You are only tracking these two int instructions.

				# test for 3 consecutive _sdwa's
				# GFX9-LABEL: name: test1_add_co_sdwa
				# GFX9: V_ADD_I32_sdwa
				# GFX9-NEXT: V_ADDC_U32_e32
				# GFX9: V_ADD_I32_sdwa
				# GFX9-NEXT: V_ADDC_U32_e32
				# GFX9: V_ADD_I32_sdwa
				# GFX9-NEXT: V_ADDC_U32_e32
				---
				name: test1_add_co_sdwa
				tracksRegLiveness: true
				registers:
				- { id: 0, class: vgpr_32, preferred-register: '' }
				liveins:
				- { reg: '$vgpr0', virtual-reg: '%0' }
				- { reg: '$sgpr0_sgpr1', virtual-reg: '%1' }
				body: \|
				bb.0:
				liveins: $vgpr0, $sgpr0_sgpr1

				%1:sgpr_64 = COPY $sgpr0_sgpr1
				%0:vgpr_32 = COPY $vgpr0
				%22:sreg_32_xm0 = S_MOV_B32 255
				%23:vgpr_32 = V_AND_B32_e32 %22, %0, implicit $exec
				%30:vreg_64 = COPY $sgpr0_sgpr1
				%63:vgpr_32, %65:sreg_64_xexec = V_ADD_I32_e64 %30.sub0, %23, implicit $exec
				%64:vgpr_32, dead %66:sreg_64_xexec = V_ADDC_U32_e64 %30.sub1, %0, killed %65, implicit $exec
				%62:vreg_64 = REG_SEQUENCE %63, %subreg.sub0, %64, %subreg.sub1
				GLOBAL_STORE_DWORDX2_SADDR %30, %62, %1, 0, 0, 0, implicit $exec, implicit $exec :: (store 8)

				%161:vgpr_32 = V_AND_B32_e32 %22, %0, implicit $exec
				%163:vgpr_32, %165:sreg_64_xexec = V_ADD_I32_e64 %30.sub0, %161, implicit $exec
				%164:vgpr_32, dead %166:sreg_64_xexec = V_ADDC_U32_e64 %30.sub1, %0, killed %165, implicit $exec
				%162:vreg_64 = REG_SEQUENCE %163, %subreg.sub0, %164, %subreg.sub1
				GLOBAL_STORE_DWORDX2_SADDR %30, %162, %1, 0, 0, 0, implicit $exec, implicit $exec :: (store 8)

				%171:vgpr_32 = V_AND_B32_e32 %22, %0, implicit $exec
				%173:vgpr_32, %175:sreg_64_xexec = V_ADD_I32_e64 %30.sub0, %171, implicit $exec
				%174:vgpr_32, dead %176:sreg_64_xexec = V_ADDC_U32_e64 %30.sub1, %0, killed %175, implicit $exec
				%172:vreg_64 = REG_SEQUENCE %173, %subreg.sub0, %174, %subreg.sub1
				GLOBAL_STORE_DWORDX2_SADDR %30, %172, %1, 0, 0, 0, implicit $exec, implicit $exec :: (store 8)

				...

				# test for VCC interference on sdwa, should generate 1 xform only
				# GFX9-LABEL: name: test2_add_co_sdwa
				# GFX9: V_ADD_I32_sdwa
				# GFX9: V_ADDC_U32_e32
				# GFX9-NOT: V_ADD_I32_sdwa
				# GFX9-NOT: V_ADDC_U32_e32
				---
				name: test2_add_co_sdwa
				tracksRegLiveness: true
				registers:
				- { id: 0, class: vgpr_32, preferred-register: '' }
				liveins:
				- { reg: '$vgpr0', virtual-reg: '%0' }
				- { reg: '$sgpr0_sgpr1', virtual-reg: '%1' }
				body: \|
				bb.0:
				liveins: $vgpr0, $sgpr0_sgpr1

				%1:sgpr_64 = COPY $sgpr0_sgpr1
				%0:vgpr_32 = COPY $vgpr0
				%22:sreg_32_xm0 = S_MOV_B32 255
				%23:vgpr_32 = V_AND_B32_e32 %22, %0, implicit $exec
				%30:vreg_64 = COPY $sgpr0_sgpr1
				%63:vgpr_32, %65:sreg_64_xexec = V_ADD_I32_e64 %30.sub0, %23, implicit $exec

				%161:vgpr_32 = V_AND_B32_e32 %22, %0, implicit $exec
				%163:vgpr_32, %165:sreg_64_xexec = V_ADD_I32_e64 %30.sub0, %161, implicit $exec
				%164:vgpr_32, dead %166:sreg_64_xexec = V_ADDC_U32_e64 %30.sub1, %0, killed %165, implicit $exec
				%162:vreg_64 = REG_SEQUENCE %163, %subreg.sub0, %164, %subreg.sub1

				%64:vgpr_32, dead %66:sreg_64_xexec = V_ADDC_U32_e64 %30.sub1, %0, killed %65, implicit $exec
				%62:vreg_64 = REG_SEQUENCE %63, %subreg.sub0, %64, %subreg.sub1
				GLOBAL_STORE_DWORDX2_SADDR %30, %62, %1, 0, 0, 0, implicit $exec, implicit $exec :: (store 8)

				%161:vgpr_32 = V_AND_B32_e32 %22, %0, implicit $exec
				%163:vgpr_32, %165:sreg_64_xexec = V_ADD_I32_e64 %30.sub0, %161, implicit $exec
				%164:vgpr_32, dead %166:sreg_64_xexec = V_ADDC_U32_e64 %30.sub1, %0, killed %165, implicit $exec
				%162:vreg_64 = REG_SEQUENCE %163, %subreg.sub0, %164, %subreg.sub1
				GLOBAL_STORE_DWORDX2_SADDR %30, %162, %1, 0, 0, 0, implicit $exec, implicit $exec :: (store 8)

				...

				# test for CarryOut used, should reject
				# GFX9-LABEL: name: test3_add_co_sdwa
				# GFX9: V_ADD_I32_e64
				# GFX9: V_ADDC_U32_e64
				# GFX9-NOT: V_ADD_I32_sdwa
				# GFX9-NOT: V_ADDC_U32_e32
				---
				name: test3_add_co_sdwa
				tracksRegLiveness: true
				registers:
				- { id: 0, class: vgpr_32, preferred-register: '' }
				liveins:
				- { reg: '$vgpr0', virtual-reg: '%0' }
				- { reg: '$sgpr0_sgpr1', virtual-reg: '%1' }
				body: \|
				bb.0:
				liveins: $vgpr0, $sgpr0_sgpr1

				%1:sgpr_64 = COPY $sgpr0_sgpr1
				%0:vgpr_32 = COPY $vgpr0
				%22:sreg_32_xm0 = S_MOV_B32 255
				%23:vgpr_32 = V_AND_B32_e32 %22, %0, implicit $exec
				%30:vreg_64 = COPY $sgpr0_sgpr1
				%63:vgpr_32, %65:sreg_64_xexec = V_ADD_I32_e64 %30.sub0, %23, implicit $exec
				%64:vgpr_32, %66:sreg_64_xexec = V_ADDC_U32_e64 %30.sub1, %0, killed %65, implicit $exec
				%62:vreg_64 = REG_SEQUENCE %63, %subreg.sub0, %66, %subreg.sub1
				GLOBAL_STORE_DWORDX2_SADDR %30, %62, %1, 0, 0, 0, implicit $exec, implicit $exec :: (store 8)

				...

				# test for CarryIn used more than once, should reject
				# GFX9-LABEL: name: test4_add_co_sdwa
				# GFX9: V_ADD_I32_e64
				# GFX9: V_ADDC_U32_e64
				# GFX9-NOT: V_ADD_I32_sdwa
				# GFX9-NOT: V_ADDC_U32_e32
				---
				name: test4_add_co_sdwa
				tracksRegLiveness: true
				registers:
				- { id: 0, class: vgpr_32, preferred-register: '' }
				liveins:
				- { reg: '$vgpr0', virtual-reg: '%0' }
				- { reg: '$sgpr0_sgpr1', virtual-reg: '%1' }
				body: \|
				bb.0:
				liveins: $vgpr0, $sgpr0_sgpr1

				%1:sgpr_64 = COPY $sgpr0_sgpr1
				%0:vgpr_32 = COPY $vgpr0
				%22:sreg_32_xm0 = S_MOV_B32 255
				%23:vgpr_32 = V_AND_B32_e32 %22, %0, implicit $exec
				%30:vreg_64 = COPY $sgpr0_sgpr1
				%63:vgpr_32, %65:sreg_64_xexec = V_ADD_I32_e64 %30.sub0, %23, implicit $exec
				%64:vgpr_32, %66:sreg_64_xexec = V_ADDC_U32_e64 %30.sub1, %0, %65, implicit $exec
				%62:vreg_64 = REG_SEQUENCE %63, %subreg.sub0, %65, %subreg.sub1
				GLOBAL_STORE_DWORDX2_SADDR %30, %62, %1, 0, 0, 0, implicit $exec, implicit $exec :: (store 8)


				...

				# test for simple example, should generate sdwa
				# GFX9-LABEL: name: test5_add_co_sdwa
				# GFX9: V_ADD_I32_sdwa
				# GFX9: V_ADDC_U32_e32
				---
				name: test5_add_co_sdwa
				tracksRegLiveness: true
				registers:
				- { id: 0, class: vgpr_32, preferred-register: '' }
				liveins:
				- { reg: '$vgpr0', virtual-reg: '%0' }
				- { reg: '$sgpr0_sgpr1', virtual-reg: '%1' }
				body: \|
				bb.0:
				liveins: $vgpr0, $sgpr0_sgpr1

				%1:sgpr_64 = COPY $sgpr0_sgpr1
				%0:vgpr_32 = COPY $vgpr0
				%22:sreg_32_xm0 = S_MOV_B32 255
				%23:vgpr_32 = V_AND_B32_e32 %22, %0, implicit $exec
				%30:vreg_64 = COPY $sgpr0_sgpr1
				%63:vgpr_32, %65:sreg_64_xexec = V_ADD_I32_e64 %30.sub0, %23, implicit $exec
				%64:vgpr_32, %66:sreg_64_xexec = V_ADDC_U32_e64 %30.sub1, %0, %65, implicit $exec
				%62:vreg_64 = REG_SEQUENCE %63, %subreg.sub0, %64, %subreg.sub1
				GLOBAL_STORE_DWORDX2_SADDR %30, %62, %1, 0, 0, 0, implicit $exec, implicit $exec :: (store 8)


				...

				# test for V_ADD_I32_e64 only, should reject
				# GFX9-LABEL: name: test6_add_co_sdwa
				# GFX9: V_ADD_I32_e64
				# GFX9-NOT: V_ADD_I32_sdwa
				# GFX9-NOT: V_ADDC_U32_e32
				---
				name: test6_add_co_sdwa
				tracksRegLiveness: true
				registers:
				- { id: 0, class: vgpr_32, preferred-register: '' }
				liveins:
				- { reg: '$vgpr0', virtual-reg: '%0' }
				- { reg: '$sgpr0_sgpr1', virtual-reg: '%1' }
				body: \|
				bb.0:
				liveins: $vgpr0, $sgpr0_sgpr1

				%1:sgpr_64 = COPY $sgpr0_sgpr1
				%0:vgpr_32 = COPY $vgpr0
				%22:sreg_32_xm0 = S_MOV_B32 255
				%23:vgpr_32 = V_AND_B32_e32 %22, %0, implicit $exec
				%30:vreg_64 = COPY $sgpr0_sgpr1
				%63:vgpr_32, %65:sreg_64_xexec = V_ADD_I32_e64 %30.sub0, %23, implicit $exec
				%62:vreg_64 = REG_SEQUENCE %63, %subreg.sub0, %23, %subreg.sub1
				GLOBAL_STORE_DWORDX2_SADDR %30, %62, %1, 0, 0, 0, implicit $exec, implicit $exec :: (store 8)


				...

				# test for V_ADDC_U32_e64 only, should reject
				# GFX9-LABEL: name: test7_add_co_sdwa
				# GFX9: V_ADDC_U32_e64
				# GFX9-NOT: V_ADD_I32_sdwa
				# GFX9-NOT: V_ADDC_U32_e32
				---
				name: test7_add_co_sdwa
				tracksRegLiveness: true
				registers:
				- { id: 0, class: vgpr_32, preferred-register: '' }
				liveins:
				- { reg: '$vgpr0', virtual-reg: '%0' }
				- { reg: '$sgpr0_sgpr1', virtual-reg: '%1' }
				body: \|
				bb.0:
				liveins: $vgpr0, $sgpr0_sgpr1

				%1:sgpr_64 = COPY $sgpr0_sgpr1
				%0:vgpr_32 = COPY $vgpr0
				%22:sreg_32_xm0 = S_MOV_B32 255
				%23:vgpr_32 = V_AND_B32_e32 %22, %0, implicit $exec
				%24:sreg_64_xexec = COPY $sgpr0_sgpr1

				%30:vreg_64 = COPY $sgpr0_sgpr1
				%64:vgpr_32, %66:sreg_64_xexec = V_ADDC_U32_e64 %30.sub1, %0, %24, implicit $exec
				%62:vreg_64 = REG_SEQUENCE %23, %subreg.sub0, %23, %subreg.sub1
				GLOBAL_STORE_DWORDX2_SADDR %30, %62, %1, 0, 0, 0, implicit $exec, implicit $exec :: (store 8)


				...

				# test for $vcc defined between two adds, should not generate
				# GFX9-LABEL: name: test8_add_co_sdwa
				# GFX9-NOT: V_ADD_I32_sdwa
				# GFX9: V_ADDC_U32_e64
				---
				name: test8_add_co_sdwa
				tracksRegLiveness: true
				registers:
				- { id: 0, class: vgpr_32, preferred-register: '' }
				liveins:
				- { reg: '$vgpr0', virtual-reg: '%0' }
				- { reg: '$sgpr0_sgpr1', virtual-reg: '%1' }
				body: \|
				bb.0:
				liveins: $vgpr0, $sgpr0_sgpr1

				%1:sgpr_64 = COPY $sgpr0_sgpr1
				%0:vgpr_32 = COPY $vgpr0
				%22:sreg_32_xm0 = S_MOV_B32 255
				%23:vgpr_32 = V_AND_B32_e32 %22, %0, implicit $exec
				%30:vreg_64 = COPY $sgpr0_sgpr1
				%63:vgpr_32, %65:sreg_64_xexec = V_ADD_I32_e64 %30.sub0, %23, implicit $exec
				$vcc = COPY %30
				%64:vgpr_32, %66:sreg_64_xexec = V_ADDC_U32_e64 %30.sub1, %0, %65, implicit $exec
				%31:vreg_64 = COPY $vcc
				%62:vreg_64 = REG_SEQUENCE %63, %subreg.sub0, %64, %subreg.sub1
				GLOBAL_STORE_DWORDX2_SADDR %31, %62, %1, 0, 0, 0, implicit $exec, implicit $exec :: (store 8)


				...

				# test for non dead $vcc, should not generate
				# GFX9-LABEL: name: test9_add_co_sdwa
				# GFX9-NOT: V_ADD_I32_sdwa
				# GFX9: V_ADDC_U32_e64
				---
				name: test9_add_co_sdwa
				tracksRegLiveness: true
				registers:
				- { id: 0, class: vgpr_32, preferred-register: '' }
				liveins:
				- { reg: '$vgpr0', virtual-reg: '%0' }
				- { reg: '$sgpr0_sgpr1', virtual-reg: '%1' }
				body: \|
				bb.0:
				liveins: $vgpr0, $sgpr0_sgpr1

				%1:sgpr_64 = COPY $sgpr0_sgpr1
				%0:vgpr_32 = COPY $vgpr0
				%22:sreg_32_xm0 = S_MOV_B32 255
				%30:vreg_64 = COPY $sgpr0_sgpr1
				$vcc = COPY %30
				%23:vgpr_32 = V_AND_B32_e32 %22, %0, implicit $exec
				%63:vgpr_32, %65:sreg_64_xexec = V_ADD_I32_e64 %30.sub0, %23, implicit $exec
				%64:vgpr_32, %66:sreg_64_xexec = V_ADDC_U32_e64 %30.sub1, %0, %65, implicit $exec
				%31:vreg_64 = COPY $vcc
				%62:vreg_64 = REG_SEQUENCE %63, %subreg.sub0, %64, %subreg.sub1
				GLOBAL_STORE_DWORDX2_SADDR %31, %62, %1, 0, 0, 0, implicit $exec, implicit $exec :: (store 8)

				...

				# test for def $vcc_lo, should not generate
				# GFX9-LABEL: name: test10_add_co_sdwa
				# GFX9-NOT: V_ADD_I32_sdwa
				# GFX9: V_ADDC_U32_e64
				---
				name: test10_add_co_sdwa
				tracksRegLiveness: true
				registers:
				- { id: 0, class: vgpr_32, preferred-register: '' }
				liveins:
				- { reg: '$vgpr0', virtual-reg: '%0' }
				- { reg: '$sgpr0_sgpr1', virtual-reg: '%1' }
				body: \|
				bb.0:
				liveins: $vgpr0, $sgpr0_sgpr1

				%1:sgpr_64 = COPY $sgpr0_sgpr1
				%0:vgpr_32 = COPY $vgpr0
				%22:sreg_32_xm0 = S_MOV_B32 255
				%30:vreg_64 = COPY $sgpr0_sgpr1
				$vcc_lo = COPY %30.sub0
				%23:vgpr_32 = V_AND_B32_e32 %22, %0, implicit $exec
				%63:vgpr_32, %65:sreg_64_xexec = V_ADD_I32_e64 %30.sub0, %23, implicit $exec
				%31:vgpr_32 = COPY $vcc_lo
				%32:vreg_64 = REG_SEQUENCE %31, %subreg.sub0, %23, %subreg.sub1
				%64:vgpr_32, %66:sreg_64_xexec = V_ADDC_U32_e64 %30.sub1, %0, %65, implicit $exec
				%62:vreg_64 = REG_SEQUENCE %63, %subreg.sub0, %64, %subreg.sub1
				GLOBAL_STORE_DWORDX2_SADDR %32, %62, %1, 0, 0, 0, implicit $exec, implicit $exec :: (store 8)

				...

				# test for read $vcc_hi, should not generate
				# GFX9-LABEL: name: test11_add_co_sdwa
				# GFX9-NOT: V_ADD_I32_sdwa
				# GFX9: V_ADDC_U32_e64
				---
				name: test11_add_co_sdwa
				tracksRegLiveness: true
				registers:
				- { id: 0, class: vgpr_32, preferred-register: '' }
				liveins:
				- { reg: '$vgpr0', virtual-reg: '%0' }
				- { reg: '$sgpr0_sgpr1', virtual-reg: '%1' }
				body: \|
				bb.0:
				liveins: $vgpr0, $sgpr0_sgpr1

				%1:sgpr_64 = COPY $sgpr0_sgpr1
				%0:vgpr_32 = COPY $vgpr0
				%22:sreg_32_xm0 = S_MOV_B32 255
				%30:vreg_64 = COPY $sgpr0_sgpr1
				$vcc_hi = COPY %30.sub0
				%23:vgpr_32 = V_AND_B32_e32 %22, %0, implicit $exec
				%63:vgpr_32, %65:sreg_64_xexec = V_ADD_I32_e64 %30.sub0, %23, implicit $exec
				%31:vgpr_32 = COPY $vcc_hi
				rampitecUnsubmitted Done Reply Inline Actions I wander why verifier does not complain on this instruction. Ah, I see. Please add -verify-machineinstrs to run lines. Anyway, you need a test for what you are checking in the code: vcc def in between of two instructions. rampitec: I wander why verifier does not complain on this instruction. Ah, I see. Please add -verify…
				ronliebAuthorUnsubmitted Done Reply Inline Actions good catch: adding -verify-machineinstrs see line 364 , this has a def of $vcc between the ADD and ADDC, is that what you are suggesting. ronlieb: good catch: adding -verify-machineinstrs see line 364 , this has a def of $vcc between the ADD…
				ronliebAuthorUnsubmitted Done Reply Inline Actions sorry , i meant line 262 ronlieb: sorry , i meant line 262
				rampitecUnsubmitted Done Reply Inline Actions It does not help. You need a test where vcc is defined and killed in between of MI and MISucc. rampitec: It does not help. You need a test where vcc is defined and killed in between of MI and MISucc.
				ronliebAuthorUnsubmitted Done Reply Inline Actions added subtest test12_add_co_sdwa test for $vcc defined and used between adds, should not generate GFX9-LABEL: name: test12_add_co_sdwa ronlieb: added subtest test12_add_co_sdwa # test for $vcc defined and used between adds, should not…
				rampitecUnsubmitted Done Reply Inline Actions I still do not see how is it legal to copy 32 bit register into 64 bit. rampitec: I still do not see how is it legal to copy 32 bit register into 64 bit.
				ronliebAuthorUnsubmitted Done Reply Inline Actions Fixed it at lines 351,354 , and also above at lines 321,325 ronlieb: Fixed it at lines 351,354 , and also above at lines 321,325
				%32:vreg_64 = REG_SEQUENCE %31, %subreg.sub0, %23, %subreg.sub1
				%64:vgpr_32, %66:sreg_64_xexec = V_ADDC_U32_e64 %30.sub1, %0, %65, implicit $exec
				%62:vreg_64 = REG_SEQUENCE %63, %subreg.sub0, %64, %subreg.sub1
				GLOBAL_STORE_DWORDX2_SADDR %32, %62, %1, 0, 0, 0, implicit $exec, implicit $exec :: (store 8)

				...

				# test for $vcc defined and used between adds, should not generate
				# GFX9-LABEL: name: test12_add_co_sdwa
				# GFX9-NOT: V_ADD_I32_sdwa
				# GFX9: V_ADDC_U32_e64
				---
				name: test12_add_co_sdwa
				tracksRegLiveness: true
				registers:
				- { id: 0, class: vgpr_32, preferred-register: '' }
				liveins:
				- { reg: '$vgpr0', virtual-reg: '%0' }
				- { reg: '$sgpr0_sgpr1', virtual-reg: '%1' }
				body: \|
				bb.0:
				liveins: $vgpr0, $sgpr0_sgpr1

				%1:sgpr_64 = COPY $sgpr0_sgpr1
				%0:vgpr_32 = COPY $vgpr0
				%22:sreg_32_xm0 = S_MOV_B32 255
				%30:vreg_64 = COPY $sgpr0_sgpr1
				%23:vgpr_32 = V_AND_B32_e32 %22, %0, implicit $exec
				%63:vgpr_32, %65:sreg_64_xexec = V_ADD_I32_e64 %30.sub0, %23, implicit $exec
				$vcc = COPY %30
				%31:vreg_64 = COPY killed $vcc
				%64:vgpr_32, %66:sreg_64_xexec = V_ADDC_U32_e64 %30.sub1, %0, %65, implicit $exec
				rampitecUnsubmitted Done Reply Inline Actions killed $vcc rampitec: killed $vcc
				%62:vreg_64 = REG_SEQUENCE %63, %subreg.sub0, %64, %subreg.sub1
				GLOBAL_STORE_DWORDX2_SADDR %31, %62, %1, 0, 0, 0, implicit $exec, implicit $exec :: (store 8)

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add sdwa support for ADD|SUB U64 decomposed PseudosClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 176308

lib/Target/AMDGPU/SIInstrInfo.cpp

lib/Target/AMDGPU/SIPeepholeSDWA.cpp

test/CodeGen/AMDGPU/sdwa-op64-test.ll

test/CodeGen/AMDGPU/sdwa-ops.mir

[AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos
ClosedPublic