Page MenuHomePhabricator

[AMDGPU] Add sdwa support for ADD|SUB U64 decomposed Pseudos
ClosedPublic

Authored by ronlieb on Nov 25 2018, 3:50 PM.

Details

Summary

The introduction of S_{ADD|SUB}_U64_PSEUDO instructions which are decomposed
into VOP3 instruction pairs for S_ADD_U64_PSEUDO:

V_ADD_I32_e64
V_ADDC_U32_e64

and for S_SUB_U64_PSEUDO

V_SUB_I32_e64
V_SUBB_U32_e64

preclude the use of SDWA to encode a constant.
SDWA: Sub-Dword addressing is supported on VOP1 and VOP2 instructions,
but not on VOP3 instructions.

We desire to fold the bit-and operand into the instruction encoding
for the V_ADD_I32 instruction. This requires that we transform the
VOP3 into a VOP2 form of the instruction (_e32).

%19:vgpr_32 = V_AND_B32_e32 255,
    killed %16:vgpr_32, implicit $exec
%47:vgpr_32, %49:sreg_64_xexec = V_ADD_I32_e64
    %26.sub0:vreg_64, %19:vgpr_32, implicit $exec
%48:vgpr_32, dead %50:sreg_64_xexec = V_ADDC_U32_e64
    %26.sub1:vreg_64, %54:vgpr_32, killed %49:sreg_64_xexec, implicit $exec

which then allows the SDWA encoding and becomes

%47:vgpr_32 = V_ADD_I32_sdwa
    0, %26.sub0:vreg_64, 0, killed %16:vgpr_32, 0, 6, 0, 6, 0,
    implicit-def $vcc, implicit $exec
%48:vgpr_32 = V_ADDC_U32_e32
    0, %26.sub1:vreg_64, implicit-def $vcc, implicit $vcc, implicit $exec

Diff Detail

Repository
rL LLVM

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

A MIR test would be more useful for checking the carry in cases

rampitec marked an inline comment as done.Nov 26 2018, 8:49 AM
rampitec added inline comments.
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
882 ↗(On Diff #175187)

Having a VOP2 pseudo does not necessarily mean there is target VOP2 instruction. I would suggest calling pseudoToMCOpcode() in addition on that VOP2 opcode.

897 ↗(On Diff #175187)

Same here.

900 ↗(On Diff #175187)

This is not SDWADesc, it is just VOP2Desc.

922 ↗(On Diff #175187)

You need to check for modifiers which you are going to drop by this conversion. For instance incoming instruction can be VOP3 OpSel. If it is OpSel you need to check that OpSel modifiers are trivial (e.g. have no effect and equivalent to VOP2). The same about VOP3 modifiers abs and neg.

You also need a mir test to check all negative cases when you cannot fold.

961–962 ↗(On Diff #175187)

Right. It does not seem to be specific to just these instructions. In general any VOP3 can go through it.

test/CodeGen/AMDGPU/sdwa-op64-test.ll
1 ↗(On Diff #175187)

You need to add fiji run line.

Essentially this is a limited version of shrinking. So I have several questions:

  1. Why not to run shrink pass before sdwa instead?
  2. If we want to shrink individual instructions, should not we just add SIInstrInfo::shrink() interface and call it from SIShrinkInstructions as well?
  3. A the very least checks needs to use SIInstrInfo::canShrink() instead of many and insufficient checks here. For instance this code does not check for valid register classes which SIInstrInfo::canShrink() does. We really need to stop cloning that code.
  4. In general SIInstrInfo::canShrink() should be extended to handle OpSel (via SIInstrInfo::hasModifiers()).
  5. SIInstrInfo::canShrink() shall call SIInstrInfo::hasVALU32BitEncoding() to check for presence of target instruction.
ronlieb marked 10 inline comments as done.Nov 26 2018, 5:31 PM

New patch arriving momentarily ...

lib/Target/AMDGPU/SIPeepholeSDWA.cpp
922 ↗(On Diff #175187)

leaving this one open, to work on the MIR test, and think about what modifiers might affect the V_ADD and B_SUB instructions.

961–962 ↗(On Diff #175187)

The CarryIn/Out related code is located in pseudoOpConvertedToVOP2()

961–962 ↗(On Diff #175187)

i changed the name to pseudoOpConvertedToVop2 to reflect that
we look for a possible ADD or SUB that resulted from a previously lowered
V_ADD_U64_PSEUDO or V_SUB_U64_PSEUDO. The function pseudoOpConvertedToVOP2 further validates that we have a lowered pseudo and returns true if it was able to perform the conversion.

also added comments kind of like the above ...

ronlieb updated this revision to Diff 175369.Nov 26 2018, 5:36 PM
ronlieb marked an inline comment as done.

New patch addressing many (but not all) of the review comments.
Will look into the shrink related comments soon ...

Essentially this is a limited version of shrinking. So I have several questions:

  1. Why not to run shrink pass before sdwa instead?

I tried adding Shrink pass before PeepholeSDWA and observed 88 lit test failures.
i tried moving Shrink pass before Peephole SDWA and observed 25 lit test failures

  1. If we want to shrink individual instructions, should not we just add SIInstrInfo::shrink() interface and call it from SIShrinkInstructions as well?

not sure ...

  1. A the very least checks needs to use SIInstrInfo::canShrink() instead of many and insufficient checks here. For instance this code does not check for valid register classes which SIInstrInfo::canShrink() does. We really need to stop cloning that code.

Great suggestion, i added calls to canShrink to replace the exsisting approach. Thanks for pointing that out.

  1. In general SIInstrInfo::canShrink() should be extended to handle OpSel (via SIInstrInfo::hasModifiers()).

I am not really trying to handle OpSel, and i have modified the current patch to preclude it. We could open another defect to look into this later.

  1. SIInstrInfo::canShrink() shall call SIInstrInfo::hasVALU32BitEncoding() to check for presence of target instruction.

I added this to canShrink and i think it helps clean up the patch, good suggestion.

New patch coming soon...
still need to construct an MIR test for negative testing.

Essentially this is a limited version of shrinking. So I have several questions:

  1. Why not to run shrink pass before sdwa instead?

I tried adding Shrink pass before PeepholeSDWA and observed 88 lit test failures.
i tried moving Shrink pass before Peephole SDWA and observed 25 lit test failures

Which may be a good thing if these failures are progressions (as I suspect) and not regressions. Are they progressions?
That is the point of other comments too, this patch is limited to handle just two instructions while there is a clear possibility to do it for almost any VOP3.

Essentially this is a limited version of shrinking. So I have several questions:

  1. Why not to run shrink pass before sdwa instead?

I tried adding Shrink pass before PeepholeSDWA and observed 88 lit test failures.
i tried moving Shrink pass before Peephole SDWA and observed 25 lit test failures

Which may be a good thing if these failures are progressions (as I suspect) and not regressions. Are they progressions?
That is the point of other comments too, this patch is limited to handle just two instructions while there is a clear possibility to do it for almost any VOP3.

I would also assume many of these failures are just commute which is attempted by shrink pass. That is normal and would only need to change the tests.

ronlieb updated this revision to Diff 175470.Nov 27 2018, 6:28 AM

Incorporated changes for Some of the Shrink suggestions.
Still need to do an MIR test.
Also, investigate if moving/adding Shrink pass results in test progressions or regressions

Essentially this is a limited version of shrinking. So I have several questions:

  1. Why not to run shrink pass before sdwa instead?

I tried adding Shrink pass before PeepholeSDWA and observed 88 lit test failures.
i tried moving Shrink pass before Peephole SDWA and observed 25 lit test failures

Which may be a good thing if these failures are progressions (as I suspect) and not regressions. Are they progressions?
That is the point of other comments too, this patch is limited to handle just two instructions while there is a clear possibility to do it for almost any VOP3.

I would also assume many of these failures are just commute which is attempted by shrink pass. That is normal and would only need to change the tests.

i tried an experiment of simply invoking the Shrink pass a 2nd time.

addPass(createSIShrinkInstructionsPass());
addPass(createSIShrinkInstructionsPass());

which resulted in 74 failures, and they do seem to be commute changes primarily (did not look at them all)
So then, i added a 3rd invocation and zero failures (i'm still laughing at this one).

Essentially this is a limited version of shrinking. So I have several questions:

  1. Why not to run shrink pass before sdwa instead?

I tried adding Shrink pass before PeepholeSDWA and observed 88 lit test failures.
i tried moving Shrink pass before Peephole SDWA and observed 25 lit test failures

Which may be a good thing if these failures are progressions (as I suspect) and not regressions. Are they progressions?
That is the point of other comments too, this patch is limited to handle just two instructions while there is a clear possibility to do it for almost any VOP3.

I would also assume many of these failures are just commute which is attempted by shrink pass. That is normal and would only need to change the tests.

i tried an experiment of simply invoking the Shrink pass a 2nd time.

addPass(createSIShrinkInstructionsPass());
addPass(createSIShrinkInstructionsPass());

which resulted in 74 failures, and they do seem to be commute changes primarily (did not look at them all)
So then, i added a 3rd invocation and zero failures (i'm still laughing at this one).

So it must be commute. I guess you just need to add a new shrink pass before sdwa. Does it help to deal with these two instructions, e.g. does it help you lit test?

If yes there are two options, either:

  1. Revert the commute in shrink pass if it did not help.
  2. Just update tests.

Essentially this is a limited version of shrinking. So I have several questions:

  1. Why not to run shrink pass before sdwa instead?

I tried adding Shrink pass before PeepholeSDWA and observed 88 lit test failures.
i tried moving Shrink pass before Peephole SDWA and observed 25 lit test failures

Which may be a good thing if these failures are progressions (as I suspect) and not regressions. Are they progressions?
That is the point of other comments too, this patch is limited to handle just two instructions while there is a clear possibility to do it for almost any VOP3.

I would also assume many of these failures are just commute which is attempted by shrink pass. That is normal and would only need to change the tests.

i tried an experiment of simply invoking the Shrink pass a 2nd time.

addPass(createSIShrinkInstructionsPass());
addPass(createSIShrinkInstructionsPass());

which resulted in 74 failures, and they do seem to be commute changes primarily (did not look at them all)
So then, i added a 3rd invocation and zero failures (i'm still laughing at this one).

So it must be commute. I guess you just need to add a new shrink pass before sdwa. Does it help to deal with these two instructions, e.g. does it help you lit test?

If yes there are two options, either:

  1. Revert the commute in shrink pass if it did not help.
  2. Just update tests.

Adding the shrink pass just before Peephole SDWA does not help the lit test, it made no difference.
I think at this point i should proceed with adding the MIR test to verify when we cannot fold.

Adding the shrink pass just before Peephole SDWA does not help the lit test, it made no difference.
I think at this point i should proceed with adding the MIR test to verify when we cannot fold.

Do you know why was it unable to shrink these instructions?

Adding the shrink pass just before Peephole SDWA does not help the lit test, it made no difference.
I think at this point i should proceed with adding the MIR test to verify when we cannot fold.

Do you know why was it unable to shrink these instructions?

SIInstrInfo::splitScalar64BitAddSub converts the S_ADD_U64_PSEUDO into the two add instructions which use the SReg_64_XEXECRegClass instead of VCC.
Later when SIShrinkInstructions::runOnMachineFunction pass runs, it sees that the Carry regs are not VCC and simply marks them with a hint to later convert to VCC ,
and then continues without doing a transformation.

if (SDst) {
  if (SDst->getReg() != AMDGPU::VCC) {
    if (TargetRegisterInfo::isVirtualRegister(SDst->getReg()))
      MRI.setRegAllocationHint(SDst->getReg(), 0, AMDGPU::VCC);
    continue;
  }

  // All of the instructions with carry outs also have an SGPR input in
  // src2.
  if (Src2 && Src2->getReg() != AMDGPU::VCC) {
    if (TargetRegisterInfo::isVirtualRegister(Src2->getReg()))
      MRI.setRegAllocationHint(Src2->getReg(), 0, AMDGPU::VCC);

    continue;
  }
}

Adding the shrink pass just before Peephole SDWA does not help the lit test, it made no difference.
I think at this point i should proceed with adding the MIR test to verify when we cannot fold.

Do you know why was it unable to shrink these instructions?

SIInstrInfo::splitScalar64BitAddSub converts the S_ADD_U64_PSEUDO into the two add instructions which use the SReg_64_XEXECRegClass instead of VCC.
Later when SIShrinkInstructions::runOnMachineFunction pass runs, it sees that the Carry regs are not VCC and simply marks them with a hint to later convert to VCC ,
and then continues without doing a transformation.

if (SDst) {
  if (SDst->getReg() != AMDGPU::VCC) {
    if (TargetRegisterInfo::isVirtualRegister(SDst->getReg()))
      MRI.setRegAllocationHint(SDst->getReg(), 0, AMDGPU::VCC);
    continue;
  }
 
  // All of the instructions with carry outs also have an SGPR input in
  // src2.
  if (Src2 && Src2->getReg() != AMDGPU::VCC) {
    if (TargetRegisterInfo::isVirtualRegister(Src2->getReg()))
      MRI.setRegAllocationHint(Src2->getReg(), 0, AMDGPU::VCC);
 
    continue;
  }
}

OK, that makes sense. It just leaves it to post-RA shrink that way. Probably shrink pass could do the same, but it is not desirable as it limits scheduling opportunities.

But I see the problem in your code now: you do not check that vcc is not clobbered or used in between of two instructions.
I also think you need to shrink both instructions, otherwise you have carry-in of addc and carry-out of add in different registers, which just happen to be allocated to the same vcc. Note, that isConvertibleToSDWA() returning true does not guarantee final sdwa conversion, so you can end up with vop3 form for the first instruction anyway.

Adding the shrink pass just before Peephole SDWA does not help the lit test, it made no difference.
I think at this point i should proceed with adding the MIR test to verify when we cannot fold.

Do you know why was it unable to shrink these instructions?

SIInstrInfo::splitScalar64BitAddSub converts the S_ADD_U64_PSEUDO into the two add instructions which use the SReg_64_XEXECRegClass instead of VCC.
Later when SIShrinkInstructions::runOnMachineFunction pass runs, it sees that the Carry regs are not VCC and simply marks them with a hint to later convert to VCC ,
and then continues without doing a transformation.

if (SDst) {
  if (SDst->getReg() != AMDGPU::VCC) {
    if (TargetRegisterInfo::isVirtualRegister(SDst->getReg()))
      MRI.setRegAllocationHint(SDst->getReg(), 0, AMDGPU::VCC);
    continue;
  }
 
  // All of the instructions with carry outs also have an SGPR input in
  // src2.
  if (Src2 && Src2->getReg() != AMDGPU::VCC) {
    if (TargetRegisterInfo::isVirtualRegister(Src2->getReg()))
      MRI.setRegAllocationHint(Src2->getReg(), 0, AMDGPU::VCC);
 
    continue;
  }
}

OK, that makes sense. It just leaves it to post-RA shrink that way. Probably shrink pass could do the same, but it is not desirable as it limits scheduling opportunities.

But I see the problem in your code now: you do not check that vcc is not clobbered or used in between of two instructions.
I also think you need to shrink both instructions, otherwise you have carry-in of addc and carry-out of add in different registers, which just happen to be allocated to the same vcc. Note, that isConvertibleToSDWA() returning true does not guarantee final sdwa conversion, so you can end up with vop3 form for the first instruction anyway.

i think the following code does make sure that there are no intervening uses, i can also strengthen it to make sure the defining instruction of the CarryIn is the first ADD instruction.
+ if (!MRI->hasOneUse(CarryIn->getReg()) || !MRI->use_empty(CarryOut->getReg()))
+ return false;

Regarding the need to shrink both 'add' instructions, yes i can see where that is needed.
I added additional code to shrink the 1st add as well, and i had to restructure it a bit to lower the e64's just before doing the SDWA's.

Revised patch in transit.

Adding the shrink pass just before Peephole SDWA does not help the lit test, it made no difference.
I think at this point i should proceed with adding the MIR test to verify when we cannot fold.

Do you know why was it unable to shrink these instructions?

SIInstrInfo::splitScalar64BitAddSub converts the S_ADD_U64_PSEUDO into the two add instructions which use the SReg_64_XEXECRegClass instead of VCC.
Later when SIShrinkInstructions::runOnMachineFunction pass runs, it sees that the Carry regs are not VCC and simply marks them with a hint to later convert to VCC ,
and then continues without doing a transformation.

if (SDst) {
  if (SDst->getReg() != AMDGPU::VCC) {
    if (TargetRegisterInfo::isVirtualRegister(SDst->getReg()))
      MRI.setRegAllocationHint(SDst->getReg(), 0, AMDGPU::VCC);
    continue;
  }
 
  // All of the instructions with carry outs also have an SGPR input in
  // src2.
  if (Src2 && Src2->getReg() != AMDGPU::VCC) {
    if (TargetRegisterInfo::isVirtualRegister(Src2->getReg()))
      MRI.setRegAllocationHint(Src2->getReg(), 0, AMDGPU::VCC);
 
    continue;
  }
}

OK, that makes sense. It just leaves it to post-RA shrink that way. Probably shrink pass could do the same, but it is not desirable as it limits scheduling opportunities.

But I see the problem in your code now: you do not check that vcc is not clobbered or used in between of two instructions.
I also think you need to shrink both instructions, otherwise you have carry-in of addc and carry-out of add in different registers, which just happen to be allocated to the same vcc. Note, that isConvertibleToSDWA() returning true does not guarantee final sdwa conversion, so you can end up with vop3 form for the first instruction anyway.

i think the following code does make sure that there are no intervening uses, i can also strengthen it to make sure the defining instruction of the CarryIn is the first ADD instruction.
+ if (!MRI->hasOneUse(CarryIn->getReg()) || !MRI->use_empty(CarryOut->getReg()))
+ return false;

It does not. It only checks that original sreg is not used. However you are replacing original carry sreg with vcc by shrinking instruction, and you do not check vcc uses.

ronlieb updated this revision to Diff 175587.Nov 27 2018, 2:42 PM

latest approach: transform the pair of ADDs/SUBs into e32, and tighten up check on def/use from one ADD to the other.

MIR test still pending.

Adding the shrink pass just before Peephole SDWA does not help the lit test, it made no difference.
I think at this point i should proceed with adding the MIR test to verify when we cannot fold.

Do you know why was it unable to shrink these instructions?

SIInstrInfo::splitScalar64BitAddSub converts the S_ADD_U64_PSEUDO into the two add instructions which use the SReg_64_XEXECRegClass instead of VCC.
Later when SIShrinkInstructions::runOnMachineFunction pass runs, it sees that the Carry regs are not VCC and simply marks them with a hint to later convert to VCC ,
and then continues without doing a transformation.

if (SDst) {
  if (SDst->getReg() != AMDGPU::VCC) {
    if (TargetRegisterInfo::isVirtualRegister(SDst->getReg()))
      MRI.setRegAllocationHint(SDst->getReg(), 0, AMDGPU::VCC);
    continue;
  }
 
  // All of the instructions with carry outs also have an SGPR input in
  // src2.
  if (Src2 && Src2->getReg() != AMDGPU::VCC) {
    if (TargetRegisterInfo::isVirtualRegister(Src2->getReg()))
      MRI.setRegAllocationHint(Src2->getReg(), 0, AMDGPU::VCC);
 
    continue;
  }
}

OK, that makes sense. It just leaves it to post-RA shrink that way. Probably shrink pass could do the same, but it is not desirable as it limits scheduling opportunities.

But I see the problem in your code now: you do not check that vcc is not clobbered or used in between of two instructions.
I also think you need to shrink both instructions, otherwise you have carry-in of addc and carry-out of add in different registers, which just happen to be allocated to the same vcc. Note, that isConvertibleToSDWA() returning true does not guarantee final sdwa conversion, so you can end up with vop3 form for the first instruction anyway.

i think the following code does make sure that there are no intervening uses, i can also strengthen it to make sure the defining instruction of the CarryIn is the first ADD instruction.
+ if (!MRI->hasOneUse(CarryIn->getReg()) || !MRI->use_empty(CarryOut->getReg()))
+ return false;

It does not. It only checks that original sreg is not used. However you are replacing original carry sreg with vcc by shrinking instruction, and you do not check vcc uses.

well now that you mention it, yes i see that is the case, i shall look into the VCC, thanks.

rampitec added inline comments.Nov 27 2018, 2:50 PM
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
910 ↗(On Diff #175587)

This check is not needed. findSingleRegUse() already did it.

ronlieb marked 3 inline comments as done.Nov 29 2018, 9:38 AM
ronlieb added inline comments.
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
922 ↗(On Diff #175187)

at present the patch only accepts the ADD|SUB variants. I am adding a check for Mods which if present we will reject the instructions, ie not perform the _e64 -> _e32 change.

MIR Test added.

ronlieb updated this revision to Diff 175890.Nov 29 2018, 9:39 AM

Added MIR test, and changes per review comments.

rampitec added inline comments.Nov 29 2018, 11:46 AM
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
862 ↗(On Diff #175903)

Not necessarily, a vcc_lo or hcc_hi can be used.

868 ↗(On Diff #175903)

There is absolutely no guarantee vcc is not live at the beginning of the block.
You need to query liveness before MI (MBB::computeRegisterLiveness) and only scan from MI to MISucc.

879 ↗(On Diff #175903)

Use does not kill register, it is not a destructive read.

942 ↗(On Diff #175903)

You do not need these checks. First the presence of modifiers does not prevent the shrink. When they are non-zero, that prevents it. And this is already tested by canShrink().

test/CodeGen/AMDGPU/sdwa-ops.mir
1 ↗(On Diff #175903)

I do not see a test with modifiers.

ronlieb marked 6 inline comments as done.Nov 30 2018, 8:05 AM
ronlieb added inline comments.
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
862 ↗(On Diff #175903)

seems like computeRegisterLiveness is a much better approach to determining VCC liveness than the clunky function VCCUsable, so i can simply toss this function out and use the computeRegisterLiveness which also handles the Subregs of VCC

test/CodeGen/AMDGPU/sdwa-ops.mir
1 ↗(On Diff #175903)

i am having difficulty trying to construct a V_ADD_I32 or V_ADDC_U32 instruction with an abs or neg modifiers. In particular, the architecture ref gfx9 has comments like the following regarding input modifiers for vop1, vop2, vop3

"In general, negation and absolute value are only supported for floating point input operands (operands with a type of F16, F32, or F64); they are not supported for integer or untyped inputs."

Do you know of an MIR example which has modifiers?

rampitec added inline comments.Nov 30 2018, 8:13 AM
test/CodeGen/AMDGPU/sdwa-ops.mir
1 ↗(On Diff #175903)

Ah, right. You are only tracking these two int instructions.

ronlieb updated this revision to Diff 176136.Nov 30 2018, 8:37 AM
ronlieb marked an inline comment as done.
rampitec added inline comments.Nov 30 2018, 8:43 AM
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
911 ↗(On Diff #176136)

That is not sufficient. You have checked that VCC is dead and the MI, fine. Now you need to check that it is not defined anywhere between MI and MISucc.

rampitec added inline comments.Nov 30 2018, 8:45 AM
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
911 ↗(On Diff #176136)

And write tests for it of course: non-dead vcc, vcc def in between of two and vcc_lo (partial) def.

ronlieb updated this revision to Diff 176182.Nov 30 2018, 12:45 PM
ronlieb marked 3 inline comments as done.
rampitec added inline comments.Nov 30 2018, 2:41 PM
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
913 ↗(On Diff #176182)

The fact ut dead at the MISucc does not give anything. It can be defined and killed in between of two instructions.

ronlieb updated this revision to Diff 176235.Nov 30 2018, 5:33 PM
ronlieb marked an inline comment as done.
rampitec added inline comments.Nov 30 2018, 5:44 PM
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
914 ↗(On Diff #176235)

Almost there. I = std::next(MI) and you do not need E, it is confusing.

test/CodeGen/AMDGPU/sdwa-ops.mir
354 ↗(On Diff #176235)

I wander why verifier does not complain on this instruction. Ah, I see. Please add -verify-machineinstrs to run lines.

Anyway, you need a test for what you are checking in the code: vcc def in between of two instructions.

ronlieb marked an inline comment as done.Nov 30 2018, 6:01 PM
ronlieb added inline comments.
test/CodeGen/AMDGPU/sdwa-ops.mir
354 ↗(On Diff #176235)

good catch: adding -verify-machineinstrs

see line 364 , this has a def of $vcc between the ADD and ADDC, is that what you are suggesting.

ronlieb marked an inline comment as done.Nov 30 2018, 6:05 PM
ronlieb added inline comments.
test/CodeGen/AMDGPU/sdwa-ops.mir
354 ↗(On Diff #176235)

sorry , i meant line 262

ronlieb updated this revision to Diff 176238.Nov 30 2018, 6:35 PM
ronlieb marked 2 inline comments as done.
rampitec added inline comments.Dec 2 2018, 12:20 AM
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
915 ↗(On Diff #176238)

I mean you do not have to check MI itself. ++I was OK:

MachineBasicBlock::const_iterator I = std::next(MI);
test/CodeGen/AMDGPU/sdwa-ops.mir
354 ↗(On Diff #176235)

It does not help. You need a test where vcc is defined and killed in between of MI and MISucc.

354 ↗(On Diff #176238)

I still do not see how is it legal to copy 32 bit register into 64 bit.

ronlieb updated this revision to Diff 176283.Dec 2 2018, 7:24 AM
ronlieb marked 3 inline comments as done.
ronlieb marked 2 inline comments as done.Dec 2 2018, 7:28 AM
ronlieb added inline comments.
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
915 ↗(On Diff #176238)

i could not use std::next(MI) in the initializer, it caused bus errors for consective MI,MISucc.

This works
+ // Check if VCC is referenced in range of (MI,MISucc].
+ MachineBasicBlock::const_iterator I = MI;
+ for (++I; I != MISucc; ++I) {

test/CodeGen/AMDGPU/sdwa-ops.mir
354 ↗(On Diff #176235)

added subtest test12_add_co_sdwa

  1. test for $vcc defined and used between adds, should not generate
  2. GFX9-LABEL: name: test12_add_co_sdwa
354 ↗(On Diff #176238)

Fixed it at lines 351,354 , and also above at lines 321,325

rampitec added inline comments.Dec 2 2018, 9:34 AM
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
915 ↗(On Diff #176238)

Take the iterator.

test/CodeGen/AMDGPU/sdwa-ops.mir
386 ↗(On Diff #176283)

killed $vcc

ronlieb updated this revision to Diff 176289.Dec 2 2018, 11:05 AM
ronlieb marked 4 inline comments as done.
rampitec added inline comments.Dec 2 2018, 12:01 PM
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
914 ↗(On Diff #176289)

Ugh.. Why do you need an interator from iterator?! This may even fail if next is end(). std::next() already returns you an iterator.

ronlieb marked an inline comment as done.Dec 2 2018, 12:56 PM
ronlieb added inline comments.
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
914 ↗(On Diff #176289)

The short answer is i have not found another mechanism to increment I to the next MI, that both compiles and does not abort.

In this particular context, MI and MISucc are both references to valid instructions in the basic block, and further that MISucc depends on MI,so its after MI in the instruction iteration order.

This loops is not executed unless we have found both MI and MISucc, so that means we will not see MBB.end() as we will give up upon encountering MISucc.

Further , in practice, MI and MISucc are very close to each other, if not actually next o each other.

All that said, the savings from calling modifiesRegister one extra time is hopefully not too expensive in practice.

My preference at this point is to use one of the two following:
personally, i think #1 is the simplest.

+ // Check if VCC is referenced in range of MI and MISucc.
+ for (MachineBasicBlock::const_iterator I = MI; I != MISucc;
+ I = std::next(I)) {
+ if (I->modifiesRegister(AMDGPU::VCC, TRI))

+ // Check if VCC is referenced in range of (MI,MISucc].
+ MachineBasicBlock::const_iterator I = MI;
+ for (++I; I != MISucc; ++I) {
+ if (I->modifiesRegister(AMDGPU::VCC, TRI))

ronlieb marked an inline comment as done.Dec 2 2018, 3:39 PM
ronlieb added inline comments.
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
914 ↗(On Diff #176289)

i need to use iterator instead of const_iterator

// Check if VCC is referenced in range of (MI,MISucc].
for (MachineBasicBlock::iterator I = std::next(MI.getIterator());
     I != MISucc; ++I) {
ronlieb updated this revision to Diff 176305.Dec 2 2018, 3:40 PM
AlexVlx added a subscriber: AlexVlx.Dec 2 2018, 4:19 PM
AlexVlx added inline comments.
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
914 ↗(On Diff #176289)

Hello Ron. It is probably more hygienic (and may have spared you the pain here) to use auto in this case, since it's one of the instances where it shines. Your loop would become for (auto I = std::next(MI.getIterator()); I != MISucc; ++I) {...}, which also makes it robust against MI.getIterator() changing its return type (as unlikely as that'd be).

ronlieb marked an inline comment as done.Dec 2 2018, 4:44 PM
ronlieb added inline comments.
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
914 ↗(On Diff #176289)

a good suggestion, thanks. It needs to be this to compile
(MISucc needs to be an iterator, and not recompute end condition on each iteration.)

for (auto I = std::next(MI.getIterator()), E = MISucc.getIterator(); 
     I != E; ++I) {
ronlieb updated this revision to Diff 176308.Dec 2 2018, 4:48 PM
AlexVlx added inline comments.Dec 2 2018, 5:10 PM
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
914 ↗(On Diff #176289)

Right, I ignored that. To be fair, at this point it may be worth considering just doing the following:

const auto It = std::find_if(MI.getIterator(), MISucc.getIterator(), [](const MachineInstr &I) {
    return I.modifiesRegister(AMDGPU::VCC, TRI);
});

if (It != MISucc.getIterator()) return;

// OR

for (auto &&I : iterator_range{MI.getIterator(), MISucc.getIterator()}) {
    if (I.modifiesRegister(AMDGPU::VCC, TRI);
        return;
}

But it's completely your call and mostly cosmetic at this point. Apologies for the side-trip:)

rampitec accepted this revision.Dec 2 2018, 6:01 PM

LGTM

This revision is now accepted and ready to land.Dec 2 2018, 6:01 PM
ronlieb marked an inline comment as done.Dec 2 2018, 6:09 PM
ronlieb added inline comments.
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
914 ↗(On Diff #176289)

I tried both suggestions, (side note: both need to access std::next)
the first approach runs into issues with class pointer access TRI.
The second one gave me syntactic fits.

Just now ,i see that Stats gave it the highly coveted LGTM.
so i am going run a PSDB on this patch, and then merge it i the morning.

rampitec added inline comments.Dec 2 2018, 6:17 PM
lib/Target/AMDGPU/SIPeepholeSDWA.cpp
914 ↗(On Diff #176289)

In fact find_if looks cool, but does not simplify source, reduce number of lines or simplifies syntax analysis/optimization of the source.

Meanwhile, it is somewhat strange we have to replicate this code across many places in the llvm. After all it is pretty standard we need to check a phys reg can be used in between of two iterators, yet it is only available as a standard utility function during RA as checkInterference. What might be worth is to generalize and factor it into a common utility.

This revision was automatically updated to reflect the committed changes.