This is an archive of the discontinued LLVM Phabricator instance.

[MachineSink] Check block prologue interference
ClosedPublic

Authored by critson on Mar 9 2022, 1:33 AM.

Details

Summary

Sinking must check for interference between the block prologue
and the instruction being sunk.
Specifically check for clobbering of uses by the prologue, and
overwrites to prologue defined registers by the sunk instruction.

Diff Detail

Event Timeline

critson created this revision.Mar 9 2022, 1:33 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 9 2022, 1:33 AM
critson requested review of this revision.Mar 9 2022, 1:33 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 9 2022, 1:33 AM

What do you mean by "block prologue"? Can you tell me which of the 2 COPYs in your test is falsely moved with your test and which instruction clobbers the register?

(Intuitively I would expect a live-in value to be available after all PHI instructions so I don't see why there can be a clobber when something is inserted at the beginning of the basic block...)

critson updated this revision to Diff 414262.Mar 9 2022, 6:39 PM
  • Rebase on top of pre-committed test

What do you mean by "block prologue"? Can you tell me which of the 2 COPYs in your test is falsely moved with your test and which instruction clobbers the register?

(Intuitively I would expect a live-in value to be available after all PHI instructions so I don't see why there can be a clobber when something is inserted at the beginning of the basic block...)

I have pre-committed the test, so the offending sink should now be obvious.

This occurs because SkipPHIsAndLabels skips block prologue instructions as determined by the target, something which is not obvious from the function name.
On AMDGPU the S_AND_SAVEEXEC_B64 is considered part of the block prologue because it is setting up the correct execution mask for the instructions in the block; however RA has assigned the register $sgpr4 to the destination of this instruction, effectively clobbering the source of the COPY being sunk.

LGTM.

@rampitec was that also meant to be an accept?

rampitec accepted this revision.Mar 14 2022, 5:28 PM

LGTM.

@rampitec was that also meant to be an accept?

That meant to be just LGTM, but I guess after some time passed accepted.

This revision is now accepted and ready to land.Mar 14 2022, 5:28 PM
ruiling added reviewers: arsenm, nhaehnle.EditedMar 15 2022, 1:58 AM

I agree this issue should be fixed here. Machine sinking should check for register dependency between the sunk instruction and the prologue instruction in the successor block.
But I think there are two kinds of register dependency need to be checked:

a.) the definition of a register which would be used in successor prologue instruction.
b.) the instruction which has a source physical register being overwritten by successor prologue instruction.

I think you are fixing the second kind of dependency currently. Maybe check the first kind of dependency in the same change?
The first kind of def-use dependency should also apply to the pre-RA sinking.

I am not sure whether other uses of SkipPHIsXXX() in the backend would also need such kind of register dependency check, but I think that needs to be investigated separately.

Aside from this, whether we should treat a S_AND_XXX exec as prologue might change in the future. IMO, it is only S_OR_XXX exec like instructions which would activating more threads should really be treated as prologue instruction.

arsenm added inline comments.Mar 15 2022, 2:34 PM
llvm/lib/CodeGen/MachineSink.cpp
1855–1870

Is there not a helper for this somewhere?

I agree this issue should be fixed here. Machine sinking should check for register dependency between the sunk instruction and the prologue instruction in the successor block.
But I think there are two kinds of register dependency need to be checked:

a.) the definition of a register which would be used in successor prologue instruction.
b.) the instruction which has a source physical register being overwritten by successor prologue instruction.

I think you are fixing the second kind of dependency currently. Maybe check the first kind of dependency in the same change?
The first kind of def-use dependency should also apply to the pre-RA sinking.

Presumably (a) only applies to pre-RA sinking? As post-RA everything should be physical registers?
At the moment I think we are getting away without checking because most virtual registers are still unique from SSA form.
However, I agree the risk exists.

I am not sure whether other uses of SkipPHIsXXX() in the backend would also need such kind of register dependency check, but I think that needs to be investigated separately.

Agreed.

Aside from this, whether we should treat a S_AND_XXX exec as prologue might change in the future. IMO, it is only S_OR_XXX exec like instructions which would activating more threads should really be treated as prologue instruction.

Yes, it would be preferable to narrow the definition of prologue instructions in the long run.

Try this. I don't know what would be the most likely pattern in real cases, this is just to show that sinking $sgpr0_sgpr1 definition instruction into successor block would break the register dependency of type (a).

---
name:            _amdgpu_ps_main
alignment:       1
tracksRegLiveness: true
registers:       []
liveins:
  - { reg: '$sgpr4', virtual-reg: '' }
body:             |
  bb.0:
    successors: %bb.1(0x80000000)
    liveins: $sgpr4, $sgpr5

    renamable $sgpr9 = COPY $sgpr4
    renamable $vgpr5 = IMPLICIT_DEF
    renamable $sgpr0_sgpr1 = COPY $sgpr4_sgpr5
    S_BRANCH %bb.1

  bb.1:
    successors: %bb.2(0x40000000), %bb.8(0x40000000)
    liveins: $sgpr6, $sgpr9, $sgpr0_sgpr1

    $sgpr4_sgpr5 = S_AND_SAVEEXEC_B64 $sgpr0_sgpr1, implicit-def $exec, implicit-def $scc, implicit $exec
    renamable $sgpr14_sgpr15 = S_XOR_B64 $exec, killed renamable $sgpr4_sgpr5, implicit-def dead $scc
    S_CBRANCH_EXECZ %bb.8, implicit $exec
    S_BRANCH %bb.2

  bb.2:
    successors: %bb.8(0x40000000)
    liveins: $sgpr6

    $m0 = COPY killed renamable $sgpr6
    S_BRANCH %bb.8

  bb.8:

    S_ENDPGM 0

...
critson updated this revision to Diff 415707.Mar 15 2022, 11:18 PM
  • Generalize to any target defined block prologue instruction interfering with a sink candidate
  • Extend to pre-RA machine sink
critson marked an inline comment as done.Mar 15 2022, 11:22 PM

Try this. I don't know what would be the most likely pattern in real cases, this is just to show that sinking $sgpr0_sgpr1 definition instruction into successor block would break the register dependency of type (a).

snipped

Thanks for this example. I used this as a basis for a test case, but used separate sources for the two COPY instructions.

llvm/lib/CodeGen/MachineSink.cpp
1855–1870

I could not see immediately find one.
Also the newer code is now special cased to only operate on target defined block prologue instructions, which is more efficient.

Thanks for the change, mostly looks good to me. with only a few minor comments.

llvm/lib/CodeGen/MachineSink.cpp
1307

maybe start from BB->getFirstNonPHI()? so you don't need to pass PHI to isBasicBlockPrologue().

1324

Could you also return true if PI->modifiesRegister()? This is mainly for case like the sunk instruction define whole piece of the register, but there is some prologue instruction overwrite the sub-register. With this check, we would be much like what hasRegisterDependency() has done.

1851

May be better to move this log after the interference check?

critson updated this revision to Diff 416114.Mar 17 2022, 2:55 AM
critson marked 2 inline comments as done.
  • Address case of sunk instruction clobbering register partially defined in block prologue
ruiling accepted this revision.Mar 17 2022, 5:22 AM

To the best of my knowledge, this should work. But please wait one or two days in case others have more comment. Please also update commit message before push.

llvm/lib/CodeGen/MachineSink.cpp
1332
auto *DefOp = PI->findRegisterDefOperand(...);
if (DefOp && !DefOp->isDead())
  return true;
critson updated this revision to Diff 416366.Mar 17 2022, 5:57 PM
critson marked an inline comment as done.
  • Incorporate reviewer feedback.

Thanks!
I won't submit this for at least 72 hours.

llvm/lib/CodeGen/MachineSink.cpp
1324

Yes, I guess this edge case exists.
We need to be careful not to consider dead defs, as AMDGPU block prologue instructions often define $scc as a side-effect.

1851

I don't think that makes sense, then the debug output will not indicate what was rejected due to interference. The pre-RA sink outputs this information before performing its checks.

critson retitled this revision from [MachineSink] Check block prologue does not clobber uses to [MachineSink] Check block prologue interference.Mar 21 2022, 7:08 PM
critson edited the summary of this revision. (Show Details)
This revision was landed with ongoing or failed builds.Mar 21 2022, 7:16 PM
This revision was automatically updated to reflect the committed changes.