This is an archive of the discontinued LLVM Phabricator instance.

[MachineSink] Allow sinking of constant or ignorable physreg uses
ClosedPublic

Authored by vangthao on Dec 20 2021, 12:33 PM.

Details

Summary

For AMDGPU, any use of the physical register EXEC prevents sinking even if it is not a real physical register read. Add check to see if a physical
register use can be ignored for sinking.

Also perform same constant and ignorable physical register check when considering sinking in loops.

Diff Detail

Unit TestsFailed

Event Timeline

vangthao created this revision.Dec 20 2021, 12:33 PM
vangthao requested review of this revision.Dec 20 2021, 12:33 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 20 2021, 12:33 PM

I am not sure I can prove to myself this is legal. For example you are sinking a def into a loop with divergent condition and this def is used after the loop. Can this happen? If so a def might be done with an exec smaller than a use which creates an undef. Hoisting was OK because def was moved into a direction where exec is strictly not less than before. Did you run PSDB on it?

I am not sure I can prove to myself this is legal. For example you are sinking a def into a loop with divergent condition and this def is used after the loop. Can this happen?

If SuccToSinkTo exists, all uses of the def must be dominated by block SuccToSinkTo. So I think it is ok to do the sinking.

isConstantPhysReg() seems ok as any def to such register has no meaning.

Better let an AMDGPU expert have a look especially at the code gen changes.

I am not sure I can prove to myself this is legal. For example you are sinking a def into a loop with divergent condition and this def is used after the loop. Can this happen? If so a def might be done with an exec smaller than a use which creates an undef. Hoisting was OK because def was moved into a direction where exec is strictly not less than before. Did you run PSDB on it?

It's ok to sink a def with a smaller exec set as long as the use set is still a subset of the new def point exec

I am not sure I can prove to myself this is legal. For example you are sinking a def into a loop with divergent condition and this def is used after the loop. Can this happen? If so a def might be done with an exec smaller than a use which creates an undef. Hoisting was OK because def was moved into a direction where exec is strictly not less than before. Did you run PSDB on it?

I believe the requirement for defs to dominate all uses prevents this from happening but I can add more test cases to check for this. This passed PSDB.

I am not sure I can prove to myself this is legal. For example you are sinking a def into a loop with divergent condition and this def is used after the loop. Can this happen? If so a def might be done with an exec smaller than a use which creates an undef. Hoisting was OK because def was moved into a direction where exec is strictly not less than before. Did you run PSDB on it?

I believe the requirement for defs to dominate all uses prevents this from happening but I can add more test cases to check for this. This passed PSDB.

IR is essentially a single thread representation. The implicit exec use is our way to model mutithreaded divergence. Consider this transformation which shall now become legal:

int lid = get_local_id(0);      int lid = get_local_id(0);
int i = 0;                      int i = 0;
x = def();                      do {
do {                        =>    x = def();
  use1(x);                        use1(x);
} while(i++ < lid);             } while(i++ < lid);
use2(x);                        use2(x);

def dominates use2 in both cases, but in the second case not with every lane. All lanes except first will use an undef.

IR is essentially a single thread representation. The implicit exec use is our way to model mutithreaded divergence. Consider this transformation which shall now become legal:

int lid = get_local_id(0);      int lid = get_local_id(0);
int i = 0;                      int i = 0;
x = def();                      do {
do {                        =>    x = def();
  use1(x);                        use1(x);
} while(i++ < lid);             } while(i++ < lid);
use2(x);                        use2(x);

def dominates use2 in both cases, but in the second case not with every lane. All lanes except first will use an undef.

We will not sink into a loop if the def is outside of the loop. In the test case loop_sink_fmac, the def was already in a loop and was why it was able to be sinked.

IR is essentially a single thread representation. The implicit exec use is our way to model mutithreaded divergence. Consider this transformation which shall now become legal:

int lid = get_local_id(0);      int lid = get_local_id(0);
int i = 0;                      int i = 0;
x = def();                      do {
do {                        =>    x = def();
  use1(x);                        use1(x);
} while(i++ < lid);             } while(i++ < lid);
use2(x);                        use2(x);

def dominates use2 in both cases, but in the second case not with every lane. All lanes except first will use an undef.

We will not sink into a loop if the def is outside of the loop. In the test case loop_sink_fmac, the def was already in a loop and was why it was able to be sinked.

If you are sure then it is OK. Note my example was with a loop which is guaranteed to execute at least one iteration. @arsenm ?

vangthao updated this revision to Diff 395768.Dec 21 2021, 4:11 PM

Changed name of some tests for better clarity. Added more negative test cases.

lkail added a subscriber: lkail.Dec 21 2021, 7:36 PM
rampitec accepted this revision.Dec 22 2021, 11:28 AM

LGTM given the new tests.

This revision is now accepted and ready to land.Dec 22 2021, 11:28 AM

This change seems to sink v_cmp instructions, which creates different results if the exec mask changed and that makes several Vulkan tests fail.
I put a reproducer here: https://gist.github.com/Flakebi/fd1d91a806b60ec330e9f61e19fe62ac
Compile with llc -mtriple=amdgcn--amdpal -mcpu=gfx1010 -verify-machineinstrs -start-before=machine-sink -stop-after=machine-sink PipelineVsFs_0xDD57C231E25DA514.mir -o PipelineVsFs_0xDD57C231E25DA514-after.mir
and the %104:sreg_64 = V_CMP_NE_U32_e64 %89, %101, implicit $exec instruction will be sunk from bb.5 into bb.6. For reference, the pipeline is from the dEQP-VK.subgroups.arithmetic.framebuffer.subgroupexclusiveadd_float_vertex CTS test.

This change seems to sink v_cmp instructions, which creates different results if the exec mask changed and that makes several Vulkan tests fail.
I put a reproducer here: https://gist.github.com/Flakebi/fd1d91a806b60ec330e9f61e19fe62ac
Compile with llc -mtriple=amdgcn--amdpal -mcpu=gfx1010 -verify-machineinstrs -start-before=machine-sink -stop-after=machine-sink PipelineVsFs_0xDD57C231E25DA514.mir -o PipelineVsFs_0xDD57C231E25DA514-after.mir
and the %104:sreg_64 = V_CMP_NE_U32_e64 %89, %101, implicit $exec instruction will be sunk from bb.5 into bb.6. For reference, the pipeline is from the dEQP-VK.subgroups.arithmetic.framebuffer.subgroupexclusiveadd_float_vertex CTS test.

Does this help?

diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 88996f455227..0678ceeeea21 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -130,10 +130,29 @@ bool SIInstrInfo::isReallyTriviallyReMaterializable(const MachineInstr &MI,
   return false;
 }

+static bool readsExecAsData(const MachineInstr &MI) {
+  if (MI.isCompare())
+    return true;
+
+  switch (MI.getOpcode()) {
+  default:
+    break;
+  case AMDGPU::V_READFIRSTLANE_B32:
+  case AMDGPU::V_CNDMASK_B64_PSEUDO:
+  case AMDGPU::V_CNDMASK_B32_dpp:
+  case AMDGPU::V_CNDMASK_B32_e32:
+  case AMDGPU::V_CNDMASK_B32_e64:
+  case AMDGPU::V_CNDMASK_B32_sdwa:
+    return true;
+  }
+
+  return false;
+}
+
 bool SIInstrInfo::isIgnorableUse(const MachineOperand &MO) const {
   // Any implicit use of exec by VALU is not a real register read.
   return MO.getReg() == AMDGPU::EXEC && MO.isImplicit() &&
-         isVALU(*MO.getParent());
+         isVALU(*MO.getParent()) && !readsExecAsData(*MO.getParent());
 }

 bool SIInstrInfo::areLoadsFromSameBasePtr(SDNode *Load0, SDNode *Load1,

Another, shorter, approach may be:

diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index 88996f455227..f85a71941e66 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -132,8 +132,10 @@ bool SIInstrInfo::isReallyTriviallyReMaterializable(const MachineInstr &MI,

 bool SIInstrInfo::isIgnorableUse(const MachineOperand &MO) const {
   // Any implicit use of exec by VALU is not a real register read.
+  // isRematerializable check excludes instructions reading EXEC as data,
+  // such as compares, v_cndmask_b32, and readfirstlane.
   return MO.getReg() == AMDGPU::EXEC && MO.isImplicit() &&
-         isVALU(*MO.getParent());
+         isVALU(*MO.getParent()) && MO.getParent()->isRematerializable();
 }

 bool SIInstrInfo::areLoadsFromSameBasePtr(SDNode *Load0, SDNode *Load1,

Both patches seem to fix the bug

Both patches seem to fix the bug

D117814