This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Add a pass to rewrite certain undef in PHI
ClosedPublic

Authored by ruiling on Sep 13 2022, 10:25 PM.

Details

Summary

For the pattern of IR (%if terminates with a divergent branch.),
divergence analysis will report %phi as uniform to help optimal code
generation.

%if
| \
| %then
| /
%endif: %phi = phi [ %uniform, %if ], [ %undef, %then ]

In the backend, %phi and %uniform will be assigned a scalar register.
But the %undef from %then will make the scalar register dead in %then.
This will likely cause the register being over-written in %then. To fix
the issue, we will rewrite %undef as %uniform. For details, please refer
the comment in AMDGPURewriteUndefForPHI.cpp. Currently there is no test
changes shown, but this is mandatory for later changes.

Diff Detail

Event Timeline

ruiling created this revision.Sep 13 2022, 10:25 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 13 2022, 10:25 PM
ruiling requested review of this revision.Sep 13 2022, 10:25 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 13 2022, 10:25 PM
ruiling updated this revision to Diff 460001.Sep 14 2022, 12:48 AM

return correct Changed status.

sameerds added inline comments.Sep 14 2022, 2:24 AM
llvm/lib/Target/AMDGPU/AMDGPURewriteUndefForPHI.cpp
117
123

If this is not true, then does DominateBB dominate IncomingBB?

ruiling added inline comments.Sep 14 2022, 6:07 AM
llvm/lib/Target/AMDGPU/AMDGPURewriteUndefForPHI.cpp
117

sounds good.

123

No, this is used to handle case like with_uniform_region_inside added in this change. If IncomingBB does not dominate DominateBB, they may be successors of the same block in case their common predecessor has uniform branch.

ruiling updated this revision to Diff 460062.Sep 14 2022, 6:17 AM

Change per Sameer's suggestion.

ruiling added inline comments.Sep 14 2022, 8:52 PM
llvm/test/CodeGen/AMDGPU/rewrite-undef-for-phi.ll
93

In case we have different IR "%c2 = phi float [ undef, %if ], [ %c, %entry ], [ %some_other_value, %loop ]", %c2 will be divergent after StructurizeCFG. But such kind of "joining divergent threads in loop header" will be split by SIAnnotateControlFlow. So %c2 will be lowered to something like:

split.cf
  %c2_split = phi float [ undef, %if ], [ %c, %entry ]
  call void @llvm.amdgcn.end.cf()
  br label %loop

loop:
  %c2 = phi float [%c2_split, %split.cf], [%some_other_value, %loop]

%c2_split should also be rewritten by this pass. So I have to move the pass after SIAnnotateControlFlow to handle this situation.

sameerds added inline comments.Sep 14 2022, 10:01 PM
llvm/lib/Target/AMDGPU/AMDGPURewriteUndefForPHI.cpp
123

My confusion is that the "all_of" check later in the code only ensures that DominateBB dominates all the undefined values. But how does one prove that UniqueDefinedIncoming dominates all the uses of the PHI?

sameerds added inline comments.Sep 14 2022, 10:18 PM
llvm/lib/Target/AMDGPU/AMDGPURewriteUndefForPHI.cpp
12

I would say "we assume that the DA reports ..." rather than "expect".

35

Does "usually" mean that sometimes the backend will not allocate a scalar? In the case that it is allocated a vector register, then the generated code will be wrong?

llvm/test/CodeGen/AMDGPU/rewrite-undef-for-phi.ll
64

Since DominateBB is only checked on defined values, here it is true that %entry dominates %bb3. Depending on how we traverse the incoming vaues, if DominateBB is initially %bb3, then later when IncomingBB is %entry, DominateBB does not dominate IncomingBB, but it is true in the opposite direction.

ruiling updated this revision to Diff 460299.Sep 14 2022, 11:20 PM

Move the pass after SIAnnotateControlFlow.
Simplify the backedge undef logic since joining divergent threads at loop header has been handled by SIAnnotateControlFlow.

Move the pass after SIAnnotateControlFlow.
Simplify the backedge undef logic since joining divergent threads at loop header has been handled by SIAnnotateControlFlow.

Is there an advantage in making this assumption that it always runs after SIAnnotateControlFlow? It makes the pass less useful for anything other than the current AMDGPU. If there is no harm in keeping the backedge logic, then the pass will remain useful for any future use where such loops do occur.

ruiling added inline comments.Sep 15 2022, 12:17 AM
llvm/lib/Target/AMDGPU/AMDGPURewriteUndefForPHI.cpp
35

My simple thinking is in case the backend choose to allocate a vector register, the backend should make sure this is correctly propagated, all the further uses in the def-use chain should get vector register allocated, the threads with undefined value might generate garbage in corresponding register lanes. But that is still correct.

123

I think this should be true based on my understanding of structurized CFG (For a block, if it has a predecessor with divergent terminator which dominates any other non-backedge predecessor, then the predecessor should dominate the block itself.). an assertion here should work. An extra check "DominateBB dominates BB" should work well, which ensures we are doing the transformation correctly.

llvm/test/CodeGen/AMDGPU/rewrite-undef-for-phi.ll
64

I am not sue whether you have further question here. Since we are checking for pattern B described in the cpp file. But we might have complex single-entry-single-exit region inside an if-then structure which are skipped during structurization. If we are processing a pattern B in the IR, then the DominateBB should be the dominator of BB.

Move the pass after SIAnnotateControlFlow.
Simplify the backedge undef logic since joining divergent threads at loop header has been handled by SIAnnotateControlFlow.

Is there an advantage in making this assumption that it always runs after SIAnnotateControlFlow? It makes the pass less useful for anything other than the current AMDGPU. If there is no harm in keeping the backedge logic, then the pass will remain useful for any future use where such loops do occur.

Please see my inline comment, we cannot work correctly for such kind of loops which may appear before SIAnnotateControlFlow for now. We should move the block-split logic inside SIAnnotateControlFlow into structurizer itself later, then the pass should be able to work with any pass after StructurizeCFG.

llvm/test/CodeGen/AMDGPU/rewrite-undef-for-phi.ll
93

In case we have different IR "%c2 = phi float [ undef, %if ], [ %c, %entry ], [ %some_other_value, %loop ]", %c2 will be divergent after StructurizeCFG. But such kind of "joining divergent threads in loop header" will be split by SIAnnotateControlFlow. So %c2 will be lowered to something like:

split.cf
  %c2_split = phi float [ undef, %if ], [ %c, %entry ]
  call void @llvm.amdgcn.end.cf()
  br label %loop

loop:
  %c2 = phi float [%c2_split, %split.cf], [%some_other_value, %loop]

%c2_split should also be rewritten by this pass. So I have to move the pass after SIAnnotateControlFlow to handle this situation.

93

This is the major motivation to move the pass after SIAnnotateControlFlow, because we cannot work correctly for this case.

ruiling updated this revision to Diff 460316.Sep 15 2022, 12:53 AM

address review comments.

ruiling updated this revision to Diff 460319.Sep 15 2022, 1:02 AM

fix one compile error.

Move the pass after SIAnnotateControlFlow.
Simplify the backedge undef logic since joining divergent threads at loop header has been handled by SIAnnotateControlFlow.

Is there an advantage in making this assumption that it always runs after SIAnnotateControlFlow? It makes the pass less useful for anything other than the current AMDGPU. If there is no harm in keeping the backedge logic, then the pass will remain useful for any future use where such loops do occur.

Please see my inline comment, we cannot work correctly for such kind of loops which may appear before SIAnnotateControlFlow for now. We should move the block-split logic inside SIAnnotateControlFlow into structurizer itself later, then the pass should be able to work with any pass after StructurizeCFG.

I understand that the block-split produces new opportunities for uniform phis, so we want to run this pass after "SIAnnotateControlFlow". But that does not automatically mean that we should assume that this is the *only* place it will ever run. By "work correctly", do you mean that the original handling of back-edges will produce incorrect IR? If not, that treatment of back-edges need not be removed.

Is there also an assumption that this pass works only with a structurized CFG? If both StructurizeCFG and SIAnnotateControlFlow are *necessary* for this pass to produce correct IR, then this needs to be document. If they are not, then this pass should be designed to work even without them, in order to preserve future opportunity. If loop backedges need extra work, at least we should document it in the code comments.

ruiling updated this revision to Diff 460418.Sep 15 2022, 8:12 AM

Add more comments about the limitation of this pass.

I understand that the block-split produces new opportunities for uniform phis, so we want to run this pass after "SIAnnotateControlFlow". But that does not automatically mean that we should assume that this is the *only* place it will ever run. By "work correctly", do you mean that the original handling of back-edges will produce incorrect IR? If not, that treatment of back-edges need not be removed.

By "work correctly", I mean the MachineIR translated from the LLVM IR should be correct. The handling of back-edge will produce IR that will be translated to incorrect MachineIR.

Is there also an assumption that this pass works only with a structurized CFG? If both StructurizeCFG and SIAnnotateControlFlow are *necessary* for this pass to produce correct IR, then this needs to be document. If they are not, then this pass should be designed to work even without them, in order to preserve future opportunity. If loop backedges need extra work, at least we should document it in the code comments.

I have updated the comment for both issues. Please have a second look. Thanks very much for the careful review!

sameerds accepted this revision.Sep 18 2022, 10:26 PM
This revision is now accepted and ready to land.Sep 18 2022, 10:26 PM

I'm confused about problem this is trying to solve. Is this a correctness fix, or optimization? If it's a correctness fix this feels like the wrong way to address it

I'm confused about problem this is trying to solve. Is this a correctness fix, or optimization? If it's a correctness fix this feels like the wrong way to address it

Yes, this is correctness fix to make the optimization changes (D132449 D132450) work correctly. This also allows StructurizeCFG to skip structured CFG.

Some offline discussion with @arsenm, he is fine with an IR pass to do this if this helps the optimization. @arsenm also said that maybe we can change the behavior of divergence analysis to report this as divergent, while I think we can keep this behavior now and re-visit the problem later if we meet with new issue on this. So, I will push this change to move forward.

This revision was landed with ongoing or failed builds.Sep 25 2022, 6:56 PM
This revision was automatically updated to reflect the committed changes.