This is an archive of the discontinued LLVM Phabricator instance.

MachineSink: Fix sinking VGPR def out of a divergent loop
ClosedPublic

Authored by arsenm on Jul 14 2023, 3:23 PM.

Details

Summary

This fixes sinking a VGPR def out of a loop past the reconvergence
point at the SI_END_CF. There was a prior fix which introduced
blockPrologueInterferes (D121277) to fix the same basic problem for
the post RA sink. This also had the special case isIgnorableUse case
which was incorrect, because in some contexts the exec use is not
ignorable.

I'm thinking of a new way to represent this which will avoid needing
hasIgnorableUse and isBasicBlockPrologue.

Fixes: SWDEV-407790

Diff Detail

Event Timeline

arsenm created this revision.Jul 14 2023, 3:23 PM
arsenm requested review of this revision.Jul 14 2023, 3:23 PM
Herald added a project: Restricted Project. · View Herald TranscriptJul 14 2023, 3:23 PM
Herald added a subscriber: wdng. · View Herald Transcript

Not sure what to make of that Fixes tag. The change to isInlineAsmBrIndirectTarget LGTM; unsure about the rest.

llvm/lib/CodeGen/MachineSink.cpp
1024

consider using const auto here.

LGTM - but should probably have more eyes on it.

This revision is now accepted and ready to land.Jul 17 2023, 10:14 AM
ruiling added inline comments.Jul 18 2023, 7:58 AM
llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.mir
54

Sorry I don't see why we are not allowed to sink such kind of loop-invariant v_add out of the loop. For this specific case, the result vgpr should be the same with and without the change, right?

arsenm added inline comments.Jul 18 2023, 8:31 AM
llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.mir
54

Maybe we could sink it if we introduced a new block before the reconvergence at si_end_cf.

This testcase has 2 loops, nested with unrelated conditions. By sinking here, it's sinking from the body of the first loop into the second loop. It now executes for lanes which would not have taken the inner loop

ruiling added inline comments.Jul 18 2023, 9:00 AM
llvm/test/CodeGen/AMDGPU/machine-sink-loop-var-out-of-divergent-loop-swdev407790.mir
54

I am not sure whether I read the CFG correctly. The bb.4 here is inside the inner loop, and the bb.5 is the reconvergence point for the inner loop. So all the lanes that are active after the SI_END_CF in bb.5 (restore the exec as before entering inner loop) should have been active in the inner loop (bb4).

I think a more accurate description of the real problem is: sinking vgpr def which has a loop-variant sgpr source out of divergent loop is wrong for AMDGPU. For other cases, it is still legal to move out of loop. The problem is not specific to nested loops, it also applies to single loop. The test case has been over-simplified and does not show original problem (both .ll and .mir test). We can keep the code as now, but I still think it's better to update the tests to show the real problem to make it easy to revisit the issue.

I think a more accurate description of the real problem is: sinking vgpr def which has a loop-variant sgpr source out of divergent loop is wrong for AMDGPU. For other cases, it is still legal to move out of loop. The problem is not specific to nested loops, it also applies to single loop. The test case has been over-simplified and does not show original problem (both .ll and .mir test). We can keep the code as now, but I still think it's better to update the tests to show the real problem to make it easy to revisit the issue.

What do you suggest? I think we need to come up with a way to properly model this

I think a more accurate description of the real problem is: sinking vgpr def which has a loop-variant sgpr source out of divergent loop is wrong for AMDGPU. For other cases, it is still legal to move out of loop. The problem is not specific to nested loops, it also applies to single loop. The test case has been over-simplified and does not show original problem (both .ll and .mir test). We can keep the code as now, but I still think it's better to update the tests to show the real problem to make it easy to revisit the issue.

What do you suggest? I think we need to come up with a way to properly model this

I was thinking about something along the lines of marking basic block edges as read or write convergent, and then adding some notion of read/write convergence that approximately means VALU, VALU cross lane, scalar or divergent control flow