This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Invert the handling of skip insertion.
ClosedPublic

Authored by cdevadas on Sep 26 2019, 10:36 AM.

Details

Summary

Current implementation of skip insertion (SIInsertSkip) makes it a mandatory pass
required for correctness. The idea was to have this handling as an optional pass.
This patch inserts the s_cbranch_execz upfront during SILowerControlFlow to skip over
the sections of code when no lanes are active. SIRemoveShortExecBranches tries to
remove the skips for short branches. It also tries to retain s_cbranch_execz
for the cases where the skip branch may be necessary.

The new pass, SIRemoveShortExecBranches will replace the handling of
skip insertion in the existing SIInsertSkip Pass.

Diff Detail

Repository
rL LLVM

Event Timeline

cdevadas created this revision.Sep 26 2019, 10:36 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 26 2019, 10:36 AM
cdevadas retitled this revision from Invert the handling of skip insertion. to [AMDGPU] Invert the handling of skip insertion..Sep 26 2019, 10:38 AM
arsenm added inline comments.Sep 26 2019, 12:36 PM
lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
37

You could make this a static member and use cl::location with the flag

51–53

This isn't necessary

78–86

Should use analyzeBranch instead

88–89

This function should probably be rewritten at some point, but for now it's probably not important. I would rename it to sound stronger. mustRetainExeczBranch or something?

149

You don't need to scan forward through the whole block to find the branches. You can just check getFirstTerminator (or just call analyzeBranch on the block and check the condition type)

test/CodeGen/AMDGPU/convergent-inlineasm.ll
7

I assume the branch was here before and this isn't a change?

cdevadas marked an inline comment as done.Sep 27 2019, 6:00 AM
cdevadas added inline comments.
test/CodeGen/AMDGPU/convergent-inlineasm.ll
7

Yes, the branch was here even earlier.

Thank you for working on this.

There are a bunch of high-level problems with this code which are really due to its history and not due to your changes, but I'd appreciate if we could get things right now as the code is rewritten. Mostly, the decision logic is overly complex and at least partially wrong because it makes assumptions about the order of basic blocks (i..e, the outer loop of shouldRetainSkips).

Let's rethink what the condition should be. Here's an attempt: s_cbranch_execz should be removed if

  • Not taking the branch when EXEC=0 will end up falling through to the branch target anyway.[0]
  • The fall-through code sequence has no unwanted side effects
  • The fall-through code sequence is short and cheap[1]
  • Heuristically, the cheapness requirement should arguably mean that the fall-through code sequence contains no branches at all. After all, if we're going to hit another branch anyway, we may as well just take the original EXECZ branch. This also simplifies the check that we correctly fall through to the branch target. (An exception could perhaps be made for a nested s_cbranch_execz, if the overall code sequence is short and cheap enough that it makes sense to remove both of them.)

How does that sound? This is a significant conceptual change to how the pass works, but I think it's for the better.

[0] There is an interesting subtle point here. The way we currently lower the original thread-based control flow, this fall-through property is always guaranteed except for subtleties involving VCC[Z] branches in loops. However, perhaps we won't always lower control flow in the same way, and perhaps we will one day generate code which doesn't have this property.

[1] What we really want is that cost of fall-through code < p_taken * (cost of taken branch) + p_nottaken (cost of not-taken branch + cost of fall-through code).

lib/Target/AMDGPU/SILowerControlFlow.cpp
247–248

Comment needs to be updated.

lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
10–12

This should say that it removes EXECZ branches for short branches, right?

Also... "try to retain"? I do hope we always succeed at that :)

cdevadas marked 2 inline comments as done.Oct 1 2019, 6:14 AM

You are right, we need to redesign the function shouldRetainSkips, especially in computing the cost. It is not guaranteed that the order of 'From' to 'To' blocks is a fall-through.
There could essentially be a nested control-flow which makes the cost computation a little complex. We can only approximate the number of instructions in the region.
We have talked about it earlier and trying to make the current design more close to how SIInsertSkip works now.

lib/Target/AMDGPU/SILowerControlFlow.cpp
247–248

I will change it.

lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
10–12

Yes, it is misleading. I will fix the comment.

cdevadas updated this revision to Diff 223422.Oct 6 2019, 8:29 AM

incorporated the suggestions + rebase

nhaehnle added inline comments.Oct 8 2019, 8:06 AM
lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
113

analyzeBranch's return value must be checked.

117–118

What's the logic here behind using domination as a criterion?

cdevadas marked 3 inline comments as done.Oct 8 2019, 10:08 AM
cdevadas added inline comments.
lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
113

Sure. Will add that.

117–118

There could be a situation in which execnz (inserted during SI_LOOP lowering) can be inverted to execz by an optimization (for instance, BranchFolding). This execz should always be retained. This special check is added to handle it.
Unfortunately, I couldn't write/find a test-case to reproduce it.

arsenm added inline comments.Oct 8 2019, 10:25 AM
lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
114–115

I think this reinterpreting analyzeBranch's outputs the way is potentially confusing.

I think you don't actually need to check analyzeBranch directly here; I think MachineBasicBlock::getFallThrough does exactly this anyway (and handles the case where there's an unconditional branch as well)

117–118

I'm not sure dominance is sufficient for irreducible loops, which you won't run into in practice (as in, they probably hit another control flow bug long before this) but we should handle it correctly

nhaehnle added inline comments.Oct 9 2019, 3:41 AM
lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
117–118

I have seen irreducible loops go all the way through compilation (because they triggered a bug somewhere, I believe in waitcount insertion), so yeah, that needs to be handled correctly.

I still think a reasonable way to do this is just to scan forward like mustRetainExeczBranch already does, see if we encounter the execz target block during that scan, and only remove the execz branch in that case.

cdevadas marked 2 inline comments as done.Oct 23 2019, 7:37 AM
cdevadas added inline comments.
lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
114–115

The check was necessary when it is a direct fallthrough; the analyzeBranch returns without assigning the FalseMBB. All I eventually require is FalseMBB field.

MachineBasicBlock::getFallThrough returns the fallthrough branch and doesn't serve the real purpose (getting the FlaseMBB) esp. when the false path is taken via. an unconditional branch.
With the following sequence, for instance, getFallThrough() returns %bb.1. But what we need is %bb.3

bb.0:
        successors: %bb.3, %bb.1
        -------------
        S_CBRANCH_EXECZ %bb.1
        S_BRANCH %bb.3
  bb.1:
        ;  predecessors: %bb.0, %bb.3
        successors: %bb.2, %bb.4

I believe, extracting the FalseMBB from the successor_list would be a better idea.
The SrcMBB will always have exactly two successors.

cdevadas updated this revision to Diff 228875.Nov 12 2019, 5:25 AM

Checked the return value of analyzeBranch.
Considered only the forward branches to avoid back-edges from this optimization.

arsenm added inline comments.Nov 13 2019, 2:24 AM
lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
64–76

You're already doing a forward scan through the blocks in mustRetainExeczBranch, so it seems like this should just be checked there

83–85

I was thinking this would go in a separate wrapper function along with the analyzeBranch call. The connection to analyzeBranch isn't obvious at this point

cdevadas updated this revision to Diff 229461.Nov 15 2019, 1:03 AM

Created a wrapper to get the true & false branch targets. Also, used BB numbering to identify the forward jumps.

arsenm added inline comments.Nov 19 2019, 4:39 AM
lib/Target/AMDGPU/SIInsertSkips.cpp
469–475

Shouldn't this be dead code now? i.e. you shouldn't be adding a case, and should be removing the SI_MASK_BRANCH part

lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
67

This name is slightly misleading. getBlockDestinations?

140

I'm not sure if potentially renumbering the blocks counts as a change, but it probably doesn't really matter. It's not a guaranteed property between passes I guess

cdevadas marked 3 inline comments as done.Nov 19 2019, 5:03 AM
cdevadas added inline comments.
lib/Target/AMDGPU/SIInsertSkips.cpp
469–475

Yes. But retained this code for the existing mir tests added for SIInsertSkip. This file will go away entirely after handling the kill intrinsics and other unrelated implementations elsewhere. Once the new design is in the upstream, I will clean-up the mir tests too.

lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
67

sure, I will use this name.

140

Inserted the renumbering here to fix (restore) any broken BB numbering if any prior optimization pass has removed/added the BBs.

cdevadas updated this revision to Diff 230193.Nov 19 2019, 9:39 PM

incorporated the suggestion (used getBlockDestinations for the function name)

arsenm accepted this revision.Dec 10 2019, 8:31 AM

LGTM with grammar fix, although this isn't complete. There's still work to kill the existing pass

lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
13

s/is no unwanted sideeffects/are no unwanted side effects/

This revision is now accepted and ready to land.Dec 10 2019, 8:31 AM
This revision was automatically updated to reflect the committed changes.
foad added a subscriber: foad.Jan 15 2020, 3:41 AM
hakzsam added a subscriber: hakzsam.EditedJan 30 2020, 7:50 AM

Hi there,

This change introduced GPU hangs with RADV. I know that it has been reverted since but release/10.x doesn't have the revert, hence it's still broken.
Can the revert (or the correct fix) be backported to LLVM 10 ?

Thanks!

Hi there,

This change introduced GPU hangs with RADV. I know that it has been reverted since but release/10.x doesn't have the revert, hence it's still broken.
Can the revert (or the correct fix) be backported to LLVM 10 ?

Thanks!

https://bugs.llvm.org/show_bug.cgi?id=44720