Page MenuHomePhabricator

[AMDGPU] Invert the handling of skip insertion.
Needs ReviewPublic

Authored by cdevadas on Sep 26 2019, 10:36 AM.

Details

Reviewers
arsenm
Summary

Current implementation of skip insertion (SIInsertSkip) makes it a mandatory pass
required for correctness. The idea was to have this handling as an optional pass.
This patch inserts the s_cbranch_execz upfront during SILowerControlFlow to skip over
the sections of code when no lanes are active. SIRemoveShortExecBranches tries to
remove the skips for short branches. It also tries to retain s_cbranch_execz
for the cases where the skip branch may be necessary.

The new pass, SIRemoveShortExecBranches will replace the handling of
skip insertion in the existing SIInsertSkip Pass.

Diff Detail

Repository
rL LLVM

Event Timeline

cdevadas created this revision.Sep 26 2019, 10:36 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 26 2019, 10:36 AM
cdevadas retitled this revision from Invert the handling of skip insertion. to [AMDGPU] Invert the handling of skip insertion..Sep 26 2019, 10:38 AM
arsenm added inline comments.Sep 26 2019, 12:36 PM
lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
38

You could make this a static member and use cl::location with the flag

52–54

This isn't necessary

79–87

Should use analyzeBranch instead

89–90

This function should probably be rewritten at some point, but for now it's probably not important. I would rename it to sound stronger. mustRetainExeczBranch or something?

150

You don't need to scan forward through the whole block to find the branches. You can just check getFirstTerminator (or just call analyzeBranch on the block and check the condition type)

test/CodeGen/AMDGPU/convergent-inlineasm.ll
7–8

I assume the branch was here before and this isn't a change?

cdevadas marked an inline comment as done.Sep 27 2019, 6:00 AM
cdevadas added inline comments.
test/CodeGen/AMDGPU/convergent-inlineasm.ll
7–8

Yes, the branch was here even earlier.

Thank you for working on this.

There are a bunch of high-level problems with this code which are really due to its history and not due to your changes, but I'd appreciate if we could get things right now as the code is rewritten. Mostly, the decision logic is overly complex and at least partially wrong because it makes assumptions about the order of basic blocks (i..e, the outer loop of shouldRetainSkips).

Let's rethink what the condition should be. Here's an attempt: s_cbranch_execz should be removed if

  • Not taking the branch when EXEC=0 will end up falling through to the branch target anyway.[0]
  • The fall-through code sequence has no unwanted side effects
  • The fall-through code sequence is short and cheap[1]
  • Heuristically, the cheapness requirement should arguably mean that the fall-through code sequence contains no branches at all. After all, if we're going to hit another branch anyway, we may as well just take the original EXECZ branch. This also simplifies the check that we correctly fall through to the branch target. (An exception could perhaps be made for a nested s_cbranch_execz, if the overall code sequence is short and cheap enough that it makes sense to remove both of them.)

How does that sound? This is a significant conceptual change to how the pass works, but I think it's for the better.

[0] There is an interesting subtle point here. The way we currently lower the original thread-based control flow, this fall-through property is always guaranteed except for subtleties involving VCC[Z] branches in loops. However, perhaps we won't always lower control flow in the same way, and perhaps we will one day generate code which doesn't have this property.

[1] What we really want is that cost of fall-through code < p_taken * (cost of taken branch) + p_nottaken (cost of not-taken branch + cost of fall-through code).

lib/Target/AMDGPU/SILowerControlFlow.cpp
246–247

Comment needs to be updated.

lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
11–13

This should say that it removes EXECZ branches for short branches, right?

Also... "try to retain"? I do hope we always succeed at that :)

cdevadas marked 2 inline comments as done.Oct 1 2019, 6:14 AM

You are right, we need to redesign the function shouldRetainSkips, especially in computing the cost. It is not guaranteed that the order of 'From' to 'To' blocks is a fall-through.
There could essentially be a nested control-flow which makes the cost computation a little complex. We can only approximate the number of instructions in the region.
We have talked about it earlier and trying to make the current design more close to how SIInsertSkip works now.

lib/Target/AMDGPU/SILowerControlFlow.cpp
246–247

I will change it.

lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
11–13

Yes, it is misleading. I will fix the comment.

cdevadas updated this revision to Diff 223422.Oct 6 2019, 8:29 AM

incorporated the suggestions + rebase

nhaehnle added inline comments.Oct 8 2019, 8:06 AM
lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
113

analyzeBranch's return value must be checked.

117–118

What's the logic here behind using domination as a criterion?

cdevadas marked 3 inline comments as done.Oct 8 2019, 10:08 AM
cdevadas added inline comments.
lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
113

Sure. Will add that.

117–118

There could be a situation in which execnz (inserted during SI_LOOP lowering) can be inverted to execz by an optimization (for instance, BranchFolding). This execz should always be retained. This special check is added to handle it.
Unfortunately, I couldn't write/find a test-case to reproduce it.

arsenm added inline comments.Oct 8 2019, 10:25 AM
lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
114–115

I think this reinterpreting analyzeBranch's outputs the way is potentially confusing.

I think you don't actually need to check analyzeBranch directly here; I think MachineBasicBlock::getFallThrough does exactly this anyway (and handles the case where there's an unconditional branch as well)

117–118

I'm not sure dominance is sufficient for irreducible loops, which you won't run into in practice (as in, they probably hit another control flow bug long before this) but we should handle it correctly

nhaehnle added inline comments.Oct 9 2019, 3:41 AM
lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
117–118

I have seen irreducible loops go all the way through compilation (because they triggered a bug somewhere, I believe in waitcount insertion), so yeah, that needs to be handled correctly.

I still think a reasonable way to do this is just to scan forward like mustRetainExeczBranch already does, see if we encounter the execz target block during that scan, and only remove the execz branch in that case.

cdevadas marked 2 inline comments as done.Wed, Oct 23, 7:37 AM
cdevadas added inline comments.
lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
114–115

The check was necessary when it is a direct fallthrough; the analyzeBranch returns without assigning the FalseMBB. All I eventually require is FalseMBB field.

MachineBasicBlock::getFallThrough returns the fallthrough branch and doesn't serve the real purpose (getting the FlaseMBB) esp. when the false path is taken via. an unconditional branch.
With the following sequence, for instance, getFallThrough() returns %bb.1. But what we need is %bb.3

bb.0:
        successors: %bb.3, %bb.1
        -------------
        S_CBRANCH_EXECZ %bb.1
        S_BRANCH %bb.3
  bb.1:
        ;  predecessors: %bb.0, %bb.3
        successors: %bb.2, %bb.4

I believe, extracting the FalseMBB from the successor_list would be a better idea.
The SrcMBB will always have exactly two successors.

cdevadas updated this revision to Diff 228875.Tue, Nov 12, 5:25 AM

Checked the return value of analyzeBranch.
Considered only the forward branches to avoid back-edges from this optimization.

arsenm added inline comments.Wed, Nov 13, 2:24 AM
lib/Target/AMDGPU/SIRemoveShortExecBranches.cpp
64–76

You're already doing a forward scan through the blocks in mustRetainExeczBranch, so it seems like this should just be checked there

83–85

I was thinking this would go in a separate wrapper function along with the analyzeBranch call. The connection to analyzeBranch isn't obvious at this point

cdevadas updated this revision to Diff 229461.Fri, Nov 15, 1:03 AM

Created a wrapper to get the true & false branch targets. Also, used BB numbering to identify the forward jumps.