This is an archive of the discontinued LLVM Phabricator instance.

Codegen: Make chains from trellis-shaped CFGs
ClosedPublic

Authored by iteratee on Jan 10 2017, 11:39 AM.

Details

Summary

Lay out trellis-shaped CFGs optimally.
A trellis of the shape below:

A     B
|\   /|
| \ / |
|  X  |
| / \ |
|/   \|
C     D

would be laid out A; B->C ; D by the current layout algorithm. Now we identify
trellises and lay them out either A->C; B->D or A->D; B->C. This scales with an
increasing number of predecessors. A trellis is a a group of 2 or more
predecessor blocks that all have the same successors.

because of this we can tail duplicate to extend existing trellises.

As an example consider the following CFG:

  B   D   F   H
 / \ / \ / \ / \
A---C---E---G---Ret

Where A,C,E,G are all small (Currently 2 instructions).

The CFG preserving layout is then A,B,C,D,E,F,G,H,Ret.

The current code will copy C into B, E into D and G into F and yield the layout
A,C,B(C),E,D(E),F(G),G,H,ret

define void @straight_test(i32 %tag) {
entry:
  br label %test1
test1: ; A
  %tagbit1 = and i32 %tag, 1
  %tagbit1eq0 = icmp eq i32 %tagbit1, 0
  br i1 %tagbit1eq0, label %test2, label %optional1
optional1: ; B
  call void @a()
  br label %test2
test2: ; C
  %tagbit2 = and i32 %tag, 2
  %tagbit2eq0 = icmp eq i32 %tagbit2, 0
  br i1 %tagbit2eq0, label %test3, label %optional2
optional2: ; D
  call void @b()
  br label %test3
test3: ; E
  %tagbit3 = and i32 %tag, 4
  %tagbit3eq0 = icmp eq i32 %tagbit3, 0
  br i1 %tagbit3eq0, label %test4, label %optional3
optional3: ; F
  call void @c()
  br label %test4
test4: ; G
  %tagbit4 = and i32 %tag, 8
  %tagbit4eq0 = icmp eq i32 %tagbit4, 0
  br i1 %tagbit4eq0, label %exit, label %optional4
optional4: ; H
  call void @d()
  br label %exit
exit:
  ret void
}

here is the layout after D27742:

straight_test:                          # @straight_test
; ... Prologue elided
; BB#0:                                 # %entry ; A (merged with test1)
; ... More prologue elided
        mr 30, 3
        andi. 3, 30, 1
        bc 12, 1, .LBB0_2
; BB#1:                                 # %test2 ; C
        rlwinm. 3, 30, 0, 30, 30
        beq      0, .LBB0_3
        b .LBB0_4
.LBB0_2:                                # %optional1 ; B (copy of C)
        bl a
        nop
        rlwinm. 3, 30, 0, 30, 30
        bne      0, .LBB0_4
.LBB0_3:                                # %test3 ; E
        rlwinm. 3, 30, 0, 29, 29
        beq      0, .LBB0_5
        b .LBB0_6
.LBB0_4:                                # %optional2 ; D (copy of E)
        bl b
        nop
        rlwinm. 3, 30, 0, 29, 29
        bne      0, .LBB0_6
.LBB0_5:                                # %test4 ; G
        rlwinm. 3, 30, 0, 28, 28
        beq      0, .LBB0_8
        b .LBB0_7
.LBB0_6:                                # %optional3 ; F (copy of G)
        bl c
        nop
        rlwinm. 3, 30, 0, 28, 28
        beq      0, .LBB0_8
.LBB0_7:                                # %optional4 ; H
        bl d
        nop
.LBB0_8:                                # %exit ; Ret
        ld 30, 96(1)                    # 8-byte Folded Reload
        addi 1, 1, 112
        ld 0, 16(1)
        mtlr 0
        blr

The tail-duplication has produced some benefit, but it has also produced a
trellis which is not laid out optimally. With this patch, we improve the layouts
of such trellises, and decrease the cost calculation for tail-duplication
accordingly.

This patch produces the layout A,C,E,G,B,D,F,H,Ret. This layout does have
back edges, which is a negative, but it has a bigger compensating
positive, which is that it handles the case where there are long strings
of skipped blocks much better than the original layout. Both layouts
handle runs of executed blocks equally well. Branch prediction also
improves if there is any correlation between subsequent optional blocks.

Here is the resulting concrete layout:

straight_test:                          # @straight_test
; BB#0:                                 # %entry ; A (merged with test1)
        mr 30, 3
        andi. 3, 30, 1
        bc 12, 1, .LBB0_4
; BB#1:                                 # %test2 ; C
        rlwinm. 3, 30, 0, 30, 30
        bne      0, .LBB0_5
.LBB0_2:                                # %test3 ; E
        rlwinm. 3, 30, 0, 29, 29
        bne      0, .LBB0_6
.LBB0_3:                                # %test4 ; G
        rlwinm. 3, 30, 0, 28, 28
        bne      0, .LBB0_7
        b .LBB0_8
.LBB0_4:                                # %optional1 ; B (Copy of C)
        bl a
        nop
        rlwinm. 3, 30, 0, 30, 30
        beq      0, .LBB0_2
.LBB0_5:                                # %optional2 ; D (Copy of E)
        bl b
        nop
        rlwinm. 3, 30, 0, 29, 29
        beq      0, .LBB0_3
.LBB0_6:                                # %optional3 ; F (Copy of G)
        bl c
        nop
        rlwinm. 3, 30, 0, 28, 28
        beq      0, .LBB0_8
.LBB0_7:                                # %optional4 ; H
        bl d
        nop
.LBB0_8:                                # %exit

Diff Detail

Event Timeline

iteratee updated this revision to Diff 83836.Jan 10 2017, 11:39 AM
iteratee retitled this revision from to Codegen: Make chains from lattice-shaped CFGs.
iteratee updated this object.
iteratee added a reviewer: davidxl.
iteratee set the repository for this revision to rL LLVM.
iteratee added subscribers: arsenm, wdng, nhaehnle and 12 others.
davidxl added inline comments.Jan 10 2017, 1:27 PM
lib/CodeGen/MachineBlockPlacement.cpp
973

Is it better to relax the condition such that as long as Succ can be dup'd into one of the unplaced predecessors, return true?

979

Change the second 'C' to 'C (+ BB') ' where BB' is the dup of BB

982

Better change 'B' to 'BB' to match the function argument name

988

Change E to 'Succ'

iteratee updated this revision to Diff 83879.Jan 10 2017, 2:49 PM
iteratee removed rL LLVM as the repository for this revision.
iteratee marked 4 inline comments as done.

Comment update as requested.

iteratee added inline comments.Jan 10 2017, 2:50 PM
lib/CodeGen/MachineBlockPlacement.cpp
973

That isn't sufficient, because we need layout to be repeatable. So when we encounter the lattice the second time, there will be no more copies left to make. Chandler also wanted to allow another pass besides layout to do the duplication (perhaps with a larger threshold or different heuristic).

davidxl edited edge metadata.Jan 10 2017, 2:56 PM

I am not sure I understand your reply about 'repeatability'. Can you elaborate? The suggestion is that if 'Succ' can be dup'ed into 'D' which is unplaced, return true. It basically does the same as skipping 'C'. Besides, checking the successors is not the reliable way to determine if C is block with duplicated bb.

iteratee updated this revision to Diff 83883.Jan 10 2017, 3:27 PM
iteratee edited edge metadata.

Add comments elaborating on why we use a CFG check rather than checking for partial copiability of blocks or keeping a list of blocks with copies.

I am not sure I understand your reply about 'repeatability'. Can you elaborate? The suggestion is that if 'Succ' can be dup'ed into 'D' which is unplaced, return true. It basically does the same as skipping 'C'. Besides, checking the successors is not the reliable way to determine if C is block with duplicated bb.

Discussed offline and comments updated to reflect discussion.

iteratee updated this revision to Diff 84161.Jan 12 2017, 12:03 PM
iteratee edited edge metadata.

This patch now adjusts the probability accounting in D28583 to account for lattice layout. This means that more duplications occur. Tests that were in D28583 are now here.

iteratee updated this revision to Diff 85085.Jan 19 2017, 6:29 PM

Rebased and modified the probability calculations to account for lattices.

iteratee updated this revision to Diff 85523.Jan 23 2017, 8:31 PM

Rebasing.

Added lattice check outside of the tail-duplication code.
Updated tests to match.
Added test for non tail-dup lattice.

Is this patch up to date?

iteratee updated this revision to Diff 86937.Feb 2 2017, 10:54 PM

Looking over this, It feels like I can split the lattice portion out and put it in first. Even together they aren't very big, so If I can get some initial opinions on this before I start that, it would be appreciated.

I wanted to lay out lattice-type CFG's correctly even if the blocks are larger than what we would tail-duplicate.
To do that I worked out that for a lattice, we can compute which pair of edges forms the optimal fallthrough and use those edges. We don't really have to worry about CFG breaking, because with a lattice we can easily compute the optimal fallthrough pair and take it.

Lattice:
A Set of Predecessor blocks P that all have the same Successors S when |S| >= 2, |P| >=2 and S ∩ P = ∅
We can treat this as a graph optimization problem. There's a well known general algorithm, (the hungarian algorithm) but I just solved it for size 2 because that's the only size we really care about.

Thanks for taking a look, I'll try and get the plain lattice code separated from the tail duplication code tomorrow.

iteratee updated this revision to Diff 87043.Feb 3 2017, 3:10 PM
iteratee edited the summary of this revision. (Show Details)
iteratee set the repository for this revision to rL LLVM.

Looking over this, It feels like I can split the lattice portion out and put it in first. Even together they aren't very big, so If I can get some initial opinions on this before I start that, it would be appreciated.

I thought about this, and looked at what would go into the 2 patches. I don't think it's worth it now.

I updated the comments to make it more clear that we're doing this for general lattices and tail-duplication gets the benefit as well.

davidxl added inline comments.Feb 6 2017, 11:17 AM
lib/CodeGen/MachineBlockPlacement.cpp
760

merge error here -- two returns

833

tail duplication can create this pattern - why is it skipped?
`

 `BB    Pred
  |  \      /|
  |    \  /  |
  |      /\  |
  |   /    S2  
  | /      /
S1  <-+`

`

836

merge this with the insert before?

850

Have a high level description of the selection algo here as comments.

1223

unrelated change?

1317

unrelated change?

iteratee updated this revision to Diff 87333.Feb 6 2017, 4:28 PM
iteratee marked an inline comment as done.

Handle Lattices with inter-successor edges.

More comments.

iteratee marked 4 inline comments as done.Feb 6 2017, 4:29 PM
iteratee added inline comments.
lib/CodeGen/MachineBlockPlacement.cpp
836

It can't, we have to count those predecessors, we just don't have to check them again.

davidxl added inline comments.Feb 7 2017, 11:02 AM
lib/CodeGen/MachineBlockPlacement.cpp
836

Can you move insert call down after ++PreCount. If insertion returns false, continue.

851

Since lattice with triangle is handled, the definition of lattice here is no longer accurate -- some predecessor does not have the same successors as BB.

881

how does this handle triangle case? In the example, S2 is predecessor of S1, but does not have the same successors of BB.

914

It is better to split the BestEdges into two BestIncomingEdges vector one for each viable successors and then sort them (or BestOutgoingEdges vectors one for each precessor). This makes the following code much more readable.

935

don't conflict (aka sharing the same successor)

946

If BestEdges are split, there is no need to do linear search for the best incoming edge for each successor -- after sorting, they are already accessible.

Split BestEdges according to Predecessor is fine too. Either way, it is easy to detect conflict.

1278

Extract the special handling of lattice code into a helper to make the main flow of the caller cleaner.

iteratee updated this revision to Diff 87511.Feb 7 2017, 1:30 PM
iteratee marked 4 inline comments as done.

Add back missing statement to handle triangles lost in merge.
Other tidying.

iteratee marked 4 inline comments as done.Feb 7 2017, 1:31 PM
iteratee added inline comments.
lib/CodeGen/MachineBlockPlacement.cpp
881

There was a merge mistake where it disappeared, sorry. I put it back.

914

Good catch, that is simpler.

1278

It already is in 2 helpers. Ignoring the logging the logic is:
if (isLattice)

return getBestLattice();

I don't see how a helper would be useful in this case.

davidxl added inline comments.Feb 7 2017, 1:48 PM
lib/CodeGen/MachineBlockPlacement.cpp
1278

I suggest pushing isLattice check into getBestLatticeSuccessor call. Also pushing the debug tracking code there too. The caller will be like

if (LatticeSucc = getBestLatticeSuccessor(...)) {

return BlockAndTailDupResult(LatticeSucc, false}'

}

1296

Is this null? can you explain this code?

iteratee updated this revision to Diff 87524.Feb 7 2017, 2:43 PM
iteratee marked an inline comment as done.

Move logging so that control flow is more obvious.

iteratee added inline comments.Feb 7 2017, 2:52 PM
lib/CodeGen/MachineBlockPlacement.cpp
1278

Keeping them separate makes the early return logic simpler, but I moved the debugging code.

1296

Yes. If BB is part of a lattice, but not an optimal edge, then we return early. We've already determined that all of BB's successors have a better fallthrough predecessor.

davidxl added inline comments.Feb 7 2017, 3:52 PM
lib/CodeGen/MachineBlockPlacement.cpp
907

The assert seems redundant - lattice shape check already checks number of predecessors.

911

Since the non-lattice based layout algorithm looks at cfg edges in forward direction (i.e. look at successor edges), it looks wrong to use best incoming edges to detect conflict. The conflict exists when the best outgoing edges from two predecessor share the same successor. Example: (skip to the end of this example to see general algorithm).

  BB    Pred
  |  \      /|
  |    \  /  |
  |      /\  |
  |   /    S2  
  | /      
S1
  1. If best outgoing edges of BB is BB->S1, while the best outgoing edges of Pred is Pred->S2, then there is no conflict.

    1.1 If there is no triangle (from S2->S1), then the best successor for BB here is S1. 1.2 If there is an edge from S2 to S1 (forming triangle), if Freq(S2->S1) > Freq(BB->S1), then we prefer layout Pred->S2->S1, so BB's best successor should be null.
  1. If the best outgoing edge of BB and Pred is the same say S1, then there is a conflict. 2.1. no triangle: if Freq(BB->S1) > Freq(Pred->S1), return S1 as the best successor for BB, otherwise S2 2.2. there is an edge from S2 to S1: ....

Actually if you look at all the special cases, there is a more general algorithm to get the optimal solution for all shapes: among all 4 edges (5 edges for the triangular case), find two non conflicting edges as the fallthrough edges such that their frequency sum is maximal. (two edges are conflicting if they share either source or dest BB).

iteratee updated this revision to Diff 87562.Feb 7 2017, 4:33 PM

If the lattice contains a triangle, we may want to tail duplicate instead. Check for that.

iteratee updated this revision to Diff 87569.Feb 7 2017, 5:26 PM
iteratee set the repository for this revision to rL LLVM.
iteratee added a reviewer: jlebar.
davidxl added inline comments.Feb 8 2017, 12:17 PM
lib/CodeGen/MachineBlockPlacement.cpp
886

Using tuple does not increase readability (e.g. mapping get<0> ... to actual field). It is better to just use a struct.

904

To greatly increase readability, please put code between line 917 and line 934 into a helper function: getBestFallEdgesInLattice(..) with comment like:

// Find two non-conflicting edges with maximal total frequency in the lattice to be used as fall through.

davidxl added inline comments.Feb 8 2017, 12:24 PM
lib/CodeGen/MachineBlockPlacement.cpp
923

Do early return if condition is not true to reduce nesting level.

927

The code will be cleaner if early return is done here too and let the tail merging checking follow.

949

Why do you need to do chain merging here? The method is supposed to do analysis only.

iteratee updated this revision to Diff 87725.Feb 8 2017, 3:35 PM
iteratee marked 4 inline comments as done.Feb 8 2017, 3:40 PM
iteratee added inline comments.
lib/CodeGen/MachineBlockPlacement.cpp
927

Tail duplication is the early return case.

949

Well, I'm open to suggestions, but if we don't merge the triangle edge, it doesn't get chosen. We know it's optimal right now.

Should we have a src->dest map that saves the other side of lattices and then follow it when we are trying to choose the successor for src?
That would be cleaner.

jlebar edited edge metadata.Feb 9 2017, 5:55 PM

Kyle asked me to look over this from an outsider's perspective and check for understandability. But after starting on it, I think this would be easier after David's comments have been addressed. I just have a few comments for now.

lib/CodeGen/MachineBlockPlacement.cpp
810

One thing that's confusing to me is: What exactly is "in" the lattice?

We say that BB is part of the lattice if its successors form a lattice. So it sounds like a recursive definition. But then the criteria for whether BB's successors "form a lattice" is different from the criteria for whether BB is itself in a lattice.

The other thing I have no intuition for here is, why do we use the mathematical word "lattice" for this shape? I'm sure there's a good reason, and understanding that might help in general.

973

Can we run clang-format over this patch? git-clang-format, included in the clang sources, can run it just over your changes, so you don't have to reformat the whole file.

For instance, this line appears to fit in 80 chars, so doesn't need to be wrapped. Also we usually put "&&" at the end of the line, not the beginning. And there are some lines that appear to be longer than 80 chars.

If it helps, I have a script that runs git-clang-format on every arc diff so I don't have to remember to do it myself. https://github.com/jlebar/conf/blob/master/bin/arc

1283

Nit, we usually omit these braces, even though the if-body is multiline.

iteratee updated this revision to Diff 88033.Feb 10 2017, 12:20 PM

Patch actually updated to match comments from last time.

davidxl added inline comments.Feb 12 2017, 10:43 PM
lib/CodeGen/MachineBlockPlacement.cpp
949

ideally this should not happen. If it happens, it means the lattice based cost analysis is not consistent with current heuristic of determine better layout predecessor (which is likely the case). We should probably enhance that logic in the future?

996

This comment does not seem to be correct. D can not be C's fall through. The lattice analysis has decided that D is the fall-through of BB.

You won't need to check all Preds of Succ either -- only Preds that are potential layout predecessor of Succ. In this case, C can not be Succ's layout predecessor.

iteratee updated this revision to Diff 88227.Feb 13 2017, 10:59 AM
iteratee marked an inline comment as done.

only git-clang-format

iteratee marked an inline comment as done.Feb 13 2017, 11:07 AM
iteratee added inline comments.
lib/CodeGen/MachineBlockPlacement.cpp
810

I didn't have a good word for it. I also can't find one googling. I'm open to suggestions. Lattice matched my initial intuitions, but really it's a subgraph where all maximal linear matchings are the same size.

949

I don't agree. We are conservative on purpose in the normal layout. When we find that we have a lattice, the need to be conservative shrinks. I don't expect them to get the same answer. Even if we did, I think it would be wasteful to re-compute the answer.

996

I reread it. It's correct. D should be the fallthrough successor of (C+BB).

You won't need to check all Preds of Succ either -- only Preds that are potential layout predecessor of Succ. In this case, C can not be Succ's layout predecessor.

I'm not sure what you mean by the above. The comment already says "unplaced predecessors"

iteratee updated this revision to Diff 88230.Feb 13 2017, 11:08 AM

Improved a comment, removed braces.

davidxl added inline comments.Feb 13 2017, 12:06 PM
lib/CodeGen/MachineBlockPlacement.cpp
949

It is not ideal, but my point still holds -- it is not a good idea to embed chain transformation code inside analysis code. If you want to improve the situation so that the analysis result can be better maintained -- that is fine -- but probably as a diferent patch.

996

For the lattice including BB, C+BB, Succ and D, only when {BestA, BestB} == { BB->D, D->Succ}, will the tail duplication check is called. Does it mean D can not be C+BB's successor?

What I meant is you can pass 'D' as a parameter to this call and only check if Succ can be tail dup into D.

iteratee added inline comments.Feb 13 2017, 12:53 PM
lib/CodeGen/MachineBlockPlacement.cpp
949

Then I'll do it the way I suggested. We already did the analysis for the other side of the lattice, we should save it, and then trust it when we get there.

996

Yes, but if we tail-duplicate then C+BB has D as a fallthrough. That's why we can ignore it.

iteratee updated this revision to Diff 88254.Feb 13 2017, 1:54 PM
iteratee retitled this revision from Codegen: Make chains from lattice-shaped CFGs to Codegen: Make chains from trellis-shaped CFGs.
iteratee edited the summary of this revision. (Show Details)

Save the analysis for the other side of the trellis so that we don't recompute it later.

Rename lattice to trellis, because it better matches existing usage.

jlebar accepted this revision.Feb 13 2017, 3:05 PM

With some changes this looks good for the part I was asked to review.

lib/CodeGen/MachineBlockPlacement.cpp
851

a trellis

852

trellises

870

Nit, auto*, or even just write out the type? "for (auto foo : bar)" is scary because it looks like you might be copying a nontrivially-sized object.

878

Do you want to avoid the double map lookup on SuccPred here? This function looks hot.

914

It looks like you don't actually care about anything other than the first two elements of this stable_sort?

Last time I checked, std::stable_sort was relatively slow, and prone to allocate heap memory. If this is hot, you may want to avoid std::stable_sort. std::nth_element will sort of do what you want, except it doesn't seem to be stable. You may do better with a custom "GetTopTwo" function.

Or maybe that's a premature optimization. :)

938

trellises? Now I'm not sure if this a typo or not, but I can't find "trelliss" as the plural in any dictionary I have onhand.

954

Nit, I'd move this down to under the comment that says "Collect the edge frequencies". Especially since the "2" in the constructor only makes sense after the if statement below.

959

Maybe "and a trellis of that size is basically unheard of"?

982

Any reason you don't want to return a SmallVector from getBestNonConflictingEdges? Or even an std::pair? Then you could say

WeightedEdge BestA, BestB;
std::tie(BestA, BestB) = getBestNonConflictingEdges(BB, Edges);

which seems closer to what you mean.

1001

Personally I'd prefer not to use "auto" here. MachineBasicBlock* isn't too much to type out, and otherwise there's no obvious, nearby anchor for the reader as to what's going on.

1278

Nit, capital letter

1285

This is unrelated to your patch, so you don't need to change it or anything, but if we always check !BlockFilter || BlockFilter->count(Foo), then shouldn't we just pass BlockFilter by reference and initialize it to an empty map when necessary? This would be much more ergonomic.

1292

Capital "i", lower-case "U", period.

This revision is now accepted and ready to land.Feb 13 2017, 3:05 PM
iteratee updated this revision to Diff 88277.Feb 13 2017, 4:49 PM
iteratee marked 7 inline comments as done.
davidxl added inline comments.Feb 14 2017, 3:19 PM
test/CodeGen/PowerPC/tail-dup-layout.ll
204

The term 'unavoidable' is not well defined -- why is 'then2' unavoidable?

254

Is this comment relevant here?

267

change name to trellis_test

275

Using non-equal branch probability here to make the result more obvious in different scenarios:

  1. there are conflict in best incoming edges
  2. there are no conflict etc.
305

We probably also need a test for trellis+triangle shape (without taildup)

323

change name.

iteratee updated this revision to Diff 88578.Feb 15 2017, 10:38 AM
iteratee marked 2 inline comments as done.

Improved test coverage in response to comments.

iteratee marked an inline comment as done.Feb 15 2017, 10:39 AM
iteratee added inline comments.
lib/CodeGen/MachineBlockPlacement.cpp
914

I don't expect the lists to be large in practice, so I'll just use stable_sort for now. It's simple enough to revisit if it becomes a bottleneck.

938

I messed up find and replace. It's fixed now.

test/CodeGen/PowerPC/tail-dup-layout.ll
254

Yes.

275

The test is now larger. It handles conflicting incoming edges, and a couple of non-conflicting edges, and a triangle.

305

f->ret is a triangle edge.

I'll make the test bigger with non-balanced edges to cover more scenarios.

This revision was automatically updated to reflect the committed changes.