# Codegen: Make chains from lattice-shaped CFGsAbandonedPublicActions

Authored by iteratee on May 20 2016, 6:33 PM.

# Details

Reviewers
 davidxl • tstellarAMD haicheng
Summary

This change extends D27742 to allow a chain of triangles to
tail-duplicate and produce a lattice. The essential change is that if a
predecessor has the same successors as a layout predecessors, we ignore
that block when considering if we can tail-duplicate into unplaced
predecessors.

As an example consider the following CFG:

```  B   D   F   H
/ \ / \ / \ / \
A---C---E---G---Ret```

Where A,C,E,G are all small (Currently 2 instructions).

The CFG preserving layout is then A,B,C,D,E,F,G,H,Ret.

The current code will copy C into B, E into D and G into F and yield the layout
A,C,B(C),E,D(E),F(G),G,H,ret

```define void @straight_test(i32 %tag) {
entry:
br label %test1
test1: ; A
%tagbit1 = and i32 %tag, 1
%tagbit1eq0 = icmp eq i32 %tagbit1, 0
br i1 %tagbit1eq0, label %test2, label %optional1
optional1: ; B
call void @a()
br label %test2
test2: ; C
%tagbit2 = and i32 %tag, 2
%tagbit2eq0 = icmp eq i32 %tagbit2, 0
br i1 %tagbit2eq0, label %test3, label %optional2
optional2: ; D
call void @b()
br label %test3
test3: ; E
%tagbit3 = and i32 %tag, 4
%tagbit3eq0 = icmp eq i32 %tagbit3, 0
br i1 %tagbit3eq0, label %test4, label %optional3
optional3: ; F
call void @c()
br label %test4
test4: ; G
%tagbit4 = and i32 %tag, 8
%tagbit4eq0 = icmp eq i32 %tagbit4, 0
br i1 %tagbit4eq0, label %exit, label %optional4
optional4: ; H
call void @d()
br label %exit
exit:
ret void
}```

here is the layout after D27742:

```straight_test:                          # @straight_test
; ... Prologue elided
; BB#0:                                 # %entry ; A (merged with test1)
; ... More prologue elided
mr 30, 3
andi. 3, 30, 1
bc 12, 1, .LBB0_2
; BB#1:                                 # %test2 ; C
rlwinm. 3, 30, 0, 30, 30
beq      0, .LBB0_3
b .LBB0_4
.LBB0_2:                                # %optional1 ; B (copy of C)
bl a
nop
rlwinm. 3, 30, 0, 30, 30
bne      0, .LBB0_4
.LBB0_3:                                # %test3 ; E
rlwinm. 3, 30, 0, 29, 29
beq      0, .LBB0_5
b .LBB0_6
.LBB0_4:                                # %optional2 ; D (copy of E)
bl b
nop
rlwinm. 3, 30, 0, 29, 29
bne      0, .LBB0_6
.LBB0_5:                                # %test4 ; G
rlwinm. 3, 30, 0, 28, 28
beq      0, .LBB0_8
b .LBB0_7
.LBB0_6:                                # %optional3 ; F (copy of G)
bl c
nop
rlwinm. 3, 30, 0, 28, 28
beq      0, .LBB0_8
.LBB0_7:                                # %optional4 ; H
bl d
nop
.LBB0_8:                                # %exit ; Ret
ld 30, 96(1)                    # 8-byte Folded Reload
ld 0, 16(1)
mtlr 0
blr```

This is where the more bold strategy of this patch comes in. We allow E
to be placed, even though its predecessor B (after copying C) is
unplaced, because it is lattice shaped after tail-duplication.
This then produces the layout A,C,E,G,B,D,F,H,Ret. This layout does have
back edges, which is a negative, but it has a bigger compensating
positive, which is that it handles the case where there are long strings
of skipped blocks much better than the original layout. Both layouts
handle runs of executed blocks equally well. Branch prediction also
improves if there is any correlation between subsequent optional blocks.

Here is the resulting concrete layout:

```straight_test:                          # @straight_test
; BB#0:                                 # %entry ; A (merged with test1)
mr 30, 3
andi. 3, 30, 1
bc 12, 1, .LBB0_4
; BB#1:                                 # %test2 ; C
rlwinm. 3, 30, 0, 30, 30
bne      0, .LBB0_5
.LBB0_2:                                # %test3 ; E
rlwinm. 3, 30, 0, 29, 29
bne      0, .LBB0_6
.LBB0_3:                                # %test4 ; G
rlwinm. 3, 30, 0, 28, 28
bne      0, .LBB0_7
b .LBB0_8
.LBB0_4:                                # %optional1 ; B (Copy of C)
bl a
nop
rlwinm. 3, 30, 0, 30, 30
beq      0, .LBB0_2
.LBB0_5:                                # %optional2 ; D (Copy of E)
bl b
nop
rlwinm. 3, 30, 0, 29, 29
beq      0, .LBB0_3
.LBB0_6:                                # %optional3 ; F (Copy of G)
bl c
nop
rlwinm. 3, 30, 0, 28, 28
beq      0, .LBB0_8
.LBB0_7:                                # %optional4 ; H
bl d
nop
.LBB0_8:                                # %exit```

# Diff Detail

### Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
iteratee updated this revision to Diff 58548.May 25 2016, 5:32 PM
iteratee removed rL LLVM as the repository for this revision.

Added fixes for AMDGPU tests, as some intervening change has enabled this optimization for that target.

iteratee edited edge metadata.May 31 2016, 11:35 AM

Did a quick run through for clarity. A few inline comments. Few requests to break things up. Check for coding style nits across the entire set of code and feel free to run clang format on the lines you've changed.

Thanks for the work so far!

-eric

lib/CodeGen/MachineBlockPlacement.cpp
724

Sadly I'm not sure if you're adding or deleting whitespace here. Either way feel free to do it separately.

756–757

Go ahead and commit this separately (along with the one below).

Also "mismatch" and you shouldn't need the \n.

825–826

Can you document everything that's going on here more please? In particular, what's going on with the callback here and why it needs to be a callback rather than happening on the spot.

1288

Once again I can't remember if we have autobrief turned on or not...

1297

Formatting.

1312

Coding style nit: no braces around single lines.

iteratee updated this revision to Diff 59969.Jun 7 2016, 4:42 PM
iteratee marked 2 inline comments as done.

iteratee updated this revision to Diff 59970.Jun 7 2016, 4:45 PM
iteratee marked an inline comment as done.

lib/CodeGen/MachineBlockPlacement.cpp
825–826

I've added a comment to this effect, but the reason it has to be a callback is because none of the things that occur would be valid after deleting the block. (use after free).

As to the rest, the function is broken up into small chunks with a comment as to what each chunk is doing. Is there something more you'd like to see?

1288

Even if we do, it's probably better to match the existing style for this change, and clean it up in a separate patch.

iteratee updated this revision to Diff 60557.Jun 13 2016, 10:28 AM

davidxl edited edge metadata.Jun 13 2016, 10:52 AM

Kyle, can you update your patch and do a rebase -- there were recent restructure changes in MBP which can make the code cleaner.

iteratee updated this revision to Diff 62151.Jun 28 2016, 4:25 PM

Added changes to handle re-laying out code that had been tail-duplicated into the same shape. Necessary to work correctly with tail merging during layout.

Herald added a subscriber: nemanjai. Jun 28 2016, 4:25 PM

OK, this took longer than I thought it would, because I had to come up with
a good way to interact with tail-merging during layout. Please take a look
now.

Kyle.

iteratee updated this revision to Diff 62155.Jun 28 2016, 4:41 PM

minor cleanups.

thanks. I don't seem to find explicit test cases added for this change. can you add one ?

Please also update the description with a real motivation example -- the original code and the pseudo code after the transformation.

lib/CodeGen/MachineBlockPlacement.cpp
316

This document does not help understand the meaning. Can add a reference to detailed description of the algoirthm in other place (e.g. function definition).

408

Brief documentation.

411

Same here.

1118

Please outline this big part into its own method.

lib/CodeGen/TailDuplicator.cpp
787 ↗(On Diff #62155)

Split out the refactor change.

872 ↗(On Diff #62155)

Split out the clean-up changes.

test/CodeGen/PowerPC/tail-dup-layout.ll
1

This test case can use some simplifications. Why not just do simple function call in optional branches? The test block can also be simplified for instance testing input parameters.

iteratee updated this revision to Diff 67760.Aug 11 2016, 3:29 PM
iteratee updated this object.
iteratee marked 2 inline comments as done.

Lots of updates. Mainly pulled some of the changes into D18226 and expanded the commit message.

iteratee updated this object.Aug 11 2016, 3:35 PM
iteratee updated this revision to Diff 67764.Aug 11 2016, 3:39 PM

Minor fix.

iteratee updated this revision to Diff 67767.Aug 11 2016, 3:53 PM
iteratee marked 5 inline comments as done.

include/llvm/Analysis/LoopInfoImpl.h
188 ↗(On Diff #67764)

I should probably split this out.

lib/CodeGen/MachineBlockPlacement.cpp
1118

This was pulled into D18226 and placed in its own method there.

lib/CodeGen/TailDuplicator.cpp
872 ↗(On Diff #62155)

I think you mean the line below. I'll split that out. The line above isn't clean up.

iteratee updated this revision to Diff 68636.Aug 18 2016, 5:06 PM

Simple rebase.

iteratee marked 2 inline comments as done.Aug 29 2016, 5:19 PM

There are two independent problems that this patch tries to address.

1. Enable tail duplication for cases when current layout prefers topological order
2. Handling a sequence of tail-duplicatable blocks.

Please split out 1) and 2) into two different patches.

For patch 1), I don't think it is the right approach to piggy back the implementation on the outline heuristics, please split it out. Ideally, the fix should be simply add one new heuristic checked before hasBetterLayoutSuccessor check or preserve top order only when tail dup is not good:

```if (hasBetterLayoutSuccessor(... ) ) {
if (!IsTailDupCandidate(Succ)) {
continue;
}
}```
iteratee updated this revision to Diff 81332.Dec 13 2016, 5:17 PM

This is now MUCH shorter.

I realized with some help from davidxl that I didn't need to tie this to the outlining.

Also, because we need to recognize the pattern that occurs from repeated tail-duplication (So that when we repeat layout, we get the same result), we just recognize the pattern instead of using the delay set, as it's redundant.

I need to rewrite the description, but the code should be much easier to review now. The change the placement algorithm is now 22 lines, and most of that is a utility function for CFG matching.

Herald edited edge metadata. Dec 13 2016, 5:17 PM
lib/CodeGen/MachineBlockPlacement.cpp
572

Add a documentation line to this method.

623

Add more explanation here (as comment) and possible with a simple example?

arsenm added a subscriber: arsenm.Dec 15 2016, 1:03 PM
test/CodeGen/AMDGPU/convergent-inlineasm.ll
32

Unnecessary whitespace change

iteratee updated this revision to Diff 83465.Jan 6 2017, 5:11 PM

Rebase and re-write description

Herald edited edge metadata. Jan 6 2017, 5:11 PM
iteratee retitled this revision from Codegen: Outline for chains of tail-duplicable blocks. to Codegen: Make chains from lattice-shaped CFGs.Jan 6 2017, 5:15 PM
iteratee updated this object.
iteratee updated this revision to Diff 83729.Jan 9 2017, 4:04 PM
iteratee marked 5 inline comments as done.

Herald edited edge metadata. Jan 9 2017, 4:04 PM
iteratee updated this revision to Diff 83739.Jan 9 2017, 4:32 PM

Herald edited edge metadata. Jan 9 2017, 4:32 PM

What is the base revision of this patch?

Since the patch has been rewritten, is it possible to create a new patch (after D27742 lands) and abandon this one? A clean restart can simplify things a lot.

What is the base revision of this patch?

The base is D27742

Since the patch has been rewritten, is it possible to create a new patch (after D27742 lands) and abandon this one? A clean restart can simplify things a lot.

I don't really want to submit D27742 without this patch. I created a new patch as you requested and marked it as a child of D27742.