This is an archive of the discontinued LLVM Phabricator instance.

Using branch probability to guide critical edge splitting.
ClosedPublic

Authored by danielcdh on Sep 21 2016, 3:55 PM.

Details

Summary

The original heuristic to break critical edge during machine sink is relatively conservertive: when there is only one instruction sinkable to the critical edge, it is likely that the machine sink pass will not break the critical edge. This leads to many speculative instructions executed at runtime. However, with profile info, we could model the splitting benefits: if the critical edge has 50% taken rate, it would always be beneficial to split the critical edge to avoid the speculated runtime instructions. This patch uses profile to guide critical edge splitting in machine sink pass.

The performance impact on speccpu2006 on Intel sandybridge machines:

spec/2006/fp/C++/444.namd 25.3 +0.26%
spec/2006/fp/C++/447.dealII 45.96 -0.10%
spec/2006/fp/C++/450.soplex 41.97 +1.49%
spec/2006/fp/C++/453.povray 36.83 -0.96%
spec/2006/fp/C/433.milc 23.81 +0.32%
spec/2006/fp/C/470.lbm 41.17 +0.34%
spec/2006/fp/C/482.sphinx3 48.13 +0.69%
spec/2006/int/C++/471.omnetpp 22.45 +3.25%
spec/2006/int/C++/473.astar 21.35 -2.06%
spec/2006/int/C++/483.xalancbmk 36.02 -2.39%
spec/2006/int/C/400.perlbench 33.7 -0.17%
spec/2006/int/C/401.bzip2 22.9 +0.52%
spec/2006/int/C/403.gcc 32.42 -0.54%
spec/2006/int/C/429.mcf 39.59 +0.19%
spec/2006/int/C/445.gobmk 26.98 -0.00%
spec/2006/int/C/456.hmmer 24.52 -0.18%
spec/2006/int/C/458.sjeng 28.26 +0.02%
spec/2006/int/C/462.libquantum 55.44 +3.74%
spec/2006/int/C/464.h264ref 46.67 -0.39%

geometric mean +0.20%

Manually checked 473 and 471 to verify the diff is in the noise range.

Diff Detail

Event Timeline

danielcdh updated this revision to Diff 72127.Sep 21 2016, 3:55 PM
danielcdh retitled this revision from to Using branch probability to guide critical edge splitting..
danielcdh updated this object.
danielcdh added a reviewer: davidxl.
danielcdh added a subscriber: llvm-commits.
danielcdh updated this revision to Diff 72514.Sep 26 2016, 10:12 AM

Lower the splitting threshold and put it in an option.

davidxl added inline comments.Sep 26 2016, 12:07 PM
test/CodeGen/X86/phys_subreg_coalesce-2.ll
19

What is this change about?

test/CodeGen/X86/pr2659.ll
17

What is this change about?

danielcdh added inline comments.Sep 26 2016, 1:12 PM
test/CodeGen/X86/phys_subreg_coalesce-2.ll
19

This patch will sink the mov to critical path, which appears after loop body. So the updates in the test changes to check that there is not "mov" in the loop body, i.e. between "forbody" and "jbe"

test/CodeGen/X86/pr2659.ll
17

The movl $1 will be sinked to critical edge, which is layouted after the loop body.

davidxl added inline comments.Sep 26 2016, 1:31 PM
test/CodeGen/X86/phys_subreg_coalesce-2.ll
19

should you check there is no move between forcond.preheader and for.body.preheader instead?

davidxl edited edge metadata.

Renato, can you help take a look at the changes in ARM test cases?

David

danielcdh added inline comments.Sep 26 2016, 1:46 PM
test/CodeGen/X86/phys_subreg_coalesce-2.ll
19

The purpose of the original test is to test if all moves are hoisted from loop body to loop header. With this patch, one of the hoisted move is sinked to critical edge (still outside loop body), but there are still moves in the preheader thus it will be hard to check specific move has been sinked (as they could be assigned any register).

We have other tests to cover test cases where instructions should be sinked to critical edge, thus maybe in this patch, we just maintain the original check without adding new check for the machine sink?

davidxl added inline comments.Sep 26 2016, 1:50 PM
test/CodeGen/X86/phys_subreg_coalesce-2.ll
19

Ok. Perhaps just a comment that there is no mov in loop body.

danielcdh updated this revision to Diff 72572.Sep 26 2016, 2:37 PM
danielcdh edited edge metadata.

update

Can you also add an explicit test case ?

test/CodeGen/ARM/code-placement.ll
15

Can you change it to use lit variable for the label ?

35

The layout change in function 't2' is weird -- there are unnecessary branch back and forth introduced. Can you check why (not directly related to this patch -- but triggers some bugs).

danielcdh updated this revision to Diff 73162.Sep 30 2016, 4:35 PM
danielcdh marked an inline comment as done.

update and add new test

davidxl added inline comments.Oct 3 2016, 11:26 AM
test/CodeGen/ARM/code-placement.ll
35

I looked at this a little more. The layout change is caused by loop rotation (in MBP) exposed. The loop rotation here is beneficial because the Freq of the outer loop's back edge is larger than the sum of the exit edge of the inner loop and the incoming edge of the outer loop.

oops, my original comment about t2 was unsubmitted...

I think the if-conversion leads to the loop rotation difference as observed by David?

test/CodeGen/ARM/code-placement.ll
35

For t2, there are originally 2 conditional jumps in level-1 loop. machine sink splitted one critical edge, making it able to be if-converted later (by machine if-converter). As a result, there is one less conditional jump in the level-1 loop, and thus the other conditional jump needs to become backedge, and thus need to have one unconditional branch from l-1 loop preheader to header.

yes, loop rotation is caused by ifconvert. The cfg does not have an edge to the function with predicated exit sequence like this:

add r12, r12, #1
cmp r12, r0
moveq r0, r1
popeq {r4, pc}
b LBB0_1

This looks like an unconditional loop backedge, which leads to MBP to consider this block a candidate to be loop top. With a better loop rotation algorithm, this opportunity should be discovered even when CFG is properly modeled.

Ping...

Renato, could you help take a look at this patch?

Thanks,
Dehao

A related comment -- (no need to be addressed here):

If a instruction defines value that is used via more than one cold edges (split or to be split), the MachineSink should probably be enhanced to let it sunk. For instance in the attached the CFG, mov $0x8000000, %r12d should be sunk into La0, Lb0 and Lb7 from L40.

digraph foo {
node [shape=box]
size="11,8.5"
_exit_ [label="_exit_\n ", style=filled, color=gray]
L0000000000000000 [label="L0000000000000000\n<foo>: \n ", style=filled, color=gray]
L0 [label="L0\npush %rbp \n ", style=filled, color=gray]
L1 [label="L1\npush %r15 \npush %r14 \n ", style=filled, color=gray]
L5 [label="L5\npush %r13 \n ", style=filled, color=gray]
L7 [label="L7\npush %r12 \npush %rbx \n ", style=filled, color=gray]
La [label="La\nsub $0x18,%rsp \n ", style=filled, color=gray]
Le [label="Le\nmov %rcx,%r14 \nmov %esi,%ebp \ntest %ebp,%ebp \njle e7 <foo+0xe7>\n ", style=filled, color=gray]
L1b [label="L1b\n.. \n ", style=filled, color=gray]
L40 [label="L40\n..\nmov $0x80000000,%r12d \ntest %eax,%eax \njs a0 <foo+0xa0>\n ", style=filled, color=gray]
L52 [label="L52\n..\njs b0 <foo+0xb0>\n ", style=filled, color=gray]
L5f [label="L5f\n..\njs b7 <foo+0xb7>\n ", style=filled, color=gray]
L71 [label="L71\n..\nmov %eax,%r12d \nand $0x80000000,%r12d \n..jmp c0 <foo+0xc0>\n", style=filled, color=gray]
La0 [label="La0\nxor %ebx,%ebx \njmp c0 <foo+0xc0>\ndata32 data32 nopw\n ", style=filled, color=gray]
Lb0 [label="Lb0\nmov $0x1,%ebx \njmp c0 <foo+0xc0>\n ", style=filled, color=gray]
Lb7 [label="Lb7\nmov $0x2,%ebx \nnopl 0x0(%rax) \n ", style=filled, color=gray]
Lc0 [label="Lc0\nmov %ebx,%edi \nmov %r12d,%esi \ncallq ca <foo+0xca>\nxor %eax,%eax \n.. \njs df <foo+0xdf>\n ", style=filled, color=gray]
Ld5 [label="Ld5\nmov %ebx,%edi \nmov %r12d,%esi \ncallq df <foo+0xdf>\n ", style=filled, color=gray]
Ldf [label="Ldf\ndec %ebp \njne 40 <foo+0x40>\n ", style=filled, color=gray]
Le7 [label="Le7\nadd $0x18,%rsp \npop %rbx \npop %r12 \npop %r13 \npop %r14 \npop %r15 \npop %rbp \nretq \n ", style=filled, color=gray]
L0000000000000000 -> L0 ;
L0 -> L1 ;
L1 -> L5 ;
L5 -> L7 ;
L7 -> La ;
La -> Le ;
Le -> Le7 ;
Le -> L1b ;
L1b -> L40 ;
L40 -> La0 ;
L40 -> L52 ;
L52 -> Lb0 ;
L52 -> L5f ;
L5f -> Lb7 ;
L5f -> L71 ;
L71 -> Lc0 ;
La0 -> Lc0 ;
Lb0 -> Lc0 ;
Lb7 -> Lc0 ;
Lc0 -> Ldf ;
Lc0 -> Ld5 ;
Ld5 -> Ldf ;
Ldf -> L40 ;
Ldf -> Le7 ;
Le7 -> _exit_ ;
}

davidxl accepted this revision.Oct 18 2016, 9:09 AM
davidxl edited edge metadata.

lgtm

(The branch threshold may need to be more biased but we won't know for sure for now. May need to tune it ..)

This revision is now accepted and ready to land.Oct 18 2016, 9:09 AM
danielcdh updated this revision to Diff 75075.Oct 18 2016, 2:45 PM
danielcdh edited edge metadata.

update new failing testcase.

danielcdh closed this revision.Oct 18 2016, 2:45 PM
This revision was automatically updated to reflect the committed changes.
danielcdh reopened this revision.Oct 18 2016, 4:24 PM

This introduced ppc build failure as MachineBranchProbabilityInfo has not be declared as dependency.

This revision is now accepted and ready to land.Oct 18 2016, 4:24 PM
danielcdh updated this revision to Diff 75088.Oct 18 2016, 4:27 PM

add dependency.

danielcdh updated this revision to Diff 75089.Oct 18 2016, 4:31 PM

add machine-sink

danielcdh closed this revision.Oct 18 2016, 4:33 PM
danielcdh reopened this revision.Oct 20 2016, 9:36 AM

reopen again for bug fixing...

This revision is now accepted and ready to land.Oct 20 2016, 9:36 AM
danielcdh updated this revision to Diff 75321.Oct 20 2016, 11:07 AM

fix the bug

The bug is fixed:

The assumption "From->isSuccessor(To)" is untrue before calling MBPI->getEdgeProbability(From, To). Add check to make sure From->To edge exists before checking edge probability.

danielcdh closed this revision.Oct 20 2016, 11:16 AM