This is an archive of the discontinued LLVM Phabricator instance.

Aggressive choosing best loop top
AbandonedPublic

Authored by cycheng on May 6 2016, 6:06 AM.

Details

Summary

We want to find better loop top for this common (and similar) pattern:

          entry               
            |                 
------> loop.header (body)    
|97%    /       \             
|      /50%      \50%         
--- latch <--- if.then        
       \    97%  /            
        \3%     /3%           
         loop.end

Currently, Branch Probability Basic Block Placement will generate BlockChain in this order:
entry -> loop.header -> if.then -> latch -> loop.end

This order cause latch needs an branch jumping back to loop.header when condition is true.

Better BlockChain order would be:
entry -> latch -> loop.header -> if.then -> loop.end

So latch can fall through loop.header without this jump.

Thanks Carrot for pointing out this performance issue: https://llvm.org/bugs/show_bug.cgi?id=25782

We also test this patch on Power8 by running SPEC2006, gcc and libquantum get 5% improvements.

Diff Detail

Event Timeline

cycheng updated this revision to Diff 56406.May 6 2016, 6:06 AM
cycheng retitled this revision from to Aggressive choosing best loop top.
cycheng updated this object.
cycheng added a subscriber: llvm-commits.
amehsan edited edge metadata.May 6 2016, 8:10 AM

Maybe we already have this and I missed it. Could you add a testcase, that has the pattern you want to optimize. That will help reviewing the code, and also prevent breaking your this optimization in the future.

amehsan added inline comments.May 9 2016, 5:36 AM
lib/CodeGen/MachineBlockPlacement.cpp
726–742

Could this hurt performance for the following example

     --------->entry
    |  |         |
    |  |    loop header
    |  |         |    \
    |  |         |      \
    |  ---- --Latch1<----- if.then
    |           |           |
    ----------Latch2 <------

Probabilities that matter: loop.heaer--->if.then 99%,  if then->Latch 1 99%, Latch1 ---> header 80%

The problem is that before your change the branch from if.then to Latch 1 is fall thru. After your change it is a jump back. If we have a single latch this shouldn't hurt the performance, but with multiple latches, the number of fall-thrus that your code, removes is more than those that it creates. I am not sure how important this is.

Checking for a single latch should be straightforward using LoopInfo API if this is really a problematic edge case.

amehsan added inline comments.May 9 2016, 8:22 AM
lib/CodeGen/MachineBlockPlacement.cpp
726–742

In the graph that I have drawn, arrows from Latch1 and Latch2 should go back to loop header. I hope it was not confusing.

cycheng added inline comments.May 10 2016, 4:47 AM
lib/CodeGen/MachineBlockPlacement.cpp
726–742

Thanks Ehsan for pointing out this!

I changed latch1 --> loop.header probability to 90%, because my patch checks this probability, if it <= 80%, then skip latch1.

         entry
           |
+----> loop.header
|       /      \
| 90%  /1%      \99%
+--latch1 <--- if.then
|     |    99%   |
|     v 10%      | 1%
---latch2 <------+

original:       this patch:

latch2          latch1
loop.header     loop.header
if.then         if.then
latch1          latch2

When do the original win:
loop.header -> if.then -> latch1 -> latch2 -> loop.header

this patch require two additional branch:

  • if.then -> latch1
  • latch2 -> loop.header

When do this patch win:
loop.header -> latch1 -> loop.header

I think I should check the probability of loop.header -> latch1, 50% looks like we can still get benefit, consider the two path again, with loop.header -> latch1 = 50%:

  1. loop.header -> if.then -> latch1 -> latch2 -> loop.header
    • 0.5 * 0.99 * 0.1 * 1.0 = (path probability) = 0.0495
    • original: * 1 (branch) = 0.0495
    • this patch: * 3 (branch) = 0.1485
  2. loop.header -> latch1 -> loop.header
    • 0.5 * 0.90 = (path probability) = 0.45
    • original: * 2 (branch) = 0.9
    • this patch: * 1 (branch) = 0.45

Right?

amehsan added inline comments.May 10 2016, 7:46 AM
lib/CodeGen/MachineBlockPlacement.cpp
726–742

The question is how much we can rely on these numbers? The probabilities are chosen statically by the compiler (or under FDO they come from training workload). So we might have probabilities during compilation that allow us to perform the code change, but real probabilities during execution may differ, and cause slow down.

For single latch case, your change is very robust, and does not cause degradation, even if actual probabilities differ from our static assumptions. But for multiple latch case this is not true.

There is still one possibility: we may want to accept the risk of degradation in some cases to have improvement in other cases.

Personally, I prefer to be conservative and disable it for multiple latches, unless some reason for accepting risky change is given. But I may be wrong.

cycheng updated this revision to Diff 56713.May 10 2016, 7:55 AM
cycheng edited edge metadata.

Fix AMDGPU test case failure

Have you benchmarked how this changes performance on other architectures? If you can't we'll need to get others do to so, as this is likely to have pretty widespread effects.

Also should get some of the other folks who've bene looking at loop layout to look at this. CC'ing David at least.

lib/CodeGen/AtomicExpandPass.cpp
900

Sink this to where it is used?

932–934

This comment doesn't really help here. What is "aggressive-best-top" and what does it mean?

I think what you're actually doing is trying to give probabilities to override the default loop probabilities because the spin loop for an atomic isn't expected to loop often if at all. I'm actually surprised you would have a 50% probability of looping here, I would have expected a relatively low probability of spinning to be more appropriate.

I think you should:

  1. split this out into an independent change with its own description and test case
  2. improve this comment to talk generically about the fact that the loop isn't expected to be taken unless there is contention.

Then the rest of this change can be predicated on that. I'm assuming that you can test this independently because its actually annotating the IR.

lib/CodeGen/MachineBlockPlacement.cpp
114–117

Any particular reason for a flag? Or for calling this aggressive? It seems pretty straight forward to try to select the best loop top among latches.

780

I would't reference system-z or a test case which might go away. You have the generic description here and can add details about why this is bad in a generic sense.

789

Why would you handle it if cold at all? Not clear why you need a different threshold here than the threshold used throughout this code for cold.

lib/Target/SystemZ/SystemZISelLowering.cpp
5298–5299

Much like in the atomic expand pass, I'd talk about the semantic *reason* why these probabilities are correct, rather than about some hypothetical layout.

Is this testable independently? If so, I would separate this too into its own patch.

davidxl edited edge metadata.May 11 2016, 5:24 PM

Sorry I missed this.

Chandler thanks for looping me in. I will take a look.

By briefly looking at the summary, it seems to me this is exactly the cost based loop rotation enhancement (by Cong) is supposed to solve. This option is currently off by default, and also guarded by Profile feedback.

Can you comment out the guard at line 1555 that checks profile data, and then use option -mllvm precise-rotation-cost=true to see if it solves your problem? We have plan to turn that on by default at some point after more performance experiment.

bmakam added a subscriber: bmakam.May 11 2016, 6:40 PM

Thanks for working on this. This pattern is also seen in spec2006/mcf, I will test this patch on AArch64/Kryo and let you know the results. FWIW, I think this is always good when the Loop has a single latch only, but for loops with multiple latches it might not always be beneficial without knowing the actual branch probabilities.

I verified that cost based rotation algorithm handles the issue as expected.

See the new test case added in r269267.

A new internal option is also added to help enable this rotation llc -force-precise-rotation-cost without requiring PGO. I recommend exploring existing rotation strategy (i.e., help by testing your use cases with performance data so that we can turn it on by default if desired) instead of adding more redundant logic here.

The missing weight fixes can be extracted out separately.

I don't think the change in MBP is the right thing to make -- we don't any more 'pattern' matching code like this in MBP, which can be very fragile. Please try the cost based loop rotation which is a general solution. For different targets, there are also tuning parameters JumpInstCost and MisfetchCost that can be tuned in a target dependent way.

lib/CodeGen/AtomicExpandPass.cpp
900

please split out change in this file into a different patch.

lib/Target/SystemZ/SystemZISelLowering.cpp
5271

This should also be split out into a different patch.

davidxl added inline comments.May 11 2016, 10:33 PM
test/CodeGen/X86/code_placement_loop_rotation2.ll
8

By the way, this example shows that the new expected layout is not as optimal as the one when -force-precise-rotation-cost is on (see below CHECK-PROFILE).

The total branch cost of the optimal layout [e, f, c, d, h, b, g ] is:

C(c->e) + C(f->h) + C(d->f) + C(h->exit) + C(b->c) + C(g->h)

while the total cost of aggressive loop top layout [h, b, g, f, c, d, e] is

C(c->e) + C(f->h) + C(d->f) + C(h->exit) + C(b->c) + C(g->h) + C(e->f)

Edge e->f is in the inner loop and it is a hot edge , the additional cost of C(e->f) can be high.

cycheng planned changes to this revision.May 12 2016, 8:27 AM

davidxl:

Yes, -force-precise-rotation-cost=true solve this issue, and the result looks better then this patch. I hope we can enable it by default.

Test on Power8 with ref data size (largest size), iteration = 1, cpu frequency governor is "performance" (maximum speed), bind on physical-cpu 0:

  • llvm: r269273 (2016 May 12)
  • -force-precise-rotation-cost=true v.s. false, > 1.0 means true is good.
  • Result:
400.perlbench 	1.00x
401.bzip2     	1.00x
403.gcc       	1.07x
429.mcf       	1.04x
445.gobmk     	1.02x
456.hmmer     	1.00x
458.sjeng     	1.01x
462.libquantum	1.00x
464.h264ref   	1.04x
471.omnetpp   	1.06x
473.astar     	0.97x
483.xalancbmk 	0.99x
  • llvm r267873 (2016 Apr 28) + this patch
  • aggressive-best-top=true v.s. false, > 1.0 means true is good.
  • Result:
400.perlbench 	1.01x
401.bzip2     	1.01x
403.gcc       	1.07x
429.mcf       	1.00x
445.gobmk     	1.02x
456.hmmer     	0.98x
458.sjeng     	0.99x
462.libquantum	1.03x
464.h264ref   	1.06x
471.omnetpp   	1.03x
473.astar     	1.00x
483.xalancbmk 	0.99x

So I thought I should use existing rotation strategy.

By the way, please look at this pattern:

          entry               
            |                 
------> loop.header (body)    
|97%    /       \             
|      /50%      \50%         
--- latch <--- if.then        
       |
       |3%
   loop.end

Current strategy generates this order: entry loop.header if.then latch loop.end
But I thought this pattern is also eligible to optimize, right?

The cost of loop rotation calculated by current rotation strategy:

BB#1 ('header') to the top: 273
BB#4 ('if.then') to the top: 297
BB#2 ('latch') to the top: 297
lib/CodeGen/AtomicExpandPass.cpp
932–934

Exactly! And yes, that loop isn't expected to be taken, I should use lower value.

lib/CodeGen/MachineBlockPlacement.cpp
114–117

For testing purpose :P

test/CodeGen/X86/code_placement_loop_rotation2.ll
8

Thanks for pointing out this : D

chandlerc:

I can test on PowerPC and x86, but I have no idea on other backend : (

CY

cycheng abandoned this revision.May 18 2016, 5:14 PM

Abandon this patch because there is already a mature mechanism: "Precise (Loop) Rotation Cost", I should rely on it.
Thanks for all of your help.