Page MenuHomePhabricator

[BPI] Improve static heuristics for "cold" paths.
Needs ReviewPublic

Authored by ebrevnov on May 6 2020, 5:36 AM.

Details

Summary

Current approach doesn't work well in cases when multiple paths are predicted to be "cold". By "cold" paths I mean those containing "unreachable" instruction, call marked with 'cold' attribute and 'unwind' handler of 'invoke' instruction. The issue is that heuristics are applied one by one until the first match and essentially ignores relative hotness/coldness
of other paths.

New approach unifies processing of "cold" paths by assigning predefined absolute weight to each block estimated to be "cold". Then we propagate these weights up/down IR similarly to existing approach. And finally set up edge probabilities based on estimated block weights.

One important difference is how we propagate weight up. Existing approach propagates the same weight to all blocks that are post-dominated by a block with some "known" weight. This is useless at least because it always gives 50\50 distribution which is assumed by default anyway. Worse, it causes the algorithm to skip further heuristics and can miss setting more accurate probability. New algorithm propagates the weight up only to the blocks that dominates and post-dominated by a block with some "known" weight. In other words, those blocks that are either always executed or not executed together.

In addition new approach processes loops in an uniform way as well. Essentially loop exit edges are estimated as "cold" paths relative to back edges and should be considered uniformly with other coldness/hotness markers.

Diff Detail

Unit TestsFailed

TimeTest
60 mswindows > LLVM.DebugInfo/X86::addr-tu-to-non-tu.ll
Script: -- : 'RUN: at line 1'; c:\ws\w16n2-1\llvm-project\premerge-checks\build\bin\llc.exe -filetype=obj -O0 -generate-type-units -split-dwarf-file=x.dwo < C:\ws\w16n2-1\llvm-project\premerge-checks\llvm\test\DebugInfo\X86\addr-tu-to-non-tu.ll | c:\ws\w16n2-1\llvm-project\premerge-checks\build\bin\llvm-dwarfdump.exe -debug-info -debug-types - | c:\ws\w16n2-1\llvm-project\premerge-checks\build\bin\filecheck.exe --implicit-check-not=Unit --implicit-check-not=contents --implicit-check-not=declaration C:\ws\w16n2-1\llvm-project\premerge-checks\llvm\test\DebugInfo\X86\addr-tu-to-non-tu.ll

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
john.brawn added inline comments.Jun 3 2020, 10:57 AM
llvm/include/llvm/Analysis/BranchProbabilityInfo.h
214

Could do with a comment here explaining what these types mean - in particular the value type of SccBlockTypeMap is a bitmask of SccBlockType, so explaining that a block can be of more than one type would be good.

340

What's TODO here (and similar for the TODOs below)?

371–412

Would it be possible to move all of these is/getSCC functions into SccInfo, to avoid cluttering up BranchProbabilityInfo?

372

Should this be getSccBlockType?

373

Probably calculateSccBlockInfo would be better here.

llvm/lib/Analysis/BranchProbabilityInfo.cpp
905

Weight, not Weigth

919

This is making sure that the divide isn't rounding down to zero, yes? In which case I don't think it'll correctly handle the case where getEstimatedEdgeWeight returns ZERO_WEIGHT.

ebrevnov updated this revision to Diff 273316.Jun 25 2020, 5:48 AM
ebrevnov marked 6 inline comments as done.

Fixes according to comments + rebase

ebrevnov marked an inline comment as done.Jun 25 2020, 5:54 AM

Hi @john.brawn,

Sorry for keeping silence for that long. I had to work on higher priority tasks before I got a chance to return back to this. I think I fixed all your remarks. Please take a look.

Thanks
Evgeniy

ebrevnov updated this revision to Diff 273323.Jun 25 2020, 5:59 AM

Minor update

Can you provide some performace data (SPEC/llvm suite) ?

Can you provide some performace data (SPEC/llvm suite) ?

By SPEC I guess you mean SPEC CPU2006/2017, right? Unfortunately, I don't have access to this benchmark. Would be nice if somebody could check it for me. On my side I can check SPEC JVM2008 and some other java benchmarks.

Regarding LLVM suite. I tried running it about half year ago but got negative experience. The results were not consistent and showed big variation. I can give one more try though. Any advice on running it is highly appreciated.

ebrevnov updated this revision to Diff 273603.Jun 26 2020, 12:31 AM

Formatting

Can you provide some performace data (SPEC/llvm suite) ?

By SPEC I guess you mean SPEC CPU2006/2017, right? Unfortunately, I don't have access to this benchmark. Would be nice if somebody could check it for me. On my side I can check SPEC JVM2008 and some other java benchmarks.

Regarding LLVM suite. I tried running it about half year ago but got negative experience. The results were not consistent and showed big variation. I can give one more try though. Any advice on running it is highly appreciated.

Yeah, any benchmark data would be great.

I measured performance on bunch of java related benchmarks we have in house including SPECJVM2008, SPECJbb2015, DaCapo9.12 and others. I don't see any noticeable impact on performance. Please let me know if you need more details on these experiments or some additional perf data.

This patch is a little large to review properly. Is there any major refactoring (NFC or mostly NFC) that can be separated out?

llvm/lib/Analysis/BranchProbabilityInfo.cpp
126

why is it set to 0xffff?

This patch is a little large to review properly. Is there any major refactoring (NFC or mostly NFC) that can be separated out?

I think I should be able to extract one or two NFCs out of this. Let me do that.

The PGOUse pass can choose not to annotate any branches with total weights == 0. Now the question becomes how do we tell PGOUse pass whether the entry should be set to 0 or leave it not set. There are two ways to do it (to signal it is not really cold, but unknown):

  1. Remove the function from the indexed format profile;
  2. set all counts to some sentinel value such as -1.
ebrevnov added a comment.EditedJul 24 2020, 5:01 AM

I have moved part of the change into a separate review D84514. Let's start with that...

Marked as WIP since I'm going to extract one more part out from this review.

ebrevnov retitled this revision from [BPI] Improve static heuristics for "cold" paths. to [WIP][BPI] Improve static heuristics for "cold" paths..Jul 28 2020, 6:10 AM
ebrevnov updated this revision to Diff 282186.Jul 31 2020, 4:11 AM

Rebased on top of D84838

ebrevnov added inline comments.Jul 31 2020, 4:27 AM
llvm/lib/Analysis/BranchProbabilityInfo.cpp
126

Short answer is to preserve same branch probability as before.
Old implementation assigned CC_NONTAKEN_WEIGHT weight to "normal" path and CC_TAKEN_WEIGHT to "cold" path. Absolute values are irrelevant and ratio CC_NONTAKEN_WEIGHT/CC_TAKEN_WEIGHT = 16 defines relative probability of "normal" and "cold" branches. New implementation uses pre-selected weight for all "nornal" paths DEFAULT_WEIGHT. Thus relative ratio is calculated as DEFAULT_WEIGHT/COLD_WEIGHT = 0xfffff/0xffff = 16.

ebrevnov retitled this revision from [WIP][BPI] Improve static heuristics for "cold" paths. to [BPI] Improve static heuristics for "cold" paths..Jul 31 2020, 4:27 AM

It's a bit shorter now. Let's see if it's manageable now. Further reduction would be problematic.

A lot of test changes can probably be extracted out as NFC (to make the string test more robust) -- this will reduce the size of the patch.

Do you have performance numbers related to the change?

llvm/lib/Analysis/BranchProbabilityInfo.cpp
126

Why not define a value for DEFAULT_WEIGHT, and define COLD_WEIGHT to be 1/16 of the DEFAULT weight? It is more readable that way.

652

document the method and params.

660

explain here. Why the early estimated weight takes precedence?

736

why doing RPO walk? the algorithm does not seem to depend on this order.

llvm/test/Analysis/BlockFrequencyInfo/redundant_edges.ll
12

why is the new estimation better?

A lot of test changes can probably be extracted out as NFC (to make the string test more robust) -- this will reduce the size of the patch.

Unfortunately, in most cases, I don't see how this can be done with out loosing current coverage. There are few cases like 'successors: %bb.3(0x80000000), %bb.4(0x00000000)' where we could remove exact numbers and just check block names. I think we better keep exact probabilities in all other cases. That actually helps to catch unintended changes.

Do you have performance numbers related to the change?

I measured performance on bunch of java related benchmarks we have in house including SPECJVM2008, SPECJbb2015, DaCapo9.12 and others. I don't see any noticeable impact on performance.

llvm/lib/Analysis/BranchProbabilityInfo.cpp
126

This is just the way how weights were defined before (using direct numbers instead of ratio). Please note all other cases (like ZH_TAKEN_WEIGHT, ZH_NONTAKEN_WEIGHT) don't use ratio as well. Another theoretical reason could be an ability to represent ratios with non zero remainder.

652

sure

660

will add a comment.

736

You are right and it should not affect correctness but could cause the algorithm to do more iterations. 'While' loop following this one propagates weights from successors to predecessors. If any of the successors not yet evaluated it will trigger an additional iteration when get evaluated. That means it's preferred to evaluate successors before predecessors.

llvm/test/Analysis/BlockFrequencyInfo/redundant_edges.ll
12

This is tough question. Let's take an extreme case when loop has infinite number of exits. In this case probability to take back branch should go to zero and loop frequency should be 1.0. That means more exits we have less probable to got to the next iteration.

Existing algorithm completely ignores number of exits and always assumes loop makes 32 iteration. New algorithm assumes that back branch weight is 32 times higher than default exiting weight. Thus in case of one loop exit both algorithms give the following probability of the back branch:

edge loop -> loop probability is 0x7c000000 / 0x80000000 = 96.88% [HOT edge]

But because in this case there are two exiting edges probability of the back branch is calculated as '32*DW/(32*DW+2*DW)==32/34=0.94"

edge loop -> loop probability is 0x783e0f84 / 0x80000000 = 93.94%

I think this is pretty reasonable probability much better matching general high level picture.

There is one cavity though. Please note in this example we have two exiting branches from latch. If we had side exit from another block even new algorithm won't take it into account. I think we should put a TODO on that and follow up later. What do you think?

ebrevnov added inline comments.Aug 11 2020, 11:46 PM
llvm/lib/Analysis/BranchProbabilityInfo.cpp
652

sure

In fact, there is a documentation in header file. In source file I can put more details on behavior in case of rewrite attempt as you requested bellow.

ebrevnov updated this revision to Diff 284984.Aug 12 2020, 12:04 AM

Added a comment to updateEstimatedBlockWeight

wenlei added a subscriber: hoyFB.Aug 12 2020, 9:58 PM
wenlei added a subscriber: wenlei.

Is this work based on any paper/implementation?
Can we add some documentation at the top of the file to get an overall idea of the cost model?

llvm/lib/Analysis/BranchProbabilityInfo.cpp
716

nit: inconsistent comments '' vs '/'

753

do we have a tab here?

This comment was removed by xbolva00.

Is this work based on any paper/implementation?

I don't propose anything completely new here. New algorithm does essentially the same as existing one but

  1. fixes several correctness issues
  2. extends the approach to handle loops and invokes in a universal way

Can we add some documentation at the top of the file to get an overall idea of the cost model?

I don't think "cost model" is applicable here....anyway I can try to describe how thing work using some example. Is this what you are looking for?

llvm/lib/Analysis/BranchProbabilityInfo.cpp
716

Will remove extra /

753

No, I don't have tabs in the code. Looks like phabricator represents indention increase by 4 this way.

..anyway I can try to describe how thing work using some example. Is this what you are looking for?

yes. Thank you.

please run clang-format.

ebrevnov updated this revision to Diff 286234.Aug 18 2020, 3:03 AM

Added description of the algorithm to the header + formatting.

please run clang-format.

Done

..anyway I can try to describe how thing work using some example. Is this what you are looking for?

yes. Thank you.

Done

Is the test failure related to this patch?

Is the test failure related to this patch?

I doubt it is. But I will double check before commit.

ebrevnov edited the summary of this revision. (Show Details)Wed, Sep 16, 10:19 PM