This is an archive of the discontinued LLVM Phabricator instance.

[MBP] Move a latch block with conditional exit and multi predecessors to top of loop
ClosedPublic

Authored by Carrot on Feb 13 2018, 2:15 PM.

Details

Summary

Current findBestLoopTop can find and move one kind of block to top, a latch block has one successor. Another common case is:

  • a latch block
  • it has two successors, one is loop header, another is exit
  • it has more than one predecessors

If it is below one of its predecessors P, only P can fall through to it, all other predecessors need a jump to it, and another conditional jump to loop header. If it is moved before loop header, all its predecessors jump to it, then fall through to loop header. So all its predecessors except P can reduce one taken branch.

Diff Detail

Event Timeline

Carrot created this revision.Feb 13 2018, 2:15 PM

Please add a few new test cases first.

Carrot updated this revision to Diff 134325.Feb 14 2018, 2:56 PM

Add a new test case.

davidxl added inline comments.Feb 15 2018, 9:52 AM
test/CodeGen/X86/move_latch_to_loop_top.ll
39

This is not the optimal rotation for the loop. The optimal rotation should be

true
latch
header
false

in which case there is one more fall through from true to latch.

I think the enhancement should shoot for getting the optimal layout.

Carrot updated this revision to Diff 135512.Feb 22 2018, 1:53 PM

Add more code to iteratively find new block that can be moved before old loop top block, and reduce taken branches. Now it can layout the test case optimally as David commented.

Carrot marked an inline comment as done.Feb 22 2018, 1:54 PM
davidxl added inline comments.Feb 26 2018, 9:20 AM
lib/CodeGen/MachineBlockPlacement.cpp
1769–1929

BottomBlock --> bottom block BB

1770

the following case.

1779

Add some arrows in the graph and explain more in text why it is not beneficial

1803

Add more comment about the intention of this method:

The method checks if the reduced taken branches is less than increased taken branch (to the exit block when rotation happens). If yes, it returns true.

1988

It makes assumption that there is an existing fall through to the exit bb. If not, it is always beneficial to rotate.

davidxl added inline comments.Feb 26 2018, 9:20 AM
lib/CodeGen/MachineBlockPlacement.cpp
1833

--> and it has more than one predecessors.

1935

Add comment that the reduced taken branches will be compared with the increased taken branch to the loop exit block.

1972

Why is this check needed?

test/CodeGen/X86/move_latch_to_loop_top.ll
75

perhaps draw a diagram here.

Carrot updated this revision to Diff 136143.Feb 27 2018, 1:26 PM
Carrot marked 7 inline comments as done.
Carrot added inline comments.
lib/CodeGen/MachineBlockPlacement.cpp
1972

Only two patterns are handled, none of the cases has more than 2 successors.

1988

Yes.
But layout the loop body and rotate loop occur after this function, it is difficult to guess which BB will be put at the bottom of the loop. So to be conservative, assume the current candidate BB can be layout at the bottom, and fall through to the exit bb.

Carrot updated this revision to Diff 201075.May 23 2019, 2:50 PM

Update the patch to the current code base, also make two improvements:

  • Do more precise cost analysis based on the increased number of fallthrough.
  • Also enable findBestLoopTop when profile information is available.
davidxl added inline comments.May 24 2019, 4:19 PM
lib/CodeGen/MachineBlockPlacement.cpp
1947–1948

Move this checked into the caller function

1981

This check is not necessary. See example at https://reviews.llvm.org/F8921829

If B6 is selected as the new loop top, the fall through frequencies can be increased from 99 to 150.

Carrot updated this revision to Diff 202299.May 30 2019, 2:04 PM
Carrot marked 2 inline comments as done.
Carrot edited the summary of this revision. (Show Details)

For all the test cases update, please also validate if they make sense or not if possible.

lib/CodeGen/MachineBlockPlacement.cpp
1808

This part looks complete. Are all the paths covered by some tests?

1918

The logic looks correct. Are all cases covered by tests?

1977

The comment should be fixed.

1983

more generally, you want largest pred edge frequency to be smaller than the new back edge frequency.

1985

The 'else' here seems wrong. It needs to fall through to do the same check.

Carrot updated this revision to Diff 203015.Jun 4 2019, 1:54 PM
Carrot marked 6 inline comments as done.
Carrot added inline comments.
lib/CodeGen/MachineBlockPlacement.cpp
1808

New tests added.

1918

A lot of existing test cases are impacted by this patch, and this function is intensively tested by those test cases.

1981

You are right. This code was used as heuristic when I didn't quantitatively compute the number of fallthrough. Now that we have more precise cost model, it should be removed.

1983

Again it is a heuristic before FallThroughGains is implemented. Now it can be removed.

Carrot added a comment.Jun 7 2019, 3:31 PM

Some analysis of the test case changes.

test/CodeGen/AArch64/neg-imm.ll
The control flow graph looks like:

    entry
      |
      V
-->for.body
|     |\
|     | \
|     |  \
|     | if.then3
|     |  /
|     | /
|     |/
---for.inc
      |
      V
for.cond.cleanup

The original layout is:

entry
for.body
if.then3
for.inc
for.cond.cleanup

For each loop iteration there are two taken branches.

The new layout is:

entry
for.inc
for.body
if.then3
for.cond.cleanup

For each loop iterations there is one taken branch.

test/CodeGen/AMDGPU/optimize-negated-cond.ll
The control flow of function @negated_cond_dominated_blocks is:

  bb
   |
   V
  bb4 <--
  /\    |
 /  \   |
bb5 bb6 |
 \  /   |
  \/    |
  bb7 ---
   |
   V
  bb3

The original layout is:

bb
bb4
bb6
bb5
bb7
bb3

For each loop iteration there are two taken branches.

New layout is

bb
bb6
bb7
bb4
bb5
bb3

For each loop iterations there is one taken branch.

test/CodeGen/Hexagon/redundant-branching2.ll
It is also diamond shaped loop,

   |
  b3 <--
  /\   |
 /  \  |
b4  b5 |
 \  /  |
  \/   |
  b6----
   |

Original layout is

b3
b4
b5
b6

New layout is

b5
b6
b3
b4

The new layout can reduce 1 taken branch per iteration.

test/CodeGen/X86/widen_arith-*.ll
The control flow graph is:

    entry
      |
      V
-->forcond ---
|     |      |
|     V      |
---forbody   |
             |
             V
         afterfor

Original layout is:

entry
forbody
forcond
afterfor

New layout is:

entry
forcond
forbody
afterfor

It shouldn't have performance impact, but the new layout is more natrual, more readable.

davidxl accepted this revision.Jun 10 2019, 11:05 AM

lgtm.

Warning. With static profiling, the layout strategy based on the 'precise' cost model may be off. If for some reason this causes issues later, the change should be guarded with 'hasProfile' check.

This revision is now accepted and ready to land.Jun 10 2019, 11:05 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptJun 14 2019, 4:05 PM

Do you have any benchmark numbers to show that this is generally profitable? From our downstream testing, it is not clear that this change is beneficial.

Do you have any benchmark numbers to show that this is generally profitable? From our downstream testing, it is not clear that this change is beneficial.

We got performance improvement in our internal search benchmark.

This change causes 35% regression on very simple loop which is hot part of our internal micro benchmark.
This loop takes 99% of total execution time and has reasonably large number of iterations.
All measurements were performed on Intel(R) Xeon(R) CPU E3-1240 v5 @ 3.50GHz.

Here is the loop in question before the change:

25.41  | 0x30020e20:   c5fa10 vmovss    12(%rcx,%rsi,4), %xmm0
  5.12  | 0x30020e26:   c5f82e vucomiss    %xmm0, %xmm0
  0.61  | 0x30020e2a:   7a1d   jp    29                              ; 0x30020e49
  0.61  | 0x30020e2c:   c5fa10 vmovss    12(%rax,%rsi,4), %xmm2
 28.07 | 0x30020e32:   c5f82e vucomiss    %xmm1, %xmm0
  4.10  | 0x30020e36:   7506   jne    6                              ; 0x30020e3e
           | 0x30020e38:   0f8bd0 jnp    720                            ; 0x3002110e
  0.41  | 0x30020e3e:   c5fac2 vcmpless    %xmm2, %xmm0, %xmm3
  0.41  | 0x30020e43:   c4e369 vblendvps    %xmm3, %xmm0, %xmm2, %xmm0
 35.04 | 0x30020e49:   c5fa11 vmovss    %xmm0, 12(%rdx,%rsi,4)
  0.20  | 0x30020e4f:   48ffc6 incq    %rsi
           | 0x30020e52:   4839de cmpq    %rbx, %rsi
           | 0x30020e55:   72c9   jb    -55                             ; 0x30020e20

After the change:

          | 0x30020a40:   c5fa11 vmovss    %xmm0, 12(%rdx,%rsi,4)
          | 0x 30020a46:   48ffc6 incq    %rsi
27.25 | 0x 30020a49:   4839de cmpq    %rbx, %rsi
          | 0x30020a4c:   0f836e jae    -146                           ; 0x300209c0
          | 0x30020a52:   c5fa10 vmovss    12(%rcx,%rsi,4), %xmm0
          | 0x30020a58:   c5f82e vucomiss    %xmm0, %xmm0
          | 0x30020a5c:   7ae2   jp    -30                             ; 0x30020a40
27.46 | 0x30020a5e:   c5fa10 vmovss    12(%rax,%rsi,4), %xmm2
          | 0x30020a64:   c5f82e vucomiss    %xmm1, %xmm0
          | 0x30020a68:   7506   jne    6                              ; 0x30020a70
          | 0x30020a6a:   0f8bfd jnp    509                            ; 0x30020c6d
          | 0x30020a70:   c5fac2 vcmpless    %xmm2, %xmm0, %xmm3
23.36 | 0x30020a75:   c4e369 vblendvps    %xmm3, %xmm0, %xmm2, %xmm0
21.93 | 0x30020a7b:   ebc3   jmp    -61                            ; 0x30020a40

So far I don't have full understanding why that causes 35% slow down.
One note is that by minimizing number of taken branches we actually increase number of branch instruction in the loop what increases code size.
Moreover we increase number of branch instructions executed at runtime for old fall-through path and don't decrease it for all over paths.
While these two facts may negatively affect performance they don't explain 35% slowdown. Something more complicated happens behind the scene.

This example shows that this optimization is not always beneficial and requires more complicated profitability heuristic.

Any ideas?

xbolva00 added a subscriber: xbolva00.EditedJul 18 2019, 4:04 AM

Can you provide benchmark results from "public" ones like SPEC or LLVM test-suite? I think this should be mandatory to land performance critical patches (always provide results from SPEC/test suite).

If think it is not enough to land critical patches just because "we see an improvement in the internal benchmark" - either open source your benchmark or if not, please provide data from test-suite.

xbolva00 added a subscriber: hans.Jul 18 2019, 4:05 AM

(should be probably reverted from 9.0 branch)

@hans

Is the test case slow down with PGO or not? Also do you have branch misprediction perf data? (large slowdowns like this is usually triggered by side effect like this). Is there a trimmed down version to demonstrate the issue?

Is the test case slow down with PGO or not?

No, it's not PGO.

Also do you have branch misprediction perf data? (large slowdowns like this is usually triggered by side effect like this).

I do. There is no any branch/data/instruction miss predictions. LSD works 100% in both cases as well. Even after regression CPI is 0.3. That means we are execution bounded. If I run the benchmark under perf scores change a little and there is 19% difference instead of original 35%. About half of slowdown (8.3%) comes from increased path length due to extra jump instruction. Rest comes from CPI increase by 10%. I'm checking different hypotheses what could cause that but no success so far.

Is there a trimmed down version to demonstrate the issue?

It is a simple loop consequently reading floating points (non NANs) from two array and writing minimum value to a third array:

for (int i = 0; i < 320000; i++) {
   c[i] = min(a[i], b[i]);
}

carrot, can you help looking into this? The cost model should look into extra direct jumps introduced as well.

Evgeniy, could you try to build your code with FDO? The layout code is based on profile information, if that is not available, the static estimated profile information is used. Since the loaded values are not NaN, there should be no branch from loop header to latch, but the estimated profile gives it a non trivial value, so the latch is moved before header, and one taken branch is reduced in the NaN path. I think the static profile estimation can be enhanced to treat a floating point number as not a NaN, or as a NaN only with a very small possibility.

I tried to reproduce your result with

clang++ -O2 -c d43256.cc -save-temps

#include<algorithm>

#define N 320000

float a[N];
float b[N];
float c[N];

using namespace std;

void foo(int M) {

for (int i = 0; i < M; i++) {
 c[i] = min(a[i], b[i]);
}

}

But I got totally different code sequence. Could you help to give more complete reproduce steps?
Thanks a lot!

hans added a comment.Jul 22 2019, 1:12 PM

(should be probably reverted from 9.0 branch)

@hans

Thanks for the heads up; I'll keep an eye on this.

I'd prefer to see this either reverted or fixed on trunk, and then merge that to the 9.0 branch.

Here is a C++ equivalent of my original code (which is actually java application) for you to reproduce.

clang++ -c -O2 floatmin.cpp -march=skylake

extern float a[];
extern float b[];
extern float c[];

bool foo(int M, bool flag) {
  for (int i = 0; i < M; i++) {
    float x = a[i];
    float y = b[i];
    float min;
    if (x != x) {
      min = x;   // a is NaN
    }
    else if (y == 0.0f) {
      goto fail;
    }
    else {
     min = (x <= y) ? x : y;
    }
    c[i] = min;
  }

  return true;
fail:
  return false;
}

With C++ reproducer I can measure about 9% slowdown only. In this case CPI is identical (for the original test case I still don't know root cause of CPI difference) and all slowdown comes from increased path length due to one extra jump.

With this reproducer on hands you can gather profile data if needed. But that's a separate story. I don't think we can afford such a regression when profile is not available.
You probably could assume worst case if profile is not available but I believe it won't help in this and root cause is that heuristic just doesn't take extra jump into account.

Thanks for the test case, I can reproduce it now.

As I suspected, the problem is in BranchProbabilityAnalysis. For floating point unorder comparison, it estimates the probability as 12/32. While for loop exit, the probability is 3.1%. With these probability numbers, move the latch before header is beneficial. Unfortunately there is never NaN in the input.

Even for a normal program, the NaN possibility is extremely rare, we should give a much smaller number for the unordered comparison.

ebrevnov, with following patch the test case can be layout correctly. Can you try it with your actual code?

Index: BranchProbabilityInfo.cpp

  • BranchProbabilityInfo.cpp (revision 366821)

+++ BranchProbabilityInfo.cpp (working copy)
@@ -118,6 +118,12 @@
static const uint32_t FPH_TAKEN_WEIGHT = 20;
static const uint32_t FPH_NONTAKEN_WEIGHT = 12;

+/ This is the probability for an ordered floating point comparison.
+static const uint32_t FPH_ORD_WEIGHT = 1024 * 1024 - 1;
+
/ This is the probability for an unordered floating point comparison, it means
+/ one or two of the operands are NaN, it should be extremely rare.
+static const uint32_t FPH_UNO_WEIGHT = 1;
+
/ Invoke-terminating normal branch taken weight
/
/ This is the weight for branching to the normal destination of an invoke
@@ -778,6 +784,8 @@

if (!FCmp)
  return false;

+ uint32_t TakenWeight = FPH_TAKEN_WEIGHT;
+ uint32_t NontakenWeight = FPH_NONTAKEN_WEIGHT;

bool isProb;
if (FCmp->isEquality()) {
  // f1 == f2 -> Unlikely

@@ -786,9 +794,13 @@

} else if (FCmp->getPredicate() == FCmpInst::FCMP_ORD) {
  // !isnan -> Likely
  isProb = true;

+ TakenWeight = FPH_ORD_WEIGHT;
+ NontakenWeight = FPH_UNO_WEIGHT;

} else if (FCmp->getPredicate() == FCmpInst::FCMP_UNO) {
  // isnan -> Unlikely
  isProb = false;

+ TakenWeight = FPH_ORD_WEIGHT;
+ NontakenWeight = FPH_UNO_WEIGHT;

} else {
  return false;
}

@@ -798,8 +810,7 @@

if (!isProb)
  std::swap(TakenIdx, NonTakenIdx);
  • BranchProbability TakenProb(FPH_TAKEN_WEIGHT,
  • FPH_TAKEN_WEIGHT + FPH_NONTAKEN_WEIGHT);

+ BranchProbability TakenProb(TakenWeight, TakenWeight + NontakenWeight);

setEdgeProbability(BB, TakenIdx, TakenProb);
setEdgeProbability(BB, NonTakenIdx, TakenProb.getCompl());
return true;

Hi @Carrot,

I checked your patch on original test case and it does resolve the issue. Unfortunately, we have 4 other test cases regressed from 1% to 5%. Bad news is that we don't have sources for these cases. At glance they don't involve operations with floats. So that is not a surprise they are not affected. I can continue investigation and try to provide short reproducer if needed. My fear here is that original change touches very wide class of cases and we won't be able to fix all of them one by one (or it will take too much time). Can we think of a more realistic approach to deal with regressions?

Thanks
Evgeniy

Is this reverted (at least from 9.0, @hans) yet?

hans added a comment.Jul 25 2019, 7:44 AM

Is this reverted (at least from 9.0, @hans) yet?

No. The preferred approach is to resolve on trunk first (reverting or fixing) and then merging that to the release branch.

It sounds like there is still discussion ongoing here. Perhaps it should be reverted on trunk in the meantime?

I suspect other affected cases are due to bad static profile data too.

For now, I think the best way forward is to enable this transformation only when real profile data is available.

@ebrevnov, it's better to provide a reproducer. Otherwise I can't analyze the problem that impacts your code. Four is not a big number.

hans added a comment.Jul 26 2019, 10:28 AM

Do you have any benchmark numbers to show that this is generally profitable? From our downstream testing, it is not clear that this change is beneficial.

We got performance improvement in our internal search benchmark.

How does this transformation impact the benchmark when not using profile data?

Do you have any benchmark numbers to show that this is generally profitable? From our downstream testing, it is not clear that this change is beneficial.

We got performance improvement in our internal search benchmark.

How does this transformation impact the benchmark when not using profile data?

Benchmark is running, will report the result once it is finished.

Do you have any benchmark numbers to show that this is generally profitable? From our downstream testing, it is not clear that this change is beneficial.

We got performance improvement in our internal search benchmark.

How does this transformation impact the benchmark when not using profile data?

In plain mode we also got performance improvement, the speedup is a little smaller than FDO Mode.

@ebrevnov, it's better to provide a reproducer. Otherwise I can't analyze the problem that impacts your code. Four is not a big number.

Four is not big, but who knows how many other cases will arise. I was looking to another case with ~%5 regression. As I mentioned, sources are not available for this case and I was not able to identify anything special just looking into assembler. As in previous case we have one extra jump which accounts for about 3.5% instructions of the loop. Here is an assembler for the loop after optimization:

23001bf40:   4c8975 movq    %r14, -72(%rbp)
23001bf44:   4c894d movq    %r9, -48(%rbp)
23001bf48:   48894d movq    %rcx, -56(%rbp)
23001bf4c:   4c8945 movq    %r8, -64(%rbp)
23001bf50:   48895d movq    %rbx, -80(%rbp)
23001bf54:   4889cf movq    %rcx, %rdi
23001bf57:   90     nop
23001bf58:   e863f1 callq    -69277                       ; 0x3000b0c0
23001bf5d:   4c8b75 movq    -72(%rbp), %r14
23001bf61:   488b4d movq    -56(%rbp), %rcx
23001bf65:   4c8b4d movq    -48(%rbp), %r9
23001bf69:   4c8b45 movq    -64(%rbp), %r8
23001bf6d:   48ffc3 incq    %rbx
23001bf70:   410fb6 movzbl    148(%r9), %eax
23001bf78:   84c0   testb    %al, %al
23001bf7a:   7572   jne    114                            ; 0x3001bfee
23001bf7c:   498b04 movq    (%r12), %rax
23001bf80:   498b55 movq    (%r13), %rdx
23001bf84:   4885c2 testq    %rax, %rdx
23001bf87:   751b   jne    27                             ; 0x3001bfa4
23001bf89:   488b78 movq    16(%rax), %rdi
23001bf8d:   4885fa testq    %rdi, %rdx
23001bf90:   7526   jne    38                             ; 0x3001bfb8
23001bf92:   f64708 testb    $1, 8(%rdi)            ; 0x3001bfd4
23001bf96:   be5f49 movl    $346463, %esi
23001bf9b:   74a3   je    -93                             ; 0x3001bf40
23001bf9d:   be8b10 movl    $4235, %esi
23001bfa2:   eb9c   jmp    -100                           ; 0x3001bf40

Thanks
Evgeniy

Do you have any benchmark numbers to show that this is generally profitable? From our downstream testing, it is not clear that this change is beneficial.

We got performance improvement in our internal search benchmark.

How does this transformation impact the benchmark when not using profile data?

In plain mode we also got performance improvement, the speedup is a little smaller than FDO Mode.

I have a general question/comment. By now it's more or less evident that benefit of optimization heavily depends on correctness of profile information. That means in general case there is no way to reason about its effectiveness. Thus I believe it should be turned off if there is no profile.

Do you have any benchmark numbers to show that this is generally profitable? From our downstream testing, it is not clear that this change is beneficial.

We got performance improvement in our internal search benchmark.

How does this transformation impact the benchmark when not using profile data?

In plain mode we also got performance improvement, the speedup is a little smaller than FDO Mode.

I have a general question/comment. By now it's more or less evident that benefit of optimization heavily depends on correctness of profile information. That means in general case there is no way to reason about its effectiveness. Thus I believe it should be turned off if there is no profile.

The optimization decision is based on profile information, if a real profile is not available, the statically estimated profile information (generated by BranchProbabilityInfo.cpp) is used. So if an unreasonable probability is generated as in your first case, or if an user program has untypical run time behavior than BPI expected, it may make bad decision.
Since you have strong concern about the optimization without real profile information, I will restore the old behavior if no real profile information is available.

Do you have any benchmark numbers to show that this is generally profitable? From our downstream testing, it is not clear that this change is beneficial.

We got performance improvement in our internal search benchmark.

How does this transformation impact the benchmark when not using profile data?

In plain mode we also got performance improvement, the speedup is a little smaller than FDO Mode.

I have a general question/comment. By now it's more or less evident that benefit of optimization heavily depends on correctness of profile information. That means in general case there is no way to reason about its effectiveness. Thus I believe it should be turned off if there is no profile.

The optimization decision is based on profile information, if a real profile is not available, the statically estimated profile information (generated by BranchProbabilityInfo.cpp) is used. So if an unreasonable probability is generated as in your first case, or if an user program has untypical run time behavior than BPI expected, it may make bad decision.
Since you have strong concern about the optimization without real profile information, I will restore the old behavior if no real profile information is available.

I would suggest putting the optimization under an option and disable it by default for now. Once all problems are resolved we can change the default. What do you think?

Do you have any benchmark numbers to show that this is generally profitable? From our downstream testing, it is not clear that this change is beneficial.

We got performance improvement in our internal search benchmark.

How does this transformation impact the benchmark when not using profile data?

In plain mode we also got performance improvement, the speedup is a little smaller than FDO Mode.

I have a general question/comment. By now it's more or less evident that benefit of optimization heavily depends on correctness of profile information. That means in general case there is no way to reason about its effectiveness. Thus I believe it should be turned off if there is no profile.

The optimization decision is based on profile information, if a real profile is not available, the statically estimated profile information (generated by BranchProbabilityInfo.cpp) is used. So if an unreasonable probability is generated as in your first case, or if an user program has untypical run time behavior than BPI expected, it may make bad decision.
Since you have strong concern about the optimization without real profile information, I will restore the old behavior if no real profile information is available.

I took a deeper look at the second failing case and found something interesting. In fact, there is a profile information for the loop but it's incomplete. For example there is a profile data for the method entry, loop back edge and some other branches in the loop. Only two branches doesn't have profile. (You may wonder how it's possible. Original application is java program where profile typically available but may be unavailable for inlined methods). I think your heuristic makes wrong decision due to absence of profile for these two branches which were generated from one select instruction over boolean variable. In most cases you really can't statically estimate if boolean is false or true.

I think we need to be conservative and don't account for potential benefit from branches with out representative profile.

Thanks
Evgeniy

I would suggest putting the optimization under an option and disable it by default for now. Once all problems are resolved we can change the default. What do you think?

After restore to original behavior in plain mode, you can use -force-precise-rotation-cost=true to use this more aggressive loop layout.

hans added a comment.Aug 1 2019, 1:36 AM

I would suggest putting the optimization under an option and disable it by default for now. Once all problems are resolved we can change the default. What do you think?

After restore to original behavior in plain mode, you can use -force-precise-rotation-cost=true to use this more aggressive loop layout.

Is there a patch in progress for restoring to the original behaviour in non-profile mode? It would be nice if we could get this resolved soon.

Carrot added a comment.Aug 1 2019, 4:00 PM

I would suggest putting the optimization under an option and disable it by default for now. Once all problems are resolved we can change the default. What do you think?

After restore to original behavior in plain mode, you can use -force-precise-rotation-cost=true to use this more aggressive loop layout.

Is there a patch in progress for restoring to the original behaviour in non-profile mode? It would be nice if we could get this resolved soon.

Yes, some new added or modified test cases after this patch need to be adjusted. The new patch can be ready tomorrow.

Carrot added a comment.Aug 2 2019, 1:28 PM

Patch https://reviews.llvm.org/D65673 for restoring the original layout in plain mode.

We are also seeing this patch slow down one of our internal benchmarks and speed up another one on the Qualcomm Hexagon target.
In both cases the static estimated profile is used - and the static profile is representative. In both cases D43256 basically lays outs executed hot code closer together improving cache utilization. However in both cases we see critical path length and the number of jumps in the critical path increase. So a precise cost model is a good idea. We spent some time analyzing why one benchmark got worse - we can see more mispredicts - but there may be more going on under the hood.
The other benchmark that speeds up - we see the new layout lowers pressure on an internal branch target hardware resource - the critical loop has a lot of calls that have already increased pressure on that resource.
We dont have sources for these benchmarks.

We have verified that D65673 restores the old behavior on the benchmark that got worse and throwing the flag -force-precise-rotation-cost=true helps us keep the improvement on the other one.

Carrot added a comment.Aug 8 2019, 2:03 PM

@hjagasia, thank you for the verification.

hans added a comment.Aug 9 2019, 2:05 AM

Patch https://reviews.llvm.org/D65673 for restoring the original layout in plain mode.

I see that this was committed in r368339. I will merge that to the release branch once it's baked in trunk for a little while.

hans added a comment.Aug 12 2019, 7:30 AM

Patch https://reviews.llvm.org/D65673 for restoring the original layout in plain mode.

I see that this was committed in r368339. I will merge that to the release branch once it's baked in trunk for a little while.

That was reverted in r368579 since it broke the Chromium build.

@Carrot: Can you please prioritize this so we can get it fixed in time for the LLVM 9 release?

@Carrot: Any update on this? We have slow downs on our benchmarks because of this as mentioned earlier.

@hans this patch should be reverted from 9.0 I think so rc3 is “fixed”.

hans added a comment.Aug 22 2019, 7:00 AM

@hans this patch should be reverted from 9.0 I think so rc3 is “fixed”.

I'd rather see the fix land on trunk first (reverting on the branch is also not trivial, there are merge conflicts in several test files). From the discussion at http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20190819/686087.html, it should be ready to go in again.

xbolva00 added a comment.EditedAug 22 2019, 7:21 AM

Ok, thanks! I didnt know there is a working fix.

@hans this patch should be reverted from 9.0 I think so rc3 is “fixed”.

I'd rather see the fix land on trunk first (reverting on the branch is also not trivial, there are merge conflicts in several test files). From the discussion at http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20190819/686087.html, it should be ready to go in again.

It is committed as r369664.

hans added a comment.Aug 27 2019, 7:38 AM

@hans this patch should be reverted from 9.0 I think so rc3 is “fixed”.

I'd rather see the fix land on trunk first (reverting on the branch is also not trivial, there are merge conflicts in several test files). From the discussion at http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20190819/686087.html, it should be ready to go in again.

It is committed as r369664.

I see there is a follow-up comment on the commit email saying it fails in the verifier.

I think at this point, we're too close to the llvm 9 release to take this. Hopefully it can get fixed and stabilized on trunk, and then be merged to llvm 9.0.1.

Maybe not worth to revert it even from 9.0.1

This commit (see below) says that disabling this new feature regressed many benchmarks. So probably it is okay to leave it as is, possibly tune heuristic a bit.

Anyway, in static profile some code perf improves, some regresses - but if this patch improves a lot of code, we shouldn’t disable it because some people have few regressions..)

https://reviews.llvm.org/rGf9f81289e6864ca3f09df16bad0ffc3ca58c3162#684441

Hi,
Seems like this is supposed to increase the average number of fallthrough in a case of simple nested loops, however it seems to also increase the total number of branch instructions both in and outside the outer loop.
In our target this seems to do more harm than good as increasing the number of branches is significantly more harmful than having more conditional taken branches.
Instead of a conditional branch that is usually taken and jumps from end to start of outer loop, we have an unconditional branch that jumps from the end of the inner loop to the end of the outer loop, which then conditionally jumps out of the outer loop or falls through to the next iteration of the outer loop.
This increases the total number of executed branch instructions by 1.

Did I understand this correctly?

Is there a way to tweak this optimization to avoid generating additional branches? Or would this mean completely disabling it for targets that don't benefit from additional fallthroughs when it comes at the cost of additional branches?

Thanks

Carrot added a comment.Jan 5 2021, 2:59 PM

@GalZohar, the various layout algorithms in MachineBlockPlacement mainly consider the number of fall-throughs and dynamic number of branch instructions (usually they are consistent) according to branch probabilities. So you can try to build your application with profiling. Or you can compile this file with -Os since the improvement in this patch is disabled with -Os.

@GalZohar, the various layout algorithms in MachineBlockPlacement mainly consider the number of fall-throughs and dynamic number of branch instructions (usually they are consistent) according to branch probabilities. So you can try to build your application with profiling. Or you can compile this file with -Os since the improvement in this patch is disabled with -Os.

In the simple example I have the number of branches executed when running the most common control flow path is increased by 1 due to this transformation.
Seems like in nested loops, where the outer-loop latch is moved to fall through into the outer-loop head, then the inner loop needs an additional exit branch, which increases the number of dynamic branches by 1 for every outer-loop iteration that also executes the inner loop. Is this intentional?
Disabling this completely degrades performance in some more complex examples where the total number of branches is not increased and therefore this optimization is beneficial. I would prefer to keep this when the number of branches isn't increased but skip it otherwise, like the nested loop example above.
I'm not sure how profiling would help my situation, as it seems like the number of branches may be increased regardless of block frequency values, as there's 1 path that executes an extra branch instruction.

@GalZohar, without a testcase I can't say what's the problem.

It is not intentional to increase the number of branches of any particular path. All of the algorithms are driven by branch probabilities. If you didn't collect profiling, llvm guesses a probability for each branch, it's reasonable for most code. But it's not rare that the guessed probability is different from actual probability, it may not result in a good layout.

So I strongly suggest you try profiling build. MBP is one of the passes that get most performance improvement from profiling.

@GalZohar, without a testcase I can't say what's the problem.

It is not intentional to increase the number of branches of any particular path. All of the algorithms are driven by branch probabilities. If you didn't collect profiling, llvm guesses a probability for each branch, it's reasonable for most code. But it's not rare that the guessed probability is different from actual probability, it may not result in a good layout.

So I strongly suggest you try profiling build. MBP is one of the passes that get most performance improvement from profiling.

While we hope to eventually support profiling builds, we must also optimize in the best way possible without profiling.
We have a very simple case where profiling shouldn't be needed. I think this would be the optimal layout:

BB1 (Entry) -> BB2, BB5
BB2 -> BB3, BB4
BB3 -> BB3, BB4
BB4 ->BB2, BB5
BB5 (Exit)

This way all blocks (except exit) have a single branch instruction.
With this optimization BB4 is moved before BB2, which results in an additional branch from BB3 to BB4 which wasn't needed before (BB3 had only a single branch, and will now need 2). All other blocks still have a single branch after. BB1 also gets an additional branch instruction, but that is acceptable as it's outside the loop.

Is this something that is intentional or is something broken with the frequencies? Assuming this is intended, I still don't understand how this transformation is good regardless of frequencies, assuming number of branches is always more important than number of fallthroughs.

@GalZohar, thanks for the example, now I understand your problem. The new layout does increase the number of executed branches for the path BB2->BB3->BB4->BB2. Unfortunately most of the current MBP algorithms don't consider this factor. When considering the number of fall through only, the new layout has more fall through and less taken branch.

Since the number of executed branches does has performance impact on your target, so welcome to enhance MBP to include this factor on related targets.

@GalZohar, thanks for the example, now I understand your problem. The new layout does increase the number of executed branches for the path BB2->BB3->BB4->BB2. Unfortunately most of the current MBP algorithms don't consider this factor. When considering the number of fall through only, the new layout has more fall through and less taken branch.

Since the number of executed branches does has performance impact on your target, so welcome to enhance MBP to include this factor on related targets.

Is this a general problem with MBP then? I haven't noticed it before this optimization, but maybe it was just luck or not investigating the right examples. I'm trying to understand where is the best place to fix it, and if it's even feasible.

Is this a general problem with MBP then? I haven't noticed it before this optimization, but maybe it was just luck or not investigating the right examples. I'm trying to understand where is the best place to fix it, and if it's even feasible.

Mostly, there is only one function considering extra jump instruction rotateLoopWithProfile, but it is not called by default.

Herald added a project: Restricted Project. · View Herald TranscriptOct 20 2022, 3:41 AM
Herald added a subscriber: kosarev. · View Herald Transcript