This is an archive of the discontinued LLVM Phabricator instance.

[SimplifyCFG] Don't do if-conversion if there is a long dependence chain
ClosedPublic

Authored by Carrot on Oct 26 2017, 4:24 PM.

Details

Summary

The motivated test case is

struct ICmp {

bool operator()(int *a, int *b) const { return *a < *b; }

};

typedef set<int *, ICmp> PSet;

void foo(PSet* pset, int* l) {

pset->insert(l);

}

The actual hot code is in function _M_get_insert_unique_pos

 while (__x != 0)
{
  __y = __x; 
  __comp = _M_impl._M_key_compare(__k, _S_key(__x));
  __x = __comp ? _S_left(__x) : _S_right(__x);
}

LLVM generates following code:

.LBB1_2:

                                
movq    %rdx, %rbx
movq    32(%rbx), %rcx
movl    (%rcx), %ecx
leaq    24(%rbx), %rdx
leaq    16(%rbx), %rsi
cmpl    %ecx, %eax
cmovlq  %rsi, %rdx
movq    (%rdx), %rdx
testq   %rdx, %rdx
jne     .LBB1_2

GCC generates:

7.53 │4137d0:   mov    0x10(%rbx),%rax                                   
0.01 │4137d4:   mov    $0x1,%edi                                               
     │4137d9:   test   %rax,%rax                                                                         
0.95 │4137dc: ↓ je     4137f7 <BM_CallF(testing::benchmark::State&)+0xd7>     
            
1.18 │4137de:   mov    %rax,%rbx

40.92 │4137e1: mov 0x20(%rbx),%rax
34.82 │4137e5: mov (%rax),%ecx

     │4137e7:   cmp    %r14d,%ecx                                                                    
1.83 │4137ea: ↑ jg     4137d0 <BM_CallF(testing::benchmark::State&)+0xb0>  
           
8.05 │4137ec:   mov    0x18(%rbx),%rax                                   
0.01 │4137f0:   xor    %edi,%edi                               
     │4137f2:   test   %rax,%rax                                                   
0.50 │4137f5: ↑ jne    4137de <BM_CallF(testing::benchmark::State&)+0xbe>

The gcc generated code is 15% faster. The reason is in llvm generated code, there is a long dependence chain including cmov and later instructions, all of them must be executed one by one. In gcc generated code, the instructions after branch don't have data dependence on instructions before branch, so they can be executed in parallel. Even if the branch is highly unpredictable(like visiting RBTree in this case), if the dependence chain is long enough, the correctly predicted execution can still bring overall performance improvement.

We have observed several internal applications benefited from this optimization. All of them visit tree like data structures.

Before if-conversion, this patch finds out the length of the dependence chain (only the part that can be executed in parallel with code after branch) before cmp, if the length is longer than a threshold, then give up if-conversion.

With this patch, now I can get

// -O2
.LBB1_3:

                              
movq    %rdx, %rbx
movq    32(%rbx), %rcx
movl    (%rcx), %ecx
cmpl    %ecx, %eax
jge     .LBB1_5

BB#4:

                                
leaq    16(%rbx), %rdx
jmp     .LBB1_6
.p2align        4, 0x90

.LBB1_5:

                                
leaq    24(%rbx), %rdx

.LBB1_6:

                                
movq    (%rdx), %rdx
testq   %rdx, %rdx
jne     .LBB1_3

// -O3
.LBB1_3:

                             
movq    %rdx, %rbx
movq    32(%rbx), %rcx
movl    (%rcx), %ecx
cmpl    %ecx, %eax
jge     .LBB1_5

BB#4:

leaq    16(%rbx), %rdx
movq    (%rdx), %rdx
testq   %rdx, %rdx
jne     .LBB1_3
jmp     .LBB1_7
.p2align        4, 0x90

.LBB1_5:

                                
leaq    24(%rbx), %rdx
movq    (%rdx), %rdx
testq   %rdx, %rdx
jne     .LBB1_3

Still worse than gcc generated code, but those are separate issues (code layout, extra lea), and it is already faster than cmov version.

Diff Detail

Event Timeline

Carrot created this revision.Oct 26 2017, 4:24 PM

Don't know why llvm generated code formatted like a mess. Try again.

The original llvm generated code:

0.49 │40e200:   mov    %rcx,%rbx

45.49 │40e203: mov 0x20(%rbx),%rcx
29.17 │40e207: mov (%rcx),%edx

0.06 │40e209:   lea    0x18(%rbx),%rcx                                                                
     │40e20d:   lea    0x10(%rbx),%rsi                                                                 
1.95 │40e211:   cmp    %edx,%eax                                                                    
3.12 │40e213:   cmovl  %rsi,%rcx

13.21 │40e217: mov (%rcx),%rcx

0.00 │40e21a:   test   %rcx,%rcx                                                  
1.57 │40e21d: ↑ jne    40e200 <std::pair<std::_Rb_tree_iterator<Lock*>, bool> std::_Rb_tree<Lock*, Lock*, std::_Identity<Lock*>, LCmp, std::allocator<Lock*> >::_M_insert_unique<Lock* const&>(Lock* const&)+0x30>

The llvm generated code with this patch

// -O2
.LBB1_3:

movq    %rdx, %rbx
movq    32(%rbx), %rcx
movl    (%rcx), %ecx
cmpl    %ecx, %eax
jge     .LBB1_5

leaq    16(%rbx), %rdx
jmp     .LBB1_6
.p2align        4, 0x90

.LBB1_5:

leaq    24(%rbx), %rdx

.LBB1_6:

movq    (%rdx), %rdx
testq   %rdx, %rdx
jne     .LBB1_3

// -O3
.LBB1_3:

movq    %rdx, %rbx
movq    32(%rbx), %rcx
movl    (%rcx), %ecx
cmpl    %ecx, %eax
jge     .LBB1_5

leaq    16(%rbx), %rdx
movq    (%rdx), %rdx
testq   %rdx, %rdx
jne     .LBB1_3
jmp     .LBB1_7
.p2align        4, 0x90

.LBB1_5:

leaq    24(%rbx), %rdx
movq    (%rdx), %rdx
testq   %rdx, %rdx
jne     .LBB1_3

.LBB1_7:

This revision was automatically updated to reflect the committed changes.
Carrot reopened this revision.Nov 16 2017, 10:48 AM

I unintentionally closed this patch by pasting its url to another svn patch.
It's still waiting for review, so reopen it.

Carrot updated this revision to Diff 123686.Nov 20 2017, 4:23 PM
Carrot edited the summary of this revision. (Show Details)

Rebase to head of trunk.

mgrang added a subscriber: mgrang.Nov 20 2017, 4:47 PM
mgrang added inline comments.
lib/Transforms/Utils/SimplifyCFG.cpp
405

Use uppercase variable names.

Shouldn't size be unsigned?

413

Shouldn't latency be unsigned?

414

Uppercase variable names.

417

Uppercase variable names.

423

Uppercase variable names.

440

Uppercase variable names.

Check if br is not null.

460

You can combine the assignment and check into a single if:

if (PHINode *PN = dyn_cast<PHINode>(II))
481

Same here. Can combine the assignment and check into single if.

505

How big are the keys of your map? Can you use llvm's map containers (like DenseMap, etc)?

511

Uppercase variable names.

Carrot updated this revision to Diff 124014.Nov 22 2017, 3:26 PM
Carrot marked 10 inline comments as done.
Carrot added inline comments.
lib/Transforms/Utils/SimplifyCFG.cpp
505

The size of the map is BB.size() - 1, and can't be greater than 40.

Rather than trying to prevent the flattening from happening in the first place, it might make more sense to reverse it later in the pass pipeline. That would be more general, and avoid blocking other optimizations. The x86 backend already does this in some limited cases; see X86CmovConversion.cpp.

Rather than trying to prevent the flattening from happening in the first place, it might make more sense to reverse it later in the pass pipeline. That would be more general, and avoid blocking other optimizations. The x86 backend already does this in some limited cases; see X86CmovConversion.cpp.

Thank you for the alternative method. This optimization should be target independent. It impacts all OOO processors. I tested the same motivated test case on power8, and observed 10% performance difference.

hfinkel edited edge metadata.Dec 11 2017, 9:33 PM

You have a bunch of variable names here with underscores, which is not our usual convention. BB1_chain -> BB1Chain, etc.

lib/Transforms/Utils/SimplifyCFG.cpp
413

Please explicitly note in the comment that the 'or' here is determined by the 'LongestChain' parameter.

497

in new BB -> in this new BB

500

I find this sentence confusing. You're trying to say that the control dependence is faster than the data dependence, because the control dependence can be speculated (and thus, the second part can execute in parallel with the first). Right?

This comment should also explain what this is checking. Something like: When considering whether to perform if-conversion, find the length of the dependence chain in BB1 (only the part that can be executed in parallel with code after branch in BB2) before cmp, and if the length is longer than a threshold, don't perform if-conversion." Document what SpeculationSize is.

518

Please add a cl::opt for this threshold.

527

You should comment that you're estimating that if there are lots of other instructions in the new BB, they'll be other instructions for the processor to issue regardless of the length of this new dependence chain.

528

Should this 2 be the SchedModel's IssueWidth?

539

Is there a way to think of the proper value of DependenceChainLatency in terms of the branch-misprediction penalty?

Carrot updated this revision to Diff 127157.Dec 15 2017, 10:15 AM
Carrot marked 5 inline comments as done.

Thanks a lot for the English language correction.

Carrot added inline comments.Dec 15 2017, 10:16 AM
lib/Transforms/Utils/SimplifyCFG.cpp
528

In real world applications, an IPC of 2 is already very good for 3 issue or 4 issue processors. Higher IPC is usually found in programs with small kernel. So I think 2 is more reasonable for most applications.

539

DependenceChainLatency depends on many factors, besides branch misprediction penalty, there are also taken branch latency, branch mis prediction rate, cmov latency.

hfinkel added inline comments.Dec 17 2017, 8:39 AM
lib/Transforms/Utils/SimplifyCFG.cpp
528

In real world applications, an IPC of 2 is already very good for 3 issue or 4 issue processors. Higher IPC is usually found in programs with small kernel. So I think 2 is more reasonable for most applications.

Fair enough. As we both know, there are all sorts of issues affecting this, but I agree that for non-loop code with small basic blocks, this is likely true. Please mention this assumption in the comment.

539

Please add this to the comment.

Carrot updated this revision to Diff 127423.Dec 18 2017, 3:21 PM
Carrot marked 2 inline comments as done.
hfinkel accepted this revision.Dec 19 2017, 9:10 PM

LGTM, however, the testing here is fairly light. We should have some cases where we do perform the if conversion because of the DependenceChainLatency check, where we do perform the if conversion because of the IPC check, and because of the BB size check. Also, where the if-converted block has some non-trivial adjustment because of LatencyAdjustment.

lib/Transforms/Utils/SimplifyCFG.cpp
414

the compare instruction. -> the compare instruction feeding the block's conditional branch.

501

... because the data dependence is changed into control dependence, and ...

This is backward. It should say, "... because the control dependence is transformed into a data dependence, and the control dependence can be speculated, and ...".

527

in small BB. -> in a small BB.

This revision is now accepted and ready to land.Dec 19 2017, 9:10 PM
Carrot updated this revision to Diff 127953.Dec 21 2017, 2:56 PM
Carrot marked 3 inline comments as done.

Add more test cases.
Will check in this version.

This revision was automatically updated to reflect the committed changes.

I don't think this patch's approach is the right one, and we should probably pursue Eli's suggestion.... It also causes very significant regressions in benchmarks, so I think we should revert it for now.

The core problem I see here is that we're pretty radically changing the canonical form of the IR. This is going to be really disruptive. Doing this so early on means that we can't recognize common, direct patterns in terms of select. While perhaps it would be better to go and teach everything to generically reason about either a select or a phi around control flow, that will need a really massive undertaking.

And this isn't just a theoretical problem. This patch regresses performance on simple code patterns like counting things in sets by a huge amount because there, we were able to do things like use the adc instruction or other dedicated hardware:
https://reviews.llvm.org/P8055 or

#include <bitset>
#include <chrono>
#include <iostream>
#include <random>
#include <vector>

int main() {
  std::bitset<256> set;
  volatile unsigned seed = 0;
  std::mt19937 rng(seed);
  for (int i = 0; i < 256; ++i)
    set[i] = rng() < (rng.max() / 2) ? false : true;

  volatile unsigned size = 4096;
  std::vector<unsigned char> data;
  for (int i = 0; i < size; ++i)
    data.push_back(rng() & 0xFF);

  volatile int iterations = 100000;
  long int true_count = 0, all_count = 0;

  auto start = std::chrono::high_resolution_clock::now();
  for (int i = 0; i < iterations; ++i)
    for (auto c : data) {
      if (set[c])
        ++true_count;
      ++all_count;
    }
  auto stop = std::chrono::high_resolution_clock::now();

  std::cout << "Did " << iterations << " iterations over " << size
            << " queries: " << true_count << " hits of " << all_count << "\n";
  std::cout << "Took "
            << std::chrono::duration<double, std::nano>(stop - start).count() /
                   iterations
            << " nanoseconds per iteration.\n";
}

This benchmark runs over 2x slower for me on both AMD and Intel hardware. That seems like a really huge regression. And I have extracted this out of actual benchmark code we have internally. I think we should probably revert this patch until there is a clear consensus and/or more comprehensive performance data supporting a particular direction here.

Another way of looking at the problem with this approach is that it seems to be trying to solve an issue pretty specifically with cmov. We actually have a dedicated pass that tries *specifically* to reason about the length of dependency chains and the proportional cost of a cmov versus a conditional branch. That seems like a much more appropriate place to handle these things, and that is exactly what Eli pointed you at. I've done some work there as well, and I really think that is a good approach. If other architectures need similar logic, we could try to factor it out and share it between targets, but I think the ideas around what constitutes a long dependency will be somewhat architecture specific. For example, I could easily imagine a predicated architecture where the patterns in this patch don't form long dependency chains.

reverted, will investigate it further.

The performance regression of https://reviews.llvm.org/P8055 is not because of missing adc instruction.

Slow version:

.LBB0_65:

movzbl  (%rdx), %esi
movq    %rsi, %rdi
shrq    $3, %rdi
andl    $24, %edi
movq    32(%rsp,%rdi), %rdi
btq     %rsi, %rdi
jae     .LBB0_67

addq    $1, %rbx

.LBB0_67:

addq    $1, %rdx
cmpq    %rdx, %r13
jne     .LBB0_65

Fast version:

.LBB0_65:

movzbl  (%rdx), %esi
movq    %rsi, %rdi
shrq    $3, %rdi
andl    $24, %edi
movq    32(%rsp,%rdi), %rdi
btq     %rsi, %rdi
adcq    $0, %rbx                                 // result of if conversion
addq    $1, %rdx
cmpq    %rdx, %r13
jne     .LBB0_65

The result of if conversion %rbx is not used in later in BB67, so although there is long dependence chain in BB65, it does not prevent the processor execute instructions in BB67 and later BBs. For the branch version, it doesn't make instructions in BB67 executed earlier, but it bring extra penalty of branch misprediction, so it is slower.

In function FindLongDependenceChain, I should also check the result of if-conversion can really delay other instructions in BB2, if not, like in this case, we should do if-conversion.

Doing this so early on means that we can't recognize common, direct patterns in terms of select. While perhaps it would be better to go and teach everything to generically reason about either a select or a phi around control flow, that will need a really massive undertaking.

FWIW, I believe that the outcome of the select/GVN discussion (e.g., from PR34603) has been that this is exactly what we need to do: we need to move select formation late in the pipeline, and likely canonicalize away from select early.