Download Raw Diff

Details

Reviewers

hfinkel
iteratee

Commits

rG33250340f4dc: [SimplifyCFG] Don't do if-conversion if there is a long dependence chain
rG433e8d3e0428: [PPC] Change i32 constant in store instruction to i64
rL321377: [SimplifyCFG] Don't do if-conversion if there is a long dependence chain
rL318436: [PPC] Change i32 constant in store instruction to i64

Summary

The motivated test case is

struct ICmp {

bool operator()(int *a, int *b) const { return *a < *b; }

};

typedef set<int *, ICmp> PSet;

void foo(PSet* pset, int* l) {

pset->insert(l);

}

The actual hot code is in function _M_get_insert_unique_pos

 while (__x != 0)
{
  __y = __x; 
  __comp = _M_impl._M_key_compare(__k, _S_key(__x));
  __x = __comp ? _S_left(__x) : _S_right(__x);
}

LLVM generates following code:

.LBB1_2:

                                
movq    %rdx, %rbx
movq    32(%rbx), %rcx
movl    (%rcx), %ecx
leaq    24(%rbx), %rdx
leaq    16(%rbx), %rsi
cmpl    %ecx, %eax
cmovlq  %rsi, %rdx
movq    (%rdx), %rdx
testq   %rdx, %rdx
jne     .LBB1_2

GCC generates:

7.53 │4137d0:   mov    0x10(%rbx),%rax                                   
0.01 │4137d4:   mov    $0x1,%edi                                               
     │4137d9:   test   %rax,%rax                                                                         
0.95 │4137dc: ↓ je     4137f7 <BM_CallF(testing::benchmark::State&)+0xd7>     
            
1.18 │4137de:   mov    %rax,%rbx

40.92 │4137e1: mov 0x20(%rbx),%rax
34.82 │4137e5: mov (%rax),%ecx

     │4137e7:   cmp    %r14d,%ecx                                                                    
1.83 │4137ea: ↑ jg     4137d0 <BM_CallF(testing::benchmark::State&)+0xb0>  
           
8.05 │4137ec:   mov    0x18(%rbx),%rax                                   
0.01 │4137f0:   xor    %edi,%edi                               
     │4137f2:   test   %rax,%rax                                                   
0.50 │4137f5: ↑ jne    4137de <BM_CallF(testing::benchmark::State&)+0xbe>

The gcc generated code is 15% faster. The reason is in llvm generated code, there is a long dependence chain including cmov and later instructions, all of them must be executed one by one. In gcc generated code, the instructions after branch don't have data dependence on instructions before branch, so they can be executed in parallel. Even if the branch is highly unpredictable(like visiting RBTree in this case), if the dependence chain is long enough, the correctly predicted execution can still bring overall performance improvement.

We have observed several internal applications benefited from this optimization. All of them visit tree like data structures.

Before if-conversion, this patch finds out the length of the dependence chain (only the part that can be executed in parallel with code after branch) before cmp, if the length is longer than a threshold, then give up if-conversion.

With this patch, now I can get

// -O2
.LBB1_3:

                              
movq    %rdx, %rbx
movq    32(%rbx), %rcx
movl    (%rcx), %ecx
cmpl    %ecx, %eax
jge     .LBB1_5

BB#4:

                                
leaq    16(%rbx), %rdx
jmp     .LBB1_6
.p2align        4, 0x90

.LBB1_5:

                                
leaq    24(%rbx), %rdx

.LBB1_6:

                                
movq    (%rdx), %rdx
testq   %rdx, %rdx
jne     .LBB1_3

// -O3
.LBB1_3:

                             
movq    %rdx, %rbx
movq    32(%rbx), %rcx
movl    (%rcx), %ecx
cmpl    %ecx, %eax
jge     .LBB1_5

BB#4:

leaq    16(%rbx), %rdx
movq    (%rdx), %rdx
testq   %rdx, %rdx
jne     .LBB1_3
jmp     .LBB1_7
.p2align        4, 0x90

.LBB1_5:

                                
leaq    24(%rbx), %rdx
movq    (%rdx), %rdx
testq   %rdx, %rdx
jne     .LBB1_3

Still worse than gcc generated code, but those are separate issues (code layout, extra lea), and it is already faster than cmov version.

Diff Detail

Repository: rL LLVM

Event Timeline

Carrot created this revision.Oct 26 2017, 4:24 PM

Don't know why llvm generated code formatted like a mess. Try again.

The original llvm generated code:

0.49 │40e200:   mov    %rcx,%rbx

45.49 │40e203: mov 0x20(%rbx),%rcx
29.17 │40e207: mov (%rcx),%edx

0.06 │40e209:   lea    0x18(%rbx),%rcx                                                                
     │40e20d:   lea    0x10(%rbx),%rsi                                                                 
1.95 │40e211:   cmp    %edx,%eax                                                                    
3.12 │40e213:   cmovl  %rsi,%rcx

13.21 │40e217: mov (%rcx),%rcx

0.00 │40e21a:   test   %rcx,%rcx                                                  
1.57 │40e21d: ↑ jne    40e200 <std::pair<std::_Rb_tree_iterator<Lock*>, bool> std::_Rb_tree<Lock*, Lock*, std::_Identity<Lock*>, LCmp, std::allocator<Lock*> >::_M_insert_unique<Lock* const&>(Lock* const&)+0x30>

The llvm generated code with this patch

// -O2
.LBB1_3:

movq    %rdx, %rbx
movq    32(%rbx), %rcx
movl    (%rcx), %ecx
cmpl    %ecx, %eax
jge     .LBB1_5

leaq    16(%rbx), %rdx
jmp     .LBB1_6
.p2align        4, 0x90

.LBB1_5:

leaq    24(%rbx), %rdx

.LBB1_6:

movq    (%rdx), %rdx
testq   %rdx, %rdx
jne     .LBB1_3

// -O3
.LBB1_3:

movq    %rdx, %rbx
movq    32(%rbx), %rcx
movl    (%rcx), %ecx
cmpl    %ecx, %eax
jge     .LBB1_5

leaq    16(%rbx), %rdx
movq    (%rdx), %rdx
testq   %rdx, %rdx
jne     .LBB1_3
jmp     .LBB1_7
.p2align        4, 0x90

.LBB1_5:

leaq    24(%rbx), %rdx
movq    (%rdx), %rdx
testq   %rdx, %rdx
jne     .LBB1_3

.LBB1_7:

Carrot added reviewers: hfinkel, iteratee.Nov 2 2017, 11:04 AM

ping

Closed by commit rL318436: [PPC] Change i32 constant in store instruction to i64 (authored by Carrot). · Explain WhyNov 16 2017, 10:27 AM

This revision was automatically updated to reflect the committed changes.

I unintentionally closed this patch by pasting its url to another svn patch.
It's still waiting for review, so reopen it.

Rebase to head of trunk.

mgrang added a subscriber: mgrang.Nov 20 2017, 4:47 PM

mgrang added inline comments.

lib/Transforms/Utils/SimplifyCFG.cpp
405 ↗	(On Diff #123686)	Use uppercase variable names. Shouldn't size be unsigned?
413 ↗	(On Diff #123686)	Shouldn't latency be unsigned?
414 ↗	(On Diff #123686)	Uppercase variable names.
417 ↗	(On Diff #123686)	Uppercase variable names.
423 ↗	(On Diff #123686)	Uppercase variable names.
440 ↗	(On Diff #123686)	Uppercase variable names. Check if br is not null.
460 ↗	(On Diff #123686)	You can combine the assignment and check into a single if: if (PHINode *PN = dyn_cast<PHINode>(II))
481 ↗	(On Diff #123686)	Same here. Can combine the assignment and check into single if.
505 ↗	(On Diff #123686)	How big are the keys of your map? Can you use llvm's map containers (like DenseMap, etc)?
511 ↗	(On Diff #123686)	Uppercase variable names.

Carrot updated this revision to Diff 124014.Nov 22 2017, 3:26 PM

Carrot marked 10 inline comments as done.

Carrot added inline comments.

lib/Transforms/Utils/SimplifyCFG.cpp
505 ↗	(On Diff #123686)	The size of the map is BB.size() - 1, and can't be greater than 40.

Rather than trying to prevent the flattening from happening in the first place, it might make more sense to reverse it later in the pass pipeline. That would be more general, and avoid blocking other optimizations. The x86 backend already does this in some limited cases; see X86CmovConversion.cpp.

In D39352#933324, @efriedma wrote:

Rather than trying to prevent the flattening from happening in the first place, it might make more sense to reverse it later in the pass pipeline. That would be more general, and avoid blocking other optimizations. The x86 backend already does this in some limited cases; see X86CmovConversion.cpp.

Thank you for the alternative method. This optimization should be target independent. It impacts all OOO processors. I tested the same motivated test case on power8, and observed 10% performance difference.

ping

You have a bunch of variable names here with underscores, which is not our usual convention. BB1_chain -> BB1Chain, etc.

lib/Transforms/Utils/SimplifyCFG.cpp
413 ↗	(On Diff #124014)	Please explicitly note in the comment that the 'or' here is determined by the 'LongestChain' parameter.
497 ↗	(On Diff #124014)	in new BB -> in this new BB
500 ↗	(On Diff #124014)	I find this sentence confusing. You're trying to say that the control dependence is faster than the data dependence, because the control dependence can be speculated (and thus, the second part can execute in parallel with the first). Right? This comment should also explain what this is checking. Something like: When considering whether to perform if-conversion, find the length of the dependence chain in BB1 (only the part that can be executed in parallel with code after branch in BB2) before cmp, and if the length is longer than a threshold, don't perform if-conversion." Document what SpeculationSize is.
518 ↗	(On Diff #124014)	Please add a cl::opt for this threshold.
527 ↗	(On Diff #124014)	You should comment that you're estimating that if there are lots of other instructions in the new BB, they'll be other instructions for the processor to issue regardless of the length of this new dependence chain.
528 ↗	(On Diff #124014)	Should this 2 be the SchedModel's IssueWidth?
539 ↗	(On Diff #124014)	Is there a way to think of the proper value of DependenceChainLatency in terms of the branch-misprediction penalty?

Thanks a lot for the English language correction.

Carrot added inline comments.Dec 15 2017, 10:16 AM

lib/Transforms/Utils/SimplifyCFG.cpp
528 ↗	(On Diff #124014)	In real world applications, an IPC of 2 is already very good for 3 issue or 4 issue processors. Higher IPC is usually found in programs with small kernel. So I think 2 is more reasonable for most applications.
539 ↗	(On Diff #124014)	DependenceChainLatency depends on many factors, besides branch misprediction penalty, there are also taken branch latency, branch mis prediction rate, cmov latency.

hfinkel added inline comments.Dec 17 2017, 8:39 AM

lib/Transforms/Utils/SimplifyCFG.cpp
528 ↗	(On Diff #124014)	In real world applications, an IPC of 2 is already very good for 3 issue or 4 issue processors. Higher IPC is usually found in programs with small kernel. So I think 2 is more reasonable for most applications. Fair enough. As we both know, there are all sorts of issues affecting this, but I agree that for non-loop code with small basic blocks, this is likely true. Please mention this assumption in the comment.
539 ↗	(On Diff #124014)	Please add this to the comment.

Carrot updated this revision to Diff 127423.Dec 18 2017, 3:21 PM

Carrot marked 2 inline comments as done.

LGTM, however, the testing here is fairly light. We should have some cases where we do perform the if conversion because of the DependenceChainLatency check, where we do perform the if conversion because of the IPC check, and because of the BB size check. Also, where the if-converted block has some non-trivial adjustment because of LatencyAdjustment.

lib/Transforms/Utils/SimplifyCFG.cpp
419 ↗	(On Diff #127423)	the compare instruction. -> the compare instruction feeding the block's conditional branch.
506 ↗	(On Diff #127423)	... because the data dependence is changed into control dependence, and ... This is backward. It should say, "... because the control dependence is transformed into a data dependence, and the control dependence can be speculated, and ...".
532 ↗	(On Diff #127423)	in small BB. -> in a small BB.

This revision is now accepted and ready to land.Dec 19 2017, 9:10 PM

hfinkel mentioned this in D41361: [SimplifyCFG] Avoid quadratic on a predecessors number behavior in instruction sinking..Dec 20 2017, 4:10 PM

Add more test cases.
Will check in this version.

Closed by commit rL321377: [SimplifyCFG] Don't do if-conversion if there is a long dependence chain (authored by Carrot). · Explain WhyDec 22 2017, 10:55 AM

This revision was automatically updated to reflect the committed changes.

I don't think this patch's approach is the right one, and we should probably pursue Eli's suggestion.... It also causes very significant regressions in benchmarks, so I think we should revert it for now.

The core problem I see here is that we're pretty radically changing the canonical form of the IR. This is going to be really disruptive. Doing this so early on means that we can't recognize common, direct patterns in terms of select. While perhaps it would be better to go and teach everything to generically reason about either a select or a phi around control flow, that will need a really massive undertaking.

And this isn't just a theoretical problem. This patch regresses performance on simple code patterns like counting things in sets by a huge amount because there, we were able to do things like use the adc instruction or other dedicated hardware:
https://reviews.llvm.org/P8055 or

#include <bitset>
#include <chrono>
#include <iostream>
#include <random>
#include <vector>

int main() {
  std::bitset<256> set;
  volatile unsigned seed = 0;
  std::mt19937 rng(seed);
  for (int i = 0; i < 256; ++i)
    set[i] = rng() < (rng.max() / 2) ? false : true;

  volatile unsigned size = 4096;
  std::vector<unsigned char> data;
  for (int i = 0; i < size; ++i)
    data.push_back(rng() & 0xFF);

  volatile int iterations = 100000;
  long int true_count = 0, all_count = 0;

  auto start = std::chrono::high_resolution_clock::now();
  for (int i = 0; i < iterations; ++i)
    for (auto c : data) {
      if (set[c])
        ++true_count;
      ++all_count;
    }
  auto stop = std::chrono::high_resolution_clock::now();

  std::cout << "Did " << iterations << " iterations over " << size
            << " queries: " << true_count << " hits of " << all_count << "\n";
  std::cout << "Took "
            << std::chrono::duration<double, std::nano>(stop - start).count() /
                   iterations
            << " nanoseconds per iteration.\n";
}

This benchmark runs over 2x slower for me on both AMD and Intel hardware. That seems like a really huge regression. And I have extracted this out of actual benchmark code we have internally. I think we should probably revert this patch until there is a clear consensus and/or more comprehensive performance data supporting a particular direction here.

Another way of looking at the problem with this approach is that it seems to be trying to solve an issue pretty specifically with cmov. We actually have a dedicated pass that tries *specifically* to reason about the length of dependency chains and the proportional cost of a cmov versus a conditional branch. That seems like a much more appropriate place to handle these things, and that is exactly what Eli pointed you at. I've done some work there as well, and I really think that is a good approach. If other architectures need similar logic, we could try to factor it out and share it between targets, but I think the ideas around what constitutes a long dependency will be somewhat architecture specific. For example, I could easily imagine a predicated architecture where the patterns in this patch don't form long dependency chains.

reverted, will investigate it further.

The performance regression of https://reviews.llvm.org/P8055 is not because of missing adc instruction.

Slow version:

.LBB0_65:

movzbl  (%rdx), %esi
movq    %rsi, %rdi
shrq    $3, %rdi
andl    $24, %edi
movq    32(%rsp,%rdi), %rdi
btq     %rsi, %rdi
jae     .LBB0_67

addq    $1, %rbx

.LBB0_67:

addq    $1, %rdx
cmpq    %rdx, %r13
jne     .LBB0_65

Fast version:

.LBB0_65:

movzbl  (%rdx), %esi
movq    %rsi, %rdi
shrq    $3, %rdi
andl    $24, %edi
movq    32(%rsp,%rdi), %rdi
btq     %rsi, %rdi
adcq    $0, %rbx                                 // result of if conversion
addq    $1, %rdx
cmpq    %rdx, %r13
jne     .LBB0_65

The result of if conversion %rbx is not used in later in BB67, so although there is long dependence chain in BB65, it does not prevent the processor execute instructions in BB67 and later BBs. For the branch version, it doesn't make instructions in BB67 executed earlier, but it bring extra penalty of branch misprediction, so it is slower.

In function FindLongDependenceChain, I should also check the result of if-conversion can really delay other instructions in BB2, if not, like in this case, we should do if-conversion.

Doing this so early on means that we can't recognize common, direct patterns in terms of select. While perhaps it would be better to go and teach everything to generically reason about either a select or a phi around control flow, that will need a really massive undertaking.

FWIW, I believe that the outcome of the select/GVN discussion (e.g., from PR34603) has been that this is exactly what we need to do: we need to move select formation late in the pipeline, and likely canonicalize away from select early.

This is an archive of the discontinued LLVM Phabricator instance.

[SimplifyCFG] Don't do if-conversion if there is a long dependence chain
ClosedPublic

Details

BB#4:

Diff Detail

Event Timeline

Revision Contents

Diff 123207

llvm/trunk/include/llvm/CodeGen/SelectionDAGNodes.h

llvm/trunk/lib/Target/PowerPC/PPCISelLowering.cpp

llvm/trunk/test/CodeGen/PowerPC/store-constant.ll

This is an archive of the discontinued LLVM Phabricator instance.

[SimplifyCFG] Don't do if-conversion if there is a long dependence chainClosedPublic

Details

BB#4:

Diff Detail

Event Timeline

Revision Contents

Diff 123207

llvm/trunk/include/llvm/CodeGen/SelectionDAGNodes.h

llvm/trunk/lib/Target/PowerPC/PPCISelLowering.cpp

llvm/trunk/test/CodeGen/PowerPC/store-constant.ll

[SimplifyCFG] Don't do if-conversion if there is a long dependence chain
ClosedPublic