The motivated test case is
struct ICmp {
bool operator()(int *a, int *b) const { return *a < *b; }};
typedef set<int *, ICmp> PSet;
void foo(PSet* pset, int* l) {
pset->insert(l);
}
The actual hot code is in function _M_get_insert_unique_pos
 while (__x != 0)
{
  __y = __x; 
  __comp = _M_impl._M_key_compare(__k, _S_key(__x));
  __x = __comp ? _S_left(__x) : _S_right(__x);
}LLVM generates following code:
.LBB1_2:
movq %rdx, %rbx movq 32(%rbx), %rcx movl (%rcx), %ecx leaq 24(%rbx), %rdx leaq 16(%rbx), %rsi cmpl %ecx, %eax cmovlq %rsi, %rdx movq (%rdx), %rdx testq %rdx, %rdx jne .LBB1_2
GCC generates:
7.53 │4137d0:   mov    0x10(%rbx),%rax                                   
0.01 │4137d4:   mov    $0x1,%edi                                               
     │4137d9:   test   %rax,%rax                                                                         
0.95 │4137dc: ↓ je     4137f7 <BM_CallF(testing::benchmark::State&)+0xd7>     
            
1.18 │4137de:   mov    %rax,%rbx40.92 │4137e1:   mov    0x20(%rbx),%rax                                               
 34.82 │4137e5:   mov    (%rax),%ecx
     │4137e7:   cmp    %r14d,%ecx                                                                    
1.83 │4137ea: ↑ jg     4137d0 <BM_CallF(testing::benchmark::State&)+0xb0>  
           
8.05 │4137ec:   mov    0x18(%rbx),%rax                                   
0.01 │4137f0:   xor    %edi,%edi                               
     │4137f2:   test   %rax,%rax                                                   
0.50 │4137f5: ↑ jne    4137de <BM_CallF(testing::benchmark::State&)+0xbe>The gcc generated code is 15% faster. The reason is in llvm generated code, there is a long dependence chain including cmov and later instructions, all of them must be executed one by one. In gcc generated code, the instructions after branch don't have data dependence on instructions before branch, so they can be executed in parallel. Even if the branch is highly unpredictable(like visiting RBTree in this case), if the dependence chain is long enough, the correctly predicted execution can still bring overall performance improvement.
We have observed several internal applications benefited from this optimization. All of them visit tree like data structures.
Before if-conversion, this patch finds out the length of the dependence chain (only the part that can be executed in parallel with code after branch) before cmp, if the length is longer than a threshold, then give up if-conversion.
With this patch, now I can get
// -O2
.LBB1_3:
movq %rdx, %rbx movq 32(%rbx), %rcx movl (%rcx), %ecx cmpl %ecx, %eax jge .LBB1_5
BB#4:
leaq 16(%rbx), %rdx jmp .LBB1_6 .p2align 4, 0x90
.LBB1_5:
leaq 24(%rbx), %rdx
.LBB1_6:
movq (%rdx), %rdx testq %rdx, %rdx jne .LBB1_3
// -O3
.LBB1_3:
movq %rdx, %rbx movq 32(%rbx), %rcx movl (%rcx), %ecx cmpl %ecx, %eax jge .LBB1_5
BB#4:
leaq 16(%rbx), %rdx movq (%rdx), %rdx testq %rdx, %rdx jne .LBB1_3 jmp .LBB1_7 .p2align 4, 0x90
.LBB1_5:
leaq 24(%rbx), %rdx movq (%rdx), %rdx testq %rdx, %rdx jne .LBB1_3
Still worse than gcc generated code, but those are separate issues (code layout, extra lea), and it is already faster than cmov version.