Current JT only process (clone) BBs with multiple successors in JT with the aim to thread the predecessor with a successor BB. This misses opportunities to to handle return BB where the return value can be simplified with threading (cloning).
Example:
#include <array>
#include <algorithm>
constexpr std::array<int, 3> x = {1, 7, 17};
bool Contains(int i) {
return std::find(x.begin(), x.end(), i) != x.end();
}
Clang produces inefficient code:
_Z8Containsi: # @_Z8Containsi
.cfi_startproc
- %bb.0: cmpl $1, %edi je .LBB0_1
- %bb.2: cmpl $7, %edi jne .LBB0_3
- %bb.4: movl $_ZL1x+4, %eax jmp .LBB0_5
.LBB0_1:
movl $_ZL1x, %eax jmp .LBB0_5
.LBB0_3:
cmpl $17, %edi movl $_ZL1x+8, %ecx movl $_ZL1x+12, %eax cmoveq %rcx, %rax
.LBB0_5:
movl $_ZL1x+12, %ecx cmpq %rcx, %rax setne %al retq
While GCC produces:
_Z8Containsi:
.LFB1534:
.cfi_startproc movl $1, %eax cmpl $1, %edi je .L1 cmpl $7, %edi je .L1 cmpl $17, %edi sete %al
.L1:
ret
This patch address the issue. After the fix, the generated code looks like:
_Z8Containsi: # @_Z8Containsi
.cfi_startproc addl $-1, %edi cmpl $16, %edi ja .LBB0_2 movl $65601, %eax # imm = 0x10041 movl %edi, %ecx shrl %cl, %eax andb $1, %al retq
.LBB0_2: # %_ZSt4findIPKiiET_S2_S2_RKT0_.exit.thread
xorl %eax, %eax retq