basic_string's constructor calls init which was not getting inlined.
This prevented optimization of const string as init would appear as a call in between a string's def and use.
Worked in collaboration with Aditya Kumar.
Differential D22782
Added 'inline' attribute to __init to inline the basic_string's constructor Authored by laxmansole on Jul 25 2016, 2:41 PM.
Details
basic_string's constructor calls init which was not getting inlined. Worked in collaboration with Aditya Kumar.
Diff Detail
Event TimelineComment Actions I think we can write a testcase that shows that copy constructors are not optimized away unless the string constructor is inlined. This patch fixes the performance of a proprietary benchmark when compiled with libc++, closing the performance gap with the same benchmark compiled with the gnu libstdc++. Overall with this patch we have half a billion fewer instructions out of about 10 billion. Comment Actions We are looking at the libcxx testsuite, it seems there are only correctness/functionality tests. Do you have any pointers on how to add tests to check if functions are inlined. In llvm it is easy to do because we can print the assembly and use FileCheck/grep to CHECK/CHECK-NOT if a string is present. Comment Actions $ cat foo.cpp int foo(const std::string name);
int bar(){
return foo("bar");
}$clang++ -S -O3 -fno-exceptions foo.cpp Assembly output without patch: _Z3barv: // @_Z3barv
// BB#0: // %entry
sub sp, sp, #64 // =64
adrp x1, .L.str
add x1, x1, :lo12:.L.str
add x0, sp, #8 // =8
orr w2, wzr, #0x3
stp xzr, x19, [sp, #24] // 8-byte Folded Spill
stp x29, x30, [sp, #48] // 8-byte Folded Spill
add x29, sp, #48 // =48
stp xzr, xzr, [sp, #8]
bl _ZNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEE6__initEPKcm
add x0, sp, #8 // =8
bl _Z3fooNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEE
and w19, w0, #0x1
add x0, sp, #8 // =8
bl _ZNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEED1Ev
mov w0, w19
ldp x29, x30, [sp, #48] // 8-byte Folded Reload
ldr x19, [sp, #32] // 8-byte Folded Reload
add sp, sp, #64 // =64
retAssembly output with patch: _Z3barv: // @_Z3barv
// BB#0: // %entry
sub sp, sp, #64 // =64
orr w8, wzr, #0x6
mov w9, #114
mov w10, #24930
add x0, sp, #8 // =8
stp xzr, x19, [sp, #24] // 8-byte Folded Spill
stp x29, x30, [sp, #48] // 8-byte Folded Spill
add x29, sp, #48 // =48
stp xzr, xzr, [sp, #8]
strb w8, [sp, #8]
strb w9, [sp, #11]
sturh w10, [sp, #9]
strb wzr, [sp, #12]
bl _Z3fooNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEE
and w19, w0, #0x1
add x0, sp, #8 // =8
bl _ZNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEED1Ev
mov w0, w19
ldp x29, x30, [sp, #48] // 8-byte Folded Reload
ldr x19, [sp, #32] // 8-byte Folded Reload
add sp, sp, #64 // =64
retWe can see that call to _ZNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEE6__initEPKcm get inlined. Comment Actions The change itself LGTM, although we probably want to inline the forward/input iterator __init's as well. However I would like to see a small benchmark that demonstrates the performance change. Please try and write the benchmark using Google Benchmark. Comment Actions Sure, Thanks, Comment Actions Added inline attribute to the forward/input iterator __init's. Comment Actions I would love to see a benchmark with this, but I've done enough investigating on my own that I *know* this patch is beneficial. Comment Actions This patch was motivated by perf analysis we did on a proprietary benchmark in which we have seen a reduction of about 1 billion instructions (out of 10B) on x86_64-linux and aarch64-linux. |