basic_string's constructor calls init which was not getting inlined.
This prevented optimization of const string as init would appear as a call in between a string's def and use.
Worked in collaboration with Aditya Kumar.
Differential D22782
Added 'inline' attribute to __init to inline the basic_string's constructor laxmansole on Jul 25 2016, 2:41 PM. Authored by
Details
basic_string's constructor calls init which was not getting inlined. Worked in collaboration with Aditya Kumar.
Diff Detail
Event TimelineComment Actions I think we can write a testcase that shows that copy constructors are not optimized away unless the string constructor is inlined. This patch fixes the performance of a proprietary benchmark when compiled with libc++, closing the performance gap with the same benchmark compiled with the gnu libstdc++. Overall with this patch we have half a billion fewer instructions out of about 10 billion. Comment Actions We are looking at the libcxx testsuite, it seems there are only correctness/functionality tests. Do you have any pointers on how to add tests to check if functions are inlined. In llvm it is easy to do because we can print the assembly and use FileCheck/grep to CHECK/CHECK-NOT if a string is present. Comment Actions $ cat foo.cpp int foo(const std::string name); int bar(){ return foo("bar"); } $clang++ -S -O3 -fno-exceptions foo.cpp Assembly output without patch: _Z3barv: // @_Z3barv // BB#0: // %entry sub sp, sp, #64 // =64 adrp x1, .L.str add x1, x1, :lo12:.L.str add x0, sp, #8 // =8 orr w2, wzr, #0x3 stp xzr, x19, [sp, #24] // 8-byte Folded Spill stp x29, x30, [sp, #48] // 8-byte Folded Spill add x29, sp, #48 // =48 stp xzr, xzr, [sp, #8] bl _ZNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEE6__initEPKcm add x0, sp, #8 // =8 bl _Z3fooNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEE and w19, w0, #0x1 add x0, sp, #8 // =8 bl _ZNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEED1Ev mov w0, w19 ldp x29, x30, [sp, #48] // 8-byte Folded Reload ldr x19, [sp, #32] // 8-byte Folded Reload add sp, sp, #64 // =64 ret Assembly output with patch: _Z3barv: // @_Z3barv // BB#0: // %entry sub sp, sp, #64 // =64 orr w8, wzr, #0x6 mov w9, #114 mov w10, #24930 add x0, sp, #8 // =8 stp xzr, x19, [sp, #24] // 8-byte Folded Spill stp x29, x30, [sp, #48] // 8-byte Folded Spill add x29, sp, #48 // =48 stp xzr, xzr, [sp, #8] strb w8, [sp, #8] strb w9, [sp, #11] sturh w10, [sp, #9] strb wzr, [sp, #12] bl _Z3fooNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEE and w19, w0, #0x1 add x0, sp, #8 // =8 bl _ZNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEED1Ev mov w0, w19 ldp x29, x30, [sp, #48] // 8-byte Folded Reload ldr x19, [sp, #32] // 8-byte Folded Reload add sp, sp, #64 // =64 ret We can see that call to _ZNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEE6__initEPKcm get inlined. Comment Actions The change itself LGTM, although we probably want to inline the forward/input iterator __init's as well. However I would like to see a small benchmark that demonstrates the performance change. Please try and write the benchmark using Google Benchmark. Comment Actions Sure, Thanks, Comment Actions Added inline attribute to the forward/input iterator __init's. Comment Actions I would love to see a benchmark with this, but I've done enough investigating on my own that I *know* this patch is beneficial. Comment Actions This patch was motivated by perf analysis we did on a proprietary benchmark in which we have seen a reduction of about 1 billion instructions (out of 10B) on x86_64-linux and aarch64-linux. |