This is an archive of the discontinued LLVM Phabricator instance.

partially inline copy constructor basic_string(const basic_string&[, allocator])
AbandonedPublic

Authored by mvels on Oct 7 2019, 5:58 PM.

Details

Summary

This change optimizes the copy constructor using partial inlining.

  • adds default_value_tag() to memory, to support default initialization
  • inlines copy contructor: non SSO init delegated to instantiated __init_long() method

Note that this change does not consider existing value initialization ctors, most would likely benefit from default initialization.

Generated code is small, i.e, considerably smaller than other hot inlined functions such as move ctor

given:

void StringCopyCtor(void* mem, const std::string& s) {
    std::string*p = new(mem) std::string{s};
}

asm:

        cmp     byte ptr [rsi + 23], 0
        js      .LBB0_2
        mov     rax, qword ptr [rsi + 16]
        mov     qword ptr [rdi + 16], rax
        movups  xmm0, xmmword ptr [rsi]
        movups  xmmword ptr [rdi], xmm0
        ret
.LBB0_2:
        jmp     std::basic_string::__init_long # TAILCALL

Benchmarks::
BM_StringCopy_Empty                                           5.18ns ± 7%             1.53ns ± 5%  -70.45%        (p=0.000 n=10+10)
BM_StringCopy_Small                                           5.18ns ± 7%             1.54ns ± 5%  -70.21%        (p=0.000 n=10+10)
BM_StringCopy_Large                                           19.0ns ± 1%             19.3ns ± 1%   +1.77%        (p=0.000 n=10+10)
BM_StringCopy_Huge                                             321ns ± 4%              310ns ± 1%     ~            (p=0.408 n=10+8)

Event Timeline

mvels created this revision.Oct 7 2019, 5:58 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 7 2019, 5:58 PM
mvels retitled this revision from This change optimizes the copy constructor using partial inlining. - add __default_value_tag() to memory, to support default initialization - inline copy contructor: non SSO init delegated to instantiated __init_long() method Note that this... to Optimize copy constructor.Oct 7 2019, 6:03 PM
mvels edited the summary of this revision. (Show Details)
mvels added reviewers: ldionne, EricWF.
mvels updated this revision to Diff 223713.Oct 7 2019, 6:06 PM
mvels edited the summary of this revision. (Show Details)
  • Added inline

A couple of things:

  1. Can you give your patches a better title? "Optimize copy constructor" is not very informative. Which copy constructor?
  2. What does this do it the dylib - where basic_string<char> and basic_string<wchar_t> are externally instantiated?
mvels retitled this revision from Optimize copy constructor to partially inline copy constructor basic_string(const basic_string&[, allocator]).Oct 7 2019, 6:50 PM
mvels added a comment.Oct 7 2019, 7:22 PM
  1. I updated the title
  1. I have not considered the implications for the dylib case, I would imagine that __init_long() needs to be exported if we allow inlining of the copy ctor.

From talking to EricWF, I understand we could alternatively mark the constructor _LIBCPP_INLINE_VISIBILITY and add code in string.cc to force instantiation of the code for any existing libs / code expecting the method to be instantiated.

I'd appreciate any suggestions or feedback you;'d have here

mvels edited the summary of this revision. (Show Details)Oct 7 2019, 7:27 PM
Harbormaster completed remote builds in B39136: Diff 223713.
mvels added a comment.Oct 9 2019, 9:50 AM

libc++.so code for std::string ctor (demangled / mixed source)

00000000000b7be0 <std_string::basic_string(std_string const&)>:
{
   b7be0:	55                   	push   %rbp
   b7be1:	48 89 e5             	mov    %rsp,%rbp
   b7be4:	41 57                	push   %r15
   b7be6:	41 56                	push   %r14
   b7be8:	41 55                	push   %r13
   b7bea:	41 54                	push   %r12
   b7bec:	53                   	push   %rbx
   b7bed:	50                   	push   %rax
   b7bee:	48 89 fb             	mov    %rdi,%rbx
        {return bool(__r_.first().__s.__size_ & __short_mask);}
   b7bf1:	80 7e 17 00          	cmpb   $0x0,0x17(%rsi)
    if (!__str.__is_long())
   b7bf5:	78 10                	js     b7c07 <std_string::basic_string(std_string const&)+0x27>
        __r_.first().__r = __str.__r_.first().__r;
   b7bf7:	48 8b 46 10          	mov    0x10(%rsi),%rax
   b7bfb:	48 89 43 10          	mov    %rax,0x10(%rbx)
   b7bff:	0f 10 06             	movups (%rsi),%xmm0
   b7c02:	0f 11 03             	movups %xmm0,(%rbx)
   b7c05:	eb 58                	jmp    b7c5f <std_string::basic_string(std_string const&)+0x7f>
        {return __r_.first().__l.__data_;}
   b7c07:	4c 8b 3e             	mov    (%rsi),%r15
        {return __r_.first().__l.__size_;}
   b7c0a:	4c 8b 76 08          	mov    0x8(%rsi),%r14
            {return (__s + (__a-1)) & ~(__a-1);}
   b7c0e:	49 8d 46 10          	lea    0x10(%r14),%rax
   b7c12:	48 83 e0 f0          	and    $0xfffffffffffffff0,%rax
  pointer __p = __alloc_traits::allocate(__alloc(), __cap + 1);
   b7c16:	49 83 fe 17          	cmp    $0x17,%r14
   b7c1a:	41 bd 17 00 00 00    	mov    $0x17,%r13d
   b7c20:	4c 0f 43 e8          	cmovae %rax,%r13
  return __builtin_operator_new(__size);
   b7c24:	4c 89 ef             	mov    %r13,%rdi
   b7c27:	e8 74 c0 00 00       	callq  c3ca0 <operator new(unsigned long)@plt>
   b7c2c:	49 89 c4             	mov    %rax,%r12
        {__r_.first().__l.__data_ = __p;}
   b7c2f:	48 89 03             	mov    %rax,(%rbx)
   b7c32:	48 b8 00 00 00 00 00 	movabs $0x8000000000000000,%rax
   b7c39:	00 00 80 
        {__r_.first().__l.__cap_  = __long_mask | __s;}
   b7c3c:	4c 09 e8             	or     %r13,%rax
   b7c3f:	48 89 43 10          	mov    %rax,0x10(%rbx)
        {__r_.first().__l.__size_ = __s;}
   b7c43:	4c 89 73 08          	mov    %r14,0x8(%rbx)
            return __n == 0 ? __s1 : (char_type*)memcpy(__s1, __s2, __n);
   b7c47:	4d 85 f6             	test   %r14,%r14
   b7c4a:	74 0e                	je     b7c5a <std_string::basic_string(std_string const&)+0x7a>
   b7c4c:	4c 89 e7             	mov    %r12,%rdi
   b7c4f:	4c 89 fe             	mov    %r15,%rsi
   b7c52:	4c 89 f2             	mov    %r14,%rdx
   b7c55:	e8 b6 c0 00 00       	callq  c3d10 <memcpy@plt>
    void assign(char_type& __c1, const char_type& __c2) _NOEXCEPT {__c1 = __c2;}
   b7c5a:	43 c6 04 34 00       	movb   $0x0,(%r12,%r14,1)
}
   b7c5f:	48 83 c4 08          	add    $0x8,%rsp
   b7c63:	5b                   	pop    %rbx
   b7c64:	41 5c                	pop    %r12
   b7c66:	41 5d                	pop    %r13
   b7c68:	41 5e                	pop    %r14
   b7c6a:	41 5f                	pop    %r15
   b7c6c:	5d                   	pop    %rbp
   b7c6d:	c3                   	retq

This change highlights ABI issues (init_long is not a forwards compatible change / technically breaks ABI)

https://github.com/llvm/llvm-project/compare/master...martijnvels:string-optimize-copy-ctor contains an idea for creating STABLE/UNSTABLE macros which more easily allows us to switch and/or split ABI functions for UNSTABLE without STABLE.

ldionne requested changes to this revision.Oct 15 2019, 11:27 AM

Note: There does not appear to be anything at https://github.com/llvm/llvm-project/compare/master...martijnvels:string-optimize-copy-ctor

The ABI break wasn't immediately obvious to me at first, so let's dissect the change to make sure we're all on the same page. First, we're adding a two new symbols to the dylib via the explicit instantiation of std::basic_string inside the dylib:

__ZNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEE11__init_longERKS5_ => std::basic_string<char, std::char_traits<char>, std::allocator<char> >::__init_long(std::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)
__ZNSt3__112basic_stringIwNS_11char_traitsIwEENS_9allocatorIwEEE11__init_longERKS5_ => std::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> >::__init_long(std::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> > const&)

That's just adding symbols, which is ABI-cool. Second, there's two possibilities for existing code that has been compiled against old libc++ headers.

  1. basic_string(const basic_string& __str) has been inlined into the code, and there's thus a call to __init() inside the dylib. That's OK, since we're not removing symbols from the dylib, so that code can still call __init() if we execute it against a new dylib.
  2. basic_string(const basic_string& __str) has not been inlined, so there's a call to basic_string(const basic_string& __str) inside the dylib. That's OK too since we are not removing that symbol. The new basic_string(const basic_string& __str) inside the dylib will have a different implementation after that patch, but it does the same thing as it used to.

The only problem I see is building an application against the latest headers, with the intent of using it against an older dylib. Then, if basic_string(const basic_string& __str) is inlined into the application, that application contains a call to __init_long() in the dylib. However, when run against an old dylib that doesn't have __init_long() yet, the dynamic linker will fail to find the symbol at runtime. I don't know whether that's technically considered an ABI break, but it's definitely not something we want.

Are we on the same page?

If so, then one way this problem can be circumvented is by having the knowledge of when a certain function (here __init_long()) was introduced in the dylib. If the deployment target is not known to contain __init_long(), we'd fall back to another implementation that does not require it in the dylib (such as emitting __init_long() in the application). We have precedent for doing that, for example for the iostreams. From <__config>:

// The stream API was dropped and re-added in the dylib shipped on macOS
// and iOS. We can only assume the dylib to provide these definitions for
// macosx >= 10.9 and ios >= 7.0. Otherwise, the definitions are available
// from the headers, but not from the dylib. Explicit instantiation
// declarations for streams exist conditionally to this; if we provide
// an explicit instantiation declaration and we try to deploy to a dylib
// that does not provide those symbols, we'll get a load-time error.
#if !defined(_LIBCPP_BUILDING_LIBRARY) &&                                      \
    ((defined(__ENVIRONMENT_MAC_OS_X_VERSION_MIN_REQUIRED__) &&                \
      __ENVIRONMENT_MAC_OS_X_VERSION_MIN_REQUIRED__ < 1090) ||                 \
     (defined(__ENVIRONMENT_IPHONE_OS_VERSION_MIN_REQUIRED__) &&               \
      __ENVIRONMENT_IPHONE_OS_VERSION_MIN_REQUIRED__ < 70000))
#  define _LIBCPP_DO_NOT_ASSUME_STREAMS_EXPLICIT_INSTANTIATION_IN_DYLIB
#endif

We then drop the explicit instantiation _declaration_ of iostreams from the headers when we can't guarantee that the dylib on the deployment target will contain their definition. This is not terribly pretty, however we could devise a solution similar to that in this case.

Another mechanism we have to work with this kind of issue is availability markup. We basically annotate when functions are introduced in the dylib, and the compiler will issue an error if one tries to use a function in the dylib in conjunction with a deployment target that does not contain that function. This is similar in the sense that both try to prevent forward incompatibilities from hurting, however I would recommend against this approach in the current case because this is only an optimization (std::string needs to stay usable when deploying to older platforms, but some performance/code size hit may be reasonable).

libcxx/include/string
1856

You removed a closing > here.

This revision now requires changes to proceed.Oct 15 2019, 11:27 AM
mvels updated this revision to Diff 225205.EditedOct 16 2019, 6:21 AM

Updated with ABI in unstable

mvels updated this revision to Diff 225210.Oct 16 2019, 6:45 AM
  1. Updating D68617: partially inline copy constructor basic_string(const basic_string&[, allocator]) #
  2. Enter a brief description of the changes included in this update.
  3. The first line is used as subject, next lines as comment. #
  4. If you intended to create a new revision, use:
  5. $ arc diff --create

Added __config

mvels marked an inline comment as done.Oct 16 2019, 7:44 AM

Thanks for the elaborate comments Louis!

Sorry for the non working link, I messed with my fork, synced to a new machine and manage to wipe out the branch...

I discussed ABI implications with among others EricWf, and we should probably take it a step further in that we can't safely introduce new symbols and inlined dependencies to these symbols.

In this latest diff I included one possible solution to allow these ABI changes for UNSTABLE while keeping these a no-op for STABLE.

Ie, for STABLE we want to retain the ctors in the ABI, and not introduce new symbols, e.g.:

basic_string(const basic_string&);

_LIBCPP_HIDE_FROM_ABI
__init_long(const basic_string&);

For UNSTABLE, we want to explicitly inline the ctor, and outline the __init_long();

_LIBCPP_HIDE_FROM_ABI
basic_string(const basic_string&);

__init_long(const basic_string&);

This may deserve a more explicit (small) design doc or consensus. It is however a recurring issue for various methods we know to have a small, inlineable fast path which is now always a remote library call, for example, there are similar opportunities in assign, erase, etc, which does add up to significant speedups and CPU cost reductions .

Conditional ABI inclusion / exclusion macros in __config allow for a cleaner change in string than more noisy #ifdef ...UNSTABLE... blocks. This is just a suggested option, I imagine there are more ways to approach this. I don't believe there is a save option that offers forward and backwards ABI compatibility, so we can only change the inlining and add symbols (i.e., ABI break) in UNSTABLE.

As you highlight, any compilation with new headers linking against older libs or dylibs are a concern. More so as that these can pop up in deployment / runtime scenarios, which makes me firmly believe we should have some mechanism to apply these changes to UNSTABLE only with minimal noise in the string header.

ldionne requested changes to this revision.Oct 16 2019, 7:44 AM
ldionne added inline comments.
libcxx/include/__config
773

I'm really uncomfortable with piggy-backing on the fact that non-inline member functions with an explicit instantiation declaration won't be inlined into user code. We're also excluding this optimization from most builds which are not using the unstable ABI.

I think it's possible to do better by encoding the knowledge of when __init_long() was added to the shared library. As far as you're concerned, you could basically write code using _LIBCPP_HAS_INIT_LONG_IN_DYLIB (or whatever), and then vendors could go and define that macro when suitable for them.

We could also define all these _LIBCPP_HAS_<function-name>_IN_DYLIB macros in case we're using the unstable ABI. WDYT?

This revision now requires changes to proceed.Oct 16 2019, 7:44 AM
ldionne added inline comments.Oct 16 2019, 7:49 AM
libcxx/include/__config
773

So, basically, my suggestion would be this:

  1. #define _LIBCPP_HAS_INIT_LONG_IN_DYLIB in __config only when _LIBCPP_ABI_UNSTABLE is defined
  2. Write the code in <string> such that it takes advantage of __init_long when it _is_in the dylib, and it doesn't when it's not (but in both cases the code should work).
  3. I'll then go and also define _LIBCPP_HAS_INIT_LONG_IN_DYLIB when it makes sense based on the deployment target, and other vendors can do the same.

This seems like a clean solution where libc++ ends up being optimal in all cases where it can be. This also introduces a structured way of making this sort of change.

mvels marked an inline comment as done.Oct 16 2019, 8:23 AM
mvels added inline comments.
libcxx/include/__config
773

I'm really uncomfortable with piggy-backing on the fact that non-inline member functions with an explicit instantiation declaration won't be inlined into user code. We're also excluding this optimization from most builds which are not using the unstable ABI.

These defines are primarily intended to say 'exclude from ABI yes/no for Stable/Unstable. We may not need the 'inline' hint, I basically added this following existing definitions in string for inline visibility. I may also miss a finer point you are making, I understand that clang now never considers external template methods for inlining, which I understand is WAD? Even so, we could keep the ctor in the ABI and add 'inline', but then we still have the 'may call methods introduced later' ABI break.

_LIBCPP_HAS_INIT_LONG_IN_DYLIB

I have 2 main concerns here

  1. this requires that vendors / clients have full control over all their dependencies and used dylib versions, which are defined at compilation time with little control at deployment and runtime. If I use some 'foo.so' from some other team or vendor, then this has repercussions on which libc++ I use or deploy. We basically move the 'you may break ABI' decision elsewhere in a 'per case' basis?
  1. I have a good couple similar small inlining opportunities which all run into the same ABI dillema, I don't think long list of method ABI inclusions are desirable, and I am concerned this 'selectively pick you ABI compatibility is a safe long term option. I really like the idea that you have 'stable' (and no surprises) or 'unstable', and you manage your full build , dependency and deployment story (which does apply to Google).

This seems like a clean solution where libc++ ends up being optimal in all cases where it can be. This also introduces a structured way of making this sort of change

I think we are on similar paths on 'allow this optimization where possible / safe'. Perhaps we divert on what 'safe' is in this context.

My 'safety' concern scenario would be some team X providing some foo.so to various other teams. When does team X safely turn this __INIT_LONG on? It can't safely do this IMHO, we basically make it a risk factor for team X, and they may not even realize the implications. This is only safe in the context of an explicit LIBCPP version, or using an explicit 'unstable' libc++.

ldionne added inline comments.Oct 16 2019, 8:50 AM
libcxx/include/__config
773

These defines are primarily intended to say 'exclude from ABI yes/no for Stable/Unstable.

My main problem with using _LIBCPP_ABI_UNSTABLE as the _sole_ determinant for whether to perform these optimizations is that most vendors are not on the unstable ABI. _LIBCPP_ABI_UNSTABLE is a bigger gun than we need to use here, and the result is that many users that _could_ get the optimization will not, simply because they are on the stable ABI.

I have a good couple similar small inlining opportunities which all run into the same ABI dillema

If they are all semantically related, I think it makes sense to hide them behind the same macro (kind of like _LIBCPP_DO_NOT_ASSUME_STREAMS_EXPLICIT_INSTANTIATION_IN_DYLIB, which doesn't list each and every stream method).

I also think that class-wide explicit instantiations are somewhat greedy in that they end up including a lot more things in the dylib than we might want. For example, do we really mean to explicitly instantiate std::basic_string<wchar_t>::__init_long() in the dylib, or would the one for std::basic_string<char> be sufficient? I guess the point I'm trying to make is that everything we put in the dylib should be well reasoned, and as a corollary there should be few things in the dylib. So there should also be only few macros like _LIBCPP_HAS_INIT_LONG_IN_DYLIB that need to exist.

I think we are on similar paths on 'allow this optimization where possible / safe'. Perhaps we divert on what 'safe' is in this context.

I think we might be very close to an agreement, but perhaps I did not properly explain the solution I propose. So, basically, what I'm saying is to make the code work whether __init_long() is in the dylib or not, and don't tie that decision to whether we're using the unstable ABI. Then, through the -target option (or -mmacosx-version-min and similar switches), the user tells the compiler which target they want to deploy to. The compiler sets some macros based on that, and libc++ checks those macros to see whether the dylib on the desired target contains the required functionality. If no information about the deployment target is known (e.g. for vendors that don't collect that information), we can assume that the functionality isn't there to be conservative. If we're using the unstable ABI, on the other hand, we can assume that the dylib contains the required functionality regardless of the deployment target.

This all happens in the headers on the user side, i.e. when building an application that needs to link against libc++.dylib. This is, in essence, a superset of what you're proposing with the stable/unstable ABI solution, but it allows the majority of users who (1) are using the stable ABI but (2) only need to deploy to recent targets that contain __init_long() to benefit from the optimization.

If the default is to be conservative (and not assume __init_long in the dylib) unless we have enough information to tell otherwise, I don't see how my approach can be unsafe (but perhaps I missed something).

mvels marked an inline comment as done.Oct 16 2019, 10:19 AM
mvels added inline comments.
libcxx/include/__config
773

_LIBCPP_ABI_UNSTABLE is a bigger gun than we need to use here, and the result is that many users that _could_ get the optimization will not, simply because they are on the stable ABI.

I agree it is a very big gun. This comes from my understanding that the ABI is a 'big thing', but I'll happily defer to feature based enabling, (i.e., selective ABI breaking)

Re naming, I would personally find 'use optimized ctor' the feature to select, __init_long() is a requirement for that, I don't see it directly as 'rthe feature' ?

I.e.:

// Inline copy constructor code for small strings.
// Requires the presence of '__init_long' in lib/ dylib
#define _LIBCPP_ABI_INLINE_SMALL_COPY_CTOR

versus

// Assume the presence of __init_long in the library / dylib, allowing for 
// SSO optimizations and inlining of the copy constructor.
#define _LIBCPP_ABI_INLINE_SMALL_COPY_CTOR

I could be at peace with both, my discomfort with the latter is that it does not tell me directly what we are actually turning on. One further complication is where we have dual requirements. For example, some other optimization requires A and B, meaning, enabling A and B as separate features triggering the real feature we actually want.

And then there is the fact that clang aggressively inlines non ABI functions, which is undesirable. Which is why I crafted the changes here to have __init_long() only inline if the ctor is in the ABI to avoid expanding the size of inlined code.

I am leaning towards '_LIBCPP_ABI_INLINE_SMALL_COPY_CTOR' (and subsequent small ABI breakages) in that case. Alternatively, name these '_LIBCPP_ABI_INLINE_SSO_OPERATIONS' which would cover current and future optimizations. Although this come close to 'allow any ABI breakage, now and future, i.e. unstable).

Condensed: is there use case for people want 'some unstable, but not all'? I honestly don't know an answer there :) This is a wide topic, I very much appreciate your feedback on this.

I added alternaive DIFF in https://reviews.llvm.org/D69061

This works for me, even as I am not all that excited about the readability / maintainability of the ifdef / endif constructs.

mvels updated this revision to Diff 225750.Oct 19 2019, 3:07 AM

Adapt per feature ABI breakage for this change

mvels marked 4 inline comments as done.Oct 19 2019, 3:10 AM
mvels added inline comments.
libcxx/include/__config
773

PTAL

mvels added a comment.Nov 25 2019, 7:12 AM

Following discussions with ericwf & ldionne, we be upstreaming these changes incrementally and address the ABI separately

mvels abandoned this revision.Nov 25 2019, 10:20 AM