This is an archive of the discontinued LLVM Phabricator instance.

Allow the compiler to optimize `string == "literal string"`.
AcceptedPublic

Authored by sbenza on Mar 29 2019, 8:45 AM.

Details

Reviewers
EricWF
Summary

The compiler needs to be able to inline the call to compare for the
optimizer to do its job here. When properly inlined, the optimizer can
elide the call to memcmp and do a hardcoded comparison both for the size
and the contents of the string.
It dramatically simplifies the generated code for things like x == ""
and x == "012345678".

Improves the BM_StringRelationalLiteral_Eq_Small_* benchmarks.

Event Timeline

sbenza created this revision.Mar 29 2019, 8:45 AM

Passing-by remark: is there a before/after LLVM IR?
Maybe the problem is obvious?

I don't know about how much this will improve performance (there should be, as Roman says, before/after IR to verify this), but I'm satisfied with the correctness of this patch.

So, Roman and I did some playing around with compiler explorer, and we discovered that if you define _LIBCPP_DISABLE_EXTERN_TEMPLATE so that you're not using the string code that's in the dylib, then you get the benefits of this patch w/o any code changes.

The reason that this doesn't get inlined is because basic_string<char, char_traits<char>, allocator<char>> (and also for wchar_t) is instantiated in the dylib.

This code replicates the logic for operator==, rather than keeping it in one place. That makes me mildly uncomfortable.

Maybe the generated code would be better if basic_string::compare was being inlined, but the compiler chose not to do it. char_traits::compare, on the other hand is much smaller and easier to inline.

IR examples before and after this change: https://gcc.godbolt.org/z/0N90W2

For something like str=="".
Before:

define dso_local zeroext i1 @_Z5EmptyRKNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEE(%"class.std::__1::basic_string"* dereferenceable(24)) local_unnamed_addr #0 personality i32 (...)* @__gxx_personality_v0 !dbg !918 {
  call void @llvm.dbg.value(metadata %"class.std::__1::basic_string"* %0, metadata !927, metadata !DIExpression()), !dbg !928
  call void @llvm.dbg.value(metadata %"class.std::__1::basic_string"* %0, metadata !929, metadata !DIExpression()) #4, !dbg !1031
  call void @llvm.dbg.value(metadata i8* getelementptr inbounds ([1 x i8], [1 x i8]* @.str, i64 0, i64 0), metadata !936, metadata !DIExpression()) #4, !dbg !1033
  call void @llvm.dbg.value(metadata i64 0, metadata !937, metadata !DIExpression()) #4, !dbg !1034
  call void @llvm.dbg.value(metadata %"class.std::__1::basic_string"* %0, metadata !1035, metadata !DIExpression()) #4, !dbg !1152
  call void @llvm.dbg.value(metadata %"class.std::__1::basic_string"* %0, metadata !1155, metadata !DIExpression()) #4, !dbg !1161
  %2 = bitcast %"class.std::__1::basic_string"* %0 to i8*, !dbg !1163
  %3 = load i8, i8* %2, align 8, !dbg !1163, !tbaa !1164
  %4 = and i8 %3, 1, !dbg !1167
  %5 = icmp eq i8 %4, 0, !dbg !1168
  %6 = getelementptr inbounds %"class.std::__1::basic_string", %"class.std::__1::basic_string"* %0, i64 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 1, !dbg !1169
  %7 = load i64, i64* %6, align 8, !dbg !1169
  %8 = lshr i8 %3, 1, !dbg !1169
  %9 = zext i8 %8 to i64, !dbg !1169
  %10 = select i1 %5, i64 %9, i64 %7, !dbg !1169
  %11 = icmp eq i64 %10, 0, !dbg !1170
  br i1 %11, label %12, label %19, !dbg !1171

12:                                               ; preds = %1
  %13 = invoke i32 @_ZNKSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEE7compareEmmPKcm(%"class.std::__1::basic_string"* nonnull %0, i64 0, i64 -1, i8* getelementptr inbounds ([1 x i8], [1 x i8]* @.str, i64 0, i64 0), i64 0)
          to label %14 unwind label %16, !dbg !1172

14:                                               ; preds = %12
  %15 = icmp eq i32 %13, 0, !dbg !1173
  br label %19, !dbg !1174

16:                                               ; preds = %12
  %17 = landingpad { i8*, i32 }
          catch i8* null, !dbg !1175
  %18 = extractvalue { i8*, i32 } %17, 0, !dbg !1175
  tail call void @__clang_call_terminate(i8* %18) #5, !dbg !1175
  unreachable, !dbg !1175

19:                                               ; preds = %1, %14
  %20 = phi i1 [ %15, %14 ], [ false, %1 ], !dbg !1176
  ret i1 %20, !dbg !1177
}

After:

define dso_local zeroext i1 @_Z5EmptyRKNSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEEE(%"class.std::__1::basic_string"* nocapture readonly dereferenceable(24)) local_unnamed_addr #0 !dbg !918 {
  call void @llvm.dbg.value(metadata %"class.std::__1::basic_string"* %0, metadata !927, metadata !DIExpression()), !dbg !928
  call void @llvm.dbg.value(metadata %"class.std::__1::basic_string"* %0, metadata !929, metadata !DIExpression()) #3, !dbg !1031
  call void @llvm.dbg.value(metadata i64 0, metadata !937, metadata !DIExpression()) #3, !dbg !1033
  call void @llvm.dbg.value(metadata %"class.std::__1::basic_string"* %0, metadata !1034, metadata !DIExpression()) #3, !dbg !1151
  call void @llvm.dbg.value(metadata %"class.std::__1::basic_string"* %0, metadata !1154, metadata !DIExpression()) #3, !dbg !1160
  %2 = bitcast %"class.std::__1::basic_string"* %0 to i8*, !dbg !1162
  %3 = load i8, i8* %2, align 8, !dbg !1162, !tbaa !1163
  %4 = and i8 %3, 1, !dbg !1166
  %5 = icmp eq i8 %4, 0, !dbg !1167
  %6 = getelementptr inbounds %"class.std::__1::basic_string", %"class.std::__1::basic_string"* %0, i64 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 1, !dbg !1168
  %7 = load i64, i64* %6, align 8, !dbg !1168
  %8 = lshr i8 %3, 1, !dbg !1168
  %9 = zext i8 %8 to i64, !dbg !1168
  %10 = select i1 %5, i64 %9, i64 %7, !dbg !1168
  %11 = icmp eq i64 %10, 0, !dbg !1169
  ret i1 %11, !dbg !1170
}

Other simple comparisons like str=="ABCDEFG" get better too. (see link above).

I'm adding a few more benchmarks in https://reviews.llvm.org/D59781 that show the improvement.

Thanks for debugging that. I didn't realize it didn't try to inline on purpose.
Maybe we just need to mark the function as inline, even if we also have the extern template instantiation.
That way the compiler can inline it if it helps, but won't actually generate the weak symbol in the object files otherwise.
Eg, something like:

--- a/libcxx/include/string
+++ b/libcxx/include/string
@@ -3709,7 +3709,7 @@ basic_string<_CharT, _Traits, _Allocator>::compare(const basic_string& __str) co
 }
 
 template <class _CharT, class _Traits, class _Allocator>
-int
+inline int
 basic_string<_CharT, _Traits, _Allocator>::compare(size_type __pos1,
                                                    size_type __n1,
                                                    const value_type* __s,

does the same job without requiring changes to the implementation.

Thanks for debugging that. I didn't realize it didn't try to inline on purpose.
Maybe we just need to mark the function as inline, even if we also have the extern template instantiation.

I don't think so, because then it would (might?) not end up in the dylib, and that would break *everything* that uses it.

sbenza updated this revision to Diff 197107.Apr 29 2019, 7:28 AM

Use inline to solve the problem instead of changing the implemetation.

sbenza retitled this revision from Make it easier for the compiler to optimize `operator==(string,char*)`. to Allow the compiler to optimize `string == "literal string"`..Apr 29 2019, 7:29 AM
sbenza edited the summary of this revision. (Show Details)

Thanks for debugging that. I didn't realize it didn't try to inline on purpose.
Maybe we just need to mark the function as inline, even if we also have the extern template instantiation.

I don't think so, because then it would (might?) not end up in the dylib, and that would break *everything* that uses it.

What solution do you propose?
The status quo bloats and slows down callers for very common simple comparisons with literal strings.
Can we somehow mark it inline but force it to be in the library? Maybe with an explicit instantiation of that one function?

sbenza updated this revision to Diff 198284.May 6 2019, 8:16 AM

Another option for the optimzation.

sbenza added a comment.May 6 2019, 8:24 AM

As an example of code reduction.

str == "":

Before the change

Dump of assembler code for function StringEqCStrLiteralEmpty(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&):
   0x0000000000407a10 <+0>:	push   %rax
   0x0000000000407a11 <+1>:	movzbl (%rdi),%eax
   0x0000000000407a14 <+4>:	test   $0x1,%al
   0x0000000000407a16 <+6>:	je     0x407a25 <StringEqCStrLiteralEmpty(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)+21>
   0x0000000000407a18 <+8>:	mov    0x8(%rdi),%rax
   0x0000000000407a1c <+12>:	test   %rax,%rax
   0x0000000000407a1f <+15>:	je     0x407a2d <StringEqCStrLiteralEmpty(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)+29>
   0x0000000000407a21 <+17>:	xor    %eax,%eax
   0x0000000000407a23 <+19>:	pop    %rcx
   0x0000000000407a24 <+20>:	retq   
   0x0000000000407a25 <+21>:	shr    %rax
   0x0000000000407a28 <+24>:	test   %rax,%rax
   0x0000000000407a2b <+27>:	jne    0x407a21 <StringEqCStrLiteralEmpty(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)+17>
   0x0000000000407a2d <+29>:	lea    0x791ab(%rip),%rcx        # 0x480bdf
   0x0000000000407a34 <+36>:	xor    %esi,%esi
   0x0000000000407a36 <+38>:	mov    $0xffffffffffffffff,%rdx
   0x0000000000407a3d <+45>:	xor    %r8d,%r8d
   0x0000000000407a40 <+48>:	callq  0x405250 <_ZNKSt3__112basic_stringIcNS_11char_traitsIcEENS_9allocatorIcEEE7compareEmmPKcm@plt>
   0x0000000000407a45 <+53>:	test   %eax,%eax
   0x0000000000407a47 <+55>:	sete   %al
   0x0000000000407a4a <+58>:	pop    %rcx
   0x0000000000407a4b <+59>:	retq   
   0x0000000000407a4c <+60>:	mov    %rax,%rdi
   0x0000000000407a4f <+63>:	callq  0x408000 <__clang_call_terminate>
End of assembler dump.

After the change

Dump of assembler code for function StringEqCStrLiteralEmpty(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&):
   0x0000000000407a10 <+0>:	movzbl (%rdi),%eax
   0x0000000000407a13 <+3>:	test   $0x1,%al
   0x0000000000407a15 <+5>:	je     0x407a1d <StringEqCStrLiteralEmpty(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)+13>
   0x0000000000407a17 <+7>:	mov    0x8(%rdi),%rax
   0x0000000000407a1b <+11>:	jmp    0x407a20 <StringEqCStrLiteralEmpty(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&)+16>
   0x0000000000407a1d <+13>:	shr    %rax
   0x0000000000407a20 <+16>:	test   %rax,%rax
   0x0000000000407a23 <+19>:	sete   %al
   0x0000000000407a26 <+22>:	retq   
End of assembler dump.
EricWF accepted this revision.May 7 2019, 4:56 PM

This LGTM. Thanks for all your hard work on this Sam.

I verified the export lists don't change. I'll land this tomorrow morning once the inline comments have been addressed.

libcxx/include/string
3887

Can we add a brief comment that mentions this specialization is here to provide better code gen?

This revision is now accepted and ready to land.May 7 2019, 4:56 PM

I'm not happy with this. In particular, it does NOT call through the character traits, which it is required to do by the standard.

sbenza added a comment.May 8 2019, 7:49 AM

I'm not happy with this. In particular, it does NOT call through the character traits, which it is required to do by the standard.

Doesn't this fall under the as-if? There should not be a way for well behaved code to know the difference. They can't specialize char_traits<char>, they can't specialize basic_string<char, char_traits<char>, allocator<char>>.

I can go through the char_traits, but char_traits::compare() has an extra branch that is unnecessary here. compare checks that the size is not 0, which is useful to avoid calling memcmp with null pointers. But there is no possiblity of null pointers here. The .data() will never return null, and the char* can't be null either (there is an assertion for that, and we passed it to strlen).

Can you offer a solution?
It is not good for the user to have all that code bloat and runtime cost when doing something as simple as == "".

I can go through the char_traits, but char_traits::compare() has an extra branch that is unnecessary here. compare checks that the size is not 0, which is useful to avoid calling memcmp with null pointers. But there is no possiblity of null pointers here. The .data() will never return null, and the char* can't be null either (there is an assertion for that, and we passed it to strlen).

Would applying some __builtin_assume's in strategic locations let us promise to the optimizer that the inputs to compare are non-null and the length is non-zero?

@EricWF asked me:

Can you restate your concern over bypassing char_traits?

  1. The real problem here is that the string code is externally instantiated in the dylib, and thus not available to be inlined. This patch in no way addresses that, merely works around it in this one particular case for one particular operation (equality comparison) on one particular kind of data (narrow character literals). I'm not really fond of solutions like that.
  2. There's been no attempt to show that this is a performance bottleneck; just assertions that "the status quo bloats and slows down callers for very common simple comparisons with literal strings". Improving a micro-benchmark may or may not have an effect on real-world code.
  3. The standard is quite clear here. The comparison calls string::compare, which calls char_traits::compare. With this patch, we're no longer doing that. This is probably allowed by the "as if" rule, but still seems "wrong" to me.

People who do not care about ABI stability could (should?) remove the external template instantiations, and get the benefits of inlining here (and in other calls). If someone were to do this, and report back, we would have data to argue with about whether or not this patch would make any difference to actual users.

@EricWF asked me:

Can you restate your concern over bypassing char_traits?

Thanks marshall.

  1. The real problem here is that the string code is externally instantiated in the dylib, and thus not available to be inlined. This patch in no way addresses that, merely works around it in this one particular case for one particular operation (equality comparison) on one particular kind of data (narrow character literals). I'm not really fond of solutions like that.

That's a potential part of the performance problem, but this change also addresses some other missed optimizations that only occur when LLVM sees the construct memcmp(...) == 0. When LLVM sees that it knows it only has to detect a difference, but not how the strings actually differ.
See https://github.com/llvm/llvm-project/blob/master/llvm/lib/Transforms/Utils/SimplifyLibCalls.cpp#L941-L943

  1. There's been no attempt to show that this is a performance bottleneck; just assertions that "the status quo bloats and slows down callers for very common simple comparisons with literal strings". Improving a micro-benchmark may or may not have an effect on real-world code.

Tests were previously checked in that demonstrate the improvement. Microbenchmarks are going to be the only easy way to demonstrate the performance difference, but we have confirmed the performance win to real world code.

  1. The standard is quite clear here. The comparison calls string::compare, which calls char_traits::compare. With this patch, we're no longer doing that. This is probably allowed by the "as if" rule, but still seems "wrong" to me.

Right, this is allowed under the as-if rule. It also doesn't use an effects equivalent to clause, so the standard only specifies what the return value should be, not how it's actually computed.
It's a bit unfortunate to duplicate code, but the performance wins seem worth it.

People who do not care about ABI stability could (should?) remove the external template instantiations, and get the benefits of inlining here (and in other calls). If someone were to do this, and report back, we would have data to argue with about whether or not this patch would make any difference to actual users.

As I mentioned earlier, this optimization is more than just allowing for compiler inlining. It's trying to get the memcpy -> bcmp optimization.

Hopefully this addresses your concerns. I plan to move this patch later today.