This is an archive of the discontinued LLVM Phabricator instance.

[libcxx] Improving std::vector<char> and std::deque<char> perfomance
AbandonedPublic

Authored by danlark on Mar 23 2018, 3:48 AM.

Details

Summary

Consider the following code.

#include <memory>
#include <vector>

class TestClass {
public:
    TestClass(size_t size)
        : Data(size)
    {
    }
private:
    std::vector<char> Data;
};

int main(void) {
    std::unique_ptr<TestClass> test;
    for (int i = 0; i < 100000; ++i)
        test.reset(new TestClass(0x10000));
    return 0;
}

For clang 5.0.1 it works for 14sec on my laptop. If you replace char by short it becomes 35 times faster(wow). The main difference in the generated code that for char no memset is called inside __construct_at_end function.

By manipulating a local variable in the loop, this lets it be fully optimized away.

Prior to this change, this would be generated (on x86-64):

51,79c58,66
<   movq  %rax, 8(%rbx)
<   movq  %rax, (%rbx)
<   movq  %rax, %rcx
<   addq  $65536, %rcx            # imm = 0x10000
<   movq  %rcx, 16(%rbx)
<   movq  $-65536, %rcx           # imm = 0xFFFFFFFFFFFF0000
<   .align  16, 0x90
< .LBB0_4:                                #   Parent Loop BB0_1 Depth=1
<                                         # =>  This Inner Loop Header: Depth=2
<   movb  $0, (%rax)
<   movq  8(%rbx), %rax
<   leaq  1(%rax), %rdx
<   movq  %rdx, 8(%rbx)
<   movb  $0, 1(%rax)
<   movq  8(%rbx), %rax
<   leaq  1(%rax), %rdx
<   movq  %rdx, 8(%rbx)
<   movb  $0, 1(%rax)
<   movq  8(%rbx), %rax
<   leaq  1(%rax), %rdx
<   movq  %rdx, 8(%rbx)
<   movb  $0, 1(%rax)
<   movq  8(%rbx), %rax
<   incq  %rax
<   movq  %rax, 8(%rbx)
<   addq  $4, %rcx
<   jne  .LBB0_4
< # BB#5:                                 # %_ZN9TestClassC2Em.exit
<                                         #   in Loop: Header=BB0_1 Depth=1
---
>   movq  %rax, (%r12)
>   movq  %rax, %rbx
>   addq  $65536, %rbx            # imm = 0x10000
>   movq  %rbx, 16(%r12)
>   xorl  %esi, %esi
>   movl  $65536, %edx            # imm = 0x10000
>   movq  %rax, %rdi
>   callq  memset
>   movq  %rbx, 8(%r12)
81,82c68,69

Diff Detail

Repository
rCXX libc++

Event Timeline

danlark created this revision.Mar 23 2018, 3:48 AM
lichray accepted this revision.Mar 25 2018, 12:22 AM
lichray added a subscriber: lichray.

Something std::byte couldn't help lol

This revision is now accepted and ready to land.Mar 25 2018, 12:22 AM
mclow.lists requested changes to this revision.Mar 26 2018, 9:59 AM

Please don't commit this.

libcxx/trunk/include/__split_buffer
201

I have been asked specifically by the optimizer folks to NOT do things like this in libc++, but rather to file bugs against the optimizer.

And I have done so for this exact case: https://bugs.llvm.org/show_bug.cgi?id=35637

This revision now requires changes to proceed.Mar 26 2018, 9:59 AM
lichray added inline comments.Mar 26 2018, 1:03 PM
libcxx/trunk/include/__split_buffer
201

From the thread I didn't see that the compiler side asked you not to do so.

And I disagree with the view. libc++ shouldn't *wait* for compilers, because we don't dictate users' compiler choices. This change doesn't make libc++ worse to coming compilers, and makes libc++ better on existing compilers, so what benefit we get by not approving this?

danlark added inline comments.Jul 18 2018, 4:31 AM
libcxx/trunk/include/__split_buffer
201

So, what is the status? Are we waiting for the compiler code-gen fix?

At Yandex we are using patched version like half a year or more.

https://github.com/catboost/catboost/blob/master/contrib/libs/cxxsupp/libcxx/include/vector#L995

I just tried this (on Compiler Explorer) using LLVM 7, and the code for my original test in https://bugs.llvm.org/show_bug.cgi?id=35637 is now optimal.
Looking briefly at your test case, it seems to be fixed now too. Can you confirm or disprove, please?

I just tried this (on Compiler Explorer) using LLVM 7, and the code for my original test in https://bugs.llvm.org/show_bug.cgi?id=35637 is now optimal.
Looking briefly at your test case, it seems to be fixed now too. Can you confirm or disprove, please?

I don't see that it was fixed, f2 is clearing byte still

https://gcc.godbolt.org/z/kYuXJQ

I just tried this (on Compiler Explorer) using LLVM 7, and the code for my original test in https://bugs.llvm.org/show_bug.cgi?id=35637 is now optimal.

Pilot error - it's still the same.

I just tried this (on Compiler Explorer) using LLVM 7, and the code for my original test in https://bugs.llvm.org/show_bug.cgi?id=35637 is now optimal.

Pilot error - it's still the same.

Not too surprising, i don't recall any of the memset llvm differentials progressing.
Though i think there was a consensus in some one of them that the transform itself is valid.
Not sure which patch was the closest one to completion, or what it would take though.

hiraditya added inline comments.May 27 2020, 2:31 PM
libcxx/trunk/include/__split_buffer
201

It would be great to get this patch in. Waiting for compiler for this optimization seems overkill.

danlark marked an inline comment as done.May 27 2020, 2:46 PM
danlark added inline comments.
libcxx/trunk/include/__split_buffer
201

It was separately submitted by the libcxx mainterner in July 2019 -- https://reviews.llvm.org/rL367183

ldionne added inline comments.May 28 2020, 6:38 AM
libcxx/trunk/include/__split_buffer
201

Did https://reviews.llvm.org/rL367183 fix your problem? If so, let's abandon this patch and you can remove your downstream patch as well.

danlark abandoned this revision.May 28 2020, 7:07 AM