This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libcxx/trunk/include/
-
trunk/
-
include/
1/6
__split_buffer
-
vector

Differential D44823

[libcxx] Improving std::vector<char> and std::deque<char> perfomance
AbandonedPublic

Authored by danlark on Mar 23 2018, 3:48 AM.

Download Raw Diff

Details

Reviewers

EricWF
mclow.lists
lichray

Summary

Consider the following code.

#include <memory>
#include <vector>

class TestClass {
public:
    TestClass(size_t size)
        : Data(size)
    {
    }
private:
    std::vector<char> Data;
};

int main(void) {
    std::unique_ptr<TestClass> test;
    for (int i = 0; i < 100000; ++i)
        test.reset(new TestClass(0x10000));
    return 0;
}

For clang 5.0.1 it works for 14sec on my laptop. If you replace char by short it becomes 35 times faster(wow). The main difference in the generated code that for char no memset is called inside __construct_at_end function.

By manipulating a local variable in the loop, this lets it be fully optimized away.

Prior to this change, this would be generated (on x86-64):

51,79c58,66
<   movq  %rax, 8(%rbx)
<   movq  %rax, (%rbx)
<   movq  %rax, %rcx
<   addq  $65536, %rcx            # imm = 0x10000
<   movq  %rcx, 16(%rbx)
<   movq  $-65536, %rcx           # imm = 0xFFFFFFFFFFFF0000
<   .align  16, 0x90
< .LBB0_4:                                #   Parent Loop BB0_1 Depth=1
<                                         # =>  This Inner Loop Header: Depth=2
<   movb  $0, (%rax)
<   movq  8(%rbx), %rax
<   leaq  1(%rax), %rdx
<   movq  %rdx, 8(%rbx)
<   movb  $0, 1(%rax)
<   movq  8(%rbx), %rax
<   leaq  1(%rax), %rdx
<   movq  %rdx, 8(%rbx)
<   movb  $0, 1(%rax)
<   movq  8(%rbx), %rax
<   leaq  1(%rax), %rdx
<   movq  %rdx, 8(%rbx)
<   movb  $0, 1(%rax)
<   movq  8(%rbx), %rax
<   incq  %rax
<   movq  %rax, 8(%rbx)
<   addq  $4, %rcx
<   jne  .LBB0_4
< # BB#5:                                 # %_ZN9TestClassC2Em.exit
<                                         #   in Loop: Header=BB0_1 Depth=1
---
>   movq  %rax, (%r12)
>   movq  %rax, %rbx
>   addq  $65536, %rbx            # imm = 0x10000
>   movq  %rbx, 16(%r12)
>   xorl  %esi, %esi
>   movl  $65536, %edx            # imm = 0x10000
>   movq  %rax, %rdi
>   callq  memset
>   movq  %rbx, 8(%r12)
81,82c68,69

Diff Detail

Repository: rCXX libc++

Event Timeline

danlark created this revision.Mar 23 2018, 3:48 AM

Herald added subscribers: cfe-commits, christof. · View Herald TranscriptMar 23 2018, 3:48 AM

Something std::byte couldn't help lol

This revision is now accepted and ready to land.Mar 25 2018, 12:22 AM

Please don't commit this.

libcxx/trunk/include/__split_buffer
201	I have been asked specifically by the optimizer folks to NOT do things like this in libc++, but rather to file bugs against the optimizer. And I have done so for this exact case: https://bugs.llvm.org/show_bug.cgi?id=35637

This revision now requires changes to proceed.Mar 26 2018, 9:59 AM

lichray added inline comments.Mar 26 2018, 1:03 PM

libcxx/trunk/include/__split_buffer
201	From the thread I didn't see that the compiler side asked you not to do so. And I disagree with the view. libc++ shouldn't wait for compilers, because we don't dictate users' compiler choices. This change doesn't make libc++ worse to coming compilers, and makes libc++ better on existing compilers, so what benefit we get by not approving this?

danlark added inline comments.Jul 18 2018, 4:31 AM

libcxx/trunk/include/__split_buffer
201	So, what is the status? Are we waiting for the compiler code-gen fix? At Yandex we are using patched version like half a year or more. https://github.com/catboost/catboost/blob/master/contrib/libs/cxxsupp/libcxx/include/vector#L995

Herald added a subscriber: ldionne. · View Herald TranscriptJul 18 2018, 4:31 AM

I just tried this (on Compiler Explorer) using LLVM 7, and the code for my original test in https://bugs.llvm.org/show_bug.cgi?id=35637 is now optimal.
Looking briefly at your test case, it seems to be fixed now too. Can you confirm or disprove, please?

Herald added a subscriber: libcxx-commits. · View Herald TranscriptJan 29 2019, 9:18 AM

In D44823#1375590, @mclow.lists wrote:

I just tried this (on Compiler Explorer) using LLVM 7, and the code for my original test in https://bugs.llvm.org/show_bug.cgi?id=35637 is now optimal.
Looking briefly at your test case, it seems to be fixed now too. Can you confirm or disprove, please?

I don't see that it was fixed, f2 is clearing byte still

https://gcc.godbolt.org/z/kYuXJQ

In D44823#1375590, @mclow.lists wrote:

I just tried this (on Compiler Explorer) using LLVM 7, and the code for my original test in https://bugs.llvm.org/show_bug.cgi?id=35637 is now optimal.

Pilot error - it's still the same.

In D44823#1375654, @mclow.lists wrote:

In D44823#1375590, @mclow.lists wrote:

I just tried this (on Compiler Explorer) using LLVM 7, and the code for my original test in https://bugs.llvm.org/show_bug.cgi?id=35637 is now optimal.

Pilot error - it's still the same.

Not too surprising, i don't recall any of the memset llvm differentials progressing.
Though i think there was a consensus in some one of them that the transform itself is valid.
Not sure which patch was the closest one to completion, or what it would take though.

hiraditya added a subscriber: hiraditya.Apr 26 2020, 8:36 PM

hiraditya added inline comments.May 27 2020, 2:31 PM

libcxx/trunk/include/__split_buffer
201	It would be great to get this patch in. Waiting for compiler for this optimization seems overkill.

danlark marked an inline comment as done.May 27 2020, 2:46 PM

danlark added inline comments.

libcxx/trunk/include/__split_buffer
201	It was separately submitted by the libcxx mainterner in July 2019 -- https://reviews.llvm.org/rL367183

ldionne added inline comments.May 28 2020, 6:38 AM

libcxx/trunk/include/__split_buffer
201	Did https://reviews.llvm.org/rL367183 fix your problem? If so, let's abandon this patch and you can remove your downstream patch as well.

danlark abandoned this revision.May 28 2020, 7:07 AM

Revision Contents

Path

Size

libcxx/

trunk/

include/

__split_buffer

12 lines

vector