This is an archive of the discontinued LLVM Phabricator instance.

Speedup to_string for integers using zero-copy.
AbandonedPublic

Authored by Mordante on May 3 2021, 4:27 AM.

Details

Reviewers
Quuxplusone
Roman-Koshelev
Group Reviewers
Restricted Project

Diff Detail

Event Timeline

Roman-Koshelev requested review of this revision.May 3 2021, 4:27 AM
Roman-Koshelev created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2021, 4:27 AM
Herald added a reviewer: Restricted Project. · View Herald Transcript
grandinj added inline comments.
libcxx/src/string.cpp
445

just a drive-by commentator - why is the wstring version so different from the string version? Surely they should be practically the same?

Quuxplusone added inline comments.
libcxx/src/string.cpp
445

I infer it's because string().capacity() > bufsize, but wstring().capacity() < bufsize. (A default-initialized basic_string<Ch> has non-zero capacity because of the Small String Optimization.)
At the very least, line 429 should add _LIBCPP_ASSERT(buf.capacity() >= bufsize); in order to catch a failure of that invariant as early and predictably as possible.

I also wondered if the wstring version should use the same trick — buf.resize(bufsize) followed by buf.resize(res.ptr - buf.data()). But:

  • bufsize varies from 11 (for int) to 20 (for long long)
  • string().capacity() is 22
  • wstring().capacity() is 4 (i.e., 16 over sizeof(wchar_t))

So if hypothetically we did wstring buf(bufsize), that'd allocate between 44 and 80 bytes of heap every time, versus the current approach allocating between 0 and 80. (And in the case someone's stringifying a small number like "0" or "42", the current approach allocates 0 bytes, which is great.) So, having them look at least a bit different is good IMO.

448

/accomodate/accommodate/ while you're here (or, look out for a merge-conflict if I fix it in main first ;))

466

Do you have a benchmark showing the performance impact? I think you should try to quantify the impact somehow — after all, all you're saving is a very predictable branch and a single stack-to-stack memcpy, right? (OTOH, I don't see how this could possibly hurt performance.)

grandinj added inline comments.May 3 2021, 6:12 AM
libcxx/src/string.cpp
445

wchar is 2 bytes not 4,
but yes, I agree with the rest of your analysis, this approach means that we only need to allocate zero or once.
Thanks for explaining.

wchar is 2 bytes not 4,

On some platforms. On others, it is 4.

Roman-Koshelev added inline comments.May 4 2021, 9:10 AM
libcxx/src/string.cpp
445

Note that in wchar* it will not work to write a number with the to_chars function. Otherwise, it would be possible to do this optimization at least for types that fit into capacity()

466

No need to make copying where it is not needed.

Also

std::string buf;
buf.resize(buf.capacity());

with c++20 completely constexpr

Mordante requested changes to this revision.May 4 2021, 10:24 AM
Mordante added a subscriber: Mordante.

Thanks for your patch.
Please investigate the ARM build errors.

libcxx/src/string.cpp
429

Can you add an assert the buffer-size is large enough? You now depend on an implementation detail of std::string. Something like:
_LIBCPP_ASSERT(buf.size(), std::numeric_limits<V>::digits10 + 2, "Capacity is too small");
If somebody adds 128-bit support this will probably break.

466

+1 for a benchmark. The change that might degrade the performance are the resize calls.

This revision now requires changes to proceed.May 4 2021, 10:24 AM

As I said

std::string buf;
buf.resize(buf.capacity());

with C ++ 20 completely constexpr and equivalent

char buf[bufsize];

It is also obvious that

string(buf, res.ptr);

worse than

buf.resize(res.ptr - buf.data());
Roman-Koshelev added a comment.EditedMay 6 2021, 10:11 PM

Benchmarks

$ g++ -v

gcc version 10.3.0 (Ubuntu 10.3.0-1ubuntu1)

$ g++ main.cpp -O3 -lbenchmark -std=c++20
$ ./a.out

2021-05-07T21:32:02+03:00
Running ./a.out
Run on (4 X 3000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 12288 KiB (x4)
Load Average: 1.01, 0.76, 0.56
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
to_stringNewBench/1         501 ns          501 ns      1367384
to_stringNewBench/2         537 ns          537 ns      1264337
to_stringNewBench/3         788 ns          726 ns       992086
to_stringNewBench/4         824 ns          777 ns       929673
to_stringNewBench/5         981 ns          921 ns       785107
to_stringNewBench/6        1071 ns         1000 ns       725562
to_stringNewBench/7        1201 ns         1129 ns       644721
to_stringNewBench/8        1109 ns         1109 ns       626050
to_stringNewBench/9        1164 ns         1164 ns       472391
to_stringNewBench/10       1795 ns         1676 ns       440870
to_stringNewBench/11       1208 ns         1134 ns       644179
to_stringNewBench/12       1715 ns         1608 ns       444597
to_stringNewBench/13       1247 ns         1168 ns       650858
to_stringNewBench/14       1121 ns         1046 ns       652401
to_stringNewBench/15       1784 ns         1640 ns       413110
to_stringNewBench/16       1065 ns         1064 ns       619175
to_stringOldBench/1         308 ns          288 ns      2437366
to_stringOldBench/2         697 ns          697 ns      1022340
to_stringOldBench/3         792 ns          791 ns       694918
to_stringOldBench/4         938 ns          881 ns       818895
to_stringOldBench/5        1107 ns         1032 ns       704530
to_stringOldBench/6        1529 ns         1435 ns       505306
to_stringOldBench/7        1346 ns         1265 ns       579740
to_stringOldBench/8        1559 ns         1456 ns       501307
to_stringOldBench/9        1701 ns         1567 ns       461477
to_stringOldBench/10       1733 ns         1642 ns       430777
to_stringOldBench/11       4683 ns         4378 ns       161100
to_stringOldBench/12       2030 ns         1884 ns       368859
to_stringOldBench/13       5351 ns         5019 ns       142119
to_stringOldBench/14       4544 ns         4541 ns       156984
to_stringOldBench/15       1807 ns         1806 ns       327187
to_stringOldBench/16       5348 ns         4958 ns       138795

$ clang++ -v

Ubuntu clang version 12.0.0-1ubuntu1

$ clang++ main.cpp -O3 -lbenchmark -std=c++20
$ ./a.out

2021-05-07T21:31:28+03:00
Running ./a.out
Run on (4 X 3000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 12288 KiB (x4)
Load Average: 0.70, 0.67, 0.53
---------------------------------------------------------------
Benchmark                     Time             CPU   Iterations
---------------------------------------------------------------
to_stringNewBench/1         489 ns          489 ns      1374672
to_stringNewBench/2         615 ns          570 ns      1266976
to_stringNewBench/3         783 ns          733 ns      1010880
to_stringNewBench/4         787 ns          787 ns       877548
to_stringNewBench/5         769 ns          768 ns       806926
to_stringNewBench/6        1022 ns          959 ns       759196
to_stringNewBench/7        1002 ns          926 ns       764341
to_stringNewBench/8        1051 ns          986 ns       719005
to_stringNewBench/9        1175 ns         1108 ns       658466
to_stringNewBench/10       1437 ns         1334 ns       548946
to_stringNewBench/11        894 ns          894 ns       790186
to_stringNewBench/12       1295 ns         1294 ns       438192
to_stringNewBench/13       1044 ns          981 ns       770102
to_stringNewBench/14        939 ns          853 ns       766173
to_stringNewBench/15       1466 ns         1383 ns       507655
to_stringNewBench/16        957 ns          901 ns       816543
to_stringOldBench/1         398 ns          372 ns      1874514
to_stringOldBench/2         916 ns          851 ns       835852
to_stringOldBench/3         961 ns          904 ns       790508
to_stringOldBench/4         911 ns          911 ns       754365
to_stringOldBench/5        1030 ns         1029 ns       521805
to_stringOldBench/6        1293 ns         1236 ns       581640
to_stringOldBench/7        1272 ns         1193 ns       608069
to_stringOldBench/8        1363 ns         1270 ns       573086
to_stringOldBench/9        1494 ns         1399 ns       521500
to_stringOldBench/10       1754 ns         1644 ns       443738
to_stringOldBench/11       4812 ns         4517 ns       164884
to_stringOldBench/12       1810 ns         1681 ns       441442
to_stringOldBench/13       4956 ns         4643 ns       164099
to_stringOldBench/14       4612 ns         4292 ns       163186
to_stringOldBench/15       1657 ns         1656 ns       431159
to_stringOldBench/16       4795 ns         4484 ns       160867
Roman-Koshelev added a comment.EditedMay 6 2021, 10:12 PM

Source

#include <charconv>
#include <random>
#include <string>

#include <benchmark/benchmark.h>

template <typename V>
std::string i_to_stringN(const V v)
{
    std::string buf;
    buf.resize(buf.capacity());
    const auto res = std::to_chars(buf.data(), buf.data() + buf.size(), v);
    buf.resize(res.ptr - buf.data());
    return buf;
}

std::string  to_stringN (int val)                { return i_to_stringN(val); }
std::string  to_stringN (long val)               { return i_to_stringN(val); }
std::string  to_stringN (long long val)          { return i_to_stringN(val); }
std::string  to_stringN (unsigned val)           { return i_to_stringN(val); }
std::string  to_stringN (unsigned long val)      { return i_to_stringN(val); }
std::string  to_stringN (unsigned long long val) { return i_to_stringN(val); }

template <typename V>
std::string i_to_stringO(const V v)
{
    constexpr size_t bufsize = std::numeric_limits<V>::digits10 + 2;
    char buf[bufsize];
    const auto res = std::to_chars(buf, buf + bufsize, v);
    return std::string(buf, res.ptr);
}

std::string  to_stringO (int val)                { return i_to_stringO(val); }
std::string  to_stringO (long val)               { return i_to_stringO(val); }
std::string  to_stringO (long long val)          { return i_to_stringO(val); }
std::string  to_stringO (unsigned val)           { return i_to_stringO(val); }
std::string  to_stringO (unsigned long val)      { return i_to_stringO(val); }
std::string  to_stringO (unsigned long long val) { return i_to_stringO(val); }

template <unsigned N> struct make_mem {
  using type = std::conditional_t<N <= 9, std::uint32_t, std::uint64_t>;
};

template <unsigned N> using make_mem_t = typename make_mem<N>::type;

std::random_device rd;
std::mt19937 gen(rd());

constexpr unsigned pow(unsigned base, unsigned exp)
{
    unsigned res = 1;
    for (unsigned i = 0; i < exp; ++i) {
        res *= base;
    }
    return res;
}

template<unsigned N, bool IsNew>
static void to_stringD(benchmark::State& state) {
    using Ty = make_mem_t<N>;
    std::uniform_int_distribution<Ty> distrib(pow(10, N-1), pow(10, N)-1);
    std::vector<Ty> vec;
    vec.reserve(128);
    for(int i=0; i<128; ++i) {
        vec.push_back(distrib(gen));
    }
    for (auto _ : state) {
        for(Ty i : vec) {
            if constexpr (IsNew) {
                std::string res = to_stringN(i);
                benchmark::DoNotOptimize(res);
            } else {
                std::string res = to_stringO(i);
                benchmark::DoNotOptimize(res);
            }
        }
    }
}

template<bool IsNew>
struct to_stringBench {

    static void run(benchmark::State& state) {
        switch(state.range(0)) {
            case 1: { to_stringD<1, IsNew>(state); return; }
            case 2: { to_stringD<2, IsNew>(state); return; }
            case 3: { to_stringD<3, IsNew>(state); return; }
            case 4: { to_stringD<4, IsNew>(state); return; }
            case 5: { to_stringD<5, IsNew>(state); return; }
            case 6: { to_stringD<6, IsNew>(state); return; }
            case 7: { to_stringD<7, IsNew>(state); return; }
            case 8: { to_stringD<8, IsNew>(state); return; }
            case 9: { to_stringD<9, IsNew>(state); return; }
            case 10: { to_stringD<10, IsNew>(state); return; }
            case 11: { to_stringD<11, IsNew>(state); return; }
            case 12: { to_stringD<12, IsNew>(state); return; }
            case 13: { to_stringD<13, IsNew>(state); return; }
            case 14: { to_stringD<14, IsNew>(state); return; }
            case 15: { to_stringD<15, IsNew>(state); return; }
            case 16: { to_stringD<16, IsNew>(state); return; }

            default: assert(false);
        }
    }

};

void to_stringNewBench(benchmark::State& state) {
    to_stringBench<true>::run(state);
}

void to_stringOldBench(benchmark::State& state) {
    to_stringBench<false>::run(state);
}

BENCHMARK(to_stringNewBench)->DenseRange(1, 16);

BENCHMARK(to_stringOldBench)->DenseRange(1, 16);

BENCHMARK_MAIN();

Updating D101752: Speedup to_string for integers using zero-copy.

Quuxplusone requested changes to this revision.May 8 2021, 8:25 AM

@Roman-Koshelev: By "benchmark," I think we mean putting it in the diff proper. Here's the patch I was about to suggest... https://pastebin.com/raw/Jg2htyVG ...except that then I ran that benchmark (with and without the patch to src/) and was unable to detect any performance improvement. Changing the first resize to __resize_default_init helped, but not enough.

$ bin/clang++ -O2 ../libcxx/benchmarks/to_string.bench.cpp lib/libc++.a lib/libc++abi.a -I ../libcxx/utils/google-benchmark/include/ -L /usr/local/lib -lbenchmark -o new.exe
$ ./new.exe
2021-05-08T11:16:51-04:00
Running ./new.exe
Run on (8 X 2200 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 6144 KiB (x1)
Load Average: 1.72, 2.07, 2.17
----------------------------------------------------------
Benchmark                Time             CPU   Iterations
----------------------------------------------------------
BM_ToString/1         9156 ns         9152 ns        74252
BM_ToString/2         9207 ns         9203 ns        74929
BM_ToString/3        10199 ns        10192 ns        67968
BM_ToString/4        10034 ns        10027 ns        68924
BM_ToString/5        11141 ns        11118 ns        63743
BM_ToString/6        11534 ns        11485 ns        62568
BM_ToString/7        11926 ns        11925 ns        58581
BM_ToString/8        12282 ns        12257 ns        52108
BM_ToString/9        13072 ns        13033 ns        51947
BM_ToString/10       13010 ns        12994 ns        53775
BM_ToWstring/1        6809 ns         6808 ns       101851
BM_ToWstring/2        7863 ns         7849 ns        88482
BM_ToWstring/3        8748 ns         8735 ns        77942
BM_ToWstring/4        9448 ns         9409 ns        77254
BM_ToWstring/5       88436 ns        88398 ns         7768
BM_ToWstring/6       88524 ns        88499 ns         7703
BM_ToWstring/7       90242 ns        90205 ns         7622
BM_ToWstring/8       92892 ns        92575 ns         7279
BM_ToWstring/9       90693 ns        90591 ns         7435
BM_ToWstring/10      93344 ns        93128 ns         7646
$ git checkout main -- ../libcxx/src/string.cpp
$ ninja cxx ; ninja cxx ; ninja cxx
[5/5] Creating library symlink lib/libc++.1.dylib lib/libc++.dylib
[1/1] Linking CXX static library lib/libc++.a
ninja: no work to do.
$ bin/clang++ -O2 ../libcxx/benchmarks/to_string.bench.cpp lib/libc++.a lib/libc++abi.a -I ../libcxx/utils/google-benchmark/include/ -L /usr/local/lib -lbenchmark -o old.exe
$ ./old.exe
2021-05-08T11:18:56-04:00
Running ./old.exe
Run on (8 X 2200 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x4)
  L1 Instruction 32 KiB (x4)
  L2 Unified 256 KiB (x4)
  L3 Unified 6144 KiB (x1)
Load Average: 2.54, 2.45, 2.31
----------------------------------------------------------
Benchmark                Time             CPU   Iterations
----------------------------------------------------------
BM_ToString/1         8390 ns         8386 ns        79859
BM_ToString/2         9494 ns         9488 ns        72932
BM_ToString/3        10157 ns        10152 ns        67160
BM_ToString/4        10708 ns        10691 ns        66018
BM_ToString/5        12084 ns        12068 ns        57079
BM_ToString/6        12080 ns        12074 ns        56325
BM_ToString/7        13303 ns        13295 ns        51666
BM_ToString/8        12070 ns        12062 ns        56550
BM_ToString/9        12610 ns        12605 ns        55151
BM_ToString/10       12362 ns        12357 ns        55914
BM_ToWstring/1        6791 ns         6787 ns       101672
BM_ToWstring/2        7698 ns         7696 ns        89471
BM_ToWstring/3        8682 ns         8672 ns        79495
BM_ToWstring/4        9499 ns         9466 ns        76316
BM_ToWstring/5       93702 ns        93611 ns         7603
BM_ToWstring/6       95741 ns        95541 ns         7602
BM_ToWstring/7       98949 ns        98798 ns         6976
BM_ToWstring/8       94875 ns        94835 ns         7649
BM_ToWstring/9       94720 ns        94675 ns         7441
BM_ToWstring/10      94846 ns        94802 ns         7011
This revision now requires changes to proceed.May 8 2021, 8:25 AM

Are you still interested in working on this patch? We're moving to GitHub PRs and like to clean up the review queue.

Herald added a project: Restricted Project. · View Herald TranscriptSep 11 2023, 10:16 AM
Mordante commandeered this revision.Nov 1 2023, 10:37 AM
Mordante edited reviewers, added: Roman-Koshelev; removed: Mordante.

I'll abandon this patch, please open a GH PR when you want to pursue this patch.

Mordante abandoned this revision.Nov 1 2023, 10:37 AM