This is an archive of the discontinued LLVM Phabricator instance.

Improve string::find
ClosedPublic

Authored by hiraditya on Nov 23 2016, 3:08 PM.

Details

Summary

Improving string::find such that it ultimately gets converted to calls to memchr and memcmp. This is an intermediate patch to see if there is an interest
in this optimization. I'm planning to propagate this change to similar functions string::rfind ... etc.

Worked in collaboration with Sebastian Pop.

Diff Detail

Repository
rL LLVM

Event Timeline

hiraditya updated this revision to Diff 79156.Nov 23 2016, 3:08 PM
hiraditya retitled this revision from to Improve string::find.
hiraditya updated this object.
hiraditya added reviewers: mclow.lists, EricWF.
hiraditya added subscribers: sebpop, cfe-commits.

Please also post the performance numbers from the added benchmarks with and without the change to the lib.

libcxx/benchmarks/string.bench.cpp
12 ↗(On Diff #79156)

Let's call this benchmark "BM_StringFindNoMatch"

21 ↗(On Diff #79156)

Please reword to remove the reference to __phase2.

23 ↗(On Diff #79156)

Let's call this benchmark "BM_StringFindAllMatch"

mclow.lists edited edge metadata.Nov 28 2016, 10:38 AM

Definitely want to see numbers.

Definitely want to see numbers.

master:
Benchmark                         Time           CPU Iterations
---------------------------------------------------------------
BM_StringFindPhase1/10            4 ns          4 ns  161568945
BM_StringFindPhase1/64           28 ns         28 ns   24937005
BM_StringFindPhase1/512         132 ns        132 ns    5321432
BM_StringFindPhase1/4k          956 ns        956 ns     732038
BM_StringFindPhase1/32k        7622 ns       7621 ns      92121
BM_StringFindPhase1/128k      30485 ns      30483 ns      22938
BM_StringFindPhase2/1             4 ns          4 ns  191884173
BM_StringFindPhase2/8             7 ns          7 ns  105129099
BM_StringFindPhase2/64           34 ns         34 ns   20582485
BM_StringFindPhase2/512         187 ns        187 ns    3736654
BM_StringFindPhase2/4k         1415 ns       1414 ns     495342
BM_StringFindPhase2/32k       11221 ns      11220 ns      62393
BM_StringFindPhase2/128k      44952 ns      44949 ns      15595

master + patch:
Benchmark                         Time           CPU Iterations
---------------------------------------------------------------
BM_StringFindPhase1/10            3 ns          3 ns  204725158
BM_StringFindPhase1/64            5 ns          5 ns  146268957
BM_StringFindPhase1/512          12 ns         12 ns   60176557
BM_StringFindPhase1/4k           67 ns         67 ns   10508533
BM_StringFindPhase1/32k         503 ns        503 ns    1000000
BM_StringFindPhase1/128k       2396 ns       2396 ns     292701
BM_StringFindPhase2/1             6 ns          6 ns  127378641
BM_StringFindPhase2/8             6 ns          6 ns  117407388
BM_StringFindPhase2/64            7 ns          7 ns   95998016
BM_StringFindPhase2/512          18 ns         18 ns   38135928
BM_StringFindPhase2/4k           83 ns         83 ns    8452337
BM_StringFindPhase2/32k         762 ns        762 ns     925744
BM_StringFindPhase2/128k       3255 ns       3255 ns     215141

The numbers are from an x86_64-linux machine: Intel(R) Core(TM) i7-4790K CPU @ 4.00GHz
Overall the patch is a win on the synthetic benchmark.
We have also seen this in a proprietary benchmark where it accounted for about 10% speedup.

hiraditya updated this revision to Diff 79908.EditedDec 1 2016, 7:23 AM
hiraditya edited edge metadata.

Updated the benchmark:
Without patch:

Run on (8 X 3400 MHz CPU s)
2016-12-01 09:20:38
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
Benchmark                           Time           CPU Iterations
-----------------------------------------------------------------
BM_StringFindNoMatch/10             4 ns          4 ns  166421326
BM_StringFindNoMatch/64            37 ns         37 ns   18754392
BM_StringFindNoMatch/512          268 ns        268 ns    2586060
BM_StringFindNoMatch/4k          2143 ns       2144 ns     328342
BM_StringFindNoMatch/32k        16910 ns      16917 ns      40623
BM_StringFindNoMatch/128k       67577 ns      67602 ns      10138
BM_StringFindAllMatch/1             3 ns          3 ns  265163471
BM_StringFindAllMatch/8             6 ns          6 ns  112582467
BM_StringFindAllMatch/64           36 ns         36 ns   19566457
BM_StringFindAllMatch/512         209 ns        209 ns    3318893
BM_StringFindAllMatch/4k         1618 ns       1618 ns     432963
BM_StringFindAllMatch/32k       12909 ns      12914 ns      54317
BM_StringFindAllMatch/128k      48342 ns      48361 ns      13922
BM_StringFindMatch1/1           33777 ns      33790 ns      20698
BM_StringFindMatch1/8           33940 ns      33953 ns      20619
BM_StringFindMatch1/64          34038 ns      34051 ns      20571
BM_StringFindMatch1/512         34217 ns      34230 ns      20480
BM_StringFindMatch1/4k          35510 ns      35524 ns      19752
BM_StringFindMatch1/32k         46438 ns      46456 ns      15030
BM_StringFindMatch2/1           33839 ns      33852 ns      20648
BM_StringFindMatch2/8           33950 ns      33963 ns      20594
BM_StringFindMatch2/64          33846 ns      33859 ns      20668
BM_StringFindMatch2/512         34023 ns      34036 ns      20279
BM_StringFindMatch2/4k          35422 ns      35436 ns      19716
BM_StringFindMatch2/32k         46570 ns      46588 ns      15027

With patch:

Run on (8 X 3400 MHz CPU s)
2016-12-01 09:15:15
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
Benchmark                           Time           CPU Iterations
-----------------------------------------------------------------
BM_StringFindNoMatch/10             5 ns          5 ns  133724346
BM_StringFindNoMatch/64             6 ns          6 ns  119312184
BM_StringFindNoMatch/512           13 ns         13 ns   51539628
BM_StringFindNoMatch/4k            77 ns         77 ns    8935934
BM_StringFindNoMatch/32k          551 ns        551 ns    1222808
BM_StringFindNoMatch/128k        2684 ns       2685 ns     259957
BM_StringFindAllMatch/1             7 ns          7 ns   98017959
BM_StringFindAllMatch/8             7 ns          7 ns   91466911
BM_StringFindAllMatch/64            8 ns          8 ns   85707392
BM_StringFindAllMatch/512          20 ns         20 ns   34490895
BM_StringFindAllMatch/4k           93 ns         93 ns    7360375
BM_StringFindAllMatch/32k         827 ns        828 ns     829944
BM_StringFindAllMatch/128k       3593 ns       3594 ns     195815
BM_StringFindMatch1/1            1332 ns       1332 ns     516354
BM_StringFindMatch1/8            1336 ns       1336 ns     495876
BM_StringFindMatch1/64           1338 ns       1339 ns     516656
BM_StringFindMatch1/512          1357 ns       1357 ns     510717
BM_StringFindMatch1/4k           1485 ns       1486 ns     461228
BM_StringFindMatch1/32k          2235 ns       2236 ns     318253
BM_StringFindMatch2/1            1335 ns       1335 ns     517105
BM_StringFindMatch2/8            1336 ns       1337 ns     518004
BM_StringFindMatch2/64           1344 ns       1345 ns     511751
BM_StringFindMatch2/512          1361 ns       1361 ns     508150
BM_StringFindMatch2/4k           1611 ns       1611 ns     463388
BM_StringFindMatch2/32k          2187 ns       2187 ns     317532
tcanens added inline comments.
include/__string
543 ↗(On Diff #79908)

A character traits class need only accept pointers, so the name _RandomAccessIterator is misleading when you are passing them directly to _Traits::find/_Traits::compare. Why not just const _CharT*? Then you can strip out all the iterator_traits circumlocution as well.

548 ↗(On Diff #79908)

For example, since _RandomAccessIterator must be a pointer type, this can't possibly be anything other than ptrdiff_t.

568 ↗(On Diff #79908)

The function-style cast is redundant.

hiraditya updated this revision to Diff 80090.Dec 2 2016, 10:16 AM

Addressed Tim's comments.

hiraditya marked 3 inline comments as done.Dec 2 2016, 10:16 AM

This is starting to look good.

libcxx/include/__string
549 ↗(On Diff #80090)

Is there a reason that you calculate the end pointer(s) from first + len in the calling function, then recover the length here?

Also, __len1 and __len2 should be const.

hiraditya added inline comments.Dec 2 2016, 11:48 AM
libcxx/include/__string
549 ↗(On Diff #80090)

Recovering length here is only making the interface uniform, I don't have any specific reason for that. This function is going to be inlined anyways. It started this way because I copied the interface from std::__search in the header file algorithm.

len2 can be a const, since len1 changes in the loop, for this to make const I would have to introduce another variable. I'll update the patch.

hiraditya updated this revision to Diff 80134.Dec 2 2016, 2:27 PM

Addressed Marshall's comments.

EricWF edited edge metadata.Dec 5 2016, 2:09 AM

@mclow.lists I'm going to leave this review up to you. I don't see any immediate issues.

EricWF accepted this revision.Dec 30 2016, 2:26 AM
EricWF edited edge metadata.

Holy crap those improvements are impressive.

This LGTM. I'm assuming @mclow.lists has nothing left to say about this.

libcxx/include/__string
542 ↗(On Diff #80134)

Please add inline to the template so -fvisibility-inlines-hidden works with it.

This revision is now accepted and ready to land.Dec 30 2016, 2:26 AM
This revision was automatically updated to reflect the committed changes.

Holy crap those improvements are impressive.

You're welcome.

This LGTM. I'm assuming @mclow.lists has nothing left to say about this.

Thanks for your review and LGTM.
I addressed your last comment, added some more comments, tested on x86_64-linux, and committed.

Just for the record, I ported the patch to gcc/libstdc++ as well: PR66414 optimize std::string::find

https://github.com/gcc-mirror/gcc/commit/fc7ebc4b8d9ad7e2891b7f72152e8a2b7543cd65

Tangmou added a subscriber: Tangmou.May 3 2022, 9:17 PM
Tangmou added inline comments.
libcxx/trunk/include/__string
557

Sorry for the comment after such a long time, but I have a question about this patch.

Since we have already calculated __len1 and ensured that __len1 < __len2 before the loop, can we just skip the length calculation and the comparison in the first loop cycle? And thus we can replace the while loop with do-while or keep the while loop but delete the length calculation and the comparison before the loop?

Herald added a project: Restricted Project. · View Herald TranscriptMay 3 2022, 9:17 PM
Tangmou added inline comments.May 9 2022, 7:51 PM
libcxx/trunk/include/__string
557

Sorry, I made a mistake. Please ignore it!