Details
- Reviewers
ldionne - Group Reviewers
Restricted Project - Commits
- rG1fd08edd5808: [libc++] Forward to std::{,w}memchr in std::find
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Benchmarks:
--------------------------------------------------------------- Benchmark old new --------------------------------------------------------------- bm_find<char>/1 10.5 ns 10.5 ns bm_find<char>/2 18.3 ns 10.4 ns bm_find<char>/3 21.1 ns 10.4 ns bm_find<char>/4 22.4 ns 10.3 ns bm_find<char>/5 23.3 ns 10.3 ns bm_find<char>/6 23.7 ns 10.4 ns bm_find<char>/7 24.0 ns 10.3 ns bm_find<char>/8 24.5 ns 10.2 ns bm_find<char>/16 26.2 ns 10.2 ns bm_find<char>/64 29.9 ns 20.8 ns bm_find<char>/512 80.2 ns 36.0 ns bm_find<char>/4096 523 ns 43.0 ns bm_find<char>/32768 3838 ns 105 ns bm_find<char>/262144 30832 ns 988 ns bm_find<char>/1048576 122541 ns 4854 ns bm_find<short>/1 10.5 ns 10.4 ns bm_find<short>/2 18.3 ns 18.5 ns bm_find<short>/3 21.2 ns 21.4 ns bm_find<short>/4 22.6 ns 22.7 ns bm_find<short>/5 23.6 ns 23.6 ns bm_find<short>/6 24.1 ns 24.2 ns bm_find<short>/7 24.5 ns 24.6 ns bm_find<short>/8 24.6 ns 24.8 ns bm_find<short>/16 26.1 ns 26.6 ns bm_find<short>/64 33.3 ns 31.3 ns bm_find<short>/512 134 ns 80.4 ns bm_find<short>/4096 967 ns 499 ns bm_find<short>/32768 7670 ns 3842 ns bm_find<short>/262144 61106 ns 31371 ns bm_find<short>/1048576 244510 ns 122120 ns bm_find<int>/1 10.4 ns 12.1 ns bm_find<int>/2 18.4 ns 12.0 ns bm_find<int>/3 21.9 ns 12.0 ns bm_find<int>/4 23.3 ns 11.9 ns bm_find<int>/5 24.2 ns 12.1 ns bm_find<int>/6 24.7 ns 12.0 ns bm_find<int>/7 24.8 ns 11.9 ns bm_find<int>/8 25.2 ns 11.9 ns bm_find<int>/16 26.2 ns 21.9 ns bm_find<int>/64 30.5 ns 33.8 ns bm_find<int>/512 80.4 ns 40.4 ns bm_find<int>/4096 497 ns 71.7 ns bm_find<int>/32768 3844 ns 498 ns bm_find<int>/262144 30722 ns 4901 ns bm_find<int>/1048576 123303 ns 22864 ns bm_ranges_find<char>/1 10.4 ns 10.4 ns bm_ranges_find<char>/2 18.3 ns 10.4 ns bm_ranges_find<char>/3 21.1 ns 10.4 ns bm_ranges_find<char>/4 22.3 ns 10.3 ns bm_ranges_find<char>/5 23.2 ns 10.3 ns bm_ranges_find<char>/6 23.6 ns 10.3 ns bm_ranges_find<char>/7 24.0 ns 10.4 ns bm_ranges_find<char>/8 24.4 ns 10.5 ns bm_ranges_find<char>/16 26.0 ns 10.3 ns bm_ranges_find<char>/64 30.6 ns 20.9 ns bm_ranges_find<char>/512 80.2 ns 36.2 ns bm_ranges_find<char>/4096 496 ns 43.1 ns bm_ranges_find<char>/32768 3844 ns 106 ns bm_ranges_find<char>/262144 30632 ns 994 ns bm_ranges_find<char>/1048576 123127 ns 5099 ns bm_ranges_find<short>/1 10.5 ns 10.5 ns bm_ranges_find<short>/2 18.4 ns 18.3 ns bm_ranges_find<short>/3 21.2 ns 21.2 ns bm_ranges_find<short>/4 22.7 ns 22.5 ns bm_ranges_find<short>/5 23.6 ns 23.4 ns bm_ranges_find<short>/6 24.1 ns 23.8 ns bm_ranges_find<short>/7 24.4 ns 24.1 ns bm_ranges_find<short>/8 24.7 ns 24.5 ns bm_ranges_find<short>/16 26.2 ns 26.1 ns bm_ranges_find<short>/64 33.2 ns 30.4 ns bm_ranges_find<short>/512 134 ns 80.4 ns bm_ranges_find<short>/4096 969 ns 506 ns bm_ranges_find<short>/32768 7648 ns 3853 ns bm_ranges_find<short>/262144 61198 ns 30507 ns bm_ranges_find<short>/1048576 248057 ns 122080 ns bm_ranges_find<int>/1 10.5 ns 11.8 ns bm_ranges_find<int>/2 18.5 ns 11.9 ns bm_ranges_find<int>/3 22.0 ns 11.9 ns bm_ranges_find<int>/4 23.4 ns 11.8 ns bm_ranges_find<int>/5 24.3 ns 11.8 ns bm_ranges_find<int>/6 24.7 ns 12.0 ns bm_ranges_find<int>/7 24.8 ns 11.9 ns bm_ranges_find<int>/8 25.1 ns 11.8 ns bm_ranges_find<int>/16 26.2 ns 21.9 ns bm_ranges_find<int>/64 30.7 ns 33.8 ns bm_ranges_find<int>/512 80.4 ns 39.7 ns bm_ranges_find<int>/4096 498 ns 72.6 ns bm_ranges_find<int>/32768 3842 ns 506 ns bm_ranges_find<int>/262144 30566 ns 5311 ns bm_ranges_find<int>/1048576 123103 ns 22743 ns
libcxx/include/__algorithm/ranges_find.h | ||
---|---|---|
54 | Maybe this conveys the intent a bit better? | |
libcxx/include/__string/constexpr_c_functions.h | ||
98–122 | I would suggest reorganizing like this, to make it more obvious what the intent of using __builtin_char_memchr is. if (__libcpp_is_constant_evaluated()) { // use __builtin_char_memchr to optimize constexpr evaluation if we can #if __has_builtin(__builtin_char_memchr) if (is_same<_Tp, char>::value) return __builtin_char_memchr(__str, __char, __count); #endif for (; __count; --__count) { if (*__str == __char) return __str; ++__str; } return nullptr; } else { return static_cast<_Tp*>(__builtin_memchr(__str, __char, __count)); } | |
100 | ||
libcxx/include/cwchar | ||
231 | static_assert("diagnostic") You also seem to be missing an important requirement on _Up? You need _Up to be trivially copyable to wchar_t or else your std::wmemcpy may not be valid? I would actually just replace _Up by wchar_t since we don't need the additional generalization, at least for now. | |
232 | You probably want to remove_cv_t<_Tp> here to also catch the case where _Tp* == wchar_t const*. | |
233–234 | Now you don't need this anymore if you take wchar_t __char as input. | |
235 | Does reintepret_cast<T>(object-of-type-T) work inside constexpr? If not then this doesn't compile when constant-evaluated, and it means your test coverage is also insufficient in that area. Either way please make sure that is tested. |
libcxx/benchmarks/algorithms/find.bench.cpp | ||
---|---|---|
10 | I really would like the benchmark results you posted in the commit message. | |
libcxx/include/__algorithm/find.h | ||
54 | Why not test is_same_v<_Tp, wchar_t>? | |
58 | Does this function work with char16_t or char32_t (depending on sizeof(wchar_t).) | |
libcxx/include/__string/constexpr_c_functions.h | ||
106–116 | Note that the current way it's written generates unneeded code | |
libcxx/include/cwchar | ||
255 | This seems duplicated code from the char version, can we combine it in a templated function. |
libcxx/include/__algorithm/find.h | ||
---|---|---|
54 | Because that would disable the optimization for other types that have the same size as wchar_t. For these types this optimization is probably UB, but that shouldn't be a problem in practice, since it's on an ABI boundary and no compiler currently optimizes wmemchr specifically (and IMO no compiler should ever optimize based on this, since there is no potential gain I can see). | |
58 | See my comment above. | |
libcxx/include/__string/constexpr_c_functions.h | ||
106–116 | This looks quite surprising to me, and I'm not sure it's worth the little bit of time the compiler spends on this. | |
libcxx/include/cwchar | ||
255 | I don't think it's worth it. The loop is really simple, and deduplicating this seems like a lot of extra complexity compared to the benefit. |
libcxx/include/cwchar | ||
---|---|---|
232 | Should we also assert the we have the same alignment as wchar_t here? | |
263–264 | For the non-constexpr case, I would do this to make sure that we really never exploit any UB, even if it may seem to be benign right now: if constexpr (std::is_trivial_v<_Tp>) { // we know it's safe to reinterpret cast (at least in C++20 with implicit lifetime types) return reinterpret_cast<_Tp*>(std::wmemchr(reinterpret_cast<__apply_cv<_Tp, wchar_t>::type*>(__str), __char, __count)); } else { // loop } Concretely, this won't change anything since we already only call this function with integral types, however if we were to e.g. specialize __is_trivially_equality_comparable for a non-trivial type, then my suggested code would work whereas the original code might be (benign) UB. WDYT? | |
libcxx/test/std/algorithms/alg.nonmodifying/alg.find/find.pass.cpp | ||
10 | We should also update the tests for ranges::find. | |
105–128 | We should also have a test that checks that if multiple elements match, we return the first one. | |
120 | Let's also add a test for an empty range. | |
124 | This shouldn't do the same compare that would be generated anyway, otherwise we are not testing properly that we don't use memchr. | |
126 | We also need to test with types that have different sizes so we trigger the memchr and the wmemchr code paths. Right now we don't trigger both with non-integral types. |
libcxx/include/__algorithm/find.h | ||
---|---|---|
54 | The function you call has the following signature inline _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX14 const wchar_t* __constexpr_wmemchr(const wchar_t* __str, wchar_t __char, size_t __count) So how can this work when _Tp is not a wchar_t? That requires a reinterpret_cast which is not allowed during constant evaluation. What am I missing? | |
libcxx/include/__string/constexpr_c_functions.h | ||
106–116 | This is something we have used before. Now it will generate dead code. | |
libcxx/test/std/algorithms/alg.nonmodifying/alg.find/find.pass.cpp | ||
145 | I really would like to see tests for all types that will use this optimization. Especially to make sure the code is valid during constant evaluation. |
libcxx/include/__algorithm/find.h | ||
---|---|---|
54 | Your signature is wrong, __constexpr_wmemchr is templated. | |
libcxx/include/__string/constexpr_c_functions.h | ||
106–116 | clang doesn't even emit the condition from the fronted AFAICT, let alone the backend: https://godbolt.org/z/87bvzbhbT. This is at best a minimal compile time speedup. | |
libcxx/test/std/algorithms/alg.nonmodifying/alg.find/find.pass.cpp | ||
145 | We currently only have integral types and pointers marked trivially equality comparable, and I don't think it's worth testing for pointers specifically, since they are for all intents and purposes integers with a funny syntax in this case. |
libcxx/include/cwchar | ||
---|---|---|
263–264 | I went for the "only optimize if we have __builtin_wmemchr" option instead. IMO we should try to apply this optimization as broadly as possible, and we can at least heavily influence how clang optimizes __builtin_wmemchr. This means that GCC won't optimize this as well, but I think this is OK given that by far most libc++ users use clang. |
Mostly LGTM but I'd like to see again before merging, especially after splitting up the test changes.
libcxx/benchmarks/algorithms/find.bench.cpp | ||
---|---|---|
22 | We should stop/start the timing to avoid including this bit in the benchmark, since this is calling into mt199whatever which is non trivial. You can use state.PauseTiming() and ResumeTiming() to do this. You should also check whether that introduces a bunch of noise in the timings, I've noticed that before. | |
libcxx/include/__algorithm/find.h | ||
65 | HIDE_FROM_ABI while we're at it. | |
libcxx/include/__string/constexpr_c_functions.h | ||
118–120 | I would suggest this instead, otherwise it looks like the memcpy is trying to tell us something but it's really just a more complicated way to do what's below: return static_cast<_Tp*>(__builtin_memchr(__str, *reinterpret_cast<char*>(&__value), __count)); | |
libcxx/test/std/algorithms/alg.nonmodifying/alg.find/ranges.find.pass.cpp | ||
198 | Could you please split these removals in a different review so we can talk about them separately? |
libcxx/benchmarks/algorithms/find.bench.cpp | ||
---|---|---|
22 | When adding them the times are in the hundreds of nanoseconds for single elements. Also, the documentation says: // NOTE: PauseTiming()/ResumeTiming() are relatively // heavyweight, and so their use should generally be avoided // within each benchmark iteration, if possible. |
I think this might have broken one of the sanitizer buildbots:
https://lab.llvm.org/buildbot/#/builders/37/builds/22340
+ diff -u /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/lib/sanitizer_common/symbolizer/scripts/global_symbols.txt undefined.new +wmemchr U + echo 'Failed: unexpected symbols'
(it might just need to have that line added?)
Yeah, that looks like a compiler-rt thing. I guess they check which symbols they rely on (and apparently now they rely on ẁmemchr`). Given that wmemcpy, wmemmove and wmemset are also listed there, it's probably fine to add wmemchr.
The libcxx tests might also need updating:
https://lab.llvm.org/buildbot/#/builders/74/builds/19459/steps/14/logs/stdio
*** /b/sanitizer-x86_64-linux-bootstrap-msan/build/llvm-project/libcxx/test/libcxx/transitive_includes/cxx26.csv --- /b/sanitizer-x86_64-linux-bootstrap-msan/build/libcxx_build_msan/test/libcxx/Output/transitive_includes.sh.cpp.dir/t.tmp/transitive_includes.csv *************** *** 3,8 **** --- 3,9 ---- algorithmcstdint algorithmcstring algorithmctime + algorithmcwchar algorithmexecution algorithminitializer_list algorithmiosfwd *************** *** 127,138 **** --- 128,141 ---- cstddefversion ctgmathccomplex ctgmathcmath + cwcharcstddef cwcharcwctype cwctypecctype dequecompare dequecstddef dequecstdint dequecstring + dequecwchar dequeinitializer_list dequelimits dequenew *************** *** 442,447 **** --- 445,451 ---- randomversion rangescompare rangescstddef + rangescwchar rangesinitializer_list rangesiosfwd rangesiterator *************** *** 640,645 **** --- 644,650 ---- vectorcstdint vectorcstdlib vectorcstring + vectorcwchar vectorinitializer_list vectoriosfwd vectorlimits
I really would like the benchmark results you posted in the commit message.