This is an archive of the discontinued LLVM Phabricator instance.

[libc++] Unroll loop in std::find_if and define other std::find variants in terms of it
AbandonedPublic

Authored by ldionne on Oct 20 2021, 9:22 AM.

Details

Reviewers
fhahn
Quuxplusone
Group Reviewers
Restricted Project
Summary

It was reported that libstdc++'s std::find was considerably faster than
libc++'s std::find in some cases. This was tracked down to some manual
loop unrolling being done in libstdc++'s implementation.

This patch modifies std::find_if to perform such loop unrolling, and
also modifies std::find and std::find_if_not to call std::find_if
as an implementation detail so they can take advantage of any optimization
we might do in std::find_if.

As a fly-by fix, this commit significantly improves the test coverage for
the three std::find_XXXX algorithms, which was previously insufficient to
find issues in this patch.

rdar://72436671

Diff Detail

Event Timeline

ldionne requested review of this revision.Oct 20 2021, 9:22 AM
ldionne created this revision.
Herald added a project: Restricted Project. · View Herald TranscriptOct 20 2021, 9:22 AM
Herald added a reviewer: Restricted Project. · View Herald Transcript

I normally push back against this sort of manual "optimization", but in this case it seems to really make a difference for the case I was told about. We can have a discussion about whether this is an actual improvement. If we end up not taking the change, I'll at least take the test changes.

I normally push back against this sort of manual "optimization", but in this case it seems to really make a difference for the case I was told about. We can have a discussion about whether this is an actual improvement. If we end up not taking the change, I'll at least take the test changes.

I also would prefer a compiler can do this kind of transformation when it's beneficial.
Can you add a benchmark so we can measure the performance gains and code size changes?

libcxx/test/std/algorithms/alg.nonmodifying/alg.find/find_if.pass.cpp
62

Since the algorithm is no longer straight forward I think it would be good to validate the number of times the predicate is evaluated. Just to make sure we didn't accidentally increase the number of evaluations.

Quuxplusone requested changes to this revision.Oct 20 2021, 11:49 AM
Quuxplusone added a subscriber: Quuxplusone.
Quuxplusone added inline comments.
libcxx/include/__algorithm/find.h
1

High-level motivation question: This performance difference that you see, is it only at -O0? I would hope that the compiler does the obvious loop unrolling at -O2 at least. I'd also be interested in seeing a Godbolt link of libc++ versus libstdc++, and perhaps even some new benchmark as part of the PR.

29

This flips the order from *first == value to value == *first. I think we're within our rights to do that, technically; but it feels gratuitous.

libcxx/test/std/algorithms/alg.nonmodifying/alg.find/find_if.pass.cpp
62

+1, I think your current implementation has 4 instances of break that should be return __first, and this causes 1 extra evaluation of __pred in most cases.

This revision now requires changes to proceed.Oct 20 2021, 11:49 AM
Mordante added inline comments.Oct 20 2021, 12:08 PM
libcxx/test/std/algorithms/alg.nonmodifying/alg.find/find_if.pass.cpp
62

I too wondered about why not a return instead of a break. Initially I also expected one extra evaluation, but upon further studying the algorithm I _think_ it's correct. switch (__last - __first) should be larger than 3 when the for loop exits in one of the breaks.

FWIW, this isn't my prime areal, but i can't not comment here.
Like stated in previous review, i think this is a pretty bad idea.
Here's an incomplete list of reasons:

  1. This is clearly a compiler deficiency. Fixing it will benefit everyone
  2. How do you know that unrolling is always beneficial? What about -Os/-Oz? What about in-order CPU's? Why unroll 4x specifically?
  3. What happens with more complex predicates? The old code had a single call, now you have how many? Will this affect (prevent) inlining in some cases?
  4. The old code seems straight-forward to auto-vectorize (for certain predicates), new one, uhh.

Re: all

Thanks a lot for chiming in. Morally, I share everybody's views here -- this is a compiler issue and we shouldn't be working around it in the library. This was reported to me and @fhahn asked me to look into whether it might be possible to implement this in the library. So I did, but I agree I'd rather not check this in if the compiler can be fixed instead. I'll let Florian explain what a compiler fix would involve.

libcxx/include/__algorithm/find.h
1
libcxx/test/std/algorithms/alg.nonmodifying/alg.find/find_if.pass.cpp
62

Yeah, I could return instead of break -- but I'm pretty sure the current algorithm is correct. We'll find out when I add tests counting the number of times the predicate is called -- that's a good suggestion whatever we end up doing.

If the compiler isn't smart enough, how about giving it a hint?
https://clang.llvm.org/docs/LanguageExtensions.html#loop-unrolling
Adding #pragma clang loop unroll_count(4) to our find causes the loop to unroll
https://clang.godbolt.org/z/687793Te4

If the compiler isn't smart enough, how about giving it a hint?
https://clang.llvm.org/docs/LanguageExtensions.html#loop-unrolling
Adding #pragma clang loop unroll_count(4) to our find causes the loop to unroll
https://clang.godbolt.org/z/687793Te4

Interesting suggestion actually! The correct comparison would be this: https://clang.godbolt.org/z/PETM1TvrG

I'll wait for @fhahn to provide more context about why we'd want to unroll in the library at all, and then I can apply your suggestions if we decide to move forward.

fhahn added a comment.Nov 2 2021, 8:26 AM

Re: all

Thanks a lot for chiming in. Morally, I share everybody's views here -- this is a compiler issue and we shouldn't be working around it in the library. This was reported to me and @fhahn asked me to look into whether it might be possible to implement this in the library. So I did, but I agree I'd rather not check this in if the compiler can be fixed instead. I'll let Florian explain what a compiler fix would involve.

I think the main problem with optimising the code in LLVM at the moment is that we lose the information that all elements between begin() and end() are dereferenceable (simplified IR shown below), e.g. the IR function could get passed a garbage end pointer and only exit via the early exit. In the original C++ source, we have stronger guarantees allowing for better optimisations.

Ideally we would vectorise the search loop, but I don't think that's possible without more information in the IR. While I agree it is unfortunate that the IR seems to be missing crucial information and the compiler should do better, I don't think there's an easy fix for now.

I'd like for the Clang + libc++ package to provide the best performance for our users and surgically modifying the libc++ implementation might be the easiest first step in that direction. As @ldionne mentioned, we encountered use cases where this difference in std::find implementation is causing a 10% performance difference compared to GCC+libstdc++ for a largish application (*not* a std::find specific micro-benchmark).

As reviewers already pointed out, if we go with a source modification, the implementation should probably only kick in for data types that can be checked cheaply (e.g. integers, floats and pointers).

I am definitely also interested in working towards a way to provide deferenceability info for the begin() & end() range. (If there's already a way I missed, please let me know :)).

define i8** @src(i8** %begin, i8** %end, i8* %tgt) {
entry:
  %cmp.i.i.not8.i = icmp eq i8** %begin, %end
  br i1 %cmp.i.i.not8.i, label %exit, label %header

header:
  %ptr.iv = phi i8** [ %ptr.iv.next, %latch ], [ %begin, %entry ]
  %lv = load i8*, i8** %ptr.iv, align 8
  %cmp = icmp eq i8* %lv, %tgt
  br i1 %cmp, label %exit, label %latch

latch:
  %ptr.iv.next = getelementptr inbounds i8*, i8** %ptr.iv.1, i64 1
  %cmp.i.i.not.i = icmp eq i8** %ptr.iv.next, %end
  br i1 %cmp.i.i.not.i, label %exit, label %header

exit:
  %res.lcssa = phi i8** [ %begin, %entry ], [ %end, %latch ], [ %ptr.iv, %header ]
  ret i8** %res.lcssa
}

I also took a stab at adding a first set of benchmarks to libcxx/benchmarks/algorithms.bench.cpp , but something is not quite right I think. I'm not very familiar with the libc++ benchmarking system, so if anybody can spot any issues, please let me know.

diff --git a/libcxx/benchmarks/algorithms.bench.cpp b/libcxx/benchmarks/algorithms.bench.cpp
index 564d89d659cb..f0e610df6ba0 100644
--- a/libcxx/benchmarks/algorithms.bench.cpp
+++ b/libcxx/benchmarks/algorithms.bench.cpp
@@ -309,6 +309,48 @@ struct PopHeap {
   };
 };

+template <class ValueType, class Order>
+struct FindMid {
+  size_t Quantity;
+
+  void run(benchmark::State& state) const {
+    runOpOnCopies<ValueType>(state, Quantity, Order(), BatchSize::CountElements, [](auto& Copy) {
+      auto I = std::find(Copy.begin(), Copy.end(), Copy[Copy.size() / 2]);
+      // Try to prevent the compiler from optimizing out std::find.
+      benchmark::DoNotOptimize(I);
+      benchmark::DoNotOptimize(Copy);
+      std::swap(*I, *Copy.begin());
+    });
+  }
+
+  bool skip() const { return Order() == ::Order::Heap; }
+
+  std::string name() const {
+    return "BM_FindMid" + ValueType::name() + Order::name() + "_" + std::to_string(Quantity);
+  };
+};
+
+template <class ValueType, class Order>
+struct FindEnd {
+  size_t Quantity;
+
+  void run(benchmark::State& state) const {
+    runOpOnCopies<ValueType>(state, Quantity, Order(), BatchSize::CountElements, [](auto& Copy) {
+      auto I = std::find(Copy.begin(), Copy.end(), Copy.back());
+      // Try to prevent the compiler from optimizing out std::find.
+      benchmark::DoNotOptimize(I);
+      benchmark::DoNotOptimize(Copy);
+      std::swap(*I, *Copy.begin());
+    });
+  }
+
+  bool skip() const { return Order() == ::Order::Heap; }
+
+  std::string name() const {
+    return "BM_FindEnd" + ValueType::name() + Order::name() + "_" + std::to_string(Quantity);
+  };
+};
+
 } // namespace

 int main(int argc, char** argv) {
@@ -333,5 +375,7 @@ int main(int argc, char** argv) {
       Quantities);
   makeCartesianProductBenchmark<PushHeap, AllValueTypes, AllOrders>(Quantities);
   makeCartesianProductBenchmark<PopHeap, AllValueTypes>(Quantities);
+  makeCartesianProductBenchmark<FindMid, AllValueTypes, AllOrders>(Quantities);
+  makeCartesianProductBenchmark<FindEnd, AllValueTypes, AllOrders>(Quantities);
   benchmark::RunSpecifiedBenchmarks();
 }
libcxx/include/__algorithm/find_if.h
90

Idle thought: @fhahn's commentary makes me realize that this PR is uncomfortably adjacent to the ongoing LWG(?) debate over

const char *first = "hello world";
const char *last = std::ranges::find(first, std::unreachable_sentinel, '\0');

The first question is, because unreachable_sentinel by definition can't be reached by incrementing first — does that mean that [first, unreachable_sentinel) lacks valid-range-ness in the same way as [first, first-1) or [first, nullptr) lacks valid-range-ness? in which case the above has UB?
Related but perhaps orthogonally, does that mean that find in practice should be allowed to read from elements beyond the match as long as they're still in the provided range, e.g.

int buf[100];  // uninitialized garbage
buf[0] = 42; buf[1] = 43; buf[2] = 44;
const char *last = std::find(buf, buf+100, 44);  // could this read buf[3] and thus have UB?

I think this specific PR isn't doing anything new and dangerous right now, but the commentary about "vectorization" and "information that pointers are dereferenceable" and "better optimization" seems very scary to me.

fhahn added inline comments.Nov 3 2021, 4:24 PM
libcxx/include/__algorithm/find_if.h
90

That's an interesting point! I naively assumed that both start and end iterators need to be referring to a valid range.

This observation is key to the vectorisation route I think, so if that won't hold in general then it might not be worth pursuing.

ldionne abandoned this revision.Nov 3 2023, 8:42 AM

I am going to abandon this patch. After consideration, I do not think that manually unrolling std::find_if like this is the way to go. Indeed, this leads to much worse code size and we really shouldn't be doing these kinds of optimizations manually, instead the compiler should learn to do it. Also, since this patch was created, we've improved std::find so that it calls memchr when it can, which is even better for both code size and speed.

I'll rebase the test improvements on top of main and upload a patch on GitHub though.

Herald added a project: Restricted Project. · View Herald TranscriptNov 3 2023, 8:42 AM
Herald added a subscriber: StephenFan. · View Herald Transcript