This is an archive of the discontinued LLVM Phabricator instance.

[LoopIdiomRecognize] BCmp loop idiom recognition
AbandonedPublic

Authored by lebedev.ri on Apr 25 2019, 1:22 PM.

Details

Summary

@mclow.lists brought up this issue up in IRC.
It is a reasonably common problem to compare some two values for equality.
Those may be just some integers, strings or arrays of integers.

In C, there is memcmp(), bcmp() functions.
In C++, there exists std::equal() algorithm.
One can also write that function manually.

libstdc++'s std::equal() is specialized to directly call memcmp() for
various types, but not std::byte from C++2a. https://godbolt.org/z/mx2ejJ

libc++ does not do anything like that, it simply relies on simple C++'s
operator==(). https://godbolt.org/z/er0Zwf (GOOD!)

So likely, there exists a certain performance opportunities.
Let's compare performance of naive std::equal() (no memcmp()) with one that
is using memcmp() (in this case, compiled with modified compiler).

#include <algorithm>
#include <cmath>
#include <cstdint>
#include <iterator>
#include <limits>
#include <random>
#include <type_traits>
#include <utility>
#include <vector>

#include "benchmark/benchmark.h"

template <class T>
bool equal(T* a, T* a_end, T* b) noexcept {
  for (; a != a_end; ++a, ++b) {
    if (*a != *b) return false;
  }
  return true;
}

template <typename T>
std::vector<T> getVectorOfRandomNumbers(size_t count) {
  std::random_device rd;
  std::mt19937 gen(rd());
  std::uniform_int_distribution<T> dis(std::numeric_limits<T>::min(),
                                       std::numeric_limits<T>::max());
  std::vector<T> v;
  v.reserve(count);
  std::generate_n(std::back_inserter(v), count,
                  [&dis, &gen]() { return dis(gen); });
  assert(v.size() == count);
  return v;
}

struct Identical {
  template <typename T>
  static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) {
    auto Tmp = getVectorOfRandomNumbers<T>(count);
    return std::make_pair(Tmp, std::move(Tmp));
  }
};

struct InequalHalfway {
  template <typename T>
  static std::pair<std::vector<T>, std::vector<T>> Gen(size_t count) {
    auto V0 = getVectorOfRandomNumbers<T>(count);
    auto V1 = V0;
    V1[V1.size() / size_t(2)]++;  // just change the value.
    return std::make_pair(std::move(V0), std::move(V1));
  }
};

template <class T, class Gen>
void BM_bcmp(benchmark::State& state) {
  const size_t Length = state.range(0);

  const std::pair<std::vector<T>, std::vector<T>> Data =
      Gen::template Gen<T>(Length);
  const std::vector<T>& a = Data.first;
  const std::vector<T>& b = Data.second;
  assert(a.size() == Length && b.size() == a.size());

  benchmark::ClobberMemory();
  benchmark::DoNotOptimize(a);
  benchmark::DoNotOptimize(a.data());
  benchmark::DoNotOptimize(b);
  benchmark::DoNotOptimize(b.data());

  for (auto _ : state) {
    const bool is_equal = equal(a.data(), a.data() + a.size(), b.data());
    benchmark::DoNotOptimize(is_equal);
  }
  state.SetComplexityN(Length);
  state.counters["eltcnt"] =
      benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariant);
  state.counters["eltcnt/sec"] =
      benchmark::Counter(Length, benchmark::Counter::kIsIterationInvariantRate);
  const size_t BytesRead = 2 * sizeof(T) * Length;
  state.counters["bytes_read/iteration"] =
      benchmark::Counter(BytesRead, benchmark::Counter::kDefaults,
                         benchmark::Counter::OneK::kIs1024);
  state.counters["bytes_read/sec"] = benchmark::Counter(
      BytesRead, benchmark::Counter::kIsIterationInvariantRate,
      benchmark::Counter::OneK::kIs1024);
}

template <typename T>
static void CustomArguments(benchmark::internal::Benchmark* b) {
  const size_t L2SizeBytes = []() {
    for (const benchmark::CPUInfo::CacheInfo& I :
         benchmark::CPUInfo::Get().caches) {
      if (I.level == 2) return I.size;
    }
    return 0;
  }();
  // What is the largest range we can check to always fit within given L2 cache?
  const size_t MaxLen = L2SizeBytes / /*total bufs*/ 2 /
                        /*maximal elt size*/ sizeof(T) / /*safety margin*/ 2;
  b->RangeMultiplier(2)->Range(1, MaxLen)->Complexity(benchmark::oN);
}

BENCHMARK_TEMPLATE(BM_bcmp, uint8_t, Identical)
    ->Apply(CustomArguments<uint8_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint16_t, Identical)
    ->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint32_t, Identical)
    ->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint64_t, Identical)
    ->Apply(CustomArguments<uint64_t>);

BENCHMARK_TEMPLATE(BM_bcmp, uint8_t, InequalHalfway)
    ->Apply(CustomArguments<uint8_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint16_t, InequalHalfway)
    ->Apply(CustomArguments<uint16_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint32_t, InequalHalfway)
    ->Apply(CustomArguments<uint32_t>);
BENCHMARK_TEMPLATE(BM_bcmp, uint64_t, InequalHalfway)
    ->Apply(CustomArguments<uint64_t>);

$ ~/src/googlebenchmark/tools/compare.py --no-utest benchmarks build-{old,new}/test/llvm-bcmp-bench
RUNNING: build-old/test/llvm-bcmp-bench --benchmark_out=/tmp/tmpb6PEUx
2019-04-25 21:17:11
Running build-old/test/llvm-bcmp-bench
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 16K (x8)
  L1 Instruction 64K (x4)
  L2 Unified 2048K (x4)
  L3 Unified 8192K (x1)
Load Average: 0.65, 3.90, 4.14
---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
<...>
BM_bcmp<uint8_t, Identical>/512000           432131 ns       432101 ns         1613 bytes_read/iteration=1000k bytes_read/sec=2.20706G/s eltcnt=825.856M eltcnt/sec=1.18491G/s
BM_bcmp<uint8_t, Identical>_BigO               0.86 N          0.86 N
BM_bcmp<uint8_t, Identical>_RMS                   8 %             8 %
<...>
BM_bcmp<uint16_t, Identical>/256000          161408 ns       161409 ns         4027 bytes_read/iteration=1000k bytes_read/sec=5.90843G/s eltcnt=1030.91M eltcnt/sec=1.58603G/s
BM_bcmp<uint16_t, Identical>_BigO              0.67 N          0.67 N
BM_bcmp<uint16_t, Identical>_RMS                 25 %            25 %
<...>
BM_bcmp<uint32_t, Identical>/128000           81497 ns        81488 ns         8415 bytes_read/iteration=1000k bytes_read/sec=11.7032G/s eltcnt=1077.12M eltcnt/sec=1.57078G/s
BM_bcmp<uint32_t, Identical>_BigO              0.71 N          0.71 N
BM_bcmp<uint32_t, Identical>_RMS                 42 %            42 %
<...>
BM_bcmp<uint64_t, Identical>/64000            50138 ns        50138 ns        10909 bytes_read/iteration=1000k bytes_read/sec=19.0209G/s eltcnt=698.176M eltcnt/sec=1.27647G/s
BM_bcmp<uint64_t, Identical>_BigO              0.84 N          0.84 N
BM_bcmp<uint64_t, Identical>_RMS                 27 %            27 %
<...>
BM_bcmp<uint8_t, InequalHalfway>/512000      192405 ns       192392 ns         3638 bytes_read/iteration=1000k bytes_read/sec=4.95694G/s eltcnt=1.86266G eltcnt/sec=2.66124G/s
BM_bcmp<uint8_t, InequalHalfway>_BigO          0.38 N          0.38 N
BM_bcmp<uint8_t, InequalHalfway>_RMS              3 %             3 %
<...>
BM_bcmp<uint16_t, InequalHalfway>/256000     127858 ns       127860 ns         5477 bytes_read/iteration=1000k bytes_read/sec=7.45873G/s eltcnt=1.40211G eltcnt/sec=2.00219G/s
BM_bcmp<uint16_t, InequalHalfway>_BigO         0.50 N          0.50 N
BM_bcmp<uint16_t, InequalHalfway>_RMS             0 %             0 %
<...>
BM_bcmp<uint32_t, InequalHalfway>/128000      49140 ns        49140 ns        14281 bytes_read/iteration=1000k bytes_read/sec=19.4072G/s eltcnt=1.82797G eltcnt/sec=2.60478G/s
BM_bcmp<uint32_t, InequalHalfway>_BigO         0.40 N          0.40 N
BM_bcmp<uint32_t, InequalHalfway>_RMS            18 %            18 %
<...>
BM_bcmp<uint64_t, InequalHalfway>/64000       32101 ns        32099 ns        21786 bytes_read/iteration=1000k bytes_read/sec=29.7101G/s eltcnt=1.3943G eltcnt/sec=1.99381G/s
BM_bcmp<uint64_t, InequalHalfway>_BigO         0.50 N          0.50 N
BM_bcmp<uint64_t, InequalHalfway>_RMS             1 %             1 %
RUNNING: build-new/test/llvm-bcmp-bench --benchmark_out=/tmp/tmpQ46PP0
2019-04-25 21:19:29
Running build-new/test/llvm-bcmp-bench
Run on (8 X 4000 MHz CPU s)
CPU Caches:
  L1 Data 16K (x8)
  L1 Instruction 64K (x4)
  L2 Unified 2048K (x4)
  L3 Unified 8192K (x1)
Load Average: 1.01, 2.85, 3.71
---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
<...>
BM_bcmp<uint8_t, Identical>/512000            18593 ns        18590 ns        37565 bytes_read/iteration=1000k bytes_read/sec=51.2991G/s eltcnt=19.2333G eltcnt/sec=27.541G/s
BM_bcmp<uint8_t, Identical>_BigO               0.04 N          0.04 N
BM_bcmp<uint8_t, Identical>_RMS                  37 %            37 %
<...>
BM_bcmp<uint16_t, Identical>/256000           18950 ns        18948 ns        37223 bytes_read/iteration=1000k bytes_read/sec=50.3324G/s eltcnt=9.52909G eltcnt/sec=13.511G/s
BM_bcmp<uint16_t, Identical>_BigO              0.08 N          0.08 N
BM_bcmp<uint16_t, Identical>_RMS                 34 %            34 %
<...>
BM_bcmp<uint32_t, Identical>/128000           18627 ns        18627 ns        37895 bytes_read/iteration=1000k bytes_read/sec=51.198G/s eltcnt=4.85056G eltcnt/sec=6.87168G/s
BM_bcmp<uint32_t, Identical>_BigO              0.16 N          0.16 N
BM_bcmp<uint32_t, Identical>_RMS                 35 %            35 %
<...>
BM_bcmp<uint64_t, Identical>/64000            18855 ns        18855 ns        37458 bytes_read/iteration=1000k bytes_read/sec=50.5791G/s eltcnt=2.39731G eltcnt/sec=3.3943G/s
BM_bcmp<uint64_t, Identical>_BigO              0.32 N          0.32 N
BM_bcmp<uint64_t, Identical>_RMS                 33 %            33 %
<...>
BM_bcmp<uint8_t, InequalHalfway>/512000        9570 ns         9569 ns        73500 bytes_read/iteration=1000k bytes_read/sec=99.6601G/s eltcnt=37.632G eltcnt/sec=53.5046G/s
BM_bcmp<uint8_t, InequalHalfway>_BigO          0.02 N          0.02 N
BM_bcmp<uint8_t, InequalHalfway>_RMS             29 %            29 %
<...>
BM_bcmp<uint16_t, InequalHalfway>/256000       9547 ns         9547 ns        74343 bytes_read/iteration=1000k bytes_read/sec=99.8971G/s eltcnt=19.0318G eltcnt/sec=26.8159G/s
BM_bcmp<uint16_t, InequalHalfway>_BigO         0.04 N          0.04 N
BM_bcmp<uint16_t, InequalHalfway>_RMS            29 %            29 %
<...>
BM_bcmp<uint32_t, InequalHalfway>/128000       9396 ns         9394 ns        73521 bytes_read/iteration=1000k bytes_read/sec=101.518G/s eltcnt=9.41069G eltcnt/sec=13.6255G/s
BM_bcmp<uint32_t, InequalHalfway>_BigO         0.08 N          0.08 N
BM_bcmp<uint32_t, InequalHalfway>_RMS            30 %            30 %
<...>
BM_bcmp<uint64_t, InequalHalfway>/64000        9499 ns         9498 ns        73802 bytes_read/iteration=1000k bytes_read/sec=100.405G/s eltcnt=4.72333G eltcnt/sec=6.73808G/s
BM_bcmp<uint64_t, InequalHalfway>_BigO         0.16 N          0.16 N
BM_bcmp<uint64_t, InequalHalfway>_RMS            28 %            28 %
Comparing build-old/test/llvm-bcmp-bench to build-new/test/llvm-bcmp-bench
Benchmark                                                  Time             CPU      Time Old      Time New       CPU Old       CPU New
---------------------------------------------------------------------------------------------------------------------------------------
<...>
BM_bcmp<uint8_t, Identical>/512000                      -0.9570         -0.9570        432131         18593        432101         18590
<...>
BM_bcmp<uint16_t, Identical>/256000                     -0.8826         -0.8826        161408         18950        161409         18948
<...>
BM_bcmp<uint32_t, Identical>/128000                     -0.7714         -0.7714         81497         18627         81488         18627
<...>
BM_bcmp<uint64_t, Identical>/64000                      -0.6239         -0.6239         50138         18855         50138         18855
<...>
BM_bcmp<uint8_t, InequalHalfway>/512000                 -0.9503         -0.9503        192405          9570        192392          9569
<...>
BM_bcmp<uint16_t, InequalHalfway>/256000                -0.9253         -0.9253        127858          9547        127860          9547
<...>
BM_bcmp<uint32_t, InequalHalfway>/128000                -0.8088         -0.8088         49140          9396         49140          9394
<...>
BM_bcmp<uint64_t, InequalHalfway>/64000                 -0.7041         -0.7041         32101          9499         32099          9498

What can we tell from the benchmark?

  • Performance of naive equality check somewhat improves with element size, maxing out at eltcnt/sec=1.58603G/s for uint16_t, or bytes_read/sec=19.0209G/s for uint64_t. I think, that instability implies performance problems.
  • Performance of memcmp()-aware benchmark always maxes out at around bytes_read/sec=51.2991G/s for every type. That is 2.6x the throughput of the naive variant!
  • eltcnt/sec metric for the memcmp()-aware benchmark maxes out at eltcnt/sec=27.541G/s for uint8_t (was: eltcnt/sec=1.18491G/s, so 24x) and linearly decreases with element size. For uint64_t, it's ~4x+ the elements/second.
  • The call obvious is more pricey than the loop, with small element count. As it can be seen from the full output , the memcmp() is almost universally worse, independent of the element size (and thus buffer size) when element count is less than 8.

So all in all, bcmp idiom does indeed pose untapped performance headroom.
This diff does implement said idiom recognition. I think a reasonable test
coverage is present, but do tell if there is anything obvious missing.

Now, quality. This does succeed to build and pass the test-suite, at least
without any non-bundled elements.


This transform fires 91 times:

$ /build/test-suite/utils/compare.py -m loop-idiom.NumBCmp result-new.json
Tests: 1149
Metric: loop-idiom.NumBCmp

Program                                         result-new

MultiSourc...Benchmarks/7zip/7zip-benchmark    79.00
MultiSource/Applications/d/make_dparser         3.00
SingleSource/UnitTests/vla                      2.00
MultiSource/Applications/Burg/burg              1.00
MultiSourc.../Applications/JM/lencod/lencod     1.00
MultiSource/Applications/lemon/lemon            1.00
MultiSource/Benchmarks/Bullet/bullet            1.00
MultiSourc...e/Benchmarks/MallocBench/gs/gs     1.00
MultiSourc...gs-C/TimberWolfMC/timberwolfmc     1.00
MultiSourc...Prolangs-C/simulator/simulator     1.00

The size changes are:
I'm not sure what's going on with SingleSource/UnitTests/vla.test yet, did not look.

$ /build/test-suite/utils/compare.py -m size..text result-{old,new}.json --filter-hash
Tests: 1149
Same hash: 907 (filtered out)
Remaining: 242
Metric: size..text

Program                                        result-old result-new diff
test-suite...ingleSource/UnitTests/vla.test   753.00     833.00     10.6%
test-suite...marks/7zip/7zip-benchmark.test   1001697.00 966657.00  -3.5%
test-suite...ngs-C/simulator/simulator.test   32369.00   32321.00   -0.1%
test-suite...plications/d/make_dparser.test   89585.00   89505.00   -0.1%
test-suite...ce/Applications/Burg/burg.test   40817.00   40785.00   -0.1%
test-suite.../Applications/lemon/lemon.test   47281.00   47249.00   -0.1%
test-suite...TimberWolfMC/timberwolfmc.test   250065.00  250113.00   0.0%
test-suite...chmarks/MallocBench/gs/gs.test   149889.00  149873.00  -0.0%
test-suite...ications/JM/lencod/lencod.test   769585.00  769569.00  -0.0%
test-suite.../Benchmarks/Bullet/bullet.test   770049.00  770049.00   0.0%
test-suite...HMARK_ANISTROPIC_DIFFUSION/128    NaN        NaN        nan%
test-suite...HMARK_ANISTROPIC_DIFFUSION/256    NaN        NaN        nan%
test-suite...CHMARK_ANISTROPIC_DIFFUSION/64    NaN        NaN        nan%
test-suite...CHMARK_ANISTROPIC_DIFFUSION/32    NaN        NaN        nan%
test-suite...ENCHMARK_BILATERAL_FILTER/64/4    NaN        NaN        nan%
Geomean difference                                                   nan%
         result-old    result-new       diff
count  1.000000e+01  10.00000      10.000000
mean   3.152090e+05  311695.40000  0.006749
std    3.790398e+05  372091.42232  0.036605
min    7.530000e+02  833.00000    -0.034981
25%    4.243300e+04  42401.00000  -0.000866
50%    1.197370e+05  119689.00000 -0.000392
75%    6.397050e+05  639705.00000 -0.000005
max    1.001697e+06  966657.00000  0.106242

I don't have timings though.

And now to the code. The basic idea is to completely replace the whole loop.
If we can't fully kill it, don't transform.
I have left one or two comments in the code, so hopefully it can be understood.

Also, there is a few TODO's that i have left for follow-ups:

  • widening of memcmp()/bcmp()
  • step smaller than the comparison size
  • Metadata propagation
  • more than two blocks as long as there is still a single backedge?
  • ???

Diff Detail

Event Timeline

lebedev.ri created this revision.Apr 25 2019, 1:22 PM

Rebased, ping, any thoughts, no matter how high-level?

reames requested changes to this revision.May 28 2019, 4:14 PM

Initial set of changes to reduce patch complexity. Note that I've skipped most of the interesting logic and will not review it until unrelated pieces have been split apart.

lib/Transforms/Scalar/LoopIdiomRecognize.cpp
147 ↗(On Diff #198225)

The addition of ORE appears to be a separable change which causes test diffs of it's own. Please split and review separately.

376 ↗(On Diff #198225)

Unrelated changes, please split and land. (Applies for almost all LLVM_DEBUG usage.)

1853 ↗(On Diff #198225)

Nope, remove.

1976 ↗(On Diff #198225)

You don't appear to be bailing out for atomic or volatile loads.

1989 ↗(On Diff #198225)

Just mod 8. LLVM does not support bytes of other size.

2127 ↗(On Diff #198225)

Actually, you do handle volatile. Move this closer to the def for readability would you?

test/Transforms/LoopIdiom/bcmp-basic.ll
2 ↗(On Diff #198225)

Unrelated test change. Remove please.

This revision now requires changes to proceed.May 28 2019, 4:14 PM
lebedev.ri marked 6 inline comments as done.

Thank you for taking a look!
I believe i have addressed review notes.

lebedev.ri added inline comments.
lib/Transforms/Scalar/LoopIdiomRecognize.cpp
1989 ↗(On Diff #198225)

Is this sufficiently-obscure, or shall i inline the variable, too?

2127 ↗(On Diff #198225)

The important difference is - if i bailout at the very beginning, no further analysis will be done.
But if i bailout here, then if this fires then one who sees this diagnostic may know that
the issue here isn't the volatile loads themselves, but rather the missing codegen.
As in, all other legality checks have passed here.

With that info, the comment still stands?

test/Transforms/LoopIdiom/bcmp-basic.ll
2 ↗(On Diff #198225)

How is this unrelated? This is one of the dedicated test files for this diff,
and that change is showing that the patch does not anger any of these verifiers.

Use ore::NV

(Review interrupted, a few useful comments, more to come)

lib/Transforms/Scalar/LoopIdiomRecognize.cpp
1989 ↗(On Diff #198225)

Looks fine.

2127 ↗(On Diff #198225)

Really not sure what that gives the user? If the loop contains a volatile or ordered load, we're done. The transform isn't going to run, no matter what other legality predicates are met.

1924 ↗(On Diff #202070)

Using m_CombineAnd like this is correct, but I found quite confusing. You might want to consider pulling out a helper function and using early return matching instead.

e.g.
struct MatchedBCmpCompairsonState {...};
bool matchBCmpComparison(Instruction *Term, MatchState&)

2020 ↗(On Diff #202070)

If the loop has unique exits, then each phi must have a single predecessor which must be in the loop. LoopSimplify creates unique exits, so I'd just bail on the non-unique case.

2036 ↗(On Diff #202070)

Given the base of each object and the length, you can form a MemoryLocation for the entire portion being read from both arrays. Given that, a simple alias query can check for modification or ordering.

MemoryLocation LocA(Src1, Size);
for (Instruction &I : instructions(L))

if (I aliases LocA) return false;

Note that if you length is non-constant, this ends up being quite conservative, but so does your current code.

See also: mayLoopAccessLocation

test/Transforms/LoopIdiom/bcmp-basic.ll
2 ↗(On Diff #198225)

Presumably it didn't anger any of the verifiiers before, so the change can be checked in and doesn't need to be part of this diff.

test/Transforms/LoopIdiom/bcmp-debugify-remarks.ll
2 ↗(On Diff #202070)

Same here.

test/Transforms/LoopIdiom/bcmp-widening.ll
2 ↗(On Diff #202070)

Same here.

lebedev.ri marked 2 inline comments as done.May 31 2019, 12:31 PM

Thank you for taking a look!

I think it is unproductive for me to reply/fix until there is a review with the entire change in mind, not parts of it.

lib/Transforms/Scalar/LoopIdiomRecognize.cpp
2127 ↗(On Diff #198225)

(note the OptimizationRemarkMissed and the LLVM_DEBUG printout?)

1924 ↗(On Diff #202070)

I'm sorry, i do not understand what you are saying here.

This for sure will need refactoring when adding memcmp support,
but until then i'm not yet seeing how to rewrite this
while improving readability at the same time.

2020 ↗(On Diff #202070)

Similarly, i do not understand what you are saying here.
You have read the comment just before these loops?

// No loop instructions must be used outside of the loop.

It doesn't matter whether we are in LCSSA form or not (we are),
whether we have dedicated exits or not (we do),
whether we have unique exits or not (that's the blocks we are checking here, so we do?).

The check is specifically that no instructions "other than the latches" are used outside of the loop.
As in, that we can fully erase the entire loop and replace it with a single bcmp call.
It's great the LCSSA form allows us to only check the PHI nodes, but i'm not sure how else
the check i'm looking for can be performed.

This is tested with @loop_instruction_used_in_phi_node_outside_loop in bcmp-negative-tests.ll.

2036 ↗(On Diff #202070)

I think i should wait until there is a full review..
Similarly, i'm not sure what this is about.
The end goal here is completely zap the entire old loop,
to fully delete it out of existence, replace with a single bcmp call.
If the loop has any other side-effects aside from those you'd get if you inline bcmp,
then the loop can not be transformed.

test/Transforms/LoopIdiom/bcmp-basic.ll
2 ↗(On Diff #198225)

Presumably it didn't anger any of the verifiiers before

But that is precisely the point here, those options aren't free, why have them in-tree for
who knows how long (until this change lands, which is far from given) until then?

nikic added a subscriber: nikic.Jun 5 2019, 2:02 PM
nikic added inline comments.
lib/Transforms/Scalar/LoopIdiomRecognize.cpp
2129 ↗(On Diff #202070)

Is this kind reasoning legal on the level of LLVM IR? Especially if no inbounds GEPs are involved, aren't we just dealing in raw memory and there isn't necessary any object with representable size involved?

(Context: Wondering about this in https://reviews.llvm.org/D61934#inline-559489 and your comment seems like a plausible explanation.)

lebedev.ri marked an inline comment as done.Jun 5 2019, 2:25 PM
lebedev.ri added inline comments.
lib/Transforms/Scalar/LoopIdiomRecognize.cpp
2129 ↗(On Diff #202070)

I don't have an answer.
The obvious alternative would be to do saturating multiplication, i guess?

xbolva00 added a subscriber: xbolva00.

This seems interesting!

Since @courbet did some work with bcmp, I added him as a reviewer.

Yeah, based on reviews/time on my patches, loop-related reviews seems most congested.

courbet added inline comments.Aug 22 2019, 2:24 PM
lib/Transforms/Scalar/LoopIdiomRecognize.cpp
1924 ↗(On Diff #202070)

I think what was meant is: If you refactor the matches to separate functions, the control flow for this function becomes clearer:

struct CmpLoopStructure {
  Value *BCmpValue, *LatchCmpValue;
  BasicBlock *HeaderBrEqualBB, *HeaderBrUnequalBB;
  BasicBlock *LatchBrFinishBB, *LatchBrContinueBB;
};

CmpLoopStructure CmpLoop;

if (!matchCmpLoopStructure(LoopHeaderBB, CmpLoop)) {
  return false;
}

struct CmpOfLoads {
  ICmpInst::Predicate BCmpPred;
  Value *LoadSrcA, *LoadSrcB;
  Value *LoadA_, *LoadB_;
}

if (!matchCmpOfLoads(BcmpLoop.BcmpValue, CmpOfLoads)) {
  return false;
}
1947 ↗(On Diff #202070)

I don't find the comment very clear. Maybe rephrase as: "FIXME: memcmp()/bcmp() calls have the same semantics as icmp. Match them too."

1953 ↗(On Diff #202070)

This is redundant with the previous check. Do you want to remove it from the previous one (this will make handling memcmp() easier in the future).

1961 ↗(On Diff #202070)

Typo: "Be vary"

1986 ↗(On Diff #202070)

If you refactor as mentioned above, you can do the casts in the subfunctions. That being said I don't think the framework allows the cast during the match, which would be cool.

2129 ↗(On Diff #202070)

Saturating multiplication would actually change the semantics of the code. The memset loop idiom already has this issue actually.
I don't think the explanation is convincing because we could indeed be loading from raw memory. I guess the only thing that is saving us right now is that it's unlikely that people are memset/memcmp-ing more than size_t-max bytes of memory in real life. But the optimization is buggy :(

lebedev.ri marked 8 inline comments as done and an inline comment as not done.

@courbet thank you for taking a look!

Rebased, chopped LoopIdiomRecognize::detectBCmpIdiom() into several smaller functions.

lebedev.ri added inline comments.Aug 23 2019, 1:51 AM
lib/Transforms/Scalar/LoopIdiomRecognize.cpp
1986 ↗(On Diff #202070)

Indeed, you can't cast during patternmatch, this seems like the best place for this.

2129 ↗(On Diff #202070)

Added FIXME for now.

courbet accepted this revision.Aug 23 2019, 10:12 AM

LGTM. Maybe let's way a couple days to see if @reames has blocking issues ?

llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
228

Is there any reason for the trailing _ ?

2579

rm

lebedev.ri added inline comments.Aug 23 2019, 10:20 AM
llvm/lib/Transforms/Scalar/LoopIdiomRecognize.cpp
228

Right. No reason now that they are in a struct. I'll drop trailing _.

lebedev.ri marked 2 inline comments as done.

Dropped unneeded _ from member variable names.

LGTM. Maybe let's way a couple days to see if @reames has blocking issues ?

@reames please could you clarify whether you have blocking issues here?

Any other loop folk want to review? :)

FWIW i personally don't expect this to have any issues, the legality check
is really (too?) strict and the rewriting part is doing the right thing
in all the edge-cases i have stumbled into in test-suite.

On Wed, Aug 28, 2019 at 11:44 PM Philip Reames <> wrote:

I have not been following the thread recently; you do not need to wait
for me if someone else has reviewed the code.

Philip

Well, okay then.
Reviews never catch everything anyways.

Proceeding to landing.

This revision was not accepted when it landed; it landed in state Needs Review.Aug 30 2019, 2:56 AM
This revision was automatically updated to reflect the committed changes.
lebedev.ri reopened this revision.Sep 3 2019, 10:14 AM

Reverted in rL370788 due to https://bugs.llvm.org/show_bug.cgi?id=43206, will investigate...

This revision was not accepted when it landed; it landed in state Needs Review.Oct 12 2019, 8:37 AM
This revision was automatically updated to reflect the committed changes.

I found a strange sanitizer error as a result of this revision, could you take a quick look at https://bugs.llvm.org/show_bug.cgi?id=43870?

lebedev.ri abandoned this revision.Nov 2 2019, 2:50 AM

Since this was reverted I am wondering..

libstdc++'s std::equal() is specialized to directly call memcmp() for various types.

Is this legal? If yes, libcxx should do it too.