This is an archive of the discontinued LLVM Phabricator instance.

[APInt] Optimize umul_ov
ClosedPublic

Authored by MaskRay on Apr 14 2019, 7:49 AM.

Diff Detail

Repository
rL LLVM

Event Timeline

MaskRay created this revision.Apr 14 2019, 7:49 AM
nikic added inline comments.Apr 14 2019, 8:34 AM
lib/Support/APInt.cpp
1912 ↗(On Diff #195067)

In this form, this will only be correct if the bit-width is *exactly* 64-bit. For smaller bit-widths you'll have to consider a combination of the intrinsic overflow flag, and whether any bits beyond the bit-width are set. (It might make sense to also special-case <= 32 bits, in which case checking bits in the result of a normal multiply would be sufficient.)

Is this trying to address some performance problem? Any numbers?

nikic added a comment.Apr 14 2019, 8:53 AM

Is this trying to address some performance problem? Any numbers?

I think even if it doesn't address a specific performance problem, this is a good change (well, once it is correct...), because you don't have to treat mul_ov as an expensive operation anymore. E.g. after this change I think we could drop the code in https://github.com/llvm-mirror/llvm/blob/2b38b84a12ff96d6862b7927f92cbdae77653deb/lib/Analysis/ValueTracking.cpp#L4001-L4008 and the big comment in the function without being concerned about a possible perf impact.

MaskRay updated this revision to Diff 195091.Apr 14 2019, 7:55 PM
MaskRay retitled this revision from [APInt] Use __builtin_[su]mulll_overflow for [su]mul_ov if available to [APInt] Optimize umul_ov.

Repurpose this patch

MaskRay marked 2 inline comments as done.Apr 14 2019, 8:10 PM

Is this trying to address some performance problem? Any numbers?

I'm using a -DBUILD_SHARED_LIBS=on build so there is some indirection cost. Anyway, The performance improvement is obvious:

#include <cstdlib>
#include <llvm/ADT/APInt.h>
using namespace std;

#define FOR(i, a, b) for (remove_cv<remove_reference<decltype(b)>::type>::type i = (a); i < (b); i++)
#define REP(i, n) FOR(i, 0, n)

int main() {
  const unsigned N = 64;
  REP(i, 10000) {
    unsigned a = rand(), b = rand();
    REP(i, 40000) {
      bool overflow;
      llvm::APInt A(N, a), B(N, b);
      auto C = A.umul_ov(B, overflow);
      asm volatile("" : : "r"(C[0]), "r"(overflow) : "memory");
    }
  }
}

Before (two udiv calls; actually one suffices): 12.8129 +- 0.0952 seconds time elapsed ( +- 0.74% )
After (leading zeros + mul + shift + plus): 3.6257 +- 0.0296 seconds time elapsed ( +- 0.82% )

lib/Support/APInt.cpp
1912 ↗(On Diff #195067)

Good catch. I forgot about that. Changed the purpose of this patch.

MaskRay marked an inline comment as done.Apr 14 2019, 8:13 PM

I have also fuzzed it for some time with libFuzzer to give me more confidence that it is correct.

#include <llvm/ADT/APInt.h>

extern "C" int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size) {
  using namespace llvm;
  unsigned n;
  uint64_t a, b;
  memcpy(&n, Data, 1);
  n = n%64+1;
  memcpy(&a, Data+1, 8);
  memcpy(&b, Data+9, 8);

  bool overflow, overflow1;
  APInt A(n, a), B(n, b), C = A.umul_ov(B, overflow), D = A * B;
  if (C != D) __builtin_trap();

  if (A != 0 && B != 0) overflow1 = D.udiv(A) != B;
  else overflow1 = false;
  if (overflow != overflow1) __builtin_trap();

  return 0;
}
MaskRay updated this revision to Diff 195158.Apr 15 2019, 5:58 AM

Fold a lz test

nikic accepted this revision.Apr 18 2019, 12:43 PM

LGTM. It might still make sense to optimize the single word case separately, but for the general case I don't think we can do much better than this.

lib/Support/APInt.cpp
1919 ↗(On Diff #195158)

Is there a reason to use operator*(RHS) over *this * RHS here?

This revision is now accepted and ready to land.Apr 18 2019, 12:43 PM
MaskRay updated this revision to Diff 195854.Apr 18 2019, 6:42 PM

operator*(RHS) -> *this * RHS

MaskRay marked an inline comment as done.Apr 18 2019, 6:45 PM
This revision was automatically updated to reflect the committed changes.