Page MenuHomePhabricator

[builtins] Make __umodsi3/__udivdi3/__umoddi3 standalone (shift and subtract)
ClosedPublic

Authored by MaskRay on Apr 10 2020, 3:34 PM.

Details

Summary

@kamleshbhalui reported that when the Standard Extension M
(Multiplication and Division) is disabled for RISC-V,
__udivdi3 will call __udivmodti4 which will in turn calls __udivdi3.

This patch moves __udivsi3 (shift and subtract) to int_div_impl.inc
__udivXi3, optimize a bit, add a __umodXi3, and use __udivXi3 and
__umodXi3 to define __udivsi3 __umodsi3 __udivdi3 __umoddi3.

Diff Detail

Event Timeline

MaskRay created this revision.Apr 10 2020, 3:34 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 10 2020, 3:34 PM
Herald added subscribers: Restricted Project, luismarques, s.egerton and 3 others. · View Herald Transcript
kamleshbhalui requested changes to this revision.Apr 10 2020, 6:53 PM

Code review in D77744 is performance-wise better than this.

This revision now requires changes to proceed.Apr 10 2020, 6:53 PM
MaskRay updated this revision to Diff 256743.Apr 10 2020, 10:26 PM

Add __builtin_clzll to speed up

kamleshbhalui requested changes to this revision.Apr 11 2020, 12:59 AM

it's still less optimal.

This revision now requires changes to proceed.Apr 11 2020, 12:59 AM

it's still less optimal.

How?

for this input
a=3325549423267495579 ,b=2351357301745665781
and compiled with -O2 with clang-9.
it takes 299 ns.
while D77744 takes only 91ns.

here is compiler detils i tried upon.

clang version 9.0.0 (tags/RELEASE_900/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/local/clang_9.0.0/bin

MaskRay updated this revision to Diff 256868.Apr 12 2020, 11:40 AM
MaskRay retitled this revision from [builtins] Implement __udivdi3 and __umoddi3 (shift and subtract) to [builtins] Make __umodsi3/__udivdi3/__umoddi3 standalone (shift and subtract).
MaskRay edited the summary of this revision. (Show Details)

Re-purpose

here is compiler detils i tried upon.

clang version 9.0.0 (tags/RELEASE_900/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/local/clang_9.0.0/bin

OK. I benchmarked __udivsi3. That version is actually slightly faster than D77744. I moved that function to int_div_impl.inc

kamleshbhalui accepted this revision.Apr 12 2020, 6:33 PM

Thanks, LGTM.

This revision is now accepted and ready to land.Apr 12 2020, 6:33 PM

OK. I benchmarked __udivsi3. That version is actually slightly faster than D77744. I moved that function to int_div_impl.inc

What was the benchmark methodology?
I'm wondering if PC benchmark results make sense for the kinds of targets that are likely to benefit from these software implementations. For microcontrollers I also wonder if a simpler but smaller implementation might be a better trade-off, but I understand that the code came from udivsi3.c so that concern is arguably outside the scope of this patch.
Ideally this patch should have test changes that show its impact and how it fixes the problem.

MaskRay added a comment.EditedApr 13 2020, 2:00 PM

OK. I benchmarked __udivsi3. That version is actually slightly faster than D77744. I moved that function to int_div_impl.inc

What was the benchmark methodology?

int main() {
#ifdef B
  du_int s=1;
  for (int i=0; i <100000000; i++) {
    du_int a = xorshift64(&s);
    du_int b = xorshift64(&s);
    du_int c = div(a, b);
    asm volatile("" : : "r"(c));
  }
#else

  for (int a = 0; a < 2000; a++)
    for (int b = 1; b < 2000; b++) {
      if (div(a, b) != a/b) {
        printf("%d/%d\n", a, b);
        return 1;
      }
      if (mod(a, b) != a%b) {
        printf("%d%%%d\n", a, b);
        return 1;
      }
    }
#endif
}

-DB for benchmark. perf stat -r 30 ./a
The default for correctness. compiler-rt/test/builtins/Unit/ has some large number tests.

I'm wondering if PC benchmark results make sense for the kinds of targets that are likely to benefit from these software implementations. For microcontrollers I also wonder if a simpler but smaller implementation might be a better trade-off, but I understand that the code came from udivsi3.c so that concern is arguably outside the scope of this patch.

I don't think the performance matters that much. compiler-rt/lib/builtins as is is not optimized for every architecture which may lack division instructions. We probably want to go the libgcc extreme route.

Ideally this patch should have test changes that show its impact and how it fixes the problem.

compiler-rt/test/builtins/Unit/ has some tests. This is a runtime issue which only arises in some configurations. I believe riscv32 witht M extension can trigger the problem. It is not really feasible to have a unittest.

This revision was automatically updated to reflect the committed changes.
bjope added a subscriber: bjope.Apr 16 2020, 2:01 AM
bjope added inline comments.
compiler-rt/lib/builtins/int_div_impl.inc
1

This broke things for us downstream (with different char size).

Do you mind if I change the condition to

sizeof(a) * CHAR_BIT== 64

Btw,. shouldn't there be a file header in this file with copyright/license information etc (see http://llvm.org/docs/CodingStandards.html#file-headers)? How could that slip through review?

bjope added inline comments.Apr 16 2020, 7:32 AM
compiler-rt/lib/builtins/int_div_impl.inc
1
MaskRay marked an inline comment as done.Apr 16 2020, 8:25 AM
MaskRay added inline comments.
compiler-rt/lib/builtins/int_div_impl.inc
1

This broke things for us downstream (with different char size).

Can you clarify the CHAR_BIT value on your platform? sizeof(a) * CHAR_BIT == 64 does not make the code clearer. A better version than the current one will be sizeof(a) == sizeof(unsigned long long)

Lots of places can have implication that CHAR_BIT=8. Importantly, POSIX CX requires CHAR_BIT to be 8.
Supporting a CHAR_BIT!=8 may be impractical in practice... For your particular problem, if sizeof(a) == sizeof(unsigned long long) looks good to you, I can fix that. I am more concerned that supporting a different CHAR_BIT!=8 will require numerous fixups everywhere in LLVM and will impose huge mental burden for various places people use bitwise arithemtics and many other operations.

bjope added inline comments.Apr 17 2020, 12:57 AM
compiler-rt/lib/builtins/int_div_impl.inc
1

We certainly got some other fixes downstream for CHAR_BIT==16, so this was just "yet another" problem that popped up for something that has been working in the past. I can keep this as yet-another-size-of-char-fixup downstream, but thought it wouldn't hurt to make this change.

I suggested using CHAR_BIT for two reasons:

  1. I was a bit fooled by the defines in int_lib.h that __builtin_clzll was expecting uint64_t as input. But now I realize that "unsigned long long" is the correct type for the argument.
  2. I got the feeling that sizeof(...) * CHAR_BIT was a quite common pattern in compiler-rt/lib/builtins.

I have updated https://reviews.llvm.org/D78300 according to your suggestion. It makes sense considering (1) above. Maybe we can continue the discussion in that review.