This is an archive of the discontinued LLVM Phabricator instance.

[builtins] Make __umodsi3/__udivdi3/__umoddi3 standalone (shift and subtract)
ClosedPublic

Authored by MaskRay on Apr 10 2020, 3:34 PM.

Details

Summary

@kamleshbhalui reported that when the Standard Extension M
(Multiplication and Division) is disabled for RISC-V,
__udivdi3 will call __udivmodti4 which will in turn calls __udivdi3.

This patch moves __udivsi3 (shift and subtract) to int_div_impl.inc
__udivXi3, optimize a bit, add a __umodXi3, and use __udivXi3 and
__umodXi3 to define __udivsi3 __umodsi3 __udivdi3 __umoddi3.

Diff Detail

Event Timeline

MaskRay created this revision.Apr 10 2020, 3:34 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 10 2020, 3:34 PM
Herald added subscribers: Restricted Project, luismarques, s.egerton and 3 others. · View Herald Transcript
kamleshbhalui requested changes to this revision.Apr 10 2020, 6:53 PM

Code review in D77744 is performance-wise better than this.

This revision now requires changes to proceed.Apr 10 2020, 6:53 PM
MaskRay updated this revision to Diff 256743.Apr 10 2020, 10:26 PM

Add __builtin_clzll to speed up

kamleshbhalui requested changes to this revision.Apr 11 2020, 12:59 AM

it's still less optimal.

This revision now requires changes to proceed.Apr 11 2020, 12:59 AM

it's still less optimal.

How?

for this input
a=3325549423267495579 ,b=2351357301745665781
and compiled with -O2 with clang-9.
it takes 299 ns.
while D77744 takes only 91ns.

here is compiler detils i tried upon.

clang version 9.0.0 (tags/RELEASE_900/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/local/clang_9.0.0/bin

MaskRay updated this revision to Diff 256868.Apr 12 2020, 11:40 AM
MaskRay retitled this revision from [builtins] Implement __udivdi3 and __umoddi3 (shift and subtract) to [builtins] Make __umodsi3/__udivdi3/__umoddi3 standalone (shift and subtract).
MaskRay edited the summary of this revision. (Show Details)

Re-purpose

here is compiler detils i tried upon.

clang version 9.0.0 (tags/RELEASE_900/final)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/local/clang_9.0.0/bin

OK. I benchmarked __udivsi3. That version is actually slightly faster than D77744. I moved that function to int_div_impl.inc

kamleshbhalui accepted this revision.Apr 12 2020, 6:33 PM

Thanks, LGTM.

This revision is now accepted and ready to land.Apr 12 2020, 6:33 PM

OK. I benchmarked __udivsi3. That version is actually slightly faster than D77744. I moved that function to int_div_impl.inc

What was the benchmark methodology?
I'm wondering if PC benchmark results make sense for the kinds of targets that are likely to benefit from these software implementations. For microcontrollers I also wonder if a simpler but smaller implementation might be a better trade-off, but I understand that the code came from udivsi3.c so that concern is arguably outside the scope of this patch.
Ideally this patch should have test changes that show its impact and how it fixes the problem.

MaskRay added a comment.EditedApr 13 2020, 2:00 PM

OK. I benchmarked __udivsi3. That version is actually slightly faster than D77744. I moved that function to int_div_impl.inc

What was the benchmark methodology?

int main() {
#ifdef B
  du_int s=1;
  for (int i=0; i <100000000; i++) {
    du_int a = xorshift64(&s);
    du_int b = xorshift64(&s);
    du_int c = div(a, b);
    asm volatile("" : : "r"(c));
  }
#else

  for (int a = 0; a < 2000; a++)
    for (int b = 1; b < 2000; b++) {
      if (div(a, b) != a/b) {
        printf("%d/%d\n", a, b);
        return 1;
      }
      if (mod(a, b) != a%b) {
        printf("%d%%%d\n", a, b);
        return 1;
      }
    }
#endif
}

-DB for benchmark. perf stat -r 30 ./a
The default for correctness. compiler-rt/test/builtins/Unit/ has some large number tests.

I'm wondering if PC benchmark results make sense for the kinds of targets that are likely to benefit from these software implementations. For microcontrollers I also wonder if a simpler but smaller implementation might be a better trade-off, but I understand that the code came from udivsi3.c so that concern is arguably outside the scope of this patch.

I don't think the performance matters that much. compiler-rt/lib/builtins as is is not optimized for every architecture which may lack division instructions. We probably want to go the libgcc extreme route.

Ideally this patch should have test changes that show its impact and how it fixes the problem.

compiler-rt/test/builtins/Unit/ has some tests. This is a runtime issue which only arises in some configurations. I believe riscv32 witht M extension can trigger the problem. It is not really feasible to have a unittest.

This revision was automatically updated to reflect the committed changes.
bjope added a subscriber: bjope.Apr 16 2020, 2:01 AM
bjope added inline comments.
compiler-rt/lib/builtins/int_div_impl.inc
2

This broke things for us downstream (with different char size).

Do you mind if I change the condition to

sizeof(a) * CHAR_BIT== 64

Btw,. shouldn't there be a file header in this file with copyright/license information etc (see http://llvm.org/docs/CodingStandards.html#file-headers)? How could that slip through review?

bjope added inline comments.Apr 16 2020, 7:32 AM
compiler-rt/lib/builtins/int_div_impl.inc
2
MaskRay marked an inline comment as done.Apr 16 2020, 8:25 AM
MaskRay added inline comments.
compiler-rt/lib/builtins/int_div_impl.inc
2

This broke things for us downstream (with different char size).

Can you clarify the CHAR_BIT value on your platform? sizeof(a) * CHAR_BIT == 64 does not make the code clearer. A better version than the current one will be sizeof(a) == sizeof(unsigned long long)

Lots of places can have implication that CHAR_BIT=8. Importantly, POSIX CX requires CHAR_BIT to be 8.
Supporting a CHAR_BIT!=8 may be impractical in practice... For your particular problem, if sizeof(a) == sizeof(unsigned long long) looks good to you, I can fix that. I am more concerned that supporting a different CHAR_BIT!=8 will require numerous fixups everywhere in LLVM and will impose huge mental burden for various places people use bitwise arithemtics and many other operations.

bjope added inline comments.Apr 17 2020, 12:57 AM
compiler-rt/lib/builtins/int_div_impl.inc
2

We certainly got some other fixes downstream for CHAR_BIT==16, so this was just "yet another" problem that popped up for something that has been working in the past. I can keep this as yet-another-size-of-char-fixup downstream, but thought it wouldn't hurt to make this change.

I suggested using CHAR_BIT for two reasons:

  1. I was a bit fooled by the defines in int_lib.h that __builtin_clzll was expecting uint64_t as input. But now I realize that "unsigned long long" is the correct type for the argument.
  2. I got the feeling that sizeof(...) * CHAR_BIT was a quite common pattern in compiler-rt/lib/builtins.

I have updated https://reviews.llvm.org/D78300 according to your suggestion. It makes sense considering (1) above. Maybe we can continue the discussion in that review.

tonic added a subscriber: tonic.Apr 21 2022, 12:06 PM

Unfortunately, this patch violates copyright of the The PowerPC Compiler Writer's Guide. Please revert this patch, remove the copyrighted code, and you can submit a new patch for the other changes. For future patches that use guides with copyright, please consult with the LLVM Foundation board before committing. Thank you!

Herald added a project: Restricted Project. · View Herald TranscriptApr 21 2022, 12:06 PM
MaskRay added a subscriber: ddunbar.EditedApr 21 2022, 12:42 PM

Unfortunately, this patch violates copyright of the The PowerPC Compiler Writer's Guide. Please revert this patch, remove the copyrighted code, and you can submit a new patch for the other changes. For future patches that use guides with copyright, please consult with the LLVM Foundation board before committing. Thank you!

If this request is related to "// Translated from Figure 3-40 of The PowerPC Compiler Writer's Guide" in the code, then I think it should apply to the very first commit
fd089990f76ec6392dbd0138ab874d90c632d2a4 (2009) by @ddunbar .

This patch merely copied this old comment. I don't think I am the appropriate one to take any action here.

Please also read the nature of the patch :)

tonic added a comment.Apr 21 2022, 1:25 PM

Unfortunately, this patch violates copyright of the The PowerPC Compiler Writer's Guide. Please revert this patch, remove the copyrighted code, and you can submit a new patch for the other changes. For future patches that use guides with copyright, please consult with the LLVM Foundation board before committing. Thank you!

If this request is related to "// Translated from Figure 3-40 of The PowerPC Compiler Writer's Guide" in the code, then I think it should apply to the very first commit
fd089990f76ec6392dbd0138ab874d90c632d2a4 (2009) by @ddunbar .

This patch merely copied this old comment. I don't think I am the appropriate one to take any action here.

Please also read the nature of the patch :)

My apologies for missing that. I will be creating a patch to remove this then and other references.

Unfortunately, this patch violates copyright of the The PowerPC Compiler Writer's Guide. Please revert this patch, remove the copyrighted code, and you can submit a new patch for the other changes. For future patches that use guides with copyright, please consult with the LLVM Foundation board before committing. Thank you!

If this request is related to "// Translated from Figure 3-40 of The PowerPC Compiler Writer's Guide" in the code, then I think it should apply to the very first commit
fd089990f76ec6392dbd0138ab874d90c632d2a4 (2009) by @ddunbar .

This patch merely copied this old comment. I don't think I am the appropriate one to take any action here.

Please also read the nature of the patch :)

My apologies for missing that. I will be creating a patch to remove this then and other references.

I believe that the comment may be referring to the algorithm rather than the implementation itself as the book provides assembly listings for PPC.

tonic added a comment.EditedApr 21 2022, 2:44 PM

Unfortunately, this patch violates copyright of the The PowerPC Compiler Writer's Guide. Please revert this patch, remove the copyrighted code, and you can submit a new patch for the other changes. For future patches that use guides with copyright, please consult with the LLVM Foundation board before committing. Thank you!

If this request is related to "// Translated from Figure 3-40 of The PowerPC Compiler Writer's Guide" in the code, then I think it should apply to the very first commit
fd089990f76ec6392dbd0138ab874d90c632d2a4 (2009) by @ddunbar .

This patch merely copied this old comment. I don't think I am the appropriate one to take any action here.

Please also read the nature of the patch :)

My apologies for missing that. I will be creating a patch to remove this then and other references.

I believe that the comment may be referring to the algorithm rather than the implementation itself as the book provides assembly listings for PPC.

The LLVM Foundation legal counsel has advised us that it violates copyright.

My apologies for missing that. I will be creating a patch to remove this then and other references.

I believe that the comment may be referring to the algorithm rather than the implementation itself as the book provides assembly listings for PPC.

The LLVM Foundation legal counsel has advised us that it violates copyright.

Okay, thank you for checking.

lkail added a subscriber: lkail.Apr 22 2022, 1:20 AM

I'm working with the IBM legal team to see if there is any way we can get an exception or workaround for the copyright (these probably aren't the correct legal terms, but hopefully people understand the intention). I can update here once I have more information.

lenary removed a subscriber: lenary.Apr 28 2022, 8:52 AM