This is an archive of the discontinued LLVM Phabricator instance.

compiler-rt: Add udivmodei5 to builtins and add bitint library
AbandonedPublic

Authored by mgehre-amd on Feb 22 2022, 7:23 AM.

Details

Summary

According to the RFC [0], this review contains the compiler-rt parts of large integer divison for _BitInt.

It adds the functions

/// Computes the unsigned division of a / b.
/// Writes the quotient to quo and the remainder to rem.
///
/// words denotes the number of elements in a and b.
COMPILER_RT_ABI void __udivmodei5(su_int *quo, su_int *rem,
                                  const su_int *a, const su_int *b,
                                  unsigned int words);

/// Computes the signed division of a / b.
/// Writes the quotient to quo and the remainder to rem.
///
/// words denotes the number of elements in a and b.
/// Warning: Might modify a and b.
COMPILER_RT_ABI void __divmodei5(su_int *quo, su_int *rem,
                                 su_int *a, su_int *b,
                                 unsigned int words);

into builtins.
In addition it introduces a new "bitint" library containing only those new functions,
which is meant as a way to provide those when using libgcc as runtime.

[0] https://discourse.llvm.org/t/rfc-add-support-for-division-of-large-bitint-builtins-selectiondag-globalisel-clang/60329

Diff Detail

Event Timeline

mgehre-amd created this revision.Feb 22 2022, 7:23 AM
mgehre-amd requested review of this revision.Feb 22 2022, 7:23 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 22 2022, 7:23 AM
Herald added a subscriber: Restricted Project. · View Herald Transcript
programmerjake requested changes to this revision.Feb 22 2022, 11:57 AM

As far as I can tell, this code seems to assume it's running on a little-endian system. I think it needs to correctly handle what happens when bigints are stored in the other endian (where the first word is the most-significant).

Also, the division code seems to assume that the most-significant word is always non-zero or the whole value is zero, which would allow it to use the number of words as a proxy for how big the value is numerically, which isn't the case.

This revision now requires changes to proceed.Feb 22 2022, 11:57 AM

As far as I can tell, this code seems to assume it's running on a little-endian system. I think it needs to correctly handle what happens when bigints are stored in the other endian (where the first word is the most-significant).

Yes, that's a good observation. I don't have access to a big-endian system right now. Should I disable building those functions on big-endian systems for this PR?

Also, the division code seems to assume that the most-significant word is always non-zero or the whole value is zero, which would allow it to use the number of words as a proxy for how big the value is numerically, which isn't the case.

Could you please tell me why you think that? I don't believe it's true, and I think the tests in divmodei5_test.c show that,
specifically if (test__divmodei5(2, 1, 2, 0, nbits, a_scratch, b_scratch, quo_scratch, rem_scratch)) does a 2/1 division on integers of various sizes.

  • Mark functions as weak linkage when they are included into the bitint library
  • Ensure that helper functions are all declared 'static'
  • Uncomment some test cases in compiler-rt/test/bitint/Unit/bitint_test.c

As far as I can tell, this code seems to assume it's running on a little-endian system. I think it needs to correctly handle what happens when bigints are stored in the other endian (where the first word is the most-significant).

Yes, that's a good observation. I don't have access to a big-endian system right now. Should I disable building those functions on big-endian systems for this PR?

probably...

Also, the division code seems to assume that the most-significant word is always non-zero or the whole value is zero, which would allow it to use the number of words as a proxy for how big the value is numerically, which isn't the case.

Could you please tell me why you think that? I don't believe it's true, and I think the tests in divmodei5_test.c show that,
specifically if (test__divmodei5(2, 1, 2, 0, nbits, a_scratch, b_scratch, quo_scratch, rem_scratch)) does a 2/1 division on integers of various sizes.

I left an inline comment on one such place.

compiler-rt/lib/builtins/udivmodei5.c
293

here's one place where the number of words is incorrectly assumed to be some indicator of numeric value

Herald added a project: Restricted Project. · View Herald TranscriptMar 11 2022, 4:31 AM
fzou1 added a subscriber: fzou1.Mar 15 2022, 5:41 AM

Ping @mgehre-amd : Are you still working on this? I see that @programmerjake has a bit of feedback, was wondering if you had a chance to look at it.

Hey, yes it's still near the top of my TODO list. Had a few other things to do, but I'm planning to come back to this.

I was thinking how to properly test that my fixed version would work fine in big endian - any idea?

And I was considering whether it was less effort to rewrite this to fix the remaining comments or to start
with a simpler algorithm like a shift-subtract division. I'll first try to fix the current one and see how much that takes.

I unfortunately don't really understand this patch very well, but I believe @FreddyYe spent some time on a similar patch that was never upstreamed, so he's likely my company's best contributor for comments/opinions.

Hey, yes it's still near the top of my TODO list. Had a few other things to do, but I'm planning to come back to this.

I was thinking how to properly test that my fixed version would work fine in big endian - any idea?

IIRC I've had success using qemu-user-static and schroot and debootstrap to create a powerpc64 (BE) debian chroot that seems to work for me...my system is Ubuntu 20.04 x86_64. Unfortunately, I don't have any handy instructions for how to do that as I did it quite a while ago and I can't remember all the details.

To be honest, I'm not familiar in compiler-rt component. And my experience before was mainly to find out the feasibility to support arbitrary-bits-division for _BitInt(). Fortunately, my proposal was basically same as this one: emit a new libcall in backend; add library support; use malloc()... And I even used a more simple division algorithm model: sub-shift, which has been shared in the RFC link. So my experience before may not be very helpful here. Anyway I can try my best. So everyone is OK with use malloc(), right? That was my biggest concern. BTW FYI the largest _BitInt() support is 8,388,608 . Hi @mgehre-amd Will you upload another patch to set the bitint.a as a default link library of clang when this patch gets landed in? I believe FE will reopen _BitInt() > 128 bit unless division is supported by default but not by option --rtlib=compiler-rt. @erichkeane @aaron.ballman

compiler-rt/lib/builtins/udivmodei5.c
42

Copy from Erich's comment: can this be uint64_t type?

47

int n = 0;

101

unsigned

FreddyYe added inline comments.Mar 21 2022, 11:47 PM
compiler-rt/lib/builtins/udivmodei5.c
153

Maybe it'd better not to include <stdint.h> but to use macros in int_types.h like other integer APIs:

typedef int32_t si_int;
typedef uint32_t su_int;
typedef int64_t di_int;
typedef uint64_t du_int;
433

NFC align naming convention: BitWidth and bits

To be honest, I'm not familiar in compiler-rt component. And my experience before was mainly to find out the feasibility to support arbitrary-bits-division for _BitInt(). Fortunately, my proposal was basically same as this one: emit a new libcall in backend; add library support; use malloc()... And I even used a more simple division algorithm model: sub-shift, which has been shared in the RFC link. So my experience before may not be very helpful here. Anyway I can try my best. So everyone is OK with use malloc(), right? That was my biggest concern. BTW FYI the largest _BitInt() support is 8,388,608 . Hi @mgehre-amd Will you upload another patch to set the bitint.a as a default link library of clang when this patch gets landed in? I believe FE will reopen _BitInt() > 128 bit unless division is supported by default but not by option --rtlib=compiler-rt. @erichkeane @aaron.ballman

Yes, that is correct, we cannot enable _BitInt>128 unless we have the library support 'compiled in'. I believe we need to make bitint.a be a default library that gets linked in.

I opened the review for the clang changes (including linking bitint by default) here: https://reviews.llvm.org/D122234

So everyone is OK with use malloc(), right? That was my biggest concern.

imho if at all possible, the temporary buffers should be passed in from the caller rather than malloc-ed, that way they can be allocated on the stack (or heap, or somewhere else) by the caller who knows exactly how big they should be and can have them be a local variable -- no dynamic alloca or malloc needed. This would allow using >128-bit ints on embedded systems without malloc, and also allows the caller to use the same temporary for multiple division operations, rather than having to call the memory allocator every time.

So everyone is OK with use malloc(), right? That was my biggest concern.

imho if at all possible, the temporary buffers should be passed in from the caller rather than malloc-ed, that way they can be allocated on the stack (or heap, or somewhere else) by the caller who knows exactly how big they should be and can have them be a local variable -- no dynamic alloca or malloc needed. This would allow using >128-bit ints on embedded systems without malloc, and also allows the caller to use the same temporary for multiple division operations, rather than having to call the memory allocator every time.

This sounds like a really good idea to me. The LLVM code generation "Knows" how big this value needs to be and can do it on the stack of the caller. So it would generate IR the equivalent of:

void func_that_bitints(__BitInt(999) a, __BitInt(999) b) {

unsigned int buffer[1024/8]; // Or whatever
something_else = __divmodei5(buffer, <stuff>);

}

Right?

In that case I will go for a subtract-shift implementation because Knuth seem to require multiple temporary buffers, and it would be odd to expose all of them through the API.
It also means that I likely need to modify the LLVM backend to emit calls to divmodei5 and not divei4 or __modei4.

This sounds like a really good idea to me. The LLVM code generation "Knows" how big this value needs to be and can do it on the stack of the caller. So it would generate IR the equivalent of:

...

Right?

looks good to me! I think the buffer should be big enough for the Knuth division algorithm (the one APInt uses afaict), even if we decide we want the shift-subtract algorithm at first...that way we can upgrade later.

Implement a basic long division
Make it work on big endian machines
Only provide

void __divmodei5(su_int *quo, su_int *rem, su_int *a, su_int *b,
                               unsigned int words)

and

void __udivmodei5(su_int *quo, su_int *rem, const su_int *a,
                                const su_int *b, unsigned int words)

to avoid allocations.

mgehre-amd edited the summary of this revision. (Show Details)Mar 28 2022, 11:45 AM
programmerjake requested changes to this revision.Mar 28 2022, 1:59 PM
programmerjake added inline comments.
compiler-rt/lib/builtins/int_lib.h
119

this should be changed to pass in a buffer big enough for knuth division...
that is words * 2 + 1 (the buffer can just be a but zero-extended). also b needs to be non-const since knuth division needs to modify it.

This revision now requires changes to proceed.Mar 28 2022, 1:59 PM
mgehre-amd added inline comments.Mar 29 2022, 1:57 AM
compiler-rt/lib/builtins/int_lib.h
119

Maybe it's a stupid question, but does Knuth actually need extra scratch space when it's allowed to modify the inputs?

In APInt::divide, I see four temporary arrays being allocated (U, V, Q, R), each with a size of words (except for U which is words + 1).
But then LHS and RHS are copied in to U and V. And after Knuth's algorithms, Q and R are copied into quo and rem.

Now if we allow to modify the arguments, we don't need to the copies and thus we can directly use the space of a, b, quo and rem, can't we?

programmerjake added inline comments.Mar 29 2022, 4:31 AM
compiler-rt/lib/builtins/int_lib.h
119

hmm...i'd have to check the algorithm again...at the very least a has to have at least 1 extra word unless you want to branch every time you index a (or maybe peel off one iteration of the loop...sounds like a pain).

Knuth's algorithm (from Knuth's book):
inputs:
u = (u[1] u[2] ... u[m+n])
v = (v[1] v[2] ... v[n])

u is extended to (u[0] u[1] u[2] ... u[m+n]) by normalization step.

outputs:
floor(u/v) = (q[0] q[1] ... q[m])
u mod v = (r[1] r[2] ... r[n])

Require 'a' to be a large integer with n + 1 words; update tests accordingly

mgehre-amd marked 8 inline comments as done.Mar 29 2022, 7:57 AM
programmerjake accepted this revision.Mar 29 2022, 11:06 AM

looks good to me! you may want someone else to also approve this since I am not a regular LLVM contributor.

This revision is now accepted and ready to land.Mar 29 2022, 11:06 AM

@howard.hinnant, as Code Owner, do you know a good person to review/approve this? @programmerjake would like a second opinion.

@howard.hinnant, as Code Owner, do you know a good person to review/approve this? @programmerjake would like a second opinion.

According to Phab, Howard has not been active in over a year, so if he doesn't respond in a timely manner I think we should make the decision on consensus with the people actively reviewing here. (Also, if Howard doesn't weigh in because he's no longer active in the community, we may want to consider poking around to see if we can scare up a new code owner for this component.)

@FreddyYe spent a lot of time investigating this same thing internally, so I suggest that he'd be the best person to approve this of those who are paying attention.

I am no longer active in this community.

I am no longer active in this community.

Thanks for confirming, Howard (and thank you for all the community efforts you've put in over the years)!

FreddyYe accepted this revision.Apr 7 2022, 6:57 PM

LGTM. Will you update the backend libcall?

LGTM. Will you update the backend libcall?

Yes, I have the change locally already.

This revision was landed with ongoing or failed builds.Apr 7 2022, 11:43 PM
This revision was automatically updated to reflect the committed changes.

@mgehre-amd Please can you take a look at the buildbots breaks: https://lab.llvm.org/buildbot/#/builders/93/builds/8214

Yes, sorry, I'll push a fix.

@mgehre-amd this has broken our Linux GCC And Windows MSVC builds. Both complain that:

.../compiler-rt/lib/builtins/udivmodei5.c:27:32: error: initializer element is not constant
 static const su_int WORD_MSB = (su_int)1 << (WORD_SIZE_IN_BITS - 1);

Also __attribute__((weak)) is not supported for MSVC.

The following patch fixes the Windows MSVC build:

diff --git a/compiler-rt/lib/builtins/udivmodei5.c b/compiler-rt/lib/builtins/udivmodei5.c
index d1cc62c19135..5a31b9c1e685 100644
--- a/compiler-rt/lib/builtins/udivmodei5.c
+++ b/compiler-rt/lib/builtins/udivmodei5.c
@@ -15,7 +15,7 @@
 // When this is built into the bitint library, provide symbols with
 // weak linkage to allow gracefully upgrading once libgcc contains those
 // symbols.
-#ifdef IS_BITINT_LIBRARY
+#if defined(IS_BITINT_LIBRARY) && !defined(_WIN32)
 #define WEAK_IF_BITINT_LIBRARY __attribute__((weak))
 #else
 #define WEAK_IF_BITINT_LIBRARY
@@ -24,7 +24,7 @@
 static const int WORD_SIZE_IN_BITS = sizeof(su_int) * CHAR_BIT;

 /// A mask with the most significant bit set.
-static const su_int WORD_MSB = (su_int)1 << (WORD_SIZE_IN_BITS - 1);
+static const su_int WORD_MSB = (su_int)1 << ((sizeof(su_int) * CHAR_BIT) - 1);

 // Define an index such that a[WORD_IDX(0)] is the least significant word
 // and a[WORD_IDX(words-1)] is the most significant word.

There was an incorrect printf format specifier usage fixed by 8843245ddd2d

I am testing check-builtins and will revert this patch in one minute.

Is there a specification for __udivmodei5 and __divmodei5? How are the names picked? These function names compete with the namespace specified by libgcc and we need to be careful to prevent collision.

compiler-rt/lib/bitint/CMakeLists.txt
2

If lib/bitint is specific to lib/builtins, it should probably be placed in lib/builtins/

12

delete blank line

compiler-rt/lib/builtins/udivmodei5.c
20

Can you elaborate how WEAK_IF_BITINT_LIBRARY is intended to be used with libgcc, libgcc.a or libgcc_s.so.1? If libgcc_s.so.1 => STB_WEAK does not affect symbol resolution and is not different from STB_GLOBAL.

MaskRay reopened this revision.Apr 8 2022, 12:46 PM
This revision is now accepted and ready to land.Apr 8 2022, 12:46 PM
MaskRay requested changes to this revision.Apr 8 2022, 12:46 PM
This revision now requires changes to proceed.Apr 8 2022, 12:46 PM
MaskRay added inline comments.Apr 8 2022, 12:51 PM
compiler-rt/lib/builtins/int_lib.h
129

unsigned int => unsigned. Some older files use unsigned int but the new interface doesn't need to follow them.

compiler-rt/test/builtins/Unit/divmodei5_test.c
94

These variables are not freed, causing memory leaks.

unsigned int => unsigned.

If you intend to use su_int, the type should be su_int *

MaskRay added a comment.EditedApr 8 2022, 12:57 PM

si di ti suffixes in libgcc.a libgcc_s.so.1 function names are related to https://gcc.gnu.org/onlinedocs/gccint/Machine-Modes.html
I am concerned whether GCC may reserve ei for other things, given that many letters from [a-z] have been taken for [a-z]i. It would be better to check with GCC folks that they will not use ei for different things.

mgehre-amd added inline comments.May 3 2022, 12:02 AM
compiler-rt/lib/builtins/udivmodei5.c
20

The idea is that clang unconditionally links to libbitint, which contains __udivmodei5 (etc) with weak linkage.

In addition, clang either links to libgcc or compiler-rt builtins, depending on the user's choice.

If the user selected an "older" libgcc.a (i.e. without udivmodei5), then udivmodei5 from libbitint is used by the linker.
If the user selected a "newer" libgcc.a (i.e. with udivmodei5), then udivmodei5 from libgcc is used by the linker. (There are no patches yet to add udivmodei5 to libgcc).
If the user links against "newer" libgcc_s.so.1, I think the
udivmodei5 from libbitint is used by the linker, but at least there is no linker error.
If the user selected compiler-rt (i.e. with udivmodei5), then udivmodei5 from compiler-rt is used by the linker.

After libgcc has adopted __udivmodei5 for a while, we can phase out libbitint.

unsigned int -> unsigned
Fix memory leaks in unit tests

MaskRay added inline comments.May 3 2022, 9:37 PM
compiler-rt/lib/builtins/udivmodei5.c
20

So libbitint is linked even in --rtlib=libgcc mode? That may be a bit weird.

I am not sure the proposed libgcc.a member shadowing scheme will work.
See https://maskray.me/blog/2021-06-20-symbol-processing#archive-processing

Say these functions are provided in libgcc.a(ei5.o). If after libbitint is processed, all symbols defined by libgcc.a(ei5.o) are either defined or unreferenced, libgcc.a(ei5.o) won't be extracted, even if it provides STB_GLOBAL definitions.

MaskRay requested changes to this revision.May 3 2022, 9:38 PM
This revision now requires changes to proceed.May 3 2022, 9:38 PM
MaskRay added a subscriber: marxin.EditedMay 3 2022, 9:40 PM

If the intention is indeed to have interop with libgcc, please don't reland until GCC folks are happy. I hereby requested changes.

Answering https://discourse.llvm.org/t/rfc-add-support-for-division-of-large-bitint-builtins-selectiondag-globalisel-clang/60329/8 here because AFAICT marxin isn't on discourse.llvm.org

@MaskRay, any idea who I could ping to figure out how those names are reserved for gcc? Or should we alternatively take a name that is not in GCCs “namespace”? How could a name look then?

Perhaps @marxin

mgehre-amd added inline comments.May 3 2022, 11:42 PM
compiler-rt/lib/builtins/udivmodei5.c
20

So libbitint is linked even in --rtlib=libgcc mode? That may be a bit weird.

This is exactly the use case for libbitint: To allow users of libgcc (which is the default on clang on Linux) to use _BitInt while libgcc / distributions have not catched up with __divmodei5 and similar.

libgcc.a(ei5.o) won't be extracted, even if it provides STB_GLOBAL definitions.

My main driver for making the symbols "weak" in libbitint is to avoid multiple definitions errors once libgcc also provides __divmodei5. You are saying that
even with normal linkage in libbitint, no multiple definition error will occur? Then I can remove the weak linkage from libbitint.

Alternatively, I can also completely remove libbitint from this PR. Then users can only use _BitInt division in clang when they either use compiler-rt or a future version of libgcc.
What do you think?

MaskRay added inline comments.May 4 2022, 12:13 PM
compiler-rt/lib/builtins/udivmodei5.c
20

You are saying that even with normal linkage in libbitint, no multiple definition error will occur? Then I can remove the weak linkage from libbitint.

Sometimes. If the member libgcc.a satisfies the requirement I described in the blog post.

In this case I think removing WEAK is correct.

Instead of putting the __divmodei5/__udivmodei5 routines in compiler-rt, can we just make the backend embed them into the object file?

Fundamentally, adding new methods to the builtins library is problematic. In general, we don't control the link step: we link against whatever builtins library the user provides, which is some random version of libgcc/compiler-rt.builtins/some handwritten replacement/etc. So adding any new names will likely cause problems for users trying to use the new features. Putting the name into a new "bitint" library is just going to confuse users more.

Really, we don't need the symbols to be in the builtins library; the implementation is entirely self-contained, so we can just emit it into the object files that need it. We can make it weak+hidden with some LLVM-specific name to ensure we don't bloat the final executable. The only complicated part is that the backend needs to generate the implementation, and it only understands LLVM IR, not C.

(I think this general approach would also be useful for other routines; for example, we currently emit overflowing arithmetic inline because there's no external implementation we can call.)

Instead of putting the __divmodei5/__udivmodei5 routines in compiler-rt, can we just make the backend embed them into the object file?

I agree for smaller builtin function. Here, I would need to check how much LLVM IR is necessary. I think it can be cut down significantly if we generate a specialized
division function for each bitwidth, but it would increase the overall program size when using multiple different bit width.
And the approach would also increase the size of every object file using the large division and make the linker work harder in removing those duplicated functions.

Is there an existing intrinsic/LLVM instruction for which LLVM embeds a call in the object file today?

Instead of putting the __divmodei5/__udivmodei5 routines in compiler-rt, can we just make the backend embed them into the object file?

I agree for smaller builtin function. Here, I would need to check how much LLVM IR is necessary.

In the future we (LLVM) will probably want to switch the algorithm used to Knuth's Algorithm D, which is quite complex...I know I wouldn't want to try writing that in C++ that generates LLVM IR...it would probably be more than 2k lines of nearly incomprehensible code. Also, I expect specialization based on input/output sizes to not have much effect for Algorithm D, imho we should just have the generic functions and call them rather than generating a separate function per bit-width.

For an example of how complex it is, see the C code we (Libre-SoC) are currently working on to develop new instructions for OpenPower to accelerate big-int arithmetic:
ignore lines 236-402, they're #ifdef-ed out.
https://git.libre-soc.org/?p=libreriscv.git;a=blob;f=openpower/sv/biginteger/divmnu64.c;h=10b05d5aba2a857fac55f5873734a9f7fae257d3;hb=HEAD#l153
if you're curious, the instructions we're developing: https://libre-soc.org/openpower/sv/biginteger/

And the approach would also increase the size of every object file using the large division and make the linker work harder in removing those duplicated functions.

This is a downside... but I think it's worthwhile anyway.

Is there an existing intrinsic/LLVM instruction for which LLVM embeds a call in the object file today?

No, we haven't really been doing this; at least, not specifically to lower intrinsics. (There are passes that generate functions as part of other features.) I figured it might be a good opportunity to start using this approach, since it's new functionality.

In the future we (LLVM) will probably want to switch the algorithm used to Knuth's Algorithm D, which is quite complex...I know I wouldn't want to try writing that in C++ that generates LLVM IR...it would probably be more than 2k lines of nearly incomprehensible code. Also, I expect specialization based on input/output sizes to not have much effect for Algorithm D, imho we should just have the generic functions and call them rather than generating a separate function per bit-width.

It wouldn't necessarily have to be raw C++ calls to IRBuilder. We could, for example, embed textual LLVM IR, and call the LLVM IR parser. Or even use clang to generate LLVM IR ahead of time... we just can't actually call clang at runtime (since clang isn't linked in to LLVM, in general).

Or even use clang to generate LLVM IR ahead of time...

This may not be a good idea for the feature of _BitInt() , which aims to offering more optimization opportunities for IR like i256, i257, i512... @erichkeane

There is some push back on the gcc mailing list both on requiring an extra word to enable Knuth [0]
and generally defining the libgcc function before _BitInt is implemented in gcc [1].
There are also concerns about whether this would be consistent with the ABI,
which is isn't finalized on all targets yet.

I see a few options going forward:

  1. Continue discussion with the gcc folks.

I'm not sure that I can move the discussion anywhere.

  1. I prototyped replacing the division with an inlined and fully unrolled version of
set_zero(quo, words);
set_zero(rem, words);

unsigned bits = words * WORD_SIZE_IN_BITS;
for (int i = bits - 1; i >= 0; --i) {
  left_shift_1(rem, words);                    // rem <<= 1;
  set_bit_0(rem, words, get_bit(a, words, i)); // rem(bit 0) = a(bit i);
  if (ucompare(rem, b, words) >= 0) {          // if (rem >= b)
    subtract(rem, b, words);                   // rem -= b;
    set_bit(quo, words, i);                    // quo(bit i) = 1;
  }
}

during SelectionDAG legalization. This seems to generally work, requires no compiler-rt
extension, but leads to ~5600 instructions on AArch64 for a single udiv i256.
I just created a SelectionDAG node for every line above, and let SelectionDAG legalize
those further to native types.

  1. Do the expansion of large udivs on LLVM IR (in TargetPassConfig::addIRPasses ?).

This has the advantage that we are allowed to create loops and functions, so we can
keep the loop rolled to save on the instruction count. We can also keep the function
outlined. We would still specialize based on the integer type, so
the implementation for i256 would look like

define i256 __llvm__udivrem256(i256 a, i256 b) {
  ...
loop_body:
  ..
  ; rem <<= 1;
  %rem_sh = shl i256 %rem, 1
  ; rem(bit 0) = a(bit i);
  %a_n1 = lshr i256 %a, %i
  %a_n = and i256 %a_n1, 1
  %rem_sb = or i256 %rem_sh, %a_n
   ...

which makes the pass simple to write. The backend later expands those big integer types
into native ones.

  1. Continue with the current approach, but rename the functions to __llvm_divmodei5 to be independent

of what gcc will decide to do.
We still need build libbitint and link to it whenever the user selects libgcc as their runtime library.
WEAK_IF_BITINT_LIBRARY can be removed; the functions always have strong linkage.

What do you think?

[0] https://gcc.gnu.org/pipermail/gcc/2022-May/238659.html
[1] https://gcc.gnu.org/pipermail/gcc/2022-May/238666.html

I have tried to circumvent various issues around changing compiler-rt by synthesizing the division function directly in LLVM: https://reviews.llvm.org/D126644

Note, libgcc for GCC 14 now provides library functions for _BitInt multiplication, division+modulo and conversions to/from binary and decimal floating point formats.

compiler-rt/test/builtins/Unit/divmodei5_test.c