This is an archive of the discontinued LLVM Phabricator instance.

Populating 'libmgpu.a' for math on the GPU
ClosedPublic

Authored by AntonRydahl on Jun 20 2023, 8:29 PM.

Details

Summary

This commit adresses the discussions from patch D152575. From previous discussions, we agreed that __builtin math functions should be used as
long as they would compile to NVPTX and AMD-GCN/AMD-HSA targets. I found that the __builtin functions compiled to GPU code in all cases. That was tested in the following way:

bash
clang++ -O3 -pthread -fno-dwarf2-cfi-asm -fno-asynchronous-unwind-tables -mcpu=gfx1100 --target=amdgcn-amd-amdhsa -nogpulib -I../llvm-project/libc -fno-pie -emit-llvm -S
<file>.cpp -o <file>.ll
llvm-as <file>.ll -o <file>.bc
llc -mcpu=gfx1100 -filetype=obj -relocation-model=pic <file>.bc -o <file>.o

However, I have not tested if the code performed well on the GPU target or if it compiles to NVPTX.

__builtin Functions

The following __builtin functions were added.

  • Added modf and modff.
  • Added nearbyint and nearbyintf.
  • Added nextafter and nextafterf.
  • Added remainder and remainderf.
  • Added remquo and remquof.
  • Added rint and rintf.
  • Added scalbn and scalbnf.
  • Added sinh and sinhf.
  • Added sqrt and sqrtf.
  • Added tan and tanf.
  • Added tanh and tanhf.
  • Added trunc and truncf.

Vendor Functions

The following vendor functions were added, because the __builtin versions do not exist in the LLVM project, as far as I am aware.

  • Added sincos and sincosf.

libc Header Files

The following header files were introduced to libc:

  • sinh.
  • tanh.

Diff Detail

Event Timeline

AntonRydahl created this revision.Jun 20 2023, 8:29 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJun 20 2023, 8:29 PM
AntonRydahl requested review of this revision.Jun 20 2023, 8:29 PM

I'm wondering in general if we should bother supporting the long double variants of some of these. Both AMDGPU and NVPTX explicitly state that long double is just double on the GPU architecture so they're equivalent to the regular doubleversions.

I'm wondering in general if we should bother supporting the long double variants of some of these. Both AMDGPU and NVPTX explicitly state that long double is just double on the GPU architecture so they're equivalent to the regular doubleversions.

I don't think we should, at least not by default. It would be very surprising, IMHO.

I don't think we should, at least not by default. It would be very surprising, IMHO.

Agreed, I think having these default to a linker failure is a good way to state that it's not supported. IIRC CUDA's nvcc gives you warnings if you use a long double since it's just a double. So, please remove the long double variants from this patch for now.

Thank you! I see that it does not make sense to have the long double versions if a long double is equivalent to a double on the GPUs. I will remove them later today.

AntonRydahl updated this revision to Diff 533389.EditedJun 21 2023, 2:02 PM
AntonRydahl edited the summary of this revision. (Show Details)

As suggested, the long double versions of the math functions were removed in this commit.

AntonRydahl edited the summary of this revision. (Show Details)Jun 21 2023, 2:06 PM
jhuber6 added a subscriber: arsenm.Jun 21 2023, 2:34 PM
jhuber6 added inline comments.
libc/src/math/gpu/sqrt.cpp
14

According to @arsenm these aren't correct now so we should proabably use the vendor versions for now.

arsenm added inline comments.Jun 21 2023, 2:36 PM
libc/src/math/gpu/sqrt.cpp
14

The f64 sqrt patch is basically postable now (the basic correct path is done, fast math 1 / sqrt(x)) folds still need to be done)

arsenm added inline comments.Jun 21 2023, 2:43 PM
libc/src/math/gpu/sqrt.cpp
14
AntonRydahl added inline comments.Jun 21 2023, 3:02 PM
libc/src/math/gpu/sqrt.cpp
14

Should I also add the vendor versions of the trigonometric functions such as sinh and tan?

14

Should I just add D153472 as a parent of this patch?

arsenm added inline comments.Jun 21 2023, 3:13 PM
libc/src/math/gpu/sqrt.cpp
14

Sure why not, they're there. Depends how much you want to put in one patch.

Nobody's reliant on this code right now so I don't think it matter which lands first

Added vendor versions of trigonometric functions as some of them do not compile to NVPTX.

jhuber6 accepted this revision.Jun 22 2023, 12:02 PM

Looks reasonable, we will definitely need some tests for these soon.

This revision is now accepted and ready to land.Jun 22 2023, 12:02 PM

Rebased the patch on upstream LLVM.

arsenm added inline comments.Jul 31 2023, 10:51 AM
libc/src/math/gpu/vendor/amdgpu/amdgpu.h
26

I thought I had already added an ocml_sincos_stret but I guess not, should switch to that whenever that gets added

libc/src/math/gpu/vendor/amdgpu/declarations.h
18

This won't actually work, the underlying pointer uses a private pointer. you can't simply declare as flat and call it. Probably should just define the struct return variant and use that. it's a lot less ugly than dealing with the pointer wrapping

jhuber6 added inline comments.Jul 31 2023, 10:54 AM
libc/src/math/gpu/vendor/amdgpu/declarations.h
18

Variables declared on the stack should be private as far as I know. I figured the semantics are the same as the regular sincos where we just write to whatever pointer we're given. If it's a stack pointer it'll be private, if it's a global it won't be and it'll be up to the user to not have that conlift.

arsenm added inline comments.Jul 31 2023, 10:57 AM
libc/src/math/gpu/vendor/amdgpu/declarations.h
18

The ocml functions aren’t magic. This is declared with the wrong type

jhuber6 added inline comments.Jul 31 2023, 11:00 AM
libc/src/math/gpu/vendor/amdgpu/declarations.h
18

Ah, so internally it expects private pointers and we need to make sure that whatever address space cast is required is done prior to calling it? I'm assuming that'll be something like __attribute__((address_space(5))) and we'll need to manually convert from it. I may need to have some utility header that assigns global, private, and local address spaces depending on the target since this is all pure C++ without the standard address space checks you'd get in OpenCL or something.

arsenm added inline comments.Jul 31 2023, 11:19 AM
libc/src/math/gpu/vendor/amdgpu/declarations.h
18

Yes. I'm looking into adding the stret variants but having a naming things is hard problem

We found that __builtin_nextafter and __builtin_nextafter were not correctly lowered. Therefore, they have been replaced with vendor versions.

We found that __builtin_nextafter and __builtin_nextafter were not correctly lowered. Therefore, they have been replaced with vendor versions.

nextafter isn't that complicated to implement, could just write one

We found that __builtin_nextafter and __builtin_nextafter were not correctly lowered. Therefore, they have been replaced with vendor versions.

nextafter isn't that complicated to implement, could just write one

The libc library has a generic version that's most likely suitable to use. If we just list nextafter in entrypoints.txt but do not provide a vendor or gpu implementation it should be used.

We found that __builtin_nextafter and __builtin_nextafter were not correctly lowered. Therefore, they have been replaced with vendor versions.

nextafter isn't that complicated to implement, could just write one

The libc library has a generic version that's most likely suitable to use. If we just list nextafter in entrypoints.txt but do not provide a vendor or gpu implementation it should be used.

Looks a branchy/early-returny in ways the optimizer isn't aggressive enough at speculating

jhuber6 added a subscriber: lntue.Jul 31 2023, 4:56 PM

Looks a branchy/early-returny in ways the optimizer isn't aggressive enough at speculating

Yeah, probably something we could ask @lntue about. Looking at https://github.com/RadeonOpenCompute/ROCm-Device-Libs/blob/c1a736ae458f49e526932b3da611f6bd571a1c47/ocml/src/nextafterD.cl#L4 it looks much less branchy, presumably all the ternaries will get put into predicate registers for AMDGPU / NVPTX.

Shall I just update the patch to include the entry points generic.nextafter and generic.nextafterf rather than the vendor versions?

We can probably just go with the generic version even if it may be a little slow, more likely to pass the tests. Probably just worth noting we could implement it more optimally later.

We can probably just go with the generic version even if it may be a little slow, more likely to pass the tests. Probably just worth noting we could implement it more optimally later.

I’d just go for the overload for now

We can probably just go with the generic version even if it may be a little slow, more likely to pass the tests. Probably just worth noting we could implement it more optimally later.

I’d just go for the overload for now

I am sorry, but I don't understand what you mean by overloading in this context. Could you please phrase it in another way?

We can probably just go with the generic version even if it may be a little slow, more likely to pass the tests. Probably just worth noting we could implement it more optimally later.

I’d just go for the overload for now

I am sorry, but I don't understand what you mean by overloading in this context. Could you please phrase it in another way?

Leave what you have calling the ocml nextafter

We can probably just go with the generic version even if it may be a little slow, more likely to pass the tests. Probably just worth noting we could implement it more optimally later.

I’d just go for the overload for now

I am sorry, but I don't understand what you mean by overloading in this context. Could you please phrase it in another way?

Leave what you have calling the ocml nextafter

Thanks for the clarification!