This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
libc/
-
config/gpu/
-
gpu/
2/5
entrypoints.txt
-
headers.txt
-
src/math/
-
math/
-
CMakeLists.txt
-
gpu/
1/1
CMakeLists.txt
2/2
round.cpp
-
roundf.cpp
-
roundl.cpp
-
vendor/
4/4
CMakeLists.txt
-
amdgpu/
-
amdgpu.h
-
declarations.h
1/2
platform.h
-
common.h
-
nvptx/
-
declarations.h
-
nvptx.h
-
sin.cpp

Differential D152486

[libc] Begin implementing a 'libmgpu.a' for math on the GPU
ClosedPublic

Authored by jhuber6 on Jun 8 2023, 4:31 PM.

Download Raw Diff

Details

Reviewers

tra
yaxunl
arsenm
jdoerfert
tianshilei1992
JonChesterfield
sivachandra
lntue
michaelrj
gregrodgers
ye-luo

Commits

rG8060d96aed7c: [libc] Begin implementing a 'libmgpu.a' for math on the GPU

Summary

This patch adds an outline to begin adding a libmgpu.a file for
provindg math on the GPU. Currently, this is most likely going to be
wrapping around existing vendor libraries and placing them in a more
usable format. Long term, we would like to provide our own
implementations of math functions that can be used instead.

This patch works by simply forwarding the calls to the standard C math
library calls like sin to the appropriate vendor call like __nv_sin.
Currently, we will use the vendor libraries directly and link them in
via -mlink-builtin-bitcode. This is necessary because of bizarre
interactions with the generic bitcode, -mlink-builtin-bitcode
internalizes and only links in the used symbols, furthermore is
propagates the target's default attributes and its the only "truly"
correct way to pull in these vendor bitcode libraries without error.

If the vendor libraries are not availible at build time, we will still
create the libmgpu.a, but we will expect that the vendor library
definitions will be provided by the user's compilation as is made
possible by https://reviews.llvm.org/D152442.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jhuber6 created this revision.Jun 8 2023, 4:31 PM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJun 8 2023, 4:31 PM

Herald added subscribers: libc-commits, mattd, asavonic and 2 others. · View Herald Transcript

jhuber6 requested review of this revision.Jun 8 2023, 4:31 PM

Herald added a subscriber: wdng. · View Herald TranscriptJun 8 2023, 4:31 PM

Harbormaster completed remote builds in B237624: Diff 529775.Jun 8 2023, 4:37 PM

Specifically for ocml functions, we're really close to not requiring the internalization. D149715 is the main piece I need to remove the last subtarget features.

The wavesize is still a bit problematic. Ideally we would have separate wave32 and wave64 builds, and not allow mixing wavesizes in a single module. With un-wavesized library IR, you can kind of get away with relying on the global subtarget. (which I guess is the same problem you have with target-cpu)

In D152486#4407279, @arsenm wrote:

Specifically for ocml functions, we're really close to not requiring the internalization. D149715 is the main piece I need to remove the last subtarget features.

The wavesize is still a bit problematic. Ideally we would have separate wave32 and wave64 builds, and not allow mixing wavesizes in a single module. With un-wavesized library IR, you can kind of get away with relying on the global subtarget. (which I guess is the same problem you have with target-cpu)

I haven't seen the wavesize used in ocml fortunately. If we fix the requirement to perform attribute propagation the second issue is that all the device functions in the ROCm device libraries have protected visibility so LTO can't optimize them out without some hacks. Being able to link these in regularly would be nice since the only way I could think of to perform the correct attribute propagation was to re-run clang with the merged LTO bitcode. However long term it would be nice to have a libm on the GPU that lived upstream in this repository. We could potentially port the OpenCL in the ROCm device libraries, but I don't know how popular that would be internally at AMD.

jhuber6 retitled this revision from [libc] Begin implementing a 'libmgpu.a' for math on th GPU to [libc] Begin implementing a 'libmgpu.a' for math on the GPU.Jun 8 2023, 5:34 PM

Missing inline on the internal definition.

Harbormaster completed remote builds in B237738: Diff 529925.Jun 9 2023, 5:49 AM

@Matt good news on ocml, thanks. I think we should add a wave size intrinsic, unconditionally expand it somewhere in clang and the backend and replace the current magic variable in ocml with it. The main appeal to encoding it that way is we can also immediately kill dead branches on the value of the intrinsic which would solve various O0 related problems from invalid but not called code reaching the backend.

@Joseph I'd really like us to be able to get rid of the header files that currently do the remapping from libm calls to the vendor ones. Have you spoken to the hip / cuda / openmp people to see if any of them are in principle willing to part with the current header translation thing?

In D152486#4408610, @JonChesterfield wrote:

@Joseph I'd really like us to be able to get rid of the header files that currently do the remapping from libm calls to the vendor ones. Have you spoken to the hip / cuda / openmp people to see if any of them are in principle willing to part with the current header translation thing?

Killing off the headers would be ideal, as it stands there are a few deficiencies in this approach.

This currently relies on a build-time dependency on libdevice.10.bc or ocml.bc for the best support. We can link it late with the support in D152442 but currently that's an opt-in thing currently because it will result in longer compile times. Ideally we would have the files we need in a binary blob upstream somewhere or we implement the math functions without relying on them.
We don't handle special-case math optimizations that the vendors have. The headers allow us at compile time to select -ffast-math variants from libdevice.10.bc or to enable the controls for things like __oclc_unsafe_math_opt
The NVPTX backend currently cannot codegen the math intrinsics so we'll need to fix that before this can be used.
Obviously this requires building the libc project which would then need to be a mandatory thing for math if we remove the headers.

In D152486#4408610, @JonChesterfield wrote:

@Matt good news on ocml, thanks. I think we should add a wave size intrinsic, unconditionally expand it somewhere in clang and the backend and replace the current magic variable in ocml with it.

We used to do that, and it doesn't work. We can't codegen different subtargets within the same function like that. Really having wave32 and wave64 coexist in the same module leads to a variety of untenable situations

Math optimisations were a bit of a mess a few years ago. Some in clang, some in llvm, some in backends, some based on intrinsics and some based on name of libm functions. There's a decent chance ocml_sin(0) won't constant fold and sin(0) will for example. There's a decent chance we'd win overall by keeping the functions named after libm functions until at least some of the middle end has run, before translating them into the vendor ones latish. Phase ordering is always tricky.

In D152486#4408674, @arsenm wrote:

In D152486#4408610, @JonChesterfield wrote:

@Matt good news on ocml, thanks. I think we should add a wave size intrinsic, unconditionally expand it somewhere in clang and the backend and replace the current magic variable in ocml with it.

We used to do that, and it doesn't work. We can't codegen different subtargets within the same function like that. Really having wave32 and wave64 coexist in the same module leads to a variety of untenable situations

I don't think that's quite what I'm proposing.

Let ocml be wavesize agnostic, modulo a call to a clang intrinsic __builtin_amdgcn_wavesize() or whatever. Get the ocml IR out of clang however. Some of it is handwritten as IR, some is opencl, whatever.

Let clang take an argument for wavesize, or default it from the arch, or however we want to pick between 32 and 64. Write that information in the IR somewhere and/or pass it to the backend.

Expand the wavesize intrinsic with something that robustly deletes the dead basic block when it's used as a branch, otherwise to the 32|64 literal. At no point does an IR module contain wave32 and wave64 code, but some of the library code doesn't known which it's going to be.

I'm aware that it's possible to have a wave32 kernel and a wave64 kernel in the same IR module, and consider that so full of hazards that clang should reject it up front, and people who want to run mixed wavesize kernels from a single application can build them into separate code objects.

In D152486#4408675, @JonChesterfield wrote:

Math optimisations were a bit of a mess a few years ago.

And they still are. The one I'd currently like to tackle is the fact that ocml treats select llvm generic math intrinsics like they're target specific fast operations. Worst offender is sqrt and log

elmcdonough added a subscriber: elmcdonough.Jun 9 2023, 10:52 AM

jhuber6 added a child revision: D152575: Added modf for NVPTX and AMDGPU targets to implement 'libmgpu.a' for math on the GPU.Jun 9 2023, 12:39 PM

Couple of questions:

If all that libmgpu.a is doing is to route to the vendor specific implementations, what exactly is the benefit provided by libmgpu.a?
How do GPU applications use and link to the math functions today without libmgpu.a?

Based on the answers to the above questions, I might have more questions and/or comments.

In D152486#4410189, @sivachandra wrote:

Couple of questions:

If all that libmgpu.a is doing is to route to the vendor specific implementations, what exactly is the benefit provided by libmgpu.a?

The vendor libraries provide math functions that have different names and the format they are provided in presents unfortunate design challenges. This allows us to provide math functions as a library using a more convenient format and it allows us to only resolve these to the vendor versions at (LTO) link time. The primary benefit of this is that LLVM generally understands what a sin call is, but does not understand what an __nv_sin call is. To that end, keeping the calls until the link job and using our libm.a should allow more optimizations. Additionally, this remaps the vendor library into something that's more usable with our interface.

How do GPU applications use and link to the math functions today without libmgpu.a?

Based on the answers to the above questions, I might have more questions and/or comments.

The current solution is to use a wrapper header that's forcibly included before each offloading compilation that maps for example sin to __nv_sin on the GPU. This is then resolved per-TU by the -mlink-builtin-bitcode option that you see used in this patch in a similar way. There is no linking per-se currently, a lot of the work I did last year was to allow GPUs to have more traditional linking steps when using offloading languages.

@jdoerfert or @tra will probably be able to give you a more detailed answer, I remember there being some talks a long time ago about making this libm. There's also an implementation of it in AMD's downstream fork for use with OpenMP offloading to Fortran so this is a way to bring that upstream.

It's also worth noting that the vendor libraries are not strictly conformant to what a standard libm would contain, for example there is no errno handling nor raising of floating point exceptions. Additionally, I think a good long-term goal would be to have GPU math implementations directly in-tree here. But I am not a math person. The ROCm device libraries are open source so inspiration could be drawn there if we were to implement our own version https://github.com/RadeonOpenCompute/ROCm-Device-Libs/tree/amd-stg-open/ocml/src. However that's not the immediate value of having a libm.a for the GPU so I'm not planning on working on that front for the time being.

elmcdonough added inline comments.Jun 9 2023, 4:02 PM

libc/src/math/gpu/amdgpu/amdgpu.h
18 ↗	(On Diff #529925)	Shouldn't this be `internal` to stay consistent with nvptx.h and sin.cpp? I wasn't able to get this patch to build with the `vendor` namespace, but changing it to `internal` got it to compile.

jhuber6 added inline comments.Jun 9 2023, 4:03 PM

libc/src/math/gpu/amdgpu/amdgpu.h
18 ↗	(On Diff #529925)	Yeah that was a mistake I forgot to update

It might be appealing to provide a normal math.h like interface via the libc project, but at that point, why should it be the libc project which should provide it? You can do all of this outside of the libc project. We ideally want the bring up of libc project's math on a new target architecture to be like this:

Implement platform specific primitives. Example, fma operations, fenv manipulation functions.
As the primitives are being implemented, start adding math functions to the list of entrypoints.

Any reason why we cannot take this approach? May be there are certain operational reasons. For example, if some functions come from the libc project and the rest from elsewhere, who provides the math.h header file?

Another point to keep in mind is that the libc project's math implementations come with an additional promise: the results they produce are correctly rounded in all rounding modes. Which means they produce the same result on any target architecture which is IEEE-754 compliant, in all rounding modes.

In D152486#4410351, @sivachandra wrote:

It might be appealing to provide a normal math.h like interface via the libc project, but at that point, why should it be the libc project which should provide it?

Whoever provides math.h should provide the implementation. As you mentioned, two places will cause problems.

You can do all of this outside of the libc project. We ideally want the bring up of libc project's math on a new target architecture to be like this:

Implement platform specific primitives. Example, fma operations, fenv manipulation functions.

As the primitives are being implemented, start adding math functions to the list of entrypoints.

Any reason why we cannot take this approach? May be there are certain operational reasons. For example, if some functions come from the libc project and the rest from elsewhere, who provides the math.h header file?

Another point to keep in mind is that the libc project's math implementations come with an additional promise: the results they produce are correctly rounded in all rounding modes. Which means they produce the same result on any target architecture which is IEEE-754 compliant, in all rounding modes.

@jhuber6 wants to get rid of the vendor math, but for now that is all we have. It would be great to have compliant alternatives, if they are as fast, or a choice for the user. Right now we have neither.

All that said, what would be a better place for the math impl for GPUs, which for now is vendor based? It seems to me that next to libc_gpu is the right place, but if you insist we cannot do this we need an alternative.

In D152486#4410351, @sivachandra wrote:

It might be appealing to provide a normal math.h like interface via the libc project, but at that point, why should it be the libc project which should provide it? You can do all of this outside of the libc project. We ideally want the bring up of libc project's math on a new target architecture to be like this:

Implement platform specific primitives. Example, fma operations, fenv manipulation functions.

As the primitives are being implemented, start adding math functions to the list of entrypoints.

Any reason why we cannot take this approach? May be there are certain operational reasons. For example, if some functions come from the libc project and the rest from elsewhere, who provides the math.h header file?

Implementing an entirely new GPU math library would be a much larger undertaking than re-using the implementations that GPU users are already dependent on. We have all the infrastructure ready due to the libc work to provide the libm and the appropriate headers with minimal changes, as shown by this patch. Even though we are depending on something external, We want to provide the headers from here and integrate them with the rest of the offloading toolchain, I don't think there's anywhere else that's suitable in the LLVM tree for an offloading language agnostic math library (e.g. HIP, CUDA, OpenMP, etc can use it). I still think of this as "an implementation of libm". I'd like to have something that's immediately useful as the GPU groups have been discussing implementing a libm interface for GPU math for at least three years now.

Another point to keep in mind is that the libc project's math implementations come with an additional promise: the results they produce are correctly rounded in all rounding modes. Which means they produce the same result on any target architecture which is IEEE-754 compliant, in all rounding modes.

I'm not actually certain on the status of this, or if the vendor math libraries support altering rounding modes beyond a few optimization tweaks. It's definitely a limitation.

elmcdonough mentioned this in D152603: [libc] Add math functions to AMD/NVPTX libm.Jun 9 2023, 7:26 PM

elmcdonough added a child revision: D152603: [libc] Add math functions to AMD/NVPTX libm.

Tweak a few things.

I think what would be valuable moving forward is a way to choose between the
implementation in libc and simply using the vendor libraries. That would allow
us to provide a libm.a that is immediately useful and then perform an
iterative implementation of other math functions.

Harbormaster completed remote builds in B238164: Diff 530475.Jun 12 2023, 5:40 AM

math.h is traditionally part of libc but it probably doesn't matter if we create a libm directory next to libc and put the code in there, then change the link order to be libm followed by libc. On the basis that libm is more likely to want to use libc functions than the inverse.

We could always move the code and merge the archives later.

There are many points to address, some raised here, and few others raised on discourse. I will try to respond to all of them with my thoughts in an abstract sense on how we ought to approach this.

We want users to use this libc because it provides something they cannot get elsewhere but we don't want the libc project to be a convenience wrapper. So, if there is a way for GPU programmers to already use vendor libraries, I do not think we want the libc project to be the agent which makes the use of the vendor libraries convenient.
That we don't want the libc project to be the agent to make it easy to use the vendor libraries should not stop us from taking a practical approach. Especially because the libc project's math library is not complete enough. The default GPU config in the libc project should use what is available in the libc project and get the rest from the vendor libraries. As more math functions get implemented, the default GPU config should switch over to the implementations available in the libc project.
Users who want something different (as in, not the default) should have the freedom and the ability to pick and choose from the libc project or the vendor libraries. Which means that the libc project should provide a large number of config options, one for every math function, to switch between the in the tree version and the vendor library. This might sound like a lot of pain, but I would really vote for making it painful. Users not using the default should have a very good reason, and hence will take that pain. Likewise, this pain will also serve as an incentive for the GPU port owners to improve the math functions provided by the libc project.
There are questions around accuracy, performance and control flow optimization (branches et al.) for the GPU. The GPU port owners should work with the math owners to set up configuration options and/or tune the math library for GPU builds. For example, the GPU port might tolerate a large error in favor of reducing the number of branches in the code. I am sure @lntue will be happy to work with you on this.

In D152486#4415455, @sivachandra wrote:

There are many points to address, some raised here, and few others raised on discourse. I will try to respond to all of them with my thoughts in an abstract sense on how we ought to approach this.

We want users to use this libc because it provides something they cannot get elsewhere but we don't want the libc project to be a convenience wrapper. So, if there is a way for GPU programmers to already use vendor libraries, I do not think we want the libc project to be the agent which makes the use of the vendor libraries convenient.

That we don't want the libc project to be the agent to make it easy to use the vendor libraries should not stop us from taking a practical approach. Especially because the libc project's math library is not complete enough. The default GPU config in the libc project should use what is available in the libc project and get the rest from the vendor libraries. As more math functions get implemented, the default GPU config should switch over to the implementations available in the libc project.

This is reasonable, I think supporting both is a good option so we can configure it as needed. However, we should probably get some basic performance numbers in to compare the implementation. It's probably not reasonable to make it the default if it's orders of magnitude slower since we don't want the default user experience to leave a bad impression. Thanks for being open to this as there is immediate utility in the GPU community for these features. @jdoerfert is eager to have these implemented specifically.

Users who want something different (as in, not the default) should have the freedom and the ability to pick and choose from the libc project or the vendor libraries. Which means that the libc project should provide a large number of config options, one for every math function, to switch between the in the tree version and the vendor library. This might sound like a lot of pain, but I would really vote for making it painful. Users not using the default should have a very good reason, and hence will take that pain. Likewise, this pain will also serve as an incentive for the GPU port owners to improve the math functions provided by the libc project.

Okay, I will make a follow-up patch for this that presents some config options. Tentatively the easiest would be to make a switch on or off like LIBC_GPU_VENDOR_MATH which we'll start out as on and then gradually move away from. If a function isn't implemented we will default to the vendor version. Then we would maybe use LIBC_GPU_VENDOR_MATH_FUNCTIONS= list to explicitly set the functions the we want to be provided by the vendor libraries. Would that be sufficient?

There are questions around accuracy, performance and control flow optimization (branches et al.) for the GPU. The GPU port owners should work with the math owners to set up configuration options and/or tune the math library for GPU builds. For example, the GPU port might tolerate a large error in favor of reducing the number of branches in the code. I am sure @lntue will be happy to work with you on this.

I'd be happy to work with @lntue on this as well. For starters I could enable a way to benchmark GPU kernels called via the nvptx-loader or amdhsa-loader which should let @lntue more easily test functions. Several of these functions are simply intrinsics and can be implemented as such, I'm sure @arsenm knows which ones. Otherwise we can simply check the output from the device libraries and see which ones are being mapped to intrinsics. I'm actually somewhat interested in this field as well so I wouldn't mind learning more about math libraries in the process, so let me know.

Updating to include support for the round function which does not depend on
the vendor libraries. This should hopefully show how more generic functionality
can be added. E.g. if we enable an entrypoint we can either have a vendor
implementation, a generic implementation, or a native GPU implementation. This
will be more easily controlled in the future but for now this should be a
sufficient outline to implement what we want.

Harbormaster completed remote builds in B238482: Diff 530894.Jun 13 2023, 7:06 AM

I'm happy with choices, and I'm more than happy if we could avoid vendor math libraries at some point.
The round example is a nice one already, we can probably have more like that fairly early, e.g., abs.

If we can get some level of math functions in, we will start building the tooling to include it, and circumvent the math wrappers in clang.
Then we can start measuring performance for applications. Maybe we should also find/build a micro test suite for math functions (on the GPU).
That would allows us to know when when can replace vendor math with our own math by default.

AntonRydahl mentioned this in D152575: Added modf for NVPTX and AMDGPU targets to implement 'libmgpu.a' for math on the GPU.Jun 13 2023, 9:08 PM

Code and structuring LGTM but a lot of comments/suggestions around the messaging. For the next step, I really want to see the mass of libc's own implementations increase with more builtin wrappers before increasing the mass of vendor wrappers.

libc/config/gpu/entrypoints.txt
87	Normally, we add the entire family of functions when adding primitive operations. In this case, you should `roundf` and `roundl` also.
libc/src/math/gpu/CMakeLists.txt
8	Use vendor wrappers for GPU math
libc/src/math/gpu/round.cpp
15	Just saying: Normally, on IEEE-754 conformant HW, the instructions generated by builtins exhibit standard conformance. I am not sure if the GPU floating point instructions are IEEE-754 comformant. If not, we should may be say that on the math status pages somewhere. You can do this separately of course.
libc/src/math/gpu/vendor/CMakeLists.txt
2	The wording here has to be fixed. The first sentence should say something like: Math functions not yet available in the libc project, or those not yet tuned for GPU workloads are provided as wrappers over vendor libraries. The current wording will become a license to start adding wrappers for everything. For example: https://reviews.llvm.org/D152575 Also, a lot of this information should be moved to `math/gpu/CMakeLists.txt`.
5–6	In the future, we will use implementations from the libc project and not provide these wrappers. Use the word "not" even if we do carry an option to wrap vendor libraries for a long time.
10	AMDGPU math functions falling back to vendor libraries will wrap ROCm device library. Feel free to use more appropriate wording, but the point I want to make is that the message should not appear like a blanket statement.
20	Ditto NVPTX math functions falling back to vendor libraries will wrap CUDA device library.

JonChesterfield added inline comments.Jun 14 2023, 1:23 AM

libc/src/math/gpu/vendor/amdgpu/platform.h
36	This oclc isa version thing is not great. We should patch clang to emit the value directly instead of linking in an IR file with the same name that defines that value, then this massive branch thing evaporates

jhuber6 added inline comments.Jun 14 2023, 5:01 AM

libc/src/math/gpu/vendor/amdgpu/platform.h
36	I tried doing that at some point but gave up, primarily because we mix per-TU controls (fast math like above) and per-executable ones (like this). I don't see a point in only enabling this single one. I think @arsenm was looking at removing the need for the per-TU math controls which would make that patch viable again.

The ISA one is special. There's a direct correspondence between compiler argument and which value is set, and only sorrow could come from splicing in an IR file with a different value to that passed to the compiler, and it'll kill off this massive ifdef mess immediately. Net reduction in code and fewer failure modes. Even faster compilation as we don't have to open a file

In D152486#4420656, @JonChesterfield wrote:

The ISA one is special. There's a direct correspondence between compiler argument and which value is set, and only sorrow could come from splicing in an IR file with a different value to that passed to the compiler, and it'll kill off this massive ifdef mess immediately. Net reduction in code and fewer failure modes. Even faster compilation as we don't have to open a file

Killing off these globals is good in the long run, but I don't want it to be a blocker on this patch.

libc/src/math/gpu/round.cpp
15	Yeah I'll want a documentation page for this once we get it in. Although I myself am not sure what these map to.

Addressing comments

Harbormaster completed remote builds in B238763: Diff 531276.Jun 14 2023, 5:38 AM

Forgot closing parens

And this one, sorry for the spam.

Harbormaster completed remote builds in B238800: Diff 531322.Jun 14 2023, 7:32 AM

lntue added inline comments.Jun 14 2023, 7:33 AM

libc/config/gpu/entrypoints.txt
84–90	Can you also update the math implementation status table at https://github.com/llvm/llvm-project/blob/main/libc/docs/math/index.rst ? Thanks,

jhuber6 added inline comments.Jun 14 2023, 7:35 AM

libc/config/gpu/entrypoints.txt
84–90	I was thinking of doing this in a separate patch, and we may want to have a different category for the GPU that indicates if it's implemented natively or if we rely on the vendor. Then we'd have some documentation about the considerations of GPU math, like reduced precision, ULP modes, linking to vendor documentation for those functions, etc.

lntue added inline comments.Jun 14 2023, 7:42 AM

libc/config/gpu/entrypoints.txt
84–90	A separate patch is fine for me. I plan to use this table for enabled entrypoints. For extra information about how they are implemented and what are the options / tradeoffs, you can put them under https://libc.llvm.org/math/#algorithms-implementation-details , or add a separate page and add a link to it in that section.

jhuber6 added a child revision: D152923: [libc] Add support for FMA in the GPU utilities.Jun 14 2023, 7:42 AM

In D152486#4420071, @sivachandra wrote:

Code and structuring LGTM but a lot of comments/suggestions around the messaging. For the next step, I really want to see the mass of libc's own implementations increase with more builtin wrappers before increasing the mass of vendor wrappers.

Our plan is to move all math functions from the headers here. In the process we'll try to use builtins and generic versions as much as possible. We will after have a list of things that are still vendor wrappers so we can cross them off one by one. Is that OK?

sivachandra accepted this revision.Jun 14 2023, 10:10 AM

sivachandra added inline comments.

libc/config/gpu/entrypoints.txt
84–90	Yes, I vote for a separate GPU math status page as it seems like is going to be different even wrt the math library promise we make about the CPU implementations.

This revision is now accepted and ready to land.Jun 14 2023, 10:10 AM

Closed by commit rG8060d96aed7c: [libc] Begin implementing a 'libmgpu.a' for math on the GPU (authored by jhuber6). · Explain WhyJun 14 2023, 10:59 AM

This revision was automatically updated to reflect the committed changes.

jhuber6 added a commit: rG8060d96aed7c: [libc] Begin implementing a 'libmgpu.a' for math on the GPU.

elmcdonough mentioned this in rG546c9b3f6a54: [libc] Add math functions to AMD/NVPTX libm.Jul 26 2023, 1:02 AM

Revision Contents

Path

Size

libc/

config/

gpu/

entrypoints.txt

10 lines

headers.txt

1 line

src/

math/

CMakeLists.txt

13 lines

gpu/

34 lines

16 lines

16 lines

23 lines

vendor/

CMakeLists.txt

41 lines

amdgpu/

25 lines

20 lines

110 lines

22 lines

nvptx/

declarations.h

20 lines

nvptx.h

24 lines

sin.cpp

18 lines

Diff 531276

libc/config/gpu/entrypoints.txt

Show First 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	set(TARGET_LIBC_ENTRYPOINTS
# stdio.h entrypoints		# stdio.h entrypoints
libc.src.stdio.puts		libc.src.stdio.puts
libc.src.stdio.fputs		libc.src.stdio.fputs
libc.src.stdio.stdin		libc.src.stdio.stdin
libc.src.stdio.stdout		libc.src.stdio.stdout
libc.src.stdio.stderr		libc.src.stdio.stderr
)		)

		set(TARGET_LIBM_ENTRYPOINTS
		# math.h entrypoints
		libc.src.math.sin
		libc.src.math.round
		sivachandraUnsubmitted Not Done Reply Inline Actions Normally, we add the entire family of functions when adding primitive operations. In this case, you should `roundf` and `roundl` also. sivachandra: Normally, we add the entire family of functions when adding primitive operations. In this case…
		libc.src.math.roundf
		libc.src.math.roundl
		)
		lntueUnsubmitted Not Done Reply Inline Actions Can you also update the math implementation status table at https://github.com/llvm/llvm-project/blob/main/libc/docs/math/index.rst ? Thanks, lntue: Can you also update the math implementation status table at https://github.com/llvm/llvm…
		jhuber6AuthorUnsubmitted Done Reply Inline Actions I was thinking of doing this in a separate patch, and we may want to have a different category for the GPU that indicates if it's implemented natively or if we rely on the vendor. Then we'd have some documentation about the considerations of GPU math, like reduced precision, ULP modes, linking to vendor documentation for those functions, etc. jhuber6: I was thinking of doing this in a separate patch, and we may want to have a different category…
		lntueUnsubmitted Not Done Reply Inline Actions A separate patch is fine for me. I plan to use this table for enabled entrypoints. For extra information about how they are implemented and what are the options / tradeoffs, you can put them under https://libc.llvm.org/math/#algorithms-implementation-details , or add a separate page and add a link to it in that section. lntue: A separate patch is fine for me. I plan to use this table for enabled entrypoints. For extra…
		sivachandraUnsubmitted Done Reply Inline Actions Yes, I vote for a separate GPU math status page as it seems like is going to be different even wrt the math library promise we make about the CPU implementations. sivachandra: Yes, I vote for a separate GPU math status page as it seems like is going to be different even…

set(TARGET_LLVMLIBC_ENTRYPOINTS		set(TARGET_LLVMLIBC_ENTRYPOINTS
${TARGET_LIBC_ENTRYPOINTS}		${TARGET_LIBC_ENTRYPOINTS}
		${TARGET_LIBM_ENTRYPOINTS}
)		)

libc/config/gpu/headers.txt

	set(TARGET_PUBLIC_HEADERS			set(TARGET_PUBLIC_HEADERS
	libc.include.ctype			libc.include.ctype
	libc.include.string			libc.include.string
				libc.include.math
	libc.include.fenv			libc.include.fenv
	libc.include.errno			libc.include.errno
	libc.include.stdlib			libc.include.stdlib
	libc.include.stdio			libc.include.stdio
	)			)

libc/src/math/CMakeLists.txt

Show All 12 Lines	add_entrypoint_object(
${name}		${name}
ALIAS		ALIAS
DEPENDS		DEPENDS
.${LIBC_TARGET_ARCHITECTURE}.${name}		.${LIBC_TARGET_ARCHITECTURE}.${name}
)		)
return()		return()
endif()		endif()

		# The GPU optionally depends on vendor libraries. If we emitted one of these
		# entrypoints it means the user requested it and we should use it instead.
		get_fq_target_name("${LIBC_TARGET_ARCHITECTURE}.vendor.${name}" fq_vendor_specific_target_name)
		if(TARGET ${fq_vendor_specific_target_name})
		add_entrypoint_object(
		${name}
		ALIAS
		DEPENDS
		.${LIBC_TARGET_ARCHITECTURE}.vendor.${name}
		)
		return()
		endif()

get_fq_target_name("generic.${name}" fq_generic_target_name)		get_fq_target_name("generic.${name}" fq_generic_target_name)
if(TARGET ${fq_generic_target_name})		if(TARGET ${fq_generic_target_name})
add_entrypoint_object(		add_entrypoint_object(
${name}		${name}
ALIAS		ALIAS
DEPENDS		DEPENDS
.generic.${name}		.generic.${name}
)		)
▲ Show 20 Lines • Show All 166 Lines • Show Last 20 Lines

libc/src/math/gpu/CMakeLists.txt

This file was added.

				# Math functions not yet available in the libc project, or those not yet tuned
				# for GPU workloads are provided as wrappers over vendor libraries. If we find
				# them ahead of time we will import them statically. Otherwise, we will keep
				# them as external references and expect them to be resolved by the user when
				# they compile. In the future,we will use implementations from the 'libc'
				# project and not provide these wrappers.
				add_subdirectory(vendor)

				sivachandraUnsubmitted Done Reply Inline Actions Use vendor wrappers for GPU math sivachandra: ``` Use vendor wrappers for GPU math ```
				# For the GPU we want to be able to optionally depend on the vendor libraries
				# until we have a suitable replacement inside `libc`.
				# TODO: We should have an option to enable or disable these on a per-function
				# basis.
				option(LIBC_GPU_VENDOR_MATH "Use vendor wrappers for GPU math" ON)
				function(add_math_entrypoint_gpu_object name)
				get_fq_target_name("vendor.${name}" fq_vendor_specific_target_name)
				if(TARGET ${fq_vendor_specific_target_name} AND ${LIBC_GPU_VENDOR_MATH})
				return()
				endif()

				add_entrypoint_object(
				${name}
				${ARGN}
				)
				endfunction()

				add_math_entrypoint_gpu_object(
				round
				SRCS
				round.cpp
				HDRS
				../round.h
				COMPILE_OPTIONS
				-O2
				)

libc/src/math/gpu/round.cpp

This file was added.

				//===-- Implementation of the GPU round function --------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "src/math/round.h"
				#include "src/__support/common.h"

				namespace __llvm_libc {

				LLVM_LIBC_FUNCTION(double, round, (double x)) { return __builtin_round(x); }

				sivachandraUnsubmitted Done Reply Inline Actions Just saying: Normally, on IEEE-754 conformant HW, the instructions generated by builtins exhibit standard conformance. I am not sure if the GPU floating point instructions are IEEE-754 comformant. If not, we should may be say that on the math status pages somewhere. You can do this separately of course. sivachandra: Just saying: Normally, on IEEE-754 conformant HW, the instructions generated by builtins…
				jhuber6AuthorUnsubmitted Done Reply Inline Actions Yeah I'll want a documentation page for this once we get it in. Although I myself am not sure what these map to. jhuber6: Yeah I'll want a documentation page for this once we get it in. Although I myself am not sure…
				} // namespace __llvm_libc

libc/src/math/gpu/roundf.cpp

This file was added.

				//===-- Implementation of the GPU roundf function -------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "src/math/roundf.h"
				#include "src/__support/common.h"

				namespace __llvm_libc {

				LLVM_LIBC_FUNCTION(float, roundf, (float x)) { return __builtin_roundf(x); }

				} // namespace __llvm_libc

libc/src/math/gpu/roundl.cpp

This file was added.

				//===-- Implementation of the GPU roundl function -------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "src/math/roundl.h"
				#include "src/__support/FPUtil/PlatformDefs.h"
				#include "src/__support/common.h"

				namespace __llvm_libc {

				#ifndef LONG_DOUBLE_IS_DOUBLE
				#error "GPU targets do not support long doubles"
				#endif

				LLVM_LIBC_FUNCTION(long double, roundl, (long double x)) {
				return __builtin_round(x);
				}

				} // namespace __llvm_libc

libc/src/math/gpu/vendor/CMakeLists.txt

This file was added.

				find_package(AMDDeviceLibs QUIET HINTS ${CMAKE_INSTALL_PREFIX} PATHS /opt/rocm)
				if(AMDDeviceLibs_FOUND)
				sivachandraUnsubmitted Done Reply Inline Actions The wording here has to be fixed. The first sentence should say something like: Math functions not yet available in the libc project, or those not yet tuned for GPU workloads are provided as wrappers over vendor libraries. The current wording will become a license to start adding wrappers for everything. For example: https://reviews.llvm.org/D152575 Also, a lot of this information should be moved to `math/gpu/CMakeLists.txt`. sivachandra: The wording here has to be fixed. The first sentence should say something like: ``` Math…
				message(STATUS "Found the ROCm device library. Implementations falling back "
				"to the vendor libraries will be resolved statically."
				get_target_property(ocml_path ocml IMPORTED_LOCATION)
				list(APPEND bitcode_link_flags
				sivachandraUnsubmitted Done Reply Inline Actions In the future, we will use implementations from the libc project and not provide these wrappers. Use the word "not" even if we do carry an option to wrap vendor libraries for a long time. sivachandra: ``` In the future, we will use implementations from the libc project and not provide these…
				"SHELL:-Xclang -mlink-builtin-bitcode -Xclang ${ocml_path}")
				else()
				message(STATUS "Could not find the ROCm device library. Unimplemented "
				"functions will be an external reference to the vendor libraries."
				sivachandraUnsubmitted Done Reply Inline Actions AMDGPU math functions falling back to vendor libraries will wrap ROCm device library. Feel free to use more appropriate wording, but the point I want to make is that the message should not appear like a blanket statement. sivachandra: ``` AMDGPU math functions falling back to vendor libraries will wrap ROCm device library. ```…
				endif()

				find_package(CUDAToolkit QUIET)
				if(CUDAToolkit_FOUND)
				set(libdevice_path ${CUDAToolkit_BIN_DIR}/../nvvm/libdevice/libdevice.10.bc)
				if (EXISTS ${libdevice_path})
				message(STATUS "Found the CUDA device library. Implementations falling back "
				"to the vendor libraries will be resolved statically."
				list(APPEND bitcode_link_flags
				"SHELL:-Xclang -mlink-builtin-bitcode -Xclang ${libdevice_path}")
				sivachandraUnsubmitted Done Reply Inline Actions Ditto NVPTX math functions falling back to vendor libraries will wrap CUDA device library. sivachandra: Ditto ``` NVPTX math functions falling back to vendor libraries will wrap CUDA device library.
				endif()
				else()
				message(STATUS "Could not find the ROCm device library. Unimplemented "
				"functions will be an external reference to the vendor libraries."
				endif()

				# FIXME: We need a way to pass the library to only the NVTPX / AMDGPU build.
				# This shouldn't cause issues because we only link in needed symbols, but it
				# will link in identity metadata from both libraries. This silences the warning.
				list(APPEND bitcode_link_flags "-Wno-linker-warnings")

				add_entrypoint_object(
				sin
				SRCS
				sin.cpp
				HDRS
				../../sin.h
				COMPILE_OPTIONS
				${bitcode_link_flags}
				-O2
				)

libc/src/math/gpu/vendor/amdgpu/amdgpu.h

This file was added.

				//===-- AMDGPU specific definitions for math support ----------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIBC_SRC_MATH_GPU_AMDGPU_H
				#define LLVM_LIBC_SRC_MATH_GPU_AMDGPU_H

				#include "declarations.h"
				#include "platform.h"

				#include "src/__support/macros/attributes.h"

				namespace __llvm_libc {
				namespace internal {

				LIBC_INLINE double sin(double x) { return __ocml_sin_f64(x); }

				} // namespace internal
				} // namespace __llvm_libc

				#endif // LLVM_LIBC_SRC_MATH_GPU_AMDGPU_H

libc/src/math/gpu/vendor/amdgpu/declarations.h

This file was added.

				//===-- AMDGPU specific declarations for math support ---------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIBC_SRC_MATH_GPU_AMDGPU_DECLARATIONS_H
				#define LLVM_LIBC_SRC_MATH_GPU_AMDGPU_DECLARATIONS_H

				namespace __llvm_libc {

				extern "C" {
				double __ocml_sin_f64(double);
				}

				} // namespace __llvm_libc

				#endif // LLVM_LIBC_SRC_MATH_GPU_AMDGPU_DECLARATIONS_H

libc/src/math/gpu/vendor/amdgpu/platform.h

This file was added.

				//===-- AMDGPU specific platform definitions for math support -------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIBC_SRC_MATH_GPU_AMDGPU_PLATFORM_H
				#define LLVM_LIBC_SRC_MATH_GPU_AMDGPU_PLATFORM_H

				#include <stdint.h>

				namespace __llvm_libc {

				// The ROCm device library uses control globals to alter codegen for the
				// different targets. To avoid needing to link them in manually we simply
				// define them here.
				extern "C" {

				// Disable unsafe math optimizations in the implementation.
				extern const uint8_t __oclc_unsafe_math_opt = 0;

				// Disable denormalization at zero optimizations in the implementation.
				extern const uint8_t __oclc_daz_opt = 0;

				// Disable rounding optimizations for 32-bit square roots.
				extern const uint8_t __oclc_correctly_rounded_sqrt32 = 0;

				// Disable finite math optimizations.
				extern const uint8_t __oclc_finite_only_opt = 0;

				#if defined(__gfx700__)
				extern const uint32_t __oclc_ISA_version = 7000;
				#elif defined(__gfx701__)
				extern const uint32_t __oclc_ISA_version = 7001;
				JonChesterfieldUnsubmitted Not Done Reply Inline Actions This oclc isa version thing is not great. We should patch clang to emit the value directly instead of linking in an IR file with the same name that defines that value, then this massive branch thing evaporates JonChesterfield: This oclc isa version thing is not great. We should patch clang to emit the value directly…
				jhuber6AuthorUnsubmitted Done Reply Inline Actions I tried doing that at some point but gave up, primarily because we mix per-TU controls (fast math like above) and per-executable ones (like this). I don't see a point in only enabling this single one. I think @arsenm was looking at removing the need for the per-TU math controls which would make that patch viable again. jhuber6: I tried doing that at some point but gave up, primarily because we mix per-TU controls (fast…
				#elif defined(__gfx702__)
				extern const uint32_t __oclc_ISA_version = 7002;
				#elif defined(__gfx703__)
				extern const uint32_t __oclc_ISA_version = 7003;
				#elif defined(__gfx704__)
				extern const uint32_t __oclc_ISA_version = 7004;
				#elif defined(__gfx705__)
				extern const uint32_t __oclc_ISA_version = 7005;
				#elif defined(__gfx801__)
				extern const uint32_t __oclc_ISA_version = 8001;
				#elif defined(__gfx802__)
				extern const uint32_t __oclc_ISA_version = 8002;
				#elif defined(__gfx803__)
				extern const uint32_t __oclc_ISA_version = 8003;
				#elif defined(__gfx805__)
				extern const uint32_t __oclc_ISA_version = 8005;
				#elif defined(__gfx810__)
				extern const uint32_t __oclc_ISA_version = 8100;
				#elif defined(__gfx900__)
				extern const uint32_t __oclc_ISA_version = 9000;
				#elif defined(__gfx902__)
				extern const uint32_t __oclc_ISA_version = 9002;
				#elif defined(__gfx904__)
				extern const uint32_t __oclc_ISA_version = 9004;
				#elif defined(__gfx906__)
				extern const uint32_t __oclc_ISA_version = 9006;
				#elif defined(__gfx908__)
				extern const uint32_t __oclc_ISA_version = 9008;
				#elif defined(__gfx909__)
				extern const uint32_t __oclc_ISA_version = 9009;
				#elif defined(__gfx90a__)
				extern const uint32_t __oclc_ISA_version = 9010;
				#elif defined(__gfx90c__)
				extern const uint32_t __oclc_ISA_version = 9012;
				#elif defined(__gfx940__)
				extern const uint32_t __oclc_ISA_version = 9400;
				#elif defined(__gfx1010__)
				extern const uint32_t __oclc_ISA_version = 10100;
				#elif defined(__gfx1011__)
				extern const uint32_t __oclc_ISA_version = 10101;
				#elif defined(__gfx1012__)
				extern const uint32_t __oclc_ISA_version = 10102;
				#elif defined(__gfx1013__)
				extern const uint32_t __oclc_ISA_version = 10103;
				#elif defined(__gfx1030__)
				extern const uint32_t __oclc_ISA_version = 10300;
				#elif defined(__gfx1031__)
				extern const uint32_t __oclc_ISA_version = 10301;
				#elif defined(__gfx1032__)
				extern const uint32_t __oclc_ISA_version = 10302;
				#elif defined(__gfx1033__)
				extern const uint32_t __oclc_ISA_version = 10303;
				#elif defined(__gfx1034__)
				extern const uint32_t __oclc_ISA_version = 10304;
				#elif defined(__gfx1035__)
				extern const uint32_t __oclc_ISA_version = 10305;
				#elif defined(__gfx1036__)
				extern const uint32_t __oclc_ISA_version = 10306;
				#elif defined(__gfx1100__)
				extern const uint32_t __oclc_ISA_version = 11000;
				#elif defined(__gfx1101__)
				extern const uint32_t __oclc_ISA_version = 11001;
				#elif defined(__gfx1102__)
				extern const uint32_t __oclc_ISA_version = 11002;
				#elif defined(__gfx1103__)
				extern const uint32_t __oclc_ISA_version = 11003;
				#else
				#error "Unknown AMDGPU architecture"
				#endif
				}

				} // namespace __llvm_libc

				#endif // LLVM_LIBC_SRC_MATH_GPU_AMDGPU_PLATFORM_H

libc/src/math/gpu/vendor/common.h

This file was added.

				//===-- Common interface for compiling the GPU math -----------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIBC_SRC_MATH_GPU_COMMON_H
				#define LLVM_LIBC_SRC_MATH_GPU_COMMON_H

				#include "src/__support/macros/properties/architectures.h"

				#if defined(LIBC_TARGET_ARCH_IS_AMDGPU)
				#include "amdgpu/amdgpu.h"
				#elif defined(LIBC_TARGET_ARCH_IS_NVPTX)
				#include "nvptx/nvptx.h"
				#else
				#error "Unsupported platform"
				#endif

				#endif // LLVM_LIBC_SRC_MATH_GPU_COMMON_H

libc/src/math/gpu/vendor/nvptx/declarations.h

This file was added.

				//===-- NVPTX specific declarations for math support ----------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIBC_SRC_MATH_GPU_NVPTX_DECLARATIONS_H
				#define LLVM_LIBC_SRC_MATH_GPU_NVPTX_DECLARATIONS_H

				namespace __llvm_libc {

				extern "C" {
				double __nv_sin(double);
				}

				} // namespace __llvm_libc

				#endif // LLVM_LIBC_SRC_MATH_GPU_NVPTX_DECLARATIONS_H

libc/src/math/gpu/vendor/nvptx/nvptx.h

This file was added.

				//===-- NVPTX specific definitions for math support -----------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_LIBC_SRC_MATH_GPU_NVPTX_H
				#define LLVM_LIBC_SRC_MATH_GPU_NVPTX_H

				#include "declarations.h"

				#include "src/__support/macros/attributes.h"

				namespace __llvm_libc {
				namespace internal {

				LIBC_INLINE double sin(double x) { return __nv_sin(x); }

				} // namespace internal
				} // namespace __llvm_libc

				#endif // LLVM_LIBC_SRC_MATH_GPU_NVPTX_H

libc/src/math/gpu/vendor/sin.cpp

This file was added.

				//===-- Implementation of the sin function for GPU ------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//

				#include "src/math/sin.h"
				#include "src/__support/common.h"

				#include "common.h"

				namespace __llvm_libc {

				LLVM_LIBC_FUNCTION(double, sin, (double x)) { return internal::sin(x); }

				} // namespace __llvm_libc

This is an archive of the discontinued LLVM Phabricator instance.

[libc] Begin implementing a 'libmgpu.a' for math on the GPUClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 531276

libc/config/gpu/entrypoints.txt

libc/config/gpu/headers.txt

libc/src/math/CMakeLists.txt

libc/src/math/gpu/CMakeLists.txt

libc/src/math/gpu/round.cpp

libc/src/math/gpu/roundf.cpp

libc/src/math/gpu/roundl.cpp

libc/src/math/gpu/vendor/CMakeLists.txt

libc/src/math/gpu/vendor/amdgpu/amdgpu.h

libc/src/math/gpu/vendor/amdgpu/declarations.h

libc/src/math/gpu/vendor/amdgpu/platform.h

libc/src/math/gpu/vendor/common.h

libc/src/math/gpu/vendor/nvptx/declarations.h

libc/src/math/gpu/vendor/nvptx/nvptx.h

libc/src/math/gpu/vendor/sin.cpp

[libc] Begin implementing a 'libmgpu.a' for math on the GPU
ClosedPublic