This is an archive of the discontinued LLVM Phabricator instance.

[libc] Begin implementing a 'libmgpu.a' for math on the GPU
ClosedPublic

Authored by jhuber6 on Jun 8 2023, 4:31 PM.

Details

Summary

This patch adds an outline to begin adding a libmgpu.a file for
provindg math on the GPU. Currently, this is most likely going to be
wrapping around existing vendor libraries and placing them in a more
usable format. Long term, we would like to provide our own
implementations of math functions that can be used instead.

This patch works by simply forwarding the calls to the standard C math
library calls like sin to the appropriate vendor call like __nv_sin.
Currently, we will use the vendor libraries directly and link them in
via -mlink-builtin-bitcode. This is necessary because of bizarre
interactions with the generic bitcode, -mlink-builtin-bitcode
internalizes and only links in the used symbols, furthermore is
propagates the target's default attributes and its the only "truly"
correct way to pull in these vendor bitcode libraries without error.

If the vendor libraries are not availible at build time, we will still
create the libmgpu.a, but we will expect that the vendor library
definitions will be provided by the user's compilation as is made
possible by https://reviews.llvm.org/D152442.

Diff Detail

Event Timeline

jhuber6 created this revision.Jun 8 2023, 4:31 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJun 8 2023, 4:31 PM
jhuber6 requested review of this revision.Jun 8 2023, 4:31 PM
arsenm added a comment.EditedJun 8 2023, 4:38 PM

Specifically for ocml functions, we're really close to not requiring the internalization. D149715 is the main piece I need to remove the last subtarget features.

The wavesize is still a bit problematic. Ideally we would have separate wave32 and wave64 builds, and not allow mixing wavesizes in a single module. With un-wavesized library IR, you can kind of get away with relying on the global subtarget. (which I guess is the same problem you have with target-cpu)

Specifically for ocml functions, we're really close to not requiring the internalization. D149715 is the main piece I need to remove the last subtarget features.

The wavesize is still a bit problematic. Ideally we would have separate wave32 and wave64 builds, and not allow mixing wavesizes in a single module. With un-wavesized library IR, you can kind of get away with relying on the global subtarget. (which I guess is the same problem you have with target-cpu)

I haven't seen the wavesize used in ocml fortunately. If we fix the requirement to perform attribute propagation the second issue is that all the device functions in the ROCm device libraries have protected visibility so LTO can't optimize them out without some hacks. Being able to link these in regularly would be nice since the only way I could think of to perform the correct attribute propagation was to re-run clang with the merged LTO bitcode. However long term it would be nice to have a libm on the GPU that lived upstream in this repository. We could potentially port the OpenCL in the ROCm device libraries, but I don't know how popular that would be internally at AMD.

jhuber6 retitled this revision from [libc] Begin implementing a 'libmgpu.a' for math on th GPU to [libc] Begin implementing a 'libmgpu.a' for math on the GPU.Jun 8 2023, 5:34 PM
jhuber6 updated this revision to Diff 529925.Jun 9 2023, 5:24 AM

Missing inline on the internal definition.

@Matt good news on ocml, thanks. I think we should add a wave size intrinsic, unconditionally expand it somewhere in clang and the backend and replace the current magic variable in ocml with it. The main appeal to encoding it that way is we can also immediately kill dead branches on the value of the intrinsic which would solve various O0 related problems from invalid but not called code reaching the backend.

@Joseph I'd really like us to be able to get rid of the header files that currently do the remapping from libm calls to the vendor ones. Have you spoken to the hip / cuda / openmp people to see if any of them are in principle willing to part with the current header translation thing?

@Joseph I'd really like us to be able to get rid of the header files that currently do the remapping from libm calls to the vendor ones. Have you spoken to the hip / cuda / openmp people to see if any of them are in principle willing to part with the current header translation thing?

Killing off the headers would be ideal, as it stands there are a few deficiencies in this approach.

  • This currently relies on a build-time dependency on libdevice.10.bc or ocml.bc for the best support. We can link it late with the support in D152442 but currently that's an opt-in thing currently because it will result in longer compile times. Ideally we would have the files we need in a binary blob upstream somewhere or we implement the math functions without relying on them.
  • We don't handle special-case math optimizations that the vendors have. The headers allow us at compile time to select -ffast-math variants from libdevice.10.bc or to enable the controls for things like __oclc_unsafe_math_opt
  • The NVPTX backend currently cannot codegen the math intrinsics so we'll need to fix that before this can be used.
  • Obviously this requires building the libc project which would then need to be a mandatory thing for math if we remove the headers.
arsenm added a comment.Jun 9 2023, 6:59 AM

@Matt good news on ocml, thanks. I think we should add a wave size intrinsic, unconditionally expand it somewhere in clang and the backend and replace the current magic variable in ocml with it.

We used to do that, and it doesn't work. We can't codegen different subtargets within the same function like that. Really having wave32 and wave64 coexist in the same module leads to a variety of untenable situations

Math optimisations were a bit of a mess a few years ago. Some in clang, some in llvm, some in backends, some based on intrinsics and some based on name of libm functions. There's a decent chance ocml_sin(0) won't constant fold and sin(0) will for example. There's a decent chance we'd win overall by keeping the functions named after libm functions until at least some of the middle end has run, before translating them into the vendor ones latish. Phase ordering is always tricky.

JonChesterfield added a comment.EditedJun 9 2023, 7:04 AM

@Matt good news on ocml, thanks. I think we should add a wave size intrinsic, unconditionally expand it somewhere in clang and the backend and replace the current magic variable in ocml with it.

We used to do that, and it doesn't work. We can't codegen different subtargets within the same function like that. Really having wave32 and wave64 coexist in the same module leads to a variety of untenable situations

I don't think that's quite what I'm proposing.

Let ocml be wavesize agnostic, modulo a call to a clang intrinsic __builtin_amdgcn_wavesize() or whatever. Get the ocml IR out of clang however. Some of it is handwritten as IR, some is opencl, whatever.

Let clang take an argument for wavesize, or default it from the arch, or however we want to pick between 32 and 64. Write that information in the IR somewhere and/or pass it to the backend.

Expand the wavesize intrinsic with something that robustly deletes the dead basic block when it's used as a branch, otherwise to the 32|64 literal. At no point does an IR module contain wave32 and wave64 code, but some of the library code doesn't known which it's going to be.

I'm aware that it's possible to have a wave32 kernel and a wave64 kernel in the same IR module, and consider that so full of hazards that clang should reject it up front, and people who want to run mixed wavesize kernels from a single application can build them into separate code objects.

arsenm added a comment.Jun 9 2023, 7:05 AM

Math optimisations were a bit of a mess a few years ago.

And they still are. The one I'd currently like to tackle is the fact that ocml treats select llvm generic math intrinsics like they're target specific fast operations. Worst offender is sqrt and log

Couple of questions:

  1. If all that libmgpu.a is doing is to route to the vendor specific implementations, what exactly is the benefit provided by libmgpu.a?
  2. How do GPU applications use and link to the math functions today without libmgpu.a?

Based on the answers to the above questions, I might have more questions and/or comments.

Couple of questions:

  1. If all that libmgpu.a is doing is to route to the vendor specific implementations, what exactly is the benefit provided by libmgpu.a?

The vendor libraries provide math functions that have different names and the format they are provided in presents unfortunate design challenges. This allows us to provide math functions as a library using a more convenient format and it allows us to only resolve these to the vendor versions at (LTO) link time. The primary benefit of this is that LLVM generally understands what a sin call is, but does not understand what an __nv_sin call is. To that end, keeping the calls until the link job and using our libm.a should allow more optimizations. Additionally, this remaps the vendor library into something that's more usable with our interface.

  1. How do GPU applications use and link to the math functions today without libmgpu.a?

Based on the answers to the above questions, I might have more questions and/or comments.

The current solution is to use a wrapper header that's forcibly included before each offloading compilation that maps for example sin to __nv_sin on the GPU. This is then resolved per-TU by the -mlink-builtin-bitcode option that you see used in this patch in a similar way. There is no linking per-se currently, a lot of the work I did last year was to allow GPUs to have more traditional linking steps when using offloading languages.

@jdoerfert or @tra will probably be able to give you a more detailed answer, I remember there being some talks a long time ago about making this libm. There's also an implementation of it in AMD's downstream fork for use with OpenMP offloading to Fortran so this is a way to bring that upstream.

It's also worth noting that the vendor libraries are not strictly conformant to what a standard libm would contain, for example there is no errno handling nor raising of floating point exceptions. Additionally, I think a good long-term goal would be to have GPU math implementations directly in-tree here. But I am not a math person. The ROCm device libraries are open source so inspiration could be drawn there if we were to implement our own version https://github.com/RadeonOpenCompute/ROCm-Device-Libs/tree/amd-stg-open/ocml/src. However that's not the immediate value of having a libm.a for the GPU so I'm not planning on working on that front for the time being.

elmcdonough added inline comments.Jun 9 2023, 4:02 PM
libc/src/math/gpu/amdgpu/amdgpu.h
19

Shouldn't this be internal to stay consistent with nvptx.h and sin.cpp? I wasn't able to get this patch to build with the vendor namespace, but changing it to internal got it to compile.

jhuber6 added inline comments.Jun 9 2023, 4:03 PM
libc/src/math/gpu/amdgpu/amdgpu.h
19

Yeah that was a mistake I forgot to update

It might be appealing to provide a normal math.h like interface via the libc project, but at that point, why should it be the libc project which should provide it? You can do all of this outside of the libc project. We ideally want the bring up of libc project's math on a new target architecture to be like this:

  1. Implement platform specific primitives. Example, fma operations, fenv manipulation functions.
  2. As the primitives are being implemented, start adding math functions to the list of entrypoints.

Any reason why we cannot take this approach? May be there are certain operational reasons. For example, if some functions come from the libc project and the rest from elsewhere, who provides the math.h header file?

Another point to keep in mind is that the libc project's math implementations come with an additional promise: the results they produce are correctly rounded in all rounding modes. Which means they produce the same result on any target architecture which is IEEE-754 compliant, in all rounding modes.

It might be appealing to provide a normal math.h like interface via the libc project, but at that point, why should it be the libc project which should provide it?

Whoever provides math.h should provide the implementation. As you mentioned, two places will cause problems.

You can do all of this outside of the libc project. We ideally want the bring up of libc project's math on a new target architecture to be like this:

  1. Implement platform specific primitives. Example, fma operations, fenv manipulation functions.
  2. As the primitives are being implemented, start adding math functions to the list of entrypoints.

Any reason why we cannot take this approach? May be there are certain operational reasons. For example, if some functions come from the libc project and the rest from elsewhere, who provides the math.h header file?

Another point to keep in mind is that the libc project's math implementations come with an additional promise: the results they produce are correctly rounded in all rounding modes. Which means they produce the same result on any target architecture which is IEEE-754 compliant, in all rounding modes.

@jhuber6 wants to get rid of the vendor math, but for now that is all we have. It would be great to have compliant alternatives, if they are as fast, or a choice for the user. Right now we have neither.

All that said, what would be a better place for the math impl for GPUs, which for now is vendor based? It seems to me that next to libc_gpu is the right place, but if you insist we cannot do this we need an alternative.

It might be appealing to provide a normal math.h like interface via the libc project, but at that point, why should it be the libc project which should provide it? You can do all of this outside of the libc project. We ideally want the bring up of libc project's math on a new target architecture to be like this:

  1. Implement platform specific primitives. Example, fma operations, fenv manipulation functions.
  2. As the primitives are being implemented, start adding math functions to the list of entrypoints.

Any reason why we cannot take this approach? May be there are certain operational reasons. For example, if some functions come from the libc project and the rest from elsewhere, who provides the math.h header file?

Implementing an entirely new GPU math library would be a much larger undertaking than re-using the implementations that GPU users are already dependent on. We have all the infrastructure ready due to the libc work to provide the libm and the appropriate headers with minimal changes, as shown by this patch. Even though we are depending on something external, We want to provide the headers from here and integrate them with the rest of the offloading toolchain, I don't think there's anywhere else that's suitable in the LLVM tree for an offloading language agnostic math library (e.g. HIP, CUDA, OpenMP, etc can use it). I still think of this as "an implementation of libm". I'd like to have something that's immediately useful as the GPU groups have been discussing implementing a libm interface for GPU math for at least three years now.

Another point to keep in mind is that the libc project's math implementations come with an additional promise: the results they produce are correctly rounded in all rounding modes. Which means they produce the same result on any target architecture which is IEEE-754 compliant, in all rounding modes.

I'm not actually certain on the status of this, or if the vendor math libraries support altering rounding modes beyond a few optimization tweaks. It's definitely a limitation.

jhuber6 updated this revision to Diff 530475.Jun 12 2023, 5:34 AM

Tweak a few things.

I think what would be valuable moving forward is a way to choose between the
implementation in libc and simply using the vendor libraries. That would allow
us to provide a libm.a that is immediately useful and then perform an
iterative implementation of other math functions.

math.h is traditionally part of libc but it probably doesn't matter if we create a libm directory next to libc and put the code in there, then change the link order to be libm followed by libc. On the basis that libm is more likely to want to use libc functions than the inverse.

We could always move the code and merge the archives later.

There are many points to address, some raised here, and few others raised on discourse. I will try to respond to all of them with my thoughts in an abstract sense on how we ought to approach this.

  • We want users to use this libc because it provides something they cannot get elsewhere but we don't want the libc project to be a convenience wrapper. So, if there is a way for GPU programmers to already use vendor libraries, I do not think we want the libc project to be the agent which makes the use of the vendor libraries convenient.
  • That we don't want the libc project to be the agent to make it easy to use the vendor libraries should not stop us from taking a practical approach. Especially because the libc project's math library is not complete enough. The default GPU config in the libc project should use what is available in the libc project and get the rest from the vendor libraries. As more math functions get implemented, the default GPU config should switch over to the implementations available in the libc project.
  • Users who want something different (as in, not the default) should have the freedom and the ability to pick and choose from the libc project or the vendor libraries. Which means that the libc project should provide a large number of config options, one for every math function, to switch between the in the tree version and the vendor library. This might sound like a lot of pain, but I would really vote for making it painful. Users not using the default should have a very good reason, and hence will take that pain. Likewise, this pain will also serve as an incentive for the GPU port owners to improve the math functions provided by the libc project.
  • There are questions around accuracy, performance and control flow optimization (branches et al.) for the GPU. The GPU port owners should work with the math owners to set up configuration options and/or tune the math library for GPU builds. For example, the GPU port might tolerate a large error in favor of reducing the number of branches in the code. I am sure @lntue will be happy to work with you on this.

There are many points to address, some raised here, and few others raised on discourse. I will try to respond to all of them with my thoughts in an abstract sense on how we ought to approach this.

  • We want users to use this libc because it provides something they cannot get elsewhere but we don't want the libc project to be a convenience wrapper. So, if there is a way for GPU programmers to already use vendor libraries, I do not think we want the libc project to be the agent which makes the use of the vendor libraries convenient.
  • That we don't want the libc project to be the agent to make it easy to use the vendor libraries should not stop us from taking a practical approach. Especially because the libc project's math library is not complete enough. The default GPU config in the libc project should use what is available in the libc project and get the rest from the vendor libraries. As more math functions get implemented, the default GPU config should switch over to the implementations available in the libc project.

This is reasonable, I think supporting both is a good option so we can configure it as needed. However, we should probably get some basic performance numbers in to compare the implementation. It's probably not reasonable to make it the default if it's orders of magnitude slower since we don't want the default user experience to leave a bad impression. Thanks for being open to this as there is immediate utility in the GPU community for these features. @jdoerfert is eager to have these implemented specifically.

  • Users who want something different (as in, not the default) should have the freedom and the ability to pick and choose from the libc project or the vendor libraries. Which means that the libc project should provide a large number of config options, one for every math function, to switch between the in the tree version and the vendor library. This might sound like a lot of pain, but I would really vote for making it painful. Users not using the default should have a very good reason, and hence will take that pain. Likewise, this pain will also serve as an incentive for the GPU port owners to improve the math functions provided by the libc project.

Okay, I will make a follow-up patch for this that presents some config options. Tentatively the easiest would be to make a switch on or off like LIBC_GPU_VENDOR_MATH which we'll start out as on and then gradually move away from. If a function isn't implemented we will default to the vendor version. Then we would maybe use LIBC_GPU_VENDOR_MATH_FUNCTIONS= list to explicitly set the functions the we want to be provided by the vendor libraries. Would that be sufficient?

  • There are questions around accuracy, performance and control flow optimization (branches et al.) for the GPU. The GPU port owners should work with the math owners to set up configuration options and/or tune the math library for GPU builds. For example, the GPU port might tolerate a large error in favor of reducing the number of branches in the code. I am sure @lntue will be happy to work with you on this.

I'd be happy to work with @lntue on this as well. For starters I could enable a way to benchmark GPU kernels called via the nvptx-loader or amdhsa-loader which should let @lntue more easily test functions. Several of these functions are simply intrinsics and can be implemented as such, I'm sure @arsenm knows which ones. Otherwise we can simply check the output from the device libraries and see which ones are being mapped to intrinsics. I'm actually somewhat interested in this field as well so I wouldn't mind learning more about math libraries in the process, so let me know.

jhuber6 updated this revision to Diff 530894.Jun 13 2023, 7:00 AM

Updating to include support for the round function which does not depend on
the vendor libraries. This should hopefully show how more generic functionality
can be added. E.g. if we enable an entrypoint we can either have a vendor
implementation, a generic implementation, or a native GPU implementation. This
will be more easily controlled in the future but for now this should be a
sufficient outline to implement what we want.

I'm happy with choices, and I'm more than happy if we could avoid vendor math libraries at some point.
The round example is a nice one already, we can probably have more like that fairly early, e.g., abs.

If we can get some level of math functions in, we will start building the tooling to include it, and circumvent the math wrappers in clang.
Then we can start measuring performance for applications. Maybe we should also find/build a micro test suite for math functions (on the GPU).
That would allows us to know when when can replace vendor math with our own math by default.

Code and structuring LGTM but a lot of comments/suggestions around the messaging. For the next step, I really want to see the mass of libc's own implementations increase with more builtin wrappers before increasing the mass of vendor wrappers.

libc/config/gpu/entrypoints.txt
87

Normally, we add the entire family of functions when adding primitive operations. In this case, you should roundf and roundl also.

libc/src/math/gpu/CMakeLists.txt
8
Use vendor wrappers for GPU math
libc/src/math/gpu/round.cpp
14 ↗(On Diff #530894)

Just saying: Normally, on IEEE-754 conformant HW, the instructions generated by builtins exhibit standard conformance. I am not sure if the GPU floating point instructions are IEEE-754 comformant. If not, we should may be say that on the math status pages somewhere. You can do this separately of course.

libc/src/math/gpu/vendor/CMakeLists.txt
1 ↗(On Diff #530894)

The wording here has to be fixed. The first sentence should say something like:

Math functions not yet available in the libc project, or those not yet tuned for GPU workloads are provided as wrappers over vendor libraries.

The current wording will become a license to start adding wrappers for everything. For example: https://reviews.llvm.org/D152575

Also, a lot of this information should be moved to math/gpu/CMakeLists.txt.

4–5 ↗(On Diff #530894)
In the future, we will use implementations from the libc project and not provide these wrappers.

Use the word "not" even if we do carry an option to wrap vendor libraries for a long time.

9 ↗(On Diff #530894)
AMDGPU math functions falling back to vendor libraries will wrap ROCm device library.

Feel free to use more appropriate wording, but the point I want to make is that the message should not appear like a blanket statement.

19 ↗(On Diff #530894)

Ditto

NVPTX math functions falling back to vendor libraries will wrap CUDA device library.
libc/src/math/gpu/vendor/amdgpu/platform.h
35 ↗(On Diff #530894)

This oclc isa version thing is not great. We should patch clang to emit the value directly instead of linking in an IR file with the same name that defines that value, then this massive branch thing evaporates

jhuber6 added inline comments.Jun 14 2023, 5:01 AM
libc/src/math/gpu/vendor/amdgpu/platform.h
35 ↗(On Diff #530894)

I tried doing that at some point but gave up, primarily because we mix per-TU controls (fast math like above) and per-executable ones (like this). I don't see a point in only enabling this single one. I think @arsenm was looking at removing the need for the per-TU math controls which would make that patch viable again.

The ISA one is special. There's a direct correspondence between compiler argument and which value is set, and only sorrow could come from splicing in an IR file with a different value to that passed to the compiler, and it'll kill off this massive ifdef mess immediately. Net reduction in code and fewer failure modes. Even faster compilation as we don't have to open a file

jhuber6 marked 5 inline comments as done.Jun 14 2023, 5:29 AM

The ISA one is special. There's a direct correspondence between compiler argument and which value is set, and only sorrow could come from splicing in an IR file with a different value to that passed to the compiler, and it'll kill off this massive ifdef mess immediately. Net reduction in code and fewer failure modes. Even faster compilation as we don't have to open a file

Killing off these globals is good in the long run, but I don't want it to be a blocker on this patch.

libc/src/math/gpu/round.cpp
14 ↗(On Diff #530894)

Yeah I'll want a documentation page for this once we get it in. Although I myself am not sure what these map to.

jhuber6 updated this revision to Diff 531276.Jun 14 2023, 5:29 AM

Addressing comments

jhuber6 updated this revision to Diff 531320.Jun 14 2023, 7:25 AM

Forgot closing parens

jhuber6 updated this revision to Diff 531322.Jun 14 2023, 7:25 AM

And this one, sorry for the spam.

lntue added inline comments.Jun 14 2023, 7:33 AM
libc/config/gpu/entrypoints.txt
84–90

Can you also update the math implementation status table at https://github.com/llvm/llvm-project/blob/main/libc/docs/math/index.rst ?
Thanks,

jhuber6 added inline comments.Jun 14 2023, 7:35 AM
libc/config/gpu/entrypoints.txt
84–90

I was thinking of doing this in a separate patch, and we may want to have a different category for the GPU that indicates if it's implemented natively or if we rely on the vendor. Then we'd have some documentation about the considerations of GPU math, like reduced precision, ULP modes, linking to vendor documentation for those functions, etc.

lntue added inline comments.Jun 14 2023, 7:42 AM
libc/config/gpu/entrypoints.txt
84–90

A separate patch is fine for me. I plan to use this table for enabled entrypoints. For extra information about how they are implemented and what are the options / tradeoffs, you can put them under https://libc.llvm.org/math/#algorithms-implementation-details , or add a separate page and add a link to it in that section.

Code and structuring LGTM but a lot of comments/suggestions around the messaging. For the next step, I really want to see the mass of libc's own implementations increase with more builtin wrappers before increasing the mass of vendor wrappers.

Our plan is to move all math functions from the headers here. In the process we'll try to use builtins and generic versions as much as possible. We will after have a list of things that are still vendor wrappers so we can cross them off one by one. Is that OK?

sivachandra accepted this revision.Jun 14 2023, 10:10 AM
sivachandra added inline comments.
libc/config/gpu/entrypoints.txt
84–90

Yes, I vote for a separate GPU math status page as it seems like is going to be different even wrt the math library promise we make about the CPU implementations.

This revision is now accepted and ready to land.Jun 14 2023, 10:10 AM