This is an archive of the discontinued LLVM Phabricator instance.

Added modf for NVPTX and AMDGPU targets to implement 'libmgpu.a' for math on the GPU
AbandonedPublic

Authored by AntonRydahl on Jun 9 2023, 11:53 AM.

Details

Summary

This patch is a follow-up to D152468. It replaces calls to modf with __nv_modf or __ocml_modf_f64. If you think I have done it the right way, I will continue to submit equivalent changes
for other math functions.

Diff Detail

Event Timeline

AntonRydahl created this revision.Jun 9 2023, 11:53 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJun 9 2023, 11:53 AM
AntonRydahl requested review of this revision.Jun 9 2023, 11:53 AM

Remember to clang format via git clang-format HEAD~1 then git add -A; git commit --amend to update if anything was changed.

libc/src/math/gpu/CMakeLists.txt
47 ↗(On Diff #530045)

Don't need to copy this comment everywhere.

libc/src/math/gpu/amdgpu/amdgpu.h
18 ↗(On Diff #530045)

This should be internal forgot about that when I updated.

24 ↗(On Diff #530045)

We're not using OpenMP so these won't work nor should they be necessary. Address space five maps to local memory which should be the default here for a variable on the stack. The only reason this was necessary for OpenMP is because of the memory model OpenMP uses trying to mimic the CPU. E.g. the CPU threads can share stack data by default but the GPU can't, so by default we don't put things on the stack in the GPU. Since this is compiled directly for the GPU it should already be stack memory. Ditto the cast shouldn't be necessary.

#Updating D152575: Added modf for NVPTX and AMDGPU targets to implement 'libmgpu.a' for math on the GPU

The OpenMP specifics have been removed, the namespace changed to internal, and git clang-format has been run.

Make sure to amend your new changes into your last commit and then arc diff. This revision only contains your new edits.

Thanks, Ethan! I will see If I know how to roll it back. Do you think I should use arc diff HEAD~2 --update D152575?

That should work I think.

This comment was removed by AntonRydahl.

Squashed the two commits into one.

Updated this patch in accordance with the changes to the parent patch, https://reviews.llvm.org/D152486.

This update contains more functions following the same template.

The boilerplate to functionality ratio is pretty extreme here. How does the libc subproject feel about code generators?

If we can run python scripts to generate the code at build time, great.

For an out of tree target I remember generating source code in cmake as people didn't like running python at build time.

If neither of those are acceptable, perhaps we should do the X macro thing.

I would rather implement the functions added by this patch as builtin wrappers instead of vendor wrappers. See the round example in D152468.

I would rather implement the functions added by this patch as builtin wrappers instead of vendor wrappers. See the round example in D152468.

I think we're missing a canonicalisation opportunity in libm. It's never totally clear whether an optimisation should target a C function with a known name or an intrinsic. Fair chance O0 C++ produces a wrapper to handle the overload with a different name. Openmp variants introduce another name mangling scheme. Then we have the nv_sin ocml_cos set. I don't know what Fortran calls it, maybe _sin.

I think we should have an IR intrinsic for each (or at least most) libm functions, transform the various source names to the intrinsics in the front end or generally as aggressively as we can. Then optimise them - at least constant fold, but trig identities might be fair game as well. Then lower to whatever mix of libm functions and native instructions the backend sees fit.

Errno is a pain here, but if errno is not disabled we can still constant fold cases that don't error. Likewise there's various fast math flags which are a mess but it seems more likely that we can handle them consistently and correctly if it's all localised to one place.

This diff isn't the right venue to propose that, I should probably try the forum that replaced mailing lists.

I would rather implement the functions added by this patch as builtin wrappers instead of vendor wrappers. See the round example in D152468.

I think we're missing a canonicalisation opportunity in libm. It's never totally clear whether an optimisation should target a C function with a known name or an intrinsic. Fair chance O0 C++ produces a wrapper to handle the overload with a different name. Openmp variants introduce another name mangling scheme. Then we have the nv_sin ocml_cos set. I don't know what Fortran calls it, maybe _sin.

I think we should have an IR intrinsic for each (or at least most) libm functions, transform the various source names to the intrinsics in the front end or generally as aggressively as we can. Then optimise them - at least constant fold, but trig identities might be fair game as well. Then lower to whatever mix of libm functions and native instructions the backend sees fit.

Errno is a pain here, but if errno is not disabled we can still constant fold cases that don't error. Likewise there's various fast math flags which are a mess but it seems more likely that we can handle them consistently and correctly if it's all localised to one place.

This diff isn't the right venue to propose that, I should probably try the forum that replaced mailing lists.

On the discussion forum, it has been proposed that we should test which versions perform better on GPUs: https://discourse.llvm.org/t/libm-conformance-and-timing-ci-for-gpus/71362

Maybe that would be a better fit for this discussion.

I would rather implement the functions added by this patch as builtin wrappers instead of vendor wrappers. See the round example in D152468.

Should I include both the vendor and builtin functions in this patch?

Should I include both the vendor and builtin functions in this patch?

That's a good question, we'd probably like to have both if we'd like to perform some performance tests, but I'm not sure if that's a good reason to keep it in-tree if we have a suitable alternative. @sivachandra What do you think?

Should I include both the vendor and builtin functions in this patch?

That's a good question, we'd probably like to have both if we'd like to perform some performance tests, but I'm not sure if that's a good reason to keep it in-tree if we have a suitable alternative. @sivachandra What do you think?

Do you reckon it would be good to keep the vendor definitions as a fallback if LIBC_HAS_BUILTIN doesn't return true?

JonChesterfield added a comment.EditedJun 14 2023, 11:39 AM

The default is a good question. Libm isn't _that_ big a library. Getting the functions right is difficult but relatively well established by this point, they've been implemented a lot of times. Getting them fast and right is probably always a vendor specific thing, you need to know the ISA you're targeting and put more effort in.

I think we could reasonably aspire to have something slow and wrong for every function implemented in tree. I'm personally ok with return 0 levels of wrong in the first instance but others may disagree.

If we're aiming at slow+wrong+complete in the first instance, the default can reasonably be to use libc unless specified otherwise. Vendors will always want to use their own implementation and that's fine, they can use the same hook provided for users who want to bring their own.

Fun question, do we want to aspire to finer grained replacement than wholesale? I claim _no_ on the grounds that we won't test every permutation that results and we shouldn't ship something untested.

(My side interest here is to end up with something that Fortran can use basically unchanged, which is not the case for libm-via-header-files)

libc/src/math/gpu/vendor/nvptx/nvptx.h
19

Why is this stuff all in a header? It means it has to be different for different targets even though the interface is the same.

I was hoping for float remainderf(float x, float y); in a header and the nvptx specific implementation in a source file

That's a good question, we'd probably like to have both if we'd like to perform some performance tests, but I'm not sure if that's a good reason to keep it in-tree if we have a suitable alternative. @sivachandra What do you think?

Per my understanding from the earlier discussion, we will add a vendor-wrapper only if the in-tree implementation is not sufficiently proven to be a good enough replacement for the vendor implementation. If that is correct, I would like to stick to that. You shouldn't need a wrapper in the libc for differential testing.

That said, will it ever be that a builtin for a floating point primitive will not be available? If yes, then the ideal approach we should take is to "fix the compiler". If that is not practical, we can take up adding vendor wrappers at that time with reduced scope (as in, that option is taken only if the builtin is not available.)

That's a good question, we'd probably like to have both if we'd like to perform some performance tests, but I'm not sure if that's a good reason to keep it in-tree if we have a suitable alternative. @sivachandra What do you think?

Per my understanding from the earlier discussion, we will add a vendor-wrapper only if the in-tree implementation is not sufficiently proven to be a good enough replacement for the vendor implementation. If that is correct, I would like to stick to that. You shouldn't need a wrapper in the libc for differential testing.

It's not strictly necessary, but it will make it a lot easier if we can simply enable or disable a flag to check the performance delta either in the implementation itself or an application. My proposal is that we implement all the vendor targets at once to get a mostly complete facade of a libm.a that we can test. This will mostly just be copying the existing headers at https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_cuda_math.h and https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_hip_math.h. The expectation is that the order of implementation by default will be native GPU implementation > generic implementation > vendor implementation with a special with an extra configuration variable. That is, since we have a native implementation of sinf we will only user the vendor version of sinf if the user specified it in some list like LIBC_GPU_VENDOR_MATH=sinf. The only burden with this approach is carrying around a few extra entrypoints that are only enabled if specified by the user, but in exchange we get something that's functional immediately and can be incrementally improved. However, I do think that anything that can be replaced with a built-in shouldn't be considered a 'vendor' implementation and can simply be placed in the regular source.

That said, will it ever be that a builtin for a floating point primitive will not be available? If yes, then the ideal approach we should take is to "fix the compiler". If that is not practical, we can take up adding vendor wrappers at that time with reduced scope (as in, that option is taken only if the builtin is not available.)

Yes, fixing the compiler will be the preferred approach here. Since we *only* build the GPU libraries with an up-to-date clang we can always patch the compiler in parallel with the GPU implementation.

Thanks for being understanding with this whole project. The GPU implementation brings in a lot of non-standard requirements but I think it's a a compelling project so I'm glad you've stuck with it thus far.

sivachandra added a comment.EditedJun 14 2023, 11:58 PM

It's not strictly necessary, but it will make it a lot easier if we can simply enable or disable a flag to check the performance delta either in the implementation itself or an application.

I would expect that a libc implementation for a GPU math function, even if it is just a builtin-wrapper, is being added because it is expected/proven to be either better or equivalent to the vendor implementation. It is the burden of the GPU libc developer to verify that - making it convenient is definitely something that can be accommodated in the libc project. We do have a number of such things for comparison against the system libc. But, all such conveniences are outside of the libc (as in, there are no alternate wrapper entrypoints) - in fact, for many functions, the wrapper indirection is close to 50% overhead that it is not meaningful to compare against wrappers.

About builtin-wrappers, I will be surprised if vendor libraries are doing anything different to get better performance etc. If that is really the case, the builtins should be fixed.

My proposal is that we implement all the vendor targets at once to get a mostly complete facade of a libm.a that we can test. This will mostly just be copying the existing headers at https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_cuda_math.h and https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__clang_hip_math.h. The expectation is that the order of implementation by default will be native GPU implementation > generic implementation > vendor implementation with a special with an extra configuration variable. That is, since we have a native implementation of sinf we will only user the vendor version of sinf if the user specified it in some list like LIBC_GPU_VENDOR_MATH=sinf. The only burden with this approach is carrying around a few extra entrypoints that are only enabled if specified by the user, but in exchange we get something that's functional immediately and can be incrementally improved. However, I do think that anything that can be replaced with a built-in shouldn't be considered a 'vendor' implementation and can simply be placed in the regular source.

I thought we agreed on this plan already. So, as a first step, add all builtin-wrappers wherever possible. Next, add vendor-wrappers and their libc-implementation alternates when available. Then the long tail of actually comparing/verifying/improving libc implementations to make them the default. I would really like if the libc implementations were the default to begin with, but as a practical decision in these initial stages, we will make the vendor wrappers the default. But, the vendor wrappers should be added only if there are no builtins which implement the corresponding floating point operations.

I would expect that a libc implementation for a GPU math function, even if it is just a builtin-wrapper, is being added because it is expected/proven to be either better or equivalent to the vendor implementation. It is the burden of the GPU libc developer to verify that - making it convenient is definitely something that can be accommodated in the libc project. We do have a number of such things for comparison against the system libc. But, all such conveniences are outside of the libc (as in, there are no alternate wrapper entrypoints) - in fact, for many functions, the wrapper indirection is close to 50% overhead that it is not meaningful to compare against wrappers.

That's reasonable, we can bias towards in-tree implementations as long as the performance is somewhat on-par.

About builtin-wrappers, I will be surprised if vendor libraries are doing anything different to get better performance etc. If that is really the case, the builtins should be fixed.

So, the vendor libraries are presented as LLVM-IR. This means that they will ultimately use the same intrinsics and be compiled by the same exact compiler. So there should be no difference. We should probably scan for any vendor implementation that can be replaced with one of the nvvm or __builtin functions. However we should keep in mind that the floating point behavior of these functions will probably be somewhat divergent from libc's correct rounding assertion, we'll need to specify as such when we document this.

I thought we agreed on this plan already. So, as a first step, add all builtin-wrappers wherever possible. Next, add vendor-wrappers and their libc-implementation alternates when available. Then the long tail of actually comparing/verifying/improving libc implementations to make them the default. I would really like if the libc implementations were the default to begin with, but as a practical decision in these initial stages, we will make the vendor wrappers the default. But, the vendor wrappers should be added only if there are no builtins which implement the corresponding floating point operations.

Yes, that's very reasonable. There's no reason to even compare the performance if it just goes to an intrinsic as I've mentioned above. So we should add all vendor functions, except ones that can be trivially recreated without calling into the accompanying library.

I will make a new patch where I test if the __builtin math functions compile to --target=amdgcn-amd-amdhsa and --target=nvptx64-nvidia-cuda. If it successfully compiles, I will add __builtin wrappers. If not, I will add __ocml and __nv wrappers. Does that sound good to you, @sivachandra?

I will make a new patch where I test if the __builtin math functions compile to --target=amdgcn-amd-amdhsa and --target=nvptx64-nvidia-cuda. If it successfully compiles, I will add __builtin wrappers. If not, I will add __ocml and __nv wrappers. Does that sound good to you, @sivachandra?

SGTM

AntonRydahl abandoned this revision.Jul 27 2023, 10:16 AM