This is an archive of the discontinued LLVM Phabricator instance.

[Libomptarget] Create device math wrappers
Needs RevisionPublic

Authored by jhuber6 on Mar 11 2022, 7:57 AM.

Details

Summary

This patch creates new bitcode libraries to be used when compiling for
the device. They define math function wrappers that first transform the
generic __omp_sin calls to original math function's name. Then we
transform the math function to the device specific __nv_sin. This
level of indirection was necessy all to avoid the declarations in
<math.h> that are not compatible with the device.

Depends on D121466

Diff Detail

Event Timeline

jhuber6 created this revision.Mar 11 2022, 7:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 11 2022, 7:57 AM
Herald added a subscriber: mgorny. · View Herald Transcript
jhuber6 requested review of this revision.Mar 11 2022, 7:57 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 11 2022, 7:57 AM

What's the benefit of making it a standalone library instead of part of device runtime?

What's the benefit of making it a standalone library instead of part of device runtime?

Johannes wanted this math library so we can get regular math calls on the device. We can't just include math.h and expect it because it has several things that aren't compatible with the device. We originally just had wrappers that mapped them to the device versions, but this meant we didn't have the optimizations LLVM has for the math functions. That being said, this is a lot of extra code and we'll need to see if it makes a performance difference. (It also makes compilation much slower.)

What's the benefit of making it a standalone library instead of part of device runtime?

Johannes wanted this math library so we can get regular math calls on the device. We can't just include math.h and expect it because it has several things that aren't compatible with the device. We originally just had wrappers that mapped them to the device versions, but this meant we didn't have the optimizations LLVM has for the math functions. That being said, this is a lot of extra code and we'll need to see if it makes a performance difference. (It also makes compilation much slower.)

Yeah, I understand the importance to have our own math library. I'm just not clear what the benefit to make it not part of our device runtime. Like CUDA's libdevice.bc contains everything.

What's the benefit of making it a standalone library instead of part of device runtime?

Johannes wanted this math library so we can get regular math calls on the device. We can't just include math.h and expect it because it has several things that aren't compatible with the device. We originally just had wrappers that mapped them to the device versions, but this meant we didn't have the optimizations LLVM has for the math functions. That being said, this is a lot of extra code and we'll need to see if it makes a performance difference. (It also makes compilation much slower.)

Yeah, I understand the importance to have our own math library. I'm just not clear what the benefit to make it not part of our device runtime. Like CUDA's libdevice.bc contains everything.

Oh, I see. I just kept them separate because including it is optional and wasn't sure if I wanted to always have that. If this is more mature and works as we expected it'd probably be a good idea to just put it in the regular libdevice.

yaxunl added a subscriber: yaxunl.Mar 11 2022, 8:51 AM

Is __clang_hip_libdevice_declares.h copied from clang/lib/Headers? How are you going to maintain it if it changes in the original place? Why not use the original copy instead of duplicate them?

Is __clang_hip_libdevice_declares.h copied from clang/lib/Headers? How are you going to maintain it if it changes in the original place? Why not use the original copy instead of duplicate them?

I had to make a few adjustments to get it to build correctly. I'll probably just add those to a new macro in the original header and include them here later.

JonChesterfield requested changes to this revision.Mar 11 2022, 9:11 AM

OK, I think I follow.

What we have is:
1/ Long list of libm symbols as standardised with some ad hoc extra ones
2/ Header file mapping libm symbols onto cuda functions and intrinsics
3/ Header file mapping libm symbols onto hip (well, ocml) functions and intrinsics

What we want is:
Applications #include math.h and stuff works. Maybe an extra #include math_gpu.h containing the ad hoc extra ones.
Optimisations in LLVM that are tied to the C symbol name work

Thus, we instantiate a bitcode library that maps libm + some extra symbols onto the cuda/ocml library by reusing the 'header' file with some macro hackery to make it do the right thing across libraries.

We might want to split OpenMPMath.h into standard and extensions, because cuda put the extensions in the global namespace and some applications written in openmp are going to define the same symbols.

Strategy looks sound to me. It's unfortunate that the symbol remap tables are written as C++ header files but at least they already exist. Copy&pasting them into openmp is going to be a maintenance disaster almost immediately, I think we need to add some macros to the headers in clang and include them via CMakeLists setting the include path appropriately.

The review of adding those macros can include a link to this diff to show that said macros are the lesser of two evils.

AMD have already done something rather like this for Fortran (which cannot use the C++ headers), see https://github.com/RadeonOpenCompute/llvm-project/blob/amd-stg-open/openmp/libomptarget/libm/src/libm.c with corresponding macros added to the clang headers.

I'm therefore marking this 'requested changes', with the proviso that the change I'm requesting is that we add more ugliness to the clang headers to make this work out OK.

We might want to put a file called math.h in the override directory and not #include_next the system math.h on the gpu.

openmp/libomptarget/DeviceLib/src/OpenMPMath.cpp
14

This is quite code generator / xmacro friendly. Should be able to have something closer to

#define M(ARITY, SYMBOL)...

M(abs)
M(fabs)
M(acos)
M(cos)
...

This revision now requires changes to proceed.Mar 11 2022, 9:11 AM

OK, I think I follow.

What we have is:
1/ Long list of libm symbols as standardised with some ad hoc extra ones
2/ Header file mapping libm symbols onto cuda functions and intrinsics
3/ Header file mapping libm symbols onto hip (well, ocml) functions and intrinsics

What we want is:
Applications #include math.h and stuff works. Maybe an extra #include math_gpu.h containing the ad hoc extra ones.
Optimisations in LLVM that are tied to the C symbol name work

Thus, we instantiate a bitcode library that maps libm + some extra symbols onto the cuda/ocml library by reusing the 'header' file with some macro hackery to make it do the right thing across libraries.

We might want to split OpenMPMath.h into standard and extensions, because cuda put the extensions in the global namespace and some applications written in openmp are going to define the same symbols.

Agreed. We put that on the TODO list.

Strategy looks sound to me. It's unfortunate that the symbol remap tables are written as C++ header files but at least they already exist. Copy&pasting them into openmp is going to be a maintenance disaster almost immediately, I think we need to add some macros to the headers in clang and include them via CMakeLists setting the include path appropriately.

We want a single version of these. All but sin -> {nv,ocml,...}_sin should be unique. That is not the case right now. This stuff is part of the unique set of files and should not live in openmp/... but rather clang/deviceLibs or similar.
For now, assume this review is just so we have a place to put the things. We are still making things work and once we have we can move stuff to the proper place.

The review of adding those macros can include a link to this diff to show that said macros are the lesser of two evils.

Changing to macros is something we can do and might even allow us to cut down the files by one.
With something like MAP(from, to, return_type, arg_types) we might be able to create the wrappers from sin -> llvm_gpu_sin, and from llvm_gpu_sin -> sin, and from sin -> __{nv,ocml}_sin.

All that said, the macros could have been done before, and can be done after this rewrite.

AMD have already done something rather like this for Fortran (which cannot use the C++ headers), see https://github.com/RadeonOpenCompute/llvm-project/blob/amd-stg-open/openmp/libomptarget/libm/src/libm.c with corresponding macros added to the clang headers.

I'm therefore marking this 'requested changes', with the proviso that the change I'm requesting is that we add more ugliness to the clang headers to make this work out OK.

We might want to put a file called math.h in the override directory and not #include_next the system math.h on the gpu.

Sounds good to me.

Changing to macros is something we can do and might even allow us to cut down the files by one.
With something like MAP(from, to, return_type, arg_types) we might be able to create the wrappers from sin -> llvm_gpu_sin, and from llvm_gpu_sin -> sin, and from sin -> __{nv,ocml}_sin.

All that said, the macros could have been done before, and can be done after this rewrite.

I'm going to look at that now - have got a few use cases for essentially the same macro. It'll be MAP(from, to, arity) though, writing out the types by hand is tedious and avoidable.

Given the header proposed in D121499, OpenMPMath.cpp can be replaced with:

#include "OpenMPMath.h"

#include "make_function.h"

#define M(SYMBOL, ARITY)                                                       \
  __DEVICE__ MAKE_FUNCTION(__omp_##SYMBOL, SYMBOL, decltype(&SYMBOL), ARITY)

M(abs, 1);
M(fabs, 1);
M(acos, 1);
M(acosf, 1);
M(acosh, 1);
M(acoshf, 1);
M(asin, 1);
M(asinf, 1);
M(asinh, 1);
M(asinhf, 1);
M(atan, 1);
M(atan2, 2);
M(atan2f, 2);
M(atanf, 1);
M(atanh, 1);
M(atanhf, 1);
M(cbrt, 1);
M(cbrtf, 1);
M(ceil, 1);
M(ceilf, 1);
M(copysign, 2);
M(copysignf, 2);
M(cos, 1);
M(cosf, 1);
M(cosh, 1);
M(coshf, 1);
M(cospi, 1);
M(cospif, 1);
M(cyl_bessel_i0, 1);
M(cyl_bessel_i0f, 1);
M(cyl_bessel_i1, 1);
M(cyl_bessel_i1f, 1);
M(erf, 1);
M(erfc, 1);
M(erfcf, 1);
M(erfcinv, 1);
M(erfcinvf, 1);
M(erfcx, 1);
M(erfcxf, 1);
M(erff, 1);
M(erfinv, 1);
M(erfinvf, 1);
M(exp, 1);
M(exp10, 1);
M(exp10f, 1);
M(exp2, 1);
M(exp2f, 1);
M(expf, 1);
M(expm1, 1);
M(expm1f, 1);
M(fabsf, 1);
M(fdim, 2);
M(fdimf, 2);
M(fdivide, 2);
M(fdividef, 2);
M(floor, 1);
M(floorf, 1);
M(fma, 3);
M(fmaf, 3);
M(fmax, 2);
M(fmaxf, 2);
M(fmin, 2);
M(fminf, 2);
M(fmod, 2);
M(fmodf, 2);
M(frexp, 2);
M(frexpf, 2);
M(hypot, 2);
M(hypotf, 2);
M(ilogb, 1);
M(ilogbf, 1);
M(j0, 1);
M(j0f, 1);
M(j1, 1);
M(j1f, 1);
M(jn, 2);
M(jnf, 2);
M(labs, 1);
M(ldexp, 2);
M(ldexpf, 2);
M(lgamma, 1);
M(lgammaf, 1);
M(llabs, 1);
M(llmax, 2);
M(llmin, 2);
M(llrint, 1);
M(llrintf, 1);
M(llround, 1);
M(llroundf, 1);
M(round, 1);
M(roundf, 1);
M(log, 1);
M(log10, 1);
M(log10f, 1);
M(log1p, 1);
M(log1pf, 1);
M(log2, 1);
M(log2f, 1);
M(logb, 1);
M(logbf, 1);
M(logf, 1);
#if defined(__LP64__)
M(lrint, 1);
M(lrintf, 1);
M(lround, 1);
M(lroundf, 1);
#else
M(lrint, 1);
M(lrintf, 1);
M(lround, 1);
M(lroundf, 1);
#endif
M(max, 2);
M(min, 2);
M(modf, 2);
M(modff, 2);
M(nearbyint, 1);
M(nearbyintf, 1);
M(nextafter, 2);
M(nextafterf, 2);
M(norm, 2);
M(norm3d, 3);
M(norm3df, 3);
M(norm4d, 4);
M(norm4df, 4);
M(normcdf, 1);
M(normcdff, 1);
M(normcdfinv, 1);
M(normcdfinvf, 1);
M(normf, 2);
M(pow, 2);
M(powf, 2);
M(powi, 2);
M(powif, 2);
M(rcbrt, 1);
M(rcbrtf, 1);
M(remainder, 2);
M(remainderf, 2);
M(remquo, 3);
M(remquof, 3);
M(rhypot, 2);
M(rhypotf, 2);
M(rint, 1);
M(rintf, 1);
M(rnorm, 2);
M(rnorm3d, 3);
M(rnorm3df, 3);
M(rnorm4d, 4);
M(rnorm4df, 4);
M(rnormf, 2);
M(rsqrt, 1);
M(rsqrtf, 1);
M(scalbn, 2);
M(scalbnf, 2);
M(scalbln, 2);
M(scalblnf, 2);
M(sin, 1);
M(sincos, 3);
M(sincosf, 3);
M(sincospi, 3);
M(sincospif, 3);
M(sinf, 1);
M(sinh, 1);
M(sinhf, 1);
M(sinpi, 1);
M(sinpif, 1);
M(sqrt, 1);
M(sqrtf, 1);
M(tan, 1);
M(tanf, 1);
M(tanh, 1);
M(tanhf, 1);
M(tgamma, 1);
M(tgammaf, 1);
M(trunc, 1);
M(truncf, 1);
M(ullmax, 2);
M(ullmin, 2);
M(umax, 2);
M(umin, 2);
M(y0, 1);
M(y0f, 1);
M(y1, 1);
M(y1f, 1);
M(yn, 2);
M(ynf, 2);