Page MenuHomePhabricator

[OpenMP] Add math functions support in OpenMP offloading
AbandonedPublic

Authored by gtbercea on Apr 19 2019, 11:12 AM.

Details

Summary

This patch adds an OpenMP specific math functions header to the lib/Headers folder and ensures it is passed to Clang.

Note:
This is an example of how support for math functions could be implemented. Before expanding this to include other math functions please let me know if you have any comments, concerns or proposed changes.

Diff Detail

Event Timeline

gtbercea created this revision.Apr 19 2019, 11:12 AM
ABataev added inline comments.Apr 19 2019, 11:23 AM
lib/Headers/__clang_openmp_math.h
15

Also, versions for float and long double

22

Add powf(float), powl(long double), sinf(float), sinl(long double)

ABataev added inline comments.Apr 19 2019, 11:26 AM
lib/Headers/__clang_openmp_math.h
3

Why __CLANG_OMP_CMATH_H__? Your file is ..._math.h, not ..._cmath.h. Plus, seems to me, you're missing standard header for the file.

jdoerfert added inline comments.Apr 19 2019, 1:40 PM
include/clang/Driver/ToolChain.h
575

Copy & Past comment

lib/Headers/__clang_openmp_math.h
6

Why is this NVPTX specific?

To follow up on my comment why this is NVPTX specific:

Is there a reason why this has to happen in the Cuda ToolChain part?
I would have assumed us to add the declarations similar to the ones provided in __clang_openmp_math.h whenever we may compile for a target.
So, if we have any OpenMP target related code in the TU, we add the header __clang_openmp_target_math.h which defines "common" math functions as you did in __clang_openmp_math.h (without the NVPTX guard). The runtime will then implement __kmpc_XXXX as it sees fit.

gtbercea updated this revision to Diff 195915.Apr 19 2019, 2:25 PM
gtbercea edited the summary of this revision. (Show Details)
  • Address comments.
gtbercea marked 5 inline comments as done.Apr 19 2019, 2:26 PM

So the scheme is: pow is defined in __clang_openmp_math.h to call __kmpc_pow. This lives in libomptarget-nvptx (both bc and static lib) and just calls pow which works because nvcc and Clang in CUDA mode make sure that the call gets routed into libdevice?

Did you test that something like pow(d, 2) is optimized by LLVM to d * d? There's a pass doing so (can't recall the name) and from my previous attempts it didn't work well if you hid the function name instead of the known pow one.

gtbercea added a comment.EditedApr 24 2019, 6:58 AM

So the scheme is: pow is defined in __clang_openmp_math.h to call __kmpc_pow. This lives in libomptarget-nvptx (both bc and static lib) and just calls pow which works because nvcc and Clang in CUDA mode make sure that the call gets routed into libdevice?

Did you test that something like pow(d, 2) is optimized by LLVM to d * d? There's a pass doing so (can't recall the name) and from my previous attempts it didn't work well if you hid the function name instead of the known pow one.

The transformation was blocked because of a check in optimizePow() this was preventing pow(x,2) from becoming x*x. By adding the pow functions to the TLI the transformation now applies. This has now been fixed in the LLVM patch. SQRT is eliminated as per usual, no change for that.

gtbercea updated this revision to Diff 196619.Apr 25 2019, 6:04 AM
  • Use macros.
gtbercea retitled this revision from [OpenMP][WIP] Add math functions support in OpenMP offloading to [OpenMP] Add math functions support in OpenMP offloading.Apr 25 2019, 11:07 AM

The last two comments in D47849 indicated exploration of a different approach, and one which still seems superior to this one. Can you please comment on why you're now pursuing this approach instead?

gtbercea added a comment.EditedApr 25 2019, 1:10 PM

The last two comments in D47849 indicated exploration of a different approach, and one which still seems superior to this one. Can you please comment on why you're now pursuing this approach instead?

This solution is following Alexey's suggestions. This solution allows the optimization of math calls if they apply (example: pow(x,2) => x*x ) which was one of the issues in the previous solution I implemented.

This comment was removed by gtbercea.

The last two comments in D47849 indicated exploration of a different approach, and one which still seems superior to this one. Can you please comment on why you're now pursuing this approach instead?

Hal, as far as I can tell, this solution is similar to yours but with a slightly different implementation. If there are particular aspects about this patch you would like to discuss/give feedback on please let me know.

gtbercea updated this revision to Diff 196725.Apr 25 2019, 1:59 PM
  • Update patch.
jdoerfert added inline comments.Apr 29 2019, 7:05 PM
lib/Driver/ToolChains/Clang.cpp
1159

Here is another "NVPTX" specialization that I don't think we need. At least with more targets we need to relax this condition.

lib/Headers/__clang_openmp_math.h
13

Why is this NVPTX specific (again)?

The last two comments in D47849 indicated exploration of a different approach, and one which still seems superior to this one. Can you please comment on why you're now pursuing this approach instead?

...

Hal, as far as I can tell, this solution is similar to yours but with a slightly different implementation. If there are particular aspects about this patch you would like to discuss/give feedback on please let me know.

The solution I suggested had the advantages of:

  1. Being able to directly reuse the code in __clang_cuda_device_functions.h. On the other hand, using this solution we need to implement a wrapper function for every math function. When __clang_cuda_device_functions.h is updated, we need to update the OpenMP wrapper as well.
  2. Providing access to wrappers for other CUDA intrinsics in a natural way (e.g., rnorm3d) [it looks a bit nicer to provide a host version of rnorm3d than __nv_rnorm3d in user code].
  3. Being similar to the "declare variant" functionality from OpenMP 5, and thus, I suspect, closer to the solution we'll eventually be able to apply in a standard way to all targets.

What are all of the long-double functions going to do on NVPTX?

This solution is following Alexey's suggestions. This solution allows the optimization of math calls if they apply (example: pow(x,2) => x*x ) which was one of the issues in the previous solution I implemented.

So we're also missing that optimization for CUDA code when compiling with Clang? Isn't this also something that, regardless, should be fixed?

Also, how fragile is this? We inline bottom up but this optimization needs to apply before inlining?

Finally, regardless of all of this, do we really need to preinclude this header? Can't we do this with a math.h wrapper?

The last two comments in D47849 indicated exploration of a different approach, and one which still seems superior to this one. Can you please comment on why you're now pursuing this approach instead?

...

Hal, as far as I can tell, this solution is similar to yours but with a slightly different implementation. If there are particular aspects about this patch you would like to discuss/give feedback on please let me know.

The solution I suggested had the advantages of:

  1. Being able to directly reuse the code in __clang_cuda_device_functions.h. On the other hand, using this solution we need to implement a wrapper function for every math function. When __clang_cuda_device_functions.h is updated, we need to update the OpenMP wrapper as well.

I'd even go as far as to argue that __clang_cuda_device_functions.h should include the internal math.h wrapper to get all math functions. See also the next comment.

  1. Providing access to wrappers for other CUDA intrinsics in a natural way (e.g., rnorm3d) [it looks a bit nicer to provide a host version of rnorm3d than __nv_rnorm3d in user code].

@hfinkel
I don't see why you want to mix CUDA intrinsics with math.h overloads. I added a rough outline of how I imagined the internal math.h header to look like as a comment in D47849. Could you elaborate how that differs from what you imagine and how the other intrinsics come in?

  1. Being similar to the "declare variant" functionality from OpenMP 5, and thus, I suspect, closer to the solution we'll eventually be able to apply in a standard way to all targets.

I can see this.

This solution is following Alexey's suggestions. This solution allows the optimization of math calls if they apply (example: pow(x,2) => x*x ) which was one of the issues in the previous solution I implemented.

So we're also missing that optimization for CUDA code when compiling with Clang? Isn't this also something that, regardless, should be fixed?

Maybe through a general built-in recognition and lowering into target specific implementations/intrinsics late again?

The last two comments in D47849 indicated exploration of a different approach, and one which still seems superior to this one. Can you please comment on why you're now pursuing this approach instead?

...

Hal, as far as I can tell, this solution is similar to yours but with a slightly different implementation. If there are particular aspects about this patch you would like to discuss/give feedback on please let me know.

The solution I suggested had the advantages of:

  1. Being able to directly reuse the code in __clang_cuda_device_functions.h. On the other hand, using this solution we need to implement a wrapper function for every math function. When __clang_cuda_device_functions.h is updated, we need to update the OpenMP wrapper as well.

I'd even go as far as to argue that __clang_cuda_device_functions.h should include the internal math.h wrapper to get all math functions. See also the next comment.

  1. Providing access to wrappers for other CUDA intrinsics in a natural way (e.g., rnorm3d) [it looks a bit nicer to provide a host version of rnorm3d than __nv_rnorm3d in user code].

@hfinkel
I don't see why you want to mix CUDA intrinsics with math.h overloads.

What I had in mind was matching non-standard functions in a standard way. For example, let's just say that I have a CUDA kernel that uses the rnorm3d function, or I otherwise have a function that I'd like to write in OpenMP that will make good use of this CUDA function (because it happens to have an efficient device implementation). This is a function that CUDA provides, in the global namespace, although it's not standard.

Then I can do something like this (depending on how we setup the implementation):

double rnorm3d(double a,  double b, double c) {
  return sqrt(a*a + b*b + c*c);
}

...

#pragma omp target
{
  double a = ..., b = ..., c = ...;
  double r = rnorm3d(a, b, c)
}

and, if we use the CUDA math headers for CUDA math-function support, than this might "just work." To be clear, I can see an argument for having this work being a bad idea ;) -- but it has the advantage of providing a way to take advantage of system-specific functions while still writing completely-portable code.

I added a rough outline of how I imagined the internal math.h header to look like as a comment in D47849. Could you elaborate how that differs from what you imagine and how the other intrinsics come in?

That looks like what I had in mind (including __clang_cuda_device_functions.h to get the device functions.)

  1. Being similar to the "declare variant" functionality from OpenMP 5, and thus, I suspect, closer to the solution we'll eventually be able to apply in a standard way to all targets.

I can see this.

This solution is following Alexey's suggestions. This solution allows the optimization of math calls if they apply (example: pow(x,2) => x*x ) which was one of the issues in the previous solution I implemented.

So we're also missing that optimization for CUDA code when compiling with Clang? Isn't this also something that, regardless, should be fixed?

Maybe through a general built-in recognition and lowering into target specific implementations/intrinsics late again?

I suspect that we need to match the intrinsics and perform the optimizations in LLVM at that level in order to get the optimizations for CUDA.

tra added a comment.Apr 30 2019, 10:33 AM

+1 to Hal's comments.

@jdoerfert :

I'd even go as far as to argue that __clang_cuda_device_functions.h should include the internal math.h wrapper to get all math functions. See also the next comment.

I'd argue other way around -- include __clang_cuda_device_functions.h from math.h and do not preinclude anything.
If the user does not include math.h, it should not have its namespace polluted by some random stuff. NVCC did this, but that's one of the most annoying 'features' we have to be compatible with for the sake of keeping existing nvcc-compilable CUDA code happy.

If users do include math.h, it should do the right thing, for both sides of the compilation.
IMO It's math.h that should be triggering pulling device functions in.

lib/Driver/ToolChains/Clang.cpp
1157–1159

This functionality is openMP-specific, but the function name AddMathDeviceFunctions() is not. I'd rather keep OpenMP specialization down where it can be easily seen. Could this check be pushed down into CudaInstallationDetector::AddMathDeviceFunctions() ?

Another potential problem here is that this file will be pre-included only for the device. It will potentially result in more observable semantic differences between host and device compilations. I don't know if it matters for OpenMP, though.

IMO intercepting math.h and providing device-specific overloads *in addition* to the regular math.h functions would be a better approach.

Another problem with pre-included files is that sometimes users may intentionally need to *not* include them.
For CUDA we have -nocudainc flag. Your change, at least, will need something similar, IMO.

The last two comments in D47849 indicated exploration of a different approach, and one which still seems superior to this one. Can you please comment on why you're now pursuing this approach instead?

...

Hal, as far as I can tell, this solution is similar to yours but with a slightly different implementation. If there are particular aspects about this patch you would like to discuss/give feedback on please let me know.

The solution I suggested had the advantages of:

  1. Being able to directly reuse the code in __clang_cuda_device_functions.h. On the other hand, using this solution we need to implement a wrapper function for every math function. When __clang_cuda_device_functions.h is updated, we need to update the OpenMP wrapper as well.

I'd even go as far as to argue that __clang_cuda_device_functions.h should include the internal math.h wrapper to get all math functions. See also the next comment.

  1. Providing access to wrappers for other CUDA intrinsics in a natural way (e.g., rnorm3d) [it looks a bit nicer to provide a host version of rnorm3d than __nv_rnorm3d in user code].

@hfinkel
I don't see why you want to mix CUDA intrinsics with math.h overloads.

What I had in mind was matching non-standard functions in a standard way. For example, let's just say that I have a CUDA kernel that uses the rnorm3d function, or I otherwise have a function that I'd like to write in OpenMP that will make good use of this CUDA function (because it happens to have an efficient device implementation). This is a function that CUDA provides, in the global namespace, although it's not standard.

Then I can do something like this (depending on how we setup the implementation):

double rnorm3d(double a,  double b, double c) {
  return sqrt(a*a + b*b + c*c);
}
 
...
 
#pragma omp target
{
  double a = ..., b = ..., c = ...;
  double r = rnorm3d(a, b, c)
}

and, if we use the CUDA math headers for CUDA math-function support, than this might "just work." To be clear, I can see an argument for having this work being a bad idea ;) -- but it has the advantage of providing a way to take advantage of system-specific functions while still writing completely-portable code.

Matching rnorm3d and replacing it with some nvvm "intrinsic" is something I wouldn't like to see happening if math.h was included and not if it was not. As you say, in Cuda that is not how it works either. I'm in favor of reusing the built-in recognition mechanism:
That is, if the target is nvptx, the name is rnorm3d, we match that name and use the appropriate intrinsic, as we do others already for other targets.

I added a rough outline of how I imagined the internal math.h header to look like as a comment in D47849. Could you elaborate how that differs from what you imagine and how the other intrinsics come in?

That looks like what I had in mind (including __clang_cuda_device_functions.h to get the device functions.)

  1. Being similar to the "declare variant" functionality from OpenMP 5, and thus, I suspect, closer to the solution we'll eventually be able to apply in a standard way to all targets.

I can see this.

This solution is following Alexey's suggestions. This solution allows the optimization of math calls if they apply (example: pow(x,2) => x*x ) which was one of the issues in the previous solution I implemented.

So we're also missing that optimization for CUDA code when compiling with Clang? Isn't this also something that, regardless, should be fixed?

Maybe through a general built-in recognition and lowering into target specific implementations/intrinsics late again?

I suspect that we need to match the intrinsics and perform the optimizations in LLVM at that level in order to get the optimizations for CUDA.

That seems reasonable to me. We could also match other intrinsics, e.g., rnorm3d, here as well, both by name but also by the computation pattern.

In D60907#1484643, @tra wrote:

+1 to Hal's comments.

@jdoerfert :

I'd even go as far as to argue that __clang_cuda_device_functions.h should include the internal math.h wrapper to get all math functions. See also the next comment.

I'd argue other way around -- include __clang_cuda_device_functions.h from math.h and do not preinclude anything.
If the user does not include math.h, it should not have its namespace polluted by some random stuff. NVCC did this, but that's one of the most annoying 'features' we have to be compatible with for the sake of keeping existing nvcc-compilable CUDA code happy.

If users do include math.h, it should do the right thing, for both sides of the compilation.
IMO It's math.h that should be triggering pulling device functions in.

I actually don't want to preinclude anything and my arguments are (mostly) for the OpenMP offloading code path not necessarily Cuda.
Maybe to clarify, what I want is:

  1. Make sure the clang/Headers/math.h is found first if math.h is included.
  2. Use a scheme similar to the one described https://reviews.llvm.org/D47849#1483653 in clang/Headers/math.h
  3. Only add math.h function overloads in our math.h. <- This is debatable
  4. Include clang/Headers/math.h from __clang_cuda_device_functions.h to avoid duplication of math function declarations.
tra added a comment.Apr 30 2019, 11:40 AM

I actually don't want to preinclude anything and my arguments are (mostly) for the OpenMP offloading code path not necessarily Cuda.
Maybe to clarify, what I want is:

  1. Make sure the clang/Headers/math.h is found first if math.h is included.
  2. Use a scheme similar to the one described https://reviews.llvm.org/D47849#1483653 in clang/Headers/math.h
  3. Only add math.h function overloads in our math.h. <- This is debatable

Agreed.

  1. Include clang/Headers/math.h from __clang_cuda_device_functions.h to avoid duplication of math function declarations.

This is not needed for CUDA. math.h is included early on in __clang_cuda_runtime_wrapper.h (via <cmath>), so by the time __clang_cuda_device_functions.h is included, math.h has already been included one way or another -- either in step 3 above, or directly by the __clang_cuda_runtime_wrapper.h

gtbercea abandoned this revision.May 15 2019, 12:53 PM
gtbercea marked an inline comment as done.

Replaced by: D61399