Page MenuHomePhabricator

[OpenMP][Clang][NVPTX] Enable math functions called in an OpenMP NVPTX target device region to be resolved as device-native function calls
Needs ReviewPublic

Authored by gtbercea on Jun 6 2018, 3:45 PM.

Details

Summary

In current Clang, on the OpenMP NVPTX toolchain, math functions are resolved as math functions for the host. For example, a call to sqrt() in a target region will result in an LLVM-IR call which looks like this:

call double sqrt(double %1)

This patch allows for math functions in OpenMP NVPTX target regions to call the same math functions that CUDA code calls. For example, for sqrt we get:

call double @llvm.nvvm.sqrt.rn.d(double %1)

This is necessary for both correctness and performance.

Diff Detail

Event Timeline

gtbercea created this revision.Jun 6 2018, 3:45 PM

Add tests for C++ and move OpenMP specific tests to OpenMP directory

lib/Headers/__clang_cuda_device_functions.h
28

Do we really need to include all that stuff here? Will it work with C++, especially with the latest versions of the standard?

44

Do you really need "inline" if you are using 'alwsys_inline' attribute already? Will it work on Windows?

IMO this goes into the right direction, we should use the fast implementation in libdevice. If LLVM doesn't lower these calls in the NVPTX backend, I think it's ok to use header wrappers as CUDA already does.

Two questions:

  1. Can you explain where this is important for "correctness"? Yesterday I compiled a code using sqrt and it seems to spit out the correct results. Maybe that's relevant for other functions?
  2. Incidentally I ran into a closely related problem: I can't #include <math.h> in translation units compiled for offloading, Clang complains about inline assembly for x86 (see below). Does that work for you?
In file included from /usr/include/math.h:413:
/usr/include/bits/mathinline.h:131:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
/usr/include/bits/mathinline.h:143:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
2 errors generated.
lib/Headers/__clang_cuda_device_functions.h
65

Why is that only valid for C++?

IMO this goes into the right direction, we should use the fast implementation in libdevice. If LLVM doesn't lower these calls in the NVPTX backend, I think it's ok to use header wrappers as CUDA already does.

Two questions:

  1. Can you explain where this is important for "correctness"? Yesterday I compiled a code using sqrt and it seems to spit out the correct results. Maybe that's relevant for other functions?
  2. Incidentally I ran into a closely related problem: I can't #include <math.h> in translation units compiled for offloading, Clang complains about inline assembly for x86 (see below). Does that work for you?

    ` In file included from /usr/include/math.h:413: /usr/include/bits/mathinline.h:131:43: error: invalid input constraint 'x' in asm asm ("pmovmskb %1, %0" : "=r" (m) : "x" (x)); ^ /usr/include/bits/mathinline.h:143:43: error: invalid input constraint 'x' in asm asm ("pmovmskb %1, %0" : "=r" (m) : "x" (x)); ^ 2 errors generated. `

Hrmm. I thought that we had fixed that already.

In case it's helpful, in an out-of-tree experimental target I have I ran into a similar problem, and to fix that I wrote the following code in the target's getTargetDefines function (in lib/Basic/Targets):

// If used as an OpenMP target on x86, x86 target feature macros are defined. math.h
// and other system headers will include inline asm if these are defined.
Builder.undefineMacro("__SSE2_MATH__");
Builder.undefineMacro("__SSE_MATH__");
gtbercea added inline comments.Jun 7 2018, 7:17 AM
lib/Headers/__clang_cuda_device_functions.h
65

C does not support overloading of functions.

IMO this goes into the right direction, we should use the fast implementation in libdevice. If LLVM doesn't lower these calls in the NVPTX backend, I think it's ok to use header wrappers as CUDA already does.

Two questions:

  1. Can you explain where this is important for "correctness"? Yesterday I compiled a code using sqrt and it seems to spit out the correct results. Maybe that's relevant for other functions?
  2. Incidentally I ran into a closely related problem: I can't #include <math.h> in translation units compiled for offloading, Clang complains about inline assembly for x86 (see below). Does that work for you?

    ` In file included from /usr/include/math.h:413: /usr/include/bits/mathinline.h:131:43: error: invalid input constraint 'x' in asm asm ("pmovmskb %1, %0" : "=r" (m) : "x" (x)); ^ /usr/include/bits/mathinline.h:143:43: error: invalid input constraint 'x' in asm asm ("pmovmskb %1, %0" : "=r" (m) : "x" (x)); ^ 2 errors generated. `

It's precisely the issue which you report here. Since you don't use device specific math functions, you can run into the problem where you may end up calling assembly instructions for a different architecture. I may have mis-classified this as a correctness issue.

It's precisely the issue which you report here. Since you don't use device specific math functions, you can run into the problem where you may end up calling assembly instructions for a different architecture. I may have mis-classified this as a correctness issue.

I think the issue is slightly different, the assembly is not necessarily in the called functions, as I said sqrt seems to work fine. Clang just errors because they are included via the header.

This is because clang::InitializePreprocessor has this:

// FIXME: This will create multiple definitions for most of the predefined
// macros. This is not the right way to handle this.
if ((LangOpts.CUDA || LangOpts.OpenMPIsDevice) && PP.getAuxTargetInfo())
  InitializePredefinedMacros(*PP.getAuxTargetInfo(), LangOpts, FEOpts,
                             Builder);

So we will end up with all host defines (including __SSE2_MATH__ as @hfinkel wrote) during target compilation :-(

tra added a comment.Jun 7 2018, 10:15 AM

IMO this goes into the right direction, we should use the fast implementation in libdevice. If LLVM doesn't lower these calls in the NVPTX backend, I think it's ok to use header wrappers as CUDA already does.

Using wrapper headers may be OK solution for now. Ideally we should grow our own equivalent of device-side libm so we don't have to rely on libdevice bitcode.

Two questions:

  1. Can you explain where this is important for "correctness"? Yesterday I compiled a code using sqrt and it seems to spit out the correct results. Maybe that's relevant for other functions?
  2. Incidentally I ran into a closely related problem: I can't #include <math.h> in translation units compiled for offloading, Clang complains about inline assembly for x86 (see below). Does that work for you?

    ` In file included from /usr/include/math.h:413: /usr/include/bits/mathinline.h:131:43: error: invalid input constraint 'x' in asm asm ("pmovmskb %1, %0" : "=r" (m) : "x" (x)); ^ /usr/include/bits/mathinline.h:143:43: error: invalid input constraint 'x' in asm asm ("pmovmskb %1, %0" : "=r" (m) : "x" (x)); ^ 2 errors generated. `

Avoiding conflicts between host and device implementations of the same functions in C++ requires use of attribute-based overloading (https://goo.gl/EXnymm). For CUDA compilation, we provide device-side overloads with device attributes but otherwise identical signatures. We may need to extend it to work in C mode, too. Clang already has attribute((overloadable)), so basic overloading mechanisms should be there already.

lib/Headers/__clang_cuda_device_functions.h
1584–1586

I think it should've been return __nv_llabs(__a) here and the definition of long long llabs() should remain back where it was.

I just stumbled upon a very interesting situation.

I noticed that, for OpenMP, the use of device math functions happens as I expected for -O0. For -O1 or higher math functions such as "sqrt" resolve to llvm builtins/intrinsics:

call double @llvm.sqrt.f64(double %1)

instead of the nvvm variant.

The surprising part (at least to me) is that the same llvm intrinsic is used when I use Clang to compile CUDA kernel code calling the "sqrt" function. I would have expected that the NVVM variant would be called for CUDA code.

Interestingly, for the "pow" function the expected device version of the function i.e.:

@__internal_accurate_pow(double %14, double %4)

is used for both CUDA and OpenMP NVPTX targets (with this patch applied of course).

Is it ok for CUDA kernels to call llvm intrinsics instead of the device specific math library functions?
If it's ok for CUDA can this be ok for OpenMP NVPTX too?
If not we probably need to fix it for both toolchains.

tra added a comment.Jul 20 2018, 3:59 PM

I just stumbled upon a very interesting situation.

I noticed that, for OpenMP, the use of device math functions happens as I expected for -O0. For -O1 or higher math functions such as "sqrt" resolve to llvm builtins/intrinsics:

call double @llvm.sqrt.f64(double %1)

instead of the nvvm variant.

I believe we do have a pass that attempts to replace some nvvm intrinsics with their llvm equivalent. It allows us to optimize the code better. My guess would be that the change does not happen with -O0.

The surprising part (at least to me) is that the same llvm intrinsic is used when I use Clang to compile CUDA kernel code calling the "sqrt" function. I would have expected that the NVVM variant would be called for CUDA code.

What we may end up generating for any given standard library call from the device side depends on number of factors and may vary.
Here's what typically happens:

  • clang parses CUDA headers and pulls 'standard' C math functions and bits of C++ overloads. These usually call __something.
  • CUDA versions up to 8.0 provided those something() functions which *usually* called nv_something() in libdevice.
  • As of CUDA-9 something became NVCC's compiler builtins and clang has to provide its own implementation -- clang_cuda_device_functions.h. This implementation may use whatever works that does the job. Any of builtin.../nvvm.../__nv_... are fair game, as long as it works.
  • CUDA wrapper headers in clang do some magic to make math parts of standard C++ library working by magic by providing some functions to do the right thing. Usually those forward to the C math functions, but it may not always be the case.
  • LLVM may update some __nvvm* intrinsics to their llvm equivalent.

In the end you may end up with somewhat different IR depending on the function and the CUDA version clang used.

Is it ok for CUDA kernels to call llvm intrinsics instead of the device specific math library functions?

It depends. We can not lower all LLVM intrinsics. Generally you can't use intrinsics that are lowered to external library call.

If it's ok for CUDA can this be ok for OpenMP NVPTX too?
If not we probably need to fix it for both toolchains.

I don't have an answer for these. OpenMP seems to have somewhat different requirements compared to C++ which we assume for CUDA.

On thing you do need to consider, though, is that the wrapper headers are rather unstable. Their goal is to provide a glue between half-broken CUDA headers and the user's code. They are not intended to provide any sort of stability to anyone else. Every new CUDA version brings new and exciting changes to its headers which requires fair amount of changes in the wrappers.

If all you need is C math functions, it *may* be OK, but, perhaps, there may be a better approach.
Why not compile a real math library to bitcode and avoid all this weirdness with gluing together half-broken pieces of CUDA that are broken by design? Unlike real CUDA compilation, you don't have the constraint that you have to match NVCC 1:1. If you have your own device-side math library you could use regular math headers and link real libm.bc instead of CUDA's libdevice. The rumors of "high performance" functions in the libdevice are somewhat exaggerated , IMO. If you take a look at the IR in the libdevice of recent CUDA version, you will see that a lot of the functions just call their llvm counterpart. If it turns out that in some case llvm generates slower code than what nvidia provides, I'm sure it will be possible to implement a reasonably fast replacement.

  1. Incidentally I ran into a closely related problem: I can't #include <math.h> in translation units compiled for offloading, Clang complains about inline assembly for x86 (see below). Does that work for you?
In file included from /usr/include/math.h:413:
/usr/include/bits/mathinline.h:131:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
/usr/include/bits/mathinline.h:143:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
2 errors generated.

Hrmm. I thought that we had fixed that already.

In case it's helpful, in an out-of-tree experimental target I have I ran into a similar problem, and to fix that I wrote the following code in the target's getTargetDefines function (in lib/Basic/Targets):

// If used as an OpenMP target on x86, x86 target feature macros are defined. math.h
// and other system headers will include inline asm if these are defined.
Builder.undefineMacro("__SSE2_MATH__");
Builder.undefineMacro("__SSE_MATH__");

Just found another workaround:

diff --git a/lib/Sema/SemaStmtAsm.cpp b/lib/Sema/SemaStmtAsm.cpp
index 0db15ea..b95f949 100644
--- a/lib/Sema/SemaStmtAsm.cpp
+++ b/lib/Sema/SemaStmtAsm.cpp
@@ -306,7 +306,9 @@ StmtResult Sema::ActOnGCCAsmStmt(SourceLocation AsmLoc, bool IsSimple,
 
     TargetInfo::ConstraintInfo Info(Literal->getString(), InputName);
     if (!Context.getTargetInfo().validateInputConstraint(OutputConstraintInfos,
-                                                         Info)) {
+                                                         Info) &&
+        !(Context.getLangOpts().OpenMPIsDevice &&
+          Context.getSourceManager().isInSystemHeader(AsmLoc))) {
       return StmtError(Diag(Literal->getLocStart(),
                             diag::err_asm_invalid_input_constraint)
                        << Info.getConstraintStr());

This will ignore all errors during OpenMP device codegen from system headers when the inline assembly is not used. In that case (calling signbit) you'll get

In file included from math.c:2:
In file included from /usr/include/math.h:413:
/usr/include/bits/mathinline.h:143:10: error: couldn't allocate input reg for constraint 'x'
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
         ^
1 error generated.

Not sure if that's acceptable...

  1. Incidentally I ran into a closely related problem: I can't #include <math.h> in translation units compiled for offloading, Clang complains about inline assembly for x86 (see below). Does that work for you?
In file included from /usr/include/math.h:413:
/usr/include/bits/mathinline.h:131:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
/usr/include/bits/mathinline.h:143:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
2 errors generated.

Hrmm. I thought that we had fixed that already.

In case it's helpful, in an out-of-tree experimental target I have I ran into a similar problem, and to fix that I wrote the following code in the target's getTargetDefines function (in lib/Basic/Targets):

// If used as an OpenMP target on x86, x86 target feature macros are defined. math.h
// and other system headers will include inline asm if these are defined.
Builder.undefineMacro("__SSE2_MATH__");
Builder.undefineMacro("__SSE_MATH__");

Just found another workaround:

diff --git a/lib/Sema/SemaStmtAsm.cpp b/lib/Sema/SemaStmtAsm.cpp
index 0db15ea..b95f949 100644
--- a/lib/Sema/SemaStmtAsm.cpp
+++ b/lib/Sema/SemaStmtAsm.cpp
@@ -306,7 +306,9 @@ StmtResult Sema::ActOnGCCAsmStmt(SourceLocation AsmLoc, bool IsSimple,
 
     TargetInfo::ConstraintInfo Info(Literal->getString(), InputName);
     if (!Context.getTargetInfo().validateInputConstraint(OutputConstraintInfos,
-                                                         Info)) {
+                                                         Info) &&
+        !(Context.getLangOpts().OpenMPIsDevice &&
+          Context.getSourceManager().isInSystemHeader(AsmLoc))) {
       return StmtError(Diag(Literal->getLocStart(),
                             diag::err_asm_invalid_input_constraint)
                        << Info.getConstraintStr());

This will ignore all errors during OpenMP device codegen from system headers when the inline assembly is not used. In that case (calling signbit) you'll get

In file included from math.c:2:
In file included from /usr/include/math.h:413:
/usr/include/bits/mathinline.h:143:10: error: couldn't allocate input reg for constraint 'x'
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
         ^
1 error generated.

Not sure if that's acceptable...

Hrmm. Doesn't that make it so that whatever functions are implemented using that inline assembly will not be callable from target code (or, perhaps worse, will crash the backend if called)?

Hrmm. Doesn't that make it so that whatever functions are implemented using that inline assembly will not be callable from target code (or, perhaps worse, will crash the backend if called)?

You are right :-(

However I'm getting worried about a more general case, not all inline assembly is guarded by #ifdefs that we could hope to get right. For example take sys/io.h which currently throws 18 errors when compiling with offloading to GPUs, even with -O0. The inline assembly is only guarded by #if defined __GNUC__ && __GNUC__ >= 2 which should be defined by any modern compiler claiming compatibility with GCC. I'm not sure this particular header will ever end up in an OpenMP application, but others with inline assembly will. From a quick grep it looks like some headers dealing with atomic operations have inline assembly and even eigen3/Eigen/src/Core/util/Memory.h for finding the cpuid.

Coming back to the original problem: Maybe we need to undefine optimization macros as in your patch to get as many correct inline functions as possible AND ignore errors from inline assembly as in my patch to not break when including weird headers?

Hrmm. Doesn't that make it so that whatever functions are implemented using that inline assembly will not be callable from target code (or, perhaps worse, will crash the backend if called)?

You are right :-(

However I'm getting worried about a more general case, not all inline assembly is guarded by #ifdefs that we could hope to get right. For example take sys/io.h which currently throws 18 errors when compiling with offloading to GPUs, even with -O0. The inline assembly is only guarded by #if defined __GNUC__ && __GNUC__ >= 2 which should be defined by any modern compiler claiming compatibility with GCC. I'm not sure this particular header will ever end up in an OpenMP application, but others with inline assembly will. From a quick grep it looks like some headers dealing with atomic operations have inline assembly and even eigen3/Eigen/src/Core/util/Memory.h for finding the cpuid.

Coming back to the original problem: Maybe we need to undefine optimization macros as in your patch to get as many correct inline functions as possible AND ignore errors from inline assembly as in my patch to not break when including weird headers?

The problem is that the inline assembly might actually be for the target, instead of the host, because we also have target preprocessor macros defined, and it's going to be hard to tell. I'm not sure that there's a great solution here, and I agree that having something more general than undefining some specific things that happen to matter for math.h would be better. As you point out, this is not just a system-header problem. We might indeed want to undefine all of the target-feature-related macros (although that won't always be sufficient, because we need basic arch macros for the system headers to work at all, and those are generally enough to guard some inline asm).

Maybe the following makes sense: Only define the host macros, minus target-feature ones, when compiling for the target in the context of the system headers. That makes the system headers work while providing a "clean" preprocessor environment for the rest of the code (and, thus, retains our ability to complain about bad inline asm).

The problem is that the inline assembly might actually be for the target, instead of the host, because we also have target preprocessor macros defined, and it's going to be hard to tell. I'm not sure that there's a great solution here, and I agree that having something more general than undefining some specific things that happen to matter for math.h would be better. As you point out, this is not just a system-header problem. We might indeed want to undefine all of the target-feature-related macros (although that won't always be sufficient, because we need basic arch macros for the system headers to work at all, and those are generally enough to guard some inline asm).

I think there was a reason for pulling in the host defines. I'd have to look at the commit message though...

Maybe the following makes sense: Only define the host macros, minus target-feature ones, when compiling for the target in the context of the system headers. That makes the system headers work while providing a "clean" preprocessor environment for the rest of the code (and, thus, retains our ability to complain about bad inline asm).

I'm not sure how that's going to help with Eigen: Just including Eigen/Core will pull in the other header file I mentioned with inline assembly. That's completely independent of preprocessor macros, I think it's enough the library's build system detected the host architecture during install.

The problem is that the inline assembly might actually be for the target, instead of the host, because we also have target preprocessor macros defined, and it's going to be hard to tell. I'm not sure that there's a great solution here, and I agree that having something more general than undefining some specific things that happen to matter for math.h would be better. As you point out, this is not just a system-header problem. We might indeed want to undefine all of the target-feature-related macros (although that won't always be sufficient, because we need basic arch macros for the system headers to work at all, and those are generally enough to guard some inline asm).

I think there was a reason for pulling in the host defines. I'd have to look at the commit message though...

As I recall, it's mostly to make glibc's bits/wordsize.h work.

Maybe the following makes sense: Only define the host macros, minus target-feature ones, when compiling for the target in the context of the system headers. That makes the system headers work while providing a "clean" preprocessor environment for the rest of the code (and, thus, retains our ability to complain about bad inline asm).

I'm not sure how that's going to help with Eigen: Just including Eigen/Core will pull in the other header file I mentioned with inline assembly. That's completely independent of preprocessor macros, I think it's enough the library's build system detected the host architecture during install.

I don't see any good way to satisfy Eigen in that form. I think that we'll need to update it to understand not to use host inline as when compiling for a target.

gtbercea updated this revision to Diff 159335.Aug 6 2018, 10:36 AM
Fix function call.
gtbercea marked an inline comment as done.Aug 7 2018, 8:15 AM

Do we still need this? I think what we really need to solve is the problem of (host) inline assembly in the header files...

Do we still need this? I think what we really need to solve is the problem of (host) inline assembly in the header files...

Don't we want to use device specific math functions?
It's not just about avoiding some the host specific assembly, it's also about getting an implementation tailored to the device.

gtbercea updated this revision to Diff 159574.Aug 7 2018, 12:40 PM

Prevent math builtins from being used for nvptx toolchain.

Don't we want to use device specific math functions?
It's not just about avoiding some the host specific assembly, it's also about getting an implementation tailored to the device.

Ok, so you are already talking about performance. I think we should fix correctness first, in particular the compiler shouldn't complain whenever <math.h> is included.

I experimented with adding only a minimum of target defines (__amd64__ and __x86_64__): While I think this is a step into the right direction it still fails when including <fenv.h>.

Btw the GCC folks don't have a complete solution either: If you compile with -O2 you get the same complaints once the code starts calling signbit. Maybe Clang should also implement lazy Sema checking for device side compilation?

Ok, so you are already talking about performance. I think we should fix correctness first, in particular the compiler shouldn't complain whenever <math.h> is included.

This patch is concerned with calling device functions when you're on the device. The correctness issues you mention are orthogonal to this and should be handled by another patch. I don't think this patch should be held up any longer.

gtbercea marked an inline comment as done.Aug 8 2018, 5:46 AM

This patch is concerned with calling device functions when you're on the device. The correctness issues you mention are orthogonal to this and should be handled by another patch. I don't think this patch should be held up any longer.

I'm confused by now, could you please highlight the point that I'm missing?

IIRC you started to work on this to fix the problem with inline assembly (see https://reviews.llvm.org/D47849#1125019). AFAICS this patch fixes declarations of math functions but you still cannot include math.h which most "correct" codes do.

In D47849#1170670, @tra wrote:

The rumors of "high performance" functions in the libdevice are somewhat exaggerated , IMO. If you take a look at the IR in the libdevice of recent CUDA version, you will see that a lot of the functions just call their llvm counterpart. If it turns out that in some case llvm generates slower code than what nvidia provides, I'm sure it will be possible to implement a reasonably fast replacement.

So regarding performance it's not yet clear to me which cases actually benefit: Is there a particular function that is slow if LLVM's backend resolves the call vs. the wrapper script directly calls libdevice?
If I understand @tra's comment correctly, I think we should have clear evidence (ie a small "benchmark") that this patch actually improves performance.

This patch is concerned with calling device functions when you're on the device. The correctness issues you mention are orthogonal to this and should be handled by another patch. I don't think this patch should be held up any longer.

I'm confused by now, could you please highlight the point that I'm missing?

You're bringing up the correctness of the header files which is a detail that is orthogonal to this patch. Even if the header files worked correctly I would still want to use the libdevice functions. Fixing the header files themselves should be therefore done in a separate patch.
Using the libdevice functions guarantees correctness (no weird assembly instructions that the device doesn't recognize etc.) and may improve performance (if for example the libdevice contained device specific assembly).

The purpose of this patch is to call NVIDIA's libdevice math functions which should in principle be more efficient in terms of runtime and register usage. Not all of them may be more effecient today (like @tra suggested) but some of them will be. Maybe others will be improved in the future, maybe not, again that's an orthogonal point. The benefit of using libdevice functions is that any improvements NVIDIA makes we will be there to use them in the OpenMP NVPTX toolchain. The premise of the OpenMP NVPTX toolchain is that it will leverage as much of the CUDA toolchain as possible.

Another point is that users specifically ask for NVIDIA math functions to be called on the device when using OpenMP NVPTX device offloading. The libdevice library offers __nv_fast_* variants of some math functions. Users want to have access to those functions and other functions that the libdevice library contains.

IIRC you started to work on this to fix the problem with inline assembly (see https://reviews.llvm.org/D47849#1125019). AFAICS this patch fixes declarations of math functions but you still cannot include math.h which most "correct" codes do.

I'm not sure what you mean by this. This patch enables me to include math.h.

IIRC you started to work on this to fix the problem with inline assembly (see https://reviews.llvm.org/D47849#1125019). AFAICS this patch fixes declarations of math functions but you still cannot include math.h which most "correct" codes do.

I'm not sure what you mean by this. This patch enables me to include math.h.

math.c:

#include <math.h>

executed commands:

 $ clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -c math.c -O2
In file included from math.c:1:
In file included from /usr/include/math.h:413:
/usr/include/bits/mathinline.h:131:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
/usr/include/bits/mathinline.h:143:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
2 errors generated.

IIRC you started to work on this to fix the problem with inline assembly (see https://reviews.llvm.org/D47849#1125019). AFAICS this patch fixes declarations of math functions but you still cannot include math.h which most "correct" codes do.

I'm not sure what you mean by this. This patch enables me to include math.h.

math.c:

#include <math.h>

executed commands:

 $ clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -c math.c -O2
In file included from math.c:1:
In file included from /usr/include/math.h:413:
/usr/include/bits/mathinline.h:131:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
/usr/include/bits/mathinline.h:143:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
2 errors generated.

I do not get that error.

I do not get that error.

In the beginning you said that you were facing the same error. Did that go away in the meantime?
Are you testing on x86 or Power? With optimizations enabled?

ye-luo added a subscriber: ye-luo.Aug 8 2018, 8:02 AM

IIRC you started to work on this to fix the problem with inline assembly (see https://reviews.llvm.org/D47849#1125019). AFAICS this patch fixes declarations of math functions but you still cannot include math.h which most "correct" codes do.

I'm not sure what you mean by this. This patch enables me to include math.h.

math.c:

#include <math.h>

executed commands:

 $ clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -c math.c -O2
In file included from math.c:1:
In file included from /usr/include/math.h:413:
/usr/include/bits/mathinline.h:131:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
/usr/include/bits/mathinline.h:143:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
2 errors generated.

We are probably linking against different math.h files. I don't seem to have a mathinline.h with those instructions. Perhaps this is an x86 specific error.

I think I know what's happening. I think the host math.h is still included but not necessarily used. Math functions resolve to math functions in the CUDA header first (that's what this patch does). This patch doesn't prevent math.h from being included.

I do not get that error.

In the beginning you said that you were facing the same error. Did that go away in the meantime?
Are you testing on x86 or Power? With optimizations enabled?

Since I'm running on Power I was facing a similar problem related to host assembly instructions on device but not exactly the same error.

The error you are seeing is that the NVPTX target doesn't regard "x" as a valid input constraint. x is an x86 specific constraint which I don't have on the Power side.

The problems I was having were related to the math functions on the device resolving to host math functions which contained host assembly instructions which were not recognized by NVPTX. This patch fixes that issue.

Perhaps the inclusion of the host math.h should just be prevented for device code?

@Hahnfeld do you get the same error if you compile with clang++ instead of clang?

@Hahnfeld do you get the same error if you compile with clang++ instead of clang?

Yes, with both trunk and this patch applied. It's the same header after all...

Hahnfeld added a subscriber: Hahnfeld.

I feel like there is no progress in the discussion (here and off-list), partly because we might still not be talking about the same things. So I'm stepping down from this revision to unblock review from somebody else.

Here's my current understanding of the issue(s):

  • math.h (or transitively included files) on both PowerPC and x86 contain inline assembly.
    • On x86 Clang directly bails out because the code is using the x input constraint which doesn't exist for NVPTX (-> invalid input constraint 'x' in asm).
    • From my understanding the header passes Sema analysis on PowerPC, but rejects CodeGen because the assembly instructions are invalid on NVPTX?
  • This problem can be avoided (for testing purposes; including math.h should be fixed as well some day!) by explicitly declaring all needed math functions (like extern double exp(double);)
    • Without additional flags this makes Clang emit Intrinsic Functions like @llvm.exp.f64 for NVPTX.
    • That's because IsMathErrnoDefault() returns false for the Cuda ToolChain. This behaviour can be overwritten using -fmath-errno (the test case nvptx_device_math_functions.c uses this flag; I'm not sure why?)
  • That at least looks to be producing correct IR in both cases which is then passed to the backend:
    1. For intrinsic functions (with some notable exceptions) the backend complains Cannot select: [...] ExternalSymbol'exp'.
      • Some exceptions are sqrt.f32, sqrt.f64, sin.f32 and cos.f32: The backend will directly lower them to the corresponding PTX instruction. Unfortunately there is none for exp...
    2. For "real" function calls (like call double @exp(double %3)) nvlink will throw Undefined reference errors.

This patch takes the following approach:

  1. Avoid intrinsics for math builtins by passing -fno-math-builtin for device compilation.
  2. Use the CUDA header to redirect math functions to their libdevice equivalents in the frontend, mostly just prefixed by __nv_ (for example exp(a) -> __nv_exp(a)).

The downside of this approach is that LLVM doesn't recognize these function calls and doesn't perform optimizations to fold libcalls. For example pow(a, 2) is transformed into a multiplication but __nv_pow(a, 2) is not.

IMO this goes into the right direction, we should use the fast implementation in libdevice.

So yeah, my comment seems to be outdated if these simple optimizations don't happen anymore with this patch: I don't want to use a fast pow(a, 2), I don't want to call a library function for that at all.

We could of course make LLVM recognize the calls to libdevice and handle them the same way. But that's adding more workarounds to make this patch not regress on easy cases (in terms of transformations).
Another approach would be to make the NVPTX backend lower remaining calls of math functions to libdevice equivalents. I came across D34708 which seems to go into that direction (but doesn't work out-of-the-box after fixing some build errors, complaing about Undefined external symbols because libdevice is optimized away as it wasn't needed before)...

The downside of this approach is that LLVM doesn't recognize these function calls and doesn't perform optimizations to fold libcalls. For example pow(a, 2) is transformed into a multiplication but __nv_pow(a, 2) is not.

Doesn't CUDA have the same problem?

I don't want to use a fast `pow(a, 2)`, I don't want to call a library function for that at all.

I do believe you won't end up calling a function. If you're compiling with optimizations on this will be inlined.

Thanks @Hahnfeld for your suggestions.

Unfortunately doing the lowering in the backend one would need to replace the math function calls with calls to libdevice function calls. I have not been able to do that in an elegant way. Encoding the interface to libdevice is just not a clean process not to mention that any changes to libdevice will have to be tracked manually with every new CUDA version. It does not make the code more maintainable, on the contrary I think it makes it harder to track libdevice changes.

On the same note, clang-cuda doesn't do the pow(a,2) -> a*a optimization, I checked. It is something that needs to be fixed for Clang-CUDA first before OpenMP can make use of it. OpenMP-NVPTX toolchain is designed to exist on top of the CUDA toolchain. It therefore inherits all the clang-cuda benefits and in this particular case, limitations.

As for the Sema check error you report (the one related to the x restriction), I think the fix you proposed is good and should be pushed in a separate patch.

Just to address any generality concerns:

This patch fixes the problem of calling libdevice math functions for all platform combinations. It ensures that the OpenMP NVPTX target region will NOT call any host math functions (which ever host that may be) IF equivalent device functions are available.

I think there was a confusion regarding header file inclusion. This patch does not address any issues that might arise from the user including header files (be it math.h or some other header). Any failure related to header file inclusion (such as the reported x restriction issue on x86) is unrelated to what this patch aims to do. Before the functionality in this patch can kick in, any user-included headers must successfully pass all checks in place for the NVPTX toolchain. A fix in the direction of the one proposed in one of the comments above is probably required. The fix would also needs its own separate patch.

gtbercea updated this revision to Diff 160598.Aug 14 2018, 8:36 AM

Add __NO_MATH_INLINES macro for the NVPTX toolchain to prevent any host assembly from seeping onto the device.

I like the idea of using an automatic include as a cc1 option (-include). However, I would prefer a more general automatic include for OpenMP, not just for math functions (clang_cuda_device_functions.h). Clang cuda automatically includes clang_cuda_runtime_wrapper.h. It includes other files as needed like clang_cuda_device_functions.h. Lets hypothetically call my proposed automatic include for OpenMP , clang_openmp_runtime_wrapper.h.

Just because clang cuda defines functions in clang_cuda_device_functins.h and automatically includes them does not make it right for OpenMP. In general, function definitions in headers should be avoided. The current function definitions in clang_cuda_device_functions.h only work for hostile nv GPUs :). This is how we can avoid function definitions in the headers. In a new openmp build process, we can build libm-nvptx.bc. This can be done by compiling __clang_cuda_device_functions.h as a device-only compile. Assuming current naming conventions, these files would be installed in the same directory as libomptarget.so (.../lib).

How do we tell clang cc1 to use this bc library? Use -mlink-builtin-bitcode. AddMathDeviceFunctions would then look something like this.

if (this is for device cc1) {

CC1Args.push_back("-mlink-builtin-bitcode");
if ( getTriple().isNVPTX())
  CC1Args.push_back(DriverArgs.MakeArgString("libm-nvptx.bc"));
if ( getTriple().getArch() == llvm::Triple::amdgcn);
  CC1Args.push_back(DriverArgs.MakeArgString("libm-amdgcn.bc"));

}

You can think of libm-<arch>.bc file as the device library equivalent of the host libm.so or libm.a. This concept of "host-consistent" library definitions can go beyond math libraries. In fact, I believe we should co-opt the -l (--library) option. The driver toolchain should look for device bc libraries for any -lX command line option. This gives us a strategy for adding user-defined device libraries.

The above code hints at the idea of architecture specific bc files (nvptx vs amdgcn). The nvptx version would call into the cuda libdevice. For radeon processors, we may want processor-optimized versions of the libraries, just like there are sub-architecture optimized versions of the cuda libdevice. If we build --cuda-cuda-gpu-arch optimized versions of math bc libs, then the above code will get a bit more complex depending on naming convention of the bc lib and the value of
--cuda-gpu-arch (which should have an alias --offload-arch).

Using a bc lib, significantly reduces the complexity of clang_openmp_runtime_wrapper.h. We do not not need or see math device function definitions or the nv headers that they need. However, it does need to correct the behaviour of rogue system headers that define host-optimized functions. We can fix this by adding the following to clang_openmp_runtime_wrapper.h so that host passes still get host-optimized functions.

#if defined(AMDGCN) || defined(NVPTX)
#define NO_INLINE 1
#endif

There is a tradeoff to using pre-compiled bc libs. It makes compile-time macro logic hard to implement. For example, we cant do this

#if defined(CLANG_CUDA_APPROX_TRANSCENDENTALS)
#define FAST_OR_SLOW(fast, slow) fast
#else
#define
FAST_OR_SLOW(fast, slow) slow
#endif

The openmp build process would either need to build alternative bc libraries for each option or a supplemental bc library to address these types of options.
If some option is turned on, then an alternative lib or particular ordering of libs would be used to build the clang cc1 command.
For example, the above code for AddMathDeviceFunctions would have this

...
if ( getTriple().isNVPTX()) {
   if (LangOpts.CUDADeviceApproxTranscendentals || LangOpts.FastMath) {
     CC1Args.push_back("-mlink-builtin-bitcode");
     CC1Args.push_back(DriverArgs.MakeArgString("libm-fast-nvptx.bc"));
   }
   CC1Args.push_back("-mlink-builtin-bitcode");
   CC1Args.push_back(DriverArgs.MakeArgString("libm-nvptx.bc"));
}

I personally believe that pre-built bc libraries with some consistency to their host-equivalent libraries is a more sane approach for device libraries than complex header logic that is customized for each architecture.

tra added a comment.Aug 22 2018, 3:16 PM

__clang_cuda_device_functions.h is not intended to be a device-side math.h, despite having a lot of overlap/similarities. It may change at any time we get new CUDA version.
I would suggest writing an OpenMP-specific replacement for math.h which would map to whatever device-specific function OpenMP needs. For NVPTX that may be libdevice, for which you have declarations in __clang_cuda_libdevice_declares.h. Using part of __clang_cuda_device_functions.h may be a decent starting point for NVPTX, but OpenMP will likely need to provide an equivalent for other back-ends, too.

lib/Basic/Targets/NVPTX.cpp
232

This relies on implementation detail of particular variant of the header file you're assuming all compilations will include. This is a workaround of the real problem (attempting to use headers from machine X while targeting Y) at best.

D50845 is dealing with the issue of headers for target code. Hopefully, they'll find a way to provide device-specific headers, so you don't rely on host headers being parseable during device-side compilation.

lib/Driver/ToolChains/Clang.cpp
4758

Could you elaborate on why you don't want the builtins?
Builtins are enabled and are useful for CUDA. What makes their use different for OpenMP?
Are you doing it to guarantee that math functions remain unresolved in IR so you could link them in from external bitcode?

gtbercea added inline comments.Aug 23 2018, 8:16 AM
lib/Basic/Targets/NVPTX.cpp
232

I agree. The proper fix would be what the other patch is attempting to do.

lib/Driver/ToolChains/Clang.cpp
4758

That's right. I don't particularly like this approach as this leads to OpenMP-NVPTX toolchain missing out on optimizations such as replacing math function call with basic operations ( pow(a,2) -> a*a for example).
I am trying to fix this in a future patch by allowing intrinsics/builtins to propagate.