This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/clang/Driver/
-
clang/
-
Driver/
-
ToolChain.h
-
lib/
-
Driver/ToolChains/
-
ToolChains/
2
Clang.cpp
-
Cuda.h
-
Cuda.cpp
-
Headers/
-
CMakeLists.txt
2/5
__clang_cuda_device_functions.h
-
__clang_cuda_libdevice_declares.h
-
test/
-
CodeGen/
-
nvptx_device_math_functions.c
-
Driver/
-
openmp-offload-gpu.c

Differential D47849

[OpenMP][Clang][NVPTX] Enable math functions called in an OpenMP NVPTX target device region to be resolved as device-native function calls
AbandonedPublic

Authored by gtbercea on Jun 6 2018, 3:45 PM.

Download Raw Diff

Details

Reviewers

tra
hfinkel
carlo.bertolli
caomhin
ABataev
kkwli0

Summary

In current Clang, on the OpenMP NVPTX toolchain, math functions are resolved as math functions for the host. For example, a call to sqrt() in a target region will result in an LLVM-IR call which looks like this:

call double sqrt(double %1)

This patch allows for math functions in OpenMP NVPTX target regions to call the same math functions that CUDA code calls. For example, for sqrt we get:

call double @llvm.nvvm.sqrt.rn.d(double %1)

This is necessary for both correctness and performance.

Diff Detail

Repository

rC Clang

Build Status

Buildable 21195
Build 21195: arc lint + arc unit

Event Timeline

gtbercea created this revision.Jun 6 2018, 3:45 PM

Herald added subscribers: cfe-commits, guansong, mgorny. · View Herald TranscriptJun 6 2018, 3:45 PM

Add tests for C++ and move OpenMP specific tests to OpenMP directory

lib/Headers/__clang_cuda_device_functions.h
28	Do we really need to include all that stuff here? Will it work with C++, especially with the latest versions of the standard?
44	Do you really need "inline" if you are using 'alwsys_inline' attribute already? Will it work on Windows?

IMO this goes into the right direction, we should use the fast implementation in libdevice. If LLVM doesn't lower these calls in the NVPTX backend, I think it's ok to use header wrappers as CUDA already does.

Two questions:

Can you explain where this is important for "correctness"? Yesterday I compiled a code using sqrt and it seems to spit out the correct results. Maybe that's relevant for other functions?
Incidentally I ran into a closely related problem: I can't #include <math.h> in translation units compiled for offloading, Clang complains about inline assembly for x86 (see below). Does that work for you?

In file included from /usr/include/math.h:413:
/usr/include/bits/mathinline.h:131:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
/usr/include/bits/mathinline.h:143:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
2 errors generated.

lib/Headers/__clang_cuda_device_functions.h
65	Why is that only valid for C++?

In D47849#1124638, @Hahnfeld wrote:
IMO this goes into the right direction, we should use the fast implementation in libdevice. If LLVM doesn't lower these calls in the NVPTX backend, I think it's ok to use header wrappers as CUDA already does.

Two questions:

Can you explain where this is important for "correctness"? Yesterday I compiled a code using sqrt and it seems to spit out the correct results. Maybe that's relevant for other functions?

Incidentally I ran into a closely related problem: I can't #include <math.h> in translation units compiled for offloading, Clang complains about inline assembly for x86 (see below). Does that work for you?
In file included from /usr/include/math.h:413:
/usr/include/bits/mathinline.h:131:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
/usr/include/bits/mathinline.h:143:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
2 errors generated.

Hrmm. I thought that we had fixed that already.

In case it's helpful, in an out-of-tree experimental target I have I ran into a similar problem, and to fix that I wrote the following code in the target's getTargetDefines function (in lib/Basic/Targets):

// If used as an OpenMP target on x86, x86 target feature macros are defined. math.h
// and other system headers will include inline asm if these are defined.
Builder.undefineMacro("__SSE2_MATH__");
Builder.undefineMacro("__SSE_MATH__");

gtbercea added inline comments.Jun 7 2018, 7:17 AM

lib/Headers/__clang_cuda_device_functions.h
65	C does not support overloading of functions.

In D47849#1124638, @Hahnfeld wrote:
IMO this goes into the right direction, we should use the fast implementation in libdevice. If LLVM doesn't lower these calls in the NVPTX backend, I think it's ok to use header wrappers as CUDA already does.

Two questions:

Can you explain where this is important for "correctness"? Yesterday I compiled a code using sqrt and it seems to spit out the correct results. Maybe that's relevant for other functions?

Incidentally I ran into a closely related problem: I can't #include <math.h> in translation units compiled for offloading, Clang complains about inline assembly for x86 (see below). Does that work for you?
In file included from /usr/include/math.h:413:
/usr/include/bits/mathinline.h:131:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
/usr/include/bits/mathinline.h:143:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
2 errors generated.

It's precisely the issue which you report here. Since you don't use device specific math functions, you can run into the problem where you may end up calling assembly instructions for a different architecture. I may have mis-classified this as a correctness issue.

In D47849#1125019, @gtbercea wrote:

It's precisely the issue which you report here. Since you don't use device specific math functions, you can run into the problem where you may end up calling assembly instructions for a different architecture. I may have mis-classified this as a correctness issue.

I think the issue is slightly different, the assembly is not necessarily in the called functions, as I said sqrt seems to work fine. Clang just errors because they are included via the header.

This is because clang::InitializePreprocessor has this:

// FIXME: This will create multiple definitions for most of the predefined
// macros. This is not the right way to handle this.
if ((LangOpts.CUDA || LangOpts.OpenMPIsDevice) && PP.getAuxTargetInfo())
  InitializePredefinedMacros(*PP.getAuxTargetInfo(), LangOpts, FEOpts,
                             Builder);

So we will end up with all host defines (including __SSE2_MATH__ as @hfinkel wrote) during target compilation :-(

In D47849#1124638, @Hahnfeld wrote:

IMO this goes into the right direction, we should use the fast implementation in libdevice. If LLVM doesn't lower these calls in the NVPTX backend, I think it's ok to use header wrappers as CUDA already does.

Using wrapper headers may be OK solution for now. Ideally we should grow our own equivalent of device-side libm so we don't have to rely on libdevice bitcode.

Two questions:

Can you explain where this is important for "correctness"? Yesterday I compiled a code using sqrt and it seems to spit out the correct results. Maybe that's relevant for other functions?

Incidentally I ran into a closely related problem: I can't #include <math.h> in translation units compiled for offloading, Clang complains about inline assembly for x86 (see below). Does that work for you?
In file included from /usr/include/math.h:413:
/usr/include/bits/mathinline.h:131:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
/usr/include/bits/mathinline.h:143:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
2 errors generated.

Avoiding conflicts between host and device implementations of the same functions in C++ requires use of attribute-based overloading (https://goo.gl/EXnymm). For CUDA compilation, we provide device-side overloads with device attributes but otherwise identical signatures. We may need to extend it to work in C mode, too. Clang already has attribute((overloadable)), so basic overloading mechanisms should be there already.

lib/Headers/__clang_cuda_device_functions.h
1584–1586	I think it should've been `return __nv_llabs(__a)` here and the definition of `long long llabs()` should remain back where it was.

I just stumbled upon a very interesting situation.

I noticed that, for OpenMP, the use of device math functions happens as I expected for -O0. For -O1 or higher math functions such as "sqrt" resolve to llvm builtins/intrinsics:

call double @llvm.sqrt.f64(double %1)

instead of the nvvm variant.

The surprising part (at least to me) is that the same llvm intrinsic is used when I use Clang to compile CUDA kernel code calling the "sqrt" function. I would have expected that the NVVM variant would be called for CUDA code.

Interestingly, for the "pow" function the expected device version of the function i.e.:

@__internal_accurate_pow(double %14, double %4)

is used for both CUDA and OpenMP NVPTX targets (with this patch applied of course).

Is it ok for CUDA kernels to call llvm intrinsics instead of the device specific math library functions?
If it's ok for CUDA can this be ok for OpenMP NVPTX too?
If not we probably need to fix it for both toolchains.

gtbercea added a reviewer: kkwli0.Jun 26 2018, 11:41 AM

In D47849#1126925, @gtbercea wrote:
I just stumbled upon a very interesting situation.

I noticed that, for OpenMP, the use of device math functions happens as I expected for -O0. For -O1 or higher math functions such as "sqrt" resolve to llvm builtins/intrinsics:
call double @llvm.sqrt.f64(double %1)
instead of the nvvm variant.

I believe we do have a pass that attempts to replace some nvvm intrinsics with their llvm equivalent. It allows us to optimize the code better. My guess would be that the change does not happen with -O0.

The surprising part (at least to me) is that the same llvm intrinsic is used when I use Clang to compile CUDA kernel code calling the "sqrt" function. I would have expected that the NVVM variant would be called for CUDA code.

What we may end up generating for any given standard library call from the device side depends on number of factors and may vary.
Here's what typically happens:

clang parses CUDA headers and pulls 'standard' C math functions and bits of C++ overloads. These usually call __something.
CUDA versions up to 8.0 provided those something() functions which *usually* called nv_something() in libdevice.
As of CUDA-9 something became NVCC's compiler builtins and clang has to provide its own implementation -- clang_cuda_device_functions.h. This implementation may use whatever works that does the job. Any of builtin.../nvvm.../__nv_... are fair game, as long as it works.
CUDA wrapper headers in clang do some magic to make math parts of standard C++ library working by magic by providing some functions to do the right thing. Usually those forward to the C math functions, but it may not always be the case.
LLVM may update some __nvvm* intrinsics to their llvm equivalent.

In the end you may end up with somewhat different IR depending on the function and the CUDA version clang used.

Is it ok for CUDA kernels to call llvm intrinsics instead of the device specific math library functions?

It depends. We can not lower all LLVM intrinsics. Generally you can't use intrinsics that are lowered to external library call.

If it's ok for CUDA can this be ok for OpenMP NVPTX too?
If not we probably need to fix it for both toolchains.

I don't have an answer for these. OpenMP seems to have somewhat different requirements compared to C++ which we assume for CUDA.

On thing you do need to consider, though, is that the wrapper headers are rather unstable. Their goal is to provide a glue between half-broken CUDA headers and the user's code. They are not intended to provide any sort of stability to anyone else. Every new CUDA version brings new and exciting changes to its headers which requires fair amount of changes in the wrappers.

If all you need is C math functions, it *may* be OK, but, perhaps, there may be a better approach.
Why not compile a real math library to bitcode and avoid all this weirdness with gluing together half-broken pieces of CUDA that are broken by design? Unlike real CUDA compilation, you don't have the constraint that you have to match NVCC 1:1. If you have your own device-side math library you could use regular math headers and link real libm.bc instead of CUDA's libdevice. The rumors of "high performance" functions in the libdevice are somewhat exaggerated , IMO. If you take a look at the IR in the libdevice of recent CUDA version, you will see that a lot of the functions just call their llvm counterpart. If it turns out that in some case llvm generates slower code than what nvidia provides, I'm sure it will be possible to implement a reasonably fast replacement.

In D47849#1124861, @hfinkel wrote:
In D47849#1124638, @Hahnfeld wrote:
Incidentally I ran into a closely related problem: I can't #include <math.h> in translation units compiled for offloading, Clang complains about inline assembly for x86 (see below). Does that work for you?
In file included from /usr/include/math.h:413:
/usr/include/bits/mathinline.h:131:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
/usr/include/bits/mathinline.h:143:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
2 errors generated.
Hrmm. I thought that we had fixed that already.

In case it's helpful, in an out-of-tree experimental target I have I ran into a similar problem, and to fix that I wrote the following code in the target's getTargetDefines function (in lib/Basic/Targets):
// If used as an OpenMP target on x86, x86 target feature macros are defined. math.h
// and other system headers will include inline asm if these are defined.
Builder.undefineMacro("__SSE2_MATH__");
Builder.undefineMacro("__SSE_MATH__");

Just found another workaround:

diff --git a/lib/Sema/SemaStmtAsm.cpp b/lib/Sema/SemaStmtAsm.cpp
index 0db15ea..b95f949 100644
--- a/lib/Sema/SemaStmtAsm.cpp
+++ b/lib/Sema/SemaStmtAsm.cpp
@@ -306,7 +306,9 @@ StmtResult Sema::ActOnGCCAsmStmt(SourceLocation AsmLoc, bool IsSimple,
 
     TargetInfo::ConstraintInfo Info(Literal->getString(), InputName);
     if (!Context.getTargetInfo().validateInputConstraint(OutputConstraintInfos,
-                                                         Info)) {
+                                                         Info) &&
+        !(Context.getLangOpts().OpenMPIsDevice &&
+          Context.getSourceManager().isInSystemHeader(AsmLoc))) {
       return StmtError(Diag(Literal->getLocStart(),
                             diag::err_asm_invalid_input_constraint)
                        << Info.getConstraintStr());

This will ignore all errors during OpenMP device codegen from system headers when the inline assembly is not used. In that case (calling signbit) you'll get

In file included from math.c:2:
In file included from /usr/include/math.h:413:
/usr/include/bits/mathinline.h:143:10: error: couldn't allocate input reg for constraint 'x'
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
         ^
1 error generated.

Not sure if that's acceptable...

In D47849#1183134, @Hahnfeld wrote:

In D47849#1124861, @hfinkel wrote:
In D47849#1124638, @Hahnfeld wrote:
Incidentally I ran into a closely related problem: I can't #include <math.h> in translation units compiled for offloading, Clang complains about inline assembly for x86 (see below). Does that work for you?
In file included from /usr/include/math.h:413:
/usr/include/bits/mathinline.h:131:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
/usr/include/bits/mathinline.h:143:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
2 errors generated.
Hrmm. I thought that we had fixed that already.

In case it's helpful, in an out-of-tree experimental target I have I ran into a similar problem, and to fix that I wrote the following code in the target's getTargetDefines function (in lib/Basic/Targets):
// If used as an OpenMP target on x86, x86 target feature macros are defined. math.h
// and other system headers will include inline asm if these are defined.
Builder.undefineMacro("__SSE2_MATH__");
Builder.undefineMacro("__SSE_MATH__");

Just found another workaround:

diff --git a/lib/Sema/SemaStmtAsm.cpp b/lib/Sema/SemaStmtAsm.cpp
index 0db15ea..b95f949 100644
--- a/lib/Sema/SemaStmtAsm.cpp
+++ b/lib/Sema/SemaStmtAsm.cpp
@@ -306,7 +306,9 @@ StmtResult Sema::ActOnGCCAsmStmt(SourceLocation AsmLoc, bool IsSimple,
 
     TargetInfo::ConstraintInfo Info(Literal->getString(), InputName);
     if (!Context.getTargetInfo().validateInputConstraint(OutputConstraintInfos,
-                                                         Info)) {
+                                                         Info) &&
+        !(Context.getLangOpts().OpenMPIsDevice &&
+          Context.getSourceManager().isInSystemHeader(AsmLoc))) {
       return StmtError(Diag(Literal->getLocStart(),
                             diag::err_asm_invalid_input_constraint)
                        << Info.getConstraintStr());

This will ignore all errors during OpenMP device codegen from system headers when the inline assembly is not used. In that case (calling signbit) you'll get

In file included from math.c:2:
In file included from /usr/include/math.h:413:
/usr/include/bits/mathinline.h:143:10: error: couldn't allocate input reg for constraint 'x'
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
         ^
1 error generated.

Not sure if that's acceptable...

Hrmm. Doesn't that make it so that whatever functions are implemented using that inline assembly will not be callable from target code (or, perhaps worse, will crash the backend if called)?

In D47849#1183150, @hfinkel wrote:

Hrmm. Doesn't that make it so that whatever functions are implemented using that inline assembly will not be callable from target code (or, perhaps worse, will crash the backend if called)?

You are right :-(

However I'm getting worried about a more general case, not all inline assembly is guarded by #ifdefs that we could hope to get right. For example take sys/io.h which currently throws 18 errors when compiling with offloading to GPUs, even with -O0. The inline assembly is only guarded by #if defined __GNUC__ && __GNUC__ >= 2 which should be defined by any modern compiler claiming compatibility with GCC. I'm not sure this particular header will ever end up in an OpenMP application, but others with inline assembly will. From a quick grep it looks like some headers dealing with atomic operations have inline assembly and even eigen3/Eigen/src/Core/util/Memory.h for finding the cpuid.

Coming back to the original problem: Maybe we need to undefine optimization macros as in your patch to get as many correct inline functions as possible AND ignore errors from inline assembly as in my patch to not break when including weird headers?

In D47849#1183996, @Hahnfeld wrote:

In D47849#1183150, @hfinkel wrote:

Hrmm. Doesn't that make it so that whatever functions are implemented using that inline assembly will not be callable from target code (or, perhaps worse, will crash the backend if called)?

You are right :-(

However I'm getting worried about a more general case, not all inline assembly is guarded by #ifdefs that we could hope to get right. For example take sys/io.h which currently throws 18 errors when compiling with offloading to GPUs, even with -O0. The inline assembly is only guarded by #if defined __GNUC__ && __GNUC__ >= 2 which should be defined by any modern compiler claiming compatibility with GCC. I'm not sure this particular header will ever end up in an OpenMP application, but others with inline assembly will. From a quick grep it looks like some headers dealing with atomic operations have inline assembly and even eigen3/Eigen/src/Core/util/Memory.h for finding the cpuid.

Coming back to the original problem: Maybe we need to undefine optimization macros as in your patch to get as many correct inline functions as possible AND ignore errors from inline assembly as in my patch to not break when including weird headers?

The problem is that the inline assembly might actually be for the target, instead of the host, because we also have target preprocessor macros defined, and it's going to be hard to tell. I'm not sure that there's a great solution here, and I agree that having something more general than undefining some specific things that happen to matter for math.h would be better. As you point out, this is not just a system-header problem. We might indeed want to undefine all of the target-feature-related macros (although that won't always be sufficient, because we need basic arch macros for the system headers to work at all, and those are generally enough to guard some inline asm).

Maybe the following makes sense: Only define the host macros, minus target-feature ones, when compiling for the target in the context of the system headers. That makes the system headers work while providing a "clean" preprocessor environment for the rest of the code (and, thus, retains our ability to complain about bad inline asm).

In D47849#1184367, @hfinkel wrote:

The problem is that the inline assembly might actually be for the target, instead of the host, because we also have target preprocessor macros defined, and it's going to be hard to tell. I'm not sure that there's a great solution here, and I agree that having something more general than undefining some specific things that happen to matter for math.h would be better. As you point out, this is not just a system-header problem. We might indeed want to undefine all of the target-feature-related macros (although that won't always be sufficient, because we need basic arch macros for the system headers to work at all, and those are generally enough to guard some inline asm).

I think there was a reason for pulling in the host defines. I'd have to look at the commit message though...

Maybe the following makes sense: Only define the host macros, minus target-feature ones, when compiling for the target in the context of the system headers. That makes the system headers work while providing a "clean" preprocessor environment for the rest of the code (and, thus, retains our ability to complain about bad inline asm).

I'm not sure how that's going to help with Eigen: Just including Eigen/Core will pull in the other header file I mentioned with inline assembly. That's completely independent of preprocessor macros, I think it's enough the library's build system detected the host architecture during install.

In D47849#1184388, @Hahnfeld wrote:

In D47849#1184367, @hfinkel wrote:

The problem is that the inline assembly might actually be for the target, instead of the host, because we also have target preprocessor macros defined, and it's going to be hard to tell. I'm not sure that there's a great solution here, and I agree that having something more general than undefining some specific things that happen to matter for math.h would be better. As you point out, this is not just a system-header problem. We might indeed want to undefine all of the target-feature-related macros (although that won't always be sufficient, because we need basic arch macros for the system headers to work at all, and those are generally enough to guard some inline asm).

I think there was a reason for pulling in the host defines. I'd have to look at the commit message though...

As I recall, it's mostly to make glibc's bits/wordsize.h work.

Maybe the following makes sense: Only define the host macros, minus target-feature ones, when compiling for the target in the context of the system headers. That makes the system headers work while providing a "clean" preprocessor environment for the rest of the code (and, thus, retains our ability to complain about bad inline asm).

I'm not sure how that's going to help with Eigen: Just including Eigen/Core will pull in the other header file I mentioned with inline assembly. That's completely independent of preprocessor macros, I think it's enough the library's build system detected the host architecture during install.

I don't see any good way to satisfy Eigen in that form. I think that we'll need to update it to understand not to use host inline as when compiling for a target.

Fix function call.

Harbormaster completed remote builds in B21126: Diff 159335.Aug 6 2018, 10:36 AM

gtbercea marked an inline comment as done.Aug 7 2018, 8:15 AM

Do we still need this? I think what we really need to solve is the problem of (host) inline assembly in the header files...

In D47849#1190903, @Hahnfeld wrote:

Do we still need this? I think what we really need to solve is the problem of (host) inline assembly in the header files...

Don't we want to use device specific math functions?
It's not just about avoiding some the host specific assembly, it's also about getting an implementation tailored to the device.

Prevent math builtins from being used for nvptx toolchain.

Harbormaster completed remote builds in B21195: Diff 159574.Aug 7 2018, 12:40 PM

In D47849#1190997, @gtbercea wrote:

Don't we want to use device specific math functions?
It's not just about avoiding some the host specific assembly, it's also about getting an implementation tailored to the device.

Ok, so you are already talking about performance. I think we should fix correctness first, in particular the compiler shouldn't complain whenever <math.h> is included.

I experimented with adding only a minimum of target defines (__amd64__ and __x86_64__): While I think this is a step into the right direction it still fails when including <fenv.h>.

Btw the GCC folks don't have a complete solution either: If you compile with -O2 you get the same complaints once the code starts calling signbit. Maybe Clang should also implement lazy Sema checking for device side compilation?

Ok, so you are already talking about performance. I think we should fix correctness first, in particular the compiler shouldn't complain whenever <math.h> is included.

This patch is concerned with calling device functions when you're on the device. The correctness issues you mention are orthogonal to this and should be handled by another patch. I don't think this patch should be held up any longer.

gtbercea marked an inline comment as done.Aug 8 2018, 5:46 AM

In D47849#1192134, @gtbercea wrote:

This patch is concerned with calling device functions when you're on the device. The correctness issues you mention are orthogonal to this and should be handled by another patch. I don't think this patch should be held up any longer.

I'm confused by now, could you please highlight the point that I'm missing?

IIRC you started to work on this to fix the problem with inline assembly (see https://reviews.llvm.org/D47849#1125019). AFAICS this patch fixes declarations of math functions but you still cannot include math.h which most "correct" codes do.

In D47849#1170670, @tra wrote:

The rumors of "high performance" functions in the libdevice are somewhat exaggerated , IMO. If you take a look at the IR in the libdevice of recent CUDA version, you will see that a lot of the functions just call their llvm counterpart. If it turns out that in some case llvm generates slower code than what nvidia provides, I'm sure it will be possible to implement a reasonably fast replacement.

So regarding performance it's not yet clear to me which cases actually benefit: Is there a particular function that is slow if LLVM's backend resolves the call vs. the wrapper script directly calls libdevice?
If I understand @tra's comment correctly, I think we should have clear evidence (ie a small "benchmark") that this patch actually improves performance.

In D47849#1192245, @Hahnfeld wrote:

In D47849#1192134, @gtbercea wrote:

This patch is concerned with calling device functions when you're on the device. The correctness issues you mention are orthogonal to this and should be handled by another patch. I don't think this patch should be held up any longer.

I'm confused by now, could you please highlight the point that I'm missing?

You're bringing up the correctness of the header files which is a detail that is orthogonal to this patch. Even if the header files worked correctly I would still want to use the libdevice functions. Fixing the header files themselves should be therefore done in a separate patch.
Using the libdevice functions guarantees correctness (no weird assembly instructions that the device doesn't recognize etc.) and may improve performance (if for example the libdevice contained device specific assembly).

The purpose of this patch is to call NVIDIA's libdevice math functions which should in principle be more efficient in terms of runtime and register usage. Not all of them may be more effecient today (like @tra suggested) but some of them will be. Maybe others will be improved in the future, maybe not, again that's an orthogonal point. The benefit of using libdevice functions is that any improvements NVIDIA makes we will be there to use them in the OpenMP NVPTX toolchain. The premise of the OpenMP NVPTX toolchain is that it will leverage as much of the CUDA toolchain as possible.

Another point is that users specifically ask for NVIDIA math functions to be called on the device when using OpenMP NVPTX device offloading. The libdevice library offers __nv_fast_* variants of some math functions. Users want to have access to those functions and other functions that the libdevice library contains.

IIRC you started to work on this to fix the problem with inline assembly (see https://reviews.llvm.org/D47849#1125019). AFAICS this patch fixes declarations of math functions but you still cannot include math.h which most "correct" codes do.

I'm not sure what you mean by this. This patch enables me to include math.h.

PDoakORNL added a subscriber: PDoakORNL.Aug 8 2018, 7:21 AM

In D47849#1192321, @gtbercea wrote:

IIRC you started to work on this to fix the problem with inline assembly (see https://reviews.llvm.org/D47849#1125019). AFAICS this patch fixes declarations of math functions but you still cannot include math.h which most "correct" codes do.

I'm not sure what you mean by this. This patch enables me to include math.h.

math.c:

#include <math.h>

executed commands:

 $ clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -c math.c -O2
In file included from math.c:1:
In file included from /usr/include/math.h:413:
/usr/include/bits/mathinline.h:131:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
/usr/include/bits/mathinline.h:143:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
2 errors generated.

In D47849#1192368, @Hahnfeld wrote:
In D47849#1192321, @gtbercea wrote:

IIRC you started to work on this to fix the problem with inline assembly (see https://reviews.llvm.org/D47849#1125019). AFAICS this patch fixes declarations of math functions but you still cannot include math.h which most "correct" codes do.

I'm not sure what you mean by this. This patch enables me to include math.h.

math.c:
#include <math.h>
executed commands:
 $ clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -c math.c -O2
In file included from math.c:1:
In file included from /usr/include/math.h:413:
/usr/include/bits/mathinline.h:131:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
/usr/include/bits/mathinline.h:143:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
2 errors generated.

I do not get that error.

In D47849#1192375, @gtbercea wrote:

I do not get that error.

In the beginning you said that you were facing the same error. Did that go away in the meantime?
Are you testing on x86 or Power? With optimizations enabled?

ye-luo added a subscriber: ye-luo.Aug 8 2018, 8:02 AM

In D47849#1192368, @Hahnfeld wrote:
In D47849#1192321, @gtbercea wrote:

IIRC you started to work on this to fix the problem with inline assembly (see https://reviews.llvm.org/D47849#1125019). AFAICS this patch fixes declarations of math functions but you still cannot include math.h which most "correct" codes do.

I'm not sure what you mean by this. This patch enables me to include math.h.

math.c:
#include <math.h>
executed commands:
 $ clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -c math.c -O2
In file included from math.c:1:
In file included from /usr/include/math.h:413:
/usr/include/bits/mathinline.h:131:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
/usr/include/bits/mathinline.h:143:43: error: invalid input constraint 'x' in asm
  __asm ("pmovmskb %1, %0" : "=r" (__m) : "x" (__x));
                                          ^
2 errors generated.

We are probably linking against different math.h files. I don't seem to have a mathinline.h with those instructions. Perhaps this is an x86 specific error.

I think I know what's happening. I think the host math.h is still included but not necessarily used. Math functions resolve to math functions in the CUDA header first (that's what this patch does). This patch doesn't prevent math.h from being included.

In D47849#1192383, @Hahnfeld wrote:

In D47849#1192375, @gtbercea wrote:

I do not get that error.

In the beginning you said that you were facing the same error. Did that go away in the meantime?
Are you testing on x86 or Power? With optimizations enabled?

Since I'm running on Power I was facing a similar problem related to host assembly instructions on device but not exactly the same error.

The error you are seeing is that the NVPTX target doesn't regard "x" as a valid input constraint. x is an x86 specific constraint which I don't have on the Power side.

The problems I was having were related to the math functions on the device resolving to host math functions which contained host assembly instructions which were not recognized by NVPTX. This patch fixes that issue.

Perhaps the inclusion of the host math.h should just be prevented for device code?

@Hahnfeld do you get the same error if you compile with clang++ instead of clang?

In D47849#1192493, @gtbercea wrote:

@Hahnfeld do you get the same error if you compile with clang++ instead of clang?

Yes, with both trunk and this patch applied. It's the same header after all...

prckent added a subscriber: prckent.Aug 8 2018, 10:30 AM

I feel like there is no progress in the discussion (here and off-list), partly because we might still not be talking about the same things. So I'm stepping down from this revision to unblock review from somebody else.

Here's my current understanding of the issue(s):

math.h (or transitively included files) on both PowerPC and x86 contain inline assembly.
- On x86 Clang directly bails out because the code is using the x input constraint which doesn't exist for NVPTX (-> invalid input constraint 'x' in asm).
- From my understanding the header passes Sema analysis on PowerPC, but rejects CodeGen because the assembly instructions are invalid on NVPTX?
This problem can be avoided (for testing purposes; including math.h should be fixed as well some day!) by explicitly declaring all needed math functions (like extern double exp(double);)
- Without additional flags this makes Clang emit Intrinsic Functions like @llvm.exp.f64 for NVPTX.
- That's because IsMathErrnoDefault() returns false for the Cuda ToolChain. This behaviour can be overwritten using -fmath-errno (the test case nvptx_device_math_functions.c uses this flag; I'm not sure why?)
That at least looks to be producing correct IR in both cases which is then passed to the backend:
1. For intrinsic functions (with some notable exceptions) the backend complains Cannot select: [...] ExternalSymbol'exp'.
  - Some exceptions are sqrt.f32, sqrt.f64, sin.f32 and cos.f32: The backend will directly lower them to the corresponding PTX instruction. Unfortunately there is none for exp...
2. For "real" function calls (like call double @exp(double %3)) nvlink will throw Undefined reference errors.

This patch takes the following approach:

Avoid intrinsics for math builtins by passing -fno-math-builtin for device compilation.
Use the CUDA header to redirect math functions to their libdevice equivalents in the frontend, mostly just prefixed by __nv_ (for example exp(a) -> __nv_exp(a)).

The downside of this approach is that LLVM doesn't recognize these function calls and doesn't perform optimizations to fold libcalls. For example pow(a, 2) is transformed into a multiplication but __nv_pow(a, 2) is not.

In D47849#1124638, @Hahnfeld wrote:

IMO this goes into the right direction, we should use the fast implementation in libdevice.

So yeah, my comment seems to be outdated if these simple optimizations don't happen anymore with this patch: I don't want to use a fast pow(a, 2), I don't want to call a library function for that at all.

We could of course make LLVM recognize the calls to libdevice and handle them the same way. But that's adding more workarounds to make this patch not regress on easy cases (in terms of transformations).
Another approach would be to make the NVPTX backend lower remaining calls of math functions to libdevice equivalents. I came across D34708 which seems to go into that direction (but doesn't work out-of-the-box after fixing some build errors, complaing about Undefined external symbols because libdevice is optimized away as it wasn't needed before)...

The downside of this approach is that LLVM doesn't recognize these function calls and doesn't perform optimizations to fold libcalls. For example pow(a, 2) is transformed into a multiplication but __nv_pow(a, 2) is not.

Doesn't CUDA have the same problem?

I don't want to use a fast pow(a, 2), I don't want to call a library function for that at all.

I do believe you won't end up calling a function. If you're compiling with optimizations on this will be inlined.

Thanks @Hahnfeld for your suggestions.

Unfortunately doing the lowering in the backend one would need to replace the math function calls with calls to libdevice function calls. I have not been able to do that in an elegant way. Encoding the interface to libdevice is just not a clean process not to mention that any changes to libdevice will have to be tracked manually with every new CUDA version. It does not make the code more maintainable, on the contrary I think it makes it harder to track libdevice changes.

On the same note, clang-cuda doesn't do the pow(a,2) -> a*a optimization, I checked. It is something that needs to be fixed for Clang-CUDA first before OpenMP can make use of it. OpenMP-NVPTX toolchain is designed to exist on top of the CUDA toolchain. It therefore inherits all the clang-cuda benefits and in this particular case, limitations.

As for the Sema check error you report (the one related to the x restriction), I think the fix you proposed is good and should be pushed in a separate patch.

Just to address any generality concerns:

This patch fixes the problem of calling libdevice math functions for all platform combinations. It ensures that the OpenMP NVPTX target region will NOT call any host math functions (which ever host that may be) IF equivalent device functions are available.

I think there was a confusion regarding header file inclusion. This patch does not address any issues that might arise from the user including header files (be it math.h or some other header). Any failure related to header file inclusion (such as the reported x restriction issue on x86) is unrelated to what this patch aims to do. Before the functionality in this patch can kick in, any user-included headers must successfully pass all checks in place for the NVPTX toolchain. A fix in the direction of the one proposed in one of the comments above is probably required. The fix would also needs its own separate patch.

Add __NO_MATH_INLINES macro for the NVPTX toolchain to prevent any host assembly from seeping onto the device.

Herald added a subscriber: jholewinski. · View Herald TranscriptAug 14 2018, 8:36 AM

Harbormaster completed remote builds in B21453: Diff 160598.Aug 14 2018, 8:36 AM

Hahnfeld mentioned this in D50845: [CUDA/OpenMP] Define only some host macros during device compilation.Aug 16 2018, 7:56 AM

gregrodgers added a subscriber: gregrodgers.Aug 22 2018, 7:30 AM

I like the idea of using an automatic include as a cc1 option (-include). However, I would prefer a more general automatic include for OpenMP, not just for math functions (clang_cuda_device_functions.h). Clang cuda automatically includes clang_cuda_runtime_wrapper.h. It includes other files as needed like clang_cuda_device_functions.h. Lets hypothetically call my proposed automatic include for OpenMP , clang_openmp_runtime_wrapper.h.

Just because clang cuda defines functions in clang_cuda_device_functins.h and automatically includes them does not make it right for OpenMP. In general, function definitions in headers should be avoided. The current function definitions in clang_cuda_device_functions.h only work for hostile nv GPUs :). This is how we can avoid function definitions in the headers. In a new openmp build process, we can build libm-nvptx.bc. This can be done by compiling __clang_cuda_device_functions.h as a device-only compile. Assuming current naming conventions, these files would be installed in the same directory as libomptarget.so (.../lib).

How do we tell clang cc1 to use this bc library? Use -mlink-builtin-bitcode. AddMathDeviceFunctions would then look something like this.

if (this is for device cc1) {

CC1Args.push_back("-mlink-builtin-bitcode");
if ( getTriple().isNVPTX())
  CC1Args.push_back(DriverArgs.MakeArgString("libm-nvptx.bc"));
if ( getTriple().getArch() == llvm::Triple::amdgcn);
  CC1Args.push_back(DriverArgs.MakeArgString("libm-amdgcn.bc"));

}

You can think of libm-<arch>.bc file as the device library equivalent of the host libm.so or libm.a. This concept of "host-consistent" library definitions can go beyond math libraries. In fact, I believe we should co-opt the -l (--library) option. The driver toolchain should look for device bc libraries for any -lX command line option. This gives us a strategy for adding user-defined device libraries.

The above code hints at the idea of architecture specific bc files (nvptx vs amdgcn). The nvptx version would call into the cuda libdevice. For radeon processors, we may want processor-optimized versions of the libraries, just like there are sub-architecture optimized versions of the cuda libdevice. If we build --cuda-cuda-gpu-arch optimized versions of math bc libs, then the above code will get a bit more complex depending on naming convention of the bc lib and the value of
--cuda-gpu-arch (which should have an alias --offload-arch).

Using a bc lib, significantly reduces the complexity of clang_openmp_runtime_wrapper.h. We do not not need or see math device function definitions or the nv headers that they need. However, it does need to correct the behaviour of rogue system headers that define host-optimized functions. We can fix this by adding the following to clang_openmp_runtime_wrapper.h so that host passes still get host-optimized functions.

#if defined(AMDGCN) || defined(NVPTX)
#define NO_INLINE 1
#endif

There is a tradeoff to using pre-compiled bc libs. It makes compile-time macro logic hard to implement. For example, we cant do this

#if defined(CLANG_CUDA_APPROX_TRANSCENDENTALS)
#define FAST_OR_SLOW(fast, slow) fast
#else
#define FAST_OR_SLOW(fast, slow) slow
#endif

The openmp build process would either need to build alternative bc libraries for each option or a supplemental bc library to address these types of options.
If some option is turned on, then an alternative lib or particular ordering of libs would be used to build the clang cc1 command.
For example, the above code for AddMathDeviceFunctions would have this

...
if ( getTriple().isNVPTX()) {
   if (LangOpts.CUDADeviceApproxTranscendentals || LangOpts.FastMath) {
     CC1Args.push_back("-mlink-builtin-bitcode");
     CC1Args.push_back(DriverArgs.MakeArgString("libm-fast-nvptx.bc"));
   }
   CC1Args.push_back("-mlink-builtin-bitcode");
   CC1Args.push_back(DriverArgs.MakeArgString("libm-nvptx.bc"));
}

I personally believe that pre-built bc libraries with some consistency to their host-equivalent libraries is a more sane approach for device libraries than complex header logic that is customized for each architecture.

__clang_cuda_device_functions.h is not intended to be a device-side math.h, despite having a lot of overlap/similarities. It may change at any time we get new CUDA version.
I would suggest writing an OpenMP-specific replacement for math.h which would map to whatever device-specific function OpenMP needs. For NVPTX that may be libdevice, for which you have declarations in __clang_cuda_libdevice_declares.h. Using part of __clang_cuda_device_functions.h may be a decent starting point for NVPTX, but OpenMP will likely need to provide an equivalent for other back-ends, too.

lib/Basic/Targets/NVPTX.cpp
232 ↗	(On Diff #160598)	This relies on implementation detail of particular variant of the header file you're assuming all compilations will include. This is a workaround of the real problem (attempting to use headers from machine X while targeting Y) at best. D50845 is dealing with the issue of headers for target code. Hopefully, they'll find a way to provide device-specific headers, so you don't rely on host headers being parseable during device-side compilation.
lib/Driver/ToolChains/Clang.cpp
4758	Could you elaborate on why you don't want the builtins? Builtins are enabled and are useful for CUDA. What makes their use different for OpenMP? Are you doing it to guarantee that math functions remain unresolved in IR so you could link them in from external bitcode?

gtbercea added inline comments.Aug 23 2018, 8:16 AM

lib/Basic/Targets/NVPTX.cpp
232 ↗	(On Diff #160598)	I agree. The proper fix would be what the other patch is attempting to do.
lib/Driver/ToolChains/Clang.cpp
4758	That's right. I don't particularly like this approach as this leads to OpenMP-NVPTX toolchain missing out on optimizations such as replacing math function call with basic operations ( pow(a,2) -> a*a for example). I am trying to fix this in a future patch by allowing intrinsics/builtins to propagate.

RaviNarayanaswamy added a subscriber: RaviNarayanaswamy.Aug 23 2018, 1:53 PM

Hahnfeld mentioned this in rL340681: [CUDA/OpenMP] Define only some host macros during device compilation.Aug 25 2018, 6:45 AM

Hahnfeld mentioned this in rC340681: [CUDA/OpenMP] Define only some host macros during device compilation.

LukasSommerTu added a subscriber: LukasSommerTu.Nov 2 2018, 7:15 AM

jdoerfert added a subscriber: jdoerfert.Feb 14 2019, 12:30 PM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 14 2019, 12:30 PM

We need to make progress on this, and I'd like to suggest a path forward...

First, we have a fundamental problem here: Using host headers to declare functions for the device execution environment isn't sound. Those host headers can do anything, and while some platforms might provide a way to make the host headers more friendly (e.g., by defining __NO_MATH_INLINES), these mechanisms are neither robust nor portable. Thus, we should not rely on host headers to define functions that might be available on the device. However, even when compiling for the device, code meant only for host execution must be semantically analyzable. This, in general, requires the host headers. So we have a situation in which we must both use the host headers during device compilation (to keep the semantic analysis of the surrounding host code working) and also can't use the host headers to provide definitions for use for device code (e.g., because those host headers might provide definitions relying on host inline asm, intrinsics, using types not lowerable in device code, could provide declarations using linkage-affecting attributes not lowerable for the device, etc.).

This is, or is very similar to, the problem that the host/device overloading addresses in CUDA. It is also the problem, or very similar to the problem, that the new OpenMP 5 declare variant directive is intended to address. Johannes and I discussed this earlier today, and I suggest that we:

Add a math.h wrapper to clang/lib/Headers, which generally just does an include_next of math.h, but provides us with the ability to customize this behavior. Writing a header for OpenMP on NVIDIA GPUs which is essentially identical to the math.h functions in __clang_cuda_device_functions.h would be unfortunate, and as CUDA does provide the underlying execution environment for OpenMP target offload on NVIDIA GPUs, duplicative even in principle. We don't need to alter the default global namespace, however, but can include this file from the wrapper math.h.
We should allow host/device overloading in OpenMP mode. As an extension, we could directly reuse the CUDA host/device overloading capability - this also has the advantage of allowing us to directly reuse clang_cuda_device_functions.h (and perhaps do a similar thing to pick up the device-side printf, etc. from clang_cuda_runtime_wrapper.h). In the future, we can extend these to provide overloading using OpenMP declare variant, if desired, when in OpenMP mode.

Thoughts?

jprice added a subscriber: jprice.Mar 20 2019, 2:56 AM

This is, or is very similar to, the problem that the host/device overloading addresses in CUDA.

IIRC the difference was that OpenMP didn't have explicit notion of host/device functions which made it hard to apply host/device overloading in practice.

It is also the problem, or very similar to the problem, that the new OpenMP 5 declare variant directive is intended to address. Johannes and I discussed this earlier today, and I suggest that we:

Interesting. declare variant sounds (according to openmp-TR7 doc) like a __device__ on steroids. That may indeed make things work. Actually, I would like device eventually work like device variant, so we can have multiple device overloads specialized for particular GPU architecture without relying on preprocessor's __CUDA_ARCH__.

Add a math.h wrapper to clang/lib/Headers, which generally just does an include_next of math.h, but provides us with the ability to customize this behavior. Writing a header for OpenMP on NVIDIA GPUs which is essentially identical to the math.h functions in __clang_cuda_device_functions.h would be unfortunate, and as CUDA does provide the underlying execution environment for OpenMP target offload on NVIDIA GPUs, duplicative even in principle. We don't need to alter the default global namespace, however, but can include this file from the wrapper math.h.

Using __clang_cuda_device_functions.h in addition to math.h wrapper should be fine. It gives us a path to provide device-side standard math library implementation and math.h wrapper provides convenient point to hook in the implementation for platforms other than CUDA.

We should allow host/device overloading in OpenMP mode. As an extension, we could directly reuse the CUDA host/device overloading capability - this also has the advantage of allowing us to directly reuse clang_cuda_device_functions.h (and perhaps do a similar thing to pick up the device-side printf, etc. from clang_cuda_runtime_wrapper.h). In the future, we can extend these to provide overloading using OpenMP declare variant, if desired, when in OpenMP mode.

Is OpenMP is still essentially C-based? host/device overloading relies on C++ machinery. I think it should work with __attribute__((overloadable)) but it's not been tested.

We may need to restructure bits and pieces of CUDA-related headers to make them reusable by OpenMP. I guess that with declare variant we may be able to reuse most of the headers as is by treating __device__ as if the function was a variant for NVPTX back-end.

Thoughts?

SGTM. Let me know if something in the CUDA-related headers gets in the way.

Thank you both for the feedback.

It's good to see that there's an interest to move this forward, I will try to refactor this patch according to Hal's suggestions and see if there are any blockers.

Thanks!

hfinkel mentioned this in D60907: [OpenMP] Add math functions support in OpenMP offloading.Apr 25 2019, 11:21 AM

In D47849#1435770, @hfinkel wrote:

We need to make progress on this, and I'd like to suggest a path forward...

First, we have a fundamental problem here: Using host headers to declare functions for the device execution environment isn't sound. Those host headers can do anything, and while some platforms might provide a way to make the host headers more friendly (e.g., by defining __NO_MATH_INLINES), these mechanisms are neither robust nor portable. Thus, we should not rely on host headers to define functions that might be available on the device. However, even when compiling for the device, code meant only for host execution must be semantically analyzable. This, in general, requires the host headers. So we have a situation in which we must both use the host headers during device compilation (to keep the semantic analysis of the surrounding host code working) and also can't use the host headers to provide definitions for use for device code (e.g., because those host headers might provide definitions relying on host inline asm, intrinsics, using types not lowerable in device code, could provide declarations using linkage-affecting attributes not lowerable for the device, etc.).

This is, or is very similar to, the problem that the host/device overloading addresses in CUDA. It is also the problem, or very similar to the problem, that the new OpenMP 5 declare variant directive is intended to address. Johannes and I discussed this earlier today, and I suggest that we:

Add a math.h wrapper to clang/lib/Headers, which generally just does an include_next of math.h, but provides us with the ability to customize this behavior. Writing a header for OpenMP on NVIDIA GPUs which is essentially identical to the math.h functions in __clang_cuda_device_functions.h would be unfortunate, and as CUDA does provide the underlying execution environment for OpenMP target offload on NVIDIA GPUs, duplicative even in principle. We don't need to alter the default global namespace, however, but can include this file from the wrapper math.h.

I imagine this to look sth along the lines of:

// File: clang/lib/Headers/math.h

#ifdef CUDA
  #include "CUDA_INCLUDE_DIR/cuda_math.h"
#elifdef ...
  ...
#endif

#include_next "math.h"

So a clang internal math.h wrapper which, depending on the target, includes all "math.h" headers in the right order.
The overload resolution should pick the right version even if there are multiple declared.

Replaced by: D61399

Revision Contents

Path

Size

include/

clang/

Driver/

ToolChain.h

4 lines

lib/

Driver/

ToolChains/

Clang.cpp

11 lines

Cuda.h

6 lines

Cuda.cpp

15 lines

Headers/

CMakeLists.txt

2 lines

__clang_cuda_device_functions.h

24 lines

__clang_cuda_libdevice_declares.h

872 lines

test/

CodeGen/

nvptx_device_math_functions.c

20 lines

Driver/

openmp-offload-gpu.c

4 lines

Commit	Tree	Parents	Author	Summary	Date
56939789d772	59308c9ece86	fb5d2bd20d98	Doru Bercea	Prevent builtins from being used for nvptx toolchain.	Aug 7 2018, 12:26 PM
fb5d2bd20d98	191b766399ca	fd58c6bacdf9	Doru Bercea	Clean-up code.	Aug 7 2018, 9:51 AM
fd58c6bacdf9	3ab61d57bc82	5cfd4c27b1c5	Doru Bercea	Rename function call.	Aug 6 2018, 10:34 AM
5cfd4c27b1c5	14a47fb0d450	d2c164831393	Doru Bercea	Fix test.	Aug 2 2018, 3:33 PM
d2c164831393	4e113ca3d76c	0b72681fa870	Doru Bercea	Fix error when trying optimization levels 1 or higher.	Jun 11 2018, 7:24 AM
0b72681fa870	057d567f0683	bfe2660d8d40	Doru Bercea	Add test. Make it work for C and C++.	Jun 6 2018, 3:32 PM
bfe2660d8d40	281f4a108588	5b76ff9f65b0	Doru Bercea	Clean-up code.	Jun 6 2018, 1:11 PM
5b76ff9f65b0	dd5360979a60	55135cee27fb	Doru Bercea	Fixes.	Jun 6 2018, 12:58 PM
55135cee27fb	eeef4fd4833b	5fba2a0558af	Doru Bercea	Make function translation work.	Jun 6 2018, 12:29 PM
5fba2a0558af	550f652de90f	ea4cfc7ca43d	Doru Bercea	Compute whether the COB can be skipped or not. (Show More…)	Jun 25 2018, 11:17 AM
ea4cfc7ca43d	51ca1f8c70f1	238bfbc3e037	Doru Bercea	Compute whether the COB can be skipped or not. (Show More…)	Jun 25 2018, 11:16 AM
238bfbc3e037	27de57b7b627	e40c72441da7	Doru Bercea	Add test for linux partial linking using LD.	Jun 11 2018, 2:58 PM
e40c72441da7	f02d7ddb3ef3	fa9944dabfe4	Doru Bercea	Remove unused variable.	May 25 2018, 3:02 PM
fa9944dabfe4	e3f4735ae53e	045cc04fe1d7	Doru Bercea	Add partial linking action and job action. Clean-up code.	May 25 2018, 1:16 PM
045cc04fe1d7	07a3dd4c20f3	01394c3e7464	Doru Bercea	Handle cases when bundler is skipped. Fix tests accordingly.	May 23 2018, 4:34 PM
01394c3e7464	bfd815b75718	755b6f9457ef	Doru Bercea	Compute whether the COB can be skipped or not.	May 23 2018, 2:13 PM
755b6f9457ef	ee9a070d512a	616fec59ebff	Doru Bercea	Revert to weak linake.	May 14 2018, 8:24 AM
616fec59ebff	52be369a695a	763add05b774	Doru Bercea	Clean-up code.	May 8 2018, 3:09 PM
763add05b774	b938e45cfe81	3aa7318ac3de	Doru Bercea	Use clang to compile the fatbin.c file. This means we have to use NVLINK… (Show More…)	May 8 2018, 2:53 PM
3aa7318ac3de	bd40b7f03f96	33029c884f77	Doru Bercea	Add private linkage.	May 8 2018, 2:24 PM
33029c884f77	c559d08c36b2	33258c663b00	Doru Bercea	Add static linking from ibm-devel. (Show More…)	Aug 2 2018, 12:08 PM

Diff 159574

include/clang/Driver/ToolChain.h

Show First 20 Lines • Show All 520 Lines • ▼ Show 20 Lines	public:
/// a suitable profile runtime library to the linker.		/// a suitable profile runtime library to the linker.
virtual void addProfileRTLibs(const llvm::opt::ArgList &Args,		virtual void addProfileRTLibs(const llvm::opt::ArgList &Args,
llvm::opt::ArgStringList &CmdArgs) const;		llvm::opt::ArgStringList &CmdArgs) const;

/// Add arguments to use system-specific CUDA includes.		/// Add arguments to use system-specific CUDA includes.
virtual void AddCudaIncludeArgs(const llvm::opt::ArgList &DriverArgs,		virtual void AddCudaIncludeArgs(const llvm::opt::ArgList &DriverArgs,
llvm::opt::ArgStringList &CC1Args) const;		llvm::opt::ArgStringList &CC1Args) const;

		/// Add arguments to use system-specific CUDA includes.
		virtual void AddMathDeviceFunctions(const llvm::opt::ArgList &DriverArgs,
		llvm::opt::ArgStringList &CC1Args) const {};

/// Add arguments to use MCU GCC toolchain includes.		/// Add arguments to use MCU GCC toolchain includes.
virtual void AddIAMCUIncludeArgs(const llvm::opt::ArgList &DriverArgs,		virtual void AddIAMCUIncludeArgs(const llvm::opt::ArgList &DriverArgs,
llvm::opt::ArgStringList &CC1Args) const;		llvm::opt::ArgStringList &CC1Args) const;

/// On Windows, returns the MSVC compatibility version.		/// On Windows, returns the MSVC compatibility version.
virtual VersionTuple computeMSVCVersion(const Driver *D,		virtual VersionTuple computeMSVCVersion(const Driver *D,
const llvm::opt::ArgList &Args) const;		const llvm::opt::ArgList &Args) const;

Show All 25 Lines

lib/Driver/ToolChains/Clang.cpp

Show First 20 Lines • Show All 1,077 Lines • ▼ Show 20 Lines	void Clang::AddPreprocessingOptions(Compilation &C, const JobAction &JA,

// Add offload include arguments specific for CUDA. This must happen before		// Add offload include arguments specific for CUDA. This must happen before
// we -I or -include anything else, because we must pick up the CUDA headers		// we -I or -include anything else, because we must pick up the CUDA headers
// from the particular CUDA installation, rather than from e.g.		// from the particular CUDA installation, rather than from e.g.
// /usr/local/include.		// /usr/local/include.
if (JA.isOffloading(Action::OFK_Cuda))		if (JA.isOffloading(Action::OFK_Cuda))
getToolChain().AddCudaIncludeArgs(Args, CmdArgs);		getToolChain().AddCudaIncludeArgs(Args, CmdArgs);

		// If we are offloading to a target via OpenMP and this target happens
		// to be an NVIDIA GPU then we need to include the CUDA runtime wrapper
		// to ensure the correct math functions are called in the offloaded
		// code.
		if (JA.isDeviceOffloading(Action::OFK_OpenMP) &&
		getToolChain().getTriple().isNVPTX())
		getToolChain().AddMathDeviceFunctions(Args, CmdArgs);

// Add -i* options, and automatically translate to		// Add -i* options, and automatically translate to
// -include-pch/-include-pth for transparent PCH support. It's		// -include-pch/-include-pth for transparent PCH support. It's
// wonky, but we include looking for .gch so we can support seamless		// wonky, but we include looking for .gch so we can support seamless
// replacement into a build system already set up to be generating		// replacement into a build system already set up to be generating
// .gch files.		// .gch files.

if (getToolChain().getDriver().IsCLMode()) {		if (getToolChain().getDriver().IsCLMode()) {
const Arg *YcArg = Args.getLastArg(options::OPT__SLASH_Yc);		const Arg *YcArg = Args.getLastArg(options::OPT__SLASH_Yc);
▲ Show 20 Lines • Show All 3,646 Lines • ▼ Show 20 Lines
// along to tell the frontend that it is generating code for a device, so that		// along to tell the frontend that it is generating code for a device, so that
// only the relevant declarations are emitted.		// only the relevant declarations are emitted.
if (IsOpenMPDevice) {		if (IsOpenMPDevice) {
CmdArgs.push_back("-fopenmp-is-device");		CmdArgs.push_back("-fopenmp-is-device");
if (Inputs.size() == 2) {		if (Inputs.size() == 2) {
CmdArgs.push_back("-fopenmp-host-ir-file-path");		CmdArgs.push_back("-fopenmp-host-ir-file-path");
CmdArgs.push_back(Args.MakeArgString(Inputs.back().getFilename()));		CmdArgs.push_back(Args.MakeArgString(Inputs.back().getFilename()));
}		}
		// Prevent usage of math.h builtins for device
		// toolchain.
		CmdArgs.push_back("-fno-math-builtin");
		traUnsubmitted Not Done Reply Inline Actions Could you elaborate on why you don't want the builtins? Builtins are enabled and are useful for CUDA. What makes their use different for OpenMP? Are you doing it to guarantee that math functions remain unresolved in IR so you could link them in from external bitcode? tra: Could you elaborate on why you don't want the builtins? Builtins are enabled and are useful for…
		gtberceaAuthorUnsubmitted Not Done Reply Inline Actions That's right. I don't particularly like this approach as this leads to OpenMP-NVPTX toolchain missing out on optimizations such as replacing math function call with basic operations ( pow(a,2) -> aa for example). I am trying to fix this in a future patch by allowing intrinsics/builtins to propagate. gtbercea:* That's right. I don't particularly like this approach as this leads to OpenMP-NVPTX toolchain…
}		}

// For all the host OpenMP offloading compile jobs we need to pass the targets		// For all the host OpenMP offloading compile jobs we need to pass the targets
// information using -fopenmp-targets= option.		// information using -fopenmp-targets= option.
if (JA.isHostOffloading(Action::OFK_OpenMP)) {		if (JA.isHostOffloading(Action::OFK_OpenMP)) {
SmallString<128> TargetInfo("-fopenmp-targets=");		SmallString<128> TargetInfo("-fopenmp-targets=");

Arg *Tgts = Args.getLastArg(options::OPT_fopenmp_targets_EQ);		Arg *Tgts = Args.getLastArg(options::OPT_fopenmp_targets_EQ);
▲ Show 20 Lines • Show All 979 Lines • Show Last 20 Lines

lib/Driver/ToolChains/Cuda.h

Show All 40 Lines	private:
// CUDA architectures for which we have raised an error in		// CUDA architectures for which we have raised an error in
// CheckCudaVersionSupportsArch.		// CheckCudaVersionSupportsArch.
mutable llvm::SmallSet<CudaArch, 4> ArchsWithBadVersion;		mutable llvm::SmallSet<CudaArch, 4> ArchsWithBadVersion;

public:		public:
CudaInstallationDetector(const Driver &D, const llvm::Triple &HostTriple,		CudaInstallationDetector(const Driver &D, const llvm::Triple &HostTriple,
const llvm::opt::ArgList &Args);		const llvm::opt::ArgList &Args);

		void AddMathDeviceFunctions(const llvm::opt::ArgList &DriverArgs,
		llvm::opt::ArgStringList &CC1Args) const;

void AddCudaIncludeArgs(const llvm::opt::ArgList &DriverArgs,		void AddCudaIncludeArgs(const llvm::opt::ArgList &DriverArgs,
llvm::opt::ArgStringList &CC1Args) const;		llvm::opt::ArgStringList &CC1Args) const;

/// Emit an error if Version does not support the given Arch.		/// Emit an error if Version does not support the given Arch.
///		///
/// If either Version or Arch is unknown, does not emit an error. Emits at		/// If either Version or Arch is unknown, does not emit an error. Emits at
/// most one error per Arch.		/// most one error per Arch.
void CheckCudaVersionSupportsArch(CudaArch Arch) const;		void CheckCudaVersionSupportsArch(CudaArch Arch) const;
▲ Show 20 Lines • Show All 102 Lines • ▼ Show 20 Lines	public:
bool isPICDefaultForced() const override { return false; }		bool isPICDefaultForced() const override { return false; }
bool SupportsProfiling() const override { return false; }		bool SupportsProfiling() const override { return false; }
bool supportsDebugInfoOption(const llvm::opt::Arg *A) const override;		bool supportsDebugInfoOption(const llvm::opt::Arg *A) const override;
bool IsMathErrnoDefault() const override { return false; }		bool IsMathErrnoDefault() const override { return false; }

void AddCudaIncludeArgs(const llvm::opt::ArgList &DriverArgs,		void AddCudaIncludeArgs(const llvm::opt::ArgList &DriverArgs,
llvm::opt::ArgStringList &CC1Args) const override;		llvm::opt::ArgStringList &CC1Args) const override;

		void AddMathDeviceFunctions(const llvm::opt::ArgList &DriverArgs,
		llvm::opt::ArgStringList &CC1Args) const override;

void addClangWarningOptions(llvm::opt::ArgStringList &CC1Args) const override;		void addClangWarningOptions(llvm::opt::ArgStringList &CC1Args) const override;
CXXStdlibType GetCXXStdlibType(const llvm::opt::ArgList &Args) const override;		CXXStdlibType GetCXXStdlibType(const llvm::opt::ArgList &Args) const override;
void		void
AddClangSystemIncludeArgs(const llvm::opt::ArgList &DriverArgs,		AddClangSystemIncludeArgs(const llvm::opt::ArgList &DriverArgs,
llvm::opt::ArgStringList &CC1Args) const override;		llvm::opt::ArgStringList &CC1Args) const override;
void AddClangCXXStdlibIncludeArgs(		void AddClangCXXStdlibIncludeArgs(
const llvm::opt::ArgList &Args,		const llvm::opt::ArgList &Args,
llvm::opt::ArgStringList &CC1Args) const override;		llvm::opt::ArgStringList &CC1Args) const override;
Show All 27 Lines

lib/Driver/ToolChains/Cuda.cpp

Show First 20 Lines • Show All 220 Lines • ▼ Show 20 Lines	for (const auto &Candidate : Candidates) {
if (LibDeviceMap.empty() && !NoCudaLib)		if (LibDeviceMap.empty() && !NoCudaLib)
continue;		continue;

IsValid = true;		IsValid = true;
break;		break;
}		}
}		}

		void CudaInstallationDetector::AddMathDeviceFunctions(
		const ArgList &DriverArgs, ArgStringList &CC1Args) const {
		CC1Args.push_back("-internal-isystem");
		CC1Args.push_back(DriverArgs.MakeArgString(getIncludePath()));
		CC1Args.push_back("-include");
		CC1Args.push_back("__clang_cuda_device_functions.h");
		CC1Args.push_back("-I");
		CC1Args.push_back(DriverArgs.MakeArgString(getIncludePath()));
		}

void CudaInstallationDetector::AddCudaIncludeArgs(		void CudaInstallationDetector::AddCudaIncludeArgs(
const ArgList &DriverArgs, ArgStringList &CC1Args) const {		const ArgList &DriverArgs, ArgStringList &CC1Args) const {
if (!DriverArgs.hasArg(options::OPT_nobuiltininc)) {		if (!DriverArgs.hasArg(options::OPT_nobuiltininc)) {
// Add cuda_wrappers/* to our system include path. This lets us wrap		// Add cuda_wrappers/* to our system include path. This lets us wrap
// standard library headers.		// standard library headers.
SmallString<128> P(D.ResourceDir);		SmallString<128> P(D.ResourceDir);
llvm::sys::path::append(P, "include");		llvm::sys::path::append(P, "include");
llvm::sys::path::append(P, "cuda_wrappers");		llvm::sys::path::append(P, "cuda_wrappers");
▲ Show 20 Lines • Show All 611 Lines • ▼ Show 20 Lines	if (!DriverArgs.hasArg(options::OPT_nocudainc) &&
!DriverArgs.hasArg(options::OPT_no_cuda_version_check)) {		!DriverArgs.hasArg(options::OPT_no_cuda_version_check)) {
StringRef Arch = DriverArgs.getLastArgValue(options::OPT_march_EQ);		StringRef Arch = DriverArgs.getLastArgValue(options::OPT_march_EQ);
assert(!Arch.empty() && "Must have an explicit GPU arch.");		assert(!Arch.empty() && "Must have an explicit GPU arch.");
CudaInstallation.CheckCudaVersionSupportsArch(StringToCudaArch(Arch));		CudaInstallation.CheckCudaVersionSupportsArch(StringToCudaArch(Arch));
}		}
CudaInstallation.AddCudaIncludeArgs(DriverArgs, CC1Args);		CudaInstallation.AddCudaIncludeArgs(DriverArgs, CC1Args);
}		}

		void CudaToolChain::AddMathDeviceFunctions(
		const ArgList &DriverArgs, ArgStringList &CC1Args) const {
		CudaInstallation.AddMathDeviceFunctions(DriverArgs, CC1Args);
		}

llvm::opt::DerivedArgList *		llvm::opt::DerivedArgList *
CudaToolChain::TranslateArgs(const llvm::opt::DerivedArgList &Args,		CudaToolChain::TranslateArgs(const llvm::opt::DerivedArgList &Args,
StringRef BoundArch,		StringRef BoundArch,
Action::OffloadKind DeviceOffloadKind) const {		Action::OffloadKind DeviceOffloadKind) const {
DerivedArgList *DAL =		DerivedArgList *DAL =
HostTC.TranslateArgs(Args, BoundArch, DeviceOffloadKind);		HostTC.TranslateArgs(Args, BoundArch, DeviceOffloadKind);
if (!DAL)		if (!DAL)
DAL = new DerivedArgList(Args.getBaseArgs());		DAL = new DerivedArgList(Args.getBaseArgs());
▲ Show 20 Lines • Show All 119 Lines • Show Last 20 Lines

lib/Headers/CMakeLists.txt

Show First 20 Lines • Show All 138 Lines • ▼ Show 20 Lines	foreach( f ${files} ${cuda_wrapper_files} )
set( dst ${output_dir}/${f} )		set( dst ${output_dir}/${f} )
add_custom_command(OUTPUT ${dst}		add_custom_command(OUTPUT ${dst}
DEPENDS ${src}		DEPENDS ${src}
COMMAND ${CMAKE_COMMAND} -E copy_if_different ${src} ${dst}		COMMAND ${CMAKE_COMMAND} -E copy_if_different ${src} ${dst}
COMMENT "Copying clang's ${f}...")		COMMENT "Copying clang's ${f}...")
list(APPEND out_files ${dst})		list(APPEND out_files ${dst})
endforeach( f )		endforeach( f )

add_custom_command(OUTPUT ${output_dir}/arm_neon.h		add_custom_command(OUTPUT ${output_dir}/arm_neon.h
DEPENDS ${CMAKE_CURRENT_BINARY_DIR}/arm_neon.h		DEPENDS ${CMAKE_CURRENT_BINARY_DIR}/arm_neon.h
COMMAND ${CMAKE_COMMAND} -E copy_if_different ${CMAKE_CURRENT_BINARY_DIR}/arm_neon.h ${output_dir}/arm_neon.h		COMMAND ${CMAKE_COMMAND} -E copy_if_different ${CMAKE_CURRENT_BINARY_DIR}/arm_neon.h ${output_dir}/arm_neon.h
COMMENT "Copying clang's arm_neon.h...")		COMMENT "Copying clang's arm_neon.h...")
list(APPEND out_files ${output_dir}/arm_neon.h)		list(APPEND out_files ${output_dir}/arm_neon.h)
add_custom_command(OUTPUT ${output_dir}/arm_fp16.h		add_custom_command(OUTPUT ${output_dir}/arm_fp16.h
DEPENDS ${CMAKE_CURRENT_BINARY_DIR}/arm_fp16.h		DEPENDS ${CMAKE_CURRENT_BINARY_DIR}/arm_fp16.h
COMMAND ${CMAKE_COMMAND} -E copy_if_different ${CMAKE_CURRENT_BINARY_DIR}/arm_fp16.h ${output_dir}/arm_fp16.h		COMMAND ${CMAKE_COMMAND} -E copy_if_different ${CMAKE_CURRENT_BINARY_DIR}/arm_fp16.h ${output_dir}/arm_fp16.h
COMMENT "Copying clang's arm_fp16.h...")		COMMENT "Copying clang's arm_fp16.h...")
Show All 28 Lines

lib/Headers/__clang_cuda_device_functions.h

Show All 18 Lines
* THE SOFTWARE.		* THE SOFTWARE.
*		*
*===-----------------------------------------------------------------------===		*===-----------------------------------------------------------------------===
*/		*/

#ifndef __CLANG_CUDA_DEVICE_FUNCTIONS_H__		#ifndef __CLANG_CUDA_DEVICE_FUNCTIONS_H__
#define __CLANG_CUDA_DEVICE_FUNCTIONS_H__		#define __CLANG_CUDA_DEVICE_FUNCTIONS_H__

#if CUDA_VERSION < 9000		#if defined(_OPENMP)
		#include <__clang_cuda_libdevice_declares.h>
		ABataevUnsubmitted Not Done Reply Inline Actions Do we really need to include all that stuff here? Will it work with C++, especially with the latest versions of the standard? ABataev: Do we really need to include all that stuff here? Will it work with C++, especially with the…
		#include <stddef.h>
		#include <limits.h>
		#endif

		#if defined(__CUDA__) && CUDA_VERSION < 9000
#error This file is intended to be used with CUDA-9+ only.		#error This file is intended to be used with CUDA-9+ only.
#endif		#endif

// __DEVICE__ is a helper macro with common set of attributes for the wrappers		// __DEVICE__ is a helper macro with common set of attributes for the wrappers
// we implement in this file. We need static in order to avoid emitting unused		// we implement in this file. We need static in order to avoid emitting unused
// functions and __forceinline__ helps inlining these wrappers at -O1.		// functions and __forceinline__ helps inlining these wrappers at -O1.
#pragma push_macro("__DEVICE__")		#pragma push_macro("__DEVICE__")
		#if defined(__CUDA__)
#define __DEVICE__ static __device__ __forceinline__		#define __DEVICE__ static __device__ __forceinline__
		#elif defined(_OPENMP)
		#define __DEVICE__ static __attribute__((always_inline))
		ABataevUnsubmitted Done Reply Inline Actions Do you really need "inline" if you are using 'alwsys_inline' attribute already? Will it work on Windows? ABataev: Do you really need "__inline__" if you are using 'alwsys_inline' attribute already? Will it…
		#endif

// libdevice provides fast low precision and slow full-recision implementations		// libdevice provides fast low precision and slow full-recision implementations
// for some functions. Which one gets selected depends on		// for some functions. Which one gets selected depends on
// __CLANG_CUDA_APPROX_TRANSCENDENTALS__ which gets defined by clang if		// __CLANG_CUDA_APPROX_TRANSCENDENTALS__ which gets defined by clang if
// -ffast-math or -fcuda-approx-transcendentals are in effect.		// -ffast-math or -fcuda-approx-transcendentals are in effect.
#pragma push_macro("__FAST_OR_SLOW")		#pragma push_macro("__FAST_OR_SLOW")
#if defined(__CLANG_CUDA_APPROX_TRANSCENDENTALS__)		#if defined(__CLANG_CUDA_APPROX_TRANSCENDENTALS__)
#define __FAST_OR_SLOW(fast, slow) fast		#define __FAST_OR_SLOW(fast, slow) fast
#else		#else
#define __FAST_OR_SLOW(fast, slow) slow		#define __FAST_OR_SLOW(fast, slow) slow
#endif		#endif

__DEVICE__ int __all(int __a) { return __nvvm_vote_all(__a); }		__DEVICE__ int __all(int __a) { return __nvvm_vote_all(__a); }
__DEVICE__ int __any(int __a) { return __nvvm_vote_any(__a); }		__DEVICE__ int __any(int __a) { return __nvvm_vote_any(__a); }
__DEVICE__ unsigned int __ballot(int __a) { return __nvvm_vote_ballot(__a); }		__DEVICE__ unsigned int __ballot(int __a) { return __nvvm_vote_ballot(__a); }
__DEVICE__ unsigned int __brev(unsigned int __a) { return __nv_brev(__a); }		__DEVICE__ unsigned int __brev(unsigned int __a) { return __nv_brev(__a); }
__DEVICE__ unsigned long long __brevll(unsigned long long __a) {		__DEVICE__ unsigned long long __brevll(unsigned long long __a) {
return __nv_brevll(__a);		return __nv_brevll(__a);
}		}
__DEVICE__ void __brkpt() { asm volatile("brkpt;"); }		__DEVICE__ void __brkpt() { asm volatile("brkpt;"); }
		HahnfeldUnsubmitted Not Done Reply Inline Actions Why is that only valid for C++? Hahnfeld: Why is that only valid for C++?
		gtberceaAuthorUnsubmitted Not Done Reply Inline Actions C does not support overloading of functions. gtbercea: C does not support overloading of functions.
		#if defined(__cplusplus)
__DEVICE__ void __brkpt(int __a) { __brkpt(); }		__DEVICE__ void __brkpt(int __a) { __brkpt(); }
		#endif
__DEVICE__ unsigned int __byte_perm(unsigned int __a, unsigned int __b,		__DEVICE__ unsigned int __byte_perm(unsigned int __a, unsigned int __b,
unsigned int __c) {		unsigned int __c) {
return __nv_byte_perm(__a, __b, __c);		return __nv_byte_perm(__a, __b, __c);
}		}
__DEVICE__ int __clz(int __a) { return __nv_clz(__a); }		__DEVICE__ int __clz(int __a) { return __nv_clz(__a); }
__DEVICE__ int __clzll(long long __a) { return __nv_clzll(__a); }		__DEVICE__ int __clzll(long long __a) { return __nv_clzll(__a); }
__DEVICE__ float __cosf(float __a) { return __nv_fast_cosf(__a); }		__DEVICE__ float __cosf(float __a) { return __nv_fast_cosf(__a); }
__DEVICE__ double __dAtomicAdd(double *__p, double __v) {		__DEVICE__ double __dAtomicAdd(double *__p, double __v) {
▲ Show 20 Lines • Show All 449 Lines • ▼ Show 20 Lines
// Parameter must have a known integer value.		// Parameter must have a known integer value.
#define __prof_trigger(__a) asm __volatile__("pmevent \t%0;" ::"i"(__a))		#define __prof_trigger(__a) asm __volatile__("pmevent \t%0;" ::"i"(__a))
__DEVICE__ int __rhadd(int __a, int __b) { return __nv_rhadd(__a, __b); }		__DEVICE__ int __rhadd(int __a, int __b) { return __nv_rhadd(__a, __b); }
__DEVICE__ unsigned int __sad(int __a, int __b, unsigned int __c) {		__DEVICE__ unsigned int __sad(int __a, int __b, unsigned int __c) {
return __nv_sad(__a, __b, __c);		return __nv_sad(__a, __b, __c);
}		}
__DEVICE__ float __saturatef(float __a) { return __nv_saturatef(__a); }		__DEVICE__ float __saturatef(float __a) { return __nv_saturatef(__a); }
__DEVICE__ int __signbitd(double __a) { return __nv_signbitd(__a); }		__DEVICE__ int __signbitd(double __a) { return __nv_signbitd(__a); }
		#if defined(__CUDA__)
__DEVICE__ int __signbitf(float __a) { return __nv_signbitf(__a); }		__DEVICE__ int __signbitf(float __a) { return __nv_signbitf(__a); }
		#endif
__DEVICE__ void __sincosf(float __a, float __sptr, float __cptr) {		__DEVICE__ void __sincosf(float __a, float __sptr, float __cptr) {
return __nv_fast_sincosf(__a, __sptr, __cptr);		return __nv_fast_sincosf(__a, __sptr, __cptr);
}		}
__DEVICE__ float __sinf(float __a) { return __nv_fast_sinf(__a); }		__DEVICE__ float __sinf(float __a) { return __nv_fast_sinf(__a); }
__DEVICE__ int __syncthreads_and(int __a) { return __nvvm_bar0_and(__a); }		__DEVICE__ int __syncthreads_and(int __a) { return __nvvm_bar0_and(__a); }
__DEVICE__ int __syncthreads_count(int __a) { return __nvvm_bar0_popc(__a); }		__DEVICE__ int __syncthreads_count(int __a) { return __nvvm_bar0_popc(__a); }
__DEVICE__ int __syncthreads_or(int __a) { return __nvvm_bar0_or(__a); }		__DEVICE__ int __syncthreads_or(int __a) { return __nvvm_bar0_or(__a); }
__DEVICE__ float __tanf(float __a) { return __nv_fast_tanf(__a); }		__DEVICE__ float __tanf(float __a) { return __nv_fast_tanf(__a); }
▲ Show 20 Lines • Show All 951 Lines • ▼ Show 20 Lines
__DEVICE__ float atan2f(float __a, float __b) { return __nv_atan2f(__a, __b); }		__DEVICE__ float atan2f(float __a, float __b) { return __nv_atan2f(__a, __b); }
__DEVICE__ float atanf(float __a) { return __nv_atanf(__a); }		__DEVICE__ float atanf(float __a) { return __nv_atanf(__a); }
__DEVICE__ double atanh(double __a) { return __nv_atanh(__a); }		__DEVICE__ double atanh(double __a) { return __nv_atanh(__a); }
__DEVICE__ float atanhf(float __a) { return __nv_atanhf(__a); }		__DEVICE__ float atanhf(float __a) { return __nv_atanhf(__a); }
__DEVICE__ double cbrt(double __a) { return __nv_cbrt(__a); }		__DEVICE__ double cbrt(double __a) { return __nv_cbrt(__a); }
__DEVICE__ float cbrtf(float __a) { return __nv_cbrtf(__a); }		__DEVICE__ float cbrtf(float __a) { return __nv_cbrtf(__a); }
__DEVICE__ double ceil(double __a) { return __nv_ceil(__a); }		__DEVICE__ double ceil(double __a) { return __nv_ceil(__a); }
__DEVICE__ float ceilf(float __a) { return __nv_ceilf(__a); }		__DEVICE__ float ceilf(float __a) { return __nv_ceilf(__a); }
		#if defined(__CUDA__)
__DEVICE__ int clock() { return __nvvm_read_ptx_sreg_clock(); }		__DEVICE__ int clock() { return __nvvm_read_ptx_sreg_clock(); }
		#endif
__DEVICE__ long long clock64() { return __nvvm_read_ptx_sreg_clock64(); }		__DEVICE__ long long clock64() { return __nvvm_read_ptx_sreg_clock64(); }
__DEVICE__ double copysign(double __a, double __b) {		__DEVICE__ double copysign(double __a, double __b) {
return __nv_copysign(__a, __b);		return __nv_copysign(__a, __b);
}		}
__DEVICE__ float copysignf(float __a, float __b) {		__DEVICE__ float copysignf(float __a, float __b) {
return __nv_copysignf(__a, __b);		return __nv_copysignf(__a, __b);
}		}
__DEVICE__ double cos(double __a) { return __nv_cos(__a); }		__DEVICE__ double cos(double __a) { return __nv_cos(__a); }
Show All 23 Lines
__DEVICE__ float exp10f(float __a) { return __nv_exp10f(__a); }		__DEVICE__ float exp10f(float __a) { return __nv_exp10f(__a); }
__DEVICE__ double exp2(double __a) { return __nv_exp2(__a); }		__DEVICE__ double exp2(double __a) { return __nv_exp2(__a); }
__DEVICE__ float exp2f(float __a) { return __nv_exp2f(__a); }		__DEVICE__ float exp2f(float __a) { return __nv_exp2f(__a); }
__DEVICE__ float expf(float __a) { return __nv_expf(__a); }		__DEVICE__ float expf(float __a) { return __nv_expf(__a); }
__DEVICE__ double expm1(double __a) { return __nv_expm1(__a); }		__DEVICE__ double expm1(double __a) { return __nv_expm1(__a); }
__DEVICE__ float expm1f(float __a) { return __nv_expm1f(__a); }		__DEVICE__ float expm1f(float __a) { return __nv_expm1f(__a); }
__DEVICE__ double fabs(double __a) { return __nv_fabs(__a); }		__DEVICE__ double fabs(double __a) { return __nv_fabs(__a); }
__DEVICE__ float fabsf(float __a) { return __nv_fabsf(__a); }		__DEVICE__ float fabsf(float __a) { return __nv_fabsf(__a); }
		#if defined(__CUDA__)
__DEVICE__ double fdim(double __a, double __b) { return __nv_fdim(__a, __b); }		__DEVICE__ double fdim(double __a, double __b) { return __nv_fdim(__a, __b); }
__DEVICE__ float fdimf(float __a, float __b) { return __nv_fdimf(__a, __b); }		__DEVICE__ float fdimf(float __a, float __b) { return __nv_fdimf(__a, __b); }
		#endif
__DEVICE__ double fdivide(double __a, double __b) { return __a / __b; }		__DEVICE__ double fdivide(double __a, double __b) { return __a / __b; }
__DEVICE__ float fdividef(float __a, float __b) {		__DEVICE__ float fdividef(float __a, float __b) {
#if __FAST_MATH__ && !__CUDA_PREC_DIV		#if __FAST_MATH__ && !__CUDA_PREC_DIV
return __nv_fast_fdividef(__a, __b);		return __nv_fast_fdividef(__a, __b);
#else		#else
return __a / __b;		return __a / __b;
#endif		#endif
}		}
Show All 18 Lines
__DEVICE__ int ilogb(double __a) { return __nv_ilogb(__a); }		__DEVICE__ int ilogb(double __a) { return __nv_ilogb(__a); }
__DEVICE__ int ilogbf(float __a) { return __nv_ilogbf(__a); }		__DEVICE__ int ilogbf(float __a) { return __nv_ilogbf(__a); }
__DEVICE__ double j0(double __a) { return __nv_j0(__a); }		__DEVICE__ double j0(double __a) { return __nv_j0(__a); }
__DEVICE__ float j0f(float __a) { return __nv_j0f(__a); }		__DEVICE__ float j0f(float __a) { return __nv_j0f(__a); }
__DEVICE__ double j1(double __a) { return __nv_j1(__a); }		__DEVICE__ double j1(double __a) { return __nv_j1(__a); }
__DEVICE__ float j1f(float __a) { return __nv_j1f(__a); }		__DEVICE__ float j1f(float __a) { return __nv_j1f(__a); }
__DEVICE__ double jn(int __n, double __a) { return __nv_jn(__n, __a); }		__DEVICE__ double jn(int __n, double __a) { return __nv_jn(__n, __a); }
__DEVICE__ float jnf(int __n, float __a) { return __nv_jnf(__n, __a); }		__DEVICE__ float jnf(int __n, float __a) { return __nv_jnf(__n, __a); }
#if defined(__LP64__)		#if defined(__LP64__)
__DEVICE__ long labs(long __a) { return llabs(__a); };		__DEVICE__ long labs(long __a) { return __nv_llabs(__a); };
#else		#else
		traUnsubmitted Done Reply Inline Actions I think it should've been `return __nv_llabs(__a)` here and the definition of `long long llabs()` should remain back where it was. tra: I think it should've been `return __nv_llabs(__a)` here and the definition of `long long llabs…
__DEVICE__ long labs(long __a) { return __nv_abs(__a); };		__DEVICE__ long labs(long __a) { return __nv_abs(__a); };
#endif		#endif
__DEVICE__ double ldexp(double __a, int __b) { return __nv_ldexp(__a, __b); }		__DEVICE__ double ldexp(double __a, int __b) { return __nv_ldexp(__a, __b); }
__DEVICE__ float ldexpf(float __a, int __b) { return __nv_ldexpf(__a, __b); }		__DEVICE__ float ldexpf(float __a, int __b) { return __nv_ldexpf(__a, __b); }
__DEVICE__ double lgamma(double __a) { return __nv_lgamma(__a); }		__DEVICE__ double lgamma(double __a) { return __nv_lgamma(__a); }
__DEVICE__ float lgammaf(float __a) { return __nv_lgammaf(__a); }		__DEVICE__ float lgammaf(float __a) { return __nv_lgammaf(__a); }
__DEVICE__ long long llabs(long long __a) { return __nv_llabs(__a); }		__DEVICE__ long long llabs(long long __a) { return __nv_llabs(__a); }
__DEVICE__ long long llmax(long long __a, long long __b) {		__DEVICE__ long long llmax(long long __a, long long __b) {
▲ Show 20 Lines • Show All 116 Lines • ▼ Show 20 Lines	__DEVICE__ float rnormf(int __dim, const float *__t) {
return __nv_rnormf(__dim, __t);		return __nv_rnormf(__dim, __t);
}		}
__DEVICE__ double round(double __a) { return __nv_round(__a); }		__DEVICE__ double round(double __a) { return __nv_round(__a); }
__DEVICE__ float roundf(float __a) { return __nv_roundf(__a); }		__DEVICE__ float roundf(float __a) { return __nv_roundf(__a); }
__DEVICE__ double rsqrt(double __a) { return __nv_rsqrt(__a); }		__DEVICE__ double rsqrt(double __a) { return __nv_rsqrt(__a); }
__DEVICE__ float rsqrtf(float __a) { return __nv_rsqrtf(__a); }		__DEVICE__ float rsqrtf(float __a) { return __nv_rsqrtf(__a); }
__DEVICE__ double scalbn(double __a, int __b) { return __nv_scalbn(__a, __b); }		__DEVICE__ double scalbn(double __a, int __b) { return __nv_scalbn(__a, __b); }
__DEVICE__ float scalbnf(float __a, int __b) { return __nv_scalbnf(__a, __b); }		__DEVICE__ float scalbnf(float __a, int __b) { return __nv_scalbnf(__a, __b); }
		#if defined(__CUDA__)
__DEVICE__ double scalbln(double __a, long __b) {		__DEVICE__ double scalbln(double __a, long __b) {
if (__b > INT_MAX)		if (__b > INT_MAX)
return __a > 0 ? HUGE_VAL : -HUGE_VAL;		return __a > 0 ? HUGE_VAL : -HUGE_VAL;
if (__b < INT_MIN)		if (__b < INT_MIN)
return __a > 0 ? 0.0 : -0.0;		return __a > 0 ? 0.0 : -0.0;
return scalbn(__a, (int)__b);		return scalbn(__a, (int)__b);
}		}
__DEVICE__ float scalblnf(float __a, long __b) {		__DEVICE__ float scalblnf(float __a, long __b) {
if (__b > INT_MAX)		if (__b > INT_MAX)
return __a > 0 ? HUGE_VALF : -HUGE_VALF;		return __a > 0 ? HUGE_VALF : -HUGE_VALF;
if (__b < INT_MIN)		if (__b < INT_MIN)
return __a > 0 ? 0.f : -0.f;		return __a > 0 ? 0.f : -0.f;
return scalbnf(__a, (int)__b);		return scalbnf(__a, (int)__b);
}		}
		#endif
__DEVICE__ double sin(double __a) { return __nv_sin(__a); }		__DEVICE__ double sin(double __a) { return __nv_sin(__a); }
__DEVICE__ void sincos(double __a, double __sptr, double __cptr) {		__DEVICE__ void sincos(double __a, double __sptr, double __cptr) {
return __nv_sincos(__a, __sptr, __cptr);		return __nv_sincos(__a, __sptr, __cptr);
}		}
__DEVICE__ void sincosf(float __a, float __sptr, float __cptr) {		__DEVICE__ void sincosf(float __a, float __sptr, float __cptr) {
return __FAST_OR_SLOW(__nv_fast_sincosf, __nv_sincosf)(__a, __sptr, __cptr);		return __FAST_OR_SLOW(__nv_fast_sincosf, __nv_sincosf)(__a, __sptr, __cptr);
}		}
__DEVICE__ void sincospi(double __a, double __sptr, double __cptr) {		__DEVICE__ void sincospi(double __a, double __sptr, double __cptr) {
▲ Show 20 Lines • Show All 46 Lines • Show Last 20 Lines

lib/Headers/__clang_cuda_libdevice_declares.h

	Show All 18 Lines
	* THE SOFTWARE.			* THE SOFTWARE.
	*			*
	*===-----------------------------------------------------------------------===			*===-----------------------------------------------------------------------===
	*/			*/

	#ifndef __CLANG_CUDA_LIBDEVICE_DECLARES_H__			#ifndef __CLANG_CUDA_LIBDEVICE_DECLARES_H__
	#define __CLANG_CUDA_LIBDEVICE_DECLARES_H__			#define __CLANG_CUDA_LIBDEVICE_DECLARES_H__

				#if defined(_OPENMP)
				#define __DEVICE__
				#elif defined(__CUDA__)
				#define __DEVICE__ __device__
				#endif

				#if defined(__cplusplus)
	extern "C" {			extern "C" {
				#endif

	__device__ int __nv_abs(int __a);			__DEVICE__ int __nv_abs(int __a);
	__device__ double __nv_acos(double __a);			__DEVICE__ double __nv_acos(double __a);
	__device__ float __nv_acosf(float __a);			__DEVICE__ float __nv_acosf(float __a);
	__device__ double __nv_acosh(double __a);			__DEVICE__ double __nv_acosh(double __a);
	__device__ float __nv_acoshf(float __a);			__DEVICE__ float __nv_acoshf(float __a);
	__device__ double __nv_asin(double __a);			__DEVICE__ double __nv_asin(double __a);
	__device__ float __nv_asinf(float __a);			__DEVICE__ float __nv_asinf(float __a);
	__device__ double __nv_asinh(double __a);			__DEVICE__ double __nv_asinh(double __a);
	__device__ float __nv_asinhf(float __a);			__DEVICE__ float __nv_asinhf(float __a);
	__device__ double __nv_atan2(double __a, double __b);			__DEVICE__ double __nv_atan2(double __a, double __b);
	__device__ float __nv_atan2f(float __a, float __b);			__DEVICE__ float __nv_atan2f(float __a, float __b);
	__device__ double __nv_atan(double __a);			__DEVICE__ double __nv_atan(double __a);
	__device__ float __nv_atanf(float __a);			__DEVICE__ float __nv_atanf(float __a);
	__device__ double __nv_atanh(double __a);			__DEVICE__ double __nv_atanh(double __a);
	__device__ float __nv_atanhf(float __a);			__DEVICE__ float __nv_atanhf(float __a);
	__device__ int __nv_brev(int __a);			__DEVICE__ int __nv_brev(int __a);
	__device__ long long __nv_brevll(long long __a);			__DEVICE__ long long __nv_brevll(long long __a);
	__device__ int __nv_byte_perm(int __a, int __b, int __c);			__DEVICE__ int __nv_byte_perm(int __a, int __b, int __c);
	__device__ double __nv_cbrt(double __a);			__DEVICE__ double __nv_cbrt(double __a);
	__device__ float __nv_cbrtf(float __a);			__DEVICE__ float __nv_cbrtf(float __a);
	__device__ double __nv_ceil(double __a);			__DEVICE__ double __nv_ceil(double __a);
	__device__ float __nv_ceilf(float __a);			__DEVICE__ float __nv_ceilf(float __a);
	__device__ int __nv_clz(int __a);			__DEVICE__ int __nv_clz(int __a);
	__device__ int __nv_clzll(long long __a);			__DEVICE__ int __nv_clzll(long long __a);
	__device__ double __nv_copysign(double __a, double __b);			__DEVICE__ double __nv_copysign(double __a, double __b);
	__device__ float __nv_copysignf(float __a, float __b);			__DEVICE__ float __nv_copysignf(float __a, float __b);
	__device__ double __nv_cos(double __a);			__DEVICE__ double __nv_cos(double __a);
	__device__ float __nv_cosf(float __a);			__DEVICE__ float __nv_cosf(float __a);
	__device__ double __nv_cosh(double __a);			__DEVICE__ double __nv_cosh(double __a);
	__device__ float __nv_coshf(float __a);			__DEVICE__ float __nv_coshf(float __a);
	__device__ double __nv_cospi(double __a);			__DEVICE__ double __nv_cospi(double __a);
	__device__ float __nv_cospif(float __a);			__DEVICE__ float __nv_cospif(float __a);
	__device__ double __nv_cyl_bessel_i0(double __a);			__DEVICE__ double __nv_cyl_bessel_i0(double __a);
	__device__ float __nv_cyl_bessel_i0f(float __a);			__DEVICE__ float __nv_cyl_bessel_i0f(float __a);
	__device__ double __nv_cyl_bessel_i1(double __a);			__DEVICE__ double __nv_cyl_bessel_i1(double __a);
	__device__ float __nv_cyl_bessel_i1f(float __a);			__DEVICE__ float __nv_cyl_bessel_i1f(float __a);
	__device__ double __nv_dadd_rd(double __a, double __b);			__DEVICE__ double __nv_dadd_rd(double __a, double __b);
	__device__ double __nv_dadd_rn(double __a, double __b);			__DEVICE__ double __nv_dadd_rn(double __a, double __b);
	__device__ double __nv_dadd_ru(double __a, double __b);			__DEVICE__ double __nv_dadd_ru(double __a, double __b);
	__device__ double __nv_dadd_rz(double __a, double __b);			__DEVICE__ double __nv_dadd_rz(double __a, double __b);
	__device__ double __nv_ddiv_rd(double __a, double __b);			__DEVICE__ double __nv_ddiv_rd(double __a, double __b);
	__device__ double __nv_ddiv_rn(double __a, double __b);			__DEVICE__ double __nv_ddiv_rn(double __a, double __b);
	__device__ double __nv_ddiv_ru(double __a, double __b);			__DEVICE__ double __nv_ddiv_ru(double __a, double __b);
	__device__ double __nv_ddiv_rz(double __a, double __b);			__DEVICE__ double __nv_ddiv_rz(double __a, double __b);
	__device__ double __nv_dmul_rd(double __a, double __b);			__DEVICE__ double __nv_dmul_rd(double __a, double __b);
	__device__ double __nv_dmul_rn(double __a, double __b);			__DEVICE__ double __nv_dmul_rn(double __a, double __b);
	__device__ double __nv_dmul_ru(double __a, double __b);			__DEVICE__ double __nv_dmul_ru(double __a, double __b);
	__device__ double __nv_dmul_rz(double __a, double __b);			__DEVICE__ double __nv_dmul_rz(double __a, double __b);
	__device__ float __nv_double2float_rd(double __a);			__DEVICE__ float __nv_double2float_rd(double __a);
	__device__ float __nv_double2float_rn(double __a);			__DEVICE__ float __nv_double2float_rn(double __a);
	__device__ float __nv_double2float_ru(double __a);			__DEVICE__ float __nv_double2float_ru(double __a);
	__device__ float __nv_double2float_rz(double __a);			__DEVICE__ float __nv_double2float_rz(double __a);
	__device__ int __nv_double2hiint(double __a);			__DEVICE__ int __nv_double2hiint(double __a);
	__device__ int __nv_double2int_rd(double __a);			__DEVICE__ int __nv_double2int_rd(double __a);
	__device__ int __nv_double2int_rn(double __a);			__DEVICE__ int __nv_double2int_rn(double __a);
	__device__ int __nv_double2int_ru(double __a);			__DEVICE__ int __nv_double2int_ru(double __a);
	__device__ int __nv_double2int_rz(double __a);			__DEVICE__ int __nv_double2int_rz(double __a);
	__device__ long long __nv_double2ll_rd(double __a);			__DEVICE__ long long __nv_double2ll_rd(double __a);
	__device__ long long __nv_double2ll_rn(double __a);			__DEVICE__ long long __nv_double2ll_rn(double __a);
	__device__ long long __nv_double2ll_ru(double __a);			__DEVICE__ long long __nv_double2ll_ru(double __a);
	__device__ long long __nv_double2ll_rz(double __a);			__DEVICE__ long long __nv_double2ll_rz(double __a);
	__device__ int __nv_double2loint(double __a);			__DEVICE__ int __nv_double2loint(double __a);
	__device__ unsigned int __nv_double2uint_rd(double __a);			__DEVICE__ unsigned int __nv_double2uint_rd(double __a);
	__device__ unsigned int __nv_double2uint_rn(double __a);			__DEVICE__ unsigned int __nv_double2uint_rn(double __a);
	__device__ unsigned int __nv_double2uint_ru(double __a);			__DEVICE__ unsigned int __nv_double2uint_ru(double __a);
	__device__ unsigned int __nv_double2uint_rz(double __a);			__DEVICE__ unsigned int __nv_double2uint_rz(double __a);
	__device__ unsigned long long __nv_double2ull_rd(double __a);			__DEVICE__ unsigned long long __nv_double2ull_rd(double __a);
	__device__ unsigned long long __nv_double2ull_rn(double __a);			__DEVICE__ unsigned long long __nv_double2ull_rn(double __a);
	__device__ unsigned long long __nv_double2ull_ru(double __a);			__DEVICE__ unsigned long long __nv_double2ull_ru(double __a);
	__device__ unsigned long long __nv_double2ull_rz(double __a);			__DEVICE__ unsigned long long __nv_double2ull_rz(double __a);
	__device__ unsigned long long __nv_double_as_longlong(double __a);			__DEVICE__ unsigned long long __nv_double_as_longlong(double __a);
	__device__ double __nv_drcp_rd(double __a);			__DEVICE__ double __nv_drcp_rd(double __a);
	__device__ double __nv_drcp_rn(double __a);			__DEVICE__ double __nv_drcp_rn(double __a);
	__device__ double __nv_drcp_ru(double __a);			__DEVICE__ double __nv_drcp_ru(double __a);
	__device__ double __nv_drcp_rz(double __a);			__DEVICE__ double __nv_drcp_rz(double __a);
	__device__ double __nv_dsqrt_rd(double __a);			__DEVICE__ double __nv_dsqrt_rd(double __a);
	__device__ double __nv_dsqrt_rn(double __a);			__DEVICE__ double __nv_dsqrt_rn(double __a);
	__device__ double __nv_dsqrt_ru(double __a);			__DEVICE__ double __nv_dsqrt_ru(double __a);
	__device__ double __nv_dsqrt_rz(double __a);			__DEVICE__ double __nv_dsqrt_rz(double __a);
	__device__ double __nv_dsub_rd(double __a, double __b);			__DEVICE__ double __nv_dsub_rd(double __a, double __b);
	__device__ double __nv_dsub_rn(double __a, double __b);			__DEVICE__ double __nv_dsub_rn(double __a, double __b);
	__device__ double __nv_dsub_ru(double __a, double __b);			__DEVICE__ double __nv_dsub_ru(double __a, double __b);
	__device__ double __nv_dsub_rz(double __a, double __b);			__DEVICE__ double __nv_dsub_rz(double __a, double __b);
	__device__ double __nv_erfc(double __a);			__DEVICE__ double __nv_erfc(double __a);
	__device__ float __nv_erfcf(float __a);			__DEVICE__ float __nv_erfcf(float __a);
	__device__ double __nv_erfcinv(double __a);			__DEVICE__ double __nv_erfcinv(double __a);
	__device__ float __nv_erfcinvf(float __a);			__DEVICE__ float __nv_erfcinvf(float __a);
	__device__ double __nv_erfcx(double __a);			__DEVICE__ double __nv_erfcx(double __a);
	__device__ float __nv_erfcxf(float __a);			__DEVICE__ float __nv_erfcxf(float __a);
	__device__ double __nv_erf(double __a);			__DEVICE__ double __nv_erf(double __a);
	__device__ float __nv_erff(float __a);			__DEVICE__ float __nv_erff(float __a);
	__device__ double __nv_erfinv(double __a);			__DEVICE__ double __nv_erfinv(double __a);
	__device__ float __nv_erfinvf(float __a);			__DEVICE__ float __nv_erfinvf(float __a);
	__device__ double __nv_exp10(double __a);			__DEVICE__ double __nv_exp10(double __a);
	__device__ float __nv_exp10f(float __a);			__DEVICE__ float __nv_exp10f(float __a);
	__device__ double __nv_exp2(double __a);			__DEVICE__ double __nv_exp2(double __a);
	__device__ float __nv_exp2f(float __a);			__DEVICE__ float __nv_exp2f(float __a);
	__device__ double __nv_exp(double __a);			__DEVICE__ double __nv_exp(double __a);
	__device__ float __nv_expf(float __a);			__DEVICE__ float __nv_expf(float __a);
	__device__ double __nv_expm1(double __a);			__DEVICE__ double __nv_expm1(double __a);
	__device__ float __nv_expm1f(float __a);			__DEVICE__ float __nv_expm1f(float __a);
	__device__ double __nv_fabs(double __a);			__DEVICE__ double __nv_fabs(double __a);
	__device__ float __nv_fabsf(float __a);			__DEVICE__ float __nv_fabsf(float __a);
	__device__ float __nv_fadd_rd(float __a, float __b);			__DEVICE__ float __nv_fadd_rd(float __a, float __b);
	__device__ float __nv_fadd_rn(float __a, float __b);			__DEVICE__ float __nv_fadd_rn(float __a, float __b);
	__device__ float __nv_fadd_ru(float __a, float __b);			__DEVICE__ float __nv_fadd_ru(float __a, float __b);
	__device__ float __nv_fadd_rz(float __a, float __b);			__DEVICE__ float __nv_fadd_rz(float __a, float __b);
	__device__ float __nv_fast_cosf(float __a);			__DEVICE__ float __nv_fast_cosf(float __a);
	__device__ float __nv_fast_exp10f(float __a);			__DEVICE__ float __nv_fast_exp10f(float __a);
	__device__ float __nv_fast_expf(float __a);			__DEVICE__ float __nv_fast_expf(float __a);
	__device__ float __nv_fast_fdividef(float __a, float __b);			__DEVICE__ float __nv_fast_fdividef(float __a, float __b);
	__device__ float __nv_fast_log10f(float __a);			__DEVICE__ float __nv_fast_log10f(float __a);
	__device__ float __nv_fast_log2f(float __a);			__DEVICE__ float __nv_fast_log2f(float __a);
	__device__ float __nv_fast_logf(float __a);			__DEVICE__ float __nv_fast_logf(float __a);
	__device__ float __nv_fast_powf(float __a, float __b);			__DEVICE__ float __nv_fast_powf(float __a, float __b);
	__device__ void __nv_fast_sincosf(float __a, float __sptr, float __cptr);			__DEVICE__ void __nv_fast_sincosf(float __a, float __sptr, float __cptr);
	__device__ float __nv_fast_sinf(float __a);			__DEVICE__ float __nv_fast_sinf(float __a);
	__device__ float __nv_fast_tanf(float __a);			__DEVICE__ float __nv_fast_tanf(float __a);
	__device__ double __nv_fdim(double __a, double __b);			__DEVICE__ double __nv_fdim(double __a, double __b);
	__device__ float __nv_fdimf(float __a, float __b);			__DEVICE__ float __nv_fdimf(float __a, float __b);
	__device__ float __nv_fdiv_rd(float __a, float __b);			__DEVICE__ float __nv_fdiv_rd(float __a, float __b);
	__device__ float __nv_fdiv_rn(float __a, float __b);			__DEVICE__ float __nv_fdiv_rn(float __a, float __b);
	__device__ float __nv_fdiv_ru(float __a, float __b);			__DEVICE__ float __nv_fdiv_ru(float __a, float __b);
	__device__ float __nv_fdiv_rz(float __a, float __b);			__DEVICE__ float __nv_fdiv_rz(float __a, float __b);
	__device__ int __nv_ffs(int __a);			__DEVICE__ int __nv_ffs(int __a);
	__device__ int __nv_ffsll(long long __a);			__DEVICE__ int __nv_ffsll(long long __a);
	__device__ int __nv_finitef(float __a);			__DEVICE__ int __nv_finitef(float __a);
	__device__ unsigned short __nv_float2half_rn(float __a);			__DEVICE__ unsigned short __nv_float2half_rn(float __a);
	__device__ int __nv_float2int_rd(float __a);			__DEVICE__ int __nv_float2int_rd(float __a);
	__device__ int __nv_float2int_rn(float __a);			__DEVICE__ int __nv_float2int_rn(float __a);
	__device__ int __nv_float2int_ru(float __a);			__DEVICE__ int __nv_float2int_ru(float __a);
	__device__ int __nv_float2int_rz(float __a);			__DEVICE__ int __nv_float2int_rz(float __a);
	__device__ long long __nv_float2ll_rd(float __a);			__DEVICE__ long long __nv_float2ll_rd(float __a);
	__device__ long long __nv_float2ll_rn(float __a);			__DEVICE__ long long __nv_float2ll_rn(float __a);
	__device__ long long __nv_float2ll_ru(float __a);			__DEVICE__ long long __nv_float2ll_ru(float __a);
	__device__ long long __nv_float2ll_rz(float __a);			__DEVICE__ long long __nv_float2ll_rz(float __a);
	__device__ unsigned int __nv_float2uint_rd(float __a);			__DEVICE__ unsigned int __nv_float2uint_rd(float __a);
	__device__ unsigned int __nv_float2uint_rn(float __a);			__DEVICE__ unsigned int __nv_float2uint_rn(float __a);
	__device__ unsigned int __nv_float2uint_ru(float __a);			__DEVICE__ unsigned int __nv_float2uint_ru(float __a);
	__device__ unsigned int __nv_float2uint_rz(float __a);			__DEVICE__ unsigned int __nv_float2uint_rz(float __a);
	__device__ unsigned long long __nv_float2ull_rd(float __a);			__DEVICE__ unsigned long long __nv_float2ull_rd(float __a);
	__device__ unsigned long long __nv_float2ull_rn(float __a);			__DEVICE__ unsigned long long __nv_float2ull_rn(float __a);
	__device__ unsigned long long __nv_float2ull_ru(float __a);			__DEVICE__ unsigned long long __nv_float2ull_ru(float __a);
	__device__ unsigned long long __nv_float2ull_rz(float __a);			__DEVICE__ unsigned long long __nv_float2ull_rz(float __a);
	__device__ int __nv_float_as_int(float __a);			__DEVICE__ int __nv_float_as_int(float __a);
	__device__ unsigned int __nv_float_as_uint(float __a);			__DEVICE__ unsigned int __nv_float_as_uint(float __a);
	__device__ double __nv_floor(double __a);			__DEVICE__ double __nv_floor(double __a);
	__device__ float __nv_floorf(float __a);			__DEVICE__ float __nv_floorf(float __a);
	__device__ double __nv_fma(double __a, double __b, double __c);			__DEVICE__ double __nv_fma(double __a, double __b, double __c);
	__device__ float __nv_fmaf(float __a, float __b, float __c);			__DEVICE__ float __nv_fmaf(float __a, float __b, float __c);
	__device__ float __nv_fmaf_ieee_rd(float __a, float __b, float __c);			__DEVICE__ float __nv_fmaf_ieee_rd(float __a, float __b, float __c);
	__device__ float __nv_fmaf_ieee_rn(float __a, float __b, float __c);			__DEVICE__ float __nv_fmaf_ieee_rn(float __a, float __b, float __c);
	__device__ float __nv_fmaf_ieee_ru(float __a, float __b, float __c);			__DEVICE__ float __nv_fmaf_ieee_ru(float __a, float __b, float __c);
	__device__ float __nv_fmaf_ieee_rz(float __a, float __b, float __c);			__DEVICE__ float __nv_fmaf_ieee_rz(float __a, float __b, float __c);
	__device__ float __nv_fmaf_rd(float __a, float __b, float __c);			__DEVICE__ float __nv_fmaf_rd(float __a, float __b, float __c);
	__device__ float __nv_fmaf_rn(float __a, float __b, float __c);			__DEVICE__ float __nv_fmaf_rn(float __a, float __b, float __c);
	__device__ float __nv_fmaf_ru(float __a, float __b, float __c);			__DEVICE__ float __nv_fmaf_ru(float __a, float __b, float __c);
	__device__ float __nv_fmaf_rz(float __a, float __b, float __c);			__DEVICE__ float __nv_fmaf_rz(float __a, float __b, float __c);
	__device__ double __nv_fma_rd(double __a, double __b, double __c);			__DEVICE__ double __nv_fma_rd(double __a, double __b, double __c);
	__device__ double __nv_fma_rn(double __a, double __b, double __c);			__DEVICE__ double __nv_fma_rn(double __a, double __b, double __c);
	__device__ double __nv_fma_ru(double __a, double __b, double __c);			__DEVICE__ double __nv_fma_ru(double __a, double __b, double __c);
	__device__ double __nv_fma_rz(double __a, double __b, double __c);			__DEVICE__ double __nv_fma_rz(double __a, double __b, double __c);
	__device__ double __nv_fmax(double __a, double __b);			__DEVICE__ double __nv_fmax(double __a, double __b);
	__device__ float __nv_fmaxf(float __a, float __b);			__DEVICE__ float __nv_fmaxf(float __a, float __b);
	__device__ double __nv_fmin(double __a, double __b);			__DEVICE__ double __nv_fmin(double __a, double __b);
	__device__ float __nv_fminf(float __a, float __b);			__DEVICE__ float __nv_fminf(float __a, float __b);
	__device__ double __nv_fmod(double __a, double __b);			__DEVICE__ double __nv_fmod(double __a, double __b);
	__device__ float __nv_fmodf(float __a, float __b);			__DEVICE__ float __nv_fmodf(float __a, float __b);
	__device__ float __nv_fmul_rd(float __a, float __b);			__DEVICE__ float __nv_fmul_rd(float __a, float __b);
	__device__ float __nv_fmul_rn(float __a, float __b);			__DEVICE__ float __nv_fmul_rn(float __a, float __b);
	__device__ float __nv_fmul_ru(float __a, float __b);			__DEVICE__ float __nv_fmul_ru(float __a, float __b);
	__device__ float __nv_fmul_rz(float __a, float __b);			__DEVICE__ float __nv_fmul_rz(float __a, float __b);
	__device__ float __nv_frcp_rd(float __a);			__DEVICE__ float __nv_frcp_rd(float __a);
	__device__ float __nv_frcp_rn(float __a);			__DEVICE__ float __nv_frcp_rn(float __a);
	__device__ float __nv_frcp_ru(float __a);			__DEVICE__ float __nv_frcp_ru(float __a);
	__device__ float __nv_frcp_rz(float __a);			__DEVICE__ float __nv_frcp_rz(float __a);
	__device__ double __nv_frexp(double __a, int *__b);			__DEVICE__ double __nv_frexp(double __a, int *__b);
	__device__ float __nv_frexpf(float __a, int *__b);			__DEVICE__ float __nv_frexpf(float __a, int *__b);
	__device__ float __nv_frsqrt_rn(float __a);			__DEVICE__ float __nv_frsqrt_rn(float __a);
	__device__ float __nv_fsqrt_rd(float __a);			__DEVICE__ float __nv_fsqrt_rd(float __a);
	__device__ float __nv_fsqrt_rn(float __a);			__DEVICE__ float __nv_fsqrt_rn(float __a);
	__device__ float __nv_fsqrt_ru(float __a);			__DEVICE__ float __nv_fsqrt_ru(float __a);
	__device__ float __nv_fsqrt_rz(float __a);			__DEVICE__ float __nv_fsqrt_rz(float __a);
	__device__ float __nv_fsub_rd(float __a, float __b);			__DEVICE__ float __nv_fsub_rd(float __a, float __b);
	__device__ float __nv_fsub_rn(float __a, float __b);			__DEVICE__ float __nv_fsub_rn(float __a, float __b);
	__device__ float __nv_fsub_ru(float __a, float __b);			__DEVICE__ float __nv_fsub_ru(float __a, float __b);
	__device__ float __nv_fsub_rz(float __a, float __b);			__DEVICE__ float __nv_fsub_rz(float __a, float __b);
	__device__ int __nv_hadd(int __a, int __b);			__DEVICE__ int __nv_hadd(int __a, int __b);
	__device__ float __nv_half2float(unsigned short __h);			__DEVICE__ float __nv_half2float(unsigned short __h);
	__device__ double __nv_hiloint2double(int __a, int __b);			__DEVICE__ double __nv_hiloint2double(int __a, int __b);
	__device__ double __nv_hypot(double __a, double __b);			__DEVICE__ double __nv_hypot(double __a, double __b);
	__device__ float __nv_hypotf(float __a, float __b);			__DEVICE__ float __nv_hypotf(float __a, float __b);
	__device__ int __nv_ilogb(double __a);			__DEVICE__ int __nv_ilogb(double __a);
	__device__ int __nv_ilogbf(float __a);			__DEVICE__ int __nv_ilogbf(float __a);
	__device__ double __nv_int2double_rn(int __a);			__DEVICE__ double __nv_int2double_rn(int __a);
	__device__ float __nv_int2float_rd(int __a);			__DEVICE__ float __nv_int2float_rd(int __a);
	__device__ float __nv_int2float_rn(int __a);			__DEVICE__ float __nv_int2float_rn(int __a);
	__device__ float __nv_int2float_ru(int __a);			__DEVICE__ float __nv_int2float_ru(int __a);
	__device__ float __nv_int2float_rz(int __a);			__DEVICE__ float __nv_int2float_rz(int __a);
	__device__ float __nv_int_as_float(int __a);			__DEVICE__ float __nv_int_as_float(int __a);
	__device__ int __nv_isfinited(double __a);			__DEVICE__ int __nv_isfinited(double __a);
	__device__ int __nv_isinfd(double __a);			__DEVICE__ int __nv_isinfd(double __a);
	__device__ int __nv_isinff(float __a);			__DEVICE__ int __nv_isinff(float __a);
	__device__ int __nv_isnand(double __a);			__DEVICE__ int __nv_isnand(double __a);
	__device__ int __nv_isnanf(float __a);			__DEVICE__ int __nv_isnanf(float __a);
	__device__ double __nv_j0(double __a);			__DEVICE__ double __nv_j0(double __a);
	__device__ float __nv_j0f(float __a);			__DEVICE__ float __nv_j0f(float __a);
	__device__ double __nv_j1(double __a);			__DEVICE__ double __nv_j1(double __a);
	__device__ float __nv_j1f(float __a);			__DEVICE__ float __nv_j1f(float __a);
	__device__ float __nv_jnf(int __a, float __b);			__DEVICE__ float __nv_jnf(int __a, float __b);
	__device__ double __nv_jn(int __a, double __b);			__DEVICE__ double __nv_jn(int __a, double __b);
	__device__ double __nv_ldexp(double __a, int __b);			__DEVICE__ double __nv_ldexp(double __a, int __b);
	__device__ float __nv_ldexpf(float __a, int __b);			__DEVICE__ float __nv_ldexpf(float __a, int __b);
	__device__ double __nv_lgamma(double __a);			__DEVICE__ double __nv_lgamma(double __a);
	__device__ float __nv_lgammaf(float __a);			__DEVICE__ float __nv_lgammaf(float __a);
	__device__ double __nv_ll2double_rd(long long __a);			__DEVICE__ double __nv_ll2double_rd(long long __a);
	__device__ double __nv_ll2double_rn(long long __a);			__DEVICE__ double __nv_ll2double_rn(long long __a);
	__device__ double __nv_ll2double_ru(long long __a);			__DEVICE__ double __nv_ll2double_ru(long long __a);
	__device__ double __nv_ll2double_rz(long long __a);			__DEVICE__ double __nv_ll2double_rz(long long __a);
	__device__ float __nv_ll2float_rd(long long __a);			__DEVICE__ float __nv_ll2float_rd(long long __a);
	__device__ float __nv_ll2float_rn(long long __a);			__DEVICE__ float __nv_ll2float_rn(long long __a);
	__device__ float __nv_ll2float_ru(long long __a);			__DEVICE__ float __nv_ll2float_ru(long long __a);
	__device__ float __nv_ll2float_rz(long long __a);			__DEVICE__ float __nv_ll2float_rz(long long __a);
	__device__ long long __nv_llabs(long long __a);			__DEVICE__ long long __nv_llabs(long long __a);
	__device__ long long __nv_llmax(long long __a, long long __b);			__DEVICE__ long long __nv_llmax(long long __a, long long __b);
	__device__ long long __nv_llmin(long long __a, long long __b);			__DEVICE__ long long __nv_llmin(long long __a, long long __b);
	__device__ long long __nv_llrint(double __a);			__DEVICE__ long long __nv_llrint(double __a);
	__device__ long long __nv_llrintf(float __a);			__DEVICE__ long long __nv_llrintf(float __a);
	__device__ long long __nv_llround(double __a);			__DEVICE__ long long __nv_llround(double __a);
	__device__ long long __nv_llroundf(float __a);			__DEVICE__ long long __nv_llroundf(float __a);
	__device__ double __nv_log10(double __a);			__DEVICE__ double __nv_log10(double __a);
	__device__ float __nv_log10f(float __a);			__DEVICE__ float __nv_log10f(float __a);
	__device__ double __nv_log1p(double __a);			__DEVICE__ double __nv_log1p(double __a);
	__device__ float __nv_log1pf(float __a);			__DEVICE__ float __nv_log1pf(float __a);
	__device__ double __nv_log2(double __a);			__DEVICE__ double __nv_log2(double __a);
	__device__ float __nv_log2f(float __a);			__DEVICE__ float __nv_log2f(float __a);
	__device__ double __nv_logb(double __a);			__DEVICE__ double __nv_logb(double __a);
	__device__ float __nv_logbf(float __a);			__DEVICE__ float __nv_logbf(float __a);
	__device__ double __nv_log(double __a);			__DEVICE__ double __nv_log(double __a);
	__device__ float __nv_logf(float __a);			__DEVICE__ float __nv_logf(float __a);
	__device__ double __nv_longlong_as_double(long long __a);			__DEVICE__ double __nv_longlong_as_double(long long __a);
	__device__ int __nv_max(int __a, int __b);			__DEVICE__ int __nv_max(int __a, int __b);
	__device__ int __nv_min(int __a, int __b);			__DEVICE__ int __nv_min(int __a, int __b);
	__device__ double __nv_modf(double __a, double *__b);			__DEVICE__ double __nv_modf(double __a, double *__b);
	__device__ float __nv_modff(float __a, float *__b);			__DEVICE__ float __nv_modff(float __a, float *__b);
	__device__ int __nv_mul24(int __a, int __b);			__DEVICE__ int __nv_mul24(int __a, int __b);
	__device__ long long __nv_mul64hi(long long __a, long long __b);			__DEVICE__ long long __nv_mul64hi(long long __a, long long __b);
	__device__ int __nv_mulhi(int __a, int __b);			__DEVICE__ int __nv_mulhi(int __a, int __b);
	__device__ double __nv_nan(const signed char *__a);			__DEVICE__ double __nv_nan(const signed char *__a);
	__device__ float __nv_nanf(const signed char *__a);			__DEVICE__ float __nv_nanf(const signed char *__a);
	__device__ double __nv_nearbyint(double __a);			__DEVICE__ double __nv_nearbyint(double __a);
	__device__ float __nv_nearbyintf(float __a);			__DEVICE__ float __nv_nearbyintf(float __a);
	__device__ double __nv_nextafter(double __a, double __b);			__DEVICE__ double __nv_nextafter(double __a, double __b);
	__device__ float __nv_nextafterf(float __a, float __b);			__DEVICE__ float __nv_nextafterf(float __a, float __b);
	__device__ double __nv_norm3d(double __a, double __b, double __c);			__DEVICE__ double __nv_norm3d(double __a, double __b, double __c);
	__device__ float __nv_norm3df(float __a, float __b, float __c);			__DEVICE__ float __nv_norm3df(float __a, float __b, float __c);
	__device__ double __nv_norm4d(double __a, double __b, double __c, double __d);			__DEVICE__ double __nv_norm4d(double __a, double __b, double __c, double __d);
	__device__ float __nv_norm4df(float __a, float __b, float __c, float __d);			__DEVICE__ float __nv_norm4df(float __a, float __b, float __c, float __d);
	__device__ double __nv_normcdf(double __a);			__DEVICE__ double __nv_normcdf(double __a);
	__device__ float __nv_normcdff(float __a);			__DEVICE__ float __nv_normcdff(float __a);
	__device__ double __nv_normcdfinv(double __a);			__DEVICE__ double __nv_normcdfinv(double __a);
	__device__ float __nv_normcdfinvf(float __a);			__DEVICE__ float __nv_normcdfinvf(float __a);
	__device__ float __nv_normf(int __a, const float *__b);			__DEVICE__ float __nv_normf(int __a, const float *__b);
	__device__ double __nv_norm(int __a, const double *__b);			__DEVICE__ double __nv_norm(int __a, const double *__b);
	__device__ int __nv_popc(int __a);			__DEVICE__ int __nv_popc(int __a);
	__device__ int __nv_popcll(long long __a);			__DEVICE__ int __nv_popcll(long long __a);
	__device__ double __nv_pow(double __a, double __b);			__DEVICE__ double __nv_pow(double __a, double __b);
	__device__ float __nv_powf(float __a, float __b);			__DEVICE__ float __nv_powf(float __a, float __b);
	__device__ double __nv_powi(double __a, int __b);			__DEVICE__ double __nv_powi(double __a, int __b);
	__device__ float __nv_powif(float __a, int __b);			__DEVICE__ float __nv_powif(float __a, int __b);
	__device__ double __nv_rcbrt(double __a);			__DEVICE__ double __nv_rcbrt(double __a);
	__device__ float __nv_rcbrtf(float __a);			__DEVICE__ float __nv_rcbrtf(float __a);
	__device__ double __nv_rcp64h(double __a);			__DEVICE__ double __nv_rcp64h(double __a);
	__device__ double __nv_remainder(double __a, double __b);			__DEVICE__ double __nv_remainder(double __a, double __b);
	__device__ float __nv_remainderf(float __a, float __b);			__DEVICE__ float __nv_remainderf(float __a, float __b);
	__device__ double __nv_remquo(double __a, double __b, int *__c);			__DEVICE__ double __nv_remquo(double __a, double __b, int *__c);
	__device__ float __nv_remquof(float __a, float __b, int *__c);			__DEVICE__ float __nv_remquof(float __a, float __b, int *__c);
	__device__ int __nv_rhadd(int __a, int __b);			__DEVICE__ int __nv_rhadd(int __a, int __b);
	__device__ double __nv_rhypot(double __a, double __b);			__DEVICE__ double __nv_rhypot(double __a, double __b);
	__device__ float __nv_rhypotf(float __a, float __b);			__DEVICE__ float __nv_rhypotf(float __a, float __b);
	__device__ double __nv_rint(double __a);			__DEVICE__ double __nv_rint(double __a);
	__device__ float __nv_rintf(float __a);			__DEVICE__ float __nv_rintf(float __a);
	__device__ double __nv_rnorm3d(double __a, double __b, double __c);			__DEVICE__ double __nv_rnorm3d(double __a, double __b, double __c);
	__device__ float __nv_rnorm3df(float __a, float __b, float __c);			__DEVICE__ float __nv_rnorm3df(float __a, float __b, float __c);
	__device__ double __nv_rnorm4d(double __a, double __b, double __c, double __d);			__DEVICE__ double __nv_rnorm4d(double __a, double __b, double __c, double __d);
	__device__ float __nv_rnorm4df(float __a, float __b, float __c, float __d);			__DEVICE__ float __nv_rnorm4df(float __a, float __b, float __c, float __d);
	__device__ float __nv_rnormf(int __a, const float *__b);			__DEVICE__ float __nv_rnormf(int __a, const float *__b);
	__device__ double __nv_rnorm(int __a, const double *__b);			__DEVICE__ double __nv_rnorm(int __a, const double *__b);
	__device__ double __nv_round(double __a);			__DEVICE__ double __nv_round(double __a);
	__device__ float __nv_roundf(float __a);			__DEVICE__ float __nv_roundf(float __a);
	__device__ double __nv_rsqrt(double __a);			__DEVICE__ double __nv_rsqrt(double __a);
	__device__ float __nv_rsqrtf(float __a);			__DEVICE__ float __nv_rsqrtf(float __a);
	__device__ int __nv_sad(int __a, int __b, int __c);			__DEVICE__ int __nv_sad(int __a, int __b, int __c);
	__device__ float __nv_saturatef(float __a);			__DEVICE__ float __nv_saturatef(float __a);
	__device__ double __nv_scalbn(double __a, int __b);			__DEVICE__ double __nv_scalbn(double __a, int __b);
	__device__ float __nv_scalbnf(float __a, int __b);			__DEVICE__ float __nv_scalbnf(float __a, int __b);
	__device__ int __nv_signbitd(double __a);			__DEVICE__ int __nv_signbitd(double __a);
	__device__ int __nv_signbitf(float __a);			__DEVICE__ int __nv_signbitf(float __a);
	__device__ void __nv_sincos(double __a, double __b, double __c);			__DEVICE__ void __nv_sincos(double __a, double __b, double __c);
	__device__ void __nv_sincosf(float __a, float __b, float __c);			__DEVICE__ void __nv_sincosf(float __a, float __b, float __c);
	__device__ void __nv_sincospi(double __a, double __b, double __c);			__DEVICE__ void __nv_sincospi(double __a, double __b, double __c);
	__device__ void __nv_sincospif(float __a, float __b, float __c);			__DEVICE__ void __nv_sincospif(float __a, float __b, float __c);
	__device__ double __nv_sin(double __a);			__DEVICE__ double __nv_sin(double __a);
	__device__ float __nv_sinf(float __a);			__DEVICE__ float __nv_sinf(float __a);
	__device__ double __nv_sinh(double __a);			__DEVICE__ double __nv_sinh(double __a);
	__device__ float __nv_sinhf(float __a);			__DEVICE__ float __nv_sinhf(float __a);
	__device__ double __nv_sinpi(double __a);			__DEVICE__ double __nv_sinpi(double __a);
	__device__ float __nv_sinpif(float __a);			__DEVICE__ float __nv_sinpif(float __a);
	__device__ double __nv_sqrt(double __a);			__DEVICE__ double __nv_sqrt(double __a);
	__device__ float __nv_sqrtf(float __a);			__DEVICE__ float __nv_sqrtf(float __a);
	__device__ double __nv_tan(double __a);			__DEVICE__ double __nv_tan(double __a);
	__device__ float __nv_tanf(float __a);			__DEVICE__ float __nv_tanf(float __a);
	__device__ double __nv_tanh(double __a);			__DEVICE__ double __nv_tanh(double __a);
	__device__ float __nv_tanhf(float __a);			__DEVICE__ float __nv_tanhf(float __a);
	__device__ double __nv_tgamma(double __a);			__DEVICE__ double __nv_tgamma(double __a);
	__device__ float __nv_tgammaf(float __a);			__DEVICE__ float __nv_tgammaf(float __a);
	__device__ double __nv_trunc(double __a);			__DEVICE__ double __nv_trunc(double __a);
	__device__ float __nv_truncf(float __a);			__DEVICE__ float __nv_truncf(float __a);
	__device__ int __nv_uhadd(unsigned int __a, unsigned int __b);			__DEVICE__ int __nv_uhadd(unsigned int __a, unsigned int __b);
	__device__ double __nv_uint2double_rn(unsigned int __i);			__DEVICE__ double __nv_uint2double_rn(unsigned int __i);
	__device__ float __nv_uint2float_rd(unsigned int __a);			__DEVICE__ float __nv_uint2float_rd(unsigned int __a);
	__device__ float __nv_uint2float_rn(unsigned int __a);			__DEVICE__ float __nv_uint2float_rn(unsigned int __a);
	__device__ float __nv_uint2float_ru(unsigned int __a);			__DEVICE__ float __nv_uint2float_ru(unsigned int __a);
	__device__ float __nv_uint2float_rz(unsigned int __a);			__DEVICE__ float __nv_uint2float_rz(unsigned int __a);
	__device__ float __nv_uint_as_float(unsigned int __a);			__DEVICE__ float __nv_uint_as_float(unsigned int __a);
	__device__ double __nv_ull2double_rd(unsigned long long __a);			__DEVICE__ double __nv_ull2double_rd(unsigned long long __a);
	__device__ double __nv_ull2double_rn(unsigned long long __a);			__DEVICE__ double __nv_ull2double_rn(unsigned long long __a);
	__device__ double __nv_ull2double_ru(unsigned long long __a);			__DEVICE__ double __nv_ull2double_ru(unsigned long long __a);
	__device__ double __nv_ull2double_rz(unsigned long long __a);			__DEVICE__ double __nv_ull2double_rz(unsigned long long __a);
	__device__ float __nv_ull2float_rd(unsigned long long __a);			__DEVICE__ float __nv_ull2float_rd(unsigned long long __a);
	__device__ float __nv_ull2float_rn(unsigned long long __a);			__DEVICE__ float __nv_ull2float_rn(unsigned long long __a);
	__device__ float __nv_ull2float_ru(unsigned long long __a);			__DEVICE__ float __nv_ull2float_ru(unsigned long long __a);
	__device__ float __nv_ull2float_rz(unsigned long long __a);			__DEVICE__ float __nv_ull2float_rz(unsigned long long __a);
	__device__ unsigned long long __nv_ullmax(unsigned long long __a,			__DEVICE__ unsigned long long __nv_ullmax(unsigned long long __a,
	unsigned long long __b);			unsigned long long __b);
	__device__ unsigned long long __nv_ullmin(unsigned long long __a,			__DEVICE__ unsigned long long __nv_ullmin(unsigned long long __a,
	unsigned long long __b);			unsigned long long __b);
	__device__ unsigned int __nv_umax(unsigned int __a, unsigned int __b);			__DEVICE__ unsigned int __nv_umax(unsigned int __a, unsigned int __b);
	__device__ unsigned int __nv_umin(unsigned int __a, unsigned int __b);			__DEVICE__ unsigned int __nv_umin(unsigned int __a, unsigned int __b);
	__device__ unsigned int __nv_umul24(unsigned int __a, unsigned int __b);			__DEVICE__ unsigned int __nv_umul24(unsigned int __a, unsigned int __b);
	__device__ unsigned long long __nv_umul64hi(unsigned long long __a,			__DEVICE__ unsigned long long __nv_umul64hi(unsigned long long __a,
	unsigned long long __b);			unsigned long long __b);
	__device__ unsigned int __nv_umulhi(unsigned int __a, unsigned int __b);			__DEVICE__ unsigned int __nv_umulhi(unsigned int __a, unsigned int __b);
	__device__ unsigned int __nv_urhadd(unsigned int __a, unsigned int __b);			__DEVICE__ unsigned int __nv_urhadd(unsigned int __a, unsigned int __b);
	__device__ unsigned int __nv_usad(unsigned int __a, unsigned int __b,			__DEVICE__ unsigned int __nv_usad(unsigned int __a, unsigned int __b,
	unsigned int __c);			unsigned int __c);

	#if CUDA_VERSION >= 9000 && CUDA_VERSION < 9020			#if CUDA_VERSION >= 9000 && CUDA_VERSION < 9020
	__device__ int __nv_vabs2(int __a);			__DEVICE__ int __nv_vabs2(int __a);
	__device__ int __nv_vabs4(int __a);			__DEVICE__ int __nv_vabs4(int __a);
	__device__ int __nv_vabsdiffs2(int __a, int __b);			__DEVICE__ int __nv_vabsdiffs2(int __a, int __b);
	__device__ int __nv_vabsdiffs4(int __a, int __b);			__DEVICE__ int __nv_vabsdiffs4(int __a, int __b);
	__device__ int __nv_vabsdiffu2(int __a, int __b);			__DEVICE__ int __nv_vabsdiffu2(int __a, int __b);
	__device__ int __nv_vabsdiffu4(int __a, int __b);			__DEVICE__ int __nv_vabsdiffu4(int __a, int __b);
	__device__ int __nv_vabsss2(int __a);			__DEVICE__ int __nv_vabsss2(int __a);
	__device__ int __nv_vabsss4(int __a);			__DEVICE__ int __nv_vabsss4(int __a);
	__device__ int __nv_vadd2(int __a, int __b);			__DEVICE__ int __nv_vadd2(int __a, int __b);
	__device__ int __nv_vadd4(int __a, int __b);			__DEVICE__ int __nv_vadd4(int __a, int __b);
	__device__ int __nv_vaddss2(int __a, int __b);			__DEVICE__ int __nv_vaddss2(int __a, int __b);
	__device__ int __nv_vaddss4(int __a, int __b);			__DEVICE__ int __nv_vaddss4(int __a, int __b);
	__device__ int __nv_vaddus2(int __a, int __b);			__DEVICE__ int __nv_vaddus2(int __a, int __b);
	__device__ int __nv_vaddus4(int __a, int __b);			__DEVICE__ int __nv_vaddus4(int __a, int __b);
	__device__ int __nv_vavgs2(int __a, int __b);			__DEVICE__ int __nv_vavgs2(int __a, int __b);
	__device__ int __nv_vavgs4(int __a, int __b);			__DEVICE__ int __nv_vavgs4(int __a, int __b);
	__device__ int __nv_vavgu2(int __a, int __b);			__DEVICE__ int __nv_vavgu2(int __a, int __b);
	__device__ int __nv_vavgu4(int __a, int __b);			__DEVICE__ int __nv_vavgu4(int __a, int __b);
	__device__ int __nv_vcmpeq2(int __a, int __b);			__DEVICE__ int __nv_vcmpeq2(int __a, int __b);
	__device__ int __nv_vcmpeq4(int __a, int __b);			__DEVICE__ int __nv_vcmpeq4(int __a, int __b);
	__device__ int __nv_vcmpges2(int __a, int __b);			__DEVICE__ int __nv_vcmpges2(int __a, int __b);
	__device__ int __nv_vcmpges4(int __a, int __b);			__DEVICE__ int __nv_vcmpges4(int __a, int __b);
	__device__ int __nv_vcmpgeu2(int __a, int __b);			__DEVICE__ int __nv_vcmpgeu2(int __a, int __b);
	__device__ int __nv_vcmpgeu4(int __a, int __b);			__DEVICE__ int __nv_vcmpgeu4(int __a, int __b);
	__device__ int __nv_vcmpgts2(int __a, int __b);			__DEVICE__ int __nv_vcmpgts2(int __a, int __b);
	__device__ int __nv_vcmpgts4(int __a, int __b);			__DEVICE__ int __nv_vcmpgts4(int __a, int __b);
	__device__ int __nv_vcmpgtu2(int __a, int __b);			__DEVICE__ int __nv_vcmpgtu2(int __a, int __b);
	__device__ int __nv_vcmpgtu4(int __a, int __b);			__DEVICE__ int __nv_vcmpgtu4(int __a, int __b);
	__device__ int __nv_vcmples2(int __a, int __b);			__DEVICE__ int __nv_vcmples2(int __a, int __b);
	__device__ int __nv_vcmples4(int __a, int __b);			__DEVICE__ int __nv_vcmples4(int __a, int __b);
	__device__ int __nv_vcmpleu2(int __a, int __b);			__DEVICE__ int __nv_vcmpleu2(int __a, int __b);
	__device__ int __nv_vcmpleu4(int __a, int __b);			__DEVICE__ int __nv_vcmpleu4(int __a, int __b);
	__device__ int __nv_vcmplts2(int __a, int __b);			__DEVICE__ int __nv_vcmplts2(int __a, int __b);
	__device__ int __nv_vcmplts4(int __a, int __b);			__DEVICE__ int __nv_vcmplts4(int __a, int __b);
	__device__ int __nv_vcmpltu2(int __a, int __b);			__DEVICE__ int __nv_vcmpltu2(int __a, int __b);
	__device__ int __nv_vcmpltu4(int __a, int __b);			__DEVICE__ int __nv_vcmpltu4(int __a, int __b);
	__device__ int __nv_vcmpne2(int __a, int __b);			__DEVICE__ int __nv_vcmpne2(int __a, int __b);
	__device__ int __nv_vcmpne4(int __a, int __b);			__DEVICE__ int __nv_vcmpne4(int __a, int __b);
	__device__ int __nv_vhaddu2(int __a, int __b);			__DEVICE__ int __nv_vhaddu2(int __a, int __b);
	__device__ int __nv_vhaddu4(int __a, int __b);			__DEVICE__ int __nv_vhaddu4(int __a, int __b);
	__device__ int __nv_vmaxs2(int __a, int __b);			__DEVICE__ int __nv_vmaxs2(int __a, int __b);
	__device__ int __nv_vmaxs4(int __a, int __b);			__DEVICE__ int __nv_vmaxs4(int __a, int __b);
	__device__ int __nv_vmaxu2(int __a, int __b);			__DEVICE__ int __nv_vmaxu2(int __a, int __b);
	__device__ int __nv_vmaxu4(int __a, int __b);			__DEVICE__ int __nv_vmaxu4(int __a, int __b);
	__device__ int __nv_vmins2(int __a, int __b);			__DEVICE__ int __nv_vmins2(int __a, int __b);
	__device__ int __nv_vmins4(int __a, int __b);			__DEVICE__ int __nv_vmins4(int __a, int __b);
	__device__ int __nv_vminu2(int __a, int __b);			__DEVICE__ int __nv_vminu2(int __a, int __b);
	__device__ int __nv_vminu4(int __a, int __b);			__DEVICE__ int __nv_vminu4(int __a, int __b);
	__device__ int __nv_vneg2(int __a);			__DEVICE__ int __nv_vneg2(int __a);
	__device__ int __nv_vneg4(int __a);			__DEVICE__ int __nv_vneg4(int __a);
	__device__ int __nv_vnegss2(int __a);			__DEVICE__ int __nv_vnegss2(int __a);
	__device__ int __nv_vnegss4(int __a);			__DEVICE__ int __nv_vnegss4(int __a);
	__device__ int __nv_vsads2(int __a, int __b);			__DEVICE__ int __nv_vsads2(int __a, int __b);
	__device__ int __nv_vsads4(int __a, int __b);			__DEVICE__ int __nv_vsads4(int __a, int __b);
	__device__ int __nv_vsadu2(int __a, int __b);			__DEVICE__ int __nv_vsadu2(int __a, int __b);
	__device__ int __nv_vsadu4(int __a, int __b);			__DEVICE__ int __nv_vsadu4(int __a, int __b);
	__device__ int __nv_vseteq2(int __a, int __b);			__DEVICE__ int __nv_vseteq2(int __a, int __b);
	__device__ int __nv_vseteq4(int __a, int __b);			__DEVICE__ int __nv_vseteq4(int __a, int __b);
	__device__ int __nv_vsetges2(int __a, int __b);			__DEVICE__ int __nv_vsetges2(int __a, int __b);
	__device__ int __nv_vsetges4(int __a, int __b);			__DEVICE__ int __nv_vsetges4(int __a, int __b);
	__device__ int __nv_vsetgeu2(int __a, int __b);			__DEVICE__ int __nv_vsetgeu2(int __a, int __b);
	__device__ int __nv_vsetgeu4(int __a, int __b);			__DEVICE__ int __nv_vsetgeu4(int __a, int __b);
	__device__ int __nv_vsetgts2(int __a, int __b);			__DEVICE__ int __nv_vsetgts2(int __a, int __b);
	__device__ int __nv_vsetgts4(int __a, int __b);			__DEVICE__ int __nv_vsetgts4(int __a, int __b);
	__device__ int __nv_vsetgtu2(int __a, int __b);			__DEVICE__ int __nv_vsetgtu2(int __a, int __b);
	__device__ int __nv_vsetgtu4(int __a, int __b);			__DEVICE__ int __nv_vsetgtu4(int __a, int __b);
	__device__ int __nv_vsetles2(int __a, int __b);			__DEVICE__ int __nv_vsetles2(int __a, int __b);
	__device__ int __nv_vsetles4(int __a, int __b);			__DEVICE__ int __nv_vsetles4(int __a, int __b);
	__device__ int __nv_vsetleu2(int __a, int __b);			__DEVICE__ int __nv_vsetleu2(int __a, int __b);
	__device__ int __nv_vsetleu4(int __a, int __b);			__DEVICE__ int __nv_vsetleu4(int __a, int __b);
	__device__ int __nv_vsetlts2(int __a, int __b);			__DEVICE__ int __nv_vsetlts2(int __a, int __b);
	__device__ int __nv_vsetlts4(int __a, int __b);			__DEVICE__ int __nv_vsetlts4(int __a, int __b);
	__device__ int __nv_vsetltu2(int __a, int __b);			__DEVICE__ int __nv_vsetltu2(int __a, int __b);
	__device__ int __nv_vsetltu4(int __a, int __b);			__DEVICE__ int __nv_vsetltu4(int __a, int __b);
	__device__ int __nv_vsetne2(int __a, int __b);			__DEVICE__ int __nv_vsetne2(int __a, int __b);
	__device__ int __nv_vsetne4(int __a, int __b);			__DEVICE__ int __nv_vsetne4(int __a, int __b);
	__device__ int __nv_vsub2(int __a, int __b);			__DEVICE__ int __nv_vsub2(int __a, int __b);
	__device__ int __nv_vsub4(int __a, int __b);			__DEVICE__ int __nv_vsub4(int __a, int __b);
	__device__ int __nv_vsubss2(int __a, int __b);			__DEVICE__ int __nv_vsubss2(int __a, int __b);
	__device__ int __nv_vsubss4(int __a, int __b);			__DEVICE__ int __nv_vsubss4(int __a, int __b);
	__device__ int __nv_vsubus2(int __a, int __b);			__DEVICE__ int __nv_vsubus2(int __a, int __b);
	__device__ int __nv_vsubus4(int __a, int __b);			__DEVICE__ int __nv_vsubus4(int __a, int __b);
	#endif // CUDA_VERSION			#endif // CUDA_VERSION
	__device__ double __nv_y0(double __a);			__DEVICE__ double __nv_y0(double __a);
	__device__ float __nv_y0f(float __a);			__DEVICE__ float __nv_y0f(float __a);
	__device__ double __nv_y1(double __a);			__DEVICE__ double __nv_y1(double __a);
	__device__ float __nv_y1f(float __a);			__DEVICE__ float __nv_y1f(float __a);
	__device__ float __nv_ynf(int __a, float __b);			__DEVICE__ float __nv_ynf(int __a, float __b);
	__device__ double __nv_yn(int __a, double __b);			__DEVICE__ double __nv_yn(int __a, double __b);

				#if defined(__cplusplus)
	} // extern "C"			} // extern "C"
				#endif
	#endif // __CLANG_CUDA_LIBDEVICE_DECLARES_H__			#endif // __CLANG_CUDA_LIBDEVICE_DECLARES_H__

test/CodeGen/nvptx_device_math_functions.c

This file was added.

				// Test calling of device math functions.
				///==========================================================================///

				// RUN: %clang -fmath-errno -S -emit-llvm -o - %s -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda \| FileCheck -check-prefix CHECK-YES %s

				void test_sqrt(double a1) {
				#pragma omp target
				{
				// CHECK-YES: call double @llvm.nvvm.sqrt.rn.d(double
				double l1 = sqrt(a1);
				}
				}

				void test_pow(float a0, double a1, long double a2) {
				#pragma omp target
				{
				// CHECK-YES: call double @__internal_accurate_pow(double
				double l1 = pow(a1, a1);
				}
				}

test/Driver/openmp-offload-gpu.c

	Show First 20 Lines • Show All 70 Lines • ▼ Show 20 Lines
	/// ###########################################################################			/// ###########################################################################

	/// Check cubin file unbundling and usage by nvlink			/// Check cubin file unbundling and usage by nvlink
	// RUN: touch %t.o			// RUN: touch %t.o
	// RUN: %clang -### -target powerpc64le-unknown-linux-gnu -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda \			// RUN: %clang -### -target powerpc64le-unknown-linux-gnu -fopenmp=libomp -fopenmp-targets=nvptx64-nvidia-cuda \
	// RUN: -no-canonical-prefixes -save-temps %t.o -fopenmp-use-target-bundling 2>&1 \			// RUN: -no-canonical-prefixes -save-temps %t.o -fopenmp-use-target-bundling 2>&1 \
	// RUN: \| FileCheck -check-prefix=CHK-CUBIN-UNBUNDLING-NVLINK %s			// RUN: \| FileCheck -check-prefix=CHK-CUBIN-UNBUNDLING-NVLINK %s

	/// Use DAG to ensure that cubin file has been unbundled.			/// Use DAG to ensure that object file has not been unbundled.
	// CHK-CUBIN-UNBUNDLING-NVLINK-DAG: nvlink{{.}}" {{.}}"[[CUBIN:.*\.cubin]]"			// CHK-CUBIN-UNBUNDLING-NVLINK-DAG: nvlink{{.}}" {{.}}"[[CUBIN:.*\.cubin]]"
	// CHK-CUBIN-UNBUNDLING-NVLINK-DAG: clang-offload-bundler{{.}}" "-type=o" {{.}}"-outputs={{.*}}[[CUBIN]]			// CHK-CUBIN-UNBUNDLING-NVLINK-DAG: clang-offload-bundler{{.}}" "-type=o" {{.}}[[CUBIN]]
	// CHK-CUBIN-UNBUNDLING-NVLINK-DAG-SAME: "-unbundle"			// CHK-CUBIN-UNBUNDLING-NVLINK-DAG-SAME: "-unbundle"

	/// ###########################################################################			/// ###########################################################################

	/// Check cubin file generation and usage by nvlink			/// Check cubin file generation and usage by nvlink
	// RUN: touch %t1.o			// RUN: touch %t1.o
	// RUN: touch %t2.o			// RUN: touch %t2.o
	// RUN: %clang -### -no-canonical-prefixes -target powerpc64le-unknown-linux-gnu -fopenmp=libomp \			// RUN: %clang -### -no-canonical-prefixes -target powerpc64le-unknown-linux-gnu -fopenmp=libomp \
	▲ Show 20 Lines • Show All 129 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP][Clang][NVPTX] Enable math functions called in an OpenMP NVPTX target device region to be resolved as device-native function callsAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 159574

include/clang/Driver/ToolChain.h

lib/Driver/ToolChains/Clang.cpp

lib/Driver/ToolChains/Cuda.h

lib/Driver/ToolChains/Cuda.cpp

lib/Headers/CMakeLists.txt

lib/Headers/__clang_cuda_device_functions.h

lib/Headers/__clang_cuda_libdevice_declares.h

test/CodeGen/nvptx_device_math_functions.c

test/Driver/openmp-offload-gpu.c

[OpenMP][Clang][NVPTX] Enable math functions called in an OpenMP NVPTX target device region to be resolved as device-native function calls
AbandonedPublic