[CUDA/OpenMP] Define only some host macros during device compilation
ClosedPublic

Authored by Hahnfeld on Aug 16 2018, 7:56 AM.

Details

Summary

When compiling CUDA or OpenMP device code Clang parses header files
that expect certain predefined macros from the host architecture. To
make this work the compiler passes the host triple via the -aux-triple
argument and (until now) pulls in all macros for that "auxiliary triple"
unconditionally.

However this results in defines like __SSE_MATH__ that will trigger
inline assembly making use of the "advertised" target features. See
the discussion of D47849 and PR38464 for a detailed explanation of
the encountered problems.

Instead of blacklisting "known bad" examples this patch starts adding
defines that are needed for certain headers like bits/wordsize.h and
bits/mathinline.h.
The disadvantage of this approach is that it decouples the definitions
from their target toolchain. However in my opinion it's more important
to keep definitions for one header close together. For one this will
include a clear documentation why these particular defines are needed.
Furthermore it simplifies maintenance because adding defines for a new
header or support for a new aux-triple only needs to touch one piece
of code.

Diff Detail

Repository
rL LLVM
Hahnfeld created this revision.Aug 16 2018, 7:56 AM
Hahnfeld edited the summary of this revision. (Show Details)Aug 16 2018, 7:56 AM
Hahnfeld added inline comments.
test/SemaCUDA/builtins.cu
15–17 ↗(On Diff #161032)

@tra I'm not sure here: Do we want __PTX__ to be defined during host compilation? I can't think of a valid use case, but you have more experience with user code.

Maybe for device compilation we also should define __NO_MATH_INLINES and __NO_STRING_INLINES macros to disable inline assembly in glibc?

Maybe for device compilation we also should define __NO_MATH_INLINES and __NO_STRING_INLINES macros to disable inline assembly in glibc?

The problem is that __NO_MATH_INLINES doesn't even avoid all inline assembly from bits/mathinline.h :-( incidentally Clang already defines __NO_MATH_INLINES for x86 (due to an old bug which has been fixed long ago) - and on CentOS we still have problems as described in PR38464.

As a second thought: This might be valid for NVPTX, but I don't think it's a good idea for x86-like offloading targets - they might well profit from inline assembly code.

Maybe for device compilation we also should define __NO_MATH_INLINES and __NO_STRING_INLINES macros to disable inline assembly in glibc?

The problem is that __NO_MATH_INLINES doesn't even avoid all inline assembly from bits/mathinline.h :-( incidentally Clang already defines __NO_MATH_INLINES for x86 (due to an old bug which has been fixed long ago) - and on CentOS we still have problems as described in PR38464.

As a second thought: This might be valid for NVPTX, but I don't think it's a good idea for x86-like offloading targets - they might well profit from inline assembly code.

I'm not saying that we should define those macros for all targets, only for NVPTX. But still, it may disable some inline assembly for other architectures.

tra added a subscriber: pcc.Aug 16 2018, 11:12 AM

Maybe for device compilation we also should define __NO_MATH_INLINES and __NO_STRING_INLINES macros to disable inline assembly in glibc?

The problem is that __NO_MATH_INLINES doesn't even avoid all inline assembly from bits/mathinline.h :-( incidentally Clang already defines __NO_MATH_INLINES for x86 (due to an old bug which has been fixed long ago) - and on CentOS we still have problems as described in PR38464.

As a second thought: This might be valid for NVPTX, but I don't think it's a good idea for x86-like offloading targets - they might well profit from inline assembly code.

I'm not saying that we should define those macros for all targets, only for NVPTX. But still, it may disable some inline assembly for other architectures.

IMO, trying to avoid inline assembly by defining(or not) some macros and hoping for the best is rather fragile as we'll have to chase *all* patches that host's math.h may have on any given system.

If I understand it correctly, the root cause of this exercise is that we want to compile for GPU using plain C. CUDA avoids this issue by separating device and host code via target attributes and clang has few special cases to ignore inline assembly errors in the host code if we're compiling for device. For OpenMP there's no such separation, not in the system headers, at least.

Perhaps we can just add another special case for inline assembly & OpenMP. If there's an error in inline assembly during device compilation and we see that the function came from the system headers, then ignore the error, poison the function, etc. That said, I don't know enough about OpenMP to tell whether that's feasible or whether that's sufficient.

Another option would be to implement some sort of attribute-based overloading. Then OpenMP can provide its own version of the device-side library function without clashing with system headers.

<start of general handwaving>
On a side note, I did spend about a year and got 3 almost-but-not-quite-working 'solutions' of exactly this problem during early days of adding CUDA support to clang. I'm very thoroughly convinced that verbatim use of headers from platform A and making them work on platform B is not feasible unless you have control of both sets of headers. Considering that system headers are *not* under our control, we do need to have a way for them to coexist without clashing. Preprocessor magic may work in limited circumstances (e.g. we only need to deal with two variants of headers that never change), but the cases where that approach is going to fall apart are rather easy to come by. Clang's __clang_cuda_runtime_wrapper.h is a horrible example of that -- it sort of works, but every CUDA release I cross my fingers and hope that they didn't decide to change *anything* in their headers.

test/SemaCUDA/builtins.cu
15–17 ↗(On Diff #161032)

I'm not sure what was the reason for adding this macro. @pcc did it long time ago in rL157173. Perhaps he has better idea about the purpose.

AFAICT, It's not used by CUDA headers, nor can I find any uses in any of CUDA sources we have (excluding clang's own tests). The only use case I see is in cl_kernel.h in Apple's xcode SDK.
System/Library/Frameworks/OpenCL.framework/Versions/A/lib/clang/2.0/include/cl_kernel.h
It's used there to #define a lot of convert_FOO to __builtin_convert(FOO...).

Based on that single use case, not defining PTX for the host compilation should probably be OK.

In D50845#1202838, @tra wrote:

Maybe for device compilation we also should define __NO_MATH_INLINES and __NO_STRING_INLINES macros to disable inline assembly in glibc?

The problem is that __NO_MATH_INLINES doesn't even avoid all inline assembly from bits/mathinline.h :-( incidentally Clang already defines __NO_MATH_INLINES for x86 (due to an old bug which has been fixed long ago) - and on CentOS we still have problems as described in PR38464.

As a second thought: This might be valid for NVPTX, but I don't think it's a good idea for x86-like offloading targets - they might well profit from inline assembly code.

I'm not saying that we should define those macros for all targets, only for NVPTX. But still, it may disable some inline assembly for other architectures.

IMO, trying to avoid inline assembly by defining(or not) some macros and hoping for the best is rather fragile as we'll have to chase *all* patches that host's math.h may have on any given system.

Completely agree here: This patch tries to pick the low-hanging fruits that happen to fix include <math.h> on most systems (and addressing a long-standing FIXME in the code). I know there are more headers that define inline assembly unconditionally and need more advanced fixes (see below).

If I understand it correctly, the root cause of this exercise is that we want to compile for GPU using plain C. CUDA avoids this issue by separating device and host code via target attributes and clang has few special cases to ignore inline assembly errors in the host code if we're compiling for device. For OpenMP there's no such separation, not in the system headers, at least.

Yes, that's one of the nice properties of CUDA (for the compiler). There used to be the same restriction for OpenMP where all functions used in target regions needed to be put in declare target. However that was relaxed in favor of implicitly marking all called functions in that TU to be declare target.
So ideally I think Clang should determine which functions are really declare target (either explicit or implicit) and only run semantical analysis on them. If a function is then found to be "broken" it's perfectly desirable to error back to the user.

Another option would be to implement some sort of attribute-based overloading. Then OpenMP can provide its own version of the device-side library function without clashing with system headers.

I'm thinking about what the desired behavior is here. So, if we have a situation where the target is the host, then we really only have one set of headers and we want everything to work as it does today. math.h should be math.h, with the same relevant preprocessor context. When the host and target differ, then we have a restricted set of external/system device functions available on the device (which may or may not have anything to do with the set of system functions provided by the host's system headers). As a result, we should really have a separate header that has those actually-available functions. When targeting NVPTX, why don't we have the included math.h be CUDA's math.h? In the end, those are the functions we need to call when we generate code. Right?

As a result, we should really have a separate header that has those actually-available functions. When targeting NVPTX, why don't we have the included math.h be CUDA's math.h? In the end, those are the functions we need to call when we generate code. Right?

That's what D47849 deals with.

If I understand it correctly, the root cause of this exercise is that we want to compile for GPU using plain C. CUDA avoids this issue by separating device and host code via target attributes and clang has few special cases to ignore inline assembly errors in the host code if we're compiling for device. For OpenMP there's no such separation, not in the system headers, at least.

Yes, that's one of the nice properties of CUDA (for the compiler). There used to be the same restriction for OpenMP where all functions used in target regions needed to be put in declare target. However that was relaxed in favor of implicitly marking all called functions in that TU to be declare target.
So ideally I think Clang should determine which functions are really declare target (either explicit or implicit) and only run semantical analysis on them. If a function is then found to be "broken" it's perfectly desirable to error back to the user.

It is not possible for OpenMP because we support implicit declare target functions. Clang cannot identify whether the function is going to be used on the device or not during sema analysis.

As a result, we should really have a separate header that has those actually-available functions. When targeting NVPTX, why don't we have the included math.h be CUDA's math.h? In the end, those are the functions we need to call when we generate code. Right?

That's what D47849 deals with.

Yes, but it doesn't get CUDA's math.h. Maybe I misunderstand how this works (and I very well might, because it's not clear that CUDA has a math.h by that name), but that patch tries to avoid problems with the host's math.h and then also injects __clang_cuda_device_functions.h into the device compilation. How does this compare to when you include math.h in Clang's CUDA mode? It seems to be that we want to somehow map standard includes, where applicable, to include files in CUDA's include/crt directory (e.g., crt/math_functions.h and crt/common_functions.h for stdio.h for printf), and nothing else ends up being available (because it is, in fact, not available).

If I understand it correctly, the root cause of this exercise is that we want to compile for GPU using plain C. CUDA avoids this issue by separating device and host code via target attributes and clang has few special cases to ignore inline assembly errors in the host code if we're compiling for device. For OpenMP there's no such separation, not in the system headers, at least.

Yes, that's one of the nice properties of CUDA (for the compiler). There used to be the same restriction for OpenMP where all functions used in target regions needed to be put in declare target. However that was relaxed in favor of implicitly marking all called functions in that TU to be declare target.
So ideally I think Clang should determine which functions are really declare target (either explicit or implicit) and only run semantical analysis on them. If a function is then found to be "broken" it's perfectly desirable to error back to the user.

It is not possible for OpenMP because we support implicit declare target functions. Clang cannot identify whether the function is going to be used on the device or not during sema analysis.

Sounds like that is a recipe for just disabling sema analysis for all implicit declare target functions.

As a result, we should really have a separate header that has those actually-available functions. When targeting NVPTX, why don't we have the included math.h be CUDA's math.h? In the end, those are the functions we need to call when we generate code. Right?

That's what D47849 deals with.

Yes, but it doesn't get CUDA's math.h. Maybe I misunderstand how this works (and I very well might, because it's not clear that CUDA has a math.h by that name), but that patch tries to avoid problems with the host's math.h and then also injects __clang_cuda_device_functions.h into the device compilation. How does this compare to when you include math.h in Clang's CUDA mode? It seems to be that we want to somehow map standard includes, where applicable, to include files in CUDA's include/crt directory (e.g., crt/math_functions.h and crt/common_functions.h for stdio.h for printf), and nothing else ends up being available (because it is, in fact, not available).

There's no CUDA specific math.h unless you want to regard clang_cuda_device_functions.h as a math header. The patch is using the same approach as CUDA and redirecting the function calls to device specific function calls. The parts of that patch which deal with host header compatibility would more naturally belong in a patch like this one so ultimately they won't be part of that patch. I'm currently working on improving the patch though by eliminating the clang_cuda_device_functions.h injection and elimintating the need to disable the built-ins.

tra added a comment.Aug 16 2018, 1:57 PM

As a result, we should really have a separate header that has those actually-available functions. When targeting NVPTX, why don't we have the included math.h be CUDA's math.h? In the end, those are the functions we need to call when we generate code. Right?

That's what D47849 deals with.

Yes, but it doesn't get CUDA's math.h. Maybe I misunderstand how this works (and I very well might, because it's not clear that CUDA has a math.h by that name), but that patch tries to avoid problems with the host's math.h and then also injects __clang_cuda_device_functions.h into the device compilation. How does this compare to when you include math.h in Clang's CUDA mode? It seems to be that we want to somehow map standard includes, where applicable, to include files in CUDA's include/crt directory (e.g., crt/math_functions.h and crt/common_functions.h for stdio.h for printf), and nothing else ends up being available (because it is, in fact, not available).

There's no CUDA specific math.h unless you want to regard clang_cuda_device_functions.h as a math header.

True. We rely on CUDA SDK which defines a subset of standard libc/libm functions with __device__ attribute.

__clang_cuda_device_functions.h just provides a set of substitutes that became nvcc's builtins and are no longer implemented in CUDA headers.
It's not supposed to replace math.h and may change with next version of CUDA which may need to cope with some other quirk of CUDA's headers.

The patch is using the same approach as CUDA and redirecting the function calls to device specific function calls. The parts of that patch which deal with host header compatibility would more naturally belong in a patch like this one so ultimately they won't be part of that patch. I'm currently working on improving the patch though by eliminating the clang_cuda_device_functions.h injection and eliminating the need to disable the built-ins.

This sounds great. When you do have device-side implementation of math library, it would probably worth considering to make CUDA use it, instead of the current hacks to adapt to CUDA headers. This would simplify things a bit and would give us much better control over the implementation.

So ideally I think Clang should determine which functions are really declare target (either explicit or implicit) and only run semantical analysis on them. If a function is then found to be "broken" it's perfectly desirable to error back to the user.

It is not possible for OpenMP because we support implicit declare target functions. Clang cannot identify whether the function is going to be used on the device or not during sema analysis.

You are right, we can't do this during device compilation because we don't have an AST before Sema.

However I'm currently thinking about the following:

  1. Identify explicit and implicit declare target functions during host Sema and CodeGen.
  2. Attach meta-data for all of them to LLVM IR .bc which is passed via -fopenmp-host-ir-file-path. I think we already do something similar for outlined target regions?
  3. During device Sema query that meta-data so Clang knows when a function will be called from within a target region. Skip analysis of functions that are not needed for the device, just as CUDA does.
  4. Check that we don't need functions that weren't marked in 2. That's to catch users doing something like:
#pragma omp target
{
#ifdef __NVPTX__
  target_func()
#endif
}

For now that's just an idea, I didn't start implementing any of this yet. Do you think that could work?

So ideally I think Clang should determine which functions are really declare target (either explicit or implicit) and only run semantical analysis on them. If a function is then found to be "broken" it's perfectly desirable to error back to the user.

It is not possible for OpenMP because we support implicit declare target functions. Clang cannot identify whether the function is going to be used on the device or not during sema analysis.

You are right, we can't do this during device compilation because we don't have an AST before Sema.

However I'm currently thinking about the following:

  1. Identify explicit and implicit declare target functions during host Sema and CodeGen.
  2. Attach meta-data for all of them to LLVM IR .bc which is passed via -fopenmp-host-ir-file-path. I think we already do something similar for outlined target regions?
  3. During device Sema query that meta-data so Clang knows when a function will be called from within a target region. Skip analysis of functions that are not needed for the device, just as CUDA does.
  4. Check that we don't need functions that weren't marked in 2. That's to catch users doing something like: `lang=c #pragma omp target { #ifdef NVPTX target_func() #endif } `

    For now that's just an idea, I didn't start implementing any of this yet. Do you think that could work?

I thought about this approach already. But it won't work in general. The main problem here is that host and device compilation phases may end up with the different set of implicit declare target functions. The main problem here not the user code, but the system libraries, which may use the different set of functions.
Another one problem here is that the user may use the function that has some host assembler inside. In this case we still need to emit error message, otherwise, we may end up with the compiler crash.

The best solution is to use only device specific header files. Device compilation phase should use system header files for the host at all.

So ideally I think Clang should determine which functions are really declare target (either explicit or implicit) and only run semantical analysis on them. If a function is then found to be "broken" it's perfectly desirable to error back to the user.

It is not possible for OpenMP because we support implicit declare target functions. Clang cannot identify whether the function is going to be used on the device or not during sema analysis.

You are right, we can't do this during device compilation because we don't have an AST before Sema.

However I'm currently thinking about the following:

  1. Identify explicit and implicit declare target functions during host Sema and CodeGen.
  2. Attach meta-data for all of them to LLVM IR .bc which is passed via -fopenmp-host-ir-file-path. I think we already do something similar for outlined target regions?
  3. During device Sema query that meta-data so Clang knows when a function will be called from within a target region. Skip analysis of functions that are not needed for the device, just as CUDA does.
  4. Check that we don't need functions that weren't marked in 2. That's to catch users doing something like: `lang=c #pragma omp target { #ifdef NVPTX target_func() #endif } `

    For now that's just an idea, I didn't start implementing any of this yet. Do you think that could work?

I thought about this approach already. But it won't work in general. The main problem here is that host and device compilation phases may end up with the different set of implicit declare target functions. The main problem here not the user code, but the system libraries, which may use the different set of functions.

How common is that for functions that are used in target regions? In the worst case we can make my fourth point a warning and lose Sema checking for those functions.

Another one problem here is that the user may use the function that has some host assembler inside. In this case we still need to emit error message, otherwise, we may end up with the compiler crash.

Once we know which functions are used, they can be checked as usual.

The best solution is to use only device specific header files. Device compilation phase should use system header files for the host at all.

You mean "shouldn't use system header files for the host"? I think that may be hard to achieve, especially if we want to Sema check all of the source code during device compilation.

So ideally I think Clang should determine which functions are really declare target (either explicit or implicit) and only run semantical analysis on them. If a function is then found to be "broken" it's perfectly desirable to error back to the user.

It is not possible for OpenMP because we support implicit declare target functions. Clang cannot identify whether the function is going to be used on the device or not during sema analysis.

You are right, we can't do this during device compilation because we don't have an AST before Sema.

However I'm currently thinking about the following:

  1. Identify explicit and implicit declare target functions during host Sema and CodeGen.
  2. Attach meta-data for all of them to LLVM IR .bc which is passed via -fopenmp-host-ir-file-path. I think we already do something similar for outlined target regions?
  3. During device Sema query that meta-data so Clang knows when a function will be called from within a target region. Skip analysis of functions that are not needed for the device, just as CUDA does.
  4. Check that we don't need functions that weren't marked in 2. That's to catch users doing something like: `lang=c #pragma omp target { #ifdef NVPTX target_func() #endif } `

    For now that's just an idea, I didn't start implementing any of this yet. Do you think that could work?

I thought about this approach already. But it won't work in general. The main problem here is that host and device compilation phases may end up with the different set of implicit declare target functions. The main problem here not the user code, but the system libraries, which may use the different set of functions.

How common is that for functions that are used in target regions? In the worst case we can make my fourth point a warning and lose Sema checking for those functions.

It does not matter how common is it or not. If the bad situation can happen, it will happen.
Warning won't work here, because, again, you may end up with the code that may cause compiler crash for the device. For example, if the system function uses throw/catch stmts, we may emit the warning for this function, but will have troubles during the codegen.

Another one problem here is that the user may use the function that has some host assembler inside. In this case we still need to emit error message, otherwise, we may end up with the compiler crash.

Once we know which functions are used, they can be checked as usual.

The best solution is to use only device specific header files. Device compilation phase should use system header files for the host at all.

You mean "shouldn't use system header files for the host"? I think that may be hard to achieve, especially if we want to Sema check all of the source code during device compilation.

Yes, I mean should not. Yes, this is hard to achieve but that's the only complete and correct solution. Everything else looks like a non-stable hack.

I thought about this approach already. But it won't work in general. The main problem here is that host and device compilation phases may end up with the different set of implicit declare target functions. The main problem here not the user code, but the system libraries, which may use the different set of functions.

How common is that for functions that are used in target regions? In the worst case we can make my fourth point a warning and lose Sema checking for those functions.

It does not matter how common is it or not. If the bad situation can happen, it will happen.
Warning won't work here, because, again, you may end up with the code that may cause compiler crash for the device. For example, if the system function uses throw/catch stmts, we may emit the warning for this function, but will have troubles during the codegen.

Right, warning wasn't a good thought. We really want strict checking and would have to error out when we find a function that wasn't implicitly declare target on the host.
I meant to ask how common that would be? If that's only some known functions we could handle them separately.

The best solution is to use only device specific header files. Device compilation phase should use system header files for the host at all.

You mean "shouldn't use system header files for the host"? I think that may be hard to achieve, especially if we want to Sema check all of the source code during device compilation.

Yes, I mean should not. Yes, this is hard to achieve but that's the only complete and correct solution. Everything else looks like a non-stable hack.

How do you propose to handle inline assembly in non-system header files?

Right, warning wasn't a good thought. We really want strict checking and would have to error out when we find a function that wasn't implicitly declare target on the host.
I meant to ask how common that would be? If that's only some known functions we could handle them separately.

Again, it does not matter how common is this situation. We cannot rely on the probability here, we need to make the compiler to work correctly in all possible situation, no matter how often they can occur.

The best solution is to use only device specific header files. Device compilation phase should use system header files for the host at all.

You mean "shouldn't use system header files for the host"? I think that may be hard to achieve, especially if we want to Sema check all of the source code during device compilation.

Yes, I mean should not. Yes, this is hard to achieve but that's the only complete and correct solution. Everything else looks like a non-stable hack.

How do you propose to handle inline assembly in non-system header files?

Just like as usual - if the assembler is supported by the device - it is ok, otherwise - error message.

Right, warning wasn't a good thought. We really want strict checking and would have to error out when we find a function that wasn't implicitly declare target on the host.
I meant to ask how common that would be? If that's only some known functions we could handle them separately.

Again, it does not matter how common is this situation. We cannot rely on the probability here, we need to make the compiler to work correctly in all possible situation, no matter how often they can occur.

Got that, I agree on the conservative approach: If we find a function to be called that wasn't checked (because it wasn't implicitly declare target on the host) the compiler can error out. That should be correct in all cases, shouldn't it?

There's a trade-off here:

  • How many TUs pass full analysis and how many don't? (today's situation; we know that some headers don't work)
  • How many TUs pass when we only check called functions (and error if we call non-checked ones) and how many regress compared to today's situation?

If the number of regressions is zero for all practical situations but we can compile some important cases, that should be a win.

The best solution is to use only device specific header files. Device compilation phase should use system header files for the host at all.

You mean "shouldn't use system header files for the host"? I think that may be hard to achieve, especially if we want to Sema check all of the source code during device compilation.

Yes, I mean should not. Yes, this is hard to achieve but that's the only complete and correct solution. Everything else looks like a non-stable hack.

How do you propose to handle inline assembly in non-system header files?

Just like as usual - if the assembler is supported by the device - it is ok, otherwise - error message.

Even if the function is never called? That would mean you can't include any Eigen header...

Maybe for device compilation we also should define __NO_MATH_INLINES and __NO_STRING_INLINES macros to disable inline assembly in glibc?

Coming back to this original question:

  • I just searched the headers on CentOS and Arch Linux and all cases considering these macros are guarded by ifndef __x86_64__ which this patch still propagates for device compilation.
  • From the CentOS package for PPC64LE it looks like the only affected case is in bits/fenvinline.h which defines the macros fegetround(), feraiseexcept(__excepts), and feclearexcept(__excepts). All matches in bits/mathinline.h are guarded by ifndef __powerpc64__ or don't use inline assembly which should be fine.

As I'm not primarily developing on Power (and can't test such change), I'd ask you to create a patch adding these macros.

Right, warning wasn't a good thought. We really want strict checking and would have to error out when we find a function that wasn't implicitly declare target on the host.
I meant to ask how common that would be? If that's only some known functions we could handle them separately.

Again, it does not matter how common is this situation. We cannot rely on the probability here, we need to make the compiler to work correctly in all possible situation, no matter how often they can occur.

Got that, I agree on the conservative approach: If we find a function to be called that wasn't checked (because it wasn't implicitly declare target on the host) the compiler can error out. That should be correct in all cases, shouldn't it?

There's a trade-off here:

  • How many TUs pass full analysis and how many don't? (today's situation; we know that some headers don't work)
  • How many TUs pass when we only check called functions (and error if we call non-checked ones) and how many regress compared to today's situation? If the number of regressions is zero for all practical situations but we can compile some important cases, that should be a win.

I need to think about it. We need to estimate all pros and cons here. It might work.

The best solution is to use only device specific header files. Device compilation phase should use system header files for the host at all.

You mean "shouldn't use system header files for the host"? I think that may be hard to achieve, especially if we want to Sema check all of the source code during device compilation.

Yes, I mean should not. Yes, this is hard to achieve but that's the only complete and correct solution. Everything else looks like a non-stable hack.

How do you propose to handle inline assembly in non-system header files?

Just like as usual - if the assembler is supported by the device - it is ok, otherwise - error message.

Even if the function is never called? That would mean you can't include any Eigen header...

Yes, that's the problem.

Got that, I agree on the conservative approach: If we find a function to be called that wasn't checked (because it wasn't implicitly declare target on the host) the compiler can error out. That should be correct in all cases, shouldn't it?

There's a trade-off here:

  • How many TUs pass full analysis and how many don't? (today's situation; we know that some headers don't work)
  • How many TUs pass when we only check called functions (and error if we call non-checked ones) and how many regress compared to today's situation? If the number of regressions is zero for all practical situations but we can compile some important cases, that should be a win.

I need to think about it. We need to estimate all pros and cons here. It might work.

I'll try to put together a protoype so that we can actually test.

The discussion kind of moved away from the original patch, probably because the problem is larger than the defition of some host macros. However I still think that this patch improves the situation.

gregrodgers added a comment.EditedAug 23 2018, 1:37 PM

I have a longer comment on header files, but let me first understand this patch.

IIUC,the concept of this patch is to fake the macros to think it is seeing
a host on the device pass.

if ((LangOpts.CUDA || LangOpts.OpenMPIsDevice) && PP.getAuxTargetInfo())

InitializePredefinedAuxMacros(*PP.getAuxTargetInfo(), Builder);

That would be counterproductive because well-behaved headers that only
provide optized asm definitions would wrap that asm with

#ifdef x86_64

do some x86 asm function definition;

#else

just provide the function declaration;

#endif

What am I missing?

What am I missing?

As discussed above this patch doesn't fix this problem. However we need __x86_64__ because bits/wordsize.h will use it to determine if we are 64- or 32-bit.

Hahnfeld planned changes to this revision.Aug 23 2018, 1:47 PM

This patch breaks C++ and CUDA compilation at the moment, sorry. I need to find and add more macros that turn out to be needed.

Hahnfeld updated this revision to Diff 162328.Aug 24 2018, 12:33 AM

Add required macros for compiling C++ code.

tra accepted this revision.Aug 24 2018, 10:32 AM

Please keep an eye on CUDA buildbot http://lab.llvm.org:8011/builders/clang-cuda-build.
It runs fair amount of tests with libc++ and handful of libstdc++ versions and may a canary if these changes break something.

This revision is now accepted and ready to land.Aug 24 2018, 10:32 AM
Hahnfeld updated this revision to Diff 162543.EditedAug 25 2018, 6:32 AM

Based on libc++ I guessed some more macros that may be needed on macOS and Windows. As I can't test myself it would be great if somebody else could report if this change is regressing CUDA support on these platforms.

In D50845#1212643, @tra wrote:

Please keep an eye on CUDA buildbot http://lab.llvm.org:8011/builders/clang-cuda-build.
It runs fair amount of tests with libc++ and handful of libstdc++ versions and may a canary if these changes break something.

I just tested locally and std::remainder fails with CUDA 8.0.44 when compiling for c++11 or later - both with and without this patch. My guess is that this version has a bug because all tests pass with CUDA 9.2.88.

I'll land this change now and watch the buildbot for any problems, thanks.

This revision was automatically updated to reflect the committed changes.
tra added a comment.Aug 30 2018, 11:20 AM

FYI. This breaks our CUDA compilation. I haven't figured out what exactly is wrong yet. I may need to unroll the patch if the fix is not obvious.

In D50845#1219709, @tra wrote:

FYI. This breaks our CUDA compilation. I haven't figured out what exactly is wrong yet. I may need to unroll the patch if the fix is not obvious.

+1. I think this patch must be reverted, it breaks compilation on many platforms and requires manual fine-tuning.

Do you have invocations or headers that don't work? The problem is that the previous code defined all macros unconditionally, so it will afterwards be hard to find the necessary macros...

In D50845#1219709, @tra wrote:

FYI. This breaks our CUDA compilation. I haven't figured out what exactly is wrong yet. I may need to unroll the patch if the fix is not obvious.

Agreed. Patches D51446 and D51312 apply fixes for the PPC64 toolchain. Similar fixes are needed for other architectures probably.
In general, it looks like this patch leads to some host macros having to be defined again for the auxiliary triple case. It is not clear to me how to exhaustively identify the missing macros, so far it's been just trial and error.

In general, it looks like this patch leads to some host macros having to be defined again for the auxiliary triple case. It is not clear to me how to exhaustively identify the missing macros, so far it's been just trial and error.

Well, that's the point of this patch, isn't it? Again, the current approach is to just define all macros which is definitely broken.

tra added a comment.Aug 30 2018, 11:39 AM

In our case the headers from a relatively old glibc and compiler errors out on this:

/* This function is used in the `isfinite' macro.  */
__MATH_INLINE int
__NTH (__finite (double __x))
{
  return (__extension__
	  (((((union { double __d; int __i[2]; }) {__d: __x}).__i[1]
	     | 0x800fffffu) + 1) >> 31));
}

expanded to this:

extern __inline __attribute__ ((__always_inline__)) __attribute__ ((__gnu_inline__)) int
 __finite (double __x) throw ()
{
  return (__extension__
   (((((union { double __d; int __i[2]; }) {__d: __x}).__i[1]
      | 0x800fffffu) + 1) >> 31));
}

The error:

.../include/bits/mathinline.h:945:9: error: '(anonymous union at .../include/bits/mathinline.h:945:9)' cannot be defined in a type specifier
          (((((union { double __d; int __i[2]; }) {__d: __x}).__i[1]
               ^
.../include/bits/mathinline.h:945:55: error: member reference base type 'void' is not a structure or union
          (((((union { double __d; int __i[2]; }) {__d: __x}).__i[1]
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~

Also, whatever macros we generate do not prevent headers from using x86 inline assembly. I see quite a few inline asm code in preprocessed output. The headers are from libc ~2.19.

tra added a comment.Aug 30 2018, 11:44 AM

In general, it looks like this patch leads to some host macros having to be defined again for the auxiliary triple case. It is not clear to me how to exhaustively identify the missing macros, so far it's been just trial and error.

Well, that's the point of this patch, isn't it? Again, the current approach is to just define all macros which is definitely broken.

I would agree that it it does not work for OpenMP which relies on host headers to be usable for device compilation.
It works OK for CUDA as device code can co-exist with the host code.

Perhaps the patch should keep existing behavior for CUDA and cherry-pick macros for OpenMP compilation only.

In D50845#1219746, @tra wrote:

Also, whatever macros we generate do not prevent headers from using x86 inline assembly. I see quite a few inline asm code in preprocessed output. The headers are from libc ~2.19.

Did you try adding

Builder.defineMacro("__NO_MATH_INLINES");
tra added a comment.Aug 30 2018, 12:05 PM

I've sent out D51501. It unbreaks CUDA compilation and keeps OpenMP unchanged.

In D50845#1219797, @tra wrote:

I've sent out D51501. It unbreaks CUDA compilation and keeps OpenMP unchanged.

I think a full revert would make more sense. And you definitely want to reinstantiate

// FIXME: This will create multiple definitions for most of the predefined
// macros. This is not the right way to handle this.

which is what I meant with "broken".

In any case, I'd like to request some more time to investigate. For now it looks like Clang was never able to parse that code, so we cannot come across this in the past.

In D50845#1219746, @tra wrote:

In our case the headers from a relatively old glibc and compiler errors out on this:

/* This function is used in the `isfinite' macro.  */
__MATH_INLINE int
__NTH (__finite (double __x))
{
  return (__extension__
	  (((((union { double __d; int __i[2]; }) {__d: __x}).__i[1]
	     | 0x800fffffu) + 1) >> 31));
}

expanded to this:

extern __inline __attribute__ ((__always_inline__)) __attribute__ ((__gnu_inline__)) int
 __finite (double __x) throw ()
{
  return (__extension__
   (((((union { double __d; int __i[2]; }) {__d: __x}).__i[1]
      | 0x800fffffu) + 1) >> 31));
}

The error:

.../include/bits/mathinline.h:945:9: error: '(anonymous union at .../include/bits/mathinline.h:945:9)' cannot be defined in a type specifier
          (((((union { double __d; int __i[2]; }) {__d: __x}).__i[1]
               ^
.../include/bits/mathinline.h:945:55: error: member reference base type 'void' is not a structure or union
          (((((union { double __d; int __i[2]; }) {__d: __x}).__i[1]
             ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~

Also, whatever macros we generate do not prevent headers from using x86 inline assembly. I see quite a few inline asm code in preprocessed output. The headers are from libc ~2.19.

Ok, the top preprocessor condition for that function is #ifndef __SSE2_MATH__ - the exact same macro that was part of the motivation. Can you please test compiling a simple C file (including math.h) with -mno-sse? My guess would be that this is broken as well.
If yes I'm fine with reverting because I need to teach Clang to allow anonymous unions in type specifiers to make that weird system header work with this patch.

tra added a comment.EditedAug 30 2018, 1:00 PM

Ok, the top preprocessor condition for that function is #ifndef __SSE2_MATH__ - the exact same macro that was part of the motivation. Can you please test compiling a simple C file (including math.h) with -mno-sse? My guess would be that this is broken as well.
If yes I'm fine with reverting because I need to teach Clang to allow anonymous unions in type specifiers to make that weird system header work with this patch.

It compiles fine. The code that causes the problem is also conditional on !defined __NO_MATH_INLINES and it's always defined for X86, so compilation only breaks for when we compile for NVPTX.

Still, the issue seems to be way too hairy for one-line fix, so I'll proceed with the unroll if you don't beat me to it.

In D50845#1219853, @tra wrote:

Ok, the top preprocessor condition for that function is #ifndef __SSE2_MATH__ - the exact same macro that was part of the motivation. Can you please test compiling a simple C file (including math.h) with -mno-sse? My guess would be that this is broken as well.
If yes I'm fine with reverting because I need to teach Clang to allow anonymous unions in type specifiers to make that weird system header work with this patch.

It compiles fine. The code that causes the problem is also conditional on !defined __NO_MATH_INLINES and it's always defined for X86, so compilation only breaks for when we compile for NVPTX.

(which references a bug fixed in 2010 IIRC).

Still, the issue seems to be way too hairy for one-line fix, so I'll proceed with the unroll if you don't beat me to it.

Please go ahead. You'll probably get conflicts because of D51446, but removing InitializePredefinedAuxMacros and the new test completely should do.

removing InitializePredefinedAuxMacros and the new test completely should do.

Yep they also contain D51312 in case you're rolling back individual commits.

removing InitializePredefinedAuxMacros and the new test completely should do.

Yep they also contain D51312 in case you're rolling back individual commits.

Err yes, that's the one I wanted to link

tra added a comment.Aug 30 2018, 1:25 PM

That, and r340967 D51441. I'm running check-clang now and will land reverted changes shortly.

tra added a comment.Aug 30 2018, 1:54 PM

Tests reverted in rL341118