Page MenuHomePhabricator

[OpenMP] Initial support for std::complex in target regions
AcceptedPublic

Authored by jdoerfert on May 31 2020, 1:08 PM.

Details

Summary

This simply follows the scheme we have for other wrappers. It resolves
the current link problem, e.g., __muldc3 not found, when std::complex
operations are used on a device.

In "CUDA mode" this should allow simple complex operations to work in
target regions. Normal mode doesn't work because the globalization in
the std::complex operators is somehow broken. This will most likely not
allow complex make math function calls to work properly, e.g., sin, but
that is more complex (pan intended) anyway.

Diff Detail

Event Timeline

jdoerfert created this revision.May 31 2020, 1:08 PM
Herald added a project: Restricted Project. · View Herald TranscriptMay 31 2020, 1:08 PM
jdoerfert updated this revision to Diff 267531.May 31 2020, 3:57 PM

Fix tests, add C support

tra added a comment.Jun 1 2020, 10:17 AM

Hmm. I'm pretty sure tensorflow is using std::complex for various types. I'm surprised that we haven't seen these functions missing.
Plain CUDA (e.g. https://godbolt.org/z/Us6oXC) code appears to have no references to __mul* or __div*, at least for optimized builds, but they do popup in unoptimized ones. Curiously enough, unoptimized code compiled with -stdlib=libc++ --std=c++11 does not need the soft-float functions. That would explain why we don't see the build breaks.

These differences suggest that these changes may need to be more nuanced with regard to the standard c++ library version and, possibly, the C++ standard used.
If possible, I would prefer to limit interference with the standard libraries only to the cases where it's necessary.

clang/lib/Headers/__clang_cuda_complex_builtins.h
29

Nit: this creates impression that we fall back on double variant of the function, while in reality we'll end up using std::isnan<float>.
Perhaps it would be better to use fully specialized function template name in all these macros. It would also avoid potential issues if someone/somewhere adds other overloads. E.g. we may end up facing std::complex<half> which may overload resolution ambiguous in some cases.

63

Soft-float library has bunch of other functions. https://gcc.gnu.org/onlinedocs/gccint/Soft-float-library-routines.html

I wonder why only the complex variants of the soft-float support functions are missing.
Does it mean that x86 code also does rely on the library to do complex multiplication?
If x86 can do complex ops, why can't nvptx?
If x86 can't, would make sense to teach it?

jdoerfert marked 2 inline comments as done.Jun 1 2020, 11:44 AM
In D80897#2066723, @tra wrote:

Hmm. I'm pretty sure tensorflow is using std::complex for various types. I'm surprised that we haven't seen these functions missing.

Which functions and missing from where? In CUDA-mode we did provide __XXXXc3 already.

Plain CUDA (e.g. https://godbolt.org/z/Us6oXC) code appears to have no references to __mul* or __div*, at least for optimized builds, but they do popup in unoptimized ones. Curiously enough, unoptimized code compiled with -stdlib=libc++ --std=c++11 does not need the soft-float functions. That would explain why we don't see the build breaks.

Its not that simple, and tbh, I don't have the full picture yet. Plain (clang) CUDA uses these functions (https://godbolt.org/z/dp_FY2), they just disappear after inlining because of the linkage. If you however enable -fast-math they are not used (https://godbolt.org/z/_N-STh). I couldn't run with stdlib=libc++ locally and godbold cuts of the output so I'm not sure if they are used and inlined or not used.

These differences suggest that these changes may need to be more nuanced with regard to the standard c++ library version and, possibly, the C++ standard used.
If possible, I would prefer to limit interference with the standard libraries only to the cases where it's necessary.

The way I understand this is that we can always provide correct weak versions of __XXXXc3 without any correctness issues. They will be stripped if they are not needed anyway. That said, this patch should not modify the CUDA behavior (except minor float vs double corrections in the __XXXXc3 methods). Could you elaborate what interference you expect?

clang/lib/Headers/__clang_cuda_complex_builtins.h
29

No problem. I'll just use std::NAME for all of them.

63

I wonder why only the complex variants of the soft-float support functions are missing.

I would guess others are conceptually missing too, the question is if we need them. I did grep the clang source for 7 non-complex soft-float support functions from the different categories listed in the gcc docs, none was found.

Does it mean that x86 code also does rely on the library to do complex multiplication?

I think so, yes. Some system library will provide the implementation of __muldc3 for the slow path of a complex multiplication.

If x86 can do complex ops, why can't nvptx?
If x86 can't, would make sense to teach it?

I think I don't understand this (and maybe the question above). What we do in CUDA right now, and with this patch in OpenMP, is to provide the __XXXXc3 functions on the device. Usually they are in some system library that we just not have on the device so we have to add them somehow.

arsenm added a subscriber: arsenm.Jun 1 2020, 12:10 PM
arsenm added inline comments.
clang/lib/Headers/__clang_cuda_complex_builtins.h
136–137

Why does this try to preserve the sign of a nan? They are meaningless

tra added a comment.Jun 1 2020, 12:27 PM
In D80897#2066723, @tra wrote:

Hmm. I'm pretty sure tensorflow is using std::complex for various types. I'm surprised that we haven't seen these functions missing.

Which functions and missing from where? In CUDA-mode we did provide __XXXXc3 already.

I mean the __XXXXc3 functions added by the patch. I've tried with clang as it is now, before your patch.

Plain CUDA (e.g. https://godbolt.org/z/Us6oXC) code appears to have no references to __mul* or __div*, at least for optimized builds, but they do popup in unoptimized ones. Curiously enough, unoptimized code compiled with -stdlib=libc++ --std=c++11 does not need the soft-float functions. That would explain why we don't see the build breaks.

Its not that simple, and tbh, I don't have the full picture yet. Plain (clang) CUDA uses these functions (https://godbolt.org/z/dp_FY2), they just disappear after inlining because of the linkage. If you however enable -fast-math they are not used (https://godbolt.org/z/_N-STh). I couldn't run with stdlib=libc++ locally and godbold cuts of the output so I'm not sure if they are used and inlined or not used.

I've checked it locally and verified that adding --stdlib=libc++ -std=c++11 to your first example shows that __*c3 functions do not appear in IR regardless of inlining or opt level.
I wonder what is that that libstdc++ does that makes those functions show up in IR. AFAICT, it's not invoked directly by the library, so it must be something clang has generated. Perhaps something should be fixed there.

These differences suggest that these changes may need to be more nuanced with regard to the standard c++ library version and, possibly, the C++ standard used.
If possible, I would prefer to limit interference with the standard libraries only to the cases where it's necessary.

The way I understand this is that we can always provide correct weak versions of __XXXXc3 without any correctness issues. They will be stripped if they are not needed anyway. That said, this patch should not modify the CUDA behavior (except minor float vs double corrections in the __XXXXc3 methods). Could you elaborate what interference you expect?

One example would be if/when we grow a better libm support for GPUs. Granted, it's just few functions and we could just remove these instances then.
I agree that adding these functions now will probably not interfere with anything we have now -- they are device-side overloads and nobody calls them directly.
The suggestion was based on a general principle of minimizing the changes that overlap with the standard libraries -- there are quite a few versions out there and I can't predict what quirks of theirs I'm not aware of. I've been burned too many times by that to be wary.

clang/lib/Headers/__clang_cuda_complex_builtins.h
63

I'm OK with providing device-side equivalents of the host standard library.

What' I'm trying to figure out if why we don't need to do it in some cases.
In case whe we do rely on these functions, but don't have them, we have at least two choices -- provide the missing functions (this patch) or ensure we never need these functions (what I'm trying to figure out). If there's a way to reliably ensure that we don't need these functions, I'd prefer that.

Right now the observation is that libc++ somehow avoids it. If we can improve clang that libstdc++ would also work without falling back on the __*c3 functions, that may be a better fix for this. That said, I don't understand yet why/how the standard c++ libraries end up with different code in this case.

jdoerfert marked an inline comment as done.Jun 2 2020, 3:06 PM

I tried to determine why we don't emit such calls for c++11 and stdc++ but I was not successful :( Tracking back from the emission lead to the generic expression codegen without any (obvious) check of the runtime library or std versions.

clang/lib/Headers/__clang_cuda_complex_builtins.h
136–137

Idk [I only work here... ;)]

I guess the algorithm was once copied from libc++, unclear if the one in there is still the same, we could check.

jdoerfert marked an inline comment as done.Jun 3 2020, 12:47 PM
jdoerfert added inline comments.
clang/lib/Headers/__clang_cuda_complex_builtins.h
42

This will actually not work right now as we do not overload isinf/isnan/isfinite properly in C++ mode. I first have to find a solution for that mess.

@tra After chatting with @hfinkel I know now why we don't see the calls in the libc++ case. libc++ implements std::complex without _Complex types, stdlib++ does. If the user uses _Complex directly we need these functions for sure as the standard defines them (I think): https://godbolt.org/z/jcXgnH

So we need them and I would like to reuse them in the OpenMP offload path :)

@JonChesterfield @hfinkel @tra ping

I would really like to land this before the release branches off to allow people to use complex in target regions.

JonChesterfield accepted this revision.Thu, Jul 2, 5:17 PM

I think this change is good. The library story is a bit difficult, but fundamentally openmp needs a shim of some sort to map target math functions onto the libm of the underlying device.

For nvptx, that's the cuda library. Amdgcn has math functions and may need another shim to map them to libm.

include_next is nasty, but that's the existing pattern for some library headers.

clang/test/Headers/Inputs/include/complex
10

Can we #include from libc++ instead? Needs some cmake to skip the test if the library is unavailable but spares duplicating this class

This revision is now accepted and ready to land.Thu, Jul 2, 5:17 PM