This is an archive of the discontinued LLVM Phabricator instance.

[CUDA] Add option to mark most functions inside <complex> as host+device.
AbandonedPublic

Authored by jlebar on Mar 21 2016, 1:40 PM.

Details

Reviewers
tra
rnk
Summary

clang --cuda-allow-std-complex translates into cc1
-fcuda-allow-std-complex. With this flag, we will mark all functions
inside <complex> within namespace std as host+device, other than
operator>> and operator<<, which use ostreams, which are not supported
in CUDA device code.

Diff Detail

Event Timeline

jlebar updated this revision to Diff 51221.Mar 21 2016, 1:40 PM
jlebar retitled this revision from to [CUDA] Add option to mark most functions inside <complex> as host+device..
jlebar updated this object.
jlebar added reviewers: tra, rnk.
jlebar added subscribers: cfe-commits, jhen.
tra accepted this revision.Mar 21 2016, 1:55 PM
tra edited edge metadata.

One minor question, LGTM otherwise.

lib/Sema/SemaCUDA.cpp
474

Can C++ library headers ever be non-system? I.e. can someone use libc++ via -I ?

This revision is now accepted and ready to land.Mar 21 2016, 1:55 PM
jlebar added inline comments.Mar 21 2016, 2:23 PM
lib/Sema/SemaCUDA.cpp
474

Good question, I have no idea if that's supposed to work. Reid, do you know?

rsmith added a subscriber: rsmith.Mar 21 2016, 2:42 PM

I would much prefer for us to, say, provide a <complex> header that wraps the system one and does something like

// <complex>
#pragma clang cuda_implicit_host_device {
#include_next <complex>
#pragma clang cuda_implicit_host_device }

or to provide an explicit list of the functions that we're promoting to __host__ __device__, or to require people to use a CUDA-compatible standard library if they want CUDA-compatible standard library behaviour.

include/clang/Driver/Options.td
383–384

I don't think it's reasonable to have something this hacky / arbitrary in the stable Clang driver interface.

lib/Sema/SemaCUDA.cpp
479–481

I don't think this works: the standard library might factor parts of <complex> out into separate header files. For instance, libstdc++ 4.4 includes the TR1 pieces of <complex> in that way.

rnk added inline comments.Mar 21 2016, 2:46 PM
lib/Sema/SemaCUDA.cpp
474

libc++ complex has this pragma in it:

#pragma GCC system_header

So we should be safe regardless of the flags used to find it.

483–488

I'd do this check after the system header test and before the "complex" test, since it's probably faster.

485

There's no cast on the RHS, so I'd spell out CXXRecordDecl here to make things more obvious.

lib/Sema/SemaDecl.cpp
8344

Do you want this to apply to declarations as well as definitions? Your test uses that functionality.

Thanks for the suggestions, Richard. I'm not sure any of them will work, but I don't defend this patch as anything other than a hack, so if we can come up with something that works for what we need to accomplish and is cleaner, that's great.

I would much prefer for us to, say, provide a <complex> header that wraps the system one and does something like

// <complex>
#pragma clang cuda_implicit_host_device {
#include_next <complex>
#pragma clang cuda_implicit_host_device }

We considered this and ruled it out for two reasons:

  1. We'd have to exclude operator>> and operator<<, presumably with additional pragmas, and
  2. We'd have to exclude everything included by <complex>.

Of course with enough pragmas anything is possible, but at this point it seemed to become substantially more complicated than this (admittedly awful) hack.

or to provide an explicit list of the functions that we're promoting to __host__ __device__

The problem with that is that libstdc++ uses many helper functions, which we'd also have to enumerate. Baking those kinds of implementation details into clang seemed worse than this hack.

or to require people to use a CUDA-compatible standard library if they want CUDA-compatible standard library behaviour.

I think asking people to use a custom standard library is a nonstarter for e.g. OSS tensorflow, and I suspect it would be a considerable amount of work to accomplish in google3. (Not to suggest that two wrongs make a right, but we already have many similar hacks in place to match nvcc's behavior with standard library functions -- the main difference here is that we're spelling the hack in clang's C++ as opposed to in __clang_cuda_runtime_wrapper.h.)

tra added a comment.Mar 21 2016, 2:56 PM

I would much prefer for us to, say, provide a <complex> header that wraps the system one and does something like

// <complex>
#pragma clang cuda_implicit_host_device {
#include_next <complex>
#pragma clang cuda_implicit_host_device }

or to provide an explicit list of the functions that we're promoting to __host__ __device__, or to require people to use a CUDA-compatible standard library if they want CUDA-compatible standard library behaviour.

We'll still need some filtering as not everything inside <complex> should be __host__ __device__.

include/clang/Driver/Options.td
383–384

What would be a better way to enable this 'feature'? I guess we could live with -Xclang -fcuda-allow-std-complex for now, but that does not seem to be particularly good way to give user control, either.

Perhaps we should have some sort of --cuda-enable-extension=foo option to control CUDA hacks.

Here are two other approaches we considered and rejected, for the record:

  1. Copy-paste a <complex> implementation from e.g. libc++ into __clang_cuda_runtime_wrapper.h, and edit it appropriately. Then #define the real <complex>'s include guards.

    Main problem with this is the obvious one: We're copying a big chunk of the standard library into the compiler, where it doesn't belong, and now we have two divergent copies of this code to maintain. In addition, we can't necessarily use libc++, since we need to support pre-c++11 and AIUI libc++ does not.
  1. Provide __device__ overrides for all the functions defined in <complex>. This almost works, except that we do not (currently) have a way to let you inject new overloads for member functions into classes we don't own. E.g. we can add a __device__ overload std::real(const complex<T>&), just like we could override std::real in any other way, but we can't add a new __device__ overload to std::complex<T>::real().

    This approach also has a similar problem to (1), which is that we'd end up copy/pasting almost all of <complex> into the compiler.
include/clang/Driver/Options.td
383–384

I don't think it's reasonable to have something this hacky / arbitrary in the stable Clang driver interface.

This is an important feature for a lot of projects, including tensorflow and eigen. No matter how we define the flag, I suspect people are going to use it en masse. (Most projects I've seen pass the equivalent flag to nvcc.) At the point that many or even most projects are relying on it, I'd suspect we'll have difficulty changing this flag, regardless of whether or not it is officially part of our stable API.

There's also the issue of discoverability. nvcc actually gives a nice error message when you try to use std::complex -- it seems pretty unfriendly not to even list the relevant flag in clang --help.

I don't feel particularly strongly about this, though -- I'm more concerned about getting something that works.

jlebar added inline comments.Mar 21 2016, 3:20 PM
include/clang/Driver/Options.td
383–384

An alternative wrt the flag is to enable it by default. This would be somewhat consistent with existing behavior, wherein we make most std math functions available without a special flag, even though they're not technically host-device. The main difference here is that there we're matching nvcc's default behavior, whereas here we're actually going further than nvcc -- nvcc by default doesn't let you touch std::complex from device code at all, and with a flag, you can touch its *constexpr* functions. Which is not actually very much.

Nonetheless, since the user-visible effect is consistent with our approach of making std math stuff available, and since this shouldn't make us reject code nvcc accepts, I'd be more OK hiding the flag to turn it off.

rnk edited edge metadata.Mar 21 2016, 3:21 PM

I would much prefer for us to, say, provide a <complex> header that wraps the system one and does something like

// <complex>
#pragma clang cuda_implicit_host_device {
#include_next <complex>
#pragma clang cuda_implicit_host_device }

or to provide an explicit list of the functions that we're promoting to __host__ __device__, or to require people to use a CUDA-compatible standard library if they want CUDA-compatible standard library behaviour.

I don't really like include_next wrapper headers, but adding a pragma spelling of the cuda device attributes might be nice. There would still be issues with the streaming operators, though.

include/clang/Driver/Options.td
383–384

What if we had a catchall nvcc quirks mode flag similar to -fms-compatibility? We probably don't want a super fine grained LangOpt like this.

jlebar added inline comments.Mar 21 2016, 3:23 PM
include/clang/Driver/Options.td
383–384

What if we had a catchall nvcc quirks mode flag similar to -fms-compatibility?

I think we midair'ed on this. See above comment about turning this flag on by default -- calling this "nvcc compat" wouldn't quite be right. We could certainly have a broader flag, but I'm not sure at the moment what else would reasonably go in with this one.

rsmith added inline comments.Mar 21 2016, 3:35 PM
include/clang/Driver/Options.td
383–384

I'd find either of these suggestions (-fnvcc-compatibility or a cc1-only flag to turn this behaviour off) more palatable than the current approach.

I'd also be a lot happier about this if we can view it as a short-term workaround, with the longer-term fix being to get the host/device attributes added to standard library implementations (even if it turns out we can never actually remove this workaround in practice). If we can legitimately claim that this is the way that CUDA is intended to work, and the missing attributes in <complex> are a bug in that header (in CUDA mode), then that provides a solid justification for having this complexity in Clang.

lib/Sema/SemaCUDA.cpp
464–465

Does nvcc do this "constexpr implies __host__ __device__" thing only for functions declared within <complex>, or for all functions?

Another alternative strategy: a wrapper <complex> header that does this:

#include // ... union of includes from libc++ and libstdc++ <complex>
#define constexpr __host__ __device__ constexpr
#include_next <complex>
#undef constexpr
485

Parent can't be null for a CXXMethodDecl, so just Method->getParent()->isInStdNamespace() would work.

jlebar added inline comments.Mar 21 2016, 3:48 PM
include/clang/Driver/Options.td
383–384

If we can legitimately claim that this is the way that CUDA is intended to work, and the missing attributes in <complex> are a bug in that header (in CUDA mode), then that provides a solid justification for having this complexity in Clang.

I think that

  1. the number of people passing --relaxed-constexpr to nvcc just so they can use a limited subset of std::complex, and
  2. the fact that we're already doing this for (basically all) other std math functions

may be decent arguments for this. But I don't know if I'm a great judge of what we can legitimately claim here.

lib/Sema/SemaCUDA.cpp
464–465

Does nvcc do this "constexpr implies host device" thing only for functions declared within <complex>, or for all functions?

All functions. Although std::complex is the main use I've observed.

Another alternative strategy: a wrapper <complex> header that does this:

That one is quite clever, although I'm not sure about enumerating all of the includes from the headers. I guess that should be reasonably stable...

I think I would like to get full complex support, though, if we can agree on a path towards that. The current limitation is silly, it seems clear that people want this, and the constexpr thing gives you but a shadow of the actual library.

479–481

Hm, that is unfortunate. One option would be to say that we just don't support this. Otherwise we have to go down the road of identifying all the relevant functions...

rsmith added inline comments.Mar 21 2016, 4:27 PM
lib/Sema/SemaCUDA.cpp
479–481

I've not checked GCC 5 onwards, but it looks like in the 4.x series, this is the only problem of this kind, and only affects the TR1 pieces (which it seems we probably don't need to care about supporting here). libc++ doesn't currently have any problems of this kind. Obviously it's unknown what issues we'll see with other standard library implementations.

rsmith added inline comments.Mar 21 2016, 4:32 PM
lib/Sema/SemaCUDA.cpp
464–465

Supporting a "constexpr implies __host__ __device__" feature for all functions seems a lot cleaner than the approach taken by this patch, and will presumably improve NVCC compatibility in other cases too (though perhaps they're quite rare). This seems like a very odd pair of features to link in this way, but if we're going to have something weird like this to support existing NVCC-targeting code, using the same approach may be better. This would also mean we would not be further extending NVCC's extension.

jlebar abandoned this revision.Mar 22 2016, 2:56 PM
jlebar marked 7 inline comments as done.

Okay, after much discussion, we've decided to go with --relaxed-constexpr instead of this. I have a patch for that which seems to mostly work, will send it out soon.