This is an archive of the discontinued LLVM Phabricator instance.

[CUDA][HIP] Fix host/device based overload resolution
ClosedPublic

Authored by yaxunl on Apr 11 2020, 1:02 PM.

Details

Summary

Currently clang fails to compile the following CUDA program in device compilation:

__host__ int foo(int x) {
     return 1;
}

template<class T>
__device__ __host__ int foo(T x) {
    return 2;
}

__device__ __host__ int bar() {
    return foo(1);
}

__global__ void test(int *a) {
    *a = bar();
}

This is due to foo is resolved to the __host__ foo instead of __device__ __host__ foo.
This seems to be a bug since __device__ __host__ foo is a viable callee for foo whereas
clang is unable to choose it.

nvcc has similar issue

https://cuda.godbolt.org/z/bGijLc

Although it only emits a warning and does not fail to compile. It emits a trap in the code
so that it will fail at run time.

This patch fixes that.

Diff Detail

Event Timeline

yaxunl created this revision.Apr 11 2020, 1:02 PM
yaxunl edited the summary of this revision. (Show Details)Apr 11 2020, 1:10 PM

If nvcc ignores host/device-ness when selecting overloads, that's probably the specified behavior, right? I agree that it would be better to not ignore it, but Clang shouldn't just make up better rules for languages with external specifications.

If nvcc ignores host/device-ness when selecting overloads, that's probably the specified behavior, right? I agree that it would be better to not ignore it, but Clang shouldn't just make up better rules for languages with external specifications.

cuda-clang does not always follow nvcc's behavior. For example, cuda-clang only allows incomplete array type for extern shared variables, whereas nvcc allows other types. If cuda-clang is supposed to follow nvcc's behavior in every aspects, we should approve https://reviews.llvm.org/D73979 , but it is not the case.

Therefore, I think we should discuss whether this is really a bug, and whether the fix can cause any unwanted side effect.

If nvcc ignores host/device-ness when selecting overloads, that's probably the specified behavior, right? I agree that it would be better to not ignore it, but Clang shouldn't just make up better rules for languages with external specifications.

cuda-clang does not always follow nvcc's behavior. For example, cuda-clang only allows incomplete array type for extern shared variables, whereas nvcc allows other types. If cuda-clang is supposed to follow nvcc's behavior in every aspects, we should approve https://reviews.llvm.org/D73979 , but it is not the case.

Therefore, I think we should discuss whether this is really a bug, and whether the fix can cause any unwanted side effect.

BTW cuda-clang is already quite different than nvcc regarding host/device-based overloading resolution. For example, the following code is valid in cuda-clang before my change but invalid in nvcc https://cuda.godbolt.org/z/qwpKZe . So if we want to follow nvcc's resolution rule we need a total revamp of device/host related resolution in cuda-clang.

__host__ int foo(int x) {
     return 1;
}

template<class T>
__device__ int foo(T x) {
    return 2;
}

__device__ int bar() {
    return foo(1);
}

__global__ void test(int *a) {
    *a = bar();
}

I'm not saying that we need to be bug-for-bug-compatible with nvcc, I'm just saying that we should be able to point to *something* to justify our behavior. I take it that the CUDA spec has rules for some amount of host/device-based overloading? What are they based on?

I'm not saying that we need to be bug-for-bug-compatible with nvcc, I'm just saying that we should be able to point to *something* to justify our behavior. I take it that the CUDA spec has rules for some amount of host/device-based overloading? What are they based on?

I checked CUDA SDK documentation and did not find useful information about overloading resolution based on host/device attributes. I guess the rule can only be deduced from nvcc behavior.

Based on https://reviews.llvm.org/D12453, https://reviews.llvm.org/D18416, and https://bcain-llvm.readthedocs.io/projects/llvm/en/latest/CompileCudaWithLLVM/#overloading-based-on-host-and-device-attributes, cuda-clang has different overload resolution rules based host/device attributes. This is intentional design decision.

rjmccall added a comment.EditedApr 11 2020, 8:19 PM

I'm not saying that we need to be bug-for-bug-compatible with nvcc, I'm just saying that we should be able to point to *something* to justify our behavior. I take it that the CUDA spec has rules for some amount of host/device-based overloading? What are they based on?

I checked CUDA SDK documentation and did not find useful information about overloading resolution based on host/device attributes. I guess the rule can only be deduced from nvcc behavior.

Based on https://reviews.llvm.org/D12453, https://reviews.llvm.org/D18416, and https://bcain-llvm.readthedocs.io/projects/llvm/en/latest/CompileCudaWithLLVM/#overloading-based-on-host-and-device-attributes, cuda-clang has different overload resolution rules based host/device attributes. This is intentional design decision.

Okay, thanks, that's all I needed. We don't need to re-litigate it.

That spec says that there's a preference given to functions according to host/device-ness. The question, then, is how that actually interacts with the normal overload resolution rules. The "deletion" approach suggests that it's meant to be the most important thing in the comparison. It seems to me that, given the wording of the specification, deletion is the wrong implementation approach, and that instead this check should just be performed in isBetterOverloadCandidate so that a candidate that better matches the host/device-ness of the caller is always considered a better candidate.

yaxunl updated this revision to Diff 256903.Apr 12 2020, 6:27 PM
yaxunl retitled this revision from [CUDA][HIP] Fix overload resolution issue for device host functions to [CUDA][HIP] Fix host/device based overload resolution.

Revised by John's comments.

rjmccall added inline comments.Apr 12 2020, 8:47 PM
clang/lib/Sema/SemaOverload.cpp
9491

Please add [CUDA] or something similar to the top of this comment so that readers can immediately know that it's dialect-specific.

At a high level, this part of the rule is essentially saying that CUDA non-emittability is a kind of non-viability. Should we just make non-emittable functions get flagged as non-viable (which will avoid a lot of relatively expensive conversion checking), or is it important to be able to select non-emittable candidates over candidates that are non-viable for other reasons?

9781

If we move anything below this check, it needs to figure out a tri-state so that it can return false if Cand2 is a better candidate than Cand1. Now, that only matters if multiversion functions are supported under CUDA, but if you're relying on them not being supported, that should at least be commented on.

9784

Okay, let's think about the right place to put this check in the ordering; we don't want different extensions to get into a who-comes-last competition.

  • Certainly this should have lower priority than the standard-defined preferences like argument conversion ranks or enable_if partial-ordering.
  • The preference for pass-object-size parameters is probably most similar to a type-based-overloading decision and so should take priority.
  • I would say that this should take priority over function multi-versioning. Function multi-versioning is all about making specialized versions of the "same function", whereas I think host/device overloading is meant to be semantically broader than that.

What do you think?

Regardless, the rationale for the order should be explained in comments.

yaxunl marked 6 inline comments as done.Apr 13 2020, 7:04 AM
yaxunl added inline comments.
clang/lib/Sema/SemaOverload.cpp
9491

There are two situations for "bad" callees:

  1. the callee should never be called. It is not just invalid call in codegen, but also invalid call in AST. e.g. a host function call a device function. In CUDA call preference, it is termed "never". And clang already removed such callees from overload candidates.
  1. the callee should not be called in codegen, but may be called in AST. This happens with __host__ __device__ functions when calling a "wrong sided" function. e.g. in device compilation, a __host__ __device__ function calls a __host__ function. This is valid in AST since the __host__ __device__ function may be an inline function which is only called by a __host__ function. There is a deferred diagnostic for the wrong-sided call, which is triggered only if the caller is emitted. However in overloading resolution, if no better candidates are available, wrong-sided candidates are still viable.
9781

multiversion host functions is orthogonal to CUDA therefore should be supported. multiversion in device, host device, and global functions are not supported. However this change does not make things worse, and should continue to work if they are supported.

host/device based overloading resolution is mostly for determining viability of a function. If two functions are both viable, other factors should take precedence in preference. This general rule has been taken for cases other than multiversion, I think it should also apply to multiversion.

I will make isBetterMultiversionCandidate three states.

9784

I will add comments for the rationale of preference.

I commented the preference between multiversion and host/device in another comment.

yaxunl updated this revision to Diff 256973.Apr 13 2020, 7:41 AM
yaxunl marked 3 inline comments as done.

fix preference for multiversion. add comments. add more tests for wrong-sided function.

rjmccall added a subscriber: echristo.
rjmccall added inline comments.
clang/lib/Sema/SemaOverload.cpp
9491

Oh, I see what you're saying; sorry, I mis-read the code. So anything with a preference *worse* than wrong-sided is outright non-viable; there's a very strong preference against wrong-sided calls that takes priority of all of the normal overload-resolution rules; and then there's a very weak preference against non-exact matches that everything else takes priority over. Okay.

9781

This general rule has been taken for cases other than multiversion, I think it should also apply to multiversion.

Well, but the multiversion people could say the same: that multiversioning is for picking an alternative among otherwise-identical functions, and HD and H functions are not otherwise-identical.

CC'ing @echristo for his thoughts on the right ordering here.

tra added a comment.Apr 13 2020, 11:36 AM

LGTM in principle. That said, my gut feeling is that this patch has a good chance of breaking something in sufficiently convoluted CUDA code like Eigen. When you land this patch, I'd appreciate if you could do it on a workday morning (Pacific time) so I'm around to test it on our code and revert if something unexpected pops up.

On a side note, this case is another point towards having to redo handling of __host__ __device__. There are way too many corner cases all over the place. Things will only get worse as we move towards newer C++ standard where a lot more code becomes constexpr which is implicitly HD. Having calls from HD functions resolve in a different way during host/device compilation is observable and may result in host and device code diverging unexpectedly.

yaxunl updated this revision to Diff 259458.Apr 22 2020, 7:53 PM

Revised to let host/device take precedence over multiversion, as John suggested.

yaxunl marked 2 inline comments as done.Apr 22 2020, 7:54 PM

Okay, one minor fix.

clang/lib/Sema/SemaOverload.cpp
9387

This is neglecting the case where they're both invalid.

echristo added inline comments.
clang/lib/Sema/SemaOverload.cpp
9781

Adding @erichkeane here as well.

I think this makes sense, but I can see a reason to multiversion a function that will run on host and device. A version of some matrix mult that takes advantage of 3 host architectures and one cuda one? Am I missing something here?

yaxunl marked an inline comment as done.Apr 23 2020, 12:19 PM
yaxunl added inline comments.
clang/lib/Sema/SemaOverload.cpp
9781

My understanding is that a multiversion function is for a specific cpu(gpu). Let's say we want to have a function f for gfx900, gfx906, sandybridge, ivybridge, shouldn't they be more like

__host__ __attribute__((cpu_specific(sandybridge))) f();
__host__ __attribute__((cpu_specific(ivybridge))) f();
__device__ __attribute__((cpu_specific(gfx900))) f();
__device__ __attribute__((cpu_specific(gfx906))) f();

instead of all __device__ __host__ functions?

erichkeane added inline comments.Apr 23 2020, 12:32 PM
clang/lib/Sema/SemaOverload.cpp
9781

IMO, it doesn't make sense for functions to functions be BOTH host and device, they'd have to be just one. Otherwise I'm not sure how the resolver behavior is supposed to work. The whole idea is that the definition is chosen at runtime.

Unless host __device void foo(); is TWO declaration chains (meaning two separate AST entries), it doesn't make sense to have multiverison on it (and then, how it would be spelled is awkward/confusing to me).

In the above case, if those 4 declarations are not 2 separate root- AST nodes, multiversioning won't work.

rjmccall added inline comments.Apr 23 2020, 4:40 PM
clang/lib/Sema/SemaOverload.cpp
9781

There are certainly functions that ought to be usable from either host or device context — any inline function that just does ordinary language things should be in that category. Also IIUC many declarations are *inferred* to be __host__ __device__, or can be mass-annotated with pragmas, and those reasons are probably the main ones this might matter — we might include a header in CUDA mode that declares a multi-versioned function, and we should handle it right.

My read of how CUDA programmers expect this to work is that they see the __host__ / __device__ attributes as primarily a mechanism for catching problems where you're using the wrong functions for the current configuration. That is, while we allow overloading by __host__/__device__-ness, users expect those attributes to mostly be used as a filter for what's "really there" rather than really strictly segregating the namespace. So I would say that CUDA programmers would probably expect the interaction with multiversioning to be:

  • Programmers can put __host__, __device__, or both on a variant depending on where it was usable.
  • Dispatches should simply ignore any variants that aren't usable for the current configuration.

And specifically they would not expect e.g. a __host__ dispatch function to only consider __host__ variants — it should be able to dispatch to anything available, which is to say, it should also include __host__ __device__ variants. Similarly (and probably more usefully), a __host__ __device__ dispatch function being compiled for the device should also consider pure __device__ functions, and so on.

If we accept that, then I think it gives us a much better idea for how to resolve the priority of the overload rules. The main impact of isBetterMultiversionCandidate is to try to ensure that we're looking at the __attribute__((cpu_dispatch)) function instead of one of the __attribute__((cpu_specific)) variants. (It has no effect on __attribute__((target)) multi-versioning, mostly because it doesn't need to: target-specific variants don't show up in lookup with __attribute__((target)).) That rule should take precedence over the CUDA preference for exact matches, because e.g. if we're compiling this:

__host__ __device__ int magic(void) __attribute__((cpu_dispatch("...")));
__host__ __device__ int magic(void) __attribute__((cpu_specific(generic)));
__host__ int magic(void) __attribute__((cpu_specific(mmx)));
__host__ int magic(void) __attribute__((cpu_specific(sse)));
__device__ int magic(void) __attribute__((cpu_specific(some_device_feature)));
__device__ int magic(void) __attribute__((cpu_specific(some_other_device_feature)));

then we don't want the compiler to prefer a CPU-specific variant over the dispatch function just because one of the variant was marked __host__.

tra accepted this revision.Apr 23 2020, 6:06 PM
tra added inline comments.
clang/lib/Sema/SemaOverload.cpp
9781

It's a bit more complicated and a bit less straightforward than that. :-( https://goo.gl/EXnymm
Handling of target attributes is where clang is very different from the NVCC, so no matter which mental model of "CUDA programmer" you pick, there's another one which will not match.

In the existing code __host__ __device__ is commonly used as a sledgehammer to work around NVCC's limitations. It does not allow attribute-based overloading, so the only way you can specialize a function for host/device is via something like this:

__host__ __device__ void foo() {
#if __CUDA_ARCH__ > 0
 // GPU code
#else
 // CPU code.
#endif
}

With clang you can write separate overloaded functions and we'll do our best to pick the one you meant to call. Alas, there are cases where it's ambiguous and depends on the callee's attributes, which may depend on theirs. When something ends up being called from different contexts, interesting things start happening. With more functions becoming constexpr (those are implicitly HD), we'll be running into such impossible-to-do-the-right-thing situations more often. The only reliable way to avoid such ambiguity is to 'clone' HD functions into separate H & D functions and do overload resolutions only considering same-side functions which will, in effect, completely separate host and device name spaces.

Run-time dispatch is also somewhat irrelevant to CUDA. Sort of. On one hand kernel launch is already a form of runtime dispatch, only it's CUDA runtime does the dispatching based on the GPU one attempts to run the kernel on. __device__ functions are always compiled for the specific GPU variant. Also, GPU variants often have different instruction sets and can't be mixed together in the same object file at all, so there's no variants once we're running the code as it's already compiled for precisely the GPU we're running on. Almost. Technically GPUs in the same family do share the same instruction sets, but I'm not sure runtime dispatch would buy us much there as the hardware differences are relatively minor.

This revision is now accepted and ready to land.Apr 23 2020, 6:06 PM
rjmccall added inline comments.Apr 23 2020, 7:47 PM
clang/lib/Sema/SemaOverload.cpp
9781

The only reliable way to avoid such ambiguity is to 'clone' HD functions into separate H & D functions and do overload resolutions only considering same-side functions which will, in effect, completely separate host and device name spaces.

Okay. Well, even if you completely split host and device functions, I think we'd still want to prefer dispatch functions over variant functions before preferring H over HD.

Although... I suppose we *do* want to consider H vs. HD before looking at the more arbitrary factors that isBetterMultiversionCandidate looks at, like the number of architectures in the dispatch. Honestly, though, those just seem like bad rules that we should drop from the code.

Run-time dispatch is also somewhat irrelevant to CUDA. Sort of.

I understand that there's very little reason (or even ability) to use multiversioning in device code, but it can certainly happen in host code, right? Still, I guess the easiest thing would just be to forbid multiversioned functions on the device.

yaxunl updated this revision to Diff 259796.Apr 23 2020, 9:04 PM
yaxunl marked an inline comment as done.

Revised by John's comments.

yaxunl marked 6 inline comments as done.Apr 24 2020, 5:21 AM
yaxunl added inline comments.
clang/lib/Sema/SemaOverload.cpp
9781

Will change back the precedence of multiversion to be over host/device.

yaxunl updated this revision to Diff 259870.Apr 24 2020, 5:24 AM
yaxunl marked an inline comment as done.

change the precedence of multiversion to be over host/device-ness.

tra added inline comments.Apr 24 2020, 9:46 AM
clang/lib/Sema/SemaOverload.cpp
9781

@rjmccall I'm OK with your reasoning & this patch. As long as the change does not break existing code, I'm fine.

rjmccall accepted this revision.Apr 24 2020, 9:48 AM

Thanks, Yaxun. LGTM.

@tra Is it OK I commit it now? Or better wait next Monday morning? Thanks.

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptApr 24 2020, 12:26 PM
tra added a comment.Apr 24 2020, 1:03 PM

Go ahead. I'll revert it if it breaks anything on our side.

In D77954#2002580, @tra wrote:

Go ahead. I'll revert it if it breaks anything on our side.

Thanks. Done by b46b1a916d44216f0c70de55ae2123eb9de69027

Sorry -- this change broke overload resolution for operator new, as it is declared in system headers. I'm reverting the patch.

$ cat /tmp/in.cu.cc
#define __device__ __attribute__((device))
void* operator new(__SIZE_TYPE__ size);
__device__ void *operator new(__SIZE_TYPE__ size);
void *x = new int;
$ clang -fsyntax-only --cuda-device-only --target=x86_64-grtev4-linux-gnu -x cuda -nocudalib -nocudainc -std=gnu++17 /tmp/in.cu.cc
/tmp/in.cu.cc:4:11: error: call to 'operator new' is ambiguous
void *x = new int;
          ^
/tmp/in.cu.cc:2:7: note: candidate function
void* operator new(__SIZE_TYPE__ size);
      ^
/tmp/in.cu.cc:3:18: note: candidate function
__device__ void *operator new(__SIZE_TYPE__ size);
                 ^
1 error generated when compiling for sm_20.

Sorry -- this change broke overload resolution for operator new, as it is declared in system headers. I'm reverting the patch.

$ cat /tmp/in.cu.cc
#define __device__ __attribute__((device))
void* operator new(__SIZE_TYPE__ size);
__device__ void *operator new(__SIZE_TYPE__ size);
void *x = new int;
$ clang -fsyntax-only --cuda-device-only --target=x86_64-grtev4-linux-gnu -x cuda -nocudalib -nocudainc -std=gnu++17 /tmp/in.cu.cc
/tmp/in.cu.cc:4:11: error: call to 'operator new' is ambiguous
void *x = new int;
          ^
/tmp/in.cu.cc:2:7: note: candidate function
void* operator new(__SIZE_TYPE__ size);
      ^
/tmp/in.cu.cc:3:18: note: candidate function
__device__ void *operator new(__SIZE_TYPE__ size);
                 ^
1 error generated when compiling for sm_20.

Thanks. Fixed in https://reviews.llvm.org/D78970

tra added a comment.May 5 2020, 11:53 AM

It appears that re-landed b46b1a916d44216f0c70de55ae2123eb9de69027 has created another compilation regression. I don't have a simple reproducer yet, so here's the error message for now:

llvm_unstable/toolchain/bin/../include/c++/v1/tuple:232:15: error: call to implicitly-deleted copy constructor of 'std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>'
            : __value_(_VSTD::forward<_Tp>(__t))
              ^        ~~~~~~~~~~~~~~~~~~~~~~~~
llvm_unstable/toolchain/bin/../include/c++/v1/tuple:388:13: note: in instantiation of function template specialization 'std::__u::__tuple_leaf<0, std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, false>::__tuple_leaf<std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, void>' requested here
            __tuple_leaf<_Uf, _Tf>(_VSTD::forward<_Up>(__u))...,
            ^
llvm_unstable/toolchain/bin/../include/c++/v1/tuple:793:15: note: in instantiation of function template specialization 'std::__u::__tuple_impl<std::__u::__tuple_indices<0, 1>, std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>>::__tuple_impl<0, 1, std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>, std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>>' requested here
            : __base_(typename __make_tuple_indices<sizeof...(_Up)>::type(),
              ^
llvm_unstable/toolchain/bin/../include/c++/v1/thread:297:17: note: in instantiation of function template specialization 'std::__u::tuple<std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>>::tuple<std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>, false, false>' requested here
            new _Gp(std::move(__tsp),
                ^
./third_party/eigen3/unsupported/Eigen/CXX11/src/ThreadPool/ThreadEnvironment.h:24:42: note: in instantiation of function template specialization 'std::__u::thread::thread<std::__u::function<void ()>, void>' requested here
    EnvThread(std::function<void()> f) : thr_(std::move(f)) {}
                                         ^
llvm_unstable/toolchain/bin/../include/c++/v1/memory:2583:3: note: copy constructor is implicitly deleted because 'unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>' has a user-declared move constructor
  unique_ptr(unique_ptr&& __u) _NOEXCEPT
  ^
1 error generated when compiling for sm_60.
yaxunl added a comment.May 5 2020, 2:04 PM
In D77954#2021026, @tra wrote:

It appears that re-landed b46b1a916d44216f0c70de55ae2123eb9de69027 has created another compilation regression. I don't have a simple reproducer yet, so here's the error message for now:

llvm_unstable/toolchain/bin/../include/c++/v1/tuple:232:15: error: call to implicitly-deleted copy constructor of 'std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>'
            : __value_(_VSTD::forward<_Tp>(__t))
              ^        ~~~~~~~~~~~~~~~~~~~~~~~~
llvm_unstable/toolchain/bin/../include/c++/v1/tuple:388:13: note: in instantiation of function template specialization 'std::__u::__tuple_leaf<0, std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, false>::__tuple_leaf<std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, void>' requested here
            __tuple_leaf<_Uf, _Tf>(_VSTD::forward<_Up>(__u))...,
            ^
llvm_unstable/toolchain/bin/../include/c++/v1/tuple:793:15: note: in instantiation of function template specialization 'std::__u::__tuple_impl<std::__u::__tuple_indices<0, 1>, std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>>::__tuple_impl<0, 1, std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>, std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>>' requested here
            : __base_(typename __make_tuple_indices<sizeof...(_Up)>::type(),
              ^
llvm_unstable/toolchain/bin/../include/c++/v1/thread:297:17: note: in instantiation of function template specialization 'std::__u::tuple<std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>>::tuple<std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>, false, false>' requested here
            new _Gp(std::move(__tsp),
                ^
./third_party/eigen3/unsupported/Eigen/CXX11/src/ThreadPool/ThreadEnvironment.h:24:42: note: in instantiation of function template specialization 'std::__u::thread::thread<std::__u::function<void ()>, void>' requested here
    EnvThread(std::function<void()> f) : thr_(std::move(f)) {}
                                         ^
llvm_unstable/toolchain/bin/../include/c++/v1/memory:2583:3: note: copy constructor is implicitly deleted because 'unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>' has a user-declared move constructor
  unique_ptr(unique_ptr&& __u) _NOEXCEPT
  ^
1 error generated when compiling for sm_60.

For implicit __host__ __device__ functions, they may be promoted by pragma but themselves may not be qualified as __host__ __device__ functions.

Since they are promoted from host functions, they are good citizens in host compilation, but may incur diagnostics in device compilation, because their callees may be missing in device side. Since we cannot defer all the diagnostics, once such things happen, we are doomed.

So now we can understand why the previous behavior: that is, in a __host__ __device__ function, same-side candidate is always preferred over wrong-sided candidate. However, __device__ __host__ candidate is not preferred over wrong-sided candidate. On the other hand, their other properties take precedence. Only if all others are equal, __device__ __host__ candidate is preferred over wrong-sided candidate.

I will put a workaround: In device compilation, in implicit __device__ __host__ callers, I will keep the old behavior, that is, implicit __device__ __host__ candidate has equal preference with wrong-sided candidate. By doing this, we will in most cases resolve the overloading the same way as if the callers and callees are host functions, therefore resolved the same way as in their expected environment. This will make sure: 1. we will not end up with no viable candidate 2. we will not have ambiguity, since we know it is resolvable in host compilation.

For explicit __device__ __host__ functions, we do not need the workaround, since they are intended for host and device and are supposed to work for both host and device.

tra added a comment.May 5 2020, 2:23 PM

I will put a workaround: In device compilation, in implicit __device__ __host__ callers, I will keep the old behavior, that is, implicit __device__ __host__ candidate has equal preference with wrong-sided candidate. By doing this, we will in most cases resolve the overloading the same way as if the callers and callees are host functions, therefore resolved the same way as in their expected environment. This will make sure: 1. we will not end up with no viable candidate 2. we will not have ambiguity, since we know it is resolvable in host compilation.

LMK when you have something. I can give it a spin internally.