This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
lib/Sema/
-
Sema/
16/17
SemaOverload.cpp
-
test/SemaCUDA/
-
SemaCUDA/
-
function-overload.cu

Differential D77954

[CUDA][HIP] Fix host/device based overload resolution
ClosedPublic

Authored by yaxunl on Apr 11 2020, 1:02 PM.

Download Raw Diff

Details

Reviewers

tra
rjmccall
echristo

Commits

rGc77a4078e010: [CUDA][HIP] Fix host/device based overload resolution

Summary

Currently clang fails to compile the following CUDA program in device compilation:

__host__ int foo(int x) {
     return 1;
}

template<class T>
__device__ __host__ int foo(T x) {
    return 2;
}

__device__ __host__ int bar() {
    return foo(1);
}

__global__ void test(int *a) {
    *a = bar();
}

This is due to foo is resolved to the __host__ foo instead of __device__ __host__ foo.
This seems to be a bug since __device__ __host__ foo is a viable callee for foo whereas
clang is unable to choose it.

nvcc has similar issue

https://cuda.godbolt.org/z/bGijLc

Although it only emits a warning and does not fail to compile. It emits a trap in the code
so that it will fail at run time.

This patch fixes that.

Diff Detail

Event Timeline

yaxunl created this revision.Apr 11 2020, 1:02 PM

yaxunl edited the summary of this revision. (Show Details)Apr 11 2020, 1:10 PM

If nvcc ignores host/device-ness when selecting overloads, that's probably the specified behavior, right? I agree that it would be better to not ignore it, but Clang shouldn't just make up better rules for languages with external specifications.

In D77954#1976294, @rjmccall wrote:

If nvcc ignores host/device-ness when selecting overloads, that's probably the specified behavior, right? I agree that it would be better to not ignore it, but Clang shouldn't just make up better rules for languages with external specifications.

cuda-clang does not always follow nvcc's behavior. For example, cuda-clang only allows incomplete array type for extern shared variables, whereas nvcc allows other types. If cuda-clang is supposed to follow nvcc's behavior in every aspects, we should approve https://reviews.llvm.org/D73979 , but it is not the case.

Therefore, I think we should discuss whether this is really a bug, and whether the fix can cause any unwanted side effect.

In D77954#1976316, @yaxunl wrote:

In D77954#1976294, @rjmccall wrote:

If nvcc ignores host/device-ness when selecting overloads, that's probably the specified behavior, right? I agree that it would be better to not ignore it, but Clang shouldn't just make up better rules for languages with external specifications.

cuda-clang does not always follow nvcc's behavior. For example, cuda-clang only allows incomplete array type for extern shared variables, whereas nvcc allows other types. If cuda-clang is supposed to follow nvcc's behavior in every aspects, we should approve https://reviews.llvm.org/D73979 , but it is not the case.

Therefore, I think we should discuss whether this is really a bug, and whether the fix can cause any unwanted side effect.

BTW cuda-clang is already quite different than nvcc regarding host/device-based overloading resolution. For example, the following code is valid in cuda-clang before my change but invalid in nvcc https://cuda.godbolt.org/z/qwpKZe . So if we want to follow nvcc's resolution rule we need a total revamp of device/host related resolution in cuda-clang.

__host__ int foo(int x) {
     return 1;
}

template<class T>
__device__ int foo(T x) {
    return 2;
}

__device__ int bar() {
    return foo(1);
}

__global__ void test(int *a) {
    *a = bar();
}

I'm not saying that we need to be bug-for-bug-compatible with nvcc, I'm just saying that we should be able to point to *something* to justify our behavior. I take it that the CUDA spec has rules for some amount of host/device-based overloading? What are they based on?

In D77954#1976378, @rjmccall wrote:

I'm not saying that we need to be bug-for-bug-compatible with nvcc, I'm just saying that we should be able to point to *something* to justify our behavior. I take it that the CUDA spec has rules for some amount of host/device-based overloading? What are they based on?

I checked CUDA SDK documentation and did not find useful information about overloading resolution based on host/device attributes. I guess the rule can only be deduced from nvcc behavior.

Based on https://reviews.llvm.org/D12453, https://reviews.llvm.org/D18416, and https://bcain-llvm.readthedocs.io/projects/llvm/en/latest/CompileCudaWithLLVM/#overloading-based-on-host-and-device-attributes, cuda-clang has different overload resolution rules based host/device attributes. This is intentional design decision.

In D77954#1976456, @yaxunl wrote:

In D77954#1976378, @rjmccall wrote:

I'm not saying that we need to be bug-for-bug-compatible with nvcc, I'm just saying that we should be able to point to *something* to justify our behavior. I take it that the CUDA spec has rules for some amount of host/device-based overloading? What are they based on?

I checked CUDA SDK documentation and did not find useful information about overloading resolution based on host/device attributes. I guess the rule can only be deduced from nvcc behavior.

Based on https://reviews.llvm.org/D12453, https://reviews.llvm.org/D18416, and https://bcain-llvm.readthedocs.io/projects/llvm/en/latest/CompileCudaWithLLVM/#overloading-based-on-host-and-device-attributes, cuda-clang has different overload resolution rules based host/device attributes. This is intentional design decision.

Okay, thanks, that's all I needed. We don't need to re-litigate it.

That spec says that there's a preference given to functions according to host/device-ness. The question, then, is how that actually interacts with the normal overload resolution rules. The "deletion" approach suggests that it's meant to be the most important thing in the comparison. It seems to me that, given the wording of the specification, deletion is the wrong implementation approach, and that instead this check should just be performed in isBetterOverloadCandidate so that a candidate that better matches the host/device-ness of the caller is always considered a better candidate.

Revised by John's comments.

rjmccall added inline comments.Apr 12 2020, 8:47 PM

clang/lib/Sema/SemaOverload.cpp
9481	Please add `[CUDA]` or something similar to the top of this comment so that readers can immediately know that it's dialect-specific. At a high level, this part of the rule is essentially saying that CUDA non-emittability is a kind of non-viability. Should we just make non-emittable functions get flagged as non-viable (which will avoid a lot of relatively expensive conversion checking), or is it important to be able to select non-emittable candidates over candidates that are non-viable for other reasons?
9726	If we move anything below this check, it needs to figure out a tri-state so that it can return false if `Cand2` is a better candidate than `Cand1`. Now, that only matters if multiversion functions are supported under CUDA, but if you're relying on them not being supported, that should at least be commented on.
9729	Okay, let's think about the right place to put this check in the ordering; we don't want different extensions to get into a who-comes-last competition. Certainly this should have lower priority than the standard-defined preferences like argument conversion ranks or `enable_if` partial-ordering. The preference for pass-object-size parameters is probably most similar to a type-based-overloading decision and so should take priority. I would say that this should take priority over function multi-versioning. Function multi-versioning is all about making specialized versions of the "same function", whereas I think host/device overloading is meant to be semantically broader than that. What do you think? Regardless, the rationale for the order should be explained in comments.

yaxunl marked 6 inline comments as done.Apr 13 2020, 7:04 AM

yaxunl added inline comments.

clang/lib/Sema/SemaOverload.cpp
9481	There are two situations for "bad" callees: the callee should never be called. It is not just invalid call in codegen, but also invalid call in AST. e.g. a host function call a device function. In CUDA call preference, it is termed "never". And clang already removed such callees from overload candidates. the callee should not be called in codegen, but may be called in AST. This happens with `__host__ __device__` functions when calling a "wrong sided" function. e.g. in device compilation, a `__host__ __device__` function calls a `__host__` function. This is valid in AST since the `__host__ __device__` function may be an inline function which is only called by a `__host__` function. There is a deferred diagnostic for the wrong-sided call, which is triggered only if the caller is emitted. However in overloading resolution, if no better candidates are available, wrong-sided candidates are still viable.
9726	multiversion host functions is orthogonal to CUDA therefore should be supported. multiversion in device, host device, and global functions are not supported. However this change does not make things worse, and should continue to work if they are supported. host/device based overloading resolution is mostly for determining viability of a function. If two functions are both viable, other factors should take precedence in preference. This general rule has been taken for cases other than multiversion, I think it should also apply to multiversion. I will make isBetterMultiversionCandidate three states.
9729	I will add comments for the rationale of preference. I commented the preference between multiversion and host/device in another comment.

fix preference for multiversion. add comments. add more tests for wrong-sided function.

rjmccall added a reviewer: echristo.Apr 13 2020, 10:14 AM

rjmccall added a subscriber: echristo.

rjmccall added inline comments.

clang/lib/Sema/SemaOverload.cpp
9481	Oh, I see what you're saying; sorry, I mis-read the code. So anything with a preference worse than wrong-sided is outright non-viable; there's a very strong preference against wrong-sided calls that takes priority of all of the normal overload-resolution rules; and then there's a very weak preference against non-exact matches that everything else takes priority over. Okay.
9726	This general rule has been taken for cases other than multiversion, I think it should also apply to multiversion. Well, but the multiversion people could say the same: that multiversioning is for picking an alternative among otherwise-identical functions, and HD and H functions are not otherwise-identical. CC'ing @echristo for his thoughts on the right ordering here.

LGTM in principle. That said, my gut feeling is that this patch has a good chance of breaking something in sufficiently convoluted CUDA code like Eigen. When you land this patch, I'd appreciate if you could do it on a workday morning (Pacific time) so I'm around to test it on our code and revert if something unexpected pops up.

On a side note, this case is another point towards having to redo handling of __host__ __device__. There are way too many corner cases all over the place. Things will only get worse as we move towards newer C++ standard where a lot more code becomes constexpr which is implicitly HD. Having calls from HD functions resolve in a different way during host/device compilation is observable and may result in host and device code diverging unexpectedly.

ping

Revised to let host/device take precedence over multiversion, as John suggested.

yaxunl marked 2 inline comments as done.Apr 22 2020, 7:54 PM

Okay, one minor fix.

clang/lib/Sema/SemaOverload.cpp
9389	This is neglecting the case where they're both invalid.

echristo added a subscriber: erichkeane.Apr 23 2020, 11:03 AM

echristo added inline comments.

clang/lib/Sema/SemaOverload.cpp
9726	Adding @erichkeane here as well. I think this makes sense, but I can see a reason to multiversion a function that will run on host and device. A version of some matrix mult that takes advantage of 3 host architectures and one cuda one? Am I missing something here?

yaxunl marked an inline comment as done.Apr 23 2020, 12:19 PM

yaxunl added inline comments.

clang/lib/Sema/SemaOverload.cpp
9726	My understanding is that a multiversion function is for a specific cpu(gpu). Let's say we want to have a function f for gfx900, gfx906, sandybridge, ivybridge, shouldn't they be more like __host__ __attribute__((cpu_specific(sandybridge))) f(); __host__ __attribute__((cpu_specific(ivybridge))) f(); __device__ __attribute__((cpu_specific(gfx900))) f(); __device__ __attribute__((cpu_specific(gfx906))) f(); instead of all `__device__ __host__` functions?

erichkeane added inline comments.Apr 23 2020, 12:32 PM

clang/lib/Sema/SemaOverload.cpp
9726	IMO, it doesn't make sense for functions to functions be BOTH host and device, they'd have to be just one. Otherwise I'm not sure how the resolver behavior is supposed to work. The whole idea is that the definition is chosen at runtime. Unless host __device void foo(); is TWO declaration chains (meaning two separate AST entries), it doesn't make sense to have multiverison on it (and then, how it would be spelled is awkward/confusing to me). In the above case, if those 4 declarations are not 2 separate root- AST nodes, multiversioning won't work.

rjmccall added inline comments.Apr 23 2020, 4:40 PM

clang/lib/Sema/SemaOverload.cpp
9726	There are certainly functions that ought to be usable from either host or device context — any inline function that just does ordinary language things should be in that category. Also IIUC many declarations are inferred to be `__host__ __device__`, or can be mass-annotated with pragmas, and those reasons are probably the main ones this might matter — we might include a header in CUDA mode that declares a multi-versioned function, and we should handle it right. My read of how CUDA programmers expect this to work is that they see the `__host__` / `__device__` attributes as primarily a mechanism for catching problems where you're using the wrong functions for the current configuration. That is, while we allow overloading by `__host__`/`__device__`-ness, users expect those attributes to mostly be used as a filter for what's "really there" rather than really strictly segregating the namespace. So I would say that CUDA programmers would probably expect the interaction with multiversioning to be: Programmers can put `__host__`, `__device__`, or both on a variant depending on where it was usable. Dispatches should simply ignore any variants that aren't usable for the current configuration. And specifically they would not expect e.g. a `__host__` dispatch function to only consider `__host__` variants — it should be able to dispatch to anything available, which is to say, it should also include `__host__ __device__` variants. Similarly (and probably more usefully), a `__host__ __device__` dispatch function being compiled for the device should also consider pure `__device__` functions, and so on. If we accept that, then I think it gives us a much better idea for how to resolve the priority of the overload rules. The main impact of `isBetterMultiversionCandidate` is to try to ensure that we're looking at the `__attribute__((cpu_dispatch))` function instead of one of the `__attribute__((cpu_specific))` variants. (It has no effect on `__attribute__((target))` multi-versioning, mostly because it doesn't need to: target-specific variants don't show up in lookup with `__attribute__((target))`.) That rule should take precedence over the CUDA preference for exact matches, because e.g. if we're compiling this: __host__ __device__ int magic(void) __attribute__((cpu_dispatch("..."))); __host__ __device__ int magic(void) __attribute__((cpu_specific(generic))); __host__ int magic(void) __attribute__((cpu_specific(mmx))); __host__ int magic(void) __attribute__((cpu_specific(sse))); __device__ int magic(void) __attribute__((cpu_specific(some_device_feature))); __device__ int magic(void) __attribute__((cpu_specific(some_other_device_feature))); then we don't want the compiler to prefer a CPU-specific variant over the dispatch function just because one of the variant was marked `__host__`.

tra accepted this revision.Apr 23 2020, 6:06 PM

tra added inline comments.

clang/lib/Sema/SemaOverload.cpp
9726	It's a bit more complicated and a bit less straightforward than that. :-( https://goo.gl/EXnymm Handling of target attributes is where clang is very different from the NVCC, so no matter which mental model of "CUDA programmer" you pick, there's another one which will not match. In the existing code `__host__ __device__` is commonly used as a sledgehammer to work around NVCC's limitations. It does not allow attribute-based overloading, so the only way you can specialize a function for host/device is via something like this: __host__ __device__ void foo() { #if __CUDA_ARCH__ > 0 // GPU code #else // CPU code. #endif } With clang you can write separate overloaded functions and we'll do our best to pick the one you meant to call. Alas, there are cases where it's ambiguous and depends on the callee's attributes, which may depend on theirs. When something ends up being called from different contexts, interesting things start happening. With more functions becoming constexpr (those are implicitly HD), we'll be running into such impossible-to-do-the-right-thing situations more often. The only reliable way to avoid such ambiguity is to 'clone' HD functions into separate H & D functions and do overload resolutions only considering same-side functions which will, in effect, completely separate host and device name spaces. Run-time dispatch is also somewhat irrelevant to CUDA. Sort of. On one hand kernel launch is already a form of runtime dispatch, only it's CUDA runtime does the dispatching based on the GPU one attempts to run the kernel on. `__device__` functions are always compiled for the specific GPU variant. Also, GPU variants often have different instruction sets and can't be mixed together in the same object file at all, so there's no variants once we're running the code as it's already compiled for precisely the GPU we're running on. Almost. Technically GPUs in the same family do share the same instruction sets, but I'm not sure runtime dispatch would buy us much there as the hardware differences are relatively minor.

This revision is now accepted and ready to land.Apr 23 2020, 6:06 PM

rjmccall added inline comments.Apr 23 2020, 7:47 PM

clang/lib/Sema/SemaOverload.cpp
9726	The only reliable way to avoid such ambiguity is to 'clone' HD functions into separate H & D functions and do overload resolutions only considering same-side functions which will, in effect, completely separate host and device name spaces. Okay. Well, even if you completely split host and device functions, I think we'd still want to prefer dispatch functions over variant functions before preferring H over HD. Although... I suppose we do want to consider H vs. HD before looking at the more arbitrary factors that `isBetterMultiversionCandidate` looks at, like the number of architectures in the dispatch. Honestly, though, those just seem like bad rules that we should drop from the code. Run-time dispatch is also somewhat irrelevant to CUDA. Sort of. I understand that there's very little reason (or even ability) to use multiversioning in device code, but it can certainly happen in host code, right? Still, I guess the easiest thing would just be to forbid multiversioned functions on the device.

Revised by John's comments.

yaxunl marked 6 inline comments as done.Apr 24 2020, 5:21 AM

yaxunl added inline comments.

clang/lib/Sema/SemaOverload.cpp
9726	Will change back the precedence of multiversion to be over host/device.

change the precedence of multiversion to be over host/device-ness.

tra added inline comments.Apr 24 2020, 9:46 AM

clang/lib/Sema/SemaOverload.cpp
9726	@rjmccall I'm OK with your reasoning & this patch. As long as the change does not break existing code, I'm fine.

Thanks, Yaxun. LGTM.

@tra Is it OK I commit it now? Or better wait next Monday morning? Thanks.

Closed by commit rGc77a4078e010: [CUDA][HIP] Fix host/device based overload resolution (authored by yaxunl). · Explain WhyApr 24 2020, 12:26 PM

This revision was automatically updated to reflect the committed changes.

Herald added a project: Restricted Project. · View Herald TranscriptApr 24 2020, 12:26 PM

Go ahead. I'll revert it if it breaks anything on our side.

In D77954#2002580, @tra wrote:

Go ahead. I'll revert it if it breaks anything on our side.

Thanks. Done by b46b1a916d44216f0c70de55ae2123eb9de69027

Sorry -- this change broke overload resolution for operator new, as it is declared in system headers. I'm reverting the patch.

$ cat /tmp/in.cu.cc
#define __device__ __attribute__((device))
void* operator new(__SIZE_TYPE__ size);
__device__ void *operator new(__SIZE_TYPE__ size);
void *x = new int;
$ clang -fsyntax-only --cuda-device-only --target=x86_64-grtev4-linux-gnu -x cuda -nocudalib -nocudainc -std=gnu++17 /tmp/in.cu.cc
/tmp/in.cu.cc:4:11: error: call to 'operator new' is ambiguous
void *x = new int;
          ^
/tmp/in.cu.cc:2:7: note: candidate function
void* operator new(__SIZE_TYPE__ size);
      ^
/tmp/in.cu.cc:3:18: note: candidate function
__device__ void *operator new(__SIZE_TYPE__ size);
                 ^
1 error generated when compiling for sm_20.

In D77954#2005313, @gribozavr2 wrote:

Sorry -- this change broke overload resolution for operator new, as it is declared in system headers. I'm reverting the patch.

$ cat /tmp/in.cu.cc
#define __device__ __attribute__((device))
void* operator new(__SIZE_TYPE__ size);
__device__ void *operator new(__SIZE_TYPE__ size);
void *x = new int;
$ clang -fsyntax-only --cuda-device-only --target=x86_64-grtev4-linux-gnu -x cuda -nocudalib -nocudainc -std=gnu++17 /tmp/in.cu.cc
/tmp/in.cu.cc:4:11: error: call to 'operator new' is ambiguous
void *x = new int;
          ^
/tmp/in.cu.cc:2:7: note: candidate function
void* operator new(__SIZE_TYPE__ size);
      ^
/tmp/in.cu.cc:3:18: note: candidate function
__device__ void *operator new(__SIZE_TYPE__ size);
                 ^
1 error generated when compiling for sm_20.

Thanks. Fixed in https://reviews.llvm.org/D78970

It appears that re-landed b46b1a916d44216f0c70de55ae2123eb9de69027 has created another compilation regression. I don't have a simple reproducer yet, so here's the error message for now:

llvm_unstable/toolchain/bin/../include/c++/v1/tuple:232:15: error: call to implicitly-deleted copy constructor of 'std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>'
            : __value_(_VSTD::forward<_Tp>(__t))
              ^        ~~~~~~~~~~~~~~~~~~~~~~~~
llvm_unstable/toolchain/bin/../include/c++/v1/tuple:388:13: note: in instantiation of function template specialization 'std::__u::__tuple_leaf<0, std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, false>::__tuple_leaf<std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, void>' requested here
            __tuple_leaf<_Uf, _Tf>(_VSTD::forward<_Up>(__u))...,
            ^
llvm_unstable/toolchain/bin/../include/c++/v1/tuple:793:15: note: in instantiation of function template specialization 'std::__u::__tuple_impl<std::__u::__tuple_indices<0, 1>, std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>>::__tuple_impl<0, 1, std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>, std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>>' requested here
            : __base_(typename __make_tuple_indices<sizeof...(_Up)>::type(),
              ^
llvm_unstable/toolchain/bin/../include/c++/v1/thread:297:17: note: in instantiation of function template specialization 'std::__u::tuple<std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>>::tuple<std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>, false, false>' requested here
            new _Gp(std::move(__tsp),
                ^
./third_party/eigen3/unsupported/Eigen/CXX11/src/ThreadPool/ThreadEnvironment.h:24:42: note: in instantiation of function template specialization 'std::__u::thread::thread<std::__u::function<void ()>, void>' requested here
    EnvThread(std::function<void()> f) : thr_(std::move(f)) {}
                                         ^
llvm_unstable/toolchain/bin/../include/c++/v1/memory:2583:3: note: copy constructor is implicitly deleted because 'unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>' has a user-declared move constructor
  unique_ptr(unique_ptr&& __u) _NOEXCEPT
  ^
1 error generated when compiling for sm_60.

In D77954#2021026, @tra wrote:

It appears that re-landed b46b1a916d44216f0c70de55ae2123eb9de69027 has created another compilation regression. I don't have a simple reproducer yet, so here's the error message for now:

llvm_unstable/toolchain/bin/../include/c++/v1/tuple:232:15: error: call to implicitly-deleted copy constructor of 'std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>'
            : __value_(_VSTD::forward<_Tp>(__t))
              ^        ~~~~~~~~~~~~~~~~~~~~~~~~
llvm_unstable/toolchain/bin/../include/c++/v1/tuple:388:13: note: in instantiation of function template specialization 'std::__u::__tuple_leaf<0, std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, false>::__tuple_leaf<std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, void>' requested here
            __tuple_leaf<_Uf, _Tf>(_VSTD::forward<_Up>(__u))...,
            ^
llvm_unstable/toolchain/bin/../include/c++/v1/tuple:793:15: note: in instantiation of function template specialization 'std::__u::__tuple_impl<std::__u::__tuple_indices<0, 1>, std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>>::__tuple_impl<0, 1, std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>, std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>>' requested here
            : __base_(typename __make_tuple_indices<sizeof...(_Up)>::type(),
              ^
llvm_unstable/toolchain/bin/../include/c++/v1/thread:297:17: note: in instantiation of function template specialization 'std::__u::tuple<std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>>::tuple<std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>, false, false>' requested here
            new _Gp(std::move(__tsp),
                ^
./third_party/eigen3/unsupported/Eigen/CXX11/src/ThreadPool/ThreadEnvironment.h:24:42: note: in instantiation of function template specialization 'std::__u::thread::thread<std::__u::function<void ()>, void>' requested here
    EnvThread(std::function<void()> f) : thr_(std::move(f)) {}
                                         ^
llvm_unstable/toolchain/bin/../include/c++/v1/memory:2583:3: note: copy constructor is implicitly deleted because 'unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>' has a user-declared move constructor
  unique_ptr(unique_ptr&& __u) _NOEXCEPT
  ^
1 error generated when compiling for sm_60.

For implicit __host__ __device__ functions, they may be promoted by pragma but themselves may not be qualified as __host__ __device__ functions.

Since they are promoted from host functions, they are good citizens in host compilation, but may incur diagnostics in device compilation, because their callees may be missing in device side. Since we cannot defer all the diagnostics, once such things happen, we are doomed.

So now we can understand why the previous behavior: that is, in a __host__ __device__ function, same-side candidate is always preferred over wrong-sided candidate. However, __device__ __host__ candidate is not preferred over wrong-sided candidate. On the other hand, their other properties take precedence. Only if all others are equal, __device__ __host__ candidate is preferred over wrong-sided candidate.

I will put a workaround: In device compilation, in implicit __device__ __host__ callers, I will keep the old behavior, that is, implicit __device__ __host__ candidate has equal preference with wrong-sided candidate. By doing this, we will in most cases resolve the overloading the same way as if the callers and callees are host functions, therefore resolved the same way as in their expected environment. This will make sure: 1. we will not end up with no viable candidate 2. we will not have ambiguity, since we know it is resolvable in host compilation.

For explicit __device__ __host__ functions, we do not need the workaround, since they are intended for host and device and are supposed to work for both host and device.

FYI, I've just reverted it in bf6a26b066382e0f41bf023c781d84061c542307

In D77954#2021299, @yaxunl wrote:

I will put a workaround: In device compilation, in implicit __device__ __host__ callers, I will keep the old behavior, that is, implicit __device__ __host__ candidate has equal preference with wrong-sided candidate. By doing this, we will in most cases resolve the overloading the same way as if the callers and callees are host functions, therefore resolved the same way as in their expected environment. This will make sure: 1. we will not end up with no viable candidate 2. we will not have ambiguity, since we know it is resolvable in host compilation.

LMK when you have something. I can give it a spin internally.

tra added a reverting change: rGbf6a26b06638: Revert D77954 -- it breaks Eigen & Tensorflow..May 5 2020, 2:36 PM

yaxunl mentioned this in D79526: [CUDA][HIP] Workaround for resolving host device function against wrong-sided function.May 6 2020, 4:03 PM

yaxunl mentioned this in D84364: [CUDA][HIP] Defer overloading resolution diagnostics for host device functions.Aug 6 2020, 3:08 PM

Revision Contents

Path

Size

clang/

lib/

Sema/

SemaOverload.cpp

6 lines

test/

SemaCUDA/

function-overload.cu

11 lines

Diff 256781

clang/lib/Sema/SemaOverload.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 9,380 Lines • ▼ Show 20 Lines	if (!Cand1.Function \|\| !Cand1.Function->isMultiVersion() \|\| !Cand2.Function \|\|
return false;		return false;

// If Cand1 is invalid, it cannot be a better match, if Cand2 is invalid, this		// If Cand1 is invalid, it cannot be a better match, if Cand2 is invalid, this
// is obviously better.		// is obviously better.
if (Cand1.Function->isInvalidDecl()) return false;		if (Cand1.Function->isInvalidDecl()) return false;
if (Cand2.Function->isInvalidDecl()) return true;		if (Cand2.Function->isInvalidDecl()) return true;

// If this is a cpu_dispatch/cpu_specific multiversion situation, prefer		// If this is a cpu_dispatch/cpu_specific multiversion situation, prefer
// cpu_dispatch, else arbitrarily based on the identifiers.		// cpu_dispatch, else arbitrarily based on the identifiers.
		rjmccallUnsubmitted Done Reply Inline Actions This is neglecting the case where they're both invalid. rjmccall: This is neglecting the case where they're both invalid.
bool Cand1CPUDisp = Cand1.Function->hasAttr<CPUDispatchAttr>();		bool Cand1CPUDisp = Cand1.Function->hasAttr<CPUDispatchAttr>();
bool Cand2CPUDisp = Cand2.Function->hasAttr<CPUDispatchAttr>();		bool Cand2CPUDisp = Cand2.Function->hasAttr<CPUDispatchAttr>();
const auto *Cand1CPUSpec = Cand1.Function->getAttr<CPUSpecificAttr>();		const auto *Cand1CPUSpec = Cand1.Function->getAttr<CPUSpecificAttr>();
const auto *Cand2CPUSpec = Cand2.Function->getAttr<CPUSpecificAttr>();		const auto *Cand2CPUSpec = Cand2.Function->getAttr<CPUSpecificAttr>();

if (!Cand1CPUDisp && !Cand2CPUDisp && !Cand1CPUSpec && !Cand2CPUSpec)		if (!Cand1CPUDisp && !Cand2CPUDisp && !Cand1CPUSpec && !Cand2CPUSpec)
return false;		return false;

▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	bool clang::isBetterOverloadCandidate(
if (!Cand2.Viable)		if (!Cand2.Viable)
return Cand1.Viable;		return Cand1.Viable;
else if (!Cand1.Viable)		else if (!Cand1.Viable)
return false;		return false;

// C++ [over.match.best]p1:		// C++ [over.match.best]p1:
//		//
// -- if F is a static member function, ICS1(F) is defined such		// -- if F is a static member function, ICS1(F) is defined such
// that ICS1(F) is neither better nor worse than ICS1(G) for		// that ICS1(F) is neither better nor worse than ICS1(G) for
		rjmccallUnsubmitted Done Reply Inline Actions Please add `[CUDA]` or something similar to the top of this comment so that readers can immediately know that it's dialect-specific. At a high level, this part of the rule is essentially saying that CUDA non-emittability is a kind of non-viability. Should we just make non-emittable functions get flagged as non-viable (which will avoid a lot of relatively expensive conversion checking), or is it important to be able to select non-emittable candidates over candidates that are non-viable for other reasons? rjmccall: Please add `[CUDA]` or something similar to the top of this comment so that readers can…
		yaxunlAuthorUnsubmitted Done Reply Inline Actions There are two situations for "bad" callees: the callee should never be called. It is not just invalid call in codegen, but also invalid call in AST. e.g. a host function call a device function. In CUDA call preference, it is termed "never". And clang already removed such callees from overload candidates. the callee should not be called in codegen, but may be called in AST. This happens with `__host__ __device__` functions when calling a "wrong sided" function. e.g. in device compilation, a `__host__ __device__` function calls a `__host__` function. This is valid in AST since the `__host__ __device__` function may be an inline function which is only called by a `__host__` function. There is a deferred diagnostic for the wrong-sided call, which is triggered only if the caller is emitted. However in overloading resolution, if no better candidates are available, wrong-sided candidates are still viable. yaxunl: There are two situations for "bad" callees: 1. the callee should never be called. It is not…
		rjmccallUnsubmitted Done Reply Inline Actions Oh, I see what you're saying; sorry, I mis-read the code. So anything with a preference worse than wrong-sided is outright non-viable; there's a very strong preference against wrong-sided calls that takes priority of all of the normal overload-resolution rules; and then there's a very weak preference against non-exact matches that everything else takes priority over. Okay. rjmccall: Oh, I see what you're saying; sorry, I mis-read the code. So anything with a preference…
// any function G, and, symmetrically, ICS1(G) is neither		// any function G, and, symmetrically, ICS1(G) is neither
// better nor worse than ICS1(F).		// better nor worse than ICS1(F).
unsigned StartArg = 0;		unsigned StartArg = 0;
if (Cand1.IgnoreObjectArgument \|\| Cand2.IgnoreObjectArgument)		if (Cand1.IgnoreObjectArgument \|\| Cand2.IgnoreObjectArgument)
StartArg = 1;		StartArg = 1;

auto IsIllFormedConversion = [&](const ImplicitConversionSequence &ICS) {		auto IsIllFormedConversion = [&](const ImplicitConversionSequence &ICS) {
// We don't allow incompatible pointer conversions in C++.		// We don't allow incompatible pointer conversions in C++.
▲ Show 20 Lines • Show All 228 Lines • ▼ Show 20 Lines	bool clang::isBetterOverloadCandidate(
bool HasPS1 = Cand1.Function != nullptr &&		bool HasPS1 = Cand1.Function != nullptr &&
functionHasPassObjectSizeParams(Cand1.Function);		functionHasPassObjectSizeParams(Cand1.Function);
bool HasPS2 = Cand2.Function != nullptr &&		bool HasPS2 = Cand2.Function != nullptr &&
functionHasPassObjectSizeParams(Cand2.Function);		functionHasPassObjectSizeParams(Cand2.Function);
if (HasPS1 != HasPS2 && HasPS1)		if (HasPS1 != HasPS2 && HasPS1)
return true;		return true;

return isBetterMultiversionCandidate(Cand1, Cand2);		return isBetterMultiversionCandidate(Cand1, Cand2);
}		}
		rjmccallUnsubmitted Done Reply Inline Actions If we move anything below this check, it needs to figure out a tri-state so that it can return false if `Cand2` is a better candidate than `Cand1`. Now, that only matters if multiversion functions are supported under CUDA, but if you're relying on them not being supported, that should at least be commented on. rjmccall: If we move anything below this check, it needs to figure out a tri-state so that it can return…
		yaxunlAuthorUnsubmitted Done Reply Inline Actions multiversion host functions is orthogonal to CUDA therefore should be supported. multiversion in device, host device, and global functions are not supported. However this change does not make things worse, and should continue to work if they are supported. host/device based overloading resolution is mostly for determining viability of a function. If two functions are both viable, other factors should take precedence in preference. This general rule has been taken for cases other than multiversion, I think it should also apply to multiversion. I will make isBetterMultiversionCandidate three states. yaxunl: multiversion host functions is orthogonal to CUDA therefore should be supported. multiversion…
		rjmccallUnsubmitted Done Reply Inline Actions This general rule has been taken for cases other than multiversion, I think it should also apply to multiversion. Well, but the multiversion people could say the same: that multiversioning is for picking an alternative among otherwise-identical functions, and HD and H functions are not otherwise-identical. CC'ing @echristo for his thoughts on the right ordering here. rjmccall: > This general rule has been taken for cases other than multiversion, I think it should also…
		echristoUnsubmitted Done Reply Inline Actions Adding @erichkeane here as well. I think this makes sense, but I can see a reason to multiversion a function that will run on host and device. A version of some matrix mult that takes advantage of 3 host architectures and one cuda one? Am I missing something here? echristo: Adding @erichkeane here as well. I think this makes sense, but I can see a reason to…
		yaxunlAuthorUnsubmitted Done Reply Inline Actions My understanding is that a multiversion function is for a specific cpu(gpu). Let's say we want to have a function f for gfx900, gfx906, sandybridge, ivybridge, shouldn't they be more like __host__ __attribute__((cpu_specific(sandybridge))) f(); __host__ __attribute__((cpu_specific(ivybridge))) f(); __device__ __attribute__((cpu_specific(gfx900))) f(); __device__ __attribute__((cpu_specific(gfx906))) f(); instead of all `__device__ __host__` functions? yaxunl: My understanding is that a multiversion function is for a specific cpu(gpu). Let's say we want…
		erichkeaneUnsubmitted Done Reply Inline Actions IMO, it doesn't make sense for functions to functions be BOTH host and device, they'd have to be just one. Otherwise I'm not sure how the resolver behavior is supposed to work. The whole idea is that the definition is chosen at runtime. Unless host __device void foo(); is TWO declaration chains (meaning two separate AST entries), it doesn't make sense to have multiverison on it (and then, how it would be spelled is awkward/confusing to me). In the above case, if those 4 declarations are not 2 separate root- AST nodes, multiversioning won't work. erichkeane: IMO, it doesn't make sense for functions to functions be BOTH host and device, they'd have to…
		rjmccallUnsubmitted Done Reply Inline Actions There are certainly functions that ought to be usable from either host or device context — any inline function that just does ordinary language things should be in that category. Also IIUC many declarations are inferred to be `__host__ __device__`, or can be mass-annotated with pragmas, and those reasons are probably the main ones this might matter — we might include a header in CUDA mode that declares a multi-versioned function, and we should handle it right. My read of how CUDA programmers expect this to work is that they see the `__host__` / `__device__` attributes as primarily a mechanism for catching problems where you're using the wrong functions for the current configuration. That is, while we allow overloading by `__host__`/`__device__`-ness, users expect those attributes to mostly be used as a filter for what's "really there" rather than really strictly segregating the namespace. So I would say that CUDA programmers would probably expect the interaction with multiversioning to be: Programmers can put `__host__`, `__device__`, or both on a variant depending on where it was usable. Dispatches should simply ignore any variants that aren't usable for the current configuration. And specifically they would not expect e.g. a `__host__` dispatch function to only consider `__host__` variants — it should be able to dispatch to anything available, which is to say, it should also include `__host__ __device__` variants. Similarly (and probably more usefully), a `__host__ __device__` dispatch function being compiled for the device should also consider pure `__device__` functions, and so on. If we accept that, then I think it gives us a much better idea for how to resolve the priority of the overload rules. The main impact of `isBetterMultiversionCandidate` is to try to ensure that we're looking at the `__attribute__((cpu_dispatch))` function instead of one of the `__attribute__((cpu_specific))` variants. (It has no effect on `__attribute__((target))` multi-versioning, mostly because it doesn't need to: target-specific variants don't show up in lookup with `__attribute__((target))`.) That rule should take precedence over the CUDA preference for exact matches, because e.g. if we're compiling this: __host__ __device__ int magic(void) __attribute__((cpu_dispatch("..."))); __host__ __device__ int magic(void) __attribute__((cpu_specific(generic))); __host__ int magic(void) __attribute__((cpu_specific(mmx))); __host__ int magic(void) __attribute__((cpu_specific(sse))); __device__ int magic(void) __attribute__((cpu_specific(some_device_feature))); __device__ int magic(void) __attribute__((cpu_specific(some_other_device_feature))); then we don't want the compiler to prefer a CPU-specific variant over the dispatch function just because one of the variant was marked `__host__`. rjmccall: There are certainly functions that ought to be usable from either host or device context — any…
		traUnsubmitted Done Reply Inline Actions It's a bit more complicated and a bit less straightforward than that. :-( https://goo.gl/EXnymm Handling of target attributes is where clang is very different from the NVCC, so no matter which mental model of "CUDA programmer" you pick, there's another one which will not match. In the existing code `__host__ __device__` is commonly used as a sledgehammer to work around NVCC's limitations. It does not allow attribute-based overloading, so the only way you can specialize a function for host/device is via something like this: __host__ __device__ void foo() { #if __CUDA_ARCH__ > 0 // GPU code #else // CPU code. #endif } With clang you can write separate overloaded functions and we'll do our best to pick the one you meant to call. Alas, there are cases where it's ambiguous and depends on the callee's attributes, which may depend on theirs. When something ends up being called from different contexts, interesting things start happening. With more functions becoming constexpr (those are implicitly HD), we'll be running into such impossible-to-do-the-right-thing situations more often. The only reliable way to avoid such ambiguity is to 'clone' HD functions into separate H & D functions and do overload resolutions only considering same-side functions which will, in effect, completely separate host and device name spaces. Run-time dispatch is also somewhat irrelevant to CUDA. Sort of. On one hand kernel launch is already a form of runtime dispatch, only it's CUDA runtime does the dispatching based on the GPU one attempts to run the kernel on. `__device__` functions are always compiled for the specific GPU variant. Also, GPU variants often have different instruction sets and can't be mixed together in the same object file at all, so there's no variants once we're running the code as it's already compiled for precisely the GPU we're running on. Almost. Technically GPUs in the same family do share the same instruction sets, but I'm not sure runtime dispatch would buy us much there as the hardware differences are relatively minor. tra: It's a bit more complicated and a bit less straightforward than that. :-( https://goo.
		rjmccallUnsubmitted Done Reply Inline Actions The only reliable way to avoid such ambiguity is to 'clone' HD functions into separate H & D functions and do overload resolutions only considering same-side functions which will, in effect, completely separate host and device name spaces. Okay. Well, even if you completely split host and device functions, I think we'd still want to prefer dispatch functions over variant functions before preferring H over HD. Although... I suppose we do want to consider H vs. HD before looking at the more arbitrary factors that `isBetterMultiversionCandidate` looks at, like the number of architectures in the dispatch. Honestly, though, those just seem like bad rules that we should drop from the code. Run-time dispatch is also somewhat irrelevant to CUDA. Sort of. I understand that there's very little reason (or even ability) to use multiversioning in device code, but it can certainly happen in host code, right? Still, I guess the easiest thing would just be to forbid multiversioned functions on the device. rjmccall: > The only reliable way to avoid such ambiguity is to 'clone' HD functions into separate H & D…
		yaxunlAuthorUnsubmitted Done Reply Inline Actions Will change back the precedence of multiversion to be over host/device. yaxunl: Will change back the precedence of multiversion to be over host/device.
		traUnsubmitted Not Done Reply Inline Actions @rjmccall I'm OK with your reasoning & this patch. As long as the change does not break existing code, I'm fine. tra: @rjmccall I'm OK with your reasoning & this patch. As long as the change does not break…

/// Determine whether two declarations are "equivalent" for the purposes of		/// Determine whether two declarations are "equivalent" for the purposes of
/// name lookup and overload resolution. This applies when the same internal/no		/// name lookup and overload resolution. This applies when the same internal/no
		rjmccallUnsubmitted Done Reply Inline Actions Okay, let's think about the right place to put this check in the ordering; we don't want different extensions to get into a who-comes-last competition. Certainly this should have lower priority than the standard-defined preferences like argument conversion ranks or `enable_if` partial-ordering. The preference for pass-object-size parameters is probably most similar to a type-based-overloading decision and so should take priority. I would say that this should take priority over function multi-versioning. Function multi-versioning is all about making specialized versions of the "same function", whereas I think host/device overloading is meant to be semantically broader than that. What do you think? Regardless, the rationale for the order should be explained in comments. rjmccall: Okay, let's think about the right place to put this check in the ordering; we don't want…
		yaxunlAuthorUnsubmitted Done Reply Inline Actions I will add comments for the rationale of preference. I commented the preference between multiversion and host/device in another comment. yaxunl: I will add comments for the rationale of preference. I commented the preference between…
/// linkage entity is defined by two modules (probably by textually including		/// linkage entity is defined by two modules (probably by textually including
/// the same header). In such a case, we don't consider the declarations to		/// the same header). In such a case, we don't consider the declarations to
/// declare the same entity, but we also don't want lookups with both		/// declare the same entity, but we also don't want lookups with both
/// declarations visible to be ambiguous in some cases (this happens when using		/// declarations visible to be ambiguous in some cases (this happens when using
/// a modularized libstdc++).		/// a modularized libstdc++).
bool Sema::isEquivalentInternalLinkageDeclaration(const NamedDecl *A,		bool Sema::isEquivalentInternalLinkageDeclaration(const NamedDecl *A,
const NamedDecl *B) {		const NamedDecl *B) {
auto *VA = dyn_cast_or_null<ValueDecl>(A);		auto *VA = dyn_cast_or_null<ValueDecl>(A);
▲ Show 20 Lines • Show All 78 Lines • ▼ Show 20 Lines	OverloadCandidateSet::BestViableFunction(Sema &S, SourceLocation Loc,
// candidate call is WrongSide and the other is SameSide, we ignore		// candidate call is WrongSide and the other is SameSide, we ignore
// the WrongSide candidate.		// the WrongSide candidate.
if (S.getLangOpts().CUDA) {		if (S.getLangOpts().CUDA) {
const FunctionDecl *Caller = dyn_cast<FunctionDecl>(S.CurContext);		const FunctionDecl *Caller = dyn_cast<FunctionDecl>(S.CurContext);
bool ContainsSameSideCandidate =		bool ContainsSameSideCandidate =
llvm::any_of(Candidates, [&](OverloadCandidate *Cand) {		llvm::any_of(Candidates, [&](OverloadCandidate *Cand) {
// Check viable function only.		// Check viable function only.
return Cand->Viable && Cand->Function &&		return Cand->Viable && Cand->Function &&
		(S.IdentifyCUDAPreference(Caller, Cand->Function) ==
		Sema::CFP_SameSide \|\|
S.IdentifyCUDAPreference(Caller, Cand->Function) ==		S.IdentifyCUDAPreference(Caller, Cand->Function) ==
Sema::CFP_SameSide;		Sema::CFP_HostDevice);
});		});
if (ContainsSameSideCandidate) {		if (ContainsSameSideCandidate) {
auto IsWrongSideCandidate = [&](OverloadCandidate *Cand) {		auto IsWrongSideCandidate = [&](OverloadCandidate *Cand) {
// Check viable function only to avoid unnecessary data copying/moving.		// Check viable function only to avoid unnecessary data copying/moving.
return Cand->Viable && Cand->Function &&		return Cand->Viable && Cand->Function &&
S.IdentifyCUDAPreference(Caller, Cand->Function) ==		S.IdentifyCUDAPreference(Caller, Cand->Function) ==
Sema::CFP_WrongSide;		Sema::CFP_WrongSide;
};		};
▲ Show 20 Lines • Show All 5,025 Lines • Show Last 20 Lines

clang/test/SemaCUDA/function-overload.cu

	Show First 20 Lines • Show All 325 Lines • ▼ Show 20 Lines
	__device__ void test_device_calls_template_fn() {			__device__ void test_device_calls_template_fn() {
	DeviceReturnTy ret1 = template_vs_function(1.0f);			DeviceReturnTy ret1 = template_vs_function(1.0f);
	DeviceReturnTy ret2 = template_vs_function(2.0);			DeviceReturnTy ret2 = template_vs_function(2.0);
	}			}

	// If we have a mix of HD and H-only or D-only candidates in the overload set,			// If we have a mix of HD and H-only or D-only candidates in the overload set,
	// normal C++ overload resolution rules apply first.			// normal C++ overload resolution rules apply first.
	template <typename T> TemplateReturnTy template_vs_hd_function(T arg)			template <typename T> TemplateReturnTy template_vs_hd_function(T arg)
	#ifdef __CUDA_ARCH__
	//expected-note@-2 {{declared here}}
	#endif
	{			{
	return TemplateReturnTy();			return TemplateReturnTy();
	}			}
	__host__ __device__ HostDeviceReturnTy template_vs_hd_function(float arg) {			__host__ __device__ HostDeviceReturnTy template_vs_hd_function(float arg) {
	return HostDeviceReturnTy();			return HostDeviceReturnTy();
	}			}

	__host__ __device__ void test_host_device_calls_hd_template() {			__host__ __device__ void test_host_device_calls_hd_template() {
	HostDeviceReturnTy ret1 = template_vs_hd_function(1.0f);
	TemplateReturnTy ret2 = template_vs_hd_function(1);
	#ifdef __CUDA_ARCH__			#ifdef __CUDA_ARCH__
	// expected-error@-2 {{reference to __host__ function 'template_vs_hd_function<int>' in __host__ __device__ function}}			typedef HostDeviceReturnTy ExpectedReturnTy;
				#else
				typedef TemplateReturnTy ExpectedReturnTy;
	#endif			#endif
				HostDeviceReturnTy ret1 = template_vs_hd_function(1.0f);
				ExpectedReturnTy ret2 = template_vs_hd_function(1);
	}			}

	__host__ void test_host_calls_hd_template() {			__host__ void test_host_calls_hd_template() {
	HostDeviceReturnTy ret1 = template_vs_hd_function(1.0f);			HostDeviceReturnTy ret1 = template_vs_hd_function(1.0f);
	TemplateReturnTy ret2 = template_vs_hd_function(1);			TemplateReturnTy ret2 = template_vs_hd_function(1);
	}			}

	__device__ void test_device_calls_hd_template() {			__device__ void test_device_calls_hd_template() {
	▲ Show 20 Lines • Show All 64 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[CUDA][HIP] Fix host/device based overload resolutionClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 256781

clang/lib/Sema/SemaOverload.cpp

clang/test/SemaCUDA/function-overload.cu

[CUDA][HIP] Fix host/device based overload resolution
ClosedPublic