This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
lib/Sema/
-
Sema/
16/17
SemaOverload.cpp
-
test/SemaCUDA/
-
SemaCUDA/
-
function-overload.cu

Differential D77954

[CUDA][HIP] Fix host/device based overload resolution
ClosedPublic

Authored by yaxunl on Apr 11 2020, 1:02 PM.

Download Raw Diff

Details

Reviewers

tra
rjmccall
echristo

Commits

rGc77a4078e010: [CUDA][HIP] Fix host/device based overload resolution

Summary

Currently clang fails to compile the following CUDA program in device compilation:

__host__ int foo(int x) {
     return 1;
}

template<class T>
__device__ __host__ int foo(T x) {
    return 2;
}

__device__ __host__ int bar() {
    return foo(1);
}

__global__ void test(int *a) {
    *a = bar();
}

This is due to foo is resolved to the __host__ foo instead of __device__ __host__ foo.
This seems to be a bug since __device__ __host__ foo is a viable callee for foo whereas
clang is unable to choose it.

nvcc has similar issue

https://cuda.godbolt.org/z/bGijLc

Although it only emits a warning and does not fail to compile. It emits a trap in the code
so that it will fail at run time.

This patch fixes that.

Diff Detail

Event Timeline

yaxunl created this revision.Apr 11 2020, 1:02 PM

yaxunl edited the summary of this revision. (Show Details)Apr 11 2020, 1:10 PM

If nvcc ignores host/device-ness when selecting overloads, that's probably the specified behavior, right? I agree that it would be better to not ignore it, but Clang shouldn't just make up better rules for languages with external specifications.

In D77954#1976294, @rjmccall wrote:

If nvcc ignores host/device-ness when selecting overloads, that's probably the specified behavior, right? I agree that it would be better to not ignore it, but Clang shouldn't just make up better rules for languages with external specifications.

cuda-clang does not always follow nvcc's behavior. For example, cuda-clang only allows incomplete array type for extern shared variables, whereas nvcc allows other types. If cuda-clang is supposed to follow nvcc's behavior in every aspects, we should approve https://reviews.llvm.org/D73979 , but it is not the case.

Therefore, I think we should discuss whether this is really a bug, and whether the fix can cause any unwanted side effect.

In D77954#1976316, @yaxunl wrote:

In D77954#1976294, @rjmccall wrote:

If nvcc ignores host/device-ness when selecting overloads, that's probably the specified behavior, right? I agree that it would be better to not ignore it, but Clang shouldn't just make up better rules for languages with external specifications.

cuda-clang does not always follow nvcc's behavior. For example, cuda-clang only allows incomplete array type for extern shared variables, whereas nvcc allows other types. If cuda-clang is supposed to follow nvcc's behavior in every aspects, we should approve https://reviews.llvm.org/D73979 , but it is not the case.

Therefore, I think we should discuss whether this is really a bug, and whether the fix can cause any unwanted side effect.

BTW cuda-clang is already quite different than nvcc regarding host/device-based overloading resolution. For example, the following code is valid in cuda-clang before my change but invalid in nvcc https://cuda.godbolt.org/z/qwpKZe . So if we want to follow nvcc's resolution rule we need a total revamp of device/host related resolution in cuda-clang.

__host__ int foo(int x) {
     return 1;
}

template<class T>
__device__ int foo(T x) {
    return 2;
}

__device__ int bar() {
    return foo(1);
}

__global__ void test(int *a) {
    *a = bar();
}

I'm not saying that we need to be bug-for-bug-compatible with nvcc, I'm just saying that we should be able to point to *something* to justify our behavior. I take it that the CUDA spec has rules for some amount of host/device-based overloading? What are they based on?

In D77954#1976378, @rjmccall wrote:

I'm not saying that we need to be bug-for-bug-compatible with nvcc, I'm just saying that we should be able to point to *something* to justify our behavior. I take it that the CUDA spec has rules for some amount of host/device-based overloading? What are they based on?

I checked CUDA SDK documentation and did not find useful information about overloading resolution based on host/device attributes. I guess the rule can only be deduced from nvcc behavior.

Based on https://reviews.llvm.org/D12453, https://reviews.llvm.org/D18416, and https://bcain-llvm.readthedocs.io/projects/llvm/en/latest/CompileCudaWithLLVM/#overloading-based-on-host-and-device-attributes, cuda-clang has different overload resolution rules based host/device attributes. This is intentional design decision.

In D77954#1976456, @yaxunl wrote:

In D77954#1976378, @rjmccall wrote:

I'm not saying that we need to be bug-for-bug-compatible with nvcc, I'm just saying that we should be able to point to *something* to justify our behavior. I take it that the CUDA spec has rules for some amount of host/device-based overloading? What are they based on?

I checked CUDA SDK documentation and did not find useful information about overloading resolution based on host/device attributes. I guess the rule can only be deduced from nvcc behavior.

Based on https://reviews.llvm.org/D12453, https://reviews.llvm.org/D18416, and https://bcain-llvm.readthedocs.io/projects/llvm/en/latest/CompileCudaWithLLVM/#overloading-based-on-host-and-device-attributes, cuda-clang has different overload resolution rules based host/device attributes. This is intentional design decision.

Okay, thanks, that's all I needed. We don't need to re-litigate it.

That spec says that there's a preference given to functions according to host/device-ness. The question, then, is how that actually interacts with the normal overload resolution rules. The "deletion" approach suggests that it's meant to be the most important thing in the comparison. It seems to me that, given the wording of the specification, deletion is the wrong implementation approach, and that instead this check should just be performed in isBetterOverloadCandidate so that a candidate that better matches the host/device-ness of the caller is always considered a better candidate.

Revised by John's comments.

rjmccall added inline comments.Apr 12 2020, 8:47 PM

clang/lib/Sema/SemaOverload.cpp
9491	Please add `[CUDA]` or something similar to the top of this comment so that readers can immediately know that it's dialect-specific. At a high level, this part of the rule is essentially saying that CUDA non-emittability is a kind of non-viability. Should we just make non-emittable functions get flagged as non-viable (which will avoid a lot of relatively expensive conversion checking), or is it important to be able to select non-emittable candidates over candidates that are non-viable for other reasons?
9781	If we move anything below this check, it needs to figure out a tri-state so that it can return false if `Cand2` is a better candidate than `Cand1`. Now, that only matters if multiversion functions are supported under CUDA, but if you're relying on them not being supported, that should at least be commented on.
9784	Okay, let's think about the right place to put this check in the ordering; we don't want different extensions to get into a who-comes-last competition. Certainly this should have lower priority than the standard-defined preferences like argument conversion ranks or `enable_if` partial-ordering. The preference for pass-object-size parameters is probably most similar to a type-based-overloading decision and so should take priority. I would say that this should take priority over function multi-versioning. Function multi-versioning is all about making specialized versions of the "same function", whereas I think host/device overloading is meant to be semantically broader than that. What do you think? Regardless, the rationale for the order should be explained in comments.

yaxunl marked 6 inline comments as done.Apr 13 2020, 7:04 AM

yaxunl added inline comments.

clang/lib/Sema/SemaOverload.cpp
9491	There are two situations for "bad" callees: the callee should never be called. It is not just invalid call in codegen, but also invalid call in AST. e.g. a host function call a device function. In CUDA call preference, it is termed "never". And clang already removed such callees from overload candidates. the callee should not be called in codegen, but may be called in AST. This happens with `__host__ __device__` functions when calling a "wrong sided" function. e.g. in device compilation, a `__host__ __device__` function calls a `__host__` function. This is valid in AST since the `__host__ __device__` function may be an inline function which is only called by a `__host__` function. There is a deferred diagnostic for the wrong-sided call, which is triggered only if the caller is emitted. However in overloading resolution, if no better candidates are available, wrong-sided candidates are still viable.
9781	multiversion host functions is orthogonal to CUDA therefore should be supported. multiversion in device, host device, and global functions are not supported. However this change does not make things worse, and should continue to work if they are supported. host/device based overloading resolution is mostly for determining viability of a function. If two functions are both viable, other factors should take precedence in preference. This general rule has been taken for cases other than multiversion, I think it should also apply to multiversion. I will make isBetterMultiversionCandidate three states.
9784	I will add comments for the rationale of preference. I commented the preference between multiversion and host/device in another comment.

fix preference for multiversion. add comments. add more tests for wrong-sided function.

rjmccall added a reviewer: echristo.Apr 13 2020, 10:14 AM

rjmccall added a subscriber: echristo.

rjmccall added inline comments.

clang/lib/Sema/SemaOverload.cpp
9491	Oh, I see what you're saying; sorry, I mis-read the code. So anything with a preference worse than wrong-sided is outright non-viable; there's a very strong preference against wrong-sided calls that takes priority of all of the normal overload-resolution rules; and then there's a very weak preference against non-exact matches that everything else takes priority over. Okay.
9781	This general rule has been taken for cases other than multiversion, I think it should also apply to multiversion. Well, but the multiversion people could say the same: that multiversioning is for picking an alternative among otherwise-identical functions, and HD and H functions are not otherwise-identical. CC'ing @echristo for his thoughts on the right ordering here.

LGTM in principle. That said, my gut feeling is that this patch has a good chance of breaking something in sufficiently convoluted CUDA code like Eigen. When you land this patch, I'd appreciate if you could do it on a workday morning (Pacific time) so I'm around to test it on our code and revert if something unexpected pops up.

On a side note, this case is another point towards having to redo handling of __host__ __device__. There are way too many corner cases all over the place. Things will only get worse as we move towards newer C++ standard where a lot more code becomes constexpr which is implicitly HD. Having calls from HD functions resolve in a different way during host/device compilation is observable and may result in host and device code diverging unexpectedly.

ping

Revised to let host/device take precedence over multiversion, as John suggested.

yaxunl marked 2 inline comments as done.Apr 22 2020, 7:54 PM

Okay, one minor fix.

clang/lib/Sema/SemaOverload.cpp
9387	This is neglecting the case where they're both invalid.

echristo added a subscriber: erichkeane.Apr 23 2020, 11:03 AM

echristo added inline comments.

clang/lib/Sema/SemaOverload.cpp
9781	Adding @erichkeane here as well. I think this makes sense, but I can see a reason to multiversion a function that will run on host and device. A version of some matrix mult that takes advantage of 3 host architectures and one cuda one? Am I missing something here?

yaxunl marked an inline comment as done.Apr 23 2020, 12:19 PM

yaxunl added inline comments.

clang/lib/Sema/SemaOverload.cpp
9781	My understanding is that a multiversion function is for a specific cpu(gpu). Let's say we want to have a function f for gfx900, gfx906, sandybridge, ivybridge, shouldn't they be more like __host__ __attribute__((cpu_specific(sandybridge))) f(); __host__ __attribute__((cpu_specific(ivybridge))) f(); __device__ __attribute__((cpu_specific(gfx900))) f(); __device__ __attribute__((cpu_specific(gfx906))) f(); instead of all `__device__ __host__` functions?

erichkeane added inline comments.Apr 23 2020, 12:32 PM

clang/lib/Sema/SemaOverload.cpp
9781	IMO, it doesn't make sense for functions to functions be BOTH host and device, they'd have to be just one. Otherwise I'm not sure how the resolver behavior is supposed to work. The whole idea is that the definition is chosen at runtime. Unless host __device void foo(); is TWO declaration chains (meaning two separate AST entries), it doesn't make sense to have multiverison on it (and then, how it would be spelled is awkward/confusing to me). In the above case, if those 4 declarations are not 2 separate root- AST nodes, multiversioning won't work.

rjmccall added inline comments.Apr 23 2020, 4:40 PM

clang/lib/Sema/SemaOverload.cpp
9781	There are certainly functions that ought to be usable from either host or device context — any inline function that just does ordinary language things should be in that category. Also IIUC many declarations are inferred to be `__host__ __device__`, or can be mass-annotated with pragmas, and those reasons are probably the main ones this might matter — we might include a header in CUDA mode that declares a multi-versioned function, and we should handle it right. My read of how CUDA programmers expect this to work is that they see the `__host__` / `__device__` attributes as primarily a mechanism for catching problems where you're using the wrong functions for the current configuration. That is, while we allow overloading by `__host__`/`__device__`-ness, users expect those attributes to mostly be used as a filter for what's "really there" rather than really strictly segregating the namespace. So I would say that CUDA programmers would probably expect the interaction with multiversioning to be: Programmers can put `__host__`, `__device__`, or both on a variant depending on where it was usable. Dispatches should simply ignore any variants that aren't usable for the current configuration. And specifically they would not expect e.g. a `__host__` dispatch function to only consider `__host__` variants — it should be able to dispatch to anything available, which is to say, it should also include `__host__ __device__` variants. Similarly (and probably more usefully), a `__host__ __device__` dispatch function being compiled for the device should also consider pure `__device__` functions, and so on. If we accept that, then I think it gives us a much better idea for how to resolve the priority of the overload rules. The main impact of `isBetterMultiversionCandidate` is to try to ensure that we're looking at the `__attribute__((cpu_dispatch))` function instead of one of the `__attribute__((cpu_specific))` variants. (It has no effect on `__attribute__((target))` multi-versioning, mostly because it doesn't need to: target-specific variants don't show up in lookup with `__attribute__((target))`.) That rule should take precedence over the CUDA preference for exact matches, because e.g. if we're compiling this: __host__ __device__ int magic(void) __attribute__((cpu_dispatch("..."))); __host__ __device__ int magic(void) __attribute__((cpu_specific(generic))); __host__ int magic(void) __attribute__((cpu_specific(mmx))); __host__ int magic(void) __attribute__((cpu_specific(sse))); __device__ int magic(void) __attribute__((cpu_specific(some_device_feature))); __device__ int magic(void) __attribute__((cpu_specific(some_other_device_feature))); then we don't want the compiler to prefer a CPU-specific variant over the dispatch function just because one of the variant was marked `__host__`.

tra accepted this revision.Apr 23 2020, 6:06 PM

tra added inline comments.

clang/lib/Sema/SemaOverload.cpp
9781	It's a bit more complicated and a bit less straightforward than that. :-( https://goo.gl/EXnymm Handling of target attributes is where clang is very different from the NVCC, so no matter which mental model of "CUDA programmer" you pick, there's another one which will not match. In the existing code `__host__ __device__` is commonly used as a sledgehammer to work around NVCC's limitations. It does not allow attribute-based overloading, so the only way you can specialize a function for host/device is via something like this: __host__ __device__ void foo() { #if __CUDA_ARCH__ > 0 // GPU code #else // CPU code. #endif } With clang you can write separate overloaded functions and we'll do our best to pick the one you meant to call. Alas, there are cases where it's ambiguous and depends on the callee's attributes, which may depend on theirs. When something ends up being called from different contexts, interesting things start happening. With more functions becoming constexpr (those are implicitly HD), we'll be running into such impossible-to-do-the-right-thing situations more often. The only reliable way to avoid such ambiguity is to 'clone' HD functions into separate H & D functions and do overload resolutions only considering same-side functions which will, in effect, completely separate host and device name spaces. Run-time dispatch is also somewhat irrelevant to CUDA. Sort of. On one hand kernel launch is already a form of runtime dispatch, only it's CUDA runtime does the dispatching based on the GPU one attempts to run the kernel on. `__device__` functions are always compiled for the specific GPU variant. Also, GPU variants often have different instruction sets and can't be mixed together in the same object file at all, so there's no variants once we're running the code as it's already compiled for precisely the GPU we're running on. Almost. Technically GPUs in the same family do share the same instruction sets, but I'm not sure runtime dispatch would buy us much there as the hardware differences are relatively minor.

This revision is now accepted and ready to land.Apr 23 2020, 6:06 PM

rjmccall added inline comments.Apr 23 2020, 7:47 PM

clang/lib/Sema/SemaOverload.cpp
9781	The only reliable way to avoid such ambiguity is to 'clone' HD functions into separate H & D functions and do overload resolutions only considering same-side functions which will, in effect, completely separate host and device name spaces. Okay. Well, even if you completely split host and device functions, I think we'd still want to prefer dispatch functions over variant functions before preferring H over HD. Although... I suppose we do want to consider H vs. HD before looking at the more arbitrary factors that `isBetterMultiversionCandidate` looks at, like the number of architectures in the dispatch. Honestly, though, those just seem like bad rules that we should drop from the code. Run-time dispatch is also somewhat irrelevant to CUDA. Sort of. I understand that there's very little reason (or even ability) to use multiversioning in device code, but it can certainly happen in host code, right? Still, I guess the easiest thing would just be to forbid multiversioned functions on the device.

Revised by John's comments.

yaxunl marked 6 inline comments as done.Apr 24 2020, 5:21 AM

yaxunl added inline comments.

clang/lib/Sema/SemaOverload.cpp
9781	Will change back the precedence of multiversion to be over host/device.

change the precedence of multiversion to be over host/device-ness.

tra added inline comments.Apr 24 2020, 9:46 AM

clang/lib/Sema/SemaOverload.cpp
9781	@rjmccall I'm OK with your reasoning & this patch. As long as the change does not break existing code, I'm fine.

Thanks, Yaxun. LGTM.

@tra Is it OK I commit it now? Or better wait next Monday morning? Thanks.

Closed by commit rGc77a4078e010: [CUDA][HIP] Fix host/device based overload resolution (authored by yaxunl). · Explain WhyApr 24 2020, 12:26 PM

This revision was automatically updated to reflect the committed changes.

Herald added a project: Restricted Project. · View Herald TranscriptApr 24 2020, 12:26 PM

Go ahead. I'll revert it if it breaks anything on our side.

In D77954#2002580, @tra wrote:

Go ahead. I'll revert it if it breaks anything on our side.

Thanks. Done by b46b1a916d44216f0c70de55ae2123eb9de69027

Sorry -- this change broke overload resolution for operator new, as it is declared in system headers. I'm reverting the patch.

$ cat /tmp/in.cu.cc
#define __device__ __attribute__((device))
void* operator new(__SIZE_TYPE__ size);
__device__ void *operator new(__SIZE_TYPE__ size);
void *x = new int;
$ clang -fsyntax-only --cuda-device-only --target=x86_64-grtev4-linux-gnu -x cuda -nocudalib -nocudainc -std=gnu++17 /tmp/in.cu.cc
/tmp/in.cu.cc:4:11: error: call to 'operator new' is ambiguous
void *x = new int;
          ^
/tmp/in.cu.cc:2:7: note: candidate function
void* operator new(__SIZE_TYPE__ size);
      ^
/tmp/in.cu.cc:3:18: note: candidate function
__device__ void *operator new(__SIZE_TYPE__ size);
                 ^
1 error generated when compiling for sm_20.

In D77954#2005313, @gribozavr2 wrote:

Sorry -- this change broke overload resolution for operator new, as it is declared in system headers. I'm reverting the patch.

$ cat /tmp/in.cu.cc
#define __device__ __attribute__((device))
void* operator new(__SIZE_TYPE__ size);
__device__ void *operator new(__SIZE_TYPE__ size);
void *x = new int;
$ clang -fsyntax-only --cuda-device-only --target=x86_64-grtev4-linux-gnu -x cuda -nocudalib -nocudainc -std=gnu++17 /tmp/in.cu.cc
/tmp/in.cu.cc:4:11: error: call to 'operator new' is ambiguous
void *x = new int;
          ^
/tmp/in.cu.cc:2:7: note: candidate function
void* operator new(__SIZE_TYPE__ size);
      ^
/tmp/in.cu.cc:3:18: note: candidate function
__device__ void *operator new(__SIZE_TYPE__ size);
                 ^
1 error generated when compiling for sm_20.

Thanks. Fixed in https://reviews.llvm.org/D78970

It appears that re-landed b46b1a916d44216f0c70de55ae2123eb9de69027 has created another compilation regression. I don't have a simple reproducer yet, so here's the error message for now:

llvm_unstable/toolchain/bin/../include/c++/v1/tuple:232:15: error: call to implicitly-deleted copy constructor of 'std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>'
            : __value_(_VSTD::forward<_Tp>(__t))
              ^        ~~~~~~~~~~~~~~~~~~~~~~~~
llvm_unstable/toolchain/bin/../include/c++/v1/tuple:388:13: note: in instantiation of function template specialization 'std::__u::__tuple_leaf<0, std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, false>::__tuple_leaf<std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, void>' requested here
            __tuple_leaf<_Uf, _Tf>(_VSTD::forward<_Up>(__u))...,
            ^
llvm_unstable/toolchain/bin/../include/c++/v1/tuple:793:15: note: in instantiation of function template specialization 'std::__u::__tuple_impl<std::__u::__tuple_indices<0, 1>, std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>>::__tuple_impl<0, 1, std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>, std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>>' requested here
            : __base_(typename __make_tuple_indices<sizeof...(_Up)>::type(),
              ^
llvm_unstable/toolchain/bin/../include/c++/v1/thread:297:17: note: in instantiation of function template specialization 'std::__u::tuple<std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>>::tuple<std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>, false, false>' requested here
            new _Gp(std::move(__tsp),
                ^
./third_party/eigen3/unsupported/Eigen/CXX11/src/ThreadPool/ThreadEnvironment.h:24:42: note: in instantiation of function template specialization 'std::__u::thread::thread<std::__u::function<void ()>, void>' requested here
    EnvThread(std::function<void()> f) : thr_(std::move(f)) {}
                                         ^
llvm_unstable/toolchain/bin/../include/c++/v1/memory:2583:3: note: copy constructor is implicitly deleted because 'unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>' has a user-declared move constructor
  unique_ptr(unique_ptr&& __u) _NOEXCEPT
  ^
1 error generated when compiling for sm_60.

In D77954#2021026, @tra wrote:

It appears that re-landed b46b1a916d44216f0c70de55ae2123eb9de69027 has created another compilation regression. I don't have a simple reproducer yet, so here's the error message for now:

llvm_unstable/toolchain/bin/../include/c++/v1/tuple:232:15: error: call to implicitly-deleted copy constructor of 'std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>'
            : __value_(_VSTD::forward<_Tp>(__t))
              ^        ~~~~~~~~~~~~~~~~~~~~~~~~
llvm_unstable/toolchain/bin/../include/c++/v1/tuple:388:13: note: in instantiation of function template specialization 'std::__u::__tuple_leaf<0, std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, false>::__tuple_leaf<std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, void>' requested here
            __tuple_leaf<_Uf, _Tf>(_VSTD::forward<_Up>(__u))...,
            ^
llvm_unstable/toolchain/bin/../include/c++/v1/tuple:793:15: note: in instantiation of function template specialization 'std::__u::__tuple_impl<std::__u::__tuple_indices<0, 1>, std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>>::__tuple_impl<0, 1, std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>, std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>>' requested here
            : __base_(typename __make_tuple_indices<sizeof...(_Up)>::type(),
              ^
llvm_unstable/toolchain/bin/../include/c++/v1/thread:297:17: note: in instantiation of function template specialization 'std::__u::tuple<std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>>::tuple<std::__u::unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>, std::__u::function<void ()>, false, false>' requested here
            new _Gp(std::move(__tsp),
                ^
./third_party/eigen3/unsupported/Eigen/CXX11/src/ThreadPool/ThreadEnvironment.h:24:42: note: in instantiation of function template specialization 'std::__u::thread::thread<std::__u::function<void ()>, void>' requested here
    EnvThread(std::function<void()> f) : thr_(std::move(f)) {}
                                         ^
llvm_unstable/toolchain/bin/../include/c++/v1/memory:2583:3: note: copy constructor is implicitly deleted because 'unique_ptr<std::__u::__thread_struct, std::__u::default_delete<std::__u::__thread_struct>>' has a user-declared move constructor
  unique_ptr(unique_ptr&& __u) _NOEXCEPT
  ^
1 error generated when compiling for sm_60.

For implicit __host__ __device__ functions, they may be promoted by pragma but themselves may not be qualified as __host__ __device__ functions.

Since they are promoted from host functions, they are good citizens in host compilation, but may incur diagnostics in device compilation, because their callees may be missing in device side. Since we cannot defer all the diagnostics, once such things happen, we are doomed.

So now we can understand why the previous behavior: that is, in a __host__ __device__ function, same-side candidate is always preferred over wrong-sided candidate. However, __device__ __host__ candidate is not preferred over wrong-sided candidate. On the other hand, their other properties take precedence. Only if all others are equal, __device__ __host__ candidate is preferred over wrong-sided candidate.

I will put a workaround: In device compilation, in implicit __device__ __host__ callers, I will keep the old behavior, that is, implicit __device__ __host__ candidate has equal preference with wrong-sided candidate. By doing this, we will in most cases resolve the overloading the same way as if the callers and callees are host functions, therefore resolved the same way as in their expected environment. This will make sure: 1. we will not end up with no viable candidate 2. we will not have ambiguity, since we know it is resolvable in host compilation.

For explicit __device__ __host__ functions, we do not need the workaround, since they are intended for host and device and are supposed to work for both host and device.

FYI, I've just reverted it in bf6a26b066382e0f41bf023c781d84061c542307

In D77954#2021299, @yaxunl wrote:

I will put a workaround: In device compilation, in implicit __device__ __host__ callers, I will keep the old behavior, that is, implicit __device__ __host__ candidate has equal preference with wrong-sided candidate. By doing this, we will in most cases resolve the overloading the same way as if the callers and callees are host functions, therefore resolved the same way as in their expected environment. This will make sure: 1. we will not end up with no viable candidate 2. we will not have ambiguity, since we know it is resolvable in host compilation.

LMK when you have something. I can give it a spin internally.

tra added a reverting change: rGbf6a26b06638: Revert D77954 -- it breaks Eigen & Tensorflow..May 5 2020, 2:36 PM

yaxunl mentioned this in D79526: [CUDA][HIP] Workaround for resolving host device function against wrong-sided function.May 6 2020, 4:03 PM

yaxunl mentioned this in D84364: [CUDA][HIP] Defer overloading resolution diagnostics for host device functions.Aug 6 2020, 3:08 PM

Revision Contents

Path

Size

clang/

lib/

Sema/

SemaOverload.cpp

133 lines

test/

SemaCUDA/

function-overload.cu

50 lines

Diff 259796

clang/lib/Sema/SemaOverload.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 9,368 Lines • ▼ Show 20 Lines	for (auto Pair : zip_longest(Cand1Attrs, Cand2Attrs)) {
(*Cand2A)->getCond()->Profile(Cand2ID, S.getASTContext(), true);		(*Cand2A)->getCond()->Profile(Cand2ID, S.getASTContext(), true);
if (Cand1ID != Cand2ID)		if (Cand1ID != Cand2ID)
return Comparison::Worse;		return Comparison::Worse;
}		}

return Comparison::Equal;		return Comparison::Equal;
}		}

static bool isBetterMultiversionCandidate(const OverloadCandidate &Cand1,		static Comparison
		isBetterMultiversionCandidate(const OverloadCandidate &Cand1,
const OverloadCandidate &Cand2) {		const OverloadCandidate &Cand2) {
if (!Cand1.Function \|\| !Cand1.Function->isMultiVersion() \|\| !Cand2.Function \|\|		if (!Cand1.Function \|\| !Cand1.Function->isMultiVersion() \|\| !Cand2.Function \|\|
!Cand2.Function->isMultiVersion())		!Cand2.Function->isMultiVersion())
return false;		return Comparison::Equal;

// If Cand1 is invalid, it cannot be a better match, if Cand2 is invalid, this		// If both are invalid, they are equal. If one of them is invalid, the other
// is obviously better.		// is better.
if (Cand1.Function->isInvalidDecl()) return false;		if (Cand1.Function->isInvalidDecl()) {
if (Cand2.Function->isInvalidDecl()) return true;		if (Cand2.Function->isInvalidDecl())
		rjmccallUnsubmitted Done Reply Inline Actions This is neglecting the case where they're both invalid. rjmccall: This is neglecting the case where they're both invalid.
		return Comparison::Equal;
		return Comparison::Worse;
		}
		if (Cand2.Function->isInvalidDecl())
		return Comparison::Better;

// If this is a cpu_dispatch/cpu_specific multiversion situation, prefer		// If this is a cpu_dispatch/cpu_specific multiversion situation, prefer
// cpu_dispatch, else arbitrarily based on the identifiers.		// cpu_dispatch, else arbitrarily based on the identifiers.
bool Cand1CPUDisp = Cand1.Function->hasAttr<CPUDispatchAttr>();		bool Cand1CPUDisp = Cand1.Function->hasAttr<CPUDispatchAttr>();
bool Cand2CPUDisp = Cand2.Function->hasAttr<CPUDispatchAttr>();		bool Cand2CPUDisp = Cand2.Function->hasAttr<CPUDispatchAttr>();
const auto *Cand1CPUSpec = Cand1.Function->getAttr<CPUSpecificAttr>();		const auto *Cand1CPUSpec = Cand1.Function->getAttr<CPUSpecificAttr>();
const auto *Cand2CPUSpec = Cand2.Function->getAttr<CPUSpecificAttr>();		const auto *Cand2CPUSpec = Cand2.Function->getAttr<CPUSpecificAttr>();

if (!Cand1CPUDisp && !Cand2CPUDisp && !Cand1CPUSpec && !Cand2CPUSpec)		if (!Cand1CPUDisp && !Cand2CPUDisp && !Cand1CPUSpec && !Cand2CPUSpec)
return false;		return Comparison::Equal;

if (Cand1CPUDisp && !Cand2CPUDisp)		if (Cand1CPUDisp && !Cand2CPUDisp)
return true;		return Comparison::Better;
if (Cand2CPUDisp && !Cand1CPUDisp)		if (Cand2CPUDisp && !Cand1CPUDisp)
return false;		return Comparison::Worse;

if (Cand1CPUSpec && Cand2CPUSpec) {		if (Cand1CPUSpec && Cand2CPUSpec) {
if (Cand1CPUSpec->cpus_size() != Cand2CPUSpec->cpus_size())		if (Cand1CPUSpec->cpus_size() != Cand2CPUSpec->cpus_size())
return Cand1CPUSpec->cpus_size() < Cand2CPUSpec->cpus_size();		return Cand1CPUSpec->cpus_size() < Cand2CPUSpec->cpus_size()
		? Comparison::Better
		: Comparison::Worse;

std::pair<CPUSpecificAttr::cpus_iterator, CPUSpecificAttr::cpus_iterator>		std::pair<CPUSpecificAttr::cpus_iterator, CPUSpecificAttr::cpus_iterator>
FirstDiff = std::mismatch(		FirstDiff = std::mismatch(
Cand1CPUSpec->cpus_begin(), Cand1CPUSpec->cpus_end(),		Cand1CPUSpec->cpus_begin(), Cand1CPUSpec->cpus_end(),
Cand2CPUSpec->cpus_begin(),		Cand2CPUSpec->cpus_begin(),
[](const IdentifierInfo LHS, const IdentifierInfo RHS) {		[](const IdentifierInfo LHS, const IdentifierInfo RHS) {
return LHS->getName() == RHS->getName();		return LHS->getName() == RHS->getName();
});		});

assert(FirstDiff.first != Cand1CPUSpec->cpus_end() &&		assert(FirstDiff.first != Cand1CPUSpec->cpus_end() &&
"Two different cpu-specific versions should not have the same "		"Two different cpu-specific versions should not have the same "
"identifier list, otherwise they'd be the same decl!");		"identifier list, otherwise they'd be the same decl!");
return (FirstDiff.first)->getName() < (FirstDiff.second)->getName();		return (FirstDiff.first)->getName() < (FirstDiff.second)->getName()
		? Comparison::Better
		: Comparison::Worse;
}		}
llvm_unreachable("No way to get here unless both had cpu_dispatch");		llvm_unreachable("No way to get here unless both had cpu_dispatch");
}		}

/// Compute the type of the implicit object parameter for the given function,		/// Compute the type of the implicit object parameter for the given function,
/// if any. Returns None if there is no implicit object parameter, and a null		/// if any. Returns None if there is no implicit object parameter, and a null
/// QualType if there is a 'matches anything' implicit object parameter.		/// QualType if there is a 'matches anything' implicit object parameter.
static Optional<QualType> getImplicitObjectParamType(ASTContext &Context,		static Optional<QualType> getImplicitObjectParamType(ASTContext &Context,
▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines	bool clang::isBetterOverloadCandidate(
SourceLocation Loc, OverloadCandidateSet::CandidateSetKind Kind) {		SourceLocation Loc, OverloadCandidateSet::CandidateSetKind Kind) {
// Define viable functions to be better candidates than non-viable		// Define viable functions to be better candidates than non-viable
// functions.		// functions.
if (!Cand2.Viable)		if (!Cand2.Viable)
return Cand1.Viable;		return Cand1.Viable;
else if (!Cand1.Viable)		else if (!Cand1.Viable)
return false;		return false;

		// [CUDA] A function with 'never' preference is marked not viable, therefore
		// is never shown up here. The worst preference shown up here is 'wrong side',
		// e.g. a host function called by a device host function in device
		// compilation. This is valid AST as long as the host device function is not
		rjmccallUnsubmitted Done Reply Inline Actions Please add `[CUDA]` or something similar to the top of this comment so that readers can immediately know that it's dialect-specific. At a high level, this part of the rule is essentially saying that CUDA non-emittability is a kind of non-viability. Should we just make non-emittable functions get flagged as non-viable (which will avoid a lot of relatively expensive conversion checking), or is it important to be able to select non-emittable candidates over candidates that are non-viable for other reasons? rjmccall: Please add `[CUDA]` or something similar to the top of this comment so that readers can…
		yaxunlAuthorUnsubmitted Done Reply Inline Actions There are two situations for "bad" callees: the callee should never be called. It is not just invalid call in codegen, but also invalid call in AST. e.g. a host function call a device function. In CUDA call preference, it is termed "never". And clang already removed such callees from overload candidates. the callee should not be called in codegen, but may be called in AST. This happens with `__host__ __device__` functions when calling a "wrong sided" function. e.g. in device compilation, a `__host__ __device__` function calls a `__host__` function. This is valid in AST since the `__host__ __device__` function may be an inline function which is only called by a `__host__` function. There is a deferred diagnostic for the wrong-sided call, which is triggered only if the caller is emitted. However in overloading resolution, if no better candidates are available, wrong-sided candidates are still viable. yaxunl: There are two situations for "bad" callees: 1. the callee should never be called. It is not…
		rjmccallUnsubmitted Done Reply Inline Actions Oh, I see what you're saying; sorry, I mis-read the code. So anything with a preference worse than wrong-sided is outright non-viable; there's a very strong preference against wrong-sided calls that takes priority of all of the normal overload-resolution rules; and then there's a very weak preference against non-exact matches that everything else takes priority over. Okay. rjmccall: Oh, I see what you're saying; sorry, I mis-read the code. So anything with a preference…
		// emitted, e.g. it is an inline function which is called only by a host
		// function. A deferred diagnostic will be triggered if it is emitted.
		// However a wrong-sided function is still a viable candidate here.
		//
		// If Cand1 can be emitted and Cand2 cannot be emitted in the current
		// context, Cand1 is better than Cand2. If Cand1 can not be emitted and Cand2
		// can be emitted, Cand1 is not better than Cand2. This rule should have
		// precedence over other rules.
		//
		// If both Cand1 and Cand2 can be emitted, or neither can be emitted, then
		// other rules should be used to determine which is better, except
		// multiversion. This is because host/device based overloading resolution is
		// mostly for determining viability of a function. If two functions are both
		// viable, other factors should take precedence in preference, e.g. the
		// standard-defined preferences like argument conversion ranks or enable_if
		// partial-ordering. The preference for pass-object-size parameters is
		// probably most similar to a type-based-overloading decision and so should
		// take priority.
		//
		// ToDo: multiversion currently only works for host functions. The host/device
		// attribute takes precedence over multiversion since we only need to compare
		// host vs host device or device vs host device, whereas multiversion does
		// not make sense for host device functions. This may need revisit if issues
		// arise when multiversion is supported on device.
		//
		// If other rules cannot determine which is better, CUDA preference will be
		// used again to determine which is better.
		//
		// TODO: Currently IdentifyCUDAPreference does not return correct values
		// for functions called in global variable initializers due to missing
		// correct context about device/host. Therefore we can only enforce this
		// rule when there is a caller. We should enforce this rule for functions
		// in global variable initializers once proper context is added.
		if (S.getLangOpts().CUDA && Cand1.Function && Cand2.Function) {
		if (FunctionDecl *Caller = dyn_cast<FunctionDecl>(S.CurContext)) {
		auto P1 = S.IdentifyCUDAPreference(Caller, Cand1.Function);
		auto P2 = S.IdentifyCUDAPreference(Caller, Cand2.Function);
		assert(P1 != Sema::CFP_Never && P2 != Sema::CFP_Never);
		auto Cand1Emittable = P1 > Sema::CFP_WrongSide;
		auto Cand2Emittable = P2 > Sema::CFP_WrongSide;
		if (Cand1Emittable && !Cand2Emittable)
		return true;
		if (!Cand1Emittable && Cand2Emittable)
		return false;
		}
		}

// C++ [over.match.best]p1:		// C++ [over.match.best]p1:
//		//
// -- if F is a static member function, ICS1(F) is defined such		// -- if F is a static member function, ICS1(F) is defined such
// that ICS1(F) is neither better nor worse than ICS1(G) for		// that ICS1(F) is neither better nor worse than ICS1(G) for
// any function G, and, symmetrically, ICS1(G) is neither		// any function G, and, symmetrically, ICS1(G) is neither
// better nor worse than ICS1(F).		// better nor worse than ICS1(F).
unsigned StartArg = 0;		unsigned StartArg = 0;
if (Cand1.IgnoreObjectArgument \|\| Cand2.IgnoreObjectArgument)		if (Cand1.IgnoreObjectArgument \|\| Cand2.IgnoreObjectArgument)
▲ Show 20 Lines • Show All 218 Lines • ▼ Show 20 Lines	bool clang::isBetterOverloadCandidate(

// Check for enable_if value-based overload resolution.		// Check for enable_if value-based overload resolution.
if (Cand1.Function && Cand2.Function) {		if (Cand1.Function && Cand2.Function) {
Comparison Cmp = compareEnableIfAttrs(S, Cand1.Function, Cand2.Function);		Comparison Cmp = compareEnableIfAttrs(S, Cand1.Function, Cand2.Function);
if (Cmp != Comparison::Equal)		if (Cmp != Comparison::Equal)
return Cmp == Comparison::Better;		return Cmp == Comparison::Better;
}		}

if (S.getLangOpts().CUDA && Cand1.Function && Cand2.Function) {
FunctionDecl *Caller = dyn_cast<FunctionDecl>(S.CurContext);
return S.IdentifyCUDAPreference(Caller, Cand1.Function) >
S.IdentifyCUDAPreference(Caller, Cand2.Function);
}

bool HasPS1 = Cand1.Function != nullptr &&		bool HasPS1 = Cand1.Function != nullptr &&
functionHasPassObjectSizeParams(Cand1.Function);		functionHasPassObjectSizeParams(Cand1.Function);
bool HasPS2 = Cand2.Function != nullptr &&		bool HasPS2 = Cand2.Function != nullptr &&
functionHasPassObjectSizeParams(Cand2.Function);		functionHasPassObjectSizeParams(Cand2.Function);
if (HasPS1 != HasPS2 && HasPS1)		if (HasPS1 != HasPS2 && HasPS1)
return true;		return true;

return isBetterMultiversionCandidate(Cand1, Cand2);		// If other rules cannot determine which is better, CUDA preference is used
		// to determine which is better.
		rjmccallUnsubmitted Done Reply Inline Actions If we move anything below this check, it needs to figure out a tri-state so that it can return false if `Cand2` is a better candidate than `Cand1`. Now, that only matters if multiversion functions are supported under CUDA, but if you're relying on them not being supported, that should at least be commented on. rjmccall: If we move anything below this check, it needs to figure out a tri-state so that it can return…
		yaxunlAuthorUnsubmitted Done Reply Inline Actions multiversion host functions is orthogonal to CUDA therefore should be supported. multiversion in device, host device, and global functions are not supported. However this change does not make things worse, and should continue to work if they are supported. host/device based overloading resolution is mostly for determining viability of a function. If two functions are both viable, other factors should take precedence in preference. This general rule has been taken for cases other than multiversion, I think it should also apply to multiversion. I will make isBetterMultiversionCandidate three states. yaxunl: multiversion host functions is orthogonal to CUDA therefore should be supported. multiversion…
		rjmccallUnsubmitted Done Reply Inline Actions This general rule has been taken for cases other than multiversion, I think it should also apply to multiversion. Well, but the multiversion people could say the same: that multiversioning is for picking an alternative among otherwise-identical functions, and HD and H functions are not otherwise-identical. CC'ing @echristo for his thoughts on the right ordering here. rjmccall: > This general rule has been taken for cases other than multiversion, I think it should also…
		echristoUnsubmitted Done Reply Inline Actions Adding @erichkeane here as well. I think this makes sense, but I can see a reason to multiversion a function that will run on host and device. A version of some matrix mult that takes advantage of 3 host architectures and one cuda one? Am I missing something here? echristo: Adding @erichkeane here as well. I think this makes sense, but I can see a reason to…
		yaxunlAuthorUnsubmitted Done Reply Inline Actions My understanding is that a multiversion function is for a specific cpu(gpu). Let's say we want to have a function f for gfx900, gfx906, sandybridge, ivybridge, shouldn't they be more like __host__ __attribute__((cpu_specific(sandybridge))) f(); __host__ __attribute__((cpu_specific(ivybridge))) f(); __device__ __attribute__((cpu_specific(gfx900))) f(); __device__ __attribute__((cpu_specific(gfx906))) f(); instead of all `__device__ __host__` functions? yaxunl: My understanding is that a multiversion function is for a specific cpu(gpu). Let's say we want…
		erichkeaneUnsubmitted Done Reply Inline Actions IMO, it doesn't make sense for functions to functions be BOTH host and device, they'd have to be just one. Otherwise I'm not sure how the resolver behavior is supposed to work. The whole idea is that the definition is chosen at runtime. Unless host __device void foo(); is TWO declaration chains (meaning two separate AST entries), it doesn't make sense to have multiverison on it (and then, how it would be spelled is awkward/confusing to me). In the above case, if those 4 declarations are not 2 separate root- AST nodes, multiversioning won't work. erichkeane: IMO, it doesn't make sense for functions to functions be BOTH host and device, they'd have to…
		rjmccallUnsubmitted Done Reply Inline Actions There are certainly functions that ought to be usable from either host or device context — any inline function that just does ordinary language things should be in that category. Also IIUC many declarations are inferred to be `__host__ __device__`, or can be mass-annotated with pragmas, and those reasons are probably the main ones this might matter — we might include a header in CUDA mode that declares a multi-versioned function, and we should handle it right. My read of how CUDA programmers expect this to work is that they see the `__host__` / `__device__` attributes as primarily a mechanism for catching problems where you're using the wrong functions for the current configuration. That is, while we allow overloading by `__host__`/`__device__`-ness, users expect those attributes to mostly be used as a filter for what's "really there" rather than really strictly segregating the namespace. So I would say that CUDA programmers would probably expect the interaction with multiversioning to be: Programmers can put `__host__`, `__device__`, or both on a variant depending on where it was usable. Dispatches should simply ignore any variants that aren't usable for the current configuration. And specifically they would not expect e.g. a `__host__` dispatch function to only consider `__host__` variants — it should be able to dispatch to anything available, which is to say, it should also include `__host__ __device__` variants. Similarly (and probably more usefully), a `__host__ __device__` dispatch function being compiled for the device should also consider pure `__device__` functions, and so on. If we accept that, then I think it gives us a much better idea for how to resolve the priority of the overload rules. The main impact of `isBetterMultiversionCandidate` is to try to ensure that we're looking at the `__attribute__((cpu_dispatch))` function instead of one of the `__attribute__((cpu_specific))` variants. (It has no effect on `__attribute__((target))` multi-versioning, mostly because it doesn't need to: target-specific variants don't show up in lookup with `__attribute__((target))`.) That rule should take precedence over the CUDA preference for exact matches, because e.g. if we're compiling this: __host__ __device__ int magic(void) __attribute__((cpu_dispatch("..."))); __host__ __device__ int magic(void) __attribute__((cpu_specific(generic))); __host__ int magic(void) __attribute__((cpu_specific(mmx))); __host__ int magic(void) __attribute__((cpu_specific(sse))); __device__ int magic(void) __attribute__((cpu_specific(some_device_feature))); __device__ int magic(void) __attribute__((cpu_specific(some_other_device_feature))); then we don't want the compiler to prefer a CPU-specific variant over the dispatch function just because one of the variant was marked `__host__`. rjmccall: There are certainly functions that ought to be usable from either host or device context — any…
		traUnsubmitted Done Reply Inline Actions It's a bit more complicated and a bit less straightforward than that. :-( https://goo.gl/EXnymm Handling of target attributes is where clang is very different from the NVCC, so no matter which mental model of "CUDA programmer" you pick, there's another one which will not match. In the existing code `__host__ __device__` is commonly used as a sledgehammer to work around NVCC's limitations. It does not allow attribute-based overloading, so the only way you can specialize a function for host/device is via something like this: __host__ __device__ void foo() { #if __CUDA_ARCH__ > 0 // GPU code #else // CPU code. #endif } With clang you can write separate overloaded functions and we'll do our best to pick the one you meant to call. Alas, there are cases where it's ambiguous and depends on the callee's attributes, which may depend on theirs. When something ends up being called from different contexts, interesting things start happening. With more functions becoming constexpr (those are implicitly HD), we'll be running into such impossible-to-do-the-right-thing situations more often. The only reliable way to avoid such ambiguity is to 'clone' HD functions into separate H & D functions and do overload resolutions only considering same-side functions which will, in effect, completely separate host and device name spaces. Run-time dispatch is also somewhat irrelevant to CUDA. Sort of. On one hand kernel launch is already a form of runtime dispatch, only it's CUDA runtime does the dispatching based on the GPU one attempts to run the kernel on. `__device__` functions are always compiled for the specific GPU variant. Also, GPU variants often have different instruction sets and can't be mixed together in the same object file at all, so there's no variants once we're running the code as it's already compiled for precisely the GPU we're running on. Almost. Technically GPUs in the same family do share the same instruction sets, but I'm not sure runtime dispatch would buy us much there as the hardware differences are relatively minor. tra: It's a bit more complicated and a bit less straightforward than that. :-( https://goo.
		rjmccallUnsubmitted Done Reply Inline Actions The only reliable way to avoid such ambiguity is to 'clone' HD functions into separate H & D functions and do overload resolutions only considering same-side functions which will, in effect, completely separate host and device name spaces. Okay. Well, even if you completely split host and device functions, I think we'd still want to prefer dispatch functions over variant functions before preferring H over HD. Although... I suppose we do want to consider H vs. HD before looking at the more arbitrary factors that `isBetterMultiversionCandidate` looks at, like the number of architectures in the dispatch. Honestly, though, those just seem like bad rules that we should drop from the code. Run-time dispatch is also somewhat irrelevant to CUDA. Sort of. I understand that there's very little reason (or even ability) to use multiversioning in device code, but it can certainly happen in host code, right? Still, I guess the easiest thing would just be to forbid multiversioned functions on the device. rjmccall: > The only reliable way to avoid such ambiguity is to 'clone' HD functions into separate H & D…
		yaxunlAuthorUnsubmitted Done Reply Inline Actions Will change back the precedence of multiversion to be over host/device. yaxunl: Will change back the precedence of multiversion to be over host/device.
		traUnsubmitted Not Done Reply Inline Actions @rjmccall I'm OK with your reasoning & this patch. As long as the change does not break existing code, I'm fine. tra: @rjmccall I'm OK with your reasoning & this patch. As long as the change does not break…
		if (S.getLangOpts().CUDA && Cand1.Function && Cand2.Function) {
		if (FunctionDecl *Caller = dyn_cast<FunctionDecl>(S.CurContext)) {
		return S.IdentifyCUDAPreference(Caller, Cand1.Function) >
		rjmccallUnsubmitted Done Reply Inline Actions Okay, let's think about the right place to put this check in the ordering; we don't want different extensions to get into a who-comes-last competition. Certainly this should have lower priority than the standard-defined preferences like argument conversion ranks or `enable_if` partial-ordering. The preference for pass-object-size parameters is probably most similar to a type-based-overloading decision and so should take priority. I would say that this should take priority over function multi-versioning. Function multi-versioning is all about making specialized versions of the "same function", whereas I think host/device overloading is meant to be semantically broader than that. What do you think? Regardless, the rationale for the order should be explained in comments. rjmccall: Okay, let's think about the right place to put this check in the ordering; we don't want…
		yaxunlAuthorUnsubmitted Done Reply Inline Actions I will add comments for the rationale of preference. I commented the preference between multiversion and host/device in another comment. yaxunl: I will add comments for the rationale of preference. I commented the preference between…
		S.IdentifyCUDAPreference(Caller, Cand2.Function);
		}
		}

		auto MV = isBetterMultiversionCandidate(Cand1, Cand2);
		if (MV == Comparison::Better)
		return true;

		return false;
}		}

/// Determine whether two declarations are "equivalent" for the purposes of		/// Determine whether two declarations are "equivalent" for the purposes of
/// name lookup and overload resolution. This applies when the same internal/no		/// name lookup and overload resolution. This applies when the same internal/no
/// linkage entity is defined by two modules (probably by textually including		/// linkage entity is defined by two modules (probably by textually including
/// the same header). In such a case, we don't consider the declarations to		/// the same header). In such a case, we don't consider the declarations to
/// declare the same entity, but we also don't want lookups with both		/// declare the same entity, but we also don't want lookups with both
/// declarations visible to be ambiguous in some cases (this happens when using		/// declarations visible to be ambiguous in some cases (this happens when using
▲ Show 20 Lines • Show All 69 Lines • ▼ Show 20 Lines
/// \returns The result of overload resolution.		/// \returns The result of overload resolution.
OverloadingResult		OverloadingResult
OverloadCandidateSet::BestViableFunction(Sema &S, SourceLocation Loc,		OverloadCandidateSet::BestViableFunction(Sema &S, SourceLocation Loc,
iterator &Best) {		iterator &Best) {
llvm::SmallVector<OverloadCandidate *, 16> Candidates;		llvm::SmallVector<OverloadCandidate *, 16> Candidates;
std::transform(begin(), end(), std::back_inserter(Candidates),		std::transform(begin(), end(), std::back_inserter(Candidates),
[](OverloadCandidate &Cand) { return &Cand; });		[](OverloadCandidate &Cand) { return &Cand; });

// [CUDA] HD->H or HD->D calls are technically not allowed by CUDA but
// are accepted by both clang and NVCC. However, during a particular
// compilation mode only one call variant is viable. We need to
// exclude non-viable overload candidates from consideration based
// only on their host/device attributes. Specifically, if one
// candidate call is WrongSide and the other is SameSide, we ignore
// the WrongSide candidate.
if (S.getLangOpts().CUDA) {
const FunctionDecl *Caller = dyn_cast<FunctionDecl>(S.CurContext);
bool ContainsSameSideCandidate =
llvm::any_of(Candidates, [&](OverloadCandidate *Cand) {
// Check viable function only.
return Cand->Viable && Cand->Function &&
S.IdentifyCUDAPreference(Caller, Cand->Function) ==
Sema::CFP_SameSide;
});
if (ContainsSameSideCandidate) {
auto IsWrongSideCandidate = [&](OverloadCandidate *Cand) {
// Check viable function only to avoid unnecessary data copying/moving.
return Cand->Viable && Cand->Function &&
S.IdentifyCUDAPreference(Caller, Cand->Function) ==
Sema::CFP_WrongSide;
};
llvm::erase_if(Candidates, IsWrongSideCandidate);
}
}

// Find the best viable function.		// Find the best viable function.
Best = end();		Best = end();
for (auto *Cand : Candidates) {		for (auto *Cand : Candidates) {
Cand->Best = false;		Cand->Best = false;
if (Cand->Viable)		if (Cand->Viable)
if (Best == end() \|\|		if (Best == end() \|\|
isBetterOverloadCandidate(S, Cand, Best, Loc, Kind))		isBetterOverloadCandidate(S, Cand, Best, Loc, Kind))
Best = Cand;		Best = Cand;
▲ Show 20 Lines • Show All 5,011 Lines • Show Last 20 Lines

clang/test/SemaCUDA/function-overload.cu

	Show First 20 Lines • Show All 325 Lines • ▼ Show 20 Lines
	__device__ void test_device_calls_template_fn() {			__device__ void test_device_calls_template_fn() {
	DeviceReturnTy ret1 = template_vs_function(1.0f);			DeviceReturnTy ret1 = template_vs_function(1.0f);
	DeviceReturnTy ret2 = template_vs_function(2.0);			DeviceReturnTy ret2 = template_vs_function(2.0);
	}			}

	// If we have a mix of HD and H-only or D-only candidates in the overload set,			// If we have a mix of HD and H-only or D-only candidates in the overload set,
	// normal C++ overload resolution rules apply first.			// normal C++ overload resolution rules apply first.
	template <typename T> TemplateReturnTy template_vs_hd_function(T arg)			template <typename T> TemplateReturnTy template_vs_hd_function(T arg)
	#ifdef __CUDA_ARCH__
	//expected-note@-2 {{declared here}}
	#endif
	{			{
	return TemplateReturnTy();			return TemplateReturnTy();
	}			}
	__host__ __device__ HostDeviceReturnTy template_vs_hd_function(float arg) {			__host__ __device__ HostDeviceReturnTy template_vs_hd_function(float arg) {
	return HostDeviceReturnTy();			return HostDeviceReturnTy();
	}			}

	__host__ __device__ void test_host_device_calls_hd_template() {			__host__ __device__ void test_host_device_calls_hd_template() {
	HostDeviceReturnTy ret1 = template_vs_hd_function(1.0f);
	TemplateReturnTy ret2 = template_vs_hd_function(1);
	#ifdef __CUDA_ARCH__			#ifdef __CUDA_ARCH__
	// expected-error@-2 {{reference to __host__ function 'template_vs_hd_function<int>' in __host__ __device__ function}}			typedef HostDeviceReturnTy ExpectedReturnTy;
				#else
				typedef TemplateReturnTy ExpectedReturnTy;
	#endif			#endif
				HostDeviceReturnTy ret1 = template_vs_hd_function(1.0f);
				ExpectedReturnTy ret2 = template_vs_hd_function(1);
	}			}

	__host__ void test_host_calls_hd_template() {			__host__ void test_host_calls_hd_template() {
	HostDeviceReturnTy ret1 = template_vs_hd_function(1.0f);			HostDeviceReturnTy ret1 = template_vs_hd_function(1.0f);
	TemplateReturnTy ret2 = template_vs_hd_function(1);			TemplateReturnTy ret2 = template_vs_hd_function(1);
	}			}

	__device__ void test_device_calls_hd_template() {			__device__ void test_device_calls_hd_template() {
	HostDeviceReturnTy ret1 = template_vs_hd_function(1.0f);			HostDeviceReturnTy ret1 = template_vs_hd_function(1.0f);
	// Host-only function template is not callable with strict call checks,			// Host-only function template is not callable with strict call checks,
	// so for device side HD function will be the only choice.			// so for device side HD function will be the only choice.
	HostDeviceReturnTy ret2 = template_vs_hd_function(1);			HostDeviceReturnTy ret2 = template_vs_hd_function(1);
	}			}

	// Check that overloads still work the same way on both host and			// Check that overloads still work the same way on both host and
	// device side when the overload set contains only functions from one			// device side when the overload set contains only functions from one
	// side of compilation.			// side of compilation.
	__device__ DeviceReturnTy device_only_function(int arg) { return DeviceReturnTy(); }			__device__ DeviceReturnTy device_only_function(int arg) { return DeviceReturnTy(); }
	__device__ DeviceReturnTy2 device_only_function(float arg) { return DeviceReturnTy2(); }			__device__ DeviceReturnTy2 device_only_function(float arg) { return DeviceReturnTy2(); }
	#ifndef __CUDA_ARCH__			#ifndef __CUDA_ARCH__
	// expected-note@-3 {{'device_only_function' declared here}}			// expected-note@-3 2{{'device_only_function' declared here}}
	// expected-note@-3 {{'device_only_function' declared here}}			// expected-note@-3 2{{'device_only_function' declared here}}
	#endif			#endif
	__host__ HostReturnTy host_only_function(int arg) { return HostReturnTy(); }			__host__ HostReturnTy host_only_function(int arg) { return HostReturnTy(); }
	__host__ HostReturnTy2 host_only_function(float arg) { return HostReturnTy2(); }			__host__ HostReturnTy2 host_only_function(float arg) { return HostReturnTy2(); }
	#ifdef __CUDA_ARCH__			#ifdef __CUDA_ARCH__
	// expected-note@-3 {{'host_only_function' declared here}}			// expected-note@-3 2{{'host_only_function' declared here}}
	// expected-note@-3 {{'host_only_function' declared here}}			// expected-note@-3 2{{'host_only_function' declared here}}
	#endif			#endif

	__host__ __device__ void test_host_device_single_side_overloading() {			__host__ __device__ void test_host_device_single_side_overloading() {
	DeviceReturnTy ret1 = device_only_function(1);			DeviceReturnTy ret1 = device_only_function(1);
	DeviceReturnTy2 ret2 = device_only_function(1.0f);			DeviceReturnTy2 ret2 = device_only_function(1.0f);
	#ifndef __CUDA_ARCH__			#ifndef __CUDA_ARCH__
	// expected-error@-3 {{reference to __device__ function 'device_only_function' in __host__ __device__ function}}			// expected-error@-3 {{reference to __device__ function 'device_only_function' in __host__ __device__ function}}
	// expected-error@-3 {{reference to __device__ function 'device_only_function' in __host__ __device__ function}}			// expected-error@-3 {{reference to __device__ function 'device_only_function' in __host__ __device__ function}}
	#endif			#endif
	HostReturnTy ret3 = host_only_function(1);			HostReturnTy ret3 = host_only_function(1);
	HostReturnTy2 ret4 = host_only_function(1.0f);			HostReturnTy2 ret4 = host_only_function(1.0f);
	#ifdef __CUDA_ARCH__			#ifdef __CUDA_ARCH__
	// expected-error@-3 {{reference to __host__ function 'host_only_function' in __host__ __device__ function}}			// expected-error@-3 {{reference to __host__ function 'host_only_function' in __host__ __device__ function}}
	// expected-error@-3 {{reference to __host__ function 'host_only_function' in __host__ __device__ function}}			// expected-error@-3 {{reference to __host__ function 'host_only_function' in __host__ __device__ function}}
	#endif			#endif
	}			}

				// wrong-sided overloading should not cause diagnostic unless it is emitted.
				// This inline function is not emitted.
				inline __host__ __device__ void test_host_device_wrong_side_overloading_inline_no_diag() {
				DeviceReturnTy ret1 = device_only_function(1);
				DeviceReturnTy2 ret2 = device_only_function(1.0f);
				HostReturnTy ret3 = host_only_function(1);
				HostReturnTy2 ret4 = host_only_function(1.0f);
				}

				// wrong-sided overloading should cause diagnostic if it is emitted.
				// This inline function is emitted since it is called by an emitted function.
				inline __host__ __device__ void test_host_device_wrong_side_overloading_inline_diag() {
				DeviceReturnTy ret1 = device_only_function(1);
				DeviceReturnTy2 ret2 = device_only_function(1.0f);
				#ifndef __CUDA_ARCH__
				// expected-error@-3 {{reference to __device__ function 'device_only_function' in __host__ __device__ function}}
				// expected-error@-3 {{reference to __device__ function 'device_only_function' in __host__ __device__ function}}
				#endif
				HostReturnTy ret3 = host_only_function(1);
				HostReturnTy2 ret4 = host_only_function(1.0f);
				#ifdef __CUDA_ARCH__
				// expected-error@-3 {{reference to __host__ function 'host_only_function' in __host__ __device__ function}}
				// expected-error@-3 {{reference to __host__ function 'host_only_function' in __host__ __device__ function}}
				#endif
				}

				__host__ __device__ void test_host_device_wrong_side_overloading_inline_diag_caller() {
				test_host_device_wrong_side_overloading_inline_diag();
				// expected-note@-1 {{called by 'test_host_device_wrong_side_overloading_inline_diag_caller'}}
				}

	// Verify that we allow overloading function templates.			// Verify that we allow overloading function templates.
	template <typename T> __host__ T template_overload(const T &a) { return a; };			template <typename T> __host__ T template_overload(const T &a) { return a; };
	template <typename T> __device__ T template_overload(const T &a) { return a; };			template <typename T> __device__ T template_overload(const T &a) { return a; };

	__host__ void test_host_template_overload() {			__host__ void test_host_template_overload() {
	template_overload(1); // OK. Attribute-based overloading picks __host__ variant.			template_overload(1); // OK. Attribute-based overloading picks __host__ variant.
	}			}
	__device__ void test_device_template_overload() {			__device__ void test_device_template_overload() {
	Show All 19 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[CUDA][HIP] Fix host/device based overload resolutionClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 259796

clang/lib/Sema/SemaOverload.cpp

clang/test/SemaCUDA/function-overload.cu

[CUDA][HIP] Fix host/device based overload resolution
ClosedPublic