This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
include/clang/Sema/
-
clang/
-
Sema/
4/4
Sema.h
-
lib/Sema/
-
Sema/
2/2
SemaCUDA.cpp
-
SemaOverload.cpp
-
test/SemaCUDA/
-
SemaCUDA/
10/10
function-overload.cu

Differential D79526

[CUDA][HIP] Workaround for resolving host device function against wrong-sided function
ClosedPublic

Authored by yaxunl on May 6 2020, 4:03 PM.

Download Raw Diff

Details

Reviewers

tra
rjmccall

Commits

rGe03394c6a6ff: [CUDA][HIP] Workaround for resolving host device function against wrong-sided…

Summary

https://reviews.llvm.org/D77954 caused regressions due to diagnostics in implicit
host device functions.

The implicit host device functions are often functions in system headers forced to be device host by pragmas.

Some of them are valid host device functions that can be emitted in both host and device compilation.

Some of them are valid host functions but invalid device functions. In device compilation they incur
diagnostics. However as long as these diagnostics are deferred and these functions are not emitted
this is fine.

Before D77954, in host device callers, host device candidates are not favored against wrong-sided candidates,
which preserves the overloading resolution result as if the caller and the candidates are host functions.
This makes sure the callee does not cause other issues, e.g. type mismatch, const-ness issues, etc. If the
selected function is a host device function, then it is a viable callee. If the selected function is a host
function, then the caller is not a valid host device function, and it results in a diagnostic but it can be deferred.

The problem is that we have to give host device candidates equal preference with wrong-sided candidates. If
the users really intend to favor host device candidate against wrong-sided candidate, they cannot get the
expected selection.

Ideally we should be able to defer all diagnostics for functions not sure to be emitted. In that case we can
have correct preference. If diagnostics occur due to overloading resolution change, as long as the function
is not emitted, it is fine.

Unfortunately it is not a trivial work to defer all diagnostics. Even deferring only overloading resolution related
diagnostics is not a simple work.

For now, it seems the most feasible workaround is to treat implicit host device function and explicit host
device function differently. Basically in device compilation for implicit host device functions, keep the
old behavior, i.e. give host device candidates and wrong-sided candidates equal preference. For explicit
host device functions, favor host device candidates against wrong-sided candidates.

The rationale is that explicit host device functions are blessed by the user to be valid host device functions,
that is, they should not cause diagnostics in both host and device compilation. If diagnostics occur, user is
able to fix them. However, there is no guarantee that implicit host device function can be compiled in
device compilation, therefore we need to preserve its overloading resolution in device compilation.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

yaxunl created this revision.May 6 2020, 4:03 PM

I've tested the patch on our sources and it still breaks tensorflow compilation, though in a different way:

In file included from third_party/tensorflow/core/kernels/slice_op_gpu.cu.cc:22:
In file included from ./third_party/tensorflow/core/framework/register_types.h:20:
In file included from ./third_party/tensorflow/core/framework/numeric_types.h:28:
In file included from ./third_party/tensorflow/core/platform/types.h:22:
In file included from ./third_party/tensorflow/core/platform/tstring.h:24:
In file included from ./third_party/tensorflow/core/platform/cord.h:23:
In file included from ./third_party/tensorflow/core/platform/google/cord.h:19:
In file included from ./third_party/absl/strings/cord.h:89:
./third_party/absl/strings/internal/cord_internal.h:34:16: error: no matching constructor for initialization of 'std::atomic<int32_t>' (aka 'atomic<int>')
  Refcount() : count_{1} {}
               ^     ~~~
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1778:8: note: candidate constructor (the implicit copy constructor) not viable: no known conversion from 'int' to 'const std::__u::atomic<int>' for 1st argument
struct atomic
       ^
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1784:5: note: candidate constructor not viable: requires 0 arguments, but 1 was provided
    atomic() _NOEXCEPT _LIBCPP_DEFAULT
    ^
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1807:52: error: call to deleted constructor of '__atomic_base<base::scheduling::Schedulable *>'
    _LIBCPP_CONSTEXPR atomic(_Tp* __d) _NOEXCEPT : __base(__d) {}
                                                   ^      ~~~
./third_party/absl/base/internal/thread_identity.h:162:66: note: in instantiation of member function 'std::__u::atomic<base::scheduling::Schedulable *>::atomic' requested here
    std::atomic<base::scheduling::Schedulable*> bound_schedulable{nullptr};
                                                                 ^
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1675:5: note: '__atomic_base' has been explicitly marked deleted here
    __atomic_base(const __atomic_base&) = delete;
    ^
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1786:51: error: call to implicitly-deleted copy constructor of '__atomic_base<long>'
    _LIBCPP_CONSTEXPR atomic(_Tp __d) _NOEXCEPT : __base(__d) {}
                                                  ^      ~~~
./third_party/absl/synchronization/mutex.h:927:25: note: in instantiation of member function 'std::__u::atomic<long>::atomic' requested here
inline Mutex::Mutex() : mu_(0) {
                        ^
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1698:7: note: copy constructor of '__atomic_base<long, true>' is implicitly deleted because base class '__atomic_base<long, false>' has a deleted copy constructor
    : public __atomic_base<_Tp, false>
      ^
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1675:5: note: '__atomic_base' has been explicitly marked deleted here
    __atomic_base(const __atomic_base&) = delete;
    ^

fix regression. only treat implicit host device candidate inferior in device compilation.

In D79526#2025761, @tra wrote:

I've tested the patch on our sources and it still breaks tensorflow compilation, though in a different way:

In file included from third_party/tensorflow/core/kernels/slice_op_gpu.cu.cc:22:
In file included from ./third_party/tensorflow/core/framework/register_types.h:20:
In file included from ./third_party/tensorflow/core/framework/numeric_types.h:28:
In file included from ./third_party/tensorflow/core/platform/types.h:22:
In file included from ./third_party/tensorflow/core/platform/tstring.h:24:
In file included from ./third_party/tensorflow/core/platform/cord.h:23:
In file included from ./third_party/tensorflow/core/platform/google/cord.h:19:
In file included from ./third_party/absl/strings/cord.h:89:
./third_party/absl/strings/internal/cord_internal.h:34:16: error: no matching constructor for initialization of 'std::atomic<int32_t>' (aka 'atomic<int>')
  Refcount() : count_{1} {}
               ^     ~~~
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1778:8: note: candidate constructor (the implicit copy constructor) not viable: no known conversion from 'int' to 'const std::__u::atomic<int>' for 1st argument
struct atomic
       ^
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1784:5: note: candidate constructor not viable: requires 0 arguments, but 1 was provided
    atomic() _NOEXCEPT _LIBCPP_DEFAULT
    ^
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1807:52: error: call to deleted constructor of '__atomic_base<base::scheduling::Schedulable *>'
    _LIBCPP_CONSTEXPR atomic(_Tp* __d) _NOEXCEPT : __base(__d) {}
                                                   ^      ~~~
./third_party/absl/base/internal/thread_identity.h:162:66: note: in instantiation of member function 'std::__u::atomic<base::scheduling::Schedulable *>::atomic' requested here
    std::atomic<base::scheduling::Schedulable*> bound_schedulable{nullptr};
                                                                 ^
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1675:5: note: '__atomic_base' has been explicitly marked deleted here
    __atomic_base(const __atomic_base&) = delete;
    ^
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1786:51: error: call to implicitly-deleted copy constructor of '__atomic_base<long>'
    _LIBCPP_CONSTEXPR atomic(_Tp __d) _NOEXCEPT : __base(__d) {}
                                                  ^      ~~~
./third_party/absl/synchronization/mutex.h:927:25: note: in instantiation of member function 'std::__u::atomic<long>::atomic' requested here
inline Mutex::Mutex() : mu_(0) {
                        ^
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1698:7: note: copy constructor of '__atomic_base<long, true>' is implicitly deleted because base class '__atomic_base<long, false>' has a deleted copy constructor
    : public __atomic_base<_Tp, false>
      ^
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1675:5: note: '__atomic_base' has been explicitly marked deleted here
    __atomic_base(const __atomic_base&) = delete;
    ^

Looks like we went overboard to treat implicit host device candidate as inferior. They should be treated
as inferior in device compilation, not in host compilation. Here because they are treated as inferior
to same-sided candidate in host compilation, they changed overload resolution in host compilation
therefore caused the failure in host compilation.

I have updated the patch to treat implicit host device candidate as inferior in device compilation.

The latest version of the patch works well enough to compile tensorflow. That's the good news.

In D79526#2026857, @yaxunl wrote:

Looks like we went overboard to treat implicit host device candidate as inferior. They should be treated
as inferior in device compilation, not in host compilation. Here because they are treated as inferior
to same-sided candidate in host compilation, they changed overload resolution in host compilation
therefore caused the failure in host compilation.

I have updated the patch to treat implicit host device candidate as inferior in device compilation.

I'm concerned that this creates inconsistency in how overload resolution works during host and device compilation.
In general they should behave the same. I.e. a test where this change is needed during device-side compilation will require the same change on the host side, if you swap H and D attributes on the functions in the test.

Speaking of tests, it would be great to add a test illustrating this scenario.

In D79526#2027242, @tra wrote:

The latest version of the patch works well enough to compile tensorflow. That's the good news.

In D79526#2026857, @yaxunl wrote:

Looks like we went overboard to treat implicit host device candidate as inferior. They should be treated
as inferior in device compilation, not in host compilation. Here because they are treated as inferior
to same-sided candidate in host compilation, they changed overload resolution in host compilation
therefore caused the failure in host compilation.

I have updated the patch to treat implicit host device candidate as inferior in device compilation.

I'm concerned that this creates inconsistency in how overload resolution works during host and device compilation.
In general they should behave the same. I.e. a test where this change is needed during device-side compilation will require the same change on the host side, if you swap H and D attributes on the functions in the test.

Speaking of tests, it would be great to add a test illustrating this scenario.

I added a test at line 483 for the situation.

For implicit host device functions, since they are not guaranteed to work in device compilation, we can only resolve them as if they are host functions. This causes asymmetry but implicit host device functions are originally host functions so it is biased toward host compilation in the beginning. Only the original resolution guarantees no other issues. For example, in the failed compilation in TF, some ctor of std::atomic becomes implicit host device function because it is constexpr. We should treated as wrong-sided in device compilation, but we should treated as same-sided in host compilation, otherwise it changes the resolution in host compilation and causes other issues.

In D79526#2027470, @yaxunl wrote:

For implicit host device functions, since they are not guaranteed to work in device compilation, we can only resolve them as if they are host functions. This causes asymmetry but implicit host device functions are originally host functions so it is biased toward host compilation in the beginning.

I don't think that the assertion that implicit host device functions are originally host functions is always true. While in practice most such functions may indeed come from the existing host code (e.g. the standard library), I don't see any inherent reason why they can't come from the code written for GPU. E.g. thrust is likely to have some implicitly HD functions in the code that was not intended for CPUs and your assumption will be wrong. Even if such case may not exist now, it would not be unreasonable for users to have such code on device.
This overload resolution difference is observable and it will likely create new corner cases in convoluted enough C++ code.

I think we need something more principled than "happens to work for existing code".

Only the original resolution guarantees no other issues. For example, in the failed compilation in TF, some ctor of std::atomic becomes implicit host device function because it is constexpr. We should treated as wrong-sided in device compilation, but we should treated as same-sided in host compilation, otherwise it changes the resolution in host compilation and causes other issues.

It may be true for atomic, where we do need to have GPU-specific implementation. However, I can also see classes with constexpr constructors that are prefectly usable on both sides and do not have to be treated as the wrong-side.

TBH, I do not see any reasonable way to deal with this with the current implementation of how HD functions are treated. This patch and its base do improve things somewhat, but it all comes at the cost of further complexity and potentially paints us even deeper into a corner. Current behavior is already rather hard to explain.

Some time back @wash from NVIDIA was asking about improving HD function handling. Maybe it's time for all interested parties to figure out whether it's time to come up with a better solution. Not in this patch, obviously.

clang/test/SemaCUDA/function-overload.cu
464–470	These tests only veryfy that the code compiled, but it does not guarantee that we've picked the correct overload. You should give callees different return types and assign the result to a variable of intended type. See `test_host_device_calls_hd_template()` on line 341 for an example.

This one is just a FYI. I've managed to reduce the failure in the first version of this patch and it looks rather odd because the reduced test case has nothing to do with CUDA. Instead it appears to introduce a difference in compilation of regular host-only C++ code with -x cuda vs -x c++. I'm not sure how/why first version caused this and why the latest one fixes it. It may be worth double checking that we're not missing something here.

template <class a> a b;
auto c(...);
template <class d> constexpr auto c(d) -> decltype(0);
struct e {
  template <class ad, class... f> static auto g(ad, f...) {
    h<e, decltype(b<f>)...>;
  }
  struct i {
    template <class, class... f> static constexpr auto j(f... k) { c(k...); }
  };
  template <class, class... f> static auto h() { i::j<int, f...>; }
};
class l {
  l() {
    e::g([] {}, this);
  }
};

The latest version of this patch works, but previous one failed with an error, when the example was compiled as CUDA, but not, when it was compiled as C++:

$ bin/clang++ -x cuda argmax.cc -ferror-limit=1 -fsyntax-only --cuda-host-only -nocudalib -nocudainc -fsized-deallocation -std=c++17

argmax.cc:9:68: error: function 'c' with deduced return type cannot be used before it is defined
    template <class, class... f> static constexpr auto j(f... k) { c(k...); }
                                                                   ^
argmax.cc:11:53: note: in instantiation of function template specialization 'e::i::j<int, l *>' requested here
  template <class, class... f> static auto h() { i::j<int, f...>; }
                                                    ^
argmax.cc:6:5: note: in instantiation of function template specialization 'e::h<e, l *>' requested here
    h<e, decltype(b<f>)...>;
    ^
argmax.cc:15:8: note: in instantiation of function template specialization 'e::g<(lambda at argmax.cc:15:10), l *>' requested here
    e::g([] {}, this);
       ^
argmax.cc:2:6: note: 'c' declared here
auto c(...);
     ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
2 errors generated when compiling for host.

$ bin/clang++ -x c++ argmax.cc -ferror-limit=1 -fsyntax-only --cuda-host-only -nocudalib -nocudainc -fsized-deallocation -std=c++17

clang-11: warning: argument unused during compilation: '-nocudainc' [-Wunused-command-line-argument]
argmax.cc:11:50: warning: expression result unused [-Wunused-value]
  template <class, class... f> static auto h() { i::j<int, f...>; }
                                                 ^~~~~~~~~~~~~~~
argmax.cc:6:5: note: in instantiation of function template specialization 'e::h<e, l *>' requested here
    h<e, decltype(b<f>)...>;
    ^
argmax.cc:15:8: note: in instantiation of function template specialization 'e::g<(lambda at argmax.cc:15:10), l *>' requested here
    e::g([] {}, this);
       ^
argmax.cc:6:5: warning: expression result unused [-Wunused-value]
    h<e, decltype(b<f>)...>;
    ^~~~~~~~~~~~~~~~~~~~~~~
argmax.cc:15:8: note: in instantiation of function template specialization 'e::g<(lambda at argmax.cc:15:10), l *>' requested here
    e::g([] {}, this);
       ^
argmax.cc:3:35: warning: inline function 'c<l *>' is not defined [-Wundefined-inline]
template <class d> constexpr auto c(d) -> decltype(0);
                                  ^
argmax.cc:9:68: note: used here
    template <class, class... f> static constexpr auto j(f... k) { c(k...); }
                                                                   ^
3 warnings generated.

clang/include/clang/Sema/Sema.h
11663	Plumbing an optional output argument it through multiple levels of callers as an output argument is rather hard to follow, especially considering that it's not set in all code paths. Perhaps we can turn IsImplicitHDAttr into a separate function and call it from isBetterOverloadCandidate().

In D79526#2027695, @tra wrote:
This one is just a FYI. I've managed to reduce the failure in the first version of this patch and it looks rather odd because the reduced test case has nothing to do with CUDA. Instead it appears to introduce a difference in compilation of regular host-only C++ code with -x cuda vs -x c++. I'm not sure how/why first version caused this and why the latest one fixes it. It may be worth double checking that we're not missing something here.
template <class a> a b;
auto c(...);
template <class d> constexpr auto c(d) -> decltype(0);
struct e {
  template <class ad, class... f> static auto g(ad, f...) {
    h<e, decltype(b<f>)...>;
  }
  struct i {
    template <class, class... f> static constexpr auto j(f... k) { c(k...); }
  };
  template <class, class... f> static auto h() { i::j<int, f...>; }
};
class l {
  l() {
    e::g([] {}, this);
  }
};

function j is an implicit host device function, it calls function c. There are two candidates: the first one is a host function, the second one is an implicit host device function.

Assuming this code is originally C++ code, the author intends the second to be chosen since it is a better match. The code will fail to compile if the first one is chosen since its return type cannot be deduced.

Now we compile it as CUDA code and constexpr functions automatically become implicit host device function. In host compilation we do not need special handling since host device candidates and same-sided candidates are both viable. There was a bug which used special handling of implicit host device function in host compilation, which was fixed by my last update.

Basically we only need special handling for implicit host device function in device compilation. In host compilation we always use the normal overloading resolution. For explicit host device functions we always use the normal overloading resolution.

clang/include/clang/Sema/Sema.h
11663	will do
clang/test/SemaCUDA/function-overload.cu
464–470	they have different return types. The right one returns double and the wrong one returns void. If the wrong one is chosen, there is syntax error since the caller returns double.

introduce Sema::IsCUDAImplicitHostDeviceFunction() and remove changes to IdentifyCUDATarget and IdentifyCUDAPreference. Added one more test.

In D79526#2027552, @tra wrote:

In D79526#2027470, @yaxunl wrote:

For implicit host device functions, since they are not guaranteed to work in device compilation, we can only resolve them as if they are host functions. This causes asymmetry but implicit host device functions are originally host functions so it is biased toward host compilation in the beginning.

I don't think that the assertion that implicit host device functions are originally host functions is always true. While in practice most such functions may indeed come from the existing host code (e.g. the standard library), I don't see any inherent reason why they can't come from the code written for GPU. E.g. thrust is likely to have some implicitly HD functions in the code that was not intended for CPUs and your assumption will be wrong. Even if such case may not exist now, it would not be unreasonable for users to have such code on device.
This overload resolution difference is observable and it will likely create new corner cases in convoluted enough C++ code.

I agree currently it is possible to force a device function to be implicitly host device by pragma. However it is arguable whether we should have special handling of overload resolution in this case. We do special handling of overload resolution because we can not modify some system headers which are intended for host originally. If a function was originally device function, it is CUDA/HIP code and it should follow normal overloading resolution rule and should be fixed if issues occur when it is marked as a host device function.

I think we need something more principled than "happens to work for existing code".

Only the original resolution guarantees no other issues. For example, in the failed compilation in TF, some ctor of std::atomic becomes implicit host device function because it is constexpr. We should treated as wrong-sided in device compilation, but we should treated as same-sided in host compilation, otherwise it changes the resolution in host compilation and causes other issues.

It may be true for atomic, where we do need to have GPU-specific implementation. However, I can also see classes with constexpr constructors that are prefectly usable on both sides and do not have to be treated as the wrong-side.

Before this patch (together with the reverted commit), the device host candidates are always treated with the same preference as wrong-sided candidates in device compilation, so a wrong-sided candidate may hide a viable host device candidate. This patch fixes that for most cases, including: 1. host compilation 2. explicit host device caller 3. explicit host device callee. Only in device compilation when an implicit host device caller calls an implicit host device callee we apply the special 'incorrect' overloading resolution rule. If the special handling causes undesirable effect on users code, users can either mark the caller or callee to be explicit host device to bypass the special handling.

TBH, I do not see any reasonable way to deal with this with the current implementation of how HD functions are treated. This patch and its base do improve things somewhat, but it all comes at the cost of further complexity and potentially paints us even deeper into a corner. Current behavior is already rather hard to explain.

Some time back @wash from NVIDIA was asking about improving HD function handling. Maybe it's time for all interested parties to figure out whether it's time to come up with a better solution. Not in this patch, obviously.

This patch is trying to fix the incorrect overloading resolution rule about host device callee in host device caller. It should be favored over wrong-sided callee but currently it is not.

If we reject this patch, we have to bear with the incorrect overloading rule until a better fix is implemented.

The complexity introduced by this patch is that it needs to have special rule for implicit host device caller and implicit host device callee in device compilation, where implicit host device callee is not favored over wrong-sided callee to preserve the overloading resolution result as if they are both host callees. This is to allow some functions in system headers becoming implicitly host device functions without causing undeferrable diagnostics.

The complexity introduced in the compiler code is not significant: a new function Sema::IsCUDAImplicitHostDeviceFunction is introduced and used in isBetterOverloadCandidate to detect the special situation that needs special handling. The code for special handling is trivial.

The complexity introduced in the overloading resolution rule is somehow concerning.

Before this patch, the rule is: same sided candidates are favored over wrong sided candidates, but host device candidates have same preference as wrong sided candidates.

After this patch, the rule is: same sided candidates and host device candidates have the same preference over wrong-sided candidates, except implicit host device function in device compilation, which preserves original resolution.

The reason to have the exception is that the implicit host device caller may be in system headers which users cannot modify. In device compilation, favoring implicit host device candidates over host candidates may change the resolution results, which incurs diagnostics.

Alternative solution I can think of:

defer all diagnostics possibly incurred due to overloading resolution change. Since the host device function is invalid, it cannot really be used by device code. As long as it is not really emitted, it should be OK. However, this requires all the possible diagnostics incurred due to overloading resolution change to be deferred. This requires some significant changes since 1) the PartialDiagnosticBuilder currently has limited input types than the DiagnosticBuilder; 2) the diagnostic to be deferred may have accompanying notes which need to be deferred in coordination; 3) If there are control flow changes depending on whether diagnostics happen, they need to be modified so that compilation will continue.

change precedence of host-ness: If selection of a candidate will incur error, then it is not favored over host-ness, i.e. we would rather choose a wrong-sided candidate that does not cause other error, than choosing a implicit host device candidate that causes other error.

tra added inline comments.May 11 2020, 12:23 PM

clang/include/clang/Sema/Sema.h
11670	I think this can be `static` as it does not need Sema's state.
clang/lib/Sema/SemaCUDA.cpp
217–220	Is it possible for us to ever end up here with an explicitly set attribute but with an implicit function? If that were to happen, we'd return true and that would be incorrect. Perhaps add an assert to make sure it does not happen or always return `A->isImplicit()` if an attribute is already set.
clang/test/SemaCUDA/function-overload.cu
464–470	Ah. I've missed it. Could you change the types to `struct CorrectOverloadRetTy`/`struct IncorrectOverloadRetTy` to make it more obvious?

yaxunl marked 6 inline comments as done.May 11 2020, 1:21 PM

yaxunl added inline comments.

clang/include/clang/Sema/Sema.h
11670	will do
clang/lib/Sema/SemaCUDA.cpp
217–220	will return A->isImplicit()
clang/test/SemaCUDA/function-overload.cu
464–470	will do

revised by Artem's comments.

LGTM, modulo cosmetic test changes mentioned below.

clang/test/SemaCUDA/function-overload.cu
465	Is `inline` necessary in these new tests? Please remove it where it's not needed.
479	Nit: `Incorrect` should not have `C` capitalized as it's one word.
488–515	Please move this test below the other two as keeping them together is useful to illustrate the differences in behavior of overloading in explicit HD vs implicit HD functions.

This revision is now accepted and ready to land.May 11 2020, 3:23 PM

yaxunl marked 6 inline comments as done.May 11 2020, 9:53 PM

yaxunl added inline comments.

clang/test/SemaCUDA/function-overload.cu
465	It is not needed by callee but needed by caller to make sure it causes deferred diagnostics. Will remove it from callees.
479	will fix.
488–515	will do

Closed by commit rGe03394c6a6ff: [CUDA][HIP] Workaround for resolving host device function against wrong-sided… (authored by yaxunl). · Explain WhyMay 12 2020, 5:52 AM

This revision was automatically updated to reflect the committed changes.

yaxunl marked 3 inline comments as done.

Herald added a project: Restricted Project. · View Herald TranscriptMay 12 2020, 5:52 AM

e03394c6a6ff5832aa43259d4b8345f40ca6a22c Still breaks some of the existing CUDA code (got failures in pytorch and Eigen). I'll revert the patch and will send you a reduced reproducer.

Reduced test case:

struct a {
  __attribute__((device)) a(short);
  __attribute__((device)) operator unsigned() const;
  __attribute__((device)) operator int() const;
};
struct b {
  a d;
};
void f(b g) { b e = g; }

Failure:

$ bin/clang++ -x cuda aten.cc -fsyntax-only  --cuda-path=$HOME/local/cuda-10.1 --cuda-device-only --cuda-gpu-arch=sm_60 -stdlib=libc++ -std=c++17 -ferror-limit=1

aten.cc:6:8: error: conversion from 'const a' to 'short' is ambiguous
struct b {
       ^
aten.cc:9:21: note: in implicit copy constructor for 'b' first required here
void f(b g) { b e = g; }
                    ^
aten.cc:3:27: note: candidate function
  __attribute__((device)) operator unsigned() const;
                          ^
aten.cc:4:27: note: candidate function
  __attribute__((device)) operator int() const;
                          ^
aten.cc:2:34: note: passing argument to parameter here
  __attribute__((device)) a(short);
                                 ^
1 error generated when compiling for sm_60.

The same code compiles fine in C++ and I would expect it to work on device side the same way.

In D79526#2042680, @tra wrote:

Reduced test case:

struct a {
  __attribute__((device)) a(short);
  __attribute__((device)) operator unsigned() const;
  __attribute__((device)) operator int() const;
};
struct b {
  a d;
};
void f(b g) { b e = g; }

Failure:

$ bin/clang++ -x cuda aten.cc -fsyntax-only  --cuda-path=$HOME/local/cuda-10.1 --cuda-device-only --cuda-gpu-arch=sm_60 -stdlib=libc++ -std=c++17 -ferror-limit=1

aten.cc:6:8: error: conversion from 'const a' to 'short' is ambiguous
struct b {
       ^
aten.cc:9:21: note: in implicit copy constructor for 'b' first required here
void f(b g) { b e = g; }
                    ^
aten.cc:3:27: note: candidate function
  __attribute__((device)) operator unsigned() const;
                          ^
aten.cc:4:27: note: candidate function
  __attribute__((device)) operator int() const;
                          ^
aten.cc:2:34: note: passing argument to parameter here
  __attribute__((device)) a(short);
                                 ^
1 error generated when compiling for sm_60.

The same code compiles fine in C++ and I would expect it to work on device side the same way.

a and b both have an implicit HD copy ctor. In device compilation, copy ctor of b is calling copy ctor of a. There are two candidates: implicit HD copy ctor of a, and device ctor a(short).

In my previous fix, I made H and implicit HD candidate equal, however I forgot about the relation between D candidate and HD candidate. I incorrectly made D favored over HD and H. This caused inferior device candidate a(short) chosen over copy ctor of a.

I have a fix for this https://reviews.llvm.org/D80450

tra mentioned this in D80450: [CUDA][HIP] Fix HD function resolution.May 26 2020, 11:27 AM

Revision Contents

Path

Size

clang/

include/

clang/

Sema/

Sema.h

2 lines

lib/

Sema/

SemaCUDA.cpp

14 lines

SemaOverload.cpp

143 lines

test/

SemaCUDA/

function-overload.cu

146 lines

Diff 263408

clang/include/clang/Sema/Sema.h

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 11,654 Lines • ▼ Show 20 Lines	enum CUDAFunctionTarget {
CFT_Host,		CFT_Host,
CFT_HostDevice,		CFT_HostDevice,
CFT_InvalidTarget		CFT_InvalidTarget
};		};

/// Determines whether the given function is a CUDA device/host/kernel/etc.		/// Determines whether the given function is a CUDA device/host/kernel/etc.
/// function.		/// function.
///		///
/// Use this rather than examining the function's attributes yourself -- you		/// Use this rather than examining the function's attributes yourself -- you
		traUnsubmitted Done Reply Inline Actions Plumbing an optional output argument it through multiple levels of callers as an output argument is rather hard to follow, especially considering that it's not set in all code paths. Perhaps we can turn IsImplicitHDAttr into a separate function and call it from isBetterOverloadCandidate(). tra: Plumbing an optional output argument it through multiple levels of callers as an output…
		yaxunlAuthorUnsubmitted Done Reply Inline Actions will do yaxunl: will do
/// will get it wrong. Returns CFT_Host if D is null.		/// will get it wrong. Returns CFT_Host if D is null.
CUDAFunctionTarget IdentifyCUDATarget(const FunctionDecl *D,		CUDAFunctionTarget IdentifyCUDATarget(const FunctionDecl *D,
bool IgnoreImplicitHDAttr = false);		bool IgnoreImplicitHDAttr = false);
CUDAFunctionTarget IdentifyCUDATarget(const ParsedAttributesView &Attrs);		CUDAFunctionTarget IdentifyCUDATarget(const ParsedAttributesView &Attrs);

/// Gets the CUDA target for the current context.		/// Gets the CUDA target for the current context.
CUDAFunctionTarget CurrentCUDATarget() {		CUDAFunctionTarget CurrentCUDATarget() {
		traUnsubmitted Done Reply Inline Actions I think this can be `static` as it does not need Sema's state. tra: I think this can be `static` as it does not need Sema's state.
		yaxunlAuthorUnsubmitted Done Reply Inline Actions will do yaxunl: will do
return IdentifyCUDATarget(dyn_cast<FunctionDecl>(CurContext));		return IdentifyCUDATarget(dyn_cast<FunctionDecl>(CurContext));
}		}

		static bool IsCUDAImplicitHostDeviceFunction(const FunctionDecl *D);

// CUDA function call preference. Must be ordered numerically from		// CUDA function call preference. Must be ordered numerically from
// worst to best.		// worst to best.
enum CUDAFunctionPreference {		enum CUDAFunctionPreference {
CFP_Never, // Invalid caller/callee combination.		CFP_Never, // Invalid caller/callee combination.
CFP_WrongSide, // Calls from host-device to host or device		CFP_WrongSide, // Calls from host-device to host or device
// function that do not match current compilation		// function that do not match current compilation
// mode.		// mode.
CFP_HostDevice, // Any calls to host/device functions.		CFP_HostDevice, // Any calls to host/device functions.
▲ Show 20 Lines • Show All 795 Lines • Show Last 20 Lines

clang/lib/Sema/SemaCUDA.cpp

Show First 20 Lines • Show All 205 Lines • ▼ Show 20 Lines	Sema::IdentifyCUDAPreference(const FunctionDecl *Caller,
if ((CallerTarget == CFT_Host && CalleeTarget == CFT_Device) \|\|		if ((CallerTarget == CFT_Host && CalleeTarget == CFT_Device) \|\|
(CallerTarget == CFT_Device && CalleeTarget == CFT_Host) \|\|		(CallerTarget == CFT_Device && CalleeTarget == CFT_Host) \|\|
(CallerTarget == CFT_Global && CalleeTarget == CFT_Host))		(CallerTarget == CFT_Global && CalleeTarget == CFT_Host))
return CFP_Never;		return CFP_Never;

llvm_unreachable("All cases should've been handled by now.");		llvm_unreachable("All cases should've been handled by now.");
}		}

		template <typename AttrT> static bool hasImplicitAttr(const FunctionDecl *D) {
		if (!D)
		return false;
		if (auto *A = D->getAttr<AttrT>())
		return A->isImplicit();
		return D->isImplicit();
		}
		traUnsubmitted Done Reply Inline Actions Is it possible for us to ever end up here with an explicitly set attribute but with an implicit function? If that were to happen, we'd return true and that would be incorrect. Perhaps add an assert to make sure it does not happen or always return `A->isImplicit()` if an attribute is already set. tra: Is it possible for us to ever end up here with an explicitly set attribute but with an implicit…
		yaxunlAuthorUnsubmitted Done Reply Inline Actions will return A->isImplicit() yaxunl: will return A->isImplicit()

		bool Sema::IsCUDAImplicitHostDeviceFunction(const FunctionDecl *D) {
		bool IsImplicitDevAttr = hasImplicitAttr<CUDADeviceAttr>(D);
		bool IsImplicitHostAttr = hasImplicitAttr<CUDAHostAttr>(D);
		return IsImplicitDevAttr && IsImplicitHostAttr;
		}

void Sema::EraseUnwantedCUDAMatches(		void Sema::EraseUnwantedCUDAMatches(
const FunctionDecl *Caller,		const FunctionDecl *Caller,
SmallVectorImpl<std::pair<DeclAccessPair, FunctionDecl *>> &Matches) {		SmallVectorImpl<std::pair<DeclAccessPair, FunctionDecl *>> &Matches) {
if (Matches.size() <= 1)		if (Matches.size() <= 1)
return;		return;

using Pair = std::pair<DeclAccessPair, FunctionDecl*>;		using Pair = std::pair<DeclAccessPair, FunctionDecl*>;

▲ Show 20 Lines • Show All 577 Lines • Show Last 20 Lines

clang/lib/Sema/SemaOverload.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 9,368 Lines • ▼ Show 20 Lines	for (auto Pair : zip_longest(Cand1Attrs, Cand2Attrs)) {
(*Cand2A)->getCond()->Profile(Cand2ID, S.getASTContext(), true);		(*Cand2A)->getCond()->Profile(Cand2ID, S.getASTContext(), true);
if (Cand1ID != Cand2ID)		if (Cand1ID != Cand2ID)
return Comparison::Worse;		return Comparison::Worse;
}		}

return Comparison::Equal;		return Comparison::Equal;
}		}

static bool isBetterMultiversionCandidate(const OverloadCandidate &Cand1,		static Comparison
		isBetterMultiversionCandidate(const OverloadCandidate &Cand1,
const OverloadCandidate &Cand2) {		const OverloadCandidate &Cand2) {
if (!Cand1.Function \|\| !Cand1.Function->isMultiVersion() \|\| !Cand2.Function \|\|		if (!Cand1.Function \|\| !Cand1.Function->isMultiVersion() \|\| !Cand2.Function \|\|
!Cand2.Function->isMultiVersion())		!Cand2.Function->isMultiVersion())
return false;		return Comparison::Equal;

// If Cand1 is invalid, it cannot be a better match, if Cand2 is invalid, this		// If both are invalid, they are equal. If one of them is invalid, the other
// is obviously better.		// is better.
if (Cand1.Function->isInvalidDecl()) return false;		if (Cand1.Function->isInvalidDecl()) {
if (Cand2.Function->isInvalidDecl()) return true;		if (Cand2.Function->isInvalidDecl())
		return Comparison::Equal;
		return Comparison::Worse;
		}
		if (Cand2.Function->isInvalidDecl())
		return Comparison::Better;

// If this is a cpu_dispatch/cpu_specific multiversion situation, prefer		// If this is a cpu_dispatch/cpu_specific multiversion situation, prefer
// cpu_dispatch, else arbitrarily based on the identifiers.		// cpu_dispatch, else arbitrarily based on the identifiers.
bool Cand1CPUDisp = Cand1.Function->hasAttr<CPUDispatchAttr>();		bool Cand1CPUDisp = Cand1.Function->hasAttr<CPUDispatchAttr>();
bool Cand2CPUDisp = Cand2.Function->hasAttr<CPUDispatchAttr>();		bool Cand2CPUDisp = Cand2.Function->hasAttr<CPUDispatchAttr>();
const auto *Cand1CPUSpec = Cand1.Function->getAttr<CPUSpecificAttr>();		const auto *Cand1CPUSpec = Cand1.Function->getAttr<CPUSpecificAttr>();
const auto *Cand2CPUSpec = Cand2.Function->getAttr<CPUSpecificAttr>();		const auto *Cand2CPUSpec = Cand2.Function->getAttr<CPUSpecificAttr>();

if (!Cand1CPUDisp && !Cand2CPUDisp && !Cand1CPUSpec && !Cand2CPUSpec)		if (!Cand1CPUDisp && !Cand2CPUDisp && !Cand1CPUSpec && !Cand2CPUSpec)
return false;		return Comparison::Equal;

if (Cand1CPUDisp && !Cand2CPUDisp)		if (Cand1CPUDisp && !Cand2CPUDisp)
return true;		return Comparison::Better;
if (Cand2CPUDisp && !Cand1CPUDisp)		if (Cand2CPUDisp && !Cand1CPUDisp)
return false;		return Comparison::Worse;

if (Cand1CPUSpec && Cand2CPUSpec) {		if (Cand1CPUSpec && Cand2CPUSpec) {
if (Cand1CPUSpec->cpus_size() != Cand2CPUSpec->cpus_size())		if (Cand1CPUSpec->cpus_size() != Cand2CPUSpec->cpus_size())
return Cand1CPUSpec->cpus_size() < Cand2CPUSpec->cpus_size();		return Cand1CPUSpec->cpus_size() < Cand2CPUSpec->cpus_size()
		? Comparison::Better
		: Comparison::Worse;

std::pair<CPUSpecificAttr::cpus_iterator, CPUSpecificAttr::cpus_iterator>		std::pair<CPUSpecificAttr::cpus_iterator, CPUSpecificAttr::cpus_iterator>
FirstDiff = std::mismatch(		FirstDiff = std::mismatch(
Cand1CPUSpec->cpus_begin(), Cand1CPUSpec->cpus_end(),		Cand1CPUSpec->cpus_begin(), Cand1CPUSpec->cpus_end(),
Cand2CPUSpec->cpus_begin(),		Cand2CPUSpec->cpus_begin(),
[](const IdentifierInfo LHS, const IdentifierInfo RHS) {		[](const IdentifierInfo LHS, const IdentifierInfo RHS) {
return LHS->getName() == RHS->getName();		return LHS->getName() == RHS->getName();
});		});

assert(FirstDiff.first != Cand1CPUSpec->cpus_end() &&		assert(FirstDiff.first != Cand1CPUSpec->cpus_end() &&
"Two different cpu-specific versions should not have the same "		"Two different cpu-specific versions should not have the same "
"identifier list, otherwise they'd be the same decl!");		"identifier list, otherwise they'd be the same decl!");
return (FirstDiff.first)->getName() < (FirstDiff.second)->getName();		return (FirstDiff.first)->getName() < (FirstDiff.second)->getName()
		? Comparison::Better
		: Comparison::Worse;
}		}
llvm_unreachable("No way to get here unless both had cpu_dispatch");		llvm_unreachable("No way to get here unless both had cpu_dispatch");
}		}

/// Compute the type of the implicit object parameter for the given function,		/// Compute the type of the implicit object parameter for the given function,
/// if any. Returns None if there is no implicit object parameter, and a null		/// if any. Returns None if there is no implicit object parameter, and a null
/// QualType if there is a 'matches anything' implicit object parameter.		/// QualType if there is a 'matches anything' implicit object parameter.
static Optional<QualType> getImplicitObjectParamType(ASTContext &Context,		static Optional<QualType> getImplicitObjectParamType(ASTContext &Context,
▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines	bool clang::isBetterOverloadCandidate(
SourceLocation Loc, OverloadCandidateSet::CandidateSetKind Kind) {		SourceLocation Loc, OverloadCandidateSet::CandidateSetKind Kind) {
// Define viable functions to be better candidates than non-viable		// Define viable functions to be better candidates than non-viable
// functions.		// functions.
if (!Cand2.Viable)		if (!Cand2.Viable)
return Cand1.Viable;		return Cand1.Viable;
else if (!Cand1.Viable)		else if (!Cand1.Viable)
return false;		return false;

		// [CUDA] A function with 'never' preference is marked not viable, therefore
		// is never shown up here. The worst preference shown up here is 'wrong side',
		// e.g. a host function called by a device host function in device
		// compilation. This is valid AST as long as the host device function is not
		// emitted, e.g. it is an inline function which is called only by a host
		// function. A deferred diagnostic will be triggered if it is emitted.
		// However a wrong-sided function is still a viable candidate here.
		//
		// If Cand1 can be emitted and Cand2 cannot be emitted in the current
		// context, Cand1 is better than Cand2. If Cand1 can not be emitted and Cand2
		// can be emitted, Cand1 is not better than Cand2. This rule should have
		// precedence over other rules.
		//
		// If both Cand1 and Cand2 can be emitted, or neither can be emitted, then
		// other rules should be used to determine which is better. This is because
		// host/device based overloading resolution is mostly for determining
		// viability of a function. If two functions are both viable, other factors
		// should take precedence in preference, e.g. the standard-defined preferences
		// like argument conversion ranks or enable_if partial-ordering. The
		// preference for pass-object-size parameters is probably most similar to a
		// type-based-overloading decision and so should take priority.
		//
		// If other rules cannot determine which is better, CUDA preference will be
		// used again to determine which is better.
		//
		// TODO: Currently IdentifyCUDAPreference does not return correct values
		// for functions called in global variable initializers due to missing
		// correct context about device/host. Therefore we can only enforce this
		// rule when there is a caller. We should enforce this rule for functions
		// in global variable initializers once proper context is added.
		if (S.getLangOpts().CUDA && Cand1.Function && Cand2.Function) {
		if (FunctionDecl *Caller = dyn_cast<FunctionDecl>(S.CurContext)) {
		bool IsCallerImplicitHD = Sema::IsCUDAImplicitHostDeviceFunction(Caller);
		bool IsCand1ImplicitHD =
		Sema::IsCUDAImplicitHostDeviceFunction(Cand1.Function);
		bool IsCand2ImplicitHD =
		Sema::IsCUDAImplicitHostDeviceFunction(Cand2.Function);
		auto P1 = S.IdentifyCUDAPreference(Caller, Cand1.Function);
		auto P2 = S.IdentifyCUDAPreference(Caller, Cand2.Function);
		assert(P1 != Sema::CFP_Never && P2 != Sema::CFP_Never);
		// The implicit HD function may be a function in a system header which
		// is forced by pragma. In device compilation, if we prefer HD candidates
		// over wrong-sided candidates, overloading resolution may change, which
		// may result in non-deferrable diagnostics. As a workaround, we let
		// implicit HD candidates take equal preference as wrong-sided candidates.
		// This will preserve the overloading resolution.
		auto EmitThreshold =
		(S.getLangOpts().CUDAIsDevice && IsCallerImplicitHD &&
		(IsCand1ImplicitHD \|\| IsCand2ImplicitHD))
		? Sema::CFP_HostDevice
		: Sema::CFP_WrongSide;
		auto Cand1Emittable = P1 > EmitThreshold;
		auto Cand2Emittable = P2 > EmitThreshold;
		if (Cand1Emittable && !Cand2Emittable)
		return true;
		if (!Cand1Emittable && Cand2Emittable)
		return false;
		}
		}

// C++ [over.match.best]p1:		// C++ [over.match.best]p1:
//		//
// -- if F is a static member function, ICS1(F) is defined such		// -- if F is a static member function, ICS1(F) is defined such
// that ICS1(F) is neither better nor worse than ICS1(G) for		// that ICS1(F) is neither better nor worse than ICS1(G) for
// any function G, and, symmetrically, ICS1(G) is neither		// any function G, and, symmetrically, ICS1(G) is neither
// better nor worse than ICS1(F).		// better nor worse than ICS1(F).
unsigned StartArg = 0;		unsigned StartArg = 0;
if (Cand1.IgnoreObjectArgument \|\| Cand2.IgnoreObjectArgument)		if (Cand1.IgnoreObjectArgument \|\| Cand2.IgnoreObjectArgument)
▲ Show 20 Lines • Show All 218 Lines • ▼ Show 20 Lines	bool clang::isBetterOverloadCandidate(

// Check for enable_if value-based overload resolution.		// Check for enable_if value-based overload resolution.
if (Cand1.Function && Cand2.Function) {		if (Cand1.Function && Cand2.Function) {
Comparison Cmp = compareEnableIfAttrs(S, Cand1.Function, Cand2.Function);		Comparison Cmp = compareEnableIfAttrs(S, Cand1.Function, Cand2.Function);
if (Cmp != Comparison::Equal)		if (Cmp != Comparison::Equal)
return Cmp == Comparison::Better;		return Cmp == Comparison::Better;
}		}

if (S.getLangOpts().CUDA && Cand1.Function && Cand2.Function) {
FunctionDecl *Caller = dyn_cast<FunctionDecl>(S.CurContext);
return S.IdentifyCUDAPreference(Caller, Cand1.Function) >
S.IdentifyCUDAPreference(Caller, Cand2.Function);
}

bool HasPS1 = Cand1.Function != nullptr &&		bool HasPS1 = Cand1.Function != nullptr &&
functionHasPassObjectSizeParams(Cand1.Function);		functionHasPassObjectSizeParams(Cand1.Function);
bool HasPS2 = Cand2.Function != nullptr &&		bool HasPS2 = Cand2.Function != nullptr &&
functionHasPassObjectSizeParams(Cand2.Function);		functionHasPassObjectSizeParams(Cand2.Function);
if (HasPS1 != HasPS2 && HasPS1)		if (HasPS1 != HasPS2 && HasPS1)
return true;		return true;

return isBetterMultiversionCandidate(Cand1, Cand2);		auto MV = isBetterMultiversionCandidate(Cand1, Cand2);
		if (MV == Comparison::Better)
		return true;
		if (MV == Comparison::Worse)
		return false;

		// If other rules cannot determine which is better, CUDA preference is used
		// to determine which is better.
		if (S.getLangOpts().CUDA && Cand1.Function && Cand2.Function) {
		FunctionDecl *Caller = dyn_cast<FunctionDecl>(S.CurContext);
		return S.IdentifyCUDAPreference(Caller, Cand1.Function) >
		S.IdentifyCUDAPreference(Caller, Cand2.Function);
		}

		return false;
}		}

/// Determine whether two declarations are "equivalent" for the purposes of		/// Determine whether two declarations are "equivalent" for the purposes of
/// name lookup and overload resolution. This applies when the same internal/no		/// name lookup and overload resolution. This applies when the same internal/no
/// linkage entity is defined by two modules (probably by textually including		/// linkage entity is defined by two modules (probably by textually including
/// the same header). In such a case, we don't consider the declarations to		/// the same header). In such a case, we don't consider the declarations to
/// declare the same entity, but we also don't want lookups with both		/// declare the same entity, but we also don't want lookups with both
/// declarations visible to be ambiguous in some cases (this happens when using		/// declarations visible to be ambiguous in some cases (this happens when using
▲ Show 20 Lines • Show All 69 Lines • ▼ Show 20 Lines
/// \returns The result of overload resolution.		/// \returns The result of overload resolution.
OverloadingResult		OverloadingResult
OverloadCandidateSet::BestViableFunction(Sema &S, SourceLocation Loc,		OverloadCandidateSet::BestViableFunction(Sema &S, SourceLocation Loc,
iterator &Best) {		iterator &Best) {
llvm::SmallVector<OverloadCandidate *, 16> Candidates;		llvm::SmallVector<OverloadCandidate *, 16> Candidates;
std::transform(begin(), end(), std::back_inserter(Candidates),		std::transform(begin(), end(), std::back_inserter(Candidates),
[](OverloadCandidate &Cand) { return &Cand; });		[](OverloadCandidate &Cand) { return &Cand; });

// [CUDA] HD->H or HD->D calls are technically not allowed by CUDA but
// are accepted by both clang and NVCC. However, during a particular
// compilation mode only one call variant is viable. We need to
// exclude non-viable overload candidates from consideration based
// only on their host/device attributes. Specifically, if one
// candidate call is WrongSide and the other is SameSide, we ignore
// the WrongSide candidate.
if (S.getLangOpts().CUDA) {
const FunctionDecl *Caller = dyn_cast<FunctionDecl>(S.CurContext);
bool ContainsSameSideCandidate =
llvm::any_of(Candidates, [&](OverloadCandidate *Cand) {
// Check viable function only.
return Cand->Viable && Cand->Function &&
S.IdentifyCUDAPreference(Caller, Cand->Function) ==
Sema::CFP_SameSide;
});
if (ContainsSameSideCandidate) {
auto IsWrongSideCandidate = [&](OverloadCandidate *Cand) {
// Check viable function only to avoid unnecessary data copying/moving.
return Cand->Viable && Cand->Function &&
S.IdentifyCUDAPreference(Caller, Cand->Function) ==
Sema::CFP_WrongSide;
};
llvm::erase_if(Candidates, IsWrongSideCandidate);
}
}

// Find the best viable function.		// Find the best viable function.
Best = end();		Best = end();
for (auto *Cand : Candidates) {		for (auto *Cand : Candidates) {
Cand->Best = false;		Cand->Best = false;
if (Cand->Viable)		if (Cand->Viable)
if (Best == end() \|\|		if (Best == end() \|\|
isBetterOverloadCandidate(S, Cand, Best, Loc, Kind))		isBetterOverloadCandidate(S, Cand, Best, Loc, Kind))
Best = Cand;		Best = Cand;
▲ Show 20 Lines • Show All 5,050 Lines • Show Last 20 Lines

clang/test/SemaCUDA/function-overload.cu

	// REQUIRES: x86-registered-target			// REQUIRES: x86-registered-target
	// REQUIRES: nvptx-registered-target			// REQUIRES: nvptx-registered-target

	// RUN: %clang_cc1 -triple x86_64-unknown-linux-gnu -fsyntax-only -verify %s			// RUN: %clang_cc1 -std=c++14 -triple x86_64-unknown-linux-gnu -fsyntax-only -verify %s
	// RUN: %clang_cc1 -triple nvptx64-nvidia-cuda -fsyntax-only -fcuda-is-device -verify %s			// RUN: %clang_cc1 -std=c++14 -triple nvptx64-nvidia-cuda -fsyntax-only -fcuda-is-device -verify %s

	#include "Inputs/cuda.h"			#include "Inputs/cuda.h"

	// Opaque return types used to check that we pick the right overloads.			// Opaque return types used to check that we pick the right overloads.
	struct HostReturnTy {};			struct HostReturnTy {};
	struct HostReturnTy2 {};			struct HostReturnTy2 {};
	struct DeviceReturnTy {};			struct DeviceReturnTy {};
	struct DeviceReturnTy2 {};			struct DeviceReturnTy2 {};
	struct HostDeviceReturnTy {};			struct HostDeviceReturnTy {};
	struct TemplateReturnTy {};			struct TemplateReturnTy {};

				struct CorrectOverloadRetTy{};
				#if __CUDA_ARCH__
				// expected-note@-2 {{candidate constructor (the implicit copy constructor) not viable: no known conversion from 'IncorrectOverloadRetTy' to 'const CorrectOverloadRetTy &' for 1st argument}}
				// expected-note@-3 {{candidate constructor (the implicit move constructor) not viable: no known conversion from 'IncorrectOverloadRetTy' to 'CorrectOverloadRetTy &&' for 1st argument}}
				#endif
				struct IncorrectOverloadRetTy{};

	typedef HostReturnTy (*HostFnPtr)();			typedef HostReturnTy (*HostFnPtr)();
	typedef DeviceReturnTy (*DeviceFnPtr)();			typedef DeviceReturnTy (*DeviceFnPtr)();
	typedef HostDeviceReturnTy (*HostDeviceFnPtr)();			typedef HostDeviceReturnTy (*HostDeviceFnPtr)();
	typedef void (*GlobalFnPtr)(); // __global__ functions must return void.			typedef void (*GlobalFnPtr)(); // __global__ functions must return void.

	// CurrentReturnTy is {HostReturnTy,DeviceReturnTy} during {host,device}			// CurrentReturnTy is {HostReturnTy,DeviceReturnTy} during {host,device}
	// compilation.			// compilation.
	#ifdef __CUDA_ARCH__			#ifdef __CUDA_ARCH__
	▲ Show 20 Lines • Show All 301 Lines • ▼ Show 20 Lines
	__device__ void test_device_calls_template_fn() {			__device__ void test_device_calls_template_fn() {
	DeviceReturnTy ret1 = template_vs_function(1.0f);			DeviceReturnTy ret1 = template_vs_function(1.0f);
	DeviceReturnTy ret2 = template_vs_function(2.0);			DeviceReturnTy ret2 = template_vs_function(2.0);
	}			}

	// If we have a mix of HD and H-only or D-only candidates in the overload set,			// If we have a mix of HD and H-only or D-only candidates in the overload set,
	// normal C++ overload resolution rules apply first.			// normal C++ overload resolution rules apply first.
	template <typename T> TemplateReturnTy template_vs_hd_function(T arg)			template <typename T> TemplateReturnTy template_vs_hd_function(T arg)
	#ifdef __CUDA_ARCH__
	//expected-note@-2 {{declared here}}
	#endif
	{			{
	return TemplateReturnTy();			return TemplateReturnTy();
	}			}
	__host__ __device__ HostDeviceReturnTy template_vs_hd_function(float arg) {			__host__ __device__ HostDeviceReturnTy template_vs_hd_function(float arg) {
	return HostDeviceReturnTy();			return HostDeviceReturnTy();
	}			}

	__host__ __device__ void test_host_device_calls_hd_template() {			__host__ __device__ void test_host_device_calls_hd_template() {
	HostDeviceReturnTy ret1 = template_vs_hd_function(1.0f);
	TemplateReturnTy ret2 = template_vs_hd_function(1);
	#ifdef __CUDA_ARCH__			#ifdef __CUDA_ARCH__
	// expected-error@-2 {{reference to __host__ function 'template_vs_hd_function<int>' in __host__ __device__ function}}			typedef HostDeviceReturnTy ExpectedReturnTy;
				#else
				typedef TemplateReturnTy ExpectedReturnTy;
	#endif			#endif
				HostDeviceReturnTy ret1 = template_vs_hd_function(1.0f);
				ExpectedReturnTy ret2 = template_vs_hd_function(1);
	}			}

	__host__ void test_host_calls_hd_template() {			__host__ void test_host_calls_hd_template() {
	HostDeviceReturnTy ret1 = template_vs_hd_function(1.0f);			HostDeviceReturnTy ret1 = template_vs_hd_function(1.0f);
	TemplateReturnTy ret2 = template_vs_hd_function(1);			TemplateReturnTy ret2 = template_vs_hd_function(1);
	}			}

	__device__ void test_device_calls_hd_template() {			__device__ void test_device_calls_hd_template() {
	HostDeviceReturnTy ret1 = template_vs_hd_function(1.0f);			HostDeviceReturnTy ret1 = template_vs_hd_function(1.0f);
	// Host-only function template is not callable with strict call checks,			// Host-only function template is not callable with strict call checks,
	// so for device side HD function will be the only choice.			// so for device side HD function will be the only choice.
	HostDeviceReturnTy ret2 = template_vs_hd_function(1);			HostDeviceReturnTy ret2 = template_vs_hd_function(1);
	}			}

	// Check that overloads still work the same way on both host and			// Check that overloads still work the same way on both host and
	// device side when the overload set contains only functions from one			// device side when the overload set contains only functions from one
	// side of compilation.			// side of compilation.
	__device__ DeviceReturnTy device_only_function(int arg) { return DeviceReturnTy(); }			__device__ DeviceReturnTy device_only_function(int arg) { return DeviceReturnTy(); }
	__device__ DeviceReturnTy2 device_only_function(float arg) { return DeviceReturnTy2(); }			__device__ DeviceReturnTy2 device_only_function(float arg) { return DeviceReturnTy2(); }
	#ifndef __CUDA_ARCH__			#ifndef __CUDA_ARCH__
	// expected-note@-3 {{'device_only_function' declared here}}			// expected-note@-3 2{{'device_only_function' declared here}}
	// expected-note@-3 {{'device_only_function' declared here}}			// expected-note@-3 2{{'device_only_function' declared here}}
	#endif			#endif
	__host__ HostReturnTy host_only_function(int arg) { return HostReturnTy(); }			__host__ HostReturnTy host_only_function(int arg) { return HostReturnTy(); }
	__host__ HostReturnTy2 host_only_function(float arg) { return HostReturnTy2(); }			__host__ HostReturnTy2 host_only_function(float arg) { return HostReturnTy2(); }
	#ifdef __CUDA_ARCH__			#ifdef __CUDA_ARCH__
	// expected-note@-3 {{'host_only_function' declared here}}			// expected-note@-3 2{{'host_only_function' declared here}}
	// expected-note@-3 {{'host_only_function' declared here}}			// expected-note@-3 2{{'host_only_function' declared here}}
	#endif			#endif

	__host__ __device__ void test_host_device_single_side_overloading() {			__host__ __device__ void test_host_device_single_side_overloading() {
	DeviceReturnTy ret1 = device_only_function(1);			DeviceReturnTy ret1 = device_only_function(1);
	DeviceReturnTy2 ret2 = device_only_function(1.0f);			DeviceReturnTy2 ret2 = device_only_function(1.0f);
	#ifndef __CUDA_ARCH__			#ifndef __CUDA_ARCH__
	// expected-error@-3 {{reference to __device__ function 'device_only_function' in __host__ __device__ function}}			// expected-error@-3 {{reference to __device__ function 'device_only_function' in __host__ __device__ function}}
	// expected-error@-3 {{reference to __device__ function 'device_only_function' in __host__ __device__ function}}			// expected-error@-3 {{reference to __device__ function 'device_only_function' in __host__ __device__ function}}
	#endif			#endif
	HostReturnTy ret3 = host_only_function(1);			HostReturnTy ret3 = host_only_function(1);
	HostReturnTy2 ret4 = host_only_function(1.0f);			HostReturnTy2 ret4 = host_only_function(1.0f);
	#ifdef __CUDA_ARCH__			#ifdef __CUDA_ARCH__
	// expected-error@-3 {{reference to __host__ function 'host_only_function' in __host__ __device__ function}}			// expected-error@-3 {{reference to __host__ function 'host_only_function' in __host__ __device__ function}}
	// expected-error@-3 {{reference to __host__ function 'host_only_function' in __host__ __device__ function}}			// expected-error@-3 {{reference to __host__ function 'host_only_function' in __host__ __device__ function}}
	#endif			#endif
	}			}

				// wrong-sided overloading should not cause diagnostic unless it is emitted.
				// This inline function is not emitted.
				inline __host__ __device__ void test_host_device_wrong_side_overloading_inline_no_diag() {
				DeviceReturnTy ret1 = device_only_function(1);
				DeviceReturnTy2 ret2 = device_only_function(1.0f);
				HostReturnTy ret3 = host_only_function(1);
				HostReturnTy2 ret4 = host_only_function(1.0f);
				}

				// wrong-sided overloading should cause diagnostic if it is emitted.
				// This inline function is emitted since it is called by an emitted function.
				inline __host__ __device__ void test_host_device_wrong_side_overloading_inline_diag() {
				DeviceReturnTy ret1 = device_only_function(1);
				DeviceReturnTy2 ret2 = device_only_function(1.0f);
				#ifndef __CUDA_ARCH__
				// expected-error@-3 {{reference to __device__ function 'device_only_function' in __host__ __device__ function}}
				// expected-error@-3 {{reference to __device__ function 'device_only_function' in __host__ __device__ function}}
				#endif
				HostReturnTy ret3 = host_only_function(1);
				HostReturnTy2 ret4 = host_only_function(1.0f);
				#ifdef __CUDA_ARCH__
				// expected-error@-3 {{reference to __host__ function 'host_only_function' in __host__ __device__ function}}
				// expected-error@-3 {{reference to __host__ function 'host_only_function' in __host__ __device__ function}}
				#endif
				}

				__host__ __device__ void test_host_device_wrong_side_overloading_inline_diag_caller() {
				test_host_device_wrong_side_overloading_inline_diag();
				// expected-note@-1 {{called by 'test_host_device_wrong_side_overloading_inline_diag_caller'}}
				}

	// Verify that we allow overloading function templates.			// Verify that we allow overloading function templates.
	template <typename T> __host__ T template_overload(const T &a) { return a; };			template <typename T> __host__ T template_overload(const T &a) { return a; };
	template <typename T> __device__ T template_overload(const T &a) { return a; };			template <typename T> __device__ T template_overload(const T &a) { return a; };

	__host__ void test_host_template_overload() {			__host__ void test_host_template_overload() {
	template_overload(1); // OK. Attribute-based overloading picks __host__ variant.			template_overload(1); // OK. Attribute-based overloading picks __host__ variant.
	}			}
	__device__ void test_device_template_overload() {			__device__ void test_device_template_overload() {
	Show All 11 Lines
	__host__ __device__ int constexpr_overload(const T &x, const T &y) {			__host__ __device__ int constexpr_overload(const T &x, const T &y) {
	return x - y;			return x - y;
	}			}

	// Verify that function overloading doesn't prune candidate wrongly.			// Verify that function overloading doesn't prune candidate wrongly.
	int test_constexpr_overload(C2 &x, C2 &y) {			int test_constexpr_overload(C2 &x, C2 &y) {
	return constexpr_overload(x, y);			return constexpr_overload(x, y);
	}			}

				// Verify no ambiguity for new operator.
				void *a = new int;
				__device__ void *b = new int;
				// expected-error@-1{{dynamic initialization is not supported for __device__, __constant__, and __shared__ variables.}}

				// Verify no ambiguity for new operator.
				traUnsubmitted Done Reply Inline Actions Is `inline` necessary in these new tests? Please remove it where it's not needed. tra: Is `inline` necessary in these new tests? Please remove it where it's not needed.
				yaxunlAuthorUnsubmitted Done Reply Inline Actions It is not needed by callee but needed by caller to make sure it causes deferred diagnostics. Will remove it from callees. yaxunl: It is not needed by callee but needed by caller to make sure it causes deferred diagnostics.
				template<typename _Tp> _Tp&& f();
				template<typename _Tp, typename = decltype(new _Tp(f<_Tp>()))>
				void __test();

				void foo() {
				traUnsubmitted Done Reply Inline Actions These tests only veryfy that the code compiled, but it does not guarantee that we've picked the correct overload. You should give callees different return types and assign the result to a variable of intended type. See `test_host_device_calls_hd_template()` on line 341 for an example. tra: These tests only veryfy that the code compiled, but it does not guarantee that we've picked the…
				yaxunlAuthorUnsubmitted Done Reply Inline Actions they have different return types. The right one returns double and the wrong one returns void. If the wrong one is chosen, there is syntax error since the caller returns double. yaxunl: they have different return types. The right one returns double and the wrong one returns void.
				traUnsubmitted Done Reply Inline Actions Ah. I've missed it. Could you change the types to `struct CorrectOverloadRetTy`/`struct IncorrectOverloadRetTy` to make it more obvious? tra: Ah. I've missed it. Could you change the types to `struct CorrectOverloadRetTy`/`struct…
				yaxunlAuthorUnsubmitted Done Reply Inline Actions will do yaxunl: will do
				__test<int>();
				}

				// Test resolving implicit host device candidate vs wrong-sided candidate.
				// In device compilation, implicit host device caller choose implicit host
				// device candidate and wrong-sided candidate with equal preference.
				// Resolution result should not change with/without pragma.
				namespace ImplicitHostDeviceVsWrongSided {
				CorrectOverloadRetTy callee(double x);
				traUnsubmitted Done Reply Inline Actions Nit: `Incorrect` should not have `C` capitalized as it's one word. tra: Nit: `Incorrect` should not have `C` capitalized as it's one word.
				yaxunlAuthorUnsubmitted Done Reply Inline Actions will fix. yaxunl: will fix.
				#pragma clang force_cuda_host_device begin
				IncorrectOverloadRetTy callee(int x);
				inline CorrectOverloadRetTy implicit_hd_caller() {
				return callee(1.0);
				}
				#pragma clang force_cuda_host_device end
				}

				// Test resolving implicit host device candidate vs same-sided candidate.
				// In host compilation, implicit host device caller choose implicit host
				// device candidate and same-sided candidate with equal preference.
				// Resolution result should not change with/without pragma.
				namespace ImplicitHostDeviceVsSameSide {
				IncorrectOverloadRetTy callee(int x);
				#pragma clang force_cuda_host_device begin
				CorrectOverloadRetTy callee(double x);
				inline CorrectOverloadRetTy implicit_hd_caller() {
				return callee(1.0);
				}
				#pragma clang force_cuda_host_device end
				}

				// Test resolving explicit host device candidate vs. wrong-sided candidate.
				// Explicit host device caller favors host device candidate against wrong-sided
				// candidate.
				namespace ExplicitHostDeviceVsWrongSided {
				CorrectOverloadRetTy callee(double x);
				__host__ __device__ IncorrectOverloadRetTy callee(int x);
				inline __host__ __device__ CorrectOverloadRetTy explicit_hd_caller() {
				return callee(1.0);
				#if __CUDA_ARCH__
				// expected-error@-2 {{no viable conversion from returned value of type 'IncorrectOverloadRetTy' to function return type 'CorrectOverloadRetTy'}}
				#endif
				}
				}

				traUnsubmitted Done Reply Inline Actions Please move this test below the other two as keeping them together is useful to illustrate the differences in behavior of overloading in explicit HD vs implicit HD functions. tra: Please move this test below the other two as keeping them together is useful to illustrate the…
				yaxunlAuthorUnsubmitted Done Reply Inline Actions will do yaxunl: will do
				// In the implicit host device function 'caller', the second 'callee' should be
				// chosen since it has better match, even though it is an implicit host device
				// function whereas the first 'callee' is a host function. A diagnostic will be
				// emitted if the first 'callee' is chosen since deduced return type cannot be
				// used before it is defined.
				namespace ImplicitHostDeviceByConstExpr {
				template <class a> a b;
				auto callee(...);
				template <class d> constexpr auto callee(d) -> decltype(0);
				struct e {
				template <class ad, class... f> static auto g(ad, f...) {
				return h<e, decltype(b<f>)...>;
				}
				struct i {
				template <class, class... f> static constexpr auto caller(f... k) {
				return callee(k...);
				}
				};
				template <class, class... f> static auto h() {
				return i::caller<int, f...>;
				}
				};
				class l {
				l() {
				e::g([] {}, this);
				}
				};
				}