Page MenuHomePhabricator

[CUDA][HIP] Workaround for resolving host device function against wrong-sided function
ClosedPublic

Authored by yaxunl on May 6 2020, 4:03 PM.

Details

Summary

https://reviews.llvm.org/D77954 caused regressions due to diagnostics in implicit
host device functions.

The implicit host device functions are often functions in system headers forced to be device host by pragmas.

Some of them are valid host device functions that can be emitted in both host and device compilation.

Some of them are valid host functions but invalid device functions. In device compilation they incur
diagnostics. However as long as these diagnostics are deferred and these functions are not emitted
this is fine.

Before D77954, in host device callers, host device candidates are not favored against wrong-sided candidates,
which preserves the overloading resolution result as if the caller and the candidates are host functions.
This makes sure the callee does not cause other issues, e.g. type mismatch, const-ness issues, etc. If the
selected function is a host device function, then it is a viable callee. If the selected function is a host
function, then the caller is not a valid host device function, and it results in a diagnostic but it can be deferred.

The problem is that we have to give host device candidates equal preference with wrong-sided candidates. If
the users really intend to favor host device candidate against wrong-sided candidate, they cannot get the
expected selection.

Ideally we should be able to defer all diagnostics for functions not sure to be emitted. In that case we can
have correct preference. If diagnostics occur due to overloading resolution change, as long as the function
is not emitted, it is fine.

Unfortunately it is not a trivial work to defer all diagnostics. Even deferring only overloading resolution related
diagnostics is not a simple work.

For now, it seems the most feasible workaround is to treat implicit host device function and explicit host
device function differently. Basically in device compilation for implicit host device functions, keep the
old behavior, i.e. give host device candidates and wrong-sided candidates equal preference. For explicit
host device functions, favor host device candidates against wrong-sided candidates.

The rationale is that explicit host device functions are blessed by the user to be valid host device functions,
that is, they should not cause diagnostics in both host and device compilation. If diagnostics occur, user is
able to fix them. However, there is no guarantee that implicit host device function can be compiled in
device compilation, therefore we need to preserve its overloading resolution in device compilation.

Diff Detail

Event Timeline

yaxunl created this revision.May 6 2020, 4:03 PM
tra added a comment.May 7 2020, 12:12 PM

I've tested the patch on our sources and it still breaks tensorflow compilation, though in a different way:

In file included from third_party/tensorflow/core/kernels/slice_op_gpu.cu.cc:22:
In file included from ./third_party/tensorflow/core/framework/register_types.h:20:
In file included from ./third_party/tensorflow/core/framework/numeric_types.h:28:
In file included from ./third_party/tensorflow/core/platform/types.h:22:
In file included from ./third_party/tensorflow/core/platform/tstring.h:24:
In file included from ./third_party/tensorflow/core/platform/cord.h:23:
In file included from ./third_party/tensorflow/core/platform/google/cord.h:19:
In file included from ./third_party/absl/strings/cord.h:89:
./third_party/absl/strings/internal/cord_internal.h:34:16: error: no matching constructor for initialization of 'std::atomic<int32_t>' (aka 'atomic<int>')
  Refcount() : count_{1} {}
               ^     ~~~
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1778:8: note: candidate constructor (the implicit copy constructor) not viable: no known conversion from 'int' to 'const std::__u::atomic<int>' for 1st argument
struct atomic
       ^
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1784:5: note: candidate constructor not viable: requires 0 arguments, but 1 was provided
    atomic() _NOEXCEPT _LIBCPP_DEFAULT
    ^
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1807:52: error: call to deleted constructor of '__atomic_base<base::scheduling::Schedulable *>'
    _LIBCPP_CONSTEXPR atomic(_Tp* __d) _NOEXCEPT : __base(__d) {}
                                                   ^      ~~~
./third_party/absl/base/internal/thread_identity.h:162:66: note: in instantiation of member function 'std::__u::atomic<base::scheduling::Schedulable *>::atomic' requested here
    std::atomic<base::scheduling::Schedulable*> bound_schedulable{nullptr};
                                                                 ^
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1675:5: note: '__atomic_base' has been explicitly marked deleted here
    __atomic_base(const __atomic_base&) = delete;
    ^
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1786:51: error: call to implicitly-deleted copy constructor of '__atomic_base<long>'
    _LIBCPP_CONSTEXPR atomic(_Tp __d) _NOEXCEPT : __base(__d) {}
                                                  ^      ~~~
./third_party/absl/synchronization/mutex.h:927:25: note: in instantiation of member function 'std::__u::atomic<long>::atomic' requested here
inline Mutex::Mutex() : mu_(0) {
                        ^
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1698:7: note: copy constructor of '__atomic_base<long, true>' is implicitly deleted because base class '__atomic_base<long, false>' has a deleted copy constructor
    : public __atomic_base<_Tp, false>
      ^
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1675:5: note: '__atomic_base' has been explicitly marked deleted here
    __atomic_base(const __atomic_base&) = delete;
    ^
yaxunl updated this revision to Diff 262858.May 8 2020, 5:15 AM
yaxunl edited the summary of this revision. (Show Details)

fix regression. only treat implicit host device candidate inferior in device compilation.

yaxunl added a comment.May 8 2020, 5:19 AM
In D79526#2025761, @tra wrote:

I've tested the patch on our sources and it still breaks tensorflow compilation, though in a different way:

In file included from third_party/tensorflow/core/kernels/slice_op_gpu.cu.cc:22:
In file included from ./third_party/tensorflow/core/framework/register_types.h:20:
In file included from ./third_party/tensorflow/core/framework/numeric_types.h:28:
In file included from ./third_party/tensorflow/core/platform/types.h:22:
In file included from ./third_party/tensorflow/core/platform/tstring.h:24:
In file included from ./third_party/tensorflow/core/platform/cord.h:23:
In file included from ./third_party/tensorflow/core/platform/google/cord.h:19:
In file included from ./third_party/absl/strings/cord.h:89:
./third_party/absl/strings/internal/cord_internal.h:34:16: error: no matching constructor for initialization of 'std::atomic<int32_t>' (aka 'atomic<int>')
  Refcount() : count_{1} {}
               ^     ~~~
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1778:8: note: candidate constructor (the implicit copy constructor) not viable: no known conversion from 'int' to 'const std::__u::atomic<int>' for 1st argument
struct atomic
       ^
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1784:5: note: candidate constructor not viable: requires 0 arguments, but 1 was provided
    atomic() _NOEXCEPT _LIBCPP_DEFAULT
    ^
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1807:52: error: call to deleted constructor of '__atomic_base<base::scheduling::Schedulable *>'
    _LIBCPP_CONSTEXPR atomic(_Tp* __d) _NOEXCEPT : __base(__d) {}
                                                   ^      ~~~
./third_party/absl/base/internal/thread_identity.h:162:66: note: in instantiation of member function 'std::__u::atomic<base::scheduling::Schedulable *>::atomic' requested here
    std::atomic<base::scheduling::Schedulable*> bound_schedulable{nullptr};
                                                                 ^
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1675:5: note: '__atomic_base' has been explicitly marked deleted here
    __atomic_base(const __atomic_base&) = delete;
    ^
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1786:51: error: call to implicitly-deleted copy constructor of '__atomic_base<long>'
    _LIBCPP_CONSTEXPR atomic(_Tp __d) _NOEXCEPT : __base(__d) {}
                                                  ^      ~~~
./third_party/absl/synchronization/mutex.h:927:25: note: in instantiation of member function 'std::__u::atomic<long>::atomic' requested here
inline Mutex::Mutex() : mu_(0) {
                        ^
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1698:7: note: copy constructor of '__atomic_base<long, true>' is implicitly deleted because base class '__atomic_base<long, false>' has a deleted copy constructor
    : public __atomic_base<_Tp, false>
      ^
third_party/crosstool/v18/llvm_unstable/toolchain/bin/../include/c++/v1/atomic:1675:5: note: '__atomic_base' has been explicitly marked deleted here
    __atomic_base(const __atomic_base&) = delete;
    ^

Looks like we went overboard to treat implicit host device candidate as inferior. They should be treated
as inferior in device compilation, not in host compilation. Here because they are treated as inferior
to same-sided candidate in host compilation, they changed overload resolution in host compilation
therefore caused the failure in host compilation.

I have updated the patch to treat implicit host device candidate as inferior in device compilation.

tra added a comment.May 8 2020, 10:45 AM

The latest version of the patch works well enough to compile tensorflow. That's the good news.

Looks like we went overboard to treat implicit host device candidate as inferior. They should be treated
as inferior in device compilation, not in host compilation. Here because they are treated as inferior
to same-sided candidate in host compilation, they changed overload resolution in host compilation
therefore caused the failure in host compilation.

I have updated the patch to treat implicit host device candidate as inferior in device compilation.

I'm concerned that this creates inconsistency in how overload resolution works during host and device compilation.
In general they should behave the same. I.e. a test where this change is needed during device-side compilation will require the same change on the host side, if you swap H and D attributes on the functions in the test.

Speaking of tests, it would be great to add a test illustrating this scenario.

In D79526#2027242, @tra wrote:

The latest version of the patch works well enough to compile tensorflow. That's the good news.

Looks like we went overboard to treat implicit host device candidate as inferior. They should be treated
as inferior in device compilation, not in host compilation. Here because they are treated as inferior
to same-sided candidate in host compilation, they changed overload resolution in host compilation
therefore caused the failure in host compilation.

I have updated the patch to treat implicit host device candidate as inferior in device compilation.

I'm concerned that this creates inconsistency in how overload resolution works during host and device compilation.
In general they should behave the same. I.e. a test where this change is needed during device-side compilation will require the same change on the host side, if you swap H and D attributes on the functions in the test.

Speaking of tests, it would be great to add a test illustrating this scenario.

I added a test at line 483 for the situation.

For implicit host device functions, since they are not guaranteed to work in device compilation, we can only resolve them as if they are host functions. This causes asymmetry but implicit host device functions are originally host functions so it is biased toward host compilation in the beginning. Only the original resolution guarantees no other issues. For example, in the failed compilation in TF, some ctor of std::atomic becomes implicit host device function because it is constexpr. We should treated as wrong-sided in device compilation, but we should treated as same-sided in host compilation, otherwise it changes the resolution in host compilation and causes other issues.

tra added a subscriber: wash.May 8 2020, 1:16 PM

For implicit host device functions, since they are not guaranteed to work in device compilation, we can only resolve them as if they are host functions. This causes asymmetry but implicit host device functions are originally host functions so it is biased toward host compilation in the beginning.

I don't think that the assertion that implicit host device functions are originally host functions is always true. While in practice most such functions may indeed come from the existing host code (e.g. the standard library), I don't see any inherent reason why they can't come from the code written for GPU. E.g. thrust is likely to have some implicitly HD functions in the code that was not intended for CPUs and your assumption will be wrong. Even if such case may not exist now, it would not be unreasonable for users to have such code on device.
This overload resolution difference is observable and it will likely create new corner cases in convoluted enough C++ code.

I think we need something more principled than "happens to work for existing code".

Only the original resolution guarantees no other issues. For example, in the failed compilation in TF, some ctor of std::atomic becomes implicit host device function because it is constexpr. We should treated as wrong-sided in device compilation, but we should treated as same-sided in host compilation, otherwise it changes the resolution in host compilation and causes other issues.

It may be true for atomic, where we do need to have GPU-specific implementation. However, I can also see classes with constexpr constructors that are prefectly usable on both sides and do not have to be treated as the wrong-side.

TBH, I do not see any reasonable way to deal with this with the current implementation of how HD functions are treated. This patch and its base do improve things somewhat, but it all comes at the cost of further complexity and potentially paints us even deeper into a corner. Current behavior is already rather hard to explain.

Some time back @wash from NVIDIA was asking about improving HD function handling. Maybe it's time for all interested parties to figure out whether it's time to come up with a better solution. Not in this patch, obviously.

clang/test/SemaCUDA/function-overload.cu
464–470

These tests only veryfy that the code compiled, but it does not guarantee that we've picked the correct overload.
You should give callees different return types and assign the result to a variable of intended type. See test_host_device_calls_hd_template() on line 341 for an example.

tra added a comment.May 8 2020, 2:23 PM

This one is just a FYI. I've managed to reduce the failure in the first version of this patch and it looks rather odd because the reduced test case has nothing to do with CUDA. Instead it appears to introduce a difference in compilation of regular host-only C++ code with -x cuda vs -x c++. I'm not sure how/why first version caused this and why the latest one fixes it. It may be worth double checking that we're not missing something here.

template <class a> a b;
auto c(...);
template <class d> constexpr auto c(d) -> decltype(0);
struct e {
  template <class ad, class... f> static auto g(ad, f...) {
    h<e, decltype(b<f>)...>;
  }
  struct i {
    template <class, class... f> static constexpr auto j(f... k) { c(k...); }
  };
  template <class, class... f> static auto h() { i::j<int, f...>; }
};
class l {
  l() {
    e::g([] {}, this);
  }
};

The latest version of this patch works, but previous one failed with an error, when the example was compiled as CUDA, but not, when it was compiled as C++:

$ bin/clang++ -x cuda argmax.cc -ferror-limit=1 -fsyntax-only --cuda-host-only -nocudalib -nocudainc -fsized-deallocation -std=c++17

argmax.cc:9:68: error: function 'c' with deduced return type cannot be used before it is defined
    template <class, class... f> static constexpr auto j(f... k) { c(k...); }
                                                                   ^
argmax.cc:11:53: note: in instantiation of function template specialization 'e::i::j<int, l *>' requested here
  template <class, class... f> static auto h() { i::j<int, f...>; }
                                                    ^
argmax.cc:6:5: note: in instantiation of function template specialization 'e::h<e, l *>' requested here
    h<e, decltype(b<f>)...>;
    ^
argmax.cc:15:8: note: in instantiation of function template specialization 'e::g<(lambda at argmax.cc:15:10), l *>' requested here
    e::g([] {}, this);
       ^
argmax.cc:2:6: note: 'c' declared here
auto c(...);
     ^
fatal error: too many errors emitted, stopping now [-ferror-limit=]
2 errors generated when compiling for host.
$ bin/clang++ -x c++ argmax.cc -ferror-limit=1 -fsyntax-only --cuda-host-only -nocudalib -nocudainc -fsized-deallocation -std=c++17

clang-11: warning: argument unused during compilation: '-nocudainc' [-Wunused-command-line-argument]
argmax.cc:11:50: warning: expression result unused [-Wunused-value]
  template <class, class... f> static auto h() { i::j<int, f...>; }
                                                 ^~~~~~~~~~~~~~~
argmax.cc:6:5: note: in instantiation of function template specialization 'e::h<e, l *>' requested here
    h<e, decltype(b<f>)...>;
    ^
argmax.cc:15:8: note: in instantiation of function template specialization 'e::g<(lambda at argmax.cc:15:10), l *>' requested here
    e::g([] {}, this);
       ^
argmax.cc:6:5: warning: expression result unused [-Wunused-value]
    h<e, decltype(b<f>)...>;
    ^~~~~~~~~~~~~~~~~~~~~~~
argmax.cc:15:8: note: in instantiation of function template specialization 'e::g<(lambda at argmax.cc:15:10), l *>' requested here
    e::g([] {}, this);
       ^
argmax.cc:3:35: warning: inline function 'c<l *>' is not defined [-Wundefined-inline]
template <class d> constexpr auto c(d) -> decltype(0);
                                  ^
argmax.cc:9:68: note: used here
    template <class, class... f> static constexpr auto j(f... k) { c(k...); }
                                                                   ^
3 warnings generated.
clang/include/clang/Sema/Sema.h
11663

Plumbing an optional output argument it through multiple levels of callers as an output argument is rather hard to follow, especially considering that it's not set in all code paths. Perhaps we can turn IsImplicitHDAttr into a separate function and call it from isBetterOverloadCandidate().

yaxunl marked 4 inline comments as done.May 10 2020, 6:35 AM
In D79526#2027695, @tra wrote:

This one is just a FYI. I've managed to reduce the failure in the first version of this patch and it looks rather odd because the reduced test case has nothing to do with CUDA. Instead it appears to introduce a difference in compilation of regular host-only C++ code with -x cuda vs -x c++. I'm not sure how/why first version caused this and why the latest one fixes it. It may be worth double checking that we're not missing something here.

template <class a> a b;
auto c(...);
template <class d> constexpr auto c(d) -> decltype(0);
struct e {
  template <class ad, class... f> static auto g(ad, f...) {
    h<e, decltype(b<f>)...>;
  }
  struct i {
    template <class, class... f> static constexpr auto j(f... k) { c(k...); }
  };
  template <class, class... f> static auto h() { i::j<int, f...>; }
};
class l {
  l() {
    e::g([] {}, this);
  }
};

function j is an implicit host device function, it calls function c. There are two candidates: the first one is a host function, the second one is an implicit host device function.

Assuming this code is originally C++ code, the author intends the second to be chosen since it is a better match. The code will fail to compile if the first one is chosen since its return type cannot be deduced.

Now we compile it as CUDA code and constexpr functions automatically become implicit host device function. In host compilation we do not need special handling since host device candidates and same-sided candidates are both viable. There was a bug which used special handling of implicit host device function in host compilation, which was fixed by my last update.

Basically we only need special handling for implicit host device function in device compilation. In host compilation we always use the normal overloading resolution. For explicit host device functions we always use the normal overloading resolution.

clang/include/clang/Sema/Sema.h
11663

will do

clang/test/SemaCUDA/function-overload.cu
464–470

they have different return types. The right one returns double and the wrong one returns void. If the wrong one is chosen, there is syntax error since the caller returns double.

yaxunl updated this revision to Diff 263041.May 10 2020, 6:39 AM
yaxunl marked 2 inline comments as done.

introduce Sema::IsCUDAImplicitHostDeviceFunction() and remove changes to IdentifyCUDATarget and IdentifyCUDAPreference. Added one more test.

In D79526#2027552, @tra wrote:

For implicit host device functions, since they are not guaranteed to work in device compilation, we can only resolve them as if they are host functions. This causes asymmetry but implicit host device functions are originally host functions so it is biased toward host compilation in the beginning.

I don't think that the assertion that implicit host device functions are originally host functions is always true. While in practice most such functions may indeed come from the existing host code (e.g. the standard library), I don't see any inherent reason why they can't come from the code written for GPU. E.g. thrust is likely to have some implicitly HD functions in the code that was not intended for CPUs and your assumption will be wrong. Even if such case may not exist now, it would not be unreasonable for users to have such code on device.
This overload resolution difference is observable and it will likely create new corner cases in convoluted enough C++ code.

I agree currently it is possible to force a device function to be implicitly host device by pragma. However it is arguable whether we should have special handling of overload resolution in this case. We do special handling of overload resolution because we can not modify some system headers which are intended for host originally. If a function was originally device function, it is CUDA/HIP code and it should follow normal overloading resolution rule and should be fixed if issues occur when it is marked as a host device function.

I think we need something more principled than "happens to work for existing code".

Only the original resolution guarantees no other issues. For example, in the failed compilation in TF, some ctor of std::atomic becomes implicit host device function because it is constexpr. We should treated as wrong-sided in device compilation, but we should treated as same-sided in host compilation, otherwise it changes the resolution in host compilation and causes other issues.

It may be true for atomic, where we do need to have GPU-specific implementation. However, I can also see classes with constexpr constructors that are prefectly usable on both sides and do not have to be treated as the wrong-side.

Before this patch (together with the reverted commit), the device host candidates are always treated with the same preference as wrong-sided candidates in device compilation, so a wrong-sided candidate may hide a viable host device candidate. This patch fixes that for most cases, including: 1. host compilation 2. explicit host device caller 3. explicit host device callee. Only in device compilation when an implicit host device caller calls an implicit host device callee we apply the special 'incorrect' overloading resolution rule. If the special handling causes undesirable effect on users code, users can either mark the caller or callee to be explicit host device to bypass the special handling.

TBH, I do not see any reasonable way to deal with this with the current implementation of how HD functions are treated. This patch and its base do improve things somewhat, but it all comes at the cost of further complexity and potentially paints us even deeper into a corner. Current behavior is already rather hard to explain.

Some time back @wash from NVIDIA was asking about improving HD function handling. Maybe it's time for all interested parties to figure out whether it's time to come up with a better solution. Not in this patch, obviously.

This patch is trying to fix the incorrect overloading resolution rule about host device callee in host device caller. It should be favored over wrong-sided callee but currently it is not.

If we reject this patch, we have to bear with the incorrect overloading rule until a better fix is implemented.

The complexity introduced by this patch is that it needs to have special rule for implicit host device caller and implicit host device callee in device compilation, where implicit host device callee is not favored over wrong-sided callee to preserve the overloading resolution result as if they are both host callees. This is to allow some functions in system headers becoming implicitly host device functions without causing undeferrable diagnostics.

The complexity introduced in the compiler code is not significant: a new function Sema::IsCUDAImplicitHostDeviceFunction is introduced and used in isBetterOverloadCandidate to detect the special situation that needs special handling. The code for special handling is trivial.

The complexity introduced in the overloading resolution rule is somehow concerning.

Before this patch, the rule is: same sided candidates are favored over wrong sided candidates, but host device candidates have same preference as wrong sided candidates.

After this patch, the rule is: same sided candidates and host device candidates have the same preference over wrong-sided candidates, except implicit host device function in device compilation, which preserves original resolution.

The reason to have the exception is that the implicit host device caller may be in system headers which users cannot modify. In device compilation, favoring implicit host device candidates over host candidates may change the resolution results, which incurs diagnostics.

Alternative solution I can think of:

  1. defer all diagnostics possibly incurred due to overloading resolution change. Since the host device function is invalid, it cannot really be used by device code. As long as it is not really emitted, it should be OK. However, this requires all the possible diagnostics incurred due to overloading resolution change to be deferred. This requires some significant changes since 1) the PartialDiagnosticBuilder currently has limited input types than the DiagnosticBuilder; 2) the diagnostic to be deferred may have accompanying notes which need to be deferred in coordination; 3) If there are control flow changes depending on whether diagnostics happen, they need to be modified so that compilation will continue.
  1. change precedence of host-ness: If selection of a candidate will incur error, then it is not favored over host-ness, i.e. we would rather choose a wrong-sided candidate that does not cause other error, than choosing a implicit host device candidate that causes other error.
tra added inline comments.May 11 2020, 12:23 PM
clang/include/clang/Sema/Sema.h
11670

I think this can be static as it does not need Sema's state.

clang/lib/Sema/SemaCUDA.cpp
217–220

Is it possible for us to ever end up here with an explicitly set attribute but with an implicit function? If that were to happen, we'd return true and that would be incorrect.
Perhaps add an assert to make sure it does not happen or always return A->isImplicit() if an attribute is already set.

clang/test/SemaCUDA/function-overload.cu
464–470

Ah. I've missed it. Could you change the types to struct CorrectOverloadRetTy/struct IncorrectOverloadRetTy to make it more obvious?

yaxunl marked 6 inline comments as done.May 11 2020, 1:21 PM
yaxunl added inline comments.
clang/include/clang/Sema/Sema.h
11670

will do

clang/lib/Sema/SemaCUDA.cpp
217–220

will return A->isImplicit()

clang/test/SemaCUDA/function-overload.cu
464–470

will do

yaxunl updated this revision to Diff 263268.May 11 2020, 1:55 PM
yaxunl marked 3 inline comments as done.

revised by Artem's comments.

tra accepted this revision.May 11 2020, 3:23 PM

LGTM, modulo cosmetic test changes mentioned below.

clang/test/SemaCUDA/function-overload.cu
465

Is inline necessary in these new tests? Please remove it where it's not needed.

479

Nit: Incorrect should not have C capitalized as it's one word.

488–515

Please move this test below the other two as keeping them together is useful to illustrate the differences in behavior of overloading in explicit HD vs implicit HD functions.

This revision is now accepted and ready to land.May 11 2020, 3:23 PM
yaxunl marked 6 inline comments as done.May 11 2020, 9:53 PM
yaxunl added inline comments.
clang/test/SemaCUDA/function-overload.cu
465

It is not needed by callee but needed by caller to make sure it causes deferred diagnostics. Will remove it from callees.

479

will fix.

488–515

will do

This revision was automatically updated to reflect the committed changes.
yaxunl marked 3 inline comments as done.
Herald added a project: Restricted Project. · View Herald TranscriptMay 12 2020, 5:52 AM
tra added a comment.May 18 2020, 12:21 PM

e03394c6a6ff5832aa43259d4b8345f40ca6a22c Still breaks some of the existing CUDA code (got failures in pytorch and Eigen). I'll revert the patch and will send you a reduced reproducer.

tra added a comment.May 18 2020, 2:51 PM

Reduced test case:

struct a {
  __attribute__((device)) a(short);
  __attribute__((device)) operator unsigned() const;
  __attribute__((device)) operator int() const;
};
struct b {
  a d;
};
void f(b g) { b e = g; }

Failure:

$ bin/clang++ -x cuda aten.cc -fsyntax-only  --cuda-path=$HOME/local/cuda-10.1 --cuda-device-only --cuda-gpu-arch=sm_60 -stdlib=libc++ -std=c++17 -ferror-limit=1

aten.cc:6:8: error: conversion from 'const a' to 'short' is ambiguous
struct b {
       ^
aten.cc:9:21: note: in implicit copy constructor for 'b' first required here
void f(b g) { b e = g; }
                    ^
aten.cc:3:27: note: candidate function
  __attribute__((device)) operator unsigned() const;
                          ^
aten.cc:4:27: note: candidate function
  __attribute__((device)) operator int() const;
                          ^
aten.cc:2:34: note: passing argument to parameter here
  __attribute__((device)) a(short);
                                 ^
1 error generated when compiling for sm_60.

The same code compiles fine in C++ and I would expect it to work on device side the same way.

In D79526#2042680, @tra wrote:

Reduced test case:

struct a {
  __attribute__((device)) a(short);
  __attribute__((device)) operator unsigned() const;
  __attribute__((device)) operator int() const;
};
struct b {
  a d;
};
void f(b g) { b e = g; }

Failure:

$ bin/clang++ -x cuda aten.cc -fsyntax-only  --cuda-path=$HOME/local/cuda-10.1 --cuda-device-only --cuda-gpu-arch=sm_60 -stdlib=libc++ -std=c++17 -ferror-limit=1

aten.cc:6:8: error: conversion from 'const a' to 'short' is ambiguous
struct b {
       ^
aten.cc:9:21: note: in implicit copy constructor for 'b' first required here
void f(b g) { b e = g; }
                    ^
aten.cc:3:27: note: candidate function
  __attribute__((device)) operator unsigned() const;
                          ^
aten.cc:4:27: note: candidate function
  __attribute__((device)) operator int() const;
                          ^
aten.cc:2:34: note: passing argument to parameter here
  __attribute__((device)) a(short);
                                 ^
1 error generated when compiling for sm_60.

The same code compiles fine in C++ and I would expect it to work on device side the same way.

a and b both have an implicit HD copy ctor. In device compilation, copy ctor of b is calling copy ctor of a. There are two candidates: implicit HD copy ctor of a, and device ctor a(short).

In my previous fix, I made H and implicit HD candidate equal, however I forgot about the relation between D candidate and HD candidate. I incorrectly made D favored over HD and H. This caused inferior device candidate a(short) chosen over copy ctor of a.

I have a fix for this https://reviews.llvm.org/D80450