Page MenuHomePhabricator

[CUDA][HIP] Fix HD function resolution
ClosedPublic

Authored by yaxunl on May 22 2020, 10:33 AM.

Details

Summary

Add option -ffix-overload-resolution, which is off by default.

When -ffix-overload-resolution is off, keep the original behavior.
Otherwise enable the correct hostness based overloading resolution.

Diff Detail

Event Timeline

yaxunl created this revision.May 22 2020, 10:33 AM
tra added a comment.May 26 2020, 11:27 AM

Is this patch supposed to be used with D79526 or instead of it?

clang/test/SemaCUDA/function-overload.cu
463

__device__, etc. are defined by the included "Inputs/cuda.h" and can be used here to make it more readable.

yaxunl updated this revision to Diff 266366.May 26 2020, 4:08 PM

Fix test.

tra added a comment.Jun 3 2020, 9:39 AM
In D80450#2055463, @tra wrote:

Is this patch supposed to be used with D79526 or instead of it?

^^^ I don't think this has been answered. I would like to test this change before it lands.

yaxunl added a comment.Jun 3 2020, 9:55 AM
In D80450#2071696, @tra wrote:
In D80450#2055463, @tra wrote:

Is this patch supposed to be used with D79526 or instead of it?

^^^ I don't think this has been answered. I would like to test this change before it lands.

sorry I missed that. Yes this patch is used on top of D79526.

tra accepted this revision.Jun 3 2020, 12:45 PM

LGTM. Combined with D79526 it appears to work for tensorflow build.

This revision is now accepted and ready to land.Jun 3 2020, 12:45 PM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptJun 4 2020, 2:24 PM
tra added a comment.Jun 11 2020, 10:01 AM

Reproducer for the regression. https://gist.github.com/Artem-B/183e9cfc28c6b04c1c862c853b5d9575
It's not particularly small, but that's as far as I could get it reduced.

With the patch, an attempt to instantiate ag on line 36 (in the reproducer sources I linked to above) results in ambiguity between two templates on lines 33 and 24 that are in different namespaces.
Previously it picked the template on line 28.

tra added a comment.Jun 11 2020, 11:36 AM
In D80450#2087938, @tra wrote:

Reproducer for the regression. https://gist.github.com/Artem-B/183e9cfc28c6b04c1c862c853b5d9575
It's not particularly small, but that's as far as I could get it reduced.

With the patch, an attempt to instantiate ag on line 36 (in the reproducer sources I linked to above) results in ambiguity between two templates on lines 33 and 24 that are in different namespaces.
Previously it picked the template on line 28.

Managed to simplify the reproducer down to this which now reports that a host candidate has been ignored. This may explain why we ended up with the ambiguity when other overloads were present.

template <typename> struct a {};
namespace b {
struct c : a<int> {};
template <typename d> void ag(d);
} // namespace b
template <typename ae>
__attribute__((host)) __attribute__((device)) int ag(a<ae>) {
  ae e;
  ag(e);
}
void f() { ag<b::c>; }
In D80450#2088129, @tra wrote:
In D80450#2087938, @tra wrote:

Reproducer for the regression. https://gist.github.com/Artem-B/183e9cfc28c6b04c1c862c853b5d9575
It's not particularly small, but that's as far as I could get it reduced.

With the patch, an attempt to instantiate ag on line 36 (in the reproducer sources I linked to above) results in ambiguity between two templates on lines 33 and 24 that are in different namespaces.
Previously it picked the template on line 28.

Managed to simplify the reproducer down to this which now reports that a host candidate has been ignored. This may explain why we ended up with the ambiguity when other overloads were present.

template <typename> struct a {};
namespace b {
struct c : a<int> {};
template <typename d> void ag(d);
} // namespace b
template <typename ae>
__attribute__((host)) __attribute__((device)) int ag(a<ae>) {
  ae e;
  ag(e);
}
void f() { ag<b::c>; }

The error only happens in device compilation.

For the call ag(e). There are two candidates:

  1. ag in namespace b. The function arguments can match. However it is a host function, therefore is a wrong-sided candidate and not viable.
  1. ag in default name space. It is a host device function. However the function arguments requires a<ae>, therefore cannot match.

Before my patch, wrong-sided candidate is allowed. clang resolves to candidate 1 and this results in a diagnostic about host function referenced in device host function, which can be deferred. Since f() is not emitted on device side, the deferred diags is not emitted.

After my patch, wrong-sided candidate is not allowed. clang resolves to candidate 2, which results in a diagnostic that no matching function, which is not a deferred diagnostics by default and is emitted even if f() is not emitted on device side.

In a sense, the diagnostic is correct, since ag(a<ae>) cannot be emitted on device side. This can be fixed by either make ag(a<ae>) a host function or make ag(d) a host device function.

In the original testcase (https://gist.github.com/Artem-B/183e9cfc28c6b04c1c862c853b5d9575)

Before my change, call at line 36 resolves to wrong-sided candidate at line 29 since that is the best match for argument types. This results in a deferred diag which allows device compilation to pass.

After my change, call at line 36 resolves to two host device candidates. This results in diagnostics about ambiguity which is not deferred by default. Therefore the compilation fails.

Basically it all boils down to the issue that overloading resolution diagnostics are not deferred by default.

I think first of all we need to exclude wrong-sided candidates as this patch does, otherwise we cannot have correct hostness based overloading resolution and fix bugs like https://bugs.llvm.org/show_bug.cgi?id=46922 .

However by doing this we changes the existing overloading resolution incur some overloading resolution diags. Unless we defer these diags, we may break some existing CUDA/HIP code.

Fortunately https://reviews.llvm.org/D84364 is already landed, which allows deferring overloading resolution diags under option -fgpu-defer-diags.

I think a conservative solution is that we keep the old overloading resolution behavior by default (i.e. do not exclude wrong-sided candidates), whereas enable the correct hostness based overloading resolution when -fgpu-defer-diags is on. If developers would like correct hostness based overloading resolution, they can use -fgpu-defer-diags. Then as -fgpu-defer-diags become stable, we turn it on by default.

yaxunl reopened this revision.Nov 26 2020, 8:43 AM

reopen for fixing the regression

This revision is now accepted and ready to land.Nov 26 2020, 8:43 AM
yaxunl updated this revision to Diff 308032.Nov 27 2020, 6:10 AM
yaxunl retitled this revision from [CUDA][HIP] Fix implicit HD function resolution to [CUDA][HIP] Fix HD function resolution.
yaxunl edited the summary of this revision. (Show Details)

If -fgpu-defer-diags is off, keep original behavior.

tra added a comment.Nov 30 2020, 12:01 PM

For the call ag(e). There are two candidates:

  1. ag in namespace b. The function arguments can match. However it is a host function, therefore is a wrong-sided candidate and not viable.
  1. ag in default name space. It is a host device function. However the function arguments requires a<ae>, therefore cannot match.

Before my patch, wrong-sided candidate is allowed. clang resolves to candidate 1 and this results in a diagnostic about host function referenced in device host function, which can be deferred. Since f() is not emitted on device side, the deferred diags is not emitted.

This used to be a fairly common pattern in existing CUDA code. A lot of templated code had __host__ __device__ slapped on it because NVCC had no target overloading and it's hard to control who/where/how will instantiate particular template. Some of those templates could only be instantiated on one side of the compilation. Clang's only choice was to allow the wrong-side candidates and/or defer the diagnostics. I vaguely recall that it was one of the trickier bits of the overload resolution rules to handle.

After my patch, wrong-sided candidate is not allowed. clang resolves to candidate 2, which results in a diagnostic that no matching function, which is not a deferred diagnostics by default and is emitted even if f() is not emitted on device side.
...
Basically it all boils down to the issue that overloading resolution diagnostics are not deferred by default.

Looks that way.

I think a conservative solution is that we keep the old overloading resolution behavior by default (i.e. do not exclude wrong-sided candidates), whereas enable the correct hostness based overloading resolution when -fgpu-defer-diags is on. If developers would like correct hostness based overloading resolution, they can use -fgpu-defer-diags. Then as -fgpu-defer-diags become stable, we turn it on by default.

SGTM. I'll check how the patch fares on our CUDA code.

clang/test/SemaCUDA/function-overload.cu
614

competes->compete

tra added inline comments.Nov 30 2020, 12:05 PM
clang/test/SemaCUDA/function-overload.cu
616

One thing that bothers me about this comment is that -fgpu-defer-diag apparently changes the result of the overload resolution, not just deferring diags.

617–618

It would be great to have a test where those diagnostics *do* fire.

In D80450#2423706, @tra wrote:

SGTM. I'll check how the patch fares on our CUDA code.

Please hold on. I just found a regression due to old behavior not fully recovered in certain case. I will update the patch for fixing the regression.

yaxunl marked an inline comment as done.Nov 30 2020, 12:35 PM
yaxunl added inline comments.
clang/test/SemaCUDA/function-overload.cu
616

without -fgpu-defer-diag we have to keep the old incorrect overloading resolution since otherwise it breaks existing code.

We can only have correct overloading resolution with -fgpu-defer-diag on.

If we want to have correct overloading resolution, not depending on whether -fgpu-defer-diag is on or off, we have to turn on -fgpu-defer-diag by default. In this case no existing code will be broken.

tra added inline comments.Nov 30 2020, 1:04 PM
clang/test/SemaCUDA/function-overload.cu
616

We can only have correct overloading resolution with -fgpu-defer-diag on.

-fgpu-defer-diags is a prerequisite for fixing overload resolution. I'm fine with that.
Making it serve the double duty of affecting the overload resolution is what I was pointing at.

We should have a knob fix-overload-resolution which would then turn -fgpu-defer-diag on, not the other way around.

yaxunl marked 2 inline comments as done.Nov 30 2020, 1:17 PM
yaxunl added inline comments.
clang/test/SemaCUDA/function-overload.cu
616

That makes sense. Will add -ffix-overload-resolution.

yaxunl updated this revision to Diff 308515.Nov 30 2020, 7:11 PM
yaxunl edited the summary of this revision. (Show Details)

Add -ffix-overload-resolution and fix a regression.

tra accepted this revision.Dec 1 2020, 1:02 PM

LGTM.

I'd suggest adding more details on the background of this change to the commit log (point to the comment in the isBetterOverloadCandidate ?) and outline the intention to enable the new way to do overloading after some soak time.

Also, naming. -ffix-overload-resolution is rather non-specific. I didn't mean to use it literally. The problem is that I can't think of a good descriptive name for what we do here. -fgpu-fix-wrong-side-overloads ? Something else?

clang/lib/Sema/SemaOverload.cpp
9621

The comment uses device/host for both function attributes and when it refers to the compilation phase. It would help to make it more readable if function attributes would be distinct from compilation phase. E.g. by using __host__ __device__ or HD.

yaxunl marked 2 inline comments as done.Dec 2 2020, 8:27 AM
In D80450#2426507, @tra wrote:

LGTM.

I'd suggest adding more details on the background of this change to the commit log (point to the comment in the isBetterOverloadCandidate ?) and outline the intention to enable the new way to do overloading after some soak time.

Will do.

Also, naming. -ffix-overload-resolution is rather non-specific. I didn't mean to use it literally. The problem is that I can't think of a good descriptive name for what we do here. -fgpu-fix-wrong-side-overloads ? Something else?

How about -fgpu-exclude-wrong-side-overloads? Since what this patch does is always excluding wrong side overloads whereas previously only excluding wrong side overloads if there are same side overloads.

clang/lib/Sema/SemaOverload.cpp
9621

will use H/D/HD for function attribute when committing.

tra added a comment.Dec 2 2020, 10:12 AM

Also, naming. -ffix-overload-resolution is rather non-specific. I didn't mean to use it literally. The problem is that I can't think of a good descriptive name for what we do here. -fgpu-fix-wrong-side-overloads ? Something else?

How about -fgpu-exclude-wrong-side-overloads? Since what this patch does is always excluding wrong side overloads whereas previously only excluding wrong side overloads if there are same side overloads.

SGTM. Maybe, also make it hidden. I don't think it's useful for the end users.

This revision was automatically updated to reflect the committed changes.
yaxunl marked an inline comment as done.