This is an archive of the discontinued LLVM Phabricator instance.

[CUDA][HIP] Let lambda be host device by default
ClosedPublic

Authored by yaxunl on Apr 22 2020, 10:52 AM.

Details

Summary

This patch let lambda be host device by default and adds diagnostics for capturing host variable by reference in device lambda.

Diff Detail

Event Timeline

yaxunl created this revision.Apr 22 2020, 10:52 AM
tra added a reviewer: rsmith.Apr 22 2020, 11:50 AM
tra added a subscriber: rsmith.

Summoning @rsmith as I'm sure that there are interesting corner cases in lambda handling that we didn't consider.

Making lambdas implicitly HD will make it easier to write the code which can't be instantiated on one side of the compilation. That's probably observable via SFINAE, but I can't tell whether that matters.
By default I'd rather err on handling lambdas the same way as we do regular user-authored functions.

clang/lib/Sema/SemaCUDA.cpp
759

What about __global__ lambdas? We probably don't want to add HD attributes on them here.

clang/test/CodeGenCUDA/lambda.cu
17–27

The test example may not be doing what it's seemingly supposed to be doing:
https://cppinsights.io/s/3a5c42ff

h() gets a temporary host-side object which keeps the reference to a and that reference will actually point to the host-side shadow of the actual device-side a. When you get to execute g it's this may not be very usable on device side and thus f.operator() will probably not work.

Alas, we currently have no diagnostics for that kind of error.

Change it to a non-capturing lambda, perhaps?

yaxunl marked 4 inline comments as done.Apr 22 2020, 7:09 PM
yaxunl added inline comments.
clang/lib/Sema/SemaCUDA.cpp
759

lambda is not allowed to be kernel. I will add a lit test for that.

clang/test/CodeGenCUDA/lambda.cu
17–27

It works.

We need to think about this in device compilation. In device compilation, global variable is a device variable, the lambda is a device host function, therefore the lambda is accessing the real a, not the shadow.

In the host compilation, the lambda is not really called, therefore it is not emitted.

I will update the lit test with these checks.

yaxunl updated this revision to Diff 259452.Apr 22 2020, 7:13 PM
yaxunl marked 2 inline comments as done.

Add a negative test for lambda kernel. Add more checks to codegen test.

yaxunl updated this revision to Diff 259453.Apr 22 2020, 7:16 PM

clean up test.

pfultz2 added inline comments.Apr 23 2020, 7:32 PM
clang/lib/Sema/SemaCUDA.cpp
760

Shouldn't we add these attributes if there are no host and device attributes? This seems like it will treat []() __device__ {} as host device.

yaxunl marked 2 inline comments as done.Apr 23 2020, 8:10 PM
yaxunl added inline comments.
clang/lib/Sema/SemaCUDA.cpp
760

There is check on line 716. We only reach here if there is no device and host attrs on the function.

Seems reasonable

clang/test/CodeGenCUDA/lambda.cu
30

Typo

yaxunl updated this revision to Diff 259957.Apr 24 2020, 12:20 PM
yaxunl marked 2 inline comments as done.

Fix typo

rjmccall accepted this revision.Apr 24 2020, 12:33 PM
This revision is now accepted and ready to land.Apr 24 2020, 12:33 PM
tra added inline comments.Apr 24 2020, 1:39 PM
clang/test/CodeGenCUDA/lambda.cu
17–27

Clang manages to see through to the initializer of a, but I'm not sure how much we can rely on this.
In general, f.operator() for a capturing lambda needs to access captured variables via this which points to a temporary objects created and passed to g by the host. You can see it if you capture a local variable: https://godbolt.org/z/99389o

Anyways, it's an issue orthogonal to this patch. My concern is that tests are often used as an example of things that are OK to do, and capturing lambdas are a pretty big foot-shooting gun when used with CUDA. It's very easy to do wrong thing without compiler complaining about them.
While accessing a does work, it appears to do so by accident, rather than by design.

I'm fairly confident that I can hide the initializer with sufficiently complicated code, force clang to access a via this and make everything fail at runtime. IMO, what we have here is a 'happens to work' situation. I do not want to call it 'guaranteed to work' without making sure that it always does.

In order to demonstrate that lambda is host/device, you do not need it to be a capturing lambda. You can make it call an overloaded function with host and device variants and verify that the lambda works on host and device sides.

yaxunl updated this revision to Diff 260094.Apr 25 2020, 7:10 AM

Added more tests

yaxunl marked 2 inline comments as done.Apr 25 2020, 7:25 AM
yaxunl added inline comments.
clang/test/CodeGenCUDA/lambda.cu
17–27

I added one more test, where a lambda function calls a template function which is overloaded with a host version and a device version. The lambda is called in both host function and in kernel. Test shows correct version of template function are emitted in host and device compilation.

I think it is not a surprise that the lambda function is able to resolve the host/device-ness of the callee correctly. We are doing resolution in a host device function and the two candidates are same-side vs wrong-side.

hliao requested changes to this revision.Apr 25 2020, 9:20 AM
hliao added a subscriber: hliao.
In D78655#1997491, @tra wrote:

Summoning @rsmith as I'm sure that there are interesting corner cases in lambda handling that we didn't consider.

Making lambdas implicitly HD will make it easier to write the code which can't be instantiated on one side of the compilation. That's probably observable via SFINAE, but I can't tell whether that matters.
By default I'd rather err on handling lambdas the same way as we do regular user-authored functions.

Marking lambda __device__/__device__ __host__ should be

In D78655#1997491, @tra wrote:

Summoning @rsmith as I'm sure that there are interesting corner cases in lambda handling that we didn't consider.

Making lambdas implicitly HD will make it easier to write the code which can't be instantiated on one side of the compilation. That's probably observable via SFINAE, but I can't tell whether that matters.
By default I'd rather err on handling lambdas the same way as we do regular user-authored functions.

Marking lambda with proper attributes helps check the potential harmful captures.

In D78655#1997491, @tra wrote:

Summoning @rsmith as I'm sure that there are interesting corner cases in lambda handling that we didn't consider.

Making lambdas implicitly HD will make it easier to write the code which can't be instantiated on one side of the compilation. That's probably observable via SFINAE, but I can't tell whether that matters.
By default I'd rather err on handling lambdas the same way as we do regular user-authored functions.

I though the goal of adding HD/D attributes for lambda is to make the static check easier as lambda used in device code or device lambda is sensitive to captures. Invalid capture may render error accidentally without static check, says we capture host variable reference in a device lambda. That makes the final code invalid. Allowing regular lambda to be used in global or device function is considering harmful.

This revision now requires changes to proceed.Apr 25 2020, 9:20 AM
hliao added inline comments.Apr 25 2020, 9:22 AM
clang/test/CodeGenCUDA/lambda.cu
84

We are allowing regular lambda to be used in the device functions. That should be explicitly marked by making that lambda __device__ or __host__ __device__. Even though we may not have static checks for capture so far, that should be easily extended with those attributes.

yaxunl marked an inline comment as done.Apr 25 2020, 2:19 PM

I though the goal of adding HD/D attributes for lambda is to make the static check easier as lambda used in device code or device lambda is sensitive to captures. Invalid capture may render error accidentally without static check, says we capture host variable reference in a device lambda. That makes the final code invalid. Allowing regular lambda to be used in global or device function is considering harmful.

Inferring a lambda function by default as __host__ __device__ does not mean skipping the check for harmful captures.

If we add such checks, it does not matter whether the __host__ __device__ attribute is explicit or implicit, they go through the same check.

How to infer the device/host-ness of a lambda function is a usability issue. It is orthogonal to the issue of missing diagnostics about captures.

Forcing users to explicitly mark a lambda function as __device__ __host__ itself does not help diagnose the harmful captures if such diags do not exist.

Let's think about a lambda function which captures references to host variables. If it is only used in host code, as in ordinary C++ host code. Marking it host device implicitly does not change anything, since it is not emitted in device code. If it is used in device code, it will likely cause mem fault at run time since currently we do not diagnose it. Does it help if we force users to mark it __device__ __host__? I don't think so. Users will just reluctantly add __device__ __host__ to it and they still end up as mem fault. If we add the diagnostic about the harmful captures, it does not matter whether the __device__ __host__ attribute is explicit or implicit, users get the same diagnostic about harmful captures. So the effect is the same. However, usability is improved if users do not need to add __device__ __host__ by themselves.

Does inferring lambda function as __device__ __host__ by default making the diagnostic about harmful captures more difficult? No. It should be the same for lambdas with explicit __device__ __host__. This needs to be a deferred diagnostic like those we already did, which are only emitted if the function is really emitted. It does not matter whether the device/host attrs are implicit or explicit.

says we capture host variable reference in a device lambda.

Is that required to be an error? I know @AlexVlx added support to hcc at one point to capture host variables by reference. So it seems to be possible for it to work correctly. So it doesn't seem to be like reason enough to disallow implicit HD.

hliao added a comment.Apr 29 2020, 9:41 PM

I though the goal of adding HD/D attributes for lambda is to make the static check easier as lambda used in device code or device lambda is sensitive to captures. Invalid capture may render error accidentally without static check, says we capture host variable reference in a device lambda. That makes the final code invalid. Allowing regular lambda to be used in global or device function is considering harmful.

Inferring a lambda function by default as __host__ __device__ does not mean skipping the check for harmful captures.

If we add such checks, it does not matter whether the __host__ __device__ attribute is explicit or implicit, they go through the same check.

How to infer the device/host-ness of a lambda function is a usability issue. It is orthogonal to the issue of missing diagnostics about captures.

Forcing users to explicitly mark a lambda function as __device__ __host__ itself does not help diagnose the harmful captures if such diags do not exist.

Let's think about a lambda function which captures references to host variables. If it is only used in host code, as in ordinary C++ host code. Marking it host device implicitly does not change anything, since it is not emitted in device code. If it is used in device code, it will likely cause mem fault at run time since currently we do not diagnose it. Does it help if we force users to mark it __device__ __host__? I don't think so. Users will just reluctantly add __device__ __host__ to it and they still end up as mem fault. If we add the diagnostic about the harmful captures, it does not matter whether the __device__ __host__ attribute is explicit or implicit, users get the same diagnostic about harmful captures. So the effect is the same. However, usability is improved if users do not need to add __device__ __host__ by themselves.

Does inferring lambda function as __device__ __host__ by default making the diagnostic about harmful captures more difficult? No. It should be the same for lambdas with explicit __device__ __host__. This needs to be a deferred diagnostic like those we already did, which are only emitted if the function is really emitted. It does not matter whether the device/host attrs are implicit or explicit.

Not only the capture is an issue, like a regular function, lambda could also access non-local variables/functions. Without marking proper HD attributes explicitly, it's difficult or expensive to statically check the use of them. If we have to guess that attributes based on how they are called, we may find cases where conflicting results may be derived depending on call sites. It would be quite confusing if developers struggle to solve one but trigger another one or more. From the other perspective, a lambda is just another function and should have consistent rule for its usage, resolution, and etc.

hliao added a comment.Apr 29 2020, 9:47 PM

says we capture host variable reference in a device lambda.

Is that required to be an error? I know @AlexVlx added support to hcc at one point to capture host variables by reference. So it seems to be possible for it to work correctly. So it doesn't seem to be like reason enough to disallow implicit HD.

Do we have a mechanism to allow device code to access the local auto variable on the host side?

Not only the capture is an issue, like a regular function, lambda could also access non-local variables/functions.

In practice this is not an issue. Hcc will implictly treat anything inlinable as host device, and user's are not confused or surprised when they use non-local variables/reference that are on the host.

From the other perspective, a lambda is just another function and should have consistent rule for its usage, resolution, and etc.

But its not like another function. It has internal linkage(even when using global variables declared with inline). Lambdas are also implicitly constexpr whereas a function need to explicitly declare constexpr. Making lambdas implicitly HD whereas function need to be explicit seems to be consistent with how lambdas work with constexpr.

tra added a comment.Apr 30 2020, 11:02 AM

Not only the capture is an issue, like a regular function, lambda could also access non-local variables/functions.

In practice this is not an issue. Hcc will implictly treat anything inlinable as host device, and user's are not confused or surprised when they use non-local variables/reference that are on the host.

From the other perspective, a lambda is just another function and should have consistent rule for its usage, resolution, and etc.

But its not like another function.

+1. Lambda is an object. In addition to operator(), it may also have local storage for captured variables and can also be mutable, which makes it even more complicated. I.e. if I pass a mutable lambda by reference to the GPU kernel, will the same lambda called on host do the same thing when it's called on the device? In principle it would work if GPU and host operate un a uniform memory (i.e. all memory is accessible at the same addresses from both host and the GPU), but I don't think it's a universally true assumption. E.g. GPU-to-GPU memory accesses are not always possible, even when they technically share the same address space with the host. I can create a capturing lambda on one GPU, pass it to the host (may work fine there), pass it to another GPU and it will fail.

Lambdas are also implicitly constexpr whereas a function need to explicitly declare constexpr.

According to cppreference, it's only true since C++17 and, AFAICT, only for capture-less lambdas.

Making lambdas implicitly HD whereas function need to be explicit seems to be consistent with how lambdas work with constexpr.

Considering they are not always constexpr, this assertion is not true, either.

If/when operator() does get constexpr treatment by compiler, we should already derive HD attributes from constexpr. If we do not, then that's what needs to be fixed. That at least would make sense from consistency standpoint as we currently do treat all other constexpr functions as HD.

I.e. if I pass a mutable lambda by reference to the GPU kernel

I dont think we are enabling passing host objects by reference through functions. Although it could be possible to capture the mutable lambda by reference by another lambda.

will the same lambda called on host do the same thing when it's called on the device?

Yes, just as the same as capturing a host variable by reference and using it on the device.

In principle it would work if GPU and host operate un a uniform memory

A unified memory is not necessary. What is needed is a coordination between the compiler and runtime.

We dont support capturing host variable by reference, so maybe we can restrict the implicit HD to lambdas that don't capture by reference?

According to cppreference, it's only true since C++17 and, AFAICT, only for capture-less lambdas.

You can capture as well, if its in a constexpr context.

Considering they are not always constexpr, this assertion is not true, either.

Yes, we seem to delay this. It is always HD but not always emitted for both host and device.

The issue would be if users tried to detect HD using SFINAE. It could be a false claim, but maybe it doesnt matter. More importantly, if the lambda is called in a unevaluated context, will the compiler still emit the function or will it produce a hard error instead of a substitution failure? I assume something like this would compile:

template<class F>
__host__ auto is_host(F f) -> decltype(f(), std::true_type{});
std::false_type is_host(...);

template<class F>
__device__ auto is_device(F f) -> decltype(f(), std::true_type{});
std::false_type is_device(...);

__host__ void f();

void g()
{
    auto l = []{ f(); };
    using on_host = decltype(is_host(l));
    static_assert(on_host{}, "Lambda not on host");
    using on_device = decltype(is_device(l));
    static_assert(on_device{}, "Lambda not on device");
}

If/when operator() does get constexpr treatment by compiler, we should already derive HD attributes from constexpr. If we do not, then that's what needs to be fixed.

How does the compiler implement this? Does it add constexpr attribute onto the operator() or does the constexpr-evalutation visits the lambda as if it were constexpr? It seems the latter would be more effecient, and it would be similar to what we are doing with HD. The only difference is that a function can be overloaded with __host__ and __device__ whereas that is not possible with constexpr. So a difference could be detected by the user, but maybe that doesn't matter

That at least would make sense from consistency standpoint as we currently do treat all other constexpr functions as HD.

I mean consistent across the different attributes not in the interpretation of constexpr. A lambda that only calls constexpr functions implicitly has constexpr attribute. So, a lambda that only calls device functions(or HD) should implicitly have the __device__ attribute.

tra added a comment.Apr 30 2020, 3:26 PM

I.e. if I pass a mutable lambda by reference to the GPU kernel

I dont think we are enabling passing host objects by reference through functions. Although it could be possible to capture the mutable lambda by reference by another lambda.

You're not enabling it, but I don't think you prevent it either. In any case, the point was that we don't have a consistent way to handle capturing lambdas in CUDA&HIP. Hence my suggestion for not using it in the tests which would implicitly suggest otherwise.

will the same lambda called on host do the same thing when it's called on the device?

Yes, just as the same as capturing a host variable by reference and using it on the device.

I don't think I understand. Capturing happens when/where the lambda object is created. It captures addresses as seen at the point in time on the processor which executed the constructor.
When the reference to the lambda object is passed to another processor all bets are generally off. Again, besides the point of not using capturing lambda in the test it's somewhat irrelevant for this patch.

In principle it would work if GPU and host operate un a uniform memory

A unified memory is not necessary. What is needed is a coordination between the compiler and runtime.

We dont support capturing host variable by reference, so maybe we can restrict the implicit HD to lambdas that don't capture by reference?

Another point that capturing lambdas are not something ready for the prime time.

According to cppreference, it's only true since C++17 and, AFAICT, only for capture-less lambdas.

You can capture as well, if its in a constexpr context.

You can, but the point is that the lambda's operator()is not *always* constexpr and your assertion does not change this.

Considering they are not always constexpr, this assertion is not true, either.

Yes, we seem to delay this. It is always HD but not always emitted for both host and device.

Could you elaborate? I'm not sure what you mean by we seem to delay this and what does it have to do with the assertion that lambdas are not always constexpr by default?

If/when operator() does get constexpr treatment by compiler, we should already derive HD attributes from constexpr. If we do not, then that's what needs to be fixed.

How does the compiler implement this? Does it add constexpr attribute onto the operator() or does the constexpr-evalutation visits the lambda as if it were constexpr? It seems the latter would be more effecient, and it would be similar to what we are doing with HD. The only difference is that a function can be overloaded with __host__ and __device__ whereas that is not possible with constexpr. So a difference could be detected by the user, but maybe that doesn't matter

I think it's done here:
https://github.com/llvm/llvm-project/blob/master/clang/lib/Sema/SemaCUDA.cpp#L557

We basically slap implicit HD on constexpr functions when we process function declarations. It's likely that lambdas may go through a different code path and miss this.

That at least would make sense from consistency standpoint as we currently do treat all other constexpr functions as HD.

I mean consistent across the different attributes not in the interpretation of constexpr. A lambda that only calls constexpr functions implicitly has constexpr attribute. So, a lambda that only calls device functions(or HD) should implicitly have the __device__ attribute.

Again, it's a subject for a wider discussion beyond the scope of this patch. Right now we're only dealing with the question of whether lambdas should always be HD by default.
I'm OK with making constexpr lambdas HD as it would match how we handle regular functions. I don't think non-constexpr lambdas should be HD.

It captures addresses as seen at the point in time on the processor which executed the constructor.

Yea and the same happens when assigning the address to a pointer, which is later used on a different device.

Another point that capturing lambdas are not something ready for the prime time.

The same issues exist with functions. We dont prevent passing a pointer to host memory to a device function. I guess because the analysis to do so is incomplete and expensive. A lambda capturing by reference does seem simpler to analyze at least for implicit HD.

Could you elaborate? I'm not sure what you mean by we seem to delay this and what does it have to do with the assertion that lambdas are not always constexpr by default?

Lambdas are not always constexpr, and this patch doesnt make lambdas to always be generated for host and device, even though, it does always have a HD attribute. Instead it pushes the decision to emit the lambda for host or device to later when we are emitting the code for codegen(at least thats how I understand this happening, @yaxunl can correct me if I am wrong here).

I think it's done here:

I actually meant how constexpr lambdas was implemented, which I can see here:

https://github.com/llvm/llvm-project/blob/master/clang/lib/Sema/SemaExpr.cpp#L16087

It doesn't annotate a lambda as constexpr. Instead it tries to evaluate all lambdas as constexpr when its used in constexpr context. This is similar to what we do for HD. We treat all lambdas as HD, but then only emit it for the device when its called in a device function. The big difference is that constexpr doesn't participate in overload resolution so constexpr lambdas are not observable by the user whereas host and device attributes are.

We basically slap implicit HD on constexpr functions when we process function declarations. It's likely that lambdas may go through a different code path and miss this.

Yea, which wont work for lambdas since NewD->isConstexpr() will return false(unless the user explicitly adds constexpr). We could traverse the AST to see if the lambda only calls constexpr functions and then annotate it with HD(we could also extend this to HD functions as well). However, this seems costly.

It would be better to take the same approach for constexpr lambdas and treat all lambdas as potentially HD(which is what this patch seems to do).

yaxunl updated this revision to Diff 261811.May 4 2020, 7:31 AM
yaxunl retitled this revision from [HIP] Let lambda be host device by default to [HIP] Add -fhip-lambda-host-device.
yaxunl edited the summary of this revision. (Show Details)

Revised.

Is it possible to add a test like this?

kernel<<<1,1>>>([=](){ 
    auto f = [&]{ hd(); };
    f(); 
});

That should not have a compiler error.

yaxunl updated this revision to Diff 261869.May 4 2020, 10:35 AM

add one more test

tra accepted this revision.May 4 2020, 11:05 AM

LGTM. Thank you for adding the checks for capturing lambdas and for putting this behind the flag.
I've asked @rsmith to chime in in case there's anything else about the lambdas we need to deal with. Please wait a day or two before landing the patch to give him a chance to reply.

clang/test/SemaCUDA/lambda.cu
26–36

We may need a better diagnostic for this. Here we've correctly rejected captured lambdas, but the diagnostic is a generic 'can't use that'.
If would be helpful to let user know that we can't use that because of the capturing lambdas.

rsmith added a comment.May 4 2020, 6:44 PM

There are two behaviors that seem to make sense:

  • Treat lambdas as implicitly HD (like constexpr functions) in all CUDA / HIP language modes. I don't think it makes sense for lambdas to become implicitly HD in C++17 simply because they become implicitly constexpr, nor for their HDness to depend on whether their parameter types happen to be literal types, etc. So in C++17, where lambdas are constexpr whenever they can be, the logical behavior would seem to be that lambdas are implicitly HD. And then for consistency with that, I'd expect them to be implicitly HD across all language modes.
  • Implicitly give lambdas the same HD-ness as the enclosing function (if there is one).

I would think the best choice may be to do both of these things: if there is an enclosing function, inherit its host or device attributes. And if not, then treat the lambda as implicitly HD. A slight variation on that, that might be better: lambdas with no lambda-capture are implicitly HD; lambdas with any lambda-capture (which must therefore have an enclosing function) inherit the enclosing function's HDness.

(Note that if we go this way, it makes no difference if there are reference captures, because they're always references on the same "side".)

clang/include/clang/Basic/LangOptions.def
247 ↗(On Diff #261869)

Is it really appropriate to have a flag for this? I would have expected that either the correct HIP behavior would be that lambdas are implicitly HD, or not, and Clang should just use whichever behavior is correct. (I don't know what authority decides what is and is not "correct" for HIP, but still.)

If this is a difference between versions of HIP, we generally model that by having a single "version" field rather than ways of turning on/off the individual changes.

clang/include/clang/Driver/Options.td
628–632 ↗(On Diff #261869)

You document these as "Let [lambdas be HD]" (which I would understand to mean "permit lambdas to be HD"), but the actual effect appears to be "make lambdas be HD by default".

clang/lib/Sema/SemaCUDA.cpp
753–758

This check appears to prevent lambdas appearing in any context outside a function from being implicitly HD. Is that what you want? Eg:

auto foo = [] {}; // not implicitly HD
760

The reference captures check seems quite strange to me. A copy capture of a pointer could have the same problem, as could a copy capture of a class that contains a reference or a pointer. As could an init-capture.

These kinds of quirky language rules are usually more trouble than they're worth.

I agree with Richard that just making lambdas HD by default in all modes seems like the right rule.

tra added a comment.May 5 2020, 9:32 AM

I agree with Richard that just making lambdas HD by default in all modes seems like the right rule.

Ack. Let's give it a try. I'll test this on our code and see what falls out. Stay tuned.

A slight variation on that, that might be better: lambdas with no lambda-capture are implicitly HD; lambdas with any lambda-capture (which must therefore have an enclosing function) inherit the enclosing function's HDness.

Lambdas should already inherit the enclosing functions HDness. Keeping capturing lambdas as implictly HD matches closer the behavior in HIP/HCC, and as we are porting code it is not always clear which lambdas need explicit HD annotation since running on the device is an implementation detail.

Capturing lambdas has its pitfalls but they are no different from the same pitfalls that happen with asynchronous programming or signal callbacks.

clang/lib/Sema/SemaCUDA.cpp
760

Capturing by value is not always an error, only when copying a pointer to a host variable. but this requires a lot more static analysis to diagnose. However, capturing by reference is almost always wrong(at least with the current HIP) when the context is host and the lambda is called on the device.

Therefore, we avoid this scenario by not making such lambdas implicitly HD, but the error message may not be quite as clear. It is a quirky language rule, and we could remove this restriction and rely on a warning or static analysis to diagnose the issue.

tra added a comment.May 7 2020, 1:45 PM
In D78655#2020651, @tra wrote:

Ack. Let's give it a try. I'll test this on our code and see what falls out. Stay tuned.

The patch seems to cause no issues. I've ran it with local changes that enable it unconditionally for CUDA. I'm OK with making this the default behavior.

yaxunl updated this revision to Diff 272171.Jun 19 2020, 1:50 PM
yaxunl retitled this revision from [HIP] Add -fhip-lambda-host-device to [CUDA][HIP] Let non-caputuring lambda be host device.
yaxunl edited the summary of this revision. (Show Details)

Revised by Richard's comments.

yaxunl updated this revision to Diff 272225.Jun 19 2020, 7:05 PM
yaxunl marked 4 inline comments as done.

improve diagnostic message

yaxunl marked 2 inline comments as done.Jun 19 2020, 7:07 PM

There are two behaviors that seem to make sense:

  • Treat lambdas as implicitly HD (like constexpr functions) in all CUDA / HIP language modes. I don't think it makes sense for lambdas to become implicitly HD in C++17 simply because they become implicitly constexpr, nor for their HDness to depend on whether their parameter types happen to be literal types, etc. So in C++17, where lambdas are constexpr whenever they can be, the logical behavior would seem to be that lambdas are implicitly HD. And then for consistency with that, I'd expect them to be implicitly HD across all language modes.
  • Implicitly give lambdas the same HD-ness as the enclosing function (if there is one).

I would think the best choice may be to do both of these things: if there is an enclosing function, inherit its host or device attributes. And if not, then treat the lambda as implicitly HD. A slight variation on that, that might be better: lambdas with no lambda-capture are implicitly HD; lambdas with any lambda-capture (which must therefore have an enclosing function) inherit the enclosing function's HDness.

(Note that if we go this way, it makes no difference if there are reference captures, because they're always references on the same "side".)

Sorry for the delay. I updated the patch as you suggested:

  • lambdas without enclosing function are implicitly HD
  • lambdas with no lambda-capture are implicitly HD
  • lambdas with any lambda-capture (which must therefore have an enclosing function) inherit the enclosing function's HDness.
clang/test/SemaCUDA/lambda.cu
26–36

added more information about lambda function to the diagnostic message

tra added a comment.Jun 22 2020, 9:58 AM
  • lambdas with any lambda-capture (which must therefore have an enclosing function) inherit the enclosing function's HDness.

Nit: *any* capture does not necessarily imply existence of the enclosing function. One can explicitly capture variables in the global scope. E.g.:

int x;
auto l = [x=x](){ return x;};
yaxunl marked an inline comment as done.Jun 22 2020, 11:21 AM
In D78655#2107016, @tra wrote:
  • lambdas with any lambda-capture (which must therefore have an enclosing function) inherit the enclosing function's HDness.

Nit: *any* capture does not necessarily imply existence of the enclosing function. One can explicitly capture variables in the global scope. E.g.:

int x;
auto l = [x=x](){ return x;};

It seems we can only promote non-capturing lambdas, no matter whether it has enclosing function or not.

tra added a comment.Jun 22 2020, 12:45 PM

It seems we can only promote non-capturing lambdas, no matter whether it has enclosing function or not.

I'd be OK with promoting only non-capturing lambdas until we figure out a consistent way to deal with the capturing ones.
Or we can promote captured ones, too and rely on postponed diags to guard against producing wrong-side code. We may need to improve that a bit anyways. E.g. what should we do if I write something like this:

__device__ int dv;
int hv;

__host__ __device__ int hd() {
  return [a = dv, b=hv](){ return a + b;}();
}

https://godbolt.org/z/op_FE6 -- NVCC complains about not being able to access hv in a device function (which makes sense considering that it converts HD -> D after source splitting, but clang happily allows capturing both variables (but will likely fail during ptxas due to the fact that there will be no hv on device side).

yaxunl updated this revision to Diff 272539.Jun 22 2020, 1:50 PM

Only make non-capturing lambda host and device by default.

pfultz2 added inline comments.Jun 22 2020, 2:13 PM
clang/lib/Sema/SemaCUDA.cpp
753

There should at least be a flag to enable capturing lambdas to be implicitly HD. I dont really understand the rational for making capturing lambdas not implicitly HD. It seems like its trying to prevent using an address to host on the device, but I dont see how this prevents that at all.

This will also break the compilation in rocm. Should we use a fork of llvm to compile rocm?

tra added inline comments.Jun 22 2020, 3:30 PM
clang/lib/Sema/SemaCUDA.cpp
753

@pfultz2:

This will also break the compilation in rocm. Should we use a fork of llvm to compile rocm?

Could you give an example to demonstrate current use and how it will break? My understanding that the patch *relaxes* the restrictions on lambdas so in theory not promoting capturing lambdas preserves the status quo.

As for the fork, my response would be an empathic "no, please don't do it". Fork == different compiler == showstopper for various use cases. It would definitely be an issue for us at Google.

Considering that we're still probing our way towards making lambdas more useful, it may be a bit premature to heavily depend on any particular implementation detail of an experimental feature, even if it happens to work. We'll need to figure out an approach that will be sustainable long-term and forked compiler is a rather large and hard-to-maintain hammer for this. In my experience, adapting source code ends up being more manageable long-term.

Could you give an example to demonstrate current use and how it will break?

Here is place where it would break:

https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/src/targets/gpu/device/include/migraphx/gpu/device/multi_index.hpp#L129

This change was already included in a fork of llvm in rocm 3.5 and 3.6 releases which is why this compiles. This also compiles using the hcc-based hip compilers which is what previous rocm versions used. It would be best if this can be upstreamed, so we dont have to hold on to these extra changes in a fork.

Part of the motivation for this change was that it wasn't always clear in code where the __device__ attribute is needed with lambdas sometimes. It also makes it more consistent with constexpr lambdas and hcc-based hip compiler. Including this for capturing lambdas will make this simpler and easier to understand.

If there are concerns about making it default for capturing lambdas, then can we at least just have a flag to enable this for capturing lambdas?

tra added a comment.Jun 23 2020, 10:14 AM

Could you give an example to demonstrate current use and how it will break?

Here is place where it would break:

https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/src/targets/gpu/device/include/migraphx/gpu/device/multi_index.hpp#L129

This change was already included in a fork of llvm in rocm 3.5 and 3.6 releases which is why this compiles. This also compiles using the hcc-based hip compilers which is what previous rocm versions used. It would be best if this can be upstreamed, so we dont have to hold on to these extra changes in a fork.

It may be OK to require updated software in order to switch to a new compiler. E.g. it would be unreasonable for clang to compile all existing HCC code. Nor did we promise to compile all existing CUDA code when it was at the point in time where HIP is now -- new compiler emerging in an ecosystem with existing code which compiles and works fine with the incumbent compiler, but needs some tweaks to compile/work with clang. There will be some back and forth before we reach the equilibrium where most things compile and work.

You may need to make some portability tweaks to your code to make it work with upstream and internal clang versions + hcc. This is roughly what's been done to existing CUDA code -- pretty much all major libraries that use CUDA (tensorflow, Thrust, cutlas, cub, pytorch) had to have minor tweaks to make it portable to clang.

Now, back to the specifics of your example. I'm still not 100% sure I understand what the problem is. Can you boil down the use case to an example on godbolt? Not just the lambda itself, but also the way it's intended to be used. It does not need to compile, I just need it to understand your use case and the problem.
I can imaging passing lambda type as a template parameter which would make it hard to predict/control where/how it will finally be instantiated or used, but it would be great to have a practical example.

Part of the motivation for this change was that it wasn't always clear in code where the __device__ attribute is needed with lambdas sometimes. It also makes it more consistent with constexpr lambdas and hcc-based hip compiler. Including this for capturing lambdas will make this simpler and easier to understand.

If there are concerns about making it default for capturing lambdas, then can we at least just have a flag to enable this for capturing lambdas?

I've just pointed that the assumption that having the capture implies having enclosing function is invalid. We've already decided to proceed with promotion of all lambdas in general, so it's mostly the matter of taking care of implementation details.
Dealing with capturing lambdas in a separate patch is one option. IMO it makes sense in general as capturing lambdas do have their own distinct quirks, while promotion of non-capturing lambdas are relatively uncontroversial.
If Sam decides to incorporate support for capturing lambdas in this patch, we could still do it by restricting the capturing lambda promotion to the ones within a function scope only. I.e. lambdas created in global scope would still be host.

ashi1 added a subscriber: ashi1.Jun 25 2020, 3:10 PM
ashi1 added inline comments.
clang/test/CodeGenCUDA/lambda.cu
54

There is a typo here, DEV-LABEL

Now, back to the specifics of your example. I'm still not 100% sure I understand what the problem is. Can you boil down the use case to an example on godbolt?

I dont have a specific example, but there could be code like this generic clip operator:

template<class F, class T>
void clip(F f,
          const T& min_val,
          const T& max_val)
{

    f([=](auto x) {
        return ::min<decltype(x)>(::max<decltype(x)>(min_val, x), max_val);
    });
}

Its not clear to the writer of the generic function that it needs to declare the lambda with an explicit HD.

If Sam decides to incorporate support for capturing lambdas in this patch, we could still do it by restricting the capturing lambda promotion to the ones within a function scope only. I.e. lambdas created in global scope would still be host.

I think that would be acceptable. I dont think global scope capturing lambdas are very common due to possible ODR issues.

hliao added a comment.Jun 26 2020, 9:37 AM

Now, back to the specifics of your example. I'm still not 100% sure I understand what the problem is. Can you boil down the use case to an example on godbolt?

I dont have a specific example, but there could be code like this generic clip operator:

template<class F, class T>
void clip(F f,
          const T& min_val,
          const T& max_val)
{

    f([=](auto x) {
        return ::min<decltype(x)>(::max<decltype(x)>(min_val, x), max_val);
    });
}

What's the expected HD property of this template function clip? Why we cannot ask developers to add the same HD property for that inner lambda?

Its not clear to the writer of the generic function that it needs to declare the lambda with an explicit HD.

If Sam decides to incorporate support for capturing lambdas in this patch, we could still do it by restricting the capturing lambda promotion to the ones within a function scope only. I.e. lambdas created in global scope would still be host.

I think that would be acceptable. I dont think global scope capturing lambdas are very common due to possible ODR issues.

What's the expected HD property of this template function clip?

I think it is intended to be host-only. The function f will launch a kernel or threads to utilize the passed lambda.

Ideally, it would be nice to make all inlineable functions implicitly HD. There is a pragma to force HD, but it is broken(due to forcing HD on functions annotated as host-only or device-only). It would be nice to have a flag to enable such behavior(instead of a pragma). Requiring all these explicit HD annotations just seems like we are moving backwards.

yaxunl updated this revision to Diff 274589.Jun 30 2020, 1:30 PM
yaxunl retitled this revision from [CUDA][HIP] Let non-caputuring lambda be host device to [CUDA][HIP] Let lambda be host device by default.
yaxunl edited the summary of this revision. (Show Details)

Added diagnostics for capturing host variables on device lambda and made lambda host device by default.

tra added inline comments.Jun 30 2020, 2:33 PM
clang/lib/Sema/SemaCUDA.cpp
749

What does the return value mean? We don't seem to check it anyways. If we don't care about the result, perhaps the function should be void.
If we do, then it would be good to document its purpose and returned values and, probably, rename it to better indicate what is it it's supposed to check.

clang/lib/Sema/SemaLambda.cpp
1784

I would expect Sema-level diags to be produced during both host and device compilation. Some of the diags for HD may need to be postponed until after things have been CodeGen'ed, but the checks should happen nevertheless.

pfultz2 added inline comments.Jun 30 2020, 6:07 PM
clang/test/SemaCUDA/lambda.cu
28

Will this still produce diagnostics when the lambda is explicitly __device__? Maybe you could add a test case for that.

kernel<<<1,1>>>([&]() __device__ { hd(b); });
yaxunl marked 10 inline comments as done.Jul 1 2020, 5:48 AM
yaxunl added inline comments.
clang/lib/Sema/SemaCUDA.cpp
749

it should return void. fixed.

clang/lib/Sema/SemaLambda.cpp
1784

A host device function could be emitted in both host and device compilation. The way the deferred diags works is that they are only emitted when the function is sure to be emitted in a specific compilation. In host compilation, when a host device function is sure to be emitted, it is emitted as host function, therefore diags for host compilation and only diags for host compilation should be emitted. The same with device compilation.

This makes sense since we do not know if a host device function will be emitted in device compilation when we are doing host compilation, since to do that we have to go through the whole device compilation whereas currently device compilation and host compilation are separate process.

That said, when we emit diags for captures by reference, we should only emit them when the lambdas are emitted as device functions. When they are emitted as host functions in host compilation, these captures are valid and should not be diagnosed.

clang/test/CodeGenCUDA/lambda.cu
54

fixed. thanks

clang/test/SemaCUDA/lambda.cu
28

yes. added a test

yaxunl updated this revision to Diff 274774.Jul 1 2020, 5:52 AM
yaxunl marked 4 inline comments as done.

revised by Artem's and Paul's comments

tra added inline comments.Jul 6 2020, 1:33 PM
clang/lib/Sema/SemaLambda.cpp
1784

HD lambda capturing something that's side-specific is similar to HD function calling side-specific function. I still think that the general principle applies to both sides of the compilation.

We may be saved by the fact that functions seem to be the only things that are truly side-specific and wrong-side access will already produce postponed diagnostics. For variables we'll end up capturing shadows, which is in the gray area -- we're allowed to use pointers & sizeof(), but not the non-const values.

Perhaps this condition check should be folded into the CUDACheckLambdaCapture and add few comments about the reasons for particular OK/not-OK choices we make there.

hliao resigned from this revision.Jul 7 2020, 11:37 AM
This revision is now accepted and ready to land.Jul 7 2020, 11:37 AM
yaxunl updated this revision to Diff 276211.Jul 7 2020, 2:35 PM

refactor CUDACheckLambdaCapture and add comments

tra added inline comments.Jul 7 2020, 3:57 PM
clang/lib/Sema/SemaCUDA.cpp
756–758

I don't think this is completely correct. Postponed diags get emitted if we know we're attempoting to codegen wrong things.
E.g. during host compilation when HD function used by host code ends up attempting to call a device function.
It also works in the other direction -- it kicks in during device compilation when HD function calls a host function.
AFAICT it has nothing to do with what happens on the other side of the compilation, but rather what we're attempting to codegen during *this* compilation.

I don't think that we can reason that checks can be done on the host side only, based only on the argument you're making above (at least based on the way I understand it).

The point you're making below that a captured lambda created by device code can't ever be used by the host code is probably a better argument why the check may not be necessary.

yaxunl updated this revision to Diff 276417.Jul 8 2020, 7:07 AM

revised by Artem's comments

yaxunl marked 3 inline comments as done.Jul 8 2020, 7:08 AM
yaxunl added inline comments.
clang/lib/Sema/SemaCUDA.cpp
756–758

Revised the comments.

tra accepted this revision.Jul 8 2020, 9:31 AM

LGTM.

This revision was automatically updated to reflect the committed changes.
yaxunl marked an inline comment as done.
Herald added a project: Restricted Project. · View Herald TranscriptJul 8 2020, 10:11 AM