This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
include/clang/
-
clang/
-
Basic/
-
DiagnosticSemaKinds.td
-
Sema/
-
Sema.h
-
lib/Sema/
-
Sema/
13/13
SemaCUDA.cpp
3/3
SemaLambda.cpp
-
test/
-
CodeGenCUDA/
8/8
lambda.cu
-
SemaCUDA/
-
Inputs/
-
cuda.h
4/4
lambda.cu

Differential D78655

[CUDA][HIP] Let lambda be host device by default
ClosedPublic

Authored by yaxunl on Apr 22 2020, 10:52 AM.

Download Raw Diff

Details

Reviewers

tra
rjmccall
rsmith
hliao

Commits

rG1eaad01046c8: [CUDA][HIP] Let lambda be host device by default

Summary

This patch let lambda be host device by default and adds diagnostics for capturing host variable by reference in device lambda.

Diff Detail

Event Timeline

yaxunl created this revision.Apr 22 2020, 10:52 AM

Summoning @rsmith as I'm sure that there are interesting corner cases in lambda handling that we didn't consider.

Making lambdas implicitly HD will make it easier to write the code which can't be instantiated on one side of the compilation. That's probably observable via SFINAE, but I can't tell whether that matters.
By default I'd rather err on handling lambdas the same way as we do regular user-authored functions.

clang/lib/Sema/SemaCUDA.cpp
778	What about `__global__` lambdas? We probably don't want to add HD attributes on them here.
clang/test/CodeGenCUDA/lambda.cu
17–27	The test example may not be doing what it's seemingly supposed to be doing: https://cppinsights.io/s/3a5c42ff `h()` gets a temporary host-side object which keeps the reference to `a` and that reference will actually point to the host-side shadow of the actual device-side `a`. When you get to execute `g` it's `this` may not be very usable on device side and thus `f.operator()` will probably not work. Alas, we currently have no diagnostics for that kind of error. Change it to a non-capturing lambda, perhaps?

yaxunl marked 4 inline comments as done.Apr 22 2020, 7:09 PM

yaxunl added inline comments.

clang/lib/Sema/SemaCUDA.cpp
778	lambda is not allowed to be kernel. I will add a lit test for that.
clang/test/CodeGenCUDA/lambda.cu
17–27	It works. We need to think about this in device compilation. In device compilation, global variable is a device variable, the lambda is a device host function, therefore the lambda is accessing the real a, not the shadow. In the host compilation, the lambda is not really called, therefore it is not emitted. I will update the lit test with these checks.

Add a negative test for lambda kernel. Add more checks to codegen test.

clean up test.

pfultz2 added inline comments.Apr 23 2020, 7:32 PM

clang/lib/Sema/SemaCUDA.cpp
779	Shouldn't we add these attributes if there are no host and device attributes? This seems like it will treat `[]() __device__ {}` as host device.

yaxunl marked 2 inline comments as done.Apr 23 2020, 8:10 PM

yaxunl added inline comments.

clang/lib/Sema/SemaCUDA.cpp
779	There is check on line 716. We only reach here if there is no device and host attrs on the function.

Seems reasonable

clang/test/CodeGenCUDA/lambda.cu
30	Typo

Fix typo

rjmccall accepted this revision.Apr 24 2020, 12:33 PM

This revision is now accepted and ready to land.Apr 24 2020, 12:33 PM

tra added inline comments.Apr 24 2020, 1:39 PM

clang/test/CodeGenCUDA/lambda.cu
17–27	Clang manages to see through to the initializer of `a`, but I'm not sure how much we can rely on this. In general, `f.operator()` for a capturing lambda needs to access captured variables via `this` which points to a temporary objects created and passed to `g` by the host. You can see it if you capture a local variable: https://godbolt.org/z/99389o Anyways, it's an issue orthogonal to this patch. My concern is that tests are often used as an example of things that are OK to do, and capturing lambdas are a pretty big foot-shooting gun when used with CUDA. It's very easy to do wrong thing without compiler complaining about them. While accessing `a` does work, it appears to do so by accident, rather than by design. I'm fairly confident that I can hide the initializer with sufficiently complicated code, force clang to access `a` via `this` and make everything fail at runtime. IMO, what we have here is a 'happens to work' situation. I do not want to call it 'guaranteed to work' without making sure that it always does. In order to demonstrate that lambda is host/device, you do not need it to be a capturing lambda. You can make it call an overloaded function with host and device variants and verify that the lambda works on host and device sides.

Added more tests

yaxunl marked 2 inline comments as done.Apr 25 2020, 7:25 AM

yaxunl added inline comments.

clang/test/CodeGenCUDA/lambda.cu
17–27	I added one more test, where a lambda function calls a template function which is overloaded with a host version and a device version. The lambda is called in both host function and in kernel. Test shows correct version of template function are emitted in host and device compilation. I think it is not a surprise that the lambda function is able to resolve the host/device-ness of the callee correctly. We are doing resolution in a host device function and the two candidates are same-side vs wrong-side.

In D78655#1997491, @tra wrote:

Summoning @rsmith as I'm sure that there are interesting corner cases in lambda handling that we didn't consider.

Making lambdas implicitly HD will make it easier to write the code which can't be instantiated on one side of the compilation. That's probably observable via SFINAE, but I can't tell whether that matters.
By default I'd rather err on handling lambdas the same way as we do regular user-authored functions.

Marking lambda __device__/__device__ __host__ should be

In D78655#1997491, @tra wrote:

Summoning @rsmith as I'm sure that there are interesting corner cases in lambda handling that we didn't consider.

Making lambdas implicitly HD will make it easier to write the code which can't be instantiated on one side of the compilation. That's probably observable via SFINAE, but I can't tell whether that matters.
By default I'd rather err on handling lambdas the same way as we do regular user-authored functions.

Marking lambda with proper attributes helps check the potential harmful captures.

In D78655#1997491, @tra wrote:

Summoning @rsmith as I'm sure that there are interesting corner cases in lambda handling that we didn't consider.

Making lambdas implicitly HD will make it easier to write the code which can't be instantiated on one side of the compilation. That's probably observable via SFINAE, but I can't tell whether that matters.
By default I'd rather err on handling lambdas the same way as we do regular user-authored functions.

I though the goal of adding HD/D attributes for lambda is to make the static check easier as lambda used in device code or device lambda is sensitive to captures. Invalid capture may render error accidentally without static check, says we capture host variable reference in a device lambda. That makes the final code invalid. Allowing regular lambda to be used in global or device function is considering harmful.

This revision now requires changes to proceed.Apr 25 2020, 9:20 AM

hliao added inline comments.Apr 25 2020, 9:22 AM

clang/test/CodeGenCUDA/lambda.cu
84	We are allowing regular lambda to be used in the device functions. That should be explicitly marked by making that lambda `__device__` or `__host__ __device__`. Even though we may not have static checks for capture so far, that should be easily extended with those attributes.

In D78655#2003681, @hliao wrote:

I though the goal of adding HD/D attributes for lambda is to make the static check easier as lambda used in device code or device lambda is sensitive to captures. Invalid capture may render error accidentally without static check, says we capture host variable reference in a device lambda. That makes the final code invalid. Allowing regular lambda to be used in global or device function is considering harmful.

Inferring a lambda function by default as __host__ __device__ does not mean skipping the check for harmful captures.

If we add such checks, it does not matter whether the __host__ __device__ attribute is explicit or implicit, they go through the same check.

How to infer the device/host-ness of a lambda function is a usability issue. It is orthogonal to the issue of missing diagnostics about captures.

Forcing users to explicitly mark a lambda function as __device__ __host__ itself does not help diagnose the harmful captures if such diags do not exist.

Let's think about a lambda function which captures references to host variables. If it is only used in host code, as in ordinary C++ host code. Marking it host device implicitly does not change anything, since it is not emitted in device code. If it is used in device code, it will likely cause mem fault at run time since currently we do not diagnose it. Does it help if we force users to mark it __device__ __host__? I don't think so. Users will just reluctantly add __device__ __host__ to it and they still end up as mem fault. If we add the diagnostic about the harmful captures, it does not matter whether the __device__ __host__ attribute is explicit or implicit, users get the same diagnostic about harmful captures. So the effect is the same. However, usability is improved if users do not need to add __device__ __host__ by themselves.

Does inferring lambda function as __device__ __host__ by default making the diagnostic about harmful captures more difficult? No. It should be the same for lambdas with explicit __device__ __host__. This needs to be a deferred diagnostic like those we already did, which are only emitted if the function is really emitted. It does not matter whether the device/host attrs are implicit or explicit.

says we capture host variable reference in a device lambda.

Is that required to be an error? I know @AlexVlx added support to hcc at one point to capture host variables by reference. So it seems to be possible for it to work correctly. So it doesn't seem to be like reason enough to disallow implicit HD.

In D78655#2003837, @yaxunl wrote:

In D78655#2003681, @hliao wrote:

I though the goal of adding HD/D attributes for lambda is to make the static check easier as lambda used in device code or device lambda is sensitive to captures. Invalid capture may render error accidentally without static check, says we capture host variable reference in a device lambda. That makes the final code invalid. Allowing regular lambda to be used in global or device function is considering harmful.

Inferring a lambda function by default as __host__ __device__ does not mean skipping the check for harmful captures.

If we add such checks, it does not matter whether the __host__ __device__ attribute is explicit or implicit, they go through the same check.

How to infer the device/host-ness of a lambda function is a usability issue. It is orthogonal to the issue of missing diagnostics about captures.

Forcing users to explicitly mark a lambda function as __device__ __host__ itself does not help diagnose the harmful captures if such diags do not exist.

Let's think about a lambda function which captures references to host variables. If it is only used in host code, as in ordinary C++ host code. Marking it host device implicitly does not change anything, since it is not emitted in device code. If it is used in device code, it will likely cause mem fault at run time since currently we do not diagnose it. Does it help if we force users to mark it __device__ __host__? I don't think so. Users will just reluctantly add __device__ __host__ to it and they still end up as mem fault. If we add the diagnostic about the harmful captures, it does not matter whether the __device__ __host__ attribute is explicit or implicit, users get the same diagnostic about harmful captures. So the effect is the same. However, usability is improved if users do not need to add __device__ __host__ by themselves.

Does inferring lambda function as __device__ __host__ by default making the diagnostic about harmful captures more difficult? No. It should be the same for lambdas with explicit __device__ __host__. This needs to be a deferred diagnostic like those we already did, which are only emitted if the function is really emitted. It does not matter whether the device/host attrs are implicit or explicit.

Not only the capture is an issue, like a regular function, lambda could also access non-local variables/functions. Without marking proper HD attributes explicitly, it's difficult or expensive to statically check the use of them. If we have to guess that attributes based on how they are called, we may find cases where conflicting results may be derived depending on call sites. It would be quite confusing if developers struggle to solve one but trigger another one or more. From the other perspective, a lambda is just another function and should have consistent rule for its usage, resolution, and etc.

In D78655#2011823, @pfultz2 wrote:

says we capture host variable reference in a device lambda.

Is that required to be an error? I know @AlexVlx added support to hcc at one point to capture host variables by reference. So it seems to be possible for it to work correctly. So it doesn't seem to be like reason enough to disallow implicit HD.

Do we have a mechanism to allow device code to access the local auto variable on the host side?

Not only the capture is an issue, like a regular function, lambda could also access non-local variables/functions.

In practice this is not an issue. Hcc will implictly treat anything inlinable as host device, and user's are not confused or surprised when they use non-local variables/reference that are on the host.

From the other perspective, a lambda is just another function and should have consistent rule for its usage, resolution, and etc.

But its not like another function. It has internal linkage(even when using global variables declared with inline). Lambdas are also implicitly constexpr whereas a function need to explicitly declare constexpr. Making lambdas implicitly HD whereas function need to be explicit seems to be consistent with how lambdas work with constexpr.

In D78655#2013226, @pfultz2 wrote:

Not only the capture is an issue, like a regular function, lambda could also access non-local variables/functions.

In practice this is not an issue. Hcc will implictly treat anything inlinable as host device, and user's are not confused or surprised when they use non-local variables/reference that are on the host.

From the other perspective, a lambda is just another function and should have consistent rule for its usage, resolution, and etc.

But its not like another function.

+1. Lambda is an object. In addition to operator(), it may also have local storage for captured variables and can also be mutable, which makes it even more complicated. I.e. if I pass a mutable lambda by reference to the GPU kernel, will the same lambda called on host do the same thing when it's called on the device? In principle it would work if GPU and host operate un a uniform memory (i.e. all memory is accessible at the same addresses from both host and the GPU), but I don't think it's a universally true assumption. E.g. GPU-to-GPU memory accesses are not always possible, even when they technically share the same address space with the host. I can create a capturing lambda on one GPU, pass it to the host (may work fine there), pass it to another GPU and it will fail.

Lambdas are also implicitly constexpr whereas a function need to explicitly declare constexpr.

According to cppreference, it's only true since C++17 and, AFAICT, only for capture-less lambdas.

Making lambdas implicitly HD whereas function need to be explicit seems to be consistent with how lambdas work with constexpr.

Considering they are not always constexpr, this assertion is not true, either.

If/when operator() does get constexpr treatment by compiler, we should already derive HD attributes from constexpr. If we do not, then that's what needs to be fixed. That at least would make sense from consistency standpoint as we currently do treat all other constexpr functions as HD.

I.e. if I pass a mutable lambda by reference to the GPU kernel

I dont think we are enabling passing host objects by reference through functions. Although it could be possible to capture the mutable lambda by reference by another lambda.

will the same lambda called on host do the same thing when it's called on the device?

Yes, just as the same as capturing a host variable by reference and using it on the device.

In principle it would work if GPU and host operate un a uniform memory

A unified memory is not necessary. What is needed is a coordination between the compiler and runtime.

We dont support capturing host variable by reference, so maybe we can restrict the implicit HD to lambdas that don't capture by reference?

According to cppreference, it's only true since C++17 and, AFAICT, only for capture-less lambdas.

You can capture as well, if its in a constexpr context.

Considering they are not always constexpr, this assertion is not true, either.

Yes, we seem to delay this. It is always HD but not always emitted for both host and device.

The issue would be if users tried to detect HD using SFINAE. It could be a false claim, but maybe it doesnt matter. More importantly, if the lambda is called in a unevaluated context, will the compiler still emit the function or will it produce a hard error instead of a substitution failure? I assume something like this would compile:

template<class F>
__host__ auto is_host(F f) -> decltype(f(), std::true_type{});
std::false_type is_host(...);

template<class F>
__device__ auto is_device(F f) -> decltype(f(), std::true_type{});
std::false_type is_device(...);

__host__ void f();

void g()
{
    auto l = []{ f(); };
    using on_host = decltype(is_host(l));
    static_assert(on_host{}, "Lambda not on host");
    using on_device = decltype(is_device(l));
    static_assert(on_device{}, "Lambda not on device");
}

If/when operator() does get constexpr treatment by compiler, we should already derive HD attributes from constexpr. If we do not, then that's what needs to be fixed.

How does the compiler implement this? Does it add constexpr attribute onto the operator() or does the constexpr-evalutation visits the lambda as if it were constexpr? It seems the latter would be more effecient, and it would be similar to what we are doing with HD. The only difference is that a function can be overloaded with __host__ and __device__ whereas that is not possible with constexpr. So a difference could be detected by the user, but maybe that doesn't matter

That at least would make sense from consistency standpoint as we currently do treat all other constexpr functions as HD.

I mean consistent across the different attributes not in the interpretation of constexpr. A lambda that only calls constexpr functions implicitly has constexpr attribute. So, a lambda that only calls device functions(or HD) should implicitly have the __device__ attribute.

In D78655#2013616, @pfultz2 wrote:

I.e. if I pass a mutable lambda by reference to the GPU kernel

I dont think we are enabling passing host objects by reference through functions. Although it could be possible to capture the mutable lambda by reference by another lambda.

You're not enabling it, but I don't think you prevent it either. In any case, the point was that we don't have a consistent way to handle capturing lambdas in CUDA&HIP. Hence my suggestion for not using it in the tests which would implicitly suggest otherwise.

will the same lambda called on host do the same thing when it's called on the device?

Yes, just as the same as capturing a host variable by reference and using it on the device.

I don't think I understand. Capturing happens when/where the lambda object is created. It captures addresses as seen at the point in time on the processor which executed the constructor.
When the reference to the lambda object is passed to another processor all bets are generally off. Again, besides the point of not using capturing lambda in the test it's somewhat irrelevant for this patch.

In principle it would work if GPU and host operate un a uniform memory

A unified memory is not necessary. What is needed is a coordination between the compiler and runtime.

We dont support capturing host variable by reference, so maybe we can restrict the implicit HD to lambdas that don't capture by reference?

Another point that capturing lambdas are not something ready for the prime time.

According to cppreference, it's only true since C++17 and, AFAICT, only for capture-less lambdas.

You can capture as well, if its in a constexpr context.

You can, but the point is that the lambda's operator()is not *always* constexpr and your assertion does not change this.

Considering they are not always constexpr, this assertion is not true, either.

Yes, we seem to delay this. It is always HD but not always emitted for both host and device.

Could you elaborate? I'm not sure what you mean by we seem to delay this and what does it have to do with the assertion that lambdas are not always constexpr by default?

If/when operator() does get constexpr treatment by compiler, we should already derive HD attributes from constexpr. If we do not, then that's what needs to be fixed.

How does the compiler implement this? Does it add constexpr attribute onto the operator() or does the constexpr-evalutation visits the lambda as if it were constexpr? It seems the latter would be more effecient, and it would be similar to what we are doing with HD. The only difference is that a function can be overloaded with __host__ and __device__ whereas that is not possible with constexpr. So a difference could be detected by the user, but maybe that doesn't matter

I think it's done here:
https://github.com/llvm/llvm-project/blob/master/clang/lib/Sema/SemaCUDA.cpp#L557

We basically slap implicit HD on constexpr functions when we process function declarations. It's likely that lambdas may go through a different code path and miss this.

That at least would make sense from consistency standpoint as we currently do treat all other constexpr functions as HD.

I mean consistent across the different attributes not in the interpretation of constexpr. A lambda that only calls constexpr functions implicitly has constexpr attribute. So, a lambda that only calls device functions(or HD) should implicitly have the __device__ attribute.

Again, it's a subject for a wider discussion beyond the scope of this patch. Right now we're only dealing with the question of whether lambdas should always be HD by default.
I'm OK with making constexpr lambdas HD as it would match how we handle regular functions. I don't think non-constexpr lambdas should be HD.

It captures addresses as seen at the point in time on the processor which executed the constructor.

Yea and the same happens when assigning the address to a pointer, which is later used on a different device.

Another point that capturing lambdas are not something ready for the prime time.

The same issues exist with functions. We dont prevent passing a pointer to host memory to a device function. I guess because the analysis to do so is incomplete and expensive. A lambda capturing by reference does seem simpler to analyze at least for implicit HD.

Could you elaborate? I'm not sure what you mean by we seem to delay this and what does it have to do with the assertion that lambdas are not always constexpr by default?

Lambdas are not always constexpr, and this patch doesnt make lambdas to always be generated for host and device, even though, it does always have a HD attribute. Instead it pushes the decision to emit the lambda for host or device to later when we are emitting the code for codegen(at least thats how I understand this happening, @yaxunl can correct me if I am wrong here).

I think it's done here:

I actually meant how constexpr lambdas was implemented, which I can see here:

https://github.com/llvm/llvm-project/blob/master/clang/lib/Sema/SemaExpr.cpp#L16087

It doesn't annotate a lambda as constexpr. Instead it tries to evaluate all lambdas as constexpr when its used in constexpr context. This is similar to what we do for HD. We treat all lambdas as HD, but then only emit it for the device when its called in a device function. The big difference is that constexpr doesn't participate in overload resolution so constexpr lambdas are not observable by the user whereas host and device attributes are.

We basically slap implicit HD on constexpr functions when we process function declarations. It's likely that lambdas may go through a different code path and miss this.

Yea, which wont work for lambdas since NewD->isConstexpr() will return false(unless the user explicitly adds constexpr). We could traverse the AST to see if the lambda only calls constexpr functions and then annotate it with HD(we could also extend this to HD functions as well). However, this seems costly.

It would be better to take the same approach for constexpr lambdas and treat all lambdas as potentially HD(which is what this patch seems to do).

Revised.

Is it possible to add a test like this?

kernel<<<1,1>>>([=](){ 
    auto f = [&]{ hd(); };
    f(); 
});

That should not have a compiler error.

add one more test

LGTM. Thank you for adding the checks for capturing lambdas and for putting this behind the flag.
I've asked @rsmith to chime in in case there's anything else about the lambdas we need to deal with. Please wait a day or two before landing the patch to give him a chance to reply.

clang/test/SemaCUDA/lambda.cu
26–36	We may need a better diagnostic for this. Here we've correctly rejected captured lambdas, but the diagnostic is a generic 'can't use that'. If would be helpful to let user know that we can't use that because of the capturing lambdas.

There are two behaviors that seem to make sense:

Treat lambdas as implicitly HD (like constexpr functions) in all CUDA / HIP language modes. I don't think it makes sense for lambdas to become implicitly HD in C++17 simply because they become implicitly constexpr, nor for their HDness to depend on whether their parameter types happen to be literal types, etc. So in C++17, where lambdas are constexpr whenever they can be, the logical behavior would seem to be that lambdas are implicitly HD. And then for consistency with that, I'd expect them to be implicitly HD across all language modes.
Implicitly give lambdas the same HD-ness as the enclosing function (if there is one).

I would think the best choice may be to do both of these things: if there is an enclosing function, inherit its host or device attributes. And if not, then treat the lambda as implicitly HD. A slight variation on that, that might be better: lambdas with no lambda-capture are implicitly HD; lambdas with any lambda-capture (which must therefore have an enclosing function) inherit the enclosing function's HDness.

(Note that if we go this way, it makes no difference if there are reference captures, because they're always references on the same "side".)

clang/include/clang/Basic/LangOptions.def
247 ↗	(On Diff #261869)	Is it really appropriate to have a flag for this? I would have expected that either the correct HIP behavior would be that lambdas are implicitly HD, or not, and Clang should just use whichever behavior is correct. (I don't know what authority decides what is and is not "correct" for HIP, but still.) If this is a difference between versions of HIP, we generally model that by having a single "version" field rather than ways of turning on/off the individual changes.
clang/include/clang/Driver/Options.td
628–632 ↗	(On Diff #261869)	You document these as "Let [lambdas be HD]" (which I would understand to mean "permit lambdas to be HD"), but the actual effect appears to be "make lambdas be HD by default".
clang/lib/Sema/SemaCUDA.cpp
776–777	This check appears to prevent lambdas appearing in any context outside a function from being implicitly HD. Is that what you want? Eg: auto foo = [] {}; // not implicitly HD
779	The reference captures check seems quite strange to me. A copy capture of a pointer could have the same problem, as could a copy capture of a class that contains a reference or a pointer. As could an init-capture. These kinds of quirky language rules are usually more trouble than they're worth.

I agree with Richard that just making lambdas HD by default in all modes seems like the right rule.

I agree with Richard that just making lambdas HD by default in all modes seems like the right rule.

Ack. Let's give it a try. I'll test this on our code and see what falls out. Stay tuned.

A slight variation on that, that might be better: lambdas with no lambda-capture are implicitly HD; lambdas with any lambda-capture (which must therefore have an enclosing function) inherit the enclosing function's HDness.

Lambdas should already inherit the enclosing functions HDness. Keeping capturing lambdas as implictly HD matches closer the behavior in HIP/HCC, and as we are porting code it is not always clear which lambdas need explicit HD annotation since running on the device is an implementation detail.

Capturing lambdas has its pitfalls but they are no different from the same pitfalls that happen with asynchronous programming or signal callbacks.

clang/lib/Sema/SemaCUDA.cpp
779	Capturing by value is not always an error, only when copying a pointer to a host variable. but this requires a lot more static analysis to diagnose. However, capturing by reference is almost always wrong(at least with the current HIP) when the context is host and the lambda is called on the device. Therefore, we avoid this scenario by not making such lambdas implicitly HD, but the error message may not be quite as clear. It is a quirky language rule, and we could remove this restriction and rely on a warning or static analysis to diagnose the issue.

In D78655#2020651, @tra wrote:

Ack. Let's give it a try. I'll test this on our code and see what falls out. Stay tuned.

The patch seems to cause no issues. I've ran it with local changes that enable it unconditionally for CUDA. I'm OK with making this the default behavior.

Revised by Richard's comments.

improve diagnostic message

In D78655#2019290, @rsmith wrote:

There are two behaviors that seem to make sense:

Treat lambdas as implicitly HD (like constexpr functions) in all CUDA / HIP language modes. I don't think it makes sense for lambdas to become implicitly HD in C++17 simply because they become implicitly constexpr, nor for their HDness to depend on whether their parameter types happen to be literal types, etc. So in C++17, where lambdas are constexpr whenever they can be, the logical behavior would seem to be that lambdas are implicitly HD. And then for consistency with that, I'd expect them to be implicitly HD across all language modes.

Implicitly give lambdas the same HD-ness as the enclosing function (if there is one).

I would think the best choice may be to do both of these things: if there is an enclosing function, inherit its host or device attributes. And if not, then treat the lambda as implicitly HD. A slight variation on that, that might be better: lambdas with no lambda-capture are implicitly HD; lambdas with any lambda-capture (which must therefore have an enclosing function) inherit the enclosing function's HDness.

(Note that if we go this way, it makes no difference if there are reference captures, because they're always references on the same "side".)

Sorry for the delay. I updated the patch as you suggested:

lambdas without enclosing function are implicitly HD
lambdas with no lambda-capture are implicitly HD
lambdas with any lambda-capture (which must therefore have an enclosing function) inherit the enclosing function's HDness.

clang/test/SemaCUDA/lambda.cu
26–36	added more information about lambda function to the diagnostic message

In D78655#2105058, @yaxunl wrote:

lambdas with any lambda-capture (which must therefore have an enclosing function) inherit the enclosing function's HDness.

Nit: *any* capture does not necessarily imply existence of the enclosing function. One can explicitly capture variables in the global scope. E.g.:

int x;
auto l = [x=x](){ return x;};

In D78655#2107016, @tra wrote:
In D78655#2105058, @yaxunl wrote:

lambdas with any lambda-capture (which must therefore have an enclosing function) inherit the enclosing function's HDness.

Nit: *any* capture does not necessarily imply existence of the enclosing function. One can explicitly capture variables in the global scope. E.g.:
int x;
auto l = [x=x](){ return x;};

It seems we can only promote non-capturing lambdas, no matter whether it has enclosing function or not.

In D78655#2107190, @yaxunl wrote:

It seems we can only promote non-capturing lambdas, no matter whether it has enclosing function or not.

I'd be OK with promoting only non-capturing lambdas until we figure out a consistent way to deal with the capturing ones.
Or we can promote captured ones, too and rely on postponed diags to guard against producing wrong-side code. We may need to improve that a bit anyways. E.g. what should we do if I write something like this:

__device__ int dv;
int hv;

__host__ __device__ int hd() {
  return [a = dv, b=hv](){ return a + b;}();
}

https://godbolt.org/z/op_FE6 -- NVCC complains about not being able to access hv in a device function (which makes sense considering that it converts HD -> D after source splitting, but clang happily allows capturing both variables (but will likely fail during ptxas due to the fact that there will be no hv on device side).

Only make non-capturing lambda host and device by default.

pfultz2 added inline comments.Jun 22 2020, 2:13 PM

clang/lib/Sema/SemaCUDA.cpp
777	There should at least be a flag to enable capturing lambdas to be implicitly HD. I dont really understand the rational for making capturing lambdas not implicitly HD. It seems like its trying to prevent using an address to host on the device, but I dont see how this prevents that at all. This will also break the compilation in rocm. Should we use a fork of llvm to compile rocm?

tra added inline comments.Jun 22 2020, 3:30 PM

clang/lib/Sema/SemaCUDA.cpp
777	@pfultz2: This will also break the compilation in rocm. Should we use a fork of llvm to compile rocm? Could you give an example to demonstrate current use and how it will break? My understanding that the patch relaxes the restrictions on lambdas so in theory not promoting capturing lambdas preserves the status quo. As for the fork, my response would be an empathic "no, please don't do it". Fork == different compiler == showstopper for various use cases. It would definitely be an issue for us at Google. Considering that we're still probing our way towards making lambdas more useful, it may be a bit premature to heavily depend on any particular implementation detail of an experimental feature, even if it happens to work. We'll need to figure out an approach that will be sustainable long-term and forked compiler is a rather large and hard-to-maintain hammer for this. In my experience, adapting source code ends up being more manageable long-term.

Could you give an example to demonstrate current use and how it will break?

Here is place where it would break:

https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/src/targets/gpu/device/include/migraphx/gpu/device/multi_index.hpp#L129

This change was already included in a fork of llvm in rocm 3.5 and 3.6 releases which is why this compiles. This also compiles using the hcc-based hip compilers which is what previous rocm versions used. It would be best if this can be upstreamed, so we dont have to hold on to these extra changes in a fork.

Part of the motivation for this change was that it wasn't always clear in code where the __device__ attribute is needed with lambdas sometimes. It also makes it more consistent with constexpr lambdas and hcc-based hip compiler. Including this for capturing lambdas will make this simpler and easier to understand.

If there are concerns about making it default for capturing lambdas, then can we at least just have a flag to enable this for capturing lambdas?

In D78655#2108047, @pfultz2 wrote:

Could you give an example to demonstrate current use and how it will break?

Here is place where it would break:

https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/src/targets/gpu/device/include/migraphx/gpu/device/multi_index.hpp#L129

This change was already included in a fork of llvm in rocm 3.5 and 3.6 releases which is why this compiles. This also compiles using the hcc-based hip compilers which is what previous rocm versions used. It would be best if this can be upstreamed, so we dont have to hold on to these extra changes in a fork.

It may be OK to require updated software in order to switch to a new compiler. E.g. it would be unreasonable for clang to compile all existing HCC code. Nor did we promise to compile all existing CUDA code when it was at the point in time where HIP is now -- new compiler emerging in an ecosystem with existing code which compiles and works fine with the incumbent compiler, but needs some tweaks to compile/work with clang. There will be some back and forth before we reach the equilibrium where most things compile and work.

You may need to make some portability tweaks to your code to make it work with upstream and internal clang versions + hcc. This is roughly what's been done to existing CUDA code -- pretty much all major libraries that use CUDA (tensorflow, Thrust, cutlas, cub, pytorch) had to have minor tweaks to make it portable to clang.

Now, back to the specifics of your example. I'm still not 100% sure I understand what the problem is. Can you boil down the use case to an example on godbolt? Not just the lambda itself, but also the way it's intended to be used. It does not need to compile, I just need it to understand your use case and the problem.
I can imaging passing lambda type as a template parameter which would make it hard to predict/control where/how it will finally be instantiated or used, but it would be great to have a practical example.

Part of the motivation for this change was that it wasn't always clear in code where the __device__ attribute is needed with lambdas sometimes. It also makes it more consistent with constexpr lambdas and hcc-based hip compiler. Including this for capturing lambdas will make this simpler and easier to understand.

If there are concerns about making it default for capturing lambdas, then can we at least just have a flag to enable this for capturing lambdas?

I've just pointed that the assumption that having the capture implies having enclosing function is invalid. We've already decided to proceed with promotion of all lambdas in general, so it's mostly the matter of taking care of implementation details.
Dealing with capturing lambdas in a separate patch is one option. IMO it makes sense in general as capturing lambdas do have their own distinct quirks, while promotion of non-capturing lambdas are relatively uncontroversial.
If Sam decides to incorporate support for capturing lambdas in this patch, we could still do it by restricting the capturing lambda promotion to the ones within a function scope only. I.e. lambdas created in global scope would still be host.

ashi1 added a subscriber: ashi1.Jun 25 2020, 3:10 PM

ashi1 added inline comments.

clang/test/CodeGenCUDA/lambda.cu
54	There is a typo here, DEV-LABEL

Now, back to the specifics of your example. I'm still not 100% sure I understand what the problem is. Can you boil down the use case to an example on godbolt?

I dont have a specific example, but there could be code like this generic clip operator:

template<class F, class T>
void clip(F f,
          const T& min_val,
          const T& max_val)
{

    f([=](auto x) {
        return ::min<decltype(x)>(::max<decltype(x)>(min_val, x), max_val);
    });
}

Its not clear to the writer of the generic function that it needs to declare the lambda with an explicit HD.

If Sam decides to incorporate support for capturing lambdas in this patch, we could still do it by restricting the capturing lambda promotion to the ones within a function scope only. I.e. lambdas created in global scope would still be host.

I think that would be acceptable. I dont think global scope capturing lambdas are very common due to possible ODR issues.

In D78655#2117216, @pfultz2 wrote:
Now, back to the specifics of your example. I'm still not 100% sure I understand what the problem is. Can you boil down the use case to an example on godbolt?

I dont have a specific example, but there could be code like this generic clip operator:
template<class F, class T>
void clip(F f,
          const T& min_val,
          const T& max_val)
{

    f([=](auto x) {
        return ::min<decltype(x)>(::max<decltype(x)>(min_val, x), max_val);
    });
}

What's the expected HD property of this template function clip? Why we cannot ask developers to add the same HD property for that inner lambda?

Its not clear to the writer of the generic function that it needs to declare the lambda with an explicit HD.

If Sam decides to incorporate support for capturing lambdas in this patch, we could still do it by restricting the capturing lambda promotion to the ones within a function scope only. I.e. lambdas created in global scope would still be host.

I think that would be acceptable. I dont think global scope capturing lambdas are very common due to possible ODR issues.

What's the expected HD property of this template function clip?

I think it is intended to be host-only. The function f will launch a kernel or threads to utilize the passed lambda.

Ideally, it would be nice to make all inlineable functions implicitly HD. There is a pragma to force HD, but it is broken(due to forcing HD on functions annotated as host-only or device-only). It would be nice to have a flag to enable such behavior(instead of a pragma). Requiring all these explicit HD annotations just seems like we are moving backwards.

Added diagnostics for capturing host variables on device lambda and made lambda host device by default.

tra added inline comments.Jun 30 2020, 2:33 PM

clang/lib/Sema/SemaCUDA.cpp
750	What does the return value mean? We don't seem to check it anyways. If we don't care about the result, perhaps the function should be void. If we do, then it would be good to document its purpose and returned values and, probably, rename it to better indicate what is it it's supposed to check.
clang/lib/Sema/SemaLambda.cpp
1783	I would expect Sema-level diags to be produced during both host and device compilation. Some of the diags for HD may need to be postponed until after things have been CodeGen'ed, but the checks should happen nevertheless.

pfultz2 added inline comments.Jun 30 2020, 6:07 PM

clang/test/SemaCUDA/lambda.cu
27	Will this still produce diagnostics when the lambda is explicitly `__device__`? Maybe you could add a test case for that. kernel<<<1,1>>>([&]() __device__ { hd(b); });

yaxunl marked 10 inline comments as done.Jul 1 2020, 5:48 AM

yaxunl added inline comments.

clang/lib/Sema/SemaCUDA.cpp
750	it should return void. fixed.
clang/lib/Sema/SemaLambda.cpp
1783	A host device function could be emitted in both host and device compilation. The way the deferred diags works is that they are only emitted when the function is sure to be emitted in a specific compilation. In host compilation, when a host device function is sure to be emitted, it is emitted as host function, therefore diags for host compilation and only diags for host compilation should be emitted. The same with device compilation. This makes sense since we do not know if a host device function will be emitted in device compilation when we are doing host compilation, since to do that we have to go through the whole device compilation whereas currently device compilation and host compilation are separate process. That said, when we emit diags for captures by reference, we should only emit them when the lambdas are emitted as device functions. When they are emitted as host functions in host compilation, these captures are valid and should not be diagnosed.
clang/test/CodeGenCUDA/lambda.cu
54	fixed. thanks
clang/test/SemaCUDA/lambda.cu
27	yes. added a test

revised by Artem's and Paul's comments

tra added inline comments.Jul 6 2020, 1:33 PM

clang/lib/Sema/SemaLambda.cpp
1783	HD lambda capturing something that's side-specific is similar to HD function calling side-specific function. I still think that the general principle applies to both sides of the compilation. We may be saved by the fact that functions seem to be the only things that are truly side-specific and wrong-side access will already produce postponed diagnostics. For variables we'll end up capturing shadows, which is in the gray area -- we're allowed to use pointers & sizeof(), but not the non-const values. Perhaps this condition check should be folded into the `CUDACheckLambdaCapture` and add few comments about the reasons for particular OK/not-OK choices we make there.

hliao resigned from this revision.Jul 7 2020, 11:37 AM

This revision is now accepted and ready to land.Jul 7 2020, 11:37 AM

refactor CUDACheckLambdaCapture and add comments

tra added inline comments.Jul 7 2020, 3:57 PM

clang/lib/Sema/SemaCUDA.cpp
757–759	I don't think this is completely correct. Postponed diags get emitted if we know we're attempoting to codegen wrong things. E.g. during host compilation when HD function used by host code ends up attempting to call a device function. It also works in the other direction -- it kicks in during device compilation when HD function calls a host function. AFAICT it has nothing to do with what happens on the other side of the compilation, but rather what we're attempting to codegen during this compilation. I don't think that we can reason that checks can be done on the host side only, based only on the argument you're making above (at least based on the way I understand it). The point you're making below that a captured lambda created by device code can't ever be used by the host code is probably a better argument why the check may not be necessary.

revised by Artem's comments

yaxunl marked 3 inline comments as done.Jul 8 2020, 7:08 AM

yaxunl added inline comments.

clang/lib/Sema/SemaCUDA.cpp
757–759	Revised the comments.

LGTM.

Closed by commit rG1eaad01046c8: [CUDA][HIP] Let lambda be host device by default (authored by yaxunl). · Explain WhyJul 8 2020, 10:11 AM

This revision was automatically updated to reflect the committed changes.

yaxunl marked an inline comment as done.

Herald added a project: Restricted Project. · View Herald TranscriptJul 8 2020, 10:11 AM

Revision Contents

Path

Size

clang/

include/

clang/

Basic/

DiagnosticSemaKinds.td

4 lines

Sema/

Sema.h

7 lines

lib/

Sema/

SemaCUDA.cpp

36 lines

SemaLambda.cpp

6 lines

test/

CodeGenCUDA/

lambda.cu

85 lines

SemaCUDA/

Inputs/

cuda.h

14 lines

lambda.cu

70 lines

Diff 274589

clang/include/clang/Basic/DiagnosticSemaKinds.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 7,987 Lines • ▼ Show 20 Lines
	def err_global_call_not_config : Error<			def err_global_call_not_config : Error<
	"call to global function %0 not configured">;			"call to global function %0 not configured">;
	def err_ref_bad_target : Error<			def err_ref_bad_target : Error<
	"reference to %select{__device__\|__global__\|__host__\|__host__ __device__}0 "			"reference to %select{__device__\|__global__\|__host__\|__host__ __device__}0 "
	"function %1 in %select{__device__\|__global__\|__host__\|__host__ __device__}2 function">;			"function %1 in %select{__device__\|__global__\|__host__\|__host__ __device__}2 function">;
	def err_ref_bad_target_global_initializer : Error<			def err_ref_bad_target_global_initializer : Error<
	"reference to %select{__device__\|__global__\|__host__\|__host__ __device__}0 "			"reference to %select{__device__\|__global__\|__host__\|__host__ __device__}0 "
	"function %1 in global initializer">;			"function %1 in global initializer">;
				def err_capture_bad_target : Error<
				"capture host variable %0 by reference in device or host device lambda function">;
				def err_capture_bad_target_this_ptr : Error<
				"capture host side class data member by this pointer in device or host device lambda function">;
	def warn_kern_is_method : Extension<			def warn_kern_is_method : Extension<
	"kernel function %0 is a member function; this may not be accepted by nvcc">,			"kernel function %0 is a member function; this may not be accepted by nvcc">,
	InGroup<CudaCompat>;			InGroup<CudaCompat>;
	def warn_kern_is_inline : Warning<			def warn_kern_is_inline : Warning<
	"ignored 'inline' attribute on kernel function %0">,			"ignored 'inline' attribute on kernel function %0">,
	InGroup<CudaCompat>;			InGroup<CudaCompat>;
	def err_variadic_device_fn : Error<			def err_variadic_device_fn : Error<
	"CUDA device code does not support variadic functions">;			"CUDA device code does not support variadic functions">;
	▲ Show 20 Lines • Show All 2,850 Lines • Show Last 20 Lines

clang/include/clang/Sema/Sema.h

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 9,991 Lines • ▼ Show 20 Lines
	///			///
	/// Will only create deferred diagnostics for a given SourceLocation once,			/// Will only create deferred diagnostics for a given SourceLocation once,
	/// so you can safely call this multiple times without generating duplicate			/// so you can safely call this multiple times without generating duplicate
	/// deferred errors.			/// deferred errors.
	///			///
	/// - Otherwise, returns true without emitting any diagnostics.			/// - Otherwise, returns true without emitting any diagnostics.
	bool CheckCUDACall(SourceLocation Loc, FunctionDecl *Callee);			bool CheckCUDACall(SourceLocation Loc, FunctionDecl *Callee);

				bool CUDACheckLambdaCapture(CXXMethodDecl *D, const sema::Capture &Capture);

	/// Set __device__ or __host__ __device__ attributes on the given lambda			/// Set __device__ or __host__ __device__ attributes on the given lambda
	/// operator() method.			/// operator() method.
	///			///
	/// CUDA lambdas declared inside __device__ or __global__ functions inherit			/// CUDA lambdas by default is host device function unless it has explicit
	/// the __device__ attribute. Similarly, lambdas inside __host__ __device__			/// host or device attribute.
	/// functions become __host__ __device__ themselves.
	void CUDASetLambdaAttrs(CXXMethodDecl *Method);			void CUDASetLambdaAttrs(CXXMethodDecl *Method);

	/// Finds a function in \p Matches with highest calling priority			/// Finds a function in \p Matches with highest calling priority
	/// from \p Caller context and erases all functions with lower			/// from \p Caller context and erases all functions with lower
	/// calling priority.			/// calling priority.
	void EraseUnwantedCUDAMatches(			void EraseUnwantedCUDAMatches(
	const FunctionDecl *Caller,			const FunctionDecl *Caller,
	SmallVectorImpl<std::pair<DeclAccessPair, FunctionDecl *>> &Matches);			SmallVectorImpl<std::pair<DeclAccessPair, FunctionDecl *>> &Matches);
	▲ Show 20 Lines • Show All 786 Lines • Show Last 20 Lines

clang/lib/Sema/SemaCUDA.cpp

Show All 11 Lines

#include "clang/AST/ASTContext.h"		#include "clang/AST/ASTContext.h"
#include "clang/AST/Decl.h"		#include "clang/AST/Decl.h"
#include "clang/AST/ExprCXX.h"		#include "clang/AST/ExprCXX.h"
#include "clang/Basic/Cuda.h"		#include "clang/Basic/Cuda.h"
#include "clang/Basic/TargetInfo.h"		#include "clang/Basic/TargetInfo.h"
#include "clang/Lex/Preprocessor.h"		#include "clang/Lex/Preprocessor.h"
#include "clang/Sema/Lookup.h"		#include "clang/Sema/Lookup.h"
		#include "clang/Sema/ScopeInfo.h"
#include "clang/Sema/Sema.h"		#include "clang/Sema/Sema.h"
#include "clang/Sema/SemaDiagnostic.h"		#include "clang/Sema/SemaDiagnostic.h"
#include "clang/Sema/SemaInternal.h"		#include "clang/Sema/SemaInternal.h"
#include "clang/Sema/Template.h"		#include "clang/Sema/Template.h"
#include "llvm/ADT/Optional.h"		#include "llvm/ADT/Optional.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
using namespace clang;		using namespace clang;

▲ Show 20 Lines • Show All 713 Lines • ▼ Show 20 Lines	DeviceDiagBuilder(DiagKind, Loc, diag::err_ref_bad_target, Caller, *this)
<< IdentifyCUDATarget(Callee) << Callee << IdentifyCUDATarget(Caller);		<< IdentifyCUDATarget(Callee) << Callee << IdentifyCUDATarget(Caller);
DeviceDiagBuilder(DiagKind, Callee->getLocation(), diag::note_previous_decl,		DeviceDiagBuilder(DiagKind, Callee->getLocation(), diag::note_previous_decl,
Caller, *this)		Caller, *this)
<< Callee;		<< Callee;
return DiagKind != DeviceDiagBuilder::K_Immediate &&		return DiagKind != DeviceDiagBuilder::K_Immediate &&
DiagKind != DeviceDiagBuilder::K_ImmediateWithCallStack;		DiagKind != DeviceDiagBuilder::K_ImmediateWithCallStack;
}		}

		bool Sema::CUDACheckLambdaCapture(CXXMethodDecl *Callee,
		traUnsubmitted Done Reply Inline Actions What does the return value mean? We don't seem to check it anyways. If we don't care about the result, perhaps the function should be void. If we do, then it would be good to document its purpose and returned values and, probably, rename it to better indicate what is it it's supposed to check. tra: What does the return value mean? We don't seem to check it anyways. If we don't care about the…
		yaxunlAuthorUnsubmitted Done Reply Inline Actions it should return void. fixed. yaxunl: it should return void. fixed.
		const sema::Capture &Capture) {
		FunctionDecl *Caller = dyn_cast<FunctionDecl>(CurContext);
		if (!Caller)
		return true;
		bool CalleeIsDevice = Callee->hasAttr<CUDADeviceAttr>();
		bool CallerIsHost =
		!Caller->hasAttr<CUDAGlobalAttr>() && !Caller->hasAttr<CUDADeviceAttr>();
		bool ShouldCheck = CalleeIsDevice && CallerIsHost;
		if (!ShouldCheck \|\| !Capture.isReferenceCapture())
		traUnsubmitted Done Reply Inline Actions I don't think this is completely correct. Postponed diags get emitted if we know we're attempoting to codegen wrong things. E.g. during host compilation when HD function used by host code ends up attempting to call a device function. It also works in the other direction -- it kicks in during device compilation when HD function calls a host function. AFAICT it has nothing to do with what happens on the other side of the compilation, but rather what we're attempting to codegen during this compilation. I don't think that we can reason that checks can be done on the host side only, based only on the argument you're making above (at least based on the way I understand it). The point you're making below that a captured lambda created by device code can't ever be used by the host code is probably a better argument why the check may not be necessary. tra: I don't think this is completely correct. Postponed diags get emitted if we know we're…
		yaxunlAuthorUnsubmitted Done Reply Inline Actions Revised the comments. yaxunl: Revised the comments.
		return true;
		auto DiagKind = DeviceDiagBuilder::K_Deferred;
		if (Capture.isVariableCapture()) {
		DeviceDiagBuilder(DiagKind, Capture.getLocation(),
		diag::err_capture_bad_target, Callee, *this)
		<< Capture.getVariable();
		} else if (Capture.isThisCapture()) {
		DeviceDiagBuilder(DiagKind, Capture.getLocation(),
		diag::err_capture_bad_target_this_ptr, Callee, *this);
		}
		return true;
		}

void Sema::CUDASetLambdaAttrs(CXXMethodDecl *Method) {		void Sema::CUDASetLambdaAttrs(CXXMethodDecl *Method) {
assert(getLangOpts().CUDA && "Should only be called during CUDA compilation");		assert(getLangOpts().CUDA && "Should only be called during CUDA compilation");
if (Method->hasAttr<CUDAHostAttr>() \|\| Method->hasAttr<CUDADeviceAttr>())		if (Method->hasAttr<CUDAHostAttr>() \|\| Method->hasAttr<CUDADeviceAttr>())
return;		return;
FunctionDecl *CurFn = dyn_cast<FunctionDecl>(CurContext);
if (!CurFn)
return;
CUDAFunctionTarget Target = IdentifyCUDATarget(CurFn);
if (Target == CFT_Global \|\| Target == CFT_Device) {
Method->addAttr(CUDADeviceAttr::CreateImplicit(Context));
} else if (Target == CFT_HostDevice) {
Method->addAttr(CUDADeviceAttr::CreateImplicit(Context));		Method->addAttr(CUDADeviceAttr::CreateImplicit(Context));
		rsmithUnsubmitted Done Reply Inline Actions This check appears to prevent lambdas appearing in any context outside a function from being implicitly HD. Is that what you want? Eg: auto foo = [] {}; // not implicitly HD rsmith: This check appears to prevent lambdas appearing in any context outside a function from being…
		pfultz2Unsubmitted Done Reply Inline Actions There should at least be a flag to enable capturing lambdas to be implicitly HD. I dont really understand the rational for making capturing lambdas not implicitly HD. It seems like its trying to prevent using an address to host on the device, but I dont see how this prevents that at all. This will also break the compilation in rocm. Should we use a fork of llvm to compile rocm? pfultz2: There should at least be a flag to enable capturing lambdas to be implicitly HD. I dont really…
		traUnsubmitted Done Reply Inline Actions @pfultz2: This will also break the compilation in rocm. Should we use a fork of llvm to compile rocm? Could you give an example to demonstrate current use and how it will break? My understanding that the patch relaxes the restrictions on lambdas so in theory not promoting capturing lambdas preserves the status quo. As for the fork, my response would be an empathic "no, please don't do it". Fork == different compiler == showstopper for various use cases. It would definitely be an issue for us at Google. Considering that we're still probing our way towards making lambdas more useful, it may be a bit premature to heavily depend on any particular implementation detail of an experimental feature, even if it happens to work. We'll need to figure out an approach that will be sustainable long-term and forked compiler is a rather large and hard-to-maintain hammer for this. In my experience, adapting source code ends up being more manageable long-term. tra: @pfultz2: > This will also break the compilation in rocm. Should we use a fork of llvm to…
Method->addAttr(CUDAHostAttr::CreateImplicit(Context));		Method->addAttr(CUDAHostAttr::CreateImplicit(Context));
		traUnsubmitted Done Reply Inline Actions What about `__global__` lambdas? We probably don't want to add HD attributes on them here. tra: What about `__global__` lambdas? We probably don't want to add HD attributes on them here.
		yaxunlAuthorUnsubmitted Done Reply Inline Actions lambda is not allowed to be kernel. I will add a lit test for that. yaxunl: lambda is not allowed to be kernel. I will add a lit test for that.
}		}
		pfultz2Unsubmitted Done Reply Inline Actions Shouldn't we add these attributes if there are no host and device attributes? This seems like it will treat `[]() __device__ {}` as host device. pfultz2: Shouldn't we add these attributes if there are no host and device attributes? This seems like…
		yaxunlAuthorUnsubmitted Done Reply Inline Actions There is check on line 716. We only reach here if there is no device and host attrs on the function. yaxunl: There is check on line 716. We only reach here if there is no device and host attrs on the…
		rsmithUnsubmitted Done Reply Inline Actions The reference captures check seems quite strange to me. A copy capture of a pointer could have the same problem, as could a copy capture of a class that contains a reference or a pointer. As could an init-capture. These kinds of quirky language rules are usually more trouble than they're worth. rsmith: The reference captures check seems quite strange to me. A copy capture of a pointer could have…
		pfultz2Unsubmitted Done Reply Inline Actions Capturing by value is not always an error, only when copying a pointer to a host variable. but this requires a lot more static analysis to diagnose. However, capturing by reference is almost always wrong(at least with the current HIP) when the context is host and the lambda is called on the device. Therefore, we avoid this scenario by not making such lambdas implicitly HD, but the error message may not be quite as clear. It is a quirky language rule, and we could remove this restriction and rely on a warning or static analysis to diagnose the issue. pfultz2: Capturing by value is not always an error, only when copying a pointer to a host variable. but…
}

void Sema::checkCUDATargetOverload(FunctionDecl *NewFD,		void Sema::checkCUDATargetOverload(FunctionDecl *NewFD,
const LookupResult &Previous) {		const LookupResult &Previous) {
assert(getLangOpts().CUDA && "Should only be called during CUDA compilation");		assert(getLangOpts().CUDA && "Should only be called during CUDA compilation");
CUDAFunctionTarget NewTarget = IdentifyCUDATarget(NewFD);		CUDAFunctionTarget NewTarget = IdentifyCUDATarget(NewFD);
for (NamedDecl *OldND : Previous) {		for (NamedDecl *OldND : Previous) {
FunctionDecl *OldFD = OldND->getAsFunction();		FunctionDecl *OldFD = OldND->getAsFunction();
if (!OldFD)		if (!OldFD)
▲ Show 20 Lines • Show All 53 Lines • Show Last 20 Lines

clang/lib/Sema/SemaLambda.cpp

Show First 20 Lines • Show All 984 Lines • ▼ Show 20 Lines	void Sema::ActOnStartOfLambdaDefinition(LambdaIntroducer &Intro,

// code_seg attribute on lambda apply to the method.		// code_seg attribute on lambda apply to the method.
if (Attr A = getImplicitCodeSegOrSectionAttrForFunction(Method, /IsDefinition=*/true))		if (Attr A = getImplicitCodeSegOrSectionAttrForFunction(Method, /IsDefinition=*/true))
Method->addAttr(A);		Method->addAttr(A);

// Attributes on the lambda apply to the method.		// Attributes on the lambda apply to the method.
ProcessDeclAttributes(CurScope, Method, ParamInfo);		ProcessDeclAttributes(CurScope, Method, ParamInfo);

// CUDA lambdas get implicit attributes based on the scope in which they're		// CUDA lambdas get implicit host and device attributes.
// declared.
if (getLangOpts().CUDA)		if (getLangOpts().CUDA)
CUDASetLambdaAttrs(Method);		CUDASetLambdaAttrs(Method);

// Number the lambda for linkage purposes if necessary.		// Number the lambda for linkage purposes if necessary.
handleLambdaNumbering(Class, Method);		handleLambdaNumbering(Class, Method);

// Introduce the function call operator as the current declaration context.		// Introduce the function call operator as the current declaration context.
PushDeclContext(CurScope, Method);		PushDeclContext(CurScope, Method);
▲ Show 20 Lines • Show All 772 Lines • ▼ Show 20 Lines	for (unsigned I = 0, N = LSI->Captures.size(); I != N; ++I) {
// FIXME: Skip this capture if the capture is not used, the initializer		// FIXME: Skip this capture if the capture is not used, the initializer
// has no side-effects, the type of the capture is trivial, and the		// has no side-effects, the type of the capture is trivial, and the
// lambda is not externally visible.		// lambda is not externally visible.

// Add a FieldDecl for the capture and form its initializer.		// Add a FieldDecl for the capture and form its initializer.
BuildCaptureField(Class, From);		BuildCaptureField(Class, From);
Captures.push_back(Capture);		Captures.push_back(Capture);
CaptureInits.push_back(Init.get());		CaptureInits.push_back(Init.get());

		if (LangOpts.CUDA && LangOpts.CUDAIsDevice)
		traUnsubmitted Done Reply Inline Actions I would expect Sema-level diags to be produced during both host and device compilation. Some of the diags for HD may need to be postponed until after things have been CodeGen'ed, but the checks should happen nevertheless. tra: I would expect Sema-level diags to be produced during both host and device compilation. Some of…
		yaxunlAuthorUnsubmitted Done Reply Inline Actions A host device function could be emitted in both host and device compilation. The way the deferred diags works is that they are only emitted when the function is sure to be emitted in a specific compilation. In host compilation, when a host device function is sure to be emitted, it is emitted as host function, therefore diags for host compilation and only diags for host compilation should be emitted. The same with device compilation. This makes sense since we do not know if a host device function will be emitted in device compilation when we are doing host compilation, since to do that we have to go through the whole device compilation whereas currently device compilation and host compilation are separate process. That said, when we emit diags for captures by reference, we should only emit them when the lambdas are emitted as device functions. When they are emitted as host functions in host compilation, these captures are valid and should not be diagnosed. yaxunl: A host device function could be emitted in both host and device compilation. The way the…
		traUnsubmitted Done Reply Inline Actions HD lambda capturing something that's side-specific is similar to HD function calling side-specific function. I still think that the general principle applies to both sides of the compilation. We may be saved by the fact that functions seem to be the only things that are truly side-specific and wrong-side access will already produce postponed diagnostics. For variables we'll end up capturing shadows, which is in the gray area -- we're allowed to use pointers & sizeof(), but not the non-const values. Perhaps this condition check should be folded into the `CUDACheckLambdaCapture` and add few comments about the reasons for particular OK/not-OK choices we make there. tra: HD lambda capturing something that's side-specific is similar to HD function calling side…
		CUDACheckLambdaCapture(CallOperator, From);
}		}

Class->setCaptures(Captures);		Class->setCaptures(Captures);

// C++11 [expr.prim.lambda]p6:		// C++11 [expr.prim.lambda]p6:
// The closure type for a lambda-expression with no lambda-capture		// The closure type for a lambda-expression with no lambda-capture
// has a public non-virtual non-explicit const conversion function		// has a public non-virtual non-explicit const conversion function
// to pointer to function having the same parameter and return		// to pointer to function having the same parameter and return
▲ Show 20 Lines • Show All 148 Lines • Show Last 20 Lines

clang/test/CodeGenCUDA/lambda.cu

This file was added.

				// RUN: %clang_cc1 -x hip -emit-llvm -std=c++11 %s -o - \
				// RUN: -triple x86_64-linux-gnu \
				// RUN: \| FileCheck -check-prefix=HOST %s
				// RUN: %clang_cc1 -x hip -emit-llvm -std=c++11 %s -o - \
				// RUN: -triple amdgcn-amd-amdhsa -fcuda-is-device \
				// RUN: \| FileCheck -check-prefix=DEV %s

				#include "Inputs/cuda.h"

				// Device side kernel name.
				// HOST: @[[KERN_CAPTURE:[0-9]+]] = {{.*}} c"_Z1gIZ12test_capturevEUlvE_EvT_\00"
				// HOST: @[[KERN_RESOLVE:[0-9]+]] = {{.*}} c"_Z1gIZ12test_resolvevEUlvE_EvT_\00"

				// Check functions emitted for test_capture in host compilation.
				// Check lambda is not emitted in host compilation.
				// HOST-LABEL: define void @_Z12test_capturev
				// HOST: call void @_Z19test_capture_helperIZ12test_capturevEUlvE_EvT_
				// HOST-LABEL: define internal void @_Z19test_capture_helperIZ12test_capturevEUlvE_EvT_
				// HOST: call void @_Z16__device_stub__gIZ12test_capturevEUlvE_EvT_
				// HOST-NOT: define{{.*}}@_ZZ4mainENKUlvE_clEv

				// Check functions emitted for test_resolve in host compilation.
				// Check host version of template function 'overloaded' is emitted and called
				// by the lambda function.
				// HOST-LABEL: define void @_Z12test_resolvev
				// HOST: call void @_Z19test_resolve_helperIZ12test_resolvevEUlvE_EvT_()
				// HOST-LABEL: define internal void @_Z19test_resolve_helperIZ12test_resolvevEUlvE_EvT_
				traUnsubmitted Done Reply Inline Actions The test example may not be doing what it's seemingly supposed to be doing: https://cppinsights.io/s/3a5c42ff `h()` gets a temporary host-side object which keeps the reference to `a` and that reference will actually point to the host-side shadow of the actual device-side `a`. When you get to execute `g` it's `this` may not be very usable on device side and thus `f.operator()` will probably not work. Alas, we currently have no diagnostics for that kind of error. Change it to a non-capturing lambda, perhaps? tra: The test example may not be doing what it's seemingly supposed to be doing: https://cppinsights.
				yaxunlAuthorUnsubmitted Done Reply Inline Actions It works. We need to think about this in device compilation. In device compilation, global variable is a device variable, the lambda is a device host function, therefore the lambda is accessing the real a, not the shadow. In the host compilation, the lambda is not really called, therefore it is not emitted. I will update the lit test with these checks. yaxunl: It works. We need to think about this in device compilation. In device compilation, global…
				traUnsubmitted Done Reply Inline Actions Clang manages to see through to the initializer of `a`, but I'm not sure how much we can rely on this. In general, `f.operator()` for a capturing lambda needs to access captured variables via `this` which points to a temporary objects created and passed to `g` by the host. You can see it if you capture a local variable: https://godbolt.org/z/99389o Anyways, it's an issue orthogonal to this patch. My concern is that tests are often used as an example of things that are OK to do, and capturing lambdas are a pretty big foot-shooting gun when used with CUDA. It's very easy to do wrong thing without compiler complaining about them. While accessing `a` does work, it appears to do so by accident, rather than by design. I'm fairly confident that I can hide the initializer with sufficiently complicated code, force clang to access `a` via `this` and make everything fail at runtime. IMO, what we have here is a 'happens to work' situation. I do not want to call it 'guaranteed to work' without making sure that it always does. In order to demonstrate that lambda is host/device, you do not need it to be a capturing lambda. You can make it call an overloaded function with host and device variants and verify that the lambda works on host and device sides. tra: Clang manages to see through to the initializer of `a`, but I'm not sure how much we can rely…
				yaxunlAuthorUnsubmitted Done Reply Inline Actions I added one more test, where a lambda function calls a template function which is overloaded with a host version and a device version. The lambda is called in both host function and in kernel. Test shows correct version of template function are emitted in host and device compilation. I think it is not a surprise that the lambda function is able to resolve the host/device-ness of the callee correctly. We are doing resolution in a host device function and the two candidates are same-side vs wrong-side. yaxunl: I added one more test, where a lambda function calls a template function which is overloaded…
				// HOST: call void @_Z16__device_stub__gIZ12test_resolvevEUlvE_EvT_
				// HOST: call void @_ZZ12test_resolvevENKUlvE_clEv
				// HOST-LABEL: define internal void @_ZZ12test_resolvevENKUlvE_clEv
				rjmccallUnsubmitted Done Reply Inline Actions Typo rjmccall: Typo
				// HOST: call i32 @_Z10overloadedIiET_v
				// HOST-LABEL: define linkonce_odr i32 @_Z10overloadedIiET_v
				// HOST: ret i32 2

				// Check kernel is registered with correct device side kernel name.
				// HOST: @__hipRegisterFunction({{.*}}@[[KERN_CAPTURE]]
				// HOST: @__hipRegisterFunction({{.*}}@[[KERN_RESOLVE]]

				// DEV: @a = addrspace(1) externally_initialized global i32 0

				// Check functions emitted for test_capture in device compilation.
				// Check lambda is emitted in device compilation and accessing device variable.
				// DEV-LABEL: define amdgpu_kernel void @_Z1gIZ12test_capturevEUlvE_EvT_
				// DEV: call void @_ZZ12test_capturevENKUlvE_clEv
				// DEV-LABEL: define internal void @_ZZ12test_capturevENKUlvE_clEv
				// DEV: store i32 1, i32* addrspacecast (i32 addrspace(1)* @a to i32*)

				// Check functions emitted for test_resolve in device compilation.
				// Check device version of template function 'overloaded' is emitted and called
				// by the lambda function.
				// DEV-LABEL: define amdgpu_kernel void @_Z1gIZ12test_resolvevEUlvE_EvT_
				// DEV: call void @_ZZ12test_resolvevENKUlvE_clEv
				// DEV-LABE: define internal void @_ZZ12test_resolvevENKUlvE_clEv
				// DEV: call i32 @_Z10overloadedIiET_v
				ashi1Unsubmitted Done Reply Inline Actions There is a typo here, DEV-LABEL ashi1: There is a typo here, DEV-LABEL
				yaxunlAuthorUnsubmitted Done Reply Inline Actions fixed. thanks yaxunl: fixed. thanks
				// DEV-LABEL: define linkonce_odr i32 @_Z10overloadedIiET_v
				// DEV: ret i32 1

				__device__ int a;

				template<class T>
				__device__ T overloaded() { return 1; }

				template<class T>
				__host__ T overloaded() { return 2; }

				template<class F>
				__global__ void g(F f) { f(); }

				template<class F>
				void test_capture_helper(F f) { g<<<1,1>>>(f); }

				template<class F>
				void test_resolve_helper(F f) { g<<<1,1>>>(f); f(); }

				// Test capture of device variable in lambda function.
				void test_capture(void) {
				test_capture_helper([](){ a = 1;});
				}

				// Test resolving host/device function in lambda function.
				// Callee should resolve to correct host/device function based on where
				// the lambda function is called, not where it is defined.
				void test_resolve(void) {
				test_resolve_helper([](){ overloaded<int>();});
				hliaoUnsubmitted Done Reply Inline Actions We are allowing regular lambda to be used in the device functions. That should be explicitly marked by making that lambda `__device__` or `__host__ __device__`. Even though we may not have static checks for capture so far, that should be easily extended with those attributes. hliao: We are allowing regular lambda to be used in the device functions. That should be explicitly…
				}

clang/test/SemaCUDA/Inputs/cuda.h

	Show All 11 Lines
	#define __shared__ __attribute__((shared))			#define __shared__ __attribute__((shared))
	#define __launch_bounds__(...) __attribute__((launch_bounds(__VA_ARGS__)))			#define __launch_bounds__(...) __attribute__((launch_bounds(__VA_ARGS__)))

	struct dim3 {			struct dim3 {
	unsigned x, y, z;			unsigned x, y, z;
	__host__ __device__ dim3(unsigned x, unsigned y = 1, unsigned z = 1) : x(x), y(y), z(z) {}			__host__ __device__ dim3(unsigned x, unsigned y = 1, unsigned z = 1) : x(x), y(y), z(z) {}
	};			};

				#ifdef __HIP__
				typedef struct hipStream *hipStream_t;
				typedef enum hipError {} hipError_t;
				int hipConfigureCall(dim3 gridSize, dim3 blockSize, size_t sharedSize = 0,
				hipStream_t stream = 0);
				extern "C" hipError_t __hipPushCallConfiguration(dim3 gridSize, dim3 blockSize,
				size_t sharedSize = 0,
				hipStream_t stream = 0);
				extern "C" hipError_t hipLaunchKernel(const void *func, dim3 gridDim,
				dim3 blockDim, void **args,
				size_t sharedMem,
				hipStream_t stream);
				#else
	typedef struct cudaStream *cudaStream_t;			typedef struct cudaStream *cudaStream_t;
	typedef enum cudaError {} cudaError_t;			typedef enum cudaError {} cudaError_t;

	extern "C" int cudaConfigureCall(dim3 gridSize, dim3 blockSize,			extern "C" int cudaConfigureCall(dim3 gridSize, dim3 blockSize,
	size_t sharedSize = 0,			size_t sharedSize = 0,
	cudaStream_t stream = 0);			cudaStream_t stream = 0);
	extern "C" int __cudaPushCallConfiguration(dim3 gridSize, dim3 blockSize,			extern "C" int __cudaPushCallConfiguration(dim3 gridSize, dim3 blockSize,
	size_t sharedSize = 0,			size_t sharedSize = 0,
	cudaStream_t stream = 0);			cudaStream_t stream = 0);
	extern "C" cudaError_t cudaLaunchKernel(const void *func, dim3 gridDim,			extern "C" cudaError_t cudaLaunchKernel(const void *func, dim3 gridDim,
	dim3 blockDim, void **args,			dim3 blockDim, void **args,
	size_t sharedMem, cudaStream_t stream);			size_t sharedMem, cudaStream_t stream);
				#endif

	// Host- and device-side placement new overloads.			// Host- and device-side placement new overloads.
	void operator new(__SIZE_TYPE__, void p) { return p; }			void operator new(__SIZE_TYPE__, void p) { return p; }
	void operator new[](__SIZE_TYPE__, void p) { return p; }			void operator new[](__SIZE_TYPE__, void p) { return p; }
	__device__ void operator new(__SIZE_TYPE__, void p) { return p; }			__device__ void operator new(__SIZE_TYPE__, void p) { return p; }
	__device__ void operator new[](__SIZE_TYPE__, void p) { return p; }			__device__ void operator new[](__SIZE_TYPE__, void p) { return p; }

	#endif // !__NVCC__			#endif // !__NVCC__

clang/test/SemaCUDA/lambda.cu

This file was added.

				// RUN: %clang_cc1 -std=c++17 -fsyntax-only -verify=com %s
				// RUN: %clang_cc1 -std=c++17 -fsyntax-only -fcuda-is-device -verify=com,dev %s

				#include "Inputs/cuda.h"

				auto global_lambda = [] () { return 123; };

				template<class F>
				__global__ void kernel(F f) { f(); }
				// dev-note@-1 6{{called by 'kernel<(lambda}}

				__host__ __device__ void hd(int x);

				class A {
				int b;
				public:
				void test() {
				[=](){ hd(b); }();

				[&](){ hd(b); }();

				kernel<<<1,1>>>([](){ hd(0); });

				kernel<<<1,1>>>([=](){ hd(b); });
				// dev-error@-1 {{capture host side class data member by this pointer in device or host device lambda function}}

				kernel<<<1,1>>>([&](){ hd(b); });
				pfultz2Unsubmitted Done Reply Inline Actions Will this still produce diagnostics when the lambda is explicitly `__device__`? Maybe you could add a test case for that. kernel<<<1,1>>>([&]() __device__ { hd(b); }); pfultz2: Will this still produce diagnostics when the lambda is explicitly `__device__`? Maybe you could…
				yaxunlAuthorUnsubmitted Done Reply Inline Actions yes. added a test yaxunl: yes. added a test
				// dev-error@-1 {{capture host side class data member by this pointer in device or host device lambda function}}

				kernel<<<1,1>>>([&](){
				auto f = [&]{ hd(b); };
				// dev-error@-1 {{capture host side class data member by this pointer in device or host device lambda function}}
				f();
				});
				}
				};
				traUnsubmitted Done Reply Inline Actions We may need a better diagnostic for this. Here we've correctly rejected captured lambdas, but the diagnostic is a generic 'can't use that'. If would be helpful to let user know that we can't use that because of the capturing lambdas. tra: We may need a better diagnostic for this. Here we've correctly rejected captured lambdas, but…
				yaxunlAuthorUnsubmitted Done Reply Inline Actions added more information about lambda function to the diagnostic message yaxunl: added more information about lambda function to the diagnostic message

				int main(void) {
				auto lambda_kernel = [&]__global__(){};
				// com-error@-1 {{kernel function 'operator()' must be a free function or static member function}}

				int b;
				[&](){ hd(b); }();

				[=, &b](){ hd(b); }();

				kernel<<<1,1>>>(global_lambda);

				kernel<<<1,1>>>([](){ hd(0); });

				kernel<<<1,1>>>([=](){ hd(b); });

				kernel<<<1,1>>>([b](){ hd(b); });

				kernel<<<1,1>>>([&](){ hd(b); });
				// dev-error@-1 {{capture host variable 'b' by reference in device or host device lambda function}}

				kernel<<<1,1>>>([=, &b](){ hd(b); });
				// dev-error@-1 {{capture host variable 'b' by reference in device or host device lambda function}}

				kernel<<<1,1>>>([&, b](){ hd(b); });

				kernel<<<1,1>>>([&](){
				auto f = [&]{ hd(b); };
				// dev-error@-1 {{capture host variable 'b' by reference in device or host device lambda function}}
				f();
				});

				return 0;
				}

This is an archive of the discontinued LLVM Phabricator instance.

[CUDA][HIP] Let lambda be host device by defaultClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 274589

clang/include/clang/Basic/DiagnosticSemaKinds.td

clang/include/clang/Sema/Sema.h

clang/lib/Sema/SemaCUDA.cpp

clang/lib/Sema/SemaLambda.cpp

clang/test/CodeGenCUDA/lambda.cu

clang/test/SemaCUDA/Inputs/cuda.h

clang/test/SemaCUDA/lambda.cu

[CUDA][HIP] Let lambda be host device by default
ClosedPublic