This is an archive of the discontinued LLVM Phabricator instance.

[CUDA][HIP] Fix function scope static variable
Needs ReviewPublic

Authored by yaxunl on Jan 27 2021, 1:58 PM.

Details

Reviewers
tra
Summary

Currently static variables are allowed in device, global, and host device
functions.

A static variable in device and global functions is supposed to have
implicit device attribute. Currently it does not. This causes incorrect
diagnostics about host variables accessed by device functions.

Another issue is static device variables are allowed in host functions
since host functions could pass them to kernels for useful computations.
Currently they are not emitted in device compilation, which should be
fixed.

This patch also handles static variables in host device functions
and function scope static managed variables, and externalization
of such variables for fno-gpu-rdc case.

Diff Detail

Event Timeline

yaxunl requested review of this revision.Jan 27 2021, 1:58 PM
yaxunl created this revision.
tra added a comment.Jan 27 2021, 4:39 PM

A static variable in device and global functions is supposed to have
implicit device attribute. Currently it does not. This causes incorrect
diagnostics about host variables accessed by device functions.

Correct diagnostics sevice-side local static vars is a valid concern.
Could you elaborate on why are static variables in device functions are supposed to be __device__? I'm not quite sure that it's been established. At least not as a full __device__ variable, with runtime registration and the host-side shadow.

Judging by the tests and the comments, it may be better to rephrase the purpose of this patch along the lines that it allows treating a subset of the static variables for which the host may need to know device-side address as __device__, with all the overhead it entails. static vars that can't be created in the host code, remain purely static on device. When I read the patch description for the first time, it sounded more invasive than it actually is.

clang/lib/CodeGen/CodeGenModule.cpp
101

Nit. "This class does that" could be dropped. I'd generally follow a "<this thing> does <that> for <this reason>" structure.
E.g something along these lines:

Helper class for emitting device-side static variables created in host-side functions. While we do not emit host-side functions on device, we still need to emit the static variables the host code will expect to see on the device.
clang/lib/Sema/SemaCUDA.cpp
533–540 ↗(On Diff #319658)

This does not seem to be directly relevant for this patch. Perhaps move it into a separate patch?

clang/lib/Sema/SemaDecl.cpp
7247–7250

This is somewhat confusing. I guess the issue is that we're conflating all the functionality implied by the __device__ attribute and the accessible on device which is a subset of it. For the static vars in D functions you only need for it to be accessible on device, IMO. For HD functions, you do need the full __device__ functionality, with host shadow and runtime registration.

While adding implicit __device__ works for statics in the device-only functions, it's a bit of an overkill. It also gives us a somewhat different AST between host/device compilations.

Perhaps we can handle statics in device-only functions w/o adding implicit __device__. Can we check the parent of the variable instead when we check whether we're allowed to reference the variable?

clang/test/CodeGenCUDA/func-scope-static-var.cu
55

What's the reason for externalizing the variables for no-rdc only?
If we do not externalize them, then we'll potentially have a problem with the host code attempting to get variable's device-side address and fail at runtime, because it's not visible on device.

I think the right thing to do here is to always externalize them, but add unique suffix for RDC.

88

Nit: static variables w/o attributes are implicitly __device__. Or By default, static variables are implicitly __device__.

It's also not clear what you mean by which are independent. It may be better to add more details in a separate sentence.

127–128

We could use an explanation why we're not externalizing or shadowing them.

yaxunl marked 6 inline comments as done.Feb 1 2021, 5:22 PM
yaxunl added inline comments.
clang/lib/CodeGen/CodeGenModule.cpp
101

done

clang/lib/Sema/SemaCUDA.cpp
533–540 ↗(On Diff #319658)

separated to another patch

clang/lib/Sema/SemaDecl.cpp
7247–7250

Before we consider a function scope static variable without explicit device attribute, let's consider the difference between a static variable with explicit device attribute and a global device variable. They are both emitted in device compilation and have shadow variables in host compilation. The only difference is the linkage. A global device variable is supposed to be visible to other compilation units, whereas a static device variable is supposed to be visible to the same compilation unit only. A function scope static variable with device attribute has similar traits: It needs to be emitted in device compilation, and it needs a shadow variable in host compilation in case it needs to be accessed in host code. The only difference is that it is only visible inside the function.

Now let's consider a static var without device attribute in a device function. From sema and codegen point of view, it should have difference from a function scope static var with device attribute. Adding an implicit device attribute would simplify its handling.

Now let's consider a static var without device attribute in a host device function. The following code is valid for both nvcc and cuda-clang:

int __device__ __host__ func(int x) {
  static int a = 1;
  return a + x;
}

This requires the static variable is directly accessible in both device and host compilation. This requires that in device compilation, the static var behaves like a static var with explicit device attribute, whereas in host compilation, the static var behaves like a normal host static var. By adding implicit device attribute, we can clearly distinguish these situations and reuse the sema and codegen logic of device attribute.

clang/test/CodeGenCUDA/func-scope-static-var.cu
55

Yes this will be fixed by the patch for externalizing static var for -fgpu-rdc

88

revised

127–128

added explanation

yaxunl updated this revision to Diff 320644.Feb 1 2021, 5:34 PM
yaxunl marked 6 inline comments as done.

Revised by Artem's comments

tra added inline comments.Feb 2 2021, 11:08 AM
clang/lib/Sema/SemaDecl.cpp
7247–7250

A function scope static variable with device attribute has similar traits: It needs to be emitted in device compilation, and it needs a shadow variable in host compilation in case it needs to be accessed in host code.

This is the part I don't agree with. Can you give me an example how a local variable in a __device__ function can be accessed from the host code?

One can't refer to local static vars from outside of the function and even if the function returns the address, it will make no sense on the host side, because there's no reverse device-address to host shadow registration. I do not think we need host shadow or registration for device-side local statics. What do I miss?

Now let's consider a static var without device attribute in a device function. From sema and codegen point of view, it should have difference from a function scope static var with device attribute. Adding an implicit device attribute would simplify its handling.

I agree that it makes things simpler. What I'm saying is that the simple solution comes with an overhead that's not needed.

 int __device__ __host__ func(int x) {
   static int a = 1;
   return a + x;
}

This requires the static variable is directly accessible in both device and host compilation.
This requires that in device compilation, the static var behaves like a static var with explicit device attribute,
whereas in host compilation, the static var behaves like a normal host static var.

I'm not sure I follow your reasoning. directly accessible in both device and host compilation. would need an equivalent of __managed__ attribute. Regular __device__ variables only allow the variable to have an address on the host side which can then be translated into device-side address by the runtime. The variable is only directly accessible from device.

By adding implicit device attribute, we can clearly distinguish these situations and reuse the sema and codegen logic of device attribute.

While this approach does remove that shadow+registration overhead, it does not give both host and device access to the same variable and it creates more divergence between host and device AST, which I'd prefer to avoid, if possible.

To summarize, we appear to agree on what we want in the end -- a variable accessible on its respective side only w/o overhead of the shadown and registration. What we disagree on is on how to implement it.
Your approach is to add __device__ attibute only during device-side compilation only, which allows using parts of the functionality that happes to do the right thing in the individual compilation at the price of AST divergence.
I think that AST divergence should be avoided and that we should have a uniform way of handling local static vars on both sides.

Also, we'll need to figure out and document how static vars are expected to work in HD functions. Should they be implicitly __managed__? That would be the most intuitively sensible thing, but it's not going to work with CUDA as we don't support __managed__ yet.

We could explicitly say that both host and device have their own instance of the local static variable. It's sort of how it works in practice now, but it's deviating of what a user would expect from a static var. It's probably a more natural fit for CUDA/HIP programming model in general. E.g. consider that we may be running on more than one GPU. In order for a static var to work for all GPUs and the host, it should live on the host and then be memory-mapped on each device. I'm not sure if __managed__ can handle that in principle for CUDA. Each-carries-their own approach is more consistent -- that's how we treat global variables anyways.