This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
include/clang/AST/
-
clang/
-
AST/
-
ASTContext.h
-
lib/
-
AST/
8/8
ASTContext.cpp
-
Sema/
-
SemaExpr.cpp
-
test/CodeGenCUDA/
-
CodeGenCUDA/
-
constexpr-variables.cu
-
static-device-var-no-rdc.cu

Differential D80858

[CUDA][HIP] Support accessing static device variable in host code for -fno-gpu-rdc
ClosedPublic

Authored by yaxunl on May 29 2020, 9:44 PM.

Download Raw Diff

Details

Reviewers

tra
rjmccall
hliao

Commits

rG45f2a56856e2: [CUDA][HIP] Support accessing static device variable in host code for -fno-gpu…

Summary

nvcc supports accessing file-scope static device variables in host code by host APIs
like cudaMemcpyToSymbol etc.

CUDA/HIP let users access device variables in host code by shadow variables. In host compilation,
clang emits a shadow variable for each device variable, and calls __*RegisterVariable to
register it in init function. The address of the shadow variable and the device side mangled
name of the device variable is passed to __*RegisterVariable. Runtime looks up the symbol
by name in the device binary (so called code object, which is actually elf format) to find the
address of the device variable.

The problem with static device variables is that they have internal linkage, therefore their
name may be changed by the linker if there are multiple symbols with the same name. Also
they end up as local symbols in the elf file, whereas the runtime only look up the global symbols.

To support accessing static device var in host code for -fno-gpu-rdc mode, change the intnernal
linkage to external linkage. The name does not need change since there is only one TU for
-fno-gpu-rdc mode. Also the externalization is done only if the device static var is referenced
by host code.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

yaxunl created this revision.May 29 2020, 9:44 PM

The value is based on llvm::sys::Process::GetRandomNumber(). So unless one provides a build-system-derived uuid for every compilation unit, recompiling identical source will yield an observably different binary.

The distinction between 'unique' and 'random' is significant for anyone depending on repeatable binary output, so this patch should probably rename 'unique' to 'random' everywhere.

In D80858#2067537, @JonChesterfield wrote:

The value is based on llvm::sys::Process::GetRandomNumber(). So unless one provides a build-system-derived uuid for every compilation unit, recompiling identical source will yield an observably different binary.

The distinction between 'unique' and 'random' is significant for anyone depending on repeatable binary output, so this patch should probably rename 'unique' to 'random' everywhere.

+1. Reproducible builds is something we do pay attention to. We've had similar issue with -fcuda-rcd compilation and I think ended up generating the ID by deriving it from the input file name. It works reasonably well in most situations except the cases when the same file is recompiled multiple times with different preprocessor macros set.

In addition to everyone else's concerns about generating this number randomly, it's also inherently less testable.

I'm sure there's something else you could use that's reasonably likely to be unique, like a hash of the input filename or output filename or a combination of the two.

clang/include/clang/Driver/Action.h
216 ↗	(On Diff #267427)	Allowing an arbitrary string might be more adaptable.
clang/lib/AST/ASTContext.cpp
10064	This needs to be a clearer statement of why this is necessary.
10068	Are you sure this doesn't apply to e.g. local statics? Can't you have kernel lambdas, or am I confusing HIP with another language?
clang/lib/CodeGen/CodeGenModule.cpp
1094 ↗	(On Diff #267427)	Please extract this "(device \|\| constant) && file && static" predicate instead of repeating it in three different places.
clang/lib/Driver/Action.cpp
170 ↗	(On Diff #267427)	I'm sure GetRandomNumber can return 0, so this logic is faulty. This also seems like an unfortunate intrusion of HIP-specific semantics on the rest of the driver. Maybe HIP can generate the shared id when it's setting up and cloning the job.

yaxunl marked 10 inline comments as done.Jul 6 2020, 5:50 AM

yaxunl added inline comments.

clang/include/clang/Driver/Action.h
216 ↗	(On Diff #267427)	done
clang/lib/AST/ASTContext.cpp
10064	added comments to explain why
10068	function-scope static var in a device function is only visible to the device function. Host code cannot access it, therefore no need to externalize it.
clang/lib/CodeGen/CodeGenModule.cpp
1094 ↗	(On Diff #267427)	extracted as ASTContext::shouldExternalizeStaticVar
clang/lib/Driver/Action.cpp
170 ↗	(On Diff #267427)	Changed type of ID to string. Now empty ID means disabled. 0 is now allowed as a usual ID. Moved initialization of ID to HIP action builder.

revised by John's comments.

Also added generating compilation unit ID by hashing input file path and options. added option -fuse-cuid=random|hash|none for controlling the method to generate compilation unit ID. Allows -cuid=X to override -fuse-cuid. Enabled this feature for both CUDA and HIP.

tra added inline comments.Jul 6 2020, 2:24 PM

clang/lib/CodeGen/CodeGenModule.cpp
6069 ↗	(On Diff #275685)	I suspect that will have interesting issues if CUID is an arbitrary user-supplied string. We may want to impose some sort of sanity check or filtering on the cuid value. Considering that it's a CC1 flag, it's not a critical problem, but some safeguards would be useful there, too. Should we limit allowed character set?
clang/test/Driver/hip-cuid.hip
35 ↗	(On Diff #275685)	Nit: `abcd` could potentially match the value generated by hash. I'd change it to contain characters other than hex.

JonChesterfield added inline comments.Jul 6 2020, 2:39 PM

clang/lib/AST/ASTContext.cpp
10068	This doesn't sound right. An inline function can return a pointer to a function scope static variable, e.g. to implement a singleton in a header file. I think host code can then access said variable.

rjmccall added inline comments.Jul 9 2020, 2:05 PM

clang/lib/AST/ASTContext.cpp
10068	Right, and IIRC you can declare host device functions as well, which ought to agree on the variable if they agree on globals.
clang/lib/Driver/Action.cpp
170 ↗	(On Diff #267427)	Thanks, that works.

yaxunl marked 9 inline comments as done.Jul 11 2020, 7:55 PM

yaxunl added inline comments.

clang/lib/AST/ASTContext.cpp
10068	As long as we are not accessing the static variable by symbol we do not need externalize it. If a device function returns a pointer to its static variable and somehow passes that pointer to host code, the host code can use it directly by hipMemCpy.
10068	If we have a static variable in a device function, it is only visible in the function and not visible by any host code. We only need externalize it if it needs to be accessed `by symbol` in the host code, however, that is impossible, therefore we do not need externalize it. For static variables in a host device function, the static variables should be different instances on host side and device side. The rationale is that a static variable is per function, whereas a host device function is actually two functions: a host instance and a device instance, which could be totally different by using conditional macros. If it is requested that the static variable in a host device function is one instance, it requires special handling in runtime so that the same variable can be accessed on both device side and host side by common load/store instructions, but that is not the case. Therefore the device side instance of a static variable in a host device function is still only visible to device codes, not visible to host codes. Since it cannot be accessed `by symbol` by host code, it does not needs to be externalized.
clang/lib/CodeGen/CodeGenModule.cpp
6069 ↗	(On Diff #275685)	will only allow alphanumeric and underscore in CUID for simplicity.
clang/test/Driver/hip-cuid.hip
35 ↗	(On Diff #275685)	done

Herald added a subscriber: dang. · View Herald TranscriptJul 11 2020, 7:55 PM

Only allow cuid to be alphanumeric and underscore.

ping

I don't that's proper way to support file-scope static device variables. As we discuss APIs like cudaMemcpyToSymol, that's a runtime API instead of driver API. The later needs to specify the module (or code object) id in addition to a symbol name and won't have the conflict issues. For the runtime API, all named device variables (static or not) are identified at the host side by their host stub variables. That stub variable is used to register the corresponding device variables associated with a module id to unique identify that variables across all TUs. As long as we look up device variables using their host stub variable pointers, they are uniquely identified already. The runtime implementation needs to find the module id and the variable symbol from the pointer of its host stub variable. It's not the compiler job to fabricate name uniquely across TUs.

This revision now requires changes to proceed.Jul 23 2020, 7:09 AM

In D80858#2169295, @hliao wrote:

I don't that's proper way to support file-scope static device variables. As we discuss APIs like cudaMemcpyToSymol, that's a runtime API instead of driver API. The later needs to specify the module (or code object) id in addition to a symbol name and won't have the conflict issues. For the runtime API, all named device variables (static or not) are identified at the host side by their host stub variables. That stub variable is used to register the corresponding device variables associated with a module id to unique identify that variables across all TUs. As long as we look up device variables using their host stub variable pointers, they are uniquely identified already. The runtime implementation needs to find the module id and the variable symbol from the pointer of its host stub variable. It's not the compiler job to fabricate name uniquely across TUs.

The problem is that even though the static variable is registered through __hipRigisterVariable, the runtime still relies on looking up symbol name to get the address of the device variable. That's why we need to externalize the static variable.

Another reason is that we need to support it in rdc mode, where different TU can have static var with the same name.

In D80858#2169399, @yaxunl wrote:

In D80858#2169295, @hliao wrote:

I don't that's proper way to support file-scope static device variables. As we discuss APIs like cudaMemcpyToSymol, that's a runtime API instead of driver API. The later needs to specify the module (or code object) id in addition to a symbol name and won't have the conflict issues. For the runtime API, all named device variables (static or not) are identified at the host side by their host stub variables. That stub variable is used to register the corresponding device variables associated with a module id to unique identify that variables across all TUs. As long as we look up device variables using their host stub variable pointers, they are uniquely identified already. The runtime implementation needs to find the module id and the variable symbol from the pointer of its host stub variable. It's not the compiler job to fabricate name uniquely across TUs.

The problem is that even though the static variable is registered through __hipRigisterVariable, the runtime still relies on looking up symbol name to get the address of the device variable. That's why we need to externalize the static variable.

If so, the runtime should be fixed as the variable name. I remembered I fixed the global one so that each one is uniquely identified by module id plus the name. For runtime APIs, all host-side references to device variables should look through the host stub variables instead of its name. If runtime or API doesn't follow that, we should fix them instead of asking the compiler to do the favor.

In D80858#2169534, @yaxunl wrote:

Another reason is that we need to support it in rdc mode, where different TU can have static var with the same name.

That's an issue of our current RDC support through LLVM IR instead of native code. The name conflicts are introduced as we link all TUs into a single module at IR level. The frontend should not be changed to support that.

In D80858#2170311, @hliao wrote:

The problem is that even though the static variable is registered through __hipRigisterVariable, the runtime still relies on looking up symbol name to get the address of the device variable. That's why we need to externalize the static variable.

If so, the runtime should be fixed as the variable name. I remembered I fixed the global one so that each one is uniquely identified by module id plus the name. For runtime APIs, all host-side references to device variables should look through the host stub variables instead of its name. If runtime or API doesn't follow that, we should fix them instead of asking the compiler to do the favor.

For runtime APIs, we do reference device variables through host stub variable instead of its name. However, how does runtime implements that internally?

In host compilation, clang emits call of __hipRegisterVariable(shadow_var, device_var_name) in init functions.

runtime builds a map of each shadow var to a device var name. then runtime looks up device var name in code object to get the real address of a device var.

Note: __hipRegisterVariable does not really associate a shadow var with the address of a device var, since in host compilation there is no way to know the address of a device var. It only associates a shadow var with the name of a device var.

So eventually, runtime still needs to look up the device var symbol in code objects. Since ROCm runtime does not look up local symbols, it cannot find the static device var in code objects, unless we externalize it.

In D80858#2170328, @hliao wrote:

In D80858#2169534, @yaxunl wrote:

Another reason is that we need to support it in rdc mode, where different TU can have static var with the same name.

That's an issue of our current RDC support through LLVM IR instead of native code. The name conflicts are introduced as we link all TUs into a single module at IR level. The frontend should not be changed to support that.

I am not sure if ISA level linking would help for this. My understanding is that the linker will rename the internal symbols having the same name from different objects. It does not matter if it is llvm-link or native linker. Once they are renamed we can no longer find them by name.

I don't see a reason why we cannot support accessing static device variable in FE. We have requests for this feature from different users.

Would it work if we generate a globally unique visible aliases for the static vars and use the alias' name to register device-side entities without changing their visibility?

In D80858#2170533, @yaxunl wrote:

In D80858#2170328, @hliao wrote:

In D80858#2169534, @yaxunl wrote:

Another reason is that we need to support it in rdc mode, where different TU can have static var with the same name.

That's an issue of our current RDC support through LLVM IR instead of native code. The name conflicts are introduced as we link all TUs into a single module at IR level. The frontend should not be changed to support that.

I am not sure if ISA level linking would help for this. My understanding is that the linker will rename the internal symbols having the same name from different objects. It does not matter if it is llvm-link or native linker. Once they are renamed we can no longer find them by name.

I don't see a reason why we cannot support accessing static device variable in FE. We have requests for this feature from different users.

the register API also has a module ID parameter. With that, the runtime should be able to establish the map from the pointer of that host stub variable to its device module and the name within that module. For non-RDC case, that's already enough for the runtime to find the correct device variable.

But, as static internal linkage is defined, it makes that identifier only visible to a single TU and not visible across TU. To be host, I think we still need to define that behaves in the context of CUDA or HIP as it obviously breaks the definition by making that static visible between host and device TUs in the restrict definition.

In D80858#2170547, @tra wrote:

Would it work if we generate a globally unique visible aliases for the static vars and use the alias' name to register device-side entities without changing their visibility?

We still need to define how a static device variable should be visible on the host side. How that behaves in the context of relocatable code generation as well as anonymous namespaces. Also, in the context of CUDA runtime/driver API, if a device variable is addressable on the host side through the runtime API, it should be addressable through the driver API as well. However, the naming will be a big challenge.

In D80858#2170604, @hliao wrote:

In D80858#2170547, @tra wrote:

Would it work if we generate a globally unique visible aliases for the static vars and use the alias' name to register device-side entities without changing their visibility?

We still need to define how a static device variable should be visible on the host side.

As far as the host is concerned, if a 'static' variable is within the same TU it (or, rather, its shadow) should be accessible.

How that behaves in the context of relocatable code generation as well as anonymous namespaces.

AFAICT it would just work -- each TU only sees static things within that TU on both host and device sides. Visible module-unique aliases would allow host-side binary to get address of an otherwise non-visible variables. I expect that will work with RDC -- because we didn't change the visibility of the symbols themselves, they behave according to the regular linking rules. The aliases are also globally unique, so they should present no issues either.

Also, in the context of CUDA runtime/driver API, if a device variable is addressable on the host side through the runtime API, it should be addressable through the driver API as well. However, the naming will be a big challenge.

I'm not convinced that I completely agree with this assertion. It's reasonable for visible symbols (though even then there's a question of how do you figure out a properly mangled name for that symbol), but expecting it to work for non-visible symbols would push it too far, IMO.
It will also have issues with RDC as the name resulution will become ambiguous. Compiler does have necessary info to resolve that ambiguity, runtime support library, too, but the driver does not. We'd have to provide the driver with the additional info to let it map host-side symbol info to its device counterpart. It's doable, but I don't think it's particularly useful in practice. If something needs to be accessed via the driver API, it would be reasonable to expect it to be a public symbol.

In D80858#2170668, @tra wrote:

In D80858#2170604, @hliao wrote:

In D80858#2170547, @tra wrote:

Would it work if we generate a globally unique visible aliases for the static vars and use the alias' name to register device-side entities without changing their visibility?

We still need to define how a static device variable should be visible on the host side.

As far as the host is concerned, if a 'static' variable is within the same TU it (or, rather, its shadow) should be accessible.

How that behaves in the context of relocatable code generation as well as anonymous namespaces.

AFAICT it would just work -- each TU only sees static things within that TU on both host and device sides. Visible module-unique aliases would allow host-side binary to get address of an otherwise non-visible variables. I expect that will work with RDC -- because we didn't change the visibility of the symbols themselves, they behave according to the regular linking rules. The aliases are also globally unique, so they should present no issues either.

The problem is not whether we have solution to tell them but when we need to add that. Not all static device variables need to be visible to the host side. Externalizing them adds the overhead for the linker and may pose additional restrictions on aggressive optimizations. Do we have to support every ambiguous usage in the burden of the compiler change?

Also, in the context of CUDA runtime/driver API, if a device variable is addressable on the host side through the runtime API, it should be addressable through the driver API as well. However, the naming will be a big challenge.

I'm not convinced that I completely agree with this assertion. It's reasonable for visible symbols (though even then there's a question of how do you figure out a properly mangled name for that symbol), but expecting it to work for non-visible symbols would push it too far, IMO.

Won't this patch just makes invisible variables "visible"?

It will also have issues with RDC as the name resulution will become ambiguous. Compiler does have necessary info to resolve that ambiguity, runtime support library, too, but the driver does not. We'd have to provide the driver with the additional info to let it map host-side symbol info to its device counterpart. It's doable, but I don't think it's particularly useful in practice. If something needs to be accessed via the driver API, it would be reasonable to expect it to be a public symbol.

In D80858#2170781, @hliao wrote:

The problem is not whether we have solution to tell them but when we need to add that. Not all static device variables need to be visible to the host side. Externalizing them adds the overhead for the linker and may pose additional restrictions on aggressive optimizations. Do we have to support every ambiguous usage in the burden of the compiler change?

It's a multi-part problem.
The key request here is, IIUIC, to make TU-local variables visible within the TU on both host and the device.
If and how it should be implemented is up for the discusstion.
For me the request is reasonable and it would be good to have it as it would make the code behave according to general expectations of how TU-local variables behave -- "if it's in the same source file, it should be accessible from the functions in that file".

I'm mostly discussing the 'how' part and we obviously don't have all the answers yet.

I agree, that adding the 'handles' for all non-visible variables is suboptimal and it may prove to be too big of a burden. That said, it does not look like an outright showstopper either. Ideally we only need those handles for the items that are being referred to by the host. Most likely most of them will not be. The problem is that it's the host-side compilation which knows which symbols will be needed, but it's the device-side compilation where we would have to generate aliases and, generally speaking, device-side compilation may not be able to ever tell which symbols will be needed by the host. I don't have a good solution for that. Ideally we need to have exact same AST on both sides and we can't guarantee that with the current implementation of HD functions and the use of preprocessor macros.

Also, in the context of CUDA runtime/driver API, if a device variable is addressable on the host side through the runtime API, it should be addressable through the driver API as well. However, the naming will be a big challenge.

I'm not convinced that I completely agree with this assertion. It's reasonable for visible symbols (though even then there's a question of how do you figure out a properly mangled name for that symbol), but expecting it to work for non-visible symbols would push it too far, IMO.

Won't this patch just makes invisible variables "visible"?

Yes, but there will be no name conflicts, so for all practical purposes they don't change the end result of linking.

In D80858#2170821, @tra wrote:

In D80858#2170781, @hliao wrote:

The problem is not whether we have solution to tell them but when we need to add that. Not all static device variables need to be visible to the host side. Externalizing them adds the overhead for the linker and may pose additional restrictions on aggressive optimizations. Do we have to support every ambiguous usage in the burden of the compiler change?

It's a multi-part problem.
The key request here is, IIUIC, to make TU-local variables visible within the TU on both host and the device.
If and how it should be implemented is up for the discusstion.
For me the request is reasonable and it would be good to have it as it would make the code behave according to general expectations of how TU-local variables behave -- "if it's in the same source file, it should be accessible from the functions in that file".

We may try to solve the issue without RDC firstly, where we don't need to change that static variable name (if the runtime maintains the device binaries correctly.) We only need to ensure the linker won't remove their symbols.

I'm mostly discussing the 'how' part and we obviously don't have all the answers yet.

I agree, that adding the 'handles' for all non-visible variables is suboptimal and it may prove to be too big of a burden. That said, it does not look like an outright showstopper either. Ideally we only need those handles for the items that are being referred to by the host. Most likely most of them will not be. The problem is that it's the host-side compilation which knows which symbols will be needed, but it's the device-side compilation where we would have to generate aliases and, generally speaking, device-side compilation may not be able to ever tell which symbols will be needed by the host. I don't have a good solution for that. Ideally we need to have exact same AST on both sides and we can't guarantee that with the current implementation of HD functions and the use of preprocessor macros.

Also, in the context of CUDA runtime/driver API, if a device variable is addressable on the host side through the runtime API, it should be addressable through the driver API as well. However, the naming will be a big challenge.

I'm not convinced that I completely agree with this assertion. It's reasonable for visible symbols (though even then there's a question of how do you figure out a properly mangled name for that symbol), but expecting it to work for non-visible symbols would push it too far, IMO.

Won't this patch just makes invisible variables "visible"?

Yes, but there will be no name conflicts, so for all practical purposes they don't change the end result of linking.

In D80858#2171266, @hliao wrote:

We may try to solve the issue without RDC firstly, where we don't need to change that static variable name (if the runtime maintains the device binaries correctly.) We only need to ensure the linker won't remove their symbols.

This is essentially externalizing it in linker. However, it may not work.

Consider a static device var was accessed in host code through hipMemCpyToSymbol or hipMemCpyFromSymbol. In device code, this static var may be initialized through host code, or its final value may be read by host code after kernel execution. The existence of these operations mean that this static device variable is actually having external linkage, instead of internal linkage, since it is accessed by external modules. Fail to reflect this truth in IR will cause optimization passes to make incorrect assumptions about this variable and perform incorrect optimizations on it. e.g. the llvm passes can assume the value in this static var never changes, or its final value will not be used, then the llvm passes may simply remove it. Marking it as used will not solve the issue, since the llvm passes may still assume its value never changes after initialization, whereas in reality it may be changed by hipMemCpyToSymbol before kernel execution.

It's a good point. Perhaps this is one of the cases where we should *not* follow nvcc.
We can't have our cake (preserve static behavior) and eat it (treat it as non-static in case something on the host side may decide to use an API which uses symbol names). Something's got to give. While we could make it work in some cases, I don't think we can make it work consistently.
I think it would be reasonable to restrict APIs that access symbols by name to be applicable to visible symbols only.

In D80858#2177159, @tra wrote:

It's a good point. Perhaps this is one of the cases where we should *not* follow nvcc.
We can't have our cake (preserve static behavior) and eat it (treat it as non-static in case something on the host side may decide to use an API which uses symbol names). Something's got to give. While we could make it work in some cases, I don't think we can make it work consistently.
I think it would be reasonable to restrict APIs that access symbols by name to be applicable to visible symbols only.

Agree, we only need to support accessing static device var by shadow var (runtime API), which is sufficient for most apps.

In D80858#2177104, @yaxunl wrote:

In D80858#2171266, @hliao wrote:

We may try to solve the issue without RDC firstly, where we don't need to change that static variable name (if the runtime maintains the device binaries correctly.) We only need to ensure the linker won't remove their symbols.

This is essentially externalizing it in linker. However, it may not work.

Consider a static device var was accessed in host code through hipMemCpyToSymbol or hipMemCpyFromSymbol. In device code, this static var may be initialized through host code, or its final value may be read by host code after kernel execution. The existence of these operations mean that this static device variable is actually having external linkage, instead of internal linkage, since it is accessed by external modules. Fail to reflect this truth in IR will cause optimization passes to make incorrect assumptions about this variable and perform incorrect optimizations on it. e.g. the llvm passes can assume the value in this static var never changes, or its final value will not be used, then the llvm passes may simply remove it. Marking it as used will not solve the issue, since the llvm passes may still assume its value never changes after initialization, whereas in reality it may be changed by hipMemCpyToSymbol before kernel execution.

That's what I mean by "externalizing" that variable. In fact, nvcc doesn't the same thing and generate global symbol in PTX.

To get file-scope static device variables support, we have to 2 separate issues:

externalize variable to ensure backend/linker preserve their symbols for runtime to query. The major concern is that we'd better check whether a static device variable is referenced using host APIs. Otherwise, aggressive IPO optimizations may not be able to be applied. As those are file-scope static variables, roughly checking their reference on the host side should be doable.
rename device static variables. Such renaming is only necessary when RDC is enabled. Without RDC, there would be conflicting names and renaming is unnecessary. For renaming, relying on developer supplied CUID is not reliable and we should extend name mangler to include source name, digest, or whatever automatically managed by the compiler. If we support static variable access on the host side, we should not rely on developer supplied option to function correctly. Also, the mangled name should be demangled into a meaningful name if possible. I remembered GCC has special support for names from anonymous namespaces if they have conflicting names. We may follow the similar scheme.

In D80858#2177159, @tra wrote:

It's a good point. Perhaps this is one of the cases where we should *not* follow nvcc.
We can't have our cake (preserve static behavior) and eat it (treat it as non-static in case something on the host side may decide to use an API which uses symbol names). Something's got to give. While we could make it work in some cases, I don't think we can make it work consistently.
I think it would be reasonable to restrict APIs that access symbols by name to be applicable to visible symbols only.

In fact, 'nvcc' does that by globalizing that static device variable.

jdoerfert added a subscriber: jdoerfert.Jul 28 2020, 6:58 AM

I think Sam's approach is reasonable.

The ability to supply CUID to sub-compilations is useful by itself and should probably be split into a separate patch as it's largely independent of the externalization of file-scope static vars.

revised for -fno-gpu-rdc case by Michael's comments.

yaxunl mentioned this in D85223: [CUDA][HIP] Support accessing static device variable in host code for -fgpu-rdc.Aug 4 2020, 9:56 AM

yaxunl added a child revision: D85223: [CUDA][HIP] Support accessing static device variable in host code for -fgpu-rdc.

What is expected to happen to device statics in anonymous name spaces? It may be worth adding them to the tests.

LGTM otherwise.

In D80858#2193970, @tra wrote:

What is expected to happen to device statics in anonymous name spaces? It may be worth adding them to the tests.

LGTM otherwise.

static device var in anonymous name space will have external linkage if it is referenced by host code. will add a test when committing.

This revision was not accepted when it landed; it landed in state Needs Review.Aug 5 2020, 4:58 AM

Closed by commit rG45f2a56856e2: [CUDA][HIP] Support accessing static device variable in host code for -fno-gpu… (authored by yaxunl). · Explain Why

This revision was automatically updated to reflect the committed changes.

yaxunl added a commit: rG45f2a56856e2: [CUDA][HIP] Support accessing static device variable in host code for -fno-gpu….

Herald added a project: Restricted Project. · View Herald TranscriptAug 5 2020, 4:58 AM

Sam, just a FYI that the patch has a couple of unintended consequences.

We now end up with various things instantiated as device-side constant objects when they were not before, when we compile with -std=c++17 (especially with libc++):
https://godbolt.org/z/KbTM9M

That in turn sometimes pulls in other thins that may not exist on the device. E.g. in one case we've ended up with a PTX error caused by unresolved reference to strlen() (via some char_traits functions) .
The other potential issue is that we increase use of __constant__ and there's only 64K of it, so additional use pushed total use over the limit in few cases.

So far all failures can be attributed to questionable user code, but when we may need to figure out how to avoid emitting unused __constant__ data.

In D80858#2207699, @tra wrote:

Sam, just a FYI that the patch has a couple of unintended consequences.

We now end up with various things instantiated as device-side constant objects when they were not before, when we compile with -std=c++17 (especially with libc++):
https://godbolt.org/z/KbTM9M

That in turn sometimes pulls in other thins that may not exist on the device. E.g. in one case we've ended up with a PTX error caused by unresolved reference to strlen() (via some char_traits functions) .
The other potential issue is that we increase use of __constant__ and there's only 64K of it, so additional use pushed total use over the limit in few cases.

So far all failures can be attributed to questionable user code, but when we may need to figure out how to avoid emitting unused __constant__ data.

This is because we have some implicit constant device variables, e.g.

static _LIBCPP_CONSTEXPR const int __libcpp_polling_count = 64;

then they are referenced by some host functions, e.g.

template<class _Fn, class _BFn>
_LIBCPP_AVAILABILITY_SYNC _LIBCPP_INLINE_VISIBILITY
bool __libcpp_thread_poll_with_backoff(
  _Fn && __f, _BFn && __bf, chrono::nanoseconds __max_elapsed = chrono::nanoseconds::zero())
{
    auto const __start = chrono::high_resolution_clock::now();
    for(int __count = 0;;) {
      if(__f())
        return true; // _Fn completion means success
      if(__count < __libcpp_polling_count) {
        __count += 1;
        continue;
      }

when __libcpp_polling_count has internal linkage, it can be eliminated by optimization pass. When it is externalized, it can no longer be eliminated.

We can restrict externalization to constant variables with explicit 'constant' attributes only, which should fix this issue.

We can restrict externalization to constant variables with explicit 'constant' attributes only, which should fix this issue.

SGTM. If it does not have explicit device-side attribute, it's never going to need to cross host/device boundary. I guess this applies to vars with __device__ attribute, too.

In D80858#2207898, @tra wrote:

We can restrict externalization to constant variables with explicit 'constant' attributes only, which should fix this issue.

SGTM. If it does not have explicit device-side attribute, it's never going to need to cross host/device boundary. I guess this applies to vars with __device__ attribute, too.

I think we only add implicit __constant__ attributes for variables, not __device__ attributes.

The fix is here https://reviews.llvm.org/D85686

The fix is here https://reviews.llvm.org/D85686

Thank you.

Revision Contents

Path

Size

clang/

include/

clang/

AST/

ASTContext.h

7 lines

lib/

AST/

ASTContext.cpp

19 lines

Sema/

SemaExpr.cpp

19 lines

test/

CodeGenCUDA/

constexpr-variables.cu

6 lines

static-device-var-no-rdc.cu

94 lines

Diff 283202

clang/include/clang/AST/ASTContext.h

Show All 37 Lines
#include "clang/Basic/PartialDiagnostic.h"		#include "clang/Basic/PartialDiagnostic.h"
#include "clang/Basic/SanitizerBlacklist.h"		#include "clang/Basic/SanitizerBlacklist.h"
#include "clang/Basic/SourceLocation.h"		#include "clang/Basic/SourceLocation.h"
#include "clang/Basic/Specifiers.h"		#include "clang/Basic/Specifiers.h"
#include "clang/Basic/XRayLists.h"		#include "clang/Basic/XRayLists.h"
#include "llvm/ADT/APSInt.h"		#include "llvm/ADT/APSInt.h"
#include "llvm/ADT/ArrayRef.h"		#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/DenseMap.h"		#include "llvm/ADT/DenseMap.h"
		#include "llvm/ADT/DenseSet.h"
#include "llvm/ADT/FoldingSet.h"		#include "llvm/ADT/FoldingSet.h"
#include "llvm/ADT/IntrusiveRefCntPtr.h"		#include "llvm/ADT/IntrusiveRefCntPtr.h"
#include "llvm/ADT/MapVector.h"		#include "llvm/ADT/MapVector.h"
#include "llvm/ADT/None.h"		#include "llvm/ADT/None.h"
#include "llvm/ADT/Optional.h"		#include "llvm/ADT/Optional.h"
#include "llvm/ADT/PointerIntPair.h"		#include "llvm/ADT/PointerIntPair.h"
#include "llvm/ADT/PointerUnion.h"		#include "llvm/ADT/PointerUnion.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
▲ Show 20 Lines • Show All 940 Lines • ▼ Show 20 Lines	#include "clang/Basic/AArch64SVEACLETypes.def"

// Decl used to help define __builtin_va_list for some targets.		// Decl used to help define __builtin_va_list for some targets.
// The decl is built when constructing 'BuiltinVaListDecl'.		// The decl is built when constructing 'BuiltinVaListDecl'.
mutable Decl *VaListTagDecl = nullptr;		mutable Decl *VaListTagDecl = nullptr;

// Implicitly-declared type 'struct _GUID'.		// Implicitly-declared type 'struct _GUID'.
mutable TagDecl *MSGuidTagDecl = nullptr;		mutable TagDecl *MSGuidTagDecl = nullptr;

		/// Keep track of CUDA/HIP static device variables referenced by host code.
		llvm::DenseSet<const VarDecl *> CUDAStaticDeviceVarReferencedByHost;

ASTContext(LangOptions &LOpts, SourceManager &SM, IdentifierTable &idents,		ASTContext(LangOptions &LOpts, SourceManager &SM, IdentifierTable &idents,
SelectorTable &sels, Builtin::Context &builtins);		SelectorTable &sels, Builtin::Context &builtins);
ASTContext(const ASTContext &) = delete;		ASTContext(const ASTContext &) = delete;
ASTContext &operator=(const ASTContext &) = delete;		ASTContext &operator=(const ASTContext &) = delete;
~ASTContext();		~ASTContext();

/// Attach an external AST source to the AST context.		/// Attach an external AST source to the AST context.
///		///
▲ Show 20 Lines • Show All 2,015 Lines • ▼ Show 20 Lines	SectionInfo(DeclaratorDecl *Decl,
SectionFlags(SectionFlags) {}		SectionFlags(SectionFlags) {}
};		};

llvm::StringMap<SectionInfo> SectionInfos;		llvm::StringMap<SectionInfo> SectionInfos;

/// Return a new OMPTraitInfo object owned by this context.		/// Return a new OMPTraitInfo object owned by this context.
OMPTraitInfo &getNewOMPTraitInfo();		OMPTraitInfo &getNewOMPTraitInfo();

		/// Whether a C++ static variable should be externalized.
		bool shouldExternalizeStaticVar(const Decl *D) const;

private:		private:
/// All OMPTraitInfo objects live in this collection, one per		/// All OMPTraitInfo objects live in this collection, one per
/// `pragma omp [begin] declare variant` directive.		/// `pragma omp [begin] declare variant` directive.
SmallVector<std::unique_ptr<OMPTraitInfo>, 4> OMPTraitInfoVector;		SmallVector<std::unique_ptr<OMPTraitInfo>, 4> OMPTraitInfoVector;
};		};

/// Insertion operator for diagnostics.		/// Insertion operator for diagnostics.
const DiagnosticBuilder &operator<<(const DiagnosticBuilder &DB,		const DiagnosticBuilder &operator<<(const DiagnosticBuilder &DB,
▲ Show 20 Lines • Show All 114 Lines • Show Last 20 Lines

clang/lib/AST/ASTContext.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 10,055 Lines • ▼ Show 20 Lines	case 'A':
// we want this argument to be a char*&; for x86-64, we want		// we want this argument to be a char*&; for x86-64, we want
// it to be a __va_list_tag*.		// it to be a __va_list_tag*.
Type = Context.getBuiltinVaListType();		Type = Context.getBuiltinVaListType();
assert(!Type.isNull() && "builtin va list type not initialized!");		assert(!Type.isNull() && "builtin va list type not initialized!");
if (Type->isArrayType())		if (Type->isArrayType())
Type = Context.getArrayDecayedType(Type);		Type = Context.getArrayDecayedType(Type);
else		else
Type = Context.getLValueReferenceType(Type);		Type = Context.getLValueReferenceType(Type);
break;		break;
		rjmccallUnsubmitted Done Reply Inline Actions This needs to be a clearer statement of why this is necessary. rjmccall: This needs to be a clearer statement of why this is necessary.
		yaxunlAuthorUnsubmitted Done Reply Inline Actions added comments to explain why yaxunl: added comments to explain why
case 'q': {		case 'q': {
char *End;		char *End;
unsigned NumElements = strtoul(Str, &End, 10);		unsigned NumElements = strtoul(Str, &End, 10);
assert(End != Str && "Missing vector size");		assert(End != Str && "Missing vector size");
		rjmccallUnsubmitted Done Reply Inline Actions Are you sure this doesn't apply to e.g. local statics? Can't you have kernel lambdas, or am I confusing HIP with another language? rjmccall: Are you sure this doesn't apply to e.g. local statics? Can't you have kernel lambdas, or am I…
		yaxunlAuthorUnsubmitted Done Reply Inline Actions function-scope static var in a device function is only visible to the device function. Host code cannot access it, therefore no need to externalize it. yaxunl: function-scope static var in a device function is only visible to the device function. Host…
		JonChesterfieldUnsubmitted Done Reply Inline Actions This doesn't sound right. An inline function can return a pointer to a function scope static variable, e.g. to implement a singleton in a header file. I think host code can then access said variable. JonChesterfield: This doesn't sound right. An inline function can return a pointer to a function scope static…
		yaxunlAuthorUnsubmitted Done Reply Inline Actions As long as we are not accessing the static variable by symbol we do not need externalize it. If a device function returns a pointer to its static variable and somehow passes that pointer to host code, the host code can use it directly by hipMemCpy. yaxunl: As long as we are not accessing the static variable by symbol we do not need externalize it.
		rjmccallUnsubmitted Done Reply Inline Actions Right, and IIRC you can declare host device functions as well, which ought to agree on the variable if they agree on globals. rjmccall: Right, and IIRC you can declare __host__ __device__ functions as well, which ought to agree on…
		yaxunlAuthorUnsubmitted Done Reply Inline Actions If we have a static variable in a device function, it is only visible in the function and not visible by any host code. We only need externalize it if it needs to be accessed `by symbol` in the host code, however, that is impossible, therefore we do not need externalize it. For static variables in a host device function, the static variables should be different instances on host side and device side. The rationale is that a static variable is per function, whereas a host device function is actually two functions: a host instance and a device instance, which could be totally different by using conditional macros. If it is requested that the static variable in a host device function is one instance, it requires special handling in runtime so that the same variable can be accessed on both device side and host side by common load/store instructions, but that is not the case. Therefore the device side instance of a static variable in a host device function is still only visible to device codes, not visible to host codes. Since it cannot be accessed `by symbol` by host code, it does not needs to be externalized. yaxunl: If we have a static variable in a device function, it is only visible in the function and not…
Str = End;		Str = End;

QualType ElementType = DecodeTypeFromStr(Str, Context, Error,		QualType ElementType = DecodeTypeFromStr(Str, Context, Error,
RequiresICE, false);		RequiresICE, false);
assert(!RequiresICE && "Can't require vector ICE");		assert(!RequiresICE && "Can't require vector ICE");

Type = Context.getScalableVectorType(ElementType, NumElements);		Type = Context.getScalableVectorType(ElementType, NumElements);
break;		break;
▲ Show 20 Lines • Show All 243 Lines • ▼ Show 20 Lines	static GVALinkage adjustGVALinkageForAttributes(const ASTContext &Context,
// See http://msdn.microsoft.com/en-us/library/xa0d9ste.aspx		// See http://msdn.microsoft.com/en-us/library/xa0d9ste.aspx
// dllexport/dllimport on inline functions.		// dllexport/dllimport on inline functions.
if (D->hasAttr<DLLImportAttr>()) {		if (D->hasAttr<DLLImportAttr>()) {
if (L == GVA_DiscardableODR \|\| L == GVA_StrongODR)		if (L == GVA_DiscardableODR \|\| L == GVA_StrongODR)
return GVA_AvailableExternally;		return GVA_AvailableExternally;
} else if (D->hasAttr<DLLExportAttr>()) {		} else if (D->hasAttr<DLLExportAttr>()) {
if (L == GVA_DiscardableODR)		if (L == GVA_DiscardableODR)
return GVA_StrongODR;		return GVA_StrongODR;
} else if (Context.getLangOpts().CUDA && Context.getLangOpts().CUDAIsDevice &&		} else if (Context.getLangOpts().CUDA && Context.getLangOpts().CUDAIsDevice) {
D->hasAttr<CUDAGlobalAttr>()) {
// Device-side functions with __global__ attribute must always be		// Device-side functions with __global__ attribute must always be
// visible externally so they can be launched from host.		// visible externally so they can be launched from host.
if (L == GVA_DiscardableODR \|\| L == GVA_Internal)		if (D->hasAttr<CUDAGlobalAttr>() &&
		(L == GVA_DiscardableODR \|\| L == GVA_Internal))
return GVA_StrongODR;		return GVA_StrongODR;
		// Single source offloading languages like CUDA/HIP need to be able to
		// access static device variables from host code of the same compilation
		// unit. This is done by externalizing the static variable.
		if (Context.shouldExternalizeStaticVar(D))
		return GVA_StrongExternal;
}		}
return L;		return L;
}		}

/// Adjust the GVALinkage for a declaration based on what an external AST source		/// Adjust the GVALinkage for a declaration based on what an external AST source
/// knows about whether there can be other definitions of this declaration.		/// knows about whether there can be other definitions of this declaration.
static GVALinkage		static GVALinkage
adjustGVALinkageForExternalDefinitionKind(const ASTContext &Ctx, const Decl *D,		adjustGVALinkageForExternalDefinitionKind(const ASTContext &Ctx, const Decl *D,
▲ Show 20 Lines • Show All 838 Lines • ▼ Show 20 Lines

const DiagnosticBuilder &		const DiagnosticBuilder &
clang::operator<<(const DiagnosticBuilder &DB,		clang::operator<<(const DiagnosticBuilder &DB,
const ASTContext::SectionInfo &Section) {		const ASTContext::SectionInfo &Section) {
if (Section.Decl)		if (Section.Decl)
return DB << Section.Decl;		return DB << Section.Decl;
return DB << "a prior #pragma section";		return DB << "a prior #pragma section";
}		}

		bool ASTContext::shouldExternalizeStaticVar(const Decl *D) const {
		return !getLangOpts().GPURelocatableDeviceCode &&
		(D->hasAttr<CUDADeviceAttr>() \|\| D->hasAttr<CUDAConstantAttr>()) &&
		isa<VarDecl>(D) && cast<VarDecl>(D)->isFileVarDecl() &&
		cast<VarDecl>(D)->getStorageClass() == SC_Static &&
		CUDAStaticDeviceVarReferencedByHost.count(cast<VarDecl>(D));
		}

clang/lib/Sema/SemaExpr.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 17,858 Lines • ▼ Show 20 Lines	static void DoMarkVarDeclReferenced(Sema &SemaRef, SourceLocation Loc,
assert((!E \|\| isa<DeclRefExpr>(E) \|\| isa<MemberExpr>(E) \|\|		assert((!E \|\| isa<DeclRefExpr>(E) \|\| isa<MemberExpr>(E) \|\|
isa<FunctionParmPackExpr>(E)) &&		isa<FunctionParmPackExpr>(E)) &&
"Invalid Expr argument to DoMarkVarDeclReferenced");		"Invalid Expr argument to DoMarkVarDeclReferenced");
Var->setReferenced();		Var->setReferenced();

if (Var->isInvalidDecl())		if (Var->isInvalidDecl())
return;		return;

		// Record a CUDA/HIP static device/constant variable if it is referenced
		// by host code. This is done conservatively, when the variable is referenced
		// in any of the following contexts:
		// - a non-function context
		// - a host function
		// - a host device function
		// This also requires the reference of the static device/constant variable by
		// host code to be visible in the device compilation for the compiler to be
		// able to externalize the static device/constant variable.
		if ((Var->hasAttr<CUDADeviceAttr>() \|\| Var->hasAttr<CUDAConstantAttr>()) &&
		Var->isFileVarDecl() && Var->getStorageClass() == SC_Static) {
		auto *CurContext = SemaRef.CurContext;
		if (!CurContext \|\| !isa<FunctionDecl>(CurContext) \|\|
		cast<FunctionDecl>(CurContext)->hasAttr<CUDAHostAttr>() \|\|
		(!cast<FunctionDecl>(CurContext)->hasAttr<CUDADeviceAttr>() &&
		!cast<FunctionDecl>(CurContext)->hasAttr<CUDAGlobalAttr>()))
		SemaRef.getASTContext().CUDAStaticDeviceVarReferencedByHost.insert(Var);
		}

auto *MSI = Var->getMemberSpecializationInfo();		auto *MSI = Var->getMemberSpecializationInfo();
TemplateSpecializationKind TSK = MSI ? MSI->getTemplateSpecializationKind()		TemplateSpecializationKind TSK = MSI ? MSI->getTemplateSpecializationKind()
: Var->getTemplateSpecializationKind();		: Var->getTemplateSpecializationKind();

OdrUseContext OdrUse = isOdrUseContext(SemaRef);		OdrUseContext OdrUse = isOdrUseContext(SemaRef);
bool UsableInConstantExpr =		bool UsableInConstantExpr =
Var->mightBeUsableInConstantExpressions(SemaRef.Context);		Var->mightBeUsableInConstantExpressions(SemaRef.Context);

▲ Show 20 Lines • Show All 1,319 Lines • Show Last 20 Lines

clang/test/CodeGenCUDA/constexpr-variables.cu

Show All 13 Lines	// COM: @_ZN1BL1bE = internal {{.*}}constant i32 9
constexpr int b = 9;		constexpr int b = 9;
}		}
__constant__ const int &use_B_b = B::b;		__constant__ const int &use_B_b = B::b;

struct Q {		struct Q {
// CXX14: @_ZN1Q2k2E = {{.*}}externally_initialized constant i32 6		// CXX14: @_ZN1Q2k2E = {{.*}}externally_initialized constant i32 6
// CXX17: @_ZN1Q2k2E = internal {{.*}}constant i32 6		// CXX17: @_ZN1Q2k2E = internal {{.*}}constant i32 6
// CXX14: @_ZN1Q2k1E = available_externally {{.*}}constant i32 5		// CXX14: @_ZN1Q2k1E = available_externally {{.*}}constant i32 5
// CXX17: @_ZN1Q2k1E = linkonce_odr {{.*}}constant i32 5		// CXX17: @_ZN1Q2k1E = {{.*}} externally_initialized constant i32 5
static constexpr int k1 = 5;		static constexpr int k1 = 5;
static constexpr int k2 = 6;		static constexpr int k2 = 6;
};		};
constexpr int Q::k2;		constexpr int Q::k2;

__constant__ const int &use_Q_k1 = Q::k1;		__constant__ const int &use_Q_k1 = Q::k1;
__constant__ const int &use_Q_k2 = Q::k2;		__constant__ const int &use_Q_k2 = Q::k2;

template<typename T> struct X {		template<typename T> struct X {
// CXX14: @_ZN1XIiE1aE = available_externally {{.*}}constant i32 123		// CXX14: @_ZN1XIiE1aE = available_externally {{.*}}constant i32 123
// CXX17: @_ZN1XIiE1aE = linkonce_odr {{.*}}constant i32 123		// CXX17: @_ZN1XIiE1aE = {{.*}}externally_initialized constant i32 123
static constexpr int a = 123;		static constexpr int a = 123;
};		};
__constant__ const int &use_X_a = X<int>::a;		__constant__ const int &use_X_a = X<int>::a;

template <typename T, T a, T b> struct A {		template <typename T, T a, T b> struct A {
// CXX14: @_ZN1AIiLi1ELi2EE1xE = available_externally {{.*}}constant i32 2		// CXX14: @_ZN1AIiLi1ELi2EE1xE = available_externally {{.*}}constant i32 2
// CXX17: @_ZN1AIiLi1ELi2EE1xE = linkonce_odr {{.*}}constant i32 2		// CXX17: @_ZN1AIiLi1ELi2EE1xE = {{.*}}externally_initialized constant i32 2
constexpr static T x = a * b;		constexpr static T x = a * b;
};		};
__constant__ const int &y = A<int, 1, 2>::x;		__constant__ const int &y = A<int, 1, 2>::x;

clang/test/CodeGenCUDA/static-device-var-no-rdc.cu

This file was added.

				// REQUIRES: x86-registered-target
				// REQUIRES: amdgpu-registered-target

				// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -fcuda-is-device \
				// RUN: -emit-llvm -o - -x hip %s \| FileCheck \
				// RUN: -check-prefixes=DEV %s

				// RUN: %clang_cc1 -triple x86_64-gnu-linux \
				// RUN: -emit-llvm -o - -x hip %s \| FileCheck \
				// RUN: -check-prefixes=HOST %s

				#include "Inputs/cuda.h"

				// Test function scope static device variable, which should not be externalized.
				// DEV-DAG: @_ZZ6kernelPiPPKiE1w = internal addrspace(4) constant i32 1

				// Check a static device variable referenced by host function is externalized.
				// DEV-DAG: @_ZL1x = addrspace(1) externally_initialized global i32 0
				// HOST-DAG: @_ZL1x = internal global i32 undef
				// HOST-DAG: @[[DEVNAMEX:[0-9]+]] = {{.*}}c"_ZL1x\00"

				static __device__ int x;

				// Check a static device variables referenced only by device functions and kernels
				// is not externalized.
				// DEV-DAG: @_ZL2x2 = internal addrspace(1) global i32 0
				static __device__ int x2;

				// Check a static device variable referenced by host device function is externalized.
				// DEV-DAG: @_ZL2x3 = addrspace(1) externally_initialized global i32 0
				static __device__ int x3;

				// Check a static device variable referenced in file scope is externalized.
				// DEV-DAG: @_ZL2x4 = addrspace(1) externally_initialized global i32 0
				static __device__ int x4;
				int& x4_ref = x4;

				// Check a static device variable in anonymous namespace.
				// DEV-DAG: @_ZN12_GLOBAL__N_12x5E = addrspace(1) externally_initialized global i32 0
				namespace {
				static __device__ int x5;
				}

				// Check a static constant variable referenced by host is externalized.
				// DEV-DAG: @_ZL1y = addrspace(4) externally_initialized global i32 0
				// HOST-DAG: @_ZL1y = internal global i32 undef
				// HOST-DAG: @[[DEVNAMEY:[0-9]+]] = {{.*}}c"_ZL1y\00"

				static __constant__ int y;

				// Test static host variable, which should not be externalized nor registered.
				// HOST-DAG: @_ZL1z = internal global i32 0
				// DEV-NOT: @_ZL1z
				static int z;

				// Test static device variable in inline function, which should not be
				// externalized nor registered.
				// DEV-DAG: @_ZZ6devfunPPKiE1p = linkonce_odr addrspace(4) constant i32 2, comdat

				inline __device__ void devfun(const int ** b) {
				const static int p = 2;
				b[0] = &p;
				b[1] = &x2;
				}

				__global__ void kernel(int a, const int *b) {
				const static int w = 1;
				a[0] = x;
				a[1] = y;
				a[2] = x2;
				a[3] = x3;
				a[4] = x4;
				a[5] = x5;
				b[0] = &w;
				devfun(b);
				}

				__host__ __device__ void hdf(int *a) {
				a[0] = x3;
				}

				int* getDeviceSymbol(int *x);

				void foo(int *a) {
				getDeviceSymbol(&x);
				getDeviceSymbol(&x5);
				getDeviceSymbol(&y);
				z = 123;
				}

				// HOST: __hipRegisterVar({{.}}@_ZL1x {{.}}@[[DEVNAMEX]]
				// HOST: __hipRegisterVar({{.}}@_ZL1y {{.}}@[[DEVNAMEY]]
				// HOST-NOT: __hipRegisterVar({{.*}}@_ZZ6kernelPiPPKiE1w
				// HOST-NOT: __hipRegisterVar({{.*}}@_ZZ6devfunPPKiE1p