This is an archive of the discontinued LLVM Phabricator instance.

[CUDA][HIP] Support accessing static device variable in host code for -fno-gpu-rdc
ClosedPublic

Authored by yaxunl on May 29 2020, 9:44 PM.

Details

Summary

nvcc supports accessing file-scope static device variables in host code by host APIs
like cudaMemcpyToSymbol etc.

CUDA/HIP let users access device variables in host code by shadow variables. In host compilation,
clang emits a shadow variable for each device variable, and calls __*RegisterVariable to
register it in init function. The address of the shadow variable and the device side mangled
name of the device variable is passed to __*RegisterVariable. Runtime looks up the symbol
by name in the device binary (so called code object, which is actually elf format) to find the
address of the device variable.

The problem with static device variables is that they have internal linkage, therefore their
name may be changed by the linker if there are multiple symbols with the same name. Also
they end up as local symbols in the elf file, whereas the runtime only look up the global symbols.

To support accessing static device var in host code for -fno-gpu-rdc mode, change the intnernal
linkage to external linkage. The name does not need change since there is only one TU for
-fno-gpu-rdc mode. Also the externalization is done only if the device static var is referenced
by host code.

Diff Detail

Event Timeline

yaxunl created this revision.May 29 2020, 9:44 PM

The value is based on llvm::sys::Process::GetRandomNumber(). So unless one provides a build-system-derived uuid for every compilation unit, recompiling identical source will yield an observably different binary.

The distinction between 'unique' and 'random' is significant for anyone depending on repeatable binary output, so this patch should probably rename 'unique' to 'random' everywhere.

tra added a comment.Jun 1 2020, 3:50 PM

The value is based on llvm::sys::Process::GetRandomNumber(). So unless one provides a build-system-derived uuid for every compilation unit, recompiling identical source will yield an observably different binary.

The distinction between 'unique' and 'random' is significant for anyone depending on repeatable binary output, so this patch should probably rename 'unique' to 'random' everywhere.

+1. Reproducible builds is something we do pay attention to. We've had similar issue with -fcuda-rcd compilation and I think ended up generating the ID by deriving it from the input file name. It works reasonably well in most situations except the cases when the same file is recompiled multiple times with different preprocessor macros set.

In addition to everyone else's concerns about generating this number randomly, it's also inherently less testable.

I'm sure there's something else you could use that's reasonably likely to be unique, like a hash of the input filename or output filename or a combination of the two.

clang/include/clang/Driver/Action.h
217

Allowing an arbitrary string might be more adaptable.

clang/lib/AST/ASTContext.cpp
10064

This needs to be a clearer statement of why this is necessary.

10068

Are you sure this doesn't apply to e.g. local statics? Can't you have kernel lambdas, or am I confusing HIP with another language?

clang/lib/CodeGen/CodeGenModule.cpp
1098

Please extract this "(device || constant) && file && static" predicate instead of repeating it in three different places.

clang/lib/Driver/Action.cpp
171

I'm sure GetRandomNumber can return 0, so this logic is faulty.

This also seems like an unfortunate intrusion of HIP-specific semantics on the rest of the driver. Maybe HIP can generate the shared id when it's setting up and cloning the job.

yaxunl marked 10 inline comments as done.Jul 6 2020, 5:50 AM
yaxunl added inline comments.
clang/include/clang/Driver/Action.h
217

done

clang/lib/AST/ASTContext.cpp
10064

added comments to explain why

10068

function-scope static var in a device function is only visible to the device function. Host code cannot access it, therefore no need to externalize it.

clang/lib/CodeGen/CodeGenModule.cpp
1098

extracted as ASTContext::shouldExternalizeStaticVar

clang/lib/Driver/Action.cpp
171

Changed type of ID to string. Now empty ID means disabled. 0 is now allowed as a usual ID.

Moved initialization of ID to HIP action builder.

yaxunl updated this revision to Diff 275685.Jul 6 2020, 6:02 AM
yaxunl marked 5 inline comments as done.
yaxunl retitled this revision from [HIP] Support accessing static device variable in host code to [CUDA][HIP] Support accessing static device variable in host code.
yaxunl edited the summary of this revision. (Show Details)

revised by John's comments.

Also added generating compilation unit ID by hashing input file path and options. added option -fuse-cuid=random|hash|none for controlling the method to generate compilation unit ID. Allows -cuid=X to override -fuse-cuid. Enabled this feature for both CUDA and HIP.

tra added inline comments.Jul 6 2020, 2:24 PM
clang/lib/CodeGen/CodeGenModule.cpp
6069

I suspect that will have interesting issues if CUID is an arbitrary user-supplied string. We may want to impose some sort of sanity check or filtering on the cuid value. Considering that it's a CC1 flag, it's not a critical problem, but some safeguards would be useful there, too. Should we limit allowed character set?

clang/test/Driver/hip-cuid.hip
35

Nit: abcd could potentially match the value generated by hash. I'd change it to contain characters other than hex.

JonChesterfield added inline comments.Jul 6 2020, 2:39 PM
clang/lib/AST/ASTContext.cpp
10068

This doesn't sound right. An inline function can return a pointer to a function scope static variable, e.g. to implement a singleton in a header file. I think host code can then access said variable.

rjmccall added inline comments.Jul 9 2020, 2:05 PM
clang/lib/AST/ASTContext.cpp
10068

Right, and IIRC you can declare host device functions as well, which ought to agree on the variable if they agree on globals.

clang/lib/Driver/Action.cpp
171

Thanks, that works.

yaxunl marked 9 inline comments as done.Jul 11 2020, 7:55 PM
yaxunl added inline comments.
clang/lib/AST/ASTContext.cpp
10068

As long as we are not accessing the static variable by symbol we do not need externalize it.

If a device function returns a pointer to its static variable and somehow passes that pointer to host code, the host code can use it directly by hipMemCpy.

10068

If we have a static variable in a device function, it is only visible in the function and not visible by any host code. We only need externalize it if it needs to be accessed by symbol in the host code, however, that is impossible, therefore we do not need externalize it.

For static variables in a host device function, the static variables should be different instances on host side and device side. The rationale is that a static variable is per function, whereas a host device function is actually two functions: a host instance and a device instance, which could be totally different by using conditional macros. If it is requested that the static variable in a host device function is one instance, it requires special handling in runtime so that the same variable can be accessed on both device side and host side by common load/store instructions, but that is not the case. Therefore the device side instance of a static variable in a host device function is still only visible to device codes, not visible to host codes. Since it cannot be accessed by symbol by host code, it does not needs to be externalized.

clang/lib/CodeGen/CodeGenModule.cpp
6069

will only allow alphanumeric and underscore in CUID for simplicity.

clang/test/Driver/hip-cuid.hip
35

done

yaxunl updated this revision to Diff 277272.Jul 11 2020, 7:58 PM
yaxunl marked 3 inline comments as done.

Only allow cuid to be alphanumeric and underscore.

hliao requested changes to this revision.Jul 23 2020, 7:09 AM
hliao added a subscriber: hliao.

I don't that's proper way to support file-scope static device variables. As we discuss APIs like cudaMemcpyToSymol, that's a runtime API instead of driver API. The later needs to specify the module (or code object) id in addition to a symbol name and won't have the conflict issues. For the runtime API, all named device variables (static or not) are identified at the host side by their host stub variables. That stub variable is used to register the corresponding device variables associated with a module id to unique identify that variables across all TUs. As long as we look up device variables using their host stub variable pointers, they are uniquely identified already. The runtime implementation needs to find the module id and the variable symbol from the pointer of its host stub variable. It's not the compiler job to fabricate name uniquely across TUs.

This revision now requires changes to proceed.Jul 23 2020, 7:09 AM

I don't that's proper way to support file-scope static device variables. As we discuss APIs like cudaMemcpyToSymol, that's a runtime API instead of driver API. The later needs to specify the module (or code object) id in addition to a symbol name and won't have the conflict issues. For the runtime API, all named device variables (static or not) are identified at the host side by their host stub variables. That stub variable is used to register the corresponding device variables associated with a module id to unique identify that variables across all TUs. As long as we look up device variables using their host stub variable pointers, they are uniquely identified already. The runtime implementation needs to find the module id and the variable symbol from the pointer of its host stub variable. It's not the compiler job to fabricate name uniquely across TUs.

The problem is that even though the static variable is registered through __hipRigisterVariable, the runtime still relies on looking up symbol name to get the address of the device variable. That's why we need to externalize the static variable.

Another reason is that we need to support it in rdc mode, where different TU can have static var with the same name.

I don't that's proper way to support file-scope static device variables. As we discuss APIs like cudaMemcpyToSymol, that's a runtime API instead of driver API. The later needs to specify the module (or code object) id in addition to a symbol name and won't have the conflict issues. For the runtime API, all named device variables (static or not) are identified at the host side by their host stub variables. That stub variable is used to register the corresponding device variables associated with a module id to unique identify that variables across all TUs. As long as we look up device variables using their host stub variable pointers, they are uniquely identified already. The runtime implementation needs to find the module id and the variable symbol from the pointer of its host stub variable. It's not the compiler job to fabricate name uniquely across TUs.

The problem is that even though the static variable is registered through __hipRigisterVariable, the runtime still relies on looking up symbol name to get the address of the device variable. That's why we need to externalize the static variable.

If so, the runtime should be fixed as the variable name. I remembered I fixed the global one so that each one is uniquely identified by module id plus the name. For runtime APIs, all host-side references to device variables should look through the host stub variables instead of its name. If runtime or API doesn't follow that, we should fix them instead of asking the compiler to do the favor.

Another reason is that we need to support it in rdc mode, where different TU can have static var with the same name.

That's an issue of our current RDC support through LLVM IR instead of native code. The name conflicts are introduced as we link all TUs into a single module at IR level. The frontend should not be changed to support that.

The problem is that even though the static variable is registered through __hipRigisterVariable, the runtime still relies on looking up symbol name to get the address of the device variable. That's why we need to externalize the static variable.

If so, the runtime should be fixed as the variable name. I remembered I fixed the global one so that each one is uniquely identified by module id plus the name. For runtime APIs, all host-side references to device variables should look through the host stub variables instead of its name. If runtime or API doesn't follow that, we should fix them instead of asking the compiler to do the favor.

For runtime APIs, we do reference device variables through host stub variable instead of its name. However, how does runtime implements that internally?

In host compilation, clang emits call of __hipRegisterVariable(shadow_var, device_var_name) in init functions.

runtime builds a map of each shadow var to a device var name. then runtime looks up device var name in code object to get the real address of a device var.

Note: __hipRegisterVariable does not really associate a shadow var with the address of a device var, since in host compilation there is no way to know the address of a device var. It only associates a shadow var with the name of a device var.

So eventually, runtime still needs to look up the device var symbol in code objects. Since ROCm runtime does not look up local symbols, it cannot find the static device var in code objects, unless we externalize it.

Another reason is that we need to support it in rdc mode, where different TU can have static var with the same name.

That's an issue of our current RDC support through LLVM IR instead of native code. The name conflicts are introduced as we link all TUs into a single module at IR level. The frontend should not be changed to support that.

I am not sure if ISA level linking would help for this. My understanding is that the linker will rename the internal symbols having the same name from different objects. It does not matter if it is llvm-link or native linker. Once they are renamed we can no longer find them by name.

I don't see a reason why we cannot support accessing static device variable in FE. We have requests for this feature from different users.

tra added a comment.Jul 23 2020, 1:02 PM

Would it work if we generate a globally unique visible aliases for the static vars and use the alias' name to register device-side entities without changing their visibility?

hliao added a comment.Jul 23 2020, 1:27 PM

Another reason is that we need to support it in rdc mode, where different TU can have static var with the same name.

That's an issue of our current RDC support through LLVM IR instead of native code. The name conflicts are introduced as we link all TUs into a single module at IR level. The frontend should not be changed to support that.

I am not sure if ISA level linking would help for this. My understanding is that the linker will rename the internal symbols having the same name from different objects. It does not matter if it is llvm-link or native linker. Once they are renamed we can no longer find them by name.

I don't see a reason why we cannot support accessing static device variable in FE. We have requests for this feature from different users.

the register API also has a module ID parameter. With that, the runtime should be able to establish the map from the pointer of that host stub variable to its device module and the name within that module. For non-RDC case, that's already enough for the runtime to find the correct device variable.

But, as static internal linkage is defined, it makes that identifier only visible to a single TU and not visible across TU. To be host, I think we still need to define that behaves in the context of CUDA or HIP as it obviously breaks the definition by making that static visible between host and device TUs in the restrict definition.

hliao added a comment.Jul 23 2020, 1:30 PM
In D80858#2170547, @tra wrote:

Would it work if we generate a globally unique visible aliases for the static vars and use the alias' name to register device-side entities without changing their visibility?

We still need to define how a static device variable should be visible on the host side. How that behaves in the context of relocatable code generation as well as anonymous namespaces. Also, in the context of CUDA runtime/driver API, if a device variable is addressable on the host side through the runtime API, it should be addressable through the driver API as well. However, the naming will be a big challenge.

tra added a comment.Jul 23 2020, 1:50 PM
In D80858#2170547, @tra wrote:

Would it work if we generate a globally unique visible aliases for the static vars and use the alias' name to register device-side entities without changing their visibility?

We still need to define how a static device variable should be visible on the host side.

As far as the host is concerned, if a 'static' variable is within the same TU it (or, rather, its shadow) should be accessible.

How that behaves in the context of relocatable code generation as well as anonymous namespaces.

AFAICT it would just work -- each TU only sees static things within that TU on both host and device sides. Visible module-unique aliases would allow host-side binary to get address of an otherwise non-visible variables. I expect that will work with RDC -- because we didn't change the visibility of the symbols themselves, they behave according to the regular linking rules. The aliases are also globally unique, so they should present no issues either.

Also, in the context of CUDA runtime/driver API, if a device variable is addressable on the host side through the runtime API, it should be addressable through the driver API as well. However, the naming will be a big challenge.

I'm not convinced that I completely agree with this assertion. It's reasonable for visible symbols (though even then there's a question of how do you figure out a properly mangled name for that symbol), but expecting it to work for non-visible symbols would push it too far, IMO.
It will also have issues with RDC as the name resulution will become ambiguous. Compiler does have necessary info to resolve that ambiguity, runtime support library, too, but the driver does not. We'd have to provide the driver with the additional info to let it map host-side symbol info to its device counterpart. It's doable, but I don't think it's particularly useful in practice. If something needs to be accessed via the driver API, it would be reasonable to expect it to be a public symbol.

hliao added a comment.Jul 23 2020, 2:35 PM
In D80858#2170668, @tra wrote:
In D80858#2170547, @tra wrote:

Would it work if we generate a globally unique visible aliases for the static vars and use the alias' name to register device-side entities without changing their visibility?

We still need to define how a static device variable should be visible on the host side.

As far as the host is concerned, if a 'static' variable is within the same TU it (or, rather, its shadow) should be accessible.

How that behaves in the context of relocatable code generation as well as anonymous namespaces.

AFAICT it would just work -- each TU only sees static things within that TU on both host and device sides. Visible module-unique aliases would allow host-side binary to get address of an otherwise non-visible variables. I expect that will work with RDC -- because we didn't change the visibility of the symbols themselves, they behave according to the regular linking rules. The aliases are also globally unique, so they should present no issues either.

The problem is not whether we have solution to tell them but when we need to add that. Not all static device variables need to be visible to the host side. Externalizing them adds the overhead for the linker and may pose additional restrictions on aggressive optimizations. Do we have to support every ambiguous usage in the burden of the compiler change?

Also, in the context of CUDA runtime/driver API, if a device variable is addressable on the host side through the runtime API, it should be addressable through the driver API as well. However, the naming will be a big challenge.

I'm not convinced that I completely agree with this assertion. It's reasonable for visible symbols (though even then there's a question of how do you figure out a properly mangled name for that symbol), but expecting it to work for non-visible symbols would push it too far, IMO.

Won't this patch just makes invisible variables "visible"?

It will also have issues with RDC as the name resulution will become ambiguous. Compiler does have necessary info to resolve that ambiguity, runtime support library, too, but the driver does not. We'd have to provide the driver with the additional info to let it map host-side symbol info to its device counterpart. It's doable, but I don't think it's particularly useful in practice. If something needs to be accessed via the driver API, it would be reasonable to expect it to be a public symbol.

tra added a comment.Jul 23 2020, 2:59 PM

The problem is not whether we have solution to tell them but when we need to add that. Not all static device variables need to be visible to the host side. Externalizing them adds the overhead for the linker and may pose additional restrictions on aggressive optimizations. Do we have to support every ambiguous usage in the burden of the compiler change?

It's a multi-part problem.
The key request here is, IIUIC, to make TU-local variables visible within the TU on both host and the device.
If and how it should be implemented is up for the discusstion.
For me the request is reasonable and it would be good to have it as it would make the code behave according to general expectations of how TU-local variables behave -- "if it's in the same source file, it should be accessible from the functions in that file".

I'm mostly discussing the 'how' part and we obviously don't have all the answers yet.

I agree, that adding the 'handles' for all non-visible variables is suboptimal and it may prove to be too big of a burden. That said, it does not look like an outright showstopper either. Ideally we only need those handles for the items that are being referred to by the host. Most likely most of them will not be. The problem is that it's the host-side compilation which knows which symbols will be needed, but it's the device-side compilation where we would have to generate aliases and, generally speaking, device-side compilation may not be able to ever tell which symbols will be needed by the host. I don't have a good solution for that. Ideally we need to have exact same AST on both sides and we can't guarantee that with the current implementation of HD functions and the use of preprocessor macros.

Also, in the context of CUDA runtime/driver API, if a device variable is addressable on the host side through the runtime API, it should be addressable through the driver API as well. However, the naming will be a big challenge.

I'm not convinced that I completely agree with this assertion. It's reasonable for visible symbols (though even then there's a question of how do you figure out a properly mangled name for that symbol), but expecting it to work for non-visible symbols would push it too far, IMO.

Won't this patch just makes invisible variables "visible"?

Yes, but there will be no name conflicts, so for all practical purposes they don't change the end result of linking.

hliao added a comment.Jul 23 2020, 8:58 PM
In D80858#2170821, @tra wrote:

The problem is not whether we have solution to tell them but when we need to add that. Not all static device variables need to be visible to the host side. Externalizing them adds the overhead for the linker and may pose additional restrictions on aggressive optimizations. Do we have to support every ambiguous usage in the burden of the compiler change?

It's a multi-part problem.
The key request here is, IIUIC, to make TU-local variables visible within the TU on both host and the device.
If and how it should be implemented is up for the discusstion.
For me the request is reasonable and it would be good to have it as it would make the code behave according to general expectations of how TU-local variables behave -- "if it's in the same source file, it should be accessible from the functions in that file".

We may try to solve the issue without RDC firstly, where we don't need to change that static variable name (if the runtime maintains the device binaries correctly.) We only need to ensure the linker won't remove their symbols.

I'm mostly discussing the 'how' part and we obviously don't have all the answers yet.

I agree, that adding the 'handles' for all non-visible variables is suboptimal and it may prove to be too big of a burden. That said, it does not look like an outright showstopper either. Ideally we only need those handles for the items that are being referred to by the host. Most likely most of them will not be. The problem is that it's the host-side compilation which knows which symbols will be needed, but it's the device-side compilation where we would have to generate aliases and, generally speaking, device-side compilation may not be able to ever tell which symbols will be needed by the host. I don't have a good solution for that. Ideally we need to have exact same AST on both sides and we can't guarantee that with the current implementation of HD functions and the use of preprocessor macros.

Also, in the context of CUDA runtime/driver API, if a device variable is addressable on the host side through the runtime API, it should be addressable through the driver API as well. However, the naming will be a big challenge.

I'm not convinced that I completely agree with this assertion. It's reasonable for visible symbols (though even then there's a question of how do you figure out a properly mangled name for that symbol), but expecting it to work for non-visible symbols would push it too far, IMO.

Won't this patch just makes invisible variables "visible"?

Yes, but there will be no name conflicts, so for all practical purposes they don't change the end result of linking.

We may try to solve the issue without RDC firstly, where we don't need to change that static variable name (if the runtime maintains the device binaries correctly.) We only need to ensure the linker won't remove their symbols.

This is essentially externalizing it in linker. However, it may not work.

Consider a static device var was accessed in host code through hipMemCpyToSymbol or hipMemCpyFromSymbol. In device code, this static var may be initialized through host code, or its final value may be read by host code after kernel execution. The existence of these operations mean that this static device variable is actually having external linkage, instead of internal linkage, since it is accessed by external modules. Fail to reflect this truth in IR will cause optimization passes to make incorrect assumptions about this variable and perform incorrect optimizations on it. e.g. the llvm passes can assume the value in this static var never changes, or its final value will not be used, then the llvm passes may simply remove it. Marking it as used will not solve the issue, since the llvm passes may still assume its value never changes after initialization, whereas in reality it may be changed by hipMemCpyToSymbol before kernel execution.

tra added a comment.Jul 27 2020, 3:15 PM

It's a good point. Perhaps this is one of the cases where we should *not* follow nvcc.
We can't have our cake (preserve static behavior) and eat it (treat it as non-static in case something on the host side may decide to use an API which uses symbol names). Something's got to give. While we could make it work in some cases, I don't think we can make it work consistently.
I think it would be reasonable to restrict APIs that access symbols by name to be applicable to visible symbols only.

In D80858#2177159, @tra wrote:

It's a good point. Perhaps this is one of the cases where we should *not* follow nvcc.
We can't have our cake (preserve static behavior) and eat it (treat it as non-static in case something on the host side may decide to use an API which uses symbol names). Something's got to give. While we could make it work in some cases, I don't think we can make it work consistently.
I think it would be reasonable to restrict APIs that access symbols by name to be applicable to visible symbols only.

Agree, we only need to support accessing static device var by shadow var (runtime API), which is sufficient for most apps.

We may try to solve the issue without RDC firstly, where we don't need to change that static variable name (if the runtime maintains the device binaries correctly.) We only need to ensure the linker won't remove their symbols.

This is essentially externalizing it in linker. However, it may not work.

Consider a static device var was accessed in host code through hipMemCpyToSymbol or hipMemCpyFromSymbol. In device code, this static var may be initialized through host code, or its final value may be read by host code after kernel execution. The existence of these operations mean that this static device variable is actually having external linkage, instead of internal linkage, since it is accessed by external modules. Fail to reflect this truth in IR will cause optimization passes to make incorrect assumptions about this variable and perform incorrect optimizations on it. e.g. the llvm passes can assume the value in this static var never changes, or its final value will not be used, then the llvm passes may simply remove it. Marking it as used will not solve the issue, since the llvm passes may still assume its value never changes after initialization, whereas in reality it may be changed by hipMemCpyToSymbol before kernel execution.

That's what I mean by "externalizing" that variable. In fact, nvcc doesn't the same thing and generate global symbol in PTX.

To get file-scope static device variables support, we have to 2 separate issues:

  • externalize variable to ensure backend/linker preserve their symbols for runtime to query. The major concern is that we'd better check whether a static device variable is referenced using host APIs. Otherwise, aggressive IPO optimizations may not be able to be applied. As those are file-scope static variables, roughly checking their reference on the host side should be doable.
  • rename device static variables. Such renaming is only necessary when RDC is enabled. Without RDC, there would be conflicting names and renaming is unnecessary. For renaming, relying on developer supplied CUID is not reliable and we should extend name mangler to include source name, digest, or whatever automatically managed by the compiler. If we support static variable access on the host side, we should not rely on developer supplied option to function correctly. Also, the mangled name should be demangled into a meaningful name if possible. I remembered GCC has special support for names from anonymous namespaces if they have conflicting names. We may follow the similar scheme.
In D80858#2177159, @tra wrote:

It's a good point. Perhaps this is one of the cases where we should *not* follow nvcc.
We can't have our cake (preserve static behavior) and eat it (treat it as non-static in case something on the host side may decide to use an API which uses symbol names). Something's got to give. While we could make it work in some cases, I don't think we can make it work consistently.
I think it would be reasonable to restrict APIs that access symbols by name to be applicable to visible symbols only.

In fact, 'nvcc' does that by globalizing that static device variable.

tra added a comment.Jul 28 2020, 11:44 AM

I think Sam's approach is reasonable.

The ability to supply CUID to sub-compilations is useful by itself and should probably be split into a separate patch as it's largely independent of the externalization of file-scope static vars.

yaxunl updated this revision to Diff 282952.Aug 4 2020, 9:50 AM
yaxunl retitled this revision from [CUDA][HIP] Support accessing static device variable in host code to [CUDA][HIP] Support accessing static device variable in host code for -fno-gpu-rdc.
yaxunl edited the summary of this revision. (Show Details)

revised for -fno-gpu-rdc case by Michael's comments.

tra accepted this revision.Aug 4 2020, 11:02 AM

What is expected to happen to device statics in anonymous name spaces? It may be worth adding them to the tests.

LGTM otherwise.

yaxunl added a comment.Aug 4 2020, 9:32 PM
In D80858#2193970, @tra wrote:

What is expected to happen to device statics in anonymous name spaces? It may be worth adding them to the tests.

LGTM otherwise.

static device var in anonymous name space will have external linkage if it is referenced by host code. will add a test when committing.

This revision was not accepted when it landed; it landed in state Needs Review.Aug 5 2020, 4:58 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptAug 5 2020, 4:58 AM
tra added a comment.Aug 10 2020, 11:45 AM

Sam, just a FYI that the patch has a couple of unintended consequences.

We now end up with various things instantiated as device-side constant objects when they were not before, when we compile with -std=c++17 (especially with libc++):
https://godbolt.org/z/KbTM9M

That in turn sometimes pulls in other thins that may not exist on the device. E.g. in one case we've ended up with a PTX error caused by unresolved reference to strlen() (via some char_traits functions) .
The other potential issue is that we increase use of __constant__ and there's only 64K of it, so additional use pushed total use over the limit in few cases.

So far all failures can be attributed to questionable user code, but when we may need to figure out how to avoid emitting unused __constant__ data.

In D80858#2207699, @tra wrote:

Sam, just a FYI that the patch has a couple of unintended consequences.

We now end up with various things instantiated as device-side constant objects when they were not before, when we compile with -std=c++17 (especially with libc++):
https://godbolt.org/z/KbTM9M

That in turn sometimes pulls in other thins that may not exist on the device. E.g. in one case we've ended up with a PTX error caused by unresolved reference to strlen() (via some char_traits functions) .
The other potential issue is that we increase use of __constant__ and there's only 64K of it, so additional use pushed total use over the limit in few cases.

So far all failures can be attributed to questionable user code, but when we may need to figure out how to avoid emitting unused __constant__ data.

This is because we have some implicit constant device variables, e.g.

static _LIBCPP_CONSTEXPR const int __libcpp_polling_count = 64;

then they are referenced by some host functions, e.g.

template<class _Fn, class _BFn>
_LIBCPP_AVAILABILITY_SYNC _LIBCPP_INLINE_VISIBILITY
bool __libcpp_thread_poll_with_backoff(
  _Fn && __f, _BFn && __bf, chrono::nanoseconds __max_elapsed = chrono::nanoseconds::zero())
{
    auto const __start = chrono::high_resolution_clock::now();
    for(int __count = 0;;) {
      if(__f())
        return true; // _Fn completion means success
      if(__count < __libcpp_polling_count) {
        __count += 1;
        continue;
      }

when __libcpp_polling_count has internal linkage, it can be eliminated by optimization pass. When it is externalized, it can no longer be eliminated.

We can restrict externalization to constant variables with explicit 'constant' attributes only, which should fix this issue.

tra added a comment.Aug 10 2020, 12:40 PM

We can restrict externalization to constant variables with explicit 'constant' attributes only, which should fix this issue.

SGTM. If it does not have explicit device-side attribute, it's never going to need to cross host/device boundary. I guess this applies to vars with __device__ attribute, too.

In D80858#2207898, @tra wrote:

We can restrict externalization to constant variables with explicit 'constant' attributes only, which should fix this issue.

SGTM. If it does not have explicit device-side attribute, it's never going to need to cross host/device boundary. I guess this applies to vars with __device__ attribute, too.

I think we only add implicit __constant__ attributes for variables, not __device__ attributes.

The fix is here https://reviews.llvm.org/D85686