Page MenuHomePhabricator

[CUDA][HIP] Support accessing static device variable in host code for -fgpu-rdc
ClosedPublic

Authored by yaxunl on Aug 4 2020, 9:56 AM.

Details

Summary

This is separated from https://reviews.llvm.org/D80858

For -fgpu-rdc mode, static device vars in different TU's may have the same name.
To support accessing file-scope static device variables in host code, we need to give them
a distinct name and external linkage. This can be done by postfixing each static device variable with
a distinct CUID (Compilation Unit ID) hash.

Since the static device variables have different name across compilation units, now we let
them have external linkage so that they can be looked up by the runtime.

Diff Detail

Event Timeline

yaxunl created this revision.Aug 4 2020, 9:56 AM

I concede that making the variables external, and trying to give them unique names, does work around static variables not working. I believe static variables are subjected to more aggressive optimisation than external ones but the effect might not be significant.

This "works" in cuda today because the loader ignores the local annotation when accessing the variable. There is some probably unintended behaviour when multiple static variables have the same name in that the first one wins.

The corresponding change to the hsa loader is trivial. Why is making the symbols external, with the associated complexity in picking non-conflicting names, considered better than changing the loader?

I concede that making the variables external, and trying to give them unique names, does work around static variables not working. I believe static variables are subjected to more aggressive optimisation than external ones but the effect might not be significant.

This "works" in cuda today because the loader ignores the local annotation when accessing the variable. There is some probably unintended behaviour when multiple static variables have the same name in that the first one wins.

The corresponding change to the hsa loader is trivial. Why is making the symbols external, with the associated complexity in picking non-conflicting names, considered better than changing the loader?

Three reasons:

  1. The loader would like to look up dynsym only, which conforms better to the standard dynamic linker behavior and is more efficient than looking up all symbols.
  1. There could be symbols with the same name from different compilation units and they end up as local symbols with the same name in the binary. How does the loader know which is which.
  1. If a device symbol is static but actually accessed by the host code in the same compilation unit, the device symbol has de facto external linkage since it is truly accessed by some one out side of the device object (this is due to the unfortunate fact that a single source file ends up with a host object and a device object even though they are supposed to be the same compilation unit). Keeping the device symbol with internal linkage will cause the compiler over optimize the device code.
tra added a comment.Jan 19 2021, 11:23 AM

I'd propose splitting the patch into two. One with the addition of CUID and the other that changes the way we havdle static vars.
CUID is useful on its own and is relatively uncontroversial.

Externalizing static vars is a more interesting issue and I'm not sure what's the best way to handle it yet. On one hand it is necessary for visibility across host/device, on the other, externalizing all static vars will almost always have negative effect as very few of the static vars actually need this. As already pointed out in the #if 0 section of the patch, ideally we should externalize only the vars that need it. Generally speaking, I do not think we will be able to do that, because with -fgpu-rdc it may be used from the host code in some other TU.

We may need to explicitly annotate such the static variables that need to be visible on both sides and only apply externalization to the variables annotated this way. E.g. require them to be __host__ __device__.

WDYT?

In D85223#2507518, @tra wrote:

I'd propose splitting the patch into two. One with the addition of CUID and the other that changes the way we havdle static vars.
CUID is useful on its own and is relatively uncontroversial.

Externalizing static vars is a more interesting issue and I'm not sure what's the best way to handle it yet. On one hand it is necessary for visibility across host/device, on the other, externalizing all static vars will almost always have negative effect as very few of the static vars actually need this. As already pointed out in the #if 0 section of the patch, ideally we should externalize only the vars that need it. Generally speaking, I do not think we will be able to do that, because with -fgpu-rdc it may be used from the host code in some other TU.

We may need to explicitly annotate such the static variables that need to be visible on both sides and only apply externalization to the variables annotated this way. E.g. require them to be __host__ __device__.

WDYT?

Agree that CUID may be useful for other situations. Will separate it to another review.

yaxunl updated this revision to Diff 317712.Jan 19 2021, 3:24 PM
yaxunl edited the summary of this revision. (Show Details)

separate CUID patch.

tra added inline comments.Jan 20 2021, 11:42 AM
clang/lib/AST/ASTContext.cpp
11446–11447

!(getLangOpts().GPURelocatableDeviceCode && getLangOpts().CUID.empty()).

Maybe this should be broken down into something easier to read.

// Applies only to -fgpu-rdc or when we were given a CUID
if (!getLangOpts().GPURelocatableDeviceCode || !getLangOpts().CUID.empty()))
    return false;
// .. only file-scope static vars...
auto *VD = dyn_cast<VarDecl>(D);
if (!(VD && VD->isFileVarDecl() && VD->getStorageClass() == SC_Static))
    return false;
// .. with explicit __device__ or __constant__ attributes.
return ((D->hasAttr<CUDADeviceAttr>() && !D->getAttr<CUDADeviceAttr>()->isImplicit()) ||
            (D->hasAttr<CUDAConstantAttr>() &&!D->getAttr<CUDAConstantAttr>()->isImplicit()));
11446–11447

BTW, does this mean that we'll externalize & uniquify the vars even w/o -fgpu-rdc if CUID is given?

IMO -fgpu-rdc should remain the flag to control whether externalization is needed.
CUID controls the value of a unique suffix, if we need it, but should not automatically enable externalization.

clang/lib/CodeGen/CodeGenModule.cpp
2865–2866

Is this code needed?

yaxunl marked 3 inline comments as done.Sun, Feb 7, 8:57 PM
yaxunl added inline comments.
clang/lib/AST/ASTContext.cpp
11446–11447

done

11446–11447

mayExternalizeStaticVar returns true does not mean the static var must be externalized. mayExternalizeStaticVar only indicates the static var may be externalized. It is used to enable checking whether this var is used by host code.

For -fno-gpu-rdc, we only externalize a static variable if it is referenced by host code. If a static var is referenced by host code, -fno-gpu-rdc will change its linkage to external, but does not need to make the symbol unique because each TU ends up as a different device binary.

clang/lib/CodeGen/CodeGenModule.cpp
2865–2866

this code is not needed. removed.

yaxunl updated this revision to Diff 322021.Sun, Feb 7, 8:59 PM
yaxunl marked 3 inline comments as done.
yaxunl edited the summary of this revision. (Show Details)

Revised by Artem's comments. Use CUID hash as postfix for static variable name.

tra accepted this revision.Tue, Feb 9, 10:31 AM

LGTM with new test nits.

@JonChesterfield -- are you OK with the patch?

clang/test/CodeGenCUDA/device-var-linkage.cu
40

It should probably be a regex after HASH:, not the hash value itself.

clang/test/CodeGenCUDA/managed-var.cu
54

Same here.

clang/test/CodeGenCUDA/static-device-var-rdc.cu
35

ditto.

clang/test/SemaCUDA/static-device-var.cu
11

A comment explaining what we're testing would be helpful. no-diagnostics gives no clues about what is it we're looking for here.

15–23

So, this verifies that we're allowed to use static local vars in device code. A comment would be useful.

24–38

And this verifies that global static vars can be referenced from both host and device.
I'd also add a negative test with static int host_only; and would verify that we still don't allow accessing it from the device.

This revision is now accepted and ready to land.Tue, Feb 9, 10:31 AM
JonChesterfield accepted this revision.Tue, Feb 9, 10:38 AM

This works around the limitations of the binary format nvptx and amdgpu are using in the compiler. It's the wrong place in the stack to fix it - we could introduce another symbol table in the binary to capture the per-tu-between-arch scoping.

However, if we later reach consensus on what to do in the elf instead, we can still do that. In particular, embedding an elf for one arch in a named section of an elf for a host arch is crude. This workaround seems acceptable in the meantime.

tra added a comment.Tue, Feb 9, 10:59 AM

What breaks existing abstractions is that we produce N ELF objects from a single TU and the meaning of static becomes fuzzy. On one hand, we don't want that static symbol to be visible across objects on the same target, at the same time we do want it to be visible across host/device objects compiled from the same TU. ELF does not have a way to express it. Making such symbols visible with an unique suffix seems to be a reasonable tradeoff. We probably have more options available for AMDGPU. E.g. as you've suggested, give runtime extra clues about referencing these variables across host/device boundary without resorting to making them externally visible. However, we don't have that flexibility for NVPTX.

yaxunl marked 6 inline comments as done.Tue, Feb 9, 12:11 PM

This works around the limitations of the binary format nvptx and amdgpu are using in the compiler. It's the wrong place in the stack to fix it - we could introduce another symbol table in the binary to capture the per-tu-between-arch scoping.

However, if we later reach consensus on what to do in the elf instead, we can still do that. In particular, embedding an elf for one arch in a named section of an elf for a host arch is crude. This workaround seems acceptable in the meantime.

Yes we should revisit this if there is a better solution.

clang/test/CodeGenCUDA/device-var-linkage.cu
40

will do

clang/test/CodeGenCUDA/managed-var.cu
54

will do

clang/test/CodeGenCUDA/static-device-var-rdc.cu
35

will do

clang/test/SemaCUDA/static-device-var.cu
11

will do

15–23

will do

24–38

will do

This revision was automatically updated to reflect the committed changes.
yaxunl marked 6 inline comments as done.
Herald added a project: Restricted Project. · View Herald TranscriptWed, Feb 24, 3:41 PM