This is an archive of the discontinued LLVM Phabricator instance.

[NVPTX] Apply global var demotion to private symbols
ClosedPublic

Authored by qcolombet on Jul 5 2023, 6:22 AM.

Details

Summary

When emitting the assembly we perform some late global variables demotion.
Prior to this patch, this optimization was only performed on variables with
the internal linkage whereas any local global variable can be demoted.

Fix that by using hasLocalLinkage instead of hasInternalLinkage.

Without this change, global variables with the private linkage wouldn't
be demoted.

Diff Detail

Event Timeline

qcolombet created this revision.Jul 5 2023, 6:22 AM
qcolombet requested review of this revision.Jul 5 2023, 6:22 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 5 2023, 6:22 AM
Herald added a subscriber: wangpc. · View Herald Transcript
mehdi_amini accepted this revision.Jul 6 2023, 10:47 AM
This revision is now accepted and ready to land.Jul 6 2023, 10:47 AM
guraypp accepted this revision.Jul 7 2023, 2:06 AM

Looks great! This work also solves my issue in D154074

tra added a comment.EditedJul 10 2023, 2:14 PM

This may be potentially problematic for CUDA as we may need to refer to the global variable from the host side. E.g. via cudaMemcpyFromSymbol/cudaMemcpyToSymbol.

This may be potentially problematic for CUDA as we may need to refer to the global variable from the host side.

What about if we add shared memory address space check? The shared memory isn't addressable from the host. Something like below:

if (!gv->hasInternalLinkage() || (!gv->hasLocalLinkage() && Pty->getAddressSpace() != ADDRESS_SPACE_SHARED))
tra added a comment.Jul 10 2023, 2:26 PM

What about if we add shared memory address space check? The shared memory isn't addressable from the host. Something like below:

if (!gv->hasInternalLinkage() || (!gv->hasLocalLinkage() && Pty->getAddressSpace() != ADDRESS_SPACE_SHARED))

For shared variables demotion should be fine, though I'm not sure what it would buy us. Shared variables, even if they get demoted would still be allocated statically, so effectively it only affects the symbol visibility without saving any resources, which would be the case for non-shared variables.

tra added a comment.Jul 10 2023, 2:33 PM

Ah, never mind. We're only dealing with shared variables here. The patch is fine.

tra accepted this revision.Jul 10 2023, 2:34 PM

Ah, never mind. We're only dealing with shared variables here. The patch is fine.

Awesome thanks!

This may be potentially problematic for CUDA as we may need to refer to the global variable from the host side. E.g. via cudaMemcpyFromSymbol/cudaMemcpyToSymbol.

If something is to be referenced from outside a LLVM module, shouldn't we just avoid making it internal in the first place? Neither internal nor private are supposed to be visible outside the current module I believe.

tra added a comment.Jul 10 2023, 5:06 PM

This may be potentially problematic for CUDA as we may need to refer to the global variable from the host side. E.g. via cudaMemcpyFromSymbol/cudaMemcpyToSymbol.

If something is to be referenced from outside a LLVM module, shouldn't we just avoid making it internal in the first place? Neither internal nor private are supposed to be visible outside the current module I believe.

Please ignore, I missed that this demotion only applies to the shared variables which can't be accessed from the host that way.

jdoerfert added a subscriber: jdoerfert.EditedJul 10 2023, 6:06 PM

I don't think this is (and was) correct, it does not account for capturing.

We have a pass that does this in the middle end (https://llvm.org/doxygen/GlobalOpt_8cpp_source.html#l01359), which reminds me of https://discourse.llvm.org/t/rfc-cleaning-up-the-nvidia-and-potentially-amd-gpu-backend/71680.

I'm a bit confused @jdoerfert : are there two different uses of the word "demoted" here? Seems like you're referring to a transformation that turns a global into an alloca, whereas here is it just about "demoting" the linkage type of the global?

jdoerfert added a comment.EditedJul 10 2023, 8:57 PM

I'm a bit confused @jdoerfert : are there two different uses of the word "demoted" here? Seems like you're referring to a transformation that turns a global into an alloca, whereas here is it just about "demoting" the linkage type of the global?

I do not see the linkage being changed, if so, to what? Local (=private or internal) is already as "good" as it gets.

Given the test result, and the comment ("Find out if a global variable can be demoted to local scope."), it looks to me as if the scope is changed, thus where the variable is "declared/visible".
That said, I was expecting (w/o knowing for sure), that a variable scoped in a function cannot be accessed by other functions. If that is wrong, the first part of my comment is mood.
I still believe we don't want these special casings in our backends, but that is for later. If the scope doesn't preclude access, this is fine.

EDIT:
I looked at the PTX spec online but could not figure it out right away. One can always try to escape the address, recurse, check that the address is still the same and that outside accesses are visible as they should be.

Ah I see what you mean! Scratch what I wrote above...

wangpc removed a subscriber: wangpc.Jul 11 2023, 12:31 AM

Hi @jdoerfert,

I'm a bit confused @jdoerfert : are there two different uses of the word "demoted" here? Seems like you're referring to a transformation that turns a global into an alloca, whereas here is it just about "demoting" the linkage type of the global?

I do not see the linkage being changed, if so, to what? Local (=private or internal) is already as "good" as it gets.

Given the test result, and the comment ("Find out if a global variable can be demoted to local scope."), it looks to me as if the scope is changed, thus where the variable is "declared/visible".
That said, I was expecting (w/o knowing for sure), that a variable scoped in a function cannot be accessed by other functions. If that is wrong, the first part of my comment is mood.

I think a variable scoped in a function can be accessed by other functions. Meaning, the symbol may not be visible, but the address pointed by this symbol can be accessed (i.e., it is okay to escape), because the allocations are handled at a higher level (the driver).
@guraypp do you have more details on this?

I still believe we don't want these special casings in our backends, but that is for later. If the scope doesn't preclude access, this is fine.

@guraypp could you explain why this optimization is useful?
I don't remember. I would expect the visibility doesn't matter as long as the (meta)data around what needs to be allocated before a kernel launch is accurate, but maybe this info is derived from the symbols visibility?

EDIT:
I looked at the PTX spec online but could not figure it out right away. One can always try to escape the address, recurse, check that the address is still the same and that outside accesses are visible as they should be.

Looking at how usedInOneFunc is implemented, you're right that the address could escape. That being said, I don't think this is an issue since the memory allocation is handled at the kernel level. Anyhow, this is orthogonal to this patch IMHO.

Cheers,
-Quentin

Hi @jdoerfert,

I'm a bit confused @jdoerfert : are there two different uses of the word "demoted" here? Seems like you're referring to a transformation that turns a global into an alloca, whereas here is it just about "demoting" the linkage type of the global?

I do not see the linkage being changed, if so, to what? Local (=private or internal) is already as "good" as it gets.

Given the test result, and the comment ("Find out if a global variable can be demoted to local scope."), it looks to me as if the scope is changed, thus where the variable is "declared/visible".
That said, I was expecting (w/o knowing for sure), that a variable scoped in a function cannot be accessed by other functions. If that is wrong, the first part of my comment is mood.

I think a variable scoped in a function can be accessed by other functions. Meaning, the symbol may not be visible, but the address pointed by this symbol can be accessed (i.e., it is okay to escape), because the allocations are handled at a higher level (the driver).
@guraypp do you have more details on this?

Spoke with @guraypp offline and he confirms that was the case.

I still believe we don't want these special casings in our backends, but that is for later. If the scope doesn't preclude access, this is fine.

@guraypp could you explain why this optimization is useful?
I don't remember. I would expect the visibility doesn't matter as long as the (meta)data around what needs to be allocated before a kernel launch is accurate, but maybe this info is derived from the symbols visibility?

@guraypp checked on Hopper and the driver seems to do the right allocation with or without the demotion. (Meaning the demotion doesn't seem that useful anymore.)
That being said, it may still be useful for older driver.
Anyhow, scope doesn't preclude access so I think @jdoerfert, you're fine with the patch, right?

EDIT:
I looked at the PTX spec online but could not figure it out right away. One can always try to escape the address, recurse, check that the address is still the same and that outside accesses are visible as they should be.

Looking at how usedInOneFunc is implemented, you're right that the address could escape. That being said, I don't think this is an issue since the memory allocation is handled at the kernel level. Anyhow, this is orthogonal to this patch IMHO.

Cheers,
-Quentin

Ping @jdoerfert

Does the patch work for you with the additional context?
The TL;DR is the semantic is correct but more recent drivers may not need that optimization anymore.

jdoerfert accepted this revision.Jul 27 2023, 12:23 PM

Ping @jdoerfert

Does the patch work for you with the additional context?
The TL;DR is the semantic is correct but more recent drivers may not need that optimization anymore.

SG, thx.

This revision was landed with ongoing or failed builds.Aug 9 2023, 5:44 AM
This revision was automatically updated to reflect the committed changes.