Need to emit the reference in the global address space and the
referenced to static variable should be bitcasted to the generic address
space to be compatible with NVPTX.
Required to avoid early optimization of the static variables
The cast was never the problem but the fact that the $ref lives in the global address space while the global value might be in the shared one.
D101030 gives a way out here by not creating $ref variables if there is no host version of a global. This makes sense because we don't need to copy to the global ever.
So, users can have static globals in shared memory by ensuring they are not host accessible. We probably should emit an error if they try to do global in shared memory
that has a host version.
Long story short, this cast won't help.
I checked that this cast helps to fix a problem with ptxas with shared/global memory refs. I'm not saying that this is the best solution, but better to have a fix while the final solution is not landed. Plus. maybe, it will fix some other potential issues until the $refs will go away. Maybe need a new attribute to avoid early optimization of the internal vars.
I don't understand why the cast makes ptxas happy, can you include a test in the runtime so we can make sure that stays that way and is not some ptxas artifact. After all, we still point to a shared memory symbol from global memory which doesn't make too much sense.
Here is the related info https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#initializers. Take a look at the Examples:
.const .u32 foo = 42; ... .global .u32 p2 = generic(foo); // generic address of foo
After the patch, we have this:
.shared .align 4 .u32 _ZL1X; .global .align 8 .u64 __ZL1X$ref = generic(_ZL1X);
which is similar to the example.