Page MenuHomePhabricator

[HIP] Allow non-incomplete array type for extern shared var
Needs ReviewPublic

Authored by yaxunl on Feb 4 2020, 10:22 AM.

Details

Reviewers
tra
rsmith

Diff Detail

Event Timeline

yaxunl created this revision.Feb 4 2020, 10:22 AM
tra added a comment.Feb 4 2020, 10:56 AM

A better description for the change would be helpful.

For what it's worth, NVCC accepts 'all of the above' for extern shared. https://godbolt.org/z/8cBsXv Whether that makes sense or not is another question.
IIRC, extern __shared__ can, effectively only be a pointer to an opaque type because at compile time we neither know the address, nor do we know the size of the memory it will point to. I believe that was the reason why we've limited accepted types to size-less arrays only. Otherwise, we end up with situations where on the source level we declare an object (I.e. nobody expects a pointer to be involved), but end up failing with invalid memory access because no memory was ever allocated for it. Incomplete array seems to be a reasonable trade-off.

What's your proposed use case for this change? Does extern __shared__ work in HIP the same way it works in CUDA?

clang/test/CodeGenCUDA/extern-shared.cu
2–3

I'd add a CUDA test run, too to issuestrate what we expect CUDA to handle.

In D73979#1857485, @tra wrote:

A better description for the change would be helpful.

For what it's worth, NVCC accepts 'all of the above' for extern shared. https://godbolt.org/z/8cBsXv Whether that makes sense or not is another question.

IIRC, `extern __shared__` can, effectively only be a pointer to an opaque type because at compile time we neither know the address, nor do we know the size of the memory it will point to. I believe that was the reason why we've limited accepted types to size-less arrays only. Otherwise, we end up with situations where on the source level we declare an object (I.e. nobody expects a pointer to be involved), but end up failing with invalid memory access because no memory was ever allocated for it. Incomplete array seems to be a reasonable trade-off.

What's your proposed use case for this change? Does extern __shared__ work in HIP the same way it works in CUDA?

The shared memory is divided into two parts: static part and dynamic part. AMDGPU backends calculates static usage of all shared vars for a kernel and put that info in code object. When runtime launches a kernel, it checks static shared memory required by the kernel, add the dynamic shared memory requirement specified by triple chevron and allocates the sum of static and dynamic requirements.

AMDGPU backend assumes static shared memory takes the lower memory address and the dynamic shared memory starts at the boundary between static shared memory and dynamic memory address.

AMDGPU backend lowers all external uninitialized shared vars to the size of all static shared var usage, i.e. the starting address of the dynamic part of shared memory.

Based on CUDA usage of extern shared var (https://devblogs.nvidia.com/using-shared-memory-cuda-cc/), it seems CUDA also assumes all extern shared vars have the same address, therefore HIP and CUDA have similar behavior.

tra added a comment.Feb 4 2020, 12:12 PM

Based on CUDA usage of extern shared var (https://devblogs.nvidia.com/using-shared-memory-cuda-cc/), it seems CUDA also assumes all extern shared vars have the same address, therefore HIP and CUDA have similar behavior.

Yes. Because of the CUDA also assumes all extern shared vars have the same address I think that not allowing additional types for extern __shared__ makes sense for HIP, too. I'd rather not give users more ways to do a wrong thing.

Can you elaborate on why you want to allow this feature? While it would be convenient for someone who uses exactly *one* such extern object, in practice the most common use case is that users declare a single extern __shared__ array to serve as the memory pool and then manually allocate chunks within it and assign the addresses to appropriately typed pointers. I guess they could define an extern __shared__ struct with fields representing the objects, but that seems sort of pointless considering that the only reason to use extern __shared__ is to allocate shared memory dynamically.

In general, the concept of extern __shared__ with *all* such extern items occupying the same space is broken by design. It's not composable (every function using one needs to coordinate with every other function doig the same). It introduces failure modes not obvious from the source code (access an object, fail with invalid memory access). It does not fit the conventional meaning of what extern something means in C++ (different objects have different addresses). IMO, it should not have existed and the shared memory/pointer should've been exposed via explicit API. I.e. CUDA could've used the same mechanism which provides threads with threadIdx and blockIdx.

As things stand right now, extern __shared__ is something I want gone, not added more features to. AFAICT, the limitations clang places on it right now have not been an issue for the CUDA code we compile.

Is there a pressing need for this feature for HIP? Perhaps it would make more sense to introduce a more sensible API and port existing HIP code to use it.

WDYT?

In D73979#1857664, @tra wrote:

Based on CUDA usage of extern shared var (https://devblogs.nvidia.com/using-shared-memory-cuda-cc/), it seems CUDA also assumes all extern shared vars have the same address, therefore HIP and CUDA have similar behavior.

Yes. Because of the CUDA also assumes all extern shared vars have the same address I think that not allowing additional types for extern __shared__ makes sense for HIP, too. I'd rather not give users more ways to do a wrong thing.

Can you elaborate on why you want to allow this feature? While it would be convenient for someone who uses exactly *one* such extern object, in practice the most common use case is that users declare a single extern __shared__ array to serve as the memory pool and then manually allocate chunks within it and assign the addresses to appropriately typed pointers. I guess they could define an extern __shared__ struct with fields representing the objects, but that seems sort of pointless considering that the only reason to use extern __shared__ is to allocate shared memory dynamically.

In general, the concept of extern __shared__ with *all* such extern items occupying the same space is broken by design. It's not composable (every function using one needs to coordinate with every other function doig the same). It introduces failure modes not obvious from the source code (access an object, fail with invalid memory access). It does not fit the conventional meaning of what extern something means in C++ (different objects have different addresses). IMO, it should not have existed and the shared memory/pointer should've been exposed via explicit API. I.e. CUDA could've used the same mechanism which provides threads with threadIdx and blockIdx.

As things stand right now, extern __shared__ is something I want gone, not added more features to. AFAICT, the limitations clang places on it right now have not been an issue for the CUDA code we compile.

Is there a pressing need for this feature for HIP? Perhaps it would make more sense to introduce a more sensible API and port existing HIP code to use it.

WDYT?

All extern shared vars are sharing the same address, however, they may be used as different types in different functions.

For example,

__device__ int foo() {
  extern __shared__ int a;
  for (...) a+=...;
  return a;
}

__device__ double bar(int x) {
  extern __shared__ double b[10];
  for(...) b[x]+=...;
  return b[0];
}

__global__ void k() {
  foo();
  //...
  bar();
}

In one function foo, users need to use the shared memory as an int. In another function, users need to use the shared memory as a double array. Users just need to make sure they request sufficient dynamic shared memory in triple chevron to be greater than the max dynamic shared memory usage. Users do not need to pass values in extern shared var between functions. They just treat it as an uninitialized variable. Forbidding different types for extern shared variable does not add any benefit, just forcing users to work around the limitation and resulting in less readable code.

BTW this is requested by HIP users, who have similar code for CUDA and HIP. They found it surprised that nvcc allows it but hip-clang does not.

tra added a subscriber: rsmith.Feb 4 2020, 2:02 PM

All extern shared vars are sharing the same address, however, they may be used as different types in different functions.

For example,

__device__ int foo() {
  extern __shared__ int a;
  for (...) a+=...;
  return a;
}

__device__ double bar(int x) {
  extern __shared__ double b[10];
  for(...) b[x]+=...;
  return b[0];
}

__global__ void k() {
  foo();
  //...
  bar();
}

I do agree that it is possible to use extern __shared__ given enough care. My point is that as a feature it is ill-designed, very easy to misuse and creates more problems than it's worth, especially in non-trivial code.
As an illustration, what if bar needs to call foo() and foo lives in a header file somewhere else? Whoever implements foo must make sure that nothing else in the transitive call chain uses extern __shared__. That's hard to guarantee in practice and it's very easy to introduce new dependencies without even being aware of them. I.e. via an intermediate function which is not aware that the caller and callee have this restriction.

There are no compiler checks to warn you about it, you you will only know about the problem when you encounter data corruption at runtime and in machine learning applications that may go unnoticed for a long time.

In one function foo, users need to use the shared memory as an int. In another function, users need to use the shared memory as a double array. Users just need to make sure they request sufficient dynamic shared memory in triple chevron to be greater than the max dynamic shared memory usage.

Again, that requires complete knowledge of who uses this construct. Without compiler's help that's hard to guarantee outside of simple use cases. I can not imagine using it as is in something like thrust or eigen. In fact, thrust does provide extern_shared_ptr specifically to serve the same kind of API that I proposed.

Users do not need to pass values in extern shared var between functions. They just treat it as an uninitialized variable.

I'm OK with that, but they *do* need to make it explicit that they are dealing with externally allocated memory and they do need to pass something to identify which chunk of that memory they operate upon. It may be the pointer, or it may be an offset to be used relative to an extern __shared__ base.

Forbidding different types for extern shared variable does not add any benefit, just forcing users to work around the limitation and resulting in less readable code.

That's where we disagree. I believe that in this case it would be a net benefit to be even more restrictive with extern __shared__ than we are right now and get users to explicitly treat extern __shared__ as externally allocated memory.

It does not take all that much code to make it work and it does result in more robust code. E.g. your example can be rewritten like this:

// sprinkle static_casts as necessary.
void *get_shmem(size_t offset){
  extern __shared__ char shmem[];
  return  &shmem[offset];
}

__device__ int foo(int *a) {
  for (...) *a+=...;
  return *a;
}

__device__ double bar(double *bx) {
  for(...) *bx+=...;
  return *bx;
}

__global__ void k() {
  foo(get_shmem(0));
  //...
  bar(get_shmem(0));
}

Net benefits that I see:

  • functions are composable now -- one can call foo from bar and vice versa and ensure they don't step on each other's toes.
  • it's clear that they do operate on the same buffer, when called from k -- arguably that's the place where it matters.
  • It's easy to change if you need them to work on different sub-buffers.
  • foo/bar are not limited to working on shared memory only
  • foo/bar can execute in diverged branches, if given non-overlapping buffers. The original example would potentially fail in interesting ways if k does something like this:
__global__ void k() {
  if (threadIdx.x < 16)
     foo();
  else 
     bar();
}

It's much easier not to open this can of works than clean it up afterwards when you grow more users that depend on it.
I don't think it should be enabled for CUDA, and don't think that it would be a good idea for HIP, either.
Perhaps we need a third opinion from someone with a broader perspective.

@rsmith -- do you have an opinion on what should be done with a tactically useful, but strategically unsound features in general and this CUDA-specific oddity specifically?

tra added a reviewer: rsmith.Feb 4 2020, 2:02 PM

BTW this is requested by HIP users, who have similar code for CUDA and HIP. They found it surprised that nvcc allows it but hip-clang does not.

I think I'm one of the HIP users here, but the above change is not what I was hoping for.

I'd like:

__shared__ int x;
__shared__ int y;
__device__ void foo()
{
  assert(&x != &y);
  x = 2 * y;
}

to compile and behave as it does on cuda, i.e. the 'x' variable gets allocated in shared memory for each kernel which accesses it, and so does 'y'.

The 'extern shared' feature where nvcc builds a union out of all the things it sees and the user indexes into it at runtime is totally unappealing. That cuda uses the 'extern' keyword to opt into this magic union also seems undesirable.

BTW this is requested by HIP users, who have similar code for CUDA and HIP. They found it surprised that nvcc allows it but hip-clang does not.

I think I'm one of the HIP users here, but the above change is not what I was hoping for.

I'd like:

__shared__ int x;
__shared__ int y;
__device__ void foo()
{
  assert(&x != &y);
  x = 2 * y;
}

to compile and behave as it does on cuda, i.e. the 'x' variable gets allocated in shared memory for each kernel which accesses it, and so does 'y'.

The 'extern shared' feature where nvcc builds a union out of all the things it sees and the user indexes into it at runtime is totally unappealing. That cuda uses the 'extern' keyword to opt into this magic union also seems undesirable.

Clang emits correct IR for this code. If you use x and y in a kernel directly, amdgcn backend can generate correct ISA where &x and &y are different. What the backend does is to accumulate all shared memory and get the total shared memory usage and assign address to different shared variables. Therefore x and y get different addresses.

Currently amdgcn backend emits a diagnostic message if shared variable is used in non-kernel function:

https://github.com/llvm/llvm-project/blob/6085593c128e91fd7db998c5441ebe120c7e4f04/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp#L1232

https://github.com/llvm/llvm-project/blob/3fda1fde8f7bdf3b90d8700f5a386f63409b4313/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp#L1952

This is unreasonable if the backend is able to calculate total shared memory usage, so this is a bug. With this bug fixed, you should be able to use shared variables in device functions.