This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
lib/Sema/
-
Sema/
-
SemaDeclAttr.cpp
-
test/
-
CodeGenCUDA/
1
extern-shared.cu
-
SemaCUDA/
-
extern-shared.cu

Differential D73979

[HIP] Allow non-incomplete array type for extern shared var
Needs ReviewPublic

Authored by yaxunl on Feb 4 2020, 10:22 AM.

Download Raw Diff

Details

Reviewers

tra
rsmith

Diff Detail

Event Timeline

yaxunl created this revision.Feb 4 2020, 10:22 AM

A better description for the change would be helpful.

For what it's worth, NVCC accepts 'all of the above' for extern shared. https://godbolt.org/z/8cBsXv Whether that makes sense or not is another question.
IIRC, extern __shared__ can, effectively only be a pointer to an opaque type because at compile time we neither know the address, nor do we know the size of the memory it will point to. I believe that was the reason why we've limited accepted types to size-less arrays only. Otherwise, we end up with situations where on the source level we declare an object (I.e. nobody expects a pointer to be involved), but end up failing with invalid memory access because no memory was ever allocated for it. Incomplete array seems to be a reasonable trade-off.

What's your proposed use case for this change? Does extern __shared__ work in HIP the same way it works in CUDA?

clang/test/CodeGenCUDA/extern-shared.cu
2–3	I'd add a CUDA test run, too to issuestrate what we expect CUDA to handle.

In D73979#1857485, @tra wrote:

A better description for the change would be helpful.

For what it's worth, NVCC accepts 'all of the above' for extern shared. https://godbolt.org/z/8cBsXv Whether that makes sense or not is another question.
IIRC, extern __shared__ can, effectively only be a pointer to an opaque type because at compile time we neither know the address, nor do we know the size of the memory it will point to. I believe that was the reason why we've limited accepted types to size-less arrays only. Otherwise, we end up with situations where on the source level we declare an object (I.e. nobody expects a pointer to be involved), but end up failing with invalid memory access because no memory was ever allocated for it. Incomplete array seems to be a reasonable trade-off.

What's your proposed use case for this change? Does extern __shared__ work in HIP the same way it works in CUDA?

The shared memory is divided into two parts: static part and dynamic part. AMDGPU backends calculates static usage of all shared vars for a kernel and put that info in code object. When runtime launches a kernel, it checks static shared memory required by the kernel, add the dynamic shared memory requirement specified by triple chevron and allocates the sum of static and dynamic requirements.

AMDGPU backend assumes static shared memory takes the lower memory address and the dynamic shared memory starts at the boundary between static shared memory and dynamic memory address.

AMDGPU backend lowers all external uninitialized shared vars to the size of all static shared var usage, i.e. the starting address of the dynamic part of shared memory.

Based on CUDA usage of extern shared var (https://devblogs.nvidia.com/using-shared-memory-cuda-cc/), it seems CUDA also assumes all extern shared vars have the same address, therefore HIP and CUDA have similar behavior.

In D73979#1857536, @yaxunl wrote:

Based on CUDA usage of extern shared var (https://devblogs.nvidia.com/using-shared-memory-cuda-cc/), it seems CUDA also assumes all extern shared vars have the same address, therefore HIP and CUDA have similar behavior.

Yes. Because of the CUDA also assumes all extern shared vars have the same address I think that not allowing additional types for extern __shared__ makes sense for HIP, too. I'd rather not give users more ways to do a wrong thing.

Can you elaborate on why you want to allow this feature? While it would be convenient for someone who uses exactly *one* such extern object, in practice the most common use case is that users declare a single extern __shared__ array to serve as the memory pool and then manually allocate chunks within it and assign the addresses to appropriately typed pointers. I guess they could define an extern __shared__ struct with fields representing the objects, but that seems sort of pointless considering that the only reason to use extern __shared__ is to allocate shared memory dynamically.

In general, the concept of extern __shared__ with *all* such extern items occupying the same space is broken by design. It's not composable (every function using one needs to coordinate with every other function doig the same). It introduces failure modes not obvious from the source code (access an object, fail with invalid memory access). It does not fit the conventional meaning of what extern something means in C++ (different objects have different addresses). IMO, it should not have existed and the shared memory/pointer should've been exposed via explicit API. I.e. CUDA could've used the same mechanism which provides threads with threadIdx and blockIdx.

As things stand right now, extern __shared__ is something I want gone, not added more features to. AFAICT, the limitations clang places on it right now have not been an issue for the CUDA code we compile.

Is there a pressing need for this feature for HIP? Perhaps it would make more sense to introduce a more sensible API and port existing HIP code to use it.

WDYT?

In D73979#1857664, @tra wrote:

In D73979#1857536, @yaxunl wrote:

Based on CUDA usage of extern shared var (https://devblogs.nvidia.com/using-shared-memory-cuda-cc/), it seems CUDA also assumes all extern shared vars have the same address, therefore HIP and CUDA have similar behavior.

Yes. Because of the CUDA also assumes all extern shared vars have the same address I think that not allowing additional types for extern __shared__ makes sense for HIP, too. I'd rather not give users more ways to do a wrong thing.

Can you elaborate on why you want to allow this feature? While it would be convenient for someone who uses exactly *one* such extern object, in practice the most common use case is that users declare a single extern __shared__ array to serve as the memory pool and then manually allocate chunks within it and assign the addresses to appropriately typed pointers. I guess they could define an extern __shared__ struct with fields representing the objects, but that seems sort of pointless considering that the only reason to use extern __shared__ is to allocate shared memory dynamically.

In general, the concept of extern __shared__ with *all* such extern items occupying the same space is broken by design. It's not composable (every function using one needs to coordinate with every other function doig the same). It introduces failure modes not obvious from the source code (access an object, fail with invalid memory access). It does not fit the conventional meaning of what extern something means in C++ (different objects have different addresses). IMO, it should not have existed and the shared memory/pointer should've been exposed via explicit API. I.e. CUDA could've used the same mechanism which provides threads with threadIdx and blockIdx.

As things stand right now, extern __shared__ is something I want gone, not added more features to. AFAICT, the limitations clang places on it right now have not been an issue for the CUDA code we compile.

Is there a pressing need for this feature for HIP? Perhaps it would make more sense to introduce a more sensible API and port existing HIP code to use it.

WDYT?

All extern shared vars are sharing the same address, however, they may be used as different types in different functions.

For example,

__device__ int foo() {
  extern __shared__ int a;
  for (...) a+=...;
  return a;
}

__device__ double bar(int x) {
  extern __shared__ double b[10];
  for(...) b[x]+=...;
  return b[0];
}

__global__ void k() {
  foo();
  //...
  bar();
}

In one function foo, users need to use the shared memory as an int. In another function, users need to use the shared memory as a double array. Users just need to make sure they request sufficient dynamic shared memory in triple chevron to be greater than the max dynamic shared memory usage. Users do not need to pass values in extern shared var between functions. They just treat it as an uninitialized variable. Forbidding different types for extern shared variable does not add any benefit, just forcing users to work around the limitation and resulting in less readable code.

BTW this is requested by HIP users, who have similar code for CUDA and HIP. They found it surprised that nvcc allows it but hip-clang does not.

In D73979#1857728, @yaxunl wrote:
All extern shared vars are sharing the same address, however, they may be used as different types in different functions.

For example,
__device__ int foo() {
  extern __shared__ int a;
  for (...) a+=...;
  return a;
}

__device__ double bar(int x) {
  extern __shared__ double b[10];
  for(...) b[x]+=...;
  return b[0];
}

__global__ void k() {
  foo();
  //...
  bar();
}

I do agree that it is possible to use extern __shared__ given enough care. My point is that as a feature it is ill-designed, very easy to misuse and creates more problems than it's worth, especially in non-trivial code.
As an illustration, what if bar needs to call foo() and foo lives in a header file somewhere else? Whoever implements foo must make sure that nothing else in the transitive call chain uses extern __shared__. That's hard to guarantee in practice and it's very easy to introduce new dependencies without even being aware of them. I.e. via an intermediate function which is not aware that the caller and callee have this restriction.

There are no compiler checks to warn you about it, you you will only know about the problem when you encounter data corruption at runtime and in machine learning applications that may go unnoticed for a long time.

In one function foo, users need to use the shared memory as an int. In another function, users need to use the shared memory as a double array. Users just need to make sure they request sufficient dynamic shared memory in triple chevron to be greater than the max dynamic shared memory usage.

Again, that requires complete knowledge of who uses this construct. Without compiler's help that's hard to guarantee outside of simple use cases. I can not imagine using it as is in something like thrust or eigen. In fact, thrust does provide extern_shared_ptr specifically to serve the same kind of API that I proposed.

Users do not need to pass values in extern shared var between functions. They just treat it as an uninitialized variable.

I'm OK with that, but they *do* need to make it explicit that they are dealing with externally allocated memory and they do need to pass something to identify which chunk of that memory they operate upon. It may be the pointer, or it may be an offset to be used relative to an extern __shared__ base.

Forbidding different types for extern shared variable does not add any benefit, just forcing users to work around the limitation and resulting in less readable code.

That's where we disagree. I believe that in this case it would be a net benefit to be even more restrictive with extern __shared__ than we are right now and get users to explicitly treat extern __shared__ as externally allocated memory.

It does not take all that much code to make it work and it does result in more robust code. E.g. your example can be rewritten like this:

// sprinkle static_casts as necessary.
void *get_shmem(size_t offset){
  extern __shared__ char shmem[];
  return  &shmem[offset];
}

__device__ int foo(int *a) {
  for (...) *a+=...;
  return *a;
}

__device__ double bar(double *bx) {
  for(...) *bx+=...;
  return *bx;
}

__global__ void k() {
  foo(get_shmem(0));
  //...
  bar(get_shmem(0));
}

Net benefits that I see:

functions are composable now -- one can call foo from bar and vice versa and ensure they don't step on each other's toes.
it's clear that they do operate on the same buffer, when called from k -- arguably that's the place where it matters.
It's easy to change if you need them to work on different sub-buffers.
foo/bar are not limited to working on shared memory only
foo/bar can execute in diverged branches, if given non-overlapping buffers. The original example would potentially fail in interesting ways if k does something like this:

__global__ void k() {
  if (threadIdx.x < 16)
     foo();
  else 
     bar();
}

It's much easier not to open this can of works than clean it up afterwards when you grow more users that depend on it.
I don't think it should be enabled for CUDA, and don't think that it would be a good idea for HIP, either.
Perhaps we need a third opinion from someone with a broader perspective.

@rsmith -- do you have an opinion on what should be done with a tactically useful, but strategically unsound features in general and this CUDA-specific oddity specifically?

tra added a reviewer: rsmith.Feb 4 2020, 2:02 PM

JonChesterfield added a subscriber: JonChesterfield.Mar 5 2020, 8:40 AM

saiislam added a subscriber: saiislam.Mar 5 2020, 8:41 AM

In D73979#1857736, @yaxunl wrote:

BTW this is requested by HIP users, who have similar code for CUDA and HIP. They found it surprised that nvcc allows it but hip-clang does not.

I think I'm one of the HIP users here, but the above change is not what I was hoping for.

I'd like:

__shared__ int x;
__shared__ int y;
__device__ void foo()
{
  assert(&x != &y);
  x = 2 * y;
}

to compile and behave as it does on cuda, i.e. the 'x' variable gets allocated in shared memory for each kernel which accesses it, and so does 'y'.

The 'extern shared' feature where nvcc builds a union out of all the things it sees and the user indexes into it at runtime is totally unappealing. That cuda uses the 'extern' keyword to opt into this magic union also seems undesirable.

In D73979#1907965, @JonChesterfield wrote:
In D73979#1857736, @yaxunl wrote:

BTW this is requested by HIP users, who have similar code for CUDA and HIP. They found it surprised that nvcc allows it but hip-clang does not.

I think I'm one of the HIP users here, but the above change is not what I was hoping for.

I'd like:
__shared__ int x;
__shared__ int y;
__device__ void foo()
{
  assert(&x != &y);
  x = 2 * y;
}
to compile and behave as it does on cuda, i.e. the 'x' variable gets allocated in shared memory for each kernel which accesses it, and so does 'y'.

The 'extern shared' feature where nvcc builds a union out of all the things it sees and the user indexes into it at runtime is totally unappealing. That cuda uses the 'extern' keyword to opt into this magic union also seems undesirable.

Clang emits correct IR for this code. If you use x and y in a kernel directly, amdgcn backend can generate correct ISA where &x and &y are different. What the backend does is to accumulate all shared memory and get the total shared memory usage and assign address to different shared variables. Therefore x and y get different addresses.

Currently amdgcn backend emits a diagnostic message if shared variable is used in non-kernel function:

https://github.com/llvm/llvm-project/blob/6085593c128e91fd7db998c5441ebe120c7e4f04/llvm/lib/Target/AMDGPU/AMDGPUISelLowering.cpp#L1232

https://github.com/llvm/llvm-project/blob/3fda1fde8f7bdf3b90d8700f5a386f63409b4313/llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp#L1952

This is unreasonable if the backend is able to calculate total shared memory usage, so this is a bug. With this bug fixed, you should be able to use shared variables in device functions.

Revision Contents

Path

Size

clang/

lib/

Sema/

SemaDeclAttr.cpp

2 lines

test/

CodeGenCUDA/

extern-shared.cu

35 lines

SemaCUDA/

extern-shared.cu

5 lines

Diff 242368

clang/lib/Sema/SemaDeclAttr.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 4,291 Lines • ▼ Show 20 Lines

	static void handleSharedAttr(Sema &S, Decl *D, const ParsedAttr &AL) {			static void handleSharedAttr(Sema &S, Decl *D, const ParsedAttr &AL) {
	if (checkAttrMutualExclusion<CUDAConstantAttr>(S, D, AL))			if (checkAttrMutualExclusion<CUDAConstantAttr>(S, D, AL))
	return;			return;
	const auto *VD = cast<VarDecl>(D);			const auto *VD = cast<VarDecl>(D);
	// extern __shared__ is only allowed on arrays with no length (e.g.			// extern __shared__ is only allowed on arrays with no length (e.g.
	// "int x[]").			// "int x[]").
	if (!S.getLangOpts().GPURelocatableDeviceCode && VD->hasExternalStorage() &&			if (!S.getLangOpts().GPURelocatableDeviceCode && VD->hasExternalStorage() &&
	!isa<IncompleteArrayType>(VD->getType())) {			!isa<IncompleteArrayType>(VD->getType()) && !S.getLangOpts().HIP) {
	S.Diag(AL.getLoc(), diag::err_cuda_extern_shared) << VD;			S.Diag(AL.getLoc(), diag::err_cuda_extern_shared) << VD;
	return;			return;
	}			}
	if (S.getLangOpts().CUDA && VD->hasLocalStorage() &&			if (S.getLangOpts().CUDA && VD->hasLocalStorage() &&
	S.CUDADiagIfHostCode(AL.getLoc(), diag::err_cuda_host_shared)			S.CUDADiagIfHostCode(AL.getLoc(), diag::err_cuda_host_shared)
	<< S.CurrentCUDATarget())			<< S.CurrentCUDATarget())
	return;			return;
	D->addAttr(::new (S.Context) CUDASharedAttr(S.Context, AL));			D->addAttr(::new (S.Context) CUDASharedAttr(S.Context, AL));
	▲ Show 20 Lines • Show All 3,534 Lines • Show Last 20 Lines

clang/test/CodeGenCUDA/extern-shared.cu

This file was added.

				// REQUIRES: amdgpu-registered-target
				// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa -emit-llvm -o - -fcuda-is-device \
				// RUN: -target-cpu gfx906 -x hip %s \| FileCheck %s
				traUnsubmitted Not Done Reply Inline Actions I'd add a CUDA test run, too to issuestrate what we expect CUDA to handle. tra: I'd add a CUDA test run, too to issuestrate what we expect CUDA to handle.

				#include "Inputs/cuda.h"

				// CHECK: @global = external addrspace(3) global i32, align 4
				// CHECK: @global_arr = external addrspace(3) global [0 x i32], align 4
				// CHECK: @global_arr1 = external addrspace(3) global [1 x i32], align 4
				// CHECK: @global_ptr = external addrspace(3) global i32*, align 8
				// CHECK: @x = external addrspace(3) global i32, align 4
				// CHECK: @arr = external addrspace(3) global [0 x i32], align 4
				// CHECK: @arr1 = external addrspace(3) global [1 x i32], align 4
				// CHECK: @ptr = external addrspace(3) global i32*, align 8

				extern __shared__ int global;
				extern __shared__ int global_arr[];
				extern __shared__ int global_arr1[1];
				extern __shared__ int* global_ptr;

				__global__ void foo() {
				extern __shared__ int x;
				extern __shared__ int arr[];
				extern __shared__ int arr1[1];
				extern __shared__ int* ptr;
				global = 1;
				global_arr[0] = 1;
				global_arr1[0] = 1;
				*global_ptr = 1;
				x = 1;
				arr[0] = 1;
				arr1[0] = 1;
				*ptr = 1;
				}

clang/test/SemaCUDA/extern-shared.cu

	// RUN: %clang_cc1 -fsyntax-only -Wundefined-internal -verify %s			// RUN: %clang_cc1 -fsyntax-only -Wundefined-internal -verify %s
	// RUN: %clang_cc1 -fsyntax-only -Wundefined-internal -fcuda-is-device -verify %s			// RUN: %clang_cc1 -fsyntax-only -Wundefined-internal -fcuda-is-device -verify %s

				// RUN: %clang_cc1 -fsyntax-only -Wundefined-internal -verify=hip -x hip %s
				// RUN: %clang_cc1 -fsyntax-only -Wundefined-internal -fcuda-is-device \
				// RUN: -verify=hip -x hip %s

	// RUN: %clang_cc1 -fsyntax-only -Wundefined-internal -fgpu-rdc -verify=rdc %s			// RUN: %clang_cc1 -fsyntax-only -Wundefined-internal -fgpu-rdc -verify=rdc %s
	// RUN: %clang_cc1 -fsyntax-only -Wundefined-internal -fcuda-is-device -fgpu-rdc -verify=rdc %s			// RUN: %clang_cc1 -fsyntax-only -Wundefined-internal -fcuda-is-device -fgpu-rdc -verify=rdc %s

	// Most of these declarations are fine in separate compilation mode.			// Most of these declarations are fine in separate compilation mode.

				// hip-no-diagnostics
	#include "Inputs/cuda.h"			#include "Inputs/cuda.h"

	__device__ void foo() {			__device__ void foo() {
	extern __shared__ int x; // expected-error {{__shared__ variable 'x' cannot be 'extern'}}			extern __shared__ int x; // expected-error {{__shared__ variable 'x' cannot be 'extern'}}
	extern __shared__ int arr[]; // ok			extern __shared__ int arr[]; // ok
	extern __shared__ int arr0[0]; // expected-error {{__shared__ variable 'arr0' cannot be 'extern'}}			extern __shared__ int arr0[0]; // expected-error {{__shared__ variable 'arr0' cannot be 'extern'}}
	extern __shared__ int arr1[1]; // expected-error {{__shared__ variable 'arr1' cannot be 'extern'}}			extern __shared__ int arr1[1]; // expected-error {{__shared__ variable 'arr1' cannot be 'extern'}}
	extern __shared__ int* ptr ; // expected-error {{__shared__ variable 'ptr' cannot be 'extern'}}			extern __shared__ int* ptr ; // expected-error {{__shared__ variable 'ptr' cannot be 'extern'}}
	Show All 27 Lines