This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
lib/CodeGen/
-
CodeGen/
1/4
TargetInfo.cpp
-
test/
-
CodeGen/
-
nvptx-abi.c
-
CodeGenCUDA/
-
kernel-args-alignment.cu
-
kernel-args.cu
-
OpenMP/
-
nvptx_unsupported_type_codegen.cpp

Differential D118084

[CUDA, NVPTX] Pass byval aggregates directly
Changes PlannedPublic

Authored by tra on Jan 24 2022, 3:37 PM.

Download Raw Diff

Details

Reviewers

jdoerfert
yaxunl

Summary

Changes the NVPTX ABI to pass aggregates directly. Only clang-generated IR is
affected. The change does not affect ABI on thechange function signatures in the
generated PTX

Discussion: https://llvm.discourse.group/t/nvptx-calling-convention-for-aggregate-arguments-passed-by-value

Currently NVPTX ABI passes aggregate values indirectly as a byval pointer. When
we need to pass a *value*, LLVM has to store it in an alloca, so it can have a
pointer to pass on. This is a double whammy for NVPTX. LLVM often fails to
eliminate that alloca (usually SROA considers such pointer as escaped and gives
up) and that is noticeable hit on performance. When we lower IR to PTX, the
argument is actually passed by copy, so we end up having to do more work just to
get the value loaded back from the alloca. So, we do more work for less
performance. Switching to passing aggregates directly allows us to generate
better code.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

tra created this revision.Jan 24 2022, 3:37 PM

Herald added subscribers: asavonic, bixia. · View Herald TranscriptJan 24 2022, 3:37 PM

tra requested review of this revision.Jan 24 2022, 3:37 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 24 2022, 3:37 PM

The RFC discussion concluded this should be fine wrt. the interoperability use cases we want to support.
Code change looks good but I have one question.

clang/lib/CodeGen/TargetInfo.cpp
7183	Nit: Maybe a note that this effectively disables passing values via `byval`.
7193	When is this ever hit and should we not disable byval here too while we are at it?

Getting rid of byval helps getting rid of locals in quite a few places, but runs into a new problem. 😕

Looks like this change does have unexpected side-effects.
When we need to dynamically index into a struct passed directly, there's no easy way to do it as extractvalue only supports constant indices.
With byval aggregates LLVM uses GEP which does allow using dynamic indexing.
Alloca would only show up after nvptx-lower-args pass and that by that time IR would often be simple enough to eliminate that alloca.
Now, clang generates a local copy early on and, indexes into it dynamically with GEP... and then LLVM fails to eliminate the local copy because SROA fails to deal with dynamic indices and that in turn prevents IR optimizations that were possible without alloca.
https://github.com/llvm/llvm-project/issues/51734

That's rather unfortunate. This regression is serious enough to be a showstopper for my use case.

clang/lib/CodeGen/TargetInfo.cpp
7193	Basically it's saying "pass as byval pointer if it's an int that's larger than what we can lower". Yes, I think passing such integer directly would make sense. We may hit this if clang wants to pass `__i128` (do larger int types exist in clang?). I think (some of) this may be a leftover from the days when we didn't support i128 in CUDA/NVPTX. I think we do now.

@lebedev.ri wanted to teach SROA how to deal with dynamic indices before, IIRC. It seems to be generally useful. This patch can wait till then?

clang/lib/CodeGen/TargetInfo.cpp
7193	We have larger types, I somewhat doubt using them will work properly everywhere though.

In D118084#3271073, @jdoerfert wrote:

@lebedev.ri wanted to teach SROA how to deal with dynamic indices before, IIRC. It seems to be generally useful.

Interesting. I'd like to hear more.

This patch can wait till then?

Yes.

In D118084#3271110, @tra wrote:

In D118084#3271073, @jdoerfert wrote:

@lebedev.ri wanted to teach SROA how to deal with dynamic indices before, IIRC. It seems to be generally useful.

Interesting. I'd like to hear more.

I guess i, too, would like to hear more about the problem.
My last idea was about allowing splitting

struct {
  int a;
  int b[2];
} a;

into

// not in a struct anymore!
int a;
int b[2]

But given just the int b[2]; i'm not sure what can be done.

This patch can wait till then?

Yes.

In D118084#3272154, @lebedev.ri wrote:
My last idea was about allowing splitting
struct {
  int a;
  int b[2];
} a;
into
// not in a struct anymore!
int a;
int b[2]

This looks like it's a somewhat different problem.

In my case this is what bites me: https://godbolt.org/z/417fMMn6c
It's a variant of this issue: https://github.com/llvm/llvm-project/issues/51734

I have a WIP patch that converts a GEP with a dynamic index with a known range of values into a series of comparisons and fixed-index GEPs. I guess I'll need to get it sorted out first.

Harbormaster completed remote builds in B145353: Diff 402701.Jan 26 2022, 12:16 PM

gehre added a subscriber: gehre.Feb 2 2022, 11:47 PM

kovdan01 added a subscriber: kovdan01.Mar 14 2022, 4:11 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 14 2022, 4:11 AM

kovdan01 mentioned this in D120129: [NVPTX] Enhance vectorization of ld.param & st.param.Mar 17 2022, 3:18 PM

Revision Contents

Path

Size

clang/

lib/

CodeGen/

TargetInfo.cpp

5 lines

test/

CodeGen/

nvptx-abi.c

10 lines

CodeGenCUDA/

kernel-args-alignment.cu

2 lines

kernel-args.cu

8 lines

OpenMP/

nvptx_unsupported_type_codegen.cpp

5 lines

Diff 402701

clang/lib/CodeGen/TargetInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,174 Lines • ▼ Show 20 Lines	if (isAggregateTypeForABI(Ty)) {
if (getContext().getLangOpts().CUDAIsDevice) {		if (getContext().getLangOpts().CUDAIsDevice) {
if (Ty->isCUDADeviceBuiltinSurfaceType())		if (Ty->isCUDADeviceBuiltinSurfaceType())
return ABIArgInfo::getDirect(		return ABIArgInfo::getDirect(
CGInfo.getCUDADeviceBuiltinSurfaceDeviceType());		CGInfo.getCUDADeviceBuiltinSurfaceDeviceType());
if (Ty->isCUDADeviceBuiltinTextureType())		if (Ty->isCUDADeviceBuiltinTextureType())
return ABIArgInfo::getDirect(		return ABIArgInfo::getDirect(
CGInfo.getCUDADeviceBuiltinTextureDeviceType());		CGInfo.getCUDADeviceBuiltinTextureDeviceType());
}		}
return getNaturalAlignIndirect(Ty, /* byval */ true);		// We want to pass whole aggregate value as one argument.
		jdoerfertUnsubmitted Not Done Reply Inline Actions Nit: Maybe a note that this effectively disables passing values via `byval`. jdoerfert: Nit: Maybe a note that this effectively disables passing values via `byval`.
		auto AI = ABIArgInfo::getDirect();
		AI.setCanBeFlattened(false);
		return AI;
}		}

if (const auto *EIT = Ty->getAs<BitIntType>()) {		if (const auto *EIT = Ty->getAs<BitIntType>()) {
if ((EIT->getNumBits() > 128) \|\|		if ((EIT->getNumBits() > 128) \|\|
(!getContext().getTargetInfo().hasInt128Type() &&		(!getContext().getTargetInfo().hasInt128Type() &&
EIT->getNumBits() > 64))		EIT->getNumBits() > 64))
return getNaturalAlignIndirect(Ty, /* byval */ true);		return getNaturalAlignIndirect(Ty, /* byval */ true);
		jdoerfertUnsubmitted Not Done Reply Inline Actions When is this ever hit and should we not disable byval here too while we are at it? jdoerfert: When is this ever hit and should we not disable byval here too while we are at it?
		traAuthorUnsubmitted Done Reply Inline Actions Basically it's saying "pass as byval pointer if it's an int that's larger than what we can lower". Yes, I think passing such integer directly would make sense. We may hit this if clang wants to pass `__i128` (do larger int types exist in clang?). I think (some of) this may be a leftover from the days when we didn't support i128 in CUDA/NVPTX. I think we do now. tra: Basically it's saying "pass as byval pointer if it's an int that's larger than what we can…
		jdoerfertUnsubmitted Not Done Reply Inline Actions We have larger types, I somewhat doubt using them will work properly everywhere though. jdoerfert: We have larger types, I somewhat doubt using them will work properly everywhere though.
}		}

return (isPromotableIntegerTypeForABI(Ty) ? ABIArgInfo::getExtend(Ty)		return (isPromotableIntegerTypeForABI(Ty) ? ABIArgInfo::getExtend(Ty)
: ABIArgInfo::getDirect());		: ABIArgInfo::getDirect());
}		}

void NVPTXABIInfo::computeInfo(CGFunctionInfo &FI) const {		void NVPTXABIInfo::computeInfo(CGFunctionInfo &FI) const {
if (!getCXXABI().classifyReturnType(FI))		if (!getCXXABI().classifyReturnType(FI))
▲ Show 20 Lines • Show All 4,305 Lines • Show Last 20 Lines

clang/test/CodeGen/nvptx-abi.c

	Show All 15 Lines
	// CHECK-LABEL: @bar			// CHECK-LABEL: @bar
	// CHECK: call %struct.float4_s @my_function			// CHECK: call %struct.float4_s @my_function
	ret = my_function();			ret = my_function();
	return ret.x;			return ret.x;
	}			}

	void foo(float4_t x) {			void foo(float4_t x) {
	// CHECK-LABEL: @foo			// CHECK-LABEL: @foo
	// CHECK: %struct.float4_s* noundef byval(%struct.float4_s) align 4 %x			// CHECK: (%struct.float4_s %x
	}			}

	void fooN(float4_t x, float4_t y, float4_t z) {			void fooN(float4_t x, float4_t y, float4_t z) {
	// CHECK-LABEL: @fooN			// CHECK-LABEL: @fooN
	// CHECK: %struct.float4_s* noundef byval(%struct.float4_s) align 4 %x			// CHECK-SAME: %struct.float4_s %x
	// CHECK: %struct.float4_s* noundef byval(%struct.float4_s) align 4 %y			// CHECK-SAME: %struct.float4_s %y
	// CHECK: %struct.float4_s* noundef byval(%struct.float4_s) align 4 %z			// CHECK-SAME: %struct.float4_s %z
	}			}

	typedef struct nested_s {			typedef struct nested_s {
	unsigned long long x;			unsigned long long x;
	float z[64];			float z[64];
	float4_t t;			float4_t t;
	} nested_t;			} nested_t;

	void baz(nested_t x) {			void baz(nested_t x) {
	// CHECK-LABEL: @baz			// CHECK-LABEL: @baz
	// CHECK: %struct.nested_s* noundef byval(%struct.nested_s) align 8 %x)			// CHECK: (%struct.nested_s %x
	}			}

clang/test/CodeGenCUDA/kernel-args-alignment.cu

	Show All 30 Lines
	// 1. offset 0, width 1			// 1. offset 0, width 1
	// 2. offset 8 (because alignof(S) == 8), width 16			// 2. offset 8 (because alignof(S) == 8), width 16
	// 3. offset 24, width 8			// 3. offset 24, width 8
	// HOST-OLD: call i32 @cudaSetupArgument({{[^,]*}}, i64 1, i64 0)			// HOST-OLD: call i32 @cudaSetupArgument({{[^,]*}}, i64 1, i64 0)
	// HOST-OLD: call i32 @cudaSetupArgument({{[^,]*}}, i64 16, i64 8)			// HOST-OLD: call i32 @cudaSetupArgument({{[^,]*}}, i64 16, i64 8)
	// HOST-OLD: call i32 @cudaSetupArgument({{[^,]*}}, i64 8, i64 24)			// HOST-OLD: call i32 @cudaSetupArgument({{[^,]*}}, i64 8, i64 24)

	// DEVICE-LABEL: @_Z6kernelc1SPi			// DEVICE-LABEL: @_Z6kernelc1SPi
	// DEVICE-SAME: i8{{[^,]}}, %struct.S noundef byval(%struct.S) align 8{{[^,]}}, i32			// DEVICE-SAME: i8{{[^,]}}, %struct.S %{{[^,]}}, i32*
	__global__ void kernel(char a, S s, int *b) {}			__global__ void kernel(char a, S s, int *b) {}

clang/test/CodeGenCUDA/kernel-args.cu

	// RUN: %clang_cc1 -x hip -triple amdgcn-amd-amdhsa -fcuda-is-device \			// RUN: %clang_cc1 -x hip -triple amdgcn-amd-amdhsa -fcuda-is-device \
	// RUN: -emit-llvm %s -o - \| FileCheck -check-prefix=AMDGCN %s			// RUN: -emit-llvm %s -o - \| FileCheck -check-prefix=AMDGCN %s
	// RUN: %clang_cc1 -x cuda -triple nvptx64-nvidia-cuda- -fcuda-is-device \			// RUN: %clang_cc1 -x cuda -triple nvptx64-nvidia-cuda- -fcuda-is-device \
	// RUN: -emit-llvm %s -o - \| FileCheck -check-prefix=NVPTX %s			// RUN: -emit-llvm %s -o - \| FileCheck -check-prefix=NVPTX %s
	#include "Inputs/cuda.h"			#include "Inputs/cuda.h"

	struct A {			struct A {
	int a[32];			int a[32];
	float *p;			float *p;
	};			};

	// AMDGCN: define{{.}} amdgpu_kernel void @_Z6kernel1A(%struct.A addrspace(4) byref(%struct.A) align 8 %{{.+}})			// AMDGCN: define{{.}} amdgpu_kernel void @_Z6kernel1A(%struct.A addrspace(4) byref(%struct.A) align 8 %{{.+}})
	// NVPTX: define{{.}} void @_Z6kernel1A(%struct.A noundef byval(%struct.A) align 8 %x)			// NVPTX: define{{.*}} void @_Z6kernel1A(%struct.A %x
	__global__ void kernel(A x) {			__global__ void kernel(A x) {
	}			}

	class Kernel {			class Kernel {
	public:			public:
	// AMDGCN: define{{.}} amdgpu_kernel void @_ZN6Kernel12memberKernelE1A(%struct.A addrspace(4) byref(%struct.A) align 8 %{{.+}})			// AMDGCN: define{{.}} amdgpu_kernel void @_ZN6Kernel12memberKernelE1A(%struct.A addrspace(4) byref(%struct.A) align 8 %{{.+}})
	// NVPTX: define{{.}} void @_ZN6Kernel12memberKernelE1A(%struct.A noundef byval(%struct.A) align 8 %x)			// NVPTX: define{{.*}} void @_ZN6Kernel12memberKernelE1A(%struct.A %x
	static __global__ void memberKernel(A x){}			static __global__ void memberKernel(A x){}
	template<typename T> static __global__ void templateMemberKernel(T x) {}			template<typename T> static __global__ void templateMemberKernel(T x) {}
	};			};


	template <typename T>			template <typename T>
	__global__ void templateKernel(T x) {}			__global__ void templateKernel(T x) {}

	void launch(void*);			void launch(void*);

	void test() {			void test() {
	Kernel K;			Kernel K;
	// AMDGCN: define{{.}} amdgpu_kernel void @_Z14templateKernelI1AEvT_(%struct.A addrspace(4) byref(%struct.A) align 8 %{{.+}}			// AMDGCN: define{{.}} amdgpu_kernel void @_Z14templateKernelI1AEvT_(%struct.A addrspace(4) byref(%struct.A) align 8 %{{.+}}
	// NVPTX: define{{.}} void @_Z14templateKernelI1AEvT_(%struct.A noundef byval(%struct.A) align 8 %x)			// NVPTX: define{{.*}} void @_Z14templateKernelI1AEvT_(%struct.A %x
	launch((void*)templateKernel<A>);			launch((void*)templateKernel<A>);

	// AMDGCN: define{{.}} amdgpu_kernel void @_ZN6Kernel20templateMemberKernelI1AEEvT_(%struct.A addrspace(4) byref(%struct.A) align 8 %{{.+}}			// AMDGCN: define{{.}} amdgpu_kernel void @_ZN6Kernel20templateMemberKernelI1AEEvT_(%struct.A addrspace(4) byref(%struct.A) align 8 %{{.+}}
	// NVPTX: define{{.}} void @_ZN6Kernel20templateMemberKernelI1AEEvT_(%struct.A noundef byval(%struct.A) align 8 %x)			// NVPTX: define{{.*}} void @_ZN6Kernel20templateMemberKernelI1AEEvT_(%struct.A %x
	launch((void*)Kernel::templateMemberKernel<A>);			launch((void*)Kernel::templateMemberKernel<A>);
	}			}

clang/test/OpenMP/nvptx_unsupported_type_codegen.cpp

Show All 28 Lines	struct T1 {
char c;		char c;
T1() : a(12), f(15) {}		T1() : a(12), f(15) {}
T1 &operator+(T1 &b) { f += b.a; return *this;}		T1 &operator+(T1 &b) { f += b.a; return *this;}
};		};

#pragma omp declare target		#pragma omp declare target
T a = T();		T a = T();
T f = a;		T f = a;
// CHECK: define{{ protected \| }}void @{{.+}}foo{{.+}}([[T]]* noundef byval([[T]]) align {{.+}})		// CHECK: define{{ protected \| }}void @{{.+}}foo{{.+}}([[T]] %{{.+}})
void foo(T a = T()) {		void foo(T a = T()) {
return;		return;
}		}
// CHECK: define{{ protected \| }}[6 x i64] @{{.+}}bar{{.+}}()		// CHECK: define{{ protected \| }}[6 x i64] @{{.+}}bar{{.+}}()
T bar() {		T bar() {
// CHECK: bitcast [[T]]* %{{.+}} to [6 x i64]*		// CHECK: bitcast [[T]]* %{{.+}} to [6 x i64]*
// CHECK-NEXT: load [6 x i64], [6 x i64]* %{{.+}},		// CHECK-NEXT: load [6 x i64], [6 x i64]* %{{.+}},
// CHECK-NEXT: ret [6 x i64]		// CHECK-NEXT: ret [6 x i64]
return T();		return T();
}		}
// CHECK: define{{ protected \| }}void @{{.+}}baz{{.+}}()		// CHECK: define{{ protected \| }}void @{{.+}}baz{{.+}}()
void baz() {		void baz() {
// CHECK: call [6 x i64] @{{.+}}bar{{.+}}()		// CHECK: call [6 x i64] @{{.+}}bar{{.+}}()
// CHECK-NEXT: bitcast [[T]]* %{{.+}} to [6 x i64]*		// CHECK-NEXT: bitcast [[T]]* %{{.+}} to [6 x i64]*
// CHECK-NEXT: store [6 x i64] %{{.+}}, [6 x i64]* %{{.+}},		// CHECK-NEXT: store [6 x i64] %{{.+}}, [6 x i64]* %{{.+}},
T t = bar();		T t = bar();
}		}
T1 a1 = T1();		T1 a1 = T1();
T1 f1 = a1;		T1 f1 = a1;
// CHECK: define{{ protected \| }}void @{{.+}}foo1{{.+}}([[T1]]* noundef byval([[T1]]) align {{.+}})		// CHECK: define{{ protected \| }}void @{{.+}}foo1{{.+}}([[T1]] %{{.+}})
void foo1(T1 a = T1()) {		void foo1(T1 a = T1()) {
return;		return;
}		}
// CHECK: define{{ protected \| }}[[T1]] @{{.+}}bar1{{.+}}()		// CHECK: define{{ protected \| }}[[T1]] @{{.+}}bar1{{.+}}()
T1 bar1() {		T1 bar1() {
// CHECK: load [[T1]], [[T1]]*		// CHECK: load [[T1]], [[T1]]*
// CHECK-NEXT: ret [[T1]]		// CHECK-NEXT: ret [[T1]]
return T1();		return T1();
}		}
// CHECK: define{{ protected \| }}void @{{.+}}baz1{{.+}}()		// CHECK: define{{ protected \| }}void @{{.+}}baz1{{.+}}()
void baz1() {		void baz1() {
// CHECK: call [[T1]] @{{.+}}bar1{{.+}}()		// CHECK: call [[T1]] @{{.+}}bar1{{.+}}()
T1 t = bar1();		T1 t = bar1();
}		}
#pragma omp end declare target		#pragma omp end declare target