This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
lib/
-
CodeGen/
-
CGCUDANV.cpp
-
Frontend/
-
InitPreprocessor.cpp
-
test/CodeGenCUDA/
-
CodeGenCUDA/
-
Inputs/
-
cuda.h
-
kernel-call.cu

Differential D154822

[clang] Support '-fgpu-default-stream=per-thread' for NVIDIA CUDA
ClosedPublic

Authored by boxu-zhang on Jul 10 2023, 2:08 AM.

Download Raw Diff

Details

Reviewers

rjmccall
tra
jansvoboda11

Commits

rGf05b58a9468c: [clang] Support '-fgpu-default-stream=per-thread' for NVIDIA CUDA

Summary

I'm using clang to compile CUDA code. And just found that clang doesn't support the per-thread stream option for NV CUDA. I don't know if there is another solution.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

boxu-zhang created this revision.Jul 10 2023, 2:08 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 10 2023, 2:08 AM

Herald added subscribers: mattd, yaxunl. · View Herald Transcript

boxu-zhang requested review of this revision.Jul 10 2023, 2:08 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 10 2023, 2:08 AM

Herald added a subscriber: cfe-commits. · View Herald Transcript

boxu-zhang edited the summary of this revision. (Show Details)Jul 10 2023, 2:20 AM

boxu-zhang added reviewers: jansvoboda11, rjmccall.

Harbormaster completed remote builds in B244072: Diff 538546.Jul 10 2023, 2:36 AM

Looking at CUDA headers, it appears that changing only compiler-generated-glue may be insufficient. A lot of other CUDA API calls need to be changed to _ptsz variant and for that we need to have CUDA_API_PER_THREAD_DEFAULT_STREAM defined.

jansvoboda11 resigned from this revision.Jul 10 2023, 9:33 AM

Append 'CUDA_API_PER_THREAD_DEFAULT_STREAM' as a defined macro

In D154822#4485700, @tra wrote:

Looking at CUDA headers, it appears that changing only compiler-generated-glue may be insufficient. A lot of other CUDA API calls need to be changed to _ptsz variant and for that we need to have CUDA_API_PER_THREAD_DEFAULT_STREAM defined.

CUDA_API_PER_THREAD_DEFAULT_STREAM is defined now.

Harbormaster completed remote builds in B244328: Diff 538894.Jul 10 2023, 8:18 PM

Another point that I don't get is why the libcxx CI failed. Anyone know the reason this message?
"
Running global pre-checkout hook
Preparing working directory
Running global post-checkout hook
Running commands
$ trap 'kill -- $$' INT TERM QUIT; libcxx/utils/ci/generate-buildkite-pipeline | buildkite-agent pipeline upload
2023-07-11 02:36:45 INFO Reading pipeline config from STDIN
2023-07-11 02:36:46 INFO Updating BUILDKITE_COMMIT to "a297905cd83911c8a03f060cb9d96bc99aae3f8c"
2023-07-11 02:36:46 FATAL Pipeline parsing of "(stdin)" failed (Expected identifier to start with a letter, got ')
🚨 Error: The command exited with status 1
user command error: exit status 1
"

Add component 'clang' in commit message

Harbormaster completed remote builds in B244334: Diff 538904.Jul 10 2023, 9:20 PM

tra accepted this revision.Jul 11 2023, 11:50 AM

This revision is now accepted and ready to land.Jul 11 2023, 11:50 AM

I don't have the permission to push to main branch. Can anyone push this?

Can anyone push this?

I can help with this. How do you want your commit to be attributed? The patch currently has boxu.zhang <boxu.zhang@hotmail.com>. Do you want it to be changed to something else?

In D154822#4494287, @tra wrote:

Can anyone push this?

I can help with this. How do you want your commit to be attributed? The patch currently has boxu.zhang <boxu.zhang@hotmail.com>. Do you want it to be changed to something else?

Just use that, it's my real name and mailbox, thanks for your help. :)

This revision was landed with ongoing or failed builds.Jul 13 2023, 4:55 PM

Closed by commit rGf05b58a9468c: [clang] Support '-fgpu-default-stream=per-thread' for NVIDIA CUDA (authored by boxu-zhang, committed by tra). · Explain Why

This revision was automatically updated to reflect the committed changes.

tra added a commit: rGf05b58a9468c: [clang] Support '-fgpu-default-stream=per-thread' for NVIDIA CUDA.

Revision Contents

Path

Size

clang/

lib/

CodeGen/

CGCUDANV.cpp

10 lines

Frontend/

InitPreprocessor.cpp

3 lines

test/

CodeGenCUDA/

Inputs/

cuda.h

4 lines

kernel-call.cu

4 lines

Diff 540228

clang/lib/CodeGen/CGCUDANV.cpp

Show First 20 Lines • Show All 352 Lines • ▼ Show 20 Lines	void CGNVCUDARuntime::emitDeviceStubBodyNew(CodeGenFunction &CGF,
// void **args, size_t sharedMem,		// void **args, size_t sharedMem,
// cudaStream_t stream);		// cudaStream_t stream);
// hipError_t hipLaunchKernel[_spt](const void *func, dim3 gridDim,		// hipError_t hipLaunchKernel[_spt](const void *func, dim3 gridDim,
// dim3 blockDim, void **args,		// dim3 blockDim, void **args,
// size_t sharedMem, hipStream_t stream);		// size_t sharedMem, hipStream_t stream);
TranslationUnitDecl *TUDecl = CGM.getContext().getTranslationUnitDecl();		TranslationUnitDecl *TUDecl = CGM.getContext().getTranslationUnitDecl();
DeclContext *DC = TranslationUnitDecl::castToDeclContext(TUDecl);		DeclContext *DC = TranslationUnitDecl::castToDeclContext(TUDecl);
std::string KernelLaunchAPI = "LaunchKernel";		std::string KernelLaunchAPI = "LaunchKernel";
if (CGF.getLangOpts().HIP && CGF.getLangOpts().GPUDefaultStream ==		if (CGF.getLangOpts().GPUDefaultStream ==
LangOptions::GPUDefaultStreamKind::PerThread)		LangOptions::GPUDefaultStreamKind::PerThread) {
		if (CGF.getLangOpts().HIP)
KernelLaunchAPI = KernelLaunchAPI + "_spt";		KernelLaunchAPI = KernelLaunchAPI + "_spt";
		else if (CGF.getLangOpts().CUDA)
		KernelLaunchAPI = KernelLaunchAPI + "_ptsz";
		}
auto LaunchKernelName = addPrefixToName(KernelLaunchAPI);		auto LaunchKernelName = addPrefixToName(KernelLaunchAPI);
IdentifierInfo &cudaLaunchKernelII =		IdentifierInfo &cudaLaunchKernelII =
CGM.getContext().Idents.get(LaunchKernelName);		CGM.getContext().Idents.get(LaunchKernelName);
FunctionDecl *cudaLaunchKernelFD = nullptr;		FunctionDecl *cudaLaunchKernelFD = nullptr;
for (auto *Result : DC->lookup(&cudaLaunchKernelII)) {		for (auto *Result : DC->lookup(&cudaLaunchKernelII)) {
if (FunctionDecl *FD = dyn_cast<FunctionDecl>(Result))		if (FunctionDecl *FD = dyn_cast<FunctionDecl>(Result))
cudaLaunchKernelFD = FD;		cudaLaunchKernelFD = FD;
}		}
▲ Show 20 Lines • Show All 866 Lines • Show Last 20 Lines

clang/lib/Frontend/InitPreprocessor.cpp

Show First 20 Lines • Show All 568 Lines • ▼ Show 20 Lines	static void InitializeStandardPredefinedMacros(const TargetInfo &TI,
// Not "standard" per se, but available even with the -undef flag.		// Not "standard" per se, but available even with the -undef flag.
if (LangOpts.AsmPreprocessor)		if (LangOpts.AsmPreprocessor)
Builder.defineMacro("__ASSEMBLER__");		Builder.defineMacro("__ASSEMBLER__");
if (LangOpts.CUDA) {		if (LangOpts.CUDA) {
if (LangOpts.GPURelocatableDeviceCode)		if (LangOpts.GPURelocatableDeviceCode)
Builder.defineMacro("__CLANG_RDC__");		Builder.defineMacro("__CLANG_RDC__");
if (!LangOpts.HIP)		if (!LangOpts.HIP)
Builder.defineMacro("__CUDA__");		Builder.defineMacro("__CUDA__");
		if (LangOpts.GPUDefaultStream ==
		LangOptions::GPUDefaultStreamKind::PerThread)
		Builder.defineMacro("CUDA_API_PER_THREAD_DEFAULT_STREAM");
}		}
if (LangOpts.HIP) {		if (LangOpts.HIP) {
Builder.defineMacro("__HIP__");		Builder.defineMacro("__HIP__");
Builder.defineMacro("__HIPCC__");		Builder.defineMacro("__HIPCC__");
Builder.defineMacro("__HIP_MEMORY_SCOPE_SINGLETHREAD", "1");		Builder.defineMacro("__HIP_MEMORY_SCOPE_SINGLETHREAD", "1");
Builder.defineMacro("__HIP_MEMORY_SCOPE_WAVEFRONT", "2");		Builder.defineMacro("__HIP_MEMORY_SCOPE_WAVEFRONT", "2");
Builder.defineMacro("__HIP_MEMORY_SCOPE_WORKGROUP", "3");		Builder.defineMacro("__HIP_MEMORY_SCOPE_WORKGROUP", "3");
Builder.defineMacro("__HIP_MEMORY_SCOPE_AGENT", "4");		Builder.defineMacro("__HIP_MEMORY_SCOPE_AGENT", "4");
▲ Show 20 Lines • Show All 831 Lines • Show Last 20 Lines

clang/test/CodeGenCUDA/Inputs/cuda.h

Show First 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	extern "C" int cudaConfigureCall(dim3 gridSize, dim3 blockSize,
size_t sharedSize = 0,		size_t sharedSize = 0,
cudaStream_t stream = 0);		cudaStream_t stream = 0);
extern "C" int __cudaPushCallConfiguration(dim3 gridSize, dim3 blockSize,		extern "C" int __cudaPushCallConfiguration(dim3 gridSize, dim3 blockSize,
size_t sharedSize = 0,		size_t sharedSize = 0,
cudaStream_t stream = 0);		cudaStream_t stream = 0);
extern "C" cudaError_t cudaLaunchKernel(const void *func, dim3 gridDim,		extern "C" cudaError_t cudaLaunchKernel(const void *func, dim3 gridDim,
dim3 blockDim, void **args,		dim3 blockDim, void **args,
size_t sharedMem, cudaStream_t stream);		size_t sharedMem, cudaStream_t stream);
		extern "C" cudaError_t cudaLaunchKernel_ptsz(const void *func, dim3 gridDim,
		dim3 blockDim, void **args,
		size_t sharedMem, cudaStream_t stream);

#endif		#endif

extern "C" __device__ int printf(const char*, ...);		extern "C" __device__ int printf(const char*, ...);

clang/test/CodeGenCUDA/kernel-call.cu

	// RUN: %clang_cc1 -target-sdk-version=8.0 -emit-llvm %s -o - \			// RUN: %clang_cc1 -target-sdk-version=8.0 -emit-llvm %s -o - \
	// RUN: \| FileCheck %s --check-prefixes=CUDA-OLD,CHECK			// RUN: \| FileCheck %s --check-prefixes=CUDA-OLD,CHECK
	// RUN: %clang_cc1 -target-sdk-version=9.2 -emit-llvm %s -o - \			// RUN: %clang_cc1 -target-sdk-version=9.2 -emit-llvm %s -o - \
	// RUN: \| FileCheck %s --check-prefixes=CUDA-NEW,CHECK			// RUN: \| FileCheck %s --check-prefixes=CUDA-NEW,CHECK
				// RUN: %clang_cc1 -target-sdk-version=9.2 -emit-llvm %s -o - \
				// RUN: -fgpu-default-stream=per-thread -DCUDA_API_PER_THREAD_DEFAULT_STREAM \
				// RUN: \| FileCheck %s --check-prefixes=CUDA-PTH,CHECK
	// RUN: %clang_cc1 -x hip -emit-llvm %s -o - \			// RUN: %clang_cc1 -x hip -emit-llvm %s -o - \
	// RUN: \| FileCheck %s --check-prefixes=HIP-OLD,CHECK			// RUN: \| FileCheck %s --check-prefixes=HIP-OLD,CHECK
	// RUN: %clang_cc1 -fhip-new-launch-api -x hip -emit-llvm %s -o - \			// RUN: %clang_cc1 -fhip-new-launch-api -x hip -emit-llvm %s -o - \
	// RUN: \| FileCheck %s --check-prefixes=HIP-NEW,LEGACY,CHECK			// RUN: \| FileCheck %s --check-prefixes=HIP-NEW,LEGACY,CHECK
	// RUN: %clang_cc1 -fhip-new-launch-api -x hip -emit-llvm %s -o - \			// RUN: %clang_cc1 -fhip-new-launch-api -x hip -emit-llvm %s -o - \
	// RUN: -fgpu-default-stream=legacy \			// RUN: -fgpu-default-stream=legacy \
	// RUN: \| FileCheck %s --check-prefixes=HIP-NEW,LEGACY,CHECK			// RUN: \| FileCheck %s --check-prefixes=HIP-NEW,LEGACY,CHECK
	// RUN: %clang_cc1 -fhip-new-launch-api -x hip -emit-llvm %s -o - \			// RUN: %clang_cc1 -fhip-new-launch-api -x hip -emit-llvm %s -o - \
	// RUN: -fgpu-default-stream=per-thread -DHIP_API_PER_THREAD_DEFAULT_STREAM \			// RUN: -fgpu-default-stream=per-thread -DHIP_API_PER_THREAD_DEFAULT_STREAM \
	// RUN: \| FileCheck %s --check-prefixes=HIP-NEW,PTH,CHECK			// RUN: \| FileCheck %s --check-prefixes=HIP-NEW,PTH,CHECK

	#include "Inputs/cuda.h"			#include "Inputs/cuda.h"

	// CHECK-LABEL: define{{.*}}g1			// CHECK-LABEL: define{{.*}}g1
	// HIP-OLD: call{{.*}}hipSetupArgument			// HIP-OLD: call{{.*}}hipSetupArgument
	// HIP-OLD: call{{.*}}hipLaunchByPtr			// HIP-OLD: call{{.*}}hipLaunchByPtr
	// HIP-NEW: call{{.*}}__hipPopCallConfiguration			// HIP-NEW: call{{.*}}__hipPopCallConfiguration
	// LEGACY: call{{.*}}hipLaunchKernel			// LEGACY: call{{.*}}hipLaunchKernel
	// PTH: call{{.*}}hipLaunchKernel_spt			// PTH: call{{.*}}hipLaunchKernel_spt
	// CUDA-OLD: call{{.*}}cudaSetupArgument			// CUDA-OLD: call{{.*}}cudaSetupArgument
	// CUDA-OLD: call{{.*}}cudaLaunch			// CUDA-OLD: call{{.*}}cudaLaunch
	// CUDA-NEW: call{{.*}}__cudaPopCallConfiguration			// CUDA-NEW: call{{.*}}__cudaPopCallConfiguration
	// CUDA-NEW: call{{.*}}cudaLaunchKernel			// CUDA-NEW: call{{.*}}cudaLaunchKernel
				// CUDA-PTH: call{{.*}}cudaLaunchKernel_ptsz
	__global__ void g1(int x) {}			__global__ void g1(int x) {}

	// CHECK-LABEL: define{{.*}}main			// CHECK-LABEL: define{{.*}}main
	int main(void) {			int main(void) {
	// HIP-OLD: call{{.*}}hipConfigureCall			// HIP-OLD: call{{.*}}hipConfigureCall
	// HIP-NEW: call{{.*}}__hipPushCallConfiguration			// HIP-NEW: call{{.*}}__hipPushCallConfiguration
	// CUDA-OLD: call{{.*}}cudaConfigureCall			// CUDA-OLD: call{{.*}}cudaConfigureCall
	// CUDA-NEW: call{{.*}}__cudaPushCallConfiguration			// CUDA-NEW: call{{.*}}__cudaPushCallConfiguration
	// CHECK: icmp			// CHECK: icmp
	// CHECK: br			// CHECK: br
	// CHECK: call{{.*}}g1			// CHECK: call{{.*}}g1
	g1<<<1, 1>>>(42);			g1<<<1, 1>>>(42);
	}			}