This is an archive of the discontinued LLVM Phabricator instance.

clang/lib/CodeGen/TargetInfo.cpp
8078	The magic value of 256 should be defined as a constant or macro somewhere -- you're using it in multiple places. Alternatively, always set LangOpts.GPUMaxThreadsPerBlock to something and skip figuring out the default everywhere else.
clang/test/CodeGenCUDA/amdgpu-kernel-attrs.cu
19	Is this the attribute that `__launch_bounds__()` expands to in HIP? If launch_bounds is a separate attribute, then, I guess, it should be tested, too.

In D71221#1791802, @tra wrote:

What's the use case for this flag?

If a kernel is launched with a block size greater than the default max block size, explicit launch bound is required.

Different projects have different block size usages.

If a project mostly uses block size 1024, it prefers to use 1024 as the default max block size to avoid adding explicit launch bounds to each kernel.

If a project mostly uses block size 256, it prefers to use 256 as the default max block size.

Another situation is that at the initial development stage people prefer a default max block size that works for all possible launching block sizes. Then they can just let the max block size be 1024 by using this option. Later on, they can add launch bounds and choose a different max block size.

On the other hand, we cannot simply let the default max block size be 1024 since we have large sets of existing projects assuming default max block size be 256. Changing the default max block size to 1024 will cause perf degradation in the existing projects. Adding this options provides an option for backward compatibility in case we want to change the default max block size.

clang/lib/CodeGen/TargetInfo.cpp
8078	For the default value of LangOpts.GPUMaxThreadsPerBlock, it tends to be target dependent. I am thinking probably should add TargetInfo.getDefaultMaxThreadsPerBlock() and use it to set the default value for LangOpts.GPUMaxThreadsPerBlock.
clang/test/CodeGenCUDA/amdgpu-kernel-attrs.cu
19	yes.

tra added inline comments.Dec 20 2019, 10:30 AM

clang/lib/CodeGen/TargetInfo.cpp
8078	That could be an option. I just want to have an unambiguous source for that number. BTW, does the value need to be hardcoded for OpenCL? I think it would make sense for --gpu-max-threads-per-block=n to control the value for OpenCL, too. Then you would always get the value from LangOpts.GPUMaxThreadsPerBlock and will have only one place where you set the default, which would be OK until we make the default target-specific.

revised by Artem's comments.

ping

tra accepted this revision.Jan 6 2020, 8:49 AM

This revision is now accepted and ready to land.Jan 6 2020, 8:49 AM

Closed by commit rG9f2d8b5c0cdb: [HIP] Add option --gpu-max-threads-per-block=n (authored by yaxunl). · Explain WhyJan 7 2020, 8:21 AM

This revision was automatically updated to reflect the committed changes.

Herald added a project: Restricted Project. · View Herald TranscriptJan 7 2020, 8:21 AM

Revision Contents

Path

Size

clang/

include/

clang/

Basic/

LangOptions.def

1 line

Driver/

Options.td

3 lines

lib/

CodeGen/

TargetInfo.cpp

7 lines

Driver/

ToolChains/

HIP.cpp

8 lines

Frontend/

CompilerInvocation.cpp

6 lines

test/

CodeGenCUDA/

amdgpu-kernel-attrs.cu

22 lines

Driver/

hip-options.hip

10 lines

Diff 236597

clang/include/clang/Basic/LangOptions.def

	Show First 20 Lines • Show All 221 Lines • ▼ Show 20 Lines
	LANGOPT(RenderScript , 1, 0, "RenderScript")			LANGOPT(RenderScript , 1, 0, "RenderScript")

	LANGOPT(CUDAIsDevice , 1, 0, "compiling for CUDA device")			LANGOPT(CUDAIsDevice , 1, 0, "compiling for CUDA device")
	LANGOPT(CUDAAllowVariadicFunctions, 1, 0, "allowing variadic functions in CUDA device code")			LANGOPT(CUDAAllowVariadicFunctions, 1, 0, "allowing variadic functions in CUDA device code")
	LANGOPT(CUDAHostDeviceConstexpr, 1, 1, "treating unattributed constexpr functions as __host__ __device__")			LANGOPT(CUDAHostDeviceConstexpr, 1, 1, "treating unattributed constexpr functions as __host__ __device__")
	LANGOPT(CUDADeviceApproxTranscendentals, 1, 0, "using approximate transcendental functions")			LANGOPT(CUDADeviceApproxTranscendentals, 1, 0, "using approximate transcendental functions")
	LANGOPT(GPURelocatableDeviceCode, 1, 0, "generate relocatable device code")			LANGOPT(GPURelocatableDeviceCode, 1, 0, "generate relocatable device code")
	LANGOPT(GPUAllowDeviceInit, 1, 0, "allowing device side global init functions for HIP")			LANGOPT(GPUAllowDeviceInit, 1, 0, "allowing device side global init functions for HIP")
				LANGOPT(GPUMaxThreadsPerBlock, 32, 256, "default max threads per block for kernel launch bounds for HIP")

	LANGOPT(SYCLIsDevice , 1, 0, "Generate code for SYCL device")			LANGOPT(SYCLIsDevice , 1, 0, "Generate code for SYCL device")

	LANGOPT(HIPUseNewLaunchAPI, 1, 0, "Use new kernel launching API for HIP")			LANGOPT(HIPUseNewLaunchAPI, 1, 0, "Use new kernel launching API for HIP")

	LANGOPT(SizedDeallocation , 1, 0, "sized deallocation")			LANGOPT(SizedDeallocation , 1, 0, "sized deallocation")
	LANGOPT(AlignedAllocation , 1, 0, "aligned allocation")			LANGOPT(AlignedAllocation , 1, 0, "aligned allocation")
	LANGOPT(AlignedAllocationUnavailable, 1, 0, "aligned allocation functions are unavailable")			LANGOPT(AlignedAllocationUnavailable, 1, 0, "aligned allocation functions are unavailable")
	▲ Show 20 Lines • Show All 117 Lines • Show Last 20 Lines

clang/include/clang/Driver/Options.td

	Show First 20 Lines • Show All 600 Lines • ▼ Show 20 Lines
	def fhip_dump_offload_linker_script : Flag<["-"], "fhip-dump-offload-linker-script">,			def fhip_dump_offload_linker_script : Flag<["-"], "fhip-dump-offload-linker-script">,
	Group<f_Group>, Flags<[NoArgumentUnused, HelpHidden]>;			Group<f_Group>, Flags<[NoArgumentUnused, HelpHidden]>;
	def fhip_new_launch_api : Flag<["-"], "fhip-new-launch-api">,			def fhip_new_launch_api : Flag<["-"], "fhip-new-launch-api">,
	Flags<[CC1Option]>, HelpText<"Use new kernel launching API for HIP.">;			Flags<[CC1Option]>, HelpText<"Use new kernel launching API for HIP.">;
	def fno_hip_new_launch_api : Flag<["-"], "fno-hip-new-launch-api">;			def fno_hip_new_launch_api : Flag<["-"], "fno-hip-new-launch-api">;
	def fgpu_allow_device_init : Flag<["-"], "fgpu-allow-device-init">,			def fgpu_allow_device_init : Flag<["-"], "fgpu-allow-device-init">,
	Flags<[CC1Option]>, HelpText<"Allow device side init function in HIP">;			Flags<[CC1Option]>, HelpText<"Allow device side init function in HIP">;
	def fno_gpu_allow_device_init : Flag<["-"], "fno-gpu-allow-device-init">;			def fno_gpu_allow_device_init : Flag<["-"], "fno-gpu-allow-device-init">;
				def gpu_max_threads_per_block_EQ : Joined<["--"], "gpu-max-threads-per-block=">,
				Flags<[CC1Option]>,
				HelpText<"Default max threads per block for kernel launch bounds for HIP">;
	def libomptarget_nvptx_path_EQ : Joined<["--"], "libomptarget-nvptx-path=">, Group<i_Group>,			def libomptarget_nvptx_path_EQ : Joined<["--"], "libomptarget-nvptx-path=">, Group<i_Group>,
	HelpText<"Path to libomptarget-nvptx libraries">;			HelpText<"Path to libomptarget-nvptx libraries">;
	def dD : Flag<["-"], "dD">, Group<d_Group>, Flags<[CC1Option]>,			def dD : Flag<["-"], "dD">, Group<d_Group>, Flags<[CC1Option]>,
	HelpText<"Print macro definitions in -E mode in addition to normal output">;			HelpText<"Print macro definitions in -E mode in addition to normal output">;
	def dI : Flag<["-"], "dI">, Group<d_Group>, Flags<[CC1Option]>,			def dI : Flag<["-"], "dI">, Group<d_Group>, Flags<[CC1Option]>,
	HelpText<"Print include directives in -E mode in addition to normal output">;			HelpText<"Print include directives in -E mode in addition to normal output">;
	def dM : Flag<["-"], "dM">, Group<d_Group>, Flags<[CC1Option]>,			def dM : Flag<["-"], "dM">, Group<d_Group>, Flags<[CC1Option]>,
	HelpText<"Print macro definitions in -E mode instead of normal output">;			HelpText<"Print macro definitions in -E mode instead of normal output">;
	▲ Show 20 Lines • Show All 2,757 Lines • Show Last 20 Lines

clang/lib/CodeGen/TargetInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 8,066 Lines • ▼ Show 20 Lines	if (ReqdWGS \|\| FlatWGS) {
if (Min != 0) {		if (Min != 0) {
assert(Min <= Max && "Min must be less than or equal Max");		assert(Min <= Max && "Min must be less than or equal Max");

std::string AttrVal = llvm::utostr(Min) + "," + llvm::utostr(Max);		std::string AttrVal = llvm::utostr(Min) + "," + llvm::utostr(Max);
F->addFnAttr("amdgpu-flat-work-group-size", AttrVal);		F->addFnAttr("amdgpu-flat-work-group-size", AttrVal);
} else		} else
assert(Max == 0 && "Max must be zero");		assert(Max == 0 && "Max must be zero");
} else if (IsOpenCLKernel \|\| IsHIPKernel) {		} else if (IsOpenCLKernel \|\| IsHIPKernel) {
// By default, restrict the maximum size to 256.		// By default, restrict the maximum size to a value specified by
F->addFnAttr("amdgpu-flat-work-group-size", "1,256");		// --gpu-max-threads-per-block=n or its default value.
		std::string AttrVal =
		std::string("1,") + llvm::utostr(M.getLangOpts().GPUMaxThreadsPerBlock);
		traUnsubmitted Not Done Reply Inline Actions The magic value of 256 should be defined as a constant or macro somewhere -- you're using it in multiple places. Alternatively, always set LangOpts.GPUMaxThreadsPerBlock to something and skip figuring out the default everywhere else. tra: The magic value of 256 should be defined as a constant or macro somewhere -- you're using it in…
		yaxunlAuthorUnsubmitted Done Reply Inline Actions For the default value of LangOpts.GPUMaxThreadsPerBlock, it tends to be target dependent. I am thinking probably should add TargetInfo.getDefaultMaxThreadsPerBlock() and use it to set the default value for LangOpts.GPUMaxThreadsPerBlock. yaxunl: For the default value of LangOpts.GPUMaxThreadsPerBlock, it tends to be target dependent. I am…
		traUnsubmitted Not Done Reply Inline Actions That could be an option. I just want to have an unambiguous source for that number. BTW, does the value need to be hardcoded for OpenCL? I think it would make sense for --gpu-max-threads-per-block=n to control the value for OpenCL, too. Then you would always get the value from LangOpts.GPUMaxThreadsPerBlock and will have only one place where you set the default, which would be OK until we make the default target-specific. tra: That could be an option. I just want to have an unambiguous source for that number. BTW, does…
		F->addFnAttr("amdgpu-flat-work-group-size", AttrVal);
}		}

if (const auto *Attr = FD->getAttr<AMDGPUWavesPerEUAttr>()) {		if (const auto *Attr = FD->getAttr<AMDGPUWavesPerEUAttr>()) {
unsigned Min =		unsigned Min =
Attr->getMin()->EvaluateKnownConstInt(M.getContext()).getExtValue();		Attr->getMin()->EvaluateKnownConstInt(M.getContext()).getExtValue();
unsigned Max = Attr->getMax() ? Attr->getMax()		unsigned Max = Attr->getMax() ? Attr->getMax()
->EvaluateKnownConstInt(M.getContext())		->EvaluateKnownConstInt(M.getContext())
.getExtValue()		.getExtValue()
▲ Show 20 Lines • Show All 1,985 Lines • Show Last 20 Lines

clang/lib/Driver/ToolChains/HIP.cpp

Show First 20 Lines • Show All 301 Lines • ▼ Show 20 Lines	void HIPToolChain::addClangTargetOptions(
if (DriverArgs.hasFlag(options::OPT_fcuda_approx_transcendentals,		if (DriverArgs.hasFlag(options::OPT_fcuda_approx_transcendentals,
options::OPT_fno_cuda_approx_transcendentals, false))		options::OPT_fno_cuda_approx_transcendentals, false))
CC1Args.push_back("-fcuda-approx-transcendentals");		CC1Args.push_back("-fcuda-approx-transcendentals");

if (DriverArgs.hasFlag(options::OPT_fgpu_rdc, options::OPT_fno_gpu_rdc,		if (DriverArgs.hasFlag(options::OPT_fgpu_rdc, options::OPT_fno_gpu_rdc,
false))		false))
CC1Args.push_back("-fgpu-rdc");		CC1Args.push_back("-fgpu-rdc");

		StringRef MaxThreadsPerBlock =
		DriverArgs.getLastArgValue(options::OPT_gpu_max_threads_per_block_EQ);
		if (!MaxThreadsPerBlock.empty()) {
		std::string ArgStr =
		std::string("--gpu-max-threads-per-block=") + MaxThreadsPerBlock.str();
		CC1Args.push_back(DriverArgs.MakeArgStringRef(ArgStr));
		}

if (DriverArgs.hasFlag(options::OPT_fgpu_allow_device_init,		if (DriverArgs.hasFlag(options::OPT_fgpu_allow_device_init,
options::OPT_fno_gpu_allow_device_init, false))		options::OPT_fno_gpu_allow_device_init, false))
CC1Args.push_back("-fgpu-allow-device-init");		CC1Args.push_back("-fgpu-allow-device-init");

CC1Args.push_back("-fcuda-allow-variadic-functions");		CC1Args.push_back("-fcuda-allow-variadic-functions");

// Default to "hidden" visibility, as object level linking will not be		// Default to "hidden" visibility, as object level linking will not be
// supported for the foreseeable future.		// supported for the foreseeable future.
▲ Show 20 Lines • Show All 152 Lines • Show Last 20 Lines

clang/lib/Frontend/CompilerInvocation.cpp

Show First 20 Lines • Show All 2,553 Lines • ▼ Show 20 Lines	#include "clang/Basic/LangStandards.def"
if (Args.hasArg(OPT_fgpu_allow_device_init)) {		if (Args.hasArg(OPT_fgpu_allow_device_init)) {
if (Opts.HIP)		if (Opts.HIP)
Opts.GPUAllowDeviceInit = 1;		Opts.GPUAllowDeviceInit = 1;
else		else
Diags.Report(diag::warn_ignored_hip_only_option)		Diags.Report(diag::warn_ignored_hip_only_option)
<< Args.getLastArg(OPT_fgpu_allow_device_init)->getAsString(Args);		<< Args.getLastArg(OPT_fgpu_allow_device_init)->getAsString(Args);
}		}
Opts.HIPUseNewLaunchAPI = Args.hasArg(OPT_fhip_new_launch_api);		Opts.HIPUseNewLaunchAPI = Args.hasArg(OPT_fhip_new_launch_api);
		if (Opts.HIP)
		Opts.GPUMaxThreadsPerBlock = getLastArgIntValue(
		Args, OPT_gpu_max_threads_per_block_EQ, Opts.GPUMaxThreadsPerBlock);
		else if (Args.hasArg(OPT_gpu_max_threads_per_block_EQ))
		Diags.Report(diag::warn_ignored_hip_only_option)
		<< Args.getLastArg(OPT_gpu_max_threads_per_block_EQ)->getAsString(Args);

if (Opts.ObjC) {		if (Opts.ObjC) {
if (Arg *arg = Args.getLastArg(OPT_fobjc_runtime_EQ)) {		if (Arg *arg = Args.getLastArg(OPT_fobjc_runtime_EQ)) {
StringRef value = arg->getValue();		StringRef value = arg->getValue();
if (Opts.ObjCRuntime.tryParse(value))		if (Opts.ObjCRuntime.tryParse(value))
Diags.Report(diag::err_drv_unknown_objc_runtime) << value;		Diags.Report(diag::err_drv_unknown_objc_runtime) << value;
}		}

▲ Show 20 Lines • Show All 1,188 Lines • Show Last 20 Lines

clang/test/CodeGenCUDA/amdgpu-kernel-attrs.cu

	// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa \			// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa \
	// RUN: -fcuda-is-device -emit-llvm -o - %s \| FileCheck %s			// RUN: -fcuda-is-device -emit-llvm -o - -x hip %s \
				// RUN: \| FileCheck -check-prefixes=CHECK,DEFAULT %s
				// RUN: %clang_cc1 -triple amdgcn-amd-amdhsa --gpu-max-threads-per-block=1024 \
				// RUN: -fcuda-is-device -emit-llvm -o - -x hip %s \
				// RUN: \| FileCheck -check-prefixes=CHECK,MAX1024 %s
	// RUN: %clang_cc1 -triple nvptx \			// RUN: %clang_cc1 -triple nvptx \
	// RUN: -fcuda-is-device -emit-llvm -o - %s \| FileCheck %s \			// RUN: -fcuda-is-device -emit-llvm -o - %s \| FileCheck %s \
	// RUN: -check-prefix=NAMD			// RUN: -check-prefix=NAMD
	// RUN: %clang_cc1 -triple x86_64-unknown-linux-gnu -emit-llvm \			// RUN: %clang_cc1 -triple x86_64-unknown-linux-gnu -emit-llvm \
	// RUN: -verify -o - %s \| FileCheck -check-prefix=NAMD %s			// RUN: -verify -o - -x hip %s \| FileCheck -check-prefix=NAMD %s

	#include "Inputs/cuda.h"			#include "Inputs/cuda.h"

				__global__ void flat_work_group_size_default() {
				// CHECK: define amdgpu_kernel void @_Z28flat_work_group_size_defaultv() [[FLAT_WORK_GROUP_SIZE_DEFAULT:#[0-9]+]]
				}

	__attribute__((amdgpu_flat_work_group_size(32, 64))) // expected-no-diagnostics			__attribute__((amdgpu_flat_work_group_size(32, 64))) // expected-no-diagnostics
				traUnsubmitted Not Done Reply Inline Actions Is this the attribute that `__launch_bounds__()` expands to in HIP? If launch_bounds is a separate attribute, then, I guess, it should be tested, too. tra: Is this the attribute that `__launch_bounds__()` expands to in HIP? If __launch_bounds__ is a…
				yaxunlAuthorUnsubmitted Done Reply Inline Actions yes. yaxunl: yes.
	__global__ void flat_work_group_size_32_64() {			__global__ void flat_work_group_size_32_64() {
	// CHECK: define amdgpu_kernel void @_Z26flat_work_group_size_32_64v() [[FLAT_WORK_GROUP_SIZE_32_64:#[0-9]+]]			// CHECK: define amdgpu_kernel void @_Z26flat_work_group_size_32_64v() [[FLAT_WORK_GROUP_SIZE_32_64:#[0-9]+]]
	}			}
	__attribute__((amdgpu_waves_per_eu(2))) // expected-no-diagnostics			__attribute__((amdgpu_waves_per_eu(2))) // expected-no-diagnostics
	__global__ void waves_per_eu_2() {			__global__ void waves_per_eu_2() {
	// CHECK: define amdgpu_kernel void @_Z14waves_per_eu_2v() [[WAVES_PER_EU_2:#[0-9]+]]			// CHECK: define amdgpu_kernel void @_Z14waves_per_eu_2v() [[WAVES_PER_EU_2:#[0-9]+]]
	}			}
	__attribute__((amdgpu_num_sgpr(32))) // expected-no-diagnostics			__attribute__((amdgpu_num_sgpr(32))) // expected-no-diagnostics
	__global__ void num_sgpr_32() {			__global__ void num_sgpr_32() {
	// CHECK: define amdgpu_kernel void @_Z11num_sgpr_32v() [[NUM_SGPR_32:#[0-9]+]]			// CHECK: define amdgpu_kernel void @_Z11num_sgpr_32v() [[NUM_SGPR_32:#[0-9]+]]
	}			}
	__attribute__((amdgpu_num_vgpr(64))) // expected-no-diagnostics			__attribute__((amdgpu_num_vgpr(64))) // expected-no-diagnostics
	__global__ void num_vgpr_64() {			__global__ void num_vgpr_64() {
	// CHECK: define amdgpu_kernel void @_Z11num_vgpr_64v() [[NUM_VGPR_64:#[0-9]+]]			// CHECK: define amdgpu_kernel void @_Z11num_vgpr_64v() [[NUM_VGPR_64:#[0-9]+]]
	}			}

	// Make sure this is silently accepted on other targets.			// Make sure this is silently accepted on other targets.
	// NAMD-NOT: "amdgpu-flat-work-group-size"			// NAMD-NOT: "amdgpu-flat-work-group-size"
	// NAMD-NOT: "amdgpu-waves-per-eu"			// NAMD-NOT: "amdgpu-waves-per-eu"
	// NAMD-NOT: "amdgpu-num-vgpr"			// NAMD-NOT: "amdgpu-num-vgpr"
	// NAMD-NOT: "amdgpu-num-sgpr"			// NAMD-NOT: "amdgpu-num-sgpr"

	// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="32,64"			// DEFAULT-DAG: attributes [[FLAT_WORK_GROUP_SIZE_DEFAULT]] = {{.*}}"amdgpu-flat-work-group-size"="1,256"
	// CHECK-DAG: attributes [[WAVES_PER_EU_2]] = { convergent noinline nounwind optnone "amdgpu-waves-per-eu"="2"			// MAX1024-DAG: attributes [[FLAT_WORK_GROUP_SIZE_DEFAULT]] = {{.*}}"amdgpu-flat-work-group-size"="1,1024"
	// CHECK-DAG: attributes [[NUM_SGPR_32]] = { convergent noinline nounwind optnone "amdgpu-num-sgpr"="32"			// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64]] = {{.*}}"amdgpu-flat-work-group-size"="32,64"
	// CHECK-DAG: attributes [[NUM_VGPR_64]] = { convergent noinline nounwind optnone "amdgpu-num-vgpr"="64"			// CHECK-DAG: attributes [[WAVES_PER_EU_2]] = {{.*}}"amdgpu-waves-per-eu"="2"
				// CHECK-DAG: attributes [[NUM_SGPR_32]] = {{.*}}"amdgpu-num-sgpr"="32"
				// CHECK-DAG: attributes [[NUM_VGPR_64]] = {{.*}}"amdgpu-num-vgpr"="64"

clang/test/Driver/hip-options.hip

This file was added.

				// REQUIRES: clang-driver
				// REQUIRES: x86-registered-target
				// REQUIRES: amdgpu-registered-target

				// RUN: %clang -### -x hip --gpu-max-threads-per-block=1024 %s 2>&1 \| FileCheck %s

				// Check that there are commands for both host- and device-side compilations.
				//
				// CHECK: clang{{.}}" "-cc1" {{.}} "-fcuda-is-device"
				// CHECK-SAME: "--gpu-max-threads-per-block=1024"

This is an archive of the discontinued LLVM Phabricator instance.

[HIP] Add option --gpu-max-threads-per-block=nClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 236597

clang/include/clang/Basic/LangOptions.def

clang/include/clang/Driver/Options.td

clang/lib/CodeGen/TargetInfo.cpp

clang/lib/Driver/ToolChains/HIP.cpp

clang/lib/Frontend/CompilerInvocation.cpp

clang/test/CodeGenCUDA/amdgpu-kernel-attrs.cu

clang/test/Driver/hip-options.hip

[HIP] Add option --gpu-max-threads-per-block=n
ClosedPublic