This is an archive of the discontinued LLVM Phabricator instance.

clang/AMDGPU: Apply workgroup related attributes to all functions
AbandonedPublic

Authored by arsenm on Oct 16 2020, 11:51 AM.

Download Raw Diff

Details

Reviewers

yaxunl
rampitec
b-sumner

Summary

When the default flat work group size is 256, it should also apply to
callable functions.

Diff Detail

Event Timeline

arsenm created this revision.Oct 16 2020, 11:51 AM

Herald added subscribers: kerbowa, t-tye, tpr and 4 others. · View Herald TranscriptOct 16 2020, 11:51 AM

arsenm requested review of this revision.Oct 16 2020, 11:51 AM

Herald added a subscriber: wdng. · View Herald TranscriptOct 16 2020, 11:51 AM

What if a device function is called by kernels with different work group sizes, will caller's work group size override callee's work group size?

In D89582#2335574, @yaxunl wrote:

What if a device function is called by kernels with different work group sizes, will caller's work group size override callee's work group size?

It's user error to call a function with a larger range than the caller

In D89582#2335619, @arsenm wrote:

In D89582#2335574, @yaxunl wrote:

What if a device function is called by kernels with different work group sizes, will caller's work group size override callee's work group size?

It's user error to call a function with a larger range than the caller

The problem is that user can override default on a kernel with the attribute, but cannot do so on function. So a module can be compiled with a default smaller than requested on one of the kernels.

Then if default is maximum 1024 and can only be overridden with the --gpu-max-threads-per-block option it would not be problem, if not the description of the option:

LANGOPT(GPUMaxThreadsPerBlock, 32, 256, "default max threads per block for kernel launch bounds for HIP")

I.e. it says about the "default", so it should be perfectly legal to set a higher limits on a specific kernel. Should the option say it restricts the maximum it would be legal to apply it to functions as well.

In D89582#2335671, @rampitec wrote:

In D89582#2335619, @arsenm wrote:

In D89582#2335574, @yaxunl wrote:

What if a device function is called by kernels with different work group sizes, will caller's work group size override callee's work group size?

It's user error to call a function with a larger range than the caller

The problem is that user can override default on a kernel with the attribute, but cannot do so on function. So a module can be compiled with a default smaller than requested on one of the kernels.

Then if default is maximum 1024 and can only be overridden with the --gpu-max-threads-per-block option it would not be problem, if not the description of the option:
LANGOPT(GPUMaxThreadsPerBlock, 32, 256, "default max threads per block for kernel launch bounds for HIP")
I.e. it says about the "default", so it should be perfectly legal to set a higher limits on a specific kernel. Should the option say it restricts the maximum it would be legal to apply it to functions as well.

The current backend default ends up greatly restricting the registers used in the functions, and increasing the spilling.

In D89582#2335704, @arsenm wrote:
In D89582#2335671, @rampitec wrote:

In D89582#2335619, @arsenm wrote:

In D89582#2335574, @yaxunl wrote:

What if a device function is called by kernels with different work group sizes, will caller's work group size override callee's work group size?

It's user error to call a function with a larger range than the caller

The problem is that user can override default on a kernel with the attribute, but cannot do so on function. So a module can be compiled with a default smaller than requested on one of the kernels.
Then if default is maximum 1024 and can only be overridden with the --gpu-max-threads-per-block option it would not be problem, if not the description of the option:
LANGOPT(GPUMaxThreadsPerBlock, 32, 256, "default max threads per block for kernel launch bounds for HIP")
I.e. it says about the "default", so it should be perfectly legal to set a higher limits on a specific kernel. Should the option say it restricts the maximum it would be legal to apply it to functions as well.
The current backend default ends up greatly restricting the registers used in the functions, and increasing the spilling.

I know the problem, but it should be better to use AMDGPUPropagateAttributes for this. It will clone functions if needed.

rampitec added a reviewer: b-sumner.Oct 19 2020, 8:18 AM

arsenm abandoned this revision.Jan 12 2021, 12:09 PM

Revision Contents

Path

Size

clang/

lib/

CodeGen/

TargetInfo.cpp

4 lines

test/

CodeGenCUDA/

amdgpu-kernel-attrs.cu

8 lines

CodeGenOpenCL/

amdgpu-attrs.cl

2 lines

Diff 298704

clang/lib/CodeGen/TargetInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 9,025 Lines • ▼ Show 20 Lines	void AMDGPUTargetCodeGenInfo::setTargetAttributes(
const bool IsOpenCLKernel = M.getLangOpts().OpenCL &&		const bool IsOpenCLKernel = M.getLangOpts().OpenCL &&
FD->hasAttr<OpenCLKernelAttr>();		FD->hasAttr<OpenCLKernelAttr>();
const bool IsHIPKernel = M.getLangOpts().HIP &&		const bool IsHIPKernel = M.getLangOpts().HIP &&
FD->hasAttr<CUDAGlobalAttr>();		FD->hasAttr<CUDAGlobalAttr>();
if ((IsOpenCLKernel \|\| IsHIPKernel) &&		if ((IsOpenCLKernel \|\| IsHIPKernel) &&
(M.getTriple().getOS() == llvm::Triple::AMDHSA))		(M.getTriple().getOS() == llvm::Triple::AMDHSA))
F->addFnAttr("amdgpu-implicitarg-num-bytes", "56");		F->addFnAttr("amdgpu-implicitarg-num-bytes", "56");

if (IsHIPKernel)		if (M.getLangOpts().HIP)
F->addFnAttr("uniform-work-group-size", "true");		F->addFnAttr("uniform-work-group-size", "true");


const auto *FlatWGS = FD->getAttr<AMDGPUFlatWorkGroupSizeAttr>();		const auto *FlatWGS = FD->getAttr<AMDGPUFlatWorkGroupSizeAttr>();
if (ReqdWGS \|\| FlatWGS) {		if (ReqdWGS \|\| FlatWGS) {
unsigned Min = 0;		unsigned Min = 0;
unsigned Max = 0;		unsigned Max = 0;
if (FlatWGS) {		if (FlatWGS) {
Show All 9 Lines	if (ReqdWGS \|\| FlatWGS) {

if (Min != 0) {		if (Min != 0) {
assert(Min <= Max && "Min must be less than or equal Max");		assert(Min <= Max && "Min must be less than or equal Max");

std::string AttrVal = llvm::utostr(Min) + "," + llvm::utostr(Max);		std::string AttrVal = llvm::utostr(Min) + "," + llvm::utostr(Max);
F->addFnAttr("amdgpu-flat-work-group-size", AttrVal);		F->addFnAttr("amdgpu-flat-work-group-size", AttrVal);
} else		} else
assert(Max == 0 && "Max must be zero");		assert(Max == 0 && "Max must be zero");
} else if (IsOpenCLKernel \|\| IsHIPKernel) {		} else {
// By default, restrict the maximum size to a value specified by		// By default, restrict the maximum size to a value specified by
// --gpu-max-threads-per-block=n or its default value.		// --gpu-max-threads-per-block=n or its default value.
std::string AttrVal =		std::string AttrVal =
std::string("1,") + llvm::utostr(M.getLangOpts().GPUMaxThreadsPerBlock);		std::string("1,") + llvm::utostr(M.getLangOpts().GPUMaxThreadsPerBlock);
F->addFnAttr("amdgpu-flat-work-group-size", AttrVal);		F->addFnAttr("amdgpu-flat-work-group-size", AttrVal);
}		}

if (const auto *Attr = FD->getAttr<AMDGPUWavesPerEUAttr>()) {		if (const auto *Attr = FD->getAttr<AMDGPUWavesPerEUAttr>()) {
▲ Show 20 Lines • Show All 2,111 Lines • Show Last 20 Lines

clang/test/CodeGenCUDA/amdgpu-kernel-attrs.cu

	Show All 10 Lines
	// RUN: -verify -o - -x hip %s \| FileCheck -check-prefix=NAMD %s			// RUN: -verify -o - -x hip %s \| FileCheck -check-prefix=NAMD %s

	#include "Inputs/cuda.h"			#include "Inputs/cuda.h"

	__global__ void flat_work_group_size_default() {			__global__ void flat_work_group_size_default() {
	// CHECK: define amdgpu_kernel void @_Z28flat_work_group_size_defaultv() [[FLAT_WORK_GROUP_SIZE_DEFAULT:#[0-9]+]]			// CHECK: define amdgpu_kernel void @_Z28flat_work_group_size_defaultv() [[FLAT_WORK_GROUP_SIZE_DEFAULT:#[0-9]+]]
	}			}

				__device__ void func_flat_work_group_size_default() {
				// CHECK: define void @_Z33func_flat_work_group_size_defaultv() [[FLAT_WORK_GROUP_SIZE_DEFAULT_FUNC:#[0-9]+]]
				}

	__attribute__((amdgpu_flat_work_group_size(32, 64))) // expected-no-diagnostics			__attribute__((amdgpu_flat_work_group_size(32, 64))) // expected-no-diagnostics
	__global__ void flat_work_group_size_32_64() {			__global__ void flat_work_group_size_32_64() {
	// CHECK: define amdgpu_kernel void @_Z26flat_work_group_size_32_64v() [[FLAT_WORK_GROUP_SIZE_32_64:#[0-9]+]]			// CHECK: define amdgpu_kernel void @_Z26flat_work_group_size_32_64v() [[FLAT_WORK_GROUP_SIZE_32_64:#[0-9]+]]
	}			}
	__attribute__((amdgpu_waves_per_eu(2))) // expected-no-diagnostics			__attribute__((amdgpu_waves_per_eu(2))) // expected-no-diagnostics
	__global__ void waves_per_eu_2() {			__global__ void waves_per_eu_2() {
	// CHECK: define amdgpu_kernel void @_Z14waves_per_eu_2v() [[WAVES_PER_EU_2:#[0-9]+]]			// CHECK: define amdgpu_kernel void @_Z14waves_per_eu_2v() [[WAVES_PER_EU_2:#[0-9]+]]
	}			}
	__attribute__((amdgpu_num_sgpr(32))) // expected-no-diagnostics			__attribute__((amdgpu_num_sgpr(32))) // expected-no-diagnostics
	__global__ void num_sgpr_32() {			__global__ void num_sgpr_32() {
	// CHECK: define amdgpu_kernel void @_Z11num_sgpr_32v() [[NUM_SGPR_32:#[0-9]+]]			// CHECK: define amdgpu_kernel void @_Z11num_sgpr_32v() [[NUM_SGPR_32:#[0-9]+]]
	}			}
	__attribute__((amdgpu_num_vgpr(64))) // expected-no-diagnostics			__attribute__((amdgpu_num_vgpr(64))) // expected-no-diagnostics
	__global__ void num_vgpr_64() {			__global__ void num_vgpr_64() {
	// CHECK: define amdgpu_kernel void @_Z11num_vgpr_64v() [[NUM_VGPR_64:#[0-9]+]]			// CHECK: define amdgpu_kernel void @_Z11num_vgpr_64v() [[NUM_VGPR_64:#[0-9]+]]
	}			}

	// Make sure this is silently accepted on other targets.			// Make sure this is silently accepted on other targets.
	// NAMD-NOT: "amdgpu-flat-work-group-size"			// NAMD-NOT: "amdgpu-flat-work-group-size"
	// NAMD-NOT: "amdgpu-waves-per-eu"			// NAMD-NOT: "amdgpu-waves-per-eu"
	// NAMD-NOT: "amdgpu-num-vgpr"			// NAMD-NOT: "amdgpu-num-vgpr"
	// NAMD-NOT: "amdgpu-num-sgpr"			// NAMD-NOT: "amdgpu-num-sgpr"

	// DEFAULT-DAG: attributes [[FLAT_WORK_GROUP_SIZE_DEFAULT]] = {{.}}"amdgpu-flat-work-group-size"="1,256"{{.}}"uniform-work-group-size"="true"			// DEFAULT-DAG: attributes [[FLAT_WORK_GROUP_SIZE_DEFAULT]] = {{.}}"amdgpu-flat-work-group-size"="1,256"{{.}}"uniform-work-group-size"="true"
				// DEFAULT-DAG: attributes [[FLAT_WORK_GROUP_SIZE_DEFAULT_FUNC]] = {{.}}"amdgpu-flat-work-group-size"="1,256"{{.}}"uniform-work-group-size"="true"

	// MAX1024-DAG: attributes [[FLAT_WORK_GROUP_SIZE_DEFAULT]] = {{.*}}"amdgpu-flat-work-group-size"="1,1024"			// MAX1024-DAG: attributes [[FLAT_WORK_GROUP_SIZE_DEFAULT]] = {{.*}}"amdgpu-flat-work-group-size"="1,1024"
				// MAX1024-DAG: attributes [[FLAT_WORK_GROUP_SIZE_DEFAULT_FUNC]] = {{.*}}"amdgpu-flat-work-group-size"="1,1024"

	// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64]] = {{.*}}"amdgpu-flat-work-group-size"="32,64"			// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64]] = {{.*}}"amdgpu-flat-work-group-size"="32,64"
	// CHECK-DAG: attributes [[WAVES_PER_EU_2]] = {{.*}}"amdgpu-waves-per-eu"="2"			// CHECK-DAG: attributes [[WAVES_PER_EU_2]] = {{.*}}"amdgpu-waves-per-eu"="2"
	// CHECK-DAG: attributes [[NUM_SGPR_32]] = {{.*}}"amdgpu-num-sgpr"="32"			// CHECK-DAG: attributes [[NUM_SGPR_32]] = {{.*}}"amdgpu-num-sgpr"="32"
	// CHECK-DAG: attributes [[NUM_VGPR_64]] = {{.*}}"amdgpu-num-vgpr"="64"			// CHECK-DAG: attributes [[NUM_VGPR_64]] = {{.*}}"amdgpu-num-vgpr"="64"

clang/test/CodeGenOpenCL/amdgpu-attrs.cl

	Show First 20 Lines • Show All 184 Lines • ▼ Show 20 Lines
	// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2_NUM_SGPR_32]] = {{.*}} "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-sgpr"="32" "amdgpu-waves-per-eu"="2"			// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2_NUM_SGPR_32]] = {{.*}} "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-sgpr"="32" "amdgpu-waves-per-eu"="2"
	// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2_NUM_VGPR_64]] = {{.*}} "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-vgpr"="64" "amdgpu-waves-per-eu"="2"			// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2_NUM_VGPR_64]] = {{.*}} "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-vgpr"="64" "amdgpu-waves-per-eu"="2"
	// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2_4_NUM_SGPR_32]] = {{.*}} "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-sgpr"="32" "amdgpu-waves-per-eu"="2,4"			// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2_4_NUM_SGPR_32]] = {{.*}} "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-sgpr"="32" "amdgpu-waves-per-eu"="2,4"
	// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2_4_NUM_VGPR_64]] = {{.*}} "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-vgpr"="64" "amdgpu-waves-per-eu"="2,4"			// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2_4_NUM_VGPR_64]] = {{.*}} "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-vgpr"="64" "amdgpu-waves-per-eu"="2,4"

	// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2_NUM_SGPR_32_NUM_VGPR_64]] = {{.*}} "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-sgpr"="32" "amdgpu-num-vgpr"="64" "amdgpu-waves-per-eu"="2"			// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2_NUM_SGPR_32_NUM_VGPR_64]] = {{.*}} "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-sgpr"="32" "amdgpu-num-vgpr"="64" "amdgpu-waves-per-eu"="2"
	// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2_4_NUM_SGPR_32_NUM_VGPR_64]] = {{.*}} "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-sgpr"="32" "amdgpu-num-vgpr"="64" "amdgpu-waves-per-eu"="2,4"			// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2_4_NUM_SGPR_32_NUM_VGPR_64]] = {{.*}} "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-sgpr"="32" "amdgpu-num-vgpr"="64" "amdgpu-waves-per-eu"="2,4"

	// CHECK-DAG: attributes [[A_FUNCTION]] = {{.*}}			// CHECK-DAG: attributes [[A_FUNCTION]] = {{.*}} "amdgpu-flat-work-group-size"="1,256"
	// CHECK-DAG: attributes [[DEFAULT_KERNEL_ATTRS]] = {{.*}} "amdgpu-flat-work-group-size"="1,256" "amdgpu-implicitarg-num-bytes"="56"			// CHECK-DAG: attributes [[DEFAULT_KERNEL_ATTRS]] = {{.*}} "amdgpu-flat-work-group-size"="1,256" "amdgpu-implicitarg-num-bytes"="56"