This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Always emit amdgpu-flat-work-group-size
ClosedPublic

Authored by arsenm on May 31 2019, 9:05 AM.

Download Raw Diff

Details

Reviewers

t-tye
b-sumner
rampitec
kzhuravl

Summary

The backend default maximum should be the hardware maximum, so the
frontend should set the implementation defined default maximum.

Diff Detail

Event Timeline

arsenm created this revision.May 31 2019, 9:05 AM

Herald added subscribers: tpr, dstuttard, yaxunl and 4 others. · View Herald TranscriptMay 31 2019, 9:05 AM

b-sumner added inline comments.May 31 2019, 9:56 AM

lib/CodeGen/TargetInfo.cpp
7947	Theoretically, shouldn't the minimum be 1?

arsenm marked an inline comment as done.May 31 2019, 10:32 AM

arsenm added inline comments.

lib/CodeGen/TargetInfo.cpp
7947	That's what I thought, but the backend is defaulting to 2 * wave size now

yaxunl added inline comments.May 31 2019, 1:59 PM

lib/CodeGen/TargetInfo.cpp
7947	I don't get it. This attribute indicates the possible workgroup size range this kernel may be run with, right? It only depends on how user execute the kernel. How is it related to backend defaults?

arsenm marked an inline comment as done.Jun 3 2019, 6:54 AM

arsenm added inline comments.

lib/CodeGen/TargetInfo.cpp
7947	The backend currently assumes 128,256 by default as the bounds. I want to make this a frontend decision, and make the backend assumption the most conservative default

ping

My concern is that this essentially forcing user to add amdgpu_flat_work_group_size attribute to all kernels that are executed outside of (128,256). Potentially this can cause lots of regressions for existing OpenCL apps. I am not sure if it is feasible to force all OpenCL apps to make this change. Should we do some tests before making this change?

We need to communicate with anyone generating IR to ensure this is being generated before we change the default. clang is only one of those generators. This change will also need to be documented in the usage document.

In D62739#1536390, @yaxunl wrote:

My concern is that this essentially forcing user to add amdgpu_flat_work_group_size attribute to all kernels that are executed outside of (128,256). Potentially this can cause lots of regressions for existing OpenCL apps. I am not sure if it is feasible to force all OpenCL apps to make this change. Should we do some tests before making this change?

This is already the case. This is just moving it to the frontend. There's no user observable change from this patch

In D62739#1536428, @b-sumner wrote:

We need to communicate with anyone generating IR to ensure this is being generated before we change the default. clang is only one of those generators. This change will also need to be documented in the usage document.

The planned change is to make the backend more conservative, so it shouldn't break other frontends

In D62739#1543437, @arsenm wrote:

In D62739#1536390, @yaxunl wrote:

My concern is that this essentially forcing user to add amdgpu_flat_work_group_size attribute to all kernels that are executed outside of (128,256). Potentially this can cause lots of regressions for existing OpenCL apps. I am not sure if it is feasible to force all OpenCL apps to make this change. Should we do some tests before making this change?

This is already the case. This is just moving it to the frontend. There's no user observable change from this patch

I do want to reduce the minimum bound to 1 at least (which is the default for graphics shaders), but that's a separate change

In D62739#1543438, @arsenm wrote:

In D62739#1536428, @b-sumner wrote:

We need to communicate with anyone generating IR to ensure this is being generated before we change the default. clang is only one of those generators. This change will also need to be documented in the usage document.

The planned change is to make the backend more conservative, so it shouldn't break other frontends

It may not break other frontends, but could cause substantial performance regressions. At a minimum the summary should clearly mention this possibility.

In D62739#1543578, @b-sumner wrote:

In D62739#1543438, @arsenm wrote:

In D62739#1536428, @b-sumner wrote:

We need to communicate with anyone generating IR to ensure this is being generated before we change the default. clang is only one of those generators. This change will also need to be documented in the usage document.

The planned change is to make the backend more conservative, so it shouldn't break other frontends

It may not break other frontends, but could cause substantial performance regressions. At a minimum the summary should clearly mention this possibility.

The summary of the future backend patch will

Rebase

ping

rampitec added inline comments.Aug 22 2019, 12:02 PM

lib/CodeGen/TargetInfo.cpp
7947	I would agree that minimum should be 1. Backend's choice is also unclear, but emitting a false attribute is a separate issue.

Make lower bound 1, although this is a behavior change

LGTM

This revision is now accepted and ready to land.Aug 27 2019, 8:38 AM

r370101

Revision Contents

Path

Size

lib/

CodeGen/

TargetInfo.cpp

10 lines

test/

CodeGenOpenCL/

amdgpu-attrs.cl

35 lines

Diff 217301

lib/CodeGen/TargetInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,906 Lines • ▼ Show 20 Lines	void AMDGPUTargetCodeGenInfo::setTargetAttributes(
if (!FD)		if (!FD)
return;		return;

llvm::Function *F = cast<llvm::Function>(GV);		llvm::Function *F = cast<llvm::Function>(GV);

const auto *ReqdWGS = M.getLangOpts().OpenCL ?		const auto *ReqdWGS = M.getLangOpts().OpenCL ?
FD->getAttr<ReqdWorkGroupSizeAttr>() : nullptr;		FD->getAttr<ReqdWorkGroupSizeAttr>() : nullptr;

if (((M.getLangOpts().OpenCL && FD->hasAttr<OpenCLKernelAttr>()) \|\|
		const bool IsOpenCLKernel = M.getLangOpts().OpenCL &&
		FD->hasAttr<OpenCLKernelAttr>();
		if ((IsOpenCLKernel \|\|
(M.getLangOpts().HIP && FD->hasAttr<CUDAGlobalAttr>())) &&		(M.getLangOpts().HIP && FD->hasAttr<CUDAGlobalAttr>())) &&
(M.getTriple().getOS() == llvm::Triple::AMDHSA))		(M.getTriple().getOS() == llvm::Triple::AMDHSA))
F->addFnAttr("amdgpu-implicitarg-num-bytes", "56");		F->addFnAttr("amdgpu-implicitarg-num-bytes", "56");

const auto *FlatWGS = FD->getAttr<AMDGPUFlatWorkGroupSizeAttr>();		const auto *FlatWGS = FD->getAttr<AMDGPUFlatWorkGroupSizeAttr>();
if (ReqdWGS \|\| FlatWGS) {		if (ReqdWGS \|\| FlatWGS) {
unsigned Min = 0;		unsigned Min = 0;
unsigned Max = 0;		unsigned Max = 0;
if (FlatWGS) {		if (FlatWGS) {
Show All 9 Lines	if (ReqdWGS \|\| FlatWGS) {

if (Min != 0) {		if (Min != 0) {
assert(Min <= Max && "Min must be less than or equal Max");		assert(Min <= Max && "Min must be less than or equal Max");

std::string AttrVal = llvm::utostr(Min) + "," + llvm::utostr(Max);		std::string AttrVal = llvm::utostr(Min) + "," + llvm::utostr(Max);
F->addFnAttr("amdgpu-flat-work-group-size", AttrVal);		F->addFnAttr("amdgpu-flat-work-group-size", AttrVal);
} else		} else
assert(Max == 0 && "Max must be zero");		assert(Max == 0 && "Max must be zero");
		} else if (IsOpenCLKernel) {
		// By default, restrict the maximum size to 256.
		F->addFnAttr("amdgpu-flat-work-group-size", "1,256");
		b-sumnerUnsubmitted Not Done Reply Inline Actions Theoretically, shouldn't the minimum be 1? b-sumner: Theoretically, shouldn't the minimum be 1?
		arsenmAuthorUnsubmitted Done Reply Inline Actions That's what I thought, but the backend is defaulting to 2 * wave size now arsenm: That's what I thought, but the backend is defaulting to 2 * wave size now
		yaxunlUnsubmitted Not Done Reply Inline Actions I don't get it. This attribute indicates the possible workgroup size range this kernel may be run with, right? It only depends on how user execute the kernel. How is it related to backend defaults? yaxunl: I don't get it. This attribute indicates the possible workgroup size range this kernel may be…
		arsenmAuthorUnsubmitted Done Reply Inline Actions The backend currently assumes 128,256 by default as the bounds. I want to make this a frontend decision, and make the backend assumption the most conservative default arsenm: The backend currently assumes 128,256 by default as the bounds. I want to make this a frontend…
		rampitecUnsubmitted Not Done Reply Inline Actions I would agree that minimum should be 1. Backend's choice is also unclear, but emitting a false attribute is a separate issue. rampitec: I would agree that minimum should be 1. Backend's choice is also unclear, but emitting a false…
}		}

if (const auto *Attr = FD->getAttr<AMDGPUWavesPerEUAttr>()) {		if (const auto *Attr = FD->getAttr<AMDGPUWavesPerEUAttr>()) {
unsigned Min =		unsigned Min =
Attr->getMin()->EvaluateKnownConstInt(M.getContext()).getExtValue();		Attr->getMin()->EvaluateKnownConstInt(M.getContext()).getExtValue();
unsigned Max = Attr->getMax() ? Attr->getMax()		unsigned Max = Attr->getMax() ? Attr->getMax()
->EvaluateKnownConstInt(M.getContext())		->EvaluateKnownConstInt(M.getContext())
.getExtValue()		.getExtValue()
▲ Show 20 Lines • Show All 1,983 Lines • Show Last 20 Lines

test/CodeGenOpenCL/amdgpu-attrs.cl

	Show First 20 Lines • Show All 137 Lines • ▼ Show 20 Lines
	kernel void reqd_work_group_size_32_2_1_flat_work_group_size_16_128() {			kernel void reqd_work_group_size_32_2_1_flat_work_group_size_16_128() {
	// CHECK: define amdgpu_kernel void @reqd_work_group_size_32_2_1_flat_work_group_size_16_128() [[FLAT_WORK_GROUP_SIZE_16_128:#[0-9]+]]			// CHECK: define amdgpu_kernel void @reqd_work_group_size_32_2_1_flat_work_group_size_16_128() [[FLAT_WORK_GROUP_SIZE_16_128:#[0-9]+]]
	}			}

	void a_function() {			void a_function() {
	// CHECK: define void @a_function() [[A_FUNCTION:#[0-9]+]]			// CHECK: define void @a_function() [[A_FUNCTION:#[0-9]+]]
	}			}

				kernel void default_kernel() {
				// CHECK: define amdgpu_kernel void @default_kernel() [[DEFAULT_KERNEL_ATTRS:#[0-9]+]]
				}


	// Make sure this is silently accepted on other targets.			// Make sure this is silently accepted on other targets.
	// X86-NOT: "amdgpu-flat-work-group-size"			// X86-NOT: "amdgpu-flat-work-group-size"
	// X86-NOT: "amdgpu-waves-per-eu"			// X86-NOT: "amdgpu-waves-per-eu"
	// X86-NOT: "amdgpu-num-vgpr"			// X86-NOT: "amdgpu-num-vgpr"
	// X86-NOT: "amdgpu-num-sgpr"			// X86-NOT: "amdgpu-num-sgpr"
	// X86-NOT: "amdgpu-implicitarg-num-bytes"			// X86-NOT: "amdgpu-implicitarg-num-bytes"
	// NONAMDHSA-NOT: "amdgpu-implicitarg-num-bytes"			// NONAMDHSA-NOT: "amdgpu-implicitarg-num-bytes"

	// CHECK-NOT: "amdgpu-flat-work-group-size"="0,0"			// CHECK-NOT: "amdgpu-flat-work-group-size"="0,0"
	// CHECK-NOT: "amdgpu-waves-per-eu"="0"			// CHECK-NOT: "amdgpu-waves-per-eu"="0"
	// CHECK-NOT: "amdgpu-waves-per-eu"="0,0"			// CHECK-NOT: "amdgpu-waves-per-eu"="0,0"
	// CHECK-NOT: "amdgpu-num-sgpr"="0"			// CHECK-NOT: "amdgpu-num-sgpr"="0"
	// CHECK-NOT: "amdgpu-num-vgpr"="0"			// CHECK-NOT: "amdgpu-num-vgpr"="0"

	// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56"			// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56"
	// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_64_64]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="64,64" "amdgpu-implicitarg-num-bytes"="56"			// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_64_64]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="64,64" "amdgpu-implicitarg-num-bytes"="56"
	// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_16_128]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="16,128" "amdgpu-implicitarg-num-bytes"="56"			// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_16_128]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="16,128" "amdgpu-implicitarg-num-bytes"="56"
	// CHECK-DAG: attributes [[WAVES_PER_EU_2]] = { convergent noinline nounwind optnone "amdgpu-implicitarg-num-bytes"="56" "amdgpu-waves-per-eu"="2"
	// CHECK-DAG: attributes [[WAVES_PER_EU_2_4]] = { convergent noinline nounwind optnone "amdgpu-implicitarg-num-bytes"="56" "amdgpu-waves-per-eu"="2,4"			// CHECK-DAG: attributes [[WAVES_PER_EU_2]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="1,256" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-waves-per-eu"="2"
	// CHECK-DAG: attributes [[NUM_SGPR_32]] = { convergent noinline nounwind optnone "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-sgpr"="32"
	// CHECK-DAG: attributes [[NUM_VGPR_64]] = { convergent noinline nounwind optnone "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-vgpr"="64"			// CHECK-DAG: attributes [[WAVES_PER_EU_2_4]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="1,256" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-waves-per-eu"="2,4"
				// CHECK-DAG: attributes [[NUM_SGPR_32]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="1,256" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-sgpr"="32"
				// CHECK-DAG: attributes [[NUM_VGPR_64]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="1,256" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-vgpr"="64"

	// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-waves-per-eu"="2"			// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-waves-per-eu"="2"
	// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2_4]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-waves-per-eu"="2,4"			// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2_4]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-waves-per-eu"="2,4"
	// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_NUM_SGPR_32]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-sgpr"="32"			// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_NUM_SGPR_32]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-sgpr"="32"
	// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_NUM_VGPR_64]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-vgpr"="64"			// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_NUM_VGPR_64]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-vgpr"="64"
	// CHECK-DAG: attributes [[WAVES_PER_EU_2_NUM_SGPR_32]] = { convergent noinline nounwind optnone "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-sgpr"="32" "amdgpu-waves-per-eu"="2"			// CHECK-DAG: attributes [[WAVES_PER_EU_2_NUM_SGPR_32]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="1,256" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-sgpr"="32" "amdgpu-waves-per-eu"="2"
	// CHECK-DAG: attributes [[WAVES_PER_EU_2_NUM_VGPR_64]] = { convergent noinline nounwind optnone "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-vgpr"="64" "amdgpu-waves-per-eu"="2"			// CHECK-DAG: attributes [[WAVES_PER_EU_2_NUM_VGPR_64]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="1,256" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-vgpr"="64" "amdgpu-waves-per-eu"="2"
	// CHECK-DAG: attributes [[WAVES_PER_EU_2_4_NUM_SGPR_32]] = { convergent noinline nounwind optnone "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-sgpr"="32" "amdgpu-waves-per-eu"="2,4"			// CHECK-DAG: attributes [[WAVES_PER_EU_2_4_NUM_SGPR_32]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="1,256" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-sgpr"="32" "amdgpu-waves-per-eu"="2,4"
	// CHECK-DAG: attributes [[WAVES_PER_EU_2_4_NUM_VGPR_64]] = { convergent noinline nounwind optnone "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-vgpr"="64" "amdgpu-waves-per-eu"="2,4"			// CHECK-DAG: attributes [[WAVES_PER_EU_2_4_NUM_VGPR_64]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="1,256" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-vgpr"="64" "amdgpu-waves-per-eu"="2,4"
	// CHECK-DAG: attributes [[NUM_SGPR_32_NUM_VGPR_64]] = { convergent noinline nounwind optnone "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-sgpr"="32" "amdgpu-num-vgpr"="64"			// CHECK-DAG: attributes [[NUM_SGPR_32_NUM_VGPR_64]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="1,256" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-sgpr"="32" "amdgpu-num-vgpr"="64"

	// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2_NUM_SGPR_32]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-sgpr"="32" "amdgpu-waves-per-eu"="2"			// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2_NUM_SGPR_32]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-sgpr"="32" "amdgpu-waves-per-eu"="2"
	// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2_NUM_VGPR_64]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-vgpr"="64" "amdgpu-waves-per-eu"="2"			// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2_NUM_VGPR_64]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-vgpr"="64" "amdgpu-waves-per-eu"="2"
	// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2_4_NUM_SGPR_32]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-sgpr"="32" "amdgpu-waves-per-eu"="2,4"			// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2_4_NUM_SGPR_32]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-sgpr"="32" "amdgpu-waves-per-eu"="2,4"
	// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2_4_NUM_VGPR_64]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-vgpr"="64" "amdgpu-waves-per-eu"="2,4"			// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2_4_NUM_VGPR_64]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-vgpr"="64" "amdgpu-waves-per-eu"="2,4"

	// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2_NUM_SGPR_32_NUM_VGPR_64]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-sgpr"="32" "amdgpu-num-vgpr"="64" "amdgpu-waves-per-eu"="2"			// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2_NUM_SGPR_32_NUM_VGPR_64]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-sgpr"="32" "amdgpu-num-vgpr"="64" "amdgpu-waves-per-eu"="2"
	// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2_4_NUM_SGPR_32_NUM_VGPR_64]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-sgpr"="32" "amdgpu-num-vgpr"="64" "amdgpu-waves-per-eu"="2,4"			// CHECK-DAG: attributes [[FLAT_WORK_GROUP_SIZE_32_64_WAVES_PER_EU_2_4_NUM_SGPR_32_NUM_VGPR_64]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="32,64" "amdgpu-implicitarg-num-bytes"="56" "amdgpu-num-sgpr"="32" "amdgpu-num-vgpr"="64" "amdgpu-waves-per-eu"="2,4"

	// CHECK-DAG: attributes [[A_FUNCTION]] = { convergent noinline nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false"			// CHECK-DAG: attributes [[A_FUNCTION]] = { convergent noinline nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false"
				// CHECK-DAG: attributes [[DEFAULT_KERNEL_ATTRS]] = { convergent noinline nounwind optnone "amdgpu-flat-work-group-size"="1,256" "amdgpu-implicitarg-num-bytes"="56"