This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Switch backend default max workgroup size to 1024
ClosedPublic

Authored by arsenm on Oct 30 2019, 10:50 PM.

Download Raw Diff

Details

Reviewers

kzhuravl
rampitec
yaxunl
b-sumner

Summary

Previously this would default to 256, not the maximum supported size
of 1024. Using a maximum lower than the hardware maximum requires
language runtimes to enforce this limit for correctness, which no
language has correctly done. Switch the default to the conservatively
correct maximum, and force frontends to opt-in to the more optimal 256
default maximum.

I don't really understand why the changes in occupancy-levels.ll
increased the computed occupancy, which I expected to decrease. I'm
not sure if these tests should be forcing the old maximum.

Diff Detail

Event Timeline

arsenm created this revision.Oct 30 2019, 10:50 PM

Herald added a project: Restricted Project. · View Herald TranscriptOct 30 2019, 10:50 PM

Herald added subscribers: hiraditya, t-tye, tpr and 5 others. · View Herald Transcript

I do not think that deliberately introducing performance regression is a good way to force FE to do anything.

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
354	It currently returns 2048, not 1024 as far as I can see.
llvm/test/CodeGen/AMDGPU/hsa-metadata-kernel-code-props-v3.ll
15	And given that getMaxFlatWorkGroupSize() returns 2048 I do not understand how does it work.
llvm/test/CodeGen/AMDGPU/occupancy-levels.ll
265	This needs to be investigated first I believe. There must be some wrong logic somewhere with LDS accounting for occupancy.

In D69654#1729022, @rampitec wrote:

I do not think that deliberately introducing performance regression is a good way to force FE to do anything.

clang already emits the clamp to 256 if unspecified. The bugs from not being correct by default have come up many times

arsenm marked an inline comment as done.Oct 31 2019, 11:04 AM

arsenm added inline comments.

llvm/test/CodeGen/AMDGPU/hsa-metadata-kernel-code-props-v3.ll
15	D66812 changes this to 1024

In D69654#1729035, @arsenm wrote:

In D69654#1729022, @rampitec wrote:

I do not think that deliberately introducing performance regression is a good way to force FE to do anything.

clang already emits the clamp to 256 if unspecified. The bugs from not being correct by default have come up many times

OK, thanks.

llvm/test/CodeGen/AMDGPU/hsa-metadata-kernel-code-props-v3.ll
15	Ok

LGTM

llvm/test/CodeGen/AMDGPU/occupancy-levels.ll
265	I have recollected the logic behind LDS sharing and occupancy calculations. The increase seems to be right because LDS allocation is per WG and with a bigger WG you have less of them.

This revision is now accepted and ready to land.Nov 12 2019, 12:13 PM

4b472139513ba460595804f8113497844b41fbcc

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUSubtarget.cpp

8 lines

test/

CodeGen/

AMDGPU/

amdgpu.private-memory.ll

5 lines

array-ptr-calc-i32.ll

2 lines

hsa-metadata-kernel-code-props-v3.ll

19 lines

hsa-metadata-kernel-code-props.ll

26 lines

lower-range-metadata-intrinsic-call.ll

2 lines

occupancy-levels.ll

10 lines

private-memory-r600.ll

2 lines

promote-alloca-addrspacecast.ll

2 lines

promote-alloca-to-lds-icmp.ll

2 lines

promote-alloca-to-lds-phi.ll

2 lines

promote-alloca-to-lds-select.ll

2 lines

Diff 227222

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp

	Show First 20 Lines • Show All 337 Lines • ▼ Show 20 Lines
	AMDGPUSubtarget::getOccupancyWithLocalMemSize(const MachineFunction &MF) const {			AMDGPUSubtarget::getOccupancyWithLocalMemSize(const MachineFunction &MF) const {
	const auto *MFI = MF.getInfo<SIMachineFunctionInfo>();			const auto *MFI = MF.getInfo<SIMachineFunctionInfo>();
	return getOccupancyWithLocalMemSize(MFI->getLDSSize(), MF.getFunction());			return getOccupancyWithLocalMemSize(MFI->getLDSSize(), MF.getFunction());
	}			}

	std::pair<unsigned, unsigned>			std::pair<unsigned, unsigned>
	AMDGPUSubtarget::getDefaultFlatWorkGroupSize(CallingConv::ID CC) const {			AMDGPUSubtarget::getDefaultFlatWorkGroupSize(CallingConv::ID CC) const {
	switch (CC) {			switch (CC) {
	case CallingConv::AMDGPU_CS:
	case CallingConv::AMDGPU_KERNEL:
	case CallingConv::SPIR_KERNEL:
	return std::make_pair(getWavefrontSize() * 2,
	std::max(getWavefrontSize() * 4, 256u));
	case CallingConv::AMDGPU_VS:			case CallingConv::AMDGPU_VS:
	case CallingConv::AMDGPU_LS:			case CallingConv::AMDGPU_LS:
	case CallingConv::AMDGPU_HS:			case CallingConv::AMDGPU_HS:
	case CallingConv::AMDGPU_ES:			case CallingConv::AMDGPU_ES:
	case CallingConv::AMDGPU_GS:			case CallingConv::AMDGPU_GS:
	case CallingConv::AMDGPU_PS:			case CallingConv::AMDGPU_PS:
	return std::make_pair(1, getWavefrontSize());			return std::make_pair(1, getWavefrontSize());
	default:			default:
	return std::make_pair(1, 16 * getWavefrontSize());			return std::make_pair(1u, getMaxFlatWorkGroupSize());
				rampitecUnsubmitted Not Done Reply Inline Actions It currently returns 2048, not 1024 as far as I can see. rampitec: It currently returns 2048, not 1024 as far as I can see.
	}			}
	}			}

	std::pair<unsigned, unsigned> AMDGPUSubtarget::getFlatWorkGroupSizes(			std::pair<unsigned, unsigned> AMDGPUSubtarget::getFlatWorkGroupSizes(
	const Function &F) const {			const Function &F) const {
	// FIXME: 1024 if function.
	// Default minimum/maximum flat work group sizes.			// Default minimum/maximum flat work group sizes.
	std::pair<unsigned, unsigned> Default =			std::pair<unsigned, unsigned> Default =
	getDefaultFlatWorkGroupSize(F.getCallingConv());			getDefaultFlatWorkGroupSize(F.getCallingConv());

	// Requested minimum/maximum flat work group sizes.			// Requested minimum/maximum flat work group sizes.
	std::pair<unsigned, unsigned> Requested = AMDGPU::getIntegerPairAttribute(			std::pair<unsigned, unsigned> Requested = AMDGPU::getIntegerPairAttribute(
	F, "amdgpu-flat-work-group-size", Default);			F, "amdgpu-flat-work-group-size", Default);

	▲ Show 20 Lines • Show All 533 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/amdgpu.private-memory.ll

Show First 20 Lines • Show All 406 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @ptrtoint(i32 addrspace(1)* %out, i32 %a, i32 %b) #0 {
%tmp5 = load i32, i32 addrspace(5)* %tmp4		%tmp5 = load i32, i32 addrspace(5)* %tmp4
store i32 %tmp5, i32 addrspace(1)* %out		store i32 %tmp5, i32 addrspace(1)* %out
ret void		ret void
}		}

; OPT-LABEL: @pointer_typed_alloca(		; OPT-LABEL: @pointer_typed_alloca(
; OPT: getelementptr inbounds [256 x i32 addrspace(1)], [256 x i32 addrspace(1)] addrspace(3)* @pointer_typed_alloca.A.addr, i32 0, i32 %{{[0-9]+}}		; OPT: getelementptr inbounds [256 x i32 addrspace(1)], [256 x i32 addrspace(1)] addrspace(3)* @pointer_typed_alloca.A.addr, i32 0, i32 %{{[0-9]+}}
; OPT: load i32 addrspace(1), i32 addrspace(1) addrspace(3)* %{{[0-9]+}}, align 4		; OPT: load i32 addrspace(1), i32 addrspace(1) addrspace(3)* %{{[0-9]+}}, align 4
define amdgpu_kernel void @pointer_typed_alloca(i32 addrspace(1)* %A) {		define amdgpu_kernel void @pointer_typed_alloca(i32 addrspace(1)* %A) #1 {
entry:		entry:
%A.addr = alloca i32 addrspace(1)*, align 4, addrspace(5)		%A.addr = alloca i32 addrspace(1)*, align 4, addrspace(5)
store i32 addrspace(1)* %A, i32 addrspace(1)* addrspace(5)* %A.addr, align 4		store i32 addrspace(1)* %A, i32 addrspace(1)* addrspace(5)* %A.addr, align 4
%ld0 = load i32 addrspace(1), i32 addrspace(1) addrspace(5)* %A.addr, align 4		%ld0 = load i32 addrspace(1), i32 addrspace(1) addrspace(5)* %A.addr, align 4
%arrayidx = getelementptr inbounds i32, i32 addrspace(1)* %ld0, i32 0		%arrayidx = getelementptr inbounds i32, i32 addrspace(1)* %ld0, i32 0
store i32 1, i32 addrspace(1)* %arrayidx, align 4		store i32 1, i32 addrspace(1)* %arrayidx, align 4
%ld1 = load i32 addrspace(1), i32 addrspace(1) addrspace(5)* %A.addr, align 4		%ld1 = load i32 addrspace(1), i32 addrspace(1) addrspace(5)* %A.addr, align 4
%arrayidx1 = getelementptr inbounds i32, i32 addrspace(1)* %ld1, i32 1		%arrayidx1 = getelementptr inbounds i32, i32 addrspace(1)* %ld1, i32 1
▲ Show 20 Lines • Show All 127 Lines • ▼ Show 20 Lines
entry:		entry:
%tmp = alloca [1 x i32], addrspace(5)		%tmp = alloca [1 x i32], addrspace(5)
store [1 x i32] [i32 0], [1 x i32] addrspace(5)* %tmp		store [1 x i32] [i32 0], [1 x i32] addrspace(5)* %tmp
%load = load [1 x i32], [1 x i32] addrspace(5)* %tmp		%load = load [1 x i32], [1 x i32] addrspace(5)* %tmp
store [1 x i32] %load, [1 x i32] addrspace(1)* %out		store [1 x i32] %load, [1 x i32] addrspace(1)* %out
ret void		ret void
}		}

attributes #0 = { nounwind "amdgpu-waves-per-eu"="1,2" }		attributes #0 = { nounwind "amdgpu-waves-per-eu"="1,2" "amdgpu-flat-work-group-size"="1,256" }
		attributes #1 = { nounwind "amdgpu-flat-work-group-size"="1,256" }

; HSAOPT: !0 = !{}		; HSAOPT: !0 = !{}
; HSAOPT: !1 = !{i32 0, i32 257}		; HSAOPT: !1 = !{i32 0, i32 257}
; HSAOPT: !2 = !{i32 0, i32 256}		; HSAOPT: !2 = !{i32 0, i32 256}

; NOHSAOPT: !0 = !{i32 0, i32 257}		; NOHSAOPT: !0 = !{i32 0, i32 257}
; NOHSAOPT: !1 = !{i32 0, i32 256}		; NOHSAOPT: !1 = !{i32 0, i32 256}

llvm/test/CodeGen/AMDGPU/array-ptr-calc-i32.ll

Show All 37 Lines	define amdgpu_kernel void @test_private_array_ptr_calc(i32 addrspace(1)* noalias %out, i32 addrspace(1)* noalias %inA, i32 addrspace(1)* noalias %inB) #0 {
; Dummy call		; Dummy call
call void @llvm.amdgcn.s.barrier()		call void @llvm.amdgcn.s.barrier()
%reload = load i32, i32 addrspace(5)* %alloca_ptr, align 4, !range !0		%reload = load i32, i32 addrspace(5)* %alloca_ptr, align 4, !range !0
%out_ptr = getelementptr inbounds i32, i32 addrspace(1)* %out, i32 %tid		%out_ptr = getelementptr inbounds i32, i32 addrspace(1)* %out, i32 %tid
store i32 %reload, i32 addrspace(1)* %out_ptr, align 4		store i32 %reload, i32 addrspace(1)* %out_ptr, align 4
ret void		ret void
}		}

attributes #0 = { nounwind "amdgpu-waves-per-eu"="1,1" }		attributes #0 = { nounwind "amdgpu-waves-per-eu"="1,1" "amdgpu-flat-work-group-size"="1,256" }
attributes #1 = { nounwind readnone }		attributes #1 = { nounwind readnone }
attributes #2 = { nounwind convergent }		attributes #2 = { nounwind convergent }

!0 = !{i32 0, i32 65536 }		!0 = !{i32 0, i32 65536 }

llvm/test/CodeGen/AMDGPU/hsa-metadata-kernel-code-props-v3.ll

	; RUN: llc -mattr=+code-object-v3 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx700 -enable-misched=0 -filetype=obj -o - < %s \| llvm-readobj -elf-output-style=GNU -notes \| FileCheck --check-prefix=CHECK --check-prefix=GFX700 --check-prefix=WAVE64 --check-prefix=NOTES %s			; RUN: llc -mattr=+code-object-v3 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx700 -enable-misched=0 -filetype=obj -o - < %s \| llvm-readobj -elf-output-style=GNU -notes \| FileCheck --check-prefix=CHECK --check-prefix=GFX700 --check-prefix=WAVE64 --check-prefix=NOTES %s
	; RUN: llc -mattr=+code-object-v3 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx803 -enable-misched=0 -filetype=obj -o - < %s \| llvm-readobj -elf-output-style=GNU -notes \| FileCheck --check-prefix=CHECK --check-prefix=GFX803 --check-prefix=WAVE64 --check-prefix=NOTES %s			; RUN: llc -mattr=+code-object-v3 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx803 -enable-misched=0 -filetype=obj -o - < %s \| llvm-readobj -elf-output-style=GNU -notes \| FileCheck --check-prefix=CHECK --check-prefix=GFX803 --check-prefix=WAVE64 --check-prefix=NOTES %s
	; RUN: llc -mattr=+code-object-v3 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -enable-misched=0 -filetype=obj -o - < %s \| llvm-readobj -elf-output-style=GNU -notes \| FileCheck --check-prefix=CHECK --check-prefix=GFX900 --check-prefix=WAVE64 --check-prefix=NOTES %s			; RUN: llc -mattr=+code-object-v3 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -enable-misched=0 -filetype=obj -o - < %s \| llvm-readobj -elf-output-style=GNU -notes \| FileCheck --check-prefix=CHECK --check-prefix=GFX900 --check-prefix=WAVE64 --check-prefix=NOTES %s
	; RUN: llc -mattr=+code-object-v3 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1010 -enable-misched=0 -filetype=obj -o - < %s \| llvm-readobj -elf-output-style=GNU -notes \| FileCheck --check-prefix=CHECK --check-prefix=GFX1010 --check-prefix=WAVE32 --check-prefix=NOTES %s			; RUN: llc -mattr=+code-object-v3 -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1010 -enable-misched=0 -filetype=obj -o - < %s \| llvm-readobj -elf-output-style=GNU -notes \| FileCheck --check-prefix=CHECK --check-prefix=GFX1010 --check-prefix=WAVE32 --check-prefix=NOTES %s

	@var = addrspace(1) global float 0.0			@var = addrspace(1) global float 0.0

	; CHECK: ---			; CHECK: ---
	; CHECK: amdhsa.kernels:			; CHECK: amdhsa.kernels:

	; CHECK: - .args:			; CHECK: - .args:
	; CHECK: .group_segment_fixed_size: 0			; CHECK: .group_segment_fixed_size: 0
	; CHECK: .kernarg_segment_align: 8			; CHECK: .kernarg_segment_align: 8
	; CHECK: .kernarg_segment_size: 24			; CHECK: .kernarg_segment_size: 24
	; CHECK: .max_flat_workgroup_size: 256			; CHECK: .max_flat_workgroup_size: 1024
				rampitecUnsubmitted Not Done Reply Inline Actions And given that getMaxFlatWorkGroupSize() returns 2048 I do not understand how does it work. rampitec: And given that getMaxFlatWorkGroupSize() returns 2048 I do not understand how does it work.
				arsenmAuthorUnsubmitted Done Reply Inline Actions D66812 changes this to 1024 arsenm: D66812 changes this to 1024
				rampitecUnsubmitted Not Done Reply Inline Actions Ok rampitec: Ok
	; CHECK: .name: test			; CHECK: .name: test
	; CHECK: .private_segment_fixed_size: 0			; CHECK: .private_segment_fixed_size: 0
	; WAVE64: .sgpr_count: 8			; WAVE64: .sgpr_count: 8
	; WAVE32: .sgpr_count: 10			; WAVE32: .sgpr_count: 10
	; CHECK: .symbol: test.kd			; CHECK: .symbol: test.kd
	; CHECK: .vgpr_count: 6			; CHECK: .vgpr_count: 6
	; WAVE64: .wavefront_size: 64			; WAVE64: .wavefront_size: 64
	; WAVE32: .wavefront_size: 32			; WAVE32: .wavefront_size: 32
	define amdgpu_kernel void @test(			define amdgpu_kernel void @test(
	half addrspace(1)* %r,			half addrspace(1)* %r,
	half addrspace(1)* %a,			half addrspace(1)* %a,
	half addrspace(1)* %b) {			half addrspace(1)* %b) {
	entry:			entry:
	%a.val = load half, half addrspace(1)* %a			%a.val = load half, half addrspace(1)* %a
	%b.val = load half, half addrspace(1)* %b			%b.val = load half, half addrspace(1)* %b
	%r.val = fadd half %a.val, %b.val			%r.val = fadd half %a.val, %b.val
	store half %r.val, half addrspace(1)* %r			store half %r.val, half addrspace(1)* %r
	ret void			ret void
	}			}

				; CHECK: - .args:
				; CHECK: .max_flat_workgroup_size: 256
				define amdgpu_kernel void @test_max_flat_workgroup_size(
				half addrspace(1)* %r,
				half addrspace(1)* %a,
				half addrspace(1)* %b) #2 {
				entry:
				%a.val = load half, half addrspace(1)* %a
				%b.val = load half, half addrspace(1)* %b
				%r.val = fadd half %a.val, %b.val
				store half %r.val, half addrspace(1)* %r
				ret void
				}

	; CHECK: .name: num_spilled_sgprs			; CHECK: .name: num_spilled_sgprs
	; GFX700: .sgpr_spill_count: 40			; GFX700: .sgpr_spill_count: 40
	; GFX803: .sgpr_spill_count: 24			; GFX803: .sgpr_spill_count: 24
	; GFX900: .sgpr_spill_count: 24			; GFX900: .sgpr_spill_count: 24
	; GFX1010: .sgpr_spill_count: 24			; GFX1010: .sgpr_spill_count: 24
	; CHECK: .symbol: num_spilled_sgprs.kd			; CHECK: .symbol: num_spilled_sgprs.kd
	define amdgpu_kernel void @num_spilled_sgprs(			define amdgpu_kernel void @num_spilled_sgprs(
	i32 addrspace(1)* %out0, i32 addrspace(1)* %out1, [8 x i32],			i32 addrspace(1)* %out0, i32 addrspace(1)* %out1, [8 x i32],
	▲ Show 20 Lines • Show All 100 Lines • ▼ Show 20 Lines
	}			}

	; CHECK: amdhsa.version:			; CHECK: amdhsa.version:
	; CHECK-NEXT: - 1			; CHECK-NEXT: - 1
	; CHECK-NEXT: - 0			; CHECK-NEXT: - 0

	attributes #0 = { "amdgpu-num-sgpr"="14" }			attributes #0 = { "amdgpu-num-sgpr"="14" }
	attributes #1 = { "amdgpu-num-vgpr"="20" }			attributes #1 = { "amdgpu-num-vgpr"="20" }
				attributes #2 = { "amdgpu-flat-work-group-size"="1,256" }

llvm/test/CodeGen/AMDGPU/hsa-metadata-kernel-code-props.ll

Show All 12 Lines
; CHECK: CodeProps:		; CHECK: CodeProps:
; CHECK: KernargSegmentSize: 24		; CHECK: KernargSegmentSize: 24
; CHECK: GroupSegmentFixedSize: 0		; CHECK: GroupSegmentFixedSize: 0
; CHECK: PrivateSegmentFixedSize: 0		; CHECK: PrivateSegmentFixedSize: 0
; CHECK: KernargSegmentAlign: 8		; CHECK: KernargSegmentAlign: 8
; CHECK: WavefrontSize: 64		; CHECK: WavefrontSize: 64
; CHECK: NumSGPRs: 8		; CHECK: NumSGPRs: 8
; CHECK: NumVGPRs: 6		; CHECK: NumVGPRs: 6
; CHECK: MaxFlatWorkGroupSize: 256		; CHECK: MaxFlatWorkGroupSize: 1024
define amdgpu_kernel void @test(		define amdgpu_kernel void @test(
half addrspace(1)* %r,		half addrspace(1)* %r,
half addrspace(1)* %a,		half addrspace(1)* %a,
half addrspace(1)* %b) {		half addrspace(1)* %b) {
entry:		entry:
%a.val = load half, half addrspace(1)* %a		%a.val = load half, half addrspace(1)* %a
%b.val = load half, half addrspace(1)* %b		%b.val = load half, half addrspace(1)* %b
%r.val = fadd half %a.val, %b.val		%r.val = fadd half %a.val, %b.val
store half %r.val, half addrspace(1)* %r		store half %r.val, half addrspace(1)* %r
ret void		ret void
}		}

		; CHECK-LABEL: - Name: test_max_flat_workgroup_size
		; CHECK: SymbolName: 'test_max_flat_workgroup_size@kd'
		; CHECK: CodeProps:
		; CHECK: KernargSegmentSize: 24
		; CHECK: GroupSegmentFixedSize: 0
		; CHECK: PrivateSegmentFixedSize: 0
		; CHECK: KernargSegmentAlign: 8
		; CHECK: WavefrontSize: 64
		; CHECK: NumSGPRs: 8
		; CHECK: NumVGPRs: 6
		; CHECK: MaxFlatWorkGroupSize: 256
		define amdgpu_kernel void @test_max_flat_workgroup_size(
		half addrspace(1)* %r,
		half addrspace(1)* %a,
		half addrspace(1)* %b) #2 {
		entry:
		%a.val = load half, half addrspace(1)* %a
		%b.val = load half, half addrspace(1)* %b
		%r.val = fadd half %a.val, %b.val
		store half %r.val, half addrspace(1)* %r
		ret void
		}

; CHECK-LABEL: - Name: num_spilled_sgprs		; CHECK-LABEL: - Name: num_spilled_sgprs
; CHECK: SymbolName: 'num_spilled_sgprs@kd'		; CHECK: SymbolName: 'num_spilled_sgprs@kd'
; CHECK: CodeProps:		; CHECK: CodeProps:
; GFX700: NumSpilledSGPRs: 40		; GFX700: NumSpilledSGPRs: 40
; GFX803: NumSpilledSGPRs: 24		; GFX803: NumSpilledSGPRs: 24
; GFX900: NumSpilledSGPRs: 24		; GFX900: NumSpilledSGPRs: 24
define amdgpu_kernel void @num_spilled_sgprs(		define amdgpu_kernel void @num_spilled_sgprs(
i32 addrspace(1)* %out0, i32 addrspace(1)* %out1, [8 x i32],		i32 addrspace(1)* %out0, i32 addrspace(1)* %out1, [8 x i32],
▲ Show 20 Lines • Show All 97 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @num_spilled_vgprs() #1 {
store volatile float %val29, float addrspace(1)* @var		store volatile float %val29, float addrspace(1)* @var
store volatile float %val30, float addrspace(1)* @var		store volatile float %val30, float addrspace(1)* @var

ret void		ret void
}		}

attributes #0 = { "amdgpu-num-sgpr"="14" }		attributes #0 = { "amdgpu-num-sgpr"="14" }
attributes #1 = { "amdgpu-num-vgpr"="20" }		attributes #1 = { "amdgpu-num-vgpr"="20" }
		attributes #2 = { "amdgpu-flat-work-group-size"="1,256" }

llvm/test/CodeGen/AMDGPU/lower-range-metadata-intrinsic-call.ll

Show All 33 Lines	entry:
%and = and i32 %id, 255		%and = and i32 %id, 255
store i32 %and, i32 addrspace(1)* %out, align 4		store i32 %and, i32 addrspace(1)* %out, align 4
ret void		ret void
}		}


declare i32 @llvm.amdgcn.workitem.id.x() #1		declare i32 @llvm.amdgcn.workitem.id.x() #1

attributes #0 = { norecurse nounwind }		attributes #0 = { norecurse nounwind "amdgpu-flat-work-group-size"="1,256" }
attributes #1 = { nounwind readnone }		attributes #1 = { nounwind readnone }

!0 = !{i32 0, i32 1024}		!0 = !{i32 0, i32 1024}
!1 = !{i32 0, i32 1023}		!1 = !{i32 0, i32 1023}

llvm/test/CodeGen/AMDGPU/occupancy-levels.ll

	Show First 20 Lines • Show All 256 Lines • ▼ Show 20 Lines
	@lds6552 = internal addrspace(3) global [6552 x i8] undef, align 4			@lds6552 = internal addrspace(3) global [6552 x i8] undef, align 4
	define amdgpu_kernel void @used_lds_6552() {			define amdgpu_kernel void @used_lds_6552() {
	%p = bitcast [6552 x i8] addrspace(3)* @lds6552 to i8 addrspace(3)*			%p = bitcast [6552 x i8] addrspace(3)* @lds6552 to i8 addrspace(3)*
	store volatile i8 1, i8 addrspace(3)* %p			store volatile i8 1, i8 addrspace(3)* %p
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}used_lds_6556:			; GCN-LABEL: {{^}}used_lds_6556:
	; GFX9: ; Occupancy: 9			; GFX9: ; Occupancy: 10
				rampitecUnsubmitted Not Done Reply Inline Actions This needs to be investigated first I believe. There must be some wrong logic somewhere with LDS accounting for occupancy. rampitec: This needs to be investigated first I believe. There must be some wrong logic somewhere with…
				rampitecUnsubmitted Not Done Reply Inline Actions I have recollected the logic behind LDS sharing and occupancy calculations. The increase seems to be right because LDS allocation is per WG and with a bigger WG you have less of them. rampitec: I have recollected the logic behind LDS sharing and occupancy calculations. The increase seems…
	; GFX1010W64: ; Occupancy: 19			; GFX1010W64: ; Occupancy: 20
	; GFX1010W32: ; Occupancy: 20			; GFX1010W32: ; Occupancy: 20
	@lds6556 = internal addrspace(3) global [6556 x i8] undef, align 4			@lds6556 = internal addrspace(3) global [6556 x i8] undef, align 4
	define amdgpu_kernel void @used_lds_6556() {			define amdgpu_kernel void @used_lds_6556() {
	%p = bitcast [6556 x i8] addrspace(3)* @lds6556 to i8 addrspace(3)*			%p = bitcast [6556 x i8] addrspace(3)* @lds6556 to i8 addrspace(3)*
	store volatile i8 1, i8 addrspace(3)* %p			store volatile i8 1, i8 addrspace(3)* %p
	ret void			ret void
	}			}

	; GCN-LABEL: {{^}}used_lds_13112:			; GCN-LABEL: {{^}}used_lds_13112:
	; GFX9: ; Occupancy: 4			; GFX9: ; Occupancy: 10
	; GFX1010W64: ; Occupancy: 9			; GFX1010W64: ; Occupancy: 20
	; GFX1010W32: ; Occupancy: 19			; GFX1010W32: ; Occupancy: 20
	@lds13112 = internal addrspace(3) global [13112 x i8] undef, align 4			@lds13112 = internal addrspace(3) global [13112 x i8] undef, align 4
	define amdgpu_kernel void @used_lds_13112() {			define amdgpu_kernel void @used_lds_13112() {
	%p = bitcast [13112 x i8] addrspace(3)* @lds13112 to i8 addrspace(3)*			%p = bitcast [13112 x i8] addrspace(3)* @lds13112 to i8 addrspace(3)*
	store volatile i8 1, i8 addrspace(3)* %p			store volatile i8 1, i8 addrspace(3)* %p
	ret void			ret void
	}			}

	attributes #0 = { "amdgpu-waves-per-eu"="2,3" }			attributes #0 = { "amdgpu-waves-per-eu"="2,3" }
	attributes #1 = { "amdgpu-waves-per-eu"="18,18" }			attributes #1 = { "amdgpu-waves-per-eu"="18,18" }
	attributes #2 = { "amdgpu-waves-per-eu"="19,19" }			attributes #2 = { "amdgpu-waves-per-eu"="19,19" }

llvm/test/CodeGen/AMDGPU/private-memory-r600.ll

Show First 20 Lines • Show All 294 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @ptrtoint(i32 addrspace(1)* %out, i32 %a, i32 %b) #0 {
%tmp5 = load i32, i32 addrspace(5)* %tmp4		%tmp5 = load i32, i32 addrspace(5)* %tmp4
store i32 %tmp5, i32 addrspace(1)* %out		store i32 %tmp5, i32 addrspace(1)* %out
ret void		ret void
}		}

; OPT: !0 = !{i32 0, i32 257}		; OPT: !0 = !{i32 0, i32 257}
; OPT: !1 = !{i32 0, i32 256}		; OPT: !1 = !{i32 0, i32 256}

attributes #0 = { nounwind "amdgpu-waves-per-eu"="1,2" }		attributes #0 = { nounwind "amdgpu-waves-per-eu"="1,2" "amdgpu-flat-work-group-size"="1,256" }

llvm/test/CodeGen/AMDGPU/promote-alloca-addrspacecast.ll

Show All 12 Lines	entry:
%tmp = bitcast [1 x i32]* %data to half*		%tmp = bitcast [1 x i32]* %data to half*
%tmp1 = addrspacecast half* %tmp to half addrspace(4)*		%tmp1 = addrspacecast half* %tmp to half addrspace(4)*
%tmp2 = bitcast half addrspace(4)* %tmp1 to <2 x i16> addrspace(4)*		%tmp2 = bitcast half addrspace(4)* %tmp1 to <2 x i16> addrspace(4)*
%tmp3 = load <2 x i16>, <2 x i16> addrspace(4)* %tmp2, align 2		%tmp3 = load <2 x i16>, <2 x i16> addrspace(4)* %tmp2, align 2
%tmp4 = bitcast <2 x i16> %tmp3 to <2 x half>		%tmp4 = bitcast <2 x i16> %tmp3 to <2 x half>
ret void		ret void
}		}

attributes #0 = { nounwind }		attributes #0 = { nounwind "amdgpu-flat-work-group-size"="1,256" }

llvm/test/CodeGen/AMDGPU/promote-alloca-to-lds-icmp.ll

Show First 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @lds_promoted_alloca_icmp_unknown_ptr(i32 addrspace(1)* %out, i32 %a, i32 %b) #0 {
%cmp = icmp eq i32* %ptr0, %ptr1		%cmp = icmp eq i32* %ptr0, %ptr1
%zext = zext i1 %cmp to i32		%zext = zext i1 %cmp to i32
store volatile i32 %zext, i32 addrspace(1)* %out		store volatile i32 %zext, i32 addrspace(1)* %out
ret void		ret void
}		}

declare i32* @get_unknown_pointer() #0		declare i32* @get_unknown_pointer() #0

attributes #0 = { nounwind "amdgpu-waves-per-eu"="1,1" }		attributes #0 = { nounwind "amdgpu-waves-per-eu"="1,1" "amdgpu-flat-work-group-size"="1,256" }

llvm/test/CodeGen/AMDGPU/promote-alloca-to-lds-phi.ll

Show First 20 Lines • Show All 195 Lines • ▼ Show 20 Lines	for.body: ; preds = %for.body, %for.body.preheader
%incdec.ptr = getelementptr inbounds i32, i32* %p.08, i32 1		%incdec.ptr = getelementptr inbounds i32, i32* %p.08, i32 1
%inc = add nuw nsw i32 %i.09, 1		%inc = add nuw nsw i32 %i.09, 1
%cmp = icmp eq i32* %incdec.ptr, %call		%cmp = icmp eq i32* %incdec.ptr, %call
br i1 %cmp, label %for.cond.cleanup.loopexit, label %for.body		br i1 %cmp, label %for.cond.cleanup.loopexit, label %for.body
}		}

declare i32* @get_unknown_pointer() #0		declare i32* @get_unknown_pointer() #0

attributes #0 = { nounwind "amdgpu-waves-per-eu"="1,1" }		attributes #0 = { nounwind "amdgpu-waves-per-eu"="1,1" "amdgpu-flat-work-group-size"="1,256" }

llvm/test/CodeGen/AMDGPU/promote-alloca-to-lds-select.ll

Show First 20 Lines • Show All 125 Lines • ▼ Show 20 Lines	bb:
%tmp2 = icmp eq i32 %arg1, 0		%tmp2 = icmp eq i32 %arg1, 0
%tmp3 = select i1 %tmp2, double addrspace(5)* null, double addrspace(5)* %tmp		%tmp3 = select i1 %tmp2, double addrspace(5)* null, double addrspace(5)* %tmp
store double 1.000000e+00, double addrspace(5)* %tmp3, align 8		store double 1.000000e+00, double addrspace(5)* %tmp3, align 8
%tmp4 = load double, double addrspace(5)* %tmp, align 8		%tmp4 = load double, double addrspace(5)* %tmp, align 8
store double %tmp4, double addrspace(1)* %arg		store double %tmp4, double addrspace(1)* %arg
ret void		ret void
}		}

attributes #0 = { norecurse nounwind "amdgpu-waves-per-eu"="1,1" }		attributes #0 = { norecurse nounwind "amdgpu-waves-per-eu"="1,1" "amdgpu-flat-work-group-size"="1,256" }
attributes #1 = { norecurse nounwind }		attributes #1 = { norecurse nounwind }

This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Switch backend default max workgroup size to 1024ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 227222

llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp

llvm/test/CodeGen/AMDGPU/amdgpu.private-memory.ll

llvm/test/CodeGen/AMDGPU/array-ptr-calc-i32.ll

llvm/test/CodeGen/AMDGPU/hsa-metadata-kernel-code-props-v3.ll

llvm/test/CodeGen/AMDGPU/hsa-metadata-kernel-code-props.ll

llvm/test/CodeGen/AMDGPU/lower-range-metadata-intrinsic-call.ll

llvm/test/CodeGen/AMDGPU/occupancy-levels.ll

llvm/test/CodeGen/AMDGPU/private-memory-r600.ll

llvm/test/CodeGen/AMDGPU/promote-alloca-addrspacecast.ll

llvm/test/CodeGen/AMDGPU/promote-alloca-to-lds-icmp.ll

llvm/test/CodeGen/AMDGPU/promote-alloca-to-lds-phi.ll

llvm/test/CodeGen/AMDGPU/promote-alloca-to-lds-select.ll

AMDGPU: Switch backend default max workgroup size to 1024
ClosedPublic