This is an archive of the discontinued LLVM Phabricator instance.

Differential D20728

AMDGPU: Disable AMDGPUPromoteAlloca pass for shader calling conventions.
ClosedPublic

Authored by bnieuwenhuizen on May 27 2016, 5:54 AM.

Download Raw Diff

Details

Reviewers

• tstellarAMD
arsenm

Commits

rGbef1ceb8154f: AMDGPU: Disable AMDGPUPromoteAlloca pass for shader calling conventions.
rL275779: AMDGPU: Disable AMDGPUPromoteAlloca pass for shader calling conventions.

Summary

The work item intrinsics are not available for the shader
calling conventions. And even if we did hook them up most
shader stages haves some extra restrictions on the amount
of available LDS.

Diff Detail

Repository: rL LLVM

Event Timeline

bnieuwenhuizen updated this revision to Diff 58774.May 27 2016, 5:54 AM

bnieuwenhuizen retitled this revision from to AMDGPU: Disable AMDGPU for shader calling conventions..

bnieuwenhuizen updated this object.

bnieuwenhuizen added reviewers: • tstellarAMD, arsenm.

bnieuwenhuizen added a subscriber: llvm-commits.

Herald added subscribers: kzhuravl, arsenm. · View Herald TranscriptMay 27 2016, 5:54 AM

The title needs to be fixed. Apart from that, LGTM.

bnieuwenhuizen retitled this revision from AMDGPU: Disable AMDGPU for shader calling conventions. to AMDGPU: Disable AMDGPUPromoteAlloca pass for shader calling conventions..May 31 2016, 3:19 AM

This at least needs a comment for why. What are the additional restrictions? This seems like it wouldn't be hard to fix

Also tests

The problem is that Mesa currently lies about which data lives in LDS, in particular for tessellation shader inputs and outputs. A proper fix would first need to take care of that.

In general it's unclear to me how Mesa and LLVM should communicate about how LDS is used. Can we get a guarantee e.g. that LLVM-reserved LDS memory will always be placed after any LDS object in the initial IR?

I vote to fix the current issue for now (though yes, tests sound like a good idea :)) while we figure out the LDS allocation issues.

Rather than skipping the entire pass, can we just disable the promotion to LDS? Lowering alloca to vectors is still useful.

In D20728#445653, @nhaehnle wrote:

The problem is that Mesa currently lies about which data lives in LDS, in particular for tessellation shader inputs and outputs. A proper fix would first need to take care of that.

In general it's unclear to me how Mesa and LLVM should communicate about how LDS is used. Can we get a guarantee e.g. that LLVM-reserved LDS memory will always be placed after any LDS object in the initial IR?

I vote to fix the current issue for now (though yes, tests sound like a good idea :)) while we figure out the LDS allocation issues.

The order allocated is the reverse use order in the function, so it could be either order. This needs to be fixed for other reasons too, but is a pain

Added a test & comment and moved the check so that only the promote to LDS is disabled.

An example of extra conditions of LDS usage for other stages is the PS needing implicit LDS
space for interpolation inputs. Furthermore some shader have no LDS_SIZE register and I am
not sure whether we can allocate some in all cases (e.g. VS has no dedicated field, but can
we allocate using the LS LDS_SIZE even if the LS does not run?)

This patch LGTM.

Could someone commit this?

It should rebase cleanly and still pass the testsuite.

Closed by commit rL275779: AMDGPU: Disable AMDGPUPromoteAlloca pass for shader calling conventions. (authored by nha). · Explain WhyJul 18 2016, 2:10 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AMDGPU/

AMDGPUPromoteAlloca.cpp

6 lines

test/

CodeGen/

AMDGPU/

promote-alloca-shaders.ll

29 lines

Diff 64286

llvm/trunk/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp

Show First 20 Lines • Show All 643 Lines • ▼ Show 20 Lines	void AMDGPUPromoteAlloca::handleAlloca(AllocaInst &I) {

if (tryPromoteAllocaToVector(&I)) {		if (tryPromoteAllocaToVector(&I)) {
DEBUG(dbgs() << " alloca is not a candidate for vectorization.\n");		DEBUG(dbgs() << " alloca is not a candidate for vectorization.\n");
return;		return;
}		}

const Function &ContainingFunction = *I.getParent()->getParent();		const Function &ContainingFunction = *I.getParent()->getParent();

		// Don't promote the alloca to LDS for shader calling conventions as the work
		// item ID intrinsics are not supported for these calling conventions.
		// Furthermore not all LDS is available for some of the stages.
		if (AMDGPU::isShader(ContainingFunction.getCallingConv()))
		return;

// FIXME: We should also try to get this value from the reqd_work_group_size		// FIXME: We should also try to get this value from the reqd_work_group_size
// function attribute if it is available.		// function attribute if it is available.
unsigned WorkGroupSize = AMDGPU::getMaximumWorkGroupSize(ContainingFunction);		unsigned WorkGroupSize = AMDGPU::getMaximumWorkGroupSize(ContainingFunction);

const DataLayout &DL = Mod->getDataLayout();		const DataLayout &DL = Mod->getDataLayout();

unsigned Align = I.getAlignment();		unsigned Align = I.getAlignment();
if (Align == 0)		if (Align == 0)
▲ Show 20 Lines • Show All 195 Lines • Show Last 20 Lines

llvm/trunk/test/CodeGen/AMDGPU/promote-alloca-shaders.ll

				; RUN: opt -S -mtriple=amdgcn-unknown-unknown -amdgpu-promote-alloca < %s \| FileCheck -check-prefix=IR %s
				; RUN: llc -march=amdgcn -mcpu=tonga < %s \| FileCheck -check-prefix=ASM %s

				; IR-LABEL: define amdgpu_vs void @promote_alloca_shaders(i32 addrspace(1)* inreg %out, i32 addrspace(1)* inreg %in) #0 {
				; IR: alloca [5 x i32]
				; ASM-LABEL: {{^}}promote_alloca_shaders:
				; ASM: ; LDSByteSize: 0 bytes/workgroup (compile time only)

				define amdgpu_vs void @promote_alloca_shaders(i32 addrspace(1)* inreg %out, i32 addrspace(1)* inreg %in) #0 {
				entry:
				%stack = alloca [5 x i32], align 4
				%tmp0 = load i32, i32 addrspace(1)* %in, align 4
				%arrayidx1 = getelementptr inbounds [5 x i32], [5 x i32]* %stack, i32 0, i32 %tmp0
				store i32 4, i32* %arrayidx1, align 4
				%arrayidx2 = getelementptr inbounds i32, i32 addrspace(1)* %in, i32 1
				%tmp1 = load i32, i32 addrspace(1)* %arrayidx2, align 4
				%arrayidx3 = getelementptr inbounds [5 x i32], [5 x i32]* %stack, i32 0, i32 %tmp1
				store i32 5, i32* %arrayidx3, align 4
				%arrayidx4 = getelementptr inbounds [5 x i32], [5 x i32]* %stack, i32 0, i32 0
				%tmp2 = load i32, i32* %arrayidx4, align 4
				store i32 %tmp2, i32 addrspace(1)* %out, align 4
				%arrayidx5 = getelementptr inbounds [5 x i32], [5 x i32]* %stack, i32 0, i32 1
				%tmp3 = load i32, i32* %arrayidx5
				%arrayidx6 = getelementptr inbounds i32, i32 addrspace(1)* %out, i32 1
				store i32 %tmp3, i32 addrspace(1)* %arrayidx6
				ret void
				}

				attributes #0 = { nounwind "amdgpu-max-work-group-size"="64" }