This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
-
AMDGPUUnifyDivergentExitNodes.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
kill-infinite-loop.ll

Differential D71192

AMDGPU: Fix AMDGPUUnifyDivergentExitNodes with no normal returns
ClosedPublic

Authored by cwabbott on Dec 9 2019, 3:13 AM.

Download Raw Diff

Details

Reviewers

arsenm
nhaehnle
critson

Commits

rGce06d50756e9: AMDGPU: Fix AMDGPUUnifyDivergentExitNodes with no normal returns
rG323bfde20c5f: AMDGPU: Fix AMDGPUUnifyDivergentExitNodes with no normal returns

Summary

The code was assuming in a few places that if there was only one exit
from the function that it was a normal return, which is invalid. It
could be an infinite loop, in which case we still need to insert the
usual fake edge so that the null export happens. This fixes shaders that
end with an infinite loop that discards.

Diff Detail

Repository

rG LLVM Github Monorepo

Build Status

Buildable 42104
Build 42488: arc lint + arc unit

Event Timeline

cwabbott created this revision.Dec 9 2019, 3:13 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 9 2019, 3:13 AM

Herald added subscribers: llvm-commits, hiraditya, t-tye and 6 others. · View Herald Transcript

cwabbott added a parent revision: D70781: AMDGPU: Fix handling of infinite loops in fragment shaders.Dec 9 2019, 3:14 AM

Harbormaster completed remote builds in B42104: Diff 232793.Dec 9 2019, 3:15 AM

LGTM

This revision is now accepted and ready to land.Dec 9 2019, 11:34 PM

Closed by commit rG323bfde20c5f: AMDGPU: Fix AMDGPUUnifyDivergentExitNodes with no normal returns (authored by cwabbott). · Explain WhyJan 29 2020, 6:35 AM

This revision was automatically updated to reflect the committed changes.

Herald added a subscriber: kerbowa. · View Herald TranscriptJan 29 2020, 6:35 AM

cwabbott reopened this revision.Jan 29 2020, 9:09 AM

This revision is now accepted and ready to land.Jan 29 2020, 9:09 AM

It seems I forgot to test this against the entire testsuite, which made a bunch
of buildbots unhappy. This version fixes the issues:

Skip the entire thing for non-pixel shaders that only end with an infinite

loop. This was causing slightly different assembly for four other tests, due to
the insertion of the fake branch, and it generates extra work for normal
shaders that's pointless.

Update the expected output in update-phi.ll to account for the null export.

Harbormaster completed remote builds in B45260: Diff 241187.Jan 29 2020, 9:12 AM

arsenm accepted this revision.Jan 29 2020, 9:29 AM

LGTM. Did you manage to sort out your commit access issues?

In D71192#1848832, @nhaehnle wrote:

LGTM. Did you manage to sort out your commit access issues?

Yes, I just got added yesterday.

Closed by commit rGce06d50756e9: AMDGPU: Fix AMDGPUUnifyDivergentExitNodes with no normal returns (authored by cwabbott). · Explain WhyJan 30 2020, 2:08 AM

This revision was automatically updated to reflect the committed changes.

Hi @cwabbott,
This commit causes a GPU hang on amdvlk in one test, due to the missing "done" bit on the normal export. I think in the attached case the patch incorrectly classifies the export as being in an infinite loop.

before_AMDGPUUnifyDivergentExitNodes.ll2 KBDownload

In D71192#1886101, @piotr wrote:

Hi @cwabbott,
This commit causes a GPU hang on amdvlk in one test, due to the missing "done" bit on the normal export. I think in the attached case the patch incorrectly classifies the export as being in an infinite loop.
before_AMDGPUUnifyDivergentExitNodes.ll2 KBDownload

Hi @piotr,

I think what's happening is that the "normal" return, that already has an export before it, is uniformly reached (i.e. only reached via uniform branches). Normally this means that we don't have to do anything with it, so the pass ignores it, but then we remove the "done" bit even though the final dummy export, which is supposed to replace it, isn't reached when returning normally. There are two ways I can see to solve it:

When InsertExport is true, unify *all* the return blocks, even the uniformly-reached ones. For this shader, we would still remove the done bit on the original export, but we'd replace the return with a branch to the UnifiedReturnBlock with the dummy export.
Do the above, but when we can prove that all the existing exports are post-dominated by the normal, uniformly-reached returns, we can avoid clearing their done bits or unifying them.

The second option would prevent the regression in code quality, but it would be more complicated. We've deliberately made this as simple as possible already, because it only kicks in when there are infinite loops (that might have kills in them) and this seems pretty uncommon in real-world shaders. Is this coming from a test or an actual game?

The reported example comes from one of graphicsfuzz CTS tests, so I think it is fine to treat it as an unrealistic corner case where performance is not critical.

piotr mentioned this in D76364: [AMDGPU] Fix AMDGPUUnifyDivergentExitNodes.Mar 18 2020, 7:36 AM

piotr mentioned this in rGd1a7bfca7436: [AMDGPU] Fix AMDGPUUnifyDivergentExitNodes.Mar 18 2020, 9:15 AM

ruiling mentioned this in D105608: [NFC][AMDGPU] autogenerate kill-infinite-loop.ll checks.Jul 7 2021, 11:30 PM

ruiling mentioned this in D105610: [AMDGPU] Don't handle export done when unify exit nodes.Jul 7 2021, 11:36 PM

ruiling mentioned this in rGd9b9fdd91bb4: [AMDGPU] Don't handle export done when unify exit nodes.Jul 14 2021, 12:07 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUUnifyDivergentExitNodes.cpp

4 lines

test/

CodeGen/

AMDGPU/

kill-infinite-loop.ll

16 lines

Diff 232793

llvm/lib/Target/AMDGPU/AMDGPUUnifyDivergentExitNodes.cpp

Show First 20 Lines • Show All 189 Lines • ▼ Show 20 Lines	for (BasicBlock *BB : ReturningBlocks) {
simplifyCFG(BB, TTI, {2});		simplifyCFG(BB, TTI, {2});
}		}

return NewRetBlock;		return NewRetBlock;
}		}

bool AMDGPUUnifyDivergentExitNodes::runOnFunction(Function &F) {		bool AMDGPUUnifyDivergentExitNodes::runOnFunction(Function &F) {
auto &PDT = getAnalysis<PostDominatorTreeWrapperPass>().getPostDomTree();		auto &PDT = getAnalysis<PostDominatorTreeWrapperPass>().getPostDomTree();
if (PDT.getRoots().size() <= 1)
return false;

LegacyDivergenceAnalysis &DA = getAnalysis<LegacyDivergenceAnalysis>();		LegacyDivergenceAnalysis &DA = getAnalysis<LegacyDivergenceAnalysis>();

// Loop over all of the blocks in a function, tracking all of the blocks that		// Loop over all of the blocks in a function, tracking all of the blocks that
// return.		// return.
SmallVector<BasicBlock *, 4> ReturningBlocks;		SmallVector<BasicBlock *, 4> ReturningBlocks;
SmallVector<BasicBlock *, 4> UnreachableBlocks;		SmallVector<BasicBlock *, 4> UnreachableBlocks;

▲ Show 20 Lines • Show All 108 Lines • ▼ Show 20 Lines	if (!ReturningBlocks.empty()) {
ReturningBlocks.push_back(UnreachableBlock);		ReturningBlocks.push_back(UnreachableBlock);
}		}
}		}

// Now handle return blocks.		// Now handle return blocks.
if (ReturningBlocks.empty())		if (ReturningBlocks.empty())
return false; // No blocks return		return false; // No blocks return

if (ReturningBlocks.size() == 1)		if (ReturningBlocks.size() == 1 && !InsertExport)
return false; // Already has a single return block		return false; // Already has a single return block

const TargetTransformInfo &TTI		const TargetTransformInfo &TTI
= getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);		= getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);

unifyReturnBlockSet(F, ReturningBlocks, InsertExport, TTI, "UnifiedReturnBlock");		unifyReturnBlockSet(F, ReturningBlocks, InsertExport, TTI, "UnifiedReturnBlock");
return true;		return true;
}		}

llvm/test/CodeGen/AMDGPU/kill-infinite-loop.ll

Show All 39 Lines	loop:
call void @llvm.amdgcn.kill(i1 false) #3		call void @llvm.amdgcn.kill(i1 false) #3
br label %loop		br label %loop

end:		end:
call void @llvm.amdgcn.exp.compr.v2i16(i32 0, i32 5, <2 x i16> < i16 0, i16 0 >, <2 x i16> < i16 0, i16 0 >, i1 true, i1 true) #3		call void @llvm.amdgcn.exp.compr.v2i16(i32 0, i32 5, <2 x i16> < i16 0, i16 0 >, <2 x i16> < i16 0, i16 0 >, i1 true, i1 true) #3
ret void		ret void
}		}

		; test the case where there's only a kill in an infinite loop
		; CHECK-LABEL: only_kill
		; CHECK: exp null off, off, off, off done vm
		; CHECK-NEXT: s_endpgm
		; SIInsertSkips inserts an extra null export here, but it should be harmless.
		; CHECK: exp null off, off, off, off done vm
		; CHECK-NEXT: s_endpgm
		define amdgpu_ps void @only_kill() #0 {
		main_body:
		br label %loop

		loop:
		call void @llvm.amdgcn.kill(i1 false) #3
		br label %loop
		}

; In case there's an epilog, we shouldn't have to do this.		; In case there's an epilog, we shouldn't have to do this.
; CHECK-LABEL: return_nonvoid		; CHECK-LABEL: return_nonvoid
; CHECK-NOT: exp null off, off, off, off done vm		; CHECK-NOT: exp null off, off, off, off done vm
define amdgpu_ps float @return_nonvoid(float %0) #0 {		define amdgpu_ps float @return_nonvoid(float %0) #0 {
main_body:		main_body:
%cmp = fcmp olt float %0, 1.000000e+01		%cmp = fcmp olt float %0, 1.000000e+01
br i1 %cmp, label %end, label %loop		br i1 %cmp, label %end, label %loop

Show All 13 Lines