This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Do not annotate an else branch if there is a kill
ClosedPublic

Authored by critson on Feb 24 2021, 5:09 PM.

Download Raw Diff

Details

Reviewers

arsenm
piotr

Commits

rGf08dadd242fd: [AMDGPU] Do not annotate an else branch if there is a kill

Summary

As llvm.amdgcn.kill is lowered to a terminator it can cause
else branch annotations to end up in the wrong block.
Do not annotate conditionals as else branches where there is
a kill to avoid this.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

critson created this revision.Feb 24 2021, 5:09 PM

Herald added subscribers: kerbowa, jfb, hiraditya and 7 others. · View Herald TranscriptFeb 24 2021, 5:09 PM

critson requested review of this revision.Feb 24 2021, 5:09 PM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 24 2021, 5:09 PM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

With respect to the searching that this adds, I ran this on 14069 graphics pipelines from them this invoked the search 13735 times.
Mean 5.5 instructions were visited during the search and the worst case was 33 instructions.
So I think the cost of doing this is low.

I am still in two minds about whether to do this or stop lowering kills to block terminators, but this is the smaller change.

Harbormaster completed remote builds in B90716: Diff 326248.Feb 24 2021, 8:32 PM

Alternatively, you could create the set of blocks containing a kill in advance - by looking at function declarations in the module, and if a kill is found then checking its users to find blocks. Whether that would be better depends on how often hasKill() is called, though.

llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
187	Bail out if calling convention is not CallingConv::AMDGPU_PS ?

What about kills in transitive successors? I really don't like how kill isn't considered a terminator.

Thank you Piotr and Matt for your comments.

Since I don't really like kills being terminators either, I am going to investigate changing that before pushing this further.
I will return to this if changing kill behaviour seem in tractable.

llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp
187	Kills are valid in other calling conventions.

In D97427#2587455, @critson wrote:

Thank you Piotr and Matt for your comments.

Since I don't really like kills being terminators either, I am going to investigate changing that before pushing this further.

No, no. The exact opposite. My problem is the IR kill intrinsic is not a terminator. In the MIR exec modifications should be terminators as that's the only real mechanism for ensuring correct spill placement around them

In D97427#2587457, @arsenm wrote:

In D97427#2587455, @critson wrote:

Since I don't really like kills being terminators either, I am going to investigate changing that before pushing this further.

No, no. The exact opposite. My problem is the IR kill intrinsic is not a terminator. In the MIR exec modifications should be terminators as that's the only real mechanism for ensuring correct spill placement around them

Ah sorry, I did indeed misread your "isn't" as "is".
Can intrinsics even be terminators?
Or are you suggesting that we should split the block at the kill in the IR level very early in the backend?

In D97427#2587522, @critson wrote:

In D97427#2587457, @arsenm wrote:

In D97427#2587455, @critson wrote:

Since I don't really like kills being terminators either, I am going to investigate changing that before pushing this further.

No, no. The exact opposite. My problem is the IR kill intrinsic is not a terminator. In the MIR exec modifications should be terminators as that's the only real mechanism for ensuring correct spill placement around them

Ah sorry, I did indeed misread your "isn't" as "is".
Can intrinsics even be terminators?

Almost. Recently the callbr instruction was added. I was thinking about using it for kills, but I haven't thought about this too carefully

Or are you suggesting that we should split the block at the kill in the IR level very early in the backend?

My initial thought is splitting blocks in the IR wouldn't be helpful. IR transforms are more likely to glue these blocks right back together again. Overall I'm more interested in moving towards the wave transform control flow lowering rather than thinking about how to improve the current IR pass flow

! In D97427#2587524, @arsenm wrote:

! In D97427#2587522, @critson wrote:
Or are you suggesting that we should split the block at the kill in the IR level very early in the backend?

My initial thought is splitting blocks in the IR wouldn't be helpful. IR transforms are more likely to glue these blocks right back together again. Overall I'm more interested in moving towards the wave transform control flow lowering rather than thinking about how to improve the current IR pass flow

I agree that focusing our energy on wave transform control flow is the way to go.
However, this is a legitimate bug causing Piglet test failures for Mesa so it would be good to fix it.

Returning to your question about "transitive successors".
I do not think this is an issue -- the problem is with else being put in the same block as a kill causing that kill to executed as if it was part of the preceding if-branch.
If we place an else in a transitive successor then that implies the kill was genuinely part of the if-branch, which is fine.

Ping

arsenm accepted this revision.Mar 10 2021, 5:54 AM

This revision is now accepted and ready to land.Mar 10 2021, 5:54 AM

In D97427#2599162, @critson wrote:

Returning to your question about "transitive successors".
I do not think this is an issue -- the problem is with else being put in the same block as a kill causing that kill to executed as if it was part of the preceding if-branch.
If we place an else in a transitive successor then that implies the kill was genuinely part of the if-branch, which is fine.

You could theoretically have a split block that doesn't actually change control flow (i.e. one that just unconditionally branches to another block that only has the one predecessor). That's unlikely to happen in practice since those will get put back together again. Given that at this point we're just sort of waiting for this scheme to die, I wouldn't worry too much about this edge case

Closed by commit rGf08dadd242fd: [AMDGPU] Do not annotate an else branch if there is a kill (authored by critson). · Explain WhyMar 11 2021, 7:02 PM

This revision was automatically updated to reflect the committed changes.

critson added a commit: rGf08dadd242fd: [AMDGPU] Do not annotate an else branch if there is a kill.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIAnnotateControlFlow.cpp

13 lines

test/

CodeGen/

AMDGPU/

si-annotate-cf-kill.ll

130 lines

Diff 330126

llvm/lib/Target/AMDGPU/SIAnnotateControlFlow.cpp

Show First 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	class SIAnnotateControlFlow : public FunctionPass {
bool isTopOfStack(BasicBlock *BB);		bool isTopOfStack(BasicBlock *BB);

Value *popSaved();		Value *popSaved();

void push(BasicBlock BB, Value Saved);		void push(BasicBlock BB, Value Saved);

bool isElse(PHINode *Phi);		bool isElse(PHINode *Phi);

		bool hasKill(const BasicBlock *BB);

void eraseIfUnused(PHINode *Phi);		void eraseIfUnused(PHINode *Phi);

void openIf(BranchInst *Term);		void openIf(BranchInst *Term);

void insertElse(BranchInst *Term);		void insertElse(BranchInst *Term);

Value *		Value *
handleLoopCondition(Value Cond, PHINode Broken, llvm::Loop *L,		handleLoopCondition(Value Cond, PHINode Broken, llvm::Loop *L,
▲ Show 20 Lines • Show All 94 Lines • ▼ Show 20 Lines	if (Phi->getIncomingBlock(i) == IDom) {
if (Phi->getIncomingValue(i) != BoolFalse)		if (Phi->getIncomingValue(i) != BoolFalse)
return false;		return false;

}		}
}		}
return true;		return true;
}		}

		bool SIAnnotateControlFlow::hasKill(const BasicBlock *BB) {
		for (const Instruction &I : *BB) {
		piotrUnsubmitted Not Done Reply Inline Actions Bail out if calling convention is not CallingConv::AMDGPU_PS ? piotr: Bail out if calling convention is not CallingConv::AMDGPU_PS ?
		critsonAuthorUnsubmitted Done Reply Inline Actions Kills are valid in other calling conventions. critson: Kills are valid in other calling conventions.
		if (const CallInst *CI = dyn_cast<CallInst>(&I))
		if (CI->getIntrinsicID() == Intrinsic::amdgcn_kill)
		return true;
		}
		return false;
		}

// Erase "Phi" if it is not used any more		// Erase "Phi" if it is not used any more
void SIAnnotateControlFlow::eraseIfUnused(PHINode *Phi) {		void SIAnnotateControlFlow::eraseIfUnused(PHINode *Phi) {
if (RecursivelyDeleteDeadPHINode(Phi)) {		if (RecursivelyDeleteDeadPHINode(Phi)) {
LLVM_DEBUG(dbgs() << "Erased unused condition phi\n");		LLVM_DEBUG(dbgs() << "Erased unused condition phi\n");
}		}
}		}

/// Open a new "If" block		/// Open a new "If" block
▲ Show 20 Lines • Show All 142 Lines • ▼ Show 20 Lines	if (I.nodeVisited(Term->getSuccessor(1))) {

if (DT->dominates(Term->getSuccessor(1), BB))		if (DT->dominates(Term->getSuccessor(1), BB))
handleLoop(Term);		handleLoop(Term);
continue;		continue;
}		}

if (isTopOfStack(BB)) {		if (isTopOfStack(BB)) {
PHINode *Phi = dyn_cast<PHINode>(Term->getCondition());		PHINode *Phi = dyn_cast<PHINode>(Term->getCondition());
if (Phi && Phi->getParent() == BB && isElse(Phi)) {		if (Phi && Phi->getParent() == BB && isElse(Phi) && !hasKill(BB)) {
insertElse(Term);		insertElse(Term);
eraseIfUnused(Phi);		eraseIfUnused(Phi);
continue;		continue;
}		}

closeControlFlow(BB);		closeControlFlow(BB);
}		}

Show All 15 Lines

llvm/test/CodeGen/AMDGPU/si-annotate-cf-kill.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -march=amdgcn -mcpu=verde -verify-machineinstrs \| FileCheck --check-prefix=SI %s
				; RUN: llc < %s -march=amdgcn -mcpu=tonga -mattr=-flat-for-global -verify-machineinstrs \| FileCheck --check-prefix=FLAT %s

				define amdgpu_ps float @uniform_kill(float %a, i32 %b, float %c) {
				; SI-LABEL: uniform_kill:
				; SI: ; %bb.0: ; %entry
				; SI-NEXT: v_cvt_i32_f32_e32 v0, v0
				; SI-NEXT: s_mov_b64 s[0:1], exec
				; SI-NEXT: s_mov_b64 s[2:3], -1
				; SI-NEXT: v_or_b32_e32 v0, v1, v0
				; SI-NEXT: v_and_b32_e32 v0, 1, v0
				; SI-NEXT: v_cmp_eq_u32_e32 vcc, 1, v0
				; SI-NEXT: s_and_saveexec_b64 s[4:5], vcc
				; SI-NEXT: ; %bb.1: ; %if1
				; SI-NEXT: s_xor_b64 s[2:3], exec, -1
				; SI-NEXT: ; %bb.2: ; %endif1
				; SI-NEXT: s_or_b64 exec, exec, s[4:5]
				; SI-NEXT: s_wqm_b64 s[4:5], s[2:3]
				; SI-NEXT: s_xor_b64 s[4:5], s[4:5], exec
				; SI-NEXT: s_andn2_b64 s[0:1], s[0:1], s[4:5]
				; SI-NEXT: s_cbranch_scc0 BB0_6
				; SI-NEXT: ; %bb.3: ; %endif1
				; SI-NEXT: s_and_b64 exec, exec, s[0:1]
				; SI-NEXT: v_mov_b32_e32 v0, 0
				; SI-NEXT: s_and_saveexec_b64 s[0:1], s[2:3]
				; SI-NEXT: s_cbranch_execz BB0_5
				; SI-NEXT: ; %bb.4: ; %if2
				; SI-NEXT: s_mov_b32 s3, 0
				; SI-NEXT: v_add_f32_e32 v0, 1.0, v2
				; SI-NEXT: s_load_dwordx4 s[4:7], s[2:3], 0x0
				; SI-NEXT: v_cvt_i32_f32_e32 v0, v0
				; SI-NEXT: s_waitcnt lgkmcnt(0)
				; SI-NEXT: buffer_atomic_swap v0, off, s[4:7], 0 offset:4 glc
				; SI-NEXT: s_waitcnt vmcnt(0) expcnt(0)
				; SI-NEXT: v_cvt_f32_i32_e32 v0, v0
				; SI-NEXT: BB0_5: ; %endif2
				; SI-NEXT: s_or_b64 exec, exec, s[0:1]
				; SI-NEXT: s_branch BB0_7
				; SI-NEXT: BB0_6:
				; SI-NEXT: s_mov_b64 exec, 0
				; SI-NEXT: exp null off, off, off, off done vm
				; SI-NEXT: s_endpgm
				; SI-NEXT: BB0_7:
				;
				; FLAT-LABEL: uniform_kill:
				; FLAT: ; %bb.0: ; %entry
				; FLAT-NEXT: v_cvt_i32_f32_e32 v0, v0
				; FLAT-NEXT: s_mov_b64 s[0:1], exec
				; FLAT-NEXT: s_mov_b64 s[2:3], -1
				; FLAT-NEXT: v_or_b32_e32 v0, v1, v0
				; FLAT-NEXT: v_and_b32_e32 v0, 1, v0
				; FLAT-NEXT: v_cmp_eq_u32_e32 vcc, 1, v0
				; FLAT-NEXT: s_and_saveexec_b64 s[4:5], vcc
				; FLAT-NEXT: ; %bb.1: ; %if1
				; FLAT-NEXT: s_xor_b64 s[2:3], exec, -1
				; FLAT-NEXT: ; %bb.2: ; %endif1
				; FLAT-NEXT: s_or_b64 exec, exec, s[4:5]
				; FLAT-NEXT: s_wqm_b64 s[4:5], s[2:3]
				; FLAT-NEXT: s_xor_b64 s[4:5], s[4:5], exec
				; FLAT-NEXT: s_andn2_b64 s[0:1], s[0:1], s[4:5]
				; FLAT-NEXT: s_cbranch_scc0 BB0_6
				; FLAT-NEXT: ; %bb.3: ; %endif1
				; FLAT-NEXT: s_and_b64 exec, exec, s[0:1]
				; FLAT-NEXT: v_mov_b32_e32 v0, 0
				; FLAT-NEXT: s_and_saveexec_b64 s[0:1], s[2:3]
				; FLAT-NEXT: s_cbranch_execz BB0_5
				; FLAT-NEXT: ; %bb.4: ; %if2
				; FLAT-NEXT: s_mov_b32 s3, 0
				; FLAT-NEXT: v_add_f32_e32 v0, 1.0, v2
				; FLAT-NEXT: s_load_dwordx4 s[4:7], s[2:3], 0x0
				; FLAT-NEXT: v_cvt_i32_f32_e32 v0, v0
				; FLAT-NEXT: s_waitcnt lgkmcnt(0)
				; FLAT-NEXT: buffer_atomic_swap v0, off, s[4:7], 0 offset:4 glc
				; FLAT-NEXT: s_waitcnt vmcnt(0)
				; FLAT-NEXT: v_cvt_f32_i32_e32 v0, v0
				; FLAT-NEXT: BB0_5: ; %endif2
				; FLAT-NEXT: s_or_b64 exec, exec, s[0:1]
				; FLAT-NEXT: s_branch BB0_7
				; FLAT-NEXT: BB0_6:
				; FLAT-NEXT: s_mov_b64 exec, 0
				; FLAT-NEXT: exp null off, off, off, off done vm
				; FLAT-NEXT: s_endpgm
				; FLAT-NEXT: BB0_7:
				entry:
				%.1 = fptosi float %a to i32
				%.2 = or i32 %b, %.1
				%.3 = and i32 %.2, 1
				%.not = icmp eq i32 %.3, 0
				br i1 %.not, label %endif1, label %if1

				if1:
				br i1 false, label %if3, label %endif1

				if3:
				br label %endif1

				endif1:
				%.0 = phi i1 [ false, %if3 ], [ false, %if1 ], [ true, %entry ]
				%.4 = call i1 @llvm.amdgcn.wqm.vote(i1 %.0)
				; This kill must be uniformly executed
				call void @llvm.amdgcn.kill(i1 %.4)
				%.test0 = fadd nsz arcp float %c, 1.0
				%.test1 = fptosi float %.test0 to i32
				br i1 %.0, label %if2, label %endif2

				if2:
				%.5 = getelementptr inbounds <4 x i32>, <4 x i32> addrspace(6)* undef, i32 31, !amdgpu.uniform !0
				%.6 = load <4 x i32>, <4 x i32> addrspace(6)* %.5, align 16, !invariant.load !0
				%.7 = call i32 @llvm.amdgcn.raw.buffer.atomic.swap.i32(i32 %.test1, <4 x i32> %.6, i32 4, i32 0, i32 0)
				%.8 = sitofp i32 %.7 to float
				br label %endif2

				endif2:
				%.9 = phi float [ %.8, %if2 ], [ 0.0, %endif1 ]
				ret float %.9
				}


				declare i32 @llvm.amdgcn.raw.buffer.atomic.swap.i32(i32, <4 x i32>, i32, i32, i32 immarg) #2
				declare i1 @llvm.amdgcn.wqm.vote(i1) #3
				declare void @llvm.amdgcn.kill(i1) #4
				declare float @llvm.amdgcn.wqm.f32(float) #1

				attributes #1 = { nounwind readnone speculatable willreturn }
				attributes #2 = { nounwind willreturn }
				attributes #3 = { convergent nounwind readnone willreturn }
				attributes #4 = { nounwind }

				!0 = !{}